proxywhirl.fetchers

Proxy fetching and parsing functionality.

This module provides tools for fetching proxies from various sources and parsing different formats (JSON, CSV, plain text, HTML tables).

Classes

CSVParser

Parse CSV-formatted proxy lists.

GeonodeParser

Parse GeoNode API JSON response format.

HTMLTableParser

Parse HTML table-formatted proxy lists.

JSONParser

Parse JSON-formatted proxy lists.

PlainTextParser

Parse plain text proxy lists (one per line).

ProxyFetcher

Fetch proxies from various sources.

ProxyValidator

Validate proxy connectivity with metrics collection.

ValidationResult

Result of proxy validation with timing metrics.

Functions

deduplicate_proxies(proxies)

Deduplicate proxies by URL+Port combination.

Module Contents

class proxywhirl.fetchers.CSVParser(has_header=True, columns=None, skip_invalid=False)[source]

Parse CSV-formatted proxy lists.

Initialize CSV parser.

Parameters:
  • has_header (bool) – Whether CSV has header row

  • columns (list[str] | None) – Column names if no header

  • skip_invalid (bool) – Skip malformed rows instead of raising error

parse(data)[source]

Parse CSV proxy data.

Parameters:

data (str) – CSV string to parse

Returns:

List of proxy dictionaries

Raises:

ProxyFetchError – If CSV is malformed and skip_invalid is False

Return type:

list[dict[str, Any]]

class proxywhirl.fetchers.GeonodeParser[source]

Parse GeoNode API JSON response format.

GeoNode API returns: {“data”: [{“ip”: “…”, “port”: “…”, “protocols”: [“http”]}, …]} This parser extracts and transforms to standard format: {“url”: “http://ip:port”, “protocol”: “http”}

parse(data)[source]

Parse GeoNode API response.

Parameters:

data (str) – JSON string from GeoNode API

Returns:

List of proxy dictionaries in standard format

Return type:

list[dict[str, Any]]

class proxywhirl.fetchers.HTMLTableParser(table_selector='table', column_map=None, column_indices=None)[source]

Parse HTML table-formatted proxy lists.

Initialize HTML table parser.

Parameters:
  • table_selector (str) – CSS selector for table element

  • column_map (dict[str, str] | None) – Map header names to proxy fields

  • column_indices (dict[str, int] | None) – Map field names to column indices

parse(data)[source]

Parse HTML table proxy data.

Parameters:

data (str) – HTML string containing table

Returns:

List of proxy dictionaries

Return type:

list[dict[str, Any]]

class proxywhirl.fetchers.JSONParser(key=None, required_fields=None)[source]

Parse JSON-formatted proxy lists.

Initialize JSON parser.

Parameters:
  • key (str | None) – Optional key to extract from JSON object

  • required_fields (list[str] | None) – Fields that must be present in each proxy

parse(data)[source]

Parse JSON proxy data.

Parameters:

data (str) – JSON string to parse

Returns:

List of proxy dictionaries

Raises:
Return type:

list[dict[str, Any]]

class proxywhirl.fetchers.PlainTextParser(skip_invalid=True)[source]

Parse plain text proxy lists (one per line).

Initialize plain text parser.

Parameters:

skip_invalid (bool) – Skip invalid URLs instead of raising error

parse(data)[source]

Parse plain text proxy data.

Parameters:

data (str) – Plain text string with one proxy per line Supports formats: IP:PORT, http://IP:PORT, socks5://IP:PORT

Returns:

List of proxy dictionaries with ‘url’ key

Raises:

ProxyFetchError – If invalid proxy format is encountered and skip_invalid=False

Return type:

list[dict[str, Any]]

class proxywhirl.fetchers.ProxyFetcher(sources=None, validator=None)[source]

Fetch proxies from various sources.

Initialize proxy fetcher.

Parameters:
  • sources (list[proxywhirl.models.ProxySourceConfig] | None) – List of proxy source configurations

  • validator (ProxyValidator | None) – ProxyValidator instance for validating fetched proxies

add_source(source)[source]

Add a proxy source.

Parameters:

source (proxywhirl.models.ProxySourceConfig) – Proxy source configuration to add

Return type:

None

async close()[source]

Close client connection and cleanup resources.

Return type:

None

async fetch_all(validate=True, deduplicate=True, fetch_progress_callback=None, validate_progress_callback=None)[source]

Fetch proxies from all configured sources.

Parameters:
  • validate (bool) – Whether to validate proxies before returning

  • deduplicate (bool) – Whether to deduplicate proxies

  • fetch_progress_callback (Any | None) – Optional callback(completed, total, proxies_found) for fetch progress

  • validate_progress_callback (Any | None) – Optional callback(completed, total, valid_count) for validation progress

Returns:

List of proxy dictionaries

Return type:

list[dict[str, Any]]

async fetch_from_source(source)[source]

Fetch proxies from a single source.

Includes automatic retry with exponential backoff for: - HTTP 429 (Too Many Requests) - respects Retry-After header - HTTP 503 (Service Unavailable) - HTTP 502 (Bad Gateway) - HTTP 504 (Gateway Timeout) - Network timeouts

Parameters:

source (proxywhirl.models.ProxySourceConfig) – Proxy source configuration

Returns:

List of proxy dictionaries

Raises:

ProxyFetchError – If fetching fails after retries

Return type:

list[dict[str, Any]]

remove_source(url)[source]

Remove a proxy source by URL.

Parameters:

url (str) – URL of source to remove

Return type:

None

async start_periodic_refresh(callback=None, interval=None)[source]

Start periodic proxy refresh.

Parameters:
  • callback (Any | None) – Optional callback to invoke with new proxies

  • interval (int | None) – Override default refresh interval (seconds)

Return type:

None

class proxywhirl.fetchers.ProxyValidator(timeout=5.0, test_url=None, level=None, concurrency=50)[source]

Validate proxy connectivity with metrics collection.

Initialize proxy validator.

Parameters:
  • timeout (float) – Connection timeout in seconds

  • test_url (str | None) – URL to use for connectivity testing. If None, rotates between multiple fast endpoints (Google, Cloudflare, etc.)

  • level (proxywhirl.models.ValidationLevel | None) – Validation level (BASIC, STANDARD, FULL). Defaults to STANDARD.

  • concurrency (int) – Maximum number of concurrent validations

async check_anonymity(proxy_url=None)[source]

Check proxy anonymity level by detecting IP leakage using shared client.

Tests if the proxy reveals the real IP address or proxy usage through HTTP headers like X-Forwarded-For, Via, X-Real-IP, etc.

Parameters:

proxy_url (str | None) – Full proxy URL (e.g., “http://proxy.example.com:8080”) If None, makes request without proxy (for testing)

Returns:

Proxy leaks real IP address - “anonymous”: Proxy hides IP but reveals proxy usage via headers - “elite”: Proxy completely hides both IP and proxy usage - “unknown” or None: Could not determine (error occurred)

Return type:

  • “transparent”

async close()[source]

Close all client connections and cleanup resources.

Return type:

None

async validate(proxy)[source]

Validate proxy connectivity with fast TCP pre-check and timing metrics.

Parameters:

proxy (dict[str, Any]) – Proxy dictionary with ‘url’ key (e.g., “http://1.2.3.4:8080”)

Returns:

ValidationResult with is_valid flag and response_time_ms (if successful)

Return type:

ValidationResult

async validate_batch(proxies, progress_callback=None)[source]

Validate multiple proxies in parallel with concurrency control and metrics.

Uses asyncio.Semaphore to limit concurrent validations based on the configured concurrency limit. Records response time for valid proxies.

Parameters:
  • proxies (list[dict[str, Any]]) – List of proxy dictionaries

  • progress_callback (Any | None) – Optional callback(completed, total, valid_count) for progress

Returns:

List of working proxies with response_time_ms added to each

Return type:

list[dict[str, Any]]

async validate_https_capability_batch(http_proxies, concurrency=500, max_results=None, progress_callback=None)[source]

Test already-validated HTTP proxies for HTTPS/CONNECT support.

Many free proxy lists label HTTP proxies as “HTTPS” after testing CONNECT tunneling. This method takes working HTTP proxies and tests each via the CONNECT method against an HTTPS endpoint. Proxies that pass are returned as https:// entries ready for DB insertion.

Parameters:
  • http_proxies (list[dict[str, Any]]) – Already-validated HTTP proxy dicts (protocol=’http’).

  • concurrency (int) – Max concurrent HTTPS tests (default 500).

  • max_results (int | None) – Stop early once this many HTTPS-capable proxies are found.

  • progress_callback (Any | None) – Optional callback(completed, total, valid_count).

Returns:

Proxy dicts with protocol='https' and url='https://ip:port' for each HTTP proxy that successfully tunnels HTTPS via CONNECT.

Return type:

list[dict[str, Any]]

property test_url: str[source]

Get current test URL, rotating through multiple endpoints.

Return type:

str

class proxywhirl.fetchers.ValidationResult[source]

Bases: NamedTuple

Result of proxy validation with timing metrics.

proxywhirl.fetchers.deduplicate_proxies(proxies)[source]

Deduplicate proxies by URL+Port combination.

Hostnames are normalized to lowercase since DNS names are case-insensitive (RFC 4343). This ensures that “PROXY.EXAMPLE.COM:8080” and “proxy.example.com:8080” are correctly identified as duplicates.

Handles edge cases including: - IPv6 addresses: [2001:db8::1]:8080 (preserved as-is, already case-insensitive) - IDN hostnames: Прокси.рф:8080 (lowercased correctly) - Mixed-case DNS names: Proxy.EXAMPLE.com:8080 (lowercased for comparison)

Parameters:

proxies (list[dict[str, Any]]) – List of proxy dictionaries

Returns:

Deduplicated list (keeps first occurrence)

Return type:

list[dict[str, Any]]