proxywhirl.fetchers¶
Proxy fetching and parsing functionality.
This module provides tools for fetching proxies from various sources and parsing different formats (JSON, CSV, plain text, HTML tables).
Classes¶
Parse CSV-formatted proxy lists. |
|
Parse GeoNode API JSON response format. |
|
Parse HTML table-formatted proxy lists. |
|
Parse JSON-formatted proxy lists. |
|
Parse plain text proxy lists (one per line). |
|
Fetch proxies from various sources. |
|
Validate proxy connectivity with metrics collection. |
|
Result of proxy validation with timing metrics. |
Functions¶
|
Deduplicate proxies by URL+Port combination. |
Module Contents¶
- class proxywhirl.fetchers.CSVParser(has_header=True, columns=None, skip_invalid=False)[source]¶
Parse CSV-formatted proxy lists.
Initialize CSV parser.
- Parameters:
- class proxywhirl.fetchers.GeonodeParser[source]¶
Parse GeoNode API JSON response format.
GeoNode API returns: {“data”: [{“ip”: “…”, “port”: “…”, “protocols”: [“http”]}, …]} This parser extracts and transforms to standard format: {“url”: “http://ip:port”, “protocol”: “http”}
- class proxywhirl.fetchers.HTMLTableParser(table_selector='table', column_map=None, column_indices=None)[source]¶
Parse HTML table-formatted proxy lists.
Initialize HTML table parser.
- Parameters:
- class proxywhirl.fetchers.JSONParser(key=None, required_fields=None)[source]¶
Parse JSON-formatted proxy lists.
Initialize JSON parser.
- Parameters:
- parse(data)[source]¶
Parse JSON proxy data.
- Parameters:
data (str) – JSON string to parse
- Returns:
List of proxy dictionaries
- Raises:
ProxyFetchError – If JSON is invalid
ProxyValidationError – If required fields are missing
- Return type:
- class proxywhirl.fetchers.PlainTextParser(skip_invalid=True)[source]¶
Parse plain text proxy lists (one per line).
Initialize plain text parser.
- Parameters:
skip_invalid (bool) – Skip invalid URLs instead of raising error
- parse(data)[source]¶
Parse plain text proxy data.
- Parameters:
data (str) – Plain text string with one proxy per line Supports formats: IP:PORT, http://IP:PORT, socks5://IP:PORT
- Returns:
List of proxy dictionaries with ‘url’ key
- Raises:
ProxyFetchError – If invalid proxy format is encountered and skip_invalid=False
- Return type:
- class proxywhirl.fetchers.ProxyFetcher(sources=None, validator=None)[source]¶
Fetch proxies from various sources.
Initialize proxy fetcher.
- Parameters:
sources (list[proxywhirl.models.ProxySourceConfig] | None) – List of proxy source configurations
validator (ProxyValidator | None) – ProxyValidator instance for validating fetched proxies
- add_source(source)[source]¶
Add a proxy source.
- Parameters:
source (proxywhirl.models.ProxySourceConfig) – Proxy source configuration to add
- Return type:
None
- async fetch_all(validate=True, deduplicate=True, fetch_progress_callback=None, validate_progress_callback=None)[source]¶
Fetch proxies from all configured sources.
- Parameters:
validate (bool) – Whether to validate proxies before returning
deduplicate (bool) – Whether to deduplicate proxies
fetch_progress_callback (Any | None) – Optional callback(completed, total, proxies_found) for fetch progress
validate_progress_callback (Any | None) – Optional callback(completed, total, valid_count) for validation progress
- Returns:
List of proxy dictionaries
- Return type:
- async fetch_from_source(source)[source]¶
Fetch proxies from a single source.
Includes automatic retry with exponential backoff for: - HTTP 429 (Too Many Requests) - respects Retry-After header - HTTP 503 (Service Unavailable) - HTTP 502 (Bad Gateway) - HTTP 504 (Gateway Timeout) - Network timeouts
- Parameters:
source (proxywhirl.models.ProxySourceConfig) – Proxy source configuration
- Returns:
List of proxy dictionaries
- Raises:
ProxyFetchError – If fetching fails after retries
- Return type:
- class proxywhirl.fetchers.ProxyValidator(timeout=5.0, test_url=None, level=None, concurrency=50)[source]¶
Validate proxy connectivity with metrics collection.
Initialize proxy validator.
- Parameters:
timeout (float) – Connection timeout in seconds
test_url (str | None) – URL to use for connectivity testing. If None, rotates between multiple fast endpoints (Google, Cloudflare, etc.)
level (proxywhirl.models.ValidationLevel | None) – Validation level (BASIC, STANDARD, FULL). Defaults to STANDARD.
concurrency (int) – Maximum number of concurrent validations
- async check_anonymity(proxy_url=None)[source]¶
Check proxy anonymity level by detecting IP leakage using shared client.
Tests if the proxy reveals the real IP address or proxy usage through HTTP headers like X-Forwarded-For, Via, X-Real-IP, etc.
- Parameters:
proxy_url (str | None) – Full proxy URL (e.g., “http://proxy.example.com:8080”) If None, makes request without proxy (for testing)
- Returns:
Proxy leaks real IP address - “anonymous”: Proxy hides IP but reveals proxy usage via headers - “elite”: Proxy completely hides both IP and proxy usage - “unknown” or None: Could not determine (error occurred)
- Return type:
“transparent”
- async validate(proxy)[source]¶
Validate proxy connectivity with fast TCP pre-check and timing metrics.
- Parameters:
proxy (dict[str, Any]) – Proxy dictionary with ‘url’ key (e.g., “http://1.2.3.4:8080”)
- Returns:
ValidationResult with is_valid flag and response_time_ms (if successful)
- Return type:
- async validate_batch(proxies, progress_callback=None)[source]¶
Validate multiple proxies in parallel with concurrency control and metrics.
Uses asyncio.Semaphore to limit concurrent validations based on the configured concurrency limit. Records response time for valid proxies.
- async validate_https_capability_batch(http_proxies, concurrency=500, max_results=None, progress_callback=None)[source]¶
Test already-validated HTTP proxies for HTTPS/CONNECT support.
Many free proxy lists label HTTP proxies as “HTTPS” after testing CONNECT tunneling. This method takes working HTTP proxies and tests each via the CONNECT method against an HTTPS endpoint. Proxies that pass are returned as
https://entries ready for DB insertion.- Parameters:
http_proxies (list[dict[str, Any]]) – Already-validated HTTP proxy dicts (protocol=’http’).
concurrency (int) – Max concurrent HTTPS tests (default 500).
max_results (int | None) – Stop early once this many HTTPS-capable proxies are found.
progress_callback (Any | None) – Optional callback(completed, total, valid_count).
- Returns:
Proxy dicts with
protocol='https'andurl='https://ip:port'for each HTTP proxy that successfully tunnels HTTPS via CONNECT.- Return type:
- class proxywhirl.fetchers.ValidationResult[source]¶
Bases:
NamedTupleResult of proxy validation with timing metrics.
- proxywhirl.fetchers.deduplicate_proxies(proxies)[source]¶
Deduplicate proxies by URL+Port combination.
Hostnames are normalized to lowercase since DNS names are case-insensitive (RFC 4343). This ensures that “PROXY.EXAMPLE.COM:8080” and “proxy.example.com:8080” are correctly identified as duplicates.
Handles edge cases including: - IPv6 addresses: [2001:db8::1]:8080 (preserved as-is, already case-insensitive) - IDN hostnames: Прокси.рф:8080 (lowercased correctly) - Mixed-case DNS names: Proxy.EXAMPLE.com:8080 (lowercased for comparison)