← back to blog

The 2026 Requests guide for production scraping

The 2026 Requests guide for production scraping

Python’s Requests library is still the workhorse of most scraping stacks I run. It’s not the flashiest tool, no async by default, no browser rendering, but it’s predictable, well-documented, and has a decade of production battle-testing behind it. When someone asks me what to learn first for HTTP-level scraping, Requests is still the answer.

This guide is for operators who’ve played with Requests in tutorials and now need to run it in a real environment: rotating proxies, retrying on failures, managing sessions across hundreds of concurrent workers, and not getting your IP pool burned in 48 hours. I’m writing this from the perspective of running scrapers for price intelligence and lead generation pipelines, not toy projects. If you’re still doing requests.get(url) with no headers and a residential IP that cost you real money, this is for you.

By the end you’ll have a production-ready Requests setup: session management, exponential backoff, proxy rotation wired in, and a pattern you can hand off to a worker pool without it falling apart. I’ll also flag where Requests genuinely hits its ceiling so you know when to reach for something else.

what you need

  • Python 3.10+ (I’m on 3.12 in 2026, there’s no reason to be below 3.10)
  • requests 2.32+ and urllib3 2.x (check your transitive deps, older urllib3 has known TLS issues)
  • A proxy provider with rotating residential or datacenter IPs. I use ProxyScraping’s residential plan at around $4/GB at time of writing. Check their current pricing at proxyscraping.com.
  • A target URL you’re authorised to scrape (check the site’s robots.txt and terms of service)
  • Basic familiarity with pip and virtual environments
  • Optional but useful: a Redis or SQLite instance for deduplicating URLs if you’re crawling

Rough monthly cost to get started: $10-30 for proxy bandwidth at moderate volume, $0 for the library itself.

step by step

1. set up your environment

Create a clean virtualenv and install dependencies:

python3 -m venv scraper-env
source scraper-env/bin/activate
pip install requests==2.32.3 urllib3==2.2.1 tenacity==8.3.0

Pin your versions. I’ve been burned by urllib3 2.0 breaking SSL cert handling on a Tuesday deploy. Put a requirements.txt in the project root from day one.

expected output: pip installs without errors, python -c "import [requests](https://requests.readthedocs.io/); print([requests](https://requests.readthedocs.io/).__version__)" prints 2.32.3.

if it breaks: if you get SSL errors on install, your system CA bundle may be stale. Run pip install --upgrade certifi and retry.

2. build a session with realistic headers

Never use bare [requests](https://requests.readthedocs.io/).get() in production. Always use a Session object. It handles cookie persistence, connection pooling, and lets you set default headers once:

import requests

def make_session() -> requests.Session:
    session = requests.Session()
    session.headers.update({
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    })
    session.max_redirects = 5
    return session

Use a Chrome user-agent that matches a real, current browser version. Sites fingerprint on UA consistency with other headers. Check whatismybrowser.com for current Chrome UA strings, they update frequently.

expected output: your session object is ready to make requests with realistic headers baked in.

if it breaks: if the target site is still blocking you, check whether it returns a 403 or a redirect to a captcha page. Different responses need different fixes.

3. wire in proxy rotation

ProxyScraping and most residential proxy providers give you a single gateway endpoint that rotates IPs per request or per session, depending on your config. Here’s how to attach it:

import os

PROXY_USER = os.environ["PROXY_USER"]
PROXY_PASS = os.environ["PROXY_PASS"]
PROXY_HOST = "gate.proxyscraping.com:9999"

proxies = {
    "http": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}",
    "https": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}",
}

session.proxies.update(proxies)

Never hardcode credentials. Pull them from environment variables or a secrets manager. I use python-dotenv locally and environment injection in production containers.

expected output: session.get("https://httpbin.org/ip") returns an IP that isn’t yours.

if it breaks: if you get ProxyError: Cannot connect to proxy, double-check the gateway hostname and port from your provider dashboard. If auth fails, verify you’re using the right credential format, some providers use user-country-US:pass style strings for geo-targeting.

4. add retries with exponential backoff

This is where most beginner scrapers fall apart. A raw retry loop with time.sleep(1) will hammer a target and burn your IPs. Use tenacity for proper backoff:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

@retry(
    stop=stop_after_attempt(4),
    wait=wait_exponential(multiplier=1, min=2, max=30),
    retry=retry_if_exception_type((requests.exceptions.ConnectionError, requests.exceptions.Timeout)),
)
def fetch(session: requests.Session, url: str) -> requests.Response:
    response = session.get(url, timeout=(5, 20))
    response.raise_for_status()
    return response

The timeout=(5, 20) tuple is connect timeout and read timeout separately. Always set both. A hung connection without a timeout will block your worker indefinitely.

expected output: on transient failures the function retries with increasing delays. After 4 attempts it raises and you can log and move on.

if it breaks: if raise_for_status() is triggering your retry on 404s, that’s probably wrong. Add explicit status code handling before the raise to skip retrying on permanent errors.

5. handle response validation properly

Don’t trust the HTTP status code alone. Sites sometimes return 200 with a CAPTCHA page or a login redirect. Validate the actual content:

def is_valid_response(response: requests.Response, expected_text: str) -> bool:
    if response.status_code != 200:
        return False
    if expected_text not in response.text:
        return False
    if len(response.content) < 500:
        return False
    return True

Pick expected_text based on something structural in the target page, a section heading, a known CSS class, anything that wouldn’t appear in a block page.

expected output: your scraper correctly identifies block pages instead of storing garbage HTML.

if it breaks: if the expected text keeps changing, consider checking for the absence of block indicators (words like “Access Denied”, “Cloudflare”) rather than presence of expected content.

6. manage connection pools for concurrency

If you’re running multiple workers, each Session maintains its own connection pool. The default pool size is 10. For high-concurrency setups you need to configure this explicitly using HTTPAdapter:

from requests.adapters import HTTPAdapter

adapter = HTTPAdapter(
    pool_connections=20,
    pool_maxsize=50,
    max_retries=0,  # we handle retries with tenacity, not urllib3
)
session.mount("https://", adapter)
session.mount("http://", adapter)

Setting max_retries=0 on the adapter is important. If both urllib3 and tenacity retry, you get compounding retry storms that hammer your proxy budget.

expected output: under load your sessions don’t throw urllib3.exceptions.MaxRetryError from pool exhaustion.

if it breaks: if you’re using threading, make sure each thread has its own Session object. Sessions are not thread-safe, share adapters but not sessions across threads.

7. integrate with a simple worker pool

Here’s a minimal threading pattern that holds up at moderate scale:

from concurrent.futures import ThreadPoolExecutor, as_completed

def scrape_url(url: str) -> dict:
    session = make_session()
    session.proxies.update(proxies)
    adapter = HTTPAdapter(pool_connections=5, pool_maxsize=10, max_retries=0)
    session.mount("https://", adapter)
    session.mount("http://", adapter)

    try:
        response = fetch(session, url)
        return {"url": url, "status": "ok", "html": response.text}
    except Exception as e:
        return {"url": url, "status": "error", "error": str(e)}
    finally:
        session.close()

urls = ["https://example.com/page/1", "https://example.com/page/2"]

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {executor.submit(scrape_url, url): url for url in urls}
    for future in as_completed(futures):
        result = future.result()
        print(result["url"], result["status"])

Create and close sessions per-task to avoid state leaking between requests. The overhead is negligible compared to network latency.

expected output: 10 concurrent requests run without session conflicts or leaked connections.

if it breaks: if you hit OS-level file descriptor limits (OSError: [Errno 24] Too many open files), increase your ulimit with ulimit -n 4096 or configure it in your container’s limits.

8. log everything you need to debug later

Production scrapers need structured logs. At minimum log: URL, status code, response time, proxy used (or at least proxy hash), and any error type.

import logging
import time

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger(__name__)

def fetch_with_logging(session, url):
    start = time.monotonic()
    try:
        response = fetch(session, url)
        elapsed = time.monotonic() - start
        logger.info("ok url=%s status=%d elapsed=%.2fs", url, response.status_code, elapsed)
        return response
    except Exception as e:
        elapsed = time.monotonic() - start
        logger.error("fail url=%s error=%s elapsed=%.2fs", url, type(e).__name__, elapsed)
        raise

Store logs somewhere you can query, CloudWatch, Datadog, or even plain files rotated with logrotate. You’ll thank yourself the first time a target site changes its block pattern at 3am.

expected output: every request produces a structured log line with timing.

if it breaks: if log output is missing in production, check whether your logger is being silenced by a parent logger or a library that calls logging.basicConfig before yours does.

common pitfalls

using a single global session across threads. Requests sessions are not thread-safe. Using one session from multiple threads causes race conditions on headers and cookies. One session per thread, always.

ignoring timeout parameters. Calling session.get(url) with no timeout in production will eventually leave you with hung workers that never recover. Set both connect and read timeouts explicitly every time.

retrying on 4xx errors. A 429 (rate limit) deserves a retry with backoff. A 404 does not. A 403 from a block page might, if you rotate IPs. Handle status codes explicitly instead of blanket-retrying on raise_for_status().

burning your best IPs on cheap targets. Residential IPs cost real money. If a target doesn’t block datacenter IPs, use datacenter IPs and save residential bandwidth for sites that need it. Most providers let you switch proxy types on the same account.

scraping without checking robots.txt. This is a legal and ethical issue, not just a technical one. The Robots Exclusion Protocol is now an official RFC (9309). Courts in multiple jurisdictions have begun treating robots.txt disregard as a factor in access-related claims. This is not legal advice, consult a lawyer for your specific use case.

scaling this

10x (hundreds of URLs/hour). The threading pattern above handles this. Tune max_workers based on your proxy bandwidth and target response times. Monitor error rates and adjust backoff parameters.

100x (thousands of URLs/hour). Switch from threading to async. aiohttp or httpx with asyncio will handle concurrent connections more efficiently than a thread pool at this scale. Your Requests sessions become aiohttp.ClientSession objects. The proxy and retry patterns stay the same conceptually. If you’re running multi-account operations at this scale, antidetect browser setups may also become relevant, there’s a useful breakdown of the tooling landscape at antidetectreview.org/blog/.

1000x (tens of thousands of URLs/hour). You’re now in distributed scraping territory. A single machine won’t cut it. You need a job queue (Celery with Redis, or a managed queue like SQS), multiple worker nodes, and centralized proxy management so different workers don’t overlap on the same IP at the same time. Your retry and session logic stays but needs to be stateless enough to run anywhere in your fleet. You’ll also want proxy spend tracking because at this scale costs compound fast.

where to go next

If you’re also doing anything on the wallet or airdrop farming side of data collection, airdropfarming.org/blog/ has context on how scraping fits into on-chain data workflows.

Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?