The 2026 httpx guide for production scraping
The 2026 httpx guide for production scraping
Most scraping tutorials still start with requests. it works, but it is synchronous, it has no built-in HTTP/2 support, and when you are pulling a few thousand URLs an hour the single-threaded model becomes a bottleneck you feel in your wallet, not just your runtime. i moved my production scrapers to httpx in late 2024 and have not looked back. the library gives you a requests-compatible interface, native async, HTTP/2, and a clean way to plug in proxy rotation, all in one package.
this guide is for operators running scrapers at scale: e-commerce price monitors, SERP trackers, lead-gen pipelines, anything that hits hundreds or thousands of endpoints per session. if you are a hobbyist scraping one site once a week, requests is fine. if you are billing clients by the data point or running arbitrage bots that need fresh data every few minutes, read on.
by the end of this tutorial you will have a working async httpx scraper with proxy rotation, retry logic, connection pooling tuned for throughput, and a structure that holds up past 100k requests a day.
what you need
- python 3.11+ (3.12 preferred; asyncio improvements are real)
- httpx 0.27+ and anyio or asyncio in stdlib
- a proxy provider with rotating residential or datacenter IPs. i use ProxyScraping’s residential plan ($75/month for 10 GB as of may 2026) for most jobs
- a linux VPS with at least 2 vCPUs and 4 GB RAM for serious async workloads. hetzner CPX21 at ~€8/month is my default starting point
- python packages:
httpx[http2],tenacityfor retries,asyncio-throttlefor rate limiting - basic familiarity with Python async/await syntax
optional but recommended: a redis instance for URL queuing if you scale past a single machine.
step by step
step 1: install dependencies
pip install "httpx[http2]" tenacity asyncio-throttle
the [http2] extra pulls in h2, which enables HTTP/2 negotiation via ALPN. many modern CDNs prefer HTTP/2 and you get header compression and multiplexing for free. verify the install worked:
python -c "import httpx; print(httpx.__version__)"
expected output: 0.27.x or higher. if you see an older version, pip install --upgrade httpx[http2].
if it breaks: some environments have a conflicting h2 version pinned by another package. run pip install "h2>=4.0,<5" explicitly, then retry.
step 2: create a reusable async client
the single biggest mistake i see from people coming from [requests](https://requests.readthedocs.io/) is creating a new client per request. that kills connection pooling and makes you pay DNS resolution costs every time. create one client, reuse it across your coroutines.
import httpx
import asyncio
async def make_client(proxy_url: str | None = None) -> httpx.AsyncClient:
transport = httpx.AsyncHTTPTransport(
retries=0, # we handle retries ourselves with tenacity
http2=True,
)
proxies = {"all://": proxy_url} if proxy_url else None
client = httpx.AsyncClient(
transport=transport,
proxies=proxies,
timeout=httpx.Timeout(connect=10.0, read=30.0, write=10.0, pool=5.0),
limits=httpx.Limits(max_connections=200, max_keepalive_connections=50),
follow_redirects=True,
headers={
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
},
)
return client
expected output: no errors. the client is not yet connected to anything; connections open lazily on first request.
if it breaks: proxies was renamed in some pre-0.24 versions. make sure you are on 0.27+. the parameter signature is stable from 0.24 onward.
step 3: add retry logic with tenacity
production scraping means 5xx errors, timeouts, and the occasional connection reset. wrap your fetch function with tenacity so transient failures retry automatically without crashing your pipeline.
from tenacity import (
retry,
stop_after_attempt,
wait_exponential,
retry_if_exception_type,
)
@retry(
retry=retry_if_exception_type((httpx.TimeoutException, httpx.ConnectError)),
wait=wait_exponential(multiplier=1, min=2, max=30),
stop=stop_after_attempt(4),
reraise=True,
)
async def fetch(client: httpx.AsyncClient, url: str) -> httpx.Response:
response = await client.get(url)
response.raise_for_status() # raises httpx.HTTPStatusError on 4xx/5xx
return response
this retries on connection-level errors and timeouts, backing off exponentially. it does not retry 4xx by default because a 403 is usually a signal problem, not a transient failure, and hammering on it wastes proxy bandwidth.
if it breaks: if you are getting RuntimeError: cannot reuse already awaited coroutine, make sure the @retry decorator wraps an async def, not a coroutine object. tenacity 8.2+ handles async natively.
step 4: plug in proxy rotation
ProxyScraping’s rotating residential gateway gives you a single endpoint; each request goes out through a different exit IP. the format is a standard HTTP proxy URL:
PROXY_URL = "http://user:[email protected]:9999"
for jobs where you need sticky sessions (same IP across a multi-step login flow), append a session ID to the username:
session_id = "mysession42"
PROXY_URL = f"http://user-session-{session_id}:[email protected]:9999"
pass the proxy URL into make_client() from step 2. one client, one rotating gateway. if you need geographic targeting, ProxyScraping exposes country codes as URL parameters, check their dashboard for the exact format.
if it breaks: if you get 407 [Proxy Authentication](https://proxyscraping.org/blog/proxy-authentication-user-pass-vs-ip-whitelist-trade-offs) Required, double-check your credentials in the dashboard. some providers also block certain ports; test with curl -x $PROXY_URL https://httpbin.org/ip before debugging in Python.
step 5: run concurrent requests with asyncio.gather
now we tie it together. a simple scraper that hits a list of URLs concurrently:
import asyncio
from asyncio_throttle import Throttler
async def scrape_all(urls: list[str], proxy_url: str, rps: float = 5.0) -> list[dict]:
throttler = Throttler(rate_limit=rps)
results = []
async with await make_client(proxy_url) as client:
async def fetch_one(url: str) -> dict:
async with throttler:
try:
resp = await fetch(client, url)
return {"url": url, "status": resp.status_code, "body": resp.text}
except Exception as e:
return {"url": url, "error": str(e)}
results = await asyncio.gather(*[fetch_one(u) for u in urls])
return results
if __name__ == "__main__":
urls = ["https://httpbin.org/get"] * 20
data = asyncio.run(scrape_all(urls, proxy_url=PROXY_URL))
for row in data[:3]:
print(row["status"], row["url"])
expected output: a stream of 200 https://httpbin.org/get lines. runtime for 20 requests at 5 rps should be about 4 seconds.
if it breaks: asyncio.gather swallows exceptions by default when return_exceptions=False (default). if the whole gather crashes, add return_exceptions=True and inspect results for exception objects.
step 6: handle TLS fingerprinting
by 2026 most serious anti-bot stacks, Cloudflare, Akamai, Datadome, check your TLS ClientHello fingerprint, not just your user-agent. stock Python ssl sends a fingerprint that says “i am python”, even if your headers say Chrome. the fix with httpx alone is partial: http2 + ALPN gets you closer to a browser fingerprint, but for hardened targets you need curl-cffi, which wraps libcurl compiled against BoringSSL.
swap the transport for curl-cffi on targets you know fingerprint-check:
pip install curl-cffi
from curl_cffi.requests import AsyncSession
async def fetch_with_impersonation(url: str, proxy: str) -> str:
async with AsyncSession(impersonate="chrome124") as session:
r = await session.get(url, proxy=proxy)
return r.text
you lose some httpx features but gain a real Chrome TLS fingerprint. i keep both in the same codebase and route by target.
if it breaks: curl-cffi ships pre-compiled wheels for linux x86_64 and arm64. if you are on an unusual arch, you may need to build from source, which requires libcurl-dev.
step 7: parse and store results
httpx returns response text and bytes. for HTML, use selectolax (faster than BeautifulSoup) or parsel (xpath + css, familiar if you know Scrapy). for JSON APIs, resp.json() works out of the box.
from selectolax.parser import HTMLParser
def extract_title(html: str) -> str | None:
tree = HTMLParser(html)
node = tree.css_first("title")
return node.text(strip=True) if node else None
store results to postgres, sqlite, or a jsonl file depending on volume. for anything above 10k rows per run, postgres with COPY is faster than individual inserts.
if it breaks: selectolax raises on malformed HTML sometimes. wrap in a try/except and log the URL for manual review.
common pitfalls
creating a new client per request. i mentioned this above but it kills you twice: no connection reuse, and you hit the OS file descriptor limit fast on Linux (default 1024 open files per process). one client, reused across all coroutines.
ignoring response codes. a 200 that returns a CAPTCHA page is not success. check resp.text contains what you expect. a simple assert "expected_keyword" in resp.text after every fetch catches this early.
not rate limiting. asyncio.gather with 500 URLs and no throttle will send all 500 requests simultaneously. that burns proxy bandwidth, triggers IP bans, and can DoS a small target. use asyncio-throttle or a semaphore. for most targets i start at 2-5 rps per IP and tune up from there.
leaking the async client. always use async with or explicitly call await client.aclose(). unclosed clients leave TCP connections dangling and you will see “too many open files” errors after long runs.
using residential proxies for everything. residential IPs are 5-20x more expensive than datacenter. for targets that do not fingerprint-check, datacenter proxies are fine and cheaper. save residential bandwidth for Cloudflare-protected or high-value targets. see the proxy type comparison on this site for a cost breakdown.
scaling this
10x (hundreds of URLs/hour). one process, one async client, asyncio.gather with a throttler. the setup from this tutorial handles this fine on a single 2-vCPU VPS. monitor memory; if you are storing full response bodies in a list you will OOM around 50k responses.
100x (thousands of URLs/hour). split your URL list across multiple worker processes using multiprocessing or run multiple VPS instances behind a job queue. redis + rq is my default stack at this tier. each worker gets its own httpx client and proxy credential. bump your proxy plan; 10 GB/month does not last long at this volume.
1000x (tens of thousands of URLs/hour). you are now looking at distributed crawl infrastructure. each machine runs an async worker pool. a central redis cluster or kafka topic holds the URL frontier. you need proxy failover, not just rotation: if one gateway goes down your entire fleet stalls. most operators at this scale use two or three proxy providers simultaneously and route by target or fail over automatically. revisit your storage layer too: postgres write throughput is a ceiling you will hit. consider columnar stores like ClickHouse for analytics workloads or time-series data. multiaccounting operators running parallel sessions at this scale often maintain separate browser fingerprint pools, see the guides at multiaccountops.com/blog/ for how that workflow differs from pure httpx scraping.
the httpx documentation on connection pooling covers the tuning parameters in detail. the Python asyncio documentation is the authoritative reference for the concurrency model. for HTTP/2 semantics, the RFC 9113 spec explains why multiplexing matters at scale.
where to go next
- ProxyScraping residential proxy review: a hands-on cost and performance test of the proxy provider i use in this tutorial, with benchmarks against Bright Data and Oxylabs.
- curl-cffi vs httpx for anti-bot bypass: when to swap transports, which targets each handles, and how to route between them in the same pipeline.
- back to the blog index for more operator-focused scraping and proxy guides.
for fingerprint-hardened targets where httpx alone is not enough, the antidetectreview.org/blog/ covers browser automation tools that complement the httpx stack when you need full session simulation.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.