The 2026 Nodriver guide for production scraping
The 2026 Nodriver guide for production scraping
Most scraping libraries that were “undetectable” in 2023 are now fingerprinted at the TCP handshake level before your first request lands. Playwright gets caught on TLS fingerprints. Selenium leaves DOM artifacts even when you patch them. Puppeteer-extra with stealth plugin still leaks via timing signatures on high-value targets like LinkedIn, Zillow, or any site running Cloudflare’s bot score tier.
Nodriver, built by ultrafunkamsterdam and maintained on GitHub, takes a different approach. Instead of patching Chrome, it drives a real, unmodified Chrome binary over the Chrome DevTools Protocol directly, bypassing the WebDriver interface entirely. No navigator.webdriver, no CDP command artifacts in the page’s JS context, no chromedriver process to fingerprint. This matters in 2026 because bot detection has moved well past user-agent checks.
This guide is for operators already running scrapers in production, or developers who need to graduate from basic HTTP scrapers to browser automation for JavaScript-heavy targets. By the end you will have a working Nodriver setup with proxy rotation, session management, and a deployment pattern that scales from a single VPS to a distributed fleet.
what you need
- Python 3.11+ , Nodriver’s async API relies on
asynciofeatures stabilised in 3.11 - Chrome or Chromium 120+ , Nodriver downloads a matching Chromium build via
nodriveritself on first run, or you point it at an existing binary - A Linux VPS or bare metal box , Ubuntu 22.04 LTS works well; avoid Windows for production, the process management is messier
- Proxies , rotating residential or ISP proxies. Datacenter proxies will get flagged on most Cloudflare-protected targets. Budget roughly $3-8/GB for residential; ProxyScraping has ISP proxies worth testing for mid-tier targets
nodriverPython package , currently0.36on PyPI as of May 2026. Install via pip- Virtual display ,
xvfborXvfbfor headless operation on servers without a display server - RAM , Chrome is heavy. Budget 300-500 MB per concurrent browser instance. A box running 20 parallel sessions needs at least 12 GB RAM
step by step
1. install and verify nodriver
pip install nodriver
# verify
python -c "import nodriver; print(nodriver.__version__)"
Nodriver pulls a Chromium build on first run if you do not specify a browser executable. On a fresh VPS this download is about 200 MB. If you want to pin a specific Chrome binary:
import nodriver as uc
browser = await uc.start(browser_executable_path="/usr/bin/google-chrome-stable")
if it breaks: If Chrome fails to launch with a sandbox error on Linux, add --no-sandbox to the browser args, or run as a non-root user (preferred). Never run Chrome as root in production without --no-sandbox, and if you must, understand the security implications.
2. set up a virtual display for headless servers
Chrome in nodriver is not technically “headless” in the traditional --headless sense, it runs a real display. On a server you need Xvfb.
sudo apt-get install -y xvfb
Xvfb :99 -screen 0 1920x1080x24 &
export DISPLAY=:99
For a production setup, wrap this in a systemd unit so the display comes up at boot:
[Unit]
Description=Virtual framebuffer for Chrome scraping
[Service]
ExecStart=/usr/bin/Xvfb :99 -screen 0 1920x1080x24
Restart=always
[Install]
WantedBy=multi-user.target
if it breaks: If you see cannot connect to X server, check $DISPLAY is set and Xvfb is running. ps aux | grep Xvfb confirms it.
3. write your first nodriver script
import asyncio
import nodriver as uc
async def scrape(url: str) -> str:
browser = await uc.start()
page = await browser.get(url)
await page.sleep(2) # let JS render
content = await page.get_content()
browser.stop()
return content
if __name__ == "__main__":
html = asyncio.run(scrape("https://example.com"))
print(html[:500])
Run it. You should see HTML output. The key thing nodriver does here is launch Chrome with a real user profile context and no WebDriver flags set in the JS environment.
if it breaks: ModuleNotFoundError means your pip install is in the wrong virtualenv. TimeoutError on the get() call usually means the page is actively blocking or your network is slow. Add timeout=30 to the get() call.
4. add proxy rotation
Nodriver passes Chrome args at launch, so proxy configuration goes there. The cleanest pattern for rotation is to restart a browser instance per domain or per N requests.
import asyncio
import nodriver as uc
import random
PROXIES = [
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
]
async def scrape_with_proxy(url: str) -> str:
proxy = random.choice(PROXIES)
browser = await uc.start(
browser_args=[f"--proxy-server={proxy}"]
)
page = await browser.get(url)
await page.sleep(2)
content = await page.get_content()
browser.stop()
return content
For authenticated proxies with a user:pass format, Chrome handles the auth challenge automatically when credentials are embedded in the proxy URL.
if it breaks: If pages load but show “your IP is blocked”, the proxy is being detected. Switch from datacenter to residential proxies. If Chrome throws a proxy auth dialog, it means your proxy URL format is wrong , verify the scheme and credential format against your proxy provider’s docs.
5. handle anti-bot challenges
Nodriver sidesteps most passive fingerprinting checks because you are running a real Chrome binary. But active challenges like Cloudflare Turnstile or hCaptcha still require either:
- waiting for the challenge to auto-solve (Cloudflare often does this for clean residential IPs)
- using a CAPTCHA solving service like 2captcha or CapMonster
For Cloudflare, the pattern is to wait and poll:
async def wait_for_cf(page, timeout=30):
for _ in range(timeout):
title = await page.evaluate("document.title")
if "Just a moment" not in title:
return True
await page.sleep(1)
return False
page = await browser.get("https://target.com")
solved = await wait_for_cf(page)
if not solved:
raise RuntimeError("Cloudflare challenge not resolved")
For Turnstile or hCaptcha, the 2captcha API accepts the challenge parameters and returns a token you inject. See 2captcha’s documentation for the current integration spec , the API has been stable since 2022.
if it breaks: If the challenge never resolves with a residential proxy, try a different IP. Cloudflare’s bot score is partly IP reputation-based.
6. extract data and handle navigation
Nodriver exposes CSS selectors and JavaScript evaluation for extraction:
# find an element
element = await page.find("h1")
text = await element.get_text()
# or run JS directly
price = await page.evaluate(
"document.querySelector('.price').innerText"
)
# click and wait for navigation
button = await page.find("button.load-more")
await button.click()
await page.sleep(1.5)
For pages with dynamic pagination or infinite scroll, combine click() with sleep() and re-query the DOM. There is no built-in waitForNavigation equivalent in nodriver, so sleeping is the practical approach for most production use cases.
if it breaks: ElementNotFoundError means the selector did not match. Use the browser’s DevTools locally to verify your selectors before running headless.
7. manage sessions and cookies
For sites that require login, persist the Chrome profile directory between sessions:
browser = await uc.start(
user_data_dir="/var/scrapers/profiles/account_001"
)
Chrome writes cookies, local storage, and cached credentials to this directory. On the next launch, the session resumes. This is more reliable than manually injecting cookies because Chrome’s session storage is opaque and varies by site.
if it breaks: If the profile directory gets corrupted (this happens on hard kills), delete the SingletonLock file inside the profile directory and retry. If the session is expired, you need to re-authenticate.
8. wrap everything in a job queue
For anything beyond 5-10 pages, you need a queue. The simplest production pattern is a Redis list with a worker pool:
import asyncio
import nodriver as uc
import redis.asyncio as aioredis
async def worker(queue: aioredis.Redis, proxy: str):
while True:
url = await queue.lpop("scrape_queue")
if not url:
break
browser = await uc.start(browser_args=[f"--proxy-server={proxy}"])
page = await browser.get(url.decode())
await page.sleep(2)
content = await page.get_content()
await queue.rpush("results", content)
browser.stop()
async def main():
r = await aioredis.from_url("redis://localhost")
proxies = ["http://p1:pass@host:port", "http://p2:pass@host:port"]
tasks = [worker(r, proxy) for proxy in proxies]
await asyncio.gather(*tasks)
This is a minimal example. In production you will want error handling, retry logic, and a results schema, but the shape is correct.
if it breaks: Redis ConnectionRefusedError means Redis is not running or you are connecting to the wrong host/port.
common pitfalls
reusing a single browser instance for too many pages. Memory leaks build up. Restart the browser every 50-100 pages or every hour, whichever comes first.
ignoring Chrome process cleanup. If your script crashes, Chrome processes keep running. Add a try/finally block that calls browser.stop(), and run a cron job that kills orphaned Chrome processes: pkill -f "google-chrome" is blunt but effective during debugging.
using datacenter proxies on Cloudflare-protected targets. Cloudflare’s risk scoring in 2026 flags entire datacenter ASNs. For mid-tier targets ISP proxies work; for high-value targets, residential only.
parsing HTML immediately after get(). JavaScript-heavy pages render asynchronously. A 1-2 second sleep after navigation is the practical default. For more precise control, evaluate a JavaScript expression that returns true when your target element exists.
running too many parallel sessions per machine. Chrome is 300-500 MB per instance. 20 sessions on an 8 GB machine will swap. Keep session count at (RAM_GB - 2) / 0.5 as a rough ceiling. For multi-account operations at scale, the patterns at multiaccountops.com/blog/ cover machine allocation strategies that apply directly here.
scaling this
10x (50-100 pages/hour): A single 8-core, 16 GB VPS handles this. Run 10-15 concurrent browser sessions with the Redis queue pattern above. A single rotating residential proxy pool with 5-10 exit IPs is sufficient. ProxyScraping’s API makes rotation straightforward.
100x (500-1000 pages/hour): You need multiple machines. Introduce a central job coordinator, shared Redis (Redis Cloud free tier covers this volume), and separate your proxy pool by target domain to avoid IP collisions. At this scale, Chrome restarts between every job to keep memory clean. Also start tracking per-IP ban rates.
1000x (5000+ pages/hour): You are now running a distributed browser fleet. Each machine runs 15-20 Chrome instances, you have 5-10 machines, and your proxy spend is $200-500/month minimum for residential. At this scale, build a proxy health checker that removes banned IPs from rotation automatically. Consider a dedicated proxy manager like Bright Data’s Proxy Manager (open source on GitHub) to handle rotation at the fleet level. Your bottleneck shifts from scraping to parsing and storage. Pipeline results into a message queue like Kafka or SQS rather than Redis.
where to go next
- Proxyscraping residential proxy review and setup guide , covers proxy pool configuration for browser automation specifically
- Fingerprint evasion in 2026: TLS, HTTP/2, and browser entropy , deeper dive into the detection surface nodriver reduces but does not eliminate
- Running parallel browser sessions with asyncio and Redis queues , extends the queue pattern above into a production-grade worker pool
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.