← back to blog

The 2026 Selenium guide for production scraping

The 2026 Selenium guide for production scraping

Selenium still works. I know that’s not the flashiest way to start, but after years of operators abandoning it for Playwright or Puppeteer, a lot of people are surprised to discover that Selenium 4.x with undetected-chromedriver holds up surprisingly well against modern bot detection stacks. The tools around it have matured. The community is large. And for teams already running Python pipelines, it fits naturally.

That said, using Selenium the way most tutorials teach it , spinning up a plain Chrome instance, making requests, parsing HTML , will get you blocked within minutes on any serious target. This guide covers how I actually run Selenium in production: proxy rotation, fingerprint hardening, session management, and the infrastructure decisions that matter when you’re running hundreds of concurrent browsers.

This is for operators who have some Python experience and want to run scrapers that last more than a day. By the end you’ll have a working scraper template, an understanding of where Selenium falls apart, and a clear picture of what it costs to scale.

what you need

  • Python 3.11 or 3.12 (3.13 has minor compatibility issues with some Selenium plugins as of May 2026)
  • selenium 4.20+ and undetected-chromedriver 3.5+ (install via pip)
  • Chrome or Chromium 124+ installed locally or via Docker
  • A rotating proxy provider. I use ProxyScraping for residential proxies. at time of writing, residential bandwidth runs $3-8/GB depending on plan tier. datacenter proxies are cheaper but get blocked faster on most targets
  • A Linux server or VPS with at least 2 vCPUs and 4GB RAM per 5 concurrent browser instances. DigitalOcean, Hetzner, or a bare metal box all work
  • Optional but recommended: BrowserMob Proxy or mitmproxy if you need to inspect network traffic during development
  • beautifulsoup4 and lxml for parsing, sqlalchemy or psycopg2 for storage

Estimated monthly cost for a 50-thread operation: $80-150 in proxies, $40-80 in server compute.

step by step

step 1: install and verify your environment

pip install selenium==4.20.0 undetected-chromedriver==3.5.5 beautifulsoup4 lxml requests

Verify Chrome is accessible:

google-chrome --version
# or
chromium-browser --version

If you’re on a headless server without a display, install xvfb:

sudo apt-get install xvfb

You won’t need it if you run headless mode (which we’ll configure), but having it available saves debugging time.

if it breaks: if undetected-chromedriver fails to patch Chrome, check that your Chrome version matches the driver version it downloads. run uc.Chrome(version_main=124) to pin the version explicitly.

step 2: create a hardened browser factory

Don’t instantiate browsers inline. use a factory function so you can swap configurations without touching your scraping logic.

import undetected_chromedriver as uc
from selenium.webdriver.chrome.options import Options

def make_driver(proxy: str = None, headless: bool = True) -> uc.Chrome:
    options = uc.ChromeOptions()

    if headless:
        options.add_argument("--headless=new")  # new headless mode, less detectable

    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--window-size=1920,1080")
    options.add_argument("--lang=en-US,en;q=0.9")

    if proxy:
        options.add_argument(f"--proxy-server={proxy}")

    driver = uc.Chrome(options=options, use_subprocess=True)
    driver.set_page_load_timeout(30)
    driver.implicitly_wait(0)  # never use implicit waits in production

    return driver

The Selenium 4 documentation covers all ChromeOptions in detail. the --headless=new flag matters: old headless mode exposes a distinct navigator.webdriver fingerprint that detection systems flag immediately.

if it breaks: on some VPS providers, --no-sandbox is required because the default Chrome sandbox requires kernel features that aren’t available. if you see “DevToolsActivePort file doesn’t exist”, add --remote-debugging-port=9222.

step 3: integrate proxy rotation

A static proxy is a single point of failure. you need rotation. here’s a simple approach using a proxy list:

import random
import time
from contextlib import contextmanager

PROXY_LIST = [
    "http://user:[email protected]:8080",
    "http://user:[email protected]:8080",
    # ... pulled from your proxy provider's API
]

@contextmanager
def managed_driver(proxy_list: list):
    proxy = random.choice(proxy_list)
    driver = make_driver(proxy=proxy)
    try:
        yield driver
    finally:
        driver.quit()

For ProxyScraping’s residential pool, you get a rotating endpoint that handles rotation server-side, which simplifies this considerably. check your provider’s docs for the sticky session vs. rotating session tradeoff: sticky sessions are useful when you need to stay on the same IP across a multi-page flow (login, then scrape), rotating is better for independent requests.

if it breaks: if proxies time out frequently, filter your list by response time before adding them to the pool. a dead proxy in your list will cause random 30-second hangs.

step 4: implement explicit waits correctly

This is where most Selenium tutorials go wrong. implicit waits interact badly with explicit waits and create unpredictable timing. use only explicit waits.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

def wait_for_element(driver, selector: str, timeout: int = 10):
    try:
        return WebDriverWait(driver, timeout).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, selector))
        )
    except TimeoutException:
        return None

def wait_for_url_change(driver, original_url: str, timeout: int = 15):
    WebDriverWait(driver, timeout).until(
        lambda d: d.current_url != original_url
    )

The W3C WebDriver specification defines what “element present” means at the protocol level. understanding this helps when you’re debugging whether a wait condition is semantically correct or just accidentally working.

if it breaks: if presence_of_element_located returns too early (the DOM node exists but content hasn’t loaded), switch to visibility_of_element_located or write a custom expected condition that checks the element’s text content.

step 5: handle JavaScript-heavy pages

For pages that load data via XHR after initial render, you have two options: wait for the DOM to settle, or intercept network requests.

import json

def get_json_from_xhr(driver, url: str, trigger_selector: str = None):
    # enable network capture via Chrome DevTools Protocol
    driver.execute_cdp_cmd("Network.enable", {})

    requests_log = []

    def capture(event):
        requests_log.append(event)

    # navigate to page
    driver.get(url)

    if trigger_selector:
        el = wait_for_element(driver, trigger_selector)
        if el:
            el.click()

    time.sleep(2)  # give XHR time to fire

    # check browser logs for network responses
    logs = driver.get_log("performance")
    for entry in logs:
        message = json.loads(entry["message"])
        if message["message"]["method"] == "Network.responseReceived":
            response_url = message["message"]["params"]["response"]["url"]
            if "api" in response_url:
                print(f"captured API call: {response_url}")

This uses the Chrome DevTools Protocol directly, which Selenium 4 exposes via execute_cdp_cmd. it’s a more surgical approach than trying to parse rendered HTML for data that was clearly meant to come from an API.

if it breaks: CDP performance logs must be enabled in Chrome options: options.set_capability("goog:loggingPrefs", {"performance": "ALL"}).

step 6: manage cookies and sessions

For sites requiring login:

import pickle
import os

def save_cookies(driver, path: str):
    with open(path, "wb") as f:
        pickle.dump(driver.get_cookies(), f)

def load_cookies(driver, path: str, domain: str):
    driver.get(f"https://{domain}")  # must navigate to domain first
    if os.path.exists(path):
        with open(path, "rb") as f:
            cookies = pickle.load(f)
        for cookie in cookies:
            driver.add_cookie(cookie)
        driver.refresh()

Keep sessions alive by refreshing every 15-20 minutes if you’re doing long-running scrapes. some sites expire sessions on inactivity faster than you’d expect.

if it breaks: cookie domains must match exactly. a cookie saved from www.example.com won’t load if you navigate to example.com first. strip www. consistently.

step 7: parse and store results

from bs4 import BeautifulSoup
import psycopg2

def parse_page(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    results = []
    for item in soup.select(".product-card"):
        results.append({
            "title": item.select_one("h2").get_text(strip=True),
            "price": item.select_one(".price").get_text(strip=True),
            "url": item.select_one("a")["href"],
        })
    return results

def store_results(conn, results: list[dict]):
    with conn.cursor() as cur:
        cur.executemany(
            "INSERT INTO scraped_items (title, price, url, scraped_at) "
            "VALUES (%(title)s, %(price)s, %(url)s, NOW()) "
            "ON CONFLICT (url) DO UPDATE SET price = EXCLUDED.price, scraped_at = NOW()",
            results
        )
    conn.commit()

if it breaks: if selectors break after a site update, log the raw HTML alongside your parse errors so you can diagnose what changed without re-running the scraper.

step 8: build a retry loop

import logging

logger = logging.getLogger(__name__)

def scrape_with_retry(url: str, proxy_list: list, max_attempts: int = 3) -> list[dict]:
    for attempt in range(max_attempts):
        try:
            with managed_driver(proxy_list) as driver:
                driver.get(url)
                wait_for_element(driver, "body", timeout=15)
                html = driver.page_source
                return parse_page(html)
        except Exception as e:
            logger.warning(f"attempt {attempt + 1} failed for {url}: {e}")
            time.sleep(2 ** attempt)  # exponential backoff
    logger.error(f"all attempts failed for {url}")
    return []

if it breaks: if you see the same error on every attempt, the problem is structural (wrong selector, blocked IP range, site change), not transient. don’t burn retries on something that needs investigation.

common pitfalls

leaving implicit waits on. the default Selenium setup uses a 0-second implicit wait. many tutorials set it to something like 10 seconds “for safety.” this interacts badly with explicit waits and creates unpredictable behavior. set it to 0 and use explicit waits everywhere.

running one proxy per scraper process. if your scraper and its proxy die together, you lose the session. decouple them. a proxy pool manager (Squid, or a provider’s rotating endpoint) should be independent of your scraper process.

not rotating user agents and accept-language headers. browser fingerprinting looks at more than just IP. a fleet of browsers all reporting the same user agent string is a trivial signal. use a library like fake-useragent or maintain your own list of recent Chrome UA strings.

scraping without rate limits. hitting a site at full speed isn’t just rude, it’s a fast path to a subnet ban. add random delays between 1-4 seconds between requests. mimic human timing patterns.

ignoring memory leaks in long-running processes. each Chrome instance consumes memory, and if you’re not explicitly calling driver.quit(), processes accumulate. use context managers (as shown above) or add a watchdog that kills zombie Chrome processes.

For more on running multiple browser identities without overlap, the antidetect browser comparison at antidetectreview.org/blog/ is a useful reference if you’re considering moving beyond raw Selenium to a more managed fingerprint environment.

scaling this

10x (50 concurrent browsers): run multiple threads with a ThreadPoolExecutor. give each thread its own driver instance. use a proxy pool with at least 3x the number of proxies as threads to allow rotation. one server with 8 vCPUs and 16GB RAM handles this comfortably.

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=50) as executor:
    futures = [executor.submit(scrape_with_retry, url, PROXY_LIST) for url in url_list]
    results = [f.result() for f in futures]

100x (500 concurrent browsers): you need horizontal scaling. containerize your scraper with Docker, use a job queue (Redis + RQ or Celery), and distribute workers across multiple servers. at this scale, a centralized proxy management layer becomes important. look at Squid as a proxy cache/router, or use a provider with an enterprise API endpoint.

1000x: at this point Selenium’s per-process overhead becomes a real cost. each Chrome instance is 150-300MB of RAM. 1000 concurrent browsers means 150-300GB of RAM across your fleet. you’ll likely need to mix Selenium with lighter tools: use Selenium only for pages that require actual browser execution, and route simpler requests through httpx or curl_cffi directly. orchestration via Kubernetes makes sense here. budget $2,000-5,000/month in infrastructure at this scale.

where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?