The 2026 Selenium guide for production scraping
The 2026 Selenium guide for production scraping
Selenium still works. I know that’s not the flashiest way to start, but after years of operators abandoning it for Playwright or Puppeteer, a lot of people are surprised to discover that Selenium 4.x with undetected-chromedriver holds up surprisingly well against modern bot detection stacks. The tools around it have matured. The community is large. And for teams already running Python pipelines, it fits naturally.
That said, using Selenium the way most tutorials teach it , spinning up a plain Chrome instance, making requests, parsing HTML , will get you blocked within minutes on any serious target. This guide covers how I actually run Selenium in production: proxy rotation, fingerprint hardening, session management, and the infrastructure decisions that matter when you’re running hundreds of concurrent browsers.
This is for operators who have some Python experience and want to run scrapers that last more than a day. By the end you’ll have a working scraper template, an understanding of where Selenium falls apart, and a clear picture of what it costs to scale.
what you need
- Python 3.11 or 3.12 (3.13 has minor compatibility issues with some Selenium plugins as of May 2026)
selenium4.20+ andundetected-chromedriver3.5+ (install via pip)- Chrome or Chromium 124+ installed locally or via Docker
- A rotating proxy provider. I use ProxyScraping for residential proxies. at time of writing, residential bandwidth runs $3-8/GB depending on plan tier. datacenter proxies are cheaper but get blocked faster on most targets
- A Linux server or VPS with at least 2 vCPUs and 4GB RAM per 5 concurrent browser instances. DigitalOcean, Hetzner, or a bare metal box all work
- Optional but recommended: BrowserMob Proxy or mitmproxy if you need to inspect network traffic during development
beautifulsoup4andlxmlfor parsing,sqlalchemyorpsycopg2for storage
Estimated monthly cost for a 50-thread operation: $80-150 in proxies, $40-80 in server compute.
step by step
step 1: install and verify your environment
pip install selenium==4.20.0 undetected-chromedriver==3.5.5 beautifulsoup4 lxml requests
Verify Chrome is accessible:
google-chrome --version
# or
chromium-browser --version
If you’re on a headless server without a display, install xvfb:
sudo apt-get install xvfb
You won’t need it if you run headless mode (which we’ll configure), but having it available saves debugging time.
if it breaks: if undetected-chromedriver fails to patch Chrome, check that your Chrome version matches the driver version it downloads. run uc.Chrome(version_main=124) to pin the version explicitly.
step 2: create a hardened browser factory
Don’t instantiate browsers inline. use a factory function so you can swap configurations without touching your scraping logic.
import undetected_chromedriver as uc
from selenium.webdriver.chrome.options import Options
def make_driver(proxy: str = None, headless: bool = True) -> uc.Chrome:
options = uc.ChromeOptions()
if headless:
options.add_argument("--headless=new") # new headless mode, less detectable
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--window-size=1920,1080")
options.add_argument("--lang=en-US,en;q=0.9")
if proxy:
options.add_argument(f"--proxy-server={proxy}")
driver = uc.Chrome(options=options, use_subprocess=True)
driver.set_page_load_timeout(30)
driver.implicitly_wait(0) # never use implicit waits in production
return driver
The Selenium 4 documentation covers all ChromeOptions in detail. the --headless=new flag matters: old headless mode exposes a distinct navigator.webdriver fingerprint that detection systems flag immediately.
if it breaks: on some VPS providers, --no-sandbox is required because the default Chrome sandbox requires kernel features that aren’t available. if you see “DevToolsActivePort file doesn’t exist”, add --remote-debugging-port=9222.
step 3: integrate proxy rotation
A static proxy is a single point of failure. you need rotation. here’s a simple approach using a proxy list:
import random
import time
from contextlib import contextmanager
PROXY_LIST = [
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
# ... pulled from your proxy provider's API
]
@contextmanager
def managed_driver(proxy_list: list):
proxy = random.choice(proxy_list)
driver = make_driver(proxy=proxy)
try:
yield driver
finally:
driver.quit()
For ProxyScraping’s residential pool, you get a rotating endpoint that handles rotation server-side, which simplifies this considerably. check your provider’s docs for the sticky session vs. rotating session tradeoff: sticky sessions are useful when you need to stay on the same IP across a multi-page flow (login, then scrape), rotating is better for independent requests.
if it breaks: if proxies time out frequently, filter your list by response time before adding them to the pool. a dead proxy in your list will cause random 30-second hangs.
step 4: implement explicit waits correctly
This is where most Selenium tutorials go wrong. implicit waits interact badly with explicit waits and create unpredictable timing. use only explicit waits.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
def wait_for_element(driver, selector: str, timeout: int = 10):
try:
return WebDriverWait(driver, timeout).until(
EC.presence_of_element_located((By.CSS_SELECTOR, selector))
)
except TimeoutException:
return None
def wait_for_url_change(driver, original_url: str, timeout: int = 15):
WebDriverWait(driver, timeout).until(
lambda d: d.current_url != original_url
)
The W3C WebDriver specification defines what “element present” means at the protocol level. understanding this helps when you’re debugging whether a wait condition is semantically correct or just accidentally working.
if it breaks: if presence_of_element_located returns too early (the DOM node exists but content hasn’t loaded), switch to visibility_of_element_located or write a custom expected condition that checks the element’s text content.
step 5: handle JavaScript-heavy pages
For pages that load data via XHR after initial render, you have two options: wait for the DOM to settle, or intercept network requests.
import json
def get_json_from_xhr(driver, url: str, trigger_selector: str = None):
# enable network capture via Chrome DevTools Protocol
driver.execute_cdp_cmd("Network.enable", {})
requests_log = []
def capture(event):
requests_log.append(event)
# navigate to page
driver.get(url)
if trigger_selector:
el = wait_for_element(driver, trigger_selector)
if el:
el.click()
time.sleep(2) # give XHR time to fire
# check browser logs for network responses
logs = driver.get_log("performance")
for entry in logs:
message = json.loads(entry["message"])
if message["message"]["method"] == "Network.responseReceived":
response_url = message["message"]["params"]["response"]["url"]
if "api" in response_url:
print(f"captured API call: {response_url}")
This uses the Chrome DevTools Protocol directly, which Selenium 4 exposes via execute_cdp_cmd. it’s a more surgical approach than trying to parse rendered HTML for data that was clearly meant to come from an API.
if it breaks: CDP performance logs must be enabled in Chrome options: options.set_capability("goog:loggingPrefs", {"performance": "ALL"}).
step 6: manage cookies and sessions
For sites requiring login:
import pickle
import os
def save_cookies(driver, path: str):
with open(path, "wb") as f:
pickle.dump(driver.get_cookies(), f)
def load_cookies(driver, path: str, domain: str):
driver.get(f"https://{domain}") # must navigate to domain first
if os.path.exists(path):
with open(path, "rb") as f:
cookies = pickle.load(f)
for cookie in cookies:
driver.add_cookie(cookie)
driver.refresh()
Keep sessions alive by refreshing every 15-20 minutes if you’re doing long-running scrapes. some sites expire sessions on inactivity faster than you’d expect.
if it breaks: cookie domains must match exactly. a cookie saved from www.example.com won’t load if you navigate to example.com first. strip www. consistently.
step 7: parse and store results
from bs4 import BeautifulSoup
import psycopg2
def parse_page(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
results = []
for item in soup.select(".product-card"):
results.append({
"title": item.select_one("h2").get_text(strip=True),
"price": item.select_one(".price").get_text(strip=True),
"url": item.select_one("a")["href"],
})
return results
def store_results(conn, results: list[dict]):
with conn.cursor() as cur:
cur.executemany(
"INSERT INTO scraped_items (title, price, url, scraped_at) "
"VALUES (%(title)s, %(price)s, %(url)s, NOW()) "
"ON CONFLICT (url) DO UPDATE SET price = EXCLUDED.price, scraped_at = NOW()",
results
)
conn.commit()
if it breaks: if selectors break after a site update, log the raw HTML alongside your parse errors so you can diagnose what changed without re-running the scraper.
step 8: build a retry loop
import logging
logger = logging.getLogger(__name__)
def scrape_with_retry(url: str, proxy_list: list, max_attempts: int = 3) -> list[dict]:
for attempt in range(max_attempts):
try:
with managed_driver(proxy_list) as driver:
driver.get(url)
wait_for_element(driver, "body", timeout=15)
html = driver.page_source
return parse_page(html)
except Exception as e:
logger.warning(f"attempt {attempt + 1} failed for {url}: {e}")
time.sleep(2 ** attempt) # exponential backoff
logger.error(f"all attempts failed for {url}")
return []
if it breaks: if you see the same error on every attempt, the problem is structural (wrong selector, blocked IP range, site change), not transient. don’t burn retries on something that needs investigation.
common pitfalls
leaving implicit waits on. the default Selenium setup uses a 0-second implicit wait. many tutorials set it to something like 10 seconds “for safety.” this interacts badly with explicit waits and creates unpredictable behavior. set it to 0 and use explicit waits everywhere.
running one proxy per scraper process. if your scraper and its proxy die together, you lose the session. decouple them. a proxy pool manager (Squid, or a provider’s rotating endpoint) should be independent of your scraper process.
not rotating user agents and accept-language headers. browser fingerprinting looks at more than just IP. a fleet of browsers all reporting the same user agent string is a trivial signal. use a library like fake-useragent or maintain your own list of recent Chrome UA strings.
scraping without rate limits. hitting a site at full speed isn’t just rude, it’s a fast path to a subnet ban. add random delays between 1-4 seconds between requests. mimic human timing patterns.
ignoring memory leaks in long-running processes. each Chrome instance consumes memory, and if you’re not explicitly calling driver.quit(), processes accumulate. use context managers (as shown above) or add a watchdog that kills zombie Chrome processes.
For more on running multiple browser identities without overlap, the antidetect browser comparison at antidetectreview.org/blog/ is a useful reference if you’re considering moving beyond raw Selenium to a more managed fingerprint environment.
scaling this
10x (50 concurrent browsers): run multiple threads with a ThreadPoolExecutor. give each thread its own driver instance. use a proxy pool with at least 3x the number of proxies as threads to allow rotation. one server with 8 vCPUs and 16GB RAM handles this comfortably.
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=50) as executor:
futures = [executor.submit(scrape_with_retry, url, PROXY_LIST) for url in url_list]
results = [f.result() for f in futures]
100x (500 concurrent browsers): you need horizontal scaling. containerize your scraper with Docker, use a job queue (Redis + RQ or Celery), and distribute workers across multiple servers. at this scale, a centralized proxy management layer becomes important. look at Squid as a proxy cache/router, or use a provider with an enterprise API endpoint.
1000x: at this point Selenium’s per-process overhead becomes a real cost. each Chrome instance is 150-300MB of RAM. 1000 concurrent browsers means 150-300GB of RAM across your fleet. you’ll likely need to mix Selenium with lighter tools: use Selenium only for pages that require actual browser execution, and route simpler requests through httpx or curl_cffi directly. orchestration via Kubernetes makes sense here. budget $2,000-5,000/month in infrastructure at this scale.
where to go next
- Rotating proxies for scrapers: a practical setup guide covers proxy pool management in depth, including how to handle IP reputation scoring and ban detection
- Playwright vs Selenium in 2026: which to use for production scraping walks through the specific cases where Playwright’s auto-wait model saves real development time versus Selenium’s more explicit control
- Browse the full scraping library for tutorials on parsing, storage, and anti-detection
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.