The 2026 Botasaurus guide for production scraping

Most scraping frameworks fail the same way: they work fine in a dev environment, then fall apart the moment a real target site loads a Cloudflare challenge or starts fingerprinting your TLS handshake. I’ve burned enough time on Scrapy spiders and raw requests sessions that now, when I’m starting a new data collection project, I reach for Botasaurus first. it handles the tedious parts, browser fingerprinting, request interception, parallel job management, so I can focus on the actual data.

This guide is for operators who already write Python and want to run Botasaurus in production, not just for a one-off scrape. if you’re collecting product data, monitoring SERPs, or aggregating listings at scale, this is the workflow that has held up for me. by the end you’ll have a working scraper with proxy rotation, parallel workers, structured output, and a deployment pattern you can run on a bare VPS.

One honest caveat before we start: Botasaurus does a lot, but it isn’t magic. it still needs good proxies underneath it, and it won’t save you from a site that’s doing aggressive bot scoring on behavior patterns. the tool reduces friction, it doesn’t eliminate the cat-and-mouse game entirely.

what you need

Python 3.9 or higher. Botasaurus uses some typing features and async patterns that break on older versions.
Google Chrome or Chromium installed on the machine you’re running from. the framework ships with its own chromedriver management but needs a real browser binary.
Botasaurus package from PyPI. install it once, it pulls undetected-chromedriver and its other dependencies automatically.
A proxy provider account. for most production jobs I use rotating residential proxies from ProxyScraping. datacenter proxies work for low-sensitivity targets. expect to pay roughly $3-8 per GB for residential, $0.50-2 per GB for datacenter, depending on the provider and plan.
A Linux VPS with at least 2 vCPUs and 4 GB RAM if you’re running more than 4 parallel browser workers. DigitalOcean, Hetzner, or Vultr all work. headless Chrome is memory-hungry.
pip, venv, and optionally Docker if you prefer containerized deploys.

Budget estimate for a modest production run: $20-40/month on a Hetzner CX31 (2 vCPU, 8 GB RAM), plus proxy costs that scale with your volume.

step by step

step 1: install Botasaurus in a clean environment

python3 -m venv botenv
source botenv/bin/activate
pip install botasaurus

Expected output: pip resolves and installs botasaurus along with undetected-chromedriver, requests, and a handful of utility packages. the install takes 30-60 seconds on a fresh VPS.

if it breaks: if you see a chromedriver version mismatch error later, run pip install --upgrade botasaurus first. the package pins chromedriver versions and newer Chrome binaries sometimes get ahead of older package releases.

step 2: write your first scraper with the `@browser` decorator

Botasaurus uses a decorator pattern. the @browser decorator wraps a function and handles browser setup, teardown, and retry logic for you.

from botasaurus.browser import browser, Driver

@browser
def scrape_title(driver: Driver, data):
    driver.get(data["url"])
    title = driver.title
    return {"url": data["url"], "title": title}

if __name__ == "__main__":
    results = scrape_title([
        {"url": "https://example.com"},
        {"url": "https://example.org"},
    ])
    print(results)

run it with python scraper.py. you should see a headless Chrome window open (or not, if headless is on by default) and a list of dicts returned. the output is already structured and ready to serialize.

if it breaks: ModuleNotFoundError: No module named 'botasaurus' means you’re outside the venv. activate it with source botenv/bin/activate and retry.

step 3: add proxy rotation

this is where things get real. pass a proxy string into the decorator. for a rotating endpoint from ProxyScraping (or any provider that gives you a gateway URL), it looks like this:

from botasaurus.browser import browser, Driver

PROXY = "http://username:[email protected]:10000"

@browser(proxy=PROXY)
def scrape_with_proxy(driver: Driver, data):
    driver.get(data["url"])
    return {"url": data["url"], "ip_seen": driver.get_current_ip()}

for providers that rotate on each connection, a single gateway URL is enough. for sticky sessions (same IP for a multi-page crawl), append a session ID to the username: username-session-abc123.

if it breaks: if you’re getting 407 Proxy Authentication Required, double-check that your proxy credentials don’t contain special characters that need URL encoding. an @ in a password will break the URL unless it’s encoded as %40.

step 4: run jobs in parallel

the parallel argument on @browser tells Botasaurus how many browser instances to spin up simultaneously. it handles the worker pool internally.

from botasaurus.browser import browser, Driver

@browser(proxy=PROXY, parallel=4)
def scrape_parallel(driver: Driver, data):
    driver.get(data["url"])
    return {"url": data["url"], "title": driver.title}

results = scrape_parallel([
    {"url": "https://books.toscrape.com/catalogue/page-1.html"},
    {"url": "https://books.toscrape.com/catalogue/page-2.html"},
    {"url": "https://books.toscrape.com/catalogue/page-3.html"},
    {"url": "https://books.toscrape.com/catalogue/page-4.html"},
])

start at parallel=2 on a 2 vCPU machine and watch memory before bumping higher. each headless Chrome worker can eat 300-600 MB depending on the page complexity.

if it breaks: if workers are crashing silently, set headless=False temporarily to watch what’s actually happening in the browser. most silent failures are either proxy timeouts or JavaScript that Botasaurus’s underlying undetected-chromedriver isn’t handling right.

step 5: handle blocks and retries

Botasaurus has a built-in retry mechanism, but you need to tell it what counts as a failure. the cleanest pattern is to check for block signals inside your scrape function and raise an exception:

from botasaurus.browser import browser, Driver
from botasaurus.exceptions import CloudflareException

@browser(proxy=PROXY, parallel=4, max_retry=3)
def scrape_with_retry(driver: Driver, data):
    driver.get(data["url"])

    # check for block signals
    if "Access denied" in driver.title or "Just a moment" in driver.title:
        raise CloudflareException("Blocked on: " + data["url"])

    return {"url": data["url"], "title": driver.title}

max_retry=3 will retry failed URLs up to 3 times with fresh browser instances. combine this with a rotating proxy endpoint and most transient blocks resolve themselves on retry.

if it breaks: if every retry hits the same block, the issue is usually IP quality or fingerprinting, not a transient hiccup. switch to residential proxies, or check out how anti-detect browser setups handle canvas and WebGL fingerprints over at antidetectreview.org/blog/ for a deeper look at the fingerprinting layer.

step 6: use `@request` for lighter pages

not every URL needs a full browser. Botasaurus ships a @request decorator that uses a plain HTTP session with rotated headers but no JavaScript rendering. it’s 10-20x faster and uses a fraction of the memory.

from botasaurus.request import request, Request

@request(proxy=PROXY, parallel=10)
def scrape_api(request: Request, data):
    response = request.get(data["url"])
    return {"url": data["url"], "status": response.status_code}

use @request for APIs, sitemaps, and any page where you’ve confirmed the data is in the raw HTML. use @browser for JavaScript-rendered content, login flows, and anything behind Cloudflare.

if it breaks: getting 403s with @request that don’t appear with @browser? the target is checking TLS fingerprints or header order. fall back to @browser for that domain, or look into a JA3 spoofing proxy layer.

step 7: save output to JSON or a database

by default Botasaurus saves results to an output/ folder as JSON files named after your scraper function. for production I usually push straight to a database instead:

import psycopg2
import json

def save_to_postgres(results):
    conn = psycopg2.connect("postgresql://user:pass@localhost/scrapedb")
    cur = conn.cursor()
    for row in results:
        cur.execute(
            "INSERT INTO results (url, title, scraped_at) VALUES (%s, %s, NOW())",
            (row["url"], row.get("title", ""))
        )
    conn.commit()
    cur.close()
    conn.close()

results = scrape_parallel(urls)
save_to_postgres(results)

if it breaks: connection refused errors usually mean the database isn’t listening on the expected port, or your VPS firewall is blocking the connection. check pg_hba.conf and your firewall rules first.

step 8: deploy to a VPS and run on a schedule

on a Hetzner or DigitalOcean VPS, the simplest production deploy is a systemd service or a cron job. here’s the cron approach:

# edit crontab with: crontab -e
0 6 * * * /home/ubuntu/botenv/bin/python /home/ubuntu/scraper/main.py >> /var/log/scraper.log 2>&1

for longer jobs where you want process supervision, write a systemd unit file. set Restart=on-failure and RestartSec=60 so the service recovers from crashes automatically.

if it breaks: if Chrome crashes silently under cron but works fine interactively, it’s usually a display environment issue. add DISPLAY=:99 or ensure you’re running headless (headless=True in the decorator).

common pitfalls

running too many parallel workers too fast. starting at parallel=10 on a 2-core machine will thrash memory and make your scrapes slower, not faster. ramp up gradually and monitor with htop.

using datacenter proxies on targets that fingerprint IP reputation. e-commerce and travel sites use IP scoring services. residential proxies cost more but clear the reputation bar. see the proxy reviews on /blog/ for a breakdown of when each type makes sense.

not rotating user agents and accept-language headers. Botasaurus handles a lot of this automatically with undetected-chromedriver, but if you’re using @request, you need to pass realistic headers explicitly.

ignoring the output folder until it’s 20 GB. Botasaurus saves JSONs by default and doesn’t prune old runs. set up a cleanup cron or redirect output to a database from day one.

scraping without checking robots.txt. Botasaurus won’t stop you from ignoring robots exclusion protocol rules. whether to respect them is a legal and operational question specific to your use case, this is not legal advice, consult your own counsel on compliance obligations.

scaling this

10x (basic production): four parallel @browser workers on a single VPS, one rotating residential proxy endpoint, cron scheduling. handles most data collection jobs at this scale without architectural changes.

100x: multiple VPS instances running the same scraper, jobs coordinated through a Redis queue (RQ or Celery both work). proxy costs become the dominant variable here. consider a dedicated proxy plan for scraping to get better per-GB rates at volume.

1000x: at this point you’re operating a distributed crawl cluster. Botasaurus itself doesn’t change much, but the infrastructure around it does: container orchestration (Kubernetes or Nomad), centralized logging (Loki or ELK), per-domain rate limiting enforced at the queue level, and proxy provider contracts with SLA guarantees. you’ll also want to split @request and @browser workers onto separate node pools since their resource profiles are completely different.

where to go next

Undetected-chromedriver deep dive: Botasaurus builds on top of undetected-chromedriver. understanding what it does under the hood helps you debug the 20% of cases where Botasaurus doesn’t handle fingerprinting automatically.
Best proxies for web scraping in 2026: proxy choice is the biggest variable in production scraping success rates. this review covers residential, datacenter, and ISP proxy providers with actual test data.
Scraping at scale with Playwright and rotating proxies: if you hit a target that Botasaurus’s Chrome setup can’t handle, Playwright is the next tool to reach for. the proxy integration pattern is similar.

Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.