← back to blog

The 2026 Botasaurus guide for production scraping

The 2026 Botasaurus guide for production scraping

Most scraping frameworks fail the same way: they work fine in a dev environment, then fall apart the moment a real target site loads a Cloudflare challenge or starts fingerprinting your TLS handshake. I’ve burned enough time on Scrapy spiders and raw requests sessions that now, when I’m starting a new data collection project, I reach for Botasaurus first. it handles the tedious parts, browser fingerprinting, request interception, parallel job management, so I can focus on the actual data.

This guide is for operators who already write Python and want to run Botasaurus in production, not just for a one-off scrape. if you’re collecting product data, monitoring SERPs, or aggregating listings at scale, this is the workflow that has held up for me. by the end you’ll have a working scraper with proxy rotation, parallel workers, structured output, and a deployment pattern you can run on a bare VPS.

One honest caveat before we start: Botasaurus does a lot, but it isn’t magic. it still needs good proxies underneath it, and it won’t save you from a site that’s doing aggressive bot scoring on behavior patterns. the tool reduces friction, it doesn’t eliminate the cat-and-mouse game entirely.


what you need

  • Python 3.9 or higher. Botasaurus uses some typing features and async patterns that break on older versions.
  • Google Chrome or Chromium installed on the machine you’re running from. the framework ships with its own chromedriver management but needs a real browser binary.
  • Botasaurus package from PyPI. install it once, it pulls undetected-chromedriver and its other dependencies automatically.
  • A proxy provider account. for most production jobs I use rotating residential proxies from ProxyScraping. datacenter proxies work for low-sensitivity targets. expect to pay roughly $3-8 per GB for residential, $0.50-2 per GB for datacenter, depending on the provider and plan.
  • A Linux VPS with at least 2 vCPUs and 4 GB RAM if you’re running more than 4 parallel browser workers. DigitalOcean, Hetzner, or Vultr all work. headless Chrome is memory-hungry.
  • pip, venv, and optionally Docker if you prefer containerized deploys.

Budget estimate for a modest production run: $20-40/month on a Hetzner CX31 (2 vCPU, 8 GB RAM), plus proxy costs that scale with your volume.


step by step

step 1: install Botasaurus in a clean environment

python3 -m venv botenv
source botenv/bin/activate
pip install botasaurus

Expected output: pip resolves and installs botasaurus along with undetected-chromedriver, requests, and a handful of utility packages. the install takes 30-60 seconds on a fresh VPS.

if it breaks: if you see a chromedriver version mismatch error later, run pip install --upgrade botasaurus first. the package pins chromedriver versions and newer Chrome binaries sometimes get ahead of older package releases.

step 2: write your first scraper with the @browser decorator

Botasaurus uses a decorator pattern. the @browser decorator wraps a function and handles browser setup, teardown, and retry logic for you.

from botasaurus.browser import browser, Driver

@browser
def scrape_title(driver: Driver, data):
    driver.get(data["url"])
    title = driver.title
    return {"url": data["url"], "title": title}

if __name__ == "__main__":
    results = scrape_title([
        {"url": "https://example.com"},
        {"url": "https://example.org"},
    ])
    print(results)

run it with python scraper.py. you should see a headless Chrome window open (or not, if headless is on by default) and a list of dicts returned. the output is already structured and ready to serialize.

if it breaks: ModuleNotFoundError: No module named 'botasaurus' means you’re outside the venv. activate it with source botenv/bin/activate and retry.

step 3: add proxy rotation

this is where things get real. pass a proxy string into the decorator. for a rotating endpoint from ProxyScraping (or any provider that gives you a gateway URL), it looks like this:

from botasaurus.browser import browser, Driver

PROXY = "http://username:[email protected]:10000"

@browser(proxy=PROXY)
def scrape_with_proxy(driver: Driver, data):
    driver.get(data["url"])
    return {"url": data["url"], "ip_seen": driver.get_current_ip()}

for providers that rotate on each connection, a single gateway URL is enough. for sticky sessions (same IP for a multi-page crawl), append a session ID to the username: username-session-abc123.

if it breaks: if you’re getting 407 Proxy Authentication Required, double-check that your proxy credentials don’t contain special characters that need URL encoding. an @ in a password will break the URL unless it’s encoded as %40.

step 4: run jobs in parallel

the parallel argument on @browser tells Botasaurus how many browser instances to spin up simultaneously. it handles the worker pool internally.

from botasaurus.browser import browser, Driver

@browser(proxy=PROXY, parallel=4)
def scrape_parallel(driver: Driver, data):
    driver.get(data["url"])
    return {"url": data["url"], "title": driver.title}

results = scrape_parallel([
    {"url": "https://books.toscrape.com/catalogue/page-1.html"},
    {"url": "https://books.toscrape.com/catalogue/page-2.html"},
    {"url": "https://books.toscrape.com/catalogue/page-3.html"},
    {"url": "https://books.toscrape.com/catalogue/page-4.html"},
])

start at parallel=2 on a 2 vCPU machine and watch memory before bumping higher. each headless Chrome worker can eat 300-600 MB depending on the page complexity.

if it breaks: if workers are crashing silently, set headless=False temporarily to watch what’s actually happening in the browser. most silent failures are either proxy timeouts or JavaScript that Botasaurus’s underlying undetected-chromedriver isn’t handling right.

step 5: handle blocks and retries

Botasaurus has a built-in retry mechanism, but you need to tell it what counts as a failure. the cleanest pattern is to check for block signals inside your scrape function and raise an exception:

from botasaurus.browser import browser, Driver
from botasaurus.exceptions import CloudflareException

@browser(proxy=PROXY, parallel=4, max_retry=3)
def scrape_with_retry(driver: Driver, data):
    driver.get(data["url"])

    # check for block signals
    if "Access denied" in driver.title or "Just a moment" in driver.title:
        raise CloudflareException("Blocked on: " + data["url"])

    return {"url": data["url"], "title": driver.title}

max_retry=3 will retry failed URLs up to 3 times with fresh browser instances. combine this with a rotating proxy endpoint and most transient blocks resolve themselves on retry.

if it breaks: if every retry hits the same block, the issue is usually IP quality or fingerprinting, not a transient hiccup. switch to residential proxies, or check out how anti-detect browser setups handle canvas and WebGL fingerprints over at antidetectreview.org/blog/ for a deeper look at the fingerprinting layer.

step 6: use @request for lighter pages

not every URL needs a full browser. Botasaurus ships a @request decorator that uses a plain HTTP session with rotated headers but no JavaScript rendering. it’s 10-20x faster and uses a fraction of the memory.

from botasaurus.request import request, Request

@request(proxy=PROXY, parallel=10)
def scrape_api(request: Request, data):
    response = request.get(data["url"])
    return {"url": data["url"], "status": response.status_code}

use @request for APIs, sitemaps, and any page where you’ve confirmed the data is in the raw HTML. use @browser for JavaScript-rendered content, login flows, and anything behind Cloudflare.

if it breaks: getting 403s with @request that don’t appear with @browser? the target is checking TLS fingerprints or header order. fall back to @browser for that domain, or look into a JA3 spoofing proxy layer.

step 7: save output to JSON or a database

by default Botasaurus saves results to an output/ folder as JSON files named after your scraper function. for production I usually push straight to a database instead:

import psycopg2
import json

def save_to_postgres(results):
    conn = psycopg2.connect("postgresql://user:pass@localhost/scrapedb")
    cur = conn.cursor()
    for row in results:
        cur.execute(
            "INSERT INTO results (url, title, scraped_at) VALUES (%s, %s, NOW())",
            (row["url"], row.get("title", ""))
        )
    conn.commit()
    cur.close()
    conn.close()

results = scrape_parallel(urls)
save_to_postgres(results)

if it breaks: connection refused errors usually mean the database isn’t listening on the expected port, or your VPS firewall is blocking the connection. check pg_hba.conf and your firewall rules first.

step 8: deploy to a VPS and run on a schedule

on a Hetzner or DigitalOcean VPS, the simplest production deploy is a systemd service or a cron job. here’s the cron approach:

# edit crontab with: crontab -e
0 6 * * * /home/ubuntu/botenv/bin/python /home/ubuntu/scraper/main.py >> /var/log/scraper.log 2>&1

for longer jobs where you want process supervision, write a systemd unit file. set Restart=on-failure and RestartSec=60 so the service recovers from crashes automatically.

if it breaks: if Chrome crashes silently under cron but works fine interactively, it’s usually a display environment issue. add DISPLAY=:99 or ensure you’re running headless (headless=True in the decorator).


common pitfalls

running too many parallel workers too fast. starting at parallel=10 on a 2-core machine will thrash memory and make your scrapes slower, not faster. ramp up gradually and monitor with htop.

using datacenter proxies on targets that fingerprint IP reputation. e-commerce and travel sites use IP scoring services. residential proxies cost more but clear the reputation bar. see the proxy reviews on /blog/ for a breakdown of when each type makes sense.

not rotating user agents and accept-language headers. Botasaurus handles a lot of this automatically with undetected-chromedriver, but if you’re using @request, you need to pass realistic headers explicitly.

ignoring the output folder until it’s 20 GB. Botasaurus saves JSONs by default and doesn’t prune old runs. set up a cleanup cron or redirect output to a database from day one.

scraping without checking robots.txt. Botasaurus won’t stop you from ignoring robots exclusion protocol rules. whether to respect them is a legal and operational question specific to your use case, this is not legal advice, consult your own counsel on compliance obligations.


scaling this

10x (basic production): four parallel @browser workers on a single VPS, one rotating residential proxy endpoint, cron scheduling. handles most data collection jobs at this scale without architectural changes.

100x: multiple VPS instances running the same scraper, jobs coordinated through a Redis queue (RQ or Celery both work). proxy costs become the dominant variable here. consider a dedicated proxy plan for scraping to get better per-GB rates at volume.

1000x: at this point you’re operating a distributed crawl cluster. Botasaurus itself doesn’t change much, but the infrastructure around it does: container orchestration (Kubernetes or Nomad), centralized logging (Loki or ELK), per-domain rate limiting enforced at the queue level, and proxy provider contracts with SLA guarantees. you’ll also want to split @request and @browser workers onto separate node pools since their resource profiles are completely different.


where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?