The 2026 Botasaurus guide for production scraping
The 2026 Botasaurus guide for production scraping
Most scraping frameworks fail the same way: they work fine in a dev environment, then fall apart the moment a real target site loads a Cloudflare challenge or starts fingerprinting your TLS handshake. I’ve burned enough time on Scrapy spiders and raw requests sessions that now, when I’m starting a new data collection project, I reach for Botasaurus first. it handles the tedious parts, browser fingerprinting, request interception, parallel job management, so I can focus on the actual data.
This guide is for operators who already write Python and want to run Botasaurus in production, not just for a one-off scrape. if you’re collecting product data, monitoring SERPs, or aggregating listings at scale, this is the workflow that has held up for me. by the end you’ll have a working scraper with proxy rotation, parallel workers, structured output, and a deployment pattern you can run on a bare VPS.
One honest caveat before we start: Botasaurus does a lot, but it isn’t magic. it still needs good proxies underneath it, and it won’t save you from a site that’s doing aggressive bot scoring on behavior patterns. the tool reduces friction, it doesn’t eliminate the cat-and-mouse game entirely.
what you need
- Python 3.9 or higher. Botasaurus uses some typing features and async patterns that break on older versions.
- Google Chrome or Chromium installed on the machine you’re running from. the framework ships with its own chromedriver management but needs a real browser binary.
- Botasaurus package from PyPI. install it once, it pulls undetected-chromedriver and its other dependencies automatically.
- A proxy provider account. for most production jobs I use rotating residential proxies from ProxyScraping. datacenter proxies work for low-sensitivity targets. expect to pay roughly $3-8 per GB for residential, $0.50-2 per GB for datacenter, depending on the provider and plan.
- A Linux VPS with at least 2 vCPUs and 4 GB RAM if you’re running more than 4 parallel browser workers. DigitalOcean, Hetzner, or Vultr all work. headless Chrome is memory-hungry.
pip,venv, and optionally Docker if you prefer containerized deploys.
Budget estimate for a modest production run: $20-40/month on a Hetzner CX31 (2 vCPU, 8 GB RAM), plus proxy costs that scale with your volume.
step by step
step 1: install Botasaurus in a clean environment
python3 -m venv botenv
source botenv/bin/activate
pip install botasaurus
Expected output: pip resolves and installs botasaurus along with undetected-chromedriver, requests, and a handful of utility packages. the install takes 30-60 seconds on a fresh VPS.
if it breaks: if you see a chromedriver version mismatch error later, run pip install --upgrade botasaurus first. the package pins chromedriver versions and newer Chrome binaries sometimes get ahead of older package releases.
step 2: write your first scraper with the @browser decorator
Botasaurus uses a decorator pattern. the @browser decorator wraps a function and handles browser setup, teardown, and retry logic for you.
from botasaurus.browser import browser, Driver
@browser
def scrape_title(driver: Driver, data):
driver.get(data["url"])
title = driver.title
return {"url": data["url"], "title": title}
if __name__ == "__main__":
results = scrape_title([
{"url": "https://example.com"},
{"url": "https://example.org"},
])
print(results)
run it with python scraper.py. you should see a headless Chrome window open (or not, if headless is on by default) and a list of dicts returned. the output is already structured and ready to serialize.
if it breaks: ModuleNotFoundError: No module named 'botasaurus' means you’re outside the venv. activate it with source botenv/bin/activate and retry.
step 3: add proxy rotation
this is where things get real. pass a proxy string into the decorator. for a rotating endpoint from ProxyScraping (or any provider that gives you a gateway URL), it looks like this:
from botasaurus.browser import browser, Driver
PROXY = "http://username:[email protected]:10000"
@browser(proxy=PROXY)
def scrape_with_proxy(driver: Driver, data):
driver.get(data["url"])
return {"url": data["url"], "ip_seen": driver.get_current_ip()}
for providers that rotate on each connection, a single gateway URL is enough. for sticky sessions (same IP for a multi-page crawl), append a session ID to the username: username-session-abc123.
if it breaks: if you’re getting 407 Proxy Authentication Required, double-check that your proxy credentials don’t contain special characters that need URL encoding. an @ in a password will break the URL unless it’s encoded as %40.
step 4: run jobs in parallel
the parallel argument on @browser tells Botasaurus how many browser instances to spin up simultaneously. it handles the worker pool internally.
from botasaurus.browser import browser, Driver
@browser(proxy=PROXY, parallel=4)
def scrape_parallel(driver: Driver, data):
driver.get(data["url"])
return {"url": data["url"], "title": driver.title}
results = scrape_parallel([
{"url": "https://books.toscrape.com/catalogue/page-1.html"},
{"url": "https://books.toscrape.com/catalogue/page-2.html"},
{"url": "https://books.toscrape.com/catalogue/page-3.html"},
{"url": "https://books.toscrape.com/catalogue/page-4.html"},
])
start at parallel=2 on a 2 vCPU machine and watch memory before bumping higher. each headless Chrome worker can eat 300-600 MB depending on the page complexity.
if it breaks: if workers are crashing silently, set headless=False temporarily to watch what’s actually happening in the browser. most silent failures are either proxy timeouts or JavaScript that Botasaurus’s underlying undetected-chromedriver isn’t handling right.
step 5: handle blocks and retries
Botasaurus has a built-in retry mechanism, but you need to tell it what counts as a failure. the cleanest pattern is to check for block signals inside your scrape function and raise an exception:
from botasaurus.browser import browser, Driver
from botasaurus.exceptions import CloudflareException
@browser(proxy=PROXY, parallel=4, max_retry=3)
def scrape_with_retry(driver: Driver, data):
driver.get(data["url"])
# check for block signals
if "Access denied" in driver.title or "Just a moment" in driver.title:
raise CloudflareException("Blocked on: " + data["url"])
return {"url": data["url"], "title": driver.title}
max_retry=3 will retry failed URLs up to 3 times with fresh browser instances. combine this with a rotating proxy endpoint and most transient blocks resolve themselves on retry.
if it breaks: if every retry hits the same block, the issue is usually IP quality or fingerprinting, not a transient hiccup. switch to residential proxies, or check out how anti-detect browser setups handle canvas and WebGL fingerprints over at antidetectreview.org/blog/ for a deeper look at the fingerprinting layer.
step 6: use @request for lighter pages
not every URL needs a full browser. Botasaurus ships a @request decorator that uses a plain HTTP session with rotated headers but no JavaScript rendering. it’s 10-20x faster and uses a fraction of the memory.
from botasaurus.request import request, Request
@request(proxy=PROXY, parallel=10)
def scrape_api(request: Request, data):
response = request.get(data["url"])
return {"url": data["url"], "status": response.status_code}
use @request for APIs, sitemaps, and any page where you’ve confirmed the data is in the raw HTML. use @browser for JavaScript-rendered content, login flows, and anything behind Cloudflare.
if it breaks: getting 403s with @request that don’t appear with @browser? the target is checking TLS fingerprints or header order. fall back to @browser for that domain, or look into a JA3 spoofing proxy layer.
step 7: save output to JSON or a database
by default Botasaurus saves results to an output/ folder as JSON files named after your scraper function. for production I usually push straight to a database instead:
import psycopg2
import json
def save_to_postgres(results):
conn = psycopg2.connect("postgresql://user:pass@localhost/scrapedb")
cur = conn.cursor()
for row in results:
cur.execute(
"INSERT INTO results (url, title, scraped_at) VALUES (%s, %s, NOW())",
(row["url"], row.get("title", ""))
)
conn.commit()
cur.close()
conn.close()
results = scrape_parallel(urls)
save_to_postgres(results)
if it breaks: connection refused errors usually mean the database isn’t listening on the expected port, or your VPS firewall is blocking the connection. check pg_hba.conf and your firewall rules first.
step 8: deploy to a VPS and run on a schedule
on a Hetzner or DigitalOcean VPS, the simplest production deploy is a systemd service or a cron job. here’s the cron approach:
# edit crontab with: crontab -e
0 6 * * * /home/ubuntu/botenv/bin/python /home/ubuntu/scraper/main.py >> /var/log/scraper.log 2>&1
for longer jobs where you want process supervision, write a systemd unit file. set Restart=on-failure and RestartSec=60 so the service recovers from crashes automatically.
if it breaks: if Chrome crashes silently under cron but works fine interactively, it’s usually a display environment issue. add DISPLAY=:99 or ensure you’re running headless (headless=True in the decorator).
common pitfalls
running too many parallel workers too fast. starting at parallel=10 on a 2-core machine will thrash memory and make your scrapes slower, not faster. ramp up gradually and monitor with htop.
using datacenter proxies on targets that fingerprint IP reputation. e-commerce and travel sites use IP scoring services. residential proxies cost more but clear the reputation bar. see the proxy reviews on /blog/ for a breakdown of when each type makes sense.
not rotating user agents and accept-language headers. Botasaurus handles a lot of this automatically with undetected-chromedriver, but if you’re using @request, you need to pass realistic headers explicitly.
ignoring the output folder until it’s 20 GB. Botasaurus saves JSONs by default and doesn’t prune old runs. set up a cleanup cron or redirect output to a database from day one.
scraping without checking robots.txt. Botasaurus won’t stop you from ignoring robots exclusion protocol rules. whether to respect them is a legal and operational question specific to your use case, this is not legal advice, consult your own counsel on compliance obligations.
scaling this
10x (basic production): four parallel @browser workers on a single VPS, one rotating residential proxy endpoint, cron scheduling. handles most data collection jobs at this scale without architectural changes.
100x: multiple VPS instances running the same scraper, jobs coordinated through a Redis queue (RQ or Celery both work). proxy costs become the dominant variable here. consider a dedicated proxy plan for scraping to get better per-GB rates at volume.
1000x: at this point you’re operating a distributed crawl cluster. Botasaurus itself doesn’t change much, but the infrastructure around it does: container orchestration (Kubernetes or Nomad), centralized logging (Loki or ELK), per-domain rate limiting enforced at the queue level, and proxy provider contracts with SLA guarantees. you’ll also want to split @request and @browser workers onto separate node pools since their resource profiles are completely different.
where to go next
- Undetected-chromedriver deep dive: Botasaurus builds on top of undetected-chromedriver. understanding what it does under the hood helps you debug the 20% of cases where Botasaurus doesn’t handle fingerprinting automatically.
- Best proxies for web scraping in 2026: proxy choice is the biggest variable in production scraping success rates. this review covers residential, datacenter, and ISP proxy providers with actual test data.
- Scraping at scale with Playwright and rotating proxies: if you hit a target that Botasaurus’s Chrome setup can’t handle, Playwright is the next tool to reach for. the proxy integration pattern is similar.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.