← back to blog

The 2026 ScrapingBee guide for production scraping

The 2026 ScrapingBee guide for production scraping

Most scraping tutorials show you a curl command that works once, then stop. the reality of running scrapers in production is messier: sites update their bot detection, JavaScript payloads grow heavier, and proxy pools rotate unpredictably. i’ve been running scraping pipelines out of Singapore for a few years now, and the tools that survive production are ones that abstract away the browser infrastructure so you can focus on the data layer.

ScrapingBee is a managed scraping API that spins up headless Chrome instances on their end and returns rendered HTML to you over HTTP. you send a request, they handle the browser, the proxy rotation, and most CAPTCHA scenarios. the tradeoff is cost per request versus the ops overhead of running your own cluster. for pipelines under a few million requests a month, the math usually favors the managed API.

this guide is for operators who have already done some scraping and want to move a pipeline to ScrapingBee, or who are evaluating it against rolling their own Playwright cluster. by the end you’ll have a working Python scraper using the ScrapingBee API, with retry logic, error handling, and a pattern that scales from a hundred to a hundred thousand requests.


what you need

  • a ScrapingBee account with API key. the free trial gives 1,000 credits. paid plans start around $49/month for the Freelance tier (check their current pricing page as it changes)
  • Python 3.10+ with requests and python-dotenv installed
  • basic familiarity with HTTP: status codes, headers, query parameters. MDN’s HTTP headers reference is a good bookmark if you’re rusty
  • a .env file to hold secrets, never hardcoded
  • a target site you have permission to scrape, or a sandbox like books.toscrape.com
  • roughly $10-50/month budget for testing at meaningful volume before committing to a plan

step by step

step 1: get your API key and test the basics

after creating your account, find your API key in the dashboard under Account Settings. drop it in a .env file immediately:

SCRAPINGBEE_API_KEY=your_key_here

install dependencies:

pip install requests python-dotenv

the core ScrapingBee pattern is a GET request to their API endpoint with your key and the target URL as query parameters. here’s the minimal version:

import os
import requests
from dotenv import load_dotenv

load_dotenv()
API_KEY = os.getenv("SCRAPINGBEE_API_KEY")

def fetch(url: str) -> str:
    response = requests.get(
        "https://app.scrapingbee.com/api/v1/",
        params={
            "api_key": API_KEY,
            "url": url,
        },
        timeout=60,
    )
    response.raise_for_status()
    return response.text

html = fetch("https://books.toscrape.com/")
print(html[:500])

expected output: the first 500 characters of the rendered HTML from books.toscrape.com. if you see a 401, the API key is wrong. if you see a 422, the url parameter is malformed.

if it breaks: check that your .env is in the same directory you’re running Python from, and that load_dotenv() is called before you read the variable.


step 2: enable JavaScript rendering

by default ScrapingBee returns the raw HTML without executing JavaScript. for sites that load content via XHR or framework rendering, you need render_js=True. this costs more credits per request (typically 5 credits vs 1 for non-rendered), so only use it when you need it.

def fetch_js(url: str) -> str:
    response = requests.get(
        "https://app.scrapingbee.com/api/v1/",
        params={
            "api_key": API_KEY,
            "url": url,
            "render_js": "true",
        },
        timeout=90,  # JS rendering needs more time
    )
    response.raise_for_status()
    return response.text

expected output: fully rendered HTML including content that was injected by JavaScript after page load.

if it breaks: if the content still looks empty, the site may be checking for WebGL or canvas fingerprints. look into ScrapingBee’s premium_proxy parameter, which routes through residential IPs with more realistic browser profiles. if you’re researching browser fingerprinting in depth, the antidetectreview.org blog covers the fingerprint surface area in detail.


step 3: pass custom headers and cookies

some sites check for realistic User-Agent strings or require a session cookie to show content. ScrapingBee forwards headers you pass via the custom_google or header forwarding parameters. check their official API documentation for the exact parameter names, since they update the API periodically.

import json

def fetch_with_headers(url: str, cookies: dict = None) -> str:
    params = {
        "api_key": API_KEY,
        "url": url,
        "render_js": "true",
    }
    if cookies:
        params["cookies"] = json.dumps(cookies)

    response = requests.get(
        "https://app.scrapingbee.com/api/v1/",
        params=params,
        timeout=90,
    )
    response.raise_for_status()
    return response.text

expected output: page HTML that reflects the session state implied by your cookies, useful for scraping behind a login wall you’re authorized to access.

if it breaks: make sure cookie values don’t contain characters that break URL encoding. use urllib.parse.quote to sanitize values before passing them.


step 4: build a retry wrapper with exponential backoff

ScrapingBee returns a 429 when you hit rate limits and occasionally a 500 on their end when a browser instance fails. production code needs to handle both without crashing your pipeline.

import time
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def fetch_with_retry(url: str, render_js: bool = False, max_retries: int = 4) -> str | None:
    params = {
        "api_key": API_KEY,
        "url": url,
        "render_js": "true" if render_js else "false",
    }
    for attempt in range(max_retries):
        try:
            response = requests.get(
                "https://app.scrapingbee.com/api/v1/",
                params=params,
                timeout=90,
            )
            if response.status_code == 429:
                wait = 2 ** attempt
                logger.warning(f"rate limited, waiting {wait}s (attempt {attempt+1})")
                time.sleep(wait)
                continue
            response.raise_for_status()
            return response.text
        except requests.exceptions.Timeout:
            logger.warning(f"timeout on attempt {attempt+1} for {url}")
            time.sleep(2 ** attempt)
        except requests.exceptions.HTTPError as e:
            logger.error(f"HTTP error {e.response.status_code} for {url}")
            if e.response.status_code < 500:
                return None  # client error, don't retry
            time.sleep(2 ** attempt)
    logger.error(f"all retries exhausted for {url}")
    return None

expected output: either the HTML string on success, or None after all retries fail, with logged context at every failure point.

if it breaks: if you’re seeing persistent 500s, add the ScrapingBee response body to your logs. they often include a human-readable reason like ERR_PROXY_CONNECTION_FAILED that tells you exactly what went wrong.


step 5: parse the response with BeautifulSoup

once you have the HTML, parsing is standard. install beautifulsoup4 and lxml:

pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup

def extract_book_titles(html: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")
    titles = [h3.find("a")["title"] for h3 in soup.select("article.product_pod h3")]
    return titles

html = fetch_with_retry("https://books.toscrape.com/")
if html:
    print(extract_book_titles(html))

expected output: a list of book title strings from the page.

if it breaks: if your CSS selectors return nothing, dump soup.prettify() to a file and inspect the actual DOM. ScrapingBee occasionally returns a bot-detection interstitial instead of the real page if the target site is aggressively defended. check for the word “captcha” or “access denied” in the response before parsing.


step 6: handle pagination at scale

most real scraping jobs involve paginated listings. chain requests with a concurrency limit so you don’t slam ScrapingBee’s rate limits or burn credits faster than your plan allows:

import concurrent.futures

def scrape_paginated(base_url: str, page_count: int, workers: int = 5) -> list[str]:
    urls = [f"{base_url}?page={i}" for i in range(1, page_count + 1)]
    results = []
    with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
        futures = {executor.submit(fetch_with_retry, url): url for url in urls}
        for future in concurrent.futures.as_completed(futures):
            html = future.result()
            if html:
                results.append(html)
    return results

expected output: a list of HTML strings, one per page, fetched concurrently with 5 workers.

if it breaks: if you’re seeing a lot of 429s, reduce workers to 2-3. your ScrapingBee plan has a concurrent request cap that varies by tier.


step 7: store results and track credit usage

ScrapingBee returns your remaining credits in the response header spb-remaining-api-calls. log it so you don’t run out mid-pipeline:

def fetch_tracked(url: str) -> tuple[str | None, int | None]:
    response = requests.get(
        "https://app.scrapingbee.com/api/v1/",
        params={"api_key": API_KEY, "url": url},
        timeout=60,
    )
    remaining = response.headers.get("spb-remaining-api-calls")
    response.raise_for_status()
    return response.text, int(remaining) if remaining else None

html, credits_left = fetch_tracked("https://books.toscrape.com/")
if credits_left is not None and credits_left < 500:
    logger.warning(f"low credits: {credits_left} remaining")

persist your scraped HTML to disk or object storage before parsing so you don’t have to re-fetch on a parse failure:

import pathlib

def save_html(url: str, html: str, output_dir: str = "raw_html") -> None:
    pathlib.Path(output_dir).mkdir(exist_ok=True)
    filename = url.replace("https://", "").replace("/", "_")[:100] + ".html"
    (pathlib.Path(output_dir) / filename).write_text(html, encoding="utf-8")

if it breaks: file name sanitization is rough above. if you have unusual URLs, switch to a hash: import hashlib; filename = hashlib.md5(url.encode()).hexdigest() + ".html".


common pitfalls

using JS rendering for every request. JS rendering costs 5x the credits. audit your target pages first. static HTML endpoints like sitemaps, RSS feeds, and many product listing pages don’t need it. benchmark with and without before defaulting to render_js=True everywhere.

not caching raw HTML. if your parse logic has a bug on page 50,000, you want to fix the parser and reprocess, not re-scrape. always write raw HTML to disk or S3 before you transform it. the rotating proxies guide on this site has more on structuring your pipeline to separate fetch from parse.

ignoring the spb-remaining-api-calls header. pipelines that silently run to zero credits stop in the middle of a job with no useful error message. build credit alerting from day one.

hardcoding delays instead of backing off on 429. a fixed time.sleep(2) between requests is wasteful when you’re under your rate limit and insufficient when you’re over it. exponential backoff keyed on the 429 response is the right pattern, as shown in step 4.

not checking what you actually got. ScrapingBee returns 200 even when the target site served a CAPTCHA page or a login redirect. always assert that key content exists in the response before treating the request as successful.


scaling this

10x (thousands of requests/day): the pattern above works as-is. use 3-5 concurrent workers, cache to local disk, and monitor the credits header. the Freelance plan is usually enough here.

100x (tens of thousands/day): move your pipeline off a laptop and onto a small VPS or a scheduled cloud function. swap local disk caching for S3. add a job queue (Redis + RQ, or SQS) so you can restart failed batches without re-scraping successes. upgrade to ScrapingBee’s Startup or Business tier. for proxy strategy at this scale, see the scraper library overview on this site.

1000x (hundreds of thousands/day): at this volume you’re spending serious money on a managed API. run a cost-per-successful-fetch calculation quarterly and compare against self-hosted Playwright with a residential proxy pool. the ops overhead goes up but the per-request cost drops significantly. also start thinking about domain-specific rate limiting: spreading requests to the same domain across a longer window reduces blocks regardless of which infrastructure you use. the HTTP/1.1 spec via RFC 9110 is worth reading if you want to understand exactly what browser semantics you’re trying to replicate at scale.


where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?