The 2026 BeautifulSoup guide for production scraping
The 2026 BeautifulSoup guide for production scraping
BeautifulSoup gets dismissed a lot by people who’ve moved on to Playwright or Scrapy. i get it. but after running scrapers in production for a few years now, i keep coming back to bs4 for a specific class of jobs: static or lightly dynamic HTML pages where you control the fetch layer yourself and want surgical precision over what gets parsed. it is not a full scraping framework. that’s the point.
this guide is for operators who are past the tutorial phase. you’ve scraped a few hundred pages, maybe you’ve hit rate limits or CAPTCHAs, and now you want to build something that runs reliably without babysitting it. i’ll walk through setup, a working scraper pattern, the mistakes that cost me hours, and what the architecture looks like when you scale past a few thousand pages a day.
by the end you’ll have a production-ready scraping pattern using BeautifulSoup 4 with proper proxy rotation, error handling, and a storage layer. the code here runs on Python 3.11+ and i’ll be specific about versions, packages, and costs throughout.
what you need
- Python 3.11 or 3.12. bs4 runs on older versions but async support and typing are cleaner on 3.11+.
- beautifulsoup4 4.12.x and lxml as the parser. lxml is faster than html.parser for large documents. install both.
- requests 2.31+ or httpx 0.27+ for the fetch layer. i prefer httpx for async workloads.
- a rotating proxy provider. for production you need this. proxyscraping.com’s residential plan starts around $3/GB as of mid-2026. datacenter proxies are cheaper but get blocked more on protected targets.
- a storage layer. sqlite works fine up to ~500k rows. beyond that, postgres or a file-based approach (jsonlines to S3/R2) is more practical.
- a VPS or small server. a $6/month Hetzner CAX11 (2 vCPU ARM, 4GB RAM) handles 50-100 concurrent requests without complaint.
- optional: redis for deduplication and job queuing if you’re running multi-process.
estimated monthly cost for a moderate scraping operation (100k pages/day): proxy bandwidth $30-80, VPS $6-12, storage negligible. nothing exotic.
step by step
step 1: install dependencies cleanly
create a virtual environment and pin your versions. don’t install into system python.
python3.12 -m venv .venv
source .venv/bin/activate
pip install beautifulsoup4==4.12.3 lxml==5.2.1 httpx==0.27.0 tenacity==8.3.0
pip freeze > requirements.txt
tenacity handles retry logic declaratively. it saves a lot of boilerplate later.
if it breaks: if lxml fails to build, install system dependencies first: sudo apt install libxml2-dev libxslt1-dev python3-dev on Debian/Ubuntu.
step 2: write the fetch layer with proxy support
separate your fetch logic from your parsing logic from day one. here’s a minimal fetch module:
import httpx
import random
from tenacity import retry, stop_after_attempt, wait_exponential
PROXIES = [
"http://user:[email protected]:8080",
"http://user:[email protected]:8080",
# load from env or config file in production
]
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
def fetch(url: str, timeout: int = 15) -> str:
proxy = random.choice(PROXIES)
with httpx.Client(proxy=proxy, headers=HEADERS, timeout=timeout, follow_redirects=True) as client:
response = client.get(url)
response.raise_for_status()
return response.text
if it breaks: if you get ProxyError or ConnectTimeout constantly, the proxy credentials or endpoint format may be wrong. check your provider’s httpx-specific docs since the proxy URL format differs from requests.
step 3: parse with BeautifulSoup using lxml
always specify the parser explicitly. if you don’t, bs4 picks one based on what’s installed and your behavior changes across environments.
from bs4 import BeautifulSoup
def parse_product(html: str) -> dict:
soup = BeautifulSoup(html, "lxml")
name = soup.select_one("h1.product-title")
price = soup.select_one("[data-price]")
description = soup.select_one("div.description p")
return {
"name": name.get_text(strip=True) if name else None,
"price": price["data-price"] if price else None,
"description": description.get_text(strip=True) if description else None,
}
use select_one with CSS selectors rather than find. CSS selectors are more readable, easier to copy from browser devtools, and handle complex queries (attribute selectors, pseudo-classes) more naturally. the BeautifulSoup 4 documentation covers the full selector support.
if it breaks: if fields come back as None, open the raw HTML in a file and inspect it. often the selector matches in devtools but not in the fetched HTML because the content is rendered client-side via JavaScript. bs4 cannot execute JS. if the data isn’t in the raw HTML response, you need Playwright or a JS-rendering service, not bs4.
step 4: handle encoding and malformed HTML
real-world HTML is messy. lxml is more forgiving than html.parser but you still need to handle encoding correctly.
import httpx
def fetch_with_encoding(url: str) -> str:
with httpx.Client(...) as client:
response = client.get(url)
# httpx detects encoding from headers and BOM
# force it if you know the site uses a specific charset
if "charset" not in response.headers.get("content-type", ""):
response.encoding = "utf-8"
return response.text
for sites that serve latin-1 or windows-1252 (older e-commerce sites in Southeast Asia especially), httpx’s auto-detection usually handles it. if you’re getting garbled characters, print response.encoding before returning and override it.
if it breaks: try response.content (bytes) and pass it directly to BeautifulSoup: [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)(response.content, "lxml"). bs4 will handle encoding detection internally using the charset declared in the HTML meta tags.
step 5: extract structured data at scale
for list pages with many items, select returns all matches as a list. process them in a loop and collect into a list of dicts before writing to storage.
def parse_listing_page(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
items = []
for card in soup.select("div.product-card"):
link = card.select_one("a.product-link")
title = card.select_one("span.title")
price = card.select_one("span.price")
if not link:
continue
items.append({
"url": link.get("href"),
"title": title.get_text(strip=True) if title else None,
"price": price.get_text(strip=True) if price else None,
})
return items
guard against missing elements explicitly. a None object raises AttributeError if you call .get_text() on it, and that will silently drop rows or crash the whole batch.
if it breaks: add a temporary print(card) inside the loop on the first few iterations to see what bs4 is actually finding. selectors that look correct sometimes fail because of extra whitespace in class names or because the site uses dynamically generated class strings (common with CSS-in-JS frameworks).
step 6: write to storage
for sqlite, sqlite3 from stdlib is enough for single-process scrapers.
import sqlite3
import json
from datetime import datetime
def save_items(items: list[dict], db_path: str = "scrape.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY,
url TEXT UNIQUE,
data TEXT,
scraped_at TEXT
)
""")
for item in items:
conn.execute(
"INSERT OR IGNORE INTO products (url, data, scraped_at) VALUES (?, ?, ?)",
(item.get("url"), json.dumps(item), datetime.utcnow().isoformat())
)
conn.commit()
conn.close()
INSERT OR IGNORE with a UNIQUE constraint on URL is your deduplication. simple and works fine up to a few hundred thousand rows.
if it breaks: if you get database is locked errors, you’re running multiple processes writing to the same sqlite file. switch to WAL mode (conn.execute("PRAGMA journal_mode=WAL")) or move to postgres.
step 7: put it together with pagination
most scraping jobs involve paginated list pages followed by detail pages. here’s the full loop:
import time
BASE_URL = "https://example.com/products?page={}"
def scrape_all_pages(max_pages: int = 50):
all_items = []
for page_num in range(1, max_pages + 1):
url = BASE_URL.format(page_num)
try:
html = fetch(url)
items = parse_listing_page(html)
if not items:
print(f"no items on page {page_num}, stopping")
break
all_items.extend(items)
save_items(items)
time.sleep(1.5) # be a reasonable citizen
except Exception as e:
print(f"page {page_num} failed: {e}")
continue
return all_items
the time.sleep(1.5) is not just politeness. sites with rate limiting will block you faster if you hit them at full speed. respecting the robots exclusion protocol (robots.txt) and crawl-delay directives is also worth checking before running any production scraper.
if it breaks: if pagination stops early, check whether the site uses query parameters, path segments, or POST requests for pagination. use browser devtools network tab to see the exact request the next-page button fires.
common pitfalls
1. using html.parser instead of lxml in production. html.parser is slower and handles malformed HTML differently than lxml. if you test locally with html.parser and deploy with lxml (or vice versa), selectors that work in testing can fail silently in production. always specify the parser explicitly and use lxml for production.
2. assuming the DOM matches what devtools shows. the browser renders the page after executing JavaScript. bs4 only sees the raw server response. a common trap is copying a selector from Chrome devtools and wondering why bs4 returns nothing. before debugging your parsing code, check the raw HTML: curl -s "https://target.com/page" | grep "your-expected-text".
3. no retry logic on the fetch layer. network errors, proxy failures, and transient 5xx responses are normal. without retries you lose data silently. use tenacity or equivalent and log every retry so you can monitor proxy health over time.
4. storing raw HTML instead of parsed data. storing raw HTML is tempting because you can re-parse later. the problem is storage cost grows fast and re-parsing pipelines add complexity. parse on ingest, store structured data, and only keep raw HTML if you genuinely expect the parsing logic to change.
5. hardcoding selectors without any fallback. sites change their markup. if your selector breaks, the field goes to None and you might not notice for days. add a check after scraping: compare the null rate per field against a baseline. if more than 10% of records have a null price field, something changed upstream.
scaling this
10x (1k pages/day to 10k pages/day). the single-process pattern above handles this on a $6 VPS without issue. add a URL queue (a simple sqlite table with a status column works) so you can restart without re-scraping.
100x (10k to 100k pages/day). move to async. replace httpx.Client with httpx.AsyncClient and use asyncio.gather with a semaphore to cap concurrency. 50-100 concurrent requests is usually the sweet spot before you start exhausting proxy connections. redis becomes useful here as a shared job queue across multiple workers. the httpx async documentation covers the patterns well.
at this scale you also need proxy pool management. a static list of proxies is not enough. use a provider API to rotate sessions, or use a residential gateway that handles rotation for you. bandwidth costs become a real line item: budget for it upfront.
1000x (100k to 1M+ pages/day). bs4’s parsing speed becomes a bottleneck before anything else. lxml’s native API (without bs4 wrapping it) is 3-5x faster for pure extraction tasks. consider whether you actually need bs4 at this point or whether raw lxml or a purpose-built library like selectolax makes more sense. the fetch layer also needs geographic distribution at this scale since all requests originating from one datacenter region get rate-limited faster. if you’re running browser fingerprinting evasion alongside scraping at this scale, the antidetect browser review side covers that tooling in detail.
storage transitions from sqlite to postgres or columnar formats (parquet on object storage) somewhere between 500k and 5M rows depending on query patterns.
where to go next
- how to rotate proxies in Python with requests and httpx covers proxy pool management and session rotation patterns in depth
- requests vs httpx in 2026: which to use for scraping compares the two fetch libraries with benchmarks and a decision framework
- back to the blog index for other scraping and automation guides
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.