How to scrape Booking.com at scale in 2026 with proxies that work
How to scrape Booking.com at scale in 2026 with proxies that work
Booking.com is one of the most aggressively anti-bot platforms on the internet. they run Akamai Bot Manager, enforce strict TLS fingerprinting, serve javascript challenges on suspicious sessions, and quietly return stale or fabricated pricing data to IPs they’ve flagged, without ever returning a 403. that last part is the one that burns operators. you think you’re collecting clean data and only find out during QA a week later that half your records are garbage.
this guide is for data engineers, travel tech operators, and price intelligence teams who need reliable hotel availability, room-level pricing, review counts, and listing metadata from Booking.com. i’m assuming you already know why you’re scraping it, you need the how. by the end of this you’ll have a working Python pipeline that uses rotating residential proxies, handles js-rendered content, validates output, and can be pushed from 100 requests/day to 100,000 requests/day without rebuilding from scratch.
this is not legal advice. scraping public data occupies a legally contested space. review Booking.com’s terms of service and consult your own legal counsel before collecting data commercially.
what you need
- python 3.11+ with
playwright,httpx,parsel,pydantic,redis-py - rotating residential proxy plan: minimum 10GB for a test run. ProxyScraping residential starts around $4/GB, Bright Data and Oxylabs are 5-8x that but have better Booking.com-specific success rates. i’ll use ProxyScraping in this guide
- a Redis instance: for request deduplication and session state. a $7/month DigitalOcean droplet is fine
- postgres or bigquery: for storing results. postgres works fine under 10M rows
- a Booking.com URL list: property IDs, search result pages, or destination slugs depending on your use case
- budget estimate: 10,000 property pages at ~150KB each = ~1.5GB proxy traffic = roughly $6-10. factor in retries and you’re closer to $15-20 per 10k properties
step by step
step 1: set up your environment
python -m venv venv && source venv/bin/activate
pip install playwright httpx parsel pydantic redis psycopg2-binary
playwright install chromium
create a .env file:
PROXY_HOST=rp.proxyscraping.com
PROXY_PORT=31112
PROXY_USER=your_username
PROXY_PASS=your_password
REDIS_URL=redis://localhost:6379/0
PG_DSN=postgresql://user:pass@localhost/booking
if it breaks: if playwright install fails on a headless server, run [playwright](https://playwright.dev/) install-deps [chromium](https://www.chromium.org/Home/) first. on Ubuntu 22.04 you may also need apt-get install -y libnss3 libatk-bridge2.0-0.
step 2: build a session factory with proper browser fingerprints
Booking.com’s bot detection leans heavily on TLS fingerprint matching and navigator properties. a bare [requests](https://requests.readthedocs.io/) session will get you blocked within minutes. use playwright with a real chromium binary and randomize the viewport and locale on each session.
import asyncio
from playwright.async_api import async_playwright
import os, random
LOCALES = ["en-GB", "en-US", "de-DE", "fr-FR", "nl-NL"]
VIEWPORTS = [(1280, 800), (1366, 768), (1440, 900), (1920, 1080)]
async def make_browser_context(playwright, proxy_url: str):
browser = await playwright.chromium.launch(
headless=True,
proxy={"server": proxy_url}
)
ctx = await browser.new_context(
locale=random.choice(LOCALES),
viewport={"width": v[0], "height": v[1]} if (v := random.choice(VIEWPORTS)) else None,
user_agent=None, # let playwright use chromium's native UA
timezone_id="Europe/London",
)
return browser, ctx
the timezone_id matters. booking.com’s js checks Intl.DateTimeFormat().resolvedOptions().timeZone and compares it against the IP geolocation. set it to a timezone that matches your proxy’s exit country.
if it breaks: if you see net::ERR_PROXY_CONNECTION_FAILED, confirm your proxy credentials are correct and that the proxy supports HTTPS CONNECT tunneling, not just HTTP forward proxying.
step 3: construct target URLs correctly
for property-level pages the canonical pattern is:
https://www.booking.com/hotel/{country_code}/{property_slug}.html?checkin=2026-06-01&checkout=2026-06-03&group_adults=2&no_rooms=1
always include checkin/checkout dates. without them, Booking.com returns a different page layout and suppresses pricing entirely. use dates 2-4 weeks out for realistic pricing. for search result pages:
https://www.booking.com/searchresults.html?ss=Singapore&checkin=2026-06-01&checkout=2026-06-03&group_adults=2
store your URL queue in Redis as a sorted set, scored by priority:
import redis
r = redis.from_url(os.getenv("REDIS_URL"))
def enqueue(urls: list[str], priority: float = 1.0):
pipe = r.pipeline()
for url in urls:
pipe.zadd("queue:booking", {url: priority})
pipe.execute()
if it breaks: if results come back with different hotel counts than expected, check that you’re not hitting a regional redirect. Booking.com will redirect booking.com to booking.com/en-gb/ or similar based on IP geolocation. follow redirects and log the final URL.
step 4: extract data with parsel, validate with pydantic
once playwright has rendered the page, pass the HTML to parsel (not BeautifulSoup, parsel is faster and supports both CSS and XPath):
from parsel import Selector
from pydantic import BaseModel, validator
from typing import Optional
class PropertyListing(BaseModel):
property_id: str
name: str
price_eur: Optional[float]
review_score: Optional[float]
review_count: Optional[int]
stars: Optional[int]
url: str
def parse_property(html: str, url: str) -> PropertyListing:
sel = Selector(html)
name = sel.css('h2[data-testid="title"]::text').get("").strip()
price_raw = sel.css('[data-testid="price-and-discounted-price"] span::text').getall()
price_str = "".join(price_raw).replace(",", "").replace("€", "").strip()
score_text = sel.css('[data-testid="review-score-right-component"] div::text').get("")
review_count_text = sel.css('[data-testid="review-score-right-component"] span::text').get("")
prop_id = url.split("/")[-1].replace(".html", "").split("?")[0]
return PropertyListing(
property_id=prop_id,
name=name,
price_eur=float(price_str) if price_str.replace(".", "").isdigit() else None,
review_score=float(score_text) if score_text.replace(".", "").isdigit() else None,
review_count=int("".join(filter(str.isdigit, review_count_text))) if review_count_text else None,
stars=len(sel.css('[data-testid="rating-stars"] span').getall()),
url=url,
)
if it breaks: Booking.com updates its data-testid attributes frequently. if name comes back empty, open the rendered HTML in a browser devtools and search for the hotel name text to find the new selector. keep selectors in a config file, not hardcoded in the parser.
step 5: implement retry logic with exponential backoff
a 200 response does not mean success on Booking.com. watch for these patterns in the response HTML that indicate a soft block:
<title>Access Denied</title><div class="av-container">(Akamai interstitial)- price fields present but all showing the same suspiciously round number
import asyncio, random
MAX_RETRIES = 4
BASE_DELAY = 2.0
async def fetch_with_retry(url: str, proxy_url: str) -> str | None:
for attempt in range(MAX_RETRIES):
try:
async with async_playwright() as p:
browser, ctx = await make_browser_context(p, proxy_url)
page = await ctx.new_page()
await page.goto(url, wait_until="networkidle", timeout=30000)
html = await page.content()
await browser.close()
if "Access Denied" in html or "av-container" in html:
raise ValueError("soft block detected")
return html
except Exception as e:
delay = BASE_DELAY * (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(delay)
return None
if it breaks: if you’re hitting soft blocks consistently on the same proxy exit IPs, lower your concurrency or switch to a proxy provider with a larger residential pool. more IPs per country = better rotation coverage.
step 6: write results to postgres and mark deduplication
import psycopg2, json, hashlib
def save_listing(conn, listing: PropertyListing, raw_html: str):
html_hash = hashlib.sha256(raw_html.encode()).hexdigest()
with conn.cursor() as cur:
cur.execute("""
INSERT INTO booking_listings
(property_id, name, price_eur, review_score, review_count, stars, url, html_hash, scraped_at)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,NOW())
ON CONFLICT (property_id, scraped_at::date) DO UPDATE
SET price_eur = EXCLUDED.price_eur,
html_hash = EXCLUDED.html_hash
""", (listing.property_id, listing.name, listing.price_eur,
listing.review_score, listing.review_count, listing.stars,
listing.url, html_hash))
conn.commit()
mark the URL as done in Redis:
r.zrem("queue:booking", url)
r.sadd("done:booking", url)
if it breaks: if you see unique constraint violations, check that your scraped_at::date cast matches your postgres timezone setting. set timezone = 'UTC' in postgresql.conf to avoid surprises.
common pitfalls
1. trusting 200 OK responses blindly. Booking.com returns 200 with degraded content to suspected bots. always validate that key fields (price, name) are present before marking a record as clean.
2. reusing playwright browser contexts across requests. each context accumulates cookies and local storage state that Booking.com’s fingerprinting engine tracks across sessions. create a fresh context per request, or at most per batch of 3-5 requests from the same proxy IP.
3. scraping without check-in dates. pages without dates return a layout variant that suppresses prices and availability. every URL in your queue should include checkin, checkout, group_adults, and no_rooms params.
4. ignoring IP geolocation/timezone mismatch. using a German residential proxy but setting browser timezone to Asia/Singapore is a fingerprinting signal. match your playwright timezone_id to the proxy’s exit country.
5. not refreshing data frequently enough. Booking.com pricing updates multiple times per day. if you’re selling price intelligence, once-a-day scrapes aren’t competitive. plan your pipeline for 4-6 hour refresh cycles on high-value properties.
scaling this
10x (1,000 properties/day): a single Python process with asyncio and 5-10 concurrent playwright instances handles this. one t3.medium EC2 instance (~$30/month) is sufficient.
100x (10,000 properties/day): split the URL queue into shards and run multiple worker processes across 2-3 machines. use Redis streams instead of sorted sets for better multi-consumer support. proxy spend hits $150-200/month at this scale. if you’re managing multiple scraping targets alongside this, the account management patterns at multiaccountops.com/blog/ are worth reading for keeping proxy pools cleanly separated per target.
1000x (100,000+ properties/day): you need a proper job queue (Celery with Redis broker, or a managed queue like SQS). containerize the worker with Docker, run on Kubernetes or ECS with autoscaling. at this volume, consider a dedicated residential proxy plan with a Booking.com-specific proxy rotation API rather than a generic rotating endpoint. also factor in storage: 100k properties at 150KB HTML each = 15GB/day of raw storage if you’re keeping originals.
the HTTP/1.1 and HTTP/2 semantics defined in RFC 9110 matter at scale. Booking.com uses HTTP/2 aggressively. a scraper that doesn’t negotiate H2 correctly will fail TLS-level fingerprint checks before it even sends a request. playwright handles this natively, but if you switch to a pure httpx-based approach for performance, verify that H2 is enabled.
for understanding what signals Booking.com is likely using to classify bots, the MDN documentation on the User-Agent header and surrounding request headers is a useful reference for understanding what a “normal” browser request looks like.
where to go next
- residential proxy providers compared for scraping in 2026 - a breakdown of ProxyScraping, Bright Data, Oxylabs, Smartproxy, and IPRoyal tested against common anti-bot targets
- how to scrape Airbnb listings with playwright and rotating proxies - same core technique applied to Airbnb, which runs a different bot detection stack
- building a price monitoring pipeline with postgres and dbt - turning raw scraped data into a queryable price history dataset
browse more guides at the proxyscraping.org blog index.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.