← back to blog

How to bypass Cloudflare 403s with Playwright plus residential proxies

How to bypass Cloudflare 403s with Playwright plus residential proxies

You spin up a Playwright script, point it at a target site, and within three requests you’re staring at a 403. sometimes it’s a Cloudflare interstitial. sometimes it’s a silent block, HTML that looks right but contains no data. either way, your scraper is dead and you burned proxy credits on nothing.

this is not a beginner problem. i’ve seen it happen to people running mature scraping operations who switched frameworks, upgraded Node, or hit a site that silently tightened its bot rules. the combination of Cloudflare’s bot management product and a naive Playwright setup is one of the most common failure patterns i encounter, and fixing it properly requires understanding a few distinct layers: TLS fingerprinting, browser automation signals, IP reputation, and behavioral analysis. getting one of them right and ignoring the others will still get you blocked.

the good news is that residential proxies plus a properly configured Playwright context can handle a large share of real-world Cloudflare deployments. the bad news is “properly configured” is doing a lot of work in that sentence. this post is the setup i actually use in production, including the parts that took me longer than i’d like to admit to figure out.

background and prior art

Cloudflare’s bot detection has gone through several distinct generations. the original challenge page (the “one moment while we check your browser” spinner from the early 2010s) was a JavaScript proof-of-work check that any headless browser could solve with minor effort. Cloudflare Bot Management, which became a paid product around 2019-2020, introduced behavioral scoring, TLS fingerprinting via JA3 hashes, and machine learning models trained on real traffic patterns. the current iteration also does browser environment fingerprinting: it checks for properties like navigator.webdriver, canvas rendering signatures, WebGL renderer strings, and timing anomalies in how JavaScript events fire.

the research community has been tracking this arms race. the Cloudflare bot management documentation acknowledges it scores every request and assigns a bot score of 1-99. sites using the Business or Enterprise plan can set firewall rules that block below a threshold score. the threshold is set by the site operator, not Cloudflare, so the same bot score that passes on one site might block on another. this is important, because it means there’s no single fix. you’re fighting both Cloudflare’s scoring and whatever the target site’s operator has configured.

on the tooling side, the go-to stealth library for years was puppeteer-extra-plugin-stealth, which patches a set of known automation leaks in Chromium. playwright-stealth ports came later. more recently, projects like Camoufox take a different approach by shipping a patched Firefox binary rather than trying to patch leaks at the JS layer. each approach has tradeoffs we’ll get into.

the core mechanism

when Playwright launches a Chromium context without any stealth configuration, it leaks automation signals at multiple layers simultaneously.

layer 1: TLS fingerprinting. before your HTTP request even arrives at the application layer, your TLS handshake is fingerprinted. the JA3 fingerprint is computed from the combination of TLS version, cipher suites, extensions, elliptic curves, and elliptic curve point formats your client advertises. a stock Playwright/Chromium build has a known JA3 signature that differs from organic Chrome traffic because the cipher suite ordering is slightly different and some extensions that real Chrome sends are missing or reordered. Cloudflare logs these and scores them. if your JA3 matches a known scraper or headless browser fingerprint, you’re starting with a high bot score before any JavaScript runs. the more recent JA4 fingerprint standard (RFC draft documented here) adds more signal and is harder to spoof.

residential proxies help here because the TLS handshake still originates from your machine, not the proxy. to actually fix JA3 you need either a patched Chromium that reorders ciphers to match real Chrome, or a library like cycletls for non-browser HTTP. for browser-based scraping, the practical options are Camoufox (uses Firefox’s TLS stack, which has a different but legitimate JA3) or a custom Chromium build.

layer 2: browser automation signals. Chrome’s automation flag sets navigator.webdriver = true. this is detectable with one line of JavaScript. beyond that, Playwright contexts leak through: missing or misconfigured window.chrome object, absent browser plugins, canvas fingerprint anomalies (headless Chromium renders slightly differently than headed Chrome on the same machine), and WebGL renderer strings that reveal Mesa or SwiftShader instead of a real GPU.

the standard fix is playwright-extra with the stealth plugin, or Camoufox. here’s a minimal stealth setup with playwright-extra:

import { chromium } from 'playwright-extra';
import StealthPlugin from 'puppeteer-extra-plugin-stealth';

chromium.use(StealthPlugin());

const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
  viewport: { width: 1280, height: 800 },
  locale: 'en-US',
  timezoneId: 'America/New_York',
});

the user agent alone doesn’t help. you need the locale, timezone, and viewport to be internally consistent. Cloudflare’s fingerprinting checks whether your Accept-Language header matches your navigator.language, whether your timezone offset matches your claimed timezone, and whether your screen dimensions are plausible for the claimed device.

layer 3: IP reputation. this is where residential proxies earn their price. datacenter IPs, even rotating ones, carry a reputation score that Cloudflare maintains. ASN-level blocks are common for the major datacenter ranges (AWS, GCP, Digital Ocean, Hetzner). residential IPs route through ISP-assigned consumer ranges and score much lower by default. the tradeoff is cost: residential bandwidth from providers like Bright Data or Oxylabs runs $8-15/GB versus $0.50-2/GB for datacenter.

the proxy configuration in Playwright is straightforward:

const context = await browser.newContext({
  proxy: {
    server: 'http://gate.smartproxy.com:10000',
    username: 'your_username',
    password: 'your_password',
  },
  // ...other options
});

for sticky sessions (same IP across multiple requests), use the session ID parameter your provider supports. Bright Data uses brd-session-XXXXX in the username. Oxylabs uses user-XXXXX-session-XXXXX. the exact format varies by provider, check their docs.

layer 4: behavioral signals. this is the layer that gets you even after you’ve fixed 1-3. real users move mice, scroll, have navigation history, have cookies from prior visits, take non-zero time to read pages before clicking. Playwright scripts that hit a page and immediately extract data with zero dwell time look robotic because they are. Cloudflare’s behavioral scoring weights timing patterns, event sequences, and interaction entropy.

minimum viable behavioral mimicry:

// random delay helper
const sleep = (ms: number) => new Promise(r => setTimeout(r, ms));
const jitter = (base: number, variance: number) =>
  base + (Math.random() - 0.5) * 2 * variance;

// after page load, wait before interacting
await page.goto(url);
await sleep(jitter(2000, 800));

// simulate a scroll
await page.mouse.move(640, 400);
await page.evaluate(() => window.scrollBy(0, 300));
await sleep(jitter(1500, 500));

this won’t fool a sophisticated ML model but it shifts the score enough to matter on sites with moderate bot protection.

worked examples

example 1: e-commerce product page scraping at scale.

a client was scraping product prices from a mid-tier e-commerce site running Cloudflare Business. they were using vanilla Playwright with datacenter proxies and hitting a ~90% block rate. i rebuilt the setup with playwright-extra stealth, Smartproxy residential rotating proxies ($7.50/GB at the time, billed monthly), and added a 1.5-3 second random delay post-load before extraction. block rate dropped to roughly 8%. the remaining 8% were pages where Cloudflare served a Turnstile challenge (the interactive CAPTCHA successor), which required either a CAPTCHA solving service or manual handling. total proxy cost per 10,000 pages went from $0.40 (datacenter, mostly wasted) to about $2.10 (residential, mostly successful), but actual cost per successfully scraped page dropped significantly.

example 2: job board monitoring with session persistence.

a different use case: monitoring a job board that required login. this site used Cloudflare Enterprise with stricter behavioral scoring. the trick that worked here was maintaining persistent browser contexts rather than creating a fresh context per run. we stored the browser context state (cookies plus localStorage) to disk after a successful human-verified login, then loaded that state for automated runs:

// save context after manual login
await context.storageState({ path: './session.json' });

// load for automated run
const context = await browser.newContext({
  storageState: './session.json',
  proxy: { server: '...', username: '...', password: '...' },
  // ...
});

the persistent session with established cookies scored much lower on Cloudflare’s bot score because it had behavioral history. we still rotated proxies but used sticky sessions to avoid the IP changing mid-session, which itself triggers anomaly detection. this approach worked cleanly for about 2 weeks before the session expired and required renewal.

example 3: Camoufox for sites with aggressive TLS fingerprinting.

one site we were targeting had apparently configured their Cloudflare ruleset to block specific JA3 hashes. we could confirm this because requests from patched Chromium were blocked while identical requests from Firefox (manual testing) went through. switching the stack to Camoufox, which ships a patched Firefox 128 build with automation signals removed, resolved the JA3 mismatch and dropped blocks from 100% to under 5%:

from camoufox.sync_api import Camoufox

with Camoufox(headless=True, proxy={
    "server": "http://gate.provider.com:10000",
    "username": "user",
    "password": "pass"
}) as browser:
    page = browser.new_page()
    page.goto("https://target.com/path")
    data = page.inner_text(".product-price")

Camoufox’s Python API mirrors Playwright’s closely. the main operational downside is that Firefox is slower than Chromium and has slightly less ecosystem support for complex JS interactions, but for data extraction tasks it’s usually fine.

edge cases and failure modes

Turnstile challenges. Cloudflare Turnstile replaced the older hCaptcha integration for many sites. unlike the old challenge page, Turnstile can be invisible (no user interaction required) or interactive. the invisible variant is solved automatically by Cloudflare’s client-side JS based on browser telemetry. a good stealth setup passes it without intervention most of the time. the interactive variant requires either a CAPTCHA solving service (2captcha, CapSolver, etc.) integrated into your flow, or you need to accept that those pages will fail. don’t build pipelines that assume 100% Turnstile pass rates.

IP warming. fresh residential proxy sessions occasionally start with an elevated bot score, especially on high-value targets. this is because the IP may have been used by another customer for activity that raised flags. the fix is retrying with a different session on 403s rather than retrying the same IP. implement exponential backoff with session rotation:

async function fetchWithRetry(url: string, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    const sessionId = `sess-${Date.now()}-${Math.random().toString(36).slice(2)}`;
    const context = await browser.newContext({
      proxy: {
        server: 'http://gate.provider.com:10000',
        username: `user-session-${sessionId}`,
        password: 'pass',
      },
      // ...
    });
    try {
      const page = await context.newPage();
      const response = await page.goto(url);
      if (response?.status() === 403) throw new Error('blocked');
      return await page.content();
    } catch (e) {
      await context.close();
      if (i === maxRetries - 1) throw e;
    }
  }
}

header inconsistency. Playwright sets some default headers that don’t match what a real Chrome browser sends. in particular, the Accept header for navigation requests, Sec-Fetch-* headers, and the order of headers in the HTTP/2 pseudo-header block all differ from organic Chrome. Cloudflare’s HTTP/2 fingerprinting (sometimes called AKAMAI fingerprinting or H2 fingerprinting) can detect these mismatches. the fix is setting headers explicitly to match real Chrome:

await context.setExtraHTTPHeaders({
  'Accept': 'text/html,application/xhtml+xml,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
  'Accept-Language': 'en-US,en;q=0.9',
  'Accept-Encoding': 'gzip, deflate, br',
  'Upgrade-Insecure-Requests': '1',
});

this is also why copying headers from a browser DevTools capture is worth doing when debugging a specific target.

geolocation mismatches. if your proxy IP resolves to Singapore but your browser locale is en-US with timezone America/New_York, that’s a signal. either use proxies in the geographic region appropriate for the target site, or make your browser locale match your proxy region. for US-targeted sites, use US residential proxies. Smartproxy and Oxylabs both support country and even city-level targeting. the extra specificity usually costs the same per GB but requires adjusting your proxy URL parameters.

rate limits masquerading as bot blocks. not every 403 is a bot detection block. some are pure rate limits. if you see 403s only after N successful requests in a time window, you’re hitting rate limiting, not bot scoring. the fix is concurrency and timing management, not stealth improvements. i’ve wasted hours tuning fingerprinting when the actual problem was 50 concurrent workers hammering the same endpoint. check your request logs before assuming it’s a bot detection issue.

what we learned in production

the single biggest lesson is that stealth is a moving target. a setup that works today may fail in three months because Cloudflare pushed an update to their ML models, a target site tightened their bot score threshold, or your proxy provider’s IP ranges got flagged. build monitoring in from the start: log response status codes, response sizes, and whether the HTML you got contains the content you expect. a 200 with Cloudflare’s challenge HTML is just as bad as a 403. scraper health monitoring should look at data yield, not just HTTP status.

the second lesson is that the Playwright documentation on browser contexts is worth reading carefully, because context isolation is where a lot of practitioners leave performance on the table. creating a new browser process per request is expensive. creating a new context per request (within the same browser process) is cheap and gives you cookie/storage isolation. for most scraping workloads, one browser process with N concurrent contexts is the right architecture, not N browser processes. this matters for residential proxy cost too, since more efficient context reuse means fewer wasted proxy connections.

on proxy selection specifically: for most Cloudflare targets i’d start with Smartproxy or ProxyScrape’s residential plan before going to Bright Data, because the cost difference is meaningful at scale and the quality is close enough for most targets. Bright Data’s SERP API and ISP proxy products are worth the premium for the hardest targets, but they’re overkill for the median case. if you’re running antidetect browser workflows rather than pure Playwright, the discussion over at antidetectreview.org has good comparative analysis of which browsers hold up best against Cloudflare specifically.

the third lesson: don’t fight Cloudflare when there’s a legitimate API. a surprising number of sites with aggressive front-end bot protection have mobile apps backed by less-protected JSON APIs, or offer official data access programs. i’ve saved weeks of scraping engineering time by checking for a mobile API first. this isn’t always possible, but it’s worth 30 minutes of investigation before building a scraping stack.

references and further reading


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-22.

need infra for this today?