The 2026 Playwright guide for production scraping
The 2026 Playwright guide for production scraping
Most scraping tutorials show you how to click a button and grab a title tag. that’s fine for a weekend project. but when you’re running scraping infrastructure at scale, you start hitting a different class of problem: browser fingerprinting, session management, proxy integration, memory leaks on long-running crawlers, and servers that go down at 3am. this guide is for operators who are past the tutorial phase and want to build something that actually holds up.
Playwright, maintained by Microsoft, is the browser automation library I default to in 2026. it handles Chromium, Firefox, and WebKit from a single API, has better async support than Puppeteer, and has an active release cadence. i’ve run it against e-commerce sites, job boards, and travel aggregators, across both Singapore cloud infrastructure and offshore VPS providers. the toolchain works, but only if you configure it correctly from the start.
by the end of this guide you’ll have a working Playwright scraper with proxy support, stealth configuration, structured data extraction, and enough error handling to survive a multi-hour run without babysitting it. i’ll also cover what breaks when you scale from 10 to 1000 concurrent browsers.
what you need
- Node.js 18 or above. Playwright’s async browser context API requires modern V8 features. check with
node --version. - Playwright npm package.
npm install playwrightpulls in the library. you install browsers separately withnpx playwright install chromium. installing all three browsers is about 300MB on disk. - A VPS or dedicated server. scraping from your local machine works for testing but not production. i use Hetzner CX21 instances (2 vCPU, 4GB RAM) at around €4.15/month for low-concurrency jobs. for high-concurrency work, 8GB RAM minimum per node.
- Residential or datacenter proxies. Playwright does not rotate IPs for you. you need a proxy provider. datacenter proxies cost roughly $0.50-$2 per GB depending on provider and are fine for many targets. residential proxies run $5-$15 per GB and are necessary for tighter anti-bot targets.
- A process manager. PM2 or systemd to keep workers running and restart on crash.
- Basic TypeScript or JavaScript knowledge. the examples below are TypeScript but work fine as plain JS with minor syntax changes.
step by step
step 1: install and verify
mkdir scraper && cd scraper
npm init -y
npm install playwright
npx playwright install chromium
verify the install works before touching anything else:
import { chromium } from 'playwright';
const browser = await chromium.launch({ headless: true });
const page = await browser.newPage();
await page.goto('https://example.com');
console.log(await page.title());
await browser.close();
expected output: Example Domain printed to stdout. if you get a browserType.launch: Failed to launch error, run npx [playwright](https://playwright.dev/) install-deps [chromium](https://www.chromium.org/Home/) to pull in missing system libraries. this is the most common failure on fresh Ubuntu servers.
if it breaks: on headless Linux you may need --no-sandbox as a launch argument. add args: ['--no-sandbox'] to your launch() call. note that disabling the sandbox is a security tradeoff and should only happen inside containers or isolated VMs, not on a shared host.
step 2: configure browser context for stealth
a default Playwright browser is detectable. the navigator.webdriver flag is set, the viewport is suspiciously consistent, and there is no realistic browsing history. most serious anti-bot systems check these signals.
const browser = await chromium.launch({
headless: true,
args: ['--no-sandbox', '--disable-blink-features=AutomationControlled']
});
const context = await browser.newContext({
viewport: { width: 1366, height: 768 },
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
locale: 'en-US',
timezoneId: 'America/New_York',
});
the --disable-blink-features=AutomationControlled flag removes the webdriver signal from navigator. the user agent string matters: use a realistic recent Chrome version. you can verify what browsers are in use at MDN’s user agent documentation and cross-reference with real traffic distributions.
if it breaks: if the site is still detecting you after stealth config, the issue is usually canvas fingerprinting or WebGL. consider [playwright](https://playwright.dev/)-extra with the stealth plugin, which patches additional fingerprint surfaces. see the Playwright docs on browser contexts for the full list of configurable signals.
step 3: add proxy support
const context = await browser.newContext({
proxy: {
server: 'http://proxy.example.com:8080',
username: 'user',
password: 'pass',
},
// ... other context options
});
proxy credentials go in the context, not the browser launch. this lets you run multiple contexts with different proxy endpoints inside a single browser process, which matters for performance.
for rotating proxies, create a new context per request or per session rather than per page. closing and reopening pages inside the same context reuses the same IP if your proxy provider uses session-based rotation.
if it breaks: if you get net::ERR_PROXY_CONNECTION_FAILED, verify the proxy is reachable from your server with curl --proxy http://user:pass@proxy:8080 https://httpbin.org/ip. Playwright proxy errors are often network routing issues on the VPS side, not Playwright itself.
step 4: navigate and wait correctly
this is where most scrapers fail. page.goto(url) resolves when the initial HTML loads, not when JavaScript renders the data you need.
const page = await context.newPage();
await page.goto('https://target-site.com/products', {
waitUntil: 'domcontentloaded',
timeout: 30000,
});
// wait for a specific element that signals the data is ready
await page.waitForSelector('[data-testid="product-list"]', { timeout: 15000 });
use waitUntil: 'domcontentloaded' over 'networkidle' for speed. networkidle waits for all network activity to stop, which can take 10+ seconds on ad-heavy pages. instead, wait for the specific element you need.
if it breaks: if waitForSelector times out, the element might be inside an iframe or shadow DOM. use page.frameLocator('iframe').locator('[data-testid="product-list"]') for iframe cases. for shadow DOM, Playwright’s locator API pierces shadow roots by default.
step 5: extract structured data
const products = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.product-card')).map(card => ({
title: card.querySelector('.product-title')?.textContent?.trim(),
price: card.querySelector('.price')?.textContent?.trim(),
url: card.querySelector('a')?.href,
}));
});
page.evaluate() runs inside the browser context and returns serializable data. avoid extracting raw HTML and parsing it in Node, you pay the serialization cost for more data than you need.
if it breaks: if evaluate() returns empty arrays, check the selectors in browser devtools first. sites frequently change class names. use data-testid or aria-label attributes when available as they change less often.
step 6: handle errors and retries
production scrapers need retry logic. network errors, timeouts, and transient blocks are guaranteed at scale.
async function scrapeWithRetry(url: string, maxRetries = 3): Promise<any> {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const page = await context.newPage();
await page.goto(url, { waitUntil: 'domcontentloaded', timeout: 30000 });
await page.waitForSelector('.product-card', { timeout: 15000 });
const data = await page.evaluate(/* extraction logic */);
await page.close();
return data;
} catch (err) {
console.error(`attempt ${attempt} failed for ${url}:`, err.message);
if (attempt === maxRetries) throw err;
await new Promise(resolve => setTimeout(resolve, 2000 * attempt));
}
}
}
exponential backoff (2s, 4s, 6s here) reduces hammering on a target when it is already struggling.
if it breaks: if all retries fail consistently, the issue is usually a block, not a transient error. check the error message. a 403 means the server is rejecting you. a timeout means the page is loading but not rendering your selector.
step 7: manage memory with context pooling
the most common production failure mode is memory growth. opening a new browser per URL will exhaust RAM in under an hour on any reasonable workload.
const browser = await chromium.launch({ headless: true });
const POOL_SIZE = 5;
const contexts = await Promise.all(
Array.from({ length: POOL_SIZE }, () =>
browser.newContext({ /* your stealth config */ })
)
);
// use a queue and assign URLs to contexts round-robin or via a semaphore
keep browsers alive and recycle contexts. close individual pages after use, not contexts. close contexts only when the context has accumulated state you want to discard (cookies, storage).
if it breaks: if memory still grows, add --js-flags="--max-old-space-size=512" to your browser launch args to cap V8 heap per browser. also check for page.on('response') listeners that accumulate without being removed.
step 8: run in production with PM2
npm install -g pm2
pm2 start dist/scraper.js --name scraper-worker -i 2
pm2 save
pm2 startup
-i 2 runs two worker processes. PM2 restarts workers on crash and persists across reboots with pm2 startup. logs go to ~/.pm2/logs/.
if it breaks: if workers keep crashing with OOM errors, reduce the Playwright pool size inside each worker or increase the server RAM. 4GB RAM handles roughly 3-5 concurrent Playwright contexts with Chromium depending on page complexity.
common pitfalls
using networkidle universally. it sounds safe but it is slow and unreliable on pages with long-polling or analytics scripts. learn to use waitForSelector with the specific element you care about.
ignoring browser console errors. add page.on('console', msg => ...) during development. JavaScript errors in the page often explain why your selectors return nothing.
not rotating user agents. sending the exact same user agent string on every request from the same IP is a fingerprint. maintain a small pool of realistic UA strings and rotate them per context.
keeping pages open too long. each open page holds memory. close pages as soon as you have extracted the data. a scraper that opens 50 pages and forgets to close them will OOM within an hour.
no rate limiting. hitting the same domain at 50 requests per second from one IP gets you blocked within minutes and may also trigger legal risk depending on the target’s terms. add a deliberate delay between requests to the same domain. this is not legal advice, consult your own counsel for your specific situation.
scaling this
10x (50 concurrent browsers): you need a server with at least 16GB RAM and multiple cores. split into multiple Node processes using PM2 cluster mode or separate workers per domain. at this level, proxy management becomes critical. you want sticky sessions to avoid mid-flow IP changes.
100x (500 concurrent browsers): single-machine Playwright does not scale here. move to a distributed job queue. Redis with Bull or BullMQ is a common pattern. each worker node pulls URLs from the queue, scrapes, and pushes results to a database. Kubernetes simplifies worker scaling. if you’re managing multi-account workflows at this scale, the operational patterns at multiaccountops.com/blog/ are worth reading, particularly for session isolation.
1000x: at this level you likely need a managed browser fleet, either a self-hosted Playwright in Docker cluster or a third-party browser cloud. the economics shift: managed browser services charge per minute of browser time and the cost can exceed self-hosted infrastructure quickly, so model the numbers before committing. monitoring becomes non-optional. you need metrics on success rates, block rates, and latency per domain.
where to go next
- rotating proxies with Playwright covers per-request proxy rotation patterns and how to handle sticky sessions with residential providers.
- Playwright vs Puppeteer in 2026 breaks down where each library has the edge and when to switch.
- back to all scraping tutorials for the full index.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.