The 2026 Puppeteer guide for production scraping
The 2026 Puppeteer guide for production scraping
Most Puppeteer tutorials show you how to take a screenshot or scrape a static page in 20 lines. that’s fine for demos. it’s useless when you’re pulling product data from a JavaScript-heavy e-commerce site at scale, rotating proxies, and dealing with CAPTCHAs at 3am.
I’ve been running scraping infrastructure in production since 2021, mostly for price monitoring and lead generation. Puppeteer, maintained by the Chrome team at Google, remains my default tool for browser automation when a site requires JavaScript execution. it’s not the cheapest option in CPU or memory, but it’s the most predictable when sites render content client-side. as of May 2026, Puppeteer is on version 22.x and ships with Chrome 124+ by default via the puppeteer package.
this guide is for operators who already know a bit of Node.js and want a working scraper they can actually deploy. by the end you’ll have a script that launches a stealth browser, rotates proxies, extracts structured data, and fails gracefully instead of silently.
what you need
- Node.js 20+ (LTS, download from nodejs.org)
- npm or pnpm for package management
- A proxy provider with residential or datacenter IPs, such as ProxyScraping (starting around $3/GB for residential), Bright Data, or Oxylabs. datacenter is cheaper but gets blocked more often.
- A Linux VPS or container with at least 2 vCPUs and 4 GB RAM per 3-5 concurrent Puppeteer instances. Chromium is hungry.
- Basic knowledge of CSS selectors or XPath for element targeting
- A target site you have permission to scrape, or a sandbox like toscrape.com
- Optional: a CAPTCHA solving service (2Captcha or CapSolver, around $1-3 per 1000 solves) if the target uses Cloudflare Turnstile or hCaptcha
budget estimate for a small production setup: $20-50/month for a decent VPS plus proxy costs depending on volume.
step by step
step 1: initialise the project and install dependencies
mkdir my-scraper && cd my-scraper
npm init -y
npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth
[puppeteer](https://pptr.dev/) itself downloads a compatible Chromium binary during install, so first install can take 2-3 minutes depending on your connection. [puppeteer](https://pptr.dev/)-extra is a wrapper that enables plugins; [puppeteer](https://pptr.dev/)-extra-plugin-stealth patches around 20 known browser fingerprinting vectors like navigator.webdriver.
if it breaks: if the Chromium download fails on a headless server, set [PUPPETEER](https://pptr.dev/)_SKIP_DOWNLOAD=true and point executablePath to a system Chrome install instead.
step 2: launch a stealth browser
const puppeteer = require('puppeteer-extra');
const StealthPlugin = require('puppeteer-extra-plugin-stealth');
puppeteer.use(StealthPlugin());
async function launchBrowser(proxyUrl) {
const browser = await puppeteer.launch({
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
`--proxy-server=${proxyUrl}`,
],
executablePath: process.env.CHROME_PATH || undefined,
});
return browser;
}
--no-sandbox and --disable-setuid-sandbox are required when running as root in Docker. --disable-dev-shm-usage prevents Chrome from crashing on low-memory containers by writing to /tmp instead of the shared memory partition.
if it breaks: if you see error while loading shared libraries on Ubuntu, run npx [puppeteer](https://pptr.dev/) browsers install chrome to reinstall the bundled binary.
step 3: route traffic through a proxy with authentication
async function setupPage(browser, username, password) {
const page = await browser.newPage();
await page.authenticate({ username, password });
await page.setViewport({ width: 1280, height: 800 });
await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ' +
'(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'
);
return page;
}
the page.authenticate call handles proxy auth challenges automatically. set a realistic viewport and user agent, many anti-bot systems flag headless defaults like HeadlessChrome in the UA string. the stealth plugin handles most of this automatically, but an explicit UA override adds a second layer.
if it breaks: if the proxy returns a 407 auth error, double-check that your proxy URL format is http://host:port with the credentials passed separately via authenticate(). some providers use http://user:pass@host:port in the launch args directly, which also works but exposes credentials in process lists.
step 4: navigate to the target and wait for content
async function navigate(page, url) {
await page.goto(url, {
waitUntil: 'networkidle2',
timeout: 30000,
});
// wait for a specific element if the page is JS-heavy
await page.waitForSelector('.product-list', { timeout: 15000 });
}
networkidle2 waits until there are no more than 2 open network connections for 500ms. it’s a reasonable signal that a SPA has finished its initial data fetch. for pages that load content lazily on scroll, you’ll need to scroll first and then wait.
if it breaks: networkidle2 hangs on sites with persistent WebSocket connections or polling intervals. switch to domcontentloaded and add a manual waitForSelector for the element you actually need.
step 5: extract structured data
async function extractProducts(page) {
const products = await page.evaluate(() => {
return Array.from(document.querySelectorAll('.product-card')).map(card => ({
name: card.querySelector('.product-title')?.textContent.trim(),
price: card.querySelector('.price')?.textContent.trim(),
url: card.querySelector('a')?.href,
}));
});
return products;
}
page.evaluate() runs inside the browser context, so you get access to the DOM but not Node.js. keep this function lean and serialisable, you can’t pass functions or class instances back out. for large pages with hundreds of items, this approach is faster than page.$$(selector) followed by repeated getProperty calls.
if it breaks: if textContent comes back empty, the element might be rendering inside a Shadow DOM. use page.$$eval with shadowRoot traversal, or reach for Playwright which has better built-in Shadow DOM support.
step 6: handle infinite scroll or pagination
async function scrollToBottom(page) {
let previousHeight = 0;
while (true) {
const currentHeight = await page.evaluate(() => document.body.scrollHeight);
if (currentHeight === previousHeight) break;
previousHeight = currentHeight;
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await new Promise(r => setTimeout(r, 1500));
}
}
the loop scrolls until the page stops growing. the 1500ms sleep gives the JS time to fetch and render the next batch. tune this delay based on the site’s actual load time. for paginated sites, loop through page URLs or click the “next” button with page.click() followed by waitForNavigation.
if it breaks: some sites detect automated scrolling by velocity. add jitter: await new Promise(r => setTimeout(r, 1000 + Math.random() * 1000)).
step 7: write error handling and retry logic
async function scrapeWithRetry(url, proxyUrl, proxyUser, proxyPass, retries = 3) {
let browser;
for (let attempt = 1; attempt <= retries; attempt++) {
try {
browser = await launchBrowser(proxyUrl);
const page = await setupPage(browser, proxyUser, proxyPass);
await navigate(page, url);
const data = await extractProducts(page);
return data;
} catch (err) {
console.error(`attempt ${attempt} failed: ${err.message}`);
if (attempt === retries) throw err;
await new Promise(r => setTimeout(r, 2000 * attempt));
} finally {
if (browser) await browser.close();
}
}
}
always close the browser in a finally block. Chromium processes leak badly if you kill them mid-flight. the exponential backoff (2s, 4s, 6s) gives transient network errors time to clear before the next attempt.
if it breaks: if you’re hitting memory limits, the browser may crash silently and the catch block never fires. add a browser.on('disconnected', ...) event listener to detect this and trigger cleanup.
step 8: save output to disk or a database
const fs = require('fs');
async function saveResults(data, filename) {
fs.writeFileSync(filename, JSON.stringify(data, null, 2));
console.log(`saved ${data.length} records to ${filename}`);
}
for small runs, JSON files are fine. for anything beyond a few thousand records, pipe directly into PostgreSQL, SQLite, or a cloud bucket. if you’re running concurrent scrapers, writes to a single file will corrupt data. use a queue-based approach or write per-job files and merge later.
if it breaks: if the file is empty, page.evaluate likely returned an empty array because the selector didn’t match. log the page HTML with await page.content() to debug.
common pitfalls
not rotating proxies. a single IP scraping the same site will get blocked within minutes on most production targets. rotate proxies per request or per session. residential IPs from providers like ProxyScraping are harder to block than datacenter ranges.
running too many concurrent browsers. each Chromium instance uses roughly 300-500 MB of RAM at idle and spikes higher during page loads. on a 4 GB server, 3-4 concurrent instances is the realistic ceiling before you start swapping to disk.
ignoring robots.txt ethics. this doesn’t mean you must obey it, but you should know what it says. sites that explicitly forbid scraping and notice aggressive traffic may respond with legal letters. for antidetect and multi-account use cases that overlap with scraping, the antidetectreview.org/blog/ community has useful operator-perspective discussions on risk management.
using waitForTimeout everywhere. page.waitForTimeout(3000) is a fixed sleep. it either wastes time on fast pages or fails on slow ones. use waitForSelector, waitForFunction, or waitForNavigation instead, they resolve as soon as the condition is met.
not handling session state. if a site requires login, store cookies after authentication and reuse them. re-logging in for every request burns time and looks like a bot. save cookies with page.cookies() and restore with page.setCookie().
scaling this
10x (100 requests/day). a single VPS with 2 concurrent browsers and a cron job handles this fine. no queue needed. log to flat files.
100x (1000 requests/day). you need a proper job queue. Bull with Redis works well. run 2-4 browser workers as separate Node processes. add a proxy pool with health checks so bad IPs get retired automatically. see our guide on building a proxy rotation system for the implementation.
1000x (10,000+ requests/day). containerise each browser worker in Docker. orchestrate with Kubernetes or Nomad. use a cloud object store (S3, R2) for output. consider switching compute-heavy pages to a dedicated scraping API like ProxyScraping’s rendering endpoint, it offloads the Chromium cost entirely. at this scale your proxy bill can easily exceed your compute bill, so cache aggressively and deduplicate URLs before queuing.
the Puppeteer documentation on performance covers browser context reuse (browser.createBrowserContext()) which lets you share a single Chromium process across multiple isolated sessions, cutting memory use by roughly 40% in practice.
where to go next
- Playwright vs Puppeteer for production scraping in 2026 covers when to switch frameworks and what you gain or lose
- Choosing a residential proxy provider: 2026 benchmarks has throughput and block-rate data across major providers
- Back to the blog index for the full library of scraping and automation tutorials
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.