← back to blog

The 2026 Crawlee guide for production scraping

The 2026 Crawlee guide for production scraping

Most web scraping tutorials show you how to fetch a page. that’s the easy part. what they skip is everything that breaks at scale: request queues that survive crashes, proxy rotation that doesn’t burn IPs, browser fingerprints that don’t scream “bot”, and storage that doesn’t fill your disk on night one. i’ve hit all of those walls.

Crawlee is an open-source TypeScript/JavaScript scraping framework maintained by Apify. it unifies Cheerio (fast HTML parsing), Puppeteer, and Playwright under one API, and ships with a built-in request queue, key-value store, dataset storage, and proxy management out of the box. it’s not the only option, but for teams already in the Node ecosystem it removes a lot of glue code that you’d otherwise maintain yourself.

this guide is for operators who have scraped before but want a production-quality setup, not a weekend script. by the end you’ll have a working Crawlee project with persistent queues, rotating proxies, and a deployment path, whether that’s self-hosted or on Apify’s cloud. i’ll also cover the mistakes i’ve made so you don’t have to.

what you need

  • Node.js 18 or higher. Crawlee v3 dropped support for older versions. grab the latest LTS from nodejs.org.
  • npm or yarn. npm ships with Node, yarn is optional.
  • A proxy provider. for anything beyond hobby projects you need rotating residential or datacenter proxies. proxyscraping has a proxy API and rotation docs worth reading before you start.
  • Docker (optional but recommended for production). containerizing your scraper makes deployment consistent.
  • Apify account (optional). Crawlee deploys natively to Apify’s platform. free tier exists but paid plans start around $49/month as of May 2026 for Actor compute.
  • Target site awareness. check the site’s robots.txt and terms before you scrape. this is not legal advice, but ignoring those documents has consequences.

estimated infrastructure cost for a small production job: $20-80/month depending on proxy volume and whether you self-host or use Apify.

step by step

step 1: initialize your project

create a fresh directory and scaffold the project.

mkdir my-crawler && cd my-crawler
npm init -y

then install Crawlee with your preferred crawler type. i use the full bundle for flexibility:

npm install crawlee

if you only need Playwright (full-browser rendering) you can install it separately:

npx playwright install chromium

expected output: node_modules folder, package.json updated.
if it breaks: if [playwright](https://playwright.dev/) install fails on Linux, you’re likely missing system deps. run npx [playwright](https://playwright.dev/) install-deps [chromium](https://www.chromium.org/Home/) as root.


step 2: pick your crawler type

Crawlee gives you three crawlers. picking the wrong one wastes money and time.

  • CheerioCrawler: parses raw HTML with jQuery-style selectors. fast, low memory, no JavaScript execution. use this when the data is in the initial HTML response.
  • PlaywrightCrawler: full headless browser via Playwright. handles SPAs, infinite scroll, click interactions. slower and heavier.
  • PuppeteerCrawler: same idea as Playwright but uses the Puppeteer API. Playwright is generally preferred now.

for most e-commerce or news scraping, try CheerioCrawler first. only reach for Playwright when the content requires JS execution.

if it breaks: if you’re getting empty results from Cheerio, the page is probably client-rendered. switch to PlaywrightCrawler and verify with page.content().


step 3: write your first crawler

here’s a minimal but real CheerioCrawler setup:

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 200,
    async requestHandler({ $, request }) {
        const title = $('h1').first().text().trim();
        const url = request.url;

        await Dataset.pushData({ url, title });

        // follow internal links
        $('a[href]').each((_, el) => {
            const href = $(el).attr('href') ?? '';
            if (href.startsWith('/')) {
                crawler.addRequests([`https://example.com${href}`]);
            }
        });
    },
});

await crawler.run(['https://example.com']);

expected output: a storage/datasets/default/ folder appears with JSON files, one per page.
if it breaks: if you see 403 errors immediately, the site is blocking your user-agent or IP. jump to step 5 (proxies) before debugging further.


step 4: configure the request queue

by default Crawlee uses a file-based RequestQueue in ./storage/. this queue persists across runs, so if your scraper crashes it resumes where it left off. that alone saves you from re-scraping thousands of pages.

to start fresh instead of resuming:

rm -rf ./storage

to set a custom storage directory (useful in Docker):

import { Configuration } from 'crawlee';

Configuration.getGlobalConfig().set('storageClientOptions', {
    localDataDirectory: '/data/crawlee-storage',
});

expected output: queue state and datasets written to your specified path.
if it breaks: permission errors on the storage path are common in containers. make sure your Docker volume mount is writable by the process user.


step 5: add proxy rotation

this is where most scrapers either win or lose. Crawlee has a ProxyConfiguration class that handles rotation automatically per request.

import { CheerioCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
    proxyUrls: [
        'http://user:[email protected]:10000',
        'http://user:[email protected]:10001',
        'http://user:[email protected]:10002',
    ],
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    maxRequestsPerCrawl: 500,
    async requestHandler({ $, request, proxyInfo }) {
        console.log(`Fetched ${request.url} via ${proxyInfo?.url}`);
        // your handler logic
    },
});

for residential proxy rotation i pull from a pool endpoint instead of hardcoded IPs. most providers give you a gateway URL like http://user:[email protected]:port that rotates automatically on each connection.

expected output: each request logs a different proxy IP.
if it breaks: if all requests fail, test your proxy URL directly with curl -x "http://user:pass@host:port" https://httpbin.org/ip. if that fails too, the credentials or endpoint are wrong.


step 6: handle pagination

most real targets require following next-page links or incrementing page parameters. here’s a pattern that works reliably:

async requestHandler({ $, request, enqueueLinks }) {
    // scrape current page
    $('.product-card').each((_, el) => {
        // extract data
    });

    // enqueue next page
    const nextHref = $('a.pagination-next').attr('href');
    if (nextHref) {
        await enqueueLinks({ urls: [nextHref] });
    }
},

enqueueLinks deduplicates URLs against the existing queue automatically, so you won’t loop on pages you’ve already visited.

expected output: queue grows as pagination runs, then drains.
if it breaks: if pagination loops infinitely, the next-page link may be relative and resolve incorrectly. log nextHref before enqueuing and verify the full URL is what you expect.


step 7: export your data

Crawlee’s Dataset API writes to JSON by default. to export as CSV or to push to an external store:

import { Dataset } from 'crawlee';

// export all collected data to CSV
const dataset = await Dataset.open();
await dataset.exportToCSV('results');

this writes results.csv to ./storage/key_value_stores/default/. for pushing to a database, call your DB client directly inside the requestHandler. i typically batch inserts in groups of 100 to reduce round-trips.

expected output: results.csv in the key-value store directory.
if it breaks: if the CSV is empty, make sure Dataset.pushData() was called at least once before export.


step 8: deploy to production

for self-hosted, containerize it:

FROM node:20-slim

# install Playwright deps
RUN npx playwright install-deps chromium

WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .

CMD ["node", "src/main.js"]

for Apify’s managed platform, install their CLI and push:

npm install -g apify-cli
apify login
apify push

Apify handles scaling, scheduling, and storage. for background on how operators manage multi-account or multi-session scraping setups, the folks at multiaccountops.com/blog/ cover some useful patterns around session isolation and fingerprint separation that apply here too.

expected output: Actor deployed, visible in Apify console.
if it breaks: if the Actor fails on startup, check the build logs for missing system dependencies. Playwright in particular needs system libraries that aren’t in the default Node image.

common pitfalls

1. using PlaywrightCrawler for everything. it’s tempting to default to the browser crawler because it handles more cases. but PlaywrightCrawler uses 5-10x more memory and is significantly slower. always try Cheerio first and only escalate when the page actually needs a browser.

2. ignoring the request queue storage location. if you run your scraper in a temp container without mounting a persistent volume, the queue is lost on restart. i’ve watched long-running jobs vanish this way. always mount your storage directory.

3. hardcoding proxy credentials in source. proxy credentials rotate and expire. use environment variables and never commit them to git. use process.env.PROXY_URL and pass the variable at runtime.

4. not setting maxRequestsPerCrawl in development. without this limit, a development run against a large site will enqueue millions of URLs and chew through your proxy allowance. set a low cap (50-200) while developing, remove it in production.

5. relying on page.waitForSelector with fixed timeouts. hardcoded waitForTimeout(3000) calls are brittle and slow. use waitForSelector with a specific element that signals the content has loaded. it’s faster and more reliable across page load variance.

scaling this

10x (a few hundred pages/day). a single Node process on a $6/month VPS with a cheap datacenter proxy pool handles this easily. the default file-based storage is fine.

100x (tens of thousands of pages/day). you’ll hit concurrency limits on a single machine. split work across multiple Actors on Apify, or run multiple containers in parallel with a shared Redis-backed queue. Crawlee supports custom StorageClient implementations, so you can swap in Redis or a remote key-value store. proxy spend becomes real here, expect $50-150/month depending on target.

1000x (millions of pages/day). file-based storage is out. you need a distributed queue (Redis or a managed queue service), a dedicated proxy pool with residential or ISP proxies, and browser fingerprint randomization if you’re using Playwright. at this level the browser fingerprinting angle matters significantly. sites run sophisticated bot detection that correlates canvas fingerprints, WebGL hashes, and timing patterns. antidetect browser reviews at antidetectreview.org/blog/ go deeper on that topic if you’re hitting detection walls. Crawlee’s PlaywrightCrawler supports custom browser launch options where you can inject fingerprint-spoofing plugins.

one thing that doesn’t change at any scale: the request queue logic. Crawlee’s deduplication and retry handling scales well. what changes is the infrastructure underneath it.

where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?