← back to blog

The 2026 Apify SDK guide for production scraping

The 2026 Apify SDK guide for production scraping

Most scraping tutorials stop at “get your first response.” that’s fine for a weekend project. it’s not fine when you’re running scrapers that need to stay up, rotate proxies, respect rate limits, and dump clean data somewhere useful at scale.

the Apify SDK is one of the more mature options for production-grade scraping on Node.js. it wraps Crawlee, the open-source web crawling library, and adds a layer for deploying to Apify’s cloud platform, managing storage, and handling Actor inputs and outputs. you can run it locally or push it to Apify’s infrastructure. both are useful depending on your setup.

this guide is for operators who already know JavaScript and want a working scraper that doesn’t fall apart after a few thousand requests. by the end, you’ll have an Actor that crawls a target site, handles retries and proxy rotation, persists data to Apify’s dataset, and is ready to schedule or trigger via API.


what you need

  • Node.js 18+ installed locally
  • Apify account at apify.com (free tier works for dev, pay-as-you-go for production, roughly $0.004 per actor compute unit)
  • Apify CLI (npm install -g apify-cli)
  • Proxies: residential or datacenter depending on your target. Apify’s built-in proxy pool starts at $12.50/GB for residential. you can also plug in your own.
  • A target URL: pick a site you have permission to scrape. public data, your own properties, or a site where you’ve confirmed scraping is permitted.
  • basic familiarity with async/await in JavaScript
  • optionally, Playwright for JS-rendered pages (adds browser overhead, plan accordingly)

step by step

step 1: scaffold a new actor

apify create my-scraper

the CLI will ask which template you want. pick “Crawlee + CheerioCrawler” for HTML-only sites, or “Crawlee + PlaywrightCrawler” for JS-rendered pages. CheerioCrawler is faster and cheaper, so default to it unless you need browser rendering.

expected output: a folder my-scraper/ with src/main.js, package.json, and .actor/actor.json.

if it breaks: if apify isn’t found after install, check your PATH or run via npx apify-cli create.


step 2: understand the project structure

open src/main.js. the scaffold gives you something like this:

import { Actor } from 'apify';
import { CheerioCrawler } from 'crawlee';

await Actor.init();

const input = await Actor.getInput();
const { startUrls = [{ url: 'https://example.com' }] } = input ?? {};

const crawler = new CheerioCrawler({
    requestHandler: async ({ $, request }) => {
        const title = $('title').text();
        await Actor.pushData({ url: request.url, title });
    },
});

await crawler.run(startUrls);
await Actor.exit();

Actor.init() connects to the Apify platform if you’re running in cloud. locally it just reads env vars from .actor/.env. Actor.getInput() reads the JSON input you pass when triggering the actor. Actor.pushData() saves records to the default dataset.

if it breaks: if you see Actor.init() requires APIFY_TOKEN, create a .env file in .actor/ with APIFY_TOKEN=your_token from your Apify account settings.


step 3: add request queue and crawl depth control

for anything beyond a single URL, use a request queue. Crawlee manages this automatically, but you should configure maxRequestsPerCrawl and maxConcurrency explicitly:

const crawler = new CheerioCrawler({
    maxConcurrency: 10,
    maxRequestsPerCrawl: 500,
    requestHandlerTimeoutSecs: 30,

    requestHandler: async ({ $, request, enqueueLinks }) => {
        const title = $('title').text();
        await Actor.pushData({ url: request.url, title });

        // follow internal links
        await enqueueLinks({
            globs: ['https://target-site.com/**'],
        });
    },
});

maxConcurrency: 10 keeps you from hammering a server. enqueueLinks with a glob filter prevents crawling off-domain. maxRequestsPerCrawl is your circuit breaker.

if it breaks: if the queue grows unbounded and memory spikes, lower maxConcurrency and double-check your glob filter. crawlee stores the queue in-memory by default locally; on Apify’s platform it uses their key-value store.


step 4: configure proxy rotation

this is where most scrapers fail in production. without proxy rotation, your IP gets blocked after a few hundred requests on any serious site.

import { Actor } from 'apify';
import { CheerioCrawler } from 'crawlee';

await Actor.init();

const proxyConfiguration = await Actor.createProxyConfiguration({
    groups: ['RESIDENTIAL'],   // or 'DATACENTER'
    countryCode: 'US',
});

const crawler = new CheerioCrawler({
    proxyConfiguration,
    maxConcurrency: 10,
    requestHandler: async ({ $, request }) => {
        await Actor.pushData({
            url: request.url,
            title: $('title').text(),
        });
    },
});

Actor.createProxyConfiguration() integrates with Apify’s built-in proxy pool. if you have your own proxies, pass them directly:

const proxyConfiguration = await Actor.createProxyConfiguration({
    proxyUrls: [
        'http://user:[email protected]:8000',
        'http://user:[email protected]:8000',
    ],
});

if it breaks: residential proxies are slow. if you’re seeing timeouts, increase requestHandlerTimeoutSecs to 60 or even 90. if you’re seeing 407 errors, check your proxy credentials.


step 5: handle errors and retries

the Apify SDK docs cover this in detail. the short version: Crawlee retries failed requests up to maxRequestRetries times (default 3). you can hook into failures to log or re-classify:

const crawler = new CheerioCrawler({
    maxRequestRetries: 3,
    requestHandlerTimeoutSecs: 30,

    failedRequestHandler: async ({ request }, error) => {
        console.error(`Request ${request.url} failed: ${error.message}`);
        await Actor.pushData({
            url: request.url,
            error: error.message,
            failed: true,
        });
    },

    requestHandler: async ({ $, request, response }) => {
        if (response.status !== 200) {
            throw new Error(`Unexpected status: ${response.status}`);
        }
        await Actor.pushData({ url: request.url, title: $('title').text() });
    },
});

pushing failed requests to the dataset lets you audit what broke without losing track of it.

if it breaks: if you’re hitting the same URL repeatedly and failing, add it to a blocklist or use request.noRetry = true inside the handler to skip retries for known-bad URLs.


step 6: define actor input schema

hardcoded URLs are fine for testing. for production, define an INPUT_SCHEMA.json so anyone triggering the actor can configure it without touching code:

{
    "title": "My Scraper",
    "description": "Crawls a site and extracts titles.",
    "type": "object",
    "schemaVersion": 1,
    "properties": {
        "startUrls": {
            "title": "Start URLs",
            "type": "array",
            "description": "URLs to start crawling from.",
            "editor": "requestListSources"
        },
        "maxRequestsPerCrawl": {
            "title": "Max requests",
            "type": "integer",
            "default": 100
        }
    }
}

place this file in .actor/INPUT_SCHEMA.json. Apify’s UI will render a form from it. this is what makes actors shareable and schedulable without code changes.

if it breaks: if the Apify UI shows a blank input form, check that schemaVersion is set to 1 and that the JSON is valid.


step 7: run locally, then push

test locally first:

cd my-scraper
npm install
apify run

this runs the actor using your local .actor/.env credentials and writes output to storage/datasets/default/. check the JSON files there to confirm your data looks right.

when you’re satisfied:

apify push

this uploads the actor to your Apify account. go to the Apify console, find the actor, and trigger a test run with real input. check the “Dataset” tab for output and the “Log” tab for any runtime errors.

if it breaks: if apify push fails with authentication errors, run apify login first and paste your API token from the Apify console.


step 8: schedule or trigger via API

once the actor is on the platform, you can schedule it or call it from external systems:

curl -X POST \
  "https://api.apify.com/v2/acts/YOUR_USERNAME~my-scraper/runs" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -d '{"startUrls":[{"url":"https://target-site.com"}],"maxRequestsPerCrawl":500}'

the API returns a run ID. poll GET /v2/actor-runs/{runId} to check status, or set up a webhook to POST to your endpoint when the run finishes. this is how you integrate scraping into a pipeline without running infrastructure yourself.

if it breaks: 401 errors mean the token is wrong or missing. 429 means you’ve hit rate limits. check Apify’s API reference for rate limit headers.


common pitfalls

using PlaywrightCrawler when you don’t need to. browser-based crawling uses 10x more compute. check whether the data is in the initial HTML before reaching for Playwright. use curl or view-source: in the browser to confirm.

not setting maxRequestsPerCrawl. without it, a crawl can queue tens of thousands of URLs and run for hours. always set a ceiling in dev, even if you remove it in production.

ignoring the failed request log. most operators only look at the happy-path data. failed requests often reveal pattern mismatches or newly blocked routes. check failedRequestHandler output on every production run.

flat proxy configuration across all domains. if you’re scraping multiple targets in one actor, some may need residential proxies and others won’t. configure proxy per request using proxyConfiguration.newUrl() and attach it to specific requests.

not handling pagination. enqueueLinks finds links on the page, but pagination often uses query params or POST requests that aren’t standard <a> tags. build explicit pagination logic for these cases rather than relying on auto-discovery.


scaling this

10x (hundreds of requests/day): the default free tier handles this. set maxConcurrency to 5-10, residential proxies optional, schedule via cron in the Apify console.

100x (thousands of requests/day): you’ll hit memory limits on small actor instances. upgrade the instance memory in .actor/actor.json to 1024MB or more. split large crawls into multiple actor runs by domain or URL range, triggered via the API. watch your proxy costs, they add up fast at this level.

1000x (tens of thousands of requests/day): multi-actor architectures become necessary. a coordinator actor splits work and spawns crawlers via Actor.metamorph() or the Apify API. use Apify’s key-value store for shared state between runs. at this scale, proxy quality matters more than price per GB, residential pool reliability varies by provider. some operators in the multi-account space have written about managing scrapers at this scale at multiaccountops.com/blog/ if you want operator-level perspective.

build monitoring in from the start. log a run summary to your own database at the end of every actor run, including request counts, failure rates, and dataset size. you want to catch degradation before it becomes an outage.


where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?