← back to blog

The 2026 Cheerio guide for production scraping

The 2026 Cheerio guide for production scraping

Most scraping tutorials start with Playwright or Puppeteer, as if every site you’ll ever touch requires a full browser. that assumption is expensive and slow. the reality I’ve found running scrapers out of Singapore is that a large share of the pages you need, product listings, forum threads, news archives, shipping rate pages, are plain server-rendered HTML. no JavaScript execution needed. for those targets, Cheerio is still the right tool in 2026 and it’s an order of magnitude cheaper to run than headless Chrome.

Cheerio is a fast, jQuery-compatible HTML parser for Node.js. it doesn’t run JavaScript, it doesn’t spin up a browser process, and it doesn’t care about cookies, service workers, or WebGL fingerprinting. you fetch raw HTML with an HTTP client and Cheerio gives you a familiar selector API on top of it. the mental model is simple: $('div.product-title').text(). if you’ve written any jQuery you’re already 80% there.

this guide is for operators who have already scraped something small, maybe a few hundred pages with a hand-rolled script, and now want to run it at production volume without the infrastructure exploding. by the end you’ll have a working scraper that handles pagination, rotates proxies, respects rate limits at a basic level, and writes structured output. i’m assuming you’re running Node.js 20 or 22 and are comfortable with async/await.

what you need

  • Node.js 20+ (LTS). check with node -v. download from nodejs.org
  • npm or pnpm for package management
  • cheerio (npm install cheerio) – the HTML parser
  • got (npm install got) – an HTTP client that handles retries and timeouts cleanly. version 14 is ESM-only, version 13 works with CommonJS
  • cheerio is currently at v1.0.0 on npm after years in RC. the API is stable
  • a proxy provider for anything beyond trivial volume. I use residential or datacenter proxies from proxyscraping.com. rotating proxies via HTTP proxy format (http://user:pass@host:port) is supported natively by got
  • a target site’s robots.txt reviewed before you start. robotstxt.org explains the spec. ignoring it is your legal exposure, not mine
  • ~$10-30/month in proxy costs for moderate volume (100k-1M requests/month depending on provider and pool type)
  • a basic understanding of CSS selectors. if you need a refresher, the MDN selector reference is the authoritative source

step by step

step 1: initialize the project

mkdir cheerio-scraper && cd cheerio-scraper
npm init -y
npm install cheerio got@13 p-limit

p-limit is for concurrency control. you’ll want it. without it you’ll either flood the target or flood yourself.

if you’re on a project using ES modules, add "type": "module" to your package.json. if you’re on CommonJS, stay on got v13 and use require(). the examples below use CommonJS for broader compatibility.

expected output: a node_modules folder and a package.json with the three dependencies listed.

if it breaks: if got v13 install fails, try npm install [email protected] explicitly. npm sometimes resolves to got v14 which is ESM-only and will throw on require().

step 2: fetch a page and verify the HTML

before writing any selectors, confirm you can retrieve the page and that it’s returning actual HTML, not a JavaScript bundle.

// fetch-test.js
const got = require('got');

async function main() {
  const response = await got('https://example.com', {
    headers: {
      'User-Agent': 'Mozilla/5.0 (compatible; MyResearchBot/1.0)',
    },
    timeout: { request: 10000 },
  });
  console.log(response.statusCode);
  console.log(response.body.slice(0, 500));
}

main();

run with node fetch-test.js. you should see 200 and the start of an HTML document.

if it breaks: status 403 or a Cloudflare challenge page means you need a real browser or better headers at minimum. Cheerio won’t help you past a JS challenge wall. status 429 means slow down. check your headers closely: many sites block the default got user-agent string.

step 3: load HTML into Cheerio and write your first selectors

const got = require('got');
const cheerio = require('cheerio');

async function scrape(url) {
  const { body } = await got(url, {
    headers: { 'User-Agent': 'Mozilla/5.0 (compatible; MyResearchBot/1.0)' },
  });

  const $ = cheerio.load(body);

  // example: grab all article titles from a news listing page
  const titles = [];
  $('h2.article-title a').each((i, el) => {
    titles.push({
      title: $(el).text().trim(),
      href: $(el).attr('href'),
    });
  });

  return titles;
}

scrape('https://your-target.com/news').then(console.log);

the $ variable works like jQuery. .each() iterates over matched elements, $(el).text() extracts inner text, $(el).attr('href') gets an attribute. open Chrome DevTools on your target page, right-click an element, and use “Copy selector” as a starting point, then tighten it manually. auto-generated selectors from DevTools are often brittle.

if it breaks: if .text() returns empty strings, the content is probably rendered by JavaScript and you need a headless browser. if your selector matches nothing, double check that the HTML in body actually contains the element. console.log(body) and search for the text you expect.

step 4: add proxy support

for any non-trivial scraping run you need to route requests through proxies. got supports HTTP proxies via the hpagent package or via the agent option with https-proxy-agent.

npm install https-proxy-agent
const got = require('got');
const { HttpsProxyAgent } = require('https-proxy-agent');

const PROXY_URL = 'http://username:[email protected]:9000';

const client = got.extend({
  agent: {
    https: new HttpsProxyAgent(PROXY_URL),
  },
  headers: {
    'User-Agent': 'Mozilla/5.0 (compatible; MyResearchBot/1.0)',
  },
  timeout: { request: 15000 },
  retry: { limit: 2, statusCodes: [429, 500, 502, 503] },
});

the retry option in got will automatically retry on those status codes. for rotating proxies, each request through the endpoint gets a different exit IP. check your provider’s docs for whether rotation happens per-request or per-session, that matters for multi-page flows.

if it breaks: if you see ECONNREFUSED or ETIMEDOUT, the proxy credentials or host are wrong. if you see the proxy returning HTML error pages instead of your target, the proxy itself is working but the target blocked it. try a different proxy pool (residential vs datacenter).

step 5: handle pagination

most real targets span multiple pages. two common patterns: page number in the URL (?page=2), and a “next” link you follow.

async function scrapeAllPages(baseUrl, maxPages = 50) {
  const results = [];
  let page = 1;
  let hasMore = true;

  while (hasMore && page <= maxPages) {
    const url = `${baseUrl}?page=${page}`;
    const { body } = await client(url);
    const $ = cheerio.load(body);

    const items = [];
    $('div.listing-item').each((i, el) => {
      items.push({
        name: $(el).find('.item-name').text().trim(),
        price: $(el).find('.item-price').text().trim(),
      });
    });

    results.push(...items);

    // check if there's a next page link
    hasMore = $('a.pagination-next').length > 0;
    page++;

    // basic rate limit: 1 second between pages
    await new Promise(r => setTimeout(r, 1000));
  }

  return results;
}

always set a maxPages guard. without it a pagination bug can put you in an infinite loop that runs your proxy bill into the ceiling.

if it breaks: if you’re getting the same page repeatedly, check whether the “next” selector is correct or whether the site uses a different pagination scheme like cursor-based links.

step 6: run concurrent requests with a concurrency limit

sequential scraping is slow. but unconstrained parallelism will get you blocked or crash your process. p-limit lets you set a ceiling.

const pLimit = require('p-limit');
const limit = pLimit(5); // max 5 concurrent requests

async function scrapeUrls(urls) {
  const tasks = urls.map(url =>
    limit(() => scrapeOne(url))
  );
  return Promise.all(tasks);
}

for most residential proxy pools I start at 5 concurrent requests and increase if I’m not seeing blocks. for datacenter proxies on forgiving targets I’ll go to 20-30.

if it breaks: if you’re seeing lots of 429s, reduce the limit and add more delay. if you’re seeing connection errors, your proxy provider may have a concurrency cap on your account tier.

step 7: write output to JSON

const fs = require('fs');

async function main() {
  const data = await scrapeAllPages('https://your-target.com/listings');
  fs.writeFileSync('output.json', JSON.stringify(data, null, 2));
  console.log(`wrote ${data.length} records`);
}

main();

for large runs (100k+ records) don’t buffer everything in memory. write to a newline-delimited JSON file (.ndjson) and append as you go, or write directly to a database.

if it breaks: JSON.stringify on a very large array will run out of heap. use --max-old-space-size=4096 as a node flag, or switch to streaming writes.

step 8: validate and clean extracted data before saving

raw scraped text is messy. prices come with currency symbols and commas, dates come in regional formats, whitespace is inconsistent.

function cleanPrice(raw) {
  // removes $, commas, whitespace and returns a float
  return parseFloat(raw.replace(/[^0-9.]/g, ''));
}

function cleanText(raw) {
  return raw.replace(/\s+/g, ' ').trim();
}

do this at extraction time, not later. cleaning downstream means re-running the scraper if you discover a format issue.

if it breaks: NaN from parseFloat means your regex is stripping too much or the input format wasn’t what you expected. log the raw string before cleaning to see what you’re actually getting.

common pitfalls

using too-specific selectors. copying a CSS selector from DevTools often gives you something like #main > div:nth-child(3) > ul > li:nth-child(2) > a. that breaks the moment the site does any layout change. write selectors that match semantic structure: class names, data attributes, ARIA labels. prefer .product-title over div > h3:nth-child(2).

no timeout on requests. got’s default timeout is no timeout. a single hung connection can stall your entire queue if you’re not using p-limit. always set timeout: { request: 15000 } at minimum.

not checking [robots.txt](https://www.rfc-editor.org/rfc/rfc9309.html) before starting. I’ve seen operators spend days building a scraper only to realize the target explicitly disallows crawling in [robots.txt](https://www.rfc-editor.org/rfc/rfc9309.html) and their legal team then kills the project. check it first. it’s one HTTP request.

assuming the HTML is well-formed. real-world HTML is a mess. Cheerio is forgiving and will parse malformed markup, but your selectors may match unexpected elements. always add .trim() to text output and test against a sample of 20-30 real pages, not just one.

ignoring response body on non-200 status. got throws on 4xx/5xx by default. if you want to inspect the error body (useful for understanding why you’re being blocked), catch the error and read error.response.body.

scaling this

at 10x (tens of thousands of pages): the single-file script starts to feel painful. split into a URL queue, a worker that scrapes one URL, and an output writer. use a simple SQLite table as your queue to track which URLs are done, pending, or failed. this lets you resume after crashes without re-scraping.

at 100x (hundreds of thousands of pages): SQLite becomes a bottleneck under concurrent writes. switch to Postgres or use a proper job queue like BullMQ backed by Redis. run workers in separate Node processes or on separate machines. proxy costs become significant at this scale: track your cost-per-1k-requests and pick the cheapest pool that has acceptable success rates. I wrote about proxy cost comparisons in the proxyscraping.com proxy provider review.

at 1000x (millions of pages): Cheerio itself isn’t the bottleneck. network I/O and proxy availability are. you’re now looking at distributed workers, proxy failover logic, per-domain rate limiting, and structured observability. consider whether you can push the work to a scraping API instead of running raw infrastructure. for anti-detect requirements at this scale, the antidetect browser comparison at antidetectreview.org covers what operators use for managed browser fingerprinting at scale. also revisit whether every page actually needs scraping or whether you can use sitemaps, RSS feeds, or official APIs for a subset of your data needs.

where to go next

Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?