← back to blog

The 2026 Scrapy guide for production scraping

The 2026 Scrapy guide for production scraping

Most Scrapy tutorials stop at “run your first spider.” that’s fine for learning, but the gap between a working spider and a production scraper that runs unattended for months is enormous. i’ve run scrapers commercially since 2019, and i keep coming back to Scrapy for batch, structured data extraction because nothing else gives you the same combination of speed, middleware support, and battle-tested reliability at the Python layer.

this guide is for operators who already know Python and want to deploy Scrapy at real scale: rotating proxies, persistent item pipelines, autothrottle tuning, and monitoring. if you’re still on “hello world” Scrapy, start at the official Scrapy docs and come back here.

by the end you’ll have a spider that respects rate limits, rotates residential proxies via middleware, writes clean items to a Postgres database, and logs errors in a way you can actually act on. i’ll also flag the scaling decisions you’ll need to revisit when volume grows.


what you need

  • Python 3.11+ , Scrapy 2.11+ dropped support for anything older
  • Scrapy 2.11 , install via pip install scrapy==2.11.2; check the PyPI release page for the latest stable
  • A proxy provider , residential or datacenter depending on target. ProxyScraping.org’s residential pool works here; budget $15-50/month for moderate volume
  • PostgreSQL 15+ , for the item pipeline. a $7/month Render.com or Railway instance is enough to start
  • psycopg2 , pip install psycopg2-binary
  • Scrapyd or a plain Linux VPS , i use a $6/month Hetzner CX11 for single-spider deployments
  • A targets list , URLs in a CSV or database table, not hardcoded in the spider
  • Optional: Sentry DSS key , free tier is sufficient for error tracking

step by step

step 1: create a clean project structure

scrapy startproject myproject
cd myproject
scrapy genspider product_spider example.com

the default layout is fine. the files you’ll touch most:

myproject/
  spiders/product_spider.py
  settings.py
  middlewares.py
  pipelines.py
  items.py

keep one spider per file. don’t put business logic in settings.py.

if it breaks: if [scrapy](https://scrapy.org/) isn’t found after pip install, your venv isn’t activated. run source venv/bin/activate and try again.


step 2: define your items properly

vague item definitions cause messy data downstream. be explicit in items.py:

import scrapy

class ProductItem(scrapy.Item):
    url = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
    currency = scrapy.Field()
    scraped_at = scrapy.Field()
    spider_name = scrapy.Field()

always add scraped_at and spider_name. you’ll want them when debugging a pipeline six months from now.

if it breaks: if you’re getting KeyError in the pipeline, you forgot to add the field to items.py. Scrapy items are strict dictionaries.


step 3: write a spider that reads URLs from a source

hardcoding URLs in spiders is the number one reason scrapers break silently. read from a file or database instead:

import scrapy
import csv
from datetime import datetime, timezone
from myproject.items import ProductItem

class ProductSpider(scrapy.Spider):
    name = "product_spider"

    def start_requests(self):
        with open("urls.csv", "r") as f:
            reader = csv.DictReader(f)
            for row in reader:
                yield scrapy.Request(
                    url=row["url"],
                    callback=self.parse,
                    errback=self.handle_error,
                    meta={"source_id": row["id"]},
                )

    def parse(self, response):
        item = ProductItem()
        item["url"] = response.url
        item["title"] = response.css("h1.product-title::text").get("").strip()
        item["price"] = response.css("span.price::text").get("").strip()
        item["currency"] = "USD"
        item["scraped_at"] = datetime.now(timezone.utc).isoformat()
        item["spider_name"] = self.name
        yield item

    def handle_error(self, failure):
        self.logger.error(f"Request failed: {failure.request.url} , {failure.value}")

if it breaks: FileNotFoundError on urls.csv means your working directory isn’t the project root. run scrapy from the directory containing [scrapy](https://scrapy.org/).cfg.


step 4: set up proxy rotation middleware

this is where most tutorials skip ahead too fast. a proper proxy middleware rotates per-request, not per-spider-start. create a custom middleware in middlewares.py:

import random

class RotatingProxyMiddleware:
    def __init__(self, proxies):
        self.proxies = proxies

    @classmethod
    def from_crawler(cls, crawler):
        proxies = crawler.settings.getlist("PROXY_LIST")
        return cls(proxies)

    def process_request(self, request, spider):
        if self.proxies:
            proxy = random.choice(self.proxies)
            request.meta["proxy"] = proxy

in settings.py:

PROXY_LIST = [
    "http://user:[email protected]:10000",
    "http://user:[email protected]:10001",
    # add more endpoints from your provider's dashboard
]

DOWNLOADER_MIDDLEWARES = {
    "myproject.middlewares.RotatingProxyMiddleware": 350,
    "scrapy.downloadermiddlewares.retry.RetryMiddleware": 550,
}

for high-volume work i use sticky session proxies per domain , one session per target domain keeps login cookies consistent. ProxyScraping’s residential pool supports session stickiness via the session query parameter.

if it breaks: 407 [Proxy Authentication](https://proxyscraping.org/blog/proxy-authentication-user-pass-vs-ip-whitelist-trade-offs) Required means your credentials in the proxy URL are wrong. double-check the username and password encoding, especially if either contains @ or :.


step 5: configure autothrottle and concurrency

Scrapy’s AutoThrottle extension is underused. it adjusts download delay dynamically based on server latency. in settings.py:

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1.0
AUTOTHROTTLE_MAX_DELAY = 30.0
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0
AUTOTHROTTLE_DEBUG = False  # set True briefly to tune

CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 4
DOWNLOAD_TIMEOUT = 20

RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 429, 403]

CONCURRENT_[REQUESTS](https://requests.readthedocs.io/)_PER_DOMAIN = 4 is the most important setting for not getting banned. i’ve seen operators run 64 concurrent requests per domain on a residential proxy and wonder why the IP pool burns through in two hours.

if it breaks: if you’re seeing mostly 429s, drop AUTOTHROTTLE_TARGET_CONCURRENCY to 1.0 and watch the debug logs. the server is telling you to slow down.


step 6: write items to Postgres via a pipeline

in pipelines.py:

import psycopg2
import os

class PostgresPipeline:
    def open_spider(self, spider):
        self.conn = psycopg2.connect(os.environ["DATABASE_URL"])
        self.cur = self.conn.cursor()
        self.cur.execute("""
            CREATE TABLE IF NOT EXISTS products (
                id SERIAL PRIMARY KEY,
                url TEXT UNIQUE,
                title TEXT,
                price TEXT,
                currency TEXT,
                scraped_at TIMESTAMPTZ,
                spider_name TEXT
            )
        """)
        self.conn.commit()

    def close_spider(self, spider):
        self.conn.commit()
        self.cur.close()
        self.conn.close()

    def process_item(self, item, spider):
        self.cur.execute("""
            INSERT INTO products (url, title, price, currency, scraped_at, spider_name)
            VALUES (%s, %s, %s, %s, %s, %s)
            ON CONFLICT (url) DO UPDATE SET
                title = EXCLUDED.title,
                price = EXCLUDED.price,
                scraped_at = EXCLUDED.scraped_at
        """, (
            item["url"], item["title"], item["price"],
            item["currency"], item["scraped_at"], item["spider_name"]
        ))
        return item

enable it in settings.py:

ITEM_PIPELINES = {
    "myproject.pipelines.PostgresPipeline": 300,
}

set DATABASE_URL as an environment variable, not in settings. secrets in source control is how data breaches start.

if it breaks: psycopg2.OperationalError: could not connect almost always means the DATABASE_URL format is wrong. it should be postgresql://user:password@host:5432/dbname.


step 7: add basic monitoring

Scrapy’s built-in stats collector logs to stdout at the end of each crawl. for anything production, pipe those stats somewhere you’ll actually see them. the simplest approach is an extension that pushes to a webhook on close:

# in extensions.py
from scrapy import signals
import requests
import os

class CrawlStatsNotifier:
    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def spider_closed(self, spider, reason):
        stats = spider.crawler.stats.get_stats()
        webhook = os.environ.get("STATS_WEBHOOK_URL")
        if webhook:
            requests.post(webhook, json={
                "spider": spider.name,
                "reason": reason,
                "items_scraped": stats.get("item_scraped_count", 0),
                "requests_count": stats.get("downloader/request_count", 0),
                "error_count": stats.get("spider_exceptions/count", 0),
            }, timeout=5)

a free Slack incoming webhook or a Discord webhook works fine here.

if it breaks: if the extension silently doesn’t fire, check that it’s listed in EXTENSIONS in settings.py with a priority number and that the class path is correct.


common pitfalls

1. not handling redirects explicitly. by default Scrapy follows redirects. if a site redirects logged-out users to a login page, you’ll silently scrape login forms instead of product data. set REDIRECT_MAX_TIMES = 3 and log redirect counts in your stats.

2. ignoring robots.txt in production. Scrapy respects robots.txt by default (ROBOTSTXT_OBEY = True). some operators disable this to scrape more aggressively. before you do, check the site’s terms and whether the data you’re collecting qualifies as personal data under applicable law, this is not legal advice. the robots exclusion protocol RFC 9309 is worth reading once.

3. using a single Scrapy process for too long. memory leaks in long-running spiders are real. for crawls over 500k pages, i restart the spider in batches using a URL queue rather than one infinite crawl. Scrapyd helps manage this.

4. not deduplicating URLs before the crawl. Scrapy has a request fingerprint deduplicator built in (DUPEFILTER_CLASS), but it’s in-memory and resets on restart. for multi-day crawls, use a persistent bloom filter or a Redis-backed dedup. [scrapy](https://scrapy.org/)-redis handles this.

5. writing all items to one table forever. after a few months you’ll have a 50GB products table with no partitioning and slow queries. partition by month from day one, or at least add an index on scraped_at.


scaling this

10x (100-200 concurrent requests): the setup above handles this fine. tune CONCURRENT_[REQUESTS](https://requests.readthedocs.io/) and your proxy pool size. a single CX21 Hetzner box ($14/month) is enough compute.

100x (1,000+ concurrent requests): you need to distribute across multiple Scrapy processes, likely using Scrapyd + a URL queue in Redis. [scrapy](https://scrapy.org/)-redis replaces the default scheduler and deduplicator with Redis-backed versions. at this level proxy spend dominates your cost, expect $200-500/month for a residential pool with this throughput.

1000x (10,000+ concurrent requests): Scrapy alone probably isn’t the right answer here. you’re looking at distributed orchestration: a job queue (Celery or Dramatiq), multiple workers, a proxy load balancer, and dedicated monitoring. some teams swap Scrapy’s download layer for an async HTTP client like httpx at this scale. if you’re running anti-detect browser-level scraping at volume, the multiaccountops.com blog has operator-level write-ups on managing session farms that are worth reading alongside this.

the database also becomes a bottleneck. partition your tables, use COPY instead of INSERT for bulk writes, and consider a columnar store like DuckDB for analytics queries on scraped data.


where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?