← back to blog

How to scrape Hacker News at scale in 2026 with proxies that work

How to scrape Hacker News at scale in 2026 with proxies that work

Hacker News is one of the cleanest public data sources on the internet. No paywalls, no login gates, no heavy JavaScript rendering for the core content. Y Combinator maintains a public Firebase API that returns structured JSON, and the site itself is mostly server-rendered HTML that any HTTP client can parse. So why do operators still get blocked when they try to pull data at scale?

Because the volume itself is the signal. If you’re pulling thousands of items per hour, the same IP is making hundreds of sequential requests that look nothing like a human browsing pattern. Rate limits kick in, you start seeing 429s and 503s, and your pipeline stops. I’ve built three different HN scrapers over the years, one for a newsletter aggregator, one for a trend-detection tool, one for a client tracking competitor link placements. The pattern is always the same: works fine on a laptop, falls apart at 10x or 100x.

This tutorial is for operators who need continuous, reliable HN data, whether for competitive intelligence, content trend analysis, link tracking, or feeding an automated pipeline. I’ll walk through the exact Python setup I use, including the proxy rotation layer that makes it work without burning through IPs. By the end you’ll have a working scraper running a continuous loop, writing to a local database, and handling errors without dying silently.

What you need

  • Python 3.10+ with requests, beautifulsoup4, and sqlite3 (stdlib)
  • A rotating residential proxy service. I use ProxyScraping.com residential proxies, running roughly $3-8/GB depending on the plan tier
  • Optional: Redis for distributed deduplication if you’re running multiple workers
  • Optional: a small VPS (Hetzner CX22 at ~$4/month, DigitalOcean Basic Droplet at $6/month) for running the loop continuously
  • A text editor and basic comfort with running Python scripts from the command line

Rough cost at modest scale: $10-30/month for proxies, $5-6/month for a VPS. Under $40/month total for a pipeline pulling several thousand items per day.

Step by step

Step 1: read the official API first

Before scraping HTML, check what the API gives you for free. The Hacker News Firebase API is documented, maintained by YC, and significantly more tolerant of automated traffic than the HTML site.

import requests

BASE_URL = "https://hacker-news.firebaseio.com/v0"

def get_top_stories():
    r = requests.get(f"{BASE_URL}/topstories.json", timeout=10)
    r.raise_for_status()
    return r.json()  # list of up to 500 item IDs

def get_item(item_id):
    r = requests.get(f"{BASE_URL}/item/{item_id}.json", timeout=10)
    r.raise_for_status()
    return r.json()

story_ids = get_top_stories()
print(get_item(story_ids[0]))

Expected output: a dict with keys id, title, url, score, by, time, descendants, kids.

If it breaks: the API returns null for deleted or dead items. Add if data: before processing any item response.

Step 2: check robots.txt before touching the HTML site

Pull https://news.ycombinator.com/[robots.txt](https://www.rfc-editor.org/rfc/rfc9309.html) and read it. HN allows crawling of most paths, including item pages and the front page, but restricts a few paths like /r/. The robots exclusion protocol is not legally binding in most jurisdictions, but respecting it is both good practice and reduces your chance of an IP ban. Stay inside those bounds.

The Firebase API has no officially published rate limit, but in practice anything above 1 request per second from a single IP triggers throttling. The HTML site is tighter, rate-limiting around 30-50 requests per minute per IP.

If it breaks: if you see 429 responses, slow down before adding proxies. Adding proxies to a fundamentally too-fast request loop just burns through your proxy quota faster.

Step 3: build a single-IP scraper with delay

import time
import requests

BASE_URL = "https://hacker-news.firebaseio.com/v0"

def fetch_items(item_ids, delay=0.6):
    results = []
    for item_id in item_ids:
        try:
            r = requests.get(
                f"{BASE_URL}/item/{item_id}.json",
                timeout=10
            )
            r.raise_for_status()
            data = r.json()
            if data:
                results.append(data)
        except requests.RequestException as e:
            print(f"error on {item_id}: {e}")
        time.sleep(delay)
    return results

story_ids = requests.get(f"{BASE_URL}/topstories.json").json()
stories = fetch_items(story_ids[:50])
print(f"fetched {len(stories)} stories")

Expected output: a list of story dicts, printed count at the end.

If it breaks: increase timeout to 15s. The Firebase endpoints can be slow, especially under load.

Step 4: add rotating residential proxies

This is the step that separates a prototype from a production pipeline. You need your requests to exit through different IPs so no single IP accumulates enough requests to trigger a block. I route through ProxyScraping’s residential rotating gateway, which handles IP rotation internally and presents a single endpoint.

PROXY_USER = "your_username"
PROXY_PASS = "your_password"
PROXY_HOST = "gate.proxyscraping.com"
PROXY_PORT = 31112  # residential rotating, verify in your dashboard

PROXIES = {
    "http": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}",
    "https": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}",
}

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
        "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
    )
}

def fetch_with_proxy(url):
    r = requests.get(url, proxies=PROXIES, headers=HEADERS, timeout=15)
    r.raise_for_status()
    return r.json()

Expected output: same data as before, but requests now exit through rotating residential IPs.

If it breaks: check your proxy credentials in the ProxyScraping dashboard. Verify you’re using the correct port for your plan tier, residential and datacenter ports are different. If you’re getting 407 auth errors, make sure you’re not double-URL-encoding special characters in your password.

Step 5: scrape comment text from the HTML site

The Firebase API only returns comment metadata and the kids array of child IDs, not the actual comment text. For full comment content you need to parse the HTML site.

from bs4 import BeautifulSoup

def get_comments(story_id):
    url = f"https://news.ycombinator.com/item?id={story_id}"
    r = requests.get(url, proxies=PROXIES, headers=HEADERS, timeout=15)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")
    comments = []
    for span in soup.select(".commtext"):
        text = span.get_text(separator=" ", strip=True)
        if text:
            comments.append(text)
    return comments

Expected output: a list of comment strings for the story.

If it breaks: HN’s HTML class names have been stable since at least 2022, but inspect the page source if .commtext returns nothing. The structure occasionally shifts after site updates.

Step 6: add SQLite deduplication

Running a continuous loop means you’ll encounter the same story IDs on every poll. Track what you’ve already fetched so you’re not re-inserting the same rows.

import sqlite3

conn = sqlite3.connect("hn_stories.db")
c = conn.cursor()
c.execute("""
    CREATE TABLE IF NOT EXISTS stories (
        id INTEGER PRIMARY KEY,
        title TEXT,
        url TEXT,
        score INTEGER,
        author TEXT,
        fetched_at INTEGER DEFAULT (strftime('%s','now'))
    )
""")
conn.commit()

def save_story(story):
    c.execute(
        "INSERT OR IGNORE INTO stories (id, title, url, score, author) "
        "VALUES (?, ?, ?, ?, ?)",
        (story.get("id"), story.get("title"), story.get("url"),
         story.get("score"), story.get("by"))
    )
    conn.commit()

def already_fetched(item_id):
    c.execute("SELECT 1 FROM stories WHERE id = ?", (item_id,))
    return c.fetchone() is not None

If it breaks: SQLite doesn’t handle concurrent writes well across multiple processes. If you run two workers at once, switch to PostgreSQL or add a threading lock around the write calls.

Step 7: wire everything into a continuous loop

import time

def run_pipeline(poll_interval=300):
    print("starting HN pipeline")
    while True:
        try:
            story_ids = fetch_with_proxy(f"{BASE_URL}/topstories.json")
            new_ids = [sid for sid in story_ids if not already_fetched(sid)]
            print(f"{len(new_ids)} new stories to fetch")
            for sid in new_ids:
                story = fetch_with_proxy(f"{BASE_URL}/item/{sid}.json")
                if story:
                    save_story(story)
                time.sleep(0.4)
        except Exception as e:
            print(f"pipeline error: {e}")
            time.sleep(30)
        print(f"sleeping {poll_interval}s")
        time.sleep(poll_interval)

run_pipeline()

Expected output: a growing SQLite database, with status lines printed every 5 minutes.

If it breaks: the outer try/except catches most runtime failures and resumes after 30 seconds. Watch your log output for repeated errors, which usually mean a credential issue or proxy misconfiguration, not a transient network blip.

Common pitfalls

Using datacenter proxies on the HTML site. Datacenter IPs are fine for the Firebase API, which doesn’t do IP reputation scoring aggressively. The HTML site at news.ycombinator.com is more selective and will block datacenter ranges faster, especially for comment page requests. Residential proxies cost more per GB but fail less often, so total effective cost is often similar.

Static User-Agent strings. Sending python-[requests](https://requests.readthedocs.io/)/2.31.0 on every request is an instant fingerprint. Rotate through realistic browser User-Agent strings. You don’t need a large list, five or six current Chrome and Firefox strings on Linux and macOS is enough.

Ignoring backpressure signals. A 429 or 503 means back off, not retry immediately. Use exponential backoff: wait 2 seconds on the first retry, 4 on the second, 8 on the third. Retrying immediately on a rate-limited IP wastes proxy bandwidth and doesn’t help.

Over-polling. HN’s top stories list updates every few minutes at most. Polling every 10 seconds achieves nothing except burning bandwidth and increasing your IP exposure. Every 5 minutes for top stories is plenty for most use cases. Every 15-30 minutes for the “new” feed is fine.

No structured error logging. Silent failures are the worst kind. If a request fails and you only print it to stdout, you’ll think the pipeline is healthy while silently missing data. Write errors to a log file with timestamps and item IDs so you can audit gaps later.

Scaling this

10x (a few thousand items per day): the setup above works without modification. SQLite handles this comfortably. One process, one proxy connection, 5-minute polling. Proxy costs stay under $5/month.

100x (tens of thousands of items per day, including comment HTML): move to async with httpx and asyncio to run multiple requests concurrently. Replace SQLite with PostgreSQL. Run 3-5 worker processes behind a task queue. At this scale you’re pulling comment HTML for many stories, which multiplies your bandwidth usage. Budget $30-80/month in proxy costs. A dedicated VPS is worth it here for stable uptime.

1000x (hundreds of thousands of items per day, full historical backfill or real-time comment indexing): you need a proper job queue. Celery with Redis works, or a managed queue like AWS SQS. Shard proxy traffic across multiple accounts or pool configurations to avoid rate limits within the proxy network itself. At this scale you’re looking at $150-400/month in proxy costs. Export from Postgres into ClickHouse or BigQuery for analytical queries, Postgres row scans get slow past ~50M rows without careful indexing.

One thing that comes up at the high end, especially if you’re running this pipeline as a service for multiple clients: each client’s traffic should use a separate proxy pool. If one client’s scraping behavior triggers a soft block, it shouldn’t affect the others. There’s a useful breakdown of IP isolation practices for multi-tenant setups at multiaccountops.com/blog/.

Where to go next

Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?