How to scrape Hacker News at scale in 2026 with proxies that work
How to scrape Hacker News at scale in 2026 with proxies that work
Hacker News is one of the cleanest public data sources on the internet. No paywalls, no login gates, no heavy JavaScript rendering for the core content. Y Combinator maintains a public Firebase API that returns structured JSON, and the site itself is mostly server-rendered HTML that any HTTP client can parse. So why do operators still get blocked when they try to pull data at scale?
Because the volume itself is the signal. If you’re pulling thousands of items per hour, the same IP is making hundreds of sequential requests that look nothing like a human browsing pattern. Rate limits kick in, you start seeing 429s and 503s, and your pipeline stops. I’ve built three different HN scrapers over the years, one for a newsletter aggregator, one for a trend-detection tool, one for a client tracking competitor link placements. The pattern is always the same: works fine on a laptop, falls apart at 10x or 100x.
This tutorial is for operators who need continuous, reliable HN data, whether for competitive intelligence, content trend analysis, link tracking, or feeding an automated pipeline. I’ll walk through the exact Python setup I use, including the proxy rotation layer that makes it work without burning through IPs. By the end you’ll have a working scraper running a continuous loop, writing to a local database, and handling errors without dying silently.
What you need
- Python 3.10+ with
requests,beautifulsoup4, andsqlite3(stdlib) - A rotating residential proxy service. I use ProxyScraping.com residential proxies, running roughly $3-8/GB depending on the plan tier
- Optional: Redis for distributed deduplication if you’re running multiple workers
- Optional: a small VPS (Hetzner CX22 at ~$4/month, DigitalOcean Basic Droplet at $6/month) for running the loop continuously
- A text editor and basic comfort with running Python scripts from the command line
Rough cost at modest scale: $10-30/month for proxies, $5-6/month for a VPS. Under $40/month total for a pipeline pulling several thousand items per day.
Step by step
Step 1: read the official API first
Before scraping HTML, check what the API gives you for free. The Hacker News Firebase API is documented, maintained by YC, and significantly more tolerant of automated traffic than the HTML site.
import requests
BASE_URL = "https://hacker-news.firebaseio.com/v0"
def get_top_stories():
r = requests.get(f"{BASE_URL}/topstories.json", timeout=10)
r.raise_for_status()
return r.json() # list of up to 500 item IDs
def get_item(item_id):
r = requests.get(f"{BASE_URL}/item/{item_id}.json", timeout=10)
r.raise_for_status()
return r.json()
story_ids = get_top_stories()
print(get_item(story_ids[0]))
Expected output: a dict with keys id, title, url, score, by, time, descendants, kids.
If it breaks: the API returns null for deleted or dead items. Add if data: before processing any item response.
Step 2: check robots.txt before touching the HTML site
Pull https://news.ycombinator.com/[robots.txt](https://www.rfc-editor.org/rfc/rfc9309.html) and read it. HN allows crawling of most paths, including item pages and the front page, but restricts a few paths like /r/. The robots exclusion protocol is not legally binding in most jurisdictions, but respecting it is both good practice and reduces your chance of an IP ban. Stay inside those bounds.
The Firebase API has no officially published rate limit, but in practice anything above 1 request per second from a single IP triggers throttling. The HTML site is tighter, rate-limiting around 30-50 requests per minute per IP.
If it breaks: if you see 429 responses, slow down before adding proxies. Adding proxies to a fundamentally too-fast request loop just burns through your proxy quota faster.
Step 3: build a single-IP scraper with delay
import time
import requests
BASE_URL = "https://hacker-news.firebaseio.com/v0"
def fetch_items(item_ids, delay=0.6):
results = []
for item_id in item_ids:
try:
r = requests.get(
f"{BASE_URL}/item/{item_id}.json",
timeout=10
)
r.raise_for_status()
data = r.json()
if data:
results.append(data)
except requests.RequestException as e:
print(f"error on {item_id}: {e}")
time.sleep(delay)
return results
story_ids = requests.get(f"{BASE_URL}/topstories.json").json()
stories = fetch_items(story_ids[:50])
print(f"fetched {len(stories)} stories")
Expected output: a list of story dicts, printed count at the end.
If it breaks: increase timeout to 15s. The Firebase endpoints can be slow, especially under load.
Step 4: add rotating residential proxies
This is the step that separates a prototype from a production pipeline. You need your requests to exit through different IPs so no single IP accumulates enough requests to trigger a block. I route through ProxyScraping’s residential rotating gateway, which handles IP rotation internally and presents a single endpoint.
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
PROXY_HOST = "gate.proxyscraping.com"
PROXY_PORT = 31112 # residential rotating, verify in your dashboard
PROXIES = {
"http": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}",
"https": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}",
}
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
)
}
def fetch_with_proxy(url):
r = requests.get(url, proxies=PROXIES, headers=HEADERS, timeout=15)
r.raise_for_status()
return r.json()
Expected output: same data as before, but requests now exit through rotating residential IPs.
If it breaks: check your proxy credentials in the ProxyScraping dashboard. Verify you’re using the correct port for your plan tier, residential and datacenter ports are different. If you’re getting 407 auth errors, make sure you’re not double-URL-encoding special characters in your password.
Step 5: scrape comment text from the HTML site
The Firebase API only returns comment metadata and the kids array of child IDs, not the actual comment text. For full comment content you need to parse the HTML site.
from bs4 import BeautifulSoup
def get_comments(story_id):
url = f"https://news.ycombinator.com/item?id={story_id}"
r = requests.get(url, proxies=PROXIES, headers=HEADERS, timeout=15)
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
comments = []
for span in soup.select(".commtext"):
text = span.get_text(separator=" ", strip=True)
if text:
comments.append(text)
return comments
Expected output: a list of comment strings for the story.
If it breaks: HN’s HTML class names have been stable since at least 2022, but inspect the page source if .commtext returns nothing. The structure occasionally shifts after site updates.
Step 6: add SQLite deduplication
Running a continuous loop means you’ll encounter the same story IDs on every poll. Track what you’ve already fetched so you’re not re-inserting the same rows.
import sqlite3
conn = sqlite3.connect("hn_stories.db")
c = conn.cursor()
c.execute("""
CREATE TABLE IF NOT EXISTS stories (
id INTEGER PRIMARY KEY,
title TEXT,
url TEXT,
score INTEGER,
author TEXT,
fetched_at INTEGER DEFAULT (strftime('%s','now'))
)
""")
conn.commit()
def save_story(story):
c.execute(
"INSERT OR IGNORE INTO stories (id, title, url, score, author) "
"VALUES (?, ?, ?, ?, ?)",
(story.get("id"), story.get("title"), story.get("url"),
story.get("score"), story.get("by"))
)
conn.commit()
def already_fetched(item_id):
c.execute("SELECT 1 FROM stories WHERE id = ?", (item_id,))
return c.fetchone() is not None
If it breaks: SQLite doesn’t handle concurrent writes well across multiple processes. If you run two workers at once, switch to PostgreSQL or add a threading lock around the write calls.
Step 7: wire everything into a continuous loop
import time
def run_pipeline(poll_interval=300):
print("starting HN pipeline")
while True:
try:
story_ids = fetch_with_proxy(f"{BASE_URL}/topstories.json")
new_ids = [sid for sid in story_ids if not already_fetched(sid)]
print(f"{len(new_ids)} new stories to fetch")
for sid in new_ids:
story = fetch_with_proxy(f"{BASE_URL}/item/{sid}.json")
if story:
save_story(story)
time.sleep(0.4)
except Exception as e:
print(f"pipeline error: {e}")
time.sleep(30)
print(f"sleeping {poll_interval}s")
time.sleep(poll_interval)
run_pipeline()
Expected output: a growing SQLite database, with status lines printed every 5 minutes.
If it breaks: the outer try/except catches most runtime failures and resumes after 30 seconds. Watch your log output for repeated errors, which usually mean a credential issue or proxy misconfiguration, not a transient network blip.
Common pitfalls
Using datacenter proxies on the HTML site. Datacenter IPs are fine for the Firebase API, which doesn’t do IP reputation scoring aggressively. The HTML site at news.ycombinator.com is more selective and will block datacenter ranges faster, especially for comment page requests. Residential proxies cost more per GB but fail less often, so total effective cost is often similar.
Static User-Agent strings. Sending python-[requests](https://requests.readthedocs.io/)/2.31.0 on every request is an instant fingerprint. Rotate through realistic browser User-Agent strings. You don’t need a large list, five or six current Chrome and Firefox strings on Linux and macOS is enough.
Ignoring backpressure signals. A 429 or 503 means back off, not retry immediately. Use exponential backoff: wait 2 seconds on the first retry, 4 on the second, 8 on the third. Retrying immediately on a rate-limited IP wastes proxy bandwidth and doesn’t help.
Over-polling. HN’s top stories list updates every few minutes at most. Polling every 10 seconds achieves nothing except burning bandwidth and increasing your IP exposure. Every 5 minutes for top stories is plenty for most use cases. Every 15-30 minutes for the “new” feed is fine.
No structured error logging. Silent failures are the worst kind. If a request fails and you only print it to stdout, you’ll think the pipeline is healthy while silently missing data. Write errors to a log file with timestamps and item IDs so you can audit gaps later.
Scaling this
10x (a few thousand items per day): the setup above works without modification. SQLite handles this comfortably. One process, one proxy connection, 5-minute polling. Proxy costs stay under $5/month.
100x (tens of thousands of items per day, including comment HTML): move to async with httpx and asyncio to run multiple requests concurrently. Replace SQLite with PostgreSQL. Run 3-5 worker processes behind a task queue. At this scale you’re pulling comment HTML for many stories, which multiplies your bandwidth usage. Budget $30-80/month in proxy costs. A dedicated VPS is worth it here for stable uptime.
1000x (hundreds of thousands of items per day, full historical backfill or real-time comment indexing): you need a proper job queue. Celery with Redis works, or a managed queue like AWS SQS. Shard proxy traffic across multiple accounts or pool configurations to avoid rate limits within the proxy network itself. At this scale you’re looking at $150-400/month in proxy costs. Export from Postgres into ClickHouse or BigQuery for analytical queries, Postgres row scans get slow past ~50M rows without careful indexing.
One thing that comes up at the high end, especially if you’re running this pipeline as a service for multiple clients: each client’s traffic should use a separate proxy pool. If one client’s scraping behavior triggers a soft block, it shouldn’t affect the others. There’s a useful breakdown of IP isolation practices for multi-tenant setups at multiaccountops.com/blog/.
Where to go next
- Residential vs datacenter proxies: which one do you actually need breaks down the cost-versus-reliability tradeoff with real benchmarks across scraping targets including social news sites.
- How to scrape Reddit at scale with rotating proxies covers the same pattern for Reddit’s API and HTML, with notes on OAuth token management and subreddit-level rate limits.
- The full tutorial index has guides on exporting scraped data to BigQuery, running scrapers on a schedule with cron, and managing proxy credentials securely.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.