How to scrape Reddit at scale in 2026 with proxies that work
How to scrape Reddit at scale in 2026 with proxies that work
Reddit is one of the richest public datasets on the internet. Millions of threads, comments, votes, and user signals updated in near-real-time across tens of thousands of communities. if you’re building sentiment analysis, training datasets, market research pipelines, or community monitoring tools, Reddit data is often irreplaceable.
The problem is that Reddit actively defends against scraping. After major API pricing changes in 2023, the free tier is capped at 100 requests per minute for OAuth-authenticated apps, with commercial usage requiring a data licensing agreement for large-scale pulls. The old days of hammering the JSON API with no auth are over. Get it wrong and you’ll eat 429s, IP bans, and account suspensions before you’ve collected anything useful.
This guide is for operators who need to pull Reddit data at scale, whether that’s 50,000 posts a day or 5 million. i’ll walk through the full stack: auth setup, proxy configuration, rate handling, and what changes as volume grows. you’ll come out with a working scraper and a clear picture of where the edge cases live.
what you need
- Python 3.10+ with
praw,requests,httpx, andtenacityinstalled - A Reddit developer app (free, created at reddit.com/prefs/apps) for OAuth credentials
- Rotating residential proxies, minimum 1 GB for testing. datacenter proxies work at low volume but get flagged fast at scale. residential pools from providers like Smartproxy (starts around $7/GB), Bright Data ($8.40/GB), or ProxyScraping’s residential plans are workable options
- A proxy manager or rotation endpoint that supports per-request IP rotation
- A Postgres or SQLite instance for deduplication and checkpointing, you will need this
- Basic familiarity with OAuth2 flows and reading HTTP response headers
- Budget estimate: $20-80/month for proxies at moderate scale (100k-500k requests/month), plus Reddit API costs if you trigger commercial thresholds
step by step
step 1: create your Reddit app credentials
Go to reddit.com/prefs/apps, create a “script” type app. note down your client_id (under the app name), client_secret, and set a user_agent string that follows Reddit’s required format: platform:app_id:version (by /u/yourusername).
Reddit’s API documentation is explicit that a descriptive user agent is required. using a generic string like python-requests/2.31 will get you rate-limited faster than almost anything else.
pip install praw requests httpx tenacity
if it breaks: if your app shows “invalid_grant” on first auth, double-check that the app type is “script” not “web app”. web apps require a redirect URI flow.
step 2: authenticate and test rate limits
import praw
import time
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
user_agent="linux:myresearchtool:v1.0 (by /u/yourusername)",
username="yourusername",
password="yourpassword"
)
# test with a small pull
subreddit = reddit.subreddit("python")
for post in subreddit.hot(limit=10):
print(post.title, post.score)
time.sleep(0.6) # stay well under 100 req/min
run this first without proxies to confirm credentials work. you should see posts returned with no errors. PRAW handles token refresh automatically, so don’t try to manage that yourself.
if it breaks: a ResponseException: 401 means bad credentials. a 429 at this stage means you’re running multiple instances with the same credentials, consolidate them.
step 3: configure your proxy rotation
PRAW uses Python’s [requests](https://requests.readthedocs.io/) under the hood via its requestor layer. the cleanest way to inject proxies is by passing a custom [requests](https://requests.readthedocs.io/).Session:
import praw
import requests
from praw import Reddit
# Sticky session endpoint from your proxy provider
# replace with your actual proxy rotation endpoint
PROXY_ENDPOINT = "http://user:[email protected]:10000"
session = requests.Session()
session.proxies = {
"http": PROXY_ENDPOINT,
"https": PROXY_ENDPOINT,
}
# monkey-patch PRAW's requestor session
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
user_agent="linux:myresearchtool:v1.0 (by /u/yourusername)",
username="yourusername",
password="yourpassword",
requestor_kwargs={"session": session}
)
at this point, all PRAW requests will route through the proxy. verify by checking what IP Reddit sees using [reddit](https://www.reddit.com/).auth.limits or a test endpoint.
if it breaks: if you get SSL errors through the proxy, some providers require you to disable SSL verification for their MITM layer. set session.verify = False only if your provider instructs this, and only if you trust the proxy chain. better providers give you a CA cert to install instead.
step 4: build a rate-aware request loop
Reddit’s API returns your remaining request budget in response headers (X-Ratelimit-Remaining, X-Ratelimit-Reset, X-Ratelimit-Used). PRAW exposes these via [reddit](https://www.reddit.com/).auth.limits. build your loop around this:
from tenacity import retry, wait_exponential, stop_after_attempt
@retry(wait=wait_exponential(multiplier=1, min=4, max=60), stop=stop_after_attempt(5))
def fetch_posts(subreddit_name, limit=100):
subreddit = reddit.subreddit(subreddit_name)
posts = []
for post in subreddit.new(limit=limit):
posts.append({
"id": post.id,
"title": post.title,
"score": post.score,
"url": post.url,
"created_utc": post.created_utc,
"num_comments": post.num_comments,
"subreddit": subreddit_name,
})
limits = reddit.auth.limits
remaining = limits.get("remaining", 100)
if remaining < 10:
reset_time = limits.get("reset_timestamp", time.time() + 60)
sleep_for = max(0, reset_time - time.time()) + 2
print(f"rate limit low, sleeping {sleep_for:.1f}s")
time.sleep(sleep_for)
return posts
if it breaks: if you’re getting persistent 429s despite sleeping, you may have multiple processes sharing one set of credentials. each OAuth token has its own rate limit bucket, you need one credential set per worker process.
step 5: checkpoint and deduplicate with a local database
at any real scale you will hit network errors, proxy timeouts, and process crashes. without checkpointing you re-scrape from scratch every time. use SQLite for small runs, Postgres for anything serious:
import sqlite3
conn = sqlite3.connect("reddit_scrape.db")
conn.execute("""
CREATE TABLE IF NOT EXISTS posts (
id TEXT PRIMARY KEY,
title TEXT,
score INTEGER,
url TEXT,
created_utc REAL,
num_comments INTEGER,
subreddit TEXT,
scraped_at REAL
)
""")
conn.commit()
def save_posts(posts):
conn.executemany("""
INSERT OR IGNORE INTO posts
(id, title, score, url, created_utc, num_comments, subreddit, scraped_at)
VALUES (:id, :title, :score, :url, :created_utc, :num_comments, :subreddit, :scraped_at)
""", [{**p, "scraped_at": time.time()} for p in posts])
conn.commit()
the INSERT OR IGNORE on the primary key means re-running the scraper won’t duplicate data. track which subreddits you’ve completed in a separate table.
if it breaks: SQLite will throw database is locked under concurrent writes. either switch to Postgres or use WAL mode: conn.execute("PRAGMA journal_mode=WAL").
step 6: handle comment trees
post listings are the easy part. comment trees are where most scrapers break. Reddit nests comments up to 10 levels deep and paginates with “more” objects that require additional API calls:
def fetch_comments(post_id, max_comments=500):
submission = reddit.submission(id=post_id)
submission.comments.replace_more(limit=0) # skip "load more" for speed
comments = []
for comment in submission.comments.list():
if len(comments) >= max_comments:
break
if hasattr(comment, 'body'):
comments.append({
"id": comment.id,
"post_id": post_id,
"body": comment.body,
"score": comment.score,
"created_utc": comment.created_utc,
"parent_id": comment.parent_id,
})
return comments
replace_more(limit=0) skips the “load more” nodes entirely. if you need full thread depth, use replace_more(limit=None) but expect each “more” node to cost one API call. at high volume, those add up fast.
if it breaks: replace_more can timeout on very large threads (10k+ comments). wrap it in a try/except PRAWException and fall back to limit=0.
step 7: parallelize with multiple credential sets
to get past the 100 req/min ceiling, you need multiple Reddit apps. each app authenticates independently and gets its own rate limit bucket. a clean pattern is a worker pool:
import concurrent.futures
import queue
credential_sets = [
{"client_id": "id1", "client_secret": "sec1", "username": "user1", "password": "pass1"},
{"client_id": "id2", "client_secret": "sec2", "username": "user2", "password": "pass2"},
# add as many as you have
]
reddit_instances = [
praw.Reddit(**creds, user_agent="linux:myresearchtool:v1.0 (by /u/mainaccount)")
for creds in credential_sets
]
subreddits_to_scrape = ["python", "datascience", "machinelearning", "learnprogramming"]
def worker(reddit_instance, subreddit_queue):
while not subreddit_queue.empty():
try:
sub = subreddit_queue.get_nowait()
posts = fetch_posts_with_instance(reddit_instance, sub)
save_posts(posts)
except queue.Empty:
break
q = queue.Queue()
for sub in subreddits_to_scrape:
q.put(sub)
with concurrent.futures.ThreadPoolExecutor(max_workers=len(reddit_instances)) as executor:
futures = [executor.submit(worker, inst, q) for inst in reddit_instances]
concurrent.futures.wait(futures)
each thread uses a different [reddit](https://www.reddit.com/)_instance and therefore a different rate limit bucket. pair each instance with its own proxy rotation endpoint to avoid IP reuse across workers.
if it breaks: if Reddit starts returning 403s across all credential sets simultaneously, you’ve likely triggered account-level action. pause for 24 hours and review your user agent strings and request patterns.
common pitfalls
using datacenter proxies for high-volume pulls. datacenter IPs are flagged heavily by Reddit’s anti-abuse systems. they work for light testing but you’ll see escalating 429s and eventual IP blocks within hours at serious volume. residential or mobile proxies are the only durable choice above a few thousand requests per day. see the best rotating residential proxies guide for a current comparison.
sharing one OAuth token across multiple processes. the 100 req/min limit is per token, not per IP. running four processes against one credential set doesn’t multiply your throughput, it multiplies your 429s. one credential set per worker process is the rule.
ignoring created_utc for historical pulls. Reddit’s new listing only goes back 1000 posts. for historical data you need the after pagination parameter or a different approach entirely (Pushshift was the standard option but has had availability issues, as of 2026 verify its current status before depending on it). plan your data collection window before starting.
not rotating user agents alongside proxies. rotating IPs while keeping the same user agent string is a half-measure. Reddit correlates both. if you’re using a proxy pool, vary the user agent per session as well.
skipping deduplication. pagination gaps and restarts will generate duplicates if you’re not checkpointing by post ID. INSERT OR IGNORE on the primary key is the minimum acceptable setup.
scaling this
10x (1M requests/month): you need at minimum 5-10 credential sets, residential proxies with a sticky session option, and Postgres instead of SQLite. at this level your main constraint is credential management, not infrastructure. keep a spreadsheet of which accounts are in good standing.
100x (10M requests/month): you’re past what a single machine handles cleanly. move to a job queue (Redis + Celery or similar), distribute workers across multiple VMs, and use a proxy provider with a dedicated pool rather than shared residential. budget $200-600/month for proxies alone at this tier. you may also need to look at Reddit’s commercial data licensing, depending on your use case, this is where operators doing this for business purposes should talk to a lawyer (this is not legal advice).
1000x (100M+ requests/month): at this scale you’re almost certainly in the territory where Reddit wants a formal data agreement. the technical stack becomes less of a constraint than the legal and commercial relationship. some operators at this scale also supplement API scraping with browser automation for content not exposed in the API, pairing headless browsers with antidetect profiles. if that’s relevant to your operation, antidetectreview.org/blog/ covers the tooling landscape well.
for both the 100x and 1000x tiers, consider multi-account infrastructure for running the Reddit apps themselves. the patterns for managing credential pools at scale overlap heavily with broader multi-account operations, which is covered in more depth at multiaccountops.com/blog/.
where to go next
- How to scrape Twitter/X at scale in 2026 covers a similar operator setup for X’s API, including the differences in rate limit architecture and which proxy types hold up.
- Best rotating residential proxies for scraping in 2026 reviews current residential proxy providers with actual throughput tests, useful before you commit to a provider at scale.
- The official PRAW documentation is the authoritative reference for every method used in this guide. when in doubt, read the source.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.