How to scrape GitHub at scale in 2026 with proxies that work
How to scrape GitHub at scale in 2026 with proxies that work
GitHub is one of the most data-rich targets you can scrape. repositories, contributors, star counts, commit histories, dependency graphs, issue timelines. if you are building a developer intelligence tool, a VC deal-sourcing pipeline, an open-source health monitor, or just benchmarking your competitors’ engineering output, GitHub is where the signal lives.
the problem is that GitHub’s rate limits are brutal by design. unauthenticated requests cap at 60 per hour per IP. authenticated requests with a personal access token get you 5,000 per hour. if you are trying to enumerate millions of repos, that math does not work. you hit the wall within minutes, your IP gets flagged, and you are back to square one.
this guide is for operators who already know how to write basic Python and want a working production setup, not a toy demo. by the end you will have a scraper that rotates tokens and proxy IPs, respects rate limit headers, stores clean output to disk or a database, and can be scaled horizontally without getting your infrastructure blocked.
what you need
- python 3.11+ with
requests,httpx, oraiohttp. i usehttpxfor async work. - GitHub personal access tokens (PATs), as many as you can generate across accounts. each authenticated token gives you 5,000 requests/hour against the REST API.
- a rotating residential or datacenter proxy pool. datacenter proxies are cheaper and fast but easier to block. residential IPs are slower and pricier but survive longer on GitHub. budget $50-200/month depending on volume.
- a proxy provider that supports HTTP/S and SOCKS5. ProxyScraping, Bright Data, Oxylabs, and Smartproxy all work. ProxyScraping’s datacenter pool starts around $7/month for lighter loads.
- a storage backend: SQLite for prototyping, Postgres for anything real, or flat JSONL files if you are feeding a pipeline.
- a VPS or cloud instance to run the scraper. a $6/month Hetzner CX22 in Falkenstein handles moderate throughput fine.
rough cost for a mid-scale operation scraping 500k repo records per month: $15-40 in proxies, $6-12 in compute, PATs are free if you can generate enough accounts.
step by step
step 1: understand the rate limit landscape
before writing a single line of code, read GitHub’s rate limit documentation in full. key numbers:
- unauthenticated: 60 requests/hour per IP
- authenticated PAT: 5,000 requests/hour per token
- GitHub Apps: up to 15,000 requests/hour depending on installation
- search API: 30 requests/minute (authenticated), 10 unauthenticated
the search API is the most restrictive and the most tempting. avoid relying on it as your primary enumeration method.
if it breaks: if you are hitting 403s from the start, check whether your IP range is already flagged. try a fresh residential IP before assuming your token is the issue.
step 2: generate a token pool
log into GitHub, go to Settings > Developer Settings > Personal Access Tokens > Tokens (classic). create tokens with public_repo and read:user scopes. for private repo access you need repo scope but that is a different threat model.
store tokens in a plain text file, one per line, or in an environment variable array. never hardcode them.
# tokens.txt
ghp_abc123...
ghp_def456...
ghp_ghi789...
if you have multiple GitHub accounts, generate one PAT per account. each is a separate rate limit bucket.
if it breaks: GitHub started enforcing stricter PAT creation limits in late 2024 across accounts that look related by IP or payment method. generate tokens from different IPs if possible.
step 3: configure your proxy pool
with ProxyScraping or any rotating proxy provider, you get a gateway endpoint that swaps the exit IP on each request or on a timer. set it up as an environment variable:
export PROXY_URL="http://user:[email protected]:8080"
test it before integrating:
curl -x "$PROXY_URL" https://api.github.com/rate_limit
you should see a response showing your remaining quota. if you see a 407, your credentials are wrong. if you see a timeout, the proxy endpoint is down or the port is blocked by your VPS firewall.
if it breaks: some proxy providers block api.github.com on their shared datacenter pools to avoid getting the range blacklisted. switch to residential or contact support to get the block lifted.
step 4: build the token and proxy rotation logic
the core pattern is a round-robin token queue combined with a sticky or rotating proxy. here is a minimal working implementation:
import httpx
import itertools
import time
from pathlib import Path
TOKENS = Path("tokens.txt").read_text().strip().splitlines()
PROXY_URL = "http://user:[email protected]:8080"
token_cycle = itertools.cycle(TOKENS)
def get_headers():
token = next(token_cycle)
return {
"Authorization": f"Bearer {token}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
}
def fetch(url: str) -> dict:
headers = get_headers()
with httpx.Client(proxy=PROXY_URL, timeout=15) as client:
resp = client.get(url, headers=headers)
if resp.status_code == 403:
retry_after = int(resp.headers.get("Retry-After", 60))
print(f"rate limited, sleeping {retry_after}s")
time.sleep(retry_after)
return fetch(url)
if resp.status_code == 200:
return resp.json()
resp.raise_for_status()
this is synchronous and single-threaded. it works for testing. for production you want async concurrency, which i cover in step 6.
if it breaks: if you get repeated 403s on a specific token, that token may be suspended. pull it out of rotation and generate a fresh one.
step 5: enumerate repos with the REST API
GitHub’s /repositories endpoint is the cleanest way to enumerate public repos by ID. it returns repos in ascending order of creation ID, which means you can paginate deterministically using the since parameter.
def scrape_repos(start_id: int = 0, limit: int = 10000):
results = []
since = start_id
while len(results) < limit:
url = f"https://api.github.com/repositories?since={since}&per_page=100"
data = fetch(url)
if not data:
break
results.extend(data)
since = data[-1]["id"]
print(f"fetched {len(results)} repos, last id={since}")
return results
each page gives you up to 100 repos. at 5,000 requests/hour per token and 100 repos per request, one token can pull 500,000 repo records per hour on the repository list endpoint. this is why the list endpoint is the right starting point for bulk collection.
if it breaks: if data comes back as an empty list before you expect it, you may have hit the end of the repo index or a transient error. add a short sleep and retry before giving up.
step 6: add async concurrency for real throughput
once the single-threaded version is working, move to asyncio and httpx.AsyncClient. with four tokens and a rotating proxy pool, you can safely run four concurrent workers without tripping rate limits.
import asyncio
import httpx
async def async_fetch(client: httpx.AsyncClient, url: str, token: str) -> dict:
headers = {
"Authorization": f"Bearer {token}",
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
}
resp = await client.get(url, headers=headers)
if resp.status_code == 429 or resp.status_code == 403:
await asyncio.sleep(int(resp.headers.get("Retry-After", 60)))
return await async_fetch(client, url, token)
resp.raise_for_status()
return resp.json()
async def main():
proxy = "http://user:[email protected]:8080"
async with httpx.AsyncClient(proxy=proxy, timeout=20) as client:
tasks = [
async_fetch(client, f"https://api.github.com/repositories?since={i*1000}", token)
for i, token in enumerate(TOKENS[:4])
]
results = await asyncio.gather(*tasks, return_exceptions=True)
return results
asyncio.run(main())
if it breaks: if you get ssl: certificate_verify_failed errors through the proxy, add verify=False to the client constructor. this disables TLS verification through the proxy tunnel. only do this if you trust your proxy provider.
step 7: store output and handle restarts
write output to JSONL (one JSON object per line) with the current repo ID appended to a state file. this lets you resume without re-scraping.
import json
def save_repos(repos: list, output_file: str = "repos.jsonl", state_file: str = "state.txt"):
with open(output_file, "a") as f:
for repo in repos:
f.write(json.dumps(repo) + "\n")
if repos:
with open(state_file, "w") as f:
f.write(str(repos[-1]["id"]))
def load_state(state_file: str = "state.txt") -> int:
try:
return int(Path(state_file).read_text().strip())
except FileNotFoundError:
return 0
if it breaks: if your JSONL file has partial writes from a crash, run python -c "import json; [json.loads(l) for l in open('repos.jsonl')]" to find the corrupt line and truncate the file there.
common pitfalls
relying on the search API for bulk work. the search API is rate-limited to 30 requests/minute. it is good for targeted queries, not enumeration. use /repositories?since= for bulk collection instead.
not reading the X-RateLimit-Remaining header. GitHub tells you exactly how many requests you have left in the current window. ignoring this header and sleeping on a fixed timer wastes time when you have quota left and causes retries when you are already at zero.
using datacenter proxies on the authenticated API long-term. GitHub’s trust scoring looks at the IP reputation of the IP making authenticated requests. datacenter IPs from AWS or DigitalOcean ranges get flagged faster than residential IPs. if you are doing high-volume authenticated requests, residential proxies last longer. for fingerprinting and browser-based access, the antidetect angle is worth reading about at antidetectreview.org/blog/.
generating too many PATs from the same IP. GitHub tracks the IP you use to generate tokens. if you create 20 tokens from your home IP, they may all get suspended together. use different exit IPs when generating tokens across accounts.
not handling secondary rate limits. GitHub has a secondary rate limit that kicks in when you make too many requests in a short burst, even if your per-hour quota is intact. keep concurrent requests per token under 10 and add a 50-100ms delay between requests per worker.
scaling this
10x (hundreds of thousands of records/month): the setup above works fine. one VPS, four to eight tokens, a $15/month rotating datacenter proxy plan. the bottleneck is token count. acquire more accounts or look into GitHub Apps for higher quotas.
100x (tens of millions of records/month): you need a job queue. replace the simple loop with Celery or a lightweight queue like RQ, where each task fetches one page of repos and writes output to S3 or a Postgres table. run multiple VPS workers pulling from the same queue. your proxy spend climbs to $100-300/month. switch to residential proxies if your 403 rate climbs above 5%.
1000x (hundreds of millions of records, continuous crawl): at this scale you are re-crawling the same repos for freshness, not just doing a one-time enumeration. you need a crawler state table tracking last-seen timestamps, a priority queue that re-fetches high-activity repos more often, and a dedicated infrastructure team or managed scraping infrastructure. proxy costs at this level start at $500+/month. GitHub also at this scale may reach out directly, so have a legal review of GitHub’s Terms of Service before you get there. this is not legal advice.
a useful pattern at the 1000x level: combine GitHub’s public events API (which has its own rate limit) with the REST API to detect repo changes in near real-time and only re-fetch repos that have actually changed. the GitHub Events API returns the last 300 events per actor, which is enough to build a lightweight change detector.
where to go next
if GitHub is your first large-scale scrape, the patterns here apply to most API-based targets. the next logical reads on this site:
- how to scrape LinkedIn at scale in 2026 with rotating proxies covers a harder target where anti-bot measures are more aggressive and browser fingerprinting matters more than IP rotation
- best residential proxy providers for web scraping in 2026 is our current comparison of proxy providers by price, pool size, and GitHub success rate
- the full scraping tutorial index has everything else organized by target and difficulty
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.