How to scrape Crunchbase at scale in 2026 with proxies that work
How to scrape Crunchbase at scale in 2026 with proxies that work
Crunchbase sits behind one of the more aggressive anti-scraping stacks you will encounter on a business data site. Cloudflare WAF, JavaScript fingerprinting, rate limits per IP, and login walls that expire sessions mid-crawl. If you have tried naive scraping with requests or a simple proxy, you have probably hit a 403 within the first few hundred pages and spent the next hour debugging headers.
This guide is for operators who need Crunchbase data at volume: VC analysts pulling fresh funding rounds, sales teams enriching leads, or founders benchmarking competitors. i will walk through the exact toolchain i use, the proxy tier that actually passes, and a Python pipeline you can run today. you will not get a toy example that dies at page 10. the outcome is a repeatable scraper that handles thousands of company profiles per day without burning your proxy pool.
Crunchbase does offer an official API with paid tiers starting at $29/month for basic access. if your use case fits within the API’s data model, start there. this tutorial is for when the API does not cover what you need, when you want real-time data outside the API’s refresh cycle, or when you need fields the API does not expose.
what you need
- Python 3.11+ with
playwright,httpx,parsel, andpython-dotenv - Rotating residential proxies (datacenter proxies do not work reliably on Crunchbase as of 2026). budget $50-150/month at 10-30 GB depending on volume. i use ProxyScrape residential rotating proxies for this site’s tests, which run about $4/GB
- A Crunchbase Pro account ($29/month) if you need data behind the login wall (employee count, funding details, investor lists). free tier exposes very little
- A Redis instance for deduplication and crawl queue management. a $7/month DigitalOcean managed Redis works fine
- A Postgres database for storing output. Supabase free tier is enough to start
- A Linux VPS with at least 2 vCPU and 4GB RAM for running Playwright headfully. a $12/month Hetzner CX22 works
- Optional: 2Captcha or CapSolver (~$1 per 1000 solves) for CAPTCHA fallback when Cloudflare challenges slip through
step by step
step 1: understand what you are actually requesting
Before writing a line of code, check Crunchbase’s robots.txt. as of early 2026 it disallows /people/, /organization/, and most profile paths for generic bots. this does not make scraping illegal, but it tells you what Crunchbase considers sensitive and where their defenses will be strongest. plan your crawl paths accordingly and build in respectful delays.
if it breaks: if robots.txt has changed since this article was published, re-read it before crawling. mismatched disallow rules will cause unnecessary blocks.
step 2: install dependencies and set up the environment
python -m venv venv && source venv/bin/activate
pip install playwright httpx parsel python-dotenv redis psycopg2-binary
playwright install chromium
Create a .env file:
PROXY_HOST=rp.proxyscrape.com
PROXY_PORT=6060
PROXY_USER=your_username
PROXY_PASS=your_password
CB_SESSION_COOKIE=your_crunchbase_session_cookie
REDIS_URL=redis://localhost:6379
DATABASE_URL=postgresql://user:pass@host/db
if it breaks: [playwright](https://playwright.dev/) install can fail on headless VPS setups. run [playwright](https://playwright.dev/) install-deps [chromium](https://www.chromium.org/Home/) first, then retry.
step 3: extract your session cookie from a logged-in browser
Crunchbase assigns a _cb_ses and cb_auth cookie on login. the cleanest way to get this without automating login (which triggers aggressive bot signals) is to log in manually in Chrome, open DevTools, go to Application > Cookies, and copy the values.
import os
from dotenv import load_dotenv
load_dotenv()
SESSION_COOKIE = os.getenv("CB_SESSION_COOKIE")
cookies = [
{"name": "_cb_ses", "value": SESSION_COOKIE, "domain": "crunchbase.com", "path": "/"}
]
Session cookies typically last 7-14 days. build a reminder into your cron to refresh them. i keep a note in multiaccountops.com/blog/ on managing session rotation across multiple scraping accounts if you need to run parallel sessions without tripping duplicate-session detection.
if it breaks: if you see a redirect to /login mid-crawl, your session has expired. catch 302 redirects in your error handler and alert on them immediately rather than letting the pipeline silently fail.
step 4: configure Playwright with your residential proxy
Playwright’s browser contexts support per-context proxy settings, which means you can spin up many contexts with different proxy endpoints from the same browser instance.
from playwright.async_api import async_playwright
import asyncio
PROXY = {
"server": f"http://{os.getenv('PROXY_HOST')}:{os.getenv('PROXY_PORT')}",
"username": os.getenv("PROXY_USER"),
"password": os.getenv("PROXY_PASS"),
}
async def get_page_html(url: str, cookies: list) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
proxy=PROXY,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 800},
locale="en-US",
)
await context.add_cookies(cookies)
page = await context.new_page()
await page.goto(url, wait_until="networkidle", timeout=30000)
html = await page.content()
await browser.close()
return html
if it breaks: wait_until="networkidle" can timeout on pages with continuous polling. switch to wait_until="domcontentloaded" and add an explicit page.wait_for_selector(".profile-layout-header") instead.
step 5: parse the data you need
Crunchbase renders most data server-side into a <script id="ng-state"> tag as a JSON blob. parsing this is more reliable than scraping DOM elements, which change with every frontend deploy.
from parsel import Selector
import json
import re
def extract_org_data(html: str) -> dict:
sel = Selector(text=html)
raw = sel.css("script#ng-state::text").get("")
if not raw:
return {}
try:
data = json.loads(raw)
# navigate the nested structure to company fields
org = data.get("HttpState", {})
return org
except json.JSONDecodeError:
return {}
You will need to inspect the JSON structure for the specific fields you want (funding total, last round date, headcount, investors). the structure has changed twice in the past year, so write a schema version check into your parser and alert when fields go missing.
if it breaks: if ng-state returns empty, Crunchbase may have changed the script tag ID. inspect the raw HTML and search for window.__INITIAL_STATE__ or similar alternatives.
step 6: build a queued crawl pipeline with Redis deduplication
import redis
import psycopg2
r = redis.from_url(os.getenv("REDIS_URL"))
conn = psycopg2.connect(os.getenv("DATABASE_URL"))
def enqueue_orgs(slugs: list[str]):
for slug in slugs:
if not r.sismember("cb:seen", slug):
r.lpush("cb:queue", slug)
def process_queue():
while True:
item = r.brpop("cb:queue", timeout=5)
if not item:
break
_, slug = item
slug = slug.decode()
url = f"https://www.crunchbase.com/organization/{slug}"
html = asyncio.run(get_page_html(url, cookies))
data = extract_org_data(html)
if data:
save_to_db(conn, slug, data)
r.sadd("cb:seen", slug)
if it breaks: Redis brpop will block indefinitely if you set timeout to 0. always set a timeout and build a graceful shutdown handler.
step 7: store and validate output
def save_to_db(conn, slug: str, data: dict):
cur = conn.cursor()
cur.execute("""
INSERT INTO crunchbase_orgs (slug, raw_json, scraped_at)
VALUES (%s, %s, NOW())
ON CONFLICT (slug) DO UPDATE SET raw_json = EXCLUDED.raw_json, scraped_at = NOW()
""", (slug, json.dumps(data)))
conn.commit()
cur.close()
validate that key fields (funding total, founded date, location) are non-null before committing. a silent null in 40% of rows will ruin your analysis downstream. log every null-rate by field per crawl run.
if it breaks: if you see ON CONFLICT errors, your slug column may not have a unique constraint. add one: ALTER TABLE crunchbase_orgs ADD CONSTRAINT crunchbase_orgs_slug_key UNIQUE (slug);
common pitfalls
using datacenter proxies. this is the number one reason pipelines die within an hour. Crunchbase’s bot detection cross-references IP reputation databases. residential rotating proxies from a quality provider are not optional here.
not rotating user agents and accept-language headers. a single user-agent string across thousands of requests is a strong bot signal even with rotating IPs. keep a pool of 10-20 realistic Chrome user agents and rotate them per request. HTTP/1.1 semantics define these as optional, but Crunchbase treats consistent headers as a fingerprint.
hammering pagination without delays. Crunchbase rate-limits aggressively on search and list endpoints. add a random sleep between 2-5 seconds per page. the extra time is worth it versus burning your session.
ignoring CAPTCHA failure modes. when Cloudflare throws a JS challenge and your solver fails, the scraper continues, saves empty HTML, and your database fills with garbage rows. always check the page title or a sentinel selector before saving.
not refreshing session cookies on a schedule. most operators set this up once and forget it. add a weekly cron that alerts you when the session cookie age exceeds 10 days.
scaling this
10x (thousands of profiles/day). run three or four async workers on a single VPS with the queued pipeline above. set concurrency to 4-6 Playwright contexts, each with a different proxy endpoint. at this scale your main bottleneck is proxy bandwidth, not compute.
100x (tens of thousands/day). move to a task queue like Celery or RQ with multiple worker machines. run workers on Hetzner CCX23 instances (~$25/month each). centralize your Redis queue and Postgres on managed services. monitor proxy IP rotation to confirm you are not reusing the same IP more than once per 5 minutes.
1000x (hundreds of thousands/day). at this volume you need a dedicated proxy pool, not a shared residential plan. negotiate a dedicated IP pool with your proxy provider or stack multiple providers for redundancy. run workers across at least 3 geographic regions (US, EU, APAC) to match Crunchbase’s CDN edge behavior. implement a circuit breaker that pauses crawls per region when block rates exceed 15%. your pipeline architecture should look like a proper distributed system, not a scaled-up script, and you should review our guide to scaling proxy scraping infrastructure before committing to this tier.
where to go next
- Best residential proxies for scraping in 2026 covers how to evaluate proxy providers by block rate, speed, and price per GB, with direct comparisons for high-volume use cases.
- How to scrape LinkedIn at scale with proxies walks through a similar pipeline for LinkedIn company pages, which share many of the same Cloudflare and fingerprinting defenses as Crunchbase.
- Back to all tutorials on the blog.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.