← back to blog

How to scrape Capterra at scale in 2026 with proxies that work

How to scrape Capterra at scale in 2026 with proxies that work

Capterra is one of the most valuable B2B datasets on the internet. Over 2 million verified reviews across 900+ software categories, with structured product metadata, pricing signals, and competitor positioning all sitting in public HTML. if you’re building a competitive intelligence tool, training a recommendation model, or running a lead-gen operation targeting software buyers, Capterra is a primary source, not a secondary one.

the problem is Capterra is owned by Gartner, and the engineering team there knows exactly what scrapers look like. they run Cloudflare, fingerprint TLS handshakes, throttle by IP subnet, and serve honeypot links to bots that don’t behave like real users. a naive requests script hitting Capterra in 2026 will get blocked within minutes, sometimes within seconds.

this tutorial is for operators who already understand basic web scraping and want a production-grade setup that actually runs at volume. by the end you’ll have a working pipeline, rotating residential proxies, browser-level rendering where needed, and a schema that captures listings and reviews cleanly. i’ll flag what breaks and why throughout.

what you need

  • Python 3.11+ with httpx, playwright, parsel, and tenacity
  • Rotating residential proxy pool. datacenter IPs get flagged immediately on Capterra. you need residential. ProxyScraping’s residential plan starts at $3/GB and includes country-level targeting, which matters for Capterra since geo-gating affects which reviews you see
  • A proxy endpoint with session stickiness support. you need the same IP for at least 3-5 requests per product page, otherwise the session breaks mid-crawl
  • PostgreSQL or BigQuery for storage. SQLite works for prototyping but falls apart past 50k rows with concurrent writes
  • A machine with at least 4 vCPUs. Playwright headless browsers are memory-hungry. a $24/month Hetzner CPX31 (4 vCPUs, 8GB RAM) handles about 8 concurrent browser contexts
  • Playwright installed with Chromium: playwright install chromium
  • budget estimate: $30-80/month in proxy bandwidth for a 10k product crawl depending on how many review pages you pull

step by step

step 1: read robots.txt and understand what Capterra permits

before writing a single line of scraper code, pull https://www.capterra.com/robots.txt and read it. as of 2026 it disallows /profile/ paths for some user-agents, allows category listing pages, and sets a crawl-delay directive that most scrapers ignore. ignoring crawl-delay is one of the fastest ways to trigger rate limits.

this is not legal advice. scraping publicly available data is a contested area in US law following the hiQ v. LinkedIn line of cases, and rules differ by jurisdiction. consult a lawyer if your use case is commercial at scale.

curl -s https://www.capterra.com/robots.txt | head -60

note which paths are disallowed and build them into your URL filter from day one.

step 2: map the URL structure

Capterra’s category pages follow a clean pattern:

https://www.capterra.com/project-management-software/
https://www.capterra.com/crm-software/

product profile pages look like:

https://www.capterra.com/p/12345/productname/

reviews live at:

https://www.capterra.com/reviews/12345/productname/

start by scraping the category index pages to build your product ID list. don’t guess IDs, enumerate them from the category pages. there are roughly 900 category landing pages you can hit systematically.

if it breaks: category pages sometimes serve a JavaScript-rendered shell on first hit. if you see an empty response body or a Cloudflare challenge page, you need to move to browser rendering in step 4 rather than plain HTTP.

step 3: set up proxy rotation with session pinning

ProxyScraping provides a gateway endpoint you configure in your HTTP client. for Capterra you want session stickiness so the same IP handles the category page and the subsequent product clicks within a session.

import httpx
import random
import string

def make_session_id(length=8):
    return ''.join(random.choices(string.ascii_lowercase + string.digits, k=length))

PROXY_USER = "your_username"
PROXY_PASS = "your_password"
PROXY_HOST = "gate.proxyscraping.com"
PROXY_PORT = 31112

def build_proxy_url(session_id: str, country: str = "us") -> str:
    user = f"{PROXY_USER}-session-{session_id}-country-{country}"
    return f"http://{user}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"

def get_client(session_id: str) -> httpx.Client:
    proxy_url = build_proxy_url(session_id)
    return httpx.Client(
        proxy=proxy_url,
        headers={
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        },
        timeout=30.0,
        follow_redirects=True,
    )

pin one session ID per product you’re crawling. rotate to a new session ID between products.

if it breaks: if you get 403s consistently, check that you’re sending a realistic Accept header stack. Capterra’s edge reads standard HTTP header semantics and an absent or malformed Accept-Encoding header is a strong bot signal.

step 4: add Playwright for JavaScript-heavy pages

some Capterra pages, particularly review listing pages past page 2, require JavaScript execution to load review content. use Playwright for these rather than httpx.

from playwright.async_api import async_playwright
import asyncio

async def fetch_reviews_page(url: str, proxy_url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": proxy_url},
        )
        context = await browser.new_context(
            viewport={"width": 1440, "height": 900},
            locale="en-US",
            timezone_id="America/New_York",
        )
        page = await context.new_page()
        await page.goto(url, wait_until="networkidle", timeout=45000)
        content = await page.content()
        await browser.close()
        return content

run this only where necessary. Playwright adds 2-4 seconds per page and uses 200-400MB RAM per context. for category listing pages and product metadata, plain httpx is faster and cheaper.

if it breaks: if Playwright hits a Cloudflare interstitial and hangs, add a wait after navigation: await page.wait_for_timeout(3000) before reading content. if that still fails, check whether your residential IP is flagged by running it through a clean browser manually.

step 5: parse product data with parsel

from parsel import Selector

def parse_product_page(html: str) -> dict:
    sel = Selector(text=html)

    return {
        "name": sel.css('h1[data-testid="vendor-title"]::text').get("").strip(),
        "rating": sel.css('[data-testid="overall-rating"]::attr(aria-label)').get(""),
        "review_count": sel.css('[data-testid="review-count"]::text').get("").strip(),
        "price_note": sel.css('[data-testid="pricing-summary"]::text').get("").strip(),
        "categories": sel.css('[data-testid="category-tag"]::text').getall(),
        "description": sel.css('div[data-testid="product-description"] p::text').get("").strip(),
    }

Capterra’s HTML uses data-testid attributes fairly consistently, which is good for scraper stability. these attributes change less often than class names. that said, run a diff of the HTML structure monthly, Gartner has historically pushed frontend updates in Q1.

if it breaks: if data-testid attributes are missing, you may be getting a Cloudflare bot-check page. log the raw HTML to a file and inspect it. a 200 status with a challenge page is common when the proxy IP is flagged.

step 6: store and deduplicate

import psycopg2
from datetime import datetime

def upsert_product(conn, product: dict, url: str):
    with conn.cursor() as cur:
        cur.execute("""
            INSERT INTO capterra_products (url, name, rating, review_count, price_note, scraped_at)
            VALUES (%s, %s, %s, %s, %s, %s)
            ON CONFLICT (url) DO UPDATE SET
                name = EXCLUDED.name,
                rating = EXCLUDED.rating,
                review_count = EXCLUDED.review_count,
                price_note = EXCLUDED.price_note,
                scraped_at = EXCLUDED.scraped_at
        """, (url, product["name"], product["rating"], product["review_count"], product["price_note"], datetime.utcnow()))
    conn.commit()

use ON CONFLICT DO UPDATE so re-runs update stale records rather than failing or duplicating. index on url and scraped_at.

if it breaks: if you see unique constraint errors despite the upsert, check that your URL normalization is consistent. Capterra sometimes redirects /p/12345/name/ to /p/12345/name (trailing slash matters for dedup keys).

step 7: build the crawl queue and rate-limit properly

import asyncio
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=20))
async def scrape_product(url: str, session_id: str):
    # fetch, parse, store
    pass

async def run_queue(urls: list[str], concurrency: int = 8):
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded_scrape(url):
        async with semaphore:
            session_id = make_session_id()
            await scrape_product(url, session_id)
            await asyncio.sleep(2.5)  # respect crawl-delay

    await asyncio.gather(*[bounded_scrape(url) for url in urls])

the 2.5 second delay per request slot at 8 concurrency gives you roughly 3 requests per second aggregate, which is within a reasonable crawl-delay range. pushing past 5 req/s on Capterra without residential proxies and browser fingerprinting is how you get subnet bans.

if it breaks: if you see a spike in 429 responses, drop concurrency to 4 and increase sleep to 5 seconds. 429s on Capterra are IP-level, not account-level, so rotating to a new proxy session usually clears them.

common pitfalls

using datacenter proxies. Capterra’s Cloudflare configuration explicitly targets ASN ranges associated with proxy providers and data centers. residential IPs from real ISPs pass cleanly. datacenter IPs from AWS, GCP, or Hetzner ranges get challenged almost immediately. this is the single biggest reason scrapers fail on Capterra in 2026.

ignoring TLS fingerprinting. if you’re using a standard Python [requests](https://requests.readthedocs.io/) or httpx build, your TLS ClientHello looks different from a Chrome browser. Cloudflare reads this. using httpx with a patched TLS stack or running actual Chromium via Playwright for flagged pages solves this. for most Capterra pages, Playwright is the more reliable path once you hit detection.

scraping reviews without session continuity. reviews past page 1 often require a valid session cookie that was set on the product page. if you jump straight to /reviews/12345/?page=3, you’ll get a partial or empty response. always start a session from the product root.

not caching category page results. category index pages list 25-50 products each and change slowly. scraping them repeatedly to rebuild your product URL list wastes bandwidth and proxy quota. cache the product URL list to disk and refresh it weekly, not per-run.

storing raw HTML as your primary record. raw HTML is 100-300KB per page and Capterra’s structure changes. parse to structured fields immediately and keep the HTML only for debugging, or not at all.

scaling this

at 10x (100k products): the single-machine Playwright setup still works. add a Redis queue to coordinate work across restarts. your main cost is proxy bandwidth, roughly 0.5-1GB per 1000 products with reviews. budget $50-150/month in proxy costs at this scale.

at 100x (1 million products): you’re now running multiple machines. distribute the queue with Celery or a managed queue like SQS. run Playwright workers on separate machines from your queue coordinator. proxy bandwidth becomes your dominant cost. consider caching aggressively at the CDN layer and only re-scraping products that have review count changes. if you’re managing browser fingerprints across a fleet, the antidetect browser comparisons at antidetectreview.org/blog/ are worth reading for evaluating whether tools like Multilogin or AdsPower make sense at this volume.

at 1000x (multi-million scale): at this point you’re running a data product. consider whether scraping is still the right architecture or whether a Capterra data licensing agreement makes more economic sense. the engineering cost of maintaining a fleet at this scale, including proxy rotation, fingerprint management, and schema drift, is non-trivial. if scraping is still the answer, dedicated residential proxy pools with guaranteed bandwidth agreements from providers are cheaper per GB than pay-as-you-go.

where to go next

Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?