← back to blog

How to scrape Trustpilot at scale in 2026 with proxies that work

How to scrape Trustpilot at scale in 2026 with proxies that work

Trustpilot is one of the cleanest structured review sources on the web. Reviews are public, paginated consistently, and the data, company name, star rating, review body, date, reviewer location, and reply thread, is all in the HTML without JavaScript rendering tricks on most pages. The problem is volume. If you’re pulling 50 companies, you can get away with a basic requests script and a cheap datacenter proxy. If you’re pulling 5,000 companies, or monitoring 500 daily, Trustpilot will block you within hours. Their infrastructure sits behind Cloudflare and they fingerprint headers, TLS handshakes, and request cadence. Most tutorials written before 2024 are outdated on that point.

This guide is for operators building competitive intelligence pipelines, reputation monitoring tools, or review aggregators. I’m not going to pretend this is a casual afternoon project. You’ll need working knowledge of Python, access to residential proxies, and some tolerance for debugging rate-limit responses. The outcome, if you follow this correctly, is a scraper that can pull tens of thousands of reviews per day reliably with a cost structure you can actually plan around.

Before anything else: read Trustpilot’s terms for businesses and check their robots.txt. The /review/ path is not disallowed for crawlers as of writing, but Trustpilot explicitly restricts automated data collection in their ToS. Whether that matters in your jurisdiction is a legal question, not a technical one. This is not legal advice. Know your local rules before you build.

what you need

  • Python 3.11+ with playwright, httpx, parsel, and tenacity installed
  • Residential proxy plan. Datacenter IPs get blocked within minutes on high-volume runs. I use Bright Data residential proxies (around $8.40/GB as of Q1 2026) or Smartproxy (~$7/GB). Either works. Budget $30-100/month for moderate scale.
  • A proxy rotation endpoint that supports HTTP CONNECT tunneling, not just HTTP forwarding. Bright Data’s gateway format is brd.superproxy.io:22225.
  • A Trustpilot API key (optional but useful for company lookup). Register at documentation.trustpilot.com , the free tier gives you read access to public business units.
  • A target list. A CSV of company domain names (e.g. amazon.com, booking.com). Trustpilot review URLs follow the pattern trustpilot.com/review/{domain}.
  • A storage layer. SQLite is fine for under 1M rows. Postgres or BigQuery if you’re going bigger.
  • Infrastructure cost: a $12/month VPS (Hetzner CX22 or equivalent) is enough to run the scraper process. The proxy bill dominates.

step by step

step 1: install dependencies and configure your proxy

pip install playwright httpx parsel tenacity
playwright install chromium

Set your proxy credentials as environment variables. Never hardcode them.

export PROXY_HOST="brd.superproxy.io"
export PROXY_PORT="22225"
export PROXY_USER="your-username"
export PROXY_PASS="your-password"

Test the connection before writing any scraping logic:

curl -x "http://$PROXY_USER:$PROXY_PASS@$PROXY_HOST:$PROXY_PORT" \
  "https://httpbin.org/ip"

You should see a residential IP, not your server IP. If it returns your server IP, your proxy config is wrong. If it times out, check the port and whether your VPS firewall allows outbound on that port.

step 2: check the page structure

Open a Trustpilot review page in your browser and inspect the HTML. As of early 2026, reviews live inside <article> tags with the attribute data-service-review-card-paper="true". Each card contains:

  • Star rating: div[data-service-review-rating] attribute value
  • Review title: h2 inside the card
  • Body: p[data-service-review-text-typography="true"]
  • Date: time element with an ISO datetime attribute
  • Reviewer name: span[data-consumer-name-typography="true"]

Page 1 is trustpilot.com/review/{domain}, page 2 onwards is trustpilot.com/review/{domain}?page=2. Total page count is in the pagination nav. Extract it from the last page link to know when to stop.

If it breaks: Trustpilot occasionally ships CSS class changes. If your selectors stop returning data, re-inspect the live page. The data-* attributes have been more stable than class names in my experience.

step 3: write the core fetcher with retry logic

Use httpx for the HTTP layer and tenacity for retries. Don’t use [requests](https://requests.readthedocs.io/) for anything proxy-heavy at scale, it handles connection pooling worse.

import os, httpx, time, random
from tenacity import retry, stop_after_attempt, wait_exponential

PROXY = f"http://{os.environ['PROXY_USER']}:{os.environ['PROXY_PASS']}@{os.environ['PROXY_HOST']}:{os.environ['PROXY_PORT']}"

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

@retry(stop=stop_after_attempt(4), wait=wait_exponential(multiplier=2, min=3, max=30))
def fetch_page(url: str) -> str:
    with httpx.Client(proxy=PROXY, headers=HEADERS, timeout=20) as client:
        resp = client.get(url)
        if resp.status_code == 429:
            raise Exception("rate limited, retrying")
        if resp.status_code != 200:
            raise Exception(f"unexpected status {resp.status_code}")
        return resp.text

Expected output: raw HTML string. If you’re consistently hitting 403, move to the Playwright path in step 5.

If it breaks: a 403 that persists after rotation usually means Cloudflare has flagged your TLS fingerprint, not just your IP. Switch to Playwright or a scraping browser API.

step 4: parse reviews with parsel

from parsel import Selector

def parse_reviews(html: str) -> list[dict]:
    sel = Selector(text=html)
    reviews = []
    for card in sel.css('article[data-service-review-card-paper]'):
        reviews.append({
            "rating": card.attrib.get("data-service-review-rating"),
            "title": card.css("h2 ::text").get("").strip(),
            "body": card.css('p[data-service-review-text-typography] ::text').getall(),
            "date": card.css("time ::attr(datetime)").get(""),
            "reviewer": card.css('span[data-consumer-name-typography] ::text').get("").strip(),
        })
    return reviews

If it breaks: if reviews comes back empty, the page is rendering client-side. Trustpilot does this for some paths. Use the Playwright fetcher instead.

step 5: handle javascript-rendered pages with Playwright

Some Trustpilot pages, particularly company profiles with custom widgets, require a real browser. For those:

from playwright.sync_api import sync_playwright

def fetch_with_browser(url: str, proxy_config: dict) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(
            headless=True,
            proxy=proxy_config
        )
        context = browser.new_context(
            user_agent=HEADERS["User-Agent"],
            locale="en-US"
        )
        page = context.new_page()
        page.goto(url, wait_until="domcontentloaded", timeout=30000)
        page.wait_for_selector('article[data-service-review-card-paper]', timeout=10000)
        html = page.content()
        browser.close()
        return html

proxy_config = {
    "server": f"http://{os.environ['PROXY_HOST']}:{os.environ['PROXY_PORT']}",
    "username": os.environ['PROXY_USER'],
    "password": os.environ['PROXY_PASS'],
}

Playwright uses a real Chromium TLS stack so it passes most Cloudflare JS challenges automatically. The tradeoff is speed, it’s 5-10x slower than raw HTTP, and RAM, each browser context uses ~150MB.

If it breaks: if Cloudflare is still blocking you with Playwright, add a random 2-5 second page.wait_for_timeout() before extracting content and rotate proxies between each context launch, not each page load.

step 6: paginate and store

import sqlite3, json
from urllib.parse import quote

def scrape_company(domain: str, db: sqlite3.Connection):
    base_url = f"https://www.trustpilot.com/review/{domain}"
    page = 1
    while True:
        url = base_url if page == 1 else f"{base_url}?page={page}"
        try:
            html = fetch_page(url)
        except Exception as e:
            print(f"failed on {domain} page {page}: {e}")
            break
        reviews = parse_reviews(html)
        if not reviews:
            break
        for r in reviews:
            db.execute(
                "INSERT OR IGNORE INTO reviews (domain, rating, title, body, date, reviewer) VALUES (?,?,?,?,?,?)",
                (domain, r["rating"], r["title"], json.dumps(r["body"]), r["date"], r["reviewer"])
            )
        db.commit()
        page += 1
        time.sleep(random.uniform(1.5, 3.5))

Add INSERT OR IGNORE with a unique index on (domain, date, reviewer) to make reruns idempotent.

step 7: run against your target list

import csv

conn = sqlite3.connect("trustpilot.db")
conn.execute("""
    CREATE TABLE IF NOT EXISTS reviews (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        domain TEXT, rating TEXT, title TEXT,
        body TEXT, date TEXT, reviewer TEXT,
        UNIQUE(domain, date, reviewer)
    )
""")

with open("targets.csv") as f:
    for row in csv.DictReader(f):
        print(f"scraping {row['domain']}")
        scrape_company(row["domain"], conn)
        time.sleep(random.uniform(3, 7))  # between companies

conn.close()

Expected output: trustpilot.db with all reviews. Verify by running SELECT COUNT(*) FROM reviews;.

common pitfalls

Using datacenter proxies. I see this constantly in operator forums, including some threads at multiaccountops.com/blog/. Datacenter IPs are trivially identified by Trustpilot’s Cloudflare config. You’ll get 10-50 requests before a block. Residential or mobile proxies only.

Not rotating proxies between companies. Pulling 200 pages from the same exit IP looks like an obvious crawler pattern. Configure your proxy provider to rotate on each request or at minimum between each company domain.

Ignoring the retry-after header. When you get a 429, read the Retry-After value and actually wait that long. Hammering through rate limits escalates you to a harder block faster.

Not handling pagination termination properly. If your parser returns an empty list because the HTML structure changed, an infinite loop will burn your proxy bandwidth quietly. Always cap at a maximum page count (say, 500) as a safety ceiling.

Scraping too fast. A sub-second delay between requests is a fingerprint. Add realistic jitter, 1.5-4 seconds between pages, 5-10 seconds between companies. Slow is cheap and effective.

scaling this

10x (hundreds of companies): the single-threaded script above handles this. Run it on a cron job nightly. Total proxy cost: under $10/month. SQLite holds fine at this size.

100x (thousands of companies): move to a job queue. Celery with Redis or a simple Postgres-backed queue works. Run 4-8 concurrent workers, each with its own proxy session. Switch from SQLite to Postgres. At this level you’re probably spending $50-150/month on proxies. See the residential proxy comparison on this blog for current pricing across Bright Data, Oxylabs, and Smartproxy.

1000x (tens of thousands of companies, daily refresh): the scraper architecture doesn’t change much but the ops overhead does. You’ll want: distributed workers (k8s or EC2 auto-scaling groups), a dedicated proxy subaccount per worker pool so you can rotate credentials independently, and a monitoring layer that alerts on error rate spikes, which usually mean a selector change or a new Cloudflare rule. At this scale, proxy cost is $300-800/month. Evaluate whether a commercial scraping API like Bright Data’s SERP API or ScraperAPI makes more sense than managing the proxy layer yourself. The unit economics often favor managed APIs above 500k requests/month.

One thing that changes at 1000x that most people underestimate: deduplication and schema drift. Trustpilot’s HTML structure has shifted 3-4 times since 2022 based on my own monitoring. Build your parser as a versioned module so you can hot-swap it without touching the pipeline.

where to go next

If you’re building multi-account or browser-automation pipelines around this data, the antidetect browser reviews at antidetectreview.org/blog/ are worth reading alongside this. Profile fingerprint management matters if you’re logging into platforms rather than scraping public pages.

Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?