← back to blog

How to scrape G2 at scale in 2026 with proxies that work

How to scrape G2 at scale in 2026 with proxies that work

G2 is one of the most valuable sources of B2B software review data on the internet. Vendors, investors, and competitive intelligence teams all want the same thing: category rankings, star distributions, reviewer job titles, and sentiment trends across thousands of products. the problem is G2 actively protects this data. they use Cloudflare, behavioral fingerprinting, and aggressive rate limits that will block a naive scraper within minutes.

this guide is for operators who already understand what proxies are and want a working extraction pipeline, not a toy script. i’ll cover the full stack: environment setup, JavaScript rendering, proxy rotation, and what changes when you go from one product page to a full category crawl. by the end you’ll have a repeatable process you can run on a schedule.

before anything else, read G2’s terms of service. scraping public review data for competitive intelligence sits in a legal grey area. this is not legal advice. you are responsible for your own compliance decisions.


what you need

  • python 3.11+ with httpx, playwright, parsel, and pandas
  • proxyscraping.com residential proxies, rotating plan. the entry tier starts around $3/GB and gives you access to the residential pool, which is what G2 actually requires. datacenter proxies fail within a few hundred requests
  • a proxy manager or your own rotation logic, covered in step 4
  • a headless browser runtime: Playwright with chromium is the standard choice in 2026. install via playwright install chromium
  • optional: a Supabase or Postgres instance for structured storage if you’re doing more than a one-off pull
  • estimated cost for a 10,000-review run: roughly 2-4 GB of proxy traffic ($6-12) plus compute time on a small VPS. plan for more if you’re re-fetching pages for fresh data

step by step

step 1: map the pages you need

G2 product pages follow a predictable URL pattern:

https://www.g2.com/products/{product-slug}/reviews
https://www.g2.com/products/{product-slug}/reviews?page=2

category pages list products:

https://www.g2.com/categories/{category-slug}

start by building a seed list. if you want all CRM tools, grab the category page and extract product slugs first. do not start crawling review pages until you have a clean seed list, or you’ll end up with a chaotic job queue.

seed_products = [
    "salesforce-crm",
    "hubspot-crm",
    "pipedrive",
    # add more
]

base_url = "https://www.g2.com/products/{slug}/reviews?page={page}"

if it breaks: G2 occasionally changes slug format after rebrands. if a URL returns a 301, follow the redirect once and update your seed list.

step 2: set up your playwright context with proxy injection

G2 requires JavaScript rendering. the review content is loaded via client-side React calls. plain httpx requests will get you the HTML shell with no review data inside.

Playwright’s Python SDK lets you inject a proxy at the browser context level, which is cleaner than system-level proxy configuration.

from playwright.sync_api import sync_playwright

PROXY_SERVER = "http://rotating.proxyscraping.com:8000"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"

def get_browser_context(p):
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        proxy={
            "server": PROXY_SERVER,
            "username": PROXY_USER,
            "password": PROXY_PASS,
        },
        viewport={"width": 1440, "height": 900},
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    )
    return browser, context

if it breaks: if context creation hangs, your proxy server string may be wrong. test with curl -x http://user:pass@host:port https://httpbin.org/ip first.

step 3: fingerprint hardening

G2 uses behavioral fingerprinting on top of Cloudflare. a stock Playwright browser context leaks automation signals. at minimum you need to:

  • set a realistic user_agent (done above)
  • randomize viewport within plausible desktop ranges
  • add a random mouse movement before the first click or scroll
  • set navigator.webdriver to undefined via page script injection
async def stealth_patch(page):
    await page.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {
            get: () => undefined
        });
    """)

there are libraries like [playwright](https://playwright.dev/)-stealth that automate most of this. as of mid-2026, the maintained fork at pypi.org/project/[playwright](https://playwright.dev/)-stealth covers the major signal vectors. install it and call stealth_sync(page) before navigation.

if it breaks: if you’re still hitting Cloudflare challenge pages (you’ll see a 403 with a turnstile widget in the HTML), your residential proxy IPs may have been flagged. switch to a fresh IP range and reduce request frequency.

step 4: proxy rotation strategy

proxyscraping.com’s rotating residential endpoint assigns a new IP per connection by default. for G2 specifically, sticky sessions (same IP for 10-30 minutes) perform better because the anti-bot system tracks session continuity. if you rotate IPs on every page load, the behavioral profile looks inhuman.

configure a sticky session via the username parameter:

username: user-youruser-session-abc123
password: yourpassword

generate a new session ID every 15-20 pages or every 10 minutes, whichever comes first. here is a minimal rotation wrapper:

import random
import string
import time

class StickyProxyRotator:
    def __init__(self, base_user, password, host, port):
        self.base_user = base_user
        self.password = password
        self.host = host
        self.port = port
        self.session_id = self._new_session()
        self.page_count = 0
        self.session_start = time.time()

    def _new_session(self):
        return ''.join(random.choices(string.ascii_lowercase, k=8))

    def get_proxy(self):
        if self.page_count >= 15 or (time.time() - self.session_start) > 600:
            self.session_id = self._new_session()
            self.page_count = 0
            self.session_start = time.time()
        self.page_count += 1
        user = f"{self.base_user}-session-{self.session_id}"
        return {"server": f"http://{self.host}:{self.port}", "username": user, "password": self.password}

if it breaks: some residential IPs are flagged at the ISP level by G2. if you see consistent failures from a specific session, kill the session immediately instead of waiting.

step 5: extract review data

once a page loads, reviews live in structured HTML. use parsel to query them:

from parsel import Selector

def parse_reviews(html):
    sel = Selector(text=html)
    reviews = []
    for card in sel.css('[itemprop="review"]'):
        reviews.append({
            "reviewer": card.css('[itemprop="author"] ::text').get("").strip(),
            "title": card.css('[itemprop="name"] ::text').get("").strip(),
            "rating": card.css('[itemprop="ratingValue"] ::attr(content)').get(""),
            "body": " ".join(card.css('[itemprop="reviewBody"] ::text').getall()).strip(),
            "date": card.css('[itemprop="datePublished"] ::attr(content)').get(""),
        })
    return reviews

G2 uses [schema.org](https://schema.org/) Review markup, which is stable across redesigns. do not rely on class names.

if it breaks: if itemprop="review" returns nothing, the page may not have fully rendered. increase your page.wait_for_selector('[itemprop="review"]', timeout=15000) call. if it still fails, check the raw HTML for a Cloudflare challenge body.

step 6: pagination and delay logic

G2 shows up to 25 reviews per page. a product with 500 reviews has 20 pages. crawl them sequentially with a jittered delay:

import asyncio
import random

async def crawl_product(slug, max_pages=20):
    results = []
    for page_num in range(1, max_pages + 1):
        url = f"https://www.g2.com/products/{slug}/reviews?page={page_num}"
        html = await fetch_page(url)
        page_reviews = parse_reviews(html)
        if not page_reviews:
            break
        results.extend(page_reviews)
        await asyncio.sleep(random.uniform(4, 9))
    return results

the 4-9 second range is conservative. you can push to 2-5 seconds on fresh residential IPs, but slower is safer when you’re doing a first run on a new category.

if it breaks: if you get empty results on page 5+ of a product with known reviews, G2 may be soft-blocking your session. rotate to a new session and retry from that page.

step 7: store and deduplicate

pipe results to a dataframe and export to CSV for small runs, or insert into Postgres for anything ongoing:

import pandas as pd

df = pd.DataFrame(all_reviews)
df.drop_duplicates(subset=["reviewer", "date", "title"], inplace=True)
df.to_csv("g2_reviews.csv", index=False)

if you’re tracking changes over time, add a scraped_at timestamp column and use an upsert strategy on (slug, reviewer, date) as your composite key.

if it breaks: duplicate reviews usually mean you crawled the same page twice after a retry. add page-level deduplication before the product-level dedup.


common pitfalls

using datacenter proxies. G2 blocks most datacenter IP ranges at the ASN level. residential or mobile proxies only. this is the single biggest reason scrapers fail on G2.

rotating IPs too fast. a new IP on every page request is a textbook bot signal. G2’s behavioral model expects sessions. use sticky sessions as described in step 4.

scraping without JavaScript rendering. you will get the HTML shell and no review data. if you see empty itemprop fields, this is your problem.

ignoring rate limits after a block. if you hit a 429 or a Cloudflare challenge, keep hitting the same endpoint and you will get your IP range burned. back off for 30+ minutes and rotate your session.

no deduplication on retries. failed jobs get retried, pages get re-crawled, and you end up with duplicate rows in your dataset. always deduplicate before analysis.


scaling this

10x (hundreds of products): run multiple product slugs concurrently with asyncio.gather, capped at 3-5 concurrent sessions. keep a shared proxy rotator so sessions don’t collide. a single VPS with 4 cores handles this without issues.

100x (full categories, tens of thousands of reviews): introduce a proper job queue. Redis with RQ is the simple option, Celery if you need more control. split the workload across 2-3 workers, each with its own proxy session pool. at this scale, proxy traffic costs start mattering, so instrument your GB usage per run.

1000x (platform-wide crawl, multiple categories, recurring): you need distributed workers, a database-backed job queue, and a proxy budget of $50-200/month depending on refresh frequency. consider caching pages locally for 24-48 hours to avoid re-fetching unchanged reviews. this is also where you should think about whether a commercial G2 data API makes more economic sense than your own infrastructure. for multi-account and platform-scale data operations, the patterns at multiaccountops.com/blog/ cover infrastructure choices worth reading before you commit to a full build.

at any scale, build in a circuit breaker: if error rate on a worker exceeds 20% in a 10-minute window, pause that worker and alert. silent failures at scale are expensive.


where to go next

for all scraping guides and proxy reviews, see the blog index.


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?