← back to blog

How to scrape Google SERP at scale in 2026 with proxies that work

How to scrape Google SERP at scale in 2026 with proxies that work

Google SERP scraping is one of the most demanded and most blocked scraping tasks out there. Google has had years to refine its bot detection, and if you’re hitting it with bare datacenter IPs, a simple requests loop, and no fingerprint management, you’ll see CAPTCHAs within minutes. i’ve burned through more than a few proxy budgets learning this the hard way in Singapore, where residential IP pools for Southeast Asia are thinner than most vendors admit.

This guide is for operators who need rank tracking data, SERP feature monitoring, or competitive intelligence at real volume: 10,000 to 1,000,000 queries a month. it covers the actual setup, not the theory. if you’re scraping 50 keywords manually for a one-off project, SerpApi at $50/month is probably fine and you should stop reading here. if you need to own the pipeline, control your data freshness, and keep costs below $2 per 1,000 results, read on.

By the end you’ll have a working Python scraper using Playwright, a rotating residential proxy integration, a basic retry layer, and a pattern you can scale from a laptop to a distributed queue system.


what you need

  • Python 3.11+ with playwright, httpx, tenacity, and redis installed
  • Rotating residential proxy account: Bright Data, Oxylabs, or Smartproxy. budget $50-150/month to start. residential is non-negotiable for Google at scale. datacenter IPs get flagged within seconds on most queries
  • A Redis instance for the job queue. a $7/month Upstash instance works fine at 10k-100k queries/month
  • A Linux VPS or Docker environment for running workers. i use Hetzner CX21 at €5.29/month
  • Basic knowledge of CSS selectors and how Google structures its SERP HTML
  • A proxy provider that gives you sticky sessions (10-30 minute session persistence), not just random rotation on every request

Estimated monthly cost at 100k queries: $80-160 in proxy bandwidth, $7-15 in compute, depending on concurrency.


step by step

step 1: install dependencies and set up playwright

pip install playwright httpx tenacity redis python-dotenv
playwright install chromium

Playwright’s headed Chromium is the foundation. Google’s bot detection checks for WebGL fingerprints, canvas fingerprints, navigator properties, and timing anomalies that [requests](https://requests.readthedocs.io/) simply cannot fake. Playwright’s browser automation docs cover the full API, but you only need a small subset for SERP scraping.

Create a .env file:

PROXY_HOST=your-proxy-gateway.com
PROXY_PORT=22225
PROXY_USER=your-username
PROXY_PASS=your-password
REDIS_URL=redis://localhost:6379

if it breaks: if [playwright](https://playwright.dev/) install fails on a headless VPS, run [playwright](https://playwright.dev/) install-deps [chromium](https://www.chromium.org/Home/) first. Hetzner Ubuntu images are missing several Chromium runtime libs by default.

step 2: configure your proxy with sticky sessions

Residential proxy gateways let you request a sticky session via a session ID in the username string. this keeps the same exit IP for a full request cycle, which matters because Google sometimes checks IP consistency mid-page.

import os
import random
from dotenv import load_dotenv

load_dotenv()

def get_proxy_config(session_id: str = None) -> dict:
    if session_id is None:
        session_id = str(random.randint(10000, 99999))

    user = f"{os.getenv('PROXY_USER')}-session-{session_id}"
    return {
        "server": f"http://{os.getenv('PROXY_HOST')}:{os.getenv('PROXY_PORT')}",
        "username": user,
        "password": os.getenv("PROXY_PASS"),
    }

Check your provider’s docs for their exact sticky session syntax. Bright Data uses -session-XXXX, Oxylabs uses -sessid-XXXX. the pattern above assumes Bright Data.

if it breaks: if you’re getting 407 Proxy Auth Required, double-check that your proxy allowlist includes your VPS IP. most residential providers whitelist by IP, not just credentials.

step 3: build the Playwright scraping function

import asyncio
from playwright.async_api import async_playwright

async def scrape_serp(query: str, session_id: str, country: str = "us") -> dict:
    proxy_config = get_proxy_config(session_id)

    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy=proxy_config,
            args=["--no-sandbox", "--disable-blink-features=AutomationControlled"],
        )

        context = await browser.new_context(
            locale="en-US",
            timezone_id="America/New_York",
            viewport={"width": 1280, "height": 800},
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        )

        page = await context.new_page()
        url = f"https://www.google.com/search?q={query}&gl={country}&hl=en&num=10"

        await page.goto(url, wait_until="domcontentloaded", timeout=30000)
        await page.wait_for_timeout(random.randint(1200, 2800))  # human-ish delay

        results = await page.evaluate("""
            () => Array.from(document.querySelectorAll('div.g')).map(el => ({
                title: el.querySelector('h3')?.innerText,
                url: el.querySelector('a')?.href,
                snippet: el.querySelector('.VwiC3b')?.innerText,
            })).filter(r => r.title && r.url)
        """)

        await browser.close()
        return {"query": query, "results": results, "session_id": session_id}

Google’s SERP HTML changes frequently. .VwiC3b is the snippet class as of early 2026, but validate it against live HTML regularly. i check this monthly by inspecting a live SERP.

if it breaks: if you’re getting an empty results array, check whether Google served a CAPTCHA page. add a check: if "captcha" in page.url: raise CaptchaError(). then you can route those jobs to a different proxy session.

step 4: add retry logic with tenacity

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type

class CaptchaError(Exception):
    pass

class ProxyError(Exception):
    pass

@retry(
    retry=retry_if_exception_type((ProxyError, TimeoutError)),
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=20),
)
async def scrape_with_retry(query: str, country: str = "us") -> dict:
    session_id = str(random.randint(10000, 99999))
    return await scrape_serp(query, session_id, country)

tenacity handles the exponential backoff. don’t retry on CaptchaError with the same session. rotate the session ID and reduce your concurrency instead.

if it breaks: if tenacity is retrying endlessly, add logging to each retry attempt. silent failure loops burn proxy bandwidth fast.

step 5: set up a Redis job queue

import redis
import json

r = redis.from_url(os.getenv("REDIS_URL"))

def enqueue_queries(queries: list[str], country: str = "us"):
    for q in queries:
        job = json.dumps({"query": q, "country": country, "attempts": 0})
        r.lpush("serp_queue", job)

def dequeue_job() -> dict | None:
    item = r.rpop("serp_queue")
    return json.loads(item) if item else None

def store_result(result: dict):
    r.lpush("serp_results", json.dumps(result))

This gives you a simple FIFO queue. for production i’d use Celery with Redis broker or a proper task queue, but this pattern works up to about 50k queries/day on a single worker.

if it breaks: if Redis connection drops intermittently on Upstash, switch to redis.StrictRedis with socket_keepalive=True. Upstash’s free tier also has a max connection limit.

step 6: run the worker loop

import asyncio

async def worker(worker_id: int, concurrency: int = 3):
    print(f"Worker {worker_id} started")
    semaphore = asyncio.Semaphore(concurrency)

    while True:
        job = dequeue_job()
        if not job:
            await asyncio.sleep(2)
            continue

        async with semaphore:
            try:
                result = await scrape_with_retry(job["query"], job["country"])
                store_result(result)
                print(f"Done: {job['query']} ({len(result['results'])} results)")
            except Exception as e:
                print(f"Failed: {job['query']} , {e}")
                if job["attempts"] < 2:
                    job["attempts"] += 1
                    r.lpush("serp_queue", json.dumps(job))

if __name__ == "__main__":
    asyncio.run(worker(worker_id=1, concurrency=3))

Start with concurrency=3 per worker. each Playwright browser uses ~150-200MB RAM. a Hetzner CX21 (4GB RAM) can handle 3-4 concurrent browsers comfortably.

if it breaks: if RAM spikes and workers die, drop concurrency to 2 and monitor with htop. Playwright leaks memory on some Linux configs if you don’t call browser.close() in every code path.


common pitfalls

using datacenter proxies. i see this constantly in forums. datacenter IPs are cheap ($0.50-2/GB) but Google has them flagged at the ASN level in many cases. residential proxies at $4-15/GB are the correct tool for SERP. if you want to understand the proxy landscape better, the residential proxy reviews on this site go into provider-by-provider comparisons.

skipping the human-ish delay. sending requests at exactly 0ms between actions is a dead giveaway. add jitter. random.randint(800, 3000) between page interactions is the minimum. vary your query timing too.

scraping without geo-targeting. Google personalizes results by country. if your proxy exit node is in Germany but you’re scraping US rankings, your data is wrong. always pass gl=us&hl=en or the appropriate locale. check your proxy provider’s country filtering options.

not validating CSS selectors after Google updates. Google’s SERP HTML changes every few months. build a simple validation step that checks whether your selectors return non-empty results on a test query after each deployment.

over-rotating sessions too aggressively. some operators rotate IP on every single request. this actually increases detection risk for SERP, because a consistent session looks more human. stick with 10-20 minute sticky sessions, then rotate.


scaling this

10x (100k queries/month): add 2-3 worker processes on the same VPS, each with concurrency=3. upgrade your proxy plan. parse and store results in Postgres instead of Redis lists. this is still single-machine territory.

100x (1M queries/month): move to a distributed worker pool. i use Docker Compose with 5-10 worker containers behind a job queue, each pulling from Redis. add a results database with proper indexing. at this volume you’ll also want to route different query types through different proxy pools, for example different locales or query categories, to avoid session concentration triggering rate limits.

1000x (10M+ queries/month): at this scale you’re negotiating direct proxy contracts ($2-5/GB instead of retail), running Kubernetes, and probably building your own CAPTCHA handling with a third-party solver service like 2captcha or CapSolver. you’ll also want to track your success rate per proxy session and automatically retire sessions that start failing. this is also where it makes sense to evaluate whether a managed SERP API like DataForSEO is actually cheaper per result than running your own infrastructure. the crossover point is usually around 5-10M queries/month depending on your proxy costs.

If you’re operating multiple scraping setups across different targets, the workflow patterns at multiaccountops.com/blog/ cover infrastructure isolation strategies that apply directly to multi-target scraping pipelines.


where to go next

Google’s own Search Central documentation is also worth reading, not because it tells you how to scrape, but because understanding how Google structures its crawling and indexing helps you reason about what data is actually meaningful to collect.


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?