← back to blog

How to scrape Yelp at scale in 2026 with proxies that work

How to scrape Yelp at scale in 2026 with proxies that work

Yelp is one of the more frustrating targets in local business intelligence. the data is genuinely useful: business names, phone numbers, addresses, hours, categories, review counts, and star ratings for millions of US, Canadian, and European listings. but Yelp runs aggressive bot detection backed by Cloudflare, enforces geo-based access quirks, and returns a 403 or a CAPTCHA within seconds if your request pattern looks automated. i’ve watched people burn through proxy budgets in an hour on this site before they understood what was actually blocking them.

this tutorial is for operators who need Yelp data at scale, not hobbyists who want ten listings. specifically: lead gen agencies building local business databases, market research teams tracking competitor reviews, and SEO shops monitoring citation consistency across verticals. if you just need a few hundred records, use the Yelp Fusion API instead and skip all of this. the Fusion API is free up to 500 calls/day and is the legitimate, no-drama path for small volumes.

for everything else, here is what actually works in 2026: residential proxies with session stickiness, a headless browser to handle JS rendering, careful request pacing, and a data pipeline that handles partial failures gracefully. by the end of this you will have a working scraper that can pull thousands of business records per hour without burning through your proxy allocation.


what you need

  • Python 3.11+ with the following packages: playwright, httpx, beautifulsoup4, lxml, tenacity, csv (stdlib)
  • Playwright browser binaries: install with playwright install chromium
  • Rotating residential proxy provider: i use ProxyScrape residential proxies for this stack. Bright Data and Oxylabs also work. budget $8-15/GB. for 10,000 listings you will use roughly 2-4 GB depending on how many review pages you pull
  • A Yelp search URL plan: know your target verticals and cities before you start. Yelp paginates at 10 results per page with a maximum of 24 pages (240 results) per search
  • A machine or VPS: any Ubuntu 22.04 VPS with 2 vCPUs and 4 GB RAM will do. Hetzner CX21 (~$5/mo) or a DigitalOcean Droplet works fine
  • Optional: a Supabase or PostgreSQL instance for storing output at scale instead of flat CSV files

total cost for a one-time 50,000 listing pull: roughly $30-60 in proxy bandwidth plus VPS time.


step by step

step 1: read yelp’s robots.txt and terms of service

before any scraping project, read Yelp’s Terms of Service and check yelp.com/robots.txt. Yelp’s ToS prohibits automated data collection without their permission. this tutorial is provided for educational and authorized research purposes. if you are scraping commercially, consult your own legal counsel first. this is not legal advice.

step 2: set up your python environment

python -m venv yelp_env
source yelp_env/bin/activate
pip install playwright httpx beautifulsoup4 lxml tenacity
playwright install chromium

create a file config.py to hold your proxy credentials and rate limit settings:

PROXY_HOST = "rp.proxyscrape.com"
PROXY_PORT = 6060
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
REQUESTS_PER_MINUTE = 20
SESSION_ROTATE_EVERY = 5  # pages before rotating proxy session

if it breaks: if [playwright](https://playwright.dev/) install [chromium](https://www.chromium.org/Home/) fails on a headless VPS, run apt-get install -y libnss3 libatk-bridge2.0-0 libdrm2 libxkbcommon0 libgbm1 first.

step 3: build the search URL generator

Yelp search URLs follow a predictable pattern. a search for “dentist” in “Austin, TX” starting at result 10 looks like:

https://www.yelp.com/search?find_desc=dentist&find_loc=Austin%2C+TX&start=10

build a generator that yields all pagination offsets for your target queries:

from itertools import product

VERTICALS = ["dentist", "plumber", "electrician"]
CITIES = ["Austin, TX", "Phoenix, AZ", "Denver, CO"]
MAX_START = 230  # 24 pages * 10 = 240 results, start at 0,10,...230

def generate_urls():
    for vertical, city in product(VERTICALS, CITIES):
        for start in range(0, MAX_START + 1, 10):
            yield (
                f"https://www.yelp.com/search"
                f"?find_desc={vertical.replace(' ', '+')}"
                f"&find_loc={city.replace(' ', '+').replace(',', '%2C')}"
                f"&start={start}"
            )

if it breaks: Yelp sometimes redirects city names to canonical slugs. if you get zero results, check the final URL in the browser and adjust the find_loc value.

step 4: configure playwright with residential proxies

Yelp’s Cloudflare layer checks TLS fingerprints and browser headers. a plain [requests](https://requests.readthedocs.io/) call will get blocked immediately. use Playwright with a residential proxy to get a real-browser fingerprint:

from playwright.async_api import async_playwright
import asyncio

async def fetch_page(url: str, proxy_url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy={"server": proxy_url}
        )
        context = await browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            locale="en-US",
            viewport={"width": 1280, "height": 800},
        )
        page = await context.new_page()
        await page.goto(url, wait_until="domcontentloaded", timeout=30000)
        await page.wait_for_timeout(2000)  # let JS settle
        content = await page.content()
        await browser.close()
        return content

def proxy_url_for_session(session_id: int) -> str:
    from config import PROXY_HOST, PROXY_PORT, PROXY_USER, PROXY_PASS
    return (
        f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}"
        f"@{PROXY_HOST}:{PROXY_PORT}"
    )

see the Playwright Python docs for full context on browser configuration options.

if it breaks: if you get a CAPTCHA page, the proxy IP is flagged. increment your session ID to rotate to a fresh IP. if CAPTCHAs persist, switch to a provider with a larger residential pool.

step 5: parse business listings from the search results page

Yelp renders listings server-side with embedded JSON. the cleanest extraction path is the __NEXT_DATA__ JSON blob in the page source:

import json
from bs4 import BeautifulSoup

def parse_listings(html: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    script = soup.find("script", {"id": "__NEXT_DATA__"})
    if not script:
        return []
    data = json.loads(script.string)
    try:
        results = (
            data["props"]["pageProps"]["searchPageProps"]
            ["mainContentComponentsListProps"]
        )
    except KeyError:
        return []

    listings = []
    for item in results:
        biz = item.get("bizId") or item.get("businessName")
        if not biz:
            continue
        listings.append({
            "name": item.get("businessName", ""),
            "rating": item.get("rating", ""),
            "review_count": item.get("reviewCount", ""),
            "address": item.get("formattedAddress", ""),
            "phone": item.get("phone", ""),
            "yelp_url": "https://www.yelp.com" + item.get("businessUrl", ""),
            "categories": ", ".join(
                c.get("title", "") for c in item.get("categories", [])
            ),
        })
    return listings

if it breaks: Yelp periodically restructures the __NEXT_DATA__ JSON. if parse_listings returns empty lists on valid pages, print data.keys() and trace the new path manually.

step 6: add retry logic and rate limiting

wrap your fetch function with tenacity for automatic retries on transient failures:

from tenacity import retry, stop_after_attempt, wait_exponential
import asyncio, time

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=10))
async def fetch_with_retry(url, session_id):
    return await fetch_page(url, proxy_url_for_session(session_id))

async def scrape_urls(url_list):
    results = []
    session_id = 1
    for i, url in enumerate(url_list):
        if i % SESSION_ROTATE_EVERY == 0 and i > 0:
            session_id += 1
        html = await fetch_with_retry(url, session_id)
        listings = parse_listings(html)
        results.extend(listings)
        await asyncio.sleep(60 / REQUESTS_PER_MINUTE)
    return results

if it breaks: if you are hitting Yelp’s per-IP rate limit (repeated 429s), reduce [REQUESTS](https://requests.readthedocs.io/)_PER_MINUTE to 10 and increase SESSION_ROTATE_EVERY to 3.

step 7: write output to CSV or database

import csv

def save_to_csv(records: list[dict], path: str):
    if not records:
        return
    with open(path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=records[0].keys())
        writer.writeheader()
        writer.writerows(records)

for larger runs, insert directly into Postgres or Supabase. if you are building a multi-account local intelligence operation, the community at multiaccountops.com/blog/ has good patterns for structuring per-client databases.


common pitfalls

using datacenter proxies. Yelp blocks ASNs associated with AWS, GCP, and common datacenter proxy providers very aggressively. residential or mobile proxies only. see our residential proxy provider comparison for current rankings.

ignoring the 240-result cap. Yelp caps search results at 24 pages (240 listings) per query. if your vertical has more than 240 businesses in a city, break it down by neighborhood, zip code, or more specific subcategories. “restaurant” in “New York, NY” is useless. “sushi restaurant” in “Brooklyn, NY 11215” is workable.

scraping too fast. running 50 concurrent Playwright instances on a single proxy session will get that session flagged within minutes. concurrency and session rotation need to be matched. at 20 requests/minute, use one session per 5 requests then rotate.

not handling partial page loads. Yelp’s search page sometimes loads skeleton content before the actual listing data. using wait_until="domcontentloaded" alone is not enough. the extra wait_for_timeout(2000) in step 4 exists for this reason. skip it and you will parse empty listing shells.

storing raw HTML forever. at scale, storing full HTML files eats disk fast. parse and discard immediately, or compress HTML to a content-addressed store if you need to reparse later.


scaling this

10x (tens of thousands of listings): run the async scraper with a concurrency limit of 5 using asyncio.Semaphore. a single VPS handles this comfortably overnight. output to CSV or SQLite.

100x (hundreds of thousands of listings): split your URL list into batches and distribute across 3-5 VPS instances. each instance uses its own proxy sub-account or username prefix so sessions don’t collide. use Postgres with a scraped_at timestamp column so you can deduplicate and resume after failures. expect 15-30 GB of proxy bandwidth.

1000x (millions of listings, ongoing monitoring): at this volume you need a task queue, not a script. move to Celery or RQ with Redis as the broker. each worker is a separate process on a separate machine. proxy costs become your dominant expense, so negotiate volume pricing with your provider. you will also want to monitor your proxy hit rate and block rate per provider pool and route around flagged ranges automatically. if you are running operations at this scale across multiple verticals and geographies, the anti-detect and session management patterns at antidetectreview.org/blog/ are worth reading for browser fingerprint hardening.


where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?