How to scrape Yelp at scale in 2026 with proxies that work
How to scrape Yelp at scale in 2026 with proxies that work
Yelp is one of the more frustrating targets in local business intelligence. the data is genuinely useful: business names, phone numbers, addresses, hours, categories, review counts, and star ratings for millions of US, Canadian, and European listings. but Yelp runs aggressive bot detection backed by Cloudflare, enforces geo-based access quirks, and returns a 403 or a CAPTCHA within seconds if your request pattern looks automated. i’ve watched people burn through proxy budgets in an hour on this site before they understood what was actually blocking them.
this tutorial is for operators who need Yelp data at scale, not hobbyists who want ten listings. specifically: lead gen agencies building local business databases, market research teams tracking competitor reviews, and SEO shops monitoring citation consistency across verticals. if you just need a few hundred records, use the Yelp Fusion API instead and skip all of this. the Fusion API is free up to 500 calls/day and is the legitimate, no-drama path for small volumes.
for everything else, here is what actually works in 2026: residential proxies with session stickiness, a headless browser to handle JS rendering, careful request pacing, and a data pipeline that handles partial failures gracefully. by the end of this you will have a working scraper that can pull thousands of business records per hour without burning through your proxy allocation.
what you need
- Python 3.11+ with the following packages:
playwright,httpx,beautifulsoup4,lxml,tenacity,csv(stdlib) - Playwright browser binaries: install with
playwright install chromium - Rotating residential proxy provider: i use ProxyScrape residential proxies for this stack. Bright Data and Oxylabs also work. budget $8-15/GB. for 10,000 listings you will use roughly 2-4 GB depending on how many review pages you pull
- A Yelp search URL plan: know your target verticals and cities before you start. Yelp paginates at 10 results per page with a maximum of 24 pages (240 results) per search
- A machine or VPS: any Ubuntu 22.04 VPS with 2 vCPUs and 4 GB RAM will do. Hetzner CX21 (~$5/mo) or a DigitalOcean Droplet works fine
- Optional: a Supabase or PostgreSQL instance for storing output at scale instead of flat CSV files
total cost for a one-time 50,000 listing pull: roughly $30-60 in proxy bandwidth plus VPS time.
step by step
step 1: read yelp’s robots.txt and terms of service
before any scraping project, read Yelp’s Terms of Service and check yelp.com/robots.txt. Yelp’s ToS prohibits automated data collection without their permission. this tutorial is provided for educational and authorized research purposes. if you are scraping commercially, consult your own legal counsel first. this is not legal advice.
step 2: set up your python environment
python -m venv yelp_env
source yelp_env/bin/activate
pip install playwright httpx beautifulsoup4 lxml tenacity
playwright install chromium
create a file config.py to hold your proxy credentials and rate limit settings:
PROXY_HOST = "rp.proxyscrape.com"
PROXY_PORT = 6060
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
REQUESTS_PER_MINUTE = 20
SESSION_ROTATE_EVERY = 5 # pages before rotating proxy session
if it breaks: if [playwright](https://playwright.dev/) install [chromium](https://www.chromium.org/Home/) fails on a headless VPS, run apt-get install -y libnss3 libatk-bridge2.0-0 libdrm2 libxkbcommon0 libgbm1 first.
step 3: build the search URL generator
Yelp search URLs follow a predictable pattern. a search for “dentist” in “Austin, TX” starting at result 10 looks like:
https://www.yelp.com/search?find_desc=dentist&find_loc=Austin%2C+TX&start=10
build a generator that yields all pagination offsets for your target queries:
from itertools import product
VERTICALS = ["dentist", "plumber", "electrician"]
CITIES = ["Austin, TX", "Phoenix, AZ", "Denver, CO"]
MAX_START = 230 # 24 pages * 10 = 240 results, start at 0,10,...230
def generate_urls():
for vertical, city in product(VERTICALS, CITIES):
for start in range(0, MAX_START + 1, 10):
yield (
f"https://www.yelp.com/search"
f"?find_desc={vertical.replace(' ', '+')}"
f"&find_loc={city.replace(' ', '+').replace(',', '%2C')}"
f"&start={start}"
)
if it breaks: Yelp sometimes redirects city names to canonical slugs. if you get zero results, check the final URL in the browser and adjust the find_loc value.
step 4: configure playwright with residential proxies
Yelp’s Cloudflare layer checks TLS fingerprints and browser headers. a plain [requests](https://requests.readthedocs.io/) call will get blocked immediately. use Playwright with a residential proxy to get a real-browser fingerprint:
from playwright.async_api import async_playwright
import asyncio
async def fetch_page(url: str, proxy_url: str) -> str:
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
proxy={"server": proxy_url}
)
context = await browser.new_context(
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
locale="en-US",
viewport={"width": 1280, "height": 800},
)
page = await context.new_page()
await page.goto(url, wait_until="domcontentloaded", timeout=30000)
await page.wait_for_timeout(2000) # let JS settle
content = await page.content()
await browser.close()
return content
def proxy_url_for_session(session_id: int) -> str:
from config import PROXY_HOST, PROXY_PORT, PROXY_USER, PROXY_PASS
return (
f"http://{PROXY_USER}-session-{session_id}:{PROXY_PASS}"
f"@{PROXY_HOST}:{PROXY_PORT}"
)
see the Playwright Python docs for full context on browser configuration options.
if it breaks: if you get a CAPTCHA page, the proxy IP is flagged. increment your session ID to rotate to a fresh IP. if CAPTCHAs persist, switch to a provider with a larger residential pool.
step 5: parse business listings from the search results page
Yelp renders listings server-side with embedded JSON. the cleanest extraction path is the __NEXT_DATA__ JSON blob in the page source:
import json
from bs4 import BeautifulSoup
def parse_listings(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
script = soup.find("script", {"id": "__NEXT_DATA__"})
if not script:
return []
data = json.loads(script.string)
try:
results = (
data["props"]["pageProps"]["searchPageProps"]
["mainContentComponentsListProps"]
)
except KeyError:
return []
listings = []
for item in results:
biz = item.get("bizId") or item.get("businessName")
if not biz:
continue
listings.append({
"name": item.get("businessName", ""),
"rating": item.get("rating", ""),
"review_count": item.get("reviewCount", ""),
"address": item.get("formattedAddress", ""),
"phone": item.get("phone", ""),
"yelp_url": "https://www.yelp.com" + item.get("businessUrl", ""),
"categories": ", ".join(
c.get("title", "") for c in item.get("categories", [])
),
})
return listings
if it breaks: Yelp periodically restructures the __NEXT_DATA__ JSON. if parse_listings returns empty lists on valid pages, print data.keys() and trace the new path manually.
step 6: add retry logic and rate limiting
wrap your fetch function with tenacity for automatic retries on transient failures:
from tenacity import retry, stop_after_attempt, wait_exponential
import asyncio, time
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=10))
async def fetch_with_retry(url, session_id):
return await fetch_page(url, proxy_url_for_session(session_id))
async def scrape_urls(url_list):
results = []
session_id = 1
for i, url in enumerate(url_list):
if i % SESSION_ROTATE_EVERY == 0 and i > 0:
session_id += 1
html = await fetch_with_retry(url, session_id)
listings = parse_listings(html)
results.extend(listings)
await asyncio.sleep(60 / REQUESTS_PER_MINUTE)
return results
if it breaks: if you are hitting Yelp’s per-IP rate limit (repeated 429s), reduce [REQUESTS](https://requests.readthedocs.io/)_PER_MINUTE to 10 and increase SESSION_ROTATE_EVERY to 3.
step 7: write output to CSV or database
import csv
def save_to_csv(records: list[dict], path: str):
if not records:
return
with open(path, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=records[0].keys())
writer.writeheader()
writer.writerows(records)
for larger runs, insert directly into Postgres or Supabase. if you are building a multi-account local intelligence operation, the community at multiaccountops.com/blog/ has good patterns for structuring per-client databases.
common pitfalls
using datacenter proxies. Yelp blocks ASNs associated with AWS, GCP, and common datacenter proxy providers very aggressively. residential or mobile proxies only. see our residential proxy provider comparison for current rankings.
ignoring the 240-result cap. Yelp caps search results at 24 pages (240 listings) per query. if your vertical has more than 240 businesses in a city, break it down by neighborhood, zip code, or more specific subcategories. “restaurant” in “New York, NY” is useless. “sushi restaurant” in “Brooklyn, NY 11215” is workable.
scraping too fast. running 50 concurrent Playwright instances on a single proxy session will get that session flagged within minutes. concurrency and session rotation need to be matched. at 20 requests/minute, use one session per 5 requests then rotate.
not handling partial page loads. Yelp’s search page sometimes loads skeleton content before the actual listing data. using wait_until="domcontentloaded" alone is not enough. the extra wait_for_timeout(2000) in step 4 exists for this reason. skip it and you will parse empty listing shells.
storing raw HTML forever. at scale, storing full HTML files eats disk fast. parse and discard immediately, or compress HTML to a content-addressed store if you need to reparse later.
scaling this
10x (tens of thousands of listings): run the async scraper with a concurrency limit of 5 using asyncio.Semaphore. a single VPS handles this comfortably overnight. output to CSV or SQLite.
100x (hundreds of thousands of listings): split your URL list into batches and distribute across 3-5 VPS instances. each instance uses its own proxy sub-account or username prefix so sessions don’t collide. use Postgres with a scraped_at timestamp column so you can deduplicate and resume after failures. expect 15-30 GB of proxy bandwidth.
1000x (millions of listings, ongoing monitoring): at this volume you need a task queue, not a script. move to Celery or RQ with Redis as the broker. each worker is a separate process on a separate machine. proxy costs become your dominant expense, so negotiate volume pricing with your provider. you will also want to monitor your proxy hit rate and block rate per provider pool and route around flagged ranges automatically. if you are running operations at this scale across multiple verticals and geographies, the anti-detect and session management patterns at antidetectreview.org/blog/ are worth reading for browser fingerprint hardening.
where to go next
- How to scrape Google Maps for local business data in 2026: Google Maps and Yelp have significant overlap in local business coverage but different structures, protections, and data freshness patterns. this tutorial covers the Maps-specific scraping stack.
- Residential proxy provider comparison for scraping in 2026: a current side-by-side of Bright Data, Oxylabs, ProxyScrape, and Smartproxy on success rate, bandwidth cost, and pool size for targets like Yelp, Google, and Amazon.
- Building a local business lead database: end to end: how to combine Yelp, Google Maps, and LinkedIn data into a structured lead database with deduplication, enrichment, and CRM export pipelines.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.