How to scrape Yellow Pages at scale in 2026 with proxies that work
How to scrape Yellow Pages at scale in 2026 with proxies that work
Yellow Pages is still one of the most complete directories of US small businesses, with name, phone, address, category, and reviews all on one page. if you are building a lead list, enriching a CRM, or doing competitive research on local service providers, it is a reasonable first stop. the problem is that the site blocks datacenter IPs aggressively, rotates its HTML structure without notice, and will silently return incomplete pages if it thinks you are a bot.
i have been running scrapers against local business directories since 2019 and Yellow Pages is one of the more annoying ones, not because it has sophisticated fingerprinting, but because the combination of Cloudflare, IP-level blocks, and subtle anti-bot tweaks means a naive scraper works fine for 200 requests then falls apart at 2000. this guide is for operators who have already run basic scrapers and hit that wall, or who want to build it right the first time.
by the end you will have a working Scrapy spider that rotates residential proxies, handles retries gracefully, and exports clean JSON or CSV. i will also show you what breaks at 10x and 100x and what you need to change.
what you need
- Python 3.11+ and a virtual environment
- Scrapy 2.11 (
pip install scrapy) , the Scrapy docs are thorough and worth reading - scrapy-rotating-proxies or a custom downloader middleware
- Residential proxy pool , datacenter IPs are blocked on most Yellow Pages category pages. ProxyScraping residential proxies start at around $4/GB as of May 2026. you need US-located exits specifically
- BeautifulSoup4 or parsel (already bundled with Scrapy) for parsing
- A target URL list , category + city combinations, e.g.
https://www.yellowpages.com/search?search_terms=plumber&geo_location_terms=Houston%2C+TX - Budget estimate: for 100k listings expect 15-25 GB of proxy traffic depending on retry rate. factor $60-100 in proxy spend for a one-shot pull
- Optional: Playwright or Splash if you hit JavaScript-rendered sections (less common on Yellow Pages listing pages, more common on individual business profiles)
before you write a single line of code, check yellowpages.com/robots.txt and review their Terms of Service. scraping publicly available business contact data sits in a gray area; the legal picture varies by jurisdiction and use case, and this is not legal advice. if you are operating commercially, talk to a lawyer familiar with the CFAA and relevant state laws.
step by step
step 1: set up the project and environment
python -m venv yp_env
source yp_env/bin/activate
pip install scrapy scrapy-rotating-proxies beautifulsoup4 lxml
scrapy startproject yellowpages_scraper
cd yellowpages_scraper
expected output: standard Scrapy project scaffold with spiders/, settings.py, middlewares.py, pipelines.py.
if it breaks: if [scrapy](https://scrapy.org/) is not found after install, check that your virtualenv is activated and that which python points to the venv.
step 2: build your URL seed list
Yellow Pages search URLs follow a predictable pattern. build a CSV of category + city combinations:
import csv
import urllib.parse
categories = ["plumber", "electrician", "dentist", "hvac", "roofing contractor"]
cities = ["Houston TX", "Phoenix AZ", "Chicago IL", "Los Angeles CA", "Philadelphia PA"]
with open("seeds.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(["url"])
for cat in categories:
for city in cities:
params = urllib.parse.urlencode({
"search_terms": cat,
"geo_location_terms": city
})
writer.writerow([f"https://www.yellowpages.com/search?{params}"])
expected output: seeds.csv with 25 rows (5 categories x 5 cities). scale this to your actual target list.
if it breaks: URL encoding issues with special characters in city names. test each URL manually in a browser before running at volume.
step 3: write the spider
# yellowpages_scraper/spiders/yp_spider.py
import scrapy
import csv
class YPSpider(scrapy.Spider):
name = "yp"
custom_settings = {
"DOWNLOAD_DELAY": 2,
"RANDOMIZE_DOWNLOAD_DELAY": True,
"CONCURRENT_REQUESTS_PER_DOMAIN": 2,
"AUTOTHROTTLE_ENABLED": True,
}
def start_requests(self):
with open("seeds.csv") as f:
reader = csv.DictReader(f)
for row in reader:
yield scrapy.Request(row["url"], callback=self.parse_listing_page)
def parse_listing_page(self, response):
cards = response.css("div.result")
for card in cards:
yield {
"name": card.css("a.business-name span::text").get("").strip(),
"phone": card.css("div.phones.phone.primary::text").get("").strip(),
"address": card.css("span.street-address::text").get("").strip(),
"city": card.css("span.locality::text").get("").strip(),
"category": card.css("div.categories a::text").get("").strip(),
"url": card.css("a.business-name::attr(href)").get(""),
}
# pagination
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse_listing_page)
expected output: items flowing into your Scrapy pipeline or a JSONL file.
if it breaks: CSS selectors are the first thing that breaks when Yellow Pages updates its markup. if you get empty results, open a saved HTML file and inspect the actual class names. Yellow Pages has changed from v-card to result and back more than once.
step 4: configure proxy rotation
in settings.py:
ROTATING_PROXY_LIST_PATH = "proxies.txt" # one proxy per line: http://user:pass@host:port
DOWNLOADER_MIDDLEWARES = {
"rotating_proxies.middlewares.RotatingProxyMiddleware": 610,
"rotating_proxies.middlewares.BanDetectionMiddleware": 620,
}
ROTATING_PROXY_BAN_POLICY = "rotating_proxies.policy.BanDetectionPolicy"
populate proxies.txt with your residential proxy endpoints. ProxyScraping’s residential proxy format is typically http://username:[email protected]:PORT. see the ProxyScraping dashboard for the exact endpoint string.
if it breaks: if all proxies get marked as banned within minutes, your ban detection policy is too aggressive. add a custom policy that only marks a proxy banned on HTTP 403 or 429, not on empty results.
step 5: set realistic request headers
Yellow Pages checks User-Agent and sometimes Referer. a bare Scrapy default gets flagged fast.
# settings.py
DEFAULT_REQUEST_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Referer": "https://www.yellowpages.com/",
}
rotate User-Agents using [scrapy](https://scrapy.org/)-fake-useragent if you are running more than a few thousand requests per day.
if it breaks: if you see a Cloudflare challenge page in the response HTML (look for cf-browser-verification in response.text), you need either a Cloudflare-bypass service or a headless browser layer. for most Yellow Pages listing pages this is not necessary, but individual business profile pages sometimes trigger it.
step 6: run and export
scrapy crawl yp -o output.jsonl --logfile=yp.log
monitor the log for [rotating_proxies] ban messages and [[scrapy](https://scrapy.org/)] item counts. a healthy run on 25 seed URLs should produce 500-1500 items depending on category density.
if it breaks: if item count is zero but you see 200 responses, your CSS selectors are wrong. save a raw response with response.body to a file and inspect it: open("/tmp/debug.html", "wb").write(response.body).
step 7: clean and deduplicate output
Yellow Pages can return the same business in multiple category searches. deduplicate on phone number or the business URL slug:
import json
seen = set()
clean = []
with open("output.jsonl") as f:
for line in f:
item = json.loads(line)
key = item.get("phone") or item.get("url")
if key and key not in seen:
seen.add(key)
clean.append(item)
with open("output_clean.jsonl", "w") as f:
for item in clean:
f.write(json.dumps(item) + "\n")
print(f"{len(clean)} unique records")
common pitfalls
using datacenter proxies. Yellow Pages detects ASN ranges associated with hosting providers. even premium datacenter proxies from major vendors get blocked on category search pages. residential is the minimum; mobile proxies are better for high-volume scraping but cost more ($10-15/GB in most markets as of early 2026).
ignoring pagination depth. some category+city combos have 50+ pages of results. if your spider stops at page 1 because it did not find a next link (maybe the selector changed), you silently miss 90% of the data. always log the item count per seed URL and spot-check it against what you see in a browser.
scraping too fast. DOWNLOAD_DELAY = 0 with 16 concurrent requests will get your proxy range banned within minutes. Yellow Pages rate-limits at the IP level, not the session level, so burning through proxies fast is expensive. 2-second delays with auto-throttle enabled is a reasonable starting point.
not handling soft blocks. Yellow Pages sometimes returns a valid 200 response with a CAPTCHA or “please verify you are human” page instead of results. your spider will happily scrape zero items from these and mark the request successful. add a middleware that checks response body length or a known element presence and re-queues the request with a fresh proxy.
ignoring the HTML version vs. API version. Yellow Pages has an internal API used by its own frontend. if you can reverse-engineer the XHR calls in DevTools, you can often get cleaner JSON responses. the endpoints are not officially documented and change without notice, but for short-term projects they are faster and easier to parse than the HTML. this is not covered in this guide but it is worth knowing about.
scaling this
10x (250k records, 5 states). the main change is proxy volume. at 250k records expect 40-60 GB of residential traffic. move your seed list to a database and use Scrapy’s CrawlerProcess in a loop rather than one-shot CLI runs. add a pipeline that writes to PostgreSQL instead of flat files so you can query incrementally.
100x (2.5M records, full US directory). at this point you need distributed crawling. Scrapy-Redis is the standard approach: multiple Scrapy workers share a Redis queue. run workers on 4-8 cheap VMs (a $6/month Hetzner CX21 per worker is fine). proxy costs become the dominant expense at roughly $400-800 for a full pull depending on retry rate. also build in a resume mechanism because a full-US crawl will take days and something will fail mid-run.
1000x (ongoing, refreshed monthly). you are no longer doing one-shot scraping, you are running a data pipeline. this means scheduling, change detection, incremental updates, and monitoring. consider whether Scrapy is still the right tool or whether a managed browser platform (Browserless, Apify) makes more sense at this scale. your proxy contract should be negotiated, not pay-as-you-go. if you are building a product on top of this data, also look at how anti-detect browser profiles can help maintain session persistence at scale, which is covered in more depth over at antidetectreview.org/blog/.
for infrastructure patterns at all three tiers, also see the general guide on scaling scrapers with residential proxies.
where to go next
- How to scrape Google Maps in 2026 with Python and rotating proxies , Google Maps has overlapping coverage with Yellow Pages for local businesses and is worth running in parallel for enrichment.
- Best residential proxies for scraping in 2026: tested and ranked , if you are still picking a proxy provider, this is the comparison that covers ProxyScraping, Oxylabs, Bright Data, and Smartproxy side by side with actual block rates on common targets.
- Scrapy middleware patterns: handling bans, retries, and session rotation , goes deeper on the middleware stack than this tutorial does.
The HTTP semantics RFC (RFC 9110) is dry reading but understanding status codes at the spec level helps when you are debugging edge cases in retry logic. the Python urllib.robotparser module is the standard library way to check robots.txt programmatically before each crawl run.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.