How to scrape Zillow at scale in 2026 with proxies that work
How to scrape Zillow at scale in 2026 with proxies that work
Zillow is one of the most aggressively anti-scraping real estate platforms on the web. their bot detection has tightened considerably since 2024, and if you’re still running naive requests-based scrapers with datacenter IPs, you’re going to spend most of your time debugging 403s and CAPTCHA walls instead of collecting data. i’ve been through that cycle more than once, so this guide skips the dead ends.
this tutorial is for operators, data engineers, and developers who need structured Zillow listing data, whether that’s for a rental analytics product, a real estate aggregator, or market research. you don’t need to be a scraping expert to follow this, but you should be comfortable with Python and running scripts from the command line. by the end, you’ll have a working scraper that rotates residential proxies through ProxyScraping, handles Zillow’s bot checks, and outputs clean JSON that you can pipe into any downstream system.
one note before we start: Zillow’s Terms of Use prohibit automated data collection without written consent. this guide is written for legitimate use cases, research, and understanding how anti-bot systems work. check your legal obligations in your jurisdiction before running this in production. this is not legal advice.
what you need
tools and libraries
- Python 3.11+
- Playwright for Python (playwright>=1.44)
- parsel for HTML parsing
- httpx for async HTTP if you want headless-lite mode
- jq for inspecting JSON output at the terminal
proxy infrastructure - ProxyScraping residential proxy plan, minimum the Starter tier ($75/month for 8GB as of Q1 2026). datacenter IPs will not work reliably on Zillow. residential or mobile only. - optionally: a sticky session endpoint if you need multi-page pagination on the same IP
accounts and access - ProxyScraping dashboard account for credentials and usage tracking - a Supabase or Postgres instance if you’re persisting results (optional but recommended at scale)
costs to budget - proxy bandwidth: Zillow pages are heavy, 800KB to 2MB per page including XHR. at 1000 pages/day expect 1-2GB bandwidth - a residential proxy plan at $75/month covers moderate-scale collection. for 1000x scale, budget $400-800/month depending on provider
background reading - Zillow’s robots.txt documents which paths are disallowed. read it before you decide what to collect.
step by step
step 1: install dependencies
pip install playwright parsel httpx jq
playwright install chromium
expected output: playwright downloads a ~130MB Chromium binary. this only runs once.
if it breaks: if you’re on a headless Linux server, run [playwright](https://playwright.dev/) install-deps [chromium](https://www.chromium.org/Home/) first to install system dependencies.
step 2: configure your ProxyScraping residential proxy endpoint
go to your ProxyScraping dashboard, copy your residential proxy credentials. the endpoint format looks like:
http://user-YOURUSER:[email protected]:8080
set this as an environment variable so it never appears in your source code:
export PROXY_URL="http://user-YOURUSER:[email protected]:8080"
for sticky sessions (useful when paginating through search results), use the session ID parameter documented in your dashboard.
if it breaks: if you’re getting 407 [Proxy Authentication](https://proxyscraping.org/blog/proxy-authentication-user-pass-vs-ip-whitelist-trade-offs) Required, double-check the username prefix format. ProxyScraping uses user- prefixed usernames in some plan tiers.
step 3: set up the Playwright browser context with proxy
import asyncio
import os
from playwright.async_api import async_playwright
PROXY_URL = os.environ["PROXY_URL"]
async def get_browser():
p = await async_playwright().start()
browser = await p.chromium.launch(
headless=True,
proxy={"server": PROXY_URL}
)
context = await browser.new_context(
viewport={"width": 1280, "height": 900},
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
locale="en-US",
)
return browser, context
expected output: no output on success. the browser launches silently.
if it breaks: ERR_PROXY_CONNECTION_FAILED means the proxy server is unreachable. check that you’re not behind a firewall blocking the proxy port. ProxyScraping uses port 8080 on residential plans.
step 4: intercept Zillow’s JSON API instead of parsing HTML
this is where most tutorials go wrong. Zillow loads listing data via an internal GraphQL/REST API, not directly in the HTML. the rendered HTML is sparse. if you intercept the XHR responses, you get clean structured JSON with no parsing headaches.
async def scrape_search_page(context, search_url):
captured = []
async def handle_response(response):
url = response.url
if "GetSearchPageState" in url or "search/GetSearchPageState" in url:
try:
data = await response.json()
captured.append(data)
except Exception:
pass
page = await context.new_page()
page.on("response", handle_response)
await page.goto(search_url, wait_until="networkidle", timeout=60000)
await page.wait_for_timeout(3000) # let lazy XHR settle
await page.close()
return captured
expected output: captured is a list containing one or more dicts with listing arrays nested under cat1.searchResults.listResults or a similar path, depending on the search type.
if it breaks: if captured is empty, open Zillow in your browser’s DevTools, filter Network by XHR, and confirm the endpoint name hasn’t changed. Zillow occasionally renames internal API paths. the GetSearchPageState key has been stable since mid-2024 but check.
step 5: parse listings from the captured JSON
def extract_listings(raw_data):
listings = []
for blob in raw_data:
try:
results = blob["cat1"]["searchResults"]["listResults"]
for r in results:
listings.append({
"zpid": r.get("zpid"),
"address": r.get("address"),
"price": r.get("price"),
"beds": r.get("beds"),
"baths": r.get("baths"),
"sqft": r.get("area"),
"listing_url": r.get("detailUrl"),
"status": r.get("statusType"),
})
except (KeyError, TypeError):
continue
return listings
expected output: a list of dicts, one per listing, with the fields above. typical Zillow search returns 40 listings per page.
if it breaks: print blob.keys() to see the top-level structure. if Zillow A/B tested you into a different page variant, the path might be cat2 instead of cat1.
step 6: handle pagination
Zillow search URLs accept a ?searchQueryState= parameter containing a JSON blob. pagination is controlled by the pagination key inside that blob.
import json
from urllib.parse import urlencode, urlparse, parse_qs
def next_page_url(base_url, current_page):
parsed = urlparse(base_url)
qs = parse_qs(parsed.query)
state = json.loads(qs.get("searchQueryState", ["{}"])[0])
state["pagination"] = {"currentPage": current_page + 1}
new_qs = {"searchQueryState": json.dumps(state)}
return f"{parsed.scheme}://{parsed.netloc}{parsed.path}?{urlencode(new_qs)}"
loop through pages until listResults comes back empty or you hit your page cap.
if it breaks: Zillow caps search results at page 20 for unauthenticated users regardless of total result count. if you need more than 800 listings for a geography, split by zip code or bounding box.
step 7: write results to JSON and verify
import json
def save_results(listings, output_path="zillow_output.json"):
with open(output_path, "w") as f:
json.dump(listings, f, indent=2)
print(f"saved {len(listings)} listings to {output_path}")
verify with jq at the terminal:
jq '.[0]' zillow_output.json
expected output: a single listing object with all fields populated.
if it breaks: if price is null for rental listings, check for unformattedPrice or hdpData.homeInfo.price in the raw blob. Zillow’s schema differs slightly between for-sale and for-rent pages.
step 8: add delays and session rotation
running Playwright with no delays at full speed will get your IPs flagged within minutes. add jitter between requests and rotate proxy sessions after every few pages.
import random
async def polite_delay():
await asyncio.sleep(random.uniform(4.0, 9.0))
for session rotation, append a random session ID to your proxy username:
import uuid
def get_rotating_proxy():
base_user = os.environ["PROXY_USER"]
base_pass = os.environ["PROXY_PASS"]
session_id = uuid.uuid4().hex[:8]
return f"http://user-{base_user}-session-{session_id}:{base_pass}@residential.proxyscraping.com:8080"
if it breaks: if you’re still hitting blocks after rotation, extend delays to 10-20 seconds. Zillow’s detection correlates request velocity more than IP reputation alone.
common pitfalls
using datacenter proxies. datacenter ranges are blocklisted by Zillow’s bot detection, which fingerprints ASN prefixes. i’ve tested this: even premium datacenter providers get blocked within 50-100 requests on Zillow. residential or mobile proxies are not optional here. see our guide to the best residential proxies for real estate data for a comparison of alternatives.
parsing the HTML instead of intercepting XHR. the rendered Zillow page has minimal listing data in the DOM. operators who try to [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) their way through it end up with half-complete records and brittle selectors that break on every Zillow frontend deploy.
ignoring TLS fingerprinting. Playwright’s default Chromium passes TLS fingerprint checks fine. if you switch to raw httpx or [requests](https://requests.readthedocs.io/), Zillow’s Akamai layer will fingerprint your TLS hello and block you. stick with a real browser context unless you’re using a library that spoofs TLS fingerprints.
not handling the 20-page cap. Zillow silently returns page 1 results when you request page 21+. if your pagination loop doesn’t check for duplicate zpid values, you’ll silently collect the same 40 listings over and over.
skipping the robots.txt check. this isn’t just ethical, it’s strategic. if Zillow’s legal team ever contacts you, being able to show you respected [robots.txt](https://www.rfc-editor.org/rfc/rfc9309.html) boundaries matters. check it before targeting any new path.
scaling this
from 10 to 100 pages/day: run the script sequentially with delays. one Playwright browser, one context, one session. 100 pages at 7 seconds average delay is under 15 minutes. bandwidth cost: well under 500MB/day.
from 100 to 1000 pages/day: move to async concurrency with a semaphore to cap parallel browser contexts. 3-5 concurrent contexts is the practical ceiling before delay requirements eat your time advantage. split work across multiple proxy sessions. expect 2-4GB bandwidth/day, which is still within a standard residential plan.
sem = asyncio.Semaphore(3)
async def bounded_scrape(context, url):
async with sem:
await polite_delay()
return await scrape_search_page(context, url)
from 1000 to 10,000+ pages/day: you need a distributed queue. push URLs into a Redis or SQS queue, run 5-10 worker VMs each with their own Playwright pool and proxy credentials. at this scale, bandwidth costs are $300-600/month and you need to monitor proxy hit rates daily. some operators at this scale also use antidetect browsers with profile rotation, particularly for accounts that need to stay warm across sessions. the antidetect browser comparison at antidetectreview.org covers what’s available and where the tradeoffs are.
at 10,000+ pages/day you should also be caching aggressively. Zillow listing data doesn’t change by the minute. cache zpid results for 6-24 hours and skip re-fetching unchanged listings.
where to go next
-
How to rotate proxies with Scrapy for real estate sites covers a Scrapy-based approach if you prefer a more structured framework over raw Playwright scripts.
-
Best residential proxies for real estate data scraping in 2026 benchmarks ProxyScraping against five competitors on Zillow, Redfin, and Realtor.com specifically, with actual block rates measured over 30 days.
-
Back to the proxy guides index for the full list of tutorials on residential proxies, rotating setups, and target-specific scraping.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.