← back to blog

How to scrape Realtor.com at scale in 2026 with proxies that work

How to scrape Realtor.com at scale in 2026 with proxies that work

Realtor.com is one of the densest public real estate datasets on the internet. it has listing prices, days-on-market, agent contact details, historical price changes, and neighborhood-level data that takes years to accumulate. if you’re building a real estate analytics tool, a price-tracking dashboard, or a leads pipeline for agents, scraping it is often the fastest path to data that would otherwise cost tens of thousands of dollars from licensed data vendors.

the problem is that Realtor.com has layered bot defenses. they run Cloudflare at the edge, use TLS fingerprinting to flag headless browsers, and will silently return stale or empty JSON to IPs they’ve flagged rather than serving a hard block. that last one is particularly nasty because your scraper looks healthy but your data is garbage. i’ve seen operators waste weeks on pipelines that were serving cached 403 responses with a 200 status code.

this guide is for operators who already know some Python and want a working, scalable pipeline for Realtor.com listings. by the end you’ll have a rotating-proxy setup, a working HTTP client with proper headers, a simple parser for the internal JSON API, and a plan for going from 100 to 10,000 requests per day without burning your proxy pool.


what you need

  • Python 3.11+ with httpx, curl_cffi, and parsel installed
  • Residential rotating proxies from a provider with US coverage. i use ProxyScraping’s residential plan ($3.50/GB as of Q1 2026) for most real estate work. datacenter proxies will not work here reliably.
  • A proxy gateway endpoint with username/password auth. most providers give you a single sticky-session endpoint plus a rotating endpoint. you want the rotating one.
  • A Supabase free-tier project (or any Postgres instance) to store raw JSON before you parse it
  • ~$10-30/month budget for proxy bandwidth at moderate scale (1,000-10,000 listings/day)
  • optional: a fingerprint-safe browser driver if you need to load pages with JavaScript rendering. Playwright with the playwright-stealth plugin works, but adds latency and cost.

step by step

step 1: read the robots.txt and understand what you’re working with

before writing a single line, read Realtor.com’s robots.txt. it disallows a number of paths including /contact-agent/ and some internal API routes. the listing search pages (/realestateandhomes-search/) and individual property pages (/realestateandhomes-detail/) are not disallowed as of this writing.

understanding the Robots Exclusion Protocol is worth the ten minutes. it tells you which paths are explicitly off-limits and gives you a clear paper trail if you ever need to explain your data practices to a client or partner.

curl -s https://www.realtor.com/robots.txt | head -60

expected output: a list of Disallow: directives. note any paths that overlap with your target data.

if it breaks: if the curl itself is blocked, your residential proxy rotation isn’t set up yet. skip to step 2 first.


step 2: set up your proxy client with curl_cffi

standard [requests](https://requests.readthedocs.io/) or even httpx will get fingerprinted on Realtor.com within a few hundred requests. curl_cffi impersonates real browser TLS fingerprints (Chrome 120, Safari 17, etc.) and is the current best tool for this without spinning up a full browser. install it with:

pip install curl_cffi httpx parsel

set up your client:

from curl_cffi import requests as cf_requests

PROXY_USER = "your_proxy_user"
PROXY_PASS = "your_proxy_pass"
PROXY_HOST = "gate.proxyscraping.com"
PROXY_PORT = "31112"

proxies = {
    "http": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}",
    "https": f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}",
}

session = cf_requests.Session(impersonate="chrome120", proxies=proxies)

if it breaks: double-check your proxy credentials. most providers have a dashboard that shows live connection stats. if you see 0 connections, the auth is wrong.


step 3: find the internal JSON API endpoint

Realtor.com loads listing data via an internal GraphQL-style API, not by server-rendering HTML. open Chrome DevTools, go to the Network tab, filter for XHR, and load a search results page. you’ll see a request to a path like /api/v1/hulk or similar (the exact path changes, check DevTools for the current one). this endpoint returns structured JSON that’s far easier to parse than HTML.

copy the full request URL and headers from DevTools:

import json

url = "https://www.realtor.com/api/v1/hulk_preliminary/propertylist"
headers = {
    "accept": "application/json",
    "accept-language": "en-US,en;q=0.9",
    "referer": "https://www.realtor.com/realestateandhomes-search/San-Francisco_CA",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "rdc-client-name": "RDC_WEB_SEARCH_PAGE",
    "rdc-client-version": "2.0.1986",
}

params = {
    "city": "San Francisco",
    "state_code": "CA",
    "offset": 0,
    "limit": 42,
    "sort": "recently_changed",
}

resp = session.get(url, headers=headers, params=params, timeout=30)
data = resp.json()
print(json.dumps(data["data"]["home_search"]["results"][0], indent=2))

if it breaks: the API path or required headers may have changed. repeat the DevTools inspection to get fresh values. the rdc-client-name and rdc-client-version headers are often required and version-locked.


step 4: parse listing fields you actually need

once the JSON is flowing, extract what matters. here’s a minimal parser:

def parse_listing(raw: dict) -> dict:
    prop = raw.get("property", {})
    location = raw.get("location", {})
    address = location.get("address", {})

    return {
        "property_id": raw.get("property_id"),
        "list_price": raw.get("list_price"),
        "beds": prop.get("beds"),
        "baths_full": prop.get("baths_full"),
        "sqft": prop.get("sqft"),
        "lot_sqft": prop.get("lot_sqft"),
        "address": address.get("line"),
        "city": address.get("city"),
        "state": address.get("state_code"),
        "zip": address.get("postal_code"),
        "days_on_market": raw.get("list_date"),
        "status": raw.get("status"),
        "href": raw.get("href"),
    }

listings = [parse_listing(r) for r in data["data"]["home_search"]["results"]]

if it breaks: the JSON schema changes occasionally after site deployments. log the raw response to a file and re-map the fields. never hardcode field paths without a fallback get() call.


step 5: paginate through results

Realtor.com caps results at 42 per page. paginate with offset:

import time

all_listings = []
offset = 0
total = None

while True:
    params["offset"] = offset
    resp = session.get(url, headers=headers, params=params, timeout=30)
    data = resp.json()

    results = data["data"]["home_search"]["results"]
    if total is None:
        total = data["data"]["home_search"]["total"]

    all_listings.extend([parse_listing(r) for r in results])

    if offset + 42 >= total or offset >= 2016:  # realtor.com hard-caps at ~2000 results per search
        break

    offset += 42
    time.sleep(1.2)  # stay under rate limits

print(f"collected {len(all_listings)} listings")

note the ~2,000-result cap per search query. to get more than that from a single metro, split by price range, neighborhood, or property type. this is a common ceiling operators hit without realizing it.

if it breaks: if results comes back empty before you expect, your IP may be silently rate-limited. add exponential backoff and check your proxy provider’s bandwidth dashboard.


step 6: store raw JSON before you parse

parse failures are inevitable as the site evolves. always write the raw API response to Postgres (or even flat files) before you transform it:

import supabase  # pip install supabase

sb = supabase.create_client(SUPABASE_URL, SUPABASE_KEY)

for listing in all_listings:
    sb.table("realtor_raw").upsert(
        {"property_id": listing["property_id"], "raw": listing},
        on_conflict="property_id"
    ).execute()

this lets you re-parse historical data when the schema changes without re-scraping. storage is cheap. bandwidth is not.

if it breaks: if upsert fails on conflict, make sure property_id has a unique index on your table.


step 7: schedule and monitor

run your scraper on a daily cron via a simple script. for monitoring, log each run’s result count and flag any run that returns fewer than 80% of yesterday’s count. a sudden drop is the first sign your proxies are being blocked or the API changed.

# crontab entry: runs at 2am Singapore time (UTC+8)
0 18 * * * /usr/bin/python3 /home/user/realtor_scraper/run.py >> /var/log/realtor_scraper.log 2>&1

common pitfalls

using datacenter proxies. Realtor.com’s fraud scoring flags datacenter ASNs (AWS, DigitalOcean, Hetzner) fast. you’ll get 200 responses with empty result sets. residential proxies are not optional here.

not rotating user agents alongside IPs. TLS fingerprint + IP + user-agent form a composite identity. rotating only IPs while keeping the same user agent string reduces your effective anonymity. curl_cffi handles TLS, but you should still randomize user agents from a list of real Chrome and Safari versions.

ignoring the 2,000-result cap. many operators assume pagination alone will get them all listings in a city. it won’t. new york city or los angeles will have tens of thousands of active listings. split queries by price band or zip code.

parsing HTML instead of the API. scraping the rendered HTML is slower, harder to maintain, and more brittle than hitting the JSON API. always check DevTools first.

skipping deduplication. if you run multiple search queries to get past the 2,000-result cap, you will collect the same listing multiple times. deduplicate on property_id before storing.


scaling this

10x (1,000 requests/day): a single residential proxy gateway handles this easily. one script, one cron job, ~1-2 GB/month bandwidth. total cost is under $10/month.

100x (10,000 requests/day): start distributing across multiple proxy gateway sessions with different sticky-session IDs. add a request queue (even a simple Postgres-backed one works) so you can parallelize with 5-10 workers without hammering a single exit IP. at this level you also want structured logging, not just print statements.

1000x (100,000+ requests/day): you’re now spending meaningfully on bandwidth ($30-100+/month depending on provider). consider caching responses at the Postgres level and only re-fetching listings that have changed recently. use the list_date field to prioritize fresh listings. you’ll also want to monitor proxy pool health actively, checking for block rates by provider region. some operators at this scale also run antidetect browser profiles for the subset of pages that require full JavaScript rendering, keeping the lightweight HTTP client for the majority of requests.


where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?