How to scrape eBay at scale in 2026 with proxies that work
How to scrape eBay at scale in 2026 with proxies that work
eBay is one of the messiest scraping targets on the internet. it’s not a clean API, the HTML structure changes without notice, and their bot detection has gotten more aggressive since mid-2024. if you’ve tried to pull product prices, seller feedback counts, or completed auction data recently and ended up with a wall of 429s and CAPTCHA challenges, you’re not alone.
this guide is for operators who need eBay data in volume, whether that’s for price intelligence, competitor monitoring, reseller tooling, or marketplace analytics. i’ll walk through the full stack: proxy selection, request architecture, parsing, and what actually breaks at scale. i run pipelines hitting eBay US, UK, and AU; specifics here come from production, not theory.
by the end you’ll have a working Python scraper that rotates residential proxies, handles eBay’s session quirks, parses listing data reliably, and can scale from a few hundred requests per day to tens of thousands.
what you need
- Python 3.11+ with
httpx,beautifulsoup4,lxml, andplaywrightinstalled - rotating residential proxies, minimum a pool of 10k IPs. datacenter proxies will get blocked within a few hundred requests on eBay in 2026. i use ProxyScraping residential plans which start around $4/GB, or Brightdata’s residential network (around $8.40/GB) as a fallback
- a scraping-friendly server in the US, preferably AWS us-east-1 or a US VPS, because eBay geo-gates some listing data and prices differ by region
- storage: Postgres or S3-compatible object storage for raw HTML and parsed output
- Python packages:
pip install httpx[http2] beautifulsoup4 lxml playwright - budget: plan for roughly $15-40/month at moderate scale (10k-50k requests/day), mostly proxy egress costs
step by step
step 1: set up your proxy rotation layer
don’t put proxy credentials directly in your scraping code. build a thin rotation wrapper so you can swap providers without touching scraper logic.
import httpx
import random
PROXY_LIST = [
"http://user:[email protected]:7777",
# add more endpoints or load from env
]
def get_proxy():
return random.choice(PROXY_LIST)
def make_client():
return httpx.Client(
proxy=get_proxy(),
timeout=20,
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
}
)
expected output: you can call make_client() and it returns an httpx client pre-configured with a rotating proxy.
if it breaks: if httpx throws a ProxyError, the proxy endpoint is misconfigured. double-check that your username/password are URL-encoded (special characters like @ or # in passwords will break the URL).
step 2: understand eBay’s page types
eBay has three main scraping targets, each with different structure and bot sensitivity:
- search results (
/sch/i.html?_nkw=...): medium difficulty, changes layout occasionally - listing pages (
/itm/...): high bot sensitivity, contains structured JSON-LD - completed/sold listings (
/sch/i.html?_nkw=...&LH_Complete=1&LH_Sold=1): highest value for price research, same structure as search
start with search pages. they’re more forgiving and give you listing IDs you can then fetch individually.
expected output: knowing which URLs to target before writing a single request.
if it breaks: if you see different page structures than documented here, eBay likely ran an A/B test. check whether your User-Agent is triggering a mobile layout by temporarily removing it.
step 3: fetch a search results page
def fetch_search(query: str, page: int = 1) -> str:
url = "https://www.ebay.com/sch/i.html"
params = {
"_nkw": query,
"_pgn": page,
"_ipg": 60, # 60 results per page, max
}
client = make_client()
resp = client.get(url, params=params)
resp.raise_for_status()
return resp.text
html = fetch_search("vintage seiko watch")
print(len(html)) # should be 80k-120k characters
expected output: raw HTML string. if you’re getting ~2k characters you’ve hit a CAPTCHA or redirect page.
if it breaks: a 403 with a small response body means eBay flagged the IP. add a time.sleep(random.uniform(1.5, 4)) between requests and rotate more aggressively. for Cloudflare challenge pages, use residential proxies with a higher trust score or [playwright](https://playwright.dev/) for JS rendering to establish a session cookie.
step 4: parse listing data from search results
eBay embeds structured data in JSON-LD blocks, but it’s incomplete. the most reliable parse is a hybrid: JSON-LD for item ID and title, direct DOM for price and condition.
from bs4 import BeautifulSoup
import json
import re
def parse_search_results(html: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
items = []
for card in soup.select("li.s-item"):
try:
title_el = card.select_one(".s-item__title")
price_el = card.select_one(".s-item__price")
link_el = card.select_one("a.s-item__link")
condition_el = card.select_one(".SECONDARY_INFO")
if not title_el or not link_el:
continue
title = title_el.get_text(strip=True)
if title == "Shop on eBay": # skip the first phantom card
continue
href = link_el["href"]
item_id_match = re.search(r"/itm/(\d+)", href)
items.append({
"item_id": item_id_match.group(1) if item_id_match else None,
"title": title,
"price": price_el.get_text(strip=True) if price_el else None,
"condition": condition_el.get_text(strip=True) if condition_el else None,
"url": href.split("?")[0],
})
except Exception:
continue
return items
results = parse_search_results(html)
print(results[:2])
expected output: a list of dicts with item ID, title, price string, condition, and clean URL.
if it breaks: eBay changes CSS class names a few times per year. if s-item stops working, open the page in a browser, inspect a listing card, and update the selectors. this is the most maintenance-heavy part of any eBay scraper.
step 5: fetch individual listing pages for full data
for price history, seller info, item specifics, and description, you need the listing page itself. eBay embeds a window.__PRELOADED_STATE__ JSON blob that contains most of what you need.
import json
def fetch_listing(item_id: str) -> dict:
url = f"https://www.ebay.com/itm/{item_id}"
client = make_client()
resp = client.get(url)
resp.raise_for_status()
html = resp.text
soup = BeautifulSoup(html, "lxml")
# extract JSON-LD structured data
json_ld = None
for script in soup.find_all("script", type="application/ld+json"):
try:
data = json.loads(script.string)
if isinstance(data, dict) and data.get("@type") == "Product":
json_ld = data
break
except json.JSONDecodeError:
continue
return {
"item_id": item_id,
"json_ld": json_ld,
"raw_html_length": len(html),
}
expected output: a dict containing the JSON-LD product data and the raw HTML length as a sanity check.
if it breaks: if json_ld is None, eBay may have changed the script tag structure. fall back to parsing the <h1> for title and the [itemprop="price"] element for price.
step 6: handle pagination and rate limiting
eBay search results go up to page 100 (6000 results with 60 per page). for large queries you’ll need to either paginate or split by subcategory.
import time
import random
def scrape_query(query: str, max_pages: int = 10) -> list[dict]:
all_items = []
for page in range(1, max_pages + 1):
try:
html = fetch_search(query, page=page)
items = parse_search_results(html)
if not items:
print(f"no results on page {page}, stopping")
break
all_items.extend(items)
print(f"page {page}: got {len(items)} items, total {len(all_items)}")
# randomized delay between requests, critical for staying under radar
time.sleep(random.uniform(2, 5))
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
print(f"rate limited on page {page}, sleeping 30s")
time.sleep(30)
else:
raise
return all_items
expected output: a flat list of all items across pages with progress logging.
if it breaks: consistent 429s even after sleeping suggest your proxy pool is too small or too many IPs have been flagged. switch to a fresh proxy pool or use a provider with sticky sessions to spread load.
step 7: store and deduplicate results
import sqlite3
import json
def init_db(db_path: str = "ebay_items.db"):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS items (
item_id TEXT PRIMARY KEY,
title TEXT,
price TEXT,
condition TEXT,
url TEXT,
scraped_at TEXT DEFAULT (datetime('now'))
)
""")
conn.commit()
return conn
def upsert_items(conn, items: list[dict]):
conn.executemany("""
INSERT OR REPLACE INTO items (item_id, title, price, condition, url)
VALUES (:item_id, :title, :price, :condition, :url)
""", items)
conn.commit()
expected output: a local SQLite file with deduplicated item records.
if it breaks: if you’re getting constraint errors, check that item_id is never None before upserting. filter with [i for i in items if i["item_id"]].
step 8: validate your output
before scaling up, sanity-check a sample of your data. the most common silent failure is scraping eBay’s “ghost” first card (which always says “Shop on eBay”), or capturing CAPTCHA page HTML that parses as empty results.
conn = init_db()
cursor = conn.execute("SELECT COUNT(*), COUNT(DISTINCT item_id) FROM items")
total, unique = cursor.fetchone()
print(f"total rows: {total}, unique items: {unique}")
# spot check
cursor = conn.execute("SELECT * FROM items LIMIT 5")
for row in cursor:
print(row)
expected output: counts match and sample rows contain real product data, not placeholder text.
if it breaks: if titles are all “Shop on eBay” or prices are None across the board, your HTML selector is targeting the wrong element. run parse_search_results on a saved HTML file and inspect what the soup is returning.
common pitfalls
using datacenter proxies. datacenter IPs (AWS, DigitalOcean, Hetzner ranges) get flagged by eBay within minutes of sustained use in 2026,residential proxies are not optional. some operators also try ISP proxies (static residential), which work better than datacenter but worse than rotating residential for high-volume work.
not rotating User-Agent strings. sending the same UA on every request is a fingerprinting signal. maintain a pool of realistic Chrome and Firefox UAs from recent versions and rotate them per-session. WhatIsMyBrowser publishes current UA strings.
ignoring eBay’s region-specific endpoints. ebay.com, ebay.co.uk, ebay.com.au, and ebay.de return different prices and listings. multi-region scraping needs proxies in each target country; eBay does server-side geo-detection beyond the domain.
scraping too fast. eBay’s bot detection is session-based, not just IP-based. human-looking patterns (2-5 second gaps, occasional pauses, varying start pages) survive much longer than machine-paced crawls.
not handling eBay’s A/B tests. eBay runs constant layout experiments; a parser that worked yesterday can break silently today for 30% of requests. always log raw HTML for failed parses and check selectors periodically.
scaling this
10x (100k requests/day): the main change is moving from synchronous httpx to async. use httpx.AsyncClient with asyncio and asyncio.Semaphore to cap concurrency at 10-20 simultaneous requests. at this scale SQLite becomes a bottleneck, switch to Postgres.
100x (1M requests/day): you need a job queue (Celery with Redis, or a managed queue like AWS SQS) to distribute work across multiple scraping workers. proxy costs become your dominant expense at this scale, roughly $40-120/day depending on provider and cache hit rate. start caching listing pages that haven’t changed in 24h. if you’re running multi-account operations or managing scraper identities at this scale, multiaccountops.com/blog/ has operational guides on session management that apply directly here.
1000x (10M+ requests/day): at this volume you’re beyond what a single provider’s residential pool handles cleanly. you need geographic distribution (scraping workers in US, EU, APAC each with local proxies), a CDN-layer cache for repeat listing fetches, and real-time monitoring on your 429 rate, parse success rate, and proxy health. eBay also starts recognizing behavioral patterns at this scale even across clean IPs, so you need session warm-up logic that mimics organic browsing before hitting high-value pages. browser automation with Playwright for session cookie generation becomes necessary, not optional.
according to eBay’s developer program documentation, they do offer official APIs for some use cases including Finding and Browse APIs with rate limits in the millions of calls per day. for structured product data at scale, evaluate whether the API covers your use case before building a full scraper, it’ll be more stable.
where to go next
- how to scrape Amazon product pages with rotating proxies, the same architecture applies with different selectors and tighter rate limits
- residential vs datacenter proxies: what actually works in 2026, if you’re still deciding on a proxy provider or want to understand the tradeoff in depth
- how to parse structured data with JSON-LD and Schema.org, eBay is one of many sites using JSON-LD and the parsing patterns generalize widely
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.