How to scrape Amazon at scale in 2026 with proxies that work
How to scrape Amazon at scale in 2026 with proxies that work
Amazon is one of the hardest targets to scrape consistently. the anti-bot systems have gotten more aggressive every year, the page structure changes without notice, and a naive datacenter IP will get blocked within minutes. i’ve run price monitoring pipelines against Amazon for several years and the core problem is always the same: Amazon treats every non-human request as a threat, and you need to look convincingly human without actually being human.
this guide is for operators who need reliable, repeatable Amazon data, whether that’s product prices, rankings, reviews, or search results. it assumes you know basic Python and have worked with HTTP requests before. if you’re brand new to proxies, read the proxy fundamentals overview on this site first. by the end of this guide you’ll have a working scraper that can handle a few thousand daily requests and a clear path to scaling further.
the approach i use is built around residential proxies with session rotation, a headless browser for JS-heavy pages, and a retry layer that handles the inevitable blocks gracefully. it’s not the cheapest setup, but it’s the one that actually works reliably in production.
what you need
tools and libraries - Python 3.11+ - Scrapy 2.11+ for the crawl engine - Playwright (via scrapy-playwright) for pages that require JS execution - cloudscraper or curl-cffi for lightweight requests on simpler pages - Redis for request queuing and seen-URL deduplication - PostgreSQL or a cloud warehouse (BigQuery, Supabase) for storage
proxies - residential proxy pool with at least 1M IPs recommended. i use ProxyScraping’s residential pool for mid-scale runs, and Bright Data or Smartproxy for high-volume production. expect to pay $5-10 per GB for residential traffic depending on provider and volume tier. - sticky sessions (same IP for a session duration of 10-30 minutes) are important for cart and detail page workflows
accounts and infrastructure - a VPS or cloud instance (2 vCPU, 4GB RAM minimum) to run the scraper - Docker for containerizing the scraper and making deploys reproducible - a CAPTCHA solving service if you plan to go above ~5,000 requests per day. 2captcha and CapMonster both work, at roughly $1-3 per 1,000 solves.
budget estimate for a small production run (10,000 requests/day) - proxy traffic: ~$20-50/month depending on page weight and cache hit rate - CAPTCHA solving: ~$15-30/month - VPS: ~$10-20/month - total: roughly $50-100/month to start
step by step
step 1: set up your Scrapy project
pip install scrapy scrapy-playwright cloudscraper redis psycopg2-binary
playwright install chromium
scrapy startproject amazon_scraper
cd amazon_scraper
create settings.py with the basics:
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY = 1.5
RANDOMIZE_DOWNLOAD_DELAY = True
COOKIES_ENABLED = False
RETRY_TIMES = 3
RETRY_HTTP_CODES = [403, 503, 429, 407]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'amazon_scraper.middlewares.ProxyMiddleware': 750,
}
if it breaks: if Playwright fails to install chromium, run [playwright](https://playwright.dev/) install --with-deps [chromium](https://www.chromium.org/Home/) to pull system dependencies. on headless servers you may need xvfb.
step 2: configure proxy rotation middleware
write a simple middleware in middlewares.py that rotates your proxy per request:
import random
class ProxyMiddleware:
def __init__(self, proxy_list):
self.proxies = proxy_list
@classmethod
def from_crawler(cls, crawler):
proxies = crawler.settings.get('PROXY_LIST', [])
return cls(proxies)
def process_request(self, request, spider):
proxy = random.choice(self.proxies)
request.meta['proxy'] = proxy
for ProxyScraping’s residential rotating endpoint, you don’t need a list, you point to a single gateway URL and the provider handles rotation:
PROXY_LIST = ['http://user:[email protected]:9000']
if it breaks: if you see 407 [Proxy Authentication](https://proxyscraping.org/blog/proxy-authentication-user-pass-vs-ip-whitelist-trade-offs) Required, your credentials are wrong or the plan has expired. check the provider dashboard.
step 3: craft realistic request headers
Amazon checks User-Agent, Accept-Language, and a set of browser-consistency headers. bare requests without these headers get flagged immediately.
DEFAULT_REQUEST_HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xhtml;q=0.9,image/avif,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
}
rotate User-Agent strings from a realistic pool. i keep a list of 20-30 real Chrome versions and pick randomly per session.
if it breaks: if you’re getting 503 with a CAPTCHA page rather than a clean block, your headers are inconsistent. make sure Sec-Fetch-* headers match the User-Agent browser family.
step 4: handle the product detail page
Amazon ASIN-level product pages (amazon.com/dp/ASINCODE) are the most useful and the most protected. for these, use Playwright when the standard request returns a bot-check page:
import scrapy
from scrapy_playwright.page import PageMethod
class ProductSpider(scrapy.Spider):
name = 'product'
def start_requests(self):
asins = ['B08N5WRWNW', 'B09B8YWXDF'] # your ASIN list
for asin in asins:
url = f'https://www.amazon.com/dp/{asin}'
yield scrapy.Request(
url,
meta={
'playwright': True,
'playwright_page_methods': [
PageMethod('wait_for_selector', '#productTitle'),
],
},
callback=self.parse_product
)
def parse_product(self, response):
yield {
'asin': response.url.split('/dp/')[-1].split('/')[0],
'title': response.css('#productTitle::text').get('').strip(),
'price': response.css('.a-price .a-offscreen::text').get(''),
'rating': response.css('span[data-hook="rating-out-of-text"]::text').get(''),
'review_count': response.css('#acrCustomerReviewText::text').get(''),
}
if it breaks: if #productTitle never appears, Amazon may be serving a different page variant. add a screenshot step in Playwright to see what’s actually loading: PageMethod('screenshot', path='debug.png').
step 5: handle CAPTCHAs programmatically
when you hit the CAPTCHA page (/errors/validateCaptcha), you need to solve it or rotate to a fresh IP. for small volumes, rotating IP is cheaper. for high volumes, integrate a solver:
import requests as req
def solve_captcha(image_url, api_key):
# send to 2captcha
task = req.post('http://2captcha.com/in.php', data={
'key': api_key,
'method': 'base64',
'body': get_image_base64(image_url),
})
task_id = task.text.split('|')[1]
# poll for result
import time
time.sleep(15)
result = req.get(f'http://2captcha.com/res.php?key={api_key}&action=get&id={task_id}')
return result.text.split('|')[1]
if it breaks: 2captcha has a success rate of roughly 90-95% on Amazon’s image CAPTCHAs. if you’re seeing below 80%, the image is being passed incorrectly. log the raw base64 and test it in the 2captcha dashboard directly.
step 6: parse and store results
write parsed items to PostgreSQL using a Scrapy pipeline:
class PostgresPipeline:
def open_spider(self, spider):
import psycopg2
self.conn = psycopg2.connect(os.environ['DATABASE_URL'])
self.cur = self.conn.cursor()
def process_item(self, item, spider):
self.cur.execute(
"""INSERT INTO products (asin, title, price, rating, scraped_at)
VALUES (%s, %s, %s, %s, NOW())
ON CONFLICT (asin) DO UPDATE SET
price = EXCLUDED.price,
scraped_at = EXCLUDED.scraped_at""",
(item['asin'], item['title'], item['price'], item['rating'])
)
self.conn.commit()
return item
if it breaks: if you see duplicate key errors, ensure the asin column has a UNIQUE constraint and the ON CONFLICT clause matches your schema.
step 7: run and monitor
scrapy crawl product -s LOG_LEVEL=INFO 2>&1 | tee scrape_$(date +%Y%m%d).log
watch for three metrics: request success rate (should be above 85%), CAPTCHA rate (below 5% is healthy), and proxy error rate (407/503 splits). if success rate drops below 70%, your proxy pool is likely flagged and you need to rotate provider or endpoint.
common pitfalls
using datacenter proxies on product pages. datacenter IPs are blacklisted at the ASN level on Amazon for most residential markets. you might get lucky on some marketplace domains, but for amazon.com you need residential or mobile IPs. i’ve wasted money on datacenter pools that worked on test runs and failed within an hour of production volume.
static sessions. using the same IP for hundreds of requests across different products looks like a bot. keep session length to 10-20 requests per IP, then rotate. most residential proxy providers support session TTL via URL parameters.
ignoring robots.txt signals without understanding the legal context. Amazon’s Conditions of Use prohibit systematic scraping. this is not legal advice, and how companies enforce ToS varies by jurisdiction. understand what you’re doing and why before you build production pipelines. the hiQ v. LinkedIn case and subsequent rulings affect the US context, but rules differ elsewhere.
not handling geographic variation. Amazon serves different prices, availability, and content based on geolocation. if your proxies are routing through unexpected countries, your data will be wrong. lock proxy exit geography to your target market.
parsing price strings naively. Amazon displays prices as $24.99, $24.99 - $34.99, from $19.99, or nothing at all for out-of-stock items. build a price parser that handles all cases rather than assuming a single format.
scaling this
10x (100,000 requests/day): the bottleneck shifts to concurrent Scrapy workers. run multiple spiders in parallel using Docker Compose, each hitting different ASIN segments. use Redis as a shared request queue with [scrapy](https://scrapy.org/)-redis. proxy costs become significant at this level, negotiate volume pricing.
100x (1M requests/day): you need a distributed crawl architecture. Scrapy-Redis with a cluster of 5-10 crawler nodes, each pulling from a shared Redis queue. consider moving to a managed proxy API that handles rotation and session management internally rather than doing it yourself. CAPTCHA solving costs can run $50-150/day at this volume. some operators use antidetect browsers like Multilogin or GoLogin to manage browser fingerprint consistency at scale. for a review of those tools, antidetectreview.org covers the options well.
1000x (10M+ requests/day): at this level, you’re running a professional data pipeline. you need dedicated infrastructure, likely a cloud-native job scheduler (Airflow or Prefect), a streaming data layer (Kafka or Pub/Sub), and active relationship management with proxy providers. cost per GB drops significantly with custom contracts. monitoring becomes critical, instrument everything with Prometheus and Grafana. consider whether buying Amazon data from a licensed third-party provider (e.g., data brokers with Amazon seller relationships) is cheaper than crawling at this volume.
where to go next
-
How to rotate residential proxies in Python without getting blocked - deeper dive into session management and IP rotation strategies
-
Best residential proxy providers for e-commerce scraping in 2026 - tested comparison of ProxyScraping, Bright Data, Smartproxy, and Oxylabs with real benchmark data
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.