← back to blog

How to scrape Walmart at scale in 2026 with proxies that work

How to scrape Walmart at scale in 2026 with proxies that work

Walmart runs one of the most aggressive bot detection stacks in US e-commerce. if you’ve tried scraping it without preparation, you already know, you get CAPTCHAs within minutes, IP bans within hours, and a pile of 403s that are hard to debug. I’ve been running scraping infrastructure for price monitoring and catalog aggregation since 2021, and Walmart has steadily gotten harder. in 2026 it uses Akamai Bot Manager, TLS fingerprinting, and behavioral analysis in combination, which means a basic rotating proxy setup is no longer enough.

this tutorial is for operators who are already comfortable with Python and want a production-grade pipeline: pricing analysts, e-commerce resellers, market research teams, or data engineers building retail intelligence products. if you’re looking to scrape a handful of product pages by hand, this is probably more than you need.

by the end you’ll have a working scraper that rotates residential proxies, spoofs TLS and browser fingerprints, handles Walmart’s dynamic rendering, and can be scaled to tens of thousands of requests per day without burning through proxy budget.

review Walmart’s Terms of Use before you start, and respect their robots.txt. this tutorial does not cover anything that involves bypassing authentication, accessing private account data, or violating the CFAA. scraping publicly visible product pages sits in a contested legal space, consult your own legal counsel for your specific use case.


what you need

  • python 3.11+ with httpx, playwright, and parsel installed
  • residential proxy pool, minimum 10,000 IPs recommended. i use ProxyScraping’s residential plan for lower-volume jobs and Oxylabs for anything above 50k requests/day. budget $50-$150/month for moderate scale
  • a linux VPS with at least 4 vCPUs and 8GB RAM for running concurrent Playwright sessions. Hetzner CX31 at around €12/month works fine to start
  • scrapy if you want async pipeline management at scale: Scrapy docs
  • rotating User-Agent list sourced from a real browser UA database, not generic strings
  • Playwright for Python: handles JS-rendered pages. official docs here
  • optional: a Redis queue for job management if you’re crawling at 1000+ req/hour

step by step

step 1: set up your proxy rotation with httpx

start with a basic rotating proxy client before touching Playwright. this confirms your proxy pool is working and lets you benchmark raw block rates.

import httpx
import random

PROXY_LIST = [
    "http://user:[email protected]:8000",
    "http://user:[email protected]:8000",
    # add more from your dashboard
]

def get_client():
    proxy = random.choice(PROXY_LIST)
    return httpx.Client(proxies={"http://": proxy, "https://": proxy}, timeout=15)

def fetch(url: str) -> httpx.Response:
    with get_client() as client:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept": "text/html,application/xhtml+xml,application/xhtml;q=0.9,*/*;q=0.8",
        }
        return client.get(url, headers=headers, follow_redirects=True)

expected output: HTTP 200 with HTML content for product pages. if you’re seeing 403 or 429 immediately, your proxy type is wrong, you likely need residential, not datacenter.

if it breaks: 503 responses from Akamai mean TLS fingerprinting has flagged your client. move to Playwright (step 3) or use an httpx fork with custom TLS settings like curl_cffi.


step 2: get the right proxy type

datacenter proxies will not work reliably against Walmart’s current stack. you need residential or mobile proxies. the distinction matters because Walmart’s bot detection cross-checks ASN data against residential ISP ranges.

i run two tiers: - residential rotating (ProxyScraping, Brightdata, Smartproxy): for product page fetches and search result pages - sticky residential sessions (same IP for 10-30 min): for anything that requires a simulated browse session, like adding to cart flows or checking seller information

for bulk product catalog scrapes the rotating residential tier handles it fine. expect to pay $4-$8 per GB depending on provider and volume.


step 3: set up playwright with proxy injection

Walmart’s product pages are heavily JS-rendered. prices, availability, and seller data are injected client-side. httpx alone won’t get you these fields on most listings.

from playwright.async_api import async_playwright
import asyncio

PROXY_CONFIG = {
    "server": "http://proxy.proxyscraping.com:8000",
    "username": "your_user",
    "password": "your_pass",
}

async def fetch_product(url: str) -> str:
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            proxy=PROXY_CONFIG,
            args=["--disable-blink-features=AutomationControlled"]
        )
        context = await browser.new_context(
            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            locale="en-US",
            viewport={"width": 1280, "height": 800},
        )
        page = await context.new_page()
        await page.goto(url, wait_until="networkidle", timeout=30000)
        content = await page.content()
        await browser.close()
        return content

html = asyncio.run(fetch_product("https://www.walmart.com/ip/some-product/123456789"))

expected output: full rendered HTML including the __NEXT_DATA__ JSON blob embedded in the page, which contains structured product data.

if it breaks: if you get a blank page or a CAPTCHA, add a random sleep between 2 and 5 seconds before page.goto, and check that --disable-blink-features=AutomationControlled is in your launch args. if the CAPTCHA persists across multiple proxies, your fingerprint is being flagged. check antidetectreview.org/blog/ for current anti-detect browser options that handle Walmart specifically.


step 4: extract __NEXT_DATA__ for structured data

Walmart embeds a JSON blob in every product page under a <script id="__NEXT_DATA__"> tag. this is the cleanest extraction path and avoids brittle CSS selectors.

from parsel import Selector
import json

def parse_product(html: str) -> dict:
    sel = Selector(text=html)
    raw = sel.css("script#__NEXT_DATA__::text").get()
    if not raw:
        return {}
    data = json.loads(raw)
    # path varies by page type, inspect your target page
    props = data.get("props", {}).get("pageProps", {}).get("initialData", {})
    product = props.get("data", {}).get("product", {})
    return {
        "name": product.get("name"),
        "price": product.get("priceInfo", {}).get("currentPrice", {}).get("price"),
        "availability": product.get("availabilityStatus"),
        "seller": product.get("sellerName"),
        "item_id": product.get("usItemId"),
    }

expected output: a dict with name, price, availability, and seller. the schema changes periodically, add logging for KeyError and alert yourself when the extraction path breaks.

if it breaks: Walmart periodically restructures the __NEXT_DATA__ schema, especially after major frontend deploys. add json.dumps(data, indent=2) to a debug log and manually inspect the new path.


step 5: handle search and category pages

product search pages (walmart.com/search?q=...) and category browse pages use a different __NEXT_DATA__ structure. the product list is usually under props.pageProps.initialData.searchResult.itemStacks.

def parse_search(html: str) -> list[dict]:
    sel = Selector(text=html)
    raw = sel.css("script#__NEXT_DATA__::text").get()
    data = json.loads(raw)
    stacks = (
        data.get("props", {})
        .get("pageProps", {})
        .get("initialData", {})
        .get("searchResult", {})
        .get("itemStacks", [])
    )
    items = []
    for stack in stacks:
        for item in stack.get("items", []):
            items.append({
                "name": item.get("name"),
                "price": item.get("priceInfo", {}).get("currentPrice", {}).get("price"),
                "item_id": item.get("usItemId"),
                "url": f"https://www.walmart.com/ip/{item.get('usItemId')}",
            })
    return items

if it breaks: pagination uses cursor-based parameters in the URL. if page 2+ returns no results, check the pageInfo.nextCursor field in the JSON and append it as a query parameter.


step 6: build the queue and respect rate limits

at even moderate scale you need a job queue, not a simple loop. use Redis with rq or Scrapy’s built-in scheduler.

the practical safe rate for Walmart without triggering adaptive blocking is roughly 1 request per 3-6 seconds per IP. with 50 proxies rotating, that’s around 600-1200 requests per hour. don’t push faster than this without monitoring your block rate.

import time
import random

def scrape_batch(urls: list[str], delay_min=3, delay_max=6):
    results = []
    for url in urls:
        html = asyncio.run(fetch_product(url))
        results.append(parse_product(html))
        time.sleep(random.uniform(delay_min, delay_max))
    return results

if it breaks: if you see block rates above 15%, extend the delay range and shrink concurrent sessions. Walmart’s adaptive system tightens thresholds based on recent traffic patterns from an IP.


step 7: store and monitor

write results to postgres or sqlite with a scraped_at timestamp. price data without timestamps is useless. add a simple block-rate counter: if more than 10% of requests in a rolling 100-request window return non-200, pause the job and alert.


common pitfalls

using datacenter proxies. this is the single biggest mistake. Walmart’s ASN filtering will catch datacenter ranges from AWS, GCP, and most bulk proxy providers. residential only. if you want to check how this works on the detection side, the hiQ Labs v. LinkedIn case is worth reading for context on how courts have treated public web scraping legally, though it’s not a green light.

not rotating User-Agents. a single UA string sent 5,000 times is a pattern. maintain a weighted pool of real browser UA strings, skewed toward Chrome on Windows since that matches the majority of real Walmart traffic.

ignoring TLS fingerprinting. httpx with default settings presents a TLS fingerprint distinct from a real browser. use curl_cffi with impersonate="chrome124" as a lightweight fix if you don’t want to run full Playwright.

scraping too fast from too few IPs. 10 requests per second from 5 IPs will get you banned. 1 request per 5 seconds from 200 IPs is much stealthier and often cheaper in terms of retry cost.

not monitoring __NEXT_DATA__ schema drift. Walmart deploys frontend changes frequently. a schema change will silently produce empty extractions if you’re not alerting on null fields. set up a canary that checks a known product’s price daily and alerts if extraction fails.


scaling this

10x (1,000-10,000 req/day): a single VPS, 50-100 residential IPs, and a cron job gets you here. Playwright is fine at this level. spend around $50-100/month on proxies.

100x (10,000-100,000 req/day): you need async concurrency properly implemented, Scrapy with [scrapy](https://scrapy.org/)-[playwright](https://playwright.dev/) or a custom asyncio pool. split your proxy pool into geographic clusters, US-geolocated IPs perform better against Walmart’s CDN. budget $200-500/month on infrastructure and proxies combined. Redis queue becomes mandatory for job management and retry logic.

1,000x (100,000+ req/day): at this level you’re looking at distributed workers, probably on Kubernetes or a task queue like Celery with multiple workers. proxy cost dominates your budget, $1,500/month and up depending on provider and bandwidth. consider negotiating a dedicated proxy pool directly with a provider rather than using shared residential. you also need a proper observability stack, Prometheus + Grafana at minimum, to catch block rate spikes before they cascade. if you’re operating multi-account or managing multiple scraping identities at this level, the workflows covered at multiaccountops.com/blog/ become relevant for keeping your infrastructure segmented.


where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?