← back to blog

How to scrape Indeed at scale in 2026 with proxies that work

How to scrape Indeed at scale in 2026 with proxies that work

Indeed is one of the most valuable sources of job market data on the internet. Recruiters pull it for talent intelligence, researchers use it to track hiring trends, and operators build salary benchmarking tools off it. the problem is that Indeed blocks scrapers aggressively, and most tutorials written before 2024 are useless now because the site has moved almost entirely to client-side rendering and added multiple fingerprinting layers.

I’ve been running data pipelines off job boards for three years out of Singapore. this guide is for people who want reliable, repeatable job data at volume, not a one-off CSV. by the end you’ll have a working Playwright-based scraper with rotating residential proxies, a parsing pipeline, and a clear picture of where things break at scale.

one honest note before we start: Indeed’s Terms of Service prohibit automated access without their written permission, and their robots.txt disallows crawling on most paths. this is a technical tutorial, not legal advice. if you’re building a commercial product on this data, have a lawyer look at your use case first.

what you need

  • Python 3.11+ installed locally or on a VPS
  • Playwright for Python (pip install playwright) and its Chromium browser (playwright install chromium)
  • Rotating residential proxies, minimum 5GB pool. datacenter proxies get blocked within minutes on Indeed. i use ProxyScraping residential proxies, which run about $3/GB at time of writing
  • A VPS or cloud instance (optional for development, required at scale). a $6/month Hetzner CX22 is enough for up to ~500 requests/day
  • PostgreSQL or SQLite for storing results. i’ll use SQLite for this guide to keep setup minimal
  • httpx, selectolax, and tenacity Python libraries for parsing and retries

estimated monthly cost for 10,000 job listings: around $8-15 in proxy bandwidth, $6-12 in compute.

step by step

step 1: set up your Python environment

create a fresh virtual environment and install dependencies:

python3 -m venv venv
source venv/bin/activate
pip install playwright httpx selectolax tenacity python-dotenv
playwright install chromium

put your proxy credentials in a .env file, never hardcoded:

PROXY_HOST=gate.proxyscraping.com
PROXY_PORT=9999
PROXY_USER=your_username
PROXY_PASS=your_password

if it breaks: if [playwright](https://playwright.dev/) install [chromium](https://www.chromium.org/Home/) fails on a headless VPS, run [playwright](https://playwright.dev/) install-deps [chromium](https://www.chromium.org/Home/) first to pull system dependencies.

step 2: build the base browser context

Indeed’s anti-bot layer checks for headless browser signals. the key is to launch with realistic viewport, timezone, and user agent settings:

import asyncio
import os
from playwright.async_api import async_playwright
from dotenv import load_dotenv

load_dotenv()

PROXY_SERVER = f"http://{os.getenv('PROXY_USER')}:{os.getenv('PROXY_PASS')}@{os.getenv('PROXY_HOST')}:{os.getenv('PROXY_PORT')}"

async def make_context(playwright):
    browser = await playwright.chromium.launch(
        headless=True,
        proxy={"server": PROXY_SERVER},
        args=["--disable-blink-features=AutomationControlled"]
    )
    context = await browser.new_context(
        viewport={"width": 1280, "height": 800},
        locale="en-US",
        timezone_id="America/New_York",
        user_agent=(
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        )
    )
    return browser, context

if it breaks: if you see ERR_TUNNEL_CONNECTION_FAILED, your proxy credentials are wrong or the gateway hostname has changed. check the ProxyScraping dashboard for the current endpoint.

step 3: construct the search URL

Indeed’s search URL structure is stable. the main parameters are q (query), l (location), and start (pagination offset in multiples of 10):

import urllib.parse

def build_url(query: str, location: str, page: int = 0) -> str:
    params = {
        "q": query,
        "l": location,
        "start": page * 10,
        "fromage": 14,  # jobs posted in last 14 days
    }
    return "https://www.indeed.com/jobs?" + urllib.parse.urlencode(params)

if it breaks: if you’re getting results for the wrong region, add &sc=0kf%3Ajt(fulltime)%3B to filter job type, or check if Indeed is redirecting you to a country-specific domain based on the proxy’s exit IP.

step 4: handle the page load and block detection

Indeed renders job cards with React. you need to wait for the card container to appear, and check for the CAPTCHA gate before parsing:

async def fetch_page(page, url: str) -> str | None:
    try:
        await page.goto(url, wait_until="networkidle", timeout=30000)

        # Check for CAPTCHA or block page
        if await page.query_selector('[id="challenge-error-title"]'):
            print(f"Blocked on {url}, rotating proxy")
            return None

        # Wait for job cards to render
        await page.wait_for_selector('[data-testid="job-title"]', timeout=10000)
        return await page.content()

    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return None

if it breaks: if wait_for_selector times out consistently, Indeed may have changed the selector. open the page in a headed browser (headless=False) and inspect the DOM. as of May 2026, [data-testid="job-title"] is still the correct anchor.

step 5: parse the job listings

once you have the HTML, use selectolax for fast parsing. it’s significantly faster than BeautifulSoup on large volumes:

from selectolax.parser import HTMLParser

def parse_jobs(html: str) -> list[dict]:
    tree = HTMLParser(html)
    jobs = []

    for card in tree.css('[data-testid="slider_container"]'):
        title_node = card.css_first('[data-testid="job-title"]')
        company_node = card.css_first('[data-testid="company-name"]')
        location_node = card.css_first('[data-testid="text-location"]')
        salary_node = card.css_first('[data-testid="attribute_snippet_testid"]')

        jobs.append({
            "title": title_node.text(strip=True) if title_node else None,
            "company": company_node.text(strip=True) if company_node else None,
            "location": location_node.text(strip=True) if location_node else None,
            "salary": salary_node.text(strip=True) if salary_node else None,
        })

    return jobs

if it breaks: Indeed A/B tests its UI constantly. if you’re getting empty results, log the raw HTML and check what container classes are actually present on that session. running the same URL through two different proxy exits sometimes gives different UI versions.

step 6: add retry logic and polite delays

Playwright’s async API doesn’t have built-in retry. use tenacity to wrap your fetches, and add a randomized sleep between requests to avoid rate-limit patterns:

import random
import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10))
async def fetch_with_retry(page, url: str) -> str | None:
    html = await fetch_page(page, url)
    if html is None:
        raise Exception("Blocked or empty response")
    return html

async def polite_delay():
    await asyncio.sleep(random.uniform(2.0, 5.0))

if it breaks: if you’re hitting the retry limit consistently, it’s a proxy quality issue, not a code issue. switch to a different proxy pool or reduce concurrency.

step 7: wire it together and store results

import sqlite3

def init_db(db_path: str = "indeed_jobs.db"):
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS jobs (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            title TEXT, company TEXT, location TEXT,
            salary TEXT, scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    """)
    conn.commit()
    return conn

async def scrape(query: str, location: str, max_pages: int = 10):
    conn = init_db()
    async with async_playwright() as p:
        browser, context = await make_context(p)
        page = await context.new_page()

        for page_num in range(max_pages):
            url = build_url(query, location, page_num)
            html = await fetch_with_retry(page, url)
            if html:
                jobs = parse_jobs(html)
                conn.executemany(
                    "INSERT INTO jobs (title, company, location, salary) VALUES (?,?,?,?)",
                    [(j["title"], j["company"], j["location"], j["salary"]) for j in jobs]
                )
                conn.commit()
                print(f"Page {page_num + 1}: saved {len(jobs)} jobs")
            await polite_delay()

        await browser.close()
    conn.close()

asyncio.run(scrape("data engineer", "Remote"))

if it breaks: if SQLite throws database is locked, you have multiple processes writing to the same file. switch to PostgreSQL or use WAL mode: conn.execute("PRAGMA journal_mode=WAL").

common pitfalls

using datacenter proxies. this is the most common mistake. Indeed identifies datacenter ASNs immediately and either blocks the request or serves a CAPTCHA. residential or mobile proxies are required. ISP proxies sometimes work but are inconsistent.

not rotating the proxy per session. if you reuse the same proxy connection across hundreds of pages, Indeed’s session tracking will flag and block it. make a fresh browser context per batch of 20-30 pages, forcing a new proxy exit IP.

ignoring the CAPTCHA signal. if your scraper doesn’t check for block pages and just tries to parse them, you’ll get hundreds of empty rows in your database with no error. always validate that the expected selectors are present before calling the parser.

scraping too fast. hitting Indeed at 1 request per second will get your proxy pool flagged within hours. 2-5 seconds between requests per browser instance, with a maximum of 3-4 parallel instances, is a sustainable rate for residential proxies.

not deduplicating. if you run the same query twice, or queries with overlapping results, you’ll accumulate duplicates fast. add a unique constraint on (title, company, location) or hash the job ID from the URL.

scaling this

10x (1,000-10,000 listings/day): a single VPS with 3-4 async browser contexts is enough. proxy bandwidth cost stays under $20/month. the bottleneck is usually browser memory, not network. set Playwright’s --no-sandbox and limit the number of open pages to avoid OOM kills.

100x (10,000-100,000 listings/day): split by region or job category across multiple VPS instances. add a job queue (Redis + RQ, or Celery) so instances don’t duplicate work. proxy costs become meaningful here, around $60-150/month depending on pool size and hit rate. consider caching pages that are likely to repeat across queries.

1000x (100,000+ listings/day): at this volume you need a dedicated proxy provider relationship, not pay-as-you-go. browser-based scraping becomes expensive on compute. i’d evaluate whether certain endpoints return structured data via XHR calls that you can hit with httpx directly, which cuts compute cost significantly. you’ll also want monitoring, automatic proxy rotation on block signals, and probably a dedicated parsing service. operators managing multi-account pipelines at this scale often reference infrastructure patterns from multiaccountops.com/blog/ for ideas on session and fingerprint management.

logging and observability matter more as you scale. track per-proxy block rates, success rates per query type, and database write latency. you want to catch a degraded proxy pool before it corrupts a day’s worth of data.

where to go next

Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?