How to scrape Glassdoor at scale in 2026 with proxies that work
How to scrape Glassdoor at scale in 2026 with proxies that work
Glassdoor is one of the most valuable sources of real employer data on the internet: salary ranges, review sentiment, interview question patterns, headcount signals. HR analytics firms, recruiters, and competitive intelligence teams all want it. The problem is that Glassdoor has invested heavily in bot detection over the past two years, and naive scrapers, whether simple requests loops or cheap datacenter proxies, get blocked within minutes. Most tutorials you find online are from 2022 or 2023 and assume a threat model that no longer exists.
This guide is for operators who already know Python and have some scraping background but have hit a wall with Glassdoor specifically. I’ll walk through what actually works in 2026: browser automation with Playwright, residential proxy rotation, session management, and a storage pipeline that scales. You won’t get past Glassdoor’s defenses with datacenter IPs, so budget accordingly from the start.
The outcome is a working scraper that can pull company reviews, salary data, and job listings at a sustained rate without burning through proxy budget on blocks. I run a version of this for competitive HR data projects, and the approach here reflects what’s running in production.
what you need
- Python 3.11+ with
playwright,httpx,parsel, andtqdminstalled - Residential proxy subscription. Glassdoor requires residential or ISP proxies. Datacenter proxies will fail at the Cloudflare layer. Expect to spend $8-15 per GB. ProxyScrape’s residential plan, Bright Data, or Oxylabs all work. Budget at least $50/month for moderate scale.
- A Glassdoor account. Some data (full review text, salary specifics) is gated behind login. Create a real account with a real email address.
- A server or VPS. Scraping locally is fine for testing but not for scale. A $6/month DigitalOcean or Hetzner instance works.
- Storage. PostgreSQL or a flat JSONL pipeline. I’ll use JSONL for simplicity here.
- Optional: anti-detect browser. For the most aggressive fingerprinting scenarios, a tool like GoLogin or AdsPower helps. See antidetectreview.org for independent reviews of current options.
- Estimated time to first data: 2-4 hours to configure, a few days to tune at scale.
step by step
step 1: audit what you actually need
Before writing a line of code, download Glassdoor’s robots.txt and read it. Note which paths are disallowed. This shapes your architecture: you want to hit allowed paths where possible and be explicit in your internal compliance documentation if you’re building a commercial product. Understand this is not legal advice, and you should have your own counsel review your use case before commercializing scraped data.
Decide your data targets: company overview pages (/Overview/), reviews (/Reviews/), salaries (/Salary/), or jobs (/Jobs/). Each has a different page structure and different anti-bot sensitivity. Reviews are the hardest. Jobs are the easiest.
step 2: set up your environment
pip install playwright httpx parsel tqdm python-dotenv
playwright install chromium
Create a .env file:
PROXY_HOST=your.proxy.host
PROXY_PORT=10001
PROXY_USER=your_username
PROXY_PASS=your_password
[email protected]
GLASSDOOR_PASS=yourpassword
Never hardcode credentials. Use python-dotenv to load them.
step 3: configure residential proxy rotation
Glassdoor’s Cloudflare integration checks IP reputation, ASN, and session consistency. Residential proxies route through real ISP addresses, which pass the ASN check. The key additional requirement is sticky sessions: you need the same IP across an entire scraping session for a given company page sequence, otherwise Glassdoor’s session validation fails mid-crawl.
Most residential proxy providers support sticky sessions via a session_id parameter appended to the proxy username. With ProxyScrape:
import os
import random
from dotenv import load_dotenv
load_dotenv()
def get_proxy(session_id: str) -> dict:
user = f"{os.getenv('PROXY_USER')}-session-{session_id}"
return {
"server": f"http://{os.getenv('PROXY_HOST')}:{os.getenv('PROXY_PORT')}",
"username": user,
"password": os.getenv("PROXY_PASS"),
}
Generate a new session_id per company you’re scraping, not per request. This keeps your IP sticky within one company’s page sequence.
step 4: build the Playwright browser session
Playwright’s Chromium handles JavaScript rendering, which is required for Glassdoor’s dynamic review loading. The critical configuration is matching a realistic browser fingerprint. Playwright’s documentation covers the launch flags in detail.
import asyncio
from playwright.async_api import async_playwright
async def get_browser_context(proxy: dict):
async with async_playwright() as p:
browser = await p.chromium.launch(
headless=True,
args=["--disable-blink-features=AutomationControlled"]
)
context = await browser.new_context(
proxy=proxy,
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
viewport={"width": 1280, "height": 800},
locale="en-US",
)
return browser, context
If it breaks: if you see a Cloudflare challenge page (status 403 or a JS challenge loop), your proxy IP has been flagged. Rotate to a new session ID and retry. If the challenge persists across multiple IPs, your browser fingerprint is the issue: update the user agent string to a current Chrome version and check that navigator.webdriver is not exposed.
step 5: log in and maintain session state
Save authenticated cookies to disk after logging in once per proxy session. This avoids triggering repeated login events, which are themselves a bot signal.
import json
async def login_and_save_cookies(context, email: str, password: str, cookie_path: str):
page = await context.new_page()
await page.goto("https://www.glassdoor.com/profile/login_input.htm")
await page.wait_for_timeout(2000)
await page.fill('input[name="username"]', email)
await page.fill('input[name="password"]', password)
await page.click('button[type="submit"]')
await page.wait_for_load_state("networkidle")
cookies = await context.cookies()
with open(cookie_path, "w") as f:
json.dump(cookies, f)
await page.close()
async def load_cookies(context, cookie_path: str):
with open(cookie_path) as f:
cookies = json.load(f)
await context.add_cookies(cookies)
If it breaks: Glassdoor may present a CAPTCHA on login. If this happens consistently, your proxy IP pool is too narrow. Add a human-like delay (page.wait_for_timeout(random.randint(1500, 3500))) before clicking submit, and use a different session ID for login than for scraping.
step 6: navigate and extract review data
Reviews are paginated and loaded dynamically. The URL pattern for company reviews is:
https://www.glassdoor.com/Reviews/{Company-Name}-Reviews-E{employer_id}.htm?sort.sortType=RD&sort.ascending=false&filter.iso3Language=eng&filter.employmentStatus=REGULAR&filter.employmentStatus=PART_TIME&page={page_num}
You need the employer ID, which appears in the URL when you browse manually. Build a list of (company_name, employer_id) tuples as your input.
async def scrape_reviews(context, employer_id: str, company_slug: str, max_pages: int = 10):
results = []
page = await context.new_page()
for page_num in range(1, max_pages + 1):
url = (
f"https://www.glassdoor.com/Reviews/{company_slug}-Reviews-E{employer_id}.htm"
f"?sort.sortType=RD&sort.ascending=false&page={page_num}"
)
await page.goto(url)
await page.wait_for_timeout(random.randint(2000, 4000))
content = await page.content()
reviews = parse_reviews(content)
if not reviews:
break
results.extend(reviews)
await page.close()
return results
Parse with parsel:
from parsel import Selector
def parse_reviews(html: str) -> list[dict]:
sel = Selector(text=html)
reviews = []
for r in sel.css('[data-test="review-details"]'):
reviews.append({
"title": r.css('[data-test="review-title"] span::text').get("").strip(),
"rating": r.css('[data-test="overall-rating"]::attr(aria-label)').get(""),
"pros": r.css('[data-test="review-pros"] span::text').get("").strip(),
"cons": r.css('[data-test="review-cons"] span::text').get("").strip(),
"date": r.css("time::attr(datetime)").get(""),
})
return reviews
CSS selectors on Glassdoor change periodically. Keep a dated snapshot of the selector map in your codebase and audit it monthly.
If it breaks: if parse_reviews returns empty lists but the page loads, Glassdoor has updated their HTML structure. Open the page manually with the same proxy and inspect the DOM.
step 7: write to storage and handle retries
import jsonlines
import asyncio
async def scrape_company(employer_id, company_slug, output_path):
session_id = f"{employer_id}-{random.randint(1000,9999)}"
proxy = get_proxy(session_id)
browser, context = await get_browser_context(proxy)
try:
await load_cookies(context, "cookies.json")
reviews = await scrape_reviews(context, employer_id, company_slug)
with jsonlines.open(output_path, mode="a") as writer:
for r in reviews:
r["employer_id"] = employer_id
writer.write(r)
finally:
await browser.close()
Run with asyncio.run(scrape_company(...)). Add exponential backoff on failures: wait 5s, then 15s, then 45s before giving up on a company.
step 8: validate output quality
After your first run, spot-check 5-10 reviews manually against what appears on Glassdoor. Check that dates parse correctly, ratings are numeric, and review text is not truncated. Truncation usually means the “read more” button wasn’t clicked. Add a click step before extracting:
try:
await page.click('[data-test="review-see-more"]', timeout=2000)
except:
pass
common pitfalls
Using datacenter proxies. This is the most common mistake. Glassdoor’s Cloudflare integration blocks datacenter ASNs almost immediately in 2026. If your proxy cost is under $1/GB, it’s almost certainly datacenter. Pay for residential.
Sharing one session across many companies. A single IP requesting 40 different employer pages in a row is a clear bot signal. Generate a fresh session ID per company and add a 10-30 second gap between companies.
Scraping without a logged-in session. Unauthentic sessions see truncated reviews and no salary data. Log in once per proxy session and reuse cookies.
Ignoring selector drift. Glassdoor deploys frontend changes frequently. Scrapers break silently when selectors stop matching. Add an assertion that your parser returns at least one result per page, and alert when it returns zero.
Scraping at uniform intervals. Fixed delays (e.g., always 2 seconds between requests) are easier to detect than randomized ones. Use random.uniform(1.5, 4.5) at minimum. More variation is better.
scaling this
10x (a few hundred companies): the single-process async approach above handles this comfortably overnight. One residential proxy subscription is enough. Output to JSONL files partitioned by date.
100x (thousands of companies): move to a task queue. Celery with Redis, or a simpler approach like a SQLite-backed job queue with concurrent.futures. You’ll need to manage proxy bandwidth actively: at this scale, budget $200-400/month in residential proxy spend. Consider ProxyScrape’s bulk residential plan or a dedicated ISP proxy pool, which gives better consistency than rotating residential. See the guide to residential proxy tiers for a cost comparison across providers.
1000x (tens of thousands of companies, continuous refresh): you need distributed scraping across multiple VPS instances, a proper proxy management layer that tracks block rates per IP pool, and a monitoring stack. At this point, Glassdoor’s rate limits become a resource planning constraint. You’ll also want to deduplicate reviews at ingestion since the same review appears across multiple page loads. A PostgreSQL pipeline with a unique constraint on (employer_id, review_date, reviewer_hash) handles this. For the browser fingerprinting side at this scale, look at whether a managed headless browser service like Browserless makes more economic sense than self-hosted Playwright. The Cloudflare bypass deep-dive covers the specific headers and TLS fingerprinting techniques relevant here.
If you’re running multi-account operations alongside scraping, the infrastructure overlaps significantly with what’s covered at multiaccountops.com/blog/, particularly around session isolation and proxy assignment strategies.
where to go next
- Best residential proxies for scraping in 2026: a cost and performance comparison of the major residential proxy providers, including how they perform against Cloudflare-protected targets.
- How to bypass Cloudflare with proxies: deeper technical coverage of TLS fingerprinting, browser challenges, and what actually changes between Cloudflare’s bot management tiers.
- Playwright vs. Puppeteer for scraping in 2026: if you’re not yet committed to Playwright, this covers the practical tradeoffs for JS-heavy targets like Glassdoor.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.