← back to blog

How to scrape LinkedIn at scale in 2026 with proxies that work

How to scrape LinkedIn at scale in 2026 with proxies that work

LinkedIn is one of the hardest targets on the web. it blocks datacenter IPs almost immediately, fingerprints browsers aggressively, and has a legal team that has been fighting scraping cases in US courts since 2017. and yet people scrape it every day, at scale, for lead generation, recruiting, market research, and competitive intelligence. the data is too valuable to ignore.

this tutorial is for operators who already understand what proxies are and why you need them. if you’re building a lead gen pipeline, a recruiting tool, or pulling job postings for aggregation, this is the practical path through LinkedIn’s defenses in 2026. by the end you’ll have a working scrape loop for profiles and search results, running through residential proxies with session management that doesn’t collapse after 50 requests.

i run scraping infrastructure out of Singapore and have burned through a lot of proxy budget finding what actually works on LinkedIn versus what the forums say works. this is what i actually use.


what you need

  • Playwright (Python or Node.js) for browser automation. Selenium is too fingerprint-heavy at this point.
  • Residential proxies: at minimum a pool from a provider with LinkedIn-compatible IPs. i use ProxyScraping’s residential plan ($3.99/GB at the time of writing). datacenter proxies fail within minutes on LinkedIn.
  • LinkedIn accounts: a minimum of 3-5 warm accounts (60+ days old, with profile photos, connections, and some activity). fresh accounts get restricted fast.
  • An antidetect browser or browser profile manager for managing those accounts separately. if you’re running multiple accounts, antidetectreview.org/blog/ has solid coverage of the current options.
  • Python 3.11+ and basic async knowledge
  • MongoDB or Postgres for storing results. SQLite breaks above ~100k records with concurrent writes.
  • Monthly budget: roughly $50-200 depending on scale. most of that is proxies.

a note before we continue: LinkedIn’s User Agreement explicitly prohibits scraping. the legal landscape is complicated (the hiQ Labs v. LinkedIn Corp. 9th Circuit ruling in 2022 found that scraping publicly available data doesn’t violate the CFAA), but that ruling doesn’t override LinkedIn’s ToS enforcement, which is an account ban and IP block, not a lawsuit. scrape at your own risk, understand what you’re doing, and this is not legal advice.


step by step

step 1: set up your environment

pip install playwright python-dotenv motor pymongo
playwright install chromium

create a .env file:

PROXY_HOST=gate.proxyscraping.com
PROXY_PORT=7777
PROXY_USER=your_username
PROXY_PASS=your_password
LI_SESSION_COOKIE=your_li_at_cookie_value

get your li_at cookie by logging into LinkedIn in a browser and copying it from DevTools. this is your session token. one per account.

if it breaks: if [playwright](https://playwright.dev/) install fails behind a corporate firewall, run [PLAYWRIGHT](https://playwright.dev/)_BROWSERS_PATH=/tmp [playwright](https://playwright.dev/) install [chromium](https://www.chromium.org/Home/).

step 2: validate your proxy on LinkedIn before building anything

don’t spend 4 hours building a scraper only to find your proxy provider is blocked.

import asyncio
from playwright.async_api import async_playwright

async def test_proxy():
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            proxy={
                "server": "http://gate.proxyscraping.com:7777",
                "username": "your_username",
                "password": "your_password"
            }
        )
        page = await browser.new_page()
        await page.goto("https://www.linkedin.com/feed/")
        print(await page.title())
        await browser.close()

asyncio.run(test_proxy())

expected output: a page title containing “LinkedIn”. if you get a CAPTCHA or redirect to the login wall, that IP range is flagged. try a different proxy session.

if it breaks: LinkedIn blocks most residential providers’ shared exit nodes over time. check if your provider supports sticky sessions, and use a different geo (US or EU residential IPs work better than APAC for LinkedIn in my experience).

logging in via form on every run triggers MFA and login anomaly flags. inject the cookie directly instead.

async def get_authenticated_page(browser, li_at_cookie):
    context = await browser.new_context()
    await context.add_cookies([{
        "name": "li_at",
        "value": li_at_cookie,
        "domain": ".linkedin.com",
        "path": "/",
        "httpOnly": True,
        "secure": True
    }])
    page = await context.new_page()
    return page, context

navigate to https://www.[linkedin](https://www.linkedin.com/).com/feed/ after this and confirm you’re logged in before proceeding. the page should show the feed, not a login wall.

if it breaks: li_at cookies expire. rotate to a fresh cookie. keep 3-5 backup accounts with cookies extracted.

step 4: build your profile scraper

LinkedIn’s profile structure is relatively stable. target the page’s JSON-LD or the visible DOM. here’s a minimal profile extractor:

async def scrape_profile(page, profile_url):
    await page.goto(profile_url, wait_until="domcontentloaded")
    await page.wait_for_timeout(2000 + (asyncio.get_event_loop().time() % 1000))

    name = await page.locator("h1").first.inner_text()
    headline = await page.locator(".text-body-medium").first.inner_text()

    return {
        "url": profile_url,
        "name": name.strip(),
        "headline": headline.strip()
    }

add delays between requests: random.uniform(3.5, 8.0) seconds. consistent timing is a bot signal. LinkedIn’s detection watches for sub-2-second page transitions.

if it breaks: if h1 returns multiple results or empty, LinkedIn has updated its class names. open the page in a real browser, inspect the DOM, and update your selectors. this happens every 2-3 months.

step 5: rotate proxies and sessions properly

the single biggest mistake operators make is using one proxy IP for too many requests. LinkedIn starts throwing soft blocks (slow responses, empty feeds) before it hard-blocks.

rule of thumb: rotate your proxy session every 10-15 profile views. with sticky sessions on ProxyScraping, append _session_XXXX to your username where XXXX is a random ID.

import random
import string

def get_session_proxy():
    session_id = ''.join(random.choices(string.ascii_lowercase + string.digits, k=8))
    return {
        "server": "http://gate.proxyscraping.com:7777",
        "username": f"your_username_session_{session_id}",
        "password": "your_password"
    }

also rotate your LinkedIn account (cookie) every 50-100 requests. one account doing 500 profile views in a day will get restricted.

if it breaks: if all sessions are failing, your IP pool may have a subnet ban. contact your proxy provider and request a different exit pool. some providers have LinkedIn-specific pools.

step 6: scrape search results for lead sourcing

profiles are useful but search result pages are where most operators start. LinkedIn search URLs follow a consistent pattern. for people search:

https://www.linkedin.com/search/results/people/?keywords=SOFTWARE+ENGINEER&location=Singapore&page=2

paginate by incrementing page=. LinkedIn limits guest/organic searches to roughly 100 pages before it pushes you to Sales Navigator. with a logged-in session you get further.

async def scrape_search_page(page, keywords, location, page_num):
    url = f"https://www.linkedin.com/search/results/people/?keywords={keywords}&location={location}&page={page_num}"
    await page.goto(url, wait_until="domcontentloaded")
    await page.wait_for_timeout(random.uniform(3000, 6000))

    results = await page.locator(".reusable-search__result-container").all()
    profiles = []
    for result in results:
        link = await result.locator("a").first.get_attribute("href")
        if link and "/in/" in link:
            profiles.append("https://www.linkedin.com" + link.split("?")[0])
    return profiles

if it breaks: if results come back empty but the page loaded, LinkedIn may be serving a bot-detection variant of the page. try navigating to the feed first before each search to warm up the session.

step 7: store and deduplicate results

from motor.motor_asyncio import AsyncIOMotorClient

client = AsyncIOMotorClient("mongodb://localhost:27017")
db = client["linkedin_scrape"]

async def save_profile(data):
    await db.profiles.update_one(
        {"url": data["url"]},
        {"$set": data},
        upsert=True
    )

upsert on URL prevents duplicates without a separate dedup pass. run your scraper with a queue (asyncio.Queue or Redis) so you can resume after crashes without re-scraping.


common pitfalls

1. using datacenter proxies. i see this constantly. datacenter IPs from AWS, GCP, or cheap proxy providers are range-blocked at LinkedIn’s edge. residential is the minimum viable option. mobile proxies work better but cost more ($15-30/GB typically).

2. not warming up accounts. a 3-day-old account scraping 200 profiles will be restricted before lunch. minimum: 60 days old, 50+ connections, some post interactions. buy aged accounts if you have to, but understand the risk.

3. fixed request timing. time.sleep(3) on every request is a pattern. real users vary. use random.uniform(3.5, 9.0) and occasionally add a longer pause (random.uniform(15, 45)) every 20-30 requests to simulate reading.

4. ignoring LinkedIn’s robots.txt. this won’t stop you technically but it tells you exactly what paths LinkedIn monitors most aggressively. knowing which paths are explicitly disallowed tells you where detection is heaviest.

5. scaling on a single account. one account, even with proxy rotation, caps out fast. the right architecture is multiple accounts, each doing limited daily volume, coordinated through a queue. see our proxy rotation guide for the queue pattern.


scaling this

10x (500-1000 profiles/day): the setup above handles this with 3 accounts and a residential proxy pool of $20-40/month. asyncio with 3 concurrent workers, one per account.

100x (5000-10000 profiles/day): you need 15-20 accounts, proxy costs jump to $100-200/month, and you’ll want a task queue (Celery + Redis or RQ). account management becomes the bottleneck. each account needs its own browser context and ideally its own residential session pool. consider Playwright’s documentation on browser contexts for isolating state properly.

1000x (50000+ profiles/day): at this scale you’re looking at Sales Navigator API access (if you can get it), or a significant antidetect browser setup running dozens of persistent profiles across multiple machines. the proxy bill alone is $500-2000/month. account sourcing and replacement becomes a full-time concern. you’ll also want monitoring for block rates per account, per proxy pool, and per geo. see our residential proxy benchmark for provider comparisons at volume.


where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?