← back to blog

How to scrape X (Twitter) at scale in 2026 with proxies that work

How to scrape X (Twitter) at scale in 2026 with proxies that work

X is one of the hardest platforms to scrape reliably, and also one of the most requested. the combination of aggressive rate limiting, login walls on most content, and an API that now charges $100/month at the entry tier means that most people end up either overpaying for API credits or watching their scrapers break every two weeks when X rotates its internal endpoints. i’ve been running data pipelines off X since the Twitter v1 API days, and the playbook has changed a lot.

this tutorial is for operators who need structured, repeatable data from X at scale: social listening companies, research teams, quant funds tracking sentiment, and people building training datasets. it assumes you’re comfortable with Python and can set up a Linux VPS. i’m not going to sugarcoat the rate-limit situation or pretend there’s a magic tool that makes this painless. i’ll tell you what actually works in 2026 and what it costs.

by the end you’ll have a working scraper using Playwright and rotating residential proxies, a solid understanding of where the failure points are, and a sensible path to scaling from hundreds of requests per day to tens of thousands.


what you need

  • python 3.11+ with playwright, httpx, tenacity, and pydantic installed
  • a residential proxy pool with rotation. i use proxyscraping residential proxies. datacenter IPs get blocked almost immediately on X in 2026, so don’t bother with those for browser-level scraping. budget $3-8/GB depending on the provider and volume
  • a VPS or cloud instance with at least 2 vCPU and 4GB RAM for running headless browsers. a Hetzner CPX21 at ~€5.90/month works fine for low to medium volume
  • X accounts (optional but helpful). logged-in sessions expose significantly more content than guest scraping. if you’re managing multiple X accounts for scraping purposes, the multi-account operations guide at multiaccountops.com/blog/ covers account hygiene in depth
  • time: realistic setup time is 4-6 hours including proxy configuration and debugging

cost baseline for a modest scraping operation pulling ~50,000 tweets/day: - residential proxies: ~$40-80/month depending on bandwidth - VPS: ~€6-12/month - optional X API Basic for supplemental data: $100/month


step by step

step 1: decide your data source: API or browser scraping

X offers a tiered API at developer.x.com. the Free tier is effectively write-only as of 2025. the Basic tier ($100/month) gives you 10,000 tweet reads per month, which sounds like a lot until you realise that’s 333 tweets per day. the Pro tier is $5,000/month.

for anything beyond light usage, browser automation or the unofficial internal endpoints are the practical reality. i use browser automation via Playwright because it mirrors real user behavior and works without an API key.

note: scraping X beyond what the API allows is against X’s Terms of Service. this guide is written for legitimate use cases like research and journalism. you are responsible for complying with applicable laws and platform terms in your jurisdiction. this is not legal advice.

step 2: install and configure Playwright

pip install playwright httpx tenacity pydantic
playwright install chromium

test that the browser launches cleanly:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://x.com")
    print(page.title())
    browser.close()

expected output: something like X. It's what's happening / X. if you see a blank title or a timeout, your VPS might be blocking outbound connections on port 443. check your firewall rules.

if it breaks: headless Chromium sometimes gets fingerprinted. switch to headless=False temporarily to debug, or try channel="chrome" if you have Chrome installed on the VPS.

step 3: set up residential proxy rotation

residential proxies route your requests through real ISP-assigned IPs, which is what makes them work on X while datacenter IPs get blocked. configure Playwright to use a rotating proxy endpoint:

from playwright.sync_api import sync_playwright

PROXY_HOST = "rp.proxyscraping.com"
PROXY_PORT = 31112
PROXY_USER = "your_username"
PROXY_PASS = "your_password"

def get_browser(playwright):
    return playwright.chromium.launch(
        headless=True,
        proxy={
            "server": f"http://{PROXY_HOST}:{PROXY_PORT}",
            "username": PROXY_USER,
            "password": PROXY_PASS,
        }
    )

verify you’re getting a different IP each session before going further. you can check this by visiting a URL that returns your IP in JSON and printing the response.

if it breaks: if X returns a 429 or a “something went wrong” page consistently from the proxy, test the same request with curl through the proxy first to isolate whether it’s a proxy issue or a browser fingerprint issue.

step 4: build a basic tweet scraper

X renders tweets via its internal GraphQL API. intercepting these requests is more efficient than parsing the DOM directly. here’s a pattern that captures the underlying JSON as the page loads:

import json
from playwright.sync_api import sync_playwright

def scrape_user_timeline(username: str, max_tweets: int = 100):
    tweets = []

    def handle_response(response):
        if "UserTweets" in response.url and response.status == 200:
            try:
                data = response.json()
                entries = (
                    data.get("data", {})
                    .get("user", {})
                    .get("result", {})
                    .get("timeline_v2", {})
                    .get("timeline", {})
                    .get("instructions", [])
                )
                for instruction in entries:
                    for entry in instruction.get("entries", []):
                        tweet = entry.get("content", {}).get("itemContent", {})
                        if tweet.get("tweet_results"):
                            tweets.append(tweet["tweet_results"])
            except Exception:
                pass

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.on("response", handle_response)
        page.goto(f"https://x.com/{username}", wait_until="networkidle")
        browser.close()

    return tweets[:max_tweets]

expected output: a list of raw tweet result dicts. parse them with pydantic models to extract full_text, created_at, public_metrics, etc.

if it breaks: X occasionally renames or restructures its GraphQL response keys. if tweets comes back empty, add a debug print inside handle_response to log all matching response URLs and inspect the raw JSON structure.

step 5: add retry logic and rate limit handling

X will return 429s at volume. wrap your scraper with tenacity to handle transient failures:

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import httpx

@retry(
    stop=stop_after_attempt(5),
    wait=wait_exponential(multiplier=2, min=4, max=60),
    retry=retry_if_exception_type((TimeoutError, httpx.HTTPStatusError)),
)
def scrape_with_retry(username):
    return scrape_user_timeline(username)

set a random delay of 3-8 seconds between requests to each distinct account timeline. this alone dramatically reduces the rate at which your proxy IPs get flagged.

if it breaks: if you’re hitting 429s even with delays, your proxy pool is too small or your sessions aren’t rotating. each session should use a fresh browser context with a new IP.

step 6: store results and deduplicate

pipe results to a Postgres table (or even SQLite for smaller volumes) and use tweet IDs as the primary key to handle duplicates:

CREATE TABLE tweets (
    tweet_id TEXT PRIMARY KEY,
    username TEXT,
    full_text TEXT,
    created_at TIMESTAMPTZ,
    like_count INTEGER,
    retweet_count INTEGER,
    scraped_at TIMESTAMPTZ DEFAULT NOW()
);

use INSERT ... ON CONFLICT (tweet_id) DO NOTHING to handle re-runs cleanly.

if it breaks: if your scrape job crashes mid-run and you’re not sure what was saved, add a job_id column and query for the most recent complete job before re-running.

step 7: validate your output before scaling

before you push volume, manually verify 20-30 rows. check that timestamps are correct, that full_text isn’t truncated, and that metrics look plausible. a silent parsing bug at 1,000 tweets/day becomes a large corrupt dataset at 100,000 tweets/day.

read the Python requests documentation if you need to supplement browser scraping with direct HTTP calls for specific endpoints.

if it breaks: discrepancies between scraped and actual tweet counts usually mean you’re missing paginated responses. add scroll-and-intercept logic to capture subsequent GraphQL calls as the page loads more content.


common pitfalls

using datacenter proxies. i see this constantly. X blocks datacenter IP ranges aggressively. you’ll get a few hundred requests through before sessions start failing. spend the money on residential or, for search-only use cases, ISP proxies. see our rotating residential proxies guide for provider comparisons.

scraping logged-out. a lot of X content now requires a logged-in session to view. if you’re only seeing a subset of tweets or hitting login walls, you need to inject a valid session cookie into your browser context. manage session rotation carefully.

not respecting GraphQL endpoint changes. X’s internal API structure changes without notice. build your parser to be resilient: log raw responses to disk periodically so you can debug regressions without having to reproduce them live.

running all workers from the same IP block. even with a proxy pool, if you’re buying proxies from a single provider and running 50 concurrent workers, the IP diversity may still be limited. diversify across two providers for large operations.

ignoring [robots.txt](https://www.rfc-editor.org/rfc/rfc9309.html). X’s robots.txt explicitly disallows most crawling. this doesn’t make scraping illegal, but it is a signal about platform intent, and it’s worth knowing what you’re agreeing to when you proceed.


scaling this

10x (up to ~500,000 tweets/day): move to async Playwright using [playwright](https://playwright.dev/).async_api. run 10-20 concurrent browser contexts per machine. you’ll need 8GB+ RAM and should move to a dedicated server rather than shared VPS.

100x (multi-million tweets/day): browser automation becomes expensive in CPU and bandwidth at this scale. augment with direct HTTP requests to the GraphQL endpoints using session tokens extracted from logged-in browser sessions, then use httpx.AsyncClient with connection pooling. this uses far less resource than full browser rendering. you’ll also want a proper job queue like Redis + rq or Celery to distribute work across multiple machines.

1000x: at this point you’re running a data company. you need dedicated infrastructure, multiple proxy providers, session management systems to maintain dozens of logged-in accounts, and monitoring to detect when scrape quality degrades. consider whether a commercial data vendor (like Brandwatch or Sprinklr) is actually cheaper than the engineering cost of maintaining this infrastructure in-house. sometimes it is.

for multi-account management at scale, antidetect browsers become relevant. antidetectreview.org/blog/ covers the current generation of tools.


where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?