How to scrape Twitter Ads Library at scale in 2026 with proxies that work
How to scrape Twitter Ads Library at scale in 2026 with proxies that work
X’s Ads Transparency Center is one of the most underused competitive intelligence sources available to operators right now. It shows you every ad a brand has run on the platform, including creative, copy, targeting region, and impression windows. If you’re running paid campaigns, doing competitive research, or building ad-tracking tools, this data is genuinely useful, and unlike Meta’s Ad Library, most people haven’t automated it yet.
The problem is the platform fights you. X uses aggressive bot detection, Cloudflare challenges, and session-bound pagination that breaks the moment your IP changes mid-crawl. Datacenter proxies get blocked within minutes. Without a proper residential proxy rotation strategy, you’ll hit walls at around 50-100 requests before your session gets silently throttled or served empty results. This tutorial covers how to get around that.
This is written for Python developers who want to collect ad data at scale, whether that’s 500 advertisers or 50,000. By the end you’ll have a working Playwright-based scraper, a proxy rotation setup using ProxyScraping’s residential pool, and a clear path to scaling without getting banned.
what you need
Tools and libraries
- Python 3.11 or higher
- Playwright for Python (pip install playwright)
- playwright install chromium
- pandas for output, asyncio for concurrency
- A JSON/CSV store or Postgres for results
Proxy infrastructure - ProxyScraping residential proxy plan. the entry-level plan starts at around $3/GB for residential, which is enough for moderate volumes. for 10,000 pages you’re looking at roughly 8-15GB depending on page weight. budget $30-50 for a pilot run. - Sticky session support matters here because X’s ad pages use multi-step pagination. you want the same IP to persist for at least 10 minutes per advertiser crawl.
Accounts and access - No X account is strictly required for the Transparency Center, it’s publicly accessible at ads.x.com/transparency. however, logged-in sessions see more ad metadata in practice. - If you’re scraping at volume, don’t use your personal account. use a warmed throwaway or no session at all.
Estimated cost for a 10k advertiser run - Proxies: $40-80 - Compute (a $6/month VPS is fine): minimal - Time: one weekend to build, a few hours to run
step by step
step 1: understand the target structure
Before writing a line of code, spend 20 minutes in the Transparency Center manually. Search for a well-known advertiser, scroll through their ads, and watch the network tab in DevTools.
You’ll see XHR requests going to internal API endpoints. X periodically changes these endpoint paths, so don’t hardcode them. Instead, treat the browser behavior as your scraping layer. Playwright renders the full page and lets you intercept these requests if you want, but the simpler approach is just scraping the rendered DOM.
Key observations:
- Search is at /transparency with a ?query= param
- Each advertiser has a profile page with paginated ad cards
- Pagination is scroll-triggered, not a next-page button
- Ad cards contain: creative thumbnail, ad text, start/end date, advertiser name, and sometimes targeting country
step 2: set up Playwright with proxy rotation
import asyncio
from playwright.async_api import async_playwright
PROXY_HOST = "residential.proxyscraping.com"
PROXY_PORT = 8080
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
async def get_browser(playwright, sticky_id=None):
username = f"{PROXY_USER}-session-{sticky_id}" if sticky_id else PROXY_USER
proxy = {
"server": f"http://{PROXY_HOST}:{PROXY_PORT}",
"username": username,
"password": PROXY_PASS,
}
browser = await playwright.chromium.launch(headless=True, proxy=proxy)
return browser
ProxyScraping’s sticky session format appends a session ID to the username. check their docs for the exact format for your plan tier, it varies slightly.
if it breaks: if the browser hangs at launch, confirm your proxy credentials are correct with a simple curl test through the proxy before debugging Playwright.
step 3: load the transparency center and handle Cloudflare
X serves a Cloudflare challenge page on first load from cold IPs. Playwright with a real Chromium binary usually passes this, but you need to look human.
async def load_transparency(page, query):
await page.goto(
f"https://ads.x.com/transparency?query={query}",
wait_until="networkidle",
timeout=30000
)
# wait for the ad cards container to be present
await page.wait_for_selector('[data-testid="ad-card"]', timeout=15000)
Add a 2-5 second random delay after page load. it sounds obvious but skipping this is how most scrapers get caught.
if it breaks: if you see a Cloudflare interstitial that doesn’t resolve, the IP is flagged. rotate to a new sticky session ID and retry. residential IPs from major carriers (Comcast, Telstra, Singtel) pass more reliably than mobile IPs for this target.
step 4: extract ad card data from the DOM
async def extract_ads(page):
cards = await page.query_selector_all('[data-testid="ad-card"]')
results = []
for card in cards:
ad = {}
text_el = await card.query_selector('.ad-copy-text')
ad['copy'] = await text_el.inner_text() if text_el else None
date_el = await card.query_selector('.ad-date-range')
ad['dates'] = await date_el.inner_text() if date_el else None
img_el = await card.query_selector('img.ad-creative')
ad['image_url'] = await img_el.get_attribute('src') if img_el else None
results.append(ad)
return results
Note: the exact selector names above are illustrative. X’s class names are obfuscated and change with deploys. use the Accessibility tree (page.accessibility.snapshot()) or stable data-testid attributes instead of class names where possible, they’re more stable across site updates.
if it breaks: if selectors return empty, run await page.screenshot(path="debug.png") to see what the page actually rendered. silent failures are common when Cloudflare serves a bot page with a 200 status.
step 5: handle infinite scroll pagination
X loads more ads as you scroll. Playwright can simulate this.
async def scroll_and_collect(page, max_scrolls=10):
all_ads = []
for i in range(max_scrolls):
ads = await extract_ads(page)
all_ads.extend(ads)
prev_count = len(all_ads)
await page.evaluate("window.scrollBy(0, document.body.scrollHeight)")
await asyncio.sleep(2.5)
# check if new content loaded
await page.wait_for_load_state("networkidle")
if len(await page.query_selector_all('[data-testid="ad-card"]')) == prev_count:
break # no new ads loaded, we're at the end
return all_ads
if it breaks: if scroll doesn’t trigger new loads, the site may require focus events. try page.mouse.move() to a random coordinate before scrolling.
step 6: build the advertiser queue and run concurrently
For scale you need to run multiple advertisers in parallel. Keep concurrency low per IP, high across IPs.
async def scrape_advertiser(advertiser_name, session_id):
async with async_playwright() as p:
browser = await get_browser(p, sticky_id=session_id)
context = await browser.new_context()
page = await context.new_page()
await load_transparency(page, advertiser_name)
ads = await scroll_and_collect(page)
await browser.close()
return {"advertiser": advertiser_name, "ads": ads}
async def main(advertiser_list):
semaphore = asyncio.Semaphore(5) # max 5 concurrent browsers
async def bounded_scrape(name, idx):
async with semaphore:
return await scrape_advertiser(name, session_id=idx % 50)
tasks = [bounded_scrape(name, i) for i, name in enumerate(advertiser_list)]
return await asyncio.gather(*tasks, return_exceptions=True)
A semaphore of 5 concurrent browsers is conservative but safe for a first run. push it to 15-20 once you’ve validated your proxy pool is clean.
if it breaks: if you get connection resets at high concurrency, your proxy plan’s concurrent connection limit is probably the ceiling. ProxyScraping’s business tier allows higher concurrent connections than entry plans.
step 7: store results and deduplicate
import pandas as pd
import hashlib, json
def save_results(results, output_path="ads_output.jsonl"):
seen = set()
with open(output_path, "a") as f:
for r in results:
for ad in r.get("ads", []):
key = hashlib.md5(json.dumps(ad, sort_keys=True).encode()).hexdigest()
if key not in seen:
seen.add(key)
f.write(json.dumps({**ad, "advertiser": r["advertiser"]}) + "\n")
JSONL is better than CSV here because ad copy frequently contains commas, newlines, and quotes.
if it breaks: if you’re hitting file write errors at scale, buffer writes and flush every 100 records. for anything above 100k ads, pipe directly into Postgres with psycopg2.
common pitfalls
1. using datacenter proxies Datacenter IPs are flagged at the CDN layer before your request hits X’s application. residential or mobile proxies only. this is non-negotiable for this target.
2. reusing sessions across different advertisers without a delay X’s backend correlates session behavior. jumping between 50 advertiser searches on the same sticky session looks like a bot. either rotate the session ID between advertisers or add a 30-60 second cooldown between searches on the same session.
3. scraping without a baseline Run your scraper on 10 advertisers manually first and verify the output looks correct before launching a 10,000 advertiser batch. errors that only show up on a fresh IP or specific advertiser type will silently corrupt your dataset at scale.
4. ignoring rate limit signals X doesn’t always 429 you. sometimes it just returns fewer ads per page, or returns the same first page repeatedly. build a check: if page 3 of scrolling returns the same ad IDs as page 1, you’re being throttled. rotate and retry.
5. not checking the EU DSA compliance context The EU Digital Services Act requires very large online platforms to maintain public ad repositories. X is covered as a VLOP. this means the Transparency Center data is explicitly made public for research purposes in the EU context. understand this nuance before deciding how you use the data. this is not legal advice.
scaling this
10x (hundreds of advertisers) The setup above handles this without changes. the main constraint is proxy bandwidth. 500 advertisers at 20 ads each is maybe 3GB. straightforward.
100x (thousands of advertisers) At this level, move off a single VPS. distribute across 3-5 machines, each with its own proxy session pool. use a simple Redis queue to distribute advertiser names and store results centrally. rotate sticky session IDs more aggressively: one session per advertiser, no reuse.
You’ll also want to start fingerprint-hardening your Playwright setup. install the [playwright](https://playwright.dev/)-stealth package and apply it to every context. X’s bot scoring looks at canvas fingerprint, navigator properties, and timing patterns.
pip install playwright-stealth
from playwright_stealth import stealth_async
await stealth_async(page)
1000x (tens of thousands of advertisers) At this point the bottleneck is usually session freshness and IP reputation. you want mobile residential proxies from ProxyScraping’s mobile pool, not standard residential. mobile IPs have significantly better pass rates on X because the platform expects high-volume behavior from mobile carrier NATs.
You’ll also need to think about X’s API as a complement. the API has stricter rate limits and doesn’t cover all transparency data, but it’s lower friction for advertiser lookup lookups and can reduce your browser automation load significantly when combined.
For data storage at this volume, columnar formats (Parquet) or a proper analytical database (DuckDB or BigQuery) will save you hours compared to JSONL.
If you’re running multi-account operations alongside this, the antidetect browser stack at antidetectreview.org/blog/ covers how to isolate browser profiles properly. relevant if your scraping setup shares infrastructure with account management.
where to go next
- How to scrape Meta Ad Library with rotating proxies: same operator use case, different target. Meta’s library has a real API which changes the approach significantly.
- ProxyScraping residential proxy review: a full breakdown of ProxyScraping’s plans, speeds, and pool quality tested across common scraping targets including social platforms.
- How to build an ad intelligence tracker with Python and Postgres: takes the output from this tutorial and turns it into a queryable database with trend tracking over time.
For operators running ad arbitrage or affiliate tracking who also farm airdrop campaigns on the side, the intersection of browser automation and ad data is covered well at airdropfarming.org/blog/. different context, overlapping infrastructure.
All tutorial links are from the proxyscraping.org blog index.
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.