← back to blog

How to scrape TikTok at scale in 2026 with proxies that work

How to scrape TikTok at scale in 2026 with proxies that work

TikTok is one of the hardest platforms to scrape at scale. the bot detection is aggressive, the API endpoints rotate signing parameters, and a naive approach will get you blocked within minutes. i’ve wasted money on datacenter proxies, burned through residential pools, and debugged more x-bogus signing errors than i care to count. this guide reflects what actually works as of mid-2026.

this tutorial is for operators who need TikTok data at scale: social listening tools, market research pipelines, influencer databases, trend trackers. if you’re pulling a few hundred profiles manually, you don’t need this. if you need tens of thousands of records per day, consistently, you do.

by the end you’ll have a working Python scraper using Playwright with stealth plugins, a rotating residential or mobile proxy setup, and a rate-limit-aware request loop that keeps sessions alive long enough to be useful.


what you need

  • Python 3.11+ with playwright, playwright-stealth, httpx, and python-dotenv installed
  • Rotating residential proxies, minimum 500 IPs in pool. mobile proxies (4G/5G) work better for TikTok but cost more. budget $50-150/month for a mid-size pool from providers like Bright Data, Smartproxy, or ProxyScrape residential
  • A proxy manager or rotator endpoint, so you get a fresh IP per request or per session without managing rotation yourself
  • A scrape target list: usernames, hashtags, or video IDs depending on your use case
  • MongoDB or PostgreSQL for storing results (i use Postgres with JSONB columns for TikTok’s unpredictable response shapes)
  • 2-4 hours for initial setup and testing
  • Familiarity with async Python helps but isn’t required

TikTok has a Research API for academic use, but it requires approval and has strict rate limits. most commercial operators won’t qualify. this guide covers scraping the public web interface, which is what everyone else does too.

be aware: TikTok’s terms of service prohibit automated scraping. this is not legal advice. how you assess and handle that risk is your call.


step by step

step 1: install dependencies and configure your proxy

pip install playwright playwright-stealth httpx python-dotenv
playwright install chromium

create a .env file:

PROXY_HOST=your.proxy.host
PROXY_PORT=10000
PROXY_USER=youruser
PROXY_PASS=yourpass

expected output: chromium installs cleanly, no errors. if playwright complains about missing system deps on Ubuntu, run [playwright](https://playwright.dev/) install-deps [chromium](https://www.chromium.org/Home/).

if it breaks: on headless servers, add --with-deps to the playwright install command.


step 2: launch a stealth browser session through the proxy

TikTok fingerprints browser properties hard. using raw Playwright without stealth patches gets you a 403 or a blank page within a few requests. the [playwright](https://playwright.dev/)-stealth library patches the most common detection vectors.

import asyncio
import os
from dotenv import load_dotenv
from playwright.async_api import async_playwright
from playwright_stealth import stealth_async

load_dotenv()

PROXY = {
    "server": f"http://{os.getenv('PROXY_HOST')}:{os.getenv('PROXY_PORT')}",
    "username": os.getenv("PROXY_USER"),
    "password": os.getenv("PROXY_PASS"),
}

async def get_browser():
    p = await async_playwright().start()
    browser = await p.chromium.launch(
        headless=True,
        proxy=PROXY,
        args=["--disable-blink-features=AutomationControlled"]
    )
    context = await browser.new_context(
        user_agent="Mozilla/5.0 (iPhone; CPU iPhone OS 17_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Mobile/15E148 Safari/604.1",
        viewport={"width": 390, "height": 844},
        locale="en-US",
        timezone_id="America/New_York",
    )
    page = await context.new_page()
    await stealth_async(page)
    return browser, context, page

using a mobile user-agent matters here. TikTok’s mobile web endpoints return cleaner JSON than the desktop ones and are slightly less fingerprinted.

expected output: browser launches, no immediate errors.

if it breaks: if you get ERR_TUNNEL_CONNECTION_FAILED, your proxy credentials are wrong or the proxy doesn’t support HTTPS tunneling. confirm with your provider.


step 3: load TikTok and capture the API calls

TikTok’s public pages load video data via internal XHR calls. intercept these instead of scraping the DOM, which is fragile and slow.

async def capture_user_feed(username: str, page):
    captured = []

    async def handle_response(response):
        if "api/post/item_list" in response.url or "api/user/detail" in response.url:
            try:
                data = await response.json()
                captured.append(data)
            except Exception:
                pass

    page.on("response", handle_response)
    await page.goto(f"https://www.tiktok.com/@{username}", wait_until="networkidle", timeout=30000)
    await page.wait_for_timeout(4000)  # let lazy-loaded XHR fire
    return captured

expected output: captured list contains one or more dicts with user info and a list of video objects. each video object has video.id, stats, desc, author, createTime, and more.

if it breaks: if captured is empty, TikTok may have served a challenge page. check await page.content() for a CAPTCHA or bot challenge. this means your proxy IP is flagged. rotate to a fresh IP and retry.


step 4: parse video metadata

def parse_videos(raw_responses: list) -> list:
    videos = []
    for resp in raw_responses:
        items = resp.get("itemList") or []
        for item in items:
            videos.append({
                "video_id": item.get("id"),
                "author": item.get("author", {}).get("uniqueId"),
                "description": item.get("desc"),
                "created_at": item.get("createTime"),
                "play_count": item.get("stats", {}).get("playCount"),
                "like_count": item.get("stats", {}).get("diggCount"),
                "comment_count": item.get("stats", {}).get("commentCount"),
                "share_count": item.get("stats", {}).get("shareCount"),
            })
    return videos

expected output: clean list of dicts ready to insert into your database.

if it breaks: TikTok occasionally restructures field names. if itemList is missing, print raw_responses[0].keys() to find the new wrapper key.


step 5: implement rate-limited looping

this is where most scrapers fail. they hammer requests, burn through IPs, and wonder why their pool dries up.

import asyncio
import random

async def scrape_users(usernames: list, output: list):
    browser, context, page = await get_browser()
    try:
        for i, username in enumerate(usernames):
            raw = await capture_user_feed(username, page)
            videos = parse_videos(raw)
            output.extend(videos)
            print(f"[{i+1}/{len(usernames)}] {username}: {len(videos)} videos")

            # rotate proxy and refresh session every 15-20 requests
            if (i + 1) % random.randint(15, 20) == 0:
                await context.close()
                await browser.close()
                browser, context, page = await get_browser()

            # randomized delay: 3-8 seconds between requests
            await asyncio.sleep(random.uniform(3, 8))
    finally:
        await browser.close()

the session rotation every 15-20 requests forces a new proxy IP (assuming your proxy provider rotates on new connections, which most do). this keeps individual IPs from accumulating too many requests.

expected output: terminal prints each username and video count as it processes.

if it breaks: if you’re seeing consistent empty results after session rotation, your entire proxy pool may be flagged. switch to mobile proxies for TikTok specifically, they have significantly lower block rates.


step 6: store results

import json
from datetime import datetime

def save_results(videos: list, filename: str = None):
    if not filename:
        filename = f"tiktok_{datetime.now().strftime('%Y%m%d_%H%M%S')}.jsonl"
    with open(filename, "w") as f:
        for v in videos:
            f.write(json.dumps(v) + "\n")
    print(f"saved {len(videos)} videos to {filename}")

JSONL (newline-delimited JSON) is better than a single JSON array here because you can append incrementally without loading the whole file, and it imports cleanly into Postgres with COPY or BigQuery with its JSON ingestion.

expected output: a .jsonl file in your working directory with one video record per line.

if it breaks: encoding errors on TikTok descriptions (emojis, CJK characters) are common. ensure you’re opening files with encoding="utf-8" explicitly on Windows.


step 7: run and validate

python -c "
import asyncio
from scraper import scrape_users

usernames = ['charlidamelio', 'khaby.lame', 'bellapoarch']
results = []
asyncio.run(scrape_users(usernames, results))
print(f'total: {len(results)} videos')
"

spot-check 5-10 records against what you see on TikTok manually. verify play counts are in the right order of magnitude.


common pitfalls

using datacenter proxies. TikTok’s IP scoring is sophisticated. datacenter IP ranges from AWS, GCP, and DigitalOcean are in known ASNs and get challenged immediately. residential or mobile proxies are non-negotiable for TikTok at any scale above a handful of requests.

not rotating the session, just the IP. if you keep the same browser fingerprint while rotating IPs, TikTok can correlate sessions by browser fingerprint instead of by IP. rotate the full context including viewport, user-agent, and timezone properties across sessions. see the Playwright docs on browser contexts for how context isolation works.

ignoring cursor pagination. TikTok’s video list endpoint returns a cursor and hasMore field. if you don’t use these to paginate, you get at most 30 videos per user. loop on hasMore == true and pass the cursor back as a query param to get the full feed.

scraping at fixed intervals. fixed sleep(5) patterns are easier to detect than randomized delays. use random.uniform(3, 8) or exponential backoff. human-like variance matters.

not handling rate limit responses. a 429 or a JSON response with statusCode: 10101 (TikTok’s internal “too many requests” code) means slow down now, not eventually. catch these explicitly and add a 60-120 second back-off before retrying.


scaling this

10x (thousands of profiles/day): the single-process async loop works here. increase your proxy pool to 1000+ IPs, run 3-5 parallel browser contexts with asyncio.gather(), and store results directly to Postgres rather than flat files.

100x (tens of thousands/day): move to a job queue. Celery with Redis works well. each worker manages its own browser instance and proxy allocation. at this volume, mobile proxies become cost-justified because your block rate on residential drops significantly and reruns are expensive. if you’re running influencer discovery operations at this scale, the multi-account management patterns at multiaccountops.com/blog/ apply here too, particularly around session isolation.

1000x (hundreds of thousands/day): you’re running a fleet. containerize each scraper in Docker, deploy to Kubernetes or Nomad, and use a dedicated proxy management layer rather than direct provider endpoints. at this scale, proxy cost is your biggest line item. negotiate volume pricing directly with your provider, most offer 20-40% discounts above $500/month. you also need monitoring: track success rate per IP subnet, alert when it drops below 80%, and auto-blacklist flagged ranges.

at all scales, build in idempotency. TikTok profile data changes frequently, so your pipeline should handle re-scraping the same username without creating duplicate records. a unique index on video_id in your database handles this cleanly.


where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?