← back to blog

How to scrape YouTube at scale in 2026 with proxies that work

How to scrape YouTube at scale in 2026 with proxies that work

YouTube is the second-largest search engine on the planet and the largest video platform, with over 800 hours of video uploaded every minute as of Google’s own published figures. that makes it one of the richest data sources for SEO research, competitor analysis, ad intelligence, trend detection, and content gap analysis. scraping it at scale, though, is genuinely hard. Google has been tightening bot detection on YouTube since late 2023, and by 2025 they rolled out more aggressive fingerprinting that breaks naive scrapers within minutes.

this guide is for operators: people running data pipelines, research tools, or content intelligence products who need consistent YouTube data, not one-off queries. if you want a few video titles you can use the YouTube Data API v3 with a free API key. if you need volume, fresh data outside API quotas, comment sentiment at scale, or thumbnail metadata that the API doesn’t expose cleanly, you need a scraper backed by rotating residential proxies.

by the end of this tutorial you will have a working Python scraper that rotates proxies, handles rate limits, and pulls structured data from YouTube search results, channel pages, and video metadata. it won’t be undetectable forever, but it will be production-stable and maintainable. i run a version of this in my own SEO monitoring stack.

what you need

  • Python 3.11+ with pip
  • yt-dlp (free, open source, actively maintained, handles most YouTube extraction)
  • Scrapy 2.11+ for crawling channel and search pages at scale
  • Rotating residential proxies, minimum 10GB/month to start. datacenter proxies will get blocked within hours on YouTube. i use ProxyScraping’s residential pool at proxyscraping.com, check current pricing there as it changes. budget roughly $50-150/month depending on volume.
  • A proxy rotation middleware or a sticky session endpoint
  • PostgreSQL or a parquet pipeline for storage. SQLite is fine for prototyping under 1M rows.
  • A server or VPS with Python installed. a $6/month Hetzner CX22 or a $12 DigitalOcean Droplet is enough for a single-node pipeline.
  • Basic familiarity with proxies: if this is your first proxy setup, read our residential proxy rotation guide first.

estimated monthly infrastructure cost to start: $60-200 depending on data volume and proxy provider tier.

step by step

step 1: install your dependencies

pip install yt-dlp scrapy httpx playwright
playwright install chromium

yt-dlp handles video metadata extraction. scrapy handles search and channel crawling. playwright is for JavaScript-rendered pages where scrapy falls short.

if it breaks: if playwright install fails, run [playwright](https://playwright.dev/) install-deps first on Ubuntu/Debian.

step 2: get your proxy endpoint

sign up for a residential proxy plan at proxyscraping.com. you want a rotating endpoint, not static IPs, for YouTube. grab your proxy URL from the dashboard. it will look like:

http://user:[email protected]:9800

test it immediately:

curl -x "http://user:[email protected]:9800" https://httpbin.org/ip

you should get a residential IP in return, not your server IP. if you get your own IP, the proxy isn’t working, check your credentials.

if it breaks: YouTube will reject proxies from obvious datacenter ranges (AWS, GCP, Azure). if your curl returns a clean residential IP but YouTube still blocks you, try switching to a different country pool in your dashboard. US and UK residential IPs tend to have higher trust scores on Google properties.

step 3: extract video metadata with yt-dlp

yt-dlp is the most reliable way to pull structured metadata from individual videos and playlists. it handles YouTube’s internal API calls without you having to reverse-engineer them yourself.

import yt_dlp

proxy = "http://user:[email protected]:9800"

ydl_opts = {
    'quiet': True,
    'skip_download': True,
    'proxy': proxy,
    'extractor_args': {'youtube': {'skip': ['dash', 'hls']}},
}

def get_video_metadata(url):
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(url, download=False)
        return {
            'id': info.get('id'),
            'title': info.get('title'),
            'view_count': info.get('view_count'),
            'like_count': info.get('like_count'),
            'comment_count': info.get('comment_count'),
            'upload_date': info.get('upload_date'),
            'channel': info.get('channel'),
            'description': info.get('description'),
            'tags': info.get('tags'),
            'duration': info.get('duration'),
        }

if __name__ == '__main__':
    data = get_video_metadata('https://www.youtube.com/watch?v=dQw4w9WgXcQ')
    print(data)

expected output: a dict with clean structured metadata. duration in seconds, upload_date as YYYYMMDD string.

if it breaks: yt-dlp updates frequently. run yt-dlp -U if you get extraction errors, Google changes their internal API format with no warning.

step 4: scrape YouTube search results

yt-dlp can also iterate search results. this is useful for tracking keywords, competitor channels, or trend monitoring.

import yt_dlp
import json

proxy = "http://user:[email protected]:9800"

def search_youtube(query, max_results=50):
    ydl_opts = {
        'quiet': True,
        'skip_download': True,
        'proxy': proxy,
        'extract_flat': True,
        'playlist_items': f'1:{max_results}',
    }
    search_url = f'ytsearch{max_results}:{query}'
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        result = ydl.extract_info(search_url, download=False)
        return result.get('entries', [])

results = search_youtube('python tutorial 2026', max_results=20)
for r in results:
    print(r.get('id'), r.get('title'), r.get('view_count'))

if it breaks: if you get HTTP 429 errors, you’re hitting rate limits. add a sleep between requests and rotate your proxy session. more on this in step 6.

step 5: crawl channel pages with Scrapy

for bulk channel data (all uploads, subscriber estimates, about page metadata) use Scrapy with a proxy middleware.

create a Scrapy spider:

import scrapy

class YouTubeChannelSpider(scrapy.Spider):
    name = 'youtube_channel'

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
        },
        'HTTP_PROXY': 'http://user:[email protected]:9800',
        'DOWNLOAD_DELAY': 2,
        'RANDOMIZE_DOWNLOAD_DELAY': True,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'AUTOTHROTTLE_ENABLED': True,
    }

    def start_requests(self):
        channels = [
            'https://www.youtube.com/@mkbhd/videos',
            'https://www.youtube.com/@LinusTechTips/videos',
        ]
        for url in channels:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        # YouTube renders video grids in JSON inside script tags
        # use yt-dlp for the actual data, Scrapy for discovery
        yield {'url': response.url, 'status': response.status}

for JavaScript-rendered content (the actual video grid), fall back to playwright or parse the initial page state JSON embedded in the HTML.

if it breaks: YouTube’s channel pages are heavily JS-rendered. if Scrapy returns empty content, switch to playwright for that spider and use scrapy-playwright. install it with pip install [scrapy](https://scrapy.org/)-[playwright](https://playwright.dev/).

step 6: handle rate limiting and fingerprinting

YouTube uses several signals to detect bots: request frequency, missing headers, consistent timing patterns, and IP reputation. here is a minimal set of mitigations.

import time
import random

def scrape_with_backoff(func, *args, retries=3):
    for attempt in range(retries):
        try:
            result = func(*args)
            time.sleep(random.uniform(1.5, 4.0))
            return result
        except Exception as e:
            if '429' in str(e) or 'rate' in str(e).lower():
                wait = (2 ** attempt) * 5 + random.uniform(0, 3)
                print(f"rate limited, waiting {wait:.1f}s")
                time.sleep(wait)
            else:
                raise
    raise Exception(f"failed after {retries} attempts")

rotate your user agent strings. a static UA is a fast path to a block. use a list of real browser UA strings from actual recent browser releases.

if it breaks: if you’re still getting blocked after rotating proxies and UAs, you may need a headless browser approach. playwright with residential proxies is slower but more reliable for detection-sensitive targets. see antidetectreview.org/blog/ for browser fingerprint hardening techniques if you go that route.

step 7: store and pipeline your data

write results to PostgreSQL with psycopg2 or to parquet files for analytics workloads:

import psycopg2
import json

def save_to_postgres(conn, video_data):
    cur = conn.cursor()
    cur.execute("""
        INSERT INTO videos (video_id, title, view_count, upload_date, channel, tags, scraped_at)
        VALUES (%s, %s, %s, %s, %s, %s, NOW())
        ON CONFLICT (video_id) DO UPDATE
        SET view_count = EXCLUDED.view_count,
            scraped_at = EXCLUDED.scraped_at
    """, (
        video_data['id'],
        video_data['title'],
        video_data['view_count'],
        video_data['upload_date'],
        video_data['channel'],
        json.dumps(video_data.get('tags', [])),
    ))
    conn.commit()

if it breaks: add ON CONFLICT upsert logic from the start. you will re-scrape the same video IDs over time for fresh view count data, and duplicates will corrupt your analytics.

step 8: schedule and monitor

use cron or a task queue (Celery, APScheduler) to run your scraper on a schedule. log failures to a file or monitoring service.

# scrape top 100 results for 20 keywords every 6 hours
0 */6 * * * /home/user/venv/bin/python /home/user/youtube_scraper/run.py >> /var/log/yt_scraper.log 2>&1

check your proxy dashboard for bandwidth usage daily in the first week. a runaway spider can burn through your proxy quota in hours.

if it breaks: add a hard cap on requests per run in your code. never run an untested spider without a CLOSESPIDER_ITEMCOUNT or equivalent limit.

common pitfalls

using datacenter proxies. this is the number one mistake. datacenter IPs (AWS, OVH, Hetzner, Vultr) are heavily flagged on YouTube. you will get CAPTCHA or soft-blocks within 15-30 minutes. residential proxies are non-negotiable for sustained scraping.

ignoring YouTube’s Terms of Service. scraping is in a legal gray zone and YouTube’s ToS explicitly restricts automated access. this is not legal advice, consult a lawyer for your specific use case, but understand what you’re building and for whom.

not respecting robots.txt. YouTube’s robots.txt disallows large sections of the site. know what you’re crawling and make a considered decision, don’t be naive about it.

scraping at uniform intervals. fixed 2-second delays are just as detectable as 0-second delays. randomize your timing. human browsing is irregular.

skipping deduplication. if you’re running periodic scrapes, the same video IDs will appear in multiple search result pages. deduplicate on video ID before writing to your database or you’ll waste storage and skew your analytics.

scaling this

10x (hundreds of requests/hour): increase your proxy plan bandwidth, add concurrency in Scrapy (CONCURRENT_[REQUESTS](https://requests.readthedocs.io/) = 16), and use a proper task queue like Celery with Redis. one server handles this.

100x (thousands of requests/hour): split your keyword/channel list across multiple worker processes or servers. use a central job queue. add proxy pool diversity, ideally proxies from multiple providers to reduce single-provider failure risk. monitor per-IP block rates and retire flagged IPs automatically.

1000x (tens of thousands of requests/hour): at this scale you need a distributed crawler, proper observability (request success rates, proxy block rates by country, latency), and a data warehouse not a Postgres single instance. budget $500-2000+/month on proxies alone at this volume. consider whether the YouTube Data API can cover any portion of your use case to reduce scraping load, its free quota is 10,000 units/day. for multi-account YouTube operations at scale, multiaccountops.com/blog/ covers the account management layer you’ll need.

where to go next

Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?