How to scrape Instagram at scale in 2026 with proxies that work
How to scrape Instagram at scale in 2026 with proxies that work
Instagram is one of the hardest platforms to scrape at any serious volume. meta aggressively fingerprints requests, rotates challenges, and will ban IPs within minutes if you’re not careful. i’ve spent time building and breaking Instagram scrapers, and what works in 2024 often stops working six months later when meta quietly tightens the rate limits or adds a new device fingerprinting layer.
this guide is written for operators who need to pull public data at scale: profile bios, follower counts, post metadata, hashtag feeds, that kind of thing. if you’re doing market research, building a social listening tool, or monitoring competitor accounts, this is the workflow i’d hand to someone on my team. it’s not for scraping private accounts, bypassing login walls you don’t have permission to access, or anything that involves extracting content you don’t have a right to collect.
by the end you’ll have a working Python setup using rotating residential proxies, rate-limiting logic, and a session management pattern that survives at least a few thousand requests before you need to rotate credentials.
what you need
- python 3.11+ with
requests,playwright, andhttpxinstalled - rotating residential proxies , datacenter IPs get blocked almost instantly on Instagram. i use proxyscraping.com’s residential pool; at time of writing their rotating residential plan starts around $3/GB. you need a pool with at least a few thousand IPs across multiple countries
- 2-3 Instagram accounts for session-based scraping (more on this below). create these manually or use aged accounts. do not buy bulk-created accounts, they get flagged immediately
- a target list , profile usernames, hashtags, or post URLs you’re collecting from
- a postgres or sqlite database to store results and track which targets you’ve already hit
- roughly $20-50/month to get started depending on volume , proxies are the main cost
optional but useful: an anti-detect browser for initial session setup. the team at antidetectreview.org/blog/ covers which browsers are holding up in 2026 if you want a current comparison before you commit to one.
step by step
step 1: understand what instagram actually allows
before writing a single line of code, read meta’s graph api documentation for instagram. meta provides an official API for business accounts. if your use case fits within those endpoints, use the API. the rate limits are tight (200 calls per hour per token for many endpoints) but it’s stable and won’t get your IPs banned.
most operators hit the API ceiling fast and fall back to scraping the public web interface. if you go that route, be clear-eyed: instagram’s terms of service prohibit automated data collection without permission. this is not legal advice. consult a lawyer if you’re building a commercial product on scraped instagram data.
if it breaks: check whether your use case qualifies for the official API first. the headache of scraping often isn’t worth it if the API covers what you need.
step 2: set up your proxy connection
get your proxy credentials from proxyscraping.com and test the connection before wiring anything else together.
import requests
proxies = {
"http": "http://user:[email protected]:10001",
"https": "http://user:[email protected]:10001",
}
r = requests.get("https://api.ipify.org", proxies=proxies, timeout=10)
print(r.text) # should print a residential IP, not your own
run this a few times. you should see different IPs each time if you’re on a rotating endpoint. if you keep seeing the same IP, check whether you’ve enabled rotation in your dashboard settings.
if it breaks: if connection times out, check that your firewall isn’t blocking the proxy port. try port 10002 or 10003 as alternatives.
step 3: create and warm up instagram sessions
instagram trusts sessions that look like they came from a real browser on a real device. logging in through a raw [requests](https://requests.readthedocs.io/) session from a fresh IP will trigger a checkpoint challenge almost every time.
the cleanest approach i’ve found: use playwright to do the initial login manually or semi-manually, then export the cookies, and use those cookies in your [requests](https://requests.readthedocs.io/) scraping session.
pip install playwright
playwright install chromium
from playwright.sync_api import sync_playwright
import json
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context()
page = context.new_page()
page.goto("https://www.instagram.com/accounts/login/")
# log in manually here, solve any 2FA
input("press enter once logged in...")
cookies = context.cookies()
with open("session_cookies.json", "w") as f:
json.dump(cookies, f)
browser.close()
do this once per account. then load these cookies into your scraping requests. warm up each session by visiting a few profiles manually before running automation.
if it breaks: instagram often prompts for phone verification on new sessions. have a real SIM you control for each account. SMS verification services work short-term but get flagged over time.
step 4: build your scraper with rate limiting
instagram’s public endpoints return JSON if you pass the right headers. the main ones most scrapers hit are the profile endpoint and the hashtag feed.
import requests
import json
import time
import random
def get_profile(username, cookies, proxies):
headers = {
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) AppleWebKit/605.1.15",
"Accept": "*/*",
"X-IG-App-ID": "936619743392459",
"X-Requested-With": "XMLHttpRequest",
}
session = requests.Session()
session.cookies.update({c["name"]: c["value"] for c in cookies})
url = f"https://www.instagram.com/api/v1/users/web_profile_info/?username={username}"
r = session.get(url, headers=headers, proxies=proxies, timeout=15)
if r.status_code == 429:
print(f"rate limited, sleeping 60s")
time.sleep(60)
return None
if r.status_code == 200:
return r.json()
return None
def scrape_profiles(usernames, cookies, proxies):
results = []
for username in usernames:
data = get_profile(username, cookies, proxies)
if data:
results.append(data)
# random delay between 3 and 8 seconds
time.sleep(random.uniform(3, 8))
return results
the X-IG-App-ID header is required for the v1 API. the value above is instagram’s own web app ID, which is publicly documented in instagram’s client source code.
if it breaks: if you get 401 responses, your session cookies have expired. regenerate them using the playwright flow in step 3.
step 5: implement proxy rotation and session rotation together
one mistake i see constantly: people rotate proxies but keep the same session cookie. instagram ties sessions to device fingerprints and behavioral patterns, not just IPs. you need to rotate both on a schedule.
import itertools
proxy_list = [
"http://user:[email protected]:10001",
"http://user:[email protected]:10002",
"http://user:[email protected]:10003",
]
session_cookies_list = [
json.load(open("session_cookies_1.json")),
json.load(open("session_cookies_2.json")),
json.load(open("session_cookies_3.json")),
]
proxy_cycle = itertools.cycle(proxy_list)
session_cycle = itertools.cycle(session_cookies_list)
def get_next_config():
proxy = next(proxy_cycle)
cookies = next(session_cycle)
return {"http": proxy, "https": proxy}, cookies
rotate every 50-100 requests per session. if you’re running at higher volume, rotate every 20-30.
if it breaks: if all sessions hit checkpoints simultaneously, you likely triggered a pattern-based block. pause all scraping for 2-4 hours before resuming.
step 6: store and deduplicate results
import sqlite3
conn = sqlite3.connect("instagram_data.db")
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS profiles (
username TEXT PRIMARY KEY,
follower_count INTEGER,
bio TEXT,
scraped_at TEXT
)
""")
def save_profile(username, data):
user_data = data.get("data", {}).get("user", {})
cursor.execute("""
INSERT OR REPLACE INTO profiles (username, follower_count, bio, scraped_at)
VALUES (?, ?, ?, datetime('now'))
""", (
username,
user_data.get("edge_followed_by", {}).get("count"),
user_data.get("biography"),
))
conn.commit()
track which usernames you’ve already scraped. re-scraping the same profiles wastes proxy bandwidth and accelerates session burnout.
if it breaks: if your database file grows unexpectedly large, check for duplicate rows. the INSERT OR REPLACE pattern above handles this, but make sure your schema has the primary key set correctly.
step 7: monitor and handle blocks
build a simple block detection layer that watches response codes and pauses when things go wrong.
class BlockDetector:
def __init__(self, threshold=5):
self.consecutive_failures = 0
self.threshold = threshold
def record(self, success: bool):
if success:
self.consecutive_failures = 0
else:
self.consecutive_failures += 1
if self.consecutive_failures >= self.threshold:
print("block detected, sleeping 10 minutes")
time.sleep(600)
self.consecutive_failures = 0
log every response code to a file or your database. you want to be able to spot patterns: if failures spike at a specific time of day, that’s often meta running a detection job.
if it breaks: if you’re seeing consistent 403s across all sessions and proxies, instagram may have updated their endpoint. check developer forums or GitHub issues on open-source instagram scraping libraries for recent changes.
common pitfalls
using datacenter proxies. i see this constantly from operators who are trying to save money. datacenter IPs are blocked almost instantly on Instagram, usually within the first 5-10 requests. residential proxies cost more but they’re the only thing that works at any real volume. the proxyscraping.com residential proxy guide covers the technical reasons in more detail.
scraping too fast. human users don’t request 10 profiles per second. if your timing looks robotic, instagram will rate-limit you within minutes. keep delays random and within human-plausible ranges (3-15 seconds between requests, longer between sessions).
ignoring endpoint changes. meta changes instagram’s internal API endpoints a few times a year. if your scraper breaks overnight, check the network tab in Chrome dev tools on a fresh Instagram session to find the new endpoint paths.
reusing the same session forever. Instagram sessions have a natural lifespan. trying to stretch one session across thousands of requests will accelerate detection. treat sessions as consumable and budget account creation accordingly.
not handling 2FA properly. if your accounts require 2FA and your automation doesn’t handle it, sessions will fail silently. test your session export and import flow end-to-end, including what happens after a forced logout.
scaling this
10x (a few hundred profiles/day): the setup above handles this fine. one proxy pool, two or three sessions, a single machine.
100x (a few thousand profiles/day): you need to distribute requests across multiple machines or processes. use a job queue like Redis with rq or celery to coordinate scrapers. add more sessions (10-20 accounts) and consider geographic targeting in your proxy config to match where your target accounts are located. this is also where having a proper proxy pool with country-level targeting from your provider starts to matter. if you’re managing multiple accounts for concurrent scraping operations, the multi-account workflows at multiaccountops.com/blog/ are worth reading before you build your own session management layer.
1000x (tens of thousands of profiles/day): at this scale, session management becomes the bottleneck, not proxies. you need automated account creation or sourcing pipelines, session health monitoring, and automatic rotation when accounts hit checkpoints. budget for a meaningful ongoing account acquisition cost. also revisit whether the official Meta Graph API with multiple app tokens might cover part of your volume, because reliable unofficial scraping at this scale requires serious operational overhead.
where to go next
- how to rotate proxies in Python for any target , a deeper look at rotation logic, backoff strategies, and handling different proxy types in Python
- residential vs datacenter proxies: which one do you actually need , breaks down the cost-performance tradeoffs for different scraping targets
- scraping TikTok at scale in 2026 , similar fingerprinting challenges, different endpoint structure
Written by Xavier Fok
disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.