← back to blog

How to scrape Producthunt at scale in 2026 with proxies that work

How to scrape Producthunt at scale in 2026 with proxies that work

Producthunt is one of the best public signal sources for early-stage product intelligence. Every day, hundreds of products launch with upvote counts, maker profiles, tags, and comment sentiment that you can use for competitive research, lead generation, or trend analysis. the problem is that Producthunt rate-limits aggressive crawlers hard and blocks datacenter IPs quickly, making naive scrapers fail within minutes.

this guide is for operators who want reliable, repeatable access to Producthunt data: founders doing competitive analysis, growth teams tracking product launches by category, and data vendors building SaaS products on top of launch data. by the end you will have a working Python scraper that pulls daily launches, filters by category, and rotates residential proxies to stay undetected at volume.

i run this pipeline for a client doing SaaS competitive intelligence. it runs nightly, pulls the previous day’s launches across ten categories, and feeds a Postgres table. here is exactly how to build it.


what you need

accounts and credentials - a Producthunt developer account (free) and OAuth2 API credentials from https://api.producthunt.com/v2/docs - a ProxyScraping residential proxy plan. the Starter plan at $4/GB is sufficient for under 10k requests/day. the Business plan at $2.50/GB makes sense above 50k requests/day.

infrastructure - Python 3.11+ - httpx 0.27+ (preferred over requests for async support) - gql 3.5+ (Python GraphQL client) - a Postgres or SQLite database to store results - a Linux VPS or cron-capable environment (a $6/month Hetzner CX11 works fine)

estimated monthly cost at 10k requests/day - proxy bandwidth: roughly 1GB/day at ~100KB per request, so ~$120/month on Starter, ~$75/month on Business - VPS: ~$6/month - API calls via OAuth: free tier allows 500 requests/hour authenticated


step by step

step 1: create your Producthunt developer app

go to https://api.producthunt.com/v2/docs and sign in. click “Add an Application”, give it a name and a redirect URI (use http://localhost:3000/callback for local dev). copy your client_id and client_secret.

for server-side scraping you want a developer token, not user OAuth. in the API dashboard there is a “Developer Token” section that issues tokens scoped to your account. this is the token you will use for authenticated requests.

expected output: a developer token that looks like a 64-character hex string.

if it breaks: if the token creation UI is missing, you may need to verify your email first. Producthunt sometimes gates API access behind email verification.

step 2: understand the GraphQL schema

Producthunt’s API is GraphQL at https://api.producthunt.com/v2/api/graphql. the primary query you will use for daily scraping is posts, filtered by postedAfter and postedBefore. here is the shape:

query GetPosts($after: String, $first: Int, $postedAfter: DateTime, $postedBefore: DateTime) {
  posts(first: $first, after: $after, postedAfter: $postedAfter, postedBefore: $postedBefore, order: VOTES) {
    pageInfo {
      hasNextPage
      endCursor
    }
    edges {
      node {
        id
        name
        tagline
        slug
        votesCount
        commentsCount
        createdAt
        website
        topics {
          edges {
            node {
              name
            }
          }
        }
        makers {
          id
          name
          username
          twitterUsername
        }
      }
    }
  }
}

expected output: familiarity with the schema before you write any code. check the GraphQL spec if you are new to cursor-based pagination, the pageInfo.endCursor and after cursor pattern is standard.

if it breaks: introspect the live schema with { __schema { types { name } } } to verify field names haven’t changed.

step 3: configure your proxy rotation

ProxyScraping residential proxies use a gateway format. your proxy URL will look like:

http://user-<username>-session-<session_id>:[email protected]:7777

for scraping you want sticky sessions off (random rotation per request) so each request exits from a different IP. drop the -session-<id> part and the gateway will rotate automatically.

import os

PROXY_USER = os.environ["PROXY_USER"]
PROXY_PASS = os.environ["PROXY_PASS"]
PROXY_GATE = "gate.proxyscraping.com:7777"

def get_proxy() -> dict:
    url = f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_GATE}"
    return {"http://": url, "https://": url}

store credentials in environment variables, never hardcode them.

expected output: a proxy dict that httpx accepts directly via the proxies parameter.

if it breaks: if you get 407 Proxy Authentication Required, double-check the username format in your ProxyScraping dashboard. some plans use user-username prefix, others just the plain username.

step 4: write the authenticated GraphQL client

import httpx
import json
from datetime import datetime, timedelta, timezone

PH_API = "https://api.producthunt.com/v2/api/graphql"
PH_TOKEN = os.environ["PH_DEVELOPER_TOKEN"]

QUERY = """
query GetPosts($after: String, $first: Int, $postedAfter: DateTime, $postedBefore: DateTime) {
  posts(first: $first, after: $after, postedAfter: $postedAfter, postedBefore: $postedBefore, order: VOTES) {
    pageInfo { hasNextPage endCursor }
    edges {
      node {
        id name tagline slug votesCount commentsCount createdAt website
        topics { edges { node { name } } }
        makers { id name username twitterUsername }
      }
    }
  }
}
"""

def fetch_posts_page(after: str | None, posted_after: str, posted_before: str) -> dict:
    variables = {
        "first": 20,
        "after": after,
        "postedAfter": posted_after,
        "postedBefore": posted_before,
    }
    headers = {
        "Authorization": f"Bearer {PH_TOKEN}",
        "Content-Type": "application/json",
        "Accept": "application/json",
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
    }
    proxies = get_proxy()
    with httpx.Client(proxies=proxies, timeout=30) as client:
        resp = client.post(
            PH_API,
            json={"query": QUERY, "variables": variables},
            headers=headers,
        )
        resp.raise_for_status()
        return resp.json()

expected output: a JSON response with data.posts.edges containing up to 20 posts per call.

if it breaks: a 429 response means you have hit the authenticated rate limit of 500 requests/hour. add a time.sleep(7.5) between requests to stay safely under that ceiling.

step 5: paginate through all results

import time

def fetch_all_posts(date: datetime) -> list[dict]:
    posted_after = date.replace(hour=0, minute=0, second=0).isoformat() + "Z"
    posted_before = date.replace(hour=23, minute=59, second=59).isoformat() + "Z"

    all_posts = []
    cursor = None

    while True:
        data = fetch_posts_page(cursor, posted_after, posted_before)
        page = data["data"]["posts"]

        for edge in page["edges"]:
            all_posts.append(edge["node"])

        if not page["pageInfo"]["hasNextPage"]:
            break

        cursor = page["pageInfo"]["endCursor"]
        time.sleep(2)  # be polite, stay under rate limit

    return all_posts

if __name__ == "__main__":
    yesterday = datetime.now(timezone.utc) - timedelta(days=1)
    posts = fetch_all_posts(yesterday)
    print(f"fetched {len(posts)} posts for {yesterday.date()}")

expected output: a list of all posts from the target date. a typical day on Producthunt has 200-400 posts.

if it breaks: if data is None or the key is missing, print the raw response. Producthunt sometimes returns errors alongside partial data in GraphQL, so add if "errors" in data: print(data["errors"]) before processing.

step 6: persist to Postgres

import psycopg2
import json

def save_posts(posts: list[dict], conn_str: str):
    conn = psycopg2.connect(conn_str)
    cur = conn.cursor()
    cur.execute("""
        CREATE TABLE IF NOT EXISTS ph_posts (
            id TEXT PRIMARY KEY,
            name TEXT,
            tagline TEXT,
            slug TEXT,
            votes_count INT,
            comments_count INT,
            created_at TEXT,
            website TEXT,
            topics JSONB,
            makers JSONB,
            scraped_at TIMESTAMPTZ DEFAULT NOW()
        )
    """)
    for post in posts:
        topics = [e["node"]["name"] for e in post.get("topics", {}).get("edges", [])]
        cur.execute("""
            INSERT INTO ph_posts (id, name, tagline, slug, votes_count, comments_count,
                                  created_at, website, topics, makers)
            VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
            ON CONFLICT (id) DO UPDATE SET votes_count = EXCLUDED.votes_count,
                                           comments_count = EXCLUDED.comments_count
        """, (
            post["id"], post["name"], post["tagline"], post["slug"],
            post["votesCount"], post["commentsCount"], post["createdAt"],
            post["website"], json.dumps(topics), json.dumps(post.get("makers", []))
        ))
    conn.commit()
    cur.close()
    conn.close()

expected output: an upsert pattern so re-running the scraper updates vote counts without duplicating rows.

if it breaks: if you see psycopg2.OperationalError, check that Postgres is accepting connections on the right port and that your connection string includes the correct database name.

step 7: schedule with cron

on your Hetzner VPS, add a crontab entry to run the scraper nightly at 02:00 UTC (after the Producthunt daily cycle closes):

crontab -e
0 2 * * * /usr/bin/python3 /home/ubuntu/ph_scraper/main.py >> /var/log/ph_scraper.log 2>&1

expected output: daily automated runs with logs you can inspect if something breaks.

if it breaks: if the cron job silently fails, add set -e at the top of a wrapper shell script and use absolute paths for all binaries. cron has a minimal $PATH.


common pitfalls

using datacenter proxies. Producthunt blocks ASNs from AWS, GCP, and most datacenter ranges. if you see consistent 403s or CAPTCHA challenges, your proxy type is wrong. switch to residential.

not setting a realistic User-Agent. the default httpx user-agent string is trivially fingerprinted. always send a browser-realistic user-agent header.

scraping without authentication. unauthenticated requests to the GraphQL endpoint are throttled much more aggressively than authenticated ones. always use your developer token even if the endpoint technically works without it.

ignoring the ON CONFLICT case. vote counts change throughout the day as posts age. if you only run your scraper once and do a plain INSERT, you will have stale vote data. either run the upsert pattern above or scrape at multiple intervals.

not monitoring proxy spend. at scale, a bug that removes the time.sleep between requests can burn through several GB of proxy bandwidth in an hour. set a spend alert in your ProxyScraping dashboard and cap daily bandwidth in the proxy settings.


scaling this

10x (2,000-4,000 posts/day). the single-threaded approach above handles this fine. add retry logic with exponential backoff using tenacity. monitor the cron log weekly.

100x (filtering 30 days of history or multiple date ranges). switch from httpx.Client to httpx.AsyncClient and run concurrent coroutines with asyncio.gather. keep concurrency at 5-10 to stay within the 500 req/hour limit. at this point, consider running the scraper on a dedicated VPS rather than shared hosting, and use the ProxyScraping Business plan to cut proxy costs.

1000x (real-time monitoring, comment scraping, maker enrichment). at this level you are hitting the GraphQL API’s limits even with full authentication. options: cache aggressively in Redis so you don’t re-fetch posts you already have, use Producthunt’s webhook or RSS endpoints for near-real-time signals rather than polling, and enrich maker data from external sources rather than hammering the API per-maker. you will also want to track your scraper’s fingerprint across multiple proxy pools. operators running this kind of multi-account, multi-IP infrastructure sometimes cross-reference with resources at multiaccountops.com/blog/ for rotation strategy patterns.


where to go next


Written by Xavier Fok

disclosure: this article may contain affiliate links. if you buy through them we may earn a commission at no extra cost to you. verdicts are independent of payouts. last reviewed by Xavier Fok on 2026-05-19.

need infra for this today?