How to Avoid IP Bans When Web Scraping
IP bans are the most common obstacle in web scraping. Whether you are collecting product prices, monitoring competitors, or gathering training data, getting blocked means lost time and incomplete datasets. This guide covers the practical techniques that professional scrapers use to stay under the radar.
Why Websites Ban IPs
Before diving into solutions, it helps to understand what triggers bans. Websites use several detection signals, and bans usually result from a combination of them rather than any single factor.
- Request volume. Sending hundreds of requests per minute from one IP is the most obvious signal. No human browses that fast.
- Request patterns. Accessing pages in a predictable, sequential order (page 1, page 2, page 3) at exact intervals looks automated.
- Missing headers. Real browsers send dozens of headers — User-Agent, Accept, Accept-Language, Referer. Scrapers often send only the URL.
- IP reputation. Datacenter IPs are flagged by anti-bot databases. Even before your first request, the IP is suspicious.
- TLS fingerprinting. The way your HTTP client negotiates TLS (cipher suites, extensions) differs from real browsers. Advanced anti-bot systems check this.
- Behavioral analysis. Some systems track mouse movements, scroll patterns, and JavaScript execution. A request with no JS execution at all is a red flag on sites that expect it.
Technique 1: Rotate Your IPs
The single most effective technique is IP rotation. Instead of sending all requests from one address, distribute them across many IPs so that no single address accumulates enough requests to trigger detection.
Residential proxies are ideal for this because each IP belongs to a real ISP and has a clean reputation. With a rotating gateway, every request automatically exits through a different IP — no pool management required on your end.
import requestsproxy_url = "http://USER:PASS@p.proxyshare.io:8080"proxies = {"http": proxy_url, "https": proxy_url}# Each request exits through a different residential IPfor page in range(1, 101):resp = requests.get(f"https://example.com/products?page={page}",proxies=proxies,timeout=30,)print(f"Page {page}: {resp.status_code}")
Technique 2: Set Realistic Headers
A real Chrome browser sends a specific set of headers with every request. Your scraper should mimic this. At minimum, set User-Agent, Accept, Accept-Language, and Accept-Encoding.
headers = {"User-Agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) ""AppleWebKit/537.36 (KHTML, like Gecko) ""Chrome/122.0.0.0 Safari/537.36"),"Accept": ("text/html,application/xhtml+xml,application/xml;""q=0.9,image/avif,image/webp,*/*;q=0.8"),"Accept-Language": "en-US,en;q=0.9","Accept-Encoding": "gzip, deflate, br","Connection": "keep-alive","Upgrade-Insecure-Requests": "1",}session = requests.Session()session.headers.update(headers)
Rotate your User-Agent strings periodically. Maintain a list of 10-20 current browser User-Agents and pick one randomly per session or per request.
Technique 3: Randomize Request Timing
Humans do not browse at perfectly regular intervals. Adding random delays between requests makes your traffic pattern look more natural.
import timeimport randomfor url in urls:response = session.get(url, proxies=proxies, timeout=30)process(response)# Random delay between 0.5 and 3 secondstime.sleep(random.uniform(0.5, 3.0))
The delay range depends on the target site. For aggressive anti-bot systems, use 2-5 second delays. For lighter protection, 0.5-1.5 seconds is usually enough. With rotating residential IPs, you can afford shorter delays because each request appears to come from a different user.
Technique 4: Respect robots.txt and Rate Limits
The robots.txt file specifies crawl guidelines. While not legally binding in most jurisdictions, respecting it demonstrates good faith and reduces your chance of getting actively blocked. Many sites specify a Crawl-delay directive — honor it.
When you receive a 429 (Too Many Requests) response, back off immediately. Implement exponential backoff: wait 2 seconds, then 4, then 8. Continuing to hammer a site after receiving 429s is the fastest way to get permanently banned.
def fetch_with_backoff(session, url, max_retries=3):"""Fetch URL with exponential backoff on rate limits."""for attempt in range(max_retries):response = session.get(url, proxies=proxies, timeout=30)if response.status_code == 429:wait = 2 ** (attempt + 1)print(f"Rate limited, waiting {wait}s...")time.sleep(wait)continuereturn responseraise Exception(f"Failed after {max_retries} retries: {url}")
Technique 5: Use a Headless Browser for JS-Heavy Sites
Some websites require JavaScript execution to render content or to pass anti-bot checks. For these targets, tools like Puppeteer or Playwright are necessary. They run a real browser engine, which produces authentic TLS fingerprints and can execute JavaScript challenges.
import { chromium } from "playwright";const browser = await chromium.launch({proxy: {server: "http://p.proxyshare.io:8080",username: "USER",password: "PASS",},});const page = await browser.newPage();await page.goto("https://example.com", {waitUntil: "networkidle",timeout: 60000,});// Now the page has fully rendered including JS contentconst data = await page.evaluate(() => {return document.querySelector(".product-price")?.textContent;});console.log(data);await browser.close();
Headless browsers consume more bandwidth and are slower than raw HTTP requests. Use them only for sites that genuinely require JavaScript rendering. For most targets, a well-configured requests session with residential proxies is sufficient.
Technique 6: Vary Your Scraping Patterns
Do not scrape pages in sequential order. Shuffle your URL list so that requests appear random rather than systematic. Vary the pages you visit — occasionally access the homepage, category pages, and other non-target pages to make your browsing pattern look organic.
import random# Shuffle URLs to avoid sequential patternsurls = [f"https://example.com/product/{i}" for i in range(1, 500)]random.shuffle(urls)for url in urls:response = session.get(url, proxies=proxies, timeout=30)process(response)time.sleep(random.uniform(1.0, 3.0))
Putting It All Together
No single technique is a silver bullet. The most reliable scraping setups combine multiple approaches: rotating residential IPs provide the foundation, realistic headers and timing add authenticity, proper error handling ensures resilience, and headless browsers handle the toughest targets.
Start simple and escalate only when needed. Many websites can be scraped with basic HTTP requests and rotating proxies. Save the complexity of headless browsers for sites that genuinely require them. Focus on resilience over speed — a slower scraper that runs reliably collects more data than a fast one that gets banned after 10 minutes.
Stop fighting IP bans
ProxyShare rotates residential IPs automatically on every request. Focus on your data, not your infrastructure.
View Plans