March 14, 2026

Web Scraping with Python & Residential Proxies

Python is the most popular language for web scraping, and for good reason. Its ecosystem of HTTP clients and HTML parsers makes it fast to build scrapers that work. But without proxies, even the best scraper will get blocked. This guide walks through building a production-quality scraper with rotating residential proxies.

Prerequisites

You will need Python 3.8 or later. We will use three libraries: requests for HTTP calls, beautifulsoup4 for parsing HTML, and lxml as a fast parser backend.

pip install requests beautifulsoup4 lxml

Step 1: Set Up the Proxy Connection

The first step is configuring your HTTP client to route requests through a residential proxy. With a rotating proxy gateway, you connect to a single endpoint and each request automatically exits through a different IP address.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def create_session(proxy_user, proxy_pass):
    """Create a requests session with proxy and retry logic."""
    session = requests.Session()

    # Configure retries for resilience
    retries = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504],
    )
    session.mount("http://", HTTPAdapter(max_retries=retries))
    session.mount("https://", HTTPAdapter(max_retries=retries))

    # Set proxy
    proxy_url = f"http://{proxy_user}:{proxy_pass}@p.proxyshare.io:8080"
    session.proxies = {
        "http": proxy_url,
        "https": proxy_url,
    }

    # Set a realistic user agent
    session.headers.update({
        "User-Agent": (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/122.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
    })

    return session

Notice the retry configuration. Residential proxies route through real consumer connections, so occasional timeouts happen. The retry adapter handles this automatically without any manual intervention.

Step 2: Fetch and Parse HTML

With the session configured, fetching pages is straightforward. BeautifulSoup handles the HTML parsing, and we use lxml as the parser for speed.

from bs4 import BeautifulSoup

def scrape_page(session, url):
    """Fetch a page and return parsed BeautifulSoup object."""
    response = session.get(url, timeout=30)
    response.raise_for_status()
    return BeautifulSoup(response.text, "lxml")

# Example: scrape product listings
session = create_session("YOUR_USER", "YOUR_PASS")
soup = scrape_page(session, "https://example.com/products")

products = []
for item in soup.select(".product-card"):
    products.append({
        "name": item.select_one(".product-title").get_text(strip=True),
        "price": item.select_one(".product-price").get_text(strip=True),
        "url": item.select_one("a")["href"],
    })

print(f"Found {len(products)} products")

Step 3: Handle Pagination

Most scraping jobs involve iterating through multiple pages. With rotating proxies, each page request comes from a different IP, which distributes load naturally and avoids triggering rate limits.

import time

def scrape_all_pages(session, base_url, max_pages=100):
    """Scrape through paginated results."""
    all_items = []

    for page_num in range(1, max_pages + 1):
        url = f"{base_url}?page={page_num}"

        try:
            soup = scrape_page(session, url)
        except requests.exceptions.RequestException as e:
            print(f"Error on page {page_num}: {e}")
            continue

        items = soup.select(".product-card")
        if not items:
            print(f"No items on page {page_num}, stopping.")
            break

        for item in items:
            all_items.append({
                "name": item.select_one(".product-title").get_text(strip=True),
                "price": item.select_one(".product-price").get_text(strip=True),
            })

        print(f"Page {page_num}: {len(items)} items (total: {len(all_items)})")

        # Be polite — add a small delay between pages
        time.sleep(1)

    return all_items

results = scrape_all_pages(session, "https://example.com/products")
print(f"Scraped {len(results)} total items")

Step 4: Save Results

Once you have collected data, save it in a structured format. CSV and JSON are the most common choices for scraped data.

import csv
import json

# Save as CSV
with open("products.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "price"])
    writer.writeheader()
    writer.writerows(results)

# Save as JSON
with open("products.json", "w") as f:
    json.dump(results, f, indent=2)

print("Data saved to products.csv and products.json")

Step 5: Scale with Concurrency

For large-scale jobs, sequential requests are too slow. Python's concurrent.futures module lets you scrape multiple pages in parallel. With rotating proxies, each concurrent request exits through a different IP, so concurrency actually improves stealth.

from concurrent.futures import ThreadPoolExecutor, as_completed

def scrape_url(session, url):
    """Scrape a single URL and return extracted data."""
    try:
        soup = scrape_page(session, url)
        title = soup.select_one("title")
        return {"url": url, "title": title.get_text(strip=True) if title else ""}
    except Exception as e:
        return {"url": url, "error": str(e)}

urls = [f"https://example.com/product/{i}" for i in range(1, 101)]
session = create_session("YOUR_USER", "YOUR_PASS")

results = []
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {executor.submit(scrape_url, session, url): url for url in urls}

    for future in as_completed(futures):
        result = future.result()
        results.append(result)
        if "error" not in result:
            print(f"OK: {result['url']}")
        else:
            print(f"ERR: {result['url']} — {result['error']}")

print(f"Completed {len(results)} URLs")

Start with 10 concurrent workers and increase gradually. Monitor your success rate — if it drops, reduce concurrency or add longer delays between requests to the same domain.

Common Pitfalls

Missing User-Agent header. Many sites block requests without a User-Agent. Always set a realistic browser User-Agent string.
No timeout. Without a timeout, a stalled connection can hang your entire script. Always set timeout=30.
Ignoring HTTP errors. Check status codes. A 403 or 429 response means the site is pushing back — slow down or adjust your approach.
Scraping too fast. Even with rotating IPs, hammering a single domain with hundreds of concurrent requests can trigger domain-level throttling. Be respectful.
Not handling encoding. Use response.encoding = response.apparent_encoding for sites that do not declare charset properly.

Putting It All Together

A production scraper combines all these pieces: a resilient HTTP session with proxy support, structured parsing with BeautifulSoup, pagination handling, error recovery, and concurrent execution. The proxy layer is the foundation — it determines whether your scraper can access the data at all. With rotating residential IPs, the HTTP layer becomes invisible, letting you focus on the parsing and data extraction logic that matters.

Try it with real residential IPs

ProxyShare provides rotating residential proxies with straightforward per-GB pricing.

View Plans