Web Scraping with Python & Residential Proxies
Python is the most popular language for web scraping, and for good reason. Its ecosystem of HTTP clients and HTML parsers makes it fast to build scrapers that work. But without proxies, even the best scraper will get blocked. This guide walks through building a production-quality scraper with rotating residential proxies.
Prerequisites
You will need Python 3.8 or later. We will use three libraries: requests for HTTP calls, beautifulsoup4 for parsing HTML, and lxml as a fast parser backend.
pip install requests beautifulsoup4 lxml
Step 1: Set Up the Proxy Connection
The first step is configuring your HTTP client to route requests through a residential proxy. With a rotating proxy gateway, you connect to a single endpoint and each request automatically exits through a different IP address.
import requestsfrom requests.adapters import HTTPAdapterfrom urllib3.util.retry import Retrydef create_session(proxy_user, proxy_pass):"""Create a requests session with proxy and retry logic."""session = requests.Session()# Configure retries for resilienceretries = Retry(total=3,backoff_factor=1,status_forcelist=[429, 500, 502, 503, 504],)session.mount("http://", HTTPAdapter(max_retries=retries))session.mount("https://", HTTPAdapter(max_retries=retries))# Set proxyproxy_url = f"http://{proxy_user}:{proxy_pass}@p.proxyshare.io:8080"session.proxies = {"http": proxy_url,"https": proxy_url,}# Set a realistic user agentsession.headers.update({"User-Agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) ""AppleWebKit/537.36 (KHTML, like Gecko) ""Chrome/122.0.0.0 Safari/537.36"),"Accept-Language": "en-US,en;q=0.9",})return session
Notice the retry configuration. Residential proxies route through real consumer connections, so occasional timeouts happen. The retry adapter handles this automatically without any manual intervention.
Step 2: Fetch and Parse HTML
With the session configured, fetching pages is straightforward. BeautifulSoup handles the HTML parsing, and we use lxml as the parser for speed.
from bs4 import BeautifulSoupdef scrape_page(session, url):"""Fetch a page and return parsed BeautifulSoup object."""response = session.get(url, timeout=30)response.raise_for_status()return BeautifulSoup(response.text, "lxml")# Example: scrape product listingssession = create_session("YOUR_USER", "YOUR_PASS")soup = scrape_page(session, "https://example.com/products")products = []for item in soup.select(".product-card"):products.append({"name": item.select_one(".product-title").get_text(strip=True),"price": item.select_one(".product-price").get_text(strip=True),"url": item.select_one("a")["href"],})print(f"Found {len(products)} products")
Step 3: Handle Pagination
Most scraping jobs involve iterating through multiple pages. With rotating proxies, each page request comes from a different IP, which distributes load naturally and avoids triggering rate limits.
import timedef scrape_all_pages(session, base_url, max_pages=100):"""Scrape through paginated results."""all_items = []for page_num in range(1, max_pages + 1):url = f"{base_url}?page={page_num}"try:soup = scrape_page(session, url)except requests.exceptions.RequestException as e:print(f"Error on page {page_num}: {e}")continueitems = soup.select(".product-card")if not items:print(f"No items on page {page_num}, stopping.")breakfor item in items:all_items.append({"name": item.select_one(".product-title").get_text(strip=True),"price": item.select_one(".product-price").get_text(strip=True),})print(f"Page {page_num}: {len(items)} items (total: {len(all_items)})")# Be polite — add a small delay between pagestime.sleep(1)return all_itemsresults = scrape_all_pages(session, "https://example.com/products")print(f"Scraped {len(results)} total items")
Step 4: Save Results
Once you have collected data, save it in a structured format. CSV and JSON are the most common choices for scraped data.
import csvimport json# Save as CSVwith open("products.csv", "w", newline="") as f:writer = csv.DictWriter(f, fieldnames=["name", "price"])writer.writeheader()writer.writerows(results)# Save as JSONwith open("products.json", "w") as f:json.dump(results, f, indent=2)print("Data saved to products.csv and products.json")
Step 5: Scale with Concurrency
For large-scale jobs, sequential requests are too slow. Python's concurrent.futures module lets you scrape multiple pages in parallel. With rotating proxies, each concurrent request exits through a different IP, so concurrency actually improves stealth.
from concurrent.futures import ThreadPoolExecutor, as_completeddef scrape_url(session, url):"""Scrape a single URL and return extracted data."""try:soup = scrape_page(session, url)title = soup.select_one("title")return {"url": url, "title": title.get_text(strip=True) if title else ""}except Exception as e:return {"url": url, "error": str(e)}urls = [f"https://example.com/product/{i}" for i in range(1, 101)]session = create_session("YOUR_USER", "YOUR_PASS")results = []with ThreadPoolExecutor(max_workers=10) as executor:futures = {executor.submit(scrape_url, session, url): url for url in urls}for future in as_completed(futures):result = future.result()results.append(result)if "error" not in result:print(f"OK: {result['url']}")else:print(f"ERR: {result['url']} — {result['error']}")print(f"Completed {len(results)} URLs")
Start with 10 concurrent workers and increase gradually. Monitor your success rate — if it drops, reduce concurrency or add longer delays between requests to the same domain.
Common Pitfalls
- Missing User-Agent header. Many sites block requests without a User-Agent. Always set a realistic browser User-Agent string.
- No timeout. Without a timeout, a stalled connection can hang your entire script. Always set
timeout=30. - Ignoring HTTP errors. Check status codes. A 403 or 429 response means the site is pushing back — slow down or adjust your approach.
- Scraping too fast. Even with rotating IPs, hammering a single domain with hundreds of concurrent requests can trigger domain-level throttling. Be respectful.
- Not handling encoding. Use
response.encoding = response.apparent_encodingfor sites that do not declare charset properly.
Putting It All Together
A production scraper combines all these pieces: a resilient HTTP session with proxy support, structured parsing with BeautifulSoup, pagination handling, error recovery, and concurrent execution. The proxy layer is the foundation — it determines whether your scraper can access the data at all. With rotating residential IPs, the HTTP layer becomes invisible, letting you focus on the parsing and data extraction logic that matters.
Try it with real residential IPs
ProxyShare provides rotating residential proxies with straightforward per-GB pricing.
View Plans