The WebScraping Project That Got Me Banned From 50 Sites

Last Updated on December 2, 2025 by Editorial Team

Author(s): DefineWorld

Originally published on Towards AI.

Within a week, my IP was blocked by 47 different sites. My scrapers were detected and shut down. Some sites threatened legal action. One even sent a cease and desist letter.

The WebScraping Project That Got Me Banned From 50 Sites — AI generated

Disclaimer: This article is only for educational purposes. We do not encourage anyone to scrape websites, especially those web properties that may have terms and conditions against such actions.

I thought I was just collecting public data. Turns out webscraping at scale requires way more sophistication than I realized. Here is what I learned about scraping without getting banned.

The Naive Scraper That Failed Immediately

My first attempt was embarrassingly simple. Loop through URLs. Make requests. Parse HTML. Save data.

import requests
from bs4 import BeautifulSoup
import time

urls = [f'https://example.com/products?page={i}' for i in range(1, 100)]
for url in urls:
 response = requests.get(url)
 soup = BeautifulSoup(response.content, 'html.parser')
 
 # extract product data
 products = soup.find_all('div', class_='product')
 for product in products:
 name = product.find('h2').text
 price = product.find('span', class_='price').text
 print(f"{name}: {price}")
 
 time.sleep(1)

This got me banned from the first site within an hour. The problems were obvious in hindsight.

No user agent. Predictable request patterns. Same IP for every request. No respect for robots.txt. Hammering the server with requests.

Every red flag a bot detection system looks for.

Respecting Robots.txt

I did not even check if the sites allowed scraping. Most have a robots.txt file that specifies what is and is not allowed.

Ignoring it is the fastest way to get banned.

from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse

def can_scrape(url):
 parsed = urlparse(url)
 robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
 
 rp = RobotFileParser()
 rp.set_url(robots_url)
 rp.read()
 
 # check if our user agent can fetch this URL
 return rp.can_fetch("MyBot/1.0", url)
# only scrape allowed URLs
for url in urls:
 if can_scrape(url):
 # scrape the URL
 pass
 else:
 print(f"Scraping not allowed: {url}")

This immediately cut my target list down. Some sites explicitly disallowed scraping their pricing pages. I had to respect that or face legal issues.

User Agents and Headers That Look Real

My requests had no user agent. They screamed, “I am a bot.”

Real browsers send dozens of headers. User agent. Accept headers. Accept-language. Referrer. Cookie headers.

import requests
from fake_useragent import UserAgent

ua = UserAgent()
headers = {
 'User-Agent': ua.random,
 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
 'Accept-Language': 'en-US,en;q=0.5',
 'Accept-Encoding': 'gzip, deflate, br',
 'DNT': '1',
 'Connection': 'keep-alive',
 'Upgrade-Insecure-Requests': '1',
 'Referer': 'https://www.google.com/'
}
response = requests.get(url, headers=headers)

I started rotating user agents. Made my requests look like they came from different browsers. Included realistic headers.

Ban rate dropped by 60%.

Rate Limiting and Randomized Delays

I was making requests as fast as possible. One per second. Every second. For hours.

No human browses like that. It was obvious bot behavior.

import random
import time
def smart_delay():
 # random delay between 2-8 seconds
 delay = random.uniform(2, 8)
 time.sleep(delay)
def scrape_with_natural_timing(urls):
 for i, url in enumerate(urls):
 response = requests.get(url, headers=headers)
 
 # process response
 process_data(response)
 
 # vary delay based on time of day
 if 9 <= datetime.now().hour <= 17:
 # slower during business hours (more realistic)
 smart_delay()
 else:
 # can be slightly faster at night
 time.sleep(random.uniform(1, 4))
 
 # take longer breaks periodically
 if i % 50 == 0:
 print("Taking a break...")
 time.sleep(random.uniform(30, 90))

Random delays between requests. Longer pauses periodically. Slower scraping during business hours when real users are active.

This made traffic patterns look more human.

Rotating Proxies to Avoid IP Bans

Even with good behavior, scraping hundreds of pages from a single IP looks suspicious. Sites track request counts per IP.

I needed to rotate through different IPs.

import requests
# list of proxy servers
proxies_list = [
 'http://proxy1.example.com:8080',
 'http://proxy2.example.com:8080',
 'http://proxy3.example.com:8080',
]
def get_random_proxy():
 return {
 'http': random.choice(proxies_list),
 'https': random.choice(proxies_list)
 }
def scrape_with_proxy(url):
 proxy = get_random_proxy()
 
 try:
 response = requests.get(
 url,
 headers=headers,
 proxies=proxy,
 timeout=10
 )
 return response
 except requests.exceptions.RequestException as e:
 print(f"Proxy failed: {e}")
 return None

I used a proxy service that provided residential IPs. Requests came from different locations. Different ISPs. Much harder to detect as a scraper.

This eliminated most IP-based bans.

Handling JavaScript-Heavy Sites

Some sites loaded pricing data dynamically with JavaScript. My simple requests.get() returned empty pages.

I needed to execute JavaScript like a real browser.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape_javascript_site(url):
 options = Options()
 options.add_argument('--headless')
 options.add_argument(f'user-agent={ua.random}')
 
 driver = webdriver.Chrome(options=options)
 
 try:
 driver.get(url)
 
 # wait for content to load
 wait = WebDriverWait(driver, 10)
 products = wait.until(
 EC.presence_of_all_elements_located((By.CLASS_NAME, "product"))
 )
 
 data = []
 for product in products:
 name = product.find_element(By.TAG_NAME, "h2").text
 price = product.find_element(By.CLASS_NAME, "price").text
 data.append({'name': name, 'price': price})
 
 return data
 
 finally:
 driver.quit()

Selenium runs a real browser. Executes JavaScript. Renders pages fully. Much slower than requests, but it works on sites that require it.

I used Selenium only when necessary. Most sites worked fine with simple requests.

Respecting Aggressive Anti-Bot Systems

Some sites use Cloudflare or similar services that actively detect and block bots. CAPTCHA challenges. JavaScript challenges. Fingerprint detection.

For these sites, I had to decide if scraping was worth the effort.

from selenium_stealth import stealth
from seleniumwire import webdriver

def scrape_protected_site(url):
 options = webdriver.ChromeOptions()
 options.add_argument("--headless")
 
 driver = webdriver.Chrome(options=options)
 
 # stealth mode to avoid detection
 stealth(driver,
 languages=["en-US", "en"],
 vendor="Google Inc.",
 platform="Win32",
 webgl_vendor="Intel Inc.",
 renderer="Intel Iris OpenGL Engine",
 fix_hairline=True,
 )
 
 try:
 driver.get(url)
 
 # wait for any challenges to resolve
 time.sleep(random.uniform(3, 7))
 
 # check if we got blocked
 if "challenge" in driver.page_source.lower():
 print("Detected challenge, aborting")
 return None
 
 # extract data
 return parse_page(driver.page_source)
 
 finally:
 driver.quit()

Even with stealth techniques, some sites were too protected. I accepted that and moved on.

Fighting aggressive anti-bot systems is often not worth the legal and technical hassle.

Caching to Minimize Requests

I was re-scraping the same pages multiple times during development. Completely unnecessary load on servers.

import hashlib
import os
import pickle

def cache_key(url):
 return hashlib.md5(url.encode()).hexdigest()
def get_cached(url, max_age=3600):
 key = cache_key(url)
 cache_file = f"cache/{key}.pkl"
 
 if os.path.exists(cache_file):
 # check age
 age = time.time() - os.path.getmtime(cache_file)
 if age < max_age:
 with open(cache_file, 'rb') as f:
 return pickle.load(f)
 
 return None
def cache_response(url, data):
 key = cache_key(url)
 cache_file = f"cache/{key}.pkl"
 
 os.makedirs('cache', exist_ok=True)
 with open(cache_file, 'wb') as f:
 pickle.dump(data, f)
def scrape_with_cache(url):
 # check cache first
 cached = get_cached(url, max_age=3600)
 if cached:
 return cached
 
 # fetch fresh data
 response = requests.get(url, headers=headers)
 cache_response(url, response.content)
 
 return response.content

During development, I used cached responses. Only hit live sites when necessary. This reduced unnecessary traffic by 90%.

Error Handling for Unreliable Scraping

Scraping is inherently fragile. Sites change their HTML. Servers go down. Networks fail. Proxies die.

I needed robust error handling.

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
 stop=stop_after_attempt(3),
 wait=wait_exponential(multiplier=1, min=4, max=60)
)
def scrape_with_retry(url):
 try:
 response = requests.get(
 url,
 headers=headers,
 proxies=get_random_proxy(),
 timeout=10
 )
 
 response.raise_for_status()
 return response.content
 
 except requests.exceptions.RequestException as e:
 print(f"Request failed: {e}")
 raise
def scrape_safely(urls):
 results = []
 failed = []
 
 for url in urls:
 try:
 content = scrape_with_retry(url)
 data = parse_content(content)
 results.append(data)
 
 except Exception as e:
 print(f"Failed to scrape {url}: {e}")
 failed.append(url)
 
 # log failures for later retry
 if failed:
 with open('failed_urls.txt', 'a') as f:
 f.write('\n'.join(failed) + '\n')
 
 return results

Automatic retries with exponential backoff. Logging failures. Continuing despite errors instead of crashing.

This made scraping reliable even when individual requests failed.

The Legal Considerations I Ignored

The biggest mistake was not thinking about legal implications. Just because data is public does not mean scraping it is legal.

I consulted a lawyer after getting that cease and desist letter. Learned which sites I could legally scrape and which I could not.

Some data I wanted was simply off limits. I found alternative sources or bought the data instead.

What Actually Works Long-Term

Scraping at scale requires respecting the sites you scrape. Following robots.txt. Rate limiting appropriately. Using realistic request patterns.

It also requires accepting limitations. Not every site can or should be scraped. Sometimes paying for an API is the right answer.

My scraping infrastructure now runs reliably. Collects data from 20+ sites without issues. No bans. No legal threats.

Because I finally learned to scrape responsibly.

What data are you trying to scrape right now?

If you enjoyed reading, be sure to give it 50 CLAPS! Follow and don’t miss out on any of my future posts — subscribe to my profile for must-read blog updates!

Thanks for reading!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

The WebScraping Project That Got Me Banned From 50 Sites

Author(s): DefineWorld

The Naive Scraper That Failed Immediately

Respecting Robots.txt

User Agents and Headers That Look Real

Rate Limiting and Randomized Delays

Rotating Proxies to Avoid IP Bans

Handling JavaScript-Heavy Sites

Respecting Aggressive Anti-Bot Systems

Caching to Minimize Requests

Error Handling for Unreliable Scraping

The Legal Considerations I Ignored

What Actually Works Long-Term

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

The WebScraping Project That Got Me Banned From 50 Sites

Author(s): DefineWorld

The Naive Scraper That Failed Immediately

Respecting Robots.txt

User Agents and Headers That Look Real

Rate Limiting and Randomized Delays

Rotating Proxies to Avoid IP Bans

Handling JavaScript-Heavy Sites

Respecting Aggressive Anti-Bot Systems

Caching to Minimize Requests

Error Handling for Unreliable Scraping

The Legal Considerations I Ignored

What Actually Works Long-Term

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement