Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

The WebScraping Project That Got Me Banned From 50 Sites
Latest   Machine Learning

The WebScraping Project That Got Me Banned From 50 Sites

Last Updated on December 2, 2025 by Editorial Team

Author(s): DefineWorld

Originally published on Towards AI.

Within a week, my IP was blocked by 47 different sites. My scrapers were detected and shut down. Some sites threatened legal action. One even sent a cease and desist letter.

The WebScraping Project That Got Me Banned From 50 Sites
AI generated

Disclaimer: This article is only for educational purposes. We do not encourage anyone to scrape websites, especially those web properties that may have terms and conditions against such actions.

I thought I was just collecting public data. Turns out webscraping at scale requires way more sophistication than I realized. Here is what I learned about scraping without getting banned.

The Naive Scraper That Failed Immediately

My first attempt was embarrassingly simple. Loop through URLs. Make requests. Parse HTML. Save data.

import requests
from bs4 import BeautifulSoup
import time

urls = [f'https://example.com/products?page={i}' for i in range(1, 100)]
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# extract product data
products = soup.find_all('div', class_='product')
for product in products:
name = product.find('h2').text
price = product.find('span', class_='price').text
print(f"{name}: {price}")

time.sleep(1)

This got me banned from the first site within an hour. The problems were obvious in hindsight.

No user agent. Predictable request patterns. Same IP for every request. No respect for robots.txt. Hammering the server with requests.

Every red flag a bot detection system looks for.

Respecting Robots.txt

I did not even check if the sites allowed scraping. Most have a robots.txt file that specifies what is and is not allowed.

Ignoring it is the fastest way to get banned.

from urllib.robotparser import RobotFileParser
from urllib.parse import urlparse

def can_scrape(url):
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"

rp = RobotFileParser()
rp.set_url(robots_url)
rp.read()

# check if our user agent can fetch this URL
return rp.can_fetch("MyBot/1.0", url)
# only scrape allowed URLs
for url in urls:
if can_scrape(url):
# scrape the URL
pass
else:
print(f"Scraping not allowed: {url}")

This immediately cut my target list down. Some sites explicitly disallowed scraping their pricing pages. I had to respect that or face legal issues.

User Agents and Headers That Look Real

My requests had no user agent. They screamed, “I am a bot.”

Real browsers send dozens of headers. User agent. Accept headers. Accept-language. Referrer. Cookie headers.

import requests
from fake_useragent import UserAgent

ua = UserAgent()
headers = {
'User-Agent': ua.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Referer': 'https://www.google.com/'
}
response = requests.get(url, headers=headers)

I started rotating user agents. Made my requests look like they came from different browsers. Included realistic headers.

Ban rate dropped by 60%.

Rate Limiting and Randomized Delays

I was making requests as fast as possible. One per second. Every second. For hours.

No human browses like that. It was obvious bot behavior.

import random
import time
def smart_delay():
# random delay between 2-8 seconds
delay = random.uniform(2, 8)
time.sleep(delay)
def scrape_with_natural_timing(urls):
for i, url in enumerate(urls):
response = requests.get(url, headers=headers)

# process response
process_data(response)

# vary delay based on time of day
if 9 <= datetime.now().hour <= 17:
# slower during business hours (more realistic)
smart_delay()
else:
# can be slightly faster at night
time.sleep(random.uniform(1, 4))

# take longer breaks periodically
if i % 50 == 0:
print("Taking a break...")
time.sleep(random.uniform(30, 90))

Random delays between requests. Longer pauses periodically. Slower scraping during business hours when real users are active.

This made traffic patterns look more human.

Rotating Proxies to Avoid IP Bans

Even with good behavior, scraping hundreds of pages from a single IP looks suspicious. Sites track request counts per IP.

I needed to rotate through different IPs.

import requests
# list of proxy servers
proxies_list = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080',
'http://proxy3.example.com:8080',
]
def get_random_proxy():
return {
'http': random.choice(proxies_list),
'https': random.choice(proxies_list)
}
def scrape_with_proxy(url):
proxy = get_random_proxy()

try:
response = requests.get(
url,
headers=headers,
proxies=proxy,
timeout=10
)
return response
except requests.exceptions.RequestException as e:
print(f"Proxy failed: {e}")
return None

I used a proxy service that provided residential IPs. Requests came from different locations. Different ISPs. Much harder to detect as a scraper.

This eliminated most IP-based bans.

Handling JavaScript-Heavy Sites

Some sites loaded pricing data dynamically with JavaScript. My simple requests.get() returned empty pages.

I needed to execute JavaScript like a real browser.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape_javascript_site(url):
options = Options()
options.add_argument('--headless')
options.add_argument(f'user-agent={ua.random}')

driver = webdriver.Chrome(options=options)

try:
driver.get(url)

# wait for content to load
wait = WebDriverWait(driver, 10)
products = wait.until(
EC.presence_of_all_elements_located((By.CLASS_NAME, "product"))
)

data = []
for product in products:
name = product.find_element(By.TAG_NAME, "h2").text
price = product.find_element(By.CLASS_NAME, "price").text
data.append({'name': name, 'price': price})

return data

finally:
driver.quit()

Selenium runs a real browser. Executes JavaScript. Renders pages fully. Much slower than requests, but it works on sites that require it.

I used Selenium only when necessary. Most sites worked fine with simple requests.

Respecting Aggressive Anti-Bot Systems

Some sites use Cloudflare or similar services that actively detect and block bots. CAPTCHA challenges. JavaScript challenges. Fingerprint detection.

For these sites, I had to decide if scraping was worth the effort.

from selenium_stealth import stealth
from seleniumwire import webdriver

def scrape_protected_site(url):
options = webdriver.ChromeOptions()
options.add_argument("--headless")

driver = webdriver.Chrome(options=options)

# stealth mode to avoid detection
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)

try:
driver.get(url)

# wait for any challenges to resolve
time.sleep(random.uniform(3, 7))

# check if we got blocked
if "challenge" in driver.page_source.lower():
print("Detected challenge, aborting")
return None

# extract data
return parse_page(driver.page_source)

finally:
driver.quit()

Even with stealth techniques, some sites were too protected. I accepted that and moved on.

Fighting aggressive anti-bot systems is often not worth the legal and technical hassle.

Caching to Minimize Requests

I was re-scraping the same pages multiple times during development. Completely unnecessary load on servers.

import hashlib
import os
import pickle

def cache_key(url):
return hashlib.md5(url.encode()).hexdigest()
def get_cached(url, max_age=3600):
key = cache_key(url)
cache_file = f"cache/{key}.pkl"

if os.path.exists(cache_file):
# check age
age = time.time() - os.path.getmtime(cache_file)
if age < max_age:
with open(cache_file, 'rb') as f:
return pickle.load(f)

return None
def cache_response(url, data):
key = cache_key(url)
cache_file = f"cache/{key}.pkl"

os.makedirs('cache', exist_ok=True)
with open(cache_file, 'wb') as f:
pickle.dump(data, f)
def scrape_with_cache(url):
# check cache first
cached = get_cached(url, max_age=3600)
if cached:
return cached

# fetch fresh data
response = requests.get(url, headers=headers)
cache_response(url, response.content)

return response.content

During development, I used cached responses. Only hit live sites when necessary. This reduced unnecessary traffic by 90%.

Error Handling for Unreliable Scraping

Scraping is inherently fragile. Sites change their HTML. Servers go down. Networks fail. Proxies die.

I needed robust error handling.

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=60)
)

def scrape_with_retry(url):
try:
response = requests.get(
url,
headers=headers,
proxies=get_random_proxy(),
timeout=10
)

response.raise_for_status()
return response.content

except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
raise
def scrape_safely(urls):
results = []
failed = []

for url in urls:
try:
content = scrape_with_retry(url)
data = parse_content(content)
results.append(data)

except Exception as e:
print(f"Failed to scrape {url}: {e}")
failed.append(url)

# log failures for later retry
if failed:
with open('failed_urls.txt', 'a') as f:
f.write('\n'.join(failed) + '\n')

return results

Automatic retries with exponential backoff. Logging failures. Continuing despite errors instead of crashing.

This made scraping reliable even when individual requests failed.

The Legal Considerations I Ignored

The biggest mistake was not thinking about legal implications. Just because data is public does not mean scraping it is legal.

Terms of service often explicitly prohibit scraping. Some sites have legal protections. Copyright issues can arise.

I consulted a lawyer after getting that cease and desist letter. Learned which sites I could legally scrape and which I could not.

Some data I wanted was simply off limits. I found alternative sources or bought the data instead.

What Actually Works Long-Term

Scraping at scale requires respecting the sites you scrape. Following robots.txt. Rate limiting appropriately. Using realistic request patterns.

It also requires accepting limitations. Not every site can or should be scraped. Sometimes paying for an API is the right answer.

My scraping infrastructure now runs reliably. Collects data from 20+ sites without issues. No bans. No legal threats.

Because I finally learned to scrape responsibly.

What data are you trying to scrape right now?

If you enjoyed reading, be sure to give it 50 CLAPS! Follow and don’t miss out on any of my future posts — subscribe to my profile for must-read blog updates!

Thanks for reading!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.