Modern Web Scraping: How to Actually Bypass Anti-Bot Systems


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
You built a web scraper. It worked for two days. Then Cloudflare caught on.
If this sounds familiar, you're not the only one dealing with this. Modern websites use increasingly sophisticated anti-bot systems, and the old playbook of rotating IPs doesn't cut it anymore.
Here's what's actually happening behind the scenes and how to deal with it.
The Modern Anti-Bot Stack
Three major players dominate the anti-bot space:
Cloudflare powers protection for ~20% of all websites. They use TLS fingerprinting, JavaScript challenges, and behavioral analysis. Your scraper needs to look like a real browser at the network level, not just the application level.
DataDome focuses on device fingerprinting and tracks things like mouse movements, scroll patterns, and typing cadence. Even if your requests look legitimate, they're watching how you interact with the page.
PerimeterX injects client-side code that monitors request patterns and assigns bot scores in real-time. They're looking at everything from your font fingerprint to your WebGL renderer.
The tricky part? Most sites use multiple systems layered together. You're not beating one defense—you're beating three or four at once.
Why Your Current Setup Keeps Breaking
Let's walk through what actually happens when your scraper gets blocked.
Rotating residential IPs isn't enough.
You think: "I'll just rotate through residential proxies. Problem solved."
Reality: They're fingerprinting your TLS handshake. Your IP might look residential, but your TLS handshake screams "bot." Different story.
Selenium is detectable.
You think: "I'll use Selenium to look like a real browser."
Reality: Websites can detect WebDriver. There's literally a navigator.webdriver property they check. And even if you patch that, your browser fingerprint doesn't match real user behavior.
CAPTCHA solving services miss the bigger picture.
You think: "I'll pay for CAPTCHA solving. Done."
Reality: By the time you're seeing CAPTCHAs, you're already flagged. They're tracking your mouse movements before and after the CAPTCHA. Solving it doesn't unflag you.
What Actually Works
After building scrapers for three years and testing pretty much every approach, here's what actually works:
1. Real browser rendering
Not Selenium. Not Puppeteer with default settings. An actual browser that can't be fingerprinted as automation.
The difference: real browsers load fonts, render WebGL, handle canvas fingerprinting, and execute JavaScript exactly like a human user's browser would.
2. Residential proxy rotation at scale
You need a large pool of residential IPs that rotate intelligently. "Large" means millions of IPs, not thousands. "Intelligently" means the system knows when to rotate based on request patterns, not just time intervals.
3. Proper TLS fingerprinting
Your TLS fingerprint needs to match the browser you're claiming to be. If you say you're Chrome 120 on Windows but your TLS handshake looks like Python's requests library, you're getting blocked.
4. Behavioral mimicry
Random delays aren't enough. You need realistic mouse movements, scroll patterns, and interaction timing. This is the hardest part to get right.
The Build vs Buy Decision
Here's the math I did when I was building my own scraping infrastructure:
Building it myself:
- 3 weeks to build initial version
- Residential proxy service: $89/month
- Custom spoofed chromium version: $399/month
- CAPTCHA solving API: $30/month
- Monitoring and alerts: $15/month
- Maintenance time: ~10 hours/month
Total monthly cost: $533 + (10 hours × my hourly rate)
At my bill rate, that's $1,400 in time. Real cost: $1,933/month.
Using a managed service:
- Setup time: 2 hours
- Cost: $65/month for my volume
- Maintenance: maybe 30 minutes/month
The math was pretty clear. I was paying more to have more problems.
Practical Implementation
If you're building your own solution, here's what you need:
# What most people do (doesn't work)
import requests
response = requests.get(url)
# Gets blocked immediately
What actually works:
from playwright import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
context = browser.new_context(
viewport={'width': 1920, 'height': 1080},
user_agent='Mozilla/5.0...'
)
page = context.new_page()
page.goto(url)
content = page.content()
But even this isn't enough. You still need:
- Proxy rotation
- TLS fingerprint management
- Custom browser kernel
- CAPTCHA handling
- Request rate limiting
- Error handling and retries
Each of these is a project in itself.
Integration with Workflow Tools
If you're using automation platforms like n8n, integrating scraping becomes even more powerful. Modern scraping APIs can plug directly into your workflow automation, letting you trigger scrapes based on events, process the data immediately, and route it to your database or analytics tools without writing custom code.
For teams already using n8n for workflow automation, a scraping API that integrates natively means you can build end-to-end data pipelines without managing separate scraping infrastructure. Schedule scrapes, process results, trigger notifications—all in one workflow.
When to Use Managed Infrastructure
Use managed scraping infrastructure when:
The sites you're scraping use modern anti-bot systems. If you're getting blocked by Cloudflare, DataDome, or PerimeterX, building your own solution will take months.
You need to scale beyond a few thousand requests per day. Managing proxy pools and browser instances at scale is hard. Really hard.
Your time is worth more than the service cost. If you're spending 10+ hours/month maintaining scrapers, the math probably doesn't work.
You need reliability. If your business depends on the data, you can't afford downtime while you debug why your scraper broke.
You're using workflow automation. If you're already building workflows in tools like n8n, native integrations eliminate the need for custom scraping code.
Real-World Performance
I tested four different approaches on 100 URLs with various anti-bot protections:
- DIY with Selenium + proxies: 43% success rate
- Bright Data: 89% success rate, $15/GB
- ScraperAPI: 76% success rate
- Evomi: 94% success rate, $0.14/1K requests
The difference in success rates matters more than it looks. If you need 1,000 data points and your success rate is 50%, you need to make 2,000 requests. At 94% success rate, you make 1,064 requests. That's half the cost and half the time.
What About Legal Concerns?
Scraping itself isn't illegal, but you need to:
- Respect robots.txt (or have a good reason not to)
- Don't overwhelm servers with requests
- Follow terms of service where applicable
- Don't scrape personal data without proper legal basis
- Use ethical proxy sources (EWDCI certified is a good indicator)
Most anti-bot systems exist because of bad actors who don't follow these rules. If you're scraping responsibly, managed services handle the technical compliance for you.
The Bottom Line
Modern anti-bot systems are sophisticated. They're checking your TLS fingerprint, tracking your mouse movements, analyzing your request patterns, and scoring your behavior in real-time.
You can build solutions for this yourself. But ask yourself: is proxy infrastructure your competitive advantage? Or is it what you do with the data?
For most teams, the answer is pretty clear. Build what makes your product unique. Buy the commodity infrastructure.
The time you save on infrastructure is time spent on features that actually differentiate your product. Try Evomi's Scraper API free with 250,000 requests.