The Problem
Websites deploy multi-layered anti-scraping defenses: IP blocking, rate limiting, CAPTCHAs, and increasingly, TLS/HTTP/2 fingerprinting that detects automated clients regardless of IP rotation. For teams maintaining legacy TypeScript codebases built on axios or native http, upgrading these systems while preserving functionality presents specific challenges.
What Works in Legacy Systems
Proxy rotation remains foundational. Residential proxies outperform datacenter IPs for evading blocks, as they mimic legitimate user traffic patterns. A basic implementation cycles through proxy pools:
const proxies = [
{ host: 'proxy1.example.com', port: 8080 },
{ host: 'proxy2.example.com', port: 8080 }
];
function getRandomProxy() {
return proxies[Math.floor(Math.random() * proxies.length)];
}
Request throttling with randomized delays (2-5 seconds) mimics human behavior:
async function scrapeData(urls: string[]) {
for (const url of urls) {
await sleep(Math.random() * 3000 + 2000);
const response = await fetchWithProxy(url);
}
}
Header rotation (User-Agent, Referer) adds another evasion layer, though sophisticated sites now fingerprint TLS handshakes and HTTP/2 prioritization, which headers alone can't mask.
The Trade-offs
These tactics suit small-scale projects but fail at enterprise scale. Sites deploying WAFs and behavioral analysis detect patterns across IP pools. Headless browsers (Puppeteer, Playwright) handle JavaScript-rendered content but introduce resource overhead in legacy systems not designed for browser automation.
Commercial scraping APIs (Zyte, ScrapingBee) offload proxy management and CAPTCHA solving, trading cost for reliability. For organizations where data collection supports revenue-critical workflows, the build-versus-buy calculus increasingly favors external services.
Legal Context Matters
Ignoring robots.txt or terms of service invites legal risk alongside technical blocks. Enterprise teams should validate scraping compliance before deploying evasion techniques.
What to Watch
Defensive technology evolves faster than evasion methods. Sites adding Cloudflare Turnstile or similar CAPTCHA systems require solver integrations that legacy codebases struggle to accommodate without significant refactoring. The pattern is clear: tactical fixes buy time, but long-term data acquisition strategies need architectural review.
Bottom line: Proxy rotation and throttling still work for basic scraping. For production systems facing modern defenses, evaluate whether legacy code modifications justify the maintenance burden versus transitioning to managed scraping infrastructure.