Design a Web Crawler
Who Asks This Question?
Web crawler design is a classic at companies that process massive amounts of web data. Based on interview reports, it's frequently asked at:
- Google — They built the world's largest web crawler for search indexing; appears in both system design and infrastructure interviews
- Meta — Uses crawlers for link previews, content analysis, and social graph building
- Amazon — Alexa web crawler (now discontinued) and product price monitoring crawlers
- Microsoft — Bing search crawler and LinkedIn profile/content discovery
- Stripe — Website verification for merchant onboarding requires targeted crawling
- Cloudflare — Web security scanning and performance analysis across millions of sites
- Palantir — Data collection pipelines often include web crawling components
This question tests whether you understand the gap between "downloading a webpage" (easy) and "downloading the entire web without getting banned or crashing servers" (hard). Companies ask it to see if you've dealt with production-scale data collection challenges.
What the Interviewer Is Really Testing
Most candidates think this is about web scraping: "How do I parse HTML?" That's only 15% of what matters. Here's the actual scoring breakdown:
| Evaluation Area | Weight | What They're Looking For |
|---|---|---|
| Requirements gathering | 15% | Do you ask about scale, politeness, and content types? |
| URL management | 20% | Can you design a frontier that handles billions of URLs efficiently? |
| Politeness & ethics | 20% | Do you respect robots.txt and implement crawl delays? |
| Distributed architecture | 25% | How do you coordinate crawlers across multiple machines? |
| Content handling | 10% | Parsing, deduplication, storage — the "easy" part |
| Production concerns | 10% | Failure recovery, monitoring, avoiding spider traps |
The #1 reason candidates fail: they focus on HTML parsing while ignoring politeness policies. A crawler that doesn't respect robots.txt or overwhelms servers isn't just bad design — it's unethical and potentially illegal. Interviewers want to see that you understand responsible crawling.
Step 1: Clarify Requirements
Questions You Must Ask
Don't jump into architecture. These questions fundamentally shape your design:
"What's the scale? How many pages do we need to crawl?" This determines everything. Crawling 1 million pages is different from crawling 50 billion. Google's crawler processes ~20 billion pages daily. Your architecture must match the scale.
"What types of content — HTML only, or also images, PDFs, videos?" HTML parsing is straightforward. But if you need to extract text from PDFs or thumbnails from videos, that requires specialized workers and different storage patterns.
"Is this a one-time crawl or continuous monitoring?" One-time crawls can use simpler architectures. Continuous crawling needs freshness policies, change detection, and sophisticated scheduling.
"Do we need to respect robots.txt and implement crawl delays?" Always yes in production. This isn't optional — it's about being a good web citizen. Ignoring this shows you've never built a real crawler.
"What's our crawl politeness policy?" Different sites need different treatment. News sites might allow 10 requests/second, while personal blogs should get 1 request per 10 seconds.
Requirements You Should State
After questioning, explicitly state your assumptions:
Functional:
- Crawl 1 billion web pages per month (adjustable based on their answer)
- Extract text content, metadata, and outbound links
- Respect robots.txt and implement per-site crawl delays
- Support both fresh crawling and re-crawling for updates
Non-functional:
- Process 400 pages per second on average (1B pages / 30 days / 24h / 3600s)
- 99% uptime — crawler failures shouldn't require manual intervention
- Polite crawling — never overwhelm any single server
- Storage efficient — handle duplicate content detection
Step 2: High-Level Architecture
Core Components
[Seed URLs] → [URL Frontier] → [Crawler Workers]
↑ ↓
[URL Manager] ← [Content Processor]
↑ ↓
[Robots.txt Cache] [Duplicate Detector]
↓
[Content Storage]
URL Frontier: The brain of the crawler. Manages which URLs to crawl next, enforces politeness, and prioritizes important pages.
Crawler Workers: Fetch web pages, handle redirects, and deal with various HTTP response codes.
Content Processor: Extracts links, processes content, and feeds new URLs back to the frontier.
Duplicate Detector: Prevents crawling the same content multiple times using bloom filters and content hashing.
Robots.txt Cache: Stores and interprets robots.txt files to ensure compliant crawling.
Request Flow
- URL Selection: Frontier selects the next URL to crawl based on priority and politeness constraints
- Robots Check: Verify the URL is allowed per robots.txt
- HTTP Fetch: Download the page with proper headers and timeout handling
- Content Processing: Parse HTML, extract links and content
- Duplicate Check: Hash content to detect duplicates
- Storage: Store unique content and metadata
- Link Extraction: Feed new URLs back to the frontier
Strong candidates emphasize the feedback loop: crawling generates new URLs to crawl. Managing this loop efficiently — without running out of memory or losing URLs — is the core technical challenge.
Step 3: Deep Dive — URL Frontier Design
The URL frontier is where weak and strong answers diverge. This component manages billions of URLs while enforcing complex politeness constraints.
Challenge 1: Politeness Implementation
You can't just keep URLs in a simple queue. Different websites need different crawl delays:
Host-based Queue Architecture:
URLs to Crawl:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ amazon.com │ │ github.com │ │ reddit.com │
│ Queue │ │ Queue │ │ Queue │
│ - url1 │ │ - url5 │ │ - url8 │
│ - url2 │ │ - url6 │ │ - url9 │
│ - url3 │ │ - url7 │ │ - url10 │
└─────────────┘ └─────────────┘ └─────────────┘
↓ 1/sec ↓ 5/sec ↓ 2/sec
Each host gets its own queue with its own crawl delay. This prevents overwhelming any single server while maintaining high overall throughput.
public class PoliteFrontier {
private final Map<String, HostQueue> hostQueues;
private final PriorityQueue<ScheduledHost> readyHosts;
private final RobotsCache robotsCache;
public class HostQueue {
private final Queue<URL> urls = new LinkedList<>();
private long lastCrawlTime;
private int crawlDelayMs; // from robots.txt
public boolean isReady() {
return !urls.isEmpty() &&
System.currentTimeMillis() - lastCrawlTime >= crawlDelayMs;
}
}
public URL getNextUrl() {
while (!readyHosts.isEmpty()) {
ScheduledHost host = readyHosts.peek();
if (host.queue.isReady()) {
URL url = host.queue.urls.poll();
host.lastCrawlTime = System.currentTimeMillis();
// Reschedule this host if it has more URLs
if (!host.queue.urls.isEmpty()) {
readyHosts.offer(new ScheduledHost(
host,
System.currentTimeMillis() + host.queue.crawlDelayMs
));
}
return url;
} else {
break; // No hosts ready yet
}
}
return null; // Come back later
}
}
Challenge 2: URL Prioritization
Not all URLs are equal. You want to crawl important pages first:
Priority Factors:
- PageRank/Authority: High-authority pages are crawled more frequently
- Freshness: News sites need more frequent updates than static documentation
- User Interest: Pages users actually visit get higher priority
- Depth: Pages closer to the root are often more important
Multi-level Priority Architecture:
public class PrioritizedFrontier {
private final List<Queue<URL>> priorityQueues; // 0=highest, 3=lowest
public void addUrl(URL url, int priority) {
String host = url.getHost();
HostQueue hostQueue = getOrCreateHostQueue(host);
// Add to appropriate priority level for this host
hostQueue.addToPriority(url, priority);
}
public URL getNextUrl() {
// Try high-priority queues first
for (int priority = 0; priority < priorityQueues.size(); priority++) {
URL url = getUrlFromPriorityLevel(priority);
if (url != null) return url;
}
return null;
}
}
Challenge 3: URL Deduplication
With billions of URLs, you'll encounter massive duplication. Same content at different URLs, redirects, and URL variations:
Bloom Filter for Fast Rejection:
public class URLDeduplicator {
private final BloomFilter<String> seenUrls;
private final Set<String> confirmedUrls; // Smaller, accurate set
public boolean isDuplicate(URL url) {
String canonical = canonicalize(url);
// Fast negative check
if (!seenUrls.mightContain(canonical)) {
seenUrls.put(canonical);
return false;
}
// Slow positive check for precision
if (confirmedUrls.contains(canonical)) {
return true;
}
confirmedUrls.add(canonical);
return false;
}
private String canonicalize(URL url) {
return url.toString()
.toLowerCase()
.replaceAll("/$", "") // Remove trailing slash
.replaceAll("[?&]utm_[^&]*", "") // Remove tracking params
.replaceAll("#.*", ""); // Remove fragments
}
}
Content-based Deduplication: Even different URLs can serve identical content. Use content hashing after download:
public String contentHash(String htmlContent) {
// Remove dynamic elements before hashing
String cleaned = htmlContent
.replaceAll("<!--.*?-->", "") // Comments
.replaceAll("<script.*?</script>", "") // JavaScript
.replaceAll("\\s+", " ") // Normalize whitespace
.trim();
return DigestUtils.sha256Hex(cleaned);
}
Step 4: Deep Dive — Distributed Architecture
Horizontal Scaling Strategy
A single crawler can't handle billions of pages. You need to distribute the work across multiple machines while maintaining coordination:
Option 1: Centralized Frontier (Simple)
[Central URL Frontier/Database]
↓ ↓ ↓
[Crawler 1] [Crawler 2] [Crawler 3]
Workers pull URLs from a central frontier. Simple to implement but the frontier becomes a bottleneck at scale.
Option 2: Distributed Hash-based Assignment (Scalable)
URLs distributed by hash(hostname) % num_workers
Worker 1: amazon.com, google.com
Worker 2: github.com, reddit.com
Worker 3: stackoverflow.com, wikipedia.org
Each worker is responsible for specific hosts. This maintains politeness (one worker per host) while distributing load:
public class DistributedCrawler {
private final int workerId;
private final int totalWorkers;
public boolean shouldProcessHost(String hostname) {
return Math.abs(hostname.hashCode()) % totalWorkers == workerId;
}
public void processNewUrl(URL url) {
if (shouldProcessHost(url.getHost())) {
localFrontier.add(url);
} else {
forwardToCorrectWorker(url);
}
}
}
Challenge: DNS Resolution Caching
At scale, DNS lookups become a major bottleneck. Don't look up amazon.com for every single Amazon page:
public class DNSCache {
private final Map<String, InetAddress> cache = new ConcurrentHashMap<>();
private final Map<String, Long> expiry = new ConcurrentHashMap<>();
public InetAddress resolve(String hostname) {
Long exp = expiry.get(hostname);
if (exp != null && System.currentTimeMillis() < exp) {
return cache.get(hostname);
}
try {
InetAddress addr = InetAddress.getByName(hostname);
cache.put(hostname, addr);
expiry.put(hostname, System.currentTimeMillis() + 3600000); // 1hr TTL
return addr;
} catch (UnknownHostException e) {
// Cache negative results too
expiry.put(hostname, System.currentTimeMillis() + 300000); // 5min TTL
return null;
}
}
}
Challenge: Checkpointing and Fault Tolerance
Crawlers run for weeks or months. They must survive machine failures and restarts:
Checkpoint Strategy:
public class CheckpointManager {
public void saveCheckpoint() {
CheckpointData data = new CheckpointData(
frontier.getState(),
seenUrls.serialize(),
currentProgress.getMetrics()
);
// Atomic write to prevent corruption
File temp = new File("checkpoint.tmp");
writeToFile(data, temp);
temp.renameTo(new File("checkpoint.dat"));
}
public void restore() {
File checkpoint = new File("checkpoint.dat");
if (checkpoint.exists()) {
CheckpointData data = readFromFile(checkpoint);
frontier.restore(data.frontierState);
seenUrls.restore(data.bloomFilter);
log.info("Restored from checkpoint: {} URLs processed",
data.urlsProcessed);
}
}
}
Challenge: Handling Spider Traps
Some sites create infinite URL spaces to waste crawler resources:
Calendar trap: site.com/calendar/2024/01/01, site.com/calendar/2024/01/02, ... (generates URLs forever)
Session trap: site.com/login?session=abc123, site.com/login?session=def456, ... (each URL creates new session URLs)
Detection and Mitigation:
public class SpiderTrapDetector {
private final Map<String, Integer> hostDepthCount = new HashMap<>();
private final Map<String, Set<String>> urlPatterns = new HashMap<>();
public boolean isSpiderTrap(URL url) {
String host = url.getHost();
// Max depth limit
int depth = calculateDepth(url);
if (depth > MAX_DEPTH_PER_HOST) {
return true;
}
// Pattern detection
String pattern = extractPattern(url.getPath());
Set<String> seenPatterns = urlPatterns.getOrDefault(host, new HashSet<>());
if (seenPatterns.size() > MAX_PATTERNS_PER_HOST &&
seenPatterns.contains(pattern)) {
return true;
}
return false;
}
private String extractPattern(String path) {
// Replace numbers and IDs with placeholders
return path.replaceAll("\\d+", "*")
.replaceAll("/[a-f0-9]{8,}", "/*");
}
}
Step 5: Content Processing and Storage
HTML Parsing and Link Extraction
Once you have the HTML content, extract useful information:
public class ContentProcessor {
public ProcessedContent process(String html, URL sourceUrl) {
Document doc = Jsoup.parse(html, sourceUrl.toString());
// Extract metadata
String title = doc.title();
String description = doc.select("meta[name=description]").attr("content");
// Extract text content (remove navigation, ads, etc.)
String mainContent = extractMainContent(doc);
// Extract outbound links
List<URL> links = doc.select("a[href]")
.stream()
.map(element -> element.attr("abs:href"))
.filter(href -> isValidUrl(href))
.map(this::parseUrl)
.filter(Objects::nonNull)
.collect(Collectors.toList());
return new ProcessedContent(title, description, mainContent, links);
}
private String extractMainContent(Document doc) {
// Remove scripts, styles, navigation
doc.select("script, style, nav, header, footer, aside").remove();
// Extract text from main content areas
Elements mainElements = doc.select("main, article, .content, #content");
if (!mainElements.isEmpty()) {
return mainElements.text();
}
return doc.body().text();
}
}
Freshness Policy
Different content has different update frequencies. News articles change frequently, while legal documents rarely change:
public class FreshnessManager {
public long getNextCrawlTime(URL url, LocalDateTime lastCrawled) {
ContentType type = classifyContent(url);
Duration interval = getFreshnessInterval(type);
return lastCrawled.plus(interval).toEpochSecond();
}
private Duration getFreshnessInterval(ContentType type) {
switch (type) {
case NEWS: return Duration.ofHours(1);
case BLOG: return Duration.ofDays(1);
case PRODUCT_PAGE: return Duration.ofDays(7);
case DOCUMENTATION: return Duration.ofDays(30);
default: return Duration.ofDays(14);
}
}
}
Common Mistakes
These patterns appear in real interview feedback:
Mistake 1: Ignoring Robots.txt
Describing a crawler that ignores robots.txt shows you don't understand web etiquette. Always mention checking robots.txt and implementing crawl-delay directives.
Mistake 2: Single-threaded Design
A single-threaded crawler downloading one page at a time will never scale. You need concurrent workers, but with per-host rate limiting.
Mistake 3: No Duplicate Handling
Without deduplication, crawlers get trapped in redirect loops and waste resources on identical content. Bloom filters for URLs and content hashing are essential.
Mistake 4: Focusing Only on HTML
Real crawlers encounter PDFs, images, videos, and other content types. Discuss how you'd handle different MIME types and whether to extract text from PDFs.
Mistake 5: No Failure Recovery
Crawlers run for weeks. If a machine crashes and you lose all progress, that's unacceptable. Mention checkpointing and the ability to resume from where you left off.
Mistake 6: Overwhelming Servers
Sending 100 simultaneous requests to a small blog is irresponsible and will get your crawler banned. Emphasize politeness and per-host rate limiting.
Interviewer Follow-Up Questions
Prepare for these common follow-ups:
"How would you handle JavaScript-heavy sites?" Traditional crawlers only see the initial HTML, not content loaded by JavaScript. Options:
- Headless browser automation (Playwright, Puppeteer) — slower but handles modern sites
- API discovery — many sites have REST APIs that are more efficient than scraping
- Server-side rendering detection — some sites provide different content to crawlers
"What about content behind login walls?" This gets into ethics and legality. Generally, respect access controls:
- Only crawl publicly accessible content
- If you need authenticated content, use official APIs
- Consider rate limiting more aggressively for authenticated crawling
"How do you detect when a website structure changes?" Structure changes break extraction logic. Options:
- Content diff analysis — compare new crawls to previous crawls
- Schema extraction — detect common patterns in HTML structure
- Machine learning — train models to identify content areas
"What about handling different languages and encodings?" Character encoding detection and proper Unicode handling:
- Check HTTP Content-Type header
- Detect encoding from HTML meta tags
- Use libraries like ICU4J for proper text processing
Summary: Your 35-Minute Interview Plan
| Time | What to Do |
|---|---|
| 0-5 min | Clarify requirements: scale, content types, politeness, continuous vs one-time |
| 5-10 min | High-level architecture: frontier, workers, processors, storage |
| 10-20 min | URL frontier deep dive: politeness, prioritization, deduplication |
| 20-28 min | Distributed challenges: worker coordination, DNS caching, spider traps |
| 28-33 min | Content processing: parsing, freshness, checkpointing |
| 33-35 min | Wrap up: monitoring, failure modes, ethical considerations |
The web crawler interview tests your ability to build respectful, scalable data collection systems. The technical challenges — politeness, deduplication, distributed coordination — are what separate experienced engineers from those who've only built toy scrapers. Emphasize being a good web citizen while handling internet-scale data.