Design a Web Crawler | System Design Course

Who Asks This Question?

Web crawler design is a classic at companies that process massive amounts of web data. Based on interview reports, it's frequently asked at:

Google — They built the world's largest web crawler for search indexing; appears in both system design and infrastructure interviews
Meta — Uses crawlers for link previews, content analysis, and social graph building
Amazon — Alexa web crawler (now discontinued) and product price monitoring crawlers
Microsoft — Bing search crawler and LinkedIn profile/content discovery
Stripe — Website verification for merchant onboarding requires targeted crawling
Cloudflare — Web security scanning and performance analysis across millions of sites
Palantir — Data collection pipelines often include web crawling components

This question tests whether you understand the gap between "downloading a webpage" (easy) and "downloading the entire web without getting banned or crashing servers" (hard). Companies ask it to see if you've dealt with production-scale data collection challenges.

What the Interviewer Is Really Testing

Most candidates think this is about web scraping: "How do I parse HTML?" That's only 15% of what matters. Here's the actual scoring breakdown:

Evaluation Area	Weight	What They're Looking For
Requirements gathering	15%	Do you ask about scale, politeness, and content types?
URL management	20%	Can you design a frontier that handles billions of URLs efficiently?
Politeness & ethics	20%	Do you respect robots.txt and implement crawl delays?
Distributed architecture	25%	How do you coordinate crawlers across multiple machines?
Content handling	10%	Parsing, deduplication, storage — the "easy" part
Production concerns	10%	Failure recovery, monitoring, avoiding spider traps

The #1 reason candidates fail: they focus on HTML parsing while ignoring politeness policies. A crawler that doesn't respect robots.txt or overwhelms servers isn't just bad design — it's unethical and potentially illegal. Interviewers want to see that you understand responsible crawling.

Step 1: Clarify Requirements

Questions You Must Ask

Don't jump into architecture. These questions fundamentally shape your design:

"What's the scale? How many pages do we need to crawl?" This determines everything. Crawling 1 million pages is different from crawling 50 billion. Google's crawler processes ~20 billion pages daily. Your architecture must match the scale.

"What types of content — HTML only, or also images, PDFs, videos?" HTML parsing is straightforward. But if you need to extract text from PDFs or thumbnails from videos, that requires specialized workers and different storage patterns.

"Is this a one-time crawl or continuous monitoring?" One-time crawls can use simpler architectures. Continuous crawling needs freshness policies, change detection, and sophisticated scheduling.

"Do we need to respect robots.txt and implement crawl delays?" Always yes in production. This isn't optional — it's about being a good web citizen. Ignoring this shows you've never built a real crawler.

"What's our crawl politeness policy?" Different sites need different treatment. News sites might allow 10 requests/second, while personal blogs should get 1 request per 10 seconds.

Requirements You Should State

After questioning, explicitly state your assumptions:

Functional:

Crawl 1 billion web pages per month (adjustable based on their answer)
Extract text content, metadata, and outbound links
Respect robots.txt and implement per-site crawl delays
Support both fresh crawling and re-crawling for updates

Non-functional:

Process 400 pages per second on average (1B pages / 30 days / 24h / 3600s)
99% uptime — crawler failures shouldn't require manual intervention
Polite crawling — never overwhelm any single server
Storage efficient — handle duplicate content detection

Step 2: High-Level Architecture

Core Components

[Seed URLs] → [URL Frontier] → [Crawler Workers]
                    ↑               ↓
               [URL Manager] ← [Content Processor]
                    ↑               ↓
             [Robots.txt Cache] [Duplicate Detector]
                                    ↓
                             [Content Storage]

URL Frontier: The brain of the crawler. Manages which URLs to crawl next, enforces politeness, and prioritizes important pages.

Crawler Workers: Fetch web pages, handle redirects, and deal with various HTTP response codes.

Content Processor: Extracts links, processes content, and feeds new URLs back to the frontier.

Duplicate Detector: Prevents crawling the same content multiple times using bloom filters and content hashing.

Robots.txt Cache: Stores and interprets robots.txt files to ensure compliant crawling.

Request Flow

URL Selection: Frontier selects the next URL to crawl based on priority and politeness constraints
Robots Check: Verify the URL is allowed per robots.txt
HTTP Fetch: Download the page with proper headers and timeout handling
Content Processing: Parse HTML, extract links and content
Duplicate Check: Hash content to detect duplicates
Storage: Store unique content and metadata
Link Extraction: Feed new URLs back to the frontier

Strong candidates emphasize the feedback loop: crawling generates new URLs to crawl. Managing this loop efficiently — without running out of memory or losing URLs — is the core technical challenge.

Step 3: Deep Dive — URL Frontier Design

The URL frontier is where weak and strong answers diverge. This component manages billions of URLs while enforcing complex politeness constraints.

Challenge 1: Politeness Implementation

You can't just keep URLs in a simple queue. Different websites need different crawl delays:

Host-based Queue Architecture:

URLs to Crawl:
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ amazon.com  │  │ github.com  │  │ reddit.com  │
│ Queue       │  │ Queue       │  │ Queue       │
│ - url1      │  │ - url5      │  │ - url8      │
│ - url2      │  │ - url6      │  │ - url9      │
│ - url3      │  │ - url7      │  │ - url10     │
└─────────────┘  └─────────────┘  └─────────────┘
     ↓ 1/sec          ↓ 5/sec          ↓ 2/sec

Each host gets its own queue with its own crawl delay. This prevents overwhelming any single server while maintaining high overall throughput.

public class PoliteFrontier {
    private final Map<String, HostQueue> hostQueues;
    private final PriorityQueue<ScheduledHost> readyHosts;
    private final RobotsCache robotsCache;

    public class HostQueue {
        private final Queue<URL> urls = new LinkedList<>();
        private long lastCrawlTime;
        private int crawlDelayMs; // from robots.txt
        
        public boolean isReady() {
            return !urls.isEmpty() && 
                   System.currentTimeMillis() - lastCrawlTime >= crawlDelayMs;
        }
    }

    public URL getNextUrl() {
        while (!readyHosts.isEmpty()) {
            ScheduledHost host = readyHosts.peek();
            if (host.queue.isReady()) {
                URL url = host.queue.urls.poll();
                host.lastCrawlTime = System.currentTimeMillis();
                
                // Reschedule this host if it has more URLs
                if (!host.queue.urls.isEmpty()) {
                    readyHosts.offer(new ScheduledHost(
                        host, 
                        System.currentTimeMillis() + host.queue.crawlDelayMs
                    ));
                }
                return url;
            } else {
                break; // No hosts ready yet
            }
        }
        return null; // Come back later
    }
}

Challenge 2: URL Prioritization

Not all URLs are equal. You want to crawl important pages first:

Priority Factors:

PageRank/Authority: High-authority pages are crawled more frequently
Freshness: News sites need more frequent updates than static documentation
User Interest: Pages users actually visit get higher priority
Depth: Pages closer to the root are often more important

Multi-level Priority Architecture:

public class PrioritizedFrontier {
    private final List<Queue<URL>> priorityQueues; // 0=highest, 3=lowest
    
    public void addUrl(URL url, int priority) {
        String host = url.getHost();
        HostQueue hostQueue = getOrCreateHostQueue(host);
        
        // Add to appropriate priority level for this host
        hostQueue.addToPriority(url, priority);
    }
    
    public URL getNextUrl() {
        // Try high-priority queues first
        for (int priority = 0; priority < priorityQueues.size(); priority++) {
            URL url = getUrlFromPriorityLevel(priority);
            if (url != null) return url;
        }
        return null;
    }
}

Challenge 3: URL Deduplication

With billions of URLs, you'll encounter massive duplication. Same content at different URLs, redirects, and URL variations:

Bloom Filter for Fast Rejection:

public class URLDeduplicator {
    private final BloomFilter<String> seenUrls;
    private final Set<String> confirmedUrls; // Smaller, accurate set
    
    public boolean isDuplicate(URL url) {
        String canonical = canonicalize(url);
        
        // Fast negative check
        if (!seenUrls.mightContain(canonical)) {
            seenUrls.put(canonical);
            return false;
        }
        
        // Slow positive check for precision
        if (confirmedUrls.contains(canonical)) {
            return true;
        }
        
        confirmedUrls.add(canonical);
        return false;
    }
    
    private String canonicalize(URL url) {
        return url.toString()
            .toLowerCase()
            .replaceAll("/$", "")  // Remove trailing slash
            .replaceAll("[?&]utm_[^&]*", "")  // Remove tracking params
            .replaceAll("#.*", ""); // Remove fragments
    }
}

Content-based Deduplication: Even different URLs can serve identical content. Use content hashing after download:

public String contentHash(String htmlContent) {
    // Remove dynamic elements before hashing
    String cleaned = htmlContent
        .replaceAll("<!--.*?-->", "")  // Comments
        .replaceAll("<script.*?</script>", "")  // JavaScript
        .replaceAll("\\s+", " ")  // Normalize whitespace
        .trim();
    
    return DigestUtils.sha256Hex(cleaned);
}

Step 4: Deep Dive — Distributed Architecture

Horizontal Scaling Strategy

A single crawler can't handle billions of pages. You need to distribute the work across multiple machines while maintaining coordination:

Option 1: Centralized Frontier (Simple)

[Central URL Frontier/Database]
       ↓ ↓ ↓
[Crawler 1] [Crawler 2] [Crawler 3]

Workers pull URLs from a central frontier. Simple to implement but the frontier becomes a bottleneck at scale.

Option 2: Distributed Hash-based Assignment (Scalable)

URLs distributed by hash(hostname) % num_workers

Worker 1: amazon.com, google.com
Worker 2: github.com, reddit.com  
Worker 3: stackoverflow.com, wikipedia.org

Each worker is responsible for specific hosts. This maintains politeness (one worker per host) while distributing load:

public class DistributedCrawler {
    private final int workerId;
    private final int totalWorkers;
    
    public boolean shouldProcessHost(String hostname) {
        return Math.abs(hostname.hashCode()) % totalWorkers == workerId;
    }
    
    public void processNewUrl(URL url) {
        if (shouldProcessHost(url.getHost())) {
            localFrontier.add(url);
        } else {
            forwardToCorrectWorker(url);
        }
    }
}

Challenge: DNS Resolution Caching

At scale, DNS lookups become a major bottleneck. Don't look up amazon.com for every single Amazon page:

public class DNSCache {
    private final Map<String, InetAddress> cache = new ConcurrentHashMap<>();
    private final Map<String, Long> expiry = new ConcurrentHashMap<>();
    
    public InetAddress resolve(String hostname) {
        Long exp = expiry.get(hostname);
        if (exp != null && System.currentTimeMillis() < exp) {
            return cache.get(hostname);
        }
        
        try {
            InetAddress addr = InetAddress.getByName(hostname);
            cache.put(hostname, addr);
            expiry.put(hostname, System.currentTimeMillis() + 3600000); // 1hr TTL
            return addr;
        } catch (UnknownHostException e) {
            // Cache negative results too
            expiry.put(hostname, System.currentTimeMillis() + 300000); // 5min TTL
            return null;
        }
    }
}

Challenge: Checkpointing and Fault Tolerance

Crawlers run for weeks or months. They must survive machine failures and restarts:

Checkpoint Strategy:

public class CheckpointManager {
    public void saveCheckpoint() {
        CheckpointData data = new CheckpointData(
            frontier.getState(),
            seenUrls.serialize(),
            currentProgress.getMetrics()
        );
        
        // Atomic write to prevent corruption
        File temp = new File("checkpoint.tmp");
        writeToFile(data, temp);
        temp.renameTo(new File("checkpoint.dat"));
    }
    
    public void restore() {
        File checkpoint = new File("checkpoint.dat");
        if (checkpoint.exists()) {
            CheckpointData data = readFromFile(checkpoint);
            frontier.restore(data.frontierState);
            seenUrls.restore(data.bloomFilter);
            
            log.info("Restored from checkpoint: {} URLs processed", 
                     data.urlsProcessed);
        }
    }
}

Challenge: Handling Spider Traps

Some sites create infinite URL spaces to waste crawler resources:

Calendar trap: site.com/calendar/2024/01/01, site.com/calendar/2024/01/02, ... (generates URLs forever)

Session trap: site.com/login?session=abc123, site.com/login?session=def456, ... (each URL creates new session URLs)

Detection and Mitigation:

public class SpiderTrapDetector {
    private final Map<String, Integer> hostDepthCount = new HashMap<>();
    private final Map<String, Set<String>> urlPatterns = new HashMap<>();
    
    public boolean isSpiderTrap(URL url) {
        String host = url.getHost();
        
        // Max depth limit
        int depth = calculateDepth(url);
        if (depth > MAX_DEPTH_PER_HOST) {
            return true;
        }
        
        // Pattern detection
        String pattern = extractPattern(url.getPath());
        Set<String> seenPatterns = urlPatterns.getOrDefault(host, new HashSet<>());
        if (seenPatterns.size() > MAX_PATTERNS_PER_HOST && 
            seenPatterns.contains(pattern)) {
            return true;
        }
        
        return false;
    }
    
    private String extractPattern(String path) {
        // Replace numbers and IDs with placeholders
        return path.replaceAll("\\d+", "*")
                  .replaceAll("/[a-f0-9]{8,}", "/*");
    }
}

Step 5: Content Processing and Storage

HTML Parsing and Link Extraction

Once you have the HTML content, extract useful information:

public class ContentProcessor {
    public ProcessedContent process(String html, URL sourceUrl) {
        Document doc = Jsoup.parse(html, sourceUrl.toString());
        
        // Extract metadata
        String title = doc.title();
        String description = doc.select("meta[name=description]").attr("content");
        
        // Extract text content (remove navigation, ads, etc.)
        String mainContent = extractMainContent(doc);
        
        // Extract outbound links
        List<URL> links = doc.select("a[href]")
            .stream()
            .map(element -> element.attr("abs:href"))
            .filter(href -> isValidUrl(href))
            .map(this::parseUrl)
            .filter(Objects::nonNull)
            .collect(Collectors.toList());
        
        return new ProcessedContent(title, description, mainContent, links);
    }
    
    private String extractMainContent(Document doc) {
        // Remove scripts, styles, navigation
        doc.select("script, style, nav, header, footer, aside").remove();
        
        // Extract text from main content areas
        Elements mainElements = doc.select("main, article, .content, #content");
        if (!mainElements.isEmpty()) {
            return mainElements.text();
        }
        
        return doc.body().text();
    }
}

Freshness Policy

Different content has different update frequencies. News articles change frequently, while legal documents rarely change:

public class FreshnessManager {
    public long getNextCrawlTime(URL url, LocalDateTime lastCrawled) {
        ContentType type = classifyContent(url);
        Duration interval = getFreshnessInterval(type);
        
        return lastCrawled.plus(interval).toEpochSecond();
    }
    
    private Duration getFreshnessInterval(ContentType type) {
        switch (type) {
            case NEWS: return Duration.ofHours(1);
            case BLOG: return Duration.ofDays(1);
            case PRODUCT_PAGE: return Duration.ofDays(7);
            case DOCUMENTATION: return Duration.ofDays(30);
            default: return Duration.ofDays(14);
        }
    }
}

Common Mistakes

These patterns appear in real interview feedback:

Mistake 1: Ignoring Robots.txt

Describing a crawler that ignores robots.txt shows you don't understand web etiquette. Always mention checking robots.txt and implementing crawl-delay directives.

Mistake 2: Single-threaded Design

A single-threaded crawler downloading one page at a time will never scale. You need concurrent workers, but with per-host rate limiting.

Mistake 3: No Duplicate Handling

Without deduplication, crawlers get trapped in redirect loops and waste resources on identical content. Bloom filters for URLs and content hashing are essential.

Mistake 4: Focusing Only on HTML

Real crawlers encounter PDFs, images, videos, and other content types. Discuss how you'd handle different MIME types and whether to extract text from PDFs.

Mistake 5: No Failure Recovery

Crawlers run for weeks. If a machine crashes and you lose all progress, that's unacceptable. Mention checkpointing and the ability to resume from where you left off.

Mistake 6: Overwhelming Servers

Sending 100 simultaneous requests to a small blog is irresponsible and will get your crawler banned. Emphasize politeness and per-host rate limiting.

Interviewer Follow-Up Questions

Prepare for these common follow-ups:

"How would you handle JavaScript-heavy sites?" Traditional crawlers only see the initial HTML, not content loaded by JavaScript. Options:

Headless browser automation (Playwright, Puppeteer) — slower but handles modern sites
API discovery — many sites have REST APIs that are more efficient than scraping
Server-side rendering detection — some sites provide different content to crawlers

"What about content behind login walls?" This gets into ethics and legality. Generally, respect access controls:

Only crawl publicly accessible content
If you need authenticated content, use official APIs
Consider rate limiting more aggressively for authenticated crawling

"How do you detect when a website structure changes?" Structure changes break extraction logic. Options:

Content diff analysis — compare new crawls to previous crawls
Schema extraction — detect common patterns in HTML structure
Machine learning — train models to identify content areas

"What about handling different languages and encodings?" Character encoding detection and proper Unicode handling:

Check HTTP Content-Type header
Detect encoding from HTML meta tags
Use libraries like ICU4J for proper text processing

Summary: Your 35-Minute Interview Plan

Time	What to Do
0-5 min	Clarify requirements: scale, content types, politeness, continuous vs one-time
5-10 min	High-level architecture: frontier, workers, processors, storage
10-20 min	URL frontier deep dive: politeness, prioritization, deduplication
20-28 min	Distributed challenges: worker coordination, DNS caching, spider traps
28-33 min	Content processing: parsing, freshness, checkpointing
33-35 min	Wrap up: monitoring, failure modes, ethical considerations

The web crawler interview tests your ability to build respectful, scalable data collection systems. The technical challenges — politeness, deduplication, distributed coordination — are what separate experienced engineers from those who've only built toy scrapers. Emphasize being a good web citizen while handling internet-scale data.