Design a Search Autocomplete System | System Design Course

Who Asks This Question?

Search autocomplete is asked at companies where search is a core feature or performance. Based on interview reports, it's frequently asked at:

Google — Search is their DNA; multiple reports confirm this as both L4 and senior rounds
Amazon — Product search autocomplete handles billions of queries daily
Meta — Asked for search infrastructure roles covering Instagram and Facebook search
Netflix — Content discovery and search recommendations are critical features
Spotify — Music search with artist/track/album autocomplete across 100+ million songs
Airbnb — Location and property search with geographic autocomplete
Uber — Address autocomplete for pickup/destination, asked in multiple rounds
LinkedIn — People/company search with professional context

This question tests your understanding of real-time user experience constraints. Companies ask it because autocomplete looks simple ("just match prefixes!") but building it at scale requires sophisticated data structures, caching strategies, and pipeline design.

What the Interviewer Is Really Testing

Most candidates focus on tries and get stuck on implementation details. That's only part of what interviewers evaluate. Here's the actual scoring breakdown:

Evaluation Area	Weight	What They're Looking For
Requirements gathering	15%	Do you understand the UX constraints (sub-100ms) vs backend complexity?
Data structures	25%	Trie knowledge, but more importantly: when tries aren't enough
System architecture	20%	Where does data collection happen? Offline vs real-time pipelines
Caching strategy	20%	Multi-layer caching from CDN to application layer
Scale considerations	20%	Handling billions of queries, personalization, multi-language support

The #1 reason candidates fail: they spend 20 minutes implementing trie insertion/search while the interviewer waits for them to discuss query popularity, data collection pipelines, or caching strategies. The trie is necessary but not sufficient.

Step 1: Clarify Requirements

Questions You Must Ask

These questions fundamentally change your architecture approach:

"What types of queries are we autocompleting?" This determines your data sources. Product search uses a catalog. People search needs profiles. Address autocomplete requires geographic data. Each has different update patterns and ranking signals.

"Are suggestions personalized or global?" Global suggestions are simpler — same results for everyone searching "java" (coffee vs programming). Personalized suggestions require user context, search history, and A/B testing infrastructure.

"What's the expected query volume?" 1,000 QPS vs 100,000 QPS changes everything. High-volume systems need read replicas, aggressive caching, and geographic distribution.

"Do we need real-time updates when new content is added?" Real-time updates (new product appears in autocomplete immediately) require streaming pipelines. Batch updates (hourly/daily refresh) allow simpler offline processing.

"What languages and regions do we support?" Multi-language support affects tokenization, ranking, and storage. "café" vs "cafe" should both match. Chinese/Japanese need different tokenization than English.

Requirements You Should State

After asking questions, explicitly state your scope:

Functional:

Return top 5-10 autocomplete suggestions for any prefix (1+ characters)
Support prefix matching with fuzzy tolerance (handle typos)
Rank suggestions by popularity/relevance
Filter inappropriate or sensitive suggestions

Non-functional:

Sub-100ms response time (autocomplete feels instant or it's useless)
Handle 50,000+ concurrent users typing simultaneously
99.9% availability — autocomplete failures hurt user experience significantly
Support for 10+ languages with proper tokenization

Scale assumptions:

10 billion total searchable items (products, content, etc.)
1 billion unique search queries per month
Average query length: 2-4 words

Step 2: High-Level Architecture

Core Components

Your autocomplete system has two distinct pipelines: data collection/processing (offline) and query serving (real-time).

[Data Sources] → [Collection Pipeline] → [Processing] → [Suggestion Store]
     |                                                        ↓
[User Queries] → [Query Analytics] ─────────────────────────────┘
     |
     ↓
[Load Balancer] → [Autocomplete Service] → [Suggestion Cache] → [Response]

Data Collection Pipeline (Offline)

This runs continuously, processing search logs to build suggestion rankings:

Query Log Collection: Every search query gets logged with metadata: user ID, timestamp, query text, result clicks, session context. This feeds your popularity calculations.

Query Processing:

Normalization: Convert to lowercase, trim whitespace, handle Unicode
Tokenization: Split into searchable terms, handle different languages
Filtering: Remove PII, inappropriate content, and spam queries

Popularity Calculation: Recent queries matter more than old ones. Use exponential decay or sliding window counters:

# Simplified popularity scoring
def calculate_popularity(query_stats):
    recent_weight = 0.7  # Last 7 days
    medium_weight = 0.2  # 7-30 days ago
    old_weight = 0.1     # 30+ days ago
    
    score = (query_stats.recent_count * recent_weight +
             query_stats.medium_count * medium_weight + 
             query_stats.old_count * old_weight)
    
    return score

Query Serving Pipeline (Real-time)

This handles user keystrokes with sub-100ms latency requirements:

Request Flow:

User types "java" in search box
Frontend debounces keystrokes (wait 150ms after typing stops)
Request hits load balancer, routes to autocomplete service
Service checks L1 cache (in-memory), then L2 cache (Redis)
If cache miss: query suggestion store, populate caches
Return top suggestions with ranking scores

Step 3: Deep Dive — Data Structure

Why Trie Isn't Enough

Every candidate mentions tries. Good. But a basic trie falls short at scale:

Memory explosion: Storing "suggest chocolate cake recipes" in a trie creates nodes for every character. With 10 billion queries, memory usage becomes prohibitive.

No popularity ranking: A trie tells you if a prefix exists, not which suggestions are most popular. You need additional metadata at each node.

Poor cache locality: Following pointer chains through tree nodes causes cache misses. For sub-100ms responses, cache performance matters.

Optimized Trie Design

Production autocomplete systems use compressed tries with metadata:

public class AutocompleteTrie {
    private TrieNode root = new TrieNode();
    
    private static class TrieNode {
        Map<Character, TrieNode> children = new HashMap<>();
        List<ScoredSuggestion> topSuggestions = new ArrayList<>();
        boolean isEndOfWord = false;
    }
    
    private static class ScoredSuggestion {
        String suggestion;
        double popularity;
        long lastUpdated;
        
        ScoredSuggestion(String suggestion, double popularity) {
            this.suggestion = suggestion;
            this.popularity = popularity;
            this.lastUpdated = System.currentTimeMillis();
        }
    }
    
    // Store top-K suggestions at each node for O(1) retrieval
    public void insertWithPopularity(String word, double popularity) {
        TrieNode current = root;
        
        for (char c : word.toCharArray()) {
            current.children.putIfAbsent(c, new TrieNode());
            current = current.children.get(c);
            
            // Update top suggestions at this prefix
            current.addSuggestion(new ScoredSuggestion(word, popularity));
        }
        current.isEndOfWord = true;
    }
    
    public List<String> getSuggestions(String prefix, int limit) {
        TrieNode current = root;
        
        // Navigate to prefix node
        for (char c : prefix.toCharArray()) {
            if (!current.children.containsKey(c)) {
                return Collections.emptyList();
            }
            current = current.children.get(c);
        }
        
        // Return pre-computed top suggestions
        return current.topSuggestions.stream()
                .sorted((a, b) -> Double.compare(b.popularity, a.popularity))
                .limit(limit)
                .map(s -> s.suggestion)
                .collect(Collectors.toList());
    }
}

Alternative: Inverted Index

For very large datasets, inverted indices outperform tries:

Structure: Map each prefix to a sorted list of suggestions:

"jav" → ["java programming", "javascript tutorial", "java coffee beans"]
"java" → ["java programming", "java coffee beans", "java interview questions"]

Benefits:

Memory efficient — only store actual prefixes, not character-by-character trees
Faster serialization for distributed storage
Easier to partition across multiple servers

Implementation:

class InvertedIndexAutocomplete:
    def __init__(self):
        self.prefix_to_suggestions = {}  # prefix -> sorted list of suggestions
    
    def build_index(self, query_popularity_map):
        for query, popularity in query_popularity_map.items():
            # Generate all prefixes for this query
            for i in range(1, len(query) + 1):
                prefix = query[:i].lower()
                
                if prefix not in self.prefix_to_suggestions:
                    self.prefix_to_suggestions[prefix] = []
                
                self.prefix_to_suggestions[prefix].append({
                    'suggestion': query,
                    'popularity': popularity
                })
        
        # Sort by popularity
        for prefix in self.prefix_to_suggestions:
            self.prefix_to_suggestions[prefix].sort(
                key=lambda x: x['popularity'], reverse=True
            )
    
    def get_suggestions(self, prefix, limit=10):
        prefix = prefix.lower()
        if prefix not in self.prefix_to_suggestions:
            return []
        
        return [item['suggestion'] for item in 
                self.prefix_to_suggestions[prefix][:limit]]

Strong answer mentions both approaches: "For real-time lookups, I'd use a trie with pre-computed top-K suggestions at each node. For very large datasets where memory is constrained, I'd use an inverted index with prefix-to-suggestions mapping."

Step 4: Deep Dive — Caching Strategy

Sub-100ms response time requires aggressive caching at multiple layers. This is often the most important part of the interview.

Layer 1: CDN/Edge Cache

What: Cache popular autocomplete responses at CDN edge locations TTL: 1-6 hours (depending on how fresh suggestions need to be) Cache key: Hash of (prefix, language, region)

Popular prefixes like "wea", "how", "what" are the same worldwide and get millions of requests. Serving these from CDN eliminates backend load entirely.

// Edge cache logic
function getCachedSuggestions(prefix, language, region) {
    const cacheKey = `autocomplete:${prefix}:${language}:${region}`;
    const cached = edgeCache.get(cacheKey);
    
    if (cached && !isExpired(cached.timestamp, 6 * 3600)) {
        return cached.suggestions;
    }
    
    return null; // Cache miss, fetch from origin
}

Layer 2: Application Cache (Redis)

What: In-memory cache shared across all application servers TTL: 5-30 minutes Cache key: Include personalization context for logged-in users

class AutocompleteCache:
    def __init__(self, redis_client):
        self.redis = redis_client
        
    def get_suggestions(self, prefix, user_context=None):
        cache_key = f"ac:{prefix.lower()}"
        
        if user_context and user_context.get('user_id'):
            # Personalized suggestions
            cache_key += f":user:{user_context['user_id']}"
        
        cached = self.redis.get(cache_key)
        if cached:
            return json.loads(cached)
        
        return None
    
    def set_suggestions(self, prefix, suggestions, user_context=None, ttl=300):
        cache_key = f"ac:{prefix.lower()}"
        
        if user_context and user_context.get('user_id'):
            cache_key += f":user:{user_context['user_id']}"
            
        self.redis.setex(cache_key, ttl, json.dumps(suggestions))

Layer 3: Local In-Memory Cache

What: Each application server caches recent requests in local memory TTL: 1-5 minutes Size: LRU cache with 10,000-100,000 entries

public class LocalAutocompleteCache {
    private final Cache<String, List<String>> cache;
    
    public LocalAutocompleteCache() {
        this.cache = Caffeine.newBuilder()
            .maximumSize(50_000)
            .expireAfterWrite(2, TimeUnit.MINUTES)
            .build();
    }
    
    public List<String> getSuggestions(String prefix) {
        return cache.getIfPresent(prefix.toLowerCase());
    }
    
    public void putSuggestions(String prefix, List<String> suggestions) {
        cache.put(prefix.toLowerCase(), suggestions);
    }
}

Cache Warm-up Strategy

Cold caches hurt user experience. Warm them proactively:

Popular prefix pre-loading: During deployment, fetch suggestions for the top 10,000 prefixes and populate caches.

Predictive loading: When a user searches "java pr", predictively load suggestions for "java pro", "java pri", etc.

Regional optimization: Pre-load region-specific popular prefixes at each edge location.

Step 5: Deep Dive — Scale and Personalization

Handling Geographic Distribution

Challenge: Users in Tokyo expect different autocomplete suggestions than users in New York searching the same prefix.

Solution: Regional suggestion stores

class GeoAutocompleteService:
    def __init__(self):
        self.regional_stores = {
            'us-east': AutocompleteStore('us-east'),
            'us-west': AutocompleteStore('us-west'),
            'eu-west': AutocompleteStore('eu-west'),
            'asia-pacific': AutocompleteStore('asia-pacific')
        }
    
    def get_suggestions(self, prefix, user_location, limit=10):
        # Route to regional store
        region = self.determine_region(user_location)
        store = self.regional_stores.get(region)
        
        if not store:
            # Fallback to global suggestions
            store = self.regional_stores['us-east']
        
        return store.get_suggestions(prefix, limit)
    
    def determine_region(self, user_location):
        # IP geolocation or user profile data
        if user_location.get('country') in ['US', 'CA']:
            return 'us-east'
        elif user_location.get('country') in ['GB', 'FR', 'DE']:
            return 'eu-west'
        else:
            return 'asia-pacific'

Personalization Pipeline

Challenge: "python" should suggest "python programming" for developers but "python pets" for reptile enthusiasts.

Solution: User context mixing

class PersonalizedAutocomplete:
    def __init__(self, global_store, user_profile_service):
        self.global_store = global_store
        self.user_service = user_profile_service
    
    def get_suggestions(self, prefix, user_id, limit=10):
        # Get global popular suggestions
        global_suggestions = self.global_store.get_suggestions(prefix, limit * 2)
        
        if not user_id:
            return global_suggestions[:limit]
        
        # Get user context
        user_profile = self.user_service.get_profile(user_id)
        user_categories = user_profile.get('interest_categories', [])
        search_history = user_profile.get('recent_searches', [])
        
        # Re-rank suggestions based on user context
        scored_suggestions = []
        for suggestion in global_suggestions:
            base_score = suggestion.get('popularity', 0)
            
            # Boost score if matches user interests
            category_boost = self.calculate_category_boost(suggestion, user_categories)
            history_boost = self.calculate_history_boost(suggestion, search_history)
            
            final_score = base_score * (1 + category_boost + history_boost)
            scored_suggestions.append({
                'suggestion': suggestion['text'],
                'score': final_score
            })
        
        # Return top suggestions
        sorted_suggestions = sorted(scored_suggestions, 
                                  key=lambda x: x['score'], reverse=True)
        return [s['suggestion'] for s in sorted_suggestions[:limit]]

Multi-Language Support

Challenge: Autocomplete for "café" should work whether user types "cafe", "café", or even "caff" (common typo).

Solution: Unicode normalization + fuzzy matching

import unicodedata
from difflib import SequenceMatcher

class MultiLanguageAutocomplete:
    def __init__(self):
        self.language_specific_stores = {}  # 'en', 'es', 'fr', etc.
    
    def normalize_query(self, query, language='en'):
        # Remove accents and normalize Unicode
        normalized = unicodedata.normalize('NFD', query)
        ascii_query = ''.join(c for c in normalized 
                             if unicodedata.category(c) != 'Mn')
        
        # Language-specific normalization
        if language in ['de', 'fr', 'es']:
            # Handle common character substitutions
            ascii_query = ascii_query.replace('ß', 'ss')
            ascii_query = ascii_query.replace('ñ', 'n')
        
        return ascii_query.lower().strip()
    
    def fuzzy_match_suggestions(self, prefix, suggestions, threshold=0.8):
        fuzzy_matches = []
        
        for suggestion in suggestions:
            similarity = SequenceMatcher(None, prefix, 
                                       suggestion.lower()).ratio()
            
            if similarity >= threshold:
                fuzzy_matches.append({
                    'suggestion': suggestion,
                    'similarity': similarity
                })
        
        return sorted(fuzzy_matches, key=lambda x: x['similarity'], reverse=True)

Step 6: Deep Dive — Data Collection Pipeline

Real-time vs Batch Processing

Real-time pipeline: New search queries immediately influence autocomplete suggestions

Use case: Breaking news, trending topics ("ukraine war 2024")
Technology: Kafka + Storm/Flink for stream processing
Trade-off: Complex infrastructure, but suggestions stay current

Batch pipeline: Process search logs in hourly/daily batches

Use case: Product catalogs, stable content ("cooking recipes")
Technology: Hadoop/Spark for large-scale processing
Trade-off: Simpler infrastructure, but 1-24 hour suggestion lag

Most production systems use hybrid: batch for stable suggestions, real-time for trending topics.

class HybridSuggestionPipeline:
    def __init__(self):
        self.batch_processor = BatchSuggestionProcessor()
        self.realtime_processor = RealtimeSuggestionProcessor()
        
    def process_query(self, query, user_context):
        # Log for batch processing
        self.batch_processor.log_query(query, user_context)
        
        # Check if this is a trending/breaking topic
        if self.is_trending_topic(query):
            # Process immediately for real-time updates
            self.realtime_processor.update_suggestions(query)
    
    def is_trending_topic(self, query):
        # Simple trending detection
        recent_count = self.get_recent_query_count(query, hours=1)
        historical_avg = self.get_historical_average(query, days=30)
        
        # If current volume is 10x normal, it's trending
        return recent_count > historical_avg * 10

Filtering Inappropriate Content

Challenge: Autocomplete must not suggest inappropriate, biased, or harmful queries.

Solution: Multi-layer filtering

class ContentFilter:
    def __init__(self):
        self.banned_words = self.load_banned_words()
        self.ml_classifier = self.load_toxicity_classifier()
        
    def filter_suggestions(self, suggestions):
        filtered = []
        
        for suggestion in suggestions:
            # Rule-based filtering
            if self.contains_banned_words(suggestion):
                continue
                
            # ML-based toxicity detection
            toxicity_score = self.ml_classifier.predict_toxicity(suggestion)
            if toxicity_score > 0.7:  # High toxicity threshold
                continue
                
            # PII detection
            if self.contains_pii(suggestion):
                continue
                
            filtered.append(suggestion)
        
        return filtered
    
    def contains_banned_words(self, suggestion):
        words = suggestion.lower().split()
        return any(word in self.banned_words for word in words)
    
    def contains_pii(self, suggestion):
        # Simple regex patterns for common PII
        import re
        
        # Phone numbers
        if re.search(r'\d{3}-\d{3}-\d{4}', suggestion):
            return True
            
        # Email addresses
        if re.search(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', suggestion):
            return True
            
        return False

Common Mistakes

Mistake 1: Only Discussing Trie Implementation

Spending 20+ minutes on trie insertion/traversal without mentioning data collection, caching, or popularity ranking. The data structure is 25% of the problem.

Mistake 2: Ignoring Latency Requirements

Designing a system that works but takes 500ms to respond. Sub-100ms is non-negotiable for autocomplete — anything slower feels broken to users.

Mistake 3: No Data Pipeline Discussion

Jumping straight to serving queries without explaining where suggestions come from. The data collection and processing pipeline is half the system.

Mistake 4: Static Popularity Rankings

Hardcoding suggestion popularity instead of computing it from query logs. Real autocomplete systems adapt to changing user behavior.

Mistake 5: Single-Server Design for Global Scale

Describing a single trie in memory when the requirements call for billions of queries. Global systems need geographic distribution and multiple cache layers.

Mistake 6: No Content Filtering

Assuming all user queries are appropriate for suggestions. Production systems need extensive filtering for inappropriate content, PII, and spam.

Interviewer Follow-Up Questions

"How would you handle typos in autocomplete?" Two approaches: fuzzy string matching (edit distance) or phonetic matching (Soundex/Metaphone). Most production systems combine both with ML-based spell correction.

"What if a competitor's product name keeps appearing in autocomplete suggestions?" This is a business policy question disguised as a technical one. Technical solution: filtering rules. Business decision: do you want to show competitor names or not?

"How would you prevent autocomplete from suggesting yesterday's searches?" Implement query decay: weight recent searches higher, downweight searches that haven't appeared recently. Use exponential decay or sliding window counters.

"Should personalized suggestions be based on the current user's searches or all users?" Hybrid approach: global popularity provides baseline relevance, user personalization adds context. Pure personalization creates filter bubbles; pure global ignores user intent.

"How would you handle autocomplete for e-commerce product search?" Product catalogs change differently than search queries. New products need immediate availability, discontinued products should fade from suggestions. Requires real-time inventory integration.

Summary: Your 35-Minute Interview Plan

Time	What to Do
0-5 min	Clarify requirements: query types, personalization, scale, latency
5-10 min	High-level architecture: data collection vs query serving pipelines
10-18 min	Data structure: trie with metadata OR inverted index, justify choice
18-26 min	Caching strategy: CDN, Redis, local caches with specific TTLs
26-32 min	Scale challenges: geographic distribution, personalization, multi-language
32-35 min	Data pipeline: batch vs real-time processing, content filtering

The autocomplete interview tests your understanding of real-time user experience at scale. Focus on the latency constraint (sub-100ms) and how it drives every architectural decision — from data structure choice to caching strategy to geographic distribution.