Design a Search Autocomplete System
Who Asks This Question?
Search autocomplete is asked at companies where search is a core feature or performance. Based on interview reports, it's frequently asked at:
- Google — Search is their DNA; multiple reports confirm this as both L4 and senior rounds
- Amazon — Product search autocomplete handles billions of queries daily
- Meta — Asked for search infrastructure roles covering Instagram and Facebook search
- Netflix — Content discovery and search recommendations are critical features
- Spotify — Music search with artist/track/album autocomplete across 100+ million songs
- Airbnb — Location and property search with geographic autocomplete
- Uber — Address autocomplete for pickup/destination, asked in multiple rounds
- LinkedIn — People/company search with professional context
This question tests your understanding of real-time user experience constraints. Companies ask it because autocomplete looks simple ("just match prefixes!") but building it at scale requires sophisticated data structures, caching strategies, and pipeline design.
What the Interviewer Is Really Testing
Most candidates focus on tries and get stuck on implementation details. That's only part of what interviewers evaluate. Here's the actual scoring breakdown:
| Evaluation Area | Weight | What They're Looking For |
|---|---|---|
| Requirements gathering | 15% | Do you understand the UX constraints (sub-100ms) vs backend complexity? |
| Data structures | 25% | Trie knowledge, but more importantly: when tries aren't enough |
| System architecture | 20% | Where does data collection happen? Offline vs real-time pipelines |
| Caching strategy | 20% | Multi-layer caching from CDN to application layer |
| Scale considerations | 20% | Handling billions of queries, personalization, multi-language support |
The #1 reason candidates fail: they spend 20 minutes implementing trie insertion/search while the interviewer waits for them to discuss query popularity, data collection pipelines, or caching strategies. The trie is necessary but not sufficient.
Step 1: Clarify Requirements
Questions You Must Ask
These questions fundamentally change your architecture approach:
"What types of queries are we autocompleting?" This determines your data sources. Product search uses a catalog. People search needs profiles. Address autocomplete requires geographic data. Each has different update patterns and ranking signals.
"Are suggestions personalized or global?" Global suggestions are simpler — same results for everyone searching "java" (coffee vs programming). Personalized suggestions require user context, search history, and A/B testing infrastructure.
"What's the expected query volume?" 1,000 QPS vs 100,000 QPS changes everything. High-volume systems need read replicas, aggressive caching, and geographic distribution.
"Do we need real-time updates when new content is added?" Real-time updates (new product appears in autocomplete immediately) require streaming pipelines. Batch updates (hourly/daily refresh) allow simpler offline processing.
"What languages and regions do we support?" Multi-language support affects tokenization, ranking, and storage. "café" vs "cafe" should both match. Chinese/Japanese need different tokenization than English.
Requirements You Should State
After asking questions, explicitly state your scope:
Functional:
- Return top 5-10 autocomplete suggestions for any prefix (1+ characters)
- Support prefix matching with fuzzy tolerance (handle typos)
- Rank suggestions by popularity/relevance
- Filter inappropriate or sensitive suggestions
Non-functional:
- Sub-100ms response time (autocomplete feels instant or it's useless)
- Handle 50,000+ concurrent users typing simultaneously
- 99.9% availability — autocomplete failures hurt user experience significantly
- Support for 10+ languages with proper tokenization
Scale assumptions:
- 10 billion total searchable items (products, content, etc.)
- 1 billion unique search queries per month
- Average query length: 2-4 words
Step 2: High-Level Architecture
Core Components
Your autocomplete system has two distinct pipelines: data collection/processing (offline) and query serving (real-time).
[Data Sources] → [Collection Pipeline] → [Processing] → [Suggestion Store]
| ↓
[User Queries] → [Query Analytics] ─────────────────────────────┘
|
↓
[Load Balancer] → [Autocomplete Service] → [Suggestion Cache] → [Response]
Data Collection Pipeline (Offline)
This runs continuously, processing search logs to build suggestion rankings:
Query Log Collection: Every search query gets logged with metadata: user ID, timestamp, query text, result clicks, session context. This feeds your popularity calculations.
Query Processing:
- Normalization: Convert to lowercase, trim whitespace, handle Unicode
- Tokenization: Split into searchable terms, handle different languages
- Filtering: Remove PII, inappropriate content, and spam queries
Popularity Calculation: Recent queries matter more than old ones. Use exponential decay or sliding window counters:
# Simplified popularity scoring
def calculate_popularity(query_stats):
recent_weight = 0.7 # Last 7 days
medium_weight = 0.2 # 7-30 days ago
old_weight = 0.1 # 30+ days ago
score = (query_stats.recent_count * recent_weight +
query_stats.medium_count * medium_weight +
query_stats.old_count * old_weight)
return score
Query Serving Pipeline (Real-time)
This handles user keystrokes with sub-100ms latency requirements:
Request Flow:
- User types "java" in search box
- Frontend debounces keystrokes (wait 150ms after typing stops)
- Request hits load balancer, routes to autocomplete service
- Service checks L1 cache (in-memory), then L2 cache (Redis)
- If cache miss: query suggestion store, populate caches
- Return top suggestions with ranking scores
Step 3: Deep Dive — Data Structure
Why Trie Isn't Enough
Every candidate mentions tries. Good. But a basic trie falls short at scale:
Memory explosion: Storing "suggest chocolate cake recipes" in a trie creates nodes for every character. With 10 billion queries, memory usage becomes prohibitive.
No popularity ranking: A trie tells you if a prefix exists, not which suggestions are most popular. You need additional metadata at each node.
Poor cache locality: Following pointer chains through tree nodes causes cache misses. For sub-100ms responses, cache performance matters.
Optimized Trie Design
Production autocomplete systems use compressed tries with metadata:
public class AutocompleteTrie {
private TrieNode root = new TrieNode();
private static class TrieNode {
Map<Character, TrieNode> children = new HashMap<>();
List<ScoredSuggestion> topSuggestions = new ArrayList<>();
boolean isEndOfWord = false;
}
private static class ScoredSuggestion {
String suggestion;
double popularity;
long lastUpdated;
ScoredSuggestion(String suggestion, double popularity) {
this.suggestion = suggestion;
this.popularity = popularity;
this.lastUpdated = System.currentTimeMillis();
}
}
// Store top-K suggestions at each node for O(1) retrieval
public void insertWithPopularity(String word, double popularity) {
TrieNode current = root;
for (char c : word.toCharArray()) {
current.children.putIfAbsent(c, new TrieNode());
current = current.children.get(c);
// Update top suggestions at this prefix
current.addSuggestion(new ScoredSuggestion(word, popularity));
}
current.isEndOfWord = true;
}
public List<String> getSuggestions(String prefix, int limit) {
TrieNode current = root;
// Navigate to prefix node
for (char c : prefix.toCharArray()) {
if (!current.children.containsKey(c)) {
return Collections.emptyList();
}
current = current.children.get(c);
}
// Return pre-computed top suggestions
return current.topSuggestions.stream()
.sorted((a, b) -> Double.compare(b.popularity, a.popularity))
.limit(limit)
.map(s -> s.suggestion)
.collect(Collectors.toList());
}
}
Alternative: Inverted Index
For very large datasets, inverted indices outperform tries:
Structure: Map each prefix to a sorted list of suggestions:
"jav" → ["java programming", "javascript tutorial", "java coffee beans"]
"java" → ["java programming", "java coffee beans", "java interview questions"]
Benefits:
- Memory efficient — only store actual prefixes, not character-by-character trees
- Faster serialization for distributed storage
- Easier to partition across multiple servers
Implementation:
class InvertedIndexAutocomplete:
def __init__(self):
self.prefix_to_suggestions = {} # prefix -> sorted list of suggestions
def build_index(self, query_popularity_map):
for query, popularity in query_popularity_map.items():
# Generate all prefixes for this query
for i in range(1, len(query) + 1):
prefix = query[:i].lower()
if prefix not in self.prefix_to_suggestions:
self.prefix_to_suggestions[prefix] = []
self.prefix_to_suggestions[prefix].append({
'suggestion': query,
'popularity': popularity
})
# Sort by popularity
for prefix in self.prefix_to_suggestions:
self.prefix_to_suggestions[prefix].sort(
key=lambda x: x['popularity'], reverse=True
)
def get_suggestions(self, prefix, limit=10):
prefix = prefix.lower()
if prefix not in self.prefix_to_suggestions:
return []
return [item['suggestion'] for item in
self.prefix_to_suggestions[prefix][:limit]]
Strong answer mentions both approaches: "For real-time lookups, I'd use a trie with pre-computed top-K suggestions at each node. For very large datasets where memory is constrained, I'd use an inverted index with prefix-to-suggestions mapping."
Step 4: Deep Dive — Caching Strategy
Sub-100ms response time requires aggressive caching at multiple layers. This is often the most important part of the interview.
Layer 1: CDN/Edge Cache
What: Cache popular autocomplete responses at CDN edge locations TTL: 1-6 hours (depending on how fresh suggestions need to be) Cache key: Hash of (prefix, language, region)
Popular prefixes like "wea", "how", "what" are the same worldwide and get millions of requests. Serving these from CDN eliminates backend load entirely.
// Edge cache logic
function getCachedSuggestions(prefix, language, region) {
const cacheKey = `autocomplete:${prefix}:${language}:${region}`;
const cached = edgeCache.get(cacheKey);
if (cached && !isExpired(cached.timestamp, 6 * 3600)) {
return cached.suggestions;
}
return null; // Cache miss, fetch from origin
}
Layer 2: Application Cache (Redis)
What: In-memory cache shared across all application servers TTL: 5-30 minutes Cache key: Include personalization context for logged-in users
class AutocompleteCache:
def __init__(self, redis_client):
self.redis = redis_client
def get_suggestions(self, prefix, user_context=None):
cache_key = f"ac:{prefix.lower()}"
if user_context and user_context.get('user_id'):
# Personalized suggestions
cache_key += f":user:{user_context['user_id']}"
cached = self.redis.get(cache_key)
if cached:
return json.loads(cached)
return None
def set_suggestions(self, prefix, suggestions, user_context=None, ttl=300):
cache_key = f"ac:{prefix.lower()}"
if user_context and user_context.get('user_id'):
cache_key += f":user:{user_context['user_id']}"
self.redis.setex(cache_key, ttl, json.dumps(suggestions))
Layer 3: Local In-Memory Cache
What: Each application server caches recent requests in local memory TTL: 1-5 minutes Size: LRU cache with 10,000-100,000 entries
public class LocalAutocompleteCache {
private final Cache<String, List<String>> cache;
public LocalAutocompleteCache() {
this.cache = Caffeine.newBuilder()
.maximumSize(50_000)
.expireAfterWrite(2, TimeUnit.MINUTES)
.build();
}
public List<String> getSuggestions(String prefix) {
return cache.getIfPresent(prefix.toLowerCase());
}
public void putSuggestions(String prefix, List<String> suggestions) {
cache.put(prefix.toLowerCase(), suggestions);
}
}
Cache Warm-up Strategy
Cold caches hurt user experience. Warm them proactively:
Popular prefix pre-loading: During deployment, fetch suggestions for the top 10,000 prefixes and populate caches.
Predictive loading: When a user searches "java pr", predictively load suggestions for "java pro", "java pri", etc.
Regional optimization: Pre-load region-specific popular prefixes at each edge location.
Step 5: Deep Dive — Scale and Personalization
Handling Geographic Distribution
Challenge: Users in Tokyo expect different autocomplete suggestions than users in New York searching the same prefix.
Solution: Regional suggestion stores
class GeoAutocompleteService:
def __init__(self):
self.regional_stores = {
'us-east': AutocompleteStore('us-east'),
'us-west': AutocompleteStore('us-west'),
'eu-west': AutocompleteStore('eu-west'),
'asia-pacific': AutocompleteStore('asia-pacific')
}
def get_suggestions(self, prefix, user_location, limit=10):
# Route to regional store
region = self.determine_region(user_location)
store = self.regional_stores.get(region)
if not store:
# Fallback to global suggestions
store = self.regional_stores['us-east']
return store.get_suggestions(prefix, limit)
def determine_region(self, user_location):
# IP geolocation or user profile data
if user_location.get('country') in ['US', 'CA']:
return 'us-east'
elif user_location.get('country') in ['GB', 'FR', 'DE']:
return 'eu-west'
else:
return 'asia-pacific'
Personalization Pipeline
Challenge: "python" should suggest "python programming" for developers but "python pets" for reptile enthusiasts.
Solution: User context mixing
class PersonalizedAutocomplete:
def __init__(self, global_store, user_profile_service):
self.global_store = global_store
self.user_service = user_profile_service
def get_suggestions(self, prefix, user_id, limit=10):
# Get global popular suggestions
global_suggestions = self.global_store.get_suggestions(prefix, limit * 2)
if not user_id:
return global_suggestions[:limit]
# Get user context
user_profile = self.user_service.get_profile(user_id)
user_categories = user_profile.get('interest_categories', [])
search_history = user_profile.get('recent_searches', [])
# Re-rank suggestions based on user context
scored_suggestions = []
for suggestion in global_suggestions:
base_score = suggestion.get('popularity', 0)
# Boost score if matches user interests
category_boost = self.calculate_category_boost(suggestion, user_categories)
history_boost = self.calculate_history_boost(suggestion, search_history)
final_score = base_score * (1 + category_boost + history_boost)
scored_suggestions.append({
'suggestion': suggestion['text'],
'score': final_score
})
# Return top suggestions
sorted_suggestions = sorted(scored_suggestions,
key=lambda x: x['score'], reverse=True)
return [s['suggestion'] for s in sorted_suggestions[:limit]]
Multi-Language Support
Challenge: Autocomplete for "café" should work whether user types "cafe", "café", or even "caff" (common typo).
Solution: Unicode normalization + fuzzy matching
import unicodedata
from difflib import SequenceMatcher
class MultiLanguageAutocomplete:
def __init__(self):
self.language_specific_stores = {} # 'en', 'es', 'fr', etc.
def normalize_query(self, query, language='en'):
# Remove accents and normalize Unicode
normalized = unicodedata.normalize('NFD', query)
ascii_query = ''.join(c for c in normalized
if unicodedata.category(c) != 'Mn')
# Language-specific normalization
if language in ['de', 'fr', 'es']:
# Handle common character substitutions
ascii_query = ascii_query.replace('ß', 'ss')
ascii_query = ascii_query.replace('ñ', 'n')
return ascii_query.lower().strip()
def fuzzy_match_suggestions(self, prefix, suggestions, threshold=0.8):
fuzzy_matches = []
for suggestion in suggestions:
similarity = SequenceMatcher(None, prefix,
suggestion.lower()).ratio()
if similarity >= threshold:
fuzzy_matches.append({
'suggestion': suggestion,
'similarity': similarity
})
return sorted(fuzzy_matches, key=lambda x: x['similarity'], reverse=True)
Step 6: Deep Dive — Data Collection Pipeline
Real-time vs Batch Processing
Real-time pipeline: New search queries immediately influence autocomplete suggestions
- Use case: Breaking news, trending topics ("ukraine war 2024")
- Technology: Kafka + Storm/Flink for stream processing
- Trade-off: Complex infrastructure, but suggestions stay current
Batch pipeline: Process search logs in hourly/daily batches
- Use case: Product catalogs, stable content ("cooking recipes")
- Technology: Hadoop/Spark for large-scale processing
- Trade-off: Simpler infrastructure, but 1-24 hour suggestion lag
Most production systems use hybrid: batch for stable suggestions, real-time for trending topics.
class HybridSuggestionPipeline:
def __init__(self):
self.batch_processor = BatchSuggestionProcessor()
self.realtime_processor = RealtimeSuggestionProcessor()
def process_query(self, query, user_context):
# Log for batch processing
self.batch_processor.log_query(query, user_context)
# Check if this is a trending/breaking topic
if self.is_trending_topic(query):
# Process immediately for real-time updates
self.realtime_processor.update_suggestions(query)
def is_trending_topic(self, query):
# Simple trending detection
recent_count = self.get_recent_query_count(query, hours=1)
historical_avg = self.get_historical_average(query, days=30)
# If current volume is 10x normal, it's trending
return recent_count > historical_avg * 10
Filtering Inappropriate Content
Challenge: Autocomplete must not suggest inappropriate, biased, or harmful queries.
Solution: Multi-layer filtering
class ContentFilter:
def __init__(self):
self.banned_words = self.load_banned_words()
self.ml_classifier = self.load_toxicity_classifier()
def filter_suggestions(self, suggestions):
filtered = []
for suggestion in suggestions:
# Rule-based filtering
if self.contains_banned_words(suggestion):
continue
# ML-based toxicity detection
toxicity_score = self.ml_classifier.predict_toxicity(suggestion)
if toxicity_score > 0.7: # High toxicity threshold
continue
# PII detection
if self.contains_pii(suggestion):
continue
filtered.append(suggestion)
return filtered
def contains_banned_words(self, suggestion):
words = suggestion.lower().split()
return any(word in self.banned_words for word in words)
def contains_pii(self, suggestion):
# Simple regex patterns for common PII
import re
# Phone numbers
if re.search(r'\d{3}-\d{3}-\d{4}', suggestion):
return True
# Email addresses
if re.search(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', suggestion):
return True
return False
Common Mistakes
Mistake 1: Only Discussing Trie Implementation
Spending 20+ minutes on trie insertion/traversal without mentioning data collection, caching, or popularity ranking. The data structure is 25% of the problem.
Mistake 2: Ignoring Latency Requirements
Designing a system that works but takes 500ms to respond. Sub-100ms is non-negotiable for autocomplete — anything slower feels broken to users.
Mistake 3: No Data Pipeline Discussion
Jumping straight to serving queries without explaining where suggestions come from. The data collection and processing pipeline is half the system.
Mistake 4: Static Popularity Rankings
Hardcoding suggestion popularity instead of computing it from query logs. Real autocomplete systems adapt to changing user behavior.
Mistake 5: Single-Server Design for Global Scale
Describing a single trie in memory when the requirements call for billions of queries. Global systems need geographic distribution and multiple cache layers.
Mistake 6: No Content Filtering
Assuming all user queries are appropriate for suggestions. Production systems need extensive filtering for inappropriate content, PII, and spam.
Interviewer Follow-Up Questions
"How would you handle typos in autocomplete?" Two approaches: fuzzy string matching (edit distance) or phonetic matching (Soundex/Metaphone). Most production systems combine both with ML-based spell correction.
"What if a competitor's product name keeps appearing in autocomplete suggestions?" This is a business policy question disguised as a technical one. Technical solution: filtering rules. Business decision: do you want to show competitor names or not?
"How would you prevent autocomplete from suggesting yesterday's searches?" Implement query decay: weight recent searches higher, downweight searches that haven't appeared recently. Use exponential decay or sliding window counters.
"Should personalized suggestions be based on the current user's searches or all users?" Hybrid approach: global popularity provides baseline relevance, user personalization adds context. Pure personalization creates filter bubbles; pure global ignores user intent.
"How would you handle autocomplete for e-commerce product search?" Product catalogs change differently than search queries. New products need immediate availability, discontinued products should fade from suggestions. Requires real-time inventory integration.
Summary: Your 35-Minute Interview Plan
| Time | What to Do |
|---|---|
| 0-5 min | Clarify requirements: query types, personalization, scale, latency |
| 5-10 min | High-level architecture: data collection vs query serving pipelines |
| 10-18 min | Data structure: trie with metadata OR inverted index, justify choice |
| 18-26 min | Caching strategy: CDN, Redis, local caches with specific TTLs |
| 26-32 min | Scale challenges: geographic distribution, personalization, multi-language |
| 32-35 min | Data pipeline: batch vs real-time processing, content filtering |
The autocomplete interview tests your understanding of real-time user experience at scale. Focus on the latency constraint (sub-100ms) and how it drives every architectural decision — from data structure choice to caching strategy to geographic distribution.