Design a Chat System
Who Asks This Question?
Chat system design is a staple at companies where real-time communication is either core to their business or a critical user experience component. Based on interview reports, it's frequently asked at:
- Meta — WhatsApp, Messenger, and Instagram DMs handle billions of messages daily
- Slack — Their entire platform is built around real-time messaging at enterprise scale
- Discord — Real-time chat for millions of concurrent users in voice and text channels
- Zoom — In-meeting chat and persistent team messaging features
- Microsoft — Teams integration with Office 365 and enterprise communication
- Snapchat — Ephemeral messaging with multimedia and real-time features
- Telegram — High-performance messaging with security focus and large group support
- LinkedIn — Professional messaging integrated with social networking features
- Twitch — Live chat during streams with massive concurrent viewers
This question tests whether you understand real-time systems, WebSocket management, and the complexity of reliable message delivery across different client states. Companies ask it because chat seems simple but involves deep distributed systems challenges around ordering, delivery guarantees, and connection management.
What the Interviewer Is Really Testing
Most candidates focus on the basic "send message from A to B" flow and miss the harder problems around scale, reliability, and user experience. Here's what interviewers actually evaluate:
| Evaluation Area | Weight | What They're Looking For |
|---|---|---|
| Requirements gathering | 15% | Do you ask about 1:1 vs group chat, online presence, message history, and client types? |
| Real-time connection design | 30% | WebSocket management, connection pooling, heartbeats, and graceful degradation |
| Message ordering and delivery | 25% | Sequence numbers, delivery receipts, handling offline users, and duplicate prevention |
| Database design and scale | 20% | Sharding strategies, hot partitions, message storage, and efficient retrieval |
| Advanced features | 10% | Typing indicators, read receipts, file sharing, and end-to-end encryption considerations |
The #1 reason candidates struggle: they describe HTTP POST requests for sending messages instead of understanding that chat requires persistent connections and real-time bidirectional communication. The WebSocket design and connection management complexity is what separates strong answers from weak ones.
Step 1: Clarify Requirements
Questions That Define Your Architecture
"Is this 1:1 chat, group chat, or both?" This fundamentally changes your fanout strategy:
- 1:1 chat: Simple message routing between two users
- Group chat: Complex fanout to multiple recipients, permissions, admin controls
- Both: Need to handle different message delivery patterns and storage strategies
"What's the expected scale?" Changes everything from connection management to database partitioning:
- 10K concurrent users: Single server can handle WebSocket connections
- 1M concurrent users: Need connection pooling across multiple servers
- 100M users: Global distribution, sophisticated sharding, CDN integration
"Do we need message history and how much?" Determines storage strategy and retrieval patterns:
- Last 100 messages: Can store in memory/cache
- Complete history: Need efficient database pagination and archival strategy
- Search functionality: Requires full-text search indexing
"What types of clients do we support?" Affects connection strategy and fallback mechanisms:
- Web browsers: WebSocket with HTTP fallback
- Mobile apps: Need to handle connection drops, background states, push notifications
- Desktop apps: Persistent connections with better reliability
"Do we need online presence and typing indicators?" Real-time features that require additional event broadcasting:
- Online/offline status updates
- "User is typing..." indicators
- Last seen timestamps
Functional Requirements
After clarifying, state what you're building:
Core messaging:
- Send and receive messages in real-time via 1:1 and group chats
- Support text messages with basic formatting
- Message history storage and retrieval with pagination
- Delivery receipts and read status tracking
Real-time features:
- Online/offline presence indicators
- Typing indicators and real-time status updates
- Push notifications for offline users
- File and media sharing capabilities
Non-functional:
- Sub-second message delivery for online users
- 99.9% message delivery reliability
- Support for millions of concurrent connections
- Graceful degradation when servers are unavailable
Step 2: High-Level Design
API Design
WebSocket Endpoints:
wss://chat.example.com/ws?token=JWT_TOKEN
HTTP REST API:
GET /api/v1/chats # List user's conversations
GET /api/v1/chats/{chatId}/messages # Message history with pagination
POST /api/v1/chats # Create new chat/group
POST /api/v1/chats/{chatId}/members # Add members to group
PUT /api/v1/users/{userId}/presence # Update online status
POST /api/v1/media/upload # Upload files/images
WebSocket message format:
{
"type": "message",
"chatId": "chat_123",
"messageId": "msg_abc",
"senderId": "user_456",
"content": "Hello world!",
"timestamp": 1703875200000,
"messageType": "text"
}
{
"type": "typing_indicator",
"chatId": "chat_123",
"userId": "user_456",
"isTyping": true
}
{
"type": "presence_update",
"userId": "user_456",
"status": "online",
"lastSeen": 1703875200000
}
{
"type": "delivery_receipt",
"messageId": "msg_abc",
"status": "delivered", // delivered, read
"userId": "user_789",
"timestamp": 1703875205000
}
System Architecture
[Client Apps] ←→ [Load Balancer] ←→ [WebSocket Servers] ←→ [Message Queue]
| |
↓ ↓
[Connection Manager] ←→ [Presence Service] ←→ [Chat Service] ←→ [Message DB]
| | |
↓ ↓ ↓
[Redis Sessions] ←→ [Notification Service] ←→ [File Storage] ←→ [User DB]
Request flow for sending a message:
- Client sends message via WebSocket to WebSocket server
- WebSocket server authenticates and validates the message
- Message queued for processing and persistence
- Chat service saves message to database and generates sequence number
- Message fanout to all chat participants via message queue
- WebSocket servers deliver to online recipients
- Notification service handles push notifications for offline users
- Delivery receipts sent back to sender
Step 3: Deep Dive — WebSocket Connection Management
Connection Pooling and Load Balancing
Managing millions of persistent connections requires careful architecture:
@Service
public class WebSocketConnectionManager {
private final Map<String, Set<WebSocketSession>> userConnections = new ConcurrentHashMap<>();
private final RedisTemplate<String, String> redis;
public void addConnection(String userId, WebSocketSession session) {
// Track local connections
userConnections.computeIfAbsent(userId, k -> ConcurrentHashMap.newKeySet())
.add(session);
// Register server instance for this user in Redis
String serverInstance = InetAddress.getLocalHost().getHostName();
redis.opsForSet().add("user_connections:" + userId, serverInstance);
redis.expire("user_connections:" + userId, Duration.ofHours(1));
}
public void removeConnection(String userId, WebSocketSession session) {
Set<WebSocketSession> sessions = userConnections.get(userId);
if (sessions != null) {
sessions.remove(session);
if (sessions.isEmpty()) {
userConnections.remove(userId);
// Remove server registration if user has no more connections
String serverInstance = InetAddress.getLocalHost().getHostName();
redis.opsForSet().remove("user_connections:" + userId, serverInstance);
}
}
}
public Set<String> getServerInstancesForUser(String userId) {
return redis.opsForSet().members("user_connections:" + userId);
}
}
Connection distribution strategy:
- Use consistent hashing based on user ID to ensure users consistently connect to the same server when possible
- This reduces Redis lookups and improves connection locality
- When a server goes down, affected users reconnect and get distributed to healthy servers
Heartbeat and Connection Health
WebSocket connections can silently fail, so active health monitoring is essential:
class ChatWebSocket {
constructor(url) {
this.ws = new WebSocket(url);
this.pingInterval = null;
this.missedPings = 0;
this.maxMissedPings = 3;
this.setupHeartbeat();
}
setupHeartbeat() {
this.pingInterval = setInterval(() => {
if (this.ws.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify({ type: 'ping' }));
this.missedPings++;
if (this.missedPings > this.maxMissedPings) {
console.log('Connection seems dead, reconnecting...');
this.reconnect();
}
}
}, 30000); // Send ping every 30 seconds
}
onMessage(event) {
const message = JSON.parse(event.data);
if (message.type === 'pong') {
this.missedPings = 0; // Reset missed ping counter
return;
}
// Handle other message types...
}
reconnect() {
this.ws.close();
clearInterval(this.pingInterval);
// Exponential backoff for reconnection
setTimeout(() => {
this.ws = new WebSocket(this.url);
this.setupHeartbeat();
this.setupMessageHandlers();
}, Math.min(1000 * Math.pow(2, this.reconnectAttempts), 30000));
}
}
Message Routing Across Servers
When users are connected to different WebSocket servers, messages need intelligent routing:
@Component
public class MessageRoutingService {
private final MessageQueue messageQueue;
private final WebSocketConnectionManager connectionManager;
public void routeMessage(ChatMessage message) {
List<String> recipients = getChatParticipants(message.getChatId());
for (String recipient : recipients) {
if (recipient.equals(message.getSenderId())) {
continue; // Don't send back to sender
}
Set<String> serverInstances = connectionManager.getServerInstancesForUser(recipient);
if (serverInstances.isEmpty()) {
// User is offline, queue for push notification
enqueueNotification(recipient, message);
} else {
// Route to all server instances where user has connections
for (String serverInstance : serverInstances) {
RouteMessage routeMsg = RouteMessage.builder()
.serverInstance(serverInstance)
.userId(recipient)
.message(message)
.build();
messageQueue.send("websocket_routing", routeMsg);
}
}
}
}
}
Strong candidates explain the multi-server challenge explicitly: "When user A on server 1 sends a message to user B on server 2, we need a routing mechanism. I'd use Redis to track which server each user is connected to, then route messages through a message queue to the appropriate WebSocket servers."
Step 4: Deep Dive — Message Ordering and Delivery Guarantees
Sequence Numbers and Ordering
Ensuring messages appear in the correct order across all clients is non-trivial in distributed systems:
@Entity
public class ChatMessage {
@Id
private String messageId;
private String chatId;
private String senderId;
private String content;
private Long globalSequence; // Global ordering across all chats
private Long chatSequence; // Per-chat ordering
private Instant timestamp;
private MessageStatus status;
// getters/setters...
}
@Service
public class MessageSequenceService {
private final RedisTemplate<String, String> redis;
public Long generateChatSequence(String chatId) {
// Atomic increment per chat
return redis.opsForValue().increment("chat_sequence:" + chatId);
}
public Long generateGlobalSequence() {
// Global sequence for message ordering across entire system
return redis.opsForValue().increment("global_message_sequence");
}
public void saveMessage(ChatMessage message) {
// Assign sequence numbers before saving
message.setChatSequence(generateChatSequence(message.getChatId()));
message.setGlobalSequence(generateGlobalSequence());
message.setTimestamp(Instant.now());
messageRepository.save(message);
}
}
Client-side ordering logic:
class MessageBuffer {
constructor() {
this.messages = new Map(); // messageId -> message
this.nextExpectedSequence = 1;
this.deliveredMessages = [];
}
addMessage(message) {
this.messages.set(message.messageId, message);
this.processBuffer();
}
processBuffer() {
while (true) {
// Find message with next expected sequence number
const nextMessage = Array.from(this.messages.values())
.find(msg => msg.chatSequence === this.nextExpectedSequence);
if (!nextMessage) {
break; // Wait for missing message
}
// Deliver message in order
this.deliveredMessages.push(nextMessage);
this.messages.delete(nextMessage.messageId);
this.nextExpectedSequence++;
this.renderMessage(nextMessage);
}
}
handleMissingMessages() {
// Request missing messages from server if gap detected
if (this.messages.size > 0) {
const minSequence = Math.min(...Array.from(this.messages.values())
.map(msg => msg.chatSequence));
if (minSequence > this.nextExpectedSequence) {
this.requestMissingMessages(this.nextExpectedSequence, minSequence - 1);
}
}
}
}
Delivery Receipts and Read Status
Implementing WhatsApp-style delivery and read receipts:
@Entity
public class MessageDeliveryReceipt {
@Id
private String id;
private String messageId;
private String userId;
private DeliveryStatus status; // SENT, DELIVERED, READ
private Instant timestamp;
// getters/setters...
}
@Service
public class DeliveryReceiptService {
public void markAsDelivered(String messageId, String userId) {
MessageDeliveryReceipt receipt = MessageDeliveryReceipt.builder()
.messageId(messageId)
.userId(userId)
.status(DeliveryStatus.DELIVERED)
.timestamp(Instant.now())
.build();
deliveryReceiptRepository.save(receipt);
// Send receipt back to message sender
broadcastDeliveryReceipt(messageId, userId, DeliveryStatus.DELIVERED);
}
public void markAsRead(String messageId, String userId) {
// Update existing receipt or create new one
MessageDeliveryReceipt receipt = deliveryReceiptRepository
.findByMessageIdAndUserId(messageId, userId)
.orElse(new MessageDeliveryReceipt());
receipt.setMessageId(messageId);
receipt.setUserId(userId);
receipt.setStatus(DeliveryStatus.READ);
receipt.setTimestamp(Instant.now());
deliveryReceiptRepository.save(receipt);
broadcastDeliveryReceipt(messageId, userId, DeliveryStatus.READ);
}
private void broadcastDeliveryReceipt(String messageId, String userId,
DeliveryStatus status) {
// Find original message sender
ChatMessage originalMessage = messageRepository.findById(messageId);
if (originalMessage != null) {
DeliveryReceiptEvent event = DeliveryReceiptEvent.builder()
.messageId(messageId)
.userId(userId)
.status(status)
.timestamp(Instant.now())
.build();
// Send receipt to original sender
webSocketService.sendToUser(originalMessage.getSenderId(), event);
}
}
}
Handling Offline Users
Messages sent to offline users need special handling for delivery guarantees:
@Service
public class OfflineMessageService {
private final NotificationService notificationService;
private final MessageRepository messageRepository;
public void handleOfflineDelivery(String userId, ChatMessage message) {
// Store message for offline user
OfflineMessage offlineMsg = OfflineMessage.builder()
.userId(userId)
.messageId(message.getMessageId())
.chatId(message.getChatId())
.timestamp(Instant.now())
.delivered(false)
.build();
offlineMessageRepository.save(offlineMsg);
// Send push notification
PushNotification notification = PushNotification.builder()
.userId(userId)
.title(getSenderName(message.getSenderId()))
.body(truncateMessage(message.getContent()))
.badge(getUnreadCount(userId))
.data(Map.of(
"chatId", message.getChatId(),
"messageId", message.getMessageId()
))
.build();
notificationService.sendPushNotification(notification);
}
public List<ChatMessage> deliverOfflineMessages(String userId) {
List<OfflineMessage> offlineMessages = offlineMessageRepository
.findByUserIdAndDeliveredFalse(userId);
List<String> messageIds = offlineMessages.stream()
.map(OfflineMessage::getMessageId)
.collect(toList());
List<ChatMessage> messages = messageRepository.findByIdIn(messageIds);
// Mark as delivered
offlineMessages.forEach(msg -> {
msg.setDelivered(true);
msg.setDeliveredAt(Instant.now());
});
offlineMessageRepository.saveAll(offlineMessages);
return messages;
}
}
Step 5: Deep Dive — Database Design and Sharding
Message Storage Schema
CREATE TABLE chat_messages (
message_id VARCHAR(36) PRIMARY KEY,
chat_id VARCHAR(36) NOT NULL,
sender_id VARCHAR(36) NOT NULL,
content TEXT NOT NULL,
message_type ENUM('text', 'image', 'file', 'system') DEFAULT 'text',
chat_sequence BIGINT NOT NULL,
global_sequence BIGINT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX idx_chat_sequence (chat_id, chat_sequence),
INDEX idx_global_sequence (global_sequence),
INDEX idx_sender_time (sender_id, created_at)
);
CREATE TABLE chats (
chat_id VARCHAR(36) PRIMARY KEY,
chat_type ENUM('direct', 'group') NOT NULL,
name VARCHAR(255),
description TEXT,
created_by VARCHAR(36) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
last_message_at TIMESTAMP,
last_message_id VARCHAR(36),
INDEX idx_created_by (created_by),
INDEX idx_last_message (last_message_at)
);
CREATE TABLE chat_participants (
chat_id VARCHAR(36),
user_id VARCHAR(36),
joined_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
role ENUM('admin', 'member') DEFAULT 'member',
last_read_sequence BIGINT DEFAULT 0,
PRIMARY KEY (chat_id, user_id),
INDEX idx_user_chats (user_id, joined_at)
);
Sharding Strategy
Large-scale chat systems require database sharding to handle billions of messages:
@Service
public class ChatShardingService {
private static final int SHARD_COUNT = 16;
public int getShardForChat(String chatId) {
// Use consistent hashing based on chat ID
return Math.abs(chatId.hashCode()) % SHARD_COUNT;
}
public int getShardForUser(String userId) {
// User data can be sharded separately from messages
return Math.abs(userId.hashCode()) % SHARD_COUNT;
}
public String getMessageTableName(String chatId) {
int shard = getShardForChat(chatId);
return "chat_messages_shard_" + shard;
}
// For cross-shard queries (user's chat list), use a mapping table
public List<ChatSummary> getUserChats(String userId) {
// Query user's chat participation across all shards
List<ChatParticipant> participations = chatParticipantRepository
.findByUserId(userId);
// Group by shard to minimize database connections
Map<Integer, List<String>> chatsByShard = participations.stream()
.collect(groupingBy(
p -> getShardForChat(p.getChatId()),
mapping(ChatParticipant::getChatId, toList())
));
List<ChatSummary> allChats = new ArrayList<>();
for (Map.Entry<Integer, List<String>> entry : chatsByShard.entrySet()) {
DataSource shard = getShardDataSource(entry.getKey());
List<ChatSummary> shardChats = queryChatSummaries(shard, entry.getValue());
allChats.addAll(shardChats);
}
return allChats.stream()
.sorted((a, b) -> b.getLastMessageAt().compareTo(a.getLastMessageAt()))
.collect(toList());
}
}
Hot partition mitigation: Large group chats create hot partitions where one shard handles disproportionate traffic. Solutions:
- Move very active chats to dedicated high-performance shards
- Use read replicas for message history queries
- Cache recent messages in Redis for high-traffic chats
Message History and Pagination
Efficient retrieval of chat history with proper pagination:
@RestController
public class ChatHistoryController {
@GetMapping("/api/v1/chats/{chatId}/messages")
public MessageHistoryResponse getMessages(
@PathVariable String chatId,
@RequestParam(required = false) String cursor,
@RequestParam(defaultValue = "50") int limit) {
// Validate user access to chat
validateUserAccess(chatId, getCurrentUserId());
Long sequenceStart = null;
if (cursor != null) {
sequenceStart = decodeCursor(cursor);
}
List<ChatMessage> messages = messageService.getChatHistory(
chatId, sequenceStart, limit + 1); // Fetch one extra for hasMore
boolean hasMore = messages.size() > limit;
if (hasMore) {
messages = messages.subList(0, limit);
}
String nextCursor = null;
if (hasMore && !messages.isEmpty()) {
ChatMessage lastMessage = messages.get(messages.size() - 1);
nextCursor = encodeCursor(lastMessage.getChatSequence());
}
return MessageHistoryResponse.builder()
.messages(messages)
.hasMore(hasMore)
.nextCursor(nextCursor)
.build();
}
private String encodeCursor(Long sequence) {
return Base64.getEncoder().encodeToString(sequence.toString().getBytes());
}
private Long decodeCursor(String cursor) {
byte[] decoded = Base64.getDecoder().decode(cursor);
return Long.parseLong(new String(decoded));
}
}
Step 6: Deep Dive — Advanced Features
Typing Indicators
Real-time typing indicators require careful state management to avoid spam:
@Service
public class TypingIndicatorService {
private final RedisTemplate<String, String> redis;
private static final Duration TYPING_TIMEOUT = Duration.ofSeconds(10);
public void startTyping(String chatId, String userId) {
String key = "typing:" + chatId;
// Add user to typing set with expiration
redis.opsForSet().add(key, userId);
redis.expire(key, TYPING_TIMEOUT);
// Broadcast typing indicator to chat participants
TypingIndicatorEvent event = TypingIndicatorEvent.builder()
.chatId(chatId)
.userId(userId)
.isTyping(true)
.build();
chatEventService.broadcastToChatParticipants(chatId, event, userId);
}
public void stopTyping(String chatId, String userId) {
String key = "typing:" + chatId;
redis.opsForSet().remove(key, userId);
TypingIndicatorEvent event = TypingIndicatorEvent.builder()
.chatId(chatId)
.userId(userId)
.isTyping(false)
.build();
chatEventService.broadcastToChatParticipants(chatId, event, userId);
}
public Set<String> getTypingUsers(String chatId) {
String key = "typing:" + chatId;
return redis.opsForSet().members(key);
}
}
Client-side typing management:
class TypingManager {
constructor(chatId, webSocket) {
this.chatId = chatId;
this.ws = webSocket;
this.typingTimer = null;
this.isCurrentlyTyping = false;
}
handleKeypress() {
if (!this.isCurrentlyTyping) {
this.startTyping();
}
// Reset the timer on each keypress
clearTimeout(this.typingTimer);
this.typingTimer = setTimeout(() => {
this.stopTyping();
}, 3000); // Stop typing after 3 seconds of inactivity
}
startTyping() {
this.isCurrentlyTyping = true;
this.ws.send(JSON.stringify({
type: 'typing_start',
chatId: this.chatId
}));
}
stopTyping() {
if (this.isCurrentlyTyping) {
this.isCurrentlyTyping = false;
this.ws.send(JSON.stringify({
type: 'typing_stop',
chatId: this.chatId
}));
}
}
}
File and Media Sharing
@RestController
public class MediaController {
private final MediaStorageService storageService;
private final VirusScanService virusScanService;
@PostMapping("/api/v1/media/upload")
public MediaUploadResponse uploadMedia(@RequestParam("file") MultipartFile file,
@RequestParam("chatId") String chatId) {
// Validate file size and type
validateFile(file);
// Scan for viruses (async for large files)
virusScanService.scanFile(file);
// Generate unique file ID and upload to storage
String fileId = UUID.randomUUID().toString();
String storageUrl = storageService.uploadFile(file, fileId);
// Create media message
ChatMessage mediaMessage = ChatMessage.builder()
.messageId(UUID.randomUUID().toString())
.chatId(chatId)
.senderId(getCurrentUserId())
.messageType(MessageType.MEDIA)
.content(createMediaContent(file, storageUrl))
.build();
messageService.sendMessage(mediaMessage);
return MediaUploadResponse.builder()
.fileId(fileId)
.url(storageUrl)
.messageId(mediaMessage.getMessageId())
.build();
}
private String createMediaContent(MultipartFile file, String url) {
MediaContent content = MediaContent.builder()
.filename(file.getOriginalFilename())
.mimeType(file.getContentType())
.size(file.getSize())
.url(url)
.build();
return objectMapper.writeValueAsString(content);
}
}
End-to-End Encryption Considerations
While full E2E encryption implementation is beyond most interviews, showing awareness of the challenges demonstrates security thinking:
Key Exchange Flow:
1. Client A generates ephemeral key pair
2. Client A requests Client B's public key from server
3. Clients perform key agreement (ECDH)
4. Messages encrypted with derived symmetric key
5. Server only stores encrypted message blobs
Key challenges for E2E encryption:
- Key management: How do you handle lost devices, key rotation, and multi-device sync?
- Group chats: Complex key distribution when members join/leave groups
- Search: Can't search encrypted message content on server
- Compliance: Some enterprises require message retention and e-discovery
Most interviewers don't expect detailed cryptographic implementation, but mentioning these trade-offs shows you understand that E2E encryption significantly complicates the system architecture while providing important privacy benefits.
Step 7: Common Mistakes
Mistake 1: Using HTTP Instead of WebSocket
Suggesting REST API calls for sending messages instead of persistent connections. Chat requires real-time bidirectional communication that HTTP can't efficiently provide.
Mistake 2: Ignoring Connection Management Across Multiple Servers
Designing for single server without considering how to route messages when users are connected to different WebSocket servers. This is the core distributed systems challenge.
Mistake 3: No Message Ordering Strategy
Assuming messages will arrive in order without sequence numbers or ordering logic. In distributed systems, network latency and server processing can cause messages to arrive out of order.
Mistake 4: Forgetting About Offline Users
Only designing for online users while ignoring message delivery to offline users, push notifications, and message history retrieval when users come back online.
Mistake 5: Poor Database Sharding Strategy
Either not mentioning sharding at scale, or suggesting naive approaches like round-robin that don't consider query patterns and hot partitions.
Mistake 6: Missing Failure Scenarios
Not discussing WebSocket connection failures, server crashes, database outages, and how the system gracefully degrades while maintaining core functionality.
Interviewer Follow-Up Questions
"How would you handle a message sent to a group with 100,000 members?" This tests fan-out strategy. Explain the write amplification problem and solutions: async processing, batching, rate limiting the fanout, and potentially using a pull model for very large groups where online members fetch messages rather than receiving pushes.
"What if a user's message appears different on different devices?" This tests understanding of consistency. Explain the importance of sequence numbers, how clients handle out-of-order delivery, and potential split-brain scenarios when connection state differs across user's devices.
"How do you ensure a message is delivered exactly once?" Explain that exactly-once is extremely difficult in distributed systems and usually unnecessary for chat. At-least-once with idempotency is more practical — duplicate detection on client side using message IDs.
"What happens when the database is down?" Discuss graceful degradation: message queue accumulates messages, WebSocket connections stay alive, recent messages served from cache, and recovery procedures when database comes back online.
"How would you implement message reactions (like emoji responses)?" Design a separate events system: reactions as lightweight events with messageId reference, aggregated counts cached for popular messages, and real-time broadcasting to chat participants.
"What about message editing and deletion?" Edit/delete are actually new messages with special types that reference the original message. Clients apply these operations to update their local view while maintaining audit trail.
Summary: Your 35-Minute Interview Plan
| Time | What to Do |
|---|---|
| 0-5 min | Clarify requirements: 1:1 vs group, scale, message history, client types, real-time features |
| 5-12 min | High-level design: WebSocket architecture, API design, component overview, message flow |
| 12-22 min | WebSocket deep dive: connection management, routing across servers, heartbeat, failover |
| 22-28 min | Message ordering and delivery: sequence numbers, delivery receipts, offline user handling |
| 28-32 min | Database design: sharding strategy, message storage schema, pagination |
| 32-35 min | Advanced features: typing indicators, file sharing, presence, E2E encryption considerations |
The chat system interview is fundamentally about real-time distributed systems. The core challenges are WebSocket connection management across multiple servers, reliable message ordering and delivery, and efficient data storage at scale. Strong candidates demonstrate understanding of both the real-time communication complexity and the distributed systems engineering required to make it work reliably.