AlignUp Logo

Design a Chat System

30 min read

Who Asks This Question?

Chat system design is a staple at companies where real-time communication is either core to their business or a critical user experience component. Based on interview reports, it's frequently asked at:

  • Meta — WhatsApp, Messenger, and Instagram DMs handle billions of messages daily
  • Slack — Their entire platform is built around real-time messaging at enterprise scale
  • Discord — Real-time chat for millions of concurrent users in voice and text channels
  • Zoom — In-meeting chat and persistent team messaging features
  • Microsoft — Teams integration with Office 365 and enterprise communication
  • Snapchat — Ephemeral messaging with multimedia and real-time features
  • Telegram — High-performance messaging with security focus and large group support
  • LinkedIn — Professional messaging integrated with social networking features
  • Twitch — Live chat during streams with massive concurrent viewers

This question tests whether you understand real-time systems, WebSocket management, and the complexity of reliable message delivery across different client states. Companies ask it because chat seems simple but involves deep distributed systems challenges around ordering, delivery guarantees, and connection management.

What the Interviewer Is Really Testing

Most candidates focus on the basic "send message from A to B" flow and miss the harder problems around scale, reliability, and user experience. Here's what interviewers actually evaluate:

Evaluation AreaWeightWhat They're Looking For
Requirements gathering15%Do you ask about 1:1 vs group chat, online presence, message history, and client types?
Real-time connection design30%WebSocket management, connection pooling, heartbeats, and graceful degradation
Message ordering and delivery25%Sequence numbers, delivery receipts, handling offline users, and duplicate prevention
Database design and scale20%Sharding strategies, hot partitions, message storage, and efficient retrieval
Advanced features10%Typing indicators, read receipts, file sharing, and end-to-end encryption considerations

The #1 reason candidates struggle: they describe HTTP POST requests for sending messages instead of understanding that chat requires persistent connections and real-time bidirectional communication. The WebSocket design and connection management complexity is what separates strong answers from weak ones.

Step 1: Clarify Requirements

Questions That Define Your Architecture

"Is this 1:1 chat, group chat, or both?" This fundamentally changes your fanout strategy:

  • 1:1 chat: Simple message routing between two users
  • Group chat: Complex fanout to multiple recipients, permissions, admin controls
  • Both: Need to handle different message delivery patterns and storage strategies

"What's the expected scale?" Changes everything from connection management to database partitioning:

  • 10K concurrent users: Single server can handle WebSocket connections
  • 1M concurrent users: Need connection pooling across multiple servers
  • 100M users: Global distribution, sophisticated sharding, CDN integration

"Do we need message history and how much?" Determines storage strategy and retrieval patterns:

  • Last 100 messages: Can store in memory/cache
  • Complete history: Need efficient database pagination and archival strategy
  • Search functionality: Requires full-text search indexing

"What types of clients do we support?" Affects connection strategy and fallback mechanisms:

  • Web browsers: WebSocket with HTTP fallback
  • Mobile apps: Need to handle connection drops, background states, push notifications
  • Desktop apps: Persistent connections with better reliability

"Do we need online presence and typing indicators?" Real-time features that require additional event broadcasting:

  • Online/offline status updates
  • "User is typing..." indicators
  • Last seen timestamps

Functional Requirements

After clarifying, state what you're building:

Core messaging:

  • Send and receive messages in real-time via 1:1 and group chats
  • Support text messages with basic formatting
  • Message history storage and retrieval with pagination
  • Delivery receipts and read status tracking

Real-time features:

  • Online/offline presence indicators
  • Typing indicators and real-time status updates
  • Push notifications for offline users
  • File and media sharing capabilities

Non-functional:

  • Sub-second message delivery for online users
  • 99.9% message delivery reliability
  • Support for millions of concurrent connections
  • Graceful degradation when servers are unavailable

Step 2: High-Level Design

API Design

WebSocket Endpoints:
wss://chat.example.com/ws?token=JWT_TOKEN

HTTP REST API:
GET /api/v1/chats                    # List user's conversations  
GET /api/v1/chats/{chatId}/messages  # Message history with pagination
POST /api/v1/chats                   # Create new chat/group
POST /api/v1/chats/{chatId}/members  # Add members to group
PUT /api/v1/users/{userId}/presence  # Update online status
POST /api/v1/media/upload            # Upload files/images

WebSocket message format:

{
  "type": "message",
  "chatId": "chat_123",
  "messageId": "msg_abc",
  "senderId": "user_456",
  "content": "Hello world!",
  "timestamp": 1703875200000,
  "messageType": "text"
}

{
  "type": "typing_indicator", 
  "chatId": "chat_123",
  "userId": "user_456",
  "isTyping": true
}

{
  "type": "presence_update",
  "userId": "user_456", 
  "status": "online",
  "lastSeen": 1703875200000
}

{
  "type": "delivery_receipt",
  "messageId": "msg_abc",
  "status": "delivered", // delivered, read
  "userId": "user_789",
  "timestamp": 1703875205000
}

System Architecture

[Client Apps] ←→ [Load Balancer] ←→ [WebSocket Servers] ←→ [Message Queue]
                                           |                      |
                                           ↓                      ↓
[Connection Manager] ←→ [Presence Service] ←→ [Chat Service] ←→ [Message DB]
           |                                      |                |
           ↓                                      ↓                ↓
[Redis Sessions] ←→ [Notification Service] ←→ [File Storage] ←→ [User DB]

Request flow for sending a message:

  1. Client sends message via WebSocket to WebSocket server
  2. WebSocket server authenticates and validates the message
  3. Message queued for processing and persistence
  4. Chat service saves message to database and generates sequence number
  5. Message fanout to all chat participants via message queue
  6. WebSocket servers deliver to online recipients
  7. Notification service handles push notifications for offline users
  8. Delivery receipts sent back to sender

Step 3: Deep Dive — WebSocket Connection Management

Connection Pooling and Load Balancing

Managing millions of persistent connections requires careful architecture:

@Service
public class WebSocketConnectionManager {
    private final Map<String, Set<WebSocketSession>> userConnections = new ConcurrentHashMap<>();
    private final RedisTemplate<String, String> redis;
    
    public void addConnection(String userId, WebSocketSession session) {
        // Track local connections
        userConnections.computeIfAbsent(userId, k -> ConcurrentHashMap.newKeySet())
                      .add(session);
        
        // Register server instance for this user in Redis
        String serverInstance = InetAddress.getLocalHost().getHostName();
        redis.opsForSet().add("user_connections:" + userId, serverInstance);
        redis.expire("user_connections:" + userId, Duration.ofHours(1));
    }
    
    public void removeConnection(String userId, WebSocketSession session) {
        Set<WebSocketSession> sessions = userConnections.get(userId);
        if (sessions != null) {
            sessions.remove(session);
            if (sessions.isEmpty()) {
                userConnections.remove(userId);
                // Remove server registration if user has no more connections
                String serverInstance = InetAddress.getLocalHost().getHostName();
                redis.opsForSet().remove("user_connections:" + userId, serverInstance);
            }
        }
    }
    
    public Set<String> getServerInstancesForUser(String userId) {
        return redis.opsForSet().members("user_connections:" + userId);
    }
}

Connection distribution strategy:

  • Use consistent hashing based on user ID to ensure users consistently connect to the same server when possible
  • This reduces Redis lookups and improves connection locality
  • When a server goes down, affected users reconnect and get distributed to healthy servers

Heartbeat and Connection Health

WebSocket connections can silently fail, so active health monitoring is essential:

class ChatWebSocket {
  constructor(url) {
    this.ws = new WebSocket(url);
    this.pingInterval = null;
    this.missedPings = 0;
    this.maxMissedPings = 3;
    this.setupHeartbeat();
  }
  
  setupHeartbeat() {
    this.pingInterval = setInterval(() => {
      if (this.ws.readyState === WebSocket.OPEN) {
        this.ws.send(JSON.stringify({ type: 'ping' }));
        this.missedPings++;
        
        if (this.missedPings > this.maxMissedPings) {
          console.log('Connection seems dead, reconnecting...');
          this.reconnect();
        }
      }
    }, 30000); // Send ping every 30 seconds
  }
  
  onMessage(event) {
    const message = JSON.parse(event.data);
    
    if (message.type === 'pong') {
      this.missedPings = 0; // Reset missed ping counter
      return;
    }
    
    // Handle other message types...
  }
  
  reconnect() {
    this.ws.close();
    clearInterval(this.pingInterval);
    
    // Exponential backoff for reconnection
    setTimeout(() => {
      this.ws = new WebSocket(this.url);
      this.setupHeartbeat();
      this.setupMessageHandlers();
    }, Math.min(1000 * Math.pow(2, this.reconnectAttempts), 30000));
  }
}

Message Routing Across Servers

When users are connected to different WebSocket servers, messages need intelligent routing:

@Component
public class MessageRoutingService {
    private final MessageQueue messageQueue;
    private final WebSocketConnectionManager connectionManager;
    
    public void routeMessage(ChatMessage message) {
        List<String> recipients = getChatParticipants(message.getChatId());
        
        for (String recipient : recipients) {
            if (recipient.equals(message.getSenderId())) {
                continue; // Don't send back to sender
            }
            
            Set<String> serverInstances = connectionManager.getServerInstancesForUser(recipient);
            
            if (serverInstances.isEmpty()) {
                // User is offline, queue for push notification
                enqueueNotification(recipient, message);
            } else {
                // Route to all server instances where user has connections
                for (String serverInstance : serverInstances) {
                    RouteMessage routeMsg = RouteMessage.builder()
                        .serverInstance(serverInstance)
                        .userId(recipient)
                        .message(message)
                        .build();
                    
                    messageQueue.send("websocket_routing", routeMsg);
                }
            }
        }
    }
}

Strong candidates explain the multi-server challenge explicitly: "When user A on server 1 sends a message to user B on server 2, we need a routing mechanism. I'd use Redis to track which server each user is connected to, then route messages through a message queue to the appropriate WebSocket servers."

Step 4: Deep Dive — Message Ordering and Delivery Guarantees

Sequence Numbers and Ordering

Ensuring messages appear in the correct order across all clients is non-trivial in distributed systems:

@Entity
public class ChatMessage {
    @Id
    private String messageId;
    private String chatId;
    private String senderId;
    private String content;
    private Long globalSequence;     // Global ordering across all chats
    private Long chatSequence;       // Per-chat ordering
    private Instant timestamp;
    private MessageStatus status;
    
    // getters/setters...
}

@Service
public class MessageSequenceService {
    private final RedisTemplate<String, String> redis;
    
    public Long generateChatSequence(String chatId) {
        // Atomic increment per chat
        return redis.opsForValue().increment("chat_sequence:" + chatId);
    }
    
    public Long generateGlobalSequence() {
        // Global sequence for message ordering across entire system
        return redis.opsForValue().increment("global_message_sequence");
    }
    
    public void saveMessage(ChatMessage message) {
        // Assign sequence numbers before saving
        message.setChatSequence(generateChatSequence(message.getChatId()));
        message.setGlobalSequence(generateGlobalSequence());
        message.setTimestamp(Instant.now());
        
        messageRepository.save(message);
    }
}

Client-side ordering logic:

class MessageBuffer {
  constructor() {
    this.messages = new Map(); // messageId -> message
    this.nextExpectedSequence = 1;
    this.deliveredMessages = [];
  }
  
  addMessage(message) {
    this.messages.set(message.messageId, message);
    this.processBuffer();
  }
  
  processBuffer() {
    while (true) {
      // Find message with next expected sequence number
      const nextMessage = Array.from(this.messages.values())
        .find(msg => msg.chatSequence === this.nextExpectedSequence);
      
      if (!nextMessage) {
        break; // Wait for missing message
      }
      
      // Deliver message in order
      this.deliveredMessages.push(nextMessage);
      this.messages.delete(nextMessage.messageId);
      this.nextExpectedSequence++;
      
      this.renderMessage(nextMessage);
    }
  }
  
  handleMissingMessages() {
    // Request missing messages from server if gap detected
    if (this.messages.size > 0) {
      const minSequence = Math.min(...Array.from(this.messages.values())
        .map(msg => msg.chatSequence));
      
      if (minSequence > this.nextExpectedSequence) {
        this.requestMissingMessages(this.nextExpectedSequence, minSequence - 1);
      }
    }
  }
}

Delivery Receipts and Read Status

Implementing WhatsApp-style delivery and read receipts:

@Entity
public class MessageDeliveryReceipt {
    @Id
    private String id;
    private String messageId;
    private String userId;
    private DeliveryStatus status; // SENT, DELIVERED, READ
    private Instant timestamp;
    
    // getters/setters...
}

@Service
public class DeliveryReceiptService {
    
    public void markAsDelivered(String messageId, String userId) {
        MessageDeliveryReceipt receipt = MessageDeliveryReceipt.builder()
            .messageId(messageId)
            .userId(userId)
            .status(DeliveryStatus.DELIVERED)
            .timestamp(Instant.now())
            .build();
        
        deliveryReceiptRepository.save(receipt);
        
        // Send receipt back to message sender
        broadcastDeliveryReceipt(messageId, userId, DeliveryStatus.DELIVERED);
    }
    
    public void markAsRead(String messageId, String userId) {
        // Update existing receipt or create new one
        MessageDeliveryReceipt receipt = deliveryReceiptRepository
            .findByMessageIdAndUserId(messageId, userId)
            .orElse(new MessageDeliveryReceipt());
        
        receipt.setMessageId(messageId);
        receipt.setUserId(userId);
        receipt.setStatus(DeliveryStatus.READ);
        receipt.setTimestamp(Instant.now());
        
        deliveryReceiptRepository.save(receipt);
        broadcastDeliveryReceipt(messageId, userId, DeliveryStatus.READ);
    }
    
    private void broadcastDeliveryReceipt(String messageId, String userId, 
                                        DeliveryStatus status) {
        // Find original message sender
        ChatMessage originalMessage = messageRepository.findById(messageId);
        if (originalMessage != null) {
            DeliveryReceiptEvent event = DeliveryReceiptEvent.builder()
                .messageId(messageId)
                .userId(userId)
                .status(status)
                .timestamp(Instant.now())
                .build();
            
            // Send receipt to original sender
            webSocketService.sendToUser(originalMessage.getSenderId(), event);
        }
    }
}

Handling Offline Users

Messages sent to offline users need special handling for delivery guarantees:

@Service
public class OfflineMessageService {
    private final NotificationService notificationService;
    private final MessageRepository messageRepository;
    
    public void handleOfflineDelivery(String userId, ChatMessage message) {
        // Store message for offline user
        OfflineMessage offlineMsg = OfflineMessage.builder()
            .userId(userId)
            .messageId(message.getMessageId())
            .chatId(message.getChatId())
            .timestamp(Instant.now())
            .delivered(false)
            .build();
        
        offlineMessageRepository.save(offlineMsg);
        
        // Send push notification
        PushNotification notification = PushNotification.builder()
            .userId(userId)
            .title(getSenderName(message.getSenderId()))
            .body(truncateMessage(message.getContent()))
            .badge(getUnreadCount(userId))
            .data(Map.of(
                "chatId", message.getChatId(),
                "messageId", message.getMessageId()
            ))
            .build();
        
        notificationService.sendPushNotification(notification);
    }
    
    public List<ChatMessage> deliverOfflineMessages(String userId) {
        List<OfflineMessage> offlineMessages = offlineMessageRepository
            .findByUserIdAndDeliveredFalse(userId);
        
        List<String> messageIds = offlineMessages.stream()
            .map(OfflineMessage::getMessageId)
            .collect(toList());
        
        List<ChatMessage> messages = messageRepository.findByIdIn(messageIds);
        
        // Mark as delivered
        offlineMessages.forEach(msg -> {
            msg.setDelivered(true);
            msg.setDeliveredAt(Instant.now());
        });
        offlineMessageRepository.saveAll(offlineMessages);
        
        return messages;
    }
}

Step 5: Deep Dive — Database Design and Sharding

Message Storage Schema

CREATE TABLE chat_messages (
    message_id VARCHAR(36) PRIMARY KEY,
    chat_id VARCHAR(36) NOT NULL,
    sender_id VARCHAR(36) NOT NULL,
    content TEXT NOT NULL,
    message_type ENUM('text', 'image', 'file', 'system') DEFAULT 'text',
    chat_sequence BIGINT NOT NULL,
    global_sequence BIGINT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    
    INDEX idx_chat_sequence (chat_id, chat_sequence),
    INDEX idx_global_sequence (global_sequence),
    INDEX idx_sender_time (sender_id, created_at)
);

CREATE TABLE chats (
    chat_id VARCHAR(36) PRIMARY KEY,
    chat_type ENUM('direct', 'group') NOT NULL,
    name VARCHAR(255),
    description TEXT,
    created_by VARCHAR(36) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    last_message_at TIMESTAMP,
    last_message_id VARCHAR(36),
    
    INDEX idx_created_by (created_by),
    INDEX idx_last_message (last_message_at)
);

CREATE TABLE chat_participants (
    chat_id VARCHAR(36),
    user_id VARCHAR(36),
    joined_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    role ENUM('admin', 'member') DEFAULT 'member',
    last_read_sequence BIGINT DEFAULT 0,
    
    PRIMARY KEY (chat_id, user_id),
    INDEX idx_user_chats (user_id, joined_at)
);

Sharding Strategy

Large-scale chat systems require database sharding to handle billions of messages:

@Service
public class ChatShardingService {
    private static final int SHARD_COUNT = 16;
    
    public int getShardForChat(String chatId) {
        // Use consistent hashing based on chat ID
        return Math.abs(chatId.hashCode()) % SHARD_COUNT;
    }
    
    public int getShardForUser(String userId) {
        // User data can be sharded separately from messages
        return Math.abs(userId.hashCode()) % SHARD_COUNT;
    }
    
    public String getMessageTableName(String chatId) {
        int shard = getShardForChat(chatId);
        return "chat_messages_shard_" + shard;
    }
    
    // For cross-shard queries (user's chat list), use a mapping table
    public List<ChatSummary> getUserChats(String userId) {
        // Query user's chat participation across all shards
        List<ChatParticipant> participations = chatParticipantRepository
            .findByUserId(userId);
        
        // Group by shard to minimize database connections
        Map<Integer, List<String>> chatsByShard = participations.stream()
            .collect(groupingBy(
                p -> getShardForChat(p.getChatId()),
                mapping(ChatParticipant::getChatId, toList())
            ));
        
        List<ChatSummary> allChats = new ArrayList<>();
        for (Map.Entry<Integer, List<String>> entry : chatsByShard.entrySet()) {
            DataSource shard = getShardDataSource(entry.getKey());
            List<ChatSummary> shardChats = queryChatSummaries(shard, entry.getValue());
            allChats.addAll(shardChats);
        }
        
        return allChats.stream()
            .sorted((a, b) -> b.getLastMessageAt().compareTo(a.getLastMessageAt()))
            .collect(toList());
    }
}

Hot partition mitigation: Large group chats create hot partitions where one shard handles disproportionate traffic. Solutions:

  • Move very active chats to dedicated high-performance shards
  • Use read replicas for message history queries
  • Cache recent messages in Redis for high-traffic chats

Message History and Pagination

Efficient retrieval of chat history with proper pagination:

@RestController
public class ChatHistoryController {
    
    @GetMapping("/api/v1/chats/{chatId}/messages")
    public MessageHistoryResponse getMessages(
            @PathVariable String chatId,
            @RequestParam(required = false) String cursor,
            @RequestParam(defaultValue = "50") int limit) {
        
        // Validate user access to chat
        validateUserAccess(chatId, getCurrentUserId());
        
        Long sequenceStart = null;
        if (cursor != null) {
            sequenceStart = decodeCursor(cursor);
        }
        
        List<ChatMessage> messages = messageService.getChatHistory(
            chatId, sequenceStart, limit + 1); // Fetch one extra for hasMore
        
        boolean hasMore = messages.size() > limit;
        if (hasMore) {
            messages = messages.subList(0, limit);
        }
        
        String nextCursor = null;
        if (hasMore && !messages.isEmpty()) {
            ChatMessage lastMessage = messages.get(messages.size() - 1);
            nextCursor = encodeCursor(lastMessage.getChatSequence());
        }
        
        return MessageHistoryResponse.builder()
            .messages(messages)
            .hasMore(hasMore)
            .nextCursor(nextCursor)
            .build();
    }
    
    private String encodeCursor(Long sequence) {
        return Base64.getEncoder().encodeToString(sequence.toString().getBytes());
    }
    
    private Long decodeCursor(String cursor) {
        byte[] decoded = Base64.getDecoder().decode(cursor);
        return Long.parseLong(new String(decoded));
    }
}

Step 6: Deep Dive — Advanced Features

Typing Indicators

Real-time typing indicators require careful state management to avoid spam:

@Service
public class TypingIndicatorService {
    private final RedisTemplate<String, String> redis;
    private static final Duration TYPING_TIMEOUT = Duration.ofSeconds(10);
    
    public void startTyping(String chatId, String userId) {
        String key = "typing:" + chatId;
        
        // Add user to typing set with expiration
        redis.opsForSet().add(key, userId);
        redis.expire(key, TYPING_TIMEOUT);
        
        // Broadcast typing indicator to chat participants
        TypingIndicatorEvent event = TypingIndicatorEvent.builder()
            .chatId(chatId)
            .userId(userId)
            .isTyping(true)
            .build();
        
        chatEventService.broadcastToChatParticipants(chatId, event, userId);
    }
    
    public void stopTyping(String chatId, String userId) {
        String key = "typing:" + chatId;
        redis.opsForSet().remove(key, userId);
        
        TypingIndicatorEvent event = TypingIndicatorEvent.builder()
            .chatId(chatId)
            .userId(userId)
            .isTyping(false)
            .build();
        
        chatEventService.broadcastToChatParticipants(chatId, event, userId);
    }
    
    public Set<String> getTypingUsers(String chatId) {
        String key = "typing:" + chatId;
        return redis.opsForSet().members(key);
    }
}

Client-side typing management:

class TypingManager {
  constructor(chatId, webSocket) {
    this.chatId = chatId;
    this.ws = webSocket;
    this.typingTimer = null;
    this.isCurrentlyTyping = false;
  }
  
  handleKeypress() {
    if (!this.isCurrentlyTyping) {
      this.startTyping();
    }
    
    // Reset the timer on each keypress
    clearTimeout(this.typingTimer);
    this.typingTimer = setTimeout(() => {
      this.stopTyping();
    }, 3000); // Stop typing after 3 seconds of inactivity
  }
  
  startTyping() {
    this.isCurrentlyTyping = true;
    this.ws.send(JSON.stringify({
      type: 'typing_start',
      chatId: this.chatId
    }));
  }
  
  stopTyping() {
    if (this.isCurrentlyTyping) {
      this.isCurrentlyTyping = false;
      this.ws.send(JSON.stringify({
        type: 'typing_stop',
        chatId: this.chatId
      }));
    }
  }
}

File and Media Sharing

@RestController
public class MediaController {
    private final MediaStorageService storageService;
    private final VirusScanService virusScanService;
    
    @PostMapping("/api/v1/media/upload")
    public MediaUploadResponse uploadMedia(@RequestParam("file") MultipartFile file,
                                         @RequestParam("chatId") String chatId) {
        
        // Validate file size and type
        validateFile(file);
        
        // Scan for viruses (async for large files)
        virusScanService.scanFile(file);
        
        // Generate unique file ID and upload to storage
        String fileId = UUID.randomUUID().toString();
        String storageUrl = storageService.uploadFile(file, fileId);
        
        // Create media message
        ChatMessage mediaMessage = ChatMessage.builder()
            .messageId(UUID.randomUUID().toString())
            .chatId(chatId)
            .senderId(getCurrentUserId())
            .messageType(MessageType.MEDIA)
            .content(createMediaContent(file, storageUrl))
            .build();
        
        messageService.sendMessage(mediaMessage);
        
        return MediaUploadResponse.builder()
            .fileId(fileId)
            .url(storageUrl)
            .messageId(mediaMessage.getMessageId())
            .build();
    }
    
    private String createMediaContent(MultipartFile file, String url) {
        MediaContent content = MediaContent.builder()
            .filename(file.getOriginalFilename())
            .mimeType(file.getContentType())
            .size(file.getSize())
            .url(url)
            .build();
        
        return objectMapper.writeValueAsString(content);
    }
}

End-to-End Encryption Considerations

While full E2E encryption implementation is beyond most interviews, showing awareness of the challenges demonstrates security thinking:

Key Exchange Flow:
1. Client A generates ephemeral key pair
2. Client A requests Client B's public key from server
3. Clients perform key agreement (ECDH)
4. Messages encrypted with derived symmetric key
5. Server only stores encrypted message blobs

Key challenges for E2E encryption:

  • Key management: How do you handle lost devices, key rotation, and multi-device sync?
  • Group chats: Complex key distribution when members join/leave groups
  • Search: Can't search encrypted message content on server
  • Compliance: Some enterprises require message retention and e-discovery

Most interviewers don't expect detailed cryptographic implementation, but mentioning these trade-offs shows you understand that E2E encryption significantly complicates the system architecture while providing important privacy benefits.

Step 7: Common Mistakes

Mistake 1: Using HTTP Instead of WebSocket

Suggesting REST API calls for sending messages instead of persistent connections. Chat requires real-time bidirectional communication that HTTP can't efficiently provide.

Mistake 2: Ignoring Connection Management Across Multiple Servers

Designing for single server without considering how to route messages when users are connected to different WebSocket servers. This is the core distributed systems challenge.

Mistake 3: No Message Ordering Strategy

Assuming messages will arrive in order without sequence numbers or ordering logic. In distributed systems, network latency and server processing can cause messages to arrive out of order.

Mistake 4: Forgetting About Offline Users

Only designing for online users while ignoring message delivery to offline users, push notifications, and message history retrieval when users come back online.

Mistake 5: Poor Database Sharding Strategy

Either not mentioning sharding at scale, or suggesting naive approaches like round-robin that don't consider query patterns and hot partitions.

Mistake 6: Missing Failure Scenarios

Not discussing WebSocket connection failures, server crashes, database outages, and how the system gracefully degrades while maintaining core functionality.

Interviewer Follow-Up Questions

"How would you handle a message sent to a group with 100,000 members?" This tests fan-out strategy. Explain the write amplification problem and solutions: async processing, batching, rate limiting the fanout, and potentially using a pull model for very large groups where online members fetch messages rather than receiving pushes.

"What if a user's message appears different on different devices?" This tests understanding of consistency. Explain the importance of sequence numbers, how clients handle out-of-order delivery, and potential split-brain scenarios when connection state differs across user's devices.

"How do you ensure a message is delivered exactly once?" Explain that exactly-once is extremely difficult in distributed systems and usually unnecessary for chat. At-least-once with idempotency is more practical — duplicate detection on client side using message IDs.

"What happens when the database is down?" Discuss graceful degradation: message queue accumulates messages, WebSocket connections stay alive, recent messages served from cache, and recovery procedures when database comes back online.

"How would you implement message reactions (like emoji responses)?" Design a separate events system: reactions as lightweight events with messageId reference, aggregated counts cached for popular messages, and real-time broadcasting to chat participants.

"What about message editing and deletion?" Edit/delete are actually new messages with special types that reference the original message. Clients apply these operations to update their local view while maintaining audit trail.

Summary: Your 35-Minute Interview Plan

TimeWhat to Do
0-5 minClarify requirements: 1:1 vs group, scale, message history, client types, real-time features
5-12 minHigh-level design: WebSocket architecture, API design, component overview, message flow
12-22 minWebSocket deep dive: connection management, routing across servers, heartbeat, failover
22-28 minMessage ordering and delivery: sequence numbers, delivery receipts, offline user handling
28-32 minDatabase design: sharding strategy, message storage schema, pagination
32-35 minAdvanced features: typing indicators, file sharing, presence, E2E encryption considerations

The chat system interview is fundamentally about real-time distributed systems. The core challenges are WebSocket connection management across multiple servers, reliable message ordering and delivery, and efficient data storage at scale. Strong candidates demonstrate understanding of both the real-time communication complexity and the distributed systems engineering required to make it work reliably.