Design a Notification System
Who Asks This Question?
The notification system is a deceptively complex question that appears at companies where real-time communication matters. Based on interview reports, it's frequently asked at:
- Meta — Their entire ecosystem (Facebook, Instagram, WhatsApp) runs on notifications
- Slack — Real-time messaging is their core business; they test notification fanout extensively
- LinkedIn — Job alerts, connection requests, and activity notifications at massive scale
- Airbnb — Booking confirmations, host communications, and travel reminders across multiple channels
- Uber — Trip updates, driver matching, and payment notifications with tight delivery SLAs
- Discord — Real-time chat notifications and presence updates for millions of concurrent users
- Stripe — Payment confirmations, webhook deliveries, and fraud alerts via multiple channels
- Zoom — Meeting reminders, recording notifications, and real-time collaboration alerts
This question tests whether you've dealt with real-time systems and understand the complexity of reliable delivery across different channels. Companies that ask it want to see that you can design for both scale and reliability — notifications must be fast but never lost.
What the Interviewer Is Really Testing
Most candidates focus on the basic "send message to user" flow and miss the hard distributed systems problems. Here's what interviewers actually evaluate:
| Evaluation Area | Weight | What They're Looking For |
|---|---|---|
| Requirements gathering | 20% | Do you ask about notification types, delivery guarantees, and failure scenarios? |
| Multi-channel design | 25% | Push, email, SMS, in-app — each has different constraints and delivery mechanisms |
| Fan-out strategies | 25% | How do you handle one message going to millions of users efficiently? |
| Delivery guarantees | 20% | At-least-once vs exactly-once, handling failures, retry logic |
| Production concerns | 10% | Rate limiting, priority queues, monitoring, user preferences |
The #1 reason candidates struggle: they describe a single notification channel (usually push) and ignore the multi-channel complexity. Real notification systems must coordinate delivery across push notifications, email, SMS, and in-app messages — each with different latency, cost, and reliability characteristics.
Step 1: Clarify Requirements
Questions That Define Your Architecture
"What types of notifications do we need to support?" This determines your entire system design. Different channels have vastly different requirements:
- Push notifications: Fast (seconds), mobile-focused, limited payload size
- Email: Slower (minutes), rich content, high deliverability standards
- SMS: Expensive, ultra-reliable, character limits, regulatory constraints
- In-app notifications: Real-time, supports rich media, only works when user is active
"What's the expected scale?" Changes everything from database choice to fanout strategy:
- 10,000 users: Single server with simple message queues
- 10 million users: Distributed queues, database sharding, rate limiting
- 1 billion users: Global distribution, sophisticated fan-out, priority systems
"What are the delivery guarantee requirements?" This is the most technical question:
- At-most-once: Fast, simple, but notifications might be lost
- At-least-once: Reliable, but users might receive duplicates
- Exactly-once: Complex distributed consensus, usually not worth it
"Do we need analytics and delivery tracking?" If yes, you need to track delivery status, open rates, and click-through rates. This adds event logging, analytics pipelines, and potentially webhook callbacks to third parties.
Functional Requirements
After clarifying, state what you're building:
Core functionality:
- Send notifications via multiple channels (push, email, SMS, in-app)
- Support different notification types: transactional, promotional, system alerts
- Handle both single-user and broadcast notifications
- Track delivery status and user engagement
- User preference management (opt-out, channel selection, frequency limits)
Non-functional:
- High availability — critical notifications (security alerts, payment confirmations) must be delivered
- Low latency — real-time notifications within seconds for active users
- Scale — handle millions of users and billions of notifications per day
- Cost optimization — SMS is expensive (~$0.01 per message), optimize channel selection
Step 2: High-Level Design
API Design
POST /api/v1/notifications/send
Body: {
"userId": "user123",
"type": "payment_confirmation",
"channels": ["push", "email"],
"priority": "high",
"payload": {
"title": "Payment Successful",
"message": "Your payment of $25.99 has been processed",
"deepLink": "/transactions/tx_abc123"
}
}
POST /api/v1/notifications/broadcast
Body: {
"audience": {
"segment": "premium_users",
"filters": {"country": "US", "active_last_30_days": true}
},
"type": "product_announcement",
"channels": ["push", "email"],
"payload": {...}
}
GET /api/v1/notifications/{userId}
Response: [...] // user's notification history
PUT /api/v1/users/{userId}/preferences
Body: {
"email_marketing": false,
"push_transactional": true,
"sms_critical_only": true
}
System Architecture
Notification API
|
v
[Message Queue] → [Channel Workers] → [Third-party Services]
| | |
| |→ Push Worker → FCM/APNs
| |→ Email Worker → SendGrid/SES
| |→ SMS Worker → Twilio/AWS SNS
| |→ In-App Worker → WebSocket/SSE
|
v
[Database] ← [Analytics Service]
Request flow:
- API server receives notification request
- Validate user preferences and notification type
- Enqueue messages to channel-specific queues
- Workers consume from queues and call third-party APIs
- Track delivery status and update analytics
Channel-Specific Considerations
| Channel | Latency | Cost | Payload Size | Reliability | Use Case |
|---|---|---|---|---|---|
| Push | ~2 seconds | Free | 4KB | 95% delivery | Real-time alerts |
| ~30 seconds | $0.0001 | Unlimited | 99% delivery | Rich content, receipts | |
| SMS | ~5 seconds | $0.01 | 160 chars | 99.9% delivery | Critical alerts |
| In-app | ~1 second | Free | Unlimited | Only if user active | Activity feeds |
Step 3: Deep Dive — Fan-out Strategies
This is the core technical challenge that separates strong candidates from average ones. How do you efficiently deliver one message to millions of users?
Push Model vs Pull Model
Push Model (Write Fanout): When a notification is triggered, immediately write it to all recipients' individual queues.
Example: A popular user posts an update that needs to notify 10 million followers. Push model immediately writes 10 million entries.
User A posts → Fan out to 10M queues → Each follower's queue gets the message
Pros: Fast delivery — users get notifications immediately Cons: High write volume, expensive storage, potential hot-spotting
Pull Model (Read Fanout): Store the notification once, and users pull/fetch relevant notifications when they become active.
User A posts → Store once in global feed → Users pull relevant messages when active
Pros: Efficient storage, handles inactive users well Cons: Higher latency, complex ranking/filtering logic
Hybrid Approach (Production Reality)
Most production systems use a hybrid based on user activity and notification priority:
public class FanoutStrategy {
private static final int ACTIVE_USER_THRESHOLD = 7; // days
private static final int VIP_FOLLOWER_LIMIT = 1_000_000;
public FanoutDecision decideFanout(User sender, NotificationType type) {
if (type == NotificationType.CRITICAL) {
return FanoutDecision.PUSH_ALL; // Security alerts, payments
}
List<User> followers = getFollowers(sender);
List<User> activeFollowers = followers.stream()
.filter(u -> u.lastActive().isAfter(now().minusDays(ACTIVE_USER_THRESHOLD)))
.collect(toList());
if (activeFollowers.size() < VIP_FOLLOWER_LIMIT) {
return FanoutDecision.PUSH_TO_ACTIVE_USERS;
} else {
return FanoutDecision.PULL_MODEL; // Celebrity users
}
}
}
Strategy rules:
- Active users (last 7 days): Push fanout for immediate delivery
- Inactive users: Pull model — they'll see notifications when they return
- High-follower accounts: Pull model to avoid write amplification
- Critical notifications: Always push regardless of fanout cost
Strong candidates explain the trade-off explicitly: "For a user with 50k active followers, I'd use push fanout because the write cost is manageable and delivery is immediate. For a celebrity with 10M followers, I'd use pull model because writing 10M entries per post would overwhelm our write capacity."
Message Queue Architecture
[High Priority Queue] → [Critical Worker] (security, payments)
[Medium Priority Queue] → [Standard Worker] (social, updates)
[Low Priority Queue] → [Batch Worker] (marketing, newsletters)
Queue configuration by channel:
queues:
sms_critical:
priority: high
workers: 20
rate_limit: 100/second # Regulatory limits
push_realtime:
priority: medium
workers: 50
rate_limit: 10000/second
email_marketing:
priority: low
workers: 10
rate_limit: 1000/second
batch_size: 100 # Send in batches for efficiency
Step 4: Deep Dive — Delivery Guarantees and Failure Handling
The At-Least-Once Challenge
Most notification systems provide at-least-once delivery — notifications are guaranteed to be delivered but might arrive multiple times.
Implementation pattern:
- Store notification in database with status "pending"
- Send to third-party service (FCM, SendGrid, Twilio)
- On success response, mark as "delivered"
- On failure/timeout, retry with exponential backoff
- After max retries, mark as "failed" and alert
@Service
public class NotificationDeliveryService {
private static final int MAX_RETRIES = 3;
private static final Duration BASE_DELAY = Duration.ofSeconds(5);
public DeliveryResult deliver(Notification notification) {
for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
try {
DeliveryResult result = sendToProvider(notification);
if (result.isSuccess()) {
updateStatus(notification.getId(), NotificationStatus.DELIVERED);
return result;
}
} catch (Exception e) {
log.warn("Delivery attempt {} failed for {}: {}",
attempt, notification.getId(), e.getMessage());
if (attempt < MAX_RETRIES) {
sleep(BASE_DELAY.multipliedBy(1L << (attempt - 1))); // Exponential backoff
}
}
}
updateStatus(notification.getId(), NotificationStatus.FAILED);
alertOnFailure(notification);
return DeliveryResult.failed();
}
}
Handling Third-Party Service Failures
Each notification channel depends on external services that can fail:
| Service | Failure Mode | Mitigation |
|---|---|---|
| FCM/APNs | Rate limiting, invalid tokens | Retry with backoff, token cleanup |
| SendGrid/SES | Temporary outages | Multiple provider fallback |
| Twilio | Account suspension, regional blocks | Pre-approved backup providers |
Circuit breaker pattern for external dependencies:
@Component
public class EmailServiceCircuitBreaker {
private final CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("email-service");
public CompletableFuture<EmailResult> sendEmail(EmailNotification email) {
return circuitBreaker.executeSupplier(() -> {
return primaryEmailProvider.send(email);
}).recover(throwable -> {
log.error("Primary email provider failed, using fallback", throwable);
return fallbackEmailProvider.send(email);
});
}
}
Dead Letter Queues and Manual Intervention
After all retries fail, notifications go to a dead letter queue for manual investigation:
Failed Notification → Dead Letter Queue → Admin Dashboard → Manual Retry/Investigation
Common failure scenarios requiring manual intervention:
- Invalid user data (malformed email addresses, deregistered push tokens)
- Third-party service account issues (billing, API limits exceeded)
- Compliance violations (user in Do Not Call registry for SMS)
Step 5: Deep Dive — User Preferences and Rate Limiting
Preference Management
Users must control their notification experience to prevent spam and maintain engagement:
CREATE TABLE user_notification_preferences (
user_id BIGINT PRIMARY KEY,
email_marketing BOOLEAN DEFAULT false,
email_transactional BOOLEAN DEFAULT true,
push_social BOOLEAN DEFAULT true,
push_marketing BOOLEAN DEFAULT false,
sms_critical_only BOOLEAN DEFAULT true,
frequency_limit_per_hour INTEGER DEFAULT 10,
quiet_hours_start TIME DEFAULT '22:00',
quiet_hours_end TIME DEFAULT '07:00',
timezone VARCHAR(50) DEFAULT 'UTC'
);
Preference enforcement logic:
public boolean shouldSendNotification(User user, NotificationType type,
NotificationChannel channel) {
UserPreferences prefs = getPreferences(user.getId());
// Check channel opt-in
if (!prefs.isChannelEnabled(channel, type)) {
return false;
}
// Check frequency limits
int recentCount = countRecentNotifications(user.getId(),
Duration.ofHours(1));
if (recentCount >= prefs.getHourlyLimit()) {
return false;
}
// Check quiet hours (convert to user's timezone)
LocalTime now = LocalTime.now(ZoneId.of(prefs.getTimezone()));
if (isInQuietHours(now, prefs) && !type.isCritical()) {
return false;
}
return true;
}
Rate Limiting Per User and Per Channel
Different rate limits prevent notification fatigue while ensuring critical messages get through:
public class NotificationRateLimiter {
private final RedisTemplate<String, String> redis;
public boolean allowNotification(String userId, NotificationChannel channel,
NotificationType type) {
if (type.isCritical()) {
return true; // Never rate limit critical notifications
}
String key = String.format("rate_limit:%s:%s", userId, channel);
String count = redis.opsForValue().get(key);
int currentCount = count != null ? Integer.parseInt(count) : 0;
int limit = getLimitForChannel(channel);
if (currentCount >= limit) {
return false;
}
redis.opsForValue().increment(key);
redis.expire(key, Duration.ofHours(1));
return true;
}
private int getLimitForChannel(NotificationChannel channel) {
return switch (channel) {
case PUSH -> 20; // 20 push notifications per hour max
case EMAIL -> 5; // 5 emails per hour max
case SMS -> 2; // SMS is expensive, very limited
case IN_APP -> 50; // In-app can be higher volume
};
}
}
Step 6: Deep Dive — Template System and Personalization
Template Management
Production notification systems use templates to separate content from delivery logic:
{
"template_id": "payment_confirmation",
"channels": {
"push": {
"title": "Payment successful",
"body": "Your payment of {{amount}} has been processed"
},
"email": {
"subject": "Payment confirmation - {{merchant_name}}",
"template_url": "s3://templates/payment_confirmation.html",
"variables": ["amount", "merchant_name", "transaction_id", "date"]
},
"sms": {
"body": "{{merchant_name}}: Your {{amount}} payment was successful. Ref: {{transaction_id}}"
}
},
"localization": {
"es": {
"push": {
"title": "Pago exitoso",
"body": "Tu pago de {{amount}} ha sido procesado"
}
}
}
}
Template rendering service:
@Service
public class NotificationTemplateService {
public RenderedNotification renderTemplate(String templateId,
NotificationChannel channel,
Map<String, Object> variables,
String locale) {
Template template = templateRepository.findByIdAndChannel(templateId, channel);
if (template == null) {
throw new TemplateNotFoundException(templateId, channel);
}
// Apply localization if available
Template localizedTemplate = getLocalizedTemplate(template, locale);
// Render with variables using Mustache/Handlebars
String renderedTitle = templateEngine.render(localizedTemplate.getTitle(), variables);
String renderedBody = templateEngine.render(localizedTemplate.getBody(), variables);
return RenderedNotification.builder()
.title(renderedTitle)
.body(renderedBody)
.deepLink(renderDeepLink(localizedTemplate.getDeepLink(), variables))
.build();
}
}
Personalization and Segmentation
Advanced notification systems support audience segmentation and personalized content:
public class NotificationAudienceService {
public List<User> resolveAudience(AudienceDefinition definition) {
QueryBuilder query = new QueryBuilder();
// Apply demographic filters
if (definition.getCountries() != null) {
query.whereIn("country", definition.getCountries());
}
// Apply behavioral filters
if (definition.getLastActiveWithin() != null) {
query.where("last_active_at", ">",
now().minus(definition.getLastActiveWithin()));
}
// Apply engagement filters
if (definition.getEngagementLevel() != null) {
query.where("engagement_score", ">=",
definition.getEngagementLevel().getMinScore());
}
return userRepository.findByQuery(query.build());
}
public Map<String, Object> getPersonalizationVariables(User user,
NotificationType type) {
Map<String, Object> variables = new HashMap<>();
variables.put("first_name", user.getFirstName());
variables.put("timezone", user.getTimezone());
if (type == NotificationType.RECOMMENDATION) {
variables.put("recommendations",
recommendationService.getForUser(user.getId()));
}
return variables;
}
}
Step 7: Common Mistakes and Follow-up Questions
Mistake 1: Ignoring Multi-Channel Complexity
Describing only push notifications while ignoring email, SMS, and in-app channels. Real systems must coordinate across all channels with different delivery guarantees and cost models.
Mistake 2: Missing User Preferences
Designing a system that bombards users without respecting opt-outs, frequency limits, and quiet hours. This leads to poor user experience and legal compliance issues.
Mistake 3: No Failure Handling Strategy
Assuming third-party services (FCM, SendGrid, Twilio) never fail. Production systems need circuit breakers, fallback providers, and dead letter queues.
Mistake 4: Inefficient Fan-out for High-Volume Users
Using push fanout for celebrity users with millions of followers would overwhelm write capacity. Need hybrid push/pull strategy based on user activity.
Mistake 5: Forgetting About Cost Optimization
Treating all notification channels equally when SMS costs ~$0.01 per message. Channel selection should consider cost, urgency, and user preferences.
Follow-up Questions to Prepare For
"How would you handle a notification going to 100 million users simultaneously?" This tests your understanding of write amplification. Use pull model for such large broadcasts, with push notifications only for recently active users. Consider staged rollout (10% → 50% → 100%) to detect issues early.
"What if a user is offline for weeks and has thousands of pending notifications?" Implement notification expiration and consolidation. Expire old notifications, consolidate similar ones ("You have 50 new messages" instead of 50 individual notifications), and prioritize by importance when they return.
"How do you ensure exactly-once delivery?" Explain that exactly-once is expensive and often unnecessary. Most systems use at-least-once with idempotency keys — duplicate notifications are acceptable for most use cases. For critical cases, use distributed consensus (costly).
"How would you implement real-time in-app notifications?" WebSockets or Server-Sent Events for active connections. When a user comes online, establish a connection and deliver pending notifications immediately. Use connection pooling and heartbeat mechanisms to handle connection failures.
"What about international compliance (GDPR, CAN-SPAM)?" Implement explicit consent tracking, easy unsubscribe mechanisms, and data retention policies. Different regions have different rules (EU requires explicit opt-in, US allows opt-out). Store consent timestamps and audit trails.
Summary: Your 35-Minute Interview Plan
| Time | What to Do |
|---|---|
| 0-5 min | Clarify requirements: notification types, scale, delivery guarantees, analytics |
| 5-12 min | High-level design: API, multi-channel architecture, message queue design |
| 12-22 min | Deep dive: Fan-out strategies (push vs pull vs hybrid), handling high-volume users |
| 22-28 min | Delivery guarantees: at-least-once implementation, failure handling, circuit breakers |
| 28-32 min | User preferences, rate limiting, template system |
| 32-35 min | Wrap up: trade-offs, cost optimization, compliance considerations |
The notification system interview tests your ability to design for both scale and reliability across multiple delivery channels. Companies want to see that you understand the complexity of coordinating push, email, SMS, and in-app notifications — each with different constraints, costs, and user expectations.