From a single TCP socket to a planet-scale messaging fabric — how WhatsApp, Slack, and Messenger keep billions of conversations live, ordered, and durable.
Before drawing boxes, pin down the user-visible features. The chat domain is small in surface area but rich in failure modes — every requirement quietly imposes a different storage or networking constraint.
Two users exchange messages in near real time. Latency target around 200 ms end-to-end. Messages survive the recipient being offline.
Up to a few hundred participants per room. Each member sees the same messages in the same order. Fan-out happens server-side.
Show whether a contact is online, away, or last-seen-at. Tolerant of brief disconnects; not a security guarantee.
Durable, paginated, sorted by time. Searchable on the client. Retention spans years for compliance and user nostalgia.
If the recipient has no live socket, hand the message to APNs / FCM so a notification fires on the lock screen.
The same account on phone, desktop, and web sees identical state. Each device tracks its own delivered / read cursors.
A chat system stands or falls on its push channel. HTTP request-response is client-pull; real-time chat needs the server to wake up the client. Four options — only one is right.
| Approach | How it works | Latency | Server load | Verdict |
|---|---|---|---|---|
| Short polling | Client hits /messages every few seconds asking "anything new?". Most calls return empty. |
Poor | Wasteful | Reject |
| Long polling | Client opens a request that the server holds until a message arrives or a timeout (30 s) fires. | OK | Held threads | Fallback |
| SSE | One-way stream from server to client over plain HTTP. Client still POSTs to send. | Good | Light | One-way only |
| WebSocket | Full-duplex TCP-like channel over a single upgraded HTTP connection. Bytes flow both ways. | Excellent | Persistent | Pick this |
A chat client is bidirectional by definition — it both sends and receives. WebSocket gives one persistent connection per device, framed messages, low overhead per packet, and predictable wake-up semantics on mobile radios. Long polling becomes the graceful fallback on networks that block upgrades.
Each concern gets its own service so it can scale on its own axis. The chat service holds connections; the rest stay stateless or behind storage.
Each instance terminates thousands of live WebSockets. Routes inbound messages to other chat servers, writes to storage, hands off to push when needed.
Tracks the heartbeat of each connected device. Reads are cheap, writes are massive — Redis with TTL keys is the typical home.
A thin abstraction over Apple / Google / Web push. Receives async jobs when a recipient has no live socket on any device.
Hot path: append-only KV store partitioned by chat_id. Cold path: relational store for users, group rosters, profile data.
A user's socket lives on exactly one chat server at a time. To route a message from Alice to Bob, you have to know which box holds Bob's connection. A shared registry — usually Redis or ZooKeeper — answers that question.
When a client opens a WebSocket, the receiving chat server writes user_id → self into the registry with a TTL refreshed by heartbeats.
The originating server looks up the recipient. Local? Push the frame directly. Remote? RPC to the holding server, which forwards to its socket.
The TTL lets you survive crashes — a dead server's mappings expire and the next reconnect rewrites the entry on a healthy node.
A message must be durable before it is delivered; it must be ordered before it is acknowledged. Two paths diverge based on whether the recipient is currently holding a live socket.
The chat server writes to storage first. Only then does it ack the sender. If delivery fails, the message is still recoverable on reconnect.
A central ID generator (Snowflake, ULID, or per-chat sequence) gives every message a monotonic, sortable identifier shared by all parties.
No socket for Bob? The message still sits in his inbox. Push service fires APNs / FCM. On next reconnect, Bob's device pulls everything newer than its cursor.
Reads and writes for a chat system are skewed by conversation, not by user. A single chat is the unit of locality — partition there and life gets easier.
chat_id (clusters all messages of one conversation onto one node).message_id (Snowflake: time + worker + seq).Per-conversation, IDs must increase. A centralised Snowflake gives global monotonicity with no coordination on the hot path. The first 41 bits encode time; ordering by ID equals ordering by time.
In a group, the sender publishes once. The chat service writes the message once to the conversation log, then enqueues it into a per-recipient inbox so each member can fetch and ack independently.
Each member acks independently and at their own pace. Bob may read instantly; Liu reads tomorrow. Per-recipient state keeps "unread" counts trivial and supports clean retries.
This fan-out works up to a few hundred members. Above that — channels, broadcast rooms — flip the model: members pull from a shared log rather than each receiving a copy.
The conversation log assigns the canonical message_id. Every inbox copy references it, so even if devices receive out of order, the client can sort and dedupe by ID.
Presence looks trivial: "is this user online?" But every connected device sends a heartbeat every few seconds, and every contact wants the answer in real time. At a billion users, that is the highest QPS surface in the system.
Client sends a ping every N seconds. Server writes presence:userId with a TTL of 2–3× the interval. If the key expires, the user is implicitly offline.
A user with 500 contacts means 500 reads per second just to render online dots. Solve with caching and by only computing presence for currently-visible contacts.
Don't push presence continuously — push only on transitions (online → offline). Subscribers maintain their own cached view and rely on diffs.
These ephemeral signals make a chat feel alive. They are cheap individually and brutal in aggregate — design them as best-effort and avoid storing them as first-class messages.
Three checkmark states:
Each ack travels over the same WebSocket and updates a small key, not the message row.
Pure transient signal. Client sends typing-start, server fan-outs to the chat's connected members, then auto-clears after 5 s if no typing-stop arrives. Never persisted. Drop on overload.
Read state is per-user, per-chat — a single cursor "highest read message_id". Storing one row per (user, chat) is far cheaper than per (user, message). Group chats display the count of members above the cursor.
Network retries are the rule, not the exception. Each ack carries the message_id; the server treats duplicate acks as no-ops. The client's outbound queue uses a client-generated client_msg_id to dedupe sends across reconnects.
If losing the signal would be merely annoying — typing, presence flicker — make it best-effort. If losing it would be wrong — delivery status, ordering — make it durable.
When no socket is alive, the app is asleep — possibly killed by the OS. The only way to reach it is through the platform push gateway. The push service is a queue, a worker pool, and a careful authority on what to send.
Each device registers a push token with the chat service. Tokens expire — workers must handle 410 / unregistered responses and purge dead tokens.
Don't ring the phone for every message in a busy group. Workers consult per-user preferences and may collapse rapid bursts into a single "5 new messages" notification.
If messages are E2E encrypted, push carries only a wake-up signal. The actual content is decrypted on-device after the app fetches the ciphertext over the chat socket.
A user's account isn't tied to one device. Phone, laptop, web — each connects independently, sees the same conversations, and must converge on the same view of read state and message order without coordination from the user.
Each device persists "last message_id I've seen" per chat. On reconnect, it asks the server for everything newer. The server doesn't need to remember anything per session.
If you read on your phone, the badge disappears on your laptop. Reading on any device pushes the new read cursor to the other devices over their sockets.
An incoming message is pushed to every live socket the user has. "Delivered" status flips as soon as one device acks; "Read" requires an explicit read event.
If you only carry away five ideas from this chapter, let them be these.
One persistent, full-duplex connection per device. Everything — messages, presence, typing, receipts — flows through it. Long polling is only a fallback.
The chat service writes durably first, acks the sender, then routes. Storage is the source of truth; sockets are just a fast path to the same data.
chat_id is the partition key, message_id (Snowflake) is the sort key. Every conversation is a self-contained log — local reads, local writes, easy to shard.
For small groups, copy to per-recipient inboxes. For huge rooms, flip to pull. The right model depends on the ratio of members to active readers.
Heartbeats + TTL are simple in concept and brutal in QPS. Cache aggressively, send only state transitions, and never compute presence the user can't see.
Push gateways, per-device cursors, idempotent acks, dedupe on client_msg_id. Networks fail constantly; the chat experience must not.