Insights from Alvin Henrick, AVP of Engineering at b.well Connected Health
Why do you need FHIR SSE? Once or twice in your life, you’ve hit this wall. You move from Dallas, Texas, to San Francisco. New city, new provider, new health system, and the first question at your new doctor’s office, “Can you get your records transferred?” You call your old clinic, and they fax something, maybe. Three weeks later, half your history is missing, and you’re repeating labs you did six months ago.
Now flip the script. What if your health data followed you in real-time? What if the moment your old provider updated a record, your new provider’s system knew about it? What if an AI health assistant on your phone could tell you, “your lab results just came in, here’s what changed since last time,” seconds after your doctor signed off?
That’s what FHIR Subscriptions with Server-Sent Events enable. But building it for real patients, at scale, in a multi-tenant healthcare platform, that’s where the engineering gets interesting.
What We’re Solving at b.well
At b.well Connected Health, this is the kind of engineering challenge we wake up to every day. Our platform aggregates health data from hundreds of sources, payers, providers, labs, and pharmacies into a unified, FHIR-native data layer. The patient doesn’t care which system their data lives in. They care that it’s there when they need it, and that they know when something changes.
We’re solving these hard distributed systems problems, cross-pod event delivery, multi-tenant security isolation, real-time credential lifecycle, not as academic exercises, but because a patient in San Francisco shouldn’t have to wait three weeks for their Dallas records. Because when a lab result is ready, the patient’s AI health assistant should know before the patient remembers to check the portal.
Healthcare engineering is uniquely unforgiving. You can’t “move fast and break things” when the things are patient records. You can’t have eventual consistency in tenant isolation, since a single cross-tenant leak is a HIPAA incident. You can’t have a notification gap where a critical lab result silently disappears because a pod restarted during a deploy.
The engineering challenges below are how we improve healthcare and patients’ lives. Every architectural decision here traces back to a patient experience: faster notifications, safer data delivery, zero dropped events, and a foundation that AI agents can build on to give patients a real-time understanding of their own health.
The Gap Between Spec and Reality
The FHIR R5 Subscription Backport IG defines channel types for delivering notifications: rest-hook, websocket, email, sms, message. In the age of AI agents and LLM-powered patient experiences, polling a FHIR server every N minutes for changes is not going to cut it. Patients expect to know immediately when their lab result drops, when an appointment changes, and when a care plan is updated, not on the next poll cycle.
FHIR SSE is not a first-class channel in the spec. But §2.4.0.0.5 of the Backport IG explicitly leaves room for custom channel types. That’s the conformance hook we use.
The real challenge isn’t the protocol; SSE is well-understood. The challenge is everything the spec doesn’t address:
- How do you deliver events to the right patient without sticky sessions?
- How do you revoke credentials on an already-open stream?
- What happens when a JWT expires mid-stream on a 24-hour connection?
- How do you ensure no PHI leaks through the notification channel itself?
- How do you scale horizontally without duplicating every event to every pod?
The FHIR SSE Architecture
Here’s what we built:
The key insight: one pod processes each Kafka message (shared consumer group), then Redis Pub/Sub fans it out to every pod that has a connected client for that subscription. No sticky sessions. No duplicate consumption. Any pod can accept any client connection.
Why Not Sticky Sessions?
Sticky sessions break during rolling deployments. They break during autoscaling. They make your load balancer a single point of failure for specific connections. In healthcare, “Sorry, your notification stream died because we deployed a patch,” is not acceptable.
Why Not One Consumer Group Per Pod?
If you have 10 pods, you’d process every Kafka message 10 times. That’s wasteful, and it breaks exactly-once semantics for sequence number assignment.
The Custom FHIR SSE Channel
Our FHIR SSE channel is a superset of the Backport IG. The notification Bundle shape is IG-compatible:
The payload is id-only (claim-checker pattern). No PHI on the wire. The client gets a reference (Observation/lab-123), then fetches the actual resource from the FHIR server with its own JWT. The FHIR server re-authorizes that fetch, so even if something goes wrong in the notification layer, the patient only sees what their token allows.
This is deliberate. In an LLM-agent world where AI assistants subscribe to health events on behalf of patients, the claim-checker pattern means the notification channel is a signal layer, not a data layer. The AI agent gets “something changed,” fetches the resource through proper authorization, then summarizes it for the patient. Two authorization checkpoints instead of one.
The Hard Parts
1. Cross-Pod Event Delivery
Pod 1 processes the Kafka message. Pod 2 has the patient’s connection. Redis Pub/Sub bridges the gap. If Redis is down, clients recover missed events from ClickHouse on reconnect using Last-Event-Id.
2. Dual-Dimension Security Filtering
Every event carries security codes extracted from the FHIR resource’s meta.security:
- Access codes — which organization can see this resource
- Owner codes — which partner owns this resource
The patient’s JWT carries matching scopes. At delivery time, we intersect:
This is multi-tenancy at the event level, not the connection level. Two subscriptions on the same pod, same subscription ID, even, but different tenant security tags will see different events. Cross-tenant leakage is structurally impossible because filtering happens per-event, per-delivery.
3. Token Expiry Mid-Stream
SSE connections live up to 24 hours. JWTs live ~1 hour. The spec says nothing about this. Here’s the lifecycle we implemented:
The token-expiring warning gives clients 5 minutes to prepare. On reconnect, Last-Event-Id ensures zero data loss. The gap between disconnect and reconnect is covered by ClickHouse persistence.
4. Real-Time Credential Revocation
When an admin revokes a client credential, active SSE connections for that client must terminate immediately, not on the next token refresh, not on the next heartbeat, now.
Revocation propagates across all pods via a dedicated Redis Pub/Sub control channel. The revocation is also persisted with a TTL equal to max JWT lifetime, so if a pod restarts, it loads active revocations before accepting connections. No zombie streams.
5. The Replay Safety Net
ClickHouse stores every event for 7 days, partitioned by date, ordered by (subscription_id, sequence_number). This is the safety net that makes the entire fire-and-forget architecture safe:
- Redis Pub/Sub down? → Client reconnects, replays from ClickHouse.
- Pod crashed? → Client reconnects, replays from ClickHouse.
- Network blip? → Client reconnects with Last-Event-Id, picks up where it left off.
The key design decision: ClickHouse is the durability layer, not Redis. Redis Pub/Sub is ephemeral by design, and that’s fine; it’s the fast path. ClickHouse is the reliable path.
The Patient Perspective in the AI Era
Here’s where it gets interesting. As engineers building healthcare infrastructure, we’re also patients. When you build a real-time notification system, you’re building the substrate that AI patient agents will consume.
Consider the flow for an LLM-powered health assistant:
- Patient grants access to their health data (SMART on FHIR scopes)
- AI agent creates a FHIR Subscription for relevant resource types
- AI agent opens SSE connection, receives id-only notifications
- On each notification: fetches resource, interprets it, notifies patient in plain language
- Patient asks follow-up questions; agent has full context
The id-only pattern is perfect for this. The AI agent doesn’t need the raw FHIR Bundle streaming over SSE; it needs a signal that something changed, then it fetches, interprets, and acts. Two authorization checkpoints (SSE access filter + FHIR server re-auth) mean the patient’s data stays protected even when an AI intermediary is involved.
The FHIR SSE channel becomes the nervous system. The LLM becomes the interpreter. The patient gets real-time, understandable health updates instead of a portal they check once a week.
What the FHIR Spec Gives You vs. What You Need
| Concern | FHIR Spec Provides | What You Actually Need |
|---|---|---|
| Event format | ✓︎ Notification Bundle shape | Same |
| Payload content levels | ✓︎ empty / id-only / full-resource | Same |
| Last-Event-Id replay | ✓︎ Protocol-level | ClickHouse backing store + sequence numbers |
| Multi-tenancy | ✗︎ Not addressed | Per-event security filtering |
| Cross-pod delivery | ✗︎ Not addressed | Redis Pub/Sub fan-out |
| Token lifecycle | ✗︎ Not addressed | Proactive expiry warnings + graceful reconnect |
| Credential revocation | ✗︎ Not addressed | Cross-pod revocation propagation |
| Horizontal scaling | ✗︎ Not addressed | Shared consumer group + fan-out pattern |
| PHI protection | ✗︎ Not addressed | id-only + re-authorization on fetch |
The spec gives you the protocol. The hard part — and the interesting engineering — is everything around it.
Key Technology Choices
| Component | Technology | Why |
|---|---|---|
| Event ingestion | Kafka (shared consumer group) | Already in CDC pipeline, ordering guarantees via partition key |
| Cross-pod delivery | Redis Pub/Sub | Lightweight, no persistence needed (ClickHouse is safety net) |
| Sequence numbers | Redis INCR | Distributed atomic counter, fast |
| Event persistence | ClickHouse | Time-series optimized, columnar, built-in TTL, cheap storage |
| SSE streaming | Spring WebFlux (Reactor) | Non-blocking, handles thousands of long-lived connections |
| Auth | Multi-issuer JWT (JWKS) | Enterprise reality: multiple IdPs |
| Resilience | Circuit breakers (Resilience4j) | ClickHouse down shouldn’t kill live delivery |
Wrapping Up
The FHIR Subscription Backport IG gives you a solid foundation — the custom channel extension point is well-designed, and the notification Bundle shape is interoperable. But deploying real-time notifications in a multi-tenant healthcare platform requires solving problems the spec intentionally leaves to implementers.
If you’re building something similar — especially if you’re thinking about how LLM agents will consume health events on behalf of patients — the claim-checker pattern (id-only) with per-event security filtering is the architecture I’d recommend. Keep the notification channel thin (signals, not data), enforce authorization at every boundary, and design for the reality that connections will break, tokens will expire, and credentials will be revoked.
The patients — and their AI agents — shouldn’t have to worry about any of that. They should just get the update.
Source: Built and deployed in production at b.well Connected Health. Architecture patterns described here are generalizable to any FHIR-native platform implementing real-time subscriptions.