End-to-End Encryption (E2EE) | videocalling.app

What is End-to-End Encryption?

End-to-end encryption (E2EE) is a security system where only the communicating parties can read the messages. No one else—not the service provider, not the server infrastructure, not even the platform operator—can decrypt the communication. Think of it like sending a letter in a locked box where only the recipient has the key, and the postal service can't open it.

In video calling, E2EE means your audio and video streams are encrypted on your device and remain encrypted until they reach the recipient's device. Even if someone intercepts the data in transit or compromises the servers routing your call, they see only encrypted gibberish.

E2EE is the gold standard for privacy. It's used by Signal, WhatsApp (for messaging and calls), FaceTime, and increasingly by enterprise video conferencing platforms. As of 2025, E2EE in WebRTC has matured significantly with new standards and browser APIs making implementation practical for production applications.

WebRTC's Built-in Encryption (DTLS-SRTP)

WebRTC mandates encryption—all WebRTC connections are encrypted by default using DTLS-SRTP:

DTLS (Datagram Transport Layer Security): Establishes encrypted channels and exchanges encryption keys
SRTP (Secure Real-time Transport Protocol): Encrypts the actual media packets (audio and video)

This encryption is mandatory and automatic—you can't disable it. Every WebRTC connection is encrypted, protecting against passive eavesdropping on the network.

However, DTLS-SRTP alone is NOT end-to-end encryption. Here's why.

The SFU Problem: Why Standard WebRTC Isn't E2EE

In peer-to-peer WebRTC (two people connecting directly), DTLS-SRTP provides true E2EE. Only the two participants have the decryption keys. Perfect.

But most video calling uses SFU (Selective Forwarding Unit) servers for group calls. Here's what happens:

Your device encrypts media with DTLS-SRTP and sends it to the SFU server
The SFU server decrypts the media to inspect and route it
The SFU re-encrypts the media with new DTLS-SRTP keys
The re-encrypted media is sent to recipients

In step 2, the SFU has access to unencrypted media—even if only briefly in memory. This breaks end-to-end encryption. The media is encrypted in transit, but the server can potentially access it.

Why do SFUs decrypt? They need to inspect RTP headers to determine which streams to forward, detect keyframes, handle simulcast layers, and perform quality-based switching. Traditional SFU architecture requires plaintext access to media packets.

True E2EE with Insertable Streams

The solution: add a second layer of encryption that the SFU never decrypts. Only the endpoints (participants) have the keys for this layer.

How It Works

Your device captures and encodes video/audio
Before sending, your device adds application-layer encryption (E2EE layer)
This encrypted payload is then wrapped in DTLS-SRTP (transport encryption)
The SFU receives the packet, decrypts DTLS-SRTP, but sees only the application-layer ciphertext
The SFU forwards the (still encrypted) media to recipients
Recipients decrypt the application-layer encryption to recover the original media

The SFU never sees plaintext media. It only sees encrypted frames that it blindly forwards.

Insertable Streams API

Introduced in 2020 and standardized as RTCRtpScriptTransform, this browser API allows JavaScript code to intercept encoded media frames before they're packetized and sent, or after they're received but before decoding.

Your code can:

Read each encoded frame
Apply additional encryption using cryptographic libraries (like WebCrypto API)
Return the encrypted frame for sending
On reception, decrypt frames before passing them to the decoder

Browser support: Chromium-based browsers (Chrome, Edge, Opera) fully support it. Safari has partial support. Firefox support is in progress as of 2025.

SFrame: The Emerging Standard

SFrame (Secure Frame) is an IETF standard protocol specifically designed for encrypting media frames in WebRTC group calls. It's optimized for the constraints of real-time media:

Key Features

Partial encryption: Only encrypts the media payload, leaving RTP headers and some metadata unencrypted so SFUs can still route packets
Per-frame encryption: Each frame is independently encrypted, allowing out-of-order delivery and packet loss without affecting other frames
Minimal overhead: ~10-40 bytes per frame for encryption metadata
Fast symmetric encryption: Uses AES-GCM for speed (critical for real-time encoding)
Group key management: Supports multiple participants with shared group keys

How SFrame Works

Each participant has an encryption key. Before sending a frame:

Frame is encoded (VP8, H.264, Opus, etc.)
SFrame adds a small header indicating which key was used and a counter
The encoded payload is encrypted with AES-GCM
The encrypted frame is sent via WebRTC

Recipients identify the sender from the SFrame header, use the corresponding key to decrypt, and decode the media.

MLS: Group Key Management

The hardest part of E2EE isn't encrypting frames—it's managing encryption keys across multiple participants, especially as people join and leave calls.

MLS (Messaging Layer Security), standardized by the IETF in 2023, is a group key exchange protocol designed for exactly this problem.

MLS Features

Efficient key updates: Adding a participant doesn't require re-exchanging keys with everyone
Forward secrecy: Compromising today's keys doesn't decrypt past communications
Post-compromise security: Recovering from key compromise is possible
Scalability: Works efficiently with thousands of participants

Key Rotation

When someone leaves a call, keys must be rotated so departed participants can't decrypt future conversation. MLS handles this efficiently:

Generate new group key
Distribute to current participants via the existing E2EE channel
All participants switch to the new key

For joins, hash ratcheting can derive new keys without full key exchange, reducing overhead.

Implementation Challenges

1. Performance Overhead

Additional encryption/decryption consumes CPU. On mobile devices or low-end hardware, this can reduce video quality or drain batteries faster. Hardware acceleration for AES helps, but the overhead is real (typically 10-30% CPU increase).

2. Partial Encryption Complexity

SFUs need some unencrypted metadata to function:

Keyframe indicators (to prioritize important frames)
Simulcast layer information (for quality switching)
Codec-specific metadata

Determining exactly what must remain unencrypted varies by codec (VP8, VP9, H.264, Opus) and frame type. Getting this wrong breaks SFU functionality or leaks information.

3. Key Management Complexity

Securely distributing, rotating, and managing keys across dynamic groups is hard. You need:

Secure initial key exchange (often using public key cryptography)
Key derivation functions
Synchronization mechanisms (all participants must use the same key version)
Handling race conditions (simultaneous joins/leaves)

4. Browser Support

As of 2025, Insertable Streams support is incomplete. Chromium browsers are fully compatible, but Safari and Firefox support is partial or in development. Cross-browser E2EE requires fallback strategies or limits supported platforms.

5. Debugging and Monitoring

With E2EE, you can't inspect media server-side. Debugging call quality issues becomes harder—you can't look at the video frames the server sees because it sees only encrypted data. Telemetry and diagnostics must happen client-side.

P2P vs. SFU E2EE

Peer-to-Peer

E2EE is trivial with P2P. WebRTC's built-in DTLS-SRTP already provides E2EE—only the two peers have the keys. No additional encryption layer needed.

This is how FaceTime, WhatsApp 1-on-1 calls, and Signal work. The simplicity is beautiful: connect directly, exchange keys via DTLS, encrypt with SRTP. Done.

SFU (Group Calls)

E2EE with SFU requires the additional application-layer encryption (SFrame or similar) plus group key management (MLS or similar). Much more complex, but necessary for calls with >4-5 participants.

Real-World Implementations (2025)

Apps With E2EE

WhatsApp: E2EE for all calls (1-on-1 and group), uses Signal Protocol for key exchange
Signal: E2EE for everything (messaging and calls), pioneer of modern E2EE
FaceTime: E2EE for all calls, Apple's proprietary implementation
Zoom: Optional E2EE for meetings (disabled by default, requires host enablement)
Jitsi Meet: Offers E2EE using Insertable Streams for Chromium browsers
Cloudflare Calls: Demonstrated E2EE implementation with MLS (Orange Meets project)

Apps Without E2EE

Google Meet: Encrypted in transit (DTLS-SRTP) but not E2EE—Google can decrypt
Microsoft Teams: Encrypted in transit but not E2EE—Microsoft can decrypt
Most enterprise video platforms: Not E2EE by default to enable features like recording, transcription, compliance monitoring

Why Not Always Use E2EE?

If E2EE is more secure, why don't all platforms use it?

Feature limitations: Server-side features like cloud recording, live transcription, content moderation, and AI features require access to media—incompatible with E2EE
Compliance: Some industries require the ability to audit communications, which E2EE prevents
Performance: Additional encryption overhead reduces quality on low-end devices
Complexity: E2EE is harder to implement and maintain, increasing development cost
Browser support: Cross-browser E2EE still has compatibility challenges in 2025

For many business applications, the trade-offs favor transport encryption (DTLS-SRTP) without E2EE. For privacy-focused consumer apps, E2EE is increasingly the standard.

Verifying E2EE

How can you tell if a video call truly uses E2EE?

Security codes/fingerprints: Apps like Signal and WhatsApp show security codes you can verify with the other party out-of-band (in person, separate channel)
Platform documentation: Check if the platform explicitly states E2EE and explains how it works
Open source: Open source implementations (like Jitsi) can be audited
Third-party audits: Security audits by reputable firms provide confidence

If an app offers cloud recording or AI transcription, it's probably not E2EE (those features require server access to unencrypted media).

The Future of E2EE in WebRTC

As of 2025, E2EE in WebRTC is transitioning from "difficult and niche" to "practical and increasingly standard":

SFrame is being standardized and implemented in production systems
MLS provides robust group key management
Browser support for Insertable Streams is improving
Major platforms (Zoom, Jitsi) now offer E2EE options
Libraries like libsframe make implementation easier

Expect E2EE to become the default for consumer video calling by 2026-2027, similar to how HTTPS became the default for websites.

The Bottom Line

End-to-end encryption is the gold standard for communication privacy. While WebRTC has always been encrypted in transit (DTLS-SRTP), true E2EE—where even the servers can't decrypt your calls—requires additional application-layer encryption using technologies like SFrame and key management protocols like MLS.

The trade-off is complexity and reduced functionality (no server-side recording, transcription, or AI features). But for privacy-critical applications—personal calls, sensitive business discussions, healthcare consultations—E2EE is essential.

As browser support matures and standards solidify, implementing E2EE in WebRTC applications is becoming practical for mainstream developers. Understanding E2EE helps you make informed decisions about when to use it and how to implement it effectively.

References

True End-to-End Encryption with WebRTC Insertable Streams - webrtcHacks
Having fun with Insertable Streams and E2EE (and SFrame!) - Meetecho
Does your video call have End-to-End Encryption? Probably not.. - webrtcHacks
WebRTC Security - Is it secure and safe? - Stream
End-to-End Encryption - Dyte
Exploring End-to-End Encryption (E2EE) in WebRTC - DigitalSamba
Orange Me2eets: We made an end-to-end encrypted video calling app - Cloudflare
End-to-End Encryption - Cloudflare Realtime Documentation