videocalling

SFU (Selective Forwarding Unit)

架構

Server architecture that forwards streams between participants efficiently

What is an SFU?

A Selective Forwarding Unit (SFU) is a media server architecture that acts as a smart router for video and audio streams. Unlike peer-to-peer connections where participants send streams directly to each other, or MCU servers that decode and mix all streams, an SFU receives streams from each participant and selectively forwards them to other participants without modification.

Think of it as an intelligent traffic controller: each participant sends their video once to the SFU, which then distributes it to everyone who needs it. The SFU makes smart decisions about which streams to send where, based on bandwidth, active speakers, and layout requirements.

As of 2025, SFU has become the de facto standard architecture for WebRTC applications, used by virtually all major video conferencing platforms including Zoom, Google Meet, Microsoft Teams, Discord, and WhatsApp.

How SFU Works: Step by Step

1. Connection Establishment

Each participant establishes a WebRTC connection with the SFU server. This involves exchanging session descriptions through a signaling server, negotiating codecs, and using ICE to establish the optimal connection path.

2. Media Capture and Upload

The participant's device captures audio from the microphone and video from the camera. The browser encodes these streams using agreed-upon codecs (like VP8, VP9, or H.264 for video, and Opus for audio) and sends them to the SFU using RTP (Real-time Transport Protocol) or SRTP (Secure RTP) for encrypted transmission.

Crucially, each participant only uploads their stream once—not separately to each viewer. This dramatically reduces upload bandwidth requirements compared to P2P.

3. Stream Reception and Management

The SFU receives and maintains separate inbound streams from each connected device. It maintains a routing table of all active participants, tracks which streams are available, and manages subscriptions based on client requests and network conditions.

4. Selective Forwarding

Here's where the "selective" part comes in. The SFU doesn't blindly forward all streams to all participants. Instead, it makes intelligent decisions:

  • Forward only streams that are visible in a participant's layout
  • Choose appropriate quality levels based on available bandwidth
  • Prioritize active speaker streams over silent participants
  • Adapt stream selection when participants change their view or network conditions fluctuate

5. Client-Side Decoding

Each participant's device receives multiple streams from the SFU and is responsible for decoding them and composing the final video call interface. The client handles layout, determines which videos to display prominently, and renders the user interface.

Key Technical Features

No Transcoding

The SFU's defining characteristic is that it doesn't decode or re-encode video streams—it simply forwards the encoded packets. This keeps CPU usage remarkably low on the server side, allowing a single SFU server to handle hundreds of concurrent participants.

Simulcast Support

Modern SFUs leverage simulcast, where each participant sends multiple versions of their video at different quality levels (e.g., 1080p, 720p, 360p). The SFU then intelligently selects which quality to forward to each recipient based on their bandwidth and device capabilities.

For example, if you're in a meeting with 10 people but your screen only shows 4 at a time, the SFU might send you high-quality streams for the 4 visible participants and lower-quality or no streams for the others. When you change your view, the SFU adapts instantly.

Adaptive Bitrate

SFUs monitor each participant's network conditions in real-time. When someone's bandwidth drops, the SFU automatically switches them to lower-quality simulcast layers. When conditions improve, it upgrades back to higher quality—all without interrupting the call.

Advantages of SFU

Excellent Scalability

SFU scales beautifully from small meetings to large conferences. A well-configured SFU can handle hundreds of participants in a single call. Because it doesn't process video, adding more participants increases server bandwidth usage but not CPU load proportionally.

Lower Upload Bandwidth

Participants upload their stream once to the SFU, not separately to each viewer. In a 10-person meeting, you upload one stream instead of nine—reducing your required upload bandwidth by 9x compared to P2P mesh.

Low Latency

Since the SFU doesn't decode/encode, it introduces minimal processing delay—typically only 50-100ms of additional latency compared to direct P2P. This keeps conversations feeling natural and real-time.

Cost-Effective

Compared to MCU servers that require expensive CPU resources for transcoding, SFU servers are relatively cheap to operate. The primary cost is bandwidth, which is significantly less expensive than computational power.

Quality Flexibility

Each participant can receive different quality streams based on their device and network. Someone on a phone with 4G gets lower quality to save bandwidth, while someone on fiber with a 4K monitor gets the highest quality available.

Limitations

Client-Side Decoding Load

Because the SFU doesn't mix streams, each participant's device must decode multiple video streams simultaneously. In a large meeting, this can strain older devices or mobile phones, especially when viewing many participants at once.

Bandwidth Requirements

While upload bandwidth is greatly reduced, download bandwidth can still be significant. If you're viewing 9 people in gallery view, you're downloading 9 separate video streams. On limited connections, this can cause quality issues.

No Unified Layout

Unlike MCU where the server creates a single composite video, SFU requires each client to compose their own layout. This means every participant might see a slightly different arrangement, and recording requires server-side composition similar to MCU.

When to Use SFU

SFU is the right choice for:

  • Group video calls: Anywhere from 5 to 100+ participants
  • Professional conferencing: Business meetings, webinars, remote collaboration
  • Social video chat: Friend groups, family calls, gaming communities
  • Cost-sensitive applications: When server costs matter but quality can't be compromised
  • Mobile-inclusive meetings: When some participants join from phones or tablets

SFU vs P2P vs MCU

Understanding when to use each architecture:

  • P2P: Best for 1-on-1 or very small groups (2-4 people), minimal cost, maximum privacy
  • SFU: Best for most use cases (5-100+ people), balanced cost/performance, industry standard
  • MCU: Best for very large broadcasts (100+ viewers), legacy systems, or guaranteed bandwidth savings for viewers

Popular SFU Implementations

Several open-source and commercial SFU servers are widely used:

  • mediasoup: Powerful Node.js SFU, highly flexible and performant
  • Janus: Feature-rich C-based SFU, extremely efficient
  • LiveKit: Modern Go-based SFU with excellent developer experience
  • Cloudflare Calls: Managed SFU service with global edge deployment

Why SFU Dominates in 2025

SFU has become the overwhelmingly preferred architecture because it hits the sweet spot between cost, quality, and scalability. It avoids P2P's exponential complexity while being far more economical than MCU.

Every major platform—Zoom's web client, Google Meet, Microsoft Teams, Discord, Telegram video calls—uses SFU architecture under the hood. The combination of simulcast, selective forwarding, and adaptive bitrate gives users a high-quality experience without breaking the bank on server infrastructure.

For developers building video calling features in 2025, SFU is almost always the right answer unless you have very specific requirements that demand P2P (maximum privacy, 1-on-1 only) or MCU (extreme bandwidth constraints, legacy compatibility).

References