Virtual Background

What is Virtual Background?

Virtual background is a feature that uses machine learning to identify and separate a person from their background in real-time video, then either replaces the background with a custom image/video or applies blur effects. This technology enables users to maintain privacy, hide messy environments, or add professional or creative backgrounds to their video calls.

In WebRTC applications, virtual backgrounds are implemented using computer vision algorithms that perform semantic segmentation on each video frame, classifying pixels as either "person" or "background" and then processing them differently.

How Virtual Background Works

Semantic Segmentation Process

The core technology behind virtual backgrounds involves these steps:

Frame Capture: Capture video frames from the user's camera using getUserMedia()
ML Inference: Run each frame through a machine learning model that classifies every pixel as person or background
Mask Generation: The ML model outputs a segmentation mask - a grayscale image where white represents the person and black represents the background
Background Processing: Apply blur effect or replace background pixels based on the mask
Frame Composition: Combine the segmented person with the new background
Stream Output: Send the processed frames to WebRTC peer connection

Implementation with Insertable Streams

Modern WebRTC implementations use the Insertable Streams API (also called Encoded Transform API) for efficient frame processing:

// Get camera stream
const stream = await navigator.mediaDevices.getUserMedia({ video: true });

// Create video track processor
const videoTrack = stream.getVideoTracks()[0];
const processor = new MediaStreamTrackProcessor({ track: videoTrack });
const generator = new MediaStreamTrackGenerator({ kind: 'video' });

// Process frames
const transformer = new TransformStream({
  async transform(frame, controller) {
    // Run ML segmentation
    const mask = await segmentationModel.predict(frame);
    // Apply background effect
    const processedFrame = applyVirtualBackground(frame, mask, backgroundImage);
    controller.enqueue(processedFrame);
    frame.close();
  }
});

processor.readable.pipeThrough(transformer).pipeTo(generator.writable);

Machine Learning Technologies

MediaPipe Selfie Segmentation

MediaPipe is Google's open-source framework for building ML pipelines. The Selfie Segmentation model is specifically optimized for person segmentation in video conferencing scenarios:

Performance: Fastest option, typically achieving 30-60 FPS on modern hardware
Accuracy: High-quality segmentation for people within 2 meters of the camera
Technology: Uses WebAssembly (WASM) for near-native performance in browsers
Model Size: Compact models (~1-3 MB) optimized for real-time use
Acceleration: Leverages XNNPACK library for hardware acceleration

MediaPipe is the same technology used in Google Meet and is the recommended solution for production WebRTC applications as of 2025.

TensorFlow.js BodyPix

BodyPix is a TensorFlow.js model for person and body part segmentation:

Performance: Moderate, 15-40 FPS depending on browser and hardware (Chrome significantly faster than Firefox)
Flexibility: Can segment individual body parts, not just person vs background
License: Apache License, suitable for commercial use
Browser Support: Good cross-browser compatibility
Model Variants: Multiple model architectures available (MobileNet, ResNet) with quality/performance tradeoffs

BodyPix is easier to integrate but generally slower than MediaPipe for simple background replacement use cases.

TensorFlow DeepLab v3+

DeepLab v3+ is a high-quality semantic segmentation model:

Accuracy: Excellent segmentation quality, better edge detection
Performance: Slower than MediaPipe/BodyPix, typically 5-15 FPS
Use Case: Better suited for pre-recorded video processing than real-time conferencing
Resource Usage: Higher CPU/GPU requirements