videocalling
Voice Activity Detection (VAD)

Voice Activity Detection (VAD)

技术

Technology that detects the presence or absence of human speech.

What is Voice Activity Detection?

Voice Activity Detection (VAD) is a technique used in speech processing to detect whether a human is speaking or if there is only silence/background noise. It acts as a smart switch that knows when to process audio and when to stop.

Why is VAD Important in Video Calls?

  • Bandwidth Efficiency: In a VoIP call (like WebRTC), there is no need to transmit data packets for silence. VAD allows the encoder to stop sending audio frames when no one is talking, significantly reducing bandwidth usage.
  • Noise Reduction: By accurately identifying non-speech segments, VAD helps noise suppression algorithms focus only on cleaning up actual speech rather than trying to process constant background hums.
  • Echo Suppression: VAD helps echo cancellers determine when the local user is active, preventing them from canceling out the user's own speech mistakenly.

How It Works

VAD algorithms analyze the input signal's energy levels and frequency spectrum. Simple VADs look for energy thresholds (is the sound loud enough?), while advanced VADs use machine learning to distinguish the specific spectral characteristics of human voice from other sounds like typing or traffic.

VAD vs. Push-to-Talk

Push-to-talk is a manual VAD (controlled by the user). Algorithmic VAD is automatic. However, overly aggressive VAD can sometimes cut off the very beginning or end of a sentence (clipping), which is why tuning the "attack" and "release" times is crucial for a natural conversation experience.