videocalling
Illustration of Live Captions in video calling

Live Captions

Feature

Real-time text display of spoken audio during video calls for accessibility and comprehension

What are Live Captions?

Live captions (also called real-time captions or live transcription) are text displays of spoken words that appear on-screen during a video call as people speak. Unlike pre-recorded captions added in post-production, live captions are generated in real-time using automatic speech recognition (ASR) technology, making conversations accessible to deaf and hard-of-hearing participants, as well as anyone who benefits from reading along.

In 2025, live captions have evolved from a basic accessibility feature to a core component of inclusive video communication. Modern implementations support multiple languages, real-time translation, speaker identification, and customizable display options—transforming how global teams collaborate across language barriers.

How Live Captions Work

Live captioning involves several technologies working together in real-time:

Automatic Speech Recognition (ASR)

ASR is the foundation of live captions. It uses machine learning models to convert spoken audio into text. Modern ASR engines process audio in small chunks, typically analyzing speech in windows of a few seconds, then outputting recognized words with minimal delay. The technology has advanced dramatically—contemporary systems can handle accents, technical jargon, and overlapping speech with increasing accuracy.

Speaker Diarization

In multi-party calls, the system identifies who is speaking and labels the captions accordingly. This "speaker diarization" helps participants follow conversations by showing "John: Let's discuss the timeline" rather than just "Let's discuss the timeline."

Language Processing

After initial speech recognition, language models refine the output for grammar, punctuation, and context. This post-processing improves readability by adding proper capitalization and punctuation that raw ASR output would miss.

Real-Time Translation

Many platforms now offer translated captions, where speech in one language is transcribed and translated into another language in real-time. Some systems, like Google Meet's 2025 feature, even provide voice dubbing—a synthetic voice delivering the translation alongside the original speaker.

Who Benefits from Live Captions?

Deaf and Hard-of-Hearing Participants

The primary accessibility use case. Live captions provide a visual representation of the conversation, allowing deaf and hard-of-hearing individuals to fully participate in video calls. This is not just a convenience—it's often a legal requirement for accessible communication.

Non-Native Speakers

People communicating in their second (or third) language often find reading along helpful. Captions reinforce comprehension, especially for unfamiliar words, fast speakers, or strong accents. With translated captions, participants can follow meetings conducted in languages they don't speak at all.

Participants in Challenging Audio Environments

Someone in a noisy café, an open-plan office, or with unreliable audio equipment can follow the meeting through captions. This makes video calls viable in situations where audio-only would fail.

Anyone Who Retains Information Better by Reading

Studies show many people comprehend and retain information better when they both hear and see it. Live captions improve meeting engagement and reduce the cognitive load of following complex discussions.

Platform Support

Google Meet

Offers AI-powered live captions with customizable font size, text color, and background. In May 2025, Google introduced AI speech translation (beta) with real-time voice dubbing—starting with English-Spanish translation. Transcription and advanced features like "take notes for me" require higher-tier Workspace plans.

Microsoft Teams

Meeting organizers can enable live captions for scheduled meetings, webinars, and channel meetings. Captions are generated automatically using AI. Note: interpreted audio cannot be recorded, and transcripts are saved only in the original spoken language.

Zoom

Provides AI-powered live transcription and captioning, with support for multiple languages. Users can adjust caption size and position. Zoom also offers simultaneous interpretation channels for real-time translation.

Accuracy Considerations

While ASR technology has improved dramatically, it's not perfect. Understanding its limitations is important:

Current ASR Accuracy

Modern ASR systems achieve 90-95% accuracy under ideal conditions (clear audio, common vocabulary, major languages). However, accuracy drops with:

  • Background noise and poor microphone quality
  • Strong accents or uncommon dialects
  • Technical jargon, proper nouns, and industry-specific terminology
  • Fast speech or overlapping speakers
  • Less common languages with limited training data

Professional Alternatives: CART

For situations requiring 99%+ accuracy—legal proceedings, medical consultations, or formal accessibility accommodations—Computer-Assisted Real-Time Transcription (CART) uses human transcribers. Hybrid solutions combine ASR for a rough draft with human editors for correction.

Legal Requirements and Compliance

Organizations have legal obligations to provide accessible communication:

Americans with Disabilities Act (ADA)

The ADA requires "effective communication" for individuals with hearing disabilities. Approved aids include real-time captioning, closed captioning, and related technologies. The DOJ's April 2024 final rule establishes WCAG 2.1 Level AA as the standard for accessible digital content, with compliance deadlines in 2026-2027.

WCAG Guidelines

The Web Content Accessibility Guidelines specify requirements for captions:

  • WCAG 1.2.2: Pre-recorded video must include accurate, synchronized captions
  • WCAG 1.2.4: Live audio in synchronized media must include captions

Important: Auto-generated captions that haven't been reviewed and edited typically don't meet WCAG standards. Organizations should verify accuracy for compliance.

Quality Standards

Best practice for accessible captions requires:

  • 99%+ accuracy for effective communication
  • Synchronized: Captions appear at approximately the same time as audio
  • Equivalent: Content includes speaker identification and relevant sound effects
  • Accessible: Readily available to those who need them

Customization Options

Modern platforms offer extensive customization to meet individual needs:

  • Font size: Adjustable text size for visibility
  • Text and background colors: High contrast options for readability
  • Position: Choose where captions appear on screen
  • Caption style: Open (always visible) vs. closed (toggleable) captions
  • Language: Select caption language independent of spoken language

The Future of Live Captions

Live caption technology continues to advance:

  • AI-powered accuracy improvement: Language models trained on larger datasets with better context understanding
  • Universal real-time translation: Seamless multi-language meetings where everyone hears or reads in their preferred language
  • Personalized vocabulary: Systems that learn your company's jargon and frequently-used names
  • Emotional context: Captions that convey tone and emphasis, not just words
  • AR integration: Captions overlaid in augmented reality for in-person meetings

As remote and hybrid work becomes permanent, live captions are evolving from an accessibility accommodation to a universal feature that improves communication for everyone.

References