Skip to main content

You are viewing Agora Docs forBetaproducts and features. Switch to Docs

Product overview

Real-Time Speech-To-Text

Agora Real-Time Speech-To-Text (STT) enables you to transcribe the voice stream of each host to provide live closed captions (CC) and transcription for improved accessibility. Using its advanced features, you can also remove silent audio segments to optimize transcription performance and reduce costs. The output text can be further processed as input for large language models, such as GPT. Real-Time STT serves as a gateway for real-time engagement to enter the AI arena.

Product Features

Live transcription for RTC

Integrated with Agora’s voice and video service, live transcription and captions improve accessibility for your audience. Perfect for meetings, live streaming, lectures, interviews, live shopping, and more.

Cloud-based STT

Cloud-based service converts voice to text based on the active or specific hosts, then distributes the text to all participants in the channel for further processing. Does not depend on the performance of the client device and network.

Speaker labeling

Label each transcribed text with the speaker's UID. Separate transcription of each host ensures accuracy even when multiple hosts are talking simultaneously.

Caption recording

Upload the transcriptions as .vtt files to cloud storage, then play back audio or video recordings with closed captions (CC). The timestamps in the .vtt file ensure that the text is perfectly synchronized with the audio or video, so it appears exactly where it was generated.

Multi-language support

Real-time transcription supports all major languages and dialects, and each channel can support audio to text transcription for up to two languages simultaneously.

Enterprise-grade security and compliance

Agora is ISO and SOC 2 certified and meets compliance standards for regional privacy laws and industry regulations, including GDPR, CCPA, and HIPAA. Live captions and transcription can be encrypted the same way as the RTC audio or video.

vundefined