Quickstart
Real-Time Transcription takes the audio content of a host's media stream and transcribes it into written words in real time. This page shows you how to start and stop Real-Time Transcription in your app, through a business server, then display the text in your app.
Understand the tech
To start transcribing the audio in a channel in real time, you send an HTTP
request to the Agora SD-RTN™
through your business server. Real-Time Transcription provides the following modes:
- Transcribe speech in real-time, then stream this data to the channel.
- Transcribe speech in real-time, store the text in the
WebVTT
format, and upload the file to third-party cloud storage.
Real-Time Transcription transcribes at most three speakers in a channel. When there are more than three speakers, the top three are selected based on volume, and their audio is transcribed.
The following figure shows the workflow to start, query, and stop a Real-Time Transcription task:
In order to use the RESTful API to transcribe speech, make the following calls:
acquire
: Request abuilderToken
that authenticates the user and gives permission to start Real-Time Transcription . You must callstart
using thisbuilderToken
within five minutes.start
: Begin the transcription task. Once you start a task,builderToken
remains valid for the entire session. Use the samebuilderToken
to query and stop the task.query
: Check the task status.stop
: Stop the transcription task.
Prerequisites
In order to set up Real-Time Transcription in your app, you must have:
-
Enabled Real-Time Transcription for your project:
Contact sales@agora.io.
-
Activated a supported cloud storage service to record and store Real-Time Transcription videos and texts
-
Installed the Protobuf package to generate code classes for displaying transcription text.
-
To run the post-processing script, install:
- Python 3.0
ffmpeg
andffplay
Implement a business server
You create a business server as a bridge between your app and Agora Real-Time Transcription. Implementing a business server to manage Real-Time Transcription provides the following benefits:
- Improved security as your
apiKey
,apiSecret
,builderToken
, andtaskId
, are not exposed to the client. - Token processing is securely handled on the business server.
- Avoid splicing complex request body strings on the client side to reduce the probability of errors.
- Implement additional functionality on the business server. For example, billing for Real-Time Transcription use, checking user privileges and payment status of a user.
- If the REST API is updated, you do not need to update the client.
Agora Real-Time Transcription supports only integer uid
s.
- When you join a channel in your app, use an integer value for your UID. For example,
7
. - When you start a Real-Time Transcription session set
uid
to the same integer UID enclosed in quotation marks. For example,"7"
.
To obtain sample code for your business server, see the:
- Real-Time Transcription business server demo - an example business server that follows the workflow described in this document
- Postman Collection - API reference and code examples for the language you want to program in.
Use Google Protobuf Generator to parse text data
Google Protocol buffers are an extensible and language-neutral mechanism for serializing transcription data. Protobuf enables you to generate source code in multiple languages, based on a specified structure. For more information about Google protocol buffers, see protobuf.dev.
Agora provides the following Protobuf template for parsing Real-Time Transcription data:
To read and display the Real-Time Transcription text in your client:
-
Copy the Protobuf template to a local
.proto
file. -
In your file, edit the following properties to match your project:
package
: The source code package namespace.option
: The desired language options.
-
Generate a Protobuf class.
You run the
protoc
protocol compiler on your.proto
file to generate the code that you need to work with the defined message types. Theprotoc
compiler is invoked as follows:Agora also provides Protobuf sample code to parse and display transcription text. To obtain the sample code, contact sales@agora.io
-
Use the Protobuf class to read transcription text.
When transcription text is available, your app receives the
onStreamMessage
callback. You use the generated Protobuf class in you app to read the byte data returned by the callback. Refer to the API reference for callback details.
Synchronize transcription files with the cloud recording
The m3u8+vtt
file generated by Real-Time Transcription, and the m3u8+ts
file generated by Cloud Recording are two independent files. The time stamp references in these media
files are different, and not synchronized. The cloud recording time stamp starts at 0
, while the m3u8+vtt
uses the system time stamp. If either process starts abnormally, the media files generated by the two services may be out of sync during playback.
Post-processing ensures synchronization of subtitles and recorded audio. It enables you to associate the m3u8+ts
file generated by cloud recording with the m3u8+vtt
file generated by Real-Time Transcription.
Agora provides a post-processing script that enables you to synchronize the two files.
Run the post-processing script
To synchronize files generated by Real-Time Transcription, take the following steps:
-
Unzip the post-processing script to a local folder.
-
Run the script on your Real-Time Transcription files:
If
ffmpeg/ffprob
are not in yourPATH
, use–ffmpeg_path
to specify the path. -
Play the synchronized files:
-
Start the HTTP server by running the following command:
-
In your browser, enter the following URL:
-
Reference
This section contains information that completes the information in this page, or points you to documentation that explains other aspects to this product.
REST API
Refer to the Real-Time Transcription REST API documentation for parameter details.
SDK API
- Android: onStreamMessage
- Electron: onStreamMessage
- Flutter: onStreamMessage
- iOS: receiveStreamMessageFromUid
- macOS:receiveStreamMessageFromUid
- Unity: onStreamMessage
- Windows: onStreamMessage
List of supported languages
Use the following language codes in the recognizeConfig.language
parameter of the start request. The current version supports at most two languages, separated by commas. Languages marked with * do not support LID (language identification).
Language | Code |
---|---|
Chinese (Cantonese, Traditional) | zh-HK |
Chinese (Mandarin, Simplified) | zh-CN |
Chinese (Taiwanese Putonghua) | zh-TW |
English (India) | en-IN |
English (US) | en-US |
French (French) | fr-FR |
German (Germany) | de-DE |
Hai (Thailand) | th-TH |
Hindi (India) | hi-IN |
Indonesian (Indonesia) | id-ID |
Italian (Italy) | it-IT |
Japanese (Japan) | ja-JP |
Korean (South Korea) | ko-KR |
Malay (Malaysia) | ms-MY* |
Persian (Iran) | fa-IR* |
Portuguese (Portugal) | pt-PT |
Russian (Russia) | ru-RU |
Spanish (Spain) | es-ES |
Turkish (Turkey) | tr-TR |
Vietnamese (Vietnam) | vi-VN |
Supported third-party cloud storage services
The following third-party cloud storage service providers are supported: