Skip to main content

You are viewing Agora Docs forBeta products and features. Switch to Docs

Quickstart

Real-Time Transcription takes the audio content of a host's media stream and transcribes it into written words in real time. This page shows you how to start and stop Real-Time Transcription in your app, through a business server, then display the text in your app.

Understand the tech

To start transcribing the audio in a channel in real time, you send an HTTP request to the Agora SD-RTN™ through your business server. Real-Time Transcription provides the following modes:

  • Transcribe speech in real-time, then stream this data to the channel.
  • Transcribe speech in real-time, store the text in the WebVTT format, and upload the file to third-party cloud storage.

Real-Time Transcription transcribes at most three speakers in a channel. When there are more than three speakers, the top three are selected based on volume, and their audio is transcribed.

The following figure shows the workflow to start, query, and stop a Real-Time Transcription task:

Real-Time Transcription business server

In order to use the RESTful API to transcribe speech, make the following calls:

  1. acquire: Request a builderToken that authenticates the user and gives permission to start Real-Time Transcription . You must call start using this builderToken within five minutes.
  2. start: Begin the transcription task. Once you start a task, builderToken remains valid for the entire session. Use the same builderToken to query and stop the task.
  3. query: Check the task status.
  4. stop: Stop the transcription task.

Prerequisites

In order to set up Real-Time Transcription in your app, you must have:

  • Enabled Real-Time Transcription for your project:

    Contact sales@agora.io.

  • Activated a supported cloud storage service to record and store Real-Time Transcription videos and texts

  • Installed the Protobuf package to generate code classes for displaying transcription text.

  • To run the post-processing script, install:

    • Python 3.0
    • ffmpeg and ffplay

Implement a business server

You create a business server as a bridge between your app and Agora Real-Time Transcription. Implementing a business server to manage Real-Time Transcription provides the following benefits:

  • Improved security as your apiKey, apiSecret, builderToken, and taskId, are not exposed to the client.
  • Token processing is securely handled on the business server.
  • Avoid splicing complex request body strings on the client side to reduce the probability of errors.
  • Implement additional functionality on the business server. For example, billing for Real-Time Transcription use, checking user privileges and payment status of a user.
  • If the REST API is updated, you do not need to update the client.

Agora Real-Time Transcription supports only integer uids.

  • When you join a channel in your app, use an integer value for your UID. For example, 7.
  • When you start a Real-Time Transcription session set uid to the same integer UID enclosed in quotation marks. For example, "7".

To obtain sample code for your business server, see the:

Use Google Protobuf Generator to parse text data

Google Protocol buffers are an extensible and language-neutral mechanism for serializing transcription data. Protobuf enables you to generate source code in multiple languages, based on a specified structure. For more information about Google protocol buffers, see protobuf.dev.

Agora provides the following Protobuf template for parsing Real-Time Transcription data:


_25
syntax = "proto3";
_25
_25
package agora.audio2text;
_25
option java_package = "io.agora.rtc.audio2text";
_25
option java_outer_classname = "Audio2TextProtobuffer";
_25
_25
message Text {
_25
int32 vendor = 1;
_25
int32 version = 2;
_25
int32 seqnum = 3;
_25
int32 uid = 4;
_25
int32 flag = 5;
_25
int64 time = 6;
_25
int32 lang = 7;
_25
int32 starttime = 8;
_25
int32 offtime = 9;
_25
repeated Word words = 10;
_25
}
_25
message Word {
_25
string text = 1;
_25
int32 start_ms = 2;
_25
int32 duration_ms = 3;
_25
bool is_final = 4;
_25
double confidence = 5;
_25
}

To read and display the Real-Time Transcription text in your client:

  1. Copy the Protobuf template to a local .proto file.

  2. In your file, edit the following properties to match your project:

    • package : The source code package namespace.
    • option : The desired language options.
  3. Generate a Protobuf class.

    You run the protoc protocol compiler on your .proto file to generate the code that you need to work with the defined message types. The protoc compiler is invoked as follows:


    _1
    protoc --proto_path=IMPORT_PATH --cpp_out=DST_DIR --java_out=DST_DIR --python_out=DST_DIR --go_out=DST_DIR --ruby_out=DST_DIR --objc_out=DST_DIR --csharp_out=DST_DIR path/to/file.proto

    Agora also provides Protobuf sample code to parse and display transcription text. To obtain the sample code, contact sales@agora.io

  4. Use the Protobuf class to read transcription text.

    When transcription text is available, your app receives the onStreamMessage callback. You use the generated Protobuf class in you app to read the byte data returned by the callback. Refer to the API reference for callback details.

Synchronize transcription files with the cloud recording

The m3u8+vtt file generated by Real-Time Transcription, and the m3u8+ts file generated by Cloud Recording are two independent files. The time stamp references in these media files are different, and not synchronized. The cloud recording time stamp starts at 0, while the m3u8+vtt uses the system time stamp. If either process starts abnormally, the media files generated by the two services may be out of sync during playback.

Post-processing ensures synchronization of subtitles and recorded audio. It enables you to associate the m3u8+ts file generated by cloud recording with the m3u8+vtt file generated by Real-Time Transcription.

Agora provides a post-processing script that enables you to synchronize the two files.

Run the post-processing script

To synchronize files generated by Real-Time Transcription, take the following steps:

  1. Unzip the post-processing script to a local folder.

  2. Run the script on your Real-Time Transcription files:


    _1
    python3 insert_subtitle.py --av audio_dir/audio_ts.m3u8 --subtitle subtitle_dir/subtitle.m3u8 --output output_dir/ --overwrite

    If ffmpeg/ffprob are not in your PATH, use–ffmpeg_path to specify the path.

  3. Play the synchronized files:

    1. Start the HTTP server by running the following command:


      _1
      python3 -m http.server --bind 127.0.0.1 -doutput_dir

    2. In your browser, enter the following URL:


      _1
      http://127.0.0.1:8000/player_demo.html

Reference

This section contains information that completes the information in this page, or points you to documentation that explains other aspects to this product.

REST API

Refer to the Real-Time Transcription REST API documentation for parameter details.

SDK API

List of supported languages

Use the following language codes in the recognizeConfig.language parameter of the start request. The current version supports at most two languages, separated by commas.

LanguageCode
Chinese (Cantonese, Traditional)zh-HK
Chinese (Mandarin, Simplified)zh-CN
Chinese (Taiwanese Putonghua)zh-TW
English (India)en-IN
English (US)en-US
French (French)fr-FR
German (Germany)de-DE
Hindi (India)hi-IN
Indonesian (Indonesia)id-ID
Italian (Italy)it-IT
Japanese (Japan)ja-JP
Korean (South Korea)ko-KR
Portuguese (Portugal)pt-PT
Spanish (Spain)es-ES

Supported third-party cloud storage services

The following third-party cloud storage service providers are supported: