Recommended Parameters by Use Case

The right parameter configuration can make a significant difference in transcription quality and latency for realtime use cases. This guide covers recommended starting points for common scenarios and highlights pitfalls that frequently trip up new integrations.

These recommendations apply to the Realtime API and are passed during session initialization. They are starting points — tune them to match your specific needs.

Language Configuration

One of the most common configuration mistakes is misunderstanding how language_config works. Choosing the right setup avoids unnecessary detection overhead and improves accuracy. When to set an explicit language:

You know the language of the audio ahead of time.
The audio is monolingual (single language throughout).
You want the fastest, most accurate results.

{
  "language_config": {
    "languages": ["en"],
    "code_switching": false
  }
}

When to use auto-detection:

You process audio in many different languages and don’t know which one beforehand.
You want Gladia to pick the language automatically.

{
  "language_config": {
    "languages": [],
    "code_switching": false
  }
}

When code_switching is false and no language is set, the language is detected on the first utterance and reused for the rest of the session or file. If the beginning of your audio contains silence, music, or a different language than the main content, this can lead to incorrect detection for the whole transcription.

Even when using auto-detection, pass a small list of likely languages in languages to constrain the search. This improves both accuracy and processing time.

Code Switching

Code switching (language_config.code_switching: true) lets Gladia detect and transcribe multiple languages within the same audio, re-evaluating the language on each utterance. When to enable it:

Speakers switch languages mid-conversation (e.g. bilingual meetings, multilingual customer support).
You need the detected language returned per utterance.

When NOT to enable it:

The audio is in a single language — code switching adds unnecessary processing and can introduce misdetections.
You’ve set exactly one language in languages — in that case code_switching is ignored anyway.

{
  "language_config": {
    "languages": ["en", "fr", "es"],
    "code_switching": true
  }
}

Do not enable code_switching with an empty languages list. When no languages are specified, the language detector evaluates every utterance against 100+ supported languages, which leads to frequent misdetections — especially between similar-sounding languages. Always provide a short list of languages you actually expect in the audio.

Custom Vocabulary

Custom vocabulary is a post-transcription replacement based on phoneme similarity. It’s essential for domain-specific terms that speech models frequently mis-transcribe. Best practices:

Always provide both the custom_vocabulary flag and a custom_vocabulary_config.
Add pronunciations for words that can be said in different ways (accents, foreign speakers). This is more reliable than raising intensity.
Keep intensity moderate (0.4-0.6). High values increase false positives where unrelated words get replaced.
Set language on individual vocabulary entries when your audio is multilingual and a term is pronounced differently depending on the language.

For simple terms that are already close to their phonetic spelling (e.g. brand names), you can pass them as plain strings instead of objects — Gladia will use the default intensity.

{
  "audio_url": "YOUR_AUDIO_URL",
  "custom_vocabulary": true,
  "custom_vocabulary_config": {
    "vocabulary": [
      "Kubernetes",
      {
        "value": "Gladia",
        "pronunciations": ["Gladya", "Gladiah"],
        "intensity": 0.5
      },
      {
        "value": "PostgreSQL",
        "pronunciations": ["Postgres Q L", "Post gress"],
        "intensity": 0.4
      }
    ],
    "default_intensity": 0.5
  }
}

Voice Agents

For callbots, customer-service assistants, or voice-driven chatbots the top priority is low latency. The agent must react quickly to user speech, even if sentence boundaries are not perfectly formed.

Parameter	Recommended value	Why
`endpointing`	`0.05` - `0.1`	Closes utterances fast, keeping turn-taking snappy. See Endpointing.
`maximum_duration_without_endpointing`	`15`	Prevents very long utterances from staying open without cutting off the conversation.
`messages_config.receive_partial_transcripts`	`true`	Enables interim results so the agent can start processing early. Use the `speech_stop` event to know when the user has finished speaking. See Partial transcripts.
`realtime_processing.custom_vocabulary`	`true`	Add product names and action keywords so the agent can react accurately.

This setup is optimized for fast turn-taking. If utterances get cut off mid-sentence, raise endpointing slightly.

Meeting Recorders

For apps that record and transcribe meetings in real time — team stand-ups, board sessions, 1-on-1s — the goal is to produce a structured, speaker-attributed live transcript that can feed downstream features like summarization or live note-taking.

Parameter	Recommended value	Why
`endpointing`	`0.3` - `0.5`	Lets speakers finish their sentences before closing an utterance. See Endpointing.
`maximum_duration_without_endpointing`	`15`	Prevents very long utterances in case a speaker doesn’t pause.
`messages_config.receive_partial_transcripts`	`true`	Feeds live captions to the UI while waiting for final results. See Partial transcripts.
`language_config.languages`	Set explicitly	Meeting language is almost always known in advance — setting it avoids detection overhead.
`realtime_processing.custom_vocabulary`	`true`	Add company-specific terms, project names, and participant names for better accuracy.

Diarization vs. multi-channel: if each speaker is on a separate audio channel (e.g. a, use the channel field on each utterance to identify who is speaking — diarization is not needed. See Multiple channelsIf all speakers share a single audio channel, enable diarization to separate the speakers. See Speaker diarization.

Call Centers

For live phone calls the priorities are speaker identification and fast, accurate transcription despite variable audio quality (telephony codecs, background noise, cross-talk).

Parameter	Recommended value	Why
`endpointing`	`0.2` - `0.4`	Keeps turn-taking responsive without cutting off mid-sentence. See Endpointing.
`maximum_duration_without_endpointing`	`15`	Prevents very long utterances in monologue-style segments.
`language_config.languages`	Set explicitly (e.g. `["en"]`)	Call center audio typically has a known language. Setting it avoids detection errors on noisy recordings.
`realtime_processing.custom_vocabulary`	`true`	Add product names, plan names, and internal terminology.

For calls with more than two participants (e.g. conference bridges), use diarization_config.min_speakers / max_speakers instead of number_of_speakers to give the model a flexible range.

Subtitles & Captioning

When providing live subtitles, the goal is to sync text with the speaker in real time. The right balance between speed and segment quality depends on whether captions are displayed live or post-produced.

Parameter	Recommended value	Why
`endpointing`	`0.3` (live) / `0.8` (post-production)	Lower values keep captions close to the speaker; higher values produce cleaner subtitle blocks.
`maximum_duration_without_endpointing`	`5`	Prevents excessively long subtitle segments that are hard to read on screen.
`messages_config.receive_partial_transcripts`	`true`	Shows words as they are spoken, then refines them when the final result arrives.
`language_config.languages`	Set explicitly	Avoids detection lag when the broadcast language is known.

For post-production subtitles generated from a recording, consider using the Pre-recorded API with the dedicated subtitles feature instead — it produces SRT/VTT files with fine-grained timing controls.

Introduction

Speech-to-Text

Language

Audio Intelligence

Integrations

Limits & Specifications

Migrations

Recommended Parameters by Use Case

Language Configuration

Code Switching

Custom Vocabulary

Voice Agents

Meeting Recorders

Call Centers

Subtitles & Captioning

Introduction

Speech-to-Text

Language

Audio Intelligence

Integrations

Limits & Specifications

Migrations

​Language Configuration

​Code Switching

​Custom Vocabulary

​Voice Agents

​Meeting Recorders

​Call Centers

​Subtitles & Captioning

Language Configuration

Code Switching

Custom Vocabulary

Voice Agents

Meeting Recorders

Call Centers

Subtitles & Captioning