Archived docs Get your API Key
Documentation
🤖 AI Assistant

Voice element

Type: object

Creates a voiceover element by converting the provided text into synthesized speech. The text to be spoken is specified using the text property. The voice property determines the voice to use, and the model property allows specifying which speech synthesis model to employ. Optionally, a connection ID can be provided to utilize your own API key for voice generation, otherwise the JSON2Video API keys will be used.

Working with the Voice element

The Voice element uses AI models to generate a voiceover for your video.

You can choose from the following models:

  • Microsoft Azure (azure): The Azure model is a powerful and flexible option that supports a wide range of voices and languages. It offers high-quality speech synthesis with customizable parameters like support for SSML (Speech Synthesis Markup Language) for advanced text-to-speech control.
  • ElevenLabs (elevenlabs): The ElevenLabs model is a popular choice for its high quality of its voices and the variety of voices available.
  • ElevenLabs Flash V2.5 (elevenlabs-flash-v2-5): The ElevenLabs Flash V2.5 model is a fast and efficient option that provides high-quality speech synthesis with extended language support.

Note The azure model is the default model and will be used if no model is specified.

Example

This example creates a voiceover for a video using the Azure model.

{
  "resolution": "full-hd",
  "scenes": [
    {
      "elements": [
        {
          "type": "voice",
          "text": "Hello, world!",
          "voice": "en-US-EmmaMultilingualNeural",
          "model": "azure"
        }
      ]
    }
  ]
}

Voice generation costs

Generating a voiceover with AI is expensive and depending on the model you choose, it can consume a significant amount of credits.

Model Credits per minute
Azure 0 credits per minute
ElevenLabs 60 credits per minute
ElevenLabs Flash V2.5 60 credits per minute

Azure model cost is included in all JSON2Video plans and do not consume any credits.

Voiceovers generated with AI models are cached in JSON2Video servers to avoid calling the AI models for the same voiceover multiple times. This means that if you call the JSON2Video API with the same parameters for the same voiceover again, the same cached voiceover will be used in the video, avoiding unnecessary costs. But if for any reason you need to regenerate a voiceover, you can do it by setting the cache property to false.

Using your own API key

If you already have an ElevenLabs or Azure API account, you can use your API key to generate your voiceovers. This is specially useful for ElevenLabs custom voices.

To use your own API key:

  1. You need to create a connection in the Connections page.
  2. You need to provide the connection ID to the connection property in the Voice element.

Example

This example creates a voiceover for a video using the ElevenLabs model and your own API key.

{
  "resolution": "full-hd",
  "scenes": [
    {
      "elements": [
        {
          "type": "voice",
          "text": "Hello, world!",
          "model": "elevenlabs",
          "voice": "Daniel",
          "connection": "my-connection-id"
        }
      ]
    }
  ]
}

Choosing the right voice

Finding the right voice for your project can be a challenge.

Azure voices

Azure voices have this format: en-US-EmmaMultilingualNeural.

The first part is the language code (2 digits), the second part is the country code (2 digits) and the third part is the name of the voice.

For the Azure model, you can check the full list of voices by language here: https://json2video.com/ai-voices/azure/languages/

ElevenLabs voices

ElevenLabs voices have natural names like Daniel, Serena, Antoni, Bella, Nova, Shimmer and more. You can also use the ElevenLabs's voiceID to specify the voice you want to use.

You can find a list of voices in the ElevenLabs Voices Library page (you need to be logged in).

Properties

The following properties are required:

  • text
  • type

cache

If true, the system will attempt to retrieve and use a previously rendered (cached) version of this element, if an identical version is available. This can significantly reduce processing time. If false, a new render of the element will always be performed, regardless of whether a cached version exists. The default value is true.

Type boolean
Required No
Default Value true
Format boolean

comment

A field for adding descriptive notes or internal memos related to the element. This comment is for your reference and does not affect the rendering process. It can be used to keep notes about the element like describing the content or the purpose of the element.

Type string
Required No

condition

A string containing an expression that determines whether the element will be rendered. The element is rendered only if the condition evaluates to true. If the condition is false or an empty string, the element will be skipped and not included in the scene or movie.

Type string
Required No

connection

The ID of your pre-configured connection to use for voice generation. Connections are defined within the application's dashboard. By specifying a connection ID, you can leverage the API key associated with that connection, enabling you to use your own account with the AI model provider for voice generation. If a connection ID is not provided, the default JSON2Video API keys will be used, potentially deducting credits for the API calls.

Type string
Required No

duration

Defines the duration of the element in seconds. Use a positive value to specify the element's length. A value of -1 instructs the system to automatically set the duration based on the intrinsic length of the asset or file used by the element. A value of -2 sets the element's duration to match that of its parent scene (if it's inside a scene) or the movie (if it's in the movie elements array).

Type number
Required No
Default Value -1
Format float

extra-time

The amount of time, in seconds, to extend the element's duration beyond its natural length. This allows the element to linger on screen after its content has finished playing or displaying. For example, setting extra-time to 0.5 will keep the element visible for an additional half-second.

Type number
Required No
Default Value 0
Format float

fade-in

The duration, in seconds, of the fade-in effect applied to the element's appearance. A value of 0 means no fade-in effect. Larger values result in a longer fade-in duration. The value must be a non-negative number.

Type number
Required No
Format float
Minimum Value 0

fade-out

The duration, in seconds, of the fade-out effect applied to the element's disappearance. A value of 0 means no fade-out effect. Larger values result in a longer fade-out duration. The value must be a non-negative number.

Type number
Required No
Format float
Minimum Value 0

id

A unique identifier for the element within the movie. This string allows you to reference and manage individual elements. If not provided, the system will automatically generate a random string.

Type string
Required No
Default Value "@randomString"

model

The generative AI model to use for synthesizing the voice. Be aware that some models may consume credits for each request.

Type string
Required No
Enum Values azure, elevenlabs, elevenlabs-flash-v2-5

muted

If true, the audio track of the element (e.g., a video or audio file) will be muted, effectively silencing it. If false or omitted, the audio will play according to its original volume or the volume setting.

Type boolean
Required No
Default Value false

start

The element's start time, in seconds, determines when it begins playing within its container's timeline. This time is relative to the beginning of the scene it's in or, if the element is part of the movie's elements array, relative to the beginning of the movie itself. The default value is 0, meaning the element starts at the beginning of its container's timeline.

Type number
Required No
Default Value 0
Format float

text

The text content to be synthesized into speech.

Type string
Required Yes

type

This field specifies the element's type and must be set to voice for voiceover elements.

Type string
Required Yes
Enum Values voice

variables

Defines local variables specific to this element. These variables can be used to dynamically alter the element's properties or content during the rendering process. Variable names must consist of only letters, numbers, and underscores.

Type object
Required No
Default Value {}

voice

The name of the voice to be used for text-to-speech synthesis. This value determines which AI voice will be used to generate the audio. Refer to the available voices documentation to explore the supported options.

Type string
Required No

volume

Controls the volume gain of the audio track (e.g., a video or audio file). This is a multiplier applied to the original audio level. A value of 1 represents the original volume (no gain), values greater than 1 increase the volume, and values less than 1 decrease the volume. The acceptable range is from 0 to 10. For background music with voiceovers, a usual value is 0.2. Increasing the volume of the audio track can reduce the quality of the audio.

Type number
Required No
Default Value 1
Minimum Value 0
Maximum Value 10

z-index

Element's z-index, determining its stacking order within the video. Higher values bring the element to the front, obscuring elements with lower values. Lower values send the element to the back, potentially behind other elements. The value must be an integer between -99 and 99; the default is 0. The natural way of layering elements is by the order of the elements in the elements array. If by any reason this does not work in your case, you can use the z-index property to manually control the stacking order.

Type number
Required No
Default Value 0
Format integer
Minimum Value -99
Maximum Value 99