The Voice elements allow you to easily add voice-over to your videos by simply indicating the text to be spoken and the type of voice (and language) to be used.

JSON2Video uses Microsoft Azure's Text-To-Speech service to achieve the most natural voices and the widest variety of languages and accents.

Check the full list of available voices and languages.

In the following examples we will see how we can include voice elements in our videos.

Simple voice over

In this example we will use the default voice to add a short voice-over to a still image video:

{
    "resolution": "full-hd",
    "quality": "high",
    "scenes": [
        {
            "comment": "Scene #1",
            "elements": [
                {
                    "type": "image",
                    "src": "https://assets.json2video.com/assets/images/space-apollo11-01.jpg",
                    "scale": {
                        "width": 1920,
                        "height": 1280
                    },
                    "zoom": 5
                },
                {
                    "type": "voice",
                    "text": "That's one small step for a man, one giant leap for mankind. Upon taking a \"small step\" onto the surface of the moon in 1969, Neil Armstrong uttered what would become one of history's most famous one-liners.",
                    "start": 1.5
                }
            ]
        }
    ]
}

The resulting video is:

The video uses one scene with 2 elements:

In this example, we are not indicating the voice to use, so it uses the default value for the voice field: en-GB-LibbyNeural.

Using multiple voices

In this example, we will use two voices in two different languages to showcase the Voice element features.

{
    "resolution": "full-hd",
    "quality": "high",
    "scenes": [
        {
            "comment": "Scene #1",
            "elements": [
                {
                    "type": "image",
                    "src": "https://assets.json2video.com/assets/images/woman-01.jpg",
                    "y": -100
                },
                {
                    "type": "voice",
                    "text": "Hello Diego! Could you please introduce yourself in Italian?",
                    "voice": "en-US-AriaNeural",
                    "start": 1
                }
            ]
        },
        {
            "comment": "Scene #2",
            "elements": [
                {
                    "type": "image",
                    "src": "https://assets.json2video.com/assets/images/man-01.jpg",
                    "y": -100
                },
                {
                    "type": "voice",
                    "text": "S\u00ec, certo, Aria. Mi chiamo Diego Rossi e sono di Firenze.",
                    "voice": "it-IT-DiegoNeural"
                }
            ]
        }
    ]
}

The resulting video is:

The video simulates a short conversation between an English-speaking woman and an Italian-speaking man.

Changing the pace of the voice

You can use a few tags to change the pace of the voice:

Just wrap the text with the tags to apply the voice change. Examples:

{
    "resolution": "full-hd",
    "quality": "high",
    "scenes": [
        {
            "comment": "Scene #1",
            "elements": [
                {
                    "type": "voice",
                    "text": "That's one small step for a man, <super-slow>one giant leap for mankind</super-slow>. <fast>Upon taking a \"small step\" onto the surface of the moon in 1969</fast>, Neil Armstrong uttered what would become <slow>one of history's most famous one-liners</slow>.",
                    "start": 1.5
                }
            ]
        }
    ]
}

Expressing emotion

You can also add an emotion to the voice over by using tags.

These are the supported emotions:

Example:

{
    "resolution": "full-hd",
    "quality": "high",
    "scenes": [
        {
            "comment": "Scene #1",
            "elements": [
                {
                    "type": "voice",
                    "voice": "en-US-AriaNeural",
                    "text": "<cheerful>\"That's remarkable! You're a genius!\"</cheerful> Mom said to her son.",
                    "start": 1.5
                }
            ]
        }
    ]
}

Using SSML

Finally, you can use SSML tags to express more complex nuances.

Example:

{
    "resolution": "full-hd",
    "quality": "high",
    "scenes": [
        {
            "comment": "Scene #1",
            "elements": [
                {
                    "type": "voice",
                    "voice": "en-US-AriaNeural",
                    "text": "<mstts:express-as style=\"cheerful\">\"That's remarkable! You're a genius!\"</mstts:express-as><break time=\"600ms\" />Mom said to her son.",
                    "start": 1.5
                }
            ]
        }
    ]
}

Balancing music and voice volume

When you want to add music and narration to a video, you typically need to adjust the volume so that the voice can be heard clearly. The best option is to keep the voice at its original volume and reduce the volume of the music.

Read this section in the audio elements documentation.