7. AI voiceover
A silent listing only goes so far. This chapter adds a synthesised voice-over narrating the property. JSON2Video ships with Microsoft Azure voices included in every plan (no per-character cost), with optional premium ElevenLabs voices when you connect your own key.
Prerequisites: chapter 6. Background music must be quieter than the voice โ the volume: 0.4 we already set in chapter 2 handles that.
Step 1 โ The voice element
A voice element converts text to audio at render time. The required fields are type: "voice" and text. voice and model are optional; the default model is Azure.
{
"type": "voice",
"text": "Welcome to 123 Oak Street โ a four-bedroom craftsman home, listed at $849,000.",
"voice": "en-US-EmmaMultilingualNeural"
}
voice is the speaker ID. Azure exposes hundreds โ en-US-EmmaMultilingualNeural, en-US-AndrewMultilingualNeural, es-ES-ElviraNeural, and so on. See the AI models catalog for the live list.
Note โ voice elements do not need a
duration. The renderer reads the synthesised audio length and applies it automatically. You can still cap it manually if needed.
Step 2 โ Insert the voice at movie level
Put the voice in the top-level elements array so it spans across the title card and the room scenes. Use start to delay it until after the title card animation finishes.
{
"elements": [
{
"type": "audio",
"src": "https://cdn.json2video.com/assets/audios/uplifting-corporate.mp3",
"volume": 0.4
},
{
"type": "voice",
"text": "Welcome to 123 Oak Street โ a four-bedroom craftsman home, listed at $849,000.",
"voice": "en-US-EmmaMultilingualNeural",
"start": 1.5
}
]
}
A 1.5 s delay lets the title card's animation breathe before the voice starts.
Step 3 โ Use a premium ElevenLabs voice (optional)
For higher-fidelity speech, switch the model to elevenlabs and add a connection ID pointing at your ElevenLabs API key (configure in Dashboard โ Connections). Without connection, the default Azure voice is used.
{
"type": "voice",
"text": "Welcome to 123 Oak Street โ a four-bedroom craftsman home, listed at $849,000.",
"model": "elevenlabs",
"voice": "21m00Tcm4TlvDq8ikWAM",
"connection": "my-elevenlabs"
}
ElevenLabs voices consume extra credits (~60 per minute). Azure is free under every plan. See Credit consumption.
Step 4 โ Submit the SDK call
The JSON payload is the same regardless of language. Here is the full POST in four flavours.
curl -X POST https://api.json2video.com/v2/movies \
-H "x-api-key: $JSON2VIDEO_API_KEY" \
-H "Content-Type: application/json" \
-d @movie.json
const movie = await import("node:fs").then(fs => JSON.parse(fs.readFileSync("movie.json", "utf8")));
const res = await fetch("https://api.json2video.com/v2/movies", {
method: "POST",
headers: {
"x-api-key": process.env.JSON2VIDEO_API_KEY,
"Content-Type": "application/json",
},
body: JSON.stringify(movie),
});
console.log(await res.json());
import os, json, requests
movie = json.load(open("movie.json"))
r = requests.post(
"https://api.json2video.com/v2/movies",
headers={"x-api-key": os.environ["JSON2VIDEO_API_KEY"]},
json=movie,
)
print(r.json())
<?php
$movie = json_decode(file_get_contents("movie.json"), true);
$ch = curl_init("https://api.json2video.com/v2/movies");
curl_setopt_array($ch, [
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_HTTPHEADER => ["x-api-key: " . getenv("JSON2VIDEO_API_KEY"), "Content-Type: application/json"],
CURLOPT_POSTFIELDS => json_encode($movie),
]);
echo curl_exec($ch);
The complete final JSON
{
"resolution": "full-hd",
"elements": [
{
"type": "audio",
"src": "https://cdn.json2video.com/assets/audios/uplifting-corporate.mp3",
"volume": 0.4
},
{
"type": "voice",
"text": "Welcome to 123 Oak Street โ a four-bedroom craftsman home, listed at $849,000.",
"voice": "en-US-EmmaMultilingualNeural",
"start": 1.5
},
{
"type": "html",
"tailwind": true,
"wait": 0.5,
"html": "<div class='inline-flex items-center gap-2 px-6 py-4 rounded-xl bg-emerald-700 text-white text-5xl font-bold shadow-lg'>๐ฐ $849,000</div>",
"position": "bottom-right",
"x": -60,
"y": -60,
"start": 4,
"duration": 12
}
],
"scenes": [
{
"duration": 4,
"elements": [
{
"type": "component",
"component": "basic/000",
"settings": { "headline": "FOR SALE", "subline": "123 Oak Street" }
}
]
},
{
"duration": 4,
"transition": { "style": "fade", "duration": 0.5 },
"elements": [
{ "type": "image", "src": "https://cdn.json2video.com/assets/images/sample-house-front.jpg" },
{ "type": "text", "text": "Exterior", "position": "bottom-left", "x": 60, "y": -60 }
]
},
{
"duration": 4,
"transition": { "style": "fade", "duration": 0.5 },
"elements": [
{ "type": "image", "src": "https://cdn.json2video.com/assets/images/sample-house-kitchen.jpg" },
{ "type": "text", "text": "Chef's Kitchen", "position": "bottom-left", "x": 60, "y": -60 }
]
},
{
"duration": 4,
"transition": { "style": "fade", "duration": 0.5 },
"elements": [
{ "type": "image", "src": "https://cdn.json2video.com/assets/images/sample-house-bedroom.jpg" },
{ "type": "text", "text": "Master Bedroom", "position": "bottom-left", "x": 60, "y": -60 }
]
}
]
}
Expected output
The same 16-second listing as chapter 6, now with a clean Emma voice narrating "Welcome to 123 Oak Street โ a four-bedroom craftsman home, listed at $849,000." starting 1.5 s in. Sample render: tutorial-07.mp4 (placeholder).
What you learned
type: voicesynthesises an audio track fromtext.voicepicks a speaker,modelpicks the provider (default:azure).- A movie-level voice element runs across all scenes โ use
startto delay it. - Azure is included in every plan; ElevenLabs requires a
connectionand consumes extra credits.