Archived docs Get your API Key
Get started
Tutorials
Guides
Reference
Help for AI agents
๐Ÿค– AI Assistant

7. AI voiceover

A silent listing only goes so far. This chapter adds a synthesised voice-over narrating the property. JSON2Video ships with Microsoft Azure voices included in every plan (no per-character cost), with optional premium ElevenLabs voices when you connect your own key.

Prerequisites: chapter 6. Background music must be quieter than the voice โ€” the volume: 0.4 we already set in chapter 2 handles that.

Step 1 โ€” The voice element

A voice element converts text to audio at render time. The required fields are type: "voice" and text. voice and model are optional; the default model is Azure.

{
  "type": "voice",
  "text": "Welcome to 123 Oak Street โ€” a four-bedroom craftsman home, listed at $849,000.",
  "voice": "en-US-EmmaMultilingualNeural"
}

voice is the speaker ID. Azure exposes hundreds โ€” en-US-EmmaMultilingualNeural, en-US-AndrewMultilingualNeural, es-ES-ElviraNeural, and so on. See the AI models catalog for the live list.

Note โ€” voice elements do not need a duration. The renderer reads the synthesised audio length and applies it automatically. You can still cap it manually if needed.

Step 2 โ€” Insert the voice at movie level

Put the voice in the top-level elements array so it spans across the title card and the room scenes. Use start to delay it until after the title card animation finishes.

{
  "elements": [
    {
      "type": "audio",
      "src": "https://cdn.json2video.com/assets/audios/uplifting-corporate.mp3",
      "volume": 0.4
    },
    {
      "type": "voice",
      "text": "Welcome to 123 Oak Street โ€” a four-bedroom craftsman home, listed at $849,000.",
      "voice": "en-US-EmmaMultilingualNeural",
      "start": 1.5
    }
  ]
}

A 1.5 s delay lets the title card's animation breathe before the voice starts.

Step 3 โ€” Use a premium ElevenLabs voice (optional)

For higher-fidelity speech, switch the model to elevenlabs and add a connection ID pointing at your ElevenLabs API key (configure in Dashboard โ†’ Connections). Without connection, the default Azure voice is used.

{
  "type": "voice",
  "text": "Welcome to 123 Oak Street โ€” a four-bedroom craftsman home, listed at $849,000.",
  "model": "elevenlabs",
  "voice": "21m00Tcm4TlvDq8ikWAM",
  "connection": "my-elevenlabs"
}

ElevenLabs voices consume extra credits (~60 per minute). Azure is free under every plan. See Credit consumption.

Step 4 โ€” Submit the SDK call

The JSON payload is the same regardless of language. Here is the full POST in four flavours.

curl -X POST https://api.json2video.com/v2/movies \
  -H "x-api-key: $JSON2VIDEO_API_KEY" \
  -H "Content-Type: application/json" \
  -d @movie.json
const movie = await import("node:fs").then(fs => JSON.parse(fs.readFileSync("movie.json", "utf8")));
const res = await fetch("https://api.json2video.com/v2/movies", {
  method: "POST",
  headers: {
    "x-api-key": process.env.JSON2VIDEO_API_KEY,
    "Content-Type": "application/json",
  },
  body: JSON.stringify(movie),
});
console.log(await res.json());
import os, json, requests
movie = json.load(open("movie.json"))
r = requests.post(
    "https://api.json2video.com/v2/movies",
    headers={"x-api-key": os.environ["JSON2VIDEO_API_KEY"]},
    json=movie,
)
print(r.json())
<?php
$movie = json_decode(file_get_contents("movie.json"), true);
$ch = curl_init("https://api.json2video.com/v2/movies");
curl_setopt_array($ch, [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HTTPHEADER => ["x-api-key: " . getenv("JSON2VIDEO_API_KEY"), "Content-Type: application/json"],
    CURLOPT_POSTFIELDS => json_encode($movie),
]);
echo curl_exec($ch);

The complete final JSON

{
  "resolution": "full-hd",
  "elements": [
    {
      "type": "audio",
      "src": "https://cdn.json2video.com/assets/audios/uplifting-corporate.mp3",
      "volume": 0.4
    },
    {
      "type": "voice",
      "text": "Welcome to 123 Oak Street โ€” a four-bedroom craftsman home, listed at $849,000.",
      "voice": "en-US-EmmaMultilingualNeural",
      "start": 1.5
    },
    {
      "type": "html",
      "tailwind": true,
      "wait": 0.5,
      "html": "<div class='inline-flex items-center gap-2 px-6 py-4 rounded-xl bg-emerald-700 text-white text-5xl font-bold shadow-lg'>๐Ÿ’ฐ $849,000</div>",
      "position": "bottom-right",
      "x": -60,
      "y": -60,
      "start": 4,
      "duration": 12
    }
  ],
  "scenes": [
    {
      "duration": 4,
      "elements": [
        {
          "type": "component",
          "component": "basic/000",
          "settings": { "headline": "FOR SALE", "subline": "123 Oak Street" }
        }
      ]
    },
    {
      "duration": 4,
      "transition": { "style": "fade", "duration": 0.5 },
      "elements": [
        { "type": "image", "src": "https://cdn.json2video.com/assets/images/sample-house-front.jpg" },
        { "type": "text", "text": "Exterior", "position": "bottom-left", "x": 60, "y": -60 }
      ]
    },
    {
      "duration": 4,
      "transition": { "style": "fade", "duration": 0.5 },
      "elements": [
        { "type": "image", "src": "https://cdn.json2video.com/assets/images/sample-house-kitchen.jpg" },
        { "type": "text", "text": "Chef's Kitchen", "position": "bottom-left", "x": 60, "y": -60 }
      ]
    },
    {
      "duration": 4,
      "transition": { "style": "fade", "duration": 0.5 },
      "elements": [
        { "type": "image", "src": "https://cdn.json2video.com/assets/images/sample-house-bedroom.jpg" },
        { "type": "text", "text": "Master Bedroom", "position": "bottom-left", "x": 60, "y": -60 }
      ]
    }
  ]
}

Expected output

The same 16-second listing as chapter 6, now with a clean Emma voice narrating "Welcome to 123 Oak Street โ€” a four-bedroom craftsman home, listed at $849,000." starting 1.5 s in. Sample render: tutorial-07.mp4 (placeholder).

What you learned

  • type: voice synthesises an audio track from text.
  • voice picks a speaker, model picks the provider (default: azure).
  • A movie-level voice element runs across all scenes โ€” use start to delay it.
  • Azure is included in every plan; ElevenLabs requires a connection and consumes extra credits.

Previous chapter / Next chapter

โ† 6. HTML elements ยท 8. Automatic subtitles โ†’