Generating Video

This tutorial will show you how to create mp4 video using the Character API. To get the most out of this tutorial, you will want to first install and run the 'videogen' Reference Implementation from

To run videogen, you will need an AWS account with permissions for the Character API and AWS Polly. After downloading the videogen.js file, you will modify it to replace the blank keys with the ones for your account. Each time you run the tool you will incur API charges on your account. The tool is used from the command line:

$ node videogen SusanHead 3.0 250 200 NeuralJoanna "<headnod/> Hello world!" hello.mp4

Here is the result:

The tool can be used in an offline manner to produce video for standalone use, or as a source for video-editing software. The tool is written in Node.js, and is easy to customize for your particular needs. It can also serve as a Reference Implementation for building an online video-generation solution, such as a service for viral video creation that might be used as part of a social media campaign.

How it Works

The Reference Implementation is based on the ffmpeg compressor, available for download at The ffmpeg command line tool takes as input a series of images and an audio track, and produces a compressed video in a variety of formats.

Previous tutorials have discussed how longer animations can be produced by playing several segments back-to-back, with each segment being approximately one sentence in length, and representing a single Character API call. For simplicity, we consider only the case where the character begins and ends with the default position. While the Reference Implementation produces an mp4 file for only a single segment, multiple segments can easily be concatenated together by concatenating the resulting mp4 files.

The key thing to understand about video is that the compression ratios are extremely high. Consider a video of 300 frames at 24 frames per second, or about 12.5 seconds in duration. On disk, 300 images might take 30mb at 100k per image. However the resulting video could be as small as 30kb. You could render out 300 images to disk and then compress them all with one call to ffmpeg. But you can do much better than this by setting up your renderer to pipe its images to the ffmpeg process, so that the images are compressed as they are being generated, and never even get stored to disk.

The Character API already provides us with a simple means of generating the raw frames - we simply follow the Character API-supplied JSON instructions, which describe how to build each frame by copying portions of the texture image over to an off-screen rendering buffer. Our implementation is in javascript, however we use the 'sharp' package (, which uses native code that can scale to multiple CPU cores. Since ffmpeg also can take advantage of multiple CPU cores, the resulting video generation engine is surprisingly fast, especially as you scale up to multicore instances. For example an AWS "t2-medium" instance has 2 CPU cores, and performs roughly twice as fast as a "t2-small", with only one CPU core.

Fast and efficient video generation leads to the opportunity to create specialized higher-resolution models and higher frame rates. Please contact us for more information on high-res models of stock and custom characters.

The ffmpeg Pipeline

Node.js has the ability to invoke ffmpeg as a separate process, all while acting as a source for the raw frame data to 'pipe' into that process. At the heart of the videogen.js you will see something like this:

renderCore(callback) {
    // See for parameter details
    let args = [
            '-framerate', '24',
            '-f', 'image2pipe',
            '-i', '-',
    args = args.concat(['-i', this.audioFile]);
    args = args.concat([this.params.outputFile]);

    this.child = spawn('ffmpeg', args);
    this.child.on('close', (code)=>{
        if (this.err) 

    // Run doFrame() for all frames in the animData via recursion
    this.doFrame(0, (err)=>{
        if (err) this.err = err;
        if (this.child && this.child.exitCode === null) this.child.stdin.end();

The input file parameters are the image pipe, represented by '-i -', and the audio file, represented by 'i audiofile'. After the ffmpeg process is spawned, we call doFrame():

doFrame(i, callback) {
    if (i == this.animData.frames.length) 
        return callback(null);
    //console.log("Rendering frame "+i);
    this.animate(i, (err)=>{
        if (err) return callback(err);
        this.doFrame(i+1, callback);

animate(frame, callback) {
    // create a new image from the texture file and JSON instructions

In doFrame() we call animate() to render the next frame and write it out to 'stdin', which is now piped to ffmpeg. We then call doFrame() recursively until the final frame is reached. Once the call unwinds, the video is completed and the task's 'close' handler is called.

The animate() function is much like the animate() we find in an HTML client, except that rather than using the Canvas API to create a character frame from the animation data, we use the Node.js sharp library to do the same.

animate(frame, callback) {
    let recipe =[this.animData.frames[frame][0]];
    // not shown: populate pieceDatas, pieceInfos from the recipe, as needed
    let a = [];
    for (let i = 0; i < recipe.length; i++) {
        let extractKey = this.extractKeyFromRecipeItem(recipe, i);
        a.push({input:this.pieceDatas[extractKey], raw:this.pieceInfos[extractKey], left:recipe[i][0], top:recipe[i][1]});
    sharp({create: { width: this.params.width, height: this.params.height, channels: 4, background: { r: 0, g: 0, b: 0, alpha: 0 } }})
            .toBuffer((err, data) => {

We create a new sharp offscreen buffer and use the sharp 'composite' method. This method takes an array of "blits", or image copy operations, from the texture image to the offscreen buffer. Because of the way sharp works, the source for each blit is actually a separate sharp buffer. These buffers are called "pieces" in the reference code, and are derived from the texture using the sharp 'extract' method.

There are a few other wrinkles that are not mentioned so far. One is that it is quite critical that the audio file and the video have a compatible number of frames. Without this, lipsync problems can often be noticed when concatenating video segments. The Reference Implementation contains code to measure the audio file's length and to truncate or pad the audio file if necessary, once the length of the video is known. Another wrinkle is that the Reference Implementation supports the use of secondary textures, as this can result in faster operation on larger characters and higher frame rates. Finally, some characters require that the underlying bits be cleared before compositing-in a new image. With sharp, this is done by creating an empty "punchout" buffer of the same size as the image being composed, and adding this to the composition list.

Wrapping Up

This tutorial has introduced an efficient technique for generating mp4 talking character videos using the Character API.

The Character API and Reference Implementations provide you with a modern framework for building interactive and video-based character applications.

We look forward to seeing what you build with the Character API. As you embark on this journey, please keep our artistic and solution development services in mind. With over a decade of experience building solutions in this area, we look forward to being your technical and infrastructure partners.

Please send questions and comments to

Copyright © 2021 Media Semantics, Inc. All rights reserved.
Comments? Write
See our privacy policy.