Tutorial Product•Video Generation SDK Developer Tools Tutorial SaaS

Build a Video Generation SaaS in Minutes — Not Months

A step-by-step guide to shipping a video generation product using the Skytells SDK, TrueFusion Video, and third-party models like Sora 2. From npm install to production-ready SaaS.

Sarah Burton·March 13, 2026·14 min·

MD TXT

Build a Video Generation SaaS with the Skytells SDK — From npm install to production-ready video generation — a practical walkthrough

Build a Video Generation SaaS in Minutes — Not Months

Eighteen months ago, if you wanted to build a product that generates video, you had two options: hire a machine learning team and spend six figures training models on rented GPU clusters, or glue together three different open-source projects that each broke in different ways every time you updated a dependency.

Neither option shipped fast.

Things have changed. The Skytells API gives you access to production-grade video and image generation models through a single REST endpoint — including our own TrueFusion Video family and third-party models like Sora 2. The JavaScript SDK wraps it in a createClient call and a predict method. And if you want a head start on the frontend, there's a Next.js image generation starter on GitHub that you can fork and extend.

This post walks through building a video generation SaaS from scratch — the kind of product you could charge money for — using these tools.

What You're Building

By the end of this guide, you'll have a working application that:

Generates videos from text prompts using TrueFusion Video and TrueFusion Video Pro
Supports third-party video models like Sora 2 through the same unified API
Adds lip-sync capabilities with LipFusion for dubbing and translation
Generates promotional thumbnails with TrueFusion image models
Handles async processing with webhooks for longer jobs
Serves everything through a clean API your frontend can consume

TrueFusion Video generates high-quality video from text descriptions. TrueFusion Video Pro pushes the output further for professional workflows. And because Skytells also supports Sora 2, your users can pick whichever model fits their use case — all through the same SDK and the same endpoint.

Step 1: Set Up the SDK

Install the Skytells package:

npm install skytells

Initialize the client:

import { createClient } from 'skytells';

const skytells = createClient(process.env.SKYTELLS_API_KEY);

That's your entire infrastructure layer. No model downloads, no CUDA configuration, no Docker containers running inference servers. The client talks to api.skytells.ai and handles authentication, retries, and response parsing.

You can grab your API key from the Dashboard — it takes about 30 seconds.

Step 2: Generate Your First Video

TrueFusion Video takes a text prompt and generates video. No source footage required — describe what you want, and the model creates it.

Here's the simplest possible call:

const result = await skytells.predict({
  model: "truefusion-video",
  input: {
    prompt: "A drone shot flying over a coastal city at golden hour, waves crashing against a modern waterfront, cinematic 4K"
  },
  await: true
});

console.log(result.output); // URL to the generated video

That's it. You describe the scene, set await: true to wait for the result, and get back a URL to the finished video.

TrueFusion Video vs. TrueFusion Video Pro

TrueFusion Video covers the majority of use cases — marketing clips, product demos, social content, explainer visuals. It's fast and the output quality is strong enough to ship directly.

TrueFusion Video Pro is for professional workflows that need higher fidelity, longer duration, and finer control over motion, composition, and style. If you're building a product where the video output is the product — ad creative platforms, content generation tools, video marketing SaaS — Pro is worth the step up.

Swapping between the two is a one-line change:

// Standard quality — fast and cost-effective
const standard = await skytells.predict({
  model: "truefusion-video",
  input: { prompt: "Product showcase on a clean white background, rotating slowly" },
  await: true
});

// Pro quality — higher fidelity for professional output
const pro = await skytells.predict({
  model: "truefusion-video-pro",
  input: { prompt: "Product showcase on a clean white background, rotating slowly" },
  await: true
});

Sora 2 — Also Available Through the Same API

Skytells also supports Sora 2 as a third-party model in the catalog. If you or your users prefer OpenAI's video generation model, you access it through the exact same SDK — same predict call, same webhook pattern, same response format.

const result = await skytells.predict({
  model: "sora-2",
  input: {
    prompt: "A timelapse of a flower blooming in a sunlit greenhouse, macro lens, shallow depth of field"
  },
  await: true
});

This is a significant advantage for a SaaS product. You can offer your users a choice of video generation models — TrueFusion Video, TrueFusion Video Pro, Sora 2 — without maintaining separate integrations. One SDK, one billing system, one webhook pattern.

Step 3: Handle Longer Jobs with Webhooks

Video generation can take anywhere from a few seconds to a couple of minutes depending on duration and model. For production workloads, use async processing. Instead of blocking until the result is ready, fire off the request and let Skytells call you back when it's done.

const prediction = await skytells.predict({
  model: "truefusion-video-pro",
  input: {
    prompt: "A cinematic product reveal — camera slowly panning around a luxury watch on a marble surface, dramatic lighting, 4K"
  },
  await: false,
  webhook: {
    url: "https://your-app.com/api/webhook",
    events: ["predict.completed"]
  }
});

// prediction.id gives you a tracking ID
// Your webhook endpoint receives the result when processing finishes

Your webhook handler receives the full prediction result — output URL, processing time, status, and any metadata. You can update your database, notify the user, trigger downstream processing, or whatever your product needs.

This pattern is essential for a real SaaS. Your users describe a video, see a "processing" state in the UI, and get notified when their result is ready. No long-polling, no timeout headaches.

Step 4: Add Image Generation for Thumbnails

Every video product needs thumbnails. Instead of asking users to upload one or auto-extracting a frame (which usually looks terrible), generate a custom thumbnail with TrueFusion.

const thumbnail = await skytells.predict({
  model: "truefusion-pro",
  input: {
    prompt: "Professional video thumbnail, speaker presenting to camera, modern studio background, cinematic lighting, 16:9 aspect ratio",
    steps: 18,
    guidanceScale: 7.5
  },
  await: true
});

console.log(thumbnail.output); // URL to the generated image

TrueFusion Pro delivers a FID score of 1.8 — that's closer to real photographs than any publicly benchmarked model. The images look natural, not "AI-generated." Your users' thumbnails will actually get clicked.

The Next.js Starter on GitHub

If you want to get the image generation side running even faster, check out the image generation starter project on the Skytells AI GitHub. It's a Next.js application with the SDK pre-configured, a clean UI for prompt input and image display, and API routes already wired up. Fork it, swap in your API key, and you have a working image generation frontend in under five minutes.

The starter handles the UI patterns that take time to build from scratch: loading states, error handling, image galleries, download buttons, and responsive layouts. You can extend it with video generation routes using the same patterns shown here.

Step 5: Build the API Layer

Here's how you might structure the backend for a video generation SaaS using Node.js:

import { createClient } from 'skytells';
import express from 'express';

const app = express();
const skytells = createClient(process.env.SKYTELLS_API_KEY);

// Text-to-video generation
app.post('/api/generate-video', async (req, res) => {
  const { prompt, model = "truefusion-video", callbackUrl } = req.body;

  const prediction = await skytells.predict({
    model, // "truefusion-video", "truefusion-video-pro", or "sora-2"
    input: { prompt },
    await: false,
    webhook: callbackUrl ? {
      url: callbackUrl,
      events: ["predict.completed"]
    } : undefined
  });

  res.json({
    predictionId: prediction.id,
    status: "processing"
  });
});

// Lip-sync for dubbing and translation
app.post('/api/lip-sync', async (req, res) => {
  const { videoUrl, audioUrl, callbackUrl } = req.body;

  const prediction = await skytells.predict({
    model: "lipfusion",
    input: {
      video: videoUrl,
      audio: audioUrl
    },
    await: false,
    webhook: callbackUrl ? {
      url: callbackUrl,
      events: ["predict.completed"]
    } : undefined
  });

  res.json({
    predictionId: prediction.id,
    status: "processing"
  });
});

// Thumbnail generation
app.post('/api/generate-thumbnail', async (req, res) => {
  const { prompt } = req.body;

  const result = await skytells.predict({
    model: "truefusion-large",
    input: { prompt },
    await: true
  });

  res.json({ imageUrl: result.output });
});

Three endpoints covering the full video product workflow: generate video from text, lip-sync existing footage, and create thumbnails. Your entire inference backend is the Skytells SDK doing the heavy lifting. You focus on your product logic — user accounts, billing, storage, the frontend experience — not on managing GPU instances or monitoring VRAM utilization.

Notice the video generation endpoint accepts a model parameter. Your users can choose between TrueFusion Video, TrueFusion Video Pro, and Sora 2 — without you maintaining separate integrations for each.

Step 6: Add Lip-Sync for Dubbing and Translation

Here's where LipFusion comes in — and it solves a different problem from video generation. LipFusion is a lip-syncing model: it takes an existing video and a new audio track, and re-renders the speaker's face so their lip movements perfectly match the new audio.

This is the foundation of a video translation product. Take a video in English, provide translated audio in Japanese, and get back a video where the speaker appears to be naturally speaking Japanese.

// Translate a video to multiple languages in parallel
const languages = ['japanese', 'spanish', 'german', 'portuguese'];

const translations = await Promise.all(
  languages.map(lang =>
    skytells.predict({
      model: "lipfusion",
      input: {
        video: "https://storage.example.com/original-english.mp4",
        audio: `https://storage.example.com/audio-${lang}.mp3`
      },
      await: true
    })
  )
);

LipFusion achieves 98.2% sync accuracy across 40+ languages and 120+ dialects. It processes video at 1.5x realtime speed using a multimodal transformer architecture that detects 68 distinct facial landmarks. The output doesn't look dubbed — it looks naturally spoken.

Content localization at scale is a real business. Media companies, e-learning platforms, and marketing teams pay serious money for this. And because LipFusion and TrueFusion Video both run through the same Skytells API, you can build a product that generates videos and localizes them — all from a single integration.

Step 7: The Python Path

If your backend is Python, the integration is just as straightforward:

import requests

# Generate video from text
response = requests.post(
    'https://api.skytells.ai/v1/predict',
    headers={
        'Content-Type': 'application/json',
        'x-api-key': 'YOUR_API_KEY'
    },
    json={
        'model': 'truefusion-video',
        'input': {
            'prompt': 'A professional product demo, modern office setting, clean aesthetic'
        },
        'await': True
    }
)

result = response.json()
print(result['output'])

SDKs are also available for Ruby and Go. The API behaves identically regardless of language — same endpoint, same request shape, same response format.

Pricing That Makes a SaaS Viable

One question that comes up immediately when building on third-party APIs: can you charge enough to cover your costs and still make margin?

With Skytells, video generation is priced per second of output. Image generation through TrueFusion starts at $0.06 per image. These are unit costs that leave room for a SaaS margin — especially at volume, where bulk pricing kicks in.

You can track spend in real-time through the Skytells Console, set budget alerts, and configure hard spending caps. No surprise bills.

Edge-Compatible SDK

Worth mentioning: the Skytells SDK runs at the edge. It's compatible with Cloudflare Pages Functions, Vercel Edge Functions, and any runtime that supports standard fetch. If your SaaS runs on edge infrastructure, the SDK works without modification — no Node.js-specific APIs, no filesystem dependencies.

Where to Go from Here

The full API reference is at docs.skytells.ai. The SDK documentation, including TypeScript types and all available options, lives at docs.skytells.ai/sdks/ts/.

If you want to experiment before writing code, the Skytells Console lets you run any model interactively in the browser — fill in parameters, click Run, see the output. No setup required. It even generates code snippets in Node.js, Python, and cURL that you can copy directly into your project.

For a running frontend you can fork and customize, the skytells-ai GitHub has open-source starter projects — including the Next.js image generation app mentioned earlier.

The models are production-ready. The SDK is production-ready. The pricing works at scale. The only thing left is the product you build on top of it.

Share this article

Sarah Burton

Product and strategy lead at Skytells, with a background in AI product management and development.

Last updated on March 13, 2026

Introducing Skytells Cloud Agents: Parallel Engineering at the Speed of Intent

April 17, 2026

Skytells announces Cloud Agents — a production system that puts multiple specialized AI agents to work inside your GitHub repository workflows simultaneously. Orchestrated by Eve, powered by Skytells and leading models, and designed for engineering teams who can't afford serial execution.

Cognition: AI-Powered Observability That Doesn't Just Watch — It Acts

April 13, 2026

Skytells announces the general availability of Cognition, an AI-powered production observability SDK for Node.js. Cognition captures every error with full context, monitors runtime health in real time, detects security threats before they escalate, and uses AI to correlate signals into automated actions — with one line of code. Manage everything from the Skytells Console.

Inside the Skytells Console — One Control Plane for AI, Infrastructure, and Production Operations

April 13, 2026

A detailed look at the Skytells Console: the unified platform where engineering teams manage AI model inference across 50+ models, deploy applications from Git, provision GPU and cloud infrastructure across 30+ global regions, design enterprise networks, and monitor production systems — all from a single control plane.

Our offices

Follow us

Build a Video Generation SaaS in Minutes — Not Months

Build a Video Generation SaaS in Minutes — Not Months

What You're Building

Step 1: Set Up the SDK

Step 2: Generate Your First Video

TrueFusion Video vs. TrueFusion Video Pro

Sora 2 — Also Available Through the Same API

Step 3: Handle Longer Jobs with Webhooks

Step 4: Add Image Generation for Thumbnails

The Next.js Starter on GitHub

Step 5: Build the API Layer

Step 6: Add Lip-Sync for Dubbing and Translation

Step 7: The Python Path

Pricing That Makes a SaaS Viable

Edge-Compatible SDK

Where to Go from Here

Share this article

Sarah Burton

More Articles

Introducing Skytells Cloud Agents: Parallel Engineering at the Speed of Intent

Cognition: AI-Powered Observability That Doesn't Just Watch — It Acts

Inside the Skytells Console — One Control Plane for AI, Infrastructure, and Production Operations