Our offices

  • United States
    2332 Beach Avenue
    Venice, CA 90291
  • Singapore
    L39, Marina Bay Financial Centre Tower
    10 Marina Boulevard

Follow us

Build a Video Generation SaaS in Minutes — Not Months

A step-by-step guide to shipping a video generation product using the Skytells SDK, TrueFusion Video, and third-party models like Sora 2. From npm install to production-ready SaaS.

··14 min·
Build a Video Generation SaaS with the Skytells SDK
From npm install to production-ready video generation — a practical walkthrough

Build a Video Generation SaaS in Minutes — Not Months

Eighteen months ago, if you wanted to build a product that generates video, you had two options: hire a machine learning team and spend six figures training models on rented GPU clusters, or glue together three different open-source projects that each broke in different ways every time you updated a dependency.

Neither option shipped fast.

Things have changed. The Skytells API gives you access to production-grade video and image generation models through a single REST endpoint — including our own TrueFusion Video family and third-party models like Sora 2. The JavaScript SDK wraps it in a createClient call and a predict method. And if you want a head start on the frontend, there's a Next.js image generation starter on GitHub that you can fork and extend.

This post walks through building a video generation SaaS from scratch — the kind of product you could charge money for — using these tools.

What You're Building

By the end of this guide, you'll have a working application that:

  • Generates videos from text prompts using TrueFusion Video and TrueFusion Video Pro
  • Supports third-party video models like Sora 2 through the same unified API
  • Adds lip-sync capabilities with LipFusion for dubbing and translation
  • Generates promotional thumbnails with TrueFusion image models
  • Handles async processing with webhooks for longer jobs
  • Serves everything through a clean API your frontend can consume

TrueFusion Video generates high-quality video from text descriptions. TrueFusion Video Pro pushes the output further for professional workflows. And because Skytells also supports Sora 2, your users can pick whichever model fits their use case — all through the same SDK and the same endpoint.

Step 1: Set Up the SDK

Install the Skytells package:

npm install skytells

Initialize the client:

import { createClient } from 'skytells';

const skytells = createClient(process.env.SKYTELLS_API_KEY);

That's your entire infrastructure layer. No model downloads, no CUDA configuration, no Docker containers running inference servers. The client talks to api.skytells.ai and handles authentication, retries, and response parsing.

You can grab your API key from the Dashboard — it takes about 30 seconds.

Step 2: Generate Your First Video

TrueFusion Video takes a text prompt and generates video. No source footage required — describe what you want, and the model creates it.

Here's the simplest possible call:

const result = await skytells.predict({
  model: "truefusion-video",
  input: {
    prompt: "A drone shot flying over a coastal city at golden hour, waves crashing against a modern waterfront, cinematic 4K"
  },
  await: true
});

console.log(result.output); // URL to the generated video

That's it. You describe the scene, set await: true to wait for the result, and get back a URL to the finished video.

TrueFusion Video vs. TrueFusion Video Pro

TrueFusion Video covers the majority of use cases — marketing clips, product demos, social content, explainer visuals. It's fast and the output quality is strong enough to ship directly.

TrueFusion Video Pro is for professional workflows that need higher fidelity, longer duration, and finer control over motion, composition, and style. If you're building a product where the video output is the product — ad creative platforms, content generation tools, video marketing SaaS — Pro is worth the step up.

Swapping between the two is a one-line change:

// Standard quality — fast and cost-effective
const standard = await skytells.predict({
  model: "truefusion-video",
  input: { prompt: "Product showcase on a clean white background, rotating slowly" },
  await: true
});

// Pro quality — higher fidelity for professional output
const pro = await skytells.predict({
  model: "truefusion-video-pro",
  input: { prompt: "Product showcase on a clean white background, rotating slowly" },
  await: true
});

Sora 2 — Also Available Through the Same API

Skytells also supports Sora 2 as a third-party model in the catalog. If you or your users prefer OpenAI's video generation model, you access it through the exact same SDK — same predict call, same webhook pattern, same response format.

const result = await skytells.predict({
  model: "sora-2",
  input: {
    prompt: "A timelapse of a flower blooming in a sunlit greenhouse, macro lens, shallow depth of field"
  },
  await: true
});

This is a significant advantage for a SaaS product. You can offer your users a choice of video generation models — TrueFusion Video, TrueFusion Video Pro, Sora 2 — without maintaining separate integrations. One SDK, one billing system, one webhook pattern.

Step 3: Handle Longer Jobs with Webhooks

Video generation can take anywhere from a few seconds to a couple of minutes depending on duration and model. For production workloads, use async processing. Instead of blocking until the result is ready, fire off the request and let Skytells call you back when it's done.

const prediction = await skytells.predict({
  model: "truefusion-video-pro",
  input: {
    prompt: "A cinematic product reveal — camera slowly panning around a luxury watch on a marble surface, dramatic lighting, 4K"
  },
  await: false,
  webhook: {
    url: "https://your-app.com/api/webhook",
    events: ["predict.completed"]
  }
});

// prediction.id gives you a tracking ID
// Your webhook endpoint receives the result when processing finishes

Your webhook handler receives the full prediction result — output URL, processing time, status, and any metadata. You can update your database, notify the user, trigger downstream processing, or whatever your product needs.

This pattern is essential for a real SaaS. Your users describe a video, see a "processing" state in the UI, and get notified when their result is ready. No long-polling, no timeout headaches.

Step 4: Add Image Generation for Thumbnails

Every video product needs thumbnails. Instead of asking users to upload one or auto-extracting a frame (which usually looks terrible), generate a custom thumbnail with TrueFusion.

const thumbnail = await skytells.predict({
  model: "truefusion-pro",
  input: {
    prompt: "Professional video thumbnail, speaker presenting to camera, modern studio background, cinematic lighting, 16:9 aspect ratio",
    steps: 18,
    guidanceScale: 7.5
  },
  await: true
});

console.log(thumbnail.output); // URL to the generated image

TrueFusion Pro delivers a FID score of 1.8 — that's closer to real photographs than any publicly benchmarked model. The images look natural, not "AI-generated." Your users' thumbnails will actually get clicked.

The Next.js Starter on GitHub

If you want to get the image generation side running even faster, check out the image generation starter project on the Skytells AI GitHub. It's a Next.js application with the SDK pre-configured, a clean UI for prompt input and image display, and API routes already wired up. Fork it, swap in your API key, and you have a working image generation frontend in under five minutes.

The starter handles the UI patterns that take time to build from scratch: loading states, error handling, image galleries, download buttons, and responsive layouts. You can extend it with video generation routes using the same patterns shown here.

Step 5: Build the API Layer

Here's how you might structure the backend for a video generation SaaS using Node.js:

import { createClient } from 'skytells';
import express from 'express';

const app = express();
const skytells = createClient(process.env.SKYTELLS_API_KEY);

// Text-to-video generation
app.post('/api/generate-video', async (req, res) => {
  const { prompt, model = "truefusion-video", callbackUrl } = req.body;

  const prediction = await skytells.predict({
    model, // "truefusion-video", "truefusion-video-pro", or "sora-2"
    input: { prompt },
    await: false,
    webhook: callbackUrl ? {
      url: callbackUrl,
      events: ["predict.completed"]
    } : undefined
  });

  res.json({
    predictionId: prediction.id,
    status: "processing"
  });
});

// Lip-sync for dubbing and translation
app.post('/api/lip-sync', async (req, res) => {
  const { videoUrl, audioUrl, callbackUrl } = req.body;

  const prediction = await skytells.predict({
    model: "lipfusion",
    input: {
      video: videoUrl,
      audio: audioUrl
    },
    await: false,
    webhook: callbackUrl ? {
      url: callbackUrl,
      events: ["predict.completed"]
    } : undefined
  });

  res.json({
    predictionId: prediction.id,
    status: "processing"
  });
});

// Thumbnail generation
app.post('/api/generate-thumbnail', async (req, res) => {
  const { prompt } = req.body;

  const result = await skytells.predict({
    model: "truefusion-large",
    input: { prompt },
    await: true
  });

  res.json({ imageUrl: result.output });
});

Three endpoints covering the full video product workflow: generate video from text, lip-sync existing footage, and create thumbnails. Your entire inference backend is the Skytells SDK doing the heavy lifting. You focus on your product logic — user accounts, billing, storage, the frontend experience — not on managing GPU instances or monitoring VRAM utilization.

Notice the video generation endpoint accepts a model parameter. Your users can choose between TrueFusion Video, TrueFusion Video Pro, and Sora 2 — without you maintaining separate integrations for each.

Step 6: Add Lip-Sync for Dubbing and Translation

Here's where LipFusion comes in — and it solves a different problem from video generation. LipFusion is a lip-syncing model: it takes an existing video and a new audio track, and re-renders the speaker's face so their lip movements perfectly match the new audio.

This is the foundation of a video translation product. Take a video in English, provide translated audio in Japanese, and get back a video where the speaker appears to be naturally speaking Japanese.

// Translate a video to multiple languages in parallel
const languages = ['japanese', 'spanish', 'german', 'portuguese'];

const translations = await Promise.all(
  languages.map(lang =>
    skytells.predict({
      model: "lipfusion",
      input: {
        video: "https://storage.example.com/original-english.mp4",
        audio: `https://storage.example.com/audio-${lang}.mp3`
      },
      await: true
    })
  )
);

LipFusion achieves 98.2% sync accuracy across 40+ languages and 120+ dialects. It processes video at 1.5x realtime speed using a multimodal transformer architecture that detects 68 distinct facial landmarks. The output doesn't look dubbed — it looks naturally spoken.

Content localization at scale is a real business. Media companies, e-learning platforms, and marketing teams pay serious money for this. And because LipFusion and TrueFusion Video both run through the same Skytells API, you can build a product that generates videos and localizes them — all from a single integration.

Step 7: The Python Path

If your backend is Python, the integration is just as straightforward:

import requests

# Generate video from text
response = requests.post(
    'https://api.skytells.ai/v1/predict',
    headers={
        'Content-Type': 'application/json',
        'x-api-key': 'YOUR_API_KEY'
    },
    json={
        'model': 'truefusion-video',
        'input': {
            'prompt': 'A professional product demo, modern office setting, clean aesthetic'
        },
        'await': True
    }
)

result = response.json()
print(result['output'])

SDKs are also available for Ruby and Go. The API behaves identically regardless of language — same endpoint, same request shape, same response format.

Pricing That Makes a SaaS Viable

One question that comes up immediately when building on third-party APIs: can you charge enough to cover your costs and still make margin?

With Skytells, video generation is priced per second of output. Image generation through TrueFusion starts at $0.06 per image. These are unit costs that leave room for a SaaS margin — especially at volume, where bulk pricing kicks in.

You can track spend in real-time through the Skytells Console, set budget alerts, and configure hard spending caps. No surprise bills.

Edge-Compatible SDK

Worth mentioning: the Skytells SDK runs at the edge. It's compatible with Cloudflare Pages Functions, Vercel Edge Functions, and any runtime that supports standard fetch. If your SaaS runs on edge infrastructure, the SDK works without modification — no Node.js-specific APIs, no filesystem dependencies.

Where to Go from Here

The full API reference is at docs.skytells.ai. The SDK documentation, including TypeScript types and all available options, lives at docs.skytells.ai/sdks/ts/.

If you want to experiment before writing code, the Skytells Console lets you run any model interactively in the browser — fill in parameters, click Run, see the output. No setup required. It even generates code snippets in Node.js, Python, and cURL that you can copy directly into your project.

For a running frontend you can fork and customize, the skytells-ai GitHub has open-source starter projects — including the Next.js image generation app mentioned earlier.

The models are production-ready. The SDK is production-ready. The pricing works at scale. The only thing left is the product you build on top of it.

Share this article

Sarah Burton

Sarah Burton

Product and strategy lead at Skytells, with a background in AI product management and development.

Last updated on

More Articles

Avagen: How Per-User Model Fine-Tuning Makes AI Avatars Actually Look Like You

Avagen takes a different approach to AI avatars — instead of zero-shot embedding, it fine-tunes the entire model for every user. The result is photo and video avatars with a level of realism most tools can't match.

Read more

The Skytells Console: Run 50+ AI Models Without Writing a Line of Code

The Skytells Console is an interactive browser-based playground that lets you generate images, video, audio, and text from 50+ AI models — no SDK, no terminal, no setup. Here's how to use it.

Read more

BeatFusion 2.0: How We Built an AI That Composes Full Songs from Lyrics

BeatFusion 2.0 generates complete, broadcast-ready songs — vocals, instrumentation, mixing — from a text prompt. Here's what changed, what it means for creators, and how the architecture works.

Read more