Our offices

  • United States
    2332 Beach Avenue
    Venice, CA 90291
  • Singapore
    L39, Marina Bay Financial Centre Tower
    10 Marina Boulevard

Follow us

BeatFusion 2.0: How We Built an AI That Composes Full Songs from Lyrics

BeatFusion 2.0 generates complete, broadcast-ready songs — vocals, instrumentation, mixing — from a text prompt. Here's what changed, what it means for creators, and how the architecture works.

··11 min·
BeatFusion 2.0 — AI-powered song generation by Skytells
BeatFusion 2.0 generates broadcast-ready songs from lyrics and a style prompt

BeatFusion 2.0: How We Built an AI That Composes Full Songs from Lyrics

Most AI music tools generate loops. Short clips. Background textures that feel like stock audio wearing a slightly better outfit.

We wanted something different: give the model lyrics and a style direction, and get back a complete song — with natural-sounding vocals, structured arrangement, and broadcast-quality audio. That's what BeatFusion 2.0 does.

This post walks through what's new, what makes it work, and why it matters for anyone who creates with audio.

What BeatFusion 2.0 Actually Does

At its core, BeatFusion is a hybrid multimodal generative audio architecture. It combines transformer-based music conditioning, section-aware composition planning, neural vocal synthesis, and latent audio rendering into a single pipeline.

You provide two things: lyrics and a style description. BeatFusion handles everything else — vocal delivery, instrumentation, arrangement, mixing — and returns a full-length song in 44.1kHz stereo.

That's not a simplified explanation. That's the product.

The Numbers Behind It

Before we get into the architecture, here's where BeatFusion 2.0 lands on the benchmarks that matter:

MetricBeatFusion 2.0MusicGen LargeStable Audio 2.0
FAD Score2.895.483.65
Generation Speed (30s audio, H100)3.1s
Max Song Length5 min
Output Quality44.1kHz stereo

A FAD (Fréchet Audio Distance) score of 2.89 means the generated audio distribution is closer to real music than any publicly benchmarked model we've tested against. Lower is better — and the gap is significant.

Why Full Songs Are Hard

Generating a 10-second loop is a fundamentally different problem from generating a 4-minute song. Loops are stateless. Songs are not.

A real song has temporal structure: verse, chorus, bridge, outro. It has narrative arc: the energy builds, the chorus hits harder the second time, the bridge introduces tension. And it has consistency requirements: the vocal timbre can't drift, the key shouldn't wander, the instrumentation needs to feel like one band playing together.

Most generative audio models break down here because they treat music as a flat sequence. BeatFusion treats it as a structured composition. Internally, it uses over 14 section tags — [verse], [chorus], [bridge], [intro], [outro], and more — that let the model (and the user) control the arrangement precisely.

If your AI can generate a great 15-second riff but can't sustain coherence across a full song, you've built a synthesizer, not a composer.

HA
Hazem Ali
CEO & Founder of Skytells, Inc.

The Three Tiers

BeatFusion ships in three variants, each targeting a different use case:

BeatFusion Standard — A 1.5B parameter model producing 32kHz stereo output. Covers 100+ genres, generates up to 2 minutes of audio. This is the entry point: fast, capable, and available via API and console.

BeatFusion Pro — 3.8B parameters, 44.1kHz broadcast-quality output. Adds melody and MIDI conditioning, stem separation (vocals, drums, bass, other), and supports songs up to 5 minutes. This is where professional workflows start.

BeatFusion Ultra — The flagship. Lossless WAV output, style transfer from reference tracks, multi-track generation with up to 8 stems, audio inpainting and outpainting. Commercial usage rights included. Built for production studios, game audio teams, and content platforms that need full creative control.

What "Natural Vocals" Actually Means

Vocal synthesis is where most music AI falls flat — literally. The vocals sound robotic, the pitch transitions feel mechanical, and the breathing patterns are either absent or uncanny.

BeatFusion 2.0's vocal pipeline models realistic timbre, breathing patterns, and smooth pitch transitions. The result is vocal delivery that sits in a mix the way a recorded vocal does — not layered on top like a text-to-speech output pasted over a beat.

This matters not just aesthetically but commercially. If the vocals sound synthetic, the track sounds synthetic. And synthetic-sounding tracks don't get licensed, don't get streamed, and don't connect with listeners.

Style-Aware Mixing

One of the less obvious but most impactful features is automatic style-aware mixing. BeatFusion adjusts its mixing decisions based on the genre and style context:

  • Rock — distortion saturation, wider stereo imaging, punchy drums
  • Jazz — warm mid-range, room ambience, restrained compression
  • Electronic — tight transients, sidechain-style pumping, sub-bass emphasis
  • Orchestral — spatial depth, dynamic range preservation, section balancing

This means a jazz ballad and a drum-and-bass track don't just have different instruments — they have different sonic signatures. The mixing adapts to the musical context, the way a human engineer would.

For Developers: The Integration Story

BeatFusion is API-first. You can start generating songs directly from the Skytells Console or integrate via the API. Full documentation is available at docs.skytells.ai.

Here's what a complete API call looks like — lyrics in, full song out:

const response = await fetch('https://api.skytells.ai/v1/predict', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'x-api-key': 'API_KEY_HERE'
  },
  body: JSON.stringify({
    model: "beatfusion-2.0",
    input: {
      "lyrics": "[Verse]\nIn the hush of night, we find our space,\nWrapped in moonlight's gentle embrace.\nYour whisper's soft, like a velvet song,\nIn this tender moment, where we both belong.\n\n[Chorus]\nJust you and me, in this lazy jazz,\nOur souls entwined, nothing else we ask.\nIn this serenade, we sway and sigh,\nLost in this love, beneath the starry sky.\n\n[Bridge]\nYour voice, a lullaby, soothes my soul,\nIn this night, together, we feel whole.\nEach moment shared, a timeless flight,\nIn this gentle jazz, we find our light.\n\n[Outro]\nAs dawn approaches, and stars fade away,\nIn your arms, I wish to forever stay.",
      "prompt": "Acoustic folk-blues, raw and intimate, front porch recording feel, fingerpicked acoustic guitar, harmonica, upright bass, warm lo-fi production, slow tempo, male vocals with gravelly texture"
    },
    await: true
  })
});

const result = await response.json();

That's it — lyrics with section tags, a style prompt describing the sound you want, and the model handles vocal delivery, instrumentation, arrangement, and mixing. You can explore the full API reference and SDKs for JavaScript, Python, Ruby, and Go at docs.skytells.ai.

Where This Gets Used

The use cases we're seeing in early access fall into clear categories:

  • Film & TV scoring — composers using BeatFusion to prototype cues, then refining with stem separation
  • Game audio — adaptive, loopable soundtracks generated from scene descriptions
  • Advertising — royalty-free jingles and brand soundscapes created in minutes instead of weeks
  • Podcast production — custom intros, outros, and background music that match the show's tone
  • Music production — producers generating stem packs, loops, and sample material as creative starting points

On Ethics and Licensing

BeatFusion is trained exclusively on licensed and royalty-free music catalogs. Generated outputs are cleared for commercial use — there's no ambiguity about rights.

The model also includes content safety layers: profanity detection and harmful content classifiers run on both input and output. We think this is table stakes for any generative model shipping to production.

Try It Now

BeatFusion 2.0 is available now. You can start generating songs immediately from the Skytells Console, explore the full model family at skytells.ai/models/beatfusion, or dive into the API docs at docs.skytells.ai.

If you're building something with audio and want to see what's possible, we'd genuinely love to hear what you create.

Share this article

Hazem Ali

Hazem Ali

Hazem Ali is the CEO and founder of Skytells, Inc. He is a software engineer with over 20 years of experience in the industry, and a strong believer in the power of AI to transform industries and society.

Last updated on

More Articles

Build a Video Generation SaaS in Minutes — Not Months

A step-by-step guide to shipping a video generation product using the Skytells SDK, TrueFusion Video, and third-party models like Sora 2. From npm install to production-ready SaaS.

Read more

Avagen: How Per-User Model Fine-Tuning Makes AI Avatars Actually Look Like You

Avagen takes a different approach to AI avatars — instead of zero-shot embedding, it fine-tunes the entire model for every user. The result is photo and video avatars with a level of realism most tools can't match.

Read more

The Skytells Console: Run 50+ AI Models Without Writing a Line of Code

The Skytells Console is an interactive browser-based playground that lets you generate images, video, audio, and text from 50+ AI models — no SDK, no terminal, no setup. Here's how to use it.

Read more