BeatFusion 2.0: How We Built an AI That Composes Full Songs from Lyrics
Most AI music tools generate loops. Short clips. Background textures that feel like stock audio wearing a slightly better outfit.
We wanted something different: give the model lyrics and a style direction, and get back a complete song — with natural-sounding vocals, structured arrangement, and broadcast-quality audio. That's what BeatFusion 2.0 does.
This post walks through what's new, what makes it work, and why it matters for anyone who creates with audio.
What BeatFusion 2.0 Actually Does
At its core, BeatFusion is a hybrid multimodal generative audio architecture. It combines transformer-based music conditioning, section-aware composition planning, neural vocal synthesis, and latent audio rendering into a single pipeline.
You provide two things: lyrics and a style description. BeatFusion handles everything else — vocal delivery, instrumentation, arrangement, mixing — and returns a full-length song in 44.1kHz stereo.
That's not a simplified explanation. That's the product.
The Numbers Behind It
Before we get into the architecture, here's where BeatFusion 2.0 lands on the benchmarks that matter:
| Metric | BeatFusion 2.0 | MusicGen Large | Stable Audio 2.0 |
|---|---|---|---|
| FAD Score | 2.89 | 5.48 | 3.65 |
| Generation Speed (30s audio, H100) | 3.1s | — | — |
| Max Song Length | 5 min | — | — |
| Output Quality | 44.1kHz stereo | — | — |
A FAD (Fréchet Audio Distance) score of 2.89 means the generated audio distribution is closer to real music than any publicly benchmarked model we've tested against. Lower is better — and the gap is significant.
Why Full Songs Are Hard
Generating a 10-second loop is a fundamentally different problem from generating a 4-minute song. Loops are stateless. Songs are not.
A real song has temporal structure: verse, chorus, bridge, outro. It has narrative arc: the energy builds, the chorus hits harder the second time, the bridge introduces tension. And it has consistency requirements: the vocal timbre can't drift, the key shouldn't wander, the instrumentation needs to feel like one band playing together.
Most generative audio models break down here because they treat music as a flat sequence. BeatFusion treats it as a structured composition. Internally, it uses over 14 section tags — [verse], [chorus], [bridge], [intro], [outro], and more — that let the model (and the user) control the arrangement precisely.
If your AI can generate a great 15-second riff but can't sustain coherence across a full song, you've built a synthesizer, not a composer.
The Three Tiers
BeatFusion ships in three variants, each targeting a different use case:
BeatFusion Standard — A 1.5B parameter model producing 32kHz stereo output. Covers 100+ genres, generates up to 2 minutes of audio. This is the entry point: fast, capable, and available via API and console.
BeatFusion Pro — 3.8B parameters, 44.1kHz broadcast-quality output. Adds melody and MIDI conditioning, stem separation (vocals, drums, bass, other), and supports songs up to 5 minutes. This is where professional workflows start.
BeatFusion Ultra — The flagship. Lossless WAV output, style transfer from reference tracks, multi-track generation with up to 8 stems, audio inpainting and outpainting. Commercial usage rights included. Built for production studios, game audio teams, and content platforms that need full creative control.
What "Natural Vocals" Actually Means
Vocal synthesis is where most music AI falls flat — literally. The vocals sound robotic, the pitch transitions feel mechanical, and the breathing patterns are either absent or uncanny.
BeatFusion 2.0's vocal pipeline models realistic timbre, breathing patterns, and smooth pitch transitions. The result is vocal delivery that sits in a mix the way a recorded vocal does — not layered on top like a text-to-speech output pasted over a beat.
This matters not just aesthetically but commercially. If the vocals sound synthetic, the track sounds synthetic. And synthetic-sounding tracks don't get licensed, don't get streamed, and don't connect with listeners.
Style-Aware Mixing
One of the less obvious but most impactful features is automatic style-aware mixing. BeatFusion adjusts its mixing decisions based on the genre and style context:
- Rock — distortion saturation, wider stereo imaging, punchy drums
- Jazz — warm mid-range, room ambience, restrained compression
- Electronic — tight transients, sidechain-style pumping, sub-bass emphasis
- Orchestral — spatial depth, dynamic range preservation, section balancing
This means a jazz ballad and a drum-and-bass track don't just have different instruments — they have different sonic signatures. The mixing adapts to the musical context, the way a human engineer would.
For Developers: The Integration Story
BeatFusion is API-first. You can start generating songs directly from the Skytells Console or integrate via the API. Full documentation is available at docs.skytells.ai.
Here's what a complete API call looks like — lyrics in, full song out:
const response = await fetch('https://api.skytells.ai/v1/predict', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'x-api-key': 'API_KEY_HERE'
},
body: JSON.stringify({
model: "beatfusion-2.0",
input: {
"lyrics": "[Verse]\nIn the hush of night, we find our space,\nWrapped in moonlight's gentle embrace.\nYour whisper's soft, like a velvet song,\nIn this tender moment, where we both belong.\n\n[Chorus]\nJust you and me, in this lazy jazz,\nOur souls entwined, nothing else we ask.\nIn this serenade, we sway and sigh,\nLost in this love, beneath the starry sky.\n\n[Bridge]\nYour voice, a lullaby, soothes my soul,\nIn this night, together, we feel whole.\nEach moment shared, a timeless flight,\nIn this gentle jazz, we find our light.\n\n[Outro]\nAs dawn approaches, and stars fade away,\nIn your arms, I wish to forever stay.",
"prompt": "Acoustic folk-blues, raw and intimate, front porch recording feel, fingerpicked acoustic guitar, harmonica, upright bass, warm lo-fi production, slow tempo, male vocals with gravelly texture"
},
await: true
})
});
const result = await response.json();That's it — lyrics with section tags, a style prompt describing the sound you want, and the model handles vocal delivery, instrumentation, arrangement, and mixing. You can explore the full API reference and SDKs for JavaScript, Python, Ruby, and Go at docs.skytells.ai.
Where This Gets Used
The use cases we're seeing in early access fall into clear categories:
- Film & TV scoring — composers using BeatFusion to prototype cues, then refining with stem separation
- Game audio — adaptive, loopable soundtracks generated from scene descriptions
- Advertising — royalty-free jingles and brand soundscapes created in minutes instead of weeks
- Podcast production — custom intros, outros, and background music that match the show's tone
- Music production — producers generating stem packs, loops, and sample material as creative starting points
On Ethics and Licensing
BeatFusion is trained exclusively on licensed and royalty-free music catalogs. Generated outputs are cleared for commercial use — there's no ambiguity about rights.
The model also includes content safety layers: profanity detection and harmful content classifiers run on both input and output. We think this is table stakes for any generative model shipping to production.
Try It Now
BeatFusion 2.0 is available now. You can start generating songs immediately from the Skytells Console, explore the full model family at skytells.ai/models/beatfusion, or dive into the API docs at docs.skytells.ai.
If you're building something with audio and want to see what's possible, we'd genuinely love to hear what you create.


