当前位置: 首页 » 资讯 » 新科技 » 正文

复旦&腾讯提出Baton:首创语义蓝图指引,实现音画逻辑精准同步

IP属地 中国·北京 机器之心Pro 时间:2026-06-16 14:19:16



当用户给出一句简单提示词时,当前的音视频生成模型往往已经能够生成具有不错质量的视听内容。然而,一旦提示词变得复杂,问题便开始暴露出来。

例如,用户可能要求模型生成这样一个场景:一个男孩先完成运球训练,随后开始讲话;或者两个人在互动过程中依次说出不同内容;又或者某个动作发生后,对应的声音才逐渐出现并增强。

这类包含多阶段动作、复杂人物交互以及明确时序关系的指令,不仅要求模型理解「发生了什么」,还要求模型准确推理「什么时候发生」「谁在发生」以及「声音应该如何与画面对应」。

遗憾的是,对于这类需要长程语义理解的复杂场景,目前大多数开源音视频生成模型仍然表现不佳。生成结果中经常出现人物动作与声音错位、多角色对白对应错误、音画节奏不同步等问题。

其根本原因在于,现有方法大多将文本提示编码为单个全局语义向量,并将其同时作为视频与音频生成过程的条件信号。虽然这种方式能够提供场景级别的语义指导,但却难以进一步拆解复杂事件之间的时序关系,也无法明确描述不同角色、动作和声音之间应如何对应。

围绕这一问题,研究社区已经进行了大量探索。例如,Ovi 率先构建了原生视频 — 音频联合生成框架,并采用双分支 DiT 架构同时建模视觉与听觉信号;LTX-2.3 进一步提升了模型规模与训练数据质量;MOVA 等工作则增强训练策略与跨模态协同机制入手。与此同时,随着多模态大语言模型的发展,一些研究开始引入 Qwen3、Qwen3-VL、Qwen3-Omni 等模型,希望利用其更强的语义理解与推理能力,对用户提示进行扩展、重写或增强,从而为生成模型提供更丰富的条件信息。

然而,上述方法大多仍然遵循同一种范式:先将复杂提示压缩为统一的全局语义表示,再将其作为条件注入视频与音频扩散过程。

这样的设计虽然能够告诉模型「场景中有哪些内容」,却难以进一步描述「哪些事件应该先发生、哪些事件应该后发生,以及对应的视觉与声音信号应当如何在时间轴上保持一致」。由于缺乏显式的跨模态语义规划机制,视频与音频往往只能依据同一个模糊条件独立完成生成,最终在复杂场景下逐渐形成彼此偏离甚至相互冲突的生成轨迹。

因此,当面对多阶段动作链条、复杂人物互动、长程因果关系乃至多说话人对话等任务时,现有方法仍然难以生成稳定、高保真且高度同步的视听内容。



为了解决这一问题,复旦大学与腾讯混元团队提出了 Baton。与现有方法直接利用全局文本特征驱动扩散生成不同,Baton 的核心思想是将语义推理与内容生成解耦:模型首先构建一份跨模态共享的语义蓝图(Semantic Blueprint),随后再依据这份蓝图同步生成视频与音频。

在这一框架下,视频和音频不再各自独立地理解用户提示,而是共享同一份包含事件、角色、时序关系和跨模态对应关系的中间规划结果。借助这一显式规划过程,模型能够在生成开始之前完成复杂语义关系的推理,从而为后续扩散过程提供稳定且一致的指导信号。

为实现这一目标,Baton 设计了 VA-Planner 和 Relative Semantic RoPE 两项关键技术。其中,VA-Planner 负责生成跨模态语义蓝图,而 Relative Semantic RoPE 则负责将蓝图中的规划信息准确映射到扩散模型的生成空间中,从而实现视频与音频的精细协同生成。



论文标题:Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation论文地址:https://arxiv.org/pdf/2605.25195项目地址:https://francis-rings.github.io/Baton/

方法简介

如下图所示,Baton 通过显式解耦语义推理与内容生成两个阶段,构建了一套具备模态感知能力的语义蓝图(Blueprint)机制,用于统一协调视频与音频的扩散去噪过程。

具体而言,用户输入的文本提示首先被送入多模态大语言模型(MLLM)进行语义推理,从中预测出一对分别对应视频和音频模态的 planned tokens。这些 planned tokens 充当跨模态共享的语义蓝图,为后续生成过程提供明确的内容规划和时序指导。为了将语义蓝图融入生成过程,planned tokens 进一步通过 cross-attention 注入扩散 Transformer(DiT)中。这里采用的 DiT 延续了 Ovi 中的双分支架构,分别负责视频与音频的生成与去噪。

值得注意的是,planned tokens 与扩散模型中的 latents 分布在不同的时空网格上,两者天然存在位置对应关系不一致的问题。

为了解决这一问题,Baton 提出了Relative Semantic RoPE(RS-RoPE)机制,通过构建统一的相对位置编码空间,实现 planned tokens 与 diffusion latents 之间的精确语义对齐,从而确保语义蓝图能够有效指导视频和音频的联合生成过程。

如下图所示,Baton 将语义理解和内容生成拆分成两个独立阶段,并通过一种跨模态的规划机制来保持视频与音频的协调一致。

具体来说,系统首先对用户提供的文本指令进行深度解析,由多模态大语言模型(MLLM)生成两组专门的 Planned Tokens,分别对应视频和音频模态。这些规划信息可以看作一份「生成蓝图」,明确规定了内容应该呈现什么,以及事件在时间上的先后关系,为接下来的生成步骤提供导航。

在生成过程中,Planned Tokens 通过跨注意力机制注入扩散 Transformer(DiT),使得生成模型在每一步去噪时都能参考这份语义蓝图。Baton 沿用 Ovi 的双分支设计,视频和音频各自拥有独立的生成路径,但通过蓝图保持同步。

由于蓝图与扩散潜变量在空间和时间上存在天然的不匹配,为保证精确对应,Baton 引入了Relative Semantic RoPE(RS-RoPE)。这一机制为规划信息和潜变量建立了统一的相对位置编码,使生成模型能够准确地将语义指导映射到每个生成单元上,从而实现音画高度同步的联合生成。



1. VA-Planner:跨模态语义规划

与直接使用全局文本嵌入不同,VA-Planner 首先利用多模态大语言模型对用户提示进行显式语义推理,并生成一组分别对应视频和音频模态的 Planned Tokens。这些 Tokens 不再仅仅表示整体场景,而是进一步编码了局部事件的语义信息,包括发生了什么、发生在何处以及发生在什么时间。

具体而言,Baton 将视频规划区域与音频规划区域共同组织到同一个自回归推理序列中,并利用 MLLM 逐步预测对应的语义表示。

由于视频与音频规划共享同一上下文,同时位于统一的推理过程中,因此模型能够在生成阶段之前就建立跨模态关联关系。最终得到的 Planned Tokens 可以被视为一份跨模态共享的语义蓝图,为后续视频与音频生成提供统一且细粒度的规划信息。

2. 双语义对齐塔:构建视频与音频共享的语义蓝图

虽然 VA-Planner 已经能够生成视频和音频对应的语义规划,但这些表示仍然位于 MLLM 的语言空间中,与扩散模型实际使用的视觉和音频特征空间之间存在明显差异。

因此,Baton 进一步设计了Dual Semantic Alignment Towers(双语义对齐塔),负责将规划结果转换为更适合生成模型理解的感知语义表示。

具体而言,Baton 分别构建视频塔和音频塔,并采用 SigLip2 与 WavTokenizer 作为对应模态的感知监督目标。每个对齐塔内部都包含一组可学习查询(Learnable Queries),用于从视频规划表示和音频规划表示中提取最关键的语义信息。

更重要的是,双塔引入了双向跨模态注意力机制。由于 MLLM 的自回归结构天然具有单向依赖关系,视频规划无法直接感知音频规划的信息。为了解决这一问题,视频塔在提取视觉语义的同时还会主动吸收音频信息,音频塔则同步引入视觉信息,从而实现双向语义交互。最终得到的视频与音频 Planned Tokens 不再是两份独立规划,而是共享同一时间轴和语义结构的统一蓝图。

为了进一步建立跨模态时序对应关系,Baton 还引入了Timestamp-based RoPE,将视频关键帧与音频片段映射到统一时间坐标系中,使模型能够准确理解不同模态事件之间的时间对应关系。具体实现细节和详细公式推导请阅读原论文。

3. RS-RoPE:让语义蓝图真正「落地」到生成过程

在实际生成阶段,Planned Tokens 与扩散模型中的潜变量(Latents)位于不同的时空网格上。前者描述的是关键事件和语义结构,而后者对应的是视频和音频在扩散过程中的具体表示,两者之间并不存在天然的一一对应关系。如果直接进行跨注意力交互,模型很难准确判断某个潜变量应该关注哪部分语义规划信息。

为了解决这一问题,Baton 提出了Relative Semantic RoPE(RS-RoPE)。与传统位置编码仅描述 Token 绝对位置不同,RS-RoPE 构建了一套统一的相对语义坐标系,将 Planned Tokens 与扩散潜变量映射到同一参考空间中。

借助这一机制,扩散模型能够在去噪过程中准确找到与当前时空位置最相关的规划信息,使语义蓝图真正参与到每一步生成过程之中。

换句话说,RS-RoPE 相当于为 Blueprint 和扩散生成之间建立了一座精确的「导航系统」,确保视频和音频始终沿着预先规划好的语义路径协同演化。

训练策略

Baton 的训练采用三阶段策略:

1. VA-Planner 预训练

在第一阶段,模型学习将用户提示转化为跨模态语义规划(Planned Tokens)。利用真实视频和音频数据作为监督,VA-Planner 学会生成能够反映视觉和音频感知结构的连续特征,而不仅仅依赖自然语言嵌入,从而获得更丰富的语义信息。

2. DiT 适配训练

第二阶段旨在让扩散模型(DiT)学习这些语义特征的分布。此时,DiT 以真实特征作为条件进行训练,能够熟悉视频与音频的生成规律,同时避免被 VA-Planner 预测误差干扰。

3. 联合微调

最后,VA-Planner 与 DiT 组合成完整系统,VA-Planner 参数冻结,DiT 接收规划器预测的 Planned Tokens 作为输入进行训练。这一步能够弥合理想特征与实际预测之间的差距,缓解曝光偏差问题,使生成过程更稳定、鲁棒。

实验

在定量试验对比上,Baton 与以前的开源模型在 Verse-Bench 和 Sem100 上进行指标对比,其中 Verse-Bench 为开源的音画一致生成的测试集,Sem100 为内部收集的 100 条测试视频样例,相比于以前的开源测试集,Sem100 的 text prompt 具备更加复杂的描述,包括人物与周围环境的多次连续性交互动作,涉及多人的复杂交互,涉及多个连续指定性质的复杂组合动作描述。对比结果如下表所示:



评测指标涵盖视频质量(AQ、IQ、DD、ID)、音频质量(PQ、CU)、音视频同步性(Sync-C、Sync-D、DeSync)以及提示词遵循能力(P-Acc)等多个维度。

实验结果表明,在以简单场景为主的 Verse-Bench 上,Baton 与当前领先开源模型 LTX-2 整体表现接近;而在更具挑战性的 Sem100 上,Baton 展现出明显优势。

相比 LTX-2,Baton 的提示词遵循准确率(P-Acc)提升 32%,多说话人词错误率(M-WER)提升 76%,音画不同步指标(DeSync)提升 30%。

其中,M-WER 的提升尤为突出。多说话人场景不仅要求模型理解说话内容,更要求准确判断「谁在什么时候说了什么」。这一能力恰恰依赖于 Baton 所构建的细粒度时序语义规划,而传统全局文本嵌入难以提供这样的时间对齐信息。这也进一步验证了显式语义规划对于复杂指令生成的重要性。

此外,团队还将 Baton 与多款闭源商业模型进行了对比。尽管在视觉质量和音频美感方面,Baton 与顶级商业系统仍存在一定差距,但在复杂指令遵循能力上已经展现出较强竞争力。



生成结果展示



视频链接:https://mp.weixin.qq.com/s/UobG7nWamiWMt45L62tCWA

Video prompt:On a vast barren beach under a pale overcast sky with haze obscuring the flat horizon, a young man with dark messy hair lies face down on the sand, wearing a thick brown hooded wool coat, sand clinging to his clothes and skin. He props himself on his elbows, looking forward. In the far background, a column of sand explodes upward among silhouetted soldiers. The young man flinches in terror and clutches his head while successive blasts draw closer, sending towering columns of sand into the air. A harsh, aggressive soundscape with deep rumbles and piercing screeches builds as the bombardment rapidly approaches his position. Close-up, low angle, rule-of-thirds composition. Natural diffused overcast daylight with soft shadows. Desaturated, monochromatic palette of beige, tan, and muted olive green with a cool color temperature. Gritty realism, tense tone.

Audio prompt:On a windswept open beach, continuous artillery explosions rumble and crash, growing progressively louder and closer. The blasts intensify from distant muffled thuds into deafening concussive roars, each one shaking the ground harder than the last. Sand and debris scatter and rain down with increasing violence. Beneath the relentless bombardment, rapid shallow breathing and a choked gasp of terror from a soldier [Speaker A] are barely audible.



视频链接:https://mp.weixin.qq.com/s/UobG7nWamiWMt45L62tCWA

Video prompt:In a indoor martial arts gym with yellow padded bars along the wall, and cool fluorescent overhead lighting, two bald men of Middle Eastern descent stand facing each other. The first, with a short beard, wears a black chest protector over a white t-shirt and remains stationary. The second, with a darker skin tone in a dark polo shirt, stands to his right as the instructor. The instructor delivers a quick punch to the first man's upper body. The camera shifts focus to the first man, who absorbs the hit. The instructor asks him a brief question; the first man nods and responds. The instructor then resumes speaking, using hand gestures to continue his explanation as the instructor looks toward the camera. Medium shot, eye-level, rule-of-thirds composition. Cool-toned overhead lighting with high contrast between brightly lit faces and deep surrounding shadows. Neutral palette of black, grey, and white accented by yellow padded bars and cool blue ambient light. Observational documentary style, focused tone.

Audio prompt:In a gym with faint ambient echo and a low-level room tone, a mature man [Speaker A] speaks in a steady, instructional tone: "Think about the idea of short distance power. If someone like this suddenly tries to headbutt me, one, I can hit him." A sharp, percussive thud of a fist striking a padded chest protector rings out immediately after. A brief pause, then [Speaker A] asks in a calm, slightly concerned tone: "You okay?" Another man [Speaker B] replies affirmatively: "Yeah." [Speaker A] continues in the same steady, instructional tone: "And he's not headbutting."



视频链接:https://mp.weixin.qq.com/s/UobG7nWamiWMt45L62tCWA

Video prompt:At dusk in a desolate clearing beside a rustic log cabin with a thatched roof, shuttered windows, and barrels leaning against its side, under bare trees and a dim overcast sky, a bearded white man with short dark hair, wearing a loose brown shirt and matching pants, squats before a small crackling campfire within a stone ring on dry grass. To his right stands a slender white teenage boy with curly dark hair in an off-white Henley shirt and light trousers, holding a crumpled grey cloth. The man rises from his squat, reaches out with both hands, and takes the cloth from the boy. He turns toward the fire, steps forward, and bends down to drape the cloth over the burning logs. Smoke rises. The boy stands still, watching intently. The man then picks up a long wooden poker from the ground and pokes the smoldering bundle beneath the cloth.

Audio prompt:A quiet outdoor dusk atmosphere with faint wind rustling dry grass. A small campfire crackles and pops within a stone ring. The soft rustle of cloth being handled. The crackling intensifies briefly as the cloth is draped over the fire, then dulls into a low, muffled smolder. A wooden poker scrapes against stone and prods the embers, stirring hissing smoke.



视频链接:https://mp.weixin.qq.com/s/UobG7nWamiWMt45L62tCWA

Video prompt:In a dimly lit interior, a close-up shows hands using a knife and fork to slice through a medium-rare steak on a white square plate. After cutting off a piece, the camera tilts upward to reveal a Caucasian woman with long wavy reddish-blonde hair, blue eyes, and fair skin, wearing dark clothing over a collared shirt. She lifts the piece of steak with the fork, brings it to her mouth, and chews slowly. After swallowing, her gaze lowers briefly, then she raises her head and stares intensely forward. Cinematic realism.

Audio prompt:A knife sawing through steak with a soft, wet slicing sound against the plate. A fork scrapes briefly. Quiet, slow chewing follows. After a pause, a single melancholic piano note rings out.



视频链接:https://mp.weixin.qq.com/s/UobG7nWamiWMt45L62tCWA

Video prompt:Inside an old car, a girl wearing a grey-white t-shirt first looks down, then smiles slightly while steering along a rural road. A small figurine sits on the dashboard. The camera then pans left to reveal a passenger wearing a colorful wrestling mask.

Audio prompt:A dramatic orchestral score with sweeping strings. The music is layered with the sounds of a vehicle engine starting and revving. A dog barks repeatedly in the background, its voice echoing slightly as if in an open space. A boy [Speaker A] shouts: ”Ah"



视频链接:https://mp.weixin.qq.com/s/UobG7nWamiWMt45L62tCWA

Video prompt:On a sunny suburban backyard with green lawn and tall hedges, a woman in a ribbed sweater and black skirt rallies a shuttlecock with a boy across a badminton net. He jogs off-screen to fetch it; she turns toward camera, striding forward with arms raised in playful triumph. A second boy charges in from behind, tackling her into wrestling on the grass — she lifts him, spins him around, both laughing joyously. Handheld tracking shot follows the action, shifting from eye-level to low-angle during the lift.

Audio prompt:A fast-paced electronic dance music track with a driving beat and synthesized melodies plays throughout the clip. A boy [Speaker A] shouts excitedly in a energetic voice: "Oh no! Ten points! I'm scared! She's the winner!" A girl [Speaker B] shouts back with equal excitement: "We're the winners!"



视频链接:https://mp.weixin.qq.com/s/UobG7nWamiWMt45L62tCWA

Video prompt:On a residential street corner, a young Asian boy in bright blue shorts stands holding a brown Spalding basketball in one hand and a yellow-orange ball in the other. The camera slowly orbits from behind him onto a concrete patio beside a dense green hedge. He drops into a low stance and begins simultaneously dribbling both balls side by side, bouncing them in rhythm as he moves forward step by step along the road. Medium close-up widening to a tracking shot, eye-level shifting to slightly high-angle.

Audio prompt:Set in an outdoor environment, a young boy [Speaker A] speaks in a clear, instructional tone: "This is two ball basketball drill.". Immediately after he finishes speaking, the rhythmic, percussive sound of a basketball being dribbled on a hard surface begins and continues for the rest of the clip.



视频链接:https://mp.weixin.qq.com/s/UobG7nWamiWMt45L62tCWA

Video prompt:A young Caucasian man stands at an outdoor shooting range, holding a scoped AR-15 rifle, he fires several shots at a nearby pine tree, then reloads.

Audio prompt:In a quiet, open outdoor environment, a sharp gunshot rings out, followed by a male voice [Speaker A] saying "Ah" in a neutral tone. Immediately after, another gunshot is fired. After a brief pause, a mechanical click is heard, as if a weapon is being reloaded.



视频链接:https://mp.weixin.qq.com/s/UobG7nWamiWMt45L62tCWA

Video prompt:On a sunlit outdoor asphalt basketball court, bordered by dense green trees under a clear blue sky, a young man with short brown hair wearing dark sunglasses, a grey baseball cap, a black t-shirt and black Nike athletic shorts. He picks up a red basketball. He stands, turns away from the camera, and walks along the baseline, dribbling the ball between his legs. As he nears the free-throw line he takes a jump shot; the ball arcs over the rim and drops through the net. Medium-to-tracking shot, eye-level, with leading room guiding the viewer toward the basket. Natural high-key golden-hour daylight casts soft shadows across the grey court. Observational tone capturing the action in real time.

Audio prompt:Set in an outdoor environment with birdsong and faint rustling sounds, a young man [Speaker A] speaks in a calm, encouraging tone: "Easy peasy, baby." The sound of a ball being dribbled on a hard surface is heard, followed by a sharp impact as it hits a backboard or wall. The dribbling resumes, accompanied by the soft thud of the ball bouncing on the ground.

免责声明:本网信息来自于互联网,目的在于传递更多信息,并不代表本网赞同其观点。其内容真实性、完整性不作任何保证或承诺。如若本网有任何内容侵犯您的权益,请及时联系我们,本站将会在24小时内处理完毕。