4W-07
拡散モデルを用いた動画チューニングにおけるマスクガイダンスによる品質向上
○卓 越,顧 淳祉,栗山 繁(豊橋技科大)
Recent advances in diffusion-based text-to-image model have been successfully extended to text-to-video (T2V) generation. However, existing T2V methods still face significant challenges in maintaining temporal consistency. To address these limitations, we propose to incorporate segmentation guidance into the diffusion pipeline to promote temporal stability. Our method extracts the positions of user-specified objects in each frame using object segmentation models, and utilizes the features of the mask image sequence under the diffusion mechanism. We also extend our method to multi-object scenarios to improve stability. Experimental results demonstrate that our method improves temporal smoothness and contributes to object boundary preservation.