Punchline-Driven Hierarchical Facial Animation via Multimodal Large Language Models

Na Wang

Abstract


Speech-driven 3D facial animation has achieved high phonetic realism, but current models often fail to convey the expressive peaks, such as punchlines, that are critical for engaging communication. This paper introduces a novel framework that addresses this gap by leveraging a Multimodal Large Language Model (MLLM) for a deep, semantic understanding of speech. Our core innovation is a system that explicitly models and animates the climax of an utterance. The framework first employs a multimodal punchline detection module to identify moments of high expressive intent from both acoustic and textual cues. This signal guides our Punchline-Driven Hierarchical Animator (PDHA), which functionally decomposes the face into distinct regions and generates motion in a coordinated cascade, allowing the punchline to dynamically amplify expression in the upper face while preserving articulatory precision in the mouth. A final cross-modal fusion decoder refines the output for precise temporal alignment. Comprehensive experiments on the VOCASET dataset show that our model not only sets a new state-of-the-art in geometric fidelity, reducing Vertex Error by 7.8% compared to the state-of-the-art FaceFormer baseline, but is also rated as significantly more expressive and natural in user studies (p < 0.01), confirming its ability to capture the emotional impact of a punchline.


Full Text:

PDF


DOI: https://doi.org/10.31449/inf.v49i25.11394

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.