Punchline-Driven Hierarchical Facial Animation via Multimodal Large Language Models

Na Wang

doi:10.31449/inf.v49i25.11394

Punchline-Driven Hierarchical Facial Animation via Multimodal Large Language Models

Abstract

Speech-driven 3D facial animation has achieved high phonetic realism, but current models often fail to convey the expressive peaks, such as punchlines, that are critical for engaging communication. This paper introduces a novel framework that addresses this gap by leveraging a Multimodal Large Language Model (MLLM) for a deep, semantic understanding of speech. Our core innovation is a system that explicitly models and animates the climax of an utterance. The framework first employs a multimodal punchline detection module to identify moments of high expressive intent from both acoustic and textual cues. This signal guides our Punchline-Driven Hierarchical Animator (PDHA), which functionally decomposes the face into distinct regions and generates motion in a coordinated cascade, allowing the punchline to dynamically amplify expression in the upper face while preserving articulatory precision in the mouth. A final cross-modal fusion decoder refines the output for precise temporal alignment. Comprehensive experiments on the VOCASET dataset show that our model not only sets a new state-of-the-art in geometric fidelity, reducing Vertex Error by 7.8% compared to the state-of-the-art FaceFormer baseline, but is also rated as significantly more expressive and natural in user studies (p < 0.01), confirming its ability to capture the emotional impact of a punchline.

Authors

Na Wang Hubei University of Technology

DOI:

https://doi.org/10.31449/inf.v49i25.11394

Downloads

Published

12/18/2025

How to Cite

Wang, N. (2025). Punchline-Driven Hierarchical Facial Animation via Multimodal Large Language Models. Informatica, 49(25). https://doi.org/10.31449/inf.v49i25.11394

Download Citation

Issue

Vol. 49 No. 25 (2025): Online-only issue

Section

Online-only

License

Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.

All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.

Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.

Punchline-Driven Hierarchical Facial Animation via Multimodal Large Language Models

Abstract

Authors

DOI:

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Information