Multi-level Constraint-Based Two-Stage Few-Shot Knowledge Distillation for Vision-Language Models

Yantao Liu

doi:10.31449/inf.v50i13.13666

Multi-level Constraint-Based Two-Stage Few-Shot Knowledge Distillation for Vision-Language Models

Abstract

Vision-Language Models (VLMs) exhibit excellent zero-shot and few-shot capabilities in downstream tasks by maximizing the similarity between matched image-text pairs. However, the dual-encoder structure of VLMs introduces a large number of parameters, which limits their practical deployment. Knowledge distillation transfers the knowledge of VLMs to lightweight student models to ensure computational efficiency, but the requirement of sufficient high-quality labeled data for knowledge distillation is difficult to meet in real-world scenarios. This paper proposes a Multi-level Two-stage few-shot Knowledge Distillation (MTKD) method. MTKD consists of two stages: in the first stage, a fine-tuned VLM is used as the teacher model. Under the few-shot setting, the knowledge of unlabeled data is transferred to the student model through multi-level constraints (instance-level, batch-level, and class-level) to enhance the few-shot knowledge representation. In the second stage, a small amount of labeled data is used, the student model from the first stage is frozen, and an adapter implemented by a residual structure is inserted at the end of the image encoder for supervised improvement. Ablation experiments on 6 commonly used public datasets verify the effectiveness of MTKD, and comparisons with other methods demonstrate its competitiveness. MTKD achieves an average performance improvement of 3.2% across the six public datasets, with a maximum gain of 8.6% on certain datasets. In addition, experiments on 3 medical datasets prove that MTKD also has high applicability in the field of medical image recognition, indicating that MTKD can be easily transferred to fields with significant differences from the pre-trained data distribution.

References

Authors

Yantao Liu South China University of Technology

DOI:

https://doi.org/10.31449/inf.v50i13.13666

Downloads

Published

06/29/2026

Issue

Vol. 50 No. 13 (2026): Online-only issue

Section

Online-only

License

Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.

All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.

Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.

How to Cite

Multi-level Constraint-Based Two-Stage Few-Shot Knowledge Distillation for Vision-Language Models. (2026). Informatica, 50(13). https://doi.org/10.31449/inf.v50i13.13666

Download Citation

Multi-level Constraint-Based Two-Stage Few-Shot Knowledge Distillation for Vision-Language Models

Abstract

References

Authors

DOI:

Downloads

Published

Issue

Section

License

How to Cite

Developed By

Information