Efficient Test-Time Adaptation of Vision-Language Models

1Mohamed bin Zayed University of Artificial Intelligence     2Nanyang Technological University     3University of Ottawa     4Carnegie Mellon University

CVPR 2024
First Image

(a) Test-time Prompt Tuning

Second Image

(b) Training-free Dynamic Adapter (Ours)

Comparison of our proposed Training-free Dynamic Adapter (TDA) with Test-time Prompt Tuning (TPT) and its enhancement DiffTPT: both TPT and DiffTPT require significant computational resources to optimize the learnable prompt via backpropagation. TDA is a dynamic cache that is training-free without any backpropagation, making it efficient for test-time adaptation in various real-world scenarios.

Abstract

Test-time adaptation with pre-trained vision-language models has attracted increasing attention for tackling distribution shifts during the test time. Though prior studies have achieved very promising performance, they involve intensive computation which is severely unaligned with test-time adaptation.

We design TDA, a training-free dynamic adapter that enables effective and efficient test-time adaptation with vision-language models. TDA works with a lightweight key-value cache that maintains a dynamic queue with few-shot pseudo labels as values and the corresponding test-sample features as keys. Leveraging the key-value cache, TDA allows adapting to test data gradually via progressive pseudo label refinement which is super-efficient without incurring any backpropagation. In addition, we introduce negative pseudo labeling that alleviates the adverse impact of pseudo label noises by assigning pseudo labels to certain negative classes when the model is uncertain about its pseudo label predictions. Extensive experiments over two benchmarks demonstrate TDA’s superior effectiveness and efficiency as compared with the state-of-the-art.

Overview

Overview Image

Overview of Training-free Dynamic Adapter (TDA)

TDA constructs and updates two key-value caches to store the knowledge of a stream of test samples, and uses the two caches to generate positive and negative predictions which are combined with CLIP predictions to produce the final prediction. Specifically, the CLIP predictions are generated by performing the dot product between the image features generated by CLIP's image encoder Ev and the text embeddings generated by CLIP's text encoder Et , using the hand-crafted prompt and class names. The two key-value caches are updated by gradually incorporating the test features and their corresponding pseudo labels calculated from CLIP's predictions, based on prediction entropy and cache capacity.


TDA Efficiency and Adaptability

We evaluated TDA for its efficiency alongside other methods on the ImageNet dataset, focusing on both their performance and computational requirements during test time adaptation. The table below showcases our findings.

Comparisons in terms of efficiency and effectiveness on ImageNet
Method Testing Time Accuracy Gain
CLIP-ResNet-50 12min 59.81 0
TPT 12h 50min 60.74 +0.93
DiffTPT 34h 45min 60.80 +0.99
TDA (Ours) 16min 61.35 +1.54

Our study further assesses TDA against two primary benchmarks. Initially, we evaluate the model's robustness using the out-of-distribution (OOD) benchmark across four ImageNet-derived datasets. Following this, we explore its adaptability and generalization capabilities through the cross-domain benchmark over ten diverse image classification datasets. The results of our comprehensive evaluation are detailed in the table below:

OOD Benchmark
Method ImageNet (IN) IN-A IN-V2 IN-R IN-S Average OOD Average
CLIP-ResNet-50 59.81 23.24 52.91 60.72 35.48 46.43 43.09
CoOp 63.33 23.06 55.40 56.60 34.67 46.61 42.43
CoCoOp 62.81 23.32 55.72 57.74 34.48 46.81 42.82
Tip-Adapter 62.03 23.13 53.97 60.35 35.74 47.04 43.30
TPT 60.74 26.67 54.70 59.11 35.09 47.26 43.89
DiffTPT 60.80 31.06 55.80 58.80 37.10 48.71 45.69
TDA (Ours) 61.35 30.29 55.54 62.58 38.12 49.58 46.63

Cross-Domain Benchmark
Method Aircraft Caltech Cars DTD EuroSAT Flowers Food101 Pets SUN397 UCF101 Average
CLIP-ResNet-50 16.11 87.26 55.89 40.37 25.79 62.77 74.82 82.97 60.85 59.48 56.63
CoOp 15.12 86.53 55.32 37.29 26.20 61.55 75.59 87.00 58.15 59.05 56.18
CoCoOp 14.61 87.38 56.22 38.53 28.73 65.57 76.20 88.39 59.61 57.10 57.23
TPT 17.58 87.02 58.46 40.84 28.33 62.69 74.88 84.49 61.46 60.82 57.66
DiffTPT 17.60 86.89 60.71 40.72 41.04 63.53 79.21 83.40 62.72 62.67 59.85
TDA (Ours) 17.61 89.70 57.78 43.74 42.11 68.74 77.75 86.18 62.53 64.18 61.03

BibTeX

@article{karmanov2024efficient,
      title     = {Efficient Test-Time Adaptation of Vision-Language Models},
      author    = {Karmanov, Adilbek and Guan, Dayan and Lu, Shijian and El Saddik, Abdulmotaleb and Xing, Eric},
      journal   = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      year      = {2024}
  }