Test-time adaptation with pre-trained vision-language models has attracted increasing attention for tackling distribution shifts during the test time. Though prior studies have achieved very promising performance, they involve intensive computation which is severely unaligned with test-time adaptation.
We design TDA, a training-free dynamic adapter that enables effective and efficient test-time adaptation with vision-language models. TDA works with a lightweight key-value cache that maintains a dynamic queue with few-shot pseudo labels as values and the corresponding test-sample features as keys. Leveraging the key-value cache, TDA allows adapting to test data gradually via progressive pseudo label refinement which is super-efficient without incurring any backpropagation. In addition, we introduce negative pseudo labeling that alleviates the adverse impact of pseudo label noises by assigning pseudo labels to certain negative classes when the model is uncertain about its pseudo label predictions. Extensive experiments over two benchmarks demonstrate TDA’s superior effectiveness and efficiency as compared with the state-of-the-art.
TDA constructs and updates two key-value caches to store the knowledge of a stream of test samples, and uses the two caches to generate positive and negative predictions which are combined with CLIP predictions to produce the final prediction. Specifically, the CLIP predictions are generated by performing the dot product between the image features generated by CLIP's image encoder Ev and the text embeddings generated by CLIP's text encoder Et , using the hand-crafted prompt and class names. The two key-value caches are updated by gradually incorporating the test features and their corresponding pseudo labels calculated from CLIP's predictions, based on prediction entropy and cache capacity.
We evaluated TDA for its efficiency alongside other methods on the ImageNet dataset, focusing on both their performance and computational requirements during test time adaptation. The table below showcases our findings.
Method | Testing Time | Accuracy | Gain |
---|---|---|---|
CLIP-ResNet-50 | 12min | 59.81 | 0 |
TPT | 12h 50min | 60.74 | +0.93 |
DiffTPT | 34h 45min | 60.80 | +0.99 |
TDA (Ours) | 16min | 61.35 | +1.54 |
Our study further assesses TDA against two primary benchmarks. Initially, we evaluate the model's robustness using the out-of-distribution (OOD) benchmark across four ImageNet-derived datasets. Following this, we explore its adaptability and generalization capabilities through the cross-domain benchmark over ten diverse image classification datasets. The results of our comprehensive evaluation are detailed in the table below:
Method | ImageNet (IN) | IN-A | IN-V2 | IN-R | IN-S | Average | OOD Average |
---|---|---|---|---|---|---|---|
CLIP-ResNet-50 | 59.81 | 23.24 | 52.91 | 60.72 | 35.48 | 46.43 | 43.09 |
CoOp | 63.33 | 23.06 | 55.40 | 56.60 | 34.67 | 46.61 | 42.43 |
CoCoOp | 62.81 | 23.32 | 55.72 | 57.74 | 34.48 | 46.81 | 42.82 |
Tip-Adapter | 62.03 | 23.13 | 53.97 | 60.35 | 35.74 | 47.04 | 43.30 |
TPT | 60.74 | 26.67 | 54.70 | 59.11 | 35.09 | 47.26 | 43.89 |
DiffTPT | 60.80 | 31.06 | 55.80 | 58.80 | 37.10 | 48.71 | 45.69 |
TDA (Ours) | 61.35 | 30.29 | 55.54 | 62.58 | 38.12 | 49.58 | 46.63 |
Method | Aircraft | Caltech | Cars | DTD | EuroSAT | Flowers | Food101 | Pets | SUN397 | UCF101 | Average |
---|---|---|---|---|---|---|---|---|---|---|---|
CLIP-ResNet-50 | 16.11 | 87.26 | 55.89 | 40.37 | 25.79 | 62.77 | 74.82 | 82.97 | 60.85 | 59.48 | 56.63 |
CoOp | 15.12 | 86.53 | 55.32 | 37.29 | 26.20 | 61.55 | 75.59 | 87.00 | 58.15 | 59.05 | 56.18 |
CoCoOp | 14.61 | 87.38 | 56.22 | 38.53 | 28.73 | 65.57 | 76.20 | 88.39 | 59.61 | 57.10 | 57.23 |
TPT | 17.58 | 87.02 | 58.46 | 40.84 | 28.33 | 62.69 | 74.88 | 84.49 | 61.46 | 60.82 | 57.66 |
DiffTPT | 17.60 | 86.89 | 60.71 | 40.72 | 41.04 | 63.53 | 79.21 | 83.40 | 62.72 | 62.67 | 59.85 |
TDA (Ours) | 17.61 | 89.70 | 57.78 | 43.74 | 42.11 | 68.74 | 77.75 | 86.18 | 62.53 | 64.18 | 61.03 |
@article{karmanov2024efficient,
title = {Efficient Test-Time Adaptation of Vision-Language Models},
author = {Karmanov, Adilbek and Guan, Dayan and Lu, Shijian and El Saddik, Abdulmotaleb and Xing, Eric},
journal = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2024}
}