Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View

Jin Wang¹ , Shichao Dong⁴, Yapeng Zhu⁴, Kelu Yao³, Weidong Zhao³, Chao Li³, Ping Luo²^,¹

¹The University of Hong Kong, ²Shanghai AI Laboratory, ³Zhejiang Laboratory, ⁴Baidu Inc

Diagnosing the compositional reasoning capabilities of Vision Language Models (VLMs). In this paper, we systematically analyze the potential causes for the poor compositional performance of VLMs from each unimodal separately and then multimodal jointly. In this way, three insights are obtained and validated correspondingly.

Introduction

Compositional reasoning capabilities are usually considered as fundamental skills to characterize human perception. Recent studies show that current Vision Language Models (VLMs) surprisingly lack sufficient knowledge with respect to such capabilities. To this end, we propose to thoroughly diagnose the composition representations encoded by VLMs, systematically revealing the potential cause for this weakness. Specifically, we propose evaluation methods from a novel game-theoretic view to assess the vulnerability of VLMs on different aspects of compositional understanding, e.g., relations and attributes. Extensive experimental results demonstrate and validate several insights to understand the incapabilities of VLMs on compositional reasoning, which provide useful and reliable guidance for future studies.

Framework

Can text encoders of VLMs understand texts compositionally?

Evaluating the sensitivities of text encoders of VLMs to the changes of textual patterns. Specifically, given captions $\mathcal{T}_1$ and $\mathcal{T}_2$ with object words being swapped, we design $Q_O$, $Q_R$ and $Q_{R\&O}$ to assess whether text encoders react correctly to the fine-grained changes of compositionality. In this case, $Q_{R\&O}$, which measures the interaction changes between object words and relation words, should be of a greater value than $Q_O$ and $Q_R$. Here $\text{T}_1 \cdot \text{I}_1$ represents the cosine similarity between the normalized text embedding $\text{T}_1$ and normalized image embedding $\text{I}_1$.

Insight 1: It is to our surprise that text encoders of VLMs show excellent compositional reasoning capabilities, able to recognize the dominant compositional differences between input texts like human understanding.

Can image encoders of VLMs understand texts compositionally?

Evaluating the sensitivities of image encoders of VLMs to the changes of visual patterns. Specifically, given images $\mathcal{I}_1$ and $\mathcal{I}_2$ with object relations being altered, we design $D_{O_1}$, $D_{O_2}$ and $D_{O_1\&O_2}$ to assess whether image encoders react correctly to the fine-grained changes of visual compositionality. In this case, $D_{O_1\&O_2}$, which measures the relation changes between objects, should be of a greater value than $D_{O_1}$ and $D_{O_2}$.

Insight 2: Image encoders of VLMs demonstrate compositional reasoning capabilities to some extent, which are relatively weaker than the corresponding text encoders, partially resulting in the poor compositional performance of VLMs.

Do text encoders and image encoders have matching compositional knowledge?

Evaluating whether image encoders and text encoders of VLMs possess mutually matching compositional knowledge with modified sensitivity metrics. Specifically, given image-text pairs $\{\mathcal{I}_1,\mathcal{T}_1\}$ and $\{\mathcal{I}_2, \mathcal{T}_2\}$ sharing minimal differences of object relations, we design ${Q_{\mathcal{T}:R\&O \xrightarrow{}\mathcal{I}:O_1}}$, ${Q_{\mathcal{T}:R\&O \xrightarrow{}\mathcal{I}:O_2}}$ and ${Q_{\mathcal{T}:R\&O \xrightarrow{}\mathcal{I}:O_1\& O_2}}$ to assess whether image encoders obtain the corresponding compositional knowledge for text encoders. Besides, we also design $D_{\mathcal{I}:O_1\&O_2 \xrightarrow{}\mathcal{T}:O}$, $D_{\mathcal{I}:O_1\&O_2 \xrightarrow{}\mathcal{T}:R}$ and $D_{\mathcal{I}:O_1\&O_2 \xrightarrow{}\mathcal{T}:R\&O}$ to assess whether text encoders obtain the corresponding compositional knowledge for image encoders.

Insight 3: Although text encoders and image encoders show certain compositional reasoning capabilities individually, they do not share mutually-matching compositional knowledge, which also partially accounts for the poor compositional abilities of VLMs.

BibTeX

@inproceedings{wang2024diagnosing,
      title={Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View},
      author={Jin, Wang and Shichao, Dong and Yapeng, Zhu and Kelu, Yao and Weidong, Zhao and Chao, Li and Ping, Luo},
      booktitle={To appear in ICML},
      year={2024},
    }