[Articles] Know “No” Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

About this Article

Authors: Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, Sungroh Yoon
Journal: arXiv
Year: 2025
Official Citation: Park, Junsung, et al. “Know” No”Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP.” arXiv preprint arXiv:2501.10913 (2025).

Accomplishments

Constructed negation dataset generating pipeline and negation testing benchmarks on CLIP.

Key Points

1. Negation Datasets Generation Pipeline

CLIP is not good at negation expressions because of the lack of plentiful negation caption data. Therefore, authors decided to make negation caption datasets by following the steps below. The basic idea is transforming(augmenting) existing captions by using LLM.

P1: Negation about objects (Figure 3 Left)

Steps:

(1) Identify plausible objects from existing captions by using LLM.

(2) Verify the absence of object (of step 1) on image by using MLLM (used for checking the existence of objects on image).

(3) Augment negation captions by using the objects whose absence are confirmed in step 2.
P2: Negation beyond objects (e.g. actions) (Figure 3 Right)

Steps:

(1) Extract image-question-answer pairs whose answer is “No” from VQA dataset.

(2) Augment negation captions by LLM.

2. NegRegCOCOg, the first text-to-image retrieval benchmark for negations.

Authors developed the first text-to-image retrieval benchmark which can test the model’s comprehension on negations. Authors used image segmentation datasets, because of the three reasons.

Because of the characteristics of image segmentation, negations are widely used.
Various types of negations (e.g. no, not without).
Various positions (in a sentence), covering range of negations (e.g. actions, attributes).

NegRegCOCOg is constructed by the steps below.

Steps:

(1) For each negation-inclusive prompt T, a corresponding image patch $P^{+}$ was identified. (Positive answer)

(2) For different instances whose category is same as $P^{+}$, designate $P^{-}$ (Negative answer)

(3) Filter the $T-P^{+}-P^{-}$ pairs based on constraints (e.g. enough size of $P^{+}$ more than one $P^{-}$.

(4) Augment (expand) the area of patches.

During the evaluation, the similarity is used. Models get a point 1 if: $Sim(T,P^{+})>Sim(T,P^{-})$.

Figures & Table Explanation

1. Figure 1: Examples of original and NegationCLIP.

Our NegationCLIP demonstrates a better understanding of negation concepts across various tasks.

2. Table 1, Figure 2: Problems of existing datasets.

Existing datasets contains few of negations (Table 1), or image-unrelated negations (Figure 2).

3. Table 2: Results of performance comparison.

NegationCLIP is a model which is trained with authors-generated datasets. (CLIP-bnl and Con-CLIP are public models for negation.) On ImageNet and COCO, Negation CLIP was similar to other models. This shows that fine-tuning approach does not decrease the original performance. On VALSE and NegRegCOCOg, NegationCLIP outperformed other models.

4. Table 3: Results of ablation study.

Rand-P1 is similar to P1, but it randomly selects objects regardless of the image context.

Rand-P1 underperformed than P1. This shows that considering the image context is important when making negation captions.
$P1+P2$ showed the best performance.

5. Table 4. Figure 5: Text-to-image generation results of NegationCLIP.

To apply NegationCLIP to various applications, authors performed text-to-image generation, by replacing original CLIP text encoder to Negation CLIP. TIFA score shows that the original performance was not decreased, and Neg Score shows that the negation performance increased.

6. Table 5, Figure 6: Referring image segmentation results of NegationCLIPSeg.

To apply NegationCLIP to various applications, authors performed referring image segmentation, by replacing original CLIPSeg text encoder to NegationCLIP (Negation CLIPSeg). On datasets without negation (PhraseCut), the performance of NegationCLIPSeg was similar to that of other models. On datasets with negation (RefCOCOg), NegationCLIPSeg showed the best performance.

Twitter Facebook LinkedIn

[Articles] Know “No” Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

HJ

About this Article

Accomplishments

Key Points

1. Negation Datasets Generation Pipeline

2. NegRegCOCOg, the first text-to-image retrieval benchmark for negations.

Figures & Table Explanation

1. Figure 1: Examples of original and NegationCLIP.

2. Table 1, Figure 2: Problems of existing datasets.

3. Table 2: Results of performance comparison.

4. Table 3: Results of ablation study.

5. Table 4. Figure 5: Text-to-image generation results of NegationCLIP.

6. Table 5, Figure 6: Referring image segmentation results of NegationCLIPSeg.

공유하기

댓글남기기

참고

[Articles] AIVariant: a deep learning-based somatic variant detector for highly contaminated tumor samples

[Articles] ProPILE: Probing Privacy Leakage in Large Language Models

[Articles] ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models

[Articles] OPT-OUT: Investigating Entity-Level Unlearning for Large Language Models via Optimal Transport