About this Article
- Authors: Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, Sungroh Yoon
- Journal: arXiv
- Year: 2025
- Official Citation: Park, Junsung, et al. “Know” No”Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP.” arXiv preprint arXiv:2501.10913 (2025).
Accomplishments
- Constructed negation dataset generating pipeline and negation testing benchmarks on CLIP.
Key Points
1. Negation Datasets Generation Pipeline
CLIP is not good at negation expressions because of the lack of plentiful negation caption data. Therefore, authors decided to make negation caption datasets by following the steps below. The basic idea is transforming(augmenting) existing captions by using LLM.
-
P1: Negation about objects (Figure 3 Left)
Steps:
(1) Identify plausible objects from existing captions by using LLM.
(2) Verify the absence of object (of step 1) on image by using MLLM (used for checking the existence of objects on image).
(3) Augment negation captions by using the objects whose absence are confirmed in step 2.
-
P2: Negation beyond objects (e.g. actions) (Figure 3 Right)
Steps:
(1) Extract image-question-answer pairs whose answer is “No” from VQA dataset.
(2) Augment negation captions by LLM.
2. NegRegCOCOg, the first text-to-image retrieval benchmark for negations.
Authors developed the first text-to-image retrieval benchmark which can test the model’s comprehension on negations. Authors used image segmentation datasets, because of the three reasons.
- Because of the characteristics of image segmentation, negations are widely used.
- Various types of negations (e.g. no, not without).
- Various positions (in a sentence), covering range of negations (e.g. actions, attributes).
NegRegCOCOg is constructed by the steps below.
Steps:
(1) For each negation-inclusive prompt T, a corresponding image patch $P^{+}$ was identified. (Positive answer)
(2) For different instances whose category is same as $P^{+}$, designate $P^{-}$ (Negative answer)
(3) Filter the $T-P^{+}-P^{-}$ pairs based on constraints (e.g. enough size of $P^{+}$ more than one $P^{-}$.
(4) Augment (expand) the area of patches.
During the evaluation, the similarity is used. Models get a point 1 if: $Sim(T,P^{+})>Sim(T,P^{-})$.
Figures & Table Explanation
1. Figure 1: Examples of original and NegationCLIP.
Our NegationCLIP demonstrates a better understanding of negation concepts across various tasks.
2. Table 1, Figure 2: Problems of existing datasets.
Existing datasets contains few of negations (Table 1), or image-unrelated negations (Figure 2).
3. Table 2: Results of performance comparison.
NegationCLIP is a model which is trained with authors-generated datasets. (CLIP-bnl and Con-CLIP are public models for negation.) On ImageNet and COCO, Negation CLIP was similar to other models. This shows that fine-tuning approach does not decrease the original performance. On VALSE and NegRegCOCOg, NegationCLIP outperformed other models.
4. Table 3: Results of ablation study.
Rand-P1 is similar to P1, but it randomly selects objects regardless of the image context.
- Rand-P1 underperformed than P1. This shows that considering the image context is important when making negation captions.
- $P1+P2$ showed the best performance.
5. Table 4. Figure 5: Text-to-image generation results of NegationCLIP.
To apply NegationCLIP to various applications, authors performed text-to-image generation, by replacing original CLIP text encoder to Negation CLIP. TIFA score shows that the original performance was not decreased, and Neg Score shows that the negation performance increased.
6. Table 5, Figure 6: Referring image segmentation results of NegationCLIPSeg.
To apply NegationCLIP to various applications, authors performed referring image segmentation, by replacing original CLIPSeg text encoder to NegationCLIP (Negation CLIPSeg). On datasets without negation (PhraseCut), the performance of NegationCLIPSeg was similar to that of other models. On datasets with negation (RefCOCOg), NegationCLIPSeg showed the best performance.
댓글남기기