[ACL'26] TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

1School of Software, Shandong University,
2Department of Computing, Hong Kong Polytechnic University,
3School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)

*Corresponding author.

Abstract

MY ALT TEXT

Traditional CIR and Baseline Performance in Multi-Modification Scenarios

MY ALT TEXT

(a) Example of traditional CIR, and (b) Performance comparison of representative baselines on CIR datasets in original and multi-modification scenarios (all models are trained on original FashionIQ).



Data Generation Pipeline

MY ALT TEXT

Pipeline of the construction of our proposed multi-modification CIR datasets.



Framework: Text-oriented Entity Mapping Architecture (TEMA)

MY ALT TEXT

Overall architecture of our proposed TEMA.


Experiment

MY ALT TEXT

Performance comparison on M-FashionIQ and M-CIRR relative to R@K(%). The overall best results are in bold, while best results over baselines are underlined. The Avg metric in M-CIRR denotes (R@5 + Rsubset@1) / 2.


MY ALT TEXT

The qualitative results for the PA module, which shows the MMT and corresponding summary. The to-be-modified entities are colored.


MY ALT TEXT

Attention visualization results for the reference image on M-FashionIQ by the PA-generated summary.


MY ALT TEXT

Attention visualization results for the reference image on M-CIRR by the PA-generated summary.


BibTeX


          @inproceedings{TEMA,
            title={TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval},
            author={Li, Zixu and Hu, Yupeng and Fu, Zhiheng and Chen, Zhiwei and Li, Yongqi and Nie, Liqiang},
            booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
            year={2026}
          }