[ACL'26] TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

Zixu Li¹, Yupeng Hu^1*, Zhiheng Fu¹, Zhiwei Chen¹, Yongqi Li², Liqiang Nie³

¹School of Software, Shandong University,
²Department of Computing, Hong Kong Polytechnic University,
³School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)
^*Corresponding author.

Paper Coming Soon Complete source code

Traditional CIR and Baseline Performance in Multi-Modification Scenarios

(a) Example of traditional CIR, and (b) Performance comparison of representative baselines on CIR datasets in original and multi-modification scenarios (all models are trained on original FashionIQ).

Data Generation Pipeline

Pipeline of the construction of our proposed multi-modification CIR datasets.

Framework: Text-oriented Entity Mapping Architecture (TEMA)

Overall architecture of our proposed TEMA.

Experiment

Performance comparison on M-FashionIQ and M-CIRR relative to R@K(%). The overall best results are in bold, while best results over baselines are underlined. The Avg metric in M-CIRR denotes (R@5 + Rsubset@1) / 2.

The qualitative results for the PA module, which shows the MMT and corresponding summary. The to-be-modified entities are colored.

Attention visualization results for the reference image on M-FashionIQ by the PA-generated summary.

Attention visualization results for the reference image on M-CIRR by the PA-generated summary.

BibTeX


          @inproceedings{TEMA,
            title={TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval},
            author={Li, Zixu and Hu, Yupeng and Fu, Zhiheng and Chen, Zhiwei and Li, Yongqi and Nie, Liqiang},
            booktitle={Proceedings of the Association for Computational Linguistics (ACL)},
            year={2026}
          }