Abstract: Cross-modal image-text retrieval enables efficient heterogeneous modality interaction via vision-language semantic alignment, advancing multimodal intelligence applications. However, ...