Abstract: To address the challenges of inter-modal alignment and the inability of single methods to fully exploit cross-modal semantic information in multimodal representation learning, this paper ...