Publications

* Equal Contribution | † Corresponding Author

Cross-modal Retrieval

  1. SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval
    Ruixiang Zhao*, Zhihao Xu*, Bangxiang Lan,  Zijie XinJingyu Liu, and Xirong Li†
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026
  2. Music Grounding by Short Video
    Zijie Xin, Minquan Wang, Jingyu Liu, Ye Ma, Quan ChenPeng Jiang, and Xirong Li†
    In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025
  3. Learning Partially-Decorrelated Common Spaces for Ad-hoc Video Search
    Fan HuZijie Xin, and Xirong Li†
    In Proceedings of the 33rd ACM international conference on Multimedia (ACMMM), 2025
  4. DAPL: Integration of Positive and Negative Descriptions in Text-Based Person Search
    Yuchuan DengZhanpeng HuZijie Xin, Chuang Deng, and Qijun Zhao†
    In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 2025
  5. Holistic Features are almost Sufficient for Text-to-Video Retrieval
    Kaibin Tian*Ruixiang Zhao*Zijie Xin, Bangxiang Lan, and Xirong Li†
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

Medical AI

  1. Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data
    Yuchuan Deng, Qijie Wei, Kaiheng Qian, Jiazhen Liu,  Zijie Xin, Bangxiang Lan, Jingyu Liu, Jianfeng Dong, and Xirong Li†
    arXiv preprint arXiv:2604.08322, 2026

Generative Model

  1. Multi-Object Sketch Animation by Scene Decomposition and Motion Planning
    Jingyu LiuZijie Xin, Yuhan Fu, Ruixiang Zhao, Bangxiang Lan, and Xirong Li†
    In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

Video Large Language Model

Coming soon...