My research focuses on Embodied AI and Generative Models, with a particular emphasis on training robot foundation models capable of performing a wide range of tasks in our daily lifes. I believe that co-training with internet data and leveraging pre-trained generative models (e.g., VLM, video diffusion model, etc.) are crucial for developing effective generalist robot policies.
"Set your course by the stars, not by the lights of passing ships."
Honors and Awards: [2024.07] RSS 2024 Best Paper Award Finalists.
[2022.06] Outstanding Graduates of Tsinghua University (Top 10%).
[2017.11] 34th Chinese Physics Olympiad (CPhO), Silver Medal.
Selected Publications
Papers related to Robotic Foundation Model training are highlighted.
UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent
Jianke Zhang*, Yanjiang Guo*, Yucheng Hu*, Xiaoyu Chen, Jianyu Chen
Arxiv, 2025 arXivVLA Model.
We incoperate both multi-modal understanding (MMU) and future prediction into VLA model, enhancing both high-level semantic knowledge and low-level visual dynamics.
Improving Vision-Language-Action Model with Online Reinforcement Learning Yanjiang Guo*, Jianke Zhang*, Xiaoyu Chen*, Xiang Ji, Yen-Jen Wang, Yucheng Hu, Jianyu Chen
ICRA, 2025 arXiv
/
twitter1
/
twitter2VLA Model.
We make some initial exploration on leveraging online RL to improve the VLA model! We notice that online RL for VLA can be extremely unstable and thus we adopted a iterative approach.
Prediction with Action: Visual Policy Learning via Joint Denoising Process Yanjiang Guo*, Yucheng Hu*, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu#, Jianyu Chen#
NeurIPS, 2024 project page
/
code
/
arXivVideo Generation (or World Model) based Robotic Foundation Model
We jointly predict future images and robot actions in a unified DiT network, transfering physical knowledge from internet video data to robots.
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
Jianke Zhang*, Yanjiang Guo*, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen
CoRL, 2024 arXiv
/
twitter
/
机器之心VLA Model
We finetune pretrained VLM into VLA models with hierarchical transformers, keeping the generalization ability but also much higher control frequency.