My research focuses on Embodied AI and Generative Models, with a particular emphasis on training robot foundation models capable of performing a wide range of tasks in physical world. I prefer simple and scalable methods :)
"Set your course by the stars, not by the lights of passing ships."
Honors and Awards: [2024.07] RSS 2024 Best Paper Award Finalists.
[2022.06] Outstanding Graduates of Tsinghua University (Top 10%).
[2017.11] 34th Chinese Physics Olympiad (CPhO), Silver Medal.
Papers related to Robotic Foundation Model training are highlighted.
UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent
Jianke Zhang*, Yanjiang Guo*, Yucheng Hu*, Xiaoyu Chen, Jianyu Chen
ICML, 2025 arXiv
/
codeVLA Model.
We incoperate both multi-modal understanding (MMU) and future prediction into VLA model, enhancing both high-level semantic knowledge and low-level visual dynamics.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual
Representations
Yucheng Hu*, Yanjiang Guo*, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, Jianyu Chen
ICML, 2025   (Spotlight, top 2.6%) project page
/
code
/
arXiv
/
twitter
/
机器之心
/
量子位Video Generation (or World Model) based Robotic Foundation Model
We finetune a general-purpose video diffusion model into manipulation-focused video prediction model to guide policy learning.
Improving Vision-Language-Action Model with Online Reinforcement Learning Yanjiang Guo*, Jianke Zhang*, Xiaoyu Chen*, Xiang Ji, Yen-Jen Wang, Yucheng Hu, Jianyu Chen
ICRA, 2025 arXiv
/
twitter1
/
twitter2VLA Model.
We make some initial exploration on leveraging online RL to improve the VLA model! We notice that online RL for VLA can be extremely unstable and thus we adopted a iterative approach.
Prediction with Action: Visual Policy Learning via Joint Denoising Process Yanjiang Guo*, Yucheng Hu*, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu#, Jianyu Chen#
NeurIPS, 2024 project page
/
code
/
arXivVideo Generation (or World Model) based Robotic Foundation Model
We jointly predict future images and robot actions in a unified DiT network, transfering physical knowledge from internet video data to robots.
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
Jianke Zhang*, Yanjiang Guo*, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen
CoRL, 2024 arXiv
/
twitter
/
机器之心VLA Model
We finetune pretrained VLM into VLA models with hierarchical transformers, keeping the generalization ability but also much higher control frequency.