The Chinese University of Hong Kong

04/30/2026 | Press release | Distributed by Public on 04/29/2026 22:00

CUHK develops VLM with spatial intelligence to improve AI robotic manipulation in complex tasks

30 Apr 2026

CUHK develops VLM with spatial intelligence to improve AI robotic manipulation in complex tasks

30 Apr 2026

A research team from CUHK's Faculty of Engineering has developed a VLM integrated with spatial intelligence, allowing robots to autonomously perform complex, long-horizon manipulation tasks with various objects, and enhancing the analytical capabilities of AI.
Professor Liu Yun-hui (3rd left), Professor Dou Qi (2nd right) and research team members.

An illustration of the VLM technology with spatial intelligence.

The proposed framework demonstrates scalability across diverse tasks and platforms. It enables the precise execution of spatial instructions and manipulation (1st and 3rd left), facilitates dexterous manipulation on humanoid robot platforms (2nd left), and leverages tactile feedback to achieve adaptive visuo-tactile grasping capabilities (1st right).

A research team from The Chinese University of Hong Kong (CUHK)'s Faculty of Engineering has developed a Vision-Language Model (VLM) integrated with spatial intelligence. This breakthrough enables robots to comprehend 3D spatial information like humans do, featuring scalability for visuo-tactile fusion[1], allowing them to autonomously perform complex, long-horizon manipulation tasks with various objects and further enhancing AI's analytical capabilities. The findings have been published in the renowned international journal Science Robotics.

Although current VLMs allow robots to accurately understand human language instructions, they still lack a deep understanding of the 3D spatial relationships among objects, making it difficult to generate accurate plans for long-horizon manipulation tasks. To enhance the spatial understanding of VLMs, the CUHK team proposed a novel method called Retrieval-augmented Manipulation (RAM). This approach allows robots to simultaneously answer two critical questions during the planning process: what action to take at each step and how such actions can be executed feasibly in 3D space.

The team constructed a structured 3D object knowledge base for the robot, cataloguing the 3D geometries, stable placement configurations and graspable affordances of a variety of everyday objects. When generating a manipulation plan, the VLM retrieves relevant geometric and manipulation records from the knowledge base in real time. It evaluates physical feasibility to determine action sequences and intermediate states, while grounding abstract instructions in explicit spatial constraints. This equips the AI robots with the capability to handle long-horizon task manipulation.

The research deeply integrates vision-driven spatial intelligence with the long-horizon task planning capabilities of VLMs. By constructing a structured 3D object knowledge base, the VLM can dynamically retrieve the geometric and manipulation records of objects when planning long-horizon operations. This approach effectively extends the VLM's language-level understanding and reasoning capability to complex 3D physical manipulation scenarios.

Professor Dou Qi, Associate Professor from CUHK's Department of Computer Science and Engineering, who led the study, said: "Spatial intelligence is key to unlocking long-horizon manipulation, and visual perception is a crucial pathway to achieving it. Our method marks a breakthrough in bringing spatial understanding together with VLM reasoning."

Professor Dou added that the proposed robot spatial intelligence technology scales effectively across tasks and platforms. In 14 manipulation tasks requiring spatial perception and covering 31 different objects, RAM enabled robots to accurately follow spatial language instructions, reason about 3D spatial relationships and perform adaptive manipulation conditioned on the scene's physical context. RAM works seamlessly with leading VLMs and can be readily deployed on general-purpose humanoid robot platforms for fine-grained long-horizon manipulation.

Furthermore, CUHK's newly developed system features scalability for visuo-tactile fusion, leveraging tactile feedback for more adaptive manipulation. Professor Liu Yun-hui, Choh-Ming Li Professor of Mechanical and Automation Engineering at CUHK, and Director of the Hong Kong Centre for Logistics Robotics (HKCLR), said: "This research demonstrates the potential of AI to advance robot manipulation, with promising applications across scenarios from industrial to household settings, which will ultimately help to improve human life."

This research was supported by the HKCLR. Founded by CUHK, the centre is driven by a research team comprising professors from CUHK and the University of California, Berkeley. It is funded by the Innovation and Technology Commission of the HKSAR Government under the InnoHK Research and Development Platform. Its mission is to advance robot intelligence across perception, interaction, manipulation and mobility. Working closely with academic and industry partners in Hong Kong, the Greater Bay Area and the Chinese Mainland, the centre helps to translate cutting-edge AI and robotics research into real-world applications.

For the full research, please visit: https://www.science.org/doi/10.1126/scirobotics.aea2092

[1] Visuo-tactile fusion is the process of combining visual data and tactile data to create a comprehensive understanding of an environment or object, enabling robots to perform complex, contact-rich manipulation tasks with human-like dexterity.

A research team from CUHK's Faculty of Engineering has developed a VLM integrated with spatial intelligence, allowing robots to autonomously perform complex, long-horizon manipulation tasks with various objects, and enhancing the analytical capabilities of AI.
Professor Liu Yun-hui (3rd left), Professor Dou Qi (2nd right) and research team members.

An illustration of the VLM technology with spatial intelligence.

The proposed framework demonstrates scalability across diverse tasks and platforms. It enables the precise execution of spatial instructions and manipulation (1st and 3rd left), facilitates dexterous manipulation on humanoid robot platforms (2nd left), and leverages tactile feedback to achieve adaptive visuo-tactile grasping capabilities (1st right).

The Chinese University of Hong Kong published this content on April 30, 2026, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on April 30, 2026 at 04:00 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]