In China’s artificial intelligence sector, SenseTime has long stood as one of the most enduring players, well over a decade old and well-acquainted with the cyclical tides of technological change.

During the rise of visual AI, the company emerged from a research lab at The Chinese University of Hong Kong to become one of the first to commercialize computer vision at scale. Yet B2B operations have never been easy. Like many peers, SenseTime often faced clients with highly customized needs and long development cycles.

Then came OpenAI’s ChatGPT, which reshaped the industry around large language models. Leveraging its early lead in computing infrastructure, SenseTime found new momentum. According to its 2024 annual report, the company’s generative AI business brought in RMB 2.4 billion (USD 336 million) in revenue, rising from 34.8% of total income in 2023 to 63.7%, making it SenseTime’s most critical business line.

But after three years of rapid progress in large models, a more pragmatic question looms: beyond narrow applications, how can AI enter the physical world and become a practical force that truly changes how we work and live?

That question lies at the center of SenseTime’s next chapter.

As embodied intelligence emerges as the next frontier, a new company has joined the race. ACE Robotics, led by Wang Xiaogang, SenseTime’s co-founder and executive director, has officially stepped into the field. Wang now serves as ACE Robotics’ chairman.

In an interview with 36Kr, Wang said ACE Robotics was not born from hype but from necessity. It aims to address real-world pain points through a new human-centric research paradigm, focusing on building a “brain” that understands the laws of the physical world and to deliver integrated hardware-software products for real-world use.

This direction reflects a broader shift across the industry. A year ago, embodied intelligence firms were still experimenting with mobility and stability. Today, some have secured contracts worth hundreds of millions of RMB, bringing robots into factories in Shenzhen, Shanghai, and Suzhou.

AI’s shift toward physical intelligence carries major significance, especially as the industry faces growing pressure to deliver real returns.

In the first half of 2025, SenseTime reported a net loss of RMB 1.162 billion (USD 163 million), a 50% year-on-year decrease, even as its R&D spending continued to rise. The company is now pursuing more grounded, sustainable paths to growth.

The breakthrough, Wang said, will not come from a leap toward artificial general intelligence (AGI), but from robots that can learn reusable skills through real-world interaction and solve tangible physical problems.

The following transcript has been edited and consolidated for brevity and clarity.

36Kr: Why did SenseTime decide to establish ACE Robotics this year and move into embodied intelligence?

Wang Xiaogang (WX): The decision stems from two considerations: industrialization and technological paradigm.

From an industrial perspective, embodied intelligence represents a market worth tens of trillions of RMB. As Nvidia founder Jensen Huang has said, one day everyone may own one or more robots. Their numbers could exceed smartphones, and their unit value could rival that of automobiles.

For SenseTime, which has historically focused on B2B software, expanding into integrated hardware-software operations is a natural step toward scale. Years of working with vertical industries have given us a deep understanding of user pain points. Compared with many embodied AI startups that lack this context, our ability to deploy in real-world scenarios gives us an edge in commercialization.

From a technical perspective, traditional embodied intelligence has a key weakness. Hardware has advanced quickly, but the “brain” has lagged behind because most approaches are machine-centric. They start with a robot’s form, train a general model on data from that specific body, and assume it can generalize. But it can’t. Just as humans and animals can’t share one brain, robots with different morphologies—whether with dexterous hands, claws, or multiple arms—cannot share a universal model.

36Kr: What distinguishes ACE Robotics’ technical approach?

WX: We’re proposing a new, human-centric paradigm. We begin by studying how humans interact with the physical world, essentially how we move, grasp, and manipulate. Using wearable devices and third-person cameras, we collect multimodal data, including vision, touch, and force, to record complex, commonsense human behaviors.

By feeding this data into a world model, we enable it to understand both physics and human behavioral logic. A mature world model can even guide hardware design, ensuring that a robot’s form naturally fits its intended environment.

In recent months, companies such as Tesla and Figure AI have pivoted toward first-person camera-based learning. But these approaches capture only visual information, without integrating critical signals like force, touch, and friction—the keys to genuine multidimensional interaction.

Vision alone may let a robot dance or shadowbox, but it still struggles with real contact tasks like moving a bottle or tightening a screw.

Our human-centric approach has already been validated. A team led by professor Liu Ziwei developed the EgoLife dataset, containing over 300 hours of first- and third-person human activity data. Models trained on this dataset have overcome the industry’s pain point, whereas most existing datasets capture only trivial actions, insufficient for complex motion learning.

36Kr: Public data shows China’s embodied intelligence market surpassed RMB 800 billion (USD 112 billion) in 2024, with hundreds of startups emerging. How does ACE Robotics define its position?

WX: Our goal is not merely to build models but to deliver integrated hardware-software products that solve real problems in defined scenarios.

We’ve found that much existing hardware doesn’t match real-world needs. So we work closely with partners on customized designs.

Take quadruped robots: traditional models mount cameras too low and narrow, making it difficult to detect traffic lights or navigate intersections. In partnership with Insta360, we developed a panoramic camera module with 360-degree coverage, solving that limitation.

We’re also tackling issues like waterproofing, high computing costs, and limited battery life, which are key obstacles to outdoor and industrial deployment.

36Kr: How do these collaborations function in practice?

WX: Our strength lies in the “brain,” which represents models, navigation, and operation capabilities.

Previously, SenseTime specialized in large-scale software systems but had no standardized edge products. Through prior investments in hardware and component makers, ACE Robotics now follows an ecosystem model. We define design standards, co-develop hardware with partners, and keep our model layer open, offering base models and training resources.

36Kr: SenseTime has deep roots in security and autonomous driving. Which of those capabilities translate to robotics?

WX: R&D systems and safety standards are two key areas. Both autonomous driving and robotics rely on massive datasets for continuous improvement. The validated “data flywheels” we’ve built significantly boost iteration speed. Meanwhile, the rigorous safety and data-quality frameworks from autonomous driving can directly enhance robotics reliability.

On the functional side, our SenseFoundry platform already includes hundreds of modules originally built for fixed-camera city management. When linked to mobile robots, these capabilities transfer seamlessly, extending from static monitoring to dynamic mobility.

36Kr: How do you view SenseTime’s evolution from visual AI to embodied intelligence?

WX: SenseTime’s path traces AI’s own progression from version 1.0 to 3.0.

In 2014, we were in the AI 1.0 era, defined by visual recognition. Machines began to outperform the human eye, but intelligence came from manual labeling: tagging images to simulate cognition. Because labeled data was limited and task-specific, each application required its own dataset. Intelligence was only as strong as the amount of human labor behind it. Models were small and lacked generalization across scenarios.

Then came the 2.0 era of large models, which transformed everything. The key difference was data richness. The internet’s texts, poems, and code embody centuries of human knowledge, far more diverse than labeled images. Large models learned from this collective intelligence, allowing them to generalize across industries and domains.

But as online data becomes saturated, the marginal gains from this approach are slowing.

We are now entering the AI 3.0 era of embodied intelligence, defined by direct interaction with the physical world. To truly understand physics and human behavior, reading text and images is no longer enough. AI must engage with the world. Tasks such as cleaning a room or delivering a package demand real-time, adaptive intelligence. Through direct interaction, AI can overcome the limits of existing data and open new pathways for growth.

36Kr: How does ACE Robotics’ Kairos 3.0 model differ from systems like OpenAI’s Sora or World Labs’ Marble?

WX: Kairos 3.0 consists of three components: multimodal understanding and fusion, a synthetic network, and behavioral prediction.

The first fuses diverse inputs, including not just images, videos, and text, but also camera poses, 3D object trajectories, and tactile or force data. This enables the model to grasp the real-world physics behind movement and interaction.

In collaboration with Nanyang Technological University, for example, the model can infer camera poses from a single image. When a robotic arm captures a frame, the model can deduce the arm’s position and predict its motion from visual changes, deepening its understanding of physical interaction.

The second component, the synthetic network, can generate videos of robots performing various manipulation tasks, swapping robot types, or altering environmental elements such as objects, tools, or room layouts.

The third, behavioral prediction, enables the model to anticipate a robot’s next move after receiving an instruction, bridging cognition and execution into a complete loop from understanding to action.

36Kr: How does the human-centric approach improve data efficiency, generalization, and multimodal integration?

WX: It combines environmental data collection with world modeling.

By “environment,” we mean real human living and working spaces. Unlike autonomous driving, which focuses narrowly on roads, or underwater robotics, we model how humans interact with their surroundings. This yields higher data efficiency and more authentic inputs.

We also integrate human ergonomics, touch, and force, which are all essential for rapid learning, and all missing in machine-centric paths.

36Kr: When do you expect human-centric systems to achieve large-scale adoption?

WX: The first large-scale applications will emerge in quadruped robots, or robotic dogs.

Most current quadruped bots still rely on remote control or preset routes. Our system gives them autonomous navigation and spatial intelligence. Equipped with ACE Robotics’ navigation technology, they can coordinate through a control platform, follow Baidu Maps commands, and respond to multimodal or voice inputs. They can identify people in need, record license plates, and detect anomalies.

Linked with our SenseFoundry vision platform, these robots can recognize fights, garbage accumulation, unleashed pets, or unauthorized drones, sending real-time data back to control centers.

This combination, supported by cloud-based management, will soon scale in inspection and monitoring. Within one to two years, we expect widespread deployment in industrial environments.

36Kr: Beyond the near term, which other applications are worth watching?

WX: In the medium term, warehouse logistics will likely be the next major commercialization frontier.

Unlike factories, warehouses share consistent operational patterns. As online shopping expands, front-end logistics hubs require standardized automation for sorting and packaging. Traditional robot data collection cannot handle the enormous variety of SKUs, but large-scale environmental data allows our models to generalize and scale efficiently.

In the long term, home environments will be the next key direction, though safety remains a major challenge. Household robots must manage collisions and ensure object safety, much like autonomous driving must evolve from Level 2 to Level 4 autonomy.

Progress is being made. Figure AI, for instance, is partnering with real estate funds managing millions of apartment layouts to gather environmental data, gradually moving embodied intelligence closer to the home.

KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Huang Nan for 36Kr.