“Physical AI,” short for physical artificial intelligence, is emerging as a popular term in smart vehicles as algorithm providers explore how AI systems can understand and act in the physical world. One such company is Zhuoyu Technology, an independent spinoff from DJI, which has unveiled what it describes as a native multimodal foundation model for mobile physical AI.
The model was unveiled at the latest Beijing Auto Show. Yu Beibei, vice president of Zhuoyu, said the shift is less about crafting a capital markets story than about survival.
“If you don’t get on this technology path, you might not be able to break through in the future,” Yu said.
In this new competitive field, algorithm providers are no longer competing only with their former peers. They also face larger companies with more resources that are moving in from other AI segments, including embodied intelligence, which refers to AI systems designed to interact with the physical world through machines such as vehicles and robots. That is pushing algorithm providers into another round of elimination, while those that break through could unlock broader commercial opportunities.
Using its mobile foundation model, Zhuoyu has begun moving beyond the traditional model of pairing hardware sales with development fees. To reignite growth, it is extending passenger vehicle technologies into Level 4 autonomous driving applications, including robotaxis, while exploring a token-based business model built around subscriptions and profit sharing.
36Kr spoke with Yu about the premise of physical AI, its commercialization potential, and how Zhuoyu is building its moat.
The following transcript has been edited and consolidated for brevity and clarity.
36Kr: Could you explain the native multimodal foundation model in detail?
Yu Beibei (YB): The concept of native multimodality can be traced back to last year, when we began working on VLA 1.0. At the time, the approach was closer to a vision-action alignment model, with a large language model (LLM) added at the backend. That created many problems, including limitations in language and semantic understanding, as well as response latency.
We believe it goes against common sense to translate all information into language, then try to understand the physical world through the results of that translation.
A more reasonable path is this: vision, audio, and action are all modalities. Rules and reasoning are also modalities. All of them should be incorporated during pretraining, allowing the model to understand the physical world within a shared space across multiple modalities. That is the more appropriate approach.
36Kr: Have you removed the language modality?
YB: At present, our in-vehicle model has not yet enabled language input. This is similar to Xpeng’s VLA 2.0. We are working in a similar direction, shifting toward this paradigm, and the underlying backbone network has already changed.
36Kr: Has Zhuoyu also entered the VLA 2.0 stage?
YB: Yes. The industry is at an inflection point in its paradigm shift. The choice before us is whether to continue with the previous paradigm of small models, such as expert models, or decisively switch to the large model paradigm.
We are more optimistic about the large-model paradigm. In mobile physical AI, the goal is to apply mobility capabilities across a wide range of carriers. That essentially means achieving scale.
The history of LLMs offers a useful reference. Previously, when people worked on vision-language models, some built expert models, while others built general-purpose models, also known as foundation models.
Looking back, the group that ultimately broke through was the one building foundation models. Expert models focused on areas such as medical diagnosis did not truly break through. In physical AI, we believe the same pattern will apply, so we will firmly follow the foundation model paradigm.
36Kr: No one seems to have trained a model that can be uniformly accessed by all kinds of carriers yet. In essence, everyone is still solving vehicle-related problems.
YB: This will move forward in stages. In 2025, nearly everyone switched to a data-driven approach. That means the model’s basic capabilities had already reached roughly 70 points. At that point, if you want to lift it to 90, the remaining 20-point gap still requires post-training, data collection, and generalization. But the gap has narrowed from the original 40-to-80-point gap to today’s 70-to-90-point gap.
Going forward, as the model’s basic capabilities improve further, our goal is to achieve zero-shot generalization.
If the model can reach 95 out of the box, then later work around post-training, generalization, and city launches can almost be ignored. We have not yet reached 95 out of the box, but we have already reached 70.
36Kr: At this stage, has Zhuoyu unified various scenarios into the same model and run it in practice, with the view that it can already be mass produced and generalized across different fields? Or is it still at a relatively early stage?
YB: At this point, we are still far from being able to say it is turnkey. The industry has not yet reached a conclusion on the ultimate paradigm for physical AI, or what kind of architecture can truly understand the physical world.
36Kr: How do you view the fact that more solution providers are now gravitating toward physical AI? Is this about giving capital markets a story with more room for imagination?
YB: We believe this is no longer just a commercial or strategic choice. Ultimately, it should rise to the level of a survival rule. If you don’t get on this technology path, you may not be able to break through in the future.
This is similar to the eve of the LLM boom. Many expert models for medical diagnosis emerged before, but once general-purpose large models appeared, they replaced them all. Those earlier models ultimately failed to break through.
36Kr: Under this paradigm, building a general-purpose model requires data from other scenarios, or other conditions needed for early-stage training. Are those still insufficient?
YB: When training our own foundation model, 30% of the data currently comes from real-world data collected by vehicles, 30% comes from robots, and the remaining 40% comes from the internet.
For data related to this kind of mobility capability, what you need from the internet is essentially first-person-view video captured while moving. It does not necessarily have to come from passenger vehicles or commercial vehicles. It can also be video filmed while a person is walking. This type of data exists at a huge scale and is relatively easy to obtain.
Many companies claim they want to build mobile physical AI. Model capability is one part of it, but more importantly, embodied intelligence has to be deployed onto specific hardware, and that distribution process is difficult. It is unlike digital AI, which can spread through smartphones from one person to ten, and from ten to a hundred, quickly expanding from one user to hundreds of millions of users. The speed of distribution is extremely fast.
So establishing a distribution platform and distribution network is also crucial. It determines how this capability is deployed onto mobile carriers and physical entities.
36Kr: How is Zhuoyu approaching distribution?
YB: We have our own methods. For example, we work with partners to define hardware standards. Once these standards are defined, we carry out hardware licensing and distribution through partners. That is the hardware distribution piece.
On the software side, our mobility capability SDK can package model capabilities and provide them to partners that do not have the ability to post-train models. We can also package it as “mobile AI.” In other words, once the model is good enough, we can open source it, allowing other parties to conduct post-training based on the model. That is another distribution method.
It can also be made directly into a “mobile agent.” In the future, for some low-safety, low-real-time applications, such as robot vacuums or lawn mowers, the device may only need to transmit a video stream to the cloud. The cloud computes the trajectory and sends it directly back to the machine. That could be another distribution method.
36Kr: Do these distribution methods correspond to Zhuoyu’s commercial charging models?
YB: Yes, and the commercial scenarios they address are also different.
The traditional approach, such as working on passenger vehicles or commercial vehicles, is to sell hardware, sell software licenses, and charge development fees and nonrecurring engineering fees. Internally, we call this the first growth curve business.
The second growth curve is about extending technologies that have already been validated in passenger vehicles into areas such as robotaxis and robovans. Although we also sell hardware and may charge development fees, we generally do not charge software licensing fees.
For the software portion, revenue comes through profit sharing. In Level 4 businesses, for example, as a service provider, we need to continuously participate in software iterations and may even need to take part in operations. That requires ongoing revenue, which evolves into a subscription and revenue-sharing model.
36Kr: It sounds like this will be more profitable.
YB: The profit structure is comparatively better.
We may have different ways to distribute algorithms. Take the “mobile agent” as an example. This distribution method is somewhat like distributing so-called “action tokens.”
It is equivalent to a consumer electronics device transmitting a video stream to a cloud-based inference model, after which the model sends back a trajectory. The charging model may be based on the number of times the consumer device is used, or the distance it travels, collecting fees similar to “action tokens.” This is another form of subscription.
36Kr: Will Zhuoyu handle subsequent operations and maintenance?
YB: For Level 2 systems, operations and maintenance are not involved in the first place. Operations and maintenance only come into play at Level 4. There needs to be a remote monitoring system that continuously monitors vehicle operations and, when necessary, enables remote takeover.
This is somewhat like the OnStar service in the past. Using this service requires payment. Once a vehicle activates Level 4 functions, whether for trunk logistics or passenger vehicles, an additional fee needs to be paid.
In the future, even when the sensor and computing configurations of passenger vehicles can support Level 4, car owners may still normally use a Level 2 or higher system. When they need to activate Level 4 functions, they will need to pay a little extra for every kilometer driven in Level 4 mode, because there will always be a system monitoring it.
36Kr: Do you think Level 2 and Level 4 will result in completely different business models?
YB: That’s right. Level 2 and Level 4 capabilities target completely different business models. From our perspective, we believe Level 4 should first be deployed in urban areas and then expanded to highway scenarios.
From the standpoint of engineering safety, for an accident of the same nature, the degree of harm caused on highways is far greater than in urban areas.
36Kr: As industry attention on physical AI grows, should we expect more attrition?
YB: A new round of reshuffling may be about to begin. All companies working on autonomous driving should, in the near future, transform into mobile physical AI companies.
If competition takes place on the mobile physical AI track, it becomes cross-sector competition. There may not even be competition among existing players in this industry. We will also need to compete with players that originally worked on digital AI and now want to transform into embodied intelligence and physical AI.
36Kr: Then what exactly is Zhuoyu’s moat?
YB: We believe there are two points.
The first is model capability. At present, there is still no conclusion on the industry’s iteration paradigm, or even on what model architecture will ultimately be adopted. Perhaps advanced new architectures such as 3D DiT or V-JEPA will break through in the future, but these remain unknowns.
The second is distribution capability, which is an extremely high barrier. Establishing a distribution platform and distribution network, creating an ecosystem, and uniting different partners to carry out distribution together will certainly be an extremely high barrier.
KrASIA features translated and adapted content that was originally published by 36Kr. This article was written by Xiao Man for 36Kr.