The race to lead in assisted driving has become one of the most competitive and consequential arenas in the automotive sector. In March, alongside the release of the new-generation SU7, Xiaomi positioned itself among the top contenders with the launch of an assisted driving system built on its XLA large model, framing it as a step toward an experience-first approach.

Once viewed as a late entrant, Xiaomi is now pursuing a more pragmatic path centered on user experience, supported by a technical roadmap that is both clear and agile. That trajectory began in March 2024, when it mass produced its assisted driving system on the first-generation SU7. At the time, industry debate focused on map-free systems and urban deployment, and Xiaomi’s initial solution aligned with that trend. As a newcomer, it followed the prevailing direction.

As rule-based, map-free approaches began to reach their limits, however, attention shifted to end-to-end models. In February 2025, Xiaomi introduced its second-generation assisted driving system, followed by additional iterations by July the same year to keep pace with competitors.

The momentum behind data-driven systems proved short-lived. End-to-end performance depends heavily on data, and the wide range of long-tail scenarios emerged as a shared constraint. With no clear roadmap, multiple technical paths began to diverge.

Rather than remain on the same track, Xiaomi revisited first principles and asked a more fundamental question: can a car learn to drive the way a person does?

In March, it introduced its third-generation solution, the XLA large model, designed to enable artificial intelligence-driven cognition. Unlike earlier iterations, XLA does not rely solely on rules or statistical patterns. It aims to enable the system to interpret its environment and apply commonsense and causal reasoning.

Photo source: Xiaomi.

The shift from rules to data to cognition, compressed into two years, underscores the pressure Xiaomi faced. In the rule-based phase, it lacked engineering experience. In the data-driven phase, it needed to scale closed-loop systems. As a late entrant, it had to accelerate development, only to encounter another industry shift toward cognition.

How Xiaomi maintained direction was the focus of a conversation with Chen Guang and Chen Long, who lead the company’s end-to-end assisted driving and foundation model development, respectively.

The following transcript has been edited and consolidated for brevity and clarity.

36Kr: What is Xiaomi doing now in assisted driving?

Chen Long (CL): What we are doing is introducing the paradigm of cognitive large models into assisted driving. Through large models, we want assisted driving systems to have cognitive abilities when it comes to the environment so they can acquire some commonsense knowledge about the human world, traffic rules, and causal relationships involving objects on the road. That, in turn, can help solve the long-tail problems that end-to-end systems struggle to handle.

The XLA cognitive large model we released recently is the first version of our cognition-driven assisted driving system.

36Kr: How does a cognitive large model compare with end-to-end systems?

CL: Let me use a scenario as an example. Suppose the road ahead is closed, and signs and barriers on-site are directing vehicles to detour. In the process, the vehicle may need to borrow an oncoming lane temporarily or even cross a double yellow line for a short stretch.

An end-to-end version is more likely to keep moving forward based on the current road layout. In that kind of temporary rerouting scenario, it may not actively understand that it should detour.

But the XLA model can combine on-site signs with environmental information, understand that this is a road closure detour scenario, infer a feasible path, and initiate a reasonable reroute.

36Kr: How does this differ from vision-language-action models?

CL: Our XLA cognitive large model does not just use visual information. It also includes audio, radar, and other modalities such as navigation. So the first meaning of the “X” in “XLA” is that we use more than vision as an information input.

Another difference is that we have incorporated data related to embodied intelligence into XLA’s foundation model.

That is also a very important distinction. Other automakers’ cognitive large models are built on open-source models. Xiaomi uses its own MiMo-Embodied foundation model. Because it is self-developed, we were able to add a large amount of embodied intelligence data during the pretraining stage of the foundation model. So the second meaning of the X in XLA is that we have richer data.

There is also another core difference.

Some VLA systems output long passages of textual reasoning before they output an action. That creates a problem: it is too slow, and the latency is hard to control. Then there is another line of thinking, which is to simply remove language altogether. But then it is no longer really VLA, because it is not using the reasoning capability of the “L” at all.

In XLA, we use latent space reasoning. In practice, that means the system uses machine language during inference, which keeps both the process and the inference latency under control. Of course, that machine language can also be decoded into text, so it remains interpretable. We preserve reasoning ability while significantly improving efficiency

36Kr: Why introduce embodied intelligence data?

CL: We brought in data related to embodied intelligence mainly to train cars in spatial perception and spatial reasoning.

First of all, there is a gap in precision. A car’s perception of surrounding objects is generally at the decimeter level. But a humanoid robot may be trained on tasks such as grasping a cup, and the precision in that data can reach the centimeter level or even finer. If you use humanoid robot data to train a car, wouldn’t the car’s capabilities become stronger?

Second, today’s assisted driving systems do not really interact with surrounding objects. Our goal is to avoid collisions, but the assisted driving system does not actually understand what a collision is. Spatial reasoning is really about helping the car understand what consequences a certain driving choice could produce. Robots happen to have a lot of data on exactly that kind of interaction.

We believe MiMo-Embodied is the world’s first embodied foundation model to bridge assisted driving and robotics. We have also found that assisted driving data and robotics data reinforce each other. So in the future, we hope that assisted driving, robots, and even other Xiaomi smart devices can evolve toward a single brain, creating a more seamless experience.

36Kr: Integrating that data must be complex.

CL: That is right. Embodied intelligence data includes many different kinds of robot bodies. On those bodies, the sensors are positioned differently, and even the resolution of the camera images varies. Assisted driving outputs are mostly two-dimensional, while robots produce multi-joint outputs in three-dimensional space.

The hard part is figuring out how to design a sophisticated model architecture that can unify all of this different data. For now, though, the training objectives are mainly spatial perception and spatial reasoning. We are not yet dealing too much with action-level tasks. The differences between the execution spaces of the two kinds of tasks may be something to consider in the future.

36Kr: Has XLA improved parking capabilities?

Chen Guang (CG): Our parking capabilities have been enhanced as well. We launched a new feature this time. Say your final navigated destination is a merchant inside a mall. Our parking system will look through that complex’s garage for a parking space closest to the elevator entrance for that merchant. So far, that feature has received praise and recommendations from some users.

36Kr: That sounds difficult to implement.

CG: I think there are a lot of challenges, but at the core it is about how to let the car, in a relatively unfamiliar environment, find the parking spot that suits it best, the way a person would.

Once the car enters a garage, it needs to read the environment there, including text signs and elevator information. If the closest spots are all occupied, we begin to roam and look for a more suitable space. At its core, this is about using the guidance information already available to get to the final destination in the navigation.

36Kr: Does this require significant computing power?

CG: It does, with very high demands. We only got XLA deployed after a great deal of algorithmic optimization. This kind of algorithm adaptation is itself a major challenge. We went through a lot of development work and engineering optimization, and we hit some pitfalls along the way. It has been hard work. There is some know-how here for us.

36Kr: How would you assess Xiaomi’s engineering capabilities?

CG: I personally think they are quite advanced. Even now, there are very few companies that can deploy such a complex model on an actual vehicle and then push it out to all users.

36Kr: What comes next for Xiaomi?

CL: The first issue is definitely compute. The larger a model is, the stronger its capabilities tend to be. Of course, we would like to put the strongest possible model in the car, but in-car compute is limited. That is what latent space reasoning is for. And we will do more on that in the future.

CG: Yes. Further increasing the parameter count of the in-car model, figuring out how it can ingest more data during training, and enabling it to understand more scenarios, that is the first challenge.

The second challenge is how to develop more driving and parking functions for users and further improve the product experience, especially whether new features can bring users more pleasant surprises.

36Kr: Does improving cognitive models depend on data?

CL: Data is certainly one part of it. We continuously need high-quality data. The other part is the model’s own capabilities, especially the foundation model.

As I mentioned, some companies use open-source foundation models. The problem is that you do not know how those open-source models were pretrained. They may not have very meticulous data cleaning methods or standards, and they may even use abstract or risky information from the internet. Once that shows up in actual driving behavior, it could trigger a butterfly effect and create major risks.

But building a sizable foundation model from scratch is extremely difficult. First, you need a very strong team. Then you need to do data selection and cleaning, build and debug your own infrastructure systems, and establish a set of evaluation metrics. And once a version of the model is released, it may no longer be based on a leading architecture just a few months later, so this whole process has to be repeated again and again.

So how far a cognitive large model can be optimized depends not only on the talent and resources a company invests in its foundation model. It also depends on the company’s judgment about the direction large models are taking.

36Kr: Is Xiaomi fully committed to cognitive models?

CG: There are different routes being explored for assisted driving. One is the XLA route we are pursuing now, which directly introduces a cognitive large model. Another is to use the generative and reconstructive capabilities of a world model to solve cognition-related problems.

That said, we do not see cognitive large models and world models as opposing paths. Even multimodal language models need a strong simulation environment.

In fact, we have combined the two technical routes. Just because the in-car system uses XLA does not mean we would completely abandon world models in the cloud.

36Kr: Do world models still have advantages that are hard to replace?

CG: At least in closed-loop simulation, when the physical world is projected into a digital space, world models are indispensable.

Right now, the technical focus is on long-tail scenarios. Say an oddly shaped rock or a tire rolls onto the road. In the real world, it is hard to encounter that with test vehicles, and it is hard to collect such scenarios at scale. So whether you are using world models or XLA, you still have to explore them in a simulator.

36Kr: Is that the new technical consensus?

CG: Xiaomi moved relatively far ahead during the stage of single-stack end-to-end systems, so even before cognitive large models appeared, it already felt that closed-loop simulation capability was very important. Including Tesla, the industry’s leading players have probably already done solid work on reconstruction and generation with world models.

36Kr: How was closed-loop simulation done before world models emerged?

CG: It was very difficult. You could almost only do static scenes. Dynamic scenes depended on real data, which is why people always used to say that data volume was very scarce.

36Kr: But if XLA can already understand these road obstacles, do you still need to train repeatedly on this kind of data?

CG: Before any feature is rolled out to real users, we want it to be fully tested inside a simulator.

36Kr: Does testing it there guarantee safety in the real world?

CG: The digital space and the world model really act like a funnel. They can intercept most problems. For the ones that remain, the multimodal large model itself has generalization ability, and we hope to rely on its own cognitive and reasoning capabilities to produce better solutions. The two work in combination.

36Kr: Will Xiaomi keep investing in world models? What direction will future iteration take?

CG: This year at Nvidia GTC, we introduced Xiaomi’s latest progress in world models. We have also published close to ten papers related to world models at leading conferences including CVPR 2026, ICLR 2026, NeurIPS 2025, and ICCV 2025. That seems enough to show how seriously Xiaomi takes world models.

As for direction, I would point to three things.

  1. First is realism. That may be a little different from what we usually mean by perfect realism. Let me give an example. When we simulate rain, we want the image to look like a real camera view with water droplets on the lens, not some perfectly clean rainy-day environment. We want the simulated scenario to match the information the vehicle actually receives, because only then is the test effective.
  2. Second is richness. Today I may want to solve mobility in a scenario with direct sunlight. Two days later, I may want to solve the problem of mobility in heavy rain, heavy fog, or heavy snow. So can we change only the weather and lighting conditions without changing the traffic information?
  3. Third is scene editing capability. Your digital assets need to be rich enough that you can use them to simulate all kinds of scenarios. Only if they are rich enough are they truly useful.

36Kr: That sounds complicated. How long has Xiaomi been investing in this?

CG: Two years. Thinking back, some of the technical preparation started in the first half of 2024. By the end of 2024, Xiaomi’s technology had already gained a certain degree of recognition in both the industry and academia. In the second half of 2025, we started to enter a harvest period on the technical side, with things like championships in major competitions and papers.

36Kr: In other words, Xiaomi already has a clear advantage in this area?

CG: Of course, we hope our first-mover advantage can continue. We did start fairly early, and we hope that can have a positive impact on the industry so everyone can build this out on a more solid footing. In the end, it all serves the product experience of the industry as a whole.

36Kr: What makes for a good assisted driving experience?

CG: I think the most important thing in a good experience is safety. We cannot deliver a product that makes users feel unsafe or uneasy. That is the most important issue for us right now.

36Kr: Why do you separate safety from peace of mind?

CG: From a technical standpoint, as long as there is no collision, that counts as safety. But what users feel is safe is not just avoiding collisions. Say the car brakes sharply. The user may not understand why the system made what feels like such an extreme move. That can create physical discomfort and also a feeling of being unsafe.

So it is not enough for us to guarantee safety in the technical sense. We also need to make sure the product gives users a real sense of peace of mind. Only when it is both safe and something people dare to use and want to use is the product experience complete.

36Kr: Has Xiaomi learned anything new about creating that sense of peace of mind?

CG: I think we have made some new progress there.

For example, in blind spots at intersections, we apply preventive deceleration. That is very similar to the way a person drives. A user’s first reaction is that the car made that move because it understood the scenario.

Or take an approaching traffic jam. Our car will not wait until the very last moment and then slam on the brakes at the limit. Instead, it will reduce speed early and defensively. That reflects some of our thinking on both safety and peace of mind.

36Kr: Can you sum up the personality, or the values, of Xiaomi’s assisted driving R&D?

CG: Xiaomi’s values have deeply shaped the character of the assisted driving team. I think the most important thing is to be friends with users, to think from the user’s point of view about what kind of product experience they need, and then use that to push the technology forward.

For example, when we moved from end-to-end systems to XLA, some colleagues initially favored world models and others favored XLA. But after in-depth discussions, everyone ultimately felt that if XLA could be made to work, it would definitely bring users a very cool product experience. So even though it was difficult, we went for it.

CL: Yes. Even though Xiaomi started relatively late in assisted driving, and even though its release cadence may not be as fast, it will definitely put the safest product with the best experience into users’ hands.

From my perspective, we have also been practicing first principles all along. Because we firmly believe large models can help assisted driving solve some key problems, we have done a great deal of exploratory work on large models. Ultimately, we hope to bring out their capabilities and push assisted driving toward a higher level.

KrASIA features translated and adapted content that was originally published by 36Kr. This article was written by 36Kr Brand.