Robotics will not get its “GPT moment” by following LLMs, Agibot chief scientist says

China’s embodied intelligence sector has undergone a quiet shift over the past six months.

The fixation on degrees of freedom in robot bodies has begun to ease. In its place, attention is moving toward the factors that may determine the ceiling of robotic intelligence: data, models, infrastructure, and, more importantly, whether they can reinforce one another in real-world deployment.

As debate continues over whether robots can replicate the scaling law of large language models (LLMs) through brute-force data accumulation, Luo Jianlan, associate professor at the Shanghai Innovation Institute and chief scientist at Agibot, has offered a view that cuts against the prevailing current: embodied intelligence cannot simply copy the development path of LLMs.

Luo has a recognizable way of speaking. He switches quickly between Chinese and English technical terms, builds arguments densely, and rarely gives ambiguous answers.

Rather than staying within the narrower debate over whether data, models, or infrastructure matters most, he points directly to the system-level problem. The central tension in embodied intelligence today is not whether any single component can break through alone, but whether these components can form a loop in real-world deployment.

That view comes from his experience across academic research and industrial implementation. Luo earned his doctorate at UC Berkeley, where he studied under Sergey Levine, a foundational figure in embodied intelligence. After graduating, he worked as a research scientist at Google X and DeepMind. A little more than a year ago, he returned to China and joined the Shanghai Innovation Institute and Agibot.

In his view, many “embodied foundation models” are not being trained through pretraining in the true sense. They are closer to mid-training or fine-tuning.

The reason is that high-quality real robot interaction data remains scarce, especially data that covers multiple scenarios, tasks, and robot bodies, and includes failures, corrections, and long-tail interactions. There is still nowhere near enough of it to support large-scale pretraining comparable to that of LLMs.

That has led to a common pattern. At a stage when real robot interaction data is insufficient, many teams choose to stack high-quality teleoperation data on top of existing open-source model bases, then align or fine-tune the model for specific tasks.

This can quickly improve laboratory task performance in the short term, but it is not the same as genuine embodied foundation model pretraining. A model’s loss curve improving on offline data mainly shows that it is fitting existing data better. Whether it can transfer to new physical environments, handle long-tail disturbances, and recover from failure still has to be validated through real-world deployment.

Loss is the “score” for how much a model gets wrong each time. A loss curve plots that score over time. When the loss curve declines, it usually means the model is fitting the training data better. In robotics, however, that does not necessarily translate into a higher deployment success rate in real-world scenarios.

That is why Luo believes embodied intelligence should not blindly follow the scaling law that has shaped the development of GPT and later large models.

With LLMs, there is a relatively stable and predictable statistical relationship between pretraining loss and model capability. In robotics, a decline in offline loss does not necessarily correspond to a higher real-world deployment success rate. Robots face an open physical world involving contact, disturbances, long-tail scenarios, hardware differences, and task feedback. A model trained on contextual data does not mean it can truly grasp reality or interface with it.

The real breakthrough in embodied intelligence, therefore, is not just about adding parameters or accumulating data. It is about building a deployment loop. Only when robot deployment scales up, the cost of adapting to new scenarios continues to fall, and data feedback steadily improves model capability will embodied robots be better equipped to operate in the physical world.

Within this framework, Luo’s goal after returning to China has been to build a scalable loop for embodied intelligence.

He has distilled his work over the past year into three technical pillars.

The first is SOP, or scalable online post-training. SOP addresses the infrastructure needed for large-scale online post-training of robots, including low-latency data feedback, cloud computing, training scheduling, and model updates. Its value is not only as an algorithmic module, but also as a way to test whether robot data can move efficiently from deployment sites into the training loop.

The second is LWD, or learning while deploying. It attempts to break the old separation between training and deployment, so that robots are no longer products fixed at shipment, but systems that keep evolving in real scenarios such as convenience stores and supermarkets. When robots encounter unfamiliar shelf configurations, product placements, or operational disturbances, the system can continuously accumulate data through real interaction and turn those experiences into later model improvements.

The third is the Tau-0-WM world model, released jointly by the Shanghai Innovation Institute and Agibot.

Tau-0-WM does not treat video generation as the end goal. Instead, it uses video prediction as a way to learn physical dynamics and evaluate the consequences of actions. More specifically, it aims to become an action-conditioned physical simulator. Before a robot performs an action, it compares the future outcomes that different candidate actions may produce inside the model, helping the system choose the more reliable action.

For example, when facing an egg near the edge of a table, a regular vision-language-action (VLA) model may directly output a grasping action. An action-conditioned world model, by contrast, can first compare the future consequences of several candidate trajectories and avoid choosing an action that would sweep the egg off the table.

In Luo’s view, the decisive factor in embodied intelligence from here will not be hardware, nor the isolated strength of data, models, or infrastructure. It will be whether they can form a loop with one another. It is like the planks of a barrel: if any key plank is too short, the system’s capability cannot be fully released.

The following transcript has been edited and consolidated for brevity and clarity.

36Kr: Why do you think few teams in China’s embodied intelligence sector are truly training foundation models?

Jianlan Luo (JL): By analogy with the development stage of LLMs, I think very few teams in robotics are capable of doing embodied foundation model pretraining. Most are doing more fine-tuning or mid-training.

Even a lot of mid-training is not very solid. Many so-called “robot foundation models” in the industry today are closer to task adaptation or mid-training on top of existing open-source bases. They have not yet entered a true pretraining stage driven by large-scale, heterogeneous, real interaction data.

There is even a half-joking saying in the industry that in research papers, PI, or Physical Intelligence, has never won. In reality, PI has never lost.

What that reflects is a problem: robot models cannot be judged only by paper metrics. Ultimately, you still have to look at real-world deployment performance.

Looking back at the LLM path, the outputs of pretrained models themselves are full of noise. They need high-quality alignment through mid-training, and then specific capabilities are further activated through post-training.

True robotic foundation model pretraining should also, like LLM pretraining, absorb extremely broad data, even noisy data. The difference is that robotics data is not static text. It is interaction, failure, correction, recovery, and long-tail scenarios in the real world.

36Kr: What are the differences among pretraining, mid-training, and post-training in terms of data and architecture?

JL: These are three stages of training. The main differences are data and training algorithms.

Pretraining trains a model on extremely broad data. It covers a little bit of every data type.
Mid-training uses high-quality teleoperated robot demonstration data to align the model with task requirements.
Post-training optimizes for specific capabilities. For example, reasoning capability in LLMs often has to be further activated and aligned through post-training, reinforcement learning, or high-quality task data.

36Kr: What challenges might Chinese companies face next as they fill the gaps in pretraining and post-training?

JL: The core is data, as well as real-world scenario deployment. The whole system, from data to infrastructure to models, is connected link by link. None of these is absolutely more important than the others. This is the barrel effect.

I believe real-world data must serve as the foundation. It is like reading the same book at different ages. At three, you cannot understand it. At 20, you understand the plot. At 40, you begin to understand the human nature inside.

The stronger the foundation model is, the more efficiently it can absorb heterogeneous data and transfer to new tasks. But without real data as the foundation, if you rely only on simulation or video data, the model’s ceiling will be limited.

36Kr: Companies are now talking about a “GPT moment” for robotics. Roughly what scale of data do you think is needed to truly achieve generalization?

JL: I oppose blindly benchmarking against a GPT-style scaling law.

If we limit the discussion to high-quality robotic data that involves real interaction and can be used for closed-loop deployment, the current data scale is still far from enough. Many claims of data points in the millions do not use consistent definitions. Some refer to video, some to trajectories, some to simulation, some to teleoperation, and some to repeated collection on a single task. There is no consensus yet on how robotic data should be measured.

The scaling law of LLMs is built on a relatively stable and predictable statistical relationship between pretraining loss and model capability. But this law does not automatically hold in embodied intelligence.

A decline in a robot’s training loss only means the model is fitting static data better. It does not equal an improvement in deployment success rate in the physical world. The complexity of physical interaction means that a model with data context does not mean it can truly grasp reality.

Therefore, the gold standard for embodied intelligence is definitely not data scale or loss value. It is deployment effectiveness in real-world scenarios. The true breakthrough point comes when we observe that, as the number of deployed robots increases, the cost of adapting to new scenarios continues to fall and model iteration efficiency keeps improving. That is the inflection point at which the data flywheel starts turning.

Unfortunately, academia and industry still cannot precisely calculate the data magnitude corresponding to this inflection point.

36Kr: You returned to China more than a year ago. From your observations, what is the biggest difference between the embodied intelligence ecosystems in China and overseas?

JL: A robot is a full-stack system. It needs hardware, models, and intelligence, and it also needs to form a data loop through real-world deployment. You cannot wait for one technology to fully converge before working on another.

China’s advantages are its industrial chain, supply chain, engineering capability, and talent density. The part no one has truly cracked globally is the robot’s “brain.” We should combine these advantages, quickly run the loop, and make good use of China’s existing strengths in hardware, scenarios, and deployment, rather than competing only on the robot body.

36Kr: Since returning to China, you have done a lot of work, including LWD, SOP, and the world model released recently. What role does each of these play? What are the main parts of the complete loop?

jL: Moving from the bottom up, the lowest layer is a large number of robot hardware units deployed in real scenarios, which is fleet learning. First, you need a robot “fleet” of sufficient scale.

Above that is the infrastructure layer, including real-time cloud computing, data feedback, communications, training acceleration, and inference acceleration, all integrated across software, hardware, and cloud Infra. The SOP we released earlier is essentially a proof of concept for this Infra. It proves that this link can be run end to end.

Above that is the algorithm layer, which includes two parts: pretraining and post-training. The LWD we released a few months ago addresses robot post-training and self-evolution. Later, we will also continue advancing our own pretrained foundation model.

The premise of our overall loop is that real-world deployment is not the end of training, but the starting point for intelligence to continue evolving. It can form a positive flywheel: deploy more robots, generate more data, train better models, and then deploy more robots.

36Kr: What does the ideal data flywheel look like?

JL: It is a positive cycle in which more deployment makes the model stronger: the model becomes stronger, so more robots are deployed; more robots are deployed, so more data flows back; more data flows back, so a stronger model is trained.

For example, in semi-structured scenarios such as convenience stores and supermarkets, deploying in the first 20 stores may require collecting a large amount of interaction data. But as the number of deployments increases, the cost of adapting to new scenarios will fall significantly. Ideally, by the time deployment reaches the 100th store, the amount of data needed for adaptation to a new scenario will be very small, even close to out of the box.

36Kr: What is the significance of opening up this loop?

JL: The hardware is still not perfect, but for building a loop around specific tasks, the hardware is basically sufficient. It is not the main bottleneck. The real shortcoming is the data loop, meaning the ability to continuously iterate across the whole link from models to data.

CEOs and investors around the world are watching embodied intelligence. Everyone is waiting for the initial signal to appear. Once someone runs a commercial loop in a semi-open scenario and proves that the data flywheel can turn, capital and industrial resources will quickly concentrate in that direction.

This is precisely the opportunity for startups. Big companies are constrained by OKRs (objectives and key results) and existing moats, so they turn relatively slowly. A startup’s advantage is speed. We do not need to overturn every scenario.

In the next 12–18 months, if a team can be the first to run the positive cycle among deployment, data, and iteration in semi-structured scenarios such as convenience stores, supermarkets, and warehouses, it will build a very strong first-mover advantage.

36Kr: World models are a hot topic now. How do you understand them?

JL: This topic gets brought up every two years. It started around 2017 or 2018. In the past, it was mainly discussed within technical circles. Now that artificial intelligence has a high level of public attention, world models have also moved into broader public view.

For world models, I care more about the action-conditioned predictive model, which can also be understood as a forward dynamics model. Given the current state and an action, it predicts the future state, reward, or changes in other utility after that action is executed. Its core is being able to evaluate an action’s impact on the future state of the world without actually executing the action.

For example, if I am boiling eggs in the morning, I can predict in my mind that simmering on low heat will take a long time, so high heat may be better. I do not need to actually execute every action first. I judge which plan is better in my head.

36Kr: Why has the technical route for world models not converged?

JL: The biggest problem with world models now is that the definition is too broad. What many people call a world model is closer to a video prediction model, meaning it predicts how the picture will change. But what robots really need is not just a future image. They need to know how an action will change the subsequent state of the world. With that, they can do planning and action evaluation.

If a model only generates future frames but cannot be used to evaluate an action’s impact on the world state, its value for robotic decision-making is limited. To me, what matters more is an action-conditioned predictive model: given the current state and candidate actions, predict what state each action will bring the world into.

Many companies focused purely on world models treat the world model as the ultimate goal. For me, the world model is a tool for achieving the goal of pretraining. The formula is reversed.

36Kr: What goal do you hope to achieve using world models?

JL: The goal is predictive dynamics. We want to evaluate whether an action is good or bad without executing it, improve planning accuracy, and make the overall system perform better.

36Kr: What’s your view on VLA models? After the value of world models has attracted attention, what does coordination between the two look like?

JL: The necessity of vision and action has already become a consensus. The remaining dispute is whether language is necessary.

I believe language is indispensable. It is the most natural interface for complex task decomposition, long-horizon reasoning, and contextual continuity. A vision-language model (VLM) is currently the best vehicle for handling this kind of high-level planning.

The current VLA approach aligns everything into language space and uses a “discrete pretraining plus continuous action head” model. This may not be the final form, but I think it is too extreme to directly declare that VLA is “dead.” As a complex decision-making system, a robot needs both low-level action precision and high-level planning capability.

At this stage, the amount of data is still far from enough to negate the value of VLA technology. Although world models have advantages in temporal dynamics modeling and action prediction, they still have shortcomings in language grounding and complex logic processing. For example, for a long-horizon task such as boiling an egg, the world model itself still has difficulty completing the full multi-step decomposition and execution.

The real breakthrough in the future will come from integrating VLA with world models: use the former to handle language-driven macro-level planning, and use the latter to ensure the precision of physical execution.

36Kr: So you believe generalization can be achieved without that much data?

JL: Data being important and how much data is needed are two different things.

There is an assumption now that things do not work because there is not enough data, so more data is needed. But there is another possibility.

For example, there may be 100 million households globally. Perhaps we do not need to collect data from 80 million households to generalize to the remaining 20 million. Maybe data from just 10,000 households, combined with other methods, would be enough to generalize to the other households.

No one can prove which assumption is correct right now. We can only keep doing it and validating it along the way. Scientific research is about constantly proposing hypotheses, testing them at the lowest possible cost, and finding the direction of gradient descent, rather than imagining conclusions out of thin air.

36Kr: On the data side, egocentric data is also very popular now. Is this a transitional solution, or will it remain an important component over the long term?

JL: It depends on what the base model is.

If the foundation model is not trained from scratch, but is based on an existing VLM or video model, then those models have already absorbed the features of this type of data, so egocentric data is useful. But if you are training an embodied foundation model from scratch, the core is still real robot deployment data.

Because robotics is currently in a data desert, any data is better than no data. But conclusions drawn under the premise of small data scale may not hold once the field reaches the big data stage.

This is similar to the early days of autonomous driving, when people discussed alternative data sources such as simulation data, Google Street View, and dashcam data. At the time, no one could obtain enough real vehicle data, so all of those data sources had value. But when real vehicle data became abundant enough that large-scale storage and processing infrastructure had to be built specifically for it, the relative importance of other alternative data sources had to be reassessed.

The robotics field today is similar to autonomous driving in its early days. Everyone is proposing different alternative data solutions. Fundamentally, this is because real robot data is still insufficient. Once there is enough data, the value of these solutions will also be reassessed.

KrASIA features translated and adapted content that was originally published by 36Kr. This article was written by Qiu Xiaofen for 36Kr.