Scaling law remains valid, but priorities are shifting, says Moonshot AI founder

China’s landscape for artificial intelligence startups has grown more contentious in 2024. The “AI tigers” that roared into the market in 2023, buoyed by substantial funding and inflated valuations, now face mounting skepticism. Critics point to the sameness in their AI applications and a collective struggle to prove viable business models.

Meanwhile, global AI leaders like OpenAI appear to have eased their pace. GPT-5 remains unreleased, fueling industry-wide speculation about whether the scaling law, long the backbone of large model innovation, has hit its ceiling.

However, Yang Zhilin, founder of Moonshot AI, said that the scaling law is still valid. What has changed, he added, is the nature of what’s being scaled.

Yang’s assertion was the backdrop to Moonshot AI’s unveiling of K0-math, a new computational model focused on mathematical reasoning. Unlike general-purpose large language models (LLMs), K0-math has been engineered to tackle complex mathematical challenges. Its standout feature? Distributed thinking. In live demonstrations, the model systematically broke down problems, retraced faulty logic when errors occurred, and reattempted solutions—behaviors akin to human problem-solving.

According to Moonshot AI’s internal tests, K0-math competes neck-and-neck with OpenAI’s publicly available models, o1-mini and o1-preview. To ensure fairness, the team benchmarked K0-math across a spectrum of datasets. On Chinese educational benchmarks, from junior high to postgraduate exams, K0-math outperformed its counterparts. Even on high-stakes datasets like OMNI-MATH and AIME, K0-math reached 90% and 83% of o1-mini’s top scores, respectively.

The K0-math launch follows closely on the heels of Kimi Explorer, a model enriched with chain-of-thought (COT) reasoning. The release marks another step in Moonshot AI’s trajectory toward narrowing the chasm between itself and industry titans. But Yang is candid about K0-math’s growing pains. While capable of tackling complex problems, the model occasionally falters with simpler ones, taking unnecessarily convoluted steps or failing to articulate reasoning clearly.

Yang, once a steadfast advocate for scaling through computational power and parameter growth, now acknowledges the shifting paradigm. Where expansion once prioritized computational power and parameter size, the focus has now pivoted toward refining intelligence and reasoning through reinforcement learning.

The launch date of November 16 carried symbolic weight: it marked the first anniversary of Kimi Chat, Moonshot AI’s debut product.

In the past two years, Moonshot AI has been one of the most closely watched AI startups in China. From Kimi Assistant’s popularity in 2023 to rapid ad-based growth in 2024 and the recent arbitration dispute, this team has been constantly in the spotlight, seemingly navigating through fog.

Yet now, Moonshot AI seems uninterested in addressing all the rumors. During the media exchange, Yang only talked about the new model and related technologies, and simply disclosed a figure: by October 2024, Kimi had reached 36 million monthly active users.

Yang expressed that he remains optimistic—that the paradigm shift does not mean that pretraining through large-scale expansion is obsolete, and that leading models could still release considerable potential from pretraining for another half to a full generation in the future.

Moreover, practical applications in specialized domains, from academia to industry, may emerge as large models evolve in reasoning ability to meet niche demands.

The following excerpts are from the media exchange with Yang and have been edited and consolidated for brevity and clarity.

Q: After transitioning to a reinforcement learning route, will data become a major challenge for model iteration?

Yang Zhilin (YZ): Indeed, this is the core issue of the reinforcement learning route. Previously, when we predicted the next field, we used static data, and our techniques for data filtering, scoring, and selection were quite mature.

However, in the reinforcement learning route, all data is generated by the model itself (such as certain thought processes). When the model thinks, it needs to know whether the thought is right or wrong, which imposes higher requirements on the reward model. We also have a lot of alignment work to do, which to some extent can mitigate these issues.

Q: How do you balance the need to scale computation versus pursuing reinforcement learning in the iteration process of models?

YZ: I think AI development is like swinging between two states. If your algorithms and data are well-prepared but computational power is insufficient, then you need to do more engineering to improve the infrastructure so it can continue advancing.

From the birth of transformers to GPT-4, the main issue has been how to scale. Algorithmically and data-wise, there were no fundamental problems.

But now that scaling has plateaued, adding more computational power may no longer directly solve the problem—the issue is that high-quality data is scarce. Tens of terabytes of tokens represent the upper limit of over 20 years of human internet accumulation.

Therefore, we need to change the algorithms to avoid this bottleneck. All good algorithms work well with scaling and allow for greater potential.

We’ve been working on reinforcement learning from early on, and I think it’s going to be a major trend, changing the target function and learning process so it can continue to scale.

Q: Will non-transformer routing algorithms solve this issue?

YZ: No, because it’s not fundamentally an architecture problem. It’s an issue with the learning algorithm or lack of a learning goal. Architecturally, I don’t see a fundamental problem.

Q: Regarding inference costs, now that the mathematical version is online along with Kimi Explorer, will users be able to choose different models, or will you allocate them based on the query?

YZ: In the next version, we’ll likely let users choose. Early on, this allows for better allocation or to better meet user expectations. We also don’t want it to ponder simple equations for too long, so early on, we might go with this solution.

Ultimately, it’s still a technical issue. First, we can dynamically allocate optimal computing power. If the model is intelligent enough, it knows the approximate time needed to reason each type of problem—just like humans, not overthinking simple equations.

Q: Is the AI industry being held back by scaling laws right now?

YZ: I’m a bit more optimistic. The core issue is that using a static dataset is a rather crude approach, whereas reinforcement learning often involves a human in the loop.

For instance, if you label 100 pieces of data, it can have a big impact, and the rest is handled by the model itself. I think more issues will be resolved this way in the future.

From a methodological standpoint, reinforcement learning is highly deterministic, and the main issue lies in figuring out how to properly tune the model—the potential ceiling is very high.

Q: Last year, you said that long-form text was the first step toward a moonshot. Where do you think mathematical models and deep reasoning fit in?

YZ: It’s the second step.

Q: Now that pretraining scaling seems to be at a bottleneck, what do you think is the impact on the landscape of large models between China and the US? Do you think the gap is widening or narrowing?

YZ: I’ve always thought the gap is relatively constant. It might even be a good thing for us.

If pretraining involves a budget of USD 1 billion this year, USD 10 billion next year, or USD 100 billion the following year, it may not be sustainable.

Certainly, posttraining also needs scaling—just starting from a lower base. For a long time, computational power will not be the bottleneck, which means that innovation capability becomes even more important. In this context, I think it’s an advantage for us.

Q: Is deep reasoning and the mathematical model a distant feature from everyday users? How do you see the connection between these features and users?

YZ: Not really. There’s value on two fronts.

First, today’s mathematical models have substantial value in educational products and play an important role in our overall traffic.

Secondly, it serves as technical iteration and verification. We can put this technology in more scenarios, such as for deep searches in the Explorer version. I think it has two layers of meaning.

Q: There’s a lot of discussion about AI applications now. Superapps haven’t really emerged, and many AI apps are very similar. What’s your perspective on this?

YZ: I think a superapp already exists—ChatGPT has more than 500 million monthly active users. Isn’t that a superapp? At least it’s halfway there, which largely validates the question.

Even products like CharacterAI initially had a large user base, but it was hard to expand beyond that. In this process, we also assess the US market to determine which business has the greatest potential to succeed.

We will focus on what we believe has the highest ceiling, which also aligns best with our mission.

Q: The industry is seeing AI startups being acquired, and talent flowing back to big tech companies. How do you see this trend?

YZ: We haven’t faced this issue, but some other companies might have. I think it’s normal, as the industry is entering a new phase. Initially, there were many companies, but now fewer are working on this.

Everyone’s focus is diverging, which is a natural progression. Some companies can’t keep up, which leads to issues like these—it’s just part of the industry’s evolution.

Q: You rarely mention the state of model training. What is the current situation with pretraining?

YZ: I believe there is still room for pretraining—perhaps for half a generation to a full generation of models—and this potential will be realized next year. I expect that, by next year, leading models will bring pretraining to an extreme level.

However, we believe the focus will shift to reinforcement learning next. There will be changes in the paradigm. Fundamentally, it’s still about scaling—it doesn’t mean scaling stops, but rather that we’ll scale through different approaches. That’s our view.

Q: Sora is about to release its product. When will you release a multimodal product? What’s your take on multimodality?

YZ: We are also working on it, and several of our multimodal capabilities are currently in internal testing.

Regarding multimodality, I think AI’s most important capabilities in the future are thinking and interaction. Thinking is more crucial than interaction.

This isn’t to say that interaction isn’t important, but thinking determines the ceiling. Interaction is a necessary condition—without capabilities like vision, interaction isn’t possible.

However, for thinking, you need to consider the difficulty of the task: Does it require a PhD to label, or can anyone do it? Which one is harder to find? That defines the ceiling for AI.

Q: What’s your perspective on competition with AI applications like Doubao?

YZ: We prefer to focus on creating real value for users rather than focusing too much on the competition itself, as competition doesn’t generate value.

Our core issue is how to improve the model’s reasoning capabilities. Providing greater value to users through this focus is about doing the right thing, rather than intentionally doing things differently. If someone achieves artificial general intelligence (AGI), it’s a great outcome.

Q: When did you decide to focus solely on Kimi as the only product?

YZ: Around March or April this year, maybe between February and April. It was based on assessing the US market and our own observations. It was clear that we needed to cut back rather than add more, focusing on fewer things.

Q: Why?

YZ: Over the past two years, we actively chose to reduce the number of projects. This was important and a major lesson over the past year.

Initially, we tried to develop multiple products simultaneously. This approach may work for a while, but ultimately, it’s crucial to focus and perfect one product.

Cutting down on projects is essentially about controlling team size. Among the major startups developing large models, we’ve always kept our team size small, maintaining the highest ratio of cards to people. This is crucial.

We don’t want the team to grow too large—it has a deadly impact on innovation. Developing three projects simultaneously turns us into a big company, which means we lose all our advantages.

Q: What is your core mission right now?

YZ: The core mission is to improve user retention, or at least use retention as an important metric.

I think user retention is directly related to the model’s maturity and technical capability.

The reasoning capabilities aren’t yet strong enough, and the interaction isn’t rich enough, so what it can do today is still limited. There’s still much room to improve interaction—whether between users and AI or with the objective world.

If we measure our distance from the goal of AGI, we’re still at an early stage. Of course, every year we make significant progress. If you used last year’s product today, it would feel almost intolerable.

Q: What are your thoughts on going global?

YZ: Focus first, then globalize—we need to be more patient.

Q: Recently, everyone is talking about growth in large models and how to achieve commercial success. How do you maintain positive commercialization?

YZ: I think there’s definitely potential, but for us, the key is still retention. We need to think more long-term—at least until ROI becomes positive. This is highly correlated with technological advancement.

For us, the main focus is improving retention and organic growth. Proper marketing is needed, but balancing the relationships among these factors is crucial.

KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Yong Yi for 36Kr.