From July 4–7, the World Artificial Intelligence Conference (WAIC) took place at the Shanghai World Expo Exhibition and Convention Center, with chip companies emerging as the focal point due to their critical role in artificial intelligence development. While Nvidia was the highlight last year, this year’s WAIC showcased significant shifts in the industry.
The conference was organized into three halls: H1, H2, and H3. Notably, about half of the exhibitors in Hall H2 were companies involved in intelligent computing centers and chips, including major players like Huawei, Baidu, Iluvatar Corex, Moore Threads, Biren Technology, Sugon, and Enflame.
However, given the sensitive nature of chip technology, these companies were cautious in their presentations. Instead of showcasing their latest chip products directly, they focused on displaying servers equipped with the chips or demonstrating applications through partners.
WAIC provided a broad overview of the AI industry and its future trends, particularly in computing power. As large AI models split into two camps—one focusing on commercial applications and the other on high-performance parameters—similar polarization is seen in the field of computing power. Chinese chip companies are diverging strategically, with some pursuing high-performance GPUs and others concentrating on practical deployment of large models across various industries, heralding a boom in inference chips.
The boom of inference chips
Inference chips were a highlight at WAIC, with several companies reporting significant sales increases. Iluvatar Corex and Enflame Technology representatives mentioned shipping tens of thousands of inference chips since last year, primarily through intelligent computing centers. Baidu also noted that most of its Kunlun chip sales were inference cards, with tens of thousands of units shipped via official and external channels. These inference cards are designed for large model parameters and cloud computing, while other companies are focusing on edge use cases, running large model inferences with smaller, more cost-effective chips suitable for local operations.
An employee from Axera said that, although its small chip can only run models with two billion parameters or less, it offers the advantage of local edge operation, ensuring privacy and significant cost-effectiveness. This chip is widely used in surveillance cameras and other internet-of-things devices and is currently in high demand.
The rise of inference chips has mainly been driven by market dynamics and competition.
With mainstream large models gradually moving toward open-source and model manufacturers slashing prices to capture the market, large models are becoming more widely adopted across various industries. Inference is essentially the process of utilizing large models, thereby requiring substantial support from inference chips. Unlike training scenarios, inference chips cater to diverse industries, presenting vast opportunities.
More importantly, this market remains mostly untapped by Nvidia’s chips. Previously, the industry had been using Nvidia’s RTX 4090 and L20 products for large model inference, but these products have notable drawbacks.
For instance, the RTX 4090 is a consumer-grade graphics card, which Nvidia does not officially allow for large model inference and is currently facing export restrictions. Moreover, Nvidia’s products have limitations such as high power consumption and insufficient memory, which competitors have sought to exploit.
For example, a representative from Iluvatar Corex said that its large model inference server, with 16 cards and 512 gigabytes of memory, matches the cost of Nvidia’s RTX 4090 but consumes only one-third of the power. This product has already been integrated into the supply chain of a major model company. New players like ZTE are also entering the inference chip market with innovative business models.
From cards to clusters
While inference chips are a booming business, some chip manufacturers are relentlessly pursuing higher computing power. Notably, several Chinese computing power clusters are transitioning from 1,000-card to 10,000-card scales. At this year’s WAIC, Moore Threads introduced Kuae, touted as China’s first 10,000-card intelligent computing cluster solution.
10,000-card clusters have become the standard for pre-training among leading large model manufacturers. OpenAI, Google, and Meta have all achieved clusters of this scale, with OpenAI notably boasting over 50,000 cards.
In China, only a handful of companies, such as ByteDance, are believed to have clusters of such scale, and all are using Nvidia’s products. Additionally, Huawei and Moore Threads are advancing 10,000-card clusters built with their own chips.
It’s worth noting that these clusters are primarily tailored for trillion-parameter AI models. The industry remains divided on their practicality—some are optimistic, while others approach with caution. At WAIC, several chip company representatives voiced concerns over the immense investment required for such large intelligent computing centers. After all, the commercial viability of trillion-parameter models is still largely unproven. Currently, companies are more focused on optimizing clusters with hundreds or, at most, thousands of cards.
Scaling to these levels involves more than just increasing the number of GPUs. It demands higher software requirements, including building a massive system to connect the components, maximizing computing efficiency, and ensuring long-term stability and reliability during model training.
Despite the challenges, there is optimism. Moore Threads founder Zhang Jianzhong said that, since the introduction of scaling laws for large models in 2020, the integration of computing power, algorithms, and data has driven significant performance improvements. This trend is expected to continue.
Moreover, while the Transformer architecture is currently mainstream, emerging architectures like Mamba, RWKV, and RetNet have demonstrated their ability to enhance computational efficiency. In time, these innovations will likely drive demand for higher-performance computing resources.
KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Qiu Xiaofen for 36Kr.