Zhipu AI is putting its foot on the gas in the race for multimodal artificial intelligence supremacy. On July 26, 2024, it launched Zhipu Qingying, a video generation model akin to Sora. While Sora remains inaccessible months after its release, Qingying was made available to the public for free from day one.

A month later, on August 29, Zhipu made a splash at the International Conference on Knowledge Discovery and Data Mining (KDD), debuting “Her,” a GPT-4o-like model. This was featured in the consumer-facing product Zhipu Qingyan, which introduced a new “video call” function—bringing AI one step closer to human-like communication.

Qingyan stays updated on trends, too. After checking out the viral game Black Myth: Wukong, it quickly understood the content and could chat with users about it.

Alongside these updates, Zhipu rolled out a new multimodal model suite, featuring the visual model GLM-4V-Plus, which can understand both videos and web pages, and the text-to-image model CogView-3-Plus.

The base language model GLM has also been upgraded to GLM-4-Plus, a model capable of handling long texts and solving complex math problems with ease.

GPT-4o: Homework helper, tutor, and kitchen assistant

Previously, GPT-4o wowed users with its emotion-predicting abilities. But Qingyan takes a more straightforward approach, reminding users that, as an AI, it can’t express emotions.

That said, Qingyan’s video call feature opens up practical applications tailored to China’s focus on lifelong learning.

For example, it can serve as a personal English tutor. With the camera on, users can learn on demand, anytime, anywhere. Qingyan also doubles as a math teacher—its explanations rival those of real-life tutors. Parents can finally take a breather from homework stress.

At home, Qingyan acts as a personal assistant, too. It can recognize a Luckin Coffee bag and provide a brief history of the brand. Though sometimes, it veers off course—like when it explained how to store the bag instead of the coffee inside.

Though video call histories can’t be saved yet, using Qingyan feels like having a tutor, homework helper, and kitchen assistant rolled into one.

New visual model: From video understanding to code interpretation

At KDD, Zhipu AI unveiled its updated model suite, including a new generation of its base language model and an enhanced multimodal family: GLM-4V-Plus and CogView-3-Plus.

What’s notable about GLM-4-Plus is that it was trained using high-quality synthetic data. This has proven that AI-generated data can be highly effective for model training, reducing costs. According to Zhipu AI, GLM-4-Plus’s language understanding rivals GPT-4o and Llama3.1-405B.

Comprehensive Benchmarks
ModelAlignBenchMMLUMATHGPQALCBNCBIFEval
Claude 3.5 Sonnet80.788.371.156.449.853.180.6
Llama 3.1 405B60.788.673.850.139.45083.9
Gemini 1.5 Pro74.785.967.746.233.642.374.4
GPT-4o83.888.776.651.045.552.381.9
GLM-4-Plus83.286.874.250.745.850.479.5
GLM-4-Plus/GPT-4o99%98%97%99%101%96%97%
GLM-4-Plus/Claude 3.5 Sonnet103%98%104%85%92%95%99%

In terms of long-text capabilities, GLM-4-Plus performs on par with GPT-4o and Claude 3.5 Sonnet. On the InfiniteBench test suite, created by Liu Zhiyuan’s team at Tsinghua University, GLM-4-Plus even slightly outperformed these leading models.

Long-text Modeling Benchmarks
ModelLongBench-ChatInfiniteBench/EN.MCRuler
Mistral-123B8.238.980.5
Llama 405B8.683.491.5
Claude Sonnet 3.58.679.5
Gemini 1.5 Pro8.680.995.8
GPT-4o9.082.5
GLM-4-Plus8.885.193.0
GLM-4-Plus/GPT-4o98%103%
GLM-4-Plus/Claude 3.5 Sonnet102%107%

Moreover, by adopting proximal policy optimization (PPO)—a method that enhances decision-making in complex tasks—GLM-4-Plus has significantly boosted its data and code inference abilities and better aligns with human preferences.

The processing cost for 1 million tokens with GLM-4-Plus is RMB 50 (USD 7), comparable to Baidu’s latest large model, Ernie 4.0 Turbo, which costs RMB 30 (USD 4.2) for input and RMB 60 (USD 8.4) for output per million tokens.

But what’s truly groundbreaking is its multimodal capability.

Vision Capability Benchmarks
ModelOCRBenchMMEMMStarMMVetMMMU-ValAI2DSEEDBench-IMG
Claude 3.5 Sonnet788192078.562.266.080.272.2
Gemini 1.5 Pro7542110.673.959.164.079.1
GPT-4V-11065161771.573.849.763.675.972.3
GPT-4V-04096562070.279.856.067.578.673.0
GPT-4o7362310.380.569.169.284.677.1
GLM-4V-Plus8332274.782.469.953.383.677.4
GLM-4-Plus/GPT-4o113%99%102%101%99%99%100%
GLM-4-Plus/Claude 3.5 Sonnet106%118%105%106%81%104%107%

GLM-4V-Plus, the new visual model, now understands videos and web pages—significant improvements over its predecessor.

For instance, uploading a screenshot of Zhipu AI’s homepage allows GLM-4V-Plus to instantly convert it into HTML code, helping users quickly recreate a website.

Unlike typical video comprehension models, GLM-4V-Plus not only understands complex videos but also has a sense of time. You can ask it about specific moments in a video, and it can identify the exact content. However, as of this writing, Zhipu AI’s open platform doesn’t yet support video uploads for this feature.

Despite its impressive visual capabilities, GLM-4V-Plus lags in multi-turn dialogues and text understanding, meaning it’s not yet on par with GPT-4o in this regard.

Video Understanding Capability Benchmarks
ModelMVBenchLVBenchTemporal QAMulti-turn DialogueChinese-English Support
LLaVA-NeXT-Video50.632.2
PLLaVA58.126.1
LLaVA-OneVision59.427.0✅️✅️
GPT-4o47.834.7✅️
Gemini 1.5 Pro52.633.1✅️✅️✅️
GLM-4V-Plus71.238.3✅️✅️✅️
GLM-4-Plus/GPT-4o149%110%
GLM-4-Plus/Gemini 1.5 Pro135%116%

At KDD, Zhipu AI also introduced CogView-3-Plus, the next generation of its text-to-image model. Compared to FLUX, the current frontrunner in the field, CogView-3-Plus holds its own in generating images within 20 seconds.

Text-to-image Generation Capability Benchmarks
ModelClip ScoreAES ScoreHPSV2ImageRewardPickScoreMPS
SD3-Medium0.26555.520.27740.214421.3110.57
Kolors0.27266.140.28330.548222.1411.86
DALLE-30.32375.950.29040.973422.5111.95
MidJourney-V5.20.31446.120.28130.816922.7412.40
MidJourney-V60.32765.950.27980.835122.7812.34
Flux-dev0.31556.040.28811.033322.9610.12
CogView-3-Plus Full (20s)0.31775.900.29630.979722.5312.55
CogView-3-Plus Lite (5s)0.31195.910.28430.938422.5212.48
Image source: 36Kr.

CogView-3-Plus also supports image editing, such as changing object colors or replacing items in an image.

Image source: Zhipu AI.

It took Zhipu AI over seven months to add the “Plus” suffix to models launched in January 2024—its longest development cycle since 2023.

What’s clear is that GPT-4o represents a pivotal moment for AI companies. As multimodal capabilities merge, the “black box” of language understanding is beginning to open—only to be quickly resealed by GPT-4o.

Most Chinese AI companies are adopting a divide-and-conquer strategy, first enhancing single-modal capabilities before tackling integration challenges. Zhipu AI is still in this phase, but the launch of its video call feature hints at the early stages of multimodal fusion.

KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Zhou Xinyu for 36Kr.