Ever run into situations like these?

You realize just before a trip that you need to add a stop, but the only way to do it is to tell the driver after you get in. A family of six needs a ride, so you have to compare vehicle types and guess whether there is enough trunk space. An older adult needs to pick up a child but does not know how to use a ride-hailing app, so a family member books the ride and then calls again when the car is close.

These are small frustrations, but they point to a larger problem: digital services still ask people to adapt to software, rather than the reverse.

That raises a simple question. If artificial intelligence is already good at many digital tasks, when will it be able to handle something as ordinary as booking a ride?

Over the past year, companies have pitched AI as a way to complete work for people. Large models can write reports, build slide decks, and generate marketing copy. They can take on a substantial amount of cognitive work. But ask one to hail a car, and the limitations become obvious.

The issue is not whether the technology can generate an answer. It is whether anyone trusts AI to take responsibility in a real-world service.

In digital settings, the cost of an AI mistake is usually limited. A factual error can be corrected. A weak draft can be rewritten. In the physical world, even a small error can cost time, money, or safety.

That is why most AI systems still operate mainly as tools for suggestion and assistance. Real-world services require something harder: reliable execution.

This is the gap. AI is improving at task completion, but it still lacks dependable judgment about consequences and accountability.

Why ride-hailing is a useful test

On March 23, Alibaba’s Qwen rolled out an AI ride-hailing feature. Users can describe what they want in natural language, including the destination, preferred price range, whether to share the ride, and any vehicle preferences. The AI then completes the booking without forcing users to switch apps or work through multiple menus.

At first glance, this may look like a simpler interface. In practice, it shifts AI from interpreting requests to carrying them out.

Ride-hailing is a useful test of whether AI can handle real-world services. It is frequent, time-sensitive, execution-heavy, and sensitive to friction. Users monitor the process closely. Has a driver accepted the order? Is the route reasonable? Will the driver arrive late? A problem at any stage quickly becomes a bad user experience.

More importantly, success does not depend on one accurate model. It depends on several links working together reliably.

From an engineering perspective, that makes deployment hard. In a purely digital product, adding steps to a flow is often manageable. In real-world fulfillment, every added dependency raises the chance of failure.

Suppose a single ride-hailing request involves five critical steps: speech recognition, intent understanding, spatial reasoning, route planning, and supply dispatch. Even if each step works 95% of the time, the overall success rate falls to about 77%, because every step has to work in sequence.

Add traffic, shifting driver supply, and other real-world variables, and the process may involve more than ten tightly connected steps. At that point, the overall success rate can fall below 60%. Errors that happen early matter most. If the system misunderstands the request at the start, the rest of the process cannot recover cleanly, no matter how strong the dispatch system is.

For the rider left waiting on the curb, those probabilities do not matter. What matters is whether the car arrives.

Qwen app users can now access a feature (top left) that allows them to delegate ride-hailing to the AI. Image source: 36Kr.

Qwen’s exploration of actionable AI did not begin with ride-hailing.

During this year’s Lunar New Year holiday, Qwen tested an initiative that let users treat family and friends with a single prompt. Users could order takeout, book hotels, buy movie tickets, and trigger other real-world actions. The experiment pushed the model beyond the chat interface and treated language as a service interface.

The ride-hailing feature launched at the end of March extends that idea. Unlike placing a static order, ride-hailing requires AI to respond in real time to changing conditions, including vehicle matching, price limits, route changes, and supply fluctuations. None of these variables can be fully preset, and each decision affects the user experience in the moment.

That points to a broader shift in what AI systems are being asked to do. They are not just generating text on a screen. They are starting to operate across real services that involve streets, restaurants, and theaters.

Qwen is also not relying on a single feature. It is tying together several modules built for the ride-hailing use case. The system has to understand requests such as “six people need a business van” or “need to pick someone up on the way, so add a stop.” It supports saved locations and scheduled departures, and it is expected to add more proactive services over time, such as suggesting itineraries based on weather or road conditions.

This changes the user experience. Instead of clicking through layers of menus to choose a vehicle type, enter addresses, and add stops manually, users can describe what they need in natural language. That matters especially for older adults, visually impaired users, and others who struggle with conventional app flows.

The combination of AI assistance and specialized skills can also expand access to these services. It can make ride-hailing easier to use for people who were poorly served by existing interfaces.

Qwen’s skills may also work across domains. Its ride-hailing function can coordinate with hotel booking, food delivery, ticketing, and other services. A request such as “help me plan a weekend trip to Hangzhou” could trigger a chain of actions: book a hotel, arrange a ride, recommend local food, and reserve a boat tour. The broader idea is that natural language becomes the interface for multiple connected services.

Why AI ride-hailing is hard to build

On the surface, AI ride-hailing may seem like a matter of connecting voice commands to a mobility platform’s API. But the real barrier is not the interface. It is accountability and the ability to complete the task reliably.

Companies such as OpenAI, Anthropic, and Google DeepMind have advanced AI systems. Some have already released agent prototypes with function calling and memory. But moving into a physical service such as ride-hailing presents three major obstacles.

First, the fulfillment chain is long, and the tolerance for error is low.

Ride-hailing is not like sending a message or generating an image. It involves a full chain: parsing user intent, understanding geography, matching vehicle type, estimating price, dispatching drivers, tracking the trip, and handling exceptions. A mistake at any stage can cause the service to fail.

Most AI products still rely on probabilistic output plus human fallback. A chatbot can admit an error and ask the user to clarify. A ride-hailing service cannot send the wrong car and treat that as a minor correction.

Second, there is a trust gap between platforms and AI providers.

If OpenAI wanted to partner with Uber, deep integration would be difficult. Uber’s core assets are its driver network and dispatch algorithms. Any outside AI system that touches dispatch logic would need unusually high levels of access.

For Uber, that creates a practical question: if an AI mistake leads to invalid orders, wasted driver mileage, or user complaints, who is responsible? Does the AI company bear the cost, or does the platform absorb it? There is still no clear precedent for dividing responsibility in a case like this.

By contrast, in a traditional app interface, responsibility is easier to assign. If a user selects the wrong vehicle type or enters the wrong address, the error is clearly the user’s. Once an AI agent makes that decision on the user’s behalf, the boundary becomes less clear, and platforms are wary of that ambiguity.

Third, end-to-end infrastructure remains hard to control.

Even AI companies that excel at building general-purpose models usually do not control offline service networks. Google aggregates third-party mobility services through Maps and has also operated robotaxi services through Waymo. Apple has a strong ecosystem but has never built a significant local-services entry point. Meta remains focused mainly on social platforms and online commerce.

That means even if these companies can produce a convincing demo, they still may not be able to guarantee consistent service across cities, during rush hour, or in bad weather. AI ride-hailing is an infrastructure problem as much as an interface problem. It requires real-time awareness of supply, dynamic strategy adjustments, and a fast response to exceptions. That kind of system cannot be assembled simply by connecting a few APIs.

Qwen’s move into ride-hailing matters not because the problem is easy, but because it makes AI’s current limits harder to ignore.

Silicon Valley’s hesitation points to a broader reality. Once AI moves from the information layer to the action layer, intelligence alone is not enough. Systems also need reliability, control, and a clear allocation of responsibility.

Why responsibility matters more than intelligence

For the past few years, the standard for judging whether an AI system is good has been fairly narrow. Can it write smooth copy? Can it generate high-quality images? Can it outperform humans on benchmark tests?

Those capabilities matter, but they mostly play out in a digital environment where mistakes are easier to reverse. If something goes wrong, it can be redone or deleted. But once AI starts operating in services such as ride-hailing, restaurant booking, and food delivery, the stakes change.

A single mistake is no longer just a bad output. It can create a real loss. A user may miss a flight, a child may be left without a pickup, or an older adult may end up waiting in the rain. At that point, users do not want an assistant that sounds smart. They want a system that is dependable and finishes the job.

That is the gap between current AI tools and usable agents. One can generate. The other has to fulfill.

In practice, fulfillment means understanding the definite need behind a fuzzy instruction. A request for “a clean, fresh-smelling car” carries expectations about odor, cleanliness, and vehicle condition. It means asking follow-up questions, or inferring reasonable constraints when information is incomplete. If a family of six needs a ride, a five-seat car should be ruled out by default. It also means stepping in quickly when something goes wrong. If a driver cancels, the system should re-dispatch within 30 seconds and notify the user.

Above all, it means taking responsibility for the final result, even when multiple systems are involved.

That kind of responsibility will not come from fine-tuning alone or from longer context windows. It requires a different product architecture: an intent engine that models real-life situations and latent constraints, an execution loop that keeps the process traceable and open to intervention from the moment a request is issued to the moment the service is completed, and a trust mechanism that creates a clear path for attribution and repair when AI makes a mistake.

Behind that is a broader shift in how AI should be judged. The question is no longer only how human it appears, but whether it can be held accountable.

That is difficult, because responsibility imposes constraints. It means not overpromising for the sake of a demo. It means not trading certainty for speed. It means not treating users as test subjects in an experiment.

Historically, technologies become part of everyday life not just because they are advanced, but because people find them reliable. Electric lighting replaced oil lamps not only because it was brighter, but because it was safer. Smartphones became universal not only because they offered more features, but because their interfaces felt direct and intuitive.

If AI is to enter ordinary households in a meaningful way, it will have to cross that last gap between being impressive and being dependable. Ride-hailing is a useful way to see whether it can.

KrASIA features translated and adapted content that was originally published by 36Kr. This article was written by Xiao Xi for 36Kr.