Xpeng reframes autonomous driving as AI deployment in the physical world

At a media briefing following the launch of Xpeng’s second-generation vision-language-action (VLA) model, He Xiaopeng, chairman and CEO of the Chinese automaker, expressed confidence that the company is “nearly five times ahead of top-tier industry players.”

The evolution of smart driving is shifting from vehicles defined primarily by software to what Xpeng describes, in paraphrased terms, as “AI-defined super agents.” The company has outlined an aggressive roadmap: skip Level 3 autonomous driving, a stage it views as constrained by hardware, software, and regulatory compromises, and instead focus development on Level 2 and Level 4 systems.

In He’s view, the second-generation VLA system already provides the technical foundation for Xpeng to move directly from Level 2 to Level 4.

Like Tesla, Xpeng is no longer focusing on incremental improvements to its existing autonomous driving framework. Instead, it is reframing autonomous driving as the physical deployment of artificial general intelligence. Before announcing this shift, Xpeng merged its cockpit and autonomous driving centers, consolidating artificial intelligence resources into a unified platform intended to improve development efficiency.

The company is now introducing the concept of a “world model” to deepen integration between smart cockpit and smart driving systems. Rather than operating as separate domains, the two systems are being combined with the aim of evolving from passive tools into active service providers within the next one to three years.

This ambition is said to depend on building a robust foundation model and addressing the data challenge. “Building a strong foundation model is a mandatory requirement,” said Liu Xianming, head of Xpeng’s general AI center. “Without it, companies risk falling behind in this transition.”

Xpeng’s technological shift has been decisive. Software upgrades centered on intelligence are now central to its product strategy. However, the company remains an automaker whose primary revenue comes from vehicle sales. In China’s increasingly competitive automotive market, every manufacturer, including Xpeng, must navigate both market pressure and rapid technological change.

The following transcript has been edited and consolidated for brevity and clarity.

36Kr: Why does Xpeng advocate skipping Level 3 and even propose this at the Two Sessions?

He Xiaopeng (HX): I believe that starting from Level 4, there will be a new responsible party. Given the current trajectory of global technological development, the next step after Level 2 is essentially Level 4. Inserting Level 3 in between creates challenges across hardware, software, and legal frameworks. From my perspective, China should focus on Level 2 and Level 4.

36Kr: How many vehicles will carry the second-generation VLA system? Can you provide an estimate?

HX: All our Ultra and Ultra SE models will be equipped with the second-generation VLA model. In the global market, Xpeng vehicles will offer two tiers of assisted driving.

36Kr: What level has the second-generation VLA model actually achieved? Has it fully reached Level 4?

Liu Xianming (LX): We’re not claiming it’s 100% at Level 4 yet. But VLA 2.0 is built on a highly general and efficient architecture. We release new versions almost daily, continuously iterating and solving new problems. The pace of improvement has exceeded our expectations. We are confident that in the coming period, we can establish a relatively complete system at Level 4 capability.

As for timing, it’s difficult to give a precise date. Our estimate is within one to three years. From our perspective, if our daily iteration speed continues to accelerate, reflected in both training velocity and data scale, then we believe it will arrive quickly.

36Kr: Why merge the cockpit and autonomous driving units? How is Xpeng’s adjustment different from other automakers?

HX: The automotive sector is entering a new phase of cross-domain integration. Autonomous driving governs vehicle motion. The cockpit is the vehicle’s brain. When combined with the powertrain and chassis, we see four domains undergoing integration.

In the future, especially for Level 4 or robotaxi vehicles, many companies will move away from isolated domain integration toward cross-domain integration. That shift will make vehicles faster, safer, and more responsive, multiplying capability and moving from passive use to active service. The general AI center led by Liu is part of that integration process.

This is also why I believe full autonomous driving will arrive within one to three years, and that within three to five years, all vehicles will become powerful intelligent agents.

36Kr: The second-generation VLA system will be rolled out in full later this month. How will this shape Xpeng’s high-end strategy over the next three years?

HX: In the next one to three years, cars will move from the software era to the AI era, shifting from independent hardware and software development toward cross-domain integration. Vehicles will evolve from simple new energy cars into advanced intelligent agents capable of proactive service.

Because Xpeng is conducting R&D across multiple domains simultaneously, you will see significant cross-domain integration outcomes over the next one to three years.

Even traditional gasoline vehicles are facing mounting challenges. Automakers can no longer rely on old methods. Cars will shift from passive production tools to products that actively generate productivity. I believe this will be a transformative shift within three to five years.

36Kr: Robotaxi companies rarely mention foundation models. Will foundation models become the standard for Level 4 players?

LX: The technological paradigm for Level 4 and autonomous driving has already shifted. Companies like Waymo have relatively low ceilings under the old approach. They can only continue optimizing incrementally.

This leads to the operational design domain problem, where vehicles can operate depending on how many cars are deployed, how much data is collected, and how many maps are built. To truly generalize, the paradigm must change. It’s inevitable.

Building a strong foundation model is a mandatory requirement. Without it, companies risk falling behind in this transition.

36Kr: Tesla faced localization challenges with FSD (Full Self-Driving) in China. How will Xpeng avoid similar issues abroad?

LX: Even without overseas data adaptation training, the second-generation VLA system has already demonstrated strong capabilities, as seen in the video released by He.

Second, Xpeng is a global company. Under compliant conditions, it can lawfully collect and use local data wherever it operates.

Third, through world model-based simulation and generation, we can quickly reach a baseline capability for generalized scenarios.

Our global autonomous driving strategy combines strong model generalization, global deployment, and technological breakthroughs. It cannot rely solely on China data.

36Kr: If the foundation model empowers diverse agents, will there be bottlenecks in multimodal interaction or spatial perception?

LX: The underlying reuse capability should be strong. The VLA and foundation model are natively multimodal and not designed solely for autonomous driving. However, we are still exploring the extent of reuse. Our immediate focus is to fully deploy it in vehicles before advancing cockpit-driving integration further.

36Kr: Some argue human data is losing value. What’s your view?

HX: The amount of data in the physical and human world is effectively infinite.

I used to think 100,000 or one million vehicles generating miles of data would be enough. Now I believe it’s far from sufficient. Many assume that selling more cars automatically yields valuable data. That’s incorrect. Collecting high-quality, valuable, large-scale data is extremely difficult, and that applies to cars and robots alike. We are far from solving that challenge.

36Kr: Is reinforcement learning a cure-all?

LX: Reinforcement learning is not a panacea. It requires a strong foundation model that can at least sample feasible solutions. Without that, reinforcement learning cannot improve performance.

However, it is highly efficient for targeted problem-solving and long-tail exploration. It is a powerful learning method, but not a universal solution.

36Kr: Why don’t users perceive proportional improvements despite massive increases in computing power?

LX: It’s not about headline numbers. The key is using computing power effectively. That’s why we are transitioning from general-purpose processors to ASICs.

Look at Nvidia’s GPU and CUDA era. The issue is utilization, not simply scale. Large computing power also requires higher information density inputs and larger models. Otherwise, the compute sits idle. Simply increasing the numbers will not produce perceptible improvements for users.

36Kr Does the VLA approach directly output a trajectory, or generate multiple options for selection?

LX: The core question is whether you are building autonomous driving, or AI. We are building AI.

If you commit to that shift, you cannot carry over legacy heuristic rules. The key is scaling data, model size, and compute, and iterating rapidly. Over the past few years, AI development has shown that scaling and fast iteration can solve many problems.

36Kr: Is there a tradeoff among safety, scenario coverage, and efficiency?

LX: Anyone familiar with machine learning knows the PR curve. When performance plateaus, tradeoffs emerge among safety, efficiency, and coverage.

The main objective of autonomous driving is safety. But no one wants a system that is safe yet unbearably slow and inefficient. The solution is to improve fundamental capability. Only then can safety improve without sacrificing other dimensions.

What we mean by a “generational gap” is not just a single metric lead. It refers to shifting the entire approach and accelerating iteration itself. The goal is not only to move faster, but to continuously increase the pace of improvement by building a general underlying capability system. That is the real generational difference, not isolated metric gains.

KrASIA features translated and adapted content that was originally published by 36Kr. This article was written by Xiao Man for 36Kr.