Unitree’s founder disputes VLA consensus, backs video-trained models for robotics

When most in the industry think of Unitree Robotics, they see a company focused on building robot hardware. But at the World Robot Conference (WRC), founder Wang Xingxing offered a different narrative.

During his keynote at WRC, Wang dedicated a large portion of his talk to large models, algorithms, and data. It was a shift that didn’t go unnoticed. His comments sparked debate, particularly his critique of the vision-language-action (VLA) framework driving many of today’s embodied robots.

Wang didn’t mince words: he called the VLA architecture “relatively dumb.”

The main problem, in his view, is data, or the lack thereof. VLA models require vast, high-quality datasets to function effectively in the real world. While the scarcity of such data is widely acknowledged, many companies have pursued brute-force methods: gathering real-world robot data, generating simulation data, or building specialized data collection infrastructure.

Wang believes that emphasis is misplaced.

“People are paying too much attention to foundational data,” he said. “What we really need to improve is the architecture of embodied models. Right now, they are neither good enough nor unified enough.”

Historically, Wang has emphasized Unitree’s strength in hardware, not in building a robot “brain.” That messaging led many to assume the company wasn’t seriously investing in artificial intelligence.

But speaking to 36Kr during WRC, Wang clarified: “Our model team is actually quite large, it’s just smaller compared to the big AI firms.”

Still, Wang argued that size isn’t everything.

“It doesn’t take the most money or the biggest team to build world-class technology,” Wang said. “Innovation doesn’t only happen inside tech giants. Smaller teams can develop better models, though the pressure is much higher.”

Unitree’s model strategy remains flexible. While Wang is skeptical of current VLA approaches, he noted that Unitree is also experimenting with training AI on top of VLA frameworks. His core argument is that simply feeding more data into immature models is inefficient. A well-designed embodied model, he suggested, might perform well with only a limited amount of high-quality data.

The company’s bigger bet appears to be on video-driven models. Last year, Google introduced a world model trained on video, and Wang said Unitree was exploring similar ideas around the same time.

In this approach, a video generation model creates footage of a robot completing a task, such as tidying a room. That video is then used to guide a real robot in performing the same actions.

Wang believes this method may surpass VLA-based strategies in speed and convergence. Still, it comes with tradeoffs: generating high-resolution video is GPU-intensive.

Despite the technical demands, Wang sees a path forward. He envisions low-cost, large-scale, distributed computing clusters built specifically for robotics. “If you’ve got 100 robots in a factory,” he said, “it makes sense to deploy a distributed server cluster inside the plant. The robots need low-latency communication anyway.”

Unitree’s robots, known for dancing at Lunar New Year galas and performing in high-octane battle scenes at this year’s World Artificial Intelligence Conference and WRC, have gained a reputation for showmanship rather than practical utility.

Meanwhile, newer entrants are focused on putting robots to work: tightening screws, folding laundry, making beds. That contrast has led some observers to dismiss Unitree’s robots as more spectacle than substance.

Wang disagrees:

“Right now, putting robots to work in homes or factories isn’t very realistic. Performance and demonstrations are just more achievable at this stage.”

He added that the team working on practical robotics is the largest within Unitree.

So why aren’t those use cases more visible?

“Developing practical robots is a massive challenge for AI models,” Wang said. “And honestly, we’re still not where we want to be.”

Wang’s vision for useful robots extends beyond narrow tasks like folding clothes or cooking. He sees them as general-purpose and multifunctional, such as being capable of pouring tea in a factory, then performing onstage.

Asked when robotics might see a breakthrough equivalent to AI’s “ChatGPT moment,” Wang offered a timeline: “The fastest would be two to three years. Slowest, maybe three to five. But this wave of embodied intelligence won’t take more than ten.”

What might that moment look like?

Picture a venue full of humanoid robots, walking freely. You stop one, give it a command, and it gets the job done. That, Wang said, would mark the tipping point.

KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Qiu Xiaofen for 36Kr.