From the recently concluded World Artificial Intelligence Conference (WAIC) to the World Robot Conference (WRC), one question looms over the showroom floor: how do you evaluate what a robot can really do?
Gao Yang, co-founder of the embodied intelligence company Spirit AI, offers some advice:
- Take a robot that claims to fold clothes. Crumple a shirt into a ball and toss it onto a table, then see if the robot can still complete the task. Or hand it a pair of pants or a jacket and watch whether it can generalize across different types of clothing.
- When a robot is in motion, observe whether its movements are smooth and fluid or jerky and uneven. That distinction often reveals the quality of coordination between its perception, cognition, and execution systems.
Gao would know. He’s one of the most closely watched figures in embodied artificial intelligence today. After completing his doctoral program at the University of California, Berkeley, he returned to China to become an assistant professor at Tsinghua University’s Institute for Interdisciplinary Information Sciences.
In 2023, Gao co-founded Spirit AI with Han Fengtao, former CTO of Rokae. Han brought deep manufacturing experience, having overseen mass production of tens of thousands of robots. Gao, by contrast, came with an academic grounding in artificial intelligence. The pairing of practical hardware know-how and advanced research credentials helped Spirit AI quickly rise to prominence in China’s embodied intelligence wave.
In just 19 months, the company raised over RMB 1 billion (USD 140 million) from a lineup of heavyweight backers including Huawei’s Habo investment arm, JD.com, Contemporary Amperex Technology (CATL), and Shunwei Capital.
Gao’s move from academia into entrepreneurship wasn’t without skepticism. The transition from scientist to founder often invites doubts, especially in China’s fast-moving tech ecosystem.
“Scientist-led startups, to some extent, are inherently unreliable,” he said. “Science is driven by curiosity and a search for truth. Startups, meanwhile, are about commercial success. I constantly remind myself of what I’m not good at, and work to fill those gaps.”
He likens entrepreneurship to playing a game. Meetings with investors and customers? “Boss battles,” he quipped. Gao has pitched to hundreds of investors. In the early days, his technical explanations were so dense they lulled people to sleep. “Now, I’m much more fluent in how I communicate,” he said. “That growth is what I enjoy most.”
In his office, a small capybara plush sits atop his monitor, representing a quiet counterpoint to the intensity of startup life. In an interview with 36Kr, Gao reflected on his shift from academia to business, and shared how his thinking on embodied AI continues to evolve.
The following transcript has been edited and consolidated for brevity and clarity.
“Build an Apple, not an Android”
36Kr: You and Han seem like a complementary duo. He focuses on hardware while your background is in software. What were you looking for in a co-founder?
Gao Yang (GY): I spent a lot of time thinking about how to sell embodied intelligence to customers. The clearest conclusion I reached was to go full-stack. You need to build both the software and the hardware. And for embodied AI, you need to build an Apple, not an Android.
Early-stage tech tends to struggle with cross-platform compatibility. In almost every emerging industry, the most effective solutions come from tight integration between hardware and software. Just look at early personal computers. IBM made both. It took decades for the ecosystem to mature enough for those layers to separate.
My own background is heavily software-focused. I’d barely touched hardware before this. So for a company that’s aiming to survive and lead for the next 30 years, building strength on both fronts is critical.
That’s also why Han and I aligned. Many hardware engineers are slow to adapt to change, but Han recognized early on that the world was shifting. He was ready to rethink how things should be built.
36Kr: What made you want to start a robotics company in 2023?
GY: ChatGPT. It completely reshaped how I thought about machine learning. Before GPT-3.5 launched, I didn’t believe in what OpenAI was doing, and I wasn’t alone. A lot of senior professors at UCB were skeptical too.
But after GPT-3.5 came out, it was a wake-up call. We had to admit we were wrong. From there, it was easy to see that embodied intelligence would be the next logical step. It was just a matter of time.
36Kr: You committed to a full-stack approach early on, but some leading robotics companies still downplay the importance of the “brain.” How do you interpret that?
GY: I think it’s a strategic choice. If they are already strong in hardware and doing well in the education market, that might be enough to carry them to an IPO. In that case, their best move is to defend their position in that segment, where competition is already heating up, and then expand gradually after going public. No company can do everything at once.
36Kr: Do you think there’s room for a company that focuses purely on building new robot form factors?
GY: The form factor is inseparable from the AI. If your robot fails to reach something because its arm can’t extend far enough, that’s not just a mechanical problem, it’s a design problem. And it’s one that has to be solved jointly by the AI and the hardware teams.
36Kr: So the market can’t really support two dominant full-stack players?
GY: It’s very hard to imagine.
From scientist to CEO
36Kr: When professor Wu Yi invited you back from UCB, you planned to return to academia. What changed?
GY: At the time, there wasn’t a wave of breakthrough AI like we’re seeing now. My other option was to join a large US tech company as a research engineer, but that’s a well-trodden path, one that someone else has already mapped out.
Becoming a professor felt more open-ended. You start with nothing: no lab, no team. You build everything from scratch. That zero-to-one process was exciting to me. I started Spirit AI around late 2023, during my third year back in China.
36Kr: It’s clear you were thinking about this from a business perspective early on.
GY: I’ve always been interested in making technology usable for everyday people. That interest led me to think seriously about how to build a commercially viable robot, one that integrates hardware and software tightly. That’s what ultimately shaped how I chose my co-founder.
36Kr: You’ve said that management is a kind of technology. What do you mean by that?
GY: It’s not a hard science, but it’s systematic. There are patterns you can learn and trace. But it still requires intuition, flexibility. It’s part science, part craft.
36Kr: You’ve also said scientist-led startups can be unreliable. How are you addressing that?
GY: Scientists pursue truth. Startups pursue product-market fit. Customers don’t care about theoretical elegance. They care about whether something works for them.
So I try to build a team that treats the company like a living system. I don’t assume I’ll succeed. I just try to spot my blind spots and keep learning.
36Kr: How did you make the identity shift from researcher to CEO?
GY: By acknowledging what I didn’t know and deciding to learn it. Success in this role means building a product people want, not just proving a theory.
36Kr: Are you enjoying the process?
GY: I think so. It’s like a complex game that’s challenging and full of lessons. Early on, I gave investors dense, fact-heavy pitches. People would tune out. Then I learned to explain things more vividly. There have been a lot of lessons like that.
36Kr: How many investors have you pitched?
GY: I lost count. Maybe 100, maybe 200. Each time, you have to walk them through it from the start.
36Kr: What helped you improve?
GY: Feedback. Without it, you don’t know what’s missing. Now, I’m much more fluent in how I communicate. That growth is what I enjoy most.
How to judge a good VLA model? Try it yourself
36Kr: At this stage, using transformers for pretraining seems to be the norm. But is there still a big performance gap between companies?
Gao Yang: Come to WRC and see for yourself. No matter what the theories say, you need hands-on experience. Try interacting with the robots. Toss a crumpled shirt and see if it can fold it again. That’s how you really tell.
36Kr: That’s a pretty practical approach to evaluating demos.
GY: Robots are massive, complex systems. You can’t assess them with a checklist. The only way to understand what they are capable of is through direct interaction, and see how the models perform under real conditions.
36Kr: VLA models are a hot topic this year. What distinguishes the good ones?
GY: Two things: algorithms and data.
Some VLA models can’t decompose tasks at all. Spirit AI’s version includes a temporal modulation system that ensures smooth motion. Without that, robots often move stiffly or freeze mid-task.
Data is equally important. Training large models requires massive datasets. We pretrain on human motion videos from the internet. Some VLA models can’t effectively leverage this kind of data, so they struggle more.
From a technical perspective, it comes down to model architecture and dataset quality—how it’s cleaned, balanced, and used.
From a user’s perspective, it’s about task complexity. Some models can only handle basic pick-and-place operations. Our model can fold clothes, and even when the setup changes unexpectedly, it still performs well. That’s real robustness.
36Kr: Is Spirit v1’s VLA model based on your earlier ViLa and CoPa research?
GY: Not just those. We’ve built on multiple prior projects, including OneTwoVLA, which we’ve engineered into our current production models.
36Kr: What makes OneTwoVLA different from typical VLA models?
GY: It can reason through multi-step tasks intelligently. Take the instruction: “Put the phone in the drawer.” That involves picking up the phone, opening the drawer, placing it inside, and closing the drawer. Most VLA models can’t break that down.
OneTwoVLA decides for itself whether and how to decompose tasks, but if the command is simple, it won’t overcomplicate things either.
36Kr: You’ve predicted that robotics will reach a GPT-3.5-like phase in four years. What defines that level?
GY: You could ask a robot to do almost anything, and it would succeed 70–80% of the time, even with things like fetching a water bottle from outside the house. It won’t be perfect, but it’ll be close.
36Kr: The VLA paradigm is evolving rapidly. Where is there still room for improvement?
GY: I agree with Robot Era’s Chen Jianyu that there’s too much focus on the “L” (language) in VLA right now. These models don’t need to understand language in its full generality.
There are also many technical areas that need work.
36Kr: Such as?
GY: For starters, making better use of human motion video online. Most current models rely on image-text pairs. Spirit AI already uses full-body human motion data as it’s much more relevant for training embodied agents.
There’s also teleoperation data, basically figuring out how to fine-tune VLA models with it and how to implement reinforcement learning in real-world settings. Supervised fine-tuning is human-led, while reinforcement learning is self-driven. Both have roles to play.
Architecturally, as Chen pointed out, we need to reduce the language component’s footprint. And motion tokenizers—how we represent and sequence movement—are still underdeveloped. That space is wide open for innovation.
36Kr: Is the temporal modulation system proprietary?
GY: Yes. We developed it about four months ago.
36Kr: What does it improve in terms of motion?
GY: Without it, robot movement is jerky and rigid. It can’t vary speed within a task.
Our system enables fluid speed modulation. For example, when folding a shirt, there’s a moment when the robot needs to flick the fabric in midair. That motion needs to be fast. Without dynamic speed control, the shirt won’t behave as expected.
36Kr: Are world models part of your roadmap?
GY: World models are expensive to train, and for now, they are not critical for embodied AI. That said, they will become indispensable, especially for reinforcement learning. We’re running small-scale experiments, but full deployment is still a ways off.
36Kr: What’s your take on hierarchical architectures?
GY: I think they will be phased out.
They are useful in the short term, especially when hand-tuning is manageable. But they don’t scale. Every new task requires manual adjustment. End-to-end models just need more data.
36Kr: In your view, what still lacks consensus in robotics?
GY: Plenty. The significance of actuators, for instance. Or which use cases will land first. Even the VLA framework is still evolving rapidly. Its core architecture may be settling, but implementation details are changing all the time, and fast.
Are large-scale data collection factories worth it?
36Kr: What’s your take on robotics companies building large-scale data collection factories? Isn’t there a problem if the data can’t transfer across different robot platforms?
GY: I don’t think large-scale data collection makes much sense at this stage. Robot hardware is still evolving quickly, and changes in form factor significantly reduce the reusability of collected data. That undercuts its value.
Also, at least with our algorithm, massive data collection isn’t necessary. What matters more is high-quality pretraining. Right now, some companies have it backward. They are overinvesting in data collection and underinvesting in foundational model quality.
36Kr: Some companies use data factories as part of their business model.
GY: In the short term, sure, it can bring in revenue. In the US, labor is expensive, so AI companies often buy data instead of collecting it. But long term, it’s a shaky model, especially when cross-platform generalization hasn’t been solved yet.
36Kr: If someone buys data that doesn’t match their robot’s hardware, is it still useful?
GY: It’s still useful, but you’re getting diminished returns.
36Kr: Many demos now focus on robots folding clothes or opening appliances. Why those tasks?
GY: Folding clothes is one of the most difficult challenges in robotics. Clothing is deformable and behaves unpredictably, so it can’t be programmed in advance. That makes it a strong benchmark as it shows how robust the model really is.
Tasks like opening fridges or washing machines are familiar household interactions. They help people envision practical use cases for robots.
36Kr: What proportion of your training data comes from the internet, and what roles do different data types play?
GY: Over 95% of our training data is sourced from the internet. This kind of data provides wide-ranging context and variability, which is ideal for pretraining. The goal is generalization: can the robot handle unfamiliar situations?
Teleoperation data connects that generalization with precise physical control. Watching isn’t enough; robots also need to practice. Teleoperation enables the kind of fine motor training they require.
36Kr: How does generalization show up in practice?
GY: Let’s say a robot trained only on iPhones encounters a foldable phone for the first time. If it can still identify the object’s shape, understand its physical properties, and manipulate it correctly, that’s generalization.
36Kr: Is the industry good at that yet?
GY: We’re still in the early days. But we’ve seen significant gains since incorporating internet video into training. Generalization rates on unfamiliar objects can improve by 60–80%. And when you layer in teleoperation data, those gains compound even further.
KrASIA Connection features translated and adapted content that was originally published by 36Kr. This article was written by Qiu Xiaofen for 36Kr.