Industry Analysis · May 8, 2026 · 5 min read

The Real Data Problem in Physical AI

Humanoid robots get billions in funding, yet the bottleneck is still real-world training data that no synthetic pipeline can replace.

Physical AI · May 2025

The Real Data Problem in Physical AI

Humanoid robots are getting billions in investment. Foundation models are getting smarter by the week. And yet the single biggest bottleneck to physical AI is exactly where it has always been: real-world training data that no synthetic pipeline can replace.

By Oleg Shlyakhter, Co-Founder & CEO, Field Motion 5 min read

$6.1B

VC into humanoids in 2025

MIT Technology Review, Apr 2026

20K hrs

Egocentric data to double robot dexterity

NVIDIA GR00T N1.7, Feb 2025

564

Real-world scenes in DROID dataset

DROID Dataset, arxiv 2403.12945

From intractable to inevitable - fast

A few years ago, general-purpose robots were a research curiosity. Today they are a strategic priority for some of the largest companies on earth. Tesla's Optimus program, Figure AI's Helix, Google DeepMind's Gemini Robotics, and NVIDIA's GR00T N1 - all of them launched or reached significant milestones inside a 24-month window. The common denominator isn't hardware. It's data.

The analogy to large language models is useful but incomplete. LLMs trained on the internet. Physical AI has no equivalent corpus. You cannot scrape YouTube for force feedback during dexterous manipulation. You cannot download a Parquet file of how a human hand adapts grip force when picking up a wet glass versus a dry one. That data has to be created, in the real world, by humans demonstrating the behaviors robots need to learn.

"The performance of robotic systems is fundamentally tied to the scale, diversity, and quality of the underlying training data. As robotics moves from simple pick-and-place tasks to more delicate, contact-heavy tasks, the demand for high-quality data that captures both precise motion and physical interaction has become increasingly critical."

- RoboPaint: From Human Demonstration to Any Robot and Any View, arXiv 2602.05325

The data scarcity nobody talks about

The Open X-Embodiment dataset - the largest collaborative effort to pool robot manipulation data across 21 institutions and 22 robot platforms - is a landmark achievement. It contains demonstrations for 527 skills across 160,000+ tasks. But when you look at its composition, more than 85% of real trajectories come from just four robot platforms in fixed lab environments. Scene diversity is thin. Real-world generalization is limited.

The DROID dataset pushed that further - 564 scenes across 52 real-world buildings, collected specifically to improve zero-shot generalization. It was a deliberate attempt to fix the in-the-wild gap. And it's still not enough. The robots that will actually deploy in warehouses, hospitals, and homes need far more scene diversity, far more task diversity, and critically, far more variation in the humans doing the demonstrating.

// Why synthetic data doesn't solve this

Simulation still lacks the physical contact fidelity needed for dexterous manipulation - the "reality gap" in tactile and force feedback remains unsolved.

World models like NVIDIA Cosmos can synthesize video but not proprioception, joint torque, or grip force - the signals manipulation policies actually depend on.

Retargeting synthetic demonstrations to real robots introduces embodiment gaps - kinematic differences that degrade policy performance on the actual hardware.

Models trained on simulated data consistently underperform models fine-tuned with even modest amounts of real-world in-distribution demonstrations.

The scaling law nobody expected

In February 2025, NVIDIA released GR00T N1, their open foundation model for generalist humanoid robots. The N1.7 update that followed introduced something significant: the first identified scaling law for robot dexterity.

Going from 1,000 to 20,000 hours of human egocentric video more than doubled average task completion rates. This is the same kind of data-scaling relationship that defined the LLM era - but for physical manipulation. The data source: EgoScale, a pretraining corpus of 20,854 hours of egocentric human video spanning 20+ task categories.

Figure AI drew a similar conclusion independently. Their Helix model - trained on approximately 500 hours of multi-robot, multi-operator teleoperated demonstrations - achieved faster-than-demonstrator dexterous manipulation in a real-world logistics scenario. The data was curated carefully: corrective behaviors were retained, slow or failed demonstrations were filtered out. Quality and diversity mattered as much as volume.

"NVIDIA identified what it describes as the first-ever scaling law for robot dexterity - going from 1,000 to 20,000 hours of human egocentric data more than doubles average task completion."

- MarkTechPost, Top 10 Physical AI Models 2026

Teleoperation is the industrial pipeline - with a quality problem

MIT Technology Review recently called out how strange the race to collect physical AI data has become. Training centers in China where people wear exoskeletons doing the same task hundreds of times. Gig workers in Nigeria, Argentina, and India filming household chores. Delivery company employees outfitted with sensors - partly for injury tracking, partly to train the robots that could eventually replace them.

The demand is real. The supply chains being built to meet it are improvised. And the quality gap between a curated teleoperation pipeline and a crowdsourced one is significant.

The DEXOP paper (MIT, 2025) illustrates exactly what's at stake. Their passive hand exoskeleton system - which mechanically links human fingers to robot fingers and provides direct contact feedback - produced demonstration data that trained policies with significantly better task performance per unit collection time compared to standard teleoperation. The mechanism matters. The quality of the physical coupling between demonstrator and robot shapes the data that comes out.

This is why who collects data and how they collect it determines whether that data is actually useful. A pipeline with trained operators, proper equipment, real-world environment diversity, and rigorous QA produces fundamentally different data than crowdsourced video. That difference shows up directly in model performance.

Cross-embodiment is the multiplier

The robotics industry is not converging on a single hardware platform. It is fragmenting into a growing ecosystem of humanoids, arms, bimanual systems, and mobile manipulators - each with different kinematics, sensors, and deployment contexts. This makes retargetability the central value-creation question in training data.

The OXE-AugE paper put it directly: given the high cost of recollecting demonstrations for every new platform, cross-embodiment generalization - the ability to transfer policies across different robot hardware - is an important goal for scalable and practical robot learning. The Open X-Embodiment dataset itself was built on this premise: collect once, generalize broadly.

Figure AI's Project Go-Big took this further - achieving zero-shot human-to-robot transfer for locomotion using only human video, with no robot-specific training data required. That's the direction the field is heading: data that generalizes across embodiments rather than being locked to a single hardware configuration.

For data collection infrastructure, this means the format, annotation schema, and capture methodology need to be designed with retargetability in mind from the start. Data collected for one robot and reusable for three others is worth three times as much.

The tactile gap

There is one data modality that almost no one in the industry has adequately solved: haptic and force feedback. Every robot manipulation paper acknowledges it. Very few pipelines actually capture it at scale.

DEXOP captures tactile data as a core feature. The RoboPaint pipeline explicitly cites the lack of tactile data in passive video as a primary challenge. The PHUMA dataset (Unitree G1 and H1-2 humanoids) focuses on physical plausibility in motion data - recognizing that physically infeasible demonstrations create artifacts that hurt policy training. Force and contact data is not optional for contact-rich tasks. It is the signal.

Any data collection operation that is serious about dexterous manipulation needs a plan for force and contact capture. This is an area where the gap between what the models need and what most pipelines deliver is still very wide.

What this means for the industry

The physical AI data problem is not going away. It is scaling. Every new humanoid program - whether it's Figure AI's Helix Lab, Apptronik's Apollo, or 1X's NEO - needs training data that cannot be generated synthetically at the quality levels required. The companies that build robust real-world data pipelines will have a durable advantage. The companies that rely on improvised crowdsourcing will be fighting quality problems every time they try to scale.

The infrastructure required is not just about hardware or operator networks. It's about understanding what good physical AI data looks like, building QA processes that can validate it, and maintaining the real-world environment access needed to collect it across the task diversity that foundation models require.

That infrastructure is what Field Motion builds. Teleoperation demonstrations, egocentric video with hand and finger tracking, dexterous manipulation sequences, force and contact data, sensor fusion datasets - designed for retargetability and delivered model-ready. We operate as a managed data collection and annotation service, with trained operators, QA specialists, and access to real-world environments across multiple geographies.

The scaling law is real. The data gap is real. The question is who closes it.

Work with Field Motion

Physical AI training data - teleoperation, egocentric video, manipulation sequences, force/contact capture. Managed end-to-end.

Get in touch