Building a General-Purpose Physical AI System for Food Manipulation (Part 2)

Some of our recent engineering blogs introduced Chef’s Food Foundation Model (FFM). First, we showed the FFM assembling a burger in under a minute. Then, we explained why we’re building the FFM—why we’re developing a single model that works across many meals and hardware platforms (cross-embodiment).

‍

This blog will explore how we taught the FFM to make a burrito bowl instead of a burger—the same model assembling a completely different dish. Like a burger, a burrito bowl consists of different ingredients: rice, corn, beans, chicken, lettuce, and cheese. But these two meals are less alike than it might seem: a burger is stacked from discrete parts; a bowl is scooped, lifting loose piles with a utensil and dropping a measured amount into a dish. To teach our FFM this new meal, we keep the same model architecture and train a new model to scoop ingredients using human demonstration data.

Why is scooping a different task?

Compared to grabbing and stacking burger ingredients, scooping ingredients for a burrito bowl introduces three new challenges:

Granularity: Rice clumps, corn rolls, beans sit in liquid, chicken comes in uneven chunks, lettuce is loose, and cheese sticks together. There is no fixed object to grab, and the surface shifts with every scoop.

Portioning: Each ingredient must be portioned precisely, and everything hinges on how the robot loads the utensil. If the robot underfills the utensil, the portion becomes too small. If it overfills, the ingredient will spill.

Clean transport: Liquid drips, and kernels scatter. Any sudden movement flings food past the bowl. What’s intuitive to humans—every scoop reaching the dish without spilling or landing in the wrong place—is a nontrivial challenge for our robot.

‍

Same architecture, new data

We don’t change our model architecture and use the same two-stage setup as before: the FFM decides what to do next, and an action policy network carries it out. What’s new is the data: we use demonstrations with a person assembling a burrito bowl. Scooping needs its own motions: finding the correct grip point, entering the ingredient pile at the right angle, scooping up the right amount, and emptying the spoon without scattering the ingredient.

‍

Teaching our robot a new meal is a data problem, not an engineering rebuild. We don’t redesign the system; we train it on a new task, which is much faster than hand-building a separate robotics setup for each new meal.

‍

An adaptable handle for ordinary utensils

A two-finger gripper can’t scoop food directly, so we give our robot a way to hold and use ordinary serving utensils. To the best of our knowledge, this is the world’s first spoon made for robots! The handle is simple: a custom-designed clamp that fits a standard serving utensil and gives the robot’s gripper the same grip point every time. The robot picks up the utensil by this robot-friendly handle, uses it, and sets it down again.

‍

Our adaptable handle does more than letting our robot hold a spoon. The hardest part of handling food is the grasp (see our previous blog), because food varies a lot: a piece of lettuce or a clump of rice behaves differently every time. A handle is different: its shape is fixed, and only its position and orientation shift a little, depending on where it was set down. When our robot grips a rigid shape instead of the food, there is far less variation to deal with—a big reason scooping is so consistent.

‍

To scoop correctly, two parts of our design need to align:

The handle’s shape: Our robot scoops by watching its wrist camera. The handle needs to stay clear of that view. Our first version blocked it: the camera’s viewpoint was too low). The second version fixed this issue by adding a stand-off support that raises the camera.
The utensil’s shape: A deep, cylindrical scoop is excellent for holding food but difficult to fill. A flatter, shovel-shaped scoop fills up easily but spills at the slightest shake. We decide to use a shovel and keep it from spilling with smoother motion, including a latency-aware technique that reduces shaking.

‍

Different utensils and handles we tried; top: wrist-camera views of the first and second-version handle; bottom: utensils with first and second-version handles

The utensils we choose are standard kitchen tools, so they are cost-efficient, easy to clean, and easy to replace.

‍

Spillage and cross-contamination

In a kitchen, a scoop that spills or mixes ingredients is not just a presentation problem, it’s a safety problem. Two priorities follow:

Spillage: Loose, granular food spills if the robot arm jerks or moves too fast. To prevent this, our action policy network moved the arm slowly and smoothly when carrying a loaded spoon. This slow, safe transport is the main reason why burrito bowl assembly takes as long as it does (more on that below).
Cross-contamination: Mixing ingredients between compartments spreads allergens, so the scoop’s path needs to stay over the right container and keep stray pieces out of its neighbors. We’re currently verifying this process against a formal food-safety standard.

‍

Training by demonstration

Similarly to the burger assembly, we teach our robot this new skill by demonstration rather than by writing rules (i.e., through imitation learning). Operators guide the robot arms through the entire burrito bowl assembly process many times, and the model learns from the recordings. The details that matter (e.g., how to scoop up sticky rice and how much to load on the spoon) are nearly impossible to write down by hand.

‍

After about 25 hours of demonstrations on top of a pre-trained model, the system completed 39 of 41 bowls—a 95% success rate, and higher than the 75% we reported on the burger. This result is consistent with a point we made earlier: gripping a fixed handle instead of the food itself causes less variation, leading to a higher success rate.

‍

What works

Within the setup we tested, our robot performs consistently. It successfully scoops all six ingredients despite normal variation in tray fill levels, keeps portions balanced from bowl to bowl by compensating when one runs short, and requires no hand-tuning between runs (aside from removing the finished bowl). A full bowl takes just under two minutes to assemble.

‍

What’s next

Next up, we’re working on bringing the cycle time down, making the model more robust, measuring portion accuracy, and running larger tests in real kitchens.

‍

Beyond that, our FFM will extend to additional ingredients, new meals, and, eventually, new robotic hardware different from our bi-manual system. Each of these steps requires more data and testing, but so far, a new meal has meant new demonstrations, not a new machine.

‍

If you’re thinking about physical AI in a commercial kitchen or want to follow our work, get in touch with our team!