ES1- A World Model For Outdoor Tasks

Deploying physical AI agents trained in simulation

Jan 08, 2024

Building Outdoor Autonomy

Robots in landscaping has a potential to change an often overlooked industry. All around us the natural world is being manipulated to thrive in populated areas. In the US, 50 million man hours are spent weekly caring for our parks, yards and community spaces. However, we are still very short of the amount of people needed to perform sustainable and attentive care. Instead, current solutions are to amplify worker output with large gas machines and deadly herbices. A promising alternative to 20th century techniques of mechanization is robotic agents. A robot is a tool that augments human labor to allow them to perform high-quality emission-free land care.

Landscaping provides a wide range of tasks with vary levels of dexterity require. At ESR, we are framing this as a technical roadmap where we start with deploying 3D surface coverage agents and scale to complex dexterity.

At Electric Sheep, we believe to build these robots we need to create a large scale sandbox for them to learn in. To accomplish this we are acquiring and transforming landscaping companies across the country into RL factories. Landscaping consists of a wide range of mobile manipulation tasks from unstructured dexterous manipulation to mechanized surface coverage. This spectrum of tasks provides a perfect training ground for embodied AI to learn in a profitable way. This blog talks about how we are approaching building AI, with mowing as our running example. We believe key to a generalized agent is learning a model of the world that can be heavily pre-trained in simulation.

World Models for Robust Behaviors

Consider how a human performs the task of mowing. First they must identify the mowable region which requires a strong semantic understanding of the world. She then needs to formulate a coverage plan and track where they have been; this requires the ability to on the fly map and localize in large scale environments. Finally, there is the kinematics of the machine itself, knowing what obstacles it can traverse over like leaves versus things it will get stuck on like man-hole covers. As humans we can perform all of this seamlessly in real time … for robots to work in the wild they need to match this ability.

To embed an agent with generic concepts of traversability, semantic understanding, mapping and localization – we are training a world model. World models have gained a lot of attention lately because of their potential to enable reasoning and planning, which is currently lacking in modern AI agents. Fundamentally, a model gives the agent the ability to ask what happens if a certain action is taken, which allows it to evaluate different possible outcomes. The ability to plan naturally falls out because the agent can reason ahead and select the best result.

Left: A robot using a mental model to reason about the world. Right: A robot that is purely reactive with no mental model. (Image from Berkeley CS 188 Slides)

For example in a safety critical systems like a lawn-mowing robot ; it is important to know the consequence of your action based on a model. The reason is if an agent encounters an unseen environment it can forward simulate possible actions and score those that are risky or dangerous. Additionally, if an agent gets stuck in a tricky scenario, like a gopher hole, it can exhaustively evaluate options to think of a way out … this is in contrast to models like LLMs which has finite compute no matter the problem.

ES-1 a World Model For Terrain Coverage

The idea of a robot using a model is not new; in fact classic techniques such as motion planning explicitly require one. In general though building such a model can be quite hard because it requires multiple modular systems like SLAM, semantic perception, collision detection, which each require entire teams to support and precise sensing. At ESR, we instead are building a single dense model that can operate on low-cost sensing data and can automatically improve from data overtime.

Building on recent advances in generative AI; we trained a foundational world model to predict a variety of the embodied intelligence tasks needed to solve for tasks like mowing, edging, trimming. To accomplish these tasks a model needs to understand the semantics of world, create a map that can be used for coverage planning and highlight the edges of the workable area (in this case the grass). ES1 achieves this by first consuming a time-series of stereo images and uncorrected GPS. The model then outputs a BEV (birds-eye-view) semantic map of the world, pose of robot, 3D semantics, low-lying obstacles detection and a traversability map.

Illustration of ES-1. A dense model that takes in a time series of images and uncorrected GPS to output a world state used by our autonomous agent.

Architecturally the model processes visual information with efficient CNNs feature extractors and then aggregates the time-series information in a vanilla decoder style transformer to produce the output representation. Given that the architecture is expected to run-real time on embedded Jetson platforms great care goes into model distillation and reducing parameter count. We found the most effective way to do this was using very high quality data sources, which is only possible via synthetic data.

Large Scale Pre-training with ISAAC

Training a world model requires dense label annotation of the model’s state, generally this extremely cost prohibitive to use only real world data. NVIDIA’s ISAAC simulation provides the ability to simulate the entire autonomy stack end-to-end in photo-realistic procedural worlds. Using assets from the Omniverse, we implemented the ability to simulate procedural parks and lawns in the ISAAC. We can then leverage its real time simulation ability to run our robot and collect training data.

Our autonomous agent collected training data for ES-1 in simulation.

Using NVIDIA Dockers we can containerize the ISAAC environment and launch large scale simulations of cloud instances. To date we have trained ES1 on over a hundred thousand unique procedural worlds. This level of diversity enables robust sim2real transfer that allows our agents to be safely deployed in the physical world. To keep advancing performance though the agent now needs to learn from the long tail of corner cases found in the physical world.

Scaling with the Fleet

The yards across the US are full of a wide range of diversity from the type of grass, the terrain around them, dogs and gopher holes. To truly understand this distribution with a world model, we need to sample from it and improve performance. Given a pre-trained world model in simulation, we can deploy the robot in our crews and have it learn from real examples.

When ES-1 is uncertain in a scenario; we log these moments in time. Those data points are then passed to a LMM that describes the scenario at hand and either tries to pulls the proper assets from the Omniverse to update our simulator or passes the data off to a human annotator for real world fine-tuning. We will be describing this process in detail in subsequent blog.

We are currently deployed in hundreds of yards running robots daily and improving the performance of our world model. Next year, we plan to be launching products that go beyond just mowing but use the same ES-1 model to perform a variety of tasks in the outdoor world. Follow us here for more updates and maybe next time you're in a park; you will see ES1 in operation.

Thank you for reading ESR’s Substack. This post is public so feel free to share it.

Electric Sheep