ASU Learning Sparks

How Visual Inertial Odometry & SLAM Work

Visual inertial odometry (VIO) and simultaneous localization and mapping (SLAM) are used to simulate a user's position, movement and surroundings within a virtual world. Cameras and other sensors are able to interpret where and how an object is moving through a virtual environment. This allows us to create seamless virtual experiences that emulate all kinds of physical ...

Visual inertial odometry (VIO) and simultaneous localization and mapping (SLAM) are used to simulate a user's position, movement and surroundings within a virtual world. Cameras and other sensors are able to interpret where and how an object is moving through a virtual environment. This allows us to create seamless virtual experiences that emulate all kinds of physical movements from driving to jumping to flying.

The principal illusion behind virtual reality or augmented reality is that as you move your device around, the game engine renders the virtual world in perfect correspondence with your device moving around in the physical world. 

But how does your VR/AR device know where it is and in what orientation? It turns out that it uses a combination of camera sensors and motion sensors and uses some algorithms to fuse this information into an estimation of where it is and how it’s moving.

Visually, the visual tracking system forms a map of the environment at the same time as it’s estimating its location. This is called “Simultaneous Localization and Mapping”, since it’s both localizing (figuring out where it is) and mapping (creating a visual reference map of the environment). 

The camera system – sometimes one camera, sometimes multiple – takes pictures of the environment, and the on-board computer determines where the corners and edges are in the images, what we call visual features. As the device moves around, the location of these visual features moves around in the 2D image frame, and the computer can match a detected visual feature with the same feature detected in a previous frame. When the visual feature is seen enough times, and in the midst of a lot of other features, the computer can make assumptions about its geometry: where features form flat surfaces in 3D space, and where the camera could be with respect to those surfaces with high precision. 

At the same time, the motion sensing hardware on the headset or smartphone also assists with estimating how the device is moving around. Because the visual tracking system requires processing on a lot of imaging data, it’s too computationally expensive to run too frequently. So instead, the immediately available sensor values from the accelerometer, gyroscope, and compass devices can be used to assist by providing a rapid estimation of a device’s motion. The rapid estimates from the motion sensing hardware fuse with the estimates of the precise visual tracking, which allows the VR/AR device to guess where it is with speed and accuracy.

The visual-inertial odometry allows devices to properly place virtual objects and environments in real-time, giving users the ability to enter AR and VR experiences that blend the virtual and physical worlds together.