The making of the HoloLens 2: How advanced AI built Microsoft’s vision for ubiquitous computing

The first time people don the new HoloLens 2 on their heads, the device automatically gets to know them: It measures everything from the precise shape of their hands to the exact distance between their eyes.

The artificial intelligence research and development that enabled those capabilities “was astonishingly complicated” but essential to making the experience of using the device “instinctual,” said Jamie Shotton, a partner scientist who leads the HoloLens science team in Cambridge, United Kingdom.

“We want you to know how to use HoloLens without having to be taught how to use it,” he said. “We know how to interact with things in the real, physical world: We pick things up, we press buttons, we point to things. We aim, as far as possible, to translate that directly into mixed reality.”

Microsoft today announced the HoloLens 2 is now shipping to customers. The sensor-packed holographic computing headset uses AI to displace space and time, creating a mixed reality of people, places and things in order to facilitate one of our most basic human impulses: exchanging knowledge.

Microsoft Technical Fellow Alex Kipman said the headset defines the highest watermark of intelligent edge devices – AI-capable technologies that can collect and process data even without a reliable internet connection – and that can then share some or all of that data with the intelligent cloud when connected.

On a recent day, Kipman sketched a diagram of this ubiquitous computing fabric on a digital whiteboard in his office.

“HoloLens,” he said, “is the first native device to be invented from the ground up with this worldview in mind.”

The marriage of the AI in HoloLens 2 with the AI capabilities of Azure, Microsoft’s cloud computing platform, allows heads-up, hands-on workers to learn skills that advance their careers, and makes it possible for people on opposite ends of the Earth who speak different languages to collaborate with a shared sense of physical presence.

“You can do really interesting things with HoloLens, and you can do really interesting things with the cloud,” said Julia White, Microsoft corporate vice president of Azure marketing. “But when you see these two things come together, it changes the game in terms of what people can actually do.”

Delivering mixed reality with AI

To enable instinctual interactions with HoloLens 2, Shotton and his colleagues developed, trained and deployed AI models onto the device that track people’s hand motions and eye gaze so that, for example, they can perceive a hologram floating in front of them and reach out to resize it or reposition it.

To build the hand tracking system, the team built a rig with a dome of cameras pointing inward that they used to record a diverse range of people’s hands. Then the team used offline cloud processing to build a 3D model capable of representing all human hand shapes and motions.

From this 3D model, the team was able to use computer graphics to render realistic, synthetic images of hands along with synthetic labels to make the model robust across a variety of hand shapes, poses and movements.

“You can generate effectively unlimited quantities of training data,” Shotton said.

The team used this data to train a compact deep neural network, a type of AI algorithm, that fits on the HoloLens’s onboard processor and runs efficiently on every frame coming from the device’s depth sensor.

When a new customer puts on a HoloLens 2, the system uses this neural network to help fit a personalized 3D model to the customer’s hands, enabling the precise tracking required to allow instinctual interaction with holograms.

Shotton’s team took a similar approach to build and train the eye tracking model, paying close attention to what’s called interpupillary distance, or the distance between the centers of the pupils of the eyes, which varies across people and affects how a person sees near or distant objects.

The result is an eye tracking model that allows the HoloLens 2 to precisely display holograms in front of customers for interaction and manipulation with their hands and eyes.

“Without eye tracking, aligning holograms to the real world – especially the person’s physical hand – would just be impossible to the level of precision needed to allow instinctual interaction,” Shotton said.

AI on the edge to the cloud

The hand and eye tracking capabilities, along with other intelligent features such as simultaneous localization and mapping that’s necessary to make holograms appear pinned to the world as a person moves around, are embedded on the HoloLens 2 in the second generation custom chip called a holographic processing unit, or HPU 2.0.

Kipman calls this class of on-device AI capability perception AI.

“Perception is like reptile brain,” he said. “It is performing those operations that your brain performs that are instinctual, that you don’t think about.”

For people, this type of intelligence keeps our hearts beating, our lungs breathing and our eyes performing microsaccades to gauge depth of field, for example. When we’re thirsty and want a sip of water, our eyes instinctively gauge the distance to the water glass that our hands lift to our lips.

Perception AI on the HoloLens 2 enables people to manipulate and interact with holograms without worrying about what’s called latency – typically the hundreds of milliseconds it takes for data to travel to the cloud, be processed and returned to the edge.

“Even tens of milliseconds make a significant perceptual difference” when pressing a button on a hologram, for example, or scrolling through text on a hologram with your eyes, noted Shotton. “That turnaround time is critical.”

Privacy concerns are another reason to do AI calculations locally on a device; the iris scans that HoloLens 2 performs to authenticate customers are the type of personal data people may not want sent to the cloud.

For many other types of data, however, there’s an advantage to sending it to the cloud: Once there, the customer can take advantage of Azure AI and mixed reality services and combine the data from their device with data from throughout the ubiquitous computing fabric. That allows for more advanced computation or cognition, Kipman said.

Cloud collaboration

A key advantage of intelligent cloud-powered holographic computing is the ability to share information with others who have a HoloLens or another device with similar capabilities, said Marc Pollefeys, the director of Microsoft’s Mixed Reality and AI Zurich Lab in Switzerland.

Pollefeys is leading a team that develops core computer vision algorithms for a mixed reality cloud service called Azure Spatial Anchors that allows holograms to persist, locked in the real world, for anyone with the appropriate level of access to view.

For example, spatial anchor technology allows a manager in a factory to place holograms next to equipment on an assembly line that contain vital, real-time operating and maintenance information that any credentialed worker with a mixed reality capable device can access.

“If I can only place information that I will see back on my device, it’s probably never worth placing holograms in the world, but if I can annotate the world and afterward anyone else in the company that has the right access can see all of the information, it is suddenly much more valuable,” Pollefeys said.

To create this capability, Pollefeys and his team developed AI computer vision algorithms that process data from sensors to extract 3D geometric information about the environment and piece it together in the cloud to create a digital twin, or map, of the area of interest.

HoloLens has always built up a 3D or spatial understanding of its environment to function. Azure Spatial Anchors creates, refines and shares these maps across devices, Pollefeys noted. That’s why the maps from individual devices are pieced together and stored in the cloud.

“It doesn’t make sense to have that data only on an individual device,” he said. “It is one of those things where I have a little piece of the puzzle, and somebody else has a little piece of the puzzle, and all of the devices together have covered the whole space of interest.”

These maps get denser, more precise and robust over time as different mixed reality capable devices – HoloLenses as well as properly equipped phones, tablets and laptops – map their environment and share the data with the cloud.

For example, the map of the factory floor where the manager left holograms floating over pieces of equipment on the assembly line is steadily refined as more and more credentialed workers view the holograms with their devices.

This capability also enables scenarios such as a meeting between architects and clients to view and interact with a holographic 3D blueprint of a building, each of them with mixed reality capable devices looking at the blueprint from their own point of view as they sit around a table.

Azure contains pre-built services to write applications for these types of experiences on HoloLens and any other mixed reality device, including smartphones and tablets running the iOS and Android operating systems, noted White.

“That collaboration experience isn’t just locked to HoloLens,” she said. “And, the cost and complexity and skillset required to make an application that does something amazing is far down.”

The cross-device and platform capability, for example, enables experiences such as Minecraft Earth, which merges the popular video game with mixed reality in a way that players can build and place virtual structures in the real world that persist so that other players can interact with them on their devices.

“We all get to participate because it is based on using cloud technology that can be understood and interpreted by all different devices,” said White.

Technology that is designed for people

For HoloLens to work as envisioned, the technology that underpins the experience needs to understand the world in ways that are similar to the way people do, Kipman noted.

That’s why he and his collaborators across Microsoft have developed, deployed and leveraged AI solutions throughout the ubiquitous computing fabric, from the silicon in the headset of HoloLens 2 to Azure AI and mixed reality services.

Back at his digital whiteboard, Kipman has now sketched out a vision for ubiquitous computing that is rife with words, boxes, arrows – and a stick-figure picture of two people locked in conversation next to an intelligent device.

That, he says, is the ultimate goal of ubiquitous computing – to get people to interact with other people in natural ways.

To drive home the point, he establishes a moment of intense, deliberate eye contact and says, “Hopefully, you are getting more out of this conversation because you are physically present with me.”

“We could have done this over the phone,” he continues. “We could have done it over Skype. I could have recorded it and sent you a tape. You didn’t choose to do that. You chose to be physically present with me. Why? Because that’s how we do human things.”

“The con is you have to be here at the same time I am here, and we have to be in the same location. The power of this technology is it gives us the ability to displace space and time.”