This post is posted by Matt Miesnieks on the Super Ventures Blog. Matt Miesnieks is currently a partner at Super Venture and has worked for Samsung, Dekko, Layar and others.

From this article, we can learn:

  • Basic technical principles of ARKit: visual inertial measurement (VIO) system, plane detection

  • ARKit’s two main mysteries: monocular 3D imaging and metrological scale acquisition

  • Comparison of ARKit, Tango and Hololens technologies

  • How do developers use ARKit

Immersion is arranged as follows.

This year, Apple’s ARKit announcement at WWDC made a big splash in the AR ecosystem. For the first time, developers are finding that they can make a powerful AR SDK widely available in their apps, eliminating the need for markers, initializers, depth cameras, and even specialized authoring tools. Not surprisingly, ARKit lets developers collectively open various AR demo shows.

However, many developers don’t understand how ARKit works or why it is better than other SDKS. Looking at ARKit from the underlying technology can help you understand what the current version of ARKit can do, understand where ARKit needs to be improved and why, and also help us predict when in the future. Android systems and head-mounted displays (VR or AR) can support similar features of ARKit.

I’ve been working on AR for 9 years now and have developed similar technology to ARKit in the past, but there was no hardware available to support it. As an insider, I understand how these systems are built and why they are built the way they are.

This post is an attempt to explain technical issues to non-technical people, not computer vision engineers. In this article, I explain ARKit in simple terms, which may not be 100% scientific, but I hope it will at least help people understand ARKit better.

What technology is ARKit based on?

Technically, ARKit is equipped with a visual Inertial Measurement (VIO) system with simple 2D planar detection. Specifically, the VIO system tracks where you are in space in real time, known as 6 degrees of freedom (DoF) movements. For example, between each frame refresh of the screen, your movements are recalculated at a refresh rate of 30fps or more, and these calculations are performed twice simultaneously.

In the first, by matching dots in the real world to pixels in each frame of the camera’s sensor, your movements are tracked by the visual system for a calculation. The second time your movements are tracked by an inertial system, using two inertial measurement units (IMUs), the accelerometer and the gyroscope.

The Kalman Filter then integrates the output from the VIO system and inertial system to predict your best “real” location (called the Ground Truth), and the ARKit SDK publishes your latest location. Just like a car’s odometer shows how far a vehicle has driven, the VIO system records how far an iPhone has traveled in six DoF of space. 6 DoF represents translation along the xyz three directions, plus pitch, deflection, and roll about the three axes.

The biggest advantage of VIO systems is that the IMU can read 1,000 times per second, even when the user’s actions are accelerated. Between each reading of the IMU, dead reckoning is used to calculate the movement of the device. At this point, dead reckoning is more like guessing, like, IF I ask you to buy a step forward and guess how big the step is, you’ll use dead reckoning to predict the distance. (I’ll explain in more detail how this prediction achieves high accuracy later.) The errors generated by the inertial system accumulate over time, so the longer the IMU is separated from each frame or the longer the inertial system is used, and without the VIO system’s cooperation, the tracking of the motion will become more and more deviated from the actual motion.

Visual/optical measurements are made at the camera frame rate, usually 30fps, and based on the distance the scene changes per frame. Optical system usually accumulates errors with the increase of distance, and time also affects the accuracy of optical system to a certain extent. So the farther you go, the longer it takes, the bigger the error. The good news is that the advantages of one tracking system can offset the disadvantages of the other.

Visual and inertial tracking systems are completely different measurement systems and do not depend on each other. This means the camera may be covered, or it may see a scene with few optical features, such as a white wall, while the inertial system can “load” several frames. In contrast, when the device is stationary, the motion tracking information provided by the visual system is more stable than that provided by the inertial system. Kalman filter continuously selects the best motion information, the tracking effect is more stable.

VIO systems have been around for years and are well recognized in the industry, and many devices on the market are equipped with VIO systems. So the use of VIO systems in Apple’s ARKit is not a sign of innovation. So what is it about ARKit that makes it so powerful?

The second key to ARKit is simple planar detection. This technique is necessary, you need the ground as a reference to indicate the position, otherwise the object will float in the air. Any three points can define a plane. According to the feature points detected by the optical system (points seen in the demos), the reference plane can be obtained by averaging the feature points through the algorithm.

If the optical system picks up enough feature points, you can predict the real plane. These feature points are often referred to as “point clouds” and all feature points form sparse point clouds for optical tracking. Sparse point clouds require only a small amount of storage memory and short CPU usage. With the support of the inertial system, the optical system can work even if it detects a small number of feature points. Point clouds are different from dense point clouds, which look more realistic (currently being studied using dense point clouds for tracking, which is more complex).

Two mysteries of ARKit

Some would call ARKit SLAM, or use the term SLAM to refer to position tracking. To be clear, SLAM is a pretty broad term, just like the term “multimedia.” “Tracking” itself is a generic term. Using “ranging” is more specific, but in the AR world, “tracking” is just fine. There are many approaches to SLAM, and tracking is only one component of the SLAM system. I see ARKit as a lightweight or simple SLAM system. Tango’s or Hololens’ SLAM system has many other features besides distance measurement.

ARKit has two “mysteries” : how to achieve 3D effects through a monocular lens; Second, how to get the measurement scale (like the tape measure demo). The answer lies in removing IMU errors “very well”, allowing dead reckoning to predict with high precision. When this is done, the following effects occur:

To get the 3D effect, you need to get two views of the scene from different angles, and then use stereo computing to get your spatial position. That’s how the eyes see in 3D, and why some trackers rely on stereoscopic cameras. If you have two cameras, you can easily calculate the distance between the cameras and capture the frame at the same time. How did ARKit get 3D with just one camera? Because one camera can capture one frame, and then the camera moves to capture a second frame.

Using IMU dead reckoning, the distance between two frames can be calculated, and then the stereoscopic picture can be calculated normally. In fact, you might be able to capture more frames for your calculations and get more accuracy. If the IMU is accurate enough, the “movement” created between the two frames can be detected simply by the movement of tiny muscle groups in the arm after the hand makes a fist, which looks like magic.

The system relies on IMU dead reckoning to obtain the measurement scale. From the acceleration and time measurements given by the IMU, the rate can be calculated and the distance between each IMU frame can be obtained. The math is not hard, the hard part is eliminating the IMU error to get a near-perfect measurement of acceleration. A tiny error that lasts for a few seconds, when the screen is refreshed 1,000 times per second, can result in a measurement error of 30% or more. Surprisingly, Apple’s ARKit has reduced the error to less than 10%.

What about Tango, HoloLens, Vuforia and other SDKS?

Tango is a brand name, not a real product. Tango includes hardware reference design (RGB, fisheye lenses, depth cameras, and CPU/GPU specs), and also participates in software such as VIO (motion tracking), sparse mapping (region learning), and intensive 3D reconstruction (depth perception).

HoloLens has exactly the same software stack, plus a number of ASics (which Microsoft calls holographic processing units) that optimize CPU/GPU offload processing and reduce power consumption.

Vuforia is almost identical to ARKit, but Vuforia’s hardware is separate.

Both SDKS use the same VIO system, and Tango and ARKit use the same code base that FlyBy originally released! Neither HoloLens nor Tango uses a depth camera for tracking, so what makes ARKit stand out?

The answer is that ARKit is no better than HoloLens. I even think HoloLens’ tracking system is the best on the market, but HoloLens’ hardware is not widely available. Microsoft may install HoloLens tracking in Windows smartphones, but I’m sure it won’t for business reasons:

That could add production and time costs to calibrating sensors for a phone that may sell few copies. And Microsoft’s version of ARKit may not convince developers to switch from iOS or Android.

Twelve months ago, Google could have easily delivered Tango phones that run on Android, but Google didn’t. If Google ships Tango early, ARKit will follow a trend, not a breakthrough.

I don’t think Google wants to do a specific sensor calibration process for every OEM, and since every OEM makes a different version of Tango, Google doesn’t want to choose between the larger Oems (Samsung, Huawei, etc.). Therefore, Google provides OEM with reference hardware design, OEM can choose to “use or not use”. (Of course, it’s not that simple, and this is a key point the Oems have told me.)

With Android smartphone hardware commoditized, cameras and sensor stacks are the last areas of differentiation for Android phones, so Oems can’t meet Google’s requirements. Google believes that depth cameras are part of the phone, but depth cameras increase the cost of the phone, which is another reason oems reject Google!

The market has changed since ARKit was released. Oems will either look for Tango’s replacement system or accept Google’s hardware reference design and implement platform control. This is also an interesting change.

Overall, ARKit is better because:

Apple can afford to tightly couple the VIO algorithms to the sensors and spend a lot of time calibrating the VIO systems to reduce errors in calculating spatial positions.

It’s worth noting that the big Oems have some alternatives. Other tracking solutions are available, like ORB Slam, OpenCV, etc., but almost all are optical trackers with a single RGB, stereo, depth camera, some using sparse point clouds, some using dense point clouds. There are a number of startups working on tracking systems, and working on enhanced pixels is a good direction to go, but ultimately the competition for any VIO system will focus on hardware modeling and calibration.

How do developers use ARKit

You probably already have a phone that supports ARKit. First of all, understand that developing content using ARKit is very different from developing a mobile APP in the past: one is that you don’t control the scene, and the other is that you control every frame pixel.

Then, think Tango or HoloLens and see what happens when your content interacts with a 3D model in a scene you can’t control.

It’s harder to learn, more difficult than going from web to mobile or from mobile to VR. You need to completely rethink how applications work and what UX means. I see a lot of ARKit demos now. I saw them built on Vuforia four years ago, and four years before that Layar (the world’s first AR mobile browser by SPRXmobile in 2010). Over the years, I’ve seen examples of almost every type of AR APPs, and I’m happy to provide support and feedback.

I often encourage developers to build novel apps. Some goofy apps launch with great success, but satisfying users with AR hardware can be challenging.

Not many people can build good tracking systems

Intuitively, only a few people can build good tracking systems. Engineers with cross-disciplinary backgrounds will be able to develop the best solutions for mobile phone tracking that incorporate monocular VIO systems.

In the mid-20th century, VIO systems were first used by Intersense, a Boston-based military/industrial supplier. Leonid Naimark, one of the founders of this technology, joined Dekko, a company I founded, as chief scientist in 2011. Due to sensor limitations, Dekko confirmed that VIO would not work on the IPad 2. Leonid returned to the military industry, but Dekko’s CTO Pierre Georgel is now a senior engineer on Google’s Daydream team.

Ori Inbar, my partner at Super Ventures, started Ogmento. Ogmento later changed its name to FlyBy, and the FlyBy team successfully built an ios-based VIO system and added a fisheye camera. The code base has been licensed to Google as Tango’s VIO system. After Apple acquired FlyBy, FlyBy’s VIO system codebase became the core of ARKit VIO.

Chris Broaddus, FlyBy’s CTO, continues to work on tracking systems for Daqri and has now joined Silicon Valley’s secretive self-driving startup Zoox. In 2007, the first mobile SLAM system was developed by Georg Klein at the Oxford Active Computing Lab (PTAM), Georg Klein worked with David Nister to build the VIO system for HoloLens, and David left to create the Autopilot system for Tesla.

Gerhard Reitmayr, a doctoral student led by Georg, led the development of Vuforia’s VIO system. Eitan Pilipski, who previously served as vp of Vuforia, is now an AR software engineer at Snap.

Key members of the teams at Oxford, Cambridge and Imperial College London developed the Kinect tracking system and are now leading the development of tracking systems for Oculus and Magic Leap.

Interestingly, I can’t say which discipline is currently leading the startups working on AR tracking systems. The founders, whether from a robotics background or any other computer-vision background, are no longer able to support the development of tracking systems for a wider range of applications.

Later, I’ll talk about the work that contemporary scientists are doing.

It comes down to statistics

There is no “works” or “doesn’t work” for AR systems. In most cases, AR systems do the job just fine. AR systems strive to be “better,” which is also what drives statistics.

So don’t trust all AR APP demos, especially ones posted on YouTube that show amazing results. There is often a big gap between what can be achieved in a carefully orchestrated environment and what can be achieved by the average user in real life. But demonstrations of smartphones or VR apps generally don’t have this problem. As a result, the audience is often fooled.

This is a concrete technical example of why statistics can ultimately determine how a system is performing.

In the image above, there is a grid representing the digital image sensor in the camera. Each cell is a pixel. To stabilize tracking, each pixel should have a matching point in the real world, assuming the device is completely stationary. However, the image on the right shows that photons are not so obedient, and that each pixel is the total number of photons that fall wherever they please. Changes in the light in the scene (sunlight penetrating clouds, fluorescent lights flashing, etc.) also change the composition of photons in the sensor, which now corresponds to different pixels in the real world. Well, in that case the visual tracking system thinks the user has moved!

So when blips flash in various ARKit demos, the system must determine which points are “reliable.” The system triangulates these points to calculate the user’s spatial position, averaging them to get the best estimate of the actual location. Therefore, to ensure that erroneous statistics are completely removed, more accurate systems need to be developed. This requires tight integration and calibration between the camera hardware stack (multiple lenses and coatings, shutter and image sensors, etc.), IMU hardware and software algorithms.

Integration of hardware and software

VIO systems aren’t that hard to develop, and the algorithms for VIO systems are already out there, and there are plenty of examples. However, it can be difficult to get VIO systems to work well. I mean, the inertial and optical systems are fully integrated to create three-dimensional maps, to determine the scale at low accuracy.

For example, when I started Dekko, the app’s case required the user to start moving according to specific requirements, and then the phone moved back and forth for about 30 seconds before the inertial and optical systems merged to create a three-dimensional map. Building a good inertial tracking system requires experienced engineers. However, only about 20 engineers in the world have the necessary skills and experience, and most of those 20 are working on cruise missile tracking systems or navigation systems for Mars rovers.

Even if you can hire one of these engineers, you still need a tight combination of hardware and software to minimize errors. This means that the IMU can be accurately modeled using software, with detailed knowledge of the entire camera and the detailed specifications of each component, and more importantly, the IMU and the camera need to be very precisely synchronized.

The system needs to know exactly which data read by the IMU corresponds to the start screen and which corresponds to the end screen. This is essential for the linkage of the two systems, which has only recently been realized because hardware Oems have not seen the need to invest in it. That’s why it took Dekko a long time to integrate hardware and software based on the iPad 2. The first Tango phone was the first device to accurately synchronize time and the first consumer phone to have a good tracking system.

Currently, chips from companies such as Qualcom that are used in tracking systems have a synchronized sensor hub that works with all components, which means VIO systems work on most current devices with appropriate sensor calibration.

Because of the close dependence on hardware and software, it is almost impossible for software developers to build a good system without deep support from oems. Google has invested heavily in getting some Oems to support Tango’s hardware specification, and Companies like Microsoft and Magic Leap are working on building their own hardware. Apple has been so successful in releasing ARKit because it is such a good collection of hardware and software.

Optical calibration

In order for the software to accurately match the camera’s pixels to the dots in the real world, the camera system needs to be precisely calibrated. There are two types of optical calibration:

The first is geometric calibration: using a pinhole model of the camera to correct the field of view and effect of the lens. Almost all images are distorted by the lens lens. Most software developers can calibrate by using standard checkerboard-based and basic public camera parameters without the help of an OEM.

The second is photometric calibration: this calibration method is used more often and usually requires OEM’s participation in the optimization of the details of the image sensor itself and the use of the internal lens coating. This calibration is used to handle color and intensity mapping. For example, a camera attached to a telescope taking pictures of the sky needs to know whether a slight change in the intensity of the light on the sensor is a definite star, or whether it is simply an error caused by the sensor or lens. Calibration gives AR trackers a higher degree of certainty because each pixel on the sensor corresponds to a real-world point, so optical tracking is more accurate and produces less error.

In the image above, various RGB points of light fall into a “bucket of pixels” on the image sensor, a process that illustrates the problem nicely. The spot produced by the real world midpoint usually falls on a boundary of a few pixels, which will average the intensity of the spot. Small changes, such as user movement, or scene shadows or flashing fluorescent lights, change the changes in the real-world points corresponding to the pixels. At this point, all optical calibration to eliminate the resulting error as much as possible.

Inertial alignment

For an IMU, measuring acceleration is more important than measuring distance or speed. IMU read errors accumulate over time, generating errors very quickly! The goal of calibration and modeling is to ensure that distance measurements are sufficiently accurate at X equal parts per second. Ideally, this time period should be long enough to minimize the loss of the camera’s track of several frames when the camera is obscured or something else happens in the scene.

Distance measurements using the IMU are called dead reckoning. It’s basically a guess, modeling the data collected by the IMU, determining how errors accumulate, and writing filters to reduce them. Imagine if you were asked to take a step and guess how long it would be. Guessing the distance of a single step produces a high degree of error. However, if you take a thousand steps repeatedly and guess the distance of each step, the margin of error is very small. Because you know the foot, the type of floor, the style of shoe, how fast you are moving, how well you are in shape, and so on, your final guess will be very accurate. Basic IMU calibration and modeling is this principle.

There are many sources of error in the data. The arm moves the device repeatedly, usually in exactly the same way, capturing the output of the IMU and writing it to the filter until the output from the IMU matches exactly the movement from the arm. To further reduce the extra error, Google and Microsoft are even calibrating in micro-gravity environments on the International Space Station (ISS) and “zero-g aircraft.”

In fact, achieving true accuracy is much harder than lip service. Oems must calibrate all devices, even if many have different IMUs (for example, the Galaxy 7 May have IMUs from Invensense and Bosch, although of course Bosch does not work with Invensense). Of course, this is another advantage Apple has over Android Oems.

Tracking the future of technology

If VIO is what we can achieve today, how will it evolve in the future and make ARKit look redundant? Surprisingly, VIO systems will always be the best method of tracking over hundreds of meters (for longer distances, VIO systems will need to incorporate GPS to reposition landmark recognition). The reason to optimize VIO systems is this: Even if other optical systems are as accurate as VIO’s, other systems’ Gpus or cameras still need batteries, which can have a big impact on head-mounted displays. Therefore, the MONocular camera VIO system is the most accurate, lowest power consumption, and lowest cost solution.

Deep learning does have a significant impact on research tracking systems. Deep learning based on tracking systems has produced an error of about 10 percent so far, with the top VIO systems in the single digits getting smaller and repositioning the optimization room.

Depth cameras optimize VIO systems in a variety of ways, including accurate ground measurements and dimensions, and edge tracking for situations where feature points are not obvious. However, it is power-hungry, needs to run at a low frame rate, and uses VIO between frames. Depth cameras are not suitable for outdoor use because the infrared generated by the camera’s operation is disrupted by the infrared generated by sunlight. The range of the camera also depends on power consumption, which means the phone may only have a range of a few meters. Depth cameras have a high BOM cost, so Oems will avoid installing depth cameras in mass-produced phones.

Dual cameras or fisheye lenses help to see larger scenes and can capture more optical features. For example, a normal lens may only see white walls, but a fisheye lens can see ceiling patterns and carpets. Tango and HoloLens both use this scheme. Furthermore, depth information can be obtained with dual-camera or fisheye lenses at a lower computational cost than VIO, but VIO can obtain depth information using low-cost Bom and low power consumption.

Due to the proximity of cameras (even HMD) on dual-camera phones, the precise range of cameras is very limited for depth calculations. Cameras spaced a few centimeters apart can only be accurate to a depth of a few meters.

The key to end-to-end tracking is to support a wider range of tracking, especially if you can track several miles outdoors. In this respect, THERE is little difference between AR tracking and driverless tracking, except that AR systems use fewer sensors and consume less energy. Eventually, any device will be capable of tracking on a large scale, and cloud services will be necessary, which is why Google recently announced Tango as a visual location service. We’ll see these changes in the coming months, which is why everyone cares about 3D maps right now.

The future of AR computer vision

6 Dof position tracking technology will be fully commercialized for all devices within the next 12-18 months. What problems still need to be solved?

HoloLens calls it Spatial Mapping, and Tango calls it Depth Perception. The 3D reconstruction system can find the shape or structure of real objects in the scene. This technology allows virtual content to hide behind the real world.

This leads people to confuse the concept of AR as “mixed” reality, which is actually augmented reality. Most AR demos don’t have 3D reconstruction support, so AR content is just overlaid in front of all real world objects.

3D reconstruction systems work by capturing dense point clouds from a scene (today using depth cameras), converting them into a grid, importing the “invisible” grid into Unity along with real-world coordinates, and then placing those grids on the real scene when the real world view appears in the camera.

This means that virtual content can interact with the real world. Note that the 2D version of ARKit is implemented by detecting the 2D plane, which is the most basic requirement. If ARKit didn’t reference the ground, things made in Unity would “float” at will.

In the photo, Magic Leap demonstrates a robot that hides behind a table leg. We don’t know if the table legs will be reconstructed in real time, or if the virtual robot will be pre-modeled and placed behind the real table legs.

The aforementioned problems with depth photography still exist in 3D reconstruction, which is why it is not widely available at this time. Researchers are working on making monocular RGB cameras capable of real-time photo-level 3D reconstruction. It will take at least 12-18 months for the technology to be in production. That’s why I think “true” consumer AR headsets are still far away.

In 2012, Dekko’s 3D reconstruction system worked on the iPad 2. We had to display the grid, otherwise users wouldn’t believe what they were seeing (the tracking system understood the real world). The SUV has just completed its jump and is partially hidden behind a tissue box.

After 3D reconstruction, there are many interesting studies annotating 3D scenes. Almost all of the computer vision deep learning you see today uses 2D images, but for AR (cars, drones, etc.) we need to understand the world semantically in 3D.

The picture is an example of semantic interpretation of 3D scene. At the bottom is the original image, in the middle is a 3D model (probably built with a stereo camera or LIDAR), and at the top is an image segmentation through deep learning, so we can distinguish the sidewalk from it. This is also useful for Pokemon Go.

Then, we need to figure out how to scale all the amazing technology to support multiple users in real time. That’s the ultimate goal.

As 3D rebuilds require more and more capacity, we need to understand how to host them to the cloud for multiple users to share and extend the model.

The future of AR other technologies

The future of other AR technologies is too broad to talk about. Let’s first talk about the technologies to be further developed:

  • Optics: field of view, lens size, resolution, brightness, depth of focus, focus, etc.

We will see “transitional” HMD designs that are constrained by a few key metrics and attempt to solve only one problem, such as social, or tracking technology, or enterprise user cases, and so on, before we see the final consumer product solution.

  • Render: Blending virtual content with the real world.

Identify the real light source and match it to the virtual world to make the shadows and textures look reasonable. This technology is something Hollywood SFX has been working on for years. But for AR, which needs to be done in real time on your phone and doesn’t interfere with real-world light or background, even trivial things are important.

  • Input: There’s still a long way to go.

Studies have shown that multimodal input systems work best (apple is rumoured to be doing just that). Multimodal means a variety of input “modes”, such as gestures, speech, computer vision, touch, eye tracking, etc., which AI should also take into account in order to best understand user intentions.

  • Graphical User interfaces (GUI) and applications: There are no AR apps as we imagine.

We just want to see the Sonos display controls on the device, not the Sonos button. And we’re always focused on the visual field, and the interaction with the realized world, and nobody knows how to render it, but it’s certainly not a 4 x 6 grid image.

  • Social problem: Only Apple and Snap know how to market fashion, AR HMD’s sales may just be people’s desire for fashion. This problem may be harder to solve than any technological one.

Via Super Ventures Blog


Building intelligent driving key | “future car” lecture hall

Autonomous driving technology feast jointly created by Leifeng &AI MOOCs Institute, New Wisdom Driving and NetEase Cloud Classroom Enterprise Edition!