Video introduction: Announcing the Objectron dataset

The latest technology in machine learning (ML) has achieved remarkable accuracy on many computer vision tasks simply by training models on photographs. Building on these successes and advancing 3D object understanding has great potential to power a wider range of applications, such as augmented reality, robotics, autonomy and image retrieval. For example, earlier this year we released MediaPipe Objectron, a set of real-time 3D object detection models designed for mobile devices that are trained on fully annotated real 3D datasets to predict 3D bounding boxes for objects.

However, compared to 2D tasks such as ImageNet, COCO, and Open Images, understanding 3D objects remains a challenging task due to the lack of large real-world data sets. To enable the research community to continue to advance in the understanding of 3D objects, there is an urgent need to publish object-centric video datasets that can capture the 3D structure of more objects while matching data formats (i.e., video or camera streams) for many visual tasks to aid in the training and benchmarking of machine learning models.

Today, we’re excited to release the Objectron Dataset, a collection of short, object-centric video clips that capture more common objects from different angles. Each video clip is accompanied by AR session metadata, which includes camera posture and sparse point clouds. The data also contains hand-commented 3D bounding boxes for each object, which describe the object’s position, orientation, and size. The dataset consists of 15K annotated video clips complemented by more than 4 million annotated images collected from a geographically diverse sample covering 10 countries on five continents.

3D object detection solution

In addition to the data set, we shared 3D object detection solutions for four categories of objects — shoes, chairs, cups, and cameras. The models were released in MediaPipe, Google’s open source framework for a cross-platform customizable ML solution for real-time and streaming media, which also supports ML solutions such as real-time hand, iris and body posture tracking on devices.

These latest releases have a two-tier architecture compared to the single-tier Objectron model released earlier. The first phase uses the TensorFlow object detection model to find 2D clipping of objects. The second stage then uses image clipping to estimate 3D bounding boxes while calculating 2D clipping of the object for the next frame, so the object detector does not need to run every frame. The second stage 3D Boxed Predictor ran at 83 FPS on an Adreno 650 mobile GPU.

Evaluation metrics for 3D object detection

Using ground reality annotations, we use 3D union set intersection (IoU) similarity statistics to evaluate the performance of 3D object detection models, a common indicator of computer vision tasks that measure the proximity of boundary boxes to the target ground truth.

We propose an algorithm for calculating accurate 3D IoU values for general 3D oriented boxes. First, we use the Sutherland-Hodgman polygon clipping algorithm to calculate the intersection between the faces of the two boxes. This is similar to truncated head culling, a technique used in computer graphics. The volume of the intersection is calculated from the convex hull of all trimmed polygons. Finally, the IoU is calculated from the intersection volume and union collective product of the two boxes. We will release the source code for the metrics along with the dataset.

Data set format

Technical details of the Objectron dataset, including usage and tutorials, are available on the Dataset website. This dataset includes bicycles, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes, and is stored in an Objectron bucket on Google Cloud Storage, which contains the following assets:

  • Video sequence
  • Comment tag (object’s 3D bounding box)
  • AR metadata (such as camera posture, point cloud, and plane)
  • Processed data set: mashup version of annotated frames in tF.Example format for images and SequenceExample format for videos.
  • The support script runs the evaluation based on the above metrics
  • Scripts that support loading data into Tensorflow, PyTorch, and Jax and visualizing data sets, including the “Hello World” example

For data sets, we also open source a data pipeline for parsing data sets in popular Tensorflow, PyTorch, and Jax frameworks. Sample Colab notebooks are also provided.

By releasing this Objectron dataset, we hope to enable the research community to push the boundaries of understanding 3D object geometry. We also hope to promote new research and applications, such as view synthesis, improved 3D representation and unsupervised learning.

Update note: first update wechat public number “rain night blog”, later update blog, after will be distributed to each platform, if the first to know more in advance, please pay attention to the wechat public number “rain night blog”.

Blog Source: Blog of rainy Night