Video introduction: MediaPipe Holistic – Predicts face, hand and posture simultaneously on the device

Real-time, simultaneous awareness of body posture, facial markers and hand tracking on mobile devices can lead to a variety of impactful applications, such as fitness and movement analysis, gesture control and sign language recognition, and augmented reality effects. MediaPipe, an open source framework designed to leverage the complex perceptual pipeline of accelerated reasoning (e.g., GPU or CPU), has provided fast, accurate, yet independent solutions for these tasks. Combining them all in real time into a semantically consistent end-to-end solution is a unique challenge that requires simultaneous reasoning with multiple related neural networks.

Today, we are pleased to announce MediaPipe Holistic, a solution to this challenge that provides a novel, state-of-the-art human posture topology that unlocksnovel use cases. MediaPipe Holistic includes a new pipeline that optimizes posture, face, and hand components, with each component running in real time, minimal memory transfers between reasoning backends, and added support for interchangeability of all three components, depending on a quality/speed tradeoff. When all three components are included, MediaPipe Holistic provides a unified topology for more than 540 breakthrough key points (33 poses, 21 one-handed and 468 facial markers), enabling near-real-time performance on mobile devices. MediaPipe as a whole is being released as part of MediaPipe, available on mobile devices (Android, iOS) and desktop devices. We also introduced MediaPipe’s new out-of-the-box API for research (Python) and Web (JavaScript) to simplify access to the technology.

Piping and quality

The MediaPipe integral pipeline integrates individual models of posture, face and hand components, each optimized for its specific domain. However, due to their different specialisation, the input from one component is not suitable for the other. For example, the pose estimation model takes a low fixed resolution video frame (256×256) as input. However, if hand and face regions were cropped from this image to be passed to their respective models, the image resolution would be too low to accurately represent. Therefore, we designed MediaPipe Holistic as a multi-stage pipeline that processes different areas using regionally appropriate image resolution.

First, MediaPipe uses the posture detector of BlazePose followed by a keypoint model to estimate the body posture. Then, using inferred postural key points, it deduces three area of interest (ROI) clipping for each hand (2x) and face, and uses the re-clipping model to improve ROI (details below). The pipeline then cropped full-resolution input frames to these ROIs and applied task-specific face and hand models to estimate their corresponding key points. Finally, all the key points were combined with the key points of the pose model to produce a complete set of more than 540 key points.

To simplify ROI recognition, a pipeline of tracking methods similar to those used for individual faces and hands was utilized. This approach assumes that the object does not move significantly between frames, using estimates from the previous frame as a guide for the area of the object in the current frame. However, during rapid movement, the tracker may lose the target, which requires the detector to relocate it in the image. MediaPipe uses posture prediction (on each frame) as an additional ROI that MediaPipe Holistic uses to reduce pipeline response time when responding to fast movement. This also enables the model to maintain semantic consistency of the entire body and its parts by preventing the left and right hands or body parts of one person in the frame from being confused with the body parts of another person.

In addition, the resolution of the input frames of the pose model is low enough that the ROI of the face and hands is still too inaccurate to guide the re-clipping of these areas, which requires precise input clipping to remain lightweight. To close this accuracy gap, we used lightweight faces and hands to re-cull models that act as spatial converters and take only about 10% of the corresponding model reasoning time.

MEH sleep

Track pipeline (baseline) 9.8% 3.1%

No recrop pipeline 11.8% 3.5%

Pipelines with recrops 9.7% 3.1%

performance

MediaPipe Holistic needs to coordinate up to eight models per frame — one posture detector, one posture landmark model, three re-tailoring models and three hand and face keypoint models. In building this solution, we optimized not only the machine learning model, but also the pre-processing and post-processing algorithms (such as affine transformations), which take a lot of time on most devices due to the complexity of the pipeline. In this case, moving all pre-processing calculations to the GPU results in a total pipeline acceleration of approximately 1.5 times, depending on the device. As a result, MediaPipe Holistic can perform at near real-time performance, even on mid-range devices and browsers.

Phone first-person shooting

Google Pixel 2 XL 18

Samsung integrated + 20

15-inch MacBook Pro 2017 15

The multi-phase nature of pipelining provides two additional performance advantages. Since models are mostly self-contained, they can be replaced with lighter or heavier versions (or turned off entirely) depending on performance and accuracy requirements. Furthermore, once the pose is inferred, it is possible to know exactly whether the hands and face are within the frame boundaries, allowing the pipeline to skip inferences about these body parts.

application

MediaPipe Holistic owns more than 540 key points, to achieve the body language, gestures and facial expressions, synchronization, perception of the whole. Its hybrid approach supports a remote gesture interface, as well as full-body AR, motion analysis and sign language recognition. To demonstrate MediaPipe’s Holistic quality and performance, we built a simple remote control interface that runs locally in the browser and supports compelling user interaction without the need for a mouse or keyboard. Users can manipulate on-screen objects, type on a virtual keyboard while sitting on the couch, and point or touch specific facial areas (for example, mute or turn off the camera). Beneath it relies on precise hand detection, and subsequent gesture recognition is mapped to a “trackpad” space fixed to the user’s shoulder for remote control up to 4 meters.

This gesture control technology can unlock a variety of novel use cases when other forms of human-computer interaction are inconvenient. Give it a try in our web demo and use it to prototype your own ideas.

MediaPipe for research and the Web

To accelerate ML research and its adoption in the Web developer community, MediaPipe now offers ready-to-use but customizable ML solutions in Python and JavaScript. We started with those from previous publications: Face Mesh, Hands and Poses, including MediaPipe Holistic, and more. Try them directly in a web browser: Python for laptops using MediaPipe on Google Colab, and JavaScript typed with your own webcam in MediaPipe on CodePen!

conclusion

We hope that the release of MediaPipe Holistic will inspire members of the R&D community to build new and unique applications. We anticipate that these pipelines will open the way for future research into challenging areas such as sign language recognition, contactless control interfaces, or other complex use cases. We look forward to seeing what you can build with it!

Update note: first update wechat public number “rain night blog”, later update blog, after will be distributed to each platform, if the first to know more in advance, please pay attention to the wechat public number “rain night blog”.

Blog Source: Blog of rainy Night