Translate the article, the content has been deleted. Original address: medium.com/tensorflow/…

In partnership with Google Creative LABS, I’m pleased to announce the release of The TensorFlow.js version of PoseNet, a machine learning model that allows real-time human posture estimation in a browser. You can access storage.googleapis.com/tfjs-models… Try the online demo.

PoseNet can use single-pose or multi-pose algorithms to detect people in images and videos – all from the browser.

So, the question is, what is attitude estimation? Pose estimation refers to computer vision technology that detects people in images and videos so that people can determine where someone’s elbow appears in the image. To be clear, the technology does not identify who is in the image – there is no personally identifiable information associated with the identification. The algorithm only estimates the positions of key body joints.

Okay, why is this exciting to start with? Pose estimation has many uses, from interactive installations that respond to the body to augmented reality, animation, fitness uses and more. We hope that the accessibility of this model will inspire more developers and manufacturers to try to incorporate pose detection into their own projects. Although many optional attitude detection systems are open source, all require specialized hardware and/or cameras, as well as considerable system setup. PoseNet runs on tensorflow.js, and anyone with a PC or phone with a camera can experience the technology in a web browser. And since we’ve open-source the model, JavaScript developers can use this technique with just a few lines of code. More importantly, it can actually help protect user privacy. Since PoseNet on tensorflow.js runs in a browser, any pose data does not remain on the user’s computer.


Introduction to PoseNet

PoseNet can be used to estimate single poses or multiple poses, meaning that one version of the algorithm can only detect one person in an image/video, while another version can detect multiple people in an image/video. Why are there two versions? The single-person posture detector is faster and simpler, but there can only be one subject in the image (more on that later). Let’s start with individual poses that are easier to use.

From the top, postures are estimated to occur in two stages:

  1. Input RGB image to convolutional neural network.
  2. Single pose or multiple pose decoding algorithms are used to decode pose, build confidence score, key point position, and key point confidence score from model output.

Wait, what do these key words mean? Let’s review the most important:

  • Posture – At the very top, PoseNet returns a posture object containing a list of key points for each detected person and a confidence score for the instance layer.

PoseNet returns confidence values for each person detected and for each posture key point detected. Image source: “Microsoft Coco: Generic Objects in Context Datasets”,cocodataset.org.

  • Posture confidence – This determines the overall confidence of the posture estimate. It is between 0.0 and 1.0. It can be used to hide uncertain poses.
  • Key points – an estimated part of the body position, e.g. nose, right ear, left knee, right foot, etc. It contains location and key point confidence scores. PoseNet can currently detect 17 key points as shown below:

PosNet detects 17 key postural points.

  • Key point confidence score – This determines the confidence in estimating the accuracy of the key point position. It is between 0.0 and 1.0. It can be used to hide key points that are uncertain.
  • Key point position – the x and Y two-dimensional coordinates of the detected key points in the original input image.

Step 1: Import the tensorflow. js and PoseNet libraries

Much has been done to abstract the complexity of the model and encapsulate the functionality into easy-to-use methods. Let’s review the basics of how to configure a PoseNet project.

The library can be installed via NPM:

npm install @tensorflow-models/posnetCopy the code

Import using es6 module:

import * as posenet from '@tensorflow-models/posenet';
const net = await posenet.load();Copy the code

Or via the bundle of the page:

<html> <body> <! -- Load TensorFlow.js --> <script src="https://unpkg.com/@tensorflow/tfjs"></script> <! -- Load Posenet --> <script src="https://unpkg.com/@tensorflow-models/posenet"> </script> <script type="text/javascript"> posenet.load().then(function(net) { // posenet model loaded }); </script> </body> </html>Copy the code

Step 2A: Single pose estimation

Example of a single person pose estimation algorithm applied to images. Image source: “Microsoft Coco: Generic Objects in Context Datasets”,cocodataset.org.

As mentioned earlier, the single pose estimation algorithm is simpler and faster. It’s ideal for a scene where there’s only one person in the middle of an input image or video. The disadvantage is that if there are more than one person in the image, the key points from two people may be estimated to be part of the same single pose – for example, the left arm of # 1 and the right knee of # 2 May be combined as determined by the algorithm to belong to the same pose. If the input image is likely to contain more than one person, a multi-pose estimation algorithm should be used.

Let’s look at the input of the single pose estimation algorithm:

  • Enter the image element – the HTML element that contains the image to predict, such as the video or IMG tag. It is important that the image or video element be square.
  • The number between the image scale factor – 0.2 and 1. The default is 0.50. Scale the image before input to the network. Setting this number low will shrink the image and speed it up at the expense of accuracy.
  • Flip horizontally – Default is false. If the pose should be flipped horizontally/mirrored. For videos that default to flip horizontally (such as webcams), this should be set to true so that the correct orientation of the pose is returned.
  • Output stride – must be 32, 16, or 8. The default value is 16. Internally, this parameter affects the height and width of layers in the neural network. From the upper view, it affects the accuracy and speed of attitude estimation. The lower the output step value is, the higher the accuracy is but the slower the speed is; the higher the value is, the faster the speed is but the lower the accuracy is. View the output stride on the quality of the output is the best way to try to use the single posture estimation example: storage.googleapis.com/tfjs-models… .

Now let’s look at the output of the single pose estimation algorithm:

  • Posture that contains posture confidence scores and an array of 17 key points.
  • Each key point contains a key point location and a key point confidence score. Similarly, all key positions have X and Y coordinates in the input image space and can be mapped directly to the image.

The following short code block shows how to use the single-pose estimation algorithm:

Const imageScaleFactor = 0.50; const flipHorizontal = false; const outputStride = 16; const imageElement = document.getElementById('cat'); // load the posenet model const net = await posenet.load(); const pose = await net.estimateSinglePose(imageElement, scaleFactor, flipHorizontal, outputStride);Copy the code

An example output pose looks like this:

{"score": 0.32371445304906, "keypoints": [{// nose "position": {"x": 301.42237830162, "y": 177.69162777066}, "score": 0.99799561500549}, {// left eye "position": {"x": 326.05302262306, "y": 122.9596464932}, "score": 0.99766051769257}, {"x": 258.72196650505, "y": 127.51624706388}, "score": 0.99926537275314},... }Copy the code

Step 2B: Multi-person attitude estimation

An example of a multi-person pose estimation algorithm applied to images. Image source: “Microsoft Coco: Generic Objects in Context Datasets”,cocodataset.org

Multi-person pose estimation algorithm can estimate many poses/people in the image. It is more complex and slightly slower than the single-pose algorithm, but it has the advantage that if there are multiple people in the picture, the key points they detect are less likely to be associated with the wrong pose. For this reason, this algorithm may be more appropriate even if the application scenario is to detect the posture of a single person.

In addition, an attractive feature of the algorithm is that performance is not affected by the number of people in the input image. The counting time is the same whether there are 15 people or 5 people.

Let’s look at the input:

  • Input image element – same as single pose estimation
  • Image scale factor – same as single pose estimation
  • Horizontal flip – same as single pose estimation
  • Output stride – same as single pose estimate
  • Maximum pose detection – An integer, 5 by default, the maximum number of poses to be detected.
  • Posture confidence threshold -0.0 to 1.0, default 0.5. From the top view, this controls the minimum confidence score for the return pose.
  • Non-maximum suppression (NMS) radius – a number in pixels. From the top view, this controls the minimum distance between return poses. The default value is 20, which is probably correct for most cases. It should be increased/decreased to filter out less accurate postures, but only used when adjusting postural confidence scores is not good enough.

See what effect of these parameters is the best way to try to use this gesture to estimate more example: storage.googleapis.com/tfjs-models… .

Let’s look at the output:

  • An array of poses.
  • Each pose contains the same information as in the one-man estimation algorithm.

The following simple code block shows how to use the multi-pose estimation algorithm:

Const imageScaleFactor = 0.50; const flipHorizontal = false; const outputStride = 16; // get up to 5 poses const maxPoseDetections = 5; // Minimum confidence of the root part of a pose const scoreThreshold = 0.5; // minimum distance in pixels between the root parts of poses const nmsRadius = 20; const imageElement = document.getElementById('cat'); // load posenet const net = await posenet.load(); const poses = await net.estimateMultiplePoses( imageElement, imageScaleFactor, flipHorizontal, outputStride, maxPoseDetections, scoreThreshold, nmsRadius);Copy the code

An example of the pose array output is shown below:

// Array of poses/persons [{// pose #1 "score": 0.42985695206067, "keypoints": [{// nose "position": {"x": }, {// pose #2 "score": 0.13461434583673, "keypositions": [{// nose "position": {"x": 116.58444058895, "y": 99.772533416748}, "score": },...]},...Copy the code

At this point, you have enough knowledge to understand the PoseNet example. If you want to learn more about the technical details of this model and implementation, please read the original text: medium.com/tensorflow/… Appendix for more technical details.