This article is collated from the LiveVideoStack online sharing done by Zhang Yuan, director of machine vision Standards and Application Research Department, Institute of New Technology, China Telecom Research Institute. She introduced in detail the latest progress and technical innovation of machine vision coding standardization work of VCM, DCM and other standard organizations.

LiveVideoStack

Hi LVS guys, thank you for spending your evening listening to me share with you the latest developments in machine vision coding standards and technologies. I talked about a similar topic on the LVS Shanghai website in April this year. Based on the opinions collected on the spot, I will introduce some main contents again today. I will also share the latest progress with you because of the VCM meeting held in April. We welcome more exchanges and participation in our work in the future.

Let me give you a little background. First of all, now its physical communication has gradually surpassed and replaced the communication between people. The goal of machine vision coding is to make the machine have the ability of human perception of visual signals, replace the human brain and serve the whole machine system as a machine brain. According to statistics, vision accounts for 87% of all sensory data intake. All aspects of neural network research, including brain machine research, are started from vision. At the machine level, vision is also the most important source of information, so we regard it as a target to be conquered first.

With the development of 5G, artificial intelligence learning network, deep learning and machine learning, all kinds of data sources, including text, image, sound, animation, video and other data types, now have neural network processing mode, we can do CV, NLP processing, voice processing, And processing based on data mining, big data analysis, planning and decision-making. This series of intelligent tasks, namely recognition, detection, classification, tracking and so on, can be realized by neural network. Traditionally, video is captured and compressed. Now, with the addition of neural networks, it is very natural to combine the independent technology of video with vertical industry applications, including industry, road and autonomous driving. In order to support the task of multimedia intelligent analysis, new technologies and applications oriented to machine vision are needed under the new development situation.

Here are some simple statistics. First of all, video accounts for about 80 percent of all web traffic, and half or more of that is for machine or algorithmic analysis. Video sites are also a big source of viewing, but now video is in two directions: human entertainment and intelligent analysis algorithms for machines. The market size of machine vision is also growing rapidly, and is now a key area of investment. The main applications in the famous 5G triangle and 5.5g hexagon are closely related to machine vision, including smart city, smart home, smart building, automatic driving, industrial automation and so on. Machine vision is now widely used in various fields and will be an important source of 5.5g and 6G traffic in the future. The whole CV algorithm is now the largest part of AI in terms of maturity, investment and financing ratio, and the proportion of practical application.

From this background, we can analyze that it is a general trend that we use machine vision algorithm to replace manual processing tasks. In the past, monitoring scenes, such as road violations, driving scenes, and detection scenes, are mainly human. Now it is machines that are increasingly taking over. Electronic eyes have been widely used, car vision is L2, L3 level of automatic driving, as well as machine quality detection, fault detection, these algorithms are very common. Here, again based on the human eye, the image is collected with CMOS, CCD, and rebuilt into pixels, and then the machine algorithm is used to analyze the pixels shown to people. In fact, human beings need to see a single pixel, but what the machine needs is the characteristic value inside. For example, if there is a feature state such as failure or not, as long as there are relevant features, the method we give people can be predicted to do task analysis for the machine. In fact, it has redundancy in storage and transmission. We have made an assumption from the logical level that a feature map can be used to only show the machine, and the color and saturation information needed by people can be avoided by the way of feature map, so as to meet the visual requirements of object communication and reduce the bandwidth and delay.

What makes our idea possible is that machine vision is fundamentally different from human vision. Here are some examples. In the main latitude, machine vision is better than human vision, this is easier to understand, for example, gray resolution, spatial resolution, color resolution, speed, accuracy, etc., are better than human. The main thing here is speed. A human can only see 25, 30 frames per second, whereas a machine can catch 1,000, 2,000 frames per second. But the human brain is very good in intelligence, the human brain in V1, V2, V3 and other brain structures in different processing, the received photoelectric signal, automatic compression, analysis and synthesis. But machines have a long way to go in terms of where, say, a change in Angle, or a change in weather conditions, they can misidentify. To summarize, human eyes and machine algorithms see different things. People look at the edges, the details, the large scale of the video requires video fidelity. The machine is concerned with the completion of the task and needs semantic information for detection and recognition. Besides, there is also the requirement of delay. The machine needs semantic information, not what human eyes see. Based on these differences, we can draw a conclusion that the traditional video coding and no distortion or infinite approximation to restore video coding method, in fact, is not suitable for machines.

For video coding, here is a brief introduction. The main reason we do video coding is bandwidth. The compression method is basically to remove the redundancy of time and space, or to remove the redundancy of human eye characteristics. However, in the field of machines, the dimension of redundancy increases, and our tasks for machines, such as detection, segmentation, recognition, tracking, etc., can be completed without degrading performance, which is lossless, which is different from the traditional definition of video compression lossless. In addition to the dimensions, we have more means of compression.

The figure shows the current data volume curve on the network, including a comparison. The compression rate of video codec increases linearly, while the data volume increases exponentially. There is a huge gap between the data volume and the compression rate. With today’s development of intelligence or digitization, the gap will widen. Our demand for video compression algorithms and performance is still high and will not decrease with the construction of fiber optics or 5G.

The main standard for audio and video compression is MPEG, which was established 32 years ago and has undergone a series of reforms. The two working groups under JTC1 and SC29 of ISO and IEC were JPEG and MPEG, which were founded by a pair of students from Waseda University and reorganized last year. The first group is still JPEG. MPEG splits all groups into seven groups, including requirements, systems, video, JVC, audio, 3D, and Genes. The work is still continuing. The whole standard group has great industrial influence and industrialization capacity. Therefore, the whole standard group has attracted many parties’ participation and produced a series of high value standards such as AVC and HEVC. Now there are also some competitive standards organizations, such as AOM, MPAI. This also shows from the side this market value is very large.

When it comes to machine vision coding (VCM), it has been planned and promoted since 2014, and has undergone a series of changes in the process, and succeeded in 2019. The logic at that time was that because we had a lot of surveillance and industrial scenes, we had to do a lot of analysis based on video, which was greatly different from the information processing, coding and information use of broadcast scenes. In fact, the efficiency of h.264 and H.265 encoding was not enough. Based on the needs of the industry, academic research was also done at that time, and it was concluded that there was a lot of room for machine-specific codec. Supporting evidence is encoded and generated by intentional coding combined with object attributes. And directly in the compression domain to do tasks, can also be rebuilt, based on academic research and industry needs, it is considered feasible. We also went through a series of promotions, from the very beginning proposed for intelligent security scenarios. Because the background of the security scene is basically fixed, with few foreground changes, the efficiency may be increased to more than 90% for the specific compression coding of the security scene. At the time, the MPEG organization accepted, but the matter was not pushed. Later, we analyzed that because the development of AI technology was limited and the 5G network environment was not available at that time, although we succeeded, we could not finish the task. We mentioned it again in 2017, encoding based on video content and semantic features, but it was not accepted because of the application layer. Until 2019, in collaboration with GTI (Chip Unicorn Company) of the United States, he proposed the concept and framework of compressed coding for machine vision and human-machine hybrid vision, which has been widely recognized and established an expert group on machine vision coding.

At the MPEG conference in July 2019, we proposed to study the next generation video codec standard. The logic was, h.266 is almost done, what direction to go next. Our judgment at the time was that a path toward 4K, 8K, immersive, continuous pursuit of compression and higher resolution was for human vision. The other approach is machine-oriented, not necessarily to restore the video, but to make the task lossless or an acceptable reduction in task performance. It also includes human-machine hybrid vision, which is the scene where the pointer needs to do verification or forensics, or reconstruction, but with different quality requirements, different priorities, or extra bit streams. Combined with AI deep learning feature extraction and encoding, it is expected to become the main traffic source of 5G and 6G in the future. There was a lot of support from the industry, as well as from Leonardo, then chairman of MPEG, including the chair of the requirements group and the video group.

The definition of machine vision coding is to study the compression coding technology aimed at intelligent application. Ensure compression rate, to save bandwidth, and intelligent task, to do the task lossless. The goal is to define a compressed video or feature code stream extracted from a video. Bitstream can serve a variety of machine tasks, while ensuring high compression efficiency and machine intelligence task performance. Although the name is machine vision coding, but in fact is to serve machine vision and human-machine hybrid vision two scenarios. Specifically, there are six scenarios identified by the global industry, among which video surveillance and smart city are affiliated, as well as smart transportation, smart industry, intelligent production and broadcasting and consumer electronics. Intelligent production and broadcast is special, because people are still watching the video, but because of the need for review in China and the need for video classification in foreign countries, we may have to do the task in the case of incomplete decoding. The specific approach is to target six scenarios. In addition to the six scenes, there are more than ten scenes to be refined, not mainstream for the time being.

All tasks are analyzed and decomposed according to six scenarios. All the tasks are listed in the figure. The most important tasks are detection and segmentation, which are human-machine hybrid, including video reconstruction and enhancement. For example, for driving scenes, identification of events, abnormal detection, analysis of crowd density, measurement of industrial scenes, mask making for privacy, and video with multiple perspectives. What we do is to list all the tasks, compress and encode new data on the data set given by the task. Then compare it with the same algorithm of the existing best case VVC to see if machine vision coding is better than the existing method of doing it with video. In addition to the six scenes, there are still some scenes to be refined, including smart glasses, unmanned shops, robots, smart agriculture, smart fisheries, smart animal husbandry, AR/VR, games, etc.

The VCM group has held seven official meetings since its establishment, with high participation from the industry and academia. Companies in various fields, including chips, algorithms and applications, are involved. At the beginning, the research still focused on the definition of requirements and architecture of application scenarios, and then gradually focused on anchor and which matrix to use. On the technical side, the first proposed extension is CDVA. CDVA is the first compression standard for features, a limited intelligence that can only do such tasks, not all machine tasks. Later, experts gradually put forward methods such as feature extraction, object coding, and end-to-end neural network video compression. Now a variety of technological routes are gradually forming.

The diagram shows the architecture of the entire VCM. So this is a very generic architecture, where video or features come in, go into the VCM’s encoder, get compressed and encoded, and then transmitted, or stored, to the decoder, and the decoded video or features, on the one hand, serve the machine intelligence task, on the other hand, optionally, reconstruct the video for human viewing. The following is an example that includes existing technical solutions. Again, it comes in from the video or the feature, one way goes through the feature code, the other way goes through the video code, for example the object code is contained in the video code. If you go the feature coding route, first feature extraction, then there are two options. One is the compression method of statistics by feature. The other is to transform features into images and then compress and encode images. You can combine the two for optimal performance. Decoding end is a reverse process, serving machine vision and human-machine hybrid vision. Overall VCM prefetch can achieve performance improvements, which can be divided into several aspects. The first is to provide lightweight compression features and reduce bandwidth, with different solutions for single task, multiple task and unknown task. Second, not the main one, but more efficient than existing VVC and H.266 codes. The third point is computing load saving, because we move part of the whole end-to-end AI algorithm forward to the acquisition end and complete part of the AI algorithm at the coding end, so we can adjust the front-end, cloud or terminal computing resources according to the actual application situation and allocate them accordingly. For example, a collection terminal corresponds to tens of thousands of clients. In this case, it is more suitable to move forward to save computational load. The fourth point, which is gradually discovered in the process, is privacy protection. Right now, images or video data involving human faces or personal identity are very sensitive, both domestically and internationally. If the feature method is used to process, if the face or video cannot be reconstructed, or if the feature extraction method is not disclosed, it cannot be recovered, but the task can still be performed. In addition, privacy concerns can be circumvented if the final results are statistical data that are not targeted to individuals. Specifically, we use some data sets, some are public data sets, and some are collected and contributed by ourselves. Based on these data sets, anchor is first made to achieve the best performance, and then compared based on other data sets. The principle is that data sets should preferably be uncompressed, downloadable, and clearly licensed.

We also made a lot of contributions in the early stage, that is, the generation of Anchor. Based on the best method available to us, h.266’s method is compressed and coded, and then the machine task is done. Compressibility and task performance are used to evaluate performance according to different tasks. The following examples include a possible technical solution architecture for VCM. The above is the method of pure characteristic compression. Another method is video plus features. Note here that we have split the above machine analysis into two parts, with some functions front-loaded.

Specific Anchor generation targeted at the three most common intelligent tasks: Detection, segmentation, tracking, we use the given data set and neural network, with the existing FFmpeg for image pretreatment, with the latest VTM8.2 in VVC (is also constantly updated according to the latest VTM development), image compression. Four resolutions and six QP values were determined to generate curves of compressibility and performance. The following is a comparison of the two performance evaluation frameworks. One is pure machine vision. The following route is the existing method. Take out the Bitstream on the transmission side, and then take out the performance on the task side, and draw a RD-curve. The path above is VCM’s CODEC, which is the algorithm proposed by each proposer to compare Bitstream to performance. The other is for human-machine hybrid, to increase PSNR for comparison.

In the figure is an example generated by Anchor, which is made with COCO data set for Object Detection and Faster R-CNN. The straight line below is the data without compression, and the other curves are anchors made under different resolution and different QP. VCM is compared with this one to see whether the performance can support a new standard and generate corresponding commercial value. The picture on the right is the pipeline generating anchor, also for the consideration of standardization work, the whole group will follow the same program for evaluation.

The entire VCM standard group is still in the exploratory stage and has not yet begun formal standardization work. The work completed in the past includes requirements determination, several major tasks, and public data sets, as well as 8 anchors, and the official collection of evidence was released, which was officially released in January this year. An MPEG meeting was held in April to evaluate all the evidence gathered. This collection of evidence has been relatively successful, and several different technical routes have been formed, which will be introduced in detail below.

The figure shows the technical route of the whole VCM group. My personal summary can be divided into five parts: feature coding, feature extraction, human-machine mixing, evaluation related to the standard group, and anchor work. The feature code can also be divided into more details. In fact, after more than 30 years of development, video coding has developed to a very detailed degree, each small change will have different methods and attempts, feature set is just the beginning of the work, there will be a lot of things to do. On the one hand, the existing codec scheme can be changed into HEVC, VVC, etc., and different operations can also be carried out on the graph. How to make CROP, etc., will also affect the result. On the other hand, it is a new technical scheme, including feature transformation, which is the transformation of the distribution of statistical features in the video domain. The main one is feature quantization, which is a unique feature that is less sensitive to quantization than video. The standard group tried many quantization methods, such as linear quantization, vector quantization and feature dimension reduction. The third is feature coding, which includes macro block partitioning. Features can also be analogous to images, with prediction of different channels and inter-frame prediction of features. In terms of feature extraction, we can extract the features of a single task, that is, which model to choose and which layer to extract, so as to minimize the amount of data and have an acceptable impact on the task. This section also includes the selection of data sets. Common features are considered to be common only when they support multi-task, so data sets should also be marked, and joint optimization, joint training or transfer training should be carried out for loss functions. There are also multi-source features, which may include different data sources, such as IR, weak data, 3D, and multi-source fusion. What is worth mentioning here is weak imaging, because ISP is still mainly doing some tasks aimed at human eyes. Now some papers show that machine intelligence tasks can be done based on this weak data, which can be completed without ISP processing and achieve better performance on a given data set. This could also be a future direction of research, as well as real industrial applications. There are two main approaches to human-machine mixing. One is end-to-end neural network coding. MPEG also has a group of DNNVC, which uses neural network to do compression coding for human eyes. It also uses end-to-end neural network as the main solution, which improves performance more, but may not have much room for adjustment. Another approach is feature reverse engineering. There is also a special object coding method, which can be coded according to the semantic structure, that is, enumerating all objects, or it can be separated from the background, which can add videos and key frames to key points. Also how to test and evaluate the method, a series of matrix definition, someone is doing joint multiplication optimization. Now we basically carry out the CURVE of RD-curve and BD-map according to the task, calculate the final result value, and use this value to calibrate whether the scene is better than the present one. Finally, there is the work of Anchor.

The figure shows a summary of the four main technical solutions. Feature coding includes PCA quantization, normal transformation quantization, HEVC coding after feature splicing, and semantic segmentation graph coding. Semantic segmentation graph is a very special feature, which is strongly correlated with the image and has the same chromaticity value in a field. There are some special compression coding methods.

The second is the end-to-end video coding, which has better performance. The model based on compressAI has corresponding results, and some optimization adjustments have been made. For example, NTU proposed a network framework, and also improved some performance. There is also the end-to-end compression coding based on RAW domain data, mainly to not degrade performance, but also to save ISP delay. There is also foreground and background based object coding. The third is feature extraction, which is to find common features, open the neural network from the middle, take out the map, and encode the vector. What matters most is what kind of backbone to use and which layer to choose. Now choose stem layer more, also have stagE1 ~ STAGe5 all add up, but such data volume is relatively large. This is because the data volume of feature is much larger than that of image, but feature is a new field, and there are many means for compression. Although the performance is not the best now, there may be many optimization methods in the future. The fourth is human-machine hybrid video coding, which includes the combination of video frame keypoint sequence stream and keyframe video stream, feature coding framework based on semantic information, and rATE-distortion optimization function combined with intelligent task performance optimization.

Some details of feature coding techniques are listed in the figure. Based on ResNet network model, output features of C4 layer are extracted. In COCO data set, feature quantization with BD>=4 has no effect on target detection and target segmentation performance, while BD=3 has a certain decrease in performance. In nanyang Technological University, features of VGGNet and ResNet are splicing and tiling, and the compression rate at C5 layer is 5~10 times higher than that of JPEG. There is also the quantization of PCA, and the normal distribution transformation of features, and the normalization operation of channels. Some results can be done without loss of performance, and some may have performance degradation.

There are some special compression codes for semantic segmentation graph which is widely used in driving scenes. Peking University uses the method of quadtree, Ericsson uses the method of content coding, GTI uses the method of dividing different macroblocks, they have made a series of attempts. There are also end-to-end neural network coding methods, performance is very good, more than 20% to 30% performance improvement over the existing methods.

The key point sequence stream of video frame proposed by Peking University is combined with the key frame video stream, and the continuous video stream is reconstructed by its own network, which not only improves the compression rate and video quality, but also significantly improves the performance of target detection. Ustc proposed a feature coding framework based on semantic information, which is based on the concept of object coding, that is, all objects are represented, and the background is separated. Some optimization was carried out at the meeting in April, and the task performance can be improved by 20%. Shandong University tried to add image segmentation performance index into rATE-distortion optimization function, optimize SSE and MIOU at the same time, and finally achieve performance improvement at the same rate. As for feature extraction technology, one is to extract the bottom layer of the end-to-end neural network, and the other is to extract the STEM layer of the backbone network, which can achieve the best compression rate.

Here are the latest developments of the conference that just ended in April. In January of this year, we released CFE, which is a multi-step process with several expected performance benefits for the entire code, such as increased compression efficiency, privacy protection, and computational complexity savings. Four technical routes are summarized to improve compression efficiency. One is feature coding, which is the idea proposed by our group at the very beginning. It will replace video with features. There is a large space in the future, but its performance is not particularly good at present. Some methods are obtained within the group, such as target detection ResNet network intermediate layer and Kmeans quantization, target tracking YOLO network intermediate layer, feature channel rearrangement, feature channel correlation de-redundancy, and task network intermediate layer coding.

End-to-end coding is the best performance at present. The two evidences accepted this time are that the two networks cheng2020/ MBT2018 have been retrained, end-to-end optimization and improvement have been made, and the loss function of image reconstruction and detection has been optimized, with the performance improved by 20% to 30%. Based on CHENG2020 end-to-end neural network, Zhejiang University saves 22.8% of bD-rate improvement compared with VTM. Alibaba and City University of Hong Kong are based on MBT2018 end-to-end neural network, which is 20% to 30% better than VVC.

Object coding has been proposed in Both Korea and China, that is, target detection is carried out at the coding end, and the target area and background area are separated before coding. Based on FLIR data set, THE Korean ETRI proposed the separation of front background, which reduced the foreground and background in different proportions and then encoded, and the compression rate was improved by 27%. University of Science and Technology of China (USTC) proposed feature coding for low level semantics and high level semantics, with approximately 20% performance improvement.

In target detection, China Telecom proposed to extract the middle layer based on ResNet backbone network, adopt Kmeans clustering and BAC coding, and try different BD quantization based on clustering centroid of different dimensions, which has the advantage of detection accuracy in the case of high bit stream. Canon and ETRI have tried, but the compression performance is not very good at present. Finally, the improvement of the existing coding tools. VVC is now very mature, which is designed by modules. Many experts have experimented with many of these modules, giving detailed reports such as which modules provide the best computing performance with them turned off. It is also possible that the VVC will issue a profile to the VCM.

Tencent found that by turning off the SAO/ISP/MIP module, the complexity was reduced and the coding time was reduced by 26%. Ericsson also made a similar work. When using VTM decoding end-rate distortion optimization, it introduced the method of joint optimization of encoding performance and target detection performance, and adopted different rate-distortion parameters for different image scaling scales and QP values, which could achieve a performance improvement of 56%. This may not be practical yet, but it’s useful for research.

China is also making rapid progress in this area. DCM, a standard group of data compression and coding for machine intelligence, was established in January last year. The difference with international VCM is that in addition to video, audio, point cloud and other data types are added. However, the fastest progress is probably in the video image area, DCM and international VCM functions are basically the same, and can support the work of the international standard group. The participants will vary, and the application scenarios focused will vary according to the industry in our country. Domestic progress is faster, because international policy and regulations are more restrictive. So far, DCM group has held four meetings, during which there are some small discussions, sorted out technical documents, technical routes, and made the corresponding national standard project, including two international standards related to machine vision and human-machine hybrid vision, and issued corresponding white papers. It has set up 7 special task groups, including external liaison and intellectual property groups. Technology related requirements group, data set group, test group, etc. At present, relevant anchor work and test work are being carried out, and technology-related proposals are also being collected.

Because the goal of codec is still to serve the industry, to be able to industrialization, so here listed in addition to codec, the international application of relevant standard organizations and industrial organizations, need to contact and cooperate with them. The goal is that in the whole process, we can get the real demand of the market, as well as the data set from the actual application, and when the standard is ready, it can be directly adopted by the application organization. Thank you!