In this paper, professor Fan Yibo from the School of Microelectronics of Fudan University gave a speech at LiveVideoStackCon 2021. He shared the differences between hardware and software and introduced the hardware microarchitecture of hardware encoder in detail, including X1 encoder for chip implementation and K1 encoder for FPGA implementation. And an open source version of the video encoder.

By Fan Yibo

Organizing/LiveVideoStack

Hello everyone, I’m Fan Yibo, from the School of Microelectronics, Fudan University. The theme of this sharing is video codec IP hardware open source. First I will introduce the difference between hardware and software encoders; Next, it focuses on the hardware microarchitecture of hardware codec, including the open source version and the high-performance version, which are based on the unified architecture. A high-performance X1 encoder for chip implementation and a low-hardware COST K1 encoder for FPGA implementation will be shared later, as well as an open source version. The main optimization goal of X1 lies in the small logical area and outstanding RD performance. According to the measured results, it is about 10% better than X265 median, and the compression rate is similar to that of the mainstream IP vendor, while the chip area is only about 1/3 of the mainstream IP vendor. K1 is mainly for FPGA, and the target device is the most common FPGA such as K7. At present, K1 is the only 4K soft-core codec that supports FPGA.

01

Introduction

First, a brief introduction.

1.1 About me and my lab

I am working in the School of Microelectronics of Fudan University. My lab is called VIP Lab, which is positioned for the research of video and image dedicated processor IP and chip design. I mainly studied the design of hardware chips, including video compression, from H.264 to H.265, the next generation OF H.266, AVS3, and SVAC2. We also do research on image processing, including IMAGE ISP processing, image compression, such as JPEG, Lepton, WebP, DISPLAY-oriented DSC and VDCM. In addition, there are studies on Learning based ISP and Video Codec, which use deep Learning methods to transform traditional ISP processing and Video Codec into end-to-end neural network processing. In addition, our laboratory is also engaged in heterogeneous computing system modeling for data center and cloud computing.

1.2 about XKSilicon

Open source is mainly organized with XKSilicon as the Open source foundation. We have an open source website, and the open source code is on Http:// www.openasic.org, so you can go to the website and download the open source code. In addition to H.265, there is also an open source VERSION of H.264, with two updates in 2018 and 2019, and a 3.0 update for H.264/H.265 this year. In the chip field, there are very few open source hardware IP at home and abroad, and there are not as many open source hardware projects as open source software projects. There are also fewer related developers, so more hardware developers are needed to contribute to the open source community. In order to support this part of open source work, we also customized closed source version for partners and customized domain-specific commercial version of COdec IP core for their professional application fields. This part of work also reflects our team’s engineering capability and the agile development advantage of our hardware IP core architecture.

02

Hardware and software encoders

This section wants to explain some differences between hardware encoders and software encoders.

2.1 Software encoders develop rapidly and open source encoders are abundant

Our country is doing very well in software encoders. If you look at the MSU video coding contest, it’s basically a Chinese battlefield, and the top 10 are all domestic companies and major Internet giants. The development of software encoder is very good, domestic companies have strong core competitiveness in this field, launched the world’s leading version of encoder. Open source software is also rich. If a startup wants to use an encoder at the beginning, it doesn’t need to develop. It can directly use X264 and X265 to support mainstream standards.

2.2 Hardware encoder limited chip, no open source encoder IP

Compared with the software lively, hardware cold some. Currently, cloud encoders rely on Intel or Nvida platforms, which have transcoding IC chips for video codec. Mobile phone terminal can rely on Qualcomm, Haisi, etc., which have mature hard programming programs. If the company wants to develop chips supporting video applications, it needs to integrate video codec IP core in the chip. Generally speaking, startup companies will not develop video codec IP core by themselves, and usually seek third-party IP Vendor. There are few IP vendors that can provide encoders, most of which are from foreign countries, while domestic core sources can provide IP. Of course, we can also use FPGA, for example XILINX has FPGA products with built-in encoder core. For small batch of hardware products with high customization, this KIND of FPGA with core CODEC can be used. Its main disadvantages are high price and strong platform dependence. Once the design is finalized, it means that the product depends on a special type of FPGA, and it is difficult to switch platforms. The video encoder soft core IP does not have the problem of platform dependence. At present, there are few open source soft core hardware IP, and the open source video codec IP core is only done by our team in the world.

2.3 Differences between software and hardware

Next I want to explain to you the difference between software and hardware encoders. Both H.264 and H.265 are hybrid coding frameworks, abstracting the modules in them, and for software the CPU is divided into time slices that execute different algorithms at different times. It can be executed out of order in an algorithm. For example, the execution of IME module this time, the next execution of FME, the CPU to the corresponding instructions into the next execution, the next time can be the following, CPU for software, the core is to constantly improve the CPU frequency, whether 8K or 16K, if the ability to do 10GHz CPU frequency basically software can conquer the world. But now the problem is that the CPU frequency can not be raised, and the embedded CPU frequency is facing a greater bottleneck, and the CPU can not have a strong computing power. It’s relatively easy for software developers to develop encoders, the CPU takes care of all the external bandwidth, and developers don’t have to solve these underlying data scheduling problems.

When it comes to hardware design, it’s a different story.

On the one hand, the same algorithm module is in the form of an assembly line in hardware. Once the assembly line is lined up, the modules at the front and back levels have data dependence, and the data at the rear level can only take the results of the previous level and cannot be executed out of order, but can only be executed in the same direction as the assembly line. In addition, all modules are parallel in hardware. For example, in the clock-cycle shown in the figure above, all modules are actually working at the same time, which is different from only one module executing at any time in CPU. For hardware pipeline, all modules are executing at the same time and their data blocks are different.

The second aspect is the strong correlation between the modules before and after the assembly line. Only the results of the execution of the modules before the assembly line can be used by the modules after the assembly line. Therefore, there is a strong dependence on the data and it cannot be executed randomly.

The third aspect is that the memory read operation of the hardware IP requires manual control. Software developers can ignore the CPU’s memory reading and writing operations because the CPU automatically accesses and writes instructions and data to and from the D Cache and I Cache. Developers only need to focus on the core algorithmic logic. The hardware development of IP core is completely different, developers need to manual processing with the bus and peripheral storage data interaction, when writing data, when reading data, the length of the read and write data whether conform to the requirements of the DDR Burst length, whether can make full use of the DDR bandwidth, this requires hardware developers to implement manually.

Therefore, the work of developing a hardware encoder is several times that of a software encoder. Generally speaking, before writing a hardware encoder IP core, we will also write a software encoder, which is called C Model. This C Model encoder simulates the pipeline division of hardware, and corresponds all its modules to hardware modules one by one. Various data dependencies are taken into account in the model. The C Model itself is also an executable, full-featured video encoder. When the architecture is confirmed in the C model and the coding compression performance meets the design requirements, the hardware can be written and the hardware encoder design at the RTL level can be completed. It can be seen that developing a hardware encoder usually needs to write two encoders, both software encoder and hardware encoder, and all algorithms need to be optimized once to remove data dependence and optimize algorithms that are not suitable for hardware execution. The development workload is much larger than that of software, and the algorithm optimization requirements are higher.

2.4 Implementation of hardware editor

Generally speaking, there are many ways to implement hardware encoder, among which the most common way is through the main processing plus coprocessor. For example, Intel CPU is realized by adding multimedia instruction set extension (CPU+ Co-processer). GPU can integrate special algorithm module hardware. At the same time, high performance programmable hardware coding can be realized by using the multi-accounting power of GPU. On the other hand, for the special processor ASIC or SOC, it is necessary to integrate the encoder IP core into the bus. There are two ways to integrate SOC: hard core and soft core. The hard core is realized by chip, and the fixed phone is the internal circuit of the chip. Soft core is realized by FPGA, and the encoder circuit will be formed when downloaded to FPGA chip as acceleration. The difference between core and soft core is that one is biased towards chip implementation and the other is biased towards FPGA implementation.

2.5 Reasons for Using a Hardware Encoder

The first reason for using hardware encoder is performance and cost. For example, to make an 8K encoder, many servers are needed in the software way. For hardware encoder, only one chip is enough, and the performance of hardware encoder is several orders of magnitude better than that of software encoder. The other is the performance of the compression ratio, the software encoder will do a better job than the hardware encoder, because the software encoder can serial execution, as long as CPU capacity enough job can mobilize very complex algorithms, without considering the data dependence, RDO algorithm can pursue perfection, so in terms of compression ratio, the software encoder will do a better job than the hardware encoder. Hardware encoder needs to eliminate pipeline data dependence, and algorithm optimization will be done, which will cause RD performance decline. The second reason is real-time and low latency, the CPU to do coding, teeming with instructions to do, is not nearly as speed dedicated chip, and generally introduces the frame level cache, and hardware encoding can eliminate most cache handling, can also press line processing, external data to a few lines can start coding, can achieve very low latency. The third reason is uHD and high IO bandwidth. Uhd brings high IO bandwidth, which is the bandwidth between the DDR and CPU. Hardware encoders can do bandwidth compression optimization, bandwidth can be reduced and DDR access can do better, software encoders are difficult to do 8K or 16K encoding, with hardware is relatively easy.

03

XK265 encoder structure

Next, share the architecture of XK265 encoder.

3.1 Design objectives and characteristics of XK265

First of all, the architecture design goal is to make an algorithm and standard extensible architecture. Therefore, our architecture only reflects the algorithm flow module and has nothing to do with specific coding standards. Extensible hardware architecture can be made into a set of different standard algorithms, in a unified architecture to support different coding standards. Second, flexible hardware architecture with customizable performance, area and features. Different application areas have different requirements for video coding, such as for mobile phones and for security. Mobile phone encoders require low power consumption, and the compression rate may not be very low. For security, the compression rate must be very high, and the hd pressure should be below 1Mbps. A very complex RDO strategy can be used to ensure the clarity of compression. Therefore, different application scenarios, the requirement of video encoder is different, need to do some in some of the basic core PE unit can be configured to do some can be configured in some important line unit, including search prediction model is also configurable, including some inside the block size is also configurable, can achieve performance, cost, quality and compression ratio of dynamic allocation. Third, some multichannel stream support and low delay support and some encoding and decoding hardware reuse, encoder and decoder inside 70% of the logic can be reused, as long as a small amount of logic can be both coding and decoding. In addition, some multi-stream zero delay switch, we hope to be able to do different streams in, can be very fast to compile multi-stream, and multi-stream can do parallel real-time high coding. The fourth is to have a rich user configuration parameter space, including the user’s coding control and some parameter configuration, including some RDO policy configuration, including the search mode configuration, including the control of the code control side, ROI configuration, etc. We need to have a very strong user configurable space that allows them to adapt to different scenarios.

3.2 XK265:X001 CDC architecture

This is our five-level pipeline CDC (Encoder+Decoder) architecture, which is implemented in the face of chips. The target frequency should run up to 800Mbps to support 4K30 to 4K60 such a performance. As shown in the figure, the main functions of the module, such as FTH, are used to fetch some data, RMD, and ME (Motion Estimation) engine include both integers and fractions together. RDO, REC, DBF, and SAO are all in the same stream. DMP is a final reconstruction image output channel, E_C/E_D support entropy encoding and entropy decoding, encoding route from EXT_RD into BS_O out, decoding route from BS_I into EXT_WR out. Based on this agile chip architecture, it is easy to customize a single Encoder, a single Decoder, a Codec in composite mode, and different IP cores that support a single I frame version or an I+P version. At the same time, it is very easy to customize and generate special encoder IP cores with different performance indicators, area indicators and compression rate indicators.

3.3 XK265:X001 ENC architecture

As an example of IP customization for agile architecture, this is our K series IP architecture for FPGA soft core. Because the main frequency of FPGA is generally difficult to run very high, the target frequency of our K series is about 150MHz, which is very different from the main frequency of X series 800MHz for chip. The PERFORMANCE of K series IP is the same as that of X series, and it also needs to support the high performance of 4K30. Therefore, we adopt the seven-level pipeline architecture for K series. As shown in the figure, IME and FME were separated out as separate pipelined levels, while RDO, REC, DBF and SAO were separated out as separate pipelined levels, finally realizing high-performance coding under low dominant frequency.

3.4 XK265:X001 DEC architecture

As another example of IP customization for agile architecture, we completed the IP core of K series video decoder with single decoding function for FPGA. Decoder also uses the seven-level pipeline architecture, but the coding of the data path was removed, removed after only decoding, can achieve a very streamlined pipeline decoder.

3.5 XK265: RMD can be configured

Let’s take a look at the underlying module design in our hardware architecture. We provide a wealth of configurable options, which are important for agile architectural design. In the case of hardware, the circuit has largely lost its modifiable characteristics when it is solidified. For IP, we can do some area tuning before solidifying the hardware. We provide two configurable parameters at the module level of the IP core — static configurable parameters and dynamic configurable parameters.

The so-called static configurability refers to the parameters that can be configured before IP is made into a real chip circuit, and these parameters will affect the final circuit size and logical area. For RMD, for example, we can configure the number of engines to balance its area and throughput. We can also configure whether the DC and Palanar modes are supported, because these two modes occupy a large area. Removing this mode can save a lot of area. Of course, RD will also be affected.

Dynamically configurable means that the circuit has been solidified and some functions of the module can be turned on or off through configuration registers to adjust the coding performance and compression performance. For example, all PU sizes can be matched, from the smallest 4×4 PU to 8×8, 16×16, 32×32, and 64×64. The prediction of a CERTAIN PU can be dynamically turned on or off to balance the compression quality and throughput performance of the video. An important configuration indicator for dynamic configuration is RDO level, which is used to guide the dynamic configuration parameters of each module. A higher RDO level can achieve higher compression performance, at the cost of all PUs being searched. In this way, the compression rate is better, but the frame rate is lower. Like 4K30. If you want the same circuit to achieve 4K60, you may need to abandon part of PU and lower the LEVEL of RDO, so that 4K60 can be achieved.

3.6 XK265: The IME can be configured

This is an integer motion estimate with statically configurable options to configure the array size, update direction, size of PU to be searched, and so on. Its dynamically configurable options are mainly the patterns to be searched. Search Pattern and Search Window are configurable.

This is the programmable motion estimation engine we made. The search window can be set to different shapes and sizes, and the search pattern can be used for full search and quick search. Different search methods can be achieved through updating microcodes and algorithms.

3.7XK265: FME can be configured

The same is true for the fractional pixel motion estimation. For static configurables, you can configure different number of engines. If you want to tailor the number of engines, you can reduce the performance and save the area. All engines can be balanced in terms of area and throughput. The size of the engine, whether it’s 16×2 or 8×1, is just the difference in parallelism between multi-pixel processing. The engine’s register cycle can balance area and timing, with higher or lower frequency as desired, and can also be configured through the register cycle. The same is true for dynamic configurability. All PU sizes can be configured, and all searched neighborhoods can be configured.

3.8 XK265: RDO can be configured

This is the RDO part, which is also the key engine for the encoder to balance compression rate and performance. The most important point that affects the video quality of the encoder is RDO. As shown in the figure, in our design of the RDO engine, the size of the various engines in the loop and the size of the search PU are reconfigurable. Static only affects the area, but dynamic affects the image quality and throughput, including the number of prediction modes within the frame that can be rematched, DC and Plannar modes. Whether to enable independent chromaticity prediction or brightness chromaticity prediction, Merge and Skip switches can be turned on and off. There are also some adaptive decision coefficients of scenes and detection of some simple scenes, which is convenient for people to do different RDO configuration strategies for different scenes.

3.9XK265: High parallelism CABAC

In the last part of entropy coding, we made a highly parallel CABAC pipeline design, we adopted four Bin parallel mode, in front of the Bypass also has some parallel optimization strategies, finally we can achieve 4.37 BPCC parallelism, can support 8K large bit rate, high quality entropy coding.

04

X1 for Silicon

Let’s take a look at two commercial custom encoders for performance evaluation

4.1 X1: IP core for chip implementation

The first one is IP for chip implementation, which we call X1. It is a five-level architecture for chip implementation, and most features are supported. The important features include:

1. CDC architecture, supporting encoding and decoding. It supports two streams, main stream and secondary stream, and can be extended to achieve more parallel streams through the shadow register. It supports a variety of image input channels, which can be input by external DDR. The original image is placed in DDR, and the encoder reads the input from DDR. It can also be used as Line Buffer input for ultra-low latency encoding applications, where as long as 32 lines of images can be used to start coding.

2. Supports LCU level multi-channel traffic switching. It can encode its CTU in one stream, and then switch to another stream to encode its CTU in real time, so as to achieve high real-time coding of multi-stream.

3. The low cost hardware design of 32×32 is adopted. We did not make 64×64 here in order to balance its area. Meanwhile, we support all CU, TU and PU specifications.

4. Support CBR, VBR and other code control methods, Frame level is done by software, CTU level is done by built-in hardware. We also support lossless compression of reference frames. Under the condition of 4K30 and 4K60 compression, the external memory bandwidth can be relatively low.

5. Support all IME modes, the search Window is all adjustable, according to the current configuration is ±32 and ±64, but if you need a larger search window can also be done.

6. Support 1/2 and 1/4 FME, similar to the above IME features.

7. Support ROI. It is possible to do ROI for certain areas.

At present, we can achieve 800MHz in 22nm process. For RDO Level High, we can achieve the coding performance of 4K30. For RDO Level medium, we can achieve the coding performance of 4K60, which is smaller than the area of other IP vendors.

Here, we take a look at the bit Rate performance of X1. Anchor is X265-medium, the test result with code control is on the top, and the test result of fixed QP is on the bottom. Here, the index we use to measure is BD-rate, the smaller bD-rate is (the more negative it is), the better the compression efficiency is, and the more bit Rate is saved. As shown in the figure, under the test with code control, plant C is 9.6% lower than THAT of X265, plant V is 9.7% lower than that of X265, and our F265 Cmodel is 10.1% lower than that of X265. It can be seen that when the bit rate control is enabled, our bit rate level is about the same as that of other suppliers. On the other hand, the Rate Mismatch under code control did a good job. Here our numbers are better and we can keep it within 0.2-0.3%.

Table 2 shows the test of fixed QP, and Anchor is also X265-medium. I Frame our results are the best and far superior to the other two commercial encoders. In terms of P frame, v factory did very well, 13% better than X265. We only achieved 6.4% better. The gap between P frame and V factory is about 7%. Another C factory is similar to us. However, what I need to emphasize here is that we use 32×32 LCU, cancel 64×64 PU support, and cancel Nx2N partition, these features will generally bring about 8%-10% loss of compression performance, but these features are enabled in V factory. Therefore, if these features are included, the compression rate of P frame is almost the same as that of V frame. In addition, due to our 32×32 LCU and excellent internal agile architecture design, our hardware area is only 1/3 of that of v factory.

05

K1 for FPGA

K1 faces FPGA, is a version of low clock frequency and high coding performance, so its compression efficiency performance must be inferior. Since all aspects of FPGA resources are limited, our goal is to achieve 4K codec in FPGA under k7-325T, such a very small K7 series. We made a very big cut for this, and the encoder is still mainly for the embedded end, not the accelerator card. In some scenarios with low bit rate but high performance requirements, it can meet the application requirements.

5.1 K1: IP core for FPGA implementation

In the pipeline design of K1, we directly set the SIZE of CTU to 16×16, which is the same as the MB SIZE of H.264. Next, we will compare the compression rate with X264-medium instead of X265. K1 aims to provide high performance 4K codec on small scale FPgas with better compression rate than X264-Medium. As shown in the figure, K1 also supports encoding and decoding, External memory and Line Buffer input, but there is no support for multi-stream parallel encoding.

In other respects, similar to X1, we have made some simplifications in RDO to support 4K encoding performance up to 140MHz. FPGA implementation, k7-325T chip as the final target application device, the overall performance can achieve the highest resolution of 4K30. We also provide two levels of RDO Level configuration, RDO Level Median supports 4K30; RDO Level High, can support 1080P60, can provide better compression efficiency. The total area of K1 is 170K LUT or so.

This is a comparison between K1 and X264-Median. Because there is no SOFT core for FPGA that supports 4K, it can only be compared with X264 software. As shown in the figure, we also list the compression performance indicators of X264-Veryslow and X264-VeryFast. X264-median is used as anchor for all performance comparisons. Baseline is the most simplified version, we remove DB and SAO, and search window is also very small, which can be within 150K LUT. In terms of compression, the Baseline i-frame is about 13% better than x264-median and about 16% worse than p-frame.

However, if the area of the Baseline is slightly enlarged and DB is added, I frame will be 15.8% better and P frame will be reduced from 15.9% to 3.8% worse. P frames will perform better if SAO is added. Because FPGA is more flexible, we can configure which modules to be added according to different device sizes to achieve the best balance of bit rate and area. This is also the advantage of our agile architecture design, which allows flexibility to add tools to the Baseline for performance, compression, and area configurability.

K1:5.2 Demo

We have a demo outside the lecture hall, and you are welcome to see the demo.

The schematic diagram of Demo system is shown here. The image is input to FPGA code board through HDMI of PC on the right, FPGA code card does encoding and outputs code stream through Ethernet, and another FPGA serves as decoding card to receive video code stream for decoding and decode FPGA output image to display at the same time.

Above we encode and decode the physical demonstration.

06

Open Source Roadmap

K1 and X1 are both high-performance versions for the industry and are not open source at this stage. In addition, we have an open source version, which has the same hardware architecture as the closed source version, with differences in area, algorithm complexity and supported resolution performance.

We opened source XK265 in 2017, releasing versions 1.0 and 2.0 respectively, which you can download from the official website. V1.0 focuses on the reference hardware design, which is designed for each algorithm module of video encoding and decoding. V2.0 is an architectural and test upgrade over version 1.0. On V2.0, we did more testing, which was more stable than V1.0, and we also provided C models for comparison testing. In both V1.0 and V2.0, we only used a very simplified RDO algorithm. In this part, we encourage other teams to explore and implement their own RDO algorithm. In the current open source version, we focus more on reducing the hardware area and streamlining the algorithm pipeline. The goal is to achieve high performance video codec at very low hardware cost.

This year we plan to release V3.0, which will be an open source version after we finish the K1 project.

V3.0 version features: mainly for embedded FPGA, for k7, Pynq such a very small FPGA soft core encoder. For instance face Pynq will have I frame only the code, because the logic of Pynq is very small, the price is probably in 7, 8 hundred RMB, and k1 development board wants 10 thousand multivariate, estimate only university laboratory ability can afford. In addition, we will optimize for low latency to achieve end-to-end codec with ultra-low latency. In terms of extremely low area cost, we strive to make the hardware cost very small, even only Y channel, as far as possible to remove some redundant logic, to achieve extremely low cost; In addition, we will provide a complete bus interface and FPGA Demo open source code; Among other features, we will add cost-effective RDO algorithm and bitrate control algorithm.

Finally, LET me share our open source roadmap for the future. The topic of today’s speech is XK265 encoder. We also have an XK264, and we will release a V3.0 version update this year. Its hardware cost is lower than THAT of XK265, and its application range is wider. XK264 is common in some board-level applications and embedded applications because of the low complexity of these applications. We are still working on AI-CoDEC in the lab, and the algorithm has been studied almost. Next, we will do some hardware oriented implementation work. So next year we plan to release an ai-based image-oriented open source encoder. In addition, we are also developing AI-ISP, and we will make an open source version of AI-ISP in the future.

In the field of application, we mainly face the field is terminal chip and embedded. At the same time, we are also considering cloud-oriented IP core design and IP core application on cloud FPGA acceleration cards (such as Xilinx U250, etc.) recently. Especially for some requirements of high compression efficiency, for 8K, 16K, VR and other cloud coding applications, there is no chip solution at present, but it needs to do coding acceleration in the cloud, so you can achieve heterogeneous codec design in the data center.

That’s all I have to share, thank you!

For details, please scan the __ QR code __ or click __ to read the original text __ for more information about the conference

This article uses the article synchronization assistant to synchronize