You should be familiar with compression, almost every day with compression

Common compression software: 7-zip, WinRAR, 2345 good pressure

Common compression formats:.zip,.rar,.tar.gz

There are a lot of compression programs, but they all serve the same purpose, to reduce the storage space occupied by files

In addition to the above compression formats, such as.jpg,.mp3,.avi, etc., are also compressed, but compared to the above. Zip, they perform lossy compression

1 Lossy compression

Lossy compression is to use the human is not sensitive to certain frequency components of image or sound features, allowing compression loss in the process of certain information, although not completely restore the original data, but the loss of part in understanding the influence of the original image is narrow, but for much larger compression ratio, lossy compression is widely applied in speech, image and video data compression.

In other words, when a file is lossy, it will lose some of its data forever. There is no way to restore the file to its original state. Why do we need lossy compression when lossy compression will lose data forever?

Because lossy compression can achieve higher cost performance, we can fully accept the loss of part of the data, these lost data will not affect our use

Emojis, for example, which we use almost every day in conversation, are the result of lossy compression. Once mosaics appear, they can never be restored, but they have better usability and dissemination

2 Lossless compression

Is the use of statistical redundancy of data compression, compression can be fully recovered after the original data without causing any distortion, but the compression ratio is constrained by data redundancy theory, typically 2:1 to 5-1, this kind of method is widely used for text data, programs, and special applications of image data (such as fingerprint images, medical images, etc.) of compression.

Lossless compression is not very difficult to get started, but it is very difficult to do well, and different compression algorithms can achieve the compression rate and compression speed is quite different

Today’s lossless compression algorithms can compress files to 30-40% of their original size, and the higher the compression ratio, the more difficult it is to decompress

ZSTD (Zstandard) is a free, open source, fast real-time data compression program. It is a lossless compression algorithm written in C language, with a better compression ratio, developed by Facebook

In a computer, files are made up of various codes, and the basic principle of compression is to simplify the permutation and combination of characters in the code by looking for patterns. Hence, various compression algorithms have emerged

Such as: run coding, dictionary algorithm, Huffman coding…

Here’s an example:

There is the following set of data

bbbbbbyytttttedddaaannccccccceee

If you want to compress it, do you know how to do it at a glance

=> 6b3y5t1e3d3a2n7c4e

The number of repetitions plus the character itself can be used to compress the 34-character data into 18 character bits, reducing the size of the data by 16 characters

The simplest compression method is Run Length Encoding (RLE). However, this algorithm has a major disadvantage. If there are no piles of repeated characters, the compressed file will be twice the size of the original file in the worst case

The dictionary algorithm takes the word that appears most frequently in a file, generates a dictionary list (similar to key-value pairs), and uses special code to represent the word

For example, if you have a friend named “Waddeen Woweshenme Ramoszewski,”

Wouldn’t it be annoying if you had to say his name every time you said it?

You can use a nickname: 00. Next time you mention his name, use 00. It’s much shorter

Of course, this is not the best solution we can think of so far

In 1952, when he was still a doctoral student, he completed his final assignment for Information Theory, and optimized it on the basis of dictionary algorithm. He believed that the more frequently something occurs, the more it should be represented by short characters. So the classical algorithm in the field of compression was born – Huffman coding compression.

To put it simply, the more frequently the content appears, the less it needs to be described and the less space is occupied, while the less common content, the longer the description length is relatively, the more space is occupied

Take, for example, the following set of data

1,50,20,50,50,18,50,25,32,18

If you want to compress this set of data using Huffman coding, you first arrange the numbers according to the number of occurrences

50,18,1,20,25,32

Think of them as nodes, and the blue below the nodes is the number of occurrences of that number

First, find the two nodes with the least frequency of occurrence. The left branch is 0, and the right branch is 1

Think of these two nodes as a new node, and the number of occurrences is the sum of the occurrences of the two nodes

Find the two nodes with the least number of occurrences and repeat the above steps

The resulting Huffman tree

We can get Huffman codes for these numbers

  • 50:00
  • 18:01
  • 1:100
  • When 1
  • Sake, 0
  • Thus says 1

1,50,20,50,50,18,50,25,32,18

The data set above is compressed by Huffman code to become

100,00,101,00,00,01,00,110,111,01

Convert raw data to binary for comparison

1,110010, 10100, 110010, 110010, 10010, 110010, 110010, 11001, 100000,10010

It obviously saves a lot of space

In fact, each compression algorithm has its own advantages, and most of the compressed files we contact in daily life are the result of the joint efforts of a variety of compression algorithms

3 Compression bomb

Compression bomb: a very small very small, only dozens of KB large compression file, after being decompressed but like a bomb, infinite sets of dolls, exploded millions of GB of source files

A file called 42.zip, the initial size is only 42 KB, using the decompression password 42 fully decompressed, a full 4.5PB size,

When unpacked, the 42.zip produces 16 packages, each containing 16 identical packages. After five cycles, you get 16 to the power of 5, or 1,048,576 files, each of which is 4.3GB in size. So the final 1048576*4.3GB is about 4.5PB, which is certainly not supported by ordinary computers

Attached is a download link for 42.zip (use with caution) bufflet.dk /

There’s something even more powerful

droste.zip

It is smaller, only 28KB, but once it is decompressed, it will be infinite sets of dolls, decompress a compressed file exactly the same, and then automatically decompress, and out of a exactly the same, and so on… The hard drive just blew up

The idea behind this file is to output itself as a result

But according to the information entropy function developed by Claude Elwood Shannon, the founder of information theory, there is a limit to file compression

Although the above two files have a high compression ratio, the information entropy is actually very low, because there is a lot of deliberately repeated data, and compression is a process of eliminating redundancy, so the file can be very small

What about files with high compression that don’t deliberately duplicate data?

A 3D film, “Comet Strikes Earth”, made around 2000, demonstrates an astonishing compression ratio

The comet striking the earth baidu network backup resources (use) : pan.baidu.com/s/1sj8Z7p7

Straight out, it’s about 15 gigabytes. When it’s compressed, it’s only 64 kilobytes, 250,000 times less

Warez, the producer of the film, with a only 64KB. Exe file to achieve, in the decompression run time can call the graphics card, CPU and memory, real-time rendering, rendering the film on the spot frame by frame

Compression like this is used in some games, like cyberpunk 2077, so you need a well-equipped computer to run

Going back to files like 42.zip and Droste.zip, what do they do?

Around the year 2000, these compressed files were actually used to hack other people’s computers

Some computer virus makers take advantage of anti-virus software that scans the inside of compressed files and sends compressed bombs to the target’s computer along with the virus

The compression bomb itself is small and easy to transport, but the actual scanning takes time, and the virus attacks the computer while the anti-virus software scans 4.5 petabytes of files one by one

Now, of course, antivirus software is powerful enough to circumvent this kind of compression bomb

Related links:

Blog.csdn.net/weixin_3078…

www.cnblogs.com/zhouie/p/10…

Blog.csdn.net/fanyun_01/a…

❤️ Thank you

That is all the content of this sharing. I hope it will help you

Don’t forget to share, like and bookmark your favorite things.

Welcome to pay attention to the public number ELab team receiving factory good article ~