Arithmetic encoding does not simply map each source symbol to a code word, but instead assigns a code word to the entire input sequence, so on average each source symbol can be assigned a code word of length less than 1.

Arithmetic coding is simple to operate. Here is an example to explain the principle of arithmetic coding:

Suppose the source has a, B, C, D four symbols, probability is 0.2, 0.2, 0.4, 0.2, input sequence “ABCCD”.

symbol a b c d
The probability of 0.2 0.2 0.4 0.2
The initial interval [0,0.2) [0.2, 0.4) [0.4, 0.8) [0.8, 1)
  1. The first compressed symbol is “a”, whose initial interval is [0, 0.2];
  2. The second compressed symbol is “B”. Since the value range of the preceding symbol “A” is limited within the range of [0, 0.2), the value range of “B” should be within the sub-range of [0.2, 0.4) of the previous symbol interval [0, 0.2), so the actual coding range of “B” is between [0.04, 0.08).
  3. The third compressed symbol is “C”, whose coding value range should be within the sub-range of [0.4, 0.8) of [0.04, 0.08), so the actual coding range of “C” is between [0.056, 0.072).
  4. The fourth compressed symbol is “C”, whose coding value range should be within the sub-range of [0.4, 0.8) of [0.056, 0.072), so the actual coding range of “C” is between [0.0624, 0.0688).
  5. The fifth compressed symbol is “D”, whose coding value range should be within the sub-range of [0.8, 1) of [0.0624, 0.0688), so the actual coding range of “C” is between [0.06752, 0.0688).
  6. So far, the data sequence “ABCCD” has been described as a real number interval [0.06752, 0.0688], or any real value in this interval is unique to the data sequence. In this way, any real number in this interval can be used to represent the whole symbol sequence, such as 0.0688.

The whole process is shown below:

The decoding process is as follows:

Decoding decision Decoding interval Decoding symbols
0.0688 in [0,0.2) [0,0.2) a
0.0688 is in the fourth 1/10 interval of [0,0.2] [0.04, 0.08) b
0.0688 is in the seventh 1/10 range of [0.04,0.08) [0.056, 0.072) c
0.0688 is in the seventh 1/10 range of [0.056,0.072] [0.0624, 0.0688) c
0.0688 is in the 10th 1/10 range of [0.0624,0.0688] [0.06752, 0.0688) d

The above example assumes that both the encoder and the decoder know the length of the message, so the decoding process does not go on indefinitely.

Arithmetic coding has the following problems:

  • Actual computer accuracy cannot be infinite, so there will be overflow. This problem can be achieved by scaling.
  • The code word generated by encoding is a real number between [0,1). The decoder must receive all bits of the real number before decoding.
  • Arithmetic encoding is sensitive to errors, and if one bit is wrong the entire message is wrong.

If you are interested, please pay attention to wechat public account Video Coding