This article was first published on:Walker AI

With the development of digital audio technology, music copyright has been paid more attention. More and more research and attention have been paid to audio copyright protection technology, and silent watermarking technology is one of them. At the same time, the Internet online conference is more and more popular, audio silent watermarking technology can also ensure the confidentiality of the conference and trace the source of leaks.

Because the human auditory system (HAS) is extremely sensitive and the audio perception redundancy is small, it is very difficult for the watermark to meet the invisibility and robustness at the same time. At the same time, audio compression algorithms such as MP3 have become the mainstream way of digital audio compression with the advent of the network era because of their excellent compression rate and sound quality. However, MP3 encoding is lossy and the watermark information will be destroyed after audio compression, so the research of audio watermarking is more challenging than image watermarking algorithm.

The purpose of this algorithm is to provide an adaptive mixed-domain audio watermark embedding method, which can embed more watermark information in the same audio signal while ensuring concealment, so as to improve the anti-clip aggression of audio watermark to a certain extent.

1. Basic knowledge

1.1 quantitative

The values are mapped to the coordinate system, and then the coordinate system is divided according to a quantization factor (also known as step) and the values represented in each step are assumed. The watermark algorithm usually uses the embedded information converted into binary code, and the original information of the embedded information can be pictures or texts, etc. Then assuming that the quantization factor is δ, 0- δ stands for 0, δ -2 δ stands for 1, the quantization result is as shown in the figure. As you can see from the figure below, – δ -0 represents 1, 0- δ represents 0, and δ -2 δ represents 1, 2 δ -3 δ represents 0, 3 δ -4 δ represents 1.

1.2 Masking effect

Masking is an effect that occurs in the human auditory system (HAS) : high energy blocks out low energy for a short time, leaving the human to hear only the high energy parts. The masking effect can be divided into advanced masking, simultaneous masking and hysteretic masking according to the occurrence of high and low energy parts. Advanced masking means that the energy of the rear part is higher than that of the front part, so only the rear part can be heard, otherwise there is lag masking; Masking means that the front and back energy is higher than the current part, so the current part will not be heard and will be masked by the sound of the front and back. The masking effect is shown below.

1.3 MP3 compression

MP3 compression will result in time domain offset and frequency domain amplitude change.

After MP3 lossy compression, the performance in time domain is not only the change of amplitude but also the shift in time sequence. Because the beginning and end frames need to be filled with 0 when the ORTHOGONAL overlap changes in MP3 compression, resulting in edge effect, this part of data is also added to the audio after decoding, that is, the offset on the time sequence is generated after compression.

Using the masking effect of sound in frequency domain, the quantized noise is below the masking threshold, and the redundant part of audio perception is removed. This causes high frequencies to vary greatly after compression and low frequencies to vary relatively little.

1.4 DWT transform

(1) The low frequency approximation coefficient and the high frequency detail coefficient will be obtained after the wavelet transform. (2) Multiple (order) wavelet transform can be performed, with the maximum order level=log2(n), where n is the time domain sampling point. (3) After multi-order conversion, level+1 coefficient will be obtained.

1.5 pretreatment

For different audio file format parameters, we uniformly convert the file into a waV file with 44.1khz and 16bit stereo bit width, and convert the file into two INT16 arrays by reading it into memory. The array of two INT16 represents the left and right channels and the bit depth is 16 bits, that is, the value of the sampling point occupies 16 bits.

At the same time, the watermark information to be embedded is constructed as a grayscale image to increase the robustness of the system. Here, the watermark information is represented by binary.

2. Implementation steps

2.1 Audio framing

A. Embedded unit

As mentioned above, in order to solve the problem of MP3 compression time offset, the energy of each embedded segment is calculated and the segment with low energy is filtered. Here, we call each embedded segment an embedded unit. And each embedded unit is divided into two small areas according to the length: the embedded area and the positioning area.

For DWT, each order of transformation yields an approximation and detail coefficient. Assuming that there is a signal of length X, the maximum number of transformations (order) level can be written as follows:


l e v e l = l o g x 2 level = log_x^2

Because the robustness is better at low frequency, we adopt the region below 3kHz for embedding and take level=4.

In order to ensure the robustness of the embedding, the more sampling points participate in the embedding, the better the robustness. Therefore, a constant α is set to represent the expansion factor of the embedded region. α can be 8,16,32, etc.

We define a constant value N to describe the number of sampling points (length) in each small region, and call it the embedding length. Then the length of an embedded element is 2N, where:


N = 2 4 x Alpha. N = 2 ^ 4 x alpha

Where DWT transformation is of order 4, so the length of an embedded element is set as FL:


f l = 32 x Alpha. Fl = 32 x alpha

If α is 8, the number of sampling points needed to embed an information bit is 256, and 172 information bits can be embedded in one second of audio.


44100 / 256 = 172 \lfloor 44100/256 \rfloor =172

The embedding area belongs to the embedding area used for watermarking. The function of the positioning area is to screen out the real embedding area and the embedding sequence by calculating the energy of the positioning area of each unit, and to provide the reference value of the embedding strength.

B. embedded in the frame

Assuming that the length and width of the watermark image are H and W respectively, the length of the watermark data is H × W.

According to the definition of an embedded element, the length of an embedded element is 2N. We can calculate that at least sampling point length L needed for embedding a watermark is:


L = h x w x N x 2 L = h * w * N * 2

And since our audio sampling rate is fs=44100Hz, we can calculate the time t of audio required for embedding a complete watermark:


t = L / f s t = L / fs

The length of a frame is obtained by rounding up the remainder of t and a constant n (n=10 in this paper), which represents the minimum number of n seconds needed to embed a complete watermark. In this way, we can ensure that an embedded frame has more than the number of units required for embedding and store the complete watermark information.


F l = t / n x n x f s F_l = \ lceiL T /n \ RCeil ×n× f_S

Below calculate the energy value of the location area in each embedded cell in a frame. We need to sort the energy from high to low. Considering that the sum of squares of amplitude is too large to be easy to calculate, the absolute sum of energy is directly used here:


E s = i = 1 l S i E_s = \sum_{i=1}^l|S_i|

Set an energy threshold to filter out embedded elements below the threshold. If the number of remaining embedding units at this time is less than H × W, it means that the current frame is not enough to embed a complete watermark, and the frame should be skipped. The embedding units with energy greater than the threshold are sorted in descending order according to the energy size, and H × W of them are selected as the real embedding units of watermark. The reason for doing this is that the part with high energy has better robustness.

C. Cepstrum coefficient

The sampling points in the embedding region and the positioning region were respectively transformed by wavelet 4th order to obtain the approximate values, and then they were respectively transformed by CCEPS. The values with large changes obtained after wavelet transform are mapped to a small range. At the same time, as the coefficient fluctuation at both ends of CCEPS is also large as shown in the figure below, the intermediate stationary part is selected for embedding.

2.2 Watermark Embedding

Through the above methods, we obtained the VALUES of DWT approximation coefficients in the CCEPS spectrum of the embedded region and the positioning region, which were respectively set as CCEPSe,CCEPSlCCEPS_e,CCEPS_lCCEPSe and CCEPSl.

Intercept the intermediate smooth part, set the intercept length before and after as L0L_0L0, CCEPS coefficient length as L, and calculate the mean CCEPS coefficient of the two regions:


m e a n e = a v g ( C C E P S e [ l 0 : l l 0 ] ) m e a n l = a v g ( C C E P S l [ l 0 : l l 0 ] ) mean_e = avg(CCEPS_e[l_0:l-l_0])\\ mean_l = avg(CCEPS_l[l_0:l-l_0])

Set the global constant: the embedding strength (0-1) is β, take the quantization step q of each embedding frame, then


q = m e a n l x Beta. Q = mean_l x beta

Regions representing high energy have better robustness, so quantization step size can be appropriately increased to improve watermark embedding strength, whereas regions with lower energy should be selected with lower quantization step size to improve concealment.

The mean value of CCEPS in the embedded region was quantified.

Let the quantized mean be meane, mean_e^, meane, and the embedded signal be wiW_iwi:


I Q ( m e a n e ) = m e a n e / q x q + q / 2 IQ(mean_e) = \lfloor mean_e/q \rfloor × q + q / 2

m e a n e . = {   I Q ( m e a n e ) + q . i f   I Q ( m e a n e )   indicates   w i I Q ( m e a n e ) . i f   I Q ( m e a n e )  =  w i Mean_e ^, = \ begin {cases} \ IQ (mean_e) + q, & if \ IQ (mean_e) \ indicates \ w_i \ \ IQ (mean_e), & if \ IQ (mean_e) \ = {cases} \ w_i \ end

The ratio of the mean before and after embedding represents the scaling of the value, so let the scaling factor f be:


f = m e a n e . / m e a n e f = mean^,_e / mean_e

CCEPS coefficient of embedded region after embedding:


C C E P S e [ i ] . = { C C E P S e [ i ] x f . i f   i [ l 0 . l l 0 ] C C E P S e [ i ] . o t h e r CCEPS_e [I] ^ = \ begin {cases} CCEPS_e [I] x, f, if I \ \ \ \ [l_0, l – l_0] in CCEPS_e [I], other \ end {cases}

Take the inverse transformation of the complex cepstrum coefficient after embedding ICCEPS, and obtain the DWT approximate coefficient after embedding:


A c . = I C C E P S ( C C E P S e . ) A_c^, = ICCEPS(CCEPS_e^,)

Then the inverse wavelet transform IDWT can get the embedded audio signal containing watermark information after embedding, and the positioning area does not need to be transformed. So the embedded signal of an embedded frame S,S^,S,


S . = { I D W T ( A c . ) . i f   l Located in embedded region S l . o t h e r S^,= \begin{cases} IDWT(A_c^,), if \ l at embedded area \\ S_l, other \end{cases}

Finally, the obtained S,S^,S is the signal value after watermark embedding. After embedding each embedded frame, the embedded signal is re-written into the file. This gives you the audio file that contains the watermark.

In the above process, several constants mentioned: embedding strength factor β, DWT order level and expansion factor α, their values have an impact on the robustness and concealment of the algorithm.

The larger the quantization step is, the better the robustness is, but also leads to the decrease of concealment. Therefore, the quantization step can also be called the quality coefficient. In this paper, the step size changes dynamically with the strength of the positioning area, so the embedding strength factor β will directly control the sound quality after embedding.

The expansion factor α affects the capacity, concealment and robustness of signal embedding. The larger the value of expansion factor α is, the better the robustness and concealment is, but the more sampling points are needed for simultaneously embedding a signal bit.

2.3 Watermark Extraction

The extraction end resamples the file according to the above to obtain the same sampling rate and bit depth, and filters the frame according to the energy size to screen out the embedding unit and the embedding sequence.

DWT approximate coefficient and CCEPS were calculated from each embedding element to obtain the mean value of the stationary part of the positioning CCEPS, and the quantization step q was obtained by embedding intensity factor β.

The mean value of the stationary part of CCEPS in the embedded region is quantized to obtain the embedded information bits, and all signals in a frame are extracted as binary data of the watermark. Finally, the data is converted to h× WH × WH × W image, that is, the grayscale image of the initial watermark is obtained.

conclusion

Compared with the audio before embedding, there is almost no effect on the sound quality after embedding. The watermark image can be obtained through mp3 transcoding, resampling, clipping and displacement, so the algorithm is considered to have strong robustness. The algorithm is blind watermark and does not need the original file when extracting watermark.


PS: more dry technology, pay attention to the public, | xingzhe_ai 】, and walker to discuss together!