Talk about SVD factorization of matrices

Seemingly lofty artificial intelligence, machine learning, in fact, are inseparable from the support of mathematics. Of these, the most important are undoubtedly two parts: algebra and probability theory. I can’t cover algebra (especially matrix theory) and probability theory in their entirety in this blog, but it’s possible to present some of the interesting and important parts of it.

In this article, we will talk about SVD factorization of matrices.

Some matrix knowledge

So let’s start with some basic matrices.

Transpose and conjugate transpose

Transpose of matrices is the simplest matrix transformation. In short, ifThe matrix ofThe transpose of theta is called theta; theIs aThe matrix of, and.

So the transpose of a matrix is equivalent to flipping the matrix along the main diagonal; At the same time, it’s not hard to figure out.

Conjugate transpose of matrices is probably the second-simplest matrix transformation. The conjugate transpose just superposes the conjugate of the complex number on top of the transpose. Therefore, if toRemember the matrixThe conjugate transpose of delta, then delta.

The unitary matrix

Unitary matrix is a special square matrix which satisfies

U UH =U H U= In .

It is easy to see that unitary matrices are in fact generalized orthogonal matrices. When all the elements of an unitary matrix are real numbers, it is actually an orthogonal matrix. On the other hand, because, so unitary matrixmeet; In fact, this is a necessary and sufficient condition for a matrix to be unitary.

Normal matrix

Like unitary matrix, normal matrix is also a special square matrix, which is required to satisfy the commutative law with its conjugate transpose matrix in the sense of matrix multiplication. That is to say, if the matrixIf the following conditions are met, it is called a normal matrix:

M MH =M H M.

It is obvious that the unitary matrix of complex coefficients and the orthogonal matrix of real coefficients are normal matrices. It is obvious that normal matrices are not only unitary matrices or orthogonal matrices. For example, the matrixIt’s a normal matrix, but it’s certainly not unitary or orthogonal; because

M MH =( 2 11 1 21 1 12 ) =M H M.

Spectral theorem and spectral decomposition

Diagonalization of matrices is an important proposition in linear algebra. Spectral Theorem gives a conclusion on diagonalization of square matrices: if matrixIs a normal matrix, there is a unitary matrix, and diagonal matrices,

M = U Λ UH.

That is to say, a normal matrix can be decomposed into a diagonal matrix by unitary transformation; This method of matrix decomposition is called spectral decomposition.

SVD decomposition

Spectrum theorem gives the possibility and form of decomposition of normal matrix. However, for matrices, normal matrices are very demanding. Therefore, the spectral theorem is a very weak theorem with limited scope of application. In actual production, many matrices we encounter are not normal matrices. For these matrices, the spectral theorem fails. As a generalization of the spectrum theorem, SVD decomposition is much weaker on the original matrix.

SVD decomposition says: AssumeIs aIn which all the elements belong to the number field(real domainOr complex domain). So, there isThe unitary matrix 和 The unitary matrixmake

M = U Σ VH,

Among them 是 Nonnegative real diagonal matrix; andThe element on the diagonal 是 The singular value of. In general, we prefer to order these singular values from greatest to smallest, so thatJust byThe only thing is certain.

On the other hand, because 和 These are unitary matrices, so 和 The column vectors of 和 A set of orthonormal basis for. We will beThe column vectors of theta are called theta; willThe column vectors of theta are called theta; At the same time, willNumber one on the diagonalI’m going to call theta. So, SVD decomposition can actually take the matrixWrite a sum

M =min (M,n) ∑ I = 1 σ I → u I → v T I.

Calculation method of SVD

With the introduction to SVD and the related geometry explained, the next most straightforward thing to know is how to compute the SVD of a matrix. Let’s explore the problem in several steps.

SVD and eigenvalues

Now, let’s say the matrixSVD decomposition of is

M = U Σ VH;

So, we have

MMH =U σ VH V σ H UH =U (σ H σ H)U H MH M =V σ H UH U σ VH =V (σ H σ) VH

That is to say,The column vector (left singular vector) of, isEigenvectors of; At the same time,The column vector (right singular vector) of, isEigenvectors of; On the other hand,Singular value of (Is the non-zero diagonal element oforThe square root of the nonzero eigenvalue of.

How is SVD calculated

With this knowledge, we can calculate the SVD factorization of any matrix by hand; Specifically, the algorithm is as follows

To calculate 和 ;
Calculated separately 和 Eigenvectors and their eigenvalues;
Is composed of eigenvectors of; whileIs composed of eigenvectors of;
right 和 Take the square root of the non-zero eigenvalue of, and fill in the position of the above eigenvectorDiagonal element of.

Let’s actually do it

Now, let’s try to compute the singular value decomposition. To compute singular value decomposition, you need to computeThe left and right product of its conjugate transpose; Here is mainly byAs an example.

First, we need to calculate.

W =MM H =[ 2 4 1 3 0 0 0 0 ] [ 2 10 0 4 30 0 ] =[ 20 140 0 14 100 0 0 00 0 0 00 0 ] .

Now, let’s askEigenvalues and eigenvectors of. According to the definition; so. That is to say

[20 −λ14 00 14 10−λ 00 0−λ 00 00 −λ] → x =→ 0.

According to the theory of linear systems of equations, it is necessary for this equation to beThe determinant of the coefficient matrix is required to be 0. That is

This is; Can solve... By substituting the eigenvalues into the original equation, the corresponding eigenvectors can be obtained. These eigen vectors, as column vectors, form matrices

U =[− 0.82 − 0.58 00 − 0.58 0.8200 0 01 00 00 1].

In the same way, (note, 和 Have the same eigenvalues of)

V =[− 0.40 − 0.91 − 0.91 0.40].

As well asThe diagonal elements onArithmetic square root of eigenvalues of; So there is

σ =[5.46 000 0.37 000].

So we have the matrixSVD decomposition of (numerically approximate) :

[2 4 1 3 00 00] ≈[− 0.82 − 0.58 00 − 0.58 0.82 00 0 01 00 00 1] [5.46 00 0.37 00 00] [− 0.40 − 0.91 − 0.91 0.40]

Geometric intuition

Let’s start with an example. Assuming thatIs aThe matrix of, andIt’s a linear spaceIs the vector, then

Y = M ⋅ x

It’s a linear spaceVector 1, 2, 3. So the matrixThat corresponds to a sub 到 The transformationTo be specific, both.

That is to say, in linear algebra, any matrix can be viewed as a transformation. So in this way, we’re unifying the matrix and the transformation.

Rotation transformation and Reflection Transformation (mirror image)

When you rotate in linear space, you actually change the direction of the vector, but you don’t change the length or the chirality of the vector. Now suppose the matrixIt’s a linear spaceA rotation transformation in the matrix, and let’s see what it looks like.

So first of all, let’s think about the dot product. Because rotation does not change the length of the vector, and the Angle between two vectors remains the same after the same rotation. If so,Corresponds to a rotation transformation, so there must be

⋅→ B =M → a ⋅→ B,

That is

⋅→ B T =M → a ⋅(M→ b)T,

That is to say, i.e.,It’s an orthogonal matrix.

So, in two dimensions,You can write 或 . The determinant of the former is zeroAnd the determinant of this is zero. sinceLambda is an orthogonal matrix, so its determinant must be lambda. Now the question is, the determinant is zero 和 Which one is the rotation? Or are they both rotations?

Going back, we need to notice two things. First, in the definition of rotation, we propose that rotation remains “chiral”; Second, we didn’t use the property of chirality invariance when we figured out that the rotation matrix was orthogonal — because.

In fact, ifThe determinant of phi is phi, then the matrix corresponds to a defect rotation — rotation firstAnd then on a straight lineThe mirror. considerThis defect rotation is essentially a straight lineThe mirror.

So let’s say that the rotation matrix is a determinant of thetaIs the positive definite matrix of, and its form is

[cos φ− sinφ sinφ cos φ],

Represents a positive (usually counterclockwise) rotation. forOrthonormal basis for.And they are respectively transformed to 和 .

The determinant forIs the positive definite matrix of, and its form is

[cosφ sin φsin φ− cosφ],

In a straight lineThe mirror.

Scaling transformation

To scale in a linear space, essentially, is to allow each dimension of a linear space to transform independently of the other dimensions. So, obviously, the corresponding matrix should be a diagonal matrix.

Geometric interpretation of SVD decomposition

Now go back to SVD decomposition

M = U Σ VH,

In terms of real numbers, we’re essentially taking a complex transformationDecomposed into three transformations: rotation/mirror, scaling, rotation/mirror.

Without losing generality, let’s assumeAs well as 和 Are rotation matrices, then the process can be expressed as

It’s not hard to see,First the (possibly degenerate) unit ball is rotated (rotated orthonormal basis) and then viaThe unit circle is scaled and stretched into an ellipse (hyperellipsoid in hyperspace), and then throughPlace the hyperellipsoid inRotation in space. And the length of each half axis of this hyperellipsoid is the matrixThe singular value of phi, which is the matrixThe element on the diagonal.

Application of SVD decomposition

In chemistry, there is a saying that “structure decides nature and nature decides purpose”. This reflects the characteristics of a thing from the inside out and the universal law of human use of things. The same holds true in mathematics.

SVD factorizes the matrix into a summation form, where the coefficients of each term are the singular values of the original matrix. These singular values, according to the previous geometric interpretation, are actually the lengths of the short axes of the spatial hyperellipsoid. Now imagine a very flat ellipse in a two-dimensional plane (with very high eccentricity) whose major axis is so much longer than the minor axis that the entire ellipse looks indistinguishable from a line segment. At this point, if the short axis of the ellipse is forcibly set to zero, it is intuitively obvious that the process of the ellipse degenerating into a line segment is not abrupt. In SVD decomposition, larger singular values reflect the main features and information of the matrix itself. Small singular values, for example, the ellipse is very short short axis, which almost does not reflect the characteristics and information carried by the matrix. Therefore, if we force the small singular value in SVD decomposition to zero, it is equivalent to discarding some unimportant information in the matrix.

Therefore, SVD decomposition has at least two functions:

Analysis to understand the main characteristics of the original matrix and the information carried (take a number of maximum singular values), which leads to principal component analysis (PCA);
Discarding ignores the secondary characteristics of the original matrix and the secondary information carried (discarding several smaller singular values), which leads to topics such as information lossy compression and matrix low-rank approximation.

These two applications are actually dual: by ordering by importance, on the one hand, we know what information (singular values) is important, and on the other hand, we can naturally discard the unimportant parts. Here we take lossy compression of information as an example.

In real life and work, a lot of information can be represented in matrix form. For example: image information (see PIL concise tutorial – Pixel manipulation and Image Filters), huge feature matrices in machine learning tasks, etc. Here, we follow the previous track and visually show the application of SVD decomposition in graphic compression in the form of image information.

First let’s look back at some cats we’ve seen. They look like this:

After SVD decomposition, the singular value values of RGB three channels respectively look like (code) :

[8.29754663e+04 1.43568761e+04 8.28098602e+03 7.94075752e+03

E+03 e+03 e+03 e+03 2.64043230 3.07326587 4.64118946 6.87204550

E+03 e+03 e+03 e+03 1.73772694 1.81457650 2.08293043 2.34251575

E+03 e+03 e+03 e+03 1.18657598 1.28556279 1.44987605 1.55535238

E+03 e+03 e+03 e+02 9.63555279 1.04069060 1.10588319 1.15156737

.

E+00 e+00 e+00 e+00 1.89766075 2.01670137 2.03810704 2.07308001

1.78169821 e+00]

[7.52035286e+04 1.45096769e+04 1.02416708e+04 7.99187399e+03

E+03 e+03 e+03 e+03 2.81678573 3.22590281 4.82795595 5.55763091

E+03 e+03 e+03 e+03 1.67558281 1.87922653 2.05484885 2.47269533

E+03 e+03 e+03 e+03 1.19338672 1.30714569 1.48494502 1.55022246

E+03 e+03 e+03 e+02 9.93807772 1.04558020 1.07687752 1.17078655

.

E+00 e+00 e+00 e+00 1.88738236 1.95633445 2.03020090 2.08166328

1.80539295 e+00]

[7.15164941e+04 1.60372342e+04 1.20401757e+04 8.69602152e+03

E+03 e+03 e+03 e+03 3.17683272 3.48390702 3.76913288 5.69604800

E+03 e+03 e+03 e+03 1.76733763 2.08571764 2.32005514 2.73730517

E+03 e+03 e+03 e+03 1.21607022 1.39202168 1.47436741 1.55393096

E+03 e+03 e+03 e+02 9.97811473 1.01255317 1.16377337 1.17991116

.

E+00 e+00 e+00 e+00 1.88718778 1.99837012 2.13041080 2.17604369

1.80040166 e+00]

When we truncate from large to small, discard the smaller singular values and reconstruct the image, we get a “compressed” image.

When only 1 singular value is taken, the reconstructed image is as follows. I can barely see anything.

When only 5 singular values are taken, the reconstructed image is as follows. Just enough to make out the shape of a cat.

It is observed that around the 20th singular value, the magnitude of the singular value changes by an order of magnitude (dropping from +03 to +02). Therefore, when 20 singular values are taken, the reconstructed image is as follows; At this point, the image of the cat is very clear.

When 50 singular values are taken, the reconstructed image is pretty close to the original image.

Similarly, we can observe the results of the reconstructed image with 100/200/300/400 singular values.

Your encouragement is my biggest motivation to write

As the saying goes, investment efficiency is the best investment. If you feel that the quality of my articles is good, the harvest is great after reading, and it is expected to improve your work efficiency by 10%, you might as well donate a small amount to me, so that I have the motivation to continue to write more good articles.