Transformations in computer vision -Affine translations and Homography

Photos: Christopher Burns at Unsplash

Getting started guide

Understand transformations in computer vision.

Overview of parallel transformations and isotopes

content

The bionic transform
With a method of

– Homogeneous coordinates

– Pinhole camera model

– The equation of the apposition method

reference

Transformations are an important part of computer vision, and understanding how they work lays the foundation for more advanced technologies.

Here, I will mainly introduce bionic transformation and homology.

Bionic transformation.

The affine transformation is the simplest form of transformation. These transformations are also linear because they satisfy the following properties.

Lines map to lines
Point to point
Parallel lines remain parallel

Some familiar examples of bionic transformations are translation, expansion, rotation, shearing, and reflection. In addition, any combination of these transformations (such as rotation after expansion) is another bionic transformation.

Examples of basic bionic transformations (image courtesy of author

Instead of dividing bionic transformation into different cases, it is more elegant to have a unified definition. So we turn to using the matrix as a linear transformation to define the bionic transformation. If you’re not familiar with interpreting matrices as linear transformations of space, 3Blue1Brown has an excellent video on this topic.

Medium.com/media/82745…

In essence, we can think of a bionic transformation as some kind of linear transformation of a matrix, and then some kind of translation.

In two dimensions, the equation for the bionic transformation is as follows.

Equation of bionic transformation in two dimensions (image courtesy of author

Here, the matrix represents some linear transformation on a vector with entries _ (x1_ and _x2_), such as reflection, shear, rotation, expansion, or a combination of all four transformations. And it’s worth noting that since the transformation is linear, it also has to be invertible, so the determinant of the matrix is non-zero. The last step of transformation is to translate the vector _[t1_,t2] to complete the transformation of the vector _[y1_,y2].

Examples of bionic transformations in two dimensions (image courtesy of author

The biomimetic transformation can also be generalized to the _n_ dimension using the following equation.

N-dimensional bionic transformation equation

This transformation maps the vector _x_ to the vector _y_ by applying the linear transformation _A_ (where _A_ is an invertible _n×n_ matrix), and then translates it by applying the vector _b_ (b_ has dimension _n_×1).

In short, biomimetic transformations can be represented as linear transformations consisting of a number of translations, and they are very effective in modifying computer vision images. In fact, image preprocessing largely relies on affine transformation to scale, rotate, shift and so on.

With the composition.

Homology transformation is a type of projection transformation where we use a projection to connect two images. Homology transforms were originally introduced to study changes in perspective, and they allow for a better understanding of how images change when we look at them from different perspectives.

Using isotope conversion perspectives (source: University of Maryland).

Isotopes are studied by studying the fact that cameras take different images depending on their position and orientation. Essentially, the same composition is a transition between two images of the same scene, but from different perspectives. There are two cases that apply only to isomorphism (both of which assume that the worldview can be modeled in planes).

The images are taken by the same camera, but from a different Angle (the world is now basically flat).
Two cameras view the same plane from different positions

Two cases of isomorphism in aerial view (image courtesy of author).

However, in order to use the same composition, we must first design tools to describe the results of an image on the camera. More specifically, if we get a camera position/direction and an image plane (this is where the image appears and is a property of the camera), we must find the position of a point in the world (called a world point) on the image plane.

Finding the coordinates of the points on the image is key to isotopic analysis (image courtesy of the authors).

Because light travels in straight lines, we know that a world point with rectangular coordinates _ (X, Y, Z_) appears in the image plane as the intersection of the image plane with the line that passes through the center of the camera and the world point (this line is the path of light to the camera).

However, in order to define this formula correctly, we must bypass a concept in projection geometry known as homogeneous coordinates.

Homogeneous coordinates.

Homogeneous (or projected) coordinates are another coordinate system with the advantage that the formulas for homogeneous coordinates are often much simpler than for cartesian coordinates (points on the X-y_ plane). In contrast, isomorphic coordinates use three coordinates (x_’, y’, z’) to represent points in space (in general, they use one more coordinate than Cartesian coordinates). As a bonus, converting homogeneous coordinates to Cartesian coordinates is relatively simple.

Convert homogeneous coordinates to cartesian coordinates

Note that when _z’=1, homogeneous coordinates map well to Cartesian coordinates. In addition, when _z’=0, we interpret the corresponding cartesian coordinates as a point infinite. With that in mind, let’s look at some applications of homologous coordinates.

First, lines in homogeneous coordinates can be represented by just three points. In particular, you can represent a line _l_ with _ (l_1, L2, l3) in homogeneous coordinates. So, this line is the set of all the points _p=__(x_,y,z), such that the dot product of _l_ with _p_ is 0.

Lines in homogeneous coordinates (image courtesy of author

The second notable property is that, given two points _p1_= (a, b, c) and _p2_= (d, e, f), the line _l_ that goes through _p1_ and _p2_ is given by the cross product of the two points (an operation on two vectors defined in elementary linear algebra).

A straight line between two points in homogeneous coordinates. (Image courtesy of author

A final property of note is that the intersection (in homogeneous coordinates) of the two lines _l1_ and _l2_ is the point _p_ given by the intersection product of the two lines.

Intersection of two lines in homogeneous coordinates (image courtesy of author

With these properties, we are now in a good position to understand the formula behind the same composition.

Pinhole camera model.

We now return to our problem of finding the world points (X, Y, Z) — which are in Cartesian coordinates — in the image plane of the camera at a given camera position and direction.

In particular, we find that the world point _ (X_, Y, Z, 1) lies on the image plane point _ (X’, Y’, Z’), both of which are homogeneous coordinates and are related by the following equation (f_ is a constant describing the focal length of the camera).

The equation of relation between the homogeneous coordinates of a world point and the homogeneous coordinates in the image plane

From here, we can convert _(x_’,y’,z’) to the cartesian coordinates _(x_,y) in the image plane described above.

It turns out that the matrix that transforms the world points into points on the image plane can be decomposed well, giving us an intuitive way to understand what the matrix does.

A matrix can be understood as two actions. (1) scale, (2) compress to a lower dimension

However, since the coordinate system on the image is different from the cartesian coordinate system (in pixel coordinates, the upper left corner is (0,0)), we must make one last transformation to convert the homogeneous image plane coordinate _ (x’, y’, z’) to the homogeneous pixel coordinate system (u’_, v’, w’).

Representation of pixel coordinates (image courtesy of author

So we first scale our image plane coordinates (in order to convert them to pixels) by dividing by the size of the pixels, namely _ρu_ in the _u_ direction and _ρv_ in the _v_ direction. You can think of it as converting from things like meters to pixels. Next, we have to translate the coordinates by some _u0_ and _v0_ so that the origin of the pixel coordinates is in place. Putting these two transformations together gives us the following formula.

Converts the image plane coordinates to pixel coordinates.

Finally, putting everything together, we can complete our camera model by converting the world point _ (X_, Y, Z, 1) to the pixel coordinates _ (u’_, V ‘, w’), where both are given in homogeneous coordinates.

The equation for the camera model

Extrinsic parameter can be thought of as a term that gives us information about the camera’s orientation (stored in the rotation matrix _R_) and position (stored in the vector _t_).

In practice, we tend not to know all the parameters, so we can use camera matrices instead of approximations, which are just some 3 by 4 matrices (through the matrix multiplication property); This is done by calibrating the camera.

In addition, an important property of the camera matrix is that scaling the _C_ entry results in a new camera matrix describing the same camera. This is because if we multiply _C_ by some constant _λ_, then all entries of homogeneous pixel coordinates will also be scaled by _λ_. However, when we convert from homogeneous to Cartesian coordinates (as shown earlier), _λ_ cancellations, leaving the same Cartesian coordinates as before scaling. Therefore, it has become a convention to make the lower-right entry of the camera matrix 1 because the scaling factor is arbitrary.

The standard form of the camera matrix

With these equations defined, we are very close to getting a formula for the same composition.

Equations for isotopes.

Again, let’s check the setup of the pinhole camera. In this setting, our camera will photograph the point _P_=(X,Y,Z) located on a plane, where the homogeneous pixel coordinates of one point _P_ on the image plane are shown as _ (u’,v’__,w’). Now, we can use a neat trick in our choice of coordinates. In other words, let’s keep the _X_ and _Y_ in the plane, and the _Z axis out of the plane. Therefore, the _Z_ coordinate of all points in the plane is 0, which means that _P_ has the form _P_= (X, Y, 0).

Convenient choice of coordinate system ** (coordinates shown are ** homogeneous) (Image courtesy of author

Now, plugging this into our camera equation and using _Z_=0, we get the following simplified result.

Take advantage of Z=0 in the camera equation

This new camera matrix is called the homology matrix H and has 8 unknown terms (because the 1 in the lower right corner is fixed). Traditionally, the entries in the plane simultaneous matrix are represented by _H_ instead of _C_, as shown in the figure below.

Equation for plane homology matrix

Since there are 8 unknowns in the co-domain matrix, and each world point in the plane has 2 coordinates, we need to calibrate the camera at 4 world points to estimate all entries for _H_.

As a review, homography is a transformation that can be used to convert one image to the other when both images are on the same plane but from different perspectives. Mathematically, this transformation is performed by the homology matrix, which is a 3×3 matrix with 8 unknowns that can be estimated by calibrating the image with 4 corresponding points (better approximations can be obtained using more points, but 4 points is the minimum requirement).

Example of the same composition with yellow calibration points (image courtesy of author

All in all, homography is a powerful tool for areas such as augmented reality (where certain images are projected into the environment) and image stitching (where multiple images are combined to create a larger panorama).