For most programmers, Git is probably one of the tools you encounter most in your daily work. We work with Git every day, submitting code through Git, finding great open source projects on Github, and participating in technical discussions. Then I wonder if you have really understood Git and why it appeared? What exactly is it? What magic does it have that makes it the most popular version control system today?

In the following article, I will take you to analyze Git in detail from six aspects of Git version control system history, Git version control principle, Git branch principle, Git temporary storage principle, Git data integrity, and try to uncover the principle behind Git. Hopefully, after reading this article, you will have a different understanding of Git.

History of version control system

To systematically understand a technology, we should first understand its history, as well as its background and reasons.

Git is a version control system, and to fully understand it, you need to understand what a version control system is and how it has evolved.

Version control is a system that records changes to the contents of one or more files so that future revisions to a particular version can be reviewed. -Baidu.com

Simply put, the beauty of a version control system is that even if you arbitrarily change and delete an entire project, you can restore the project to its original state in very little time.

What other version control systems existed in history before Git? And what are the characteristics of each?

The evolution of version control systems can be roughly divided into the following three stages. The first stage is the use of local file systems as version control systems, and the second stage is the centralized server like CVS and SVN to manage versions. The third stage is the distributed version management stage like Git and Mercurial.

1.1 Local version control System

In the first phase, there is no dedicated version control system, and people usually save different versions of files by copying the entire project directory.

The good thing about this is that it’s easy to get started, but the bad thing is that it’s easy to make mistakes. Sometimes you get confused about where you’re working, and you accidentally write the wrong file or overwrite something you didn’t intend.

While native version control systems (such as RCS, a version management tool for individual files) have been developed to keep track of changes in files, this approach has its problems, particularly with team collaboration.

1.2 Centralized version control system

To solve the problem of teamwork among team members, a centralized version control system (CVCS) emerged. Such systems, such as CVS and Subversion, have a single centrally managed server that holds revisions of all files.

Centralized version control systems have version management and branch management capabilities. With this version management approach, everyone can see to some extent what everyone else in the project is doing. Administrators can easily control the rights of each developer, and it is much easier to manage a centralized version management system than to maintain a local database on each client.

Centralized version management is appropriate from a corporate project perspective, not only allowing developers to collaborate around a project, but also fine-tuning the access and manipulation rights of those involved. ** Even if a code leak occurs due to an employee’s negligence, the leaked version is only part of the code repository (as long as it is not data from a central server).

So if centralized version management is so good, why is distributed version control all the rage? Let’s look at the characteristics of a distributed version control system:

1.3 Distributed Management System

Distributed version control systems, such as Git and Mercurial, are typically distributed. The client doesn’t just take a snapshot of the latest version of the file, but rather mirrors the entire repository.

The biggest benefit of its distribution is “decentralization”, because each client is a complete repository of code, so everyone can build more powerful or targeted features from others. You can fork (strictly speaking, fork is a function provided by hosting platforms such as Github). However, its core is also the decentralized design of Git) any project you like, and then modify it as necessary to suit your own project, or after modification, you can launch a pull Request to merge your changes back into the previous project, to contribute to the original project.

The intent is not to develop a project (which is the intent of a centralized version control system), but to contribute more power to the community, where there is no centralized server, no workhorse, no authority, and everyone is equal. The final project will be the result of the collective efforts of the entire community, while also making the project itself very viable.

Distributed version control system makes the code management of open source project more convenient, and the popular open source movement in recent years promotes the development of distributed version control system, making it gradually become the mainstream of current version control system.

1.4 the Git

With the introduction of these three version control systems, let’s move on to why Git was born.

Back in 2005, big-name Linux projects faced the following problems in managing their code repositories:

  1. BitKeeper (a distributed version control system at that time) was out of license, available for VCS, RCS, SVN, etc.
  2. The Linux project at this point was already a giant repository of millions of lines of code.
  3. There are many requirements for nonlinear development and many functions need to be developed in parallel at the same time.
  4. There are many people involved in Linux maintenance and they come from all over the world.

To solve these problems, the Linux community had to choose a version control system that was fully distributed, strongly supported non-linear development (allowing hundreds of parallel branches), and maintained operational speed and storage capacity.

At the time, no version control system was available that could accommodate all of these features, so the Linux community (Linus, the founder of Linux) opted for its own version control system, now known as Git.

Linus designed Git from the perspective of a file system expert and kernel designer, based on his previous experience with BitKeeper (also a distributed version control system), to provide Git with the following features:

  • Strong multi – branch management ability
  • Fast operation speed, good performance
  • Have a “staging area”
  • Completely distributed
  • Ensuring data integrity

In recent years, because of these advantages, Git has become more and more popular among developers, and it has gradually grown into the most popular version control system today.

Next, I will take you inside Git to explain how Git is designed to control its version, how it implements fast and convenient branch management, what its staging area is, and how it guarantees the integrity of stored data.

For easier explanation, here is a demo project for you to demonstrate, its directory structure is as follows:

$ ls
css/
 - index.css 
index.md
index.html
Copy the code

So far, only git init has been performed on this demo. Next, we will analyze the principle of Git in more detail based on this demo file.

Git version control principle

Git is a version control system. How does it achieve version control? To answer this question, we get back to the essence of what Git is.

2.1 What is Git

Git is actually a content-addressable file system on which version control ** systems are built.

Notice that there are two keywords here — content addressing and version control. Let’s start by looking at what content addressing is.

For a content-addressing system, the system records a content-address, which is a unique and persistent identifier of the corresponding data. The content-address is a string of values calculated by a cryptographic hash algorithm (such as SHA-1 or MD5) that provides the content-address when we need the data. The system can obtain the physical address of the data from this address and return the data. Also, any change to the data will result in a change in the content address. -Pro git

A content-addressing system is essentially a key-value database. You can insert any type of content into a Git repository and it will return a unique identifier that can be retrieved again at any time.

You may wonder why we never see Git database in our daily use.

2.2 Git’s content addressing database

That’s because Git wraps up its key-value database, which you don’t normally touch when you use Git on a daily basis.

The database is stored in the.git directory, which stores data for all versions of almost all files in the project. When you clone a project from a Git repository, you are actually copying the project’s.git directory.

Git: git git: git git: git git: git git

$ ls
config
description
HEAD
hooks/
info/
logs/
objects/
refs/
index
COMMIT_EDITMSG
Copy the code

This is the structure of a typical Git repository. Config stores some configuration information for the project. Description is used by the GitWeb program, hooks store client and server hook scripts (such as the script that validates git commit format at commit time), info stores files that do not want to be logged in. COMMIT_EDITMSG stores the commit information of the last commit, logs stores the commit information, index stores the staging information. The refs directory stores Pointers to submitted objects for data (branches, remote repositories, labels, and so on). HEAD stores the branch that is currently checked out.

The objects directory contains Git’s key-value database, which stores all files in Git and their historical versions. Does this directory store files in key-value pairs like we think? We can verify this with a demo project.

Back in the demo project, we looked at the.git/ Objects directory and found that it stored no data except the info and Pack directories that were automatically created by the system (which were also empty).

Next, we execute git add index.html, and then look at the directory again to see an additional record in the.git/ Objects directory. This record consists of a series of hash codes. Git can insert a file into a Git database with git add and get a unique hash code. (More on unique hash codes in a later section)

Does this hash code correspond to the index. HTML file we added? Git cat-file git cat-file git cat-file git cat-file git cat-file (the -p argument retrieves the content of the data).

We execute the following commands:

You can see that the generated hash code actually fetched the index.html file stored in the Git database.

At this point, we know that Git stores data that does conform to the characteristics of a key-value database. How does Git use it for version control?

2.3 Git version control

To understand how Git does version control, you must first understand how Git handles data.

2.3.1 The way Git processes data

The main difference between Git and other version control systems (SVN and similar tools) is the way Git treats data. Most other version management systems (CVS,SVN) are a difference-based version control system that stores data as a list of file changes.

Git, on the other hand, treats data more like a series of snapshots of a small file system. In Git, every time you commit an update, it creates a snapshot of all the files at the time and stores the index of that snapshot. For efficiency, Git does not re-store the file if it has not been modified, but only keeps a link to the previously stored file. Git is more like a snapshot stream for storing data.

What are the snapshots in this snapshot stream? How does Git represent these snapshots?

2.3.2 Git object

There are three different types of Git objects in Git, which correspond to three different types of snapshots in Git snapshot flow.

  • Data object (BLOB) : Represents the contents of a single file, corresponding to a snapshot of a single file in a Git snapshot stream.
  • Tree: represents information about multiple file organizations, corresponding to multiple file snapshots in Git snapshot flow.
  • Commit object: represents information related to a commit, corresponding to a snapshot of Version in Git snapshot stream.

(Note: the tree object mentioned above is not only representing the information of multiple file organizations, but also can represent the information of a single file containing the name of the file. Here for everyone to understand the convenience, it is boiled down to the information of multiple file organizations.)

Let’s take a look at these Git objects from the demo project

Git cat-file -p git cat-file -p git cat-file -p git cat-file -p git cat-file -p When using git cat-file -t, you can tell Git to give the file object type based on the given hash value.

Let’s look at the type of the index.html file object in the previous section

You can see that it is a BLOB object, so we verified in the previous section that it does hold information about a file.

Proceed with the project, execute git commit, and commit the files from the staging area (which will be dissected later) to the Git repository.

Then we look at the.objects directory and see that there are two more files in that directory.

Git cat-file git cat-file git cat-file git cat-file

Some basic information of the documents we submitted before appears in the content of this file. Let’s take a look at the type of file.

This is the tree type, which mainly solves the problem of multiple file organizations, and it also solves the problem of bloB objects having only file contents but no file names.

Git cat-file git cat-file git cat-file git cat-file git cat-file

The content of this file contains the above tree type and its corresponding hash value, in addition to the submitter’s information and time, as well as the submitted description copy.

This is the submission type, the submission type mainly contains some information related to the submission, and the corresponding tree object information of the submission.

The relationship between the above three Git objects is shown below:

2.3.3 Achieving version control

Take a look at this diagram of how Git processes data from section 2.3.1:

Imagine if bloBs were the only type of object available, how would you represent a snapshot stream? A snapshot stream must be a collection of multiple file objects, and blob can only represent a single file. Therefore, to represent multiple files, Git introduces tree-type objects, which can solve the problem of multiple files being organized together.

In addition to the problem of organizing multiple files, we must also want to know who saved the snapshots, when they were saved, why they were saved, what snapshot this snapshot is based on, and what new snapshots are created from this snapshot? This is the basic information that a Commit saves for you.

The following diagram shows the relationship between Git’s three object types when multiple commits are involved.

This chart gives you an idea of how Git does version control.

The different commit objects on the right correspond to different versions on the left, and each commit object is associated with a tree object, which is a snapshot of the file system (FileA, FileB, and FileC on the left). Different BLOB objects are associated under the tree object.

Through the combination of these three types and the content addressing system mentioned above, Git is able to implement version control on top of them.

Git branch is a branch of Git. Git branch is a branch of Git.

Git branch principle

Conclusion: Git branches are essentially Pointers to committed objects.

Next, use the demo project to help you understand:

As mentioned in Section 2.2, the.git/refs directory stores Pointers to commit objects that point to data (branches, remote repositories, tags, etc.). The branch related Pointers are stored in.git/refs/heads.

Because the project only has the master branch at this point (no other branches have been cut), the only file in this directory is the master file.

We looked at the file and found that its contents were a string of hashes for Git objects.

Run git cat-file to check the hash value and find that the hash value is a commit object.

The refs/heads directory contains a new test file. The refs/heads directory contains a new test file.

Check the contents of the test file:

Just like the previous master content, this pointer file points to the same commit object.

Let’s review the above process with the following diagram (with several dummy commits in order to better describe the process) :

Above all, we confirm that Git branches are Pointers to committed objects.

Because of this, all branch operations are very fast, such as creating a new branch, which is essentially writing a string of 40-bit hashes associated with a COMMIT object to a file.

Now let’s look at branch switching, which is a little bit different from branch creation.

What makes it different is the special pointer HEAD. In Git, HEAD is a pointer to the current checkout branch, so switching branches is actually a HEAD pointer move.

For example, we now switch to the newly created test branch:

(1) If the test branch is not switched before the test branch is switched, the HEAD still points to the master branch.

(2) After the test branch is switched, the HEAD pointer points to the test branch

Git branch creation, destruction, and switching are essentially pointer operations. This is where Git is superior to other version control systems in branch management. Unlike the branch management of other version control systems, such as SVN, its branch creation is not a complete copy of the source directory (linux-based soft link), but its speed and convenience of branch operation is still a big gap with Git.

Git temporary storage area

Before getting into Git staging, let’s review the classic Git workflow.

  1. Modify files in the workspace.
  2. Optionally save the changes you want to commit next time.
  3. Commit the update, find the files in the staging area, and permanently store the snapshot in your Git repository.

Three concepts emerge from the above process – repository, staging area, and workspace. What do they represent?

  • Repository: The place where Git holds the metadata and object database for your project, which is the most important part of Git.

  • Workspace: Extracted content for a version of the project. These files are extracted from Git databases and placed on disk for you to use or modify.

  • Staging area: is an “intermediate area” that provides a buffer area before actually committing to the Git repository.

A staging area is essentially a file that holds information about a list of files to commit next time, the contents of which are stored in the.git/Index directory.

Git /index is a binary file. If you open it in text, it will be garbled. We need to hexdump the file to display it in hex.

Here is what the above staging area file (staging area only has a single file) roughly corresponds to:

The meanings of each item are as follows:

  1. DIRC: index header information, which contains the identity of a valid index fileDIRCAnd the index file version and the number of files in the index file.
  2. Ctime: time when a file is created.
  3. Mtime: indicates the time when a file is modified.
  4. Device: indicates the id of the file storage device
  5. Inode: indicates the inode number.
  6. Mode: indicates the file mode.
  7. UID: indicates the UID of the owning user.
  8. GID: indicates the owning user group.
  9. File Size: indicates the File Size.
  10. Entry SHA-1: Sha-1 hash of Git object files.
  11. Flags: indicates the flag bit, including the assumed invariable flag, extended flag, and file name.

Git’s index file is actually quite complex, in addition to the above mentioned documents index, and indexes (for a Git repository with directory, usually additional indexes are introduced, in order to realize the faster the working directory of reconstruction) and some other indexes, including UNTR (for cached in the workspace but uncommitted documents index), FSMN (to determine whether a file has changed by changing the information provided by the file system) and so on.

(Check out this article for more details on this section.)

Now that we have learned about the contents of the staging area, we will demonstrate how the staging area works by combining the demo project and the classic Git workflow described above.

Following the above demo, the workspace, staging area, and Git repository state are shown below:

Now add a file index.md to your workspace and run Git status. This will prompt you to say “Changes not staged for COMMIT” because Git judged that the files in your workspace are different from those in your staging area. At this point, the three states are as follows:

Next we run git add index.md to add the changes to the staging area. At this point, the three states are as follows:

At this point, the files in the workspace and staging are the same, but because the staging and the repository are different, running git status will see “Changes to be committed”. That is, the next commit is now expected to be different from the last.

Finally we run git commit to commit. At this point, the three states are as follows:

Now running Git status has no output because the files in all three areas are the same again. This process reveals the secrets of git staging. Now, the staging area is not as complicated as we thought.

With that said, why does Git design staging?

The advantage of this design is that you have more control over what you submit. If you work in a development of modification for no logical relevance documents, if there is no “staging area” at this time, you will need to submit these irrelevant documents together, and if you have the staging area, there is no these questions, you can according to any dimension you want to submit.

Therefore, Git provides users with a “buffer” area before the workspace and repository through the design of the staging area. Users can use the staging area flexibly according to their own needs.

Git data integrity

Finally, let’s take a look at how Git ensures data integrity.

All data in Git is computed as checksums before storage to ensure data integrity and prevent content from being tampered with.

The mechanism it uses to compute the checksum is called a sha-1 hash. This is a string of 40 hexadecimal characters (0-9 and a-f). Git constructs a header starting with the type of the recognized object. Git then adds a space to the first part of the header, followed by the number of bytes of data content, and finally a null byte.

Git hash = SHA1(object type + space + number of bytes of data content + null bytes + original data)

(Because the file header stores data length and other information, it can increase the security of SHA1 hash to a certain extent)

The resulting SHA-1 hash looks like this:

24b9da6552252987aa493b52f8696cd6d3b00373

If you want to fake an identical SHA-1 value, it is very difficult to do so, as shown in the following example.

Each hexadecimal number is used to represent a 4-bit binary number, so the output of the 40-bit SHA1 hash is actually 160 bits. To use a bicolor gambling analogy, creating the same SHA1 hash is equivalent to picking 32 “red balls”, each of which has a choice of 1 to 32 (5-digit binary numbers) and can be repeated between the red balls. Compared to the fact that only seven balls are selected in a two-color lottery, SHA1 “winning” is equivalent to buying five consecutive two-color balls and having to win the first prize in each one. Of course, due to algorithmic problems, the chances of creating a collision (the same digital digest) are not that small, but small enough to allow Git to distinguish and identify different objects. -Git authoritative guide

Git generates a commit hash value using git’s commit hash as an example.

Back to the demo project

  • Git cat-file git cat-file git cat-file

  • Count the number of characters contained in the submission information, which contains 181 characters in total.

  • The commit message is preceded by commit 181 (null character), and the SHA1 hash algorithm is executed.

  • The above command yields the same hash value as you would see with Git log.

This is how a Git hash is generated. All Git objects are generated this way, except that the header information for each object type is slightly different. Tree objects start with tree and file objects start with blob.

Six summarize

Here, the content of this article is basically over, finally take you to review the content of the whole article.

The article began by taking you through the evolution of version control systems, from local version control systems to centralized version control systems to distributed version control systems. We know that distributed version control systems are where they are today because of their mutual achievements with the open source movement. The “decentralized” nature of distributed version control systems is a natural fit for code management in open source projects.

Then we introduced the birth of Git, the star of distributed version management systems. It was designed by the father of Linux and has many advantages that other version control systems do not have, such as the way it treats data, its branching model, its operation speed, the design of staging area and the guarantee of data integrity. All of which makes it one of the most popular version control systems available today.

After that, we take you to understand Git in detail from four aspects: Git version control, Git branch model, Git staging design and Git data integrity. The essence of Git is a content addressing system. Based on the content addressing system, it skillfully combines three Git objects (BLOb object, tree object, commit object) to achieve the ability of version control. Git branches are essentially Pointers to commit objects. Creating a branch is simply a pointer to a commit record. Switching a branch is essentially a pointer move, so branch operations, such as creating a new branch, can switch very quickly.

Git staging is an area that sits between your workspace and your Git repository. It gives you more control over the files you commit to. It is a directory tree that contains the file index, which records information about the file, and the corresponding hash value of the file.

Git uses sha1 hash to calculate the checksum of files. The value obtained by this algorithm is hard to forge. In addition, once files are changed, the generated hash value will change, so to a large extent, sha1 hash can ensure data integrity. We also implemented a hash on a COMMIT object.

Finally, a diagram is used to describe the whole Git workflow described above.

When clone a new project or switch to a new branch, the HEAD reference is first modified to point to the new branch reference. The contents of the staging area are then filled with the latest committed snapshot for that branch, and the contents of the staging area are then copied to the workspace.

After modifying files in your workspace, use git add to synchronize the changes to the staging area, where new blob and tree objects (if there are directories) are generated, and then use git commit to add the staging files to the repository. A new tree object and a new COMMIT object are generated. The new commit object is then associated with the previous commit object via the parent parameter.

That is all the content of the article, thank you for watching, and I hope you can learn something after reading the article. I also hope you can point out the problems in the article. Welcome to discuss in the comments section.

Vii References

  • Set up the SVN server and configure Cornerstone on a MAC
  • Why is branching and merging easier in Mercurial than in Subversion?
  • Git Authoritative Guide
  • my git
  • why git
  • What are the differences between Subversion and Git?
  • Why is Git better than Subversion?
  • Pro git
  • What are the core advantages of Git over SVN and other version management tools?
  • From Git to blockchain
  • How to understand Git’s distribution?
  • Git index (1) : Index file structure
  • Git: Understanding the Index File