Talk about Git storage principle and related implementation

The original article was written by Zhou Kai, head of Gitee, and originally written on the wechat official account “Zoker Essay”.

Abstract: Git is by far the most popular version control system, from the local development to production deployment, we used every day in the Git to our version control, in addition to the routine use of command, if you want to have further knowledge of Git, then study the underlying storage principle of the Git will be helpful to understand Git and its use is very, Even if you are not a Git developer, it is recommended that you take a look at the underlying principles of Git. This will give you a new understanding of how powerful Git is, and will make you feel more comfortable in your daily use of Git.

This article is mainly aimed at the readers who have a certain understanding of Git. It will not introduce the specific function and use of Git, nor will it introduce the differences with other version control systems such as Subversion. It will mainly introduce the essence of Git and the relevant principles of its storage implementation. The purpose is to help Git users have a clearer understanding of the internal implementation of Git when using it for version control.

What is the nature of Git

Git hash-object: Git hash-object Git hash-object: Git hash-object: Git hash-object

➜ Zoker git:(master) eligible cat testfile Hello git ➜ Zoker git:(master) eligible git hashing -object testfile-w 9f4d96d5b00d98959ea9960f069585ce42b1349aCopy the code

Git! Hello Git! The -w option tells Git to write the file to git’s.git/ Objects database directory, and Git returns an SHA value. The SHA value is the key we will retrieve later:

➜ Zoker git: (master) ✗ git cat - file - p 9 f4d96d5b00d98959ea9960f069585ce42b1349a Hello gitCopy the code

Git cat-file: git cat-file: git cat-file: git cat-file: git cat-file: git cat-file: git cat-file

The data we just inserted is a basic blob object. Git also has other object types, such as Tree and Commit. These different object types have specific associations, and they logically link different objects together to help us control and check out different versions. We’ll explore the different object types later, but let’s take a look at Git’s directory structure and see how data is stored in Git.

Git directory structure

Git /objects is a.git/objects directory. This directory is located in the.git/objects directory. How does Git store this data? In this section, we will focus on the structure of Git’s storage directory to understand how Git stores different types of data.

For more information, see github.com/git/git/blo…

Git init creates an empty git repository in the current directory. Git automatically generates a.git directory, which is the repository for all git metadata.

➜ Zoker git init the Initialized the empty git repository/Users/Zoker in/TMP/Zoker /. Git / ➜ Zoker git tree. (master) ✗ git. Git ├─ config // Configuates the current repository information. ├─ config // Configuates the current repository information. ├─ description // Storehouse description // ├─ hooks // The default there will be some template │ ├ ─ ─ the update. The sample │ ├ ─ ─ the pre - receive. Sample │ └ ─ ─... │ ├─ ├─ info │ ├─ ├─ info │ ├─ │ ├─ info │ ├─ │ ├─ info │ ├─ │ ├─ info │ ├─ ├─ // Store reference information, i.e. branch, tag ├── heads ├─ tagsCopy the code

These are the only Git repositories that are initialized by default. In addition, there are other types of files and directories such as Packed -refs Modules logs, etc. These files have a specific purpose and only appear after a specific operation or configuration. Here we focus on the implementation of core storage. The functions and usage scenarios of these additional files or directories can be read by yourself. Here are only some of the core files.

Hooks directory

The hooks directory contains Git hooks that can be triggered after or before many events. By default, all Git hooks have the. Sample suffix, which needs to be removed and executable permissions granted. Here are some common hooks and their common uses:

Client-side hook

Pre-commit: triggered before the commit, such as checking whether the commit information is normal, whether the test is complete, and whether the code format meets the requirements
Post-commit: Instead, this is triggered when the entire commit is complete and can be used to send notifications

Server-side hook

Pre-receive: a script that is invoked first when the server receives a push request and checks whether the pushed references meet the requirements
Update: Similar to pre-receive, but pre-receive is only run once, whereas UPDATE is run once for each branch pushed
Post-receive: triggered after the entire push process is complete. It can be used to send notifications and trigger system building

Objects directory

As we mentioned in the previous section, Git stores all the received content generated object files in this directory. We generate an object using Git hash-object and write it to Git repository. The object’s key values is 9 f4d96d5b00d98959ea9960f069585ce42b1349a, this time we to see objects in a directory structure:

➜ Zoker git: (master) ✗ git hash - object testfile - w 9 f4d96d5b00d98959ea9960f069585ce42b1349a ➜ Zoker git: (master) ✗ tree . The git/objects. The git/objects ├ ─ ─ 9 f │ └ ─ ─ 4 d96d5b00d98959ea9960f069585ce42b1349a ├ ─ ─ the info └ ─ ─ packCopy the code

Git takes the first two letters of the key value as the folder and stores the following letters as the file name of the object file. Objects /[0-9a-F][0-9a-F] are called loose objects or unpacked objects.

In addition to the objects folder, careful students should have noticed the existence of the objects/pack folder, which corresponds to the packaged files. In order to save space and improve efficiency, when there are too many loose object files in the repository, or when you manually execute git GC command, or when you push and pull the transfer process, Git packs these loose object files into a pack file to improve efficiency. Here are some of these packed files:

➜ objects git:(master) git gc... Compressing objects: 100% (75/75), done. ... ➜ objects git: (master) tree. ├ ─ pack ├ ─ ─ pack - fe24a22b0313342a6732cff4759bedb25c2ea55d. Independence idx └ ─ ─ Pack - fe24a22b0313342a6732cff4759bedb25c2ea55d. Pack └ ─ ─...Copy the code

You can see that the Objects directory has no loose objects. Instead, there are two files in the Pack directory. One is the packed file, and the other is the IDX file that indexes the contents of the packed file to check whether an object is in the corresponding pack.

It is important to note that if you GC a blob object in the repository you just created manually, it will have no effect because there is not a single reference to the blob object in the Git repository.

Refs directory

The refs directory stores our references, which can be thought of as an alias to a version number. The refs directory stores the SHA value of a Commit. The repository we tested above did not have any commits, so there was only an empty directory structure

├── ─ exercisesCopy the code

Let’s look at a random repository that contains commits and look at its default branch master

➜. Git git: (master) cat refs / 87 e917616712189ecac8c4890fe7d2dc2d554ac6 heads/masterCopy the code

As you can see, the master reference only stores the SHA value of the Commit. The advantage of this is that we do not need to remember the long LIST of SHA values. We only need to use the master alias to obtain the version. The same tags directory stores our tags, and unlike branches, the reference values recorded by tags generally do not change, while branches can change with our version. In addition, you might also see refs/ Remotes Refs/FETCH directories, which store references to specific namespaces.

Git/Packed -refs: git/ Packed -refs: git/packed-refs: git

➜. Git git:(master) Cat Packed -refs # pack-refs with: peeled fully-peeled sorted 87e917616712189ecac8c4890fe7d2dc2d554ac6 refs/heads/masterCopy the code

Git will go to refs/heads when you need to access the master branch, and if you can’t find it, Git will go to Refs/Packed -refs. Git /packed-refs creates a new master reference under refs/heads/ that contains the SHA value of the latest commit. Git = refs/heads/ Git = refs/heads/ Git = refs

We can use the cat-file command to query the blob object’s contents:

➜. Git git: (master) git cat file - p - 87 e917616712189ecac8c4890fe7d2dc2d554ac6 tree 7 d000309cb780fa27898b4d103afcfa95a8c04db aab1a9217aa6896ef46d3e1a90bc64e8178e1662 / / tree of a parent/father/submit an author Zoker <[email protected]> 1607958804 +0800 // Committer Zoker <[email protected]> 1607958804 +0800 // Submitter information test SSH // Submission informationCopy the code

It is a commit object whose main attributes are the tree object to which it points, and its parent commit (if it is the first commit, then 0000000…). , as well as the author and submission information.

So what is the commit object? What is the tree object it points to? How is it different from the bloB object we created manually earlier? Let’s talk about Git storing objects.

Git store objects

In the Git world, there are four types of storage objects: Blob, Tree, Commit, and Tag are the first three types of Git metadata. A tag object is a tag that contains an annotated tag with additional attributes. I don’t want to say too much here.

Description of Lightweight and Annotated labels: git-scm.com/book/zh/v2/…

A Blob object

To demonstrate that Git is a content-addressable KV database, we insert the contents of a file into the Git repository:

➜ Zoker git:(master) eligible cat testfile Hello git ➜ Zoker git:(master) eligible git hashing -object testfile-w 9f4d96d5b00d98959ea9960f069585ce42b1349aCopy the code

The Key for 9 f4d96d5b00d98959ea9960f069585ce42b1349a Git object is actually a Blob object, he stored the testfile file value, we can use cat – the file command to view:

➜ Zoker git: (master) ✗ git cat - file - p 9 f4d96d5b00d98959ea9960f069585ce42b1349a Hello gitCopy the code

Every time you modify a file, Git will take a snapshot of the file instead of recording the difference. So if you modify the contents of the testFile file and save it to Git repository, Git will generate its Key based on the latest contents. Git is a content-addressable KV database, after all.

In addition, the Blob object here stores textual content, which can also be binary, but it is not recommended to use Git to manage versions of binaries. The biggest problem we encounter in the daily operation of Gitee platform is that the user warehouse is too large, which is generally caused by the large binary files submitted by users. Since the change records of each file are snapshots, the space occupied by this binary file is doubled if it is changed frequently. Git will only save the file differences between two commits during GC, but binary blobs cannot be processed in the same way as text blobs. So try not to store frequently changing binary content in Git repositories. Instead, use LFS for storage. If you already have a lot of binaries, use Filter-Branch to slim them down, and your new colleagues will thank you when they Clone the warehouse for the first time.

Use of LFS: gitee.com/help/articl… Slimming of large warehouses: gitee.com/help/articl… Filter-branch:github.com/git/git/blo…

Do you think something’s wrong here? Yes, the Blob only stores the contents of the file, but does not record the file name. How do we know which file the contents belong to? The answer is Git’s other important object: the Tree object.

The Tree object

In Git, Tree objects are used to group Blob or sub-tree objects together. All content is stored in Tree and Blob objects. A Tree object contains one or more Tree entries (Tree object records). Each Tree object record contains a pointer to a Blob or subtree SHA value, along with information such as their corresponding file name. Block = inode = block = block = block = block = inode = block = block = block = block = block

The directory structure of the Tree object looks like this:

. ├ ─ ─ LICENSE ├ ─ ─ the readme. Md └ ─ ─ the SRC ├ ─ ─ libssl. So └ ─ ─ logo. The PNGCopy the code

In this way, we can structure the contents of our repository in the same way we organize directories under Linux, treating trees as directory structures and BLObs as concrete file contents.

So how do you create a Tree object? In Git, Tree objects are created according to the state of Staged staging areas (Staged staging areas). Usually, Git Add commands are used to add some files into Staged staging areas to be submitted. In an empty repository without any commits, the state of the staging area is the files you added with git add, such as:

➜ Zoker Git :(Master) University Git status On branch master No commits yet Changes: (use "git rm --cached <file>..." to unstage) new file: LICENSE new file: readme.md Untracked files: (use "git add <file>..." to include in what will be committed) src/Copy the code

Git /index. Git /index. Git /index. Git /index

➜ Zoker git:(Master) Qualify file. git/ index.git /index: Git index, version 2, 2 entriesCopy the code

You can find two entries in the index file, that is, two files in the root directory LICENSE and readme.md. For a committed repository, if there is no content in the staging area, the index represents the current version of the directory tree state. If a file is modified or added to or deleted from the staging area, the index changes to point to the SHA value of the new Blob object for that file.

So if we want to create a Tree object, we need to put something in the staging area. In addition to using git add, we can also create a staging area using the underlying command update-index. Create a tree object from the testFile file created above. Add the file testfile to the staging area:

➜ Zoker git:(master) qualify git update-index --add testfile // ➜ Zoker git:(master) qualify git status On branch master No commits yet Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: testfileCopy the code

Git inserts testFile’s contents into Git repository as blobs, and then records the Blob’s SHA value to index, telling the staging area what the contents of the file currently are.

➜ Zoker git: (master) ✗ tree. Git/objects. The git/objects ├ ─ ─ 9 f │ └ ─ ─ 4 d96d5b00d98959ea9960f069585ce42b1349a ├ ─ ─ the info └ ─ ─ Pack 3 directories, 1 file ➜ Zoker git: (master) ✗ git cat - file - p 9 f4d96d5b00d98959ea9960f069585ce42b1349a Hello gitCopy the code

When Git executes update-index, it stores the contents of the specified file as a Blob object, which is recorded in the index file state. Git hash-object: git hash-object: git hash-object: git hash-object: git hash-object: git hash-object: git hash-object

git update-index --add --cacheinfo 9f4d96d5b00d98959ea9960f069585ce42b1349a testfile
Copy the code

This command simply puts the previously generated Blob object into the staging area and specifies its file name as TestFile. Git write-tree creates a tree object based on the current state of the staging area. Git write-tree creates a tree object based on the current state of the staging area.

➜ Zoker git: (master) ✗ git write - tree aa406ee8804971cf8edfd8c89ff431b0462e250c ➜ Zoker git tree. (master) ✗ git/objects 9 f. Git/objects ├ ─ ─ │ └ ─ ─ 4 d96d5b00d98959ea9960f069585ce42b1349a ├ ─ ─ aa │ └ ─ ─ 406 ee8804971cf8edfd8c89ff431b0462e250c ├ ─ ─ The info └ ─ ─ packCopy the code

After perform the command, Git will be based on the current state of the staging area generates a SHA value for aa406ee8804971cf8edfd8c89ff431b0462e250c Tree object, And store the Tree object in the.git/ Objects directory like a Blob object.

➜ Zoker git: (master) ✗ git cat - file - p aa406ee8804971cf8edfd8c89ff431b0462e250c 100644 blob 9f4d96d5b00d98959ea9960f069585ce42b1349a testfileCopy the code

Using the cat-file command to view the Tree object, you can see that there is only one file under the object named testFile

We continue to create the second Tree object, we need the second Tree object to have the modified TestFile file, the new testFile2 file, and we need the first Tree object as the duplicate directory of the second Tree object. Add testFile and testFile2 to the staging area:

➜ Zoker git:(master) qualify git update-index testfile ➜ Zoker git:(master) qualify git update-index --add testfile2 ➜ Zoker Git :(Master) Qualify Git status On branch master No commits yet Changes to be :(use "git rm --cached <file>..." to unstage) new file: testfile new file: testfile2Copy the code

Then we need to attach the first Tree object to the duplicate directory. We can use the read-tree command to do this:

➜ Zoker git: (master) ✗ git read - tree -- prefix = duplicate aa406ee8804971cf8edfd8c89ff431b0462e250c ➜ Zoker ✗ git: (master) git status On branch master No commits yet Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: duplicate/testfile new file: testfile new file: testfile2Copy the code

Then we execute write-tree and view the second tree object through cat-file:

➜ Zoker git: (master) ✗ git write - tree 64 d62cef754e6cc995ed8d34f0d0e233e1dfd5d1 ➜ Zoker git: (master) ✗ git cat - file - p 64d62cef754e6cc995ed8d34f0d0e233e1dfd5d1 040000 tree aa406ee8804971cf8edfd8c89ff431b0462e250c duplicate 100644 blob 106287c47fd25ad9a0874670a0d5c6eacf1bfe4e testfile 100644 blob 098ffe6f84559f4899edf119c25d276dc70607cf testfile2Copy the code

We have modified the contents of the testFile file and created a new testFile2 file. We also duplicate the first Tree object as the duplicate directory of the second Tree object, which should look like this:

Now we know how to manually create a Tree object, but what if I need snapshots of two different trees later? Can’t remember the SHA values of all three Tree objects? Yes, it’s hard to remember, but the key is that we don’t know who created the snapshot, when and for what, and a Commit object can help us solve this problem.

Commit object

A Commit object records additional information about snapshots and maintains linear relationships between snapshots. Git commit-tree commits a tree object as a commit. Git commit-tree commits a tree object as a commit object. Git commit-tree commits a tree object as a commit object.

➜ Zoker git:(Master) Qualify Git commit-tree-h usage: git commit-tree \[(-p <parent>)...\] \[-S\[<keyid>\]\] \[(-m <message>)...\] \[(-F <file>)...\] <tree> -p <parent> id of  a parent commit object -m <message> commit message -F <file> read commit log message from file -S, --gpg-sign\[=<key-id>\] GPG sign commitCopy the code

The two key arguments are -p and -m. -p specifies the parent commit of the commit, which can be ignored if it is the initial first commit. -m specifies the information about the submission, mainly used to describe the reason for the submission. Let’s take the first Tree object as our initial commit:

➜ Zoker git: (master) ✗ git commit - tree - m "init commit" aa406ee8804971cf8edfd8c89ff431b0462e250c 17ae181bd6c3e703df7851c0f7ea01d9e33a675bCopy the code

Use cat-file to view the commit:

tree aa406ee8804971cf8edfd8c89ff431b0462e250c
author Zoker <[email protected]> 1613225370 +0800
committer Zoker <[email protected]> 1613225370 +0800

init commit
Copy the code

Commit Stores a Tree object that records the Commit, Commit time, and Commit information. We reference the second Tree object based on the Commit:

➜ Zoker git: (master) ✗ git commit - 17 ae181bd tree - p - 64 d62cef754e6cc995ed8d34f0d0e233e1dfd5d1 m "add dir" De96a74725dd72c10693c4896cb74e8967859e58 ➜ Zoker git: (master) ✗ git cat - file - p de96a74725dd72c10693c4896cb74e8967859e58  tree 64d62cef754e6cc995ed8d34f0d0e233e1dfd5d1 parent 17ae181bd6c3e703df7851c0f7ea01d9e33a675b author Zoker <[email protected]> 1613225850 +0800 committer Zoker <[email protected]> 1613225850 +0800 add dirCopy the code

We can use git log to view both commits, adding the –stat parameter to view file changes:

commit de96a74725dd72c10693c4896cb74e8967859e58
Author: Zoker <[email protected]>
Date:   Sun Feb 13 22:17:30 2021 +0800

    add dir

 duplicate/testfile | 1 +
 testfile           | 2 +-
 testfile2          | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

commit 17ae181bd6c3e703df7851c0f7ea01d9e33a675b
Author: Zoker <[email protected]>
Date:   Sun Feb 13 22:09:30 2021 +0800

    init commit

 testfile | 1 +
 1 file changed, 1 insertion(+)
Copy the code

The structure of the entire object is shown below:

Exercise: Create a commit using the underlying command

Create a commit using only the low-level commands we mentioned above, such as hash-object write-tree read-tree commit-tree, and think about which procedures are equivalent to git add git commit.

Object storage mode

Git classifies data into different object types and computes an SHA value for addressing. For Blob objects, Git does the following:

Identify the type of the object and construct the header information with type + number of bytes of content + empty bytes as the header information such as BLOb 151\u0000
Concatenate the header information with the content, and calculate the SHA-1 checksum
Compress content through Zlib
The content is placed in the corresponding Objects directory using the SHA value

The Tree object and the Commit object are similar, but the header type is different. Pro Git 2 describes how to implement the same logic in Ruby in Git internals.

Git- internal principles: git-scm.com/book/zh/v2/…

Git reference

Git log –stat 17ae181b can be used to view the first version of the snapshot, and the SHA value can be used to retrieve the contents of the snapshot, but it is a bit of a hassle because we need to remember a string of meaningless strings, which is where git references come in. In Git directory structure, we learned that the refs directory stores the Commit object’s SHA value as a reference, so we’ll give our current version a meaningful name. We’ll use master as the default branch reference:

➜ Zoker git: (master) ✗ echo "17 ae181bd6c3e703df7851c0f7ea01d9e33a675b" > >. The git/refs/heads/master ➜ Zoker ✗ git: (master) Tree. Git/refs. Git/refs ├ ─ ─ heads │ └ ─ ─ master └ ─ ─ tagsCopy the code

At this point, the SHA value of our first Commit is stored in the master, so we can use the master instead of 17AE181b

➜ Zoker git: (master) ✗ git cat - file - p master tree aa406ee8804971cf8edfd8c89ff431b0462e250c author Zoker <[email protected]> 1613916447 +0800 committer Zoker <[email protected]> 1613916447 +0800 init commitCopy the code

But, this is not our new version, we have the latest version is the second submit de96a74725dd72c10693c4896cb74e8967859e58, similarly, we can put the refs/heads/master SHA submitted the content changes of the value, But here we use an underlying command to do it

➜ Zoker git: (master) ✗ git update - ref refs/heads/master de96a74725dd72c10693c4896cb74e8967859e58 ➜ Zoker ✗ git: (master) cat .git/refs/heads/master de96a74725dd72c10693c4896cb74e8967859e58Copy the code

At this point, the branch master points to our latest version

conclusion

The above mainly discusses the basic storage principle and some implementation of Git, as well as some other things such as Pack packaging, transfer negotiation mechanism and storage format, which are not mentioned in the space, and will be discussed later according to some scenarios.