Original address: www.kenneth-truyers.net/2016/10/13/…

Original author:

Release time: October 13, 2016

Git’s manual says it’s a stupid content tracker. It is probably the most used version control system in the world. This is odd, because it does not describe itself as a source control system. In fact, you can use Git to track any kind of content. For example, you can create a Git NoSQL database.

It’s clumsy in Man-Pages because it doesn’t assume you’re storing anything in it. Git’s underlying model is fairly basic. In this article, I want to explore the possibility of using Git as a NoSQL database (a key-value store). You can use the file system as the datastore, and then use git Add and Git commit to save files.

# saving a document 
echo '{"id": 1, "name": "kenneth"}' > 1.json 
git add 1.json 
git commit -m "added a file" 
# reading a document 
git show master:1.json => {"id": 1, "name": "kenneth"}
Copy the code

It works, but now you’re using the file system like a database: paths are keys, and values are whatever you store in it. There are several disadvantages.

  • We need to write all the data to disk to save it to Git
  • We’re going to save the data multiple times
  • Without file storage deduplicating, we lose the benefits of automatic data deduplicating that Git offers us.
  • If we want to work on multiple branches at the same time, we need multiple checked directories.

What we want is a bare repository, a repository where no files exist in the file system, but only in the Git database. Let’s take a look at git’s data model and the pipeline commands that do this.

Git acts as a NoSQL database

Git is a content-addressable filesystem. This means that it is a simple key-value store. Every time you insert content into it, it gives you a key to retrieve that content later. Let’s create something.

#Initialize a repository 
mkdir MyRepo 
cd MyRepo git init 
# Save some content 
echo {"id": 1, "name": "kenneth"} | git hash-object -w --stdin da95f8264a0ffe3df10e94eed6371ea83aee9a4d
Copy the code

Hash-object is a Git pipe command that takes content, stores it in a database, and returns the key.

The -w switch tells it to store content, otherwise it just evaluates the hash value. The -stdin switch tells Git to read from input, not from a file.

The key it returns is content-based SHA-1. If you run the above command on your machine, you will find that it returns the exact same SHA-1. Now that we’ve found something in the database, we can read it back.

git cat-file -p da95f8264a0ffe3df10e94eed6371ea83aee9a4d 
{"id": 1, "name": "kenneth"}
Copy the code

Git Blobs

We now have a key-value store, and we have an object, a BLOB.

There’s just one problem: we can’t update this, because if we update the content, the key changes. This means that each version of our file must remember a different key. What we need is to specify our own key to track versions.

Git tree

Trees solve two problems.

  • We need to remember the hash value of our object and its version.
  • The possibility of storing file groups.

The best way to think about trees is like folders in a file system. To create a tree, you must follow two steps.

# Create and populate a staging area 
git update-index --add --cacheinfo 100644 da95f8264a0ffe3df10e94eed6371ea83aee9a4d 1.json 
# write the tree 
git write-tree d6916d3e27baa9ef2742c2ba09696f22e41011a1
Copy the code

This also gives you an SHA back. Now we can read back to the tree.

git cat-file -p d6916d3e27baa9ef2742c2ba09696f22e41011a1 100644 blob 
da95f8264a0ffe3df10e94eed6371ea83aee9a4d 1.json
Copy the code

At this point, our object database looks like this.

To modify the file, we follow the same steps.

# Add a blob 
echo {"id": 1, "name": "kenneth truyers"} | git hash-object -w --stdin 42d0d209ecf70a96666f5a4c8ed97f3fd2b75dda 

# Create and populate a staging area 
git update-index --add --cacheinfo 100644 42d0d209ecf70a96666f5a4c8ed97f3fd2b75dda 1.json 

# Write the tree 
git write-tree 2c59068b29c38db26eda42def74b7142de392212
Copy the code

So we have the following situation.

We now have two trees that represent the different states of our files. This doesn’t help because we still need to remember the sha-1 value of the tree to get the content.

Git commit

Above that, we’re at commit. A commit has five key pieces of information.

  1. Author of submission
  2. Creation date
  3. Why create it (message
  4. The single tree object to which it points
  5. One or more previous commits (currently we only consider commits with one parent, and commits with multiple parents are combined commits).

Let’s submit the tree above.

# Commit the first tree (without a parent) 
echo "commit 1st version" | git commit-tree d6916d3 05c1cec5685bbb84e806886dba0de5e2f120ab2a 

# Commit the second tree with the first commit as a parent 
echo "Commit 2nd version" | git commit-tree 2c59068 -p 05c1cec5 
9918e46dfc4241f0782265285970a7c16bf499e4
Copy the code

So we have the following state.

Now we have established a complete file history. You can open the repository with any Git client and see how 1.json is properly traced. To prove the point, this is the output of running git logs.

git log --stat 9918e46 9918e46dfc4241f0782265285970a7c16bf499e4 "Commit 2nd version"
1.json | 1 + 1 file changed, 1 insertions(+) 
05c1cec5685bbb84e806886dba0de5e2f120ab2a "Commit 1st version" 1.json | 1 + 1 file changed, 1 insertion(+)
Copy the code

And get the contents of the file from the last submission.

git show 9918e46:1.json 
{"id": 1, "name": "kenneth truyers"}
Copy the code

But we still didn’t get there, because we had to remember the hash we committed last time. All of the objects we’ve created so far are part of Git’s * object database. * A feature of this database is that it stores only immutable objects. Once you have written a blob, tree, or a commit, you can never modify it without changing the key. You can’t delete them either (at least not directly; git gc does remove dangling objects).

Git reference

At the next level up, however, are Git references. References are not part of the object database, they are part of the reference database and can be changed. There are different types of references, such as branch, label, and remote. They are similar in nature, but there are some minor differences. For now, we’re just thinking about branches. A branch is a pointer to a commit. To create a branch, we write the submitted hash to the file system.

echo 05c1cec5685bbb84e806886dba0de5e2f120ab2a > .git/refs/heads/master
Copy the code

Now we have a branch trunk that points to our first commit. To move the branch, we issue the following command.

git upd-ref refs/heads/master 9918e46
Copy the code

So we have the picture below.

Finally, we can now read the current state of our file.

git show master:1.json 
{"id": 1, "name": "kenneth truyers"}
Copy the code

Even if we add a new version of the file and subsequent trees and commits, the above command will continue to work as long as we move the branch pointer to the latest commit.

All of this seems rather complicated for a simple key-value store. However, we can abstract these things so that the client application only needs to specify branches and a key. I’ll come back to this in another article. Now, I’d like to discuss the potential advantages and disadvantages of using Git as a NoSQL database.

The efficiency of data

Git is very efficient at storing data. As mentioned earlier, blobs of the same content are stored only once because of the way hashes are computed. You can verify this by adding a bunch of identical files to an empty Git repository and then checking the size of the.git folder against the size on disk. You’ll notice that the.git folder is much smaller.

But that’s not all. Git applies to trees, too. If you modify a file in a subtree, Git only creates a new subtree and references other unaffected trees. The following example shows a commit to a hierarchy with two subfolders.

Now if I want to replace 4658ea84 this blob Git will only replace the items that have been modified and keep the items that have not been modified as reference. After replacing the BLOB with a different file and committing the changes, the diagram looks like this (new objects are marked in red).

As you can see, Git replaces only the necessary projects and references the ones that already exist.

While Git is very efficient in the way it references existing data, if every small change results in a full copy, we still end up with a huge repository after a while. To mitigate this, there is an automated garbage collection process. When git GC runs, it looks at your blobs. Where possible, it deletes the BLObs and stores a copy of the underlying data, along with the delta for each version of the BLOB. This way, Git can still retrieve each unique version of the BLOB, but doesn’t need to store the data multiple times.

versioning

You can get a full version of the system for free. With versioning, you also have the advantage of never deleting data. I’ve seen examples of this in SQL databases.

id | name | deleted 1 | kenneth | 1
Copy the code

That’s fine for simple records like this, but it’s usually not the whole story. Data may have dependencies on other data (whether it is a foreign key or not is an implementation detail), and when you want to restore it, you may not be able to do it in isolation. With Git, you can restore the correct state at the database level, not the record level, just by pointing your branches to different commits.

Another way I’ve seen it done is this.

id | street | lastUpdate 1 | town rd | 20161012
Copy the code

This is even less useful: you know it’s updated, but you don’t have any information about what’s actually updated or what its previous value was. Every time you update data, you are actually deleting data and inserting new data. The old data will be lost forever. With Git, you can run a Git log on any file to see what changes were made, who made them, when they were made, and why.

Git tools

Git has a rich set of tools that you can use to explore and manipulate your data. Most of them are focused on code, but that doesn’t mean you can’t use them to process other data. Here’s a non-exhaustive overview of some of the tools I can think of.

In basic Git commands, you can.

  • Use git diff to find two commits/branches/tags /… The exact change between
  • Use Git Bisect to find out when something stops working due to data changes
  • Use git hooks to get automatic change notifications, build full-text indexes, update caches, publish data,…
  • Restore, branch, merge,…… .

There are also external tools.

  • You can use Git clients to visualize and explore your data.
  • You can use pull requests, such as the one on GitHub, to check for changes to data before merging it.
  • Gitinspector: Statistical analysis of git repositories

Any tool that works with Git will work with your database.

NoSQL

Because it is a key-value store, you can get the usual advantages of NoSQL storage, such as schemaless databases. You can store anything you want, and it doesn’t even have to be JSON.

connectivity

Git can work in a partitioned network. You can put everything on a USB stick, save data when you’re not connected to the network, and then push and merge when you’re back online. This is an advantage we often use when developing code, but for some use cases it can be a lifesaver.

The transaction

In the example above, we commit each change to a file. You don’t have to do this, you can also include the changes as a single commit. This makes it easy to roll back the changes atomically later.

Long transactions are also possible: you can create a branch, commit a few changes, and merge it (or discard it).

Backup and replication

With traditional databases, it’s often a bit of a hassle to have a full backup and incremental backup plan. Because Git already stores the entire history, you never need to make a full backup. In addition, backup simply performs git push. These can be done from anywhere: GitHub, BitBucket, or your own Git server.

Copying is just as easy. By using git hooks, you can set up a trigger that runs git push after each commit. For example,

git remote add replica [email protected]:app.git 
cat .git/hooks/post-commit 

#! /bin/sh
git push replica
Copy the code

This is great! Git should be used as a database from now on. From now on, we should all use Git as a database

Hang on but there are some downsides.

The query

You can check…… by clicking the key That’s all. The only good news is that you can structure your data in folders so you can easily get the content by prefix, but that’s about it. Any other queries are not allowed unless you want to do a full recursive search. The only option here is to create a dedicated index for queries. If you’re not worried about index obsolescence, you can either proceed as planned, or you can use git hooks to update indexes immediately upon commit.

concurrency

As long as we’re writing blobs, concurrency is fine. The problem arises when we start writing commit and update branches. The following figure illustrates the problem when two processes concurrently attempt to create a commit.

In the case above, you can see that when the second process modifies a copy of the tree, it is actually working on an outdated tree. When it commits the tree, it loses the changes made by the first process.

The same story applies to moving branch headers. Between your commit and the update branch head, another commit may come in. You may update the branch header into the wrong commit.

The only way to do this is to lock all writes between reading a copy of the current tree and updating the branch head.

speed

We all know git is fast. But that’s in the case of creating branches. When it comes to commits per second, it’s not really that fast because you’re writing to disk all the time. We don’t notice this because usually when we’re writing code we don’t commit many times per second (at least I don’t). After running some tests on my local machine, I hit the commit limit of about 110 times/SEC.

Brandon Keepers showed some of the results in a video a few years ago, and he got about 90 commits per second, which seemed in line with what hardware advances might do.

110 commits/SEC is enough for many applications, but not all. This is also the theoretical maximum on my local development machine, with lots of resources. There are many factors that affect speed.

The size of the tree

In general, you should use many subdirectories in preference to putting all documents in the same directory. This keeps the write speed as close to the maximum as possible. The reason is that every time you create a new commit, you copy the tree, make changes to it, and then save the modified tree. While you might think that this would affect size as well, it doesn’t, because running git GC ensures that it is saved as a delta instead of two different trees. Let’s look at an example.

In the first case, we have 10. 000 blobs in the root directory. When we add a file, we copy the tree with 10.000 items, add one and save it. Because of the size of the tree, this can be a potentially lengthy operation.

In the second case, we have four layers of trees, each with 10 subtrees, and the last layer with 10 blobs (101010*10=10.000 files).

In this case, if we want to add a BLOB, we don’t need to copy the entire hierarchy, we just need to copy the branches that lead to the BLOB. The following figure shows the trees that must be copied and modified.

So by using subfolders, we can now copy 5 trees with 10 items, rather than having to copy 1 tree with 10 000 items, which is much faster. The more your data grows, the more you want to use subfolders.

Merges values into transactions

If you need more than 100 commits per second, you probably don’t need to be able to roll back each time. In this case, instead of committing every change, you can commit several changes in a single commit. You can write blobs concurrently, so it is possible to write 1000 files concurrently to disk and then do a commit to save them to the repository. This has drawbacks, but if you want raw speed, this is the best way to do it.

The solution to this problem is to add a different backend to Git that does not flush to disk immediately, but writes to an in-memory database and then asynchronously flusher to disk. But that’s not always easy to do. When I tested this solution using libgit2sharp to connect to the repository, I tried using the Voron back end (which is open source and has a variant that uses ElasticSearch). This is a lot faster, but you lose the benefit of checking your data with any standard Git tool.

merge

Another potential pain point is when you merge data from different branches. As long as there are no merge conflicts, this is actually quite a pleasant experience because it can implement a lot of nice scenarios.

  • Modifying data requires approval before data is online.
  • Run tests on live data that needs to be restored.
  • Isolate the work before merging the data

Essentially, you can have fun in all branches of development, but on different levels. The problem is when there are merge conflicts. Merging data is quite difficult because you can’t necessarily figure out how to handle these conflicts.

One potential strategy is to simply store merge conflicts as they are when you write the data, and then when you read the data, show the users the differences so they can choose which one is correct. Still, getting this right can be a difficult task.

conclusion

Git works well as a NoSQL database in some situations. It has its place and time, but I think it’s particularly useful in the following situations.

  • You have layered data (because of its inherently layered nature).
  • You need to be able to work in disconnected environments.
  • You need to set up an approval mechanism for your data (i.e. you need to branch and merge

In other cases, it doesn’t fit.

  • You need extremely fast write performance
  • You need complex queries (although you can do this by submitting hook indexes)
  • You have a huge data set (write speeds slow down even further).

So, there you go, that’s how you use Git as a NoSQL database. Let me know what you think!


Translation via www.DeepL.com/Translator (free version)

Translation via www.DeepL.com/Translator (free version)