Hello, everyone! Welcome to the amazing popular science channel of Bobo and Germ!

Today, we are going to show you how programmers can archive and manage versions of files.

We have to be prepared, today’s “popular science” a little bit hardcore, Bacteria want to demand analysis, product design, code implementation and other all-round Angle for everyone “popular science”, comprehensive things more, may not be too good to understand……

But the content should still have a little bit of fun, after all, Bacteria always write some messy things, if temporarily feel difficult to digest can consider the first collection acya ~

Before we begin, let’s introduce a concept called version management.

Let’s start with everyday life.

A bacterium this bad boy sometimes will want to return to their own past, such as back to the primary school when their own, back to the middle school when their own, back to the high school when their own, or back to the university before their own, repeat university and so on……

Each stage of life, we can see a version of their own, such as primary school version of a fungus, early version of a fungus, high school version of a fungus, university version of a fungus……

If god had saved every stage for us, then we could go back to the old version and start over!

From the old version of the development of a new life, perhaps our life can have several branches of the route……

Although it seems not too realistic at present, even if really have to choose, a fungus will never return to the past version, because a fungus dare not guarantee in another branch of life, also can meet bo elder sister……

Ahem, stop!

Although life is not archived, but the files on the computer can be archived!

Consider the following scenario: Suppose the college arranges us to do a graduation party publicity activity, with PPT, manuscripts and posters.

So we happily compiled all of the college’s undergraduate information into PTT, manuscripts and posters.

At this time, the college said: how can only undergraduate data? The relevant data of graduate student also wants to add in!

Then we added the graduate student’s information to the existing PPT, manuscript and poster.

Then the college said: at short notice, the evening of graduate students will be held separately!

You may be surprised: we have added the information of graduate students in PPT, manuscript and poster, and it has been integrated with the content of undergraduate students, it is too troublesome to delete!!

It would have been nice if we had saved the undergraduate planning materials as a version in advance, so that we could send the undergraduate version directly to the college to finish the work.

Now the scene has, the pain point has, next we set about designing a software product to solve this problem (some students will say, You are bullshit, I copy each version of the line, why to make a software out? Uh, uh, stick with it. It’s not just archiving. We manual copy archive prone to all kinds of problems, such as forget to save, forget where to save, save the order of confusion and so on…… Try it and you’ll see.

Now we have a folder with PPT, documents and posters in the folder directory, or let’s create another folder here for archiving! Just name your name.Jun!

Now that we have created a “.jun “folder to hold the version information for the current directory, we need to think about how to save the version of each file.

In the computer world, we come to the stage of designing the underlying data structure. We can think of the “.jun “folder as a database, which will be used to hold the version data of the files in the current folder.

Well, you know what?

Let’s call these.doc,.psd,.ppt files object. We create a “Objects” file in the “.jun “folder to store information about each object, and then we have an object database!

Well, as impressive as that sounds, Akiko just created two folders……

Now let’s think about what should be stored in an object. What on earth can accurately locate the version information of a file?

An object must contain at least three pieces of information:

  1. The original information of files, such as the content in our PPT and WORD documents, is called original information. It may take a lot of space to save the original information directly (at least as big as the files themselves), so we can compress it first and then save it.

  2. The type of object, considering that in addition to PPT, powerpoint and posters, we may also put new files or folders in the future, both files and folders should be called object, but they can be distinguished by different types. (Keep an eye on this, it’s the best part of the version control software, you’ll see later.)

  3. A string of character numbers, or hash values, that identifies the current object, each of which has a unique hash value (actually just a jumble of numeric characters, so it’s not easy to repeat).

Now, you might wonder: why do you keep talking about how to design? What we want to know is, why do you design this way?

Here’s the answer:

Let’s take a look at the first picture. PPT, manuscript and poster in the current folder can be represented by three objects respectively:

Since they are in the same directory, we can mark them with a large object. We define the type of this large object as tree. The tree object equivalent is not a file, but a folder:

Careful readers will find: Huh? The tree object already contains information about all files in the current folder.

We can find the objects corresponding to PPT, document and poster through this tree object. These objects save the original information of PPT, document and poster at a certain point in time. As long as we decompress the information, we can restore the folder to what it was when we saved it.

The next question is: We don’t keep just one archive, we keep many. How do we organize a series of archives?

Next we’ll introduce a new object type, which we’ll call commit, which stands for commit. To concatenate the archive, we need to add a parameter that specifies who the last commit object was, and perhaps add information such as the time, so that each commit is a version:

The above diagram is not intuitive because of the lack of space, so let’s draw another diagram to make sure that each Commit object points to a tree object. In other words: Commit Object is only a layer of encapsulation of the Tree Object. Although we only need the tree object to restore a folder’s past archive, we can use the “father” attribute to connect a series of archives. And it can be time-stamped, so that the entire archive is clear.

At this point, I don’t know if you will come up with some strange ideas. Suppose the college has two tasks: we undergraduate students, and graduate students to have a graduation fellowship, and doctoral students to have a graduation fellowship, can our version management system still be used? (Only Bacteria could come up with such a bizarre activity.)

Sure, please look at the picture below:

We can create two Commit Objects, each pointing to the archive where the undergraduate data was originally stored, and then work on the two new archives without affecting each other. We can continue to create a new archive after the two archives, like the following:

We can give the above function a nice name, “branch”.

The branching function can also be used in this way, let’s consider the following scenario:

The college doesn’t do anything fancy any more, just do a good job of planning graduation activities for undergraduates. However, the time prescribed by the college is very short, so A-chu cannot finish it alone. He asked his classmates A-fork and A-Go to do it together. The three are responsible for PPT, manuscript and poster respectively.

In order not to affect the original version, the three of them each pulled out a branch to work on, and the content of each work phase was also versioned:

After the completion of their respective work, merge the latest version:

This makes it easier for everyone to work together.

You’re designing a Commit object that has a parent pointer. Each parent pointer points to the previous Commit object. In the figure above, who does the final parent pointer point to? It looks like there are three parents, right?

Uh, this is indeed an oversight, the map is not drawn……

Yes, sometimes our data structure design is good, it is best not to change, encounter the existing data structure can not control the scene, we have to design the processing process, this is the so-called “algorithm”…… (After all, add, delete, change and check is an algorithm to some extent……)

We always say: software = data structure + algorithm, the following bacteria take you to restore the design of this process.

Because our Commit Object can only point to one parent node, we designed the merge process to look like this: the final merge is handled by one person (standing on a branch, let’s say it’s the main branch).

With the fork from the picture above you can see, we create branch as the main branch of PPT, to incorporate the content of the branch of the hook to the fork of the branch, this branch of the fork is so much a merge node, this merge node point to an a node before the fork on the branch, as a result, PPT and presentation content is merged into the main branch.

Now let’s combine the poster made by the bad boy.

In this way, we did not break the original data structure, nor did we break the design of the software: an archive version management software, on which we can find everything.

Of course, such a design is not very good, we can actually design better, for example, separate a main branch, a fork, a go, a fungus to produce their own content when separate branch (a total of four branches), everyone made, and then merged back to the main branch. This way, the main branch would be very clean, rather than the current main branch, where you can still see the various versions of a fork……

Well, that’s what software development is all about, learning best practices by trial and error!

You are talking about Git, aren’t you?

Yes, today I will introduce a distributed version control software called Git – the basic design principle of the version control part, most programmers are based on Git co-development. Unlike the example above, programmers write code files instead of powerpoint and powerpoint. Sometimes a feature is often developed by several programmers, which can be understood as group development. A common mainstream collaboration flow would look like this:

A brief introduction:

The master branch is a fully functional branch that is ready to be published online for deployment.

Hotfix branch is used to fix online bugs and patch quickly.

Release is a publishing branch.

Develop as an integration branch of functionality.

5. Feature branch is function branch.

As for the specific usage of each branch, you can go to the Internet to search, according to the experience of Bacteria, in fact, there is no use to read. Only when you are involved in the development of the company can you truly understand the meaning of each branch.

Today’s introduction of the content, is the most basic principle of Git software, in the clear this basis to use Git, will be a lot easier. Another important feature of Git is its distribution. That is, it is used for archiving and version control by multiple people (companies or teams) working together.

One of the gods looked at our article and thought it would be better if we talked about distribution, and that would be fine.

We modern programmers, who use Git to write code, take it for granted that version management systems are distributed as they are today. However, previous version management systems were centralized.

Let’s start with a brief introduction to centralization, again using the example in this article:

As shown in the figure above, there is a centralized place to manage all files, and everyone can only pull specific files for development. This is called centralized development.

There are two drawbacks to centralized development:

Efficiency. There was a problem in the central warehouse, and everyone couldn’t work properly, because everyone depended on him to pull and push files. Stability. The central warehouse is down, the archive is gone.

The most direct way we can think of is to keep a copy of the warehouse for everyone. As shown below:

Some students may ask, “Do I still need to learn centralized version management software like SVN?”

Of course not. With the advent of Git, the centralized version management system of the past has been completely overturned. SVN’s version management strategy is quite different from Git’s. I didn’t say SVN is bad. This is the change caused by the change of The Times, and the change caused by the environment, there is no better or worse.

We try to open up an Angle and think (not necessarily) : S before, memory, disk, computing resources are very precious, when the machine was not suitable to support us on each machine, intact project filing, also cannot use Git this compression save the entire file of original information strategy (incremental saving can save a lot of space, at the expense of sacrificing performance). Therefore, using a centralized version management system like SVN may be a good choice.

Now machines are getting better and better, with big disks, fast Internet, and the ability to keep a full copy on everyone’s machine. What’s more, Git itself is well designed, having grown up on the shoulders of the Linux operating system (originally created as a way to develop Linux in a distributed way), and has since grown into open source communities like Github. Slowly, everyone is willing to move to Git.

Another way to think about it: if a project is really, really big, and a single machine can’t pull the entire project, centralized version control is definitely the better choice.

However, microservices are popular in the industry now, and the decoupling of the system is the general trend, which also doomed the big project will be divided into small projects. Small projects are ideal for distributed version control.

Existence is reasonable, any technology, when we evaluate it, can not be divorced from the background of The Times and realistic needs.

We do not teach the operation of this article, about Git operation of the article, a lot of online, all kinds of strange technology and qiao, everything.

If you want to play their own strange technology, then follow bacteria in-depth data structure to explore the principle of it, those who only teach strange technology and clever blog, usually do not say how to principle, understand the principle to better discover strange technology and clever ACya!

Some students may ask: “Ah Fungus, I understand the principle, but it is not quite corresponding to the operation. We usually use Git, namely, one pull and one push, and two instructions go around the world. Branching I get it, but these distributed operations I don’t quite understand and can’t relate…

Her tutor taught her a very important learning method: when we see an unknown thing, we should first think — if this thing is given to you, how will you achieve it?

I’ve read a little bit of Git source code, and I’ve looked through the official documentation, but it’s impossible to dump all of this out here, so we’re trying to guide you a little bit. (The final in-depth accurate learning, or suggest that we look at the source document, want to learn a thing, this bitter is had to eat)

Let’s use this diagram to illustrate. Before we look at it, it is important to understand the underlying data structure of Object and how to connect objects in series to form branches. Let’s start with a quick overview:

We mentioned before that it is not a good design to take A fork as the main branch, but it does not affect our explanation. We assume that a fork has been developed, and the remote warehouse is a fork’s warehouse.

Now that the app has been developed, the app will push his local repository to the remote repository.

Our version management software is a linked list plus a tree. Each circle here is a commit object. They are corresponding to each other.

With this in mind, let’s take a look at the merge process. First, the local repository will be pushed to the remote repository (main branch) :

Now, can you see why merge branches get an extra commit?

There are also some common instructions to rebase the list. If you don’t understand, you can brush the algorithm and lay the foundation.

However, there are some problems when working together, such as a conflict when everyone changes a file at the same time. But once you understand the principle, it’s easy to solve, just communicate who is going to modify the version, negotiate a non-conflicting version, and then merge. The principle of the merger has already been mentioned.

It is also because of the collaborative work of many people that our software is destined to have the ability of networking, involving many network interactions, which is an essential content of distributed software. But we should believe that the lowest principle to understand, specific how to use, how to cooperate with others in the network, that is very simple content, the key to learn a software is to learn through its underlying data structure, learn to understand, the upper operation will lead to blade and solution.

Git principle learning short cut: To implement a simple Git

Talk is easy. If you want to learn something, you still have to do it.

If you are interested, you can write in your own familiar language. After all, writing code is just to realize ideas. Just roll up your sleeves and work hard

Thought for a moment, the next technical science may be able to introduce you to the crawler, because the crawler extension to this skill points easy to stand-alone and cluster, the distributed these sounds lofty things on, we will do our best friend into a gate, with interest in interesting theory with practice, to learn to take the matter!