UNDERSTANDING GIT: By building it from scratch

Posted on July 7, 2018July 9, 2018 Daniel Fat BACKEND

This article assumes some knowledge of Git – basically, if you ever wondered how does Git does the stuff he does, and more importantly, why is all this so complicated.

To understand Git better I propose a simple thought experiment. Let’s imagine that Alice, Bob and Charlie are really dedicated and smart developers, but on one particularly nasty day their Centralized Version Control dies. Now, since all three of them are hardworking, and they still want to deliver their changes to their client, they continue with their work. Later in the day, once all three of them quite a few changes, they need some way to merge all their changes together.

If their CVS was up, they could just let the server give each change a sequential number.

However, since each one of them is working for themselves, there needs to be common, independent way to address files and changes they introduced, that works independent on time or who worked where. Alice proposes to use a hash function to which we pass file content. Since hash function only relies on file’s content you can distinguish between files based on their SHA1, regardless of who made the changes and when. As a bonus if two developers made the exact same change, they will have same SHA1.

In Linux console if one was to execute the following:

$ echo 'test content' | git hash-object -w --stdin

and compare it to

$ echo 'test content' > test.txt
$ git hash-object -w test.txt

They would recieve the same hash value:

d670460b4b4aece5915caf5c68d12f560a9fe3e4

So for Git, the hash of test.txt and our echo commands are indistinguishable. In Git parlance, such elements are called blobs.

Ok, that’s great for comparing files, but Bob notes that in any decent project there will be more than one file changed. To solve this Alice proposes they track all the files, in some kind of a list, one that can handle a folder hierarchy. Charlie notes that this would be best solved using some kind of composite pattern. Alice notes that leafs obviously correspond to blobs, with composite elements mapping naturally to folders.

Bob remembers that in Unix philosophy folders are a files that have list of all files or folders inside them as well as a link to parent and self. Charlie notes that Unix uses inode as identificators, but they would be unwieldy in a system like ours. Bob adds that they already have unique identifiers – hashes.

So they end up with something like the following:

blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 README.md
tree 9c7e41c855b3b3a976a72dd268db17a379c0ab0e lib
blob d670460b4b4aece5915caf5c68d12f560a9fe3e4 test.txt

Which isn’t that different from the output of git cat-file as below

$ git write-tree
0406b210c2c38cd63eddbad2ea261251e039fbe4
$ git cat-file -p 0406
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 README.md
040000 tree 9c7e41c855b3b3a976a72dd268db17a379c0ab0e lib
100644 blob d670460b4b4aece5915caf5c68d12f560a9fe3e4 test.txt

In example above lib is a folder containing a single Ruby file, which we can list and see it’s content:

$ git cat-file -p 9c7e4
100644 blob 5635b63d64f0bf8cba141d49918e8f21b9940aed HelloWorld.rb

Or visualized:

Git 2

So far, we have a way to uniquely identify files and way to capture a state of file system in a single moment. And this is already shaping up as semi version control system. Now, Alice, Bob and Charlie need ways to capture their changes and add some metadata.

Alice suggests that if we could add some text and link to a tree and we could have something resembling a CVS commit. Bob suggests they also add some identifying information and a date when it was modified. So a commit becomes:

Tree: 0460b…
Author: Alice@example.org
Time: Tue May 02 2018 12:00:12 +0100
First commit

Or to represent it more visually:
Git 3

Charlie adds that if we add another commit that we’d probably need to reference the previous commit, so we know which commit follows which. This only slightly complicates our commit (for example):

Tree: ae996
Parent: 5ba2b (First commit)
Author: Alice@example.org
Time: Tue May 02 2018 12:00:12 +0100
Second commit

Bob notes, that since only one file changed README.md, we could still point our second commit to old files, that weren’t changed. In practice this allows Git to save space, since any two identical files, will have identical hash and can be stored only once. With that our second commit looks like this:
Git 4

With this, rudimentary version control system, Alice, Bob and Charlie managed to resolve their issues, and solve their problems. Luckily they didn’t have to merge anything.

For more detailed reads on this topic read more about Git Internals on official git site and Git from bottom up.

UNDERSTANDING GIT: By building it from scratch

Sign in to get new blogs and news first:

Leave a Reply Cancel reply

Daniel Fat

Categories