How git works? Demystifying .git directory and its distributed nature.
What do software engineers around the globe have in common?
— They are all humans?
— Yeah but that’s too obvious…
— Maybe they all simply code?
— Come onnn, that’s even more obvious 😁
— Hmm, maybe it’s something about git?
— YES 🔥 my friend you are right. Engineers around the world use git to create complex, distributed and highly scalable systems.
But do they really know what happens after each commit? or how does git store millions of lines of code in this tiny .git directory?
Hmmm 🤔 I don’t think so. Just follow me on this adventure inside .git directory and discover all nitty-gritty details about this amazing technology.
You can find millions of blogs and videos with different titles like: “what is git?”, “git merge vs rebase”, “beginner git crash course”, “advanced git part 2” and so on and so forth. Looks familiar right? This list never ends… Unfortunately most of those resources follow “How to Use” line and not “How it works”. There is no doubt, it’s extremely important to know how to use git, but in this blog we are just concerned about “How it Works” part.
This blog is divided into 2 parts:
- Distributed Stuff
- .git Internals
Feel free to head directly to your favourite one.
So without farther ado let’s get our hands dirty with how git actually works.
Git is the distributed database at the core of your engineering system.
Here are some very basic concepts that Git shares with application databases:
- Data is persisted to disk.
All the files which are created in a git repository are stored in .git directory.
2. Queries allow users to request information based on that data.
Besides basic git commands, git provides different commands to read and write data directly to stored files in .git
3. Distributed nodes need to synchronise and agree on some common state.
Yes my friend, every time you push/pull some changes you are synchronising those nodes and using an extremely cool distributed database.
Basically remote repository is our main node (single source of truth) and all other (cloned) local git repositories are in some sense replicated nodes of the main. These replicated nodes(local repos) may contain new updates locally or be outdated with remote repo. Therefore we just need to keep main/remote repo in sync with our local/replicated repos. This is why we push commits, pull latest changes and sometimes even resolve conflicts. I really recommend syncing with remote repository as often as you can as dealing with conflicts when your code is very outdated is simply a hell on earth 😂.
Let me touch 1 more point in the distributed part and then directly let’s jump into .git internals. In distributed systems there is a very common concept called “Consistency”, which simply means, that every node/replica has the same view of data at a given point in time irrespective of whichever client has updated the data. Based on this definition we can point out 2 types of consistency: Strong and Eventual.
Strong Consistency is when we really need all replicas to have same view on data. For example in payments/billing systems consider this simple example. Let’s say Alex and Bob decide to withdraw $100 from ATM machines using exactly same card. It means they are getting money from the same bank account simultaneously. And let’s also assume there is only $150 in the account. It means only 1 person should get desired $100. Also assume database nodes are replicated(it means there is only $150 on the account but that data is replicated among nodes A & B). Also let’s say for some reason Alex’s ATM decides to talk to Node A and Bob’s to Node B.
Here we have 2 possibilities:
- Alex took money from Node A, and for some random reason it happened that synchronisation is completed among nodes before Bob’s request reaches Node B. Therefore both replicas have same view on account data which is $50. In this case Bob is not able to get $100 as there is not enough amount in the account.
2. Bad Weather situation is when Alex took money from Node A, and for some random reason synchronisation is not completed YET among nodes at the time Bob’s request reaches Node B. Therefore replicas are out of sync. Node B still thinks there is $150 in the account and Bob successfully takes another $100 from the same account.
This is BAD. Somehow Alex and Bob managed to get $200 out of $150 bank account. Here we definitely need to ensure sync process do be completed before Bob requests money. This is what we call STRONG consistency. At any given time all replicated nodes will have same view of the bank account. Of course strong consistency comes with different costs like in availability of the system. Furthermore you can think of a distributed system as a huge game of trade-offs between consistency and availability. Checkout CAP theorem for more details.
Where as Eventual Consistency is when system does not care about delayed synchronisation between nodes. This way we don’t have to sacrifice much of availability of the system. Consider a simple example of Alex and Bob now putting a 👍🏼 on a tweet which has 100 likes. And let’s say Alex is in Tbilisi, Georgia and Bob is in his motherland Houston, Texas. Due to these locations Alex talks to Node A replica and Bob with Node B.
This business logic lets us not care about being Strongly Consistent. At the moment of pressing 👍, Alex and Bob see that tweet has 100 likes even though one of them may already has sent like to either Nodes. Eventually synchronisation will happen and both likes will be applied to the replicas. As you can see in this use case nothing brakes due to eventual consistency.
And yes my curious reader you get where we are heading. If we look at git from distributed database perspective and consider our local/remote repos as replicas/nodes we can say that in some sense we are dealing with Eventual Consistency here. In practice git has this remote repo concept as a main node with which everyone synchronises time to time. We all write code locally in our cloned repositories. And eventually we all push our changes and also pull others changes from remote repo/node. What’s most important, just like previous Like example, we are totally allowed to do it. As in practice reading our repository content for production happens from remote repo, we are free to do writes locally and eventually sync everything with remote repository.
Get ready for lot’s of visual screens as we are going to follow all the way from creating an empty git repository to providing multiple commits and diving deep into .git directory. Lets start fresh with empty git repository test-git. What happens when we run git init in an empty repository? Git simply creates .git directory and provides initial folder structure inside it. You can see .git internals on the right part of the screen. Remember all your code or any other files which are included in git repository are saved inside this .git directory. Nowhere else, there is no other magic here. To get into git’s internal flow all we need is to dive deep into objects directory. We will not cover other directories in this blog but feel free to play around with them. Especially hooks are really cool functionality in git.
Lets create a simple readme.md file with “Hello Git” content in it. If we run git status we can see easily that this file is not yet tracked by git. Therefore nothing was changed in .git as you can see on the right.
What if we run git add readme.md. Let’s try 🔥 WOW what happened to objects directory on the right? some strange thing is created 9f4d….. 😂
Basically git created directory 9f inside objects and a compressed file named 4d96d5b00d98959ea9960f069585ce42b1349a inside 9f.
But wait, what is 9f4d96d5b00d98959ea9960f069585ce42b1349a ????? This weird string my friend is our first git object which simply is a hash of readme.md file’s content (concrete hash algorithm is definitely out of our scope now, you can search for hashing in git on your own). Git could directly hash “Hello Git” string but for some other reasons which we will discover later git provides a simple structure for file contents. Instead of hashing “Hello Git”, git hashes “blob 10\0Hello Git”.
blob indicates type of this particular git object. Objects in git which store file contents have blob type. 10 simply indicates length of our content which in case of “Hello Git” string is 10 bytes(end of line byte is included as well which is not visible, therefore we get 9+1=10). \0 is just a null character don’t pay attention to it, assume there is nothing.
Should you believe me? Of course not, lets hash “blob 10\0Hello Git” string by ourselves and compare the results. I just passed that string to shasum function (hash function) and we can verify our result is same as hash computed by git.
This is still not enough, now we have to verify that content of 9f/4d96… compressed file is exactly “blob 10\0Hello Git”. I’ll use some shell magic to decompress and display its content 😁 And yea, as you can see after applying some openssl zlib … command to our compressed file we see actual string “blob 10Hello Git” which we hashed before. With small difference that \0 is not displayed by my terminal as it’s simply a null character.
Cool, now that we verified correctness of hashing files and later reading there decompressed versions, I can simply show you an easier version to read git objects. Besides pull, push, add, commit basic commands, git provides more comprehensive API to directly deal with objects. As we now can see, after git add readme.md new blob object was created in .git objects directory. How did git compute hash for this file?
With git hash-object command you can calculate hash of file’s content same way git would do. Actually this is what git uses internally to get objects name. You can see full list of git commands here.
But wait if there exists a simple git hash-object command that gives us hash of files content, shouldn’t there be a prettier way to read objects content without running openssl zlib -d < object …? Of course git already has command for reading objects directly from objects directory.
git cat-file -p and git cat-file -t commands can be used to get objects content and type separately. As you can see cat-file -p returned directly files content, cat-file -t returned object type. It more human readable way then dealing with “blob 10\0Hello Git”. And yes I forgot to mention, we can address git object by first couple of unique symbols in hash name. There is no other object starting with 9f4d in objects directory, therefore it can be considered as a unique alias for our object.
What we have covered so far?
- Git creates file content object directly after moving it to staging area after running git add command.
- Type of this object is blob.
- Blob objects store data about files content and NOT about file name.
- Hash of file’s content can be calculated with git hash-object command.
- We can read git objects content/type using git cat-file -t/p commands.
Cool 😎 now we can continue and move on to next step.
Let’s see what happens after we commit this change.
To visualise file structure better, I’ll delete hooks folder because it takes too much space on the screen and objects can’t fit in screen anymore 😂
So what happend after our first commit? 2 new objects were created. From now on I’ll address those object with first 4 hash symbols as they are unique. Object 69a7 and b53d were added by git after commit. Let’s examine both objects 1 at a time and see how are they connected to the global warming on our planet 🌎 🔥.
Let’s start with object 69a7 and see its type and content. First of all let’s mention we discovered a new git object type TREE. Tree objects in git are in some sense abstraction of a folder/directory in our project. Tree object content looks like this:
“100644 blob 9f4d96d5b00d98959ea9960f069585ce42b1349a readme.md”
It simply provides us the view on the folder with files. Our root folder contains only 1 file readme.md, which is already known to git as it exists in objects directory. Every folder in git will be represented as a tree object. Therefore tree object is basically collection of pointers to blob objects (just like here) and also pointers to another tree objects (will see later when we add nested folder in the root). If you remember we mentioned above: blob objects store only file content, not file names. You can see from the screen that tree object stores file names as well as “references” of file contents. 9f4d corresponds to a blob object which is simply content of readme.md file. Imagine how easy it is to rename a file in git. Blob object stays same and only file name in tree object is changed.
Hash names of tree objects are computed the same way as for blobs. Content of object is hashed and we get its name, easy. Raw tree object might look little bit different from git cat-file -p output. But all we should care for now is that git cat-file -p provides all content we need and git cat-file -t just verifies this is a tree object.
What we have covered so far?
- Introduced new type of git objects: TREE
- Tree object is an abstraction of a folder in git
- Tree contains pointers to blob objects (files) and other trees (nested folders)
- Tree contains names of blob objects (files) and other trees (nested folders)
Cool 😎 now we can examine second object created after commit, b53d
Simply applying same git cat-file command combination on b53d object we can proudly unlocked 3rd type git object: COMMIT. Commit object contains some commit metadata as author/comitter and commit message. Most importantly it contains pointer to a tree object. What does it mean? Every commit is simply a pointer to the root tree of our project. Every commit as a new SNAPSHOT of our whole repository. WHY? becasue commit has a pointer to root tree. And root tree has pointers to whole nested structure of root folder blobs/trees.
Let me put some magic and present a better visualisation of our current state in git-test repository. Commit points to root tree and tree represents root folder. By representing root folder we mean containing pointers to nested files and other folders(trees). In this case we only have 1 file there as blob object.
I’m extremely lucky if you made it till here 😁🔥 let me know somehow (commenting on the post, reaching out in dm) if you reached this point.
Let’s create 1 more commit, and analyse it as well. Things do change slightly with multiple commits and nested folders. In our root folder I just created code directory and inside it an app.code file with some random “Code” text.
Before I commit this change you can see that moving to staging area (git add) created only a blob object for app.code file content.
As you can see after we comitted this changes 4 new objects were created in objects directory:
- RED 781f object is blob storing content of app.code file.
- GREEN 5223 object is a tree object containing structure of /code folder having a single blob object (app.code file).
- BLUE 96df object is a root tree object (root folder) containing pointer to previously created readme.md blob object as well as newly created /code folder which is also represented as a tree from GREEN object.
- ORANGE adbf object is LATEST COMMIT which has pointer to root tree of our repository (LATEST SNAPSHOT) and a reference to parent commit. Parent commit in our case is first commit. This is why it is so easy to jump backwards in git commit history.
There is a common saying:
Commits are not diffs, they are snapshots.
As you can see, by storing root tree pointer in the commit, we are actually creating a SNAPSHOT of the whole repository at the time of that commit. We can recreate whole repository using single commit object by following down the root tree.
Branches are simply pointers to commits.
As easy as it sounds. Every time you use word HEAD or main (any branch name) git simply goes to refs directory and grabs corresponding commit hash. Branches are simply human readable aliases for commit hashes. This is why you can use git checkout command with commit hashes as well as branch names. Refs basically is a key value store of commit hashes as values and a simple aliases(branch names) as keys.
Finally we have come to an end 🔥🔥
Hope this was NOT another git blog. I tried to go beyond this definition we all have seen millions of times:
Git: A distributed version control system
At the same time demystify .git directory in a sense that it’s simply a key value store of objects and there hashes. There is no rocket science behind it.
What you can do now:
- Have a look at this guy presenting how they created an ordinary NoSql database on top of git.
- Reconstruct our current git repository state without git commands.
- Try to reverse-engineer git. Implement your git client for basic methods.
- Play around multiple copies of git repository simultaneously with distributed mental model.