Git - A Distributed Version Control System

November 13, 2021

Introduction

This is an introduction to the Git version control tool. This post only covers the core concepts of Git, and does so without the use of developer jargon. It’s meant for beginners with little to no knowledge of version control tools in general. Later posts will dive deeper into the everyday uses and mechanics of Git.

What is Git and why is it a thing?

Git is a tool, written in the C programming language, that is used to track changes to content over time. Every tracked change to your content is known as a version. Every version of content that Git tracks also includes metadata that is helpful for discerning the type of content, who wrote the content, and when that content was written… among other things.

Using Git, not only do you have full control of what content gets tracked, but once Git is tracking versions of your content, you can revert back to any version of your content that Git knows about. Hence the term version control.

This is super useful, right? Now, we can safely experiment with different approaches to solving a problem and if one approach doesn’t work out… we simply dump it and revert back to the previous version. No harm done! Without a version control system, you would have to manually create a different copy of your content, remember where you put those copies, and also what state those copies were in. Using a version control system, you defer that responsibility to a tool, which makes your life easier!

How does Git track changes?

Git doesn’t track changes automatically, though. You’ll have to tell Git when you save a version of your content. This is done by “committing” a change to Git. Once done, a committed change becomes a version, then Git will happily be able to revert back to any version of your content within your commit history.

Notice that I keep saying “content” and not “files”. That’s because at its core, Git stores hashes of content within it’s object store. Git doesn’t track directories, it tracks content and paths to content. In other words, if you have an empty directory and ask Git what changes are ready to be tracked, Git will simply reply with “nothing”. However, if you add an empty file to that directory, then there is a path available to potential content. Git will then respond to that same question, with the path to the empty file.

Once you tell Git to track a change to your content, via committing content, Git will then create a cryptographic hash that represents your content, the path to the content, the current date and time, who wrote the content (you in this case), and a reference to the previous commit. Git stores this hash into its objects store, along with all of your other commit hashes. This makes up the core of what is known as a repository. A Git repository can be thought of as a collection of content changes (versions) over time.

One note about Git hashing content. A major factor here is that Git guarantees content integrity for all versions of content within the commit history. In other words, the content that Git tracks will be exactly the same content that Git will output when you revert back to a previous version. This is important! If you revert your content to a previous version, and if that content isn’t exactly the content that you committed, then you’ll have serious issues when you go to run or compile your code. I would not recommend using a version control system that does not have this guarantee.

How does Git revert a change?

Now that you know that a Git repository is a collection of content hashes, it’s a good time to talk about references. I mentioned early that when making a commit, one of the things that Git stores within the commit hash is a reference to the previous commit. So, every commit hash is actually a reference to all of the tracked content, at that point in time! Which means, reverting to a previous version is dead simple. Simply tell Git which commit hash that you want to revert to, and Git will overwrite all of your current content with the content within that commit hash.

Within Git, references aren’t just commit hashes, they can have aliases. You can create named aliases to make versioning more human friendly. As an example, you can make an alias for a commit hash named “v1.0.1”. Then you can tell Git to revert to “v1.0.1” and Git will work out which commit hash to revert to, and make it so.

What makes Git distributed?

What makes Git distributed is that every Git repository exists locally on each contributor’s computer. Which means, you don’t need the internet to use Git, and all of your commits happen locally on your computer. This implies that you don’t need a centralized repository to regulate changes made by each contributor.

If two or more contributors want to collaborate on a project, then they would host a copy of the repository on the internet, and then sync data between their local Git repositories and the copy on the internet. This “copy on the internet” is known as a remote repository.

If something should happen to your computer, and you lose your local copy of the repository, you could simply make a copy of the remote repository on another computer: thus the distributed nature of Git repositories.

There are companies that provide hosting for remote repositories, along with other collaboration tools / features. The most notable Git hosts are GitHub, BitBucket, GitLab, Azure Devops, and AWS CodeCommit.

Conclusion

There are multiple version control tools / systems in the market. Each has its own benefits and drawbacks. For my line of work and everyday use, Git works perfectly well for me. However, if I needed to store large files or even videos, Git may not be my version control tool of choice.

Just to reiterate, this post served as an introduction to Git’s core concepts. Later posts will dive deeper into the everyday uses and mechanics of Git.