Git's Storage Principles

Git Design Principles
- Git vs SVN
- Git Storage Model
  - .git directory structure
  - Git Basic Data Objects
  - Git Package Files
  - Git References

Git Design Principles

In a nutshell, Git is a snapshot-based content-addressed filesystem. Scroll down and take your time.

Git vs SVN

Before Git, the mainstream version control systems (SVN...) were generally incremental (delta-based). were generally delta-based systems, as shown below:

Git is based on snapshots, i.e., it generates a snapshot for each modified file, and does not re-generate snapshots for unmodified files, as shown in the following figure:

Intuitively, it seems like an incremental based approach would be better?

After all, Git generates full snapshots of modified files, whereas other systems just generate incremental files. Yes, but when it comes to rolling back a version or comparing differences between multiple versions, Git only needs to take a snapshot of the corresponding version and compare it, whereas incremental-based systems need to apply the incremental file step-by-step from the beginning to backtrack, and Git's speed advantage becomes clear.

Git Storage Model

.git directory structure

replace the old with newgit init orgit clone When fetching a git repository, you can find a hidden directory under the directory. git, which has a basic structure similar to the following:

├── COMMIT_EDITMSG The message of the last commit of the repository.
├── FETCH_HEAD The SHA1 value of the last commit for each branch.
├── HEAD Records the location of the HEAD pointer.
├── ORIG_HEAD records the location of the HEAD pointer of the last safe version for certain dangerous operations, making it easy to fall back.
├── config git related configurations
├── index Temporary storage area, index file
├── packed-refs Compacted branches, records the SHA1 of the last commit of each branch.
├── logs/ Operation logs, including local and remote ones.
├── objects/ object storage folder
| ├── ... Folder name based on the first 2 characters of the SHA1 value of the object.
| ├── ...
| ├── info/
| ├── pack/ compressed data
└── refs/ Record the SHA1 value of the last commit, both local and remote.
    ├── heads/ Branch references
    ├── remotes/ Remote addresses
    └── tags/ Tag references

This directory contains all of Git's information, and it's all stored as files, so Git is a file system.

Git Basic Data Objects

blob (binary large object): also known as a file based on snapshot storage as described earlier
tree: directory, representing a collection of blob objects
commit: commit, which contains a collection of blobs and trees.
tag: tag object (refers to annotation tags), there is a lightweight tag does not record additional information such as the person who created the tag, do not need to create a separate tag object

The above four types of data objects are stored in the .git/object/ directory, git will calculate the hash value for each type of data object to determine the specific storage path, the following is an example.

> echo 'test content' | git hash-object -w --stdin
d670460b4b4aece5915caf5c68d12f560a9fe3e4
// 40-bit SHA-1 hash, the first 2 bits are directory names, the other 38 bits are filenames, the storage path is .git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4
> git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4
 test content
> git cat-file -p master^{tree} // Export the contents of the latest commit from master
100644 blob a906cb2a4a904a152e80877d4088654daad0c859 README
100644 blob 8f94139338f9404f26296befa88755fc2598c289 Rakefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0 lib
// Contains 2 file changes and 1 directory change.
> git cat-file -p 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0
100644 blob 47c6340d6459e05787f644c2447d2595f5d3a54b

git hash-object command can be used to calculate the hash value of a file

-w means to write the object to the git database

--stdin Indicates reading from standard input.

git cat-file command can retrieve the objects stored by git based on the incoming hash value

-p automatically determines the type of content

The data structure of a single commit can be summarized in the following diagram:

Git Package Files

Some of you may find that you can't find the corresponding file under .git/objects/ when you try it in your own project in the above way, what is the reason for this?

It's probably not really the operation that's wrong, but rather the compression that Git is doing.

Git initially stored objects in a "loose" object format, i.e. under .git/objects/.

However, from time to time (or when you manually execute thegit gc command) into a binary file called a packfile (stored at .git/objects/pack) to save space and improve efficiency.

Git References

References are similar to pointers, except that HEAD is stored in the .git/HEAD directory, and pointers are stored in the .git/refs directory.

branch (of company, river etc)
HEAD: a special pointer to the commit you're currently on. git/HEAD stores the hash of the commit you're referencing.
Tags (lightweight labels)

As you can see, the so-called reference is just a file that records the commit hash, and is very lightweight, which is why branch/tag creation and deletion can be so fast.