Back to the Lab (again)

Serving a Website From a Git Repo Without Cloning It

It's fairly common to use git repositories as a vehicle for serving websites. The webdev pushes their changes to some branch of a publicly available git repository, and some web server somewhere serves the current tip of that branch as the website. Github Pages would be the most famous example of this.

The domani reverse proxy also supports serving a website from a git repository. It previously did so by automatically cloning the repository locally, and periodically pulling changes down. This worked fine enough, but I figured it could be simplified further such that no local state is required except the current hash of the desired branch. This post is going to explain how this can be done by first guiding you through git's internals a bit.

Domani

Git Branches: What Are They?

The first thing to understand is what git branches actually are; they are a kind of "ref" (reference). Git tags are another kind of ref. A ref is nothing more than a name which points to an object hash, most likely a commit hash. You can easily inspect the refs of a git project, even without the git tool itself. For example, the current tip of the `main` branch can be found in the `.git/refs/heads/main` file:

Ref files, and therefore branch files, are just plaintext files containing a single object hash. When the current tip of a branch is changed the only real change which takes place on the filesystem is to change the content of its ref file. At its core, git is actually quite a simple tool.

Git Objects: Friend or Foe?

We've established that a git branch points to a git object via the hash of that object, but what does that really mean?

There are four kinds of git object: blob, tree, commit, and tag. Regardless of the object's kind, it is stored in a file named after the SHA1 of the object within the `.git/objects` directory. Objects are always stored compressed using zlib, but it is their uncompressed form which is used for hashing.

This example computes the hash of the object pointed to from our previous example. You can see the output SHA1 is the same as the object's file name (with the first two characters used for a directory name, otherwise the `objects` directory would get too big.) Note the usage of `pigz -d`, which does the zlib decompression.

The body of a git commit object is more or less plaintext, save for a single null byte which separates a header string from the object's contents, so we can just look at it directly:

As we can see from the header, the object in question is a commit (but we knew that). The `264` denotes the size of the rest of the file after the header. Following the `264` would be a null byte, except that the null byte was replaced with a newline by the `tr '\0' '\n'`.

Most of the rest of the file should be familiar. The author is listed, along with the commit's description which makes up the tail of the file. The parent refers to the previous commit in the chain by its object hash. The tree is perhaps unfamiliar, unless you've dove into git's internals before, but it is the most interesting bit for our purposes.

Contrary to how git usually presents commits to its users, git commits do not actually contain a diff from the previous to the current state of the repository. Each commit instead contains a reference to a tree, where each tree fully describes the state of the repository's files at that commit. What you see when you do something like `git show` is actually a diff generated in real-time between the previous and current trees.

As mentioned, trees are themselves another kind of git object, and so are referenced and queried just like commits. Let's look at the tree for our commit:

Welp, it looks like trees aren't so simple to look at as commits. Luckily git ships with a handy utility, `cat-file`, for directly viewing objects and pretty-printing their contents:

Much better. We can see that the tree contains a list of entries, where each entry denotes a file (blob) or sub-directory (tree), along with the permissions and name of the file/sub-directory. Looking at the contents of the `static` sub-directory's object we find yet another tree:

We can see from this how git trees are used to describe the full contents of a repository from just a single hash. If any file in the repository were to change then the hash of its associated blob object would change, which would change the associated entry in the tree the file falls in, which changes the hash of the tree's object, which changes the entry for the tree in the tree's parent directory, and so on. The change propagates all the way upward to the root tree object and its hash. This hash is then stored in a commit, which allows each commit to easily denote the entire state of the repository.

(This pattern of using a recursively hashed tree to uniquely identify an arbitrarily large amount of hierarchical data is called a Merkle tree.)

Anyway, let's get to the goods: how do we view the files themselves? Each file stored as a blob object. Let's check out the `foo.html` blob we found in our last example:

Blob objects are like commit objects, nice and easy to parse; just a header, a null byte, and then the content of the file as-is.

That's all there really is to git objects (ignoring tags, we don't need them today). Armed with this knowledge we can continue on towards our ultimate goal.

Remote Repositories

When cloning a git repository you've probably done something like:

Git then went and did a bunch of magical stuff, and afterwards the repo was fully cloned locally. But how did your git tool do that, given just a URL?

The answer is: it's complicated. There are actually two different protocols with which git might clone a repo over HTTPS: the smart protocol and the dumb protocol. The smart protocol is fast but requires a special purpose HTTP client in order to work. The dumb protocol is slower than the smart one, but it does not require a special HTTP client; it just serves files as they are without any special logic.

We're going to use the dumb protocol.

Let's return to the beginning and remember our actual goal here: we want to serve a website using the contents of a git repository, and specifically the contents of the tip of a specific branch of a git repository. To do this we need to know which commit is currently being pointed to by that branch. We can discover this by making a simple GET request:

This repo only has a single branch (and no tags), so only a single line is returned. And would you look at that, it's the same commit hash as we saw in our local copy! Let's now query that commit object, and see how objects are queried in general:

As you can see objects are returned in the exact same way they are stored locally. No surprises. While trees are difficult to look at without the `git cat-file` tool (which won't work for remote objects), blobs are still easy:

It works exactly like the local git repo.

Putting It All Together

Given all this, how would my special reverse proxy handle a request for `/static/foo.html`? Well, first it would need to query the repository server for the commit (using `/info/refs`), and then fetch the commit object in order to pull out the root tree hash, and then fetch the root tree object.

From there the server would need to look in the root tree object and check that it sees a tree entry called `static`, and fetch its tree object. The server would then check `static`'s tree object for a blob entry called `foo.html`, and finally it would fetch that blob, passing its contents back as the response to the original request (after stripping off the git object header).

This sounds like a lot of steps to serve a single file, but there's two key optimizations which can be made. The first is to cache the root tree's hash in memory, which skips two lookups right at the beginning. The root tree's hash will only change when the latest commit of the branch changes, so it's enough to cache it in memory and have a separate background process periodically re-check the latest commit.

The second optimization is to cache tree objects in-memory using their hash as a key. The object identified by a hash never changes, so this cache is easy to manage, and by caching the tree objects in memory (perhaps with an LRU cache if memory usage is a concern) all round-trips to the remote server can be eliminated, save for the final round-trip for the file itself.

If you'd like to see an example implementation of this idea you can check out my rust implementation for Domani:

git.rs

Note that I was able to take advantage of the excellent gix crate to help me with decoding git objects. If your language of choice doesn't have a git object parsing library available you'll have to parse the objects manually, but honestly it shouldn't be too difficult anyway.

That's it! Even if this is a pretty niche use-case and doesn't change the world, I hope you still found it useful as an introduction to git's internals, and perhaps as a jumping off point for your own ideas of how git can be abused to do interesting things.

-----

Published 2024-02-17