Announcing Blobcache

Blobcache is a content-addressed storage service. It runs as a systemd service on Linux, or a launchd service on macOS. It exposes a transactional API for persisting hash-linked data structures in content addressable stores called Volumes. The API is summarized in this cheat-sheet.

I built Blobcache after realizing that many of the systems I work on end up creating hash-linked data structures and then syncing them around between machines. This pattern is obvious and pervasive in distributed systems: Git, Bittorrent, and cryptocurrencies all take the same approach.

What I did Before Blobcache

Previously, I would create content-addressed stores ad-hoc as tables in SQLite, or as directories of many small files in the local filesystem. The SQLite approach is easy and correct, but doesn’t scale writes well. The small file approach scales better, but doesn’t allow for transactional consistency. Using a POSIX filesystem requires extra care to ensure that only fully written, correct blobs make it into the store.

Then I would have to build a protocol for moving the data between machines. The easy way of doing this is with something like Protocol Buffers+gRPC or HTTP+JSON. A system can request data by hash or upload new data and get a hash. That’s the meat of it, but there are also details concerning encrypting and authenticating the data in transit, and who has access to which data. Can this user write to or just read from this version of the data? The details would have to be worked out for each application.

What I do with Blobcache

Now with Blobcache as the storage layer, I create a Volume for the application, and then open transactions as needed. No more renaming files into place, or worrying about which kind of POSIX locks are the real ones. Blobcache transactions are scoped to a single volume, commit all-or-nothing, and have a well-defined total order.

When I open a transaction to insert data, I often fan out the work across multiple cores. Blobcache Transactions are safe for multiple threads and horizontally scaling workloads is encouraged. You can even pass a transaction handle to another process or machine and let it work in parallel too.

In Blobcache, most things look like Volumes. Syncing data between the local machine and a remote machine looks the same as interacting with two Volumes on the local machine. Why should it look any different?

This has been made into 2 separate problems: the local filesystem brokering the local drives, and the network brokering remote drives, but it’s really the same problem where a single variable: latency, is slightly different. Communication with the storage device can be interrupted in both scenarios, and anything that doesn’t result in the preferred end state is a failure. I have a consistent copy in one place, and I want to move it to another place. With hash-linked data structures, this can be done efficiently by maintaining and assuming referential integrity, and never copying data twice. One algorithm, built on a single interface (the Blobcache API), which can copy any directed acyclic data structure from anywhere to anywhere.

The only data model in Blobcache is my program’s own data structures. The same way I might import a library implementing a Tree or a HashMap to build a data structure in memory, I import a data structure library to manage content-addressed blobs in Blobcache. Blobcache calls these data structures Schemas, and there is a small “standard library” for Go. There is also a filesystem implementation: the Git-Like Filesystem.

These libraries cover many of the common use cases.

Blobcache is secure by default, each Node has a cryptographic identity, and does not share Volumes over the network, unless explicitly configured. Configuring access between nodes is as easy as configuring SSH: add the NodeID to a config file, list the Volumes the new Node should have access to, and you’re done.

What I’m Building

My main use for Blobcache right now is building Got Version Control. Got is version control, like Git, but for all your non-source-code data. It handles large files and directories well, and E2E encrypts all the data you give it. Got stores all of its data in Blobcache and has no network protocol of it’s own, only dealing with Blobcache Volumes. Blobcache has significantly simplified Got.