How It Works
Content identity
Every file is a sequence of bytes. C4 computes a SHA-512 hash of those bytes and encodes it in base58 format. The result is a C4 ID — a permanent, universal name for that content.
c45xZeXwKjQ2nMrBz7L1pYvWqRtN8sHfJd3g6CmAeU9kXoP4bG5hT0iVlDaSwFuO7yE
Same bytes, same ID. Always. On any machine, in any language, forever.
This is not a new idea — git uses content hashing for commits, IPFS uses it for blocks, package managers use it for checksums. C4 applies it to files and directories in a way that’s human-readable and Unix-composable.
The c4m format
A c4m file is one line per entry. Each line contains:
permissions timestamp size name c4id
For example:
-rw-r--r-- 2026-03-20T12:00:00Z 8192 README.md c45xZeXwKjQ2...
-rwxr-xr-x 2026-03-20T12:00:00Z 41023 build.sh c43zYcLnRtP8...
drwxr-xr-x 2026-03-20T12:00:00Z ... src/ c47mNqPvWxY1...
That’s it. Plain text, one entry per line. You can:
- Read it — it’s a text file
- Diff it —
diff old.c4m new.c4mworks,c4 diffis smarter - Grep it —
grep '.exr' delivery.c4mto find all EXR files - Pipe it —
c4 id ./footage/ | sort -k4to sort by filename - Email it — it’s small, even for millions of files
- Version it — commit it to git, it’s just text
A c4m file fully describes a directory tree — a complete filesystem description without the file contents. And the text format costs almost nothing: compressed c4m files are within 2% of a purpose-built binary format, because the SHA-512 IDs are genuinely high-entropy and dominate the file size regardless of encoding. The text format is not a compromise — it’s effectively free.
Directory entries
Directories get their own C4 IDs, computed from their contents. If any file inside a directory changes, the directory’s ID changes too. This gives you Merkle tree verification for free — verify the root ID and you’ve verified everything beneath it.
The content store
When you use c4, file contents are stored in a local content-addressed store by default — a directory where files are named by their C4 IDs. This gives you:
- Automatic deduplication — identical files are stored once, regardless of how many directories contain them
- Instant verification — if the file exists in the store under a given ID, it’s correct by definition
- Shared storage — multiple projects can share the same store
The store is just a directory of files. No database, no daemon, no lock files. Any tool that understands C4 IDs can read from it.
How the pieces fit together
The C4 ecosystem is a set of tools that all speak the same language: c4m files and C4 IDs.
| Tool | Role |
|---|---|
| c4 | Creates c4m files, diffs them, produces patches |
| c4sh | Mounts c4m files as virtual directories in your shell |
| c4py | Pure Python library for C4 identification, manifests, and diffing |
| c4ts | TypeScript / JavaScript library for browsers and Node.js |
| c4-swift | Native Swift library for Apple platforms |
| c4git | Tracks large files in git by C4 ID |
| libc4 | C library for embedding C4 in any application |
Because the format is plain text, you don’t need any of these tools to read a c4m file. They exist to make creation, verification, and manipulation fast and convenient.
Unix composability
c4 is designed in the Unix tradition. Nearly every c4 command produces c4m output, which means commands compose naturally:
c4 path/— scan a directory, output c4m, store contentc4 path/ | c4— scan, then identify and store the c4m itself, returning its C4 IDc4 diff a.c4m b.c4m | c4 paths— diff two states, extract just the paths that changedc4 intersect id a.c4m b.c4m | c4 paths— find content shared between two projects
Because c4 stores content by default, a simple pipeline like c4 workspace/ both captures the c4m description and stores every file’s content — one command, ready for distribution. The ergonomic display (c4 cat -e) reformats timestamps, adds comma-separated sizes, and pushes C4 IDs to the right margin for readability, while the canonical form stays machine-parseable.
Why SHA-512?
SHA-512 was chosen for permanence. It’s well-studied, widely implemented, and has a 512-bit output that provides collision resistance far beyond practical concern. A C4 ID computed today will still be valid and verifiable decades from now.
See the FAQ for more on the cryptographic choices.