FAQ

Why SHA-512?

Permanence. A C4 ID is meant to identify content forever — not just until the next algorithm upgrade. SHA-512 is:

Well-studied — decades of cryptanalysis with no practical attacks on collision resistance
Widely implemented — hardware acceleration on virtually all modern CPUs
Large output — 512 bits provides 2^256 collision resistance, far beyond any foreseeable computational capability
Stable — unlike SHA-1 (broken) or MD5 (broken), SHA-512 has a clear, long runway

We deliberately chose one algorithm and committed to it. C4 IDs don’t have a version prefix or algorithm identifier because they don’t need one. SHA-512 is the algorithm. If you have a C4 ID, you know how it was computed.

What about SHA-3 or BLAKE3?

They’re excellent algorithms. BLAKE3 is faster. SHA-3 has a different internal structure that provides diversity.

But C4 is not optimized for hashing speed — it’s optimized for permanence and universality. SHA-512 is fast enough (it saturates most storage I/O), and its longevity is unmatched. Changing algorithms would mean maintaining multiple ID spaces, which defeats the purpose of universal identification.

Won’t there be collisions?

The probability of a SHA-512 collision is approximately 1 in 2^256. To put that in perspective:

The number of atoms in the observable universe is approximately 2^266
You would need to hash roughly 2^128 files (340 undecillion) before a collision becomes probable
Every hard drive ever manufactured could not store enough unique files to make a collision likely

This is not a practical concern. It’s barely a theoretical one.

How does the content store scale?

The content store is a directory of files named by their C4 IDs. It uses an adaptive trie structure — files start in a flat directory, and when a directory exceeds a threshold (default 4096 files), it automatically splits into 2-character subdirectories:

store/                          store/
  c45xZeXw...  (flat start)      c4/
  c43zYcLn...       →              5x/
  c47mNqPv...                        c45xZeXwKjQ2...
                                   3z/
                                      c43zYcLnRtP8...

This scales to billions of files. Splitting is automatic and transparent — the store adapts as it grows.

Deduplication is automatic — if two projects contain an identical file, it’s stored once. This can dramatically reduce storage for environments with many similar datasets or repeated deliverables.

How big are c4m files?

Small. Each entry is roughly 150-200 bytes of text. A c4m file for 10,000 files is about 1.5-2 MB. For 1,000,000 files, roughly 150-200 MB. They compress well (70-80% reduction) because of the repetitive structure around the IDs.

For most real-world use, a c4m file is small enough to email, commit to git, or store indefinitely.

Why text instead of a binary format?

Because text costs almost nothing extra — and gains everything.

We tested compressed c4m files against a purpose-built low-entropy binary format. The result: c4m has only 2% additional overhead after compression. The reason is fundamental: SHA-512 digests are genuinely high-entropy data. They’re designed to be indistinguishable from random bytes. No format — binary or text — can compress them significantly. The IDs dominate the file size, and the base58 text encoding of those IDs compresses to nearly the same size as raw binary.

The 2% you “pay” for text buys you:

Read it in any editor. No hex dump, no special viewer.
Grep, awk, sort, diff. The entire Unix toolkit works on c4m files.
Human review. You can eyeball a manifest and spot problems.
Version control. Text diffs are meaningful. Binary diffs are not.
Email it. Plain text goes everywhere without encoding issues.

A binary format would save almost nothing and sacrifice all of this. The c4m format is not a compromise between efficiency and usability — it is the efficient choice.

Is the c4m format standardized?

The format is defined and governed by the open source project. It’s intentionally simple — one line per entry, fixed fields, plain text — so that any tool can read and write it without a parser library.

The full specification is in the c4 repository.

How does C4 relate to git?

C4 is complementary to git, not a replacement.

Git tracks source code changes with commit history, branches, and merges
C4 identifies file content by what it contains, regardless of history

Use git for code. Use C4 for large files, binary assets, deliverables, and anything where content identity matters more than edit history. c4git bridges the two — track large files by C4 ID in git without bloating the repository.

How does C4 relate to IPFS?

Both use content addressing, but for different purposes:

IPFS is a distributed file system — it moves and stores content across a network
C4 is an identification system — it names content and verifies it

C4 doesn’t move files or run a network. It creates identifiers and manifests that other tools (including IPFS) can use. A c4m file could reference content stored in IPFS, on local drives, on S3, or on magnetic tape. The identity is independent of storage.

Can I use C4 for sensitive data?

A C4 ID is a hash — it doesn’t reveal the content of a file, but it does confirm whether two files are identical. If you have a known file and a C4 ID, you can verify whether the ID matches that file.

For most use cases this is fine. If your threat model requires hiding even the identity of files (not just their content), you’ll need additional measures beyond content addressing.

What license is C4?

C4 tools are open source under MIT and Apache 2.0 licenses. The core c4 library, c4sh, and c4py are MIT-licensed. c4git and libc4 are licensed under Apache 2.0.