ML & Data Science

Machine learning depends on data integrity in ways that are easy to overlook. A single corrupted file in a training set can waste days of GPU time. A subtle change to a dataset can invalidate months of experiments. C4 makes these problems visible.

Snapshot your training data

Before starting a training run, lock down exactly what data you’re using:

c4 id ./training_data/ > training-data-v1.c4m

Weeks later, diff the manifest against the directory to see if anything changed:

$ c4 diff training-data-v1.c4m ./training_data/

If the output is empty, nothing changed — the directory still matches the manifest exactly. If files were modified, the patch output shows which entries differ.

Detect data leakage

Data leakage — when training data appears in your test set — is one of the most common ML mistakes. C4 IDs make it trivial to detect. Create manifests for both sets and look for matching IDs:

c4 id ./train/ > train.c4m
c4 id ./test/ > test.c4m

Any file with the same C4 ID in both manifests has identical content, regardless of filename. You can find overlaps with standard text tools:

# Extract C4 IDs from both manifests and find duplicates
awk '{print $NF}' train.c4m | sort > train_ids.txt
awk '{print $NF}' test.c4m | sort > test_ids.txt
comm -12 train_ids.txt test_ids.txt

Or do it directly in Python with c4py:

import c4py

train = c4py.scan("./train/")
test = c4py.scan("./test/")

train_ids = {entry.c4id for _, entry in train.flat_entries() if not entry.is_dir()}
test_ids = {entry.c4id for _, entry in test.flat_entries() if not entry.is_dir()}

leaked = train_ids & test_ids
if leaked:
    print(f"{len(leaked)} files appear in both train and test sets")

This catches leakage even when files have different names. Same bytes means same C4 ID.

Track dataset versions

Datasets evolve. New annotations, additional images, cleaned labels. Keep a c4m file for each version:

c4 id ./dataset/ > dataset-v1.c4m
# ... time passes, dataset changes ...
c4 id ./dataset/ > dataset-v2.c4m

$ c4 diff dataset-v1.c4m dataset-v2.c4m
c41oRmNxBzJ4kLpQvW8sYdHf3gTe6UaCb5jX7nZwPq9iMrDuEvFy2hGtSxAl0KcOmBJN
-rw-r--r-- 2026-03-20T14:00:00Z  524288 labels/annotations.json  c43nYt...
drwxr-xr-x 2026-03-20T14:00:00Z     ... images/batch_042/        c48kPq...
drwxr-xr-x 2026-03-20T14:00:00Z     ... images/batch_043/        c47mNq...
c45xZeXwKjQ2nMrBz7L1pYvWqRtN8sHfJd3g6CmAeU9kXoP4bG5hT0iVlDaSwFuO7yE

Now your experiment logs can reference dataset-v2.c4m instead of a vague “the March dataset.” Anyone can verify they’re using the same data.

Verify data transfers

Moving datasets between machines, cloud storage, and clusters is error-prone. C4 makes verification straightforward:

# On the source machine
c4 id ./large_dataset/ > dataset.c4m

# After transfer to the GPU cluster, diff to verify
c4 diff dataset.c4m /data/large_dataset/

Empty output means a perfect match. This catches silent corruption, incomplete transfers, and filesystem encoding issues.

In Python:

import c4py

report = c4py.verify_tree("dataset.c4m", "/data/large_dataset/")
if report.is_ok:
    print("Transfer verified — all files match.")
else:
    print(f"{len(report.missing)} missing, {len(report.corrupt)} corrupt")

Reproducibility receipts

For every experiment, record:

c4 id ./code/ > experiment-042-code.c4m
c4 id ./data/ > experiment-042-data.c4m
c4 id ./config/ > experiment-042-config.c4m

These three c4m files are a cryptographic proof of exactly what code, data, and configuration produced your results. Small enough to commit to git. Verifiable forever.

Or generate them from your training script with c4py:

import c4py

for name in ["code", "data", "config"]:
    manifest = c4py.scan(f"./{name}/")
    c4py.dump(manifest, f"experiment-042-{name}.c4m")
    print(f"{name}: {manifest.summary()}")