c4py — Python Library

c4py is a pure Python implementation of C4 content identification. It scans directories, creates and diffs c4m manifests, verifies trees, and manages a content-addressed store — all without requiring a running daemon.

GitHub: Avalanche-io/c4py

Install

pip install c4py

Quick start

Identify a file

import c4py

c4id = c4py.identify_file("path/to/file.mov")
print(c4id)  # c45xZeXwKjQ2...

Scan a directory

import c4py

manifest = c4py.scan("/projects/experiment-042")
print(manifest.summary())
# 1,247 files, 38 directories, 14.2 GB total, 1,201 unique C4 IDs

Load and save c4m files

import c4py

manifest = c4py.scan("./footage/")
c4py.dump(manifest, "footage.c4m")

# Later, load it back
manifest = c4py.load("footage.c4m")
print(manifest.summary())

Diff two manifests

import c4py

old = c4py.load("dataset-v1.c4m")
new = c4py.load("dataset-v2.c4m")
result = c4py.diff(old, new)

print(f"Added: {len(result.added)}")
print(f"Removed: {len(result.removed)}")
print(f"Modified: {len(result.modified)}")

Verify a directory against a manifest

import c4py

report = c4py.verify_tree("delivery.c4m", "/mnt/incoming/vendor_A/")

print(f"{len(report.ok)} files OK")
print(f"{len(report.missing)} missing")
print(f"{len(report.corrupt)} corrupt")

if report.is_ok:
    print("Delivery verified.")

Use cases

ML experiment tracking

import c4py

# Snapshot training data before each run
manifest = c4py.scan("./training_data/")
c4py.dump(manifest, "training-data-v1.c4m")

# Later, verify nothing has changed
report = c4py.verify_tree("training-data-v1.c4m", "./training_data/")
if not report.is_ok:
    print(f"{len(report.corrupt)} files changed, {len(report.missing)} missing")

Data pipeline verification

import c4py

# Record input state
input_manifest = c4py.scan("./input/")
c4py.dump(input_manifest, "pipeline-input.c4m")

# Run pipeline
run_pipeline()

# Record output state and diff against expected
output_manifest = c4py.scan("./output/")
expected = c4py.load("expected-output.c4m")
result = c4py.diff(expected, output_manifest)

if result.is_empty:
    print("Pipeline output matches expected.")
else:
    print(f"{len(result.modified)} files differ")

Find duplicates

import c4py

manifest = c4py.scan("/projects/deliverables/")
dupes = manifest.duplicates()

for c4id, paths in dupes.items():
    print(f"{c4id}: {len(paths)} copies")
    for p in paths:
        print(f"  {p}")

Content store

import c4py

store = c4py.open_store()

# Store content
with open("render.exr", "rb") as f:
    c4id = store.put(f)

# Retrieve content
content = store.get(c4id)

Interoperability

c4py produces the same C4 IDs and c4m files as the c4 CLI, c4sh, and every other tool in the ecosystem. A manifest created by the CLI can be loaded by c4py, and vice versa. The c4m format is the common language — all tools read and write it.