chunkrs

Deterministic, streaming Content-Defined Chunking (CDC) for Rust

chunkrs is a high-performance, portable infrastructure library for FastCDC chunking and cryptographic hashing.

Bytes in → Chunks & hashes out.

Zero-copy streaming. Async-agnostic. Excellent for any chunking and hashing use case.

Features

Streaming API: push()/finish() pattern for processing data in any batch size
Deterministic-by-design: Identical bytes produce identical chunk boundaries and hashes, regardless of batching or execution timing
Zero-copy: Efficient Bytes slicing from input with minimal allocations
FastCDC algorithm: Byte-by-byte gear hash rolling with configurable min/avg/max sizes
BLAKE3 identity: Cryptographic chunk hashing (optional, feature-gated)
Strictly safe: #![forbid(unsafe_code)] - zero unsafe code throughout
Minimal API: Only 6 public types accessible from crate root - Chunker, Chunk, ChunkHash, ChunkConfig, HashConfig, ChunkError
Well-tested: Comprehensive unit tests, integration tests, and fuzzing

API Changes from v0.8 to v0.9

Breaking Change: v0.9 simplifies the API by removing I/O-specific functionality and focusing on pure streaming CDC.

What Changed

v0.8 API	v0.9 API
`Chunker::chunk_file()`	Removed - use `Chunker::push()` with your file reader
`Chunker::chunk_bytes()`	Removed - use `Chunker::push()` directly
`Chunker::chunk_async()`	Removed - async support is application-layer concern
`chunker.push(bytes)`	✅ Kept - core streaming API
`chunker.finish()`	✅ Kept - finalize stream

Benefits of the New Design

Simpler: One API (push()) for all data sources
Flexible: Works with any byte source (files, network, memory)
Composable: Easily integrates with existing I/O code
Explicit: I/O strategy is controlled by your application
Smaller: Smaller dependency footprint (no tokio requirement)

Features Removed

The following features were intentionally removed to simplify the crate:

❌ File I/O helpers (read files yourself)
❌ Async streaming adapters (use your async runtime)
❌ Thread-local buffer pools (caller manages memory)
❌ Iterator-based APIs (use push()/finish() loop)

Architecture

chunkrs processes one logical byte stream at a time with byte-by-byte serial CDC:

┌───────────────┐     ┌──────────────────┐      ┌──────────────────┐ 
│ Input Bytes   │     │ Push-based       │      │ Serial CDC State │
│ (any source)  │────▶│ Streaming API   │────▶ │ (FastCDC rolling │ 
│               │     │ push()/finish()  │      │   hash, byte-by- │             
└───────────────┘     └──────────────────┘      │   byte)          │ 
                                                └──────────────────┘ 
    ┌─────────────┐       ┌───────────────────┐
    │             │       │ Chunk {           │
──▶ │ Chunk      │────▶  │   data: Bytes,    │
    │ Stream      │       │   offset: u64,    │
    │             │       │   hash: ChunkHash │
    └─────────────┘       │ }                 │
                          └───────────────────┘

When to Use chunkrs

Scenario	Recommendation
Delta sync (rsync-style)	✅ Perfect fit
Backup tools	✅ Ideal for single-stream chunking
Deduplication (CAS)	✅ Use with your own index
NVMe Gen4/5 saturation	✅ 3–5 GB/s per core
Distributed dedup	✅ Stateless, easy to distribute
Any other CDC use case	✅ Likely fits

Quick Start

[dependencies]
chunkrs = "0.9"

use chunkrs::{Chunker, ChunkConfig};
use bytes::Bytes;

fn main() {
    let mut chunker = Chunker::new(ChunkConfig::default());
    let mut pending = Bytes::new();

    // Feed data in any size (streaming)
    for chunk in &[Bytes::from(&b"first part"[..]), 
                    Bytes::from(&b"second part"[..])] {
        let (chunks, leftover) = chunker.push(chunk);
        // Process complete chunks...
        for chunk in chunks {
            println!("offset: {:?}, len: {}, hash: {:?}", 
                chunk.offset, chunk.len(), chunk.hash);
        }
        pending = leftover;
    }

    // Finalize stream
    if let Some(final_chunk) = chunker.finish() {
        println!("Final chunk: offset: {:?}, len: {}, hash: {:?}", 
            final_chunk.offset, final_chunk.len(), final_chunk.hash);
    }
}

What's in a Chunk:

Each Chunk contains:

data: Bytes — the actual chunk payload (zero-copy reference when possible)
offset: Option<u64> — byte position in the original stream
hash: Option<ChunkHash> — BLAKE3 hash for content identity (if enabled)

API Overview

Flat API Design

chunkrs uses a flat API design for simplicity and clarity. All types are accessible directly from the crate root:

use chunkrs::{Chunker, Chunk, ChunkHash, ChunkConfig, HashConfig, ChunkError};

No duplicate paths like chunkrs::chunk::Chunk - only chunkrs::Chunk.

Core Types

Type	Description
`Chunker`	Stateful CDC engine with streaming push()/finish() API
`Chunk`	Content-addressed block with `Bytes` payload and optional BLAKE3 hash
`ChunkHash`	32-byte BLAKE3 hash identifying chunk content
`ChunkConfig`	Min/avg/max chunk sizes and hash configuration
`HashConfig`	Hash algorithm configuration (BLAKE3)
`ChunkError`	Error enum for chunking operations (InvalidConfig)

Streaming API

The Chunker provides a streaming API:

use chunkrs::{Chunker, ChunkConfig};
use bytes::Bytes;

let mut chunker = Chunker::new(ChunkConfig::default());
let mut pending = Bytes::new();

// Feed data in any size (1 byte to megabytes)
let (chunks, leftover) = chunker.push(Bytes::from(&b"data"[..]));

// Process complete chunks immediately
for chunk in chunks {
    // chunk.data: Bytes - the chunk payload
    // chunk.offset: Option<u64> - position in original stream
    // chunk.hash: Option<ChunkHash> - BLAKE3 hash (if enabled)
}

// Feed leftover back in next push
pending = leftover;

// When stream ends, get final chunk
if let Some(final_chunk) = chunker.finish() {
    // Process final chunk
}

Determinism

The same input produces identical chunks regardless of how data is fed:

let data: Vec<u8> = vec![0u8; 10000];

// All at once
let mut chunker1 = Chunker::new(ChunkConfig::default());
let (chunks1, _) = chunker1.push(Bytes::from(data.clone()));
let final1 = chunker1.finish();

// In 100-byte chunks
let mut chunker2 = Chunker::new(ChunkConfig::default());
let mut all_chunks2 = Vec::new();
for chunk in data.chunks(100) {
    let (chunks, _) = chunker2.push(Bytes::from(chunk));
    all_chunks2.extend(chunks);
}
let final2 = chunker2.finish();

// Same chunks, same hashes
assert_eq!(chunks1.len() + final1.is_some() as usize, 
           all_chunks2.len() + final2.is_some() as usize);

Configuration

Chunk Sizes

Choose based on your deduplication granularity needs:

use chunkrs::ChunkConfig;

// Small files / high dedup (8 KiB average)
let small = ChunkConfig::new(2 * 1024, 8 * 1024, 32 * 1024)?;

// Default (16 KiB average) - good general purpose
let default = ChunkConfig::default();

// Large files / high throughput (256 KiB average)  
let large = ChunkConfig::new(64 * 1024, 256 * 1024, 1024 * 1024)?;

Hash Configuration

use chunkrs::{ChunkConfig, HashConfig};

// With BLAKE3 (default)
let with_hash = ChunkConfig::default();

// Boundary detection only (faster, no content identity)
let no_hash = ChunkConfig::default().with_hash_config(HashConfig::disabled());

Performance

Throughput targets on modern hardware:

Storage	Single-core CDC	Bottleneck
NVMe Gen4	~3–5 GB/s	CPU (hashing)
NVMe Gen5	~3–5 GB/s	CDC algorithm
SATA SSD	~500 MB/s	Storage
10 Gbps LAN	~1.2 GB/s	Network
HDD	~200 MB/s	Seek latency

Memory usage:

Per stream: O(pending_bytes) - typically minimal as pending is flushed on boundaries
Zero-copy: Chunk data references input Bytes without copying
Caller controls memory management (buffer pools, reuse, etc.)

To saturate NVMe Gen5: Process multiple files concurrently by running multiple Chunker instances. Do not attempt to parallelize within a single file—this destroys deduplication ratios.

Determinism Guarantees

chunkrs guarantees exact determinism:

Boundary determinism: Identical byte streams produce identical chunk boundaries at identical byte positions
Hash determinism: Identical byte streams produce identical ChunkHash (BLAKE3) values
Batch independence: Results are identical regardless of input batch sizes (1 byte vs 1MB vs streaming)
Serial consistency: Rolling hash state is strictly maintained across all push() calls

What this means: You can re-chunk a file on Tuesday with different batch sizes and get bit-identical chunks to Monday's run. This is essential for delta sync correctness.

Safety & Correctness

No unsafe code: #![forbid(unsafe_code)]
Comprehensive testing: Unit tests, doc tests, and property-based tests ensure:
- Determinism invariants
- Batch equivalence (chunking whole vs chunked yields same results)
- No panics on edge cases (empty files, single byte, max-size boundaries)

Algorithm

Boundary Detection: FastCDC (Gear hash rolling hash)

Byte-by-byte polynomial rolling hash via lookup table
Dual-mask normalization (small/large chunk detection)
Configurable min/avg/max constraints

Chunk Identity: BLAKE3 (when enabled)

Incremental hashing for streaming
32-byte cryptographic digests

Cargo Features

Feature	Description	Default
`hash-blake3`	BLAKE3 chunk hashing	✅

# Default: sync + hashing
[dependencies]
chunkrs = "0.9"

# Minimal: sync only, no hashing
[dependencies]
chunkrs = { version = "0.9", default-features = false }

Roadmap

Current: 0.9.0 — Core API stable, comprehensive feature set, seeking production feedback.

Implemented ✅

Core Functionality:

FastCDC rolling hash, push/finish streaming API, zero-copy, BLAKE3 hashing, deterministic chunking

Quality & Safety:

Comprehensive unit tests + doctests, fuzzing, no unsafe
Documentation and examples
Benchmarks

Planned Enhancements

1.0.0 — Stable Release:

Alternative hash algorithms (xxHash for speed, SHA-256 for compatibility)
Formal SemVer commitment with MSRV policy
Comprehensive integration guide and production deployment patterns

Post-1.0 — Additive Features Only:

SIMD optimizations (AVX2/AVX-512) for rolling hash
Hardware-accelerated hashing (BLAKE3 SIMD, SHA-NI)
Advanced CDC algorithm variants (e.g., pattern-aware chunking)
no_std support for embedded environments

Migration Guide

v0.8 - File chunking:

// v0.8 - no longer available
let chunks = chunker.chunk_file("path/to/file")?;

v0.9 - File chunking:

// v0.9 - read file yourself, feed to chunker
use std::fs::File;
use std::io::Read;

let mut file = File::open("path/to/file")?;
let mut buffer = vec![0u8; 8192];
let mut chunker = Chunker::new(ChunkConfig::default());

loop {
    let n = file.read(&mut buffer)?;
    if n == 0 { break; }
    let (chunks, leftover) = chunker.push(Bytes::copy_from_slice(&buffer[..n]));
    // process chunks...
}
if let Some(final_chunk) = chunker.finish() {
    // process final chunk...
}

v0.8 - Async file chunking:

// v0.8 - no longer available
let chunks = chunker.chunk_async(reader).await?;

v0.9 - Async file chunking:

// v0.9 - use your async runtime with standard Chunker
use tokio::io::AsyncReadExt;

let mut reader = reader;
let mut chunker = Chunker::new(ChunkConfig::default());
let mut buffer = vec![0u8; 8192];

loop {
    let n = reader.read(&mut buffer).await?;
    if n == 0 { break; }
    let (chunks, leftover) = chunker.push(Bytes::copy_from_slice(&buffer[..n]));
    // process chunks...
}
if let Some(final_chunk) = chunker.finish() {
    // process final chunk...
}

Non-Goals

These features are intentionally out of scope:

Networking: Handle in application layer
Encryption: Pre-encrypt or post-encrypt at application layer
Compression: Apply compression before or after chunking
Deduplication indexing: Use companion crates (CAS index implementations)
Distributed coordination: Manage at application level

Feedback & Contributions

We're actively seeking feedback on:

Real-world deployment patterns and performance characteristics
Edge cases and failure modes in production
Integration patterns with storage systems and databases
Feature requests that align with CDC use cases

Open issues or discussions at GitHub Issues. Issues and pull requests are welcome.

Refer ARCHITECTURE.md for Design and implementation details.
See CHANGELOG.md for version history.

Acknowledgments

This crate implements the FastCDC algorithm described in:

Wen Xia, Yukun Zhou, Hong Jiang, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang, Qing Liu,
"FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data Deduplication",
in Proceedings of USENIX Annual Technical Conference (USENIX ATC'16), Denver, CO, USA, June 22–24, 2016, pages: 101-114.
Paper Link

Wen Xia, Xiangyu Zou, Yukun Zhou, Hong Jiang, Chuanyi Liu, Dan Feng, Yu Hua, Yuchong Hu, Yuchong Zhang,
"The Design of Fast Content-Defined Chunking for Data Deduplication based Storage Systems",
IEEE Transactions on Parallel and Distributed Systems (TPDS), 2020.

This crate is inspired by the original fastcdc crate but focuses on a modernized API with streaming-first design, strict determinism, and allocation-conscious internals.

License

MIT License — see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
benches		benches
examples		examples
fuzz		fuzz
src		src
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chunkrs

Features

API Changes from v0.8 to v0.9

What Changed

Benefits of the New Design

Features Removed

Architecture

When to Use chunkrs

Quick Start

API Overview

Flat API Design

Core Types

Streaming API

Determinism

Configuration

Chunk Sizes

Hash Configuration

Performance

Determinism Guarantees

Safety & Correctness

Algorithm

Cargo Features

Roadmap

Implemented ✅

Planned Enhancements

Migration Guide

Non-Goals

Feedback & Contributions

Acknowledgments

License

About

Uh oh!

Releases 2

Packages

Languages

License

elemeng/chunkrs

Folders and files

Latest commit

History

Repository files navigation

chunkrs

Features

API Changes from v0.8 to v0.9

What Changed

Benefits of the New Design

Features Removed

Architecture

When to Use chunkrs

Quick Start

API Overview

Flat API Design

Core Types

Streaming API

Determinism

Configuration

Chunk Sizes

Hash Configuration

Performance

Determinism Guarantees

Safety & Correctness

Algorithm

Cargo Features

Roadmap

Implemented ✅

Planned Enhancements

Migration Guide

Non-Goals

Feedback & Contributions

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages