This example demonstrates how to use Rust for high-performance text processing within a ZenML pipeline, using PyO3 to expose Rust functions to Python.
Processes financial documents (earnings call transcripts) to prepare them for a RAG system:
- Load documents — Read
.txtfiles from a directory - Process with Rust — Clean text, chunk with sentence-boundary awareness, extract metadata
- Save results — Output processed chunks as JSON
The text processing is implemented in Rust for:
- Unicode normalization and text cleanup
- Sentence-aware chunking with configurable overlap
- Metadata extraction (dates, monetary amounts, percentages, ticker symbols)
- Rust — Install via rustup.rs
- Python 3.9+
- uv — Install via docs.astral.sh/uv
# Install and build
make install
# Run the pipeline
make runOr step by step:
uv sync --extra dev # Install Python + dev dependencies (includes maturin)
uv run maturin develop # Build Rust → Python module
uv run python run.py # Run pipeline├── src/lib.rs # Rust text processing functions
├── Cargo.toml # Rust dependencies
├── pyproject.toml # Python/maturin config
├── rag_preprocessing/
│ ├── steps.py # ZenML steps (thin wrappers)
│ └── pipeline.py # Pipeline definition
├── data/sample_transcripts/ # Sample financial documents
└── run.py # Entry point
PyO3 lets you expose Rust functions to Python with annotations:
#[pyfunction]
fn clean_text(text: &str) -> String {
// Rust text processing logic
}
#[pymodule]
fn rag_rust_core(m: &Bound<'_, PyModule>) -> PyResult<()> {
m.add_function(wrap_pyfunction!(clean_text, m)?)?;
Ok(())
}Running maturin develop compiles this into a Python module you can import normally.
ZenML steps just import and call the Rust module:
import rag_rust_core # This is our compiled Rust
@step
def process_documents(documents: list[dict]) -> list[dict]:
all_chunks = []
for doc in documents:
# Call Rust function like any Python function
chunks = rag_rust_core.process_document(doc["content"])
all_chunks.extend(chunks)
return all_chunksZenML handles orchestration, caching, artifact tracking, and observability — the Rust code is invisible to it.
| Function | Description |
|---|---|
clean_text(text) |
Normalize unicode, collapse whitespace, standardize quotes/dashes |
chunk_text(text, size, overlap) |
Split into chunks respecting sentence boundaries |
extract_metadata(text) |
Extract dates, amounts, percentages, tickers |
process_document(text, size, overlap) |
All-in-one: clean → chunk → extract |
uv run python run.py --help
# Custom chunk size
uv run python run.py --chunk-size 1000 --chunk-overlap 100
# Different data directory
uv run python run.py --data-dir /path/to/documentscargo testThis demo is designed for the default local ZenML stack. If you have a remote stack configured (e.g., with an S3 artifact store) and encounter errors, switch back to the local stack with zenml stack set default.
To run this pipeline on Kubernetes, Vertex AI, or other remote orchestrators, you need the compiled Rust extension available in your step's Docker image. We provide an example multi-stage Dockerfile that handles this:
# Build the example cloud image
docker build -f Dockerfile.cloud -t rust-rag-preprocessing:latest .
# Test it locally
docker run --rm rust-rag-preprocessing:latestThe Dockerfile.cloud uses uv for fast, reproducible builds. It installs Rust in a builder stage, compiles the maturin wheel, then creates a slim runtime image with ZenML and the sample data baked in.
To use this image with ZenML's remote orchestrators:
from zenml.config import DockerSettings
docker_settings = DockerSettings(
parent_image="your-registry/rust-rag-preprocessing:latest",
skip_build=True, # Image already has everything needed
)
@pipeline(settings={"docker": docker_settings})
def my_pipeline():
...Push the image to your container registry (ECR, GCR, GHCR, etc.) and reference it in your pipeline. See ZenML's containerization docs for more details on DockerSettings options.
For Rust developers who want MLOps tooling:
- Write idiomatic Rust with normal tooling (
cargo, tests, lints) - PyO3 handles Python↔Rust type conversion
- ZenML provides orchestration, caching, lineage tracking, dashboard
- Your Rust code runs unmodified — no special adaptation needed