Skip to content

Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR. Designed for RAG pipelines, batch workloads, and production deployments.

Notifications You must be signed in to change notification settings

kreuzberg-dev/.github

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

18 Commits
ย 
ย 

Repository files navigation

๐ŸŒ‰ Kreuzberg

The fastest document intelligence engine for RAG Developers โ€” Open Source and Cloud

Linkedin- Banner

Kreuzberg is a polyglot document intelligence framework built around a high-performance Rust core. It helps developers extract text, structure, metadata, and embeddings from 56+ document formats at native speed, without requiring GPUs.

Kreuzberg is and will remain MIT-licensed and open-source. We're currently building a hosted cloud service around it to make document processing reliable, scalable, and easy to integrate into modern pipelines.

What is Kreuzberg

1. Kreuzberg (Open Source, MIT Licensed)

A high-performance, extensible document intelligence engine.

  • Rust core with streaming parsers and full parallelism
  • Native bindings for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, TypeScript(Node/Bun/Wasm/Deno)
  • 56+ supported formats including PDF, Office, images, HTML, XML, email, archives, and scientific formats
  • OCR with table extraction (Tesseract, EasyOCR, PaddleOCR, extensible via plugins)
  • Built-in semantic chunking and optional embeddings for RAG pipelines
  • CLI, REST API, Docker images, and MCP server

Read more: https://kreuzberg.dev/

2. Kreuzberg Cloud (Coming Soon)

A fully managed document intelligence API powered by the same engine.

Planned features include:

  • Hosted REST API
  • Async jobs and webhooks
  • Built-in chunking for RAG pipelines
  • Premium OCR backends
  • Usage dashboards and analytics
  • Simple pay-as-you-go pricing

3. html-to-markdown

A high-performance HTML โ†’ Markdown converter powered by Rust. Available as a Rust crate, Python package, PHP extension, Ruby gem, Elixir Rustler NIF, Node.js bindings, WebAssembly, and a standalone CLI- with identical rendering behavior across platforms.

Why Choose Kreuzberg

  • Truly polyglot: same engine across languages
  • High throughput: optimized for batch workloads and multi-GB documents
  • Memory efficient: streaming architecture keeps memory usage predictable
  • Flexible deployment: use via CLI, REST API, MCP server and more
  • MIT licensed: safe for enterprise, commercial, and closed-source use
  • Built for RAG: native chunking, embeddings, and customization

Community

Join our dev community to ask questions, share feedback, and show what youโ€™re building.

Discord: https://discord.gg/xzx4KkAPED
Subreddit: https://www.reddit.com/r/kreuzberg_dev/
LinkedIn: https://www.linkedin.com/company/kreuzberg-dev/
X/Twitter: https://x.com/kreuzberg_dev

Contributing

Contributions are welcome.

  1. Open an issue to propose a change
  2. Submit a pull request
  3. Maintainers review and merge

See CONTRIBUTING.md in the relevant repository for details.
Kreuzberg repository: https://github.com/kreuzberg-dev/kreuzberg

License

All open-source code is MIT licensed. Itโ€™s permissive, enterprise-safe, and commercial-friendly. That means you can use Kreuzberg freely in both commercial and closed-source products with no obligations, no viral effects, and no licensing restrictions.

Maintainers

Built with love in Kreuzberg, Berlin.

About

Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR. Designed for RAG pipelines, batch workloads, and production deployments.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •