🏗️ Doxstrux

Document structure extraction tool for markdown, with extensibility to PDF and HTML.

Extract hierarchical structure, metadata, and content from documents without semantic analysis. Built for RAG pipelines, documentation analysis, and AI preprocessing.

✨ Features

Zero-regex parsing: Token-based extraction using markdown-it-py
Security-first design: Three security profiles (strict/moderate/permissive)
Document IR: Clean intermediate representation for RAG chunking
Structure extraction: Headings, lists, tables, code blocks, links, images
Content integrity: Parse without mutation, fail-closed security
Extensible architecture: Ready for PDF and HTML support

📦 Installation

pip install doxstrux

🚀 Quick Start

from doxstrux.markdown_parser_core import MarkdownParserCore

# Basic usage
content = "# Hello\n\nThis is **markdown**."
parser = MarkdownParserCore(content)
result = parser.parse()

# Access structure
print(result['structure']['headings'])
print(result['metadata']['security']['statistics'])

# With security profile
parser = MarkdownParserCore(content, security_profile='strict')
result = parser.parse()

# With custom config
parser = MarkdownParserCore(
    content,
    config={
        'preset': 'gfm',
        'plugins': ['table', 'strikethrough'],
        'allows_html': False
    },
    security_profile='moderate'
)
result = parser.parse()

🏗️ Architecture

Core Principles

Extract everything, analyze nothing: Focus on structural extraction, not semantics
No file I/O in core: Parser accepts content strings, not paths
Plain dict outputs: Lightweight, no heavy dependencies
Security layered throughout: Size limits, plugin validation, content sanitization
Modular extractors (Phase 7): 11 specialized modules with dependency injection
Single responsibility: Each extractor handles one markdown element type

Security Profiles

Profile	Max Size	Max Lines	Recursion Depth	Use Case
strict	100KB	2K	50	Untrusted input
moderate	1MB	10K	100	Standard use (default)
permissive	10MB	50K	150	Trusted documents

Document IR

Clean intermediate representation for RAG pipelines and chunking:

from doxstrux.markdown.ir import DocumentIR, ChunkPolicy

# Parse to IR
parser = MarkdownParserCore(content)
result = parser.parse()
doc_ir = DocumentIR.from_parse_result(result)

# Apply chunking policy
policy = ChunkPolicy(
    max_chunk_tokens=512,
    overlap_tokens=50,
    respect_boundaries=['heading', 'section']
)
chunks = doc_ir.chunk(policy)

🧪 Testing

# Run all tests
pytest

# With coverage
pytest --cov=src/doxstrux

# Type checking
mypy src/doxstrux

# Linting
ruff check src/ tests/

📊 Project Status

Version: 0.2.1 ✅ Published on PyPI
Python: 3.12+
Test Coverage: 69% (working toward 80% target)
Tests: 95/95 pytest passing + 542/542 baseline tests passing
Regex Count: 0 (zero-regex architecture)
Core Parser: 1944 lines (reduced from 2900, -33%)
PyPI: https://pypi.org/project/doxstrux/

Phase 7: Modular Architecture ✅ COMPLETE

Completed: Full modularization of parser into 11 specialized extractors

✅ 7.0.5: Rename from docpipe to doxstrux
✅ 7.1: Create namespace structure
✅ 7.2: Move existing modules to new namespace
✅ 7.3: Extract line & text utilities
✅ 7.4: Extract configuration & budgets
✅ 7.5: Extract simple extractors (media, footnotes, blockquotes, html)
✅ 7.6: Extract complex extractors (lists, codeblocks, tables, links, sections, paragraphs)

Achievements:

Core parser reduced by 33% (2900 → 1944 lines)
11 specialized extractor modules created
100% baseline test parity maintained
Clean dependency injection pattern throughout
Zero behavioral changes (byte-for-byte output identical)

🗺️ Roadmap

Phase 7: Modular architecture ✅ COMPLETE
Phase 8: Enhanced testing & documentation
PDF support: Extract structure from PDF documents
HTML support: Parse HTML with same IR
Enhanced chunking: Semantic-aware chunking strategies
Performance: Cython optimization for hot paths

📚 Documentation

Architecture: See CLAUDE.md for detailed architecture notes
Phase 7 Plan: See regex_refactor_docs/DETAILED_TASK_LIST.md
Testing: See regex_refactor_docs/REGEX_REFACTOR_POLICY_GATES.md

🤝 Contributing

This project follows a phased refactoring methodology with comprehensive test gates.

All changes must pass 63 pytest tests
All changes must maintain byte-for-byte output parity (542 baseline tests)
Security-first: No untrusted regex, validated links, sanitized HTML
Type-safe: Full mypy strict mode compliance

📜 License

MIT License - see LICENSE file for details.

🙏 Acknowledgments

Built on:

markdown-it-py - CommonMark compliant parser
mdit-py-plugins - Extended markdown features

Previous name: docpipe (renamed to doxstrux in v0.2.0 for extensibility to PDF/HTML)

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.claude		.claude
.github		.github
docs		docs
goldens		goldens
regex_refactor_docs		regex_refactor_docs
scripts		scripts
src/doxstrux		src/doxstrux
temp		temp
test_scripts		test_scripts
tests		tests
tools		tools
.gitignore		.gitignore
BUGS_AND_DRIFTS.md		BUGS_AND_DRIFTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
FIX_ISSUES.md		FIX_ISSUES.md
LICENSE		LICENSE
NEXT_STEPS_FORMAT.md		NEXT_STEPS_FORMAT.md
NO_SILENTS_proposal.md		NO_SILENTS_proposal.md
PYPI_RELEASE.md		PYPI_RELEASE.md
README.md		README.md
SECURITY.md		SECURITY.md
SECURITY_KERNEL_SPEC.md		SECURITY_KERNEL_SPEC.md
imports_preflight.py		imports_preflight.py
main.py		main.py
pyproject.toml		pyproject.toml
quickpush.sh		quickpush.sh
test_security_kernel_spec.py		test_security_kernel_spec.py
uv.lock		uv.lock
venv_setup.sh		venv_setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

🏗️ Doxstrux

✨ Features

📦 Installation

🚀 Quick Start

🏗️ Architecture

Core Principles

Security Profiles

Document IR

🧪 Testing

📊 Project Status

Phase 7: Modular Architecture ✅ COMPLETE

🗺️ Roadmap

📚 Documentation

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

License

poutila/doxstrux

Folders and files

Latest commit

History

Repository files navigation

🏗️ Doxstrux

✨ Features

📦 Installation

🚀 Quick Start

🏗️ Architecture

Core Principles

Security Profiles

Document IR

🧪 Testing

📊 Project Status

Phase 7: Modular Architecture ✅ COMPLETE

🗺️ Roadmap

📚 Documentation

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Languages

Packages