Document structure extraction tool for markdown, with extensibility to PDF and HTML.
Extract hierarchical structure, metadata, and content from documents without semantic analysis. Built for RAG pipelines, documentation analysis, and AI preprocessing.
- Zero-regex parsing: Token-based extraction using markdown-it-py
- Security-first design: Three security profiles (strict/moderate/permissive)
- Document IR: Clean intermediate representation for RAG chunking
- Structure extraction: Headings, lists, tables, code blocks, links, images
- Content integrity: Parse without mutation, fail-closed security
- Extensible architecture: Ready for PDF and HTML support
pip install doxstruxfrom doxstrux.markdown_parser_core import MarkdownParserCore
# Basic usage
content = "# Hello\n\nThis is **markdown**."
parser = MarkdownParserCore(content)
result = parser.parse()
# Access structure
print(result['structure']['headings'])
print(result['metadata']['security']['statistics'])
# With security profile
parser = MarkdownParserCore(content, security_profile='strict')
result = parser.parse()
# With custom config
parser = MarkdownParserCore(
content,
config={
'preset': 'gfm',
'plugins': ['table', 'strikethrough'],
'allows_html': False
},
security_profile='moderate'
)
result = parser.parse()- Extract everything, analyze nothing: Focus on structural extraction, not semantics
- No file I/O in core: Parser accepts content strings, not paths
- Plain dict outputs: Lightweight, no heavy dependencies
- Security layered throughout: Size limits, plugin validation, content sanitization
- Modular extractors (Phase 7): 11 specialized modules with dependency injection
- Single responsibility: Each extractor handles one markdown element type
| Profile | Max Size | Max Lines | Recursion Depth | Use Case |
|---|---|---|---|---|
| strict | 100KB | 2K | 50 | Untrusted input |
| moderate | 1MB | 10K | 100 | Standard use (default) |
| permissive | 10MB | 50K | 150 | Trusted documents |
Clean intermediate representation for RAG pipelines and chunking:
from doxstrux.markdown.ir import DocumentIR, ChunkPolicy
# Parse to IR
parser = MarkdownParserCore(content)
result = parser.parse()
doc_ir = DocumentIR.from_parse_result(result)
# Apply chunking policy
policy = ChunkPolicy(
max_chunk_tokens=512,
overlap_tokens=50,
respect_boundaries=['heading', 'section']
)
chunks = doc_ir.chunk(policy)# Run all tests
pytest
# With coverage
pytest --cov=src/doxstrux
# Type checking
mypy src/doxstrux
# Linting
ruff check src/ tests/- Version: 0.2.1 ✅ Published on PyPI
- Python: 3.12+
- Test Coverage: 69% (working toward 80% target)
- Tests: 95/95 pytest passing + 542/542 baseline tests passing
- Regex Count: 0 (zero-regex architecture)
- Core Parser: 1944 lines (reduced from 2900, -33%)
- PyPI: https://pypi.org/project/doxstrux/
Completed: Full modularization of parser into 11 specialized extractors
- ✅ 7.0.5: Rename from docpipe to doxstrux
- ✅ 7.1: Create namespace structure
- ✅ 7.2: Move existing modules to new namespace
- ✅ 7.3: Extract line & text utilities
- ✅ 7.4: Extract configuration & budgets
- ✅ 7.5: Extract simple extractors (media, footnotes, blockquotes, html)
- ✅ 7.6: Extract complex extractors (lists, codeblocks, tables, links, sections, paragraphs)
Achievements:
- Core parser reduced by 33% (2900 → 1944 lines)
- 11 specialized extractor modules created
- 100% baseline test parity maintained
- Clean dependency injection pattern throughout
- Zero behavioral changes (byte-for-byte output identical)
- Phase 7: Modular architecture ✅ COMPLETE
- Phase 8: Enhanced testing & documentation
- PDF support: Extract structure from PDF documents
- HTML support: Parse HTML with same IR
- Enhanced chunking: Semantic-aware chunking strategies
- Performance: Cython optimization for hot paths
- Architecture: See
CLAUDE.mdfor detailed architecture notes - Phase 7 Plan: See
regex_refactor_docs/DETAILED_TASK_LIST.md - Testing: See
regex_refactor_docs/REGEX_REFACTOR_POLICY_GATES.md
This project follows a phased refactoring methodology with comprehensive test gates.
- All changes must pass 63 pytest tests
- All changes must maintain byte-for-byte output parity (542 baseline tests)
- Security-first: No untrusted regex, validated links, sanitized HTML
- Type-safe: Full mypy strict mode compliance
MIT License - see LICENSE file for details.
Built on:
- markdown-it-py - CommonMark compliant parser
- mdit-py-plugins - Extended markdown features
Previous name: docpipe (renamed to doxstrux in v0.2.0 for extensibility to PDF/HTML)