⚠️ WORK IN PROGRESS⚠️ This project is under active development and is NOT production-ready. Breaking changes are likely to occur without prior notice. Use at your own risk in non-production environments only.
A Python tool that generates comprehensive project documentation using AI models. It analyzes your codebase, generates documentation, validates links, checks readability, and ensures high-quality output with flexible AI provider options.
For detailed technical documentation and architecture information, see:
- Architecture Guide - System design and component interactions
- Architecture Generator - Documentation for the architecture generation system
- Configuration Guide - Detailed configuration options and usage
- GitHub Integration - GitHub utilities and integration features
- Development Guide - Development setup and workflows
- Contributing Guide - Guidelines for contributing to the project
- Contributing Generator - Documentation for the contributing guide generation system
- Badges Guide - Documentation for the badge generation system
- README Generator - Documentation for the README generation system
- Mermaid Generator - Documentation for the Mermaid diagram generation system
- CodebaseAnalyzer - Documentation for the codebase analysis system
- LLM Clients - Documentation for the LLM client system
- Message Manager - Documentation for the message formatting system
-
🔍 Intelligent Codebase Analysis
- AST parsing for code structure
- Dependency graph generation
- Import/export detection
- Binary file detection
-
🤖 Flexible AI Processing
- Local Ollama integration
- Secure local processing
- Interactive model selection
- AWS Bedrock integration
- Claude 3.7 Sonnet support
- Enterprise-grade AI capabilities
- Customizable prompt templates
- Context-aware generation
- Parallel processing support
- Local Ollama integration
-
📝 Documentation Generation
- README.md generation
- Architecture documentation
- API documentation
- Developer guides
-
🔄 Smart Caching
- Multi-level cache (memory + SQLite)
- Intelligent invalidation
- TTL support
- Size-based limits
-
✅ Validation
- Link checking (internal + external)
- Markdown validation
- Badge verification
- Reference checking
-
📊 Quality Metrics
- Readability scoring
- Complexity analysis
- Documentation coverage
- Improvement suggestions
-
🔄 Repository Integration
- Local repository analysis
- GitHub repository cloning
- Automatic pull request creation
- Branch management
- Custom PR titles and descriptions
# Create virtual environment
python -m venv venv
# Activate virtual environment
venv\Scripts\activate
# Verify activation (should show virtual environment path)
where python# Create virtual environment
python3 -m venv venv
# Activate virtual environment
source venv/bin/activate
# Verify activation (should show virtual environment path)
which python# Clone the repository
git clone https://github.com/TrueBurn/ai-codebase-scribe.git
cd ai-codebase-scribe
# Install dependencies
pip install -r requirements.txt
# Ensure Ollama is running locally
# Visit https://ollama.ai for installation instructionsWhen you're done, you can deactivate the virtual environment:
deactivateWhen you first run the tool, it will:
- Connect to your Ollama instance
- List all available models
- Prompt you to select one interactively
Example selection dialog:
Available Ollama models:
1. llama3:latest
2. codellama:7b
3. mistral:instruct
Enter the number of the model to use: 2
Selected model: codellama:7b- Python 3.8+
- Ollama running locally
- Git repository
- Required Python packages:
ollama>=0.4.7gitignore-parser>=0.1.11networkx>=3.2.1python-magic>=0.4.27pyyaml>=6.0.1tqdm>=4.66.1textstat>=0.7.3psutil>=5.9.0
# Generate documentation for a local repository
python codebase_scribe.py --repo ./my-project
# Use a specific model
python codebase_scribe.py --repo ./my-project --model llama3
# Enable debug mode for verbose output
python codebase_scribe.py --repo ./my-project --debug# Clone and analyze a GitHub repository
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe
# Use a GitHub token for private repositories
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe --github-token YOUR_TOKEN
# Alternative: set token as environment variable
export GITHUB_TOKEN=your_github_token
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe# Create a PR with documentation changes
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe \
--create-pr \
--github-token YOUR_TOKEN \
--branch-name docs/readme-update \
--pr-title "Documentation: Add README and architecture docs" \
--pr-body "This PR adds auto-generated documentation using the README generator tool."
# Keep the cloned repo after PR creation (for debugging)
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe \
--create-pr --keep-clone# Disable caching (process all files)
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe --no-cache
# Clear cache before processing
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe --clear-cache
# Only clear cache (don't generate documentation)
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe --clear-cache --keep-clone# Use Ollama (default)
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe --llm-provider ollama
# Use AWS Bedrock
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe --llm-provider bedrock# Generate additional API documentation
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe --api-docs
# Custom output files
python codebase_scribe.py --github https://github.com/TrueBurn/ai-codebase-scribe \
--output-readme custom_readme.md \
--output-arch custom_architecture.md# 1. Set your GitHub token as an environment variable
export GITHUB_TOKEN=ghp_your_personal_access_token
# 2. Generate documentation and create a PR
python codebase_scribe.py \
--github https://github.com/TrueBurn/ai-codebase-scribe \
--create-pr \
--branch-name docs/update-documentation \
--pr-title "Documentation: Update README and architecture docs" \
--pr-body "This PR updates the project documentation with auto-generated content that reflects the current state of the codebase."--repo: Path to repository to analyze (required if not using --github)--github: GitHub repository URL to clone and analyze--github-token: GitHub Personal Access Token for private repositories--keep-clone: Keep cloned repository after processing (GitHub only)--create-pr: Create a pull request with generated documentation (GitHub only)--branch-name: Branch name for PR creation (default: docs/auto-generated-readme)--pr-title: Title for the pull request--pr-body: Body text for the pull request--output,-o: Output file name (default: README.md)--config,-c: Path to config file (default: config.yaml)--debug: Enable debug logging--test-mode: Enable test mode (process only first 5 files)--no-cache: Disable caching of file summaries--clear-cache: Clear the cache for this repository before processing--optimize-order: Use LLM to determine optimal file processing order--llm-provider: LLM provider to use (ollama or bedrock, overrides config file)
The generated documentation files will be created in the target repository directory (--repo path) by default. You can specify different output locations using the --readme and --architecture arguments.
Example with custom output paths:
python codebase_scribe.py \
--repo /path/to/your/repo \
--readme /path/to/output/README.md \
--architecture /path/to/output/ARCHITECTURE.mdThe config.yaml no longer specifies models. Instead, you'll choose from available models at runtime. The configuration only needs:
ollama:
base_url: "http://localhost:11434" # Ollama API endpoint
max_tokens: 4096 # Maximum tokens per request
retries: 3 # Number of retry attempts
retry_delay: 1.0 # Delay between retries
timeout: 30 # Request timeout in seconds
cache:
enabled: true # Set to false to disable caching
directory: ".cache" # Directory to store cache files
location: "repo" # "repo" or "home" for cache location
templates:
prompts:
file_summary: |
# Custom prompt for file summaries
Analyze the following code file and provide a clear, concise summary:
File: {file_path}
Type: {file_type}
Context: {context}
Code:
{code}
project_overview: |
# Custom prompt for project overview
Generate a comprehensive overview for:
Project: {project_name}
Files: {file_count}
Components: {key_components}
docs:
readme: |
# {project_name}
{project_overview}
## Usage
{usage}The config.yaml supports file filtering through a blacklist system:
blacklist:
extensions: [".md", ".txt", ".log"] # File extensions to exclude
path_patterns:
- "/temp/" # Path patterns to exclude
- "/cache/"
- "/node_modules/"
- "/__pycache__/"- extensions: List of file extensions to exclude from analysis
- path_patterns: List of regex patterns for paths to exclude
You can run Ollama on a different machine in your network:
- Local Machine (default):
ollama:
base_url: "http://localhost:11434"- Network Machine:
ollama:
base_url: "http://192.168.1.100:11434" # Replace with your machine's IP- Custom Port:
ollama:
base_url: "http://ollama.local:8000" # Custom domain and portNote: Ensure the Ollama server is accessible from your machine and any necessary firewall rules are configured.
src/
├── analyzers/ # Code analysis tools
│ └── codebase.py # Repository analysis
├── clients/ # External service clients
│ ├── ollama.py # Ollama API integration
│ ├── bedrock.py # AWS Bedrock integration
│ └── llm_utils.py # Shared LLM utilities
├── generators/ # Content generation
│ ├── contributing.py # Contributing guide generation
│ └── readme.py # README generation
├── models/ # Data models
│ └── file_info.py # File information
└── utils/ # Utility functions
├── cache.py # Caching system
├── config.py # Configuration
├── link_validator.py # Link validation
├── markdown_validator.py # Markdown checks
├── progress.py # Progress tracking
├── prompt_manager.py # Prompt handling
├── readability.py # Readability scoring
└── tree_formatter.py # Project structure visualization
We use pytest for testing. To run tests:
# Install development dependencies
pip install -r requirements-dev.txt
# Run tests
pytest # Note: On Windows, run terminal as Administrator
# Run with coverage report
pytest --cov=src tests/- Windows: Run terminal as Administrator for tests
- Unix/Linux/macOS: Regular user permissions are sufficient
See our Development Guide for detailed testing instructions.
Contributions are welcome! Please read our Contributing Guide for details on:
- Code of Conduct
- Development process
- Pull request process
- Coding standards
- Documentation requirements
- API Documentation
- Architecture Guide
- Architecture Generator
- Development Guide
- Contributing Guide
- Contributing Generator
- README Generator
- Mermaid Generator
- CodebaseAnalyzer
- LLM Clients
- Message Manager
This project is licensed under the MIT License - see the LICENSE file for details.
The project uses a class-based configuration system (ScribeConfig) that provides type safety and better organization of configuration options. The configuration is loaded from a YAML file and can be overridden by command-line arguments and environment variables.
The system provides:
- Type safety and better IDE support
- Structured organization of configuration options
- Validation of configuration values
- Environment variable overrides
- Command-line argument overrides
See the Configuration Guide for detailed information on the configuration system.
The project includes a path compression utility that reduces token usage when sending file paths to LLMs. This is particularly useful for Java projects with deep package structures, where file paths can consume a significant portion of the token budget.
The path compression system:
- Identifies common prefixes in file paths
- Replaces them with shorter keys (e.g.,
@1,@2) - Adds an explanation of the compression scheme to the LLM prompt
- Significantly reduces token usage for large projects
For example, paths like src/main/java/com/example/project/Controller.java are compressed to @1/Controller.java, saving tokens while maintaining readability.
-
Improved Project Structure Visualization: Enhanced the project structure representation in architecture documentation with a new tree formatter that uses box-drawing characters to clearly show folder hierarchy and relationships.
-
Path Compression: Added a path compression utility that reduces token usage when sending file paths to LLMs. This is particularly useful for Java projects with deep package structures.
-
Dependencies Error Fix: Fixed an issue with the
dependenciesfield in the project overview generation. The problem was that the original file manifest was being passed to the dependency analysis function instead of the converted manifest, causing errors when processing FileInfo objects. -
Template Parameter Fix: Fixed an issue in the
BedrockClientclass where thedependenciesparameter was missing in theformatmethod call for the project overview template, causing a KeyError. -
Template Parameter Fix: Fixed an issue in the
BedrockClientclass where thedependenciesparameter was missing in theformatmethod call for the project overview template, causing a KeyError.
Python bytecode caching is currently disabled for development purposes. To re-enable it:
- Remove
sys.dont_write_bytecode = Truefromcodebase_scribe.py - Or unset the
PYTHONDONTWRITEBYTECODEenvironment variable
This should be re-enabled before deploying to production for better performance.
The tool provides several ways to manage caching:
The cache can be stored in two locations, configurable via the cache.location setting in config.yaml:
-
Repository-Based Cache (
location: "repo", default)- Stored directly in the target repository's
.cachedirectory - Shared Cache: Anyone running the script on the same repository benefits from previous analysis
- CI/CD Integration: Makes it easier to integrate with GitHub Actions as the cache is part of the repository
- Portable: Cache travels with the repository when cloned or forked
- Stored directly in the target repository's
-
Home Directory Cache (
location: "home")- Stored in the user's home directory under
.readme_generator_cache - Privacy: Cache files remain on the local machine
- Cleaner Repository: Doesn't add cache files to the repository
- Legacy Behavior: Matches the behavior of earlier versions
- Stored in the user's home directory under
You can configure this in config.yaml:
cache:
enabled: true
directory: ".cache"
location: "repo" # Change to "home" to use home directory# Disable caching for current run
python codebase_scribe.py --repo /path/to/repo --no-cache
# Clear cache for specific repository
python codebase_scribe.py --repo /path/to/repo --clear-cache
# Note: --clear-cache will clear the cache and exit without processingThe cache system includes several optimizations:
- SQLite Vacuum: Periodically compacts the database to minimize repository size
- Content-Based Invalidation: Only regenerates summaries when file content changes
- Structured Storage: Uses a database format that minimizes merge conflicts
While the cache is generally committed to the repository, certain directories are excluded:
- Test Folders: Cache files in
tests/and its subdirectories are excluded from git - This prevents test-generated cache files from being committed while still allowing the main project cache to be shared
- ✅ Dependencies Error Fix: Fixed an issue with the
dependenciesparameter missing in thegenerate_project_overviewmethod inBedrockClientclass. - ✅ Split Contributing Guide: Moved the Contributing guide to a separate file with its own generator in the root directory, following GitHub conventions.
- 📝 Split Usage Guide: Move the usage guide to a separate file with its own generator.
- 🔗 Improve Documentation Links: Ensure generated README properly links to all other documentation files.
- 🔄 GitHub Workflows: Add GitHub Actions workflows for CI/CD, automated testing, and code quality checks.
- 🧪 Improve Test Coverage: Add more unit tests for the core functionality.
- 🚀 Performance Optimization: Optimize the file processing pipeline for better performance.
- 🗑️ Fix Lingering Folders: Clean up lingering temporary folders in home directory when pulling from GitHub.