Skip to content

feat: LangChain Python integration fixes and multi-instance docs#75

Merged
saschabuehrle merged 16 commits intomainfrom
feat/multi-instance-docs
Nov 18, 2025
Merged

feat: LangChain Python integration fixes and multi-instance docs#75
saschabuehrle merged 16 commits intomainfrom
feat/multi-instance-docs

Conversation

@saschabuehrle
Copy link
Copy Markdown
Collaborator

@saschabuehrle saschabuehrle commented Nov 18, 2025

CascadeFlow v0.6.0 - Multi-Instance Support and LangChain Integration

This PR includes all changes for the v0.6.0 release.

New Features

🔗 LangChain Integration (Python & TypeScript)

Production-ready LangChain integration for both Python and TypeScript with zero-config setup.

Features:

  • LangChain-compatible callback handlers for cost and token tracking
  • Built-in cost tracking utilities with budget management and CSV export
  • Automatic LangSmith tag tracking and integration
  • Universal provider support across all LangChain models

Documentation:

Examples:

🖥️ Multi-Instance Provider Support

Run draft and verifier models on separate Ollama/vLLM instances for optimized deployment.

Documentation:

Examples:

🌐 OpenRouter Provider

Access 200+ models through unified OpenRouter integration.

Documentation:

Bug Fixes

  • Critical n8n integration fixes: Resolved cascading issues and improved reliability for n8n workflow automation
  • Fixed multi-instance providers to correctly support per-model base_url configuration
  • Corrected API usage in multi-instance Ollama and vLLM examples
  • Fixed o1-mini model ID (o1-mini-2024-09-12) and documented Tier 5 requirement
  • Resolved AttributeError in edge_device.py example
  • Fixed Python code quality issues for Ruff linting compliance (deprecated type hints, unused imports)

Testing

All CI checks are passing:

  • ✅ Python tests
  • ✅ TypeScript tests
  • ✅ Code quality (Ruff, Black, TypeScript)
  • ✅ Example validation

Ready for release to PyPI and npm.

Add comprehensive documentation and examples for running draft and verifier
models on separate Ollama or vLLM instances. This enables optimal GPU
utilization in multi-GPU systems and distributed deployments.

Changes:
- Update .env.example with multi-instance configuration sections
  - OLLAMA_DRAFT_URL and OLLAMA_VERIFIER_URL
  - VLLM_DRAFT_URL and VLLM_VERIFIER_URL
  - References to TypeScript, Python, and Docker examples

- Add TypeScript examples
  - multi-instance-ollama.ts: Three configuration scenarios with health checks
  - multi-instance-vllm.ts: vLLM-specific features and API key support

- Add Python examples
  - multi_instance_ollama.py: Async implementation with health checks
  - multi_instance_vllm.py: Includes PagedAttention and batching notes

- Add Docker Compose setup for multi-GPU deployment
  - GPU device assignment (draft on GPU 0, verifier on GPU 1)
  - Separate ports (11434 and 11435)
  - Health checks and volume isolation
  - Comprehensive README with troubleshooting

Note: Multi-instance support already exists via ModelConfig.baseUrl.
This commit adds documentation and examples for the existing feature.
- Add Multi-Instance Ollama and vLLM to advanced examples tables in main README
  - Python advanced examples section (lines 402-403)
  - TypeScript advanced examples section (lines 439-440)

- Update examples/README.md with comprehensive documentation
  - Add examples to 'Find by Feature' quick reference
  - Update table of contents (3 examples instead of 1)
  - Add detailed sections for both multi-instance examples
  - Include Docker Compose guide references
  - Document use cases, hardware requirements, and performance benefits

All documentation now consistently references the new multi-instance examples.
- Remove invalid quality_threshold parameter from CascadeAgent
- Add quality_threshold to ModelConfig (0.7 for draft, 0.95 for verifier)
- Remove non-existent usage attribute access
- Match cascade setup from basic_usage.py
BREAKING BUG FIX: VLLMProvider and other providers now respect
ModelConfig.base_url for multi-instance deployments.

Problem:
- Providers were instantiated once per provider type, ignoring
  ModelConfig.base_url and api_key parameters
- Multi-instance setups (draft on GPU 0, verifier on GPU 1) failed
  because both models tried to use the same provider instance
- Examples: multi_instance_vllm.py and multi_instance_ollama.py
  couldn't connect to separate instance URLs

Solution:
- CascadeAgent._init_providers() now creates separate provider
  instances for each model with model-specific base_url/api_key
- Added model_providers dict mapping model.name → provider instance
- WholeResponseCascade and CascadeAgent use _get_provider() helper
  to look up model-specific providers (with backwards compat fallback)
- Maintains full backwards compatibility for single-instance setups

Backwards Compatibility:
✅ Tested with basic_usage.py (OpenAI standard setup)
✅ All existing functionality preserved
✅ Only activates when ModelConfig.base_url is set
✅ Falls back to provider-type lookup for existing code

Files Changed:
- cascadeflow/agent.py: _init_providers(), _get_provider(), all
  direct routing methods (_execute_direct_with_timing, etc.)
- cascadeflow/core/cascade.py: __init__(), _get_provider(),
  _call_drafter(), _call_verifier()

Tested:
- ✅ multi_instance_vllm.py with DeepSeek-R1-7B and R1-32B on
  separate instances (192.168.0.199:8000 and :8001)
- ✅ basic_usage.py with OpenAI (standard single-instance)
- ✅ Cascade routing works correctly
- ✅ Direct routing works correctly
- ✅ Health checks pass for both instances
Implements zero-config CascadeFlow wrapper for LangChain Python with intelligent
cascade routing and quality evaluation.

Key features:
- Drop-in replacement for LangChain chat models
- Pre-router enabled by default for query complexity analysis
- Quality-based escalation (threshold: 0.7)
- Full LCEL, streaming, and tools support
- Automatic LangSmith tag tracking (drafter/verifier)

Performance:
- 99.8% cost savings in production benchmarks
- 83.3% drafter acceptance for expert queries
- Pre-router + quality evaluation two-layer routing

Files:
- cascadeflow/integrations/langchain/__init__.py: Package exports
- cascadeflow/integrations/langchain/wrapper.py: Main CascadeFlow wrapper
- cascadeflow/integrations/langchain/utils.py: Helper utilities
Implements LangChain-compatible callback handlers for comprehensive cost and usage tracking.

Features:
- LangChain callback pattern (similar to get_openai_callback)
- Separate drafter/verifier cost tracking
- Token usage tracking (including streaming)
- Works with LangSmith tracing
- Near-zero performance overhead

Usage:
```python
from cascadeflow.integrations.langchain.langchain_callbacks import get_cascade_callback

with get_cascade_callback() as cb:
    response = await cascade.ainvoke("What is Python?")
    print(f"Total cost: ${cb.total_cost:.6f}")
```

Files:
- cost_tracking.py: Token cost calculations
- langchain_callbacks.py: LangChain callback handler implementation
Comprehensive test coverage for LangChain Python integration including
callback handlers, cost tracking, and integration tests.

Coverage:
- LangChain callback handler tests
- Cost tracking validation
- Integration tests for CascadeFlow wrapper

All 25 tests passing.

Files:
- tests/__init__.py: Test package
- tests/test_langchain_callbacks.py: Callback handler test suite
Three production-ready examples demonstrating LangChain integration features.

Examples:
1. langchain_cascade_benchmark.py: Full cascade benchmark (24 queries, 99.8% savings)
2. langchain_cost_tracking.py: Cost tracking with callback handlers
3. langchain_langsmith.py: LangSmith integration and tag tracking

Each example tested and verified working.
Updates README and API docs to document the production-ready LangChain Python
integration with zero-config setup.

Documentation updates:
- README.md: LangChain integration section with TypeScript/Python examples
- docs/api/python/config.md: Two-layer routing system documentation
  (pre-router + quality evaluation enabled by default)

Highlights:
- Pre-router enabled by default for query complexity analysis
- Quality threshold: 0.7 (optimal cost/quality balance)
- Automatic LangSmith tag tracking
Add missing types.py module with TypedDict definitions required by
LangChain integration. Comment out optional model discovery imports
until models.py is implemented.

Files:
- cascadeflow/integrations/langchain/types.py: New file with type definitions
  (TokenUsage, CostMetadata, CascadeResult, CascadeConfig)
- cascadeflow/integrations/langchain/__init__.py: Comment out models.py imports

All 25 tests passing.
# Conflicts:
#	README.md
#	cascadeflow/agent.py
#	cascadeflow/integrations/langchain/__init__.py
#	cascadeflow/integrations/langchain/types.py
#	cascadeflow/integrations/langchain/utils.py
#	cascadeflow/integrations/langchain/wrapper.py
#	examples/multi_instance_ollama.py
#	examples/multi_instance_vllm.py
#	packages/core/examples/nodejs/multi-instance-ollama.ts
#	packages/core/examples/nodejs/multi-instance-vllm.ts
- Format 9 Python files with Black
- Fix TypeScript type errors in multi-instance-vllm.ts example
  (remove references to non-existent result.usage property)

Resolves both CI code quality failures:
- Python Code Quality check
- TypeScript Code Quality check
Fixed AttributeError: 'CascadeResult' object has no attribute 'confidence'

CascadeResult does not have a 'confidence' attribute. The correct attribute
is 'quality_score' which represents the quality validation score (0-1).

Changes:
- examples/edge_device.py:380 - Changed result.confidence to result.quality_score
- Added None check since quality_score is optional
- Updated display label to "Quality Score" for accuracy

Note: Other files using .confidence are correct:
- semantic_quality_domain_detection.py uses DomainDetectionResult.confidence ✓
- local_providers_setup.py uses ModelResponse.confidence ✓
Fixes 404 error when running reasoning_models.py example.

Changes:
- Update model ID from "o1-mini" to "o1-mini-2024-09-12" (3 occurrences)
- Correct API tier requirement from Tier 3+ to Tier 5
- Update documentation to reflect correct model name

The model ID "o1-mini" does not exist in the OpenAI API and causes
404 errors. The correct dated version is "o1-mini-2024-09-12".
Fixed Python code style issues flagged by Ruff linter:
- Removed unused 'cast' import from cost_tracking.py
- Updated type hints to use Python 3.9+ style (list/dict instead of List/Dict)

Changes:
- cost_tracking.py: Removed unused import, updated List to list (2 occurrences)
- langchain_callbacks.py: Removed unused imports, updated Dict/List to dict/list

All Ruff checks now pass.
Remove unused pytest, Mock, and MagicMock imports from test file to resolve
Ruff linting errors (F401, I001).
@saschabuehrle saschabuehrle merged commit f1fe9f6 into main Nov 18, 2025
17 of 19 checks passed
@saschabuehrle saschabuehrle deleted the feat/multi-instance-docs branch November 18, 2025 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant