feat: LangChain Python integration fixes and multi-instance docs#75
Merged
saschabuehrle merged 16 commits intomainfrom Nov 18, 2025
Merged
feat: LangChain Python integration fixes and multi-instance docs#75saschabuehrle merged 16 commits intomainfrom
saschabuehrle merged 16 commits intomainfrom
Conversation
Add comprehensive documentation and examples for running draft and verifier models on separate Ollama or vLLM instances. This enables optimal GPU utilization in multi-GPU systems and distributed deployments. Changes: - Update .env.example with multi-instance configuration sections - OLLAMA_DRAFT_URL and OLLAMA_VERIFIER_URL - VLLM_DRAFT_URL and VLLM_VERIFIER_URL - References to TypeScript, Python, and Docker examples - Add TypeScript examples - multi-instance-ollama.ts: Three configuration scenarios with health checks - multi-instance-vllm.ts: vLLM-specific features and API key support - Add Python examples - multi_instance_ollama.py: Async implementation with health checks - multi_instance_vllm.py: Includes PagedAttention and batching notes - Add Docker Compose setup for multi-GPU deployment - GPU device assignment (draft on GPU 0, verifier on GPU 1) - Separate ports (11434 and 11435) - Health checks and volume isolation - Comprehensive README with troubleshooting Note: Multi-instance support already exists via ModelConfig.baseUrl. This commit adds documentation and examples for the existing feature.
- Add Multi-Instance Ollama and vLLM to advanced examples tables in main README - Python advanced examples section (lines 402-403) - TypeScript advanced examples section (lines 439-440) - Update examples/README.md with comprehensive documentation - Add examples to 'Find by Feature' quick reference - Update table of contents (3 examples instead of 1) - Add detailed sections for both multi-instance examples - Include Docker Compose guide references - Document use cases, hardware requirements, and performance benefits All documentation now consistently references the new multi-instance examples.
- Remove invalid quality_threshold parameter from CascadeAgent - Add quality_threshold to ModelConfig (0.7 for draft, 0.95 for verifier) - Remove non-existent usage attribute access - Match cascade setup from basic_usage.py
BREAKING BUG FIX: VLLMProvider and other providers now respect ModelConfig.base_url for multi-instance deployments. Problem: - Providers were instantiated once per provider type, ignoring ModelConfig.base_url and api_key parameters - Multi-instance setups (draft on GPU 0, verifier on GPU 1) failed because both models tried to use the same provider instance - Examples: multi_instance_vllm.py and multi_instance_ollama.py couldn't connect to separate instance URLs Solution: - CascadeAgent._init_providers() now creates separate provider instances for each model with model-specific base_url/api_key - Added model_providers dict mapping model.name → provider instance - WholeResponseCascade and CascadeAgent use _get_provider() helper to look up model-specific providers (with backwards compat fallback) - Maintains full backwards compatibility for single-instance setups Backwards Compatibility: ✅ Tested with basic_usage.py (OpenAI standard setup) ✅ All existing functionality preserved ✅ Only activates when ModelConfig.base_url is set ✅ Falls back to provider-type lookup for existing code Files Changed: - cascadeflow/agent.py: _init_providers(), _get_provider(), all direct routing methods (_execute_direct_with_timing, etc.) - cascadeflow/core/cascade.py: __init__(), _get_provider(), _call_drafter(), _call_verifier() Tested: - ✅ multi_instance_vllm.py with DeepSeek-R1-7B and R1-32B on separate instances (192.168.0.199:8000 and :8001) - ✅ basic_usage.py with OpenAI (standard single-instance) - ✅ Cascade routing works correctly - ✅ Direct routing works correctly - ✅ Health checks pass for both instances
Implements zero-config CascadeFlow wrapper for LangChain Python with intelligent cascade routing and quality evaluation. Key features: - Drop-in replacement for LangChain chat models - Pre-router enabled by default for query complexity analysis - Quality-based escalation (threshold: 0.7) - Full LCEL, streaming, and tools support - Automatic LangSmith tag tracking (drafter/verifier) Performance: - 99.8% cost savings in production benchmarks - 83.3% drafter acceptance for expert queries - Pre-router + quality evaluation two-layer routing Files: - cascadeflow/integrations/langchain/__init__.py: Package exports - cascadeflow/integrations/langchain/wrapper.py: Main CascadeFlow wrapper - cascadeflow/integrations/langchain/utils.py: Helper utilities
Implements LangChain-compatible callback handlers for comprehensive cost and usage tracking.
Features:
- LangChain callback pattern (similar to get_openai_callback)
- Separate drafter/verifier cost tracking
- Token usage tracking (including streaming)
- Works with LangSmith tracing
- Near-zero performance overhead
Usage:
```python
from cascadeflow.integrations.langchain.langchain_callbacks import get_cascade_callback
with get_cascade_callback() as cb:
response = await cascade.ainvoke("What is Python?")
print(f"Total cost: ${cb.total_cost:.6f}")
```
Files:
- cost_tracking.py: Token cost calculations
- langchain_callbacks.py: LangChain callback handler implementation
Comprehensive test coverage for LangChain Python integration including callback handlers, cost tracking, and integration tests. Coverage: - LangChain callback handler tests - Cost tracking validation - Integration tests for CascadeFlow wrapper All 25 tests passing. Files: - tests/__init__.py: Test package - tests/test_langchain_callbacks.py: Callback handler test suite
Three production-ready examples demonstrating LangChain integration features. Examples: 1. langchain_cascade_benchmark.py: Full cascade benchmark (24 queries, 99.8% savings) 2. langchain_cost_tracking.py: Cost tracking with callback handlers 3. langchain_langsmith.py: LangSmith integration and tag tracking Each example tested and verified working.
Updates README and API docs to document the production-ready LangChain Python integration with zero-config setup. Documentation updates: - README.md: LangChain integration section with TypeScript/Python examples - docs/api/python/config.md: Two-layer routing system documentation (pre-router + quality evaluation enabled by default) Highlights: - Pre-router enabled by default for query complexity analysis - Quality threshold: 0.7 (optimal cost/quality balance) - Automatic LangSmith tag tracking
Add missing types.py module with TypedDict definitions required by LangChain integration. Comment out optional model discovery imports until models.py is implemented. Files: - cascadeflow/integrations/langchain/types.py: New file with type definitions (TokenUsage, CostMetadata, CascadeResult, CascadeConfig) - cascadeflow/integrations/langchain/__init__.py: Comment out models.py imports All 25 tests passing.
# Conflicts: # README.md # cascadeflow/agent.py # cascadeflow/integrations/langchain/__init__.py # cascadeflow/integrations/langchain/types.py # cascadeflow/integrations/langchain/utils.py # cascadeflow/integrations/langchain/wrapper.py # examples/multi_instance_ollama.py # examples/multi_instance_vllm.py # packages/core/examples/nodejs/multi-instance-ollama.ts # packages/core/examples/nodejs/multi-instance-vllm.ts
- Format 9 Python files with Black - Fix TypeScript type errors in multi-instance-vllm.ts example (remove references to non-existent result.usage property) Resolves both CI code quality failures: - Python Code Quality check - TypeScript Code Quality check
Fixed AttributeError: 'CascadeResult' object has no attribute 'confidence' CascadeResult does not have a 'confidence' attribute. The correct attribute is 'quality_score' which represents the quality validation score (0-1). Changes: - examples/edge_device.py:380 - Changed result.confidence to result.quality_score - Added None check since quality_score is optional - Updated display label to "Quality Score" for accuracy Note: Other files using .confidence are correct: - semantic_quality_domain_detection.py uses DomainDetectionResult.confidence ✓ - local_providers_setup.py uses ModelResponse.confidence ✓
Fixes 404 error when running reasoning_models.py example. Changes: - Update model ID from "o1-mini" to "o1-mini-2024-09-12" (3 occurrences) - Correct API tier requirement from Tier 3+ to Tier 5 - Update documentation to reflect correct model name The model ID "o1-mini" does not exist in the OpenAI API and causes 404 errors. The correct dated version is "o1-mini-2024-09-12".
Fixed Python code style issues flagged by Ruff linter: - Removed unused 'cast' import from cost_tracking.py - Updated type hints to use Python 3.9+ style (list/dict instead of List/Dict) Changes: - cost_tracking.py: Removed unused import, updated List to list (2 occurrences) - langchain_callbacks.py: Removed unused imports, updated Dict/List to dict/list All Ruff checks now pass.
Remove unused pytest, Mock, and MagicMock imports from test file to resolve Ruff linting errors (F401, I001).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CascadeFlow v0.6.0 - Multi-Instance Support and LangChain Integration
This PR includes all changes for the v0.6.0 release.
New Features
🔗 LangChain Integration (Python & TypeScript)
Production-ready LangChain integration for both Python and TypeScript with zero-config setup.
Features:
Documentation:
Examples:
🖥️ Multi-Instance Provider Support
Run draft and verifier models on separate Ollama/vLLM instances for optimized deployment.
Documentation:
Examples:
🌐 OpenRouter Provider
Access 200+ models through unified OpenRouter integration.
Documentation:
Bug Fixes
base_urlconfigurationo1-mini-2024-09-12) and documented Tier 5 requirementAttributeErrorinedge_device.pyexampleTesting
All CI checks are passing:
Ready for release to PyPI and npm.