Tenant repository bootstrapped by k8s-infrastructure that contains the manifests for AI related applications
This repository manages AI workloads on Kubernetes using GitOps with Flux CD. It includes deployments for LLM inference services, web UIs, and model serving infrastructure.
- Path:
./apps/open-webui - Type: HelmRelease (ollama-webui chart)
- Description: Web interface for interacting with LLMs
- Features:
- Persistent storage via PVC
- Integration with Ollama backend
- Model access control bypass enabled
- Path:
./apps/comfyui - Type: Native Kubernetes resources
- Description: Graph-based interface for Stable Diffusion
- Features:
- Persistent volume for model caching
- Replication source/destination for data synchronization
- RESTic backup support
- Path:
./apps/opencode - Type: Native Kubernetes resources
- Description: Coding agent and AI workspace for interactive development
- Features:
- NVIDIA GPU support for accelerated model training and inference
- Persistent storage (100Gi PVC)
- Pre-configured development environment with tools
- RESTic backup support
- Replication source/destination for data synchronization
- Integration with GitHub, HuggingFace, and n8n via tokens
- Path:
./infrastructure/ollama - Description: Lightweight LLM inference server
- Features:
- Native GPU support
- Simple HTTP API
- Model caching
- Path:
./infrastructure/llamacpp - Description: High-performance C/C++ inference engine optimized for CPU and GPU
- Features:
- Qwen3.5-35B model support with full precision
- 256k context window for agentic AI workflows
- StatefulSet deployment with persistent storage
- Prometheus ServiceMonitor integration
- Ingress routing via HTTPRoute
- Path:
./infrastructure/vllm - Description: High-throughput LLM serving with PagedAttention
- Use Case: Production workloads requiring high concurrency
- Path:
./infrastructure/kserve - Example:
./examples/llminferenceservice.yaml - Description: Kubernetes-native ML serving platform
- Features:
- LLMInferenceService CRD
- Custom model templates
- MCP Kubernetes: Kubernetes model context protocol server
- MCP Grafana: Grafana monitoring integration
- MCP GitHub: GitHub API integration
- MCP Photoprism: Photo management (mmontes & xiaowen)
βββ apps/ # Application deployments
β βββ comfyui/ # ComfyUI deployment
β βββ n8n/ # n8n workflow automation
β βββ opencode/ # opencode AI development workspace
β βββ open-webui/ # Open WebUI deployment
βββ clusters/ # Cluster-specific configurations
β βββ homelab/
β βββ apps.yaml # Application Kustomizations
β βββ infrastructure.yaml
β βββ namespaces.yaml
βββ examples/ # Example configurations
β βββ llminferenceservice.yaml
βββ infrastructure/ # Shared infrastructure
βββ kserve/ # KServe ML serving
βββ vllm/ # vLLM serving engine
βββ llamacpp/ # llama.cpp inference engine
βββ lws/ # LeaderWorkerSet
βββ ollama/ # Ollama LLM backend
βββ mcp-*/ # MCP server integrations
LLM benchmarks using llama.cpp on Kubernetes: mmontes11/llm-bench
Key benchmarks (NVIDIA RTX PRO 4000 Blackwell, 23.5 GiB VRAM):
- Qwen3.5-35B-A3B (Q4_K_Medium): 2995 t/s @ 2048 prompts, 84.87 t/s token generation
- GPT-OSS 20B (MXFP4 MoE): 2704 t/s @ 2048 prompts, 109.68 t/s token generation