RAG Reference Architecture

This document provides a detailed reference architecture for the RAG QuickStart, including component identification, workflow descriptions, and deployment patterns.

Architecture Overview
System Components
RAG Pipeline Components
Ingestion Pipeline Components
Deployment Architecture
Implementation Technologies

Architecture Overview

The diagram illustrates both the ingestion pipeline for document processing and the RAG pipeline for query handling.

System Components

The architecture consists of two main workflow pipelines:

RAG Pipeline - Handles user queries and generates responses
Ingestion Pipeline - Processes documents and updates the knowledge base

RAG Pipeline Components

Frontend UI

Provides the user interface for submitting queries and viewing responses
Communicates with the backend services via REST APIs
Can be deployed as a separate pod from the main application logic

Query Input

Captures user queries from the frontend
Formats queries for downstream processing

Input Safety Shield

Screens incoming queries for harmful content, manipulative prompts, or injection attacks
Implements content moderation to detect inappropriate requests
May use a combination of rule-based filters and ML models
Rejects or sanitizes potentially harmful queries

Query Processing/Routing

Routes queries to appropriate retrieval systems

Retriever and Embedding Service

Converts queries into vector embeddings using chunks
Interfaces with the vector database for similarity search
May include cross-encoders for more accurate retrieval
Can be implemented using frameworks like LangChain

Vector Database

Stores document embeddings and metadata
Performs efficient similarity searches
Deployed as a separate container/pod

Retriever and Reranking

Takes initial retrieval results and improves them
Reranks documents based on relevance to the query
May filter out irrelevant or redundant information
Optimizes context for the LLM
Not yet implemented by Llama Stack

LLM Response Generation

Processes the query and retrieved context to generate a response
Formats prompts with appropriate instructions and context
Interfaces with the LLM service (e.g., vLLM running Llama models)

Output Safety Shield and Validator

Screens generated responses for harmful content
Verifies factual accuracy and alignment with retrieved information
Checks for hallucinations or unsupported claims
Ensures responses meet safety and compliance requirements

Generated Response

The final, validated response delivered to the user
Formatted appropriately for presentation in the UI
May include citations or references to source material
Could incorporate confidence scores or alternative answers

Ingestion Pipeline Components

Document Sources

S3 Bucket: Cloud storage for document files
URL: Documents for download
Uploads: Direct file uploads from users via the frontend

Processing Methods

OpenShift AI Pipelines: Orchestrated workflows for complex document processing
Python Script: Custom scripts for specialized document handling
Frontend UI or Retriever Listener: User-triggered document processing

Document Processing and Embedding Service

Chunks documents into appropriate segments using Docling
Generates embeddings for each chunk
Handles document metadata extraction
Prepares data for insertion into the vector database

Deployment Architecture

This reference architecture can be deployed in OpenShift with the following pod structure:

Pod Type	Purpose	Key Characteristics
Frontend	User interface	Contains the UI, communicates with Application Pod via APIs
Input Safety Shield	Input content moderation	Screens incoming queries for harmful content, implements query validation and sanitization, can be independently scaled
Application (llama-stack)	RAG orchestration	Houses core RAG logic, implements query processing and response generation, contains LangChain implementation for retrieval and reranking
LLM Service	Language model inference	Runs vLLM with Llama models, optimized for GPU utilization, deployed via KServe InferenceService
Vector Database	Embedding storage and search	Manages PGVector store for document embeddings, handles similarity search requests, requires persistent storage, deployed as StatefulSet
Output Safety Shield	Output content validation	Screens generated responses for harmful content, verifies factual accuracy and alignment, can be independently scaled
Embedding Service	Vector embeddings	Generates embeddings for documents and queries, may be combined with document processing components, scales based on workload
Ingestion Pipeline	Document processing	Handles workflows via Kubeflow Pipelines, uses batch processing for large document sets, connected to S3-compatible storage (MinIO)

Implementation Technologies

The RAG QuickStart uses the following technology stack:

Component	Technology
Application Framework	Llama Stack
LLM Service	vLLM with meta-llama/Llama-3.2-3B-Instruct
Vector Database	PostgreSQL + PGVector
Container Orchestration	OpenShift + OpenShift AI
RAG Framework	LangChain, LlamaIndex
Safety Models	meta-llama/Llama-Guard-3-8B
Embedding Model	all-MiniLM-L6-v2
Document Processing	Docling
Pipeline Orchestration	Kubeflow Pipelines
Object Storage	MinIO (S3-compatible)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG Reference Architecture

Table of Contents

Architecture Overview

System Components

RAG Pipeline Components

Frontend UI

Query Input

Input Safety Shield

Query Processing/Routing

Retriever and Embedding Service

Vector Database

Retriever and Reranking

LLM Response Generation

Output Safety Shield and Validator

Generated Response

Ingestion Pipeline Components

Document Sources

Processing Methods

Document Processing and Embedding Service

Deployment Architecture

Implementation Technologies

FilesExpand file tree

rag-reference-architecture.md

Latest commit

History

rag-reference-architecture.md

File metadata and controls

RAG Reference Architecture

Table of Contents

Architecture Overview

System Components

RAG Pipeline Components

Frontend UI

Query Input

Input Safety Shield

Query Processing/Routing

Retriever and Embedding Service

Vector Database

Retriever and Reranking

LLM Response Generation

Output Safety Shield and Validator

Generated Response

Ingestion Pipeline Components

Document Sources

Processing Methods

Document Processing and Embedding Service

Deployment Architecture

Implementation Technologies