AWS Community ID Demo

Description

This repository contains the Infrastructure as Code (IaC) for building a Cost-Optimized, Multi-Tenant Machine Learning Platform on Amazon EKS, leveraging Karpenter for intelligent autoscaling and Ray for distributed workload orchestration.

The core goal is to demonstrate Resource Isolation (ensuring tenants do not overuse shared GPU/CPU resources) and GPU-Aware Scheduling at a minimal operational cost.

🔴Disclamer🔴

This demo focuses primarily on infrastructure and architectural design, not on model development or fine-tuning of AI/LLM workloads.

Model training & optimization examples may be added later in future update 😀

Requirements

Before starting the deployment, ensure you have:

AWS Account: With AdministratorAccess IAM permissions (for demo only).
AWS Service Quota: Approved for at least 8 vCPUs for G and VT instances (critical for g4dn.xlarge).
AWS CLI, kubectl, helm, and Terraform (>= 1.0) installed and configured.
AWS Credentials: Configured locally via aws configure.

Architecture

Component	Description
JupyterHub	Multi-tenant web platform for launching isolated notebook servers per user, with Native Auth for built-in user authentication and session management
Ray	Distributed computing engine for scaling ML workloads across nodes (CPU/GPU)
Karpenter	Dynamic node autoscaler for AWS EKS
NVIDIA Plugin	Enables GPU workloads on GPU-based EC2 nodes
Terraform	Infrastructure as Code to provision EKS, networking, and Helm releases

Deployment Guide

Create an S3 bucket for Terraform state. You can use make command from root dir:

make create-bucket

Navigate to infra/terraform. Update providers.tf first, Make sure to configure your S3 backend properly for storing Terraform state:

# Store ".tfstate"  
backend "s3" {
    bucket = "<BUCKET_NAME" 
    key    = "terraform.tfstate"
    region = "<YOUR_BEST_REGION>" 
  }

Before running Terraform commands, update the following variable with your own IAM User ARN in infra/terraform/terraform.tfvars:

console_admin_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:user/<YOUR_USERNAME>"

Then run terraform commmands. Make sure everything runs correctly:

$ terraform init
$ terraaform plan
$ terrafrom apply

Node Group	Instance Type	Purpose
head-group	`m5.large`	Control plane + JupyterHub + Ray Head
gpu-workers-group	`g4dn.xlarge`	GPU-based workloads for Ray Workers

NB: You can adjust these instance types in main.tf based on your AWS quota and cost preference.

To create Karpenter Nodepool, just run make create-nodepool from root dir:

$ make create-nodepool

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
assets		assets
infra		infra
ray-demo		ray-demo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
makefile		makefile
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS Community ID Demo

Description

Requirements

Deployment Guide

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AWS Community ID Demo

Description

Requirements

Deployment Guide

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages