This repository is used to collect papers and code in the fields of AIOps, Ops4AI, LLM, software engineering, observability, and reliability.
-
24_holmesgpt [code]
-
23_k8sgpt [code]
-
24_OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures? [paper] [code] [data]
-
24_ASE_LasRCA: The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small Classifier [paper] [code]
- 24_SIGOPSReview_LLexus: an AI agent system for incident management [paper]
- 23_arXiv_RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models [paper]
- 24_EuroSys_RCACOPILOT: Automatic Root Cause Analysis via Large Language Models for Cloud Incidents [paper]
- 24_FSE_MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language Models [paper]
- 24_arXiv_LLMAD: Large Language Models can Deliver Accurate and Interpretable Time Series Anomaly Detection [paper]
AIOps Challenge
2020 M, T Telecom
2021 M, T, L Bank
2022 M, T, L Market
- 24_arXiv_A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends [paper]
- 24_Root Cause Analysis for Distributed Systems [paper]
- 23_FSE_Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Data [paper] [code] ✅
- 23_ICSE_Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data [paper] [code] ✅
- 24_KDD_Microservice root cause analysis with limited observability through intervention recognition in the latent space [paper] [code&data]
- 24_FSE_Chain-of-event: Interpretable root cause analysis for microservices through automatically learning weighted event causal graph [paper] [code]
- 24_ASE_ART: A Unified Unsupervised Framework for Incident Management in Microservice Systems [paper] [code&data] ✅ AIOps Challenge 2021, data for GNN
- 24_ASE_Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization [paper]
- 24_ASE_RCAEval: Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We? [paper] [code]
- 24_ICDE_ADecimo: Model Selection for Time Series Anomaly Detection [paper] [code]
- 23_ISSRE_AutoKAD: Empowering KPI Anomaly Detection with Label-Free Deployment [paper] [code] [data]
- 24_VLDB_AutoTSAD: Unsupervised Holistic Anomaly Detection for Time Series Data [paper] [code]
- 25_ICSE_ADAMAS: Adaptive Domain-Aware Performance Anomaly Detection in Cloud Service Systems [paper] [code]
- MicroServo: A Scenario-Oriented Benchmark for Assessing AIOps Algorithms in Microservice Management [paper] [code]
- 24_KDD_Pre-trained KPI Anomaly Detection Model Through Disentangled Transformer [paper] [code]
- 24_SOSP(Best paper)_FBDetect: Catching Tiny Performance Regressions at Hyperscale through In-Production Monitoring [paper]
- 24_ASE_End-to-end automl for unsupervised log anomaly detection [paper]
- cloud-incident: https://github.com/yinfangchen/cloud-incident-lit