SRE 与 GenAI 如何协同工作以减少 eBay 的停机时间：KubeCon EU 上的架构师见解

发布于 4 月 5 日

KubeCon EU Keynote by Vijay Samuel: Shared eBay SRE team's experience in enhancing incident response with ML and LLMs.
- Infrastructure Growth: eBay's platform grew with over 4000 microservices, generating 15 petabytes of logs, 10 billion active time series daily, and 10 million spans per second.
Challenges of Manual Triage: Incident manual triage based on human capabilities was cumbersome and error-prone at this scale.
Groot System: Moved away from static threshold-based alerts. Attached root cause to alerts and had auto-remediation for minor issues. Anomaly detection decreased incident detection time to under 4 minutes but failed with new incident types.
LLM Experiments: Embracing LLM promise but learned about hallucinations. Output became more accurate with very "crisp" context.
Using LLM Capabilities: Used LLMs for small amounts of information in investigations. "Explainers" provide context during incident investigation.
Dividing and Conquering: Extracted critical path inspired by Uber's CRISP whitepaper. Used "few shot prompting" to teach algorithm.
Working with Context Window: "Dictionary encoded" everything and split critical path. Generated partial explanations and combined them.
Building Complex Mechanisms: Composed explainers into more complex evaluation mechanisms. Aggregated simple explainers for dashboard analysis and built triage workflows. Incorporated faulty traces in alerts.
Conclusion: Potential system evolutions like adding metric metadata. LLMs are not a silver bullet but useful for certain tasks. Broad OpenTelemetry adoption and query language standardisation would be helpful.

阅读 7