KubeCon EU Keynote by Vijay Samuel: Shared eBay SRE team's experience in enhancing incident response with ML and LLMs.
- Infrastructure Growth: eBay's platform grew with over 4000 microservices, generating 15 petabytes of logs, 10 billion active time series daily, and 10 million spans per second.
- Challenges of Manual Triage: Incident manual triage based on human capabilities was cumbersome and error-prone at this scale.
- Groot System: Moved away from static threshold-based alerts. Attached root cause to alerts and had auto-remediation for minor issues. Anomaly detection decreased incident detection time to under 4 minutes but failed with new incident types.
- LLM Experiments: Embracing LLM promise but learned about hallucinations. Output became more accurate with very "crisp" context.
- Using LLM Capabilities: Used LLMs for small amounts of information in investigations. "Explainers" provide context during incident investigation.
- Dividing and Conquering: Extracted critical path inspired by Uber's CRISP whitepaper. Used "few shot prompting" to teach algorithm.
- Working with Context Window: "Dictionary encoded" everything and split critical path. Generated partial explanations and combined them.
- Building Complex Mechanisms: Composed explainers into more complex evaluation mechanisms. Aggregated simple explainers for dashboard analysis and built triage workflows. Incorporated faulty traces in alerts.
- Conclusion: Potential system evolutions like adding metric metadata. LLMs are not a silver bullet but useful for certain tasks. Broad OpenTelemetry adoption and query language standardisation would be helpful.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。