SRE 与 GenAI 如何协同工作以减少 eBay 的停机时间:KubeCon EU 上的架构师见解

  • KubeCon EU Keynote by Vijay Samuel: Shared eBay SRE team's experience in enhancing incident response with ML and LLMs.

    • Infrastructure Growth: eBay's platform grew with over 4000 microservices, generating 15 petabytes of logs, 10 billion active time series daily, and 10 million spans per second.
  • Challenges of Manual Triage: Incident manual triage based on human capabilities was cumbersome and error-prone at this scale.
  • Groot System: Moved away from static threshold-based alerts. Attached root cause to alerts and had auto-remediation for minor issues. Anomaly detection decreased incident detection time to under 4 minutes but failed with new incident types.
  • LLM Experiments: Embracing LLM promise but learned about hallucinations. Output became more accurate with very "crisp" context.
  • Using LLM Capabilities: Used LLMs for small amounts of information in investigations. "Explainers" provide context during incident investigation.
  • Dividing and Conquering: Extracted critical path inspired by Uber's CRISP whitepaper. Used "few shot prompting" to teach algorithm.
  • Working with Context Window: "Dictionary encoded" everything and split critical path. Generated partial explanations and combined them.
  • Building Complex Mechanisms: Composed explainers into more complex evaluation mechanisms. Aggregated simple explainers for dashboard analysis and built triage workflows. Incorporated faulty traces in alerts.
  • Conclusion: Potential system evolutions like adding metric metadata. LLMs are not a silver bullet but useful for certain tasks. Broad OpenTelemetry adoption and query language standardisation would be helpful.
阅读 7
0 条评论