Datadog 雇佣大型语言模型来协助撰写事故事后分析报告

发布于 4 月 13 日

Combining Metadata and Slack with LLM: Datadog combined structured metadata from its incident management app with Slack messages to create an LLM-driven functionality for assisting engineers in composing incident postmortems. They faced challenges using LLMs outside interactive dialog systems and ensuring high-quality content.
Enhancing Postmortem Creation: Datadog used LLMs to compile sections of the postmortem report, spending over 100 hours fine-tuning structure and instructions. Different model alternatives like GPT-3.5 and GPT-4 were explored for cost, speed, and quality, and engineers opted to use different versions for different sections based on complexity. Parallel execution reduced total time from 12 minutes to below 1 minute.
Trust and Privacy: In combining AI and human inputs for postmortem reports, trust and privacy were important. AI-generated content was marked, and sensitive information was stripped and replaced with placeholders. Secret scanning and filtering mechanisms were implemented in the ingestion API.
Customizing Templates: Postmortem authors gained the ability to customize templates with LLM instructions for different sections, promoting transparency and trust.
Conclusion: The Datadog team believes LLMs can support operations engineers but can't fully replace them yet. GenAI-enhanced products improve productivity and give a head start. They plan to expand data sources and test generating alternative postmortem versions.

阅读 7