谷歌 SRE 的演进

  • 2021 Google Incident: Set and enforce resource quotas for internal software. Quota rightsizer reduces quota if service uses less. If quota reduced below service needs, it's an unsafe control action (UCA). STPA analyzes system interactions for safety.

    • Types of UCA:

      • Required control action not provided.
      • Incorrect or inadequate control action.
      • Control action at wrong time/sequence.
      • Control action stopped too soon/applied too long.
    • Four archetypal scenarios for rightsizer:

      • Incorrect rightsizer behavior.
      • Incorrect feedback or no feedback.
      • Quota system not receiving action.
      • Quota system incorrect behavior.
    • Specific scenario: Feedback on current resource usage calculation goes wrong, leading to incorrect quota reduction.
    • Highlighted advantage of STPA: Find issues in both control and feedback paths.
    • Incident details: Incorrect feedback sent to rightsizer, resulting in quota reduction not applied immediately. Feedback about pending change not sent, leading to a significant outage.
    • Key principle: Shift from preventing failures to designing/implementing controls to enforce constraints.
阅读 11
0 条评论