1

Over the past few weeks, GitHub has experienced multiple outages due to database issues that have degraded the platform's services and impacted the use of many users.

GitHub attaches great importance to downtime incidents. While solving the problem, it also announced the details of these incidents on the 23rd of this month.

timeline

  • March 16, 14:09 UTC (duration 5 hours 36 minutes)
  • March 17 at 13:46 UTC (duration 2 hours 28 minutes)
  • March 22, 15:53 UTC (duration 2 hours 53 minutes)
  • March 23, 14:49 UTC (duration 2 hours 51 minutes)

It is understood that the main reason for the frequent downtime of GitHub in the past few weeks is the resource contention of its mysql1 cluster, which affects the performance of a large number of GitHub services and functions during peak loads.

Over the past few years, GitHub has made many optimizations, such as adding clusters to support the growth of the platform, partitioning the main database, etc., but these improvements have not been done once and for all, and they are still actively addressing this issue to this day.

To prevent such incidents in the future, GitHub has started auditing load patterns for this particular database during peak hours, and based on those audits, a series of performance fixes are being made. As part of this, they are shifting traffic to other databases to reduce load and speed up failover times, and are reviewing their change management procedures, especially monitoring and changes related to periods of high load in production.

As the platform continues to evolve, GitHub will continue to work to expand the infrastructure, including sharding the database and scaling hardware.


snakesss
1.1k 声望244 粉丝

SegmentFault 思否编辑,欢迎投稿优质技术资讯!