你的 Rails 应用程序特殊吗?

  • TL;DR: Legacy Rails apps are unpredictable; improve observability first. For an unstable B2B e-commerce Rails app with 3 components (public website, legacy ERP, middleware), add NewRelic, find issues like infinite loop and middleware load problems. Adjust Puma concurrency settings (middleware: 3 workers, 20 threads; public app: 3 workers, 9 threads), reduce database connections for middleware, and implement fast rollback. System stability improved significantly.
  • Cry for help: Received request to help an unstable Rails app with frequent outages and slowdowns, no observability or centralized logging, deployed on AWS EC2 with automated Ansible deployment and no rollback feature.
  • Preliminary steps: Added NewRelic instrumentation and extended Ansible roles to install NewRelic infra agents.
  • Investigation - first round: Found infinite loop in public website due to ERP system API change not reflected in code. Analyzed APM traces during outages and noticed public app's response time increase due to waiting for external web calls while middleware's increase was smaller. Increased middleware workers to 2 and added X-Request-Start header.
  • Investigation - second round: Gathered more knowledge during Ruby and Rails upgrade. Discovered middleware's throughput was higher than public app as it handled traffic from internal apps. Triggered public app outages by increasing middleware load, confirming middleware's concurrency configuration was suboptimal.
  • Best practices for configuring Puma concurrency: Puma uses multi-process, multi-threaded model. Default in Rails was 5 threads per worker but changed to 3 in Rails 7.2. Rule of thumb is 1 Puma worker process per 1 CPU core and tune thread count for CPU utilization. Middleware is I/O-bound so benefits from higher thread count.
  • Solution - tuning Puma concurrency settings: Adjusted Ansible deployment roles for different Puma configurations, tuned haproxy configuration, and found a way to perform fast Ansible-based rollback.
  • Number of database connections: Checked before deployment and found reducing database connections for middleware (using them only for auth/authorization and releasing immediately) was safe as each process now used only 2 connections.
阅读 8
0 条评论