你曾经有没有读得太认真以至于不小心写了下来?

  • Owning a production Postgres database is never boring. One day, noticed a weird curve in production database metrics with spikes in WALWrite (number of processes waiting to write to the write-ahead log).
  • The write-ahead log is written serially, causing contention for the mutex. The spikes occur every 6 minutes, and after zooming out, it started about 3 hours ago.
  • There are other related graphs like "Max transaction duration" with a 6-minute period and sawtooth pattern starting 3 hours ago. It's related to an ongoing backfill into BigQuery via Datastream.
  • Datastream backfill starts a transaction and uses a cursor to page through rows. The other new graph is "pages dirtied" by a particular query.
  • "Pages dirtied" means when Postgres needs to write to a page, it writes the change to the WAL and marks the buffer as dirty. Eventually, the dirty buffer is written to disk.
  • A SELECT query in Postgres is not read-only as it can modify tuples. Tuples are versions of rows, and a SELECT fetches tuples and constructs rows.
  • Hint bits are used to optimize SELECT queries by caching the result of pg_clog lookup. Setting a hint bit dirties the page.
  • Hypothesis: Datastream starts a transaction to SELECT * FROM foo, concurrent transactions SELECT and UPDATE foo rows, and subsequent SELECTs update hint bits. But there are cracks in the hypothesis.
  • Updating hint bits causes contention for WAL writes because of the default setting of full_page_writes. When the interval between checkpoints is raised, the WALWrite spikes disappear.
  • Don't know why the Datastream backfill creates tuples. Ops often has such situations where you get a working fix and move on.
阅读 6
0 条评论