- Owning a production Postgres database is never boring. One day, noticed a weird curve in production database metrics with spikes in WALWrite (number of processes waiting to write to the write-ahead log).
- The write-ahead log is written serially, causing contention for the mutex. The spikes occur every 6 minutes, and after zooming out, it started about 3 hours ago.
- There are other related graphs like "Max transaction duration" with a 6-minute period and sawtooth pattern starting 3 hours ago. It's related to an ongoing backfill into BigQuery via Datastream.
- Datastream backfill starts a transaction and uses a cursor to page through rows. The other new graph is "pages dirtied" by a particular query.
- "Pages dirtied" means when Postgres needs to write to a page, it writes the change to the WAL and marks the buffer as dirty. Eventually, the dirty buffer is written to disk.
- A
SELECT
query in Postgres is not read-only as it can modify tuples. Tuples are versions of rows, and aSELECT
fetches tuples and constructs rows. - Hint bits are used to optimize
SELECT
queries by caching the result ofpg_clog
lookup. Setting a hint bit dirties the page. - Hypothesis: Datastream starts a transaction to
SELECT * FROM foo
, concurrent transactionsSELECT
andUPDATE
foo
rows, and subsequentSELECT
s update hint bits. But there are cracks in the hypothesis. - Updating hint bits causes contention for WAL writes because of the default setting of full_page_writes. When the interval between checkpoints is raised, the WALWrite spikes disappear.
- Don't know why the Datastream backfill creates tuples. Ops often has such situations where you get a working fix and move on.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。