PostgreSQL数据库随机取数据，数据抽样

对于数据量比较大的表，不要使用下面的这种方法，会比较慢


select * from review order by random() limit 10;

会全表扫描，很慢

SYSTEM抽样方法


SELECT ...

FROM table_name

TABLESAMPLE sampling_method (argument [, ...]) [REPEATABLE (seed)]

sampling_method 抽样方法，主要有两种，system和bernoulli，argument是指抽样百分比，

举个栗子


select id from review tablesample system(0.0001);

censor@10.0.80.13:5432=#explain analyze select id from review tablesample system(0.0001);

QUERY PLAN

----------------------------------------------------------------------------------------------------

Sample Scan on review (cost=0.00..4.06 rows=6 width=4) (actual time=3.328..3.899 rows=10 loops=1)

Sampling: system ('0.0001'::real)

Planning time: 0.110 ms

Execution time: 3.920 ms

(4 rows)

需要指出一条的是，system抽样方式是随机抽取表上的数据块的数据，理论上每个数据块被检索的概率是一样的，system抽样方式基于数据块级别，被选中的快上面的所以数据将被检索, 所以可能会出现每次随机抽取的数据条数是不一样的，甚至可能返回0条，可能是抽取到了一个没有数据的数据块，

关于数据块检索我们可以通过下面的例子看到，ctid是数据块的值，前面的那个是数据库编号，后面那个是该条数据在数据块上面的编号


censor@10.0.80.13:5432=#select ctid, id from review tablesample system(0.0001);

ctid | id

------+----

(0 rows)

censor@10.0.80.13:5432=#select ctid, id from review tablesample system(0.0001);

ctid | id

------------+---------

(289034,1) | 4131829

(289034,2) | 4131830

(289034,3) | 4131853

(289034,4) | 4131854

(289034,5) | 4924212

(289034,6) | 5142532

(6 rows)

censor@10.0.80.13:5432=#select ctid, id from review tablesample system(0.0001);

ctid | id

------+----

(0 rows)

censor@10.0.80.13:5432=#select ctid, id from review tablesample system(0.0001);

ctid | id

-------------+---------

(368648,1) | 5006567

(368648,2) | 5164392

(368648,3) | 4844708

(368648,4) | 4844524

(368648,5) | 4844526

(368648,6) | 4844621

(368648,7) | 4844748

(368648,8) | 4844614

(368648,9) | 4844678

(368648,10) | 4844530

(368648,11) | 4844532

(11 rows)

BERNOULLI抽样方式

BERNOULLI抽样方式，是基于数据行的，所以相对来说，每次返回的数量相对比较一致，相差不多，这种抽取方式比SYSTEM抽样方式更加随机一些，但是相对性能要低很多，


censor@10.0.80.13:5432=#explain analyze select ctid, id from review tablesample bernoulli(0.0001);

QUERY PLAN

--------------------------------------------------------------------------------------------------------------

Sample Scan on review (cost=0.00..429583.06 rows=6 width=10) (actual time=136.399..1524.570 rows=6 loops=1)

Sampling: bernoulli ('0.0001'::real)

Planning time: 0.113 ms

Execution time: 1524.615 ms

(4 rows)

然后由于BERNOULLI抽样方式是基于数据行的，所以一般而言，抽样得到的数据应该位于不同的数据块


censor@10.0.80.13:5432=#select ctid, id from review tablesample bernoulli(0.0001);

ctid | id

-------------+---------

(82429,15) | 1419928

(131678,12) | 2159220

(273168,2) | 3965564

(344546,3) | 4675681

(4 rows)

总结

SYSTEM抽样方式适合抽样效率优先，针对大数据量啥的；

BERNOULLI适合随机性优先的场景

PostgreSQL数据库随机取数据，数据抽样

Hwan

引用和评论

被 Manus 带火的 MCP 是什么｜一文看懂

在 Kubernetes 上用 KubeBlocks + Dify 快速构建生产级 AIGC 应用

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

PostgreSQL@K8s 性能优化记

袋鼠数据库工具 6.2 AI 版已上线

袋鼠数据库工具 6.3 AI 版已上线

必读｜Postgres 如何在磁盘上存储和索引数据？