Postgresql/Greenplum内核参数配置手册

memory overcommit

vm.overcommit_memory = 2
vm.overcommit_ratio = 95 # **See [Note](https://gpdb.docs.pivotal.io/6-0/install_guide/prep_os.html#topic4__sysctl_conf) 2**

GP相关说明

When vm.overcommit_memory is 2, you specify a value for vm.overcommit_ratio. For information about calculating the value for vm.overcommit_ratio when using resource queue-based resource management, see the Greenplum Database server configuration parameter gp_vmem_protect_limit in the Greenplum Database Reference Guide. If you are using resource group-based resource management, tune the operating system vm.overcommit_ratio as necessary. If your memory utilization is too low, increase the vm.overcommit_ratio value; if your memory or swap usage is too high, decrease the value.

linux内核解释
http://linuxperf.com/?p=102
Memory Overcommit的意思是操作系统承诺给进程的内存大小超过了实际可用的内存。一个保守的操作系统不会允许memory overcommit，有多少就分配多少，再申请就没有了，这其实有些浪费内存，因为进程实际使用到的内存往往比申请的内存要少，比如某个进程malloc()了200MB内存，但实际上只用到了100MB，按照UNIX/Linux的算法，物理内存页的分配发生在使用的瞬间，而不是在申请的瞬间，也就是说未用到的100MB内存根本就没有分配，这100MB内存就闲置了。下面这个概念很重要，是理解memory overcommit的关键：commit(或overcommit)针对的是内存申请，内存申请不等于内存分配，内存只在实际用到的时候才分配。

Linux是允许memory overcommit的，只要你来申请内存我就给你，寄希望于进程实际上用不到那么多内存，但万一用到那么多了呢？那就会发生类似“银行挤兑”的危机，现金(内存)不足了。Linux设计了一个OOM killer机制(OOM = out-of-memory)来处理这种危机：挑选一个进程出来杀死，以腾出部分内存，如果还不够就继续杀…也可通过设置内核参数 vm.panic_on_oom 使得发生OOM时自动重启系统。这都是有风险的机制，重启有可能造成业务中断，杀死进程也有可能导致业务中断，我自己的这个小网站就碰到过这种问题，参见前文。所以Linux 2.6之后允许通过内核参数 vm.overcommit_memory 禁止memory overcommit。

内核参数 vm.overcommit_memory 接受三种取值：

0 – Heuristic overcommit handling. 这是缺省值，它允许overcommit，但过于明目张胆的overcommit会被拒绝，比如malloc一次性申请的内存大小就超过了系统总内存。Heuristic的意思是“试探式的”，内核利用某种算法（对该算法的详细解释请看文末）猜测你的内存申请是否合理，它认为不合理就会拒绝overcommit。
1 – Always overcommit. 允许overcommit，对内存申请来者不拒。
2 – Don’t overcommit. 禁止overcommit。

关于禁止overcommit (vm.overcommit_memory=2) ，需要知道的是，怎样才算是overcommit呢？kernel设有一个阈值，申请的内存总数超过这个阈值就算overcommit，在/proc/meminfo中可以看到这个阈值的大小：

# grep -i commit /proc/meminfo
CommitLimit:     5967744 kB
Committed_AS:    5363236 kB

CommitLimit 就是overcommit的阈值，申请的内存总数超过CommitLimit的话就算是overcommit。
这个阈值是如何计算出来的呢？它既不是物理内存的大小，也不是free memory的大小，它是通过内核参数vm.overcommit_ratio或vm.overcommit_kbytes间接设置的，公式如下：
【CommitLimit = (Physical RAM * vm.overcommit_ratio / 100) + Swap】

注：
vm.overcommit_ratio 是内核参数，缺省值是50，表示物理内存的50%。如果你不想使用比率，也可以直接指定内存的字节数大小，通过另一个内核参数 vm.overcommit_kbytes 即可；
如果使用了huge pages，那么需要从物理内存中减去，公式变成：
CommitLimit = ([total RAM] – [total huge TLB RAM]) * vm.overcommit_ratio / 100 + swap
参见https://access.redhat.com/solutions/665023

/proc/meminfo中的 Committed_AS 表示所有进程已经申请的内存总大小，（注意是已经申请的，不是已经分配的），如果 Committed_AS 超过 CommitLimit 就表示发生了 overcommit，超出越多表示 overcommit 越严重。Committed_AS 的含义换一种说法就是，如果要绝对保证不发生OOM (out of memory) 需要多少物理内存。

ip port

net.ipv4.ip_local_port_range = 10000 65535

GP相关说明

To avoid port conflicts between Greenplum Database and other applications when initializing Greenplum Database, do not specify Greenplum Database ports in the range specified by the operating system parameter net.ipv4.ip_local_port_range. For example, if net.ipv4.ip_local_port_range = 10000 65535, you could set the Greenplum Database base port numbers to these values.
PORT_BASE = 6000
MIRROR_PORT_BASE = 7000
For information about the port ranges that are used by Greenplum Database, see gpinitsystem.

linux内核解释
On Linux, there is a sysctl parameter calledip_local_port_rangethat defines the minimum and maximum port a networking connection can use as its source (local) port. This applies to both TCP and UDP connections.

cat /proc/sys/net/ipv4/ip_local_port_range

shared memory

# kernel.shmall = _PHYS_PAGES / 2 # See Note 1
kernel.shmall = 4000000000
# kernel.shmmax = kernel.shmall * PAGE_SIZE # See Note 1
kernel.shmmax = 500000000
kernel.shmmni = 4096

查看限制、查看使用
ipcs -lm、ipcs -u

shmall: This parameter sets the total amount of shared memory pages that can be used system wide. Hence, SHMALL should always be at least ceil(shmmax/PAGE_SIZE).
共享内存能使用的总页数
echo $(expr $(getconf _PHYS_PAGES) / 2)
shmmax: This parameter defines the maximum size in bytes of a single shared memory segment that a Linux process can allocate in its virtual address space.
共享内存的总大小
echo $(expr $(getconf _PHYS_PAGES) / 2 \* $(getconf PAGE_SIZE))
shmmin: This parameter sets the system wide maximum number of shared memory segments.

semaphores

cat /proc/sys/kernel/sem
500 2048000 200 40960

SEMMSL, SEMMNS, SEMOPM, SEMMNI

kernel.sem = 500 2048000 200 40960

SEMMSL
含义：每个信号量set中信号量最大个数设置：最小250；对于processes参数设置较大的系统建议设置为processes+10

SEMMNS
含义：linux系统中信号量最大个数设置：至少32000；SEMMSL * SEMMNI

SEMOPM
含义：semop系统调用允许的信号量最大个数设置：至少100；或者等于SEMMSL

SEMMNI
含义：linux系统信号量set最大个数设置：最少128

link

message queue

kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.msgmni = 2048

消息队列提供了一个从一个进程向另外一个进程发送一块数据的方法,消息队列具有内核持续性；
每个数据块都被认为是有一个类型，接收者进程接收的数据块可以有不同的类型值;
消息队列也有管道一样的不足，就是每个消息的最大长度是有上限的（MSGMAX），每个消息队列的总的字节数是有上限的（MSGMNB），系统上消息队列的总数也有一个上限（MSGMNI）

cat /proc/sys/kernel/msgmax 最大消息长度限制，8192=8K
cat /proc/sys/kernel/msgmnb 消息队列总的字节数，16384 = 16K
cat /proc/sys/kernel/msgmni 消息条目数,169

file cache

文件缓存是提升性能的重要手段。毋庸置疑，读缓存（Read caching）在绝大多数情况下是有益无害的（程序可以直接从RAM中读取数据），而写缓存(Write caching)则相对复杂。Linux内核将写磁盘的操作分解成了，先写缓存，每隔一段时间再异步地将缓存写入磁盘。这提升了IO读写的速度，但存在一定风险。数据没有及时写入磁盘，所以存在数据丢失的风险。

同样，也存在cache被写爆的情况。还可能出现一次性往磁盘写入过多数据，以致使系统卡顿。之所以卡顿，是因为系统认为，缓存太大用异步的方式来不及把它们都写进磁盘，于是切换到同步的方式写入。（异步，即写入的同时进程能正常运行；同步，即写完之前其他进程不能工作）。

# 这个时候，后台进行在脏数据达到10%时就开始异步清理，但在20%之前系统不会强制同步写磁盘。刷脏进程3秒起来一次，脏数据存活超过10秒就会开始刷。
vm.dirty_expire_centisecs = 10
vm.dirty_writeback_centisecs = 3
vm.dirty_background_ratio: 10
vm.dirty_ratio: 20
vm.dirty_background_bytes: 0
vm.dirty_bytes: 0

vm.dirty_background_ratio 是内存可以填充“脏数据”的百分比。这些“脏数据”在稍后是会写入磁盘的，pdflush/flush/kdmflush这些后台进程会稍后清理脏数据。举一个例子，我有32G内存，那么有3.2G的内存可以待着内存里，超过3.2G的话就会有后来进程来清理它。

vm.dirty_ratio 是绝对的脏数据限制，内存里的脏数据百分比不能超过这个值。如果脏数据超过这个数量，新的IO请求将会被阻挡，直到脏数据被写进磁盘。这是造成IO卡顿的重要原因，但这也是保证内存中不会存在过量脏数据的保护机制。

vm.dirty_expire_centisecs 指定脏数据能存活的时间。在这里它的值是30秒。当 pdflush/flush/kdmflush 进行起来时，它会检查是否有数据超过这个时限，如果有则会把它异步地写到磁盘中。毕竟数据在内存里待太久也会有丢失风险。

vm.dirty_writeback_centisecs 指定多长时间 pdflush/flush/kdmflush 这些进程会起来一次。

# 有7页脏数据需要刷到盘里
# cat /proc/vmstat | egrep "dirty|writeback"
nr_dirty 7
nr_writeback 0
nr_writeback_temp 0

link

swap

DB服务器出于性能考虑不适合使用swap，所以没必要配置swap空间

vm.swappiness = 0

如果内存够大，应当告诉 linux 不必太多的使用 SWAP 分区，可以通过修改 swappiness 的数值。swappiness=0的时候表示最大限度使用物理内存，然后才是 swap空间，swappiness＝100的时候表示积极的使用swap分区，并且把内存上的数据及时的搬运到swap空间里面。

net

net.ipv4.tcp_syncookies = 1
net.ipv4.conf.default.accept_source_route = 0
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.conf.all.arp_filter = 1
net.core.netdev_max_backlog = 10000
net.core.rmem_max = 2097152
net.core.wmem_max = 2097152

min free memory

为网络和文件系统保留内存3%的内存应急，不要超过5¥

awk 'BEGIN {OFMT = "%.0f";} /MemTotal/ {print $2 * .03;}' /proc/meminfo

ipc resource management

分别查询IPC资源:

$ipcs -m 查看系统使用的IPC共享内存资源
$ipcs -q 查看系统使用的IPC队列资源
$ipcs -s 查看系统使用的IPC信号量资源

查看IPC资源被谁占用

示例：有个IPCKEY(51036)，需要查询其是否被占用；

首先通过计算器将其转为十六进制:

51036 -> c75c
如果知道是被共享内存占用:

$ipcs -m | grep c75c
0x0000c75c 40403197   tdea3    666        536870912  2
如果不确定，则直接查找:

$ipcs | grep c75c
0x0000c75c 40403197   tdea3    666        536870912  2
0x0000c75c 5079070    tdea3    666        4

系统IPC参数查询

ipcs -l

清除IPC资源

ipcrm -M shmkey  移除用shmkey创建的共享内存段
ipcrm -m shmid    移除用shmid标识的共享内存段
ipcrm -Q msgkey  移除用msqkey创建的消息队列
ipcrm -q msqid  移除用msqid标识的消息队列
ipcrm -S semkey  移除用semkey创建的信号
ipcrm -s semid  移除用semid标识的信号

清除当前用户创建的所有的IPC资源:

ipcs -q | awk '{ print "ipcrm -q "$2}' | sh > /dev/null 2>&1;
ipcs -m | awk '{ print "ipcrm -m "$2}' | sh > /dev/null 2>&1;
ipcs -s | awk '{ print "ipcrm -s "$2}' | sh > /dev/null 2>&1;

link

https://gpdb.docs.pivotal.io/6-0/install_guide/prep_os.html#topic_sqj_lt1_nfb

ansible

---
- hosts: gp
  vars:
    version: "6.0.0"
    admin_user: "gp12345678"
    admin_password: "333"
    port_pre: "3001"           
  remote_user: root
  tasks:
  ##
  #! auth
  ##
  - name: add ssh authorized keys for root
    authorized_key:
      user: root
      state: present
      key: "{{ lookup('file', lookup('env','HOME') + '/.ssh/id_rsa.pub') }}"
  ##
  #! user
  ##
  - name: create admin user
    user:
      name: "{{ admin_user }}"
      password: "{{ admin_password | password_hash('sha512', 'iamsalt') }}"

  - name: add ssh authorized keys for admin
    authorized_key:
      user: "{{ admin_user }}"
      state: present
      key: "{{ lookup('file', lookup('env','HOME') + '/.ssh/id_rsa.pub') }}"
  ##    
  #! sysctl
  ##
  - name: backing up sysctl
    copy:
      src: /etc/sysctl.conf
      remote_src: yes
      dest: /tmp/sysctl.conf.bak
      backup: yes
  - name: get shmall 
    shell: echo $(expr $(getconf _PHYS_PAGES) / 2) 
    register: shmall
  - name: get shmmax
    shell: echo $(expr $(getconf _PHYS_PAGES) / 2 \* $(getconf PAGE_SIZE))
    register: shmmax
  - name: get min_free_kbytes
    shell: awk 'BEGIN {OFMT = "%.0f";} /MemTotal/ {print $2 * .03;}' /proc/meminfo
    register: min_free_kbytes
  - name: set shmall
    sysctl:
      name: kernel.shmall
      value: "{{ shmall.stdout }}"
      reload: yes
  - name: set shmmax
    sysctl:
      name: kernel.shmmax
      value: "{{ shmmax.stdout }}"
      reload: yes
  - name: set min_free_kbytes
    sysctl:
      name: vm.min_free_kbytes
      value: "{{ min_free_kbytes.stdout }}"
      reload: yes
  - name: set other sysctl
    sysctl:
       name: "{{ item.key }}"
       value: "{{ item.value }}"
       sysctl_set: yes
       state: present
       reload: yes
       ignoreerrors: yes
    with_dict:
      kernel.shmmni: 4096
      vm.overcommit_memory: 2
      vm.overcommit_ratio: 95
      net.ipv4.ip_local_port_range: 10000 65535
      kernel.sem: 500 2048000 200 40960
      kernel.sysrq: 1
      kernel.core_uses_pid: 1
      kernel.msgmnb: 65536
      kernel.msgmax: 65536
      kernel.msgmni: 2048
      net.ipv4.tcp_syncookies: 1
      net.ipv4.conf.default.accept_source_route: 0
      net.ipv4.tcp_max_syn_backlog: 4096
      net.ipv4.conf.all.arp_filter: 1
      net.core.netdev_max_backlog: 10000
      net.core.rmem_max: 2097152
      net.core.wmem_max: 2097152
      vm.swappiness: 0
      vm.zone_reclaim_mode: 0
      vm.dirty_expire_centisecs: 10
      vm.dirty_writeback_centisecs: 3
      vm.dirty_background_ratio: 10
      vm.dirty_ratio: 20
      vm.dirty_background_bytes: 0
      vm.dirty_bytes: 0 
  ##    
  #! pam limit
  ##
  - name: state PAM limits
    pam_limits:
      domain: '*'
      limit_type: '-'
      limit_item: "{{ item.key }}"
      value: "{{ item.value }}"
    with_dict:
      nofile: 655360
      nproc: 655360
      memlock: unlimited
      core: unlimited

  ##    
  #! install src gp
  ##
  - name: copy package to host
    copy:
      src: "{{ package_path }}"
      dest: /tmp

Postgresql/Greenplum内核参数配置手册

memory overcommit

ip port

shared memory

semaphores

message queue

file cache

swap

net

min free memory

ipc resource management

link

ansible

Jackgo

引用和评论

CPU软中断&案例分析ing

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

印度股票数据API对接文档

OpenTenBase安装-官方加强版

AI加速，颠覆创新 |《2月中国数据库行业分析报告》已发布，本月聚焦 LLM x 数据库

性能比拼: MySQL vs PostgreSQL

【赵渝强老师】使用PostgreSQL客户端工具