HA_Cluster:corosync+pacemaker实现web高可用

环境说明:

ubuntu 14.4
集群所用两台服务器ip: 10.11.8.192 和 10.11.8.193
nfs服务ip: 10.11.8.43

前提配置:

1.时间同步
2.主机名及hosts文件配置(两台主机都需要配置)

vim /etc/sysctl.d/10-kernel-hardening.conf  #配置主机名为node1,添加
kernel.hostname = node1
vim /etc/sysctl.d/10-kernel-hardening.conf  #配置主机名为node2,添加
kernel.hostname = node2

为防止DNS解析出错,使用hosts文件,保证主机名与'uname -n'一致

vim /etc/hosts  #都添加hosts记录
10.11.8.192 node1
10.11.8.193 node2

3.配置基于ssh秘钥的双机互信

Node1:

root@node1:~# ssh-keygen -t rsa
root@node1:~# ssh-copy-id -i ~/.ssh/id_rsa.pub root@node2 #yes接受key,输入node2的密码

Node2:

root@node2:~# ssh-keygen -t rsa
root@node2:~# ssh-copy-id -i ~/.ssh/id_rsa.pub root@node1

安装corosync和pacemaker:

cluster-glue cluster-glue-dev heartbeat resource-agents corosync
heartbeat-dev pacemaker corosync-lib libesmtp pacemaker-dev

配置corosync，（以下命令在node1上执行）:

1.编辑/etc/corosync/corosync.conf

# Please read the openais.conf.5 manual page

totem {
    version: 2

    # How long before declaring a token lost (ms)
    token: 3000

    # How many token retransmits before forming a new configuration
    token_retransmits_before_loss_const: 10

    # How long to wait for join messages in the membership protocol (ms)
    join: 60

    # How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)
    consensus: 3600

    # Turn off the virtual synchrony filter
    vsftype: none

    # Number of messages that may be sent by one processor on receipt of the token
    max_messages: 20

    # Limit generated nodeids to 31-bits (positive signed integers)
    clear_node_high_bit: yes

    # Disable encryption
     secauth: off  #启动认证功能

    # How many threads to use for encryption/decryption
     threads: 0

    # Optionally assign a fixed node id (integer)
    # nodeid: 1234

    # This specifies the mode of redundant ring, which may be none, active, or passive.
     rrp_mode: none

     interface {
        # The following values need to be set based on your environment 
        ringnumber: 0
        bindnetaddr: 10.11.8.0  #主机所在的网络地址
        mcastaddr: 226.93.2.1  #组播地址,只要不被占用即可使用 ps: 224.0.2.0～238.255.255.255为用户可用的组播地址（临时组地址），全网范围内有效；
        mcastport: 5405  #组播端口
    }
}

amf {
    mode: disabled
}

quorum {
    # Quorum for the Pacemaker Cluster Resource Manager
    provider: corosync_votequorum
    expected_votes: 1
}

aisexec {
        user:   root
        group:  root
}

logging {
        fileline: off
        to_stderr: no  #输出到标准输出
        to_logfile: yes  #输出到日志文件
        logfile: /var/log/corosync.log  #日志文件位置
        to_syslog: no  #输出到系统日志
        syslog_facility: daemon
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
                tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }
}
# 添加pacemaker服务配置
service {
    ver: 1
    name: pacemaker
}

PS: 官方文档已修改, pacemaker需独立启动

Corosync In the past the Corosync process would launch pacemaker, this
is no longer the case. Pacemaker must be launched after Corosync has
successfully started.
来源： http://clusterlabs.org/wiki/Initial_Configuration#Corosync

 /etc/init.d/corosync start
 /etc/init.d/pacemaker start

2.生成节点间通信时用到的认证密钥文件：

root@node1:~# corosync-keygen -l

option: -l 从/dev/urandom获取随机数
corosync-keygen: 不加-l 参数, 会从/dev/random中获取随机数, 如果随机数不够, 会卡住

3.将corosync和authkey复制至node2:

 root@node1:~# scp -p corosync authkey node2:/etc/corosync/

4.分别编辑两个节点/etc/default/corosync文件

# vim /etc/default/corosync
START=yes

如果不修改, 命令正常执行, 无回显, 进程不启动

启动corosync+pacemaker:

root@node1:~# /etc/init.d/corosync start
root@node1:~# /etc/init.d/pacemaker start
root@node1:~# tail -f /var/log/corosync.log #查看日志文件
root@node1:~# netstat -tunlp #查看端口监听情况
udp        0      0 10.11.8.192:5404        0.0.0.0:*                           1431/corosync   
udp        0      0 10.11.8.192:5405        0.0.0.0:*                           1431/corosync   
udp        0      0 226.93.2.1:5405         0.0.0.0:*                           1431/corosync

node1正常启动后即可启动node2

root@node1:~# ssh node2 -- /etc/init.d/corosync start
root@node1:~# ssh node2 -- /etc/init.d/pacemaker start

查看集群节点状态:

Last updated: Wed May 18 08:49:46 2016
Last change: Mon May 16 06:12:56 2016 via crm_attribute on node1
Stack: corosync
Current DC: node1 (168495296) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
0 Resources configured


Online: [ node1 node2 ]

# ps auxf #查看集群进程
root      1472  0.0  1.1 107512  9040 pts/0    S    08:32   0:00 pacemakerd
haclust+  1474  0.0  2.0 110260 15636 ?        Ss   08:32   0:00  \_ /usr/lib/pacemaker/cib
root      1475  0.0  1.2 107264  9668 ?        Ss   08:32   0:00  \_ /usr/lib/pacemaker/stonithd
root      1476  0.0  0.9  81824  6992 ?        Ss   08:32   0:00  \_ /usr/lib/pacemaker/lrmd
haclust+  1477  0.0  0.8  97688  6800 ?        Ss   08:32   0:00  \_ /usr/lib/pacemaker/attrd
haclust+  1478  0.0  2.9 110264 22136 ?        Ss   08:32   0:00  \_ /usr/lib/pacemaker/pengine
haclust+  1479  0.0  1.8 166560 14000 ?        Ss   08:32   0:00  \_ /usr/lib/pacemaker/crmd

配置集群属性:

1.禁用stonith

corosync默认启用里stonith, 此集群没有stonith设备, 于是禁用

# crm configure property stonith-enabled=false

PS: crm有命令模式和交互模式两种使用方法

使用如下命令查看当前的配置信息：

# crm configure show
node node1.magedu.com
node node2.magedu.com
property $id="cib-bootstrap-options" \
  dc-version="1.0.11-1554a83db0d3c3e546cfd3aaff6af1184f79ee87" \
  cluster-infrastructure="openais" \
  expected-quorum-votes="2" \
  stonith-enabled="false

从中可以看出stonith已经被禁用.

2.修改忽略quorum不能满足的集群状态检查
离线一个节点后, 集群状态为"WITHOUT quorum"，即已经失去了quorum，此时集群服务本身已经不满足正常运行的条件，这对于只有两节点的集群来讲是不合理的。

# crm configure property no-quorum-policy=ignore

给集群添加资源:

1.查看资源

corosync支持heartbeat，LSB和ocf等类型的资源代理，目前较为常用的类型为LSB和OCF两类，stonith类专为配置stonith设备而用；

可以通过如下命令查看当前集群系统所支持的类型：

# crm ra classes 
heartbeat
lsb
ocf / heartbeat pacemaker
stonith

如果想要查看某种类别下的所用资源代理的列表，可以使用类似如下命令实现：

# crm ra list lsb
# crm ra list ocf heartbeat
# crm ra info ocf:heartbeat:IPaddr

2.添加资源

crm(live)configure# primitive webip ocf:heartbeat:IPaddr params ip=10.11.8.200 nic=eth1 cidr_netmask=23
crm(live)configure# primitive filesystem ocf:heartbeat:Filesystem params device=10.11.8.43:/www/html directory=/var/www/html fstype=nfs
crm(live)configure# primitive httpd lsb:apache2

crm(live)# status 
Last updated: Wed May 18 09:12:56 2016
Last change: Wed May 18 09:12:52 2016 via cibadmin on node1
Stack: corosync
Current DC: node1 (168495296) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured


Online: [ node1 node2 ]

 webip (ocf::heartbeat:IPaddr): Started node1 
 filesystem (ocf::heartbeat:Filesystem): Started node2 
 httpd (lsb:apache2): Started node1

3.资源约束

集群的3个资源没有运行在同一节点上, 这不是我想要的

crm(live)configure# group webservice webip filesystem httpd
crm(live)# status 
Last updated: Wed May 18 09:22:48 2016
Last change: Wed May 18 09:20:48 2016 via crm_attribute on node2
Stack: corosync
Current DC: node1 (168495296) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured


Online: [ node1 node2 ]

 Resource Group: webservice
 webip (ocf::heartbeat:IPaddr): Started node1 
 filesystem (ocf::heartbeat:Filesystem): Started node1 
 httpd (lsb:apache2): Started node1

三种资源约束方法：
1）Resource Location（资源位置）：定义资源可以、不可以或尽可能在哪些节点上运行；
2）Resource Collocation（资源排列）：排列约束用以定义集群资源可以或不可以在某个节点上同时运行；
3）Resource Order（资源顺序）：顺序约束定义集群资源在节点上启动的顺序；

定义约束时，还需要指定分数。各种分数是集群工作方式的重要组成部分。其实，从迁移资源到决定在已降级集群中停止哪些资源的整个过程是通过以某种方式修改分数来实现的。分数按每个资源来计算，资源分数为负的任何节点都无法运行该资源。在计算出资源分数后，集群选择分数最高的节点。INFINITY（无穷大）目前定义为 1,000,000。加减无穷大遵循以下3个基本规则：
1）任何值 + 无穷大 = 无穷大
2）任何值 - 无穷大 = -无穷大
3）无穷大 - 无穷大 = -无穷大

定义资源约束时，也可以指定每个约束的分数。分数表示指派给此资源约束的值。分数较高的约束先应用，分数较低的约束后应用。通过使用不同的分数为既定资源创建更多位置约束，可以指定资源要故障转移至的目标节点的顺序。

因此，对于前述的WebIP和WebSite可能会运行于不同节点的问题，可以通过以下命令来解决：

# crm configure colocation website-with-ip INFINITY: WebSite WebIP

接着，我们还得确保WebSite在某节点启动之前得先启动WebIP，这可以使用如下命令实现：

# crm configure order httpd-after-ip mandatory: WebIP WebSite

此外，由于HA集群本身并不强制每个节点的性能相同或相近，所以，某些时候我们可能希望在正常时服务总能在某个性能较强的节点上运行，这可以通过位置约束来实现：

# crm configure location prefer-node1 WebSite rule 200: node1

这条命令实现了将WebSite约束在node1上，且指定其分数为200；

测试:

crm(live)node# standby 
crm(live)node# cd
crm(live)# status 
Last updated: Wed May 18 09:25:24 2016
Last change: Wed May 18 09:25:20 2016 via crm_attribute on node1
Stack: corosync
Current DC: node1 (168495296) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
3 Resources configured


Node node1 (168495296): standby
Online: [ node2 ]

 Resource Group: webservice
 webip    (ocf::heartbeat:IPaddr):    Started node2 
 filesystem    (ocf::heartbeat:Filesystem):    Started node2 
 httpd    (lsb:apache2):    Started node2

正常访问

HA_Cluster:corosync+pacemaker实现web高可用

环境说明:

前提配置:

配置corosync，（以下命令在node1上执行）:

配置集群属性:

给集群添加资源:

测试:

shiina

引用和评论

python 循环导入

【赵渝强老师】基于ZooKeeper实现Hadoop HA