5

Nomad 服务编排

Nomad 是一个管理机器集群并在集群上运行应用程序的工具。

快速入门

环境准备

参考之前的一篇《Consul 搭建集群》准备三台虚机。

ip
n1 172.20.20.10
n2 172.20.20.11
n3 172.20.20.12

单机安装

登录到虚机n1,切换用户到root

» vagrant ssh n1                                        
su [vagrant@n1 ~]$ su
Password:
[root@n1 vagrant]#

安装一些依赖的工具

[root@n1 vagrant]# yum install -y epel-release
[root@n1 vagrant]# yum install -y jq
[root@n1 vagrant]# yum install -y unzip

下载0.8.1版本到/tmp目录下

最新的0.8.3版本和consul结合会有反复注册服务的bug,这里使用0.8.1
[root@n1 vagrant]# cd /tmp/
[root@n1 vagrant]# curl -s https://releases.hashicorp.com/nomad/0.8.1/nomad_0.8.1_linux_amd64.zip -o nomad.zip

解压,并赋予nomad可执行权限,最后把nomad移动到/usr/bin/下

[root@n1 vagrant]# unzip nomad.zip
[root@n1 vagrant]# chmod +x nomad
[root@n1 vagrant]# mv nomad /usr/bin/nomad

检查nomad是否安装成功

[root@n1 vagrant]# nomad
Usage: nomad [-version] [-help] [-autocomplete-(un)install] <command> [args]

Common commands:
    run         Run a new job or update an existing job
    stop        Stop a running job
    status      Display the status output for a resource
    alloc       Interact with allocations
    job         Interact with jobs
    node        Interact with nodes
    agent       Runs a Nomad agent

Other commands:
    acl             Interact with ACL policies and tokens
    agent-info      Display status information about the local agent
    deployment      Interact with deployments
    eval            Interact with evaluations
    namespace       Interact with namespaces
    operator        Provides cluster-level tools for Nomad operators
    quota           Interact with quotas
    sentinel        Interact with Sentinel policies
    server          Interact with servers
    ui              Open the Nomad Web UI
    version         Prints the Nomad version

出现如上所示代表安装成功。

批量安装

参考之前的一篇《Consul 搭建集群》批量安装这一节。

使用如下脚本可批量安装nomad,并同时为每个虚机安装好docker。

$script = <<SCRIPT

echo "Installing dependencies ..."
yum install -y epel-release
yum install -y net-tools
yum install -y wget
yum install -y jq
yum install -y unzip
yum install -y bind-utils

echo "Determining Consul version to install ..."
CHECKPOINT_URL="https://checkpoint-api.hashicorp.com/v1/check"
if [ -z "$CONSUL_DEMO_VERSION" ]; then
    CONSUL_DEMO_VERSION=$(curl -s "${CHECKPOINT_URL}"/consul | jq .current_version | tr -d '"')
fi

echo "Fetching Consul version ${CONSUL_DEMO_VERSION} ..."
cd /tmp/
curl -s https://releases.hashicorp.com/consul/${CONSUL_DEMO_VERSION}/consul_${CONSUL_DEMO_VERSION}_linux_amd64.zip -o consul.zip

echo "Installing Consul version ${CONSUL_DEMO_VERSION} ..."
unzip consul.zip
sudo chmod +x consul
sudo mv consul /usr/bin/consul

sudo mkdir /etc/consul.d
sudo chmod a+w /etc/consul.d

echo "Determining Nomad 0.8.1 to install ..."
#CHECKPOINT_URL="https://checkpoint-api.hashicorp.com/v1/check"
#if [ -z "$NOMAD_DEMO_VERSION" ]; then
#    NOMAD_DEMO_VERSION=$(curl -s "${CHECKPOINT_URL}"/nomad | jq .current_version | tr -d '"')
#fi

echo "Fetching Nomad version ${NOMAD_DEMO_VERSION} ..."
cd /tmp/
curl -s https://releases.hashicorp.com/nomad/0.8.1/nomad_0.8.1_linux_amd64.zip -o nomad.zip

echo "Installing Nomad version 0.8.1 ..."
unzip nomad.zip
sudo chmod +x nomad
sudo mv nomad /usr/bin/nomad

echo "Installing nginx ..."
#yum install -y nginx

echo "Installing docker ..."
yum install -y docker

SCRIPT

启动 Agent

首先启动consul组成一个集群,具体参考《Consul 搭建集群》。如果用默认的配置,nomad启动后会检测本机的Consul并自动的讲nomad服务注册。

n1

[root@n1 vagrant]# consul agent -server -bootstrap-expect 3 -data-dir /etc/consul.d -node=node1 -bind=172.20.20.10 -ui -client 0.0.0.0

n2

[root@n2 vagrant]# consul agent -server -bootstrap-expect 3 -data-dir /etc/consul.d -node=node2 -bind=172.20.20.11 -ui -client 0.0.0.0 -join 172.20.20.10

n3

[root@n3 vagrant]# consul agent -server -bootstrap-expect 3 -data-dir /etc/consul.d -node=node3 -bind=172.20.20.12 -ui -client 0.0.0.0 -join 172.20.20.10
[root@n1 vagrant]# consul members
Node   Address            Status  Type    Build  Protocol  DC   Segment
node1  172.20.20.10:8301  alive   server  1.1.0  2         dc1  <all>
node2  172.20.20.11:8301  alive   server  1.1.0  2         dc1  <all>
node3  172.20.20.12:8301  alive   server  1.1.0  2         dc1  <all>
基本概念
  • server 分配提交的job
  • clinet 执行job任务
启动server

定义server的配置文件server.hcl

log_level = "DEBUG"

bind_addr = "0.0.0.0"

data_dir = "/home/vagrant/data_server"

name = "server1"

advertise {
  http = "172.20.20.10:4646"
  rpc = "172.20.20.10:4647"
  serf = "172.20.20.10:4648"
}

server {
  enabled = true
  # Self-elect, should be 3 or 5 for production
  bootstrap_expect = 3
}

在命令行中执行

[root@n1 vagrant]# nomad agent -config=server.hcl

进入到n2,n3 执行

nomad agent -config=server.hcl

打开浏览器 http://172.20.20.10:8500/ui/#/dc1/services
从consul中能看到nomad都以启动
WX20180613-161551@2x

再打开nomad自带的UI http://172.20.20.10:4646/ui/servers
可以看到server都已运行
WX20180613-161611@2x

启动client

在启动client之前需要先启动docker,client执行job需要用到docker。

[root@n1 vagrant]# systemctl start docker

在n2,n3 也需要启动

定义client的配置文件client.hcl

log_level = "DEBUG"
data_dir = "/home/vagrant/data_clinet"
name = "client1"
advertise {
  http = "172.20.20.10:4646"
  rpc = "172.20.20.10:4647"
  serf = "172.20.20.10:4648"
}
client {
  enabled = true
  servers = ["172.20.20.10:4647"]
}

ports {
  http = 5656
}

在n1中输入命令

[root@n1 vagrant]# nomad agent -config=client.hcl

打开浏览器 http://172.20.20.10:8500/ui/#/dc1/services/nomad-client
WX20180613-162159@2x

可以看到nomad-client已经启动成功,同理在n2,n3也运行client。

最终显示如下
WX20180613-162351@2x
WX20180613-162409@2x

运行 Job

进入到n2,新建一个文件夹job,运行nomad init

[root@n2 vagrant]# mkdir job
[root@n2 vagrant]# cd job/
[root@n2 job]# nomad init
Example job file written to example.nomad

以上命令新建了一个example的Job

命令行键入

[root@n2 job]# nomad run example.nomad
==> Monitoring evaluation "97f8a1fe"
    Evaluation triggered by job "example"
    Evaluation within deployment: "3c89e74a"
    Allocation "47bf1f20" created: node "9df69026", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "97f8a1fe" finished with status "complete"

可以看到节点为9df69026的client去执行了Job
WX20180613-164459@2x
WX20180613-164535@2x

进阶操作

集群成员

[root@n1 vagrant]# nomad server members
Name            Address       Port  Status  Leader  Protocol  Build  Datacenter  Region
server1.global  172.20.20.10  4648  alive   false   2         0.8.1  dc1         global
server2.global  172.20.20.11  4648  alive   false   2         0.8.1  dc1         global
server3.global  172.20.20.12  4648  alive   true    2         0.8.1  dc1         global

查询 Job 状态

[root@n1 vagrant]# nomad status example
ID            = example
Name          = example
Submit Date   = 2018-06-13T08:42:57Z
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         1        0       0         0

Latest Deployment
ID          = 3c89e74a
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
cache       1        1       1        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
47bf1f20  9df69026  cache       0        run      running  8m44s ago  8m26s ago

修改 Job

编辑 example.nomad 找到 count = 1 修改为 count = 3

在命令行中查看Job的变更计划

[root@n2 job]# nomad plan example.nomad
+/- Job: "example"
+/- Task Group: "cache" (2 create, 1 in-place update)
  +/- Count: "1" => "3" (forces create)
      Task: "redis"

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 70
To submit the job with version verification run:

nomad job run -check-index 70 example.nomad

When running the job with the check-index flag, the job will only be run if the
server side version matches the job modify index returned. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

执行Job的变更任务

[root@n2 job]# nomad job run -check-index 70 example.nomad
==> Monitoring evaluation "3a0ff5e0"
    Evaluation triggered by job "example"
    Evaluation within deployment: "2b5b803f"
    Allocation "34086acb" created: node "6166e031", group "cache"
    Allocation "4d01cd92" created: node "f97b5095", group "cache"
    Allocation "47bf1f20" modified: node "9df69026", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "3a0ff5e0" finished with status "complete"

可以看到又多了两个client节点去执行Job任务

在浏览器中可以看到一共有3个实例
WX20180613-170029@2x
同时也能看到Job的版本记录
WX20180613-170103@2x

[root@n2 job]# nomad status example
ID            = example
Name          = example
Submit Date   = 2018-06-13T08:56:03Z
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       0       0         3        0       0         0

Latest Deployment
ID          = 2b5b803f
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy
cache       3        3       3        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created     Modified
34086acb  6166e031  cache       1        run      running  3m38s ago   3m25s ago
4d01cd92  f97b5095  cache       1        run      running  3m38s ago   3m26s ago
47bf1f20  9df69026  cache       1        run      running  16m43s ago  3m27s ago

离开集群

首先停止n1的nomad server,Ctrl-C
在n2上查询members

[root@n2 job]# nomad server members
Name            Address       Port  Status  Leader  Protocol  Build  Datacenter  Region
server1.global  172.20.20.10  4648  failed  false   2         0.8.1  dc1         global
server2.global  172.20.20.11  4648  alive   true    2         0.8.1  dc1         global
server3.global  172.20.20.12  4648  alive   false   2         0.8.1  dc1         global

server1 的状态为 failed,此时将server1 移出集群

[root@n2 job]# nomad server force-leave server1.global
[root@n2 job]# nomad server members
Name            Address       Port  Status  Leader  Protocol  Build  Datacenter  Region
server1.global  172.20.20.10  4648  left    false   2         0.8.1  dc1         global
server2.global  172.20.20.11  4648  alive   true    2         0.8.1  dc1         global
server3.global  172.20.20.12  4648  alive   false   2         0.8.1  dc1         global

server1的状态为left,移出集群成功。


nasawz
47 声望10 粉丝