mongo sharding集群其中一个分片故障
[toc]

场景说明

ip port 角色 port 角色 port 角色 port 角色
192.168.59.140 27000 mongos 27100 config 27101 shard1-primary 27102 shard2-secondary
192.168.59.141 27000 mongos 27100 config 27101 shard1-secondary 27102 shard2-primary
192.168.59.142 27000 mongos 27100 config 27101 shard1-arbiter 27102 shard2-arbiter
场景 故障说明 影响
场景1 shard2 secondary节点故障 不影响业务
场景2 shard2 primary节点故障 不影响业务
场景3 shard2 arbiter节点故障 不影响业务
场景4 shard2 2个节点故障,其中有一个仲裁节点 影响业务,整个集群无法提供读写
场景5 shard2 其中2个节点故障,primary+secondary 影响业务,整个集群无法提供读写

处理步骤

场景1:shard2 secondary节点故障

不影响业务;恢复方法:

1、部署新的mongo server 实例

(1)新实例第一次启动的时候,配置文件注释掉如下信息:
#security:
#  keyFile: /data/mongodb/auth/keyfile.key
#  authorization: enabled
 
#replication:
#  oplogSizeMB: 512
#  replSetName: shard2
#sharding:
#  clusterRole: shardsvr

(2)创建管理账户
启动实例后,
创建管理账户:chjroot
监控账户:monitor

(3)将配置文件的注释去掉,重启新实例

2、下线问题节点

shard2:PRIMARY> rs.remove("192.168.59.140:27102")
{
    "ok" : 1,
    "operationTime" : Timestamp(1588216157, 1),
    "$gleStats" : {
        "lastOpTime" : {
            "ts" : Timestamp(1588216157, 1),
            "t" : NumberLong(2)
        },
        "electionId" : ObjectId("7fffffff0000000000000002")
    },
    "lastCommittedOpTime" : Timestamp(1588216156, 1),
    "$configServerState" : {
        "opTime" : {
            "ts" : Timestamp(1588216153, 2),
            "t" : NumberLong(1)
        }
    },
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588216157, 1),
        "signature" : {
            "hash" : BinData(0,"Z+IrFefwfA638bKEBEqp6mJVEnc="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
}

3、添加一个数据节点到shard2

shard2:PRIMARY> rs.add("192.168.59.142:27103")
{
    "ok" : 1,
    "operationTime" : Timestamp(1588216039, 1),
    "$gleStats" : {
        "lastOpTime" : {
            "ts" : Timestamp(1588216039, 1),
            "t" : NumberLong(2)
        },
        "electionId" : ObjectId("7fffffff0000000000000002")
    },
    "lastCommittedOpTime" : Timestamp(1588215749, 4),
    "$configServerState" : {
        "opTime" : {
            "ts" : Timestamp(1588216037, 2),
            "t" : NumberLong(1)
        }
    },
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588216039, 1),
        "signature" : {
            "hash" : BinData(0,"OlhhP3xyV9/Ye8n6nX9hWmu3RU8="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
}

4. 查看集群状态

shard2:PRIMARY> rs.status()
#确认新加入节点变成SECONDARY状态

5、查看mongos状态

mongos> sh.status()
--- Sharding Status --- 
  sharding version: {
      "_id" : 1,
      "minCompatibleVersion" : 5,
      "currentVersion" : 6,
      "clusterId" : ObjectId("5ea29dc1123de331d06a015d")
  }
  shards:
        {  "_id" : "shard1",  "host" : "shard1/192.168.59.140:27101,192.168.59.141:27101",  "state" : 1 }
        {  "_id" : "shard2",  "host" : "shard2/192.168.59.141:27102,192.168.59.142:27103",  "state" : 1 }

#shard2的host自动变更为"shard2/192.168.59.141:27102,192.168.59.142:27103"

场景2:shard2 primary节点故障

不影响业务;secondary节点自动提升为primary节点。
恢复方法,同场景1。

场景3:shard2 arbiter节点故障

不影响业务;
恢复方法,同场景1。
只有一点不同:新增数据节点变为新增仲裁节点

shard2:PRIMARY> rs.addArb("192.168.59.142:27103")

场景4:shard1  2个节点故障,其中有一个仲裁节点

影响业务,shard1副本集只剩下一个数据节点的时候,会自动降级为secondary,此时shard1虽然正常,但是mongos节点新的读写都会报错,已存在的连接会超时,记录到系统日志,

如下:

mongos> show dbs
2020-04-30T14:55:04.110+0800 E QUERY    [js] Error: listDatabases failed:{
    "ok" : 0,
    "errmsg" : "Could not find host matching read preference { mode: \"primary\" } for set shard1",
    "code" : 133,
    "codeName" : "FailedToSatisfyReadPreference",
    "operationTime" : Timestamp(1588229692, 2),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588229692, 2),
        "signature" : {
            "hash" : BinData(0,"2BYNFCHN8dZgE8E1J6AluDOVNZM="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
} :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
Mongo.prototype.getDBs@src/mongo/shell/mongo.js:124:1
shellHelper.show@src/mongo/shell/utils.js:876:19
shellHelper@src/mongo/shell/utils.js:766:15
@(shellhelp2):1:1

系统日志信息:
2020-04-30T14:56:23.257+0800 I COMMAND  [conn914] command lcl_szy.mycol01 appName: "MongoDB Shell" command: insert { insert: "mycol01", ordered: true, lsid: { id: UUID("b9e81b56-3cd9-4f7e-9713-2bc2705e6181") }, $clusterTime: { clusterTime: Timestamp(1588229755, 2), signature: { hash: BinData(0, 199FBD05D97079A444636107558992C60AB4D77D), keyId: 6819186238046601246 } }, $db: "lcl_szy" } nShards:1 ninserted:0 numYields:0 reslen:353 protocol:op_msg 19684ms

此时mongos的集群状态,会开启balance迁移shard2的数据到shard1,不过由于shard2只有一个secondary节点,balance并不会成功。chunk的大小一直没有变化。

mongos> sh.status()
--- Sharding Status ---
  sharding version: {
    "_id" : 1,
    "minCompatibleVersion" : 5,
    "currentVersion" : 6,
    "clusterId" : ObjectId("5ea29dc1123de331d06a015d")
  }
  shards:
        {  "_id" : "shard1",  "host" : "shard1/192.168.59.140:27101,192.168.59.141:27101",  "state" : 1 }
        {  "_id" : "shard2",  "host" : "shard2/192.168.59.140:27102,192.168.59.141:27102",  "state" : 1 }
  active mongoses:
        "4.0.4-62-g7e345a7" : 3
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  yes
        Failed balancer rounds in last 5 attempts:  5
        Last reported error:  Could not find host matching read preference { mode: "primary" } for set shard1
        Time of Reported error:  Thu Apr 30 2020 15:06:39 GMT+0800 (CST)
        Migration Results for the last 24 hours:
                196 : Success
                7622 : Failed with error 'aborted', from shard2 to shard1
  databases:
        {  "_id" : "config",  "primary" : "config",  "partitioned" : true }
                config.system.sessions
                        shard key: { "_id" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                shard1  1
                        { "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard1 Timestamp(1, 0)
        {  "_id" : "iot_test",  "primary" : "shard2",  "partitioned" : true,  "version" : {  "uuid" : UUID("d628fb8e-c88e-4548-9421-45862f6ade21"),  "lastMod" : 1 } }
                iot_test.vehicle_signal
                        shard key: { "deviceId" : "hashed" }
                        unique: false
                        balancing: true
                        chunks:
                                shard1  196
                                shard2  197
                        too many chunks to print, use verbose if you want to force print
        {  "_id" : "lcl_szy",  "primary" : "shard1",  "partitioned" : false,  "version" : {  "uuid" : UUID("635e1d10-b035-41ea-9a78-85ff4fdbadc0"),  "lastMod" : 1 } }
 
mongos> sh.isBalancerRunning()
true

恢复步骤

1、将shard1仅剩的secondary节点降级为单实例运行,恢复业务

shard1:SECONDARY> config=rs.conf()
shard1:SECONDARY> config.members=[config.members[0]]
shard1:SECONDARY>  rs.reconfig(config,{force:true})
{
    "ok" : 1,
    "operationTime" : Timestamp(1588230485, 58),
    "$gleStats" : {
        "lastOpTime" : Timestamp(0, 0),
        "electionId" : ObjectId("7fffffff0000000000000004")
    },
    "lastCommittedOpTime" : Timestamp(0, 0),
    "$configServerState" : {
        "opTime" : {
            "ts" : Timestamp(1588232958, 2),
            "t" : NumberLong(1)
        }
    },
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588232958, 2),
        "signature" : {
            "hash" : BinData(0,"yeqvk5rGUesVN67DY5+0RojTM7I="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
}
 
shard1:PRIMARY>
shard1:PRIMARY> show dbs
admin     0.000GB
config    0.000GB
iot_test  1.660GB
lcl_szy   2.965GB
local     0.714GB

2、查看mongos状态

shard1已经变为单节点

mongos> sh.status()
--- Sharding Status ---
  sharding version: {
    "_id" : 1,
    "minCompatibleVersion" : 5,
    "currentVersion" : 6,
    "clusterId" : ObjectId("5ea29dc1123de331d06a015d")
  }
  shards:
        {  "_id" : "shard1",  "host" : "shard1/192.168.59.140:27101",  "state" : 1,  "draining" : true }
        {  "_id" : "shard2",  "host" : "shard2/192.168.59.140:27102,192.168.59.141:27102",  "state" : 1 }

3、给shard1新增仲裁节点

同场景3

4、给shard1新增数据节点

同场景1

5、查看新副本集的状态

场景5:shard1 其中2个节点故障,primary+secondary

此时整个集群不可用。需要尽快减小对业务的影响

1、将shard1节点从mongo sharding集群中去掉

config集群中所有关于shard1的信息全部删除

repl_config:PRIMARY> use config
repl_config:PRIMARY> db.shards.find()
{ "_id" : "shard1", "host" : "shard1/192.168.59.140:27101,192.168.59.141:27101", "state" : 1, "draining" : true }
{ "_id" : "shard2", "host" : "shard2/192.168.59.140:27102,192.168.59.141:27102", "state" : 1 }
repl_config:PRIMARY> db.shards.remove({'_id':"shard1"})
 repl_config:PRIMARY> db.shards.find()
{ "_id" : "shard2", "host" : "shard2/192.168.59.140:27102,192.168.59.141:27102", "state" : 1 }
repl_config:PRIMARY> db.collections.find()
删掉开启了分片且有数据在shard1的那条记录
repl_config:PRIMARY> db.collections.remove({"_id":"iot_test.vehicle_signal"})
WriteResult({ "nRemoved" : 1 })
repl_config:PRIMARY> db.databases.remove({"_id":"lcl_szy"})
WriteResult({ "nRemoved" : 1 })
repl_config:PRIMARY> db.databases.find()
{ "_id" : "iot_test", "primary" : "shard2", "partitioned" : true, "version" : { "uuid" : UUID("d628fb8e-c88e-4548-9421-45862f6ade21"), "lastMod" : 1 } }
repl_config:PRIMARY>

此时mongos可以正常写入,读取少了shard1上的数据

2、重新部署shard1副本集集群

3、将shar1加入mogno sharding集群

mongos> sh.addShard('shard1/192.168.59.140:27101,192.168.59.141:27101,192.168.59.142:27101')
{
    "shardAdded" : "shard1",
    "ok" : 1,
    "operationTime" : Timestamp(1588236561, 2),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588236561, 2),
        "signature" : {
            "hash" : BinData(0,"3It1W8WjI0mMPuYN6wlCfS8M8fo="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
}
mongos> sh.status()
--- Sharding Status ---
  sharding version: {
    "_id" : 1,
    "minCompatibleVersion" : 5,
    "currentVersion" : 6,
    "clusterId" : ObjectId("5ea29dc1123de331d06a015d")
  }
  shards:
        {  "_id" : "shard1",  "host" : "shard1/192.168.59.140:27101,192.168.59.141:27101",  "state" : 1 }
        {  "_id" : "shard2",  "host" : "shard2/192.168.59.140:27102,192.168.59.141:27102",  "state" : 1 }
  active mongoses:
        "4.0.4-62-g7e345a7" : 3
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  5
        Last reported error:  Could not find host matching read preference { mode: "primary" } for set shard1
        Time of Reported error:  Thu Apr 30 2020 16:14:06 GMT+0800 (CST)
        Migration Results for the last 24 hours:
                259 : Success
                7267 : Failed with error 'aborted', from shard2 to shard1
  databases:
        {  "_id" : "config",  "primary" : "config",  "partitioned" : true }
                config.system.sessions
                        shard key: { "_id" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                shard2  1
                        { "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard2 Timestamp(2, 0)
        {  "_id" : "iot_test",  "primary" : "shard2",  "partitioned" : true,  "version" : {  "uuid" : UUID("d628fb8e-c88e-4548-9421-45862f6ade21"),  "lastMod" : 1 } }
 
mongos>

4、iot_test.vehicle_signal开启分片

config节点删除原来的chunk信息

repl_config:PRIMARY> use config
repl_config:PRIMARY> db.chunks.remove({})
WriteResult({ "nRemoved" : 394 })

mongos节点开启分片

mongos> db.runCommand({"shardCollection":"iot_test.vehicle_signal","key":{"deviceId":"hashed"}})
{
    "collectionsharded" : "iot_test.vehicle_signal",
    "collectionUUID" : UUID("ecffe19f-1cd9-48ab-b92b-fa676d5b9e0a"),
    "ok" : 1,
    "operationTime" : Timestamp(1588236901, 131),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1588236901, 131),
        "signature" : {
            "hash" : BinData(0,"WqQbQEzdGESF+A+J7qZaDHJYQXw="),
            "keyId" : NumberLong("6819186238046601246")
        }
    }
}

5、将shard1的备份数据恢复至mongo sharding集群


linda玲
438 声望27 粉丝

每天进步一点点