MegaCli 是LSI公司官方提供的SCSI卡管理工具,由于LSI被收购变成了现在的Broadcom,所以现在想下载MegaCli,需要去Broadcom官网查找Legacy产品支持,搜索MegaRAID即可。
现在官方有storcli,整合了LSI和3ware所有产品。但是个人认为Megacli用起来更顺手,而且线上用了几家国产厂商服务器,用Megacli都能管理好RAID,所以换不换无所谓。
查看Adapter 信息:
./MegaCli64 -AdpAllInfo -aALL
返回结果太长很多都看不懂但没关系,新手先记住第一行,表示我的机器上有个0号适配器。MegaCli64很多命令都要在最后用-a指定Adapter,我只有Adapter #0 所以今后都写-a0就行,还可以-a0,1,2或-aALL
Adapter #0
==============================================================================
Versions
================
Product Name : PERC H710 Adapter
Serial No : 31P003R
FW Package Build: 21.1.0-0007
Mfg. Data
================
Mfg. Date : 01/26/13
Rework Date : 01/26/13
Revision No : A00
Battery FRU : N/A
...
查看Adapter的具体配置,这台机器插了12块盘,一块做RAID0装系统,剩下的盘做了RAID5:
./MegaCli64 -CfgDsply -aALL
==============================================================================
Adapter: 0
Product Name: PERC H710 Adapter
Memory: 512MB
BBU: Present
Serial No: 31P003R
==============================================================================
Number of DISK GROUPS: 2 #有俩磁盘组
DISK GROUPS: 0 #0号磁盘组
Number of Spans: 1
SPAN: 0
Span Reference: 0x00
Number of PDs: 1
Number of VDs: 1
Number of dedicated Hotspares: 0
Virtual Disk Information:
Virtual Disk: 0 (Target Id: 0)
Name:
RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0 #做了RAID0
Size:2.728 TB
State: Optimal
Stripe Size: 64 KB
Number Of Drives:1
Span Depth:1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None
Physical Disk Information:
Physical Disk: 0
Enclosure Device ID: 32
Slot Number: 0
Device Id: 0
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]
Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors]
Coerced Size: 2.728 TB [0x15d400000 Sectors]
Firmware state: Online
SAS Address(0): 0x500056b37789abee
Connected Port Number: 0(path0)
Inquiry Data: 手动马赛克 #这里是序列号
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: 3.0Gb/s
Link Speed: 3.0Gb/s
Media Type: Hard Disk Device
DISK GROUPS: 1 #1号磁盘组
Number of Spans: 1
SPAN: 0
Span Reference: 0x01
Number of PDs: 11 #11块物理盘
Number of VDs: 1 #做成了1块虚拟盘
Number of dedicated Hotspares: 0
Virtual Disk Information:
Virtual Disk: 0 (Target Id: 1)
Name:
RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3 #做了RAID5
Size:27.285 TB
State: Optimal
Stripe Size: 64 KB
Number Of Drives:11
Span Depth:1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None
Physical Disk Information:
Physical Disk: 0 #第一块物理盘
Enclosure Device ID: 32
Slot Number: 1
Device Id: 1
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]
Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors]
Coerced Size: 2.728 TB [0x15d400000 Sectors]
Firmware state: Online
SAS Address(0): 0x500056b37789abec
Connected Port Number: 0(path0)
Inquiry Data: 手动马赛克 #这里是磁盘的序列号,跟磁盘标签一致
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: 3.0Gb/s
Link Speed: 3.0Gb/s
Media Type: Hard Disk Device
...
查看每块物理盘的信息和状态,跟前面一样,只是少了Adapter信息。
./MegaCli64 -PDList -a0
Adapter #0
Enclosure Device ID: 32
Slot Number: 0
Device Id: 0
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]
Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors]
Coerced Size: 2.728 TB [0x15d400000 Sectors]
Firmware state: Online
SAS Address(0): 0x500056b37789abee
Connected Port Number: 0(path0)
Inquiry Data: 手动马赛克 #这里是磁盘的序列号,跟磁盘标签一致
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: 3.0Gb/s
Link Speed: 3.0Gb/s
Media Type: Hard Disk Device
Enclosure Device ID: 32
Slot Number: 1
Device Id: 1
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]
Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors]
Coerced Size: 2.728 TB [0x15d400000 Sectors]
Firmware state: Online
SAS Address(0): 0x500056b37789abec
Connected Port Number: 0(path0)
Inquiry Data: 手动马赛克 #这里是磁盘的序列号,跟磁盘标签一致
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None
Device Speed: 3.0Gb/s
Link Speed: 3.0Gb/s
Media Type: Hard Disk Device
...
这里会拿到很多有用的信息:
1、Slot Number:slot号,应该跟机器外观上的标识一致。如果机器上有多块盘,直接告诉现场工程师slot X的硬盘有问题,工程师就会直接换盘。
2、Inquiry Data: 这里是磁盘的序列号,跟磁盘标签上一致。磁盘标签需要拔盘才能看到,按slot拔盘看到磁盘的序列号应该跟Inquiry Data一致。
3、Firmware state: 这里能看到磁盘的状态,Online是我们期望看到的最好状态,除此之外还有 Unconfigured Offline Failed等等,大多表达一个悲伤的事实:你要加班报修/修复他们了。。。
4、需要特别关注这几个指标:Media Error / Other Error / Predictive Failure Count / Last Predictive Failure Event Seq Number 都有可能不是0。这意味着磁盘虽然能用但已经不再可靠,很有可能存在坏簇、坏道之类的问题,必须尽快换掉这块盘。如果坚持使用,那磁盘就离彻底坏掉不远了。网上流传的说法是前3个Count越大代表磁盘状态越差,实际并不是这样,以下2个截图就可以说明。
同事为这个问题专门与服务器RAID卡磁盘厂家沟通,得到的反馈是:
查到之前的资料,Medium error、other error数值的绝对值,不能直接反应硬盘的状态。
根据与RAID卡、硬盘厂家的沟通,建议做法是监控Predictive Failure 的数值,不为零说明硬盘有问题。另外,如果硬盘failed,也可以直接报修。
Predictive Failure Count
指令:storcli /c0/eall/sall show all
监控关键字Predictive Failure Count,标准为不能大于0,若有计数,将对应的硬盘换掉;
Predictive Failure中已经涵盖media error,而且比media error的范围更广、更全面。
硬盘的 SMART 子系统已经具备一套完整的算法来评估硬盘的健康状况
SMART 子系统算法会参考硬盘运行时各个方面的参数,media error 是其中一项
SMART 对于 media error 的评估是基于单位时间增长数来计算的
当 SMART 子系统中任何一个评估项达到对应的阈值时,硬盘会报告 Sense Code: 01 5D 00 (FAILURE PREDICTION THRESHOLD EXCEEDED)
遵循 SCSI 协议标准的 host (OS SCSI 子系统,SAS 控制器, RAID 卡等) 可以正确解析出该 Sense Code
综上,由于 media error 已经被硬盘 SMART 子系统所涵盖,并且会依据 SCSI 协议标准上报 predictive failure,所有硬盘部分只需要在Raid卡下监控Predictive Failure就好,标准为不能大于0。
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。