MDADM RAID 0の障害

MDADM RAID 0の障害

昨夜、Ubuntuサーバーのシステムクロックが5分進んでいることを発見し、 "ntpdate pool.ntp.org"コマンドを実行して寝ました。

今朝私はSAMBA株が働かないことを発見しました。サーバーを見ると、権限が?に設定されていることがわかります。 ? ?当該株式が所在する所の取引量です。

サーバーを再起動しましたが、mdadmが失敗したことがわかります。

[   13.920349] sd 3:0:0:0: [sdb]
[   13.920388] Sense Key : Medium Error [current] [descriptor]
[   13.920499] Descriptor sense data with sense descriptors (in hex):
[   13.920559]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[   13.922059]         00 00 00 00
[   13.922223] sd 3:0:0:0: [sdb]
[   13.922255] Add. Sense: Unrecovered read error - auto reallocate failed
[   13.922316] sd 3:0:0:0: [sdb] CDB:
[   13.922347] Read(16): 88 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00
[   13.922855] end_request: I/O error, dev sdb, sector 0
[   13.922888] Buffer I/O error on device sdb, logical block 0
[   13.922927] ata4: EH complete
[   14.859145] ldm_validate_partition_table(): Disk read failed.
[   14.859203] Dev sdb: unable to read RDB block 0
[   14.870317]  sdb: unable to read partition table
[   14.870646] sdb: detected capacity change from 0 to 4000787030016
[   14.870869] sd 3:0:0:0: [sdb] Attached SCSI disk
[   14.886265] random: nonblocking pool is initialized
[   15.510741] md: bind<sdc1>

だからここでこの問題を解決したいのはmdadm.confです。

cat /etc/mdadm/mdadm.conf
# mdadm.conf
#
# Please refer to mdadm.conf(5) for information about this file.
#

# by default (built-in), scan all partitions (/proc/partitions) and all
# containers for MD superblocks. alternatively, specify devices to scan, using
# wildcards if desired.
#DEVICE partitions containers

# auto-create devices with Debian standard permissions
CREATE owner=root group=disk mode=0660 auto=yes

# automatically tag new arrays as belonging to the local system
HOMEHOST <system>

# instruct the monitoring daemon where to send mail alerts
MAILADDR root

# definitions of existing MD arrays

# This file was auto-generated on Mon, 16 Feb 2015 18:24:04 -0500
# by mkconf $Id$
DEVICE /dev/sdb1 /dev/sdc1
ARRAY /dev/md0 level=raid0 devices=/dev/sdb1,/dev/sdc1

次に、RAIDの両方のディスクでsmartctlコマンドを実行しましたが、どちらも正常に見えました。

smartctl -a -s on /dev/sdb
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40EZRX-00SPEB0
Serial Number:    WD-WCC4E0NLZ6ED
LU WWN Device Id: 5 0014ee 20b74560f
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri Oct  2 11:45:31 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Disabled

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (55740) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 557) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   253   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   184   184   021    Pre-fail  Always       -       7775
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       5434
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   132   132   000    Old_age   Always       -       204074
194 Temperature_Celsius     0x0022   119   109   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       96
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       49
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   199   199   000    Old_age   Offline      -       758

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


smartctl -a -s on /dev/sdc
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-45-generic] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD40EZRX-00SPEB0
Serial Number:    WD-WCC4E0CZDE98
LU WWN Device Id: 5 0014ee 2b6205b40
Firmware Version: 80.00A80
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri Oct  2 11:47:29 2015 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Disabled

=== START OF ENABLE/DISABLE COMMANDS SECTION ===
SMART Enabled.

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (52020) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 520) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   253   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   174   174   021    Pre-fail  Always       -       8258
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       10
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       5434
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       10
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   138   138   000    Old_age   Always       -       188394
194 Temperature_Celsius     0x0022   120   112   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

これまでは、すべてが大丈夫に見えますが、mdadmを実行すると、次のような結果が表示されます。

mdadm -E /dev/sdc1
/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 933d1825:56122a49:779fbad0:926ab5c9
           Name : BAILEYFS01:0  (local to host BAILEYFS01)
  Creation Time : Tue Feb 17 17:22:13 2015
     Raid Level : raid0
   Raid Devices : 2

 Avail Dev Size : 7814033392 (3726.02 GiB 4000.79 GB)
    Data Offset : 16 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : e78061dc:86e60bc0:f4f81839:3816d74a

    Update Time : Tue Feb 17 17:22:13 2015
       Checksum : 1d8e1dfc - correct
         Events : 0

     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AA ('A' == active, '.' == missing)


mdadm -E /dev/sdb1
mdadm: cannot open /dev/sdb1: No such file or directory

これは、アレイの2つのディスクに対するfdiskの出力です。

fdisk -l /dev/sdb

WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdb: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1  4294967295  2147483647+  ee  GPT
Partition 1 does not start on physical sector boundary.


fdisk -l /dev/sdc

WARNING: GPT (GUID Partition Table) detected on '/dev/sdc'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdc: 4000.8 GB, 4000787030016 bytes
255 heads, 63 sectors/track, 486401 cylinders, total 7814037168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1  4294967295  2147483647+  ee  GPT
Partition 1 does not start on physical sector boundary.

以下は parted の出力です。 (これは私のすべてのドライブをリストしているようです。私が関心のある唯一のドライブは、私のRAIDアレイ、sdb、およびsdcを構成するドライブです。)

parted -l /dev/sdb
Model: ATA ST3250318AS (scsi)
Disk /dev/sda: 250GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start   End    Size    Type      File system     Flags
 1      1049kB  246GB  246GB   primary   ext4            boot
 2      246GB   250GB  3754MB  extended
 5      246GB   250GB  3754MB  logical   linux-swap(v1)


Model: ATA WDC WD40EZRX-00S (scsi)
Disk /dev/sdb: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  4001GB  4001GB  ext3         primary


Model: ATA WDC WD40EZRX-00S (scsi)
Disk /dev/sdc: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  4001GB  4001GB  ext3         primary


Model: ATA ST3000DM001-9YN1 (scsi)
Disk /dev/sdd: 3001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name     Flags
 1      1049kB  3001GB  3001GB  ext4         primary  msftdata


Model: Seagate Desktop (scsi)
Disk /dev/sde: 3001GB
Sector size (logical/physical): 4096B/4096B
Partition Table: gpt

Number  Start   End     Size    File system  Name                  Flags
 1      1049kB  3001GB  3001GB               Basic data partition  msftdata

これは両方のディスクからのgdisk -lの出力です。

gdisk -l /dev/sdb
GPT fdisk (gdisk) version 0.8.8

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdb: 7814037168 sectors, 3.6 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): DA484D62-BB5D-461B-9F96-EAC8A5815C7B
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 7814037134
Partitions will be aligned on 2048-sector boundaries
Total free space is 3693 sectors (1.8 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      7814035455   3.6 TiB     8300  primary


gdisk -l /dev/sdc
GPT fdisk (gdisk) version 0.8.8

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdc: 7814037168 sectors, 3.6 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): BDE33471-BF86-4B5F-9DAF-5D3E67AE7E40
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 7814037134
Partitions will be aligned on 2048-sector boundaries
Total free space is 3693 sectors (1.8 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      7814035455   3.6 TiB     8300  primary

次は何をすべきかわかりません...この配列を修正する方法はありますか?

答え1

いくつかの一般的なコメント。

  1. 両方のディスクでSMARTを無効にしました(または少なくとも有効にしていません)。実行されたテストはなく、以前に実行されたテストもありません。これが私に知らせるのは、ディスクに欠陥があるかどうかを知る方法がないことです。

  2. カーネルエラーメッセージは、Add. Sense: Unrecovered read error - auto reallocate failed現在エラーが発生したセクタを交換するために使用できるスペアセクタがなくなったため、ディスクがすぐに致命的なエラーを引き起こすことを示します。これはRAID 0アレイのディスクにとって本当に悪いニュースです。

電源を完全に切って再起動することもできますが、何が起こってもSMARTツールをインストールし、定期的にディスクをテストするように設定することをお勧めします。

答え2

コメントを送ってくれた@caseyに感謝!

partedおよびgdiskコマンドを実行した後、ディスクが正常であることが明らかであるため、起動時にアレイが失敗する理由はわかりません。

同僚が次のように走ることを提案しました。

partprobe /dev/sdb

これを行い、mdadmコマンドを再実行すると、mdadmがsdbを表示できるようになりました。

mdadm -E /dev/sdb
/dev/sdb:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)

mdadm -E /dev/sdc
/dev/sdc:
   MBR Magic : aa55
Partition[0] :   4294967295 sectors at            1 (type ee)

再び再起動しましたが、今回はエラーもなくレイドが正しく構築されました。

起動プロセスが実行されているコマンドが何であるかを把握する必要があるため、これが再度発生した場合は、手動でコマンドを実行できます。

うわー…本当にクレイジーな朝ですね!

関連情報