Dell Poweredge T105にOCZ-ARC100を取り付けました。システム(CentOS 7)を起動すると、後者にBDMAエラーが表示されます。
jun 25 15:40:21 myhost kernel: ata4.00: ATA-8: OCZ-ARC100, 1.01, max UDMA/133
jun 25 15:40:21 myhost kernel: ata4.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 0/32)
jun 25 15:40:21 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:21 myhost kernel: scsi 3:0:0:0: Direct-Access ATA OCZ-ARC100 1.01 PQ: 0 ANSI: 5
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] 234441648 512-byte logical blocks: (120 GB/111 GiB)
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Write Protect is off
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
jun 25 15:40:21 myhost kernel: sda: sda1 sda2 sda3
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Attached SCSI disk
jun 25 15:40:21 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
jun 25 15:40:21 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:21 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:21 myhost kernel: ata4.00: cmd c8/00:08:00:4b:f9/00:00:00:00:00/ed tag 0 dma 4096 in
res 51/04:08:00:4b:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:21 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:21 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:21 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:21 myhost kernel: ata4: EH complete
...
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:22 myhost kernel: ata4: EH complete
jun 25 15:40:22 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
jun 25 15:40:22 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:22 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:22 myhost kernel: ata4.00: cmd c8/00:08:d0:47:f9/00:00:00:00:00/ed tag 0 dma 4096 in
res 51/04:08:d0:47:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:22 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:22 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:22 myhost kernel: ata4: EH complete
jun 25 15:40:22 myhost kernel: ata4.00: limiting speed to UDMA/100:PIO4
jun 25 15:40:22 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
jun 25 15:40:22 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:22 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:22 myhost kernel: ata4.00: cmd c8/00:08:f8:47:f9/00:00:00:00:00/ed tag 0 dma 4096 in
res 51/04:08:f8:47:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:22 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:22 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:22 myhost kernel: ata4: hard resetting link
jun 25 15:40:22 myhost kernel: ata4: nv: skipping hardreset on occupied port
jun 25 15:40:22 myhost kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/100
jun 25 15:40:22 myhost kernel: ata4: EH complete
OCZをSATA-USB2アダプタに接続し、smartctrlを実行しました。
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.6-gentoo-nvidia] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: OCZ-ARC100
Serial Number: A22L0061518000567
LU WWN Device Id: 5 e83a97 100061d69
Firmware Version: 1.01
User Capacity: 120.034.123.776 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun Jun 25 15:28:55 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x1d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x00) Error logging NOT supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 0) minutes.
Extended self-test routine
recommended polling time: ( 0) minutes.
SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0000 000 000 000 Old_age Offline - 0
9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 252
12 Power_Cycle_Count 0x0000 100 100 000 Old_age Offline - 84
171 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 39711824
174 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 10
195 Hardware_ECC_Recovered 0x0000 100 100 000 Old_age Offline - 0
196 Reallocated_Event_Count 0x0000 100 100 000 Old_age Offline - 0
197 Current_Pending_Sector 0x0000 100 100 000 Old_age Offline - 0
208 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 5
210 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
224 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 1
233 Media_Wearout_Indicator 0x0000 100 100 000 Old_age Offline - 100
241 Total_LBAs_Written 0x0000 100 100 000 Old_age Offline - 92
242 Total_LBAs_Read 0x0000 100 100 000 Old_age Offline - 221
249 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 3316691
SMART Error Log Version: 1
No Errors Logged
Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Selective Self-tests/Logging not supported
ここには明らかにエラーの兆候はありません。 BMDMAエラーにはあまり注意を払っていませんでしたが、最初はドライブが死ぬと思いましたが、今はこれが正しい診断かどうか疑問に思います。また、ドライブを新製品(Western Digital Blue 500GB)に交換すると、エラーなしで動作するという誤解を受けました。しかし、違いは、OCZが実際に比較すると非常に速いということです。
上記のエラー(明らかにDMAエラー)をどのように説明し、この問題をどのように解決できますか?たとえば、フラッシュOCZファームウェア?特定のカーネルパラメータを使用しますか?
ところで、BIOSはATA
SATAディスクにバスオプションを使用するように強制します。たとえば、AHCIに変更することはできません。これは、SATAバスに接続されているCD / DVDドライブまたはFusion MPTハードウェアRaidアダプタが原因であると考えられます。とにかくここでは(文字通り)選択の余地はありませんが、少なくともWDドライブの場合には問題にならないようです。
編集する:サーバー自体でドライブセルフテストを実行しましたが、結果は次のとおりです。
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.21.1.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0000 000 000 000 Old_age Offline - 0
9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 253
12 Power_Cycle_Count 0x0000 100 100 000 Old_age Offline - 85
171 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 39711824
174 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 10
195 Hardware_ECC_Recovered 0x0000 100 100 000 Old_age Offline - 0
196 Reallocated_Event_Count 0x0000 100 100 000 Old_age Offline - 0
197 Current_Pending_Sector 0x0000 100 100 000 Old_age Offline - 0
208 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 5
210 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
224 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 1
233 Media_Wearout_Indicator 0x0000 100 100 000 Old_age Offline - 100
241 Total_LBAs_Written 0x0000 100 100 000 Old_age Offline - 92
242 Total_LBAs_Read 0x0000 100 100 000 Old_age Offline - 222
249 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 3316768
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 253 -
しかも、次のようにそれからヒントsmartctlはドライブの内部をテストし、ドライブに欠陥がないと安全に仮定できると思います。もう少し調べてみましょう...