Hi all,
Firstly, please CC me as I'm not subscribed to this list.
I seem to be getting some random filesystem corruption on an IBM server
that I use as a Xen Dom0.
*** Specs ***
Vendor: IBM
Version: -[GGE149AUS-1.19]-
Product Name: IBM System x3650 -[7979CBM]-
AAC0: kernel 5.2-0[17003] Jul 25 2011
AAC0: monitor 5.2-0[17003]
AAC0: bios 5.2-0[17003]
AAC0: serial 5AB49E0
scsi0 : ServeRAID
scsi 0:0:0:0: Direct-Access ServeRA Dom0_RAID6 V1.0 PQ: 0 ANSI: 2
scsi 0:1:0:0: Direct-Access IBM-ESXS MAY2073RC T107 PQ: 0 ANSI: 5
scsi 0:1:1:0: Direct-Access IBM-ESXS MAY2073RC T107 PQ: 0 ANSI: 5
scsi 0:1:2:0: Direct-Access IBM-ESXS MBC2073RC SC06 PQ: 0 ANSI: 5
scsi 0:1:3:0: Direct-Access IBM-ESXS ST973402SS B52B PQ: 0 ANSI: 5
scsi 0:1:4:0: Direct-Access IBM-ESXS ST973402SS B52B PQ: 0 ANSI: 5
scsi 0:1:5:0: Direct-Access IBM-ESXS ST973402SS B52B PQ: 0 ANSI: 5
scsi 0:1:6:0: Direct-Access IBM-ESXS ST973402SS B52B PQ: 0 ANSI: 5
scsi 0:1:7:0: Direct-Access IBM-ESXS ST973402SS B52B PQ: 0 ANSI: 5
scsi 0:3:0:0: Enclosure IBM-ESXS VSC7160 1.07 PQ: 0 ANSI: 3
I'm currently running kernel 3.11.4 and before the filesystem corruption
seems to happen, I get a load of these:
aac_write: aac_fib_send failed with status: -12
While this is going on, random things seem to fail. Eventually, I'll
reboot the system and lots of tools will segfault - tracing it back
leads to libraries that seem to have been corrupted.
I can boot the system from rescue media, reinstall all the corrupted
libraries / binaries and the system runs fine again for another few
months before it happens again.
arcconf shows:
# arcconf getconfig 1
Controllers found: 1
----------------------------------------------------------------------
Controller information
----------------------------------------------------------------------
Controller Status : Okay
Channel description : SAS/SATA
Controller Model : IBM ServeRAID 8k
Controller Serial Number : 5AB49E0
Physical Slot : 0
Installed memory : 256 MB
Copyback : Disabled
Data scrubbing : Enabled
Defunct disk drive count : 0
Logical drives/Offline/Critical : 1/0/0
--------------------------------------------------------
Controller Version Information
--------------------------------------------------------
BIOS : 5.2-0 (17003)
Firmware : 5.2-0 (17003)
Driver : 1.2-0 (30200)
Boot Flash : 5.1-0 (17002)
--------------------------------------------------------
Controller Battery Information
--------------------------------------------------------
Status : Okay
Over temperature : No
Capacity remaining : 100 percent
Time remaining (at current draw) : 3 days, 20 hours, 56 minutes
--------------------------------------------------------
Controller Vital Product Data
--------------------------------------------------------
VPD Assigned# : 39R8875
EC Version# : J85096
Controller FRU# : 25R8076
Battery FRU# : 25R8088
----------------------------------------------------------------------
Logical drive information
----------------------------------------------------------------------
Logical drive number 1
Logical drive name : Dom0_RAID6
RAID level : 6
Status of logical drive : Okay
Size : 419400 MB
Read-cache mode : Enabled
Write-cache mode : Enabled (write-back)
Write-cache setting : Enabled (write-back)
Partitioned : Yes
Number of segments : 8
Stripe-unit size : 256 KB
Stripe order (Channel,Device) : 0,0 0,1 0,2 0,3 0,4 0,5
0,6 0,7
Defunct segments : No
Defunct stripes : No
Does anyone have any thoughts on this?
--
Steven Haigh
Email: [email protected]
Web: https://www.crc.id.au
Phone: (03) 9001 6090 - 0412 935 897
Fax: (03) 8338 0299