2006-09-30 15:06:37

by Chris Lee

[permalink] [raw]
Subject: Problem with legacy megaraid

I am not subscribed to this list. Please CC me on replies.

I have a machine I'm trying to use as a file server. I have a RAID10 and a
RAID5 on a single Dell PERC2/DC (AMI Megaraid 467) controller. Both arrays
are also on the same SCSI channel. The system runs fine for days on end
until I put some heavy I/O load on either array and sustain it for a few
seconds.

Distro: Gentoo Linux
Kernel: 2.6.17-gentoo-r7

Hardware:
Motherboard: Tyan Thunder i7501 Pro (S2721-533)
CPUs: Dual 2.8Ghz P4 HT Xeons
RAM: 4GB registered (3/1 split, flat model)
RAID: Dell PERC2/DC (AMI Megaraid 467)
SCSI: Adaptec AHA-2940U2/U2W PCI
NICs: onboard e100 and dual onboard e1000

There are 14 hard drives on channel 0 of the perc2/dc. IDs 1-6 are RAID10
mounted in an external enclosure; IDs 8-15 are RAID5 mounted internally.
The I/O problems occur on either array, which suggests to me that it is not
a cabling issue.

To reproduce the problem for the purpose of this email I ran `bonnie++ -d
/home/nobody -u nobody:nogroup` where /home is the mountpoint of the RAID5
and bonnie++ created an 8GB file. I have reproduced this problem numerous
times and the sustained I/O time required to create the fault varies. Once
it happened running a ./configure script for some rather small package.
However, this time it took approximately 20-30 minutes before it freaked
out.

Once the problem occurs I have to reboot the machine to regain use of the
affected array(s). I should note that the controller bios finds nothing
wrong with the arrays, and when e2fsck is forced on boot it replays the
journal and reports no other problems aside from a "Superblock last write
time is in the future. FIXED." which I attribute to a misconfiguration on
my part for saving system time to hardware clock. If I attempt to unmount
the array and mount it again without rebooting I get an error message that
sdXX is not a valid block device.

Logs/info:

active vt says (this is copied via eyeball):
[4513644.094000] ext3_aboart called.
[4513644.101000] EXT3-fs error (device sdb1): ext3_journal_start_sb: <3>sd
0:0:1:0: rejecting I/O to offline device
[4513644.109000] Remounting filesystem read-only

dmesg stuff about megaraid:
[4294675.180000] megaraid: found 0x8086:0x1960:bus 3:slot 3:func 1
[4294675.191000] scsi0:Found MegaRAID controller at 0xf8806000, IRQ:177
[4294675.254000] megaraid: [1.06:1p00] detected 2 logical drives.
[4294675.316000] megaraid: channel[0] is raid.
[4294675.326000] megaraid: channel[1] is raid.
[4294675.362000] scsi0 : LSI Logic MegaRAID 1.06 254 commands 16 targs 5
chans 7 luns
[4294675.372000] scsi0: scanning scsi channel 0 for logical drives.
[4294675.383000] Vendor: MegaRAID Model: LD0 RAID1 09634R Rev: 1.06
[4294675.393000] Type: Direct-Access ANSI SCSI
revision: 02
[4294675.404000] Vendor: MegaRAID Model: LD1 RAID5 38288R Rev: 1.06
[4294675.415000] Type: Direct-Access ANSI SCSI
revision: 02
[4294675.428000] scsi0: scanning scsi channel 4 [P0] for physical devices.
[4294675.726000] scsi0: scanning scsi channel 5 [P1] for physical devices.
[4294680.126000] megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03
EST 2005)
[4294680.136000] megaraid: 2.20.4.8 (Release Date: Mon Apr 11 12:27:22 EST
2006)
[4294680.157000] SCSI device sda: 429330432 512-byte hdwr sectors (219817
MB)
[4294680.167000] sda: Write Protect is off
[4294680.198000] SCSI device sda: 429330432 512-byte hdwr sectors (219817
MB)
[4294680.208000] sda: Write Protect is off
[4294680.237000] sda: sda1 sda2
[4294680.247000] sd 0:0:0:0: Attached scsi disk sda
[4294680.257000] SCSI device sdb: 2126413824 512-byte hdwr sectors (1088724
MB)
[4294680.267000] sdb: Write Protect is off
[4294680.296000] SCSI device sdb: 2126413824 512-byte hdwr sectors (1088724
MB)
[4294680.305000] sdb: Write Protect is off
[4294680.334000] sdb: sdb1
[4294680.345000] sd 0:0:1:0: Attached scsi disk sdb
[4294680.355000] sd 0:0:0:0: Attached scsi generic sg0 type 0
[4294680.365000] sd 0:0:1:0: Attached scsi generic sg1 type 0

/var/log/messages:
Sep 29 07:46:16 hostname kernel: [4513615.601000] sd 0:0:1:0: SCSI error:
return code = 0x40001
Sep 29 07:46:16 hostname kernel: [4513615.601000] end_request: I/O error,
dev sdb, sector 1744348567
Sep 29 07:46:16 hostname kernel: [4513615.601000] Buffer I/O error on device
sdb1, logical block 218043563
Sep 29 07:46:16 hostname kernel: [4513615.601000] lost page write due to I/O
error on sdb1
Sep 29 07:46:16 hostname kernel: [4513615.601000] sd 0:0:1:0: SCSI error:
return code = 0x40001
Sep 29 07:46:16 hostname kernel: [4513615.601000] end_request: I/O error,
dev sdb, sector 1744348695
Sep 29 07:46:16 hostname kernel: [4513615.601000] Buffer I/O error on device
sdb1, logical block 218043579
Sep 29 07:46:16 hostname kernel: [4513615.601000] lost page write due to I/O
error on sdb1
Sep 29 07:46:16 hostname kernel: [4513615.778000] sd 0:0:1:0: SCSI error:
return code = 0x40001
Sep 29 07:46:16 hostname kernel: [4513615.778000] end_request: I/O error,
dev sdb, sector 1744348823
Sep 29 07:46:16 hostname kernel: [4513615.778000] Buffer I/O error on device
sdb1, logical block 218043595
Sep 29 07:46:16 hostname kernel: [4513615.778000] lost page write due to I/O
error on sdb1
Sep 29 07:46:16 hostname kernel: [4513615.778000] sd 0:0:1:0: SCSI error:
return code = 0x40001
Sep 29 07:46:44 hostname kernel: [4513615.778000] end_request: I/O error,
dev sdb, sector 1744348951
Sep 29 07:46:44 hostname kernel: [4513615.778000] Buffer I/O error on device
sdb1, logical block 218043611
Sep 29 07:46:44 hostname kernel: [4513615.778000] lost page write due to I/O
error on sdb1
Sep 29 07:46:44 hostname kernel: [4513615.778000] sd 0:0:1:0: SCSI error:
return code = 0x40001
Sep 29 07:46:44 hostname kernel: [4513615.778000] end_request: I/O error,
dev sdb, sector 1744348959
Sep 29 07:46:44 hostname kernel: [4513615.778000] Buffer I/O error on device
sdb1, logical block 218043612
Sep 29 07:46:44 hostname kernel: [4513615.778000] lost page write due to I/O
error on sdb1
Sep 29 07:46:44 hostname kernel: [4513643.303000] sd 0:0:1:0: rejecting I/O
to offline device
Sep 29 07:46:44 hostname last message repeated 92 times

The last two lines repeat as long as something continues trying to access
that logical drive.

If any other logs/info would be useful please let me know and I will by
happy to include them. TIA for any help. Also I apologise if this is not
relevant for the list.

Thanks,
Chris


2006-09-30 22:54:19

by Andrew Morton

[permalink] [raw]
Subject: Re: Problem with legacy megaraid

On Sat, 30 Sep 2006 10:06:36 -0500
"Chris Lee" <[email protected]> wrote:

> I am not subscribed to this list. Please CC me on replies.

(more cc's added)

> I have a machine I'm trying to use as a file server. I have a RAID10 and a
> RAID5 on a single Dell PERC2/DC (AMI Megaraid 467) controller. Both arrays
> are also on the same SCSI channel. The system runs fine for days on end
> until I put some heavy I/O load on either array and sustain it for a few
> seconds.

We recently discovered that "The old megaraid driver is apparently borken
for firmware newer than 6.61.". So please check that and see if a
downgrade is needed.

Is there some reason why you cannot use the new megaraid driver?


> Distro: Gentoo Linux
> Kernel: 2.6.17-gentoo-r7
>
> Hardware:
> Motherboard: Tyan Thunder i7501 Pro (S2721-533)
> CPUs: Dual 2.8Ghz P4 HT Xeons
> RAM: 4GB registered (3/1 split, flat model)
> RAID: Dell PERC2/DC (AMI Megaraid 467)
> SCSI: Adaptec AHA-2940U2/U2W PCI
> NICs: onboard e100 and dual onboard e1000
>
> There are 14 hard drives on channel 0 of the perc2/dc. IDs 1-6 are RAID10
> mounted in an external enclosure; IDs 8-15 are RAID5 mounted internally.
> The I/O problems occur on either array, which suggests to me that it is not
> a cabling issue.
>
> To reproduce the problem for the purpose of this email I ran `bonnie++ -d
> /home/nobody -u nobody:nogroup` where /home is the mountpoint of the RAID5
> and bonnie++ created an 8GB file. I have reproduced this problem numerous
> times and the sustained I/O time required to create the fault varies. Once
> it happened running a ./configure script for some rather small package.
> However, this time it took approximately 20-30 minutes before it freaked
> out.
>
> Once the problem occurs I have to reboot the machine to regain use of the
> affected array(s). I should note that the controller bios finds nothing
> wrong with the arrays, and when e2fsck is forced on boot it replays the
> journal and reports no other problems aside from a "Superblock last write
> time is in the future. FIXED." which I attribute to a misconfiguration on
> my part for saving system time to hardware clock. If I attempt to unmount
> the array and mount it again without rebooting I get an error message that
> sdXX is not a valid block device.
>
> Logs/info:
>
> active vt says (this is copied via eyeball):
> [4513644.094000] ext3_aboart called.
> [4513644.101000] EXT3-fs error (device sdb1): ext3_journal_start_sb: <3>sd
> 0:0:1:0: rejecting I/O to offline device
> [4513644.109000] Remounting filesystem read-only
>
> dmesg stuff about megaraid:
> [4294675.180000] megaraid: found 0x8086:0x1960:bus 3:slot 3:func 1
> [4294675.191000] scsi0:Found MegaRAID controller at 0xf8806000, IRQ:177
> [4294675.254000] megaraid: [1.06:1p00] detected 2 logical drives.
> [4294675.316000] megaraid: channel[0] is raid.
> [4294675.326000] megaraid: channel[1] is raid.
> [4294675.362000] scsi0 : LSI Logic MegaRAID 1.06 254 commands 16 targs 5
> chans 7 luns
> [4294675.372000] scsi0: scanning scsi channel 0 for logical drives.
> [4294675.383000] Vendor: MegaRAID Model: LD0 RAID1 09634R Rev: 1.06
> [4294675.393000] Type: Direct-Access ANSI SCSI
> revision: 02
> [4294675.404000] Vendor: MegaRAID Model: LD1 RAID5 38288R Rev: 1.06
> [4294675.415000] Type: Direct-Access ANSI SCSI
> revision: 02
> [4294675.428000] scsi0: scanning scsi channel 4 [P0] for physical devices.
> [4294675.726000] scsi0: scanning scsi channel 5 [P1] for physical devices.
> [4294680.126000] megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03
> EST 2005)
> [4294680.136000] megaraid: 2.20.4.8 (Release Date: Mon Apr 11 12:27:22 EST
> 2006)
> [4294680.157000] SCSI device sda: 429330432 512-byte hdwr sectors (219817
> MB)
> [4294680.167000] sda: Write Protect is off
> [4294680.198000] SCSI device sda: 429330432 512-byte hdwr sectors (219817
> MB)
> [4294680.208000] sda: Write Protect is off
> [4294680.237000] sda: sda1 sda2
> [4294680.247000] sd 0:0:0:0: Attached scsi disk sda
> [4294680.257000] SCSI device sdb: 2126413824 512-byte hdwr sectors (1088724
> MB)
> [4294680.267000] sdb: Write Protect is off
> [4294680.296000] SCSI device sdb: 2126413824 512-byte hdwr sectors (1088724
> MB)
> [4294680.305000] sdb: Write Protect is off
> [4294680.334000] sdb: sdb1
> [4294680.345000] sd 0:0:1:0: Attached scsi disk sdb
> [4294680.355000] sd 0:0:0:0: Attached scsi generic sg0 type 0
> [4294680.365000] sd 0:0:1:0: Attached scsi generic sg1 type 0
>
> /var/log/messages:
> Sep 29 07:46:16 hostname kernel: [4513615.601000] sd 0:0:1:0: SCSI error:
> return code = 0x40001
> Sep 29 07:46:16 hostname kernel: [4513615.601000] end_request: I/O error,
> dev sdb, sector 1744348567
> Sep 29 07:46:16 hostname kernel: [4513615.601000] Buffer I/O error on device
> sdb1, logical block 218043563
> Sep 29 07:46:16 hostname kernel: [4513615.601000] lost page write due to I/O
> error on sdb1
> Sep 29 07:46:16 hostname kernel: [4513615.601000] sd 0:0:1:0: SCSI error:
> return code = 0x40001
> Sep 29 07:46:16 hostname kernel: [4513615.601000] end_request: I/O error,
> dev sdb, sector 1744348695
> Sep 29 07:46:16 hostname kernel: [4513615.601000] Buffer I/O error on device
> sdb1, logical block 218043579
> Sep 29 07:46:16 hostname kernel: [4513615.601000] lost page write due to I/O
> error on sdb1
> Sep 29 07:46:16 hostname kernel: [4513615.778000] sd 0:0:1:0: SCSI error:
> return code = 0x40001
> Sep 29 07:46:16 hostname kernel: [4513615.778000] end_request: I/O error,
> dev sdb, sector 1744348823
> Sep 29 07:46:16 hostname kernel: [4513615.778000] Buffer I/O error on device
> sdb1, logical block 218043595
> Sep 29 07:46:16 hostname kernel: [4513615.778000] lost page write due to I/O
> error on sdb1
> Sep 29 07:46:16 hostname kernel: [4513615.778000] sd 0:0:1:0: SCSI error:
> return code = 0x40001
> Sep 29 07:46:44 hostname kernel: [4513615.778000] end_request: I/O error,
> dev sdb, sector 1744348951
> Sep 29 07:46:44 hostname kernel: [4513615.778000] Buffer I/O error on device
> sdb1, logical block 218043611
> Sep 29 07:46:44 hostname kernel: [4513615.778000] lost page write due to I/O
> error on sdb1
> Sep 29 07:46:44 hostname kernel: [4513615.778000] sd 0:0:1:0: SCSI error:
> return code = 0x40001
> Sep 29 07:46:44 hostname kernel: [4513615.778000] end_request: I/O error,
> dev sdb, sector 1744348959
> Sep 29 07:46:44 hostname kernel: [4513615.778000] Buffer I/O error on device
> sdb1, logical block 218043612
> Sep 29 07:46:44 hostname kernel: [4513615.778000] lost page write due to I/O
> error on sdb1
> Sep 29 07:46:44 hostname kernel: [4513643.303000] sd 0:0:1:0: rejecting I/O
> to offline device
> Sep 29 07:46:44 hostname last message repeated 92 times
>
> The last two lines repeat as long as something continues trying to access
> that logical drive.
>
> If any other logs/info would be useful please let me know and I will by
> happy to include them. TIA for any help. Also I apologise if this is not
> relevant for the list.
>
> Thanks,
> Chris

2006-10-01 06:03:09

by Andrew Morton

[permalink] [raw]
Subject: Re: Problem with legacy megaraid

On Sun, 1 Oct 2006 00:39:37 -0500
"Chris Lee" <[email protected]> wrote:

> Thanks for your response Andrew. Comment responses in-line:
>
> >
> > > I am not subscribed to this list. Please CC me on replies.
> >
> > (more cc's added)
> >
>
> Outstanding; thank you.
>
> > > I have a machine I'm trying to use as a file server. I
> > have a RAID10 and a
> > > RAID5 on a single Dell PERC2/DC (AMI Megaraid 467)
> > controller. Both arrays
> > > are also on the same SCSI channel. The system runs fine
> > for days on end
> > > until I put some heavy I/O load on either array and sustain
> > it for a few
> > > seconds.
> >
> > We recently discovered that "The old megaraid driver is
> > apparently borken
> > for firmware newer than 6.61.". So please check that and see if a
> > downgrade is needed.
> >
>
> The Dell firmware version on the card currently is 1.06. I have not found a
> newer firmware version than that one.
>
> > Is there some reason why you cannot use the new megaraid driver?
> >
>
> The config help for the megaraid drivers suggested that the new megaraid
> driver would not support a PERC2. I had enabled both drivers in the kernel
> which is having this problem.:
>
> CONFIG_MEGARAID_NEWGEN=y
> CONFIG_MEGARAID_MM=y
> CONFIG_MEGARAID_MAILBOX=y
> CONFIG_MEGARAID_LEGACY=y
>
> After your suggestion I rebuilt the kernel with legacy disabled.:
>
> CONFIG_MEGARAID_NEWGEN=y
> CONFIG_MEGARAID_MM=y
> CONFIG_MEGARAID_MAILBOX=y
> # CONFIG_MEGARAID_LEGACY is not set
>
> The new megaraid driver does not detect the PERC2/DC just as I feared it
> would not. Unless I'm missing some kernel commandline arguments necessary
> to make the new driver find the card, I'm stuck with legacy.

Oh well. I was just guessing - I've never even seen a megaraid controller,
sorry.

> >
> > > Distro: Gentoo Linux
> > > Kernel: 2.6.17-gentoo-r7
> > >
> > > Hardware:
> > > Motherboard: Tyan Thunder i7501 Pro (S2721-533)
> > > CPUs: Dual 2.8Ghz P4 HT Xeons
> > > RAM: 4GB registered (3/1 split, flat model)
> > > RAID: Dell PERC2/DC (AMI Megaraid 467)
> > > SCSI: Adaptec AHA-2940U2/U2W PCI
> > > NICs: onboard e100 and dual onboard e1000
> > >

Did it work correctly under any earlier kernel version? If so, which?

2006-10-01 05:39:37

by Chris Lee

[permalink] [raw]
Subject: RE: Problem with legacy megaraid

Thanks for your response Andrew. Comment responses in-line:

>
> > I am not subscribed to this list. Please CC me on replies.
>
> (more cc's added)
>

Outstanding; thank you.

> > I have a machine I'm trying to use as a file server. I
> have a RAID10 and a
> > RAID5 on a single Dell PERC2/DC (AMI Megaraid 467)
> controller. Both arrays
> > are also on the same SCSI channel. The system runs fine
> for days on end
> > until I put some heavy I/O load on either array and sustain
> it for a few
> > seconds.
>
> We recently discovered that "The old megaraid driver is
> apparently borken
> for firmware newer than 6.61.". So please check that and see if a
> downgrade is needed.
>

The Dell firmware version on the card currently is 1.06. I have not found a
newer firmware version than that one.

> Is there some reason why you cannot use the new megaraid driver?
>

The config help for the megaraid drivers suggested that the new megaraid
driver would not support a PERC2. I had enabled both drivers in the kernel
which is having this problem.:

CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=y
CONFIG_MEGARAID_MAILBOX=y
CONFIG_MEGARAID_LEGACY=y

After your suggestion I rebuilt the kernel with legacy disabled.:

CONFIG_MEGARAID_NEWGEN=y
CONFIG_MEGARAID_MM=y
CONFIG_MEGARAID_MAILBOX=y
# CONFIG_MEGARAID_LEGACY is not set

The new megaraid driver does not detect the PERC2/DC just as I feared it
would not. Unless I'm missing some kernel commandline arguments necessary
to make the new driver find the card, I'm stuck with legacy.

>
> > Distro: Gentoo Linux
> > Kernel: 2.6.17-gentoo-r7
> >
> > Hardware:
> > Motherboard: Tyan Thunder i7501 Pro (S2721-533)
> > CPUs: Dual 2.8Ghz P4 HT Xeons
> > RAM: 4GB registered (3/1 split, flat model)
> > RAID: Dell PERC2/DC (AMI Megaraid 467)
> > SCSI: Adaptec AHA-2940U2/U2W PCI
> > NICs: onboard e100 and dual onboard e1000
> >

<snip>

Thanks,
Chris

2006-10-01 06:44:06

by Chris Lee

[permalink] [raw]
Subject: RE: Problem with legacy megaraid

> > >
> > > > Distro: Gentoo Linux
> > > > Kernel: 2.6.17-gentoo-r7
> > > >
> > > > Hardware:
> > > > Motherboard: Tyan Thunder i7501 Pro (S2721-533)
> > > > CPUs: Dual 2.8Ghz P4 HT Xeons
> > > > RAM: 4GB registered (3/1 split, flat model)
> > > > RAID: Dell PERC2/DC (AMI Megaraid 467)
> > > > SCSI: Adaptec AHA-2940U2/U2W PCI
> > > > NICs: onboard e100 and dual onboard e1000
> > > >
>
> Did it work correctly under any earlier kernel version? If
> so, which?

I've recently built the system and the problem was present with both
2.6.16-gentoo-r4 and now 2.6.17-gentoo-r7. I've not used any earlier kernel
versions in this system.

Thanks,
Chris

2006-10-04 10:21:57

by Chris Lee

[permalink] [raw]
Subject: RE: Problem with legacy megaraid

> > > >
> > > > > Distro: Gentoo Linux
> > > > > Kernel: 2.6.17-gentoo-r7
> > > > >
> > > > > Hardware:
> > > > > Motherboard: Tyan Thunder i7501 Pro (S2721-533)
> > > > > CPUs: Dual 2.8Ghz P4 HT Xeons
> > > > > RAM: 4GB registered (3/1 split, flat model)
> > > > > RAID: Dell PERC2/DC (AMI Megaraid 467)
> > > > > SCSI: Adaptec AHA-2940U2/U2W PCI
> > > > > NICs: onboard e100 and dual onboard e1000
> > > > >
> >
> > Did it work correctly under any earlier kernel version? If
> > so, which?
>
> I've recently built the system and the problem was present
> with both 2.6.16-gentoo-r4 and now 2.6.17-gentoo-r7. I've
> not used any earlier kernel versions in this system.

To update... I've rolled back to 2.6.{12,11,9} and can still reproduce the
problem on all of them. I'm out of ideas as to where I can look for the
cause. If anyone (LSI, Dell people maybe?) has any ideas please let me
know.

Thanks,
Chris

2006-10-04 13:02:04

by Kolli, Neela

[permalink] [raw]
Subject: RE: Problem with legacy megaraid

Hi Chris,
This being a "Dell controller", Dell customer service would be the
starting point to handle this.

Thanks,
Neela Syam Kolli.


-----Original Message-----
From: Chris Lee [mailto:[email protected]]
Sent: Wednesday, October 04, 2006 6:22 AM
To: [email protected]
Cc: 'Andrew Morton'; Ju, Seokmann; [email protected]; Kolli,
Neela
Subject: RE: Problem with legacy megaraid

> > > >
> > > > > Distro: Gentoo Linux
> > > > > Kernel: 2.6.17-gentoo-r7
> > > > >
> > > > > Hardware:
> > > > > Motherboard: Tyan Thunder i7501 Pro (S2721-533)
> > > > > CPUs: Dual 2.8Ghz P4 HT Xeons
> > > > > RAM: 4GB registered (3/1 split, flat model)
> > > > > RAID: Dell PERC2/DC (AMI Megaraid 467)
> > > > > SCSI: Adaptec AHA-2940U2/U2W PCI
> > > > > NICs: onboard e100 and dual onboard e1000
> > > > >
> >
> > Did it work correctly under any earlier kernel version? If
> > so, which?
>
> I've recently built the system and the problem was present
> with both 2.6.16-gentoo-r4 and now 2.6.17-gentoo-r7. I've
> not used any earlier kernel versions in this system.

To update... I've rolled back to 2.6.{12,11,9} and can still reproduce
the
problem on all of them. I'm out of ideas as to where I can look for the
cause. If anyone (LSI, Dell people maybe?) has any ideas please let me
know.

Thanks,
Chris