2008-03-06 21:26:21

by Frantisek Rysanek

[permalink] [raw]
Subject: block layer / FS question: x86_32bit with LBD, 20 TB RAID volume => funny issues

Dear everyone,

I've got another silly question, rather vaguely formulated...

I have a homebrew Fedora 5 - based live CD with some basic system
utilities and FS support, that I'm using to test various sorts of
hardware and subsystems. Call it hardware debugging, in a PC hardware
assembly shop...
Today I have "external storage" under the magnifier glass.

In the past, the biggest block devices I've met so far have been some
14TB RAID volumes (16x 1TB disk in RAID 6). These are connected to
the host PC via a SCSI/SAS/FC HBA, and essentially appear as a single
huge SCSI disk. Such RAID units work pretty fine against Linux,
preferably using LBA64/CDB16. Abnormal sector sizes are less
appropriate, but also seem to work with some FS.

A few days ago, I've had my first opportunity to put my hands on a
24bay RAID unit - configured for RAID 60, that's 20 TB of space in a
single chunk. I know that RAID units capable of this sort of capacity
have been on the market for some time now, so I was somewhat
surprised to discover that there are pending issues against Linux...

The block device is detected/reported just fine.
I didn't even try Ext3, I know it's not appropriate for this sort of
capacity. I've tried Reiser3, and already mkfs.reiserfs (user-space
util) refused to create such a big FS. Then I tried XFS. The user-
space mkfs.xfs had no objections - so far so good. But when I tried
to mount the volume thus created, the kernel-space XFS driver
(including the one in 2.6.24.2) refused to mount the FS, complaining
about the FS being too big to be mounted on this platform.

Okay, those are FS quirks, some of them even implied by the spec
(=RTFM) or at least located in user space. Hang on a second,
there's more.

If I try
dd if=/dev/zero of=/dev/sda bs=4096
it never runs until the very end. It always seems to hang somewhere
halfway through, sometimes at 7 TB, sometimes at 4 TB... it's weird.
The dd process stays alive, but data transfer stops (as seen in
iostat and by the RAID unit's LED's), and the write() or whatever
syscall inside dd just keeps sleeping blocked forever. The RAID seems
perfectly happy, there are no timeout messages from the SCSI layer.
If I terminate the dd process (CTRL+C) and start it again, everything
looks perfectly allright again.

I also have a simple test app that runs in a loop, reading a whole
raw block device (/dev/sda) from start to end. It open()s the device
node with O_LARGEFILE and just read()s 64kB chunks until EOF.
This one always terminates halfway through, and the cause seems to be
that the read() call returns EINVAL.


So far I've been using kernels compiled for 32bit mode x86.
Obviously I have LBD support enabled, and it's always worked
flawlessly. Would it be any help if I switched to 64bit mode?
My machines have been capable of that for a few years now, but so far
I had no reason to switch, as the memory capacities installed hardly
ever reached 4 GB...

Any ideas would be welcome :-)

Frank Rysanek


2008-03-07 06:10:55

by Frantisek Rysanek

[permalink] [raw]
Subject: Re: block layer / FS question: x86_32bit with LBD, 20 TB RAID volume => funny issues

On 7 Mar 2008 at 4:05, Lee Revell wrote:
> > I didn't even try Ext3, I know it's not appropriate for this sort of
> > capacity.
>
> Where did you get that idea?
>
Hmm... Google can find sources on the 'net claiming that Ext3 has a
maximum of 2 or 4 TB. Nice to know that I'm wrong, I'll test this
right away :-)

> Are you sure the hardware is not faulty?
>
I'm pretty sure.
I know what it looks like when one of those 1TB drives has a problem
it would prefer not to talk about in SMART (hint: only one disk
activity LED out of 24 keeps blinking). Nowadays I have to handle
several such drives in every new RAID unit delivered.
When such a drive times out past some margin, say half a minute, the
Linux kernel reports a SCSI CMD timeout, or rather, the RAID
controller runs out of patience sooner than the kernel, the array
gets degraded and keeps going on in degraded mode. So if the RAID
controller itself goes out for lunch, Linux definitely complains.
I've also seen a number of sly SCSI parity errors / general bus
impedance problems, all of which yielded a proper error within half a
minute or so. I am well equipped to debug such problems. Besides,
this is FC.
Once upon a time I've seen some ugly low-level incompatibility in FC
too - that resulted in some really nice messages from the Linux
kernel.
I also know a brand of controllers which claim support for TCQ depth
of 255, but actually hang with anything over 192 or so. This also
yields a proper error in Linux, and the RAID controller goes toes up.

None of this happens in my case... all is happy and calm.
Hmm... maybe I should try some modern FreeBSD for comparison :-)))

Apologies for not mentioning specific hardware brands - I don't want
to get in trouble...

> > Would it be any help if I switched to 64bit mode?
>
> Yes, it would be worth a try.
>
:-/ Okay, time to install 64bit Fedora 8 or something :-)

Anyway, thanks very much for your response :-)

Frank Rysanek

2008-03-07 09:32:29

by Andi Kleen

[permalink] [raw]
Subject: Re: block layer / FS question: x86_32bit with LBD, 20 TB RAID volume => funny issues

"Frantisek Rysanek" <[email protected]> writes:

> On 7 Mar 2008 at 4:05, Lee Revell wrote:
> > > I didn't even try Ext3, I know it's not appropriate for this sort of
> > > capacity.
> >
> > Where did you get that idea?
> >
> Hmm... Google can find sources on the 'net claiming that Ext3 has a
> maximum of 2 or 4 TB. Nice to know that I'm wrong,

You're not wrong (for 4K ext2s). Only ext4 lifted that limit, but it is
still experimental.

BTW your problems mostly sound like driver issues. Some drivers
(and some controller firmwares) have problems with large block numbers

-Andi

2008-03-07 10:41:01

by Frantisek Rysanek

[permalink] [raw]
Subject: Re: block layer / FS question: x86_32bit with LBD, 20 TB RAID volume => funny issues

On 7 Mar 2008 at 10:30, Andi Kleen wrote:
[...snip...]
> BTW your problems mostly sound like driver issues. Some drivers
> (and some controller firmwares) have problems with large block numbers
>
thanks for that hint, I'll investigate that too.

The HBA is a Qlogic QLA-2460. Based on some past experience with
other brands of FC HBA's, I tend to swear on Qlogic as the reference
implementation of FC hardware.

The driver I've tried so far is the vanilla version in 2.6.22.6 and
2.6.24.2. The firmware that I load at runtime is something I've
downloaded from Qlogic web maybe four months ago... time to update my
firmware :-)

Frank Rysanek

2008-03-09 04:26:55

by Lee Revell

[permalink] [raw]
Subject: Re: block layer / FS question: x86_32bit with LBD, 20 TB RAID volume => funny issues

On Fri, Mar 7, 2008 at 1:10 AM, Frantisek Rysanek
<[email protected]> wrote:
> On 7 Mar 2008 at 4:05, Lee Revell wrote:
> > > Would it be any help if I switched to 64bit mode?
> >
> > Yes, it would be worth a try.
> >
> :-/ Okay, time to install 64bit Fedora 8 or something :-)
>
> Anyway, thanks very much for your response :-)

You could try a Live CD, easier than installing a new distro...

Lee

2008-03-09 22:06:11

by David Chinner

[permalink] [raw]
Subject: Re: block layer / FS question: x86_32bit with LBD, 20 TB RAID volume => funny issues

On Thu, Mar 06, 2008 at 10:25:59PM +0100, Frantisek Rysanek wrote:
> A few days ago, I've had my first opportunity to put my hands on a
> 24bay RAID unit - configured for RAID 60, that's 20 TB of space in a
> single chunk. I know that RAID units capable of this sort of capacity
> have been on the market for some time now, so I was somewhat
> surprised to discover that there are pending issues against Linux...
>
> The block device is detected/reported just fine.
> I didn't even try Ext3, I know it's not appropriate for this sort of
> capacity. I've tried Reiser3, and already mkfs.reiserfs (user-space
> util) refused to create such a big FS. Then I tried XFS. The user-
> space mkfs.xfs had no objections - so far so good. But when I tried
> to mount the volume thus created, the kernel-space XFS driver
> (including the one in 2.6.24.2) refused to mount the FS, complaining
> about the FS being too big to be mounted on this platform.

Sure. the largest address space that can be used on a 32bit platform
with 4k pages is 16TB (2^32 * 2^12 = 2^44 = 16TB). For XFS, that
means metadata can't be placed higher in the filesystem than 16TB,
and seeing as we only have a single address space for metadata, the
filesystem is limited to 16TB. It could be fixed with software
changes, but really there's no excuse for using x86 given how
cheap x86_64 is now.....

> So far I've been using kernels compiled for 32bit mode x86.
> Obviously I have LBD support enabled, and it's always worked
> flawlessly. Would it be any help if I switched to 64bit mode?
> My machines have been capable of that for a few years now, but so far
> I had no reason to switch, as the memory capacities installed hardly
> ever reached 4 GB...

Yes, switching to 64 bit machines will fix this problem as the
address space will now hold 2^64*2^12 bytes.....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2008-03-10 04:33:01

by Frantisek Rysanek

[permalink] [raw]
Subject: Re: block layer / FS question: x86_32bit with LBD, 20 TB RAID volume => funny issues

On 9 Mar 2008 at 23:05, David Chinner wrote:
>
> Sure. the largest address space that can be used on a 32bit platform
> with 4k pages is 16TB (2^32 * 2^12 = 2^44 = 16TB). For XFS, that means
> metadata can't be placed higher in the filesystem than 16TB, and seeing
> as we only have a single address space for metadata, the filesystem is
> limited to 16TB. It could be fixed with software changes, but really
> there's no excuse for using x86 given how cheap x86_64 is now.....
>
[...]
>
> Yes, switching to 64 bit machines will fix this problem as the
> address space will now hold 2^64*2^12 bytes.....
>
wow, thanks for such a precise answer, from such an authoritative
source :-)

Frank Rysanek

2008-03-10 09:06:32

by Frantisek Rysanek

[permalink] [raw]
Subject: Re: block layer / FS question: x86_32bit with LBD, 20 TB RAID volume => funny issues

> On 7 Mar 2008 at 10:30, Andi Kleen wrote:
> [...snip...]
> > BTW your problems mostly sound like driver issues. Some drivers
> > (and some controller firmwares) have problems with large block numbers
> >
> thanks for that hint, I'll investigate that too.
>
> The HBA is a Qlogic QLA-2460. Based on some past experience with
> other brands of FC HBA's, I tend to swear on Qlogic as the reference
> implementation of FC hardware.
>
And indeed it was the firmware.
ftp://ftp.qlogic.com/outgoing/linux/firmware/

Until now, I've been using version v4.00.27, which is the last
numbered version in that FTP directory.

Upon closer inspection, I picked the file called
ql2400_fw.bin_mid
with a timestamp from 12th February 2008.
This one turns out to be v4.03.01 and it SOLVES THE PROBLEM :-)

It seems to work with qla2xxx v8.01.07-k7 (2.6.22.6) and
v8.02.00-k5 (2.6.24.2).

My testbed server has passed some 5 loops of dd over the weekend.

Thanks for your help :-)

I'm installing a 64bit Fedora to give XFS another try...

Frank Rysanek

2008-03-13 10:07:39

by Frantisek Rysanek

[permalink] [raw]
Subject: Re: block layer / FS question: x86_32bit with LBD, 20 TB RAID volume => funny issues

On 10 Mar 2008 at 10:13, [email protected] wrote:
>
> My testbed server has passed some 5 loops of dd over the weekend.
>
> Thanks for your help :-)
>
> I'm installing a 64bit Fedora to give XFS another try...
>
So I installed Fedora 8 64b, compiled a 64bit 2.6.24.2,
and XFS mounts without a word of objection :-)
even with my 32bit user-space on that old CD :-)

Interestingly, XFS even survives looped+parallel
Bonnie++2 - only when I let it run overnight,
in the morning the box was stuck with a
Machine Check Exception.
This is a dual Xeon Irwindale at 3 GHz, so far it's always
been rock-solid under 32bit operating systems.
Difficult for me to say if the CPU's indeed have a problem
or if this is some sorta compatibility bug...
Anyway it's unlikely to be a XFS-related issue.
I've checked the thermocouple paste on my heatsinks and I'll try
again tonight with "nomce"...

Frank Rysanek