2001-11-13 14:29:45

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Tuning Linux for high-speed disk subsystems

Hi all

After some testing at Compaq's lab in Oslo, I've come to the conclusion
that Linux cannot scale higher than about 30-40MB/sec in or out of a
hardware or software RAID-0 set with several stripe/chunk sizes tried out.
The set is based on 5 18GB 10k disks running SCSI-3 (160MBps) alone on a
32bit/33MHz PCI bus.

After speking to the storage guys here, I was told the problem generally
was that the OS should send the data requests at 256kB block sizes, as the
drives (10k) could handle 100 I/O operations per second, and thereby could
give a total of (256*100)kB/sec per spindle. When using smaller block
sizes, the speed would decrease in a linear fasion.

Does anyone know this stuff good enough to help me how to tune the system?
PS: The CPUs were almost idle during the test. Tested file system was
ext2.

Regards

roy

--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.



2001-11-13 15:59:49

by Jesse Pollard

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems

Roy Sigurd Karlsbakk <[email protected]>:
>
> Hi all
>
> After some testing at Compaq's lab in Oslo, I've come to the conclusion
> that Linux cannot scale higher than about 30-40MB/sec in or out of a
> hardware or software RAID-0 set with several stripe/chunk sizes tried out.
> The set is based on 5 18GB 10k disks running SCSI-3 (160MBps) alone on a
> 32bit/33MHz PCI bus.
>
> After speking to the storage guys here, I was told the problem generally
> was that the OS should send the data requests at 256kB block sizes, as the
> drives (10k) could handle 100 I/O operations per second, and thereby could
> give a total of (256*100)kB/sec per spindle. When using smaller block
> sizes, the speed would decrease in a linear fasion.
>
> Does anyone know this stuff good enough to help me how to tune the system?
> PS: The CPUs were almost idle during the test. Tested file system was
> ext2.

I shouldn't be the authoritative answer on this, but to start with:

a. You don't provide enough info on the hardware configuration:

a. are all of the drives on one SCSI controller?
b. is there only one PCI?
c. since you mention "CPUs", how many, and which ones
d. which chipset?
e. what was used for the benchmark?
f. which hardware raids were tested?

b. Your mentioned limit (40MB/sec) sounds like it is really a
memory<->bridge<->PCI<->controller bandwidth limit - this is about what I
get from a SCSI-3 alone on 33MHz bus (I use SCSI 3 for system disk, SCSI 2
for audio/CDRW/tape drive).

c. Based on the statement that the "CPUs were almost idle", it sounds like
the limit is outside the OS. If you are trying to setup a disk server then
you should check into multiple PCI busses @ 66MHz, and multiple disk
controllers.

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2001-11-13 16:44:03

by Alan

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems

> After some testing at Compaq's lab in Oslo, I've come to the conclusion
> that Linux cannot scale higher than about 30-40MB/sec in or out of a
> hardware or software RAID-0 set with several stripe/chunk sizes tried out.
> The set is based on 5 18GB 10k disks running SCSI-3 (160MBps) alone on a
> 32bit/33MHz PCI bus.

I'm beating that with IDE 8)

> After speking to the storage guys here, I was told the problem generally
> was that the OS should send the data requests at 256kB block sizes, as the
> drives (10k) could handle 100 I/O operations per second, and thereby could

Right now we tend to queue 128 blocks per write. That can be tuned if you
want to play with it.

> Does anyone know this stuff good enough to help me how to tune the system?
> PS: The CPUs were almost idle during the test. Tested file system was
> ext2.

Im not sure the best way to get big linear blocks in the ext2 layout or
if perhaps XFS would do that job better, but the physical layer comes
down the the block limit, scsi max sectors per I/O set by the controller
and to an extent the vm readahead (tunable in -ac kernels - the patch
to md.c should tell you how to hack md for that)

2001-11-13 16:43:53

by Ragnar Kjørstad

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems

On Tue, Nov 13, 2001 at 03:29:13PM +0100, Roy Sigurd Karlsbakk wrote:
> After some testing at Compaq's lab in Oslo, I've come to the conclusion
> that Linux cannot scale higher than about 30-40MB/sec in or out of a
> hardware or software RAID-0 set with several stripe/chunk sizes tried out.

Eh, we do 60-70 MB/s reads and 110-120 MB/s writes on our RAIDs... from
linux.


> Does anyone know this stuff good enough to help me how to tune the system?
> PS: The CPUs were almost idle during the test. Tested file system was
> ext2.

I'd say you should get rid of your compaq raid controller and use a
regular SCSI-controller - 66Mhz 64 bit. (e.g. an adaptec)



--
Ragnar Kj?rstad
Big Storage

2001-11-13 17:00:43

by Craig I. Hagan

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems

> After some testing at Compaq's lab in Oslo, I've come to the conclusion
> that Linux cannot scale higher than about 30-40MB/sec in or out of a
> hardware or software RAID-0 set with several stripe/chunk sizes tried out.
> The set is based on 5 18GB 10k disks running SCSI-3 (160MBps) alone on a
> 32bit/33MHz PCI bus.

this isn't quite true. use either the RH kernel, the -ac series, or the
attached patch (for 2.4.15-pre4). Then set /proc/sys/vm/max-readahead to 511 or
1023 (power of 2 minus 1)

this should allow you to generate large enough io's for streaming reads to do
what you are looking for.

-- craig


-------------------------------------------------------------------------------
Craig I. Hagan "It's a small world, but I wouldn't want to back it up"
hagan(at)cih.com "True hackers don't die, their ttl expires"
"It takes a village to raise an idiot, but an idiot can raze a village"

Stop the spread of spam, use a sendmail condom!
http://www.cih.com/~hagan/smtpd-hacks

In Bandwidth we trust


Attachments:
dynreadahead-2.4.15-pre4 (4.63 kB)

2001-11-13 17:37:13

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems



On Tue, 13 Nov 2001, Craig I. Hagan wrote:

> > After some testing at Compaq's lab in Oslo, I've come to the conclusion
> > that Linux cannot scale higher than about 30-40MB/sec in or out of a
> > hardware or software RAID-0 set with several stripe/chunk sizes tried out.
> > The set is based on 5 18GB 10k disks running SCSI-3 (160MBps) alone on a
> > 32bit/33MHz PCI bus.
>
> this isn't quite true. use either the RH kernel, the -ac series, or the
> attached patch (for 2.4.15-pre4). Then set /proc/sys/vm/max-readahead to 511 or
> 1023 (power of 2 minus 1)
>
> this should allow you to generate large enough io's for streaming reads to do
> what you are looking for.

Craig,

This patch is already on my pending list.

So if Linus does not apply it, I will.

2001-11-13 20:01:06

by Dan Hollis

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems

On Tue, 13 Nov 2001, Roy Sigurd Karlsbakk wrote:
> After some testing at Compaq's lab in Oslo, I've come to the conclusion
> that Linux cannot scale higher than about 30-40MB/sec in or out of a
> hardware or software RAID-0 set with several stripe/chunk sizes tried out.

We managed >100mb/sec from a raid5 IDE setup, SMP athlon on Tyan S2460
with promise controllers.

-Dan
--
[-] Omae no subete no kichi wa ore no mono da. [-]

2001-11-13 20:39:25

by Torrey Hoffman

[permalink] [raw]
Subject: RE: Tuning Linux for high-speed disk subsystems

Roy Sigurd Karlsbakk wrote:

> After some testing at Compaq's lab in Oslo, I've come to the
> conclusion
> that Linux cannot scale higher than about 30-40MB/sec in or out of a
> hardware or software RAID-0 set with several stripe/chunk
> sizes tried out.

Hmmm. I saw "dbench 32" results of 73 MB / second using Linux
software RAID-0 and IDE. However, I suppose some of that
was due to caching, and not hardware throughput.

Details: 2.4.9-ac17, 4 x Maxtor 5400 RPM, 60 GB hard drives,
2 x Promise TX-2 controllers, using UDMA-100, one drive / cable,
dual PIII-800, reiserfs, RAID - 0 with chunk-size = 1024

Torrey



2001-11-14 10:34:24

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems

> this isn't quite true. use either the RH kernel, the -ac series, or the
> attached patch (for 2.4.15-pre4). Then set /proc/sys/vm/max-readahead to 511 or
> 1023 (power of 2 minus 1)
>
> this should allow you to generate large enough io's for streaming reads to do
> what you are looking for.

What does the setting mean? The number of pages?
--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

2001-11-14 10:36:24

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems

> this isn't quite true. use either the RH kernel, the -ac series, or the
> attached patch (for 2.4.15-pre4). Then set /proc/sys/vm/max-readahead to 511 or
> 1023 (power of 2 minus 1)
>
> this should allow you to generate large enough io's for streaming reads to do
> what you are looking for.

How does this work when using software RAID-0 or 5?
--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

2001-11-16 02:56:56

by Dieter Nützel

[permalink] [raw]
Subject: RE: Tuning Linux for high-speed disk subsystems

The heroinewarrior.com (Broadcast 2000) guys came to the following with the
Tyan Thunder K7 (2 x 1.0 GHz Athlon MP) dual channel U160 (Adaptec) and
RAID 0. http://heroinewarrior.com/athlon.php3

[-]
As for performance our experiences are biased because this system is almost
exclusively used for video software development not games like most. It needs
a reliable operating system like Linux and very fast media storage drives.

The inverse telecine, a grueling memory excercise which takes 3 hours on a
dual PIII 933 and 2 hours on a dual Alpha, takes about 2 hours on the dual
Athlon.

Our 100 Gig SCSI raid, consisting of 6 15,000 rpm drives on the motherboard's
two SCSI 160 channels gives a full 110MB/sec read and write with RAID 0. With
RAID chunks set to 1MB the write accesses go to 160MB/sec and read accesses
go to 90MB/sec sustained. This system would make a good motion capture tool.
Previous Intel attempts at onboard disk I/O would give 50MB/sec.
[-]

-Dieter

2001-11-16 11:51:39

by Roy Sigurd Karlsbakk

[permalink] [raw]
Subject: RE: Tuning Linux for high-speed disk subsystems

> Our 100 Gig SCSI raid, consisting of 6 15,000 rpm drives on the motherboard's
> two SCSI 160 channels gives a full 110MB/sec read and write with RAID 0. With
> RAID chunks set to 1MB the write accesses go to 160MB/sec and read accesses
> go to 90MB/sec sustained. This system would make a good motion capture tool.
> Previous Intel attempts at onboard disk I/O would give 50MB/sec.

How much do you think I can get out of 2x6 15k disks - each 6 disks are on
their own SCSI-3/160 bus.
--
Roy Sigurd Karlsbakk, MCSE, MCNE, CLS, LCA

Computers are like air conditioners.
They stop working when you open Windows.

2001-11-16 15:24:57

by Dieter Nützel

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems

Am Freitag, 16. November 2001 12:51 schrieb Roy Sigurd Karlsbakk:
> > Our 100 Gig SCSI raid, consisting of 6 15,000 rpm drives on the
> > motherboard's two SCSI 160 channels gives a full 110MB/sec read and write
> > with RAID 0. With RAID chunks set to 1MB the write accesses go to
> > 160MB/sec and read accesses go to 90MB/sec sustained. This system would
> > make a good motion capture tool. Previous Intel attempts at onboard disk
> > I/O would give 50MB/sec.
>
> How much do you think I can get out of 2x6 15k disks - each 6 disks are on
> their own SCSI-3/160 bus.

As I count your disks may be the double for the best case. I read here on
LKML a post that someone claims that W2k deliever 250 MB/s with such a
configuration. Linux 2.4 should do the same. Ask the SCSI gurus.

Regards,
Dieter

--
Dieter N?tzel
Graduate Student, Computer Science
@home: [email protected]

2001-11-16 16:53:34

by Marvin Justice

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems


> As I count your disks may be the double for the best case. I read here on
> LKML a post that someone claims that W2k deliever 250 MB/s with such a
> configuration. Linux 2.4 should do the same. Ask the SCSI gurus.
>

That may have been my post you refer to. With 2x5 disks, each capable of
50 MB/s by itself, we can stream 255 MB/s very smoothly in either direction
with W2K --- as long as FILE_FLAG_NOBUFFER is used. With standard
reads the number is more like 100 MB/s if I recall correctly, so the buffer
cache can definitely get in the way.

With Linux + XFS I was getting 250 MB/s read and 220 MB/s write (with a
bit less smoothness than W2K) using O_DIRECT and no high mem to avoid
bounce buffer copies. Using standard reads the numbers drop to around
120 MB/s. That was a couple of weeks ago and I want to try tweaking some
more but a co-worker has "borrowed" pieces of the hardware for the moment.

--
Marvin Justice
Software Developer
BOXX Technologies
http://www.boxxtech.com
[email protected]
512-235-6318 (V)
512-835-0434 (F)

2001-11-22 10:15:38

by Martin.Knoblauch

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems

> Re: Tuning Linux for high-speed disk subsystems
>
>
> > As I count your disks may be the double for the best case. I read here on
> > LKML a post that someone claims that W2k deliever 250 MB/s with such a
> > configuration. Linux 2.4 should do the same. Ask the SCSI gurus.
> >
>
> That may have been my post you refer to. With 2x5 disks, each capable of
> 50 MB/s by itself, we can stream 255 MB/s very smoothly in either direction
> with W2K --- as long as FILE_FLAG_NOBUFFER is used. With standard
> reads the number is more like 100 MB/s if I recall correctly, so the buffer
> cache can definitely get in the way.
>
> With Linux + XFS I was getting 250 MB/s read and 220 MB/s write (with a
> bit less smoothness than W2K) using O_DIRECT and no high mem to avoid
> bounce buffer copies. Using standard reads the numbers drop to around
> 120 MB/s. That was a couple of weeks ago and I want to try tweaking some
> more but a co-worker has "borrowed" pieces of the hardware for the moment.
>
Marvin,

could you elaborate a bit more :-), or point me/us to your post
(couldn't find it). We are currently evaluating solutions for doing HDTV
playback for one of our customers. This will need about 300-320 MB/sec
read. We know (at least someone claims so) that you can do it with SGI
equipment at a price. The goal for the customer is to definitely beat
that price :-))

Martin
--
------------------------------------------------------------------
Martin Knoblauch | email: [email protected]
TeraPort GmbH | Phone: +49-89-510857-309
C+ITS | Fax: +49-89-510857-111
http://www.teraport.de | Mobile: +49-170-4904759

2001-11-22 16:31:00

by Andreas Dilger

[permalink] [raw]
Subject: Re: Tuning Linux for high-speed disk subsystems

> Re: Tuning Linux for high-speed disk subsystems
> > As I count your disks may be the double for the best case. I read here on
> > LKML a post that someone claims that W2k deliever 250 MB/s with such a
> > configuration. Linux 2.4 should do the same. Ask the SCSI gurus.
>
> That may have been my post you refer to. With 2x5 disks, each capable of
> 50 MB/s by itself, we can stream 255 MB/s very smoothly in either direction
> with W2K --- as long as FILE_FLAG_NOBUFFER is used. With standard
> reads the number is more like 100 MB/s if I recall correctly, so the buffer
> cache can definitely get in the way.
>
> With Linux + XFS I was getting 250 MB/s read and 220 MB/s write (with a
> bit less smoothness than W2K) using O_DIRECT and no high mem to avoid
> bounce buffer copies. Using standard reads the numbers drop to around
> 120 MB/s. That was a couple of weeks ago and I want to try tweaking some
> more but a co-worker has "borrowed" pieces of the hardware for the moment.

Jusy FYI, Linus announced that he had returned Andrea's O_DIRECT support
to the most recent 2.4.15-pre kernel, so you are no longer restricted to
using XFS for no-cache I/O. Whether you will be able to beat XFS for
speed using any other filesystem is another question.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/