2006-01-16 07:35:45

by Max Waterman

[permalink] [raw]
Subject: io performance...

Hi,

I've been referred to this list from the linux-raid list.

I've been playing with a RAID system, trying to obtain best bandwidth
from it.

I've noticed that I consistently get better (read) numbers from kernel 2.6.8
than from later kernels.

For example, I get 135MB/s on 2.6.8, but I typically get ~90MB/s on later
kernels.

I'm using this :

<http://www.sharcnet.ca/~hahn/iorate.c>

to measure the iorate. I'm using the debian distribution. The h/w is a MegaRAID
320-2. The array I'm measuring is a RAID0 of 4 Fujitsu Max3073NC 15Krpm drives.

The later kernels I've been using are :

2.6.12-1-686-smp
2.6.14-2-686-smp
2.6.15-1-686-smp

The kernel which gives us the best results is :

2.6.8-2-386

(note that it's not an smp kernel)

I'm testing on an otherwise idle system.

Any ideas to why this might be? Any other advice/help?

Thanks!

Max.


2006-01-16 08:08:15

by Jeffrey V. Merkey

[permalink] [raw]
Subject: Re: io performance...

Max Waterman wrote:

> Hi,
>
> I've been referred to this list from the linux-raid list.
>
> I've been playing with a RAID system, trying to obtain best bandwidth
> from it.
>
> I've noticed that I consistently get better (read) numbers from kernel
> 2.6.8
> than from later kernels.


To open the bottlenecks, the following works well. Jens will shoot me
for recommending this,
but it works well. 2.6.9 so far has the highest numbers with this fix.
You can manually putz
around with these numbers, but they are an artificial constraint if you
are using RAID technology
that caches ad elevators requests and consolidates them.


Jeff



Attachments:
blkdev.patch (540.00 B)

2006-01-16 08:35:41

by Pekka Enberg

[permalink] [raw]
Subject: Re: io performance...

Hi,

On 1/16/06, Max Waterman <[email protected]> wrote:
> I've noticed that I consistently get better (read) numbers from kernel 2.6.8
> than from later kernels.

[snip]

> The later kernels I've been using are :
>
> 2.6.12-1-686-smp
> 2.6.14-2-686-smp
> 2.6.15-1-686-smp
>
> The kernel which gives us the best results is :
>
> 2.6.8-2-386
>
> Any ideas to why this might be? Any other advice/help?

It would be helpful if you could isolate the exact changeset that
introduces the regression. You can use git bisect for that. Please
refer to the following URL for details:
http://www.kernel.org/pub/software/scm/git/docs/howto/isolate-bugs-with-bisect.txt

Also note that changeset for pre 2.6.11-rc2 kernels are in
old-2.6-bkcvs git tree. If you are new to git, you can find a good
introduction here: http://linux.yyz.us/git-howto.html. Thanks.

Pekka

2006-01-17 13:55:29

by Jens Axboe

[permalink] [raw]
Subject: Re: io performance...

On Mon, Jan 16 2006, Jeff V. Merkey wrote:
> Max Waterman wrote:
>
> >Hi,
> >
> >I've been referred to this list from the linux-raid list.
> >
> >I've been playing with a RAID system, trying to obtain best bandwidth
> >from it.
> >
> >I've noticed that I consistently get better (read) numbers from kernel
> >2.6.8
> >than from later kernels.
>
>
> To open the bottlenecks, the following works well. Jens will shoot me
> for recommending this,
> but it works well. 2.6.9 so far has the highest numbers with this fix.
> You can manually putz
> around with these numbers, but they are an artificial constraint if you
> are using RAID technology
> that caches ad elevators requests and consolidates them.
>
>
> Jeff
>
>

>
> diff -Naur ./include/linux/blkdev.h ../linux-2.6.9/./include/linux/blkdev.h
> --- ./include/linux/blkdev.h 2004-10-18 15:53:43.000000000 -0600
> +++ ../linux-2.6.9/./include/linux/blkdev.h 2005-12-06 09:54:46.000000000 -0700
> @@ -23,8 +23,10 @@
> typedef struct elevator_s elevator_t;
> struct request_pm_state;
>
> -#define BLKDEV_MIN_RQ 4
> -#define BLKDEV_MAX_RQ 128 /* Default maximum */
> +//#define BLKDEV_MIN_RQ 4
> +//#define BLKDEV_MAX_RQ 128 /* Default maximum */
> +#define BLKDEV_MIN_RQ 4096
> +#define BLKDEV_MAX_RQ 8192 /* Default maximum */

Yeah I could shoot you. However I'm more interested in why this is
necessary, eg I'd like to see some numbers from you comparing:

- The stock settings
- Doing
# echo 8192 > /sys/block/<dev>/queue/nr_requests
for each drive you are accessing.
- The kernel with your patch.

If #2 and #3 don't provide very similar profiles/scores, then we have
something to look at.

The BLKDEV_MIN_RQ increase is just silly and wastes a huge amount of
memory for no good reason.

--
Jens Axboe

2006-01-17 17:06:44

by Phillip Susi

[permalink] [raw]
Subject: Re: io performance...

Did you direct the program to use O_DIRECT? If not then I believe the
problem you are seeing is that the generic block layer is not performing
large enough readahead to keep all the disks in the array reading at
once, because the stripe width is rather large. What stripe factor did
you format the array using?


I have a sata fakeraid at home of two drives using a stripe factor of 64
KB. If I don't issue O_DIRECT IO requests of at least 128 KB ( the
stripe width ), then throughput drops significantly. If I issue
multiple async requests of smaller size that totals at least 128 KB,
throughput also remains high. If you only issue a single 32 KB request
at a time, then two requests must go to one drive and be completed
before the other drive gets any requests, so it remains idle a lot of
the time.

Max Waterman wrote:
> Hi,
>
> I've been referred to this list from the linux-raid list.
>
> I've been playing with a RAID system, trying to obtain best bandwidth
> from it.
>
> I've noticed that I consistently get better (read) numbers from kernel
> 2.6.8
> than from later kernels.
>
> For example, I get 135MB/s on 2.6.8, but I typically get ~90MB/s on later
> kernels.
>
> I'm using this :
>
> <http://www.sharcnet.ca/~hahn/iorate.c>
>
> to measure the iorate. I'm using the debian distribution. The h/w is a
> MegaRAID
> 320-2. The array I'm measuring is a RAID0 of 4 Fujitsu Max3073NC
> 15Krpm drives.
>
> The later kernels I've been using are :
>
> 2.6.12-1-686-smp
> 2.6.14-2-686-smp
> 2.6.15-1-686-smp
>
> The kernel which gives us the best results is :
>
> 2.6.8-2-386
>
> (note that it's not an smp kernel)
>
> I'm testing on an otherwise idle system.
>
> Any ideas to why this might be? Any other advice/help?
>
> Thanks!
>
> Max.
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2006-01-17 21:04:14

by Jeffrey V. Merkey

[permalink] [raw]
Subject: Re: io performance...

Jens Axboe wrote:

>On Mon, Jan 16 2006, Jeff V. Merkey wrote:
>
>
>>Max Waterman wrote:
>>
>>
>>
>>>Hi,
>>>
>>>I've been referred to this list from the linux-raid list.
>>>
>>>I've been playing with a RAID system, trying to obtain best bandwidth
>>>
>>>
>>>from it.
>>
>>
>>>I've noticed that I consistently get better (read) numbers from kernel
>>>2.6.8
>>>than from later kernels.
>>>
>>>
>>To open the bottlenecks, the following works well. Jens will shoot me
>>for recommending this,
>>but it works well. 2.6.9 so far has the highest numbers with this fix.
>>You can manually putz
>>around with these numbers, but they are an artificial constraint if you
>>are using RAID technology
>>that caches ad elevators requests and consolidates them.
>>
>>
>>Jeff
>>
>>
>>
>>
>
>
>
>>diff -Naur ./include/linux/blkdev.h ../linux-2.6.9/./include/linux/blkdev.h
>>--- ./include/linux/blkdev.h 2004-10-18 15:53:43.000000000 -0600
>>+++ ../linux-2.6.9/./include/linux/blkdev.h 2005-12-06 09:54:46.000000000 -0700
>>@@ -23,8 +23,10 @@
>> typedef struct elevator_s elevator_t;
>> struct request_pm_state;
>>
>>-#define BLKDEV_MIN_RQ 4
>>-#define BLKDEV_MAX_RQ 128 /* Default maximum */
>>+//#define BLKDEV_MIN_RQ 4
>>+//#define BLKDEV_MAX_RQ 128 /* Default maximum */
>>+#define BLKDEV_MIN_RQ 4096
>>+#define BLKDEV_MAX_RQ 8192 /* Default maximum */
>>
>>
>
>Yeah I could shoot you. However I'm more interested in why this is
>necessary, eg I'd like to see some numbers from you comparing:
>
>- The stock settings
>- Doing
> # echo 8192 > /sys/block/<dev>/queue/nr_requests
> for each drive you are accessing.
>- The kernel with your patch.
>
>If #2 and #3 don't provide very similar profiles/scores, then we have
>something to look at.
>
>The BLKDEV_MIN_RQ increase is just silly and wastes a huge amount of
>memory for no good reason.
>
>
>
Yep. I build it into the kernel to save the trouble of sending it to
proc. Jens recommendation
will work just fine. It has the same affect of increasing the max
requests outstanding.

Jeff

2006-01-18 03:03:09

by Max Waterman

[permalink] [raw]
Subject: Re: io performance...

One further question. I get these messages 'in' dmesg :

sda: asking for cache data failed
sda: assuming drive cache: write through

How can I force it to be 'write back'?

Max.

2006-01-18 05:05:59

by Jeffrey V. Merkey

[permalink] [raw]
Subject: Re: io performance...

Max Waterman wrote:

> One further question. I get these messages 'in' dmesg :
>
> sda: asking for cache data failed
> sda: assuming drive cache: write through
>
> How can I force it to be 'write back'?



Forcing write back is a very bad idea unless you have a battery backed
up RAID controller.

Jeff

>
> Max.
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-01-18 05:10:04

by Max Waterman

[permalink] [raw]
Subject: Re: io performance...

Jeff V. Merkey wrote:
> Max Waterman wrote:
>
>> One further question. I get these messages 'in' dmesg :
>>
>> sda: asking for cache data failed
>> sda: assuming drive cache: write through
>>
>> How can I force it to be 'write back'?
>
>
>
> Forcing write back is a very bad idea unless you have a battery backed
> up RAID controller.

We do.

In any case, I wonder what the consequences of assuming 'write through'
when the array is configured as 'write back'? Is it just different
settings for different caches?

Max.

> Jeff
>
>>
>> Max.
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>

2006-01-18 05:13:19

by Jeffrey V. Merkey

[permalink] [raw]
Subject: Re: io performance...

Max Waterman wrote:

> Jeff V. Merkey wrote:
>
>> Max Waterman wrote:
>>
>>> One further question. I get these messages 'in' dmesg :
>>>
>>> sda: asking for cache data failed
>>> sda: assuming drive cache: write through
>>>
>>> How can I force it to be 'write back'?
>>
>>
>>
>>
>> Forcing write back is a very bad idea unless you have a battery
>> backed up RAID controller.
>
>
> We do.
>
> In any case, I wonder what the consequences of assuming 'write
> through' when the array is configured as 'write back'? Is it just
> different settings for different caches?


It is. This is something that should be configured in a RAID
controller. OS should always be write through.

Jeff

>
> Max.
>
>> Jeff
>>
>>>
>>> Max.
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>>
>
>

2006-01-18 07:07:01

by Max Waterman

[permalink] [raw]
Subject: Re: io performance...

Jeff V. Merkey wrote:
> Max Waterman wrote:
>
>> Jeff V. Merkey wrote:
>>
>>> Max Waterman wrote:
>>>
>>>> One further question. I get these messages 'in' dmesg :
>>>>
>>>> sda: asking for cache data failed
>>>> sda: assuming drive cache: write through
>>>>
>>>> How can I force it to be 'write back'?
>>>
>>>
>>>
>>>
>>> Forcing write back is a very bad idea unless you have a battery
>>> backed up RAID controller.
>>
>>
>> We do.
>>
>> In any case, I wonder what the consequences of assuming 'write
>> through' when the array is configured as 'write back'? Is it just
>> different settings for different caches?
>
>
> It is. This is something that should be configured in a RAID
> controller. OS should always be write through.

Ok, thanks for clearing that up, though I now wonder why the message is
there.

<shrug>

Max.

>
> Jeff
>
>>
>> Max.
>>
>>> Jeff
>>>
>>>>
>>>> Max.
>>>> -
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-kernel" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at http://www.tux.org/lkml/
>>>>
>>>
>>
>>
>

2006-01-18 07:24:52

by Max Waterman

[permalink] [raw]
Subject: Re: io performance...

Phillip Susi wrote:
> Did you direct the program to use O_DIRECT?

I'm just using the s/w (iorate/bonnie++) with default options - I'm no
expert. I could try though.

> If not then I believe the
> problem you are seeing is that the generic block layer is not performing
> large enough readahead to keep all the disks in the array reading at
> once, because the stripe width is rather large. What stripe factor did
> you format the array using?

I left the stripe size at the default, which, I believe, is 64K bytes;
same as your fakeraid below.

I did play with 'blockdev --setra' too.

I noticed it was 256 with a single disk, and, with s/w raid, it
increased by 256 for each extra disk in the array. IE for the raid 0
array with 4 drives, it was 1024.

With h/w raid, however, it did not increase when I added disks. Should I
use 'blockdev --setra 320' (ie 64 x 5 = 320, since we're now running
RAID5 on 5 drives)?

> I have a sata fakeraid at home of two drives using a stripe factor of 64
> KB. If I don't issue O_DIRECT IO requests of at least 128 KB ( the
> stripe width ), then throughput drops significantly. If I issue
> multiple async requests of smaller size that totals at least 128 KB,
> throughput also remains high. If you only issue a single 32 KB request
> at a time, then two requests must go to one drive and be completed
> before the other drive gets any requests, so it remains idle a lot of
> the time.

I think that makes sense (which is a change in this RAID performance
business :( ).

Thanks.

Max.

> Max Waterman wrote:
>> Hi,
>>
>> I've been referred to this list from the linux-raid list.
>>
>> I've been playing with a RAID system, trying to obtain best bandwidth
>> from it.
>>
>> I've noticed that I consistently get better (read) numbers from kernel
>> 2.6.8
>> than from later kernels.
>>
>> For example, I get 135MB/s on 2.6.8, but I typically get ~90MB/s on later
>> kernels.
>>
>> I'm using this :
>>
>> <http://www.sharcnet.ca/~hahn/iorate.c>
>>
>> to measure the iorate. I'm using the debian distribution. The h/w is a
>> MegaRAID
>> 320-2. The array I'm measuring is a RAID0 of 4 Fujitsu Max3073NC
>> 15Krpm drives.
>>
>> The later kernels I've been using are :
>>
>> 2.6.12-1-686-smp
>> 2.6.14-2-686-smp
>> 2.6.15-1-686-smp
>>
>> The kernel which gives us the best results is :
>>
>> 2.6.8-2-386
>>
>> (note that it's not an smp kernel)
>>
>> I'm testing on an otherwise idle system.
>>
>> Any ideas to why this might be? Any other advice/help?
>>
>> Thanks!
>>
>> Max.
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>>
>

2006-01-18 09:21:30

by Alan

[permalink] [raw]
Subject: Re: io performance...

On Maw, 2006-01-17 at 21:30 -0700, Jeff V. Merkey wrote:
> > How can I force it to be 'write back'?
> Forcing write back is a very bad idea unless you have a battery backed
> up RAID controller.

Not always. If you have a cache flush command and the OS knows about
using it, or if you don't care if the data gets lost over a power
failure (eg /tmp and swap) it makes sense to force it.

The raid controller drivers that fake scsi don't always fake enough of
scsi to report that they support cache flushes and the like. That
doesn't mean the controller itself is neccessarily doing one thing or
the other.

2006-01-18 15:19:54

by Phillip Susi

[permalink] [raw]
Subject: Re: io performance...

Right, the kernel does not know how many disks are in the array, so it
can't automatically increase the readahead. I'd say increasing the
readahead manually should solve your throughput issues.

Max Waterman wrote:
>
> I left the stripe size at the default, which, I believe, is 64K bytes;
> same as your fakeraid below.
>
> I did play with 'blockdev --setra' too.
>
> I noticed it was 256 with a single disk, and, with s/w raid, it
> increased by 256 for each extra disk in the array. IE for the raid 0
> array with 4 drives, it was 1024.
>
> With h/w raid, however, it did not increase when I added disks. Should I
> use 'blockdev --setra 320' (ie 64 x 5 = 320, since we're now running
> RAID5 on 5 drives)?
>

2006-01-18 15:49:17

by Phillip Susi

[permalink] [raw]
Subject: Re: io performance...

I was going to say, doesn't the kernel set the FUA bit on the write
request to push important flushes through the disk's write-back cache?
Like for filesystem journal flushes?


Alan Cox wrote:
> Not always. If you have a cache flush command and the OS knows about
> using it, or if you don't care if the data gets lost over a power
> failure (eg /tmp and swap) it makes sense to force it.
>
> The raid controller drivers that fake scsi don't always fake enough of
> scsi to report that they support cache flushes and the like. That
> doesn't mean the controller itself is neccessarily doing one thing or
> the other.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
>

Subject: Re: io performance...

On 1/18/06, Phillip Susi <[email protected]> wrote:
> I was going to say, doesn't the kernel set the FUA bit on the write
> request to push important flushes through the disk's write-back cache?
> Like for filesystem journal flushes?

Yes if:
* you have a disk supporting FUA
* you have kernel >= 2.6.16-rc1
* you are using SCSI (this includes libata) driver [ support for IDE driver
will be merged later when races in changing IDE settings are fixed ]

Bartlomiej

> Alan Cox wrote:
> > Not always. If you have a cache flush command and the OS knows about
> > using it, or if you don't care if the data gets lost over a power
> > failure (eg /tmp and swap) it makes sense to force it.
> >
> > The raid controller drivers that fake scsi don't always fake enough of
> > scsi to report that they support cache flushes and the like. That
> > doesn't mean the controller itself is neccessarily doing one thing or
> > the other.

2006-01-19 00:48:57

by Adrian Bunk

[permalink] [raw]
Subject: Re: io performance...

On Mon, Jan 16, 2006 at 03:35:31PM +0800, Max Waterman wrote:
> Hi,
>
> I've been referred to this list from the linux-raid list.
>
> I've been playing with a RAID system, trying to obtain best bandwidth
> from it.
>
> I've noticed that I consistently get better (read) numbers from kernel 2.6.8
> than from later kernels.
>
> For example, I get 135MB/s on 2.6.8, but I typically get ~90MB/s on later
> kernels.
>
> I'm using this :
>
> <http://www.sharcnet.ca/~hahn/iorate.c>
>
> to measure the iorate. I'm using the debian distribution. The h/w is a
> MegaRAID
> 320-2. The array I'm measuring is a RAID0 of 4 Fujitsu Max3073NC 15Krpm
> drives.
>
> The later kernels I've been using are :
>
> 2.6.12-1-686-smp
> 2.6.14-2-686-smp
> 2.6.15-1-686-smp
>
> The kernel which gives us the best results is :
>
> 2.6.8-2-386
>
> (note that it's not an smp kernel)
>
> I'm testing on an otherwise idle system.
>
> Any ideas to why this might be? Any other advice/help?

You should try to narrow the problem a bit down.

Possible causes are:
- kernel regression between 2.6.8 and 2.6.12
- SMP <-> !SMP support
- patches and/or configuration changes in the Debian kernels

You should try self-compiled unmodified 2.6.8 and 2.6.12 ftp.kernel.org
kernels with the same .config (modulo differences by "make oldconfig").

After this test, you know whether you are in the first case.
If yes, you could do a bisect search for finding the point where the
regression started.

> Thanks!
>
> Max.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-01-19 01:57:05

by Robert Hancock

[permalink] [raw]
Subject: Re: io performance...

Jeff V. Merkey wrote:
> Max Waterman wrote:
>
>> One further question. I get these messages 'in' dmesg :
>>
>> sda: asking for cache data failed
>> sda: assuming drive cache: write through
>>
>> How can I force it to be 'write back'?
>
> Forcing write back is a very bad idea unless you have a battery backed
> up RAID controller.

This is not what these messages are referring to. Those write through
vs. write back messages are referring to detecting the drive write cache
mode, not setting it. Whether or not the write cache is enabled is used
to determine whether the sd driver uses SYNCHRONIZE CACHE commands to
flush the write cache on the device. If the drive says its write cache
is off or doesn't support determining the cache status, the kernel will
not issue SYNCHRONIZE CACHE commands. This may be a bad thing if the
device is really using write caching..

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/

2006-01-19 11:40:21

by Al Boldi

[permalink] [raw]
Subject: Re: io performance...

Jeff V. Merkey wrote:
> Jens Axboe wrote:
> >On Mon, Jan 16 2006, Jeff V. Merkey wrote:
> >>Max Waterman wrote:
> >>>I've noticed that I consistently get better (read) numbers from kernel
> >>>2.6.8 than from later kernels.
> >>
> >>To open the bottlenecks, the following works well. Jens will shoot me
> >>-#define BLKDEV_MIN_RQ 4
> >>-#define BLKDEV_MAX_RQ 128 /* Default maximum */
> >>+#define BLKDEV_MIN_RQ 4096
> >>+#define BLKDEV_MAX_RQ 8192 /* Default maximum */
> >
> >Yeah I could shoot you. However I'm more interested in why this is
> >necessary, eg I'd like to see some numbers from you comparing:
> >
> >- Doing
> > # echo 8192 > /sys/block/<dev>/queue/nr_requests
> > for each drive you are accessing.
> >
> >The BLKDEV_MIN_RQ increase is just silly and wastes a huge amount of
> >memory for no good reason.
>
> Yep. I build it into the kernel to save the trouble of sending it to proc.
> Jens recommendation will work just fine. It has the same affect of
> increasing the max requests outstanding.

Your suggestion doesn't do anything here on 2.6.15, but
echo 192 > /sys/block/<dev>/queue/max_sectors_kb
echo 192 > /sys/block/<dev>/queue/read_ahead_kb
works wonders!

I don't know why, but anything less than 64 and more than 256 makes the queue
collapse miserably, causing some strange __copy_to_user calls?!?!?

Also, it seems that changing the kernel HZ has some drastic effects on the
queues. A simple lilo gets delayed 400% and 200% using 100HZ and 250HZ
respectively.

--
Al

2006-01-19 13:14:55

by Max Waterman

[permalink] [raw]
Subject: Re: io performance...

Robert Hancock wrote:
> Jeff V. Merkey wrote:
>> Max Waterman wrote:
>>
>>> One further question. I get these messages 'in' dmesg :
>>>
>>> sda: asking for cache data failed
>>> sda: assuming drive cache: write through
>>>
>>> How can I force it to be 'write back'?
>>
>> Forcing write back is a very bad idea unless you have a battery backed
>> up RAID controller.
>
> This is not what these messages are referring to. Those write through
> vs. write back messages are referring to detecting the drive write cache
> mode, not setting it. Whether or not the write cache is enabled is used
> to determine whether the sd driver uses SYNCHRONIZE CACHE commands to
> flush the write cache on the device. If the drive says its write cache
> is off or doesn't support determining the cache status, the kernel will
> not issue SYNCHRONIZE CACHE commands. This may be a bad thing if the
> device is really using write caching..
>

So, if I have my raid controller set to use write-back, it *is* caching
the writes, and so this *is* a bad thing, right?

If so, how to fix?

Max.

2006-01-19 13:18:26

by Max Waterman

[permalink] [raw]
Subject: Re: io performance...

Unfortunately, they don't want me to spend time doing this sort of
thing, so I'm out of luck.

They're going to stick with 2.6.8-smp, which seems to give the best
performance (which rules out your second case below, I suppose).

:|

Max.

Adrian Bunk wrote:
> On Mon, Jan 16, 2006 at 03:35:31PM +0800, Max Waterman wrote:
>> Hi,
>>
>> I've been referred to this list from the linux-raid list.
>>
>> I've been playing with a RAID system, trying to obtain best bandwidth
>> from it.
>>
>> I've noticed that I consistently get better (read) numbers from kernel 2.6.8
>> than from later kernels.
>>
>> For example, I get 135MB/s on 2.6.8, but I typically get ~90MB/s on later
>> kernels.
>>
>> I'm using this :
>>
>> <http://www.sharcnet.ca/~hahn/iorate.c>
>>
>> to measure the iorate. I'm using the debian distribution. The h/w is a
>> MegaRAID
>> 320-2. The array I'm measuring is a RAID0 of 4 Fujitsu Max3073NC 15Krpm
>> drives.
>>
>> The later kernels I've been using are :
>>
>> 2.6.12-1-686-smp
>> 2.6.14-2-686-smp
>> 2.6.15-1-686-smp
>>
>> The kernel which gives us the best results is :
>>
>> 2.6.8-2-386
>>
>> (note that it's not an smp kernel)
>>
>> I'm testing on an otherwise idle system.
>>
>> Any ideas to why this might be? Any other advice/help?
>
> You should try to narrow the problem a bit down.
>
> Possible causes are:
> - kernel regression between 2.6.8 and 2.6.12
> - SMP <-> !SMP support
> - patches and/or configuration changes in the Debian kernels
>
> You should try self-compiled unmodified 2.6.8 and 2.6.12 ftp.kernel.org
> kernels with the same .config (modulo differences by "make oldconfig").
>
> After this test, you know whether you are in the first case.
> If yes, you could do a bisect search for finding the point where the
> regression started.
>
>> Thanks!
>>
>> Max.
>
> cu
> Adrian
>

2006-01-19 14:08:49

by Alan

[permalink] [raw]
Subject: Re: io performance...

On Iau, 2006-01-19 at 21:14 +0800, Max Waterman wrote:
> So, if I have my raid controller set to use write-back, it *is* caching
> the writes, and so this *is* a bad thing, right?

Depends on your raid controller. If it is battery backed it may well all
be fine.

Alan

2006-01-20 04:09:25

by Max Waterman

[permalink] [raw]
Subject: Re: io performance...

Alan Cox wrote:
> On Iau, 2006-01-19 at 21:14 +0800, Max Waterman wrote:
>> So, if I have my raid controller set to use write-back, it *is* caching
>> the writes, and so this *is* a bad thing, right?
>
> Depends on your raid controller. If it is battery backed it may well all
> be fine.

Eh? Why?

I'm not sure what difference it makes if the controller is battery
backed or not; if the drives are gone, then the card has nothing to
write to...will it make the writes when the power comes back on?

Max.

2006-01-20 04:27:49

by Alexander Samad

[permalink] [raw]
Subject: Re: io performance...

On Fri, Jan 20, 2006 at 12:09:14PM +0800, Max Waterman wrote:
> Alan Cox wrote:
> >On Iau, 2006-01-19 at 21:14 +0800, Max Waterman wrote:
> >>So, if I have my raid controller set to use write-back, it *is* caching
> >>the writes, and so this *is* a bad thing, right?
> >
> >Depends on your raid controller. If it is battery backed it may well all
> >be fine.
>
> Eh? Why?
>
> I'm not sure what difference it makes if the controller is battery
> backed or not; if the drives are gone, then the card has nothing to
> write to...will it make the writes when the power comes back on?
some do
>
> Max.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


Attachments:
(No filename) (860.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2006-01-20 05:58:41

by Max Waterman

[permalink] [raw]
Subject: Re: io performance...

Phillip Susi wrote:
> Right, the kernel does not know how many disks are in the array, so it
> can't automatically increase the readahead. I'd say increasing the
> readahead manually should solve your throughput issues.

Any guesses for a good number?

We're in RAID10 (2+2) at the moment on 2.6.8-smp. These are the block
numbers I'm getting using bonnie++ :

ra wr rd
256 68K 46K
512 67K 59K
640 67K 64K
1024 66K 73K
2048 67K 88K
3072 67K 91K
8192 71K 96K
9216 67K 92K
16384 67K 90K

I think we might end up going for 8192.

We're still wondering why rd performance is so low - seems to be the
same as a single drive. RAID10 should be the same performance as RAID0
over two drives, shouldn't it?

Max.

>
> Max Waterman wrote:
>>
>> I left the stripe size at the default, which, I believe, is 64K bytes;
>> same as your fakeraid below.
>>
>> I did play with 'blockdev --setra' too.
>>
>> I noticed it was 256 with a single disk, and, with s/w raid, it
>> increased by 256 for each extra disk in the array. IE for the raid 0
>> array with 4 drives, it was 1024.
>>
>> With h/w raid, however, it did not increase when I added disks. Should
>> I use 'blockdev --setra 320' (ie 64 x 5 = 320, since we're now running
>> RAID5 on 5 drives)?
>>
>

2006-01-20 12:52:25

by Alan

[permalink] [raw]
Subject: Re: io performance...

On Gwe, 2006-01-20 at 12:09 +0800, Max Waterman wrote:
> I'm not sure what difference it makes if the controller is battery
> backed or not; if the drives are gone, then the card has nothing to
> write to...will it make the writes when the power comes back on?

Yes it will, hopefully having checked first before writing. The higher
end ones you can even pull the battery backed ram module out, change the
raid card and it will do it.

Alan

2006-01-20 13:43:12

by Ian Soboroff

[permalink] [raw]
Subject: Re: io performance...

Max Waterman <[email protected]> writes:

> Phillip Susi wrote:
>> Right, the kernel does not know how many disks are in the array, so
>> it can't automatically increase the readahead. I'd say increasing
>> the readahead manually should solve your throughput issues.
>
> Any guesses for a good number?
>
> We're in RAID10 (2+2) at the moment on 2.6.8-smp. These are the block
> numbers I'm getting using bonnie++ :
>
>[...]
> We're still wondering why rd performance is so low - seems to be the
> same as a single drive. RAID10 should be the same performance as RAID0
> over two drives, shouldn't it?

I think bonnie++ measures accesses to many small files (INN-like
simulation) and database accesses. These are random accesses, which
is the worst access pattern for RAID. Seek time in a RAID equals the
longest of all the drives in the RAID, rather than the average. So
bonnie++ is domninated by your seek time.

Ian


2006-01-25 06:36:44

by Max Waterman

[permalink] [raw]
Subject: Re: io performance...

Ian Soboroff wrote:
> Max Waterman <[email protected]> writes:
>
>> Phillip Susi wrote:
>>> Right, the kernel does not know how many disks are in the array, so
>>> it can't automatically increase the readahead. I'd say increasing
>>> the readahead manually should solve your throughput issues.
>> Any guesses for a good number?
>>
>> We're in RAID10 (2+2) at the moment on 2.6.8-smp. These are the block
>> numbers I'm getting using bonnie++ :
>>
>> [...]
>> We're still wondering why rd performance is so low - seems to be the
>> same as a single drive. RAID10 should be the same performance as RAID0
>> over two drives, shouldn't it?
>
> I think bonnie++ measures accesses to many small files (INN-like
> simulation) and database accesses. These are random accesses, which
> is the worst access pattern for RAID. Seek time in a RAID equals the
> longest of all the drives in the RAID, rather than the average. So
> bonnie++ is domninated by your seek time.

You think so? I had assumed when bonnie++'s output said 'sequential
access' that it was the opposite of random, for example (raid0 on 5
drives) :

> +---------------------------------------------------------------------------------------------------------------------------------------------------+
> | |Sequential Output |Sequential Input | | |Sequential Create |Random Create |
> |---------------------+------------------------------+--------------------|Random |-----+----------------------------+----------------------------|
> | |Size:Chunk|Per Char |Block |Rewrite |Per Char |Block |Seeks |Num |Create |Read |Delete |Create |Read |Delete |
> | |Size | | | | | | |Files| | | | | | |
> |---------------------+---------+----------+---------+---------+----------+---------+-----+--------+---------+---------+--------+---------+---------|
> | |K/sec|% |K/sec |% |K/sec|% |K/sec|% |K/sec |% |/ sec|% | |/ |% |/ sec|% |/ sec|% |/ |% |/ sec|% |/ sec|% |
> | | |CPU| |CPU| |CPU| |CPU| |CPU| |CPU| |sec |CPU| |CPU| |CPU|sec |CPU| |CPU| |CPU|
> |---------------------+-----+---+------+---+-----+---+-----+---+------+---+-----+---+-----+----+---+-----+---+-----+---+----+---+-----+---+-----+---|
> |hostname |2G |48024|96 |121412|13 |59714|10 |47844|95 |200264|21 |942.8|1 |16 |4146|99 |+++++|+++|+++++|+++|4167|99 |+++++|+++|14292|99 |
> +---------------------------------------------------------------------------------------------------------------------------------------------------+

Am I wrong? If so, what exactly does 'Sequential' mean in this context?

Max.

2006-01-25 13:10:00

by be-news06

[permalink] [raw]
Subject: Re: io performance...

Ian Soboroff <[email protected]> wrote:
> simulation) and database accesses. These are random accesses, which
> is the worst access pattern for RAID. Seek time in a RAID equals the
> longest of all the drives in the RAID, rather than the average.

Well, actually it equals to the shortest seek time and it distributes the
seeks to multiple spindles (at least for raid1).

Gruss
Bernd

2006-01-25 14:19:51

by Ian Soboroff

[permalink] [raw]
Subject: Re: io performance...

Max Waterman <[email protected]> writes:

>>> We're still wondering why rd performance is so low - seems to be the
>>> same as a single drive. RAID10 should be the same performance as RAID0
>>> over two drives, shouldn't it?
>>>
>> I think bonnie++ measures accesses to many small files (INN-like
>> simulation) and database accesses. These are random accesses, which
>> is the worst access pattern for RAID. Seek time in a RAID equals the
>> longest of all the drives in the RAID, rather than the average. So
>> bonnie++ is domninated by your seek time.
>
> You think so? I had assumed when bonnie++'s output said 'sequential
> access' that it was the opposite of random, for example (raid0 on 5
> drives) :
>

I could be wrong, I was just reading the information from the bonnie++
website...

Ian