2005-04-20 17:39:53

by Andreas Hirstius

[permalink] [raw]
Subject: Serious performance degradation on a RAID with kernel 2.6.10-bk7 and later

Hi,


We have a rx4640 with 3x 3Ware 9500 SATA controllers and 24x WD740GD HDD
in a software RAID0 configuration (using md).
With kernel 2.6.11 the read performance on the md is reduced by a factor
of 20 (!!) compared to previous kernels.
The write rate to the md doesn't change!! (it actually improves a bit).

The config for the kernels are basically identical.

Here is some vmstat output:

kernel 2.6.9: ~1GB/s read
procs memory swap io
system cpu
r b swpd free buff cache si so bi bo in cs us sy wa id
1 1 0 12672 6592 15914112 0 0 1081344 56 15719 1583 0 11 14 74
1 0 0 12672 6592 15915200 0 0 1130496 0 15996 1626 0 11 14 74
0 1 0 12672 6592 15914112 0 0 1081344 0 15891 1570 0 11 14 74
0 1 0 12480 6592 15914112 0 0 1081344 0 15855 1537 0 11 14 74
1 0 0 12416 6592 15914112 0 0 1130496 0 16006 1586 0 12 14 74


kernel 2.6.11: ~55MB/s read
procs memory swap io
system cpu
r b swpd free buff cache si so bi bo in cs us sy wa id
1 1 0 24448 37568 15905984 0 0 56934 0 5166 1862 0 1 24 75
0 1 0 20672 37568 15909248 0 0 57280 0 5168 1871 0 1 24 75
0 1 0 22848 37568 15907072 0 0 57306 0 5173 1874 0 1 24 75
0 1 0 25664 37568 15903808 0 0 57190 0 5171 1870 0 1 24 75
0 1 0 21952 37568 15908160 0 0 57267 0 5168 1871 0 1 24 75


Because the filesystem might have an impact on the measurement, "dd" on /dev/md0
was used to get information about the performance.
This also opens the possibility to test with block sizes larger than the page size.
And it appears that the performance with kernel 2.6.11 is closely
related to the block size.
For example if the block size is exactly a multiple (>2) of the page
size the performance is back to ~1.1GB/s.
The general behaviour is a bit more complicated:

1. bs <= 1.5 * ps : ~27-57MB/s (differs with ps)
2. bs > 1.5 * ps && bs < 2 * ps : rate increases to max. rate
3. bs = n * ps ; (n >= 2) : ~1.1GB/s (== max. rate)
4. bs > n * ps && bs < ~(n+0.5) * ps ; (n > 2) : ~27-70MB/s (differs
with ps)
5. bs > ~(n+0.5) * ps && bs < (n+1) * ps ; (n > 2) : increasing rate
in several, more or
less, distinct steps (e.g. 1/3 of max. rate and then 2/3 of max
rate for 64k pages)

I've tested all four possible page sizes on Itanium (4k, 8k, 16k and 64k) and the pattern is
always the same!!

With kernel 2.6.9 (any kernel before 2.6.10-bk6) the read rate is always at ~1.1GB/s,
independent of the block size.


This simple patch solves the problem, but I have no idea of possible side-effects ...

--- linux-2.6.12-rc2_orig/mm/filemap.c 2005-04-04 18:40:05.000000000 +0200
+++ linux-2.6.12-rc2/mm/filemap.c 2005-04-20 10:27:42.000000000 +0200
@@ -719,7 +719,7 @@
index = *ppos >> PAGE_CACHE_SHIFT;
next_index = index;
prev_index = ra.prev_page;
- last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
+ last_index = (*ppos + desc->count + PAGE_CACHE_SIZE) >> PAGE_CACHE_SHIFT;
offset = *ppos & ~PAGE_CACHE_MASK;

isize = i_size_read(inode);
--- linux-2.6.12-rc2_orig/mm/readahead.c 2005-04-04 18:40:05.000000000 +0200
+++ linux-2.6.12-rc2/mm/readahead.c 2005-04-20 18:37:04.000000000 +0200
@@ -70,7 +70,7 @@
*/
static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
{
- unsigned long newsize = roundup_pow_of_two(size);
+ unsigned long newsize = size;

if (newsize <= max / 64)
newsize = newsize * newsize;



In order to keep this mail short, I've created a webpage that contains
all the detailed information and some plots:
http://www.cern.ch/openlab-debugging/raid


Regards,

Andreas Hirstius



2005-04-20 17:51:41

by jmerkey

[permalink] [raw]
Subject: Re: Serious performance degradation on a RAID with kernel 2.6.10-bk7 and later



For 3Ware, you need to chage the queue depths, and you will see
dramatically improved performance. 3Ware can take requests
a lot faster than Linux pushes them out. Try changing this instead, you
won't be going to sleep all the time waiting on the read/write
request queues to get "unstarved".


/linux/include/linux/blkdev.h

//#define BLKDEV_MIN_RQ 4
//#define BLKDEV_MAX_RQ 128 /* Default maximum */
#define BLKDEV_MIN_RQ 4096
#define BLKDEV_MAX_RQ 8192 /* Default maximum */


Jeff

Andreas Hirstius wrote:

> Hi,
>
>
> We have a rx4640 with 3x 3Ware 9500 SATA controllers and 24x WD740GD
> HDD in a software RAID0 configuration (using md).
> With kernel 2.6.11 the read performance on the md is reduced by a
> factor of 20 (!!) compared to previous kernels.
> The write rate to the md doesn't change!! (it actually improves a bit).
>
> The config for the kernels are basically identical.
>
> Here is some vmstat output:
>
> kernel 2.6.9: ~1GB/s read
> procs memory swap io system cpu
> r b swpd free buff cache si so bi bo in cs us sy wa id
> 1 1 0 12672 6592 15914112 0 0 1081344 56 15719 1583 0 11 14 74
> 1 0 0 12672 6592 15915200 0 0 1130496 0 15996 1626 0 11 14 74
> 0 1 0 12672 6592 15914112 0 0 1081344 0 15891 1570 0 11 14 74
> 0 1 0 12480 6592 15914112 0 0 1081344 0 15855 1537 0 11 14 74
> 1 0 0 12416 6592 15914112 0 0 1130496 0 16006 1586 0 12 14 74
>
>
> kernel 2.6.11: ~55MB/s read
> procs memory swap io system cpu
> r b swpd free buff cache si so bi bo in cs us sy wa id
> 1 1 0 24448 37568 15905984 0 0 56934 0 5166 1862 0 1 24 75
> 0 1 0 20672 37568 15909248 0 0 57280 0 5168 1871 0 1 24 75
> 0 1 0 22848 37568 15907072 0 0 57306 0 5173 1874 0 1 24 75
> 0 1 0 25664 37568 15903808 0 0 57190 0 5171 1870 0 1 24 75
> 0 1 0 21952 37568 15908160 0 0 57267 0 5168 1871 0 1 24 75
>
>
> Because the filesystem might have an impact on the measurement, "dd"
> on /dev/md0
> was used to get information about the performance. This also opens the
> possibility to test with block sizes larger than the page size.
> And it appears that the performance with kernel 2.6.11 is closely
> related to the block size.
> For example if the block size is exactly a multiple (>2) of the page
> size the performance is back to ~1.1GB/s.
> The general behaviour is a bit more complicated:
> 1. bs <= 1.5 * ps : ~27-57MB/s (differs with ps)
> 2. bs > 1.5 * ps && bs < 2 * ps : rate increases to max. rate
> 3. bs = n * ps ; (n >= 2) : ~1.1GB/s (== max. rate)
> 4. bs > n * ps && bs < ~(n+0.5) * ps ; (n > 2) : ~27-70MB/s (differs
> with ps)
> 5. bs > ~(n+0.5) * ps && bs < (n+1) * ps ; (n > 2) : increasing rate
> in several, more or
> less, distinct steps (e.g. 1/3 of max. rate and then 2/3 of max rate
> for 64k pages)
>
> I've tested all four possible page sizes on Itanium (4k, 8k, 16k and
> 64k) and the pattern is always the same!!
>
> With kernel 2.6.9 (any kernel before 2.6.10-bk6) the read rate is
> always at ~1.1GB/s,
> independent of the block size.
>
>
> This simple patch solves the problem, but I have no idea of possible
> side-effects ...
>
> --- linux-2.6.12-rc2_orig/mm/filemap.c 2005-04-04 18:40:05.000000000
> +0200
> +++ linux-2.6.12-rc2/mm/filemap.c 2005-04-20 10:27:42.000000000 +0200
> @@ -719,7 +719,7 @@
> index = *ppos >> PAGE_CACHE_SHIFT;
> next_index = index;
> prev_index = ra.prev_page;
> - last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >>
> PAGE_CACHE_SHIFT;
> + last_index = (*ppos + desc->count + PAGE_CACHE_SIZE) >>
> PAGE_CACHE_SHIFT;
> offset = *ppos & ~PAGE_CACHE_MASK;
>
> isize = i_size_read(inode);
> --- linux-2.6.12-rc2_orig/mm/readahead.c 2005-04-04 18:40:05.000000000
> +0200
> +++ linux-2.6.12-rc2/mm/readahead.c 2005-04-20 18:37:04.000000000 +0200
> @@ -70,7 +70,7 @@
> */
> static unsigned long get_init_ra_size(unsigned long size, unsigned
> long max)
> {
> - unsigned long newsize = roundup_pow_of_two(size);
> + unsigned long newsize = size;
>
> if (newsize <= max / 64)
> newsize = newsize * newsize;
>
>
>
> In order to keep this mail short, I've created a webpage that contains
> all the detailed information and some plots:
> http://www.cern.ch/openlab-debugging/raid
>
>
> Regards,
>
> Andreas Hirstius
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-04-20 18:04:52

by Andreas Hirstius

[permalink] [raw]
Subject: Re: Serious performance degradation on a RAID with kernel 2.6.10-bk7 and later


Just tried it, but the performance problem remains :-(
(actually, why should it change? This part of the code didn't change so
much between 2.6.10-bk6 and -bk7...)

Andreas



jmerkey wrote:

>
>
> For 3Ware, you need to chage the queue depths, and you will see
> dramatically improved performance. 3Ware can take requests
> a lot faster than Linux pushes them out. Try changing this instead,
> you won't be going to sleep all the time waiting on the read/write
> request queues to get "unstarved".
>
>
> /linux/include/linux/blkdev.h
>
> //#define BLKDEV_MIN_RQ 4
> //#define BLKDEV_MAX_RQ 128 /* Default maximum */
> #define BLKDEV_MIN_RQ 4096
> #define BLKDEV_MAX_RQ 8192 /* Default maximum */
>
>
> Jeff
>
> Andreas Hirstius wrote:
>
>> Hi,
>>
>>
>> We have a rx4640 with 3x 3Ware 9500 SATA controllers and 24x WD740GD
>> HDD in a software RAID0 configuration (using md).
>> With kernel 2.6.11 the read performance on the md is reduced by a
>> factor of 20 (!!) compared to previous kernels.
>> The write rate to the md doesn't change!! (it actually improves a bit).
>>
>> The config for the kernels are basically identical.
>>
>> Here is some vmstat output:
>>
>> kernel 2.6.9: ~1GB/s read
>> procs memory swap io system cpu
>> r b swpd free buff cache si so bi bo in cs us sy wa id
>> 1 1 0 12672 6592 15914112 0 0 1081344 56 15719 1583 0 11 14 74
>> 1 0 0 12672 6592 15915200 0 0 1130496 0 15996 1626 0 11 14 74
>> 0 1 0 12672 6592 15914112 0 0 1081344 0 15891 1570 0 11 14 74
>> 0 1 0 12480 6592 15914112 0 0 1081344 0 15855 1537 0 11 14 74
>> 1 0 0 12416 6592 15914112 0 0 1130496 0 16006 1586 0 12 14 74
>>
>>
>> kernel 2.6.11: ~55MB/s read
>> procs memory swap io system cpu
>> r b swpd free buff cache si so bi bo in cs us sy wa id
>> 1 1 0 24448 37568 15905984 0 0 56934 0 5166 1862 0 1 24 75
>> 0 1 0 20672 37568 15909248 0 0 57280 0 5168 1871 0 1 24 75
>> 0 1 0 22848 37568 15907072 0 0 57306 0 5173 1874 0 1 24 75
>> 0 1 0 25664 37568 15903808 0 0 57190 0 5171 1870 0 1 24 75
>> 0 1 0 21952 37568 15908160 0 0 57267 0 5168 1871 0 1 24 75
>>
>>
>> Because the filesystem might have an impact on the measurement, "dd"
>> on /dev/md0
>> was used to get information about the performance. This also opens
>> the possibility to test with block sizes larger than the page size.
>> And it appears that the performance with kernel 2.6.11 is closely
>> related to the block size.
>> For example if the block size is exactly a multiple (>2) of the page
>> size the performance is back to ~1.1GB/s.
>> The general behaviour is a bit more complicated:
>> 1. bs <= 1.5 * ps : ~27-57MB/s (differs with ps)
>> 2. bs > 1.5 * ps && bs < 2 * ps : rate increases to max. rate
>> 3. bs = n * ps ; (n >= 2) : ~1.1GB/s (== max. rate)
>> 4. bs > n * ps && bs < ~(n+0.5) * ps ; (n > 2) : ~27-70MB/s (differs
>> with ps)
>> 5. bs > ~(n+0.5) * ps && bs < (n+1) * ps ; (n > 2) : increasing rate
>> in several, more or
>> less, distinct steps (e.g. 1/3 of max. rate and then 2/3 of max rate
>> for 64k pages)
>>
>> I've tested all four possible page sizes on Itanium (4k, 8k, 16k and
>> 64k) and the pattern is always the same!!
>>
>> With kernel 2.6.9 (any kernel before 2.6.10-bk6) the read rate is
>> always at ~1.1GB/s,
>> independent of the block size.
>>
>>
>> This simple patch solves the problem, but I have no idea of possible
>> side-effects ...
>>
>> --- linux-2.6.12-rc2_orig/mm/filemap.c 2005-04-04 18:40:05.000000000
>> +0200
>> +++ linux-2.6.12-rc2/mm/filemap.c 2005-04-20 10:27:42.000000000 +0200
>> @@ -719,7 +719,7 @@
>> index = *ppos >> PAGE_CACHE_SHIFT;
>> next_index = index;
>> prev_index = ra.prev_page;
>> - last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >>
>> PAGE_CACHE_SHIFT;
>> + last_index = (*ppos + desc->count + PAGE_CACHE_SIZE) >>
>> PAGE_CACHE_SHIFT;
>> offset = *ppos & ~PAGE_CACHE_MASK;
>>
>> isize = i_size_read(inode);
>> --- linux-2.6.12-rc2_orig/mm/readahead.c 2005-04-04
>> 18:40:05.000000000 +0200
>> +++ linux-2.6.12-rc2/mm/readahead.c 2005-04-20 18:37:04.000000000 +0200
>> @@ -70,7 +70,7 @@
>> */
>> static unsigned long get_init_ra_size(unsigned long size, unsigned
>> long max)
>> {
>> - unsigned long newsize = roundup_pow_of_two(size);
>> + unsigned long newsize = size;
>>
>> if (newsize <= max / 64)
>> newsize = newsize * newsize;
>>
>>
>>
>> In order to keep this mail short, I've created a webpage that
>> contains all the detailed information and some plots:
>> http://www.cern.ch/openlab-debugging/raid
>>
>>
>> Regards,
>>
>> Andreas Hirstius
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>

2005-04-20 18:25:25

by Andreas Hirstius

[permalink] [raw]
Subject: Re: Serious performance degradation on a RAID with kernel 2.6.10-bk7 and later


I was curious if your patch would change the write rate because I see
only ~550MB/s (continuous) which is about a factor two away from the
capabilities of the disks.
... and got this behaviour (with and without my other patch):

(with single "dd if=/dev/zero of=testxx bs=65536 count=150000 &" or
several of them in parallel on an XFS fs)

"vmstat 1" output
0 0 0 28416 37888 15778368 0 0 0 0 8485 3043 0
0 0 100
6 0 0 22144 37952 15785920 0 0 0 12356 7695 2029 0
61 0 39
7 0 0 20864 38016 15785856 0 0 324 1722240 8046 4159
0 100 0 0
7 0 0 20864 38016 15784768 0 0 0 1261440 8391 5222
0 100 0 0
7 0 0 25984 38016 15781504 0 0 0 2003456 8372 5038
0 100 0 0
0 6 0 22784 38016 15781504 0 0 0 2826624 8397 8423
0 93 7 0
0 0 0 21632 38016 15783680 0 0 0 147840 8572 12114
0 9 17 74
0 0 0 21632 38016 15783680 0 0 0 52 8586 5185 0
0 0 100
0 0 0 21632 38016 15783680 0 0 0 0 8588 5412 0
0 0 100
0 0 0 21632 38016 15783680 0 0 0 0 8580 5372 0
0 0 100
0 0 0 21632 38016 15783680 0 0 0 0 7840 5590 0
0 0 100
0 0 0 21632 38016 15783680 0 0 0 0 8587 5321 0
0 0 100
0 0 0 21632 38016 15783680 0 0 0 0 8569 5575 0
0 0 100
0 0 0 21632 38016 15783680 0 0 0 0 8550 5157 0
0 0 100
0 0 0 21632 38016 15783680 0 0 0 0 7963 5640 0
0 0 100
0 0 0 21632 38016 15783680 0 0 0 32 8583 4434 0
0 0 100
7 0 0 20800 38016 15784768 0 0 0 7424 8404 3638 0
15 0 85
8 0 0 20864 38016 15786944 0 0 0 688768 7357 3221 0
100 0 0
8 0 0 20736 28544 15794240 0 0 0 1978560 8376 4897
0 100 0 0
7 0 0 22208 20736 15798784 0 0 0 1385088 8367 4984
0 100 0 0
6 0 0 22144 6848 15812672 0 0 56 1291904 8377 4815
0 100 0 0
0 0 0 50240 6848 15809408 0 0 304 3136 8556 5088 1
26 0 74
0 0 0 50304 6848 15809408 0 0 0 0 8572 5181 0
0 0 100

The average rate here is again pretty close to 550MB/s, it just writes
the blocks in "bursts"...


Andreas


jmerkey wrote:

>
>
> For 3Ware, you need to chage the queue depths, and you will see
> dramatically improved performance. 3Ware can take requests
> a lot faster than Linux pushes them out. Try changing this instead,
> you won't be going to sleep all the time waiting on the read/write
> request queues to get "unstarved".
>
>
> /linux/include/linux/blkdev.h
>
> //#define BLKDEV_MIN_RQ 4
> //#define BLKDEV_MAX_RQ 128 /* Default maximum */
> #define BLKDEV_MIN_RQ 4096
> #define BLKDEV_MAX_RQ 8192 /* Default maximum */
>
>
> Jeff
>
> Andreas Hirstius wrote:
>
>> Hi,
>>
>>
>> We have a rx4640 with 3x 3Ware 9500 SATA controllers and 24x WD740GD
>> HDD in a software RAID0 configuration (using md).
>> With kernel 2.6.11 the read performance on the md is reduced by a
>> factor of 20 (!!) compared to previous kernels.
>> The write rate to the md doesn't change!! (it actually improves a bit).
>>
>> The config for the kernels are basically identical.
>>
>> Here is some vmstat output:
>>
>> kernel 2.6.9: ~1GB/s read
>> procs memory swap io system cpu
>> r b swpd free buff cache si so bi bo in cs us sy wa id
>> 1 1 0 12672 6592 15914112 0 0 1081344 56 15719 1583 0 11 14 74
>> 1 0 0 12672 6592 15915200 0 0 1130496 0 15996 1626 0 11 14 74
>> 0 1 0 12672 6592 15914112 0 0 1081344 0 15891 1570 0 11 14 74
>> 0 1 0 12480 6592 15914112 0 0 1081344 0 15855 1537 0 11 14 74
>> 1 0 0 12416 6592 15914112 0 0 1130496 0 16006 1586 0 12 14 74
>>
>>
>> kernel 2.6.11: ~55MB/s read
>> procs memory swap io system cpu
>> r b swpd free buff cache si so bi bo in cs us sy wa id
>> 1 1 0 24448 37568 15905984 0 0 56934 0 5166 1862 0 1 24 75
>> 0 1 0 20672 37568 15909248 0 0 57280 0 5168 1871 0 1 24 75
>> 0 1 0 22848 37568 15907072 0 0 57306 0 5173 1874 0 1 24 75
>> 0 1 0 25664 37568 15903808 0 0 57190 0 5171 1870 0 1 24 75
>> 0 1 0 21952 37568 15908160 0 0 57267 0 5168 1871 0 1 24 75
>>
>>
>> Because the filesystem might have an impact on the measurement, "dd"
>> on /dev/md0
>> was used to get information about the performance. This also opens
>> the possibility to test with block sizes larger than the page size.
>> And it appears that the performance with kernel 2.6.11 is closely
>> related to the block size.
>> For example if the block size is exactly a multiple (>2) of the page
>> size the performance is back to ~1.1GB/s.
>> The general behaviour is a bit more complicated:
>> 1. bs <= 1.5 * ps : ~27-57MB/s (differs with ps)
>> 2. bs > 1.5 * ps && bs < 2 * ps : rate increases to max. rate
>> 3. bs = n * ps ; (n >= 2) : ~1.1GB/s (== max. rate)
>> 4. bs > n * ps && bs < ~(n+0.5) * ps ; (n > 2) : ~27-70MB/s (differs
>> with ps)
>> 5. bs > ~(n+0.5) * ps && bs < (n+1) * ps ; (n > 2) : increasing rate
>> in several, more or
>> less, distinct steps (e.g. 1/3 of max. rate and then 2/3 of max rate
>> for 64k pages)
>>
>> I've tested all four possible page sizes on Itanium (4k, 8k, 16k and
>> 64k) and the pattern is always the same!!
>>
>> With kernel 2.6.9 (any kernel before 2.6.10-bk6) the read rate is
>> always at ~1.1GB/s,
>> independent of the block size.
>>
>>
>> This simple patch solves the problem, but I have no idea of possible
>> side-effects ...
>>
>> --- linux-2.6.12-rc2_orig/mm/filemap.c 2005-04-04 18:40:05.000000000
>> +0200
>> +++ linux-2.6.12-rc2/mm/filemap.c 2005-04-20 10:27:42.000000000 +0200
>> @@ -719,7 +719,7 @@
>> index = *ppos >> PAGE_CACHE_SHIFT;
>> next_index = index;
>> prev_index = ra.prev_page;
>> - last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >>
>> PAGE_CACHE_SHIFT;
>> + last_index = (*ppos + desc->count + PAGE_CACHE_SIZE) >>
>> PAGE_CACHE_SHIFT;
>> offset = *ppos & ~PAGE_CACHE_MASK;
>>
>> isize = i_size_read(inode);
>> --- linux-2.6.12-rc2_orig/mm/readahead.c 2005-04-04
>> 18:40:05.000000000 +0200
>> +++ linux-2.6.12-rc2/mm/readahead.c 2005-04-20 18:37:04.000000000 +0200
>> @@ -70,7 +70,7 @@
>> */
>> static unsigned long get_init_ra_size(unsigned long size, unsigned
>> long max)
>> {
>> - unsigned long newsize = roundup_pow_of_two(size);
>> + unsigned long newsize = size;
>>
>> if (newsize <= max / 64)
>> newsize = newsize * newsize;
>>
>>
>>
>> In order to keep this mail short, I've created a webpage that
>> contains all the detailed information and some plots:
>> http://www.cern.ch/openlab-debugging/raid
>>
>>
>> Regards,
>>
>> Andreas Hirstius
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>

2005-04-20 20:13:53

by jmerkey

[permalink] [raw]
Subject: Re: Serious performance degradation on a RAID with kernel 2.6.10-bk7 and later


Burst is good. There's another window in the SCSI layer that limits to
bursts of 128 sector runs (this seems to be the behavior
on 3Ware). I've never changed this, but increasing the max number of
SCSI requests at this layer may help. The
bursty behavior is good, BTW.

Jeff

Andreas Hirstius wrote:

>
> I was curious if your patch would change the write rate because I see
> only ~550MB/s (continuous) which is about a factor two away from the
> capabilities of the disks.
> ... and got this behaviour (with and without my other patch):
>
> (with single "dd if=/dev/zero of=testxx bs=65536 count=150000 &" or
> several of them in parallel on an XFS fs)
>
> "vmstat 1" output
> 0 0 0 28416 37888 15778368 0 0 0 0 8485 3043
> 0 0 0 100
> 6 0 0 22144 37952 15785920 0 0 0 12356 7695 2029 0
> 61 0 39
> 7 0 0 20864 38016 15785856 0 0 324 1722240 8046 4159
> 0 100 0 0
> 7 0 0 20864 38016 15784768 0 0 0 1261440 8391 5222
> 0 100 0 0
> 7 0 0 25984 38016 15781504 0 0 0 2003456 8372 5038
> 0 100 0 0
> 0 6 0 22784 38016 15781504 0 0 0 2826624 8397 8423
> 0 93 7 0
> 0 0 0 21632 38016 15783680 0 0 0 147840 8572 12114
> 0 9 17 74
> 0 0 0 21632 38016 15783680 0 0 0 52 8586 5185
> 0 0 0 100
> 0 0 0 21632 38016 15783680 0 0 0 0 8588 5412
> 0 0 0 100
> 0 0 0 21632 38016 15783680 0 0 0 0 8580 5372
> 0 0 0 100
> 0 0 0 21632 38016 15783680 0 0 0 0 7840 5590
> 0 0 0 100
> 0 0 0 21632 38016 15783680 0 0 0 0 8587 5321
> 0 0 0 100
> 0 0 0 21632 38016 15783680 0 0 0 0 8569 5575
> 0 0 0 100
> 0 0 0 21632 38016 15783680 0 0 0 0 8550 5157
> 0 0 0 100
> 0 0 0 21632 38016 15783680 0 0 0 0 7963 5640
> 0 0 0 100
> 0 0 0 21632 38016 15783680 0 0 0 32 8583 4434
> 0 0 0 100
> 7 0 0 20800 38016 15784768 0 0 0 7424 8404 3638 0
> 15 0 85
> 8 0 0 20864 38016 15786944 0 0 0 688768 7357 3221
> 0 100 0 0
> 8 0 0 20736 28544 15794240 0 0 0 1978560 8376 4897
> 0 100 0 0
> 7 0 0 22208 20736 15798784 0 0 0 1385088 8367 4984
> 0 100 0 0
> 6 0 0 22144 6848 15812672 0 0 56 1291904 8377 4815
> 0 100 0 0
> 0 0 0 50240 6848 15809408 0 0 304 3136 8556 5088 1
> 26 0 74
> 0 0 0 50304 6848 15809408 0 0 0 0 8572 5181
> 0 0 0 100
>
> The average rate here is again pretty close to 550MB/s, it just writes
> the blocks in "bursts"...
>
>
> Andreas
>
>
> jmerkey wrote:
>
>>
>>
>> For 3Ware, you need to chage the queue depths, and you will see
>> dramatically improved performance. 3Ware can take requests
>> a lot faster than Linux pushes them out. Try changing this instead,
>> you won't be going to sleep all the time waiting on the read/write
>> request queues to get "unstarved".
>>
>>
>> /linux/include/linux/blkdev.h
>>
>> //#define BLKDEV_MIN_RQ 4
>> //#define BLKDEV_MAX_RQ 128 /* Default maximum */
>> #define BLKDEV_MIN_RQ 4096
>> #define BLKDEV_MAX_RQ 8192 /* Default maximum */
>>
>>
>> Jeff
>>
>> Andreas Hirstius wrote:
>>
>>> Hi,
>>>
>>>
>>> We have a rx4640 with 3x 3Ware 9500 SATA controllers and 24x WD740GD
>>> HDD in a software RAID0 configuration (using md).
>>> With kernel 2.6.11 the read performance on the md is reduced by a
>>> factor of 20 (!!) compared to previous kernels.
>>> The write rate to the md doesn't change!! (it actually improves a bit).
>>>
>>> The config for the kernels are basically identical.
>>>
>>> Here is some vmstat output:
>>>
>>> kernel 2.6.9: ~1GB/s read
>>> procs memory swap io system cpu
>>> r b swpd free buff cache si so bi bo in cs us sy wa id
>>> 1 1 0 12672 6592 15914112 0 0 1081344 56 15719 1583 0 11 14 74
>>> 1 0 0 12672 6592 15915200 0 0 1130496 0 15996 1626 0 11 14 74
>>> 0 1 0 12672 6592 15914112 0 0 1081344 0 15891 1570 0 11 14 74
>>> 0 1 0 12480 6592 15914112 0 0 1081344 0 15855 1537 0 11 14 74
>>> 1 0 0 12416 6592 15914112 0 0 1130496 0 16006 1586 0 12 14 74
>>>
>>>
>>> kernel 2.6.11: ~55MB/s read
>>> procs memory swap io system cpu
>>> r b swpd free buff cache si so bi bo in cs us sy wa id
>>> 1 1 0 24448 37568 15905984 0 0 56934 0 5166 1862 0 1 24 75
>>> 0 1 0 20672 37568 15909248 0 0 57280 0 5168 1871 0 1 24 75
>>> 0 1 0 22848 37568 15907072 0 0 57306 0 5173 1874 0 1 24 75
>>> 0 1 0 25664 37568 15903808 0 0 57190 0 5171 1870 0 1 24 75
>>> 0 1 0 21952 37568 15908160 0 0 57267 0 5168 1871 0 1 24 75
>>>
>>>
>>> Because the filesystem might have an impact on the measurement, "dd"
>>> on /dev/md0
>>> was used to get information about the performance. This also opens
>>> the possibility to test with block sizes larger than the page size.
>>> And it appears that the performance with kernel 2.6.11 is closely
>>> related to the block size.
>>> For example if the block size is exactly a multiple (>2) of the page
>>> size the performance is back to ~1.1GB/s.
>>> The general behaviour is a bit more complicated:
>>> 1. bs <= 1.5 * ps : ~27-57MB/s (differs with ps)
>>> 2. bs > 1.5 * ps && bs < 2 * ps : rate increases to max. rate
>>> 3. bs = n * ps ; (n >= 2) : ~1.1GB/s (== max. rate)
>>> 4. bs > n * ps && bs < ~(n+0.5) * ps ; (n > 2) : ~27-70MB/s (differs
>>> with ps)
>>> 5. bs > ~(n+0.5) * ps && bs < (n+1) * ps ; (n > 2) : increasing rate
>>> in several, more or
>>> less, distinct steps (e.g. 1/3 of max. rate and then 2/3 of max rate
>>> for 64k pages)
>>>
>>> I've tested all four possible page sizes on Itanium (4k, 8k, 16k and
>>> 64k) and the pattern is always the same!!
>>>
>>> With kernel 2.6.9 (any kernel before 2.6.10-bk6) the read rate is
>>> always at ~1.1GB/s,
>>> independent of the block size.
>>>
>>>
>>> This simple patch solves the problem, but I have no idea of possible
>>> side-effects ...
>>>
>>> --- linux-2.6.12-rc2_orig/mm/filemap.c 2005-04-04 18:40:05.000000000
>>> +0200
>>> +++ linux-2.6.12-rc2/mm/filemap.c 2005-04-20 10:27:42.000000000 +0200
>>> @@ -719,7 +719,7 @@
>>> index = *ppos >> PAGE_CACHE_SHIFT;
>>> next_index = index;
>>> prev_index = ra.prev_page;
>>> - last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >>
>>> PAGE_CACHE_SHIFT;
>>> + last_index = (*ppos + desc->count + PAGE_CACHE_SIZE) >>
>>> PAGE_CACHE_SHIFT;
>>> offset = *ppos & ~PAGE_CACHE_MASK;
>>>
>>> isize = i_size_read(inode);
>>> --- linux-2.6.12-rc2_orig/mm/readahead.c 2005-04-04
>>> 18:40:05.000000000 +0200
>>> +++ linux-2.6.12-rc2/mm/readahead.c 2005-04-20 18:37:04.000000000 +0200
>>> @@ -70,7 +70,7 @@
>>> */
>>> static unsigned long get_init_ra_size(unsigned long size, unsigned
>>> long max)
>>> {
>>> - unsigned long newsize = roundup_pow_of_two(size);
>>> + unsigned long newsize = size;
>>>
>>> if (newsize <= max / 64)
>>> newsize = newsize * newsize;
>>>
>>>
>>>
>>> In order to keep this mail short, I've created a webpage that
>>> contains all the detailed information and some plots:
>>> http://www.cern.ch/openlab-debugging/raid
>>>
>>>
>>> Regards,
>>>
>>> Andreas Hirstius
>>>
>>>
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>>
>>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-04-21 01:16:08

by Nick Piggin

[permalink] [raw]
Subject: Re: Serious performance degradation on a RAID with kernel 2.6.10-bk7 and later

On Wed, 2005-04-20 at 10:55 -0600, jmerkey wrote:
>
> For 3Ware, you need to chage the queue depths, and you will see
> dramatically improved performance. 3Ware can take requests
> a lot faster than Linux pushes them out. Try changing this instead, you
> won't be going to sleep all the time waiting on the read/write
> request queues to get "unstarved".
>
>
> /linux/include/linux/blkdev.h
>
> //#define BLKDEV_MIN_RQ 4
> //#define BLKDEV_MAX_RQ 128 /* Default maximum */
> #define BLKDEV_MIN_RQ 4096
> #define BLKDEV_MAX_RQ 8192 /* Default maximum */
>

BTW, don't do this. BLKDEV_MIN_RQ sets the size of the mempool
reserved requests and will only get slightly used in low memory
conditions, so most memory will probably be wasted.

Just change /sys/block/xxx/queue/nr_requests

Nick



2005-04-21 08:42:16

by Andreas Hirstius

[permalink] [raw]
Subject: Re: Serious performance degradation on a RAID with kernel 2.6.10-bk7 and later

A small update.

Patching mm/filemap.c is not necessary in order to get the improved
performance!
It's sufficient to remove roundup_pow_of_two from |get_init_ra_size ...

So a simple one-liner changes to picture dramatically.
But why ?!?!?


Andreas
|

jmerkey wrote:

>
>
> For 3Ware, you need to chage the queue depths, and you will see
> dramatically improved performance. 3Ware can take requests
> a lot faster than Linux pushes them out. Try changing this instead,
> you won't be going to sleep all the time waiting on the read/write
> request queues to get "unstarved".
>
>
> /linux/include/linux/blkdev.h
>
> //#define BLKDEV_MIN_RQ 4
> //#define BLKDEV_MAX_RQ 128 /* Default maximum */
> #define BLKDEV_MIN_RQ 4096
> #define BLKDEV_MAX_RQ 8192 /* Default maximum */
>
>
> Jeff
>
> Andreas Hirstius wrote:
>
>> Hi,
>>
>>
>> We have a rx4640 with 3x 3Ware 9500 SATA controllers and 24x WD740GD
>> HDD in a software RAID0 configuration (using md).
>> With kernel 2.6.11 the read performance on the md is reduced by a
>> factor of 20 (!!) compared to previous kernels.
>> The write rate to the md doesn't change!! (it actually improves a bit).
>>
>> The config for the kernels are basically identical.
>>
>> Here is some vmstat output:
>>
>> kernel 2.6.9: ~1GB/s read
>> procs memory swap io system cpu
>> r b swpd free buff cache si so bi bo in cs us sy wa id
>> 1 1 0 12672 6592 15914112 0 0 1081344 56 15719 1583 0 11 14 74
>> 1 0 0 12672 6592 15915200 0 0 1130496 0 15996 1626 0 11 14 74
>> 0 1 0 12672 6592 15914112 0 0 1081344 0 15891 1570 0 11 14 74
>> 0 1 0 12480 6592 15914112 0 0 1081344 0 15855 1537 0 11 14 74
>> 1 0 0 12416 6592 15914112 0 0 1130496 0 16006 1586 0 12 14 74
>>
>>
>> kernel 2.6.11: ~55MB/s read
>> procs memory swap io system cpu
>> r b swpd free buff cache si so bi bo in cs us sy wa id
>> 1 1 0 24448 37568 15905984 0 0 56934 0 5166 1862 0 1 24 75
>> 0 1 0 20672 37568 15909248 0 0 57280 0 5168 1871 0 1 24 75
>> 0 1 0 22848 37568 15907072 0 0 57306 0 5173 1874 0 1 24 75
>> 0 1 0 25664 37568 15903808 0 0 57190 0 5171 1870 0 1 24 75
>> 0 1 0 21952 37568 15908160 0 0 57267 0 5168 1871 0 1 24 75
>>
>>
>> Because the filesystem might have an impact on the measurement, "dd"
>> on /dev/md0
>> was used to get information about the performance. This also opens
>> the possibility to test with block sizes larger than the page size.
>> And it appears that the performance with kernel 2.6.11 is closely
>> related to the block size.
>> For example if the block size is exactly a multiple (>2) of the page
>> size the performance is back to ~1.1GB/s.
>> The general behaviour is a bit more complicated:
>> 1. bs <= 1.5 * ps : ~27-57MB/s (differs with ps)
>> 2. bs > 1.5 * ps && bs < 2 * ps : rate increases to max. rate
>> 3. bs = n * ps ; (n >= 2) : ~1.1GB/s (== max. rate)
>> 4. bs > n * ps && bs < ~(n+0.5) * ps ; (n > 2) : ~27-70MB/s (differs
>> with ps)
>> 5. bs > ~(n+0.5) * ps && bs < (n+1) * ps ; (n > 2) : increasing rate
>> in several, more or
>> less, distinct steps (e.g. 1/3 of max. rate and then 2/3 of max rate
>> for 64k pages)
>>
>> I've tested all four possible page sizes on Itanium (4k, 8k, 16k and
>> 64k) and the pattern is always the same!!
>>
>> With kernel 2.6.9 (any kernel before 2.6.10-bk6) the read rate is
>> always at ~1.1GB/s,
>> independent of the block size.
>>
>>
>> This simple patch solves the problem, but I have no idea of possible
>> side-effects ...
>>
>> --- linux-2.6.12-rc2_orig/mm/filemap.c 2005-04-04 18:40:05.000000000
>> +0200
>> +++ linux-2.6.12-rc2/mm/filemap.c 2005-04-20 10:27:42.000000000 +0200
>> @@ -719,7 +719,7 @@
>> index = *ppos >> PAGE_CACHE_SHIFT;
>> next_index = index;
>> prev_index = ra.prev_page;
>> - last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >>
>> PAGE_CACHE_SHIFT;
>> + last_index = (*ppos + desc->count + PAGE_CACHE_SIZE) >>
>> PAGE_CACHE_SHIFT;
>> offset = *ppos & ~PAGE_CACHE_MASK;
>>
>> isize = i_size_read(inode);
>> --- linux-2.6.12-rc2_orig/mm/readahead.c 2005-04-04
>> 18:40:05.000000000 +0200
>> +++ linux-2.6.12-rc2/mm/readahead.c 2005-04-20 18:37:04.000000000 +0200
>> @@ -70,7 +70,7 @@
>> */
>> static unsigned long get_init_ra_size(unsigned long size, unsigned
>> long max)
>> {
>> - unsigned long newsize = roundup_pow_of_two(size);
>> + unsigned long newsize = size;
>>
>> if (newsize <= max / 64)
>> newsize = newsize * newsize;
>>
>>
>>
>> In order to keep this mail short, I've created a webpage that
>> contains all the detailed information and some plots:
>> http://www.cern.ch/openlab-debugging/raid
>>
>>
>> Regards,
>>
>> Andreas Hirstius
>>
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>

Subject: Re: Serious performance degradation on a RAID with kernel 2.6.10-bk7 and later


Hi!

> A small update.
>
> Patching mm/filemap.c is not necessary in order to get the improved
> performance!
> It's sufficient to remove roundup_pow_of_two from |get_init_ra_size ...
>
> So a simple one-liner changes to picture dramatically.
> But why ?!?!?

roundup_pow_of_two() uses fls() and ia64 has buggy fls() implementation
[ seems that David fixed it but patch is not in the mainline yet]:

http://www.mail-archive.com/[email protected]/msg01196.html

That would also explain why you couldn't reproduce the problem on ia32
Xeon machines.

Bartlomiej

2005-04-21 11:31:36

by Andreas Hirstius

[permalink] [raw]
Subject: Re: Serious performance degradation on a RAID with kernel 2.6.10-bk7 and later

Hi,

The fls() patch from David solves the problem :-))

Do you have an idea, when it will be in the mainline kernel??

Andreas



Bartlomiej ZOLNIERKIEWICZ wrote:

>
> Hi!
>
>> A small update.
>>
>> Patching mm/filemap.c is not necessary in order to get the improved
>> performance!
>> It's sufficient to remove roundup_pow_of_two from |get_init_ra_size ...
>>
>> So a simple one-liner changes to picture dramatically.
>> But why ?!?!?
>
>
> roundup_pow_of_two() uses fls() and ia64 has buggy fls() implementation
> [ seems that David fixed it but patch is not in the mainline yet]:
>
> http://www.mail-archive.com/[email protected]/msg01196.html
>
> That would also explain why you couldn't reproduce the problem on ia32
> Xeon machines.
>
> Bartlomiej
>

2005-04-21 15:06:12

by David Mosberger

[permalink] [raw]
Subject: Re: [Gelato-technical] Re: Serious performance degradation on a RAID with kernel 2.6.10-bk7 and later

Tony and Andrew,

I just checked 2.6.12-rc3 and the fls() fix is indeed missing. Do you
know what happened?

--david

>>>>> On Thu, 21 Apr 2005 13:30:50 +0200, Andreas Hirstius <[email protected]> said:

Andreas> Hi, The fls() patch from David solves the problem :-))

Andreas> Do you have an idea, when it will be in the mainline
Andreas> kernel??

Andreas> Andreas



Andreas> Bartlomiej ZOLNIERKIEWICZ wrote:

>> Hi!
>>
>>> A small update.
>>>
>>> Patching mm/filemap.c is not necessary in order to get the
>>> improved performance! It's sufficient to remove
>>> roundup_pow_of_two from |get_init_ra_size ...
>>>
>>> So a simple one-liner changes to picture dramatically. But why
>>> ?!?!?
>>
>>
>> roundup_pow_of_two() uses fls() and ia64 has buggy fls()
>> implementation [ seems that David fixed it but patch is not in
>> the mainline yet]:
>>
>> http://www.mail-archive.com/[email protected]/msg01196.html
>>
>> That would also explain why you couldn't reproduce the problem on
>> ia32 Xeon machines.
>>
>> Bartlomiej
>>

Andreas> _______________________________________________
Andreas> Gelato-technical mailing list
Andreas> [email protected]
Andreas> https://www.gelato.unsw.edu.au/mailman/listinfo/gelato-technical