2008-01-23 14:47:00

by Martin Knoblauch

[permalink] [raw]
Subject: Performance problems when writing large files on CCISS hardware

Please CC me on replies, as I am not subscribed.

Hi,

for a while now I am having problems writing large files sequentially to EXT2 filesystems on CCISS based boxes. The problem is that writing multiple files in parallel is extremely slow compared to a single file in non-DIO mode. When using DIO, the scaling is almost "perfect". The problem manifests itself in RHEL4 kernels (2.6.9-X) and any mainline kernel up to 2.6.24-rc8.

The systems in question are HP/DL380G4 with 2 cpus, 8 GB memory, SmartArray6i (CCISS) with BBWC and 4x72GB@10krpm disks in RAID5 configuration. Environment is 64-bit RHEL4.3.

The problem can be reproduced by running 1, 2 or 3 parallel "dd" processes, or "iozone" with 1, 2 or 3 threads. Curiously, there was a period from 2.6.24-rc1 until 2.6.24-rc5 where the problem went away. It turned out that this was due to a "regression" that was "fixed" by below commit. Unfortunatelly this is not good for my systems, but it might shed some light on the underlying problem:

> #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> #Author: Mel Gorman <[email protected]>
> #Date: Mon Dec 17 16:20:05 2007 -0800
> #
> # mm: fix page allocation for larger I/O segments
> #
> # In some cases the IO subsystem is able to merge requests if the
pages are
> # adjacent in physical memory. This was achieved in the allocator
by having
> # expand() return pages in physically contiguous order in
situations were a
> # large buddy was split. However, list-based anti-fragmentation
changed the
> # order pages were returned in to avoid searching in
buffered_rmqueue() for a
> # page of the appropriate migrate type.
> #
> # This patch restores behaviour of rmqueue_bulk() preserving the
physical
> # order of pages returned by the allocator without incurring
increased search
> # costs for anti-fragmentation.
> #
> # Signed-off-by: Mel Gorman <[email protected]>
> # Cc: James Bottomley <[email protected]>
> # Cc: Jens Axboe <[email protected]>
> # Cc: Mark Lord <[email protected]
> # Signed-off-by: Andrew Morton <[email protected]>
> # Signed-off-by: Linus Torvalds <[email protected]>
> diff -urN linux-2.6.24-rc5/mm/page_alloc.c
linux-2.6.24-rc6/mm/page_alloc.c
> --- linux-2.6.24-rc5/mm/page_alloc.c 2007-12-21 04:14:11.305633890
+0000
> +++ linux-2.6.24-rc6/mm/page_alloc.c 2007-12-21 04:14:17.746985697
+0000
> @@ -847,8 +847,19 @@
> struct page *page = __rmqueue(zone, order,
migratetype);
> if (unlikely(page == NULL))
> break;
> +
> + /*
> + * Split buddy pages returned by expand() are
received here
> + * in physical page order. The page is added to the
callers and
> + * list and the list head then moves forward. From
the callers
> + * perspective, the linked list is ordered by page
number in
> + * some conditions. This is useful for IO devices
that can
> + * merge IO requests if the physical pages are
ordered
> + * properly.
> + */
> list_add(&page->lru, list);
> set_page_private(page, migratetype);
> + list = &page->lru;
> }
> spin_unlock(&zone->lock);
> return i;
>

Reverting this patch from 2.6.24-rc8 gives the good performance reported below (rc8*). So, apparently CCISS is very sensitive to the page ordering.

Here are the numbers (MB/sec) including sync-time. I compare 2.6.24-rc8 (rc8) and 2.6.24-rc8 with abore commit reverted (rc8*). Reported is the combined throughput for 1,2,3 iozone threads, for reference also the DIO numbers. Raw numbers are attached.

Test rc8 rc8*
----------------------------------------
1x3GB 56 90
1x3GB-DIO 86 86
2x1.5GB 9.5 87
2x1.5GB-DIO 80 85
3x1GB 16.5 85
3x1GB-DIO 85 85

One can see that in mainline/rc8 all non-DIO numbers are smaller than the corresponding DIO numbers, or the non-DIO numbers from rc8*. The performance for 2 and 3 threads in mainline/rc8 is just bad.

Of course I have the option to revert commit ....54b6d for my systems, but I think a more general solution would be better. If I can help tracking the real problem down, I am open for suggestions.

Cheers
Martin

------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de


Attachments:
cciss-rc8-bad.log (9.97 kB)
cciss-rc8-good.log (9.97 kB)
config-2.6.24-rc8 (43.29 kB)
Download all attachments

2008-02-05 10:54:10

by noah

[permalink] [raw]
Subject: Re: Performance problems when writing large files on CCISS hardware

2008/1/23, Martin Knoblauch <[email protected]>:
> Please CC me on replies, as I am not subscribed.
>
> Hi,
>
> for a while now I am having problems writing large files sequentially to EXT2 filesystems on CCISS based boxes. The problem is that writing multiple files in parallel is extremely slow compared to a single file in non-DIO mode. When using DIO, the scaling is almost "perfect". The problem manifests itself in RHEL4 kernels (2.6.9-X) and any mainline kernel up to 2.6.24-rc8.
>
> The systems in question are HP/DL380G4 with 2 cpus, 8 GB memory, SmartArray6i (CCISS) with BBWC and 4x72GB@10krpm disks in RAID5 configuration. Environment is 64-bit RHEL4.3.

I've seen similar problems on HP DL380 G4 (SmartArray 6i) and HP DL385
G5 (SmartArray P400).


RHEL 4, kernel 2.6.9 and reiserfs had siginifcantly worse I/O
performance than a Gentoo box running 2.6.18 (iirc) when I did some
I/O-test with different distributions on a HP DL380 G4 before
deploying the machine in production. Switching I/O-schedulers on RHEL4
didn't help either.


Also, I'm having awful I/O-performance with a DL385 G2/2x2.6GH/4GB
running MySQL 5 on Ubuntu 7.10 (kernel 2.6.22). It previously ran
Ubuntu 7.04 (kernel 2.6.20) which had the same issue. On this server
MySQL stalls for a long time waiting for I/O after SQL updates that
causes lots of writes.

I've been trying to mitigate the problems by adding 512MB
Batterybacked Writecache and lately also switching from RAID1 to RAID
1+0 (4x72GB 15k SAS disks). It's better but there are still issues.

I think after switching to RAID 1+0 I'm now getting around 50-70 MB/s,
which is 1.5-2.0 times the performance I had before.


Running sync on any of the servers while there are dirty pages to be
written (according to /proc/meminfo) virtually kills all I/O until the
sync completes.


I don't have much experience with other RAID controllers than the
SmartArray and naturally don't know what to expect, but I sure think
it should be better.

I'm getting much better performance out of an ordinary "home computer"
that has 4 standard disks in RAID 1+0 configuration (software; Linux
md) and AES encryption with dm-crypt.



Below are some numbers.

HP DL385 G5, 2x2.6GHz/2GB/4x72GB in RAID1+0, kernel 2.6.22 (Ubuntu 7.10 x64)
====
# sync; time sh -c "dd if=/dev/zero of=/data/test bs=1024k count=8192;sync"
dd: writing `/data/test': No space left on device
6187+0 records in
6186+0 records out
6487523328 bytes (6.5 GB) copied, 113.756 seconds, 57.0 MB/s

real 1m56.916s
user 0m0.050s
sys 0m31.040s


HP DL385 G5, 2x2.6GHz/4GB/4x72GB in RAID1+0 512MB BBWC, kernel 2.6.22
(Ubuntu 7.10 x64)
===
# sync; time sh -c "dd if=/dev/zero of=/data/test bs=1024k count=8192;sync"
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 120.797 seconds, 71.1 MB/s

real 2m1.883s
user 0m0.020s
sys 0m26.530s


-- noah