Comparing with 2.6.23, iozone sequential write/rewrite (512M) has 50% regression
in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
My machine has 8 processor cores and 8GB memory.
By bisect, I located patch
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
Another behavior: with kernel 2.6.23, if I run iozone for many times after rebooting machine,
the result looks stable. But with 2.6.24-rc1, the first run of iozone got a very small result and
following run has 4Xorig_result.
What I reported is the regression of 2nd/3rd run, because first run has bigger regression.
I also tried to change /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio and didn't get improvement.
-yanmin
On Fri, 2007-11-09 at 17:47 +0800, Zhang, Yanmin wrote:
> Comparing with 2.6.23, iozone sequential write/rewrite (512M) has 50% regression
> in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
>
> My machine has 8 processor cores and 8GB memory.
>
> By bisect, I located patch
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
>
>
> Another behavior: with kernel 2.6.23, if I run iozone for many times after rebooting machine,
> the result looks stable. But with 2.6.24-rc1, the first run of iozone got a very small result and
> following run has 4Xorig_result.
So the second run is 4x as fast as the first run?
> What I reported is the regression of 2nd/3rd run, because first run has bigger regression.
So the 2nd and 3rd run are stable at 50% slower than .23?
> I also tried to change /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio and didn't get improvement.
Could you try:
---
Subject: mm: speed up writeback ramp-up on clean systems
We allow violation of bdi limits if there is a lot of room on the
system. Once we hit half the total limit we start enforcing bdi limits
and bdi ramp-up should happen. Doing it this way avoids many small
writeouts on an otherwise idle system and should also speed up the
ramp-up.
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: Fengguang Wu <[email protected]>
---
mm/page-writeback.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c 2007-09-28 10:08:33.937415368 +0200
+++ linux-2.6/mm/page-writeback.c 2007-09-28 10:54:26.018247516 +0200
@@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long
*/
static void balance_dirty_pages(struct address_space *mapping)
{
- long bdi_nr_reclaimable;
- long bdi_nr_writeback;
+ long nr_reclaimable, bdi_nr_reclaimable;
+ long nr_writeback, bdi_nr_writeback;
long background_thresh;
long dirty_thresh;
long bdi_thresh;
@@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a
get_dirty_limits(&background_thresh, &dirty_thresh,
&bdi_thresh, bdi);
+
+ nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+ global_page_state(NR_UNSTABLE_NFS);
+ nr_writeback = global_page_state(NR_WRITEBACK);
+
bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+
if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
break;
+ /*
+ * Throttle it only when the background writeback cannot
+ * catch-up. This avoids (excessively) small writeouts
+ * when the bdi limits are ramping up.
+ */
+ if (nr_reclaimable + nr_writeback <
+ (background_thresh + dirty_thresh) / 2)
+ break;
+
if (!bdi->dirty_exceeded)
bdi->dirty_exceeded = 1;
----- Original Message ----
> From: "Zhang, Yanmin" <[email protected]>
> To: [email protected]
> Cc: LKML <[email protected]>
> Sent: Friday, November 9, 2007 10:47:52 AM
> Subject: iozone write 50% regression in kernel 2.6.24-rc1
>
> Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> 50%
>
regression
> in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
>
> My machine has 8 processor cores and 8GB memory.
>
> By bisect, I located patch
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=
> 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
>
>
> Another behavior: with kernel 2.6.23, if I run iozone for many
> times
>
after rebooting machine,
> the result looks stable. But with 2.6.24-rc1, the first run of
> iozone
>
got a very small result and
> following run has 4Xorig_result.
>
> What I reported is the regression of 2nd/3rd run, because first run
> has
>
bigger regression.
>
> I also tried to change
> /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
>
and didn't get improvement.
>
> -yanmin
> -
Hi Yanmin,
could you tell us the exact iozone command you are using? I would like to repeat it on my setup, because I definitely see the opposite behaviour in 2.6.24-rc1/rc2. The speed there is much better than in 2.6.22 and before (I skipped 2.6.23, because I was waiting for the per-bdi changes). I definitely do not see the difference between 1st and subsequent runs. But then, I do my tests with 5GB file sizes like:
iozone3_283/src/current/iozone -t 5 -F /scratch/X1 /scratch/X2 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1
Kind regards
Martin
On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote:
> ----- Original Message ----
> > From: "Zhang, Yanmin" <[email protected]>
> > To: [email protected]
> > Cc: LKML <[email protected]>
> > Sent: Friday, November 9, 2007 10:47:52 AM
> > Subject: iozone write 50% regression in kernel 2.6.24-rc1
> >
> > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> > 50%
> >
> regression
> > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> >
> > My machine has 8 processor cores and 8GB memory.
> >
> > By bisect, I located patch
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=
> > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> >
> >
> > Another behavior: with kernel 2.6.23, if I run iozone for many
> > times
> >
> after rebooting machine,
> > the result looks stable. But with 2.6.24-rc1, the first run of
> > iozone
> >
> got a very small result and
> > following run has 4Xorig_result.
> >
> > What I reported is the regression of 2nd/3rd run, because first run
> > has
> >
> bigger regression.
> >
> > I also tried to change
> > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
> >
> and didn't get improvement.
> could you tell us the exact iozone command you are using?
iozone -i 0 -r 4k -s 512m
> I would like to repeat it on my setup, because I definitely see the opposite behaviour in 2.6.24-rc1/rc2. The speed there is much better than in 2.6.22 and before (I skipped 2.6.23, because I was waiting for the per-bdi changes). I definitely do not see the difference between 1st and subsequent runs. But then, I do my tests with 5GB file sizes like:
>
> iozone3_283/src/current/iozone -t 5 -F /scratch/X1 /scratch/X2 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1
My machine uses SATA (AHCI) disk.
-yanmin
On Fri, 2007-11-09 at 10:54 +0100, Peter Zijlstra wrote:
> On Fri, 2007-11-09 at 17:47 +0800, Zhang, Yanmin wrote:
> > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has 50% regression
> > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> >
> > My machine has 8 processor cores and 8GB memory.
> >
> > By bisect, I located patch
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> >
> >
> > Another behavior: with kernel 2.6.23, if I run iozone for many times after rebooting machine,
> > the result looks stable. But with 2.6.24-rc1, the first run of iozone got a very small result and
> > following run has 4Xorig_result.
>
> So the second run is 4x as fast as the first run?
Pls. see below comments.
>
> > What I reported is the regression of 2nd/3rd run, because first run has bigger regression.
>
> So the 2nd and 3rd run are stable at 50% slower than .23?
Almostly. I did more testing today. Pls. see below result list.
>
> > I also tried to change /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio and didn't get improvement.
>
> Could you try:
>
> ---
> Subject: mm: speed up writeback ramp-up on clean systems
I tested kernel 2.6.23, 2,6,24-rc2, 2.6.24-rc2_peter(2.6.24-rc2+this patch).
1) Compare among first/second/following running
2.6.23: second run of iozone will get about 28% improvement than first run.
Following run is very stable like 2nd run.
2.6.24-rc2: second run of iozone will get about 170% improvement than first run. 3rd run
will get about 80% improvement than 2nd. Following run is very stable like 3rd run.
2.6.24-rc2_peter: second run of iozone will get about 14% improvement than first run. Following
run is mostly stable like 2nd run.
So the new patch really improves the first run result. Comparing wiht 2.6.24-rc2, 2.6.24-rc2_peter
has 330% improvement on the first run.
2) Compare among different kernels(based on the stable highest result):
2.6.24-rc2 has about 50% regression than 2.6.23.
2.6.24-rc2_peter has the same result like 2.6.24-rc2.
>From this point of view, above patch has no improvement. :)
-yanmin
On Mon, 2007-11-12 at 10:14 +0800, Zhang, Yanmin wrote:
> > Subject: mm: speed up writeback ramp-up on clean systems
>
> I tested kernel 2.6.23, 2,6,24-rc2, 2.6.24-rc2_peter(2.6.24-rc2+this patch).
>
> 1) Compare among first/second/following running
> 2.6.23: second run of iozone will get about 28% improvement than first run.
> Following run is very stable like 2nd run.
> 2.6.24-rc2: second run of iozone will get about 170% improvement than first run. 3rd run
> will get about 80% improvement than 2nd. Following run is very stable like 3rd run.
> 2.6.24-rc2_peter: second run of iozone will get about 14% improvement than first run. Following
> run is mostly stable like 2nd run.
> So the new patch really improves the first run result. Comparing wiht 2.6.24-rc2, 2.6.24-rc2_peter
> has 330% improvement on the first run.
>
> 2) Compare among different kernels(based on the stable highest result):
> 2.6.24-rc2 has about 50% regression than 2.6.23.
> 2.6.24-rc2_peter has the same result like 2.6.24-rc2.
>
> From this point of view, above patch has no improvement. :)
Drad, still good test results though.
Could you describe you system in detail, that is, you have 8GB of memory
and 8 cpus (2*quad?). How many disks does it have and are those
aggregated using md or dm? What filesystem do you use?
On Mon, 2007-11-12 at 10:45 +0100, Peter Zijlstra wrote:
> On Mon, 2007-11-12 at 10:14 +0800, Zhang, Yanmin wrote:
>
> > > Subject: mm: speed up writeback ramp-up on clean systems
> >
> > I tested kernel 2.6.23, 2,6,24-rc2, 2.6.24-rc2_peter(2.6.24-rc2+this patch).
> >
> > 1) Compare among first/second/following running
> > 2.6.23: second run of iozone will get about 28% improvement than first run.
> > Following run is very stable like 2nd run.
> > 2.6.24-rc2: second run of iozone will get about 170% improvement than first run. 3rd run
> > will get about 80% improvement than 2nd. Following run is very stable like 3rd run.
> > 2.6.24-rc2_peter: second run of iozone will get about 14% improvement than first run. Following
> > run is mostly stable like 2nd run.
> > So the new patch really improves the first run result. Comparing wiht 2.6.24-rc2, 2.6.24-rc2_peter
> > has 330% improvement on the first run.
> >
> > 2) Compare among different kernels(based on the stable highest result):
> > 2.6.24-rc2 has about 50% regression than 2.6.23.
> > 2.6.24-rc2_peter has the same result like 2.6.24-rc2.
> >
> > From this point of view, above patch has no improvement. :)
>
> Drad, still good test results though.
>
> Could you describe you system in detail, that is, you have 8GB of memory
> and 8 cpus (2*quad?).
Yes.
> How many disks does it have
1 machine uses 1 AHCI SATA. Other machines use hardware raid0.
> and are those
> aggregated using md or dm?
No.
> What filesystem do you use?
Ext3.
I got the regression on my a couple of machines. Pls. try command
#iozone -i 0 -r 4k -s 512m
-yanmin
----- Original Message ----
> From: "Zhang, Yanmin" <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: [email protected]; LKML <[email protected]>
> Sent: Monday, November 12, 2007 1:45:57 AM
> Subject: Re: iozone write 50% regression in kernel 2.6.24-rc1
>
> On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote:
> > ----- Original Message ----
> > > From: "Zhang, Yanmin"
> > > To: [email protected]
> > > Cc: LKML
> > > Sent: Friday, November 9, 2007 10:47:52 AM
> > > Subject: iozone write 50% regression in kernel 2.6.24-rc1
> > >
> > > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> > > 50%
> > >
> > regression
> > > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> > >
> > > My machine has 8 processor cores and 8GB memory.
> > >
> > > By bisect, I located patch
> >
> >
>
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h
> =
> > > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> > >
> > >
> > > Another behavior: with kernel 2.6.23, if I run iozone for many
> > > times
> > >
> > after rebooting machine,
> > > the result looks stable. But with 2.6.24-rc1, the first run of
> > > iozone
> > >
> > got a very small result and
> > > following run has 4Xorig_result.
> > >
> > > What I reported is the regression of 2nd/3rd run, because first run
> > > has
> > >
> > bigger regression.
> > >
> > > I also tried to change
> > > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
> > >
> > and didn't get improvement.
> > could you tell us the exact iozone command you are using?
> iozone -i 0 -r 4k -s 512m
>
OK, I definitely do not see the reported effect. On a HP Proliant with a RAID5 on CCISS I get:
2.6.19.2: 654-738 MB/sec write, 1126-1154 MB/sec rewrite
2.6.24-rc2: 772-820 MB/sec write, 1495-1539 MB/sec rewrite
The first run is always slowest, all subsequent runs are faster and the same speed.
>
> > I would like to repeat it on my setup, because I definitely see
> the
>
opposite behaviour in 2.6.24-rc1/rc2. The speed there is much
> better
>
than in 2.6.22 and before (I skipped 2.6.23, because I was waiting
> for
>
the per-bdi changes). I definitely do not see the difference between
> 1st
>
and subsequent runs. But then, I do my tests with 5GB file sizes like:
> >
> > iozone3_283/src/current/iozone -t 5 -F /scratch/X1
> /scratch/X2
>
/scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1
> My machine uses SATA (AHCI) disk.
>
4x72GB SCSI disks building a RAID5 on a CCISS controller with battery backed write cache. Systems are 2 CPUs (64-bit) with 8 GB memory. I could test on some IBM boxes (2x dual core, 8 GB) with RAID5 on "aacraid", but I need some time to free up one of the boxes.
Cheers
Martin
Single socket, dual core opteron, 2GB memory
Single SATA disk, ext3
x86_64 kernel and userland
(dirty_background_ratio, dirty_ratio) tunables
---- (5,10) - default
2.6.23.1-42.fc8 #1 SMP
524288 4 59580 60356
524288 4 59247 61101
524288 4 61030 62831
2.6.24-rc2 #28 SMP PREEMPT
524288 4 49277 56582
524288 4 50728 61056
524288 4 52027 59758
524288 4 51520 62426
---- (20,40) - similar to your 8GB
2.6.23.1-42.fc8 #1 SMP
524288 4 225977 447461
524288 4 232595 496848
524288 4 220608 478076
524288 4 203080 445230
2.6.24-rc2 #28 SMP PREEMPT
524288 4 54043 83585
524288 4 69949 516253
524288 4 72343 491416
524288 4 71775 492653
---- (60,80) - overkill
2.6.23.1-42.fc8 #1 SMP
524288 4 208450 491892
524288 4 216262 481135
524288 4 221892 543608
524288 4 202209 574725
524288 4 231730 452482
2.6.24-rc2 #28 SMP PREEMPT
524288 4 49091 86471
524288 4 65071 217566
524288 4 72238 492172
524288 4 71818 492433
524288 4 71327 493954
While I see that the write speed as reported under .24 ~70MB/s is much
lower than the one reported under .23 ~200MB/s, I find it very hard to
believe my poor single SATA disk could actually do the 200MB/s for
longer than its cache 8/16 MB (not sure).
vmstat shows that actual IO is done, even though the whole 512MB could
fit in cache, hence my suspicion that the ~70MB/s is the most realistic
of the two.
I'll have to look into what iozone actually does though and why this
patch makes the output different.
FWIW - because its a single backing dev it does get to 100% of the dirty
limit after a few runs, so not sure what makes the difference.
On Mon, 2007-11-12 at 17:05 +0200, Benny Halevy wrote:
> On Nov. 12, 2007, 15:26 +0200, Peter Zijlstra <[email protected]> wrote:
> > Single socket, dual core opteron, 2GB memory
> > Single SATA disk, ext3
> >
> > x86_64 kernel and userland
> >
> > (dirty_background_ratio, dirty_ratio) tunables
> >
> > ---- (5,10) - default
> >
> > 2.6.23.1-42.fc8 #1 SMP
> >
> > 524288 4 59580 60356
> > 524288 4 59247 61101
> > 524288 4 61030 62831
> >
> > 2.6.24-rc2 #28 SMP PREEMPT
> >
> > 524288 4 49277 56582
> > 524288 4 50728 61056
> > 524288 4 52027 59758
> > 524288 4 51520 62426
> >
> >
> > ---- (20,40) - similar to your 8GB
> >
> > 2.6.23.1-42.fc8 #1 SMP
> >
> > 524288 4 225977 447461
> > 524288 4 232595 496848
> > 524288 4 220608 478076
> > 524288 4 203080 445230
> >
> > 2.6.24-rc2 #28 SMP PREEMPT
> >
> > 524288 4 54043 83585
> > 524288 4 69949 516253
> > 524288 4 72343 491416
> > 524288 4 71775 492653
> >
> > ---- (60,80) - overkill
> >
> > 2.6.23.1-42.fc8 #1 SMP
> >
> > 524288 4 208450 491892
> > 524288 4 216262 481135
> > 524288 4 221892 543608
> > 524288 4 202209 574725
> > 524288 4 231730 452482
> >
> > 2.6.24-rc2 #28 SMP PREEMPT
> >
> > 524288 4 49091 86471
> > 524288 4 65071 217566
> > 524288 4 72238 492172
> > 524288 4 71818 492433
> > 524288 4 71327 493954
> >
> >
> > While I see that the write speed as reported under .24 ~70MB/s is much
> > lower than the one reported under .23 ~200MB/s, I find it very hard to
> > believe my poor single SATA disk could actually do the 200MB/s for
> > longer than its cache 8/16 MB (not sure).
> >
> > vmstat shows that actual IO is done, even though the whole 512MB could
> > fit in cache, hence my suspicion that the ~70MB/s is the most realistic
> > of the two.
>
> Even 70 MB/s seems too high. What throughput do you see for the
> raw disk partition/
>
> Also, are the numbers above for successive runs?
> It seems like you're seeing some caching effects so
> I'd recommend using a file larger than your cache size and
> the -e and -c options (to include fsync and close in timings)
> to try to eliminate them.
------ iozone -i 0 -r 4k -s 512m -e -c
.23 (20,40)
524288 4 31750 33560
524288 4 29786 32114
524288 4 29115 31476
.24 (20,40)
524288 4 25022 32411
524288 4 25375 31662
524288 4 26407 33871
------ iozone -i 0 -r 4k -s 4g -e -c
.23 (20,40)
4194304 4 39699 35550
4194304 4 40225 36099
.24 (20,40)
4194304 4 39961 41656
4194304 4 39244 39673
Yanmin, for that benchmark you ran, what was it meant to measure?
From what I can make of it its just write cache benching.
One thing I don't understand is how the write numbers are so much lower
than the rewrite numbers. The iozone code (which gives me headaches,
damn what a mess) seems to suggest that the only thing that is different
is the lack of block allocation.
Linus posted a patch yesterday fixing up a regression in the ext3 bitmap
block allocator, /me goes apply that patch and rerun the tests.
> > ---- (20,40) - similar to your 8GB
> >
> > 2.6.23.1-42.fc8 #1 SMP
> >
> > 524288 4 225977 447461
> > 524288 4 232595 496848
> > 524288 4 220608 478076
> > 524288 4 203080 445230
> >
> > 2.6.24-rc2 #28 SMP PREEMPT
> >
> > 524288 4 54043 83585
> > 524288 4 69949 516253
> > 524288 4 72343 491416
> > 524288 4 71775 492653
2.6.24-rc2 +
patches/wu-reiser.patch
patches/writeback-early.patch
patches/bdi-task-dirty.patch
patches/bdi-sysfs.patch
patches/sched-hrtick.patch
patches/sched-rt-entity.patch
patches/sched-watchdog.patch
patches/linus-ext3-blockalloc.patch
524288 4 179657 487676
524288 4 173989 465682
524288 4 175842 489800
Linus' patch is the one that makes the difference here. So I'm unsure
how you bisected it down to:
04fbfdc14e5f48463820d6b9807daa5e9c92c51f
These results seem to point to
7c9e69faa28027913ee059c285a5ea8382e24b5d
as being the offending patch.
Peter Zijlstra wrote:
..
> While I see that the write speed as reported under .24 ~70MB/s is much
> lower than the one reported under .23 ~200MB/s, I find it very hard to
> believe my poor single SATA disk could actually do the 200MB/s for
> longer than its cache 8/16 MB (not sure).
>
> vmstat shows that actual IO is done, even though the whole 512MB could
> fit in cache, hence my suspicion that the ~70MB/s is the most realistic
> of the two.
..
Yeah, sequential 70MB/sec is quite realistic for a modern SATA drive.
But significantly faster than that (say, 100MB/sec +) is unlikely at present.
On Mon, 2007-11-12 at 12:25 -0500, Mark Lord wrote:
> Peter Zijlstra wrote:
> ..
> > While I see that the write speed as reported under .24 ~70MB/s is much
> > lower than the one reported under .23 ~200MB/s, I find it very hard to
> > believe my poor single SATA disk could actually do the 200MB/s for
> > longer than its cache 8/16 MB (not sure).
> >
> > vmstat shows that actual IO is done, even though the whole 512MB could
> > fit in cache, hence my suspicion that the ~70MB/s is the most realistic
> > of the two.
> ..
>
> Yeah, sequential 70MB/sec is quite realistic for a modern SATA drive.
>
> But significantly faster than that (say, 100MB/sec +) is unlikely at present.
I just use command '#iozone -i 0 -r 4k -s 512m', no '-e -c'. So if
we consider cache, the speed is very fast. On my machine with 2.6.23, the write speed is
631M/s, quite fast. :)
On Mon, 2007-11-12 at 04:58 -0800, Martin Knoblauch wrote:
> ----- Original Message ----
> > From: "Zhang, Yanmin" <[email protected]>
> > To: Martin Knoblauch <[email protected]>
> > Cc: [email protected]; LKML <[email protected]>
> > Sent: Monday, November 12, 2007 1:45:57 AM
> > Subject: Re: iozone write 50% regression in kernel 2.6.24-rc1
> >
> > On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote:
> > > ----- Original Message ----
> > > > From: "Zhang, Yanmin"
> > > > To: [email protected]
> > > > Cc: LKML
> > > > Sent: Friday, November 9, 2007 10:47:52 AM
> > > > Subject: iozone write 50% regression in kernel 2.6.24-rc1
> > > >
> > > > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> > > > 50%
> > > >
> > > regression
> > > > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> > > >
> > > > My machine has 8 processor cores and 8GB memory.
> > > >
> > > > By bisect, I located patch
> > >
> > >
> >
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h
> > =
> > > > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> > > >
> > > >
> > > > Another behavior: with kernel 2.6.23, if I run iozone for many
> > > > times
> > > >
> > > after rebooting machine,
> > > > the result looks stable. But with 2.6.24-rc1, the first run of
> > > > iozone
> > > >
> > > got a very small result and
> > > > following run has 4Xorig_result.
> > > >
> > > > What I reported is the regression of 2nd/3rd run, because first run
> > > > has
> > > >
> > > bigger regression.
> > > >
> > > > I also tried to change
> > > > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
> > > >
> > > and didn't get improvement.
> > > could you tell us the exact iozone command you are using?
> > iozone -i 0 -r 4k -s 512m
> >
>
> OK, I definitely do not see the reported effect. On a HP Proliant with a RAID5 on CCISS I get:
>
> 2.6.19.2: 654-738 MB/sec write, 1126-1154 MB/sec rewrite
> 2.6.24-rc2: 772-820 MB/sec write, 1495-1539 MB/sec rewrite
>
> The first run is always slowest, all subsequent runs are faster and the same speed.
Although the first run is always slowest, but if we compare 2.6.23 and 2.6.24-rc,
we could find the first run result of 2.6.23 is 7 times of the one of 2.6.24-rc.
Originally, my test suite is just to pick up the result of first run. I might
change my test suite to make it run for many times.
Now I run the the test manually for many times after machine reboots. Comparing 2.6.24-rc
with 2.6.23, 3rd and following run of 2.6.24-rc has about 50% regression.
On Mon, 2007-11-12 at 17:48 +0100, Peter Zijlstra wrote:
> On Mon, 2007-11-12 at 17:05 +0200, Benny Halevy wrote:
> > On Nov. 12, 2007, 15:26 +0200, Peter Zijlstra <[email protected]> wrote:
> > > Single socket, dual core opteron, 2GB memory
> > > Single SATA disk, ext3
> > >
> > > x86_64 kernel and userland
> > >
> > > (dirty_background_ratio, dirty_ratio) tunables
> > >
> > > ---- (5,10) - default
> > >
> > > 2.6.23.1-42.fc8 #1 SMP
> > >
> > > 524288 4 59580 60356
> > > 524288 4 59247 61101
> > > 524288 4 61030 62831
> > >
> > > 2.6.24-rc2 #28 SMP PREEMPT
> > >
> > > 524288 4 49277 56582
> > > 524288 4 50728 61056
> > > 524288 4 52027 59758
> > > 524288 4 51520 62426
> > >
> > >
> > > ---- (20,40) - similar to your 8GB
> > >
> > > 2.6.23.1-42.fc8 #1 SMP
> > >
> > > 524288 4 225977 447461
> > > 524288 4 232595 496848
> > > 524288 4 220608 478076
> > > 524288 4 203080 445230
> > >
> > > 2.6.24-rc2 #28 SMP PREEMPT
> > >
> > > 524288 4 54043 83585
> > > 524288 4 69949 516253
> > > 524288 4 72343 491416
> > > 524288 4 71775 492653
> > >
> > > ---- (60,80) - overkill
> > >
> > > 2.6.23.1-42.fc8 #1 SMP
> > >
> > > 524288 4 208450 491892
> > > 524288 4 216262 481135
> > > 524288 4 221892 543608
> > > 524288 4 202209 574725
> > > 524288 4 231730 452482
> > >
> > > 2.6.24-rc2 #28 SMP PREEMPT
> > >
> > > 524288 4 49091 86471
> > > 524288 4 65071 217566
> > > 524288 4 72238 492172
> > > 524288 4 71818 492433
> > > 524288 4 71327 493954
> > >
> > >
> > > While I see that the write speed as reported under .24 ~70MB/s is much
> > > lower than the one reported under .23 ~200MB/s, I find it very hard to
> > > believe my poor single SATA disk could actually do the 200MB/s for
> > > longer than its cache 8/16 MB (not sure).
> > >
> > > vmstat shows that actual IO is done, even though the whole 512MB could
> > > fit in cache, hence my suspicion that the ~70MB/s is the most realistic
> > > of the two.
> >
> > Even 70 MB/s seems too high. What throughput do you see for the
> > raw disk partition/
> >
> > Also, are the numbers above for successive runs?
> > It seems like you're seeing some caching effects so
> > I'd recommend using a file larger than your cache size and
> > the -e and -c options (to include fsync and close in timings)
> > to try to eliminate them.
>
> ------ iozone -i 0 -r 4k -s 512m -e -c
>
> .23 (20,40)
>
> 524288 4 31750 33560
> 524288 4 29786 32114
> 524288 4 29115 31476
>
> .24 (20,40)
>
> 524288 4 25022 32411
> 524288 4 25375 31662
> 524288 4 26407 33871
>
>
> ------ iozone -i 0 -r 4k -s 4g -e -c
>
> .23 (20,40)
>
> 4194304 4 39699 35550
> 4194304 4 40225 36099
>
>
> .24 (20,40)
>
> 4194304 4 39961 41656
> 4194304 4 39244 39673
>
>
> Yanmin, for that benchmark you ran, what was it meant to measure?
> From what I can make of it its just write cache benching.
Yeah. It's quite related to cache. I did more testing on my stoakley machine (8 cores,
8GB mem). If I reduce the memory to 4GB, the speed will be far slower.
>
> One thing I don't understand is how the write numbers are so much lower
> than the rewrite numbers. The iozone code (which gives me headaches,
> damn what a mess) seems to suggest that the only thing that is different
> is the lack of block allocation.
It might be a good direction.
>
> Linus posted a patch yesterday fixing up a regression in the ext3 bitmap
> block allocator, /me goes apply that patch and rerun the tests.
>
> > > ---- (20,40) - similar to your 8GB
> > >
> > > 2.6.23.1-42.fc8 #1 SMP
> > >
> > > 524288 4 225977 447461
> > > 524288 4 232595 496848
> > > 524288 4 220608 478076
> > > 524288 4 203080 445230
> > >
> > > 2.6.24-rc2 #28 SMP PREEMPT
> > >
> > > 524288 4 54043 83585
> > > 524288 4 69949 516253
> > > 524288 4 72343 491416
> > > 524288 4 71775 492653
>
> 2.6.24-rc2 +
> patches/wu-reiser.patch
> patches/writeback-early.patch
> patches/bdi-task-dirty.patch
> patches/bdi-sysfs.patch
> patches/sched-hrtick.patch
> patches/sched-rt-entity.patch
> patches/sched-watchdog.patch
> patches/linus-ext3-blockalloc.patch
>
> 524288 4 179657 487676
> 524288 4 173989 465682
> 524288 4 175842 489800
>
>
> Linus' patch is the one that makes the difference here. So I'm unsure
> how you bisected it down to:
>
> 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
Originally, my test suite is just to pick up the result of first run. Your prior
patch(speed up writeback ramp-up on clean systems) fixed an issue about first
run result regression. So my bisect captured it.
However, late on, I found following run have different results. A moment ago,
I retested 04fbfdc14e5f48463820d6b9807daa5e9c92c51f by:
#git checkout 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
#make
Then, reverse your patch. It looks like 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
is not the root cause of following run regression. I will change my test suite to
make it run for many times and do a new bisect.
> These results seem to point to
>
> 7c9e69faa28027913ee059c285a5ea8382e24b5d
I tested 2.6.24-rc2 which already includes above patch. 2.6.24-rc2 has the same
regression like 2.6.24-rc1.
-yanmin
On Tue, 2007-11-13 at 10:19 +0800, Zhang, Yanmin wrote:
> On Mon, 2007-11-12 at 17:48 +0100, Peter Zijlstra wrote:
> > On Mon, 2007-11-12 at 17:05 +0200, Benny Halevy wrote:
> > > On Nov. 12, 2007, 15:26 +0200, Peter Zijlstra <[email protected]> wrote:
> > > > Single socket, dual core opteron, 2GB memory
> > > > Single SATA disk, ext3
> > > >
> > > > 2.6.23.1-42.fc8 #1 SMP
> > > >
> > > > 524288 4 225977 447461
> > > > 524288 4 232595 496848
> > > > 524288 4 220608 478076
> > > > 524288 4 203080 445230
> > > >
> > > > 2.6.24-rc2 #28 SMP PREEMPT
> > > >
> > > > 524288 4 54043 83585
> > > > 524288 4 69949 516253
> > > > 524288 4 72343 491416
> > > > 524288 4 71775 492653
> >
> > 2.6.24-rc2 +
> > patches/wu-reiser.patch
> > patches/writeback-early.patch
> > patches/bdi-task-dirty.patch
> > patches/bdi-sysfs.patch
> > patches/sched-hrtick.patch
> > patches/sched-rt-entity.patch
> > patches/sched-watchdog.patch
> > patches/linus-ext3-blockalloc.patch
> >
> > 524288 4 179657 487676
> > 524288 4 173989 465682
> > 524288 4 175842 489800
> >
> >
> > Linus' patch is the one that makes the difference here. So I'm unsure
> > how you bisected it down to:
> >
> > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
> Originally, my test suite is just to pick up the result of first run. Your prior
> patch(speed up writeback ramp-up on clean systems) fixed an issue about first
> run result regression. So my bisect captured it.
>
> However, late on, I found following run have different results. A moment ago,
> I retested 04fbfdc14e5f48463820d6b9807daa5e9c92c51f by:
> #git checkout 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
> #make
>
> Then, reverse your patch. It looks like 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
> is not the root cause of following run regression. I will change my test suite to
> make it run for many times and do a new bisect.
>
> > These results seem to point to
> >
> > 7c9e69faa28027913ee059c285a5ea8382e24b5d
My new bisect captured 7c9e69faa28027913ee059c285a5ea8382e24b5d
which caused the regression of iozone following run (3rd/4th... run after mounting
the ext3 partition).
Peter,
Where could I download Linus new patches, especially patches/linus-ext3-blockalloc.patch?
I couldn't find it in my archives of LKML mails.
yanmin
On Tue, 2007-11-13 at 16:34 +0800, Zhang, Yanmin wrote:
> My new bisect captured 7c9e69faa28027913ee059c285a5ea8382e24b5d
> which caused the regression of iozone following run (3rd/4th... run after mounting
> the ext3 partition).
Linus just reverted that commit with commit:
commit 0b832a4b93932103d73c0c3f35ef1153e288327b
Author: Linus Torvalds <[email protected]>
Date: Tue Nov 13 08:07:31 2007 -0800
Revert "ext2/ext3/ext4: add block bitmap validation"
This reverts commit 7c9e69faa28027913ee059c285a5ea8382e24b5d, fixing up
conflicts in fs/ext4/balloc.c manually.
The cost of doing the bitmap validation on each lookup - even when the
bitmap is cached - is absolutely prohibitive. We could, and probably
should, do it only when adding the bitmap to the buffer cache. However,
right now we are better off just reverting it.
Peter Zijlstra measured the cost of this extra validation as a 85%
decrease in cached iozone, and while I had a patch that took it down to
just 17% by not being _quite_ so stupid in the validation, it was still
a big slowdown that could have been avoided by just doing it right.