2010-03-01 12:34:42

by Michal Schmidt

[permalink] [raw]
Subject: [PATCH] ext4: reduce scheduling latency with delayed allocation

mpage_da_submit_io() may process tens of thousands of pages at a time.
Unless full preemption is enabled, it causes scheduling latencies in the order
of tens of milliseconds.

It can be reproduced simply by writing a big file on ext4 repeatedly with
dd if=/dev/zero of=/tmp/dummy bs=10M count=50

The patch fixes it by allowing to reschedule in the loop.

cyclictest can be used to measure the latency. I tested with:
$ cyclictest -t1 -p 80 -n -i 5000 -m -l 20000

The results from an UP AMD Turion 2GHz with voluntary preemption:

Without the patch:
T: 0 ( 2535) P:80 I:5000 C: 20000 Min: 12 Act: 23 Avg: 3166 Max: 70524
(i.e. Average latency was more than 3 ms. Max observed latency was 71 ms.)

With the patch:
T: 0 ( 2588) P:80 I:5000 C: 20000 Min: 13 Act: 33 Avg: 49 Max: 11009
(i.e. Average latency was only 49 us. Max observed latency was 11 ms.)

Signed-off-by: Michal Schmidt <[email protected]>
---
fs/ext4/inode.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e119524..687a993 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2007,6 +2007,7 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd)
ret = err;
}
pagevec_release(&pvec);
+ cond_resched();
}
return ret;
}
--
1.7.0



2010-03-02 03:06:24

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] ext4: reduce scheduling latency with delayed allocation

On Mon, Mar 01, 2010 at 01:34:35PM +0100, Michal Schmidt wrote:
> mpage_da_submit_io() may process tens of thousands of pages at a time.
> Unless full preemption is enabled, it causes scheduling latencies in the order
> of tens of milliseconds.
>
> It can be reproduced simply by writing a big file on ext4 repeatedly with
> dd if=/dev/zero of=/tmp/dummy bs=10M count=50
>
> The patch fixes it by allowing to reschedule in the loop.
>
> cyclictest can be used to measure the latency. I tested with:
> $ cyclictest -t1 -p 80 -n -i 5000 -m -l 20000
>
> The results from an UP AMD Turion 2GHz with voluntary preemption:
>
> Without the patch:
> T: 0 ( 2535) P:80 I:5000 C: 20000 Min: 12 Act: 23 Avg: 3166 Max: 70524
> (i.e. Average latency was more than 3 ms. Max observed latency was 71 ms.)
>
> With the patch:
> T: 0 ( 2588) P:80 I:5000 C: 20000 Min: 13 Act: 33 Avg: 49 Max: 11009
> (i.e. Average latency was only 49 us. Max observed latency was 11 ms.)

Have you tested for any performance regressions as a result of this
patch, using some file system benchmarks?

I don't think this is the best way to fix this problem, though. The
real right answer is to change how the code is structued. All of the
callsites that call mpage_da_submit_io() are immediately preceeded by
mpage_da_map_blocks(). These two functions should be combined and
instead of calling ext4_writepage() for each page,
mpage_da_map_and_write_blocks() should make a single call to
submit_bio() for each extent. That should far more CPU efficient,
solving both your scheduling latency issue as well as helping out for
benchmarks that strive to stress both the disk and CPU simultaneously
(such as for example the TPC benchmarks).

This will also make our blktrace results much more compact, and Chris
Mason will be very happy about that!

- Ted


2010-03-10 13:09:40

by Michal Schmidt

[permalink] [raw]
Subject: Re: [PATCH] ext4: reduce scheduling latency with delayed allocation

On Mon, 1 Mar 2010 22:06:19 -0500 [email protected] wrote:
> On Mon, Mar 01, 2010 at 01:34:35PM +0100, Michal Schmidt wrote:
> > mpage_da_submit_io() may process tens of thousands of pages at a
> > time. Unless full preemption is enabled, it causes scheduling
> > latencies in the order of tens of milliseconds.
> >
> > It can be reproduced simply by writing a big file on ext4
> > repeatedly with dd if=/dev/zero of=/tmp/dummy bs=10M count=50
> >
> > The patch fixes it by allowing to reschedule in the loop.
>
> Have you tested for any performance regressions as a result of this
> patch, using some file system benchmarks?

I used the 'fio' benchmark to test sequential write speed. Here are the
results:

Test kernel aggregate bandwidth
------------------------------------------------------
hdd-multi 2.6.33.nopreempt 32.7 ± 3.5 MB/s
hdd-multi 2.6.33.reduce 33.8 ± 3.7 MB/s
hdd-multi 2.6.33.preempt 33.4 ± 3.1 MB/s

hdd-single 2.6.33.nopreempt 35.9 ± 2.1 MB/s
hdd-single 2.6.33.reduce 36.6 ± 2.3 MB/s
hdd-single 2.6.33.preempt 35.9 ± 2.0 MB/s

ramdisk-multi 2.6.33.nopreempt 189.7 ± 9.2 MB/s
ramdisk-multi 2.6.33.reduce 191.4 ± 9.5 MB/s
ramdisk-multi 2.6.33.preempt 163.5 ± 9.4 MB/s

ramdisk-single 2.6.33.nopreempt 152.3 ± 10.9 MB/s
ramdisk-single 2.6.33.reduce 171.3 ± 17.0 MB/s
ramdisk-single 2.6.33.preempt 144.2 ± 15.2 MB/s

The tests were run on a laptop with dual AMD Turion 2 GHz, 2 GB RAM.
A newly created filesystem was used for every fio run.
In the 'hdd' tests the filesystem was on a 24 GB LV on a harddisk. These
tests were repeated 12 times.
- In the '-single' variant a single process wrote a 5 GB file.
- In the '-multi' variant 5 processes wrote a 1 GB file each.
In the 'ramdisk' tests the filesystem was on a 1.5 GB ramdisk. These
tests were repeated >40 times.
- In the '-single' variant a single process wrote a 1400 MB file.
- In the '-multi' variant 5 processes wrote a 280 MB file each.
The kernels were:
'2.6.33.nopreempt' - vanilla 2.6.33 with CONFIG_PREEMPT_NONE
'2.6.33.reduce' - the same + the patch to add the cond_resched()
'2.6.33.preempt' - 2.6.33 with CONFIG_PREEMPT (for curiosity)
The data for 'aggregate bandwidth' were taken from fio's 'aggrb' result.
The margin of error as reported in the table is 2 * standard deviation.

Conclusion: Adding the cond_resched() did not result in any measurable
performance decrease of sequential writes. (The results show a
performance increase, but it's within the margin of error.)

> I don't think this is the best way to fix this problem, though. The
> real right answer is to change how the code is structued. All of the
> callsites that call mpage_da_submit_io() are immediately preceeded by
> mpage_da_map_blocks(). These two functions should be combined and
> instead of calling ext4_writepage() for each page,
> mpage_da_map_and_write_blocks() should make a single call to
> submit_bio() for each extent. That should far more CPU efficient,
> solving both your scheduling latency issue as well as helping out for
> benchmarks that strive to stress both the disk and CPU simultaneously
> (such as for example the TPC benchmarks).
>
> This will also make our blktrace results much more compact, and Chris
> Mason will be very happy about that!

You're almost certainly right, but I'm not likely to make such a change
in the near future.

Michal