2009-10-09 11:20:17

by Christian Ehrhardt

[permalink] [raw]
Subject: [PATCH] mm: make VM_MAX_READAHEAD configurable

From: Christian Ehrhardt <[email protected]>

On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
and can be configured per block device queue.
On the other hand a lot of admins do not use it, therefore it is reasonable to
set a wise default.

This path allows to configure the value via Kconfig mechanisms and therefore
allow the assignment of different defaults dependent on other Kconfig symbols.

Using this, the patch increases the default max readahead for s390 improving
sequential throughput in a lot of scenarios with almost no drawbacks (only
theoretical workloads with a lot concurrent sequential read patterns on a very
low memory system suffer due to page cache trashing as expected).

Signed-off-by: Christian Ehrhardt <[email protected]>
---

[diffstat]
include/linux/mm.h | 2 +-
mm/Kconfig | 19 +++++++++++++++++++
2 files changed, 20 insertions(+), 1 deletion(-)

[diff]
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -1169,7 +1169,7 @@ int write_one_page(struct page *page, in
void task_dirty_inc(struct task_struct *tsk);

/* readahead.c */
-#define VM_MAX_READAHEAD 128 /* kbytes */
+#define VM_MAX_READAHEAD CONFIG_VM_MAX_READAHEAD /* kbytes */
#define VM_MIN_READAHEAD 16 /* kbytes (includes current page) */

int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig
+++ linux-2.6/mm/Kconfig
@@ -288,3 +288,22 @@ config NOMMU_INITIAL_TRIM_EXCESS
of 1 says that all excess pages should be trimmed.

See Documentation/nommu-mmap.txt for more information.
+
+config VM_MAX_READAHEAD
+ int "Default max vm readahead size (16-4096 kbytes)"
+ default "512" if S390
+ default "128"
+ range 16 4096
+ help
+ This entry specifies the default max size used to read ahead
+ sequential access patterns in kilobytes.
+
+ The value can be configured per device queue in /dev, this setting
+ just defines the default.
+
+ The default is 128 which it used to be for years and should suit all
+ kind of linux targets.
+
+ Smaller values might be useful for very memory constrained systems
+ like some embedded systems to avoid page cache trashing, while larger
+ values can be beneficial to server installations.


2009-10-09 12:21:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> From: Christian Ehrhardt <[email protected]>
>
> On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> and can be configured per block device queue.
> On the other hand a lot of admins do not use it, therefore it is reasonable to
> set a wise default.
>
> This path allows to configure the value via Kconfig mechanisms and therefore
> allow the assignment of different defaults dependent on other Kconfig symbols.
>
> Using this, the patch increases the default max readahead for s390 improving
> sequential throughput in a lot of scenarios with almost no drawbacks (only
> theoretical workloads with a lot concurrent sequential read patterns on a very
> low memory system suffer due to page cache trashing as expected).

Why can't this be solved in userspace?

Also, can't we simply raise this number if appropriate? Wu did some
read-ahead trashing detection bits a long while back which should scale
the read-ahead window back when we're low on memory, not sure that ever
made it in, but that sounds like a better option than having different
magic numbers for each platform.

2009-10-09 12:30:31

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

On Fri, Oct 09 2009, Peter Zijlstra wrote:
> On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > From: Christian Ehrhardt <[email protected]>
> >
> > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > and can be configured per block device queue.
> > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > set a wise default.
> >
> > This path allows to configure the value via Kconfig mechanisms and therefore
> > allow the assignment of different defaults dependent on other Kconfig symbols.
> >
> > Using this, the patch increases the default max readahead for s390 improving
> > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > theoretical workloads with a lot concurrent sequential read patterns on a very
> > low memory system suffer due to page cache trashing as expected).
>
> Why can't this be solved in userspace?
>
> Also, can't we simply raise this number if appropriate? Wu did some
> read-ahead trashing detection bits a long while back which should scale
> the read-ahead window back when we're low on memory, not sure that ever
> made it in, but that sounds like a better option than having different
> magic numbers for each platform.

Agree, making this a config option (and even defaulting to a different
number because of an arch setting) is crazy.

--
Jens Axboe

2009-10-09 13:15:27

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

On Fri, Oct 09, 2009 at 02:20:30PM +0200, Peter Zijlstra wrote:
> On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > From: Christian Ehrhardt <[email protected]>
> >
> > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > and can be configured per block device queue.
> > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > set a wise default.
> >
> > This path allows to configure the value via Kconfig mechanisms and therefore
> > allow the assignment of different defaults dependent on other Kconfig symbols.
> >
> > Using this, the patch increases the default max readahead for s390 improving
> > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > theoretical workloads with a lot concurrent sequential read patterns on a very
> > low memory system suffer due to page cache trashing as expected).
>
> Why can't this be solved in userspace?
>
> Also, can't we simply raise this number if appropriate? Wu did some

Agreed, and Ehrhardt's 512KB readahead size looks like a good default :)

> read-ahead trashing detection bits a long while back which should scale
> the read-ahead window back when we're low on memory, not sure that ever
> made it in, but that sounds like a better option than having different
> magic numbers for each platform.

The current kernel could roughly estimate the thrashing safe size (the
context readahead). However that's not enough. Context readahead is
normally active only for interleaved reads. The normal behavior is to
scale up readahead size aggressively. For better support for embedded
systems, we may need a flag/mode which tells: "we recently experienced
thrashing, so estimate and stick to the thrashing safe size instead of
keep scaling up readahead size and thus risk thrashing again".

Thanks,
Fengguang

2009-10-09 13:50:33

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

On Fri, 9 Oct 2009 14:29:52 +0200
Jens Axboe <[email protected]> wrote:

> On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > From: Christian Ehrhardt <[email protected]>
> > >
> > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > and can be configured per block device queue.
> > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > set a wise default.
> > >
> > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > >
> > > Using this, the patch increases the default max readahead for s390 improving
> > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > low memory system suffer due to page cache trashing as expected).
> >
> > Why can't this be solved in userspace?
> >
> > Also, can't we simply raise this number if appropriate? Wu did some
> > read-ahead trashing detection bits a long while back which should scale
> > the read-ahead window back when we're low on memory, not sure that ever
> > made it in, but that sounds like a better option than having different
> > magic numbers for each platform.
>
> Agree, making this a config option (and even defaulting to a different
> number because of an arch setting) is crazy.

The patch from Christian fixes a performance regression in the latest
distributions for s390. So we would opt for a larger value, 512KB seems
to be a good one. I have no idea what that will do to the embedded
space which is why Christian choose to make it configurable. Clearly
the better solution would be some sort of system control that can be
modified at runtime.

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2009-10-09 13:59:26

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote:
> On Fri, 9 Oct 2009 14:29:52 +0200
> Jens Axboe <[email protected]> wrote:
>
> > On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > > From: Christian Ehrhardt <[email protected]>
> > > >
> > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > > and can be configured per block device queue.
> > > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > > set a wise default.
> > > >
> > > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > > >
> > > > Using this, the patch increases the default max readahead for s390 improving
> > > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > > low memory system suffer due to page cache trashing as expected).
> > >
> > > Why can't this be solved in userspace?
> > >
> > > Also, can't we simply raise this number if appropriate? Wu did some
> > > read-ahead trashing detection bits a long while back which should scale
> > > the read-ahead window back when we're low on memory, not sure that ever
> > > made it in, but that sounds like a better option than having different
> > > magic numbers for each platform.
> >
> > Agree, making this a config option (and even defaulting to a different
> > number because of an arch setting) is crazy.
>
> The patch from Christian fixes a performance regression in the latest
> distributions for s390. So we would opt for a larger value, 512KB seems
> to be a good one. I have no idea what that will do to the embedded
> space which is why Christian choose to make it configurable. Clearly
> the better solution would be some sort of system control that can be
> modified at runtime.

So how about doing two patches together?

- lift default readahead size to around 512KB
- add some readahead logic to better support the thrashing case

Thanks,
Fengguang

2009-10-09 21:32:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

On Fri, 9 Oct 2009 14:29:52 +0200
Jens Axboe <[email protected]> wrote:

> On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > From: Christian Ehrhardt <[email protected]>
> > >
> > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > and can be configured per block device queue.
> > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > set a wise default.
> > >
> > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > >
> > > Using this, the patch increases the default max readahead for s390 improving
> > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > low memory system suffer due to page cache trashing as expected).
> >
> > Why can't this be solved in userspace?
> >
> > Also, can't we simply raise this number if appropriate? Wu did some
> > read-ahead trashing detection bits a long while back which should scale
> > the read-ahead window back when we're low on memory, not sure that ever
> > made it in, but that sounds like a better option than having different
> > magic numbers for each platform.
>
> Agree, making this a config option (and even defaulting to a different
> number because of an arch setting) is crazy.

Given the (increasing) level of disparity between different kinds of
storage devices, having _any_ default is crazy.

Would be better to make some sort of vaguely informed guess at
runtime, based upon the characteristics of the device.

2009-10-10 10:54:12

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

On Fri, Oct 09 2009, Andrew Morton wrote:
> On Fri, 9 Oct 2009 14:29:52 +0200
> Jens Axboe <[email protected]> wrote:
>
> > On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > > From: Christian Ehrhardt <[email protected]>
> > > >
> > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > > and can be configured per block device queue.
> > > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > > set a wise default.
> > > >
> > > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > > >
> > > > Using this, the patch increases the default max readahead for s390 improving
> > > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > > low memory system suffer due to page cache trashing as expected).
> > >
> > > Why can't this be solved in userspace?
> > >
> > > Also, can't we simply raise this number if appropriate? Wu did some
> > > read-ahead trashing detection bits a long while back which should scale
> > > the read-ahead window back when we're low on memory, not sure that ever
> > > made it in, but that sounds like a better option than having different
> > > magic numbers for each platform.
> >
> > Agree, making this a config option (and even defaulting to a different
> > number because of an arch setting) is crazy.
>
> Given the (increasing) level of disparity between different kinds of
> storage devices, having _any_ default is crazy.

You have to start somewhere :-). 0 is a default, too.

> Would be better to make some sort of vaguely informed guess at
> runtime, based upon the characteristics of the device.

I'm pretty sure the readahead logic already does respond to eg memory
pressure, not sure if it attempts to do anything based on how quickly
the device is doing IO. Wu?

--
Jens Axboe

2009-10-10 12:41:35

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

On Sat, Oct 10, 2009 at 06:53:33PM +0800, Jens Axboe wrote:
> On Fri, Oct 09 2009, Andrew Morton wrote:
> > On Fri, 9 Oct 2009 14:29:52 +0200
> > Jens Axboe <[email protected]> wrote:
> >
> > > On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > > > From: Christian Ehrhardt <[email protected]>
> > > > >
> > > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > > > and can be configured per block device queue.
> > > > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > > > set a wise default.
> > > > >
> > > > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > > > >
> > > > > Using this, the patch increases the default max readahead for s390 improving
> > > > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > > > low memory system suffer due to page cache trashing as expected).
> > > >
> > > > Why can't this be solved in userspace?
> > > >
> > > > Also, can't we simply raise this number if appropriate? Wu did some
> > > > read-ahead trashing detection bits a long while back which should scale
> > > > the read-ahead window back when we're low on memory, not sure that ever
> > > > made it in, but that sounds like a better option than having different
> > > > magic numbers for each platform.
> > >
> > > Agree, making this a config option (and even defaulting to a different
> > > number because of an arch setting) is crazy.
> >
> > Given the (increasing) level of disparity between different kinds of
> > storage devices, having _any_ default is crazy.
>
> You have to start somewhere :-). 0 is a default, too.

Yes, an obvious and viable way is to start with a default size, and to
back off in runtime if experienced thrashing.

Ideally we use 4MB readahead size per disk, however there are several
constraints:
- readahead thrashing
can be detected and handled very well if necessary :)
- mmap readaround size
currently one single size is used for both sequential readahead and
mmap readaround, and a larger readaround size risks more prefetch
misses (comparing to the pretty accurate readahead). I guess in
despite of the increased readaound misses, a large readaround size
would still help application startup time in a 4GB desktop. However
it does risk working-set thrashings for memory tight desktops. Maybe
we can try to detect working-set thrashings too.
- IO latency
Some workloads may be sensitive to IO latencies. The max_sectors_kb
may help keep IO latency under control with a large readahead size,
but there may be some tradeoffs in the IO scheduler.

In summary, towards the runtime dynamic prefetching size, we
- can reliably adapt readahead size to readahead thrashings
- may reliably adapt readaround size to working set thrashings
- don't know in general whether workload is IO latency sensitive

> > Would be better to make some sort of vaguely informed guess at
> > runtime, based upon the characteristics of the device.
>
> I'm pretty sure the readahead logic already does respond to eg memory
> pressure,

Yes, it's much better than before. Once thrashed, old kernels are
basically reduced to do 1-page (random) IOs, which is disastrous.

Current kernel does this. Given

default_readahead_size > thrashing_readahead_size

The readahead sequence would be

read_size, 2*read_size, 4*read_size, ..., (until > thrashing_readahead_size)
read_size, 2*read_size, 4*read_size, ..., (until > thrashing_readahead_size)
read_size, 2*read_size, 4*read_size, ..., (until > thrashing_readahead_size)
...

So if read_size=1, it roughly holds that

average_readahead_size = thrashing_readahead_size/log2(thrashing_readahead_size)
thrashed_pages = total_read_pages/2

And if read_size=LONG_MAX (eg. sendfile(large_file))

average_readahead_size = default_readahead_size
thrashed_pages = default_readahead_size - thrashing_readahead_size

In summary, readahead for sendfile() is not adaptive at all. Normal
reads are somehow adaptive, but not optimal.

But anyway, optimal thrashing readahead is approachable if it's a
desirable goal :).

> not sure if it attempts to do anything based on how quickly
> the device is doing IO. Wu?

Not for current kernel. But in fact it's possible to estimate the
read speed for each individual sequential stream, and possibly drop
some hint to the IO scheduler: someone will block on this IO after 3
seconds. But it may not deserve the complexity.

Thanks,
Fengguang

2009-10-10 17:42:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

On Sat, 10 Oct 2009 20:40:42 +0800 Wu Fengguang <[email protected]> wrote:

> > not sure if it attempts to do anything based on how quickly
> > the device is doing IO. Wu?
>
> Not for current kernel. But in fact it's possible to estimate the
> read speed for each individual sequential stream, and possibly drop
> some hint to the IO scheduler: someone will block on this IO after 3
> seconds. But it may not deserve the complexity.

Well, we have a test case. Would any of your design proposals address
the performance problem which motivated the s390 guys to propose this
patch?

2009-10-11 01:10:53

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

Hi Martin,

On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote:
> On Fri, 9 Oct 2009 14:29:52 +0200
> Jens Axboe <[email protected]> wrote:
>
> > On Fri, Oct 09 2009, Peter Zijlstra wrote:
> > > On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> > > > From: Christian Ehrhardt <[email protected]>
> > > >
> > > > On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> > > > and can be configured per block device queue.
> > > > On the other hand a lot of admins do not use it, therefore it is reasonable to
> > > > set a wise default.
> > > >
> > > > This path allows to configure the value via Kconfig mechanisms and therefore
> > > > allow the assignment of different defaults dependent on other Kconfig symbols.
> > > >
> > > > Using this, the patch increases the default max readahead for s390 improving
> > > > sequential throughput in a lot of scenarios with almost no drawbacks (only
> > > > theoretical workloads with a lot concurrent sequential read patterns on a very
> > > > low memory system suffer due to page cache trashing as expected).
[snip]
>
> The patch from Christian fixes a performance regression in the latest
> distributions for s390. So we would opt for a larger value, 512KB seems
> to be a good one. I have no idea what that will do to the embedded
> space which is why Christian choose to make it configurable. Clearly
> the better solution would be some sort of system control that can be
> modified at runtime.

May I ask for more details about your performance regression and why
it is related to readahead size? (we didn't change VM_MAX_READAHEAD..)

Thanks,
Fengguang

2009-10-12 05:53:44

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

Wu Fengguang wrote:
> Hi Martin,
>
> On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote:
>
>> On Fri, 9 Oct 2009 14:29:52 +0200
>> Jens Axboe <[email protected]> wrote:
>>
>>
>>> On Fri, Oct 09 2009, Peter Zijlstra wrote:
>>>
>>>> On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
>>>>
>>>>> From: Christian Ehrhardt <[email protected]>
>>>>>
>>>>> On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
>>>>> and can be configured per block device queue.
>>>>> On the other hand a lot of admins do not use it, therefore it is reasonable to
>>>>> set a wise default.
>>>>>
>>>>> This path allows to configure the value via Kconfig mechanisms and therefore
>>>>> allow the assignment of different defaults dependent on other Kconfig symbols.
>>>>>
>>>>> Using this, the patch increases the default max readahead for s390 improving
>>>>> sequential throughput in a lot of scenarios with almost no drawbacks (only
>>>>> theoretical workloads with a lot concurrent sequential read patterns on a very
>>>>> low memory system suffer due to page cache trashing as expected).
>>>>>
> [snip]
>
>> The patch from Christian fixes a performance regression in the latest
>> distributions for s390. So we would opt for a larger value, 512KB seems
>> to be a good one. I have no idea what that will do to the embedded
>> space which is why Christian choose to make it configurable. Clearly
>> the better solution would be some sort of system control that can be
>> modified at runtime.
>>
>
> May I ask for more details about your performance regression and why
> it is related to readahead size? (we didn't change VM_MAX_READAHEAD..)
>
Sure, the performance regression appeared when comparing Novell SLES10
vs. SLES11.
While you are right Wu that the upstream default never changed so far,
SLES10 had a
patch applied that set 512.

As mentioned before I didn't expect to get a generic 128->512 patch
accepted,therefore
the configurable solution. But after Peter and Jens replied so quickly
stating that
changing the default in kernel would be the wrong way to go I already
looked out for
userspace alternatives. At least for my issues I could fix it with
device specific udev rules
too.

And as Andrew mentioned the diversity of devices cause any default to be
wrong for one
or another installation. To solve that the udev approach can also differ
between different
device types (might be easier on s390 than on other architectures
because I need to take
care of two disk types atm - and both shold get 512).

The testcase for anyone who wants to experiment with it is almost too
easy, the biggest
impact can be seen with single thread iozone - I get ~40% better
throughput when
increasing the readahead size to 512 (even bigger RA sizes don't help
much in my
environment, probably due to fast devices).

--

Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

2009-10-12 06:24:03

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

On Mon, Oct 12, 2009 at 01:53:01PM +0800, Christian Ehrhardt wrote:
> Wu Fengguang wrote:
> > Hi Martin,
> >
> > On Fri, Oct 09, 2009 at 09:49:50PM +0800, Martin Schwidefsky wrote:
> >
> >> On Fri, 9 Oct 2009 14:29:52 +0200
> >> Jens Axboe <[email protected]> wrote:
> >>
> >>
> >>> On Fri, Oct 09 2009, Peter Zijlstra wrote:
> >>>
> >>>> On Fri, 2009-10-09 at 13:19 +0200, Ehrhardt Christian wrote:
> >>>>
> >>>>> From: Christian Ehrhardt <[email protected]>
> >>>>>
> >>>>> On one hand the define VM_MAX_READAHEAD in include/linux/mm.h is just a default
> >>>>> and can be configured per block device queue.
> >>>>> On the other hand a lot of admins do not use it, therefore it is reasonable to
> >>>>> set a wise default.
> >>>>>
> >>>>> This path allows to configure the value via Kconfig mechanisms and therefore
> >>>>> allow the assignment of different defaults dependent on other Kconfig symbols.
> >>>>>
> >>>>> Using this, the patch increases the default max readahead for s390 improving
> >>>>> sequential throughput in a lot of scenarios with almost no drawbacks (only
> >>>>> theoretical workloads with a lot concurrent sequential read patterns on a very
> >>>>> low memory system suffer due to page cache trashing as expected).
> >>>>>
> > [snip]
> >
> >> The patch from Christian fixes a performance regression in the latest
> >> distributions for s390. So we would opt for a larger value, 512KB seems
> >> to be a good one. I have no idea what that will do to the embedded
> >> space which is why Christian choose to make it configurable. Clearly
> >> the better solution would be some sort of system control that can be
> >> modified at runtime.
> >>
> >
> > May I ask for more details about your performance regression and why
> > it is related to readahead size? (we didn't change VM_MAX_READAHEAD..)
> >
> Sure, the performance regression appeared when comparing Novell SLES10
> vs. SLES11.
> While you are right Wu that the upstream default never changed so far,
> SLES10 had a
> patch applied that set 512.

I see. I'm curious why SLES11 removed that patch. Did it experienced
some regressions with the larger readahead size?

> As mentioned before I didn't expect to get a generic 128->512 patch
> accepted,therefore
> the configurable solution. But after Peter and Jens replied so quickly
> stating that
> changing the default in kernel would be the wrong way to go I already
> looked out for
> userspace alternatives. At least for my issues I could fix it with
> device specific udev rules
> too.

OK.

> And as Andrew mentioned the diversity of devices cause any default to be
> wrong for one
> or another installation. To solve that the udev approach can also differ
> between different
> device types (might be easier on s390 than on other architectures
> because I need to take
> care of two disk types atm - and both shold get 512).

I guess it's not a general solution for all. There are so many
devices in the world, and we have not yet considered the
memory/workload combinations.

> The testcase for anyone who wants to experiment with it is almost too
> easy, the biggest
> impact can be seen with single thread iozone - I get ~40% better
> throughput when
> increasing the readahead size to 512 (even bigger RA sizes don't help
> much in my
> environment, probably due to fast devices).

That's impressive number - I guess we need a larger default RA size.
But before that let's learn something from SLES10's experiences :)

Thanks,
Fengguang

2009-10-12 09:30:36

by Christian Ehrhardt

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

Wu Fengguang wrote:
> [SNIP]
>>> May I ask for more details about your performance regression and why
>>> it is related to readahead size? (we didn't change VM_MAX_READAHEAD..)
>>>
>>>
>> Sure, the performance regression appeared when comparing Novell SLES10
>> vs. SLES11.
>> While you are right Wu that the upstream default never changed so far,
>> SLES10 had a
>> patch applied that set 512.
>>
>
> I see. I'm curious why SLES11 removed that patch. Did it experienced
> some regressions with the larger readahead size?
>
>

Only the obvious expected one with very low free/cacheable
memory and a lot of parallel processes that do sequential I/O.
The RA size scales up for all of them but 64xMaxRA then
doesn't fit.

For example iozone with 64 threads (each on one disk for its own),
sequential access pattern read with I guess 10 M free for cache
suffered by ~15% due to trashing.

But that is a acceptable regression because it is no relevant
customer scenario, while the benefits apply to customer scenarios.

[...]
>> And as Andrew mentioned the diversity of devices cause any default to be
>> wrong for one
>> or another installation. To solve that the udev approach can also differ
>> between different
>> device types (might be easier on s390 than on other architectures
>> because I need to take
>> care of two disk types atm - and both shold get 512).
>>
>
> I guess it's not a general solution for all. There are so many
> devices in the world, and we have not yet considered the
> memory/workload combinations.
>
I completely agree, let me fix "my" issue per udev for now.
And if some day the readahead mechanism evolves and
doesn't need any max RA at all we can all be happy.

[...]

--

Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

2009-10-12 09:40:00

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH] mm: make VM_MAX_READAHEAD configurable

On Mon, Oct 12, 2009 at 05:29:48PM +0800, Christian Ehrhardt wrote:
> Wu Fengguang wrote:
> > [SNIP]
> >>> May I ask for more details about your performance regression and why
> >>> it is related to readahead size? (we didn't change VM_MAX_READAHEAD..)
> >>>
> >>>
> >> Sure, the performance regression appeared when comparing Novell SLES10
> >> vs. SLES11.
> >> While you are right Wu that the upstream default never changed so far,
> >> SLES10 had a
> >> patch applied that set 512.
> >>
> >
> > I see. I'm curious why SLES11 removed that patch. Did it experienced
> > some regressions with the larger readahead size?
> >
> >
>
> Only the obvious expected one with very low free/cacheable
> memory and a lot of parallel processes that do sequential I/O.
> The RA size scales up for all of them but 64xMaxRA then
> doesn't fit.
>
> For example iozone with 64 threads (each on one disk for its own),
> sequential access pattern read with I guess 10 M free for cache
> suffered by ~15% due to trashing.

FYI, I just finished with a patch for dealing with readahead
thrashing. Will do some tests and post the result :)

Thanks,
Fengguang