2007-09-27 01:50:31

by Wu Fengguang

[permalink] [raw]
Subject: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()

We don't want to introduce pointless delays in throttle_vm_writeout()
when the writeback limits are not yet exceeded, do we?

Cc: Nick Piggin <[email protected]>
Cc: OGAWA Hirofumi <[email protected]>
Cc: Kumar Gala <[email protected]>
Cc: Pete Zaitcev <[email protected]>
Cc: Greg KH <[email protected]>
Signed-off-by: Fengguang Wu <[email protected]>
---
mm/page-writeback.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)

--- linux-2.6.23-rc8-mm1.orig/mm/page-writeback.c
+++ linux-2.6.23-rc8-mm1/mm/page-writeback.c
@@ -507,16 +507,6 @@ void throttle_vm_writeout(gfp_t gfp_mask
long background_thresh;
long dirty_thresh;

- if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
- /*
- * The caller might hold locks which can prevent IO completion
- * or progress in the filesystem. So we cannot just sit here
- * waiting for IO to complete.
- */
- congestion_wait(WRITE, HZ/10);
- return;
- }
-
for ( ; ; ) {
get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);

@@ -530,6 +520,14 @@ void throttle_vm_writeout(gfp_t gfp_mask
global_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
congestion_wait(WRITE, HZ/10);
+
+ /*
+ * The caller might hold locks which can prevent IO completion
+ * or progress in the filesystem. So we cannot just sit here
+ * waiting for IO to complete.
+ */
+ if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO))
+ break;
}
}



2007-09-27 15:16:53

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()

On Thu, 27 Sep 2007 09:50:16 +0800
Fengguang Wu <[email protected]> wrote:

> We don't want to introduce pointless delays in throttle_vm_writeout()
> when the writeback limits are not yet exceeded, do we?

Good catch.

> Signed-off-by: Fengguang Wu <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-09-27 20:47:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()

On Thu, 27 Sep 2007 09:50:16 +0800
Fengguang Wu <[email protected]> wrote:

> We don't want to introduce pointless delays in throttle_vm_writeout()
> when the writeback limits are not yet exceeded, do we?
>
> Cc: Nick Piggin <[email protected]>
> Cc: OGAWA Hirofumi <[email protected]>
> Cc: Kumar Gala <[email protected]>
> Cc: Pete Zaitcev <[email protected]>
> Cc: Greg KH <[email protected]>
> Signed-off-by: Fengguang Wu <[email protected]>
> ---
> mm/page-writeback.c | 18 ++++++++----------
> 1 file changed, 8 insertions(+), 10 deletions(-)
>
> --- linux-2.6.23-rc8-mm1.orig/mm/page-writeback.c
> +++ linux-2.6.23-rc8-mm1/mm/page-writeback.c
> @@ -507,16 +507,6 @@ void throttle_vm_writeout(gfp_t gfp_mask
> long background_thresh;
> long dirty_thresh;
>
> - if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
> - /*
> - * The caller might hold locks which can prevent IO completion
> - * or progress in the filesystem. So we cannot just sit here
> - * waiting for IO to complete.
> - */
> - congestion_wait(WRITE, HZ/10);
> - return;
> - }
> -
> for ( ; ; ) {
> get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>
> @@ -530,6 +520,14 @@ void throttle_vm_writeout(gfp_t gfp_mask
> global_page_state(NR_WRITEBACK) <= dirty_thresh)
> break;
> congestion_wait(WRITE, HZ/10);
> +
> + /*
> + * The caller might hold locks which can prevent IO completion
> + * or progress in the filesystem. So we cannot just sit here
> + * waiting for IO to complete.
> + */
> + if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO))
> + break;
> }
> }
>

This is a pretty major bugfix.

GFP_NOIO and GFP_NOFS callers should have been spending really large
amounts of time stuck in that sleep.

I wonder why nobody noticed this happening. Either a) it turns out that
kswapd is doing a good job and such callers don't do direct reclaim much or
b) nobody is doing any in-depth kernel instrumentation.

Now, how _would_ one notice this problem? We don't have very good tools,
really. Booting with "profile=sleep" and looking at the profile data would
be one way. Repeatedly doing sysrq-T is another. Perhaps the new
lockstat-via-lockdep code would allow this to be observed in some fashion,
dunno.

Anyway, this patch has the potential to significantly alter the dynamics of
the VM behaviour under particular workloads. It might turn up other
stuff...

2007-09-28 01:19:00

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()

On Thu, Sep 27, 2007 at 11:16:10AM -0400, Rik van Riel wrote:
> On Thu, 27 Sep 2007 09:50:16 +0800
> Fengguang Wu <[email protected]> wrote:
>
> > We don't want to introduce pointless delays in throttle_vm_writeout()
> > when the writeback limits are not yet exceeded, do we?
>
> Good catch.

Thank you.

> > Signed-off-by: Fengguang Wu <[email protected]>
>
> Reviewed-by: Rik van Riel <[email protected]>

It could be a good fix for 2.6.22/23. But for -mm, I'm not
sure if throttle_vm_writeout() will be eventually removed ;-)

2007-09-28 08:08:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()


* Andrew Morton <[email protected]> wrote:

> This is a pretty major bugfix.
>
> GFP_NOIO and GFP_NOFS callers should have been spending really large
> amounts of time stuck in that sleep.
>
> I wonder why nobody noticed this happening. Either a) it turns out
> that kswapd is doing a good job and such callers don't do direct
> reclaim much or b) nobody is doing any in-depth kernel
> instrumentation.

[ Oh, it's Friday already, so soapbox time i guess. The easily offended
please skip this mail ;-) ]

People _have_ noticed, and we often ignored them. I can see four
fundamental, structural problems:

1) A certain lack of competitive pressure. An MM is too complex and
there is no "better Linux MM" to compare against objectively. The
BSDs are way too different and it's easy to dismiss even objective
comparisons due to the real complexity of the differences. Heck,
2.6.9 is "way too different" and we routinely reject bugreports from
such old kernels and lose vital feedback.

2) There is a wide-spread mentality of "you prove that there is a
problem" in the MM and elsewhere in the Linux kernel too. While of
course objective proof is paramount, we often "hide" behind our
self-created complexity of the system (without malice and without
realising it!). We've seen that happen in the updatedb discussions
and the swap-prefetch discussions. The correct approach would be for
the MM folks to be able to tell for just about any workload "this is
not our problem", and to have the benefit of the doubt _on the
tester's side_. We must not ignore people who tell us that "there is
something wrong going on here", just because they are unable to
analyze it themselves. Very often where we end up saying "we dont
know what's going on here" it's likely _our_ fault. We also must not
hide behind "please do these 10 easy steps and 2 kernel recompiles
and 10 reboots, only takes half a day, and come back to us once you
have the detailed debug data" requests. Instrumentation must be _on
by default_ (like SCHED_DEBUG is on by default), which brings us to:

3) Instrumentation and tools. Instrumentation (for example MM delay
statistics - like the scheduler delay statistics) give an objective
measure to compare kernels against each other. _Smart_ and _easy to
use_ and _default enabled_ instrumentation is a must. Not "turn on
these 3 zillion kernel options" which no distro enables. Debug
tools/scripts that use the instrumentation, that just have to be run
and produce meaningful output based on which 90% of the workloads can
be analyzed _without having to ask the user to do more_. (See
PowerTop as an example, the right kind of instrumentation can do
wonders that enables users to help us. We worked hard to lower the
cost of /proc/timer_stats so that distros can enable it by default -
and now they do enable it by default.)

4) The use of heuristics and the resulting inevitable nondeterminism in
the MM. I guess i'm biased about this, doing -rt and CFS, but we've
seen that happen with the scheduler: users _love_ determinism. (Users
dont typically care whether a click on the desktop takes 0.5 seconds
or 1.0 second - as long as it's always 0.5 or always 1.0. What they
do notice is when a click takes 0.5 seconds most of the time but
occasionally it takes 1.5 seconds - _that_ they report as a
regression. They would actually prefer it to take 1.0 seconds all the
time. The reason is human psychology: 99% of our daily routine is
driven by inconscious brain automatisms. We auto-pilot through most
of the day - and that very much covers routine computer/desktop usage
too. Unpredictable/noisy behavior of the computer forces the human
brain back into more consious activity, which is perceived as a
negative thing: it's a distraction takes capacity away from
_important_ conscious activities ... such as getting real work done
on the computer.)

Heuristics is also an objective problem for the code itself: it
introduces artificial coupling of workloads and raises complexity
artificially: it makes it very hard to prove the impact of changes
(even with good instrumentation) - thus increasing the barrier of
entry significantly. (both to external contributors and to existing
maintainers)

all in one: the barrier of entry to _providing meaningful feedback_ is
often very high, and thus the barrier of entry of experimental patches
is too high too. These two factors are a lethal combination that lure us
into the false perception that everything is fine and that the yelling
out there is just from clueless whiners who are not willing to help us
:-/

Yes, MM testing is hard (in fact, good MM instrumentation and tooling is
_very_ hard), and the MM is in a pretty good shape (otherwise an
alternative would have shown up already), and today's MM is clearly the
best ever Linux MM - but still we have to solve these structural
problems if we want to advance to the next level of quality.

The solution? I think it's not that hard: we should lower the acceptance
barrier of instrumentation patches massively. (maybe even merge them
outside the normal merge window, like we merge cleanups) Then we should
only allow high-rate changes in risky kernel subsystems that improve
their own instrumentation and tools sufficiently for ordinary users to
be able to tell whether the changes are an improvement or not. Every
time there's a major regression that was hard to debug via the existing
instrumentation, mandate the extension of instrumentation to cover that
case too.

This all couples the desire of developers to add new code with the
desire of testers to provide feedback and with the desire of actual
users to have a proven good system.

Ingo

2007-09-28 14:20:37

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()

On Thursday 27 September 2007 11:50, Fengguang Wu wrote:
> We don't want to introduce pointless delays in throttle_vm_writeout()
> when the writeback limits are not yet exceeded, do we?

I don't think so (ie. I agree with you).
IIRC, Marcelo initially did the throttle_vm_writeout?

> Cc: Nick Piggin <[email protected]>
> Cc: OGAWA Hirofumi <[email protected]>
> Cc: Kumar Gala <[email protected]>
> Cc: Pete Zaitcev <[email protected]>
> Cc: Greg KH <[email protected]>
> Signed-off-by: Fengguang Wu <[email protected]>
> ---
> mm/page-writeback.c | 18 ++++++++----------
> 1 file changed, 8 insertions(+), 10 deletions(-)
>
> --- linux-2.6.23-rc8-mm1.orig/mm/page-writeback.c
> +++ linux-2.6.23-rc8-mm1/mm/page-writeback.c
> @@ -507,16 +507,6 @@ void throttle_vm_writeout(gfp_t gfp_mask
> long background_thresh;
> long dirty_thresh;
>
> - if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
> - /*
> - * The caller might hold locks which can prevent IO completion
> - * or progress in the filesystem. So we cannot just sit here
> - * waiting for IO to complete.
> - */
> - congestion_wait(WRITE, HZ/10);
> - return;
> - }
> -
> for ( ; ; ) {
> get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
>
> @@ -530,6 +520,14 @@ void throttle_vm_writeout(gfp_t gfp_mask
> global_page_state(NR_WRITEBACK) <= dirty_thresh)
> break;
> congestion_wait(WRITE, HZ/10);
> +
> + /*
> + * The caller might hold locks which can prevent IO completion
> + * or progress in the filesystem. So we cannot just sit here
> + * waiting for IO to complete.
> + */
> + if ((gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO))
> + break;
> }
> }

2007-09-28 15:31:22

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()

On Friday 28 September 2007 18:02, Ingo Molnar wrote:
> * Andrew Morton <[email protected]> wrote:
> > This is a pretty major bugfix.
> >
> > GFP_NOIO and GFP_NOFS callers should have been spending really large
> > amounts of time stuck in that sleep.
> >
> > I wonder why nobody noticed this happening. Either a) it turns out
> > that kswapd is doing a good job and such callers don't do direct
> > reclaim much or b) nobody is doing any in-depth kernel
> > instrumentation.
>
> [ Oh, it's Friday already, so soapbox time i guess. The easily offended
> please skip this mail ;-) ]

I'll join you on your soapbox ;)


> People _have_ noticed, and we often ignored them. I can see four

This is a problem with the testers. The barrier for *reporting* problems
is really low. And if they're regressions, then it is almost 100% chance of
being solved. If we're simply missing the bug reports, I'll again ask for a
linux-bugs list which is clear of other crap.

Or are people scared of reporting regressions from previous experience?
I haven't seen any reason for this and linux-mm is probably one of the
most polite lists I have read (when it comes to interacting with non
developers, anyway ;)).


> fundamental, structural problems:
>
> 1) A certain lack of competitive pressure. An MM is too complex and
> there is no "better Linux MM" to compare against objectively. The
> BSDs are way too different and it's easy to dismiss even objective
> comparisons due to the real complexity of the differences. Heck,
> 2.6.9 is "way too different" and we routinely reject bugreports from
> such old kernels and lose vital feedback.

I don't see this as a big problem. The "better Linux MM" is implemented
by the patch one is proposing -- whether it be a one liner bugfix, or a
complete rewrite. I think it _would_ be really nice to run all interesting
workloads on various Linux 26 and 24 kernels, various patchsets, and
other open source OSes for the regression testing and cross polination
aspects... however this brings us to the main problem:

Devising useful tests and metrics, and running them.


> 2) There is a wide-spread mentality of "you prove that there is a
> problem" in the MM and elsewhere in the Linux kernel too. While of

The alternative is completely unscientific chaos. OK, for performance
heuristics, it is actually a much grayer "you show that your patch
improves something / doesn't harm others" -- this is pretty tough for
mm patches at the moment, maybe it could be better but at the end
of the day, if something is worth merging then we should be able to
actually prove (or have a pretty high confidence) that it is good. If not,
then we don't want to merge it, by definition.

In the case of the vm, this comes right back to the difficulty of getting a
range of meaningful tests. We can definitely improve this situation a
great deal, but like the scheduler, the *real* "tests" simply have to be
done by users. And unfortunately, a lot of them don't actually test VM
changes for a long time after they're merged. This would actually be one
area where it might make a lot of sense to coordinate more significant
VM changes with distro release cycles (maybe?)


> course objective proof is paramount, we often "hide" behind our
> self-created complexity of the system (without malice and without
> realising it!). We've seen that happen in the updatedb discussions
> and the swap-prefetch discussions. The correct approach would be for
> the MM folks to be able to tell for just about any workload "this is
> not our problem", and to have the benefit of the doubt _on the
> tester's side_. We must not ignore people who tell us that "there is
> something wrong going on here", just because they are unable to
> analyze it themselves. Very often where we end up saying "we dont
> know what's going on here" it's likely _our_ fault. We also must not
> hide behind "please do these 10 easy steps and 2 kernel recompiles
> and 10 reboots, only takes half a day, and come back to us once you
> have the detailed debug data" requests. Instrumentation must be _on
> by default_ (like SCHED_DEBUG is on by default), which brings us to:

The recent swap prefetch debate showed exactly how the process _should_
work. We discovered a serious issue with metadata caching behaviour, and
also the active-inactive list balancing change that makes use-once much
less effective.

It was discovered by the metrics that are already there and compiled in,
and basically required a cat /proc/vmstat ; run-workload ; cat /proc/vmstat

[ You understand why I wanted to explore any possible underlying problems
independent of swap prefetch, right? If you'd just happened to be running on
a laptop, or your pagecache / buffercache / slab cache just happened to be
full of useless crap already (as in: something that updatedb might cause), or
your morning workload happened to fault in a lot of mmap()ed data, then
swap prefetch suddenly doesn't help you. ]



> 3) Instrumentation and tools. Instrumentation (for example MM delay
> statistics - like the scheduler delay statistics) give an objective
> measure to compare kernels against each other. _Smart_ and _easy to
> use_ and _default enabled_ instrumentation is a must. Not "turn on
> these 3 zillion kernel options" which no distro enables. Debug
> tools/scripts that use the instrumentation, that just have to be run
> and produce meaningful output based on which 90% of the workloads can
> be analyzed _without having to ask the user to do more_. (See
> PowerTop as an example, the right kind of instrumentation can do
> wonders that enables users to help us. We worked hard to lower the
> cost of /proc/timer_stats so that distros can enable it by default -
> and now they do enable it by default.)

There are a lot of useful stats compiled in already. There are some
significant useful metrics which are not there, but could help. I don't know
if it is anything fundamental that we're doing wrong, though.

As you say, VM state machine is a lot more complex than scheduler. I
don't think it is reasonable to expect to be able to solve all problems
bylooking at exported stats. But so far in 2.6 development, they have often
been enough to narrow things down pretty well. And where they haven't been,
I (and others) have added useful metrics where possible (witness the
evolution of oom killer output!)

But your soapbox is helpful to emphasise that we need to be thorough
with adding and using instrumentation when debugging a problem or
introducing some new behaviour / changing old behaviour.

2007-09-28 16:24:01

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()

* Ingo Molnar ([email protected]) wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > This is a pretty major bugfix.
> >
> > GFP_NOIO and GFP_NOFS callers should have been spending really large
> > amounts of time stuck in that sleep.
> >
> > I wonder why nobody noticed this happening. Either a) it turns out
> > that kswapd is doing a good job and such callers don't do direct
> > reclaim much or b) nobody is doing any in-depth kernel
> > instrumentation.
>
> [ Oh, it's Friday already, so soapbox time i guess. The easily offended
> please skip this mail ;-) ]
>
> People _have_ noticed, and we often ignored them. I can see four
> fundamental, structural problems:
>
> 1) A certain lack of competitive pressure. An MM is too complex and
> there is no "better Linux MM" to compare against objectively. The
> BSDs are way too different and it's easy to dismiss even objective
> comparisons due to the real complexity of the differences. Heck,
> 2.6.9 is "way too different" and we routinely reject bugreports from
> such old kernels and lose vital feedback.
>
> 2) There is a wide-spread mentality of "you prove that there is a
> problem" in the MM and elsewhere in the Linux kernel too. While of
> course objective proof is paramount, we often "hide" behind our
> self-created complexity of the system (without malice and without
> realising it!). We've seen that happen in the updatedb discussions
> and the swap-prefetch discussions. The correct approach would be for
> the MM folks to be able to tell for just about any workload "this is
> not our problem", and to have the benefit of the doubt _on the
> tester's side_. We must not ignore people who tell us that "there is
> something wrong going on here", just because they are unable to
> analyze it themselves. Very often where we end up saying "we dont
> know what's going on here" it's likely _our_ fault. We also must not
> hide behind "please do these 10 easy steps and 2 kernel recompiles
> and 10 reboots, only takes half a day, and come back to us once you
> have the detailed debug data" requests. Instrumentation must be _on
> by default_ (like SCHED_DEBUG is on by default), which brings us to:
>
> 3) Instrumentation and tools. Instrumentation (for example MM delay
> statistics - like the scheduler delay statistics) give an objective
> measure to compare kernels against each other. _Smart_ and _easy to
> use_ and _default enabled_ instrumentation is a must. Not "turn on
> these 3 zillion kernel options" which no distro enables. Debug
> tools/scripts that use the instrumentation, that just have to be run
> and produce meaningful output based on which 90% of the workloads can
> be analyzed _without having to ask the user to do more_. (See
> PowerTop as an example, the right kind of instrumentation can do
> wonders that enables users to help us. We worked hard to lower the
> cost of /proc/timer_stats so that distros can enable it by default -
> and now they do enable it by default.)
>
> 4) The use of heuristics and the resulting inevitable nondeterminism in
> the MM. I guess i'm biased about this, doing -rt and CFS, but we've
> seen that happen with the scheduler: users _love_ determinism. (Users
> dont typically care whether a click on the desktop takes 0.5 seconds
> or 1.0 second - as long as it's always 0.5 or always 1.0. What they
> do notice is when a click takes 0.5 seconds most of the time but
> occasionally it takes 1.5 seconds - _that_ they report as a
> regression. They would actually prefer it to take 1.0 seconds all the
> time. The reason is human psychology: 99% of our daily routine is
> driven by inconscious brain automatisms. We auto-pilot through most
> of the day - and that very much covers routine computer/desktop usage
> too. Unpredictable/noisy behavior of the computer forces the human
> brain back into more consious activity, which is perceived as a
> negative thing: it's a distraction takes capacity away from
> _important_ conscious activities ... such as getting real work done
> on the computer.)
>
> Heuristics is also an objective problem for the code itself: it
> introduces artificial coupling of workloads and raises complexity
> artificially: it makes it very hard to prove the impact of changes
> (even with good instrumentation) - thus increasing the barrier of
> entry significantly. (both to external contributors and to existing
> maintainers)
>
> all in one: the barrier of entry to _providing meaningful feedback_ is
> often very high, and thus the barrier of entry of experimental patches
> is too high too. These two factors are a lethal combination that lure us
> into the false perception that everything is fine and that the yelling
> out there is just from clueless whiners who are not willing to help us
> :-/
>
> Yes, MM testing is hard (in fact, good MM instrumentation and tooling is
> _very_ hard), and the MM is in a pretty good shape (otherwise an
> alternative would have shown up already), and today's MM is clearly the
> best ever Linux MM - but still we have to solve these structural
> problems if we want to advance to the next level of quality.
>
> The solution? I think it's not that hard: we should lower the acceptance
> barrier of instrumentation patches massively. (maybe even merge them
> outside the normal merge window, like we merge cleanups) Then we should
> only allow high-rate changes in risky kernel subsystems that improve
> their own instrumentation and tools sufficiently for ordinary users to
> be able to tell whether the changes are an improvement or not. Every
> time there's a major regression that was hard to debug via the existing
> instrumentation, mandate the extension of instrumentation to cover that
> case too.
>
> This all couples the desire of developers to add new code with the
> desire of testers to provide feedback and with the desire of actual
> users to have a proven good system.
>

I totally agree with Ingo here. Having a basic instrumentation that is
enabled by default will help to identify code paths causing unexpected
delays in the kernel. It will not only identify kernel bugs, but also
unexpected behaviors that would be qualified as "quiet bugs" (e.g. long
delays).

The key aspect that seems to be inherent to this proposal is the need
for an extensible instrumentation mechanism that would allow developers
to add new instrumentation when it is needed (such as the Linux Kernel
Markers on which I have been working for the last year). It will enable
them, and testers, to test and benchmark kernel subsystems to detect
regressions as well as erratic behaviors.

When a problem such as this VM delay is found, we should really be able
to ask ourselves : "what instrumentation (and tests) can we add to
detect this class of problem automatically in the future ?". Otherwise,
we will always be back to the starting point after future changes. When
I talk about tests, I mostly refer to tests performed by an
out-of-kernel tool analyzing exported traces, since the proliferation
of various live tests for each and every subsystems in the default
kernel configuration is not something one would sanely consider.

Besides automatic testing, extracting traces taken from an instrumented
kernel would help users to provide a better feedback to kernel
developers by telling us what led to a particular problematic situation.
Since the burden of the proof that a particular subsystem is acting in
funny ways is left to the users/testers, we cannot expect a useful
feedback from them if we don't provide a mechanism that will let them
_prove_ that something is really going wrong.

Mathieu


> Ingo
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-28 16:41:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()

On Saturday 29 September 2007 02:23, Mathieu Desnoyers wrote:
> * Ingo Molnar ([email protected]) wrote:
> > * Andrew Morton <[email protected]> wrote:
> > > This is a pretty major bugfix.
> > >
> > > GFP_NOIO and GFP_NOFS callers should have been spending really large
> > > amounts of time stuck in that sleep.
> > >
> > > I wonder why nobody noticed this happening. Either a) it turns out
> > > that kswapd is doing a good job and such callers don't do direct
> > > reclaim much or b) nobody is doing any in-depth kernel
> > > instrumentation.
> >
> > [ Oh, it's Friday already, so soapbox time i guess. The easily offended
> > please skip this mail ;-) ]
> >
> > People _have_ noticed, and we often ignored them. I can see four
> > fundamental, structural problems:
> >
> > 1) A certain lack of competitive pressure. An MM is too complex and
> > there is no "better Linux MM" to compare against objectively. The
> > BSDs are way too different and it's easy to dismiss even objective
> > comparisons due to the real complexity of the differences. Heck,
> > 2.6.9 is "way too different" and we routinely reject bugreports from
> > such old kernels and lose vital feedback.
> >
> > 2) There is a wide-spread mentality of "you prove that there is a
> > problem" in the MM and elsewhere in the Linux kernel too. While of
> > course objective proof is paramount, we often "hide" behind our
> > self-created complexity of the system (without malice and without
> > realising it!). We've seen that happen in the updatedb discussions
> > and the swap-prefetch discussions. The correct approach would be for
> > the MM folks to be able to tell for just about any workload "this is
> > not our problem", and to have the benefit of the doubt _on the
> > tester's side_. We must not ignore people who tell us that "there is
> > something wrong going on here", just because they are unable to
> > analyze it themselves. Very often where we end up saying "we dont
> > know what's going on here" it's likely _our_ fault. We also must not
> > hide behind "please do these 10 easy steps and 2 kernel recompiles
> > and 10 reboots, only takes half a day, and come back to us once you
> > have the detailed debug data" requests. Instrumentation must be _on
> > by default_ (like SCHED_DEBUG is on by default), which brings us to:
> >
> > 3) Instrumentation and tools. Instrumentation (for example MM delay
> > statistics - like the scheduler delay statistics) give an objective
> > measure to compare kernels against each other. _Smart_ and _easy to
> > use_ and _default enabled_ instrumentation is a must. Not "turn on
> > these 3 zillion kernel options" which no distro enables. Debug
> > tools/scripts that use the instrumentation, that just have to be run
> > and produce meaningful output based on which 90% of the workloads can
> > be analyzed _without having to ask the user to do more_. (See
> > PowerTop as an example, the right kind of instrumentation can do
> > wonders that enables users to help us. We worked hard to lower the
> > cost of /proc/timer_stats so that distros can enable it by default -
> > and now they do enable it by default.)
> >
> > 4) The use of heuristics and the resulting inevitable nondeterminism in
> > the MM. I guess i'm biased about this, doing -rt and CFS, but we've
> > seen that happen with the scheduler: users _love_ determinism. (Users
> > dont typically care whether a click on the desktop takes 0.5 seconds
> > or 1.0 second - as long as it's always 0.5 or always 1.0. What they
> > do notice is when a click takes 0.5 seconds most of the time but
> > occasionally it takes 1.5 seconds - _that_ they report as a
> > regression. They would actually prefer it to take 1.0 seconds all the
> > time. The reason is human psychology: 99% of our daily routine is
> > driven by inconscious brain automatisms. We auto-pilot through most
> > of the day - and that very much covers routine computer/desktop usage
> > too. Unpredictable/noisy behavior of the computer forces the human
> > brain back into more consious activity, which is perceived as a
> > negative thing: it's a distraction takes capacity away from
> > _important_ conscious activities ... such as getting real work done
> > on the computer.)
> >
> > Heuristics is also an objective problem for the code itself: it
> > introduces artificial coupling of workloads and raises complexity
> > artificially: it makes it very hard to prove the impact of changes
> > (even with good instrumentation) - thus increasing the barrier of
> > entry significantly. (both to external contributors and to existing
> > maintainers)
> >
> > all in one: the barrier of entry to _providing meaningful feedback_ is
> > often very high, and thus the barrier of entry of experimental patches
> > is too high too. These two factors are a lethal combination that lure us
> > into the false perception that everything is fine and that the yelling
> > out there is just from clueless whiners who are not willing to help us
> >
> > :-/
> >
> > Yes, MM testing is hard (in fact, good MM instrumentation and tooling is
> > _very_ hard), and the MM is in a pretty good shape (otherwise an
> > alternative would have shown up already), and today's MM is clearly the
> > best ever Linux MM - but still we have to solve these structural
> > problems if we want to advance to the next level of quality.
> >
> > The solution? I think it's not that hard: we should lower the acceptance
> > barrier of instrumentation patches massively. (maybe even merge them
> > outside the normal merge window, like we merge cleanups) Then we should
> > only allow high-rate changes in risky kernel subsystems that improve
> > their own instrumentation and tools sufficiently for ordinary users to
> > be able to tell whether the changes are an improvement or not. Every
> > time there's a major regression that was hard to debug via the existing
> > instrumentation, mandate the extension of instrumentation to cover that
> > case too.
> >
> > This all couples the desire of developers to add new code with the
> > desire of testers to provide feedback and with the desire of actual
> > users to have a proven good system.
>
> I totally agree with Ingo here. Having a basic instrumentation that is
> enabled by default will help to identify code paths causing unexpected
> delays in the kernel. It will not only identify kernel bugs, but also
> unexpected behaviors that would be qualified as "quiet bugs" (e.g. long
> delays).

It is. See: CONFIG_VM_EVENT_COUNTERS and all the other vm specific
crap littered in /proc/ (buddyinfo, zoneinfo, meminfo, etc).

There is always an issue of sometimes not instrumenting enough basic
things... but we fundamentally have always tried to do improve this.

vm is one of the most instrumented subsystems in the kernel. By default.


> The key aspect that seems to be inherent to this proposal is the need
> for an extensible instrumentation mechanism that would allow developers
> to add new instrumentation when it is needed (such as the Linux Kernel
> Markers on which I have been working for the last year). It will enable
> them, and testers, to test and benchmark kernel subsystems to detect
> regressions as well as erratic behaviors.

We have several for the VM.


> When a problem such as this VM delay is found, we should really be able
> to ask ourselves : "what instrumentation (and tests) can we add to
> detect this class of problem automatically in the future ?".

We can and do (Andrew did, in this case).

2007-09-28 17:58:25

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()

* Nick Piggin ([email protected]) wrote:
> On Saturday 29 September 2007 02:23, Mathieu Desnoyers wrote:
> > * Ingo Molnar ([email protected]) wrote:
> > > * Andrew Morton <[email protected]> wrote:
> > > > This is a pretty major bugfix.
> > > >
> > > > GFP_NOIO and GFP_NOFS callers should have been spending really large
> > > > amounts of time stuck in that sleep.
> > > >
> > > > I wonder why nobody noticed this happening. Either a) it turns out
> > > > that kswapd is doing a good job and such callers don't do direct
> > > > reclaim much or b) nobody is doing any in-depth kernel
> > > > instrumentation.
> > >
> > > [ Oh, it's Friday already, so soapbox time i guess. The easily offended
> > > please skip this mail ;-) ]
> > >
> > > People _have_ noticed, and we often ignored them. I can see four
> > > fundamental, structural problems:
> > >
> > > 1) A certain lack of competitive pressure. An MM is too complex and
> > > there is no "better Linux MM" to compare against objectively. The
> > > BSDs are way too different and it's easy to dismiss even objective
> > > comparisons due to the real complexity of the differences. Heck,
> > > 2.6.9 is "way too different" and we routinely reject bugreports from
> > > such old kernels and lose vital feedback.
> > >
> > > 2) There is a wide-spread mentality of "you prove that there is a
> > > problem" in the MM and elsewhere in the Linux kernel too. While of
> > > course objective proof is paramount, we often "hide" behind our
> > > self-created complexity of the system (without malice and without
> > > realising it!). We've seen that happen in the updatedb discussions
> > > and the swap-prefetch discussions. The correct approach would be for
> > > the MM folks to be able to tell for just about any workload "this is
> > > not our problem", and to have the benefit of the doubt _on the
> > > tester's side_. We must not ignore people who tell us that "there is
> > > something wrong going on here", just because they are unable to
> > > analyze it themselves. Very often where we end up saying "we dont
> > > know what's going on here" it's likely _our_ fault. We also must not
> > > hide behind "please do these 10 easy steps and 2 kernel recompiles
> > > and 10 reboots, only takes half a day, and come back to us once you
> > > have the detailed debug data" requests. Instrumentation must be _on
> > > by default_ (like SCHED_DEBUG is on by default), which brings us to:
> > >
> > > 3) Instrumentation and tools. Instrumentation (for example MM delay
> > > statistics - like the scheduler delay statistics) give an objective
> > > measure to compare kernels against each other. _Smart_ and _easy to
> > > use_ and _default enabled_ instrumentation is a must. Not "turn on
> > > these 3 zillion kernel options" which no distro enables. Debug
> > > tools/scripts that use the instrumentation, that just have to be run
> > > and produce meaningful output based on which 90% of the workloads can
> > > be analyzed _without having to ask the user to do more_. (See
> > > PowerTop as an example, the right kind of instrumentation can do
> > > wonders that enables users to help us. We worked hard to lower the
> > > cost of /proc/timer_stats so that distros can enable it by default -
> > > and now they do enable it by default.)
> > >
> > > 4) The use of heuristics and the resulting inevitable nondeterminism in
> > > the MM. I guess i'm biased about this, doing -rt and CFS, but we've
> > > seen that happen with the scheduler: users _love_ determinism. (Users
> > > dont typically care whether a click on the desktop takes 0.5 seconds
> > > or 1.0 second - as long as it's always 0.5 or always 1.0. What they
> > > do notice is when a click takes 0.5 seconds most of the time but
> > > occasionally it takes 1.5 seconds - _that_ they report as a
> > > regression. They would actually prefer it to take 1.0 seconds all the
> > > time. The reason is human psychology: 99% of our daily routine is
> > > driven by inconscious brain automatisms. We auto-pilot through most
> > > of the day - and that very much covers routine computer/desktop usage
> > > too. Unpredictable/noisy behavior of the computer forces the human
> > > brain back into more consious activity, which is perceived as a
> > > negative thing: it's a distraction takes capacity away from
> > > _important_ conscious activities ... such as getting real work done
> > > on the computer.)
> > >
> > > Heuristics is also an objective problem for the code itself: it
> > > introduces artificial coupling of workloads and raises complexity
> > > artificially: it makes it very hard to prove the impact of changes
> > > (even with good instrumentation) - thus increasing the barrier of
> > > entry significantly. (both to external contributors and to existing
> > > maintainers)
> > >
> > > all in one: the barrier of entry to _providing meaningful feedback_ is
> > > often very high, and thus the barrier of entry of experimental patches
> > > is too high too. These two factors are a lethal combination that lure us
> > > into the false perception that everything is fine and that the yelling
> > > out there is just from clueless whiners who are not willing to help us
> > >
> > > :-/
> > >
> > > Yes, MM testing is hard (in fact, good MM instrumentation and tooling is
> > > _very_ hard), and the MM is in a pretty good shape (otherwise an
> > > alternative would have shown up already), and today's MM is clearly the
> > > best ever Linux MM - but still we have to solve these structural
> > > problems if we want to advance to the next level of quality.
> > >
> > > The solution? I think it's not that hard: we should lower the acceptance
> > > barrier of instrumentation patches massively. (maybe even merge them
> > > outside the normal merge window, like we merge cleanups) Then we should
> > > only allow high-rate changes in risky kernel subsystems that improve
> > > their own instrumentation and tools sufficiently for ordinary users to
> > > be able to tell whether the changes are an improvement or not. Every
> > > time there's a major regression that was hard to debug via the existing
> > > instrumentation, mandate the extension of instrumentation to cover that
> > > case too.
> > >
> > > This all couples the desire of developers to add new code with the
> > > desire of testers to provide feedback and with the desire of actual
> > > users to have a proven good system.
> >
> > I totally agree with Ingo here. Having a basic instrumentation that is
> > enabled by default will help to identify code paths causing unexpected
> > delays in the kernel. It will not only identify kernel bugs, but also
> > unexpected behaviors that would be qualified as "quiet bugs" (e.g. long
> > delays).
>
> It is. See: CONFIG_VM_EVENT_COUNTERS and all the other vm specific
> crap littered in /proc/ (buddyinfo, zoneinfo, meminfo, etc).
>
> There is always an issue of sometimes not instrumenting enough basic
> things... but we fundamentally have always tried to do improve this.
>
> vm is one of the most instrumented subsystems in the kernel. By default.
>
>
> > The key aspect that seems to be inherent to this proposal is the need
> > for an extensible instrumentation mechanism that would allow developers
> > to add new instrumentation when it is needed (such as the Linux Kernel
> > Markers on which I have been working for the last year). It will enable
> > them, and testers, to test and benchmark kernel subsystems to detect
> > regressions as well as erratic behaviors.
>
> We have several for the VM.
>

Isn't the actual instrumentation present in the VM subsystem consisting
mostly of event counters ? This kind of profiling provides limited help in
following specific delays in the kernel. Martin Bligh's paper "Linux
Kernel Debugging on Google-sized clusters", presented at OLS2007,
discusses instrumentation that had to be added to the VM subsystem in
order to find a race condition in the OOM killer by gathering a trace of
the problematic behavior. Gathering a full trace (timestamps and events)
seems to be better suited to this kind of timing-related problem.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-09-28 19:51:20

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()


Mathieu Desnoyers <[email protected]> writes:

>> [...]
>> > I totally agree with Ingo here. Having a basic instrumentation that is
>> > enabled by default will help to identify code paths causing unexpected
>> > delays in the kernel. [...]
>
>> It is. See: CONFIG_VM_EVENT_COUNTERS and all the other vm specific
>> crap littered in /proc/ (buddyinfo, zoneinfo, meminfo, etc). [...]

> Isn't the actual instrumentation present in the VM subsystem
> consisting mostly of event counters ? This kind of profiling
> provides limited help in following specific delays in the
> kernel. [...]

These existing VM events/counters could be a good target for a second
worked-out example (after blktrace) of reimplementation using generic
markers.

- FChE

2007-09-29 12:56:50

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] writeback: remove unnecessary wait in throttle_vm_writeout()

On Saturday 29 September 2007 03:57, Mathieu Desnoyers wrote:

> Isn't the actual instrumentation present in the VM subsystem consisting
> mostly of event counters ? This kind of profiling provides limited help in
> following specific delays in the kernel. Martin Bligh's paper "Linux
> Kernel Debugging on Google-sized clusters", presented at OLS2007,
> discusses instrumentation that had to be added to the VM subsystem in
> order to find a race condition in the OOM killer by gathering a trace of
> the problematic behavior. Gathering a full trace (timestamps and events)
> seems to be better suited to this kind of timing-related problem.

AFAIK, nobody is nacking patches that add more generally useful
instrumentation to the vm.

You'll obviously always get problems that you need specialised
information to solve. Those kinds of corner cases are probably a
good idea for dynamic instrumentation I guess.