2007-05-03 15:54:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans


* Andrew Morton <[email protected]> wrote:

> - If replying, please be sure to cc the appropriate individuals.
> Please also consider rewriting the Subject: to something
> appropriate.

i'm wondering about swap-prefetch:

mm-implement-swap-prefetching.patch
swap-prefetch-avoid-repeating-entry.patch
add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch

The swap-prefetch feature is relatively compact:

10 files changed, 745 insertions(+), 1 deletion(-)

it is contained mostly to itself:

mm/swap_prefetch.c | 581 ++++++++++++++++++++++++++++++++

i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
i've seen so far was positive. Time to have this upstream and time for a
desktop-oriented distro to pick it up.

I think this has been held back way too long. It's .config selectable
and it is as ready for integration as it ever is going to be. So it's a
win/win scenario.

Acked-by: Ingo Molnar <[email protected]>

Ingo


2007-05-03 16:15:12

by Michal Piotrowski

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

Hi,

On 03/05/07, Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > - If replying, please be sure to cc the appropriate individuals.
> > Please also consider rewriting the Subject: to something
> > appropriate.
>
> i'm wondering about swap-prefetch:
>
> mm-implement-swap-prefetching.patch
> swap-prefetch-avoid-repeating-entry.patch
> add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch
>
> The swap-prefetch feature is relatively compact:
>
> 10 files changed, 745 insertions(+), 1 deletion(-)
>
> it is contained mostly to itself:
>
> mm/swap_prefetch.c | 581 ++++++++++++++++++++++++++++++++
>
> i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
> clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
> i've seen so far was positive. Time to have this upstream and time for a
> desktop-oriented distro to pick it up.
>
> I think this has been held back way too long. It's .config selectable
> and it is as ready for integration as it ever is going to be. So it's a
> win/win scenario.

I'm using SWAP_PREFETCH since 2.6.17-mm1 (I don't have earlier configs)
http://www.stardust.webpages.pl/files/tbf/euridica/2.6.17-mm1/mm-config
and I don't recall _any_ problems. It's very stable for me.

>
> Acked-by: Ingo Molnar <[email protected]>
>
> Ingo

Regards,
Michal

--
Michal K. K. Piotrowski
Kernel Monkeys
(http://kernel.wikidot.com/start)

2007-05-03 16:23:11

by Michal Piotrowski

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

On 03/05/07, Michal Piotrowski <[email protected]> wrote:
> Hi,
>
> On 03/05/07, Ingo Molnar <[email protected]> wrote:
> >
> > * Andrew Morton <[email protected]> wrote:
> >
> > > - If replying, please be sure to cc the appropriate individuals.
> > > Please also consider rewriting the Subject: to something
> > > appropriate.
> >
> > i'm wondering about swap-prefetch:
> >
> > mm-implement-swap-prefetching.patch
> > swap-prefetch-avoid-repeating-entry.patch
> > add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch
> >
> > The swap-prefetch feature is relatively compact:
> >
> > 10 files changed, 745 insertions(+), 1 deletion(-)
> >
> > it is contained mostly to itself:
> >
> > mm/swap_prefetch.c | 581 ++++++++++++++++++++++++++++++++
> >
> > i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
> > clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
> > i've seen so far was positive. Time to have this upstream and time for a
> > desktop-oriented distro to pick it up.
> >
> > I think this has been held back way too long. It's .config selectable
> > and it is as ready for integration as it ever is going to be. So it's a
> > win/win scenario.
>
> I'm using SWAP_PREFETCH since 2.6.17-mm1 (I don't have earlier configs)

since 2.6.14-mm2 :)
http://lkml.org/lkml/2005/11/11/260

Regards,
Michal

--
Michal K. K. Piotrowski
Kernel Monkeys
(http://kernel.wikidot.com/start)

2007-05-03 22:15:45

by Con Kolivas

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

On Friday 04 May 2007 01:54, Ingo Molnar wrote:
> * Andrew Morton <[email protected]> wrote:
> > - If replying, please be sure to cc the appropriate individuals.
> > Please also consider rewriting the Subject: to something
> > appropriate.

> i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
> clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
> i've seen so far was positive. Time to have this upstream and time for a
> desktop-oriented distro to pick it up.
>
> I think this has been held back way too long. It's .config selectable
> and it is as ready for integration as it ever is going to be. So it's a
> win/win scenario.
>
> Acked-by: Ingo Molnar <[email protected]>

Thank you very much for code review, ack and support!

--
-ck

2007-05-04 07:34:22

by Nick Piggin

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

Ingo Molnar wrote:
> * Andrew Morton <[email protected]> wrote:
>
>
>>- If replying, please be sure to cc the appropriate individuals.
>> Please also consider rewriting the Subject: to something
>> appropriate.
>
>
> i'm wondering about swap-prefetch:

Well I had some issues with it that I don't think were fully discussed,
and Andrew prompted me to say something, but it went off list for a
couple of posts (my fault, sorry). Posting it below with Andrew's
permission...


> mm-implement-swap-prefetching.patch
> swap-prefetch-avoid-repeating-entry.patch
> add-__gfp_movable-for-callers-to-flag-allocations-from-high-memory-that-may-be-migrated-swap-prefetch.patch
>
> The swap-prefetch feature is relatively compact:
>
> 10 files changed, 745 insertions(+), 1 deletion(-)
>
> it is contained mostly to itself:
>
> mm/swap_prefetch.c | 581 ++++++++++++++++++++++++++++++++
>
> i've reviewed it once again and in the !CONFIG_SWAP_PREFETCH case it's a
> clear NOP, while in the CONFIG_SWAP_PREFETCH=y case all the feedback
> i've seen so far was positive. Time to have this upstream and time for a
> desktop-oriented distro to pick it up.
>
> I think this has been held back way too long. It's .config selectable
> and it is as ready for integration as it ever is going to be. So it's a
> win/win scenario.

Being able to config all these core heuristics changes is really not that
much of a positive. The fact that we might _need_ to config something out,
and double the configuration range isn't too pleasing.

Here were some of my concerns, and where our discussion got up to.

Andrew Morton wrote:
> On Fri, 04 May 2007 14:34:45 +1000 Nick Piggin <[email protected]> wrote:
>
>
>>Andrew Morton wrote:
>>
>>>istr you had issues with swap-prefetch?
>>>
>>>If so, now's a good time to reiterate them ;)
>>
>>1) It is said to help with the updatedb overnight problem, however it
>> actually _doesn't_ prefetch swap when there are low free pages, which
>> is how updatedb will leave the system. So this really puzzles me how
>> it would work. However if updatedb is causing excessive swapout, I
>> think we should try improving use-once algorithms first, for example.
>
>
> Yes. Perhaps it just doesn't help with the updatedb thing. Or maybe with
> normal system activity we get enough free pages to kick the thing off and
> running. Perhaps updatedb itself has a lot of rss, for example.

Could be, but I don't know. I'd think it unlikely to allow _much_ swapin,
if huge amounts of the desktop have been swapped out. But maybe... as I
said, nobody seems to have a recipe for these things.


> Would be useful to see this claim substantiated with a real testcase,
> description of results and an explanation of how and why it worked.

Yes... and then try to first improve regular page reclaim and use-once
handling.


>>2) It is a _highly_ speculative operation, and in workloads where periods
>> of low and high page usage with genuinely unused anonymous / tmpfs
>> pages, it could waste power, memory bandwidth, bus bandwidth, disk
>> bandwidth...
>
>
> Yes. I suspect that's a matter of waiting for the corner-case reporters to
> complain, then add more heuristics.

Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch is
happy to do a _lot_ of work for these things which we have already decided
are least likely to be used again.


>>3) I haven't seen a single set of numbers out of it. Feedback seems to
>> have mostly come from people who
>
>
> Yup. But can we come up with a testcase? It's hard.

I guess it is hard firstly because swapping is quite random to start with.
But I haven't even seen basic things like "make -jhuge swapstorm has no
regressions".


>>4) If this is helpful, wouldn't it be equally important for things like
>> mapped file pages? Seems like half a solution.
>
>
> True.
>
> Without thinking about it, I almost wonder if one could do a userspace
> implementation with something which pokes around in /proc/pid/pagemap and
> /proc/pid/kpagemap, perhaps with some additional interfaces added to
> do a swapcache read. (Give userspace the ability to get at swapcache
> via a regular fd?)
>
> (otoh the akpm usersapce implementation is swapoff -a;swapon -a)

Perhaps. You may need a few indicators to see whether the system is idle...
but OTOH, we've already got a lot of indicators for memory, disk usage,
etc. So, maybe :)


>>5) New one: it is possibly going to interact badly with MADV_FREE lazy
>> freeing. The more complex we make page reclaim, the worse IMO.
>
>
> That's a bit vague. What sort of problems do you envisage?

Well MADV_FREE pages aren't technically free, are they? So it might be
possible for a significant number of them to build up and prevent
swap prefetch from working. Maybe.


>>...) I had a few issues with implementation, like interaction with
>> cpusets. Don't know if these are all fixed or not. I sort of gave
>> up looking at it.
>
>
> Ah yes, I remember some mention of cpusets. I forget what it was though.

I could be wrong, but IIRC there is no good way to know which cpuset to
bring the page back into, (and I guess similarly it would be hard to know
what container to account it to, if doing account-on-allocate).

We could hope that users of these features would be mostly disjoint sets,
but that's an evil road to start heading down, where we have various core
bits of mm that don't play nice together.

--
SUSE Labs, Novell Inc.

2007-05-04 08:52:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans


* Nick Piggin <[email protected]> wrote:

> > i'm wondering about swap-prefetch:

> Being able to config all these core heuristics changes is really not
> that much of a positive. The fact that we might _need_ to config
> something out, and double the configuration range isn't too pleasing.

Well, to the desktop user this is a speculative performance feature that
he is willing to potentially waste CPU and IO capacity, in expectation
of better performance.

On the conceptual level it is _precisely the same thing as regular file
readahead_. (with the difference that to me swapahead seems to be quite
a bit more intelligent than our current file readahead logic.)

This feature has no API or ABI impact at all, it's a pure performance
feature. (besides the trivial sysctl to turn it runtime on/off).

> Here were some of my concerns, and where our discussion got up to.

> > Yes. Perhaps it just doesn't help with the updatedb thing. Or
> > maybe with normal system activity we get enough free pages to kick
> > the thing off and running. Perhaps updatedb itself has a lot of
> > rss, for example.
>
> Could be, but I don't know. I'd think it unlikely to allow _much_
> swapin, if huge amounts of the desktop have been swapped out. But
> maybe... as I said, nobody seems to have a recipe for these things.

can i take this one as a "no fundamental objection"? There are really
only 2 maintainance options left:

1) either you can do it better or at least have a _very_ clearly
described idea outlined about how to do it differently

2) or you should let others try it

#1 you've not done for 2-3 years since swap-prefetch was waiting for
integration so it's not an option at this stage anymore. Then you are
pretty much obliged to do #2. ;-)

And let me be really blunt about this, there is no option #3 to say: "I
have no real better idea, I have no code, I have no time, but hey, lets
not merge this because it 'could in theory' be possible to do it better"
=B-)

really, we are likely be better off by risking the merge of _bad_ code
(which in the swap-prefetch case is the exact opposite of the truth),
than to let code stagnate. People are clearly unhappy about certain
desktop aspects of swapping, and the only way out of that is to let more
people hack that code. Merging code involves more people. It will cause
'noise' and could cause regressions, but at least in this case the only
impact is 'performance' and the feature is trivial to disable.

The maintainance drag outside of swap_prefetch.c is essentially _zero_.
If the feature doesnt work it ends up on Con's desk. If it turns out to
not work at all (despite years of testing and happy users) it still only
ends up on Con's desk. A clear win/win scenario for you i think :-)

> > Would be useful to see this claim substantiated with a real
> > testcase, description of results and an explanation of how and why
> > it worked.
>
> Yes... and then try to first improve regular page reclaim and use-once
> handling.

agreed. Con, IIRC you wrote a testcase for this, right? Could you please
send us the results of that testing?

> >>2) It is a _highly_ speculative operation, and in workloads where periods
> >> of low and high page usage with genuinely unused anonymous / tmpfs
> >> pages, it could waste power, memory bandwidth, bus bandwidth, disk
> >> bandwidth...
> >
> > Yes. I suspect that's a matter of waiting for the corner-case
> > reporters to complain, then add more heuristics.
>
> Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch
> is happy to do a _lot_ of work for these things which we have already
> decided are least likely to be used again.

i see no real problem here. We've had heuristics for a _long_ time in
various areas of the code. Sometimes they work, sometimes they suck.

the flow of this is really easy: distro looking for a feature edge turns
it on and announces it, if the feature does not work out for users then
user turns it off and complains to distro, if enough users complain then
distro turns it off for next release, upstream forgets about this
performance feature and eventually removes it once someone notices that
it wouldnt even compile in the past 2 main releases. I see no problem
here, we did that in the past too with performance features. The
networking stack has literally dozens of such small tunable things which
get experimented with, and whose defaults do get tuned carefully. Some
of the knobs help bandwidth, some help latency.

I do not even see any risk of "splitup of mindshare" - swap-prefetch is
so clearly speculative that it's not really a different view about how
to do swapping (which would split the tester base, etc.), it's simply a
"do you want your system to speculate about the future or not" add-on
decision. Every system has a pretty clear idea about that: desktops
generally want to do it, clusters generally dont want to do it.

> >>3) I haven't seen a single set of numbers out of it. Feedback seems to
> >> have mostly come from people who
> >
> > Yup. But can we come up with a testcase? It's hard.

i think Con has a testcase.

> >>4) If this is helpful, wouldn't it be equally important for things like
> >> mapped file pages? Seems like half a solution.
[...]
> > (otoh the akpm usersapce implementation is swapoff -a;swapon -a)
>
> Perhaps. You may need a few indicators to see whether the system is
> idle... but OTOH, we've already got a lot of indicators for memory,
> disk usage, etc. So, maybe :)

The time has passed for this. Let others play too. Please :-)

> I could be wrong, but IIRC there is no good way to know which cpuset
> to bring the page back into, (and I guess similarly it would be hard
> to know what container to account it to, if doing
> account-on-allocate).

(i think cpusets are totally uninteresting in this context: nobody in
their right mind is going to use swap-prefetch on a big NUMA box. Nor
can i see any fundamental impediment to making this more cpuset-aware,
just like other subsystems were made cpuset-aware, once the requests
from actual users came in and people started getting interested in it.)

I think the "lack of testcase and numbers" is the only valid technical
objection i've seen so far. Con might be able to help us with that?

Ingo

2007-05-04 09:09:27

by Nick Piggin

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:

>>Here were some of my concerns, and where our discussion got up to.
>
>
>>>Yes. Perhaps it just doesn't help with the updatedb thing. Or
>>>maybe with normal system activity we get enough free pages to kick
>>>the thing off and running. Perhaps updatedb itself has a lot of
>>>rss, for example.
>>
>>Could be, but I don't know. I'd think it unlikely to allow _much_
>>swapin, if huge amounts of the desktop have been swapped out. But
>>maybe... as I said, nobody seems to have a recipe for these things.
>
>
> can i take this one as a "no fundamental objection"? There are really
> only 2 maintainance options left:
>
> 1) either you can do it better or at least have a _very_ clearly
> described idea outlined about how to do it differently
>
> 2) or you should let others try it
>
> #1 you've not done for 2-3 years since swap-prefetch was waiting for
> integration so it's not an option at this stage anymore. Then you are
> pretty much obliged to do #2. ;-)

The burden is not on me to get someone else's feature merged. If it
can be shown to work well and people's concerns addressed, then anything
will get merged. The reason Linux is so good is because of what we don't
merge, figuratively speaking.

I wanted to see some basic regression tests to show that it hasn't caused
obvious problems, and some basic scenarios where it helps, so that we can
analyse them. It is really simple, but I haven't got any since first
asking.

And note that I don't think I ever explicitly "nacked" anything, just
voiced my concerns. If my concerns had been addressed, then I couldn't
have stopped anybody from merging anything.


>>>>2) It is a _highly_ speculative operation, and in workloads where periods
>>>> of low and high page usage with genuinely unused anonymous / tmpfs
>>>> pages, it could waste power, memory bandwidth, bus bandwidth, disk
>>>> bandwidth...
>>>
>>>Yes. I suspect that's a matter of waiting for the corner-case
>>>reporters to complain, then add more heuristics.
>>
>>Ugh. Well it is a pretty fundamental problem. Basically swap-prefetch
>>is happy to do a _lot_ of work for these things which we have already
>>decided are least likely to be used again.
>
>
> i see no real problem here. We've had heuristics for a _long_ time in
> various areas of the code. Sometimes they work, sometimes they suck.

So that's one of my issues with the code. If all you have to support a
merge is anecodal evidence, then I find it interesting that you would
easily discount something like this.


>>>>4) If this is helpful, wouldn't it be equally important for things like
>>>> mapped file pages? Seems like half a solution.
>
> [...]
>
>>>(otoh the akpm usersapce implementation is swapoff -a;swapon -a)
>>
>>Perhaps. You may need a few indicators to see whether the system is
>>idle... but OTOH, we've already got a lot of indicators for memory,
>>disk usage, etc. So, maybe :)
>
>
> The time has passed for this. Let others play too. Please :-)

Play with what? Prefetching mmaped file pages as well? Sure.


>>I could be wrong, but IIRC there is no good way to know which cpuset
>>to bring the page back into, (and I guess similarly it would be hard
>>to know what container to account it to, if doing
>>account-on-allocate).
>
>
> (i think cpusets are totally uninteresting in this context: nobody in
> their right mind is going to use swap-prefetch on a big NUMA box. Nor
> can i see any fundamental impediment to making this more cpuset-aware,
> just like other subsystems were made cpuset-aware, once the requests
> from actual users came in and people started getting interested in it.)

OK, so make it more cpuset aware. This isn't a new issue, I raised it
a long time ago. And trust me, it is a nightmare to just assume that
nobody will use cpusets on a small box for example (AFAIK the resource
control guys are looking at doing just that).

All core VM features should play nicely with each other without *really*
good reason.


> I think the "lack of testcase and numbers" is the only valid technical
> objection i've seen so far.

Well you're entitled to your opinion too.

--
SUSE Labs, Novell Inc.

2007-05-04 12:10:59

by Con Kolivas

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

On Friday 04 May 2007 18:52, Ingo Molnar wrote:
> agreed. Con, IIRC you wrote a testcase for this, right? Could you please
> send us the results of that testing?

Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch
disabled and then enabled swap prefetch saves ~5 seconds on average hardware
on this one test case. I had many users try this and the results were between
2 and 10 seconds, but always showed a saving on this testcase. This effect
easily occurs on printing a big picture, editing a large file, compressing an
iso image or whatever in real world workloads. Smaller, but much more
frequent effects of this over the course of a day obviously also occur and do
add up.

--
-ck


Attachments:
(No filename) (717.00 B)
swap_prefetch_tester.c (2.02 kB)
Download all attachments

2007-05-05 08:42:54

by Con Kolivas

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

On Friday 04 May 2007 22:10, Con Kolivas wrote:
> On Friday 04 May 2007 18:52, Ingo Molnar wrote:
> > agreed. Con, IIRC you wrote a testcase for this, right? Could you please
> > send us the results of that testing?
>
> Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch
> disabled and then enabled swap prefetch saves ~5 seconds on average
> hardware on this one test case. I had many users try this and the results
> were between 2 and 10 seconds, but always showed a saving on this testcase.
> This effect easily occurs on printing a big picture, editing a large file,
> compressing an iso image or whatever in real world workloads. Smaller, but
> much more frequent effects of this over the course of a day obviously also
> occur and do add up.

Here's a better swap prefetch tester. Instructions in file.

Machine with 2GB ram and 2GB swapfile

Prefetch disabled:
./sp_tester
Ram 2060352000 Swap 1973420000
Total ram to be malloced: 3047062000 bytes
Starting first malloc of 1523531000 bytes
Starting 1st read of first malloc
Touching this much ram takes 809 milliseconds
Starting second malloc of 1523531000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 53397 milliseconds

Enabled:
./sp_tester
Ram 2060352000 Swap 1973420000
Total ram to be malloced: 3047062000 bytes
Starting first malloc of 1523531000 bytes
Starting 1st read of first malloc
Touching this much ram takes 676 milliseconds
Starting second malloc of 1523531000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 26351 milliseconds

Note huge time difference.

--
-ck


Attachments:
(No filename) (1.74 kB)
sp_tester.c (2.82 kB)
Download all attachments

2007-05-06 10:13:35

by Antonino Ingargiola

[permalink] [raw]
Subject: Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans

2007/5/5, Con Kolivas <[email protected]>:
[cut]
> Here's a better swap prefetch tester. Instructions in file.

The system should be leaved totally inactive for the test duration (10m) right?


Regards,

~ Antonio

2007-05-06 18:23:45

by Jory A. Pratt

[permalink] [raw]
Subject: Re: [ck] Re: swap-prefetch: 2.6.22 -mm merge plans

Con Kolivas wrote:
> Here's a better swap prefetch tester. Instructions in file.
>
> Machine with 2GB ram and 2GB swapfile
>
> Prefetch disabled:
> ./sp_tester
> Ram 2060352000 Swap 1973420000
> Total ram to be malloced: 3047062000 bytes
> Starting first malloc of 1523531000 bytes
> Starting 1st read of first malloc
> Touching this much ram takes 809 milliseconds
> Starting second malloc of 1523531000 bytes
> Completed second malloc and free
> Sleeping for 600 seconds
> Important part - starting reread of first malloc
> Completed read of first malloc
> Timed portion 53397 milliseconds
>
> Enabled:
> ./sp_tester
> Ram 2060352000 Swap 1973420000
> Total ram to be malloced: 3047062000 bytes
> Starting first malloc of 1523531000 bytes
> Starting 1st read of first malloc
> Touching this much ram takes 676 milliseconds
> Starting second malloc of 1523531000 bytes
> Completed second malloc and free
> Sleeping for 600 seconds
> Important part - starting reread of first malloc
> Completed read of first malloc
> Timed portion 26351 milliseconds
>
echo 1 > /proc/sys/vm/overcommit_memory
swapoff -a
swapon -a
./sp_tester

Ram 1153644000 Swap 1004052000
Total ram to be malloced: 1655670000 bytes
Starting first malloc of 827835000 bytes
Starting 1st read of first malloc
Touching this much ram takes 937 milliseconds
Starting second malloc of 827835000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 15011 milliseconds

echo 0 > /proc/sys/vm/overcommit_memory
swapoff -a
swapon -a
./sp_tester

Ram 1153644000 Swap 1004052000
Total ram to be malloced: 1655670000 bytes
Starting first malloc of 827835000 bytes
Starting 1st read of first malloc
Touching this much ram takes 1125 milliseconds
Starting second malloc of 827835000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 14611 milliseconds

2007-05-07 14:18:39

by Bill Davidsen

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>>> i'm wondering about swap-prefetch:
>
>> Being able to config all these core heuristics changes is really not
>> that much of a positive. The fact that we might _need_ to config
>> something out, and double the configuration range isn't too pleasing.
>
> Well, to the desktop user this is a speculative performance feature that
> he is willing to potentially waste CPU and IO capacity, in expectation
> of better performance.
>
> On the conceptual level it is _precisely the same thing as regular file
> readahead_. (with the difference that to me swapahead seems to be quite
> a bit more intelligent than our current file readahead logic.)
>
> This feature has no API or ABI impact at all, it's a pure performance
> feature. (besides the trivial sysctl to turn it runtime on/off).
>
>> Here were some of my concerns, and where our discussion got up to.

[...snip...]

> i see no real problem here. We've had heuristics for a _long_ time in
> various areas of the code. Sometimes they work, sometimes they suck.
>
> the flow of this is really easy: distro looking for a feature edge turns
> it on and announces it, if the feature does not work out for users then
> user turns it off and complains to distro, if enough users complain then
> distro turns it off for next release, upstream forgets about this
> performance feature and eventually removes it once someone notices that
> it wouldnt even compile in the past 2 main releases. I see no problem
> here, we did that in the past too with performance features. The
> networking stack has literally dozens of such small tunable things which
> get experimented with, and whose defaults do get tuned carefully. Some
> of the knobs help bandwidth, some help latency.
>
I haven't looked at this code since it first came out and didn't impress
me, but I think it would be good to get the current version in. However,
when you say "user turns it off" I hope you mean "in /proc/sys with a
switch or knob" and not by expecting people to recompile and install a
kernel. Then it might take a little memory but wouldn't do something
undesirable.

Note: I had no bad effect from the code, it just didn't feel faster. On
a low memory machine it might help. Of course I have wanted to have a
hard limit on memory used for i/o buffers, just to avoid swapping
programs to make room for i/o, so to some extent I feel as if this is a
fix for a problem we shouldn't have.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-05-07 14:29:06

by Bill Davidsen

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

Con Kolivas wrote:
> On Friday 04 May 2007 18:52, Ingo Molnar wrote:
>> agreed. Con, IIRC you wrote a testcase for this, right? Could you please
>> send us the results of that testing?
>
> Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch
> disabled and then enabled swap prefetch saves ~5 seconds on average hardware
> on this one test case. I had many users try this and the results were between
> 2 and 10 seconds, but always showed a saving on this testcase. This effect
> easily occurs on printing a big picture, editing a large file, compressing an
> iso image or whatever in real world workloads. Smaller, but much more
> frequent effects of this over the course of a day obviously also occur and do
> add up.
>
I'll try this when I get the scheduler stuff done, and also dig out the
"resp1" stuff for "back when." I see the most recent datasets were
comparing 2.5.43-mm2 responsiveness with 2.4.19-ck7, you know I always
test your stuff ;-)

Guess it might need a bit of polish for current hardware, I was testing
on *small* machines, deliberately.

--
Bill Davidsen <[email protected]>
"We have more to fear from the bungling of the incompetent than from
the machinations of the wicked." - from Slashdot

2007-05-09 23:32:14

by Con Kolivas

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

On Saturday 05 May 2007 18:42, Con Kolivas wrote:
> On Friday 04 May 2007 22:10, Con Kolivas wrote:
> > On Friday 04 May 2007 18:52, Ingo Molnar wrote:
> > > agreed. Con, IIRC you wrote a testcase for this, right? Could you
> > > please send us the results of that testing?
> >
> > Yes, sorry it's a crappy test app but works on 32bit. Timed with prefetch
> > disabled and then enabled swap prefetch saves ~5 seconds on average
> > hardware on this one test case. I had many users try this and the results
> > were between 2 and 10 seconds, but always showed a saving on this
> > testcase. This effect easily occurs on printing a big picture, editing a
> > large file, compressing an iso image or whatever in real world workloads.
> > Smaller, but much more frequent effects of this over the course of a day
> > obviously also occur and do add up.
>
> Here's a better swap prefetch tester. Instructions in file.
>
> Machine with 2GB ram and 2GB swapfile
>
> Prefetch disabled:
> ./sp_tester

> Timed portion 53397 milliseconds
>
> Enabled:
> ./sp_tester

> Timed portion 26351 milliseconds
>
> Note huge time difference.

Well how about that? That was the difference with a swap _file_ as I said, but
I went ahead and checked with a swap partition as I used to have. I didn't
notice, but somewhere in the last few months, swap prefetch code itself being
unchanged for a year, seems to have been broken by other changes in the vm
and it doesn't even start up prefetching often and has stale swap entries in
its list. Once it breaks like that it does nothing from then on. So that
leaves me with a quandry now.


Do I:

1. Go ahead and find whatever breakage was introduced and fix it with
hopefully a trivial change

2. Do option 1. and then implement support for yet another kernel feature
(cpusets) that will be used perhaps never with swap prefetch [No Nick I don't
believe you that cpusets have anything to do with normal users on a desktop
ever; if it's used on a desktop it will only be by a kernel developer testing
the cpusets code].

or

3. Dump swap prefetch forever and ignore that it ever worked and was helpful
and was a lot of work to implement and so on.


Given that even if I do 1 and/or 2 it'll still be blocked from ever going to
mainline I think the choice is clear.

Nick since you're personally the gatekeeper for this code, would you like to
make a call? Just say 3 and put me out of my misery please.

--
-ck

P.S. Ingo, thanks (and sorry) for your involvement here.

2007-05-10 00:05:32

by Nick Piggin

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

Con Kolivas wrote:

> Well how about that? That was the difference with a swap _file_ as I said, but
> I went ahead and checked with a swap partition as I used to have. I didn't
> notice, but somewhere in the last few months, swap prefetch code itself being
> unchanged for a year, seems to have been broken by other changes in the vm
> and it doesn't even start up prefetching often and has stale swap entries in
> its list. Once it breaks like that it does nothing from then on. So that
> leaves me with a quandry now.
>
>
> Do I:
>
> 1. Go ahead and find whatever breakage was introduced and fix it with
> hopefully a trivial change
>
> 2. Do option 1. and then implement support for yet another kernel feature
> (cpusets) that will be used perhaps never with swap prefetch [No Nick I don't
> believe you that cpusets have anything to do with normal users on a desktop
> ever; if it's used on a desktop it will only be by a kernel developer testing
> the cpusets code].
>
> or
>
> 3. Dump swap prefetch forever and ignore that it ever worked and was helpful
> and was a lot of work to implement and so on.
>
>
> Given that even if I do 1 and/or 2 it'll still be blocked from ever going to
> mainline I think the choice is clear.
>
> Nick since you're personally the gatekeeper for this code, would you like to
> make a call? Just say 3 and put me out of my misery please.


I'm not the gatekeeper and it is completely up to you whether you want
to work on something or not... but I'm sure you understand where I was
coming from when I suggested it doesn't get merged yet.

You may not believe this, but I agree that swap prefetching (and
prefetching in general) has some potential to help desktop workloads :).
But it still should go through the normal process of being tested and
questioned and having a look at options for first improving existing
code in those problematic cases.

Once that process happens and it is shown to work nicely, etc., then I
would not be able to (or want to) keep it from getting merged.

As far as cpusets goes... if your code goes in last, then you have to
make it work with what is there, as a rule. People are using cpusets
for memory resource control, which would have uses on a desktop system.
It is just a really bad precedent to set, having different parts of the
VM not work correctly together. Even if you made them mutually
exclusive CONFIG_ options, that is still not a very nice solution.

--
SUSE Labs, Novell Inc.

2007-05-10 01:38:13

by Con Kolivas

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

On Thursday 10 May 2007 10:05, Nick Piggin wrote:
> Con Kolivas wrote:
> > Well how about that? That was the difference with a swap _file_ as I
> > said, but I went ahead and checked with a swap partition as I used to
> > have. I didn't notice, but somewhere in the last few months, swap
> > prefetch code itself being unchanged for a year, seems to have been
> > broken by other changes in the vm and it doesn't even start up
> > prefetching often and has stale swap entries in its list. Once it breaks
> > like that it does nothing from then on. So that leaves me with a quandry
> > now.
> >
> >
> > Do I:
> >
> > 1. Go ahead and find whatever breakage was introduced and fix it with
> > hopefully a trivial change
> >
> > 2. Do option 1. and then implement support for yet another kernel feature
> > (cpusets) that will be used perhaps never with swap prefetch [No Nick I
> > don't believe you that cpusets have anything to do with normal users on a
> > desktop ever; if it's used on a desktop it will only be by a kernel
> > developer testing the cpusets code].
> >
> > or
> >
> > 3. Dump swap prefetch forever and ignore that it ever worked and was
> > helpful and was a lot of work to implement and so on.
> >
> >
> > Given that even if I do 1 and/or 2 it'll still be blocked from ever going
> > to mainline I think the choice is clear.
> >
> > Nick since you're personally the gatekeeper for this code, would you like
> > to make a call? Just say 3 and put me out of my misery please.
>
> I'm not the gatekeeper and it is completely up to you whether you want
> to work on something or not... but I'm sure you understand where I was
> coming from when I suggested it doesn't get merged yet.

No matter how you spin it, you're the gatekeeper.

> You may not believe this, but I agree that swap prefetching (and
> prefetching in general) has some potential to help desktop workloads :).
> But it still should go through the normal process of being tested and
> questioned and having a look at options for first improving existing
> code in those problematic cases.

Not this again? Proof was there ages ago that it helped and no proof that it
harmed could be found yet you cunningly pretend it never existed. It's been
done to death and I'm sick of this.

> Once that process happens and it is shown to work nicely, etc., then I
> would not be able to (or want to) keep it from getting merged.
>
> As far as cpusets goes... if your code goes in last, then you have to
> make it work with what is there, as a rule. People are using cpusets
> for memory resource control, which would have uses on a desktop system.
> It is just a really bad precedent to set, having different parts of the
> VM not work correctly together. Even if you made them mutually
> exclusive CONFIG_ options, that is still not a very nice solution.

That's as close to a 3 as I'm likely to get out of you.

Andrew you'll be relieved to know I would like you to throw swap prefetch and
related patches into the bin. Thanks.

--
-ck

2007-05-10 01:56:59

by Nick Piggin

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

Con Kolivas wrote:
> On Thursday 10 May 2007 10:05, Nick Piggin wrote:

>>I'm not the gatekeeper and it is completely up to you whether you want
>>to work on something or not... but I'm sure you understand where I was
>>coming from when I suggested it doesn't get merged yet.
>
>
> No matter how you spin it, you're the gatekeeper.

If raising unaddressed issues means closing a gate, then OK. You can
equally open it by answering them.


>>You may not believe this, but I agree that swap prefetching (and
>>prefetching in general) has some potential to help desktop workloads :).
>>But it still should go through the normal process of being tested and
>>questioned and having a look at options for first improving existing
>>code in those problematic cases.
>
>
> Not this again? Proof was there ages ago that it helped and no proof that it
> harmed could be found yet you cunningly pretend it never existed. It's been
> done to death and I'm sick of this.

I said I know it can help. Do you know how many patches I have that help
some workloads but are not merged? That's just the way it works.

What I have seen is it helps the case where you force out a huge amount
of swap. OK, that's nice -- the base case obviously works.

You said it helped with the updatedb problem. That says we should look at
why it is going bad first, and for example improve use-once algorithms.
After we do that, then swap prefetching might still help, which is fine.


>>Once that process happens and it is shown to work nicely, etc., then I
>>would not be able to (or want to) keep it from getting merged.
>>
>>As far as cpusets goes... if your code goes in last, then you have to
>>make it work with what is there, as a rule. People are using cpusets
>>for memory resource control, which would have uses on a desktop system.
>>It is just a really bad precedent to set, having different parts of the
>>VM not work correctly together. Even if you made them mutually
>>exclusive CONFIG_ options, that is still not a very nice solution.
>
>
> That's as close to a 3 as I'm likely to get out of you.

If you're not willing to try making it work with existing code, among other
things, then yes it will be difficult to get it merged. That's not going to
change.

--
SUSE Labs, Novell Inc.

2007-05-10 03:48:22

by Ray Lee

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

On 5/9/07, Nick Piggin <[email protected]> wrote:
> You said it helped with the updatedb problem. That says we should look at
> why it is going bad first, and for example improve use-once algorithms.
> After we do that, then swap prefetching might still help, which is fine.

Nick, if you're volunteering to do that analysis, then great. If not,
then you're just providing a airy hope with nothing to back up when or
if that work would ever occur.

Further, if you or someone else *does* do that work, then guess what,
we still have the option to rip out the swap prefetching code after
the hypothetical use-once improvements have been proven and merged.
Which, by the way, I've watched people talk about since 2.4. That was,
y'know, a *while* ago.

So enough with the stop energy, okay? You're better than that.

Con? He is right about the last feature to go in needs to work
gracefully with what's there now. However, it's not unheard of for
authors of other sections of code to help out with incompatibilities
by answering politely phrased questions for guidance. Though the
intersection of users between cpusets and desktop systems seems small
indeed.

2007-05-10 03:57:17

by Nick Piggin

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

Ray Lee wrote:
> On 5/9/07, Nick Piggin <[email protected]> wrote:
>
>> You said it helped with the updatedb problem. That says we should look at
>> why it is going bad first, and for example improve use-once algorithms.
>> After we do that, then swap prefetching might still help, which is fine.
>
>
> Nick, if you're volunteering to do that analysis, then great. If not,
> then you're just providing a airy hope with nothing to back up when or
> if that work would ever occur.

I'd like to try helping. Tell me your problem.


> Further, if you or someone else *does* do that work, then guess what,
> we still have the option to rip out the swap prefetching code after
> the hypothetical use-once improvements have been proven and merged.
> Which, by the way, I've watched people talk about since 2.4. That was,
> y'know, a *while* ago.

What's wrong with the use-once we have? What improvements are you talking
about?


> So enough with the stop energy, okay? You're better than that.

I don't think it is about energy or being mean, I'm just stating the
issues I have with it.

--
SUSE Labs, Novell Inc.

2007-05-10 04:02:18

by Con Kolivas

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

On Thursday 10 May 2007 13:48, Ray Lee wrote:
> On 5/9/07, Nick Piggin <[email protected]> wrote:
> > You said it helped with the updatedb problem. That says we should look at
> > why it is going bad first, and for example improve use-once algorithms.
> > After we do that, then swap prefetching might still help, which is fine.
>
> Nick, if you're volunteering to do that analysis, then great. If not,
> then you're just providing a airy hope with nothing to back up when or
> if that work would ever occur.
>
> Further, if you or someone else *does* do that work, then guess what,
> we still have the option to rip out the swap prefetching code after
> the hypothetical use-once improvements have been proven and merged.
> Which, by the way, I've watched people talk about since 2.4. That was,
> y'know, a *while* ago.
>
> So enough with the stop energy, okay? You're better than that.
>
> Con? He is right about the last feature to go in needs to work
> gracefully with what's there now. However, it's not unheard of for
> authors of other sections of code to help out with incompatibilities
> by answering politely phrased questions for guidance. Though the
> intersection of users between cpusets and desktop systems seems small
> indeed.

Let's just set the record straight. I actually discussed cpusets over a year
ago in this nonsense and was told by sgi folk there was no need to get my
head around cpusets and honouring node placement should be enough which, by
the way, swap prefetch does. So I by no means ignored this; we just hit an
impasse on just how much more featured it should be for the sake of a goddamn
home desktop pc feature.

Anyway why the hell am I resurrecting this thread? The code is declared dead
already. Leave it be.

--
-ck

2007-05-10 05:52:16

by Ray Lee

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

On 5/9/07, Nick Piggin <[email protected]> wrote:
> Ray Lee wrote:
> > On 5/9/07, Nick Piggin <[email protected]> wrote:
> >
> >> You said it helped with the updatedb problem. That says we should look at
> >> why it is going bad first, and for example improve use-once algorithms.
> >> After we do that, then swap prefetching might still help, which is fine.
> >
> > Nick, if you're volunteering to do that analysis, then great. If not,
> > then you're just providing a airy hope with nothing to back up when or
> > if that work would ever occur.
>
> I'd like to try helping. Tell me your problem.

Huh? You already stated one version of it above, namely updatedb. But
let's put this another way, shall we? A gedankenexperiment, if you
will.

Say we have a perfect swap-out algorithm that can choose exactly what
needs to be evicted to disk. ('Perfect', of course, is dependent upon
one's metric, but let's go with "maximizes overall system utilization
and minimizes IO wait time." Arbitrary, but hey.)

So, great, the right things got swapped out. Anything else that could
have been chosen would have caused more overall IO Wait. Yay us.

So what happens when those processes that triggered the swap-outs go
away? (Firefox is closed, I stop hitting my local copy of a database,
whatever.) Well, currently, nothing. What happens when I switch
workspaces and try to use my email program? Swap-ins.

Okay, so why didn't the system swap that stuff in preemptively? Why am
I sitting there waiting for something that it could have already done
in the background?

A new swap-out algorithm, be it use-once, Clock-Pro, or perfect
foreknowledge isn't going to change that issue. Swap prefetch does.

> > Further, if you or someone else *does* do that work, then guess what,
> > we still have the option to rip out the swap prefetching code after
> > the hypothetical use-once improvements have been proven and merged.
> > Which, by the way, I've watched people talk about since 2.4. That was,
> > y'know, a *while* ago.
>
> What's wrong with the use-once we have? What improvements are you talking
> about?

You said, effectively: "Use-once could be improved to deal with
updatedb". I said I've been reading emails from Rik and others talking
about that for four years now, and we're still talking about it. Were
it merely updatedb, I'd say us userspace folk should step up and
rewrite the damn thing to amortize its work. However, I and others
feel it's only an example -- glaring, obviously -- of a more pervasive
issue. A small issue, to be sure!, but an issue nevertheless.

In general, I/others are talking about improving the desktop
experience of running too much on a RAM limited machine. (Which, in my
case, is with a gig and a 2.2GHz processor.)

Or restated: the desktop experience occasionally sucks for me, and I
don't think I'm alone. There may be a heuristic, completely isolated
from userspace (and so isn't an API the kernel has to support! -- if
it doesn't work, we can rip it out again), that may mitigate the
suckiness. Let's try it.

> > So enough with the stop energy, okay? You're better than that.
>
> I don't think it is about energy or being mean, I'm just stating the
> issues I have with it.

Nick, I in no way think you're being mean, and I'm sorry if I've given
you that impression. However, if you're just stating the issues you
have with it, then can I assume that you won't lobby against having
this experiment merged?

2007-05-10 07:05:19

by Nick Piggin

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

Ray Lee wrote:
> On 5/9/07, Nick Piggin <[email protected]> wrote:
>
>> Ray Lee wrote:
>> > On 5/9/07, Nick Piggin <[email protected]> wrote:
>> >
>> >> You said it helped with the updatedb problem. That says we should
>> look at
>> >> why it is going bad first, and for example improve use-once
>> algorithms.
>> >> After we do that, then swap prefetching might still help, which is
>> fine.
>> >
>> > Nick, if you're volunteering to do that analysis, then great. If not,
>> > then you're just providing a airy hope with nothing to back up when or
>> > if that work would ever occur.
>>
>> I'd like to try helping. Tell me your problem.
>
>
> Huh? You already stated one version of it above, namely updatedb. But

So a swapping problem with updatedb should be unusual and we'd like to see
if we can fix it without resorting to prefetching.

I know the theory behind swap prefetching, and I'm not saying it doesn't
work, so I'll snip the rest of that.


>> What's wrong with the use-once we have? What improvements are you talking
>> about?
>
>
> You said, effectively: "Use-once could be improved to deal with
> updatedb". I said I've been reading emails from Rik and others talking
> about that for four years now, and we're still talking about it. Were
> it merely updatedb, I'd say us userspace folk should step up and
> rewrite the damn thing to amortize its work. However, I and others
> feel it's only an example -- glaring, obviously -- of a more pervasive
> issue. A small issue, to be sure!, but an issue nevertheless.

It isn't going to get fixed unless people complain about it. If you
cover the use-once problem with swap prefetching, then it will never
get fixed.


>> I don't think it is about energy or being mean, I'm just stating the
>> issues I have with it.
>
>
> Nick, I in no way think you're being mean, and I'm sorry if I've given
> you that impression. However, if you're just stating the issues you
> have with it, then can I assume that you won't lobby against having
> this experiment merged?

Anybody is free to merge anything into their kernel. And if somebody
asks for my issues with the swap prefetching patch, then I'll give
them :)

--
SUSE Labs, Novell Inc.

2007-05-10 07:20:41

by William Lee Irwin III

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

Ray Lee wrote:
>> Huh? You already stated one version of it above, namely updatedb. But

On Thu, May 10, 2007 at 05:04:54PM +1000, Nick Piggin wrote:
> So a swapping problem with updatedb should be unusual and we'd like to see
> if we can fix it without resorting to prefetching.
> I know the theory behind swap prefetching, and I'm not saying it doesn't
> work, so I'll snip the rest of that.

I've not run updatedb in years, so I have no idea what it does to a
modern kernel. It used to be an unholy terror of slab fragmentation
and displacing user memory. The case of streaming kernel metadata IO
is probably not quite as easy as streaming file IO.


Ray Lee wrote:
>> You said, effectively: "Use-once could be improved to deal with
>> updatedb". I said I've been reading emails from Rik and others talking
>> about that for four years now, and we're still talking about it. Were
>> it merely updatedb, I'd say us userspace folk should step up and
>> rewrite the damn thing to amortize its work. However, I and others
>> feel it's only an example -- glaring, obviously -- of a more pervasive
>> issue. A small issue, to be sure!, but an issue nevertheless.

On Thu, May 10, 2007 at 05:04:54PM +1000, Nick Piggin wrote:
> It isn't going to get fixed unless people complain about it. If you
> cover the use-once problem with swap prefetching, then it will never
> get fixed.

The policy people need to clean this up once and for all at some point.
clameter's targeted reclaim bits for slub look like a plausible tactic,
but are by no means comprehensive. Things need to attempt to eat their
own tails before eating everyone else alive. Maybe we need to take hits
on things such as badari's dd's to resolve the pathologies.


-- wli

2007-05-10 12:34:24

by Ray Lee

[permalink] [raw]
Subject: Re: swap-prefetch: 2.6.22 -mm merge plans

On 5/10/07, Nick Piggin <[email protected]> wrote:
> > Huh? You already stated one version of it above, namely updatedb. But
>
> So a swapping problem with updatedb should be unusual and we'd like to see
> if we can fix it without resorting to prefetching.
>
> I know the theory behind swap prefetching, and I'm not saying it doesn't
> work, so I'll snip the rest of that.

updatedb is only part of the problem. The other part is that the
kernel has an opportunity to preemptively return some of the evicted
working set to RAM before I ask for it. No fancy use-once algorithm is
going to address that, so your solution is provably incomplete for my
problem.

What's so hard to understand about that?

2007-05-12 04:45:00

by Con Kolivas

[permalink] [raw]
Subject: [PATCH] mm: swap prefetch improvements

It turns out that fixing swap prefetch was not that hard to fix and improve
upon, and since Andrew hasn't dropped swap prefetch, instead here are a swag
of fixes and improvements, including making it depend on !CPUSETS as Nick
requested.

These changes lead to dramatic improvements.

Eg on a machine with 2GB ram and only 500MB swap:

Prefetch disabled:
./sp_tester
Ram 2060352000 Swap 522072000
Total ram to be malloced: 2321388000 bytes
Starting first malloc of 1160694000 bytes
Starting 1st read of first malloc
Touching this much ram takes 529 milliseconds
Starting second malloc of 1160694000 bytes
Completed second malloc and free
Sleeping for 300 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 6030 milliseconds


Prefetch enabled:
/sp_tester
Ram 2060352000 Swap 522072000
Total ram to be malloced: 2321388000 bytes
Starting first malloc of 1160694000 bytes
Starting 1st read of first malloc
Touching this much ram takes 528 milliseconds
Starting second malloc of 1160694000 bytes
Completed second malloc and free
Sleeping for 300 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 665 milliseconds

Note that simply touching the ram took 528 ms so the time taken for the 230MB
converted from major faults to minor faults took only 137ms instead of 5.5s.

---
Numerous improvements to swap prefetch.

It was possible for kprefetchd to go to sleep indefinitely before/after
changing the /proc value of swap prefetch. Fix that.

The cost of remove_from_swapped_list() can be removed from every page swapin
by moving it to be done entirely by kprefetchd lazily.

The call site for add_to_swapped_list need only be at one place.

Wakeups can occur much less frequently if swap prefetch is disabled.

Make it possible to enable swap prefetch explicitly via /proc when laptop_mode
is enabled by changing the value of the sysctl to 2.

The complicated iteration over every entry can be consolidated by using
list_for_each_safe.

Swap prefetch is not cpuset aware so make the config option depend on !CPUSETS.

Fix potential irq problem by converting read_lock_irq to irqsave etc.

Code style fixes.

Change the ioprio from IOPRIO_CLASS_IDLE to normal lower priority to ensure
that bio requests are not starved if other I/O begins during prefetching.

Signed-off-by: Con Kolivas <[email protected]>

---
Documentation/sysctl/vm.txt | 4 -
init/Kconfig | 2
mm/page_io.c | 2
mm/swap_prefetch.c | 158 +++++++++++++++++++-------------------------
mm/swap_state.c | 2
mm/vmscan.c | 1
6 files changed, 75 insertions(+), 94 deletions(-)

Index: linux-2.6.21-mm1/mm/page_io.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/page_io.c 2007-02-05 22:52:04.000000000 +1100
+++ linux-2.6.21-mm1/mm/page_io.c 2007-05-12 14:30:52.000000000 +1000
@@ -17,6 +17,7 @@
#include <linux/bio.h>
#include <linux/swapops.h>
#include <linux/writeback.h>
+#include <linux/swap-prefetch.h>
#include <asm/pgtable.h>

static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -118,6 +119,7 @@ int swap_writepage(struct page *page, st
ret = -ENOMEM;
goto out;
}
+ add_to_swapped_list(page);
if (wbc->sync_mode == WB_SYNC_ALL)
rw |= (1 << BIO_RW_SYNC);
count_vm_event(PSWPOUT);
Index: linux-2.6.21-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/swap_state.c 2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/swap_state.c 2007-05-12 14:30:52.000000000 +1000
@@ -83,7 +83,6 @@ static int __add_to_swap_cache(struct pa
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
if (!error) {
- remove_from_swapped_list(entry.val);
page_cache_get(page);
SetPageLocked(page);
SetPageSwapCache(page);
@@ -102,7 +101,6 @@ int add_to_swap_cache(struct page *page,
int error;

if (!swap_duplicate(entry)) {
- remove_from_swapped_list(entry.val);
INC_CACHE_INFO(noent_race);
return -ENOENT;
}
Index: linux-2.6.21-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/vmscan.c 2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/vmscan.c 2007-05-12 14:30:52.000000000 +1000
@@ -410,7 +410,6 @@ int remove_mapping(struct address_space

if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
- add_to_swapped_list(page);
__delete_from_swap_cache(page);
write_unlock_irq(&mapping->tree_lock);
swap_free(swap);
Index: linux-2.6.21-mm1/mm/swap_prefetch.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/swap_prefetch.c 2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/swap_prefetch.c 2007-05-12 14:30:52.000000000 +1000
@@ -27,7 +27,8 @@
* needs to be at least this duration of idle time meaning in practice it can
* be much longer
*/
-#define PREFETCH_DELAY (HZ * 5)
+#define PREFETCH_DELAY (HZ * 5)
+#define DISABLED_PREFETCH_DELAY (HZ * 60)

/* sysctl - enable/disable swap prefetching */
int swap_prefetch __read_mostly = 1;
@@ -61,19 +62,30 @@ inline void delay_swap_prefetch(void)
}

/*
+ * If laptop_mode is enabled don't prefetch to avoid hard drives
+ * doing unnecessary spin-ups unless swap_prefetch is explicitly
+ * set to a higher value.
+ */
+static inline int prefetch_enabled(void)
+{
+ if (swap_prefetch <= laptop_mode)
+ return 0;
+ return 1;
+}
+
+static int wakeup_kprefetchd;
+
+/*
* Drop behind accounting which keeps a list of the most recently used swap
- * entries.
+ * entries. Entries are removed lazily by kprefetchd.
*/
void add_to_swapped_list(struct page *page)
{
struct swapped_entry *entry;
unsigned long index, flags;
- int wakeup;
-
- if (!swap_prefetch)
- return;

- wakeup = 0;
+ if (!prefetch_enabled())
+ goto out;

spin_lock_irqsave(&swapped.lock, flags);
if (swapped.count >= swapped.maxcount) {
@@ -103,23 +115,15 @@ void add_to_swapped_list(struct page *pa
store_swap_entry_node(entry, page);

if (likely(!radix_tree_insert(&swapped.swap_tree, index, entry))) {
- /*
- * If this is the first entry, kprefetchd needs to be
- * (re)started.
- */
- if (!swapped.count)
- wakeup = 1;
list_add(&entry->swapped_list, &swapped.list);
swapped.count++;
}

out_locked:
spin_unlock_irqrestore(&swapped.lock, flags);
-
- /* Do the wakeup outside the lock to shorten lock hold time. */
- if (wakeup)
+out:
+ if (wakeup_kprefetchd)
wake_up_process(kprefetchd_task);
-
return;
}

@@ -139,7 +143,7 @@ void remove_from_swapped_list(const unsi
spin_lock_irqsave(&swapped.lock, flags);
entry = radix_tree_delete(&swapped.swap_tree, index);
if (likely(entry)) {
- list_del_init(&entry->swapped_list);
+ list_del(&entry->swapped_list);
swapped.count--;
kmem_cache_free(swapped.cache, entry);
}
@@ -153,18 +157,18 @@ enum trickle_return {
};

struct node_stats {
- unsigned long last_free;
/* Free ram after a cycle of prefetching */
- unsigned long current_free;
+ unsigned long last_free;
/* Free ram on this cycle of checking prefetch_suitable */
- unsigned long prefetch_watermark;
+ unsigned long current_free;
/* Maximum amount we will prefetch to */
- unsigned long highfree[MAX_NR_ZONES];
+ unsigned long prefetch_watermark;
/* The amount of free ram before we start prefetching */
- unsigned long lowfree[MAX_NR_ZONES];
+ unsigned long highfree[MAX_NR_ZONES];
/* The amount of free ram where we will stop prefetching */
- unsigned long *pointfree[MAX_NR_ZONES];
+ unsigned long lowfree[MAX_NR_ZONES];
/* highfree or lowfree depending on whether we've hit a watermark */
+ unsigned long *pointfree[MAX_NR_ZONES];
};

/*
@@ -172,10 +176,10 @@ struct node_stats {
* determine if a node is suitable for prefetching into.
*/
struct prefetch_stats {
- nodemask_t prefetch_nodes;
/* Which nodes are currently suited to prefetching */
- unsigned long prefetched_pages;
+ nodemask_t prefetch_nodes;
/* Total pages we've prefetched on this wakeup of kprefetchd */
+ unsigned long prefetched_pages;
struct node_stats node[MAX_NUMNODES];
};

@@ -189,16 +193,15 @@ static enum trickle_return trickle_swap_
const int node)
{
enum trickle_return ret = TRICKLE_FAILED;
+ unsigned long flags;
struct page *page;

- read_lock_irq(&swapper_space.tree_lock);
+ read_lock_irqsave(&swapper_space.tree_lock, flags);
/* Entry may already exist */
page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
- read_unlock_irq(&swapper_space.tree_lock);
- if (page) {
- remove_from_swapped_list(entry.val);
+ read_unlock_irqrestore(&swapper_space.tree_lock, flags);
+ if (page)
goto out;
- }

/*
* Get a new page to read from swap. We have already checked the
@@ -217,10 +220,8 @@ static enum trickle_return trickle_swap_

/* Add them to the tail of the inactive list to preserve LRU order */
lru_cache_add_tail(page);
- if (unlikely(swap_readpage(NULL, page))) {
- ret = TRICKLE_DELAY;
+ if (unlikely(swap_readpage(NULL, page)))
goto out_release;
- }

sp_stat.prefetched_pages++;
sp_stat.node[node].last_free--;
@@ -229,6 +230,12 @@ static enum trickle_return trickle_swap_
out_release:
page_cache_release(page);
out:
+ /*
+ * All entries are removed here lazily. This avoids the cost of
+ * remove_from_swapped_list during normal swapin. Thus there are
+ * usually many stale entries.
+ */
+ remove_from_swapped_list(entry.val);
return ret;
}

@@ -414,17 +421,6 @@ out:
}

/*
- * Get previous swapped entry when iterating over all entries. swapped.lock
- * should be held and we should already ensure that entry exists.
- */
-static inline struct swapped_entry *prev_swapped_entry
- (struct swapped_entry *entry)
-{
- return list_entry(entry->swapped_list.prev->prev,
- struct swapped_entry, swapped_list);
-}
-
-/*
* trickle_swap is the main function that initiates the swap prefetching. It
* first checks to see if the busy flag is set, and does not prefetch if it
* is, as the flag implied we are low on memory or swapping in currently.
@@ -435,70 +431,49 @@ static inline struct swapped_entry *prev
static enum trickle_return trickle_swap(void)
{
enum trickle_return ret = TRICKLE_DELAY;
- struct swapped_entry *entry;
+ struct list_head *p, *next;
unsigned long flags;

- /*
- * If laptop_mode is enabled don't prefetch to avoid hard drives
- * doing unnecessary spin-ups
- */
- if (!swap_prefetch || laptop_mode)
+ if (!prefetch_enabled())
return ret;

examine_free_limits();
- entry = NULL;
+ if (!prefetch_suitable())
+ return ret;
+ if (list_empty(&swapped.list))
+ return TRICKLE_FAILED;

- for ( ; ; ) {
+ spin_lock_irqsave(&swapped.lock, flags);
+ list_for_each_safe(p, next, &swapped.list) {
+ struct swapped_entry *entry;
swp_entry_t swp_entry;
int node;

+ spin_unlock_irqrestore(&swapped.lock, flags);
+ might_sleep();
if (!prefetch_suitable())
- break;
+ goto out_unlocked;

spin_lock_irqsave(&swapped.lock, flags);
- if (list_empty(&swapped.list)) {
- ret = TRICKLE_FAILED;
- spin_unlock_irqrestore(&swapped.lock, flags);
- break;
- }
-
- if (!entry) {
- /*
- * This sets the entry for the first iteration. It
- * also is a safeguard against the entry disappearing
- * while the lock is not held.
- */
- entry = list_entry(swapped.list.prev,
- struct swapped_entry, swapped_list);
- } else if (entry->swapped_list.prev == swapped.list.next) {
- /*
- * If we have iterated over all entries and there are
- * still entries that weren't swapped out there may
- * be a reason we could not swap them back in so
- * delay attempting further prefetching.
- */
- spin_unlock_irqrestore(&swapped.lock, flags);
- break;
- }
-
+ entry = list_entry(p, struct swapped_entry, swapped_list);
node = get_swap_entry_node(entry);
if (!node_isset(node, sp_stat.prefetch_nodes)) {
/*
* We found an entry that belongs to a node that is
* not suitable for prefetching so skip it.
*/
- entry = prev_swapped_entry(entry);
- spin_unlock_irqrestore(&swapped.lock, flags);
continue;
}
swp_entry = entry->swp_entry;
- entry = prev_swapped_entry(entry);
spin_unlock_irqrestore(&swapped.lock, flags);

if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
- break;
+ goto out_unlocked;
+ spin_lock_irqsave(&swapped.lock, flags);
}
+ spin_unlock_irqrestore(&swapped.lock, flags);

+out_unlocked:
if (sp_stat.prefetched_pages) {
lru_add_drain();
sp_stat.prefetched_pages = 0;
@@ -513,13 +488,14 @@ static int kprefetchd(void *__unused)
sched_setscheduler(current, SCHED_BATCH, &param);
set_user_nice(current, 19);
/* Set ioprio to lowest if supported by i/o scheduler */
- sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
+ sys_ioprio_set(IOPRIO_WHO_PROCESS, IOPRIO_BE_NR - 1, IOPRIO_CLASS_BE);

/* kprefetchd has nothing to do until it is woken up the first time */
+ wakeup_kprefetchd = 1;
set_current_state(TASK_INTERRUPTIBLE);
schedule();

- do {
+ while (!kthread_should_stop()) {
try_to_freeze();

/*
@@ -527,13 +503,17 @@ static int kprefetchd(void *__unused)
* a wakeup, and further delay the next one.
*/
if (trickle_swap() == TRICKLE_FAILED) {
+ wakeup_kprefetchd = 1;
set_current_state(TASK_INTERRUPTIBLE);
schedule();
- }
+ } else
+ wakeup_kprefetchd = 0;
clear_last_prefetch_free();
- schedule_timeout_interruptible(PREFETCH_DELAY);
- } while (!kthread_should_stop());
-
+ if (!prefetch_enabled())
+ schedule_timeout_interruptible(DISABLED_PREFETCH_DELAY);
+ else
+ schedule_timeout_interruptible(PREFETCH_DELAY);
+ }
return 0;
}

Index: linux-2.6.21-mm1/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.21-mm1.orig/Documentation/sysctl/vm.txt 2007-05-07 21:53:00.000000000 +1000
+++ linux-2.6.21-mm1/Documentation/sysctl/vm.txt 2007-05-12 14:31:26.000000000 +1000
@@ -229,7 +229,9 @@ swap_prefetch
This enables or disables the swap prefetching feature. When the virtual
memory subsystem has been extremely idle for at least 5 seconds it will start
copying back pages from swap into the swapcache and keep a copy in swap. In
-practice it can take many minutes before the vm is idle enough.
+practice it can take many minutes before the vm is idle enough. A value of 0
+disables swap prefetching, 1 enables it unless laptop_mode is enabled, and 2
+enables it even in the presence of laptop_mode.

The default value is 1.

Index: linux-2.6.21-mm1/init/Kconfig
===================================================================
--- linux-2.6.21-mm1.orig/init/Kconfig 2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/init/Kconfig 2007-05-12 14:30:52.000000000 +1000
@@ -107,7 +107,7 @@ config SWAP

config SWAP_PREFETCH
bool "Support for prefetching swapped memory"
- depends on SWAP
+ depends on SWAP && !CPUSETS
default y
---help---
This option will allow the kernel to prefetch swapped memory pages

--
-ck

2007-05-12 05:03:17

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

> Swap prefetch is not cpuset aware so make the config option depend on !CPUSETS.

Ok.

Could you explain what it means to say "swap prefetch is not cpuset aware",
or could you give a rough idea of what it would take to make it cpuset aware?

I wouldn't go so far as to say that no one would ever want to prefetch and
use cpusets at the same time, but I will grant that it's not a sufficiently
important need that it should block a useful prefetch implementation on
non-cpuset systems.

One case that would be useful, however, is to handle prefetch in the case
that cpusets are configured into ones kernel, but one is not making any
real use of them ('number_of_cpusets' <= 1). That will actually be the
most common case for the major distribution(s) that enable cpusets by
default in their builds, for most arch's including the arch's popular
on desktops.

So what would it take to allow CONFIG'ing both prefetch and cpusets on,
but having prefetch dynamically adapt to the presence of active cpuset
usage, perhaps by basically shutting down if it can't easily do any
better? I could certainly entertain requests to callout to some
prefetch routine from the cpuset code, at the critical points that
cpusets transitioned in or out of active use.

Semi-separate issue -- is it just cpusets that aren't prefetch friendly,
or is it also mm/mempolicy (mbind, set_mempolicy) as well?

For that matter, even if neither mm/mempolicy nor cpusets are used, on
systems with multiple memory nodes (not all memory equally distant from
all CPUs, aka NUMA), could prefetch cause some sort of shuffling of
memory placement, which might harm the performance of an HPC (High
Performance Computing) application with carefully tuned memory
placement. Granted, this -is- getting to be a corner case. Most HPC
apps running on NUMA hardware are making at least some use of
mm/mempolicy or cpusets.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-05-12 05:14:17

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

On Saturday 12 May 2007 15:03, Paul Jackson wrote:
> > Swap prefetch is not cpuset aware so make the config option depend on
> > !CPUSETS.
>
> Ok.
>
> Could you explain what it means to say "swap prefetch is not cpuset aware",
> or could you give a rough idea of what it would take to make it cpuset
> aware?

Hmm I'm not really sure what it takes to make it cpuset aware; it was Nick
that pointed out that it was not, so I'm not sure and still going off your
original recommendation that there was no need to make it cpuset aware but at
least honour node placement (see below).

> I wouldn't go so far as to say that no one would ever want to prefetch and
> use cpusets at the same time, but I will grant that it's not a sufficiently
> important need that it should block a useful prefetch implementation on
> non-cpuset systems.

Thank you for agreeing on me there :)

> One case that would be useful, however, is to handle prefetch in the case
> that cpusets are configured into ones kernel, but one is not making any
> real use of them ('number_of_cpusets' <= 1). That will actually be the
> most common case for the major distribution(s) that enable cpusets by
> default in their builds, for most arch's including the arch's popular
> on desktops.
>
> So what would it take to allow CONFIG'ing both prefetch and cpusets on,
> but having prefetch dynamically adapt to the presence of active cpuset
> usage, perhaps by basically shutting down if it can't easily do any
> better? I could certainly entertain requests to callout to some
> prefetch routine from the cpuset code, at the critical points that
> cpusets transitioned in or out of active use.

It would be absolutely trivial to add a check for 'number_of_cpusets' <= 1 in
the prefetch_enabled() function. Would you like that?

> Semi-separate issue -- is it just cpusets that aren't prefetch friendly,
> or is it also mm/mempolicy (mbind, set_mempolicy) as well?
>
> For that matter, even if neither mm/mempolicy nor cpusets are used, on
> systems with multiple memory nodes (not all memory equally distant from
> all CPUs, aka NUMA), could prefetch cause some sort of shuffling of
> memory placement, which might harm the performance of an HPC (High
> Performance Computing) application with carefully tuned memory
> placement. Granted, this -is- getting to be a corner case. Most HPC
> apps running on NUMA hardware are making at least some use of
> mm/mempolicy or cpusets.

It is numa aware to some degree. It stores the node id and when it starts
prefetching it only prefetches to nodes that are suitable for prefetching to
(based on a number of arbitrary freeness arguments I invented). It uses the
original node id it came from by allocating a page via:
alloc_pages_node(node, GFP_HIGHUSER & ~__GFP_WAIT, 0);
where "node" is the original node the swapped page came from.

Thanks for comments.

--
-ck

2007-05-12 05:51:16

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

Con wrote:
> Hmm I'm not really sure what it takes to make it cpuset aware;
> ...
> It is numa aware to some degree. It stores the node id and when it starts
> prefetching it only prefetches to nodes that are suitable for prefetching to
> ...
> It would be absolutely trivial to add a check for 'number_of_cpusets' <= 1
> in the prefetch_enabled() function. Would you like that?

Hmmm ... it seems that we shadow boxing here ... trying to pick a solution
to solve a problem when we aren't even sure we have a problem, much less
what the problem is.

That does not usually lead to the right path.

Could you put some more effort into characterizing what problems
can arise if one has prefetch and cpusets active at the same time?

My first wild guess is that the only incompatibility would have been that
prefetch might mess up NUMA placement (get pages on wrong nodes), which
it seems you have tried to address in your current patches. So it would
not surprise me if there was no problem here.

We may just have to lean on Nick some more, if he is the only one who
understands what the problem is, to try again to explain it to us.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-05-12 07:23:15

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

On Saturday 12 May 2007 15:51, Paul Jackson wrote:
> Con wrote:
> > Hmm I'm not really sure what it takes to make it cpuset aware;
> > ...
> > It is numa aware to some degree. It stores the node id and when it starts
> > prefetching it only prefetches to nodes that are suitable for prefetching
> > to ...
> > It would be absolutely trivial to add a check for 'number_of_cpusets' <=
> > 1 in the prefetch_enabled() function. Would you like that?
>
> Hmmm ... it seems that we shadow boxing here ... trying to pick a solution
> to solve a problem when we aren't even sure we have a problem, much less
> what the problem is.
>
> That does not usually lead to the right path.
>
> Could you put some more effort into characterizing what problems
> can arise if one has prefetch and cpusets active at the same time?
>
> My first wild guess is that the only incompatibility would have been that
> prefetch might mess up NUMA placement (get pages on wrong nodes), which
> it seems you have tried to address in your current patches. So it would
> not surprise me if there was no problem here.

Ummm this is what I've been saying for over a year now but noone has been
listening.

> We may just have to lean on Nick some more, if he is the only one who
> understands what the problem is, to try again to explain it to us.

--
-ck

2007-05-12 08:14:58

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

> Ummm this is what I've been saying for over a year now but noone has been
> listening.

Well ... if there is a problem using prefetch and cpusets together,
it doesn't look like the two of us are going to find it.

I should probably look at your patch to answer this next question,
but being a lazy retard, I'll just ask. Is there a way, on a running
system that has your prefetch patch configured in, to disable prefetch
-- perhaps writing to some magic /proc file or something?

If so, then how about you just remove the lines in the patch that
disable prefetch on kernels configured with CPUSETS, and we charge
ahead allowing both at the same time?

If some day in the future I find something about prefetch that harms
the HPC NUMA loads I care about, then I can just dynamically disable
prefetch.

If someone ever uncovers a real problem with prefetch and cpusets,
then we will deal with it then.

As to whether your patch is otherwise (other than cpusets) worthy
of further acceptance, that I will have to leave up to those who are
competent to make such judgements.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-05-12 08:22:31

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

On Saturday 12 May 2007 18:14, Paul Jackson wrote:
> > Ummm this is what I've been saying for over a year now but noone has been
> > listening.
>
> Well ... if there is a problem using prefetch and cpusets together,
> it doesn't look like the two of us are going to find it.
>
> I should probably look at your patch to answer this next question,
> but being a lazy retard, I'll just ask. Is there a way, on a running
> system that has your prefetch patch configured in, to disable prefetch
> -- perhaps writing to some magic /proc file or something?

Indeed:

/proc/sys/vm/swap_prefetch

> If so, then how about you just remove the lines in the patch that
> disable prefetch on kernels configured with CPUSETS, and we charge
> ahead allowing both at the same time?

Ok so change the default value for swap_prefetch to 0 when CPUSETS is enabled?
Sure, I can do that.

> If some day in the future I find something about prefetch that harms
> the HPC NUMA loads I care about, then I can just dynamically disable
> prefetch.
>
> If someone ever uncovers a real problem with prefetch and cpusets,
> then we will deal with it then.
>
> As to whether your patch is otherwise (other than cpusets) worthy
> of further acceptance, that I will have to leave up to those who are
> competent to make such judgements.

Thank you very much for your comments!

--
-ck

2007-05-12 08:37:55

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

Con wrote:
> Ok so change the default value for swap_prefetch to 0 when CPUSETS is enabled?

I don't see why that special case for cpusets is needed.

I'm suggesting making no special cases for CPUSETS at all, until and
unless we find reason to.

In other words, I'm suggesting simply removing the patch lines:

- depends on SWAP
+ depends on SWAP && !CPUSETS

I see no other mention of cpusets in your patch. That's fine by me.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-05-12 08:58:34

by Con Kolivas

[permalink] [raw]
Subject: [PATCH respin] mm: swap prefetch improvements

On Saturday 12 May 2007 18:37, Paul Jackson wrote:
> Con wrote:
> > Ok so change the default value for swap_prefetch to 0 when CPUSETS is
> > enabled?
>
> I don't see why that special case for cpusets is needed.
>
> I'm suggesting making no special cases for CPUSETS at all, until and
> unless we find reason to.
>
> In other words, I'm suggesting simply removing the patch lines:
>
> - depends on SWAP
> + depends on SWAP && !CPUSETS
>
> I see no other mention of cpusets in your patch. That's fine by me.

Excellent, I prefer that as well. Thanks very much for your comments!

Here's a respin without that hunk.

---
Numerous improvements to swap prefetch.

It was possible for kprefetchd to go to sleep indefinitely before/after
changing the /proc value of swap prefetch. Fix that.

The cost of remove_from_swapped_list() can be removed from every page swapin
by moving it to be done entirely by kprefetchd lazily.

The call site for add_to_swapped_list need only be at one place.

Wakeups can occur much less frequently if swap prefetch is disabled.

Make it possible to enable swap prefetch explicitly via /proc when laptop_mode
is enabled by changing the value of the sysctl to 2.

The complicated iteration over every entry can be consolidated by using
list_for_each_safe.

Fix potential irq problem by converting read_lock_irq to irqsave etc.

Code style fixes.

Change the ioprio from IOPRIO_CLASS_IDLE to normal lower priority to ensure
that bio requests are not starved if other I/O begins during prefetching.

Signed-off-by: Con Kolivas <[email protected]>

---
Documentation/sysctl/vm.txt | 4 -
mm/page_io.c | 2
mm/swap_prefetch.c | 158 +++++++++++++++++++-------------------------
mm/swap_state.c | 2
mm/vmscan.c | 1
5 files changed, 74 insertions(+), 93 deletions(-)

Index: linux-2.6.21-mm1/mm/page_io.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/page_io.c 2007-02-05 22:52:04.000000000 +1100
+++ linux-2.6.21-mm1/mm/page_io.c 2007-05-12 14:30:52.000000000 +1000
@@ -17,6 +17,7 @@
#include <linux/bio.h>
#include <linux/swapops.h>
#include <linux/writeback.h>
+#include <linux/swap-prefetch.h>
#include <asm/pgtable.h>

static struct bio *get_swap_bio(gfp_t gfp_flags, pgoff_t index,
@@ -118,6 +119,7 @@ int swap_writepage(struct page *page, st
ret = -ENOMEM;
goto out;
}
+ add_to_swapped_list(page);
if (wbc->sync_mode == WB_SYNC_ALL)
rw |= (1 << BIO_RW_SYNC);
count_vm_event(PSWPOUT);
Index: linux-2.6.21-mm1/mm/swap_state.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/swap_state.c 2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/swap_state.c 2007-05-12 14:30:52.000000000 +1000
@@ -83,7 +83,6 @@ static int __add_to_swap_cache(struct pa
error = radix_tree_insert(&swapper_space.page_tree,
entry.val, page);
if (!error) {
- remove_from_swapped_list(entry.val);
page_cache_get(page);
SetPageLocked(page);
SetPageSwapCache(page);
@@ -102,7 +101,6 @@ int add_to_swap_cache(struct page *page,
int error;

if (!swap_duplicate(entry)) {
- remove_from_swapped_list(entry.val);
INC_CACHE_INFO(noent_race);
return -ENOENT;
}
Index: linux-2.6.21-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/vmscan.c 2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/vmscan.c 2007-05-12 14:30:52.000000000 +1000
@@ -410,7 +410,6 @@ int remove_mapping(struct address_space

if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
- add_to_swapped_list(page);
__delete_from_swap_cache(page);
write_unlock_irq(&mapping->tree_lock);
swap_free(swap);
Index: linux-2.6.21-mm1/mm/swap_prefetch.c
===================================================================
--- linux-2.6.21-mm1.orig/mm/swap_prefetch.c 2007-05-07 21:53:51.000000000 +1000
+++ linux-2.6.21-mm1/mm/swap_prefetch.c 2007-05-12 14:30:52.000000000 +1000
@@ -27,7 +27,8 @@
* needs to be at least this duration of idle time meaning in practice it can
* be much longer
*/
-#define PREFETCH_DELAY (HZ * 5)
+#define PREFETCH_DELAY (HZ * 5)
+#define DISABLED_PREFETCH_DELAY (HZ * 60)

/* sysctl - enable/disable swap prefetching */
int swap_prefetch __read_mostly = 1;
@@ -61,19 +62,30 @@ inline void delay_swap_prefetch(void)
}

/*
+ * If laptop_mode is enabled don't prefetch to avoid hard drives
+ * doing unnecessary spin-ups unless swap_prefetch is explicitly
+ * set to a higher value.
+ */
+static inline int prefetch_enabled(void)
+{
+ if (swap_prefetch <= laptop_mode)
+ return 0;
+ return 1;
+}
+
+static int wakeup_kprefetchd;
+
+/*
* Drop behind accounting which keeps a list of the most recently used swap
- * entries.
+ * entries. Entries are removed lazily by kprefetchd.
*/
void add_to_swapped_list(struct page *page)
{
struct swapped_entry *entry;
unsigned long index, flags;
- int wakeup;
-
- if (!swap_prefetch)
- return;

- wakeup = 0;
+ if (!prefetch_enabled())
+ goto out;

spin_lock_irqsave(&swapped.lock, flags);
if (swapped.count >= swapped.maxcount) {
@@ -103,23 +115,15 @@ void add_to_swapped_list(struct page *pa
store_swap_entry_node(entry, page);

if (likely(!radix_tree_insert(&swapped.swap_tree, index, entry))) {
- /*
- * If this is the first entry, kprefetchd needs to be
- * (re)started.
- */
- if (!swapped.count)
- wakeup = 1;
list_add(&entry->swapped_list, &swapped.list);
swapped.count++;
}

out_locked:
spin_unlock_irqrestore(&swapped.lock, flags);
-
- /* Do the wakeup outside the lock to shorten lock hold time. */
- if (wakeup)
+out:
+ if (wakeup_kprefetchd)
wake_up_process(kprefetchd_task);
-
return;
}

@@ -139,7 +143,7 @@ void remove_from_swapped_list(const unsi
spin_lock_irqsave(&swapped.lock, flags);
entry = radix_tree_delete(&swapped.swap_tree, index);
if (likely(entry)) {
- list_del_init(&entry->swapped_list);
+ list_del(&entry->swapped_list);
swapped.count--;
kmem_cache_free(swapped.cache, entry);
}
@@ -153,18 +157,18 @@ enum trickle_return {
};

struct node_stats {
- unsigned long last_free;
/* Free ram after a cycle of prefetching */
- unsigned long current_free;
+ unsigned long last_free;
/* Free ram on this cycle of checking prefetch_suitable */
- unsigned long prefetch_watermark;
+ unsigned long current_free;
/* Maximum amount we will prefetch to */
- unsigned long highfree[MAX_NR_ZONES];
+ unsigned long prefetch_watermark;
/* The amount of free ram before we start prefetching */
- unsigned long lowfree[MAX_NR_ZONES];
+ unsigned long highfree[MAX_NR_ZONES];
/* The amount of free ram where we will stop prefetching */
- unsigned long *pointfree[MAX_NR_ZONES];
+ unsigned long lowfree[MAX_NR_ZONES];
/* highfree or lowfree depending on whether we've hit a watermark */
+ unsigned long *pointfree[MAX_NR_ZONES];
};

/*
@@ -172,10 +176,10 @@ struct node_stats {
* determine if a node is suitable for prefetching into.
*/
struct prefetch_stats {
- nodemask_t prefetch_nodes;
/* Which nodes are currently suited to prefetching */
- unsigned long prefetched_pages;
+ nodemask_t prefetch_nodes;
/* Total pages we've prefetched on this wakeup of kprefetchd */
+ unsigned long prefetched_pages;
struct node_stats node[MAX_NUMNODES];
};

@@ -189,16 +193,15 @@ static enum trickle_return trickle_swap_
const int node)
{
enum trickle_return ret = TRICKLE_FAILED;
+ unsigned long flags;
struct page *page;

- read_lock_irq(&swapper_space.tree_lock);
+ read_lock_irqsave(&swapper_space.tree_lock, flags);
/* Entry may already exist */
page = radix_tree_lookup(&swapper_space.page_tree, entry.val);
- read_unlock_irq(&swapper_space.tree_lock);
- if (page) {
- remove_from_swapped_list(entry.val);
+ read_unlock_irqrestore(&swapper_space.tree_lock, flags);
+ if (page)
goto out;
- }

/*
* Get a new page to read from swap. We have already checked the
@@ -217,10 +220,8 @@ static enum trickle_return trickle_swap_

/* Add them to the tail of the inactive list to preserve LRU order */
lru_cache_add_tail(page);
- if (unlikely(swap_readpage(NULL, page))) {
- ret = TRICKLE_DELAY;
+ if (unlikely(swap_readpage(NULL, page)))
goto out_release;
- }

sp_stat.prefetched_pages++;
sp_stat.node[node].last_free--;
@@ -229,6 +230,12 @@ static enum trickle_return trickle_swap_
out_release:
page_cache_release(page);
out:
+ /*
+ * All entries are removed here lazily. This avoids the cost of
+ * remove_from_swapped_list during normal swapin. Thus there are
+ * usually many stale entries.
+ */
+ remove_from_swapped_list(entry.val);
return ret;
}

@@ -414,17 +421,6 @@ out:
}

/*
- * Get previous swapped entry when iterating over all entries. swapped.lock
- * should be held and we should already ensure that entry exists.
- */
-static inline struct swapped_entry *prev_swapped_entry
- (struct swapped_entry *entry)
-{
- return list_entry(entry->swapped_list.prev->prev,
- struct swapped_entry, swapped_list);
-}
-
-/*
* trickle_swap is the main function that initiates the swap prefetching. It
* first checks to see if the busy flag is set, and does not prefetch if it
* is, as the flag implied we are low on memory or swapping in currently.
@@ -435,70 +431,49 @@ static inline struct swapped_entry *prev
static enum trickle_return trickle_swap(void)
{
enum trickle_return ret = TRICKLE_DELAY;
- struct swapped_entry *entry;
+ struct list_head *p, *next;
unsigned long flags;

- /*
- * If laptop_mode is enabled don't prefetch to avoid hard drives
- * doing unnecessary spin-ups
- */
- if (!swap_prefetch || laptop_mode)
+ if (!prefetch_enabled())
return ret;

examine_free_limits();
- entry = NULL;
+ if (!prefetch_suitable())
+ return ret;
+ if (list_empty(&swapped.list))
+ return TRICKLE_FAILED;

- for ( ; ; ) {
+ spin_lock_irqsave(&swapped.lock, flags);
+ list_for_each_safe(p, next, &swapped.list) {
+ struct swapped_entry *entry;
swp_entry_t swp_entry;
int node;

+ spin_unlock_irqrestore(&swapped.lock, flags);
+ might_sleep();
if (!prefetch_suitable())
- break;
+ goto out_unlocked;

spin_lock_irqsave(&swapped.lock, flags);
- if (list_empty(&swapped.list)) {
- ret = TRICKLE_FAILED;
- spin_unlock_irqrestore(&swapped.lock, flags);
- break;
- }
-
- if (!entry) {
- /*
- * This sets the entry for the first iteration. It
- * also is a safeguard against the entry disappearing
- * while the lock is not held.
- */
- entry = list_entry(swapped.list.prev,
- struct swapped_entry, swapped_list);
- } else if (entry->swapped_list.prev == swapped.list.next) {
- /*
- * If we have iterated over all entries and there are
- * still entries that weren't swapped out there may
- * be a reason we could not swap them back in so
- * delay attempting further prefetching.
- */
- spin_unlock_irqrestore(&swapped.lock, flags);
- break;
- }
-
+ entry = list_entry(p, struct swapped_entry, swapped_list);
node = get_swap_entry_node(entry);
if (!node_isset(node, sp_stat.prefetch_nodes)) {
/*
* We found an entry that belongs to a node that is
* not suitable for prefetching so skip it.
*/
- entry = prev_swapped_entry(entry);
- spin_unlock_irqrestore(&swapped.lock, flags);
continue;
}
swp_entry = entry->swp_entry;
- entry = prev_swapped_entry(entry);
spin_unlock_irqrestore(&swapped.lock, flags);

if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
- break;
+ goto out_unlocked;
+ spin_lock_irqsave(&swapped.lock, flags);
}
+ spin_unlock_irqrestore(&swapped.lock, flags);

+out_unlocked:
if (sp_stat.prefetched_pages) {
lru_add_drain();
sp_stat.prefetched_pages = 0;
@@ -513,13 +488,14 @@ static int kprefetchd(void *__unused)
sched_setscheduler(current, SCHED_BATCH, &param);
set_user_nice(current, 19);
/* Set ioprio to lowest if supported by i/o scheduler */
- sys_ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_CLASS_IDLE);
+ sys_ioprio_set(IOPRIO_WHO_PROCESS, IOPRIO_BE_NR - 1, IOPRIO_CLASS_BE);

/* kprefetchd has nothing to do until it is woken up the first time */
+ wakeup_kprefetchd = 1;
set_current_state(TASK_INTERRUPTIBLE);
schedule();

- do {
+ while (!kthread_should_stop()) {
try_to_freeze();

/*
@@ -527,13 +503,17 @@ static int kprefetchd(void *__unused)
* a wakeup, and further delay the next one.
*/
if (trickle_swap() == TRICKLE_FAILED) {
+ wakeup_kprefetchd = 1;
set_current_state(TASK_INTERRUPTIBLE);
schedule();
- }
+ } else
+ wakeup_kprefetchd = 0;
clear_last_prefetch_free();
- schedule_timeout_interruptible(PREFETCH_DELAY);
- } while (!kthread_should_stop());
-
+ if (!prefetch_enabled())
+ schedule_timeout_interruptible(DISABLED_PREFETCH_DELAY);
+ else
+ schedule_timeout_interruptible(PREFETCH_DELAY);
+ }
return 0;
}

Index: linux-2.6.21-mm1/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.21-mm1.orig/Documentation/sysctl/vm.txt 2007-05-07 21:53:00.000000000 +1000
+++ linux-2.6.21-mm1/Documentation/sysctl/vm.txt 2007-05-12 14:31:26.000000000 +1000
@@ -229,7 +229,9 @@ swap_prefetch
This enables or disables the swap prefetching feature. When the virtual
memory subsystem has been extremely idle for at least 5 seconds it will start
copying back pages from swap into the swapcache and keep a copy in swap. In
-practice it can take many minutes before the vm is idle enough.
+practice it can take many minutes before the vm is idle enough. A value of 0
+disables swap prefetching, 1 enables it unless laptop_mode is enabled, and 2
+enables it even in the presence of laptop_mode.

The default value is 1.


--
-ck

2007-05-21 10:04:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements


* Con Kolivas <[email protected]> wrote:

> It turns out that fixing swap prefetch was not that hard to fix and
> improve upon, and since Andrew hasn't dropped swap prefetch, instead
> here are a swag of fixes and improvements, [...]

it's a reliable win on my testbox too:

# echo 1 > /proc/sys/vm/swap_prefetch
# ./sp_tester
Ram 1019540000 Swap 4096564000
Total ram to be malloced: 1529310000 bytes
Starting first malloc of 764655000 bytes
Starting 1st read of first malloc
Touching this much ram takes 4393 milliseconds
Starting second malloc of 764655000 bytes
Completed second malloc and free
Sleeping for 600 seconds
Important part - starting reread of first malloc
Completed read of first malloc
Timed portion 30279 milliseconds

versus:

# echo 0 > /proc/sys/vm/swap_prefetch
# ./sp_tester
[...]

Timed portion 36605 milliseconds

i've repeated these tests to make sure it's a stable win and indeed it
is:

# swap-prefetch-on:

Timed portion 29704 milliseconds

# swap-prefetch-off:

Timed portion 34863 milliseconds

Nice work Con!

A suggestion for improvement: right now swap-prefetch does a small bit
of swapin every 5 seconds and stays idle inbetween. Could this perhaps
be made more agressive (optionally perhaps), if the system is not
swapping otherwise? If block-IO level instrumentation is needed to
determine idleness of block IO then that is justified too i think.

Another suggestion: swap-prefetch seems to be doing all the right
decisions in the sp_test.c case - so would it be possible to add
statistics so that it could be verified how much of the swapped-in pages
were indeed a 'hit' - and how many were recycled without them being
reused? That could give a reliable, objective metric about how efficient
swap-prefetch is in any workload.

Ingo

2007-05-21 13:45:43

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

On Monday 21 May 2007 20:03, Ingo Molnar wrote:
> * Con Kolivas <[email protected]> wrote:
> > It turns out that fixing swap prefetch was not that hard to fix and
> > improve upon, and since Andrew hasn't dropped swap prefetch, instead
> > here are a swag of fixes and improvements, [...]
>
> it's a reliable win on my testbox too:
>
> # echo 1 > /proc/sys/vm/swap_prefetch

> Timed portion 30279 milliseconds
>
> versus:
>
> # echo 0 > /proc/sys/vm/swap_prefetch
> # ./sp_tester
> [...]
>
> Timed portion 36605 milliseconds
>
> i've repeated these tests to make sure it's a stable win and indeed it
> is:
>
> # swap-prefetch-on:
>
> Timed portion 29704 milliseconds
>
> # swap-prefetch-off:
>
> Timed portion 34863 milliseconds
>
> Nice work Con!

Thanks!

>
> A suggestion for improvement: right now swap-prefetch does a small bit
> of swapin every 5 seconds and stays idle inbetween. Could this perhaps
> be made more agressive (optionally perhaps), if the system is not
> swapping otherwise? If block-IO level instrumentation is needed to
> determine idleness of block IO then that is justified too i think.

Hmm.. The timer waits 5 seconds before trying to prefetch, but then only stops
if it detects any activity elsewhere. It doesn't actually try to go idle in
between but it doesn't take much activity to put it back to sleep, hence
detecting yet another "not quite idle" period and then it goes to sleep
again. I guess the sleep interval can actually be changed as another tunable
from 5 seconds to whatever the user wanted.

> Another suggestion: swap-prefetch seems to be doing all the right
> decisions in the sp_test.c case - so would it be possible to add
> statistics so that it could be verified how much of the swapped-in pages
> were indeed a 'hit' - and how many were recycled without them being
> reused? That could give a reliable, objective metric about how efficient
> swap-prefetch is in any workload.

Well the advantage is twofold potentially; 1. the pages that have been
prefecthed and become minor faults when they would have been major faults,
and 2. those that become minor faults (via 1) and then become major faults
again (since a copy is kept on backing store with swap prefetch). The
sp_tester only tests for 1, although it would be easy enough to simply do
another big malloc at the end and see how fast it swapped out again as a
marker of 2. As for an in-kernel option, it could get kind of expensive
tracking pages that have done one or both of these. I'll think about an
affordable way to do this, perhaps it could be just done as a
debugging/testing patch, but if would be nice to make it cheap enough to have
there permanently as well. The pages end up in swap cache (in the reverse
direction pages normally get to swap cache) so the accounting could be done
somewhere around there.

> Ingo

Thanks for comments!

--
-ck

2007-05-21 16:01:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements


* Con Kolivas <[email protected]> wrote:

> > A suggestion for improvement: right now swap-prefetch does a small
> > bit of swapin every 5 seconds and stays idle inbetween. Could this
> > perhaps be made more agressive (optionally perhaps), if the system
> > is not swapping otherwise? If block-IO level instrumentation is
> > needed to determine idleness of block IO then that is justified too
> > i think.
>
> Hmm.. The timer waits 5 seconds before trying to prefetch, but then
> only stops if it detects any activity elsewhere. It doesn't actually
> try to go idle in between but it doesn't take much activity to put it
> back to sleep, hence detecting yet another "not quite idle" period and
> then it goes to sleep again. I guess the sleep interval can actually
> be changed as another tunable from 5 seconds to whatever the user
> wanted.

there was nothing else running on the system - so i suspect the swapin
activity flagged 'itself' as some 'other' activity and stopped? The
swapins happened in 4 bursts, separated by 5 seconds total idleness.

Ingo

2007-05-22 10:15:40

by Antonino Ingargiola

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

2007/5/21, Ingo Molnar <[email protected]>:
>
> * Con Kolivas <[email protected]> wrote:
>
> > > A suggestion for improvement: right now swap-prefetch does a small
> > > bit of swapin every 5 seconds and stays idle inbetween. Could this
> > > perhaps be made more agressive (optionally perhaps), if the system
> > > is not swapping otherwise? If block-IO level instrumentation is
> > > needed to determine idleness of block IO then that is justified too
> > > i think.
> >
> > Hmm.. The timer waits 5 seconds before trying to prefetch, but then
> > only stops if it detects any activity elsewhere. It doesn't actually
> > try to go idle in between but it doesn't take much activity to put it
> > back to sleep, hence detecting yet another "not quite idle" period and
> > then it goes to sleep again. I guess the sleep interval can actually
> > be changed as another tunable from 5 seconds to whatever the user
> > wanted.
>
> there was nothing else running on the system - so i suspect the swapin
> activity flagged 'itself' as some 'other' activity and stopped? The
> swapins happened in 4 bursts, separated by 5 seconds total idleness.

I've noted burst swapins separated by some seconds of pause in my
desktop system too (with sp_tester and an idle gnome).


Regards,

~ Antonio

2007-05-22 10:22:54

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

On Tuesday 22 May 2007 20:15, Antonino Ingargiola wrote:
> 2007/5/21, Ingo Molnar <[email protected]>:
> > * Con Kolivas <[email protected]> wrote:
> > > > A suggestion for improvement: right now swap-prefetch does a small
> > > > bit of swapin every 5 seconds and stays idle inbetween. Could this
> > > > perhaps be made more agressive (optionally perhaps), if the system
> > > > is not swapping otherwise? If block-IO level instrumentation is
> > > > needed to determine idleness of block IO then that is justified too
> > > > i think.
> > >
> > > Hmm.. The timer waits 5 seconds before trying to prefetch, but then
> > > only stops if it detects any activity elsewhere. It doesn't actually
> > > try to go idle in between but it doesn't take much activity to put it
> > > back to sleep, hence detecting yet another "not quite idle" period and
> > > then it goes to sleep again. I guess the sleep interval can actually
> > > be changed as another tunable from 5 seconds to whatever the user
> > > wanted.
> >
> > there was nothing else running on the system - so i suspect the swapin
> > activity flagged 'itself' as some 'other' activity and stopped? The
> > swapins happened in 4 bursts, separated by 5 seconds total idleness.
>
> I've noted burst swapins separated by some seconds of pause in my
> desktop system too (with sp_tester and an idle gnome).

That really is expected, as just about anything, including journal writeout,
would be enough to put it back to sleep for 5 more seconds.
--
-ck

2007-05-22 10:26:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements


* Con Kolivas <[email protected]> wrote:

> > > there was nothing else running on the system - so i suspect the
> > > swapin activity flagged 'itself' as some 'other' activity and
> > > stopped? The swapins happened in 4 bursts, separated by 5 seconds
> > > total idleness.
> >
> > I've noted burst swapins separated by some seconds of pause in my
> > desktop system too (with sp_tester and an idle gnome).
>
> That really is expected, as just about anything, including journal
> writeout, would be enough to put it back to sleep for 5 more seconds.

note that nothing like that happened on my system - in the
swap-prefetch-off case there was _zero_ IO activity during the sleep
period.

Ingo

2007-05-22 10:39:14

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

On Tuesday 22 May 2007 20:25, Ingo Molnar wrote:
> * Con Kolivas <[email protected]> wrote:
> > > > there was nothing else running on the system - so i suspect the
> > > > swapin activity flagged 'itself' as some 'other' activity and
> > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds
> > > > total idleness.
> > >
> > > I've noted burst swapins separated by some seconds of pause in my
> > > desktop system too (with sp_tester and an idle gnome).
> >
> > That really is expected, as just about anything, including journal
> > writeout, would be enough to put it back to sleep for 5 more seconds.
>
> note that nothing like that happened on my system - in the
> swap-prefetch-off case there was _zero_ IO activity during the sleep
> period.

Ok, granted it's _very_ conservative. I really don't want to risk its presence
being a burden on anything, and the iowait it induces probably makes it turn
itself off for another PREFETCH_DELAY (5s). I really don't want to cross the
line to where it is detrimental in any way. Not dropping out on a
cond_resched and perhaps making the delay tunable should be enough to make it
a little less "sleepy".

--
-ck

2007-05-22 10:47:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements


* Con Kolivas <[email protected]> wrote:

> On Tuesday 22 May 2007 20:25, Ingo Molnar wrote:
> > * Con Kolivas <[email protected]> wrote:
> > > > > there was nothing else running on the system - so i suspect the
> > > > > swapin activity flagged 'itself' as some 'other' activity and
> > > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds
> > > > > total idleness.
> > > >
> > > > I've noted burst swapins separated by some seconds of pause in my
> > > > desktop system too (with sp_tester and an idle gnome).
> > >
> > > That really is expected, as just about anything, including journal
> > > writeout, would be enough to put it back to sleep for 5 more seconds.
> >
> > note that nothing like that happened on my system - in the
> > swap-prefetch-off case there was _zero_ IO activity during the sleep
> > period.
>
> Ok, granted it's _very_ conservative. [...]

but your first reaction was "it should not have slept for 5 seconds":

| Hmm.. The timer waits 5 seconds before trying to prefetch, but then
| only stops if it detects any activity elsewhere. It doesn't actually
| try to go idle in between

It clearly should not consider 'itself' as IO activity. This suggests
some bug in the 'detect activity' mechanism, agreed? I'm wondering
whether you are seeing the same problem, or is all swap-prefetch IO on
your system continuous until it's done [or some other IO comes
inbetween]?

Ingo

2007-05-22 10:55:58

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

On Tuesday 22 May 2007 20:46, Ingo Molnar wrote:
> It clearly should not consider 'itself' as IO activity. This suggests
> some bug in the 'detect activity' mechanism, agreed? I'm wondering
> whether you are seeing the same problem, or is all swap-prefetch IO on
> your system continuous until it's done [or some other IO comes
> inbetween]?

When nothing else is happening anywhere on the system it reads in bursts and
goes to sleep during journal writeout.

--
-ck

2007-05-22 10:58:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements


* Con Kolivas <[email protected]> wrote:

> On Tuesday 22 May 2007 20:46, Ingo Molnar wrote:
> > It clearly should not consider 'itself' as IO activity. This
> > suggests some bug in the 'detect activity' mechanism, agreed? I'm
> > wondering whether you are seeing the same problem, or is all
> > swap-prefetch IO on your system continuous until it's done [or some
> > other IO comes inbetween]?
>
> When nothing else is happening anywhere on the system it reads in
> bursts and goes to sleep during journal writeout.

hm, what do you call 'journal writeout' here that would be happening on
my system?

Ingo

2007-05-22 11:06:16

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

On Tuesday 22 May 2007 20:57, Ingo Molnar wrote:
> * Con Kolivas <[email protected]> wrote:
> > On Tuesday 22 May 2007 20:46, Ingo Molnar wrote:
> > > It clearly should not consider 'itself' as IO activity. This
> > > suggests some bug in the 'detect activity' mechanism, agreed? I'm
> > > wondering whether you are seeing the same problem, or is all
> > > swap-prefetch IO on your system continuous until it's done [or some
> > > other IO comes inbetween]?
> >
> > When nothing else is happening anywhere on the system it reads in
> > bursts and goes to sleep during journal writeout.
>
> hm, what do you call 'journal writeout' here that would be happening on
> my system?

Not really sure what you have in terms of fs, but here even with nothing going
on, ext3 writes to disk every 5 seconds with kjournald.

--
-ck

2007-05-22 11:14:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements


* Con Kolivas <[email protected]> wrote:

> > hm, what do you call 'journal writeout' here that would be happening
> > on my system?
>
> Not really sure what you have in terms of fs, but here even with
> nothing going on, ext3 writes to disk every 5 seconds with kjournald.

i have ext3, but it doesnt do that on my box. Also, i would have noticed
any IO activity in the 'swap prefetch off' case. When i said completely
idle, i really meant it ;-)

so swap-prefetch stops for 5 seconds for no apparent reason.

Ingo

2007-05-22 20:19:13

by Michael Chang

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] mm: swap prefetch improvements

On 5/22/07, Ingo Molnar <[email protected]> wrote:
>
> * Con Kolivas <[email protected]> wrote:
>
> > On Tuesday 22 May 2007 20:25, Ingo Molnar wrote:
> > > * Con Kolivas <[email protected]> wrote:
> > > > > > there was nothing else running on the system - so i suspect the
> > > > > > swapin activity flagged 'itself' as some 'other' activity and
> > > > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds
> > > > > > total idleness.
> > > > >
> > > > > I've noted burst swapins separated by some seconds of pause in my
> > > > > desktop system too (with sp_tester and an idle gnome).
> > > >
> > > > That really is expected, as just about anything, including journal
> > > > writeout, would be enough to put it back to sleep for 5 more seconds.
> > >
> > > note that nothing like that happened on my system - in the
> > > swap-prefetch-off case there was _zero_ IO activity during the sleep
> > > period.
> >
> > Ok, granted it's _very_ conservative. [...]
>
> but your first reaction was "it should not have slept for 5 seconds":
>
> | Hmm.. The timer waits 5 seconds before trying to prefetch, but then
> | only stops if it detects any activity elsewhere. It doesn't actually
> | try to go idle in between
>
> It clearly should not consider 'itself' as IO activity. This suggests
> some bug in the 'detect activity' mechanism, agreed? I'm wondering
> whether you are seeing the same problem, or is all swap-prefetch IO on
> your system continuous until it's done [or some other IO comes
> inbetween]?

The only "problem" I can see with this idea is in the potential case
that it takes up all the IO activity, and so there is never enough IO
activity from other progams to trigger the wait mechanism because they
don't get a chance to run.

That could probably be "fixed" by capping the IO, though... (with one
of those oh-so-lovable "magic numbers" or a tunable)

That said, I don't think there are any issues with the code
compensating for its own activity in the "detect activity" mechanism
-- assuming there wasn't a major impact in e.g. maintainability or
something.

As for the burstyness... considering the "no negative impact" stance,
I can understand that. But it seems inefficient, at best...

--
Michael Chang

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
Thank you.

2007-05-22 20:32:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] mm: swap prefetch improvements


* Michael Chang <[email protected]> wrote:

> > It clearly should not consider 'itself' as IO activity. This
> > suggests some bug in the 'detect activity' mechanism, agreed? I'm
> > wondering whether you are seeing the same problem, or is all
> > swap-prefetch IO on your system continuous until it's done [or some
> > other IO comes inbetween]?
>
> The only "problem" I can see with this idea is in the potential case
> that it takes up all the IO activity, and so there is never enough IO
> activity from other progams to trigger the wait mechanism because they
> don't get a chance to run.

i dont understand what you mean. Any 'use only idle IO capacity'
mechanism should immediately cease to be active the moment any other app
tries to do IO - whether the IO subsystem is saturated or not.

> That said, I don't think there are any issues with the code
> compensating for its own activity in the "detect activity" mechanism
> -- assuming there wasn't a major impact in e.g. maintainability or
> something.
>
> As for the burstyness... considering the "no negative impact" stance,
> I can understand that. But it seems inefficient, at best...

well, it's a plain old bug (a not too serious one) in my book, i'm
surprised that we are now at mail #7 about it :-) I reported it, and i
guess Con will fix it eventually. There's really no need to deny that it
exists or to try to talk it out of existence. Sheesh! :-)

Ingo

2007-05-22 20:57:58

by Ash Milsted

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] mm: swap prefetch improvements

On Tue, 22 May 2007 20:37:54 +1000
Con Kolivas <[email protected]> wrote:

> On Tuesday 22 May 2007 20:25, Ingo Molnar wrote:
> > * Con Kolivas <[email protected]> wrote:
> > > > > there was nothing else running on the system - so i suspect the
> > > > > swapin activity flagged 'itself' as some 'other' activity and
> > > > > stopped? The swapins happened in 4 bursts, separated by 5 seconds
> > > > > total idleness.
> > > >
> > > > I've noted burst swapins separated by some seconds of pause in my
> > > > desktop system too (with sp_tester and an idle gnome).
> > >
> > > That really is expected, as just about anything, including journal
> > > writeout, would be enough to put it back to sleep for 5 more seconds.
> >
> > note that nothing like that happened on my system - in the
> > swap-prefetch-off case there was _zero_ IO activity during the sleep
> > period.
>
> Ok, granted it's _very_ conservative. I really don't want to risk its presence
> being a burden on anything, and the iowait it induces probably makes it turn
> itself off for another PREFETCH_DELAY (5s). I really don't want to cross the
> line to where it is detrimental in any way. Not dropping out on a
> cond_resched and perhaps making the delay tunable should be enough to make it
> a little less "sleepy".
>
> --
> -ck

Hi. I just did some video encoding on my desktop and I was noticing
(for the first time in a while) that running apps had to hit swap quite
a lot when I switched to them (the encoding was going at full blast for
most of the day, and most of the time other running apps were
idle). Now, a good half of my RAM appeared to be free during all this,
so I was thinking at the time that it would be nice if swap prefetch
could be tunably more aggressive. I guess it would be ideal in this
case if it could kick in during tunably low disk-IO periods, even if
the CPU is rather busy. I'm sure you've considered this, so I only butt
in here to cast a vote for it. :)

Of course, I could be completely wrong about the possibility.. and I
seem to remember that the disk cache can take up about half the ram by
default without this showing up in 'gnome-system-monitor'... which I
guess might happen during heavy encoding.. but even if it did, I could
have set the limit lower, and would then have still appreciated
prefetching.

Ash

2007-05-22 22:50:25

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

On Wednesday 23 May 2007 06:42, Ash Milsted wrote:
> Hi. I just did some video encoding on my desktop and I was noticing
> (for the first time in a while) that running apps had to hit swap quite
> a lot when I switched to them (the encoding was going at full blast for
> most of the day, and most of the time other running apps were
> idle). Now, a good half of my RAM appeared to be free during all this,
> so I was thinking at the time that it would be nice if swap prefetch
> could be tunably more aggressive. I guess it would be ideal in this
> case if it could kick in during tunably low disk-IO periods, even if
> the CPU is rather busy. I'm sure you've considered this, so I only butt
> in here to cast a vote for it. :)

In this case nicing the video encode should be enough to make it prefetch even
during heavy cpu usage. It detects the total nice level rather than the cpu
usage.

> Of course, I could be completely wrong about the possibility.. and I
> seem to remember that the disk cache can take up about half the ram by
> default without this showing up in 'gnome-system-monitor'... which I
> guess might happen during heavy encoding.. but even if it did, I could
> have set the limit lower, and would then have still appreciated
> prefetching.

I plan to make it prefetch more aggressively by default soon and make it more
tunable too.

--
-ck

2007-05-23 07:58:29

by Ash Milsted

[permalink] [raw]
Subject: Re: [PATCH] mm: swap prefetch improvements

On Wed, 23 May 2007 08:50:01 +1000
Con Kolivas <[email protected]> wrote:

> On Wednesday 23 May 2007 06:42, Ash Milsted wrote:
> > Hi. I just did some video encoding on my desktop and I was noticing
> > (for the first time in a while) that running apps had to hit swap quite
> > a lot when I switched to them (the encoding was going at full blast for
> > most of the day, and most of the time other running apps were
> > idle). Now, a good half of my RAM appeared to be free during all this,
> > so I was thinking at the time that it would be nice if swap prefetch
> > could be tunably more aggressive. I guess it would be ideal in this
> > case if it could kick in during tunably low disk-IO periods, even if
> > the CPU is rather busy. I'm sure you've considered this, so I only butt
> > in here to cast a vote for it. :)
>
> In this case nicing the video encode should be enough to make it prefetch even
> during heavy cpu usage. It detects the total nice level rather than the cpu
> usage.
>

Cunning, but I guess the regular (less than 5 seconds apart)
reads/writes during the encoding process would cause prefetching to
hold off, no? I had used nice and ionice to reduce the encoder
priority, which made desktop apps pretty responsive, except when they
had to hit swap. If swap prefetch is using the idle io-priority I
suppose it would hardly affect performance if it kicked in during such
use, since it would operate in between the encoder reads anyway
(assuming the encoder is at higher ioprio), right ?

>
> I plan to make it prefetch more aggressively by default soon and make it more
> tunable too.
>

'Sounds good!