LinuxLists.cc - swapping and the value of /proc/sys/vm/swappiness

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Andrew Morton wrote:

>
> That being said, your tests are interesting. There's a wide spread of
> results across different kernel versions and across different swappiness
> settings. But the question is: which behaviour is correct for your users,
> and why?
>

Andrew,

Behavior more like that of 2.6.5 and 2.6.6 is what we would like to see, I
think. We have had problems in the past with a single large HPC application
that runs for a long time then wants to push its data out quickly. What
happens to us in 2.4.21 is that the page cache pages swap out the user pages,
and that is somethine we would like to avoid, since it can reduce the data
rate significantly.

We were planning on suggesting that such users set swappiness=0 to give
user pages priority over the page cache pages. But it doesn't look like that
works very well in the more recent kernels.

One (perhaps) desirable feature would be for intermediate values of swappiness
to have behavior in between the two extremes (mapped pages have higher
priority vs page cache pages having priority over unreferenced mapped pages),
so that one would have finer grain control over the amount of swap used. I'm
not sure how to achieve such a goal, however. :-)

On a separate issue, the response to my proposal for a mempolicy to control
allocation of page cache pages has been <ahem> underwhelming.

(See: http://marc.theaimsgroup.com/?l=linux-mm&m=109416852113561&w=2
and http://marc.theaimsgroup.com/?l=linux-mm&m=109416852416997&w=2 )

I wonder if this is because I just posted it to linux-mm or its not fleshed
out enough yet to be interesting?

Thanks,
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-09-06 21:39:01

by Andrew Morton

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Ray Bryant <[email protected]> wrote:
>
>
>
> Andrew Morton wrote:
>
> >
> > That being said, your tests are interesting. There's a wide spread of
> > results across different kernel versions and across different swappiness
> > settings. But the question is: which behaviour is correct for your users,
> > and why?
> >
>
> Andrew,
>
> Behavior more like that of 2.6.5 and 2.6.6 is what we would like to see, I
> think. We have had problems in the past with a single large HPC application
> that runs for a long time then wants to push its data out quickly. What
> happens to us in 2.4.21 is that the page cache pages swap out the user pages,
> and that is somethine we would like to avoid, since it can reduce the data
> rate significantly.

You probably need to decrease /proc/sys/vm/dirty_ratio and
dirty_background_ratio by a lot. That will reduce the amount of
unreclaimable pagecache and will take pressure off page reclaim.

Also, converting the application to explicitly tell the kernel that it
doesn't want certain data cached will help things a lot.
posix_fadvise(POSIX_FADV_DONTNEED), preferably preceded by fsync().

> We were planning on suggesting that such users set swappiness=0 to give
> user pages priority over the page cache pages. But it doesn't look like that
> works very well in the more recent kernels.

As I say above: avoiding putting all that pressure onto page reclaim in the
first case would be preferable to trying to fix stuff up after it has
happened.

> ...
> On a separate issue, the response to my proposal for a mempolicy to control
> allocation of page cache pages has been <ahem> underwhelming.
>
> (See: http://marc.theaimsgroup.com/?l=linux-mm&m=109416852113561&w=2
> and http://marc.theaimsgroup.com/?l=linux-mm&m=109416852416997&w=2 )
>
> I wonder if this is because I just posted it to linux-mm or its not fleshed
> out enough yet to be interesting?
>

General brain-fry, I expect. There's a lot happening.

2004-09-06 22:38:08

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Mon, Sep 06, 2004 at 04:22:07PM -0500, Ray Bryant wrote:
> We were planning on suggesting that such users set swappiness=0 to give
> user pages priority over the page cache pages. But it doesn't look like
> that works very well in the more recent kernels.
> One (perhaps) desirable feature would be for intermediate values of
> swappiness to have behavior in between the two extremes (mapped pages have
> higher priority vs page cache pages having priority over unreferenced
> mapped pages),
> so that one would have finer grain control over the amount of swap used.
> I'm not sure how to achieve such a goal, however. :-)

Priority paging again? A perennial suggestion.

On Mon, Sep 06, 2004 at 04:22:07PM -0500, Ray Bryant wrote:
> On a separate issue, the response to my proposal for a mempolicy to control
> allocation of page cache pages has been <ahem> underwhelming.
> (See: http://marc.theaimsgroup.com/?l=linux-mm&m=109416852113561&w=2
> and http://marc.theaimsgroup.com/?l=linux-mm&m=109416852416997&w=2 )
> I wonder if this is because I just posted it to linux-mm or its not fleshed
> out enough yet to be interesting?

It was very noncontroversial. Since it's apparently useful to someone
and generally low-impact it should probably be merged.

-- wli

2004-09-06 22:48:40

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Mon, Sep 06, 2004 at 02:11:29PM -0500, Ray Bryant wrote:
> What is unexpected is that the amount of swap space used at a particular
> swappiness setting varies dramatically with the kernel version being
> tested, in spite of the fact that the basic swap_tendency calculation in
> refile_ianctive_zone() is unchanged. (Other, subtle changes in the vm as a
> whole and this routine in particular clearly effect the impact of that
> computation.)
> For example, at a swappiness value of 0, Kernel 2.6.5 swapped out 0 bytes,
> whereas Kernel 2.6.9-rc1-mm3 swapped out 10 GB. Similarly, most kernels
> have a significant change in behavior for swappiness values near 100, but
> for SLES9 the change point occurs at swappness=60.
> A scan of the change logs for swappiness related changes shows nothing that
> might explain these changes. My question is: "Is this change in behavior
> deliberate, or just a side effect of other changes that were made in the
> vm?" and "What kind of swappiness behavior might I expect to find in future
> kernels?".

IIRC no deliberate /proc/sys/vm/swappiness semantic changes were merged.
The policy tweakers have something to answer for here unless some stats
they rely upon have since been flubbed. Logging periodic snapshots of
/proc/vmstat for these benchmarks may be helpful to implicate specific
statistics' bungling or rule out statistic miscalculation as causes.

-- wli

2004-09-06 23:10:11

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Ray Bryant writes:

> Andrew (et al),
>
> The attached results started as an exercise to try to understand what value of
> "swappiness" we should be recommending to our Altix customers when they start
> running Linux 2.6 kernels. The benchmark is very simple -- a task first
> mallocs around 90% of memory, touches all of the memory, then sleeps forever.
> After the task begins to sleep, we start up a bunch of "dd" copies. When the
> dd's all complete, we record the amount of swap used, the size of the page
> cache, and the data rates for the dd's. (Exact details are given in the
> attachment.) The benchmark was repeated for swappiness values of 0, 20, 40,
> 60, 80, 100, for a number of recent 2.6 kernels.
>
> What is unexpected is that the amount of swap space used at a particular
> swappiness setting varies dramatically with the kernel version being tested,
> in spite of the fact that the basic swap_tendency calculation in
> refile_ianctive_zone() is unchanged. (Other, subtle changes in the vm as a
> whole and this routine in particular clearly effect the impact of that
> computation.)
>
> For example, at a swappiness value of 0, Kernel 2.6.5 swapped out 0 bytes,
> whereas Kernel 2.6.9-rc1-mm3 swapped out 10 GB. Similarly, most kernels
> have a significant change in behavior for swappiness values near 100, but
> for SLES9 the change point occurs at swappness=60.
>
> A scan of the change logs for swappiness related changes shows nothing that
> might explain these changes. My question is: "Is this change in behavior
> deliberate, or just a side effect of other changes that were made in the vm?"
> and "What kind of swappiness behavior might I expect to find in future kernels?".

The change was not deliberate but there have been some other people report
significant changes in the swappiness behaviour as well (see archives). It
has usually been of the increased swapping variety lately. It has been
annoying enough to the bleeding edge desktop users for a swag of out-of-tree
hacks to start appearing (like mine).

Cheers,
Con

2004-09-06 23:29:36

by Andrew Morton

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Con Kolivas <[email protected]> wrote:
>
> > A scan of the change logs for swappiness related changes shows nothing that
> > might explain these changes. My question is: "Is this change in behavior
> > deliberate, or just a side effect of other changes that were made in the vm?"
> > and "What kind of swappiness behavior might I expect to find in future kernels?".
>
> The change was not deliberate but there have been some other people report
> significant changes in the swappiness behaviour as well (see archives). It
> has usually been of the increased swapping variety lately. It has been
> annoying enough to the bleeding edge desktop users for a swag of out-of-tree
> hacks to start appearing (like mine).

All of which is largely wasted effort. It would be much more useful to get
down and identify which patch actually caused the behavioural change.

2004-09-06 23:34:47

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Andrew Morton writes:

> Con Kolivas <[email protected]> wrote:
>>
>> > A scan of the change logs for swappiness related changes shows nothing that
>> > might explain these changes. My question is: "Is this change in behavior
>> > deliberate, or just a side effect of other changes that were made in the vm?"
>> > and "What kind of swappiness behavior might I expect to find in future kernels?".
>>
>> The change was not deliberate but there have been some other people report
>> significant changes in the swappiness behaviour as well (see archives). It
>> has usually been of the increased swapping variety lately. It has been
>> annoying enough to the bleeding edge desktop users for a swag of out-of-tree
>> hacks to start appearing (like mine).
>
> All of which is largely wasted effort. It would be much more useful to get
> down and identify which patch actually caused the behavioural change.

I don't disagree. Is there anyone who has the time and is willing to do the
regression testing? This is a general appeal to the mailing list.

Cheers,
Con

2004-09-06 23:51:52

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

William Lee Irwin III wrote:

>On Mon, Sep 06, 2004 at 04:22:07PM -0500, Ray Bryant wrote:
>
>>We were planning on suggesting that such users set swappiness=0 to give
>>user pages priority over the page cache pages. But it doesn't look like
>>that works very well in the more recent kernels.
>>One (perhaps) desirable feature would be for intermediate values of
>>swappiness to have behavior in between the two extremes (mapped pages have
>>higher priority vs page cache pages having priority over unreferenced
>>mapped pages),
>>so that one would have finer grain control over the amount of swap used.
>>I'm not sure how to achieve such a goal, however. :-)
>>
>
>Priority paging again? A perennial suggestion.
>
>

I guess reclaim_mapped is effectively priority paging. But is reasonably
fragile I guess.

My mapped_page_cost stuff is possibly (I hope) more robust in theory, but
the change from never scanning mapped pages until some point, to always
scanning mapped pages slowly necessitated non trivial changes to things
like handling of use-once pages.

>
>On Mon, Sep 06, 2004 at 04:22:07PM -0500, Ray Bryant wrote:
>
>>On a separate issue, the response to my proposal for a mempolicy to control
>>allocation of page cache pages has been <ahem> underwhelming.
>>(See: http://marc.theaimsgroup.com/?l=linux-mm&m=109416852113561&w=2
>> and http://marc.theaimsgroup.com/?l=linux-mm&m=109416852416997&w=2 )
>>I wonder if this is because I just posted it to linux-mm or its not fleshed
>>out enough yet to be interesting?
>>
>
>It was very noncontroversial. Since it's apparently useful to someone
>and generally low-impact it should probably be merged.
>
>

Yeah, I couldn't see any reason to not go ahead with it, which is why I
didn't say anything :)

2004-09-07 00:27:42

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Nick Piggin wrote:

>
>>
>> On Mon, Sep 06, 2004 at 04:22:07PM -0500, Ray Bryant wrote:
>>
>>> On a separate issue, the response to my proposal for a mempolicy to
>>> control
>>> allocation of page cache pages has been <ahem> underwhelming.
>>> (See: http://marc.theaimsgroup.com/?l=linux-mm&m=109416852113561&w=2
>>> and http://marc.theaimsgroup.com/?l=linux-mm&m=109416852416997&w=2 )
>>> I wonder if this is because I just posted it to linux-mm or its not
>>> fleshed out enough yet to be interesting?
>>>
>>
>> It was very noncontroversial. Since it's apparently useful to someone
>> and generally low-impact it should probably be merged.
>>
>>
>
> Yeah, I couldn't see any reason to not go ahead with it, which is why I
> didn't say anything :)
>
>

Cool. I'll go ahead and finish it and make it something useful then.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-09-07 01:26:40

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Tue, Sep 07, 2004 at 09:34:20AM +1000, Con Kolivas wrote:
> Andrew Morton writes:
>
> >Con Kolivas <[email protected]> wrote:
> >>
> >>> A scan of the change logs for swappiness related changes shows nothing
> >>that > might explain these changes. My question is: "Is this change in
> >> behavior
> >> > deliberate, or just a side effect of other changes that were made in
> >> the vm?" > and "What kind of swappiness behavior might I expect to find
> >> in future kernels?".
> >>
> >> The change was not deliberate but there have been some other people
> >> report significant changes in the swappiness behaviour as well (see
> >> archives). It has usually been of the increased swapping variety lately.
> >> It has been annoying enough to the bleeding edge desktop users for a
> >> swag of out-of-tree hacks to start appearing (like mine).
> >
> >All of which is largely wasted effort. It would be much more useful to get
> >down and identify which patch actually caused the behavioural change.
>
> I don't disagree. Is there anyone who has the time and is willing to do the
> regression testing? This is a general appeal to the mailing list.

Hi kernel fellows,

I volunteer. I'll try something tomorrow to compare swappiness of older kernels like
2.6.5 and 2.6.6, which were fine on SGI's Altix tests, up to current newer kernels
(on small memory boxes of course).

Someone needs to write a vmstat-like tool to parse /proc/vmstat.
The statistics in there allows us to watch the behaviour of VM
page reclaim code.

Con, if you could compile a list of reports we would be very grateful.

2004-09-07 01:35:19

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Marcelo Tosatti writes:

> On Tue, Sep 07, 2004 at 09:34:20AM +1000, Con Kolivas wrote:
>> Andrew Morton writes:
>>
>> >Con Kolivas <[email protected]> wrote:
>> >>
>> >>> A scan of the change logs for swappiness related changes shows nothing
>> >>that > might explain these changes. My question is: "Is this change in
>> >> behavior
>> >> > deliberate, or just a side effect of other changes that were made in
>> >> the vm?" > and "What kind of swappiness behavior might I expect to find
>> >> in future kernels?".
>> >>
>> >> The change was not deliberate but there have been some other people
>> >> report significant changes in the swappiness behaviour as well (see
>> >> archives). It has usually been of the increased swapping variety lately.
>> >> It has been annoying enough to the bleeding edge desktop users for a
>> >> swag of out-of-tree hacks to start appearing (like mine).
>> >
>> >All of which is largely wasted effort. It would be much more useful to get
>> >down and identify which patch actually caused the behavioural change.
>>
>> I don't disagree. Is there anyone who has the time and is willing to do the
>> regression testing? This is a general appeal to the mailing list.
>
> Hi kernel fellows,
>
> I volunteer. I'll try something tomorrow to compare swappiness of older kernels like
> 2.6.5 and 2.6.6, which were fine on SGI's Altix tests, up to current newer kernels
> (on small memory boxes of course).
>
> Someone needs to write a vmstat-like tool to parse /proc/vmstat.
> The statistics in there allows us to watch the behaviour of VM
> page reclaim code.
>
> Con, if you could compile a list of reports we would be very grateful.

Apart from lots of "soft" reports I've been getting, the most obvious one
recently on the mailing list is this:

http://marc.theaimsgroup.com/?l=linux-kernel&m=109237941331221&w=2

and no, I'm not referring to this thread because he tried one of my patches;
that's an old patch that I'm not even pushing any more.

Cheers,
Con

2004-09-07 10:46:54

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Marcelo Tosatti wrote:

>
>Hi kernel fellows,
>
>I volunteer. I'll try something tomorrow to compare swappiness of older kernels like
>2.6.5 and 2.6.6, which were fine on SGI's Altix tests, up to current newer kernels
>(on small memory boxes of course).
>

Hi Marcelo,

Just a suggestion - I'd look at the thrashing control patch first.
I bet that's the cause.

2004-09-07 10:57:56

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Nick Piggin wrote:
>
>
> Marcelo Tosatti wrote:
>
>>
>> Hi kernel fellows,
>>
>> I volunteer. I'll try something tomorrow to compare swappiness of
>> older kernels like 2.6.5 and 2.6.6, which were fine on SGI's Altix
>> tests, up to current newer kernels (on small memory boxes of course).
>>
>
> Hi Marcelo,
>
> Just a suggestion - I'd look at the thrashing control patch first.
> I bet that's the cause.

Good point!

I recall one of my users found his workload which often hit swap lightly
was swapping much heavier and his performance dropped dramatically until
I stopped including the swap thrash control patch. I informed Rik about
it some time back so I'm not sure if he addressed it in the meantime.

Cheers,
Con

Attachments:

signature.asc (256.00 B)
OpenPGP digital signature

2004-09-07 17:05:04

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Nick Piggin wrote:
>
>
> Just a suggestion - I'd look at the thrashing control patch first.
> I bet that's the cause.
>
>
The token based thrashing control patch is also in 2.6.8.1-mm4, and that
kernel doesn't behave nearly as badly as 2.5.9-rc1-mm3, so I don't think
that is the culprit in that case.
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-09-08 02:45:12

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Mon, Sep 06, 2004 at 09:03:04PM -0300, Marcelo Tosatti wrote:
> On Tue, Sep 07, 2004 at 09:34:20AM +1000, Con Kolivas wrote:
> > Andrew Morton writes:
> >
> > >Con Kolivas <[email protected]> wrote:
> > >>
> > >>> A scan of the change logs for swappiness related changes shows nothing
> > >>that > might explain these changes. My question is: "Is this change in
> > >> behavior
> > >> > deliberate, or just a side effect of other changes that were made in
> > >> the vm?" > and "What kind of swappiness behavior might I expect to find
> > >> in future kernels?".
> > >>
> > >> The change was not deliberate but there have been some other people
> > >> report significant changes in the swappiness behaviour as well (see
> > >> archives). It has usually been of the increased swapping variety lately.
> > >> It has been annoying enough to the bleeding edge desktop users for a
> > >> swag of out-of-tree hacks to start appearing (like mine).
> > >
> > >All of which is largely wasted effort. It would be much more useful to get
> > >down and identify which patch actually caused the behavioural change.
> >
> > I don't disagree. Is there anyone who has the time and is willing to do the
> > regression testing? This is a general appeal to the mailing list.
>
> Hi kernel fellows,
>
> I volunteer. I'll try something tomorrow to compare swappiness of older kernels like
> 2.6.5 and 2.6.6, which were fine on SGI's Altix tests, up to current newer kernels
> (on small memory boxes of course).
>
> Someone needs to write a vmstat-like tool to parse /proc/vmstat.
> The statistics in there allows us to watch the behaviour of VM
> page reclaim code.
>
> Con, if you could compile a list of reports we would be very grateful.

Spent some time doing a few tests.

A contained test (512MB box, allocate 450MB of memory, touch and sleep +
huge dd) show's that 2.6.6 and 2.6.7 are equivalent (swapout 120-150M when the "dd"
starts writing out data).

2.6.9-rc1 swaps out 350M on the same test. Doenst seem nice, I'll try to isolate
what causes this tomorrow (this is probably what is hitting desktop users which
is fixed by Con's patches). Thats problem #1. Something needs a little tweaking
I think.

Ray, I see the additional swapouts increase the dd performance for your particular testcase:

on 2.6.6:
Total I/O Avg Swap min max pg cache min max
----------- --------- ------- ------ --------- ------- -------
0 242.47 MB/s 0 MB ( 0, 0) 3195 MB ( 3138, 3266)
20 256.06 MB/s 0 MB ( 0, 0) 3170 MB ( 3074, 3234)
40 267.29 MB/s 0 MB ( 0, 0) 3189 MB ( 3137, 3234)
60 289.43 MB/s 666 MB ( 72, 1680) 3847 MB ( 3296, 4817) <----------

So for this one testcase it is being beneficial.

However, decreasing the "swappiness" value does not seem to make much of a difference:

Kernel Version 2.6.8.1-mm4:
Total I/O Avg Swap min max pg cache min max
----------- --------- ------- ------ --------- ------- -------
0 287.28 MB/s 710 MB ( 46, 3060) 4082 MB ( 3426, 6308)
20 288.05 MB/s 508 MB ( 94, 1417) 3848 MB ( 3442, 4739)
40 287.03 MB/s 588 MB ( 199, 1251) 3909 MB ( 3570, 4515)
60 290.08 MB/s 640 MB ( 210, 1190) 3976 MB ( 3538, 4531)
80 287.73 MB/s 693 MB ( 316, 1195) 4049 MB ( 3713, 4545)
100 166.17 MB/s 26001 MB ( 26001, 26002) 28798 MB ( 28740, 28852)

Kernel Version 2.6.9-rc1-mm3:
Total I/O Avg Swap min max pg cache min max
----------- --------- ------- ------ --------- ------- -------
0 274.80 MB/s 10511 MB ( 5644, 14492) 13293 MB ( 8596, 17156)
20 267.02 MB/s 12624 MB ( 5578, 16287) 15298 MB ( 8468, 18889)
40 267.66 MB/s 13541 MB ( 6619, 17461) 16199 MB ( 9393, 20044)
60 233.73 MB/s 18094 MB ( 16550, 19676) 20629 MB ( 19103, 22192)
80 213.64 MB/s 20950 MB ( 15844, 22977) 23450 MB ( 18496, 25440)
100 164.58 MB/s 26004 MB ( 26004, 26004) 28410 MB ( 28327, 28455)

And that is a problem #2 - swappinness not being honoured. Guys,
any ideas on the reason for that?

Andrew, dirty_ratio and dirty_background_ratio (as low as 5% each) did not significantly
affect the amount of swapped out data, only a small effect on _how soon_ anonymous
memory was swapped out.

And finally, Ray, the difference you see between 2.6.6 and 2.6.7 can be explained,
as noted by others in this thread, to vmscan.c changes (page replacement/scanning policy
changes were made).

Will continue with more tests tomorrow.

2004-09-08 03:42:16

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

> Spent some time doing a few tests.
>
> A contained test (512MB box, allocate 450MB of memory, touch and sleep +
> huge dd) show's that 2.6.6 and 2.6.7 are equivalent (swapout 120-150M when the "dd"
> starts writing out data).
>
> 2.6.9-rc1 swaps out 350M on the same test. Doenst seem nice, I'll try to isolate
> what causes this tomorrow (this is probably what is hitting desktop users which
> is fixed by Con's patches). Thats problem #1. Something needs a little tweaking
> I think.
>
> Ray, I see the additional swapouts increase the dd performance for your particular testcase:
>
> on 2.6.6:
> Total I/O Avg Swap min max pg cache min max
> ----------- --------- ------- ------ --------- ------- -------
> 0 242.47 MB/s 0 MB ( 0, 0) 3195 MB ( 3138, 3266)
> 20 256.06 MB/s 0 MB ( 0, 0) 3170 MB ( 3074, 3234)
> 40 267.29 MB/s 0 MB ( 0, 0) 3189 MB ( 3137, 3234)
> 60 289.43 MB/s 666 MB ( 72, 1680) 3847 MB ( 3296, 4817) <----------
>
> So for this one testcase it is being beneficial.
>
> However, decreasing the "swappiness" value does not seem to make much of a difference:
>
> Kernel Version 2.6.8.1-mm4:
> Total I/O Avg Swap min max pg cache min max
> ----------- --------- ------- ------ --------- ------- -------
> 0 287.28 MB/s 710 MB ( 46, 3060) 4082 MB ( 3426, 6308)
> 20 288.05 MB/s 508 MB ( 94, 1417) 3848 MB ( 3442, 4739)
> 40 287.03 MB/s 588 MB ( 199, 1251) 3909 MB ( 3570, 4515)
> 60 290.08 MB/s 640 MB ( 210, 1190) 3976 MB ( 3538, 4531)
> 80 287.73 MB/s 693 MB ( 316, 1195) 4049 MB ( 3713, 4545)
> 100 166.17 MB/s 26001 MB ( 26001, 26002) 28798 MB ( 28740, 28852)
>
> Kernel Version 2.6.9-rc1-mm3:
> Total I/O Avg Swap min max pg cache min max
> ----------- --------- ------- ------ --------- ------- -------
> 0 274.80 MB/s 10511 MB ( 5644, 14492) 13293 MB ( 8596, 17156)
> 20 267.02 MB/s 12624 MB ( 5578, 16287) 15298 MB ( 8468, 18889)
> 40 267.66 MB/s 13541 MB ( 6619, 17461) 16199 MB ( 9393, 20044)
> 60 233.73 MB/s 18094 MB ( 16550, 19676) 20629 MB ( 19103, 22192)
> 80 213.64 MB/s 20950 MB ( 15844, 22977) 23450 MB ( 18496, 25440)
> 100 164.58 MB/s 26004 MB ( 26004, 26004) 28410 MB ( 28327, 28455)
>
> And that is a problem #2 - swappinness not being honoured. Guys,
> any ideas on the reason for that?

Hum I just tested it on my 512MB box and swappiness is working fine on 2.6.8.. (it
has expected effects), unlike Ray's tests.

>
> Andrew, dirty_ratio and dirty_background_ratio (as low as 5% each) did not significantly
> affect the amount of swapped out data, only a small effect on _how soon_ anonymous
> memory was swapped out.
>
> And finally, Ray, the difference you see between 2.6.6 and 2.6.7 can be explained,
> as noted by others in this thread, to vmscan.c changes (page replacement/scanning policy
> changes were made).
>
> Will continue with more tests tomorrow.

2004-09-08 14:18:46

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Index: linux-2.6.9-rc1-mm3-kdb/mm/page-writeback.c
===================================================================
--- linux-2.6.9-rc1-mm3-kdb.orig/mm/page-writeback.c 2004-09-03 10:18:57.000000000 -0700
+++ linux-2.6.9-rc1-mm3-kdb/mm/page-writeback.c 2004-09-07 14:46:24.000000000 -0700
@@ -135,32 +135,19 @@
static void
get_dirty_limits(struct writeback_state *wbs, long *pbackground, long *pdirty)
{
- int background_ratio; /* Percentages */
- int dirty_ratio;
- int unmapped_ratio;
+ int unmapped;
long background;
long dirty;
struct task_struct *tsk;

get_writeback_state(wbs);

- unmapped_ratio = 100 - (wbs->nr_mapped * 100) / total_pages;

- dirty_ratio = vm_dirty_ratio;
- if (dirty_ratio > unmapped_ratio / 2)
- dirty_ratio = unmapped_ratio / 2;
+ unmapped = total_pages - wbs->nr_mapped;

- if (dirty_ratio < 5)
- dirty_ratio = 5;
+ background = (dirty_background_ratio * unmapped) / 100;
+ dirty = (vm_dirty_ratio * unmapped) / 100;

- /*
- * Keep the ratio between dirty_ratio and background_ratio roughly
- * what the sysctls are after dirty_ratio has been scaled (above).
- */
- background_ratio = dirty_background_ratio * dirty_ratio/vm_dirty_ratio;
-
- background = (background_ratio * total_pages) / 100;
- dirty = (dirty_ratio * total_pages) / 100;
tsk = current;
if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
background += background / 4;

Attachments:

dirty-limits-in-terms-of-unmapped.patch (1.37 kB)

2004-09-08 15:15:44

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Marcelo Tosatti wrote:

> Ray, I see the additional swapouts increase the dd performance for your particular testcase:
>
> on 2.6.6:
> Total I/O Avg Swap min max pg cache min max
> ----------- --------- ------- ------ --------- ------- -------
> 0 242.47 MB/s 0 MB ( 0, 0) 3195 MB ( 3138, 3266)
> 20 256.06 MB/s 0 MB ( 0, 0) 3170 MB ( 3074, 3234)
> 40 267.29 MB/s 0 MB ( 0, 0) 3189 MB ( 3137, 3234)
> 60 289.43 MB/s 666 MB ( 72, 1680) 3847 MB ( 3296, 4817) <----------
>
> So for this one testcase it is being beneficial.
>

True enough, but the general trend is that increasing swapping decreases data
rate. This is even more true for the real applications that we are modelling
with this simple benchmark. In thosec cases, the user has a lot of mapped
data that they then write out using buffered I/O. If the mapped data gets
swapped out, then it may have to be swapped back in to be written out to the
file system. It would be faster to keep the mapped data from being swapped
out at all provided that there is enough page cache space to keep the devices
running at full speed.

(And yes, we've suggested that they mmap() the data files -- but sometimes
this is an ISV's code that it causing the problem and we can't necessarily get
them to update their codes to use the API's we want.)

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-09-08 17:32:42

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

> It seems to me that the 5% number in there is more or less arbitrary.
> If we are on a big memory Altix (4 TB), 5% of memory would be 200 GB.
> That is a lot of page cache.

For HPC, maybe. For a fileserver, it might be far too little. That's the
trouble ... it's all dependant on the workload. Personally, I'd prefer
to get rid of manual tweakables (which are a pain in the ass in the field
anyway), and try to have the kernel react to what the customer is doing.
I guess we can leave them there for overrides, but a self-tunable default
would be most desirable.

For instance, would be nice if we started doing writeback to the spindles
that weren't busy much earlier than if the disks were thrashing.

M.

2004-09-08 18:04:55

by Rik van Riel

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Wed, 8 Sep 2004, Martin J. Bligh wrote:

> For HPC, maybe. For a fileserver, it might be far too little. That's the
> trouble ... it's all dependant on the workload. Personally, I'd prefer
> to get rid of manual tweakables (which are a pain in the ass in the field
> anyway), and try to have the kernel react to what the customer is doing.

Agreed. Many of these things should be self-tunable pretty
easily, too...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-09-08 18:09:47

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Tue, Sep 07, 2004 at 08:56:47PM +1000, Con Kolivas wrote:
> Nick Piggin wrote:
> >
> >
> >Marcelo Tosatti wrote:
> >
> >>
> >>Hi kernel fellows,
> >>
> >>I volunteer. I'll try something tomorrow to compare swappiness of
> >>older kernels like 2.6.5 and 2.6.6, which were fine on SGI's Altix
> >>tests, up to current newer kernels (on small memory boxes of course).
> >>
> >
> >Hi Marcelo,
> >
> >Just a suggestion - I'd look at the thrashing control patch first.
> >I bet that's the cause.
>
> Good point!
>
> I recall one of my users found his workload which often hit swap lightly
> was swapping much heavier and his performance dropped dramatically until
> I stopped including the swap thrash control patch. I informed Rik about
> it some time back so I'm not sure if he addressed it in the meantime.

Swap thrashing code doesnt affect anything, at least on my simple contained test.
With the same test, the amount of swapped out memory with 2.6.6/2.6.7 is 100-150MB,
while 2.6.8/2.6.9-mm* swaps out around 250MB.

I tried 2.6.7's "vmscan.c" on 2.6.8 without noticeable difference, I wonder why.

What I've noticed before with the swap token code is total crap interactivity
when memory hog is running. Which doesnt happen without it.

Con, I've seen your hard swappiness patch, why do you remove the current
swap_tendency calculation? Can you give us some insight into it?

The thing is, if the user thinks the machine is swapping out too heavily
he can always decrease vm_swappinness. Whatever change that might happen
on VM swapout policy can be tuned with vm_swappinness.

It works - its not very smooth, changing from "53" to "50" causes the
amount of swapped data to be 4 times smaller (due to
if (swap_tendency >= 100) I believe). Apart from that its fine,
and behaves as expected.

Maybe the current value of "60" is too high for most desktop users,
if so it can be decreased a little bit.

But whats the point of your patch?

2004-09-08 18:18:14

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Wed, Sep 08, 2004 at 09:20:08AM -0500, Ray Bryant wrote:
>
>
> Marcelo Tosatti wrote:
>
> >
> >Andrew, dirty_ratio and dirty_background_ratio (as low as 5% each) did not
> >significantly affect the amount of swapped out data, only a small effect
> >on _how soon_ anonymous memory was swapped out.
> >
>
> I looked at the get_dirty_limits() code and for the test cases I was
> running,
> we have mapped > 90% of memory. So what will happen is that dirty_ratio
> will be thresholded at 5%, and background_ratio will be 1%. Changing
> values in /proc won't modify this at all (well, you could force
> background_ratio to 0%.)
>
> It seems to me that the 5% number in there is more or less arbitrary. If
> we are on a big memory Altix (4 TB), 5% of memory would be 200 GB. That is
> a lot of page cache.

On such huge memory machines I guess you have no choice but scale down the
dirty limits for them to be "equivalent" with reference to IO device speed.

And as Martin says it depends on the workload also.

> It seems get_dirty_limits() would be a lot simpler (and automatically scale
> as memory is mapped) if the limits were interpreted as being in terms of
> the amount of unmapped memory. A patch that implements this idea is
> attached.
> (Andrew -- if it comes to that I can submit this patch inline -- this is
> just for talking at the moment).
>
> I'll run a few of the tests with this modified kernel and see if they are
> any different.

Huh, that changes the meaning of the dirty limits. Dont think its suitable
for mainline.

> >And finally, Ray, the difference you see between 2.6.6 and 2.6.7 can be
> >explained, as noted by others in this thread, to vmscan.c changes (page
> >replacement/scanning policy
> >changes were made).
>
> Yep. I can probably live with those minor differences though. I would be
> happier if the system didn't swap anything at all for low values of
> swappiness, though.

Now that must work - if its not we have a problem.

2004-09-08 19:42:05

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Marcelo Tosatti wrote:

>
>
> Huh, that changes the meaning of the dirty limits. Dont think its suitable
> for mainline.
>
>

The change is, in fact, not much different from what is already actually
there. The code in get_dirty_limits() adjusts the value of the user supplied
parameters in /proc/sys/vm depending on how much mapped memory there is. If
you undo the convoluted arithmetic that is in there, one finds that if you are
using the default dirty_ratio of 40%, then if the unmapped_ratio is between
80% and 10%, then

dirty_ratio = unmapped_ratio / 2;

and, a little bit of algebra later:

dirty = (total_pages - wbs->nr_mapped)/2

and

background = dirty_background_ratio/vm_background_ratio * (total_pages
- wbs->nr_mapped)

That is, for a wide range of memory usage, you are really running with an
dirty ratio of 50% stated in terms of the number of unmapped pages, and there
is no direct way to override this.

Of course, at the edges, the code changes these calculations. It just seems
to me that rather than continue the convoluted calculation that is in
get_dirty_limits(), we just make the outcome more explicit and tell the user
what is really going on.

We'd still have to figure out how to encourage a minimum page cache size of
some kind, which is what I understand the 5% min value for dirty_ratio is in
there for.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-09-08 19:50:20

by Diego Calleja

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

El Wed, 08 Sep 2004 14:04:31 -0400 (EDT) Rik van Riel <[email protected]> escribi?:

> On Wed, 8 Sep 2004, Martin J. Bligh wrote:
>
> > For HPC, maybe. For a fileserver, it might be far too little. That's the
> > trouble ... it's all dependant on the workload. Personally, I'd prefer
> > to get rid of manual tweakables (which are a pain in the ass in the field
> > anyway), and try to have the kernel react to what the customer is doing.
>
> Agreed. Many of these things should be self-tunable pretty
> easily, too...

I know this has been discussed before, but could a userspace daemon which
autotunes the tweakables do a better job wrt. to adapting the kernel
behaviour depending on the workload? Just like these days we have
irqbalance instead of a in-kernel "irq balancer". It's a alternative
worth of look at?

2004-09-08 19:52:07

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Martin J. Bligh wrote:
>>It seems to me that the 5% number in there is more or less arbitrary.
>>If we are on a big memory Altix (4 TB), 5% of memory would be 200 GB.
>>That is a lot of page cache.
>
>
> For HPC, maybe. For a fileserver, it might be far too little. That's the
> trouble ... it's all dependant on the workload. Personally, I'd prefer
> to get rid of manual tweakables (which are a pain in the ass in the field
> anyway), and try to have the kernel react to what the customer is doing.
> I guess we can leave them there for overrides, but a self-tunable default
> would be most desirable.
>

I agree that tunables are a pain in the butt, but a quick fix would to be at
least to add that 5% to the set of stuff settable in /proc/sys/vm. Most
workloads/systems won't need to change it. Very large Altix systems could
change it if needed.

I don't think that is at the root of the swappiness problems with
2.6.9-rc1-mm3, though.

> For instance, would be nice if we started doing writeback to the spindles
> that weren't busy much earlier than if the disks were thrashing.
>
> M.
>
>

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-09-08 20:56:09

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Wed, Sep 08, 2004 at 02:35:03PM -0500, Ray Bryant wrote:
>
>
> Marcelo Tosatti wrote:
>
> >
> >
> >Huh, that changes the meaning of the dirty limits. Dont think its suitable
> >for mainline.
> >
> >
>
> The change is, in fact, not much different from what is already actually
> there. The code in get_dirty_limits() adjusts the value of the user
> supplied parameters in /proc/sys/vm depending on how much mapped memory
> there is. If you undo the convoluted arithmetic that is in there, one
> finds that if you are using the default dirty_ratio of 40%, then if the
> unmapped_ratio is between 80% and 10%, then
>
> dirty_ratio = unmapped_ratio / 2;
>
> and, a little bit of algebra later:
>
> dirty = (total_pages - wbs->nr_mapped)/2
>
> and
>
> background = dirty_background_ratio/vm_background_ratio * (total_pages
> - wbs->nr_mapped)
>
> That is, for a wide range of memory usage, you are really running with an
> dirty ratio of 50% stated in terms of the number of unmapped pages, and
> there is no direct way to override this.

OK I see, yes.

> Of course, at the edges, the code changes these calculations. It just
> seems to me that rather than continue the convoluted calculation that is in
> get_dirty_limits(), we just make the outcome more explicit and tell the user
> what is really going on.
>
> We'd still have to figure out how to encourage a minimum page cache size of
> some kind, which is what I understand the 5% min value for dirty_ratio is
> in there for.

For the user "dirty_ratio" and "dirty_background_ratio" means "percentage
of total memory" (thats how it has been traditionally in Linux). And right now,
as you noted, we dont do that way.

There's probably a good reason for the "no more than half of unmapped memory".
Andrew ?

2004-09-08 21:11:42

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

>> > For HPC, maybe. For a fileserver, it might be far too little. That's the
>> > trouble ... it's all dependant on the workload. Personally, I'd prefer
>> > to get rid of manual tweakables (which are a pain in the ass in the field
>> > anyway), and try to have the kernel react to what the customer is doing.
>>
>> Agreed. Many of these things should be self-tunable pretty
>> easily, too...
>
> I know this has been discussed before, but could a userspace daemon which
> autotunes the tweakables do a better job wrt. to adapting the kernel
> behaviour depending on the workload? Just like these days we have
> irqbalance instead of a in-kernel "irq balancer". It's a alternative
> worth of look at?

I really don't see any point in pushing the self-tuning of the kernel out
into userspace. What are you hoping to achieve?

M.

2004-09-08 21:55:31

by Diego Calleja

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

El Wed, 08 Sep 2004 14:10:32 -0700 "Martin J. Bligh" <[email protected]> escribi?:

> I really don't see any point in pushing the self-tuning of the kernel out
> into userspace. What are you hoping to achieve?

Well your own words explain it, I think. "it's all dependant on the workload",
which means that only the user knows what he is going to do with the machine
and that the kernel doesn't knows that, so the algoritms built in the kernel
may be "not perfect" in their auto-tuning job. The point would be to
be able to take decisions the kernel can't take because userspace would
know better how the system should behave, say stupids things like "I want
to have this set of tunables which make compile jobs 0.01% faster at 12:00
because at that time a cron job autocompiles cvs snapshots of some project,
and at 6:00 those jobs have already finished so at that time I want a set
of tunables optimized for my everyday desktop work which make everthing 0.01%
slower but the system feels a 5% more reponsive". (well, for that a shell script
is enought) Kernel however could try to adapt itself to those changes, and do
it well...I don't really know. This came to my mind when I was thinking about
irqbalance case, which was somewhat similar, I also remember a discussion
about a "ktuned" in the mailing lists...I guess it's a matter of coding it
and get some numbers :-/

2004-09-08 22:22:07

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

>> I really don't see any point in pushing the self-tuning of the kernel out
>> into userspace. What are you hoping to achieve?
>
> Well your own words explain it, I think. "it's all dependant on the workload",
> which means that only the user knows what he is going to do with the machine
> and that the kernel doesn't knows that, so the algoritms built in the kernel
> may be "not perfect" in their auto-tuning job. The point would be to
> be able to take decisions the kernel can't take because userspace would
> know better how the system should behave, say stupids things like "I want
> to have this set of tunables which make compile jobs 0.01% faster at 12:00
> because at that time a cron job autocompiles cvs snapshots of some project,
> and at 6:00 those jobs have already finished so at that time I want a set
> of tunables optimized for my everyday desktop work which make everthing 0.01%
> slower but the system feels a 5% more reponsive". (well, for that a shell script
> is enought) Kernel however could try to adapt itself to those changes, and do
> it well...I don't really know. This came to my mind when I was thinking about
> irqbalance case, which was somewhat similar, I also remember a discussion
> about a "ktuned" in the mailing lists...I guess it's a matter of coding it
> and get some numbers :-/

Oh, I see what you mean. I think we're much better off sticking the mechanism
for autotuning stuff in the kernel - if we want to feed in policy from
userspace (which 99.9% of people will never do), it can be through such
things as /proc/sys/vm/swapiness ... that could just fix them statically
if people insisted on such things.

Having the kernel do something sensible by default is what we aim for ...
overrides still are possible if the sysadmin really thinks they're smarter ;-)

M.

2004-09-08 23:25:15

by Rik van Riel

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Wed, 8 Sep 2004, Martin J. Bligh wrote:

> Oh, I see what you mean. I think we're much better off sticking the
> mechanism for autotuning stuff in the kernel -

Agreed. Autotuning like this appears to work best by having
a self adjusting algorithm, often negative feedback loops so
things get balanced out automagically.

Works way better than anything looking at indirect data and
then tweaking some magic knobs...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-09-08 23:36:39

by Alan

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Mer, 2004-09-08 at 22:10, Martin J. Bligh wrote:
> I really don't see any point in pushing the self-tuning of the kernel out
> into userspace. What are you hoping to achieve?

What if there is more than one right answer to "self-tune" policy. Also
what if you want an application to tweak the tuning in ways that are
different to general policy ?

2004-09-08 23:43:23

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

> On Mer, 2004-09-08 at 22:10, Martin J. Bligh wrote:
>> I really don't see any point in pushing the self-tuning of the kernel out
>> into userspace. What are you hoping to achieve?
>
> What if there is more than one right answer to "self-tune" policy. Also
> what if you want an application to tweak the tuning in ways that are
> different to general policy ?

It's still overridable from userspace, I'd think. But having a sensible
default in the kernel makes a crapload of sense to me. We have better
faster access to data from there - if there are really things that aren't
just parameters to the tuning algorithm it'd have to repeatedly poke
values into hard overrides. Do-able, but not what we want by default,
I'd think.

M.

2004-09-09 01:13:38

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Marcelo Tosatti wrote:
> On Tue, Sep 07, 2004 at 08:56:47PM +1000, Con Kolivas wrote:
>
>>Nick Piggin wrote:
>>
>>>
>>>Marcelo Tosatti wrote:
>>>
>>>
>>>>Hi kernel fellows,
>>>>
>>>>I volunteer. I'll try something tomorrow to compare swappiness of
>>>>older kernels like 2.6.5 and 2.6.6, which were fine on SGI's Altix
>>>>tests, up to current newer kernels (on small memory boxes of course).
>>>>
>>>
>>>Hi Marcelo,
>>>
>>>Just a suggestion - I'd look at the thrashing control patch first.
>>>I bet that's the cause.
>>
>>Good point!
>>
>>I recall one of my users found his workload which often hit swap lightly
>>was swapping much heavier and his performance dropped dramatically until
>>I stopped including the swap thrash control patch. I informed Rik about
>>it some time back so I'm not sure if he addressed it in the meantime.
>
>
> Swap thrashing code doesnt affect anything, at least on my simple contained test.
> With the same test, the amount of swapped out memory with 2.6.6/2.6.7 is 100-150MB,
> while 2.6.8/2.6.9-mm* swaps out around 250MB.
>
> I tried 2.6.7's "vmscan.c" on 2.6.8 without noticeable difference, I wonder why.
>
> What I've noticed before with the swap token code is total crap interactivity
> when memory hog is running. Which doesnt happen without it.
>
> Con, I've seen your hard swappiness patch, why do you remove the current
> swap_tendency calculation? Can you give us some insight into it?

Sure. It was painfully simple. The swap tendency worked basically the
same but did not take into account distress. ie It made the "swappiness"
knob purely dependant on mapped ratio. For whatever reason, if the
swappiness value is the same in later kernels but swaps more, there is
more "distress" meaning we are priority scanning much more aggressively.

Cheers,
Con

Attachments:

signature.asc (256.00 B)
OpenPGP digital signature

2004-09-09 03:02:08

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Marcelo,

For what it is worth, here are the benchmark results for the kernel with the
patch I discussed before, along with the previous 2.6.9-rc1-mm3 results:

Kernel Version 2.6.9-rc1-mm3:
Total I/O Avg Swap min max pg cache min max
----------- --------- ------- ------ --------- ------- -------
0 274.80 MB/s 10511 MB ( 5644, 14492) 13293 MB ( 8596, 17156)
20 267.02 MB/s 12624 MB ( 5578, 16287) 15298 MB ( 8468, 18889)
40 267.66 MB/s 13541 MB ( 6619, 17461) 16199 MB ( 9393, 20044)
60 233.73 MB/s 18094 MB ( 16550, 19676) 20629 MB ( 19103, 22192)
80 213.64 MB/s 20950 MB ( 15844, 22977) 23450 MB ( 18496, 25440)
100 164.58 MB/s 26004 MB ( 26004, 26004) 28410 MB ( 28327, 28455)

Kernel Version 2.6.9-rc1-mm3-kdb-nrmap:
Total I/O Avg Swap min max pg cache min max
----------- --------- ------- ------ --------- ------- -------
0 286.93 MB/s 7288 MB ( 4847, 14536) 10122 MB ( 7771, 17138)
20 252.43 MB/s 13305 MB ( 3950, 18337) 15938 MB ( 6866, 20876)
40 268.52 MB/s 11538 MB ( 5333, 17298) 14238 MB ( 8247, 19836)
60 242.72 MB/s 16367 MB ( 8652, 21217) 18909 MB ( 11514, 23561)
80 212.94 MB/s 19424 MB ( 5632, 24047) 21937 MB ( 8567, 26469)
100 161.66 MB/s 26006 MB ( 26004, 26007) 28445 MB ( 28407, 28471)

Except for the swappiness = 20 case, things are a smallish bit better for
the modified kernel than 2.6.9-rc1-mm3. Clearly we haven't found the root of
this problem yet.

Have you still been unable to duplicate this problem on a small i386 platform?
--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-09-09 03:09:36

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Wed, Sep 08, 2004 at 10:06:20PM -0500, Ray Bryant wrote:
> For what it is worth, here are the benchmark results for the kernel with
> the patch I discussed before, along with the previous 2.6.9-rc1-mm3 results:
[...]
> Except for the swappiness = 20 case, things are a smallish bit better for
> the modified kernel than 2.6.9-rc1-mm3. Clearly we haven't found the root
> of this problem yet.
> Have you still been unable to duplicate this problem on a small i386
> platform?

Please log periodic snapshots of /proc/vmstat during runs on kernel
versions before and after major behavioral shifts.

-- wli

2004-09-09 13:39:25

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Wed, Sep 08, 2004 at 10:06:20PM -0500, Ray Bryant wrote:
> Marcelo,
>
> For what it is worth, here are the benchmark results for the kernel with
> the patch I discussed before, along with the previous 2.6.9-rc1-mm3 results:
>
> Kernel Version 2.6.9-rc1-mm3:
> Total I/O Avg Swap min max pg cache min max
> ----------- --------- ------- ------ --------- ------- -------
> 0 274.80 MB/s 10511 MB ( 5644, 14492) 13293 MB ( 8596, 17156)
> 20 267.02 MB/s 12624 MB ( 5578, 16287) 15298 MB ( 8468, 18889)
> 40 267.66 MB/s 13541 MB ( 6619, 17461) 16199 MB ( 9393, 20044)
> 60 233.73 MB/s 18094 MB ( 16550, 19676) 20629 MB ( 19103, 22192)
> 80 213.64 MB/s 20950 MB ( 15844, 22977) 23450 MB ( 18496, 25440)
> 100 164.58 MB/s 26004 MB ( 26004, 26004) 28410 MB ( 28327, 28455)
>
> Kernel Version 2.6.9-rc1-mm3-kdb-nrmap:
> Total I/O Avg Swap min max pg cache min max
> ----------- --------- ------- ------ --------- ------- -------
> 0 286.93 MB/s 7288 MB ( 4847, 14536) 10122 MB ( 7771, 17138)
> 20 252.43 MB/s 13305 MB ( 3950, 18337) 15938 MB ( 6866, 20876)
> 40 268.52 MB/s 11538 MB ( 5333, 17298) 14238 MB ( 8247, 19836)
> 60 242.72 MB/s 16367 MB ( 8652, 21217) 18909 MB ( 11514, 23561)
> 80 212.94 MB/s 19424 MB ( 5632, 24047) 21937 MB ( 8567, 26469)
> 100 161.66 MB/s 26006 MB ( 26004, 26007) 28445 MB ( 28407, 28471)
>
> Except for the swappiness = 20 case, things are a smallish bit better for
> the modified kernel than 2.6.9-rc1-mm3. Clearly we haven't found the root
> of this problem yet.

Indeed.

> Have you still been unable to duplicate this problem on a small i386
> platform?

Yes right, I have been unable to duplicate the problem on small i386 box.
What about your tests?

2004-09-09 14:13:58

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

William Lee Irwin III wrote:

>
> Please log periodic snapshots of /proc/vmstat during runs on kernel
> versions before and after major behavioral shifts.
>
>
> -- wli
>

wli,

Attached is the output you requested for two kernel versions: 2.6.8.1-mm4 and
2.6.9-rc1-mm3 + the nrmap_patch (that patch didn't make much difference so
this should be good enough for comparison purposes, and it was the kernel I
had built.)

Because there are so many parameters in /proc/vmstat, I had to split the
output up (more or less arbitarily) into three files to get something you
could actually look at with an editor. Even then it requires 100 columns or so.

Data in the files was observed every 10 s for the duration of the runs.
The first line of each file is a summary line, the next lines are incremental,
except for the nr_* stats, which are assumed to be absolute numbers.
Corresponding lines of each of the three output files per run were printed.
at the same time.

Data file names in the attached archive are of the form:

vmstat.$s.$t.$f.$v

where s=swappiness, here 0, 60, or 100.
t=trial here always 5
f=file 1, 2, or 3 for the three output files per run
v=version kernel version

There is a script, compare.csh in the archive that can be run to edit
corresponding files for each version. It cycles through the list examining
corresponding files foreach swapiness level and foreach output file.

The benchmark output for kernel version v is in reduce.out.$v, for reference.

I can post the perl script that was used to create this output, if there is
interest.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

Attachments:

vmstat.tar.bz2 (25.22 kB)

2004-09-09 14:18:49

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Marcelo Tosatti wrote:

>
>>Have you still been unable to duplicate this problem on a small i386
>>platform?
>
>
> Yes right, I have been unable to duplicate the problem on small i386 box.
> What about your tests?
>
>
>

I haven't had time to try on i386 yet. I guess I will have to.
Thanks for trying, anyway.

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-09-09 16:26:28

by Bill Davidsen

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Martin J. Bligh wrote:
>>>I really don't see any point in pushing the self-tuning of the kernel out
>>>into userspace. What are you hoping to achieve?
>>
>>Well your own words explain it, I think. "it's all dependant on the workload",
>>which means that only the user knows what he is going to do with the machine
>>and that the kernel doesn't knows that, so the algoritms built in the kernel
>>may be "not perfect" in their auto-tuning job. The point would be to
>>be able to take decisions the kernel can't take because userspace would
>>know better how the system should behave, say stupids things like "I want
>>to have this set of tunables which make compile jobs 0.01% faster at 12:00
>>because at that time a cron job autocompiles cvs snapshots of some project,
>>and at 6:00 those jobs have already finished so at that time I want a set
>>of tunables optimized for my everyday desktop work which make everthing 0.01%
>>slower but the system feels a 5% more reponsive". (well, for that a shell script
>>is enought) Kernel however could try to adapt itself to those changes, and do
>>it well...I don't really know. This came to my mind when I was thinking about
>>irqbalance case, which was somewhat similar, I also remember a discussion
>>about a "ktuned" in the mailing lists...I guess it's a matter of coding it
>>and get some numbers :-/
>
>
> Oh, I see what you mean. I think we're much better off sticking the mechanism
> for autotuning stuff in the kernel - if we want to feed in policy from
> userspace (which 99.9% of people will never do), it can be through such
> things as /proc/sys/vm/swapiness ... that could just fix them statically
> if people insisted on such things.
>
> Having the kernel do something sensible by default is what we aim for ...
> overrides still are possible if the sysadmin really thinks they're smarter ;-)

Don't need to be smarter, just to know more about program behaviour and
desired system response than the o/s. Think of it as providing more
information to the o/s and letting the o/s autotune with the additional
information if that makes you feel better ;-)

The only tune I really want, and I patched it in 2.4.18 or so, is the
ability to reserve some memory which will never be used for anything but
program pages. Or some way to stop a program making a single pass
through a large file (80GB) from pushing programs out of memory. I'm
watching that now as I read the list, vmstat says si/so are about 800
each, on a system which runs at zero otherwise.

I'm going to look at capabilities when I get that mythical free time,
and see if I can let a few programs lock their ass in memory, having
60-80% of RAM dedicated to once-read pages is silly.

Sorry for the rant, but this is the single most common problem I see
with performance.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-09-09 17:24:20

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

William Lee Irwin III wrote:
>> Please log periodic snapshots of /proc/vmstat during runs on kernel
>> versions before and after major behavioral shifts.

On Thu, Sep 09, 2004 at 09:16:08AM -0500, Ray Bryant wrote:
> Attached is the output you requested for two kernel versions: 2.6.8.1-mm4
> and 2.6.9-rc1-mm3 + the nrmap_patch (that patch didn't make much difference
> so this should be good enough for comparison purposes, and it was the
> kernel I had built.)
> Because there are so many parameters in /proc/vmstat, I had to split the
> output up (more or less arbitarily) into three files to get something you
> could actually look at with an editor. Even then it requires 100 columns
> or so.

This will do fine. I'll examine these for anomalous maintenance of
statistics and/or operational variables used to drive page replacement.

Thanks.

-- wli

2004-09-14 18:33:09

by Florin Andrei

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Mon, 2004-09-06 at 16:27, Andrew Morton wrote:
> Con Kolivas <[email protected]> wrote:

> > The change was not deliberate but there have been some other people report
> > significant changes in the swappiness behaviour as well (see archives). It
> > has usually been of the increased swapping variety lately. It has been
> > annoying enough to the bleeding edge desktop users for a swag of out-of-tree
> > hacks to start appearing (like mine).
>
> All of which is largely wasted effort.

>From a highly-theoretical, ivory-tower perspective, maybe; i am not the
one to pass judgement.
>From a realistic, "fix it 'cause it's performing worse than MSDOS
without a disk cache" perspective, definitely not true.

I've found a situation where the vanilla kernel has a behaviour that
makes no sense:

http://marc.theaimsgroup.com/?l=linux-kernel&m=109237941331221&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=109237959719868&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=109238126314192&w=2

A patch by Con Kolivas fixed it:

http://marc.theaimsgroup.com/?l=linux-kernel&m=109410526607990&w=2

I cannot offer more details, i have no time for experiments, i just need
a system that works. The vanilla kernel does not.

--
Florin Andrei

http://florin.myip.org/

2004-09-14 22:03:56

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Tue, Sep 14, 2004 at 11:31:53AM -0700, Florin Andrei wrote:
> On Mon, 2004-09-06 at 16:27, Andrew Morton wrote:
> > Con Kolivas <[email protected]> wrote:
>
> > > The change was not deliberate but there have been some other people report
> > > significant changes in the swappiness behaviour as well (see archives). It
> > > has usually been of the increased swapping variety lately. It has been
> > > annoying enough to the bleeding edge desktop users for a swag of out-of-tree
> > > hacks to start appearing (like mine).
> >
> > All of which is largely wasted effort.
>
> >From a highly-theoretical, ivory-tower perspective, maybe; i am not the
> one to pass judgement.
> >From a realistic, "fix it 'cause it's performing worse than MSDOS
> without a disk cache" perspective, definitely not true.
>
> I've found a situation where the vanilla kernel has a behaviour that
> makes no sense:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109237941331221&w=2
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109237959719868&w=2
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109238126314192&w=2
>
> A patch by Con Kolivas fixed it:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109410526607990&w=2
>
> I cannot offer more details, i have no time for experiments, i just need
> a system that works. The vanilla kernel does not.

Have you tried to decrease the value of /proc/sys/vm/swappiness
to say 30 and see what you get?

Andrew's point is that we should identify the problem - Con's patch
rewrites swapping policy.

2004-09-14 22:59:31

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Marcelo Tosatti wrote:
> On Tue, Sep 14, 2004 at 11:31:53AM -0700, Florin Andrei wrote:
>
>>On Mon, 2004-09-06 at 16:27, Andrew Morton wrote:
>>
>>>Con Kolivas <[email protected]> wrote:
>>
>>>> The change was not deliberate but there have been some other people report
>>>> significant changes in the swappiness behaviour as well (see archives). It
>>>> has usually been of the increased swapping variety lately. It has been
>>>> annoying enough to the bleeding edge desktop users for a swag of out-of-tree
>>>> hacks to start appearing (like mine).
>>>
>>>All of which is largely wasted effort.
>>
>>>From a highly-theoretical, ivory-tower perspective, maybe; i am not the
>>one to pass judgement.
>>>From a realistic, "fix it 'cause it's performing worse than MSDOS
>>without a disk cache" perspective, definitely not true.
>>
>>I've found a situation where the vanilla kernel has a behaviour that
>>makes no sense:
>>
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109237941331221&w=2
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109237959719868&w=2
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109238126314192&w=2
>>
>>A patch by Con Kolivas fixed it:
>>
>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109410526607990&w=2
>>
>>I cannot offer more details, i have no time for experiments, i just need
>>a system that works. The vanilla kernel does not.
>
>
> Have you tried to decrease the value of /proc/sys/vm/swappiness
> to say 30 and see what you get?
>
> Andrew's point is that we should identify the problem - Con's patch
> rewrites swapping policy.

I already answered this. That hard swappiness patch does not really
rewrite swapping policy. It identifies exactly what has changed because
it does not count "distress in the swap tendency". Therefore if the
swappiness value is the same, the mapped ratio is the same (in the
workload) yet the vm is swappinig more, it is getting into more
"distress". The mapped ratio is the same but the "distress" is for some
reason much higher in later kernels, meaning the priority of our
scanning is getting more and more intense. This should help direct your
searches.

These are the relevant lines of code _from mainline_:

distress = 100 >> zone->prev_priority
mapped_ratio = (sc->nr_mapped * 100) / total_memory;
swap_tendency = mapped_ratio / 2 + distress + vm_swappiness
if (swap_tendency >= 100)
- reclaim_mapped = 1;

That hard swappiness patch effectively made "distress == 0" always.

Con

2004-09-14 23:27:56

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Wed, Sep 15, 2004 at 08:53:21AM +1000, Con Kolivas wrote:
> Marcelo Tosatti wrote:
> >On Tue, Sep 14, 2004 at 11:31:53AM -0700, Florin Andrei wrote:
> >
> >>On Mon, 2004-09-06 at 16:27, Andrew Morton wrote:
> >>
> >>>Con Kolivas <[email protected]> wrote:
> >>
> >>>>The change was not deliberate but there have been some other people
> >>>>report significant changes in the swappiness behaviour as well (see
> >>>>archives). It has usually been of the increased swapping variety
> >>>>lately. It has been annoying enough to the bleeding edge desktop users
> >>>>for a swag of out-of-tree hacks to start appearing (like mine).
> >>>
> >>>All of which is largely wasted effort.
> >>
> >>>From a highly-theoretical, ivory-tower perspective, maybe; i am not the
> >>one to pass judgement.
> >>>From a realistic, "fix it 'cause it's performing worse than MSDOS
> >>without a disk cache" perspective, definitely not true.
> >>
> >>I've found a situation where the vanilla kernel has a behaviour that
> >>makes no sense:
> >>
> >>http://marc.theaimsgroup.com/?l=linux-kernel&m=109237941331221&w=2
> >>http://marc.theaimsgroup.com/?l=linux-kernel&m=109237959719868&w=2
> >>http://marc.theaimsgroup.com/?l=linux-kernel&m=109238126314192&w=2
> >>
> >>A patch by Con Kolivas fixed it:
> >>
> >>http://marc.theaimsgroup.com/?l=linux-kernel&m=109410526607990&w=2
> >>
> >>I cannot offer more details, i have no time for experiments, i just need
> >>a system that works. The vanilla kernel does not.
> >
> >
> >Have you tried to decrease the value of /proc/sys/vm/swappiness
> >to say 30 and see what you get?
> >
> >Andrew's point is that we should identify the problem - Con's patch
> >rewrites swapping policy.
>
> I already answered this. That hard swappiness patch does not really
> rewrite swapping policy. It identifies exactly what has changed because
> it does not count "distress in the swap tendency". Therefore if the
> swappiness value is the same, the mapped ratio is the same (in the
> workload) yet the vm is swappinig more, it is getting into more
> "distress". The mapped ratio is the same but the "distress" is for some
> reason much higher in later kernels, meaning the priority of our
> scanning is getting more and more intense. This should help direct your
> searches.

> These are the relevant lines of code _from mainline_:
>
> distress = 100 >> zone->prev_priority
> mapped_ratio = (sc->nr_mapped * 100) / total_memory;
> swap_tendency = mapped_ratio / 2 + distress + vm_swappiness
> if (swap_tendency >= 100)
> - reclaim_mapped = 1;
>
>
> That hard swappiness patch effectively made "distress == 0" always.

OK.

So isnt it true that decreasing vm_swappiness should compensate
distress and have the same effect of your patch?

To be fair I'm just arguing, haven't really looked at the code.

2004-09-15 00:27:48

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Marcelo Tosatti wrote:
> On Wed, Sep 15, 2004 at 08:53:21AM +1000, Con Kolivas wrote:
>
>>Marcelo Tosatti wrote:
>>
>>>On Tue, Sep 14, 2004 at 11:31:53AM -0700, Florin Andrei wrote:
>>>
>>>
>>>>On Mon, 2004-09-06 at 16:27, Andrew Morton wrote:
>>>>
>>>>
>>>>>Con Kolivas <[email protected]> wrote:
>>>>
>>>>>>The change was not deliberate but there have been some other people
>>>>>>report significant changes in the swappiness behaviour as well (see
>>>>>>archives). It has usually been of the increased swapping variety
>>>>>>lately. It has been annoying enough to the bleeding edge desktop users
>>>>>>for a swag of out-of-tree hacks to start appearing (like mine).
>>>>>
>>>>>All of which is largely wasted effort.
>>>>
>>>>>From a highly-theoretical, ivory-tower perspective, maybe; i am not the
>>>>one to pass judgement.
>>>>>From a realistic, "fix it 'cause it's performing worse than MSDOS
>>>>without a disk cache" perspective, definitely not true.
>>>>
>>>>I've found a situation where the vanilla kernel has a behaviour that
>>>>makes no sense:
>>>>
>>>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109237941331221&w=2
>>>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109237959719868&w=2
>>>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109238126314192&w=2
>>>>
>>>>A patch by Con Kolivas fixed it:
>>>>
>>>>http://marc.theaimsgroup.com/?l=linux-kernel&m=109410526607990&w=2
>>>>
>>>>I cannot offer more details, i have no time for experiments, i just need
>>>>a system that works. The vanilla kernel does not.
>>>
>>>
>>>Have you tried to decrease the value of /proc/sys/vm/swappiness
>>>to say 30 and see what you get?
>>>
>>>Andrew's point is that we should identify the problem - Con's patch
>>>rewrites swapping policy.
>>
>>I already answered this. That hard swappiness patch does not really
>>rewrite swapping policy. It identifies exactly what has changed because
>>it does not count "distress in the swap tendency". Therefore if the
>>swappiness value is the same, the mapped ratio is the same (in the
>>workload) yet the vm is swappinig more, it is getting into more
>>"distress". The mapped ratio is the same but the "distress" is for some
>>reason much higher in later kernels, meaning the priority of our
>>scanning is getting more and more intense. This should help direct your
>>searches.
>
>
>>These are the relevant lines of code _from mainline_:
>>
>>distress = 100 >> zone->prev_priority
>>mapped_ratio = (sc->nr_mapped * 100) / total_memory;
>>swap_tendency = mapped_ratio / 2 + distress + vm_swappiness
>>if (swap_tendency >= 100)
>>- reclaim_mapped = 1;
>>
>>
>>That hard swappiness patch effectively made "distress == 0" always.
>
> So isnt it true that decreasing vm_swappiness should compensate
> distress and have the same effect of your patch?

Nope. We swap large amounts with the wrong workload at swappiness==0
where we wouldn't before at swappiness==60. ie there is no workaround
possible without changing the code in some way.

> To be fair I'm just arguing, haven't really looked at the code.

Thats cool ;)

Cheers,
Con

2004-09-15 16:55:19

by Florin Andrei

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

On Tue, 2004-09-14 at 13:15, Marcelo Tosatti wrote:
> On Tue, Sep 14, 2004 at 11:31:53AM -0700, Florin Andrei wrote:

> > I've found a situation where the vanilla kernel has a behaviour that
> > makes no sense:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=109237941331221&w=2
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=109237959719868&w=2
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=109238126314192&w=2
> >
> > A patch by Con Kolivas fixed it:
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=109410526607990&w=2
> >
> > I cannot offer more details, i have no time for experiments, i just need
> > a system that works. The vanilla kernel does not.
>
> Have you tried to decrease the value of /proc/sys/vm/swappiness
> to say 30 and see what you get?

I cannot repeat the experiment (due to lack of time) but i remember
having severe issues with low /proc/*/swappiness values on vanilla
kernels. I don't recall the nature of the problems, though, because it's
been a while. Sorry. But that was what prompted me to search further.
The search ended when i found and tested Con's patch.

The current swapping policy of the vanilla 2.6 kernels is broken. Badly.

At least my system works properly now. <shrug>

--
Florin Andrei

http://florin.myip.org/

2004-09-16 20:23:16

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Con!

Spent some time reading your patch...

On Wed, Sep 15, 2004 at 10:22:46AM +1000, Con Kolivas wrote:

> >>I already answered this. That hard swappiness patch does not really
> >>rewrite swapping policy. It identifies exactly what has changed because
> >>it does not count "distress in the swap tendency". Therefore if the
> >>swappiness value is the same, the mapped ratio is the same (in the
> >>workload) yet the vm is swappinig more, it is getting into more
> >>"distress". The mapped ratio is the same but the "distress" is for some
> >>reason much higher in later kernels, meaning the priority of our
> >>scanning is getting more and more intense. This should help direct your
> >>searches.

Well if "distress" is getting higher (with similar workload/pressure)
thats because VM is having a harder time freeing pages (priority increases,
distress increases).

You say "distress is getting higher in later kernels". Can you expand
more on that? How did you find this out, and can you be more especific
wrt "later kernels".

> >>These are the relevant lines of code _from mainline_:
> >>
> >>distress = 100 >> zone->prev_priority
> >>mapped_ratio = (sc->nr_mapped * 100) / total_memory;
> >>swap_tendency = mapped_ratio / 2 + distress + vm_swappiness
> >>if (swap_tendency >= 100)
> >>- reclaim_mapped = 1;
> >>
> >>
> >>That hard swappiness patch effectively made "distress == 0" always.
> >
> >So isnt it true that decreasing vm_swappiness should compensate
> >distress and have the same effect of your patch?
>
> Nope. We swap large amounts with the wrong workload at swappiness==0
> where we wouldn't before at swappiness==60. ie there is no workaround
> possible without changing the code in some way.

"we wouldn't before" refering to older kernel versions?

I see you add a "z->nr_unmapped" watermark a bit above "z->pages_high",
and use that to set "pgdat->mapped_nrpages" to what needs to be freed
so z->free_pages reaches "z->nr_unmapped".

And then you use that per-pgdat "mapped_nrpages" count to avoid:

- moving mapped pages to inactive list (wasting the swappiness algorithm)
- swapping out pages at shrink_list

Those two only happen when pgdat->mapped_nrpages is zero, which
becomes true when we go below pages_low.

To resume, deactivation/swapout of mapped pages only happens when we
go any zone pages_low.

Correct?

Now with v2.6 stock kernel, kswapd will deactivate (using vm_swappiness algorithm)
and swapout pages between the low and high zone watermarks.

That avoids swapping out as hard as possible until we go below pages_low.

IMHO this might be OK for common desktop workloads where people complain
about swap, but might be harmful for other workloads where swapping out on
advance unused anonymous process memory is a _gain_.

I dont understand this check on balance_pgdat (kswapd worker function):

+ /*
+ * kswapd does a light balance_pgdat() when there is less than 1/3
+ * ram free provided there is less than vm_mapped % of that ram
+ * mapped.
+ */
+ if (maplimit && sc.nr_mapped * 100 / total_memory > vm_mapped)
+ return 0;
+

So "if not any zone is under pages_low, and more than vm_mapped % of ram
is mapped, bail out."

I dont get what you're trying to achieve with this.

>
> >To be fair I'm just arguing, haven't really looked at the code.
>
> Thats cool ;)

I still think swapout behaviour can be correctly tuned with vm_swappiness,
and agree with Andrew on that we should not change anything in the algorithm
if this can be tuned.

Andrew, maybe decrease vm_swappiness to 50 on the next -mm for a test?

2004-09-17 00:23:28

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Marcelo Tosatti wrote:
> Con!
>
> Spent some time reading your patch...

Great!

> Well if "distress" is getting higher (with similar workload/pressure)
> thats because VM is having a harder time freeing pages (priority increases,
> distress increases).
>
> You say "distress is getting higher in later kernels". Can you expand
> more on that? How did you find this out, and can you be more especific
> wrt "later kernels".

When I say earlier kernels I mean prior to 2.6.8.

I'm still referring to the "hard_swappiness patch" that Florin was using
to fix his problem which diagnosed that distress increased. I'm sorry if
my newer patch confuses the issue. hard_swappiness effectively changed this:

-distress = 100 >> zone->prev_priority
-mapped_ratio = (sc->nr_mapped * 100) / total_memory;
-swap_tendency = mapped_ratio / 2 + distress + vm_swappiness
-if (swap_tendency >= 100)
- reclaim_mapped = 1;

into this:

+mapped_ratio = (sc->nr_mapped * 100) / total_memory;
+swap_tendency = mapped_ratio / 2 + vm_swappiness
+if (swap_tendency >= 100)
+ reclaim_mapped = 1;

This made swap_tendency dependant _only_ on the mapped_ratio. Now if you
load up the same desktop and applications your mapped_ratio will be
virtually identical regardless of the kernel. If you then copy a large
file or convert a large video file etc, then the mapped ratio will be
unchanged. Therefore if the swapping increased with this workload in
2.6.8 and later kernels but did _not_ increase with hard_swappiness it
must be the "distress" value which is entirely dependant on
zone->prev_priority. Does that make my conclusion clearer?

Below here you're referring to my mapped_watermark patch so I'll address
that separately to avoid confusion.

> I see you add a "z->nr_unmapped" watermark a bit above "z->pages_high",
> and use that to set "pgdat->mapped_nrpages" to what needs to be freed
> so z->free_pages reaches "z->nr_unmapped".
>
> And then you use that per-pgdat "mapped_nrpages" count to avoid:
>
> - moving mapped pages to inactive list (wasting the swappiness algorithm)
> - swapping out pages at shrink_list
>
> Those two only happen when pgdat->mapped_nrpages is zero, which
> becomes true when we go below pages_low.
>
> To resume, deactivation/swapout of mapped pages only happens when we
> go any zone pages_low.
>
> Correct?

Yes apart from one big caveat. scanning is expensive, so it only scans
at lowest priority (DEF_PRIORITY). If it fails to release enough memory
it simply returns quietly. This means that if vm pressure is hard enough
and occurs frequently/fast enough it will still drop down below
pages_high even if the watermarks have not been re-achieved. Then the
normal algorithm will take over.

> Now with v2.6 stock kernel, kswapd will deactivate (using vm_swappiness algorithm)
> and swapout pages between the low and high zone watermarks.
>
> That avoids swapping out as hard as possible until we go below pages_low.
>
> IMHO this might be OK for common desktop workloads where people complain
> about swap, but might be harmful for other workloads where swapping out on
> advance unused anonymous process memory is a _gain_.

As I said, it only does it lightly, and it's tunable.

> I dont understand this check on balance_pgdat (kswapd worker function):

> + if (maplimit && sc.nr_mapped * 100 / total_memory > vm_mapped)
> + return 0;
> +
>
> So "if not any zone is under pages_low, and more than vm_mapped % of ram
> is mapped, bail out."

This will only be hit if "maplimit" is true. This means we have entered
balance_pgdat only due to the unmapped watermark (zone->pages_min * 4).
Here is where the real "tunable" comes into play. If greater than
vm_mapped % of ram is mapped (ie application) pages, it will not do
anything at this watermark. By default it is set to 66%. Setting it to 0
inactivates this patch entirely and makes the vm behave much like
setting swappiness to 100 in mainline.

> I still think swapout behaviour can be correctly tuned with vm_swappiness,
> and agree with Andrew on that we should not change anything in the algorithm
> if this can be tuned.

I agree it can be, but something in the logic has definitely changed,
and a different value is not giving users like Florin the desired result
any more.

Cheers,
Con

2004-09-28 01:50:59

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

---

linux-2.6-npiggin/mm/vmscan.c | 14 +++++++++++++-
1 files changed, 13 insertions(+), 1 deletion(-)

diff -puN mm/vmscan.c~vm-no-wild-kswapd mm/vmscan.c
--- linux-2.6/mm/vmscan.c~vm-no-wild-kswapd 2004-09-25 10:09:16.000000000 +1000
+++ linux-2.6-npiggin/mm/vmscan.c 2004-09-25 10:15:58.000000000 +1000
@@ -993,10 +993,13 @@ static int balance_pgdat(pg_data_t *pgda
int to_free = nr_pages;
int priority;
int i;
- int total_scanned = 0, total_reclaimed = 0;
+ int total_scanned, total_reclaimed;
struct reclaim_state *reclaim_state = current->reclaim_state;
struct scan_control sc;

+loop_again:
+ total_scanned = 0;
+ total_reclaimed = 0;
sc.gfp_mask = GFP_KERNEL;
sc.may_writepage = 0;
sc.nr_mapped = read_page_state(nr_mapped);
@@ -1095,6 +1098,15 @@ scan:
*/
if (total_scanned && priority < DEF_PRIORITY - 2)
blk_congestion_wait(WRITE, HZ/10);
+
+ /*
+ * We do this so kswapd doesn't build up large priorities for
+ * example when it is freeing in parallel with allocators. It
+ * matches the direct reclaim path behaviour in terms of impact
+ * on zone->*_priority.
+ */
+ if (total_reclaimed >= 32)
+ goto loop_again;
}
out:
for (i = 0; i < pgdat->nr_zones; i++) {

_

Attachments:

vm-no-wild-kswapd.patch (1.20 kB)

2004-09-28 03:59:20

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Ray Bryant wrote:

> Nick,
>
> As reported to you elsewhere (and duplicated here to get on this thread),
> application of the patch you sent (attached) dramatically changes the
> swappiness behavior of the 2.6.9-rc1 (and presumably the rc2) kernel.
>
> Here are the updated results:
>
> Previously:
>
> Kernel Version 2.6.9-rc1-mm3:
> Total I/O Avg Swap min max pg cache min max
> ----------- --------- ------- ------ --------- ------- -------
> 0 274.80 MB/s 10511 MB ( 5644, 14492) 13293 MB ( 8596, 17156)
> 20 267.02 MB/s 12624 MB ( 5578, 16287) 15298 MB ( 8468, 18889)
> 40 267.66 MB/s 13541 MB ( 6619, 17461) 16199 MB ( 9393, 20044)
> 60 233.73 MB/s 18094 MB ( 16550, 19676) 20629 MB ( 19103, 22192)
> 80 213.64 MB/s 20950 MB ( 15844, 22977) 23450 MB ( 18496, 25440)
> 100 164.58 MB/s 26004 MB ( 26004, 26004) 28410 MB ( 28327, 28455)
>
> With Nick Piggin et al fix:
>
> Kernel Version: linux-2.6.9-rc1-mm3-kswapdfix
>
> Total I/O Avg Swap min max pg cache min max
> ----------- --------- ------- ------ --------- ------- -------
> 0 279.97 MB/s 89 MB ( 12, 265) 3062 MB ( 2947, 3267)
> 20 283.55 MB/s 161 MB ( 15, 372) 3190 MB ( 3011, 3427)
> 40 282.32 MB/s 204 MB ( 6, 407) 3187 MB ( 2995, 3331)
> 60 279.42 MB/s 72 MB ( 15, 171) 3091 MB ( 3027, 3155)
> 80 283.34 MB/s 920 MB ( 144, 3028) 3904 MB ( 3106, 5957)
> 100 160.55 MB/s 26008 MB ( 26007, 26008) 28473 MB ( 28455, 28487)
>
> (The drop at swappiness of 60 may just be randomness, not sure it
> is significant, but these results are all based on 5 trials.)
>
> At any rate, this patch appears to fix the problems I was seeing before.
> (See
> http://marc.theaimsgroup.com/?l=linux-kernel&m=109449778320333&w=2
>
> for further details of the benchmark and the test environment).
>
>

Thanks Ray. From looking over your old results, it appears that -kswapdfix
probably has the nicest swappiness ramp, which is probably to be expected,
as the problem that is being fixed did exist in all other kernels you
tested,
but the later ones just had other aggrivating changes.

The swappiness=60 weirdness might just be some obscure interaction with the
workload. If that is the case, it is probably not too important, however it
could be due to a possible oversight in my patch....

I'm not in front of the code right now, so I can't give you a new patch to
try yet... if you're up for modifying it yourself though: we possibly should
be updating "zone->prev_priority" after each 32 (SWAP_CLUSTER_MAX) pages
freed.
So change the following:

==> if (total_reclaimed >= 32)
==> break;
}
out:
for (i = 0; i < pgdat->nr_zones; i++) {
... /* this updates zone->prev_priority */
}
=> if (!all_zones_ok)
=> goto loop_again;
return total_reclaimed;
}

so it looks something like that.

If you could run your tests again on that, it would be great.

Thanks
Nick

2004-09-29 00:36:23

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Nick Piggin wrote:

>
> Thanks Ray. From looking over your old results, it appears that
> -kswapdfix
> probably has the nicest swappiness ramp, which is probably to be
> expected,
> as the problem that is being fixed did exist in all other kernels you
> tested,
> but the later ones just had other aggrivating changes.
>
> The swappiness=60 weirdness might just be some obscure interaction
> with the
> workload. If that is the case, it is probably not too important,
> however it
> could be due to a possible oversight in my patch....
>

Here is a patch on top of the last one - if you can give it a test
some time, that would be great.

Thanks
Nick

Attachments:

vm-no-wild-kswapd2.patch (1.15 kB)

2004-09-29 04:19:27

[permalink] [raw]

Subject: Re: swapping and the value of /proc/sys/vm/swappiness

Nick Piggin wrote:
>
>
> Nick Piggin wrote:
>
>>
>> Thanks Ray. From looking over your old results, it appears that
>> -kswapdfix
>> probably has the nicest swappiness ramp, which is probably to be
>> expected,
>> as the problem that is being fixed did exist in all other kernels you
>> tested,
>> but the later ones just had other aggrivating changes.
>>
>> The swappiness=60 weirdness might just be some obscure interaction
>> with the
>> workload. If that is the case, it is probably not too important,
>> however it
>> could be due to a possible oversight in my patch....
>>
>
> Here is a patch on top of the last one - if you can give it a test
> some time, that would be great.
>
> Thanks
> Nick
>
>

Nick,

I'll put it on my list of TODO's. FYI, a full test like the ones I have been
running, at 6 swappiness levels, takes around 5 hours of machine time, so it
sometimes we have to wait for a slot that big to open up. :-)

Ray
> ------------------------------------------------------------------------
>
>
>
>
> ---
>
> linux-2.6-npiggin/mm/vmscan.c | 9 +++++++--
> 1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff -puN mm/vmscan.c~vm-no-wild-kswapd2 mm/vmscan.c
> --- linux-2.6/mm/vmscan.c~vm-no-wild-kswapd2 2004-09-29 10:30:49.000000000 +1000
> +++ linux-2.6-npiggin/mm/vmscan.c 2004-09-29 10:34:00.000000000 +1000
> @@ -991,6 +991,7 @@ out:
> static int balance_pgdat(pg_data_t *pgdat, int nr_pages)
> {
> int to_free = nr_pages;
> + int all_zones_ok;
> int priority;
> int i;
> int total_scanned, total_reclaimed;
> @@ -1013,10 +1014,11 @@ loop_again:
> }
>
> for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> - int all_zones_ok = 1;
> int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
> unsigned long lru_pages = 0;
>
> + all_zones_ok = 1;
> +
> if (nr_pages == 0) {
> /*
> * Scan in the highmem->dma direction for the highest
> @@ -1106,7 +1108,7 @@ scan:
> * on zone->*_priority.
> */
> if (total_reclaimed >= SWAP_CLUSTER_MAX)
> - goto loop_again;
> + break;
> }
> out:
> for (i = 0; i < pgdat->nr_zones; i++) {
> @@ -1114,6 +1116,9 @@ out:
>
> zone->prev_priority = zone->temp_priority;
> }
> + if (!all_zones_ok)
> + goto loop_again;
> +
> return total_reclaimed;
> }
>
>
> _

--
Best Regards,
Ray
-----------------------------------------------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[email protected] [email protected]
The box said: "Requires Windows 98 or better",
so I installed Linux.
-----------------------------------------------

2004-09-30 17:12:38