2004-04-17 19:38:59

by William Lee Irwin III

[permalink] [raw]
Subject: vmscan.c heuristic adjustment for smaller systems

Marc Singer reported an issue where an embedded ARM system performed
poorly due to page replacement potentially prematurely replacing
mapped memory where there was very little mapped pagecache in use to
begin with.

The following patch attempts to address the issue by using the
_maximum_ of vm_swappiness and distress to add to the mapped ratio, so
that distress doesn't contribute to swap_tendency until it exceeds
vm_swappiness, and afterward the effect is not cumulative.

The intended effect is that swap_tendency should vary in a more jagged
way, and not be elevated by distress beyond vm_swappiness until distress
exceeds vm_swappiness. For instance, since distress is 100 >>
zone->prev_priority, no distinction is made between a vm_swappiness of
50 or a vm_swappiness of 90 given the same mapped_ratio.

Marc Singer has results where this is an improvement, and hopefully can
clarify as-needed. Help determining whether this policy change is an
improvement for a broader variety of systems would be appreciated.


-- wli


Index: singer-2.6.5-mm6/mm/vmscan.c
===================================================================
--- singer-2.6.5-mm6.orig/mm/vmscan.c 2004-04-14 23:21:19.000000000 -0700
+++ singer-2.6.5-mm6/mm/vmscan.c 2004-04-17 11:09:35.000000000 -0700
@@ -636,7 +636,7 @@
*
* A 100% value of vm_swappiness overrides this algorithm altogether.
*/
- swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;
+ swap_tendency = mapped_ratio / 2 + max(distress, vm_swappiness);

/*
* Now use this metric to decide whether to start moving mapped memory


2004-04-17 21:33:35

by William Lee Irwin III

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, Apr 17, 2004 at 12:38:55PM -0700, William Lee Irwin III wrote:
>> Marc Singer reported an issue where an embedded ARM system performed
>> poorly due to page replacement potentially prematurely replacing
>> mapped memory where there was very little mapped pagecache in use to
>> begin with.
>> Marc Singer has results where this is an improvement, and hopefully can
>> clarify as-needed. Help determining whether this policy change is an
>> improvement for a broader variety of systems would be appreciated.

On Sat, Apr 17, 2004 at 02:29:58PM -0700, Marc Singer wrote:
> I have some numbers to clarify the 'improvement'.
> Setup:
> ARM922 CPU, 200MHz, 32MiB RAM
> NFS mounted rootfs, tcp, hard, v3, 4K blocks
> Test application copies 41MiB file and prints the elapsed time
> The two scenarios differ only in the setting of /proc/sys/vm/swappiness.

This doesn't match your first response. Anyway, this one is gets
scrapped. I guess if swappiness solves it, then so much the better.


-- wli

2004-04-17 21:30:05

by Marc Singer

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, Apr 17, 2004 at 12:38:55PM -0700, William Lee Irwin III wrote:
> Marc Singer reported an issue where an embedded ARM system performed
> poorly due to page replacement potentially prematurely replacing
> mapped memory where there was very little mapped pagecache in use to
> begin with.
>
> Marc Singer has results where this is an improvement, and hopefully can
> clarify as-needed. Help determining whether this policy change is an
> improvement for a broader variety of systems would be appreciated.

I have some numbers to clarify the 'improvement'.

Setup:
ARM922 CPU, 200MHz, 32MiB RAM
NFS mounted rootfs, tcp, hard, v3, 4K blocks
Test application copies 41MiB file and prints the elapsed time

The two scenarios differ only in the setting of /proc/sys/vm/swappiness.

swappiness
60 (default) 0
------------ --------
elapsed time(s) 52.48 52.9
53.13 52.91
53.13 52.87
52.53 53.03
52.35 53.02

mean 52.72 52.94

I'd say that there is no statistically significant difference between
these sets of times. However, after I've run the test program, I run
the command "ls -l /proc"

swappiness
60 (default) 0
------------ --------
elapsed time(s) 18 1
30 1
33 1

This is the problem. Once RAM fills with IO buffers, the kernel's
tendency to evict mapped pages ruins interactive performance.

2004-04-17 21:53:01

by Marc Singer

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, Apr 17, 2004 at 02:33:33PM -0700, William Lee Irwin III wrote:
> On Sat, Apr 17, 2004 at 12:38:55PM -0700, William Lee Irwin III wrote:
> >> Marc Singer reported an issue where an embedded ARM system performed
> >> poorly due to page replacement potentially prematurely replacing
> >> mapped memory where there was very little mapped pagecache in use to
> >> begin with.
> >> Marc Singer has results where this is an improvement, and hopefully can
> >> clarify as-needed. Help determining whether this policy change is an
> >> improvement for a broader variety of systems would be appreciated.
>
> On Sat, Apr 17, 2004 at 02:29:58PM -0700, Marc Singer wrote:
> > I have some numbers to clarify the 'improvement'.
> > Setup:
> > ARM922 CPU, 200MHz, 32MiB RAM
> > NFS mounted rootfs, tcp, hard, v3, 4K blocks
> > Test application copies 41MiB file and prints the elapsed time
> > The two scenarios differ only in the setting of /proc/sys/vm/swappiness.
>
> This doesn't match your first response. Anyway, this one is gets
> scrapped. I guess if swappiness solves it, then so much the better.

Huh? Where do you see a discrepency? I don't think I claimed that
the test program performance changed. The noticeable difference is in
interactivity once the page cache fills. IMHO, 30 seconds to do a
file listing on /proc is extreme.


2004-04-17 23:21:48

by Andrew Morton

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

Marc Singer <[email protected]> wrote:
>
> I'd say that there is no statistically significant difference between
> these sets of times. However, after I've run the test program, I run
> the command "ls -l /proc"
>
> swappiness
> 60 (default) 0
> ------------ --------
> elapsed time(s) 18 1
> 30 1
> 33 1

How on earth can it take half a minute to list /proc?

> This is the problem. Once RAM fills with IO buffers, the kernel's
> tendency to evict mapped pages ruins interactive performance.

Is everything here on NFS, or are local filesystemms involved? (What does
"mount" say?)

2004-04-17 23:30:43

by Marc Singer

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, Apr 17, 2004 at 04:21:25PM -0700, Andrew Morton wrote:
> Marc Singer <[email protected]> wrote:
> >
> > I'd say that there is no statistically significant difference between
> > these sets of times. However, after I've run the test program, I run
> > the command "ls -l /proc"
> >
> > swappiness
> > 60 (default) 0
> > ------------ --------
> > elapsed time(s) 18 1
> > 30 1
> > 33 1
>
> How on earth can it take half a minute to list /proc?

I've watched the vmscan code at work. The memory pressure is so high
that it reclaims mapped pages zealously. The program's code pages are
being evicted frequently.

I would like to show a video of the ls -l /proc command. It's
remarkable. The program pauses after displaying each line.

> > This is the problem. Once RAM fills with IO buffers, the kernel's
> > tendency to evict mapped pages ruins interactive performance.
>
> Is everything here on NFS, or are local filesystemms involved? (What does
> "mount" say?)

# mount
rootfs on / type rootfs (rw)
/dev/root on / type nfs (rw,v2,rsize=4096,wsize=4096,hard,udp,nolock,addr=192.168.8.1)
proc on /proc type proc (rw)
devpts on /dev/pts type devpts (rw)

I've been wondering if the swappiness isn't a red herring. Is it
reasonable that the distress value (in refill_inactive_zones ()) be
50?

2004-04-17 23:52:20

by Andrew Morton

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

Marc Singer <[email protected]> wrote:
>
> On Sat, Apr 17, 2004 at 04:21:25PM -0700, Andrew Morton wrote:
> > Marc Singer <[email protected]> wrote:
> > >
> > > I'd say that there is no statistically significant difference between
> > > these sets of times. However, after I've run the test program, I run
> > > the command "ls -l /proc"
> > >
> > > swappiness
> > > 60 (default) 0
> > > ------------ --------
> > > elapsed time(s) 18 1
> > > 30 1
> > > 33 1
> >
> > How on earth can it take half a minute to list /proc?
>
> I've watched the vmscan code at work. The memory pressure is so high
> that it reclaims mapped pages zealously. The program's code pages are
> being evicted frequently.

Which tends to imply that the VM is not reclaiming any of that nfs-backed
pagecache.

> I've been wondering if the swappiness isn't a red herring. Is it
> reasonable that the distress value (in refill_inactive_zones ()) be
> 50?

I'd assume that setting swappiness to zero simply means that you still have
all of your libc in pagecache when running ls.

What happens if you do the big file copy, then run `sync', then do the ls?

Have you experimented with the NFS mount options? v2? UDP?

2004-04-18 00:11:37

by Trond Myklebust

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, 2004-04-17 at 16:51, Andrew Morton wrote:

> What happens if you do the big file copy, then run `sync', then do the ls?

You shouldn't ever need to do "sync" with NFS unless you are using
mmap(). close() will suffice to flush out all dirty pages in the case of
ordinary file writes.

Cheers,
Trond

2004-04-18 00:23:47

by Marc Singer

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> Marc Singer <[email protected]> wrote:
> >
> > On Sat, Apr 17, 2004 at 04:21:25PM -0700, Andrew Morton wrote:
> > > Marc Singer <[email protected]> wrote:
> > > >
> > > > I'd say that there is no statistically significant difference between
> > > > these sets of times. However, after I've run the test program, I run
> > > > the command "ls -l /proc"
> > > >
> > > > swappiness
> > > > 60 (default) 0
> > > > ------------ --------
> > > > elapsed time(s) 18 1
> > > > 30 1
> > > > 33 1
> > >
> > > How on earth can it take half a minute to list /proc?
> >
> > I've watched the vmscan code at work. The memory pressure is so high
> > that it reclaims mapped pages zealously. The program's code pages are
> > being evicted frequently.
>
> Which tends to imply that the VM is not reclaiming any of that nfs-backed
> pagecache.

I don't think that's the whole story. They question is why.

> > I've been wondering if the swappiness isn't a red herring. Is it
> > reasonable that the distress value (in refill_inactive_zones ()) be
> > 50?
>
> I'd assume that setting swappiness to zero simply means that you still have
> all of your libc in pagecache when running ls.

Perhaps. I think it is more important that it is still mapped.

>
> What happens if you do the big file copy, then run `sync', then do the ls?

It still takes a long time. I'm watching the network load as I
perform the ls. There's almost 20 seconds of no screen activity while
NFS reloads the code.

>
> Have you experimented with the NFS mount options? v2? UDP?

Doesn't seem to matter. I've used v2, v3, UDP and TCP.

I have more data.

All of these tests are performed at the console, one command at a
time. I have a telnet daemon available, so I open a second connection
to the target system. I run a continuous loop of file copies on the
console and I execute 'ls -l /proc' in the telnet window. It's a
little slow, but it isn't unreasonable. Hmm. I then run the copy
command in the telnet window followed by the 'ls -l /proc'. It works
fine. I logout of the console session and perform the telnet window
test again. The 'ls -l /proc takes 30 seconds.

When there is more than one process running, everything is peachy.
When there is only one process (no context switching) I see the slow
performance. I had a hypothesis, but my test of that hypothesis
failed.

2004-04-18 01:06:19

by William Lee Irwin III

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, Apr 17, 2004 at 02:33:33PM -0700, William Lee Irwin III wrote:
>> This doesn't match your first response. Anyway, this one is gets
>> scrapped. I guess if swappiness solves it, then so much the better.

On Sat, Apr 17, 2004 at 02:52:57PM -0700, Marc Singer wrote:
> Huh? Where do you see a discrepency? I don't think I claimed that
> the test program performance changed. The noticeable difference is in
> interactivity once the page cache fills. IMHO, 30 seconds to do a
> file listing on /proc is extreme.

Oh, sorry, it was unclear to me that the test changed anything but
swappiness (i.e. I couldn't tell they included the patch etc.)


-- wli

2004-04-18 01:59:23

by William Lee Irwin III

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

Marc Singer <[email protected]> wrote:
>> I've watched the vmscan code at work. The memory pressure is so high
>> that it reclaims mapped pages zealously. The program's code pages are
>> being evicted frequently.

On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> Which tends to imply that the VM is not reclaiming any of that nfs-backed
> pagecache.

The observation that prompted the max() vs. addition was:

On Sat, Apr 17, 2004 at 10:57:24AM -0700, Marc Singer wrote:
> I don't think that's the whole story. I printed distress,
> mapped_ratio, and swappiness when vmscan starts trying to reclaim
> mapped pages.
> reclaim_mapped: distress 50 mapped_ratio 0 swappiness 60
> 50 + 60 > 100
> So, part of the problem is swappiness. I could set that value to 25,
> for example, to stop the machine from swapping.
> I'd be fine stopping here, except for you comment about what
> swappiness means. In my case, nearly none of memory is mapped. It is
> zone priority which has dropped to 1 that is precipitating the
> eviction. Is this what you expect and want?


Marc Singer <[email protected]> wrote:
>> I've been wondering if the swappiness isn't a red herring. Is it
>> reasonable that the distress value (in refill_inactive_zones ()) be
>> 50?

On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> I'd assume that setting swappiness to zero simply means that you still have
> all of your libc in pagecache when running ls.
> What happens if you do the big file copy, then run `sync', then do the ls?
> Have you experimented with the NFS mount options? v2? UDP?

I wonder if the ptep_test_and_clear_young() TLB flushing is related.


-- wli

2004-04-18 03:38:22

by Nick Piggin

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

Marc Singer wrote:
> On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
>
>>Marc Singer <[email protected]> wrote:
>>
>>>On Sat, Apr 17, 2004 at 04:21:25PM -0700, Andrew Morton wrote:
>>>
>>>>
>>>>How on earth can it take half a minute to list /proc?
>>>
>>>I've watched the vmscan code at work. The memory pressure is so high
>>>that it reclaims mapped pages zealously. The program's code pages are
>>>being evicted frequently.
>>
>>Which tends to imply that the VM is not reclaiming any of that nfs-backed
>>pagecache.
>
>
> I don't think that's the whole story. They question is why.
>

swappiness is pretty arbitrary and unfortunately it means
different things to machines with different sized memory.

Also, once you *have* gone past the reclaim_mapped threshold,
mapped pages aren't really given any preference above
unmapped pages.

I have a small patchset which splits the active list roughly
into mapped and unmapped pages. It might hopefully solve your
problem. Would you give it a try? It is pretty stable here.

Nick

2004-04-18 03:53:59

by Andrew Morton

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

William Lee Irwin III <[email protected]> wrote:
>
> On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> > I'd assume that setting swappiness to zero simply means that you still have
> > all of your libc in pagecache when running ls.
> > What happens if you do the big file copy, then run `sync', then do the ls?
> > Have you experimented with the NFS mount options? v2? UDP?
>
> I wonder if the ptep_test_and_clear_young() TLB flushing is related.

That, or page_referenced() always returns true on this ARM implementation
or some such silliness. Everything here points at the VM being unable to
reclaim that clean pagecache.

2004-04-18 04:17:52

by William Lee Irwin III

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sun, Apr 18, 2004 at 01:37:45PM +1000, Nick Piggin wrote:
> swappiness is pretty arbitrary and unfortunately it means
> different things to machines with different sized memory.
> Also, once you *have* gone past the reclaim_mapped threshold,
> mapped pages aren't really given any preference above
> unmapped pages.
> I have a small patchset which splits the active list roughly
> into mapped and unmapped pages. It might hopefully solve your
> problem. Would you give it a try? It is pretty stable here.

It would be interesting to see the results of this on Marc's system.
It's a more comprehensive solution than tweaking numbers.


-- wli

2004-04-18 04:41:49

by Nick Piggin

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

William Lee Irwin III wrote:
> On Sun, Apr 18, 2004 at 01:37:45PM +1000, Nick Piggin wrote:
>
>>swappiness is pretty arbitrary and unfortunately it means
>>different things to machines with different sized memory.
>>Also, once you *have* gone past the reclaim_mapped threshold,
>>mapped pages aren't really given any preference above
>>unmapped pages.
>>I have a small patchset which splits the active list roughly
>>into mapped and unmapped pages. It might hopefully solve your
>>problem. Would you give it a try? It is pretty stable here.
>
>
> It would be interesting to see the results of this on Marc's system.
> It's a more comprehensive solution than tweaking numbers.
>

Well, here is the current patch against 2.6.5-mm6. -mm is
different enough from -linus now that it is not 100% trivial
to patch (mainly the rmap and hugepages work).

Marc if you could test this it would be great. I've been doing
very swap heavy tests for the last 24 hours on a SMP system
here, so it should be fairly stable.

It replaces /proc/sys/vm/swappiness with
/proc/sys/vm/mapped_page_cost, which is in units of unmapped
pages. I have found 8 to be pretty good, so that is the
default. Higher makes it less likely to evict mapped pages.

Nick


Attachments:
split-active-list.patch (27.31 kB)

2004-04-18 05:05:37

by Marc Singer

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, Apr 17, 2004 at 06:06:16PM -0700, William Lee Irwin III wrote:
> On Sat, Apr 17, 2004 at 02:33:33PM -0700, William Lee Irwin III wrote:
> >> This doesn't match your first response. Anyway, this one is gets
> >> scrapped. I guess if swappiness solves it, then so much the better.
>
> On Sat, Apr 17, 2004 at 02:52:57PM -0700, Marc Singer wrote:
> > Huh? Where do you see a discrepency? I don't think I claimed that
> > the test program performance changed. The noticeable difference is in
> > interactivity once the page cache fills. IMHO, 30 seconds to do a
> > file listing on /proc is extreme.
>
> Oh, sorry, it was unclear to me that the test changed anything but
> swappiness (i.e. I couldn't tell they included the patch etc.)

Ah, OK. Now I understand your confusion. Based on the numbers, it is
clear that your last patch does exactly the same thing as setting
swappiness. It is true that I didn't apply it. Still, I think that
your change is worth consideration since setting swappiness to zero is
such a blunt solution. I apologize for not making this clear before.


2004-04-18 05:10:32

by Marc Singer

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sun, Apr 18, 2004 at 02:41:12PM +1000, Nick Piggin wrote:
> William Lee Irwin III wrote:
> >On Sun, Apr 18, 2004 at 01:37:45PM +1000, Nick Piggin wrote:
> >
> >>swappiness is pretty arbitrary and unfortunately it means
> >>different things to machines with different sized memory.
> >>Also, once you *have* gone past the reclaim_mapped threshold,
> >>mapped pages aren't really given any preference above
> >>unmapped pages.
> >>I have a small patchset which splits the active list roughly
> >>into mapped and unmapped pages. It might hopefully solve your
> >>problem. Would you give it a try? It is pretty stable here.
> >
> >
> >It would be interesting to see the results of this on Marc's system.
> >It's a more comprehensive solution than tweaking numbers.
> >
>
> Well, here is the current patch against 2.6.5-mm6. -mm is
> different enough from -linus now that it is not 100% trivial
> to patch (mainly the rmap and hugepages work).

Will this work against 2.6.5 without -mm6?

As an aside, I've been using SVN to manage my kernel sources. While
I'd be thrilled to make it work, it simply doesn't seem to have the
heavy lifting capability to handle the kernel work. I know the
rudiments of using BK. What I'd like is some sort of HOWTO with
example of common tasks for kernel development. Know of any?

> Marc if you could test this it would be great. I've been doing
> very swap heavy tests for the last 24 hours on a SMP system
> here, so it should be fairly stable.

I'm game.

> It replaces /proc/sys/vm/swappiness with
> /proc/sys/vm/mapped_page_cost, which is in units of unmapped
> pages. I have found 8 to be pretty good, so that is the
> default. Higher makes it less likely to evict mapped pages.

Sounds good.

Cheers.

2004-04-18 05:20:12

by Nick Piggin

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

Marc Singer wrote:
> On Sun, Apr 18, 2004 at 02:41:12PM +1000, Nick Piggin wrote:
>
>>William Lee Irwin III wrote:
>>
>>>On Sun, Apr 18, 2004 at 01:37:45PM +1000, Nick Piggin wrote:
>>>
>>>
>>>>swappiness is pretty arbitrary and unfortunately it means
>>>>different things to machines with different sized memory.
>>>>Also, once you *have* gone past the reclaim_mapped threshold,
>>>>mapped pages aren't really given any preference above
>>>>unmapped pages.
>>>>I have a small patchset which splits the active list roughly
>>>>into mapped and unmapped pages. It might hopefully solve your
>>>>problem. Would you give it a try? It is pretty stable here.
>>>
>>>
>>>It would be interesting to see the results of this on Marc's system.
>>>It's a more comprehensive solution than tweaking numbers.
>>>
>>
>>Well, here is the current patch against 2.6.5-mm6. -mm is
>>different enough from -linus now that it is not 100% trivial
>>to patch (mainly the rmap and hugepages work).
>
>
> Will this work against 2.6.5 without -mm6?
>

Unfortunately it won't patch easily. If this is a big
problem for you I could make you up a 2.6.5 version.

> As an aside, I've been using SVN to manage my kernel sources. While
> I'd be thrilled to make it work, it simply doesn't seem to have the
> heavy lifting capability to handle the kernel work. I know the
> rudiments of using BK. What I'd like is some sort of HOWTO with
> example of common tasks for kernel development. Know of any?
>

Well I don't do a great deal of coding or merging, but I
use Andrew Morton's patch scripts which make things very
easy for me.

Regarding bitkeeper, I have never tried it but there is
some help in Documentation/BK-usage/ which might be of
use to you.

2004-04-18 05:36:01

by Marc Singer

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sun, Apr 18, 2004 at 03:19:59PM +1000, Nick Piggin wrote:
> >>Well, here is the current patch against 2.6.5-mm6. -mm is
> >>different enough from -linus now that it is not 100% trivial
> >>to patch (mainly the rmap and hugepages work).
> >
> >
> >Will this work against 2.6.5 without -mm6?
> >
>
> Unfortunately it won't patch easily. If this is a big
> problem for you I could make you up a 2.6.5 version.

We'll, I'll try applying his patch and then yours. If it doesn't work
I'll let you know.

>
> >As an aside, I've been using SVN to manage my kernel sources. While
> >I'd be thrilled to make it work, it simply doesn't seem to have the
> >heavy lifting capability to handle the kernel work. I know the
> >rudiments of using BK. What I'd like is some sort of HOWTO with
> >example of common tasks for kernel development. Know of any?
> >
>
> Well I don't do a great deal of coding or merging, but I
> use Andrew Morton's patch scripts which make things very
> easy for me.

Where does he keep 'em.

> Regarding bitkeeper, I have never tried it but there is
> some help in Documentation/BK-usage/ which might be of
> use to you.

I'll read it. Thanks.

2004-04-18 05:38:59

by Marc Singer

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, Apr 17, 2004 at 08:53:38PM -0700, Andrew Morton wrote:
> William Lee Irwin III <[email protected]> wrote:
> >
> > On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> > > I'd assume that setting swappiness to zero simply means that you still have
> > > all of your libc in pagecache when running ls.
> > > What happens if you do the big file copy, then run `sync', then do the ls?
> > > Have you experimented with the NFS mount options? v2? UDP?
> >
> > I wonder if the ptep_test_and_clear_young() TLB flushing is related.
>
> That, or page_referenced() always returns true on this ARM implementation
> or some such silliness. Everything here points at the VM being unable to
> reclaim that clean pagecache.

How can I tell? Is it something like this: because page_referenced()
always returns true (which I haven't investigated) then the page
eviction code cannot distinguish mapped from cache pages and therefore
selects valuable, mapped pages.

2004-04-18 05:41:49

by Nick Piggin

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

Marc Singer wrote:
> On Sun, Apr 18, 2004 at 03:19:59PM +1000, Nick Piggin wrote:
>
>>>>Well, here is the current patch against 2.6.5-mm6. -mm is
>>>>different enough from -linus now that it is not 100% trivial
>>>>to patch (mainly the rmap and hugepages work).
>>>
>>>
>>>Will this work against 2.6.5 without -mm6?
>>>
>>
>>Unfortunately it won't patch easily. If this is a big
>>problem for you I could make you up a 2.6.5 version.
>
>
> We'll, I'll try applying his patch and then yours. If it doesn't work
> I'll let you know.
>

OK thanks.

>
>>>As an aside, I've been using SVN to manage my kernel sources. While
>>>I'd be thrilled to make it work, it simply doesn't seem to have the
>>>heavy lifting capability to handle the kernel work. I know the
>>>rudiments of using BK. What I'd like is some sort of HOWTO with
>>>example of common tasks for kernel development. Know of any?
>>>
>>
>>Well I don't do a great deal of coding or merging, but I
>>use Andrew Morton's patch scripts which make things very
>>easy for me.
>
>
> Where does he keep 'em.
>

http://www.zip.com.au/~akpm/linux/patches/

2004-04-18 05:53:10

by Andrew Morton

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

Marc Singer <[email protected]> wrote:
>
> On Sat, Apr 17, 2004 at 08:53:38PM -0700, Andrew Morton wrote:
> > William Lee Irwin III <[email protected]> wrote:
> > >
> > > On Sat, Apr 17, 2004 at 04:51:51PM -0700, Andrew Morton wrote:
> > > > I'd assume that setting swappiness to zero simply means that you still have
> > > > all of your libc in pagecache when running ls.
> > > > What happens if you do the big file copy, then run `sync', then do the ls?
> > > > Have you experimented with the NFS mount options? v2? UDP?
> > >
> > > I wonder if the ptep_test_and_clear_young() TLB flushing is related.
> >
> > That, or page_referenced() always returns true on this ARM implementation
> > or some such silliness. Everything here points at the VM being unable to
> > reclaim that clean pagecache.
>
> How can I tell?

Well some more descriptions of what the system does after that copy-to-nfs
would help. Does it _ever_ come good, or is a reboot needed, etc?

What does `vmstat 1' say during the copy, and during the ls?

/proc/vmstats before and after the ls.

Try doing the copy, then when it has finished do the old
memset(malloc(24M)) and monitor the `vmstat 1' output while it runs,
capture /proc/meminfo before and after.

None of the problems you report are present on x86 as far as I can tell,
so...

2004-04-18 06:15:37

by Marc Singer

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, Apr 17, 2004 at 10:52:43PM -0700, Andrew Morton wrote:
> Well some more descriptions of what the system does after that copy-to-nfs
> would help. Does it _ever_ come good, or is a reboot needed, etc?

As far as I can tell, it never gets good. I run the same ls command
over and over. Ten times. Always the same behavior. Sometimes it
gets even slower. I can set swappiness to zero and it responds
normally, immediately.

> What does `vmstat 1' say during the copy, and during the ls?

I thought I sent a message about this. I've found that the problem
*only* occurs when there is exactly one process running. If I open a
second console (via telnet) then the slow-down behavior disappears.
If I logout of the console session and run the tests from the telnet
session then I *do* see the problem again.

> /proc/vmstats before and after the ls.

This one I can do.

BEFORE

nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_page_table_pages 49
nr_mapped 163
nr_slab 255
pgpgin 4
pgpgout 0
pswpin 0
pswpout 0
pgalloc_high 0
pgalloc_normal 0
pgalloc_dma 78568
pgfree 79274
pgactivate 8428
pgdeactivate 8324
pgfault 17112
pgmajfault 1348
pgrefill_high 0
pgrefill_normal 0
pgrefill_dma 543296
pgsteal_high 0
pgsteal_normal 0
pgsteal_dma 60834
pgscan_kswapd_high 0
pgscan_kswapd_normal 0
pgscan_kswapd_dma 189700
pgscan_direct_high 0
pgscan_direct_normal 0
pgscan_direct_dma 31746
pginodesteal 0
slabs_scanned 1586
kswapd_steal 30907
kswapd_inodesteal 0
pageoutrun 77994
allocstall 82
pgrotated 0

ls -l /proc runs slowly.

AFTER

nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_page_table_pages 49
nr_mapped 164
nr_slab 225
pgpgin 4
pgpgout 0
pswpin 0
pswpout 0
pgalloc_high 0
pgalloc_normal 0
pgalloc_dma 85378
pgfree 86106
pgactivate 11759
pgdeactivate 11650
pgfault 21293
pgmajfault 2316
pgrefill_high 0
pgrefill_normal 0
pgrefill_dma 616785
pgsteal_high 0
pgsteal_normal 0
pgsteal_dma 67511
pgscan_kswapd_high 0
pgscan_kswapd_normal 0
pgscan_kswapd_dma 200241
pgscan_direct_high 0
pgscan_direct_normal 0
pgscan_direct_dma 31746
pginodesteal 0
slabs_scanned 1586
kswapd_steal 37584
kswapd_inodesteal 0
pageoutrun 78405
allocstall 82
pgrotated 0

ls -l still slow.

Anything interesting?

> Try doing the copy, then when it has finished do the old
> memset(malloc(24M)) and monitor the `vmstat 1' output while it runs,
> capture /proc/meminfo before and after.

Here's the meminfo version of the same test above.

BEFORE

MemTotal: 30256 kB
MemFree: 3312 kB
Buffers: 0 kB
Cached: 24512 kB
SwapCached: 0 kB
Active: 732 kB
Inactive: 24084 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 30256 kB
LowFree: 3312 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 700 kB
Slab: 1024 kB
Committed_AS: 476 kB
PageTables: 196 kB
VmallocTotal: 434176 kB
VmallocUsed: 65924 kB
VmallocChunk: 368252 kB


ls -l runs slowly

AFTER

MemTotal: 30256 kB
MemFree: 3420 kB
Buffers: 0 kB
Cached: 24448 kB
SwapCached: 0 kB
Active: 772 kB
Inactive: 23984 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 30256 kB
LowFree: 3420 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 692 kB
Slab: 964 kB
Committed_AS: 476 kB
PageTables: 196 kB
VmallocTotal: 434176 kB
VmallocUsed: 65924 kB
VmallocChunk: 368252 kB

ls -l still runs slowly.

The copy w/vmstat involves two processes. It doesn't exhibit the
problems.

> None of the problems you report are present on x86 as far as I can tell,
> so...

I don't expect you would. Are you confident that this doesn't happen
on a 386 NFS root mounted system in single-user mode? I don't have an
IA32 system where I can test this scenario.

2004-04-18 09:29:52

by Russell King

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, Apr 17, 2004 at 05:23:43PM -0700, Marc Singer wrote:
> All of these tests are performed at the console, one command at a
> time. I have a telnet daemon available, so I open a second connection
> to the target system. I run a continuous loop of file copies on the
> console and I execute 'ls -l /proc' in the telnet window. It's a
> little slow, but it isn't unreasonable. Hmm. I then run the copy
> command in the telnet window followed by the 'ls -l /proc'. It works
> fine. I logout of the console session and perform the telnet window
> test again. The 'ls -l /proc takes 30 seconds.
>
> When there is more than one process running, everything is peachy.
> When there is only one process (no context switching) I see the slow
> performance. I had a hypothesis, but my test of that hypothesis
> failed.

Guys, this tends to indicate that we _must_ have up to date aging
information from the PTE - if not, we're liable to miss out on the
pressure from user applications. The "lazy" method which 2.4 will
allow is not possible with 2.6.

This means we must flush the TLB when we mark the PTE old.

Might be worth reading my thread on linux-mm about this and commenting?
(hint hint)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-04-18 23:44:08

by Marc Singer

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sun, Apr 18, 2004 at 03:41:24PM +1000, Nick Piggin wrote:
> >We'll, I'll try applying his patch and then yours. If it doesn't work
> >I'll let you know.
> >
>
> OK thanks.

There appear to be a lot of conflicts between my development tree and
the -mm6 patch. Even your patch doesn't apply cleanly, though I think
it is only because a piece has already been applied.

I'm starting with 2.6.5, applying Russell King's 2.6.5 patch from the
8th, applying -mm6 patch, and then yours. It looks like a good bit of
Russell's patch has been included in the -mm6. But not enough of mm6
is present in my tree for your patch to work.

I'm working on the scripts and BK docs. At this point, I may have to
wait for 2.6.6 before we can make another test.

2004-04-19 00:26:37

by Rik van Riel

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sat, 17 Apr 2004, Marc Singer wrote:

> I thought I sent a message about this. I've found that the problem
> *only* occurs when there is exactly one process running.

BINGO! ;)

Looks like this could be the referenced bits not being
flushed from the MMU and not found by the VM...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-04-19 00:39:50

by Marc Singer

[permalink] [raw]
Subject: Re: vmscan.c heuristic adjustment for smaller systems

On Sun, Apr 18, 2004 at 08:26:13PM -0400, Rik van Riel wrote:
> On Sat, 17 Apr 2004, Marc Singer wrote:
>
> > I thought I sent a message about this. I've found that the problem
> > *only* occurs when there is exactly one process running.
>
> BINGO! ;)
>
> Looks like this could be the referenced bits not being
> flushed from the MMU and not found by the VM...

Can you be a little more verbose for me? The ARM MMU doesn't keep
track of page references, AFAICT. How does a context switch change
this?

I have looked into the case where the TLB for an old page isn't being
flushed (by design), but I've been unable to fix the problem by
forcing a TLB flush whenever a PTE is zeroed.