2008-02-28 19:43:37

by Rik van Riel

[permalink] [raw]
Subject: [patch 00/21] VM pageout scalability improvements

On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

Against 2.6.24-rc6-mm1

This patch series improves VM scalability by:

1) making the locking a little more scalable

2) putting filesystem backed, swap backed and non-reclaimable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory

3) switching to SEQ replacement for the anonymous LRUs, so the
number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number

More info on the overall design can be found at:

http://linux-mm.org/PageReplacementDesign


Changelog:
- pull the memcontrol lru arrayification earlier into the patch series
- use a pagevec array similar to the lru array
- clean up the code in various places
- improved pageout balancing and reduced pageout cpu use

- fix compilation on PPC and without memcontrol
- make page_is_pagecache more readable
- replace get_scan_ratio with correct version

- merge memcontroller split LRU code into the main split LRU patch,
since it is not functionally different (it was split up only to help
people who had seen the last version of the patch series review it)
- drop the page_file_cache debugging patch, since it never triggered
- reintroduce code to not scan anon list if swap is full
- add code to scan anon list if page cache is very small already
- use lumpy reclaim more aggressively for smaller order > 1 allocations

--
All Rights Reversed


2008-02-28 19:55:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 00/21] VM pageout scalability improvements

On Thu, 28 Feb 2008 14:29:08 -0500
Rik van Riel <[email protected]> wrote:

> Against 2.6.24-rc6-mm1

I rediffed all the patches, but forgot to edit the intro email.

It really is against 2.6.25-rc2-mm1.

--
All rights reversed.

2008-02-28 20:15:21

by John Stoffel

[permalink] [raw]
Subject: Re: [patch 00/21] VM pageout scalability improvements


Rik> On large memory systems, the VM can spend way too much time
Rik> scanning through pages that it cannot (or should not) evict from
Rik> memory. Not only does it use up CPU time, but it also provokes
Rik> lock contention and can leave large systems under memory presure
Rik> in a catatonic state.

Nitpicky, but what is a large memory system? I read your web page and
you talk about large memory being greater than several Gb, and about
huge systems (> 128gb). So which is this patch addressing?

I ask because I've got a new system with 4Gb of RAM and my motherboard
can goto 8Gb. Should this be a large memory system or not? I've also
only got a single dual core CPU, how does that affect things?

You talk about the Inactive list in the Anonymous memory section, and
about limiting it. You say 30% on a 1Gb system, but 1% on a 1Tb
system, which is interesting numbers but it's not clear where they
come from.

Should the IO limits (raised lower down in the document) be a more
core feature? I.e. if you only have 20MBytes/sec bandwidth to disk
for swap, should you be limiting the inactive list to 5seconds of
bandwidth in terms of size? Or 10s, or 60s?

Should we be more aggresive in pre-swapping Anonymous memory to swap,
but keeping it cached in memory for use? If there's pressure, it
seems like it would be easy to just dump pre-swapped pages from the
inactive list, without having to spend time writing them out.

Also, how does having more CPUs/IO bandwidth change things? Do we
need an exponential backoff algorithm in terms of how much memory is
allocated to the various lists? As memory gets bigger and bigger, do
we allocated fewer and fewer pages since we can't swap them out fast
enough?

I dunno... I honestly don't have the time or the knowledge to do more
than poke sticks into things and see what happens. And to ask
annoying questions.

I do appreciate your work on this.

John


Rik> Against 2.6.24-rc6-mm1

Rik> This patch series improves VM scalability by:

Rik> 1) making the locking a little more scalable

Rik> 2) putting filesystem backed, swap backed and non-reclaimable pages
Rik> onto their own LRUs, so the system only scans the pages that it
Rik> can/should evict from memory

Rik> 3) switching to SEQ replacement for the anonymous LRUs, so the
Rik> number of pages that need to be scanned when the system
Rik> starts swapping is bound to a reasonable number

Rik> More info on the overall design can be found at:

Rik> http://linux-mm.org/PageReplacementDesign


Rik> Changelog:
Rik> - pull the memcontrol lru arrayification earlier into the patch series
Rik> - use a pagevec array similar to the lru array
Rik> - clean up the code in various places
Rik> - improved pageout balancing and reduced pageout cpu use

Rik> - fix compilation on PPC and without memcontrol
Rik> - make page_is_pagecache more readable
Rik> - replace get_scan_ratio with correct version

Rik> - merge memcontroller split LRU code into the main split LRU patch,
Rik> since it is not functionally different (it was split up only to help
Rik> people who had seen the last version of the patch series review it)
Rik> - drop the page_file_cache debugging patch, since it never triggered
Rik> - reintroduce code to not scan anon list if swap is full
Rik> - add code to scan anon list if page cache is very small already
Rik> - use lumpy reclaim more aggressively for smaller order > 1 allocations

Rik> --
Rik> All Rights Reversed

Rik> --
Rik> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Rik> the body of a message to [email protected]
Rik> More majordomo info at http://vger.kernel.org/majordomo-info.html
Rik> Please read the FAQ at http://www.tux.org/lkml/


Rik> !DSPAM:47c70f4e50261498712856!

2008-02-28 20:24:34

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 00/21] VM pageout scalability improvements

On Thu, 28 Feb 2008 15:14:02 -0500
"John Stoffel" <[email protected]> wrote:

> Nitpicky, but what is a large memory system? I read your web page and
> you talk about large memory being greater than several Gb, and about
> huge systems (> 128gb). So which is this patch addressing?
>
> I ask because I've got a new system with 4Gb of RAM and my motherboard
> can goto 8Gb. Should this be a large memory system or not? I've also
> only got a single dual core CPU, how does that affect things?

It depends a lot on the workload.

On a few workloads, the current VM explodes with as little as
16GB of RAM, while a few other workloads the current VM works
fine with 128GB of RAM.

This patch tries to address the behaviour of the kernel when
faced with workloads that trip up the current VM.

> You talk about the Inactive list in the Anonymous memory section, and
> about limiting it. You say 30% on a 1Gb system, but 1% on a 1Tb
> system, which is interesting numbers but it's not clear where they
> come from.

They seemed a reasonable balance between limiting the maximum amount
of work the VM needs to do and allowing pages the time to get referenced
again. If benchmarks suggest that the ratio should be tweaked, we can
do so quite easily.

> I dunno... I honestly don't have the time or the knowledge to do more
> than poke sticks into things and see what happens. And to ask
> annoying questions.

Patch series like this can always use a good poking. Especially by
people who run all kinds of nasty programs to trip up the VM :)

--
All rights reversed.