2008-01-08 21:20:20

by Rik van Riel

[permalink] [raw]
Subject: [patch 00/19] VM pageout scalability improvements

On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not
only does it use up CPU time, but it also provokes lock contention
and can leave large systems under memory presure in a catatonic state.

Against 2.6.24-rc6-mm1

This patch series improves VM scalability by:

1) making the locking a little more scalable

2) putting filesystem backed, swap backed and non-reclaimable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory

3) switching to SEQ replacement for the anonymous LRUs, so the
number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number

More info on the overall design can be found at:

http://linux-mm.org/PageReplacementDesign


Changelog:
- merge memcontroller split LRU code into the main split LRU patch,
since it is not functionally different (it was split up only to help
people who had seen the last version of the patch series review it)
- drop the page_file_cache debugging patch, since it never triggered
- reintroduce code to not scan anon list if swap is full
- add code to scan anon list if page cache is very small already
- use lumpy reclaim more aggressively for smaller order > 1 allocations

--
All Rights Reversed


2008-01-10 04:39:16

by Mike Snitzer

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Jan 8, 2008 3:59 PM, Rik van Riel <[email protected]> wrote:
> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
>
> Against 2.6.24-rc6-mm1

Hi Rik,

How much trouble am I asking for if I were to try to get your patchset
to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)? If
workable, is such an effort before it's time relative to your TODO?

I see that you have an old port to a FC7-based 2.6.21 here:
http://people.redhat.com/riel/vmsplit/

Also, do you have a public git repo that you regularly publish to for
this patchset? If not a git repo do you put the raw patchset on some
http/ftp server?

thanks,
Mike

2008-01-10 15:42:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Wed, 9 Jan 2008 23:39:02 -0500
"Mike Snitzer" <[email protected]> wrote:

> How much trouble am I asking for if I were to try to get your patchset
> to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)? If
> workable, is such an effort before it's time relative to your TODO?

Quite a bit :)

The -mm kernel has the memory controller code, which means the
mm/ directory is fairly different. My patch set sits on top
of that.

Chances are that once the -mm kernel goes upstream (in 2.6.25-rc1),
I can start building on top of that.

OTOH, maybe I could get my patch series onto a recent 2.6.23.X with
minimal chainsaw effort.

> I see that you have an old port to a FC7-based 2.6.21 here:
> http://people.redhat.com/riel/vmsplit/
>
> Also, do you have a public git repo that you regularly publish to for
> this patchset? If not a git repo do you put the raw patchset on some
> http/ftp server?

Up to now I have only emailed out the patches. Since there is demand
for them to be downloadable from somewhere, I'll also start putting
them on http://people.redhat.com/riel/

--
All rights reversed.

2008-01-10 16:08:30

by Mike Snitzer

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Jan 10, 2008 10:41 AM, Rik van Riel <[email protected]> wrote:
>
> On Wed, 9 Jan 2008 23:39:02 -0500
> "Mike Snitzer" <[email protected]> wrote:
>
> > How much trouble am I asking for if I were to try to get your patchset
> > to fly on a fairly recent "stable" kernel (e.g. 2.6.22.15)? If
> > workable, is such an effort before it's time relative to your TODO?
>
> Quite a bit :)
>
> The -mm kernel has the memory controller code, which means the
> mm/ directory is fairly different. My patch set sits on top
> of that.
>
> Chances are that once the -mm kernel goes upstream (in 2.6.25-rc1),
> I can start building on top of that.
>
> OTOH, maybe I could get my patch series onto a recent 2.6.23.X with
> minimal chainsaw effort.

That would be great! I can't speak for others but -mm poses a problem
for testing your patchset because it is so bleeding. Let me know if
you take the plunge on a 2.6.23.x backport; I'd really appreciate it.

Is anyone else interested in consuming a 2.6.23.x backport of Rik's
patchset? If so please speak up.

> > I see that you have an old port to a FC7-based 2.6.21 here:
> > http://people.redhat.com/riel/vmsplit/
> >
> > Also, do you have a public git repo that you regularly publish to for
> > this patchset? If not a git repo do you put the raw patchset on some
> > http/ftp server?
>
> Up to now I have only emailed out the patches. Since there is demand
> for them to be downloadable from somewhere, I'll also start putting
> them on http://people.redhat.com/riel/

Great, thanks.

Mike

2008-01-11 10:42:20

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

* Rik van Riel <[email protected]> [2008-01-08 15:59:39]:

> On large memory systems, the VM can spend way too much time scanning
> through pages that it cannot (or should not) evict from memory. Not
> only does it use up CPU time, but it also provokes lock contention
> and can leave large systems under memory presure in a catatonic state.
>
> Against 2.6.24-rc6-mm1
>
> This patch series improves VM scalability by:
>
> 1) making the locking a little more scalable
>
> 2) putting filesystem backed, swap backed and non-reclaimable pages
> onto their own LRUs, so the system only scans the pages that it
> can/should evict from memory
>
> 3) switching to SEQ replacement for the anonymous LRUs, so the
> number of pages that need to be scanned when the system
> starts swapping is bound to a reasonable number
>
> More info on the overall design can be found at:
>
> http://linux-mm.org/PageReplacementDesign
>
>
> Changelog:
> - merge memcontroller split LRU code into the main split LRU patch,
> since it is not functionally different (it was split up only to help
> people who had seen the last version of the patch series review it)
> - drop the page_file_cache debugging patch, since it never triggered
> - reintroduce code to not scan anon list if swap is full
> - add code to scan anon list if page cache is very small already
> - use lumpy reclaim more aggressively for smaller order > 1 allocations
>

Hi, Rik,

I've just started the patch series, the compile fails for me on a
powerpc box. global_lru_pages() is defined under CONFIG_PM, but used
else where in mm/page-writeback.c. None of the global_lru_pages()
parameters depend on CONFIG_PM. Here's a simple patch to fix it.

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b14e188..39e6aef 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1920,6 +1920,14 @@ void wakeup_kswapd(struct zone *zone, int order)
wake_up_interruptible(&pgdat->kswapd_wait);
}

+unsigned long global_lru_pages(void)
+{
+ return global_page_state(NR_ACTIVE_ANON)
+ + global_page_state(NR_ACTIVE_FILE)
+ + global_page_state(NR_INACTIVE_ANON)
+ + global_page_state(NR_INACTIVE_FILE);
+}
+
#ifdef CONFIG_PM
/*
* Helper function for shrink_all_memory(). Tries to reclaim 'nr_pages' pages
@@ -1968,14 +1976,6 @@ static unsigned long shrink_all_zones(unsigned long nr_pages, int prio,
return ret;
}

-unsigned long global_lru_pages(void)
-{
- return global_page_state(NR_ACTIVE_ANON)
- + global_page_state(NR_ACTIVE_FILE)
- + global_page_state(NR_INACTIVE_ANON)
- + global_page_state(NR_INACTIVE_FILE);
-}
-
/*
* Try to free `nr_pages' of memory, system-wide, and return the number of
* freed pages.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-01-11 11:48:32

by Balbir Singh

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

* Rik van Riel <[email protected]> [2008-01-08 15:59:39]:

> Changelog:
> - merge memcontroller split LRU code into the main split LRU patch,
> since it is not functionally different (it was split up only to help
> people who had seen the last version of the patch series review it)

Hi, Rik,

I see a strange behaviour with this patchset. I have a program
(pagetest from Vaidy), that does the following

1. Can allocate different kinds of memory, mapped, malloc'ed or shared
2. Allocates and touches all the memory in a loop (2 times)

I mount the memory controller and limit it to 400M and run pagetest
and ask it to touch 1000M. Without this patchset everything runs fine,
but with this patchset installed, I immediately see

pagetest invoked oom-killer: gfp_mask=0xd0, order=0, oomkilladj=0
Call Trace:
[c0000000e5aef400] [c00000000000eb24] .show_stack+0x70/0x1bc (unreliable)
[c0000000e5aef4b0] [c0000000000bbbbc] .oom_kill_process+0x80/0x260
[c0000000e5aef570] [c0000000000bc498] .mem_cgroup_out_of_memory+0x6c/0x98
[c0000000e5aef610] [c0000000000f2574] .mem_cgroup_charge_common+0x1e0/0x414
[c0000000e5aef6e0] [c0000000000b852c] .add_to_page_cache+0x48/0x164
[c0000000e5aef780] [c0000000000b8664] .add_to_page_cache_lru+0x1c/0x68
[c0000000e5aef810] [c00000000012db50] .mpage_readpages+0xbc/0x15c
[c0000000e5aef940] [c00000000018bdac] .ext3_readpages+0x28/0x40
[c0000000e5aef9c0] [c0000000000c3978] .__do_page_cache_readahead+0x158/0x260
[c0000000e5aefa90] [c0000000000bac44] .filemap_fault+0x18c/0x3d4
[c0000000e5aefb70] [c0000000000cd510] .__do_fault+0xb0/0x588
[c0000000e5aefc80] [c0000000005653cc] .do_page_fault+0x440/0x620
[c0000000e5aefe30] [c000000000005408] handle_page_fault+0x20/0x58
Mem-info:
Node 0 DMA per-cpu:
CPU 0: hi: 6, btch: 1 usd: 4
CPU 1: hi: 6, btch: 1 usd: 0
CPU 2: hi: 6, btch: 1 usd: 3
CPU 3: hi: 6, btch: 1 usd: 4
Active_anon:9099 active_file:1523 inactive_anon0
inactive_file:2869 noreclaim:0 dirty:20 writeback
:0 unstable:0
free:44210 slab:639 mapped:1724 pagetables:475 bo
unce:0
Node 0 DMA free:2829440kB min:7808kB low:9728kB hi
gh:11712kB active_anon:582336kB inactive_anon:0kB active_file:97472kB inactive_f
ile:183616kB noreclaim:0kB present:3813760kB pages_scanned:0 all_unreclaimable?
no
lowmem_reserve[]: 0 0 0
Node 0 DMA: 3*64kB 5*128kB 5*256kB 4*512kB 2*1024k
B 4*2048kB 3*4096kB 2*8192kB 170*16384kB = 2828352kB
Swap cache: add 0, delete 0, find 0/0
Free swap = 3148608kB
Total swap = 3148608kB
Free swap: 3148608kB
59648 pages of RAM
677 reserved pages
28165 pages shared
0 pages swap cached
Memory cgroup out of memory: kill process 6593 (pagetest) score 1003 or a child
Killed process 6593 (pagetest)

I am using a powerpc box with 64K size pages. I'll try and investigate further,
just a heads up on the failure I am seeing.

--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL

2008-01-11 15:39:12

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch 00/19] VM pageout scalability improvements

On Fri, 11 Jan 2008 16:11:15 +0530
Balbir Singh <[email protected]> wrote:

> I've just started the patch series, the compile fails for me on a
> powerpc box. global_lru_pages() is defined under CONFIG_PM, but used
> else where in mm/page-writeback.c. None of the global_lru_pages()
> parameters depend on CONFIG_PM. Here's a simple patch to fix it.

Thank you for the fix. I have applied it to my tree.

--
All rights reversed.

2008-01-16 06:18:25

by KOSAKI Motohiro

[permalink] [raw]
Subject: rvr split LRU minor regression ?

Hi Rik

I tested new hackbench on rvr split LRU patch.

new hackbench URL is
http://people.redhat.com/mingo/cfs-scheduler/tools/hackbench.c


method of test

(1) $ ./hackbench 150 process 1000
(2) # sync; echo 3 > /proc/sys/vm/drop_caches
$ dd if=tmp10G of=/dev/null
$ ./hackbench 150 process 1000

test machine:
CPU: x86_64 1.86GHz x2
memory: 6GB


result:

2.6.24-rc6-mm1 +rvr-split-lru ratio
(small is faster)
-------------------------------------------------------------------
(1) 364.981 359.386 98.47%
(2) 364.461 387.471 106.31%


more detail:
1. /usr/bin/time command output

vanilla 2.6.24-rc6-mm1
33.74user 703.10system 6:09.56elapsed 199%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+372467minor)pagefaults 0swaps

2.6.24-rc6-mm1 + rvr-split-lru
36.22user 731.30system 6:35.16elapsed 194%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (804major+389524minor)pagefaults 0swaps

It seems increase page fault.


2.
after test (2), cat /proc/meminfo

vanilla 2.6.24-rc6-mm1

MemTotal: 5931808 kB
MemFree: 1751632 kB
Buffers: 4360 kB
Cached: 3930020 kB
SwapCached: 0 kB
Active: 46396 kB
Inactive: 3924108 kB
SwapTotal: 20972848 kB
SwapFree: 20972720 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 36140 kB
Mapped: 10104 kB
Slab: 160020 kB
SReclaimable: 3460 kB
SUnreclaim: 156560 kB
PageTables: 3712 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 23938752 kB
Committed_AS: 78940 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 57220 kB
VmallocChunk: 34359680999 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB


2.6.24-rc6-mm1 + rvr-split-lru

MemTotal: 5931356 kB
MemFree: 1771800 kB
Buffers: 2776 kB
Cached: 3914800 kB
SwapCached: 7940 kB
Active(anon): 21868 kB
Inactive(anon): 6560 kB
Active(file): 1722888 kB
Inactive(file): 2192128 kB
Noreclaim: 3472 kB
Mlocked: 3724 kB
SwapTotal: 20972848 kB
SwapFree: 20935032 kB
Dirty: 8 kB
Writeback: 0 kB
AnonPages: 23912 kB
Mapped: 9500 kB
Slab: 162188 kB
SReclaimable: 5544 kB
SUnreclaim: 156644 kB
PageTables: 4444 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 23938524 kB
Committed_AS: 106816 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 57220 kB
VmallocChunk: 34359680999 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB


It seems used once memory incorrect activation increased.
What do you think it?



- kosaki