LinuxLists.cc - Free memory never fully used, swapping

2010-11-15 19:52:48

Subject: Free memory never fully used, swapping

Hi!

We're seeing cases on a number of servers where cache never fully grows
to use all available memory. Sometimes we see servers with 4 GB of
memory that never seem to have less than 1.5 GB free, even with a
constantly-active VM. In some cases, these servers also swap out while
this happens, even though they are constantly reading the working set
into memory. We have been seeing this happening for a long time;
I don't think it's anything recent, and it still happens on 2.6.36.

I noticed that CONFIG_NUMA seems to enable some more complicated
reclaiming bits and figured it might help since most stock kernels seem
to ship with it now. This seems to have helped, but it may just be
wishful thinking. We still see this happening, though maybe to a lesser
degree. (The following observations are with CONFIG_NUMA enabled.)

I was eyeballing "vmstat 1" and "watch -n.2 -d cat /proc/vmstat" at the
same time, and I can see distinctly that the page cache is growing nicely
until a sudden event where 400 MB is freed within 1 second, leaving
this particular box with 700 MB free again. kswapd numbers increase in
/proc/vmstat, which leads me to believe that __alloc_pages_slowpath() has
been called, since it seems to be the thing that wakes up kswapd.

Previous patterns and watching of "vmstat 1" show that the swapping out
also seems to occur during the times that memory is quickly freed.

These are all x86_64, and so there is no highmem garbage going on.
The only zones would be for DMA, right? Is the combination of memory
fragmentation and large-order allocations the only thing that would be
causing this reclaim here? Is there some easy bake knob for finding what
is causing the free memory jumps each time this happens?

Kernel config and munin graph of free memory here:

http://0x.ca/sim/ref/2.6.36/

I notice CONFIG_COMPACTION is still "EXPERIMENTAL". Would it be worth
trying here? It seems to enable defrag before reclaim, but that sounds
kind of ...complicated...

Cheers,

Simon-

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 11496 401684 40364 2531844 0 0 4540 773 7890 16291 13 9 77 1
3 0 11492 400180 40372 2534204 0 0 5572 699 8544 14856 25 9 66 1
0 0 11492 394344 40372 2540796 0 0 5256 345 8239 16723 17 7 73 2
0 0 11492 388524 40372 2546236 0 0 5216 393 8687 17289 14 9 76 1
4 1 11684 716296 40244 2218612 0 220 6868 1837 11124 27368 28 20 51 0
1 0 11732 753992 40248 2181468 0 120 5240 647 9542 15609 38 11 50 1
1 0 11712 736864 40260 2197788 0 0 5872 9147 9838 16373 41 11 47 1
0 0 11712 738096 40260 2196984 0 0 4628 493 7980 15536 22 10 67 1
2 0 11712 733508 40260 2201756 0 0 4404 418 7265 16867 10 9 80 2

2010-11-22 23:44:24

by Andrew Morton

[permalink] [raw]

Subject: Re: Free memory never fully used, swapping

(cc linux-mm, where all the suckiness ends up)

On Mon, 15 Nov 2010 11:52:46 -0800
Simon Kirby <[email protected]> wrote:

> Hi!
>
> We're seeing cases on a number of servers where cache never fully grows
> to use all available memory. Sometimes we see servers with 4 GB of
> memory that never seem to have less than 1.5 GB free, even with a
> constantly-active VM. In some cases, these servers also swap out while
> this happens, even though they are constantly reading the working set
> into memory. We have been seeing this happening for a long time;
> I don't think it's anything recent, and it still happens on 2.6.36.
>
> I noticed that CONFIG_NUMA seems to enable some more complicated
> reclaiming bits and figured it might help since most stock kernels seem
> to ship with it now. This seems to have helped, but it may just be
> wishful thinking. We still see this happening, though maybe to a lesser
> degree. (The following observations are with CONFIG_NUMA enabled.)
>
> I was eyeballing "vmstat 1" and "watch -n.2 -d cat /proc/vmstat" at the
> same time, and I can see distinctly that the page cache is growing nicely
> until a sudden event where 400 MB is freed within 1 second, leaving
> this particular box with 700 MB free again. kswapd numbers increase in
> /proc/vmstat, which leads me to believe that __alloc_pages_slowpath() has
> been called, since it seems to be the thing that wakes up kswapd.
>
> Previous patterns and watching of "vmstat 1" show that the swapping out
> also seems to occur during the times that memory is quickly freed.
>
> These are all x86_64, and so there is no highmem garbage going on.
> The only zones would be for DMA, right? Is the combination of memory
> fragmentation and large-order allocations the only thing that would be
> causing this reclaim here? Is there some easy bake knob for finding what
> is causing the free memory jumps each time this happens?
>
> Kernel config and munin graph of free memory here:
>
> http://0x.ca/sim/ref/2.6.36/
>
> I notice CONFIG_COMPACTION is still "EXPERIMENTAL". Would it be worth
> trying here? It seems to enable defrag before reclaim, but that sounds
> kind of ...complicated...
>
> Cheers,
>
> Simon-
>
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 1 0 11496 401684 40364 2531844 0 0 4540 773 7890 16291 13 9 77 1
> 3 0 11492 400180 40372 2534204 0 0 5572 699 8544 14856 25 9 66 1
> 0 0 11492 394344 40372 2540796 0 0 5256 345 8239 16723 17 7 73 2
> 0 0 11492 388524 40372 2546236 0 0 5216 393 8687 17289 14 9 76 1
> 4 1 11684 716296 40244 2218612 0 220 6868 1837 11124 27368 28 20 51 0
> 1 0 11732 753992 40248 2181468 0 120 5240 647 9542 15609 38 11 50 1
> 1 0 11712 736864 40260 2197788 0 0 5872 9147 9838 16373 41 11 47 1
> 0 0 11712 738096 40260 2196984 0 0 4628 493 7980 15536 22 10 67 1
> 2 0 11712 733508 40260 2201756 0 0 4404 418 7265 16867 10 9 80 2

2010-11-23 01:34:10

by Simon Kirby

[permalink] [raw]

Subject: Re: Free memory never fully used, swapping

Uploaded a new graph with maybe more useful pixels.
( http://0x.ca/sim/ref/2.6.36/memory_long.png ) Units are day-of-month.

On this same server, dovecot is probably leaking a bit, but it seems to
be backing off on the page cache at the same time. Disk usage grows all
day and not much is unlink()ed until the middle of the night, so I don't
see any reason things shouldn't stay around in page cache.

We then restarted dovecot (with old processes left around), and it seems
it started swapping to /dev/sda and throwing out pages from running apps
which came from /dev/sda (disks) while churning through mail on /dev/md0
(/dev/sd[bc]) which is SSD. Killing the old processes seemed to make it
happy to keep enough in page cache again to make /dev/sda idle again.

I guess the vm has no idea about the relative expensiveness of reads from
the SSD versus conventional disks connected to the same box?

Anyway, we turned off swap to try to stop load on /dev/sda. sysreq-M
showed this sort of output while kswapd was churning, if it helps:

SysRq : Show Memory
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
CPU 2: hi: 0, btch: 1 usd: 0
CPU 3: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 164
CPU 1: hi: 186, btch: 31 usd: 177
CPU 2: hi: 186, btch: 31 usd: 157
CPU 3: hi: 186, btch: 31 usd: 156
Node 0 Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 51
CPU 1: hi: 186, btch: 31 usd: 152
CPU 2: hi: 186, btch: 31 usd: 57
CPU 3: hi: 186, btch: 31 usd: 157
active_anon:102749 inactive_anon:164705 isolated_anon:0
active_file:117584 inactive_file:132410 isolated_file:0
unevictable:7155 dirty:64 writeback:18 unstable:0
free:409953 slab_reclaimable:41904 slab_unreclaimable:10369
mapped:3980 shmem:698 pagetables:3947 bounce:0
Node 0 DMA free:15904kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15772kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 3251 4009 4009
Node 0 DMA32 free:1477136kB min:6560kB low:8200kB high:9840kB active_anon:351928kB inactive_anon:512024kB active_file:405472kB inactive_file:361028kB unevictable:16kB isolated(anon):0kB isolated(file):0kB present:3329568kB mlocked:16kB dirty:128kB writeback:0kB mapped:9184kB shmem:1584kB slab_reclaimable:153108kB slab_unreclaimable:17180kB kernel_stack:808kB pagetables:5760kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 757 757
Node 0 Normal free:146772kB min:1528kB low:1908kB high:2292kB active_anon:59068kB inactive_anon:146796kB active_file:64864kB inactive_file:168464kB unevictable:28604kB isolated(anon):0kB isolated(file):92kB present:775680kB mlocked:28604kB dirty:128kB writeback:72kB mapped:6736kB shmem:1208kB slab_reclaimable:14508kB slab_unreclaimable:24280kB kernel_stack:1320kB pagetables:10028kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:236 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 2*4kB 1*8kB 1*16kB 2*32kB 1*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15904kB
Node 0 DMA32: 126783*4kB 85660*8kB 13612*16kB 2013*32kB 32*64kB 3*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1477308kB
Node 0 Normal: 19801*4kB 8446*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 146772kB
252045 total pagecache pages
0 pages in swap cache
Swap cache stats: add 10390951, delete 10390951, find 21038039/22538430
Free swap = 0kB
Total swap = 0kB
1048575 pages RAM
35780 pages reserved
263717 pages shared
455184 pages non-shared

Simon-

On Mon, Nov 22, 2010 at 03:44:19PM -0800, Andrew Morton wrote:

> (cc linux-mm, where all the suckiness ends up)
>
> On Mon, 15 Nov 2010 11:52:46 -0800
> Simon Kirby <[email protected]> wrote:
>
> > Hi!
> >
> > We're seeing cases on a number of servers where cache never fully grows
> > to use all available memory. Sometimes we see servers with 4 GB of
> > memory that never seem to have less than 1.5 GB free, even with a
> > constantly-active VM. In some cases, these servers also swap out while
> > this happens, even though they are constantly reading the working set
> > into memory. We have been seeing this happening for a long time;
> > I don't think it's anything recent, and it still happens on 2.6.36.
> >
> > I noticed that CONFIG_NUMA seems to enable some more complicated
> > reclaiming bits and figured it might help since most stock kernels seem
> > to ship with it now. This seems to have helped, but it may just be
> > wishful thinking. We still see this happening, though maybe to a lesser
> > degree. (The following observations are with CONFIG_NUMA enabled.)
> >
> > I was eyeballing "vmstat 1" and "watch -n.2 -d cat /proc/vmstat" at the
> > same time, and I can see distinctly that the page cache is growing nicely
> > until a sudden event where 400 MB is freed within 1 second, leaving
> > this particular box with 700 MB free again. kswapd numbers increase in
> > /proc/vmstat, which leads me to believe that __alloc_pages_slowpath() has
> > been called, since it seems to be the thing that wakes up kswapd.
> >
> > Previous patterns and watching of "vmstat 1" show that the swapping out
> > also seems to occur during the times that memory is quickly freed.
> >
> > These are all x86_64, and so there is no highmem garbage going on.
> > The only zones would be for DMA, right? Is the combination of memory
> > fragmentation and large-order allocations the only thing that would be
> > causing this reclaim here? Is there some easy bake knob for finding what
> > is causing the free memory jumps each time this happens?
> >
> > Kernel config and munin graph of free memory here:
> >
> > http://0x.ca/sim/ref/2.6.36/
> >
> > I notice CONFIG_COMPACTION is still "EXPERIMENTAL". Would it be worth
> > trying here? It seems to enable defrag before reclaim, but that sounds
> > kind of ...complicated...
> >
> > Cheers,
> >
> > Simon-
> >
> > procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> > r b swpd free buff cache si so bi bo in cs us sy id wa
> > 1 0 11496 401684 40364 2531844 0 0 4540 773 7890 16291 13 9 77 1
> > 3 0 11492 400180 40372 2534204 0 0 5572 699 8544 14856 25 9 66 1
> > 0 0 11492 394344 40372 2540796 0 0 5256 345 8239 16723 17 7 73 2
> > 0 0 11492 388524 40372 2546236 0 0 5216 393 8687 17289 14 9 76 1
> > 4 1 11684 716296 40244 2218612 0 220 6868 1837 11124 27368 28 20 51 0
> > 1 0 11732 753992 40248 2181468 0 120 5240 647 9542 15609 38 11 50 1
> > 1 0 11712 736864 40260 2197788 0 0 5872 9147 9838 16373 41 11 47 1
> > 0 0 11712 738096 40260 2196984 0 0 4628 493 7980 15536 22 10 67 1
> > 2 0 11712 733508 40260 2201756 0 0 4404 418 7265 16867 10 9 80 2

2010-11-23 08:35:36

by Dave Hansen

[permalink] [raw]

Subject: Re: Free memory never fully used, swapping

On Mon, 2010-11-22 at 15:44 -0800, Andrew Morton wrote:
> > These are all x86_64, and so there is no highmem garbage going on.
> > The only zones would be for DMA, right?

There shouldn't be any highmem-related action going on.

> Is the combination of memory fragmentation and large-order allocations
> the only thing that would be causing this reclaim here?

It does sound somewhat suspicious. Are you using hugetlbfs or
allocating large pages? What are your high-order allocations going to?

> Is there some easy bake knob for finding what
> is causing the free memory jumps each time this happens?

I wish. :) The best thing to do is to watch stuff like /proc/vmstat
along with its friends like /proc/{buddy,meminfo,slabinfo}. Could you
post some samples of those with some indication of where the bad
behavior was seen?

I've definitely seen swapping in the face of lots of free memory, but
only in cases where I was being a bit unfair about the numbers of
hugetlbfs pages I was trying to reserve.

-- Dave

2010-11-23 10:04:20

by Mel Gorman

[permalink] [raw]

Subject: Re: Free memory never fully used, swapping

On Mon, Nov 22, 2010 at 03:44:19PM -0800, Andrew Morton wrote:
> On Mon, 15 Nov 2010 11:52:46 -0800
> Simon Kirby <[email protected]> wrote:
>
> > I noticed that CONFIG_NUMA seems to enable some more complicated
> > reclaiming bits and figured it might help since most stock kernels seem
> > to ship with it now. This seems to have helped, but it may just be
> > wishful thinking. We still see this happening, though maybe to a lesser
> > degree. (The following observations are with CONFIG_NUMA enabled.)
> >

Hi,

As this is a NUMA machine, what is the value of
/proc/sys/vm/zone_reclaim_mode ? When enabled, this reclaims memory
local to the node in preference to using remote nodes. For certain
workloads this performs better but for users that expect all of memory
to be used, it has surprising results.

If set to 1, try testing with it set to 0 and see if it makes a
difference. Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-24 06:43:34

by Simon Kirby

[permalink] [raw]

Subject: Re: Free memory never fully used, swapping

On Tue, Nov 23, 2010 at 10:04:03AM +0000, Mel Gorman wrote:

> On Mon, Nov 22, 2010 at 03:44:19PM -0800, Andrew Morton wrote:
> > On Mon, 15 Nov 2010 11:52:46 -0800
> > Simon Kirby <[email protected]> wrote:
> >
> > > I noticed that CONFIG_NUMA seems to enable some more complicated
> > > reclaiming bits and figured it might help since most stock kernels seem
> > > to ship with it now. This seems to have helped, but it may just be
> > > wishful thinking. We still see this happening, though maybe to a lesser
> > > degree. (The following observations are with CONFIG_NUMA enabled.)
> > >
>
> Hi,
>
> As this is a NUMA machine, what is the value of
> /proc/sys/vm/zone_reclaim_mode ? When enabled, this reclaims memory
> local to the node in preference to using remote nodes. For certain
> workloads this performs better but for users that expect all of memory
> to be used, it has surprising results.
>
> If set to 1, try testing with it set to 0 and see if it makes a
> difference. Thanks

Hi Mel,

It is set to 0. It's an Intel EM64T...I only enabled CONFIG_NUMA since
it seemed to enable some more complicated handling, and I figured it
might help, but it didn't seem to. It's also required for
CONFIG_COMPACTION, but that is still marked experimental.

Simon-

2010-11-24 08:46:57

by Simon Kirby

[permalink] [raw]

Subject: Re: Free memory never fully used, swapping

On Tue, Nov 23, 2010 at 12:35:31AM -0800, Dave Hansen wrote:

> I wish. :) The best thing to do is to watch stuff like /proc/vmstat
> along with its friends like /proc/{buddy,meminfo,slabinfo}. Could you
> post some samples of those with some indication of where the bad
> behavior was seen?
>
> I've definitely seen swapping in the face of lots of free memory, but
> only in cases where I was being a bit unfair about the numbers of
> hugetlbfs pages I was trying to reserve.

So, Dave and I spent quite some time today figuring out was going on
here. Once load picked up during the day, kswapd actually never slept
until late in the afternoon. During the evening now, it's still waking
up in bursts, and still keeping way too much memory free:

http://0x.ca/sim/ref/2.6.36/memory_tonight.png

(NOTE: we did swapoff -a to keep /dev/sda from overloading)

We have a much better idea on what is happening here, but more questions.

This x86_64 box has 4 GB of RAM; zones are set up as follows:

[ 0.000000] Zone PFN ranges:
[ 0.000000] DMA 0x00000001 -> 0x00001000
[ 0.000000] DMA32 0x00001000 -> 0x00100000
[ 0.000000] Normal 0x00100000 -> 0x00130000
...
[ 0.000000] On node 0 totalpages: 1047279
[ 0.000000] DMA zone: 56 pages used for memmap
[ 0.000000] DMA zone: 0 pages reserved
[ 0.000000] DMA zone: 3943 pages, LIFO batch:0
[ 0.000000] DMA32 zone: 14280 pages used for memmap
[ 0.000000] DMA32 zone: 832392 pages, LIFO batch:31
[ 0.000000] Normal zone: 2688 pages used for memmap
[ 0.000000] Normal zone: 193920 pages, LIFO batch:31

So, "Normal" is relatively small, and DMA32 contains most of the RAM.
Watermarks from /proc/zoneinfo are:

Node 0, zone DMA
min 7
low 8
high 10
protection: (0, 3251, 4009, 4009)
Node 0, zone DMA32
min 1640
low 2050
high 2460
protection: (0, 0, 757, 757)
Node 0, zone Normal
min 382
low 477
high 573
protection: (0, 0, 0, 0)

This box has a couple bnx2 NICs, which do about 60 Mbps each. Jumbo
frames were disabled for now (to try to stop big order allocations), but
this did not stop atomic allocations of order 3 coming in, as found with:

perf record --event kmem:mm_page_alloc --filter 'order>=3' -a --call-graph -c 1 -a sleep 10
perf report

__alloc_pages_nodemask
alloc_pages_current
new_slab
__slab_alloc
__kmalloc_node_track_caller
__alloc_skb
__netdev_alloc_skb
bnx2_poll_work

>From my reading of this, it seems like __alloc_skb uses kmalloc(), and
kmalloc uses the kmalloc slab unless (unlikely(size > SLUB_MAX_SIZE)),
where SLUB_MAX_SIZE is 2 * PAGE_SIZE, in which case kmalloc_large is
called which allocates pages directly. This means that reception of
jumbo frames probably actually results in (consistent) smaller order
allocations! Anyway, these GFP_ATOMIC allocations don't seem to be
failing, BUT...

Right after kswapd goes to sleep, we're left with DMA32 with 421k or so
free pages, and Normal with 20k or so free pages (about 1.8 GB free).

Immediately, zone Normal starts being used until it reaches about 468
pages free in order 0, nothing else free. kswapd is not woken here,
but allocations just start coming from zone DMA32 instead. While this
happens, the occasional order=3 allocations coming in via the slab from
__alloc_skb seem to be picking away at the available order=3 chunks.
/proc/buddyinfo shows that there are 10k or so when it starts, so this
succeeds easily.

After a minute or so, available order-3 start reaching a lower number,
like 20 or so. order-4 then starts dropping as it is split into order-3,
until it reaches 20 or so as well. Then, order-3 hits 0, and kswapd is
woken. When this occurs, there are still a few order-5, order-6, etc.,
available. I presume the GFP_ATOMIC allocation can still split buddies
here, still making order-3 available without sleeping, because there is
no allocation failure message that I can see.

Here is a "while true; do sleep 1; grep -v 'DMA ' /proc/buddyinfo; done"
("DMA" zone is totally untouched, always, so excluded; white space
crushed to avoid wrapping), while it happens:

Node 0, zone DMA 2 1 1 2 1 1 1 0 1 1 3
Node 0, zone DMA32 25770 29441 14512 10426 1901 123 4 0 0 0 0
Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
...
Node 0, zone DMA32 23343 29405 6062 6478 1901 123 4 0 0 0 0
Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 23187 29358 6047 5960 1901 123 4 0 0 0 0
Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 23000 29372 6047 5411 1901 123 4 0 0 0 0
Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 22714 29391 6076 4225 1901 123 4 0 0 0 0
Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 22354 29459 6059 3178 1901 123 4 0 0 0 0
Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 22202 29388 6035 2395 1901 123 4 0 0 0 0
Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 21971 29411 6036 1032 1901 123 4 0 0 0 0
Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 21514 29388 6019 433 1796 123 4 0 0 0 0
Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 21334 29387 6019 240 1464 123 4 0 0 0 0
Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 21237 29421 6052 216 1336 123 4 0 0 0 0
Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 20968 29378 6020 244 751 123 4 0 0 0 0
Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 20741 29383 6022 134 272 123 4 0 0 0 0
Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 20476 29370 6024 117 48 116 4 0 0 0 0
Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 20343 29369 6020 110 23 10 2 0 0 0 0
Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 21592 30477 4856 22 10 4 2 0 0 0 0
Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 24388 33261 1985 6 10 4 2 0 0 0 0
Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 25358 34080 1068 0 4 4 2 0 0 0 0
Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 75985 68954 5345 87 1 4 2 0 0 0 0
Node 0, zone Normal 18249 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81117 71630 19261 429 3 4 2 0 0 0 0
Node 0, zone Normal 17908 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81226 71299 21038 569 19 4 2 0 0 0 0
Node 0, zone Normal 18559 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81347 71278 21068 640 19 4 2 0 0 0 0
Node 0, zone Normal 17928 21 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81370 71237 21241 1073 29 4 2 0 0 0 0
Node 0, zone Normal 18187 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81401 71237 21314 1139 29 4 2 0 0 0 0
Node 0, zone Normal 16978 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81410 71239 21314 1145 29 4 2 0 0 0 0
Node 0, zone Normal 18156 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81419 71232 21317 1160 30 4 2 0 0 0 0
Node 0, zone Normal 17536 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81347 71144 21443 1160 31 4 2 0 0 0 0
Node 0, zone Normal 18483 7 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81300 71059 21556 1178 38 4 2 0 0 0 0
Node 0, zone Normal 18528 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81315 71042 21577 1180 39 4 2 0 0 0 0
Node 0, zone Normal 18431 2 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81301 71002 21702 1202 39 4 2 0 0 0 0
Node 0, zone Normal 18487 5 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81301 70998 21702 1202 39 4 2 0 0 0 0
Node 0, zone Normal 18311 0 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81296 71025 21711 1208 45 4 2 0 0 0 0
Node 0, zone Normal 17092 5 0 0 0 0 0 0 0 0 0
Node 0, zone DMA32 81299 71023 21716 1226 45 4 2 0 0 0 0
Node 0, zone Normal 18225 12 0 0 0 0 0 0 0 0 0

Running a perf record on the kswapd wakeup right when it happens shows:
perf record --event vmscan:mm_vmscan_wakeup_kswapd -a --call-graph -c 1 -a sleep 10
perf trace
swapper-0 [002] 1323136.979119: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
swapper-0 [002] 1323136.979131: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
lmtp-20593 [003] 1323136.984066: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
lmtp-20593 [003] 1323136.984079: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
swapper-0 [001] 1323136.985511: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
swapper-0 [001] 1323136.985515: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
lmtp-20593 [003] 1323136.985673: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
lmtp-20593 [003] 1323136.985675: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3

This causes kswapd to throw out a bunch of stuff from Normal and from
DMA32, to try to get zone_watermark_ok() to be happy for order=3.
However, we have a heavy read load from all of the email stored on SSDs
on this box, and kswapd ends up fighting to try to keep reclaiming the
allocations (mostly order-0). During the whole day, it never wins -- the
allocations are faster. At night, it wins after a minute or two. The
fighting is happening in all of the lines after it awakes above.

slabs_scanned, kswapd_steal, kswapd_inodesteal (slowly),
kswapd_skip_congestion_wait, and pageoutrun go up in vmstat while kswapd
is running. With the box up for 15 days, you can see it struggling on
pgscan_kswapd_normal (from /proc/vmstat):

pgfree 3329793080
pgactivate 643476431
pgdeactivate 155182710
pgfault 2649106647
pgmajfault 58157157
pgrefill_dma 0
pgrefill_dma32 19688032
pgrefill_normal 7600864
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 465191578
pgsteal_normal 651178518
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 768300403
pgscan_kswapd_normal 34614572907
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 2853983
pgscan_direct_normal 885799
pgscan_direct_movable 0
pginodesteal 191895
pgrotated 27290463

So, here are my questions.

Why do we care about order > 0 watermarks at all in the Normal zone?
Wouldn't it make a lot more sense to just make the DMA32 zone the only
one we care about for larger-order allocations? Or is this required for
the hugepage stuff?

The fact that so much stuff is evicted just because order-3 hits 0 is
crazy, especially when larger order pages are still free. It seems like
we're trying to keep large orders free here. Why? Maybe things would be
better if kswapd does not reclaim at all unless the requested order is
empty _and_ all orders above are empty. This would require hugepage
users to use CONFIG_COMPACT, and have _compaction_ occur the way the
watermark checks work now, but people without CONFIG_HUGETLB_PAGE could
just actually use the memory. Would this work?

There is logic at the end of balance_pgdat() to give up balancing order>0
and just try another loop with order = 0 if sc.nr_reclaimed is <
SWAP_CLUSTER_MAX. However, when this order=0 pass returns, the caller of
balance_pgdat(), kswapd(), gets true from sleeping_prematurely() and just
calls right back to balance_pgdat() again. I think this is why this
logic doesn't seem to work here.

Is my assumption about GFP_ATOMIC order=3 working even when order 3 is
empty, but order>3 is not? Regardless, shouldn't kswapd be woken before
order 3 is 0 since it may have nothing above order 3 to split from, thus
actually causing an allocation failure? Does something else do this?

Ok, that's enough for now.. :)

Simon-

2010-11-24 09:28:11

by Mel Gorman

[permalink] [raw]

Subject: Re: Free memory never fully used, swapping

On Tue, Nov 23, 2010 at 10:43:29PM -0800, Simon Kirby wrote:
> On Tue, Nov 23, 2010 at 10:04:03AM +0000, Mel Gorman wrote:
>
> > On Mon, Nov 22, 2010 at 03:44:19PM -0800, Andrew Morton wrote:
> > > On Mon, 15 Nov 2010 11:52:46 -0800
> > > Simon Kirby <[email protected]> wrote:
> > >
> > > > I noticed that CONFIG_NUMA seems to enable some more complicated
> > > > reclaiming bits and figured it might help since most stock kernels seem
> > > > to ship with it now. This seems to have helped, but it may just be
> > > > wishful thinking. We still see this happening, though maybe to a lesser
> > > > degree. (The following observations are with CONFIG_NUMA enabled.)
> > > >
> >
> > Hi,
> >
> > As this is a NUMA machine, what is the value of
> > /proc/sys/vm/zone_reclaim_mode ? When enabled, this reclaims memory
> > local to the node in preference to using remote nodes. For certain
> > workloads this performs better but for users that expect all of memory
> > to be used, it has surprising results.
> >
> > If set to 1, try testing with it set to 0 and see if it makes a
> > difference. Thanks
>
> Hi Mel,
>
> It is set to 0. It's an Intel EM64T...I only enabled CONFIG_NUMA since
> it seemed to enable some more complicated handling, and I figured it
> might help, but it didn't seem to. It's also required for
> CONFIG_COMPACTION, but that is still marked experimental.
>

I'm surprised a little that you are bringing compaction up because unless
there are high-order involved, it wouldn't make a difference. Is there
a constant source of high-order allocations in the system e.g. a network
card configured to use jumbo frames? A possible consequence of that is that
reclaim is kicking in early to free order-[2-4] pages that would prevent 100%
of memory being used.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-24 19:17:54

by Simon Kirby

[permalink] [raw]

Subject: Re: Free memory never fully used, swapping

On Wed, Nov 24, 2010 at 09:27:53AM +0000, Mel Gorman wrote:

> On Tue, Nov 23, 2010 at 10:43:29PM -0800, Simon Kirby wrote:
> > On Tue, Nov 23, 2010 at 10:04:03AM +0000, Mel Gorman wrote:
> >
> > > On Mon, Nov 22, 2010 at 03:44:19PM -0800, Andrew Morton wrote:
> > > > On Mon, 15 Nov 2010 11:52:46 -0800
> > > > Simon Kirby <[email protected]> wrote:
> > > >
> > > > > I noticed that CONFIG_NUMA seems to enable some more complicated
> > > > > reclaiming bits and figured it might help since most stock kernels seem
> > > > > to ship with it now. This seems to have helped, but it may just be
> > > > > wishful thinking. We still see this happening, though maybe to a lesser
> > > > > degree. (The following observations are with CONFIG_NUMA enabled.)
> > > > >
> > >
> > > Hi,
> > >
> > > As this is a NUMA machine, what is the value of
> > > /proc/sys/vm/zone_reclaim_mode ? When enabled, this reclaims memory
> > > local to the node in preference to using remote nodes. For certain
> > > workloads this performs better but for users that expect all of memory
> > > to be used, it has surprising results.
> > >
> > > If set to 1, try testing with it set to 0 and see if it makes a
> > > difference. Thanks
> >
> > Hi Mel,
> >
> > It is set to 0. It's an Intel EM64T...I only enabled CONFIG_NUMA since
> > it seemed to enable some more complicated handling, and I figured it
> > might help, but it didn't seem to. It's also required for
> > CONFIG_COMPACTION, but that is still marked experimental.
> >
>
> I'm surprised a little that you are bringing compaction up because unless
> there are high-order involved, it wouldn't make a difference. Is there
> a constant source of high-order allocations in the system e.g. a network
> card configured to use jumbo frames? A possible consequence of that is that
> reclaim is kicking in early to free order-[2-4] pages that would prevent 100%
> of memory being used.

We /were/ using jumbo frames, but only over a local cross-over connection
to another node (for DRBD), so I disabled jumbo frames on this interface
and reconnected DRBD. Even with MTUs set to 1500, we saw GFP_ATOMIC
order=3 allocations coming from __alloc_skb:

perf record --event kmem:mm_page_alloc --filter 'order>=3' -a --call-graph sleep 10
perf trace

imap-20599 [002] 1287672.803567: mm_page_alloc: page=0xffffea00004536c0 pfn=4536000 order=3 migratetype=0 gfp_flags=GFP_ATOMIC|GFP_NOWARN|GFP_NORETRY|GFP_COMP

perf report shows:

__alloc_pages_nodemask
alloc_pages_current
new_slab
__slab_alloc
__kmalloc_node_track_caller
__alloc_skb
__netdev_alloc_skb
bnx2_poll_work

Dave was seeing these on his laptop with an Intel NIC as well. Ralf
noted that the slab cache grows in higher order blocks, so this is
normal. The GFP_ATOMIC bubbles up from *alloc_skb, I guess.

Simon-

2010-11-25 01:07:57

by Shaohua Li

[permalink] [raw]

Subject: Re: Free memory never fully used, swapping

On Wed, 2010-11-24 at 16:46 +0800, Simon Kirby wrote:
> On Tue, Nov 23, 2010 at 12:35:31AM -0800, Dave Hansen wrote:
>
> > I wish. :) The best thing to do is to watch stuff like /proc/vmstat
> > along with its friends like /proc/{buddy,meminfo,slabinfo}. Could you
> > post some samples of those with some indication of where the bad
> > behavior was seen?
> >
> > I've definitely seen swapping in the face of lots of free memory, but
> > only in cases where I was being a bit unfair about the numbers of
> > hugetlbfs pages I was trying to reserve.
>
> So, Dave and I spent quite some time today figuring out was going on
> here. Once load picked up during the day, kswapd actually never slept
> until late in the afternoon. During the evening now, it's still waking
> up in bursts, and still keeping way too much memory free:
>
> http://0x.ca/sim/ref/2.6.36/memory_tonight.png
>
> (NOTE: we did swapoff -a to keep /dev/sda from overloading)
>
> We have a much better idea on what is happening here, but more questions.
>
> This x86_64 box has 4 GB of RAM; zones are set up as follows:
>
> [ 0.000000] Zone PFN ranges:
> [ 0.000000] DMA 0x00000001 -> 0x00001000
> [ 0.000000] DMA32 0x00001000 -> 0x00100000
> [ 0.000000] Normal 0x00100000 -> 0x00130000
> ...
> [ 0.000000] On node 0 totalpages: 1047279
> [ 0.000000] DMA zone: 56 pages used for memmap
> [ 0.000000] DMA zone: 0 pages reserved
> [ 0.000000] DMA zone: 3943 pages, LIFO batch:0
> [ 0.000000] DMA32 zone: 14280 pages used for memmap
> [ 0.000000] DMA32 zone: 832392 pages, LIFO batch:31
> [ 0.000000] Normal zone: 2688 pages used for memmap
> [ 0.000000] Normal zone: 193920 pages, LIFO batch:31
>
> So, "Normal" is relatively small, and DMA32 contains most of the RAM.
> Watermarks from /proc/zoneinfo are:
>
> Node 0, zone DMA
> min 7
> low 8
> high 10
> protection: (0, 3251, 4009, 4009)
> Node 0, zone DMA32
> min 1640
> low 2050
> high 2460
> protection: (0, 0, 757, 757)
> Node 0, zone Normal
> min 382
> low 477
> high 573
> protection: (0, 0, 0, 0)
>
> This box has a couple bnx2 NICs, which do about 60 Mbps each. Jumbo
> frames were disabled for now (to try to stop big order allocations), but
> this did not stop atomic allocations of order 3 coming in, as found with:
>
> perf record --event kmem:mm_page_alloc --filter 'order>=3' -a --call-graph -c 1 -a sleep 10
> perf report
>
> __alloc_pages_nodemask
> alloc_pages_current
> new_slab
> __slab_alloc
> __kmalloc_node_track_caller
> __alloc_skb
> __netdev_alloc_skb
> bnx2_poll_work
>
> From my reading of this, it seems like __alloc_skb uses kmalloc(), and
> kmalloc uses the kmalloc slab unless (unlikely(size > SLUB_MAX_SIZE)),
> where SLUB_MAX_SIZE is 2 * PAGE_SIZE, in which case kmalloc_large is
> called which allocates pages directly. This means that reception of
> jumbo frames probably actually results in (consistent) smaller order
> allocations! Anyway, these GFP_ATOMIC allocations don't seem to be
> failing, BUT...
>
> Right after kswapd goes to sleep, we're left with DMA32 with 421k or so
> free pages, and Normal with 20k or so free pages (about 1.8 GB free).
>
> Immediately, zone Normal starts being used until it reaches about 468
> pages free in order 0, nothing else free. kswapd is not woken here,
> but allocations just start coming from zone DMA32 instead. While this
> happens, the occasional order=3 allocations coming in via the slab from
> __alloc_skb seem to be picking away at the available order=3 chunks.
> /proc/buddyinfo shows that there are 10k or so when it starts, so this
> succeeds easily.
>
> After a minute or so, available order-3 start reaching a lower number,
> like 20 or so. order-4 then starts dropping as it is split into order-3,
> until it reaches 20 or so as well. Then, order-3 hits 0, and kswapd is
> woken. When this occurs, there are still a few order-5, order-6, etc.,
> available. I presume the GFP_ATOMIC allocation can still split buddies
> here, still making order-3 available without sleeping, because there is
> no allocation failure message that I can see.
>
> Here is a "while true; do sleep 1; grep -v 'DMA ' /proc/buddyinfo; done"
> ("DMA" zone is totally untouched, always, so excluded; white space
> crushed to avoid wrapping), while it happens:
>
> Node 0, zone DMA 2 1 1 2 1 1 1 0 1 1 3
> Node 0, zone DMA32 25770 29441 14512 10426 1901 123 4 0 0 0 0
> Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> ...
> Node 0, zone DMA32 23343 29405 6062 6478 1901 123 4 0 0 0 0
> Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 23187 29358 6047 5960 1901 123 4 0 0 0 0
> Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 23000 29372 6047 5411 1901 123 4 0 0 0 0
> Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 22714 29391 6076 4225 1901 123 4 0 0 0 0
> Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 22354 29459 6059 3178 1901 123 4 0 0 0 0
> Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 22202 29388 6035 2395 1901 123 4 0 0 0 0
> Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 21971 29411 6036 1032 1901 123 4 0 0 0 0
> Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 21514 29388 6019 433 1796 123 4 0 0 0 0
> Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 21334 29387 6019 240 1464 123 4 0 0 0 0
> Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 21237 29421 6052 216 1336 123 4 0 0 0 0
> Node 0, zone Normal 455 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 20968 29378 6020 244 751 123 4 0 0 0 0
> Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 20741 29383 6022 134 272 123 4 0 0 0 0
> Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 20476 29370 6024 117 48 116 4 0 0 0 0
> Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 20343 29369 6020 110 23 10 2 0 0 0 0
> Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 21592 30477 4856 22 10 4 2 0 0 0 0
> Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 24388 33261 1985 6 10 4 2 0 0 0 0
> Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 25358 34080 1068 0 4 4 2 0 0 0 0
> Node 0, zone Normal 453 1 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 75985 68954 5345 87 1 4 2 0 0 0 0
> Node 0, zone Normal 18249 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81117 71630 19261 429 3 4 2 0 0 0 0
> Node 0, zone Normal 17908 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81226 71299 21038 569 19 4 2 0 0 0 0
> Node 0, zone Normal 18559 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81347 71278 21068 640 19 4 2 0 0 0 0
> Node 0, zone Normal 17928 21 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81370 71237 21241 1073 29 4 2 0 0 0 0
> Node 0, zone Normal 18187 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81401 71237 21314 1139 29 4 2 0 0 0 0
> Node 0, zone Normal 16978 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81410 71239 21314 1145 29 4 2 0 0 0 0
> Node 0, zone Normal 18156 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81419 71232 21317 1160 30 4 2 0 0 0 0
> Node 0, zone Normal 17536 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81347 71144 21443 1160 31 4 2 0 0 0 0
> Node 0, zone Normal 18483 7 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81300 71059 21556 1178 38 4 2 0 0 0 0
> Node 0, zone Normal 18528 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81315 71042 21577 1180 39 4 2 0 0 0 0
> Node 0, zone Normal 18431 2 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81301 71002 21702 1202 39 4 2 0 0 0 0
> Node 0, zone Normal 18487 5 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81301 70998 21702 1202 39 4 2 0 0 0 0
> Node 0, zone Normal 18311 0 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81296 71025 21711 1208 45 4 2 0 0 0 0
> Node 0, zone Normal 17092 5 0 0 0 0 0 0 0 0 0
> Node 0, zone DMA32 81299 71023 21716 1226 45 4 2 0 0 0 0
> Node 0, zone Normal 18225 12 0 0 0 0 0 0 0 0 0
>
> Running a perf record on the kswapd wakeup right when it happens shows:
> perf record --event vmscan:mm_vmscan_wakeup_kswapd -a --call-graph -c 1 -a sleep 10
> perf trace
> swapper-0 [002] 1323136.979119: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> swapper-0 [002] 1323136.979131: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
> lmtp-20593 [003] 1323136.984066: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> lmtp-20593 [003] 1323136.984079: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
> swapper-0 [001] 1323136.985511: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> swapper-0 [001] 1323136.985515: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
> lmtp-20593 [003] 1323136.985673: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
> lmtp-20593 [003] 1323136.985675: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3
>
> This causes kswapd to throw out a bunch of stuff from Normal and from
> DMA32, to try to get zone_watermark_ok() to be happy for order=3.
> However, we have a heavy read load from all of the email stored on SSDs
> on this box, and kswapd ends up fighting to try to keep reclaiming the
> allocations (mostly order-0). During the whole day, it never wins -- the
> allocations are faster. At night, it wins after a minute or two. The
> fighting is happening in all of the lines after it awakes above.
>
> slabs_scanned, kswapd_steal, kswapd_inodesteal (slowly),
> kswapd_skip_congestion_wait, and pageoutrun go up in vmstat while kswapd
> is running. With the box up for 15 days, you can see it struggling on
> pgscan_kswapd_normal (from /proc/vmstat):
>
> pgfree 3329793080
> pgactivate 643476431
> pgdeactivate 155182710
> pgfault 2649106647
> pgmajfault 58157157
> pgrefill_dma 0
> pgrefill_dma32 19688032
> pgrefill_normal 7600864
> pgrefill_movable 0
> pgsteal_dma 0
> pgsteal_dma32 465191578
> pgsteal_normal 651178518
> pgsteal_movable 0
> pgscan_kswapd_dma 0
> pgscan_kswapd_dma32 768300403
> pgscan_kswapd_normal 34614572907
> pgscan_kswapd_movable 0
> pgscan_direct_dma 0
> pgscan_direct_dma32 2853983
> pgscan_direct_normal 885799
> pgscan_direct_movable 0
> pginodesteal 191895
> pgrotated 27290463
>
> So, here are my questions.
>
> Why do we care about order > 0 watermarks at all in the Normal zone?
> Wouldn't it make a lot more sense to just make the DMA32 zone the only
> one we care about for larger-order allocations? Or is this required for
> the hugepage stuff?
>
> The fact that so much stuff is evicted just because order-3 hits 0 is
> crazy, especially when larger order pages are still free. It seems like
> we're trying to keep large orders free here. Why? Maybe things would be
> better if kswapd does not reclaim at all unless the requested order is
> empty _and_ all orders above are empty. This would require hugepage
> users to use CONFIG_COMPACT, and have _compaction_ occur the way the
> watermark checks work now, but people without CONFIG_HUGETLB_PAGE could
> just actually use the memory. Would this work?
>
> There is logic at the end of balance_pgdat() to give up balancing order>0
> and just try another loop with order = 0 if sc.nr_reclaimed is <
> SWAP_CLUSTER_MAX. However, when this order=0 pass returns, the caller of
> balance_pgdat(), kswapd(), gets true from sleeping_prematurely() and just
> calls right back to balance_pgdat() again. I think this is why this
> logic doesn't seem to work here.
>
> Is my assumption about GFP_ATOMIC order=3 working even when order 3 is
> empty, but order>3 is not? Regardless, shouldn't kswapd be woken before
> order 3 is 0 since it may have nothing above order 3 to split from, thus
> actually causing an allocation failure? Does something else do this?
even kswapd is woken after order>3 is empty, the issue will occur since
the order > 3 pages will be used soon and kswapd still needs to reclaim
some pages. So the issue is there is high order page allocation and
lumpy reclaim wrongly reclaims some pages. maybe you should use slab
instead of slub to avoid high order allocation.

2010-11-25 01:18:58