2011-05-25 20:18:16

by Andrew Lutomirski

[permalink] [raw]
Subject: Easy portable testcase! (Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux))

On Tue, May 24, 2011 at 8:43 PM, KOSAKI Motohiro
<[email protected]> wrote:
>
> Unfortnately, this log don't tell us why DM don't issue any swap io. ;-)
> I doubt it's DM issue. Can you please try to make swap on out of DM?
>
>

I can do one better: I can tell you how to reproduce the OOM in the
comfort of your own VM without using dm_crypt or a Sandy Bridge
laptop. This is on Fedora 15, but it really ought to work on any
x86_64 distribution that has kvm. You'll probably want at least 6GB
on your host machine because the VM wants 4GB ram.

Here's how:

Step 1: Clone git://gitorious.org/linux-test-utils/reproduce-annoying-mm-bug.git

(You can browse here:)
https://gitorious.org/linux-test-utils/reproduce-annoying-mm-bug

Instructions to reproduce the mm bug:

Step 2: Build Linux v2.6.38.6 with config-2.6.38.6 and the patch
0001-Minchan-patch-for-testing-23-05-2011.patch (both files are in the
git repo)

Step 3: cd back to reproduce-annoying-mm-bug

Step 4: Type this.

$ make
$ qemu-kvm -m 4G -smp 2 -kernel <linux_dir>/arch/x86/boot/bzImage
-initrd initramfs.gz

Step 5: Wait for the VM to boot (it's really fast) and then run ./repro_bug.sh.

Step 6: Wait a bit and watch the fireworks. Note that it can take a
couple minutes to reproduce the bug.

Tested on my Sandy Bridge laptop and on a Xeon W3520.

For whatever reason, on my laptop without the VM I can hit the bug
almost instantaneously. Maybe it's because I'm using dm-crypt on my
laptop.

--Andy

P.S. I think that the mk_trivial_initramfs.sh script is cute, and
maybe I'll try to flesh it out and turn it into a real project some
day.


2011-05-26 08:18:57

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Easy portable testcase! (Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux))

(2011/05/26 5:17), Andrew Lutomirski wrote:
> On Tue, May 24, 2011 at 8:43 PM, KOSAKI Motohiro
> <[email protected]> wrote:
>>
>> Unfortnately, this log don't tell us why DM don't issue any swap io. ;-)
>> I doubt it's DM issue. Can you please try to make swap on out of DM?
>>
>>
>
> I can do one better: I can tell you how to reproduce the OOM in the
> comfort of your own VM without using dm_crypt or a Sandy Bridge
> laptop. This is on Fedora 15, but it really ought to work on any
> x86_64 distribution that has kvm. You'll probably want at least 6GB
> on your host machine because the VM wants 4GB ram.

Hmmm....

I don't have 6GB memory. :-)
I'll try to borrow it from anywhere, but I'd expect my response is delayed
some time.

2011-05-26 23:58:54

by Minchan Kim

[permalink] [raw]
Subject: Re: Easy portable testcase! (Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux))

On Thu, May 26, 2011 at 5:17 AM, Andrew Lutomirski <[email protected]> wrote:
> On Tue, May 24, 2011 at 8:43 PM, KOSAKI Motohiro
> <[email protected]> wrote:
>>
>> Unfortnately, this log don't tell us why DM don't issue any swap io. ;-)
>> I doubt it's DM issue. Can you please try to make swap on out of DM?
>>
>>
>
> I can do one better: I can tell you how to reproduce the OOM in the
> comfort of your own VM without using dm_crypt or a Sandy Bridge
> laptop.  This is on Fedora 15, but it really ought to work on any
> x86_64 distribution that has kvm.  You'll probably want at least 6GB
> on your host machine because the VM wants 4GB ram.
>
> Here's how:
>
> Step 1: Clone git://gitorious.org/linux-test-utils/reproduce-annoying-mm-bug.git
>
> (You can browse here:)
> https://gitorious.org/linux-test-utils/reproduce-annoying-mm-bug
>
> Instructions to reproduce the mm bug:
>
> Step 2: Build Linux v2.6.38.6 with config-2.6.38.6 and the patch
> 0001-Minchan-patch-for-testing-23-05-2011.patch (both files are in the
> git repo)
>
> Step 3: cd back to reproduce-annoying-mm-bug
>
> Step 4: Type this.
>
> $ make
> $ qemu-kvm -m 4G -smp 2 -kernel <linux_dir>/arch/x86/boot/bzImage
> -initrd initramfs.gz
>
> Step 5: Wait for the VM to boot (it's really fast) and then run ./repro_bug.sh.
>
> Step 6: Wait a bit and watch the fireworks.  Note that it can take a
> couple minutes to reproduce the bug.
>
> Tested on my Sandy Bridge laptop and on a Xeon W3520.
>
> For whatever reason, on my laptop without the VM I can hit the bug
> almost instantaneously.  Maybe it's because I'm using dm-crypt on my
> laptop.
>
> --Andy
>
> P.S.  I think that the mk_trivial_initramfs.sh script is cute, and

That's cool. :)

> maybe I'll try to flesh it out and turn it into a real project some
> day.
>

Thanks for good test environment.
Yesterday, I tried to reproduce your problem in my system(4G DRAM) but
unfortunately got failed. I tried various setting but I can't reach.
Maybe I need 8G system or sandy-bridge. :(

Hi mm folks, It's next round.
Andrew Lutomirski's first problem, kswapd hang problem was solved by
recent Mel's series(!pgdat_balanced bug and shrink_slab cond_resched)
which is key for James, Collins problem.

Andrew's next problem is a early OOM kill.

[ 60.627550] cryptsetup invoked oom-killer: gfp_mask=0x201da,
order=0, oom_adj=0, oom_score_adj=0
[ 60.627553] cryptsetup cpuset=/ mems_allowed=0
[ 60.627555] Pid: 1910, comm: cryptsetup Not tainted 2.6.38.6-no-fpu+ #47
[ 60.627556] Call Trace:
[ 60.627563] [<ffffffff8107f9c5>] ? cpuset_print_task_mems_allowed+0x91/0x9c
[ 60.627567] [<ffffffff810b3ef1>] ? dump_header+0x7f/0x1ba
[ 60.627570] [<ffffffff8109e4d6>] ? trace_hardirqs_on+0x9/0x20
[ 60.627572] [<ffffffff810b42ba>] ? oom_kill_process+0x50/0x24e
[ 60.627574] [<ffffffff810b4961>] ? out_of_memory+0x2e4/0x359
[ 60.627576] [<ffffffff810b879e>] ? __alloc_pages_nodemask+0x5f3/0x775
[ 60.627579] [<ffffffff810e127e>] ? alloc_pages_current+0xbe/0xd8
[ 60.627581] [<ffffffff810b2126>] ? __page_cache_alloc+0x77/0x7e
[ 60.627585] [<ffffffff8135d009>] ? dm_table_unplug_all+0x52/0xed
[ 60.627587] [<ffffffff810b9f74>] ? __do_page_cache_readahead+0x98/0x1a4
[ 60.627589] [<ffffffff810ba321>] ? ra_submit+0x21/0x25
[ 60.627590] [<ffffffff810ba4ee>] ? ondemand_readahead+0x1c9/0x1d8
[ 60.627592] [<ffffffff810ba5dd>] ? page_cache_sync_readahead+0x3d/0x40
[ 60.627594] [<ffffffff810b342d>] ? filemap_fault+0x119/0x36c
[ 60.627597] [<ffffffff810caf5f>] ? __do_fault+0x56/0x342
[ 60.627600] [<ffffffff810f5630>] ? lookup_page_cgroup+0x32/0x48
[ 60.627602] [<ffffffff810cd437>] ? handle_pte_fault+0x29f/0x765
[ 60.627604] [<ffffffff810ba75e>] ? add_page_to_lru_list+0x6e/0x73
[ 60.627606] [<ffffffff810be487>] ? page_evictable+0x1b/0x8d
[ 60.627607] [<ffffffff810bae36>] ? put_page+0x24/0x35
[ 60.627610] [<ffffffff810cdbfc>] ? handle_mm_fault+0x18e/0x1a1
[ 60.627612] [<ffffffff810cded2>] ? __get_user_pages+0x2c3/0x3ed
[ 60.627614] [<ffffffff810cfb4b>] ? __mlock_vma_pages_range+0x67/0x6b
[ 60.627616] [<ffffffff810cfc01>] ? do_mlock_pages+0xb2/0x11a
[ 60.627618] [<ffffffff810d0448>] ? sys_mlockall+0x111/0x11c
[ 60.627621] [<ffffffff81002a3b>] ? system_call_fastpath+0x16/0x1b
[ 60.627623] Mem-Info:
[ 60.627624] Node 0 DMA per-cpu:
[ 60.627626] CPU 0: hi: 0, btch: 1 usd: 0
[ 60.627627] CPU 1: hi: 0, btch: 1 usd: 0
[ 60.627628] CPU 2: hi: 0, btch: 1 usd: 0
[ 60.627629] CPU 3: hi: 0, btch: 1 usd: 0
[ 60.627630] Node 0 DMA32 per-cpu:
[ 60.627631] CPU 0: hi: 186, btch: 31 usd: 0
[ 60.627633] CPU 1: hi: 186, btch: 31 usd: 0
[ 60.627634] CPU 2: hi: 186, btch: 31 usd: 0
[ 60.627635] CPU 3: hi: 186, btch: 31 usd: 0
[ 60.627638] active_anon:51586 inactive_anon:17384 isolated_anon:0
[ 60.627639] active_file:0 inactive_file:226 isolated_file:0
[ 60.627639] unevictable:395661 dirty:0 writeback:3 unstable:0
[ 60.627640] free:13258 slab_reclaimable:3979 slab_unreclaimable:9755
[ 60.627640] mapped:11910 shmem:24046 pagetables:5062 bounce:0
[ 60.627642] Node 0 DMA free:8352kB min:340kB low:424kB high:508kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:952kB
unevictable:6580kB isolated(anon):0kB isolated(file):0kB
present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB
shmem:0kB slab_reclaimable:16kB slab_unreclaimable:0kB
kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB
writeback_tmp:0kB pages_scanned:1645 all_unreclaimable? yes
[ 60.627649] lowmem_reserve[]: 0 2004 2004 2004
[ 60.627651] Node 0 DMA32 free:44680kB min:44712kB low:55888kB
high:67068kB active_anon:206344kB inactive_anon:69536kB
active_file:0kB inactive_file:0kB unevictable:1576064kB
isolated(anon):0kB isolated(file):0kB present:2052320kB
mlocked:47540kB dirty:0kB writeback:12kB mapped:47640kB shmem:96184kB
slab_reclaimable:15900kB slab_unreclaimable:39020kB
kernel_stack:2424kB pagetables:20248kB unstable:0kB bounce:0kB
writeback_tmp:0kB pages_scanned:499225 all_unreclaimable? yes
[ 60.627658] lowmem_reserve[]: 0 0 0 0
[ 60.627660] Node 0 DMA: 0*4kB 0*8kB 2*16kB 2*32kB 1*64kB 2*128kB
1*256kB 1*512kB 1*1024kB 3*2048kB 0*4096kB = 8352kB
[ 60.627665] Node 0 DMA32: 959*4kB 2071*8kB 682*16kB 165*32kB
27*64kB 4*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB = 44980kB
[ 60.627670] 419957 total pagecache pages
[ 60.627671] 0 pages in swap cache
[ 60.627672] Swap cache stats: add 137, delete 137, find 0/0
[ 60.627673] Free swap = 6290904kB
[ 60.627674] Total swap = 6291452kB
[ 60.632560] 524272 pages RAM
[ 60.632562] 9451 pages reserved
[ 60.632563] 45558 pages shared
[ 60.632564] 469944 pages non-shared


There are about 270M anon and lots of free swap space in system.
Nonetheless, he saw the OOM. I think it doesn't make sense.
As I look above log, he used swap as crypted device mapper and used 1.4G ramfs.
Andy, Right?

The thing I doubt firstly was a big ramfs.
I think in reclaim, shrink_page_list will start to cull mlocked page.
If there are so many ramfs pages and working set pages in LRU,
reclaimer can't reclaim any page until It meet non-unevictable pages
or non-working set page(!PG_referenced and !pte_young). His workload
had lots of anon pages and ramfs pages. ramfs pages is unevictable
page so that it would cull and anon pages are promoted very easily so
that we can't reclaim it easily.
It means zone->pages_scanned would be very high so after all,
zone->all_unreclaimable would set.
As I look above log, the number of lru in DMA32 zone is 68970.
The number of unevictable page is 394016.

394016 + working set page(I don't know) is almost equal to (68970 * 6
= 413820).
So it's possible that zone->all_unreclaimable is set.
I wanted to test below patch by private but it doesn't solve his problem.
But I think we need below patch, still. It can happen if we had lots
of LRU order successive mlocked page in LRU.

===

>From e37f150328aedeea9a88b6190ab2b6e6c1067163 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Wed, 25 May 2011 07:09:17 +0900
Subject: [PATCH 3/3] vmscan: decrease pages_scanned on unevictable page

If there are many unevictable pages on evictable LRU list(ex, big ramfs),
shrink_page_list will move it into unevictable and can't reclaim pages.
But we already increased zone->pages_scanned.
If the situation is repeated, the number of evictable lru pages is decreased
while zone->pages_scanned is increased without reclaim any pages.
It could turn on zone->all_unreclaimable but it's totally false alram.

Signed-off-by: Minchan Kim <[email protected]>
---
mm/vmscan.c | 22 +++++++++++++++++++---
1 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 08d3077..a7df813 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -700,7 +700,8 @@ static noinline_for_stack void
free_page_list(struct list_head *free_pages)
static unsigned long shrink_page_list(struct list_head *page_list,
struct zone *zone,
struct scan_control *sc,
- unsigned long *dirty_pages)
+ unsigned long *dirty_pages,
+ unsigned long *unevictable_pages)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -708,6 +709,7 @@ static unsigned long shrink_page_list(struct
list_head *page_list,
unsigned long nr_dirty = 0;
unsigned long nr_congested = 0;
unsigned long nr_reclaimed = 0;
+ unsigned long nr_unevictable = 0;

cond_resched();

@@ -908,6 +910,7 @@ cull_mlocked:
try_to_free_swap(page);
unlock_page(page);
putback_lru_page(page);
+ nr_unevictable++;
continue;

activate_locked:
@@ -936,6 +939,7 @@ keep_lumpy:
zone_set_flag(zone, ZONE_CONGESTED);

*dirty_pages = nr_dirty;
+ *unevictable_pages = nr_unevictable;
free_page_list(&free_pages);

list_splice(&ret_pages, page_list);
@@ -1372,6 +1376,7 @@ shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone,
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_dirty;
+ unsigned long nr_unevictable;
unsigned long nr_taken;
unsigned long nr_anon;
unsigned long nr_file;
@@ -1425,7 +1430,7 @@ shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone,
spin_unlock_irq(&zone->lru_lock);

reclaim_mode = sc->reclaim_mode;
- nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_dirty);
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_dirty,
&nr_unevictable);

/* Check if we should syncronously wait for writeback */
if ((nr_dirty && !(reclaim_mode & RECLAIM_MODE_SINGLE) &&
@@ -1434,7 +1439,8 @@ shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone,
unsigned long nr_active = clear_active_flags(&page_list, NULL);
count_vm_events(PGDEACTIVATE, nr_active);
set_reclaim_mode(priority, sc, true);
- nr_reclaimed += shrink_page_list(&page_list, zone, sc, &nr_dirty);
+ nr_reclaimed += shrink_page_list(&page_list, zone, sc,
+ &nr_dirty, &nr_unevictable);
}

local_irq_disable();
@@ -1442,6 +1448,16 @@ shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone,
__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);

+ /*
+ * Too many unevictalbe pages on evictable LRU list(ex, big ramfs)
+ * can make high zone->pages_scanned and reduce the number of lru page
+ * on evictable lru as reclaim is going on.
+ * It could turn on all_unreclaimable which is false alarm.
+ */
+ spin_lock(&zone->lru_lock);
+ if (zone->pages_scanned >= nr_unevictable)
+ zone->pages_scanned -= nr_unevictable;
+ else
+ zone->pages_scanned = 0;
+ spin_unlock(&zone->lru_lock);
+
putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);

trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
--
1.7.1

===

Then, what I doubt secondly is zone_set_flag(zone, ZONE_CONGESTED).
He used swap as crypted device mapper.
Device mapper could make IO slow for his work. It means we are likely
to meet ZONE_CONGESTED higher than normal swap.

Let's think about it.
Swap device is very congested so shrink_page_list would set the zone
as CONGESTED.
Who is clear ZONE_CONGESTED? There are two place in kswapd.
One work in only order > 0. So maybe, it's no-op in Andy's
workload.(ie, it's mostly order-0 allocation)
One remained is below.

* If a zone reaches its high watermark,
* consider it to be no longer congested. It's
* possible there are dirty pages backed by
* congested BDIs but as pressure is relieved,
* spectulatively avoid congestion waits
*/
zone_clear_flag(zone, ZONE_CONGESTED);
if (i <= *classzone_idx)
balanced += zone->present_pages;

It works only if the zone meets high watermark. If allocation is
faster than reclaim(ie, it's true for slow swap device), the zone
would remain congested.
It means swapout would block.
As we see the OOM log, we can know that DMA32 zone can't meet high watermark.

Does my guessing make sense?


--
Kind regards,
Minchan Kim

2011-05-29 18:28:54

by Minchan Kim

[permalink] [raw]
Subject: Re: Easy portable testcase! (Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux))

On Fri, May 27, 2011 at 8:58 AM, Minchan Kim <[email protected]> wrote:
> On Thu, May 26, 2011 at 5:17 AM, Andrew Lutomirski <[email protected]> wrote:
>> On Tue, May 24, 2011 at 8:43 PM, KOSAKI Motohiro
>> <[email protected]> wrote:
>>>
>>> Unfortnately, this log don't tell us why DM don't issue any swap io. ;-)
>>> I doubt it's DM issue. Can you please try to make swap on out of DM?
>>>
>>>
>>
>> I can do one better: I can tell you how to reproduce the OOM in the
>> comfort of your own VM without using dm_crypt or a Sandy Bridge
>> laptop.  This is on Fedora 15, but it really ought to work on any
>> x86_64 distribution that has kvm.  You'll probably want at least 6GB
>> on your host machine because the VM wants 4GB ram.
>>
>> Here's how:
>>
>> Step 1: Clone git://gitorious.org/linux-test-utils/reproduce-annoying-mm-bug.git
>>
>> (You can browse here:)
>> https://gitorious.org/linux-test-utils/reproduce-annoying-mm-bug
>>
>> Instructions to reproduce the mm bug:
>>
>> Step 2: Build Linux v2.6.38.6 with config-2.6.38.6 and the patch
>> 0001-Minchan-patch-for-testing-23-05-2011.patch (both files are in the
>> git repo)
>>
>> Step 3: cd back to reproduce-annoying-mm-bug
>>
>> Step 4: Type this.
>>
>> $ make
>> $ qemu-kvm -m 4G -smp 2 -kernel <linux_dir>/arch/x86/boot/bzImage
>> -initrd initramfs.gz
>>
>> Step 5: Wait for the VM to boot (it's really fast) and then run ./repro_bug.sh.
>>
>> Step 6: Wait a bit and watch the fireworks.  Note that it can take a
>> couple minutes to reproduce the bug.
>>
>> Tested on my Sandy Bridge laptop and on a Xeon W3520.
>>
>> For whatever reason, on my laptop without the VM I can hit the bug
>> almost instantaneously.  Maybe it's because I'm using dm-crypt on my
>> laptop.
>>
>> --Andy
>>
>> P.S.  I think that the mk_trivial_initramfs.sh script is cute, and
>
> That's cool. :)
>
>> maybe I'll try to flesh it out and turn it into a real project some
>> day.
>>
>
> Thanks for good test environment.
> Yesterday, I tried to reproduce your problem in my system(4G DRAM) but
> unfortunately got failed. I tried various setting but I can't reach.
> Maybe I need 8G system or sandy-bridge.  :(
>
> Hi mm folks, It's next round.
> Andrew Lutomirski's first problem, kswapd hang problem was solved by
> recent Mel's series(!pgdat_balanced bug and shrink_slab cond_resched)
> which is key for James, Collins problem.
>
> Andrew's next problem is a early OOM kill.
>
> [   60.627550] cryptsetup invoked oom-killer: gfp_mask=0x201da,
> order=0, oom_adj=0, oom_score_adj=0
> [   60.627553] cryptsetup cpuset=/ mems_allowed=0
> [   60.627555] Pid: 1910, comm: cryptsetup Not tainted 2.6.38.6-no-fpu+ #47
> [   60.627556] Call Trace:
> [   60.627563]  [<ffffffff8107f9c5>] ? cpuset_print_task_mems_allowed+0x91/0x9c
> [   60.627567]  [<ffffffff810b3ef1>] ? dump_header+0x7f/0x1ba
> [   60.627570]  [<ffffffff8109e4d6>] ? trace_hardirqs_on+0x9/0x20
> [   60.627572]  [<ffffffff810b42ba>] ? oom_kill_process+0x50/0x24e
> [   60.627574]  [<ffffffff810b4961>] ? out_of_memory+0x2e4/0x359
> [   60.627576]  [<ffffffff810b879e>] ? __alloc_pages_nodemask+0x5f3/0x775
> [   60.627579]  [<ffffffff810e127e>] ? alloc_pages_current+0xbe/0xd8
> [   60.627581]  [<ffffffff810b2126>] ? __page_cache_alloc+0x77/0x7e
> [   60.627585]  [<ffffffff8135d009>] ? dm_table_unplug_all+0x52/0xed
> [   60.627587]  [<ffffffff810b9f74>] ? __do_page_cache_readahead+0x98/0x1a4
> [   60.627589]  [<ffffffff810ba321>] ? ra_submit+0x21/0x25
> [   60.627590]  [<ffffffff810ba4ee>] ? ondemand_readahead+0x1c9/0x1d8
> [   60.627592]  [<ffffffff810ba5dd>] ? page_cache_sync_readahead+0x3d/0x40
> [   60.627594]  [<ffffffff810b342d>] ? filemap_fault+0x119/0x36c
> [   60.627597]  [<ffffffff810caf5f>] ? __do_fault+0x56/0x342
> [   60.627600]  [<ffffffff810f5630>] ? lookup_page_cgroup+0x32/0x48
> [   60.627602]  [<ffffffff810cd437>] ? handle_pte_fault+0x29f/0x765
> [   60.627604]  [<ffffffff810ba75e>] ? add_page_to_lru_list+0x6e/0x73
> [   60.627606]  [<ffffffff810be487>] ? page_evictable+0x1b/0x8d
> [   60.627607]  [<ffffffff810bae36>] ? put_page+0x24/0x35
> [   60.627610]  [<ffffffff810cdbfc>] ? handle_mm_fault+0x18e/0x1a1
> [   60.627612]  [<ffffffff810cded2>] ? __get_user_pages+0x2c3/0x3ed
> [   60.627614]  [<ffffffff810cfb4b>] ? __mlock_vma_pages_range+0x67/0x6b
> [   60.627616]  [<ffffffff810cfc01>] ? do_mlock_pages+0xb2/0x11a
> [   60.627618]  [<ffffffff810d0448>] ? sys_mlockall+0x111/0x11c
> [   60.627621]  [<ffffffff81002a3b>] ? system_call_fastpath+0x16/0x1b
> [   60.627623] Mem-Info:
> [   60.627624] Node 0 DMA per-cpu:
> [   60.627626] CPU    0: hi:    0, btch:   1 usd:   0
> [   60.627627] CPU    1: hi:    0, btch:   1 usd:   0
> [   60.627628] CPU    2: hi:    0, btch:   1 usd:   0
> [   60.627629] CPU    3: hi:    0, btch:   1 usd:   0
> [   60.627630] Node 0 DMA32 per-cpu:
> [   60.627631] CPU    0: hi:  186, btch:  31 usd:   0
> [   60.627633] CPU    1: hi:  186, btch:  31 usd:   0
> [   60.627634] CPU    2: hi:  186, btch:  31 usd:   0
> [   60.627635] CPU    3: hi:  186, btch:  31 usd:   0
> [   60.627638] active_anon:51586 inactive_anon:17384 isolated_anon:0
> [   60.627639]  active_file:0 inactive_file:226 isolated_file:0
> [   60.627639]  unevictable:395661 dirty:0 writeback:3 unstable:0
> [   60.627640]  free:13258 slab_reclaimable:3979 slab_unreclaimable:9755
> [   60.627640]  mapped:11910 shmem:24046 pagetables:5062 bounce:0
> [   60.627642] Node 0 DMA free:8352kB min:340kB low:424kB high:508kB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:952kB
> unevictable:6580kB isolated(anon):0kB isolated(file):0kB
> present:15676kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB
> shmem:0kB slab_reclaimable:16kB slab_unreclaimable:0kB
> kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB
> writeback_tmp:0kB pages_scanned:1645 all_unreclaimable? yes
> [   60.627649] lowmem_reserve[]: 0 2004 2004 2004
> [   60.627651] Node 0 DMA32 free:44680kB min:44712kB low:55888kB
> high:67068kB active_anon:206344kB inactive_anon:69536kB
> active_file:0kB inactive_file:0kB unevictable:1576064kB
> isolated(anon):0kB isolated(file):0kB present:2052320kB
> mlocked:47540kB dirty:0kB writeback:12kB mapped:47640kB shmem:96184kB
> slab_reclaimable:15900kB slab_unreclaimable:39020kB
> kernel_stack:2424kB pagetables:20248kB unstable:0kB bounce:0kB
> writeback_tmp:0kB pages_scanned:499225 all_unreclaimable? yes
> [   60.627658] lowmem_reserve[]: 0 0 0 0
> [   60.627660] Node 0 DMA: 0*4kB 0*8kB 2*16kB 2*32kB 1*64kB 2*128kB
> 1*256kB 1*512kB 1*1024kB 3*2048kB 0*4096kB = 8352kB
> [   60.627665] Node 0 DMA32: 959*4kB 2071*8kB 682*16kB 165*32kB
> 27*64kB 4*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 1*4096kB = 44980kB
> [   60.627670] 419957 total pagecache pages
> [   60.627671] 0 pages in swap cache
> [   60.627672] Swap cache stats: add 137, delete 137, find 0/0
> [   60.627673] Free swap  = 6290904kB
> [   60.627674] Total swap = 6291452kB
> [   60.632560] 524272 pages RAM
> [   60.632562] 9451 pages reserved
> [   60.632563] 45558 pages shared
> [   60.632564] 469944 pages non-shared
>
>
> There are about 270M anon  and lots of free swap space in system.
> Nonetheless, he saw the OOM. I think it doesn't make sense.
> As I look above log, he used swap as crypted device mapper and used 1.4G ramfs.
> Andy, Right?
>
> The thing I doubt firstly was a big ramfs.
> I think in reclaim, shrink_page_list will start to cull mlocked page.
> If there are so many ramfs pages and working set pages in LRU,
> reclaimer can't reclaim any page until It meet non-unevictable pages
> or non-working set page(!PG_referenced and !pte_young). His workload
> had lots of anon pages and ramfs pages. ramfs pages is unevictable
> page so that it would cull and anon pages are promoted very easily so
> that we can't reclaim it easily.
> It means zone->pages_scanned would be very high so after all,
> zone->all_unreclaimable would set.
> As I look above log, the number of lru in  DMA32 zone is 68970.
> The number of unevictable page is 394016.
>
> 394016 + working set page(I don't know) is almost equal to  (68970 * 6
> = 413820).
> So it's possible that zone->all_unreclaimable is set.
> I wanted to test below patch by private but it doesn't solve his problem.
> But I think we need below patch, still. It can happen if we had lots
> of LRU order successive mlocked page in LRU.
>
> ===
>
> From e37f150328aedeea9a88b6190ab2b6e6c1067163 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Wed, 25 May 2011 07:09:17 +0900
> Subject: [PATCH 3/3] vmscan: decrease pages_scanned on unevictable page
>
> If there are many unevictable pages on evictable LRU list(ex, big ramfs),
> shrink_page_list will move it into unevictable and can't reclaim pages.
> But we already increased zone->pages_scanned.
> If the situation is repeated, the number of evictable lru pages is decreased
> while zone->pages_scanned is increased without reclaim any pages.
> It could turn on zone->all_unreclaimable but it's totally false alram.
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
>  mm/vmscan.c |   22 +++++++++++++++++++---
>  1 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 08d3077..a7df813 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -700,7 +700,8 @@ static noinline_for_stack void
> free_page_list(struct list_head *free_pages)
>  static unsigned long shrink_page_list(struct list_head *page_list,
>                                      struct zone *zone,
>                                      struct scan_control *sc,
> -                                     unsigned long *dirty_pages)
> +                                     unsigned long *dirty_pages,
> +                                     unsigned long *unevictable_pages)
>  {
>        LIST_HEAD(ret_pages);
>        LIST_HEAD(free_pages);
> @@ -708,6 +709,7 @@ static unsigned long shrink_page_list(struct
> list_head *page_list,
>        unsigned long nr_dirty = 0;
>        unsigned long nr_congested = 0;
>        unsigned long nr_reclaimed = 0;
> +       unsigned long nr_unevictable = 0;
>
>        cond_resched();
>
> @@ -908,6 +910,7 @@ cull_mlocked:
>                        try_to_free_swap(page);
>                unlock_page(page);
>                putback_lru_page(page);
> +               nr_unevictable++;
>                continue;
>
>  activate_locked:
> @@ -936,6 +939,7 @@ keep_lumpy:
>                zone_set_flag(zone, ZONE_CONGESTED);
>
>        *dirty_pages = nr_dirty;
> +       *unevictable_pages = nr_unevictable;
>        free_page_list(&free_pages);
>
>        list_splice(&ret_pages, page_list);
> @@ -1372,6 +1376,7 @@ shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone,
>        unsigned long nr_scanned;
>        unsigned long nr_reclaimed = 0;
>        unsigned long nr_dirty;
> +       unsigned long nr_unevictable;
>        unsigned long nr_taken;
>        unsigned long nr_anon;
>        unsigned long nr_file;
> @@ -1425,7 +1430,7 @@ shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone,
>        spin_unlock_irq(&zone->lru_lock);
>
>        reclaim_mode = sc->reclaim_mode;
> -       nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_dirty);
> +       nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_dirty,
> &nr_unevictable);
>
>        /* Check if we should syncronously wait for writeback */
>        if ((nr_dirty && !(reclaim_mode & RECLAIM_MODE_SINGLE) &&
> @@ -1434,7 +1439,8 @@ shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone,
>                unsigned long nr_active = clear_active_flags(&page_list, NULL);
>                count_vm_events(PGDEACTIVATE, nr_active);
>                set_reclaim_mode(priority, sc, true);
> -               nr_reclaimed += shrink_page_list(&page_list, zone, sc, &nr_dirty);
> +               nr_reclaimed += shrink_page_list(&page_list, zone, sc,
> +                                               &nr_dirty, &nr_unevictable);
>        }
>
>        local_irq_disable();
> @@ -1442,6 +1448,16 @@ shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone,
>                __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
>        __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);
>
> +       /*
> +        * Too many unevictalbe pages on evictable LRU list(ex, big ramfs)
> +        * can make high zone->pages_scanned and reduce the number of lru page
> +        * on evictable lru as reclaim is going on.
> +        * It could turn on all_unreclaimable which is false alarm.
> +        */
> +       spin_lock(&zone->lru_lock);
> +       if (zone->pages_scanned >= nr_unevictable)
> +               zone->pages_scanned -= nr_unevictable;
> +       else
> +               zone->pages_scanned = 0;
> +       spin_unlock(&zone->lru_lock);
> +
>        putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
>
>        trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
> --
> 1.7.1
>
> ===
>
> Then, what I doubt secondly is zone_set_flag(zone, ZONE_CONGESTED).
> He used swap as crypted device mapper.
> Device mapper could make IO slow for his work. It means we are likely
> to meet ZONE_CONGESTED higher than normal swap.
>
> Let's think about it.
> Swap device is very congested so shrink_page_list would set the zone
> as CONGESTED.
> Who is clear ZONE_CONGESTED? There are two place in  kswapd.
> One work in only order > 0. So maybe, it's no-op in Andy's
> workload.(ie, it's mostly order-0 allocation)
> One remained is below.
>
>                                 * If a zone reaches its high watermark,
>                                 * consider it to be no longer congested. It's
>                                 * possible there are dirty pages backed by
>                                 * congested BDIs but as pressure is relieved,
>                                 * spectulatively avoid congestion waits
>                                 */
>                                zone_clear_flag(zone, ZONE_CONGESTED);
>                                if (i <= *classzone_idx)
>                                        balanced += zone->present_pages;
>
> It works only if the zone meets high watermark. If allocation is
> faster than reclaim(ie, it's true for slow swap device), the zone
> would remain congested.
> It means swapout would block.
> As we see the OOM log, we can know that DMA32 zone can't meet high watermark.
>
> Does my guessing make sense?

Hi Andrew.
I got failed your scenario in my machine so could you be willing to
test this patch for proving my above scenario?
The patch is just revert patch of 0e093d99[do not sleep on the
congestion queue...] for 2.6.38.6.
I would like to test it for proving my above zone congestion scenario.

I did it based on 2.6.38.6 for your easy apply so you must apply it
cleanly on vanilla v2.6.38.6.
And you have to add !pgdat_balanced and shrink_slab patch.

Thanks, Andrew.

--
Kind regards,
Minchan Kim


Attachments:
0001-Revert-writeback-do-not-sleep-on-the-congestion-queu.patch (10.09 kB)

2011-05-30 00:29:08

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Easy portable testcase! (Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux))

On Sun, May 29, 2011 at 2:28 PM, Minchan Kim <[email protected]> wrote:
>>
>> It works only if the zone meets high watermark. If allocation is
>> faster than reclaim(ie, it's true for slow swap device), the zone
>> would remain congested.
>> It means swapout would block.
>> As we see the OOM log, we can know that DMA32 zone can't meet high watermark.
>>
>> Does my guessing make sense?
>
> Hi Andrew.
> I got failed your scenario in my machine so could you be willing to
> test this patch for proving my above scenario?
> The patch is just revert patch of 0e093d99[do not sleep on the
> congestion queue...] for 2.6.38.6.
> I would like to test it for proving my above zone congestion scenario.
>
> I did it based on 2.6.38.6 for your easy apply so you must apply it
> cleanly on vanilla v2.6.38.6.
> And you have to add !pgdat_balanced and shrink_slab patch.

No, because my laptop just decided that it doesn't like to turn on. :(

I'll test it on my VM on Tuesday and (fingers crossed) on my repaired
laptop next weekend.

--Andy

2011-06-14 10:11:12

by Johannes Weiner

[permalink] [raw]
Subject: Re: Easy portable testcase! (Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux))

On Sun, May 29, 2011 at 08:28:46PM -0400, Andrew Lutomirski wrote:
> On Sun, May 29, 2011 at 2:28 PM, Minchan Kim <[email protected]> wrote:
> >>
> >> It works only if the zone meets high watermark. If allocation is
> >> faster than reclaim(ie, it's true for slow swap device), the zone
> >> would remain congested.
> >> It means swapout would block.
> >> As we see the OOM log, we can know that DMA32 zone can't meet high watermark.
> >>
> >> Does my guessing make sense?
> >
> > Hi Andrew.
> > I got failed your scenario in my machine so could you be willing to
> > test this patch for proving my above scenario?
> > The patch is just revert patch of 0e093d99[do not sleep on the
> > congestion queue...] for 2.6.38.6.
> > I would like to test it for proving my above zone congestion scenario.
> >
> > I did it based on 2.6.38.6 for your easy apply so you must apply it
> > cleanly on vanilla v2.6.38.6.
> > And you have to add !pgdat_balanced and shrink_slab patch.
>
> No, because my laptop just decided that it doesn't like to turn on. :(
>
> I'll test it on my VM on Tuesday and (fingers crossed) on my repaired
> laptop next weekend.

Any updates on this?

2011-06-14 12:33:10

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Easy portable testcase! (Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux))

On Tue, Jun 14, 2011 at 6:10 AM, Johannes Weiner <[email protected]> wrote:
> On Sun, May 29, 2011 at 08:28:46PM -0400, Andrew Lutomirski wrote:
>> On Sun, May 29, 2011 at 2:28 PM, Minchan Kim <[email protected]> wrote:
>> >>
>> >> It works only if the zone meets high watermark. If allocation is
>> >> faster than reclaim(ie, it's true for slow swap device), the zone
>> >> would remain congested.
>> >> It means swapout would block.
>> >> As we see the OOM log, we can know that DMA32 zone can't meet high watermark.
>> >>
>> >> Does my guessing make sense?
>> >
>> > Hi Andrew.
>> > I got failed your scenario in my machine so could you be willing to
>> > test this patch for proving my above scenario?
>> > The patch is just revert patch of 0e093d99[do not sleep on the
>> > congestion queue...] for 2.6.38.6.
>> > I would like to test it for proving my above zone congestion scenario.
>> >
>> > I did it based on 2.6.38.6 for your easy apply so you must apply it
>> > cleanly on vanilla v2.6.38.6.
>> > And you have to add !pgdat_balanced and shrink_slab patch.
>>
>> No, because my laptop just decided that it doesn't like to turn on. :(
>>
>> I'll test it on my VM on Tuesday and (fingers crossed) on my repaired
>> laptop next weekend.
>
> Any updates on this?
>

Sorry, got distracted by writing my thesis.

This patch (Revert "writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") does not fix the problem; if
anything it triggers more easily with the patch (at least in KVM).

--Andy