2011-05-11 22:43:05

by Andrew Lutomirski

[permalink] [raw]
Subject: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

For the last few days (since moving my disk to a new laptop), my
system has been hanging, usually unrecoverably, under light memory
pressure. When this happens, I usually see soft lockups and no OOM
kill. Mouse and keyboard input stop working. Sometimes I can switch
VTs; sometimes I can't. If I just wait it out, sometimes the system
comes back after a couple of minutes but usually even ten minutes or
so isn't enough. If I force an OOM kill (Alt-SysRq-F), my system
sometimes recovers. I've attached the dmesg from when that happened
(in that case the freeze was triggered by linking a kernel and the OOM
killer killed ld.)

I can trigger it about half of the time my building a kernel (it
usually dies while linking or doing the .tmp_* stuff) and 100% of the
time by running the attached script with parameters "1500 1400 1".
The script creates a 1500M file on a ramfs, sets up dm-crypt over
loopback on that file, formats it as ext4, and mounts it, then starts
writing a 1400M file over and over on the ext4 partition.

I cannot trigger the problem by running the same script on a different
machine (with 8 GB RAM) with parameters 6000 5500 1. I can't trigger
it on this machine from initramfs (same kernel image) or from
systemd's emergency shell. I can trigger it some of the time from
systemd's rescue shell (which as a little bit more stuff running).
The problem seems about equally prevalent with ACHI or compatibility
mode and with aesni-intel enabled and disabled. (aesni-intel causes
cryptd to get pulled in, so I thought that might be the issue.)

I can sometimes (but not always) trigger this by enabling swap and
running dirty_ram 2048 (attached). (One time it took the system down
completely. I have ~8 GB of swap, all of which was empty when I ran
the program.)

I see this problem on 2.6.38.{5,6}, 2.6.39-<something from today>, and
Fedora 15's kernel, so I doubt it's an oddity of my kernel config.

I also had this problem while running Fedora 15's installer to upgrade
from Fedora 14 to 15, which rules out a lot of weird userspace issues.

This box is a Lenovo X220 Sandy Bridge laptop with 2G of RAM (the old
box had more) and runs ext4 on LVM on dm-crypt on an SSD. I see the
problem with and without a swap partition. I've also tried unloading
most drivers and the test still fails. Memtest passes.

If I had to guess, I'd say that the VM gets confused when it's forced
to write data out to my LVM-over-dm-crypt partition and either starts
OOM-killing things when it's not out of memory or deadlocks because it
runs out of available RAM and can't service new dm-crypt and block
requests.

Please help fix/debug this. It's making my shiny new laptop almost useless.

--Andy


Attachments:
successful-oom-kill.txt (86.14 kB)
test_mempressure.sh (1.95 kB)
OOM-with-lots-of-swap.txt (33.86 kB)
dirty_ram.cc (583.00 B)
Download all attachments

2011-05-11 23:08:04

by Andi Kleen

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

Andrew Lutomirski <[email protected]> writes:
>
> I can sometimes (but not always) trigger this by enabling swap and
> running dirty_ram 2048 (attached). (One time it took the system down
> completely. I have ~8 GB of swap, all of which was empty when I ran

Never configure that much swap (> 1*RAM). It will just make any OOM more
painful because it'll thrash forever. If you're 4x overcommited
no workload will be happy.

> This box is a Lenovo X220 Sandy Bridge laptop with 2G of RAM (the old
> box had more) and runs ext4 on LVM on dm-crypt on an SSD. I see the

FWIW i had problems in swapping over dmcrypt for a long time -- not
quite as severe as you. Never really tracked it down.

But I suspect just not doing the swap over dmcrypt would make
it a lot more usable.

> If I had to guess, I'd say that the VM gets confused when it's forced
> to write data out to my LVM-over-dm-crypt partition and either starts
> OOM-killing things when it's not out of memory or deadlocks because it
> runs out of available RAM and can't service new dm-crypt and block
> requests.
>
> Please help fix/debug this. It's making my shiny new laptop almost useless.

I would add some tracing to the dmcrypt paths and then log
it over the network during the problem. Most likely some part
of it stalls or tries to allocate more memory.

-Andi

--
[email protected] -- Speaking for myself only

2011-05-11 23:29:08

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Wed, May 11, 2011 at 7:07 PM, Andi Kleen <[email protected]> wrote:
> Andrew Lutomirski <[email protected]> writes:
>>
>> I can sometimes (but not always) trigger this by enabling swap and
>> running dirty_ram 2048 (attached). ?(One time it took the system down
>> completely. ?I have ~8 GB of swap, all of which was empty when I ran
>
> Never configure that much swap (> 1*RAM). It will just make any OOM more
> painful because it'll thrash forever. If you're 4x overcommited
> no workload will be happy.

Agreed. But I only need to overcommit by a little to get it to crash.

>
>> This box is a Lenovo X220 Sandy Bridge laptop with 2G of RAM (the old
>> box had more) and runs ext4 on LVM on dm-crypt on an SSD. ?I see the
>
> FWIW i had problems in swapping over dmcrypt for a long time -- not
> quite as severe as you. Never really tracked it down.
>
> But I suspect just not doing the swap over dmcrypt would make
> it a lot more usable.

Maybe. But I can get to it crash just fine without any swap at all,
which I think ought to be the most stable configuration.

>
>> If I had to guess, I'd say that the VM gets confused when it's forced
>> to write data out to my LVM-over-dm-crypt partition and either starts
>> OOM-killing things when it's not out of memory or deadlocks because it
>> runs out of available RAM and can't service new dm-crypt and block
>> requests.
>>
>> Please help fix/debug this. ?It's making my shiny new laptop almost useless.
>
> I would add some tracing to the dmcrypt paths and then log
> it over the network during the problem. Most likely some part
> of it stalls or tries to allocate more memory.

Yep, that's next. I just added some instrumentation in mempool_alloc
to warn if it can't satisfy an allocation for five seconds and it
didn't trigger. Most of the dm-crypt allocations I could find go
through mempool, so I think they're ruled out.

Do softlockups in kswapd0 mean anything? I think I can rule out a
traditional vm deadlock, because the machine is currently stuck with
tons of things hitting the softlockup warning but with 809M of DMA32
space free (as well as 8M DMA and 16kB normal).

Here's a nice picture of alt-sysrq-m with lots of memory free but the
system mostly hung. I can still switch VTs.

http://web.mit.edu/luto/www/meminfo.jpg

alt-sysrq-j to thaw filesystems caused the system to start printing
"Emergency Thaw on dm-2" in an infinite loop. Time to power off and
go home...

--Andy

>
> -Andi
>
> --
> [email protected] -- Speaking for myself only
>

2011-05-12 05:46:37

by Andi Kleen

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

> Here's a nice picture of alt-sysrq-m with lots of memory free but the
> system mostly hung. I can still switch VTs.

Would rather need backtraces. Try setting up netconsole or crashdump
first.

-Andi

--
[email protected] -- Speaking for myself only.

2011-05-12 11:54:50

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Thu, May 12, 2011 at 1:46 AM, Andi Kleen <[email protected]> wrote:
>> Here's a nice picture of alt-sysrq-m with lots of memory free but the
>> system mostly hung. ?I can still switch VTs.
>
> Would rather need backtraces. Try setting up netconsole or crashdump
> first.

Here are some logs for two different failure mores.

incorrect_oom_kill.txt is an OOM kill when there was lots of available
swap to use. AFAICT the kernel should not have OOM killed at all.

stuck_xyz is when the system is wedged with plenty (~300MB) free
memory but no swap. The sysrq files are self-explanatory.
stuck-sysrq-f.txt is after the others so that it won't have corrupted
the output. After taking all that data, I waited awhile and started
getting soft lockup messges.

I'm having trouble reproducing the "stuck" failure mode on my
lockdep-enabled kernel right now (the OOM kill is easy), so no lock
state trace. But I got one yesterday and IIRC it showed a few tty
locks and either kworker or kcryptd holding (kqueue) and
((&io->work)).

I compressed the larger files.

--Andy

>
> -Andi
>
> --
> [email protected] -- Speaking for myself only.
>


Attachments:
stuck-sysrq-m.txt (3.27 kB)
incorrect_oom_kill.txt.xz (24.52 kB)
stuck-sysrq-t.txt.xz (30.28 kB)
stuck-sysrq-w.txt (28.14 kB)
stuck-sysrq-f.txt (9.54 kB)
stuck-softlockup.txt (25.08 kB)
Download all attachments

2011-05-14 15:46:21

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

[cc linux-mm]

On Thu, May 12, 2011 at 7:54 AM, Andrew Lutomirski <[email protected]> wrote:
> On Thu, May 12, 2011 at 1:46 AM, Andi Kleen <[email protected]> wrote:
>>> Here's a nice picture of alt-sysrq-m with lots of memory free but the
>>> system mostly hung. ?I can still switch VTs.
>>
>> Would rather need backtraces. Try setting up netconsole or crashdump
>> first.
>
> Here are some logs for two different failure mores.
>
> incorrect_oom_kill.txt is an OOM kill when there was lots of available
> swap to use. ?AFAICT the kernel should not have OOM killed at all.
>
> stuck_xyz is when the system is wedged with plenty (~300MB) free
> memory but no swap. ?The sysrq files are self-explanatory.
> stuck-sysrq-f.txt is after the others so that it won't have corrupted
> the output. ?After taking all that data, I waited awhile and started
> getting soft lockup messges.
>
> I'm having trouble reproducing the "stuck" failure mode on my
> lockdep-enabled kernel right now (the OOM kill is easy), so no lock
> state trace. ?But I got one yesterday and IIRC it showed a few tty
> locks and either kworker or kcryptd holding (kqueue) and
> ((&io->work)).
>
> I compressed the larger files.
>
> --Andy

2011-05-14 16:53:49

by Andi Kleen

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

> > Here are some logs for two different failure mores.
> >
> > incorrect_oom_kill.txt is an OOM kill when there was lots of available
> > swap to use. ?AFAICT the kernel should not have OOM killed at all.
> >
> > stuck_xyz is when the system is wedged with plenty (~300MB) free
> > memory but no swap. ?The sysrq files are self-explanatory.
> > stuck-sysrq-f.txt is after the others so that it won't have corrupted
> > the output. ?After taking all that data, I waited awhile and started
> > getting soft lockup messges.
> >
> > I'm having trouble reproducing the "stuck" failure mode on my
> > lockdep-enabled kernel right now (the OOM kill is easy), so no lock
> > state trace. ?But I got one yesterday and IIRC it showed a few tty
> > locks and either kworker or kcryptd holding (kqueue) and
> > ((&io->work)).
> >
> > I compressed the larger files.

One quick observation is that pretty much all the OOMed allocations
in your log are in readahead (swap and VM). Perhaps we should throttle
readahead when the system is under high memory pressure?

(copying Fengguang)

On theory on why it could happen more often with dm_crypt is that
dm_crypt increases the latency, so more IO will be in flight.

Another thing is that the dmcrypt IOs will likely do their own
readahead, so you may end up with multiplied readahead
from several levels. Perhaps we should disable RA for the low level
encrypted dmcrypt IOs?

One thing I would try is to disable readahead like in this patch
and see if it solves the problem.

Subject: [PATCH] disable swap and VM readahead

diff --git a/mm/filemap.c b/mm/filemap.c
index c641edf..1f41b4f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1525,6 +1525,8 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
unsigned long ra_pages;
struct address_space *mapping = file->f_mapping;

+ return;
+
/* If we don't want any read-ahead, don't bother */
if (VM_RandomReadHint(vma))
return;
diff --git a/mm/readahead.c b/mm/readahead.c
index 2c0cc48..85e5b8d 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -504,6 +504,8 @@ void page_cache_sync_readahead(struct address_space *mapping,
struct file_ra_state *ra, struct file *filp,
pgoff_t offset, unsigned long req_size)
{
+ return;
+
/* no read-ahead */
if (!ra->ra_pages)
return;
@@ -540,6 +542,8 @@ page_cache_async_readahead(struct address_space *mapping,
struct page *page, pgoff_t offset,
unsigned long req_size)
{
+ return;
+
/* no read-ahead */
if (!ra->ra_pages)
return;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4668046..37c2f2f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -386,6 +386,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
* more likely that neighbouring swap pages came from the same node:
* so use the same "addr" to choose the same node for each swap read.
*/
+#if 0
nr_pages = valid_swaphandles(entry, &offset);
for (end_offset = offset + nr_pages; offset < end_offset; offset++) {
/* Ok, do the async read-ahead now */
@@ -395,6 +396,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask,
break;
page_cache_release(page);
}
+#endif
lru_add_drain(); /* Push any new pages onto the LRU now */
return read_swap_cache_async(entry, gfp_mask, vma, addr);
}



-Andi

example:

[ 524.814816] Out of memory: Kill process 867 (gpm) score 1 or sacrifice child
[ 524.815782] Killed process 867 (gpm) total-vm:6832kB, anon-rss:0kB, file-rss:
0kB
[ 525.006050] systemd-cgroups invoked oom-killer: gfp_mask=0x201da, order=0, oo
m_adj=0, oom_score_adj=0
[ 525.007089] systemd-cgroups cpuset=/ mems_allowed=0
[ 525.008119] Pid: 2167, comm: systemd-cgroups Not tainted 2.6.38.6-no-fpu+ #6
[ 525.009168] Call Trace:
[ 525.010210] [<ffffffff8147b722>] ? _raw_spin_unlock+0x28/0x2c
[ 525.011276] [<ffffffff810c75d5>] ? dump_header+0x84/0x256
[ 525.012346] [<ffffffff8107531b>] ? trace_hardirqs_on+0xd/0xf
[ 525.013423] [<ffffffff8121a8b0>] ? ___ratelimit+0xe0/0xf0
[ 525.014491] [<ffffffff810c7a20>] ? oom_kill_process+0x50/0x244
[ 525.015575] [<ffffffff810c80ef>] ? out_of_memory+0x2eb/0x367
[ 525.016657] [<ffffffff810cc08b>] ? __alloc_pages_nodemask+0x606/0x78b
[ 525.017748] [<ffffffff810f5979>] ? alloc_pages_current+0xbe/0xd6
[ 525.018844] [<ffffffff810c56fb>] ? __page_cache_alloc+0x7e/0x85
[ 525.019940] [<ffffffff810cda40>] ? __do_page_cache_readahead+0xb5/0x1cb
[ 525.021028] [<ffffffff810cddfa>] ? ra_submit+0x21/0x25

2011-05-15 15:27:50

by Fengguang Wu

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Sun, May 15, 2011 at 09:37:58AM +0800, Minchan Kim wrote:
> On Sun, May 15, 2011 at 2:43 AM, Andi Kleen <[email protected]> wrote:
> > Copying back linux-mm.
> >
> >> Recently, we added following patch.
> >> https://lkml.org/lkml/2011/4/26/129
> >> If it's a culprit, the patch should solve the problem.
> >
> > It would be probably better to not do the allocations at all under
> > memory pressure.  Even if the RA allocation doesn't go into reclaim
>
> Fair enough.
> I think we can do it easily now.
> If page_cache_alloc_readahead(ie, GFP_NORETRY) is fail, we can adjust
> RA window size or turn off a while. The point is that we can use the
> fail of __do_page_cache_readahead as sign of memory pressure.
> Wu, What do you think?

No, disabling readahead can hardly help.

The sequential readahead memory consumption can be estimated by

2 * (number of concurrent read streams) * (readahead window size)

And you can double that when there are two level of readaheads.

Since there are hardly any concurrent read streams in Andy's case,
the readahead memory consumption will be ignorable.

Typically readahead thrashing will happen long before excessive
GFP_NORETRY failures, so the reasonable solutions are to

- shrink readahead window on readahead thrashing
(current readahead heuristic can somehow do this, and I have patches
to further improve it)

- prevent abnormal GFP_NORETRY failures
(when there are many reclaimable pages)


Andy's OOM memory dump (incorrect_oom_kill.txt.xz) shows that there are

- 8MB active+inactive file pages
- 160MB active+inactive anon pages
- 1GB shmem pages
- 1.4GB unevictable pages

Hmm, why are there so many unevictable pages? How come the shmem
pages become unevictable when there are plenty of swap space?

Thanks,
Fengguang

2011-05-15 15:59:35

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Sun, May 15, 2011 at 11:27 AM, Wu Fengguang <[email protected]> wrote:
> On Sun, May 15, 2011 at 09:37:58AM +0800, Minchan Kim wrote:
>> On Sun, May 15, 2011 at 2:43 AM, Andi Kleen <[email protected]> wrote:
>> > Copying back linux-mm.
>> >
>> >> Recently, we added following patch.
>> >> https://lkml.org/lkml/2011/4/26/129
>> >> If it's a culprit, the patch should solve the problem.
>> >
>> > It would be probably better to not do the allocations at all under
>> > memory pressure. ?Even if the RA allocation doesn't go into reclaim
>>
>> Fair enough.
>> I think we can do it easily now.
>> If page_cache_alloc_readahead(ie, GFP_NORETRY) is fail, we can adjust
>> RA window size or turn off a while. The point is that we can use the
>> fail of __do_page_cache_readahead as sign of memory pressure.
>> Wu, What do you think?
>
> No, disabling readahead can hardly help.
>
> The sequential readahead memory consumption can be estimated by
>
> ? ? ? ? ? ? ? ?2 * (number of concurrent read streams) * (readahead window size)
>
> And you can double that when there are two level of readaheads.
>
> Since there are hardly any concurrent read streams in Andy's case,
> the readahead memory consumption will be ignorable.
>
> Typically readahead thrashing will happen long before excessive
> GFP_NORETRY failures, so the reasonable solutions are to
>
> - shrink readahead window on readahead thrashing
> ?(current readahead heuristic can somehow do this, and I have patches
> ?to further improve it)
>
> - prevent abnormal GFP_NORETRY failures
> ?(when there are many reclaimable pages)
>
>
> Andy's OOM memory dump (incorrect_oom_kill.txt.xz) shows that there are
>
> - 8MB ? active+inactive file pages
> - 160MB active+inactive anon pages
> - 1GB ? shmem pages
> - 1.4GB unevictable pages
>
> Hmm, why are there so many unevictable pages? ?How come the shmem
> pages become unevictable when there are plenty of swap space?

I have no clue, but this patch (from Minchan, whitespace-damaged) seems to help:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6b435c..4d24828 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t
*pgdat, int order, long remaining,
unsigned long balanced = 0;
bool all_zones_ok = true;

+ /* If kswapd has been running too long, just sleep */
+ if (need_resched())
+ return false;
+
/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
if (remaining)
return true;
@@ -2286,7 +2290,7 @@ static bool sleeping_prematurely(pg_data_t
*pgdat, int order, long remaining,
* must be balanced
*/
if (order)
- return pgdat_balanced(pgdat, balanced, classzone_idx);
+ return !pgdat_balanced(pgdat, balanced, classzone_idx);
else
return !all_zones_ok;
}

I haven't tested it very thoroughly, but it's survived much longer
than an unpatched kernel probably would have under moderate use.

I have no idea what the patch does :)

I'm happy to run any tests. I'm also planning to upgrade from 2GB to
8GB RAM soon, which might change something.

--Andy

2011-05-15 16:13:01

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Sun, May 15, 2011 at 11:27 AM, Wu Fengguang <[email protected]> wrote:
> On Sun, May 15, 2011 at 09:37:58AM +0800, Minchan Kim wrote:
>> On Sun, May 15, 2011 at 2:43 AM, Andi Kleen <[email protected]> wrote:
>> > Copying back linux-mm.
>> >
>> >> Recently, we added following patch.
>> >> https://lkml.org/lkml/2011/4/26/129
>> >> If it's a culprit, the patch should solve the problem.
>> >
>> > It would be probably better to not do the allocations at all under
>> > memory pressure. ?Even if the RA allocation doesn't go into reclaim
>>
>> Fair enough.
>> I think we can do it easily now.
>> If page_cache_alloc_readahead(ie, GFP_NORETRY) is fail, we can adjust
>> RA window size or turn off a while. The point is that we can use the
>> fail of __do_page_cache_readahead as sign of memory pressure.
>> Wu, What do you think?
>
> No, disabling readahead can hardly help.
>
> The sequential readahead memory consumption can be estimated by
>
> ? ? ? ? ? ? ? ?2 * (number of concurrent read streams) * (readahead window size)
>
> And you can double that when there are two level of readaheads.
>
> Since there are hardly any concurrent read streams in Andy's case,
> the readahead memory consumption will be ignorable.
>
> Typically readahead thrashing will happen long before excessive
> GFP_NORETRY failures, so the reasonable solutions are to
>
> - shrink readahead window on readahead thrashing
> ?(current readahead heuristic can somehow do this, and I have patches
> ?to further improve it)
>
> - prevent abnormal GFP_NORETRY failures
> ?(when there are many reclaimable pages)
>
>
> Andy's OOM memory dump (incorrect_oom_kill.txt.xz) shows that there are
>
> - 8MB ? active+inactive file pages
> - 160MB active+inactive anon pages
> - 1GB ? shmem pages
> - 1.4GB unevictable pages
>
> Hmm, why are there so many unevictable pages? ?How come the shmem
> pages become unevictable when there are plenty of swap space?

That was probably because one of my testcases creates a 1.4GB file on
ramfs. (I can provoke the problem without doing evil things like
that, but the test script is rather reliable at killing my system and
it works fine on my other machines.)

If you want, I can try to generate a trace that isn't polluted with
the evil ramfs file.

--Andy

2011-05-15 22:40:44

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Mon, May 16, 2011 at 12:27 AM, Wu Fengguang <[email protected]> wrote:
> On Sun, May 15, 2011 at 09:37:58AM +0800, Minchan Kim wrote:
>> On Sun, May 15, 2011 at 2:43 AM, Andi Kleen <[email protected]> wrote:
>> > Copying back linux-mm.
>> >
>> >> Recently, we added following patch.
>> >> https://lkml.org/lkml/2011/4/26/129
>> >> If it's a culprit, the patch should solve the problem.
>> >
>> > It would be probably better to not do the allocations at all under
>> > memory pressure.  Even if the RA allocation doesn't go into reclaim
>>
>> Fair enough.
>> I think we can do it easily now.
>> If page_cache_alloc_readahead(ie, GFP_NORETRY) is fail, we can adjust
>> RA window size or turn off a while. The point is that we can use the
>> fail of __do_page_cache_readahead as sign of memory pressure.
>> Wu, What do you think?
>
> No, disabling readahead can hardly help.

I don't mean we have to disable RA.
As I said, the point is that we can use __GFP_NORETRY alloc fail as
_sign_ of memory pressure.

>
> The sequential readahead memory consumption can be estimated by
>
>                2 * (number of concurrent read streams) * (readahead window size)
>
> And you can double that when there are two level of readaheads.
>
> Since there are hardly any concurrent read streams in Andy's case,
> the readahead memory consumption will be ignorable.
>
> Typically readahead thrashing will happen long before excessive
> GFP_NORETRY failures, so the reasonable solutions are to

If it is, RA thrashing could be better sign than failure of __GFP_NORETRY.
If we can do it easily, I don't object it. :)

>
> - shrink readahead window on readahead thrashing
>  (current readahead heuristic can somehow do this, and I have patches
>  to further improve it)

Good to hear. :)
I don't want RA steals high order page in memory pressure.
My patch and shrinking RA window helps this case.

--
Kind regards,
Minchan Kim

2011-05-15 22:58:04

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Mon, May 16, 2011 at 12:59 AM, Andrew Lutomirski <[email protected]> wrote:
> I have no clue, but this patch (from Minchan, whitespace-damaged) seems to help:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f6b435c..4d24828 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t
> *pgdat, int order, long remaining,
>       unsigned long balanced = 0;
>       bool all_zones_ok = true;
>
> +       /* If kswapd has been running too long, just sleep */
> +       if (need_resched())
> +               return false;
> +
>       /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>       if (remaining)
>               return true;
> @@ -2286,7 +2290,7 @@ static bool sleeping_prematurely(pg_data_t
> *pgdat, int order, long remaining,
>        * must be balanced
>        */
>       if (order)
> -               return pgdat_balanced(pgdat, balanced, classzone_idx);
> +               return !pgdat_balanced(pgdat, balanced, classzone_idx);
>       else
>               return !all_zones_ok;
>  }
>
> I haven't tested it very thoroughly, but it's survived much longer
> than an unpatched kernel probably would have under moderate use.
>
> I have no idea what the patch does :)

The reason I sent this is that I think your problem is similar to
recent Jame's one.
https://lkml.org/lkml/2011/4/27/361

What the patch does is [1] fix of "wrong pgdat_balanced return value"
bug and [2] fix of "infinite kswapd bug of non-preemption kernel" on
high-order page.

About [1], kswapd have to sleep if zone balancing is completed but in
1741c877[mm: kswapd: keep kswapd awake for high-order allocations
until a percentage of the node is balanced], we made a mistake that
returns wrong return.
Then, although we complete zone balancing, kswapd doesn't sleep and
calls balance_pgdat. In this case, balance_pgdat rerurns without any
work and kswapd could repeat this work infinitely.


>
> I'm happy to run any tests.  I'm also planning to upgrade from 2GB to
> 8GB RAM soon, which might change something.
>
> --Andy
>



--
Kind regards,
Minchan Kim

2011-05-16 08:51:27

by Mel Gorman

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Mon, May 16, 2011 at 07:58:01AM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 12:59 AM, Andrew Lutomirski <[email protected]> wrote:
> > I have no clue, but this patch (from Minchan, whitespace-damaged) seems to help:
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f6b435c..4d24828 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t
> > *pgdat, int order, long remaining,
> > ? ? ? unsigned long balanced = 0;
> > ? ? ? bool all_zones_ok = true;
> >
> > + ? ? ? /* If kswapd has been running too long, just sleep */
> > + ? ? ? if (need_resched())
> > + ? ? ? ? ? ? ? return false;
> > +
> > ? ? ? /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> > ? ? ? if (remaining)
> > ? ? ? ? ? ? ? return true;
> > @@ -2286,7 +2290,7 @@ static bool sleeping_prematurely(pg_data_t
> > *pgdat, int order, long remaining,
> > ? ? ? ?* must be balanced
> > ? ? ? ?*/
> > ? ? ? if (order)
> > - ? ? ? ? ? ? ? return pgdat_balanced(pgdat, balanced, classzone_idx);
> > + ? ? ? ? ? ? ? return !pgdat_balanced(pgdat, balanced, classzone_idx);
> > ? ? ? else
> > ? ? ? ? ? ? ? return !all_zones_ok;
> > ?}
> >
> > I haven't tested it very thoroughly, but it's survived much longer
> > than an unpatched kernel probably would have under moderate use.
> >
> > I have no idea what the patch does :)
>
> The reason I sent this is that I think your problem is similar to
> recent Jame's one.
> https://lkml.org/lkml/2011/4/27/361
>
> What the patch does is [1] fix of "wrong pgdat_balanced return value"
> bug and [2] fix of "infinite kswapd bug of non-preemption kernel" on
> high-order page.
>

If it turns out the patch works (which is patches 1 and 4 from the
series related to James) for more than one tester, I'll push it
separately and drop the SLUB changes.

--
Mel Gorman
SUSE Labs

2011-05-17 05:52:10

by Fengguang Wu

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Mon, May 16, 2011 at 07:40:42AM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 12:27 AM, Wu Fengguang <[email protected]> wrote:
> > On Sun, May 15, 2011 at 09:37:58AM +0800, Minchan Kim wrote:
> >> On Sun, May 15, 2011 at 2:43 AM, Andi Kleen <[email protected]> wrote:
> >> > Copying back linux-mm.
> >> >
> >> >> Recently, we added following patch.
> >> >> https://lkml.org/lkml/2011/4/26/129
> >> >> If it's a culprit, the patch should solve the problem.
> >> >
> >> > It would be probably better to not do the allocations at all under
> >> > memory pressure.  Even if the RA allocation doesn't go into reclaim
> >>
> >> Fair enough.
> >> I think we can do it easily now.
> >> If page_cache_alloc_readahead(ie, GFP_NORETRY) is fail, we can adjust
> >> RA window size or turn off a while. The point is that we can use the
> >> fail of __do_page_cache_readahead as sign of memory pressure.
> >> Wu, What do you think?
> >
> > No, disabling readahead can hardly help.
>
> I don't mean we have to disable RA.
> As I said, the point is that we can use __GFP_NORETRY alloc fail as
> _sign_ of memory pressure.

I see.

> >
> > The sequential readahead memory consumption can be estimated by
> >
> >                2 * (number of concurrent read streams) * (readahead window size)
> >
> > And you can double that when there are two level of readaheads.
> >
> > Since there are hardly any concurrent read streams in Andy's case,
> > the readahead memory consumption will be ignorable.
> >
> > Typically readahead thrashing will happen long before excessive
> > GFP_NORETRY failures, so the reasonable solutions are to
>
> If it is, RA thrashing could be better sign than failure of __GFP_NORETRY.
> If we can do it easily, I don't object it. :)

Yeah, the RA thrashing is much better sign because it not only happens
long before normal __GFP_NORETRY failures, but also offers hint on how
tight memory pressure it is. We can then shrink the readahead window
adaptively to the available page cache memory :)

> >
> > - shrink readahead window on readahead thrashing
> >  (current readahead heuristic can somehow do this, and I have patches
> >  to further improve it)
>
> Good to hear. :)
> I don't want RA steals high order page in memory pressure.

More often than not it won't be RA's fault :) When you see RA page
allocations stealing high order pages, it may actually be reflecting
some more general order-0 steal order-N problem..

> My patch and shrinking RA window helps this case.

Thanks,
Fengguang

2011-05-17 06:00:09

by Fengguang Wu

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Sun, May 15, 2011 at 12:12:36PM -0400, Andrew Lutomirski wrote:
> On Sun, May 15, 2011 at 11:27 AM, Wu Fengguang <[email protected]> wrote:
> > On Sun, May 15, 2011 at 09:37:58AM +0800, Minchan Kim wrote:
> >> On Sun, May 15, 2011 at 2:43 AM, Andi Kleen <[email protected]> wrote:
> >> > Copying back linux-mm.
> >> >
> >> >> Recently, we added following patch.
> >> >> https://lkml.org/lkml/2011/4/26/129
> >> >> If it's a culprit, the patch should solve the problem.
> >> >
> >> > It would be probably better to not do the allocations at all under
> >> > memory pressure.  Even if the RA allocation doesn't go into reclaim
> >>
> >> Fair enough.
> >> I think we can do it easily now.
> >> If page_cache_alloc_readahead(ie, GFP_NORETRY) is fail, we can adjust
> >> RA window size or turn off a while. The point is that we can use the
> >> fail of __do_page_cache_readahead as sign of memory pressure.
> >> Wu, What do you think?
> >
> > No, disabling readahead can hardly help.
> >
> > The sequential readahead memory consumption can be estimated by
> >
> >                2 * (number of concurrent read streams) * (readahead window size)
> >
> > And you can double that when there are two level of readaheads.
> >
> > Since there are hardly any concurrent read streams in Andy's case,
> > the readahead memory consumption will be ignorable.
> >
> > Typically readahead thrashing will happen long before excessive
> > GFP_NORETRY failures, so the reasonable solutions are to
> >
> > - shrink readahead window on readahead thrashing
> >  (current readahead heuristic can somehow do this, and I have patches
> >  to further improve it)
> >
> > - prevent abnormal GFP_NORETRY failures
> >  (when there are many reclaimable pages)
> >
> >
> > Andy's OOM memory dump (incorrect_oom_kill.txt.xz) shows that there are
> >
> > - 8MB   active+inactive file pages
> > - 160MB active+inactive anon pages
> > - 1GB   shmem pages
> > - 1.4GB unevictable pages
> >
> > Hmm, why are there so many unevictable pages?  How come the shmem
> > pages become unevictable when there are plenty of swap space?
>
> That was probably because one of my testcases creates a 1.4GB file on
> ramfs. (I can provoke the problem without doing evil things like
> that, but the test script is rather reliable at killing my system and
> it works fine on my other machines.)

Ah I didn't read your first email.. I'm now running

./test_mempressure.sh 1500 1400 1

with mem=2G and no swap, but cannot reproduce OOM.

What's your kconfig?

> If you want, I can try to generate a trace that isn't polluted with
> the evil ramfs file.

No, thanks. However it would be valuable if you can retry with this
patch _alone_ (without the "if (need_resched()) return false;" change,
as I don't see how it helps your case).

@@ -2286,7 +2290,7 @@ static bool sleeping_prematurely(pg_data_t
*pgdat, int order, long remaining,
* must be balanced
*/
if (order)
- return pgdat_balanced(pgdat, balanced, classzone_idx);
+ return !pgdat_balanced(pgdat, balanced, classzone_idx);
else
return !all_zones_ok;
}

Thanks,
Fengguang

2011-05-17 06:26:19

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Tue, May 17, 2011 at 2:52 PM, Wu Fengguang <[email protected]> wrote:
> On Mon, May 16, 2011 at 07:40:42AM +0900, Minchan Kim wrote:
>> On Mon, May 16, 2011 at 12:27 AM, Wu Fengguang <[email protected]> wrote:
>> > On Sun, May 15, 2011 at 09:37:58AM +0800, Minchan Kim wrote:
>> >> On Sun, May 15, 2011 at 2:43 AM, Andi Kleen <[email protected]> wrote:
>> >> > Copying back linux-mm.
>> >> >
>> >> >> Recently, we added following patch.
>> >> >> https://lkml.org/lkml/2011/4/26/129
>> >> >> If it's a culprit, the patch should solve the problem.
>> >> >
>> >> > It would be probably better to not do the allocations at all under
>> >> > memory pressure.  Even if the RA allocation doesn't go into reclaim
>> >>
>> >> Fair enough.
>> >> I think we can do it easily now.
>> >> If page_cache_alloc_readahead(ie, GFP_NORETRY) is fail, we can adjust
>> >> RA window size or turn off a while. The point is that we can use the
>> >> fail of __do_page_cache_readahead as sign of memory pressure.
>> >> Wu, What do you think?
>> >
>> > No, disabling readahead can hardly help.
>>
>> I don't mean we have to disable RA.
>> As I said, the point is that we can use __GFP_NORETRY alloc fail as
>> _sign_ of memory pressure.
>
> I see.
>
>> >
>> > The sequential readahead memory consumption can be estimated by
>> >
>> >                2 * (number of concurrent read streams) * (readahead window size)
>> >
>> > And you can double that when there are two level of readaheads.
>> >
>> > Since there are hardly any concurrent read streams in Andy's case,
>> > the readahead memory consumption will be ignorable.
>> >
>> > Typically readahead thrashing will happen long before excessive
>> > GFP_NORETRY failures, so the reasonable solutions are to
>>
>> If it is, RA thrashing could be better sign than failure of __GFP_NORETRY.
>> If we can do it easily, I don't object it. :)
>
> Yeah, the RA thrashing is much better sign because it not only happens
> long before normal __GFP_NORETRY failures, but also offers hint on how
> tight memory pressure it is. We can then shrink the readahead window
> adaptively to the available page cache memory :)
>
>> >
>> > - shrink readahead window on readahead thrashing
>> >  (current readahead heuristic can somehow do this, and I have patches
>> >  to further improve it)
>>
>> Good to hear. :)
>> I don't want RA steals high order page in memory pressure.
>
> More often than not it won't be RA's fault :)  When you see RA page
> allocations stealing high order pages, it may actually be reflecting
> some more general order-0 steal order-N problem..

Agree.
As I said to Andy, it's a general problem but RA has a possibility to
reduce it while others don't have a any solution. :(

--
Kind regards,
Minchan Kim

2011-05-17 06:35:52

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Tue, May 17, 2011 at 3:00 PM, Wu Fengguang <[email protected]> wrote:
> On Sun, May 15, 2011 at 12:12:36PM -0400, Andrew Lutomirski wrote:
>> On Sun, May 15, 2011 at 11:27 AM, Wu Fengguang <[email protected]> wrote:
>> > On Sun, May 15, 2011 at 09:37:58AM +0800, Minchan Kim wrote:
>> >> On Sun, May 15, 2011 at 2:43 AM, Andi Kleen <[email protected]> wrote:
>> >> > Copying back linux-mm.
>> >> >
>> >> >> Recently, we added following patch.
>> >> >> https://lkml.org/lkml/2011/4/26/129
>> >> >> If it's a culprit, the patch should solve the problem.
>> >> >
>> >> > It would be probably better to not do the allocations at all under
>> >> > memory pressure.  Even if the RA allocation doesn't go into reclaim
>> >>
>> >> Fair enough.
>> >> I think we can do it easily now.
>> >> If page_cache_alloc_readahead(ie, GFP_NORETRY) is fail, we can adjust
>> >> RA window size or turn off a while. The point is that we can use the
>> >> fail of __do_page_cache_readahead as sign of memory pressure.
>> >> Wu, What do you think?
>> >
>> > No, disabling readahead can hardly help.
>> >
>> > The sequential readahead memory consumption can be estimated by
>> >
>> >                2 * (number of concurrent read streams) * (readahead window size)
>> >
>> > And you can double that when there are two level of readaheads.
>> >
>> > Since there are hardly any concurrent read streams in Andy's case,
>> > the readahead memory consumption will be ignorable.
>> >
>> > Typically readahead thrashing will happen long before excessive
>> > GFP_NORETRY failures, so the reasonable solutions are to
>> >
>> > - shrink readahead window on readahead thrashing
>> >  (current readahead heuristic can somehow do this, and I have patches
>> >  to further improve it)
>> >
>> > - prevent abnormal GFP_NORETRY failures
>> >  (when there are many reclaimable pages)
>> >
>> >
>> > Andy's OOM memory dump (incorrect_oom_kill.txt.xz) shows that there are
>> >
>> > - 8MB   active+inactive file pages
>> > - 160MB active+inactive anon pages
>> > - 1GB   shmem pages
>> > - 1.4GB unevictable pages
>> >
>> > Hmm, why are there so many unevictable pages?  How come the shmem
>> > pages become unevictable when there are plenty of swap space?
>>
>> That was probably because one of my testcases creates a 1.4GB file on
>> ramfs.  (I can provoke the problem without doing evil things like
>> that, but the test script is rather reliable at killing my system and
>> it works fine on my other machines.)
>
> Ah I didn't read your first email.. I'm now running
>
> ./test_mempressure.sh 1500 1400 1
>
> with mem=2G and no swap, but cannot reproduce OOM.
>
> What's your kconfig?
>
>> If you want, I can try to generate a trace that isn't polluted with
>> the evil ramfs file.
>
> No, thanks. However it would be valuable if you can retry with this
> patch _alone_ (without the "if (need_resched()) return false;" change,
> as I don't see how it helps your case).

Yes. I was curious about that. The experiment would be very valuable.

In case of James, he met the problem again without need_resched.
https://lkml.org/lkml/2011/5/12/547.

But I am not sure what's exact meaning of 'livelock' he mentioned.
I expect he met softlockup, again.

Still I think the possibility that skip cond_resched spared in
vmscan.c is _very_ low. How come such softlockup happens?
So I am really curious about what's going on under my sight.

>
> @@ -2286,7 +2290,7 @@ static bool sleeping_prematurely(pg_data_t
> *pgdat, int order, long remaining,
>        * must be balanced
>        */
>       if (order)
> -               return pgdat_balanced(pgdat, balanced, classzone_idx);
> +               return !pgdat_balanced(pgdat, balanced, classzone_idx);
>       else
>               return !all_zones_ok;
>  }
>
> Thanks,
> Fengguang
>



--
Kind regards,
Minchan Kim

2011-05-17 19:22:58

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Tue, May 17, 2011 at 2:00 AM, Wu Fengguang <[email protected]> wrote:
> On Sun, May 15, 2011 at 12:12:36PM -0400, Andrew Lutomirski wrote:
>> On Sun, May 15, 2011 at 11:27 AM, Wu Fengguang <[email protected]> wrote:
>>
>> That was probably because one of my testcases creates a 1.4GB file on
>> ramfs. ?(I can provoke the problem without doing evil things like
>> that, but the test script is rather reliable at killing my system and
>> it works fine on my other machines.)
>
> Ah I didn't read your first email.. I'm now running
>
> ./test_mempressure.sh 1500 1400 1
>
> with mem=2G and no swap, but cannot reproduce OOM.

Do you have a Sandy Bridge laptop? There was a recent thread on lkml
suggesting that only Sandy Bridge laptops saw this problem. Although
there's something else needed to trigger it, because I can't do it
from an initramfs I made that tried to show this problem.

>
> What's your kconfig?

Attached. This is 2.6.38.6.

>
>> If you want, I can try to generate a trace that isn't polluted with
>> the evil ramfs file.
>
> No, thanks. However it would be valuable if you can retry with this
> patch _alone_ (without the "if (need_resched()) return false;" change,
> as I don't see how it helps your case).
>
> @@ -2286,7 +2290,7 @@ static bool sleeping_prematurely(pg_data_t
> *pgdat, int order, long remaining,
> ? ? ? ?* must be balanced
> ? ? ? ?*/
> ? ? ? if (order)
> - ? ? ? ? ? ? ? return pgdat_balanced(pgdat, balanced, classzone_idx);
> + ? ? ? ? ? ? ? return !pgdat_balanced(pgdat, balanced, classzone_idx);
> ? ? ? else
> ? ? ? ? ? ? ? return !all_zones_ok;
> ?}

Done.

I logged in, added swap, and ran a program that allocated 1900MB of
RAM and memset it. The system lagged a bit but survived. kswapd
showed 10% CPU (which is odd, IMO, since I'm using aesni-intel and I
think that all the crypt happens in kworker when aesni-intel is in
use).

Then I started Firefox, loaded gmail, and ran test_mempressure.sh.
Kaboom! (I.e. system was hung) SysRq-F saved the system and produced
the attached dump. I had 6GB swap available, so there shouldn't have
been any OOM.

--Andy


Attachments:
messages.txt.xz (14.86 kB)
.config (86.42 kB)
Download all attachments

2011-05-18 05:17:20

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Wed, May 18, 2011 at 4:22 AM, Andrew Lutomirski <[email protected]> wrote:
> On Tue, May 17, 2011 at 2:00 AM, Wu Fengguang <[email protected]> wrote:
>> On Sun, May 15, 2011 at 12:12:36PM -0400, Andrew Lutomirski wrote:
>>> On Sun, May 15, 2011 at 11:27 AM, Wu Fengguang <[email protected]> wrote:
>>>
>>> That was probably because one of my testcases creates a 1.4GB file on
>>> ramfs.  (I can provoke the problem without doing evil things like
>>> that, but the test script is rather reliable at killing my system and
>>> it works fine on my other machines.)
>>
>> Ah I didn't read your first email.. I'm now running
>>
>> ./test_mempressure.sh 1500 1400 1
>>
>> with mem=2G and no swap, but cannot reproduce OOM.
>
> Do you have a Sandy Bridge laptop?  There was a recent thread on lkml
> suggesting that only Sandy Bridge laptops saw this problem.  Although
> there's something else needed to trigger it, because I can't do it
> from an initramfs I made that tried to show this problem.
>
>>
>> What's your kconfig?
>
> Attached.  This is 2.6.38.6.
>
>>
>>> If you want, I can try to generate a trace that isn't polluted with
>>> the evil ramfs file.
>>
>> No, thanks. However it would be valuable if you can retry with this
>> patch _alone_ (without the "if (need_resched()) return false;" change,
>> as I don't see how it helps your case).
>>
>> @@ -2286,7 +2290,7 @@ static bool sleeping_prematurely(pg_data_t
>> *pgdat, int order, long remaining,
>>        * must be balanced
>>        */
>>       if (order)
>> -               return pgdat_balanced(pgdat, balanced, classzone_idx);
>> +               return !pgdat_balanced(pgdat, balanced, classzone_idx);
>>       else
>>               return !all_zones_ok;
>>  }
>
> Done.
>
> I logged in, added swap, and ran a program that allocated 1900MB of
> RAM and memset it.  The system lagged a bit but survived.  kswapd
> showed 10% CPU (which is odd, IMO, since I'm using aesni-intel and I
> think that all the crypt happens in kworker when aesni-intel is in
> use).

I think kswapd could use 10% enough for reclaim.

>
> Then I started Firefox, loaded gmail, and ran test_mempressure.sh.
> Kaboom!  (I.e. system was hung)  SysRq-F saved the system and produced

Hang?
It means you see softhangup of kswapd? or mouse/keyboard doesn't move?

> the attached dump.  I had 6GB swap available, so there shouldn't have
> been any OOM.

Yes. It's strange but we have seen such case several times, AFAIR.

Let see your first OOM message.
(Intentionally, I don't inline OOM message as Web Gmail mangles it and
whoever see it is very annoying.)

If it consider min/low/high of zones, any zones can't meet your
allocation request. (order-0, GFP_WAIT|IO|FS|HIGHMEM). So the result
is natural.
But thing I wonder is that we have lots of free swap space as you said.
Why doesn't VM swap out anon pages of DMA32 zone and then happen OOM?

We are going to isolate anon pages of DMA32 as log said(ie,
isolated(anon):408kB)
So I think VM is going on rightly.
The thing is task speed of request allocation is faster than swapout's
speed. So swap device is very congested and most of swapout pages
would remain PG_writeback. In the end, shrink_page_list returns 0.

In high-order page reclaim, we can adjust task's speed by should_reclaim_stall.
But for order-0 page, should_reclaim_stall returns _false_ and at last
we can see OOM message although swap has lots of free space.
Does my guessing make sense?
If it is, does it make sense that OOM happens despite we have lots of
swap space in case of order-0?
How about this?

Andrew, Could you test this patch with !pgdat_balanced patch?
I think we shouldn't see OOM message if we have lots of free swap space.

== CUT_HERE ==
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f73b865..cc23f04 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,10 +1341,6 @@ static inline bool
should_reclaim_stall(unsigned long nr_taken,
if (current_is_kswapd())
return false;

- /* Only stall on lumpy reclaim */
- if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
- return false;
-
/* If we have relaimed everything on the isolated list, no stall */
if (nr_freed == nr_taken)
return false;



Then, if you don't see any unnecessary OOM but still see the hangup,
could you apply this patch based on previous?

== CUT_HERE ==

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f73b865..703380f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2697,6 +2697,7 @@ static int kswapd(void *p)
if (!ret) {
trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
order = balance_pgdat(pgdat, order, &classzone_idx);
+ cond_resched();
}
}
return 0;

--
Kind regards,
Minchan Kim

2011-05-19 02:16:15

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Wed, May 18, 2011 at 1:17 AM, Minchan Kim <[email protected]> wrote:
> On Wed, May 18, 2011 at 4:22 AM, Andrew Lutomirski <[email protected]> wrote:
>>> No, thanks. However it would be valuable if you can retry with this
>>> patch _alone_ (without the "if (need_resched()) return false;" change,
>>> as I don't see how it helps your case).
>>>
>>> @@ -2286,7 +2290,7 @@ static bool sleeping_prematurely(pg_data_t
>>> *pgdat, int order, long remaining,
>>> ? ? ? ?* must be balanced
>>> ? ? ? ?*/
>>> ? ? ? if (order)
>>> - ? ? ? ? ? ? ? return pgdat_balanced(pgdat, balanced, classzone_idx);
>>> + ? ? ? ? ? ? ? return !pgdat_balanced(pgdat, balanced, classzone_idx);
>>> ? ? ? else
>>> ? ? ? ? ? ? ? return !all_zones_ok;
>>> ?}
>>
>> Done.
>>
>> I logged in, added swap, and ran a program that allocated 1900MB of
>> RAM and memset it. ?The system lagged a bit but survived. ?kswapd
>> showed 10% CPU (which is odd, IMO, since I'm using aesni-intel and I
>> think that all the crypt happens in kworker when aesni-intel is in
>> use).
>
> I think kswapd could use 10% enough for reclaim.
>
>>
>> Then I started Firefox, loaded gmail, and ran test_mempressure.sh.
>> Kaboom! ?(I.e. system was hung) ?SysRq-F saved the system and produced
>
> Hang?
> It means you see softhangup of kswapd? or mouse/keyboard doesn't move?

Mouse and keyboard dead.

> Andrew, Could you test this patch with !pgdat_balanced patch?
> I think we shouldn't see OOM message if we have lots of free swap space.
>
> == CUT_HERE ==
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f73b865..cc23f04 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1341,10 +1341,6 @@ static inline bool
> should_reclaim_stall(unsigned long nr_taken,
> ? ? ? ?if (current_is_kswapd())
> ? ? ? ? ? ? ? ?return false;
>
> - ? ? ? /* Only stall on lumpy reclaim */
> - ? ? ? if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
> - ? ? ? ? ? ? ? return false;
> -
> ? ? ? ?/* If we have relaimed everything on the isolated list, no stall */
> ? ? ? ?if (nr_freed == nr_taken)
> ? ? ? ? ? ? ? ?return false;
>
>
>
> Then, if you don't see any unnecessary OOM but still see the hangup,
> could you apply this patch based on previous?

With this patch, I started GNOME and Firefox, turned on swap, and ran
test_mempressure.sh 1500 1400 1. Instant panic (or OOPS and hang or
something -- didn't get the top part). Picture attached -- it looks
like memcg might be involved. I'm running F15, so it might even be
doing something.

I won't be able to get netconsole dumps until next week because I'm
out of town and only have this one computer here.

I haven't tried the other patch.

Also, the !pgdat_balanced fix plus the if (need_resched()) return
false patch just hung once on 2.6.37-rc9. I don't know what triggered
it. Maybe yum.

--Andy


Attachments:
IMG_20110518_184222.jpg (92.32 kB)

2011-05-19 02:37:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Wed, 18 May 2011 22:15:53 -0400
Andrew Lutomirski <[email protected]> wrote:

> On Wed, May 18, 2011 at 1:17 AM, Minchan Kim <[email protected]> wrote:
> > On Wed, May 18, 2011 at 4:22 AM, Andrew Lutomirski <[email protected]> wrote:

> > Andrew, Could you test this patch with !pgdat_balanced patch?
> > I think we shouldn't see OOM message if we have lots of free swap space.
> >
> > == CUT_HERE ==
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f73b865..cc23f04 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1341,10 +1341,6 @@ static inline bool
> > should_reclaim_stall(unsigned long nr_taken,
> >        if (current_is_kswapd())
> >                return false;
> >
> > -       /* Only stall on lumpy reclaim */
> > -       if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
> > -               return false;
> > -
> >        /* If we have relaimed everything on the isolated list, no stall */
> >        if (nr_freed == nr_taken)
> >                return false;
> >
> >
> >
> > Then, if you don't see any unnecessary OOM but still see the hangup,
> > could you apply this patch based on previous?
>
> With this patch, I started GNOME and Firefox, turned on swap, and ran
> test_mempressure.sh 1500 1400 1. Instant panic (or OOPS and hang or
> something -- didn't get the top part). Picture attached -- it looks
> like memcg might be involved. I'm running F15, so it might even be
> doing something.
>

Hmm, what kernel version do you use ?
I think memcg is not guilty because RIP is shrink_page_list().
But ok, I'll dig this. Could you give us your .config ?

Thanks,
-Kame


> I won't be able to get netconsole dumps until next week because I'm
> out of town and only have this one computer here.
>
> I haven't tried the other patch.
>
> Also, the !pgdat_balanced fix plus the if (need_resched()) return
> false patch just hung once on 2.6.37-rc9. I don't know what triggered
> it. Maybe yum.
>
> --Andy

2011-05-19 02:41:23

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Wed, May 18, 2011 at 10:30 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Wed, 18 May 2011 22:15:53 -0400
> Andrew Lutomirski <[email protected]> wrote:
>
>> On Wed, May 18, 2011 at 1:17 AM, Minchan Kim <[email protected]> wrote:
>> > On Wed, May 18, 2011 at 4:22 AM, Andrew Lutomirski <[email protected]> wrote:
>
>> > Andrew, Could you test this patch with !pgdat_balanced patch?
>> > I think we shouldn't see OOM message if we have lots of free swap space.
>> >
>> > == CUT_HERE ==
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index f73b865..cc23f04 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1341,10 +1341,6 @@ static inline bool
>> > should_reclaim_stall(unsigned long nr_taken,
>> > ? ? ? ?if (current_is_kswapd())
>> > ? ? ? ? ? ? ? ?return false;
>> >
>> > - ? ? ? /* Only stall on lumpy reclaim */
>> > - ? ? ? if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
>> > - ? ? ? ? ? ? ? return false;
>> > -
>> > ? ? ? ?/* If we have relaimed everything on the isolated list, no stall */
>> > ? ? ? ?if (nr_freed == nr_taken)
>> > ? ? ? ? ? ? ? ?return false;
>> >
>> >
>> >
>> > Then, if you don't see any unnecessary OOM but still see the hangup,
>> > could you apply this patch based on previous?
>>
>> With this patch, I started GNOME and Firefox, turned on swap, and ran
>> test_mempressure.sh 1500 1400 1. ?Instant panic (or OOPS and hang or
>> something -- didn't get the top part). ?Picture attached -- it looks
>> like memcg might be involved. ?I'm running F15, so it might even be
>> doing something.
>>
>
> Hmm, what kernel version do you use ?
> I think memcg is not guilty because RIP is shrink_page_list().
> But ok, I'll dig this. Could you give us your .config ?

Attached.

The address in shrink_page_list is ud2, from (I think)
VM_BUG_ON(PageActive(page));. The sequence is:

0xffffffff810d24cc <+202>: callq 0xffffffff810cf930 <test_and_set_bit>
0xffffffff810d24d1 <+207>: test %eax,%eax
0xffffffff810d24d3 <+209>: jne 0xffffffff810d2aa5 <shrink_page_list+1699>
0xffffffff810d24d9 <+215>: mov -0x28(%rbx),%rax
0xffffffff810d24dd <+219>: test $0x40,%al
0xffffffff810d24df <+221>: je 0xffffffff810d24e3 <shrink_page_list+225>
0xffffffff810d24e1 <+223>: ud2


--Andy


Attachments:
.config (86.42 kB)

2011-05-19 02:54:08

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Thu, May 19, 2011 at 11:15 AM, Andrew Lutomirski <[email protected]> wrote:
> On Wed, May 18, 2011 at 1:17 AM, Minchan Kim <[email protected]> wrote:
>> On Wed, May 18, 2011 at 4:22 AM, Andrew Lutomirski <[email protected]> wrote:
>>>> No, thanks. However it would be valuable if you can retry with this
>>>> patch _alone_ (without the "if (need_resched()) return false;" change,
>>>> as I don't see how it helps your case).
>>>>
>>>> @@ -2286,7 +2290,7 @@ static bool sleeping_prematurely(pg_data_t
>>>> *pgdat, int order, long remaining,
>>>>        * must be balanced
>>>>        */
>>>>       if (order)
>>>> -               return pgdat_balanced(pgdat, balanced, classzone_idx);
>>>> +               return !pgdat_balanced(pgdat, balanced, classzone_idx);
>>>>       else
>>>>               return !all_zones_ok;
>>>>  }
>>>
>>> Done.
>>>
>>> I logged in, added swap, and ran a program that allocated 1900MB of
>>> RAM and memset it.  The system lagged a bit but survived.  kswapd
>>> showed 10% CPU (which is odd, IMO, since I'm using aesni-intel and I
>>> think that all the crypt happens in kworker when aesni-intel is in
>>> use).
>>
>> I think kswapd could use 10% enough for reclaim.
>>
>>>
>>> Then I started Firefox, loaded gmail, and ran test_mempressure.sh.
>>> Kaboom!  (I.e. system was hung)  SysRq-F saved the system and produced
>>
>> Hang?
>> It means you see softhangup of kswapd? or mouse/keyboard doesn't move?
>
> Mouse and keyboard dead.
>
>> Andrew, Could you test this patch with !pgdat_balanced patch?
>> I think we shouldn't see OOM message if we have lots of free swap space.
>>
>> == CUT_HERE ==
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index f73b865..cc23f04 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1341,10 +1341,6 @@ static inline bool
>> should_reclaim_stall(unsigned long nr_taken,
>>        if (current_is_kswapd())
>>                return false;
>>
>> -       /* Only stall on lumpy reclaim */
>> -       if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
>> -               return false;
>> -
>>        /* If we have relaimed everything on the isolated list, no stall */
>>        if (nr_freed == nr_taken)
>>                return false;
>>
>>
>>
>> Then, if you don't see any unnecessary OOM but still see the hangup,
>> could you apply this patch based on previous?
>
> With this patch, I started GNOME and Firefox, turned on swap, and ran
> test_mempressure.sh 1500 1400 1.  Instant panic (or OOPS and hang or
> something -- didn't get the top part).  Picture attached -- it looks
> like memcg might be involved.  I'm running F15, so it might even be
> doing something.

I cannot figure out why happens OOPS.
Let me know your kernel version and config.
Kame. Is there anything related to memcg you guess?

In addition, the patch I give was utterly stupid.
The goal is that we wait dirty page writeback in (order-0 | high
priority) reclaim.
(But I don't think it's ideal solution in this problem but just for
proving the problem)
But although we pass sync with 1 in set_reclaim_mode, it ignores.
So fix is following as. (NOTICE: It doesn't related to your OOPS. )
But before further experiment, let's fix your oops.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..69d317e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -311,7 +311,8 @@ static void set_reclaim_mode(int priority, struct
scan_control *sc,
*/
if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
sc->reclaim_mode |= syncmode;
- else if (sc->order && priority < DEF_PRIORITY - 2)
+ else if ((sc->order && priority < DEF_PRIORITY - 2) ||
+ prioiry <= DEF_PRIORITY / 3)
sc->reclaim_mode |= syncmode;
else
sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
@@ -1349,10 +1350,6 @@ static inline bool
should_reclaim_stall(unsigned long nr_taken,
if (current_is_kswapd())
return false;

- /* Only stall on lumpy reclaim */
- if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
- return false;
-
/* If we have relaimed everything on the isolated list, no stall */
if (nr_freed == nr_taken)
return false;



>
> I won't be able to get netconsole dumps until next week because I'm
> out of town and only have this one computer here.

No problem. :)
We should avoid OOPS for the experiment.


>
> I haven't tried the other patch.
>
> Also, the !pgdat_balanced fix plus the if (need_resched()) return
> false patch just hung once on 2.6.37-rc9.  I don't know what triggered

Thanks for the good information.
It seems need_resched patch isn't good candidate to fix current problem.
We already weeded it out.

Thank you very much for the testing!

> it.  Maybe yum.
>
> --Andy
>



--
Kind regards,
Minchan Kim

2011-05-19 14:17:12

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

I just booted 2.6.38.6 with exactly two patches applied. Config was
the same as I emailed yesterday. Userspace is F15. First was
"aesni-intel: Merge with fpu.ko" because dracut fails to boot my
system without it. Second was this (sorry for whitespace damage):

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0665520..3f44b81 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -307,7 +307,7 @@ static void set_reclaim_mode(int priority, struct
scan_control *sc,
*/
if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
sc->reclaim_mode |= syncmode;
- else if (sc->order && priority < DEF_PRIORITY - 2)
+ else if ((sc->order && priority < DEF_PRIORITY - 2) ||
priority <= DEF_PRIORITY / 3)
sc->reclaim_mode |= syncmode;
else
sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
@@ -1342,10 +1342,6 @@ static inline bool
should_reclaim_stall(unsigned long nr_taken,
if (current_is_kswapd())
return false;

- /* Only stall on lumpy reclaim */
- if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
- return false;
-
/* If we have relaimed everything on the isolated list, no stall */
if (nr_freed == nr_taken)
return false;

I started GNOME and Firefox, enabled swap, and ran test_mempressure.sh
1500 1400 1. The system quickly gave the attached oops.

The oops was the ud2 here:

0xffffffff810d251b <+215>: mov -0x28(%rbx),%rax
0xffffffff810d251f <+219>: test $0x40,%al
0xffffffff810d2521 <+221>: je 0xffffffff810d2525 <shrink_page_list+225>
0xffffffff810d2523 <+223>: ud2

Please let me know what the next test to run is.

--Andy


Attachments:
IMG_20110519_094454.jpg (94.19 kB)

2011-05-19 14:51:51

by Fengguang Wu

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

> > I had 6GB swap available, so there shouldn't have
> > been any OOM.
>
> Yes. It's strange but we have seen such case several times, AFAIR.

I noticed that the test script mounted a "ramfs" not "tmpfs", hence
the 1.4G pages won't be swapped?

Thanks,
Fengguang

2011-05-19 15:00:45

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Thu, May 19, 2011 at 10:51 AM, Wu Fengguang <[email protected]> wrote:
>> > I had 6GB swap available, so there shouldn't have
>> > been any OOM.
>>
>> Yes. It's strange but we have seen such case several times, AFAIR.
>
> I noticed that the test script mounted a "ramfs" not "tmpfs", hence
> the 1.4G pages won't be swapped?

That's intentional.

I run LVM over dm-crypt on an SSD, and I thought that might be part of
the problem. I wanted a script that would see if I could reproduce
the problem without stressing that system too much, so I created a
second backing store on dm-crypt over ramfs so that no real I/O will
happen. The script is quite effective at bringing down my system, so
I haven't changed it.

(I have 6GB of "real" swap on the LVM, so pinning 1500MB into RAM
ought to cause some thrashing but not take the system down. And this
script with a larger ramfs does not take down my desktop, which is an
8GB Sandy Bridge box. But whatever the underlying bug is seems to
mainly affect Sandy Bridge *laptops*, so maybe that's expected.)

--Andy

>
> Thanks,
> Fengguang
>

2011-05-20 00:17:16

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Thu, May 19, 2011 at 11:16 PM, Andrew Lutomirski <[email protected]> wrote:
> I just booted 2.6.38.6 with exactly two patches applied.  Config was
> the same as I emailed yesterday.  Userspace is F15.  First was
> "aesni-intel: Merge with fpu.ko" because dracut fails to boot my
> system without it.  Second was this (sorry for whitespace damage):
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0665520..3f44b81 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -307,7 +307,7 @@ static void set_reclaim_mode(int priority, struct
> scan_control *sc,
>         */
>        if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
>                sc->reclaim_mode |= syncmode;
> -       else if (sc->order && priority < DEF_PRIORITY - 2)
> +       else if ((sc->order && priority < DEF_PRIORITY - 2) ||
> priority <= DEF_PRIORITY / 3)
>                sc->reclaim_mode |= syncmode;
>        else
>                sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
> @@ -1342,10 +1342,6 @@ static inline bool
> should_reclaim_stall(unsigned long nr_taken,
>        if (current_is_kswapd())
>                return false;
>
> -       /* Only stall on lumpy reclaim */
> -       if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
> -               return false;
> -
>        /* If we have relaimed everything on the isolated list, no stall */
>        if (nr_freed == nr_taken)
>                return false;
>
> I started GNOME and Firefox, enabled swap, and ran test_mempressure.sh
> 1500 1400 1.  The system quickly gave the attached oops.
>
> The oops was the ud2 here:
>
>   0xffffffff810d251b <+215>:   mov    -0x28(%rbx),%rax
>   0xffffffff810d251f <+219>:   test   $0x40,%al
>   0xffffffff810d2521 <+221>:   je     0xffffffff810d2525 <shrink_page_list+225>
>   0xffffffff810d2523 <+223>:   ud2
>
> Please let me know what the next test to run is.

Okay. My first patch(!pgdat_balanced and cond_resched right after
balance_pgdat) sent you was successful. But the version removed
cond_resched was hang.

Let's not make the problem complex.
So let's put aside the above my patch.

Would you be willing to test one more with below patch?
(Of course, it would be damage by white space. I can't do anything for
it in my office. Sorry.)
If below patch still fix your problem like my first patch, we will
push this patch into mainline.

Thanks. Andrew.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..1663d24 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
if (scanned == 0)
scanned = SWAP_CLUSTER_MAX;

- if (!down_read_trylock(&shrinker_rwsem))
- return 1; /* Assume we'll be able to shrink next time */
+ if (!down_read_trylock(&shrinker_rwsem)) {
+ /* Assume we'll be able to shrink next time */
+ ret = 1;
+ goto out;
+ }

list_for_each_entry(shrinker, &shrinker_list, list) {
unsigned long long delta;
@@ -286,6 +289,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
shrinker->nr += total_scan;
}
up_read(&shrinker_rwsem);
+out:
+ cond_resched();
return ret;
}

@@ -2331,7 +2336,7 @@ static bool sleeping_prematurely(pg_data_t
*pgdat, int order, long remaining,
* must be balanced
*/
if (order)
- return pgdat_balanced(pgdat, balanced, classzone_idx);
+ return !pgdat_balanced(pgdat, balanced, classzone_idx);
else
return !all_zones_ok;
}



>
> --Andy
>



--
Kind regards,
Minchan Kim

2011-05-20 00:20:12

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Thu, May 19, 2011 at 11:51 PM, Wu Fengguang <[email protected]> wrote:
>> > I had 6GB swap available, so there shouldn't have
>> > been any OOM.
>>
>> Yes. It's strange but we have seen such case several times, AFAIR.
>
> I noticed that the test script mounted a "ramfs" not "tmpfs", hence
> the 1.4G pages won't be swapped?

Right. ramfs pages can not be swapped out.
But in log, anon 200M in DMA32 doesn't include unevictable 1.4GB.
So we can swap out 200M, still.

>
> Thanks,
> Fengguang
>



--
Kind regards,
Minchan Kim

2011-05-20 02:59:11

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Thu, May 19, 2011 at 10:16 AM, Andrew Lutomirski <[email protected]> wrote:
> I just booted 2.6.38.6 with exactly two patches applied. ?Config was
> the same as I emailed yesterday. ?Userspace is F15. ?First was
> "aesni-intel: Merge with fpu.ko" because dracut fails to boot my
> system without it. ?Second was this (sorry for whitespace damage):
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0665520..3f44b81 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -307,7 +307,7 @@ static void set_reclaim_mode(int priority, struct
> scan_control *sc,
> ? ? ? ? */
> ? ? ? ?if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> ? ? ? ? ? ? ? ?sc->reclaim_mode |= syncmode;
> - ? ? ? else if (sc->order && priority < DEF_PRIORITY - 2)
> + ? ? ? else if ((sc->order && priority < DEF_PRIORITY - 2) ||
> priority <= DEF_PRIORITY / 3)
> ? ? ? ? ? ? ? ?sc->reclaim_mode |= syncmode;
> ? ? ? ?else
> ? ? ? ? ? ? ? ?sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
> @@ -1342,10 +1342,6 @@ static inline bool
> should_reclaim_stall(unsigned long nr_taken,
> ? ? ? ?if (current_is_kswapd())
> ? ? ? ? ? ? ? ?return false;
>
> - ? ? ? /* Only stall on lumpy reclaim */
> - ? ? ? if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
> - ? ? ? ? ? ? ? return false;
> -
> ? ? ? ?/* If we have relaimed everything on the isolated list, no stall */
> ? ? ? ?if (nr_freed == nr_taken)
> ? ? ? ? ? ? ? ?return false;
>
> I started GNOME and Firefox, enabled swap, and ran test_mempressure.sh
> 1500 1400 1. ?The system quickly gave the attached oops.
>

I haven't applied Minchan's latest patch yet, but given the OOPS it
seems like the root cause might be something other than kswapd not
going sleep. So I applied this additional patch:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f44b81..1beea0f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -729,7 +729,15 @@ static unsigned long shrink_page_list(struct
list_head *page_list,
if (!trylock_page(page))
goto keep;

- VM_BUG_ON(PageActive(page));
+ if (PageActive(page)) {
+ printk(KERN_ERR "shrink_page_list
(nr_scanned=%lu nr_reclaimed=%lu nr_to_reclaim=%lu gfp_mask=%X) found
inactive
+ sc->nr_scanned, sc->nr_reclaimed,
+ sc->nr_to_reclaim, sc->gfp_mask, page,
+ page->flags);
+ //VM_BUG_ON(PageActive(page));
+ msleep(1);
+ continue;
+ }
VM_BUG_ON(page_zone(page) != zone);

sc->nr_scanned++;

and saw:

[ 63.609661] Adding 6291452k swap on /dev/mapper/vg_antithesis-swap.
Priority:-1 extents:1 across:6291452k
[ 70.148767] shrink_page_list (nr_scanned=33620 nr_reclaimed=2122
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea00014220d0
with flags=100000000008005D
[ 70.148929] shrink_page_list (nr_scanned=23477 nr_reclaimed=2198
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0001423f38
with flags=100000000008005D
[ 70.150036] shrink_page_list (nr_scanned=33620 nr_reclaimed=2122
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0001422060
with flags=100000000008005D
[ 70.150132] shrink_page_list (nr_scanned=23507 nr_reclaimed=2198
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea00014249f0
with flags=100000000008005D
[ 70.152032] shrink_page_list (nr_scanned=23507 nr_reclaimed=2198
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0001424a28
with flags=100000000008005D
[ 70.152123] shrink_page_list (nr_scanned=33632 nr_reclaimed=2122
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea00014224c0
with flags=100000000008005D
[ 70.154027] shrink_page_list (nr_scanned=23507 nr_reclaimed=2198
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0001424a60
with flags=100000000008005D
[ 70.154180] shrink_page_list (nr_scanned=33733 nr_reclaimed=2122
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0001424bb0
with flags=100000000008005D
[ 70.156022] shrink_page_list (nr_scanned=23507 nr_reclaimed=2198
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0001424a98
with flags=100000000008005D
[ 70.156247] shrink_page_list (nr_scanned=33930 nr_reclaimed=2168
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea000125e860
with flags=100000000002004D
[ 70.158035] shrink_page_list (nr_scanned=23507 nr_reclaimed=2198
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0001424ad0
with flags=100000000008005D
[ 70.158101] shrink_page_list (nr_scanned=33930 nr_reclaimed=2168
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea000125f238
with flags=100000000002004D
[ 70.160010] shrink_page_list (nr_scanned=23507 nr_reclaimed=2198
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0001424b08
with flags=100000000008005D
[ 70.160075] shrink_page_list (nr_scanned=33930 nr_reclaimed=2168
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea000125f200
with flags=100000000002004D
[ 70.162013] shrink_page_list (nr_scanned=23507 nr_reclaimed=2198
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0001424b40
with flags=100000000008005D
[ 70.162080] shrink_page_list (nr_scanned=33930 nr_reclaimed=2168
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea000125f1c8
with flags=100000000002004D
[ 70.164015] shrink_page_list (nr_scanned=23507 nr_reclaimed=2198
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0001424b78
with flags=100000000008005D
[ 70.168859] shrink_page_list (nr_scanned=24706 nr_reclaimed=2239
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea00012ae030
with flags=1000000000080049
[ 70.168959] shrink_page_list (nr_scanned=40170 nr_reclaimed=2787
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea000125b488
with flags=100000000008005D
[ 70.170004] shrink_page_list (nr_scanned=24706 nr_reclaimed=2239
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea00012adf88
with flags=1000000000080049
[ 70.175980] shrink_page_list (nr_scanned=566 nr_reclaimed=81
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea0000e00f18
with flags=100000000002004D
[ 70.176140] shrink_page_list (nr_scanned=846 nr_reclaimed=94
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000df2428
with flags=100000000002004D
[ 70.176160] shrink_page_list (nr_scanned=41061 nr_reclaimed=2787
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000df29d8
with flags=100000000002004D
[ 70.176364] shrink_page_list (nr_scanned=28440 nr_reclaimed=2350
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000de9240
with flags=100000000002004D
[ 70.178086] shrink_page_list (nr_scanned=41061 nr_reclaimed=2787
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000df2a10
with flags=100000000002004D
[ 70.178161] shrink_page_list (nr_scanned=846 nr_reclaimed=94
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000df2268
with flags=100000000002004D
[ 70.178189] shrink_page_list (nr_scanned=28493 nr_reclaimed=2350
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000de92b0
with flags=100000000002004D
[ 70.178215] shrink_page_list (nr_scanned=618 nr_reclaimed=117
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea0000de98d0
with flags=100000000002004D
[ 70.180063] shrink_page_list (nr_scanned=618 nr_reclaimed=117
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea0000de9908
with flags=100000000002004D
[ 70.180081] shrink_page_list (nr_scanned=28493 nr_reclaimed=2350
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000de9320
with flags=100000000002004D
[ 70.180192] shrink_page_list (nr_scanned=897 nr_reclaimed=136
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000dea0e8
with flags=100000000002004D
[ 70.180197] shrink_page_list (nr_scanned=41119 nr_reclaimed=2787
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000deac80
with flags=100000000002004D
[ 70.182031] shrink_page_list (nr_scanned=618 nr_reclaimed=117
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea0000de9940
with flags=100000000002004D
[ 70.182048] shrink_page_list (nr_scanned=41119 nr_reclaimed=2787
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000deacb8
with flags=100000000002004D
[ 70.182063] shrink_page_list (nr_scanned=28493 nr_reclaimed=2350
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000de9358
with flags=100000000002004D
[ 70.182079] shrink_page_list (nr_scanned=897 nr_reclaimed=136
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000dea120
with flags=100000000002004D
[ 70.183986] shrink_page_list (nr_scanned=28493 nr_reclaimed=2350
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000de9828
with flags=100000000002004D
[ 70.183990] shrink_page_list (nr_scanned=41119 nr_reclaimed=2787
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000deacf0
with flags=100000000002004D
[ 70.183993] shrink_page_list (nr_scanned=897 nr_reclaimed=136
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000dea158
with flags=100000000002004D
[ 70.185982] shrink_page_list (nr_scanned=897 nr_reclaimed=136
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000dea190
with flags=100000000002004D
[ 70.185986] shrink_page_list (nr_scanned=41119 nr_reclaimed=2787
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000deadd0
with flags=100000000002004D
[ 70.186117] shrink_page_list (nr_scanned=28621 nr_reclaimed=2382
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000da5118
with flags=100000000002004D
[ 70.187991] shrink_page_list (nr_scanned=897 nr_reclaimed=136
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000dea1c8
with flags=100000000002004D
[ 70.187994] shrink_page_list (nr_scanned=41119 nr_reclaimed=2787
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000deae40
with flags=100000000002004D
[ 70.187998] shrink_page_list (nr_scanned=28621 nr_reclaimed=2382
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000da5348
with flags=100000000002004D
[ 70.189977] shrink_page_list (nr_scanned=28621 nr_reclaimed=2382
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000da5540
with flags=100000000002004D
[ 70.189980] shrink_page_list (nr_scanned=41119 nr_reclaimed=2787
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000deae78
with flags=100000000002004D
[ 70.190026] shrink_page_list (nr_scanned=950 nr_reclaimed=136
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000da5b98
with flags=100000000002004D
[ 70.191975] shrink_page_list (nr_scanned=41119 nr_reclaimed=2787
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000deaeb0
with flags=100000000002004D
[ 70.191982] shrink_page_list (nr_scanned=28621 nr_reclaimed=2382
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000da5578
with flags=100000000002004D
[ 70.192096] shrink_page_list (nr_scanned=1149 nr_reclaimed=170
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000da5c78
with flags=100000000002004D
[ 70.193973] shrink_page_list (nr_scanned=41119 nr_reclaimed=2787
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000deaee8
with flags=100000000002004D
[ 70.194025] shrink_page_list (nr_scanned=1213 nr_reclaimed=170
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000da5ff8
with flags=100000000002004D
[ 70.194190] shrink_page_list (nr_scanned=28849 nr_reclaimed=2414
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000da6a78
with flags=100000000002004D
[ 70.195970] shrink_page_list (nr_scanned=1213 nr_reclaimed=170
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000da6378
with flags=100000000002004D
[ 70.195981] shrink_page_list (nr_scanned=28849 nr_reclaimed=2414
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000da6ab0
with flags=100000000002004D
[ 70.196022] shrink_page_list (nr_scanned=41176 nr_reclaimed=2821
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000da7178
with flags=100000000002004D
[ 70.197975] shrink_page_list (nr_scanned=1213 nr_reclaimed=170
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000da66c0
with flags=100000000002004D
[ 70.197982] shrink_page_list (nr_scanned=28849 nr_reclaimed=2414
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000da7140
with flags=100000000002004D
[ 70.198197] shrink_page_list (nr_scanned=41527 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000daa198
with flags=100000000002004D
[ 70.199965] shrink_page_list (nr_scanned=41527 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000daa4a8
with flags=100000000002004D
[ 70.200070] shrink_page_list (nr_scanned=1341 nr_reclaimed=205
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000daaa58
with flags=100000000002004D
[ 70.200116] shrink_page_list (nr_scanned=28963 nr_reclaimed=2414
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d90188
with flags=100000000002004D
[ 70.201962] shrink_page_list (nr_scanned=1341 nr_reclaimed=205
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000daaac8
with flags=100000000002004D
[ 70.201965] shrink_page_list (nr_scanned=41527 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000daa4e0
with flags=100000000002004D
[ 70.202069] shrink_page_list (nr_scanned=29077 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d907e0
with flags=100000000002004D
[ 70.203959] shrink_page_list (nr_scanned=29077 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d90818
with flags=100000000002004D
[ 70.203964] shrink_page_list (nr_scanned=41527 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000daa630
with flags=100000000002004D
[ 70.204009] shrink_page_list (nr_scanned=1399 nr_reclaimed=205
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d90b28
with flags=100000000002004D
[ 70.205955] shrink_page_list (nr_scanned=1399 nr_reclaimed=205
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d90b98
with flags=100000000002004D
[ 70.205959] shrink_page_list (nr_scanned=41527 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000daa8d0
with flags=100000000002004D
[ 70.205962] shrink_page_list (nr_scanned=29077 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d90850
with flags=100000000002004D
[ 70.207962] shrink_page_list (nr_scanned=1399 nr_reclaimed=205
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d90bd0
with flags=100000000002004D
[ 70.207968] shrink_page_list (nr_scanned=29077 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d90888
with flags=100000000002004D
[ 70.208015] shrink_page_list (nr_scanned=41591 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d90f88
with flags=100000000002004D
[ 70.209950] shrink_page_list (nr_scanned=1399 nr_reclaimed=205
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d90d20
with flags=100000000002004D
[ 70.209954] shrink_page_list (nr_scanned=41591 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d917a0
with flags=100000000002004D
[ 70.210095] shrink_page_list (nr_scanned=29077 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d908f8
with flags=100000000002004D
[ 70.211948] shrink_page_list (nr_scanned=1399 nr_reclaimed=205
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d90d58
with flags=100000000002004D
[ 70.211952] shrink_page_list (nr_scanned=41591 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d91ab0
with flags=100000000002004D
[ 70.211955] shrink_page_list (nr_scanned=29077 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d90af0
with flags=100000000002004D
[ 70.213946] shrink_page_list (nr_scanned=1399 nr_reclaimed=205
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d90f18
with flags=100000000002004D
[ 70.213949] shrink_page_list (nr_scanned=41591 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d91c70
with flags=100000000002004D
[ 70.214034] shrink_page_list (nr_scanned=29165 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d92648
with flags=100000000002004D
[ 70.215944] shrink_page_list (nr_scanned=41591 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d91ca8
with flags=100000000002004D
[ 70.215948] shrink_page_list (nr_scanned=29165 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d92680
with flags=100000000002004D
[ 70.216002] shrink_page_list (nr_scanned=1462 nr_reclaimed=247
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d92728
with flags=100000000002004D
[ 70.217949] shrink_page_list (nr_scanned=1462 nr_reclaimed=247
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d92760
with flags=100000000002004D
[ 70.217952] shrink_page_list (nr_scanned=41591 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d925d8
with flags=100000000002004D
[ 70.218017] shrink_page_list (nr_scanned=29202 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d93bf0
with flags=100000000002004D
[ 70.219939] shrink_page_list (nr_scanned=41591 nr_reclaimed=2920
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d92610
with flags=100000000002004D
[ 70.220036] shrink_page_list (nr_scanned=29266 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d94018
with flags=100000000002004D
[ 70.220054] shrink_page_list (nr_scanned=1562 nr_reclaimed=290
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000dcdbe0
with flags=100000000002004D
[ 70.221934] shrink_page_list (nr_scanned=29266 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d940c0
with flags=100000000002004D
[ 70.221938] shrink_page_list (nr_scanned=1562 nr_reclaimed=290
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d95470
with flags=100000000002004D
[ 70.222585] shrink_page_list (nr_scanned=42665 nr_reclaimed=3127
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d8d7f8
with flags=100000000002004D
[ 70.223931] shrink_page_list (nr_scanned=29266 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d94130
with flags=100000000002004D
[ 70.223935] shrink_page_list (nr_scanned=42665 nr_reclaimed=3127
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d8d830
with flags=100000000002004D
[ 70.223976] shrink_page_list (nr_scanned=1612 nr_reclaimed=290
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d8f238
with flags=100000000002004D
[ 70.225929] shrink_page_list (nr_scanned=42665 nr_reclaimed=3127
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d8f468
with flags=100000000002004D
[ 70.225932] shrink_page_list (nr_scanned=1612 nr_reclaimed=290
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d8f158
with flags=100000000002004D
[ 70.225935] shrink_page_list (nr_scanned=29266 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d941a0
with flags=100000000002004D
[ 70.227934] shrink_page_list (nr_scanned=29266 nr_reclaimed=2460
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d944b0
with flags=100000000002004D
[ 70.228134] shrink_page_list (nr_scanned=42824 nr_reclaimed=3199
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d76cd8
with flags=100000000002004D
[ 70.228427] shrink_page_list (nr_scanned=2225 nr_reclaimed=409
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d7ad28
with flags=100000000002004D
[ 70.230232] shrink_page_list (nr_scanned=43013 nr_reclaimed=3247
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d695d0
with flags=100000000002004D
[ 70.230251] shrink_page_list (nr_scanned=2405 nr_reclaimed=458
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d69870
with flags=100000000002004D
[ 70.230446] shrink_page_list (nr_scanned=29609 nr_reclaimed=2544
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6b978
with flags=100000000002004D
[ 70.231920] shrink_page_list (nr_scanned=29609 nr_reclaimed=2544
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6b898
with flags=100000000002004D
[ 70.231924] shrink_page_list (nr_scanned=2405 nr_reclaimed=458
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d6a018
with flags=100000000002004D
[ 70.231927] shrink_page_list (nr_scanned=43013 nr_reclaimed=3247
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d69608
with flags=100000000002004D
[ 70.233918] shrink_page_list (nr_scanned=43013 nr_reclaimed=3247
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d69640
with flags=100000000002004D
[ 70.233921] shrink_page_list (nr_scanned=2405 nr_reclaimed=458
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d6a050
with flags=100000000002004D
[ 70.233925] shrink_page_list (nr_scanned=29609 nr_reclaimed=2544
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6b6a0
with flags=100000000002004D
[ 70.235916] shrink_page_list (nr_scanned=43013 nr_reclaimed=3247
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d69720
with flags=100000000002004D
[ 70.235920] shrink_page_list (nr_scanned=2405 nr_reclaimed=458
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d6a088
with flags=100000000002004D
[ 70.236115] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6cf58
with flags=100000000002004D
[ 70.237922] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6cf90
with flags=100000000002004D
[ 70.237926] shrink_page_list (nr_scanned=43013 nr_reclaimed=3247
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d69758
with flags=100000000002004D
[ 70.237929] shrink_page_list (nr_scanned=2405 nr_reclaimed=458
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d6a130
with flags=100000000002004D
[ 70.239910] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6cfc8
with flags=100000000002004D
[ 70.239914] shrink_page_list (nr_scanned=43013 nr_reclaimed=3247
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d697c8
with flags=100000000002004D
[ 70.239917] shrink_page_list (nr_scanned=2405 nr_reclaimed=458
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d6a168
with flags=100000000002004D
[ 70.241908] shrink_page_list (nr_scanned=43013 nr_reclaimed=3247
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d69800
with flags=100000000002004D
[ 70.241911] shrink_page_list (nr_scanned=2405 nr_reclaimed=458
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d6a1a0
with flags=100000000002004D
[ 70.241917] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6d000
with flags=100000000002004D
[ 70.243906] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6d038
with flags=100000000002004D
[ 70.243909] shrink_page_list (nr_scanned=43013 nr_reclaimed=3247
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d69838
with flags=100000000002004D
[ 70.243913] shrink_page_list (nr_scanned=2405 nr_reclaimed=458
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d6a408
with flags=100000000002004D
[ 70.245906] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6d070
with flags=100000000002004D
[ 70.245977] shrink_page_list (nr_scanned=43067 nr_reclaimed=3282
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d6e1f0
with flags=100000000002004D
[ 70.245982] shrink_page_list (nr_scanned=2456 nr_reclaimed=502
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d6d428
with flags=100000000002004D
[ 70.247909] shrink_page_list (nr_scanned=43067 nr_reclaimed=3282
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d6e228
with flags=100000000002004D
[ 70.247912] shrink_page_list (nr_scanned=2456 nr_reclaimed=502
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d6d508
with flags=100000000002004D
[ 70.247915] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6d0a8
with flags=100000000002004D
[ 70.249897] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6d230
with flags=100000000002004D
[ 70.249901] shrink_page_list (nr_scanned=43067 nr_reclaimed=3282
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d6e260
with flags=100000000002004D
[ 70.249941] shrink_page_list (nr_scanned=2510 nr_reclaimed=502
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d70330
with flags=100000000002004D
[ 70.251895] shrink_page_list (nr_scanned=43067 nr_reclaimed=3282
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d6e298
with flags=100000000002004D
[ 70.251899] shrink_page_list (nr_scanned=2510 nr_reclaimed=502
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d702f8
with flags=100000000002004D
[ 70.251911] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6d2a0
with flags=100000000002004D
[ 70.253891] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6d2d8
with flags=100000000002004D
[ 70.253895] shrink_page_list (nr_scanned=2510 nr_reclaimed=502
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d70288
with flags=100000000002004D
[ 70.253898] shrink_page_list (nr_scanned=43067 nr_reclaimed=3282
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d6e2d0
with flags=100000000002004D
[ 70.255888] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6d310
with flags=100000000002004D
[ 70.255893] shrink_page_list (nr_scanned=43067 nr_reclaimed=3282
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d6e308
with flags=100000000002004D
[ 70.255896] shrink_page_list (nr_scanned=2510 nr_reclaimed=502
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d70250
with flags=100000000002004D
[ 70.257896] shrink_page_list (nr_scanned=43067 nr_reclaimed=3282
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d6e420
with flags=100000000002004D
[ 70.257900] shrink_page_list (nr_scanned=2510 nr_reclaimed=502
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d70218
with flags=100000000002004D
[ 70.257903] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6d348
with flags=100000000002004D
[ 70.259885] shrink_page_list (nr_scanned=43067 nr_reclaimed=3282
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d6e458
with flags=100000000002004D
[ 70.259889] shrink_page_list (nr_scanned=2510 nr_reclaimed=502
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d701a8
with flags=100000000002004D
[ 70.259892] shrink_page_list (nr_scanned=29846 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d6d380
with flags=100000000002004D
[ 70.261883] shrink_page_list (nr_scanned=43067 nr_reclaimed=3282
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d6e490
with flags=100000000002004D
[ 70.261886] shrink_page_list (nr_scanned=2510 nr_reclaimed=502
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d70138
with flags=100000000002004D
[ 70.261971] shrink_page_list (nr_scanned=29929 nr_reclaimed=2578
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d726a0
with flags=100000000002004D
[ 70.263882] shrink_page_list (nr_scanned=2510 nr_reclaimed=502
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d700c8
with flags=100000000002004D
[ 70.263976] shrink_page_list (nr_scanned=43067 nr_reclaimed=3282
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d6e650
with flags=100000000002004D
[ 70.264520] shrink_page_list (nr_scanned=30546 nr_reclaimed=2709
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d4dad8
with flags=100000000002004D
[ 70.266038] shrink_page_list (nr_scanned=30674 nr_reclaimed=2741
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d50bd8
with flags=100000000002004D
[ 70.266122] shrink_page_list (nr_scanned=43361 nr_reclaimed=3364
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d51818
with flags=100000000002004D
[ 70.266387] shrink_page_list (nr_scanned=2848 nr_reclaimed=627
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d57890
with flags=100000000002004D
[ 70.268009] shrink_page_list (nr_scanned=30754 nr_reclaimed=2741
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d57d28
with flags=100000000002004D
[ 70.268014] shrink_page_list (nr_scanned=2904 nr_reclaimed=627
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d42eb0
with flags=100000000002004D
[ 70.268070] shrink_page_list (nr_scanned=43559 nr_reclaimed=3419
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d433b8
with flags=100000000002004D
[ 70.269875] shrink_page_list (nr_scanned=2904 nr_reclaimed=627
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d42ee8
with flags=100000000002004D
[ 70.270288] shrink_page_list (nr_scanned=44119 nr_reclaimed=3492
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d1f350
with flags=100000000002004D
[ 70.270814] shrink_page_list (nr_scanned=31538 nr_reclaimed=2904
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d08bb0
with flags=100000000002004D
[ 70.271870] shrink_page_list (nr_scanned=44119 nr_reclaimed=3492
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d1f318
with flags=100000000002004D
[ 70.271874] shrink_page_list (nr_scanned=2904 nr_reclaimed=627
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d42f20
with flags=100000000002004D
[ 70.271963] shrink_page_list (nr_scanned=31617 nr_reclaimed=2904
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d08c58
with flags=100000000002004D
[ 70.273867] shrink_page_list (nr_scanned=44119 nr_reclaimed=3492
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d1f2e0
with flags=100000000002004D
[ 70.273870] shrink_page_list (nr_scanned=2904 nr_reclaimed=627
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d42fc8
with flags=100000000002004D
[ 70.273874] shrink_page_list (nr_scanned=31617 nr_reclaimed=2904
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d08d38
with flags=100000000002004D
[ 70.275864] shrink_page_list (nr_scanned=44119 nr_reclaimed=3492
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000d1f2a8
with flags=100000000002004D
[ 70.275867] shrink_page_list (nr_scanned=2904 nr_reclaimed=627
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d431c0
with flags=100000000002004D
[ 70.275870] shrink_page_list (nr_scanned=31617 nr_reclaimed=2904
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000d08ec0
with flags=100000000002004D
[ 70.277926] shrink_page_list (nr_scanned=2904 nr_reclaimed=627
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000d431f8
with flags=100000000002004D
[ 70.278125] shrink_page_list (nr_scanned=44344 nr_reclaimed=3492
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000cf79d0
with flags=100000000002004D
[ 70.278222] shrink_page_list (nr_scanned=31962 nr_reclaimed=2978
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000cf7e30
with flags=100000000002004D
[ 70.279858] shrink_page_list (nr_scanned=31962 nr_reclaimed=2978
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000cf7f80
with flags=100000000002004D
[ 70.279930] shrink_page_list (nr_scanned=2954 nr_reclaimed=664
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000cf8fb0
with flags=100000000002004D
[ 70.281855] shrink_page_list (nr_scanned=31962 nr_reclaimed=2978
nr_to_reclaim=32 gfp_mask=11212) found inactive page ffffea0000cf7fb8
with flags=100000000002004D
[ 70.286255] shrink_page_list (nr_scanned=6204 nr_reclaimed=1203
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000eca388
with flags=100000000002004D
[ 70.287863] shrink_page_list (nr_scanned=6204 nr_reclaimed=1203
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000eca350
with flags=100000000002004D
[ 70.289847] shrink_page_list (nr_scanned=6204 nr_reclaimed=1203
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000eca318
with flags=100000000002004D
[ 70.290123] shrink_page_list (nr_scanned=58419 nr_reclaimed=4751
nr_to_reclaim=32 gfp_mask=11210) found inactive page ffffea0000ed8200
with flags=1000000000000841
[ 70.291845] shrink_page_list (nr_scanned=6204 nr_reclaimed=1203
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea0000eca2e0
with flags=100000000002004D
[ 70.400259] shrink_page_list (nr_scanned=618 nr_reclaimed=117
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea0000de9eb8
with flags=100000000002004D
[ 70.403707] shrink_page_list (nr_scanned=618 nr_reclaimed=117
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea0000de9ef0
with flags=100000000002004D
[ 70.406705] shrink_page_list (nr_scanned=618 nr_reclaimed=117
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea0000de9f60
with flags=100000000002004D
[ 70.409706] shrink_page_list (nr_scanned=618 nr_reclaimed=117
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea0000de9f98
with flags=100000000002004D
[ 70.412711] shrink_page_list (nr_scanned=618 nr_reclaimed=117
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea0000de9fd0
with flags=100000000002004D
[ 70.415697] shrink_page_list (nr_scanned=618 nr_reclaimed=117
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea0000dea008
with flags=100000000002004D
[ 70.418828] shrink_page_list (nr_scanned=682 nr_reclaimed=117
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea0001a4f650
with flags=1000000000020849
[ 70.421696] shrink_page_list (nr_scanned=682 nr_reclaimed=117
nr_to_reclaim=32 gfp_mask=2005A) found inactive page ffffea00000824b0
with flags=1000000000020849

Right after that happened, I hit ctrl-c to kill test_mempressure.sh.
The system was OK until I typed sync, and then everything hung.

I'm really confused. shrink_inactive_list in
RECLAIM_MODE_LUMPYRECLAIM will call one of the isolate_pages functions
with ISOLATE_BOTH. The resulting list goes into shrink_page_list,
which does VM_BUG_ON(PageActive(page)).

How is that supposed to work?

--Andy

2011-05-20 03:12:21

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

> Right after that happened, I hit ctrl-c to kill test_mempressure.sh.
> The system was OK until I typed sync, and then everything hung.
>
> I'm really confused. shrink_inactive_list in
> RECLAIM_MODE_LUMPYRECLAIM will call one of the isolate_pages functions
> with ISOLATE_BOTH. The resulting list goes into shrink_page_list,
> which does VM_BUG_ON(PageActive(page)).
>
> How is that supposed to work?

Usually clear_active_flags() clear PG_active before calling shrink_page_list().

shrink_inactive_list()
isolate_pages_global()
update_isolated_counts()
clear_active_flags()
shrink_page_list()


2011-05-20 03:38:30

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Thu, May 19, 2011 at 11:12 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> Right after that happened, I hit ctrl-c to kill test_mempressure.sh.
>> The system was OK until I typed sync, and then everything hung.
>>
>> I'm really confused. ?shrink_inactive_list in
>> RECLAIM_MODE_LUMPYRECLAIM will call one of the isolate_pages functions
>> with ISOLATE_BOTH. ?The resulting list goes into shrink_page_list,
>> which does VM_BUG_ON(PageActive(page)).
>>
>> How is that supposed to work?
>
> Usually clear_active_flags() clear PG_active before calling
> shrink_page_list().
>
> shrink_inactive_list()
> ? ?isolate_pages_global()
> ? ?update_isolated_counts()
> ? ? ? ?clear_active_flags()
> ? ?shrink_page_list()
>
>

That makes sense. And I have CONFIG_COMPACTION=y, so the lumpy mode
doesn't get set anyway.

But the pages I'm seeing have flags=100000000008005D. If I'm reading
it right, that means locked,referenced,uptodate,dirty,active. How
does a page like that end up in shrink_page_list? I don't see how a
page that's !PageLRU can get marked Active. Nonetheless, I'm hitting
that VM_BUG_ON.

Is there a race somewhere?

--Andy

2011-05-20 04:20:17

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Fri, May 20, 2011 at 12:38 PM, Andrew Lutomirski <[email protected]> wrote:
> On Thu, May 19, 2011 at 11:12 PM, KOSAKI Motohiro
> <[email protected]> wrote:
>>> Right after that happened, I hit ctrl-c to kill test_mempressure.sh.
>>> The system was OK until I typed sync, and then everything hung.
>>>
>>> I'm really confused.  shrink_inactive_list in
>>> RECLAIM_MODE_LUMPYRECLAIM will call one of the isolate_pages functions
>>> with ISOLATE_BOTH.  The resulting list goes into shrink_page_list,
>>> which does VM_BUG_ON(PageActive(page)).
>>>
>>> How is that supposed to work?
>>
>> Usually clear_active_flags() clear PG_active before calling
>> shrink_page_list().
>>
>> shrink_inactive_list()
>>    isolate_pages_global()
>>    update_isolated_counts()
>>        clear_active_flags()
>>    shrink_page_list()
>>
>>
>
> That makes sense.  And I have CONFIG_COMPACTION=y, so the lumpy mode
> doesn't get set anyway.

Could you see the problem with disabling CONFIG_COMPACTION?

>
> But the pages I'm seeing have flags=100000000008005D.  If I'm reading
> it right, that means locked,referenced,uptodate,dirty,active.  How
> does a page like that end up in shrink_page_list?  I don't see how a
> page that's !PageLRU can get marked Active.  Nonetheless, I'm hitting
> that VM_BUG_ON.

Thanks for proving that it's not a problem of latest my patch.

>
> Is there a race somewhere?

First of all, let's finish your first problem about hang. :)
And let's make another thread to fix this problem.

I think this is a severe problem because 2.6.39 includes my deactivate_pages
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=315601809d124d046abd6c3ffa346d0dbd7aa29d)

It touches page states more and more. (2.6.38.6 doesn't include it so
it's not a problem of my deactivate_pages problem)
And now inorder-putback series which I will push for 2.6.40 touches it
more and more.

So I want to resolve your problem asap.
We don't have see report about that. Could you do git-bisect?
FYI, Recently, big change of mm is compaction,transparent huge pages.
Kame, could you point out thing related to memcg if you have a mind?

>
> --Andy
>



--
Kind regards,
Minchan Kim

2011-05-20 05:15:47

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Fri, 20 May 2011 13:20:15 +0900
Minchan Kim <[email protected]> wrote:

> So I want to resolve your problem asap.
> We don't have see report about that. Could you do git-bisect?
> FYI, Recently, big change of mm is compaction,transparent huge pages.
> Kame, could you point out thing related to memcg if you have a mind?
>

I don't doubt memcg at this stage because it never modify page->flags.
Consdering the case, PageActive() is set against off-LRU pages after
clear_active_flags() clears it.

Hmm, I think I don't understand the lock system fully but...how do you
think this ?

==

At splitting a hugepage, the routine marks all pmd as "splitting".

But assume a racy case where 2 threads run into spit at the
same time, one thread wins compound_lock() and do split, another
thread should not touch splitted pages.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Index: mmotm-May11/mm/huge_memory.c
===================================================================
--- mmotm-May11.orig/mm/huge_memory.c
+++ mmotm-May11/mm/huge_memory.c
@@ -1150,7 +1150,7 @@ static int __split_huge_page_splitting(s
return ret;
}

-static void __split_huge_page_refcount(struct page *page)
+static bool __split_huge_page_refcount(struct page *page)
{
int i;
unsigned long head_index = page->index;
@@ -1161,6 +1161,11 @@ static void __split_huge_page_refcount(s
spin_lock_irq(&zone->lru_lock);
compound_lock(page);

+ if (!PageCompound(page)) {
+ compound_unlock(page);
+ spin_unlock_irq(&zone->lru_lock);
+ return false;
+ }
for (i = 1; i < HPAGE_PMD_NR; i++) {
struct page *page_tail = page + i;

@@ -1258,6 +1263,7 @@ static void __split_huge_page_refcount(s
* to be pinned by the caller.
*/
BUG_ON(page_count(page) <= 0);
+ return true;
}

static int __split_huge_page_map(struct page *page,
@@ -1367,7 +1373,8 @@ static void __split_huge_page(struct pag
mapcount, page_mapcount(page));
BUG_ON(mapcount != page_mapcount(page));

- __split_huge_page_refcount(page);
+ if (!__split_huge_page_refcount(page))
+ return;

mapcount2 = 0;
list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {

2011-05-20 05:36:15

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Fri, May 20, 2011 at 2:08 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Fri, 20 May 2011 13:20:15 +0900
> Minchan Kim <[email protected]> wrote:
>
>> So I want to resolve your problem asap.
>> We don't have see report about that. Could you do git-bisect?
>> FYI, Recently, big change of mm is compaction,transparent huge pages.
>> Kame, could you point out thing related to memcg if you have a mind?
>>
>
> I don't doubt memcg at this stage because it never modify page->flags.
> Consdering the case, PageActive() is set against off-LRU pages after
> clear_active_flags() clears it.
>
> Hmm, I think I don't understand the lock system fully but...how do you
> think this ?
>
> ==
>
> At splitting a hugepage, the routine marks all pmd as "splitting".
>
> But assume a racy case where 2 threads run into spit at the
> same time, one thread wins compound_lock() and do split, another
> thread should not touch splitted pages.

Sorry. Now I don't have a time to review in detail.
When I look it roughly, page_lock_anon_vma have to prevent it.
But Andrea needs current this problem and he will catch something we lost. :)


>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> Index: mmotm-May11/mm/huge_memory.c
> ===================================================================
> --- mmotm-May11.orig/mm/huge_memory.c
> +++ mmotm-May11/mm/huge_memory.c
> @@ -1150,7 +1150,7 @@ static int __split_huge_page_splitting(s
>        return ret;
>  }
>
> -static void __split_huge_page_refcount(struct page *page)
> +static bool __split_huge_page_refcount(struct page *page)
>  {
>        int i;
>        unsigned long head_index = page->index;
> @@ -1161,6 +1161,11 @@ static void __split_huge_page_refcount(s
>        spin_lock_irq(&zone->lru_lock);
>        compound_lock(page);
>
> +       if (!PageCompound(page)) {
> +               compound_unlock(page);
> +               spin_unlock_irq(&zone->lru_lock);
> +               return false;
> +       }
>        for (i = 1; i < HPAGE_PMD_NR; i++) {
>                struct page *page_tail = page + i;
>
> @@ -1258,6 +1263,7 @@ static void __split_huge_page_refcount(s
>         * to be pinned by the caller.
>         */
>        BUG_ON(page_count(page) <= 0);
> +       return true;
>  }
>
>  static int __split_huge_page_map(struct page *page,
> @@ -1367,7 +1373,8 @@ static void __split_huge_page(struct pag
>                       mapcount, page_mapcount(page));
>        BUG_ON(mapcount != page_mapcount(page));
>
> -       __split_huge_page_refcount(page);
> +       if (!__split_huge_page_refcount(page))
> +               return;
>
>        mapcount2 = 0;
>        list_for_each_entry(avc, &anon_vma->head, same_anon_vma) {
>
>



--
Kind regards,
Minchan Kim

2011-05-20 07:50:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Fri, 20 May 2011 14:36:13 +0900
Minchan Kim <[email protected]> wrote:

> On Fri, May 20, 2011 at 2:08 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Fri, 20 May 2011 13:20:15 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> >> So I want to resolve your problem asap.
> >> We don't have see report about that. Could you do git-bisect?
> >> FYI, Recently, big change of mm is compaction,transparent huge pages.
> >> Kame, could you point out thing related to memcg if you have a mind?
> >>
> >
> > I don't doubt memcg at this stage because it never modify page->flags.
> > Consdering the case, PageActive() is set against off-LRU pages after
> > clear_active_flags() clears it.
> >
> > Hmm, I think I don't understand the lock system fully but...how do you
> > think this ?
> >
> > ==
> >
> > At splitting a hugepage, the routine marks all pmd as "splitting".
> >
> > But assume a racy case where 2 threads run into spit at the
> > same time, one thread wins compound_lock() and do split, another
> > thread should not touch splitted pages.
>
> Sorry. Now I don't have a time to review in detail.
> When I look it roughly, page_lock_anon_vma have to prevent it.
> But Andrea needs current this problem and he will catch something we lost. :)
>
Hmm, maybe I miss something...need to build a test environ on my side.
But I'm not sure I can reproduce it..

Thanks,
-Kame

2011-05-20 10:11:35

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Fri, May 20, 2011 at 02:08:56PM +0900, KAMEZAWA Hiroyuki wrote:
> + if (!PageCompound(page)) {
> + compound_unlock(page);
> + spin_unlock_irq(&zone->lru_lock);
> + return false;
> + }

If you turn this into a BUG_ON(!PageCompound)) I'm ok with it. But it
wasn't supposed to ever happen so the above shouldn't be needed.

This very check is done in split_huge_page after taking the root
anon_vma lock. And every other thread or process sharing the page has
to take the anon_vma lock, and then check PageCompound too before it
can proceed into __split_huge_page. So I don't see a problem but
please add the BUG_ON if you are concerned. A BUG_ON definitely can't
hurt. Also note, __split_huge_page is static and is only called by
split_huge_page which does the check after proper locking.

if (!PageCompound(page))
goto out_unlock;

I figure it's not easily reproducible but you can easily rule out THP
issues by reproducing at least once after booting with
transparent_hugepage=never or by building the kernel with
CONFIG_TRANSPARENT_HUGEPAGE=n.

I'm afraid we might have some lru active/inactive/isolated vmstat.c
related issue so that's the part of the code I'd recommend to review
(I checked it and I didn't see wrong stuff yet, not even in THP
context yet but I'm still worried we have a statistic issue
somewhere). I had a bugreport during -rc by two people (one was UP
build and one was SMP build) not easily reproducible too, that hinted
a possible nr_inactive* or nr_inactive* (or both) being wrong (not
sure if _anon or _file, could be just one lru type or both). If stats
are off, that may also trigger oom killer by making the VM shrinking
(which also activates the swapping) bail out early thinking it can't
shrink no more. It could be the same statistic problem that sometime
makes the VM think it can't shrink no more and lead into early oom
killing, and at other times it loops indefinitely in too_many_isolated
if nr_isolated_X > nr_inactive_X indefinitely for __GFP_NO_KSWAPD
allocations (kswapd is immune from such loop, so if kswapd is allowed
to run, it probably kswapd then increases nr_inactive by deactivating
enough pages to make it unblock). Just a wild guess...

2011-05-20 10:40:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Wed, May 11, 2011 at 04:07:40PM -0700, Andi Kleen wrote:
> FWIW i had problems in swapping over dmcrypt for a long time -- not
> quite as severe as you. Never really tracked it down.

I use swap over dmcrypt (cryptsetup on raw device) without apparent
problems, I'd like to consider that a safe setup.

2011-05-20 14:12:09

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Fri, May 20, 2011 at 6:11 AM, Andrea Arcangeli <[email protected]> wrote:
> I figure it's not easily reproducible but you can easily rule out THP
> issues by reproducing at least once after booting with
> transparent_hugepage=never or by building the kernel with
> CONFIG_TRANSPARENT_HUGEPAGE=n.

Reproduced with CONFIG_TRANSPARENT_HUGEPAGE=n with and without
compaction and migration.

I applied the attached patch (which includes Minchan's !pgdat_balanced
and need_resched changes). I see:

[ 121.468339] firefox shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea00019217a8) w/ prev = 100000000002000D
[ 121.469236] firefox shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea00016596b8) w/ prev = 100000000002000D
[ 121.470207] firefox: shrink_page_list (nr_scanned=94
nr_reclaimed=19 nr_to_reclaim=32 gfp_mask=201DA) found inactive page
ffffea00019217a8 with flags=100000000002004D
[ 121.472451] firefox: shrink_page_list (nr_scanned=94
nr_reclaimed=19 nr_to_reclaim=32 gfp_mask=201DA) found inactive page
ffffea00016596b8 with flags=100000000002004D
[ 121.482782] dd shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea00013a8938) w/ prev = 100000000002000D
[ 121.489820] dd shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea00017a4e88) w/ prev = 1000000000000801
[ 121.490626] dd shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea000005edb0) w/ prev = 1000000000000801
[ 121.491499] dd: shrink_page_list (nr_scanned=62 nr_reclaimed=0
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea00017a4e88
with flags=1000000000000841
[ 121.494337] dd: shrink_page_list (nr_scanned=62 nr_reclaimed=0
nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea000005edb0
with flags=1000000000000841
[ 121.499219] dd shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea000129c788) w/ prev = 1000000000080009
[ 121.500363] dd shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea000129c830) w/ prev = 1000000000080009
[ 121.502270] kswapd0 shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea0001146470) w/ prev = 100000000008001D
[ 121.661545] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea0000058168) w/ prev = 1000000000000801
[ 121.662791] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea000166f288) w/ prev = 1000000000000801
[ 121.665727] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea0001681c40) w/ prev = 1000000000000801
[ 121.666857] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea0001693130) w/ prev = 1000000000000801
[ 121.667988] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea0000c790d8) w/ prev = 1000000000000801
[ 121.669105] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea000113fe48) w/ prev = 1000000000000801
[ 121.670238] kworker/1:1: shrink_page_list (nr_scanned=102
nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
ffffea0000058168 with flags=1000000000000841
[ 121.674061] kworker/1:1: shrink_page_list (nr_scanned=102
nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
ffffea000166f288 with flags=1000000000000841
[ 121.678054] kworker/1:1: shrink_page_list (nr_scanned=102
nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
ffffea0001681c40 with flags=1000000000000841
[ 121.682069] kworker/1:1: shrink_page_list (nr_scanned=102
nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
ffffea0001693130 with flags=1000000000000841
[ 121.686074] kworker/1:1: shrink_page_list (nr_scanned=102
nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
ffffea0000c790d8 with flags=1000000000000841
[ 121.690045] kworker/1:1: shrink_page_list (nr_scanned=102
nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
ffffea000113fe48 with flags=1000000000000841
[ 121.866205] test_mempressur shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea000165d5b8) w/ prev = 100000000002000D
[ 121.868204] test_mempressur shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea0001661288) w/ prev = 100000000002000D
[ 121.870203] test_mempressur shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea0001661250) w/ prev = 100000000002000D
[ 121.872195] test_mempressur shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea000100cee8) w/ prev = 100000000002000D
[ 121.873486] test_mempressur shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea0000eafab8) w/ prev = 100000000002000D
[ 121.874718] test_mempressur shrink_page_list+0x4f3/0x5ca:
SetPageActive(ffffea0000eafaf0) w/ prev = 100000000002000D

This is interesting: it looks like shrink_page_list is making its way
through the list more than once. It could be reentering itself
somehow or it could have something screwed up with the linked list.

I'll keep slowly debugging, but maybe this is enough for someone
familiar with this code to beat me to it.

Minchan, I think this means that your fixes are just hiding and not
fixing the underlying problem.


Attachments:
vm_tests.patch (3.00 kB)

2011-05-20 15:33:57

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Fri, May 20, 2011 at 10:11:47AM -0400, Andrew Lutomirski wrote:
> On Fri, May 20, 2011 at 6:11 AM, Andrea Arcangeli <[email protected]> wrote:
> > I figure it's not easily reproducible but you can easily rule out THP
> > issues by reproducing at least once after booting with
> > transparent_hugepage=never or by building the kernel with
> > CONFIG_TRANSPARENT_HUGEPAGE=n.
>
> Reproduced with CONFIG_TRANSPARENT_HUGEPAGE=n with and without
> compaction and migration.
>
> I applied the attached patch (which includes Minchan's !pgdat_balanced
> and need_resched changes). I see:
>
> [ 121.468339] firefox shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea00019217a8) w/ prev = 100000000002000D
> [ 121.469236] firefox shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea00016596b8) w/ prev = 100000000002000D
> [ 121.470207] firefox: shrink_page_list (nr_scanned=94
> nr_reclaimed=19 nr_to_reclaim=32 gfp_mask=201DA) found inactive page
> ffffea00019217a8 with flags=100000000002004D
> [ 121.472451] firefox: shrink_page_list (nr_scanned=94
> nr_reclaimed=19 nr_to_reclaim=32 gfp_mask=201DA) found inactive page
> ffffea00016596b8 with flags=100000000002004D
> [ 121.482782] dd shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea00013a8938) w/ prev = 100000000002000D
> [ 121.489820] dd shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea00017a4e88) w/ prev = 1000000000000801
> [ 121.490626] dd shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea000005edb0) w/ prev = 1000000000000801
> [ 121.491499] dd: shrink_page_list (nr_scanned=62 nr_reclaimed=0
> nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea00017a4e88
> with flags=1000000000000841
> [ 121.494337] dd: shrink_page_list (nr_scanned=62 nr_reclaimed=0
> nr_to_reclaim=32 gfp_mask=200D2) found inactive page ffffea000005edb0
> with flags=1000000000000841
> [ 121.499219] dd shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea000129c788) w/ prev = 1000000000080009
> [ 121.500363] dd shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea000129c830) w/ prev = 1000000000080009
> [ 121.502270] kswapd0 shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea0001146470) w/ prev = 100000000008001D
> [ 121.661545] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea0000058168) w/ prev = 1000000000000801
> [ 121.662791] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea000166f288) w/ prev = 1000000000000801
> [ 121.665727] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea0001681c40) w/ prev = 1000000000000801
> [ 121.666857] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea0001693130) w/ prev = 1000000000000801
> [ 121.667988] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea0000c790d8) w/ prev = 1000000000000801
> [ 121.669105] kworker/1:1 shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea000113fe48) w/ prev = 1000000000000801
> [ 121.670238] kworker/1:1: shrink_page_list (nr_scanned=102
> nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
> ffffea0000058168 with flags=1000000000000841
> [ 121.674061] kworker/1:1: shrink_page_list (nr_scanned=102
> nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
> ffffea000166f288 with flags=1000000000000841
> [ 121.678054] kworker/1:1: shrink_page_list (nr_scanned=102
> nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
> ffffea0001681c40 with flags=1000000000000841
> [ 121.682069] kworker/1:1: shrink_page_list (nr_scanned=102
> nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
> ffffea0001693130 with flags=1000000000000841
> [ 121.686074] kworker/1:1: shrink_page_list (nr_scanned=102
> nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
> ffffea0000c790d8 with flags=1000000000000841
> [ 121.690045] kworker/1:1: shrink_page_list (nr_scanned=102
> nr_reclaimed=20 nr_to_reclaim=32 gfp_mask=11212) found inactive page
> ffffea000113fe48 with flags=1000000000000841
> [ 121.866205] test_mempressur shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea000165d5b8) w/ prev = 100000000002000D
> [ 121.868204] test_mempressur shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea0001661288) w/ prev = 100000000002000D
> [ 121.870203] test_mempressur shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea0001661250) w/ prev = 100000000002000D
> [ 121.872195] test_mempressur shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea000100cee8) w/ prev = 100000000002000D
> [ 121.873486] test_mempressur shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea0000eafab8) w/ prev = 100000000002000D
> [ 121.874718] test_mempressur shrink_page_list+0x4f3/0x5ca:
> SetPageActive(ffffea0000eafaf0) w/ prev = 100000000002000D
>
> This is interesting: it looks like shrink_page_list is making its way
> through the list more than once. It could be reentering itself
> somehow or it could have something screwed up with the linked list.
>
> I'll keep slowly debugging, but maybe this is enough for someone
> familiar with this code to beat me to it.
>
> Minchan, I think this means that your fixes are just hiding and not
> fixing the underlying problem.

Could you test with below patch?

If this patch fixes it, I don't know why we see this problem now.
It should be problem long time ago.

>From b7d7ca54b3ed914723cc54d1c3bcd937e5f08e3a Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Sat, 21 May 2011 00:28:00 +0900
Subject: [BUG fix] vmscan: Clear PageActive before synchronous shrink_page_list

Normally, shrink_page_list doesn't reclaim working set page(ie, PG_referenced).
So it should return active lru list
For it, shrink_page_list does SetPageActive for them.
Sometime, we can ignore that and try to reclaim them when we reclaim high-order pages
through consecutive second call of synchronous shrink_page_list.
At that time, the pages which have PG_active could be caught by VM_BUG_ON(PageActive(page))
in shrink_page_list.

This patch clears PG_active before entering synchronous shrink_page_list.

Reported-by: Andrew Lutomirski <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/vmscan.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8bfd450..a5c01e9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1430,7 +1430,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,

/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
+ unsigned long nr_active;
set_reclaim_mode(priority, sc, true);
+ nr_active = clear_active_flags(&page_list, NULL);
+ count_vm_events(PGDEACTIVATE, nr_active);
nr_reclaimed += shrink_page_list(&page_list, zone, sc);
}

--
1.7.1

--
Kind regards,
Minchan Kim

2011-05-20 16:01:33

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Fri, May 20, 2011 at 11:33 AM, Minchan Kim <[email protected]> wrote:

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8bfd450..a5c01e9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1430,7 +1430,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>
> ? ? ? ?/* Check if we should syncronously wait for writeback */
> ? ? ? ?if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> + ? ? ? ? ? ? ? unsigned long nr_active;
> ? ? ? ? ? ? ? ?set_reclaim_mode(priority, sc, true);
> + ? ? ? ? ? ? ? nr_active = clear_active_flags(&page_list, NULL);
> + ? ? ? ? ? ? ? count_vm_events(PGDEACTIVATE, nr_active);
> ? ? ? ? ? ? ? ?nr_reclaimed += shrink_page_list(&page_list, zone, sc);
> ? ? ? ?}
>
> --

I'm now running that patch *without* the pgdat_balanced fix or the
need_resched check. The VM_BUG_ON doesn't happen but I still get
incorrect OOM kills.

However, if I replace the check with:

if (false &&should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {

then my system lags under bad memory pressure but recovers without
OOMs or oopses.

Is that expected?

--Andy

> 1.7.1
>
> --
> Kind regards,
> Minchan Kim
>

2011-05-20 16:19:45

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Fri, May 20, 2011 at 12:01:12PM -0400, Andrew Lutomirski wrote:
> On Fri, May 20, 2011 at 11:33 AM, Minchan Kim <[email protected]> wrote:
>
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 8bfd450..a5c01e9 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1430,7 +1430,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> >
> > ? ? ? ?/* Check if we should syncronously wait for writeback */
> > ? ? ? ?if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> > + ? ? ? ? ? ? ? unsigned long nr_active;
> > ? ? ? ? ? ? ? ?set_reclaim_mode(priority, sc, true);
> > + ? ? ? ? ? ? ? nr_active = clear_active_flags(&page_list, NULL);
> > + ? ? ? ? ? ? ? count_vm_events(PGDEACTIVATE, nr_active);
> > ? ? ? ? ? ? ? ?nr_reclaimed += shrink_page_list(&page_list, zone, sc);
> > ? ? ? ?}
> >
> > --
>
> I'm now running that patch *without* the pgdat_balanced fix or the
> need_resched check. The VM_BUG_ON doesn't happen but I still get

Please forget need_resched.
Instead of it, could you test shrink_slab patch with !pgdat_balanced?

@@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
if (scanned == 0)
scanned = SWAP_CLUSTER_MAX;

- if (!down_read_trylock(&shrinker_rwsem))
- return 1; /* Assume we'll be able to shrink next time */
+ if (!down_read_trylock(&shrinker_rwsem)) {
+ /* Assume we'll be able to shrink next time */
+ ret = 1;
+ goto out;
+ }

list_for_each_entry(shrinker, &shrinker_list, list) {
unsigned long long delta;
@@ -286,6 +289,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
shrinker->nr += total_scan;
}
up_read(&shrinker_rwsem);
+out:
+ cond_resched();
return ret;
}

> incorrect OOM kills.
>
> However, if I replace the check with:
>
> if (false &&should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>
> then my system lags under bad memory pressure but recovers without
> OOMs or oopses.

I agree you can see OOM but oops? Did you see any oops?

>
> Is that expected?


No.. :(

It's totally opposite.
That routine is for getting the memory althought we lose latency
It's another issue. :(

>
> --Andy
>
> > 1.7.1
> >
> > --
> > Kind regards,
> > Minchan Kim
> >

--
Kind regards,
Minchan Kim

2011-05-20 18:10:05

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Fri, May 20, 2011 at 12:19 PM, Minchan Kim <[email protected]> wrote:
> On Fri, May 20, 2011 at 12:01:12PM -0400, Andrew Lutomirski wrote:
>> On Fri, May 20, 2011 at 11:33 AM, Minchan Kim <[email protected]> wrote:
>>
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 8bfd450..a5c01e9 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1430,7 +1430,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>> >
>> > ? ? ? ?/* Check if we should syncronously wait for writeback */
>> > ? ? ? ?if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>> > + ? ? ? ? ? ? ? unsigned long nr_active;
>> > ? ? ? ? ? ? ? ?set_reclaim_mode(priority, sc, true);
>> > + ? ? ? ? ? ? ? nr_active = clear_active_flags(&page_list, NULL);
>> > + ? ? ? ? ? ? ? count_vm_events(PGDEACTIVATE, nr_active);
>> > ? ? ? ? ? ? ? ?nr_reclaimed += shrink_page_list(&page_list, zone, sc);
>> > ? ? ? ?}
>> >
>> > --
>>
>> I'm now running that patch *without* the pgdat_balanced fix or the
>> need_resched check. ?The VM_BUG_ON doesn't happen but I still get
>
> Please forget need_resched.
> Instead of it, could you test shrink_slab patch with !pgdat_balanced?
>
> @@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> ? ? ? if (scanned == 0)
> ? ? ? ? ? ? ? scanned = SWAP_CLUSTER_MAX;
>
> - ? ? ? if (!down_read_trylock(&shrinker_rwsem))
> - ? ? ? ? ? ? ? return 1; ? ? ? /* Assume we'll be able to shrink next time */
> + ? ? ? if (!down_read_trylock(&shrinker_rwsem)) {
> + ? ? ? ? ? ? ? /* Assume we'll be able to shrink next time */
> + ? ? ? ? ? ? ? ret = 1;
> + ? ? ? ? ? ? ? goto out;
> + ? ? ? }
>
> ? ? ? list_for_each_entry(shrinker, &shrinker_list, list) {
> ? ? ? ? ? ? ? unsigned long long delta;
> @@ -286,6 +289,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> ? ? ? ? ? ? ? shrinker->nr += total_scan;
> ? ? ? }
> ? ? ? up_read(&shrinker_rwsem);
> +out:
> + ? ? ? cond_resched();
> ? ? ? return ret;
> ?}
>
>> incorrect OOM kills.
>>
>> However, if I replace the check with:
>>
>> ? ? ? if (false &&should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>>
>> then my system lags under bad memory pressure but recovers without
>> OOMs or oopses.
>
> I agree you can see OOM but oops? Did you see any oops?

No oops. I've now reproduced the OOPS with both the if (false) change
and the clear_active_flags change.

Also, would this version be better? I think your version overcounts
nr_scanned, but I'm not sure what effect that would have.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f44b81..d1dabc9 100644
@@ -1426,8 +1437,13 @@ shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone,

/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
+ unsigned long nr_active, old_nr_scanned;
set_reclaim_mode(priority, sc, true);
+ nr_active = clear_active_flags(&page_list, NULL);
+ count_vm_events(PGDEACTIVATE, nr_active);
+ old_nr_scanned = sc->nr_scanned;
nr_reclaimed += shrink_page_list(&page_list, zone, sc);
+ sc->nr_scanned = old_nr_scanned;
}

local_irq_disable();

I just tested 2.6.38.6 with the attached patch. It survived dirty_ram
and test_mempressure without any problems other than slowness, but
when I hit ctrl-c to stop test_mempressure, I got the attached oom.

--Andy


Attachments:
test.patch (2.54 kB)
oom.txt.xz (19.23 kB)
Download all attachments

2011-05-20 18:40:51

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Fri, May 20, 2011 at 2:09 PM, Andrew Lutomirski <[email protected]> wrote:
> I just tested 2.6.38.6 with the attached patch. ?It survived dirty_ram
> and test_mempressure without any problems other than slowness, but
> when I hit ctrl-c to stop test_mempressure, I got the attached oom.

Reproduced with CONFIG_CGROUP_MEM_RES_CTLR=n.

--Andy

2011-05-21 12:04:54

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3f44b81..d1dabc9 100644
> @@ -1426,8 +1437,13 @@ shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone,
>
> ? ? ? ?/* Check if we should syncronously wait for writeback */
> ? ? ? ?if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> + ? ? ? ? ? ? ? unsigned long nr_active, old_nr_scanned;
> ? ? ? ? ? ? ? ?set_reclaim_mode(priority, sc, true);
> + ? ? ? ? ? ? ? nr_active = clear_active_flags(&page_list, NULL);
> + ? ? ? ? ? ? ? count_vm_events(PGDEACTIVATE, nr_active);
> + ? ? ? ? ? ? ? old_nr_scanned = sc->nr_scanned;
> ? ? ? ? ? ? ? ?nr_reclaimed += shrink_page_list(&page_list, zone, sc);
> + ? ? ? ? ? ? ? sc->nr_scanned = old_nr_scanned;
> ? ? ? ?}
>
> ? ? ? ?local_irq_disable();
>
> I just tested 2.6.38.6 with the attached patch. ?It survived dirty_ram
> and test_mempressure without any problems other than slowness, but
> when I hit ctrl-c to stop test_mempressure, I got the attached oom.

Minchan,

I'm confused now.
If pages got SetPageActive(), should_reclaim_stall() should never return true.
Can you please explain which bad scenario was happen?

-----------------------------------------------------------------------------------------------------
static void reset_reclaim_mode(struct scan_control *sc)
{
sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
}

shrink_page_list()
{
(snip)
activate_locked:
SetPageActive(page);
pgactivate++;
unlock_page(page);
reset_reclaim_mode(sc); /// here
list_add(&page->lru, &ret_pages);
}
-----------------------------------------------------------------------------------------------------


-----------------------------------------------------------------------------------------------------
bool should_reclaim_stall()
{
(snip)

/* Only stall on lumpy reclaim */
if (sc->reclaim_mode & RECLAIM_MODE_SINGLE) /// and here
return false;
-----------------------------------------------------------------------------------------------------

2011-05-21 13:35:10

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Sat, May 21, 2011 at 8:04 AM, KOSAKI Motohiro
<[email protected]> wrote:
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 3f44b81..d1dabc9 100644
>> @@ -1426,8 +1437,13 @@ shrink_inactive_list(unsigned long nr_to_scan,
>> struct zone *zone,
>>
>> ? ? ? ?/* Check if we should syncronously wait for writeback */
>> ? ? ? ?if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>> + ? ? ? ? ? ? ? unsigned long nr_active, old_nr_scanned;
>> ? ? ? ? ? ? ? ?set_reclaim_mode(priority, sc, true);
>> + ? ? ? ? ? ? ? nr_active = clear_active_flags(&page_list, NULL);
>> + ? ? ? ? ? ? ? count_vm_events(PGDEACTIVATE, nr_active);
>> + ? ? ? ? ? ? ? old_nr_scanned = sc->nr_scanned;
>> ? ? ? ? ? ? ? ?nr_reclaimed += shrink_page_list(&page_list, zone, sc);
>> + ? ? ? ? ? ? ? sc->nr_scanned = old_nr_scanned;
>> ? ? ? ?}
>>
>> ? ? ? ?local_irq_disable();
>>
>> I just tested 2.6.38.6 with the attached patch. ?It survived dirty_ram
>> and test_mempressure without any problems other than slowness, but
>> when I hit ctrl-c to stop test_mempressure, I got the attached oom.
>
> Minchan,
>
> I'm confused now.
> If pages got SetPageActive(), should_reclaim_stall() should never return true.
> Can you please explain which bad scenario was happen?
>
> -----------------------------------------------------------------------------------------------------
> static void reset_reclaim_mode(struct scan_control *sc)
> {
> ? ? ? ?sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
> }
>
> shrink_page_list()
> {
> ?(snip)
> ?activate_locked:
> ? ? ? ? ? ? ? ?SetPageActive(page);
> ? ? ? ? ? ? ? ?pgactivate++;
> ? ? ? ? ? ? ? ?unlock_page(page);
> ? ? ? ? ? ? ? ?reset_reclaim_mode(sc); ? ? ? ? ? ? ? ? ?/// here
> ? ? ? ? ? ? ? ?list_add(&page->lru, &ret_pages);
> ? ? ? ?}
> -----------------------------------------------------------------------------------------------------
>
>
> -----------------------------------------------------------------------------------------------------
> bool should_reclaim_stall()
> {
> ?(snip)
>
> ? ? ? ?/* Only stall on lumpy reclaim */
> ? ? ? ?if (sc->reclaim_mode & RECLAIM_MODE_SINGLE) ? /// and here
> ? ? ? ? ? ? ? ?return false;
> -----------------------------------------------------------------------------------------------------
>

I did some tracing and the oops happens from the second call to
shrink_page_list after should_reclaim_stall returns true and it hits
the same pages in the same order that the earlier call just finished
calling SetPageActive on. I have *not* confirmed that the two calls
happened from the same call to shrink_inactive_list, but something's
certainly wrong in there.

This is very easy to reproduce on my laptop.

--Andy

2011-05-21 14:14:34

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

> I did some tracing and the oops happens from the second call to
> shrink_page_list after should_reclaim_stall returns true and it hits
> the same pages in the same order that the earlier call just finished
> calling SetPageActive on.

Can you please share your tracing patch and raw tracing result log?

Thanks.

> I have *not* confirmed that the two calls
> happened from the same call to shrink_inactive_list, but something's
> certainly wrong in there.
>
> This is very easy to reproduce on my laptop.

2011-05-21 14:31:48

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Sat, May 21, 2011 at 9:04 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 3f44b81..d1dabc9 100644
>> @@ -1426,8 +1437,13 @@ shrink_inactive_list(unsigned long nr_to_scan,
>> struct zone *zone,
>>
>>        /* Check if we should syncronously wait for writeback */
>>        if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>> +               unsigned long nr_active, old_nr_scanned;
>>                set_reclaim_mode(priority, sc, true);
>> +               nr_active = clear_active_flags(&page_list, NULL);
>> +               count_vm_events(PGDEACTIVATE, nr_active);
>> +               old_nr_scanned = sc->nr_scanned;
>>                nr_reclaimed += shrink_page_list(&page_list, zone, sc);
>> +               sc->nr_scanned = old_nr_scanned;
>>        }
>>
>>        local_irq_disable();
>>
>> I just tested 2.6.38.6 with the attached patch.  It survived dirty_ram
>> and test_mempressure without any problems other than slowness, but
>> when I hit ctrl-c to stop test_mempressure, I got the attached oom.
>
> Minchan,
>
> I'm confused now.
> If pages got SetPageActive(), should_reclaim_stall() should never return true.

Hi KOSAKI,
You're absolutely right.
I missed that so the problem should not happen. :(

--
Kind regards,
Minchan Kim

2011-05-21 14:44:06

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

Hi Andrew.

On Sat, May 21, 2011 at 10:34 PM, Andrew Lutomirski <[email protected]> wrote:
> On Sat, May 21, 2011 at 8:04 AM, KOSAKI Motohiro
> <[email protected]> wrote:
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 3f44b81..d1dabc9 100644
>>> @@ -1426,8 +1437,13 @@ shrink_inactive_list(unsigned long nr_to_scan,
>>> struct zone *zone,
>>>
>>>        /* Check if we should syncronously wait for writeback */
>>>        if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>>> +               unsigned long nr_active, old_nr_scanned;
>>>                set_reclaim_mode(priority, sc, true);
>>> +               nr_active = clear_active_flags(&page_list, NULL);
>>> +               count_vm_events(PGDEACTIVATE, nr_active);
>>> +               old_nr_scanned = sc->nr_scanned;
>>>                nr_reclaimed += shrink_page_list(&page_list, zone, sc);
>>> +               sc->nr_scanned = old_nr_scanned;
>>>        }
>>>
>>>        local_irq_disable();
>>>
>>> I just tested 2.6.38.6 with the attached patch.  It survived dirty_ram
>>> and test_mempressure without any problems other than slowness, but
>>> when I hit ctrl-c to stop test_mempressure, I got the attached oom.
>>
>> Minchan,
>>
>> I'm confused now.
>> If pages got SetPageActive(), should_reclaim_stall() should never return true.
>> Can you please explain which bad scenario was happen?
>>
>> -----------------------------------------------------------------------------------------------------
>> static void reset_reclaim_mode(struct scan_control *sc)
>> {
>>        sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
>> }
>>
>> shrink_page_list()
>> {
>>  (snip)
>>  activate_locked:
>>                SetPageActive(page);
>>                pgactivate++;
>>                unlock_page(page);
>>                reset_reclaim_mode(sc);                  /// here
>>                list_add(&page->lru, &ret_pages);
>>        }
>> -----------------------------------------------------------------------------------------------------
>>
>>
>> -----------------------------------------------------------------------------------------------------
>> bool should_reclaim_stall()
>> {
>>  (snip)
>>
>>        /* Only stall on lumpy reclaim */
>>        if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)   /// and here
>>                return false;
>> -----------------------------------------------------------------------------------------------------
>>
>
> I did some tracing and the oops happens from the second call to
> shrink_page_list after should_reclaim_stall returns true and it hits
> the same pages in the same order that the earlier call just finished
> calling SetPageActive on.  I have *not* confirmed that the two calls
> happened from the same call to shrink_inactive_list, but something's
> certainly wrong in there.
>
> This is very easy to reproduce on my laptop.

I would like to confirm this problem.
Could you show the diff of 2.6.38.6 with current your 2.6.38.6 + alpha?
(ie, I would like to know that what patches you add up on vanilla
2.6.38.6 to reproduce this problem)
I believe you added my crap below patch. Right?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..69d317e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -311,7 +311,8 @@ static void set_reclaim_mode(int priority, struct
scan_control *sc,
*/
if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
sc->reclaim_mode |= syncmode;
- else if (sc->order && priority < DEF_PRIORITY - 2)
+ else if ((sc->order && priority < DEF_PRIORITY - 2) ||
+ prioiry <= DEF_PRIORITY / 3)
sc->reclaim_mode |= syncmode;
else
sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
@@ -1349,10 +1350,6 @@ static inline bool
should_reclaim_stall(unsigned long nr_taken,
if (current_is_kswapd())
return false;

- /* Only stall on lumpy reclaim */
- if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
- return false;
-
/* If we have relaimed everything on the isolated list, no stall */
if (nr_freed == nr_taken)
return false;


--
Kind regards,
Minchan Kim

2011-05-22 12:22:49

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Sat, May 21, 2011 at 10:44 AM, Minchan Kim <[email protected]> wrote:
> Hi Andrew.
>
> On Sat, May 21, 2011 at 10:34 PM, Andrew Lutomirski <[email protected]> wrote:
>> On Sat, May 21, 2011 at 8:04 AM, KOSAKI Motohiro
>> <[email protected]> wrote:
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 3f44b81..d1dabc9 100644
>>>> @@ -1426,8 +1437,13 @@ shrink_inactive_list(unsigned long nr_to_scan,
>>>> struct zone *zone,
>>>>
>>>> ? ? ? ?/* Check if we should syncronously wait for writeback */
>>>> ? ? ? ?if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
>>>> + ? ? ? ? ? ? ? unsigned long nr_active, old_nr_scanned;
>>>> ? ? ? ? ? ? ? ?set_reclaim_mode(priority, sc, true);
>>>> + ? ? ? ? ? ? ? nr_active = clear_active_flags(&page_list, NULL);
>>>> + ? ? ? ? ? ? ? count_vm_events(PGDEACTIVATE, nr_active);
>>>> + ? ? ? ? ? ? ? old_nr_scanned = sc->nr_scanned;
>>>> ? ? ? ? ? ? ? ?nr_reclaimed += shrink_page_list(&page_list, zone, sc);
>>>> + ? ? ? ? ? ? ? sc->nr_scanned = old_nr_scanned;
>>>> ? ? ? ?}
>>>>
>>>> ? ? ? ?local_irq_disable();
>>>>
>>>> I just tested 2.6.38.6 with the attached patch. ?It survived dirty_ram
>>>> and test_mempressure without any problems other than slowness, but
>>>> when I hit ctrl-c to stop test_mempressure, I got the attached oom.
>>>
>>> Minchan,
>>>
>>> I'm confused now.
>>> If pages got SetPageActive(), should_reclaim_stall() should never return true.
>>> Can you please explain which bad scenario was happen?
>>>
>>> -----------------------------------------------------------------------------------------------------
>>> static void reset_reclaim_mode(struct scan_control *sc)
>>> {
>>> ? ? ? ?sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
>>> }
>>>
>>> shrink_page_list()
>>> {
>>> ?(snip)
>>> ?activate_locked:
>>> ? ? ? ? ? ? ? ?SetPageActive(page);
>>> ? ? ? ? ? ? ? ?pgactivate++;
>>> ? ? ? ? ? ? ? ?unlock_page(page);
>>> ? ? ? ? ? ? ? ?reset_reclaim_mode(sc); ? ? ? ? ? ? ? ? ?/// here
>>> ? ? ? ? ? ? ? ?list_add(&page->lru, &ret_pages);
>>> ? ? ? ?}
>>> -----------------------------------------------------------------------------------------------------
>>>
>>>
>>> -----------------------------------------------------------------------------------------------------
>>> bool should_reclaim_stall()
>>> {
>>> ?(snip)
>>>
>>> ? ? ? ?/* Only stall on lumpy reclaim */
>>> ? ? ? ?if (sc->reclaim_mode & RECLAIM_MODE_SINGLE) ? /// and here
>>> ? ? ? ? ? ? ? ?return false;
>>> -----------------------------------------------------------------------------------------------------
>>>
>>
>> I did some tracing and the oops happens from the second call to
>> shrink_page_list after should_reclaim_stall returns true and it hits
>> the same pages in the same order that the earlier call just finished
>> calling SetPageActive on. ?I have *not* confirmed that the two calls
>> happened from the same call to shrink_inactive_list, but something's
>> certainly wrong in there.
>>
>> This is very easy to reproduce on my laptop.
>
> I would like to confirm this problem.
> Could you show the diff of 2.6.38.6 with current your 2.6.38.6 + alpha?
> (ie, I would like to know that what patches you add up on vanilla
> 2.6.38.6 to reproduce this problem)
> I believe you added my crap below patch. Right?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..69d317e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -311,7 +311,8 @@ static void set_reclaim_mode(int priority, struct
> scan_control *sc,
> ? ? ? ?*/
> ? ? ? if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> ? ? ? ? ? ? ? sc->reclaim_mode |= syncmode;
> - ? ? ? else if (sc->order && priority < DEF_PRIORITY - 2)
> + ? ? ? else if ((sc->order && priority < DEF_PRIORITY - 2) ||
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? prioiry <= DEF_PRIORITY / 3)
> ? ? ? ? ? ? ? sc->reclaim_mode |= syncmode;
> ? ? ? else
> ? ? ? ? ? ? ? sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
> @@ -1349,10 +1350,6 @@ static inline bool
> should_reclaim_stall(unsigned long nr_taken,
> ? ? ? if (current_is_kswapd())
> ? ? ? ? ? ? ? return false;
>
> - ? ? ? /* Only stall on lumpy reclaim */
> - ? ? ? if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
> - ? ? ? ? ? ? ? return false;
> -

Bah. It's this last hunk. Without this I can't reproduce the oops.
With this hunk, the reset_reclaim_mode doesn't work and
shrink_page_list is incorrectly called twice.

So we're back to the original problem...

--Andy

2011-05-22 23:12:53

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Sun, May 22, 2011 at 9:22 PM, Andrew Lutomirski <[email protected]> wrote:
> On Sat, May 21, 2011 at 10:44 AM, Minchan Kim <[email protected]> wrote:
>> I would like to confirm this problem.
>> Could you show the diff of 2.6.38.6 with current your 2.6.38.6 + alpha?
>> (ie, I would like to know that what patches you add up on vanilla
>> 2.6.38.6 to reproduce this problem)
>> I believe you added my crap below patch. Right?
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 292582c..69d317e 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -311,7 +311,8 @@ static void set_reclaim_mode(int priority, struct
>> scan_control *sc,
>>        */
>>       if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
>>               sc->reclaim_mode |= syncmode;
>> -       else if (sc->order && priority < DEF_PRIORITY - 2)
>> +       else if ((sc->order && priority < DEF_PRIORITY - 2) ||
>> +                               prioiry <= DEF_PRIORITY / 3)
>>               sc->reclaim_mode |= syncmode;
>>       else
>>               sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;
>> @@ -1349,10 +1350,6 @@ static inline bool
>> should_reclaim_stall(unsigned long nr_taken,
>>       if (current_is_kswapd())
>>               return false;
>>
>> -       /* Only stall on lumpy reclaim */
>> -       if (sc->reclaim_mode & RECLAIM_MODE_SINGLE)
>> -               return false;
>> -
>
> Bah.  It's this last hunk.  Without this I can't reproduce the oops.
> With this hunk, the reset_reclaim_mode doesn't work and
> shrink_page_list is incorrectly called twice.

OMG! I should have said more clearly to you. Above my patch is totally _crap_.
I thought you have experimented test without above crap patch. :(
Sorry for consuming time of many mm guys.
My apologies.

I want to resolve your original problem(ie, hang) before digging the
OOM problem.

>
> So we're back to the original problem...

Could you test below patch based on vanilla 2.6.38.6?
The expect result is that system hang never should happen.
I hope this is last test about hang.

Thanks.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..1663d24 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
if (scanned == 0)
scanned = SWAP_CLUSTER_MAX;

- if (!down_read_trylock(&shrinker_rwsem))
- return 1; /* Assume we'll be able to shrink next time */
+ if (!down_read_trylock(&shrinker_rwsem)) {
+ /* Assume we'll be able to shrink next time */
+ ret = 1;
+ goto out;
+ }

list_for_each_entry(shrinker, &shrinker_list, list) {
unsigned long long delta;
@@ -286,6 +289,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
shrinker->nr += total_scan;
}
up_read(&shrinker_rwsem);
+out:
+ cond_resched();
return ret;
}

@@ -2331,7 +2336,7 @@ static bool sleeping_prematurely(pg_data_t
*pgdat, int order, long remaining,
* must be balanced
*/
if (order)
- return pgdat_balanced(pgdat, balanced, classzone_idx);
+ return !pgdat_balanced(pgdat, balanced, classzone_idx);
else
return !all_zones_ok;
}

--
Kind regards,
Minchan Kim

2011-05-23 16:42:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Mon, May 23, 2011 at 08:12:50AM +0900, Minchan Kim wrote:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..1663d24 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> if (scanned == 0)
> scanned = SWAP_CLUSTER_MAX;
>
> - if (!down_read_trylock(&shrinker_rwsem))
> - return 1; /* Assume we'll be able to shrink next time */
> + if (!down_read_trylock(&shrinker_rwsem)) {
> + /* Assume we'll be able to shrink next time */
> + ret = 1;
> + goto out;
> + }

It looks cleaner to return -1 here to differentiate the failure in
taking the lock from when we take the lock and just 1 object is
freed. Callers seems to be ok with -1 already and more intuitive for
the while (nr > 10) loops too (those loops could be changed to "while
(nr > 0)" if all shrinkers are accurate and not doing something
inaccurate like the above code did, the shrinkers retvals I didn't
check yet).

> up_read(&shrinker_rwsem);
> +out:
> + cond_resched();
> return ret;
> }

If we enter the loop some of the shrinkers will reschedule but it
looks good for the last iteration that may have still run for some
time before returning. The actual failure of shrinker_rwsem seems only
theoretical though (but ok to cover it too with the cond_resched, but
in practice this should be more for the case where shrinker_rwsem
doesn't fail).

> @@ -2331,7 +2336,7 @@ static bool sleeping_prematurely(pg_data_t
> *pgdat, int order, long remaining,
> * must be balanced
> */
> if (order)
> - return pgdat_balanced(pgdat, balanced, classzone_idx);
> + return !pgdat_balanced(pgdat, balanced, classzone_idx);
> else
> return !all_zones_ok;
> }

I now wonder if this is why compaction in kswapd didn't work out well
and kswapd would spin at 100% load so much when compaction was added,
plus with kswapd-compaction patch I think this code should be changed
to:

if (!COMPACTION_BUILD && order)
return !pgdat_balanced();
else
return !all_zones_ok;

(but only with kswapd-compaction)

I should probably give kswapd-compaction another spin after fixing
this, because with compaction kswapd should be super successful at
satisfying zone_watermark_ok_safe(zone, _order_...) in the
sleeping_prematurely high watermark check, leading to pgdat_balanced
returning true most of the time (which would make kswapd go crazy spin
instead of stopping as it was supposed to). Mel, do you also think
it's worth another try with a fixed sleeping_prematurely like above?

Another thing, I'm not excited of the schedule_timeout(HZ/10) in
kswapd_try_to_sleep(), it seems all for the statistics.

Thanks,
Andrea

2011-05-23 17:35:43

by Mel Gorman

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Mon, May 23, 2011 at 06:42:25PM +0200, Andrea Arcangeli wrote:
> On Mon, May 23, 2011 at 08:12:50AM +0900, Minchan Kim wrote:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..1663d24 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> > if (scanned == 0)
> > scanned = SWAP_CLUSTER_MAX;
> >
> > - if (!down_read_trylock(&shrinker_rwsem))
> > - return 1; /* Assume we'll be able to shrink next time */
> > + if (!down_read_trylock(&shrinker_rwsem)) {
> > + /* Assume we'll be able to shrink next time */
> > + ret = 1;
> > + goto out;
> > + }
>
> It looks cleaner to return -1 here to differentiate the failure in
> taking the lock from when we take the lock and just 1 object is
> freed. Callers seems to be ok with -1 already and more intuitive for
> the while (nr > 10) loops too (those loops could be changed to "while
> (nr > 0)" if all shrinkers are accurate and not doing something
> inaccurate like the above code did, the shrinkers retvals I didn't
> check yet).
>

Only one caller reads the value of shrink_slab() and while it would
survive -1 being returned, it gains nothing. I don't see it as being
much clearer than the existing return value of 1.

> > up_read(&shrinker_rwsem);
> > +out:
> > + cond_resched();
> > return ret;
> > }
>
> If we enter the loop some of the shrinkers will reschedule but it
> looks good for the last iteration that may have still run for some
> time before returning.

Yes.

> The actual failure of shrinker_rwsem seems only
> theoretical though (but ok to cover it too with the cond_resched, but
> in practice this should be more for the case where shrinker_rwsem
> doesn't fail).
>

Profiles from some users imply that this condition is being hit. I
can't 100% prove it as I can't reproduce the problem locally
(seems to require a sandybridge laptop for some reason). Tests did
show that kswapd CPU usage was reduced as well as the liklihood
of hanging when shrink_slab used cond_resched() like this. See
https://lkml.org/lkml/2011/5/17/274 .

> > @@ -2331,7 +2336,7 @@ static bool sleeping_prematurely(pg_data_t
> > *pgdat, int order, long remaining,
> > * must be balanced
> > */
> > if (order)
> > - return pgdat_balanced(pgdat, balanced, classzone_idx);
> > + return !pgdat_balanced(pgdat, balanced, classzone_idx);
> > else
> > return !all_zones_ok;
> > }
>
> I now wonder if this is why compaction in kswapd didn't work out well
> and kswapd would spin at 100% load so much when compaction was added,

It's possible.

> plus with kswapd-compaction patch I think this code should be changed
> to:
>
> if (!COMPACTION_BUILD && order)
> return !pgdat_balanced();
> else
> return !all_zones_ok;
>
> (but only with kswapd-compaction)
>

Why? kswapd can enter lumpy reclaim when !COMPACTION_BUILD. While this
is hardly desirable, I don't see why kswapd should use different logic
for balancing depending on whether compaction is used or not.

> I should probably give kswapd-compaction another spin after fixing
> this, because with compaction kswapd should be super successful at
> satisfying zone_watermark_ok_safe(zone, _order_...) in the
> sleeping_prematurely high watermark check, leading to pgdat_balanced
> returning true most of the time (which would make kswapd go crazy spin
> instead of stopping as it was supposed to). Mel, do you also think
> it's worth another try with a fixed sleeping_prematurely like above?
>

It's worth a try anyway although I think it's more important to figure
out if all_unreclaimable is being improperly set or not.

> Another thing, I'm not excited of the schedule_timeout(HZ/10) in
> kswapd_try_to_sleep(), it seems all for the statistics.

It's to catch where kswapd balances a zone but continual allocations put
the zone under the high watermark quickly. It's to keep kswapd awake to
reduce the likelihood that processes get hit the min watermark and
stall.

--
Mel Gorman
SUSE Labs

2011-05-24 01:20:05

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Sun, May 22, 2011 at 7:12 PM, Minchan Kim <[email protected]> wrote:
> Could you test below patch based on vanilla 2.6.38.6?
> The expect result is that system hang never should happen.
> I hope this is last test about hang.
>
> Thanks.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..1663d24 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> ? ? ? if (scanned == 0)
> ? ? ? ? ? ? ? scanned = SWAP_CLUSTER_MAX;
>
> - ? ? ? if (!down_read_trylock(&shrinker_rwsem))
> - ? ? ? ? ? ? ? return 1; ? ? ? /* Assume we'll be able to shrink next time */
> + ? ? ? if (!down_read_trylock(&shrinker_rwsem)) {
> + ? ? ? ? ? ? ? /* Assume we'll be able to shrink next time */
> + ? ? ? ? ? ? ? ret = 1;
> + ? ? ? ? ? ? ? goto out;
> + ? ? ? }
>
> ? ? ? list_for_each_entry(shrinker, &shrinker_list, list) {
> ? ? ? ? ? ? ? unsigned long long delta;
> @@ -286,6 +289,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> ? ? ? ? ? ? ? shrinker->nr += total_scan;
> ? ? ? }
> ? ? ? up_read(&shrinker_rwsem);
> +out:
> + ? ? ? cond_resched();
> ? ? ? return ret;
> ?}
>
> @@ -2331,7 +2336,7 @@ static bool sleeping_prematurely(pg_data_t
> *pgdat, int order, long remaining,
> ? ? ? ?* must be balanced
> ? ? ? ?*/
> ? ? ? if (order)
> - ? ? ? ? ? ? ? return pgdat_balanced(pgdat, balanced, classzone_idx);
> + ? ? ? ? ? ? ? return !pgdat_balanced(pgdat, balanced, classzone_idx);
> ? ? ? else
> ? ? ? ? ? ? ? return !all_zones_ok;
> ?}

So far with this patch I can't reproduce the hang or the bogus OOM.

To be completely clear, I have COMPACTION, MIGRATION, and THP off, I'm
running 2.6.38.6, and I have exactly two patches applied. One is the
attached patch and the other is a the fpu.ko/aesni_intel.ko merger
which I need to get dracut to boot my box.

For fun, I also upgraded to 8GB of RAM and it still works.

--Andy

>
> --
> Kind regards,
> Minchan Kim
>


Attachments:
minchan-patch-v3.patch (1.19 kB)

2011-05-24 01:34:26

by Minchan Kim

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Tue, May 24, 2011 at 10:19 AM, Andrew Lutomirski <[email protected]> wrote:
> On Sun, May 22, 2011 at 7:12 PM, Minchan Kim <[email protected]> wrote:
>> Could you test below patch based on vanilla 2.6.38.6?
>> The expect result is that system hang never should happen.
>> I hope this is last test about hang.
>>
>> Thanks.
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 292582c..1663d24 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>       if (scanned == 0)
>>               scanned = SWAP_CLUSTER_MAX;
>>
>> -       if (!down_read_trylock(&shrinker_rwsem))
>> -               return 1;       /* Assume we'll be able to shrink next time */
>> +       if (!down_read_trylock(&shrinker_rwsem)) {
>> +               /* Assume we'll be able to shrink next time */
>> +               ret = 1;
>> +               goto out;
>> +       }
>>
>>       list_for_each_entry(shrinker, &shrinker_list, list) {
>>               unsigned long long delta;
>> @@ -286,6 +289,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>               shrinker->nr += total_scan;
>>       }
>>       up_read(&shrinker_rwsem);
>> +out:
>> +       cond_resched();
>>       return ret;
>>  }
>>
>> @@ -2331,7 +2336,7 @@ static bool sleeping_prematurely(pg_data_t
>> *pgdat, int order, long remaining,
>>        * must be balanced
>>        */
>>       if (order)
>> -               return pgdat_balanced(pgdat, balanced, classzone_idx);
>> +               return !pgdat_balanced(pgdat, balanced, classzone_idx);
>>       else
>>               return !all_zones_ok;
>>  }
>
> So far with this patch I can't reproduce the hang or the bogus OOM.
>
> To be completely clear, I have COMPACTION, MIGRATION, and THP off, I'm
> running 2.6.38.6, and I have exactly two patches applied.  One is the
> attached patch and the other is a the fpu.ko/aesni_intel.ko merger
> which I need to get dracut to boot my box.
>
> For fun, I also upgraded to 8GB of RAM and it still works.
>

Hmm. Could you test it with enable thp and 2G RAM?
Isn't it a original test environment?
Please don't change test environment. :)

Thanks for your effort, Andrew.

--
Kind regards,
Minchan Kim

2011-05-24 11:24:36

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Mon, May 23, 2011 at 9:34 PM, Minchan Kim <[email protected]> wrote:
> On Tue, May 24, 2011 at 10:19 AM, Andrew Lutomirski <[email protected]> wrote:
>> On Sun, May 22, 2011 at 7:12 PM, Minchan Kim <[email protected]> wrote:
>>> Could you test below patch based on vanilla 2.6.38.6?
>>> The expect result is that system hang never should happen.
>>> I hope this is last test about hang.
>>>
>>> Thanks.
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 292582c..1663d24 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>> ? ? ? if (scanned == 0)
>>> ? ? ? ? ? ? ? scanned = SWAP_CLUSTER_MAX;
>>>
>>> - ? ? ? if (!down_read_trylock(&shrinker_rwsem))
>>> - ? ? ? ? ? ? ? return 1; ? ? ? /* Assume we'll be able to shrink next time */
>>> + ? ? ? if (!down_read_trylock(&shrinker_rwsem)) {
>>> + ? ? ? ? ? ? ? /* Assume we'll be able to shrink next time */
>>> + ? ? ? ? ? ? ? ret = 1;
>>> + ? ? ? ? ? ? ? goto out;
>>> + ? ? ? }
>>>
>>> ? ? ? list_for_each_entry(shrinker, &shrinker_list, list) {
>>> ? ? ? ? ? ? ? unsigned long long delta;
>>> @@ -286,6 +289,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>> ? ? ? ? ? ? ? shrinker->nr += total_scan;
>>> ? ? ? }
>>> ? ? ? up_read(&shrinker_rwsem);
>>> +out:
>>> + ? ? ? cond_resched();
>>> ? ? ? return ret;
>>> ?}
>>>
>>> @@ -2331,7 +2336,7 @@ static bool sleeping_prematurely(pg_data_t
>>> *pgdat, int order, long remaining,
>>> ? ? ? ?* must be balanced
>>> ? ? ? ?*/
>>> ? ? ? if (order)
>>> - ? ? ? ? ? ? ? return pgdat_balanced(pgdat, balanced, classzone_idx);
>>> + ? ? ? ? ? ? ? return !pgdat_balanced(pgdat, balanced, classzone_idx);
>>> ? ? ? else
>>> ? ? ? ? ? ? ? return !all_zones_ok;
>>> ?}
>>
>> So far with this patch I can't reproduce the hang or the bogus OOM.
>>
>> To be completely clear, I have COMPACTION, MIGRATION, and THP off, I'm
>> running 2.6.38.6, and I have exactly two patches applied. ?One is the
>> attached patch and the other is a the fpu.ko/aesni_intel.ko merger
>> which I need to get dracut to boot my box.
>>
>> For fun, I also upgraded to 8GB of RAM and it still works.
>>
>
> Hmm. Could you test it with enable thp and 2G RAM?
> Isn't it a original test environment?
> Please don't change test environment. :)

The test that passed last night was an environment (hardware and
config) that I had confirmed earlier as failing without the patch.

I just re-tested my original config (from a backup -- migration,
compaction, and thp "always" are enabled). I get bogus OOMs but not a
hang. (I'm running with mem=2G right now -- I'll swap the DIMMs back
out later on if you want.)

I attached the bogus OOM (actually several that happened in sequence).
They look readahead-related. There was plenty of free swap space.

--Andy

2011-05-24 11:55:44

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

On Tue, May 24, 2011 at 7:24 AM, Andrew Lutomirski <[email protected]> wrote:
> On Mon, May 23, 2011 at 9:34 PM, Minchan Kim <[email protected]> wrote:
>> On Tue, May 24, 2011 at 10:19 AM, Andrew Lutomirski <[email protected]> wrote:
>>> On Sun, May 22, 2011 at 7:12 PM, Minchan Kim <[email protected]> wrote:
>>>> Could you test below patch based on vanilla 2.6.38.6?
>>>> The expect result is that system hang never should happen.
>>>> I hope this is last test about hang.
>>>>
>>>> Thanks.
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 292582c..1663d24 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>>> ? ? ? if (scanned == 0)
>>>> ? ? ? ? ? ? ? scanned = SWAP_CLUSTER_MAX;
>>>>
>>>> - ? ? ? if (!down_read_trylock(&shrinker_rwsem))
>>>> - ? ? ? ? ? ? ? return 1; ? ? ? /* Assume we'll be able to shrink next time */
>>>> + ? ? ? if (!down_read_trylock(&shrinker_rwsem)) {
>>>> + ? ? ? ? ? ? ? /* Assume we'll be able to shrink next time */
>>>> + ? ? ? ? ? ? ? ret = 1;
>>>> + ? ? ? ? ? ? ? goto out;
>>>> + ? ? ? }
>>>>
>>>> ? ? ? list_for_each_entry(shrinker, &shrinker_list, list) {
>>>> ? ? ? ? ? ? ? unsigned long long delta;
>>>> @@ -286,6 +289,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>>> ? ? ? ? ? ? ? shrinker->nr += total_scan;
>>>> ? ? ? }
>>>> ? ? ? up_read(&shrinker_rwsem);
>>>> +out:
>>>> + ? ? ? cond_resched();
>>>> ? ? ? return ret;
>>>> ?}
>>>>
>>>> @@ -2331,7 +2336,7 @@ static bool sleeping_prematurely(pg_data_t
>>>> *pgdat, int order, long remaining,
>>>> ? ? ? ?* must be balanced
>>>> ? ? ? ?*/
>>>> ? ? ? if (order)
>>>> - ? ? ? ? ? ? ? return pgdat_balanced(pgdat, balanced, classzone_idx);
>>>> + ? ? ? ? ? ? ? return !pgdat_balanced(pgdat, balanced, classzone_idx);
>>>> ? ? ? else
>>>> ? ? ? ? ? ? ? return !all_zones_ok;
>>>> ?}
>>>
>>> So far with this patch I can't reproduce the hang or the bogus OOM.
>>>
>>> To be completely clear, I have COMPACTION, MIGRATION, and THP off, I'm
>>> running 2.6.38.6, and I have exactly two patches applied. ?One is the
>>> attached patch and the other is a the fpu.ko/aesni_intel.ko merger
>>> which I need to get dracut to boot my box.
>>>
>>> For fun, I also upgraded to 8GB of RAM and it still works.
>>>
>>
>> Hmm. Could you test it with enable thp and 2G RAM?
>> Isn't it a original test environment?
>> Please don't change test environment. :)
>
> The test that passed last night was an environment (hardware and
> config) that I had confirmed earlier as failing without the patch.
>
> I just re-tested my original config (from a backup -- migration,
> compaction, and thp "always" are enabled). ?I get bogus OOMs but not a
> hang. ?(I'm running with mem=2G right now -- I'll swap the DIMMs back
> out later on if you want.)
>
> I attached the bogus OOM (actually several that happened in sequence).
> ?They look readahead-related. ?There was plenty of free swap space.

Now with log actually attached.

>
> --Andy
>


Attachments:
bogus_oom.txt.xz (20.56 kB)

2011-05-25 00:44:06

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: Kernel falls apart under light memory pressure (i.e. linking vmlinux)

(2011/05/24 20:55), Andrew Lutomirski wrote:
> On Tue, May 24, 2011 at 7:24 AM, Andrew Lutomirski <[email protected]> wrote:
>> On Mon, May 23, 2011 at 9:34 PM, Minchan Kim <[email protected]> wrote:
>>> On Tue, May 24, 2011 at 10:19 AM, Andrew Lutomirski <[email protected]> wrote:
>>>> On Sun, May 22, 2011 at 7:12 PM, Minchan Kim <[email protected]> wrote:
>>>>> Could you test below patch based on vanilla 2.6.38.6?
>>>>> The expect result is that system hang never should happen.
>>>>> I hope this is last test about hang.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 292582c..1663d24 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -231,8 +231,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>>>> if (scanned == 0)
>>>>> scanned = SWAP_CLUSTER_MAX;
>>>>>
>>>>> - if (!down_read_trylock(&shrinker_rwsem))
>>>>> - return 1; /* Assume we'll be able to shrink next time */
>>>>> + if (!down_read_trylock(&shrinker_rwsem)) {
>>>>> + /* Assume we'll be able to shrink next time */
>>>>> + ret = 1;
>>>>> + goto out;
>>>>> + }
>>>>>
>>>>> list_for_each_entry(shrinker, &shrinker_list, list) {
>>>>> unsigned long long delta;
>>>>> @@ -286,6 +289,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>>>>> shrinker->nr += total_scan;
>>>>> }
>>>>> up_read(&shrinker_rwsem);
>>>>> +out:
>>>>> + cond_resched();
>>>>> return ret;
>>>>> }
>>>>>
>>>>> @@ -2331,7 +2336,7 @@ static bool sleeping_prematurely(pg_data_t
>>>>> *pgdat, int order, long remaining,
>>>>> * must be balanced
>>>>> */
>>>>> if (order)
>>>>> - return pgdat_balanced(pgdat, balanced, classzone_idx);
>>>>> + return !pgdat_balanced(pgdat, balanced, classzone_idx);
>>>>> else
>>>>> return !all_zones_ok;
>>>>> }
>>>>
>>>> So far with this patch I can't reproduce the hang or the bogus OOM.
>>>>
>>>> To be completely clear, I have COMPACTION, MIGRATION, and THP off, I'm
>>>> running 2.6.38.6, and I have exactly two patches applied. One is the
>>>> attached patch and the other is a the fpu.ko/aesni_intel.ko merger
>>>> which I need to get dracut to boot my box.
>>>>
>>>> For fun, I also upgraded to 8GB of RAM and it still works.
>>>>
>>>
>>> Hmm. Could you test it with enable thp and 2G RAM?
>>> Isn't it a original test environment?
>>> Please don't change test environment. :)
>>
>> The test that passed last night was an environment (hardware and
>> config) that I had confirmed earlier as failing without the patch.
>>
>> I just re-tested my original config (from a backup -- migration,
>> compaction, and thp "always" are enabled). I get bogus OOMs but not a
>> hang. (I'm running with mem=2G right now -- I'll swap the DIMMs back
>> out later on if you want.)
>>
>> I attached the bogus OOM (actually several that happened in sequence).
>> They look readahead-related. There was plenty of free swap space.
>
> Now with log actually attached.

Unfortnately, this log don't tell us why DM don't issue any swap io. ;-)
I doubt it's DM issue. Can you please try to make swap on out of DM?