LinuxLists.cc - [PATCH] mm: check zone->all_unreclaimable in all

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> kernel may hang up, because shrink_zones() will do nothing, but
> all_unreclaimable() will say, that zone has reclaimable pages.
>
> do_try_to_free_pages()
> shrink_zones()
> for_each_zone
> if (zone->all_unreclaimable)
> continue
> if !all_unreclaimable(zonelist, sc)
> return 1
>
> __alloc_pages_slowpath()
> retry:
> did_some_progress = do_try_to_free_pages(page)
> ...
> if (!page && did_some_progress)
> retry;
>
> Signed-off-by: Andrey Vagin <[email protected]>
> ---
> mm/vmscan.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6771ea7..1c056f7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>
> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> gfp_zone(sc->gfp_mask), sc->nodemask) {
> + if (zone->all_unreclaimable)
> + continue;
> if (!populated_zone(zone))
> continue;
> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))

zone_reclaimable checks it. Isn't it enough?
Does the hang up really happen or see it by code review?

> --
> 1.7.1
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2011-03-05 16:42:15

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On 03/05/2011 06:53 PM, Minchan Kim wrote:
> On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
>> On 03/05/2011 06:20 PM, Minchan Kim wrote:
>>> On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
>>>> Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
>>>> kernel may hang up, because shrink_zones() will do nothing, but
>>>> all_unreclaimable() will say, that zone has reclaimable pages.
>>>>
>>>> do_try_to_free_pages()
>>>> shrink_zones()
>>>> for_each_zone
>>>> if (zone->all_unreclaimable)
>>>> continue
>>>> if !all_unreclaimable(zonelist, sc)
>>>> return 1
>>>>
>>>> __alloc_pages_slowpath()
>>>> retry:
>>>> did_some_progress = do_try_to_free_pages(page)
>>>> ...
>>>> if (!page&& did_some_progress)
>>>> retry;
>>>>
>>>> Signed-off-by: Andrey Vagin<[email protected]>
>>>> ---
>>>> mm/vmscan.c | 2 ++
>>>> 1 files changed, 2 insertions(+), 0 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 6771ea7..1c056f7 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>>>>
>>>> for_each_zone_zonelist_nodemask(zone, z, zonelist,
>>>> gfp_zone(sc->gfp_mask), sc->nodemask) {
>>>> + if (zone->all_unreclaimable)
>>>> + continue;
>>>> if (!populated_zone(zone))
>>>> continue;
>>>> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>>> zone_reclaimable checks it. Isn't it enough?
>> I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
>> This two patches are enough.
> Sorry if I confused you.
> I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
> zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
> So too many page scanning in small lru pages makes the zone to unreclaimable zone.
>
> In all_unreclaimable, we calls zone_reclaimable to detect it.
> It's the same thing with your patch.
balance_pgdat set zone->all_unreclaimable, but the problem is that it is
cleaned late.

The problem is that zone->all_unreclaimable = True, but
zone_reclaimable() returns True too.

zone->all_unreclaimable will be cleaned in free_*_pages, but this may be
late. It is enough allocate one page from page cache, that
zone_reclaimable() returns True and zone->all_unreclaimable becomes True.
>>> Does the hang up really happen or see it by code review?
>> Yes. You can reproduce it for help the attached python program. It's
>> not very clever:)
>> It make the following actions in loop:
>> 1. fork
>> 2. mmap
>> 3. touch memory
>> 4. read memory
>> 5. munmmap
> It seems the test program makes fork bombs and memory hogging.
> If you applied this patch, the problem is gone?
Yes.
>>>> --
>>>> 1.7.1
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to [email protected]. For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>>>> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
>> import sys, time, mmap, os
>> from subprocess import Popen, PIPE
>> import random
>>
>> global mem_size
>>
>> def info(msg):
>> pid = os.getpid()
>> print>> sys.stderr, "%s: %s" % (pid, msg)
>> sys.stderr.flush()
>>
>>
>>
>> def memory_loop(cmd = "a"):
>> """
>> cmd may be:
>> c: check memory
>> else: touch memory
>> """
>> c = 0
>> for j in xrange(0, mem_size):
>> if cmd == "c":
>> if f[j<<12] != chr(j % 255):
>> info("Data corruption")
>> sys.exit(1)
>> else:
>> f[j<<12] = chr(j % 255)
>>
>> while True:
>> pid = os.fork()
>> if (pid != 0):
>> mem_size = random.randint(0, 56 * 4096)
>> f = mmap.mmap(-1, mem_size<< 12, mmap.MAP_ANONYMOUS|mmap.MAP_PRIVATE)
>> memory_loop()
>> memory_loop("c")
>> f.close()
>

2011-03-05 15:35:50

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On 03/05/2011 06:20 PM, Minchan Kim wrote:
> On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
>> Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
>> kernel may hang up, because shrink_zones() will do nothing, but
>> all_unreclaimable() will say, that zone has reclaimable pages.
>>
>> do_try_to_free_pages()
>> shrink_zones()
>> for_each_zone
>> if (zone->all_unreclaimable)
>> continue
>> if !all_unreclaimable(zonelist, sc)
>> return 1
>>
>> __alloc_pages_slowpath()
>> retry:
>> did_some_progress = do_try_to_free_pages(page)
>> ...
>> if (!page&& did_some_progress)
>> retry;
>>
>> Signed-off-by: Andrey Vagin<[email protected]>
>> ---
>> mm/vmscan.c | 2 ++
>> 1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 6771ea7..1c056f7 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>>
>> for_each_zone_zonelist_nodemask(zone, z, zonelist,
>> gfp_zone(sc->gfp_mask), sc->nodemask) {
>> + if (zone->all_unreclaimable)
>> + continue;
>> if (!populated_zone(zone))
>> continue;
>> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>
> zone_reclaimable checks it. Isn't it enough?
I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
This two patches are enough.
> Does the hang up really happen or see it by code review?
Yes. You can reproduce it for help the attached python program. It's not
very clever:)
It make the following actions in loop:
1. fork
2. mmap
3. touch memory
4. read memory
5. munmmap

>> --
>> 1.7.1
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>

Attachments:

memeater.py (659.00 B)

2011-03-05 15:53:30

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
> On 03/05/2011 06:20 PM, Minchan Kim wrote:
> >On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> >>Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> >>kernel may hang up, because shrink_zones() will do nothing, but
> >>all_unreclaimable() will say, that zone has reclaimable pages.
> >>
> >>do_try_to_free_pages()
> >> shrink_zones()
> >> for_each_zone
> >> if (zone->all_unreclaimable)
> >> continue
> >> if !all_unreclaimable(zonelist, sc)
> >> return 1
> >>
> >>__alloc_pages_slowpath()
> >>retry:
> >> did_some_progress = do_try_to_free_pages(page)
> >> ...
> >> if (!page&& did_some_progress)
> >> retry;
> >>
> >>Signed-off-by: Andrey Vagin<[email protected]>
> >>---
> >> mm/vmscan.c | 2 ++
> >> 1 files changed, 2 insertions(+), 0 deletions(-)
> >>
> >>diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>index 6771ea7..1c056f7 100644
> >>--- a/mm/vmscan.c
> >>+++ b/mm/vmscan.c
> >>@@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> >>
> >> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >> gfp_zone(sc->gfp_mask), sc->nodemask) {
> >>+ if (zone->all_unreclaimable)
> >>+ continue;
> >> if (!populated_zone(zone))
> >> continue;
> >> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> >
> >zone_reclaimable checks it. Isn't it enough?
> I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> This two patches are enough.

Sorry if I confused you.
I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
So too many page scanning in small lru pages makes the zone to unreclaimable zone.

In all_unreclaimable, we calls zone_reclaimable to detect it.
It's the same thing with your patch.

> >Does the hang up really happen or see it by code review?
> Yes. You can reproduce it for help the attached python program. It's
> not very clever:)
> It make the following actions in loop:
> 1. fork
> 2. mmap
> 3. touch memory
> 4. read memory
> 5. munmmap

It seems the test program makes fork bombs and memory hogging.
If you applied this patch, the problem is gone?

>
> >>--
> >>1.7.1
> >>
> >>--
> >>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>the body to [email protected]. For more info on Linux MM,
> >>see: http://www.linux-mm.org/ .
> >>Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> >>Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
>

> import sys, time, mmap, os
> from subprocess import Popen, PIPE
> import random
>
> global mem_size
>
> def info(msg):
> pid = os.getpid()
> print >> sys.stderr, "%s: %s" % (pid, msg)
> sys.stderr.flush()
>
>
>
> def memory_loop(cmd = "a"):
> """
> cmd may be:
> c: check memory
> else: touch memory
> """
> c = 0
> for j in xrange(0, mem_size):
> if cmd == "c":
> if f[j<<12] != chr(j % 255):
> info("Data corruption")
> sys.exit(1)
> else:
> f[j<<12] = chr(j % 255)
>
> while True:
> pid = os.fork()
> if (pid != 0):
> mem_size = random.randint(0, 56 * 4096)
> f = mmap.mmap(-1, mem_size << 12, mmap.MAP_ANONYMOUS|mmap.MAP_PRIVATE)
> memory_loop()
> memory_loop("c")
> f.close()

--
Kind regards,
Minchan Kim

2011-03-05 17:08:09

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Sat, Mar 05, 2011 at 07:41:26PM +0300, Andrew Vagin wrote:
> On 03/05/2011 06:53 PM, Minchan Kim wrote:
> >On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
> >>On 03/05/2011 06:20 PM, Minchan Kim wrote:
> >>>On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> >>>>Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> >>>>kernel may hang up, because shrink_zones() will do nothing, but
> >>>>all_unreclaimable() will say, that zone has reclaimable pages.
> >>>>
> >>>>do_try_to_free_pages()
> >>>> shrink_zones()
> >>>> for_each_zone
> >>>> if (zone->all_unreclaimable)
> >>>> continue
> >>>> if !all_unreclaimable(zonelist, sc)
> >>>> return 1
> >>>>
> >>>>__alloc_pages_slowpath()
> >>>>retry:
> >>>> did_some_progress = do_try_to_free_pages(page)
> >>>> ...
> >>>> if (!page&& did_some_progress)
> >>>> retry;
> >>>>
> >>>>Signed-off-by: Andrey Vagin<[email protected]>
> >>>>---
> >>>> mm/vmscan.c | 2 ++
> >>>> 1 files changed, 2 insertions(+), 0 deletions(-)
> >>>>
> >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>>index 6771ea7..1c056f7 100644
> >>>>--- a/mm/vmscan.c
> >>>>+++ b/mm/vmscan.c
> >>>>@@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> >>>>
> >>>> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >>>> gfp_zone(sc->gfp_mask), sc->nodemask) {
> >>>>+ if (zone->all_unreclaimable)
> >>>>+ continue;
> >>>> if (!populated_zone(zone))
> >>>> continue;
> >>>> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> >>>zone_reclaimable checks it. Isn't it enough?
> >>I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> >>This two patches are enough.
> >Sorry if I confused you.
> >I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
> >zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
> >So too many page scanning in small lru pages makes the zone to unreclaimable zone.
> >
> >In all_unreclaimable, we calls zone_reclaimable to detect it.
> >It's the same thing with your patch.
> balance_pgdat set zone->all_unreclaimable, but the problem is that
> it is cleaned late.

Yes. It can be delayed by pcp so (zone->all_unreclaimable = true) is
a false alram since zone have a free page and it can be returned
to free list by drain_all_pages in next turn.

>
> The problem is that zone->all_unreclaimable = True, but
> zone_reclaimable() returns True too.

Why is it a problem?
If zone->all_unreclaimable gives a false alram, we does need to check
it again by zone_reclaimable call.

If we believe a false alarm and give up the reclaim, maybe we have to make
unnecessary oom kill.

>
> zone->all_unreclaimable will be cleaned in free_*_pages, but this
> may be late. It is enough allocate one page from page cache, that
> zone_reclaimable() returns True and zone->all_unreclaimable becomes
> True.
> >>>Does the hang up really happen or see it by code review?
> >>Yes. You can reproduce it for help the attached python program. It's
> >>not very clever:)
> >>It make the following actions in loop:
> >>1. fork
> >>2. mmap
> >>3. touch memory
> >>4. read memory
> >>5. munmmap
> >It seems the test program makes fork bombs and memory hogging.
> >If you applied this patch, the problem is gone?
> Yes.

Hmm.. Although it solves the problem, I think it's not a good idea that
depends on false alram and give up the retry.

> >>>>--
> >>>>1.7.1
> >>>>
> >>>>--
> >>>>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>>>the body to [email protected]. For more info on Linux MM,
> >>>>see: http://www.linux-mm.org/ .
> >>>>Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> >>>>Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
> >>import sys, time, mmap, os
> >>from subprocess import Popen, PIPE
> >>import random
> >>
> >>global mem_size
> >>
> >>def info(msg):
> >> pid = os.getpid()
> >> print>> sys.stderr, "%s: %s" % (pid, msg)
> >> sys.stderr.flush()
> >>
> >>
> >>
> >>def memory_loop(cmd = "a"):
> >> """
> >> cmd may be:
> >> c: check memory
> >> else: touch memory
> >> """
> >> c = 0
> >> for j in xrange(0, mem_size):
> >> if cmd == "c":
> >> if f[j<<12] != chr(j % 255):
> >> info("Data corruption")
> >> sys.exit(1)
> >> else:
> >> f[j<<12] = chr(j % 255)
> >>
> >>while True:
> >> pid = os.fork()
> >> if (pid != 0):
> >> mem_size = random.randint(0, 56 * 4096)
> >> f = mmap.mmap(-1, mem_size<< 12, mmap.MAP_ANONYMOUS|mmap.MAP_PRIVATE)
> >> memory_loop()
> >> memory_loop("c")
> >> f.close()
> >
>

--
Kind regards,
Minchan Kim

2011-03-07 21:59:16

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Sun, 6 Mar 2011 02:07:59 +0900
Minchan Kim <[email protected]> wrote:

> On Sat, Mar 05, 2011 at 07:41:26PM +0300, Andrew Vagin wrote:
> > On 03/05/2011 06:53 PM, Minchan Kim wrote:
> > >On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
> > >>On 03/05/2011 06:20 PM, Minchan Kim wrote:
> > >>>On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> > >>>>Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> > >>>>kernel may hang up, because shrink_zones() will do nothing, but
> > >>>>all_unreclaimable() will say, that zone has reclaimable pages.
> > >>>>
> > >>>>do_try_to_free_pages()
> > >>>> shrink_zones()
> > >>>> for_each_zone
> > >>>> if (zone->all_unreclaimable)
> > >>>> continue
> > >>>> if !all_unreclaimable(zonelist, sc)
> > >>>> return 1
> > >>>>
> > >>>>__alloc_pages_slowpath()
> > >>>>retry:
> > >>>> did_some_progress = do_try_to_free_pages(page)
> > >>>> ...
> > >>>> if (!page&& did_some_progress)
> > >>>> retry;
> > >>>>
> > >>>>Signed-off-by: Andrey Vagin<[email protected]>
> > >>>>---
> > >>>> mm/vmscan.c | 2 ++
> > >>>> 1 files changed, 2 insertions(+), 0 deletions(-)
> > >>>>
> > >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >>>>index 6771ea7..1c056f7 100644
> > >>>>--- a/mm/vmscan.c
> > >>>>+++ b/mm/vmscan.c
> > >>>>@@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> > >>>>
> > >>>> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > >>>> gfp_zone(sc->gfp_mask), sc->nodemask) {
> > >>>>+ if (zone->all_unreclaimable)
> > >>>>+ continue;
> > >>>> if (!populated_zone(zone))
> > >>>> continue;
> > >>>> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > >>>zone_reclaimable checks it. Isn't it enough?
> > >>I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> > >>This two patches are enough.
> > >Sorry if I confused you.
> > >I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
> > >zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
> > >So too many page scanning in small lru pages makes the zone to unreclaimable zone.
> > >
> > >In all_unreclaimable, we calls zone_reclaimable to detect it.
> > >It's the same thing with your patch.
> > balance_pgdat set zone->all_unreclaimable, but the problem is that
> > it is cleaned late.
>
> Yes. It can be delayed by pcp so (zone->all_unreclaimable = true) is
> a false alram since zone have a free page and it can be returned
> to free list by drain_all_pages in next turn.
>
> >
> > The problem is that zone->all_unreclaimable = True, but
> > zone_reclaimable() returns True too.
>
> Why is it a problem?
> If zone->all_unreclaimable gives a false alram, we does need to check
> it again by zone_reclaimable call.
>
> If we believe a false alarm and give up the reclaim, maybe we have to make
> unnecessary oom kill.
>
> >
> > zone->all_unreclaimable will be cleaned in free_*_pages, but this
> > may be late. It is enough allocate one page from page cache, that
> > zone_reclaimable() returns True and zone->all_unreclaimable becomes
> > True.
> > >>>Does the hang up really happen or see it by code review?
> > >>Yes. You can reproduce it for help the attached python program. It's
> > >>not very clever:)
> > >>It make the following actions in loop:
> > >>1. fork
> > >>2. mmap
> > >>3. touch memory
> > >>4. read memory
> > >>5. munmmap
> > >It seems the test program makes fork bombs and memory hogging.
> > >If you applied this patch, the problem is gone?
> > Yes.
>
> Hmm.. Although it solves the problem, I think it's not a good idea that
> depends on false alram and give up the retry.

Any alternative proposals? We should get the livelock fixed if possible..

2011-03-07 23:45:53

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Tue, Mar 8, 2011 at 6:58 AM, Andrew Morton <[email protected]> wrote:
> On Sun, 6 Mar 2011 02:07:59 +0900
> Minchan Kim <[email protected]> wrote:
>
>> On Sat, Mar 05, 2011 at 07:41:26PM +0300, Andrew Vagin wrote:
>> > On 03/05/2011 06:53 PM, Minchan Kim wrote:
>> > >On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
>> > >>On 03/05/2011 06:20 PM, Minchan Kim wrote:
>> > >>>On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
>> > >>>>Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
>> > >>>>kernel may hang up, because shrink_zones() will do nothing, but
>> > >>>>all_unreclaimable() will say, that zone has reclaimable pages.
>> > >>>>
>> > >>>>do_try_to_free_pages()
>> > >>>> shrink_zones()
>> > >>>> for_each_zone
>> > >>>> if (zone->all_unreclaimable)
>> > >>>> continue
>> > >>>> if !all_unreclaimable(zonelist, sc)
>> > >>>> return 1
>> > >>>>
>> > >>>>__alloc_pages_slowpath()
>> > >>>>retry:
>> > >>>> did_some_progress = do_try_to_free_pages(page)
>> > >>>> ...
>> > >>>> if (!page&& did_some_progress)
>> > >>>> retry;
>> > >>>>
>> > >>>>Signed-off-by: Andrey Vagin<[email protected]>
>> > >>>>---
>> > >>>> mm/vmscan.c | 2 ++
>> > >>>> 1 files changed, 2 insertions(+), 0 deletions(-)
>> > >>>>
>> > >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > >>>>index 6771ea7..1c056f7 100644
>> > >>>>--- a/mm/vmscan.c
>> > >>>>+++ b/mm/vmscan.c
>> > >>>>@@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
>> > >>>>
>> > >>>> for_each_zone_zonelist_nodemask(zone, z, zonelist,
>> > >>>> gfp_zone(sc->gfp_mask), sc->nodemask) {
>> > >>>>+ if (zone->all_unreclaimable)
>> > >>>>+ continue;
>> > >>>> if (!populated_zone(zone))
>> > >>>> continue;
>> > >>>> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>> > >>>zone_reclaimable checks it. Isn't it enough?
>> > >>I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
>> > >>This two patches are enough.
>> > >Sorry if I confused you.
>> > >I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
>> > >zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
>> > >So too many page scanning in small lru pages makes the zone to unreclaimable zone.
>> > >
>> > >In all_unreclaimable, we calls zone_reclaimable to detect it.
>> > >It's the same thing with your patch.
>> > balance_pgdat set zone->all_unreclaimable, but the problem is that
>> > it is cleaned late.
>>
>> Yes. It can be delayed by pcp so (zone->all_unreclaimable = true) is
>> a false alram since zone have a free page and it can be returned
>> to free list by drain_all_pages in next turn.
>>
>> >
>> > The problem is that zone->all_unreclaimable = True, but
>> > zone_reclaimable() returns True too.
>>
>> Why is it a problem?
>> If zone->all_unreclaimable gives a false alram, we does need to check
>> it again by zone_reclaimable call.
>>
>> If we believe a false alarm and give up the reclaim, maybe we have to make
>> unnecessary oom kill.
>>
>> >
>> > zone->all_unreclaimable will be cleaned in free_*_pages, but this
>> > may be late. It is enough allocate one page from page cache, that
>> > zone_reclaimable() returns True and zone->all_unreclaimable becomes
>> > True.
>> > >>>Does the hang up really happen or see it by code review?
>> > >>Yes. You can reproduce it for help the attached python program. It's
>> > >>not very clever:)
>> > >>It make the following actions in loop:
>> > >>1. fork
>> > >>2. mmap
>> > >>3. touch memory
>> > >>4. read memory
>> > >>5. munmmap
>> > >It seems the test program makes fork bombs and memory hogging.
>> > >If you applied this patch, the problem is gone?
>> > Yes.
>>
>> Hmm.. Although it solves the problem, I think it's not a good idea that
>> depends on false alram and give up the retry.
>
> Any alternative proposals? We should get the livelock fixed if possible..
>

And we should avoid unnecessary OOM kill if possible.

I think the problem is caused by (zone->pages_scanned <
zone_reclaimable_pages(zone) * 6). I am not sure (* 6) is a best. It
would be rather big on recent big DRAM machines.

I think it is a trade-off between latency and OOM kill.
If we decrease the magic value, maybe we should prevent the almost
livelock but happens unnecessary OOM kill.

And I think zone_reclaimable not fair.
For example, too many scanning makes reclaimable state to
unreclaimable state. Maybe it takes a very long time. But just some
page free makes unreclaimable state to reclaimabe with very easy. So
we need much painful reclaiming for changing reclaimable state with
unreclaimabe state. it would affect latency very much.

Maybe we need more smart zone_reclaimabe which is adaptive with memory pressure.

--
Kind regards,
Minchan Kim

2011-03-08 00:53:28

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Mon, 7 Mar 2011 13:58:31 -0800
Andrew Morton <[email protected]> wrote:

> On Sun, 6 Mar 2011 02:07:59 +0900
> Minchan Kim <[email protected]> wrote:
>
> > On Sat, Mar 05, 2011 at 07:41:26PM +0300, Andrew Vagin wrote:
> > > On 03/05/2011 06:53 PM, Minchan Kim wrote:
> > > >On Sat, Mar 05, 2011 at 06:34:37PM +0300, Andrew Vagin wrote:
> > > >>On 03/05/2011 06:20 PM, Minchan Kim wrote:
> > > >>>On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> > > >>>>Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> > > >>>>kernel may hang up, because shrink_zones() will do nothing, but
> > > >>>>all_unreclaimable() will say, that zone has reclaimable pages.
> > > >>>>
> > > >>>>do_try_to_free_pages()
> > > >>>> shrink_zones()
> > > >>>> for_each_zone
> > > >>>> if (zone->all_unreclaimable)
> > > >>>> continue
> > > >>>> if !all_unreclaimable(zonelist, sc)
> > > >>>> return 1
> > > >>>>
> > > >>>>__alloc_pages_slowpath()
> > > >>>>retry:
> > > >>>> did_some_progress = do_try_to_free_pages(page)
> > > >>>> ...
> > > >>>> if (!page&& did_some_progress)
> > > >>>> retry;
> > > >>>>
> > > >>>>Signed-off-by: Andrey Vagin<[email protected]>
> > > >>>>---
> > > >>>> mm/vmscan.c | 2 ++
> > > >>>> 1 files changed, 2 insertions(+), 0 deletions(-)
> > > >>>>
> > > >>>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > >>>>index 6771ea7..1c056f7 100644
> > > >>>>--- a/mm/vmscan.c
> > > >>>>+++ b/mm/vmscan.c
> > > >>>>@@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> > > >>>>
> > > >>>> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > >>>> gfp_zone(sc->gfp_mask), sc->nodemask) {
> > > >>>>+ if (zone->all_unreclaimable)
> > > >>>>+ continue;
> > > >>>> if (!populated_zone(zone))
> > > >>>> continue;
> > > >>>> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > > >>>zone_reclaimable checks it. Isn't it enough?
> > > >>I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> > > >>This two patches are enough.
> > > >Sorry if I confused you.
> > > >I mean zone->all_unreclaimable become true if !zone_reclaimable in balance_pgdat.
> > > >zone_reclaimable compares recent pages_scanned with the number of zone lru pages.
> > > >So too many page scanning in small lru pages makes the zone to unreclaimable zone.
> > > >
> > > >In all_unreclaimable, we calls zone_reclaimable to detect it.
> > > >It's the same thing with your patch.
> > > balance_pgdat set zone->all_unreclaimable, but the problem is that
> > > it is cleaned late.
> >
> > Yes. It can be delayed by pcp so (zone->all_unreclaimable = true) is
> > a false alram since zone have a free page and it can be returned
> > to free list by drain_all_pages in next turn.
> >
> > >
> > > The problem is that zone->all_unreclaimable = True, but
> > > zone_reclaimable() returns True too.
> >
> > Why is it a problem?
> > If zone->all_unreclaimable gives a false alram, we does need to check
> > it again by zone_reclaimable call.
> >
> > If we believe a false alarm and give up the reclaim, maybe we have to make
> > unnecessary oom kill.
> >
> > >
> > > zone->all_unreclaimable will be cleaned in free_*_pages, but this
> > > may be late. It is enough allocate one page from page cache, that
> > > zone_reclaimable() returns True and zone->all_unreclaimable becomes
> > > True.
> > > >>>Does the hang up really happen or see it by code review?
> > > >>Yes. You can reproduce it for help the attached python program. It's
> > > >>not very clever:)
> > > >>It make the following actions in loop:
> > > >>1. fork
> > > >>2. mmap
> > > >>3. touch memory
> > > >>4. read memory
> > > >>5. munmmap
> > > >It seems the test program makes fork bombs and memory hogging.
> > > >If you applied this patch, the problem is gone?
> > > Yes.
> >
> > Hmm.. Although it solves the problem, I think it's not a good idea that
> > depends on false alram and give up the retry.
>
> Any alternative proposals? We should get the livelock fixed if possible..

I agree with Minchan and can't think this is a real fix....
Andrey, I'm now trying your fix and it seems your fix for oom-killer,
'skip-zombie-process' works enough good for my environ.

What is your enviroment ? number of cpus ? architecture ? size of memory ?

Thanks,
-Kame

2011-03-08 03:06:45

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

> > > Hmm.. Although it solves the problem, I think it's not a good idea that
> > > depends on false alram and give up the retry.
> >
> > Any alternative proposals? We should get the livelock fixed if possible..
>
> I agree with Minchan and can't think this is a real fix....
> Andrey, I'm now trying your fix and it seems your fix for oom-killer,
> 'skip-zombie-process' works enough good for my environ.
>
> What is your enviroment ? number of cpus ? architecture ? size of memory ?

me too. 'skip-zombie-process V1' work fine. and I didn't seen this patch
improve oom situation.

And, The test program is purely fork bomb. Our oom-killer is not silver
bullet for fork bomb from very long time ago. That said, oom-killer send
SIGKILL and start to kill the victim process. But, it doesn't prevent
to be created new memory hogging tasks. Therefore we have no gurantee
to win process exiting and creating race.

*IF* we really need to care fork bomb issue, we need to write completely
new VM feature.

2011-03-08 08:11:01

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

Hi, All
> I agree with Minchan and can't think this is a real fix....
> Andrey, I'm now trying your fix and it seems your fix for oom-killer,
> 'skip-zombie-process' works enough good for my environ.
>
> What is your enviroment ? number of cpus ? architecture ? size of memory ?
Processort: AMD Phenom(tm) II X6 1055T Processor (six-core)
Ram: 8Gb
RHEL6, x86_64. This host doesn't have swap.

It hangs up fast. Tomorrow I will have to send a processes state, if it
will be interesting for you. With my patch the kernel work fine. I added
debug and found that it hangs up in the described case.
I suppose that my patch may be incorrect, but the problem exists and we
should do something.

2011-03-08 19:02:38

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On 03/08/2011 06:06 AM, KOSAKI Motohiro wrote:
>>>> Hmm.. Although it solves the problem, I think it's not a good idea that
>>>> depends on false alram and give up the retry.
>>>
>>> Any alternative proposals? We should get the livelock fixed if possible..
>>
>> I agree with Minchan and can't think this is a real fix....
>> Andrey, I'm now trying your fix and it seems your fix for oom-killer,
>> 'skip-zombie-process' works enough good for my environ.
>>
>> What is your enviroment ? number of cpus ? architecture ? size of memory ?
>
> me too. 'skip-zombie-process V1' work fine. and I didn't seen this patch
> improve oom situation.
>
> And, The test program is purely fork bomb. Our oom-killer is not silver
> bullet for fork bomb from very long time ago. That said, oom-killer send
> SIGKILL and start to kill the victim process. But, it doesn't prevent
> to be created new memory hogging tasks. Therefore we have no gurantee
> to win process exiting and creating race.

I think a live-lock is a bug, even if it's provoked by fork bomds.

And now I want say some words about zone->all_unreclaimable. I think
this flag is "conservative". It is set when situation is bad and it's
unset when situation get better. If we have a small number of
reclaimable pages, the situation is still bad. What do you mean, when
say that kernel is alive? If we have one reclaimable page, is the kernel
alive? Yes, it can work, it will generate many page faults and do
something, but anyone say that it is more dead than alive.

Try to look at it from my point of view. The patch will be correct and
the kernel will be more alive.

Excuse me, If I'm mistaken...

>
> *IF* we really need to care fork bomb issue, we need to write completely
> new VM feature.
>
>

2011-03-09 05:43:36

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Tue, 8 Mar 2011 08:45:51 +0900
Minchan Kim <[email protected]> wrote:

> On Tue, Mar 8, 2011 at 6:58 AM, Andrew Morton <[email protected]> wrote:
> > On Sun, 6 Mar 2011 02:07:59 +0900
> > Minchan Kim <[email protected]> wrote:
> > Any alternative proposals? We should get the livelock fixed if possible..
> >
>
> And we should avoid unnecessary OOM kill if possible.
>
> I think the problem is caused by (zone->pages_scanned <
> zone_reclaimable_pages(zone) * 6). I am not sure (* 6) is a best. It
> would be rather big on recent big DRAM machines.
>

It means 3 times full-scan from the highest priority to the lowest
and cannot freed any pages. I think big memory machine tend to have
more cpus, so don't think it's big.

> I think it is a trade-off between latency and OOM kill.
> If we decrease the magic value, maybe we should prevent the almost
> livelock but happens unnecessary OOM kill.
>

Hmm, should I support a sacrifice feature 'some signal(SIGINT?) will be sent by
the kernel when it detects system memory is in short' in cgroup ?
(For example, if full LRU scan is done in a zone, notifier
works and SIGINT will be sent.)

> And I think zone_reclaimable not fair.
> For example, too many scanning makes reclaimable state to
> unreclaimable state. Maybe it takes a very long time. But just some
> page free makes unreclaimable state to reclaimabe with very easy. So
> we need much painful reclaiming for changing reclaimable state with
> unreclaimabe state. it would affect latency very much.
>
> Maybe we need more smart zone_reclaimabe which is adaptive with memory pressure.
>
I agree.

Thanks,
-Kame

2011-03-09 05:50:20

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Wed, 9 Mar 2011 14:37:04 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 8 Mar 2011 08:45:51 +0900
> Minchan Kim <[email protected]> wrote:
> Hmm, should I support a sacrifice feature 'some signal(SIGINT?) will be sent by
> the kernel when it detects system memory is in short' in cgroup ?
> (For example, if full LRU scan is done in a zone, notifier
> works and SIGINT will be sent.)
>

Sorry, this sounds like "mem_notify" ;), Kosaki-san's old work.

I think functionality for "mem_notify" will have no obstacle opinion but
implementation detail is a problem....Shouldn't we try it again ?

Thanks,
-Kame

2011-03-09 05:59:05

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Tue, 08 Mar 2011 22:02:27 +0300
"[email protected]" <[email protected]> wrote:

> On 03/08/2011 06:06 AM, KOSAKI Motohiro wrote:
> >>>> Hmm.. Although it solves the problem, I think it's not a good idea that
> >>>> depends on false alram and give up the retry.
> >>>
> >>> Any alternative proposals? We should get the livelock fixed if possible..
> >>
> >> I agree with Minchan and can't think this is a real fix....
> >> Andrey, I'm now trying your fix and it seems your fix for oom-killer,
> >> 'skip-zombie-process' works enough good for my environ.
> >>
> >> What is your enviroment ? number of cpus ? architecture ? size of memory ?
> >
> > me too. 'skip-zombie-process V1' work fine. and I didn't seen this patch
> > improve oom situation.
> >
> > And, The test program is purely fork bomb. Our oom-killer is not silver
> > bullet for fork bomb from very long time ago. That said, oom-killer send
> > SIGKILL and start to kill the victim process. But, it doesn't prevent
> > to be created new memory hogging tasks. Therefore we have no gurantee
> > to win process exiting and creating race.
>
> I think a live-lock is a bug, even if it's provoked by fork bomds.
>

I tried to write fork-bomb-detector in oom-kill layer but I think
it should be co-operative with do_fork(), now.
IOW, some fork() should return -ENOMEM under OOM condition.

I'd like to try some but if you have some idea, please do.

> And now I want say some words about zone->all_unreclaimable. I think
> this flag is "conservative". It is set when situation is bad and it's
> unset when situation get better. If we have a small number of
> reclaimable pages, the situation is still bad. What do you mean, when
> say that kernel is alive? If we have one reclaimable page, is the kernel
> alive? Yes, it can work, it will generate many page faults and do
> something, but anyone say that it is more dead than alive.
>
> Try to look at it from my point of view. The patch will be correct and
> the kernel will be more alive.
>
> Excuse me, If I'm mistaken...
>

Mayne something more casual interface than oom-kill should be provided.
I wonder I can add memory-reclaim-priority to memory cgroup and
allow control of page fault latency for applicaton...
Maybe "soft_limit" for memcg, it's implemented now, works to some extent.

Thanks,
-Kame

2011-03-09 06:13:19

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Tue, 08 Mar 2011 11:12:22 +0300
Andrew Vagin <[email protected]> wrote:

> Hi, All
> > I agree with Minchan and can't think this is a real fix....
> > Andrey, I'm now trying your fix and it seems your fix for oom-killer,
> > 'skip-zombie-process' works enough good for my environ.
> >
> > What is your enviroment ? number of cpus ? architecture ? size of memory ?
> Processort: AMD Phenom(tm) II X6 1055T Processor (six-core)
> Ram: 8Gb
> RHEL6, x86_64. This host doesn't have swap.
>
Ok, thanks. "NO SWAP" is a big information ;)

> It hangs up fast. Tomorrow I will have to send a processes state, if it
> will be interesting for you. With my patch the kernel work fine. I added
> debug and found that it hangs up in the described case.
> I suppose that my patch may be incorrect, but the problem exists and we
> should do something.
>

Thanks,
-Kame

2011-03-09 06:17:07

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

> On 03/08/2011 06:06 AM, KOSAKI Motohiro wrote:
> >>>> Hmm.. Although it solves the problem, I think it's not a good idea that
> >>>> depends on false alram and give up the retry.
> >>>
> >>> Any alternative proposals? We should get the livelock fixed if possible..
> >>
> >> I agree with Minchan and can't think this is a real fix....
> >> Andrey, I'm now trying your fix and it seems your fix for oom-killer,
> >> 'skip-zombie-process' works enough good for my environ.
> >>
> >> What is your enviroment ? number of cpus ? architecture ? size of memory ?
> >
> > me too. 'skip-zombie-process V1' work fine. and I didn't seen this patch
> > improve oom situation.
> >
> > And, The test program is purely fork bomb. Our oom-killer is not silver
> > bullet for fork bomb from very long time ago. That said, oom-killer send
> > SIGKILL and start to kill the victim process. But, it doesn't prevent
> > to be created new memory hogging tasks. Therefore we have no gurantee
> > to win process exiting and creating race.
>
> I think a live-lock is a bug, even if it's provoked by fork bomds.
>
> And now I want say some words about zone->all_unreclaimable. I think
> this flag is "conservative". It is set when situation is bad and it's
> unset when situation get better. If we have a small number of
> reclaimable pages, the situation is still bad. What do you mean, when
> say that kernel is alive? If we have one reclaimable page, is the kernel
> alive? Yes, it can work, it will generate many page faults and do
> something, but anyone say that it is more dead than alive.
>
> Try to look at it from my point of view. The patch will be correct and
> the kernel will be more alive.
>
> Excuse me, If I'm mistaken...

Hi,

Hmmm...
If I could observed your patch, I did support your opinion. but I didn't. so, now I'm
curious why we got the different conclusion. tommorow, I'll try to construct a test
environment to reproduce your system.

Unfortunatelly, zone->all_unreclamable is unreliable value while hibernation processing.
Then I doubt current your patch is enough acceptable. but I'm not against to make alternative
if we can observe the same phenomenon.

At minimum, I also dislike kernel hang up issue.

Thanks.

2011-03-10 06:58:31

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

Hi Kame,

Sorry for late response.
I had a time to test this issue shortly because these day I am very busy.
This issue was interesting to me.
So I hope taking a time for enough testing when I have a time.
I should find out root cause of livelock.

I will answer your comment after it. :)
Thanks!

On Wed, Mar 9, 2011 at 2:37 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 8 Mar 2011 08:45:51 +0900
> Minchan Kim <[email protected]> wrote:
>
>> On Tue, Mar 8, 2011 at 6:58 AM, Andrew Morton <[email protected]> wrote:
>> > On Sun, 6 Mar 2011 02:07:59 +0900
>> > Minchan Kim <[email protected]> wrote:
>> > Any alternative proposals? We should get the livelock fixed if possible..
>> >
>>
>> And we should avoid unnecessary OOM kill if possible.
>>
>> I think the problem is caused by (zone->pages_scanned <
>> zone_reclaimable_pages(zone) * 6). I am not sure (* 6) is a best. It
>> would be rather big on recent big DRAM machines.
>>
>
> It means 3 times full-scan from the highest priority to the lowest
> and cannot freed any pages. I think big memory machine tend to have
> more cpus, so don't think it's big.
>
>> I think it is a trade-off between latency and OOM kill.
>> If we decrease the magic value, maybe we should prevent the almost
>> livelock but happens unnecessary OOM kill.
>>
>
> Hmm, should I support a sacrifice feature 'some signal(SIGINT?) will be sent by
> the kernel when it detects system memory is in short' in cgroup ?
> (For example, if full LRU scan is done in a zone, notifier
> works and SIGINT will be sent.)
>
>> And I think zone_reclaimable not fair.
>> For example, too many scanning makes reclaimable state to
>> unreclaimable state. Maybe it takes a very long time. But just some
>> page free makes unreclaimable state to reclaimabe with very easy. So
>> we need much painful reclaiming for changing reclaimable state with
>> unreclaimabe state. it would affect latency very much.
>>
>> Maybe we need more smart zone_reclaimabe which is adaptive with memory pressure.
>>
> I agree.
>
> Thanks,
> -Kame
>
>

--
Kind regards,
Minchan Kim

2011-03-10 14:08:46

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

> Hi,
>
> Hmmm...
> If I could observed your patch, I did support your opinion. but I didn't. so, now I'm
> curious why we got the different conclusion. tommorow, I'll try to construct a test
> environment to reproduce your system.

Hm,

following two patches seems to have bad interaction. former makes
SCHED_FIFO when OOM, latter makes CPU 100% occupied busy loop if
LRU is really tight.
Of cource, I need to run more much test. I'll digg it more at this
weekend (maybe).

commit 93b43fa55088fe977503a156d1097cc2055449a2
Author: Luis Claudio R. Goncalves <[email protected]>
Date: Mon Aug 9 17:19:41 2010 -0700

oom: give the dying task a higher priority

commit 0e093d99763eb4cea09f8ca4f1d01f34e121d10b
Author: Mel Gorman <[email protected]>
Date: Tue Oct 26 14:21:45 2010 -0700

writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant conge

2011-03-11 00:05:00

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Thu, 10 Mar 2011 15:58:29 +0900
Minchan Kim <[email protected]> wrote:

> Hi Kame,
>
> Sorry for late response.
> I had a time to test this issue shortly because these day I am very busy.
> This issue was interesting to me.
> So I hope taking a time for enough testing when I have a time.
> I should find out root cause of livelock.
>

Thanks. I and Kosaki-san reproduced the bug with swapless system.
Now, Kosaki-san is digging and found some issue with scheduler boost at OOM
and lack of enough "wait" in vmscan.c.

I myself made patch like attached one. This works well for returning TRUE at
all_unreclaimable() but livelock(deadlock?) still happens.
I wonder vmscan itself isn't a key for fixing issue.
Then, I'd like to wait for Kosaki-san's answer ;)

I'm now wondering how to catch fork-bomb and stop it (without using cgroup).
I think the problem is that fork-bomb is faster than killall...

Thanks,
-Kame
==

This is just a debug patch.

---
mm/vmscan.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 54 insertions(+), 4 deletions(-)

Index: mmotm-0303/mm/vmscan.c
===================================================================
--- mmotm-0303.orig/mm/vmscan.c
+++ mmotm-0303/mm/vmscan.c
@@ -1983,9 +1983,55 @@ static void shrink_zones(int priority, s
}
}

-static bool zone_reclaimable(struct zone *zone)
+static bool zone_seems_empty(struct zone *zone, struct scan_control *sc)
{
- return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+ unsigned long nr, wmark, free, isolated, lru;
+
+ /*
+ * If scanned, zone->pages_scanned is incremented and this can
+ * trigger OOM.
+ */
+ if (sc->nr_scanned)
+ return false;
+
+ free = zone_page_state(zone, NR_FREE_PAGES);
+ isolated = zone_page_state(zone, NR_ISOLATED_FILE);
+ if (nr_swap_pages)
+ isolated += zone_page_state(zone, NR_ISOLATED_ANON);
+
+ /* In we cannot do scan, don't count LRU pages. */
+ if (!zone->all_unreclaimable) {
+ lru = zone_page_state(zone, NR_ACTIVE_FILE);
+ lru += zone_page_state(zone, NR_INACTIVE_FILE);
+ if (nr_swap_pages) {
+ lru += zone_page_state(zone, NR_ACTIVE_ANON);
+ lru += zone_page_state(zone, NR_INACTIVE_ANON);
+ }
+ } else
+ lru = 0;
+ nr = free + isolated + lru;
+ wmark = min_wmark_pages(zone);
+ wmark += zone->lowmem_reserve[gfp_zone(sc->gfp_mask)];
+ wmark += 1 << sc->order;
+ printk("thread %d/%ld all %d scanned %ld pages %ld/%ld/%ld/%ld/%ld/%ld\n",
+ current->pid, sc->nr_scanned, zone->all_unreclaimable,
+ zone->pages_scanned,
+ nr,free,isolated,lru,
+ zone_reclaimable_pages(zone), wmark);
+ /*
+ * In some case (especially noswap), almost all page cache are paged out
+ * and we'll see the amount of reclaimable+free pages is smaller than
+ * zone->min. In this case, we canoot expect any recovery other
+ * than OOM-KILL. We can't reclaim memory enough for usual tasks.
+ */
+
+ return nr <= wmark;
+}
+
+static bool zone_reclaimable(struct zone *zone, struct scan_control *sc)
+{
+ /* zone_reclaimable_pages() can return 0, we need <= */
+ return zone->pages_scanned <= zone_reclaimable_pages(zone) * 6;
}

/*
@@ -2006,11 +2052,15 @@ static bool all_unreclaimable(struct zon
continue;
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- if (zone_reclaimable(zone)) {
+ if (zone_seems_empty(zone, sc))
+ continue;
+ if (zone_reclaimable(zone, sc)) {
all_unreclaimable = false;
break;
}
}
+ if (all_unreclaimable)
+ printk("all_unreclaimable() returns TRUE\n");

return all_unreclaimable;
}
@@ -2456,7 +2506,7 @@ loop_again:
if (zone->all_unreclaimable)
continue;
if (!compaction && nr_slab == 0 &&
- !zone_reclaimable(zone))
+ !zone_reclaimable(zone, &sc))
zone->all_unreclaimable = 1;
/*
* If we've done a decent amount of scanning and

2011-03-11 00:18:24

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Fri, Mar 11, 2011 at 8:58 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 10 Mar 2011 15:58:29 +0900
> Minchan Kim <[email protected]> wrote:
>
>> Hi Kame,
>>
>> Sorry for late response.
>> I had a time to test this issue shortly because these day I am very busy.
>> This issue was interesting to me.
>> So I hope taking a time for enough testing when I have a time.
>> I should find out root cause of livelock.
>>
>
> Thanks. I and Kosaki-san reproduced the bug with swapless system.
> Now, Kosaki-san is digging and found some issue with scheduler boost at OOM
> and lack of enough "wait" in vmscan.c.
>
> I myself made patch like attached one. This works well for returning TRUE at
> all_unreclaimable() but livelock(deadlock?) still happens.

I saw the deadlock.
It seems to happen by following code by my quick debug but not sure. I
need to investigate further but don't have a time now. :(

* Note: this may have a chance of deadlock if it gets
* blocked waiting for another task which itself is waiting
* for memory. Is there a better alternative?
*/
if (test_tsk_thread_flag(p, TIF_MEMDIE))
return ERR_PTR(-1UL);
It would be wait to die the task forever without another victim selection.
If it's right, It's a known BUG and we have no choice until now. Hmm.

> I wonder vmscan itself isn't a key for fixing issue.

I agree.

> Then, I'd like to wait for Kosaki-san's answer ;)

Me, too. :)

>
> I'm now wondering how to catch fork-bomb and stop it (without using cgroup).

Yes. Fork throttling without cgroup is very important.
And as off-topic, mem_notify without memcontrol you mentioned is
important to embedded people, I gues.

> I think the problem is that fork-bomb is faster than killall...

And deadlock problem I mentioned.

>
> Thanks,
> -Kame

Thanks for the investigation, Kame.

> ==
>
> This is just a debug patch.
>
> ---
> mm/vmscan.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 54 insertions(+), 4 deletions(-)
>
> Index: mmotm-0303/mm/vmscan.c
> ===================================================================
> --- mmotm-0303.orig/mm/vmscan.c
> +++ mmotm-0303/mm/vmscan.c
> @@ -1983,9 +1983,55 @@ static void shrink_zones(int priority, s
> }
> }
>
> -static bool zone_reclaimable(struct zone *zone)
> +static bool zone_seems_empty(struct zone *zone, struct scan_control *sc)
> {
> - return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> + unsigned long nr, wmark, free, isolated, lru;
> +
> + /*
> + * If scanned, zone->pages_scanned is incremented and this can
> + * trigger OOM.
> + */
> + if (sc->nr_scanned)
> + return false;
> +
> + free = zone_page_state(zone, NR_FREE_PAGES);
> + isolated = zone_page_state(zone, NR_ISOLATED_FILE);
> + if (nr_swap_pages)
> + isolated += zone_page_state(zone, NR_ISOLATED_ANON);
> +
> + /* In we cannot do scan, don't count LRU pages. */
> + if (!zone->all_unreclaimable) {
> + lru = zone_page_state(zone, NR_ACTIVE_FILE);
> + lru += zone_page_state(zone, NR_INACTIVE_FILE);
> + if (nr_swap_pages) {
> + lru += zone_page_state(zone, NR_ACTIVE_ANON);
> + lru += zone_page_state(zone, NR_INACTIVE_ANON);
> + }
> + } else
> + lru = 0;
> + nr = free + isolated + lru;
> + wmark = min_wmark_pages(zone);
> + wmark += zone->lowmem_reserve[gfp_zone(sc->gfp_mask)];
> + wmark += 1 << sc->order;
> + printk("thread %d/%ld all %d scanned %ld pages %ld/%ld/%ld/%ld/%ld/%ld\n",
> + current->pid, sc->nr_scanned, zone->all_unreclaimable,
> + zone->pages_scanned,
> + nr,free,isolated,lru,
> + zone_reclaimable_pages(zone), wmark);
> + /*
> + * In some case (especially noswap), almost all page cache are paged out
> + * and we'll see the amount of reclaimable+free pages is smaller than
> + * zone->min. In this case, we canoot expect any recovery other
> + * than OOM-KILL. We can't reclaim memory enough for usual tasks.
> + */
> +
> + return nr <= wmark;
> +}
> +
> +static bool zone_reclaimable(struct zone *zone, struct scan_control *sc)
> +{
> + /* zone_reclaimable_pages() can return 0, we need <= */
> + return zone->pages_scanned <= zone_reclaimable_pages(zone) * 6;
> }
>
> /*
> @@ -2006,11 +2052,15 @@ static bool all_unreclaimable(struct zon
> continue;
> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> continue;
> - if (zone_reclaimable(zone)) {
> + if (zone_seems_empty(zone, sc))
> + continue;
> + if (zone_reclaimable(zone, sc)) {
> all_unreclaimable = false;
> break;
> }
> }
> + if (all_unreclaimable)
> + printk("all_unreclaimable() returns TRUE\n");
>
> return all_unreclaimable;
> }
> @@ -2456,7 +2506,7 @@ loop_again:
> if (zone->all_unreclaimable)
> continue;
> if (!compaction && nr_slab == 0 &&
> - !zone_reclaimable(zone))
> + !zone_reclaimable(zone, &sc))
> zone->all_unreclaimable = 1;
> /*
> * If we've done a decent amount of scanning and
>
>

--
Kind regards,
Minchan Kim

2011-03-11 06:08:37

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On 03/11/2011 03:18 AM, Minchan Kim wrote:
> On Fri, Mar 11, 2011 at 8:58 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>> On Thu, 10 Mar 2011 15:58:29 +0900
>> Minchan Kim<[email protected]> wrote:
>>
>>> Hi Kame,
>>>
>>> Sorry for late response.
>>> I had a time to test this issue shortly because these day I am very busy.
>>> This issue was interesting to me.
>>> So I hope taking a time for enough testing when I have a time.
>>> I should find out root cause of livelock.
>>>
>>
>> Thanks. I and Kosaki-san reproduced the bug with swapless system.
>> Now, Kosaki-san is digging and found some issue with scheduler boost at OOM
>> and lack of enough "wait" in vmscan.c.
>>
>> I myself made patch like attached one. This works well for returning TRUE at
>> all_unreclaimable() but livelock(deadlock?) still happens.
>
> I saw the deadlock.
> It seems to happen by following code by my quick debug but not sure. I
> need to investigate further but don't have a time now. :(
>
>
> * Note: this may have a chance of deadlock if it gets
> * blocked waiting for another task which itself is waiting
> * for memory. Is there a better alternative?
> */
> if (test_tsk_thread_flag(p, TIF_MEMDIE))
> return ERR_PTR(-1UL);
> It would be wait to die the task forever without another victim selection.
> If it's right, It's a known BUG and we have no choice until now. Hmm.

I fixed this bug too and sent patch "mm: skip zombie in OOM-killer".

http://groups.google.com/group/linux.kernel/browse_thread/thread/b9c6ddf34d1671ab/2941e1877ca4f626?lnk=raot&pli=1

- if (test_tsk_thread_flag(p, TIF_MEMDIE))
+ if (test_tsk_thread_flag(p, TIF_MEMDIE) && p->mm)
return ERR_PTR(-1UL);

It is not committed yet, because Devid Rientjes and company think what
to do with "[patch] oom: prevent unnecessary oom kills or kernel panics.".
>
>> I wonder vmscan itself isn't a key for fixing issue.
>
> I agree.
>
>> Then, I'd like to wait for Kosaki-san's answer ;)
>
> Me, too. :)
>
>>
>> I'm now wondering how to catch fork-bomb and stop it (without using cgroup).
>
> Yes. Fork throttling without cgroup is very important.
> And as off-topic, mem_notify without memcontrol you mentioned is
> important to embedded people, I gues.
>
>> I think the problem is that fork-bomb is faster than killall...
>
> And deadlock problem I mentioned.
>
>>
>> Thanks,
>> -Kame
>
> Thanks for the investigation, Kame.
>
>> ==
>>
>> This is just a debug patch.
>>
>> ---
>> mm/vmscan.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++----
>> 1 file changed, 54 insertions(+), 4 deletions(-)
>>
>> Index: mmotm-0303/mm/vmscan.c
>> ===================================================================
>> --- mmotm-0303.orig/mm/vmscan.c
>> +++ mmotm-0303/mm/vmscan.c
>> @@ -1983,9 +1983,55 @@ static void shrink_zones(int priority, s
>> }
>> }
>>
>> -static bool zone_reclaimable(struct zone *zone)
>> +static bool zone_seems_empty(struct zone *zone, struct scan_control *sc)
>> {
>> - return zone->pages_scanned< zone_reclaimable_pages(zone) * 6;
>> + unsigned long nr, wmark, free, isolated, lru;
>> +
>> + /*
>> + * If scanned, zone->pages_scanned is incremented and this can
>> + * trigger OOM.
>> + */
>> + if (sc->nr_scanned)
>> + return false;
>> +
>> + free = zone_page_state(zone, NR_FREE_PAGES);
>> + isolated = zone_page_state(zone, NR_ISOLATED_FILE);
>> + if (nr_swap_pages)
>> + isolated += zone_page_state(zone, NR_ISOLATED_ANON);
>> +
>> + /* In we cannot do scan, don't count LRU pages. */
>> + if (!zone->all_unreclaimable) {
>> + lru = zone_page_state(zone, NR_ACTIVE_FILE);
>> + lru += zone_page_state(zone, NR_INACTIVE_FILE);
>> + if (nr_swap_pages) {
>> + lru += zone_page_state(zone, NR_ACTIVE_ANON);
>> + lru += zone_page_state(zone, NR_INACTIVE_ANON);
>> + }
>> + } else
>> + lru = 0;
>> + nr = free + isolated + lru;
>> + wmark = min_wmark_pages(zone);
>> + wmark += zone->lowmem_reserve[gfp_zone(sc->gfp_mask)];
>> + wmark += 1<< sc->order;
>> + printk("thread %d/%ld all %d scanned %ld pages %ld/%ld/%ld/%ld/%ld/%ld\n",
>> + current->pid, sc->nr_scanned, zone->all_unreclaimable,
>> + zone->pages_scanned,
>> + nr,free,isolated,lru,
>> + zone_reclaimable_pages(zone), wmark);
>> + /*
>> + * In some case (especially noswap), almost all page cache are paged out
>> + * and we'll see the amount of reclaimable+free pages is smaller than
>> + * zone->min. In this case, we canoot expect any recovery other
>> + * than OOM-KILL. We can't reclaim memory enough for usual tasks.
>> + */
>> +
>> + return nr<= wmark;
>> +}
>> +
>> +static bool zone_reclaimable(struct zone *zone, struct scan_control *sc)
>> +{
>> + /* zone_reclaimable_pages() can return 0, we need<= */
>> + return zone->pages_scanned<= zone_reclaimable_pages(zone) * 6;
>> }
>>
>> /*
>> @@ -2006,11 +2052,15 @@ static bool all_unreclaimable(struct zon
>> continue;
>> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>> continue;
>> - if (zone_reclaimable(zone)) {
>> + if (zone_seems_empty(zone, sc))
>> + continue;
>> + if (zone_reclaimable(zone, sc)) {
>> all_unreclaimable = false;
>> break;
>> }
>> }
>> + if (all_unreclaimable)
>> + printk("all_unreclaimable() returns TRUE\n");
>>
>> return all_unreclaimable;
>> }
>> @@ -2456,7 +2506,7 @@ loop_again:
>> if (zone->all_unreclaimable)
>> continue;
>> if (!compaction&& nr_slab == 0&&
>> - !zone_reclaimable(zone))
>> + !zone_reclaimable(zone,&sc))
>> zone->all_unreclaimable = 1;
>> /*
>> * If we've done a decent amount of scanning and
>>
>>
>
>
>

2011-03-14 01:03:49

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

On Fri, Mar 11, 2011 at 3:08 PM, [email protected] <[email protected]> wrote:
> On 03/11/2011 03:18 AM, Minchan Kim wrote:
>>
>> On Fri, Mar 11, 2011 at 8:58 AM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>>>
>>> On Thu, 10 Mar 2011 15:58:29 +0900
>>> Minchan Kim<[email protected]> wrote:
>>>
>>>> Hi Kame,
>>>>
>>>> Sorry for late response.
>>>> I had a time to test this issue shortly because these day I am very
>>>> busy.
>>>> This issue was interesting to me.
>>>> So I hope taking a time for enough testing when I have a time.
>>>> I should find out root cause of livelock.
>>>>
>>>
>>> Thanks. I and Kosaki-san reproduced the bug with swapless system.
>>> Now, Kosaki-san is digging and found some issue with scheduler boost at
>>> OOM
>>> and lack of enough "wait" in vmscan.c.
>>>
>>> I myself made patch like attached one. This works well for returning TRUE
>>> at
>>> all_unreclaimable() but livelock(deadlock?) still happens.
>>
>> I saw the deadlock.
>> It seems to happen by following code by my quick debug but not sure. I
>> need to investigate further but don't have a time now. :(
>>
>>
>> * Note: this may have a chance of deadlock if it gets
>> * blocked waiting for another task which itself is
>> waiting
>> * for memory. Is there a better alternative?
>> */
>> if (test_tsk_thread_flag(p, TIF_MEMDIE))
>> return ERR_PTR(-1UL);
>> It would be wait to die the task forever without another victim selection.
>> If it's right, It's a known BUG and we have no choice until now. Hmm.
>
>
> I fixed this bug too and sent patch "mm: skip zombie in OOM-killer".
>
> http://groups.google.com/group/linux.kernel/browse_thread/thread/b9c6ddf34d1671ab/2941e1877ca4f626?lnk=raot&pli=1
>
> - if (test_tsk_thread_flag(p, TIF_MEMDIE))
> + if (test_tsk_thread_flag(p, TIF_MEMDIE) && p->mm)
> return ERR_PTR(-1UL);
>
> It is not committed yet, because Devid Rientjes and company think what to do
> with "[patch] oom: prevent unnecessary oom kills or kernel panics.".

Thanks, Andrey.
The patch "mm: skip zombie in OOM-killer" solves my livelock issue
but I didn't look effectiveness of "mm: check zone->all_unreclaimable
in all_unreclaimable". I have to look further.

But your patch "mm: skip zombie in OOM-killer" is very controversial
because It breaks multi-thread case.
Since find_lock_task_mm is introduced, we have considered mt cases but
I think it doesn't cover completely all cases like discussing
TIF_MEMDIE now.

I will watch the discussion.
--
Kind regards,
Minchan Kim

2011-05-04 01:39:16

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

----- Original Message -----
> On 03/05/2011 06:20 PM, Minchan Kim wrote:
> > On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> >> Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> >> kernel may hang up, because shrink_zones() will do nothing, but
> >> all_unreclaimable() will say, that zone has reclaimable pages.
> >>
> >> do_try_to_free_pages()
> >> shrink_zones()
> >> for_each_zone
> >> if (zone->all_unreclaimable)
> >> continue
> >> if !all_unreclaimable(zonelist, sc)
> >> return 1
> >>
> >> __alloc_pages_slowpath()
> >> retry:
> >> did_some_progress = do_try_to_free_pages(page)
> >> ...
> >> if (!page&& did_some_progress)
> >> retry;
> >>
> >> Signed-off-by: Andrey Vagin<[email protected]>
> >> ---
> >> mm/vmscan.c | 2 ++
> >> 1 files changed, 2 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 6771ea7..1c056f7 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist
> >> *zonelist,
> >>
> >> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> >> gfp_zone(sc->gfp_mask), sc->nodemask) {
> >> + if (zone->all_unreclaimable)
> >> + continue;
> >> if (!populated_zone(zone))
> >> continue;
> >> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> >
> > zone_reclaimable checks it. Isn't it enough?
> I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> This two patches are enough.
> > Does the hang up really happen or see it by code review?
> Yes. You can reproduce it for help the attached python program. It's
> not
> very clever:)
> It make the following actions in loop:
> 1. fork
> 2. mmap
> 3. touch memory
> 4. read memory
> 5. munmmap
>
> >> --
> >> 1.7.1
I have tested this for the latest mainline kernel using the reproducer
attached, the system just hung or deadlock after oom. The whole oom
trace is here.
http://people.redhat.com/qcai/oom.log

Did I miss anything?

Attachments:

memeater.py (695.00 B)

2011-05-09 06:54:39

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

>
>
> ----- Original Message -----
> > On 03/05/2011 06:20 PM, Minchan Kim wrote:
> > > On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> > >> Check zone->all_unreclaimable in all_unreclaimable(), otherwise the
> > >> kernel may hang up, because shrink_zones() will do nothing, but
> > >> all_unreclaimable() will say, that zone has reclaimable pages.
> > >>
> > >> do_try_to_free_pages()
> > >> shrink_zones()
> > >> for_each_zone
> > >> if (zone->all_unreclaimable)
> > >> continue
> > >> if !all_unreclaimable(zonelist, sc)
> > >> return 1
> > >>
> > >> __alloc_pages_slowpath()
> > >> retry:
> > >> did_some_progress = do_try_to_free_pages(page)
> > >> ...
> > >> if (!page&& did_some_progress)
> > >> retry;
> > >>
> > >> Signed-off-by: Andrey Vagin<[email protected]>
> > >> ---
> > >> mm/vmscan.c | 2 ++
> > >> 1 files changed, 2 insertions(+), 0 deletions(-)
> > >>
> > >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >> index 6771ea7..1c056f7 100644
> > >> --- a/mm/vmscan.c
> > >> +++ b/mm/vmscan.c
> > >> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct zonelist
> > >> *zonelist,
> > >>
> > >> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > >> gfp_zone(sc->gfp_mask), sc->nodemask) {
> > >> + if (zone->all_unreclaimable)
> > >> + continue;
> > >> if (!populated_zone(zone))
> > >> continue;
> > >> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > >
> > > zone_reclaimable checks it. Isn't it enough?
> > I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> > This two patches are enough.
> > > Does the hang up really happen or see it by code review?
> > Yes. You can reproduce it for help the attached python program. It's
> > not
> > very clever:)
> > It make the following actions in loop:
> > 1. fork
> > 2. mmap
> > 3. touch memory
> > 4. read memory
> > 5. munmmap
> >
> > >> --
> > >> 1.7.1
> I have tested this for the latest mainline kernel using the reproducer
> attached, the system just hung or deadlock after oom. The whole oom
> trace is here.
> http://people.redhat.com/qcai/oom.log
>
> Did I miss anything?

Can you please try commit 929bea7c714220fc76ce3f75bef9056477c28e74?

2011-05-09 08:48:17

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

----- Original Message -----
> >
> >
> > ----- Original Message -----
> > > On 03/05/2011 06:20 PM, Minchan Kim wrote:
> > > > On Sat, Mar 05, 2011 at 02:44:16PM +0300, Andrey Vagin wrote:
> > > >> Check zone->all_unreclaimable in all_unreclaimable(), otherwise
> > > >> the
> > > >> kernel may hang up, because shrink_zones() will do nothing, but
> > > >> all_unreclaimable() will say, that zone has reclaimable pages.
> > > >>
> > > >> do_try_to_free_pages()
> > > >> shrink_zones()
> > > >> for_each_zone
> > > >> if (zone->all_unreclaimable)
> > > >> continue
> > > >> if !all_unreclaimable(zonelist, sc)
> > > >> return 1
> > > >>
> > > >> __alloc_pages_slowpath()
> > > >> retry:
> > > >> did_some_progress = do_try_to_free_pages(page)
> > > >> ...
> > > >> if (!page&& did_some_progress)
> > > >> retry;
> > > >>
> > > >> Signed-off-by: Andrey Vagin<[email protected]>
> > > >> ---
> > > >> mm/vmscan.c | 2 ++
> > > >> 1 files changed, 2 insertions(+), 0 deletions(-)
> > > >>
> > > >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > >> index 6771ea7..1c056f7 100644
> > > >> --- a/mm/vmscan.c
> > > >> +++ b/mm/vmscan.c
> > > >> @@ -2002,6 +2002,8 @@ static bool all_unreclaimable(struct
> > > >> zonelist
> > > >> *zonelist,
> > > >>
> > > >> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > >> gfp_zone(sc->gfp_mask), sc->nodemask) {
> > > >> + if (zone->all_unreclaimable)
> > > >> + continue;
> > > >> if (!populated_zone(zone))
> > > >> continue;
> > > >> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> > > >
> > > > zone_reclaimable checks it. Isn't it enough?
> > > I sent one more patch [PATCH] mm: skip zombie in OOM-killer.
> > > This two patches are enough.
> > > > Does the hang up really happen or see it by code review?
> > > Yes. You can reproduce it for help the attached python program.
> > > It's
> > > not
> > > very clever:)
> > > It make the following actions in loop:
> > > 1. fork
> > > 2. mmap
> > > 3. touch memory
> > > 4. read memory
> > > 5. munmmap
> > >
> > > >> --
> > > >> 1.7.1
> > I have tested this for the latest mainline kernel using the
> > reproducer
> > attached, the system just hung or deadlock after oom. The whole oom
> > trace is here.
> > http://people.redhat.com/qcai/oom.log
> >
> > Did I miss anything?
>
> Can you please try commit 929bea7c714220fc76ce3f75bef9056477c28e74?
As I have mentioned that I have tested the latest mainline which have
already included that fix. Also, does this problem only for x86? The
testing was done using x86_64. Not sure if that would be a problem.

2011-05-09 09:19:35

[permalink] [raw]

Subject: Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable()

> > > I have tested this for the latest mainline kernel using the
> > > reproducer
> > > attached, the system just hung or deadlock after oom. The whole oom
> > > trace is here.
> > > http://people.redhat.com/qcai/oom.log
> > >
> > > Did I miss anything?
> >
> > Can you please try commit 929bea7c714220fc76ce3f75bef9056477c28e74?
> As I have mentioned that I have tested the latest mainline which have
> already included that fix. Also, does this problem only for x86? The
> testing was done using x86_64. Not sure if that would be a problem.

No. I'm also using x86_64 and my machine completely works on current
latest linus tree. I confirmed it today.

2011-05-10 08:12:05

[permalink] [raw]

Subject: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())

(cc to oom interested people)

> > > > I have tested this for the latest mainline kernel using the
> > > > reproducer
> > > > attached, the system just hung or deadlock after oom. The whole oom
> > > > trace is here.
> > > > http://people.redhat.com/qcai/oom.log
> > > >
> > > > Did I miss anything?
> > >
> > > Can you please try commit 929bea7c714220fc76ce3f75bef9056477c28e74?
> > As I have mentioned that I have tested the latest mainline which have
> > already included that fix. Also, does this problem only for x86? The
> > testing was done using x86_64. Not sure if that would be a problem.
>
> No. I'm also using x86_64 and my machine completely works on current
> latest linus tree. I confirmed it today.

> 4194288 pages RAM

You have 16GB RAM.

> Out of memory: Kill process 1175 (dhclient) score 1 or sacrifice child
> Out of memory: Kill process 1247 (rsyslogd) score 1 or sacrifice child
> Out of memory: Kill process 1284 (irqbalance) score 1 or sacrifice child
> Out of memory: Kill process 1303 (rpcbind) score 1 or sacrifice child
> Out of memory: Kill process 1321 (rpc.statd) score 1 or sacrifice child
> Out of memory: Kill process 1333 (mdadm) score 1 or sacrifice child
> Out of memory: Kill process 1365 (rpc.idmapd) score 1 or sacrifice child
> Out of memory: Kill process 1403 (dbus-daemon) score 1 or sacrifice child
> Out of memory: Kill process 1438 (acpid) score 1 or sacrifice child
> Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> Out of memory: Kill process 1487 (hald-addon-inpu) score 1 or sacrifice child
> Out of memory: Kill process 1488 (hald-addon-acpi) score 1 or sacrifice child
> Out of memory: Kill process 1507 (automount) score 1 or sacrifice child

Oops.

OK. That's known issue. Current OOM logic doesn't works if you have
gigabytes RAM. because _all_ process have the exactly same score (=1).
then oom killer just fallback to random process killer. It was made
commit a63d83f427 (oom: badness heuristic rewrite). I pointed out
it at least three times. You have to blame Google folks. :-/

The problems are three.

1) if two processes have the same oom score, we should kill younger process.
but current logic kill older. Oldest processes are typicall system daemons.
2) Current logic use 'unsigned int' for internal score calculation. (exactly says,
it only use 0-1000 value). its very low precision calculation makes a lot of
same oom score and kill an ineligible process.
3) Current logic give 3% of SystemRAM to root processes. It obviously too big
if you have plenty memory. Now, your fork-bomb processes have 500MB OOM immune
bonus. then your fork-bomb never ever be killed.

CAI-san: I've made fixing patches. Can you please try them?

2011-05-10 08:14:28

[permalink] [raw]

Subject: [PATCH 1/4] oom: improve dump_tasks() show items

Recently, oom internal logic was dramatically changed. Thus
dump_tasks() is no longer useful. it has some meaningless
items and don't have some oom socre related items.

This patch adapt displaying fields to new oom logic.

details
==========
removed: pid (we always kill process. don't need thread id),
mm->total_vm (we no longer uses virtual memory size)
signal->oom_adj (we no longer uses it internally)
added: ppid (we often kill sacrifice child process)
modify: RSS (account mm->nr_ptes too)

Signed-off-by: KOSAKI Motohiro <[email protected]>
---

Strictly speaking. this is NOT a part of oom fixing patches. but it's
necessary when I parse QAI's test result.

mm/oom_kill.c | 14 ++++++++------
1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f52e85c..118d958 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -355,7 +355,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
struct task_struct *p;
struct task_struct *task;

- pr_info("[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name\n");
+ pr_info("[ pid] ppid uid rss cpu score_adj name\n");
for_each_process(p) {
if (oom_unkillable_task(p, mem, nodemask))
continue;
@@ -370,11 +370,13 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
continue;
}

- pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %5d %s\n",
- task->pid, task_uid(task), task->tgid,
- task->mm->total_vm, get_mm_rss(task->mm),
- task_cpu(task), task->signal->oom_adj,
- task->signal->oom_score_adj, task->comm);
+ pr_info("[%6d] %6d %5d %8lu %4u %9d %s\n",
+ task_tgid_nr(task), task_tgid_nr(task->real_parent),
+ task_uid(task),
+ get_mm_rss(task->mm) + p->mm->nr_ptes,
+ task_cpu(task),
+ task->signal->oom_score_adj,
+ task->comm);
task_unlock(task);
}
}
--
1.7.3.1

2011-05-10 08:15:10

[permalink] [raw]

Subject: [PATCH 2/4] oom: kill younger process first

This patch introduces do_each_thread_reverse() and
select_bad_process() uses it. The benefits are two,
1) oom-killer can kill younger process than older if
they have a same oom score. Usually younger process
is less important. 2) younger task often have PF_EXITING
because shell script makes a lot of short lived processes.
Reverse order search can detect it faster.

Reported-by: CAI Qian <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/sched.h | 6 ++++++
mm/oom_kill.c | 2 +-
2 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 013314a..a0a8339 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2194,6 +2194,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
#define next_task(p) \
list_entry_rcu((p)->tasks.next, struct task_struct, tasks)

+#define prev_task(p) \
+ list_entry_rcu((p)->tasks.prev, struct task_struct, tasks)
+
#define for_each_process(p) \
for (p = &init_task ; (p = next_task(p)) != &init_task ; )

@@ -2206,6 +2209,9 @@ extern bool current_is_single_threaded(void);
#define do_each_thread(g, t) \
for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do

+#define do_each_thread_reverse(g, t) \
+ for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
+
#define while_each_thread(g, t) \
while ((t = next_thread(t)) != g)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 118d958..0cf5091 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -282,7 +282,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
struct task_struct *chosen = NULL;
*ppoints = 0;

- do_each_thread(g, p) {
+ do_each_thread_reverse(g, p) {
unsigned int points;

if (!p->mm)
--
1.7.3.1

2011-05-10 08:15:49

[permalink] [raw]

Subject: [PATCH 3/4] oom: oom-killer don't use permillage of system-ram internally

CAI Qian reported his kernel did hang-up if he ran fork intensive
workload and then invoke oom-killer.

The problem is, Current oom calculation uses 0-1000 normalized value
(The unit is a permillage of system-ram). Its low precision make
a lot of same oom score. IOW, in his case, all processes have <1
oom score and internal integral calculation round it to 1. Thus
oom-killer kill ineligible process. This regression is caused by
commit a63d83f427 (oom: badness heuristic rewrite).

The solution is, the internal calculation just use number of pages
instead of permillage of system-ram. And convert it to permillage
value at displaying time.

This patch doesn't change any ABI (included /proc/<pid>/oom_score_adj)
even though current logic has a lot of my dislike thing.

Reported-by: CAI Qian <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/proc/base.c | 13 ++++++----
include/linux/oom.h | 7 +----
mm/oom_kill.c | 60 +++++++++++++++++++++++++++++++++-----------------
3 files changed, 49 insertions(+), 31 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index dfa5327..d6b0424 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -476,14 +476,17 @@ static const struct file_operations proc_lstats_operations = {

static int proc_oom_score(struct task_struct *task, char *buffer)
{
- unsigned long points = 0;
+ unsigned long points;
+ unsigned long ratio = 0;
+ unsigned long totalpages = totalram_pages + total_swap_pages + 1;

read_lock(&tasklist_lock);
- if (pid_alive(task))
- points = oom_badness(task, NULL, NULL,
- totalram_pages + total_swap_pages);
+ if (pid_alive(task)) {
+ points = oom_badness(task, NULL, NULL, totalpages);
+ ratio = points * 1000 / totalpages;
+ }
read_unlock(&tasklist_lock);
- return sprintf(buffer, "%lu\n", points);
+ return sprintf(buffer, "%lu\n", ratio);
}

struct limit_names {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 5e3aa83..0f5b588 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -40,7 +40,8 @@ enum oom_constraint {
CONSTRAINT_MEMCG,
};

-extern unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+/* The badness from the OOM killer */
+extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask, unsigned long totalpages);
extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
@@ -62,10 +63,6 @@ static inline void oom_killer_enable(void)
oom_killer_disabled = false;
}

-/* The badness from the OOM killer */
-extern unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long uptime);
-
extern struct task_struct *find_lock_task_mm(struct task_struct *p);

/* sysctls */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0cf5091..ba95870 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -132,10 +132,12 @@ static bool oom_unkillable_task(struct task_struct *p,
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
-unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
+unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask, unsigned long totalpages)
{
- int points;
+ unsigned long points;
+ unsigned long score_adj = 0;
+

if (oom_unkillable_task(p, mem, nodemask))
return 0;
@@ -160,7 +162,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*/
if (p->flags & PF_OOM_ORIGIN) {
task_unlock(p);
- return 1000;
+ return ULONG_MAX;
}

/*
@@ -176,33 +178,49 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*/
points = get_mm_rss(p->mm) + p->mm->nr_ptes;
points += get_mm_counter(p->mm, MM_SWAPENTS);
-
- points *= 1000;
- points /= totalpages;
task_unlock(p);

/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
* implementation used by LSMs.
+ *
+ * XXX: Too large bonus. Example,if the system have tera-bytes memory...
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN))
- points -= 30;
+ if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
+ if (points >= totalpages / 32)
+ points -= totalpages / 32;
+ else
+ points = 0;
+ }

/*
* /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
* either completely disable oom killing or always prefer a certain
* task.
*/
- points += p->signal->oom_score_adj;
+ if (p->signal->oom_score_adj >= 0) {
+ score_adj = p->signal->oom_score_adj * (totalpages / 1000);
+ if (ULONG_MAX - points >= score_adj)
+ points += score_adj;
+ else
+ points = ULONG_MAX;
+ } else {
+ score_adj = -p->signal->oom_score_adj * (totalpages / 1000);
+ if (points >= score_adj)
+ points -= score_adj;
+ else
+ points = 0;
+ }

/*
* Never return 0 for an eligible task that may be killed since it's
* possible that no single user task uses more than 0.1% of memory and
* no single admin tasks uses more than 3.0%.
*/
- if (points <= 0)
- return 1;
- return (points < 1000) ? points : 1000;
+ if (!points)
+ points = 1;
+
+ return points;
}

/*
@@ -274,7 +292,7 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
*
* (not docbooked, we don't want this one cluttering up the manual)
*/
-static struct task_struct *select_bad_process(unsigned int *ppoints,
+static struct task_struct *select_bad_process(unsigned long *ppoints,
unsigned long totalpages, struct mem_cgroup *mem,
const nodemask_t *nodemask)
{
@@ -283,7 +301,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*ppoints = 0;

do_each_thread_reverse(g, p) {
- unsigned int points;
+ unsigned long points;

if (!p->mm)
continue;
@@ -314,7 +332,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*/
if (p == current) {
chosen = p;
- *ppoints = 1000;
+ *ppoints = ULONG_MAX;
} else {
/*
* If this task is not being ptraced on exit,
@@ -444,14 +462,14 @@ static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
#undef K

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned int points, unsigned long totalpages,
+ unsigned long points, unsigned long totalpages,
struct mem_cgroup *mem, nodemask_t *nodemask,
const char *message)
{
struct task_struct *victim = p;
struct task_struct *child;
struct task_struct *t = p;
- unsigned int victim_points = 0;
+ unsigned long victim_points = 0;

if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem, nodemask);
@@ -466,7 +484,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
}

task_lock(p);
- pr_err("%s: Kill process %d (%s) score %d or sacrifice child\n",
+ pr_err("%s: Kill process %d (%s) points %d or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);
task_unlock(p);

@@ -478,7 +496,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
*/
do {
list_for_each_entry(child, &t->children, sibling) {
- unsigned int child_points;
+ unsigned long child_points;

if (child->mm == p->mm)
continue;
@@ -525,7 +543,7 @@ static void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
unsigned long limit;
- unsigned int points = 0;
+ unsigned long points = 0;
struct task_struct *p;

/*
@@ -674,7 +692,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
struct task_struct *p;
unsigned long totalpages;
unsigned long freed = 0;
- unsigned int points;
+ unsigned long points;
enum oom_constraint constraint = CONSTRAINT_NONE;
int killed = 0;

--
1.7.3.1

2011-05-10 08:16:28

[permalink] [raw]

Subject: [PATCH 4/4] oom: don't kill random process

CAI Qian reported oom-killer killed all system daemons in his
system at first if he ran fork bomb as root. The problem is,
current logic give them bonus of 3% of system ram. Example,
he has 16GB machine, then root processes have ~500MB oom
immune. It bring us crazy bad result. _all_ processes have
oom-score=1 and then, oom killer ignore process memroy usage
and kill random process. This regression is caused by commit
a63d83f427 (oom: badness heuristic rewrite).

This patch changes select_bad_process() slightly. If oom points == 1,
it's a sign that the system have only root privileged processes or
similar. Thus, select_bad_process() calculate oom badness without
root bonus and select eligible process.

Reported-by: CAI Qian <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/proc/base.c | 2 +-
include/linux/oom.h | 3 ++-
mm/oom_kill.c | 19 +++++++++++++++----
3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index d6b0424..b608b69 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -482,7 +482,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)

read_lock(&tasklist_lock);
if (pid_alive(task)) {
- points = oom_badness(task, NULL, NULL, totalpages);
+ points = oom_badness(task, NULL, NULL, totalpages, 1);
ratio = points * 1000 / totalpages;
}
read_unlock(&tasklist_lock);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 0f5b588..3dd3669 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -42,7 +42,8 @@ enum oom_constraint {

/* The badness from the OOM killer */
extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages);
+ const nodemask_t *nodemask, unsigned long totalpages,
+ int protect_root);
extern int try_set_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ba95870..525e1d2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -133,7 +133,8 @@ static bool oom_unkillable_task(struct task_struct *p,
* task consuming the most memory to avoid subsequent oom failures.
*/
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
- const nodemask_t *nodemask, unsigned long totalpages)
+ const nodemask_t *nodemask, unsigned long totalpages,
+ int protect_root)
{
unsigned long points;
unsigned long score_adj = 0;
@@ -186,7 +187,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*
* XXX: Too large bonus. Example,if the system have tera-bytes memory...
*/
- if (has_capability_noaudit(p, CAP_SYS_ADMIN)) {
+ if (protect_root && has_capability_noaudit(p, CAP_SYS_ADMIN)) {
if (points >= totalpages / 32)
points -= totalpages / 32;
else
@@ -298,8 +299,10 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
{
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
+ int protect_root = 1;
*ppoints = 0;

+ retry:
do_each_thread_reverse(g, p) {
unsigned long points;

@@ -345,13 +348,18 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
}
}

- points = oom_badness(p, mem, nodemask, totalpages);
+ points = oom_badness(p, mem, nodemask, totalpages, protect_root);
if (points > *ppoints) {
chosen = p;
*ppoints = points;
}
} while_each_thread(g, p);

+ if (protect_root && (*ppoints == 1)) {
+ protect_root = 0;
+ goto retry;
+ }
+
return chosen;
}

@@ -470,6 +478,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
struct task_struct *child;
struct task_struct *t = p;
unsigned long victim_points = 0;
+ int admin;

if (printk_ratelimit())
dump_header(p, gfp_mask, order, mem, nodemask);
@@ -494,6 +503,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* parent. This attempts to lose the minimal amount of work done while
* still freeing memory.
*/
+ admin = has_capability_noaudit(victim, CAP_SYS_ADMIN);
+ victim_points = oom_badness(victim, mem, nodemask, totalpages, !admin);
do {
list_for_each_entry(child, &t->children, sibling) {
unsigned long child_points;
@@ -504,7 +515,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
* oom_badness() returns 0 if the thread is unkillable
*/
child_points = oom_badness(child, mem, nodemask,
- totalpages);
+ totalpages, !admin);
if (child_points > victim_points) {
victim = child;
victim_points = child_points;
--
1.7.3.1

2011-05-10 23:22:43

[permalink] [raw]

Subject: Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())

On Tue, 10 May 2011, KOSAKI Motohiro wrote:

> OK. That's known issue. Current OOM logic doesn't works if you have
> gigabytes RAM. because _all_ process have the exactly same score (=1).
> then oom killer just fallback to random process killer. It was made
> commit a63d83f427 (oom: badness heuristic rewrite). I pointed out
> it at least three times. You have to blame Google folks. :-/
>

If all threads have the same badness score, which by definition must be 1
since that is the lowest badness score possible for an eligible thread,
then each thread is using < 0.2% of RAM.

The granularity of the badness score doesn't differentiate between threads
using 0.1% of RAM in terms of priority for kill (in this case, 16MB). The
largest consumers of memory from CAI's log have an rss of 336MB, which is
~2% of system RAM. The problem is that these are forked by root and
therefore get a 3% bonus, making their badness score 1 instead of 2.

[ You also don't have to blame "Google folks," I rewrote the oom
killer. ]

>
> The problems are three.
>
> 1) if two processes have the same oom score, we should kill younger process.
> but current logic kill older. Oldest processes are typicall system daemons.

Agreed, that seems advantageous to prefer killing threads that have done
the least amount of work (defined as those with the least runtime compared
to others in the tasklist order) over others.

> 2) Current logic use 'unsigned int' for internal score calculation. (exactly says,
> it only use 0-1000 value). its very low precision calculation makes a lot of
> same oom score and kill an ineligible process.

The range of 0-1000 allows us to differentiate tasks up to 0.1% of system
RAM from each other when making oom kill decisions. If we really want to
increase this granularity, we could increase the value to 10000 and then
multiple oom_score_adj values by 10.

> 3) Current logic give 3% of SystemRAM to root processes. It obviously too big
> if you have plenty memory. Now, your fork-bomb processes have 500MB OOM immune
> bonus. then your fork-bomb never ever be killed.
>

I agree that a constant proportion for root processes is probably not
ideal, especially in situations where there are many small threads that
only use about 1% of system RAM, such as in CAI's case. I don't agree
that we need to guard against forkbombs created by root, however. The
worst case scenario is that the continuous killing of non-root threads
will allow the admin to fix his or her error.

2011-05-10 23:29:08

[permalink] [raw]

Subject: Re: [PATCH 1/4] oom: improve dump_tasks() show items

On Tue, 10 May 2011, KOSAKI Motohiro wrote:

> Recently, oom internal logic was dramatically changed. Thus
> dump_tasks() is no longer useful. it has some meaningless
> items and don't have some oom socre related items.
>

This changelog is inaccurate.

dump_tasks() is actually useful as it currently stands; there are things
that you may add or remove but saying that it is "no longer useful" is an
exaggeration.

> This patch adapt displaying fields to new oom logic.
>
> details
> ==========
> removed: pid (we always kill process. don't need thread id),
> mm->total_vm (we no longer uses virtual memory size)

Showing mm->total_vm is still interesting to know what the old heuristic
would have used rather than the new heuristic, I'd prefer if we kept it.

> signal->oom_adj (we no longer uses it internally)
> added: ppid (we often kill sacrifice child process)
> modify: RSS (account mm->nr_ptes too)

I'd prefer if ptes were shown independently from rss instead of adding it
to the thread's true rss usage and representing it as such.

I think the cpu should also be removed.

For the next version, could you show the old output and comparsion to new
output in the changelog?

>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
>
> Strictly speaking. this is NOT a part of oom fixing patches. but it's
> necessary when I parse QAI's test result.
>
>
> mm/oom_kill.c | 14 ++++++++------
> 1 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index f52e85c..118d958 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -355,7 +355,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
> struct task_struct *p;
> struct task_struct *task;
>
> - pr_info("[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name\n");
> + pr_info("[ pid] ppid uid rss cpu score_adj name\n");
> for_each_process(p) {
> if (oom_unkillable_task(p, mem, nodemask))
> continue;
> @@ -370,11 +370,13 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
> continue;
> }
>
> - pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %5d %s\n",
> - task->pid, task_uid(task), task->tgid,
> - task->mm->total_vm, get_mm_rss(task->mm),
> - task_cpu(task), task->signal->oom_adj,
> - task->signal->oom_score_adj, task->comm);
> + pr_info("[%6d] %6d %5d %8lu %4u %9d %s\n",
> + task_tgid_nr(task), task_tgid_nr(task->real_parent),
> + task_uid(task),
> + get_mm_rss(task->mm) + p->mm->nr_ptes,
> + task_cpu(task),
> + task->signal->oom_score_adj,
> + task->comm);
> task_unlock(task);
> }
> }

2011-05-10 23:31:14

[permalink] [raw]

Subject: Re: [PATCH 2/4] oom: kill younger process first

On Tue, 10 May 2011, KOSAKI Motohiro wrote:

> This patch introduces do_each_thread_reverse() and
> select_bad_process() uses it. The benefits are two,
> 1) oom-killer can kill younger process than older if
> they have a same oom score. Usually younger process
> is less important. 2) younger task often have PF_EXITING
> because shell script makes a lot of short lived processes.
> Reverse order search can detect it faster.
>

I like this change, thanks! I'm suprised we haven't needed a
do_each_thread_reverse() in the past somewhere else in the kernel.

Could you update the comment about do_each_thread() not being break-safe
in the second version, though?

After that:

Acked-by: David Rientjes <[email protected]>

> Reported-by: CAI Qian <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> include/linux/sched.h | 6 ++++++
> mm/oom_kill.c | 2 +-
> 2 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 013314a..a0a8339 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2194,6 +2194,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
> #define next_task(p) \
> list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
>
> +#define prev_task(p) \
> + list_entry_rcu((p)->tasks.prev, struct task_struct, tasks)
> +
> #define for_each_process(p) \
> for (p = &init_task ; (p = next_task(p)) != &init_task ; )
>
> @@ -2206,6 +2209,9 @@ extern bool current_is_single_threaded(void);
> #define do_each_thread(g, t) \
> for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
>
> +#define do_each_thread_reverse(g, t) \
> + for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
> +
> #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 118d958..0cf5091 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -282,7 +282,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> struct task_struct *chosen = NULL;
> *ppoints = 0;
>
> - do_each_thread(g, p) {
> + do_each_thread_reverse(g, p) {
> unsigned int points;
>
> if (!p->mm)

2011-05-10 23:40:58

[permalink] [raw]

Subject: Re: [PATCH 3/4] oom: oom-killer don't use permillage of system-ram internally

On Tue, 10 May 2011, KOSAKI Motohiro wrote:

> CAI Qian reported his kernel did hang-up if he ran fork intensive
> workload and then invoke oom-killer.
>
> The problem is, Current oom calculation uses 0-1000 normalized value
> (The unit is a permillage of system-ram). Its low precision make
> a lot of same oom score. IOW, in his case, all processes have <1
> oom score and internal integral calculation round it to 1. Thus
> oom-killer kill ineligible process. This regression is caused by
> commit a63d83f427 (oom: badness heuristic rewrite).
>
> The solution is, the internal calculation just use number of pages
> instead of permillage of system-ram. And convert it to permillage
> value at displaying time.
>
> This patch doesn't change any ABI (included /proc/<pid>/oom_score_adj)
> even though current logic has a lot of my dislike thing.
>

s/permillage/proportion/

This is unacceptable, it does not allow users to tune oom_score_adj
appropriately based on the scores exported by /proc/pid/oom_score to
discount an amount of RAM from a thread's memory usage in systemwide,
memory controller, cpuset, or mempolicy contexts. This is only possible
because the oom score is normalized.

What would be acceptable would be to increase the granularity of the score
to 10000 or 100000 to differentiate between threads using 0.01% or 0.001%
of RAM from each other, respectively. The range of oom_score_adj would
remain the same, however, and be multiplied by 10 or 100, respectively,
when factored into the badness score baseline. I don't believe userspace
cares to differentiate between more than 0.1% of available memory.

The other issue that this patch addresses is the bonus given to root
processes. I agree that if a root process is using 4% of RAM that it
should not be equal to all other threads using 1%. I do believe that a
root process using 60% of RAM should be equal priority to a thread using
57%, however. Perhaps a compromise would be to give root processes a
bonus of 1% for every 30% of RAM they consume?

2011-05-10 23:41:50

[permalink] [raw]

Subject: Re: [PATCH 4/4] oom: don't kill random process

On Tue, 10 May 2011, KOSAKI Motohiro wrote:

> CAI Qian reported oom-killer killed all system daemons in his
> system at first if he ran fork bomb as root. The problem is,
> current logic give them bonus of 3% of system ram. Example,
> he has 16GB machine, then root processes have ~500MB oom
> immune. It bring us crazy bad result. _all_ processes have
> oom-score=1 and then, oom killer ignore process memroy usage
> and kill random process. This regression is caused by commit
> a63d83f427 (oom: badness heuristic rewrite).
>
> This patch changes select_bad_process() slightly. If oom points == 1,
> it's a sign that the system have only root privileged processes or
> similar. Thus, select_bad_process() calculate oom badness without
> root bonus and select eligible process.
>

This second (and very costly) iteration is unnecessary if the range of
oom scores is increased from 1000 to 10000 or 100000 as suggested in the
previous patch.

2011-05-11 02:31:54

[permalink] [raw]

Subject: Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())

----- Original Message -----
> (cc to oom interested people)
>
> > > > > I have tested this for the latest mainline kernel using the
> > > > > reproducer
> > > > > attached, the system just hung or deadlock after oom. The
> > > > > whole oom
> > > > > trace is here.
> > > > > http://people.redhat.com/qcai/oom.log
> > > > >
> > > > > Did I miss anything?
> > > >
> > > > Can you please try commit
> > > > 929bea7c714220fc76ce3f75bef9056477c28e74?
> > > As I have mentioned that I have tested the latest mainline which
> > > have
> > > already included that fix. Also, does this problem only for x86?
> > > The
> > > testing was done using x86_64. Not sure if that would be a
> > > problem.
> >
> > No. I'm also using x86_64 and my machine completely works on current
> > latest linus tree. I confirmed it today.
>
> > 4194288 pages RAM
>
> You have 16GB RAM.
>
> > Out of memory: Kill process 1175 (dhclient) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1247 (rsyslogd) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1284 (irqbalance) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1303 (rpcbind) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1321 (rpc.statd) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1333 (mdadm) score 1 or sacrifice child
> > Out of memory: Kill process 1365 (rpc.idmapd) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1403 (dbus-daemon) score 1 or sacrifice
> > child
> > Out of memory: Kill process 1438 (acpid) score 1 or sacrifice child
> > Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> > Out of memory: Kill process 1447 (hald) score 1 or sacrifice child
> > Out of memory: Kill process 1487 (hald-addon-inpu) score 1 or
> > sacrifice child
> > Out of memory: Kill process 1488 (hald-addon-acpi) score 1 or
> > sacrifice child
> > Out of memory: Kill process 1507 (automount) score 1 or sacrifice
> > child
>
> Oops.
>
> OK. That's known issue. Current OOM logic doesn't works if you have
> gigabytes RAM. because _all_ process have the exactly same score (=1).
> then oom killer just fallback to random process killer. It was made
> commit a63d83f427 (oom: badness heuristic rewrite). I pointed out
> it at least three times. You have to blame Google folks. :-/
>
>
> The problems are three.
>
> 1) if two processes have the same oom score, we should kill younger
> process.
> but current logic kill older. Oldest processes are typicall system
> daemons.
> 2) Current logic use 'unsigned int' for internal score calculation.
> (exactly says,
> it only use 0-1000 value). its very low precision calculation makes a
> lot of
> same oom score and kill an ineligible process.
> 3) Current logic give 3% of SystemRAM to root processes. It obviously
> too big
> if you have plenty memory. Now, your fork-bomb processes have 500MB
> OOM immune
> bonus. then your fork-bomb never ever be killed.
>
>
> CAI-san: I've made fixing patches. Can you please try them?
Sure, I saw there were some discussion going on between you and David
about your patches. Does it make more sense for me to test those after
you have settled down technical arguments?

2011-05-11 20:34:11

[permalink] [raw]

Subject: Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())

On Tue, 10 May 2011, CAI Qian wrote:

> Sure, I saw there were some discussion going on between you and David
> about your patches. Does it make more sense for me to test those after
> you have settled down technical arguments?
>

Something like the following (untested) patch should fix the issue by
simply increasing the range of a task's badness from 0-1000 to 0-10000.

There are other things to fix like the tasklist dump output and
documentation, but this shows how easy it is to increase the resolution of
the scoring. (This patch also includes a change to only give root
processes a 1% bonus for every 30% of memory they use as proposed
earlier.)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -160,7 +160,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
*/
if (p->flags & PF_OOM_ORIGIN) {
task_unlock(p);
- return 1000;
+ return 10000;
}

/*
@@ -177,32 +177,32 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
points = get_mm_rss(p->mm) + p->mm->nr_ptes;
points += get_mm_counter(p->mm, MM_SWAPENTS);

- points *= 1000;
+ points *= 10000;
points /= totalpages;
task_unlock(p);

/*
- * Root processes get 3% bonus, just like the __vm_enough_memory()
- * implementation used by LSMs.
+ * Root processes get 1% bonus per 30% memory used for a total of 3%
+ * possible just like LSMs.
*/
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
- points -= 30;
+ points -= 100 * (points / 3000);

/*
* /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
* either completely disable oom killing or always prefer a certain
* task.
*/
- points += p->signal->oom_score_adj;
+ points += p->signal->oom_score_adj * 10;

/*
* Never return 0 for an eligible task that may be killed since it's
- * possible that no single user task uses more than 0.1% of memory and
+ * possible that no single user task uses more than 0.01% of memory and
* no single admin tasks uses more than 3.0%.
*/
if (points <= 0)
return 1;
- return (points < 1000) ? points : 1000;
+ return (points < 10000) ? points : 10000;
}

/*
@@ -314,7 +314,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
*/
if (p == current) {
chosen = p;
- *ppoints = 1000;
+ *ppoints = 10000;
} else {
/*
* If this task is not being ptraced on exit,

2011-05-11 23:33:48

[permalink] [raw]

Subject: Re: [PATCH 2/4] oom: kill younger process first

On Tue, May 10, 2011 at 5:15 PM, KOSAKI Motohiro
<[email protected]> wrote:
> This patch introduces do_each_thread_reverse() and
> select_bad_process() uses it. The benefits are two,
> 1) oom-killer can kill younger process than older if
> they have a same oom score. Usually younger process
> is less important. 2) younger task often have PF_EXITING
> because shell script makes a lot of short lived processes.
> Reverse order search can detect it faster.
>
> Reported-by: CAI Qian <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2011-05-12 00:13:57

[permalink] [raw]

Subject: Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())

On Thu, May 12, 2011 at 5:34 AM, David Rientjes <[email protected]> wrote:
> On Tue, 10 May 2011, CAI Qian wrote:
>
>> Sure, I saw there were some discussion going on between you and David
>> about your patches. Does it make more sense for me to test those after
>> you have settled down technical arguments?
>>
>
> Something like the following (untested) patch should fix the issue by
> simply increasing the range of a task's badness from 0-1000 to 0-10000.
>
> There are other things to fix like the tasklist dump output and
> documentation, but this shows how easy it is to increase the resolution of
> the scoring. (This patch also includes a change to only give root

It does make sense.
I think raising resolution should be a easy way to fix the problem.

> processes a 1% bonus for every 30% of memory they use as proposed
> earlier.)

I didn't follow earlier your suggestion.
But it's not formal patch so I expect if you send formal patch to
merge, you would write down the rationale.

>
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -160,7 +160,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> */
> if (p->flags & PF_OOM_ORIGIN) {
> task_unlock(p);
> - return 1000;
> + return 10000;
> }
>
> /*
> @@ -177,32 +177,32 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> points += get_mm_counter(p->mm, MM_SWAPENTS);
>
> - points *= 1000;
> + points *= 10000;
> points /= totalpages;
> task_unlock(p);
>
> /*
> - * Root processes get 3% bonus, just like the __vm_enough_memory()
> - * implementation used by LSMs.
> + * Root processes get 1% bonus per 30% memory used for a total of 3%
> + * possible just like LSMs.
> */
> if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> - points -= 30;
> + points -= 100 * (points / 3000);
>
> /*
> * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> * either completely disable oom killing or always prefer a certain
> * task.
> */
> - points += p->signal->oom_score_adj;
> + points += p->signal->oom_score_adj * 10;
>
> /*
> * Never return 0 for an eligible task that may be killed since it's
> - * possible that no single user task uses more than 0.1% of memory and
> + * possible that no single user task uses more than 0.01% of memory and
> * no single admin tasks uses more than 3.0%.
> */
> if (points <= 0)
> return 1;
> - return (points < 1000) ? points : 1000;
> + return (points < 10000) ? points : 10000;
> }
>
> /*
> @@ -314,7 +314,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> */
> if (p == current) {
> chosen = p;
> - *ppoints = 1000;
> + *ppoints = 10000;

Scattering constant value isn't good.
You are proving it now.
I think you did it since this is not a formal patch.
I expect you will define new value (ex, OOM_INTERNAL_MAX_SCORE or whatever)

--
Kind regards,
Minchan Kim

2011-05-12 00:59:30

[permalink] [raw]

Subject: Re: [PATCH 2/4] oom: kill younger process first

On Tue, 10 May 2011 17:15:01 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> This patch introduces do_each_thread_reverse() and
> select_bad_process() uses it. The benefits are two,
> 1) oom-killer can kill younger process than older if
> they have a same oom score. Usually younger process
> is less important. 2) younger task often have PF_EXITING
> because shell script makes a lot of short lived processes.
> Reverse order search can detect it faster.
>
> Reported-by: CAI Qian <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

IIUC, for_each_thread() can be called under rcu_read_lock() but
for_each_thread_reverse() must be under tasklist_lock.

Could you add some comment ? and prev_task() should use list_entry()
not list_entry_rcu().

Thanks,
-Kame

> ---
> include/linux/sched.h | 6 ++++++
> mm/oom_kill.c | 2 +-
> 2 files changed, 7 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 013314a..a0a8339 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2194,6 +2194,9 @@ static inline unsigned long wait_task_inactive(struct task_struct *p,
> #define next_task(p) \
> list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
>
> +#define prev_task(p) \
> + list_entry_rcu((p)->tasks.prev, struct task_struct, tasks)
> +
> #define for_each_process(p) \
> for (p = &init_task ; (p = next_task(p)) != &init_task ; )
>
> @@ -2206,6 +2209,9 @@ extern bool current_is_single_threaded(void);
> #define do_each_thread(g, t) \
> for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do
>
> +#define do_each_thread_reverse(g, t) \
> + for (g = t = &init_task ; (g = t = prev_task(g)) != &init_task ; ) do
> +
> #define while_each_thread(g, t) \
> while ((t = next_thread(t)) != g)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 118d958..0cf5091 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -282,7 +282,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> struct task_struct *chosen = NULL;
> *ppoints = 0;
>
> - do_each_thread(g, p) {
> + do_each_thread_reverse(g, p) {
> unsigned int points;
>
> if (!p->mm)
> --
> 1.7.3.1
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-05-12 01:30:47

[permalink] [raw]

Subject: Re: [PATCH 2/4] oom: kill younger process first

Hi Kame,

On Thu, May 12, 2011 at 9:52 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 10 May 2011 17:15:01 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
>> This patch introduces do_each_thread_reverse() and
>> select_bad_process() uses it. The benefits are two,
>> 1) oom-killer can kill younger process than older if
>> they have a same oom score. Usually younger process
>> is less important. 2) younger task often have PF_EXITING
>> because shell script makes a lot of short lived processes.
>> Reverse order search can detect it faster.
>>
>> Reported-by: CAI Qian <[email protected]>
>> Signed-off-by: KOSAKI Motohiro <[email protected]>
>
> IIUC, for_each_thread() can be called under rcu_read_lock() but
> for_each_thread_reverse() must be under tasklist_lock.

Just out of curiosity.
You mentioned it when I sent forkbomb killer patch. :)
>From at that time, I can't understand why we need holding
tasklist_lock not rcu_read_lock. Sorry for the dumb question.

At present, it seems that someone uses tasklist_lock and others uses
rcu_read_lock. But I can't find any rule for that.

Could you elaborate it, please?
Doesn't it need document about it?

--
Kind regards,
Minchan Kim

2011-05-12 02:00:40

[permalink] [raw]

Subject: Re: [PATCH 2/4] oom: kill younger process first

On Thu, 12 May 2011 10:30:45 +0900
Minchan Kim <[email protected]> wrote:

> Hi Kame,
>
> On Thu, May 12, 2011 at 9:52 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Tue, 10 May 2011 17:15:01 +0900 (JST)
> > KOSAKI Motohiro <[email protected]> wrote:
> >
> >> This patch introduces do_each_thread_reverse() and
> >> select_bad_process() uses it. The benefits are two,
> >> 1) oom-killer can kill younger process than older if
> >> they have a same oom score. Usually younger process
> >> is less important. 2) younger task often have PF_EXITING
> >> because shell script makes a lot of short lived processes.
> >> Reverse order search can detect it faster.
> >>
> >> Reported-by: CAI Qian <[email protected]>
> >> Signed-off-by: KOSAKI Motohiro <[email protected]>
> >
> > IIUC, for_each_thread() can be called under rcu_read_lock() but
> > for_each_thread_reverse() must be under tasklist_lock.
>
> Just out of curiosity.
> You mentioned it when I sent forkbomb killer patch. :)
> From at that time, I can't understand why we need holding
> tasklist_lock not rcu_read_lock. Sorry for the dumb question.
>
> At present, it seems that someone uses tasklist_lock and others uses
> rcu_read_lock. But I can't find any rule for that.
>

for_each_list_rcu() makes use of RCU list's characteristics and allows
walk a list under rcu_read_lock() without taking any atomic locks.

list_del() of RCU list works as folllowing.

==
1) assume A, B, C, are linked in the list.
(head)<->(A) <-> (B) <-> (C)

2) remove B.
(head)<->(A) <-> (C)
/
(B)

Because (B)'s next points to (C) even after (B) is removed, (B)->next
points to the alive object. Even if (C) is removed at the same time,
(C) is not freed until rcu glace period and (C)'s next points to (head)

Then, for_each_list_rcu() can work well under rcu_read_lock(), it will visit
only alive objects (but may not be valid.)

==

please see include/linux/rculist.h and check list_add_rcu() ;)

As above implies, (B)->prev pointer is invalid pointer after list_del().
So, there will be race with list modification and for_each_list_reverse under
rcu_read__lock()

So, when you need to take atomic lock (as tasklist lock is) is...

1) You can't check 'entry' is valid or not...
In above for_each_list_rcu(), you may visit an object which is under removing.
You need some flag or check to see the object is valid or not.

2) you want to use list_for_each_safe().
You can't do list_del() an object which is under removing...

3) You want to walk the list in reverse.

3) Some other reasons. For example, you'll access an object pointed by the
'entry' and the object is not rcu safe.

make sense ?

Thanks,
-Kame

> Could you elaborate it, please?
> Doesn't it need document about it?
>
> --
> Kind regards,
> Minchan Kim
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2011-05-12 02:23:42

[permalink] [raw]

Subject: Re: [PATCH 2/4] oom: kill younger process first

On Thu, May 12, 2011 at 10:53 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 12 May 2011 10:30:45 +0900
> Minchan Kim <[email protected]> wrote:
>
>> Hi Kame,
>>
>> On Thu, May 12, 2011 at 9:52 AM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>> > On Tue, 10 May 2011 17:15:01 +0900 (JST)
>> > KOSAKI Motohiro <[email protected]> wrote:
>> >
>> >> This patch introduces do_each_thread_reverse() and
>> >> select_bad_process() uses it. The benefits are two,
>> >> 1) oom-killer can kill younger process than older if
>> >> they have a same oom score. Usually younger process
>> >> is less important. 2) younger task often have PF_EXITING
>> >> because shell script makes a lot of short lived processes.
>> >> Reverse order search can detect it faster.
>> >>
>> >> Reported-by: CAI Qian <[email protected]>
>> >> Signed-off-by: KOSAKI Motohiro <[email protected]>
>> >
>> > IIUC, for_each_thread() can be called under rcu_read_lock() but
>> > for_each_thread_reverse() must be under tasklist_lock.
>>
>> Just out of curiosity.
>> You mentioned it when I sent forkbomb killer patch. :)
>> From at that time, I can't understand why we need holding
>> tasklist_lock not rcu_read_lock. Sorry for the dumb question.
>>
>> At present, it seems that someone uses tasklist_lock and others uses
>> rcu_read_lock. But I can't find any rule for that.
>>
>
> for_each_list_rcu() makes use of RCU list's characteristics and allows
> walk a list under rcu_read_lock() without taking any atomic locks.
>
> list_del() of RCU list works as folllowing.
>
> ==
> 1) assume A, B, C, are linked in the list.
> (head)<->(A) <-> (B) <-> (C)
>
> 2) remove B.
> (head)<->(A) <-> (C)
> /
> (B)
>
> Because (B)'s next points to (C) even after (B) is removed, (B)->next
> points to the alive object. Even if (C) is removed at the same time,
> (C) is not freed until rcu glace period and (C)'s next points to (head)
>
> Then, for_each_list_rcu() can work well under rcu_read_lock(), it will visit
> only alive objects (but may not be valid.)
>
> ==
>
> please see include/linux/rculist.h and check list_add_rcu() ;)
>
> As above implies, (B)->prev pointer is invalid pointer after list_del().
> So, there will be race with list modification and for_each_list_reverse under
> rcu_read__lock()
>
> So, when you need to take atomic lock (as tasklist lock is) is...
>
> 1) You can't check 'entry' is valid or not...
> In above for_each_list_rcu(), you may visit an object which is under removing.
> You need some flag or check to see the object is valid or not.
>
> 2) you want to use list_for_each_safe().
> You can't do list_del() an object which is under removing...
>
> 3) You want to walk the list in reverse.
>
> 3) Some other reasons. For example, you'll access an object pointed by the
> 'entry' and the object is not rcu safe.
>
> make sense ?

Yes. Thanks, Kame.
It seems It is caused by prev poisoning of list_del_rcu.
If we remove it, isn't it possible to traverse reverse without atomic lock?

>
> Thanks,
> -Kame
>
>
>> Could you elaborate it, please?
>> Doesn't it need document about it?
>>
>> --
>> Kind regards,
>> Minchan Kim
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>
>
>

--
Kind regards,
Minchan Kim

2011-05-12 03:46:39

[permalink] [raw]

Subject: Re: [PATCH 2/4] oom: kill younger process first

On Thu, 12 May 2011 11:23:38 +0900
Minchan Kim <[email protected]> wrote:

> On Thu, May 12, 2011 at 10:53 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Thu, 12 May 2011 10:30:45 +0900
> > Minchan Kim <[email protected]> wrote:

> > As above implies, (B)->prev pointer is invalid pointer after list_del().
> > So, there will be race with list modification and for_each_list_reverse under
> > rcu_read__lock()
> >
> > So, when you need to take atomic lock (as tasklist lock is) is...
> >
> > 1) You can't check 'entry' is valid or not...
> > In above for_each_list_rcu(), you may visit an object which is under removing.
> > You need some flag or check to see the object is valid or not.
> >
> > 2) you want to use list_for_each_safe().
> > You can't do list_del() an object which is under removing...
> >
> > 3) You want to walk the list in reverse.
> >
> > 3) Some other reasons. For example, you'll access an object pointed by the
> > 'entry' and the object is not rcu safe.
> >
> > make sense ?
>
> Yes. Thanks, Kame.
> It seems It is caused by prev poisoning of list_del_rcu.
> If we remove it, isn't it possible to traverse reverse without atomic lock?
>

IIUC, it's possible (Fix me if I'm wrong) but I don't like that because of 2 reasons.

1. LIST_POISON is very important information at debug.

2. If we don't clear prev pointer, ok, we'll allow 2 directional walk of list
under RCU.
But, in following case
1. you are now at (C). you'll visit (C)->next...(D)
2. you are now at (D). you want to go back to (C) via (D)->prev.
3. But (D)->prev points to (B)

It's not a 2 directional list, something other or broken one.
Then, the rculist is 1 directional list in nature, I think.

So, without very very big reason, we should keep POISON.

Thanks,
-Kame

2011-05-12 04:17:15

[permalink] [raw]

Subject: Re: [PATCH 2/4] oom: kill younger process first

On Thu, May 12, 2011 at 12:39 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Thu, 12 May 2011 11:23:38 +0900
> Minchan Kim <[email protected]> wrote:
>
>> On Thu, May 12, 2011 at 10:53 AM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>> > On Thu, 12 May 2011 10:30:45 +0900
>> > Minchan Kim <[email protected]> wrote:
>
>> > As above implies, (B)->prev pointer is invalid pointer after list_del().
>> > So, there will be race with list modification and for_each_list_reverse under
>> > rcu_read__lock()
>> >
>> > So, when you need to take atomic lock (as tasklist lock is) is...
>> >
>> > 1) You can't check 'entry' is valid or not...
>> > In above for_each_list_rcu(), you may visit an object which is under removing.
>> > You need some flag or check to see the object is valid or not.
>> >
>> > 2) you want to use list_for_each_safe().
>> > You can't do list_del() an object which is under removing...
>> >
>> > 3) You want to walk the list in reverse.
>> >
>> > 3) Some other reasons. For example, you'll access an object pointed by the
>> > 'entry' and the object is not rcu safe.
>> >
>> > make sense ?
>>
>> Yes. Thanks, Kame.
>> It seems It is caused by prev poisoning of list_del_rcu.
>> If we remove it, isn't it possible to traverse reverse without atomic lock?
>>
>
> IIUC, it's possible (Fix me if I'm wrong) but I don't like that because of 2 reasons.
>
> 1. LIST_POISON is very important information at debug.

Indeed.
But if we can get a better something although we lost debug facility,
I think it would be okay.

>
> 2. If we don't clear prev pointer, ok, we'll allow 2 directional walk of list
> under RCU.
> But, in following case
> 1. you are now at (C). you'll visit (C)->next...(D)
> 2. you are now at (D). you want to go back to (C) via (D)->prev.
> 3. But (D)->prev points to (B)
>
> It's not a 2 directional list, something other or broken one.

Yes. but it shouldn't be a problem in RCU semantics.
If you need such consistency, you should use lock.

I recall old thread about it.
In http://lwn.net/Articles/262464/, mmutz and Paul already discussed
about it. :)

> Then, the rculist is 1 directional list in nature, I think.

Yes. But Why RCU become 1 directional list is we can't find a useful usecases.

>
> So, without very very big reason, we should keep POISON.

Agree.
I don't insist on it as it's not a useful usecase for persuading Paul.
That's because it's not a hot path.

It's started from just out of curiosity.
Thanks for very much clarifying that, Kame!

--
Kind regards,
Minchan Kim

2011-05-12 14:38:55

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH 2/4] oom: kill younger process first

On Thu, May 12, 2011 at 01:17:13PM +0900, Minchan Kim wrote:
> On Thu, May 12, 2011 at 12:39 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Thu, 12 May 2011 11:23:38 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> >> On Thu, May 12, 2011 at 10:53 AM, KAMEZAWA Hiroyuki
> >> <[email protected]> wrote:
> >> > On Thu, 12 May 2011 10:30:45 +0900
> >> > Minchan Kim <[email protected]> wrote:
> >
> >> > As above implies, (B)->prev pointer is invalid pointer after list_del().
> >> > So, there will be race with list modification and for_each_list_reverse under
> >> > rcu_read__lock()
> >> >
> >> > So, when you need to take atomic lock (as tasklist lock is) is...
> >> >
> >> > ?1) You can't check 'entry' is valid or not...
> >> > ? ?In above for_each_list_rcu(), you may visit an object which is under removing.
> >> > ? ?You need some flag or check to see the object is valid or not.
> >> >
> >> > ?2) you want to use list_for_each_safe().
> >> > ? ?You can't do list_del() an object which is under removing...
> >> >
> >> > ?3) You want to walk the list in reverse.
> >> >
> >> > ?3) Some other reasons. For example, you'll access an object pointed by the
> >> > ? ?'entry' and the object is not rcu safe.
> >> >
> >> > make sense ?
> >>
> >> Yes. Thanks, Kame.
> >> It seems It is caused by prev poisoning of list_del_rcu.
> >> If we remove it, isn't it possible to traverse reverse without atomic lock?
> >>
> >
> > IIUC, it's possible (Fix me if I'm wrong) but I don't like that because of 2 reasons.
> >
> > 1. LIST_POISON is very important information at debug.
>
> Indeed.
> But if we can get a better something although we lost debug facility,
> I think it would be okay.
>
> >
> > 2. If we don't clear prev pointer, ok, we'll allow 2 directional walk of list
> > ? under RCU.
> > ? But, in following case
> > ? 1. you are now at (C). you'll visit (C)->next...(D)
> > ? 2. you are now at (D). you want to go back to (C) via (D)->prev.
> > ? 3. But (D)->prev points to (B)
> >
> > ?It's not a 2 directional list, something other or broken one.
>
> Yes. but it shouldn't be a problem in RCU semantics.
> If you need such consistency, you should use lock.
>
> I recall old thread about it.
> In http://lwn.net/Articles/262464/, mmutz and Paul already discussed
> about it. :)
>
> > ?Then, the rculist is 1 directional list in nature, I think.
>
> Yes. But Why RCU become 1 directional list is we can't find a useful usecases.
>
> >
> > So, without very very big reason, we should keep POISON.
>
> Agree.
> I don't insist on it as it's not a useful usecase for persuading Paul.
> That's because it's not a hot path.
>
> It's started from just out of curiosity.
> Thanks for very much clarifying that, Kame!

Indeed, we would need a large performance/scalability/simplicity advantage
to put up with such a loss of debugging information. If it turns out
that you really need this, please let me know, but please also provide
data supporting your need.

Thanx, Paul

2011-05-12 19:38:35

[permalink] [raw]

Subject: Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())

On Thu, 12 May 2011, Minchan Kim wrote:

> > processes a 1% bonus for every 30% of memory they use as proposed
> > earlier.)
>
> I didn't follow earlier your suggestion.
> But it's not formal patch so I expect if you send formal patch to
> merge, you would write down the rationale.
>

Yes, I'm sure we'll still have additional discussion when KOSAKI-san
replies to my review of his patchset, so this quick patch was written only
for CAI's testing at this point.

In reference to the above, I think that giving root processes a 3% bonus
at all times may be a bit aggressive. As mentioned before, I don't think
that all root processes using 4% of memory and the remainder of system
threads are using 1% should all be considered equal. At the same time, I
do not believe that two threads using 50% of memory should be considered
equal if one is root and one is not. So my idea was to discount 1% for
every 30% of memory that a root process uses rather than a strict 3%.

That change can be debated and I think we'll probably settle on something
more aggressive like 1% for every 10% of memory used since oom scores are
only useful in comparison to other oom scores: in the above scenario where
there are two threads, one by root and one not by root, using 50% of
memory each, I think it would be legitimate to give the root task a 5%
bonus so that it would only be selected if no other threads used more than
44% of memory (even though the root thread is truly using 50%).

This is a heuristic within the oom killer badness scoring that can always
be debated back and forth, but I think a 1% bonus for root processes for
every 10% of memory used is plausible.

Comments?

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -160,7 +160,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> > */
> > if (p->flags & PF_OOM_ORIGIN) {
> > task_unlock(p);
> > - return 1000;
> > + return 10000;
> > }
> >
> > /*
> > @@ -177,32 +177,32 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
> > points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> > points += get_mm_counter(p->mm, MM_SWAPENTS);
> >
> > - points *= 1000;
> > + points *= 10000;
> > points /= totalpages;
> > task_unlock(p);
> >
> > /*
> > - * Root processes get 3% bonus, just like the __vm_enough_memory()
> > - * implementation used by LSMs.
> > + * Root processes get 1% bonus per 30% memory used for a total of 3%
> > + * possible just like LSMs.
> > */
> > if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> > - points -= 30;
> > + points -= 100 * (points / 3000);
> >
> > /*
> > * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> > * either completely disable oom killing or always prefer a certain
> > * task.
> > */
> > - points += p->signal->oom_score_adj;
> > + points += p->signal->oom_score_adj * 10;
> >
> > /*
> > * Never return 0 for an eligible task that may be killed since it's
> > - * possible that no single user task uses more than 0.1% of memory and
> > + * possible that no single user task uses more than 0.01% of memory and
> > * no single admin tasks uses more than 3.0%.
> > */
> > if (points <= 0)
> > return 1;
> > - return (points < 1000) ? points : 1000;
> > + return (points < 10000) ? points : 10000;
> > }
> >
> > /*
> > @@ -314,7 +314,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
> > */
> > if (p == current) {
> > chosen = p;
> > - *ppoints = 1000;
> > + *ppoints = 10000;
>
> Scattering constant value isn't good.
> You are proving it now.
> I think you did it since this is not a formal patch.
> I expect you will define new value (ex, OOM_INTERNAL_MAX_SCORE or whatever)
>

Right, we could probably do something like

#define OOM_SCORE_MAX_FACTOR 10
#define OOM_SCORE_MAX (OOM_SCORE_ADJ_MAX * OOM_SCORE_MAX_FACTOR)

in mm/oom_kill.c, which would then be used to replace all of the constants
above since OOM_SCORE_ADJ_MAX is already defined to be 1000 in
include/linux/oom.h.

2011-05-13 04:16:09

[permalink] [raw]

Subject: Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())

On Fri, May 13, 2011 at 4:38 AM, David Rientjes <[email protected]> wrote:
> On Thu, 12 May 2011, Minchan Kim wrote:
>
>> > processes a 1% bonus for every 30% of memory they use as proposed
>> > earlier.)
>>
>> I didn't follow earlier your suggestion.
>> But it's not formal patch so I expect if you send formal patch to
>> merge, you would write down the rationale.
>>
>
> Yes, I'm sure we'll still have additional discussion when KOSAKI-san
> replies to my review of his patchset, so this quick patch was written only
> for CAI's testing at this point.
>
> In reference to the above, I think that giving root processes a 3% bonus
> at all times may be a bit aggressive. As mentioned before, I don't think
> that all root processes using 4% of memory and the remainder of system
> threads are using 1% should all be considered equal. At the same time, I
> do not believe that two threads using 50% of memory should be considered
> equal if one is root and one is not. So my idea was to discount 1% for
> every 30% of memory that a root process uses rather than a strict 3%.
>
> That change can be debated and I think we'll probably settle on something
> more aggressive like 1% for every 10% of memory used since oom scores are
> only useful in comparison to other oom scores: in the above scenario where
> there are two threads, one by root and one not by root, using 50% of
> memory each, I think it would be legitimate to give the root task a 5%
> bonus so that it would only be selected if no other threads used more than
> 44% of memory (even though the root thread is truly using 50%).
>
> This is a heuristic within the oom killer badness scoring that can always
> be debated back and forth, but I think a 1% bonus for root processes for
> every 10% of memory used is plausible.
>
> Comments?

Yes. Tend to agree.
Apparently, absolute 3% bonus is a problem in CAI's case.

Your approach which makes bonus with function of rss is consistent
with current OOM heuristic.
So In consistency POV, I like it as it could help deterministic OOM policy.

About 30% or 10% things, I think it's hard to define a ideal magic
value for handling for whole workloads.
It would be very arguable. So we might need some standard method to
measure it/or redhat/suse peoples. Anyway, I don't want to argue it
until we get a number.

>
>> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> > --- a/mm/oom_kill.c
>> > +++ b/mm/oom_kill.c
>> > @@ -160,7 +160,7 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
>> > */
>> > if (p->flags & PF_OOM_ORIGIN) {
>> > task_unlock(p);
>> > - return 1000;
>> > + return 10000;
>> > }
>> >
>> > /*
>> > @@ -177,32 +177,32 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
>> > points = get_mm_rss(p->mm) + p->mm->nr_ptes;
>> > points += get_mm_counter(p->mm, MM_SWAPENTS);
>> >
>> > - points *= 1000;
>> > + points *= 10000;
>> > points /= totalpages;
>> > task_unlock(p);
>> >
>> > /*
>> > - * Root processes get 3% bonus, just like the __vm_enough_memory()
>> > - * implementation used by LSMs.
>> > + * Root processes get 1% bonus per 30% memory used for a total of 3%
>> > + * possible just like LSMs.
>> > */
>> > if (has_capability_noaudit(p, CAP_SYS_ADMIN))
>> > - points -= 30;
>> > + points -= 100 * (points / 3000);
>> >
>> > /*
>> > * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
>> > * either completely disable oom killing or always prefer a certain
>> > * task.
>> > */
>> > - points += p->signal->oom_score_adj;
>> > + points += p->signal->oom_score_adj * 10;
>> >
>> > /*
>> > * Never return 0 for an eligible task that may be killed since it's
>> > - * possible that no single user task uses more than 0.1% of memory and
>> > + * possible that no single user task uses more than 0.01% of memory and
>> > * no single admin tasks uses more than 3.0%.
>> > */
>> > if (points <= 0)
>> > return 1;
>> > - return (points < 1000) ? points : 1000;
>> > + return (points < 10000) ? points : 10000;
>> > }
>> >
>> > /*
>> > @@ -314,7 +314,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
>> > */
>> > if (p == current) {
>> > chosen = p;
>> > - *ppoints = 1000;
>> > + *ppoints = 10000;
>>
>> Scattering constant value isn't good.
>> You are proving it now.
>> I think you did it since this is not a formal patch.
>> I expect you will define new value (ex, OOM_INTERNAL_MAX_SCORE or whatever)
>>
>
> Right, we could probably do something like
>
> #define OOM_SCORE_MAX_FACTOR 10
> #define OOM_SCORE_MAX (OOM_SCORE_ADJ_MAX * OOM_SCORE_MAX_FACTOR)
>
> in mm/oom_kill.c, which would then be used to replace all of the constants
> above since OOM_SCORE_ADJ_MAX is already defined to be 1000 in
> include/linux/oom.h.

Looks good to me.
Let's wait KOSAKI's opinion and CAI's test result.

--
Kind regards,
Minchan Kim

2011-05-13 06:53:49

[permalink] [raw]

Subject: Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())

----- Original Message -----
> On Tue, 10 May 2011, CAI Qian wrote:
>
> > Sure, I saw there were some discussion going on between you and
> > David
> > about your patches. Does it make more sense for me to test those
> > after
> > you have settled down technical arguments?
> >
>
> Something like the following (untested) patch should fix the issue by
> simply increasing the range of a task's badness from 0-1000 to
> 0-10000.
>
> There are other things to fix like the tasklist dump output and
> documentation, but this shows how easy it is to increase the
> resolution of
> the scoring. (This patch also includes a change to only give root
> processes a 1% bonus for every 30% of memory they use as proposed
> earlier.)
I have had a chance to test this patch after applied this patch manually
(dd not apply cleanly) on the top of mainline kernel. The test is still
running because it is trying to kill tons of python processes instread
of the parent. Isn't there a way for oom to be smart enough to do
"killall python"?
>
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -160,7 +160,7 @@ unsigned int oom_badness(struct task_struct *p,
> struct mem_cgroup *mem,
> */
> if (p->flags & PF_OOM_ORIGIN) {
> task_unlock(p);
> - return 1000;
> + return 10000;
> }
>
> /*
> @@ -177,32 +177,32 @@ unsigned int oom_badness(struct task_struct *p,
> struct mem_cgroup *mem,
> points = get_mm_rss(p->mm) + p->mm->nr_ptes;
> points += get_mm_counter(p->mm, MM_SWAPENTS);
>
> - points *= 1000;
> + points *= 10000;
> points /= totalpages;
> task_unlock(p);
>
> /*
> - * Root processes get 3% bonus, just like the __vm_enough_memory()
> - * implementation used by LSMs.
> + * Root processes get 1% bonus per 30% memory used for a total of 3%
> + * possible just like LSMs.
> */
> if (has_capability_noaudit(p, CAP_SYS_ADMIN))
> - points -= 30;
> + points -= 100 * (points / 3000);
>
> /*
> * /proc/pid/oom_score_adj ranges from -1000 to +1000 such that it may
> * either completely disable oom killing or always prefer a certain
> * task.
> */
> - points += p->signal->oom_score_adj;
> + points += p->signal->oom_score_adj * 10;
>
> /*
> * Never return 0 for an eligible task that may be killed since it's
> - * possible that no single user task uses more than 0.1% of memory
> and
> + * possible that no single user task uses more than 0.01% of memory
> and
> * no single admin tasks uses more than 3.0%.
> */
> if (points <= 0)
> return 1;
> - return (points < 1000) ? points : 1000;
> + return (points < 10000) ? points : 10000;
> }
>
> /*
> @@ -314,7 +314,7 @@ static struct task_struct
> *select_bad_process(unsigned int *ppoints,
> */
> if (p == current) {
> chosen = p;
> - *ppoints = 1000;
> + *ppoints = 10000;
> } else {
> /*
> * If this task is not being ptraced on exit,
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign
> http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2011-05-13 10:13:19

[permalink] [raw]

Subject: Re: [PATCH 1/4] oom: improve dump_tasks() show items

Hi

Sorry for the delay. I did hit machine crash in this week and I lost
a lot of e-mail.

> On Tue, 10 May 2011, KOSAKI Motohiro wrote:
>
>> Recently, oom internal logic was dramatically changed. Thus
>> dump_tasks() is no longer useful. it has some meaningless
>> items and don't have some oom socre related items.
>>
>
> This changelog is inaccurate.
>
> dump_tasks() is actually useful as it currently stands; there are things
> that you may add or remove but saying that it is "no longer useful" is an
> exaggeration.

Hm. OK.

>
>> This patch adapt displaying fields to new oom logic.
>>
>> details
>> ==========
>> removed: pid (we always kill process. don't need thread id),
>> mm->total_vm (we no longer uses virtual memory size)
>
> Showing mm->total_vm is still interesting to know what the old heuristic
> would have used rather than the new heuristic, I'd prefer if we kept it.

OK, reasonable.

>
>> signal->oom_adj (we no longer uses it internally)
>> added: ppid (we often kill sacrifice child process)
>> modify: RSS (account mm->nr_ptes too)
>
> I'd prefer if ptes were shown independently from rss instead of adding it
> to the thread's true rss usage and representing it as such.

No. nr-pte should always be accounted as rss. I plan to change RLIMIT_RSS too.
Because, when end users change RLIMIT_RSS, they intend to limit number of physical
memory usage. It's not only subset. current total_rss = anon-rss + file-rss is
only implementation limit. In the other hand, if we makes separate RLIMIT, RLIMIT_RSS
and RLIMIT_PTE, RLIMIT_RSS don't prevent zero page DoS attack. it's no optimal.

> I think the cpu should also be removed.

ok.

> For the next version, could you show the old output and comparsion to new
> output in the changelog?

Will do.

>
>>
>> Signed-off-by: KOSAKI Motohiro<[email protected]>
>> ---
>>
>> Strictly speaking. this is NOT a part of oom fixing patches. but it's
>> necessary when I parse QAI's test result.
>>
>>
>> mm/oom_kill.c | 14 ++++++++------
>> 1 files changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index f52e85c..118d958 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -355,7 +355,7 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
>> struct task_struct *p;
>> struct task_struct *task;
>>
>> - pr_info("[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name\n");
>> + pr_info("[ pid] ppid uid rss cpu score_adj name\n");
>> for_each_process(p) {
>> if (oom_unkillable_task(p, mem, nodemask))
>> continue;
>> @@ -370,11 +370,13 @@ static void dump_tasks(const struct mem_cgroup *mem, const nodemask_t *nodemask)
>> continue;
>> }
>>
>> - pr_info("[%5d] %5d %5d %8lu %8lu %3u %3d %5d %s\n",
>> - task->pid, task_uid(task), task->tgid,
>> - task->mm->total_vm, get_mm_rss(task->mm),
>> - task_cpu(task), task->signal->oom_adj,
>> - task->signal->oom_score_adj, task->comm);
>> + pr_info("[%6d] %6d %5d %8lu %4u %9d %s\n",
>> + task_tgid_nr(task), task_tgid_nr(task->real_parent),
>> + task_uid(task),
>> + get_mm_rss(task->mm) + p->mm->nr_ptes,
>> + task_cpu(task),
>> + task->signal->oom_score_adj,
>> + task->comm);
>> task_unlock(task);
>> }
>> }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
>
>

2011-05-13 10:14:14

[permalink] [raw]

Subject: Re: [PATCH 2/4] oom: kill younger process first

(2011/05/11 8:31), David Rientjes wrote:
> On Tue, 10 May 2011, KOSAKI Motohiro wrote:
>
>> This patch introduces do_each_thread_reverse() and
>> select_bad_process() uses it. The benefits are two,
>> 1) oom-killer can kill younger process than older if
>> they have a same oom score. Usually younger process
>> is less important. 2) younger task often have PF_EXITING
>> because shell script makes a lot of short lived processes.
>> Reverse order search can detect it faster.
>>
>
> I like this change, thanks! I'm suprised we haven't needed a
> do_each_thread_reverse() in the past somewhere else in the kernel.
>
> Could you update the comment about do_each_thread() not being break-safe
> in the second version, though?

ok.

2011-05-13 10:17:09

[permalink] [raw]

Subject: Re: [PATCH 2/4] oom: kill younger process first

(2011/05/12 9:52), KAMEZAWA Hiroyuki wrote:
> On Tue, 10 May 2011 17:15:01 +0900 (JST)
> KOSAKI Motohiro<[email protected]> wrote:
>
>> This patch introduces do_each_thread_reverse() and
>> select_bad_process() uses it. The benefits are two,
>> 1) oom-killer can kill younger process than older if
>> they have a same oom score. Usually younger process
>> is less important. 2) younger task often have PF_EXITING
>> because shell script makes a lot of short lived processes.
>> Reverse order search can detect it faster.
>>
>> Reported-by: CAI Qian<[email protected]>
>> Signed-off-by: KOSAKI Motohiro<[email protected]>
>
> IIUC, for_each_thread() can be called under rcu_read_lock() but
> for_each_thread_reverse() must be under tasklist_lock.
>
> Could you add some comment ? and prev_task() should use list_entry()
> not list_entry_rcu().

Will fix. thanks.

2011-05-13 10:28:57

[permalink] [raw]

Subject: Re: [PATCH 3/4] oom: oom-killer don't use permillage of system-ram internally

(2011/05/11 8:40), David Rientjes wrote:
> On Tue, 10 May 2011, KOSAKI Motohiro wrote:
>
>> CAI Qian reported his kernel did hang-up if he ran fork intensive
>> workload and then invoke oom-killer.
>>
>> The problem is, Current oom calculation uses 0-1000 normalized value
>> (The unit is a permillage of system-ram). Its low precision make
>> a lot of same oom score. IOW, in his case, all processes have<1
>> oom score and internal integral calculation round it to 1. Thus
>> oom-killer kill ineligible process. This regression is caused by
>> commit a63d83f427 (oom: badness heuristic rewrite).
>>
>> The solution is, the internal calculation just use number of pages
>> instead of permillage of system-ram. And convert it to permillage
>> value at displaying time.
>>
>> This patch doesn't change any ABI (included /proc/<pid>/oom_score_adj)
>> even though current logic has a lot of my dislike thing.
>>
>
> s/permillage/proportion/
>
> This is unacceptable, it does not allow users to tune oom_score_adj
> appropriately based on the scores exported by /proc/pid/oom_score to
> discount an amount of RAM from a thread's memory usage in systemwide,
> memory controller, cpuset, or mempolicy contexts. This is only possible
> because the oom score is normalized.

You misunderstand the code. The patch doesn't change oom_score.
The patch change fs/proc too.

>
> What would be acceptable would be to increase the granularity of the score
> to 10000 or 100000 to differentiate between threads using 0.01% or 0.001%
> of RAM from each other, respectively. The range of oom_score_adj would
> remain the same, however, and be multiplied by 10 or 100, respectively,
> when factored into the badness score baseline. I don't believe userspace
> cares to differentiate between more than 0.1% of available memory.

Currently, SGI buy 16TB memory. 16TB x 0.1% = 1.6GB. I don't think your
fork bomb process use bigger than 1.6GB. Thus your patch is unacceptable.

So, please read the code again. or run it.

> The other issue that this patch addresses is the bonus given to root
> processes. I agree that if a root process is using 4% of RAM that it
> should not be equal to all other threads using 1%. I do believe that a
> root process using 60% of RAM should be equal priority to a thread using
> 57%, however. Perhaps a compromise would be to give root processes a
> bonus of 1% for every 30% of RAM they consume?

I think you are talking about patch [4/4], right? patch [3/4] and [4/4]
are attacking another issue. big machine issue and root user issue.

2011-05-13 11:02:37

[permalink] [raw]

Subject: Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())

(2011/05/13 13:16), Minchan Kim wrote:
> On Fri, May 13, 2011 at 4:38 AM, David Rientjes<[email protected]> wrote:
>> On Thu, 12 May 2011, Minchan Kim wrote:
>>
>>>> processes a 1% bonus for every 30% of memory they use as proposed
>>>> earlier.)
>>>
>>> I didn't follow earlier your suggestion.
>>> But it's not formal patch so I expect if you send formal patch to
>>> merge, you would write down the rationale.
>>>
>>
>> Yes, I'm sure we'll still have additional discussion when KOSAKI-san
>> replies to my review of his patchset, so this quick patch was written only
>> for CAI's testing at this point.
>>
>> In reference to the above, I think that giving root processes a 3% bonus
>> at all times may be a bit aggressive. As mentioned before, I don't think
>> that all root processes using 4% of memory and the remainder of system
>> threads are using 1% should all be considered equal. At the same time, I
>> do not believe that two threads using 50% of memory should be considered
>> equal if one is root and one is not. So my idea was to discount 1% for
>> every 30% of memory that a root process uses rather than a strict 3%.
>>
>> That change can be debated and I think we'll probably settle on something
>> more aggressive like 1% for every 10% of memory used since oom scores are
>> only useful in comparison to other oom scores: in the above scenario where
>> there are two threads, one by root and one not by root, using 50% of
>> memory each, I think it would be legitimate to give the root task a 5%
>> bonus so that it would only be selected if no other threads used more than
>> 44% of memory (even though the root thread is truly using 50%).
>>
>> This is a heuristic within the oom killer badness scoring that can always
>> be debated back and forth, but I think a 1% bonus for root processes for
>> every 10% of memory used is plausible.
>>
>> Comments?
>
> Yes. Tend to agree.
> Apparently, absolute 3% bonus is a problem in CAI's case.
>
> Your approach which makes bonus with function of rss is consistent
> with current OOM heuristic.
> So In consistency POV, I like it as it could help deterministic OOM policy.
>
> About 30% or 10% things, I think it's hard to define a ideal magic
> value for handling for whole workloads.
> It would be very arguable. So we might need some standard method to
> measure it/or redhat/suse peoples. Anyway, I don't want to argue it
> until we get a number.

I have small comments. 1) typical system have some small size system daemon
2) David's points -= 100 * (points / 3000); line doesn't make any bonus if
points is less than 3000. Zero root bonus is really desired? It may lead to
kill system daemon at first issue. 3) if we change minimum bonus from 0% to
1%, we will face the exact same problem when all process have less than
1% memory. It's not rare if the system has a plenty memory.
So, my recalculation logic (patch [4/4]) is necessary anyway.

However, proportional 1% - 10% bonus seems considerable good idea.

2011-05-16 20:42:53

[permalink] [raw]

Subject: Re: OOM Killer don't works at all if the system have >gigabytes memory (was Re: [PATCH] mm: check zone->all_unreclaimable in all_unreclaimable())

On Fri, 13 May 2011, KOSAKI Motohiro wrote:

> > > Yes, I'm sure we'll still have additional discussion when KOSAKI-san
> > > replies to my review of his patchset, so this quick patch was written only
> > > for CAI's testing at this point.
> > >
> > > In reference to the above, I think that giving root processes a 3% bonus
> > > at all times may be a bit aggressive. As mentioned before, I don't think
> > > that all root processes using 4% of memory and the remainder of system
> > > threads are using 1% should all be considered equal. At the same time, I
> > > do not believe that two threads using 50% of memory should be considered
> > > equal if one is root and one is not. So my idea was to discount 1% for
> > > every 30% of memory that a root process uses rather than a strict 3%.
> > >
> > > That change can be debated and I think we'll probably settle on something
> > > more aggressive like 1% for every 10% of memory used since oom scores are
> > > only useful in comparison to other oom scores: in the above scenario where
> > > there are two threads, one by root and one not by root, using 50% of
> > > memory each, I think it would be legitimate to give the root task a 5%
> > > bonus so that it would only be selected if no other threads used more than
> > > 44% of memory (even though the root thread is truly using 50%).
> > >
> > > This is a heuristic within the oom killer badness scoring that can always
> > > be debated back and forth, but I think a 1% bonus for root processes for
> > > every 10% of memory used is plausible.
> > >
> > > Comments?
> >
> > Yes. Tend to agree.
> > Apparently, absolute 3% bonus is a problem in CAI's case.
> >
> > Your approach which makes bonus with function of rss is consistent
> > with current OOM heuristic.
> > So In consistency POV, I like it as it could help deterministic OOM policy.
> >
> > About 30% or 10% things, I think it's hard to define a ideal magic
> > value for handling for whole workloads.
> > It would be very arguable. So we might need some standard method to
> > measure it/or redhat/suse peoples. Anyway, I don't want to argue it
> > until we get a number.
>
> I have small comments. 1) typical system have some small size system daemon
> 2) David's points -= 100 * (points / 3000); line doesn't make any bonus if
> points is less than 3000.

With the 1% bonus per 10% memory consumption, it would be

points -= 100 * (points / 1000);

instead. So, yes, this wouldn't give any bonus for root tasks that use
10% of allowed memory or less.

> Zero root bonus is really desired? It may lead to
> kill system daemon at first issue.

I would think of it this way: if a root task is using 9% of available
memory and that happens to be the largest consumer of memory, then it
makes sense to kill it instead of killing other smaller non-root tasks.
The 3% bonus would have killed the task if all other threads are using 6%
or less, this just allows them to use 2% more memory now.

On the other hand, if a root task is using 50% of available memory, then a
45% non-root task would be sacrificed instead.

Perhaps we need to be more aggressive and give more of a bonus to root
tasks?

2011-05-16 20:46:14