LinuxLists.cc - upcoming kerneloops.org item: get_page_from

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009 08:07:53 -0700 Arjan van de Ven <[email protected]> wrote:

> Hi,
>
> a new item is coming up fast in the kerneloops.org stats, and it's new
> in 2.6.31-rc;
>
> http://www.kerneloops.org/searchweek.php?search=get_page_from_freelist
>
> it's this warning in mm/page_alloc.c:
>
> * __GFP_NOFAIL is not to be used in new code.
> *
> * All __GFP_NOFAIL callers should be fixed so that they
> * properly detect and handle allocation failures.
> *
> * We most definitely don't want callers attempting to
> * allocate greater than single-page units with
> * __GFP_NOFAIL.
> */
> WARN_ON_ONCE(order > 0);
>
>
> typical backtraces look like
>
> get_page_from_freelist
> __alloc_pages_nodemask
> alloc_pages_current
> alloc_slab_page
> new_slab
> __slab_alloc
> kmem_cache_alloc_notrace
> start_this_handle
> jbd2_journal_start
>
> and
>
> get_page_from_freelist
> __alloc_pages_nodemask
> alloc_pages_current
> alloc_slab_page
> new_slab
> __slab_alloc
> kmem_cache_alloc_notrace
> start_this_handle
> journal_start
> ext3_journal_start_sb
> ext3_journal_start
> ext3_dirty_inode
>
> but there are some other ones as well at the url above.
>
>
> git blame shows that
>
> commit dab48dab37d2770824420d1e01730a107fade1aa
> Author: Andrew Morton <[email protected]>
> Date: Tue Jun 16 15:32:37 2009 -0700
>
> introduced this WARN_ON.....

Well yes. Using GFP_NOFAIL on a higher-order allocation is bad. This
patch is there to find, name, shame, blame and hopefully fix callers.

A fix for cxgb3 is in the works. slub's design is a big problem.

But we'll probably have to revert it for 2.6.31 :(

2009-06-24 16:52:33

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009, Andrew Morton wrote:
>
> Well yes. Using GFP_NOFAIL on a higher-order allocation is bad.

Yes, but your definition of "higher order" is incorrect.

At the very least, we should change the "order > 0" to "order > 1".

As I already mentioned, SLAB uses order-1 allocations in order to not have
excessive fragmentation for small kmalloc/slab's that would otherwise
waste tons of memory.

Think network packets at 1500 bytes each. You can allocate two per page,
or five per 2-pages. That's a 25% memory usage difference!

And getting an order-1 allocation is simply not that much harder than an
order-0 one. It starts getting hard once you get to order-3 or more.

Linus

2009-06-24 16:55:30

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Hi Andrew,

On Wed, 24 Jun 2009 08:07:53 -0700 Arjan van de Ven <[email protected]> wrote:
>> a new item is coming up fast in the kerneloops.org stats, and it's new
>> in 2.6.31-rc;
>>
>> http://www.kerneloops.org/searchweek.php?search=get_page_from_freelist
>>
>> it's this warning in mm/page_alloc.c:
>>
>> ? ? ? ? ? ? ? ? ? ? ? ? * __GFP_NOFAIL is not to be used in new code.
>> ? ? ? ? ? ? ? ? ? ? ? ? ?*
>> ? ? ? ? ? ? ? ? ? ? ? ? ?* All __GFP_NOFAIL callers should be fixed so that they
>> ? ? ? ? ? ? ? ? ? ? ? ? ?* properly detect and handle allocation failures.
>> ? ? ? ? ? ? ? ? ? ? ? ? ?*
>> ? ? ? ? ? ? ? ? ? ? ? ? ?* We most definitely don't want callers attempting to
>> ? ? ? ? ? ? ? ? ? ? ? ? ?* allocate greater than single-page units with
>> ? ? ? ? ? ? ? ? ? ? ? ? ?* __GFP_NOFAIL.
>> ? ? ? ? ? ? ? ? ? ? ? ? ?*/
>> ? ? ? ? ? ? ? ? ? ? ? ? WARN_ON_ONCE(order > 0);
>>
>>
>> typical backtraces look like
>>
>> get_page_from_freelist
>> __alloc_pages_nodemask
>> alloc_pages_current
>> alloc_slab_page
>> new_slab
>> __slab_alloc
>> kmem_cache_alloc_notrace
>> start_this_handle
>> jbd2_journal_start
>>
>> and
>>
>> get_page_from_freelist
>> __alloc_pages_nodemask
>> alloc_pages_current
>> alloc_slab_page
>> new_slab
>> __slab_alloc
>> kmem_cache_alloc_notrace
>> start_this_handle
>> journal_start
>> ext3_journal_start_sb
>> ext3_journal_start
>> ext3_dirty_inode
>>
>> but there are some other ones as well at the url above.
>>
>>
>> git blame shows that
>>
>> commit dab48dab37d2770824420d1e01730a107fade1aa
>> Author: Andrew Morton <[email protected]>
>> Date: ? Tue Jun 16 15:32:37 2009 -0700
>>
>> introduced this WARN_ON.....

On Wed, Jun 24, 2009 at 7:46 PM, Andrew Morton<[email protected]> wrote:
> Well yes. ?Using GFP_NOFAIL on a higher-order allocation is bad. ?This
> patch is there to find, name, shame, blame and hopefully fix callers.
>
> A fix for cxgb3 is in the works. ?slub's design is a big problem.
>
> But we'll probably have to revert it for 2.6.31 :(

How is SLUB's design a problem here? Can't we just clear GFP_NOFAIL
from the higher order allocation and thus force GFP_NOFAIL allocations
to use the minimum required order?

Pekka

2009-06-24 16:56:26

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, Jun 24, 2009 at 7:55 PM, Pekka Enberg<[email protected]> wrote:
> Hi Andrew,
>
> On Wed, 24 Jun 2009 08:07:53 -0700 Arjan van de Ven <[email protected]> wrote:
>>> a new item is coming up fast in the kerneloops.org stats, and it's new
>>> in 2.6.31-rc;
>>>
>>> http://www.kerneloops.org/searchweek.php?search=get_page_from_freelist
>>>
>>> it's this warning in mm/page_alloc.c:
>>>
>>> ? ? ? ? ? ? ? ? ? ? ? ? * __GFP_NOFAIL is not to be used in new code.
>>> ? ? ? ? ? ? ? ? ? ? ? ? ?*
>>> ? ? ? ? ? ? ? ? ? ? ? ? ?* All __GFP_NOFAIL callers should be fixed so that they
>>> ? ? ? ? ? ? ? ? ? ? ? ? ?* properly detect and handle allocation failures.
>>> ? ? ? ? ? ? ? ? ? ? ? ? ?*
>>> ? ? ? ? ? ? ? ? ? ? ? ? ?* We most definitely don't want callers attempting to
>>> ? ? ? ? ? ? ? ? ? ? ? ? ?* allocate greater than single-page units with
>>> ? ? ? ? ? ? ? ? ? ? ? ? ?* __GFP_NOFAIL.
>>> ? ? ? ? ? ? ? ? ? ? ? ? ?*/
>>> ? ? ? ? ? ? ? ? ? ? ? ? WARN_ON_ONCE(order > 0);
>>>
>>>
>>> typical backtraces look like
>>>
>>> get_page_from_freelist
>>> __alloc_pages_nodemask
>>> alloc_pages_current
>>> alloc_slab_page
>>> new_slab
>>> __slab_alloc
>>> kmem_cache_alloc_notrace
>>> start_this_handle
>>> jbd2_journal_start
>>>
>>> and
>>>
>>> get_page_from_freelist
>>> __alloc_pages_nodemask
>>> alloc_pages_current
>>> alloc_slab_page
>>> new_slab
>>> __slab_alloc
>>> kmem_cache_alloc_notrace
>>> start_this_handle
>>> journal_start
>>> ext3_journal_start_sb
>>> ext3_journal_start
>>> ext3_dirty_inode
>>>
>>> but there are some other ones as well at the url above.
>>>
>>>
>>> git blame shows that
>>>
>>> commit dab48dab37d2770824420d1e01730a107fade1aa
>>> Author: Andrew Morton <[email protected]>
>>> Date: ? Tue Jun 16 15:32:37 2009 -0700
>>>
>>> introduced this WARN_ON.....
>
> On Wed, Jun 24, 2009 at 7:46 PM, Andrew Morton<[email protected]> wrote:
>> Well yes. ?Using GFP_NOFAIL on a higher-order allocation is bad. ?This
>> patch is there to find, name, shame, blame and hopefully fix callers.
>>
>> A fix for cxgb3 is in the works. ?slub's design is a big problem.
>>
>> But we'll probably have to revert it for 2.6.31 :(
>
> How is SLUB's design a problem here? Can't we just clear GFP_NOFAIL
> from the higher order allocation and thus force GFP_NOFAIL allocations
> to use the minimum required order?

Small correction: force GFP_NOFAIL allocations to use minimum order
_if_ the higher order allocation fails.

2009-06-24 17:00:15

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, Jun 24, 2009 at 7:56 PM, Pekka Enberg<[email protected]> wrote:
> On Wed, Jun 24, 2009 at 7:55 PM, Pekka Enberg<[email protected]> wrote:
>> Hi Andrew,
>>
>> On Wed, 24 Jun 2009 08:07:53 -0700 Arjan van de Ven <[email protected]> wrote:
>>>> a new item is coming up fast in the kerneloops.org stats, and it's new
>>>> in 2.6.31-rc;
>>>>
>>>> http://www.kerneloops.org/searchweek.php?search=get_page_from_freelist
>>>>
>>>> it's this warning in mm/page_alloc.c:
>>>>
>>>> ? ? ? ? ? ? ? ? ? ? ? ? * __GFP_NOFAIL is not to be used in new code.
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ?*
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ?* All __GFP_NOFAIL callers should be fixed so that they
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ?* properly detect and handle allocation failures.
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ?*
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ?* We most definitely don't want callers attempting to
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ?* allocate greater than single-page units with
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ?* __GFP_NOFAIL.
>>>> ? ? ? ? ? ? ? ? ? ? ? ? ?*/
>>>> ? ? ? ? ? ? ? ? ? ? ? ? WARN_ON_ONCE(order > 0);
>>>>
>>>>
>>>> typical backtraces look like
>>>>
>>>> get_page_from_freelist
>>>> __alloc_pages_nodemask
>>>> alloc_pages_current
>>>> alloc_slab_page
>>>> new_slab
>>>> __slab_alloc
>>>> kmem_cache_alloc_notrace
>>>> start_this_handle
>>>> jbd2_journal_start
>>>>
>>>> and
>>>>
>>>> get_page_from_freelist
>>>> __alloc_pages_nodemask
>>>> alloc_pages_current
>>>> alloc_slab_page
>>>> new_slab
>>>> __slab_alloc
>>>> kmem_cache_alloc_notrace
>>>> start_this_handle
>>>> journal_start
>>>> ext3_journal_start_sb
>>>> ext3_journal_start
>>>> ext3_dirty_inode
>>>>
>>>> but there are some other ones as well at the url above.
>>>>
>>>>
>>>> git blame shows that
>>>>
>>>> commit dab48dab37d2770824420d1e01730a107fade1aa
>>>> Author: Andrew Morton <[email protected]>
>>>> Date: ? Tue Jun 16 15:32:37 2009 -0700
>>>>
>>>> introduced this WARN_ON.....
>>
>> On Wed, Jun 24, 2009 at 7:46 PM, Andrew Morton<[email protected]> wrote:
>>> Well yes. ?Using GFP_NOFAIL on a higher-order allocation is bad. ?This
>>> patch is there to find, name, shame, blame and hopefully fix callers.
>>>
>>> A fix for cxgb3 is in the works. ?slub's design is a big problem.
>>>
>>> But we'll probably have to revert it for 2.6.31 :(
>>
>> How is SLUB's design a problem here? Can't we just clear GFP_NOFAIL
>> from the higher order allocation and thus force GFP_NOFAIL allocations
>> to use the minimum required order?
>
> Small correction: force GFP_NOFAIL allocations to use minimum order
> _if_ the higher order allocation fails.

And here's a badly linewrapped, untested patch to do that (sorry I
don't have my laptop here). Christoph, does this look ok to you?

diff --git a/mm/slub.c b/mm/slub.c
index ce62b77..8aaf0fa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1088,8 +1088,7 @@ static struct page *allocate_slab(struct
kmem_cache *s, gfp_t flags, int node)

flags |= s->allocflags;

- page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node,
- oo);
+ page = alloc_slab_page(flags & ~__GFP_NOFAIL | __GFP_NOWARN |
__GFP_NORETRY, node, oo);
if (unlikely(!page)) {
oo = s->min;
/*

2009-06-24 17:55:50

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009 19:55:24 +0300 Pekka Enberg <[email protected]> wrote:

> On Wed, Jun 24, 2009 at 7:46 PM, Andrew Morton<[email protected]> wrote:
> > Well yes. __Using GFP_NOFAIL on a higher-order allocation is bad. __This
> > patch is there to find, name, shame, blame and hopefully fix callers.
> >
> > A fix for cxgb3 is in the works. __slub's design is a big problem.
> >
> > But we'll probably have to revert it for 2.6.31 :(
>
> How is SLUB's design a problem here? Can't we just clear GFP_NOFAIL
> from the higher order allocation and thus force GFP_NOFAIL allocations
> to use the minimum required order?

That could then lead to the __GFP_NOFAIL allocation attempt returning
NULL. But the callers cannot handle that and probably don't even test
for it - this is why they used __GFP_NOFAIL.

I dunno. Mabe we should just remove __GFP_NOFAIL and convert callers back
to open-coded infinite retry loops. Hardly an improvement, but it at
least would stop people naively using __GFP_NOFAIL.

2009-06-24 17:58:22

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Hi Andrew,

Andrew Morton wrote:
> On Wed, 24 Jun 2009 19:55:24 +0300 Pekka Enberg <[email protected]> wrote:
>
>> On Wed, Jun 24, 2009 at 7:46 PM, Andrew Morton<[email protected]> wrote:
>>> Well yes. __Using GFP_NOFAIL on a higher-order allocation is bad. __This
>>> patch is there to find, name, shame, blame and hopefully fix callers.
>>>
>>> A fix for cxgb3 is in the works. __slub's design is a big problem.
>>>
>>> But we'll probably have to revert it for 2.6.31 :(
>> How is SLUB's design a problem here? Can't we just clear GFP_NOFAIL
>> from the higher order allocation and thus force GFP_NOFAIL allocations
>> to use the minimum required order?
>
> That could then lead to the __GFP_NOFAIL allocation attempt returning
> NULL. But the callers cannot handle that and probably don't even test
> for it - this is why they used __GFP_NOFAIL.

No, the fallback allocation would still use __GFP_NOFAIL so the
semantics are preserved.

Pekka

2009-06-24 18:30:59

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009 20:53:41 +0300
Pekka Enberg <[email protected]> wrote:

> Hi Andrew,
>
> Andrew Morton wrote:
> > On Wed, 24 Jun 2009 19:55:24 +0300 Pekka Enberg <[email protected]> wrote:
> >
> >> On Wed, Jun 24, 2009 at 7:46 PM, Andrew Morton<[email protected]> wrote:
> >>> Well yes. __Using GFP_NOFAIL on a higher-order allocation is bad. __This
> >>> patch is there to find, name, shame, blame and hopefully fix callers.
> >>>
> >>> A fix for cxgb3 is in the works. __slub's design is a big problem.
> >>>
> >>> But we'll probably have to revert it for 2.6.31 :(
> >> How is SLUB's design a problem here? Can't we just clear GFP_NOFAIL
> >> from the higher order allocation and thus force GFP_NOFAIL allocations
> >> to use the minimum required order?
> >
> > That could then lead to the __GFP_NOFAIL allocation attempt returning
> > NULL. But the callers cannot handle that and probably don't even test
> > for it - this is why they used __GFP_NOFAIL.
>
> No, the fallback allocation would still use __GFP_NOFAIL so the
> semantics are preserved.
>

<looks>

hm, I didn't know that slub could fall back to lower-order allocations
like that. Neat.

Yes, it looks like that change would improve things. We have had
reports before of machines which oomed over an order-1 attempt when
there were order-0 pages available. If that were to happen in
allocate_slab(__GFP_NOFAIL), things would get ugly and the patch would
help.

What's the expected value of s->min in allocate_slab()? In what
situations would it be >0?

btw, gcc has in the past made a mess of handling small copy-by-value
structs like 'struct kmem_cache_order_objects'. Probably it's improved
in recent years, but it'd be worth checking to see if
s/struct kmem_cache_order_objects/unsigned long/ generates better code.

2009-06-24 18:43:02

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009, Andrew Morton wrote:
>
> hm, I didn't know that slub could fall back to lower-order allocations
> like that. Neat.

Slab doesn't do it, though. So we still need to get rid of the "order-1"
warning, at least (slab_break_gfp_order).

> What's the expected value of s->min in allocate_slab()? In what
> situations would it be >0?

For slub, s->min has an order of just "get_order(size)" (ie the minimum
order to fit an object).

For slab, the logic is different, but if I read the code correctly it
boils down to the minimum order, except that order-1 is accepted instead
of order-0 (strictly speaking, that only happens if you have more than
64MB of RAM). With no fallback.

And it's quite reasonable to expect to be able to do small kmalloc's
without failing.

So I'd suggest just doing this..

Linus
---
mm/page_alloc.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aecc9cd..5d714f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1153,10 +1153,10 @@ again:
* properly detect and handle allocation failures.
*
* We most definitely don't want callers attempting to
- * allocate greater than single-page units with
+ * allocate greater than order-1 page units with
* __GFP_NOFAIL.
*/
- WARN_ON_ONCE(order > 0);
+ WARN_ON_ONCE(order > 1);
}
spin_lock_irqsave(&zone->lock, flags);
page = __rmqueue(zone, order, migratetype);

2009-06-24 18:43:21

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, Jun 24, 2009 at 9:30 PM, Andrew Morton<[email protected]> wrote:
> What's the expected value of s->min in allocate_slab()? ?In what
> situations would it be >0?

When object size in the cache doesn't fit on a single page. See the
get_order() at the end of calculate_sizes() for details.

2009-06-24 18:45:01

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Hi Linus,

On Wed, 24 Jun 2009, Andrew Morton wrote:
>>
>> hm, I didn't know that slub could fall back to lower-order allocations
>> like that. ?Neat.

On Wed, Jun 24, 2009 at 9:42 PM, Linus
Torvalds<[email protected]> wrote:
> Slab doesn't do it, though. So we still need to get rid of the "order-1"
> warning, at least (slab_break_gfp_order).
>
>> What's the expected value of s->min in allocate_slab()? ?In what
>> situations would it be >0?
>
> For slub, s->min has an order of just "get_order(size)" (ie the minimum
> order to fit an object).
>
> For slab, the logic is different, but if I read the code correctly it
> boils down to the minimum order, except that order-1 is accepted instead
> of order-0 (strictly speaking, that only happens if you have more than
> 64MB of RAM). With no fallback.
>
> And it's quite reasonable to expect to be able to do small kmalloc's
> without failing.
>
> So I'd suggest just doing this..

Acked-by: Pekka Enberg <[email protected]>

That said, I do think we ought to fix SLUB not to use __GFP_FAIL for
the higher order allocation regardless of this patch.

Pekka

2009-06-24 18:51:34

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009, Pekka Enberg wrote:
>
> That said, I do think we ought to fix SLUB not to use __GFP_FAIL for
> the higher order allocation regardless of this patch.

I agree. Within the context of SLUB, it makes absolute sense.

That's especially true since SLUB already uses __GFP_NORETRY or whatever.

The combination of __GFP_NOFAIL | __GFP_NORETRY is just insane. So your
patch to remove __GFP_NOFAIL when adding __GFP_NORETRY is absolutely the
ObviouslyCorrectThingToDo(tm).

Please integrate that, and take my patch too if you would (add my sign-off
or just do your own version)? Or should I just apply them?

Linus

2009-06-24 19:06:47

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009 11:42:33 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

> So I'd suggest just doing this..
>
> Linus
> ---
> mm/page_alloc.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aecc9cd..5d714f8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1153,10 +1153,10 @@ again:
> * properly detect and handle allocation failures.
> *
> * We most definitely don't want callers attempting to
> - * allocate greater than single-page units with
> + * allocate greater than order-1 page units with
> * __GFP_NOFAIL.
> */
> - WARN_ON_ONCE(order > 0);
> + WARN_ON_ONCE(order > 1);
> }
> spin_lock_irqsave(&zone->lock, flags);
> page = __rmqueue(zone, order, migratetype);

Well. What is our overall objective here?

My original patch was motiviated by the horror at discovering that
we're using this thing (which was _never_ supposed to have new users)
for order>0 allocations. We've gone backwards.

Ideally, we'd fix all callers to handle allocation faliures then remove
__GFP_NOFAIL. But I don't know how to fix JBD.

So perhaps we should just revert that WARN_ON altogether, and I can go
on a little grep-empowered rampage, see if we can remove some of these
callsites.

It's not a huge problem, btw. I don't think I've ever seen a report of
a machine getting stuck in a __GFP_NOFAIL allocation attempt. But from
a design perspective it's Just Wrong.

2009-06-24 19:12:36

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Hi Linus,

On Wed, 24 Jun 2009, Pekka Enberg wrote:
> >
> > That said, I do think we ought to fix SLUB not to use __GFP_FAIL for
> > the higher order allocation regardless of this patch.

On Wed, 24 Jun 2009, Linus Torvalds wrote:
> I agree. Within the context of SLUB, it makes absolute sense.
>
> That's especially true since SLUB already uses __GFP_NORETRY or whatever.
>
> The combination of __GFP_NOFAIL | __GFP_NORETRY is just insane. So your
> patch to remove __GFP_NOFAIL when adding __GFP_NORETRY is absolutely the
> ObviouslyCorrectThingToDo(tm).
>
> Please integrate that, and take my patch too if you would (add my sign-off
> or just do your own version)? Or should I just apply them?

Whatever suits you best. I don't have my laptop with me at home so I can't
test them or send a pull request. I'm including a compile-tested patch
here in case you want to go ahead and merge them. If you don't want to do
that, I'll integrate both of the tomorrow morning and send you a pull
request.

Pekka

>From 7c251e5a5e7948626bfbcf406f7c90c60f182abd Mon Sep 17 00:00:00 2001
From: Pekka Enberg <[email protected]>
Date: Wed, 24 Jun 2009 21:59:51 +0300
Subject: [PATCH] SLUB: Don't pass __GFP_FAIL for the initial allocation

SLUB uses higher order allocations by default but falls back to small
orders under memory pressure. Make sure the GFP mask used in the initial
allocation doesn't include __GFP_NOFAIL.

Signed-off-by: Pekka Enberg <[email protected]>
---
mm/slub.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ce62b77..819f056 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1085,11 +1085,17 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
struct page *page;
struct kmem_cache_order_objects oo = s->oo;
+ gfp_t alloc_gfp;

flags |= s->allocflags;

- page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node,
- oo);
+ /*
+ * Let the initial higher-order allocation fail under memory pressure
+ * so we fall-back to the minimum order allocation.
+ */
+ alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
+
+ page = alloc_slab_page(alloc_gfp, node, oo);
if (unlikely(!page)) {
oo = s->min;
/*
--
1.5.6.4

2009-06-24 19:16:40

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009, Andrew Morton wrote:
>
> My original patch was motiviated by the horror at discovering that
> we're using this thing (which was _never_ supposed to have new users)
> for order>0 allocations. We've gone backwards.

Don't be silly.

We've always depended on order-1 allocations working.

You're just totally misguided in your arguments, and you are
FUNDAMENTALLY WRONG. Your argument is shit, BECAUSE IT IS NOT TRUE!

Lookie here. This is 2.6.0:mm/page_alloc.c:

do_retry = 0;
if (!(gfp_mask & __GFP_NORETRY)) {
if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
do_retry = 1;
if (gfp_mask & __GFP_NOFAIL)
do_retry = 1;
}
if (do_retry) {
blk_congestion_wait(WRITE, HZ/50);
goto rebalance;
}

Read that. Read if five times. Read it until you understand what it does.

Linus

PS. Here's a clue: Look at that 'order <= 3' thing. Think about what
happens EVEN IF __NOFAIL) was not even set!

2009-06-24 19:21:48

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009, Pekka J Enberg wrote:
> >
> > Please integrate that, and take my patch too if you would (add my sign-off
> > or just do your own version)? Or should I just apply them?
>
> Whatever suits you best. I don't have my laptop with me at home so I can't
> test them or send a pull request. I'm including a compile-tested patch
> here in case you want to go ahead and merge them.

Ok, I committed both patches. Thanks,

Linus

2009-06-24 19:37:19

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009 12:16:20 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

> Lookie here. This is 2.6.0:mm/page_alloc.c:
>
> do_retry = 0;
> if (!(gfp_mask & __GFP_NORETRY)) {
> if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
> do_retry = 1;
> if (gfp_mask & __GFP_NOFAIL)
> do_retry = 1;
> }
> if (do_retry) {
> blk_congestion_wait(WRITE, HZ/50);
> goto rebalance;
> }

rebalance:
if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
/* go through the zonelist yet again, ignoring mins */
for (i = 0; zones[i] != NULL; i++) {
struct zone *z = zones[i];

page = buffered_rmqueue(z, order, cold);
if (page)
goto got_pg;
}
goto nopage;
}

2009-06-24 19:46:34

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009, Andrew Morton wrote:

> On Wed, 24 Jun 2009 12:16:20 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
>
> > Lookie here. This is 2.6.0:mm/page_alloc.c:
> >
> > do_retry = 0;
> > if (!(gfp_mask & __GFP_NORETRY)) {
> > if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
> > do_retry = 1;
> > if (gfp_mask & __GFP_NOFAIL)
> > do_retry = 1;
> > }
> > if (do_retry) {
> > blk_congestion_wait(WRITE, HZ/50);
> > goto rebalance;
> > }
>
> rebalance:
> if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
> /* go through the zonelist yet again, ignoring mins */
> for (i = 0; zones[i] != NULL; i++) {
> struct zone *z = zones[i];
>
> page = buffered_rmqueue(z, order, cold);
> if (page)
> goto got_pg;
> }
> goto nopage;
> }

Your point?

That's the recursive allocation or oom case. Not the normal case at all.

The _normal_ case is to do the whole "try_to_free_pages()" case and try
and try again. Forever.

IOW, we have traditionally never failed small kernel allocations. It makes
perfect sense that people _depend_ on that.

Now, we have since relaxed that (a lot). And in answer to that, people
have added more __GFP_NOFAIL flags, I bet. It's all very natural. Claiming
that this is some "new error" and that we should warn about NOFAIL
allocations with big orders is just silly and simply not true.

Linus

2009-06-24 19:48:01

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009, Linus Torvalds wrote:
>
> Your point?
>
> That's the recursive allocation or oom case. Not the normal case at all.

In fact, as the whole comment explains, that case is the "give people
memory _now_, even if we normally wouldn't". The thing is, we can't
recurse into trying to free memory, because we're already in that path.

Linus

2009-06-24 20:02:19

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009 12:46:02 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Wed, 24 Jun 2009, Andrew Morton wrote:
>
> > On Wed, 24 Jun 2009 12:16:20 -0700 (PDT)
> > Linus Torvalds <[email protected]> wrote:
> >
> > > Lookie here. This is 2.6.0:mm/page_alloc.c:
> > >
> > > do_retry = 0;
> > > if (!(gfp_mask & __GFP_NORETRY)) {
> > > if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
> > > do_retry = 1;
> > > if (gfp_mask & __GFP_NOFAIL)
> > > do_retry = 1;
> > > }
> > > if (do_retry) {
> > > blk_congestion_wait(WRITE, HZ/50);
> > > goto rebalance;
> > > }
> >
> > rebalance:
> > if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
> > /* go through the zonelist yet again, ignoring mins */
> > for (i = 0; zones[i] != NULL; i++) {
> > struct zone *z = zones[i];
> >
> > page = buffered_rmqueue(z, order, cold);
> > if (page)
> > goto got_pg;
> > }
> > goto nopage;
> > }
>
> Your point?

That allocation attempts of any order can fail.

> That's the recursive allocation or oom case. Not the normal case at all.
>
> The _normal_ case is to do the whole "try_to_free_pages()" case and try
> and try again. Forever.

If the caller gets oom-killed, the allocation attempt fails. Callers need
to handle that.

> IOW, we have traditionally never failed small kernel allocations. It makes
> perfect sense that people _depend_ on that.
>
> Now, we have since relaxed that (a lot). And in answer to that, people
> have added more __GFP_NOFAIL flags, I bet. It's all very natural. Claiming
> that this is some "new error" and that we should warn about NOFAIL
> allocations with big orders is just silly and simply not true.
>

There are situations in which the allocation attempt simply will not
succeed, so a __GFP_NOFAIL attempt will lock up. Hence callers should
stop using __GFP_NOFAIL and should handle the allocation error like
99.9999% of the rest of the kernel does.

The chances of the allocation attempt failing increase with
higher-order allocations, hence the combination of __GFP_NOFAIL with
order>0 should be avoided more strenuously than __GFP_NOFAIL &&
order==0.

<Note that the TIF_MEMDIE handling has changed post-2.6.30. I still
need to get my head around the end result of what we did there. Did we
break the __alloc_pages-fails-if-TFI_MEMDIE logic?>

2009-06-24 20:14:22

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009, Andrew Morton wrote:
>
> If the caller gets oom-killed, the allocation attempt fails. Callers need
> to handle that.

I actually disagree. I think we should just admit that we can always free
up enough space to get a few pages, in order to then oom-kill things.

This is not a new concept. oom has never been "immediately kill".

Linus

2009-06-24 20:41:10

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009, Linus Torvalds wrote:
> On Wed, 24 Jun 2009, Andrew Morton wrote:
> >
> > If the caller gets oom-killed, the allocation attempt fails. Callers need
> > to handle that.
>
> I actually disagree. I think we should just admit that we can always free
> up enough space to get a few pages, in order to then oom-kill things.

Btw, if you want to change the WARN_ON() to warn when you're in the
"allocate in order to free memory" recursive case, then I'd have no issues
with that.

In fact, in that case it probably shouldn't even be conditional on the
order.

So a

WARN_ON_ONCE((p->flags & PF_MEMALLOC) && (gfpmask & __GFP_NOFAIL));

actually makes tons of sense.

There are other cases where __GFP_NOFAIL doesn't make sense too, and that
could be warned about. The __GFP_NORETRY thing was already mentioned.
Similarly, !__GFP_WAIT doesn't work with __GFP_NOFAIL - because the nofail
obviously relies on being able to do something about the failure case.

We might want to also have rules like "in order to have NOFAIL, you need
to allow IO and FS accesses".

So I don't mind warnings with __GFP_NOFAIL. I just think they should be
relevant, and make sense. The "order > 0" one is neither.

Linus

2009-06-24 21:57:28

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009 13:13:48 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:
>
> On Wed, 24 Jun 2009, Andrew Morton wrote:
> >
> > If the caller gets oom-killed, the allocation attempt fails. Callers need
> > to handle that.
>
> I actually disagree. I think we should just admit that we can always free
> up enough space to get a few pages, in order to then oom-kill things.

I'm unclear on precisely what you're proposing here?

> This is not a new concept. oom has never been "immediately kill".

Well, it has been immediate for a long time. A couple of reasons which
I can recall:

- A page-allocating process will oom-kill another process in the
expectation that the killing will free up some memory. If the
oom-killed process remains stuck in the page allocator, that doesn't
work.

- The oom-killed process might be holding locks (typically fs locks).
This can cause an arbitrary number of other processes to be blocked.
So to get the system unstuck we need the oom-killed process to
immediately exit the page allocator, to handle the NULL return and to
drop those locks.

There may be other reasons - it was all a long time ago, and I've never
personally hacked on the oom-killer much and I never get oom-killed.
But given the amount of development work which goes on in there, some
people must be getting massacred.

A long time ago, the Suse kernel shipped with a largely (or
completely?) disabled oom-killer. It removed the
retry-small-allocations-for-ever logic and simply returned NULL to the
caller. I never really understood what problem/thinking led Andrea to
do that.

But it's all a bit moot at present, as we seem to have removed the
return-NULL-if-TIF_MEMDIE logic in Mel's post-2.6.30 merges. I think
that was an accident:

- /* This allocation should allow future memory freeing. */
-
rebalance:
- if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
- && !in_interrupt()) {
- if (!(gfp_mask & __GFP_NOMEMALLOC)) {
-nofail_alloc:
- /* go through the zonelist yet again, ignoring mins */
- page = get_page_from_freelist(gfp_mask, nodemask, order,
- zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
- if (page)
- goto got_pg;
- if (gfp_mask & __GFP_NOFAIL) {
- congestion_wait(WRITE, HZ/50);
- goto nofail_alloc;
- }
- }
- goto nopage;
+ /* Allocate without watermarks if the context allows */
+ if (alloc_flags & ALLOC_NO_WATERMARKS) {
+ page = __alloc_pages_high_priority(gfp_mask, order,
+ zonelist, high_zoneidx, nodemask,
+ preferred_zone, migratetype);
+ if (page)
+ goto got_pg;
}

Offending commit 341ce06 handled the PF_MEMALLOC case but forgot about
the TIF_MEMDIE case.

Mel is having a bit of downtime at present.

2009-06-24 22:07:36

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009 13:40:11 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Wed, 24 Jun 2009, Linus Torvalds wrote:
> > On Wed, 24 Jun 2009, Andrew Morton wrote:
> > >
> > > If the caller gets oom-killed, the allocation attempt fails. Callers need
> > > to handle that.
> >
> > I actually disagree. I think we should just admit that we can always free
> > up enough space to get a few pages, in order to then oom-kill things.
>
> Btw, if you want to change the WARN_ON() to warn when you're in the
> "allocate in order to free memory" recursive case, then I'd have no issues
> with that.
>
> In fact, in that case it probably shouldn't even be conditional on the
> order.
>
> So a
>
> WARN_ON_ONCE((p->flags & PF_MEMALLOC) && (gfpmask & __GFP_NOFAIL));
>
> actually makes tons of sense.

I suspect that warning will trigger.

alloc_pages
-> ...
-> pageout
-> ...
-> get_request
-> blk_alloc_request
-> elv_set_request
-> cfq_set_request
-> cfq_get_queue
-> cfq_find_alloc_queue
-> kmem_cache_alloc_node(__GFP_NOFAIL)
-> Jens

How much this can happen in practice I don't know, but it looks bad.

> There are other cases where __GFP_NOFAIL doesn't make sense too, and that
> could be warned about. The __GFP_NORETRY thing was already mentioned.
> Similarly, !__GFP_WAIT doesn't work with __GFP_NOFAIL - because the nofail
> obviously relies on being able to do something about the failure case.
>
> We might want to also have rules like "in order to have NOFAIL, you need
> to allow IO and FS accesses".

Sure, that's sane.

fs/jbd/journal.c: new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);

But that isn't :(

> So I don't mind warnings with __GFP_NOFAIL. I just think they should be
> relevant, and make sense. The "order > 0" one is neither.

2009-06-25 04:06:05

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, Jun 24, 2009 at 03:07:14PM -0700, Andrew Morton wrote:
> On Wed, 24 Jun 2009 13:40:11 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
> > On Wed, 24 Jun 2009, Linus Torvalds wrote:
> > > On Wed, 24 Jun 2009, Andrew Morton wrote:
> > actually makes tons of sense.
>
> I suspect that warning will trigger.
>
> alloc_pages
> -> ...
> -> pageout
> -> ...
> -> get_request
> -> blk_alloc_request
> -> elv_set_request
> -> cfq_set_request
> -> cfq_get_queue
> -> cfq_find_alloc_queue
> -> kmem_cache_alloc_node(__GFP_NOFAIL)
> -> Jens
>
> How much this can happen in practice I don't know, but it looks bad.
>
> > There are other cases where __GFP_NOFAIL doesn't make sense too, and that
> > could be warned about. The __GFP_NORETRY thing was already mentioned.
> > Similarly, !__GFP_WAIT doesn't work with __GFP_NOFAIL - because the nofail
> > obviously relies on being able to do something about the failure case.
> >
> > We might want to also have rules like "in order to have NOFAIL, you need
> > to allow IO and FS accesses".
>
> Sure, that's sane.
>
> fs/jbd/journal.c: new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
>
> But that isn't :(

Unfortunately there's a lot of "hidden" ones in fs/buffer.c.
alloc_page_buffers with arg3=1, which is used quite a lot in
create_empty_buffers.

Also getblk. Sad.

fsblock FTW :)

2009-06-25 04:14:27

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, Jun 24, 2009 at 02:56:15PM -0700, Andrew Morton wrote:
> On Wed, 24 Jun 2009 13:13:48 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
> > This is not a new concept. oom has never been "immediately kill".
>
> Well, it has been immediate for a long time. A couple of reasons which
> I can recall:
>
> - A page-allocating process will oom-kill another process in the
> expectation that the killing will free up some memory. If the
> oom-killed process remains stuck in the page allocator, that doesn't
> work.
>
> - The oom-killed process might be holding locks (typically fs locks).
> This can cause an arbitrary number of other processes to be blocked.
> So to get the system unstuck we need the oom-killed process to
> immediately exit the page allocator, to handle the NULL return and to
> drop those locks.
>
> There may be other reasons - it was all a long time ago, and I've never
> personally hacked on the oom-killer much and I never get oom-killed.
> But given the amount of development work which goes on in there, some
> people must be getting massacred.

I think you are right Andrew. Any caller can fail. I suspect it is
*very* rare to ever actually see failures because of the size of
the reserves we keep, but the theoretical possibility is there. It
would become rather more common with order->0 allocations too. But
seeing as those are fairly infrequent in the page reclaim path (sans
SLUB), then they are probably uncommon too.

But to be completely safe from oopses, we need to annotate with
GFP_NOFAIL or handle failures in the caller.

> A long time ago, the Suse kernel shipped with a largely (or
> completely?) disabled oom-killer. It removed the
> retry-small-allocations-for-ever logic and simply returned NULL to the
> caller. I never really understood what problem/thinking led Andrea to
> do that.

I think all 2.6 based ones are pretty close to upstream logic,
however we have run into customers reporting OOM deadlocks that
the OOM killer cannot resolve. I forget the exact details but
they involve memory reserves being exhausted and killed tasks
trying to allocate memory holding locks etc.

> But it's all a bit moot at present, as we seem to have removed the
> return-NULL-if-TIF_MEMDIE logic in Mel's post-2.6.30 merges. I think
> that was an accident:
>
> - /* This allocation should allow future memory freeing. */
> -
> rebalance:
> - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> - && !in_interrupt()) {
> - if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> -nofail_alloc:
> - /* go through the zonelist yet again, ignoring mins */
> - page = get_page_from_freelist(gfp_mask, nodemask, order,
> - zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
> - if (page)
> - goto got_pg;
> - if (gfp_mask & __GFP_NOFAIL) {
> - congestion_wait(WRITE, HZ/50);
> - goto nofail_alloc;
> - }
> - }
> - goto nopage;
> + /* Allocate without watermarks if the context allows */
> + if (alloc_flags & ALLOC_NO_WATERMARKS) {
> + page = __alloc_pages_high_priority(gfp_mask, order,
> + zonelist, high_zoneidx, nodemask,
> + preferred_zone, migratetype);
> + if (page)
> + goto got_pg;
> }
>
> Offending commit 341ce06 handled the PF_MEMALLOC case but forgot about
> the TIF_MEMDIE case.
>
> Mel is having a bit of downtime at present.

2009-06-25 08:22:03

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, 24 Jun 2009, Andrew Morton wrote:

> On Wed, 24 Jun 2009 13:13:48 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
> >
> > On Wed, 24 Jun 2009, Andrew Morton wrote:
> > >
> > > If the caller gets oom-killed, the allocation attempt fails. Callers need
> > > to handle that.
> >
> > I actually disagree. I think we should just admit that we can always free
> > up enough space to get a few pages, in order to then oom-kill things.
>
> I'm unclear on precisely what you're proposing here?
>

It's a justification for the looping for small-order allocations.

We shouldn't actually need to free any pages to oom kill tasks since
TIF_MEMDIE allows its allocations to ignore each zone's watermarks (and it
also allows allocations to happen anywhere regardless of cpuset
restrictions just like GFP_ATOMIC), which will provide more than a "few
pages" that are already available for the oom killed task to use if each
zone's min watermark is sanely defined by min_free_kbytes, and that will
always still be above SWAP_CLUSTER_MAX for highmem.

Those memory reserves are also available for reclaimers, but PF_MEMALLOC
tasks will never loop in the page allocator because __GFP_NOFAIL is
ignored in the reclaim path (and understandably so).

> > This is not a new concept. oom has never been "immediately kill".
>
> Well, it has been immediate for a long time. A couple of reasons which
> I can recall:
>
> - A page-allocating process will oom-kill another process in the
> expectation that the killing will free up some memory. If the
> oom-killed process remains stuck in the page allocator, that doesn't
> work.
>

This would cause the oom killed task to repeatedly call the oom killer to
free more memory, and it would refuse to act because a task (current) is
found to have TIF_MEMDIE set in the tasklist scan. So this would be a
livelock when there is no memory left (and nothing can be reclaimed), but
that shouldn't occur with sane watermarks. The oom killed task will
immediately enter buffered_rmqueue() for each zone in the zonelist to
satisy the allocation; it would require this to return NULL every time for
it to loop endlessly. As you mentioned, this behavior has changed
recently so that TIF_MEMDIE threads loop endlessly instead of fail the
allocation attempt when get_page_from_freelist() fails with no watermarks.
The old behavior needs to be reinstated.

> - The oom-killed process might be holding locks (typically fs locks).
> This can cause an arbitrary number of other processes to be blocked.
> So to get the system unstuck we need the oom-killed process to
> immediately exit the page allocator, to handle the NULL return and to
> drop those locks.
>

There's no advantage to that over allowing the allocation to succeed since
current now has TIF_MEMDIE, and this is a better result than returning
NULL for ~__GFP_FS. When the old behavior of failing allocations when
ALLOC_NO_WATERMARKS has failed is restored, as you mentioned, this
livelock will be closed.

> A long time ago, the Suse kernel shipped with a largely (or
> completely?) disabled oom-killer. It removed the
> retry-small-allocations-for-ever logic and simply returned NULL to the
> caller. I never really understood what problem/thinking led Andrea to
> do that.
>

If you disable retries for allocations at or under
PAGE_ALLOC_COSTLY_ORDER, then the oom killer would have to be disabled
for ~__GFP_NOFAIL and ~__GFP_REPEAT as well, otherwise it would serve no
purpose for the benefit of the current context.

If the order is under PAGE_ALLOC_COSTLY_ORDER and try_to_free_pages()
returned something positive, there's a probability that we'll succeed in
the subsequent allocation attempts. If reclaim failed to free anything,
the oom killer was called (only if the order is at or under
PAGE_ALLOC_COSTLY_ORDER, the same condition we automatically loop in, and
there's again a probability that we'll succeed in subsequent allocation
attempts when that memory is freed as the result of the chosen task
exiting.

In other words, the only time we wouldn't want to loop in such a case is
when the oom killer fails to kill any task, which happens when:

- another task has already been oom killed and hasn't yet exited (if
this persists, it's an indication of the problem you mentioned earlier
which should be solved by the min watermark), or

- there are no eligible tasks left to kill, in which case the machine
panics (there are only kthreads or everything else is set to
OOM_DISABLE).

> But it's all a bit moot at present, as we seem to have removed the
> return-NULL-if-TIF_MEMDIE logic in Mel's post-2.6.30 merges. I think
> that was an accident:
>
> - /* This allocation should allow future memory freeing. */
> -
> rebalance:
> - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> - && !in_interrupt()) {
> - if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> -nofail_alloc:
> - /* go through the zonelist yet again, ignoring mins */
> - page = get_page_from_freelist(gfp_mask, nodemask, order,
> - zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
> - if (page)
> - goto got_pg;
> - if (gfp_mask & __GFP_NOFAIL) {
> - congestion_wait(WRITE, HZ/50);
> - goto nofail_alloc;
> - }
> - }
> - goto nopage;
> + /* Allocate without watermarks if the context allows */
> + if (alloc_flags & ALLOC_NO_WATERMARKS) {
> + page = __alloc_pages_high_priority(gfp_mask, order,
> + zonelist, high_zoneidx, nodemask,
> + preferred_zone, migratetype);
> + if (page)
> + goto got_pg;
> }
>
> Offending commit 341ce06 handled the PF_MEMALLOC case but forgot about
> the TIF_MEMDIE case.
>

Right, the allocation should fail if test_thread_flag(TIF_MEMDIE) and
__alloc_pages_high_priority() has failed. There's no need to enter
reclaim since it has already failed to free memory (the only reason it was
oom killed to begin with) and memory reserves have been fully depleted.

2009-06-25 13:26:36

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, Jun 24, 2009 at 03:07:14PM -0700, Andrew Morton wrote:
>
> fs/jbd/journal.c: new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
>
> But that isn't :(

Well, we could recode it to do what journal_alloc_head() does, which
is call the allocator in a loop:

ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
if (ret == NULL) {
jbd_debug(1, "out of memory for journal_head\n");
if (time_after(jiffies, last_warning + 5*HZ)) {
printk(KERN_NOTICE "ENOMEM in %s, retrying.\n",
__func__);
last_warning = jiffies;
}
while (ret == NULL) {
yield();
ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
}
}

Like journal_write_metadata_buffer(), which you quoted, it's called
out of the commit code, where about the only choice we have other than
looping or using GFP_NOFAIL is to abort the filesystem and remount it
read-only or panic. It's not at all clear to me that looping
repeatedly is helpful; for example, the allocator doesn't know that it
should try really hard, and perhaps fall back to an order 0 allocation
of an order 1 allocation won't work.

Hmm.... it may be possible to do the memory allocation in advance,
before we get to the commit, and make it be easier to fail and return
ENOMEM to userspace --- which I bet most applications won't handle
gracefully, either (a) not checking error codes and losing data, or
(b) dieing on the spot, so it would be effectively be an OOM kill.
And in some cases, we're calling journal_get_write_access() out of a
kernel daemon like pdflush, where the error recovery paths may get
rather interesting.

The question then is what is the right strategy? Use GFP_NOFAIL, and
let the memory allocator loop; let the allocating kernel code loop;
remount filesystems read/only and/or panic; pass a "try _really_ hard"
flag to the allocator and fall back to a ro-remount/panic if the
allocator still wasn't successful? None of the alternatives seem
particularly appealing to me....

- Ted

2009-06-25 18:51:56

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, 25 Jun 2009, Theodore Tso wrote:

> On Wed, Jun 24, 2009 at 03:07:14PM -0700, Andrew Morton wrote:
> >
> > fs/jbd/journal.c: new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
> >
> > But that isn't :(
>
> Well, we could recode it to do what journal_alloc_head() does, which
> is call the allocator in a loop:
>
> ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
> if (ret == NULL) {
> jbd_debug(1, "out of memory for journal_head\n");
> if (time_after(jiffies, last_warning + 5*HZ)) {
> printk(KERN_NOTICE "ENOMEM in %s, retrying.\n",
> __func__);
> last_warning = jiffies;
> }
> while (ret == NULL) {
> yield();
> ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
> }
> }
>
> Like journal_write_metadata_buffer(), which you quoted, it's called
> out of the commit code, where about the only choice we have other than
> looping or using GFP_NOFAIL is to abort the filesystem and remount it
> read-only or panic. It's not at all clear to me that looping
> repeatedly is helpful; for example, the allocator doesn't know that it
> should try really hard, and perhaps fall back to an order 0 allocation
> of an order 1 allocation won't work.
>

Since it's using kmem_cache_alloc(), the order fallback is the
responsibility of the slab allocator when a new slab allocation fails and
a single object could fit in an order 0 page, so it's not a concern for
this particular allocation.

There's no way to indicate that the page allocator should "try really
hard" because the VM implementation should already do that for every
allocation before failure. A subsequent attempt after the first failure
could try GFP_ATOMIC, though, which allows allocation beyond the minimum
watermark and is more likely to succeed than GFP_NOFS. Such an
allocation should be short-lived and not rely on additional memory to free
to avoid depleting most of the memory reserves available to atomic
allocations, direct reclaim, and oom killed tasks.

> Hmm.... it may be possible to do the memory allocation in advance,
> before we get to the commit, and make it be easier to fail and return
> ENOMEM to userspace --- which I bet most applications won't handle
> gracefully, either (a) not checking error codes and losing data, or
> (b) dieing on the spot, so it would be effectively be an OOM kill.

If this would still be a GFP_NOFS allocation, the oom killer will not be
triggered (it only gets called when __GFP_FS is set to avoid killing tasks
when reclaim was not possible).

2009-06-25 19:38:34

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25, 2009 at 11:51:40AM -0700, David Rientjes wrote:
>
> There's no way to indicate that the page allocator should "try really
> hard" because the VM implementation should already do that for every
> allocation before failure. A subsequent attempt after the first failure
> could try GFP_ATOMIC, though, which allows allocation beyond the minimum
> watermark and is more likely to succeed than GFP_NOFS. Such an
> allocation should be short-lived and not rely on additional memory to free
> to avoid depleting most of the memory reserves available to atomic
> allocations, direct reclaim, and oom killed tasks.

Hmm, is there a reason to avoid using GFP_ATOMIC on the first
allocation, and only adding GFP_ATOMIC after the first failure?

In the case of ext4, after we finish the commit, we will release quite
a bit of memory to the system, so using GFP_ATOMIC to complete is a
good thing. Of course, preallocating some of these data structures
before the commit would be better, since we can return ENOMEM to
userspace applications when they are calling a system call.

> > Hmm.... it may be possible to do the memory allocation in advance,
> > before we get to the commit, and make it be easier to fail and return
> > ENOMEM to userspace --- which I bet most applications won't handle
> > gracefully, either (a) not checking error codes and losing data, or
> > (b) dieing on the spot, so it would be effectively be an OOM kill.
>
> If this would still be a GFP_NOFS allocation, the oom killer will not be
> triggered (it only gets called when __GFP_FS is set to avoid killing tasks
> when reclaim was not possible).

I didn't mean that it would really be an OOM kill --- just that many
applications don't have very sophisticated error checking themselves,
and will either not do error checking at all, or if they get an ENOMEM
from a system call, will probably just immediately do a something like
'perror("Yikes!"); exit(1);' --- so it might as _well_ be an OOM kill.

On the other hand, by returning an ENOMEM to userspace, we at least
allow the competent application writers to try to do something
intelligent (cynical kernel programmers who don't believe there are
many such, lets leave that aside for the bar room discussion :-), and
if you're out of memory, you're out of memory, and whether programs
die from an OOM or an untested NULL defeference in an error path in
the application, or an explicit 'perror("Yikes!"); exit(1);', doesn't
much matter.

- Ted

2009-06-25 19:45:30

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25, 2009 at 03:38:06PM -0400, Theodore Tso wrote:
> Hmm, is there a reason to avoid using GFP_ATOMIC on the first
> allocation, and only adding GFP_ATOMIC after the first failure?

Never mind, stupid question; I hit the send button before thinking
about this enough. Obviously we should try without GFP_ATOMIC so the
allocator can try to release some memory. So maybe the answer for
filesystem code where the alternative to allocator failure is
remounting the root filesystem read-only or panic(), should be:

1) Try to do the allocation GFP_NOFS.

2) Then try GFP_ATOMIC

3) Then retry the allocator with GFP_NOFS in a loop (possibly with a
timeout than then panic's the system and allows the system to reboot,
although arguably a watchdot timer should really perform that
function).

Obviously if we can rework the filesystem code to avoid this as much
as possible, this would be desirable, but if there are some cases left
over where we really have no choice, that's probably what we should
do.

- Ted

2009-06-25 19:55:43

by Jens Axboe

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, Jun 24 2009, Andrew Morton wrote:
> On Wed, 24 Jun 2009 13:40:11 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
>
> >
> >
> > On Wed, 24 Jun 2009, Linus Torvalds wrote:
> > > On Wed, 24 Jun 2009, Andrew Morton wrote:
> > > >
> > > > If the caller gets oom-killed, the allocation attempt fails. Callers need
> > > > to handle that.
> > >
> > > I actually disagree. I think we should just admit that we can always free
> > > up enough space to get a few pages, in order to then oom-kill things.
> >
> > Btw, if you want to change the WARN_ON() to warn when you're in the
> > "allocate in order to free memory" recursive case, then I'd have no issues
> > with that.
> >
> > In fact, in that case it probably shouldn't even be conditional on the
> > order.
> >
> > So a
> >
> > WARN_ON_ONCE((p->flags & PF_MEMALLOC) && (gfpmask & __GFP_NOFAIL));
> >
> > actually makes tons of sense.
>
> I suspect that warning will trigger.
>
> alloc_pages
> -> ...
> -> pageout
> -> ...
> -> get_request
> -> blk_alloc_request
> -> elv_set_request
> -> cfq_set_request
> -> cfq_get_queue
> -> cfq_find_alloc_queue
> -> kmem_cache_alloc_node(__GFP_NOFAIL)
> -> Jens
>
> How much this can happen in practice I don't know, but it looks bad.

Yeah it sucks, but I don't think it's that bad to fixup.

The request allocation can fail, if we just return NULL in
cfq_find_alloc_queue() and let that error propagate back up to
get_request_wait(), it would simply io_schedule() and wait for a request
to be freed. The only issue here is that if we have no requests going
for this queue already, we would be stuck since there's noone to wake us
up eventually. So if we did this, we'd have to make the io_schedule()
dependent on whether there are allocations out already. Use global
congestion wait in that case, or just io_schedule_timeout() for
retrying.

The other option is to retry in cfq_find_alloc_queue() without the
NOFAIL and do the congestion wait logic in there.

Yet another option would be to have a dummy queue that is allocated when
the queue io scheduler is initialized. If cfq_find_alloc_queue() fails,
just punt the IO to that dummy queue. That would allow progress even
under extreme failure conditions.

With all that said, the likely hood of ever hitting this path is about
0%. Those failures are the ones that really suck when it's hit in the
field eventually, though :-)

--
Jens Axboe

2009-06-25 19:56:16

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, 25 Jun 2009 15:44:23 -0400
Theodore Tso <[email protected]> wrote:

> On Thu, Jun 25, 2009 at 03:38:06PM -0400, Theodore Tso wrote:
> > Hmm, is there a reason to avoid using GFP_ATOMIC on the first
> > allocation, and only adding GFP_ATOMIC after the first failure?
>
> Never mind, stupid question; I hit the send button before thinking
> about this enough. Obviously we should try without GFP_ATOMIC so the
> allocator can try to release some memory.

One use GFP_KERNEL|__GFP_HIGH, which bestows GFP_ATOMIC's special
reserve-dipping powers upon GFP_KERNEL.

Maybe one could set PF_MEMALLOC on kjournald too, as it is a thing
which can clean memory, as long as it is given a bit of memory itself.
This would be a risky step.

In fact they're both risky steps, because the block layer needs memory
too. If JBD uses __GFP_HIGH/PF_MEMALLOC tricks to get more memory,
that just deprives the block layer of some.

OTOH the block layer uses special immortal mepools to avoid starvation.

otoh2, ext3 can be backed by devices other than plain old disks.

2009-06-25 20:08:39

by Jens Axboe

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25 2009, Jens Axboe wrote:
> On Wed, Jun 24 2009, Andrew Morton wrote:
> > On Wed, 24 Jun 2009 13:40:11 -0700 (PDT)
> > Linus Torvalds <[email protected]> wrote:
> >
> > >
> > >
> > > On Wed, 24 Jun 2009, Linus Torvalds wrote:
> > > > On Wed, 24 Jun 2009, Andrew Morton wrote:
> > > > >
> > > > > If the caller gets oom-killed, the allocation attempt fails. Callers need
> > > > > to handle that.
> > > >
> > > > I actually disagree. I think we should just admit that we can always free
> > > > up enough space to get a few pages, in order to then oom-kill things.
> > >
> > > Btw, if you want to change the WARN_ON() to warn when you're in the
> > > "allocate in order to free memory" recursive case, then I'd have no issues
> > > with that.
> > >
> > > In fact, in that case it probably shouldn't even be conditional on the
> > > order.
> > >
> > > So a
> > >
> > > WARN_ON_ONCE((p->flags & PF_MEMALLOC) && (gfpmask & __GFP_NOFAIL));
> > >
> > > actually makes tons of sense.
> >
> > I suspect that warning will trigger.
> >
> > alloc_pages
> > -> ...
> > -> pageout
> > -> ...
> > -> get_request
> > -> blk_alloc_request
> > -> elv_set_request
> > -> cfq_set_request
> > -> cfq_get_queue
> > -> cfq_find_alloc_queue
> > -> kmem_cache_alloc_node(__GFP_NOFAIL)
> > -> Jens
> >
> > How much this can happen in practice I don't know, but it looks bad.
>
> Yeah it sucks, but I don't think it's that bad to fixup.
>
> The request allocation can fail, if we just return NULL in
> cfq_find_alloc_queue() and let that error propagate back up to
> get_request_wait(), it would simply io_schedule() and wait for a request
> to be freed. The only issue here is that if we have no requests going
> for this queue already, we would be stuck since there's noone to wake us
> up eventually. So if we did this, we'd have to make the io_schedule()
> dependent on whether there are allocations out already. Use global
> congestion wait in that case, or just io_schedule_timeout() for
> retrying.
>
> The other option is to retry in cfq_find_alloc_queue() without the
> NOFAIL and do the congestion wait logic in there.
>
> Yet another option would be to have a dummy queue that is allocated when
> the queue io scheduler is initialized. If cfq_find_alloc_queue() fails,
> just punt the IO to that dummy queue. That would allow progress even
> under extreme failure conditions.
>
> With all that said, the likely hood of ever hitting this path is about
> 0%. Those failures are the ones that really suck when it's hit in the
> field eventually, though :-)

Something like the below implements option 3. Totally untested, I'll
throw it through some memfail injections and blktrace before I fully
trust it. And split it into 2 patches, the first doing the
cfq_init_cfqq() stuff and the second adding oom_cfqq and using it there
too.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 833ec18..3b075a8 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -71,6 +71,51 @@ struct cfq_rb_root {
#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, }

/*
+ * Per process-grouping structure
+ */
+struct cfq_queue {
+ /* reference count */
+ atomic_t ref;
+ /* various state flags, see below */
+ unsigned int flags;
+ /* parent cfq_data */
+ struct cfq_data *cfqd;
+ /* service_tree member */
+ struct rb_node rb_node;
+ /* service_tree key */
+ unsigned long rb_key;
+ /* prio tree member */
+ struct rb_node p_node;
+ /* prio tree root we belong to, if any */
+ struct rb_root *p_root;
+ /* sorted list of pending requests */
+ struct rb_root sort_list;
+ /* if fifo isn't expired, next request to serve */
+ struct request *next_rq;
+ /* requests queued in sort_list */
+ int queued[2];
+ /* currently allocated requests */
+ int allocated[2];
+ /* fifo list of requests in sort_list */
+ struct list_head fifo;
+
+ unsigned long slice_end;
+ long slice_resid;
+ unsigned int slice_dispatch;
+
+ /* pending metadata requests */
+ int meta_pending;
+ /* number of requests that are on the dispatch list or inside driver */
+ int dispatched;
+
+ /* io prio of this group */
+ unsigned short ioprio, org_ioprio;
+ unsigned short ioprio_class, org_ioprio_class;
+
+ pid_t pid;
+};
+
+/*
* Per block device queue structure
*/
struct cfq_data {
@@ -135,51 +180,11 @@ struct cfq_data {
unsigned int cfq_slice_idle;

struct list_head cic_list;
-};
-
-/*
- * Per process-grouping structure
- */
-struct cfq_queue {
- /* reference count */
- atomic_t ref;
- /* various state flags, see below */
- unsigned int flags;
- /* parent cfq_data */
- struct cfq_data *cfqd;
- /* service_tree member */
- struct rb_node rb_node;
- /* service_tree key */
- unsigned long rb_key;
- /* prio tree member */
- struct rb_node p_node;
- /* prio tree root we belong to, if any */
- struct rb_root *p_root;
- /* sorted list of pending requests */
- struct rb_root sort_list;
- /* if fifo isn't expired, next request to serve */
- struct request *next_rq;
- /* requests queued in sort_list */
- int queued[2];
- /* currently allocated requests */
- int allocated[2];
- /* fifo list of requests in sort_list */
- struct list_head fifo;
-
- unsigned long slice_end;
- long slice_resid;
- unsigned int slice_dispatch;

- /* pending metadata requests */
- int meta_pending;
- /* number of requests that are on the dispatch list or inside driver */
- int dispatched;
-
- /* io prio of this group */
- unsigned short ioprio, org_ioprio;
- unsigned short ioprio_class, org_ioprio_class;
-
- pid_t pid;
+ /*
+ * Fallback dummy cfqq for extrem OOM conditions
+ */
+ struct cfq_queue oom_cfqq;
};

enum cfqq_state_flags {
@@ -1641,6 +1646,26 @@ static void cfq_ioc_set_ioprio(struct io_context *ioc)
ioc->ioprio_changed = 0;
}

+static void cfq_init_cfqq(struct cfq_data *cfqd, struct cfq_queue *cfqq,
+ pid_t pid, int is_sync)
+{
+ RB_CLEAR_NODE(&cfqq->rb_node);
+ RB_CLEAR_NODE(&cfqq->p_node);
+ INIT_LIST_HEAD(&cfqq->fifo);
+
+ atomic_set(&cfqq->ref, 0);
+ cfqq->cfqd = cfqd;
+
+ cfq_mark_cfqq_prio_changed(cfqq);
+
+ if (is_sync) {
+ if (!cfq_class_idle(cfqq))
+ cfq_mark_cfqq_idle_window(cfqq);
+ cfq_mark_cfqq_sync(cfqq);
+ }
+ cfqq->pid = pid;
+}
+
static struct cfq_queue *
cfq_find_alloc_queue(struct cfq_data *cfqd, int is_sync,
struct io_context *ioc, gfp_t gfp_mask)
@@ -1666,10 +1691,11 @@ retry:
*/
spin_unlock_irq(cfqd->queue->queue_lock);
new_cfqq = kmem_cache_alloc_node(cfq_pool,
- gfp_mask | __GFP_NOFAIL | __GFP_ZERO,
+ gfp_mask | __GFP_ZERO,
cfqd->queue->node);
spin_lock_irq(cfqd->queue->queue_lock);
- goto retry;
+ if (new_cfqq)
+ goto retry;
} else {
cfqq = kmem_cache_alloc_node(cfq_pool,
gfp_mask | __GFP_ZERO,
@@ -1678,23 +1704,8 @@ retry:
goto out;
}

- RB_CLEAR_NODE(&cfqq->rb_node);
- RB_CLEAR_NODE(&cfqq->p_node);
- INIT_LIST_HEAD(&cfqq->fifo);
-
- atomic_set(&cfqq->ref, 0);
- cfqq->cfqd = cfqd;
-
- cfq_mark_cfqq_prio_changed(cfqq);
-
+ cfq_init_cfqq(cfqd, cfqq, current->pid, is_sync);
cfq_init_prio_data(cfqq, ioc);
-
- if (is_sync) {
- if (!cfq_class_idle(cfqq))
- cfq_mark_cfqq_idle_window(cfqq);
- cfq_mark_cfqq_sync(cfqq);
- }
- cfqq->pid = current->pid;
cfq_log_cfqq(cfqd, cfqq, "alloced");
}

@@ -1702,6 +1713,8 @@ retry:
kmem_cache_free(cfq_pool, new_cfqq);

out:
+ if (!cfqq)
+ cfqq = &cfqd->oom_cfqq;
WARN_ON((gfp_mask & __GFP_WAIT) && !cfqq);
return cfqq;
}
@@ -1735,11 +1748,8 @@ cfq_get_queue(struct cfq_data *cfqd, int is_sync, struct io_context *ioc,
cfqq = *async_cfqq;
}

- if (!cfqq) {
+ if (!cfqq)
cfqq = cfq_find_alloc_queue(cfqd, is_sync, ioc, gfp_mask);
- if (!cfqq)
- return NULL;
- }

/*
* pin the queue now that it's allocated, scheduler exit will prune it
@@ -2465,6 +2475,11 @@ static void *cfq_init_queue(struct request_queue *q)
for (i = 0; i < CFQ_PRIO_LISTS; i++)
cfqd->prio_trees[i] = RB_ROOT;

+ /*
+ * Our fallback cfqq if cfq_find_alloc_queue() runs into OOM issues
+ */
+ cfq_init_cfqq(cfqd, &cfqd->oom_cfqq, 1, 0);
+
INIT_LIST_HEAD(&cfqd->cic_list);

cfqd->queue = q;

--
Jens Axboe

2009-06-25 20:12:33

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, 25 Jun 2009, Theodore Tso wrote:
>
> Never mind, stupid question; I hit the send button before thinking
> about this enough. Obviously we should try without GFP_ATOMIC so the
> allocator can try to release some memory. So maybe the answer for
> filesystem code where the alternative to allocator failure is
> remounting the root filesystem read-only or panic(), should be:
>
> 1) Try to do the allocation GFP_NOFS.

Well, even with NOFS, the kernel will still do data writeout that can be
done purely by swapping. NOFS is really about avoiding recursion from
filesystems.

So you might want to try GFP_NOIO, which will mean that the kernel will
try to free memory that needs no IO at all. This also protects from
recursion in the IO path (ie block layer request allocation etc).

That said, GFP_ATOMIC may be better than both in practice, if only because
it might be better at balancing memory usage (ie too much "NOIO" might
result in the kernel aggressively dropping clean page-cache pages, since
it cannot drop dirty ones).

Note the "might". It probably doesn't matter in practice, since the bulk
of all allocations should always hopefully be GFP_KERNEL or GFP_USER.

> 2) Then try GFP_ATOMIC

The main difference between NOIO and ATOMIC is

- ATOMIC never tries to free _any_ kind of memory, since it doesn't want
to take the locks, and cannot enable interrupts.

- ATOMIC has the magic "high priority" bit set that means that you get to
dip into critical memory resources in order to satisfy the memory
request.

Whether these are important to you or not, I dunno. I actually suspect
that we might want a combination of "high priority + allow memory
freeing", which would be

#define GFP_CRITICAL (__GFP_HIGH | __GFP_WAIT)

and might be useful outside of interrupt context for things that _really_
want memory at all costs.

Linus

2009-06-25 20:19:22

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, 25 Jun 2009, Theodore Tso wrote:

> On Thu, Jun 25, 2009 at 03:38:06PM -0400, Theodore Tso wrote:
> > Hmm, is there a reason to avoid using GFP_ATOMIC on the first
> > allocation, and only adding GFP_ATOMIC after the first failure?
>
> Never mind, stupid question; I hit the send button before thinking
> about this enough. Obviously we should try without GFP_ATOMIC so the
> allocator can try to release some memory.

The allocator can't actually release much memory itself, it must rely on
pdflush to do writeback and the slab shrinkers are mostly all no-ops for
~__GFP_FS. The success of pdflush's freeing will depend on the caller's
context.

> So maybe the answer for
> filesystem code where the alternative to allocator failure is
> remounting the root filesystem read-only or panic(), should be:
>
> 1) Try to do the allocation GFP_NOFS.
>
> 2) Then try GFP_ATOMIC
>
> 3) Then retry the allocator with GFP_NOFS in a loop (possibly with a
> timeout than then panic's the system and allows the system to reboot,
> although arguably a watchdot timer should really perform that
> function).
>

This is similar to how __getblk() will repeatedly loop until it gets
sufficient memory to create buffers for the block page, which also relies
heavily on pdflush. If the GFP_ATOMIC allocation failed, then it's
unlikely that the subsequent GFP_NOFS allocation will succeed any time
soon without the oom killer, which we're not allowed to call, so it would
probably be better to loop in step #2 with congestion_wait().

> Obviously if we can rework the filesystem code to avoid this as much
> as possible, this would be desirable, but if there are some cases left
> over where we really have no choice, that's probably what we should
> do.
>

Isn't there also a problem in jbd2_journal_write_metadata_buffer(),
though?

char *tmp;

jbd_unlock_bh_state(bh_in);
tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
jbd_lock_bh_state(bh_in);
if (jh_in->b_frozen_data) {
jbd2_free(tmp, bh_in->b_size);
goto repeat;
}

jh_in->b_frozen_data = tmp;
mapped_data = kmap_atomic(new_page, KM_USER0);
memcpy(tmp, mapped_data + new_offset, jh2bh(jh_in)->b_size);

jbd2_alloc() is just a wrapper to __get_free_pages() and if it fails, it
appears as though the memcpy() would cause a NULL pointer.

2009-06-25 20:23:20

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, 25 Jun 2009, Linus Torvalds wrote:
>
> Whether these are important to you or not, I dunno. I actually suspect
> that we might want a combination of "high priority + allow memory
> freeing", which would be
>
> #define GFP_CRITICAL (__GFP_HIGH | __GFP_WAIT)

Actually, that doesn't work quite the way I intended.

The current page allocator screws up, and doesn't allow us to do this
(well, you _can_ combine the flags, but they don't mean what they mean on
their own). If you have the WAIT flag set, the page allocator will not set
the ALLOC_HARDER bit, so it turns out that GFP_ATOMIC (__GFP_HIGH on its
own) sometimes actually allows more allocations than the above
GFP_CRITICAL would.

It might make more sense to make a __GFP_WAIT allocation set the
ALLOC_HARDER bit _if_ it repeats. The problem with doing a loop of
allocations outside of the page allocator is that you then miss the
subtlety of "try increasingly harder" that the page allocator internally
does (well, right now, the "increasingly harder" only exists for the
try-to-free path, but we could certainly have it for the try-to-allocate
side too)

Linus

2009-06-25 20:36:45

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, 25 Jun 2009, Linus Torvalds wrote:

> It might make more sense to make a __GFP_WAIT allocation set the
> ALLOC_HARDER bit _if_ it repeats.

This would make sense, but only for !__GFP_FS, since otherwise the oom
killer will free some memory on an allowed node when reclaim fails and we
don't otherwise want to deplete memory reserves.

2009-06-25 20:38:18

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25, 2009 at 01:18:59PM -0700, David Rientjes wrote:
> Isn't there also a problem in jbd2_journal_write_metadata_buffer(),
> though?
>
> tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
...
> memcpy(tmp, mapped_data + new_offset, jh2bh(jh_in)->b_size);
>
> jbd2_alloc() is just a wrapper to __get_free_pages() and if it fails, it
> appears as though the memcpy() would cause a NULL pointer.

Nicely spotted. Yeah, that's a bug; we need to do something about
that one, too. And what we're doing is a bit silly; it may make sense
to use __get_free_pages if filesystem blocksize == PAGE_SIZE, but
otherwise we should be using a sub-page allocator. Right now, we're
chewing up a 16k PPC page for every 4k filesystem metadata page
allocated in journal_write_metadata_buffer(), and on x86, for the
(admittedly uncommon) 1k block filesystem, we'd be chewing up a 4k
page for a 1k block buffer.

Both of these problems exist for both ext3 and ext4.

- Ted

2009-06-25 20:52:18

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, 25 Jun 2009, David Rientjes wrote:
>
> On Thu, 25 Jun 2009, Linus Torvalds wrote:
>
> > It might make more sense to make a __GFP_WAIT allocation set the
> > ALLOC_HARDER bit _if_ it repeats.
>
> This would make sense, but only for !__GFP_FS, since otherwise the oom
> killer will free some memory on an allowed node when reclaim fails and we
> don't otherwise want to deplete memory reserves.

So the reason I tend to like the kind of "incrementally try harder"
approaches is two-fold:

- it works well for balancing different choices against each other (like
on the freeing path, trying to see which kind of memory is most easily
freed by trying them all first in a "don't try very hard" mode)

- it's great for forcing _everybody_ to do part of the work (ie when some
new thread comes in and tries to allocate, the new thread starts off
with the lower priority, and as such won't steal a page that an older
allocator just freed)

And I think that second case is true even for the oom killer case, and
even for __GFP_FS.

So if people worry about oom, I would suggest that we should not think so
hard about the GFP_NOFAIL cases (which are relatively small and rare), or
about things like the above "try harder" when repeating model, but instead
think about what actually happens during oom: the most common allocations
will remain to the page allocations for user faults and/or page cache. In
fact, they get *more* common as you near OOM situation, because you get
into the whole swap/filemap thrashing situation where you have to re-read
the same pages over and over again.

So don't worry about NOFS. Instead, look at what GFP_USER and GFP_HIGHUSER
do. They set the __GFP_HARDWALL bit, and they _always_ check the end
result and fail gracefully and quickly when the allocation fails.

End result? Realistically, I suspect the _best_ thing we can do is to just
couple that bit with "we're out of memory", and just do something like

if (!did_some_progress && (gfp_flags & __GFP_HARDWALL))
goto nopage;

rather than anything else. And I suspect that if we do this, we can then
afford to retry very aggressively for the allocation cases that aren't
GFP_USER - and that may well be needed in order to make progress.

Linus

2009-06-25 21:08:13

by Joel Becker

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25, 2009 at 04:37:43PM -0400, Theodore Tso wrote:
> Nicely spotted. Yeah, that's a bug; we need to do something about
> that one, too. And what we're doing is a bit silly; it may make sense
> to use __get_free_pages if filesystem blocksize == PAGE_SIZE, but
> otherwise we should be using a sub-page allocator. Right now, we're
> chewing up a 16k PPC page for every 4k filesystem metadata page
> allocated in journal_write_metadata_buffer(), and on x86, for the
> (admittedly uncommon) 1k block filesystem, we'd be chewing up a 4k
> page for a 1k block buffer.
>
>
> Both of these problems exist for both ext3 and ext4.

And ocfs2.

Joel

--

Life's Little Instruction Book #337

"Reread your favorite book."

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2009-06-25 21:26:58

by Andreas Dilger

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Jun 25, 2009 16:37 -0400, Theodore Ts'o wrote:
> On Thu, Jun 25, 2009 at 01:18:59PM -0700, David Rientjes wrote:
> > Isn't there also a problem in jbd2_journal_write_metadata_buffer(),
> > though?
> >
> > tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
> ...
> > memcpy(tmp, mapped_data + new_offset, jh2bh(jh_in)->b_size);
> >
> > jbd2_alloc() is just a wrapper to __get_free_pages() and if it fails, it
> > appears as though the memcpy() would cause a NULL pointer.
>
> Nicely spotted. Yeah, that's a bug; we need to do something about
> that one, too.

IIRC, in the past, jbd_alloc() had a retry mechanism that would loop
indefinitely for some allocations, because they couldn't be aborted
easily. This was removed for some reason, I'm not sure why.

> And what we're doing is a bit silly; it may make sense
> to use __get_free_pages if filesystem blocksize == PAGE_SIZE, but
> otherwise we should be using a sub-page allocator. Right now, we're
> chewing up a 16k PPC page for every 4k filesystem metadata page
> allocated in journal_write_metadata_buffer(), and on x86, for the
> (admittedly uncommon) 1k block filesystem, we'd be chewing up a 4k
> page for a 1k block buffer.

IIRC there was also a good reason for this in the past, related to
the buffers being submitted to the block device layer, and if they
were allocated from the slab cache with CONFIG_DEBUG_SLAB or something
similar enabled the buffer would be misaligned and cause grief.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2009-06-25 22:05:36

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25, 2009 at 11:26:28PM +0200, Andreas Dilger wrote:
> IIRC there was also a good reason for this in the past, related to
> the buffers being submitted to the block device layer, and if they
> were allocated from the slab cache with CONFIG_DEBUG_SLAB or something
> similar enabled the buffer would be misaligned and cause grief.

So what does SLAB/SLUB/SLOB do if we create a slab cache which is a
power of two? Can one of the allocators still return misaligned
blocks of memory in some circumstances?

- Ted

2009-06-25 22:12:30

by Eric Sandeen

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Theodore Tso wrote:
> On Thu, Jun 25, 2009 at 11:26:28PM +0200, Andreas Dilger wrote:
>> IIRC there was also a good reason for this in the past, related to
>> the buffers being submitted to the block device layer, and if they
>> were allocated from the slab cache with CONFIG_DEBUG_SLAB or something
>> similar enabled the buffer would be misaligned and cause grief.
>
> So what does SLAB/SLUB/SLOB do if we create a slab cache which is a
> power of two? Can one of the allocators still return misaligned
> blocks of memory in some circumstances?
>
> - Ted

ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
with SLUB + slub debug, that gave back non-aligned memory, causing
eventual corruption ...

-Eric

2009-06-25 22:25:56

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, 25 Jun 2009, Linus Torvalds wrote:

> So the reason I tend to like the kind of "incrementally try harder"
> approaches is two-fold:
>
> - it works well for balancing different choices against each other (like
> on the freeing path, trying to see which kind of memory is most easily
> freed by trying them all first in a "don't try very hard" mode)
>

If we passed alloc_flags into direct reclaim, we'd be able to know that
priority in increasing severity: ALLOC_HARDER, ALLOC_HIGH,
ALLOC_NO_WATERMARKS. We already know if reclaim should be targeted only
to the set of allowable nodes with the gfp_mask, so ALLOC_CPUSET is
already implicitly handled.

The problem with such an approach is that if we were to, as suggested, set
ALLOC_HARDER for __GFP_WAIT when looping is that we'd soon deplete memory
reserves for potentially long-lived allocations and each zone's min
watermark would simply become min / 2.

Setting ALLOC_HARDER when __GFP_WAIT loops would imply that we'd also
defer the oom killer until we've tried with the lower watermark and that
would result in less memory being available for the TIF_MEMDIE thread when
subsequent allocations still fail.

> - it's great for forcing _everybody_ to do part of the work (ie when some
> new thread comes in and tries to allocate, the new thread starts off
> with the lower priority, and as such won't steal a page that an older
> allocator just freed)
>

So you're seeing the space between the zone's min watermark, min, and min
/ 4 (ALLOC_HARDER) as the pool of memory available to older allocators
that have already done reclaim? It wouldn't necessarily prevent rt tasks
from stealing the memory freed by the older allocators, but would stop all
others.

> And I think that second case is true even for the oom killer case, and
> even for __GFP_FS.
>

The oom killer is serialized only on the zonelist and right before it's
called, we try an allocation with the high watermark to catch parallel oom
killings. The oom killer's heuristic for killing preference is one that
should satisfy all concurrent page allocations on that zonelist, so we
probably don't need to worry about it.

> So if people worry about oom, I would suggest that we should not think so
> hard about the GFP_NOFAIL cases (which are relatively small and rare),

If __GFP_NOFAIL is used in the reclaim path, it will loop forever for
~__GFP_NOMEMALLOC allocations (and the same goes for oom killed tasks).

> or
> about things like the above "try harder" when repeating model, but instead
> think about what actually happens during oom: the most common allocations
> will remain to the page allocations for user faults and/or page cache. In
> fact, they get *more* common as you near OOM situation, because you get
> into the whole swap/filemap thrashing situation where you have to re-read
> the same pages over and over again.
>
> So don't worry about NOFS. Instead, look at what GFP_USER and GFP_HIGHUSER
> do. They set the __GFP_HARDWALL bit, and they _always_ check the end
> result and fail gracefully and quickly when the allocation fails.
>

Yeah, and the oom killer prefers to kill tasks that share a set of
allowable __GFP_HARDWALL nodes with current.

> End result? Realistically, I suspect the _best_ thing we can do is to just
> couple that bit with "we're out of memory", and just do something like
>
> if (!did_some_progress && (gfp_flags & __GFP_HARDWALL))
> goto nopage;
>
> rather than anything else. And I suspect that if we do this, we can then
> afford to retry very aggressively for the allocation cases that aren't
> GFP_USER - and that may well be needed in order to make progress.
>

Agreed.

2009-06-26 01:12:43

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
>
> ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
> with SLUB + slub debug, that gave back non-aligned memory, causing
> eventual corruption ...

Grumble. Any chance we could add an kmem_cache option which requires
the memory to be aligned? Otherwise we could rewrite our own sub-page
allocator in ext4 that only handled aligned filesystem block sizes
(i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
code that really should be done once at core functionality.

- Ted

2009-06-26 05:16:25

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Hi Ted,

On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
> > ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
> > with SLUB + slub debug, that gave back non-aligned memory, causing
> > eventual corruption ...

On Thu, 25 Jun 2009, Theodore Tso wrote:
> Grumble. Any chance we could add an kmem_cache option which requires
> the memory to be aligned? Otherwise we could rewrite our own sub-page
> allocator in ext4 that only handled aligned filesystem block sizes
> (i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
> code that really should be done once at core functionality.

We alredy have SLAB_HW_ALIGN but I wonder if this is a plain old bug in
SLUB. Christoph, Nick, don't we need something like this in the allocator?
Eric, does this fix your case?

Pekka

diff --git a/mm/slub.c b/mm/slub.c
index 819f056..7cd1e69 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2400,7 +2400,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
* user specified and the dynamic determination of cache line size
* on bootup.
*/
- align = calculate_alignment(flags, align, s->objsize);
+ align = calculate_alignment(flags, align, size);

/*
* SLUB stores one object immediately after another beginning from

2009-06-26 08:51:37

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Thu, Jun 25, 2009 at 01:22:25PM -0700, Linus Torvalds wrote:
>
>
> On Thu, 25 Jun 2009, Linus Torvalds wrote:
> >
> > Whether these are important to you or not, I dunno. I actually suspect
> > that we might want a combination of "high priority + allow memory
> > freeing", which would be
> >
> > #define GFP_CRITICAL (__GFP_HIGH | __GFP_WAIT)
>
> Actually, that doesn't work quite the way I intended.
>
> The current page allocator screws up, and doesn't allow us to do this

No I think it works OK. As designed, at least (whether or not
you agree with the design :P)

> (well, you _can_ combine the flags, but they don't mean what they mean on
> their own). If you have the WAIT flag set, the page allocator will not set
> the ALLOC_HARDER bit,

But it does use ALLOC_HIGH.

> so it turns out that GFP_ATOMIC (__GFP_HIGH on its
> own) sometimes actually allows more allocations than the above
> GFP_CRITICAL would.

Yes. The intention was to allow allocations from sleepable/reclaimable
context but not allow them to use up all the reserves allowed for
atomic context (so you don't suddenly get bursts of atomic allocation
failure warnings on your network card when under heavy disk IO, for
example).

> It might make more sense to make a __GFP_WAIT allocation set the
> ALLOC_HARDER bit _if_ it repeats. The problem with doing a loop of
> allocations outside of the page allocator is that you then miss the
> subtlety of "try increasingly harder" that the page allocator internally
> does (well, right now, the "increasingly harder" only exists for the
> try-to-free path, but we could certainly have it for the try-to-allocate
> side too)

2009-06-26 08:56:36

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Fri, Jun 26, 2009 at 08:16:18AM +0300, Pekka Enberg wrote:
> Hi Ted,
>
> On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
> > > ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
> > > with SLUB + slub debug, that gave back non-aligned memory, causing
> > > eventual corruption ...
>
> On Thu, 25 Jun 2009, Theodore Tso wrote:
> > Grumble. Any chance we could add an kmem_cache option which requires
> > the memory to be aligned? Otherwise we could rewrite our own sub-page
> > allocator in ext4 that only handled aligned filesystem block sizes
> > (i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
> > code that really should be done once at core functionality.
>
> We alredy have SLAB_HW_ALIGN but I wonder if this is a plain old bug in
> SLUB. Christoph, Nick, don't we need something like this in the allocator?
> Eric, does this fix your case?

Well I don't understand Ted's complaint. kmem_cache_create takes an
alignment argument.

2009-06-26 08:58:16

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Fri, 2009-06-26 at 10:56 +0200, Nick Piggin wrote:
> On Fri, Jun 26, 2009 at 08:16:18AM +0300, Pekka Enberg wrote:
> > Hi Ted,
> >
> > On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
> > > > ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
> > > > with SLUB + slub debug, that gave back non-aligned memory, causing
> > > > eventual corruption ...
> >
> > On Thu, 25 Jun 2009, Theodore Tso wrote:
> > > Grumble. Any chance we could add an kmem_cache option which requires
> > > the memory to be aligned? Otherwise we could rewrite our own sub-page
> > > allocator in ext4 that only handled aligned filesystem block sizes
> > > (i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
> > > code that really should be done once at core functionality.
> >
> > We alredy have SLAB_HW_ALIGN but I wonder if this is a plain old bug in
> > SLUB. Christoph, Nick, don't we need something like this in the allocator?
> > Eric, does this fix your case?
>
> Well I don't understand Ted's complaint. kmem_cache_create takes an
> alignment argument.

Yes, but AFAICT, SLUB_DEBUG doesn't respect the given alignment without
my patch.

Pekka

2009-06-26 09:07:49

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Fri, Jun 26, 2009 at 11:58:01AM +0300, Pekka Enberg wrote:
> On Fri, 2009-06-26 at 10:56 +0200, Nick Piggin wrote:
> > On Fri, Jun 26, 2009 at 08:16:18AM +0300, Pekka Enberg wrote:
> > > Hi Ted,
> > >
> > > On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
> > > > > ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
> > > > > with SLUB + slub debug, that gave back non-aligned memory, causing
> > > > > eventual corruption ...
> > >
> > > On Thu, 25 Jun 2009, Theodore Tso wrote:
> > > > Grumble. Any chance we could add an kmem_cache option which requires
> > > > the memory to be aligned? Otherwise we could rewrite our own sub-page
> > > > allocator in ext4 that only handled aligned filesystem block sizes
> > > > (i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
> > > > code that really should be done once at core functionality.
> > >
> > > We alredy have SLAB_HW_ALIGN but I wonder if this is a plain old bug in
> > > SLUB. Christoph, Nick, don't we need something like this in the allocator?
> > > Eric, does this fix your case?
> >
> > Well I don't understand Ted's complaint. kmem_cache_create takes an
> > alignment argument.
>
> Yes, but AFAICT, SLUB_DEBUG doesn't respect the given alignment without
> my patch.

Well your patch doesn't hurt (doesn't seem to make much sense to
use objsize rather than size when calculating alignment). But I
think alignment should still be calculated OK: calculate_alignemnt
will always return >= align, and so the next statement will
round up size to the correct alignment AFAICT.

2009-06-26 14:44:06

by Eric Sandeen

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Pekka J Enberg wrote:
> Hi Ted,
>
> On Thu, Jun 25, 2009 at 05:11:25PM -0500, Eric Sandeen wrote:
>>> ecryptfs used to do kmalloc(PAGE_CACHE_SIZE) & virt_to_page on that, and
>>> with SLUB + slub debug, that gave back non-aligned memory, causing
>>> eventual corruption ...
>
> On Thu, 25 Jun 2009, Theodore Tso wrote:
>> Grumble. Any chance we could add an kmem_cache option which requires
>> the memory to be aligned? Otherwise we could rewrite our own sub-page
>> allocator in ext4 that only handled aligned filesystem block sizes
>> (i.e., 1k, 2k, 4k, etc.) but that would be really silly and be extra
>> code that really should be done once at core functionality.
>
> We alredy have SLAB_HW_ALIGN but I wonder if this is a plain old bug in
> SLUB. Christoph, Nick, don't we need something like this in the allocator?
> Eric, does this fix your case?

I'll test it; it'd be great if it did, I'm um, a bit ashamed at how I
fixed it ;)

Thanks!
-Eric

> Pekka
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 819f056..7cd1e69 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2400,7 +2400,7 @@ static int calculate_sizes(struct kmem_cache *s, int forced_order)
> * user specified and the dynamic determination of cache line size
> * on bootup.
> */
> - align = calculate_alignment(flags, align, s->objsize);
> + align = calculate_alignment(flags, align, size);
>
> /*
> * SLUB stores one object immediately after another beginning from

2009-06-28 10:45:39

by Pavel Machek

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed 2009-06-24 12:46:02, Linus Torvalds wrote:
>
>
> On Wed, 24 Jun 2009, Andrew Morton wrote:
>
> > On Wed, 24 Jun 2009 12:16:20 -0700 (PDT)
> > Linus Torvalds <[email protected]> wrote:
> >
> > > Lookie here. This is 2.6.0:mm/page_alloc.c:
> > >
> > > do_retry = 0;
> > > if (!(gfp_mask & __GFP_NORETRY)) {
> > > if ((order <= 3) || (gfp_mask & __GFP_REPEAT))
> > > do_retry = 1;
> > > if (gfp_mask & __GFP_NOFAIL)
> > > do_retry = 1;
> > > }
> > > if (do_retry) {
> > > blk_congestion_wait(WRITE, HZ/50);
> > > goto rebalance;
> > > }
> >
> > rebalance:
> > if ((p->flags & (PF_MEMALLOC | PF_MEMDIE)) && !in_interrupt()) {
> > /* go through the zonelist yet again, ignoring mins */
> > for (i = 0; zones[i] != NULL; i++) {
> > struct zone *z = zones[i];
> >
> > page = buffered_rmqueue(z, order, cold);
> > if (page)
> > goto got_pg;
> > }
> > goto nopage;
> > }
>
> Your point?
>
> That's the recursive allocation or oom case. Not the normal case at all.
>
> The _normal_ case is to do the whole "try_to_free_pages()" case and try
> and try again. Forever.
>
> IOW, we have traditionally never failed small kernel allocations. It makes
> perfect sense that people _depend_ on that.

Ok, so we should re-add that 4MB buffer to suspend, so that
allocations work even during that, right?

...and... if you do enough of small allocations, they *will* have to
fail at some point. Maybe linux is now mature enough and running on
small enough devices that it makes sense to start handling that?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-06-28 18:02:18

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Sun, 28 Jun 2009, Pavel Machek wrote:
>
> Ok, so we should re-add that 4MB buffer to suspend, so that
> allocations work even during that, right?

Pavel, you really are a one-trick pony, aren't you?

Give it up. Return to your pet worry when there are any actual reports. As
you have been told several times.

The _other_ part of memory management that you and Andrew seem to be
ignoring is that it's very robust, and keeps extra memory around, and just
generally does the right thing. We don't generally pre-allocate anything,
because we don't need to.

Almost the _only_ way to run out of memory is to have tons and tons of
dirty pages around. Yes, it can happen. But if it happens, you're almost
guaranteed to be screwed anyway. The whole VM is designed around the
notion that most of memory is just clean caches, and it's designed around
that simply because if it's not true, the VM freedom is so small that
there's not a lot a VM can reasonably do.

Linus

2009-06-28 18:26:57

by Arjan van de Ven

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Sun, 28 Jun 2009 11:01:42 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

> The _other_ part of memory management that you and Andrew seem to be
> ignoring is that it's very robust, and keeps extra memory around, and
> just generally does the right thing. We don't generally pre-allocate
> anything, because we don't need to.

... and if there is a real concern, the real solution is to have a way
to request that the VM temporarily increases the amount of extra memory.
Preferably before you go down a path of hardship so that the VM can
do some work to satisfy the request.

There are a few places where having (small) local reserves make sense of
course, mostly in the write out paths like the block layer. And we do
that already.

>
> Almost the _only_ way to run out of memory is to have tons and tons
> of dirty pages around. Yes, it can happen. But if it happens, you're
> almost guaranteed to be screwed anyway. The whole VM is designed

my impression is that when the strict dirty accounting code went in,
this problem largely went away.

The other problem is pinned memory, but this was mostly a
lowmem/highmem thing, and time (and 64 bit) seems to have mostly solved
the more common cases of those.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-06-28 18:37:33

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Sun, 28 Jun 2009, Arjan van de Ven wrote:
> >
> > Almost the _only_ way to run out of memory is to have tons and tons
> > of dirty pages around. Yes, it can happen. But if it happens, you're
> > almost guaranteed to be screwed anyway. The whole VM is designed
>
> my impression is that when the strict dirty accounting code went in,
> this problem largely went away.

Yes and no.

It's still pretty easy to have lots of dirty anonymous memory, no
swap-space, and just run out of memory. If you do that, you're screwed.
The oom killer may or may not help, but even if it works, it's probably
going to work only after things have gotten pretty painful.

Also, we'll have to be pretty careful when/if we actually use that
"gfp_allowed_mask" for suspend/resume/hibernate. The SLAB_GFP_BOOT_MASK
has the __GFP_WAIT bit cleared, for example, which means that the VM won't
try to free even trivially freeable memory.

So for hibernate, if/when we use this, we should make sure to still allow
__GFP_WAIT (except, perhaps, during the stage when interrupts are actually
disabled), but clear __GFP_IO and __GFP_FS.

Or whatever. The devil is in the details.

Linus

2009-06-29 15:30:43

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Wed, Jun 24, 2009 at 02:56:15PM -0700, Andrew Morton wrote:
> On Wed, 24 Jun 2009 13:13:48 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
> >
> > On Wed, 24 Jun 2009, Andrew Morton wrote:
> > >
> > > If the caller gets oom-killed, the allocation attempt fails. Callers need
> > > to handle that.
> >
> > I actually disagree. I think we should just admit that we can always free
> > up enough space to get a few pages, in order to then oom-kill things.
>
> I'm unclear on precisely what you're proposing here?
>

As order <= PAGE_ALLOC_COSTLY_ORDER implies __GFP_NOFAIL, prehaps it
makes sense to change the check to

WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER)

? The temptation might be there to remove __GFP_NOFAIL for smaller orders but
it makes sense to have it available in case CONFIG_FAULT_INJECTION_DEBUG_FS
is set and randomly failing allocations that have serious consequences even
if handled.

> > This is not a new concept. oom has never been "immediately kill".
>
> Well, it has been immediate for a long time. A couple of reasons which
> I can recall:
>
> - A page-allocating process will oom-kill another process in the
> expectation that the killing will free up some memory. If the
> oom-killed process remains stuck in the page allocator, that doesn't
> work.
>
> - The oom-killed process might be holding locks (typically fs locks).
> This can cause an arbitrary number of other processes to be blocked.
> So to get the system unstuck we need the oom-killed process to
> immediately exit the page allocator, to handle the NULL return and to
> drop those locks.
>
> There may be other reasons - it was all a long time ago, and I've never
> personally hacked on the oom-killer much and I never get oom-killed.
> But given the amount of development work which goes on in there, some
> people must be getting massacred.
>
>
> A long time ago, the Suse kernel shipped with a largely (or
> completely?) disabled oom-killer. It removed the
> retry-small-allocations-for-ever logic and simply returned NULL to the
> caller. I never really understood what problem/thinking led Andrea to
> do that.
>
>
> But it's all a bit moot at present, as we seem to have removed the
> return-NULL-if-TIF_MEMDIE logic in Mel's post-2.6.30 merges. I think
> that was an accident:
>
> - /* This allocation should allow future memory freeing. */
> -
> rebalance:
> - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> - && !in_interrupt()) {
> - if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> -nofail_alloc:
> - /* go through the zonelist yet again, ignoring mins */
> - page = get_page_from_freelist(gfp_mask, nodemask, order,
> - zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
> - if (page)
> - goto got_pg;
> - if (gfp_mask & __GFP_NOFAIL) {
> - congestion_wait(WRITE, HZ/50);
> - goto nofail_alloc;
> - }
> - }
> - goto nopage;
> + /* Allocate without watermarks if the context allows */
> + if (alloc_flags & ALLOC_NO_WATERMARKS) {
> + page = __alloc_pages_high_priority(gfp_mask, order,
> + zonelist, high_zoneidx, nodemask,
> + preferred_zone, migratetype);
> + if (page)
> + goto got_pg;
> }
>
> Offending commit 341ce06 handled the PF_MEMALLOC case but forgot about
> the TIF_MEMDIE case.
>
> Mel is having a bit of downtime at present.

I'm getting back online now and playing catch-up. You're right in that
TIF_MEMDIE returning NULL has been broken and it's possible in theory for an
OOM-killed process to loop forever. But maybe TIF_MEMDIE looping potentially
forever is expected in the case __GFP_NOFAIL is specified. Fixing this to allow
an OOM-killed process to exit does mean that callers using __GFP_NOFAIL must
still handle NULL being returned which might be very unexpected to the caller.

In 2.6.30, TIF_MEMDIE could cause a request to exit without ever entering
direct reclaim but chances are this didn't happen as it would have been
looping during an OOM-kill. To duplicate this, a check for TIF_MEMDIE would
happen after

/* Avoid recursion of direct reclaim */
if (p->flags & PF_MEMALLOC)
goto nopage;

But as failing __GFP_NOFAIL is potentially serious, even for processes that
have been OOM killed, I think it makes more sense to check for TIF_MEMDIE
after direct reclaim and OOM killing have already been considered as options
with a patch such as the following?

==== CUT HERE ====
page-allocator: Ensure that processes that have been OOM killed exit the page allocator

Processes that have been OOM killed set the thread flag TIF_MEMDIE. A
process such as this is expected to exit the page allocator but in the
event it happens to have set __GFP_NOFAIL, it potentially loops forever.

This patch checks TIF_MEMDIE when deciding whether to loop again in the
page allocator. Such a process will now return NULL after direct reclaim
and OOM killing have both been considered as options. The potential
problem is that a __GFP_NOFAIL allocation can still return failure so
callers must still handle getting returned NULL.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5d714f8..8449cf9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1539,6 +1539,10 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
if (gfp_mask & __GFP_NORETRY)
return 0;

+ /* Do not loop if this process has been OOM-killed */
+ if (test_thread_flag(TIF_MEMDIE))
+ return 0;
+
/*
* In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
* means __GFP_NOFAIL, but that may not be true in other
@@ -1823,6 +1827,10 @@ rebalance:
!(gfp_mask & __GFP_NOFAIL))
goto nopage;

+ /* Do not loop if this process has been OOM-killed */
+ if (test_thread_flag(TIF_MEMDIE))
+ goto nopage;
+
goto restart;
}
}

2009-06-29 19:22:05

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Mon, 29 Jun 2009 16:30:07 +0100
Mel Gorman <[email protected]> wrote:

> Processes that have been OOM killed set the thread flag TIF_MEMDIE. A
> process such as this is expected to exit the page allocator but in the
> event it happens to have set __GFP_NOFAIL, it potentially loops forever.
>
> This patch checks TIF_MEMDIE when deciding whether to loop again in the
> page allocator. Such a process will now return NULL after direct reclaim
> and OOM killing have both been considered as options. The potential
> problem is that a __GFP_NOFAIL allocation can still return failure so
> callers must still handle getting returned NULL.

I don't think we should do this :(

The __GFP_NOFAIL callers are using __GFP_NOFAIL for a reason - they
just cannot handle an allocation failure at all. They won't even test
for a NULL return because a) they believe that __GFP_NOFAIL is magic and
b) if the allocation failed, they're screwed anyway.

So how feasible would it be to arrange for __GFP_NOFAIL callers to
ignore the oom-killing? Presumably this means that they'll need to kill
someone else and keep on trying?

2009-06-29 21:07:14

by Christoph Lameter

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Fri, 26 Jun 2009, Pekka Enberg wrote:

> Yes, but AFAICT, SLUB_DEBUG doesn't respect the given alignment without
> my patch.

The size parameter to calculate_alignment has no relevance unless the size
is smaller than half of the alignment. Then for some reason we do not
properly align the stuff. I never liked that....

2009-06-29 21:16:01

by Christoph Lameter

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Fri, 26 Jun 2009, Eric Sandeen wrote:

> I'll test it; it'd be great if it did, I'm um, a bit ashamed at how I
> fixed it ;)

Isnt the problem that the slab allocator did not align something that
you did not request to be aligned?

2009-06-29 21:21:19

by Eric Sandeen

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

Christoph Lameter wrote:
> On Fri, 26 Jun 2009, Eric Sandeen wrote:
>
>> I'll test it; it'd be great if it did, I'm um, a bit ashamed at how I
>> fixed it ;)
>
> Isnt the problem that the slab allocator did not align something that
> you did not request to be aligned?

Well, maybe that's all I needed. At the risk of exposing myself to
ridicule, here's the commit:

7fcba054373d5dfc43d26e243a5c9b92069972ee

Was there a better way?

-Eric

2009-06-29 22:36:01

by Christoph Lameter

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Mon, 29 Jun 2009, Eric Sandeen wrote:

> > Isnt the problem that the slab allocator did not align something that
> > you did not request to be aligned?
>
> Well, maybe that's all I needed. At the risk of exposing myself to
> ridicule, here's the commit:
>
> 7fcba054373d5dfc43d26e243a5c9b92069972ee
>
> Was there a better way?

No. Good patch. Indeed the right way. Page allocations come from the page
allocator not from the slab allocator (yes yes there are still issues with
the performance of the page allocator that sometimes dictate the use of
the slabs but in general I think this is what is needed).

2009-06-29 23:35:38

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Mon, 29 Jun 2009, Mel Gorman wrote:

> page-allocator: Ensure that processes that have been OOM killed exit the page allocator
>
> Processes that have been OOM killed set the thread flag TIF_MEMDIE. A
> process such as this is expected to exit the page allocator but in the
> event it happens to have set __GFP_NOFAIL, it potentially loops forever.
>

That's not the expected behavior for TIF_MEMDIE, although your patch
certainly changes that.

Your patch is simply doing

if (test_thread_flag(TIF_MEMDIE))
gfp_mask |= __GFP_NORETRY;

in the slowpath.

TIF_MEMDIE is supposed to allow allocations to succeed, not automatically
fail, so that it can quickly handle its SIGKILL without getting blocked in
the exit path seeking more memory.

> This patch checks TIF_MEMDIE when deciding whether to loop again in the
> page allocator. Such a process will now return NULL after direct reclaim
> and OOM killing have both been considered as options. The potential
> problem is that a __GFP_NOFAIL allocation can still return failure so
> callers must still handle getting returned NULL.
>

All __GFP_NOFAIL allocations should ensure that alloc_pages() never
returns NULL. Although it's unfortunate, that's the requirement that
callers have been guaranteed and until they are fixed, the page allocator
should respect it.

I disagree with this change because it unconditionally fails allocations
when a task has been oom killed, a scenario which should be the _highest_
priority for allocations to succeed since it leads to future memory
freeing.

Additionally, this will fail all GFP_ATOMIC allocations for oom killed
tasks if allocating without watermarks fails although pdflush may
concurrently be doing writeback or other allocation attempts are invoking
direct reclaim.

2009-06-30 07:35:26

by Pavel Machek

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Sun 2009-06-28 11:01:42, Linus Torvalds wrote:
>
> On Sun, 28 Jun 2009, Pavel Machek wrote:
> >
> > Ok, so we should re-add that 4MB buffer to suspend, so that
> > allocations work even during that, right?
>
> Pavel, you really are a one-trick pony, aren't you?
>
> Give it up. Return to your pet worry when there are any actual reports. As
> you have been told several times.

How do you report something that results in black screen during
suspend in 1/100 of attempts?

> The _other_ part of memory management that you and Andrew seem to be
> ignoring is that it's very robust, and keeps extra memory around, and just
> generally does the right thing. We don't generally pre-allocate anything,
> because we don't need to.
>
> Almost the _only_ way to run out of memory is to have tons and tons of
> dirty pages around. Yes, it can happen. But if it happens, you're almost
> guaranteed to be screwed anyway. The whole VM is designed around the
> notion that most of memory is just clean caches, and it's designed around
> that simply because if it's not true, the VM freedom is so small that
> there's not a lot a VM can reasonably do.

Well, or you can have machine with not nearly enough memory (like 16MB
system), or huge 32-bit highmem system (32GB).

But it is true that I did not see OOM for quite long time.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-06-30 07:47:25

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Mon, Jun 29, 2009 at 04:35:26PM -0700, David Rientjes wrote:
> On Mon, 29 Jun 2009, Mel Gorman wrote:
>
> > page-allocator: Ensure that processes that have been OOM killed exit the page allocator
> >
> > Processes that have been OOM killed set the thread flag TIF_MEMDIE. A
> > process such as this is expected to exit the page allocator but in the
> > event it happens to have set __GFP_NOFAIL, it potentially loops forever.
> >
>
> That's not the expected behavior for TIF_MEMDIE, although your patch
> certainly changes that.
>
> Your patch is simply doing
>
> if (test_thread_flag(TIF_MEMDIE))
> gfp_mask |= __GFP_NORETRY;
>
> in the slowpath.
>
> TIF_MEMDIE is supposed to allow allocations to succeed, not automatically
> fail, so that it can quickly handle its SIGKILL without getting blocked in
> the exit path seeking more memory.

Yes, it need to just ignore all watermarks, do not reclaim (we've
already decided reclaim will not work at this point), and return a
page if we have one otherwise NULL (unless GFP_NOFAIL is set).

> > This patch checks TIF_MEMDIE when deciding whether to loop again in the
> > page allocator. Such a process will now return NULL after direct reclaim
> > and OOM killing have both been considered as options. The potential
> > problem is that a __GFP_NOFAIL allocation can still return failure so
> > callers must still handle getting returned NULL.
> >
>
> All __GFP_NOFAIL allocations should ensure that alloc_pages() never
> returns NULL. Although it's unfortunate, that's the requirement that
> callers have been guaranteed and until they are fixed, the page allocator
> should respect it.

Yes.

Interesting thing is what to do when we have 0 pages left, we are
TIF_MEMDIE, and GFP_NOFAIL is set. Looping will most likely just
deadlock the system. Returning NULL will probably oops caller with
various locks held and then deadlock the system. It really needs to
punt back to the OOM killer so it can select another task. Until
then, maybe a simple panic would be reasonable? (it's *never* going
to hit anyone in practice I'd say, but if it does then a panic
would be better than lockup at least we know what the problem was).

> I disagree with this change because it unconditionally fails allocations
> when a task has been oom killed, a scenario which should be the _highest_
> priority for allocations to succeed since it leads to future memory

That's another interesting point. I do agree with you because that
would restore previous behaviour which got broken. But I wonder if
possibly it would be a better idea to fail all allocations? That
would a) protect reserves more, and b) probably quite likely to
exit the syscall *sooner* than if we try to satisfy all allocations.

2009-06-30 07:59:19

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Mon, Jun 29, 2009 at 05:06:37PM -0400, Christoph Lameter wrote:
> On Fri, 26 Jun 2009, Pekka Enberg wrote:
>
> > Yes, but AFAICT, SLUB_DEBUG doesn't respect the given alignment without
> > my patch.
>
> The size parameter to calculate_alignment has no relevance unless the size
> is smaller than half of the alignment. Then for some reason we do not
> properly align the stuff. I never liked that....

It is properly aligned, if it was not, then it would be a bug. It
should always align to *at least* the requested alignment.

SLAB_HWCACHE_ALIGN is defined to align the object such that it occupies
the minimal number of cache lines. This is not subjective thing to
like or dislike, but just fact because that is how it was implemented
from the start and that is what callers expect.

If you want to minimise false sharing, then normal way would be to
request internode cacheline alignment, right? This is very widely
used.

2009-06-30 08:13:44

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Tue, 30 Jun 2009, Nick Piggin wrote:

> > That's not the expected behavior for TIF_MEMDIE, although your patch
> > certainly changes that.
> >
> > Your patch is simply doing
> >
> > if (test_thread_flag(TIF_MEMDIE))
> > gfp_mask |= __GFP_NORETRY;
> >
> > in the slowpath.
> >
> > TIF_MEMDIE is supposed to allow allocations to succeed, not automatically
> > fail, so that it can quickly handle its SIGKILL without getting blocked in
> > the exit path seeking more memory.
>
> Yes, it need to just ignore all watermarks, do not reclaim (we've
> already decided reclaim will not work at this point), and return a
> page if we have one otherwise NULL (unless GFP_NOFAIL is set).
>

Right, there's no sense in looping endlessly for ~__GFP_NOFAIL if
allocations continue to fail for a thread with TIF_MEMDIE set.

TIF_MEMDIE doesn't check any watermarks as opposed to GFP_ATOMIC, which
only reduces the min watermark by half, so we can access more memory
reserves with TIF_MEMDIE. Instead of immediately failing an oom killed
task's allocation as in Mel's patch, there is a higher liklihood that it
will succeed on the next attempt.

I'd agree with Mel's added check for TIF_MEMDIE upon returning from the
oom killer, but only for __GFP_NOMEMALLOC.

> > All __GFP_NOFAIL allocations should ensure that alloc_pages() never
> > returns NULL. Although it's unfortunate, that's the requirement that
> > callers have been guaranteed and until they are fixed, the page allocator
> > should respect it.
>
> Yes.
>
> Interesting thing is what to do when we have 0 pages left, we are
> TIF_MEMDIE, and GFP_NOFAIL is set. Looping will most likely just
> deadlock the system. Returning NULL will probably oops caller with
> various locks held and then deadlock the system. It really needs to
> punt back to the OOM killer so it can select another task. Until
> then, maybe a simple panic would be reasonable? (it's *never* going
> to hit anyone in practice I'd say, but if it does then a panic
> would be better than lockup at least we know what the problem was).
>

The oom killer currently is a no-op if any eligible task has TIF_MEMDIE,
so this would require adding an oom killer timeout so that if a task fails
to exit after a predefined period, TIF_MEMDIE is cleared and the task is
marked to no longer be selected (which would require an addition to
task_struct) although it may have already completely depleted memory
reserves.

The best alternative is just to increase min_free_kbytes to ensure that
adequate memory reserves (and its partial exemptions allowed by
GFP_ATOMIC, ALLOC_HARDER, and PF_MEMALLOC) are sustained for an oom killed
task to exit and that we try hard to avoid getting stuck in
TASK_UNINTERRUPTIBLE.

> > I disagree with this change because it unconditionally fails allocations
> > when a task has been oom killed, a scenario which should be the _highest_
> > priority for allocations to succeed since it leads to future memory
>
> That's another interesting point. I do agree with you because that
> would restore previous behaviour which got broken. But I wonder if
> possibly it would be a better idea to fail all allocations? That
> would a) protect reserves more, and b) probably quite likely to
> exit the syscall *sooner* than if we try to satisfy all allocations.
>

You could only fail the single allocation where you triggered the oom
killer and you were the task chosen to die, which is what Mel's patch
implemented in the first half. I agree that would protect the memory
reserves more.

2009-06-30 08:24:32

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Tue, Jun 30, 2009 at 01:13:08AM -0700, David Rientjes wrote:
> On Tue, 30 Jun 2009, Nick Piggin wrote:
>
> > > That's not the expected behavior for TIF_MEMDIE, although your patch
> > > certainly changes that.
> > >
> > > Your patch is simply doing
> > >
> > > if (test_thread_flag(TIF_MEMDIE))
> > > gfp_mask |= __GFP_NORETRY;
> > >
> > > in the slowpath.
> > >
> > > TIF_MEMDIE is supposed to allow allocations to succeed, not automatically
> > > fail, so that it can quickly handle its SIGKILL without getting blocked in
> > > the exit path seeking more memory.
> >
> > Yes, it need to just ignore all watermarks, do not reclaim (we've
> > already decided reclaim will not work at this point), and return a
> > page if we have one otherwise NULL (unless GFP_NOFAIL is set).
> >
>
> Right, there's no sense in looping endlessly for ~__GFP_NOFAIL if
> allocations continue to fail for a thread with TIF_MEMDIE set.
>
> TIF_MEMDIE doesn't check any watermarks as opposed to GFP_ATOMIC, which
> only reduces the min watermark by half, so we can access more memory
> reserves with TIF_MEMDIE. Instead of immediately failing an oom killed
> task's allocation as in Mel's patch, there is a higher liklihood that it
> will succeed on the next attempt.

Yes. This is how it should have worked prior to Mel's patches, so we
should aim to restore that.

> I'd agree with Mel's added check for TIF_MEMDIE upon returning from the
> oom killer, but only for __GFP_NOMEMALLOC.

NOMEMALLOC indeed should always be kept away from memalloc/memdie
reserves. That's how it should have worked when I added it (but
I may have forgotten TIF_MEMDIE, I can't remember).

> > > All __GFP_NOFAIL allocations should ensure that alloc_pages() never
> > > returns NULL. Although it's unfortunate, that's the requirement that
> > > callers have been guaranteed and until they are fixed, the page allocator
> > > should respect it.
> >
> > Yes.
> >
> > Interesting thing is what to do when we have 0 pages left, we are
> > TIF_MEMDIE, and GFP_NOFAIL is set. Looping will most likely just
> > deadlock the system. Returning NULL will probably oops caller with
> > various locks held and then deadlock the system. It really needs to
> > punt back to the OOM killer so it can select another task. Until
> > then, maybe a simple panic would be reasonable? (it's *never* going
> > to hit anyone in practice I'd say, but if it does then a panic
> > would be better than lockup at least we know what the problem was).
> >
>
> The oom killer currently is a no-op if any eligible task has TIF_MEMDIE,
> so this would require adding an oom killer timeout so that if a task fails
> to exit after a predefined period, TIF_MEMDIE is cleared and the task is
> marked to no longer be selected (which would require an addition to
> task_struct) although it may have already completely depleted memory
> reserves.

It wouldn't have to be a timeout, it could be a call back to the
oom killer.

> The best alternative is just to increase min_free_kbytes to ensure that
> adequate memory reserves (and its partial exemptions allowed by
> GFP_ATOMIC, ALLOC_HARDER, and PF_MEMALLOC) are sustained for an oom killed
> task to exit and that we try hard to avoid getting stuck in
> TASK_UNINTERRUPTIBLE.

Well we're discussing what to do when reserves run out and NOFAIL
is set. So increasing min_free_kbytes is not a valid alternative :)
My vote is a simple panic with a clear message.

> > > I disagree with this change because it unconditionally fails allocations
> > > when a task has been oom killed, a scenario which should be the _highest_
> > > priority for allocations to succeed since it leads to future memory
> >
> > That's another interesting point. I do agree with you because that
> > would restore previous behaviour which got broken. But I wonder if
> > possibly it would be a better idea to fail all allocations? That
> > would a) protect reserves more, and b) probably quite likely to
> > exit the syscall *sooner* than if we try to satisfy all allocations.
> >
>
> You could only fail the single allocation where you triggered the oom
> killer and you were the task chosen to die, which is what Mel's patch
> implemented in the first half. I agree that would protect the memory
> reserves more.

2009-06-30 08:41:28

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Tue, 30 Jun 2009, Nick Piggin wrote:

> > I'd agree with Mel's added check for TIF_MEMDIE upon returning from the
> > oom killer, but only for __GFP_NOMEMALLOC.
>
> NOMEMALLOC indeed should always be kept away from memalloc/memdie
> reserves. That's how it should have worked when I added it (but
> I may have forgotten TIF_MEMDIE, I can't remember).
>

Yeah, so if test_thread_flag(TIF_MEMDIE) and __GFP_NOMEMALLOC, then it
makes sense to return NULL immediately following the call to the oom
killer for !__GFP_NOFAIL since retrying the allocation is pointless
(reclaim failed already and TIF_MEMDIE doesn't help us on the next
attempt) at that time.

> > The oom killer currently is a no-op if any eligible task has TIF_MEMDIE,
> > so this would require adding an oom killer timeout so that if a task fails
> > to exit after a predefined period, TIF_MEMDIE is cleared and the task is
> > marked to no longer be selected (which would require an addition to
> > task_struct) although it may have already completely depleted memory
> > reserves.
>
> It wouldn't have to be a timeout, it could be a call back to the
> oom killer.
>

Calling the oom killer won't do anything since it will not kill another
task while another has TIF_MEMDIE to protect those memory reserves and
give the oom killed task a chance to exit.

Panicking when a thread with TIF_MEMDIE set cannot find any memory and the
allocation is __GFP_NOFAIL makes sense, but only for order 0.

2009-06-30 09:09:44

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Tue, Jun 30, 2009 at 01:41:12AM -0700, David Rientjes wrote:
> On Tue, 30 Jun 2009, Nick Piggin wrote:
>
> > > I'd agree with Mel's added check for TIF_MEMDIE upon returning from the
> > > oom killer, but only for __GFP_NOMEMALLOC.
> >
> > NOMEMALLOC indeed should always be kept away from memalloc/memdie
> > reserves. That's how it should have worked when I added it (but
> > I may have forgotten TIF_MEMDIE, I can't remember).
> >
>
> Yeah, so if test_thread_flag(TIF_MEMDIE) and __GFP_NOMEMALLOC, then it
> makes sense to return NULL immediately following the call to the oom
> killer for !__GFP_NOFAIL since retrying the allocation is pointless
> (reclaim failed already and TIF_MEMDIE doesn't help us on the next
> attempt) at that time.

I don't see the importance of calling the oom killer. If a thread
is TIF_MEMDIE, then we should not try to enter reclaim nor try to
call the oom killer. The oom killer has already been activated and
because it has been determined that nothing can be reclaimed...

> > > The oom killer currently is a no-op if any eligible task has TIF_MEMDIE,
> > > so this would require adding an oom killer timeout so that if a task fails
> > > to exit after a predefined period, TIF_MEMDIE is cleared and the task is
> > > marked to no longer be selected (which would require an addition to
> > > task_struct) although it may have already completely depleted memory
> > > reserves.
> >
> > It wouldn't have to be a timeout, it could be a call back to the
> > oom killer.
> >
>
> Calling the oom killer won't do anything since it will not kill another
> task while another has TIF_MEMDIE to protect those memory reserves and
> give the oom killed task a chance to exit.

I don't mean the normal oom-killer path, but another call to say
"this thread got stuck, un-kill me and look for someone else to kill"
or somesuch.

> Panicking when a thread with TIF_MEMDIE set cannot find any memory and the
> allocation is __GFP_NOFAIL makes sense, but only for order 0.

Why only order-0? What would you do at order>0?

2009-06-30 11:00:18

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Mon, Jun 29, 2009 at 12:20:29PM -0700, Andrew Morton wrote:
> On Mon, 29 Jun 2009 16:30:07 +0100
> Mel Gorman <[email protected]> wrote:
>
> > Processes that have been OOM killed set the thread flag TIF_MEMDIE. A
> > process such as this is expected to exit the page allocator but in the
> > event it happens to have set __GFP_NOFAIL, it potentially loops forever.
> >
> > This patch checks TIF_MEMDIE when deciding whether to loop again in the
> > page allocator. Such a process will now return NULL after direct reclaim
> > and OOM killing have both been considered as options. The potential
> > problem is that a __GFP_NOFAIL allocation can still return failure so
> > callers must still handle getting returned NULL.
>
> I don't think we should do this :(
>
> The __GFP_NOFAIL callers are using __GFP_NOFAIL for a reason - they
> just cannot handle an allocation failure at all. They won't even test
> for a NULL return because a) they believe that __GFP_NOFAIL is magic and
> b) if the allocation failed, they're screwed anyway.
>

Indeed.

> So how feasible would it be to arrange for __GFP_NOFAIL callers to
> ignore the oom-killing?

Potential patch to do that is below - it will continue looping even with
TIF_MEMDIE if __GFP_NOFAIL is specified but otherwise a TIF_MEMDIE process
will ignore watermarks and exit if no page is available. In case __GFP_NOFAIL
is at work, it slightly alters the OOM killer to select another process if
the process selected for OOM killing is the current one.

> Presumably this means that they'll need to kill
> someone else and keep on trying?
>

That's what I would expect to happen with the following patch. An
OOM-killed && __GFP_NOFAIL process should loop again and in the event there
are 0 pages free in the system, it would select another process for OOM
killing. This is still potentially an infinite loop so we should warn in
the event that might be happening.

Do people reckon this brings behaviour more in line with expectations or
is there something else missing?

==== CUT HERE ====
page-allocator: Ensure that processes that have been OOM killed exit the page allocator

Processes that have been OOM killed set the thread flag TIF_MEMDIE. A process
such as this is expected to exit the page allocator but potentially, it
loops forever. This patch checks TIF_MEMDIE when deciding whether to loop
again in the page allocator. If set, and __GFP_NOFAIL is not specified
then the loop will exit on the assumption it's no longer important for the
process to make forward progress. Note that a process that has just been
OOM-killed will still loop at least one more time retrying the allocation
before the thread flag is checked.

If there are zero pages free and the process has been OOM-killed with
__GFP_NOFAIL, it will loop again but the OOM killer is disabled as long as
one process has TIF_MEMDIE set. Potentially, this is an infinite loop so the
OOM killer will still trigger if the current process is the process with
TIF_MEMDIE set. This still could be an infinite loop in the very unlikely
event all pages are in use by the current process that cannot continue
due to __GFP_NOFAIL. This patch warns if the situation is potentially
occuring. While it could be a deadlock situation, there is little else to
do except keep retrying or panic.

Signed-off-by: Mel Gorman <[email protected]>

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 175a67a..5f4656e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -230,14 +230,21 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
/*
* This task already has access to memory reserves and is
* being killed. Don't allow any other task access to the
- * memory reserve.
+ * memory reserve unless the current process is the one
+ * selected for OOM-killing. If the current process has
+ * been OOM-killed and we are OOM again, another process
+ * needs to be considered for OOM-kill
*
* Note: this may have a chance of deadlock if it gets
* blocked waiting for another task which itself is waiting
* for memory. Is there a better alternative?
*/
- if (test_tsk_thread_flag(p, TIF_MEMDIE))
- return ERR_PTR(-1UL);
+ if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+ if (p == current)
+ continue;
+ else
+ return ERR_PTR(-1UL);
+ }

/*
* This is in the process of releasing memory so wait for it
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5d714f8..5896469 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1539,6 +1539,14 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
if (gfp_mask & __GFP_NORETRY)
return 0;

+ /* Do not loop if OOM-killed unless __GFP_NOFAIL is specified */
+ if (test_thread_flag(TIF_MEMDIE)) {
+ if (gfp_mask & __GFP_NOFAIL)
+ WARN(1, "Potential infinite loop with __GFP_NOFAIL");
+ else
+ return 0;
+ }
+
/*
* In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
* means __GFP_NOFAIL, but that may not be true in other

2009-06-30 19:36:30

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Tue, 30 Jun 2009, Mel Gorman wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 175a67a..5f4656e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -230,14 +230,21 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> /*
> * This task already has access to memory reserves and is
> * being killed. Don't allow any other task access to the
> - * memory reserve.
> + * memory reserve unless the current process is the one
> + * selected for OOM-killing. If the current process has
> + * been OOM-killed and we are OOM again, another process
> + * needs to be considered for OOM-kill
> *
> * Note: this may have a chance of deadlock if it gets
> * blocked waiting for another task which itself is waiting
> * for memory. Is there a better alternative?
> */
> - if (test_tsk_thread_flag(p, TIF_MEMDIE))
> - return ERR_PTR(-1UL);
> + if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> + if (p == current)
> + continue;
> + else
> + return ERR_PTR(-1UL);
> + }
>
> /*
> * This is in the process of releasing memory so wait for it

This will panic the machine if current is the only user thread running or
eligible for oom kill (all others could have mm's with OOM_DISABLE set).
Currently, such tasks can exit or kthreads can free memory so that the oom
is recoverable.

The problem with this approach is that it increases the liklihood that
memory reserves will be totally depleted when several threads are
competing for them.

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5d714f8..5896469 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1539,6 +1539,14 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> if (gfp_mask & __GFP_NORETRY)
> return 0;
>
> + /* Do not loop if OOM-killed unless __GFP_NOFAIL is specified */
> + if (test_thread_flag(TIF_MEMDIE)) {
> + if (gfp_mask & __GFP_NOFAIL)
> + WARN(1, "Potential infinite loop with __GFP_NOFAIL");
> + else
> + return 0;
> + }
> +

There's a second bug in the refactored page allocator: when the oom killer
is invoked and it happens to kill current, alloc_flags is never updated
because it loops back to `restart', which is past gfp_to_alloc_flags().

When that is fixed with the patch below, gfp_mask & __GFP_NOFAIL will
never be true here in your scenario, where the oom killer kills current,
because __alloc_pages_high_priority() will infinitely loop.

> /*
> * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> * means __GFP_NOFAIL, but that may not be true in other
>

This is needed for 2.6.31-rc2.

mm: update alloc_flags after oom killer has been called

It is possible for the oom killer to select current as the task to kill.
When this happens, alloc_flags needs to be updated accordingly to set
ALLOC_NO_WATERMARKS so the subsequent allocation attempt may use memory
reserves as the result of its thread having TIF_MEMDIE set if the
allocation is not __GFP_NOMEMALLOC.

Signed-off-by: David Rientjes <[email protected]>
---
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1756,6 +1756,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,

wake_all_kswapd(order, zonelist, high_zoneidx);

+restart:
/*
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
@@ -1763,7 +1764,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
*/
alloc_flags = gfp_to_alloc_flags(gfp_mask);

-restart:
/* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,

2009-06-30 19:47:57

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Tue, 30 Jun 2009, Nick Piggin wrote:

> > Yeah, so if test_thread_flag(TIF_MEMDIE) and __GFP_NOMEMALLOC, then it
> > makes sense to return NULL immediately following the call to the oom
> > killer for !__GFP_NOFAIL since retrying the allocation is pointless
> > (reclaim failed already and TIF_MEMDIE doesn't help us on the next
> > attempt) at that time.
>
> I don't see the importance of calling the oom killer. If a thread
> is TIF_MEMDIE, then we should not try to enter reclaim nor try to
> call the oom killer. The oom killer has already been activated and
> because it has been determined that nothing can be reclaimed...
>

Right, there's no need to call it a second time. I was referring to the
initial call that set_tsk_thread_flag(current, TIF_MEMDIE). When we
return to the page allocator from the oom killer, there's no sense in
retrying for __GFP_NOMEMALLOC and !__GFP_NOFAIL since it can't use memory
reserves anyway.

That doesn't mean the oom killer shouldn't kill current when
__GFP_NOMEMALLOC, though, because it can use memory reserves along the
exit path.

> > Calling the oom killer won't do anything since it will not kill another
> > task while another has TIF_MEMDIE to protect those memory reserves and
> > give the oom killed task a chance to exit.
>
> I don't mean the normal oom-killer path, but another call to say
> "this thread got stuck, un-kill me and look for someone else to kill"
> or somesuch.
>

Right, and my suggestion for doing that was an oom killer timeout as the
threshold for determining when a thread is "stuck," because usually that
means it's blocked in TASK_UNINTERRUPTIBLE, not because memory reserves
are empty. I'd be interested in alternative approaches other than a
timeout that determine when another task should be killed.

It's always possible that a "stuck" task has fully depleted memory
reserves and no forward progress will be made by anybody, so this is a
very bad situation to begin with.

> > Panicking when a thread with TIF_MEMDIE set cannot find any memory and the
> > allocation is __GFP_NOFAIL makes sense, but only for order 0.
>
> Why only order-0? What would you do at order>0?
>

For order > 0, it'd need to loop forever like it currently does for
__GFP_NOFAIL in __alloc_pages_high_priority(). It's possible that an
allocation will eventually succeed if another task frees memory because
its allocation succeeded (not as the result of memory being totally
unavailable, but rather fragmented enough to prevent the TIF_MEMDIE task
from succeeding for order > 0).

We'd also need to consider whether the allocation is constrained to
lowmem, in which case the panic would be premature.

2009-06-30 20:32:55

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Tue, Jun 30, 2009 at 12:35:59PM -0700, David Rientjes wrote:
> On Tue, 30 Jun 2009, Mel Gorman wrote:
>
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index 175a67a..5f4656e 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -230,14 +230,21 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> > /*
> > * This task already has access to memory reserves and is
> > * being killed. Don't allow any other task access to the
> > - * memory reserve.
> > + * memory reserve unless the current process is the one
> > + * selected for OOM-killing. If the current process has
> > + * been OOM-killed and we are OOM again, another process
> > + * needs to be considered for OOM-kill
> > *
> > * Note: this may have a chance of deadlock if it gets
> > * blocked waiting for another task which itself is waiting
> > * for memory. Is there a better alternative?
> > */
> > - if (test_tsk_thread_flag(p, TIF_MEMDIE))
> > - return ERR_PTR(-1UL);
> > + if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> > + if (p == current)
> > + continue;
> > + else
> > + return ERR_PTR(-1UL);
> > + }
> >
> > /*
> > * This is in the process of releasing memory so wait for it
>
> This will panic the machine if current is the only user thread running or
> eligible for oom kill (all others could have mm's with OOM_DISABLE set).
> Currently, such tasks can exit or kthreads can free memory so that the oom
> is recoverable.
>

Good point, would the following be ok instead?

+ if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+ if (p == current) {
+ chosen = ERR_PTR(-1UL);
+ continue;
+ } else
+ return ERR_PTR(-1UL);

> The problem with this approach is that it increases the liklihood that
> memory reserves will be totally depleted when several threads are
> competing for them.
>

How so?

> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 5d714f8..5896469 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1539,6 +1539,14 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > if (gfp_mask & __GFP_NORETRY)
> > return 0;
> >
> > + /* Do not loop if OOM-killed unless __GFP_NOFAIL is specified */
> > + if (test_thread_flag(TIF_MEMDIE)) {
> > + if (gfp_mask & __GFP_NOFAIL)
> > + WARN(1, "Potential infinite loop with __GFP_NOFAIL");
> > + else
> > + return 0;
> > + }
> > +
>
> There's a second bug in the refactored page allocator: when the oom killer
> is invoked and it happens to kill current, alloc_flags is never updated
> because it loops back to `restart', which is past gfp_to_alloc_flags().
>
> When that is fixed with the patch below, gfp_mask & __GFP_NOFAIL will
> never be true here in your scenario, where the oom killer kills current,
> because __alloc_pages_high_priority() will infinitely loop.
>
> > /*
> > * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > * means __GFP_NOFAIL, but that may not be true in other
> >
>
> This is needed for 2.6.31-rc2.
>
>
> mm: update alloc_flags after oom killer has been called
>
> It is possible for the oom killer to select current as the task to kill.
> When this happens, alloc_flags needs to be updated accordingly to set
> ALLOC_NO_WATERMARKS so the subsequent allocation attempt may use memory
> reserves as the result of its thread having TIF_MEMDIE set if the
> allocation is not __GFP_NOMEMALLOC.
>
> Signed-off-by: David Rientjes <[email protected]>

Acked-by: Mel Gorman <[email protected]>

> ---
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1756,6 +1756,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>
> wake_all_kswapd(order, zonelist, high_zoneidx);
>
> +restart:
> /*
> * OK, we're below the kswapd watermark and have kicked background
> * reclaim. Now things get more complex, so set up alloc_flags according
> @@ -1763,7 +1764,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> */
> alloc_flags = gfp_to_alloc_flags(gfp_mask);
>
> -restart:
> /* This is the last chance, in general, before the goto nopage. */
> page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
> high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-06-30 20:51:22

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

On Tue, 30 Jun 2009, Mel Gorman wrote:

> > This will panic the machine if current is the only user thread running or
> > eligible for oom kill (all others could have mm's with OOM_DISABLE set).
> > Currently, such tasks can exit or kthreads can free memory so that the oom
> > is recoverable.
> >
>
> Good point, would the following be ok instead?
>
> + if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> + if (p == current) {
> + chosen = ERR_PTR(-1UL);
> + continue;
> + } else
> + return ERR_PTR(-1UL);
>

Yes, if you wanted to allow multiple threads to have TIF_MEMDIE set.

> > The problem with this approach is that it increases the liklihood that
> > memory reserves will be totally depleted when several threads are
> > competing for them.
> >
>
> How so?
>

We automatically oom kill current if it's PF_EXITING to give it TIF_MEMDIE
because we know it's on the exit path, this avoids allowing them to
allocate below the min watermark for the allocation that triggered the
oom, which could be significant.

If several threads are ooming at the same time, which happens quite often
for non-cpuset and non-memcg constrained ooms (and those not restricted to
lowmem), we could now potentially have nr_cpus threads with TIF_MEMDIE set
simultaneously, which increases the liklihood that memory reserves will be
fully depleted after each allocation that triggered the oom killer now
succeeds because of ALLOC_NO_WATERMARKS. This is somewhat mitigated by
the oom killer serialization done on the zonelist, but nothing guarantees
that reserves aren't empty before one of the threads has detached its
->mm.

oom_kill_task() SIGKILLs threads sharing ->mm with different tgid's
instead of giving them access to memory reserves specifically to avoid
this.

2009-07-01 08:55:53

by Pavel Machek

[permalink] [raw]

Subject: Re: upcoming kerneloops.org item: get_page_from_freelist

> > This is not a new concept. oom has never been "immediately kill".
>
> Well, it has been immediate for a long time. A couple of reasons which
> I can recall:
>
> - A page-allocating process will oom-kill another process in the
> expectation that the killing will free up some memory. If the
> oom-killed process remains stuck in the page allocator, that doesn't
> work.
>
> - The oom-killed process might be holding locks (typically fs locks).
> This can cause an arbitrary number of other processes to be blocked.
> So to get the system unstuck we need the oom-killed process to
> immediately exit the page allocator, to handle the NULL return and to
> drop those locks.
>
> There may be other reasons - it was all a long time ago, and I've never
> personally hacked on the oom-killer much and I never get oom-killed.
> But given the amount of development work which goes on in there, some
> people must be getting massacred.
>
>
> A long time ago, the Suse kernel shipped with a largely (or
> completely?) disabled oom-killer. It removed the
> retry-small-allocations-for-ever logic and simply returned NULL to the
> caller. I never really understood what problem/thinking led Andrea to
> do that.

I guess he was trying to get huge 32bit highmem machines to work... On
such systems, kmalloc failures will eventually get you...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-07-01 10:22:25