2010-07-07 23:11:43

by Michel Lespinasse

[permalink] [raw]
Subject: FYI: mmap_sem OOM patch

As I mentioned this in my MM discussion proposal, I am sending the following
patch for discussion.

This helps some workloads we use, that are normally run with little memory
headroom and get occasionally pushed over the edge to OOM. In some cases,
the OOMing thread can't exit because it needs to acquire the mmap_sem,
while the thread holding mmap_sem can't progress because it needs to
allocate memory.

IIRC we first encountered this against our 2.6.18 kernel; this has been
reproduced with 2.6.33 (sorry, we'll have to clean up that test case a bit
before I can send it); and following patch is against 2.6.34 plus
down_read_unfair changes (which have been discussed here before, not
resending them at this point unless someone asks).

Note that while this patch modifies some performance critical MM functions,
it does make sure to stay out of the fast paths.


Use down_read_unfair() in OOM-killed tasks

When tasks are being targeted by OOM killer, we want to use
down_read_unfair to ensure a fast exit.

- The down_read_unfair() call in fast_get_user_pages() is to cover the
futex calls in mm_release() (including exit_robust_list(),
compat_exit_robust_list(), exit_pi_state_list() and sys_futex())

- The down_read_unfair() call in do_page_fault() is to cover the
put_user() call in mm_release() and exiting the robust futex lists.

This is a rework of a previous change in 2.6.18:

Change the down_read()s in the exit path to be unfair so that we do
not result in a potentially 3-to-4-way deadlock during the ooms.

What happens is we end up with a single thread in the oom loop (T1)
that ends up killing a sibling thread (T2). That sibling thread will
need to acquire the read side of the mmap_sem in the exit path. It's
possible however that yet a different thread (T3) is in the middle of
a virtual address space operation (mmap, munmap) and is enqueue to
grab the write side of the mmap_sem behind yet another thread (T4)
that is stuck in the OOM loop (behind T1) with mmap_sem held for read
(like allocating a page for pagecache as part of a fault.

T1 T2 T3 T4
. . . .
oom: . . .
oomkill . . .
^ \ . . .
/|\ ----> do_exit: . .
| sleep in . .
| read(mmap_sem) . .
| \ . .
| ----> mmap .
| sleep in .
| write(mmap_sem) .
| \ .
| ----> fault
| holding read(mmap_sem)
| oom
| |
| /
\----------------------------------------------/

Signed-off-by: Michel Lespinasse <[email protected]>

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f627779..4b3a1c7 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
bad_area_nosemaphore(regs, error_code, address);
return;
}
- down_read(&mm->mmap_sem);
+ if (test_thread_flag(TIF_MEMDIE))
+ down_read_unfair(&mm->mmap_sem);
+ else
+ down_read(&mm->mmap_sem);
} else {
/*
* The above down_read_trylock() might have succeeded in
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 738e659..578d1a7 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -357,7 +357,10 @@ slow_irqon:
start += nr << PAGE_SHIFT;
pages += nr;

- down_read(&mm->mmap_sem);
+ if (unlikely(test_thread_flag(TIF_MEMDIE)))
+ down_read_unfair(&mm->mmap_sem);
+ else
+ down_read(&mm->mmap_sem);
ret = get_user_pages(current, mm, start,
(end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
up_read(&mm->mmap_sem);
diff --git a/kernel/exit.c b/kernel/exit.c
index 7f2683a..2318d3a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -664,7 +664,7 @@ static void exit_mm(struct task_struct * tsk)
* will increment ->nr_threads for each thread in the
* group with ->mm != NULL.
*/
- down_read(&mm->mmap_sem);
+ down_read_unfair(&mm->mmap_sem);
core_state = mm->core_state;
if (core_state) {
struct core_thread self;


--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


2010-07-08 00:16:38

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

Hi

> As I mentioned this in my MM discussion proposal, I am sending the following
> patch for discussion.
>
> This helps some workloads we use, that are normally run with little memory
> headroom and get occasionally pushed over the edge to OOM. In some cases,
> the OOMing thread can't exit because it needs to acquire the mmap_sem,
> while the thread holding mmap_sem can't progress because it needs to
> allocate memory.
>
> IIRC we first encountered this against our 2.6.18 kernel; this has been
> reproduced with 2.6.33 (sorry, we'll have to clean up that test case a bit
> before I can send it); and following patch is against 2.6.34 plus
> down_read_unfair changes (which have been discussed here before, not
> resending them at this point unless someone asks).
>
> Note that while this patch modifies some performance critical MM functions,
> it does make sure to stay out of the fast paths.
>
>
> Use down_read_unfair() in OOM-killed tasks
>
> When tasks are being targeted by OOM killer, we want to use
> down_read_unfair to ensure a fast exit.

Yup.
If admins want to kill memory hogging process manually when the system
is under heavy swap thrashing, we will face the same problem, need
unfairness and fast exit. So, unfair exiting design looks very good.

If you will updated the description, I'm glad :)


>
> - The down_read_unfair() call in fast_get_user_pages() is to cover the
> futex calls in mm_release() (including exit_robust_list(),
> compat_exit_robust_list(), exit_pi_state_list() and sys_futex())
>
> - The down_read_unfair() call in do_page_fault() is to cover the
> put_user() call in mm_release() and exiting the robust futex lists.

I have opposite question. Which case do we avoid to use down_read_unfair()?
You already change the fast path, I guess the reason is not for performance.
but I'm not sure exactly reason.


In other word, can we makes following or something like macro and
convert all caller?


static inline void down_read_mmap_sem(void) {
if (unlikely(test_thread_flag(TIF_MEMDIE)))
down_read_unfair(&mm->mmap_sem);
else
down_read(&mm->mmap_sem);
}




>
> This is a rework of a previous change in 2.6.18:
>
> Change the down_read()s in the exit path to be unfair so that we do
> not result in a potentially 3-to-4-way deadlock during the ooms.
>
> What happens is we end up with a single thread in the oom loop (T1)
> that ends up killing a sibling thread (T2). That sibling thread will
> need to acquire the read side of the mmap_sem in the exit path. It's
> possible however that yet a different thread (T3) is in the middle of
> a virtual address space operation (mmap, munmap) and is enqueue to
> grab the write side of the mmap_sem behind yet another thread (T4)
> that is stuck in the OOM loop (behind T1) with mmap_sem held for read
> (like allocating a page for pagecache as part of a fault.
>
> T1 T2 T3 T4
> . . . .
> oom: . . .
> oomkill . . .
> ^ \ . . .
> /|\ ----> do_exit: . .
> | sleep in . .
> | read(mmap_sem) . .
> | \ . .
> | ----> mmap .
> | sleep in .
> | write(mmap_sem) .
> | \ .
> | ----> fault
> | holding read(mmap_sem)
> | oom
> | |
> | /
> \----------------------------------------------/
>
> Signed-off-by: Michel Lespinasse <[email protected]>
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f627779..4b3a1c7 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> bad_area_nosemaphore(regs, error_code, address);
> return;
> }
> - down_read(&mm->mmap_sem);
> + if (test_thread_flag(TIF_MEMDIE))
> + down_read_unfair(&mm->mmap_sem);
> + else
> + down_read(&mm->mmap_sem);
> } else {
> /*
> * The above down_read_trylock() might have succeeded in
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 738e659..578d1a7 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -357,7 +357,10 @@ slow_irqon:
> start += nr << PAGE_SHIFT;
> pages += nr;
>
> - down_read(&mm->mmap_sem);
> + if (unlikely(test_thread_flag(TIF_MEMDIE)))
> + down_read_unfair(&mm->mmap_sem);
> + else
> + down_read(&mm->mmap_sem);
> ret = get_user_pages(current, mm, start,
> (end - start) >> PAGE_SHIFT, write, 0, pages, NULL);
> up_read(&mm->mmap_sem);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 7f2683a..2318d3a 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -664,7 +664,7 @@ static void exit_mm(struct task_struct * tsk)
> * will increment ->nr_threads for each thread in the
> * group with ->mm != NULL.
> */
> - down_read(&mm->mmap_sem);
> + down_read_unfair(&mm->mmap_sem);
> core_state = mm->core_state;
> if (core_state) {
> struct core_thread self;

Agreed. exit_mm() should use down_read_unfair() always.
But, I think it need the explicit comment. please.



btw, MM developers only want to review mm part. can you please get ack
about down_read_unfair() itself from another developers (perhaps,
David Howells or shomeone else).

2010-07-08 01:11:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

> Yup.
> If admins want to kill memory hogging process manually when the system
> is under heavy swap thrashing, we will face the same problem, need
> unfairness and fast exit. So, unfair exiting design looks very good.
>
> If you will updated the description, I'm glad :)

I have one more topic. can we check fatal_signal_pending() instead TIF_MEMDIE?
As I said, if the process received SIGKILL, do the fork/exec/brk/mmap
starvations have any problem?

Thanks.

2010-07-08 09:02:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:

> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f627779..4b3a1c7 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> bad_area_nosemaphore(regs, error_code, address);
> return;
> }
> - down_read(&mm->mmap_sem);
> + if (test_thread_flag(TIF_MEMDIE))
> + down_read_unfair(&mm->mmap_sem);
> + else
> + down_read(&mm->mmap_sem);
> } else {
> /*
> * The above down_read_trylock() might have succeeded in

I still think adding that _unfair interface is asking for trouble.

2010-07-08 09:24:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

> On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
>
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index f627779..4b3a1c7 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > bad_area_nosemaphore(regs, error_code, address);
> > return;
> > }
> > - down_read(&mm->mmap_sem);
> > + if (test_thread_flag(TIF_MEMDIE))
> > + down_read_unfair(&mm->mmap_sem);
> > + else
> > + down_read(&mm->mmap_sem);
> > } else {
> > /*
> > * The above down_read_trylock() might have succeeded in
>
> I still think adding that _unfair interface is asking for trouble.

Can you please explain trouble that you worry? Why do we need to keep
thread fairness when OOM case?


btw, I also dislike unfair + /proc combination.


2010-07-08 09:35:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

On Thu, 2010-07-08 at 18:24 +0900, KOSAKI Motohiro wrote:
> > On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> >
> > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > index f627779..4b3a1c7 100644
> > > --- a/arch/x86/mm/fault.c
> > > +++ b/arch/x86/mm/fault.c
> > > @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > bad_area_nosemaphore(regs, error_code, address);
> > > return;
> > > }
> > > - down_read(&mm->mmap_sem);
> > > + if (test_thread_flag(TIF_MEMDIE))
> > > + down_read_unfair(&mm->mmap_sem);
> > > + else
> > > + down_read(&mm->mmap_sem);
> > > } else {
> > > /*
> > > * The above down_read_trylock() might have succeeded in
> >
> > I still think adding that _unfair interface is asking for trouble.
>
> Can you please explain trouble that you worry? Why do we need to keep
> thread fairness when OOM case?

Just the whole concept of the unfair thing offends me ;-) I didn't
really look at the particular application in this case.

2010-07-08 09:51:25

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

> On Thu, 2010-07-08 at 18:24 +0900, KOSAKI Motohiro wrote:
> > > On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> > >
> > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > > > index f627779..4b3a1c7 100644
> > > > --- a/arch/x86/mm/fault.c
> > > > +++ b/arch/x86/mm/fault.c
> > > > @@ -1062,7 +1062,10 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
> > > > bad_area_nosemaphore(regs, error_code, address);
> > > > return;
> > > > }
> > > > - down_read(&mm->mmap_sem);
> > > > + if (test_thread_flag(TIF_MEMDIE))
> > > > + down_read_unfair(&mm->mmap_sem);
> > > > + else
> > > > + down_read(&mm->mmap_sem);
> > > > } else {
> > > > /*
> > > > * The above down_read_trylock() might have succeeded in
> > >
> > > I still think adding that _unfair interface is asking for trouble.
> >
> > Can you please explain trouble that you worry? Why do we need to keep
> > thread fairness when OOM case?
>
> Just the whole concept of the unfair thing offends me ;-) I didn't
> really look at the particular application in this case.

I see.
Yup, I agree unfair thing concept is a bit ugly. If anyone have
alternative idea, I agree to choose that thing.


2010-07-08 10:30:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

On Wed, 2010-07-07 at 16:11 -0700, Michel Lespinasse wrote:
> What happens is we end up with a single thread in the oom loop (T1)
> that ends up killing a sibling thread (T2). That sibling thread will
> need to acquire the read side of the mmap_sem in the exit path. It's
> possible however that yet a different thread (T3) is in the middle of
> a virtual address space operation (mmap, munmap) and is enqueue to
> grab the write side of the mmap_sem behind yet another thread (T4)
> that is stuck in the OOM loop (behind T1) with mmap_sem held for read
> (like allocating a page for pagecache as part of a fault.
>
> T1 T2 T3 T4
> . . . .
> oom: . . .
> oomkill . . .
> ^ \ . . .
> /|\ ----> do_exit: . .
> | sleep in . .
> | read(mmap_sem) . .
> | \ . .
> | ----> mmap .
> | sleep in .
> | write(mmap_sem) .
> | \ .
> | ----> fault
> | holding read(mmap_sem)
> | oom
> | |
> | /
> \----------------------------------------------/

So what you do is use recursive locking to side-step a deadlock.
Recursive locking is poor taste and leads to very ill defined locking
rules.

One way to fix this is to have T4 wake from the oom queue and return an
allocation failure instead of insisting on going oom itself when T1
decides to take down the task.

2010-07-08 10:49:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

On Thu, 2010-07-08 at 03:39 -0700, Michel Lespinasse wrote:
>
>
> One way to fix this is to have T4 wake from the oom queue and return an
> allocation failure instead of insisting on going oom itself when T1
> decides to take down the task.
>
> How would you have T4 figure out the deadlock situation ? T1 is taking down T2, not T4...

If T2 and T4 share a mmap_sem they belong to the same process. OOM takes
down the whole process by sending around signals of sorts (SIGKILL?), so
if T4 gets a fatal signal while it is waiting to enter the oom thingy,
have it abort and return an allocation failure.

That alloc failure (along with a pending fatal signal) will very likely
lead to the release of its mmap_sem (if not, there's more things to
cure).

At which point the cycle is broken an stuff continues as it was
intended.

2010-07-08 10:57:45

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

> On Thu, 2010-07-08 at 03:39 -0700, Michel Lespinasse wrote:
> >
> >
> > One way to fix this is to have T4 wake from the oom queue and return an
> > allocation failure instead of insisting on going oom itself when T1
> > decides to take down the task.
> >
> > How would you have T4 figure out the deadlock situation ? T1 is taking down T2, not T4...
>
> If T2 and T4 share a mmap_sem they belong to the same process. OOM takes
> down the whole process by sending around signals of sorts (SIGKILL?), so
> if T4 gets a fatal signal while it is waiting to enter the oom thingy,
> have it abort and return an allocation failure.
>
> That alloc failure (along with a pending fatal signal) will very likely
> lead to the release of its mmap_sem (if not, there's more things to
> cure).
>
> At which point the cycle is broken an stuff continues as it was
> intended.

Now, I've reread current code. I think mmotm already have this.


T4 call out_of_memory and get TIF_MEMDIE
=========================================================
void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *nodemask)
{
(snip)
/*
* If current has a pending SIGKILL, then automatically select it. The
* goal is to allow it to allocate so that it may quickly exit and free
* its memory.
*/
if (fatal_signal_pending(current)) {
set_thread_flag(TIF_MEMDIE);
boost_dying_task_prio(current, NULL);
return;
}
==================================================================


alloc_pages immediately return if the task have TIF_MEMDIE
==================================================================
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
nodemask_t *nodemask, struct zone *preferred_zone,
int migratetype)
{
(snip)
/* Avoid allocations with no watermarks from looping endlessly */
if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
goto nopage;
==========================================================================


Thought?


2010-07-08 11:02:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

On Thu, 2010-07-08 at 19:57 +0900, KOSAKI Motohiro wrote:
> > On Thu, 2010-07-08 at 03:39 -0700, Michel Lespinasse wrote:
> > >
> > >
> > > One way to fix this is to have T4 wake from the oom queue and return an
> > > allocation failure instead of insisting on going oom itself when T1
> > > decides to take down the task.
> > >
> > > How would you have T4 figure out the deadlock situation ? T1 is taking down T2, not T4...
> >
> > If T2 and T4 share a mmap_sem they belong to the same process. OOM takes
> > down the whole process by sending around signals of sorts (SIGKILL?), so
> > if T4 gets a fatal signal while it is waiting to enter the oom thingy,
> > have it abort and return an allocation failure.
> >
> > That alloc failure (along with a pending fatal signal) will very likely
> > lead to the release of its mmap_sem (if not, there's more things to
> > cure).
> >
> > At which point the cycle is broken an stuff continues as it was
> > intended.
>
> Now, I've reread current code. I think mmotm already have this.

<snip code>

[ small note on that we really should kill __GFP_NOFAIL, its utter
deadlock potential ]

> Thought?

So either its not working or google never tried that code?

2010-07-08 11:06:59

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

> On Thu, 2010-07-08 at 19:57 +0900, KOSAKI Motohiro wrote:
> > > On Thu, 2010-07-08 at 03:39 -0700, Michel Lespinasse wrote:
> > > >
> > > >
> > > > One way to fix this is to have T4 wake from the oom queue and return an
> > > > allocation failure instead of insisting on going oom itself when T1
> > > > decides to take down the task.
> > > >
> > > > How would you have T4 figure out the deadlock situation ? T1 is taking down T2, not T4...
> > >
> > > If T2 and T4 share a mmap_sem they belong to the same process. OOM takes
> > > down the whole process by sending around signals of sorts (SIGKILL?), so
> > > if T4 gets a fatal signal while it is waiting to enter the oom thingy,
> > > have it abort and return an allocation failure.
> > >
> > > That alloc failure (along with a pending fatal signal) will very likely
> > > lead to the release of its mmap_sem (if not, there's more things to
> > > cure).
> > >
> > > At which point the cycle is broken an stuff continues as it was
> > > intended.
> >
> > Now, I've reread current code. I think mmotm already have this.
>
> <snip code>
>
> [ small note on that we really should kill __GFP_NOFAIL, its utter
> deadlock potential ]

I disagree. __GFP_NOFAIL mean this allocation failure can makes really
dangerous result. Instead, OOM-Killer should try to kill next process.
I think.

> > Thought?
>
> So either its not working or google never tried that code?

Michel?

2010-07-08 11:23:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

On Thu, 2010-07-08 at 20:06 +0900, KOSAKI Motohiro wrote:
> > [ small note on that we really should kill __GFP_NOFAIL, its utter
> > deadlock potential ]
>
> I disagree. __GFP_NOFAIL mean this allocation failure can makes really
> dangerous result. Instead, OOM-Killer should try to kill next process.
> I think.

Say _what_?! you think NOFAIL is a sane thing? Pretty much everybody has
been agreeing for years that the thing should die.

2010-07-08 12:43:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

Could you please educate your mailer to not send HTML garbage?

2010-07-09 01:31:29

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

> On Thu, 2010-07-08 at 20:06 +0900, KOSAKI Motohiro wrote:
> > > [ small note on that we really should kill __GFP_NOFAIL, its utter
> > > deadlock potential ]
> >
> > I disagree. __GFP_NOFAIL mean this allocation failure can makes really
> > dangerous result. Instead, OOM-Killer should try to kill next process.
> > I think.
>
> Say _what_?! you think NOFAIL is a sane thing?

insane obviously ;)
but as far as my experience, some embedded system prefer to use NOFAIL.
So, I don't like to make big hammer crash. NOFAIL killing need long year
rather than you expected, I guess.


> Pretty much everybody has
> been agreeing for years that the thing should die.

I'm not against this at all. but until it die, it should works correctly.

2010-07-12 21:48:15

by David Rientjes

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

On Thu, 8 Jul 2010, KOSAKI Motohiro wrote:

> I disagree. __GFP_NOFAIL mean this allocation failure can makes really
> dangerous result. Instead, OOM-Killer should try to kill next process.
> I think.
>

That's not what happens, __alloc_pages_high_priority() will loop forever
for __GFP_NOFAIL, the oom killer is never recalled.

2010-07-13 00:14:22

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

> On Thu, 8 Jul 2010, KOSAKI Motohiro wrote:
>
> > I disagree. __GFP_NOFAIL mean this allocation failure can makes really
> > dangerous result. Instead, OOM-Killer should try to kill next process.
> > I think.
> >
>
> That's not what happens, __alloc_pages_high_priority() will loop forever
> for __GFP_NOFAIL, the oom killer is never recalled.

Yup, please reread the discusstion.

2010-07-13 21:08:36

by David Rientjes

[permalink] [raw]
Subject: Re: FYI: mmap_sem OOM patch

On Tue, 13 Jul 2010, KOSAKI Motohiro wrote:

> > > I disagree. __GFP_NOFAIL mean this allocation failure can makes really
> > > dangerous result. Instead, OOM-Killer should try to kill next process.
> > > I think.
> > >
> >
> > That's not what happens, __alloc_pages_high_priority() will loop forever
> > for __GFP_NOFAIL, the oom killer is never recalled.
>
> Yup, please reread the discusstion.
>

There's nothing in the discussion that addresses the fact that
__alloc_pages_high_priority() loops infinitely for __GFP_NOFAIL without
again calling the oom killer. You may be proposing a change to that when
you said "OOM-Killer should try to kill next process. I think," but I
can't speculate on that. If you are, please propose a patch.

The success of whether we can allocate without watermarks is dependent on
where we set them and what types of exclusions we allow. This isn't local
only to the page allocator but rather to the entire kernel since
GFP_ATOMIC allocations, for example, can deplete the min watermark by
~60%. The remaining memory is what we allow access to for
ALLOC_NO_WATERMARKS: those allocations in the direct reclaim patch and
those that have been oom killed.

Thus, it's important that oom killed tasks do not completely deplete
memory reserves, otherwise __GFP_NOFAIL allocations will loop forever
without killing additional tasks, as you say. That's sane since oom
killing additional tasks wouldn't allow them access to any additional
memory, anyway, so the allocation would still fail (we only call the oom
killer for those allocations that are retried) and the victim could not
exit.

With that said, I'm not exactly sure what change you're proposing when you
say the oom killer should try to kill another process because it won't
lead to any future memory freeing unless the victim can exit without
allocating memory. If that's not possible, you've proliferated a ton of
innocent kills that were triggered because of one __GFP_NOFAIL attempt.

I agree with Peter that it would be ideal to remove __GFP_NOFAIL, but I
think we require changes in the retry logic first before that is possible.
Right now, we insist on retrying all blockable !__GFP_NORETRY allocations
under PAGE_ALLOC_COSTLY_ORDER indefinitely. That, in a sense, is already
__GFP_NOFAIL behavior that is implicit: that's why we typically see
__GFP_NOFAIL with GFP_NOFS instead. With GFP_NOFS, we never kill the oom
killer in the first place, so memory allocation is only more successful on
a subsequent attempt by either direct reclaim or memory compaction.

There's nothing preventing users from doing

do {
page = alloc_page(GFP_KERNEL);
} while (!page);

if the retry logic were reworked to start failing allocations or we
removed __GFP_NOFAIL. Thus, I think __GFP_NOFAIL should be substituted
with a different gfp flag that acts similar but failable: use compaction,
reclaim, and the oom killer where possible and in that order if there is
no success and then retry to a high watermark one final time. If the
allocation is still unsuccessful, return NULL. This allows users to do
what getblk() does, for instance, by implementing their own memory freeing
functions. It also allows them to use all of the page allocator's powers
(compaction, reclaim, oom killer) without infinitely looping by either
using an order under PAGE_ALLOC_COSTLY_ORDER or insisting on __GFP_NOFAIL.