2015-02-06 16:23:09

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID

Currently kernel ignores put_user() errors when it writes tid for syscall
clone flags CLONE_CHILD_SETTID and CLONE_CHILD_CLEARTID.

Unfortunately this code always worked in this way. We cannot abort syscall
if client tid pointer is invalid.

This patch adds get_user() after failed put_user(): if this address is not
even readable then we leave it alone: user-space probably don't care about
pid or error will be noticed soon. If address is readable then application
might be mislead about it's own pid. In this case CLONE_CHILD_SETTID kills
child task. In same condition failed CLONE_CHILD_CLEARTID prints warning.

Nikita found script which sometimes hangs inside glibc in function fork()
in child task right after returning from syscall some glibc assert fails:

#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:93
#1 0x00007fabcc699087 in _L_lock_11477 at malloc.c:5249
#2 0x00007fabcc6973ed in __GI___libc_realloc at malloc.c:3054
#3 0x00007fabcc686dbd in _IO_vasprintf at vasprintf.c:86
#4 0x00007fabcc667747 in ___asprintf at asprintf.c:37
#5 0x00007fabcc642cfb in __assert_fail_base at assert.c:59
#6 0x00007fabcc642e42 in __GI___assert_fail at assert.c:103
#7 0x00007fabcc6d40b6 in __libc_fork at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:142
#8 0x00007fabccad23cd in Perl_pp_system at pp_sys.c:4143c
#9 0x00007fabcca7fcd6 in Perl_runops_standard at run.c:41
#10 0x00007fabcca2139a in S_run_body at perl.c:2350
#11 perl_run at perl.c:2268
#12 0x0000000000400db9 in main at perlmain.c:120

Assert checks (THREAD_GETMEM (self, tid) != ppid). In child task pid still
equal to the parent pid! Write for CLONE_CHILD_SETTID had failed silently.
Unfortunately this assert in glibc is flawed, some internal locks are held
at this point and task hangs when tries to print out error message.

Usually memory allocations at page-faults are either completely successful
or task is killed by out of memory killer: buddy allocator handles 0-order
allocations as GFP_NOFAIL and retries them endlessly. OOM-killer in memory
cgroup works only for faulting from user-space. If kernel hits memcg limit
inside page-fault that will be handled as ordinary "Bad address": -EFAULT.

Whole sequence looks like: task calls fork, glibc calls syscall clone with
CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument.
Child task gets read-only copy of VM including TLS. Child calls put_user()
to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page
fault and it fails because do_wp_page() hits memcg limit without invoking
OOM-killer because this is page-fault from kernel-space. Put_user returns
-EFAULT, which is ignored. Child returns into user-space and catches here
assert (THREAD_GETMEM (self, tid) != ppid), glibc tries to print something
but hangs on deadlock on internal locks. Halt and catch fire.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Reported-by: Nikita Vetoshkin <[email protected]>
---
kernel/fork.c | 15 ++++++++++++---
kernel/sched/core.c | 16 ++++++++++++++--
2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 4dc2dda..f71305d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -822,11 +822,20 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
if (tsk->clear_child_tid) {
if (!(tsk->flags & PF_SIGNALED) &&
atomic_read(&mm->mm_users) > 1) {
+ int dummy;
+
/*
- * We don't check the error code - if userspace has
- * not set up a proper pointer then tough luck.
+ * We cannot report put_user fails - if userspace has
+ * not set up a proper pointer then tough luck. It's
+ * much worse if it's failed for some other reason:
+ * for example memory shortage in CoW area, somebody
+ * will wait us forever. Let's at least print warning.
*/
- put_user(0, tsk->clear_child_tid);
+ if (unlikely(put_user(0, tsk->clear_child_tid)) &&
+ !get_user(dummy, tsk->clear_child_tid) &&
+ !fatal_signal_pending(current))
+ WARN_ON_ONCE(dummy);
+
sys_futex(tsk->clear_child_tid, FUTEX_WAKE,
1, NULL, NULL, 0);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e628cb1..74b33ff 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
post_schedule(rq);
preempt_enable();

- if (current->set_child_tid)
- put_user(task_pid_vnr(current), current->set_child_tid);
+ if (current->set_child_tid &&
+ unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) {
+ int dummy;
+
+ /*
+ * If this address is unreadable then userspace has not set
+ * proper pointer. Application either doesn't care or will
+ * notice this soon. If this address is readable then task
+ * will be mislead about its own tid. It's better to die.
+ */
+ if (!get_user(dummy, current->set_child_tid) &&
+ !fatal_signal_pending(current))
+ force_sig(SIGSEGV, current);
+ }
}

/*


2015-02-06 16:23:29

by Konstantin Khlebnikov

[permalink] [raw]
Subject: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID

Handling of flag CLONE_PARENT_SETTID has the same problem: error returned
from put_user() is ignored. Glibc completely relies on that feature and uses
value returned from syscall only for error checking.

Kernels older than v2.6.24 handled that correctly but check has been removed
in commit 30e49c263e36 ("pid namespaces: allow cloning of new namespace").
Commit message tells nothing about reason. I guess that was fix for commit
425fb2b4bf5d ("pid namespaces: move alloc_pid() lower in copy_process()")
which breaks this logic: after it kernel writes parent pid as child pid.

This patch moves related put_user() from do_fork() back into copy_process()
where it was before and where error can be handled.

Another problem is that before v2.6.24 CLONE_PARENT_SETTID stored child pid
both in parent and child memory. Documentation in man clone(2) also tells so.
In v2.6.24 put_user() was placed after copy_mm(), now only parent sees it.
LTP test which should check that is buggy too: it clones child with CLONE_VM.
It seems nobody have noticed this for seven years, probably we should leave
it as is and document existing behavior.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
kernel/fork.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f71305d..1ea2eae 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1194,6 +1194,7 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
static struct task_struct *copy_process(unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
+ int __user *parent_tidptr,
int __user *child_tidptr,
struct pid *pid,
int trace)
@@ -1416,6 +1417,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
goto bad_fork_cleanup_io;
}

+ if (clone_flags & CLONE_PARENT_SETTID) {
+ retval = put_user(pid_vnr(pid), parent_tidptr);
+ if (retval)
+ goto bad_fork_free_pid;
+ }
+
p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
/*
* Clear TID on mm_release()?
@@ -1617,7 +1624,7 @@ static inline void init_idle_pids(struct pid_link *links)
struct task_struct *fork_idle(int cpu)
{
struct task_struct *task;
- task = copy_process(CLONE_VM, 0, 0, NULL, &init_struct_pid, 0);
+ task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0);
if (!IS_ERR(task)) {
init_idle_pids(task->pids);
init_idle(task, cpu);
@@ -1661,7 +1668,7 @@ long do_fork(unsigned long clone_flags,
}

p = copy_process(clone_flags, stack_start, stack_size,
- child_tidptr, NULL, trace);
+ parent_tidptr, child_tidptr, NULL, trace);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
@@ -1675,9 +1682,6 @@ long do_fork(unsigned long clone_flags,
pid = get_task_pid(p, PIDTYPE_PID);
nr = pid_vnr(pid);

- if (clone_flags & CLONE_PARENT_SETTID)
- put_user(nr, parent_tidptr);
-
if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);

2015-02-06 19:45:25

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID

On 02/06, Konstantin Khlebnikov wrote:
>
> Whole sequence looks like: task calls fork, glibc calls syscall clone with
> CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument.
> Child task gets read-only copy of VM including TLS. Child calls put_user()
> to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page
> fault and it fails because do_wp_page() hits memcg limit without invoking
> OOM-killer because this is page-fault from kernel-space.

Because of !FAULT_FLAG_USER?

Perhaps we should fix this? Say mem_cgroup_oom_enable/disable around put_user(),
I dunno.

> Put_user returns
> -EFAULT, which is ignored. Child returns into user-space and catches here
> assert (THREAD_GETMEM (self, tid) != ppid),

If only I understood why else we need CLONE_CHILD_SETTID ;)

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
> post_schedule(rq);
> preempt_enable();
>
> - if (current->set_child_tid)
> - put_user(task_pid_vnr(current), current->set_child_tid);
> + if (current->set_child_tid &&
> + unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) {
> + int dummy;
> +
> + /*
> + * If this address is unreadable then userspace has not set
> + * proper pointer. Application either doesn't care or will
> + * notice this soon. If this address is readable then task
> + * will be mislead about its own tid. It's better to die.
> + */
> + if (!get_user(dummy, current->set_child_tid) &&
> + !fatal_signal_pending(current))
> + force_sig(SIGSEGV, current);
> + }

Well, get_user() can fail the same way? The page we need to cow can be
swapped out.

At first glance, to me this problem should be solved somewhere else...
I'll try to reread this all tomorrow.

Oleg.

2015-02-06 19:56:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID

On 02/06, Oleg Nesterov wrote:
>
> On 02/06, Konstantin Khlebnikov wrote:
> >
> > Whole sequence looks like: task calls fork, glibc calls syscall clone with
> > CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument.
> > Child task gets read-only copy of VM including TLS. Child calls put_user()
> > to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page
> > fault and it fails because do_wp_page() hits memcg limit without invoking
> > OOM-killer because this is page-fault from kernel-space.
>
> Because of !FAULT_FLAG_USER?
>
> Perhaps we should fix this? Say mem_cgroup_oom_enable/disable around put_user(),
> I dunno.
>
> > Put_user returns
> > -EFAULT, which is ignored. Child returns into user-space and catches here
> > assert (THREAD_GETMEM (self, tid) != ppid),
>
> If only I understood why else we need CLONE_CHILD_SETTID ;)
>
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
> > post_schedule(rq);
> > preempt_enable();
> >
> > - if (current->set_child_tid)
> > - put_user(task_pid_vnr(current), current->set_child_tid);
> > + if (current->set_child_tid &&
> > + unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) {
> > + int dummy;
> > +
> > + /*
> > + * If this address is unreadable then userspace has not set
> > + * proper pointer. Application either doesn't care or will
> > + * notice this soon. If this address is readable then task
> > + * will be mislead about its own tid. It's better to die.
> > + */
> > + if (!get_user(dummy, current->set_child_tid) &&
> > + !fatal_signal_pending(current))
> > + force_sig(SIGSEGV, current);
> > + }
>
> Well, get_user() can fail the same way? The page we need to cow can be
> swapped out.
>
> At first glance, to me this problem should be solved somewhere else...
> I'll try to reread this all tomorrow.

And in fact I think that this is not set_child_tid/etc-specific. Perhaps
I am totally confused, but I think that put_user() simply should not fail
this way. Say, why a syscall should return -EFAULT if memory allocation
"silently" fails? Confused.

Oleg.

2015-02-06 20:28:04

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID

On Fri, Feb 6, 2015 at 10:55 PM, Oleg Nesterov <[email protected]> wrote:
> On 02/06, Oleg Nesterov wrote:
>>
>> On 02/06, Konstantin Khlebnikov wrote:
>> >
>> > Whole sequence looks like: task calls fork, glibc calls syscall clone with
>> > CLONE_CHILD_SETTID and passes pointer to TLS THREAD_SELF->tid as argument.
>> > Child task gets read-only copy of VM including TLS. Child calls put_user()
>> > to handle CLONE_CHILD_SETTID from schedule_tail(). put_user() trigger page
>> > fault and it fails because do_wp_page() hits memcg limit without invoking
>> > OOM-killer because this is page-fault from kernel-space.
>>
>> Because of !FAULT_FLAG_USER?

Yep. As I see memcg triggers OOM only on page-faults and only from user-space.

>>
>> Perhaps we should fix this? Say mem_cgroup_oom_enable/disable around put_user(),
>> I dunno.
>>
>> > Put_user returns
>> > -EFAULT, which is ignored. Child returns into user-space and catches here
>> > assert (THREAD_GETMEM (self, tid) != ppid),
>>
>> If only I understood why else we need CLONE_CHILD_SETTID ;)

I dunno, CLONE_PARENT_SETTID should be enough for everybody but it's
broken too. Twice. See the next patch =)

>>
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>> > @@ -2312,8 +2312,20 @@ asmlinkage __visible void schedule_tail(struct task_struct *prev)
>> > post_schedule(rq);
>> > preempt_enable();
>> >
>> > - if (current->set_child_tid)
>> > - put_user(task_pid_vnr(current), current->set_child_tid);
>> > + if (current->set_child_tid &&
>> > + unlikely(put_user(task_pid_vnr(current), current->set_child_tid))) {
>> > + int dummy;
>> > +
>> > + /*
>> > + * If this address is unreadable then userspace has not set
>> > + * proper pointer. Application either doesn't care or will
>> > + * notice this soon. If this address is readable then task
>> > + * will be mislead about its own tid. It's better to die.
>> > + */
>> > + if (!get_user(dummy, current->set_child_tid) &&
>> > + !fatal_signal_pending(current))
>> > + force_sig(SIGSEGV, current);
>> > + }
>>
>> Well, get_user() can fail the same way? The page we need to cow can be
>> swapped out.
>>
>> At first glance, to me this problem should be solved somewhere else...
>> I'll try to reread this all tomorrow.
>
> And in fact I think that this is not set_child_tid/etc-specific. Perhaps
> I am totally confused, but I think that put_user() simply should not fail
> this way. Say, why a syscall should return -EFAULT if memory allocation
> "silently" fails? Confused.

That's how memcg works. All other places are handled explicitly and
returns into user-space as -ENOMEM or -EFAULT.
Probably some strange numa policy / memoryaffinity might trigger this too.

Probably all page-faults must be forced to succeed or die mode,
in this case all errors in put_user/copy_to_user could be simply ignored.

--
Konstantin

2015-02-06 20:34:29

by Oleg Nesterov

[permalink] [raw]
Subject: memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID)

Add cc's.

On 02/06, Oleg Nesterov wrote:
>
> And in fact I think that this is not set_child_tid/etc-specific. Perhaps
> I am totally confused, but I think that put_user() simply should not fail
> this way. Say, why a syscall should return -EFAULT if memory allocation
> "silently" fails? Confused.

Seriously. I must have missed something, but I can't understand 519e52473eb
"mm: memcg: enable memcg OOM killer only for user faults".

The changelog says:

System calls and kernel faults (uaccess, gup) can handle an out of
memory situation gracefully and just return -ENOMEM.

How can a system call know it should return -ENOMEM if put_user() can only
return -EFAULT ?

Oleg.

2015-02-06 20:49:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID

On Fri, Feb 6, 2015 at 8:23 AM, Konstantin Khlebnikov
<[email protected]> wrote:
> Handling of flag CLONE_PARENT_SETTID has the same problem: error returned
> from put_user() is ignored. Glibc completely relies on that feature and uses
> value returned from syscall only for error checking.

I'm not seeing the advantage of the error checking part of the pacth
patch. It generates extra code, possibly changing existing interfaces,
and it doesn't actually buy us anything.

What's the upside? If somebody passes in a bad pointer, it's their
problem. For all we know, people used to pass in NULL, even if they
had the SETTID bit set. This makes it now return EFAULT.

So I don't mind moving things into copy_process(), but I *do* mind the
new error return thing.

It's actually better in this patch than in 1/2, because 1/2 was just
insane with the whole "readable vs writable" thing. That I refuse to
even look at, for fear of going blind.

Linus

2015-02-06 21:08:58

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID

I am not sure about these changes too, but

On 02/06, Linus Torvalds wrote:
>
> What's the upside? If somebody passes in a bad pointer, it's their
> problem.

Yes. But unless I am totally confused (quite possible) this put_user()
can fail even if the pointer is valid.

So at least I think Konstantin has found a real problem. Which (I think)
shoud be fixed, and this is not connected to fork.

> insane with the whole "readable vs writable" thing.

Agreed, this part doesn't look nice in any case.

Oleg.

2015-02-06 21:13:27

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID

On Fri, Feb 6, 2015 at 11:49 PM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Feb 6, 2015 at 8:23 AM, Konstantin Khlebnikov
> <[email protected]> wrote:
>> Handling of flag CLONE_PARENT_SETTID has the same problem: error returned
>> from put_user() is ignored. Glibc completely relies on that feature and uses
>> value returned from syscall only for error checking.
>
> I'm not seeing the advantage of the error checking part of the pacth
> patch. It generates extra code, possibly changing existing interfaces,
> and it doesn't actually buy us anything.
>
> What's the upside? If somebody passes in a bad pointer, it's their
> problem. For all we know, people used to pass in NULL, even if they
> had the SETTID bit set. This makes it now return EFAULT.

Currently that works fine only because kernel retries 0-order allocations
endlessly. But pagefault_out_of_memory() is never called for non-user PF.
For kernel PF all oom-kills are triggered by buddy-allocator.
If buddy allocator gave up earlier then page-faults from kernel space
could fail without OOM. And in CoW area user-space will see stale data.
So, either we must handle all put_user/copy_to_user errors (which isn't
that bad idea) or kernel must force all PF to success-or-die policy.

First patch is that ugly because kernel has never checked errors
in that place. So, I've tried to find solution which could fix problem
without breaking backward compatibility.

>
> So I don't mind moving things into copy_process(), but I *do* mind the
> new error return thing.
>
> It's actually better in this patch than in 1/2, because 1/2 was just
> insane with the whole "readable vs writable" thing. That I refuse to
> even look at, for fear of going blind.
>
> Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-02-06 21:55:27

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID

On Fri, Feb 6, 2015 at 1:13 PM, Konstantin Khlebnikov <[email protected]> wrote:
> On Fri, Feb 6, 2015 at 11:49 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Fri, Feb 6, 2015 at 8:23 AM, Konstantin Khlebnikov
>> <[email protected]> wrote:
>>> Handling of flag CLONE_PARENT_SETTID has the same problem: error returned
>>> from put_user() is ignored. Glibc completely relies on that feature and uses
>>> value returned from syscall only for error checking.
>>
>> I'm not seeing the advantage of the error checking part of the pacth
>> patch. It generates extra code, possibly changing existing interfaces,
>> and it doesn't actually buy us anything.
>>
>> What's the upside? If somebody passes in a bad pointer, it's their
>> problem. For all we know, people used to pass in NULL, even if they
>> had the SETTID bit set. This makes it now return EFAULT.
>
> Currently that works fine only because kernel retries 0-order allocations
> endlessly. But pagefault_out_of_memory() is never called for non-user PF.
> For kernel PF all oom-kills are triggered by buddy-allocator.
> If buddy allocator gave up earlier then page-faults from kernel space
> could fail without OOM. And in CoW area user-space will see stale data.
> So, either we must handle all put_user/copy_to_user errors (which isn't
> that bad idea) or kernel must force all PF to success-or-die policy.
>
> First patch is that ugly because kernel has never checked errors
> in that place. So, I've tried to find solution which could fix problem
> without breaking backward compatibility.

If you're really worried about compatibility, it would be possible, if
really really ugly, to check whether there's a vma at all at the
requested address and to return -EFAULT only in the case where there
is a vma but put_user still failed.

A less awful approach might be to accept put_user failures if the
address is NULL.

--Andy

2015-02-06 22:11:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 2/2] kernel/fork: handle put_user errors for CLONE_PARENT_SETTID

On Fri, Feb 6, 2015 at 1:13 PM, Konstantin Khlebnikov <[email protected]> wrote:
>
> Currently that works fine only because kernel retries 0-order allocations
> endlessly. But pagefault_out_of_memory() is never called for non-user PF.
> For kernel PF all oom-kills are triggered by buddy-allocator.

This makes no sense.

You're trying to fix what you perceive as a problem in the page fault
handling in some totally different place.

If *that* is what you are worried about, then damnit, just fix the
page fault handler for the kernel case to send a signal or something
for VM_FAULT_ENOMEM. Or, better yet, make it just trigger oom at
return to user space - we could add a _TIF_OOM flag, for example, and
make it part of the user-return logic (do_notify_resume), kind of how
_TIF_SIGPENDING triggers a signal.

Don't try to make horrible code in insane places that have nothing to
do with the fundamental problem. Why did you pick this particular
get/put user anyway? There are tons others that we don't test, why did
you happen pick these and then make it have that horrible and
senseless error handling?

Because at *NO* point was it obvious that that patch had anything at
all to do with "out of memory". Not in the code, not in your commit
messages, *nowhere*.

There is no way in hell I will take that kind of changes when you
didn't even articulate why you wanted to do them in the commit
messages, and the patches didn't look like they had anything to do
with oom either.

Linus

2015-02-10 16:19:57

by Johannes Weiner

[permalink] [raw]
Subject: Re: memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID)

On Fri, Feb 06, 2015 at 09:32:46PM +0100, Oleg Nesterov wrote:
> Add cc's.
>
> On 02/06, Oleg Nesterov wrote:
> >
> > And in fact I think that this is not set_child_tid/etc-specific. Perhaps
> > I am totally confused, but I think that put_user() simply should not fail
> > this way. Say, why a syscall should return -EFAULT if memory allocation
> > "silently" fails? Confused.
>
> Seriously. I must have missed something, but I can't understand 519e52473eb
> "mm: memcg: enable memcg OOM killer only for user faults".
>
> The changelog says:
>
> System calls and kernel faults (uaccess, gup) can handle an out of
> memory situation gracefully and just return -ENOMEM.

The cover letter of the original patch series is hidden in the
changelog a few commits before this one:

commit 94bce453c78996cc4373d5da6cfabe07fcc6d9f9
Author: Johannes Weiner <[email protected]>
Date: Thu Sep 12 15:13:36 2013 -0700

arch: mm: remove obsolete init OOM protection

The memcg code can trap tasks in the context of the failing allocation
until an OOM situation is resolved. They can hold all kinds of locks
(fs, mm) at this point, which makes it prone to deadlocking.

This series converts memcg OOM handling into a two step process that is
started in the charge context, but any waiting is done after the fault
stack is fully unwound.

Patches 1-4 prepare architecture handlers to support the new memcg
requirements, but in doing so they also remove old cruft and unify
out-of-memory behavior across architectures.

Patch 5 disables the memcg OOM handling for syscalls, readahead, kernel
faults, because they can gracefully unwind the stack with -ENOMEM. OOM
handling is restricted to user triggered faults that have no other
option.

We had reports of systems deadlocking because the allocating task
would wait for the OOM killer to make progress, and the OOM-killed
task would wait for a lock held by the allocating task. It's
important to note here that the OOM killer currently does not kill a
second task until the current victim has exited, because that victim
has special rights to access to emergency reserves and we don't want
to risk overcommitting those.

To address that deadlock, I decided to no longer invoke memcg OOM
handling from the allocation context directly, but instead remember
the situation in the task_struct and handle it before restarting the
fault. However, in-kernel faults do not restart, maybe even succeed
despite allocation failures (i.e. readahead), so under the assumption
that they have to handle error returns anyway, I disabled invoking the
memcg OOM killer for in-kernel faults altogether.

> How can a system call know it should return -ENOMEM if put_user() can only
> return -EFAULT ?

I see the problem, but allocations can not be guaranteed to succeed,
not even the OOM killer can reliably make progress, depending on the
state the faulting process is in, the locks it is holding. So what
can we do if that allocation fails? Even if we go the route that
Linus proposes and make OOM situations more generic and check them on
*every* return to userspace, the OOM handler at that point might still
kill a task more suited to free memory than the faulting one, and so
we still have to communicate the proper error value to the syscall.

However, I think we could go back to invoking OOM from all allocation
contexts again as long as we change allocator and OOM killer to not
wait for individual OOM victims to exit indefinitely (unless it's a
__GFP_NOFAIL context). Maybe wait for some time on the first victim
before moving on to the next one. That would reduce the likelihood of
failing allocations without reinstating the original deadlock.
put_user() could still technically fail due to allocation failure, but
it wouldn't be a very likely event. Would that be okay?

This does not just apply to memcg allocations, btw. Physical page
allocations have recently been reported to deadlock in a similar
fashion because of low orders implying __GFP_NOFAIL. Making the OOM
killer more robust and more eager to make progress would allow us to
remove the implied __GFP_NOFAIL while maintaining reasonable quality
of service in the allocator.

What do you think?

2015-02-10 19:49:20

by Oleg Nesterov

[permalink] [raw]
Subject: Re: memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID)

On 02/10, Johannes Weiner wrote:
>
> We had reports of systems deadlocking because

Yes, yes, to some degree I understand why it was done this way. Not
that I understand the details of course. Thanks for your explanations.

> > How can a system call know it should return -ENOMEM if put_user() can only
> > return -EFAULT ?
>
> I see the problem, but allocations can not be guaranteed to succeed,
> not even the OOM killer can reliably make progress,

Yes sure,

> So what
> can we do if that allocation fails? Even if we go the route that
> Linus proposes and make OOM situations more generic and check them on
> *every* return to userspace, the OOM handler at that point might still
> kill a task more suited to free memory than the faulting one, and so
> we still have to communicate the proper error value to the syscall.

Yes. To me this means that if a page fault from kernel-space fails because
of VM_FAULT_OOM the task should be killed in any case. Except we should
obviously exclude gup/kthreads.

We can't retry in this case and (say) schedule_tail() simply can't report
or handle the failure. Imho it would be better to kill the task loudly,
perhaps with a warning.

To avoid the confusion. Of course, it is not that I am trying to simply
add send_sig(SIGKILL) into the failure paths. My only point is that,
whatever we do, the "silent" or misleading failure is worse than SIGKILL.

The application can't really "handle an out of memory situation gracefully"
as the changlelog says. Even if put_user() (and thus syscall) could return
-ENOMEM, this doesn't really matter I think.

> However, I think we could go back to invoking OOM from all allocation
> contexts again as long as we change allocator and OOM killer to not
> wait for individual OOM victims to exit indefinitely (unless it's a
> __GFP_NOFAIL context). Maybe wait for some time on the first victim
> before moving on to the next one.

perhaps... can't really comment, at least right now.

> What do you think?

So far I only think that this problem is not trivial ;)

Oleg.