2011-03-10 14:37:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + x86-mm-handle-mm_fault_error-in-kernel-space.patch added to -mm tree

(add cc's)

> Subject: x86/mm: handle mm_fault_error() in kernel space
> From: Andrey Vagin <[email protected]>
>
> mm_fault_error() should not execute oom-killer, if page fault occurs in
> kernel space. E.g. in copy_from_user/copy_to_user.

Why? I don't understand this part.

> This would happen if we find ourselves in OOM on a copy_to_user(), or a
> copy_from_user() which faults.
>
> Without this patch, the kernels hangs up in copy_from_user, because OOM
> killer sends SIG_KILL to current process,

This depends. OOM can choose another victim, and if it does we shouldn't
return -EFAULT.

> but it can't handle a signal
> while in syscall, then the kernel returns to copy_from_user, reexcute
> current command and provokes page_fault again.

Yes. This is buggy.

> --- a/arch/x86/mm/fault.c~x86-mm-handle-mm_fault_error-in-kernel-space
> +++ a/arch/x86/mm/fault.c
> @@ -827,6 +827,13 @@ mm_fault_error(struct pt_regs *regs, uns
> unsigned long address, unsigned int fault)
> {
> if (fault & VM_FAULT_OOM) {
> + /* Kernel mode? Handle exceptions or die: */
> + if (!(error_code & PF_USER)) {
> + up_read(&current->mm->mmap_sem);
> + no_context(regs, error_code, address);
> + return;
> + }
> +

At first glance, this is not optimal...

Perhaps I missed something, but afaics it is better to call
out_of_memory() first, then check if current was killed. In this case
no_context() is fine, we are not going to return to the user-mode.

IOW, what do you think about the (untested/uncompiled) patch below?

Oleg.

--- x/arch/x86/mm/fault.c
+++ x/arch/x86/mm/fault.c
@@ -829,6 +829,11 @@ mm_fault_error(struct pt_regs *regs, uns
{
if (fault & VM_FAULT_OOM) {
out_of_memory(regs, error_code, address);
+
+ if (!(error_code & PF_USER) && fatal_signal_pending(current)) {
+ no_context(regs, error_code, address);
+ return;
+ }
} else {
if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
VM_FAULT_HWPOISON_LARGE))


2011-03-10 19:31:13

by Andrew Vagin

[permalink] [raw]
Subject: Re: + x86-mm-handle-mm_fault_error-in-kernel-space.patch added to -mm tree

On 03/10/2011 05:28 PM, Oleg Nesterov wrote:
> (add cc's)
>
>> Subject: x86/mm: handle mm_fault_error() in kernel space
>> From: Andrey Vagin<[email protected]>
>>
>> mm_fault_error() should not execute oom-killer, if page fault occurs in
>> kernel space. E.g. in copy_from_user/copy_to_user.
> Why? I don't understand this part.
I thought for a bit more...

I think we should not execute out_of_memory() in this case at all,
because when we return from page fault, we execute the same command and
provoke the "same" page fault again. Now pls think what is the
difference between these page faults? It has been generated from one
place and the program do nothing between those. I think we try to handle
one error two times and it isn't good. If handle_mm_fault() returns
VM_FAULT_OOM and pagefault occurred from userspace, the current task
should be killed by SIGKILL, then we should return from page fault back
to userspace for the task to handle the signal. If handle_mm_fault()
returns VM_FAULT_OOM and pagefault occurred in kernel space, we should
execute no_context() to return from syscall.

Also note that out_of_memory is usually called from handle_mm_fault() ->
... -> alloc_page()->...->out_of_memory().

>> This would happen if we find ourselves in OOM on a copy_to_user(), or a
>> copy_from_user() which faults.
>>
>> Without this patch, the kernels hangs up in copy_from_user, because OOM
>> killer sends SIG_KILL to current process,
> This depends. OOM can choose another victim, and if it does we shouldn't
> return -EFAULT.
>
>> but it can't handle a signal
>> while in syscall, then the kernel returns to copy_from_user, reexcute
>> current command and provokes page_fault again.
> Yes. This is buggy.
>
>> --- a/arch/x86/mm/fault.c~x86-mm-handle-mm_fault_error-in-kernel-space
>> +++ a/arch/x86/mm/fault.c
>> @@ -827,6 +827,13 @@ mm_fault_error(struct pt_regs *regs, uns
>> unsigned long address, unsigned int fault)
>> {
>> if (fault& VM_FAULT_OOM) {
>> + /* Kernel mode? Handle exceptions or die: */
>> + if (!(error_code& PF_USER)) {
>> + up_read(&current->mm->mmap_sem);
>> + no_context(regs, error_code, address);
>> + return;
>> + }
>> +
> At first glance, this is not optimal...
>
> Perhaps I missed something, but afaics it is better to call
> out_of_memory() first, then check if current was killed. In this case
> no_context() is fine, we are not going to return to the user-mode.
It's not first oom, so I perfere don't ca
> IOW, what do you think about the (untested/uncompiled) patch below?
>
> Oleg.
>
> --- x/arch/x86/mm/fault.c
> +++ x/arch/x86/mm/fault.c
> @@ -829,6 +829,11 @@ mm_fault_error(struct pt_regs *regs, uns
> {
> if (fault& VM_FAULT_OOM) {
> out_of_memory(regs, error_code, address);
> +
> + if (!(error_code& PF_USER)&& fatal_signal_pending(current)) {
> + no_context(regs, error_code, address);
> + return;
> + }
> } else {
> if (fault& (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
> VM_FAULT_HWPOISON_LARGE))
>
>

2011-03-11 11:28:28

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + x86-mm-handle-mm_fault_error-in-kernel-space.patch added to -mm tree

On 03/10, Andrew Vagin wrote:
>
> On 03/10/2011 05:28 PM, Oleg Nesterov wrote:
>> (add cc's)
>>
>>> Subject: x86/mm: handle mm_fault_error() in kernel space
>>> From: Andrey Vagin<[email protected]>
>>>
>>> mm_fault_error() should not execute oom-killer, if page fault occurs in
>>> kernel space. E.g. in copy_from_user/copy_to_user.
>> Why? I don't understand this part.
> I thought for a bit more...
>
> I think we should not execute out_of_memory() in this case at all,

Why?

Btw, this may be true, but this is irrelevant. If we shouldn't call
out_of_memory() in this case, then we shouldn't call it at all, even
if PF_USER.

Andrew, I think you missed the point. Or I misunderstood. Or both ;)

> because when we return from page fault, we execute the same command and
> provoke the "same" page fault again

Sure. And the same happens if the fault occurs in user-space and
handle_mm_fault() returns VM_FAULT_OOM. This is correct.

> Now pls think what is the
> difference between these page faults?

The difference is that oom-killer should free the memory in between.
_OR_ it can decide to kill us, and _this_ case should be fixed.

> It has been generated from one
> place and the program do nothing between those.

The program does nothing, but the kernel does.

> If handle_mm_fault() returns
> VM_FAULT_OOM and pagefault occurred from userspace, the current task
> should be killed by SIGKILL,

Why do you think the current task should be killed? In this case we
do not need oom-killer at all, we could always kill the caller of
alloc_page/etc.

Suppose that the innocent task (which doesn't use a lot of memory) calls,
say, sys_read() into the unpopulated memory. Suppose that alloc_page()
fails because we have a memory hog which tries to eat all memory.

Do you think the innocent task should be punished in this case?

Assuming that mm/oom_kill.c:out_of_memory() is correct, it should find
the memory hog and kill it, after that we can retry the fault in a hope
we have more memory.

PF_USER is not relevant. If the application does mmap() and then
accesses this memory, memcpy() or copy_from_user() should follow the
same logic wrt OOM.

> If handle_mm_fault()
> returns VM_FAULT_OOM and pagefault occurred in kernel space, we should
> execute no_context() to return from syscall.

Only if current was killed by oom-killer, that is why my patch checks
fatal_signal_pending().

> Also note that out_of_memory is usually called from handle_mm_fault() ->
> ... -> alloc_page()->...->out_of_memory().

And note that pagefault_out_of_memory() checks TIF_MEMDIE and calls
schedule_timeout_uninterruptible(). This is exactly because if we are
_not_ killed by oom-killer, we are going to retry later once the killed
task frees the memory.

See?

Oleg.

2011-03-11 14:22:30

by Andrei Vagin

[permalink] [raw]
Subject: Re: + x86-mm-handle-mm_fault_error-in-kernel-space.patch added to -mm tree

On 03/11/2011 02:19 PM, Oleg Nesterov wrote:
> On 03/10, Andrew Vagin wrote:
>> On 03/10/2011 05:28 PM, Oleg Nesterov wrote:
>>> (add cc's)
>>>
>>>> Subject: x86/mm: handle mm_fault_error() in kernel space
>>>> From: Andrey Vagin<[email protected]>
>>>>
>>>> mm_fault_error() should not execute oom-killer, if page fault occurs in
>>>> kernel space. E.g. in copy_from_user/copy_to_user.
>>> Why? I don't understand this part.
>> I thought for a bit more...
>>
>> I think we should not execute out_of_memory() in this case at all,
> Why?
>
> Btw, this may be true, but this is irrelevant. If we shouldn't call
> out_of_memory() in this case, then we shouldn't call it at all, even
> if PF_USER.
Yes. We shouldn't call it at all, even if PF_USER.
> Andrew, I think you missed the point. Or I misunderstood. Or both ;)
>
>> because when we return from page fault, we execute the same command and
>> provoke the "same" page fault again
> Sure. And the same happens if the fault occurs in user-space and
> handle_mm_fault() returns VM_FAULT_OOM. This is correct.
When does handle_mm_fault() return VM_FAULT_OOM? It may be if current
task is killed or a file system returns VM_FAULT_OOM. All allocations of
memory (alloc_pages(), kmalloc()) calls out_of_memory() themselves, wait
it and try allocate memory again.
>> Now pls think what is the
>> difference between these page faults?
> The difference is that oom-killer should free the memory in between.
> _OR_ it can decide to kill us, and _this_ case should be fixed.
We wait memory in __alloc_pages_may_oom(). I think now handle_mm_fault()
returns VM_FAULT_OOM only if OOM-killer killed current task.
>> It has been generated from one
>> place and the program do nothing between those.
> The program does nothing, but the kernel does.
The kernel code may do it in one page fault...

The kernel should return from page fault back to userspace for the task
to handle the signal or to continue execute userspace code.
>> If handle_mm_fault() returns
>> VM_FAULT_OOM and pagefault occurred from userspace, the current task
>> should be killed by SIGKILL,
> Why do you think the current task should be killed? In this case we
> do not need oom-killer at all, we could always kill the caller of
> alloc_page/etc.
You don't understand. alloc_page calls oom-killer himself, then try
allocate memory again. Pls look at __alloc_pages_slowpath().
__alloc_pages_slowpat may fail if order > 3 || gfp_mask & __GFP_NOFAIL
|| test_thread_flag(TIF_MEMDIE)

> Suppose that the innocent task (which doesn't use a lot of memory) calls,
> say, sys_read() into the unpopulated memory. Suppose that alloc_page()
> fails because we have a memory hog which tries to eat all memory.
alloc_page() calls oom-killer, which kill a memory hog.
> Do you think the innocent task should be punished in this case?
Probaly you think that oom-killer is called from mm_fault_error() only.
It's incorrect.
> Assuming that mm/oom_kill.c:out_of_memory() is correct, it should find
> the memory hog and kill it, after that we can retry the fault in a hope
> we have more memory.
>
> PF_USER is not relevant. If the application does mmap() and then
> accesses this memory, memcpy() or copy_from_user() should follow the
> same logic wrt OOM.
>
>> If handle_mm_fault()
>> returns VM_FAULT_OOM and pagefault occurred in kernel space, we should
>> execute no_context() to return from syscall.
> Only if current was killed by oom-killer, that is why my patch checks
> fatal_signal_pending().
>
>> Also note that out_of_memory is usually called from handle_mm_fault() ->
>> ... -> alloc_page()->...->out_of_memory().
> And note that pagefault_out_of_memory() checks TIF_MEMDIE and calls
> schedule_timeout_uninterruptible(). This is exactly because if we are
> _not_ killed by oom-killer, we are going to retry later once the killed
> task frees the memory.
>
> See?
>
> Oleg.
>

2011-03-11 17:06:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + x86-mm-handle-mm_fault_error-in-kernel-space.patch added to -mm tree

On 03/11, Andrew Vagin wrote:
>
> On 03/11/2011 02:19 PM, Oleg Nesterov wrote:
>>
>> Btw, this may be true, but this is irrelevant. If we shouldn't call
>> out_of_memory() in this case, then we shouldn't call it at all, even
>> if PF_USER.
>
> Yes. We shouldn't call it at all, even if PF_USER.

Then why did you send this patch? If we should not call it, then we
should kill pagefault_out_of_memory() and update the callers instead
of adding the special 'if (PF_USER)' checks.

Yes, the current pagefault_out_of_memory() logic looks a bit suspicious,
but this needs another discussion. Once again, I am arguing against making
it depend on PF_USER, this was my point from the very beginning.

>>> Now pls think what is the
>>> difference between these page faults?
>> The difference is that oom-killer should free the memory in between.
>> _OR_ it can decide to kill us, and _this_ case should be fixed.
>
> We wait memory in __alloc_pages_may_oom(). I think now handle_mm_fault()
> returns VM_FAULT_OOM only if OOM-killer killed current task.

I don't think so, but this doesn't matter.

Once again, if OOM-killer killed current task we do not retry. That is
why my patch checks fatal_signal_pending() to fix the bug. That is all.

The point is, if current was _NOT_ killed we should follow the current
pagefault_out_of_memory() logic or remove pagefault_out_of_memory()
completely.

>> Why do you think the current task should be killed? In this case we
>> do not need oom-killer at all, we could always kill the caller of
>> alloc_page/etc.
>
> You don't understand. alloc_page calls oom-killer himself, then try
> allocate memory again. Pls look at __alloc_pages_slowpath().
> __alloc_pages_slowpat may fail if order > 3 || gfp_mask & __GFP_NOFAIL
> || test_thread_flag(TIF_MEMDIE)

Andrew, please, I know this.

> Probaly you think that oom-killer is called from mm_fault_error() only.
> It's incorrect.

And this too ;)

If nothing else. alloc_page doesn't call oom-killer if it is already
in progress. At least in this case we should retry after it completes.


Either way, I believe this patch should be dropped.

Oleg.

2011-03-12 21:20:56

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + x86-mm-handle-mm_fault_error-in-kernel-space.patch added to -mm tree

On 03/11, Oleg Nesterov wrote:
>
> On 03/11, Andrew Vagin wrote:
> >
> >
> The point is, if current was _NOT_ killed we should follow the current
> pagefault_out_of_memory() logic or remove pagefault_out_of_memory()
> completely.

Yes, and I still think this is valid. And thus I still think the patch
should be changed (btw, this problem is not x86 specific).

However,

> >> Why do you think the current task should be killed? In this case we
> >> do not need oom-killer at all, we could always kill the caller of
> >> alloc_page/etc.
> >
> > You don't understand. alloc_page calls oom-killer himself, then try
> > allocate memory again. Pls look at __alloc_pages_slowpath().
> > __alloc_pages_slowpat may fail if order > 3 || gfp_mask & __GFP_NOFAIL
> > || test_thread_flag(TIF_MEMDIE)
>
> Andrew, please, I know this.

Hmm. It turns out I do not ;)

I thought I can find the case when handle_mm_fault() returns VM_FAULT_OOM
and the caller is not killed, but I can't. I do not really understand
mem_cgroup_handle_oom/etc, but it seems we always retry indefinitely even
with mem_cgroup's. mm/hugetlb.c looks fine too...

So, I have to apologize, I am starting to think you are right.



Maybe someone could explain why pagefault_out_of_memory() is still
needed?

Oleg.

2011-03-13 10:29:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: + x86-mm-handle-mm_fault_error-in-kernel-space.patch added to -mm tree

> On 03/11, Oleg Nesterov wrote:
> >
> > On 03/11, Andrew Vagin wrote:
> > >
> > >
> > The point is, if current was _NOT_ killed we should follow the current
> > pagefault_out_of_memory() logic or remove pagefault_out_of_memory()
> > completely.
>
> Yes, and I still think this is valid. And thus I still think the patch
> should be changed (btw, this problem is not x86 specific).
>
> However,
>
> > >> Why do you think the current task should be killed? In this case we
> > >> do not need oom-killer at all, we could always kill the caller of
> > >> alloc_page/etc.
> > >
> > > You don't understand. alloc_page calls oom-killer himself, then try
> > > allocate memory again. Pls look at __alloc_pages_slowpath().
> > > __alloc_pages_slowpat may fail if order > 3 || gfp_mask & __GFP_NOFAIL
> > > || test_thread_flag(TIF_MEMDIE)
> >
> > Andrew, please, I know this.
>
> Hmm. It turns out I do not ;)
>
> I thought I can find the case when handle_mm_fault() returns VM_FAULT_OOM
> and the caller is not killed, but I can't. I do not really understand
> mem_cgroup_handle_oom/etc, but it seems we always retry indefinitely even
> with mem_cgroup's. mm/hugetlb.c looks fine too...
>
> So, I have to apologize, I am starting to think you are right.
>
> Maybe someone could explain why pagefault_out_of_memory() is still
> needed?

Hi Oleg, Andrew,

Now you are seeing VM dark side. ;-)
Two independent commit were introduced this hard to understand code.

commit 1c0fe6e3bda0464728c23c8d84aa47567e8b716c
Author: Nick Piggin <[email protected]>
Date: Tue Jan 6 14:38:59 2009 -0800

mm: invoke oom-killer from page fault

commit 6583bb64fc370842b32a87c67750c26f6d559af0
Author: David Rientjes <[email protected]>
Date: Wed Jul 29 15:02:06 2009 -0700

mm: avoid endless looping for oom killed tasks

Most typical case is, as andew described, handle_mm_fault -> pte_alloc_one
-> alloc_pages_current(GFP_KERNEL, 0). and order 0 GFP_KERNEL allocation
never fail except the task received TIF_MEMDIE. therefore, in this case,
no need additional pageout_out_of_memory() call. Anyway pageout_out_of_memory()
is no-op if the task has already TIF_MEMDIE.

But, we don't have any gurantee pagefault path have no large allocation
nor no GFP_ATOMIC allocation. Therefore I think Oleg's patch pointed out
right thing. The protocol is, vma->vm_ops->fault() can return VM_FAULT_OOM
and if it is, page fault handler should invoke out-of-memory.

But I doubt practical workload can observe the difference.