Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754227AbbBJTtU (ORCPT ); Tue, 10 Feb 2015 14:49:20 -0500 Received: from mx1.redhat.com ([209.132.183.28]:54378 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753362AbbBJTtS (ORCPT ); Tue, 10 Feb 2015 14:49:18 -0500 Date: Tue, 10 Feb 2015 20:47:43 +0100 From: Oleg Nesterov To: Johannes Weiner Cc: Konstantin Khlebnikov , Michal Hocko , KAMEZAWA Hiroyuki , KOSAKI Motohiro , linux-api@vger.kernel.org, Andrew Morton , Linus Torvalds , linux-kernel@vger.kernel.org, Roman Gushchin , Nikita Vetoshkin , Pavel Emelyanov Subject: Re: memcg && uaccess (Was: [PATCH 1/2] kernel/fork: handle put_user errors for CLONE_CHILD_SETTID/CLEARTID) Message-ID: <20150210194743.GA17333@redhat.com> References: <20150206162301.18031.32251.stgit@buzz> <20150206194405.GA13960@redhat.com> <20150206195529.GA15517@redhat.com> <20150206203246.GA16924@redhat.com> <20150210161941.GB11212@phnom.home.cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150210161941.GB11212@phnom.home.cmpxchg.org> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2297 Lines: 57 On 02/10, Johannes Weiner wrote: > > We had reports of systems deadlocking because Yes, yes, to some degree I understand why it was done this way. Not that I understand the details of course. Thanks for your explanations. > > How can a system call know it should return -ENOMEM if put_user() can only > > return -EFAULT ? > > I see the problem, but allocations can not be guaranteed to succeed, > not even the OOM killer can reliably make progress, Yes sure, > So what > can we do if that allocation fails? Even if we go the route that > Linus proposes and make OOM situations more generic and check them on > *every* return to userspace, the OOM handler at that point might still > kill a task more suited to free memory than the faulting one, and so > we still have to communicate the proper error value to the syscall. Yes. To me this means that if a page fault from kernel-space fails because of VM_FAULT_OOM the task should be killed in any case. Except we should obviously exclude gup/kthreads. We can't retry in this case and (say) schedule_tail() simply can't report or handle the failure. Imho it would be better to kill the task loudly, perhaps with a warning. To avoid the confusion. Of course, it is not that I am trying to simply add send_sig(SIGKILL) into the failure paths. My only point is that, whatever we do, the "silent" or misleading failure is worse than SIGKILL. The application can't really "handle an out of memory situation gracefully" as the changlelog says. Even if put_user() (and thus syscall) could return -ENOMEM, this doesn't really matter I think. > However, I think we could go back to invoking OOM from all allocation > contexts again as long as we change allocator and OOM killer to not > wait for individual OOM victims to exit indefinitely (unless it's a > __GFP_NOFAIL context). Maybe wait for some time on the first victim > before moving on to the next one. perhaps... can't really comment, at least right now. > What do you think? So far I only think that this problem is not trivial ;) Oleg. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/