Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755284Ab0H3KHK (ORCPT ); Mon, 30 Aug 2010 06:07:10 -0400 Received: from mx1.redhat.com ([209.132.183.28]:63936 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754936Ab0H3KHG (ORCPT ); Mon, 30 Aug 2010 06:07:06 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit From: Roland McGrath To: Solar Designer X-Fcc: ~/Mail/linus Cc: Kees Cook , linux-kernel@vger.kernel.org, oss-security@lists.openwall.com, Al Viro , Andrew Morton , Oleg Nesterov , KOSAKI Motohiro , Neil Horman , linux-fsdevel@vger.kernel.org Subject: Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer In-Reply-To: Solar Designer's message of Monday, 30 August 2010 07:23:31 +0400 <20100830032331.GA22773@openwall.com> References: <20100827220258.GF4703@outflux.net> <20100830005648.431B7400D9@magilla.sf.frob.com> <20100830032331.GA22773@openwall.com> Emacs: the definitive fritterware. Message-Id: <20100830100616.78971400D9@magilla.sf.frob.com> Date: Mon, 30 Aug 2010 03:06:16 -0700 (PDT) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3862 Lines: 91 > IIRC, prior to that fix, I was able to cause the kernel to loop for tens > of minutes in a single execve() call on an Alpha with 128 MB RAM, by > using repeated mappings of the same pages (almost 200 GB total). And I say, if your userland process could really allocate another 200GB, then more power to you, you can do it with an exec too. If you could do the same with a userland stack allocation, and spend all that time in strlen calls and then memcpy, you can do it inside execve too. If it takes days, that's what you asked for, and it's your process. It just ought to be every bit (or near enough) as preemptible and interruptible as that normal userland activity ought to be. So, perhaps we want this (count already has a cond_resched in its loop): diff --git a/fs/exec.c b/fs/exec.c index 2d94552..0000000 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -369,6 +369,9 @@ static int count(const char __user * con for (;;) { const char __user * p; + if (signal_pending(current)) + return -ERESTARTNOINTR; + if (get_user(p, argv)) return -EFAULT; if (!p) @@ -400,6 +403,10 @@ static int copy_strings(int argc, const int len; unsigned long pos; + if (signal_pending(current)) + return -ERESTARTNOINTR; + cond_resched(); + if (get_user(str, argv+argc) || !(len = strnlen_user(str, MAX_ARG_STRLEN))) { ret = -EFAULT; > Now it appears that, besides the issue that started this thread, the > same problem I mentioned above got re-introduced. We still have > strnlen_user() and the "max" argument to count(), but we no longer have > hard limits for "max". Someone set MAX_ARG_STRINGS to 0x7FFFFFFF, and > this is just too much. MAX_ARG_STRLEN is set to 32 pages, and these two > combined allow a userspace program to make the kernel loop for days. I really don't think we need that stuff back. I think we can get rid of it and fix the real problems, and be happier overall. > > But it sounds like all you really need is to fix the OOM/allocation > > behavior for huge stack allocations. > > No, we need both. I don't agree. If all of the implicit allocation done inside execve is accounted and controlled as well as normal userland allocations so that the execve fails when userland allocation would fail, then there is no reason for special-case arbitrary limits. > Additionally, 64bit_dos.c mentions that "it triggers a BUG() as the > stack tries to expand around the address space when shifted". Perhaps > limiting the stack size would deal with that, but maybe the "bug" needs > to be patched elsewhere as well. grsecurity has the following hunk: That change seems like it might be reasonable, but I haven't really looked at shift_arg_pages before. Has someone reported this BUG_ON failure mode with a reproducer? > Overall, there are multiple issues here (maybe up to four?) and multiple > things to review the code for. Agreed. But IMHO the missing arbitrary limits on arg/env size are not among them. I don't know about shift_arg_pages. The core fix I think makes sense is making the nascent mm get accounted to the user process normally. Rather than better enabling OOM killing, I think what really makes sense is for the nascent mm to be marked such that allocations in it (they'll be from get_arg_page->get_user_pages->handle_mm_fault) just fail with ENOMEM before it resorts to the OOM killer (or perhaps even to very aggressive pageout). That should percolate back to the execve just failing with ENOMEM, which is nicer than OOM kill even if the OOM killer actually does pick exactly and only the right target. Thanks, Roland -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/