MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
From: Roland McGrath <roland@redhat.com>
To: Solar Designer <solar@openwall.com>
Cc: Kees Cook <kees.cook@canonical.com>, linux-kernel@vger.kernel.org,
        oss-security@lists.openwall.com, Al Viro <viro@zeniv.linux.org.uk>,
        Andrew Morton <akpm@linux-foundation.org>,
        Oleg Nesterov <oleg@redhat.com>,
        KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
        Neil Horman <nhorman@tuxdriver.com>, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH] exec argument expansion can inappropriately trigger OOM-killer
In-Reply-To: Solar Designer's message of  Monday, 30 August 2010 07:23:31 +0400 <20100830032331.GA22773@openwall.com>
References: <20100827220258.GF4703@outflux.net>
	<20100830005648.431B7400D9@magilla.sf.frob.com>
	<20100830032331.GA22773@openwall.com>
Emacs: the definitive fritterware.
Message-Id: <20100830100616.78971400D9@magilla.sf.frob.com>
Date: Mon, 30 Aug 2010 03:06:16 -0700 (PDT)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3862
Lines: 91

> IIRC, prior to that fix, I was able to cause the kernel to loop for tens
> of minutes in a single execve() call on an Alpha with 128 MB RAM, by
> using repeated mappings of the same pages (almost 200 GB total).

And I say, if your userland process could really allocate another 200GB,
then more power to you, you can do it with an exec too.  If you could do
the same with a userland stack allocation, and spend all that time in
strlen calls and then memcpy, you can do it inside execve too.  If it
takes days, that's what you asked for, and it's your process.  It just
ought to be every bit (or near enough) as preemptible and interruptible
as that normal userland activity ought to be.

So, perhaps we want this (count already has a cond_resched in its loop):

diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..0000000 100644  
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -369,6 +369,9 @@ static int count(const char __user * con
 		for (;;) {
 			const char __user * p;
 
+			if (signal_pending(current))
+				return -ERESTARTNOINTR;
+
 			if (get_user(p, argv))
 				return -EFAULT;
 			if (!p)
@@ -400,6 +403,10 @@ static int copy_strings(int argc, const 
 		int len;
 		unsigned long pos;
 
+		if (signal_pending(current))
+			return -ERESTARTNOINTR;
+		cond_resched();
+
 		if (get_user(str, argv+argc) ||
 				!(len = strnlen_user(str, MAX_ARG_STRLEN))) {
 			ret = -EFAULT;

> Now it appears that, besides the issue that started this thread, the
> same problem I mentioned above got re-introduced.  We still have
> strnlen_user() and the "max" argument to count(), but we no longer have
> hard limits for "max".  Someone set MAX_ARG_STRINGS to 0x7FFFFFFF, and
> this is just too much.  MAX_ARG_STRLEN is set to 32 pages, and these two
> combined allow a userspace program to make the kernel loop for days.

I really don't think we need that stuff back.  I think we can get rid of
it and fix the real problems, and be happier overall.

> > But it sounds like all you really need is to fix the OOM/allocation
> > behavior for huge stack allocations.
> 
> No, we need both.

I don't agree.  If all of the implicit allocation done inside execve is
accounted and controlled as well as normal userland allocations so that
the execve fails when userland allocation would fail, then there is no
reason for special-case arbitrary limits.

> Additionally, 64bit_dos.c mentions that "it triggers a BUG() as the
> stack tries to expand around the address space when shifted".  Perhaps
> limiting the stack size would deal with that, but maybe the "bug" needs
> to be patched elsewhere as well.  grsecurity has the following hunk:

That change seems like it might be reasonable, but I haven't really
looked at shift_arg_pages before.  Has someone reported this BUG_ON
failure mode with a reproducer?

> Overall, there are multiple issues here (maybe up to four?) and multiple
> things to review the code for.

Agreed.  But IMHO the missing arbitrary limits on arg/env size are not
among them.  I don't know about shift_arg_pages.  The core fix I think
makes sense is making the nascent mm get accounted to the user process
normally.  Rather than better enabling OOM killing, I think what really
makes sense is for the nascent mm to be marked such that allocations in
it (they'll be from get_arg_page->get_user_pages->handle_mm_fault) just
fail with ENOMEM before it resorts to the OOM killer (or perhaps even to
very aggressive pageout).  That should percolate back to the execve just
failing with ENOMEM, which is nicer than OOM kill even if the OOM killer
actually does pick exactly and only the right target.


Thanks,
Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/