Brad Spengler published a local memory-allocation DoS that
evades the OOM-killer (though not the virtual memory RLIMIT):
http://www.grsecurity.net/~spender/64bit_dos.c
The recent changes to create a stack guard page helps slightly to
discourage this attack, but it is not sufficient. Compiling it statically
moves the libraries out of the way, allowing the stack VMA to fill the
entire TASK_SIZE.
There are two issues:
1) the OOM killer doesn't notice this argv memory explosion
2) the argv expansion does not check if rlim[RLIMIT_STACK].rlim_cur is -1.
I figure a quick solution for #2 would be the following patch. However,
running multiple copies of this program could result in similar OOM
behavior, so issue #1 still needs a solution.
Reported-by: Brad Spengler <[email protected]>
Signed-off-by: Kees Cook <[email protected]>
---
fs/exec.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index dab85ec..be40063 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -194,7 +194,8 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
* to work from.
*/
rlim = current->signal->rlim;
- if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4) {
+ if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4 ||
+ size > TASK_SIZE / 4) {
put_page(page);
return NULL;
}
--
1.7.1
--
Kees Cook
Ubuntu Security Team
> Brad Spengler published a local memory-allocation DoS that
> evades the OOM-killer (though not the virtual memory RLIMIT):
> http://www.grsecurity.net/~spender/64bit_dos.c
>
> The recent changes to create a stack guard page helps slightly to
> discourage this attack, but it is not sufficient. Compiling it statically
> moves the libraries out of the way, allowing the stack VMA to fill the
> entire TASK_SIZE.
>
> There are two issues:
> 1) the OOM killer doesn't notice this argv memory explosion
> 2) the argv expansion does not check if rlim[RLIMIT_STACK].rlim_cur is -1.
>
> I figure a quick solution for #2 would be the following patch. However,
> running multiple copies of this program could result in similar OOM
> behavior, so issue #1 still needs a solution.
>
>Reported-by: Brad Spengler <[email protected]>
>Signed-off-by: Kees Cook <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
And, I have a patch for #1. Can you please see this? Alternative idea
is to change rss accounting itself.
>From d4e114e5d31b14ebfc399d4b1fb142c7dfce0ca4 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Thu, 19 Aug 2010 20:40:20 +0900
Subject: [PATCH] oom: don't ignore temporary rss while execve
execve() makes new mm struct and setup stack and argv vector,
Unfortunately this new mm is not pointed any tasks, then oom-kill
can't detect this memory usage. therefore oom-kill may kill incorrect
task.
This patch added in-exec rss treatness to oom.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/compat.c | 8 ++++++--
fs/exec.c | 19 +++++++++++++++++--
include/linux/binfmts.h | 1 +
include/linux/sched.h | 1 +
mm/oom_kill.c | 36 +++++++++++++++++++++++++++---------
5 files changed, 52 insertions(+), 13 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index 718c706..643140c 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1527,6 +1527,7 @@ int compat_do_execve(char * filename,
retval = bprm_mm_init(bprm);
if (retval)
goto out_file;
+ set_exec_mm(bprm->mm);
bprm->argc = compat_count(argv, MAX_ARG_STRINGS);
if ((retval = bprm->argc) < 0)
@@ -1560,6 +1561,7 @@ int compat_do_execve(char * filename,
/* execve succeeded */
current->fs->in_exec = 0;
current->in_execve = 0;
+ set_exec_mm(NULL);
acct_update_integrals(current);
free_bprm(bprm);
if (displaced)
@@ -1567,8 +1569,10 @@ int compat_do_execve(char * filename,
return retval;
out:
- if (bprm->mm)
- mmput(bprm->mm);
+ if (current->in_exec_mm) {
+ struct mm_struct *in_exec_mm = set_exec_mm(NULL);
+ mmput (in_exec_mm);
+ }
out_file:
if (bprm->file) {
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..85192e1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1314,6 +1314,17 @@ int search_binary_handler(struct linux_binprm *bprm,struct pt_regs *regs)
EXPORT_SYMBOL(search_binary_handler);
+struct mm_struct* set_exec_mm(struct mm_struct *mm)
+{
+ struct mm_struct *old = current->in_exec_mm;
+
+ task_lock(current);
+ current->in_exec_mm = mm;
+ task_unlock(current);
+
+ return old;
+}
+
/*
* sys_execve() executes a new program.
*/
@@ -1361,6 +1372,7 @@ int do_execve(const char * filename,
retval = bprm_mm_init(bprm);
if (retval)
goto out_file;
+ set_exec_mm(bprm->mm);
bprm->argc = count(argv, MAX_ARG_STRINGS);
if ((retval = bprm->argc) < 0)
@@ -1395,6 +1407,7 @@ int do_execve(const char * filename,
/* execve succeeded */
current->fs->in_exec = 0;
current->in_execve = 0;
+ set_exec_mm(NULL);
acct_update_integrals(current);
free_bprm(bprm);
if (displaced)
@@ -1402,8 +1415,10 @@ int do_execve(const char * filename,
return retval;
out:
- if (bprm->mm)
- mmput (bprm->mm);
+ if (current->in_exec_mm) {
+ struct mm_struct *in_exec_mm = set_exec_mm(NULL);
+ mmput (in_exec_mm);
+ }
out_file:
if (bprm->file) {
diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
index a065612..8cf61eb 100644
--- a/include/linux/binfmts.h
+++ b/include/linux/binfmts.h
@@ -133,6 +133,7 @@ extern void install_exec_creds(struct linux_binprm *bprm);
extern void do_coredump(long signr, int exit_code, struct pt_regs *regs);
extern void set_binfmt(struct linux_binfmt *new);
extern void free_bprm(struct linux_binprm *);
+extern struct mm_struct* set_exec_mm(struct mm_struct *mm);
#endif /* __KERNEL__ */
#endif /* _LINUX_BINFMTS_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1e2a6db..d413757 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1217,6 +1217,7 @@ struct task_struct {
struct plist_node pushable_tasks;
struct mm_struct *mm, *active_mm;
+ struct mm_struct *in_exec_mm;
#if defined(SPLIT_RSS_COUNTING)
struct task_rss_stat rss_stat;
#endif
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 57c05f7..7fc6916 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -120,6 +120,30 @@ struct task_struct *find_lock_task_mm(struct task_struct *p)
return NULL;
}
+static unsigned long calculate_rss_swap(struct task_struct *p)
+{
+ struct task_struct *t = p;
+ int mm_accounted = 0;
+ unsigned long points = 0;
+
+ do {
+ task_lock(t);
+ if (!mm_accounted && t->mm) {
+ points += get_mm_rss(t->mm);
+ points += get_mm_counter(t->mm, MM_SWAPENTS);
+ mm_accounted = 1;
+ }
+ if (t->in_exec_mm) {
+ points += get_mm_rss(t->in_exec_mm);
+ points += get_mm_counter(t->in_exec_mm, MM_SWAPENTS);
+ }
+ task_unlock(t);
+ } while_each_thread(p, t);
+
+ return points;
+}
+
+
/* return true if the task is not adequate as candidate victim task. */
static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
const nodemask_t *nodemask)
@@ -157,16 +181,11 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
if (oom_unkillable_task(p, mem, nodemask))
return 0;
- p = find_lock_task_mm(p);
- if (!p)
- return 0;
-
/*
* Shortcut check for OOM_SCORE_ADJ_MIN so the entire heuristic doesn't
* need to be executed for something that cannot be killed.
*/
if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
- task_unlock(p);
return 0;
}
@@ -175,7 +194,6 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
* priority for oom killing.
*/
if (p->flags & PF_OOM_ORIGIN) {
- task_unlock(p);
return 1000;
}
@@ -190,9 +208,9 @@ unsigned int oom_badness(struct task_struct *p, struct mem_cgroup *mem,
* The baseline for the badness score is the proportion of RAM that each
* task's rss and swap space use.
*/
- points = (get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)) * 1000 /
- totalpages;
- task_unlock(p);
+ points = calculate_rss_swap(p) * 1000 / totalpages;
+ if (!points)
+ return 0;
/*
* Root processes get 3% bonus, just like the __vm_enough_memory()
--
1.6.5.2
IMHO unlimited should mean unlimited. So, on that score, I'd leave this
constraint out and just say whatever deficiencies in the OOM killer (or in
whatever should make a manifestly too-large allocation get ENOMEM) should
just be fixed separately.
But that aside, I'll just consider the intent stated in the comment in
get_arg_page:
* Limit to 1/4-th the stack size for the argv+env strings.
* This ensures that:
* - the remaining binfmt code will not run out of stack space,
* - the program will have a reasonable amount of stack left
* to work from.
To effect "1/4th the stack size", a cap at TASK_SIZE/4 does make some sense,
since TASK_SIZE is less than RLIM_INFINITY even in the pure 32-bit world,
and that is the true theoretical limit on stack size.
The trouble here, both for that stated intent, and for this "exploit",
is which TASK_SIZE that is on a biarch machine. In fact, it's the
TASK_SIZE of the process that called execve. (get_arg_page is called
from copy_strings, from do_execve before search_binary_handler--i.e.,
before anything has looked at the file to decide whether it's going to
be a 32-bit or 64-bit task on exec.) If it's a 32-bit process exec'ing
a 64-bit program, it's the 32-bit TASK_SIZE (perhaps as little as 3GB).
So that's a limit of 0.75GB on a 64-bit program, which might actually do
just fine with 2 or 3GB. If it's a 64-bit process exec'ing a 32-bit
program, it's the 64-bit TASK_SIZE (128TB on x86-64). So that's a limit
of 32TB, which is perhaps not that helpfully less than 2PB minus 1 byte
(RLIM_INFINITY/4) as far as preventing any over-allocation DoS in practice.
So IMHO your change does marginal harm in some cases (32 execs 64)
and makes no appreciable difference to anyone interested in malice
(who can just dodge by exploiting it via 64 execs 64 or 64 execs 32).
If you want to constrain it this way, it's probably simpler just to use
a smaller hard limit for RLIM_STACK at boot time (and hence system-wide).
But it sounds like all you really need is to fix the OOM/allocation
behavior for huge stack allocations.
Thanks,
Roland
On Sun, Aug 29, 2010 at 05:56:48PM -0700, Roland McGrath wrote:
> IMHO unlimited should mean unlimited.
In general, I'd agree, however let's recall that back in 2.2 days we
introduced strnlen_user() and the "max" argument to count() to prevent a
userspace program from making the kernel loop busily for too long.
IIRC, prior to that fix, I was able to cause the kernel to loop for tens
of minutes in a single execve() call on an Alpha with 128 MB RAM, by
using repeated mappings of the same pages (almost 200 GB total).
Now it appears that, besides the issue that started this thread, the
same problem I mentioned above got re-introduced. We still have
strnlen_user() and the "max" argument to count(), but we no longer have
hard limits for "max". Someone set MAX_ARG_STRINGS to 0x7FFFFFFF, and
this is just too much. MAX_ARG_STRLEN is set to 32 pages, and these two
combined allow a userspace program to make the kernel loop for days.
So I think that we should re-introduce some artificial limit(s), maybe
adjustable by root (by the host system's real root only when container
virtualization is involved). Maybe we should lower MAX_ARG_STRINGS
and/or maybe we should limit the portion of stack space usable for argv
to, say, 0.75 GB (or even less).
> So, on that score, I'd leave this
> constraint out and just say whatever deficiencies in the OOM killer (or in
> whatever should make a manifestly too-large allocation get ENOMEM) should
> just be fixed separately.
I think the "OOM killer problem" should be fixed too.
> But that aside, I'll just consider the intent stated in the comment in
> get_arg_page:
> * Limit to 1/4-th the stack size for the argv+env strings.
> * This ensures that:
> * - the remaining binfmt code will not run out of stack space,
> * - the program will have a reasonable amount of stack left
> * to work from.
> To effect "1/4th the stack size", a cap at TASK_SIZE/4 does make some sense,
> since TASK_SIZE is less than RLIM_INFINITY even in the pure 32-bit world,
> and that is the true theoretical limit on stack size.
>
> The trouble here, both for that stated intent, and for this "exploit",
> is which TASK_SIZE that is on a biarch machine. In fact, it's the
> TASK_SIZE of the process that called execve. (get_arg_page is called
> from copy_strings, from do_execve before search_binary_handler--i.e.,
> before anything has looked at the file to decide whether it's going to
> be a 32-bit or 64-bit task on exec.) If it's a 32-bit process exec'ing
> a 64-bit program, it's the 32-bit TASK_SIZE (perhaps as little as 3GB).
> So that's a limit of 0.75GB on a 64-bit program, which might actually do
> just fine with 2 or 3GB. If it's a 64-bit process exec'ing a 32-bit
> program, it's the 64-bit TASK_SIZE (128TB on x86-64). So that's a limit
> of 32TB, which is perhaps not that helpfully less than 2PB minus 1 byte
> (RLIM_INFINITY/4) as far as preventing any over-allocation DoS in practice.
That's a very good point!
> If you want to constrain it this way, it's probably simpler just to use
> a smaller hard limit for RLIM_STACK at boot time (and hence system-wide).
Right.
> But it sounds like all you really need is to fix the OOM/allocation
> behavior for huge stack allocations.
No, we need both.
Additionally, 64bit_dos.c mentions that "it triggers a BUG() as the
stack tries to expand around the address space when shifted". Perhaps
limiting the stack size would deal with that, but maybe the "bug" needs
to be patched elsewhere as well. grsecurity has the following hunk:
@@ -505,7 +520,8 @@ static int shift_arg_pages(struct vm_are
unsigned long new_end = old_end - shift;
struct mmu_gather *tlb;
- BUG_ON(new_start > new_end);
+ if (new_start >= new_end || new_start < mmap_min_addr)
+ return -EFAULT;
/*
* ensure there are no vmas between where we want to go
which is likely part of a fix (but not the entire fix) for what the
comment in 64bit_dos.c refers to. However, I was not able to trigger
the BUG_ON() on RHEL5'ish kernels by simply running the 64bit_dos
program (64-bit to 32-bit exec) on two systems, one with 2 GB RAM, the
other with 4 GB. Of course, I set "ulimit -s unlimited" first.
On the 2 GB system, I saw the OOM killer problem (several other
processes got killed before 64bit_dos was killed as well). On the 4 GB
system, exec succeeded (after looping in the kernel for a few seconds,
and then the newly started program failed because of its address space
exhaustion). Maybe the BUG is only triggerable on certain other kernel
versions, or maybe I didn't try hard enough (I certainly did not try
very hard - I did not review the code closely).
Someone could want to look into this aspect as well.
grsecurity's added check against mmap_min_addr (and the reference to it
in the "exploit's" comment) is also very curious. It can be just a way
to avoid triggering another BUG() elsewhere, or it can be just an extra
hardening measure - but it could also be "for real". We could want to
double-check that there wasn't an mmap_min_addr bypass possible here.
Overall, there are multiple issues here (maybe up to four?) and multiple
things to review the code for.
Does anyone (with more time than me) want to look into these for real?
(I sure hope so.)
Thanks,
Alexander
> IIRC, prior to that fix, I was able to cause the kernel to loop for tens
> of minutes in a single execve() call on an Alpha with 128 MB RAM, by
> using repeated mappings of the same pages (almost 200 GB total).
And I say, if your userland process could really allocate another 200GB,
then more power to you, you can do it with an exec too. If you could do
the same with a userland stack allocation, and spend all that time in
strlen calls and then memcpy, you can do it inside execve too. If it
takes days, that's what you asked for, and it's your process. It just
ought to be every bit (or near enough) as preemptible and interruptible
as that normal userland activity ought to be.
So, perhaps we want this (count already has a cond_resched in its loop):
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..0000000 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -369,6 +369,9 @@ static int count(const char __user * con
for (;;) {
const char __user * p;
+ if (signal_pending(current))
+ return -ERESTARTNOINTR;
+
if (get_user(p, argv))
return -EFAULT;
if (!p)
@@ -400,6 +403,10 @@ static int copy_strings(int argc, const
int len;
unsigned long pos;
+ if (signal_pending(current))
+ return -ERESTARTNOINTR;
+ cond_resched();
+
if (get_user(str, argv+argc) ||
!(len = strnlen_user(str, MAX_ARG_STRLEN))) {
ret = -EFAULT;
> Now it appears that, besides the issue that started this thread, the
> same problem I mentioned above got re-introduced. We still have
> strnlen_user() and the "max" argument to count(), but we no longer have
> hard limits for "max". Someone set MAX_ARG_STRINGS to 0x7FFFFFFF, and
> this is just too much. MAX_ARG_STRLEN is set to 32 pages, and these two
> combined allow a userspace program to make the kernel loop for days.
I really don't think we need that stuff back. I think we can get rid of
it and fix the real problems, and be happier overall.
> > But it sounds like all you really need is to fix the OOM/allocation
> > behavior for huge stack allocations.
>
> No, we need both.
I don't agree. If all of the implicit allocation done inside execve is
accounted and controlled as well as normal userland allocations so that
the execve fails when userland allocation would fail, then there is no
reason for special-case arbitrary limits.
> Additionally, 64bit_dos.c mentions that "it triggers a BUG() as the
> stack tries to expand around the address space when shifted". Perhaps
> limiting the stack size would deal with that, but maybe the "bug" needs
> to be patched elsewhere as well. grsecurity has the following hunk:
That change seems like it might be reasonable, but I haven't really
looked at shift_arg_pages before. Has someone reported this BUG_ON
failure mode with a reproducer?
> Overall, there are multiple issues here (maybe up to four?) and multiple
> things to review the code for.
Agreed. But IMHO the missing arbitrary limits on arg/env size are not
among them. I don't know about shift_arg_pages. The core fix I think
makes sense is making the nascent mm get accounted to the user process
normally. Rather than better enabling OOM killing, I think what really
makes sense is for the nascent mm to be marked such that allocations in
it (they'll be from get_arg_page->get_user_pages->handle_mm_fault) just
fail with ENOMEM before it resorts to the OOM killer (or perhaps even to
very aggressive pageout). That should percolate back to the execve just
failing with ENOMEM, which is nicer than OOM kill even if the OOM killer
actually does pick exactly and only the right target.
Thanks,
Roland
On Mon, Aug 30, 2010 at 07:23:31AM +0400, Solar Designer wrote:
> Additionally, 64bit_dos.c mentions that "it triggers a BUG() as the
> stack tries to expand around the address space when shifted".
[...]
> which is likely part of a fix (but not the entire fix) for what the
> comment in 64bit_dos.c refers to. However, I was not able to trigger
> the BUG_ON() on RHEL5'ish kernels by simply running the 64bit_dos
> program (64-bit to 32-bit exec) on two systems, one with 2 GB RAM, the
> other with 4 GB. Of course, I set "ulimit -s unlimited" first.
I am finally able to trigger the BUG by replacing "/bin/sh" with
"/bin/false" in 64bit_dos.c, relying on our library-free implementation
of /bin/false on Owl:
owl!solar:~$ objdump -d /bin/false
/bin/false: file format elf32-i386
Disassembly of section .text:
08048074 <.text>:
8048074: b8 01 00 00 00 mov $0x1,%eax
8048079: bb 01 00 00 00 mov $0x1,%ebx
804807e: cd 80 int $0x80
owl!solar:~$ file 64bit_dos
64bit_dos: ELF 64-bit LSB executable, AMD x86-64, version 1 (SYSV), for GNU/Linux 2.4.0, statically linked, for GNU/Linux 2.4.0, stripped
(the "exploit" is statically-linked since I brought it from another
machine, I don't think this matters).
After looping in the kernel for about 10 seconds, I got:
Kernel BUG at fs/exec.c:535
[...]
Call Trace:
[<ffffffff8007c44c>] load_elf32_binary+0x7f9/0x1702
[<ffffffff800c193f>] expand_stack+0x7f/0xad
[<ffffffff8003e39b>] search_binary_handler+0x94/0x1e2
[<ffffffff8003da2f>] do_execve+0x18e/0x1f2
[<ffffffff800542cc>] sys_execve+0x34/0x51
[<ffffffff8005e523>] stub_execve+0x67/0xb0
[...]
Code: 0f 0b 68 fe 93 3e 80 c2 17 02 48 8b 7c 24 08 4c 89 fe e8 da
RIP [<ffffffff8002dc90>] setup_arg_pages+0x151/0x2d3
The kernel is a revision and custom build of 2.6.18-128.2.1.el5
(whatever I happened to readily have installed on a test system).
I think the problem should be reproducible with current RHEL5 kernels
and likely with latest mainstream kernels as well.
The process is stuck:
solar 28754 2.8 77.8 3142276 3142276 pts/0 D+ 10:34 0:13 [false]
(uninterruptible)
3 GB memory is still taken.
Alexander
On Mon, Aug 30, 2010 at 03:06:16AM -0700, Roland McGrath wrote:
> And I say, if your userland process could really allocate another 200GB,
> then more power to you, you can do it with an exec too. If you could do
> the same with a userland stack allocation, and spend all that time in
> strlen calls and then memcpy, you can do it inside execve too. If it
> takes days, that's what you asked for, and it's your process. It just
> ought to be every bit (or near enough) as preemptible and interruptible
> as that normal userland activity ought to be.
This makes sense to me. However, introducing a new preemption point
may violate assumptions under which the code was written and reviewed
in the past. In the worst case, we'd introduce/expose race conditions
allowing for privilege escalation.
> So, perhaps we want this (count already has a cond_resched in its loop):
Good point re: count() already having this (I think it did not in 2.2).
> @@ -400,6 +403,10 @@ static int copy_strings(int argc, const
> int len;
> unsigned long pos;
>
> + if (signal_pending(current))
> + return -ERESTARTNOINTR;
> + cond_resched();
So, in current kernels, you're making it possible for more kinds of
things to change after prepare_binprm() but before
search_binary_handler(). We'd need to check for possible implications
of this.
I must admit I am not familiar with what additional kinds of things may
change when execution is preempted. This made a significant difference
in some much older kernels (many years ago), but now that the kernel
makes a lot less use of locking most things may be changed by another
CPU even without preemption. So does anyone have a list of what
additional risks we're exposed to, if any, when we allow preemption in
current kernels?
> Has someone reported this BUG_ON failure mode with a reproducer?
64bit_dos.c was supposed to be the reproducer, and I managed to get it
to work (as I've documented in another message earlier today). The
prerequisites appeared to be (some of these might be specific to my
tests, though):
- 64-bit kernel with 32-bit userland support (e.g., CONFIG_IA32_EMULATION);
- 64-bit build of 64bit_dos.c;
- 32-bit build of the target program;
- no dynamic linking in the target program;
- "ulimit -s unlimited" before running the reproducer program;
- over 3 GB of RAM in the system.
> [...] Rather than better enabling OOM killing, I think what really
> makes sense is for the nascent mm to be marked such that allocations in
> it (they'll be from get_arg_page->get_user_pages->handle_mm_fault) just
> fail with ENOMEM before it resorts to the OOM killer (or perhaps even to
> very aggressive pageout). That should percolate back to the execve just
> failing with ENOMEM, which is nicer than OOM kill even if the OOM killer
> actually does pick exactly and only the right target.
I agree.
Thanks,
Alexander
Hi guys,
I see you're having fun with my code ;) Just wanted to remind you that
I do exist; I reported this in December 2009 to Ted Tso and again
recently (forwarded the same email from 2009) to Kees Cook and James
Morris. So even though nobody's actually emailed me about the issue(s),
I am available to answer any questions. Just CC me on the email as I'm
not subscribed to the list.
Anyway, I did actually research the bug(s) involved quite a bit around
the time I reported it, so hopefully some of the below will help.
The bug seems to have been introduced in 2.6.23, see:
http://thread.gmane.org/gmane.linux.ports.hppa/752
http://www.spinics.net/lists/linux-arch/msg01584.html
http://www.mail-archive.com/[email protected]/msg170491.html
though I'm guessing the functionality was also backported to major
distros
The check using the current stack limit as a byte value (including when
it's RLIMIT_INFINITY) by dividing it by 4 is completely broken for a
number of reasons:
1/4th of a 64bit ~0 is several times larger than the size of addressable
userland address space on most 64bit architectures (amd64 is 47 bits)
1/4th of a 64bit ~0 is way bigger than any 32bit value
No other place in the kernel treats a limit in this way
With a high rlimit and high usage, the code doesn't do what it intends
to do -- be able to return a meaningful error message to execve, instead
of having the process doing the execve terminated with a SIGKILL in
later stages of ELF loading.
Combined with the fact that the max arg size * max number of args
(~256TB) is also again larger than the 32bit address space and the
amd64 userland address space (as well as the address space of several other
64bit architectures), lots of problems appear. This was exacerbated by
the behavior that, until recently fixed, allowed the stack to grow over
any other existing mappings. Combine this with ASLR and shifting that
whole stack range down a random amount via shift_arg_pages/adjust_vma
which skips many of the sanity checks that exist elsewhere, and even
more problems appear (I think this latter problem may make it possible
to still trigger the BUG_ON() on patched kernels if a static binary is
used and ASLR shifts the stack over the binary).
From my research, it's not possible to successfully execute a binary
such that mmap_min_addr with normal values can be bypassed by the stack
shifting trick. I was able to determine the exact number+sizes of
arguments to use for ASLR to have a chance to shift the stack down to 0
(any more and we would trigger that BUG_ON()). Though this was
successful, after this point in the ELF loader, additional data is set
up on the stack, proportional to the number of arguments passed. It's
impossible for this additional setup to consume less than a page, so it
triggers a stack expansion which then gets checked against the normal
mmap_min_addr checks. What actually ends up happening on this
second-stage setup is the binary gets killed with SIGKILL by the ELF
loader.
The fix to the OOM killer looks correct, but the other problem (causing
extreme interactivity hits with almost no effort in userland) needs some
more thought, especially since it appears no distro is shipping with
hard limits on the stack.
If the "/ 4" check is to be preserved, it needs to take into account
the personality of the target binary: this way, the exec'ing task should
always get a proper error back instead of being terminated by the
kernel.
There shouldn't be any additional risk from adding the extra rescheds,
as copy_*_user can already sleep and be raced against via a number of
methods.
-Brad
> This makes sense to me. However, introducing a new preemption point
> may violate assumptions under which the code was written and reviewed
> in the past. In the worst case, we'd introduce/expose race conditions
> allowing for privilege escalation.
This can't be so. There are already many possibilities for preemption
in the get_user_pages code paths (called from get_arg_page).
> > So, perhaps we want this (count already has a cond_resched in its loop):
>
> Good point re: count() already having this (I think it did not in 2.2).
Indeed, this too is a clear indication that preemption here is already safe.
> > + if (signal_pending(current))
> > + return -ERESTARTNOINTR;
> > + cond_resched();
>
> So, in current kernels, you're making it possible for more kinds of
> things to change after prepare_binprm() but before
> search_binary_handler(). We'd need to check for possible implications
> of this.
What "change"? Preemption is already possible, that's nothing new.
The only difference is that we might notice TIF_SIGPENDING having been
set, and bail out either before or after prepare_binprm, and so never
call search_binary_handler at all.
The user-visible effect of this could be that a process taking many signals
way too fast could get stuck permanently running its signal handlers and
never actually really attempting the execve (but perhaps doing some mm
setup and arg-copying work and then throwing it away, repeatedly). Before,
the speed of the signals might have been such that the process could get
into the execve syscall, so it might happen to complete when now it would
be more likely to handle another signal before doing the exec instead.
That, and debuggers might ever see -ERESTARTNOHAND status in syscall
tracing for execve (as they already might for many other syscalls, so they
should be fine).
Note that get_user_pages() already fails for SIGKILL. So a more
conservative change would be just fatal_signal_pending(current) checks.
That lets only SIGKILL short-circuit the execve to kill the process,
without just letting normal signals interrupt execve like they would
interrupt any other system call. That should have exactly no
user-visible effects (aside from fairer scheduling in CONFIG_PREEMPT=n
kernels, and better responsiveness to SIGKILL in all kernels). But I
think it's fine and proper to let all signals interrupt execve normally.
> additional risks we're exposed to, if any, when we allow preemption in
> current kernels?
That's not so much the point, since we already have plenty of places in
this code path that have voluntary preemption points. We're just adding
more, which improves interactive scheduling performance for the rest of the
system when execve takes a long time copying the strings.
> - 64-bit kernel with 32-bit userland support (e.g., CONFIG_IA32_EMULATION);
> - 64-bit build of 64bit_dos.c;
> - 32-bit build of the target program;
> - no dynamic linking in the target program;
> - "ulimit -s unlimited" before running the reproducer program;
> - over 3 GB of RAM in the system.
I reproduced it too. The lack of dynamic linking is not really a
requirement to get it to hit, though whether it fails in the kernel can
vary with the stack randomization. (Sometimes it will instead be the
userland dynamic linker dying with a message that mmap failed.)
I think the right fix for that case is this:
diff --git a/fs/exec.c b/fs/exec.c
index 2d94552..0000000 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -594,6 +594,11 @@ int setup_arg_pages(struct linux_binprm
#else
stack_top = arch_align_stack(stack_top);
stack_top = PAGE_ALIGN(stack_top);
+
+ if (unlikely(stack_top < mmap_min_addr) ||
+ unlikely(vma->vm_end - vma->vm_start >= stack_top - mmap_min_addr))
+ return -ENOMEM;
+
stack_shift = vma->vm_end - stack_top;
bprm->p -= stack_shift;
That is consistent with the existing CONFIG_STACK_GROWSUP code's check
(though without the arbitrary 1GB limit it has). We won't ever get to
shift_arg_pages if there isn't space to do the shift.
Brad made several of the same points that I made earlier and in this
message, so I concur with those. :-)
Brad mentioned a possible BUG_ON if the executable's mappings clobber
the stack mapping. With the fix above, I'm not aware how that could
happen. I think it would just clobber part of the mapping with the
MAP_FIXED mapping, meaning it shortens the stack vma from the bottom, or
maybe splits it in the middle. Then at user-mode entry point time, the
SP will point into some mapped text or data of the executable, and read
whatever is there instead of argc and argv[], etc.--it won't be as
graceful as an abortive execve, but it will just "safely" crash in some
random but normal userland sort of way.
Thanks,
Roland
Brad, Roland -
Thank you for your comments and your work on this.
On Mon, Aug 30, 2010 at 06:08:47PM -0400, Brad Spengler wrote:
> There shouldn't be any additional risk from adding the extra rescheds,
> as copy_*_user can already sleep and be raced against via a number of
> methods.
Agreed.
Alexander
Brad Spengler wrote:
> The bug seems to have been introduced in 2.6.23, see:
> http://thread.gmane.org/gmane.linux.ports.hppa/752
> http://www.spinics.net/lists/linux-arch/msg01584.html
> http://www.mail-archive.com/[email protected]/msg170491.html
> though I'm guessing the functionality was also backported to major
> distros
As far as I know, RHEL >= 5.3 and Asianux >= 3.2 backported this functionality.