2017-07-17 22:22:09

by Andy Lutomirski

[permalink] [raw]
Subject: [PATCH] exec: Check stack space more strictly

We can currently blow past the stack rlimit and cause odd behavior
if there are accounting bugs, rounding issues, or races. It's not
clear that the odd behavior is actually a problem, but it's nicer to
fail the exec instead of getting out of sync with stack limits.

Improve the code to more carefully check for space and to abort if
our stack mm gets too large in setup_arg_pages().

Signed-off-by: Andy Lutomirski <[email protected]>
---
fs/exec.c | 44 ++++++++++++++++++++++++++++++++++----------
1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 62175cbcc801..0c60c0495269 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -764,23 +764,47 @@ int setup_arg_pages(struct linux_binprm *bprm,
/* mprotect_fixup is overkill to remove the temporary stack flags */
vma->vm_flags &= ~VM_STACK_INCOMPLETE_SETUP;

- stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
- stack_size = vma->vm_end - vma->vm_start;
/*
* Align this down to a page boundary as expand_stack
* will align it up.
*/
rlim_stack = rlimit(RLIMIT_STACK) & PAGE_MASK;
+ stack_size = vma->vm_end - vma->vm_start;
+
+ if (stack_size > rlim_stack) {
+ /*
+ * If we've already used too much space (due to accounting
+ * bugs, alignment, races, or any other cause), bail.
+ */
+ ret = -ENOMEM;
+ goto out_unlock;
+ }
+
+ /*
+ * stack_expand is the amount of space beyond the space already used
+ * that we're going to pre-allocate in our stack. For historical
+ * reasons, it's 128kB, unless we have less space than that available
+ * in our rlimit.
+ *
+ * This particular historical wart is wrong-headed, though, since
+ * we haven't finished binfmt-specific setup, and the binfmt code
+ * is going to eat up some or all of this space.
+ */
+ stack_expand = min(rlim_stack - stack_size, 131072UL);
+
#ifdef CONFIG_STACK_GROWSUP
- if (stack_size + stack_expand > rlim_stack)
- stack_base = vma->vm_start + rlim_stack;
- else
- stack_base = vma->vm_end + stack_expand;
+ if (TASK_SIZE_MAX - vma->vm_end < stack_expand) {
+ ret = -ENOMEM;
+ goto out_unlock;
+ }
+ stack_base = vma->vm_end + stack_expand;
#else
- if (stack_size + stack_expand > rlim_stack)
- stack_base = vma->vm_end - rlim_stack;
- else
- stack_base = vma->vm_start - stack_expand;
+ if (vma->vm_start < mmap_min_addr ||
+ vma->vm_start - mmap_min_addr < stack_expand) {
+ ret = -ENOMEM;
+ goto out_unlock;
+ }
+ stack_base = vma->vm_start - stack_expand;
#endif
current->mm->start_stack = bprm->p;
ret = expand_stack(vma, stack_base);
--
2.9.4


2017-07-17 22:51:26

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] exec: Check stack space more strictly

On Mon, Jul 17, 2017 at 3:22 PM, Andy Lutomirski <[email protected]> wrote:
> We can currently blow past the stack rlimit and cause odd behavior
> if there are accounting bugs, rounding issues, or races. It's not
> clear that the odd behavior is actually a problem, but it's nicer to
> fail the exec instead of getting out of sync with stack limits.
>
> Improve the code to more carefully check for space and to abort if
> our stack mm gets too large in setup_arg_pages().
>
> Signed-off-by: Andy Lutomirski <[email protected]>

This is a nice check to have in case binfmts do insane things to the
vma before setup_arg_pages(). However, we've still got
create_elf_tables() which is a fixed size _except_ that it adds all
the argv/envp pointers. Those are being accounted for in the
get_arg_page() calls from copy_strings(), though.

Does anything use stack between copy_strings() and create_elf_tables()?

> ---
> fs/exec.c | 44 ++++++++++++++++++++++++++++++++++----------
> 1 file changed, 34 insertions(+), 10 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 62175cbcc801..0c60c0495269 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -764,23 +764,47 @@ int setup_arg_pages(struct linux_binprm *bprm,
> /* mprotect_fixup is overkill to remove the temporary stack flags */
> vma->vm_flags &= ~VM_STACK_INCOMPLETE_SETUP;
>
> - stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
> - stack_size = vma->vm_end - vma->vm_start;
> /*
> * Align this down to a page boundary as expand_stack
> * will align it up.
> */
> rlim_stack = rlimit(RLIMIT_STACK) & PAGE_MASK;
> + stack_size = vma->vm_end - vma->vm_start;
> +
> + if (stack_size > rlim_stack) {
> + /*
> + * If we've already used too much space (due to accounting
> + * bugs, alignment, races, or any other cause), bail.
> + */
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> +
> + /*
> + * stack_expand is the amount of space beyond the space already used
> + * that we're going to pre-allocate in our stack. For historical
> + * reasons, it's 128kB, unless we have less space than that available
> + * in our rlimit.
> + *
> + * This particular historical wart is wrong-headed, though, since
> + * we haven't finished binfmt-specific setup, and the binfmt code
> + * is going to eat up some or all of this space.
> + */
> + stack_expand = min(rlim_stack - stack_size, 131072UL);

Yeah, this needs to account for argv/envp pointers, IMO. But it's
still a magic number not tied to the actual results of
create_elf_tables() (which is effectively fixed in stack usage except
for argv/envp pointers).

I think create_elf_tables() needs checking added, or the magic number
should be tied to the size of the AUX strings (platform,
base_platform, rand_bytes), and elf_info words.

Hmm, does anything use the stack _after_ create_elf_tables()? Could we
move create_elf_tables() right after setup_arg_pages() and then
perform the check?

> +
> #ifdef CONFIG_STACK_GROWSUP
> - if (stack_size + stack_expand > rlim_stack)
> - stack_base = vma->vm_start + rlim_stack;
> - else
> - stack_base = vma->vm_end + stack_expand;
> + if (TASK_SIZE_MAX - vma->vm_end < stack_expand) {
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> + stack_base = vma->vm_end + stack_expand;
> #else
> - if (stack_size + stack_expand > rlim_stack)
> - stack_base = vma->vm_end - rlim_stack;
> - else
> - stack_base = vma->vm_start - stack_expand;
> + if (vma->vm_start < mmap_min_addr ||
> + vma->vm_start - mmap_min_addr < stack_expand) {
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> + stack_base = vma->vm_start - stack_expand;
> #endif
> current->mm->start_stack = bprm->p;
> ret = expand_stack(vma, stack_base);
> --
> 2.9.4
>

FWIW, here are my current notes on the whole existing flow (though
note my other series which is trying to move the secureexec callback
earlier):

do_execvat_common:

check RLIMIT_NPROC
unshare_files() (restored on failure)
allocate bprm
prepare_bprm_creds() (allocate bprm->cred)
mutex_lock_interruptible(&current->signal->cred_guard_mutex)
prepare_exec_creds()
prepare_creds()
memcpy(new, old, ...)
security_prepare_creds() (unimplemented by commoncap)
check_unsafe_exec()
do_open_execat()
sched_exec()
if closed fd, set BINPRM_FLAGS_PATH_INACCESSIBLE
bprm_mm_init() (allocate bprm->mm)
prepare_binprm()
bprm_fill_uid() (fill bprm->cred)
resets euid/egid to current euid/egid
sets euid/egid on bprm based on set*id
security_bprm_set_creds() (handles all euid etc logic)
kernel_read(bprm->file, ...)
copy_strings() (argv)
get_arg_page() (could fail with E2BIG for args size)
copy_strings() (envp)
get_arg_page() (could fail with E2BIG for args size)
would_dump() (examine inode perms for MAY_READ)
maybe set BINPRM_FLAGS_ENFORCE_NONDUMP
exec_binprm()
search_binary_handler() (may go recursive for script and misc)
security_bprm_check()
for each format
load_binary() (e.g. load_elf_binary)
load_elf_phdrs()
kernel_read(bprm->file, ..., elf_interpreter)
would_dump() (interpreter)
maybe set BINPRM_FLAGS_ENFORCE_NONDUMP
PT_GNU_STACK
arch_check_elf()
flush_old_exec()
de_thread()
exec_mmap()
current->mm = bprm->mm
bprm->mm = NULL; (failures after here will SEGV current)
flush_thread()
current->personality
setup_new_exec()
arch_pick_mmap_layout() (uses current RLIMIT_STACK)
dumpable (using current uid/euid?!)
arch_setup_new_exec()
perf_event_exec()
bprm->cred uid/euid
pdeath_signal = 0
dumpable (based on BINPRM_FLAGS_ENFORCE_NONDUMP)
current->self_exec_id++
flush_signal_handlers()
install_exec_creds()
security_bprm_committing_creds()
commit_creds()
bprm->cred = NULL
if !dumpable
perf_event_exit_task()
security_bprm_committed_creds()
mutex_unlock(&current->signal->cred_guard_mutex)
setup_arg_pages()
uses RLIMIT_STACK
load ELF image (ET_EXEC vs ET_DYN, load_bias, etc)
set_brk()
load_elf_interp() (if elf_interpreter)
create_elf_tables() (eats fixed stack except argv/envp ptrs)
arch_randomize_brk()
start_thread() (everything is up and running now)
if (retval < 0 && !bprm->mm)
SIGSEGV (can we send signals to my setuid parent?)
audit_bprm()
trace_sched_process_exec()
ptrace_event(PTRACE_EVENT_EXEC)
proc_exec_connector()
clean up


-Kees

--
Kees Cook
Pixel Security

2017-08-16 12:17:13

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCH] exec: Check stack space more strictly

minor nit below

On 18.07.2017 01:22, Andy Lutomirski wrote:
> We can currently blow past the stack rlimit and cause odd behavior
> if there are accounting bugs, rounding issues, or races. It's not
> clear that the odd behavior is actually a problem, but it's nicer to
> fail the exec instead of getting out of sync with stack limits.
>
> Improve the code to more carefully check for space and to abort if
> our stack mm gets too large in setup_arg_pages().
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> fs/exec.c | 44 ++++++++++++++++++++++++++++++++++----------
> 1 file changed, 34 insertions(+), 10 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 62175cbcc801..0c60c0495269 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -764,23 +764,47 @@ int setup_arg_pages(struct linux_binprm *bprm,
> /* mprotect_fixup is overkill to remove the temporary stack flags */
> vma->vm_flags &= ~VM_STACK_INCOMPLETE_SETUP;
>
> - stack_expand = 131072UL; /* randomly 32*4k (or 2*64k) pages */
> - stack_size = vma->vm_end - vma->vm_start;
> /*
> * Align this down to a page boundary as expand_stack
> * will align it up.
> */
> rlim_stack = rlimit(RLIMIT_STACK) & PAGE_MASK;
> + stack_size = vma->vm_end - vma->vm_start;
> +
> + if (stack_size > rlim_stack) {
> + /*
> + * If we've already used too much space (due to accounting
> + * bugs, alignment, races, or any other cause), bail.
> + */
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> +
> + /*
> + * stack_expand is the amount of space beyond the space already used
> + * that we're going to pre-allocate in our stack. For historical
> + * reasons, it's 128kB, unless we have less space than that available
> + * in our rlimit.
> + *
> + * This particular historical wart is wrong-headed, though, since
> + * we haven't finished binfmt-specific setup, and the binfmt code
> + * is going to eat up some or all of this space.
> + */
> + stack_expand = min(rlim_stack - stack_size, 131072UL);

nit: Use the SZ_128K from sizes.h.

> +
> #ifdef CONFIG_STACK_GROWSUP
> - if (stack_size + stack_expand > rlim_stack)
> - stack_base = vma->vm_start + rlim_stack;
> - else
> - stack_base = vma->vm_end + stack_expand;
> + if (TASK_SIZE_MAX - vma->vm_end < stack_expand) {
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> + stack_base = vma->vm_end + stack_expand;
> #else
> - if (stack_size + stack_expand > rlim_stack)
> - stack_base = vma->vm_end - rlim_stack;
> - else
> - stack_base = vma->vm_start - stack_expand;
> + if (vma->vm_start < mmap_min_addr ||
> + vma->vm_start - mmap_min_addr < stack_expand) {
> + ret = -ENOMEM;
> + goto out_unlock;
> + }
> + stack_base = vma->vm_start - stack_expand;
> #endif
> current->mm->start_stack = bprm->p;
> ret = expand_stack(vma, stack_base);
>