2020-03-29 23:17:52

by Kees Cook

[permalink] [raw]
Subject: Re: Curiosity around 'exec_id' and some problems associated with it

Hi!

Sorry, I missed this originally because it got filed into my lkml
archive and not kernel-hardening, but no one actually reads lkml
directly, myself included -- it's mostly a thread archive. I'll update
my filters, and I've added a handful of people to CC that might be
interested in looking at this too. Here's the full email, I trimmed
heavily since it's very detailed:
https://lore.kernel.org/lkml/[email protected]/

On Tue, Mar 24, 2020 at 10:50:49PM +0100, Adam Zabrocki wrote:
> Some curiosities which are interesting to point out:
>
> 1) Linus Torvalds in 2012 suspected that such 'overflow' might be possible.
> You can read more about it here:
>
> https://www.openwall.com/lists/kernel-hardening/2012/03/11/4
>
> 2) Solar Designer in 1999(!) was aware about the problem that 'exit_signal' can
> be abused. The kernel didn't protect it at all at that time. So he came up
> with the idea to introduce those two counters to deal with that problem.
> Originally, these counters were defined as "long long" type. However, during
> the revising between September 14 and September 16, 1999 he switched from
> "long long" to "int" and introduced integer wraparound handling. His patches
> were merged to the kernel 2.0.39 and 2.0.40.
>
> 3) It is worth to read the Solar Designer's message during the discussion about
> the fix for the problem CVE-2012-0056 (I'm referencing this problem later in
> that write-up about "Problem II"):
>
> https://www.openwall.com/lists/kernel-hardening/2012/03/11/12

There was some effort made somewhat recently to get this area fixed:
https://lore.kernel.org/linux-fsdevel/[email protected]/

Nothing ultimately landed, but it's worth seeing if we could revitalize
interest. Part of Jann's series was also related to fixing issues with
cred_guard_mutex, which is getting some traction now too:
https://lore.kernel.org/lkml/AM6PR03MB5170938306F22C3CF61CC573E4CD0@AM6PR03MB5170.eurprd03.prod.outlook.com/

> In short, if you hold the file descriptor open over an execve() (e.g. share it
> with child) the old VM is preserved (refcounted) and might be never released.
> Essentially, mother process' VM will be still in memory (and pointer to it is
> valid) even if the mother process passed an execve().
> This is some kind of 'memory leak' scenario. I did a simple test where process
> open /proc/self/maps file and calls clone() with CLONE_FILES flag. Next mother
> 'overwrite' itself by executing SUID binary (doesn't need to be SUID), and child
> was still able to use the original file descriptor - it's valid.

It'd be worth exploring where the resource counting is happening for
this. I haven't looked to see how much of the VM stays in kernel memory
in this situation. It probably wouldn't be hard to count it against an
rlimit or something.

Thanks for the details! I hope someone will have time to look into this.
It's a bit of a "long timeframe attack", so it's not gotta a lot of
priority (obviously). :)

-Kees

--
Kees Cook


2020-03-30 08:35:56

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Curiosity around 'exec_id' and some problems associated with it

On 03/29, Kees Cook wrote:
>
> On Tue, Mar 24, 2020 at 10:50:49PM +0100, Adam Zabrocki wrote:
> >
> > In short, if you hold the file descriptor open over an execve() (e.g. share it
> > with child) the old VM is preserved (refcounted) and might be never released.
> > Essentially, mother process' VM will be still in memory (and pointer to it is
> > valid) even if the mother process passed an execve().

This was true after e268337dfe26dfc7efd422a804dbb27977a3cccc, but please see
6d08f2c7139790c ("proc: make sure mem_open() doesn't pin the target's memory"),
iir it was merged soon after the 1st commit.

Oleg.

2020-04-01 20:51:33

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] signal: Extend exec_id to 64bits


Replace the 32bit exec_id with a 64bit exec_id to make it impossible
to wrap the exec_id counter. With care an attacker can cause exec_id
wrap and send arbitrary signals to a newly exec'd parent. This
bypasses the signal sending checks if the parent changes their
credentials during exec.

The severity of this problem can been seen that in my limited testing
of a 32bit exec_id it can take as little as 19s to exec 65536 times.
Which means that it can take as little as 14 days to wrap a 32bit
exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
days. Even my slower timing is in the uptime of a typical server.
Which means self_exec_id is simply a speed bump today, and if exec
gets noticably faster self_exec_id won't even be a speed bump.

Extending self_exec_id to 64bits introduces a problem on 32bit
architectures where reading self_exec_id is no longer atomic and can
take two read instructions. Which means that is is possible to hit
a window where the read value of exec_id does not match the written
value. So with very lucky timing after this change this still
remains expoiltable.

I have updated the update of exec_id on exec to use WRITE_ONCE
and the read of exec_id in do_notify_parent to use READ_ONCE
to make it clear that there is no locking between these two
locations.

Link: https://lore.kernel.org/kernel-hardening/[email protected]
Fixes: 2.3.23pre2
Cc: [email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
---

Linus would you prefer to take this patch directly or I could put it in
a brach and send you a pull request.

fs/exec.c | 2 +-
include/linux/sched.h | 4 ++--
kernel/signal.c | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 0e46ec57fe0a..d55710a36056 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm)

/* An exec changes our domain. We are no longer part of the thread
group */
- current->self_exec_id++;
+ WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1);
flush_signal_handlers(current, 0);
}
EXPORT_SYMBOL(setup_new_exec);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..0323e4f0982a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -939,8 +939,8 @@ struct task_struct {
struct seccomp seccomp;

/* Thread group tracking: */
- u32 parent_exec_id;
- u32 self_exec_id;
+ u64 parent_exec_id;
+ u64 self_exec_id;

/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
spinlock_t alloc_lock;
diff --git a/kernel/signal.c b/kernel/signal.c
index 9ad8dea93dbb..5383b562df85 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
* This is only possible if parent == real_parent.
* Check if it has changed security domain.
*/
- if (tsk->parent_exec_id != tsk->parent->self_exec_id)
+ if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id))
sig = SIGCHLD;
}

--
2.20.1

2020-04-01 20:56:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

On Wed, Apr 1, 2020 at 1:50 PM Eric W. Biederman <[email protected]> wrote:
>
> I have updated the update of exec_id on exec to use WRITE_ONCE
> and the read of exec_id in do_notify_parent to use READ_ONCE
> to make it clear that there is no locking between these two
> locations.

Ack, makes sense to me.

Just put it in your branch, this doesn't look urgent, considering that
it's something that has been around forever.

Linus

2020-04-01 21:07:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

Linus Torvalds <[email protected]> writes:

> On Wed, Apr 1, 2020 at 1:50 PM Eric W. Biederman <[email protected]> wrote:
>>
>> I have updated the update of exec_id on exec to use WRITE_ONCE
>> and the read of exec_id in do_notify_parent to use READ_ONCE
>> to make it clear that there is no locking between these two
>> locations.
>
> Ack, makes sense to me.
>
> Just put it in your branch, this doesn't look urgent, considering that
> it's something that has been around forever.

Done

Eric

2020-04-01 23:39:18

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

+memory model folks because this seems like a pretty sharp corner

On Wed, Apr 1, 2020 at 10:50 PM Eric W. Biederman <[email protected]> wrote:
> Replace the 32bit exec_id with a 64bit exec_id to make it impossible
> to wrap the exec_id counter. With care an attacker can cause exec_id
> wrap and send arbitrary signals to a newly exec'd parent. This
> bypasses the signal sending checks if the parent changes their
> credentials during exec.
>
> The severity of this problem can been seen that in my limited testing
> of a 32bit exec_id it can take as little as 19s to exec 65536 times.
> Which means that it can take as little as 14 days to wrap a 32bit
> exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
> days. Even my slower timing is in the uptime of a typical server.
> Which means self_exec_id is simply a speed bump today, and if exec
> gets noticably faster self_exec_id won't even be a speed bump.
>
> Extending self_exec_id to 64bits introduces a problem on 32bit
> architectures where reading self_exec_id is no longer atomic and can
> take two read instructions. Which means that is is possible to hit
> a window where the read value of exec_id does not match the written
> value. So with very lucky timing after this change this still
> remains expoiltable.
>
> I have updated the update of exec_id on exec to use WRITE_ONCE
> and the read of exec_id in do_notify_parent to use READ_ONCE
> to make it clear that there is no locking between these two
> locations.
>
> Link: https://lore.kernel.org/kernel-hardening/[email protected]
> Fixes: 2.3.23pre2
> Cc: [email protected]
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
>
> Linus would you prefer to take this patch directly or I could put it in
> a brach and send you a pull request.
>
> fs/exec.c | 2 +-
> include/linux/sched.h | 4 ++--
> kernel/signal.c | 2 +-
> 3 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 0e46ec57fe0a..d55710a36056 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm)
>
> /* An exec changes our domain. We are no longer part of the thread
> group */
> - current->self_exec_id++;
> + WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1);

GCC will generate code for this without complaining, but I think it'll
probably generate a tearing store on 32-bit platforms:

$ cat volatile-8.c
typedef unsigned long long u64;
static volatile u64 n;
void blah(void) {
n = n + 1;
}
$ gcc -O2 -m32 -c -o volatile-8.o volatile-8.c -Wall
$ objdump --disassemble=blah volatile-8.o
[...]
b: 8b 81 00 00 00 00 mov 0x0(%ecx),%eax
11: 8b 91 04 00 00 00 mov 0x4(%ecx),%edx
17: 83 c0 01 add $0x1,%eax
1a: 83 d2 00 adc $0x0,%edx
1d: 89 81 00 00 00 00 mov %eax,0x0(%ecx)
23: 89 91 04 00 00 00 mov %edx,0x4(%ecx)
[...]
$

You could maybe use atomic64_t to work around that? atomic64_read()
and atomic64_set() should be straightforward READ_ONCE() /
WRITE_ONCE() on 64-bit systems while compiling into something more
complicated on 32-bit.

> flush_signal_handlers(current, 0);
> }
> EXPORT_SYMBOL(setup_new_exec);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04278493bf15..0323e4f0982a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -939,8 +939,8 @@ struct task_struct {
> struct seccomp seccomp;
>
> /* Thread group tracking: */
> - u32 parent_exec_id;
> - u32 self_exec_id;
> + u64 parent_exec_id;
> + u64 self_exec_id;
>
> /* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
> spinlock_t alloc_lock;
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9ad8dea93dbb..5383b562df85 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
> * This is only possible if parent == real_parent.
> * Check if it has changed security domain.
> */
> - if (tsk->parent_exec_id != tsk->parent->self_exec_id)
> + if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id))
> sig = SIGCHLD;
> }
>
> --
> 2.20.1
>

2020-04-01 23:55:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

On Wed, Apr 1, 2020 at 4:37 PM Jann Horn <[email protected]> wrote:
>
> GCC will generate code for this without complaining, but I think it'll
> probably generate a tearing store on 32-bit platforms:

This is very much a "we don't care" case.

It's literally testing a sequence counter for equality. If you get
tearing in the high bits on the write (or the read), you'd still need
to have the low bits turn around 4G times to get a matching value.

So no. We're not doing atomics for the 32-bit case. That's insane.

Linus

2020-04-01 23:57:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

On Wed, Apr 1, 2020 at 4:51 PM Linus Torvalds
<[email protected]> wrote:
>
> It's literally testing a sequence counter for equality. If you get
> tearing in the high bits on the write (or the read), you'd still need
> to have the low bits turn around 4G times to get a matching value.

Put another way: first you'd have to work however many weeks to do 4
billion execve() calls, and then you need to hit basically a
single-instruction race to take advantage of it.

Good luck with that. If you have that kind of God-like capability,
whoever you're attacking stands no chance in the first place.

Linus

2020-04-02 01:38:09

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

On Thu, Apr 2, 2020 at 1:55 AM Linus Torvalds
<[email protected]> wrote:
> On Wed, Apr 1, 2020 at 4:51 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > It's literally testing a sequence counter for equality. If you get
> > tearing in the high bits on the write (or the read), you'd still need
> > to have the low bits turn around 4G times to get a matching value.
>
> Put another way: first you'd have to work however many weeks to do 4
> billion execve() calls, and then you need to hit basically a
> single-instruction race to take advantage of it.
>
> Good luck with that. If you have that kind of God-like capability,
> whoever you're attacking stands no chance in the first place.

I'm not really worried about someone being able to hit this in
practice, especially given that the only thing it lets you do is send
signals; I just think that the code is wrong in principle, and that we
should avoid having "technically wrong, but works in practice" code in
the kernel.

This kind of intentional race might also trip up testing tools like
the upcoming KCSAN instrumentation, unless it is specifically
annotated as an intentionally racing instruction. (For now, KCSAN is
64-bit only, so it won't actually be able to detect this though; and
the current KCSAN development branch actually incorrectly considers
WRITE_ONCE() to always be atomic; but still, in principle it's the
kind of issue KCSAN is supposed to detect, I think.)

But even without KCSAN, if we have tearing stores racing with loads, I
think that we ought to *at least* have a comment noting that we're
intentionally doing that so that people don't copy this kind of code
to a place where the high bits change more frequently and correctness
matters more.

Since the read is already protected by the tasklist_lock, an
alternative might be to let the execve path also take that lock to
protect the sequence number update, given that execve is not a
particularly hot path. Or we could hand-code the equality check and
increment operations to be properly race-free.

2020-04-02 02:07:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

On Wed, Apr 1, 2020 at 6:36 PM Jann Horn <[email protected]> wrote:
>
> Since the read is already protected by the tasklist_lock, an
> alternative might be to let the execve path also take that lock to
> protect the sequence number update,

No.

tasklist_lock is aboue the hottest lock there is in all of the kernel.

We're not doing stupid things for theoretical issues.

Stop this crazy argument.

A comment - sure. 64-bit atomics or very expensive locks? Not a chance.

Linus

2020-04-02 04:48:20

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

On Wed, Apr 1, 2020 at 10:50 PM Eric W. Biederman <[email protected]> wrote:
> Replace the 32bit exec_id with a 64bit exec_id to make it impossible
> to wrap the exec_id counter. With care an attacker can cause exec_id
> wrap and send arbitrary signals to a newly exec'd parent. This
> bypasses the signal sending checks if the parent changes their
> credentials during exec.
>
> The severity of this problem can been seen that in my limited testing
> of a 32bit exec_id it can take as little as 19s to exec 65536 times.
> Which means that it can take as little as 14 days to wrap a 32bit
> exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
> days. Even my slower timing is in the uptime of a typical server.

FYI, if you actually optimize this, it's more like 12s to exec 1048576
times according to my test, which means ~14 hours for 2^32 executions
(on a single core). That's on an i7-4790 (a Haswell desktop processor
that was launched about six years ago, in 2014).

Here's my test code:

=============
$ grep 'model name' /proc/cpuinfo | head -n1
model name : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
$ cat build.sh
#!/bin/sh
set -e
nasm -felf32 -o fast_execve.o fast_execve.asm
ld -m elf_i386 -o fast_execve fast_execve.o
gcc -o launch launch.c -Wall
gcc -o finish finish.c -Wall
$ cat fast_execve.asm
bits 32

section .text
global _start
_start:
; eax = argv[0]
; expected to be 8 hex digits, with 'a' meaning 0x0 and 'p' meaning 0xf
mov eax, [esp+4]

mov ebx, 0 ; loop counter
hex_digit_loop:
inc byte [eax+ebx]
cmp byte [eax+ebx], 'a'+16
jne next_exec
mov byte [eax+ebx], 'a'
inc ebx
cmp ebx, 5 ;;;;;;;;;;;;;;;;;; this is N, where iteration_count=pow(16,N)
jne hex_digit_loop


; reached pow(256,N) execs, get out

; first make the stack big again
mov eax, 75 ; setrlimit (32-bit ABI)
mov ebx, 3 ; RLIMIT_STACK
mov ecx, stacklim
int 0x80

; execute end helper
mov ebx, 4 ; dirfd = 4
jmp common_exec

next_exec:
mov ebx, 3 ; dirfd = 3

common_exec: ; execveat() with file descriptor passed in as ebx
mov ecx, nullval ; pathname = empty string
lea edx, [esp+4] ; argv
mov esi, 0 ; envp
mov edi, 0x1000 ; flags = AT_EMPTY_PATH
mov eax, 358 ; execveat (32-bit ABI)
int 0x80
int3

nullval:
dd 0
stacklim:
dd 0x02000000
dd 0xffffffff
$ cat launch.c
#define _GNU_SOURCE
#include <fcntl.h>
#include <err.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <sys/resource.h>
int main(void) {
close(3);
close(4);
if (open("fast_execve", O_PATH) != 3)
err(1, "open fast_execve");
if (open("finish", O_PATH) != 4)
err(1, "open finish");
char *argv[] = { "aaaaaaaa", NULL };

struct rlimit lim;
if (getrlimit(RLIMIT_STACK, &lim))
err(1, "getrlimit");
lim.rlim_cur = 0x4000;
if (setrlimit(RLIMIT_STACK, &lim))
err(1, "setrlimit");

syscall(__NR_execveat, 3, "", argv, NULL, AT_EMPTY_PATH);
}
$ cat finish.c
#include <stdlib.h>
int main(void) {
exit(0);
}
$ ./build.sh
$ time ./launch

real 0m12,075s
user 0m0,905s
sys 0m11,026s
$
=============

2020-04-02 07:20:53

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

On Wed, Apr 01, 2020 at 03:47:44PM -0500, Eric W. Biederman wrote:
>
> Replace the 32bit exec_id with a 64bit exec_id to make it impossible
> to wrap the exec_id counter. With care an attacker can cause exec_id
> wrap and send arbitrary signals to a newly exec'd parent. This
> bypasses the signal sending checks if the parent changes their
> credentials during exec.
>
> The severity of this problem can been seen that in my limited testing
> of a 32bit exec_id it can take as little as 19s to exec 65536 times.
> Which means that it can take as little as 14 days to wrap a 32bit
> exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
> days. Even my slower timing is in the uptime of a typical server.
> Which means self_exec_id is simply a speed bump today, and if exec
> gets noticably faster self_exec_id won't even be a speed bump.
>
> Extending self_exec_id to 64bits introduces a problem on 32bit
> architectures where reading self_exec_id is no longer atomic and can
> take two read instructions. Which means that is is possible to hit
> a window where the read value of exec_id does not match the written
> value. So with very lucky timing after this change this still
> remains expoiltable.
>
> I have updated the update of exec_id on exec to use WRITE_ONCE
> and the read of exec_id in do_notify_parent to use READ_ONCE
> to make it clear that there is no locking between these two
> locations.
>
> Link: https://lore.kernel.org/kernel-hardening/[email protected]
> Fixes: 2.3.23pre2
> Cc: [email protected]
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Thanks for chasing this down. :)

Reviewed-by: Kees Cook <[email protected]>

-Kees

> ---
>
> Linus would you prefer to take this patch directly or I could put it in
> a brach and send you a pull request.
>
> fs/exec.c | 2 +-
> include/linux/sched.h | 4 ++--
> kernel/signal.c | 2 +-
> 3 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 0e46ec57fe0a..d55710a36056 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm)
>
> /* An exec changes our domain. We are no longer part of the thread
> group */
> - current->self_exec_id++;
> + WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1);
> flush_signal_handlers(current, 0);
> }
> EXPORT_SYMBOL(setup_new_exec);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04278493bf15..0323e4f0982a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -939,8 +939,8 @@ struct task_struct {
> struct seccomp seccomp;
>
> /* Thread group tracking: */
> - u32 parent_exec_id;
> - u32 self_exec_id;
> + u64 parent_exec_id;
> + u64 self_exec_id;
>
> /* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
> spinlock_t alloc_lock;
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9ad8dea93dbb..5383b562df85 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
> * This is only possible if parent == real_parent.
> * Check if it has changed security domain.
> */
> - if (tsk->parent_exec_id != tsk->parent->self_exec_id)
> + if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id))
> sig = SIGCHLD;
> }
>
> --
> 2.20.1
>

--
Kees Cook

2020-04-02 07:23:25

by Bernd Edlinger

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits



On 4/1/20 10:47 PM, Eric W. Biederman wrote:
>
> Replace the 32bit exec_id with a 64bit exec_id to make it impossible
> to wrap the exec_id counter. With care an attacker can cause exec_id
> wrap and send arbitrary signals to a newly exec'd parent. This
> bypasses the signal sending checks if the parent changes their
> credentials during exec.
>
> The severity of this problem can been seen that in my limited testing
> of a 32bit exec_id it can take as little as 19s to exec 65536 times.
> Which means that it can take as little as 14 days to wrap a 32bit
> exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
> days. Even my slower timing is in the uptime of a typical server.
> Which means self_exec_id is simply a speed bump today, and if exec
> gets noticably faster self_exec_id won't even be a speed bump.
>
> Extending self_exec_id to 64bits introduces a problem on 32bit
> architectures where reading self_exec_id is no longer atomic and can
> take two read instructions. Which means that is is possible to hit
> a window where the read value of exec_id does not match the written
> value. So with very lucky timing after this change this still
> remains expoiltable.
>
> I have updated the update of exec_id on exec to use WRITE_ONCE
> and the read of exec_id in do_notify_parent to use READ_ONCE
> to make it clear that there is no locking between these two
> locations.
>
> Link: https://lore.kernel.org/kernel-hardening/[email protected]
> Fixes: 2.3.23pre2
> Cc: [email protected]
> Signed-off-by: "Eric W. Biederman" <[email protected]>

Reviewed-by: Bernd Edlinger <[email protected]>


Thanks
Bernd.
> ---
>
> Linus would you prefer to take this patch directly or I could put it in
> a brach and send you a pull request.
>
> fs/exec.c | 2 +-
> include/linux/sched.h | 4 ++--
> kernel/signal.c | 2 +-
> 3 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 0e46ec57fe0a..d55710a36056 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1413,7 +1413,7 @@ void setup_new_exec(struct linux_binprm * bprm)
>
> /* An exec changes our domain. We are no longer part of the thread
> group */
> - current->self_exec_id++;
> + WRITE_ONCE(current->self_exec_id, current->self_exec_id + 1);
> flush_signal_handlers(current, 0);
> }
> EXPORT_SYMBOL(setup_new_exec);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 04278493bf15..0323e4f0982a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -939,8 +939,8 @@ struct task_struct {
> struct seccomp seccomp;
>
> /* Thread group tracking: */
> - u32 parent_exec_id;
> - u32 self_exec_id;
> + u64 parent_exec_id;
> + u64 self_exec_id;
>
> /* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */
> spinlock_t alloc_lock;
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9ad8dea93dbb..5383b562df85 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1926,7 +1926,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
> * This is only possible if parent == real_parent.
> * Check if it has changed security domain.
> */
> - if (tsk->parent_exec_id != tsk->parent->self_exec_id)
> + if (tsk->parent_exec_id != READ_ONCE(tsk->parent->self_exec_id))
> sig = SIGCHLD;
> }
>
>

2020-04-02 13:50:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

Linus Torvalds <[email protected]> writes:

> tasklist_lock is aboue the hottest lock there is in all of the kernel.

Do you know code paths you see tasklist_lock being hot?

I am looking at some of the exec/signal/ptrace code paths because they
get subtle corner case wrong like a threaded exec deadlocking when
straced.

If the performance problems are in the same neighbourhood I might be
able to fix those problems while I am in the code.

Eric



2020-04-02 14:21:44

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

Jann Horn <[email protected]> writes:

> On Wed, Apr 1, 2020 at 10:50 PM Eric W. Biederman <[email protected]> wrote:
>> Replace the 32bit exec_id with a 64bit exec_id to make it impossible
>> to wrap the exec_id counter. With care an attacker can cause exec_id
>> wrap and send arbitrary signals to a newly exec'd parent. This
>> bypasses the signal sending checks if the parent changes their
>> credentials during exec.
>>
>> The severity of this problem can been seen that in my limited testing
>> of a 32bit exec_id it can take as little as 19s to exec 65536 times.
>> Which means that it can take as little as 14 days to wrap a 32bit
>> exec_id. Adam Zabrocki has succeeded wrapping the self_exe_id in 7
>> days. Even my slower timing is in the uptime of a typical server.
>
> FYI, if you actually optimize this, it's more like 12s to exec 1048576
> times according to my test, which means ~14 hours for 2^32 executions
> (on a single core). That's on an i7-4790 (a Haswell desktop processor
> that was launched about six years ago, in 2014).

Half a day. I am not at all surprised, but it is good to know it can
take so little time.

Eric

2020-04-02 19:19:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] signal: Extend exec_id to 64bits

On Thu, Apr 2, 2020 at 6:14 AM Eric W. Biederman <[email protected]> wrote:
>
> Linus Torvalds <[email protected]> writes:
>
> > tasklist_lock is aboue the hottest lock there is in all of the kernel.
>
> Do you know code paths you see tasklist_lock being hot?

It's generally not bad enough to show up on single-socket machines.

But the problem with tasklist_lock is that it's one of our remaining
completely global locks. So it scales like sh*t in some circumstances.

On single-socket machines, most of the truly nasty hot paths aren't a
huge problem, because they tend to be mostly readers. So you get the
cacheline bounce, but you don't (usually) get much busy looping. The
cacheline bounce is "almost free" on a single socket.

But because it's one of those completely global locks, on big
multi-socket machines people have reported it as a problem forever.
Even just readers can cause problems (because of the cacheline
bouncing even when you just do the reader increment), but you also end
up having more issues with writers scaling badly.

Don't get me wrong - you can get bad scaling on other locks too, even
when they aren't really global - we had that with just the reference
counter increment for the user signal accounting, after all. Neither
of the reference counts were actually global, but they were just
effectively single counters under that particular load (ie the count
was per-user, but the load ran as a single user).

The reason tasklist_lock probably doesn't come up very much is that
it's _always_ been expensive. It has also caused some fundamental
issues (I think it's the main reason we have that rule that
reader-writer locks are unfair to readers, because we have readers
from interrupt context too, but can't afford to make normal readers
disable interrupts).

A lot of the tasklist lock readers end up looping quite a bit inside
the lock (looping over threads etc), which is why it can then be a big
deal when the rare reader shows up.

We've improved a _lot_ of those loops. That has definitely helped for
the common cases. But we've never been able to really fix the lock
itself.

Linus