2022-09-10 21:17:28

by Jorge Merlino

[permalink] [raw]
Subject: [PATCH] Fix race condition when exec'ing setuid files

This patch fixes a race condition in check_unsafe_exec when a heavily
threaded program tries to exec a setuid file. check_unsafe_exec counts the
number of threads sharing the same fs_struct and compares it to the total
number of users of the fs_struct by looking at its users counter. If there
are more users than process threads using it the setuid exec fails with
LSM_UNSAFE_SHARE. The problem is that, during the kernel_clone code
execution, the fs_struct users counter is incremented before the new thread
is added to the thread_group list. So there is a race when the counter has
been incremented but the thread is not visible to the while_each_tread loop
in check_unsafe_exec.

This patch sort of fixes this by setting a process flag to the parent
process during the time this race is possible. Thus, if a process is
forking, it counts an extra user fo the fs_struct as the counter might be
incremented before the thread is visible. But this is not great as this
could generate the opposite problem as there may be an external process
sharing the fs_struct that is masked by some thread that is being counted
twice. I submit this patch just as an idea but mainly I want to introduce
this issue and see if someone comes up with a better solution.

This is a simple code to reproduce this issue:

$ cat Makefile
ALL=a b
all: $(ALL)

a: LDFLAGS=-pthread

b: b.c
$(CC) b.c -o b
sudo chown root:root b
sudo chmod u+s b

test:
for I in $$(seq 1000); do echo $I; ./a ; done

clean:
rm -vf $(ALL)

$ cat a.c

void *nothing(void *p)
{
return NULL;
}

void *target(void *p) {
for (;;) {
pthread_t t;
if (pthread_create(&t, NULL, nothing, NULL) == 0)
pthread_join(t, NULL);
}
return NULL;
}

int main(void)
{
struct timespec tv;
int i;

for (i = 0; i < 10; i++) {
pthread_t t;
pthread_create(&t, NULL, target, NULL);
}
tv.tv_sec = 0;
tv.tv_nsec = 100000;
nanosleep(&tv, NULL);
if (execl("./b", "./b", NULL) < 0)
perror("execl");
return 0;
}

$ cat b.c

int main(void)
{
const uid_t euid = geteuid();
if (euid != 0) {
printf("Failed, got euid %d (expecting 0)\n", euid);
return 1;
}
return 0;
}

$ make
make
cc -pthread a.c -o a
cc b.c -o b
sudo chown root:root b
sudo chmod u+s b
$ make test

Without this fix, one will see 'Failed, got euid 1000 (expecting 0)' messages
---
fs/exec.c | 2 ++
include/linux/sched.h | 1 +
kernel/fork.c | 3 +++
3 files changed, 6 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 9a5ca7b82bfc..a6f949a899d5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1581,6 +1581,8 @@ static void check_unsafe_exec(struct linux_binprm *bprm)
while_each_thread(p, t) {
if (t->fs == p->fs)
n_fs++;
+ if (t->flags & PF_IN_FORK)
+ n_fs++;
}
rcu_read_unlock();

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e7b2f8a5c711..f307165a434a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1722,6 +1722,7 @@ extern struct pid *cad_pid;
* I am cleaning dirty pages from some other bdi. */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
+#define PF_IN_FORK 0x02000000 /* Process is forking, prevents race condition on fs_struct users value */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
diff --git a/kernel/fork.c b/kernel/fork.c
index 8a9e92068b15..54e1e1fbe0bd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2245,6 +2245,7 @@ static __latent_entropy struct task_struct *copy_process(
retval = copy_files(clone_flags, p);
if (retval)
goto bad_fork_cleanup_semundo;
+ current->flags |= PF_IN_FORK;
retval = copy_fs(clone_flags, p);
if (retval)
goto bad_fork_cleanup_files;
@@ -2474,6 +2475,7 @@ static __latent_entropy struct task_struct *copy_process(
attach_pid(p, PIDTYPE_PID);
nr_threads++;
}
+ current->flags &= ~PF_IN_FORK;
total_forks++;
hlist_del_init(&delayed.node);
spin_unlock(&current->sighand->siglock);
@@ -2556,6 +2558,7 @@ static __latent_entropy struct task_struct *copy_process(
spin_lock_irq(&current->sighand->siglock);
hlist_del_init(&delayed.node);
spin_unlock_irq(&current->sighand->siglock);
+ current->flags &= ~PF_IN_FORK;
return ERR_PTR(retval);
}

--
2.34.1


2022-09-13 22:16:02

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] Fix race condition when exec'ing setuid files

On Sat, Sep 10, 2022 at 06:12:14PM -0300, Jorge Merlino wrote:
> This patch fixes a race condition in check_unsafe_exec when a heavily
> threaded program tries to exec a setuid file. check_unsafe_exec counts the
> number of threads sharing the same fs_struct and compares it to the total
> number of users of the fs_struct by looking at its users counter. If there
> are more users than process threads using it the setuid exec fails with
> LSM_UNSAFE_SHARE. The problem is that, during the kernel_clone code
> execution, the fs_struct users counter is incremented before the new thread
> is added to the thread_group list. So there is a race when the counter has
> been incremented but the thread is not visible to the while_each_tread loop
> in check_unsafe_exec.

Thanks for reporting this and for having a reproducer!

It looks like this is "failing safe", in the sense that the bug causes
an exec of a setuid binary to not actually change the euid. Is that an
accurate understanding here?

> This patch sort of fixes this by setting a process flag to the parent
> process during the time this race is possible. Thus, if a process is
> forking, it counts an extra user fo the fs_struct as the counter might be
> incremented before the thread is visible. But this is not great as this
> could generate the opposite problem as there may be an external process
> sharing the fs_struct that is masked by some thread that is being counted
> twice. I submit this patch just as an idea but mainly I want to introduce
> this issue and see if someone comes up with a better solution.

I'll want to spend some more time studying this race, but yes, it looks
like it should get fixed. I'm curious, though, how did you find this
problem? It seems quite unusual to have a high-load heavily threaded
process decide to exec.

-Kees

--
Kees Cook

2022-09-18 21:37:47

by Jorge Merlino

[permalink] [raw]
Subject: Re: [PATCH] Fix race condition when exec'ing setuid files


El 13/9/22 a las 19:03, Kees Cook escribió:
> Thanks for reporting this and for having a reproducer!
>
> It looks like this is "failing safe", in the sense that the bug causes
> an exec of a setuid binary to not actually change the euid. Is that an
> accurate understanding here?

Yes, that is correct.

>> This patch sort of fixes this by setting a process flag to the parent
>> process during the time this race is possible. Thus, if a process is
>> forking, it counts an extra user fo the fs_struct as the counter might be
>> incremented before the thread is visible. But this is not great as this
>> could generate the opposite problem as there may be an external process
>> sharing the fs_struct that is masked by some thread that is being counted
>> twice. I submit this patch just as an idea but mainly I want to introduce
>> this issue and see if someone comes up with a better solution.
>
> I'll want to spend some more time studying this race, but yes, it looks
> like it should get fixed. I'm curious, though, how did you find this
> problem? It seems quite unusual to have a high-load heavily threaded
> process decide to exec.

It was reported to Canonical by a customer. I don't know exactly the
circumstances where they see this problem occur in production.

Thanks
Jorge

2022-10-05 16:18:15

by Jorge Merlino

[permalink] [raw]
Subject: Re: [PATCH] Fix race condition when exec'ing setuid files

On 13/9/22 19:03, Kees Cook wrote:
> I'll want to spend some more time studying this race, but yes, it looks
> like it should get fixed. I'm curious, though, how did you find this
> problem? It seems quite unusual to have a high-load heavily threaded
> process decide to exec.

I just got a response from our customer regarding the situation where
this race condition occurs:


Our application is a Rust-based CLI tool that acts as a frontend to
cloud-based testing infrastructure. In one mode of operation it uploads
a large number of test artifacts to cloud storage, spawns compute
instances, and then starts a VPN connection to those instances. The
application creates the VPN connection by executing another setuid-root
tool as a subprocess. We see that this subprocess sometimes fails to
setuid. The "high-load heavily threaded" aspect comes from the fact that
we're using the Tokio runtime. Each upload to cloud storage is a
separate Tokio task (i.e. "green thread") and these are scheduled onto
"N" OS-level threads, where N = nproc. In a large run we may upload a
couple thousand artifacts but limit to 50 concurrent uploads. Once these
artifact uploads complete, we typically spawn the setuid subprocess
within 1-2 seconds.

Have you been able to look at this issue?

Thanks
Jorge

2022-10-06 03:14:55

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] Fix race condition when exec'ing setuid files

On Wed, Oct 05, 2022 at 01:09:36PM -0300, Jorge Merlino wrote:
> On 13/9/22 19:03, Kees Cook wrote:
> > I'll want to spend some more time studying this race, but yes, it looks
> > like it should get fixed. I'm curious, though, how did you find this
> > problem? It seems quite unusual to have a high-load heavily threaded
> > process decide to exec.
>
> I just got a response from our customer regarding the situation where this
> race condition occurs:

Thanks for getting back details!

> Our application is a Rust-based CLI tool that acts as a frontend to
> cloud-based testing infrastructure. In one mode of operation it uploads a
> large number of test artifacts to cloud storage, spawns compute instances,
> and then starts a VPN connection to those instances. The application creates
> the VPN connection by executing another setuid-root tool as a subprocess. We
> see that this subprocess sometimes fails to setuid. The "high-load heavily
> threaded" aspect comes from the fact that we're using the Tokio runtime.
> Each upload to cloud storage is a separate Tokio task (i.e. "green thread")
> and these are scheduled onto "N" OS-level threads, where N = nproc. In a
> large run we may upload a couple thousand artifacts but limit to 50
> concurrent uploads. Once these artifact uploads complete, we typically spawn
> the setuid subprocess within 1-2 seconds.

Interesting. Seems like the execve might be racing all the threads
exiting?

> Have you been able to look at this issue?

I'll continue looking at this.

Dave, this tracks back to commit a6f76f23d297 ("CRED: Make execve() take
advantage of copy-on-write credentials") ... any ideas what's happening
here?

--
Kees Cook

2022-10-06 07:37:08

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] Fix race condition when exec'ing setuid files

On Wed, Oct 05, 2022 at 08:06:15PM -0700, Kees Cook wrote:
> Dave, this tracks back to commit a6f76f23d297 ("CRED: Make execve() take
> advantage of copy-on-write credentials") ... any ideas what's happening
> here?

Er, rather, it originates before git history, but moved under lock in
commit 0bf2f3aec547 ("CRED: Fix SUID exec regression").

Eric, Al, Hugh, does this ring a bell?

It originates from 1da177e4c3f4 ("Linux-2.6.12-rc2") in git...

static inline int unsafe_exec(struct task_struct *p)
{
int unsafe = 0;
...
if (atomic_read(&p->fs->count) > 1 ||
atomic_read(&p->files->count) > 1 ||
atomic_read(&p->sighand->count) > 1)
unsafe |= LSM_UNSAFE_SHARE;

return unsafe;
}

Current code is:

static void check_unsafe_exec(struct linux_binprm *bprm)
{
struct task_struct *p = current, *t;
unsigned n_fs;
...
t = p;
n_fs = 1;
spin_lock(&p->fs->lock);
rcu_read_lock();
while_each_thread(p, t) {
if (t->fs == p->fs)
n_fs++;
}
rcu_read_unlock();

if (p->fs->users > n_fs)
bprm->unsafe |= LSM_UNSAFE_SHARE;
else
p->fs->in_exec = 1;
spin_unlock(&p->fs->lock);
}


Which seemed to take its form from:

0bf2f3aec547 ("CRED: Fix SUID exec regression")

Quoting the rationale for the checks:
...
moved the place in which the 'safeness' of a SUID/SGID exec was performed to
before de_thread() was called. This means that LSM_UNSAFE_SHARE is now
calculated incorrectly. This flag is set if any of the usage counts for
fs_struct, files_struct and sighand_struct are greater than 1 at the time the
determination is made. All of which are true for threads created by the
pthread library.

So, instead, we count up the number of threads (CLONE_THREAD) that are sharing
our fs_struct (CLONE_FS), files_struct (CLONE_FILES) and sighand_structs
(CLONE_SIGHAND/CLONE_THREAD) with us. These will be killed by de_thread() and
so can be discounted by check_unsafe_exec().

So, I think this is verifying that when attempting a suid exec, there is
no process out there with our fs_struct, file_struct, or sighand_struct
that would survive the de_thread() and be able to muck with the suid's
shared environment:

if (atomic_read(&p->fs->count) > n_fs ||
atomic_read(&p->files->count) > n_files ||
atomic_read(&p->sighand->count) > n_sighand)

Current code has eliminated the n_files and n_sighand tests:

n_sighand was removed by commit
f1191b50ec11 ("check_unsafe_exec() doesn't care about signal handlers sharing")

n_files was removed by commit
e426b64c412a ("fix setuid sometimes doesn't")

The latter reads very much like the current bug report. :) So likely the n_fs
test is buggy too...

After de_thread(), I see the calls to unshare_sighand() and
unshare_files(), so those check out.

What's needed to make p->fs safe? Doing an unshare of it seems possible,
since it exists half as a helper, unshare_fs(), and half open-coded in
ksys_unshare (see "new_fw").

Should we wire this up after de_thread() like the other two?

-Kees

--
Kees Cook

2022-10-06 21:10:58

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH] Fix race condition when exec'ing setuid files

On Thu, Sep 13, 2022 at 15:03:38 -0700, Kees Cook wrote:
> It seems quite unusual to have a high-load heavily threaded
> process decide to exec.

In looking at this a bunch more, I actually think everything is working
as intended. If a process is actively launching threads while also trying
to exec, they're going to create races for themselves.

So the question, then, is "why are they trying to exec while actively
spawning new threads?" That appears to be the core problem here, and as
far as I can tell, the kernel has behaved this way for a very long time.
I don't think the kernel should fix this, either, because it leads to a
very weird state for userspace, where the thread spawner may suddenly
die due to the exec happening in another thread. This really looks like
something userspace needs to handle correctly (i.e. don't try to exec
while actively spawning threads).

For example, here's a fix to the original PoC:

--- a.c.original 2022-10-06 13:07:13.279845619 -0700
+++ a.c 2022-10-06 13:10:27.702941645 -0700
@@ -8,8 +8,10 @@
return NULL;
}

+int stop_spawning;
+
void *target(void *p) {
- for (;;) {
+ while (!stop_spawning) {
pthread_t t;
if (pthread_create(&t, NULL, nothing, NULL) == 0)
pthread_join(t, NULL);
@@ -17,18 +19,26 @@
return NULL;
}

+#define MAX_THREADS 10
+
int main(void)
{
+ pthread_t threads[MAX_THREADS];
struct timespec tv;
int i;

- for (i = 0; i < 10; i++) {
- pthread_t t;
- pthread_create(&t, NULL, target, NULL);
+ for (i = 0; i < MAX_THREADS; i++) {
+ pthread_create(&threads[i], NULL, target, NULL);
}
tv.tv_sec = 0;
tv.tv_nsec = 100000;
nanosleep(&tv, NULL);
+
+ /* Signal shut down, and collect spawners. */
+ stop_spawning = 1;
+ for (i = 0; i < MAX_THREADS; i++)
+ pthread_join(threads[i], NULL);
+
if (execl("./b", "./b", NULL) < 0)
perror("execl");
return 0;


--
Kees Cook

2022-10-07 02:54:47

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] Fix race condition when exec'ing setuid files

On Thu, Oct 06, 2022 at 01:20:35PM -0700, Kees Cook wrote:
>
> So the question, then, is "why are they trying to exec while actively
> spawning new threads?" That appears to be the core problem here, and as
> far as I can tell, the kernel has behaved this way for a very long time.
> I don't think the kernel should fix this, either, because it leads to a
> very weird state for userspace, where the thread spawner may suddenly
> die due to the exec happening in another thread. This really looks like
> something userspace needs to handle correctly (i.e. don't try to exec
> while actively spawning threads).

One of the classic failure modes is when a threaded program calls a
library, and that library might try to do a fork/exec (or call
system(3) to run some command. e.g., such as running "lvm create ..."
or to spawn some kind of helper daemon.

There are a number of stack overflow questions about this, and there
are some solutions to _some_ of the problems, such as using
pthread_atfork(), and knowing that you are about to call fork/exec,
and use some out of band mechanism to to make sure no threads get
spawned until the fork/exec is completed --- but if you don't know
that a library is going to do a fork/exec, well, life is tough.

One technique even advocated by a stack overflow article is "avoid
using threads whenver possible". :-/

- Ted

2022-10-07 12:16:52

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] Fix race condition when exec'ing setuid files

From: Theodore Ts'o
> Sent: 07 October 2022 02:41
>
> On Thu, Oct 06, 2022 at 01:20:35PM -0700, Kees Cook wrote:
> >
> > So the question, then, is "why are they trying to exec while actively
> > spawning new threads?" That appears to be the core problem here, and as
> > far as I can tell, the kernel has behaved this way for a very long time.
> > I don't think the kernel should fix this, either, because it leads to a
> > very weird state for userspace, where the thread spawner may suddenly
> > die due to the exec happening in another thread. This really looks like
> > something userspace needs to handle correctly (i.e. don't try to exec
> > while actively spawning threads).
>
> One of the classic failure modes is when a threaded program calls a
> library, and that library might try to do a fork/exec (or call
> system(3) to run some command. e.g., such as running "lvm create ..."
> or to spawn some kind of helper daemon.
>
> There are a number of stack overflow questions about this, and there
> are some solutions to _some_ of the problems, such as using
> pthread_atfork(), and knowing that you are about to call fork/exec,
> and use some out of band mechanism to to make sure no threads get
> spawned until the fork/exec is completed --- but if you don't know
> that a library is going to do a fork/exec, well, life is tough.

Or that a library thread is about to create a new thread.

> One technique even advocated by a stack overflow article is "avoid
> using threads whenver possible". :-/

Doesn't fork() only create the current thread in the new process?
So by the time exec() runs there is a nice single threaded process
with an fd table that isn't shared.

For helpers there is always (a properly implemented) posix_spawn() :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)