2009-03-28 23:17:27

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 1/4] compat_do_execve should unshare_files

2.6.26's commit fd8328be874f4190a811c58cd4778ec2c74d2c05
"sanitize handling of shared descriptor tables in failing execve()"
moved the unshare_files() from flush_old_exec() and several binfmts
to the head of do_execve(); but forgot to make the same change to
compat_do_execve(), leaving a CLONE_FILES files_struct shared across
exec from a 32-bit process on a 64-bit kernel.

It's arguable whether the files_struct really ought to be unshared
across exec; but 2.6.1 made that so to stop the loading binary's fd
leaking into other threads, and a 32-bit process on a 64-bit kernel
ought to behave in the same way as 32 on 32 and 64 on 64.

Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected]
---

fs/compat.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

--- 2.6.29/fs/compat.c 2009-03-23 23:12:14.000000000 +0000
+++ linux/fs/compat.c 2009-03-28 18:06:02.000000000 +0000
@@ -1392,12 +1392,17 @@ int compat_do_execve(char * filename,
{
struct linux_binprm *bprm;
struct file *file;
+ struct files_struct *displaced;
int retval;

+ retval = unshare_files(&displaced);
+ if (retval)
+ goto out_ret;
+
retval = -ENOMEM;
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
if (!bprm)
- goto out_ret;
+ goto out_files;

retval = mutex_lock_interruptible(&current->cred_exec_mutex);
if (retval < 0)
@@ -1457,6 +1462,8 @@ int compat_do_execve(char * filename,
mutex_unlock(&current->cred_exec_mutex);
acct_update_integrals(current);
free_bprm(bprm);
+ if (displaced)
+ put_files_struct(displaced);
return retval;

out:
@@ -1475,6 +1482,9 @@ out_unlock:
out_free:
free_bprm(bprm);

+out_files:
+ if (displaced)
+ reset_files_struct(displaced);
out_ret:
return retval;
}


2009-03-28 23:21:59

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 2/4] fix setuid sometimes doesn't

Joe Malicki reports that setuid sometimes doesn't: very rarely,
a setuid root program does not get root euid; and, by the way,
they have a health check running lsof every few minutes.

Right, check_unsafe_exec() notes whether the files_struct is being
shared by more threads than will get killed by the exec, and if so
sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
But /proc/<pid>/fd and /proc/<pid>/fdinfo lookups make transient
use of get_files_struct(), which also raises that sharing count.

There's a rather simple fix for this: exec's check on files->count
has been redundant ever since 2.6.1 made it unshare_files() (except
while compat_do_execve() omitted to do so) - just remove that check.

[Note to -stable: this patch will not apply before 2.6.29: earlier
releases should just remove the files->count line from unsafe_exec().]

Reported-by: Joe Malicki <[email protected]>
Narrowed-down-by: Michael Itz <[email protected]>
Tested-by: Joe Malicki <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected]
---

fs/compat.c | 2 +-
fs/exec.c | 10 +++-------
fs/internal.h | 2 +-
3 files changed, 5 insertions(+), 9 deletions(-)

--- 2.6.29/fs/compat.c 2009-03-23 23:12:14.000000000 +0000
+++ linux/fs/compat.c 2009-03-28 18:06:02.000000000 +0000
@@ -1412,7 +1412,7 @@ int compat_do_execve(char * filename,
bprm->cred = prepare_exec_creds();
if (!bprm->cred)
goto out_unlock;
- check_unsafe_exec(bprm, current->files);
+ check_unsafe_exec(bprm);

file = open_exec(filename);
retval = PTR_ERR(file);
--- 2.6.29/fs/exec.c 2009-03-23 23:12:14.000000000 +0000
+++ linux/fs/exec.c 2009-03-28 18:06:02.000000000 +0000
@@ -1049,28 +1049,24 @@ EXPORT_SYMBOL(install_exec_creds);
* - the caller must hold current->cred_exec_mutex to protect against
* PTRACE_ATTACH
*/
-void check_unsafe_exec(struct linux_binprm *bprm, struct files_struct *files)
+void check_unsafe_exec(struct linux_binprm *bprm)
{
struct task_struct *p = current, *t;
unsigned long flags;
- unsigned n_fs, n_files, n_sighand;
+ unsigned n_fs, n_sighand;

bprm->unsafe = tracehook_unsafe_exec(p);

n_fs = 1;
- n_files = 1;
n_sighand = 1;
lock_task_sighand(p, &flags);
for (t = next_thread(p); t != p; t = next_thread(t)) {
if (t->fs == p->fs)
n_fs++;
- if (t->files == files)
- n_files++;
n_sighand++;
}

if (atomic_read(&p->fs->count) > n_fs ||
- atomic_read(&p->files->count) > n_files ||
atomic_read(&p->sighand->count) > n_sighand)
bprm->unsafe |= LSM_UNSAFE_SHARE;

@@ -1289,7 +1285,7 @@ int do_execve(char * filename,
bprm->cred = prepare_exec_creds();
if (!bprm->cred)
goto out_unlock;
- check_unsafe_exec(bprm, displaced);
+ check_unsafe_exec(bprm);

file = open_exec(filename);
retval = PTR_ERR(file);
--- 2.6.29/fs/internal.h 2009-03-23 23:12:14.000000000 +0000
+++ linux/fs/internal.h 2009-03-28 18:06:02.000000000 +0000
@@ -43,7 +43,7 @@ extern void __init chrdev_init(void);
/*
* exec.c
*/
-extern void check_unsafe_exec(struct linux_binprm *, struct files_struct *);
+extern void check_unsafe_exec(struct linux_binprm *);

/*
* namespace.c

2009-03-28 23:23:20

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 3/4] fix setuid sometimes wouldn't

check_unsafe_exec() also notes whether the fs_struct is being
shared by more threads than will get killed by the exec, and if so
sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
But /proc/<pid>/cwd and /proc/<pid>/root lookups make transient
use of get_fs_struct(), which also raises that sharing count.

This might occasionally cause a setuid program not to change euid,
in the same way as happened with files->count (check_unsafe_exec
also looks at sighand->count, but /proc doesn't raise that one).

We'd prefer exec not to unshare fs_struct: so fix this in procfs,
replacing get_fs_struct() by get_fs_path(), which does path_get
while still holding task_lock, instead of raising fs->count.

Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected]
___

fs/proc/base.c | 50 +++++++++++++++--------------------------------
1 file changed, 16 insertions(+), 34 deletions(-)

--- 2.6.29/fs/proc/base.c 2009-03-23 23:12:14.000000000 +0000
+++ linux/fs/proc/base.c 2009-03-28 18:06:02.000000000 +0000
@@ -146,15 +146,22 @@ static unsigned int pid_entry_count_dirs
return count;
}

-static struct fs_struct *get_fs_struct(struct task_struct *task)
+static int get_fs_path(struct task_struct *task, struct path *path, bool root)
{
struct fs_struct *fs;
+ int result = -ENOENT;
+
task_lock(task);
fs = task->fs;
- if(fs)
- atomic_inc(&fs->count);
+ if (fs) {
+ read_lock(&fs->lock);
+ *path = root ? fs->root : fs->pwd;
+ path_get(path);
+ read_unlock(&fs->lock);
+ result = 0;
+ }
task_unlock(task);
- return fs;
+ return result;
}

static int get_nr_threads(struct task_struct *tsk)
@@ -172,42 +179,24 @@ static int get_nr_threads(struct task_st
static int proc_cwd_link(struct inode *inode, struct path *path)
{
struct task_struct *task = get_proc_task(inode);
- struct fs_struct *fs = NULL;
int result = -ENOENT;

if (task) {
- fs = get_fs_struct(task);
+ result = get_fs_path(task, path, 0);
put_task_struct(task);
}
- if (fs) {
- read_lock(&fs->lock);
- *path = fs->pwd;
- path_get(&fs->pwd);
- read_unlock(&fs->lock);
- result = 0;
- put_fs_struct(fs);
- }
return result;
}

static int proc_root_link(struct inode *inode, struct path *path)
{
struct task_struct *task = get_proc_task(inode);
- struct fs_struct *fs = NULL;
int result = -ENOENT;

if (task) {
- fs = get_fs_struct(task);
+ result = get_fs_path(task, path, 1);
put_task_struct(task);
}
- if (fs) {
- read_lock(&fs->lock);
- *path = fs->root;
- path_get(&fs->root);
- read_unlock(&fs->lock);
- result = 0;
- put_fs_struct(fs);
- }
return result;
}

@@ -596,7 +585,6 @@ static int mounts_open_common(struct ino
struct task_struct *task = get_proc_task(inode);
struct nsproxy *nsp;
struct mnt_namespace *ns = NULL;
- struct fs_struct *fs = NULL;
struct path root;
struct proc_mounts *p;
int ret = -EINVAL;
@@ -610,22 +598,16 @@ static int mounts_open_common(struct ino
get_mnt_ns(ns);
}
rcu_read_unlock();
- if (ns)
- fs = get_fs_struct(task);
+ if (ns && get_fs_path(task, &root, 1) == 0)
+ ret = 0;
put_task_struct(task);
}

if (!ns)
goto err;
- if (!fs)
+ if (ret)
goto err_put_ns;

- read_lock(&fs->lock);
- root = fs->root;
- path_get(&root);
- read_unlock(&fs->lock);
- put_fs_struct(fs);
-
ret = -ENOMEM;
p = kmalloc(sizeof(struct proc_mounts), GFP_KERNEL);
if (!p)

2009-03-28 23:24:45

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 4/4] Annotate struct fs_struct's usage count restriction

From: David Howells <[email protected]>

Annotate struct fs_struct's usage count to indicate the restrictions upon it.
It may not be incremented, except by clone(CLONE_FS), as this affects the
check in check_unsafe_exec() in fs/exec.c.

Signed-off-by: David Howells <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected]
---

include/linux/fs_struct.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

--- 2.6.29/include/linux/fs_struct.h 2009-03-23 23:12:14.000000000 +0000
+++ linux/include/linux/fs_struct.h 2009-03-28 18:06:02.000000000 +0000
@@ -4,7 +4,10 @@
#include <linux/path.h>

struct fs_struct {
- atomic_t count;
+ atomic_t count; /* This usage count is used by check_unsafe_exec() for
+ * security checking purposes - therefore it may not be
+ * incremented, except by clone(CLONE_FS).
+ */
rwlock_t lock;
int umask;
struct path root, pwd;

2009-03-29 00:59:40

by Oleg Nesterov

[permalink] [raw]
Subject: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

> -void check_unsafe_exec(struct linux_binprm *bprm, struct files_struct *files)
> +void check_unsafe_exec(struct linux_binprm *bprm)
> {
> struct task_struct *p = current, *t;
> unsigned long flags;
> - unsigned n_fs, n_files, n_sighand;
> + unsigned n_fs, n_sighand;
>
> bprm->unsafe = tracehook_unsafe_exec(p);
>
> n_fs = 1;
> - n_files = 1;
> n_sighand = 1;
> lock_task_sighand(p, &flags);
> for (t = next_thread(p); t != p; t = next_thread(t)) {
> if (t->fs == p->fs)
> n_fs++;
> - if (t->files == files)
> - n_files++;
> n_sighand++;
> }
>
> if (atomic_read(&p->fs->count) > n_fs ||
> - atomic_read(&p->files->count) > n_files ||
> atomic_read(&p->sighand->count) > n_sighand)
> bprm->unsafe |= LSM_UNSAFE_SHARE;

Can't find the patch which introduced check_unsafe_exec(), so
I am asking here.

How it is supposed to work?

Let's suppose we have two threads T1 and T2. T1 exits, and calls
exit_fs().

exit_fs:

tsk->fs = NULL;
// WINDOW
put_fs_struct(fs);

Now, if T2 does exec() and check_unsafe_exec() happens in the WINDOW
above, we set LSM_UNSAFE_SHARE.

Or we can race with sub-thread doing clone(CLONE_FS|CLONE_THREAD),
the new thread is not visible in ->thread_group, buy copy_fs()
can already increment fs->count.


Another question. Why do we check sighand->count? We always unshare
->sighand on exec, see de_thread().


Minor, but why lock_task_sighand() ? This helper is "__must_check".
If it can't fail (yes, it can't fail here), spin_lock_irq(siglock)
is enough. (and given that ->siglock can't help anyway to calculate
n_fs, we could use rcu_read_lock() instead).


(as for these patches, I think they are correct).

Oleg.

2009-03-29 04:11:48

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Sun, Mar 29, 2009 at 01:53:43AM +0100, Oleg Nesterov wrote:

> Can't find the patch which introduced check_unsafe_exec(), so
> I am asking here.
>
> How it is supposed to work?
>
> Let's suppose we have two threads T1 and T2. T1 exits, and calls
> exit_fs().
>
> exit_fs:
>
> tsk->fs = NULL;
> // WINDOW
> put_fs_struct(fs);
>
> Now, if T2 does exec() and check_unsafe_exec() happens in the WINDOW
> above, we set LSM_UNSAFE_SHARE.
>
> Or we can race with sub-thread doing clone(CLONE_FS|CLONE_THREAD),
> the new thread is not visible in ->thread_group, buy copy_fs()
> can already increment fs->count.

Frankly, I don't think we really care. Note that having several sub-threads
and doing execve() in one of them will kill the rest, so you really want
to do some kind of synchronization to get something similar to reasonable
behaviour anyway.

> Another question. Why do we check sighand->count? We always unshare
> ->sighand on exec, see de_thread().

Correct. That check can and should go.

2009-03-29 04:15:31

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Sun, Mar 29, 2009 at 05:10:22AM +0100, Al Viro wrote:

> Frankly, I don't think we really care. Note that having several sub-threads
> and doing execve() in one of them will kill the rest, so you really want
> to do some kind of synchronization

in userland, that is

> to get something similar to reasonable
> behaviour anyway.

2009-03-29 04:57:56

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On 03/29, Al Viro wrote:
>
> On Sun, Mar 29, 2009 at 01:53:43AM +0100, Oleg Nesterov wrote:
>
> > Let's suppose we have two threads T1 and T2. T1 exits, and calls
> > exit_fs().
> >
> > exit_fs:
> >
> > tsk->fs = NULL;
> > // WINDOW
> > put_fs_struct(fs);
> >
> > Now, if T2 does exec() and check_unsafe_exec() happens in the WINDOW
> > above, we set LSM_UNSAFE_SHARE.
> >
> > Or we can race with sub-thread doing clone(CLONE_FS|CLONE_THREAD),
> > the new thread is not visible in ->thread_group, buy copy_fs()
> > can already increment fs->count.
>
> Frankly, I don't think we really care. Note that having several sub-threads
> and doing execve() in one of them will kill the rest, so you really want
> to do some kind of synchronization to get something similar to reasonable
> behaviour anyway.

OK.

Let's suppose that check_unsafe_exec() does not set LSM_UNSAFE_SHARE and
drops ->siglock. After that, another sub-thread does clone(CLONE_FS) without
CLONE_THREAD.

Unless we killed other threads, I can't see how we can check ->fs is not
shared with another process, we can fool ->bprm_set_creds() anyway.

Confused.

Oleg.

2009-03-29 05:56:34

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Sun, Mar 29, 2009 at 06:52:06AM +0200, Oleg Nesterov wrote:

> Let's suppose that check_unsafe_exec() does not set LSM_UNSAFE_SHARE and
> drops ->siglock. After that, another sub-thread does clone(CLONE_FS) without
> CLONE_THREAD.

Lovely. And yes, AFAICS that's a hole.

> Unless we killed other threads, I can't see how we can check ->fs is not
> shared with another process, we can fool ->bprm_set_creds() anyway.

We can't do that, until we are past the point of no return. Charming...
In principle, we can mark these threads as "-EAGAIN on such clone()" and
clean that on exec failure.

2009-03-29 06:03:50

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Sun, Mar 29, 2009 at 06:55:13AM +0100, Al Viro wrote:
> On Sun, Mar 29, 2009 at 06:52:06AM +0200, Oleg Nesterov wrote:
>
> > Let's suppose that check_unsafe_exec() does not set LSM_UNSAFE_SHARE and
> > drops ->siglock. After that, another sub-thread does clone(CLONE_FS) without
> > CLONE_THREAD.
>
> Lovely. And yes, AFAICS that's a hole.
>
> > Unless we killed other threads, I can't see how we can check ->fs is not
> > shared with another process, we can fool ->bprm_set_creds() anyway.
>
> We can't do that, until we are past the point of no return. Charming...
> In principle, we can mark these threads as "-EAGAIN on such clone()" and
> clean that on exec failure.

... or just do that to fs_struct. After finding that there's no outside
users. Commenst?

2009-03-29 11:12:46

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 3/4] fix setuid sometimes wouldn't

On Sat, Mar 28, 2009 at 11:21:27PM +0000, Hugh Dickins wrote:
> check_unsafe_exec() also notes whether the fs_struct is being
> shared by more threads than will get killed by the exec, and if so
> sets LSM_UNSAFE_SHARE to make bprm_set_creds() careful about euid.
> But /proc/<pid>/cwd and /proc/<pid>/root lookups make transient
> use of get_fs_struct(), which also raises that sharing count.
>
> This might occasionally cause a setuid program not to change euid,
> in the same way as happened with files->count (check_unsafe_exec
> also looks at sighand->count, but /proc doesn't raise that one).
>
> We'd prefer exec not to unshare fs_struct: so fix this in procfs,
> replacing get_fs_struct() by get_fs_path(), which does path_get
> while still holding task_lock, instead of raising fs->count.

> --- 2.6.29/fs/proc/base.c
> +++ linux/fs/proc/base.c
> @@ -146,15 +146,22 @@ static unsigned int pid_entry_count_dirs
> return count;
> }
>
> -static struct fs_struct *get_fs_struct(struct task_struct *task)
> +static int get_fs_path(struct task_struct *task, struct path *path, bool root)
> {
> struct fs_struct *fs;
> + int result = -ENOENT;
> +
> task_lock(task);
> fs = task->fs;
> - if(fs)
> - atomic_inc(&fs->count);
> + if (fs) {
> + read_lock(&fs->lock);
> + *path = root ? fs->root : fs->pwd;
> + path_get(path);
> + read_unlock(&fs->lock);
> + result = 0;
> + }
> task_unlock(task);
> - return fs;
> + return result;
> }

I think it's better to open-code. "root" parameter is unnatural.

2009-03-29 21:42:25

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On 03/29, Al Viro wrote:
>
> > In principle, we can mark these threads as "-EAGAIN on such clone()" and
> > clean that on exec failure.

We can't. We can miss the new subthread if we race with clone(CLONE_THREAD).
Unless we add the additional locking, of course.

We can set current->signal->flags |= SIGNAL_DO_NOT_CLONE_FS. But this is
really nasty. For examlpe, what if this flag is already set when
check_unsafe_exec() takes ->siglock ? We should return -ESOMETHING, not
good. Or schedule_timeout_uninterruptible(1) until it is cleared?

This also means copy_process()->copy_fs() path should take ->siglock too,
otherwise we we don't have a barrier.

> ... or just do that to fs_struct. After finding that there's no outside
> users. Commenst?

This is even worse. Not only we race with our sub-threads, we race
with CLONE_FS processes.

We can't mark fs_struct after finding that there's no outside users
lockless. Because we can't know whether this is "after" or not, we
can't trust "atomic_read(fs->count) <= n_fs".

Unless we re-use fs_struct->lock. In this case copy_fs() should take
it too. But again, ->fs can be already marked when we enter
check_unsafe_exec().


And btw check_unsafe_exec() seem to have another hole. Another thread
(which shares ->fs with us) can do exit_fs() right before we read
fs->count. Since this thread was already accounted in n_fs, we can
miss the fact we share ->fs with another process.


Perhaps I missed something...


Not that I like this idea (actually I hate), but perhaps we can change
the meaning of LSM_UNSAFE_SHARE,

selinux_bprm_set_creds:

if (new_tsec->sid != old_tsec->sid) {
...

if (avc_has_perm(...))
bprm->unsafe |= LSM_UNSAFE_SHARE;
}


Then we modify de_thread(). It sends SIGKILL to all subthreads, this
means that another thread can't clone() after we drop ->siglock. So we
can add this code to the ->siglock protected section

if (unlikely(bprm->unsafe & LSM_UNSAFE_SHARE)) {
if (fs_struct_is_shared())
return -EPERM;
}

...
zap_other_threads();

Oh, ugly.

I'd better hope I missed something ;)

Oleg.

2009-03-29 21:55:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/4] fix setuid sometimes wouldn't

On 03/29, Alexey Dobriyan wrote:
>
> On Sat, Mar 28, 2009 at 11:21:27PM +0000, Hugh Dickins wrote:
> >
> > -static struct fs_struct *get_fs_struct(struct task_struct *task)
> > +static int get_fs_path(struct task_struct *task, struct path *path, bool root)
> > {
> > struct fs_struct *fs;
> > + int result = -ENOENT;
> > +
> > task_lock(task);
> > fs = task->fs;
> > - if(fs)
> > - atomic_inc(&fs->count);
> > + if (fs) {
> > + read_lock(&fs->lock);
> > + *path = root ? fs->root : fs->pwd;
> > + path_get(path);
> > + read_unlock(&fs->lock);
> > + result = 0;
> > + }
> > task_unlock(task);
> > - return fs;
> > + return result;
> > }
>
> I think it's better to open-code. "root" parameter is unnatural.

IMHO, open-coding doesn't improve readability. I am not even talking
about code size (it doesn't matter of course, the kernel is so tiny ;).

Perhaps "enum what" is more natural compared to "bool root", but this
helper is simple enough.

Personally, I think this patch makes sense even if it didn't fix the
bug, it cleanups the code and makes it more readable.

Oleg.

2009-03-29 22:21:44

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Sun, Mar 29, 2009 at 11:36:35PM +0200, Oleg Nesterov wrote:
> > ... or just do that to fs_struct. After finding that there's no outside
> > users. Commenst?
>
> This is even worse. Not only we race with our sub-threads, we race
> with CLONE_FS processes.
>
> We can't mark fs_struct after finding that there's no outside users
> lockless. Because we can't know whether this is "after" or not, we
> can't trust "atomic_read(fs->count) <= n_fs".

We can lock fs_struct in question, go through the threads, then mark
or bail out. With cloning a reference to fs_struct protected by the
same lock.

FWIW, I'm not at all sure that we want atomic_t for refcount in that
case...

2009-03-29 22:39:04

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 3/4] fix setuid sometimes wouldn't

On Sun, Mar 29, 2009 at 11:48:55PM +0200, Oleg Nesterov wrote:
> On 03/29, Alexey Dobriyan wrote:
> >
> > On Sat, Mar 28, 2009 at 11:21:27PM +0000, Hugh Dickins wrote:
> > >
> > > -static struct fs_struct *get_fs_struct(struct task_struct *task)
> > > +static int get_fs_path(struct task_struct *task, struct path *path, bool root)
> > > {
> > > struct fs_struct *fs;
> > > + int result = -ENOENT;
> > > +
> > > task_lock(task);
> > > fs = task->fs;
> > > - if(fs)
> > > - atomic_inc(&fs->count);
> > > + if (fs) {
> > > + read_lock(&fs->lock);
> > > + *path = root ? fs->root : fs->pwd;
> > > + path_get(path);
> > > + read_unlock(&fs->lock);
> > > + result = 0;
> > > + }
> > > task_unlock(task);
> > > - return fs;
> > > + return result;
> > > }
> >
> > I think it's better to open-code. "root" parameter is unnatural.
>
> IMHO, open-coding doesn't improve readability. I am not even talking
> about code size (it doesn't matter of course, the kernel is so tiny ;).
>
> Perhaps "enum what" is more natural compared to "bool root", but this
> helper is simple enough.
>
> Personally, I think this patch makes sense even if it didn't fix the
> bug, it cleanups the code and makes it more readable.

Aye. Applied, with enum, but an anonymous one; no point breeding
tags without a reason.

Another note: chroot_fs_refs() shouldn't bump refcount at all; it should
do replacements, count them and then do path_release(old_root) that many
times - after having gone through all tasks.

I'm taking fs_struct handling into a new file (fs/fs_struct.c, unless
somebody has a better name), will do that in the same patch.

2009-03-30 00:07:32

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On 03/29, Al Viro wrote:
>
> On Sun, Mar 29, 2009 at 11:36:35PM +0200, Oleg Nesterov wrote:
> > > ... or just do that to fs_struct. After finding that there's no outside
> > > users. Commenst?
> >
> > This is even worse. Not only we race with our sub-threads, we race
> > with CLONE_FS processes.
> >
> > We can't mark fs_struct after finding that there's no outside users
> > lockless. Because we can't know whether this is "after" or not, we
> > can't trust "atomic_read(fs->count) <= n_fs".
>
> We can lock fs_struct in question, go through the threads, then mark
> or bail out. With cloning a reference to fs_struct protected by the
> same lock.

Yes, this is what I meant, copy_fs() needs this lock too,

> FWIW, I'm not at all sure that we want atomic_t for refcount in that
> case...

I think you are right, because exit_fs() should take fs->lock as well.

But, again. What whould we do when check_unsafe_exec() takes fs->lock
and sees that this ->fs is already marked?

In that case -EWHATEVER is not very good, it could be another process
(not sub-thread) doing exec.

Oleg.

2009-03-30 00:09:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On 03/30, Oleg Nesterov wrote:
>
> On 03/29, Al Viro wrote:
> >
> > On Sun, Mar 29, 2009 at 11:36:35PM +0200, Oleg Nesterov wrote:
> > > > ... or just do that to fs_struct. After finding that there's no outside
> > > > users. Commenst?
> > >
> > > This is even worse. Not only we race with our sub-threads, we race
> > > with CLONE_FS processes.
> > >
> > > We can't mark fs_struct after finding that there's no outside users
> > > lockless. Because we can't know whether this is "after" or not, we
> > > can't trust "atomic_read(fs->count) <= n_fs".
> >
> > We can lock fs_struct in question, go through the threads, then mark
> > or bail out. With cloning a reference to fs_struct protected by the
> > same lock.
>
> Yes, this is what I meant, copy_fs() needs this lock too,
>
> > FWIW, I'm not at all sure that we want atomic_t for refcount in that
> > case...
>
> I think you are right, because exit_fs() should take fs->lock as well.
>
> But, again. What whould we do when check_unsafe_exec() takes fs->lock
> and sees that this ->fs is already marked?

Ah, I am stupid. There is no another process if this flag is set.

Oleg.

2009-03-30 01:11:23

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Mon, Mar 30, 2009 at 02:03:38AM +0200, Oleg Nesterov wrote:
> On 03/30, Oleg Nesterov wrote:
> >
> > On 03/29, Al Viro wrote:
> > >
> > > On Sun, Mar 29, 2009 at 11:36:35PM +0200, Oleg Nesterov wrote:
> > > > > ... or just do that to fs_struct. After finding that there's no outside
> > > > > users. Commenst?
> > > >
> > > > This is even worse. Not only we race with our sub-threads, we race
> > > > with CLONE_FS processes.
> > > >
> > > > We can't mark fs_struct after finding that there's no outside users
> > > > lockless. Because we can't know whether this is "after" or not, we
> > > > can't trust "atomic_read(fs->count) <= n_fs".
> > >
> > > We can lock fs_struct in question, go through the threads, then mark
> > > or bail out. With cloning a reference to fs_struct protected by the
> > > same lock.
> >
> > Yes, this is what I meant, copy_fs() needs this lock too,
> >
> > > FWIW, I'm not at all sure that we want atomic_t for refcount in that
> > > case...
> >
> > I think you are right, because exit_fs() should take fs->lock as well.
> >
> > But, again. What whould we do when check_unsafe_exec() takes fs->lock
> > and sees that this ->fs is already marked?
>
> Ah, I am stupid. There is no another process if this flag is set.

So...
* one can always dereference current->fs
* nobody can change task->fs for seen-by-scheduler task other than
current (we can, of course, do that for a task we'd just allocated in clone
before anyone else has seen it)
* all changes of current->fs happen under task_lock *and* excl ->lock
of old value of current->fs.
* anybody can dereference task->fs while holding task_lock(task)
* ->lock nests inside task_lock
* freeing happens only when the last reference is gone *and* all
tasks that used to hold such references has already gone through task_unlock
* all refcount changes happen under excl ->lock
* changes of ->root and ->pwd happen under excl ->lock
* read access to ->root and ->pwd should be done under shared ->lock;
to use the contents past the unlock you need to grab references to path in
question while holding lock
* cloning a reference is done while holding ->lock shared, fails with
-EAGAIN if fs_struct is marked "under exec"
* mark in question is set and cleared with ->lock excl.
* check_unsafe_exec() locks current->fs shared, goes through all
threads comparing their ->fs with our, if the number doesn't match - bail
out. Otherwise we mark it "under exec".
* in the end of execve() (failure or success) we clear the mark.
* all reassignments of task->fs are either to NULL or to new instance
(never seen by anybody) or to &init_fs.
* task with ->fs == &init_fs may not call execve(); those are kernel
threads and they must get ->fs unshared before they can do anything of that
kind (otherwise we'd have no end of trouble with chdir() done by exec'ed
binary affecting hell knows what else).

Does anybody see any holes in the above?

2009-03-30 01:14:23

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Mon, Mar 30, 2009 at 02:08:43AM +0100, Al Viro wrote:

> So...
> * one can always dereference current->fs
> * nobody can change task->fs for seen-by-scheduler task other than
> current (we can, of course, do that for a task we'd just allocated in clone
> before anyone else has seen it)
> * all changes of current->fs happen under task_lock *and* excl ->lock
> of old value of current->fs.
> * anybody can dereference task->fs while holding task_lock(task)
> * ->lock nests inside task_lock
> * freeing happens only when the last reference is gone *and* all
> tasks that used to hold such references has already gone through task_unlock
> * all refcount changes happen under excl ->lock
> * changes of ->root and ->pwd happen under excl ->lock
> * read access to ->root and ->pwd should be done under shared ->lock;
> to use the contents past the unlock you need to grab references to path in
> question while holding lock
> * cloning a reference is done while holding ->lock shared, fails with
^^^^^^
excl, of course

> -EAGAIN if fs_struct is marked "under exec"
> * mark in question is set and cleared with ->lock excl.
> * check_unsafe_exec() locks current->fs shared, goes through all
> threads comparing their ->fs with our, if the number doesn't match - bail
> out. Otherwise we mark it "under exec".
> * in the end of execve() (failure or success) we clear the mark.
> * all reassignments of task->fs are either to NULL or to new instance
> (never seen by anybody) or to &init_fs.
> * task with ->fs == &init_fs may not call execve(); those are kernel
> threads and they must get ->fs unshared before they can do anything of that
> kind (otherwise we'd have no end of trouble with chdir() done by exec'ed
> binary affecting hell knows what else).
>
> Does anybody see any holes in the above?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-03-30 01:41:47

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On 03/30, Al Viro wrote:
>
> On Mon, Mar 30, 2009 at 02:08:43AM +0100, Al Viro wrote:
>
> > So...
> > * check_unsafe_exec() locks current->fs shared, goes through all
> > threads comparing their ->fs with our, if the number doesn't match - bail
> > out. Otherwise we mark it "under exec".

Unless I missed something again, check_unsafe_exec() should check "under exec"
after it takes fs->lock. If set - we are racing with sub-thread, return -EAGAIN.

We can't proceed. If that another exec() fails, it will clear "under exec" at
the end of do_execve(), before we kill other threads.

Oleg.

2009-03-30 01:46:22

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On 03/30, Oleg Nesterov wrote:
>
> On 03/30, Al Viro wrote:
> >
> > On Mon, Mar 30, 2009 at 02:08:43AM +0100, Al Viro wrote:
> >
> > > So...
> > > * check_unsafe_exec() locks current->fs shared, goes through all
> > > threads comparing their ->fs with our, if the number doesn't match - bail
> > > out. Otherwise we mark it "under exec".
>
> Unless I missed something again, check_unsafe_exec() should check "under exec"
> after it takes fs->lock. If set - we are racing with sub-thread, return -EAGAIN.
>
> We can't proceed. If that another exec() fails, it will clear "under exec" at
> the end of do_execve(), before we kill other threads.

Or we need a counter to mark/unmark.

Oleg.

2009-03-30 12:32:26

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Mon, Mar 30, 2009 at 03:40:40AM +0200, Oleg Nesterov wrote:

> > We can't proceed. If that another exec() fails, it will clear "under exec" at
> > the end of do_execve(), before we kill other threads.
>
> Or we need a counter to mark/unmark.

Nah, easier to have check_unsafe_exec() return -EAGAIN in cases we care
about.

Anyway, completely untested patchset is in
git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git/ execve-mess
(the last 9 changesets of it).

WARNING: that's *NOT* for merge at the moment; this is not a pull request.

Review (and testing) would be welcome.

Shortlog of execve-related part:
Al Viro (6):
Take fs_struct handling to new file (fs/fs_struct.c), sanitize chroot_fs_refs()
New helper - current_umask()
Get rid of indirect include of fs_struct.h
Kill unsharing fs_struct in __set_personality()
New locking/refcounting for fs_struct
check_unsafe_exec() doesn't care about signal handlers sharing

Hugh Dickins (3):
Don't bump fs_struct refcount for procfs accesses
compat_do_execve should unshare_files
fix setuid sometimes doesn't - files_struct

Diffstat (again, of execve-related stuff)
arch/cris/kernel/process.c | 1 -
arch/powerpc/platforms/cell/spufs/inode.c | 2 +-
fs/Makefile | 2 +-
fs/btrfs/acl.c | 2 +-
fs/btrfs/ioctl.c | 2 +-
fs/cifs/dir.c | 4 +-
fs/cifs/inode.c | 4 +-
fs/compat.c | 28 ++++-
fs/dcache.c | 1 +
fs/exec.c | 39 +++++--
fs/ext2/acl.c | 2 +-
fs/ext3/acl.c | 2 +-
fs/ext4/acl.c | 2 +-
fs/fat/inode.c | 2 +-
fs/fs_struct.c | 173 +++++++++++++++++++++++++++++
fs/generic_acl.c | 2 +-
fs/gfs2/acl.c | 2 +-
fs/hfsplus/options.c | 2 +-
fs/hpfs/super.c | 2 +-
fs/internal.h | 8 +-
fs/jffs2/acl.c | 2 +-
fs/jfs/acl.c | 2 +-
fs/namei.c | 14 +--
fs/namespace.c | 61 +----------
fs/nfs/nfs3proc.c | 6 +-
fs/nfs/nfs4proc.c | 2 +-
fs/nfsd/nfssvc.c | 7 +-
fs/ocfs2/acl.c | 2 +-
fs/omfs/inode.c | 2 +-
fs/open.c | 1 +
fs/proc/base.c | 53 +++------
fs/proc/task_nommu.c | 3 +-
fs/reiserfs/xattr_acl.c | 2 +-
fs/xfs/linux-2.6/xfs_iops.c | 4 +-
include/linux/fs.h | 2 +
include/linux/fs_struct.h | 7 +-
include/linux/mnt_namespace.h | 2 +
include/linux/nsproxy.h | 1 +
include/linux/sched.h | 3 +-
init/do_mounts.c | 1 +
ipc/mqueue.c | 2 +-
kernel/auditsc.c | 1 +
kernel/exec_domain.c | 22 ----
kernel/exit.c | 32 +-----
kernel/fork.c | 63 +++++------
kernel/sys.c | 1 +
net/unix/af_unix.c | 2 +-
security/tomoyo/realpath.c | 1 +
48 files changed, 337 insertions(+), 246 deletions(-)
create mode 100644 fs/fs_struct.c

2009-03-30 15:34:33

by Hugh Dickins

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Mon, 30 Mar 2009, Al Viro wrote:
> On Mon, Mar 30, 2009 at 03:40:40AM +0200, Oleg Nesterov wrote:

Well done, Oleg, for finding this: I wish I'd Cc'ed you earlier.
Shortly before posting my patches, I got very worried by n_fs in
check_unsafe_exec; then fooled myself into thinking it was okay.

>
> > > We can't proceed. If that another exec() fails, it will clear "under exec" at
> > > the end of do_execve(), before we kill other threads.
> >
> > Or we need a counter to mark/unmark.
>
> Nah, easier to have check_unsafe_exec() return -EAGAIN in cases we care
> about.
>
> Anyway, completely untested patchset is in
> git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git/ execve-mess
> (the last 9 changesets of it).
>
> WARNING: that's *NOT* for merge at the moment; this is not a pull request.
>
> Review (and testing) would be welcome.
>
> Shortlog of execve-related part:
> Al Viro (6):
> Take fs_struct handling to new file (fs/fs_struct.c), sanitize chroot_fs_refs()
> New helper - current_umask()
> Get rid of indirect include of fs_struct.h
> Kill unsharing fs_struct in __set_personality()
> New locking/refcounting for fs_struct
> check_unsafe_exec() doesn't care about signal handlers sharing
>
> Hugh Dickins (3):
> Don't bump fs_struct refcount for procfs accesses
> compat_do_execve should unshare_files
> fix setuid sometimes doesn't - files_struct

I'm looking now. Your description of what you intended sounded good.

First impression is that there's lots of good cleanup in there, but
it's all too mixed up and misordered at present - though we have friends
who insist upon doing the cleanup first (and I usually like to work that
way too), it's going to be tiresome when backporting to 2.6.29.stable
(luckily not needed earlier, unless you've uncovered more on the way).

Note that Linus put the four patches I posted straight into his tree,
so I think you'll want to be working on top of those, whereas you've
currently got them mixed in and modified in your tree.

So, setting aside what's already in, and what's just cleanup, I think
it's just 8c6934e2603283d028ac2292fe8d2812f52f23d3 I should be looking
at, New locking/refcounting for fs_struct - it'd be good for stable if
you worked on just that one to go in on top of Linus's current tree.

One of the cleanups, the one which introduces fs/fs_struct.c,
includes a good-looking but significant change to chroot_fs_refs().
Better make that a separate patch, I couldn't quite tell if it was
something I'd seriously missed from the "setuid sometimes doesn't"
set or not (my belief: yes, I missed it, and it's a good addition;
but frankly, we don't much care what happens if setuid sometimes
gets turned off if a parallel thread about to be killed happens
to be chrooting at the time - that's a much less worrying issue
than what can happen via /proc).

I think your new chroot_fs_refs() should have a path_get(new_root)
just before each path_put(old_root)?

Hugh

2009-03-30 23:45:57

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

Quoting Al Viro ([email protected]):
> On Mon, Mar 30, 2009 at 03:40:40AM +0200, Oleg Nesterov wrote:
>
> > > We can't proceed. If that another exec() fails, it will clear "under exec" at
> > > the end of do_execve(), before we kill other threads.
> >
> > Or we need a counter to mark/unmark.
>
> Nah, easier to have check_unsafe_exec() return -EAGAIN in cases we care
> about.
>
> Anyway, completely untested patchset is in
> git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git/ execve-mess
> (the last 9 changesets of it).
>
> WARNING: that's *NOT* for merge at the moment; this is not a pull request.
>
> Review (and testing) would be welcome.

(note exactly *meaningful* review, but)

exit_fs() and daemonize_fs_struct() do:

if (--fs->users)
fs = NULL;
write_unlock(&fs->lock);

Moving the write_unlock up actually let's the kernel boot and
start running ltp.

-serge

2009-03-31 06:17:40

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Mon, Mar 30, 2009 at 03:32:35PM +0100, Hugh Dickins wrote:
> > Anyway, completely untested patchset is in
> > git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git/ execve-mess
> > (the last 9 changesets of it).
> >
> > WARNING: that's *NOT* for merge at the moment; this is not a pull request.

> I'm looking now. Your description of what you intended sounded good.
>
> First impression is that there's lots of good cleanup in there, but
> it's all too mixed up and misordered at present - though we have friends
> who insist upon doing the cleanup first (and I usually like to work that
> way too), it's going to be tiresome when backporting to 2.6.29.stable
> (luckily not needed earlier, unless you've uncovered more on the way).

Yes, and worse than that; this was basically a code dump - the state at
the moment when I'd been going down.

> I think your new chroot_fs_refs() should have a path_get(new_root)
> just before each path_put(old_root)?

Not just before; at the time it replaces ->root or ->pwd. And yes,
it's one of the brainos there.

Anyway, I've got that stuff to something reasonably-looking. See the
same branch (rebased), just 5 changesets now. Have fun. Cleanups are
not part of that set.

2009-03-31 06:21:37

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Mon, Mar 30, 2009 at 06:45:39PM -0500, Serge E. Hallyn wrote:

> (note exactly *meaningful* review, but)
>
> exit_fs() and daemonize_fs_struct() do:
>
> if (--fs->users)
> fs = NULL;
> write_unlock(&fs->lock);
>
> Moving the write_unlock up actually let's the kernel boot and
> start running ltp.

Correct fix is
kill = !--fs->users;
write_unlock(&fs->lock);
...
if (kill)
free_fs_struct(fs);
and similar in other places with the same idiocy (one of which forgets to
unlock, on top of everything else).

Anyway, hopefully much saner (== looked through after getting some sleep,
as opposed to "what I've got in that branch at ~26 hours of uptime")
variant is in the same repository, same branch.

2009-04-01 00:30:30

by Hugh Dickins

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Tue, 31 Mar 2009, Al Viro wrote:
>
> Anyway, I've got that stuff to something reasonably-looking. See the
> same branch (rebased), just 5 changesets now. Have fun. Cleanups are
> not part of that set.

Just a couple of things on that:

Minor bisectability issue: the third patch, which introduces
int unshare_fs_struct(void), needs to return 0 when it succeeds:
that gets corrected in the fourth patch.

Lockdep objects to how check_unsafe_exec nests write_lock(&p->fs_lock)
inside lock_task_sighand(p, &flags). It's right: we sometimes take
sighand->siglock in interrupt, so if such an interrupt occurred just
after you take fs_lock elsewhere, that could deadlock with this. It
seems happy with taking fs_lock just outside the lock_task_sighand.

Otherwise it looks good to me, except I keep worrying about those
EAGAINs. The more so once I noticed current->cred_exec_mutex is
already being used to handle a similar issue with ptrace. What
do you think of this rather smaller patch? which I'd much rather
send after having slept on it, since it may be embarrassingly and
obviously wrong, but tomorrow may be too late ...

fs/compat.c | 6 +++---
fs/exec.c | 6 +++---
fs/namei.c | 1 +
include/linux/fs_struct.h | 1 +
include/linux/init_task.h | 2 --
include/linux/sched.h | 1 -
kernel/cred.c | 4 +---
kernel/fork.c | 13 +++++++++++--
kernel/ptrace.c | 4 ++--
9 files changed, 22 insertions(+), 16 deletions(-)

--- 2.6.29-git8/fs/compat.c 2009-03-31 16:04:22.000000000 +0100
+++ linux/fs/compat.c 2009-04-01 00:52:26.000000000 +0100
@@ -1432,7 +1432,7 @@ int compat_do_execve(char * filename,
if (!bprm)
goto out_files;

- retval = mutex_lock_interruptible(&current->cred_exec_mutex);
+ retval = mutex_lock_interruptible(&current->fs->cred_exec_mutex);
if (retval < 0)
goto out_free;
current->in_execve = 1;
@@ -1489,7 +1489,7 @@ int compat_do_execve(char * filename,

/* execve succeeded */
current->in_execve = 0;
- mutex_unlock(&current->cred_exec_mutex);
+ mutex_unlock(&current->fs->cred_exec_mutex);
acct_update_integrals(current);
free_bprm(bprm);
if (displaced)
@@ -1508,7 +1508,7 @@ out_file:

out_unlock:
current->in_execve = 0;
- mutex_unlock(&current->cred_exec_mutex);
+ mutex_unlock(&current->fs->cred_exec_mutex);

out_free:
free_bprm(bprm);
--- 2.6.29-git8/fs/exec.c 2009-03-31 16:04:22.000000000 +0100
+++ linux/fs/exec.c 2009-04-01 00:52:26.000000000 +0100
@@ -1287,7 +1287,7 @@ int do_execve(char * filename,
if (!bprm)
goto out_files;

- retval = mutex_lock_interruptible(&current->cred_exec_mutex);
+ retval = mutex_lock_interruptible(&current->fs->cred_exec_mutex);
if (retval < 0)
goto out_free;
current->in_execve = 1;
@@ -1345,7 +1345,7 @@ int do_execve(char * filename,

/* execve succeeded */
current->in_execve = 0;
- mutex_unlock(&current->cred_exec_mutex);
+ mutex_unlock(&current->fs->cred_exec_mutex);
acct_update_integrals(current);
free_bprm(bprm);
if (displaced)
@@ -1364,7 +1364,7 @@ out_file:

out_unlock:
current->in_execve = 0;
- mutex_unlock(&current->cred_exec_mutex);
+ mutex_unlock(&current->fs->cred_exec_mutex);

out_free:
free_bprm(bprm);
--- 2.6.29-git8/fs/namei.c 2009-03-31 16:04:23.000000000 +0100
+++ linux/fs/namei.c 2009-04-01 00:52:26.000000000 +0100
@@ -2903,4 +2903,5 @@ struct fs_struct init_fs = {
.count = ATOMIC_INIT(1),
.lock = __RW_LOCK_UNLOCKED(init_fs.lock),
.umask = 0022,
+ .cred_exec_mutex = __MUTEX_INITIALIZER(init_fs.cred_exec_mutex),
};
--- 2.6.29-git8/include/linux/fs_struct.h 2009-03-31 16:04:24.000000000 +0100
+++ linux/include/linux/fs_struct.h 2009-04-01 00:52:26.000000000 +0100
@@ -11,6 +11,7 @@ struct fs_struct {
rwlock_t lock;
int umask;
struct path root, pwd;
+ struct mutex cred_exec_mutex; /* execve cred calculation mutex */
};

extern struct kmem_cache *fs_cachep;
--- 2.6.29-git8/include/linux/init_task.h 2009-03-31 16:04:24.000000000 +0100
+++ linux/include/linux/init_task.h 2009-04-01 00:52:26.000000000 +0100
@@ -157,8 +157,6 @@ extern struct cred init_cred;
.group_leader = &tsk, \
.real_cred = &init_cred, \
.cred = &init_cred, \
- .cred_exec_mutex = \
- __MUTEX_INITIALIZER(tsk.cred_exec_mutex), \
.comm = "swapper", \
.thread = INIT_THREAD, \
.fs = &init_fs, \
--- 2.6.29-git8/include/linux/sched.h 2009-03-31 16:04:25.000000000 +0100
+++ linux/include/linux/sched.h 2009-04-01 00:52:26.000000000 +0100
@@ -1250,7 +1250,6 @@ struct task_struct {
* credentials (COW) */
const struct cred *cred; /* effective (overridable) subjective task
* credentials (COW) */
- struct mutex cred_exec_mutex; /* execve vs ptrace cred calculation mutex */

char comm[TASK_COMM_LEN]; /* executable name excluding path
- access with [gs]et_task_comm (which lock
--- 2.6.29-git8/kernel/cred.c 2009-03-23 23:12:14.000000000 +0000
+++ linux/kernel/cred.c 2009-04-01 00:52:26.000000000 +0100
@@ -167,7 +167,7 @@ EXPORT_SYMBOL(prepare_creds);

/*
* Prepare credentials for current to perform an execve()
- * - The caller must hold current->cred_exec_mutex
+ * - The caller must hold current->fs->cred_exec_mutex
*/
struct cred *prepare_exec_creds(void)
{
@@ -276,8 +276,6 @@ int copy_creds(struct task_struct *p, un
struct cred *new;
int ret;

- mutex_init(&p->cred_exec_mutex);
-
if (
#ifdef CONFIG_KEYS
!p->cred->thread_keyring &&
--- 2.6.29-git8/kernel/fork.c 2009-03-31 16:04:26.000000000 +0100
+++ linux/kernel/fork.c 2009-04-01 00:52:26.000000000 +0100
@@ -695,6 +695,7 @@ static struct fs_struct *__copy_fs_struc
fs->pwd = old->pwd;
path_get(&old->pwd);
read_unlock(&old->lock);
+ mutex_init(&fs->cred_exec_mutex);
}
return fs;
}
@@ -708,11 +709,19 @@ EXPORT_SYMBOL_GPL(copy_fs_struct);

static int copy_fs(unsigned long clone_flags, struct task_struct *tsk)
{
+ struct fs_struct *fs = current->fs;
+
if (clone_flags & CLONE_FS) {
- atomic_inc(&current->fs->count);
+ int retval = mutex_lock_interruptible(&fs->cred_exec_mutex);
+ if (retval < 0)
+ return retval;
+ /* tsk->fs is already what we want */
+ atomic_inc(&fs->count);
+ mutex_unlock(&fs->cred_exec_mutex);
return 0;
}
- tsk->fs = __copy_fs_struct(current->fs);
+
+ tsk->fs = __copy_fs_struct(fs);
if (!tsk->fs)
return -ENOMEM;
return 0;
--- 2.6.29-git8/kernel/ptrace.c 2009-03-23 23:12:14.000000000 +0000
+++ linux/kernel/ptrace.c 2009-04-01 00:52:26.000000000 +0100
@@ -186,7 +186,7 @@ int ptrace_attach(struct task_struct *ta
/* Protect exec's credential calculations against our interference;
* SUID, SGID and LSM creds get determined differently under ptrace.
*/
- retval = mutex_lock_interruptible(&current->cred_exec_mutex);
+ retval = mutex_lock_interruptible(&current->fs->cred_exec_mutex);
if (retval < 0)
goto out;

@@ -230,7 +230,7 @@ repeat:
bad:
write_unlock_irqrestore(&tasklist_lock, flags);
task_unlock(task);
- mutex_unlock(&current->cred_exec_mutex);
+ mutex_unlock(&current->fs->cred_exec_mutex);
out:
return retval;
}

2009-04-01 02:40:17

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Wed, Apr 01, 2009 at 01:28:01AM +0100, Hugh Dickins wrote:
> Minor bisectability issue: the third patch, which introduces
> int unshare_fs_struct(void), needs to return 0 when it succeeds:
> that gets corrected in the fourth patch.

ACK.

> Lockdep objects to how check_unsafe_exec nests write_lock(&p->fs_lock)
> inside lock_task_sighand(p, &flags). It's right: we sometimes take
> sighand->siglock in interrupt, so if such an interrupt occurred just
> after you take fs_lock elsewhere, that could deadlock with this. It
> seems happy with taking fs_lock just outside the lock_task_sighand.

Right you are, check_unsafe_exec() reordered. Will push in a few.

> Otherwise it looks good to me, except I keep worrying about those
> EAGAINs. The more so once I noticed current->cred_exec_mutex is
> already being used to handle a similar issue with ptrace. What
> do you think of this rather smaller patch? which I'd much rather
> send after having slept on it, since it may be embarrassingly and
> obviously wrong, but tomorrow may be too late ...

Eh... I'm not particulary happy with fork() growing heavier and heavier.
Besides, there's a subtle problem avoided by another variant - think what
happens if past the point of no return execve() will unshare fs_struct
(e.g. by explicit unshare() from dynamic linker).

Frankly, -EAGAIN in situation when we have userland race is fine. And
we *do* have a userland race here - execve() will kill -9 those threads
in case of success, so if they'd been doing something useful, they are
about to be suddenly screwed.

So I stand by my variant. Note that if we have *other* tasks sharing
fs_struct, your variant will block their clone() for the duration of
execve() while mine will simply leave them alone (and accept that we
have unsafe sharing).

2009-04-01 03:05:00

by Al Viro

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Wed, Apr 01, 2009 at 03:38:49AM +0100, Al Viro wrote:
> On Wed, Apr 01, 2009 at 01:28:01AM +0100, Hugh Dickins wrote:
> > Minor bisectability issue: the third patch, which introduces
> > int unshare_fs_struct(void), needs to return 0 when it succeeds:
> > that gets corrected in the fourth patch.
>
> ACK.
>
> > Lockdep objects to how check_unsafe_exec nests write_lock(&p->fs_lock)
> > inside lock_task_sighand(p, &flags). It's right: we sometimes take
> > sighand->siglock in interrupt, so if such an interrupt occurred just
> > after you take fs_lock elsewhere, that could deadlock with this. It
> > seems happy with taking fs_lock just outside the lock_task_sighand.
>
> Right you are, check_unsafe_exec() reordered. Will push in a few.

Rebased and pushed (same tree, same branch; included into for-next, along
with related cleanups).

2009-04-01 11:20:48

by Hugh Dickins

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Wed, 1 Apr 2009, Al Viro wrote:
> On Wed, Apr 01, 2009 at 01:28:01AM +0100, Hugh Dickins wrote:
>
> > Otherwise it looks good to me, except I keep worrying about those
> > EAGAINs. The more so once I noticed current->cred_exec_mutex is
> > already being used to handle a similar issue with ptrace. What
> > do you think of this rather smaller patch? which I'd much rather
> > send after having slept on it, since it may be embarrassingly and
> > obviously wrong, but tomorrow may be too late ...
>
> Eh... I'm not particulary happy with fork() growing heavier and heavier.

I don't see it as making fork() any heavier, but never mind.
The important thing is to get a fix out.

> Besides, there's a subtle problem avoided by another variant - think what
> happens if past the point of no return execve() will unshare fs_struct
> (e.g. by explicit unshare() from dynamic linker).

You're too far ahead of me there.

>
> Frankly, -EAGAIN in situation when we have userland race is fine. And
> we *do* have a userland race here - execve() will kill -9 those threads
> in case of success, so if they'd been doing something useful, they are
> about to be suddenly screwed.

Good point. I found it quite odd the way the awkward case (shared
beyond the threadgroup) is allowed to go forward (with possibility
that setuid will be undone), but the easy case is -EAGAINed. (And
I gave up on trying to find a better name for your "in_exec" flag,
which is rather more subtle than just that!) But odd as it is,
there's good reason for doing it that way.

>
> So I stand by my variant.

Fair enough.

> Note that if we have *other* tasks sharing
> fs_struct, your variant will block their clone() for the duration of
> execve() while mine will simply leave them alone (and accept that we
> have unsafe sharing).

Yes, intentional, consistent with the existing cred_exec_mutex technique.

Hugh

2009-04-01 11:27:09

by Hugh Dickins

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Wed, 1 Apr 2009, Al Viro wrote:
>
> Rebased and pushed (same tree, same branch; included into for-next, along
> with related cleanups).

for-next? But don't we need all this (or at least the first four of
your five, on top of the four Linus took from me) in -stable now?

I was wondering if we should ask Chris to do a 2.6.29.1-rc2 with
them; but perhaps you'd much rather expose them in -next for a
while, before sending Linus and 2.6.29.2.

Hugh

2009-04-06 15:41:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On 04/01, Al Viro wrote:
>
> Rebased and pushed (same tree, same branch; included into for-next, along
> with related cleanups).

Sorry for delay!

Afaics, the usage of fs->in_exec is not completely right. But firstly, a
couple of minor nits.


check_unsafe_exec() doesn't need ->siglock, we can iterate over sub-threads
under rcu_read_lock(). Note that with RCU or ->siglock we can set the "wrong"
LSM_UNSAFE_SHARE if we race with copy_process(CLONE_THREAD | CLONE_FS), but
as it was already discussed we don't care. This means it is OK to miss the
freshly cloned thread which has already passed copy_fs().


do_execve:

/* execve succeeded */
write_lock(&current->fs->lock);
current->fs->in_exec = 0;
write_unlock(&current->fs->lock);

afaics, fs->lock is not needed. If ->in_exec was set, it was set by this
thread-group and we do not share ->fs with another process. Since we are
the only thread now, we can clear ->in_exec lockless.


And now, what I think is wrong:

do_execve:

out_unmark:
write_lock(&current->fs->lock);
current->fs->in_exec = 0;
write_unlock(&current->fs->lock);

Two threads T1 and T2 and another process P, all share the same ->fs.

T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since ->fs is
shared, we set LSM_UNSAFE but not ->in_exec (actually, not very good name).

P exits and decrements fs->users.

T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not shared,
we set fs->in_exec.

T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and return
to the user-space.

T1 does clone(CLONE_FS /* without CLONE_THREAD */).

T1 continues without LSM_UNSAFE_SHARE while ->fs is shared with another
process.


What do you think about the (uncompiled) patch below ? It doesn't change
compat_do_execve(), just for discussion.

But see also another message I am going to send...

Oleg.

do_execve() must not clear fs->in_exec if it was set by another thread,
and we don't need fs->lock to clear.

Also, s/lock_task_sighand/rcu_read_lock/ in check_unsafe_exec().

--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1060,7 +1060,6 @@ EXPORT_SYMBOL(install_exec_creds);
int check_unsafe_exec(struct linux_binprm *bprm)
{
struct task_struct *p = current, *t;
- unsigned long flags;
unsigned n_fs;
int res = 0;

@@ -1068,11 +1067,12 @@ int check_unsafe_exec(struct linux_binpr

n_fs = 1;
write_lock(&p->fs->lock);
- lock_task_sighand(p, &flags);
+ rcu_read_lock();
for (t = next_thread(p); t != p; t = next_thread(t)) {
if (t->fs == p->fs)
n_fs++;
}
+ rcu_read_unlock();

if (p->fs->users > n_fs) {
bprm->unsafe |= LSM_UNSAFE_SHARE;
@@ -1080,9 +1080,8 @@ int check_unsafe_exec(struct linux_binpr
if (p->fs->in_exec)
res = -EAGAIN;
p->fs->in_exec = 1;
+ res = 1;
}
-
- unlock_task_sighand(p, &flags);
write_unlock(&p->fs->lock);

return res;
@@ -1284,6 +1283,7 @@ int do_execve(char * filename,
struct linux_binprm *bprm;
struct file *file;
struct files_struct *displaced;
+ bool clear_in_exec;
int retval;

retval = unshare_files(&displaced);
@@ -1306,8 +1306,9 @@ int do_execve(char * filename,
goto out_unlock;

retval = check_unsafe_exec(bprm);
- if (retval)
+ if (retval < 0)
goto out_unlock;
+ clear_in_exec = retval;

file = open_exec(filename);
retval = PTR_ERR(file);
@@ -1355,9 +1356,7 @@ int do_execve(char * filename,
goto out;

/* execve succeeded */
- write_lock(&current->fs->lock);
current->fs->in_exec = 0;
- write_unlock(&current->fs->lock);
current->in_execve = 0;
mutex_unlock(&current->cred_exec_mutex);
acct_update_integrals(current);
@@ -1377,9 +1376,8 @@ out_file:
}

out_unmark:
- write_lock(&current->fs->lock);
- current->fs->in_exec = 0;
- write_unlock(&current->fs->lock);
+ if (clear_in_exec)
+ current->fs->in_exec = 0;

out_unlock:
current->in_execve = 0;

2009-04-06 15:56:56

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On 04/01, Al Viro wrote:
>
> On Wed, Apr 01, 2009 at 01:28:01AM +0100, Hugh Dickins wrote:
>
> > Otherwise it looks good to me, except I keep worrying about those
> > EAGAINs.
>
> Frankly, -EAGAIN in situation when we have userland race is fine. And
> we *do* have a userland race here - execve() will kill -9 those threads
> in case of success, so if they'd been doing something useful, they are
> about to be suddenly screwed.

Can't resist! I dislike the "in_exec && -EAGAIN" oddity too.

Yes sure, we can't break the "well written" applications. But imho this
looks strange. And a bit "assymetrical" wrt LSM_UNSAFE_SHARE, I mean
check_unsafe_exec() allows sub-threads to race or CLONE_FS but only if
LSM_UNSAFE_SHARE.

Another reason, we can have the "my test-case found something strange"
bug-reports.

So. Please feel free to nack or just ignore this message, but since I
personally dislike the current behaviour I should at least try to suggest
something else.

- add "wait_queue_head_t in_exec_wait" to "struct linux_binprm".

- kill fs->in_exec, add "wait_queue_head_t *in_exec_wait_ptr"
instead.

- introduce the new helper,

void fs_lock_check_exec(struct fs_struct *fs)
{
write_lock(&fs->lock);
while (unlikely(fs->in_exec_wait_ptr)) {
DECLARE_WAITQUEUE(wait, current);

if (fatal_signal_pending(current))
/*
* clone/exec can't succeed, and this
* thread won't return to the user-space
*/
break;

__add_wait_queue(fs->in_exec_wait_ptr, &wait);
__set_current_state(TASK_KILLABLE);
write_unlock(&fs->lock);

schedule();

write_lock(&fs->lock);
__remove_wait_queue(&wait);
}
}

Or we can return -EANYTHING when fatal_signal_pending(), this doesn't
matter.

Note that this helper can block only if we race with our sub-thread
in the middle of !LSM_UNSAFE_SHARE exec. Otherwise this is not slower
than write_lock(fs->lock) + if (fs->in_exec) we currently have.


- change copy_fs() to do

if (clone_flags & CLONE_FS) {
fs_lock_check_exec(fs);
fs->users++;
write_unlock(&fs->lock);
return 0;
}


- change check_unsafe_exec:

void check_unsafe_exec(struct linux_binprm *bprm)
{
struct task_struct *p = current, *t;
unsigned n_fs;

bprm->unsafe = tracehook_unsafe_exec(p);

n_fs = 1;
fs_lock_check_exec(&p->fs);
if (p->fs->in_exec_wait_ptr)
/* we are going to die */
goto out;

rcu_read_lock();
for (t = next_thread(p); t != p; t = next_thread(t)) {
if (t->fs == p->fs)
n_fs++;
}
rcu_read_unlock();

if (p->fs->users > n_fs) {
bprm->unsafe |= LSM_UNSAFE_SHARE;
} else {
bprm->unsafe |= __LSM_EXEC_WAKE;
init_waitqueue_head(&bprm->in_exec_wait);
p->fs->in_exec_wait_ptr = &bprm->in_exec_wait;
}
out:
write_unlock(&p->fs->lock);
}



- and, finally, change do_execve()

/* execve succeeded */
current->fs->in_exec_wait_ptr = NULL;

...

out_unmark:
if (bprm->unsafe & __LSM_EXEC_WAKE) {
write_lock(&current->fs->lock);
current->fs->in_exec_wait_ptr = NULL;
wake_up_locked(&bprm->in_exec_wait);
write_unlock(&current->fs->lock);
}

Comments?

Oleg.

2009-04-19 16:39:27

by Hugh Dickins

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Mon, 6 Apr 2009, Oleg Nesterov wrote:
> On 04/01, Al Viro wrote:
> >
> > Rebased and pushed (same tree, same branch; included into for-next, along
> > with related cleanups).
>
> Sorry for delay!

Please don't suppose that you can ever beat me at the slowness game!

>
> Afaics, the usage of fs->in_exec is not completely right. But firstly, a
> couple of minor nits.
>
>
> check_unsafe_exec() doesn't need ->siglock, we can iterate over sub-threads
> under rcu_read_lock(). Note that with RCU or ->siglock we can set the "wrong"
> LSM_UNSAFE_SHARE if we race with copy_process(CLONE_THREAD | CLONE_FS), but
> as it was already discussed we don't care. This means it is OK to miss the
> freshly cloned thread which has already passed copy_fs().

Yes, I agree.
And preferable not to have IRQs disabled over that next_thread() loop.

>
>
> do_execve:
>
> /* execve succeeded */
> write_lock(&current->fs->lock);
> current->fs->in_exec = 0;
> write_unlock(&current->fs->lock);
>
> afaics, fs->lock is not needed. If ->in_exec was set, it was set by this
> thread-group and we do not share ->fs with another process. Since we are
> the only thread now, we can clear ->in_exec lockless.

Right, given your fix below. I wondered for a moment if a barrier
would then be needed, but no, this is all racy (erring on the safe
side) if the userspace insists on being racy here.

>
>
> And now, what I think is wrong:
>
> do_execve:
>
> out_unmark:
> write_lock(&current->fs->lock);
> current->fs->in_exec = 0;
> write_unlock(&current->fs->lock);
>
> Two threads T1 and T2 and another process P, all share the same ->fs.
>
> T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since ->fs is
> shared, we set LSM_UNSAFE but not ->in_exec (actually, not very good name).
>
> P exits and decrements fs->users.
>
> T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not shared,
> we set fs->in_exec.
>
> T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and return
> to the user-space.
>
> T1 does clone(CLONE_FS /* without CLONE_THREAD */).
>
> T1 continues without LSM_UNSAFE_SHARE while ->fs is shared with another
> process.

If I follow you correctly, you meant to say T2 not T1 in the last step.

>
>
> What do you think about the (uncompiled) patch below ? It doesn't change
> compat_do_execve(), just for discussion.
>
> But see also another message I am going to send...
>
> Oleg.
>
> do_execve() must not clear fs->in_exec if it was set by another thread,
> and we don't need fs->lock to clear.
>
> Also, s/lock_task_sighand/rcu_read_lock/ in check_unsafe_exec().

Yes, I think your clear_in_exec change is a necessary one,
and your rcu_read_lock well worth while.

One tiny change (aside from extending to compat_do_execve):
Al originally had check_unsafe_exec()'s write_lock(&p->fs->lock)
after the lock_task_sighand(p, &flags), but was forced to invert
that by the IRQ issue lockdep flagged. I think we'd all prefer
to think of fs->lock as an innermost lock, and would like it
now to go after your rcu_read_lock().

(You do rcu_read_unlock() earlier, but that's okay.)

Hugh

>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1060,7 +1060,6 @@ EXPORT_SYMBOL(install_exec_creds);
> int check_unsafe_exec(struct linux_binprm *bprm)
> {
> struct task_struct *p = current, *t;
> - unsigned long flags;
> unsigned n_fs;
> int res = 0;
>
> @@ -1068,11 +1067,12 @@ int check_unsafe_exec(struct linux_binpr
>
> n_fs = 1;
> write_lock(&p->fs->lock);
> - lock_task_sighand(p, &flags);
> + rcu_read_lock();
> for (t = next_thread(p); t != p; t = next_thread(t)) {
> if (t->fs == p->fs)
> n_fs++;
> }
> + rcu_read_unlock();
>
> if (p->fs->users > n_fs) {
> bprm->unsafe |= LSM_UNSAFE_SHARE;
> @@ -1080,9 +1080,8 @@ int check_unsafe_exec(struct linux_binpr
> if (p->fs->in_exec)
> res = -EAGAIN;
> p->fs->in_exec = 1;
> + res = 1;
> }
> -
> - unlock_task_sighand(p, &flags);
> write_unlock(&p->fs->lock);
>
> return res;
> @@ -1284,6 +1283,7 @@ int do_execve(char * filename,
> struct linux_binprm *bprm;
> struct file *file;
> struct files_struct *displaced;
> + bool clear_in_exec;
> int retval;
>
> retval = unshare_files(&displaced);
> @@ -1306,8 +1306,9 @@ int do_execve(char * filename,
> goto out_unlock;
>
> retval = check_unsafe_exec(bprm);
> - if (retval)
> + if (retval < 0)
> goto out_unlock;
> + clear_in_exec = retval;
>
> file = open_exec(filename);
> retval = PTR_ERR(file);
> @@ -1355,9 +1356,7 @@ int do_execve(char * filename,
> goto out;
>
> /* execve succeeded */
> - write_lock(&current->fs->lock);
> current->fs->in_exec = 0;
> - write_unlock(&current->fs->lock);
> current->in_execve = 0;
> mutex_unlock(&current->cred_exec_mutex);
> acct_update_integrals(current);
> @@ -1377,9 +1376,8 @@ out_file:
> }
>
> out_unmark:
> - write_lock(&current->fs->lock);
> - current->fs->in_exec = 0;
> - write_unlock(&current->fs->lock);
> + if (clear_in_exec)
> + current->fs->in_exec = 0;
>
> out_unlock:
> current->in_execve = 0;

2009-04-19 16:47:22

by Hugh Dickins

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Mon, 6 Apr 2009, Oleg Nesterov wrote:
> On 04/01, Al Viro wrote:
> >
> > On Wed, Apr 01, 2009 at 01:28:01AM +0100, Hugh Dickins wrote:
> >
> > > Otherwise it looks good to me, except I keep worrying about those
> > > EAGAINs.
> >
> > Frankly, -EAGAIN in situation when we have userland race is fine. And
> > we *do* have a userland race here - execve() will kill -9 those threads
> > in case of success, so if they'd been doing something useful, they are
> > about to be suddenly screwed.
>
> Can't resist! I dislike the "in_exec && -EAGAIN" oddity too.
>
> Yes sure, we can't break the "well written" applications. But imho this
> looks strange. And a bit "assymetrical" wrt LSM_UNSAFE_SHARE, I mean
> check_unsafe_exec() allows sub-threads to race or CLONE_FS but only if
> LSM_UNSAFE_SHARE.
>
> Another reason, we can have the "my test-case found something strange"
> bug-reports.
>
> So. Please feel free to nack or just ignore this message, but since I
> personally dislike the current behaviour I should at least try to suggest
> something else.

I didn't spend very long on this: it looked rather equivalent to the
current->fs->cred_exec_mutex patch that I proposed, but spinning its
own infrastructure rather than relying on the existing mutex (and of
course based on top of Al's patches, now in the tree, which have
changed the path of least resistance).

I've probably missed subtleties, but I still prefer my own suggestion;
though Al hinted at a subtle problem with that which I never grasped.

Of course I agree with you sharing my unease at -EAGAIN and in_exec.
Maybe we just lie in wait preparing a "told you so" for when someone
reports "something strange"! But I'd really like to see your fix
patch go in.

Hugh

2009-04-21 16:16:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On 04/19, Hugh Dickins wrote:
>
> On Mon, 6 Apr 2009, Oleg Nesterov wrote:
> >
> > Sorry for delay!
>
> Please don't suppose that you can ever beat me at the slowness game!

I am still trying to compete...

> > check_unsafe_exec() doesn't need ->siglock, we can iterate over sub-threads
> > under rcu_read_lock(). Note that with RCU or ->siglock we can set the "wrong"
> > LSM_UNSAFE_SHARE if we race with copy_process(CLONE_THREAD | CLONE_FS), but
> > as it was already discussed we don't care. This means it is OK to miss the
> > freshly cloned thread which has already passed copy_fs().
>
> Yes, I agree.
> And preferable not to have IRQs disabled over that next_thread() loop.

Yes sure, we don't need local_irq_disable(), only rcu_read_lock().

> > T1 does clone(CLONE_FS /* without CLONE_THREAD */).
> >
> > T1 continues without LSM_UNSAFE_SHARE while ->fs is shared with another
> > process.
>
> If I follow you correctly, you meant to say T2 not T1 in the last step.

Yes,

> Yes, I think your clear_in_exec change is a necessary one,
> and your rcu_read_lock well worth while.

OK, I'll send 2 simple patches, the first one kills lock_task_sighand(),
another adds clear_in_exec.

But,

> One tiny change (aside from extending to compat_do_execve):
> Al originally had check_unsafe_exec()'s write_lock(&p->fs->lock)
> after the lock_task_sighand(p, &flags), but was forced to invert
> that by the IRQ issue lockdep flagged. I think we'd all prefer
> to think of fs->lock as an innermost lock, and would like it
> now to go after your rcu_read_lock().

Since we are not going to disable IRQs, perhaps the above does
not matter? It is always safe to take rcu_read_lock(), no matter
which locks we already hold.

> (You do rcu_read_unlock() earlier, but that's okay.)

Yes, but unless we have a "strong" reason, it is better to take
fs->lock first. rcu_read_lock() is free, but disables preemption.

Oleg.

2009-04-21 16:42:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)



On Tue, 21 Apr 2009, Oleg Nesterov wrote:
>
> > (You do rcu_read_unlock() earlier, but that's okay.)
>
> Yes, but unless we have a "strong" reason, it is better to take
> fs->lock first. rcu_read_lock() is free, but disables preemption.

.. but so does taking a spinlock. So it shouldn't matter.

We could play games with that (the same way I think we have some games for
large-system irq latency with '__raw_spin_lock_flags()' on ia64), but that
makes sense only when you have lots of CPU's and expect irq latency to
suffer.

And it doesn't tend to make sense for preemption latency, because if you
have so many CPU's that you have lots of spinning on locks, you would
normally not really care deeply about preemption (sure, in theory it's a
real-time thing, in practice I doubt you'll find anybody who cares).

Linus

2009-04-21 16:46:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On 04/19, Hugh Dickins wrote:
>
> On Mon, 6 Apr 2009, Oleg Nesterov wrote:
> > On 04/01, Al Viro wrote:
> > >
> > > On Wed, Apr 01, 2009 at 01:28:01AM +0100, Hugh Dickins wrote:
> > >
> > > > Otherwise it looks good to me, except I keep worrying about those
> > > > EAGAINs.
> > >
> > > Frankly, -EAGAIN in situation when we have userland race is fine. And
> > > we *do* have a userland race here - execve() will kill -9 those threads
> > > in case of success, so if they'd been doing something useful, they are
> > > about to be suddenly screwed.
> >
> > Can't resist! I dislike the "in_exec && -EAGAIN" oddity too.
> >
> > Yes sure, we can't break the "well written" applications. But imho this
> > looks strange. And a bit "assymetrical" wrt LSM_UNSAFE_SHARE, I mean
> > check_unsafe_exec() allows sub-threads to race or CLONE_FS but only if
> > LSM_UNSAFE_SHARE.
> >
> > Another reason, we can have the "my test-case found something strange"
> > bug-reports.
> >
> > So. Please feel free to nack or just ignore this message, but since I
> > personally dislike the current behaviour I should at least try to suggest
> > something else.
>
> I didn't spend very long on this: it looked rather equivalent to the
> current->fs->cred_exec_mutex patch that I proposed, but spinning its
> own infrastructure rather than relying on the existing mutex (and of
> course based on top of Al's patches, now in the tree, which have
> changed the path of least resistance).
>
> I've probably missed subtleties, but I still prefer my own suggestion;
> though Al hinted at a subtle problem with that which I never grasped.

I like your patch more too. Except it has problems with unshare_fs() as
Al pointed out.

I agree, this fs->in_exec_wait_ptr looks really ugly, so...

> Of course I agree with you sharing my unease at -EAGAIN and in_exec.
> Maybe we just lie in wait preparing a "told you so" for when someone
> reports "something strange"! But I'd really like to see your fix
> patch go in.

Perhaps I'll try to make the patch later, but I am not sure I send it ;)
In any case it complicates the code.

Oleg.

2009-04-21 17:23:01

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On 04/21, Linus Torvalds wrote:
>
> On Tue, 21 Apr 2009, Oleg Nesterov wrote:
> >
> > > (You do rcu_read_unlock() earlier, but that's okay.)
> >
> > Yes, but unless we have a "strong" reason, it is better to take
> > fs->lock first. rcu_read_lock() is free, but disables preemption.
>
> .. but so does taking a spinlock. So it shouldn't matter.
>
> We could play games with that (the same way I think we have some games for
> large-system irq latency with '__raw_spin_lock_flags()' on ia64), but that
> makes sense only when you have lots of CPU's and expect irq latency to
> suffer.
>
> And it doesn't tend to make sense for preemption latency, because if you
> have so many CPU's that you have lots of spinning on locks, you would
> normally not really care deeply about preemption (sure, in theory it's a
> real-time thing, in practice I doubt you'll find anybody who cares).

OK, I agree, it doesn't really matter from latency/etc pov.

But still I can't understand why it is better to take fs->lock under
RCU lock. I mean, "fs->lock is the innermost lock" should not apply
to rcu_read_lock(). Because the latter is a bit special, no?

Oleg.

2009-04-21 17:44:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)



On Tue, 21 Apr 2009, Oleg Nesterov wrote:
>
> OK, I agree, it doesn't really matter from latency/etc pov.
>
> But still I can't understand why it is better to take fs->lock under
> RCU lock. I mean, "fs->lock is the innermost lock" should not apply
> to rcu_read_lock(). Because the latter is a bit special, no?

Oh, I don't think it matters. If you want to put the RCU read-lock
innermost, that's fine by me. I just reacted to your latency argument as
not being very strong :)

All I personally want is a patch that everybody can agree on, and that
has sane semantics.

Linus

2009-04-21 19:42:17

by Hugh Dickins

[permalink] [raw]
Subject: Re: Q: check_unsafe_exec() races (Was: [PATCH 2/4] fix setuid sometimes doesn't)

On Tue, 21 Apr 2009, Linus Torvalds wrote:
> On Tue, 21 Apr 2009, Oleg Nesterov wrote:
> >
> > OK, I agree, it doesn't really matter from latency/etc pov.
> >
> > But still I can't understand why it is better to take fs->lock under
> > RCU lock. I mean, "fs->lock is the innermost lock" should not apply
> > to rcu_read_lock(). Because the latter is a bit special, no?
>
> Oh, I don't think it matters. If you want to put the RCU read-lock
> innermost, that's fine by me. I just reacted to your latency argument as
> not being very strong :)
>
> All I personally want is a patch that everybody can agree on, and that
> has sane semantics.

Right, that ordering scarcely matters, and can probably be argued
either way. I should have been clearer when I suggested inverting
them to Oleg: I meant it merely as a suggestion, that we go back
to the ordering which came more naturally to Al in the first place.
And since Al hasn't spoken up (probably has more important things
to care about), please do go ahead with your two patches, Oleg,
with the rcu_read_lock() on whichever side takes your fancy!

Thanks,
Hugh

2009-04-23 23:08:26

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 1/2] do_execve() must not clear fs->in_exec if it was set by another thread

If do_execve() fails after check_unsafe_exec(), it clears fs->in_exec
unconditionally. This is wrong if we race with our sub-thread which
also does do_execve:

Two threads T1 and T2 and another process P, all share the same
->fs.

T1 starts do_execve(BAD_FILE). It calls check_unsafe_exec(), since
->fs is shared, we set LSM_UNSAFE but not ->in_exec.

P exits and decrements fs->users.

T2 starts do_execve(), calls check_unsafe_exec(), now ->fs is not
shared, we set fs->in_exec.

T1 continues, open_exec(BAD_FILE) fails, we clear ->in_exec and
return to the user-space.

T1 does clone(CLONE_FS /* without CLONE_THREAD */).

T2 continues without LSM_UNSAFE_SHARE while ->fs is shared with
another process.

Change check_unsafe_exec() to return res = 1 if we set ->in_exec, and change
do_execve() to clear ->in_exec depending on res.

When do_execve() suceeds, it is safe to clear ->in_exec unconditionally.
It can be set only if we don't share ->fs with another process, and since
we already killed all sub-threads either ->in_exec == 0 or we are the
only user of this ->fs.

Also, we do not need fs->lock to clear fs->in_exec.

Signed-off-by: Oleg Nesterov <[email protected]>

fs/exec.c | 19 ++++++++++---------
fs/compat.c | 11 +++++------
2 files changed, 15 insertions(+), 15 deletions(-)

--- PTRACE/fs/exec.c~1_IN_EXEC 2009-04-06 00:03:41.000000000 +0200
+++ PTRACE/fs/exec.c 2009-04-24 00:01:53.000000000 +0200
@@ -1077,9 +1077,11 @@ int check_unsafe_exec(struct linux_binpr
if (p->fs->users > n_fs) {
bprm->unsafe |= LSM_UNSAFE_SHARE;
} else {
- if (p->fs->in_exec)
- res = -EAGAIN;
- p->fs->in_exec = 1;
+ res = -EAGAIN;
+ if (!p->fs->in_exec) {
+ p->fs->in_exec = 1;
+ res = 1;
+ }
}

unlock_task_sighand(p, &flags);
@@ -1284,6 +1286,7 @@ int do_execve(char * filename,
struct linux_binprm *bprm;
struct file *file;
struct files_struct *displaced;
+ bool clear_in_exec;
int retval;

retval = unshare_files(&displaced);
@@ -1306,8 +1309,9 @@ int do_execve(char * filename,
goto out_unlock;

retval = check_unsafe_exec(bprm);
- if (retval)
+ if (retval < 0)
goto out_unlock;
+ clear_in_exec = retval;

file = open_exec(filename);
retval = PTR_ERR(file);
@@ -1355,9 +1359,7 @@ int do_execve(char * filename,
goto out;

/* execve succeeded */
- write_lock(&current->fs->lock);
current->fs->in_exec = 0;
- write_unlock(&current->fs->lock);
current->in_execve = 0;
mutex_unlock(&current->cred_exec_mutex);
acct_update_integrals(current);
@@ -1377,9 +1379,8 @@ out_file:
}

out_unmark:
- write_lock(&current->fs->lock);
- current->fs->in_exec = 0;
- write_unlock(&current->fs->lock);
+ if (clear_in_exec)
+ current->fs->in_exec = 0;

out_unlock:
current->in_execve = 0;
--- PTRACE/fs/compat.c~1_IN_EXEC 2009-04-22 20:49:07.000000000 +0200
+++ PTRACE/fs/compat.c 2009-04-24 00:09:36.000000000 +0200
@@ -1476,6 +1476,7 @@ int compat_do_execve(char * filename,
struct linux_binprm *bprm;
struct file *file;
struct files_struct *displaced;
+ bool clear_in_exec;
int retval;

retval = unshare_files(&displaced);
@@ -1498,8 +1499,9 @@ int compat_do_execve(char * filename,
goto out_unlock;

retval = check_unsafe_exec(bprm);
- if (retval)
+ if (retval < 0)
goto out_unlock;
+ clear_in_exec = retval;

file = open_exec(filename);
retval = PTR_ERR(file);
@@ -1546,9 +1548,7 @@ int compat_do_execve(char * filename,
goto out;

/* execve succeeded */
- write_lock(&current->fs->lock);
current->fs->in_exec = 0;
- write_unlock(&current->fs->lock);
current->in_execve = 0;
mutex_unlock(&current->cred_exec_mutex);
acct_update_integrals(current);
@@ -1568,9 +1568,8 @@ out_file:
}

out_unmark:
- write_lock(&current->fs->lock);
- current->fs->in_exec = 0;
- write_unlock(&current->fs->lock);
+ if (clear_in_exec)
+ current->fs->in_exec = 0;

out_unlock:
current->in_execve = 0;

2009-04-23 23:09:21

by Oleg Nesterov

[permalink] [raw]
Subject: [PATCH 2/2] check_unsafe_exec: s/lock_task_sighand/rcu_read_lock/

write_lock(&current->fs->lock) guarantees we can't wrongly miss
LSM_UNSAFE_SHARE, this is what we care about. Use rcu_read_lock()
instead of ->siglock to iterate over the sub-threads. We must see
all CLONE_THREAD|CLONE_FS threads which didn't pass exit_fs(), it
takes fs->lock too.

With or without this patch we can miss the freshly cloned thread
and set LSM_UNSAFE_SHARE, we don't care.

Signed-off-by: Oleg Nesterov <[email protected]>

--- PTRACE/fs/exec.c~2_NO_SIGLOCK 2009-04-24 00:01:53.000000000 +0200
+++ PTRACE/fs/exec.c 2009-04-24 00:36:22.000000000 +0200
@@ -1060,7 +1060,6 @@ EXPORT_SYMBOL(install_exec_creds);
int check_unsafe_exec(struct linux_binprm *bprm)
{
struct task_struct *p = current, *t;
- unsigned long flags;
unsigned n_fs;
int res = 0;

@@ -1068,11 +1067,12 @@ int check_unsafe_exec(struct linux_binpr

n_fs = 1;
write_lock(&p->fs->lock);
- lock_task_sighand(p, &flags);
+ rcu_read_lock();
for (t = next_thread(p); t != p; t = next_thread(t)) {
if (t->fs == p->fs)
n_fs++;
}
+ rcu_read_lock();

if (p->fs->users > n_fs) {
bprm->unsafe |= LSM_UNSAFE_SHARE;
@@ -1083,8 +1083,6 @@ int check_unsafe_exec(struct linux_binpr
res = 1;
}
}
-
- unlock_task_sighand(p, &flags);
write_unlock(&p->fs->lock);

return res;

2009-04-23 23:33:41

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/2] do_execve() must not clear fs->in_exec if it was set by another thread

On Fri, Apr 24, 2009 at 01:01:56AM +0200, Oleg Nesterov wrote:
> If do_execve() fails after check_unsafe_exec(), it clears fs->in_exec
> unconditionally. This is wrong if we race with our sub-thread which
> also does do_execve:


[snip]

Applied (both of those).

2009-04-24 04:33:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2/2] check_unsafe_exec: s/lock_task_sighand/rcu_read_lock/

On Fri, 23 Apr 2009, Oleg Nesterov wrote:
> write_lock(&p->fs->lock);
> - lock_task_sighand(p, &flags);
> + rcu_read_lock();
> for (t = next_thread(p); t != p; t = next_thread(t)) {
> if (t->fs == p->fs)
> n_fs++;
> }
> + rcu_read_lock();

Not quite! if second rcu_read_lock() changed to rcu_read_unlock(), then
Acked-by: Hugh Dickins <[email protected]>

2009-04-24 12:23:59

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 3/2] check_unsafe_exec: rcu_read_unlock

Fix typo in previous commit: second rcu_read_lock should be rcu_read_unlock.

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/exec.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- 2.6.30-rc3-next-20090424/fs/exec.c 2009-04-24 12:23:43.000000000 +0100
+++ linux/fs/exec.c 2009-04-24 12:26:10.000000000 +0100
@@ -1043,7 +1043,7 @@ int check_unsafe_exec(struct linux_binpr
if (t->fs == p->fs)
n_fs++;
}
- rcu_read_lock();
+ rcu_read_unlock();

if (p->fs->users > n_fs) {
bprm->unsafe |= LSM_UNSAFE_SHARE;

2009-04-24 14:41:21

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 3/2] check_unsafe_exec: rcu_read_unlock

On 04/24, Hugh Dickins wrote:
>
> --- 2.6.30-rc3-next-20090424/fs/exec.c 2009-04-24 12:23:43.000000000 +0100
> +++ linux/fs/exec.c 2009-04-24 12:26:10.000000000 +0100
> @@ -1043,7 +1043,7 @@ int check_unsafe_exec(struct linux_binpr
> if (t->fs == p->fs)
> n_fs++;
> }
> - rcu_read_lock();
> + rcu_read_unlock();

I'd say this change is not bad. Except it discloses the fact I didn't
bother to test my trivial patch.

Thanks a lot Hugh!!!

And my apologies to all.

Oleg.