LinuxLists.cc - futex question

[permalink] [raw]

Subject: RE: futex question

>
>(Cc:-ed more futex folks.)
>

That is great but I am wondering, what's your opinion on this Ingo?

2009-10-01 23:46:54

[permalink] [raw]

Subject: RE: futex question

> Should we not just clear the pointer (and
>> it's compat version) within do_execve()?

In our private repository, applying the following patch resolved the
issues I mentioned. I no longer see messages like this:

[futex] ("ifconfig")(pid=2509) exit_robust_list:unable to fetch robust
entry. uaddr=0x000000002abbc4f0

from my instrumented kernel within exit_robust_list(). My
instrumentation looked something like this:

if (fetch_robust_entry(...)) {
printk(...);
return;
}

Just tossing the patch in the community in case someone is interested
...

Signed-off-by: Anirban Sinha <[email protected]>
---
fs/compat.c | 3 +++
fs/exec.c | 3 +++
2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 6d6f98f..c3d117c 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1510,6 +1510,9 @@ int compat_do_execve(char * filename,
if (retval)
goto out_file;

+#ifdef CONFIG_FUTEX
+ current->compat_robust_list = NULL;
+#endif
bprm->argc = compat_count(argv, MAX_ARG_STRINGS);
if ((retval = bprm->argc) < 0)
goto out;
diff --git a/fs/exec.c b/fs/exec.c
index 172ceb6..e9334b8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1323,6 +1323,9 @@ int do_execve(char * filename,
retval = bprm_mm_init(bprm);
if (retval)
goto out_file;
+#ifdef CONFIG_FUTEX
+ current->robust_list = NULL;
+#endif

bprm->argc = count(argv, MAX_ARG_STRINGS);
if ((retval = bprm->argc) < 0)

2009-10-02 23:39:03

by Darren Hart

[permalink] [raw]

Subject: Re: futex question

Anirban Sinha wrote:
>> Should we not just clear the pointer (and
>>> it's compat version) within do_execve()?
>
>
> In our private repository, applying the following patch resolved the
> issues I mentioned. I no longer see messages like this:
>
> [futex] ("ifconfig")(pid=2509) exit_robust_list:unable to fetch robust
> entry. uaddr=0x000000002abbc4f0
>
> from my instrumented kernel within exit_robust_list(). My
> instrumentation looked something like this:

>
> if (fetch_robust_entry(...)) {
> printk(...);
> return;
> }
>
> Just tossing the patch in the community in case someone is interested

Thanks for sending the patch. I'm looking into it now. Couple questions:

1) What caused you to instrument this path in the first place? Were you
seeing some unexpected behavior?

2) I wonder why we would need to clear the robust list, but I don't see
other things like pi_blocked_on, etc. in execve being cleared. I'm
looking into this now (perhaps we don't do the same cleanup, need to
check).... have to get on the plane...

--
Darren Hart

> ...
>
>
>
> Signed-off-by: Anirban Sinha <[email protected]>
> ---
> fs/compat.c | 3 +++
> fs/exec.c | 3 +++
> 2 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/fs/compat.c b/fs/compat.c
> index 6d6f98f..c3d117c 100644
> --- a/fs/compat.c
> +++ b/fs/compat.c
> @@ -1510,6 +1510,9 @@ int compat_do_execve(char * filename,
> if (retval)
> goto out_file;
>
> +#ifdef CONFIG_FUTEX
> + current->compat_robust_list = NULL;
> +#endif
> bprm->argc = compat_count(argv, MAX_ARG_STRINGS);
> if ((retval = bprm->argc) < 0)
> goto out;
> diff --git a/fs/exec.c b/fs/exec.c
> index 172ceb6..e9334b8 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1323,6 +1323,9 @@ int do_execve(char * filename,
> retval = bprm_mm_init(bprm);
> if (retval)
> goto out_file;
> +#ifdef CONFIG_FUTEX
> + current->robust_list = NULL;
> +#endif
>
> bprm->argc = count(argv, MAX_ARG_STRINGS);
> if ((retval = bprm->argc) < 0)
>
>

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2009-10-03 00:37:46

[permalink] [raw]

Subject: RE: futex question

>
>
>Thanks for sending the patch. I'm looking into it now. Couple
questions:
>
>1) What caused you to instrument this path in the first place? Were
you
>seeing some unexpected behavior?

Basically, all this started as a means to aid debug or at least isolate
cases of memory corruption. When a process holding a futex died, the
robust futex cleanup operation can be foiled if there are any memory
corruptions in the user land. The "carefully inspecting the user land
linked list" part would bail out silently. So no process would get
EOWNERDEAD and wake up. So we decided to add printks so that we can
track these silent return cases.

We thought that actual number of cases of silently bailing out would be
rare so we did not expect any of those logs in the kernel buffer under
regular circumstances. To our surprise, we found lots of those logs!
This puzzled us. I looked at the code again and again but it deed some
seem to have any issues. Then it occurred to us (kaz) that an execve()
call can also cause invalid pointer values to remain in the task
structure. I did some testing and it seemed to indicate that this was
indeed the case.

There is a discussion on this by Kaz on the linux mips mailing list:

http://www.linux-mips.org/archives/linux-mips/2009-09/msg00130.html

>
>2) I wonder why we would need to clear the robust list, but I don't see
>other things like pi_blocked_on, etc. in execve being cleared.

Yea, may be. We don't use pi-futexes, so not too much confident of the
pi futex codebase there. You can look into those too.

I'm
>looking into this now (perhaps we don't do the same cleanup, need to
>check).... have to get on the plane...

Have a good trip and thanks for looking into it. :)

Ani

>>
>>
>> Signed-off-by: Anirban Sinha <[email protected]>
>> ---
>> fs/compat.c | 3 +++
>> fs/exec.c | 3 +++
>> 2 files changed, 6 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/compat.c b/fs/compat.c
>> index 6d6f98f..c3d117c 100644
>> --- a/fs/compat.c
>> +++ b/fs/compat.c
>> @@ -1510,6 +1510,9 @@ int compat_do_execve(char * filename,
>> if (retval)
>> goto out_file;
>>
>> +#ifdef CONFIG_FUTEX
>> + current->compat_robust_list = NULL;
>> +#endif
>> bprm->argc = compat_count(argv, MAX_ARG_STRINGS);
>> if ((retval = bprm->argc) < 0)
>> goto out;
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 172ceb6..e9334b8 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1323,6 +1323,9 @@ int do_execve(char * filename,
>> retval = bprm_mm_init(bprm);
>> if (retval)
>> goto out_file;
>> +#ifdef CONFIG_FUTEX
>> + current->robust_list = NULL;
>> +#endif
>>
>> bprm->argc = count(argv, MAX_ARG_STRINGS);
>> if ((retval = bprm->argc) < 0)
>>
>>
>
>
>--
>Darren Hart
>IBM Linux Technology Center
>Real-Time Linux Team

2009-10-03 04:14:29

by Eric Dumazet

[permalink] [raw]

Subject: Re: futex question

Anirban Sinha a ?crit :
>>
>> Thanks for sending the patch. I'm looking into it now. Couple
> questions:
>> 1) What caused you to instrument this path in the first place? Were
> you
>> seeing some unexpected behavior?
>
> Basically, all this started as a means to aid debug or at least isolate
> cases of memory corruption. When a process holding a futex died, the
> robust futex cleanup operation can be foiled if there are any memory
> corruptions in the user land. The "carefully inspecting the user land
> linked list" part would bail out silently. So no process would get
> EOWNERDEAD and wake up. So we decided to add printks so that we can
> track these silent return cases.
>
> We thought that actual number of cases of silently bailing out would be
> rare so we did not expect any of those logs in the kernel buffer under
> regular circumstances. To our surprise, we found lots of those logs!
> This puzzled us. I looked at the code again and again but it deed some
> seem to have any issues. Then it occurred to us (kaz) that an execve()
> call can also cause invalid pointer values to remain in the task
> structure. I did some testing and it seemed to indicate that this was
> indeed the case.
>
> There is a discussion on this by Kaz on the linux mips mailing list:
>
> http://www.linux-mips.org/archives/linux-mips/2009-09/msg00130.html

This exactly looks like what I discovered a while ago about futex used
for pthread management. Anirban, this is a real security flaw and this
should be fixed as fast as possible :)

Commit 9c8a8228d0827e0d91d28527209988f672f97d28
author Eric Dumazet <[email protected]>
Thu, 6 Aug 2009 22:09:28 +0000 (15:09 -0700)
execve: must clear current->clear_child_tid

While looking at Jens Rosenboom bug report
(http://lkml.org/lkml/2009/7/27/35) about strange sys_futex call done from
a dying "ps" program, we found following problem.

clone() syscall has special support for TID of created threads. This
support includes two features.

One (CLONE_CHILD_SETTID) is to set an integer into user memory with the
TID value.

One (CLONE_CHILD_CLEARTID) is to clear this same integer once the created
thread dies.

The integer location is a user provided pointer, provided at clone()
time.

kernel keeps this pointer value into current->clear_child_tid.

At execve() time, we should make sure kernel doesnt keep this user
provided pointer, as full user memory is replaced by a new one.

As glibc fork() actually uses clone() syscall with CLONE_CHILD_SETTID and
CLONE_CHILD_CLEARTID set, chances are high that we might corrupt user
memory in forked processes.

Following sequence could happen:

1) bash (or any program) starts a new process, by a fork() call that
glibc maps to a clone( ... CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
...) syscall

2) When new process starts, its current->clear_child_tid is set to a
location that has a meaning only in bash (or initial program) context
(&THREAD_SELF->tid)

3) This new process does the execve() syscall to start a new program.
current->clear_child_tid is left unchanged (a non NULL value)

4) If this new program creates some threads, and initial thread exits,
kernel will attempt to clear the integer pointed by
current->clear_child_tid from mm_release() :

if (tsk->clear_child_tid
&& !(tsk->flags & PF_SIGNALED)
&& atomic_read(&mm->mm_users) > 1) {
u32 __user * tidptr = tsk->clear_child_tid;
tsk->clear_child_tid = NULL;

/*
* We don't check the error code - if userspace has
* not set up a proper pointer then tough luck.
*/
<< here >> put_user(0, tidptr);
sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL, 0);
}

5) OR : if new program is not multi-threaded, but spied by /proc/pid
users (ps command for example), mm_users > 1, and the exiting program
could corrupt 4 bytes in a persistent memory area (shm or memory mapped
file)

If current->clear_child_tid points to a writeable portion of memory of the
new program, kernel happily and silently corrupts 4 bytes of memory, with
unexpected effects.

Fix is straightforward and should not break any sane program.

Reported-by: Jens Rosenboom <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Sonny Rao <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ulrich Drepper <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

2009-10-04 08:45:28

[permalink] [raw]

Subject: Re: futex question

On Fri, 2 Oct 2009, Darren Hart wrote:
> Anirban Sinha wrote:
> > > Should we not just clear the pointer (and
> > > > it's compat version) within do_execve()?
> >
> >
> > In our private repository, applying the following patch resolved the
> > issues I mentioned. I no longer see messages like this:
> >
> > [futex] ("ifconfig")(pid=2509) exit_robust_list:unable to fetch robust
> > entry. uaddr=0x000000002abbc4f0
> >
> > from my instrumented kernel within exit_robust_list(). My
> > instrumentation looked something like this:
>
> >
> > if (fetch_robust_entry(...)) {
> > printk(...);
> > return;
> > }
> >
> > Just tossing the patch in the community in case someone is interested
>
>
> Thanks for sending the patch. I'm looking into it now. Couple questions:
>
> 1) What caused you to instrument this path in the first place? Were you
> seeing some unexpected behavior?
>
> 2) I wonder why we would need to clear the robust list, but I don't see other
> things like pi_blocked_on, etc. in execve being cleared. I'm looking into
> this now (perhaps we don't do the same cleanup, need to check).... have to get
> on the plane...

Hmm, just setting the robust list pointer to NULL fixes the problem at
hand, but I wonder whether we need to call exit_robust_list() as
well.

Vs. pi_blocked_on: If that's != NULL then you are not executing that
code path because you hang in the scheduler waiting for the lock.

The interesting question is whether we need to call
exit_pi_state_list() to fix up held locks.

Thanks,

tglx

2009-10-04 16:37:46

[permalink] [raw]

Subject: Re: futex question

>> 1) What caused you to instrument this path in the first place? Were you
>> seeing some unexpected behavior?
>>
>> 2) I wonder why we would need to clear the robust list, but I don't see other
>> things like pi_blocked_on, etc. in execve being cleared. I'm looking into
>> this now (perhaps we don't do the same cleanup, need to check).... have to get
>> on the plane...
>
> Hmm, just setting the robust list pointer to NULL fixes the problem at
> hand, but I wonder whether we need to call exit_robust_list() as
> well.

hmm. That is an interesting thought. But I wonder if acquiring a lock and then exec()ing in the critical section is a legal thing to do. It does not feel right. Currently, with or without my change, such a thing would indefinitely block other waiters on the same futex.

Ani

2009-10-04 17:00:36

[permalink] [raw]

Subject: Re: futex question

On Sun, 4 Oct 2009, Anirban Sinha wrote:

> >> 1) What caused you to instrument this path in the first place? Were you
> >> seeing some unexpected behavior?
> >>
> >> 2) I wonder why we would need to clear the robust list, but I don't see other
> >> things like pi_blocked_on, etc. in execve being cleared. I'm looking into
> >> this now (perhaps we don't do the same cleanup, need to check).... have to get
> >> on the plane...
> >
> > Hmm, just setting the robust list pointer to NULL fixes the problem at
> > hand, but I wonder whether we need to call exit_robust_list() as
> > well.

> hmm. That is an interesting thought. But I wonder if acquiring a
> lock and then exec()ing in the critical section is a legal thing to

It's definitely legal. There is no law which forbids to do that. :)

> do. It does not feel right. Currently, with or without my change,
> such a thing would indefinitely block other waiters on the same
> futex.

Right. Which completely defeats the purpose of the robust list. Will
have a look tomorrow.

Thanks,

tglx

2009-10-05 10:33:54

[permalink] [raw]

Subject: Re: futex question

On Sun, 2009-10-04 at 18:59 +0200, Thomas Gleixner wrote:

> > do. It does not feel right. Currently, with or without my change,
> > such a thing would indefinitely block other waiters on the same
> > futex.
>
> Right. Which completely defeats the purpose of the robust list. Will
> have a look tomorrow.

Right, so mm_release() which is meant to destroy the old mm context
actually does exit_robust_list(), but the problem is that it does so on
the new mm, not the old one that got passed down to mm_release().

The other detail is that exit_robust_list() doesn't clear
current->robust_list.

The problem with the patch send my Ani is that it clears the robust
lists before the point of no return, so on a failing execve() we'd have
messed up the state.

Making exit_robust_list() deal with an mm that is not the current mm is
interesting indeed.

2009-10-05 10:57:30

[permalink] [raw]

Subject: Re: futex question

Peter,

On Mon, 5 Oct 2009, Peter Zijlstra wrote:

> On Sun, 2009-10-04 at 18:59 +0200, Thomas Gleixner wrote:
>
> > > do. It does not feel right. Currently, with or without my change,
> > > such a thing would indefinitely block other waiters on the same
> > > futex.
> >
> > Right. Which completely defeats the purpose of the robust list. Will
> > have a look tomorrow.
>
> Right, so mm_release() which is meant to destroy the old mm context
> actually does exit_robust_list(), but the problem is that it does so on
> the new mm, not the old one that got passed down to mm_release().
>
> The other detail is that exit_robust_list() doesn't clear
> current->robust_list.

I know.

> The problem with the patch send my Ani is that it clears the robust
> lists before the point of no return, so on a failing execve() we'd have
> messed up the state.

Right. We need to do that at the latest possible point.

Looking more into that I think we should check whether the robust list
has an entry (lock held) in do_execve() and return -EWOULDBLOCK to
luser space. Same if pi_waiters is not empty. Holding a lock and
calling execve() is simply broken.

Thanks,

tglx

2009-10-05 11:13:51

[permalink] [raw]

Subject: Re: futex question

On Mon, 2009-10-05 at 12:56 +0200, Thomas Gleixner wrote:
>
> Looking more into that I think we should check whether the robust list
> has an entry (lock held) in do_execve() and return -EWOULDBLOCK to
> luser space. Same if pi_waiters is not empty. Holding a lock and
> calling execve() is simply broken.

Fine by me ;-)

something like the below?

The question is of course what Ani was doing that triggered this in the
first place and if he can live with this.. :-)

---
fs/exec.c | 16 ++++++++++++++++
1 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index d49be6b..0812ba6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1295,6 +1295,22 @@ int do_execve(char * filename,
bool clear_in_exec;
int retval;

+ retval = -EWOULDBLOCK;
+#ifdef CONFIG_FUTEX
+ if (unlikely(current->robust_list))
+ goto out_ret;
+#ifdef CONFIG_COMPAT
+ if (unlikely(current->compat_robust_list))
+ goto out_ret;
+#endif
+ spin_lock_irq(&current->pi_lock);
+ if (!list_empty(&current->pi_state_list)) {
+ spin_unlock_irq(&current->pi_lock);
+ goto out_ret;
+ }
+ spin_unlock_irq(&current->pi_lock);
+#endif
+
retval = unshare_files(&displaced);
if (retval)
goto out_ret;

2009-10-05 11:20:28

by Ingo Molnar

[permalink] [raw]

Subject: Re: futex question

* Peter Zijlstra <[email protected]> wrote:

> diff --git a/fs/exec.c b/fs/exec.c
> index d49be6b..0812ba6 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1295,6 +1295,22 @@ int do_execve(char * filename,
> bool clear_in_exec;
> int retval;
>
> + retval = -EWOULDBLOCK;
> +#ifdef CONFIG_FUTEX
> + if (unlikely(current->robust_list))
> + goto out_ret;
> +#ifdef CONFIG_COMPAT
> + if (unlikely(current->compat_robust_list))
> + goto out_ret;
> +#endif
> + spin_lock_irq(&current->pi_lock);
> + if (!list_empty(&current->pi_state_list)) {
> + spin_unlock_irq(&current->pi_lock);
> + goto out_ret;
> + }
> + spin_unlock_irq(&current->pi_lock);
> +#endif

i suspect this should have the form of:

retval = can_exec_robust_futexes();
if (retval)
goto out_ret

retval = unshare_files(&displaced);
if (retval)
goto out_ret;

...

but ... shouldnt we just do what exec normally does and zap any state
that shouldnt be carried over into the new context - instead of denying
the exec? Am i missing something?

Ingo

2009-10-05 11:48:32

[permalink] [raw]

Subject: Re: futex question

On Mon, 5 Oct 2009, Peter Zijlstra wrote:
> On Mon, 2009-10-05 at 12:56 +0200, Thomas Gleixner wrote:
> >
> > Looking more into that I think we should check whether the robust list
> > has an entry (lock held) in do_execve() and return -EWOULDBLOCK to
> > luser space. Same if pi_waiters is not empty. Holding a lock and
> > calling execve() is simply broken.
>
> Fine by me ;-)
>
> something like the below?
>
> The question is of course what Ani was doing that triggered this in the
> first place and if he can live with this.. :-)
>
> ---
> fs/exec.c | 16 ++++++++++++++++
> 1 files changed, 16 insertions(+), 0 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index d49be6b..0812ba6 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1295,6 +1295,22 @@ int do_execve(char * filename,
> bool clear_in_exec;
> int retval;
>
> + retval = -EWOULDBLOCK;
> +#ifdef CONFIG_FUTEX
> + if (unlikely(current->robust_list))
> + goto out_ret;
> +#ifdef CONFIG_COMPAT
> + if (unlikely(current->compat_robust_list))
> + goto out_ret;
> +#endif

That needs to call into the futex code and check whether the list is
empty. If not empty, return. If empty set the pointer to NULL

Thanks,

tglx

2009-10-05 11:51:40

[permalink] [raw]

Subject: Re: futex question

On Mon, 5 Oct 2009, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > diff --git a/fs/exec.c b/fs/exec.c
> > index d49be6b..0812ba6 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1295,6 +1295,22 @@ int do_execve(char * filename,
> > bool clear_in_exec;
> > int retval;
> >
> > + retval = -EWOULDBLOCK;
> > +#ifdef CONFIG_FUTEX
> > + if (unlikely(current->robust_list))
> > + goto out_ret;
> > +#ifdef CONFIG_COMPAT
> > + if (unlikely(current->compat_robust_list))
> > + goto out_ret;
> > +#endif
> > + spin_lock_irq(&current->pi_lock);
> > + if (!list_empty(&current->pi_state_list)) {
> > + spin_unlock_irq(&current->pi_lock);
> > + goto out_ret;
> > + }
> > + spin_unlock_irq(&current->pi_lock);
> > +#endif
>
> i suspect this should have the form of:
>
> retval = can_exec_robust_futexes();
> if (retval)
> goto out_ret

Yes.

> retval = unshare_files(&displaced);
> if (retval)
> goto out_ret;
>
> ...
>
> but ... shouldnt we just do what exec normally does and zap any state
> that shouldnt be carried over into the new context - instead of denying
> the exec? Am i missing something?

We want to check whether the robust list is empty. If it's not empty
then we deny the exec instead of silently releasing the futexes or
just ignoring the robust list entirely. Same applies for the pi
waiters list.

Thanks,

tglx

2009-10-05 11:56:39

[permalink] [raw]

Subject: Re: futex question

On Mon, 2009-10-05 at 12:36 +0200, Peter Zijlstra wrote:
> On Sun, 2009-10-04 at 18:59 +0200, Thomas Gleixner wrote:
>
> > > do. It does not feel right. Currently, with or without my change,
> > > such a thing would indefinitely block other waiters on the same
> > > futex.
> >
> > Right. Which completely defeats the purpose of the robust list. Will
> > have a look tomorrow.
>
> Right, so mm_release() which is meant to destroy the old mm context
> actually does exit_robust_list(), but the problem is that it does so on
> the new mm, not the old one that got passed down to mm_release().
>
> The other detail is that exit_robust_list() doesn't clear
> current->robust_list.
>
> The problem with the patch send my Ani is that it clears the robust
> lists before the point of no return, so on a failing execve() we'd have
> messed up the state.
>
> Making exit_robust_list() deal with an mm that is not the current mm is
> interesting indeed.

Hmm...

static int exec_mmap(struct mm_struct *mm)
{
struct task_struct *tsk;
struct mm_struct * old_mm, *active_mm;

/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
mm_release(tsk, old_mm);

if (old_mm) {
/*
* Make sure that if there is a core dump in progress
* for the old mm, we get out and die instead of going
* through with the exec. We must hold mmap_sem around
* checking core_state and changing tsk->mm.
*/
down_read(&old_mm->mmap_sem);
if (unlikely(old_mm->core_state)) {
up_read(&old_mm->mmap_sem);
return -EINTR;
}
}
task_lock(tsk);
active_mm = tsk->active_mm;
tsk->mm = mm;
tsk->active_mm = mm;
activate_mm(active_mm, mm);
task_unlock(tsk);
arch_pick_mmap_layout(mm);
if (old_mm) {
up_read(&old_mm->mmap_sem);
BUG_ON(active_mm != old_mm);
mm_update_next_owner(old_mm);
mmput(old_mm);
return 0;
}
mmdrop(active_mm);
return 0;
}

Actually calls mm_release() before the flip, so the below might actually
be sufficient?

---
kernel/fork.c | 10 ++++++++--
1 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 266c6af..4c20fff 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -570,12 +570,18 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)

/* Get rid of any futexes when releasing the mm */
#ifdef CONFIG_FUTEX
- if (unlikely(tsk->robust_list))
+ if (unlikely(tsk->robust_list)) {
exit_robust_list(tsk);
+ tsk->robust_list = NULL;
+ }
#ifdef CONFIG_COMPAT
- if (unlikely(tsk->compat_robust_list))
+ if (unlikely(tsk->compat_robust_list)) {
compat_exit_robust_list(tsk);
+ tsk->compat_robust_list = NULL;
+ }
#endif
+ if (unlikely(!list_empty(&tsk->pi_state_list)))
+ exit_pi_state_list(tsk);
#endif

/* Get rid of any cached register state */

2009-10-05 12:00:44

[permalink] [raw]

Subject: Re: futex question

On Mon, 5 Oct 2009, Peter Zijlstra wrote:

> On Mon, 2009-10-05 at 12:36 +0200, Peter Zijlstra wrote:
> > On Sun, 2009-10-04 at 18:59 +0200, Thomas Gleixner wrote:
> >
> > > > do. It does not feel right. Currently, with or without my change,
> > > > such a thing would indefinitely block other waiters on the same
> > > > futex.
> > >
> > > Right. Which completely defeats the purpose of the robust list. Will
> > > have a look tomorrow.
> >
> > Right, so mm_release() which is meant to destroy the old mm context
> > actually does exit_robust_list(), but the problem is that it does so on
> > the new mm, not the old one that got passed down to mm_release().
> >
> > The other detail is that exit_robust_list() doesn't clear
> > current->robust_list.
> >
> > The problem with the patch send my Ani is that it clears the robust
> > lists before the point of no return, so on a failing execve() we'd have
> > messed up the state.
> >
> > Making exit_robust_list() deal with an mm that is not the current mm is
> > interesting indeed.
>
> Hmm...
>
> static int exec_mmap(struct mm_struct *mm)
> {
> struct task_struct *tsk;
> struct mm_struct * old_mm, *active_mm;
>
> /* Notify parent that we're no longer interested in the old VM */
> tsk = current;
> old_mm = current->mm;
> mm_release(tsk, old_mm);
>
> if (old_mm) {
> /*
> * Make sure that if there is a core dump in progress
> * for the old mm, we get out and die instead of going
> * through with the exec. We must hold mmap_sem around
> * checking core_state and changing tsk->mm.
> */
> down_read(&old_mm->mmap_sem);
> if (unlikely(old_mm->core_state)) {
> up_read(&old_mm->mmap_sem);
> return -EINTR;
> }
> }
> task_lock(tsk);
> active_mm = tsk->active_mm;
> tsk->mm = mm;
> tsk->active_mm = mm;
> activate_mm(active_mm, mm);
> task_unlock(tsk);
> arch_pick_mmap_layout(mm);
> if (old_mm) {
> up_read(&old_mm->mmap_sem);
> BUG_ON(active_mm != old_mm);
> mm_update_next_owner(old_mm);
> mmput(old_mm);
> return 0;
> }
> mmdrop(active_mm);
> return 0;
> }
>
> Actually calls mm_release() before the flip, so the below might actually
> be sufficient?

Stared at the same place a minute ago :) But still I wonder if it's a
good idea to silently release locks and set the state to OWNERDEAD
instead of hitting the app programmer with a big clue stick in case
the app holds locks when calling execve().

Thanks,

tglx

> ---
> kernel/fork.c | 10 ++++++++--
> 1 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 266c6af..4c20fff 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -570,12 +570,18 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
>
> /* Get rid of any futexes when releasing the mm */
> #ifdef CONFIG_FUTEX
> - if (unlikely(tsk->robust_list))
> + if (unlikely(tsk->robust_list)) {
> exit_robust_list(tsk);
> + tsk->robust_list = NULL;
> + }
> #ifdef CONFIG_COMPAT
> - if (unlikely(tsk->compat_robust_list))
> + if (unlikely(tsk->compat_robust_list)) {
> compat_exit_robust_list(tsk);
> + tsk->compat_robust_list = NULL;
> + }
> #endif
> + if (unlikely(!list_empty(&tsk->pi_state_list)))
> + exit_pi_state_list(tsk);
> #endif
>
> /* Get rid of any cached register state */
>

2009-10-05 12:16:16

[permalink] [raw]

Subject: Re: futex question

On Mon, 2009-10-05 at 13:59 +0200, Thomas Gleixner wrote:
> Stared at the same place a minute ago :) But still I wonder if it's a
> good idea to silently release locks and set the state to OWNERDEAD
> instead of hitting the app programmer with a big clue stick in case
> the app holds locks when calling execve().

Agreed, I rather like the feedback. With regular exit like things
there's just not much we can do to avoid the mess, but here we can
actually avoid it, seems a waste not to do so.

2009-10-05 12:25:39

by Ingo Molnar

[permalink] [raw]

Subject: Re: futex question

* Peter Zijlstra <[email protected]> wrote:

> On Mon, 2009-10-05 at 13:59 +0200, Thomas Gleixner wrote:
>
> > Stared at the same place a minute ago :) But still I wonder if it's
> > a good idea to silently release locks and set the state to OWNERDEAD
> > instead of hitting the app programmer with a big clue stick in case
> > the app holds locks when calling execve().
>
> Agreed, I rather like the feedback. With regular exit like things
> there's just not much we can do to avoid the mess, but here we can
> actually avoid it, seems a waste not to do so.

Well, exec() has been a 'exit() + boot-strap next process' kind of thing
from the get go - with little state carried over into the new task. This
has security and robustness reasons as well.

So i think exec() should release all existing state, unless told
otherwise. Making it behave differently for robust futexes sounds
assymetric to me.

It might make sense though - a 'prevent exec because you are holding
locks!' thing. Dunno.

Cc:-ed a few execve() semantics experts who might want to chime in.

If a (buggy) app calls execve() with a (robust) futex still held should
we auto-force-release robust locks held, or fail the exec with an error
code? I think the forced release is a 'anomalous exit' thing mostly,
while calling exec() is not anomalous at all.

Ingo

2009-10-05 13:11:46

[permalink] [raw]

Subject: Re: futex question

Hi all:

catching up with the mails now. still pretty early here in the west coast.

Once upon a time, like on 09-10-05 4:47 AM, Thomas Gleixner wrote:
> On Mon, 5 Oct 2009, Peter Zijlstra wrote:
>> On Mon, 2009-10-05 at 12:56 +0200, Thomas Gleixner wrote:
>>>
>>> Looking more into that I think we should check whether the robust list
>>> has an entry (lock held) in do_execve() and return -EWOULDBLOCK to
>>> luser space. Same if pi_waiters is not empty. Holding a lock and
>>> calling execve() is simply broken.
>>

ha ha! Now you say it's broken :) Indeed, I also thought so. However, I wonder if this behavior change could potentially break some user space application?

>> Fine by me ;-)
>>
>> something like the below?
>>
>> The question is of course what Ani was doing that triggered this in the
>> first place and if he can live with this.. :-)
>>
>> ---
>> fs/exec.c | 16 ++++++++++++++++
>> 1 files changed, 16 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index d49be6b..0812ba6 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1295,6 +1295,22 @@ int do_execve(char * filename,
>> bool clear_in_exec;
>> int retval;
>>
>> + retval = -EWOULDBLOCK;
>> +#ifdef CONFIG_FUTEX
>> + if (unlikely(current->robust_list))
>> + goto out_ret;
>> +#ifdef CONFIG_COMPAT
>> + if (unlikely(current->compat_robust_list))
>> + goto out_ret;
>> +#endif
>
> That needs to call into the futex code and check whether the list is
> empty. If not empty, return. If empty set the pointer to NULL

True. Just because the head pointer has been registered does not mean that the list is non-empty. It can point back to itself when no locks are held as it's circular.

We need to clear those pointers regardless. After the exceve(), the address values are meaningless under the new mm context.

Ani

2009-10-05 13:29:12

[permalink] [raw]

Subject: Re: futex question

Can you please fix your mail client to do proper line breaks at around
78 chars ?

On Mon, 5 Oct 2009, Anirban Sinha wrote:
>
> We need to clear those pointers regardless. After the exceve(), the
> address values are meaningless under the new mm context.

That's out of question. We just need to come to a decision whether we
silently clean up callers dumbness or not.

Thanks,

tglx

2009-10-05 14:03:58

[permalink] [raw]

Subject: Re: futex question

Once upon a time, like on 09-10-05 6:28 AM, Thomas Gleixner wrote:
> Can you please fix your mail client to do proper line breaks at around
> 78 chars ?

Hmm. I made that change deliberately according to instructions in
Documentation/email-clients.txt so that it does not break my patches :) Anyway,
back to previous setting.

>>
>> We need to clear those pointers regardless. After the exceve(), the
>> address values are meaningless under the new mm context.
>
> That's out of question. We just need to come to a decision whether we
> silently clean up callers dumbness or not.

Correct me if I am wrong, but according to Ingo:

> So i think exec() should release all existing state, unless told
> otherwise. Making it behave differently for robust futexes sounds
> assymetric to me.

I thought clearing the head pointers was a part of the stale existing state?

2009-10-05 14:10:20

by Darren Hart

[permalink] [raw]

Subject: Re: futex question

Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
>> On Mon, 2009-10-05 at 13:59 +0200, Thomas Gleixner wrote:
>>
>>> Stared at the same place a minute ago :) But still I wonder if it's
>>> a good idea to silently release locks and set the state to OWNERDEAD
>>> instead of hitting the app programmer with a big clue stick in case
>>> the app holds locks when calling execve().
>> Agreed, I rather like the feedback. With regular exit like things
>> there's just not much we can do to avoid the mess, but here we can
>> actually avoid it, seems a waste not to do so.
>
> Well, exec() has been a 'exit() + boot-strap next process' kind of thing
> from the get go - with little state carried over into the new task. This
> has security and robustness reasons as well.
>
> So i think exec() should release all existing state, unless told
> otherwise. Making it behave differently for robust futexes sounds
> assymetric to me.
>
> It might make sense though - a 'prevent exec because you are holding
> locks!' thing. Dunno.
>
> Cc:-ed a few execve() semantics experts who might want to chime in.
>
> If a (buggy) app calls execve() with a (robust) futex still held should
> we auto-force-release robust locks held, or fail the exec with an error
> code? I think the forced release is a 'anomalous exit' thing mostly,
> while calling exec() is not anomalous at all.

My first thought earlier in the thread was that changing the exec
behavior to fail if either a robust or pi futex is held would be liable
to break existing applications. I can now see the argument that such
apps are broken already, and if they aren't hanging, it's simply because
they are hacking around it.

I think the semantics work for robust mutexes, if you exec, the exec'ing
"thread" is effectively dead, so EOWNERDEAD makes sense.

This doesn't seem to work for PI futexes, unless they are also Robust of
course. Here I would expect a userspace application to hang.

The only locking related statements made in the SUS or our Linux man
pages is in regards to named semaphores. And here it is only said that
the semaphore will be closed like a call to sem_close(). sem_close(3)
doesn't specify a return value if the semaphore is held when called.

The closing of message queues and canceling of any pending asynchronous
I/O might provide precedent for just unlocking held locks and moving on
in the case of PI. EOWNERDEAD still makes more sense to me from a robust
point of view.

And from the ignorant-fool department, the docs refer to replacing the
"process image" on execve, doesn't that mean that if there are 20
threads in a process and one of them calls execve that all 20 are
destroyed? If so, then we are only concerned with
PTHREAD_PROCESS_SHARED futexes, since none of the private futexes will
have any users after the execve.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2009-10-05 18:11:39

[permalink] [raw]

Subject: RE: futex question

>
>The problem with the patch send my Ani is that it clears the robust
>lists before the point of no return, so on a failing execve() we'd have
>messed up the state.

Ah! yes. I should have added the lines after load_binary() succeeds:

fs/compat.c | 3 +++
fs/exec.c | 3 +++
2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 6d6f98f..7d1baf5 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1539,6 +1539,9 @@ int compat_do_execve(char * filename,
if (retval < 0)
goto out;

+#ifdef CONFIG_FUTEX
+ current->compat_robust_list = NULL;
+#endif
/* execve succeeded */
current->fs->in_exec = 0;
current->in_execve = 0;
diff --git a/fs/exec.c b/fs/exec.c
index 172ceb6..d7b4ca3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1354,6 +1354,9 @@ int do_execve(char * filename,
if (retval < 0)
goto out;

+#ifdef CONFIG_FUTEX
+ current->robust_list = NULL;
+#endif
/* execve succeeded */
current->fs->in_exec = 0;
current->in_execve = 0;

btw, by the same token, shouldn't we call sched_exec() after the exec()
actually succeeds and the process has a new mm (thus having a smallest
effective memory and cache footprint)? I know that I could be missing
something subtle.

Ani

2009-10-05 18:37:08