The current futex_wait() code (I'm looking at tip/core/futexes)
conflicts with a warning in the comments about checking *uaddr==val
before the futex_q is queued on the hb list. While userspace is able to
alter *uaddr at will and should expect to hang in the kernel forever
should it do so haphazardly, there are legitimate scenarios where the
futex value might change between the call to futex_wait() and when the
futex_q gets on the hb list.
For example, glibc protects access to the value of cond.__data.__futex
via the cond.__data.__lock. However, before it can issue the syscall it
has to drop the cond.__data.__lock, leaving a small race window where
userspace might issue a signal or broadcast, which will modify the value
of cond.__data.__futex. As I understand it, this will result in the
waiter having changed the value of the futex prior to entering the
kernel, but not enqueuing itself on the hb list until after the waiter
issues the broadcast that was intended to wake it up.
I was working up a patch to move the test to after the call to
queue_me(), but in order to do the test we also have to perform the
get_user() after the queue_me(), which might sleep if we still hold the
hb->lock. If we let queue_me() drop the hb->lock before we call
get_user() then we may see a legitimate change in *uaddr that occured
after the queue_me() and before the get_user().
I'm at a loss for how to resolve the race without causing the false
positive inside the kernel. It might be resolvable in glibc by looking
at the return code from futex_requeue and checking if the number
woken_or_requeued agrees with the number it expected to be sleeping;
this likely leaves other gaps for other waking calls, like FUTEX_WAKE.
Any thoughts? Am I missing something that guards against this race?
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
Adding a few key folks to the Cc, apologies for the short initial Cc list.
Darren Hart wrote:
> The current futex_wait() code (I'm looking at tip/core/futexes)
> conflicts with a warning in the comments about checking *uaddr==val
> before the futex_q is queued on the hb list. While userspace is able to
> alter *uaddr at will and should expect to hang in the kernel forever
> should it do so haphazardly, there are legitimate scenarios where the
> futex value might change between the call to futex_wait() and when the
> futex_q gets on the hb list.
>
> For example, glibc protects access to the value of cond.__data.__futex
> via the cond.__data.__lock. However, before it can issue the syscall it
> has to drop the cond.__data.__lock, leaving a small race window where
> userspace might issue a signal or broadcast, which will modify the value
> of cond.__data.__futex. As I understand it, this will result in the
> waiter having changed the value of the futex prior to entering the
> kernel, but not enqueuing itself on the hb list until after the waiter
> issues the broadcast that was intended to wake it up.
>
> I was working up a patch to move the test to after the call to
> queue_me(), but in order to do the test we also have to perform the
> get_user() after the queue_me(), which might sleep if we still hold the
> hb->lock. If we let queue_me() drop the hb->lock before we call
> get_user() then we may see a legitimate change in *uaddr that occured
> after the queue_me() and before the get_user().
>
> I'm at a loss for how to resolve the race without causing the false
> positive inside the kernel. It might be resolvable in glibc by looking
> at the return code from futex_requeue and checking if the number
> woken_or_requeued agrees with the number it expected to be sleeping;
> this likely leaves other gaps for other waking calls, like FUTEX_WAKE.
>
> Any thoughts? Am I missing something that guards against this race?
>
--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team
On Thu, 2009-03-19 at 15:01 -0700, Darren Hart wrote:
> Adding a few key folks to the Cc, apologies for the short initial Cc list.
>
> Darren Hart wrote:
> > The current futex_wait() code (I'm looking at tip/core/futexes)
> > conflicts with a warning in the comments about checking *uaddr==val
> > before the futex_q is queued on the hb list. While userspace is able to
> > alter *uaddr at will and should expect to hang in the kernel forever
> > should it do so haphazardly, there are legitimate scenarios where the
> > futex value might change between the call to futex_wait() and when the
> > futex_q gets on the hb list.
> >
> > For example, glibc protects access to the value of cond.__data.__futex
> > via the cond.__data.__lock. However, before it can issue the syscall it
> > has to drop the cond.__data.__lock, leaving a small race window where
> > userspace might issue a signal or broadcast, which will modify the value
> > of cond.__data.__futex. As I understand it, this will result in the
> > waiter having changed the value of the futex prior to entering the
> > kernel, but not enqueuing itself on the hb list until after the waiter
> > issues the broadcast that was intended to wake it up.
> >
> > I was working up a patch to move the test to after the call to
> > queue_me(), but in order to do the test we also have to perform the
> > get_user() after the queue_me(), which might sleep if we still hold the
> > hb->lock. If we let queue_me() drop the hb->lock before we call
> > get_user() then we may see a legitimate change in *uaddr that occured
> > after the queue_me() and before the get_user().
> >
> > I'm at a loss for how to resolve the race without causing the false
> > positive inside the kernel. It might be resolvable in glibc by looking
> > at the return code from futex_requeue and checking if the number
> > woken_or_requeued agrees with the number it expected to be sleeping;
> > this likely leaves other gaps for other waking calls, like FUTEX_WAKE.
> >
> > Any thoughts? Am I missing something that guards against this race?
get_user_pages_fast() the futex page, that will pin it, then under the
lock you can kmap_atomic() the page, and read it.
Probably massive overkill though :-)