2014-11-27 14:27:42

by Torvald Riegel

[permalink] [raw]
Subject: POSIX mutex destruction requirements vs. futexes

TLDR: POSIX and C++11 make implementation requirements for the
destruction of mutexes that, if we want to continue to use the userspace
fastpath / kernel slowpath approach of non-PI futexes, conflict with
spurious wake-ups not being allowed for FUTEX_WAIT calls that return 0.
The best solutions seem to be to either allow spurious wake-ups, or
create new variants of FUTEX_WAIT and/or FUTEX_WAKE that allow spurious
wake-ups.


Using reference-counting in critical sections to decide when the mutex
protecting the critical section can be destroyed has been recently
discussed on LKML. For example, something like this is supposed to
work:

int free = 0;

mutex_lock(&s->lock);
if (--s->refcount == 0)
free = 1
mutex_unlock(&s->lock);
if (free)
kfree(s);

The Austin Group has clarified that pthreads mutexes have to support
such uses:
http://austingroupbugs.net/view.php?id=811
C++11 makes essentially the same requirement (see 30.4.1.2.1p5;
destruction is allowed if no thread owns the mutex; it is not required
that all unlock calls have returned before destruction starts).
C11 is silent on this particular aspect of mutex semantics, but it
usually follows POSIX and/or C++ semantics.

The destruction requirement doesn't just apply in cases of explicit
reference counting. Another example would be:

Thread1:
pthread_mutex_lock(&m);
pthread_create(..., Thread2, ...);
pthread_mutex_unlock(&m);

Thread2:
pthread_mutex_lock(&m);
pthread_mutex_unlock(&m);
pthread_mutex_destroy(&m);
// reuse the memory of mutex m for something else;
// a new, unrelated futex happens to reside at the same address as
// m's internal futex

Here, the program has ensured in a different way that no thread owns the
mutex when it is destructed.

This requirement is tough to implement for glibc -- or with futexes in
general -- because what one would like to do in a mutex unlock
implementation based on futexes is the following, roughly:

lock():
while (1) {
// fast path: assume uncontended lock
if (atomic_compare_exchange_acquire(&futex, NOT_ACQUIRED, ACQUIRED)
== SUCCESS)
return;
// slow path: signal that there is a slow-path waiter and block
prev = atomic_exchange(&futex, ACQUIRED_AND_WAITERS);
if (prev == NOT_ACQUIRED) return;
futex_wait(&futex, ACQUIRED_AND_WAITERS, ...);
}

unlock():
// fast path unlock
prev = atomic_exchange_release(&futex, NOT_ACQUIRED);
// slow path unlock
if (prev == ACQUIRED_AND_WAITERS)
futex_wake(&futex, ...);

Having the fast path is good for performance. It enables spinning on
acquisition, and avoids a syscall when releasing the mutex. The latter
is good in the uncontended case but also when there is contention,
because another spinning thread may be able to acquire the mutex more
quickly than if we'd require the kernel to wake a blocked thread -- thus
increasing scalability.

However, because another thread can acquire the mutex right after the
fast path of the unlock(), the next steps of this thread can then be
concurrent with the slow path of the unlock(), in particular the
futex_wake call. All we need to make such a pending, concurrent
futex_wake call possible is that at some time in the past, we were in
the ACQUIRED_AND_WAITERS state.

This means that in the second example above, futex_wake can be
concurrent with whatever happens on the mutex' memory location after the
mutex has been destroyed. Examples are:
* The memory is unmapped. futex_wake will return an error. OK.
* The memory is reused, but not for a futex. No thread will get
woken. OK.
* The memory is reused for another glibc mutex. The slow-path
futex wake will now hit another, unrelated futex -- but the
mutex implementation is robust to such spurious wake-ups anyway,
because it can always happen when a mutex is acquired and
released more than once. OK.
* The memory is reused for another futex in some custom data
structure that expects there is just one wait/wake cycle, and
relies on FUTEX_WAIT returning 0 to mean that this is caused by
the matching FUTEX_WAKE call by *this* data structure. Not OK,
because now the delayed slow-path wake-up introduces a spurious
wake-up in an unrelated futex.

Thus, introducing spurious wake-ups is the core issue. This is not
restricted to pthreads mutexes
(https://sourceware.org/bugzilla/show_bug.cgi?id=13690) but also is an
issue for semaphores
(https://sourceware.org/bugzilla/show_bug.cgi?id=12674; I have a patch
that I'll send upstream soon that fixes the memory access issues under
concurrent destruction -- the futex issue remains) and barriers
(https://sourceware.org/bugzilla/show_bug.cgi?id=13065).
In general, this is an issue for all futex uses that rely on this same
fast-path / slow-path pattern and a similar destruction scheme.


There are ways to solve this in userspace only and with the existing
futex operations, but they all have disadvantages:
* Using truly deferred memory reclamation schemes such as hazard
pointers or RCU is probably impossible to implement for
process-shared pthreads mutexes.
* Mutexes could count pending slow-path futex_wake calls, but this
increases contention on the mutex. Also, mutex destruction
would have to wait for these to finish. We either need to do
this with spin-waiting, which isn't ideal in terms of avoiding
pathological performance cases; we could use a futex in stable
storage but, again, this doesn't work for process-shared
mutexes.
* For non-process-shared mutexes, glibc could use per-thread
futexes in stable storage or such (e.g., with something like an
MCS lock).
* (Ab)use PI futexes to solve the destruction problem.
FUTEX_UNLOCK_PI does the whole mutex release in the kernel, so
we avoid the split fast-path/slow-path problem. However, mutex
release latency (best case, perhaps common case depending on the
length of the critical section and contention, etc.) will be
higher.

(For more details on these options, see the glibc discussion thread of
this topic: https://sourceware.org/ml/libc-alpha/2014-04/msg00075.html)


It seems far better to just allow spurious wake-ups from FUTEX_WAIT. I
can imagine a few options:

(1) Allow spurious wake-ups from FUTEX_WAIT.

If we'd start from scratch, we could just specify that if FUTEX_WAIT
returns 0, this might be due to a spurious wake-up as well. I cannot
think of any use case that might be significantly harder to implement
with such a specification. Also, I would guess that effects on
performance are slow; after all; a racing, pending FUTEX_WAKE should be
rare (eg, it needs memory reuse at the exact same address, a thread
being suspended right between the fast and slow path of unlock, ...).

However, this is a change to the futex specs, if we were to make this
change now. I'm aware that you don't want to break userspace (and that's
a good thing IMO :), so I'm mentioning this option for completeness and
to some extent because there are reasons that make actually breaking
userspace in practice potentially unlikely:
* FUTEX_WAIT can return EINTR, so calls of it need to be in a loop
anyway.
* All uses of futexes that are not one-shot mechanisms (ie, just
one FUTEX_WAIT and FUTEX_WAKE for a particular futex instance)
have to be robust to spurious wake-ups anyway, even if
FUTEX_WAIT returns 0.
* The current futex documentation (ie, before the additions by
Ingo and Darren) is incomplete, so users may have followed
existing practice anyway (e.g., glibc code and Ulrich's paper),
which do handle spurious wake-ups. OTOH, this may not be
necessary, and the current incomplete docs don't allow for
spurious wake-up.


(2) New FUTEX_WAKE_SPURIOUS operation that just emits spurious wakeups

The kernel could provide a new futex wake operation that enforces that
all FUTEX_WAIT calls it wakes return with EINTR. EINTR is allowed by
the current futex documentation, so no change to the specs.
Callers such as glibc would then use this new futex op and make sure
that *their own* futexes interpret EINTR returns from FUTEX_WAIT as
caused by potentially both spurious and nonspurious wake-ups. In
practice, that would require no or just a few changes on the FUTEX_WAIT
side, because rechecking the condition after EINTR is the natural thing
to do.


(3) New FUTEX_WAKE_SPURIOUS and FUTEX_WAIT_SPURIOUS operations

In this approach, FUTEX_WAKE_SPURIOUS would only wake
FUTEX_WAIT_SPURIOUS calls (and the latter including spurious wake-ups on
return of 0). This separates new from old semantics; it's in some sense
similar to (2), except that there's an explicit "opt-in" for spurious
wake-ups on both sides.


There's practically nothing we can really do to change the mutex
destruction semantics specified by POSIX and C++11 (and that C11 likely
intended). Therefore, I think we need to support this in
implementations.

I'm looking for feedback from the kernel community on the following
points:
* Is this issue best solved on the kernel side? If not, what else
would you like to see before agreeing to that?
* Is allowing spurious wake-ups in some way the best way to solve
this? If not, do you have other suggestions?
* If we allow spurious wakeu-ps, which of (1), (2), or (3) is best
in your opinion? Or other suggestions?

Please CC me in replies because I'm not subscribed to LKML.


Thanks,

Torvald


PS: I'm aware that glibc is currently still a little sloppy when
checking return values from futex operations; this is work-in-progress
on the glibc side:
https://sourceware.org/ml/libc-alpha/2014-09/msg00381.html


2014-11-27 19:38:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: POSIX mutex destruction requirements vs. futexes

On Thu, Nov 27, 2014 at 6:27 AM, Torvald Riegel <[email protected]> wrote:
>
> Using reference-counting in critical sections to decide when the mutex
> protecting the critical section can be destroyed has been recently
> discussed on LKML. For example, something like this is supposed to
> work:
>
> int free = 0;
>
> mutex_lock(&s->lock);
> if (--s->refcount == 0)
> free = 1
> mutex_unlock(&s->lock);
> if (free)
> kfree(s);

Yeah, this is a nasty case. We've had this bug in the kernel, and only
allow self-locking data structures with spinlocks (in which the unlock
operation is guaranteed to release the lock and never touch the data
structure afterwards in any way - no "unlock fastpath followed by
still touching it").


> This requirement is tough to implement for glibc -- or with futexes in
> general -- because what one would like to do in a mutex unlock
> implementation based on futexes is the following, roughly:
>
> lock():
> while (1) {
> // fast path: assume uncontended lock
> if (atomic_compare_exchange_acquire(&futex, NOT_ACQUIRED, ACQUIRED)
> == SUCCESS)
> return;
> // slow path: signal that there is a slow-path waiter and block
> prev = atomic_exchange(&futex, ACQUIRED_AND_WAITERS);
> if (prev == NOT_ACQUIRED) return;
> futex_wait(&futex, ACQUIRED_AND_WAITERS, ...);
> }
>
> unlock():
> // fast path unlock
> prev = atomic_exchange_release(&futex, NOT_ACQUIRED);
> // slow path unlock
> if (prev == ACQUIRED_AND_WAITERS)
> futex_wake(&futex, ...);

Yup.

> This means that in the second example above, futex_wake can be
> concurrent with whatever happens on the mutex' memory location after the
> mutex has been destroyed. Examples are:
> * The memory is unmapped. futex_wake will return an error. OK.
> * The memory is reused, but not for a futex. No thread will get
> woken. OK.
> * The memory is reused for another glibc mutex. The slow-path
> futex wake will now hit another, unrelated futex -- but the
> mutex implementation is robust to such spurious wake-ups anyway,
> because it can always happen when a mutex is acquired and
> released more than once. OK.
> * The memory is reused for another futex in some custom data
> structure that expects there is just one wait/wake cycle, and
> relies on FUTEX_WAIT returning 0 to mean that this is caused by
> the matching FUTEX_WAKE call by *this* data structure. Not OK,
> because now the delayed slow-path wake-up introduces a spurious
> wake-up in an unrelated futex.
>
> Thus, introducing spurious wake-ups is the core issue.

So my gut feeling is that we should just try to see if we can live
with spurious wakeups, ie your:

> (1) Allow spurious wake-ups from FUTEX_WAIT.

because afaik that is what we actually *do* today (we'll wake up
whoever re-used that location in another thread), and it's mainly
about the whole documentation issue. No?

Linus

2014-12-01 12:05:55

by Torvald Riegel

[permalink] [raw]
Subject: Re: POSIX mutex destruction requirements vs. futexes

On Thu, 2014-11-27 at 11:38 -0800, Linus Torvalds wrote:
> On Thu, Nov 27, 2014 at 6:27 AM, Torvald Riegel <[email protected]> wrote:
> > This means that in the second example above, futex_wake can be
> > concurrent with whatever happens on the mutex' memory location after the
> > mutex has been destroyed. Examples are:
> > * The memory is unmapped. futex_wake will return an error. OK.
> > * The memory is reused, but not for a futex. No thread will get
> > woken. OK.
> > * The memory is reused for another glibc mutex. The slow-path
> > futex wake will now hit another, unrelated futex -- but the
> > mutex implementation is robust to such spurious wake-ups anyway,
> > because it can always happen when a mutex is acquired and
> > released more than once. OK.
> > * The memory is reused for another futex in some custom data
> > structure that expects there is just one wait/wake cycle, and
> > relies on FUTEX_WAIT returning 0 to mean that this is caused by
> > the matching FUTEX_WAKE call by *this* data structure. Not OK,
> > because now the delayed slow-path wake-up introduces a spurious
> > wake-up in an unrelated futex.
> >
> > Thus, introducing spurious wake-ups is the core issue.
>
> So my gut feeling is that we should just try to see if we can live
> with spurious wakeups, ie your:
>
> > (1) Allow spurious wake-ups from FUTEX_WAIT.
>
> because afaik that is what we actually *do* today (we'll wake up
> whoever re-used that location in another thread), and it's mainly
> about the whole documentation issue. No?

If that's what the kernel prefers to do, this would be fine with me. In
this case, I'd follow up with Michael Kerrisk to get this documented. I
guess something along the lines of this would work:
"Return 0 if the process was woken by a FUTEX_WAKE call, or
spuriously.
(Note that these spurious wake-ups can result from futex operations by
other threads or processes.)"
A more detailed explanation outside of the manpages (ie, the rationale
for this wording) would also be useful for users I suppose.

However, I've heard others being more concerned over this change in
semantics. Therefore, I'll describe the pros/cons that I see, just so
that we're all on the same page on what the decision for (1) would mean.

The change in return-value-of-0 semantics that (1) is about can violate
the futex semantics that existing code like the following example might
assume:

Initially f==0

Thread 1:
atomic_store_relaxed(&f, 1);
futex_wake(&f, 1, ...);

Thread 2:
do {
err = futex_wait(&f, 0, NULL);
// handle fatal errors like ENOSYS, EINVAL, ...
} while (err == EINTR);
// err must have been EWOULDBLOCK or 0

For this to work according to the specs, the program itself must not
have pending FUTEX_WAKE's to f (e.g., by having a barrier after the
operations by both threads). The program can use glibc pthreads
synchronization operations, because they don't specify anywhere that
they cause concurrent FUTEX_WAKE's after the synchronization data
structures have been destroyed and their memory has been reused.

For this to work in practice, the program would need to not use
reference-counting to trigger destruction of pthread synchronization
datastructures (or do something similar that causes an mutex unlock that
has not returned to the caller to be concurrent with destruction).

Thus, there can be correct programs out there that never are affected by
spurious wake-ups, but now would have to change if they want to be
correct after the change in documentation. I would bet that those cases
are the minority, but I don't have any data to back this up or even make
a guess about how small a minority they would be.

OTOH, in practice, they would *not* become buggy after the documentation
change, *unless* the kernel should decide to, in the future, also
introduce spurious wake-ups due to other reasons and return 0 instead of
EINTR. (That's why I put the note in the suggested docs change above.)

One could also argue that the current spec actually allows the spurious
wake-ups we are concerned with. The manpage currently states:
"Returns 0 if the process was woken by a FUTEX_WAKE call."
The spurious wake-ups *are* due to other FUTEX_WAKE calls, just not due
to the calls that the piece of code assumed they might come from. But
that's probably not quite obvious :)

It also seems a little odd that consequences of the design of how
futexes are used in practice (ie, fast/slow path split) and the POSIX/C
++11 requirements on destruction bleed into the semantics of kernel
operations. However, allowing spurious wake-ups for a return value of 0
might be the simplest way to describe how a futex can be used safely;
this would be *easier* for programmers than giving them all the details
on pending futexes etc.
So maybe we could also change the futex specs like this (or similar):
"Return 0 if the process was woken by a FUTEX_WAKE call.
Note that typical uses of futexes elsewhere in the process (e.g., by
other threads or in libraries) can lead to spurious wake-ups of
FUTEX_WAIT operations that return 0. Therefore, callers should assume
that spurious wake-ups can happen, unless they have control over all
uses of futexes in the process and other processes that have access to
the futex."


So, if you continue to prefer option (1), I think that's what we should
do. Also, in this case I'd like some feedback on the direction of the
documentation change, in particular whether
* We should explicitly allow spurious wake-ups for return value of 0 or
just add weasel-wording that spurious wake-ups may happen due to other
futex uses in the process or other processes that have access to the
futex variable.
* If we allow spurious wake-ups, whether this should still include a
promise that the kernel itself will not add spurious wake-ups (so that
programs that work today will continue to work in the future (see
example above)).

Thanks,

Torvald

2014-12-01 18:31:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: POSIX mutex destruction requirements vs. futexes

On Mon, Dec 1, 2014 at 4:05 AM, Torvald Riegel <[email protected]> wrote:
> On Thu, 2014-11-27 at 11:38 -0800, Linus Torvalds wrote:
>>
>> > (1) Allow spurious wake-ups from FUTEX_WAIT.
>>
>> because afaik that is what we actually *do* today (we'll wake up
>> whoever re-used that location in another thread), and it's mainly
>> about the whole documentation issue. No?
>
> If that's what the kernel prefers to do, this would be fine with me.

I think it's more of a "can we even do anything else"?

The kernel doesn't even see the reuse of the futex, or the fast path
parts. Not seeing the fast path is obviously by design, and not seeing
the reuse is partly due to pthreads interfaces (I guess people should
generally call mutex_destroy, but I doubt people really do, and even
if they did, how would user space actually handle the nasty race
between "pthread_unlock -> stale futex_wake" "pthread_mutex_destroy()"
anyway?).

So the thing is, I don't see how we could possibly change the existing
FUTEX_WAKE behavior.

And introducing a new "debug mode" that explicitly adds spurious
events might as well be done in user space with a wrapper about
FUTEX_WAKE or something.

Because as far as the *kernel* is concerned, there is no "spurious
wake" event. It's not spurious. It's real and exists, and the wakeup
was really done by the user. The fact that it's really a stale wakeup
for a previous allocation of a pthread mutex data structure is
something that is completely and fundamentally invisible to the
kernel.

No?

So even from a documentation standpoint, it's really not about "kernel
interfaces" being incorrectly documented, as about a big honking
warning about internal pthread_mutex() implementation in a library,
and the impact that library choice has on allocation re-use, and how
that means that even if the FUTEX_WAKE is guaranteed to not be
spurious from a *kernel* standpoint, certain use models will create
their own spurious wake events in very subtle ways.

So I do disgree with you on the "explicitly allow spurious wake-ups",
in the sense that no, the kernel doesn't do that. But it is definitely
worth adding documentation about how users can cause their *own*
"spurious" wakeups.

So I think we continue to guarantee "no spurious wakeups" from a
kernel interface standpoint. Old programs that depend on it will
continue to work.

But at the same time, old programs that have broken libraries that
didn't realize that those libraries themselves caused "spurious"
wakeups are broken, but it's not due to some kernel issue, it's a
user-mode internal implementation (that is easy to get wrong - as
mentioned, we've had this bug ourselves inside the kernel).

Linus

2014-12-01 20:44:08

by Torvald Riegel

[permalink] [raw]
Subject: Re: POSIX mutex destruction requirements vs. futexes

On Mon, 2014-12-01 at 10:31 -0800, Linus Torvalds wrote:
> On Mon, Dec 1, 2014 at 4:05 AM, Torvald Riegel <[email protected]> wrote:
> > On Thu, 2014-11-27 at 11:38 -0800, Linus Torvalds wrote:
> >>
> >> > (1) Allow spurious wake-ups from FUTEX_WAIT.
> >>
> >> because afaik that is what we actually *do* today (we'll wake up
> >> whoever re-used that location in another thread), and it's mainly
> >> about the whole documentation issue. No?
> >
> > If that's what the kernel prefers to do, this would be fine with me.
>
> I think it's more of a "can we even do anything else"?
>
> The kernel doesn't even see the reuse of the futex, or the fast path
> parts. Not seeing the fast path is obviously by design, and not seeing
> the reuse is partly due to pthreads interfaces (I guess people should
> generally call mutex_destroy, but I doubt people really do, and even
> if they did, how would user space actually handle the nasty race
> between "pthread_unlock -> stale futex_wake" "pthread_mutex_destroy()"
> anyway?).

User space could count stale (or, pending) futex_wake calls and spin in
pthread_mutex_destroy() until this count is zero. However, that would
increase contention significantly, and we must spin, not block in
pthread_mutex_destroy() at least for process-shared mutexes, because
there's no place to put a futex for this destruction-time blocking that
is not subject to the same issue again. (For process-private futexes,
this could be a memory location that is only ever used by glibc.)
There might even be more issues related to unmapping memory of
process-shared mutexes based on reference-counting in the critical
sections.

The additional contention of counting stale futex_wake's worries me
most. You only need to count when threads actually use futexes to
block, and perhaps glibc's mutex implementation could be changed to
spin-wait aggressively, and perhaps we could add some explicit handover
to other lock owners or something similar to avoid issuing a futex_wake
unless really necessary -- but this seems like quite a lot of hoops to
jump through to work around a relatively small aspect of the current
futexes.

> So the thing is, I don't see how we could possibly change the existing
> FUTEX_WAKE behavior.
>
> And introducing a new "debug mode" that explicitly adds spurious
> events might as well be done in user space with a wrapper about
> FUTEX_WAKE or something.

User space could introduce a wrapper (e.g., glibc could provide a futex
wrapper that allows spurious wakeup on return of 0) -- but glibc can't
prevent users from not using futexes directly and not through the
wrapper. Or should it try to intercept direct, non-wrapper uses of the
futex syscall in some way?

That's why I included a "debug mode" -- I'd rather call it a "new" mode
than a debug mode, because it's not really about debugging -- in the
list of options (ie, options (2) and (3)). This would allow glibc (and
other libraries) to use the futex variants with the new semantics in the
most natural way. And yet would not create situations in which calls to
the old futex variant *appear* to allow spurious wake-ups (due to
glibc's or other libraries' fault -- libstdc++ is in the same boat, for
example).

> Because as far as the *kernel* is concerned, there is no "spurious
> wake" event. It's not spurious. It's real and exists, and the wakeup
> was really done by the user. The fact that it's really a stale wakeup
> for a previous allocation of a pthread mutex data structure is
> something that is completely and fundamentally invisible to the
> kernel.
>
> No?

I agree, and that's why I mentioned that it may seem odd to fix this on
the level of the kernel interface. However, it just seems the best
approach when considering practice in kernel and user space, the
by-design futex properties, and the realities of what POSIX requires.

> So even from a documentation standpoint, it's really not about "kernel
> interfaces" being incorrectly documented, as about a big honking
> warning about internal pthread_mutex() implementation in a library,
> and the impact that library choice has on allocation re-use, and how
> that means that even if the FUTEX_WAKE is guaranteed to not be
> spurious from a *kernel* standpoint, certain use models will create
> their own spurious wake events in very subtle ways.

I agree it's not the kernel's fault, but that doesn't solve the dilemma.
It's one aspect of the futex design -- whether spurious wake-ups are
allowed for a return of 0 -- that makes it hard to use futexes for
POSIX, C++11 (and C11, most likely) synchronization primitives without
user space violating the kernel interface's semantics. In an ideal
world, we'd have considered that right away when designing futexes, made
a decision that works well for both the kernel and user space, and
specified the return conditions accordingly. IMO, allowing spurious
wake-ups doesn't actually make using futexes any harder, so allowing
them from the start would have been fine.

If somebody has any suggestions for how to fix this in user space, with
no or at least an acceptable performance hit, and in a way that works
with process-shared POSIX mutexes, I'm all ears. I don't see a good
user space solutions in the sense of being more efficient and more
effective than a kernel-side solution, so this is why I think options
(2) and (3) might be good. They would give us the futex design that
works best for POSIX etc. and yet would never affect old programs using
futexes directly because they would use the old-style semantics.

If options (2) and (3) are not acceptable to the kernel community and if
we find don't find other solutions, then I'd like to at least document
the issue in the kernel man pages (making sure pointing out that this is
caused by user space realities).

Thoughts?

2014-12-01 21:31:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: POSIX mutex destruction requirements vs. futexes

On Mon, Dec 1, 2014 at 12:44 PM, Torvald Riegel <[email protected]> wrote:
>
> User space could introduce a wrapper (e.g., glibc could provide a futex
> wrapper that allows spurious wakeup on return of 0) -- but glibc can't
> prevent users from not using futexes directly and not through the
> wrapper. Or should it try to intercept direct, non-wrapper uses of the
> futex syscall in some way?

So I don't think we even need to worry about direct users of the futex
system call. They had better know what they are doing.

> I agree, and that's why I mentioned that it may seem odd to fix this on
> the level of the kernel interface. However, it just seems the best
> approach when considering practice in kernel and user space, the
> by-design futex properties, and the realities of what POSIX requires.

Sure. That said, I think there really are very few cases we need to worry about:

- the ptreads_mutex implementation itself.

This is, after all, what 99% of all applications will use. And
here, the only worry is to make sure that glibc (or whatever other
libc) is aware of its own reuse of allocations, and basically know
that it can get into the situation where a reused mutex memory
allocation can effectively see "stale" wakeups. That is, of course,
assuming that the library implementation is the expected racy one.
There *are* ways to avoid the races, although they are generally not
really even worth the bother - just calling futex_wait() in a loop is
the RightThing(tm) to do.

- other random direct futex users

Quite frankly, these are few. There are valid and real use cases
(the CLONE_CHILD_CLEARTID interface, for example, expects the user to
use a raw futex), but they tend to be low-level enough that they'd
really better know about this. And again, it should be easy enough to
just loop around FUTEX_WAIT and just reading the value if somebody is
worried about this.

Making the documentation talk about the issue, and making the strong
suggestion to make any FUTEX_WAIT style use just always loop and check
the actual value for simplicity and robustness is probably the right
approach.

Linus

2015-02-11 13:15:35

by Jiri Kosina

[permalink] [raw]
Subject: Re: POSIX mutex destruction requirements vs. futexes

On Thu, 27 Nov 2014, Linus Torvalds wrote:

> On Thu, Nov 27, 2014 at 6:27 AM, Torvald Riegel <[email protected]> wrote:
> >
> > Using reference-counting in critical sections to decide when the mutex
> > protecting the critical section can be destroyed has been recently
> > discussed on LKML. For example, something like this is supposed to
> > work:
> >
> > int free = 0;
> >
> > mutex_lock(&s->lock);
> > if (--s->refcount == 0)
> > free = 1
> > mutex_unlock(&s->lock);
> > if (free)
> > kfree(s);
>
> Yeah, this is a nasty case. We've had this bug in the kernel, and only
> allow self-locking data structures with spinlocks (in which the unlock
> operation is guaranteed to release the lock and never touch the data
> structure afterwards in any way - no "unlock fastpath followed by
> still touching it").

BTW, is this even documented anywhere?

I don't think we can easily perform any runtime checks on this potentially
pathological pattern (say, in lockdep), but I think we are clearly not
even properly documenting it anywhere (at least
Documentation/locking/mutex-design.txt doesn't mention it at all).

--
Jiri Kosina
SUSE Labs