2013-04-02 11:00:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
> +Reservation type mutexes

> +struct ticket_mutex {

> +extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock,

That's two different names and two different forms of one (for a total
of 3 variants) for the same scheme.

FAIL...

Also, is there anything in CS literature that comes close to this? I'd
think the DBMS people would have something similar with their
transactional systems. What do they call it?


2013-04-02 14:57:10

by Maarten Lankhorst

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

Hey,

Thanks for reviewing.

Op 02-04-13 13:00, Peter Zijlstra schreef:
> On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
>> +Reservation type mutexes
>> +struct ticket_mutex {
>> +extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock,
> That's two different names and two different forms of one (for a total
> of 3 variants) for the same scheme.
>
> FAIL...
It's been hard since I haven't seen anything similar in the kernel, I originally went with tickets
since that's what ttm originally called it, and tried to kill as many references as I could
when I noticed ticket mutexes already being taken.

I'll fix up the ticket_mutex -> reservation_mutex, and mutex_reserve_* -> reserve_mutex_*

> On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
>> +mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow:
>> + Similar to mutex_reserve_lock, except it won't backoff with
>> -EAGAIN.
>> + This is useful when mutex_reserve_lock failed with -EAGAIN, and you
>> + unreserved all reservation_locks so no deadlock can occur.
>> +
> I don't particularly like these function names, with lock
> implementations the _slow post-fix is typically used for slow path
> implementations, not API type interfaces.
I didn't intend for drivers to use the new calls directly, but rather through a wrapper,
for example by ttm_eu_reserve_buffers in drivers/gpu/drm/ttm/ttm_execbuf_util.c

> Also, is there anything in CS literature that comes close to this? I'd
> think the DBMS people would have something similar with their
> transactional systems. What do they call it?
I didn't study cs, but judging from your phrasing I guess you mean you want me to call it transaction_mutexes instead?

> Head hurts, needs more time to ponder. It would be good if someone else
> (this would probably be you maarten) would also consider this and
> explore
> this 'interesting' problem space :-)
My head too, evil priority stuff!

Hacky but pragmatical workaround for now: use a real mutex around all the reserve_mutex_lock* calls instead of a virtual lock.
It can be unlocked as soon as all locks have been taken, before any actual work is done.

It only slightly kills the point of having a reservation in the first place, but at least it won't break completely -rt completely for now.

~Maarten

2013-04-02 15:56:11

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Tue, Apr 2, 2013 at 1:00 PM, Peter Zijlstra <[email protected]> wrote:
> Also, is there anything in CS literature that comes close to this? I'd
> think the DBMS people would have something similar with their
> transactional systems. What do they call it?

I've looked around a bit and in dbms row-locking land this seems to be
called the wound-wait deadlock avoidance algorithm. It's the same
approach where if you encounter an older ticket (there called
transaction timestamp) you drop all locked rows and retry (or abort)
the transaction. If you encounter a newer ticket when trying to grab a
lock simply do a blocking wait. So ticket/reservation in Maartens
patches is the analog of timestamp/transaction.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-02 16:59:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Tue, 2013-04-02 at 16:57 +0200, Maarten Lankhorst wrote:
> Hey,
>
> Thanks for reviewing.

Only partway through so far :-)

> Op 02-04-13 13:00, Peter Zijlstra schreef:
> > On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
> >> +Reservation type mutexes
> >> +struct ticket_mutex {
> >> +extern int __must_check _mutex_reserve_lock(struct ticket_mutex *lock,
> > That's two different names and two different forms of one (for a total
> > of 3 variants) for the same scheme.
> >
> > FAIL...

> It's been hard since I haven't seen anything similar in the kernel, I
> originally went with tickets since that's what ttm originally called
> it, and tried to kill as many references as I could when I noticed
> ticket mutexes already being taken.

Ticket mutexes as such don't exist, but we have ticket based spinlock
implementations. It seems a situation ripe for confusion to have two
locking primitives (mutex, spinlock) with similar names (ticket) but
vastly different semantics.

> I'll fix up the ticket_mutex -> reservation_mutex, and mutex_reserve_*
> -> reserve_mutex_*

Do a google for "lock reservation" and observe the results.. its some
scheme where they pre-assign lock ownership to the most likely thread.

> > On Thu, 2013-02-28 at 11:25 +0100, Maarten Lankhorst wrote:
> >> +mutex_reserve_lock_slow and mutex_reserve_lock_intr_slow:
> >> + Similar to mutex_reserve_lock, except it won't backoff with
> >> -EAGAIN.
> >> + This is useful when mutex_reserve_lock failed with -EAGAIN, and you
> >> + unreserved all reservation_locks so no deadlock can occur.
> >> +
> > I don't particularly like these function names, with lock
> > implementations the _slow post-fix is typically used for slow path
> > implementations, not API type interfaces.

> I didn't intend for drivers to use the new calls directly, but rather
> through a wrapper, for example by ttm_eu_reserve_buffers in
> drivers/gpu/drm/ttm/ttm_execbuf_util.c

You're providing a generic interface to the core kernel, other people
will end up using it. Providing a proper API is helpful.

> > Also, is there anything in CS literature that comes close to this? I'd
> > think the DBMS people would have something similar with their
> > transactional systems. What do they call it?

> I didn't study cs, but judging from your phrasing I guess you mean you
> want me to call it transaction_mutexes instead?

Nah, me neither, I just hate reinventing names for something that's
already got a perfectly fine name under which a bunch of people know
it.

See the email from Daniel, apparently its known as wound-wait deadlock
avoidance -- its actually described in the "deadlock" wikipedia
article.

So how about we call the thing something like:

struct ww_mutex; /* wound/wait */

int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */
int mutex_wait_lock(struct ww_mutex *); /* does not fail */

Hmm.. thinking about that,.. you only need that second variant because
you don't have a clear lock to wait for the 'older' process to
complete; but having the unconditional wait makes the entire thing
prone to accidents and deadlocks when the 'user' (read your fellow
programmers) make a mistake.

Ideally we'd only have the one primitive that returns -EDEADLK and use
a 'proper' mutex to wait on or somesuch.. let me ponder this a bit more.

> > Head hurts, needs more time to ponder. It would be good if someone else
> > (this would probably be you maarten) would also consider this explore
> > this 'interesting' problem space :-)

> My head too, evil priority stuff!
>
> Hacky but pragmatical workaround for now: use a real mutex around all
> the reserve_mutex_lock* calls instead of a virtual lock. It can be
> unlocked as soon as all locks have been taken, before any actual work
> is done.
>
> It only slightly kills the point of having a reservation in the first
> place, but at least it won't break completely -rt completely for now.

Yeah, global lock, yay :-(

2013-04-02 17:23:25

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Tue, Apr 2, 2013 at 6:59 PM, Peter Zijlstra <[email protected]> wrote:
>> > Head hurts, needs more time to ponder. It would be good if someone else
>> > (this would probably be you maarten) would also consider this explore
>> > this 'interesting' problem space :-)
>
>> My head too, evil priority stuff!
>>
>> Hacky but pragmatical workaround for now: use a real mutex around all
>> the reserve_mutex_lock* calls instead of a virtual lock. It can be
>> unlocked as soon as all locks have been taken, before any actual work
>> is done.
>>
>> It only slightly kills the point of having a reservation in the first
>> place, but at least it won't break completely -rt completely for now.
>
> Yeah, global lock, yay :-(

We've discussed this quite a bit on irc and came up with a bunch of
other ideas. The global lock is completely transparent to users, since
the lockdep annotations already rely on ticket_init/fini being a
virtual lock. So we can always fall back to that option.

For more fancy approaches we need to consider the aim first - do we
just want to prevent deadlocks through PI or do we aim for bounded
per-reservation_mutex wait block-to-acquire times for the thread with
highest rt-prio.

If it's just the former I think we can get by by piggy-packing on top
of the existing PI mutex code. Only downside is that threads can lock
arbitrary many reservation locks and so we're looking at boosting an
entire tree of processes. Otoh common operations done while holding
such a lock are swapping buffer objects in or waiting for gpu
rendering. And since we can easily queue up a few ms of rendering rt
guarantees are out the window ;-)

If that's not good enough and the global lock not scalable enough we
could try to limit the fan-out by setting a PI-boost flag in the
reservation ticket (in additional to the normal PI boosting for the
reservation mutex). Threads which are boosted in that fashion will get
a -EAGAIN on the next mutex_reserv_lock call, ensuring that the
blocking doesn't spread to further threads. But that requires that we
pass around pointers to tickets instead of values, so lifetime fun
(atm the ticket is on the stack) and probably tons of races in
updating the ticket boost state. I'd like to avoid that until we've
demonstrated a need for it ...

In any way I think that all three approaches should fit into the
proposed interfaces, so we should be able to do something sane here.
But since I have pretty much zero clue about rt I have no idea which
of the first two approaches would be preferable.

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-02 17:31:00

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Tue, Apr 2, 2013 at 6:59 PM, Peter Zijlstra <[email protected]> wrote:
>> > Also, is there anything in CS literature that comes close to this? I'd
>> > think the DBMS people would have something similar with their
>> > transactional systems. What do they call it?
>
>> I didn't study cs, but judging from your phrasing I guess you mean you
>> want me to call it transaction_mutexes instead?
>
> Nah, me neither, I just hate reinventing names for something that's
> already got a perfectly fine name under which a bunch of people know
> it.
>
> See the email from Daniel, apparently its known as wound-wait deadlock
> avoidance -- its actually described in the "deadlock" wikipedia
> article.
>
> So how about we call the thing something like:
>
> struct ww_mutex; /* wound/wait */
>
> int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */
> int mutex_wait_lock(struct ww_mutex *); /* does not fail */

I'm not sold on this prefix, since wound-wait is just the
implementation detail of how it detects/handles deadlocks. For users a
really dumb strategy of just doing a mutex trylock and always
returning -EAGAIN if that fails (together with a msleep(rand) in the
slowpath) would have the same interface. Almost at least, we could
ditch the ticket - but the ticket is used as a virtual lock for the
lockdep annotation, so ditching it would also reduce lockdep
usefulness (due to all those trylocks). So in case we ever switch the
deadlock/backoff algo ww_ would be a bit misleading.

Otoh reservation_ is just what it's used for in graphics-land, so not
that much better. I don't really have a good idea for what this is
besides mutexes_with_magic_deadlock_detection_and_backoff. Which is a
bit long.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-04 12:02:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2


I'm sorry, this email ended up quite a bit longer than I had hoped for;
please bear with me.

On Tue, 2013-04-02 at 18:59 +0200, Peter Zijlstra wrote:
> struct ww_mutex; /* wound/wait */
>
> int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */
> int mutex_wait_lock(struct ww_mutex *); /* does not fail */
>
> Hmm.. thinking about that,.. you only need that second variant because
> you don't have a clear lock to wait for the 'older' process to
> complete; but having the unconditional wait makes the entire thing
> prone to accidents and deadlocks when the 'user' (read your fellow
> programmers) make a mistake.
>
> Ideally we'd only have the one primitive that returns -EDEADLK and use
> a 'proper' mutex to wait on or somesuch.. let me ponder this a bit
> more.

So I had me a little ponder..

Its all really about breaking the symmetry (equivalence of both sides)
of the deadlock. Lets start with the simplest AB-BA deadlock:

task-O task-Y
A
B
A <-- blocks on O
B <-- blocks on Y

In this case the wound-wait algorithm says that the older task (O) has
precedence over the younger task (Y) and we'll 'kill' Y to allow
progress for O.

Now clearly we don't really want to kill Y, instead we'll allow its
lock operation to return -EDEADLK.

This would suggest that our 'new' mutex implementation (ww_mutex
henceforth) has owner tracking -- so O acquiring B can find Y to 'kill'
-- and that we have a new wakeup state TASK_DEADLOCK similar to
TASK_WAKEKILL (can we avoid this by using the extra state required
below? I don't think so since we'd have the risk of waking Y while it
would be blocked on a !ww_mutex)

Now this gets a little more interesting if we change the scenario a
little:

task-O task-Y
A
B
B <-- blocks on Y
* <-- could be A

In this case when O blocks Y isn't actually blocked, so our
TASK_DEADLOCK wakeup doesn't actually achieve anything.

This means we also have to track (task) state so that once Y tries to
acquire A (creating the actual deadlock) we'll not wait so our
TASK_DEADLOCK wakeup doesn't actually achieve anything.

Note that Y doesn't need to acquire A in order to return -EDEADLK, any
acquisition from the same set (see below) would return -EDEADLK even if
there isn't an actual deadlock. This is the cost of heuristic; we could
walk the actual block graph but that would be prohibitively expensive
since we'd have to do this on every acquire.

This raises the point of what would happen if Y were to acquire a
'regular' mutex in between the series of ww_mutexes. In this case we'd
clearly have an actual deadlock scenario. However lockdep would catch
this for we'd clearly invert the lock class order:

ww_mutex -> other -> ww_mutex

For the more complex scenarios there's the case of multiple waiters. In
this case its desirable to order the waiters in age order so that we
don't introduce age inversion -- this minimizes the amount of -EDEADLK
occurrences, but also allows doing away with the unconditional wait.

With the lock queue ordering we can simply immediately re-try the lock
acquisition sequence. Since the older task is already queued it will
obtain the lock, even if we re-queue ourselves before the lock is
assigned.

Now the one thing I've so far 'ignored' is how to assign age. Forward
progress guarantees demand that the age doesn't get reset on retries;
doing so would mean you're always pushing yourself fwd, decreasing your
'priority', never getting to be the most eligible. However it also
means that we should (re)set the time every time we start a 'new'
acquisition sequence. If we'd use a static (per task) age the task with
the lowest age would be able to 'starve' all other.

This means we need means to mark the start of an acquisition sequence;
one that is not included in the restart loop.

Having a start suggests having an end marker too, and indeed we can
find a reason for having one; suppose our second scenario above, where
Y has already acquired all locks it needs to proceed. In this case we
would have marked Y to fail on another acquisition. This doesn't seem
like a problem until you consider the case where we nest different
mm_mutex sets. In this case Y would -EDEADLK on the first of the second
(nested) set, even though re-trying that set is pointless, O doesn't
care.

Furthermore, considering the first scenario, O could block on a lock of
the first set when Y is blocked on a lock of the second (nested) set;
we need means of discerning the different lock sets. This cannot be
done by local means, that is, any context created by the start/end
markers would be local to that task and would not be recognizable by
another as to belong to the same group.

Instead we'd need to create a set-class, much like lockdep has and
somehow ensure all locks in a set are of that class.

One thing I'm not entirely sure of is the strict need for a local
context it could be used to track the set-class, but I'm not entirely
sure we need that to clean up state at the end marker. I haven't been
creative enough to construct a fail case where both O and Y nest and a
pending EDEADLK state could cross the end marker wrongly.

However, having a local context, even if empty, does put a few
constraints on the API usage, which as per below, is a good thing.

To recap, we now have:

struct ww_mutex_key { };
struct ww_mutex;
struct ww_mutex_ctx;

void __ww_mutex_init(struct ww_mutex *, void *);

#define ww_mutex_init(ww_mutex) \
do { \
static struct ww_mutex_key __key; \
__ww_mutex_init(ww_mutex, &__key); \
} while (0)

void ww_mutex_acquire_start(struct ww_mutex_ctx *);
void ww_mutex_acquire_end (struct ww_mutex_ctx *);

int ww_mutex_lock(struct ww_mutex_ctx *, struct ww_mutex *);
void ww_mutex_unlock(struct ww_mutex *);

Which are to be used like:

int bufs_lock(bufs)
{
struct ww_mutex_ctx ww_ctx;

ww_mutex_acquire_start(&ww_ctx);
retry:
for-all-buffers(buf, bufs) {
err = ww_mutex_lock(&ww_ctx, &buf->lock);
if (err)
goto error;
}

ww_mutex_acquire_end(&ww_ctx);

return 0; /* success */

error:
for-all-buffers(buf2, bufs) {
if (buf2 == buf)
break;
ww_mutex_unlock(&buf->lock);
}
if (err == -EDEADLK)
goto again;

return err; /* other error */
}

void bufs_unlock(bufs)
{
for-all-buffers(buf, bufs)
ww_mutex_unlock(&buf->lock);
}

This API has a few problems as per Rusty's API guidelines, it is fairly
trivial to use it wrong; mostly the placement of the start and end
markers are easy to get wrong, but of course one can get all kinds of
badness from doing the retry wrong. At least having the local context
forces one to use the start/end markers.

The only remaining 'problem' left is RT, however the above suggests a
very clean solution; change the lock queueing order to first sort on
priority (be it the SCHED_FIFO/RR static priority or SCHED_DEADLINE
dynamic priority) and secondly on the age.

Note that (re)setting the age at every acquisition set is really only a
means to provide FIFO fairness between tasks, RT tasks don't strictly
require this, they have their own ordering.

With all that this locking scheme should be deterministic; and in cases
where the older task would block (the young task has passed the end
marker) we can apply PI and push Y's priority.


Did I miss something?

Anyway, I haven't actually looked at the details of your implementation
yet, I'll get to that next, but I think something like the above should
be what we want to aim for.

*phew* thanks for reading this far ! :-)

2013-04-04 13:28:32

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, Apr 04, 2013 at 02:01:48PM +0200, Peter Zijlstra wrote:
>
> I'm sorry, this email ended up quite a bit longer than I had hoped for;
> please bear with me.
>
> On Tue, 2013-04-02 at 18:59 +0200, Peter Zijlstra wrote:
> > struct ww_mutex; /* wound/wait */
> >
> > int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */
> > int mutex_wait_lock(struct ww_mutex *); /* does not fail */
> >
> > Hmm.. thinking about that,.. you only need that second variant because
> > you don't have a clear lock to wait for the 'older' process to
> > complete; but having the unconditional wait makes the entire thing
> > prone to accidents and deadlocks when the 'user' (read your fellow
> > programmers) make a mistake.
> >
> > Ideally we'd only have the one primitive that returns -EDEADLK and use
> > a 'proper' mutex to wait on or somesuch.. let me ponder this a bit
> > more.
>
> So I had me a little ponder..

I need to chew some more through this, but figured some early comments to
clarify a few things would be good.

> Its all really about breaking the symmetry (equivalence of both sides)
> of the deadlock. Lets start with the simplest AB-BA deadlock:
>
> task-O task-Y
> A
> B
> A <-- blocks on O
> B <-- blocks on Y
>
> In this case the wound-wait algorithm says that the older task (O) has
> precedence over the younger task (Y) and we'll 'kill' Y to allow
> progress for O.
>
> Now clearly we don't really want to kill Y, instead we'll allow its
> lock operation to return -EDEADLK.
>
> This would suggest that our 'new' mutex implementation (ww_mutex
> henceforth) has owner tracking -- so O acquiring B can find Y to 'kill'
> -- and that we have a new wakeup state TASK_DEADLOCK similar to
> TASK_WAKEKILL (can we avoid this by using the extra state required
> below? I don't think so since we'd have the risk of waking Y while it
> would be blocked on a !ww_mutex)

We do add some form of owner tracking by storing the reservation ticket
of the current holder into every ww_mutex. So when task-Y in your above
example tries to acquire lock A it notices that it's behind in the global
queue and immedieately returns -EAGAIN to indicate the deadlock.

Aside, there's a bit of fun in that ttm uses -EDEADLCK for when you try to
reserve the same buffer twice - it can easily detect that by comparing the
lock owner ticket with the provided one and if it matches bail out. The
special error code is useful since userspace could prepare a gpu
batch command submission which lists a buffer twice, which is a userspace
bug with the current drm/gem apis. But since userspace usually doesn't try
to trick the kernel it's better to detect this lazily, and due to the
retry logic we have all the relevant error bail-out code anyway.

Hence we've kept the special -EDEADLCK semantics even for the ww_mutex
stuff.

> Now this gets a little more interesting if we change the scenario a
> little:
>
> task-O task-Y
> A
> B
> B <-- blocks on Y
> * <-- could be A

The current code at least simple blocks task-O on B until task-Y unlocks
B. Deadlocks cannot happen since if task-Y ever tries to acquire a lock
which is held by an older task (e.g. lock A) it will bail out with
-EAGAIN.

> In this case when O blocks Y isn't actually blocked, so our
> TASK_DEADLOCK wakeup doesn't actually achieve anything.
>
> This means we also have to track (task) state so that once Y tries to
> acquire A (creating the actual deadlock) we'll not wait so our
> TASK_DEADLOCK wakeup doesn't actually achieve anything.
>
> Note that Y doesn't need to acquire A in order to return -EDEADLK, any
> acquisition from the same set (see below) would return -EDEADLK even if
> there isn't an actual deadlock. This is the cost of heuristic; we could
> walk the actual block graph but that would be prohibitively expensive
> since we'd have to do this on every acquire.

Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the wait
times of older task. This could be interesting for RT, but I'm unsure of
the implications. The trick with the current code is that the oldest task
will never see an -EAGAIN ever and hence is guaranteed to make forward
progress. If the task is really unlucky though it might be forced to wait
for a younger task for every ww_mutex it tries to acquire.

The thing is now that you're not expected to hold these locks for a long
time - if you need to synchronously stall while holding a lock performance
will go down the gutters anyway. And since most current gpus/co-processors
still can't really preempt fairness isn't that high a priority, either.
So we didn't think too much about that.

> This raises the point of what would happen if Y were to acquire a
> 'regular' mutex in between the series of ww_mutexes. In this case we'd
> clearly have an actual deadlock scenario. However lockdep would catch
> this for we'd clearly invert the lock class order:
>
> ww_mutex -> other -> ww_mutex
>
> For the more complex scenarios there's the case of multiple waiters. In
> this case its desirable to order the waiters in age order so that we
> don't introduce age inversion -- this minimizes the amount of -EDEADLK
> occurrences, but also allows doing away with the unconditional wait.
>
> With the lock queue ordering we can simply immediately re-try the lock
> acquisition sequence. Since the older task is already queued it will
> obtain the lock, even if we re-queue ourselves before the lock is
> assigned.
>
> Now the one thing I've so far 'ignored' is how to assign age. Forward
> progress guarantees demand that the age doesn't get reset on retries;
> doing so would mean you're always pushing yourself fwd, decreasing your
> 'priority', never getting to be the most eligible. However it also
> means that we should (re)set the time every time we start a 'new'
> acquisition sequence. If we'd use a static (per task) age the task with
> the lowest age would be able to 'starve' all other.
>
> This means we need means to mark the start of an acquisition sequence;
> one that is not included in the restart loop.
>
> Having a start suggests having an end marker too, and indeed we can
> find a reason for having one; suppose our second scenario above, where
> Y has already acquired all locks it needs to proceed. In this case we
> would have marked Y to fail on another acquisition. This doesn't seem
> like a problem until you consider the case where we nest different
> mm_mutex sets. In this case Y would -EDEADLK on the first of the second
> (nested) set, even though re-trying that set is pointless, O doesn't
> care.

Another big reason for having a start/end marker like you've describe is
lockdep support. Lockdep is oblivious to the magic deadlock avoidance, so
when trying to just annotate the ww_mutexes themselves you can only
annotate them as trylocks (which in a way they are). But that completely
breaks lockdep warnings for the above mentioned

> ww_mutex -> other -> ww_mutex

loop. So we don't want that. The trick Maarten's patches employ is to add
a fake/virtual lockdep lock for the reservation ticket/age itself, which
is acquired in start and dropped again in end. All the ww_mutex locking is
then annotated as blocking locks nested within that virtual mutex (this is
already used in mm_take_all_locks). Maarten added a few patches to ensure
that lockdep properly checks this nesting, too:

commit d094595078d00b63839d0c5ccb8b184ef242cb45
Author: Maarten Lankhorst <[email protected]>
Date: Thu Sep 13 11:39:51 2012 +0200

lockdep: Check if nested lock is actually held

> Furthermore, considering the first scenario, O could block on a lock of
> the first set when Y is blocked on a lock of the second (nested) set;
> we need means of discerning the different lock sets. This cannot be
> done by local means, that is, any context created by the start/end
> markers would be local to that task and would not be recognizable by
> another as to belong to the same group.
>
> Instead we'd need to create a set-class, much like lockdep has and
> somehow ensure all locks in a set are of that class.

I'm a bit confused about the different classes you're talking about. Since
the ticket queue is currently a global counter there's only one class of
ww_mutexes. I guess we could change that once a second user shows up, but
currently with dma-bufs we _want_ all reservartion mutexes to be in the
same class. And if you nest reservation mutexe loops (i.e. nest calls to
acquire_start in your example below) we want that to fail. So currently we
only have one lockdep class for the mutexes plus one class for the virtual
lock everything nests in.

> One thing I'm not entirely sure of is the strict need for a local
> context it could be used to track the set-class, but I'm not entirely
> sure we need that to clean up state at the end marker. I haven't been
> creative enough to construct a fail case where both O and Y nest and a
> pending EDEADLK state could cross the end marker wrongly.
>
> However, having a local context, even if empty, does put a few
> constraints on the API usage, which as per below, is a good thing.
>
> To recap, we now have:
>
> struct ww_mutex_key { };
> struct ww_mutex;
> struct ww_mutex_ctx;
>
> void __ww_mutex_init(struct ww_mutex *, void *);
>
> #define ww_mutex_init(ww_mutex) \
> do { \
> static struct ww_mutex_key __key; \
> __ww_mutex_init(ww_mutex, &__key); \
> } while (0)
>
> void ww_mutex_acquire_start(struct ww_mutex_ctx *);
> void ww_mutex_acquire_end (struct ww_mutex_ctx *);
>
> int ww_mutex_lock(struct ww_mutex_ctx *, struct ww_mutex *);
> void ww_mutex_unlock(struct ww_mutex *);
>
> Which are to be used like:
>
> int bufs_lock(bufs)
> {
> struct ww_mutex_ctx ww_ctx;
>
> ww_mutex_acquire_start(&ww_ctx);
> retry:
> for-all-buffers(buf, bufs) {
> err = ww_mutex_lock(&ww_ctx, &buf->lock);
> if (err)
> goto error;
> }
>
> ww_mutex_acquire_end(&ww_ctx);
>
> return 0; /* success */
>
> error:
> for-all-buffers(buf2, bufs) {
> if (buf2 == buf)
> break;
> ww_mutex_unlock(&buf->lock);
> }
> if (err == -EDEADLK)
> goto again;
>
> return err; /* other error */
> }
>
> void bufs_unlock(bufs)
> {
> for-all-buffers(buf, bufs)
> ww_mutex_unlock(&buf->lock);
> }
>
> This API has a few problems as per Rusty's API guidelines, it is fairly
> trivial to use it wrong; mostly the placement of the start and end
> markers are easy to get wrong, but of course one can get all kinds of
> badness from doing the retry wrong. At least having the local context
> forces one to use the start/end markers.

I share your concerns about the api, it's way too easy to get wrong. But
with the lockdep nesting checks we should at least be covered for the
ordering in the fastpath. For the error paths themselves I've just thought
about ranodm -EAGAIN injection as a debug option. With that we should at
least be able to cover these paths sufficiently in regression test suites.

Would be a new patch for Maarten to write though ;-)

> The only remaining 'problem' left is RT, however the above suggests a
> very clean solution; change the lock queueing order to first sort on
> priority (be it the SCHED_FIFO/RR static priority or SCHED_DEADLINE
> dynamic priority) and secondly on the age.
>
> Note that (re)setting the age at every acquisition set is really only a
> means to provide FIFO fairness between tasks, RT tasks don't strictly
> require this, they have their own ordering.
>
> With all that this locking scheme should be deterministic; and in cases
> where the older task would block (the young task has passed the end
> marker) we can apply PI and push Y's priority.

We've discussed this approach of using (rt-prio, age) instead of just age
to determine the the "oldness" of a task for deadlock-breaking with
-EAGAIN. The problem is that through PI boosting or normal rt-prio changes
while tasks are trying to acquire ww_mutexes you can create acyclic loops
in the blocked-on graph of ww_mutexes after the fact and so cause
deadlocks. So we've convinced ourselves that this approche doesn't work.

The only somewhat similar approache we couldn't poke holes into is to set
a PI-boost flag on the reservation ticket (your ww_ctx) of the task a
higher-prio RT thread is blocking on. Any task who has set that flag on
its ww_ctx would then unconditionally fail with -EAGAIN on the next
ww_mutex_lock. See my other mail from yesterday for some of the ideas
we've discussed about RT-compliant ww_mutexes on irc.

> Did I miss something?
>
> Anyway, I haven't actually looked at the details of your implementation
> yet, I'll get to that next, but I think something like the above should
> be what we want to aim for.
>
> *phew* thanks for reading this far ! :-)

Well, it was a good read and I'm rather happy that we agree on the ww_ctx
thing (whatever it's called in the end), even though we have slightly
different reasons for it.

I don't really have a useful idea to make the retry handling for users
more rusty-compliant though, and I'm still unhappy with all current naming
proposals ;-)

Cheers, Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-04 16:33:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> We do add some form of owner tracking by storing the reservation
> ticket
> of the current holder into every ww_mutex. So when task-Y in your
> above
> example tries to acquire lock A it notices that it's behind in the
> global
> queue and immedieately returns -EAGAIN to indicate the deadlock.
>
> Aside, there's a bit of fun in that ttm uses -EDEADLCK for when you
> try to
> reserve the same buffer twice - it can easily detect that by comparing
> the
> lock owner ticket with the provided one and if it matches bail out.

Sure, but this should bear no influence on the design of the mutex
primitive. At most we should ensure this situation is recognizable.

> Hence we've kept the special -EDEADLCK semantics even for the ww_mutex
> stuff.

There's EDEADLK and EDEADLOCK, EDEADLCK does alas not exist :-)

> > Now this gets a little more interesting if we change the scenario a
> > little:
> >
> > task-O task-Y
> > A
> > B
> > B <-- blocks on Y
> > * <-- could be A
>
> The current code at least simple blocks task-O on B until task-Y
> unlocks
> B. Deadlocks cannot happen since if task-Y ever tries to acquire a
> lock
> which is held by an older task (e.g. lock A) it will bail out with
> -EAGAIN.

Agreed, O would have to wait until Y unlocks B. It was a 'detail' I
skimped over for the sake of brevity. I was merely trying to illustrate
the ways in which we must bring Y to return -EDEADLK (or -EAGAIN in
your version).

2013-04-04 16:38:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the
> wait
> times of older task.

No, imagine the following:

struct ww_mutex A, B;
struct mutex C;

task-O task-Y task-X
A
B
C
C
B

At this point O finds that Y owns B and thus we want to make Y 'yield'
B to make allow B progress. Since Y is blocked, we'll send a wakeup.
However Y is blocked on a different locking primitive; one that doesn't
collaborate in the -EDEADLK scheme therefore we don't want the wakeup to
succeed.


2013-04-04 16:39:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> The trick with the current code is that the oldest task
> will never see an -EAGAIN ever and hence is guaranteed to make forward
> progress. If the task is really unlucky though it might be forced to
> wait
> for a younger task for every ww_mutex it tries to acquire.

Agreed on that.. while I didn't state this my proposed thing should
behave the same. It follows from the symmetry breaking in that only
younger tasks can get 'kill'ed.

2013-04-04 16:41:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> The thing is now that you're not expected to hold these locks for a
> long
> time - if you need to synchronously stall while holding a lock
> performance
> will go down the gutters anyway. And since most current
> gpus/co-processors
> still can't really preempt fairness isn't that high a priority,
> either.
> So we didn't think too much about that.

Yeah but you're proposing a new synchronization primitive for the core
kernel.. all such 'fun' details need to be considered, not only those
few that bear on the one usecase.

2013-04-04 16:43:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> Another big reason for having a start/end marker like you've describe
> is
> lockdep support.

Yeah, I saw how you did that.. but there's other ways of making it work
too, you could for instance create a new validation state for this type
of lock.

That said, I didn't consider lockdep too much, I first want the regular
semantics clear.

2013-04-04 16:46:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> I'm a bit confused about the different classes you're talking about.
> Since
> the ticket queue is currently a global counter there's only one class
> of
> ww_mutexes.

Right, so that's not something that's going to fly.. we need to support
multiple users, including nesting of those. Again, you're adding
something to the generic kernel infrastructure, it had better be usable
:-)

> I guess we could change that once a second user shows up

No, we fix that before it all goes in. I would _so_ hate to find out it
cannot be 'fixed' and be stuck with a half-arsed sync primitive in the
core kernel that's only every usable by the one existent user.

So for now, forget all about TTM, DMA-BUF and other such coolness
except to verify that whatever we end up with does indeed work for the
case you need it for ;-)

2013-04-04 16:49:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> We've discussed this approach of using (rt-prio, age) instead of just
> age
> to determine the the "oldness" of a task for deadlock-breaking with
> -EAGAIN. The problem is that through PI boosting or normal rt-prio
> changes
> while tasks are trying to acquire ww_mutexes you can create acyclic
> loops
> in the blocked-on graph of ww_mutexes after the fact and so cause
> deadlocks. So we've convinced ourselves that this approche doesn't
> work.

Could you pretty please draw me a picture, I'm having trouble seeing
what you mean.

AFAICT as long as you boost Y while its the lock owner things should
work out, no?


2013-04-04 16:54:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> Well, it was a good read and I'm rather happy that we agree on the
> ww_ctx
> thing (whatever it's called in the end), even though we have slightly
> different reasons for it.

Yeah, I tried various weirdness to get out from under it, but the whole
progress/fairness thing made it rather hard. Ideally you'd be able to
use some existing scheduler state since its the same goal, but the
whole wakeup-retry muck makes that hard.

> I don't really have a useful idea to make the retry handling for users
> more rusty-compliant though, and I'm still unhappy with all current
> naming
> proposals ;-)

Ah, naming,.. yeah I'm not too terribly attached to most of them. I
just want to avoid something that's reasonably well known to mean
something different.

Furthermore, since we use the wound/wait symmetry breaking it would
make sense for that to appear somewhere in the name.


2013-04-04 16:57:03

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, Apr 4, 2013 at 3:31 PM, Daniel Vetter <[email protected]> wrote:
>> In this case when O blocks Y isn't actually blocked, so our
>> TASK_DEADLOCK wakeup doesn't actually achieve anything.
>>
>> This means we also have to track (task) state so that once Y tries to
>> acquire A (creating the actual deadlock) we'll not wait so our
>> TASK_DEADLOCK wakeup doesn't actually achieve anything.
>>
>> Note that Y doesn't need to acquire A in order to return -EDEADLK, any
>> acquisition from the same set (see below) would return -EDEADLK even if
>> there isn't an actual deadlock. This is the cost of heuristic; we could
>> walk the actual block graph but that would be prohibitively expensive
>> since we'd have to do this on every acquire.
>
> Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the wait
> times of older task. This could be interesting for RT, but I'm unsure of
> the implications. The trick with the current code is that the oldest task
> will never see an -EAGAIN ever and hence is guaranteed to make forward
> progress. If the task is really unlucky though it might be forced to wait
> for a younger task for every ww_mutex it tries to acquire.

[Aside: I'm writing this while your replies trickle in, but I think
it's not yet answered already.]

Ok, I've discussed this a lot with Maarten on irc and I think I see a
bit clearer now what's the aim with the new sleep state. Or at least I
have an illusion about it ;-) So let me try to recap my understanding
to check whether we're talking roughly about the same idea.

I think for starters we need to have a slightly more interesting example:

3 threads O, M, Y: O has the oldest ww_age/ticket, Y the youngest, M
is in between.
2 ww_mutexes: A, B

Y has already acquired ww_mutex A, M has already acquired ww_mutex B.

Now O wants to acquire B and M wants to acquire A (let's ignore
detailed ordering for now), resulting in O blocking on M (M holds B
already, but O is older) and M blocking on Y (same for lock B).

Now first question to check my understanding: Your aim with that
special wakeup is to kick M so that it backs off and drops B? That way
O does not need to wait for Y to complete whatever it's currently
doing, unlock A and then in turn M to complete whatever it's doing so
that it can unlock A&B and finally allows O to grab the lock.

Presuming I'm still following we should be able to fix this with the
new sleep state TASK_DEADLOCK and a flag somewhere in the thread info
(let's call it PF_GTFO for simplicity). Then every time a task does a
blocking wait on a ww_mutex it would set this special sleep state and
also check the PF_GTFO bit. If the later is set, it bails out with
-EAGAIN (so that all locks are dropped).

Now if a task wants to take a lock and notices that it's held by a
younger locker it can set that flag and wake the thread up (need to
think about all the races a bit, but we should be able to make this
work). Then it can do the normal blocking mutex slowpath and wait for
the unlock.

Now if O and M race a bit against each another M should either get
woken (if it's already blocked on Y) and back off, or notice that the
thread flag is set before it even tries to grab another mutex (and so
before the block tree can extend further to Y). And the special sleep
state is to make sure we don't cause any other spurious interrupts.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-04 16:59:14

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, Apr 4, 2013 at 6:38 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
>> Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the
>> wait
>> times of older task.
>
> No, imagine the following:
>
> struct ww_mutex A, B;
> struct mutex C;
>
> task-O task-Y task-X
> A
> B
> C
> C
> B
>
> At this point O finds that Y owns B and thus we want to make Y 'yield'
> B to make allow B progress. Since Y is blocked, we'll send a wakeup.
> However Y is blocked on a different locking primitive; one that doesn't
> collaborate in the -EDEADLK scheme therefore we don't want the wakeup to
> succeed.

Yeah, I've thought about this some more and the special sleep state
seems to be only required to ensure we don't cause spurious wakeups
for other any other reasons a task blocks. But I think we can use that
kick-current-holder approach to ensure that older tasks get the lock
in a more timely fashion than the current code. I've tried to detail
it a bit with another 3 task example - that seems to be the point
where the fun starts ;-)
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-04 20:44:34

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, Apr 4, 2013 at 6:49 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
>> We've discussed this approach of using (rt-prio, age) instead of just
>> age
>> to determine the the "oldness" of a task for deadlock-breaking with
>> -EAGAIN. The problem is that through PI boosting or normal rt-prio
>> changes
>> while tasks are trying to acquire ww_mutexes you can create acyclic
>> loops
>> in the blocked-on graph of ww_mutexes after the fact and so cause
>> deadlocks. So we've convinced ourselves that this approche doesn't
>> work.
>
> Could you pretty please draw me a picture, I'm having trouble seeing
> what you mean.
>
> AFAICT as long as you boost Y while its the lock owner things should
> work out, no?

So this one here took a bit of pondering. But I think you'll like the
conclusion.

Now the deadlock detection works because it impose a dag onto all
lockers which clearly denotes who'll wait and who will get wounded.
The problem with using (rt_prio, ww_age/ticket) instead of just the
ticket si that it's ambigous in the face of PI boosting. E.g.
- rt_prio of A > rt_prio of B, but
- task A is younger than task B

Before boosting task A wins, but after boosting (when only the age
matters) task B wins, since it's now older. So now when B comes around
and tries to grab a lock A currently holds, it'll also block.

Now the cool thing with TASK_KILLABLE (hey, I start to appreciate it,
bear with me!) is that this is no problem - it limits the length of
any chain of blocked tasks to just one link of blocked tasks:
- If the length is currently one it cannot be extended - the blocked
task will set the PF_GTFO flag on the current holder, and since that
would be checked before blocking on another ww_mutex the chain can't
grow past one blocked task.
- If the current holder itself is blocked on another ww_mutex it'll be
in TASK_DEADLOCK state and get woken up.

In both case the current holder of the lock will either continue with
what it's been doing (PI-boosted ofc) until it unlocks everything or
hits the PF_GTFO check and backs off from all currently held locks. RT
mission accomplished I think since the wait time for the highest-prio
task for grabbing a lock is bounded by the lock hold time for the
completely uncontended case. And within a given rt-prio the normal
wait/wound algorithm will resolve conflicts (and ensure forward
progress).

So I'm now rather convinced that with the TASK_DEADLOCK implementation
we can make (rt_prio, age) lock ordering work. But it's not an
artifact of consistent PI boosting (I've raced down that alley first
trying to prove it to no avail), but the fact that lock holder kicking
through PF_GTFO naturally limits blocked task chains on ww_mutexes to
just one link. That in turn makes any post-PI-boosted considerations
moot and so can't result in havoc due to a PI-boost having flipped an
edge in the lock_er_ ordering DAG (we sort tasks with wait/wound, not
locks, hence it's not a lock ordering DAG).

Please poke holes into this argument, I've probably missed a sublety
somewhere ...
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-08 10:39:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, 2013-04-04 at 18:56 +0200, Daniel Vetter wrote:
> On Thu, Apr 4, 2013 at 3:31 PM, Daniel Vetter <[email protected]> wrote:
> >> In this case when O blocks Y isn't actually blocked, so our
> >> TASK_DEADLOCK wakeup doesn't actually achieve anything.
> >>
> >> This means we also have to track (task) state so that once Y tries to
> >> acquire A (creating the actual deadlock) we'll not wait so our
> >> TASK_DEADLOCK wakeup doesn't actually achieve anything.
> >>
> >> Note that Y doesn't need to acquire A in order to return -EDEADLK, any
> >> acquisition from the same set (see below) would return -EDEADLK even if
> >> there isn't an actual deadlock. This is the cost of heuristic; we could
> >> walk the actual block graph but that would be prohibitively expensive
> >> since we'd have to do this on every acquire.
> >
> > Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the wait
> > times of older task. This could be interesting for RT, but I'm unsure of
> > the implications. The trick with the current code is that the oldest task
> > will never see an -EAGAIN ever and hence is guaranteed to make forward
> > progress. If the task is really unlucky though it might be forced to wait
> > for a younger task for every ww_mutex it tries to acquire.
>
> [Aside: I'm writing this while your replies trickle in, but I think
> it's not yet answered already.]
>
> Ok, I've discussed this a lot with Maarten on irc and I think I see a
> bit clearer now what's the aim with the new sleep state. Or at least I
> have an illusion about it ;-) So let me try to recap my understanding
> to check whether we're talking roughly about the same idea.
>
> I think for starters we need to have a slightly more interesting example:
>
> 3 threads O, M, Y: O has the oldest ww_age/ticket, Y the youngest, M
> is in between.
> 2 ww_mutexes: A, B
>
> Y has already acquired ww_mutex A, M has already acquired ww_mutex B.
>
> Now O wants to acquire B and M wants to acquire A (let's ignore
> detailed ordering for now), resulting in O blocking on M (M holds B
> already, but O is older) and M blocking on Y (same for lock B).

drawing the picture for myself:

task-O task-M task-Y
A
B
B
A

> Now first question to check my understanding: Your aim with that
> special wakeup is to kick M so that it backs off and drops B? That way
> O does not need to wait for Y to complete whatever it's currently
> doing, unlock A and then in turn M to complete whatever it's doing so
> that it can unlock A&B and finally allows O to grab the lock.

No, we always need to wait for locks to be unlocked. The sole purpose
of the special wakeups state is to not wake other (!ww_mutex) locks
that might be held by the task holding the contended ww_mutex. While
all schedule() sites should deal with spurious wakeups its a sad fact
of life that they do not :/

> Presuming I'm still following we should be able to fix this with the
> new sleep state TASK_DEADLOCK and a flag somewhere in the thread info
> (let's call it PF_GTFO for simplicity).

I'm reading "Get The F*ck Out" ? I like the name, except PF_flags are
unsuitable since they are not atomic and we'd need to set it from
another thread.

> Then every time a task does a
> blocking wait on a ww_mutex it would set this special sleep state and
> also check the PF_GTFO bit.

So its the contending task (O for B) setting PF_GTFO on the owning task
(M for B), right?

But yeah, all ww_mutex sleep states should have the new TASK_DEADLOCK
sleep state added.

> If the later is set, it bails out with
> -EAGAIN (so that all locks are dropped).

I would really rather see -EDEADLK for that..

> Now if a task wants to take a lock and notices that it's held by a
> younger locker it can set that flag and wake the thread up (need to
> think about all the races a bit, but we should be able to make this
> work). Then it can do the normal blocking mutex slowpath and wait for
> the unlock.

Right.

> Now if O and M race a bit against each another M should either get
> woken (if it's already blocked on Y) and back off, or notice that the
> thread flag is set before it even tries to grab another mutex

ww_mutex, it should block just fine on regular mutexes and other
primitives.

> (and so
> before the block tree can extend further to Y). And the special sleep
> state is to make sure we don't cause any other spurious interrupts.

Right, I think we're understanding one another here.

2013-04-08 11:47:31

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Mon, Apr 08, 2013 at 12:39:24PM +0200, Peter Zijlstra wrote:
> On Thu, 2013-04-04 at 18:56 +0200, Daniel Vetter wrote:
> > Presuming I'm still following we should be able to fix this with the
> > new sleep state TASK_DEADLOCK and a flag somewhere in the thread info
> > (let's call it PF_GTFO for simplicity).
>
> I'm reading "Get The F*ck Out" ? I like the name, except PF_flags are
> unsuitable since they are not atomic and we'd need to set it from
> another thread.

I think the PF_ flag is a non-starter for a different reason, too. To make
different clases of ww_mutexes nest properly, we need to make sure that we
don't dead/live-lock trying to wake up a task holdinga ww_mutex of class
A, while that task is trying to acquire ww_mutexes of class B. Picture:

task 1 task 2 task 3
holds lock B
ticket_A = acquire_start(class A)
ww_mutex_lock(lock A, ticket_A) busy doing something

ticket_B = acquire_start(class B)
ww_mutex_lock(lock B, ticket_B)
-> contention with task 3

ticket_task2 = acquire_start(class A)
ww_mutex_lock(lock A, ticket_task2)
-> contention with task 1

If we go with the PF_GTFO task flag, task 2 will set it on task 1
(handwave locking/atomicity again here ;-). But taks 1 will only back off
from any locks in class B. Or at least I think we should impose that
restriction, since otherwise a ww_mutex acquire loop will leak out badly
abstraction-wise and making nesting pretty much impossible in practice.

But if we really want nesting to work transparently, then task 1 should be
allowed to continue taking locks from class A (after an acquire_end(class
B) call ofc). But then it will have missed the wakeup from task 2, so task
2 blocks for too long and task 1 doesn't back off from further acquiring
locks in class A.

Presuming my reasoning for the rt case isn't broken, this break deadlock
avoidance if we sort by (rt_prio, ticket).

So I think we need to move the PF_GTFO flag to the ticket to make sure
that task1 will notice that there's contention with one of the locks it
holds from class A and that it better gtfo asap (i.e. back off, drop all
currently held locks in class A and go into the slowpath).

But I still need to think the nesting case through a bit more (and draw
some diagrams), so not sure yet. One thing I still need to ponder is how
to best stack tickets (for tasks doing nested ww_mutex locking) and where
all we need a pointer to relevant tickets and what kind of fun this
entails ... I need to ponder this some more.

> > Then every time a task does a
> > blocking wait on a ww_mutex it would set this special sleep state and
> > also check the PF_GTFO bit.
>
> So its the contending task (O for B) setting PF_GTFO on the owning task
> (M for B), right?

Yeah, the idea is that the contending task sets this bit on the current
holder (whether we put that bit into the task or the ticket), so that the
current owner can back off and drop all currently held locks at the next
opportunity (either due to being woken up in TASK_DEADLCOK state or in the
next ww_mutex_lock call).

> But yeah, all ww_mutex sleep states should have the new TASK_DEADLOCK
> sleep state added.
>
> > If the later is set, it bails out with
> > -EAGAIN (so that all locks are dropped).
>
> I would really rather see -EDEADLK for that..

I agree it makes more sense for a general api (outside of ttm), but I
struggle a bit to come up with a good errno for the "you hold this lock
already" case. Best I could do is -EBUSY ...

Hm, looking through errno.h I've just spotted -EALREADY. Seems to be used
all over (not just networking), so I guess we could extend it's meaning a
bit?

> > Now if a task wants to take a lock and notices that it's held by a
> > younger locker it can set that flag and wake the thread up (need to
> > think about all the races a bit, but we should be able to make this
> > work). Then it can do the normal blocking mutex slowpath and wait for
> > the unlock.
>
> Right.
>
> > Now if O and M race a bit against each another M should either get
> > woken (if it's already blocked on Y) and back off, or notice that the
> > thread flag is set before it even tries to grab another mutex
>
> ww_mutex, it should block just fine on regular mutexes and other
> primitives.

Yes, that special behaviour should only apply to ww_mutexes not to any
other locking. I've not been terribly careful here ...
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-09 22:18:36

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Tue, Apr 02, 2013 at 06:59:14PM +0200, Peter Zijlstra wrote:
>
> So how about we call the thing something like:
>
> struct ww_mutex; /* wound/wait */

Reading this I can't help but think of Elmer Fudd saying "Round Robin"
as "Wound Wobin"

-- Steve

>
> int mutex_wound_lock(struct ww_mutex *); /* returns -EDEADLK */
> int mutex_wait_lock(struct ww_mutex *); /* does not fail */
>

2013-04-09 22:27:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, Apr 04, 2013 at 06:38:36PM +0200, Peter Zijlstra wrote:
> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> > Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the
> > wait
> > times of older task.
>
> No, imagine the following:
>
> struct ww_mutex A, B;
> struct mutex C;
>
> task-O task-Y task-X
> A
> B
> C
> C
> B
>
> At this point O finds that Y owns B and thus we want to make Y 'yield'
> B to make allow B progress. Since Y is blocked, we'll send a wakeup.
> However Y is blocked on a different locking primitive; one that doesn't
> collaborate in the -EDEADLK scheme therefore we don't want the wakeup to
> succeed.

I'm confused to why the above is a problem. Task-X will eventually
release C, and then Y will release B and O will get to continue. Do we
have to drop them once the owner is blocked? Can't we follow the chain
like the PI code does?

-- Steve

2013-04-09 22:28:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote:
> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> > The thing is now that you're not expected to hold these locks for a
> > long
> > time - if you need to synchronously stall while holding a lock
> > performance
> > will go down the gutters anyway. And since most current
> > gpus/co-processors
> > still can't really preempt fairness isn't that high a priority,
> > either.
> > So we didn't think too much about that.
>
> Yeah but you're proposing a new synchronization primitive for the core
> kernel.. all such 'fun' details need to be considered, not only those
> few that bear on the one usecase.

Which bares the question, what other use cases are there?

-- Steve

2013-04-09 22:42:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Thu, Apr 04, 2013 at 06:56:58PM +0200, Daniel Vetter wrote:
>
> I think for starters we need to have a slightly more interesting example:
>
> 3 threads O, M, Y: O has the oldest ww_age/ticket, Y the youngest, M
> is in between.
> 2 ww_mutexes: A, B
>
> Y has already acquired ww_mutex A, M has already acquired ww_mutex B.
>

Now I probably missed it in the thread somewhere, but what makes task O
the oldest and Y the youngest? Is it the actual time from when the task
was created? What about setting an age as soon as it starts the process
of grabbing one of these locks? And it keeps the age until it
successfully grabs and releases all the locks again. It wont reset if it
had to drop the locks and start over.

Or am I totally not understanding what's going on here? Which could also
be the case ;-)

-- Steve

2013-04-10 07:34:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Tue, 2013-04-09 at 18:42 -0400, Steven Rostedt wrote:
> What about setting an age as soon as it starts the process
> of grabbing one of these locks? And it keeps the age until it
> successfully grabs and releases all the locks again. It wont reset if
> it
> had to drop the locks and start over.


That is indeed the proposed mechanism. It ensures FIFO fairness between
the various threads that try to acquire a set.

2013-04-10 08:27:49

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Wed, Apr 10, 2013 at 12:27 AM, Steven Rostedt <[email protected]> wrote:
> On Thu, Apr 04, 2013 at 06:38:36PM +0200, Peter Zijlstra wrote:
>> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
>> > Hm, I guess your aim with the TASK_DEADLOCK wakeup is to bound the
>> > wait
>> > times of older task.
>>
>> No, imagine the following:
>>
>> struct ww_mutex A, B;
>> struct mutex C;
>>
>> task-O task-Y task-X
>> A
>> B
>> C
>> C
>> B
>>
>> At this point O finds that Y owns B and thus we want to make Y 'yield'
>> B to make allow B progress. Since Y is blocked, we'll send a wakeup.
>> However Y is blocked on a different locking primitive; one that doesn't
>> collaborate in the -EDEADLK scheme therefore we don't want the wakeup to
>> succeed.
>
> I'm confused to why the above is a problem. Task-X will eventually
> release C, and then Y will release B and O will get to continue. Do we
> have to drop them once the owner is blocked? Can't we follow the chain
> like the PI code does?

Just waiting until every task already holding a lock completes and
unlucks it is indeed a viable solution - it's the currently
implemented algorithm in ttm and Maarten's current patches.

The nice thing with Peter's wakeup idea on top is:
- It bounds blocked times.
- And (at least I think so) it's the key thing making PI boosting
possible without any ugly PI inversion deadlocks happening. See

Message-ID: <CAKMK7uEUdtiDDCRPwpiumkrST6suFY7YuQcPAXR_nJ0XHKzsAw@mail.gmail.com>

for my current reasoning about this (I have not yet managed to poke a
hole into it).
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-10 09:30:27

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Tue, Apr 09, 2013 at 06:28:08PM -0400, Steven Rostedt wrote:
> On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote:
> > On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> > > The thing is now that you're not expected to hold these locks for a
> > > long
> > > time - if you need to synchronously stall while holding a lock
> > > performance
> > > will go down the gutters anyway. And since most current
> > > gpus/co-processors
> > > still can't really preempt fairness isn't that high a priority,
> > > either.
> > > So we didn't think too much about that.
> >
> > Yeah but you're proposing a new synchronization primitive for the core
> > kernel.. all such 'fun' details need to be considered, not only those
> > few that bear on the one usecase.
>
> Which bares the question, what other use cases are there?

Tbh I don't see any other either - but I agree with Peter and thinking
things through and making the api a bit more generic seems to help in
clarifying the semantics. Reminds me that I still need to draw a few
diagrams ;-)
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-10 10:31:14

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Mon, Apr 08, 2013 at 01:50:26PM +0200, Daniel Vetter wrote:
> On Mon, Apr 08, 2013 at 12:39:24PM +0200, Peter Zijlstra wrote:
> > On Thu, 2013-04-04 at 18:56 +0200, Daniel Vetter wrote:
> > > Presuming I'm still following we should be able to fix this with the
> > > new sleep state TASK_DEADLOCK and a flag somewhere in the thread info
> > > (let's call it PF_GTFO for simplicity).
> >
> > I'm reading "Get The F*ck Out" ? I like the name, except PF_flags are
> > unsuitable since they are not atomic and we'd need to set it from
> > another thread.
>
> I think the PF_ flag is a non-starter for a different reason, too. To make
> different clases of ww_mutexes nest properly, we need to make sure that we
> don't dead/live-lock trying to wake up a task holdinga ww_mutex of class
> A, while that task is trying to acquire ww_mutexes of class B. Picture:
>
> task 1 task 2 task 3
> holds lock B
> ticket_A = acquire_start(class A)
> ww_mutex_lock(lock A, ticket_A) busy doing something
>
> ticket_B = acquire_start(class B)
> ww_mutex_lock(lock B, ticket_B)
> -> contention with task 3
>
> ticket_task2 = acquire_start(class A)
> ww_mutex_lock(lock A, ticket_task2)
> -> contention with task 1
>
> If we go with the PF_GTFO task flag, task 2 will set it on task 1
> (handwave locking/atomicity again here ;-). But taks 1 will only back off
> from any locks in class B. Or at least I think we should impose that
> restriction, since otherwise a ww_mutex acquire loop will leak out badly
> abstraction-wise and making nesting pretty much impossible in practice.
>
> But if we really want nesting to work transparently, then task 1 should be
> allowed to continue taking locks from class A (after an acquire_end(class
> B) call ofc). But then it will have missed the wakeup from task 2, so task
> 2 blocks for too long and task 1 doesn't back off from further acquiring
> locks in class A.
>
> Presuming my reasoning for the rt case isn't broken, this break deadlock
> avoidance if we sort by (rt_prio, ticket).
>
> So I think we need to move the PF_GTFO flag to the ticket to make sure
> that task1 will notice that there's contention with one of the locks it
> holds from class A and that it better gtfo asap (i.e. back off, drop all
> currently held locks in class A and go into the slowpath).
>
> But I still need to think the nesting case through a bit more (and draw
> some diagrams), so not sure yet. One thing I still need to ponder is how
> to best stack tickets (for tasks doing nested ww_mutex locking) and where
> all we need a pointer to relevant tickets and what kind of fun this
> entails ... I need to ponder this some more.

Ok, so I think the nesting case should all work if we have a
per-task/ticket flag to signal contention. The point I've mused over a bit
is how to get at both the flag and the corresponding task to do the
wakeup. I think for non-rt the simplest way is to store a pointer to the
ticket in the ww_mutex, and the ticket the holds both the contention flag
and a pointer to the task. Lifetime of that stuff would be protected by
the wait_lock from disappearing untimely, which should allow the
lock/unlock fastpaths to set/clear it non-atomically - only careful places
is in the contented unlock slowpath. So rough api sketch for nesting
ww_mutexes:

struct ww_class {
atomic_t next_ticket;
/* virtual lock to annotate the acquire_start/end sections. */
struct lock_class_key acquire_key;
/* lockdep class of all ww_mutexes in this class */
struct lock_class_key mutex_key;
/* for debug/tracepts */
const char *name
};

DEFINE_WW_CLASS(name) /* ... and all the other usual magic init funcitons */

struct ww_acquire_ctx {
struct task_struct *task; /* invariant */
usinged ticket; /* invariant */
/* atomic is probably overkill, but need to review the resulting
* code with all the barriers first. */
atomic_t contention;
}

void ww_acquire_start(struct ww_acuire_ctx *ctx, struct ww_class *class);
void ww_acquire_end(struct ww_acuire_ctx *ctx);

Originally I've thought we could store the pointer to the current acquire
ctx in task_struct (since we need that handy for PI boosting anyway), but
that makes things a bit ugly with nesting. Having a (somewhat) redundant
pointer to the acquiring taks in the ctx seemed less evil.

struct ww_mutex {
struct mutex base;
struct ww_acquire_ctx *holding_ctx;
}

I've considered whether we should try to pull clever tricks like with
lockdep keys and make the ww_class more implicit (by auto-instantiating it
somewhere and adding a pointer to it in each ww_mutex). But I think we
should not hide this complexity, but instead make it explicit.

All the ww_mutex_(un)lock* variants would then also need to take the
context instead of the ticket. Besides the usual return codes matching the
normal mutex functions we'd have:

-EDEADLK: Deadlock detected (or got kicked by a higher-prio task in RT),
drop all currently held locks (of the same class) and block on this lock
(since it's the contention point) with ww_mutex_lock_slow.

-EALREADY: Lock already held by the same acquire context. We need this to
be able to efficiently yell at evil userspace.

If ww_mutex_lock goes into the slowpath and it's not one of the above
cases (i.e. wait, not wound) then it sets the contention flag
(unconditionally) and wakes up the current holder iff the current holder
is in the TASK_DEADLOCK state.

If ww_mutex_lock gets woken up it checks the contextion flag on the
acquire context. If it is set it backs off and returns -EDEADLK. Obviously
it also needs to check that before blocking, and we might even want to
check unconditionally before attempting any lock operation (to ensure we
get out of the way of higher-prio tasks quicker).

One thing we probably need is to make TASK_DEADLOCK a flag, since we still
want the different interruptible types of blocking. At least i915 goes to
great lengths to make sure that we take every lock which could be held
while waiting for the gpu with interruptible sleeps (or at least
killable).

Did I miss anything?
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-17 19:08:21

by Daniel Vetter

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Wed, Apr 10, 2013 at 12:28 AM, Steven Rostedt <[email protected]> wrote:
> On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote:
>> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
>> > The thing is now that you're not expected to hold these locks for a
>> > long
>> > time - if you need to synchronously stall while holding a lock
>> > performance
>> > will go down the gutters anyway. And since most current
>> > gpus/co-processors
>> > still can't really preempt fairness isn't that high a priority,
>> > either.
>> > So we didn't think too much about that.
>>
>> Yeah but you're proposing a new synchronization primitive for the core
>> kernel.. all such 'fun' details need to be considered, not only those
>> few that bear on the one usecase.
>
> Which bares the question, what other use cases are there?

Just stumbled over one I think: If we have a big graph of connected
things (happens really often for video pipelines). And we want
multiple users to use them in parallel. But sometimes a configuration
change could take way too long and so would unduly stall a 2nd thread
with just a global mutex, then per-object ww_mutexes would be a fit:
You'd start with grabbing all the locks for the objects you want to
change anything with, then grab anything in the graph that you also
need to check. Thanks to loop detection and self-recursion this would
all nicely work out, even for cyclic graphs of objects.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

2013-04-18 17:37:11

by Ville Syrjälä

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mutex: add support for reservation style locks, v2

On Wed, Apr 17, 2013 at 09:08:17PM +0200, Daniel Vetter wrote:
> On Wed, Apr 10, 2013 at 12:28 AM, Steven Rostedt <[email protected]> wrote:
> > On Thu, Apr 04, 2013 at 06:41:02PM +0200, Peter Zijlstra wrote:
> >> On Thu, 2013-04-04 at 15:31 +0200, Daniel Vetter wrote:
> >> > The thing is now that you're not expected to hold these locks for a
> >> > long
> >> > time - if you need to synchronously stall while holding a lock
> >> > performance
> >> > will go down the gutters anyway. And since most current
> >> > gpus/co-processors
> >> > still can't really preempt fairness isn't that high a priority,
> >> > either.
> >> > So we didn't think too much about that.
> >>
> >> Yeah but you're proposing a new synchronization primitive for the core
> >> kernel.. all such 'fun' details need to be considered, not only those
> >> few that bear on the one usecase.
> >
> > Which bares the question, what other use cases are there?
>
> Just stumbled over one I think: If we have a big graph of connected
> things (happens really often for video pipelines). And we want
> multiple users to use them in parallel. But sometimes a configuration
> change could take way too long and so would unduly stall a 2nd thread
> with just a global mutex, then per-object ww_mutexes would be a fit:
> You'd start with grabbing all the locks for the objects you want to
> change anything with, then grab anything in the graph that you also
> need to check. Thanks to loop detection and self-recursion this would
> all nicely work out, even for cyclic graphs of objects.

Indeed, that would make the locking for atomic modeset/page flip very
easy to handle, while still allowing the use of suitable fine grained
locks. I like the idea.

--
Ville Syrj?l?
Intel OTC