Hi all,
I was fuzzing with trinity on a -next kernel with the KASan patchset, and
got what initially appeared to be a rather odd trace:
[ 856.817966] BUG: AddressSanitizer: out of bounds on stack in do_raw_spin_unlock+0x417/0x4f0 at addr ffff8803875c7c42
[ 856.817966] Read of size 2 by task migration/15/140
[ 856.821565] page:ffffea000e1d71c0 count:0 mapcount:0 mapping: (null) index:0x0
[ 856.821565] flags: 0x2efffff80000000()
[ 856.821565] page dumped because: kasan: bad access detected
[ 856.821565] CPU: 15 PID: 140 Comm: migration/15 Not tainted 3.19.0-rc5-next-20150123-sasha-00061-g527ff0d-dirty #1814
[ 856.841712] 0000000000000000 0000000000000000 ffff8804d0447aa0 ffff8804d04479e8
[ 856.843478] ffffffff92eb033b 1ffffd4001c3ae3f ffffea000e1d71c0 ffff8804d0447a88
[ 856.843478] ffffffff81b4c292 ffff8804d0447a58 ffffffff815ef1e1 0000000000000000
[ 856.843478] Call Trace:
[ 856.843478] dump_stack (lib/dump_stack.c:52)
[ 856.843478] kasan_report_error (mm/kasan/report.c:136 mm/kasan/report.c:194)
[ 856.843478] ? __lock_is_held (kernel/locking/lockdep.c:3518)
[ 856.843478] ? sched_ttwu_pending (kernel/sched/core.c:4921)
[ 856.843478] __asan_report_load2_noabort (mm/kasan/report.c:234)
[ 856.843478] ? do_raw_spin_unlock (./arch/x86/include/asm/spinlock.h:157 kernel/locking/spinlock_debug.c:159)
[ 856.843478] do_raw_spin_unlock (./arch/x86/include/asm/spinlock.h:157 kernel/locking/spinlock_debug.c:159)
[ 856.843478] ? do_raw_spin_trylock (kernel/locking/spinlock_debug.c:157)
[ 856.843478] _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:162 kernel/locking/spinlock.c:191)
[ 856.843478] complete (kernel/sched/completion.c:37)
[ 856.843478] cpu_stop_signal_done (kernel/stop_machine.c:69)
[ 856.843478] cpu_stopper_thread (include/linux/spinlock.h:342 kernel/stop_machine.c:456)
[ 856.843478] ? irq_cpu_stop_queue_work (kernel/stop_machine.c:449)
[ 856.843478] ? do_raw_spin_trylock (kernel/locking/spinlock_debug.c:157)
[ 856.843478] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/paravirt.h:809 include/linux/spinlock_api_smp.h:162 kernel/locking/spinlock.c:191)
[ 856.843478] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2554 kernel/locking/lockdep.c:2601)
[ 856.843478] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:163 kernel/locking/spinlock.c:191)
[ 856.843478] smpboot_thread_fn (kernel/smpboot.c:161)
[ 856.843478] ? smpboot_unpark_thread (kernel/smpboot.c:105)
[ 856.843478] ? schedule (kernel/sched/core.c:2853)
[ 856.843478] ? __kthread_parkme (./arch/x86/include/asm/current.h:14 kernel/kthread.c:164)
[ 856.843478] ? smpboot_unpark_thread (kernel/smpboot.c:105)
[ 856.843478] kthread (kernel/kthread.c:207)
[ 856.843478] ? flush_kthread_work (kernel/kthread.c:176)
[ 856.843478] ? schedule_tail (./arch/x86/include/asm/preempt.h:95 kernel/sched/core.c:2318)
[ 856.843478] ? flush_kthread_work (kernel/kthread.c:176)
[ 856.843478] ret_from_fork (arch/x86/kernel/entry_64.S:349)
[ 856.843478] ? flush_kthread_work (kernel/kthread.c:176)
[ 856.843478] Memory state around the buggy address:
[ 856.843478] ffff8803875c7b00: f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 856.843478] ffff8803875c7b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 856.843478] >ffff8803875c7c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1
[ 856.843478] ^
[ 856.843478] ffff8803875c7c80: f1 f1 f1 00 f4 f4 f4 f2 f2 f2 f2 00 00 f4 f4 00
[ 856.843478] ffff8803875c7d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
We see this trace only now because Andrey has recently made KASan able to detect issues
with memory on stack, rather than just kmalloc()ed memory, so the underlying bug has
probably been there for a while.
Initially we couldn't explain it. It was reproducing often, and always on the
spin_unlock_irqrestore() call at the end of the complete() code.
I now have a theory for why it happens:
Thread A Thread B
----------------------------------------------------------
[Enter function]
DECLARE_COMPLETION_ONSTACK(x)
wait_for_completion(x)
complete(x)
[In complete(x):]
spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
__wake_up_locked(&x->wait, TASK_NORMAL, 1);
[Done waiting, wakes up]
[Exit function]
spin_unlock_irqrestore(&x->wait.lock, flags);
So the spin_unlock_irqrestore() at the end of complete() would proceed to corruption
the stack of thread A.
Thanks,
Sasha
On Wed, 04 Feb 2015 18:24:06 -0500 Sasha Levin <[email protected]> wrote:
> Hi all,
>
> I was fuzzing with trinity on a -next kernel with the KASan patchset, and
> got what initially appeared to be a rather odd trace:
>
> ...
>
>
> I now have a theory for why it happens:
>
> Thread A Thread B
> ----------------------------------------------------------
>
> [Enter function]
> DECLARE_COMPLETION_ONSTACK(x)
> wait_for_completion(x)
> complete(x)
> [In complete(x):]
> spin_lock_irqsave(&x->wait.lock, flags);
> x->done++;
> __wake_up_locked(&x->wait, TASK_NORMAL, 1);
> [Done waiting, wakes up]
> [Exit function]
> spin_unlock_irqrestore(&x->wait.lock, flags);
>
>
>
> So the spin_unlock_irqrestore() at the end of complete() would proceed to corruption
> the stack of thread A.
But wait_for_completion() takes ->wait.lock as well, which should
provide the needed synchronization (__wait_for_common,
do_wait_for_common). I'm not seeing a hole in the logic, but it looks
like there might be one.
On Wed, Feb 4, 2015 at 3:24 PM, Sasha Levin <[email protected]> wrote:
>
> I now have a theory for why it happens:
>
> Thread A Thread B
> ----------------------------------------------------------
>
> [Enter function]
> DECLARE_COMPLETION_ONSTACK(x)
> wait_for_completion(x)
> complete(x)
> [In complete(x):]
> spin_lock_irqsave(&x->wait.lock, flags);
> x->done++;
> __wake_up_locked(&x->wait, TASK_NORMAL, 1);
> [Done waiting, wakes up]
> [Exit function]
> spin_unlock_irqrestore(&x->wait.lock, flags);
>
> So the spin_unlock_irqrestore() at the end of complete() would proceed to corruption
> the stack of thread A.
We have indeed had bugs like that, but it *shouldn't be the case any
more. The hard rule for spinlocks is:
- the last thing *unlock* does is to release the lock with a single
store. After that has happened, it will not touch it, and before that
has happened, nobody else can get the spinlock
which means that you cannot actually get the case you are talking
about (because the "done waiting, wakes up" in thread A happens with
the spinlock held).
That said, as mentioned, we have had bugs in spinlock debugging code
in particular, where the last unlock does other things after it
releases the low-level lock. And you are clearly not using normal
spinlocks, since your error trace has this:
do_raw_spin_unlock+0x417/0x4f0
which means that "do_raw_spin_unlock() is 1250+ bytes in size. A
"normal" do_raw_spin_unlock()" is inlined because it's a single store
operation.
Looks like the debug version of do_raw_spin_unlock() is slightly
larger than that ;) Anyway, apparently that version of
do_raw_spin_unlock isn't just huge, it's also buggy.. Although I
thought we fixed that bug some time ago, and looking at the code it
does
void do_raw_spin_unlock(raw_spinlock_t *lock)
{
debug_spin_unlock(lock);
arch_spin_unlock(&lock->raw_lock);
}
so it *looks* ok in the generic code.
So off to look at the arch code...
And looking at the arch version, I think the paravirtualized code is crap.
It does:
prev = *lock;
add_smp(&lock->tickets.head, TICKET_LOCK_INC);
/* add_smp() is a full mb() */
if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);
which is *exactly* the kind of things you cannot do with spinlocks,
because after you've done the "add_smp()" and released the spinlock
for the fast-path, you can't access the spinlock any more. Exactly
because a fast-path lock migth come in, and release the whole data
structure.
As usual, the paravirt code is a horribly buggy heap of crud. Film at 11.
Why did I think we had this bug but already fixed it ? Maybe it's one
of those things that Waiman fixed in his long delayed qspinlock
series? Waiman? Or maybe I just remember the fixes where we changed
from a mutex to a spinlock, which fixed a very similar case for
non-paravirtualized cases.
Linus
* Linus Torvalds <[email protected]> wrote:
> [...]
>
> As usual, the paravirt code is a horribly buggy heap of crud.
> Film at 11.
>
> Why did I think we had this bug but already fixed it ? Maybe
> it's one of those things that Waiman fixed in his long delayed
> qspinlock series? Waiman? Or maybe I just remember the fixes
> where we changed from a mutex to a spinlock, which fixed a very
> similar case for non-paravirtualized cases.
We definitely had such a high profile bug in the native case,
years before the qspinlock series came along. I don't remember
the specifics anymore, but maybe it was in the VFS code?
Thanks,
Ingo
On Wed, Feb 04, 2015 at 04:16:54PM -0800, Linus Torvalds wrote:
>
> Why did I think we had this bug but already fixed it ? Maybe it's one
> of those things that Waiman fixed in his long delayed qspinlock
> series? Waiman?
ISTR that that would do the exact same thing, but I need to go look a
the latest paravirt code -- that's the part that we all were still
bothered with.
> Or maybe I just remember the fixes where we changed
> from a mutex to a spinlock, which fixed a very similar case for
> non-paravirtualized cases.
I think we had a non paravirt mutex case last year or so.
On 02/05/2015 04:30 AM, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 04:16:54PM -0800, Linus Torvalds wrote:
>> > Why did I think we had this bug but already fixed it ? Maybe it's one
>> > of those things that Waiman fixed in his long delayed qspinlock
>> > series? Waiman?
> ISTR that that would do the exact same thing, but I need to go look a
> the latest paravirt code -- that's the part that we all were still
> bothered with.
Testing Linus's explanation, I tried simply:
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 7050d86..54454da 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -142,7 +142,7 @@ static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
__ticket_unlock_kick(lock, old.tickets.head);
}
}
-
+static inline int arch_spin_is_locked(arch_spinlock_t *lock);
static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
{
if (TICKET_SLOWPATH_FLAG &&
@@ -153,7 +153,7 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
add_smp(&lock->tickets.head, TICKET_LOCK_INC);
/* add_smp() is a full mb() */
-
+ WARN_ON(arch_spin_is_locked(lock));
if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);
} else
And the warnings confirmed that the lock is indeed "unlocked" before we finished
arch_spin_unlock()...
Thanks,
Sasha
On Wed, 2015-02-04 at 16:16 -0800, Linus Torvalds wrote:
> And looking at the arch version, I think the paravirtualized code is crap.
>
> It does:
>
> prev = *lock;
> add_smp(&lock->tickets.head, TICKET_LOCK_INC);
>
> /* add_smp() is a full mb() */
>
> if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
> __ticket_unlock_slowpath(lock, prev);
>
>
> which is *exactly* the kind of things you cannot do with spinlocks,
> because after you've done the "add_smp()" and released the spinlock
> for the fast-path, you can't access the spinlock any more. Exactly
> because a fast-path lock migth come in, and release the whole data
> structure.
>
> As usual, the paravirt code is a horribly buggy heap of crud. Film at 11.
Per http://lwn.net/Articles/495597/ which clearly describes the intent
of the slowpath unlocking. Cc'ing Raghavendra.
Thanks,
Davidlohr
On 02/05/2015 03:59 PM, Davidlohr Bueso wrote:
> On Wed, 2015-02-04 at 16:16 -0800, Linus Torvalds wrote:
>> And looking at the arch version, I think the paravirtualized code is crap.
>>
>> It does:
>>
>> prev = *lock;
>> add_smp(&lock->tickets.head, TICKET_LOCK_INC);
>>
>> /* add_smp() is a full mb() */
>>
>> if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
>> __ticket_unlock_slowpath(lock, prev);
>>
>>
>> which is *exactly* the kind of things you cannot do with spinlocks,
>> because after you've done the "add_smp()" and released the spinlock
>> for the fast-path, you can't access the spinlock any more. Exactly
>> because a fast-path lock migth come in, and release the whole data
>> structure.
>>
>> As usual, the paravirt code is a horribly buggy heap of crud. Film at 11.
>
> Per http://lwn.net/Articles/495597/ which clearly describes the intent
> of the slowpath unlocking. Cc'ing Raghavendra.
Interestingly enough, according to that article this behaviour seems to be
"by design":
"""
This version of the patch uses a locked add to do this, followed by a test
to see if the slowflag is set. The lock prefix acts as a full memory barrier,
so we can be sure that other CPUs will have seen the unlock before we read
the flag
"""
Thanks,
Sasha
On Thu, Feb 5, 2015 at 1:02 PM, Sasha Levin <[email protected]> wrote:
>
> Interestingly enough, according to that article this behaviour seems to be
> "by design":
Oh, it's definitely by design, it's just that the design looked at
spinlocks without the admittedly very subtle issue of lifetime vs
unlocking.
Spinlocks (and completions) are special - for other locks we have
basically allowed lifetimes to be separate from the lock state, and if
you have a data structure with a mutex in it, you'll have to have some
separate lifetime rule outside of the lock itself. But spinlocks and
completions have their locking state tied into their lifetime.
Completions are so very much by design (because dynamic completions on
the stack is one of the core use cases), and spinlocks do it because
in some cases you cannot sanely avoid it (and one of those cases is
the implementation of completions - they aren't actually first-class
locking primitives of their own, although they actually *used* to be,
originally).
It is possible that the paravirt spinlocks could be saved by:
- moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code.
- making sure that the "unlock" path never does a *write* to the
possibly stale lock. KASan would still complain about the read, but we
could just say that it's a speculative read - bad form, but not
disastrous.
but right now they are clearly horribly broken. Because the unlock
path doesn't just read the value, it can end up writing to it too.
Looking at the Xen version, for example, the unlock_kick part looks
fine. It just compares the pointer to the lock against its data
structures, and doesn't touch the lock itself. Which is fine.
But right now, the real problem is that "__ticket_unlock_slowpath()"
will clear the TICKET_SLOWPATH_FLAG on a lock that has already been
unlocked - which means that it may actually *change* a byte that has
already been freed. It might not be a spinlock any more, it might have
been reused for something else, and might be anything else, and
clearing TICKET_SLOWPATH_FLAG might be clearing a random bit in a
kernel pointer or other value..
It is *very* hard to hit, though. And it obviously doesn't happen in
raw hardware, but paravirt people need to think hard about this.
Linus
On Thu, 2015-02-05 at 13:34 -0800, Linus Torvalds wrote:
> On Thu, Feb 5, 2015 at 1:02 PM, Sasha Levin <[email protected]> wrote:
> >
> > Interestingly enough, according to that article this behaviour seems to be
> > "by design":
>
> Oh, it's definitely by design, it's just that the design looked at
> spinlocks without the admittedly very subtle issue of lifetime vs
> unlocking.
>
> Spinlocks (and completions) are special - for other locks we have
> basically allowed lifetimes to be separate from the lock state, and if
> you have a data structure with a mutex in it, you'll have to have some
> separate lifetime rule outside of the lock itself. But spinlocks and
> completions have their locking state tied into their lifetime.
For spinlocks I find this very much a virtue. Tight lifetimes allow the
overall locking logic to be *simple* - keeping people from being "smart"
and bloating up spinlocks. Similarly, I hate how the paravirt
alternative blends in with regular (sane) bare metal code. What was
preventing this instead??
#ifdef CONFIG_PARAVIRT_SPINLOCKS
static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
{
if (!static_key_false(¶virt_ticketlocks_enabled))
return;
add_smp(&lock->tickets.head, TICKET_LOCK_INC);
/* Do slowpath tail stuff... */
}
#else
static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
{
__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
}
#endif
I just don't see the point to all this TICKET_SLOWPATH_FLAG:
#ifdef CONFIG_PARAVIRT_SPINLOCKS
#define __TICKET_LOCK_INC 2
#define TICKET_SLOWPATH_FLAG ((__ticket_t)1)
#else
#define __TICKET_LOCK_INC 1
#define TICKET_SLOWPATH_FLAG ((__ticket_t)0)
#endif
when it is only for paravirt -- and the word slowpath implies the
general steps as part of the generic algorithm. Lets keep code for
simple locks simple.
> Completions are so very much by design (because dynamic completions on
> the stack is one of the core use cases), and spinlocks do it because
> in some cases you cannot sanely avoid it (and one of those cases is
> the implementation of completions - they aren't actually first-class
> locking primitives of their own, although they actually *used* to be,
> originally).
>
> It is possible that the paravirt spinlocks could be saved by:
>
> - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code.
Ouch, to avoid deadlocks they explicitly need the unlock to occur before
the slowpath tail flag is read.
> - making sure that the "unlock" path never does a *write* to the
> possibly stale lock. KASan would still complain about the read, but we
> could just say that it's a speculative read - bad form, but not
> disastrous.
Yeah, you just cannot have a slowpath without reads or writes :D
Thanks,
Davidlohr
On Thu, Feb 5, 2015 at 2:37 PM, Davidlohr Bueso <[email protected]> wrote:
>>
>> It is possible that the paravirt spinlocks could be saved by:
>>
>> - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code.
>
> Ouch, to avoid deadlocks they explicitly need the unlock to occur before
> the slowpath tail flag is read.
Well, just make the unlock do the actual real unlock operation
("fastpath"), leaving the TICKET_SLOWPATH_FLAG alone. The unlock path
can *test* the flag and do whatever kicking it needs to get people
started again, but not clear it again.
Then, the clearing could be done by the lockers. That way the unlock
path at least doesn't change the spinlock any more - sicne the
spinlock might have been free'd at any time after the actual unlock.
Or something. I'm handwaving. I really dislike all the paravirt crap
we do. It tends to be really ugly, and I'm not just talking about
spinlocks. TLB flushing etc tends to get interesting too.
> Yeah, you just cannot have a slowpath without reads or writes :D
Well, you *could* do the slow-path while holding the lock, and making
the actual ock release be the last part of the operation (and do the
final unlock with a "cmpxchg" that fails and re-does the slowpath if
something changed). Then the slowpath can read and write the lock all
it wants.
But people generally want to avoid that, since it makes the hold times
longer. So the "drop the lock, *then* test if we should do some slow
path fixup" tends to be the preferred model, it's just that that model
is broken due to the lifetime rules. Making it read-only would at
least make it a lot less broken..
Linus
On 02/06/2015 04:27 AM, Linus Torvalds wrote:
> On Thu, Feb 5, 2015 at 2:37 PM, Davidlohr Bueso <[email protected]> wrote:
>>>
>>> It is possible that the paravirt spinlocks could be saved by:
>>>
>>> - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath locking code.
>>
>> Ouch, to avoid deadlocks they explicitly need the unlock to occur before
>> the slowpath tail flag is read.
>
> Well, just make the unlock do the actual real unlock operation
> ("fastpath"), leaving the TICKET_SLOWPATH_FLAG alone. The unlock path
> can *test* the flag and do whatever kicking it needs to get people
> started again, but not clear it again.
>
This is definitely a good idea, will think more on this.
(especially since any remote possibility of forgetting to wake up the
lock-waiter would result in eventual hang of kvm guest).
Hopeful to come up with a solution soon.
/me agreeing with the fact that we did not have the 'lifetime' in mind
during the design :(.
On 02/06/2015 04:07 AM, Davidlohr Bueso wrote:
> On Thu, 2015-02-05 at 13:34 -0800, Linus Torvalds wrote:
>> On Thu, Feb 5, 2015 at 1:02 PM, Sasha Levin <[email protected]> wrote:
>>>
>>> Interestingly enough, according to that article this behaviour seems to be
>>> "by design":
>>
>> Oh, it's definitely by design, it's just that the design looked at
>> spinlocks without the admittedly very subtle issue of lifetime vs
>> unlocking.
>>
>> Spinlocks (and completions) are special - for other locks we have
>> basically allowed lifetimes to be separate from the lock state, and if
>> you have a data structure with a mutex in it, you'll have to have some
>> separate lifetime rule outside of the lock itself. But spinlocks and
>> completions have their locking state tied into their lifetime.
>
> For spinlocks I find this very much a virtue. Tight lifetimes allow the
> overall locking logic to be *simple* - keeping people from being "smart"
> and bloating up spinlocks. Similarly, I hate how the paravirt
> alternative blends in with regular (sane) bare metal code. What was
> preventing this instead??
>
> #ifdef CONFIG_PARAVIRT_SPINLOCKS
> static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
> if (!static_key_false(¶virt_ticketlocks_enabled))
> return;
>
> add_smp(&lock->tickets.head, TICKET_LOCK_INC);
> /* Do slowpath tail stuff... */
> }
> #else
> static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
> __add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
> }
> #endif
>
> I just don't see the point to all this TICKET_SLOWPATH_FLAG:
>
> #ifdef CONFIG_PARAVIRT_SPINLOCKS
> #define __TICKET_LOCK_INC 2
> #define TICKET_SLOWPATH_FLAG ((__ticket_t)1)
> #else
> #define __TICKET_LOCK_INC 1
> #define TICKET_SLOWPATH_FLAG ((__ticket_t)0)
> #endif
>
> when it is only for paravirt -- and the word slowpath implies the
> general steps as part of the generic algorithm. Lets keep code for
> simple locks simple.
>
Good point, I will send this as a separate cleanup once I test the
patch I have to correct the current problem.
On 02/06/2015 12:18 PM, Raghavendra K T wrote:
> On 02/06/2015 04:27 AM, Linus Torvalds wrote:
>> On Thu, Feb 5, 2015 at 2:37 PM, Davidlohr Bueso <[email protected]>
>> wrote:
>>>>
>>>> It is possible that the paravirt spinlocks could be saved by:
>>>>
>>>> - moving the clearing of TICKET_SLOWPATH_FLAG into the fastpath
>>>> locking code.
>>>
>>> Ouch, to avoid deadlocks they explicitly need the unlock to occur before
>>> the slowpath tail flag is read.
>>
>> Well, just make the unlock do the actual real unlock operation
>> ("fastpath"), leaving the TICKET_SLOWPATH_FLAG alone. The unlock path
>> can *test* the flag and do whatever kicking it needs to get people
>> started again, but not clear it again.
>>
>
> This is definitely a good idea, will think more on this.
>
Just sent the patch with that idea, also keeping in mind that
head = tail may not hold good for trylock checking
(last unlock might have left TICKET_SLOWPATH_FLAG set)
http://article.gmane.org/gmane.linux.kernel/1883900