LinuxLists.cc - [PATCH V4] x86 spinlock: Fix memory corruption on completing completions

2015-02-13 06:44:55

Subject: [PATCH V4] x86 spinlock: Fix memory corruption on completing completions

Paravirt spinlock clears slowpath flag after doing unlock.
As explained by Linus currently it does:
prev = *lock;
add_smp(&lock->tickets.head, TICKET_LOCK_INC);

/* add_smp() is a full mb() */

if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
__ticket_unlock_slowpath(lock, prev);

which is *exactly* the kind of things you cannot do with spinlocks,
because after you've done the "add_smp()" and released the spinlock
for the fast-path, you can't access the spinlock any more. Exactly
because a fast-path lock might come in, and release the whole data
structure.

Linus suggested that we should not do any writes to lock after unlock(),
and we can move slowpath clearing to fastpath lock.

So this patch implements the fix with:
1. Moving slowpath flag to head (Oleg):
Unlocked locks don't care about the slowpath flag; therefore we can keep
it set after the last unlock, and clear it again on the first (try)lock.
-- this removes the write after unlock. note that keeping slowpath flag would
result in unnecessary kicks.
By moving the slowpath flag from the tail to the head ticket we also avoid
the need to access both the head and tail tickets on unlock.

2. use xadd to avoid read/write after unlock that checks the need for
unlock_kick (Linus):
We further avoid the need for a read-after-release by using xadd;
the prev head value will include the slowpath flag and indicate if we
need to do PV kicking of suspended spinners -- on modern chips xadd
isn't (much) more expensive than an add + load.

Result:
setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest)
benchmark overcommit %improve
kernbench 1x -0.13
kernbench 2x 0.02
dbench 1x -1.77
dbench 2x -0.63

[Jeremy: hinted missing TICKET_LOCK_INC for kick]
[Oleg: Moving slowpath flag to head, ticket_equals idea]
[PeterZ: Detailed changelog]

Reported-by: Sasha Levin <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
---
arch/x86/include/asm/spinlock.h | 91 ++++++++++++++++++++---------------------
arch/x86/kernel/kvm.c | 10 +++--
arch/x86/xen/spinlock.c | 10 +++--
3 files changed, 56 insertions(+), 55 deletions(-)

potential TODO:
* The whole patch be splitted into, 1. move slowpath flag
2. fix memory corruption in completion problem ??

Changes since V3:
- Detailed changelog (PeterZ)
- Replace ACCESS_ONCE with READ_ONCE (oleg)
- Add xen changes (Oleg)
- Correct break logic in unlock_wait() (Oleg)

Changes since V2:
- Move the slowpath flag to head, this enables xadd usage in unlock code
and inturn we can get rid of read/write after unlock (Oleg)
- usage of ticket_equals (Oleg)

Changes since V1:
- Add missing TICKET_LOCK_INC before unlock kick (fixes hang in overcommit: Jeremy).
- Remove SLOWPATH_FLAG clearing in fast lock. (Jeremy)
- clear SLOWPATH_FLAG in arch_spin_value_unlocked during comparison.
Note: The current implementation is still based on avoid writing after unlock.
we could still have potential invalid memory read. (Sasha)

Result:
setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest)
base = 3_19_rc7

3_19_rc7_spinfix_v3
+-----------+-----------+-----------+------------+-----------+
kernbench (Time taken in sec lower is better)
+-----------+-----------+-----------+------------+-----------+
base %stdev patched %stdev %improve
+-----------+-----------+-----------+------------+-----------+
1x 54.2300 3.0652 54.3008 4.0366 -0.13056
2x 90.1883 5.5509 90.1650 6.4336 0.02583
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
dbench (Throughput higher is better)
+-----------+-----------+-----------+------------+-----------+
base %stdev patched %stdev %improve
+-----------+-----------+-----------+------------+-----------+
1x 7029.9188 2.5952 6905.0712 4.4737 -1.77595
2x 3254.2075 14.8291 3233.7137 26.8784 -0.62976
+-----------+-----------+-----------+------------+-----------+

(here is the result I got from the patches, I believe there may
be some small overhead from xadd etc, but overall looks fine but
a thorough test may be needed)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 625660f..646a1a3 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -46,7 +46,7 @@ static __always_inline bool static_key_false(struct static_key *key);

static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
{
- set_bit(0, (volatile unsigned long *)&lock->tickets.tail);
+ set_bit(0, (volatile unsigned long *)&lock->tickets.head);
}

#else /* !CONFIG_PARAVIRT_SPINLOCKS */
@@ -60,10 +60,30 @@ static inline void __ticket_unlock_kick(arch_spinlock_t *lock,
}

#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+static inline int __tickets_equal(__ticket_t one, __ticket_t two)
+{
+ return !((one ^ two) & ~TICKET_SLOWPATH_FLAG);
+}
+
+static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock,
+ __ticket_t head)
+{
+ if (head & TICKET_SLOWPATH_FLAG) {
+ arch_spinlock_t old, new;
+
+ old.tickets.head = head;
+ new.tickets.head = head & ~TICKET_SLOWPATH_FLAG;
+ old.tickets.tail = new.tickets.head + TICKET_LOCK_INC;
+ new.tickets.tail = old.tickets.tail;
+
+ /* try to clear slowpath flag when there are no contenders */
+ cmpxchg(&lock->head_tail, old.head_tail, new.head_tail);
+ }
+}

static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
{
- return lock.tickets.head == lock.tickets.tail;
+ return __tickets_equal(lock.tickets.head, lock.tickets.tail);
}

/*
@@ -87,18 +107,21 @@ static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
if (likely(inc.head == inc.tail))
goto out;

- inc.tail &= ~TICKET_SLOWPATH_FLAG;
for (;;) {
unsigned count = SPIN_THRESHOLD;

do {
- if (READ_ONCE(lock->tickets.head) == inc.tail)
- goto out;
+ inc.head = READ_ONCE(lock->tickets.head);
+ if (__tickets_equal(inc.head, inc.tail))
+ goto clear_slowpath;
cpu_relax();
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
-out: barrier(); /* make sure nothing creeps before the lock is taken */
+clear_slowpath:
+ __ticket_check_and_clear_slowpath(lock, inc.head);
+out:
+ barrier(); /* make sure nothing creeps before the lock is taken */
}

static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
@@ -106,56 +129,30 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock)
arch_spinlock_t old, new;

old.tickets = READ_ONCE(lock->tickets);
- if (old.tickets.head != (old.tickets.tail & ~TICKET_SLOWPATH_FLAG))
+ if (!__tickets_equal(old.tickets.head, old.tickets.tail))
return 0;

new.head_tail = old.head_tail + (TICKET_LOCK_INC << TICKET_SHIFT);
+ new.head_tail &= ~TICKET_SLOWPATH_FLAG;

/* cmpxchg is a full barrier, so nothing can move before it */
return cmpxchg(&lock->head_tail, old.head_tail, new.head_tail) == old.head_tail;
}

-static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock,
- arch_spinlock_t old)
-{
- arch_spinlock_t new;
-
- BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);
-
- /* Perform the unlock on the "before" copy */
- old.tickets.head += TICKET_LOCK_INC;
-
- /* Clear the slowpath flag */
- new.head_tail = old.head_tail & ~(TICKET_SLOWPATH_FLAG << TICKET_SHIFT);
-
- /*
- * If the lock is uncontended, clear the flag - use cmpxchg in
- * case it changes behind our back though.
- */
- if (new.tickets.head != new.tickets.tail ||
- cmpxchg(&lock->head_tail, old.head_tail,
- new.head_tail) != old.head_tail) {
- /*
- * Lock still has someone queued for it, so wake up an
- * appropriate waiter.
- */
- __ticket_unlock_kick(lock, old.tickets.head);
- }
-}
-
static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
{
if (TICKET_SLOWPATH_FLAG &&
- static_key_false(&paravirt_ticketlocks_enabled)) {
- arch_spinlock_t prev;
+ static_key_false(&paravirt_ticketlocks_enabled)) {
+ __ticket_t head;

- prev = *lock;
- add_smp(&lock->tickets.head, TICKET_LOCK_INC);
+ BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS);

- /* add_smp() is a full mb() */
+ head = xadd(&lock->tickets.head, TICKET_LOCK_INC);

- if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
- __ticket_unlock_slowpath(lock, prev);
+ if (unlikely(head & TICKET_SLOWPATH_FLAG)) {
+ head &= ~TICKET_SLOWPATH_FLAG;
+ __ticket_unlock_kick(lock, (head + TICKET_LOCK_INC));
+ }
} else
__add(&lock->tickets.head, TICKET_LOCK_INC, UNLOCK_LOCK_PREFIX);
}
@@ -164,7 +161,7 @@ static inline int arch_spin_is_locked(arch_spinlock_t *lock)
{
struct __raw_tickets tmp = READ_ONCE(lock->tickets);

- return tmp.tail != tmp.head;
+ return tmp.tail != (tmp.head & ~TICKET_SLOWPATH_FLAG);
}

static inline int arch_spin_is_contended(arch_spinlock_t *lock)
@@ -183,16 +180,16 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock,

static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
{
- __ticket_t head = ACCESS_ONCE(lock->tickets.head);
+ __ticket_t head = READ_ONCE(lock->tickets.head);

for (;;) {
- struct __raw_tickets tmp = ACCESS_ONCE(lock->tickets);
+ struct __raw_tickets tmp = READ_ONCE(lock->tickets);
/*
* We need to check "unlocked" in a loop, tmp.head == head
* can be false positive because of overflow.
*/
- if (tmp.head == (tmp.tail & ~TICKET_SLOWPATH_FLAG) ||
- tmp.head != head)
+ if (__tickets_equal(tmp.head, tmp.tail) ||
+ !__tickets_equal(tmp.head, head))
break;

cpu_relax();
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 94f6434..9c6c8cf 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -609,7 +609,7 @@ static inline void check_zero(void)
u8 ret;
u8 old;

- old = ACCESS_ONCE(zero_stats);
+ old = READ_ONCE(zero_stats);
if (unlikely(old)) {
ret = cmpxchg(&zero_stats, old, 0);
/* This ensures only one fellow resets the stat */
@@ -727,6 +727,7 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
int cpu;
u64 start;
unsigned long flags;
+ __ticket_t head;

if (in_nmi())
return;
@@ -772,7 +773,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
* check again make sure it didn't become free while
* we weren't looking.
*/
- if (ACCESS_ONCE(lock->tickets.head) == want) {
+ head = READ_ONCE(lock->tickets.head);
+ if (__tickets_equal(head, want)) {
add_stats(TAKEN_SLOW_PICKUP, 1);
goto out;
}
@@ -803,8 +805,8 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket)
add_stats(RELEASED_SLOW, 1);
for_each_cpu(cpu, &waiting_cpus) {
const struct kvm_lock_waiting *w = &per_cpu(klock_waiting, cpu);
- if (ACCESS_ONCE(w->lock) == lock &&
- ACCESS_ONCE(w->want) == ticket) {
+ if (READ_ONCE(w->lock) == lock &&
+ READ_ONCE(w->want) == ticket) {
add_stats(RELEASED_SLOW_KICKED, 1);
kvm_kick_cpu(cpu);
break;
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 23b45eb..ed5bb2b 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -41,7 +41,7 @@ static u8 zero_stats;
static inline void check_zero(void)
{
u8 ret;
- u8 old = ACCESS_ONCE(zero_stats);
+ u8 old = READ_ONCE(zero_stats);
if (unlikely(old)) {
ret = cmpxchg(&zero_stats, old, 0);
/* This ensures only one fellow resets the stat */
@@ -112,6 +112,7 @@ __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
struct xen_lock_waiting *w = this_cpu_ptr(&lock_waiting);
int cpu = smp_processor_id();
u64 start;
+ __ticket_t head;
unsigned long flags;

/* If kicker interrupts not initialized yet, just spin */
@@ -163,7 +164,8 @@ __visible void xen_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
* check again make sure it didn't become free while
* we weren't looking
*/
- if (ACCESS_ONCE(lock->tickets.head) == want) {
+ head = READ_ONCE(lock->tickets.head);
+ if (__tickets_equal(head, want)) {
add_stats(TAKEN_SLOW_PICKUP, 1);
goto out;
}
@@ -204,8 +206,8 @@ static void xen_unlock_kick(struct arch_spinlock *lock, __ticket_t next)
const struct xen_lock_waiting *w = &per_cpu(lock_waiting, cpu);

/* Make sure we read lock before want */
- if (ACCESS_ONCE(w->lock) == lock &&
- ACCESS_ONCE(w->want) == next) {
+ if (READ_ONCE(w->lock) == lock &&
+ READ_ONCE(w->want) == next) {
add_stats(RELEASED_SLOW_KICKED, 1);
xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
break;
--
1.7.11.7

2015-02-13 15:35:44

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH V4] x86 spinlock: Fix memory corruption on completing completions

On 02/13, Raghavendra K T wrote:
>
> @@ -164,7 +161,7 @@ static inline int arch_spin_is_locked(arch_spinlock_t *lock)
> {
> struct __raw_tickets tmp = READ_ONCE(lock->tickets);
>
> - return tmp.tail != tmp.head;
> + return tmp.tail != (tmp.head & ~TICKET_SLOWPATH_FLAG);
> }

Well, this can probably use __tickets_equal() too. But this is cosmetic.

It seems that arch_spin_is_contended() should be fixed with this change,

(__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_INC

can be true because of TICKET_SLOWPATH_FLAG in .head, even if it is actually
unlocked. And the "(__ticket_t)" typecast looks unnecessary, it only adds more
confusuin, but this is cosmetic too.

> @@ -772,7 +773,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
> * check again make sure it didn't become free while
> * we weren't looking.
> */
> - if (ACCESS_ONCE(lock->tickets.head) == want) {
> + head = READ_ONCE(lock->tickets.head);
> + if (__tickets_equal(head, want)) {
> add_stats(TAKEN_SLOW_PICKUP, 1);
> goto out;

This is off-topic, but with or without this change perhaps it makes sense
to add smp_mb__after_atomic(). It is nop on x86, just to make this code
more understandable for those (for me ;) who can never remember even the
x86 rules.

Oleg.

2015-02-13 15:43:16

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH V4] x86 spinlock: Fix memory corruption on completing completions

On 02/13, Oleg Nesterov wrote:
>
> On 02/13, Raghavendra K T wrote:
> >
> > @@ -772,7 +773,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
> > * check again make sure it didn't become free while
> > * we weren't looking.
> > */
> > - if (ACCESS_ONCE(lock->tickets.head) == want) {
> > + head = READ_ONCE(lock->tickets.head);
> > + if (__tickets_equal(head, want)) {
> > add_stats(TAKEN_SLOW_PICKUP, 1);
> > goto out;
>
> This is off-topic, but with or without this change perhaps it makes sense
> to add smp_mb__after_atomic(). It is nop on x86, just to make this code
> more understandable for those (for me ;) who can never remember even the
> x86 rules.

Not that I think you should do this in v5, so please ignore.

Oleg.

2015-02-15 05:46:42

by Raghavendra K T

[permalink] [raw]

Subject: Re: [PATCH V4] x86 spinlock: Fix memory corruption on completing completions

On 02/13/2015 09:02 PM, Oleg Nesterov wrote:
> On 02/13, Raghavendra K T wrote:
>>
>> @@ -164,7 +161,7 @@ static inline int arch_spin_is_locked(arch_spinlock_t *lock)
>> {
>> struct __raw_tickets tmp = READ_ONCE(lock->tickets);
>>
>> - return tmp.tail != tmp.head;
>> + return tmp.tail != (tmp.head & ~TICKET_SLOWPATH_FLAG);
>> }
>
> Well, this can probably use __tickets_equal() too. But this is cosmetic.

That looks good. Added.

> It seems that arch_spin_is_contended() should be fixed with this change,
>
> (__ticket_t)(tmp.tail - tmp.head) > TICKET_LOCK_INC
>
> can be true because of TICKET_SLOWPATH_FLAG in .head, even if it is actually
> unlocked.

Done.
Hmm! it was because I was still under impression that slowpath bit is
in tail. You are right, situation could lead to positive max and may
report false contention.

And the "(__ticket_t)" typecast looks unnecessary, it only adds more
> confusuin, but this is cosmetic too.
>

Done.

>> @@ -772,7 +773,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
>> * check again make sure it didn't become free while
>> * we weren't looking.
>> */
>> - if (ACCESS_ONCE(lock->tickets.head) == want) {
>> + head = READ_ONCE(lock->tickets.head);
>> + if (__tickets_equal(head, want)) {
>> add_stats(TAKEN_SLOW_PICKUP, 1);
>> goto out;
>
> This is off-topic, but with or without this change perhaps it makes sense
> to add smp_mb__after_atomic(). It is nop on x86, just to make this code
> more understandable for those (for me ;) who can never remember even the
> x86 rules.
>

Hope you meant it for add_stat. yes smp_mb__after_atomic() would be
harmless barrier() in x86. Did not add this V5 as yoiu though but this
made me look at slowpath_enter code and added an explicit barrier()
there :).

2015-02-15 16:09:33

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [PATCH V4] x86 spinlock: Fix memory corruption on completing completions

On 02/15, Raghavendra K T wrote:
>
> On 02/13/2015 09:02 PM, Oleg Nesterov wrote:
>
>>> @@ -772,7 +773,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want)
>>> * check again make sure it didn't become free while
>>> * we weren't looking.
>>> */
>>> - if (ACCESS_ONCE(lock->tickets.head) == want) {
>>> + head = READ_ONCE(lock->tickets.head);
>>> + if (__tickets_equal(head, want)) {
>>> add_stats(TAKEN_SLOW_PICKUP, 1);
>>> goto out;
>>
>> This is off-topic, but with or without this change perhaps it makes sense
>> to add smp_mb__after_atomic(). It is nop on x86, just to make this code
>> more understandable for those (for me ;) who can never remember even the
>> x86 rules.
>
> Hope you meant it for add_stat.

No, no. We need a barrier between set_bit(SLOWPATH) and tickets_equal().

Yes, on x86 set_bit() can't be reordered so smp_mb_*_atomic() is nop, but
it can make the code more understandable.

> yes smp_mb__after_atomic() would be
> harmless barrier() in x86. Did not add this V5 as yoiu though but this
> made me look at slowpath_enter code and added an explicit barrier()
> there :).

Well. it looks even more confusing than a lack of barrier ;)

Oleg.