Hi
I think the MIPS arch_spin_unlock() is borken.
spin_unlock() must have RELEASE semantics, these require that no LOADs
nor STOREs leak out from the critical section.
>From what I know MIPS has a relaxed memory model which allows reads to
pass stores, and as implemented arch_spin_unlock() only issues a wmb
which doesn't order prior reads vs later stores.
Therefore upgrade the wmb() to smp_mb().
(Also, why the unconditional wmb, as opposed to smp_wmb() ?)
Maybe-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
index 40196bebe849..b2ca13f06152 100644
--- a/arch/mips/include/asm/spinlock.h
+++ b/arch/mips/include/asm/spinlock.h
@@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
static inline void arch_spin_unlock(arch_spinlock_t *lock)
{
unsigned int serving_now = lock->h.serving_now + 1;
- wmb();
+ smp_mb();
lock->h.serving_now = (u16)serving_now;
nudge_writes();
}
On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote:
> +++ b/arch/mips/include/asm/spinlock.h
> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
> static inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
> unsigned int serving_now = lock->h.serving_now + 1;
> - wmb();
> + smp_mb();
> lock->h.serving_now = (u16)serving_now;
Also, you might want to consider using WRITE_ONCE() for that store.
> nudge_writes();
And that thing is creative, I've not seen any other arch issues barriers
to speed up the store buffer flush. Does it really matter for MIPS?
> }
Peter Zijlstra <[email protected]> writes:
> Hi
>
> I think the MIPS arch_spin_unlock() is borken.
>
> spin_unlock() must have RELEASE semantics, these require that no LOADs
> nor STOREs leak out from the critical section.
>
> From what I know MIPS has a relaxed memory model which allows reads to
> pass stores, and as implemented arch_spin_unlock() only issues a wmb
> which doesn't order prior reads vs later stores.
This is correct.
> Therefore upgrade the wmb() to smp_mb().
>
> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)
Good question.
The current MIPS asm/barrier.h uses a plain SYNC instruction for all
kinds of barriers (except on Cavium Octeon), which is a bit wasteful.
A MIPS implementation can optionally support partial barriers (load,
store, acquire, release) which all behave like a full barrier if not
implemented, so those really ought to be used.
> Maybe-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
> index 40196bebe849..b2ca13f06152 100644
> --- a/arch/mips/include/asm/spinlock.h
> +++ b/arch/mips/include/asm/spinlock.h
> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
> static inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
> unsigned int serving_now = lock->h.serving_now + 1;
> - wmb();
> + smp_mb();
> lock->h.serving_now = (u16)serving_now;
> nudge_writes();
> }
All this weirdness was added in commit 500c2e1f:
MIPS: Optimize spinlocks.
The current locking mechanism uses a ll/sc sequence to release a
spinlock. This is slower than a wmb() followed by a store to unlock.
The branching forward to .subsection 2 on sc failure slows down the
contended case. So we get rid of that part too.
Since we are now working on naturally aligned u16 values, we can get
rid of a masking operation as the LHU already does the right thing.
The ANDI are reversed for better scheduling on multi-issue CPUs
On a 12 CPU 750MHz Octeon cn5750 this patch improves ipv4 UDP packet
forwarding rates from 3.58*10^6 PPS to 3.99*10^6 PPS, or about 11%.
Signed-off-by: David Daney <[email protected]>
To: [email protected]
Patchwork: http://patchwork.linux-mips.org/patch/937/
Signed-off-by: Ralf Baechle <[email protected]>
--
M?ns Rullg?rd
[email protected]
On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote:
> Hi
>
> I think the MIPS arch_spin_unlock() is borken.
>
> spin_unlock() must have RELEASE semantics, these require that no LOADs
> nor STOREs leak out from the critical section.
>
> >From what I know MIPS has a relaxed memory model which allows reads to
> pass stores, and as implemented arch_spin_unlock() only issues a wmb
> which doesn't order prior reads vs later stores.
>
> Therefore upgrade the wmb() to smp_mb().
>
> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)
One guess is that they want to order I/O accesses within the critical
section?
Thanx, Paul
> Maybe-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
> index 40196bebe849..b2ca13f06152 100644
> --- a/arch/mips/include/asm/spinlock.h
> +++ b/arch/mips/include/asm/spinlock.h
> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
> static inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
> unsigned int serving_now = lock->h.serving_now + 1;
> - wmb();
> + smp_mb();
> lock->h.serving_now = (u16)serving_now;
> nudge_writes();
> }
>
"Paul E. McKenney" <[email protected]> writes:
> On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote:
>> Hi
>>
>> I think the MIPS arch_spin_unlock() is borken.
>>
>> spin_unlock() must have RELEASE semantics, these require that no LOADs
>> nor STOREs leak out from the critical section.
>>
>> >From what I know MIPS has a relaxed memory model which allows reads to
>> pass stores, and as implemented arch_spin_unlock() only issues a wmb
>> which doesn't order prior reads vs later stores.
>>
>> Therefore upgrade the wmb() to smp_mb().
>>
>> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)
>
> One guess is that they want to order I/O accesses within the critical
> section?
Isn't that what mmiowb() is for?
>> Maybe-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>> ---
>> diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
>> index 40196bebe849..b2ca13f06152 100644
>> --- a/arch/mips/include/asm/spinlock.h
>> +++ b/arch/mips/include/asm/spinlock.h
>> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>> static inline void arch_spin_unlock(arch_spinlock_t *lock)
>> {
>> unsigned int serving_now = lock->h.serving_now + 1;
>> - wmb();
>> + smp_mb();
>> lock->h.serving_now = (u16)serving_now;
>> nudge_writes();
>> }
>>
>
--
M?ns Rullg?rd
[email protected]
On Thu, Nov 12, 2015 at 02:50:00PM +0000, M?ns Rullg?rd wrote:
> "Paul E. McKenney" <[email protected]> writes:
>
> > On Thu, Nov 12, 2015 at 01:31:23PM +0100, Peter Zijlstra wrote:
> >> Hi
> >>
> >> I think the MIPS arch_spin_unlock() is borken.
> >>
> >> spin_unlock() must have RELEASE semantics, these require that no LOADs
> >> nor STOREs leak out from the critical section.
> >>
> >> >From what I know MIPS has a relaxed memory model which allows reads to
> >> pass stores, and as implemented arch_spin_unlock() only issues a wmb
> >> which doesn't order prior reads vs later stores.
> >>
> >> Therefore upgrade the wmb() to smp_mb().
> >>
> >> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)
> >
> > One guess is that they want to order I/O accesses within the critical
> > section?
>
> Isn't that what mmiowb() is for?
Indeed it is. Perhaps they didn't trust the device drivers that they
care about to use it? Anyway, just my guess. Just out of curiosity,
what is your guess?
Thanx, Paul
> >> Maybe-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> >> ---
> >> diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
> >> index 40196bebe849..b2ca13f06152 100644
> >> --- a/arch/mips/include/asm/spinlock.h
> >> +++ b/arch/mips/include/asm/spinlock.h
> >> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
> >> static inline void arch_spin_unlock(arch_spinlock_t *lock)
> >> {
> >> unsigned int serving_now = lock->h.serving_now + 1;
> >> - wmb();
> >> + smp_mb();
> >> lock->h.serving_now = (u16)serving_now;
> >> nudge_writes();
> >> }
> >>
> >
>
> --
> M?ns Rullg?rd
> [email protected]
>
On 11/12/2015 04:31 AM, Peter Zijlstra wrote:
> Hi
>
> I think the MIPS arch_spin_unlock() is borken.
>
> spin_unlock() must have RELEASE semantics, these require that no LOADs
> nor STOREs leak out from the critical section.
>
> From what I know MIPS has a relaxed memory model which allows reads to
> pass stores, and as implemented arch_spin_unlock() only issues a wmb
> which doesn't order prior reads vs later stores.
>
> Therefore upgrade the wmb() to smp_mb().
>
> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)
asm/spinlock.h is only used for !CONFIG_SMP. So, smp_wmb() would imply
that special handling for non-SMP is needed, when this is already only
used for the SMP build case.
>
> Maybe-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
> index 40196bebe849..b2ca13f06152 100644
> --- a/arch/mips/include/asm/spinlock.h
> +++ b/arch/mips/include/asm/spinlock.h
> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
> static inline void arch_spin_unlock(arch_spinlock_t *lock)
> {
> unsigned int serving_now = lock->h.serving_now + 1;
> - wmb();
> + smp_mb();
That is too heavy.
It implies a full MIPS "SYNC" operation which stalls execution until all
previous writes are committed and globally visible.
We really want just release semantics, and there is no standard named
primitive that gives us that.
For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be:
smp_wmb();
smp_rmb();
Which expands to exactly the same thing as wmb() because smp_rmb()
expands to nothing.
For CPUs that have out-of-order loads, smp_rmb() should expand to
something lighter weight than "SYNC"
Certainly we can load up the code with "SYNC" all over the place, but it
will kill performance on SMP systems. So, my vote would be to make it
as light weight as possible, but no lighter. That will mean inventing
the proper barrier primitives.
You yourself seem to have added smp_store_release(), so we could even do:
smp_store_release(&lock->h.serving_now, lock->h.serving_now + 1);
That would leave us to cook up a proper definition of smp_store_release().
David Daney
> lock->h.serving_now = (u16)serving_now;
> nudge_writes();
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Thu, Nov 12, 2015 at 09:46:53AM -0800, David Daney wrote:
> For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be:
>
> smp_wmb();
> smp_rmb();
>
> Which expands to exactly the same thing as wmb() because smp_rmb() expands
> to nothing.
OK, so the current code isn't broken because for Cavium wmb is suffient
because rmb is a no-op, and for !Cavium wmb expands to SYNC.
> You yourself seem to have added smp_store_release(), so we could even do:
>
> smp_store_release(&lock->h.serving_now, lock->h.serving_now + 1);
>
> That would leave us to cook up a proper definition of smp_store_release().
That is indeed the better solution.
David Daney <[email protected]> writes:
> On 11/12/2015 04:31 AM, Peter Zijlstra wrote:
>> Hi
>>
>> I think the MIPS arch_spin_unlock() is borken.
>>
>> spin_unlock() must have RELEASE semantics, these require that no LOADs
>> nor STOREs leak out from the critical section.
>>
>> From what I know MIPS has a relaxed memory model which allows reads to
>> pass stores, and as implemented arch_spin_unlock() only issues a wmb
>> which doesn't order prior reads vs later stores.
>>
>> Therefore upgrade the wmb() to smp_mb().
>>
>> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)
>
> asm/spinlock.h is only used for !CONFIG_SMP. So, smp_wmb() would
> imply that special handling for non-SMP is needed, when this is
> already only used for the SMP build case.
>
>>
>> Maybe-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>> ---
>> diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
>> index 40196bebe849..b2ca13f06152 100644
>> --- a/arch/mips/include/asm/spinlock.h
>> +++ b/arch/mips/include/asm/spinlock.h
>> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>> static inline void arch_spin_unlock(arch_spinlock_t *lock)
>> {
>> unsigned int serving_now = lock->h.serving_now + 1;
>> - wmb();
>> + smp_mb();
>
> That is too heavy.
>
> It implies a full MIPS "SYNC" operation which stalls execution until
> all previous writes are committed and globally visible.
>
> We really want just release semantics, and there is no standard named
> primitive that gives us that.
>
> For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be:
>
> smp_wmb();
> smp_rmb();
>
> Which expands to exactly the same thing as wmb() because smp_rmb()
> expands to nothing.
>
> For CPUs that have out-of-order loads, smp_rmb() should expand to
> something lighter weight than "SYNC"
>
> Certainly we can load up the code with "SYNC" all over the place, but
> it will kill performance on SMP systems. So, my vote would be to make
> it as light weight as possible, but no lighter. That will mean
> inventing the proper barrier primitives.
It seems to me that the proper barrier here is a "SYNC 18" aka
SYNC_RELEASE instruction, at least on CPUs that implement that variant.
--
M?ns Rullg?rd
[email protected]
On 11/12/2015 10:13 AM, M?ns Rullg?rd wrote:
> David Daney <[email protected]> writes:
>
>> On 11/12/2015 04:31 AM, Peter Zijlstra wrote:
>>> Hi
>>>
>>> I think the MIPS arch_spin_unlock() is borken.
>>>
>>> spin_unlock() must have RELEASE semantics, these require that no LOADs
>>> nor STOREs leak out from the critical section.
>>>
>>> From what I know MIPS has a relaxed memory model which allows reads to
>>> pass stores, and as implemented arch_spin_unlock() only issues a wmb
>>> which doesn't order prior reads vs later stores.
>>>
>>> Therefore upgrade the wmb() to smp_mb().
>>>
>>> (Also, why the unconditional wmb, as opposed to smp_wmb() ?)
>>
>> asm/spinlock.h is only used for !CONFIG_SMP. So, smp_wmb() would
>> imply that special handling for non-SMP is needed, when this is
>> already only used for the SMP build case.
>>
>>>
>>> Maybe-Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>>> ---
>>> diff --git a/arch/mips/include/asm/spinlock.h b/arch/mips/include/asm/spinlock.h
>>> index 40196bebe849..b2ca13f06152 100644
>>> --- a/arch/mips/include/asm/spinlock.h
>>> +++ b/arch/mips/include/asm/spinlock.h
>>> @@ -140,7 +140,7 @@ static inline void arch_spin_lock(arch_spinlock_t *lock)
>>> static inline void arch_spin_unlock(arch_spinlock_t *lock)
>>> {
>>> unsigned int serving_now = lock->h.serving_now + 1;
>>> - wmb();
>>> + smp_mb();
>>
>> That is too heavy.
>>
>> It implies a full MIPS "SYNC" operation which stalls execution until
>> all previous writes are committed and globally visible.
>>
>> We really want just release semantics, and there is no standard named
>> primitive that gives us that.
>>
>> For CONFIG_CPU_CAVIUM_OCTEON the proper thing would be:
>>
>> smp_wmb();
>> smp_rmb();
>>
>> Which expands to exactly the same thing as wmb() because smp_rmb()
>> expands to nothing.
>>
>> For CPUs that have out-of-order loads, smp_rmb() should expand to
>> something lighter weight than "SYNC"
>>
>> Certainly we can load up the code with "SYNC" all over the place, but
>> it will kill performance on SMP systems. So, my vote would be to make
>> it as light weight as possible, but no lighter. That will mean
>> inventing the proper barrier primitives.
>
> It seems to me that the proper barrier here is a "SYNC 18" aka
> SYNC_RELEASE instruction, at least on CPUs that implement that variant.
>
Yes, unfortunately very few CPUs implement that. It is an instruction
that MIPS invented only recently, so older CPUs need a different solution.
David Daney