Hi Huacai,
On Mon, Jul 09, 2018 at 10:26:38AM +0800, Huacai Chen wrote:
> After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire()
> in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3
> has SFB (Store Fill Buffer) and the weak-ordering may cause READ_ONCE()
> to get an old value in a tight loop. So in smp_cond_load_acquire() we
> need a __smp_rmb() before the READ_ONCE() loop.
>
> This patch introduce a Loongson-specific smp_cond_load_acquire(). And
> it should be backported to as early as linux-4.5, in which release the
> smp_cond_acquire() is introduced.
>
> There may be other cases where memory barriers is needed, we will fix
> them one by one.
>
> Cc: [email protected]
> Signed-off-by: Huacai Chen <[email protected]>
> ---
> arch/mips/include/asm/barrier.h | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
> index a5eb1bb..e8c4c63 100644
> --- a/arch/mips/include/asm/barrier.h
> +++ b/arch/mips/include/asm/barrier.h
> @@ -222,6 +222,24 @@
> #define __smp_mb__before_atomic() __smp_mb__before_llsc()
> #define __smp_mb__after_atomic() smp_llsc_mb()
>
> +#ifdef CONFIG_CPU_LOONGSON3
> +/* Loongson-3 need a __smp_rmb() before READ_ONCE() loop */
> +#define smp_cond_load_acquire(ptr, cond_expr) \
> +({ \
> + typeof(ptr) __PTR = (ptr); \
> + typeof(*ptr) VAL; \
> + __smp_rmb(); \
> + for (;;) { \
> + VAL = READ_ONCE(*__PTR); \
> + if (cond_expr) \
> + break; \
> + cpu_relax(); \
> + } \
> + __smp_rmb(); \
> + VAL; \
> +})
> +#endif /* CONFIG_CPU_LOONGSON3 */
The discussion on v1 of this patch [1] seemed to converge on the view
that Loongson suffers from the same problem as ARM platforms which
enable the CONFIG_ARM_ERRATA_754327 workaround, and that we might
require a similar workaround.
Is there a reason you've not done that, and instead tweaked your patch
that's specific to the smp_cond_load_acquire() case? I'm not comfortable
with fixing just this one case when there could be many similar
problematic pieces of code you just haven't hit yet.
Please also keep the LKMM maintainers on copy for this - their feedback
will be valuable & I'll be much more comfortable applying a workaround
for Loongson's behavior here if it's one that they're OK with.
Thanks,
Paul
[1] https://www.linux-mips.org/archives/linux-mips/2018-06/msg00139.html
Hi, Paul and Peter,
I think we find the real root cause, READ_ONCE() doesn't need any
barriers, the problematic code is queued_spin_lock_slowpath() in
kernel/locking/qspinlock.c:
if (old & _Q_TAIL_MASK) {
prev = decode_tail(old);
/* Link @node into the waitqueue. */
WRITE_ONCE(prev->next, node);
pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
/*
* While waiting for the MCS lock, the next pointer may have
* been set by another lock waiter. We optimistically load
* the next pointer & prefetch the cacheline for writing
* to reduce latency in the upcoming MCS unlock operation.
*/
next = READ_ONCE(node->next);
if (next)
prefetchw(next);
}
After WRITE_ONCE(prev->next, node); arch_mcs_spin_lock_contended()
enter a READ_ONCE() loop, so the effect of WRITE_ONCE() is invisible
by other cores because of the write buffer. As a result,
arch_mcs_spin_lock_contended() will wait for ever because the waiters
of prev->next will wait for ever. I think the right way to fix this is
flush SFB after this WRITE_ONCE(), but I don't have a good solution:
1, MIPS has wbflush() which can be used to flush SFB, but other archs
don't have;
2, Every arch has mb(), and add mb() after WRITE_ONCE() can actually
solve Loongson's problem, but in syntax, mb() is different from
wbflush();
3, Maybe we can define a Loongson-specific WRITE_ONCE(), but not every
WRITE_ONCE() need wbflush(), we only need wbflush() between
WRITE_ONCE() and a READ_ONCE() loop.
Any ideas?
Huacai
On Tue, Jul 10, 2018 at 12:49 AM, Paul Burton <[email protected]> wrote:
> Hi Huacai,
>
> On Mon, Jul 09, 2018 at 10:26:38AM +0800, Huacai Chen wrote:
>> After commit 7f56b58a92aaf2c ("locking/mcs: Use smp_cond_load_acquire()
>> in MCS spin loop") Loongson-3 fails to boot. This is because Loongson-3
>> has SFB (Store Fill Buffer) and the weak-ordering may cause READ_ONCE()
>> to get an old value in a tight loop. So in smp_cond_load_acquire() we
>> need a __smp_rmb() before the READ_ONCE() loop.
>>
>> This patch introduce a Loongson-specific smp_cond_load_acquire(). And
>> it should be backported to as early as linux-4.5, in which release the
>> smp_cond_acquire() is introduced.
>>
>> There may be other cases where memory barriers is needed, we will fix
>> them one by one.
>>
>> Cc: [email protected]
>> Signed-off-by: Huacai Chen <[email protected]>
>> ---
>> arch/mips/include/asm/barrier.h | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>> diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
>> index a5eb1bb..e8c4c63 100644
>> --- a/arch/mips/include/asm/barrier.h
>> +++ b/arch/mips/include/asm/barrier.h
>> @@ -222,6 +222,24 @@
>> #define __smp_mb__before_atomic() __smp_mb__before_llsc()
>> #define __smp_mb__after_atomic() smp_llsc_mb()
>>
>> +#ifdef CONFIG_CPU_LOONGSON3
>> +/* Loongson-3 need a __smp_rmb() before READ_ONCE() loop */
>> +#define smp_cond_load_acquire(ptr, cond_expr) \
>> +({ \
>> + typeof(ptr) __PTR = (ptr); \
>> + typeof(*ptr) VAL; \
>> + __smp_rmb(); \
>> + for (;;) { \
>> + VAL = READ_ONCE(*__PTR); \
>> + if (cond_expr) \
>> + break; \
>> + cpu_relax(); \
>> + } \
>> + __smp_rmb(); \
>> + VAL; \
>> +})
>> +#endif /* CONFIG_CPU_LOONGSON3 */
>
> The discussion on v1 of this patch [1] seemed to converge on the view
> that Loongson suffers from the same problem as ARM platforms which
> enable the CONFIG_ARM_ERRATA_754327 workaround, and that we might
> require a similar workaround.
>
> Is there a reason you've not done that, and instead tweaked your patch
> that's specific to the smp_cond_load_acquire() case? I'm not comfortable
> with fixing just this one case when there could be many similar
> problematic pieces of code you just haven't hit yet.
>
> Please also keep the LKMM maintainers on copy for this - their feedback
> will be valuable & I'll be much more comfortable applying a workaround
> for Loongson's behavior here if it's one that they're OK with.
>
> Thanks,
> Paul
>
> [1] https://www.linux-mips.org/archives/linux-mips/2018-06/msg00139.html
On Tue, Jul 10, 2018 at 12:26:34PM +0800, Huacai Chen wrote:
> Hi, Paul and Peter,
>
> I think we find the real root cause, READ_ONCE() doesn't need any
> barriers, the problematic code is queued_spin_lock_slowpath() in
> kernel/locking/qspinlock.c:
>
> if (old & _Q_TAIL_MASK) {
> prev = decode_tail(old);
>
> /* Link @node into the waitqueue. */
> WRITE_ONCE(prev->next, node);
>
> pv_wait_node(node, prev);
> arch_mcs_spin_lock_contended(&node->locked);
>
> /*
> * While waiting for the MCS lock, the next pointer may have
> * been set by another lock waiter. We optimistically load
> * the next pointer & prefetch the cacheline for writing
> * to reduce latency in the upcoming MCS unlock operation.
> */
> next = READ_ONCE(node->next);
> if (next)
> prefetchw(next);
> }
>
> After WRITE_ONCE(prev->next, node); arch_mcs_spin_lock_contended()
> enter a READ_ONCE() loop, so the effect of WRITE_ONCE() is invisible
> by other cores because of the write buffer.
And _that_ is a hardware bug. Also please explain how that is different
from the ARM bug mentioned elsewhere.
> As a result,
> arch_mcs_spin_lock_contended() will wait for ever because the waiters
> of prev->next will wait for ever. I think the right way to fix this is
> flush SFB after this WRITE_ONCE(), but I don't have a good solution:
> 1, MIPS has wbflush() which can be used to flush SFB, but other archs
> don't have;
Sane archs don't need this.
> 2, Every arch has mb(), and add mb() after WRITE_ONCE() can actually
> solve Loongson's problem, but in syntax, mb() is different from
> wbflush();
Still wrong, because any non-broken arch doesn't need that flush to
begin with.
> 3, Maybe we can define a Loongson-specific WRITE_ONCE(), but not every
> WRITE_ONCE() need wbflush(), we only need wbflush() between
> WRITE_ONCE() and a READ_ONCE() loop.
No no no no ...
So now explain why the cpu_relax() hack that arm did doesn't work for
you?
On Tue, Jul 10, 2018 at 11:36:37AM +0200, Peter Zijlstra wrote:
> So now explain why the cpu_relax() hack that arm did doesn't work for
> you?
So below is the patch I think you want; if not explain in detail how
this is wrong.
diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
index af34afbc32d9..e59773de6528 100644
--- a/arch/mips/include/asm/processor.h
+++ b/arch/mips/include/asm/processor.h
@@ -386,7 +386,17 @@ unsigned long get_wchan(struct task_struct *p);
#define KSTK_ESP(tsk) (task_pt_regs(tsk)->regs[29])
#define KSTK_STATUS(tsk) (task_pt_regs(tsk)->cp0_status)
+#ifdef CONFIG_CPU_LOONGSON3
+/*
+ * Loongson-3 has a CPU bug where the store buffer gets starved when stuck in a
+ * read loop. Since spin loops of any kind should have a cpu_relax() in them,
+ * force a store-buffer flush from cpu_relax() such that any pending writes
+ * will become available as expected.
+ */
+#define cpu_relax() smp_mb()
+#else
#define cpu_relax() barrier()
+#endif
/*
* Return_address is a replacement for __builtin_return_address(count)
Hi, Peter,
I'm afraid that you have missing something......
Firstly, our previous conclusion (READ_ONCE need a barrier to avoid 'reads prioritised over writes') is totally wrong. So define cpu_relax() to smp_mb() like ARM11MPCore is incorrect, even if it can 'solve' Loongson's problem.
Secondly, I think the real problem is like this:
1, CPU0 set the lock to 0, then do something;
2, While CPU0 is doing something, CPU1 set the flag to 1 with WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE() loop;
3, After CPU0 complete its work, it wait the flag become to 1, and if so then set the lock to 1;
4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop.
If without SFB, everything is OK. But with SFB in step 2, a READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag cached in SFB (so be invisible by other CPUs) for ever, then both CPU0 and CPU1 wait for ever.
I don't think this is a hardware bug, in design, SFB will flushed to L1 cache in three cases:
1, data in SFB is full (be a complete cache line);
2, there is a subsequent read access in the same cache line;
3, a 'sync' instruction is executed.
In this case, there is no other memory access (read or write) between WRITE_ONCE() and READ_ONCE() loop. So Case 1 and Case 2 will not happen, and the only way to make the flag be visible is wbflush (wbflush is sync in Loongson's case).
I think this problem is not only happens on Loongson, but will happen on other CPUs which have write buffer (unless the write buffer has a 4th case to be flushed).
Huacai
------------------ Original ------------------
From: "Peter Zijlstra"<[email protected]>;
Date: Tue, Jul 10, 2018 06:54 PM
To: "Huacai Chen"<[email protected]>;
Cc: "Paul Burton"<[email protected]>; "Ralf Baechle"<[email protected]>; "James Hogan"<[email protected]>; "linux-mips"<[email protected]>; "Fuxin Zhang"<[email protected]>; "wuzhangjin"<[email protected]>; "stable"<[email protected]>; "Alan Stern"<[email protected]>; "Andrea Parri"<[email protected]>; "Will Deacon"<[email protected]>; "Boqun Feng"<[email protected]>; "Nicholas Piggin"<[email protected]>; "David Howells"<[email protected]>; "Jade Alglave"<[email protected]>; "Luc Maranget"<[email protected]>; "Paul E. McKenney"<[email protected]>; "Akira Yokosawa"<[email protected]>; "LKML"<[email protected]>;
Subject: Re: [PATCH V2] MIPS: implement smp_cond_load_acquire() for Loongson-3
On Tue, Jul 10, 2018 at 11:36:37AM +0200, Peter Zijlstra wrote:
> So now explain why the cpu_relax() hack that arm did doesn't work for
> you?
So below is the patch I think you want; if not explain in detail how
this is wrong.
diff --git a/arch/mips/include/asm/processor.h b/arch/mips/include/asm/processor.h
index af34afbc32d9..e59773de6528 100644
--- a/arch/mips/include/asm/processor.h
+++ b/arch/mips/include/asm/processor.h
@@ -386,7 +386,17 @@ unsigned long get_wchan(struct task_struct *p);
#define KSTK_ESP(tsk) (task_pt_regs(tsk)->regs[29])
#define KSTK_STATUS(tsk) (task_pt_regs(tsk)->cp0_status)
+#ifdef CONFIG_CPU_LOONGSON3
+/*
+ * Loongson-3 has a CPU bug where the store buffer gets starved when stuck in a
+ * read loop. Since spin loops of any kind should have a cpu_relax() in them,
+ * force a store-buffer flush from cpu_relax() such that any pending writes
+ * will become available as expected.
+ */
+#define cpu_relax() smp_mb()
+#else
#define cpu_relax() barrier()
+#endif
/*
* Return_address is a replacement for __builtin_return_address(count)
Please!! Learn to use email.
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?
Also, wrap non-quoted lines to 78 characters.
On Tue, Jul 10, 2018 at 07:45:22PM +0800, 陈华才 wrote:
> I'm afraid that you have missing something......
>
> Firstly, our previous conclusion (READ_ONCE need a barrier to avoid
> 'reads prioritised over writes') is totally wrong. So define
> cpu_relax() to smp_mb() like ARM11MPCore is incorrect, even if it can
> 'solve' Loongson's problem. Secondly, I think the real problem is
> like this:
> 1, CPU0 set the lock to 0, then do something;
> 2, While CPU0 is doing something, CPU1 set the flag to 1 with
> WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE()
> loop;
> 3, After CPU0 complete its work, it wait the flag become to 1, and if
> so then set the lock to 1;
> 4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop.
> If without SFB, everything is OK. But with SFB in step 2, a
> READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag
> cached in SFB (so be invisible by other CPUs) for ever, then both CPU0
> and CPU1 wait for ever.
Sure.. we all got that far. And no, this isn't the _real_ problem. This
is a manifestation of the problem.
The problem is that your SFB is broken (per the Linux requirements). We
require that stores will become visible. That is, they must not
indefinitely (for whatever reason) stay in the store buffer.
> I don't think this is a hardware bug, in design, SFB will flushed to
> L1 cache in three cases:
> 1, data in SFB is full (be a complete cache line);
> 2, there is a subsequent read access in the same cache line;
> 3, a 'sync' instruction is executed.
And I think this _is_ a hardware bug. You just designed the bug instead
of it being by accident.
> In this case, there is no other memory access (read or write) between
> WRITE_ONCE() and READ_ONCE() loop. So Case 1 and Case 2 will not
> happen, and the only way to make the flag be visible is wbflush
> (wbflush is sync in Loongson's case).
>
> I think this problem is not only happens on Loongson, but will happen
> on other CPUs which have write buffer (unless the write buffer has a
> 4th case to be flushed).
It doesn't happen an _any_ other architecture except that dodgy
ARM11MPCore part. Linux hard relies on stores to become available
_eventually_.
Still, even with the rules above, the best work-around is still the very
same cpu_relax() hack.
On Tue, Jul 10, 2018 at 02:17:27PM +0200, Peter Zijlstra wrote:
>
> Please!! Learn to use email.
>
> A: Because it messes up the order in which people normally read text.
> Q: Why is top-posting such a bad thing?
> A: Top-posting.
> Q: What is the most annoying thing in e-mail?
>
> Also, wrap non-quoted lines to 78 characters.
>
> On Tue, Jul 10, 2018 at 07:45:22PM +0800, 陈华才 wrote:
> > I'm afraid that you have missing something......
> >
> > Firstly, our previous conclusion (READ_ONCE need a barrier to avoid
> > 'reads prioritised over writes') is totally wrong. So define
> > cpu_relax() to smp_mb() like ARM11MPCore is incorrect, even if it can
> > 'solve' Loongson's problem. Secondly, I think the real problem is
> > like this:
>
> > 1, CPU0 set the lock to 0, then do something;
> > 2, While CPU0 is doing something, CPU1 set the flag to 1 with
> > WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE()
> > loop;
> > 3, After CPU0 complete its work, it wait the flag become to 1, and if
> > so then set the lock to 1;
> > 4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop.
Are there specific loops in the kernel whose conditions are controlled
by READ_ONCE() that don't contain cpu_relax(), smp_mb(), etc.? One
way to find them given your description of your hardware is to make
cpu_relax() be smp_mb() as Peter suggests, and then run tests to find
the problems.
Or have you already done this?
Thanx, Paul
> > If without SFB, everything is OK. But with SFB in step 2, a
> > READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag
> > cached in SFB (so be invisible by other CPUs) for ever, then both CPU0
> > and CPU1 wait for ever.
>
> Sure.. we all got that far. And no, this isn't the _real_ problem. This
> is a manifestation of the problem.
>
> The problem is that your SFB is broken (per the Linux requirements). We
> require that stores will become visible. That is, they must not
> indefinitely (for whatever reason) stay in the store buffer.
>
> > I don't think this is a hardware bug, in design, SFB will flushed to
> > L1 cache in three cases:
>
> > 1, data in SFB is full (be a complete cache line);
> > 2, there is a subsequent read access in the same cache line;
> > 3, a 'sync' instruction is executed.
>
> And I think this _is_ a hardware bug. You just designed the bug instead
> of it being by accident.
>
> > In this case, there is no other memory access (read or write) between
> > WRITE_ONCE() and READ_ONCE() loop. So Case 1 and Case 2 will not
> > happen, and the only way to make the flag be visible is wbflush
> > (wbflush is sync in Loongson's case).
> >
> > I think this problem is not only happens on Loongson, but will happen
> > on other CPUs which have write buffer (unless the write buffer has a
> > 4th case to be flushed).
>
> It doesn't happen an _any_ other architecture except that dodgy
> ARM11MPCore part. Linux hard relies on stores to become available
> _eventually_.
>
> Still, even with the rules above, the best work-around is still the very
> same cpu_relax() hack.
Hello,
On Tue, Jul 10, 2018 at 02:17:27PM +0200, Peter Zijlstra wrote:
> > 1, CPU0 set the lock to 0, then do something;
> > 2, While CPU0 is doing something, CPU1 set the flag to 1 with
> > WRITE_ONCE(), and then wait the lock become to 1 with a READ_ONCE()
> > loop;
> > 3, After CPU0 complete its work, it wait the flag become to 1, and if
> > so then set the lock to 1;
> > 4, If the lock becomes to 1, CPU1 will leave the READ_ONCE() loop.
>
> > If without SFB, everything is OK. But with SFB in step 2, a
> > READ_ONCE() loop is right after WRITE_ONCE(), which makes the flag
> > cached in SFB (so be invisible by other CPUs) for ever, then both CPU0
> > and CPU1 wait for ever.
>
> Sure.. we all got that far. And no, this isn't the _real_ problem. This
> is a manifestation of the problem.
>
> The problem is that your SFB is broken (per the Linux requirements). We
> require that stores will become visible. That is, they must not
> indefinitely (for whatever reason) stay in the store buffer.
For the record, my understanding is that this doesn't really comply with
the MIPS architecture either. It doesn't cover store buffers explicitly
but does cover the more general notion of caches.
From section 2.8.8.2 "Cached Memory Access" of the Introduction to the
MIPS64 Architecture document, revision 5.04 [1]:
> In a cached access, physical memory and all caches in the system
> containing a copy of the physical location are used to resolve the
> access. A copy of a location is coherent if the copy was placed in the
> cache by a cached coherent access; a copy of a location is noncoherent
> if the copy was placed in the cache by a cached noncoherent access.
> (Coherency is dictated by the system architecture, not the processor
> implementation.)
>
> Caches containing a coherent copy of the location are examined and/or
> modified to keep the contents of the location coherent. It is not
> possible to predict whether caches holding a noncoherent copy of the
> location will be examined and/or modified during a cached coherent
> access.
All of the accesses Linux is performing are cached coherent.
I'm not sure which is the intent (I can ask if someone's interested),
but you could either:
1) Consider the store buffer a cache, in which case loads need to
check all store buffers from all CPUs because of the "all caches"
part of the first quoted sentence.
or
2) Decide store buffers aren't covered by the MIPS architecture
documentation at all in which case the only sane thing to do would
be to make it transparent to software (and here Loongson's isn't).
> > I don't think this is a hardware bug, in design, SFB will flushed to
> > L1 cache in three cases:
>
> > 1, data in SFB is full (be a complete cache line);
> > 2, there is a subsequent read access in the same cache line;
> > 3, a 'sync' instruction is executed.
>
> And I think this _is_ a hardware bug. You just designed the bug instead
> of it being by accident.
Presuming that the data in the SFB isn't visible to other cores, and
presuming that case 2 is only covering loads from the same core that
contains the SFB, I agree that this doesn't seem like valid behavior.
Thanks,
Paul
[1] https://www.mips.com/?do-download=introduction-to-the-mips64-architecture-v5-04
Hi Huacai,
On Tue, Jul 10, 2018 at 07:45:22PM +0800, 陈华才 wrote:
> I don't think this is a hardware bug, in design, SFB will flushed to L1
> cache in three cases:
> 1, data in SFB is full (be a complete cache line);
> 2, there is a subsequent read access in the same cache line;
> 3, a 'sync' instruction is executed.
I'd expect successful LL/SC, cache maintenance (and potentially TLB)
operations to flush your SFB as well, not that I think that provides a
better workaround than throwing a 'sync' into cpu_relax(). I assume the
SFB is all physically addressed?
Generally, CPU architectures guarantee that store buffers drain "in finite
time" which is a pretty crappy guarantee, but one which tends to be
sufficient in practice and therefore relied upon by software.
Will
On 2018-7-10 Tue at 20:17:27,Peter Zijlstra Wrote:
Hi Peter
Since Huacai unable to send email via client, I'm going to reply for him
> Sure.. we all got that far. And no, this isn't the _real_ problem. This
> is a manifestation of the problem.
>
> The problem is that your SFB is broken (per the Linux requirements). We
> require that stores will become visible. That is, they must not
> indefinitely (for whatever reason) stay in the store buffer.
>
> > I don't think this is a hardware bug, in design, SFB will flushed to
> > L1 cache in three cases:
> >
> > 1, data in SFB is full (be a complete cache line);
> > 2, there is a subsequent read access in the same cache line;
> > 3, a 'sync' instruction is executed.
>
> And I think this _is_ a hardware bug. You just designed the bug instead
> of it being by accident.
Yes, we understood that this hardware feature is not supported by LKML,
so it should be a hardware bug for LKML.
>
> It doesn't happen an _any_ other architecture except that dodgy
> ARM11MPCore part. Linux hard relies on stores to become available
> _eventually_.
>
> Still, even with the rules above, the best work-around is still the very
> same cpu_relax() hack.
As you say, SFB makes Loongson not fully SMP-coherent.
However, modify cpu_relax can solve the current problem,
but not so straight forward. On the other hand, providing a Loongson-specific
WRITE_ONCE looks more reasonable, because it the eliminate the "non-cohrency".
So we can solve the bug from the root.
Thanks.
--
Jiaxun Yang
On Wed, Jul 11, 2018 at 10:04:52AM +0000, David Laight wrote:
> I also suspect that 'write starvation' is also common - after all the
> purpose of the store buffer is to do reads in preference to writes in
> order to reduce the cpu stalls waiting for the memory bus (probably
> the cpu to cache interface).
>
> I think your example is just:
> *(volatile int *)xxx = 1;
> while (!*(volatile int *)yyy) continue;
> running on two cpu with xxx and yyy swapped?
Yep. And Linux has been relying on that working for (afaict) basically
forever.
> You need a stronger bus cycle in there somewhere.
Since all spin-wait loops _should_ have cpu_relax() that is the natural
place to put it.
On Wed, Jul 11, 2018 at 11:21:06AM +0100, Will Deacon wrote:
> On Wed, Jul 11, 2018 at 06:05:51PM +0800, Jiaxun Yang wrote:
> > On 2018-7-10 Tue at 20:17:27,Peter Zijlstra Wrote:
> > > Still, even with the rules above, the best work-around is still the very
> > > same cpu_relax() hack.
> >
> > As you say, SFB makes Loongson not fully SMP-coherent.
> > However, modify cpu_relax can solve the current problem,
> > but not so straight forward. On the other hand, providing a Loongson-specific
> > WRITE_ONCE looks more reasonable, because it the eliminate the "non-cohrency".
> > So we can solve the bug from the root.
>
> Curious, but why is it not straight-forward to hack cpu_relax()? If you try
> to hack WRITE_ONCE, you also need to hack atomic_set, atomic64_set and all
> the places that should be using WRITE_ONCE but aren't ;~)
Right.
The problem isn't stores pre-se, normal progress should contain enough
stores to flush out 'old' bits in the natural order of things. But the
problem is spin-wait loops that inhibit normal progress (and thereby
store-buffer flushing).
And all spin-wait loops should be having cpu_relax() in them. So
cpu_relax() is the natural place to fix this.
Adding SYNC to WRITE_ONCE()/atomic* will hurt performance lots and will
ultimately not guarantee anything more; and as Will said, keep you
chasing dragons where people forgot to use WRITE_ONCE() where they maybe
should've.
From: Paul Burton
> Sent: 10 July 2018 18:11
...
> I'm not sure which is the intent (I can ask if someone's interested),
> but you could either:
>
> 1) Consider the store buffer a cache, in which case loads need to
> check all store buffers from all CPUs because of the "all caches"
> part of the first quoted sentence.
>
> or
>
> 2) Decide store buffers aren't covered by the MIPS architecture
> documentation at all in which case the only sane thing to do would
> be to make it transparent to software (and here Loongson's isn't)
...
Store buffers are common and are never transparent to multi-threaded code.
They are largely why you need locks.
At least on (early) sparc systems they were between the execution unit
and the data cache.
I also suspect that 'write starvation' is also common - after all the
purpose of the store buffer is to do reads in preference to writes in
order to reduce the cpu stalls waiting for the memory bus (probably
the cpu to cache interface).
I think your example is just:
*(volatile int *)xxx = 1;
while (!*(volatile int *)yyy) continue;
running on two cpu with xxx and yyy swapped?
You need a stronger bus cycle in there somewhere.
David
On Wed, Jul 11, 2018 at 06:05:51PM +0800, Jiaxun Yang wrote:
> On 2018-7-10 Tue at 20:17:27,Peter Zijlstra Wrote:
>
> Hi Peter
> Since Huacai unable to send email via client, I'm going to reply for him
>
> > Sure.. we all got that far. And no, this isn't the _real_ problem. This
> > is a manifestation of the problem.
> >
> > The problem is that your SFB is broken (per the Linux requirements). We
> > require that stores will become visible. That is, they must not
> > indefinitely (for whatever reason) stay in the store buffer.
> >
> > > I don't think this is a hardware bug, in design, SFB will flushed to
> > > L1 cache in three cases:
> > >
> > > 1, data in SFB is full (be a complete cache line);
> > > 2, there is a subsequent read access in the same cache line;
> > > 3, a 'sync' instruction is executed.
> >
> > And I think this _is_ a hardware bug. You just designed the bug instead
> > of it being by accident.
> Yes, we understood that this hardware feature is not supported by LKML,
> so it should be a hardware bug for LKML.
> >
> > It doesn't happen an _any_ other architecture except that dodgy
> > ARM11MPCore part. Linux hard relies on stores to become available
> > _eventually_.
> >
> > Still, even with the rules above, the best work-around is still the very
> > same cpu_relax() hack.
>
> As you say, SFB makes Loongson not fully SMP-coherent.
> However, modify cpu_relax can solve the current problem,
> but not so straight forward. On the other hand, providing a Loongson-specific
> WRITE_ONCE looks more reasonable, because it the eliminate the "non-cohrency".
> So we can solve the bug from the root.
Curious, but why is it not straight-forward to hack cpu_relax()? If you try
to hack WRITE_ONCE, you also need to hack atomic_set, atomic64_set and all
the places that should be using WRITE_ONCE but aren't ;~)
Will
From: Peter Zijlstra
> Sent: 11 July 2018 12:10
..
> Adding SYNC to WRITE_ONCE()/atomic* will hurt performance lots and will
> ultimately not guarantee anything more; and as Will said, keep you
> chasing dragons where people forgot to use WRITE_ONCE() where they maybe
> should've.
Also WRITE_ONCE() is there to solve an entirely different problem.
If you have a function that does:
<lots of code without any function calls>
foo->bar = 1;
the compiler is allowed to write other (unrelated) values
to foo->bar in the generated code for <lots of code>.
A long time ago I used a compiler that managed to convert:
if (foo->bar == 2)
foo->bar = 3;
into:
if (foo->bar == 2) {
foo->bar = 0;
foo->bar = 3;
}
When an interrupt read the value 0 a lot of linked list got screwed up.
WRITE_ONCE() is there to ensure that doesn't happen.
(In my case 'foo' was a 2-bit wide bitfield, and I suspect you
can't use WRITE_ONCE() on bitfields.)
David