2016-11-18 02:22:28

by Vikram Mulukutla

[permalink] [raw]
Subject: spin_lock behavior with ARM64 big.Little/HMP

Hello,

This isn't really a bug report, but just a description of a
frequency/IPC
dependent behavior that I'm curious if we should worry about. The
behavior
is exposed by questionable design so I'm leaning towards don't-care.

Consider these threads running in parallel on two ARM64 CPUs running
mainline
Linux:

(Ordering of lines between the two columns does not indicate a sequence
of
execution. Assume flag=0 initially.)

LittleARM64_CPU @ 300MHz (e.g.A53) | BigARM64_CPU @ 1.5GHz (e.g. A57)
-------------------------------------+----------------------------------
spin_lock_irqsave(s) | local_irq_save()
/* critical section */
flag = 1 | spin_lock(s)
spin_unlock_irqrestore(s) | while (!flag) {
| spin_unlock(s)
| cpu_relax();
| spin_lock(s)
| }
| spin_unlock(s)
| local_irq_restore()

I see a livelock occurring where the LittleCPU is never able to acquire
the
lock, and the BigCPU is stuck forever waiting on 'flag' to be set.

Even with ticket spinlocks, this bit of code can cause a livelock (or
very
long delays) if BigCPU runs fast enough. Afaics this can only happen if
the
LittleCPU is unable to put its ticket in the queue (i.e, increment the
next
field) since the store-exclusive keeps failing.

The problem is not present on SMP, and is mitigated by adding enough
additional clock cycles between the unlock and lock in the loop running
on the BigCPU. On big.Little, if both threads are scheduled on the same
cluster within the same clock domain, the problem is avoided.

Now the infinite loop may seem like questionable design but the problem
isn't entirely hypothetical; if BigCPU calls hrtimer_cancel with
interrupts disabled, this scenario can result if the hrtimer is about
to run on a littleCPU. It's of course possible that there's just enough
intervening code for the problem to not occur. At the very least it
seems
that loops like the one running in the BigCPU above should come with a
WARN_ON(irqs_disabled) or with a sufficient udelay() instead of the
cpu_relax.

Thoughts?

Thanks,
Vikram


2016-11-18 10:30:52

by Sudeep Holla

[permalink] [raw]
Subject: Re: spin_lock behavior with ARM64 big.Little/HMP

Hi Vikram,

On 18/11/16 02:22, Vikram Mulukutla wrote:
> Hello,
>
> This isn't really a bug report, but just a description of a frequency/IPC
> dependent behavior that I'm curious if we should worry about. The behavior
> is exposed by questionable design so I'm leaning towards don't-care.
>
> Consider these threads running in parallel on two ARM64 CPUs running
> mainline
> Linux:
>

Are you seeing this behavior with the mainline kernel on any platforms
as we have a sort of workaround for this ?

> (Ordering of lines between the two columns does not indicate a sequence of
> execution. Assume flag=0 initially.)
>
> LittleARM64_CPU @ 300MHz (e.g.A53) | BigARM64_CPU @ 1.5GHz (e.g. A57)
> -------------------------------------+----------------------------------
> spin_lock_irqsave(s) | local_irq_save()
> /* critical section */
> flag = 1 | spin_lock(s)
> spin_unlock_irqrestore(s) | while (!flag) {
> | spin_unlock(s)
> | cpu_relax();
> | spin_lock(s)
> | }
> | spin_unlock(s)
> | local_irq_restore()
>
> I see a livelock occurring where the LittleCPU is never able to acquire the
> lock, and the BigCPU is stuck forever waiting on 'flag' to be set.
>

Yes we saw this issue 3 years back on TC2 which has A7(with lowest
frequency of 300MHz IIRC) and A15(with 1.2 GHz). We were observing that
inter-cluster events are missed since the two clusters are operating at
different frequencies (details below).

The hardware recommendation is that there should be glue logic between
the two clusters which captures events from one cluster and replays then
on the other if its operating at a different frequency.

Generally EVENTO from cluster 1 is connected to the EVENTI of the
cluster 2 and vice versa. The only extra logic required is the double
synchronizer in the receiving clock domain.

This issue arise in reality if the synchronizer is missing and different
CPUs hold EVENTO for different clock cycles.

However there was a different requirement to implement timer event
stream in Linux for some user-space locking and that indirectly help to
resolve the issue on TC2. That event stream feature is enabled by
default in Linux and should fix the issue and hence I asked you if you
still see that issue.

--
Regards,
Sudeep

2016-11-18 20:22:26

by Vikram Mulukutla

[permalink] [raw]
Subject: Re: spin_lock behavior with ARM64 big.Little/HMP


Hi Sudeep,

Thanks for taking a look!

On 2016-11-18 02:30, Sudeep Holla wrote:
> Hi Vikram,
>
> On 18/11/16 02:22, Vikram Mulukutla wrote:
>> Hello,
>>
>> This isn't really a bug report, but just a description of a
>> frequency/IPC
>> dependent behavior that I'm curious if we should worry about. The
>> behavior
>> is exposed by questionable design so I'm leaning towards don't-care.
>>
>> Consider these threads running in parallel on two ARM64 CPUs running
>> mainline
>> Linux:
>>
>
> Are you seeing this behavior with the mainline kernel on any platforms
> as we have a sort of workaround for this ?
>

If I understand that workaround correctly, the ARM timer event stream is
used
to periodically wake up CPUs that are waiting in WFE, is that right? I
think
my scenario below may be different because LittleCPU doesn't actually
wait
on a WFE event in the loop that is trying to increment lock->next, i.e.
it's
stuck in the following loop:

ARM64_LSE_ATOMIC_INSN(
/* LL/SC */
" prfm pstl1strm, %3\n"
"1: ldaxr %w0, %3\n"
" add %w1, %w0, %w5\n"
" stxr %w2, %w1, %3\n"
" cbnz %w2, 1b\n",


I have been testing internal platforms; I'll try to test on something
available publicly that's b.L. In any case, the timer event stream was
enabled
when I tried this out.

>> (Ordering of lines between the two columns does not indicate a
>> sequence of
>> execution. Assume flag=0 initially.)
>>
>> LittleARM64_CPU @ 300MHz (e.g.A53) | BigARM64_CPU @ 1.5GHz (e.g.
>> A57)
>> -------------------------------------+----------------------------------
>> spin_lock_irqsave(s) | local_irq_save()
>> /* critical section */
>> flag = 1 | spin_lock(s)
>> spin_unlock_irqrestore(s) | while (!flag) {
>> | spin_unlock(s)
>> | cpu_relax();
>> | spin_lock(s)
>> | }
>> | spin_unlock(s)
>> | local_irq_restore()
>>

[...]

Thanks,
Vikram

2016-11-21 15:21:41

by Sudeep Holla

[permalink] [raw]
Subject: Re: spin_lock behavior with ARM64 big.Little/HMP



On 18/11/16 20:22, Vikram Mulukutla wrote:
>
> Hi Sudeep,
>
> Thanks for taking a look!
>
> On 2016-11-18 02:30, Sudeep Holla wrote:
>> Hi Vikram,
>>
>> On 18/11/16 02:22, Vikram Mulukutla wrote:
>>> Hello,
>>>
>>> This isn't really a bug report, but just a description of a
>>> frequency/IPC
>>> dependent behavior that I'm curious if we should worry about. The
>>> behavior
>>> is exposed by questionable design so I'm leaning towards don't-care.
>>>
>>> Consider these threads running in parallel on two ARM64 CPUs running
>>> mainline
>>> Linux:
>>>
>>
>> Are you seeing this behavior with the mainline kernel on any platforms
>> as we have a sort of workaround for this ?
>>
>
> If I understand that workaround correctly, the ARM timer event stream is
> used
> to periodically wake up CPUs that are waiting in WFE, is that right? I
> think
> my scenario below may be different because LittleCPU doesn't actually wait
> on a WFE event in the loop that is trying to increment lock->next, i.e.
> it's
> stuck in the following loop:
>
> ARM64_LSE_ATOMIC_INSN(
> /* LL/SC */
> " prfm pstl1strm, %3\n"
> "1: ldaxr %w0, %3\n"
> " add %w1, %w0, %w5\n"
> " stxr %w2, %w1, %3\n"
> " cbnz %w2, 1b\n",
>

Interesting. Just curious if this is r0p0/p1 A53 ? If so, is the errata
819472 enabled ?

--
Regards,
Sudeep

2017-03-30 04:12:26

by Vikram Mulukutla

[permalink] [raw]
Subject: Re: spin_lock behavior with ARM64 big.Little/HMP


Hi Sudeep,

>
> Interesting. Just curious if this is r0p0/p1 A53 ? If so, is the errata
> 819472 enabled ?

Sorry for bringing this up after the loo-ong delay, but I've been
assured that the A53 involved is > r0p1. I've also confirmed this
problem on multiple internal platforms, and I'm pretty sure that it
occurs on any b.L out there today. Also, we found the same problematic
lock design used in the workqueue code in the kernel, causing the same
livelock. It's very very rare and requires a perfect set of
circumstances.

If it would help I can provide a unit test if you folks would be
generous enough to test it on the latest Juno or something b.L that's
also upstream.

Thanks,
Vikram

2017-03-30 10:24:12

by Sudeep Holla

[permalink] [raw]
Subject: Re: spin_lock behavior with ARM64 big.Little/HMP



On 30/03/17 05:12, Vikram Mulukutla wrote:
>
> Hi Sudeep,
>
>>
>> Interesting. Just curious if this is r0p0/p1 A53 ? If so, is the errata
>> 819472 enabled ?
>
> Sorry for bringing this up after the loo-ong delay, but I've been
> assured that the A53 involved is > r0p1. I've also confirmed this
> problem on multiple internal platforms, and I'm pretty sure that it
> occurs on any b.L out there today. Also, we found the same problematic
> lock design used in the workqueue code in the kernel, causing the same
> livelock. It's very very rare and requires a perfect set of circumstances.
>
> If it would help I can provide a unit test if you folks would be
> generous enough to test it on the latest Juno or something b.L that's
> also upstream.
>

Sure, please do share the unit test. I will give that a try on Juno.

--
Regards,
Sudeep