2017-07-26 14:16:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

On Wed, 26 Jul 2017, qiaozhou wrote:

Cc'ed ARM folks.

> I want to ask you for suggestions about how to fix one contention between
> expire_timers and try_to_del_timer_sync. Thanks in advance.
> The issue is a hard-lockup issue detected on our platform(arm64, one cluster
> with 4 a53, and the other with 4 a73). The sequence is as below:
> 1. core0 checks expired timers, and wakes up a process on a73, in step 1.3.
> 2. core4 starts to run the process and try to delete the timer in sync method,
> in step 2.1.
> 3. Before core0 can run to 1.4 and get the lock, core4 has already run from
> 2.1 to 2.6 in while loop. And in step 2.4, it fails since the timer is still
> the running timer.
>
> core0(a53): core4(a73):
> run_timer_softirq
> __run_timers
> spin_lock_irq(&base->lock)
> expire_timers()
> 1.1: base->running_timer = timer;
> 1.2: spin_unlock_irq(&base->lock);
> 1.3: call_timer_fn(timer, fn, data);
> schedule()
> 1.4: spin_lock_irq(&base->lock); del_timer_sync(&timer)
> 1.5: back to 1.1 in while loop
> 2.1: try_to_del_timer_sync()
> 2.2: get_timer_base()
> 2.3: spin_lock_irqsave(&base->lock, flags);
> 2.4: check "base->running_timer != timer"
> 2.5: spin_unlock_irqrestore(&base->lock, flags);
> 2.6:cpu_relax(); (back to 2.1 in while loop)

This is horribly formatted.

> Core0 runs @832MHz, and core4 runs @1.8GHz. A73 is also much more powerful in
> design than a53. So the actual running is that a73 keeps looping in spin_lock
> -> check running_timer -> spin_unlock -> spin_lock ->...., while a53 can't
> even succeed to complete one store exclusive instruction, before it can enter
> WFE to wait for lock release. It just keeps looping in +30 and+3c in below
> asm, due to stxr fails.
>
> So it deadloop, a53 can't jump out ldaxr/stxr loop, while a73 can't pass the
> running_timer check. Finally the hard-lockup is triggered.(BTW, the issue
> needs a long time to occur.)
>
> <_raw_spin_lock_irq+0x2c>: prfm pstl1strm, [x19]
> /<_raw_spin_lock_irq+0x30>: ldaxr w0, [x19]//
> //<_raw_spin_lock_irq+0x34>: add w1, w0, #0x10, lsl #12//
> //<_raw_spin_lock_irq+0x38>: stxr w2, w1, [x19]//
> //<_raw_spin_lock_irq+0x3c>: cbnz w2, 0xffffff80089f52e0
> <_raw_spin_lock_irq+0x30>/
> <_raw_spin_lock_irq+0x40>: eor w1, w0, w0, ror #16
> <_raw_spin_lock_irq+0x44>: cbz w1, 0xffffff80089f530c
> <_raw_spin_lock_irq+0x5c>
> <_raw_spin_lock_irq+0x48>: sevl
> <_raw_spin_lock_irq+0x4c>: wfe
> <_raw_spin_lock_irq+0x50>: ldaxrh w2, [x19]
> <_raw_spin_lock_irq+0x54>: eor w1, w2, w0, lsr #16
> <_raw_spin_lock_irq+0x58>: cbnz w1, 0xffffff80089f52fc
> <_raw_spin_lock_irq+0x4c>
>
> The loop on a53 only has 4 instructions, and loop on a73 has ~100
> instructions. Still a53 has no chance to store exclusive successfully. It may
> be related with ldaxr/stxr cost, core frequency, core number etc.
>
> I have no idea of fixing it in spinlock/unlock implement, so I try to fix it
> in timer driver. I prepared a raw patch, not sure it's the correct direction
> to solve this issue. Could you help to give some suggestions? Thanks.
>
> From fb8fbfeb8f9f92fdadd9920ce234fd433bc883e1 Mon Sep 17 00:00:00 2001
> From: Qiao Zhou <[email protected]>
> Date: Wed, 26 Jul 2017 20:30:33 +0800
> Subject: [PATCH] RFC: timers: try to fix contention between expire_timers and
> del_timer_sync
>
> try to fix contention between expire_timers and del_timer_sync by
> adding TIMER_WOKEN status, which means that the timer has already
> woken up corresponding process.

Timers do a lot more things than waking up a process.

Just because this happens in a particular case for you where a wakeup is
involved this does not mean, that this is generally true.

And that whole flag business is completely broken. There is a comment in
call_timer_fn() which says:

/*
* It is permissible to free the timer from inside the
* function that is called from it ....

So you _CANNOT_ touch timer after the function returned. It might be gone
already.

For that particular timer case we can clear base->running_timer w/o the
lock held (see patch below), but this kind of

lock -> test -> unlock -> retry

loops are all over the place in the kernel, so this is going to hurt you
sooner than later in some other place.

Thanks,

tglx

8<------------

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1301,10 +1301,12 @@ static void expire_timers(struct timer_b
if (timer->flags & TIMER_IRQSAFE) {
raw_spin_unlock(&base->lock);
call_timer_fn(timer, fn, data);
+ base->running_timer = NULL;
raw_spin_lock(&base->lock);
} else {
raw_spin_unlock_irq(&base->lock);
call_timer_fn(timer, fn, data);
+ base->running_timer = NULL;
raw_spin_lock_irq(&base->lock);
}
}


2017-07-27 01:30:00

by Zhou Qiao(周侨)

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync


On 2017年07月26日 22:16, Thomas Gleixner wrote:
> On Wed, 26 Jul 2017, qiaozhou wrote:
>
> Cc'ed ARM folks.
>
>> I want to ask you for suggestions about how to fix one contention between
>> expire_timers and try_to_del_timer_sync. Thanks in advance.
>> The issue is a hard-lockup issue detected on our platform(arm64, one cluster
>> with 4 a53, and the other with 4 a73). The sequence is as below:
>> 1. core0 checks expired timers, and wakes up a process on a73, in step 1.3.
>> 2. core4 starts to run the process and try to delete the timer in sync method,
>> in step 2.1.
>> 3. Before core0 can run to 1.4 and get the lock, core4 has already run from
>> 2.1 to 2.6 in while loop. And in step 2.4, it fails since the timer is still
>> the running timer.
>>
>> core0(a53): core4(a73):
>> run_timer_softirq
>> __run_timers
>> spin_lock_irq(&base->lock)
>> expire_timers()
>> 1.1: base->running_timer = timer;
>> 1.2: spin_unlock_irq(&base->lock);
>> 1.3: call_timer_fn(timer, fn, data);
>> schedule()
>> 1.4: spin_lock_irq(&base->lock); del_timer_sync(&timer)
>> 1.5: back to 1.1 in while loop
>> 2.1: try_to_del_timer_sync()
>> 2.2: get_timer_base()
>> 2.3: spin_lock_irqsave(&base->lock, flags);
>> 2.4: check "base->running_timer != timer"
>> 2.5: spin_unlock_irqrestore(&base->lock, flags);
>> 2.6:cpu_relax(); (back to 2.1 in while loop)
> This is horribly formatted.
Sorry for the format. Updated the calling sequence on two cores.

core0(a53): core4(a73):
run_timer_softirq
__run_timers
spin_lock_irq(&base->lock)
expire_timers()
1.1: base->running_timer = timer;
1.2: spin_unlock_irq(&base->lock);
1.3: call_timer_fn(timer, fn, data);
1.4: spin_lock_irq(&base->lock);
1.5: back to 1.1 in while loop

core4(a73):
schedule()
del_timer_sync(&timer)
2.1: try_to_del_timer_sync()
2.2: get_timer_base()
2.3: spin_lock_irqsave(&base->lock, flags);
2.4: check "base->running_timer != timer"
2.5: spin_unlock_irqrestore(&base->lock, flags);
2.6:cpu_relax(); (back to 2.1 in while loop)

>
>> Core0 runs @832MHz, and core4 runs @1.8GHz. A73 is also much more powerful in
>> design than a53. So the actual running is that a73 keeps looping in spin_lock
>> -> check running_timer -> spin_unlock -> spin_lock ->...., while a53 can't
>> even succeed to complete one store exclusive instruction, before it can enter
>> WFE to wait for lock release. It just keeps looping in +30 and+3c in below
>> asm, due to stxr fails.
>>
>> So it deadloop, a53 can't jump out ldaxr/stxr loop, while a73 can't pass the
>> running_timer check. Finally the hard-lockup is triggered.(BTW, the issue
>> needs a long time to occur.)
>>
>> <_raw_spin_lock_irq+0x2c>: prfm pstl1strm, [x19]
>> /<_raw_spin_lock_irq+0x30>: ldaxr w0, [x19]//
>> //<_raw_spin_lock_irq+0x34>: add w1, w0, #0x10, lsl #12//
>> //<_raw_spin_lock_irq+0x38>: stxr w2, w1, [x19]//
>> //<_raw_spin_lock_irq+0x3c>: cbnz w2, 0xffffff80089f52e0
>> <_raw_spin_lock_irq+0x30>/
>> <_raw_spin_lock_irq+0x40>: eor w1, w0, w0, ror #16
>> <_raw_spin_lock_irq+0x44>: cbz w1, 0xffffff80089f530c
>> <_raw_spin_lock_irq+0x5c>
>> <_raw_spin_lock_irq+0x48>: sevl
>> <_raw_spin_lock_irq+0x4c>: wfe
>> <_raw_spin_lock_irq+0x50>: ldaxrh w2, [x19]
>> <_raw_spin_lock_irq+0x54>: eor w1, w2, w0, lsr #16
>> <_raw_spin_lock_irq+0x58>: cbnz w1, 0xffffff80089f52fc
>> <_raw_spin_lock_irq+0x4c>
>>
>> The loop on a53 only has 4 instructions, and loop on a73 has ~100
>> instructions. Still a53 has no chance to store exclusive successfully. It may
>> be related with ldaxr/stxr cost, core frequency, core number etc.
>>
>> I have no idea of fixing it in spinlock/unlock implement, so I try to fix it
>> in timer driver. I prepared a raw patch, not sure it's the correct direction
>> to solve this issue. Could you help to give some suggestions? Thanks.
>>
>> From fb8fbfeb8f9f92fdadd9920ce234fd433bc883e1 Mon Sep 17 00:00:00 2001
>> From: Qiao Zhou <[email protected]>
>> Date: Wed, 26 Jul 2017 20:30:33 +0800
>> Subject: [PATCH] RFC: timers: try to fix contention between expire_timers and
>> del_timer_sync
>>
>> try to fix contention between expire_timers and del_timer_sync by
>> adding TIMER_WOKEN status, which means that the timer has already
>> woken up corresponding process.
> Timers do a lot more things than waking up a process.
>
> Just because this happens in a particular case for you where a wakeup is
> involved this does not mean, that this is generally true.
Yes, you're right. My intention is that as the timer function is
executed, maybe we can do something.
>
> And that whole flag business is completely broken. There is a comment in
> call_timer_fn() which says:
>
> /*
> * It is permissible to free the timer from inside the
> * function that is called from it ....
>
> So you _CANNOT_ touch timer after the function returned. It might be gone
> already.
You're right. Touching timer here has risk.
>
> For that particular timer case we can clear base->running_timer w/o the
> lock held (see patch below), but this kind of
>
> lock -> test -> unlock -> retry
>
> loops are all over the place in the kernel, so this is going to hurt you
> sooner than later in some other place.
It's true. This is the way spinlock is used normally and widely in
kernel. I'll also ask ARM experts whether we can do something to avoid
or reduce the chance of such issue. ARMv8.1 has one single
instruction(ldadda) to replace the ldaxr/stxr loop. Hope it can improve
and reduce the chance.
>
> Thanks,
>
> tglx
>
> 8<------------
>
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1301,10 +1301,12 @@ static void expire_timers(struct timer_b
> if (timer->flags & TIMER_IRQSAFE) {
> raw_spin_unlock(&base->lock);
> call_timer_fn(timer, fn, data);
> + base->running_timer = NULL;
> raw_spin_lock(&base->lock);
> } else {
> raw_spin_unlock_irq(&base->lock);
> call_timer_fn(timer, fn, data);
> + base->running_timer = NULL;
> raw_spin_lock_irq(&base->lock);
> }
> }
It should work for this particular issue and I'll test it. Previously I
thought it was unsafe to touch base->running_timer without holding lock.

Thanks a lot for all the suggestions.

Best Regards
Qiao

2017-07-27 15:13:58

by Will Deacon

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

On Thu, Jul 27, 2017 at 09:29:20AM +0800, qiaozhou wrote:
> On 2017年07月26日 22:16, Thomas Gleixner wrote:
> >--- a/kernel/time/timer.c
> >+++ b/kernel/time/timer.c
> >@@ -1301,10 +1301,12 @@ static void expire_timers(struct timer_b
> > if (timer->flags & TIMER_IRQSAFE) {
> > raw_spin_unlock(&base->lock);
> > call_timer_fn(timer, fn, data);
> >+ base->running_timer = NULL;
> > raw_spin_lock(&base->lock);
> > } else {
> > raw_spin_unlock_irq(&base->lock);
> > call_timer_fn(timer, fn, data);
> >+ base->running_timer = NULL;
> > raw_spin_lock_irq(&base->lock);
> > }
> > }
> It should work for this particular issue and I'll test it. Previously I
> thought it was unsafe to touch base->running_timer without holding lock.

I think it works out in practice because base->lock and base->running_timer
share a cacheline, so end up being ordered correctly. We should probably be
using READ_ONCE/WRITE_ONCE for accessing the running_time field though.

One thing I don't get though, is why try_to_del_timer_sync needs to check
base->running_timer at all. Given that it holds the base->lock, can't it
be the person that sets it to NULL?

Will

2017-07-27 15:19:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

On Thu, 27 Jul 2017, Will Deacon wrote:
> On Thu, Jul 27, 2017 at 09:29:20AM +0800, qiaozhou wrote:
> > On 2017年07月26日 22:16, Thomas Gleixner wrote:
> > >--- a/kernel/time/timer.c
> > >+++ b/kernel/time/timer.c
> > >@@ -1301,10 +1301,12 @@ static void expire_timers(struct timer_b
> > > if (timer->flags & TIMER_IRQSAFE) {
> > > raw_spin_unlock(&base->lock);
> > > call_timer_fn(timer, fn, data);
> > >+ base->running_timer = NULL;
> > > raw_spin_lock(&base->lock);
> > > } else {
> > > raw_spin_unlock_irq(&base->lock);
> > > call_timer_fn(timer, fn, data);
> > >+ base->running_timer = NULL;
> > > raw_spin_lock_irq(&base->lock);
> > > }
> > > }
> > It should work for this particular issue and I'll test it. Previously I
> > thought it was unsafe to touch base->running_timer without holding lock.
>
> I think it works out in practice because base->lock and base->running_timer
> share a cacheline, so end up being ordered correctly. We should probably be
> using READ_ONCE/WRITE_ONCE for accessing the running_time field though.
>
> One thing I don't get though, is why try_to_del_timer_sync needs to check
> base->running_timer at all. Given that it holds the base->lock, can't it
> be the person that sets it to NULL?

No. The timer callback code does:

base->running_timer = timer;
spin_unlock(base->lock);
fn(timer);
spin_lock(base->lock);
base->running_timer = NULL;

So for del_timer_sync() the only way to figure out whether the timer
callback is running is to check base->running_timer. We cannot store state
in the timer itself because we cannot clear that state when the callback
return as the timer might have been freed in the callback. Yes, that's
nasty, but reality.

Thanks,

tglx


2017-07-28 01:10:38

by Vikram Mulukutla

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync



cc: Sudeep Holla

On 2017-07-26 18:29, qiaozhou wrote:
> On 2017年07月26日 22:16, Thomas Gleixner wrote:
>> On Wed, 26 Jul 2017, qiaozhou wrote:
>>
>> Cc'ed ARM folks.
>>

<snip>

>>
>> For that particular timer case we can clear base->running_timer w/o
>> the
>> lock held (see patch below), but this kind of
>>
>> lock -> test -> unlock -> retry
>>
>> loops are all over the place in the kernel, so this is going to hurt
>> you
>> sooner than later in some other place.
> It's true. This is the way spinlock is used normally and widely in
> kernel. I'll also ask ARM experts whether we can do something to avoid
> or reduce the chance of such issue. ARMv8.1 has one single
> instruction(ldadda) to replace the ldaxr/stxr loop. Hope it can
> improve and reduce the chance.

I think we should have this discussion now - I brought this up earlier
[1]
and I promised a test case that I completely forgot about - but here it
is (attached). Essentially a Big CPU in an acquire-check-release loop
will have an unfair advantage over a little CPU concurrently attempting
to acquire the same lock, in spite of the ticket implementation. If the
Big
CPU needs the little CPU to make forward progress : livelock.

We've run into the same loop construct in other spots in the kernel and
the reason that a real symptom is so rare is that the retry-loop on the
'Big'
CPU needs to be interrupted just once by say an IRQ/FIQ and the
live-lock
is broken. If the entire retry loop is within an interrupt-disabled
critical
section then the odds of live-locking are much higher.

An example of the problem on a previous kernel is here [2]. Changes to
the
workqueue code since may have fixed this particular instance.

One solution was to use udelay(1) in such loops instead of cpu_relax(),
but
that's not very 'relaxing'. I'm not sure if there's something we could
do
within the ticket spin-lock implementation to deal with this.

Note that I ran my test on a 4.9 kernel so that didn't include any
spinlock
implementation changes since then. The test schedules two threads, one
on
a big CPU and one on a little CPU. The big CPU thread does the
lock/unlock/retry
loop for a full 1 second with interrupts disabled, while the little CPU
attempts
to acquire the same loop but enabling interrupts after every successful
lock+unlock.
With unfairness, the little CPU may take upto 1 second (or several
milliseconds at
the least) just to acquire the lock once. This varies depending on the
IPC difference
and frequencies of the big and little ARM64 CPUs:

Big cpu frequency | Little cpu frequency | Max time taken by little to
acquire lock
2GHz | 1.5GHz | 133 microseconds
2GHz | 300MHz | 734 milliseconds

Thanks,
Vikram

[1] - https://lkml.org/lkml/2016/11/17/934
[2] - https://goo.gl/uneFjt

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Attachments:
0001-measure-spinlock-fairness-across-differently-capable.patch (7.11 kB)

2017-07-28 09:28:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:

> I think we should have this discussion now - I brought this up earlier [1]
> and I promised a test case that I completely forgot about - but here it
> is (attached). Essentially a Big CPU in an acquire-check-release loop
> will have an unfair advantage over a little CPU concurrently attempting
> to acquire the same lock, in spite of the ticket implementation. If the Big
> CPU needs the little CPU to make forward progress : livelock.

This needs to be fixed in hardware. There really isn't anything the
software can sanely do about it.

It also doesn't have anything to do with the spinlock implementation.
Ticket or not, its a fundamental problem of LL/SC. Any situation where
we use atomics for fwd progress guarantees this can happen.

The little core (or really any core) should hold on to the locked
cacheline for a while and not insta relinquish it. Giving it a chance to
reach the SC.

2017-07-28 09:28:28

by Will Deacon

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
> On 2017-07-26 18:29, qiaozhou wrote:
> >On 2017年07月26日 22:16, Thomas Gleixner wrote:
> >>On Wed, 26 Jul 2017, qiaozhou wrote:
> >>For that particular timer case we can clear base->running_timer w/o the
> >>lock held (see patch below), but this kind of
> >>
> >> lock -> test -> unlock -> retry
> >>
> >>loops are all over the place in the kernel, so this is going to hurt you
> >>sooner than later in some other place.
> >It's true. This is the way spinlock is used normally and widely in
> >kernel. I'll also ask ARM experts whether we can do something to avoid
> >or reduce the chance of such issue. ARMv8.1 has one single
> >instruction(ldadda) to replace the ldaxr/stxr loop. Hope it can
> >improve and reduce the chance.
>
> I think we should have this discussion now - I brought this up earlier [1]
> and I promised a test case that I completely forgot about - but here it
> is (attached). Essentially a Big CPU in an acquire-check-release loop
> will have an unfair advantage over a little CPU concurrently attempting
> to acquire the same lock, in spite of the ticket implementation. If the Big
> CPU needs the little CPU to make forward progress : livelock.
>
> We've run into the same loop construct in other spots in the kernel and
> the reason that a real symptom is so rare is that the retry-loop on the
> 'Big'
> CPU needs to be interrupted just once by say an IRQ/FIQ and the live-lock
> is broken. If the entire retry loop is within an interrupt-disabled critical
> section then the odds of live-locking are much higher.
>
> An example of the problem on a previous kernel is here [2]. Changes to the
> workqueue code since may have fixed this particular instance.
>
> One solution was to use udelay(1) in such loops instead of cpu_relax(), but
> that's not very 'relaxing'. I'm not sure if there's something we could do
> within the ticket spin-lock implementation to deal with this.

Does bodging cpu_relax to back-off to wfe after a while help? The event
stream will wake it up if nothing else does. Nasty patch below, but I'd be
interested to know whether or not it helps.

Will

--->8

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index 64c9e78f9882..1f5a29c8612e 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -149,9 +149,11 @@ extern void release_thread(struct task_struct *);

unsigned long get_wchan(struct task_struct *p);

+void __cpu_relax(unsigned long pc);
+
static inline void cpu_relax(void)
{
- asm volatile("yield" ::: "memory");
+ __cpu_relax(_THIS_IP_);
}

/* Thread switching */
diff --git a/arch/arm64/kernel/arm64ksyms.c b/arch/arm64/kernel/arm64ksyms.c
index 67368c7329c0..be8a698ea680 100644
--- a/arch/arm64/kernel/arm64ksyms.c
+++ b/arch/arm64/kernel/arm64ksyms.c
@@ -72,6 +72,8 @@ EXPORT_SYMBOL(_mcount);
NOKPROBE_SYMBOL(_mcount);
#endif

+EXPORT_SYMBOL(__cpu_relax);
+
/* arm-smccc */
EXPORT_SYMBOL(__arm_smccc_smc);
EXPORT_SYMBOL(__arm_smccc_hvc);
diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
index 659ae8094ed5..c394c3704b7f 100644
--- a/arch/arm64/kernel/process.c
+++ b/arch/arm64/kernel/process.c
@@ -403,6 +403,31 @@ unsigned long get_wchan(struct task_struct *p)
return ret;
}

+static DEFINE_PER_CPU(u64, __cpu_relax_data);
+
+#define CPU_RELAX_WFE_THRESHOLD 10000
+void __cpu_relax(unsigned long pc)
+{
+ u64 new, old = raw_cpu_read(__cpu_relax_data);
+ u32 old_pc, new_pc;
+ bool wfe = false;
+
+ old_pc = (u32)old;
+ new = new_pc = (u32)pc;
+
+ if (old_pc == new_pc) {
+ if ((old >> 32) > CPU_RELAX_WFE_THRESHOLD) {
+ asm volatile("sevl; wfe; wfe\n" ::: "memory");
+ wfe = true;
+ } else {
+ new = old + (1UL << 32);
+ }
+ }
+
+ if (this_cpu_cmpxchg(__cpu_relax_data, old, new) == old && !wfe)
+ asm volatile("yield" ::: "memory");
+}
+
unsigned long arch_align_stack(unsigned long sp)
{
if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)

2017-07-28 19:09:41

by Vikram Mulukutla

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

On 2017-07-28 02:28, Will Deacon wrote:
> On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:

<snip>

>>
>> I think we should have this discussion now - I brought this up earlier
>> [1]
>> and I promised a test case that I completely forgot about - but here
>> it
>> is (attached). Essentially a Big CPU in an acquire-check-release loop
>> will have an unfair advantage over a little CPU concurrently
>> attempting
>> to acquire the same lock, in spite of the ticket implementation. If
>> the Big
>> CPU needs the little CPU to make forward progress : livelock.
>>

<snip>

>>
>> One solution was to use udelay(1) in such loops instead of
>> cpu_relax(), but
>> that's not very 'relaxing'. I'm not sure if there's something we could
>> do
>> within the ticket spin-lock implementation to deal with this.
>
> Does bodging cpu_relax to back-off to wfe after a while help? The event
> stream will wake it up if nothing else does. Nasty patch below, but I'd
> be
> interested to know whether or not it helps.
>
> Will
>
This does seem to help. Here's some data after 5 runs with and without
the patch.

time = max time taken to acquire lock
counter = number of times lock acquired

cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
Without the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
117893us| 2349144| 2us| 6748236|
571260us| 2125651| 2us| 7643264|
19780us| 2392770| 2us| 5987203|
19948us| 2395413| 2us| 5977286|
19822us| 2429619| 2us| 5768252|
19888us| 2444940| 2us| 5675657|
=====================================================

cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
With the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
3us| 2737438| 2us| 6907147|
2us| 2742478| 2us| 6902241|
132us| 2745636| 2us| 6876485|
3us| 2744554| 2us| 6898048|
3us| 2741391| 2us| 6882901|
=====================================================

The patch also seems to have helped with fairness in general
allowing more work to be done if the CPU frequencies are more
closely matched (I don't know if this translates to real world
performance - probably not). The counter values are higher
with the patch.

time = max time taken to acquire lock
counter = number of times lock acquired

cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
Without the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
2us| 5240654| 1us| 5339009|
2us| 5287797| 97us| 5327073|
2us| 5237634| 1us| 5334694|
2us| 5236676| 88us| 5333582|
84us| 5285880| 84us| 5329489|
=====================================================

cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
With the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
140us| 10449121| 1us| 11154596|
1us| 10757081| 1us| 11479395|
83us| 10237109| 1us| 10902557|
2us| 9871101| 1us| 10514313|
2us| 9758763| 1us| 10391849|
=====================================================


Thanks,
Vikram

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2017-07-28 19:11:38

by Vikram Mulukutla

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

On 2017-07-28 02:28, Peter Zijlstra wrote:
> On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>
>> I think we should have this discussion now - I brought this up earlier
>> [1]
>> and I promised a test case that I completely forgot about - but here
>> it
>> is (attached). Essentially a Big CPU in an acquire-check-release loop
>> will have an unfair advantage over a little CPU concurrently
>> attempting
>> to acquire the same lock, in spite of the ticket implementation. If
>> the Big
>> CPU needs the little CPU to make forward progress : livelock.
>
> This needs to be fixed in hardware. There really isn't anything the
> software can sanely do about it.
>
> It also doesn't have anything to do with the spinlock implementation.
> Ticket or not, its a fundamental problem of LL/SC. Any situation where
> we use atomics for fwd progress guarantees this can happen.
>

Agreed, it seems like trying to build a fair SW protocol over unfair HW.
But if we can minimally change such loop constructs to address this (all
instances I've seen so far use cpu_relax) it would save a lot of hours
spent debugging these problems. Lot of b.L devices out there :-)

It's also possible that such a workaround may help contention
performance
since the big CPU may have to wait for say a tick before breaking out of
that loop (the non-livelock scenario where the entire loop isn't in a
critical section).

> The little core (or really any core) should hold on to the locked
> cacheline for a while and not insta relinquish it. Giving it a chance
> to
> reach the SC.

Thanks,
Vikram

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2017-07-31 11:20:36

by Zhou Qiao(周侨)

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync



On 2017年07月29日 03:09, Vikram Mulukutla wrote:
> On 2017-07-28 02:28, Will Deacon wrote:
>> On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>
> <snip>
>
>>>
>>> I think we should have this discussion now - I brought this up
>>> earlier [1]
>>> and I promised a test case that I completely forgot about - but here it
>>> is (attached). Essentially a Big CPU in an acquire-check-release loop
>>> will have an unfair advantage over a little CPU concurrently attempting
>>> to acquire the same lock, in spite of the ticket implementation. If
>>> the Big
>>> CPU needs the little CPU to make forward progress : livelock.
>>>
>
> <snip>
>
>>>
>>> One solution was to use udelay(1) in such loops instead of
>>> cpu_relax(), but
>>> that's not very 'relaxing'. I'm not sure if there's something we
>>> could do
>>> within the ticket spin-lock implementation to deal with this.
>>
>> Does bodging cpu_relax to back-off to wfe after a while help? The event
>> stream will wake it up if nothing else does. Nasty patch below, but
>> I'd be
>> interested to know whether or not it helps.
>>
>> Will
>>
The patch also helps a lot on my platform. (Though it does cause
deadlock(related with udelay) in uart driver in early boot, and not sure
it's uart driver issue. Just workaround it firstly)

Platform: 4 a53(832MHz) + 4 a73(1.8GHz)
Test condition #1:
a. core2: a53, while loop (spinlock, spin_unlock)
b. core7: a73, while loop (spinlock, spin_unlock, cpu_relax)

Test result: recording the lock acquire times(a53, a73), max lock
acquired time(a53), in 20 seconds

Without cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
182| 38371616| 1,951,954|
202| 38427652| 2,261,319|
210| 38477427| 15,309,597|
207| 38494479| 6,656,453|
220| 38422283| 2,064,155|
===============================================================

With cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
1849898| 37799379| 131,255|
1574172| 38557653| 38,410|
1924777| 37831725| 42,999|
1477665| 38723741| 52,087|
1865793| 38007741| 783,965|
===============================================================

Also add some workload to the whole system to check the result.
Test condition #2: based on #1
c. core6: a73, 1.8GHz, run "while(1);" loop

With cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
20| 42563981| 2,317,070|
10| 42652793| 4,210,944|
9| 42651075| 5,691,834|
28| 42652591| 4,539,555|
10| 42652801| 5,850,639|
===============================================================

Also hotplug out other cores.
Test condition #2: based on #1
d. hotplug out core1/3/4/5/6, keep core0 for scheduling

With cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
447| 42652450| 309,549|
515| 42650382| 337,661|
415| 42646669| 628,525|
431| 42651137| 365,862|
464| 42648916| 379,934|
===============================================================

The last two tests are the actual cases where the hard-lockup is
triggered on my platform. So I gathered some data, and it shows that a53
needs much longer time to acquire the lock.

All tests are done in android, black screen with USB cable attached. The
data is not so pretty as Vikram's. It might be related with cpu
topology, core numbers, CCI frequency etc. (I'll do another test with
both a53 and a73 running at 1.2GHz, to check whether it's the core
frequency which leads to the major difference.)

> This does seem to help. Here's some data after 5 runs with and without
> the patch.
>
> time = max time taken to acquire lock
> counter = number of times lock acquired
>
> cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
> Without the cpu_relax() bodging patch:
> =====================================================
> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
> ==========|==============|===========|==============|
> 117893us| 2349144| 2us| 6748236|
> 571260us| 2125651| 2us| 7643264|
> 19780us| 2392770| 2us| 5987203|
> 19948us| 2395413| 2us| 5977286|
> 19822us| 2429619| 2us| 5768252|
> 19888us| 2444940| 2us| 5675657|
> =====================================================
>
> cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
> With the cpu_relax() bodging patch:
> =====================================================
> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
> ==========|==============|===========|==============|
> 3us| 2737438| 2us| 6907147|
> 2us| 2742478| 2us| 6902241|
> 132us| 2745636| 2us| 6876485|
> 3us| 2744554| 2us| 6898048|
> 3us| 2741391| 2us| 6882901|
> ==================================================== >
> The patch also seems to have helped with fairness in general
> allowing more work to be done if the CPU frequencies are more
> closely matched (I don't know if this translates to real world
> performance - probably not). The counter values are higher
> with the patch.
>
> time = max time taken to acquire lock
> counter = number of times lock acquired
>
> cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
> Without the cpu_relax() bodging patch:
> =====================================================
> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
> ==========|==============|===========|==============|
> 2us| 5240654| 1us| 5339009|
> 2us| 5287797| 97us| 5327073|
> 2us| 5237634| 1us| 5334694|
> 2us| 5236676| 88us| 5333582|
> 84us| 5285880| 84us| 5329489|
> =====================================================
>
> cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
> With the cpu_relax() bodging patch:
> =====================================================
> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
> ==========|==============|===========|==============|
> 140us| 10449121| 1us| 11154596|
> 1us| 10757081| 1us| 11479395|
> 83us| 10237109| 1us| 10902557|
> 2us| 9871101| 1us| 10514313|
> 2us| 9758763| 1us| 10391849|
> =====================================================
>
>
> Thanks,
> Vikram
>

2017-07-31 13:13:24

by Will Deacon

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

Hi Vikram,

On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
> On 2017-07-28 02:28, Will Deacon wrote:
> >On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>
> <snip>
>
> >>
> >>I think we should have this discussion now - I brought this up earlier
> >>[1]
> >>and I promised a test case that I completely forgot about - but here it
> >>is (attached). Essentially a Big CPU in an acquire-check-release loop
> >>will have an unfair advantage over a little CPU concurrently attempting
> >>to acquire the same lock, in spite of the ticket implementation. If the
> >>Big
> >>CPU needs the little CPU to make forward progress : livelock.
> >>
>
> <snip>
>
> >>
> >>One solution was to use udelay(1) in such loops instead of cpu_relax(),
> >>but
> >>that's not very 'relaxing'. I'm not sure if there's something we could
> >>do
> >>within the ticket spin-lock implementation to deal with this.
> >
> >Does bodging cpu_relax to back-off to wfe after a while help? The event
> >stream will wake it up if nothing else does. Nasty patch below, but I'd be
> >interested to know whether or not it helps.
> >
> >Will
> >
> This does seem to help. Here's some data after 5 runs with and without the
> patch.

Blimey, that does seem to make a difference. Shame it's so ugly! Would you
be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had
it set to 10000 in the diff I posted, but that might be higher than optimal.
It would be interested to see if it correlates with num_possible_cpus()
for the highly contended case.

Will

2017-08-01 07:37:34

by Zhou Qiao(周侨)

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync



On 2017年07月31日 19:20, qiaozhou wrote:
>
>
> On 2017年07月29日 03:09, Vikram Mulukutla wrote:
>> On 2017-07-28 02:28, Will Deacon wrote:
>>> On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>>
>> <snip>
>>>
>>> Does bodging cpu_relax to back-off to wfe after a while help? The event
>>> stream will wake it up if nothing else does. Nasty patch below, but
>>> I'd be
>>> interested to know whether or not it helps.
>>>
>>> Will
>>>
> The patch also helps a lot on my platform. (Though it does cause
> deadlock(related with udelay) in uart driver in early boot, and not sure
> it's uart driver issue. Just workaround it firstly)
>
> Platform: 4 a53(832MHz) + 4 a73(1.8GHz)
> Test condition #1:
> a. core2: a53, while loop (spinlock, spin_unlock)
> b. core7: a73, while loop (spinlock, spin_unlock, cpu_relax)
>
> Test result: recording the lock acquire times(a53, a73), max lock
> acquired time(a53), in 20 seconds
>
> Without cpu_relax bodging patch:
> ===============================================================
> |a53 locked times | a73 locked times | a53 max locked time(us)|
> ==================|==================|========================|
> 182| 38371616| 1,951,954|
> 202| 38427652| 2,261,319|
> 210| 38477427| 15,309,597|
> 207| 38494479| 6,656,453|
> 220| 38422283| 2,064,155|
> ===============================================================
>
> With cpu_relax bodging patch:
> ===============================================================
> |a53 locked times | a73 locked times | a53 max locked time(us)|
> ==================|==================|========================|
> 1849898| 37799379| 131,255|
> 1574172| 38557653| 38,410|
> 1924777| 37831725| 42,999|
> 1477665| 38723741| 52,087|
> 1865793| 38007741| 783,965|
> ===============================================================
>
> Also add some workload to the whole system to check the result.
> Test condition #2: based on #1
> c. core6: a73, 1.8GHz, run "while(1);" loop
>
> With cpu_relax bodging patch:
> ===============================================================
> |a53 locked times | a73 locked times | a53 max locked time(us)|
> ==================|==================|========================|
> 20| 42563981| 2,317,070|
> 10| 42652793| 4,210,944|
> 9| 42651075| 5,691,834|
> 28| 42652591| 4,539,555|
> 10| 42652801| 5,850,639|
> ===============================================================
>
> Also hotplug out other cores.
> Test condition #3: based on #1
> d. hotplug out core1/3/4/5/6, keep core0 for scheduling
>
> With cpu_relax bodging patch:
> ===============================================================
> |a53 locked times | a73 locked times | a53 max locked time(us)|
> ==================|==================|========================|
> 447| 42652450| 309,549|
> 515| 42650382| 337,661|
> 415| 42646669| 628,525|
> 431| 42651137| 365,862|
> 464| 42648916| 379,934|
> ===============================================================
>
> The last two tests are the actual cases where the hard-lockup is
> triggered on my platform. So I gathered some data, and it shows that a53
> needs much longer time to acquire the lock.
>
> All tests are done in android, black screen with USB cable attached. The
> data is not so pretty as Vikram's. It might be related with cpu
> topology, core numbers, CCI frequency etc. (I'll do another test with
> both a53 and a73 running at 1.2GHz, to check whether it's the core
> frequency which leads to the major difference.)
>
Test the contention with the same frequency between a53 and a73 cores.
Platform: 4 a53(1248MHz) + 4 a73(1248MHz)
Test condition #4:
a. core2: a53, while loop (spinlock, spin_unlock)
b. core7: a73, while loop (spinlock, spin_unlock)
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
12945632| 13021576| 14|
12934181| 13059230| 16|
12987186| 13059016| 49|
12958583| 13038884| 24|
14637546| 14672522| 14|
===============================================================

The locked times are almost the same, and the max time of acquiring the
lock on a53 also drops. On my platform, core frequency seems to be the
key factor.

>> This does seem to help. Here's some data after 5 runs with and without
>> the patch.
>>
>> time = max time taken to acquire lock
>> counter = number of times lock acquired
>>
>> cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
>> Without the cpu_relax() bodging patch:
>> =====================================================
>> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
>> ==========|==============|===========|==============|
>> 117893us| 2349144| 2us| 6748236|
>> 571260us| 2125651| 2us| 7643264|
>> 19780us| 2392770| 2us| 5987203|
>> 19948us| 2395413| 2us| 5977286|
>> 19822us| 2429619| 2us| 5768252|
>> 19888us| 2444940| 2us| 5675657|
>> =====================================================
>>
>> cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
>> With the cpu_relax() bodging patch:
>> =====================================================
>> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
>> ==========|==============|===========|==============|
>> 3us| 2737438| 2us| 6907147|
>> 2us| 2742478| 2us| 6902241|
>> 132us| 2745636| 2us| 6876485|
>> 3us| 2744554| 2us| 6898048|
>> 3us| 2741391| 2us| 6882901|
>> ==================================================== >
>> The patch also seems to have helped with fairness in general
>> allowing more work to be done if the CPU frequencies are more
>> closely matched (I don't know if this translates to real world
>> performance - probably not). The counter values are higher
>> with the patch.
>>
>> time = max time taken to acquire lock
>> counter = number of times lock acquired
>>
>> cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
>> Without the cpu_relax() bodging patch:
>> =====================================================
>> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
>> ==========|==============|===========|==============|
>> 2us| 5240654| 1us| 5339009|
>> 2us| 5287797| 97us| 5327073|
>> 2us| 5237634| 1us| 5334694|
>> 2us| 5236676| 88us| 5333582|
>> 84us| 5285880| 84us| 5329489|
>> =====================================================
>>
>> cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
>> With the cpu_relax() bodging patch:
>> =====================================================
>> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
>> ==========|==============|===========|==============|
>> 140us| 10449121| 1us| 11154596|
>> 1us| 10757081| 1us| 11479395|
>> 83us| 10237109| 1us| 10902557|
>> 2us| 9871101| 1us| 10514313|
>> 2us| 9758763| 1us| 10391849|
>> =====================================================
>>Also apply Vikram's patch and have a test.

cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
Without cpu_relax bodging patch
=====================================================
cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
==========|==============|===========|==============|
16505| 5243| 2| 12487322|
16494| 5619| 1| 12013291|
16498| 5276| 2| 11706824|
16494| 7123| 1| 12532355|
16470| 7208| 2| 11784617|
=====================================================

cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
With cpu_relax bodging patch:
=====================================================
cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
==========|==============|===========|==============|
3991| 140714| 1| 11430528|
4018| 144371| 1| 11430528|
4034| 143250| 1| 11427011|
4330| 147345| 1| 11423583|
4752| 138273| 1| 11433241|
=====================================================

It has some improvements, but not so good as Vikram's data. The big core
still has much more chance to acquire lock.
>>
>> Thanks,
>> Vikram
>>

2017-08-03 23:25:15

by Vikram Mulukutla

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

Hi Will,

On 2017-07-31 06:13, Will Deacon wrote:
> Hi Vikram,
>
> On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
>> On 2017-07-28 02:28, Will Deacon wrote:
>> >On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:

>> >
>> This does seem to help. Here's some data after 5 runs with and without
>> the
>> patch.
>
> Blimey, that does seem to make a difference. Shame it's so ugly! Would
> you
> be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I
> had
> it set to 10000 in the diff I posted, but that might be higher than
> optimal.
> It would be interested to see if it correlates with num_possible_cpus()
> for the highly contended case.
>
> Will

Sorry for the late response - I should hopefully have some more data
with
different thresholds before the week is finished or on Monday.

Thanks,
Vikram
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2017-08-03 23:32:28

by Vikram Mulukutla

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync


Hi Qiao,


On 2017-08-01 00:37, qiaozhou wrote:
> On 2017年07月31日 19:20, qiaozhou wrote:
>>
>>

<snip>

>>> =====================================================
>>> Also apply Vikram's patch and have a test.
>
> cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
> Without cpu_relax bodging patch
> =====================================================
> cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
> ==========|==============|===========|==============|
> 16505| 5243| 2| 12487322|
> 16494| 5619| 1| 12013291|
> 16498| 5276| 2| 11706824|
> 16494| 7123| 1| 12532355|
> 16470| 7208| 2| 11784617|
> =====================================================
>
> cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
> With cpu_relax bodging patch:
> =====================================================
> cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
> ==========|==============|===========|==============|
> 3991| 140714| 1| 11430528|
> 4018| 144371| 1| 11430528|
> 4034| 143250| 1| 11427011|
> 4330| 147345| 1| 11423583|
> 4752| 138273| 1| 11433241|
> =====================================================
>
> It has some improvements, but not so good as Vikram's data. The big
> core still has much more chance to acquire lock.
>>>
>>> Thanks,
>>> Vikram
>>>

Thanks for your data! I'll check on one of our other platforms to see
if I see similar behavior. This may have something to do with the
event-stream on your platform or the A53 revision as Sudeep pointed
out here [1] - something to check I suppose...

Thanks,
Vikram

[1] - https://lkml.org/lkml/2016/11/21/458

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2017-08-04 03:15:43

by Zhou Qiao(周侨)

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync



On 2017年08月04日 07:32, Vikram Mulukutla wrote:
>
> Hi Qiao,
>
>
> On 2017-08-01 00:37, qiaozhou wrote:
>> On 2017年07月31日 19:20, qiaozhou wrote:
>>>
>>>
>
> <snip>
>
>>>> =====================================================
>>>> Also apply Vikram's patch and have a test.
>>
>> cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
>> Without cpu_relax bodging patch
>> =====================================================
>> cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
>> ==========|==============|===========|==============|
>> 16505| 5243| 2| 12487322|
>> 16494| 5619| 1| 12013291|
>> 16498| 5276| 2| 11706824|
>> 16494| 7123| 1| 12532355|
>> 16470| 7208| 2| 11784617|
>> =====================================================
>>
>> cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
>> With cpu_relax bodging patch:
>> =====================================================
>> cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
>> ==========|==============|===========|==============|
>> 3991| 140714| 1| 11430528|
>> 4018| 144371| 1| 11430528|
>> 4034| 143250| 1| 11427011|
>> 4330| 147345| 1| 11423583|
>> 4752| 138273| 1| 11433241|
>> =====================================================
>>
>> It has some improvements, but not so good as Vikram's data. The big
>> core still has much more chance to acquire lock.
>>>>
>>>> Thanks,
>>>> Vikram
>>>>
>
> Thanks for your data! I'll check on one of our other platforms to see
> if I see similar behavior. This may have something to do with the
> event-stream on your platform or the A53 revision as Sudeep pointed
> out here [1] - something to check I suppose...
Thanks for the reminder. Our HW IP has already fixed the bug Sudeep
mentioned. I'm also checking the global exclusive monitor implementation
on our platform with our ASIC guys, and it might be related with snoop
transaction, or implement details in global exclusive monitor.

Thanks a lot.
Qiao
>
> Thanks,
> Vikram
>
> [1] - https://lkml.org/lkml/2016/11/21/458
>

2017-08-15 18:40:45

by Will Deacon

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

Hi Vikram,

On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote:
> On 2017-07-31 06:13, Will Deacon wrote:
> >On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
> >>On 2017-07-28 02:28, Will Deacon wrote:
> >>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>
> >>>
> >>This does seem to help. Here's some data after 5 runs with and without
> >>the
> >>patch.
> >
> >Blimey, that does seem to make a difference. Shame it's so ugly! Would you
> >be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had
> >it set to 10000 in the diff I posted, but that might be higher than
> >optimal.
> >It would be interested to see if it correlates with num_possible_cpus()
> >for the highly contended case.
> >
> >Will
>
> Sorry for the late response - I should hopefully have some more data with
> different thresholds before the week is finished or on Monday.

Did you get anywhere with the threshold heuristic?

Will

2017-08-25 19:48:47

by Vikram Mulukutla

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync


Hi Will,

On 2017-08-15 11:40, Will Deacon wrote:
> Hi Vikram,
>
> On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote:
>> On 2017-07-31 06:13, Will Deacon wrote:
>> >On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
>> >>On 2017-07-28 02:28, Will Deacon wrote:
>> >>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>>
>> >>>
>> >>This does seem to help. Here's some data after 5 runs with and without
>> >>the
>> >>patch.
>> >
>> >Blimey, that does seem to make a difference. Shame it's so ugly! Would you
>> >be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had
>> >it set to 10000 in the diff I posted, but that might be higher than
>> >optimal.
>> >It would be interested to see if it correlates with num_possible_cpus()
>> >for the highly contended case.
>> >
>> >Will
>>
>> Sorry for the late response - I should hopefully have some more data
>> with
>> different thresholds before the week is finished or on Monday.
>
> Did you get anywhere with the threshold heuristic?
>
> Will

Here's some data from experiments that I finally got to today. I decided
to recompile for every value of the threshold. Was doing a binary search
of sorts and then started reducing by orders of magnitude. There pairs
of rows here:

Row1 is with cpu0 (little) at 300MHz and cpu4 at 1.9Ghz
Row2 is with cpu0 (little) at 1.5GHz and cpu4 at 1.9Ghz

It looks like even with the threshold set to 1, we don't hit the worst
case of a single instance of locking taking a long time, probably a
consequence
of how the test is designed? However as we reduce the threshold, the
fairness
in terms of how many times each CPU acquires the lock skews towards the
big CPU,
starting with threshold=500

If I understand the code correctly, the upper 32 bits of an ARM64
virtual
address will overflow when 1 is added to it, and so we'll keep WFE'ing
on
every subsequent cpu_relax invoked from the same PC, until we cross the
hard-coded threshold, right?


CPU_RELAX_WFE_THRESHOLD = 5000 and 2500 (very similar results)
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
2|2763059|2|7323169
0|11477590|1|12110373

3|2762788|2|7329881
1|11557987|1|12042980

3|2765912|2|7308789
1|11470294|1|12120074

3|2761793|2|7333907
1|11431841|1|12155046

3|2762402|2|7328843
1|11495705|1|12096518

3|2764392|2|7308640
1|11479146|1|12111419
====================================================|

CPU_RELAX_WFE_THRESHOLD = 500
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
3|2338277|2|10052592
1|6963131|1|18103639

3|2337982|2|10037188
1|6979396|1|18082811

3|2337282|2|10052184
0|6954990|1|18109860

3|2338737|2|10039556
1|7185046|1|17809240

4|2338857|2|10027407
1|6958274|1|18111394

4|2340208|2|10031173
0|7097088|1|17921861
-----------------------------------------------------
=====================================================

CPU_RELAX_WFE_THRESHOLD = 50
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
4|1219792|2|18005180
0|1252767|1|25296935

4|1219312|2|18049566
1|1252625|1|25227292

4|1219884|2|18020775
1|1252363|1|25298387

4|1220862|2|18012062
1|1251827|1|25283787

4|1220489|2|18010055
0|1251729|1|25272917

3|1220088|2|18027279
0|1253264|1|25268834
-----------------------------------------------------
=====================================================

CPU_RELAX_WFE_THRESHOLD = 10
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
3|298604|1|23784805
0|293511|1|24604172

3|294707|2|23857487
0|292519|1|24564209

4|294199|1|23832180
0|293840|1|24593323

4|294314|1|23853353
0|293609|1|24635190

4|293802|1|23836764
0|293322|1|24553212

3|293658|1|23889801
0|292663|1|24552118
-----------------------------------------------------
=====================================================

CPU_RELAX_WFE_THRESHOLD = 5
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
3|173061|1|22332479
0|173759|1|23774009

3|174471|1|22342362
0|173161|1|23814466

3|173851|2|22235422
0|172734|1|23705848

2|173452|1|22255166
0|172830|1|23824301

2|173028|1|22390297
0|172336|1|23836407

3|172968|1|22285954
0|173207|1|23844900
-----------------------------------------------------
=====================================================

CPU_RELAX_WFE_THRESHOLD = 1
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
2|64245|1|6266848
0|77117|1|20917346

2|71310|1|5184106
1|77426|1|21040797

3|71335|2|5024650
0|77167|1|20934429

3|71295|1|5361696
0|77377|1|20902970

2|71357|1|5302482
0|77278|1|20967106

3|71158|1|5214564
0|77334|1|21022485
-----------------------------------------------------
=====================================================

Thanks,
Vikram

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2017-08-25 20:25:45

by Vikram Mulukutla

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync



On 2017-08-25 12:48, Vikram Mulukutla wrote:

>
> If I understand the code correctly, the upper 32 bits of an ARM64
> virtual
> address will overflow when 1 is added to it, and so we'll keep WFE'ing
> on
> every subsequent cpu_relax invoked from the same PC, until we cross the
> hard-coded threshold, right?
>

Oops, misread that. Second time we enter cpu_relax from the same PC, we
do a WFE. Then we stop doing the WFE until we hit the threshold using
the
per-cpu counter. So with a higher threshold, we wait for more
cpu_relax()
calls before starting the WFE again.

So a lower threshold implies we should hit WFE branch sooner. It seems
that since my test keeps the while loop going for a full 5 seconds, a
lower
threshold will obviously result in more WFEs and lower the
lock-acquired-count.

I guess we want a high threshold but not so high that the little CPU has
to wait a while before the big CPU counts up to the threshold, is that
correct?

Thanks,
Vikram

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2017-08-28 23:12:04

by Vikram Mulukutla

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

Hi Will,

On 2017-08-25 12:48, Vikram Mulukutla wrote:
> Hi Will,
>
> On 2017-08-15 11:40, Will Deacon wrote:
>> Hi Vikram,
>>
>> On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote:
>>> On 2017-07-31 06:13, Will Deacon wrote:
>>> >On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
>>> >>On 2017-07-28 02:28, Will Deacon wrote:
>>> >>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>>>
>>> >>>
>>> >>This does seem to help. Here's some data after 5 runs with and without
>>> >>the
>>> >>patch.
>>> >
>>> >Blimey, that does seem to make a difference. Shame it's so ugly! Would you
>>> >be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had
>>> >it set to 10000 in the diff I posted, but that might be higher than
>>> >optimal.
>>> >It would be interested to see if it correlates with num_possible_cpus()
>>> >for the highly contended case.
>>> >
>>> >Will
>>>
>>> Sorry for the late response - I should hopefully have some more data
>>> with
>>> different thresholds before the week is finished or on Monday.
>>
>> Did you get anywhere with the threshold heuristic?
>>
>> Will
>
> Here's some data from experiments that I finally got to today. I
> decided
> to recompile for every value of the threshold. Was doing a binary
> search
> of sorts and then started reducing by orders of magnitude. There pairs
> of rows here:
>

Well here's something interesting. I tried a different platform and
found that
the workaround doesn't help much at all, similar to Qiao's observation
on his b.L
chipset. Something to do with the WFE implementation or event-stream?

I modified your patch to use a __delay(1) in place of the WFEs and this
was
the result (still with the 10k threshold). The worst-case lock time for
cpu0
drastically improves. Given that cpu0 re-enables interrupts between each
lock
attempt in my test case, I think the lock count matters less here.

cpu_relax() patch with WFEs (original workaround):
(pairs of rows, first row is with c0 at 300Mhz, second
with c0 at 1.9GHz. Both rows have cpu4 at 2.3GHz max time
is in microseconds)
------------------------------------------------------|
c0 max time| c0 lock count| c4 max time| c4 lock count|
------------------------------------------------------|
999843| 25| 2| 12988498| -> c0/cpu0 at
300Mhz
0| 8421132| 1| 9152979| -> c0/cpu0 at
1.9GHz
------------------------------------------------------|
999860| 160| 2| 12963487|
1| 8418492| 1| 9158001|
------------------------------------------------------|
999381| 734| 2| 12988636|
1| 8387562| 1| 9128056|
------------------------------------------------------|
989800| 750| 3| 12996473|
1| 8389091| 1| 9112444|
------------------------------------------------------|

cpu_relax() patch with __delay(1):
(pairs of rows, first row is with c0 at 300Mhz, second
with c0 at 1.9GHz. Both rows have cpu4 at 2.3GHz. max time
is in microseconds)
------------------------------------------------------|
c0 max time| c0 lock count| c4 max time| c4 lock count|
------------------------------------------------------|
7703| 1532| 2| 13035203| -> c0/cpu0 at
300Mhz
1| 8511686| 1| 8550411| -> c0/cpu0 at
1.9GHz
------------------------------------------------------|
7801| 1561| 2| 13040188|
1| 8553985| 1| 8609853|
------------------------------------------------------|
3953| 1576| 2| 13049991|
1| 8576370| 1| 8611533|
------------------------------------------------------|
3953| 1557| 2| 13030553|
1| 8509020| 1| 8543883|
------------------------------------------------------|

I should also note that my earlier kernel was 4.9-stable based
and the one above was on a 4.4-stable based kernel.

Thanks,
Vikram

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2017-09-06 11:20:59

by Zhou Qiao(周侨)

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync



On 2017年08月29日 07:12, Vikram Mulukutla wrote:
>
> Well here's something interesting. I tried a different platform and
> found that
> the workaround doesn't help much at all, similar to Qiao's observation
> on his b.L
> chipset. Something to do with the WFE implementation or event-stream?

Hi Vikram,

I did some experiments, to tune the ddr controller(and ddr ram) freq,
and cci freq. And the result is as below:

cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
cci: 832M
dclk: DDR controller clock.(data rate = 4 * dclk)
With cpu_relax bodging patch:
==============================================================
dclk | cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
=======|===========|==============|===========|==============|
78M | 8906| 55438| 13| 4015789|
156M | 5964| 75109| 4| 8229050|
500M | 102| 5984783| 1| 6400885|
600M | 16| 6233601| 1| 6504718|
==============================================================

I suspect that the global exclusive monitor in ddr controller may play
an important part. With ddr frequency is higher enough, it seems to
handle the exclusive requests efficiently and fairly.

If reducing cci freq to a lower value, the result of little core drops a
lot again.

cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
cci: 416M
dclk: DDR controller clock.(data rate = 4 * dclk)
With cpu_relax bodging patch:
==============================================================
dclk | cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
=======|===========|==============|===========|==============|
78M | 8837| 10596| 11| 3873635|
156M | 17597| 10211| 4| 6513493|
500M | 10888| 13214| 2| 8916396|
600M | 8934| 15842| 2| 9394124|
==============================================================

I guess the result on your different platform might be related with DDR
frequency too.

Best Regards
Qiao

2017-09-25 11:17:49

by Zhou Qiao(周侨)

[permalink] [raw]
Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

Hi Will,

Will this bodging patch be merged? It can solve the livelock issue on
arm64 platforms(at least improve a lot).

I suspected that CCI-freq might impact the contention between little and
big core, but on my platform, it impacts little. In fact the frequency
of external DDR controller impacts the contention.(My last reply has
detailed data). It might be flushed out of cache after entering WFE and
be loaded from DDR to cache when woken up.(I guessed that's why external
DDR freq matters.)

Even with the lowest DDR freq(78M) on my platform, the maximum delay to
get locked of the little core drops to ~10 ms with this bodging patch,
while without the patch, the delay can be in 10s level by my testing, as
discussed previously. So I'm wondering whether it's will be pushed into
mainline, or still need more data?

Thanks a lot.

Best Regards
Qiao

On 2017年08月29日 07:12, Vikram Mulukutla wrote:
> Hi Will,
>
> On 2017-08-25 12:48, Vikram Mulukutla wrote:
>> Hi Will,
>>
>> On 2017-08-15 11:40, Will Deacon wrote:
>>> Hi Vikram,
>>>
>>> On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote:
>>>> On 2017-07-31 06:13, Will Deacon wrote:
>>>> >On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote:
>>>> >>On 2017-07-28 02:28, Will Deacon wrote:
>>>> >>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:
>>>>
>>>> >>>
>>>> >>This does seem to help. Here's some data after 5 runs with and
>>>> without
>>>> >>the
>>>> >>patch.
>>>> >
>>>> >Blimey, that does seem to make a difference. Shame it's so ugly!
>>>> Would you
>>>> >be able to experiment with other values for
>>>> CPU_RELAX_WFE_THRESHOLD? I had
>>>> >it set to 10000 in the diff I posted, but that might be higher than
>>>> >optimal.
>>>> >It would be interested to see if it correlates with
>>>> num_possible_cpus()
>>>> >for the highly contended case.
>>>> >
>>>> >Will
>>>>
>>>> Sorry for the late response - I should hopefully have some more data
>>>> with
>>>> different thresholds before the week is finished or on Monday.
>>>
>>> Did you get anywhere with the threshold heuristic?
>>>
>>> Will
>>
>> Here's some data from experiments that I finally got to today. I decided
>> to recompile for every value of the threshold. Was doing a binary search
>> of sorts and then started reducing by orders of magnitude. There pairs
>> of rows here:
>>
>
> Well here's something interesting. I tried a different platform and
> found that
> the workaround doesn't help much at all, similar to Qiao's observation
> on his b.L
> chipset. Something to do with the WFE implementation or event-stream?