The 1st spinner (header of the MCS queue) spins on the whole qspinlock
variable to check whether the lock is released. For a contended qspinlock,
this spinning is a hotspot as each CPU queued in the MCS queue performs
the spinning when it becomes the 1st spinner (header of the MCS queue).
The granularity among SMT h/w threads in the same core could be "byte"
which the Load-Store Unit (LSU) inside the core handles. Making the 1st
spinner only spin on locked_pending bits (not the whole qspinlock) can
avoid the false dependency between the tail field and the locked_pending
field. So this micro-optimization helps the h/w thread (the 1st spinner)
stay in a low power state and prevents it from being woken up by other
h/w threads in the same core when they perform xchg_tail() to update
the tail field. Please see a similar discussion in the link [1].
[1] https://lore.kernel.org/r/[email protected]
Signed-off-by: Qiuxu Zhuo <[email protected]>
---
kernel/locking/qspinlock.c | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index efebbf19f887..e7b990b28610 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -513,7 +513,20 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
if ((val = pv_wait_head_or_lock(lock, node)))
goto locked;
+#if _Q_PENDING_BITS == 8
+ /*
+ * Spinning on the 2-byte locked_pending instead of the 4-byte qspinlock
+ * variable can avoid the false dependency between the tail field and
+ * the locked_pending field. This helps the h/w thread (the 1st spinner)
+ * stay in a low power state and prevents it from being woken up by other
+ * h/w threads in the same core when they perform xchg_tail() to update
+ * the tail field only.
+ */
+ smp_cond_load_acquire(&lock->locked_pending, !VAL);
+ val = atomic_read_acquire(&lock->val);
+#else
val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
+#endif
locked:
/*
--
2.17.1
On 5/8/23 04:15, Qiuxu Zhuo wrote:
> The 1st spinner (header of the MCS queue) spins on the whole qspinlock
> variable to check whether the lock is released. For a contended qspinlock,
> this spinning is a hotspot as each CPU queued in the MCS queue performs
> the spinning when it becomes the 1st spinner (header of the MCS queue).
>
> The granularity among SMT h/w threads in the same core could be "byte"
> which the Load-Store Unit (LSU) inside the core handles. Making the 1st
> spinner only spin on locked_pending bits (not the whole qspinlock) can
> avoid the false dependency between the tail field and the locked_pending
> field. So this micro-optimization helps the h/w thread (the 1st spinner)
> stay in a low power state and prevents it from being woken up by other
> h/w threads in the same core when they perform xchg_tail() to update
> the tail field. Please see a similar discussion in the link [1].
>
> [1] https://lore.kernel.org/r/[email protected]
>
> Signed-off-by: Qiuxu Zhuo <[email protected]>
> ---
> kernel/locking/qspinlock.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index efebbf19f887..e7b990b28610 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -513,7 +513,20 @@ void __lockfunc queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> if ((val = pv_wait_head_or_lock(lock, node)))
> goto locked;
>
> +#if _Q_PENDING_BITS == 8
> + /*
> + * Spinning on the 2-byte locked_pending instead of the 4-byte qspinlock
> + * variable can avoid the false dependency between the tail field and
> + * the locked_pending field. This helps the h/w thread (the 1st spinner)
> + * stay in a low power state and prevents it from being woken up by other
> + * h/w threads in the same core when they perform xchg_tail() to update
> + * the tail field only.
> + */
> + smp_cond_load_acquire(&lock->locked_pending, !VAL);
> + val = atomic_read_acquire(&lock->val);
> +#else
> val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
> +#endif
>
> locked:
> /*
What hardware can benefit from this change? Do you have any
micro-benchmark that can show the performance benefit?
Cheers,
Longman
Hi Waiman,
Thanks for your review of this patch.
Please see the comments below.
> From: Waiman Long <[email protected]>
> Sent: Monday, May 8, 2023 11:31 PM
> To: Zhuo, Qiuxu <[email protected]>; Peter Zijlstra
> <[email protected]>; Ingo Molnar <[email protected]>; Will Deacon
> <[email protected]>
> Cc: Boqun Feng <[email protected]>; [email protected]
> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
> locked_pending bits
>
>
> On 5/8/23 04:15, Qiuxu Zhuo wrote:
> > The 1st spinner (header of the MCS queue) spins on the whole qspinlock
> > variable to check whether the lock is released. For a contended
> > qspinlock, this spinning is a hotspot as each CPU queued in the MCS
> > queue performs the spinning when it becomes the 1st spinner (header of
> the MCS queue).
> >
> > The granularity among SMT h/w threads in the same core could be "byte"
> > which the Load-Store Unit (LSU) inside the core handles. Making the
> > 1st spinner only spin on locked_pending bits (not the whole qspinlock)
> > can avoid the false dependency between the tail field and the
> > locked_pending field. So this micro-optimization helps the h/w thread
> > (the 1st spinner) stay in a low power state and prevents it from being
> > woken up by other h/w threads in the same core when they perform
> > xchg_tail() to update the tail field. Please see a similar discussion in the link
> [1].
> >
> > [1]
> > https://lore.kernel.org/r/[email protected]
> >
> > Signed-off-by: Qiuxu Zhuo <[email protected]>
> > ---
> > kernel/locking/qspinlock.c | 13 +++++++++++++
> > 1 file changed, 13 insertions(+)
> >
> > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > index efebbf19f887..e7b990b28610 100644
> > --- a/kernel/locking/qspinlock.c
> > +++ b/kernel/locking/qspinlock.c
> > @@ -513,7 +513,20 @@ void __lockfunc
> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > if ((val = pv_wait_head_or_lock(lock, node)))
> > goto locked;
> >
> > +#if _Q_PENDING_BITS == 8
> > + /*
> > + * Spinning on the 2-byte locked_pending instead of the 4-byte
> qspinlock
> > + * variable can avoid the false dependency between the tail field and
> > + * the locked_pending field. This helps the h/w thread (the 1st
> spinner)
> > + * stay in a low power state and prevents it from being woken up by
> other
> > + * h/w threads in the same core when they perform xchg_tail() to
> update
> > + * the tail field only.
> > + */
> > + smp_cond_load_acquire(&lock->locked_pending, !VAL);
> > + val = atomic_read_acquire(&lock->val); #else
> > val = atomic_cond_read_acquire(&lock->val, !(VAL &
> > _Q_LOCKED_PENDING_MASK));
> > +#endif
> >
> > locked:
> > /*
>
> What hardware can benefit from this change? Do you have any micro-
> benchmark that can show the performance benefit?
i) I don't have the hardware to measure the data.
But I run a benchmark [1] for the contended spinlock on an Intel Sapphire Rapids
server (192 h/w threads, 2sockets) that showed that the 1st spinner spinning was
a hotspot (contributed ~55% cache bouncing traffic measured by the perf C2C.
I don't analyze the cache bouncing here, but just say the spinning is a hotspot).
ii) The similar micro-optimization discussion [2] (looked like it was accepted by you ????) that
avoiding the false dependency (between the tail field and the locked_pending field)
should help some arches (e.g., some ARM64???) the h/w thread (spinner) stay in a
low-power state without the disruption by its sibling h/w threads in the same core.
[1] https://github.com/yamahata/ipi_benchmark.git
[2] https://lore.kernel.org/r/[email protected]
Thanks
-Qiuxu
> Cheers,
> Longman
On 5/8/23 22:45, Zhuo, Qiuxu wrote:
> Hi Waiman,
>
> Thanks for your review of this patch.
> Please see the comments below.
>
>> From: Waiman Long <[email protected]>
>> Sent: Monday, May 8, 2023 11:31 PM
>> To: Zhuo, Qiuxu <[email protected]>; Peter Zijlstra
>> <[email protected]>; Ingo Molnar <[email protected]>; Will Deacon
>> <[email protected]>
>> Cc: Boqun Feng <[email protected]>; [email protected]
>> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
>> locked_pending bits
>>
>>
>> On 5/8/23 04:15, Qiuxu Zhuo wrote:
>>> The 1st spinner (header of the MCS queue) spins on the whole qspinlock
>>> variable to check whether the lock is released. For a contended
>>> qspinlock, this spinning is a hotspot as each CPU queued in the MCS
>>> queue performs the spinning when it becomes the 1st spinner (header of
>> the MCS queue).
>>> The granularity among SMT h/w threads in the same core could be "byte"
>>> which the Load-Store Unit (LSU) inside the core handles. Making the
>>> 1st spinner only spin on locked_pending bits (not the whole qspinlock)
>>> can avoid the false dependency between the tail field and the
>>> locked_pending field. So this micro-optimization helps the h/w thread
>>> (the 1st spinner) stay in a low power state and prevents it from being
>>> woken up by other h/w threads in the same core when they perform
>>> xchg_tail() to update the tail field. Please see a similar discussion in the link
>> [1].
>>> [1]
>>> https://lore.kernel.org/r/[email protected]
>>>
>>> Signed-off-by: Qiuxu Zhuo <[email protected]>
>>> ---
>>> kernel/locking/qspinlock.c | 13 +++++++++++++
>>> 1 file changed, 13 insertions(+)
>>>
>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>>> index efebbf19f887..e7b990b28610 100644
>>> --- a/kernel/locking/qspinlock.c
>>> +++ b/kernel/locking/qspinlock.c
>>> @@ -513,7 +513,20 @@ void __lockfunc
>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>> if ((val = pv_wait_head_or_lock(lock, node)))
>>> goto locked;
>>>
>>> +#if _Q_PENDING_BITS == 8
>>> + /*
>>> + * Spinning on the 2-byte locked_pending instead of the 4-byte
>> qspinlock
>>> + * variable can avoid the false dependency between the tail field and
>>> + * the locked_pending field. This helps the h/w thread (the 1st
>> spinner)
>>> + * stay in a low power state and prevents it from being woken up by
>> other
>>> + * h/w threads in the same core when they perform xchg_tail() to
>> update
>>> + * the tail field only.
>>> + */
>>> + smp_cond_load_acquire(&lock->locked_pending, !VAL);
>>> + val = atomic_read_acquire(&lock->val); #else
>>> val = atomic_cond_read_acquire(&lock->val, !(VAL &
>>> _Q_LOCKED_PENDING_MASK));
>>> +#endif
>>>
>>> locked:
>>> /*
>> What hardware can benefit from this change? Do you have any micro-
>> benchmark that can show the performance benefit?
> i) I don't have the hardware to measure the data.
> But I run a benchmark [1] for the contended spinlock on an Intel Sapphire Rapids
> server (192 h/w threads, 2sockets) that showed that the 1st spinner spinning was
> a hotspot (contributed ~55% cache bouncing traffic measured by the perf C2C.
> I don't analyze the cache bouncing here, but just say the spinning is a hotspot).
I believe the amount of cacheline bouncing will be the same whether you
read 32 or 16 bits from the lock word. At least this is my understanding
of the x86 arch. Please correct me if my assumption is incorrect.
>
> ii) The similar micro-optimization discussion [2] (looked like it was accepted by you ????) that
> avoiding the false dependency (between the tail field and the locked_pending field)
> should help some arches (e.g., some ARM64???) the h/w thread (spinner) stay in a
> low-power state without the disruption by its sibling h/w threads in the same core.
That is true. However, this patch causes one more read from the lock
cacheline which isn't necessary for arches that won't benefit from it.
So I am less incline to accept it unless there is evidence of the
benefit it can bring.
Cheers,
Longman
> From: Waiman Long <[email protected]>
> Sent: Tuesday, May 9, 2023 10:58 AM
> To: Zhuo, Qiuxu <[email protected]>; Peter Zijlstra
> <[email protected]>; Ingo Molnar <[email protected]>; Will Deacon
> <[email protected]>
> Cc: Boqun Feng <[email protected]>; [email protected]
> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
> locked_pending bits
>
>
> On 5/8/23 22:45, Zhuo, Qiuxu wrote:
> > Hi Waiman,
> >
> > Thanks for your review of this patch.
> > Please see the comments below.
> >
> >> From: Waiman Long <[email protected]>
> >> Sent: Monday, May 8, 2023 11:31 PM
> >> To: Zhuo, Qiuxu <[email protected]>; Peter Zijlstra
> >> <[email protected]>; Ingo Molnar <[email protected]>; Will Deacon
> >> <[email protected]>
> >> Cc: Boqun Feng <[email protected]>; [email protected]
> >> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only
> >> spin on locked_pending bits
> >>
> >>
> >> On 5/8/23 04:15, Qiuxu Zhuo wrote:
> >>> The 1st spinner (header of the MCS queue) spins on the whole
> >>> qspinlock variable to check whether the lock is released. For a
> >>> contended qspinlock, this spinning is a hotspot as each CPU queued
> >>> in the MCS queue performs the spinning when it becomes the 1st
> >>> spinner (header of
> >> the MCS queue).
> >>> The granularity among SMT h/w threads in the same core could be
> "byte"
> >>> which the Load-Store Unit (LSU) inside the core handles. Making the
> >>> 1st spinner only spin on locked_pending bits (not the whole
> >>> qspinlock) can avoid the false dependency between the tail field and
> >>> the locked_pending field. So this micro-optimization helps the h/w
> >>> thread (the 1st spinner) stay in a low power state and prevents it
> >>> from being woken up by other h/w threads in the same core when they
> >>> perform
> >>> xchg_tail() to update the tail field. Please see a similar
> >>> discussion in the link
> >> [1].
> >>> [1]
> >>> https://lore.kernel.org/r/20230105021952.3090070-1-
> [email protected]
> >>>
> >>> Signed-off-by: Qiuxu Zhuo <[email protected]>
> >>> ---
> >>> kernel/locking/qspinlock.c | 13 +++++++++++++
> >>> 1 file changed, 13 insertions(+)
> >>>
> >>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> >>> index efebbf19f887..e7b990b28610 100644
> >>> --- a/kernel/locking/qspinlock.c
> >>> +++ b/kernel/locking/qspinlock.c
> >>> @@ -513,7 +513,20 @@ void __lockfunc
> >> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >>> if ((val = pv_wait_head_or_lock(lock, node)))
> >>> goto locked;
> >>>
> >>> +#if _Q_PENDING_BITS == 8
> >>> + /*
> >>> + * Spinning on the 2-byte locked_pending instead of the 4-byte
> >> qspinlock
> >>> + * variable can avoid the false dependency between the tail field and
> >>> + * the locked_pending field. This helps the h/w thread (the 1st
> >> spinner)
> >>> + * stay in a low power state and prevents it from being woken up
> >>> +by
> >> other
> >>> + * h/w threads in the same core when they perform xchg_tail() to
> >> update
> >>> + * the tail field only.
> >>> + */
> >>> + smp_cond_load_acquire(&lock->locked_pending, !VAL);
> >>> + val = atomic_read_acquire(&lock->val); #else
> >>> val = atomic_cond_read_acquire(&lock->val, !(VAL &
> >>> _Q_LOCKED_PENDING_MASK));
> >>> +#endif
> >>>
> >>> locked:
> >>> /*
> >> What hardware can benefit from this change? Do you have any micro-
> >> benchmark that can show the performance benefit?
> > i) I don't have the hardware to measure the data.
> > But I run a benchmark [1] for the contended spinlock on an Intel
> Sapphire Rapids
> > server (192 h/w threads, 2sockets) that showed that the 1st spinner
> spinning was
> > a hotspot (contributed ~55% cache bouncing traffic measured by the
> perf C2C.
> > I don't analyze the cache bouncing here, but just say the spinning is a
> hotspot).
> I believe the amount of cacheline bouncing will be the same whether you
> read 32 or 16 bits from the lock word. At least this is my understanding of the
> x86 arch. Please correct me if my assumption is incorrect.
You're right.
The amount of cache line bouncing was nearly the same either spinning 32 or 16 bits
(according to my measured perf C2C data on an x86 server).
> >
> > ii) The similar micro-optimization discussion [2] (looked like it was accepted
> by you ????) that
> > avoiding the false dependency (between the tail field and the
> locked_pending field)
> > should help some arches (e.g., some ARM64???) the h/w thread
> (spinner) stay in a
> > low-power state without the disruption by its sibling h/w threads in the
> same core.
>
> That is true. However, this patch causes one more read from the lock
> cacheline which isn't necessary for arches that won't benefit from it.
> So I am less incline to accept it unless there is evidence of the
> benefit it can bring.
This patch removes a bitwise AND operation on the VAL value.
Does it compensate for the one more read from the cache line?
Thanks!
- Qiuxu
> Cheers,
> Longman
On 5/8/23 23:30, Zhuo, Qiuxu wrote:
>> From: Waiman Long <[email protected]>
>> Sent: Tuesday, May 9, 2023 10:58 AM
>> To: Zhuo, Qiuxu <[email protected]>; Peter Zijlstra
>> <[email protected]>; Ingo Molnar <[email protected]>; Will Deacon
>> <[email protected]>
>> Cc: Boqun Feng <[email protected]>; [email protected]
>> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on
>> locked_pending bits
>>
>>
>> On 5/8/23 22:45, Zhuo, Qiuxu wrote:
>>> Hi Waiman,
>>>
>>> Thanks for your review of this patch.
>>> Please see the comments below.
>>>
>>>> From: Waiman Long <[email protected]>
>>>> Sent: Monday, May 8, 2023 11:31 PM
>>>> To: Zhuo, Qiuxu <[email protected]>; Peter Zijlstra
>>>> <[email protected]>; Ingo Molnar <[email protected]>; Will Deacon
>>>> <[email protected]>
>>>> Cc: Boqun Feng <[email protected]>; [email protected]
>>>> Subject: Re: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only
>>>> spin on locked_pending bits
>>>>
>>>>
>>>> On 5/8/23 04:15, Qiuxu Zhuo wrote:
>>>>> The 1st spinner (header of the MCS queue) spins on the whole
>>>>> qspinlock variable to check whether the lock is released. For a
>>>>> contended qspinlock, this spinning is a hotspot as each CPU queued
>>>>> in the MCS queue performs the spinning when it becomes the 1st
>>>>> spinner (header of
>>>> the MCS queue).
>>>>> The granularity among SMT h/w threads in the same core could be
>> "byte"
>>>>> which the Load-Store Unit (LSU) inside the core handles. Making the
>>>>> 1st spinner only spin on locked_pending bits (not the whole
>>>>> qspinlock) can avoid the false dependency between the tail field and
>>>>> the locked_pending field. So this micro-optimization helps the h/w
>>>>> thread (the 1st spinner) stay in a low power state and prevents it
>>>>> from being woken up by other h/w threads in the same core when they
>>>>> perform
>>>>> xchg_tail() to update the tail field. Please see a similar
>>>>> discussion in the link
>>>> [1].
>>>>> [1]
>>>>> https://lore.kernel.org/r/20230105021952.3090070-1-
>> [email protected]
>>>>> Signed-off-by: Qiuxu Zhuo <[email protected]>
>>>>> ---
>>>>> kernel/locking/qspinlock.c | 13 +++++++++++++
>>>>> 1 file changed, 13 insertions(+)
>>>>>
>>>>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>>>>> index efebbf19f887..e7b990b28610 100644
>>>>> --- a/kernel/locking/qspinlock.c
>>>>> +++ b/kernel/locking/qspinlock.c
>>>>> @@ -513,7 +513,20 @@ void __lockfunc
>>>> queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>>>> if ((val = pv_wait_head_or_lock(lock, node)))
>>>>> goto locked;
>>>>>
>>>>> +#if _Q_PENDING_BITS == 8
>>>>> + /*
>>>>> + * Spinning on the 2-byte locked_pending instead of the 4-byte
>>>> qspinlock
>>>>> + * variable can avoid the false dependency between the tail field and
>>>>> + * the locked_pending field. This helps the h/w thread (the 1st
>>>> spinner)
>>>>> + * stay in a low power state and prevents it from being woken up
>>>>> +by
>>>> other
>>>>> + * h/w threads in the same core when they perform xchg_tail() to
>>>> update
>>>>> + * the tail field only.
>>>>> + */
>>>>> + smp_cond_load_acquire(&lock->locked_pending, !VAL);
>>>>> + val = atomic_read_acquire(&lock->val); #else
>>>>> val = atomic_cond_read_acquire(&lock->val, !(VAL &
>>>>> _Q_LOCKED_PENDING_MASK));
>>>>> +#endif
>>>>>
>>>>> locked:
>>>>> /*
>>>> What hardware can benefit from this change? Do you have any micro-
>>>> benchmark that can show the performance benefit?
>>> i) I don't have the hardware to measure the data.
>>> But I run a benchmark [1] for the contended spinlock on an Intel
>> Sapphire Rapids
>>> server (192 h/w threads, 2sockets) that showed that the 1st spinner
>> spinning was
>>> a hotspot (contributed ~55% cache bouncing traffic measured by the
>> perf C2C.
>>> I don't analyze the cache bouncing here, but just say the spinning is a
>> hotspot).
>> I believe the amount of cacheline bouncing will be the same whether you
>> read 32 or 16 bits from the lock word. At least this is my understanding of the
>> x86 arch. Please correct me if my assumption is incorrect.
> You're right.
> The amount of cache line bouncing was nearly the same either spinning 32 or 16 bits
> (according to my measured perf C2C data on an x86 server).
>
>>> ii) The similar micro-optimization discussion [2] (looked like it was accepted
>> by you ????) that
>>> avoiding the false dependency (between the tail field and the
>> locked_pending field)
>>> should help some arches (e.g., some ARM64???) the h/w thread
>> (spinner) stay in a
>>> low-power state without the disruption by its sibling h/w threads in the
>> same core.
>>
>> That is true. However, this patch causes one more read from the lock
>> cacheline which isn't necessary for arches that won't benefit from it.
>> So I am less incline to accept it unless there is evidence of the
>> benefit it can bring.
> This patch removes a bitwise AND operation on the VAL value.
> Does it compensate for the one more read from the cache line?
Register to register operation in the case of bitwise AND doesn't cost
much, it is the potential reading from a hot cacheline that cause most
of the delay. Besides there is also an additional acquire barrier.
Again, if there is no concrete proof of a benefit, there is no point to
make the code more complicated.
Cheers,
Longman
Hello,
kernel test robot noticed "kernel_BUG_at_lib/list_debug.c" on:
commit: 3fa78d3cb9dfb63f66f86b76a7621c9d1c1ee302 ("[PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits")
url: https://github.com/intel-lab-lkp/linux/commits/Qiuxu-Zhuo/locking-qspinlock-Make-the-1st-spinner-only-spin-on-locked_pending-bits/20230508-161751
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git ec570320b09f76d52819e60abdccf372658216b6
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH 1/1] locking/qspinlock: Make the 1st spinner only spin on locked_pending bits
in testcase: phoronix-test-suite
version:
with following parameters:
test: sqlite-2.1.0
option_a: 128
cpufreq_governor: performance
test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/
compiler: gcc-11
test machine: 16 threads 1 sockets Intel(R) Xeon(R) E-2278G CPU @ 3.40GHz (Coffee Lake) with 32G memory
(please refer to attached dmesg/kmsg for entire log/backtrace)
+----------------+------------+------------+
| | ec570320b0 | 3fa78d3cb9 |
+----------------+------------+------------+
| boot_successes | 0 | 2 |
+----------------+------------+------------+
If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]
[ 54.397344][ T2693] ------------[ cut here ]------------
[ 54.402911][ T2693] kernel BUG at lib/list_debug.c:59!
[ 54.408314][ T2693] invalid opcode: 0000 [#1] SMP NOPTI
[ 54.413773][ T2693] CPU: 2 PID: 2693 Comm: sqlite3 Tainted: G S 6.3.0-rc1-00010-g3fa78d3cb9df #1
[ 54.424166][ T2693] Hardware name: Intel Corporation Mehlow UP Server Platform/Moss Beach Server, BIOS CNLSE2R1.R00.X188.B13.1903250419 03/25/2019
[ 54.437758][ T2693] RIP: 0010:__list_del_entry_valid (lib/list_debug.c:59 (discriminator 3))
[ 54.443830][ T2693] Code: 36 6f 82 e8 87 98 a8 ff 0f 0b 48 89 ca 48 c7 c7 20 37 6f 82 e8 76 98 a8 ff 0f 0b 4c 89 c2 48 c7 c7 58 37 6f 82 e8 65 98 a8 ff <0f> 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 a8 37 6f 82 e8 4e 98 a8
All code
========
0: 36 6f outsl %ss:(%rsi),(%dx)
2: 82 (bad)
3: e8 87 98 a8 ff callq 0xffffffffffa8988f
8: 0f 0b ud2
a: 48 89 ca mov %rcx,%rdx
d: 48 c7 c7 20 37 6f 82 mov $0xffffffff826f3720,%rdi
14: e8 76 98 a8 ff callq 0xffffffffffa8988f
19: 0f 0b ud2
1b: 4c 89 c2 mov %r8,%rdx
1e: 48 c7 c7 58 37 6f 82 mov $0xffffffff826f3758,%rdi
25: e8 65 98 a8 ff callq 0xffffffffffa8988f
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 89 d1 mov %rdx,%rcx
2f: 4c 89 c6 mov %r8,%rsi
32: 4c 89 ca mov %r9,%rdx
35: 48 c7 c7 a8 37 6f 82 mov $0xffffffff826f37a8,%rdi
3c: e8 .byte 0xe8
3d: 4e 98 rex.WRX cltq
3f: a8 .byte 0xa8
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 89 d1 mov %rdx,%rcx
5: 4c 89 c6 mov %r8,%rsi
8: 4c 89 ca mov %r9,%rdx
b: 48 c7 c7 a8 37 6f 82 mov $0xffffffff826f37a8,%rdi
12: e8 .byte 0xe8
13: 4e 98 rex.WRX cltq
15: a8 .byte 0xa8
[ 54.463787][ T2693] RSP: 0018:ffffc90004ad3e50 EFLAGS: 00010246
[ 54.469954][ T2693] RAX: 000000000000006d RBX: ffff88887567c6c0 RCX: 0000000000000000
[ 54.478031][ T2693] RDX: 0000000000000000 RSI: ffff888854e9c700 RDI: ffff888854e9c700
[ 54.486105][ T2693] RBP: ffff888108530300 R08: 0000000000000000 R09: 00000000ffff7fff
[ 54.494179][ T2693] R10: ffffc90004ad3d08 R11: ffffffff82bd8d88 R12: ffff888108530358
[ 54.502264][ T2693] R13: 00000000ffffff9c R14: ffff8881086e96a0 R15: ffff888108530300
[ 54.510352][ T2693] FS: 00007f9d97379740(0000) GS:ffff888854e80000(0000) knlGS:0000000000000000
[ 54.519388][ T2693] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 54.526081][ T2693] CR2: 0000556700799178 CR3: 000000018215c001 CR4: 00000000003706e0
[ 54.534170][ T2693] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 54.542247][ T2693] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 54.550328][ T2693] Call Trace:
[ 54.553758][ T2693] <TASK>
[ 54.556798][ T2693] __dentry_kill (include/linux/list.h:134 fs/dcache.c:550 fs/dcache.c:603)
[ 54.561399][ T2693] dentry_kill (fs/dcache.c:755)
[ 54.565830][ T2693] dput (fs/dcache.c:913)
[ 54.569770][ T2693] do_unlinkat (fs/namei.c:4321)
[ 54.574289][ T2693] __x64_sys_unlink (fs/namei.c:4364 fs/namei.c:4362 fs/namei.c:4362)
[ 54.579056][ T2693] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 54.583593][ T2693] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
[ 54.589613][ T2693] RIP: 0033:0x7f9d9749d247
[ 54.594121][ T2693] Code: f0 ff ff 73 01 c3 48 8b 0d 46 bc 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 19 bc 0c 00 f7 d8 64 89 01 48
All code
========
0: f0 ff lock (bad)
2: ff 73 01 pushq 0x1(%rbx)
5: c3 retq
6: 48 8b 0d 46 bc 0c 00 mov 0xcbc46(%rip),%rcx # 0xcbc53
d: f7 d8 neg %eax
f: 64 89 01 mov %eax,%fs:(%rcx)
12: 48 83 c8 ff or $0xffffffffffffffff,%rax
16: c3 retq
17: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
1e: 00 00 00
21: 66 90 xchg %ax,%ax
23: b8 57 00 00 00 mov $0x57,%eax
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 retq
33: 48 8b 0d 19 bc 0c 00 mov 0xcbc19(%rip),%rcx # 0xcbc53
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W
Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 retq
9: 48 8b 0d 19 bc 0c 00 mov 0xcbc19(%rip),%rcx # 0xcbc29
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
[ 54.614042][ T2693] RSP: 002b:00007fff66911a48 EFLAGS: 00000206 ORIG_RAX: 0000000000000057
[ 54.622596][ T2693] RAX: ffffffffffffffda RBX: 00005637c6657f58 RCX: 00007f9d9749d247
[ 54.630716][ T2693] RDX: 0000000000000000 RSI: 00005637c6658287 RDI: 00005637c6658287
[ 54.638820][ T2693] RBP: 0000000000000001 R08: 00005637c6657b98 R09: 0000000000000000
[ 54.646908][ T2693] R10: 00007fff66911a20 R11: 0000000000000206 R12: 0000000000000000
[ 54.654990][ T2693] R13: 00005637c6658287 R14: 00005637c6657a98 R15: 00005637c6673008
[ 54.663083][ T2693] </TASK>
[ 54.666234][ T2693] Modules linked in: sg ip_tables overlay rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver btrfs blake2b_generic xor raid6_pq libcrc32c intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp sd_mod coretemp t10_pi crc64_rocksoft_generic kvm_intel crc64_rocksoft crc64 kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ast ghash_clmulni_intel drm_shmem_helper drm_kms_helper sha512_ssse3 syscopyarea sysfillrect sysimgblt ahci rapl libahci ppdev intel_cstate parport_pc wmi_bmof intel_wmi_thunderbolt mei_me video i2c_designware_platform intel_uncore drm libata intel_pmc_core mei i2c_designware_core idma64 intel_pch_thermal ie31200_edac wmi parport acpi_tad acpi_power_meter acpi_pad
[ 54.731069][ T2693] ---[ end trace 0000000000000000 ]---
[ 54.736720][ T2693] RIP: 0010:__list_del_entry_valid (lib/list_debug.c:59 (discriminator 3))
[ 54.742881][ T2693] Code: 36 6f 82 e8 87 98 a8 ff 0f 0b 48 89 ca 48 c7 c7 20 37 6f 82 e8 76 98 a8 ff 0f 0b 4c 89 c2 48 c7 c7 58 37 6f 82 e8 65 98 a8 ff <0f> 0b 48 89 d1 4c 89 c6 4c 89 ca 48 c7 c7 a8 37 6f 82 e8 4e 98 a8
All code
========
0: 36 6f outsl %ss:(%rsi),(%dx)
2: 82 (bad)
3: e8 87 98 a8 ff callq 0xffffffffffa8988f
8: 0f 0b ud2
a: 48 89 ca mov %rcx,%rdx
d: 48 c7 c7 20 37 6f 82 mov $0xffffffff826f3720,%rdi
14: e8 76 98 a8 ff callq 0xffffffffffa8988f
19: 0f 0b ud2
1b: 4c 89 c2 mov %r8,%rdx
1e: 48 c7 c7 58 37 6f 82 mov $0xffffffff826f3758,%rdi
25: e8 65 98 a8 ff callq 0xffffffffffa8988f
2a:* 0f 0b ud2 <-- trapping instruction
2c: 48 89 d1 mov %rdx,%rcx
2f: 4c 89 c6 mov %r8,%rsi
32: 4c 89 ca mov %r9,%rdx
35: 48 c7 c7 a8 37 6f 82 mov $0xffffffff826f37a8,%rdi
3c: e8 .byte 0xe8
3d: 4e 98 rex.WRX cltq
3f: a8 .byte 0xa8
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 89 d1 mov %rdx,%rcx
5: 4c 89 c6 mov %r8,%rsi
8: 4c 89 ca mov %r9,%rdx
b: 48 c7 c7 a8 37 6f 82 mov $0xffffffff826f37a8,%rdi
12: e8 .byte 0xe8
13: 4e 98 rex.WRX cltq
15: a8 .byte 0xa8
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file
# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests