core_idle_state is maintained for each core. It uses 0-7 bits to track
whether a thread in the core has entered fastsleep or winkle. 8th bit is
used as a lock bit.
The lock bit is set in these 2 scenarios-
- The thread is first in subcore to wakeup from sleep/winkle.
- If its the last thread in the core about to enter sleep/winkle
While the lock bit is set, if any other thread in the core wakes up, it
loops until the lock bit is cleared before proceeding in the wakeup
path. This helps prevent race conditions w.r.t fastsleep workaround and
prevents threads from switching to process context before core/subcore
resources are restored.
But, in the path to sleep/winkle entry, we currently don't check for
lock-bit. This exposes us to following race when running with subcore
on-
First thread in the subcorea Another thread in the same
waking up core entering sleep/winkle
lwarx r15,0,r14
ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
stwcx. r15,0,r14
[Code to restore subcore state]
lwarx r15,0,r14
[clear thread bit]
stwcx. r15,0,r14
andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
stw r15,0(r14)
Here, after the thread entering sleep clears its thread bit in
core_idle_state, the value is overwritten by the thread waking up.
This patch fixes the above race by looping on the lock bit even while
entering the idle states.
Signed-off-by: Shreyas B. Prabhu <[email protected]>
---
arch/powerpc/kernel/idle_power7.S | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/arch/powerpc/kernel/idle_power7.S b/arch/powerpc/kernel/idle_power7.S
index ccde8f0..48c7d4f 100644
--- a/arch/powerpc/kernel/idle_power7.S
+++ b/arch/powerpc/kernel/idle_power7.S
@@ -144,12 +144,26 @@ power7_enter_nap_mode:
bge cr3,2f
IDLE_STATE_ENTER_SEQ(PPC_NAP)
/* No return */
+
+core_idle_lock_held_entry:
+ HMT_LOW
+core_idle_lock_loop_entry:
+ lwz r15,0(r14)
+ andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT
+ bne core_idle_lock_loop_entry
+ HMT_MEDIUM
+ b lwarx_loop1
+
2:
/* Sleep or winkle */
lbz r7,PACA_THREAD_MASK(r13)
ld r14,PACA_CORE_IDLE_STATE_PTR(r13)
lwarx_loop1:
lwarx r15,0,r14
+
+ andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT
+ bne core_idle_lock_held_entry
+
andc r15,r15,r7 /* Clear thread bit */
andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
--
1.9.3
On Wed, 2015-01-07 at 06:34:10 UTC, "Shreyas B. Prabhu" wrote:
> core_idle_state is maintained for each core. It uses 0-7 bits to track
> whether a thread in the core has entered fastsleep or winkle. 8th bit is
> used as a lock bit.
> The lock bit is set in these 2 scenarios-
> - The thread is first in subcore to wakeup from sleep/winkle.
> - If its the last thread in the core about to enter sleep/winkle
>
> While the lock bit is set, if any other thread in the core wakes up, it
> loops until the lock bit is cleared before proceeding in the wakeup
> path. This helps prevent race conditions w.r.t fastsleep workaround and
> prevents threads from switching to process context before core/subcore
> resources are restored.
>
> But, in the path to sleep/winkle entry, we currently don't check for
> lock-bit. This exposes us to following race when running with subcore
> on-
>
> First thread in the subcorea Another thread in the same
> waking up core entering sleep/winkle
>
> lwarx r15,0,r14
> ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
> stwcx. r15,0,r14
> [Code to restore subcore state]
>
> lwarx r15,0,r14
> [clear thread bit]
> stwcx. r15,0,r14
>
> andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS
> stw r15,0(r14)
>
> Here, after the thread entering sleep clears its thread bit in
> core_idle_state, the value is overwritten by the thread waking up.
> This patch fixes the above race by looping on the lock bit even while
> entering the idle states.
What are the symptoms of this bug?
I assume they're not good. In which case this should go to stable, shouldn't
it? If so which versions?
And which commit introduced the bug?
cheers
>
> What are the symptoms of this bug?
>
In the cases where we hit this race and the core enters fastsleep,
code mistakes an idle thread as running. Because of this, the first
thread waking up from fastsleep which is supposed to resync timebase
skips it. So we can end up having a core with stale timebase value.
We suspect this is causing soft lockups with call stacks similar to this-
[126529.208714] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [opal_errd:7722]
[126529.208849] CPU: 8 PID: 7722 Comm: opal_errd
[126529.208853] task: c00000bf67803a80 ti: c00000bf6788c000 task.ti: c00000bf6788c000
[126529.208856] NIP: c00000000015a180 LR: c00000000015a0d0 CTR: c00000000001ed70
[126529.208859] REGS: c00000bf6788faa0 TRAP: 0901 Not tainted (3.18.13-336.el7_1.pkvm3_1_0.2000.1.ppc64le)
[126529.208860] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24004824 XER: 20000000
[126529.208871] CFAR: c00000000015a194 SOFTE: 1
GPR00: c0000000002db9e8 c00000bf6788fd20 c0000000012b1800 00003af5b88f569e
GPR04: 0000000000d3dbb8 00003af5c236ca0b ffffffffffffffff 000000000001ee28
GPR08: 000000003b9ac9ff 5bfc723fba82c8f9 00000000c06f2b88 c0000000009908c8
GPR12: c00000000001ed70 c000000007da4c00
[126529.208896] NIP [c00000000015a180] ktime_get_ts64+0x130/0x1f0
[126529.208899] LR [c00000000015a0d0] ktime_get_ts64+0x80/0x1f0
[126529.208902] Call Trace:
[126529.208909] [c00000bf6788fd20] [c00000000019c0e4] __audit_syscall_exit+0x214/0x2e0 (unreliable)
[126529.208916] [c00000bf6788fda0] [c0000000002db9e8] poll_select_set_timeout+0x98/0xe0
[126529.208919] [c00000bf6788fde0] [c0000000002dcf7c] SyS_poll+0x8c/0x160
[126529.208925] [c00000bf6788fe30] [c000000000009358] syscall_exit+0x0/0x98
[126529.208927] Instruction dump:
[126529.208930] 7d29ea14 6108c9ff 39400000 7fa94040 409d0038 4800001c 60000000 60000000
[126529.208936] 60000000 60000000 60000000 60420000 <3d29c465> 394a0001 39293600 794a0020
> I assume they're not good. In which case this should go to stable, shouldn't
> it? If so which versions?
>
Yes this should go into stable. 3.19+
> And which commit introduced the bug?
>
77b54e9f213f76a powernv/powerpc: Add winkle support for offline cpus
Thanks,
Shreyas