LinuxLists.cc - [RH72 Spectre] ibpb_enabled = 1 leads to hard LOCKUP under x86

2018-01-20 09:08:33

Subject: [RH72 Spectre] ibpb_enabled = 1 leads to hard LOCKUP under x86_64 host machine

Hi all,

We are testing the patches for Spectre and Meltdown under OS derived from RH7.2,
and hit by a hard LOCKUP panic under a x86_64 host environment.

The hard LOCKUP can be reproduced, and it will gone if we disable ibpb by
writing 0 to ibpb_enabled file, and it will appear again when we enable ibpb
( writing 1 or 2).

The workload running on the host is just starting two hundreds security
containers sequentially, then stopping them and repeating. The security
container is implemented by using docker and kvm, so there will be many
"docker-containerd-shim" and "qemu-system-x86_64uvm" processes. The reproduction
of the hard LOCKUP problem can be accelerated by running the following command
("hackbench" comes from ltp project):
while true; do ./hackbench 100 process 1000; done

We have saved vmcore files for the hard LOCKUPs by using kdump. The hard LOCKUPs
are triggerd by different processes and on different Linux kernel stack. We have
analyzed one hard LOCKUP, it is caused by wake_up_new_task() when it tried to
get rq->lock by invoking __task_rq_lock(). The value of the lock is 422320416
(head = 6432, tail = 6444), and we have found the five processes which are
waiting on the lock, but we can not find the process which had taken it.

We guess maybe something is wrong with the CPU scheduler, because the RSP
register of process runv which is waiting for rq->lock is incorrect. The RSP
pointers the stack of swapper/57 and runv is also running on CPU 57 (more
details in the end of the mail). The same phenomenon exists on others hardLOCKs.

So has anyone encountered a similar problem before, and any suggestions
and directions for the hard LOCKUP problems ?

Thanks,
Tao

---
The following lines are output from one instance of the hard LOCKUP panics:

* output from crash which complain about the unexpected RSP register:

crash: inconsistent active task indications for CPU 57:
runqueue: ffff882eac72e780 "runv" (default)
current_task: ffff882f768e1700 "swapper/57"

crash> runq -m -c 57
CPU 57: [0 00:00:00.000] PID: 8173 TASK: ffff882eac72e780 COMMAND: "runv"
crash> bt 8173
PID: 8173 TASK: ffff882eac72e780 CPU: 57 COMMAND: "runv"
#0 [ffff885fbe145e00] stop_this_cpu at ffffffff8101f66d
#1 [ffff885fbe145e10] kbox_rlock_stop_other_cpus_call at ffffffffa031e649
#2 [ffff885fbe145e50] smp_nmi_call_function_handler at ffffffff81047dd6
#3 [ffff885fbe145e68] nmi_handle at ffffffff8164fc09
#4 [ffff885fbe145eb0] do_nmi at ffffffff8164fd84
#5 [ffff885fbe145ef0] end_repeat_nmi at ffffffff8164eff9
[exception RIP: _raw_spin_lock+48]
RIP: ffffffff8164dc50 RSP: ffff882f768f3b18 RFLAGS: 00000002
RAX: 0000000000000a58 RBX: ffff882f76f1d080 RCX: 0000000000001920
RDX: 0000000000001922 RSI: 0000000000001922 RDI: ffff885fbe159580
RBP: ffff882f768f3b18 R8: 0000000000000012 R9: 0000000000000001
R10: 0000000000000400 R11: 0000000000000000 R12: ffff882f76f1d884
R13: 0000000000000046 R14: ffff885fbe159580 R15: 0000000000000039
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#6 [ffff882f768f3b18] _raw_spin_lock at ffffffff8164dc50
bt: cannot transition from exception stack to current process stack:
exception stack pointer: ffff885fbe145e00
process stack pointer: ffff882f768f3b18
current stack base: ffff882f34e38000

* kernel panic message when hard LOCKUP occurs

[ 4396.807556] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 55
[ 4396.807561] CPU: 55 PID: 8267 Comm: docker Tainted: G O ---- ------- 3.10.0-327.59.59.46.x86_64 #1
[ 4396.807563] Hardware name: Huawei RH2288H V3/BC11HGSA0, BIOS 1.57 08/11/2015
[ 4396.807564] Call Trace:
[ 4396.807571] <NMI> [<ffffffff81646140>] dump_stack+0x19/0x1b
[ 4396.807575] [<ffffffff8163f792>] panic+0xd8/0x214
[ 4396.807582] [<ffffffff811228b1>] watchdog_overflow_callback+0xd1/0xe0
[ 4396.807589] [<ffffffff81166161>] __perf_event_overflow+0xa1/0x250
[ 4396.807595] [<ffffffff81166c34>] perf_event_overflow+0x14/0x20
[ 4396.807600] [<ffffffff810339b8>] intel_pmu_handle_irq+0x1e8/0x470
[ 4396.807610] [<ffffffff812ffc11>] ? ioremap_page_range+0x241/0x320
[ 4396.807617] [<ffffffff813a1044>] ? ghes_copy_tofrom_phys+0x124/0x210
[ 4396.807621] [<ffffffff813a11d0>] ? ghes_read_estatus+0xa0/0x190
[ 4396.807626] [<ffffffff8165058b>] perf_event_nmi_handler+0x2b/0x50
[ 4396.807629] [<ffffffff8164fc09>] nmi_handle.isra.0+0x69/0xb0
[ 4396.807633] [<ffffffff8164fd84>] do_nmi+0x134/0x410
[ 4396.807637] [<ffffffff8164eff9>] end_repeat_nmi+0x1e/0x7e
[ 4396.807643] [<ffffffff8164dc5a>] ? _raw_spin_lock+0x3a/0x50
[ 4396.807648] [<ffffffff8164dc5a>] ? _raw_spin_lock+0x3a/0x50
[ 4396.807653] [<ffffffff8164dc5a>] ? _raw_spin_lock+0x3a/0x50
[ 4396.807658] <<EOE>> [<ffffffff810bd33c>] wake_up_new_task+0x9c/0x170
[ 4396.807662] [<ffffffff8107dfbb>] do_fork+0x13b/0x320
[ 4396.807667] [<ffffffff8107e226>] SyS_clone+0x16/0x20
[ 4396.807672] [<ffffffff816577f4>] stub_clone+0x44/0x70
[ 4396.807676] [<ffffffff8165743d>] ? system_call_fastpath+0x16/0x1b

* cpu info for the first CPU (72 CPUs in total)

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz
stepping : 2
microcode : 0x3b
cpu MHz : 2300.000
cache size : 46080 KB
physical id : 0
siblings : 36
core id : 0
cpu cores : 18
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm arat epb invpcid_single pln pts dtherm spec_ctrl ibpb_support tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc
bogomips : 4589.42
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

2018-01-20 09:08:33

by David Woodhouse

[permalink] [raw]

Subject: Re: [RH72 Spectre] ibpb_enabled = 1 leads to hard LOCKUP under x86_64 host machine

On Sat, 2018-01-20 at 17:00 +0800, Hou Tao wrote:
>
> So has anyone encountered a similar problem before, and any suggestions
> and directions for the hard LOCKUP problems ?

Arjan, what is the Intel recommendation here?

Attachments:

smime.p7s (5.09 kB)

2018-01-20 13:59:07

by Van De Ven, Arjan

[permalink] [raw]

Subject: RE: [RH72 Spectre] ibpb_enabled = 1 leads to hard LOCKUP under x86_64 host machine

well first of all don't use IBRS, use retpoline

and if Andrea says this was a known issue in their code then I think that closes the issue.

> -----Original Message-----
> From: David Woodhouse [mailto:[email protected]]
> Sent: Saturday, January 20, 2018 1:03 AM
> To: Hou Tao <[email protected]>; [email protected]; linux-
> [email protected]; Van De Ven, Arjan <[email protected]>
> Cc: [email protected]; Thomas Gleixner <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [RH72 Spectre] ibpb_enabled = 1 leads to hard LOCKUP under
> x86_64 host machine
>
> On Sat, 2018-01-20 at 17:00 +0800, Hou Tao wrote:
> >
> > So has anyone encountered a similar problem before, and any suggestions
> > and directions for the hard LOCKUP problems ?
>
> Arjan, what is the Intel recommendation here?

2018-01-20 15:23:21

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [RH72 Spectre] ibpb_enabled = 1 leads to hard LOCKUP under x86_64 host machine

Hello everyone,

On Sat, Jan 20, 2018 at 01:56:08PM +0000, Van De Ven, Arjan wrote:
> well first of all don't use IBRS, use retpoline

This issue triggers in the IBPB code during user to user context
switch and IBPB is still needed there no matter if kernel is using
retpolines or if it uses kernel IBRS. In fact IBPB is still needed
there even if retpolines+user_ibrs is used or if
always_ibrs/ibrs_enabled=2 is used (IBRS doesn't protect from the
poison generated in the same predictor mode, "especially" in future
CPUs).

Only retpolining all userland would avoid IBPB here, but I doubt you
suggest that.

Kernel retpolines or kernel IBRS would make zero difference for
this specific issue.

> and if Andrea says this was a known issue in their code then I think that closes the issue.
>

It's an implementation bug we inherited from the merge of a CPU vendor
patch and I can confirm it's already closed. The fix has been already
shipped with the wave 2 update in fact and some other versions even
had the bug fixed since the very first wave on 0day.

That deadlock nuisance only ever triggered in artificial QA testcases
and even then it wasn't easily reproducible.

We already moved the follow ups in vendor BZ to avoid using bandwidth
here.

Thank you!
Andrea