2018-05-01 16:44:35

by Rick Warner

[permalink] [raw]
Subject: stall/hang on 4.15 kernel with some Xeon skylake CPUs and extended APIC

Hi All,

I've discovered that some new Supermicro skylake systems will hang/stall
while booting the 4.15 kernel when extended APIC (x2apic) is enabled in
the BIOS. The issue happens on specific CPUs only and follows the CPUs.

We had (4) quad socket systems with Xeon 6134 CPUs; 2 out of 4 were
exhibiting this behavior.  We replaced 2 CPUs at that time and the
behavior was eliminated. Those systems were then shipped to our customer
(we are an HPC system integrator).

Now, we have 5 single socket systems with 5122 CPUs.  2 out of the 5 are
hanging.  If we swap the CPUs from the hanging systems with working
systems, the behavior follows the CPU.

I've done a git bisect between 4.14 and 4.15 and found this commit is
triggering the issue:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=023a611748fd58d46c8aa049cf4f22ebada983f5

Some of the commits right before it also seemed to trigger this warning:
[    5.062563] Debug warning: early ioremap leak of 1 areas detected.
               please boot with early_ioremap_debug and report the dmesg.

I have a dmesg log of 1 commit prior to the referenced link with
early_ioremap_debug enabled if it is desired.

The latest git still has the issue.

I've attached a dmesg log captured via serial console from a system
exhibiting this problem.  Here is an excerpt from it where the problems
start:

ACPI: Added _OSI(Module Device)
ACPI: Added _OSI(Processor Device)
ACPI: Added _OSI(3.0 _SCP Extensions)
ACPI: Added _OSI(Processor Aggregator Device)
ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
INFO: rcu_sched self-detected stall on CPU
        34-....: (14997 ticks this GP) idle=b3e/140000000000001/0
softirq=18/18 fqs=7497
INFO: rcu_sched detected stalls on CPUs/tasks:

        34-....: (14997 ticks this GP) idle=b3e/140000000000001/0
softirq=18/18 fqs=7498
 (t=15002 jiffies g=-294 c=-295 q=391)
        (detected by 0, t=15002 jiffies, g=-294, c=-295, q=391)
NMI backtrace for cpu 34
CPU: 34 PID: 1 Comm: swapper/0 Not tainted 4.15.7-gentoo-r1-netuno-x86_64 #4
Hardware name: Supermicro SYS-2049U-TR4/X11QPH+, BIOS 2.0c 02/23/2018
Call Trace:
 <IRQ>
 dump_stack+0x5d/0x79
 nmi_cpu_backtrace+0x94/0xae
 ? irq_force_complete_move+0x6f/0x6f
 nmi_trigger_cpumask_backtrace+0x56/0xd3
 rcu_dump_cpu_stacks+0x96/0xc0
 rcu_check_callbacks+0x285/0x697
 update_process_times+0x28/0x4a
 tick_handle_periodic+0x20/0x5f
 smp_apic_timer_interrupt+0x93/0xf9
 apic_timer_interrupt+0x7d/0x90
 </IRQ>
RIP: 0010:smp_call_function_many+0x1f1/0x204
RSP: 0000:ffffc900000f3af0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
RAX: 0000000000000001 RBX: ffff880c110a0488 RCX: 0000000000000001
RDX: ffff880c10e64440 RSI: 0000000000000000 RDI: ffff880c110a0488
RBP: ffff880c110a0480 R08: fffffffffffffffe R09: 0000000000000003
R10: 0000000000000000 R11: ffffea00c03c1a60 R12: 0000000000000001
R13: ffff880c110a04b8 R14: 0000000000020440 R15: ffffffff81ed5400
 ? slub_cpu_dead+0xa0/0xa0
 ? slub_cpu_dead+0xa0/0xa0
 ? __mmu_notifier_mm_destroy+0x32/0x32
 on_each_cpu_mask+0x23/0x53
 ? slub_cpu_dead+0xa0/0xa0
 on_each_cpu_cond+0x7c/0x8b
 __kmem_cache_shrink+0x3c/0x237
 ? acpi_ps_delete_parse_tree+0x2d/0x59
 ? set_debug_rodata+0x11/0x11
 ? acpi_os_purge_cache+0xa/0xd
 acpi_os_purge_cache+0xa/0xd
 acpi_purge_cached_objects+0x29/0x38
 acpi_initialize_objects+0x46/0x4f
 ? acpi_sleep_init+0xd6/0xd6
 acpi_init+0xb6/0x324
 ? scan_for_dmi_ipmi+0x15/0xec
 ? acpi_sleep_init+0xd6/0xd6
 do_one_initcall+0x89/0x128
 ? set_debug_rodata+0x11/0x11
 ? set_debug_rodata+0x11/0x11
 kernel_init_freeable+0x112/0x18e
 ? rest_init+0xaa/0xaa
 kernel_init+0xa/0xf0
 ret_from_fork+0x35/0x40

The NMI dump info repeats periodically after that but never progresses
further.

If any other information is needed, please let me know.  I've reported
this issue to Supermicro already and they believe it is an issue with
the kernel opposed to an issue specific to their systems.  I don't have
any other brand Xeon skylake systems with extended APIC support that I
can try this with.

Thanks,
Rick


Richard Warner
Chief Technology Officer
Microway, Inc


Attachments:
dmesg-hang-with-extended-APIC-enabled.txt.gz (6.89 kB)

2018-05-15 16:08:57

by Rick Warner

[permalink] [raw]
Subject: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC

Hi All,

Does anyone have ideas on this?  Is there any other data I can provide
to help debug this?

Thanks,
Rick

On 05/01/2018 12:37 PM, Rick Warner wrote:
> Hi All,
>
> I've discovered that some new Supermicro skylake systems will hang/stall
> while booting the 4.15 kernel when extended APIC (x2apic) is enabled in
> the BIOS. The issue happens on specific CPUs only and follows the CPUs.
>
> We had (4) quad socket systems with Xeon 6134 CPUs; 2 out of 4 were
> exhibiting this behavior.  We replaced 2 CPUs at that time and the
> behavior was eliminated. Those systems were then shipped to our customer
> (we are an HPC system integrator).
>
> Now, we have 5 single socket systems with 5122 CPUs.  2 out of the 5 are
> hanging.  If we swap the CPUs from the hanging systems with working
> systems, the behavior follows the CPU.
>
> I've done a git bisect between 4.14 and 4.15 and found this commit is
> triggering the issue:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=023a611748fd58d46c8aa049cf4f22ebada983f5
>
> Some of the commits right before it also seemed to trigger this warning:
> [    5.062563] Debug warning: early ioremap leak of 1 areas detected.
>                please boot with early_ioremap_debug and report the dmesg.
>
> I have a dmesg log of 1 commit prior to the referenced link with
> early_ioremap_debug enabled if it is desired.
>
> The latest git still has the issue.
>
> I've attached a dmesg log captured via serial console from a system
> exhibiting this problem.  Here is an excerpt from it where the problems
> start:
>
> ACPI: Added _OSI(Module Device)
> ACPI: Added _OSI(Processor Device)
> ACPI: Added _OSI(3.0 _SCP Extensions)
> ACPI: Added _OSI(Processor Aggregator Device)
> ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
> INFO: rcu_sched self-detected stall on CPU
>         34-....: (14997 ticks this GP) idle=b3e/140000000000001/0
> softirq=18/18 fqs=7497
> INFO: rcu_sched detected stalls on CPUs/tasks:
>
>         34-....: (14997 ticks this GP) idle=b3e/140000000000001/0
> softirq=18/18 fqs=7498
>  (t=15002 jiffies g=-294 c=-295 q=391)
>         (detected by 0, t=15002 jiffies, g=-294, c=-295, q=391)
> NMI backtrace for cpu 34
> CPU: 34 PID: 1 Comm: swapper/0 Not tainted 4.15.7-gentoo-r1-netuno-x86_64 #4
> Hardware name: Supermicro SYS-2049U-TR4/X11QPH+, BIOS 2.0c 02/23/2018
> Call Trace:
>  <IRQ>
>  dump_stack+0x5d/0x79
>  nmi_cpu_backtrace+0x94/0xae
>  ? irq_force_complete_move+0x6f/0x6f
>  nmi_trigger_cpumask_backtrace+0x56/0xd3
>  rcu_dump_cpu_stacks+0x96/0xc0
>  rcu_check_callbacks+0x285/0x697
>  update_process_times+0x28/0x4a
>  tick_handle_periodic+0x20/0x5f
>  smp_apic_timer_interrupt+0x93/0xf9
>  apic_timer_interrupt+0x7d/0x90
>  </IRQ>
> RIP: 0010:smp_call_function_many+0x1f1/0x204
> RSP: 0000:ffffc900000f3af0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
> RAX: 0000000000000001 RBX: ffff880c110a0488 RCX: 0000000000000001
> RDX: ffff880c10e64440 RSI: 0000000000000000 RDI: ffff880c110a0488
> RBP: ffff880c110a0480 R08: fffffffffffffffe R09: 0000000000000003
> R10: 0000000000000000 R11: ffffea00c03c1a60 R12: 0000000000000001
> R13: ffff880c110a04b8 R14: 0000000000020440 R15: ffffffff81ed5400
>  ? slub_cpu_dead+0xa0/0xa0
>  ? slub_cpu_dead+0xa0/0xa0
>  ? __mmu_notifier_mm_destroy+0x32/0x32
>  on_each_cpu_mask+0x23/0x53
>  ? slub_cpu_dead+0xa0/0xa0
>  on_each_cpu_cond+0x7c/0x8b
>  __kmem_cache_shrink+0x3c/0x237
>  ? acpi_ps_delete_parse_tree+0x2d/0x59
>  ? set_debug_rodata+0x11/0x11
>  ? acpi_os_purge_cache+0xa/0xd
>  acpi_os_purge_cache+0xa/0xd
>  acpi_purge_cached_objects+0x29/0x38
>  acpi_initialize_objects+0x46/0x4f
>  ? acpi_sleep_init+0xd6/0xd6
>  acpi_init+0xb6/0x324
>  ? scan_for_dmi_ipmi+0x15/0xec
>  ? acpi_sleep_init+0xd6/0xd6
>  do_one_initcall+0x89/0x128
>  ? set_debug_rodata+0x11/0x11
>  ? set_debug_rodata+0x11/0x11
>  kernel_init_freeable+0x112/0x18e
>  ? rest_init+0xaa/0xaa
>  kernel_init+0xa/0xf0
>  ret_from_fork+0x35/0x40
>
> The NMI dump info repeats periodically after that but never progresses
> further.
>
> If any other information is needed, please let me know.  I've reported
> this issue to Supermicro already and they believe it is an issue with
> the kernel opposed to an issue specific to their systems.  I don't have
> any other brand Xeon skylake systems with extended APIC support that I
> can try this with.
>
> Thanks,
> Rick
>
>
> Richard Warner
> Chief Technology Officer
> Microway, Inc


2018-05-15 20:20:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC

On Tue, 15 May 2018, Rick Warner wrote:
> > I've discovered that some new Supermicro skylake systems will hang/stall
> > while booting the 4.15 kernel when extended APIC (x2apic) is enabled in
> > the BIOS. The issue happens on specific CPUs only and follows the CPUs.
> >
> > We had (4) quad socket systems with Xeon 6134 CPUs; 2 out of 4 were
> > exhibiting this behavior.  We replaced 2 CPUs at that time and the
> > behavior was eliminated. Those systems were then shipped to our customer
> > (we are an HPC system integrator).
> >
> > Now, we have 5 single socket systems with 5122 CPUs.  2 out of the 5 are
> > hanging.  If we swap the CPUs from the hanging systems with working
> > systems, the behavior follows the CPU.

That's weird.

> > I've done a git bisect between 4.14 and 4.15 and found this commit is
> > triggering the issue:
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=023a611748fd58d46c8aa049cf4f22ebada983f5
> >

Interesting.

> > I've attached a dmesg log captured via serial console from a system
> > exhibiting this problem.  Here is an excerpt from it where the problems
> > start:

> > NMI backtrace for cpu 34

> > RIP: 0010:smp_call_function_many+0x1f1/0x204

So this waits for the IPI to be handled on some other CPU(s).

> > RSP: 0000:ffffc900000f3af0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
> > RAX: 0000000000000001 RBX: ffff880c110a0488 RCX: 0000000000000001
> > RDX: ffff880c10e64440 RSI: 0000000000000000 RDI: ffff880c110a0488
> > RBP: ffff880c110a0480 R08: fffffffffffffffe R09: 0000000000000003
> > R10: 0000000000000000 R11: ffffea00c03c1a60 R12: 0000000000000001
> > R13: ffff880c110a04b8 R14: 0000000000020440 R15: ffffffff81ed5400
> >  ? slub_cpu_dead+0xa0/0xa0
> >  ? slub_cpu_dead+0xa0/0xa0
> >  ? __mmu_notifier_mm_destroy+0x32/0x32
> >  on_each_cpu_mask+0x23/0x53
> >  ? slub_cpu_dead+0xa0/0xa0
> >  on_each_cpu_cond+0x7c/0x8b
> >  __kmem_cache_shrink+0x3c/0x237
> >  ? acpi_ps_delete_parse_tree+0x2d/0x59
> >  ? set_debug_rodata+0x11/0x11
> >  ? acpi_os_purge_cache+0xa/0xd
> >  acpi_os_purge_cache+0xa/0xd
> >  acpi_purge_cached_objects+0x29/0x38
> >  acpi_initialize_objects+0x46/0x4f
> >  ? acpi_sleep_init+0xd6/0xd6
> >  acpi_init+0xb6/0x324
> >  ? scan_for_dmi_ipmi+0x15/0xec
> >  ? acpi_sleep_init+0xd6/0xd6
> >  do_one_initcall+0x89/0x128
> >  ? set_debug_rodata+0x11/0x11
> >  ? set_debug_rodata+0x11/0x11
> >  kernel_init_freeable+0x112/0x18e
> >  ? rest_init+0xaa/0xaa
> >  kernel_init+0xa/0xf0
> >  ret_from_fork+0x35/0x40

> > If any other information is needed, please let me know.  I've reported
> > this issue to Supermicro already and they believe it is an issue with
> > the kernel opposed to an issue specific to their systems.  I don't have
> > any other brand Xeon skylake systems with extended APIC support that I
> > can try this with.

I can't spot an immediate fail with that commit, but I'll have a look
tomorrow for instrumenting this with tracepoints which can be dumped from
the stall detector.

Thanks,

tglx

2018-05-16 14:50:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC

Rick,

On Tue, 15 May 2018, Thomas Gleixner wrote:
> I can't spot an immediate fail with that commit, but I'll have a look
> tomorrow for instrumenting this with tracepoints which can be dumped from
> the stall detector.

can you please give the patch below a try? I assume you have a serial
console, otherwise this is going to be tedious.

Please add the following to the kernel command line:

ftrace_dump_on_oops

Note, the box will panic after the first stall and spill out the trace
buffer over serial, which might take a while.

Thanks,

tglx

8<--------------------

--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -52,20 +52,28 @@ static void
if (apic_dest != APIC_DEST_ALLINC)
cpumask_clear_cpu(smp_processor_id(), tmpmsk);

+ trace_printk("To: %*pbl\n", cpumask_pr_args(tmpmsk));
+
/* Collapse cpus in a cluster so a single IPI per cluster is sent */
for_each_cpu(cpu, tmpmsk) {
struct cluster_mask *cmsk = per_cpu(cluster_masks, cpu);

dest = 0;
+ trace_printk("CPU: %u cluster: %*pbl\n", cpu,
+ cpumask_pr_args(&cmsk->mask));
for_each_cpu_and(clustercpu, tmpmsk, &cmsk->mask)
dest |= per_cpu(x86_cpu_to_logical_apicid, clustercpu);

- if (!dest)
+ if (!dest) {
+ trace_printk("dest = 0!?\n");
continue;
+ }

__x2apic_send_IPI_dest(dest, vector, apic->dest_logical);
/* Remove cluster CPUs from tmpmask */
cpumask_andnot(tmpmsk, tmpmsk, &cmsk->mask);
+ trace_printk("dest %08x --> tmpmsk %*pbl\n", dest,
+ cpumask_pr_args(tmpmsk));
}

local_irq_restore(flags);
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -124,7 +124,7 @@ int rcu_num_lvls __read_mostly = RCU_NUM
int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
/* panic() on RCU Stall sysctl. */
-int sysctl_panic_on_rcu_stall __read_mostly;
+int sysctl_panic_on_rcu_stall __read_mostly = 1;

/*
* The rcu_scheduler_active variable is initialized to the value
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -681,6 +681,7 @@ void on_each_cpu_cond(bool (*cond_func)(
for_each_online_cpu(cpu)
if (cond_func(cpu, info))
cpumask_set_cpu(cpu, cpus);
+ trace_printk("%*pbl\n", cpumask_pr_args(cpus));
on_each_cpu_mask(cpus, func, info, wait);
preempt_enable();
free_cpumask_var(cpus);

2018-05-16 23:03:25

by Rick Warner

[permalink] [raw]
Subject: Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC

Thanks Thomas!

I've attached the dmesg output with the kernel parameter and supplied patch.

Here is the excerpt of the ftrace dump:
Dumping ftrace buffer:
---------------------------------
swapper/-1       0d... 463331us : __x2apic_send_IPI_mask: To:
swapper/-1       0d... 463333us : __x2apic_send_IPI_mask: CPU: 0
cluster: 0-1
swapper/-1       0d... 463334us : __x2apic_send_IPI_mask: dest 00010001
--> tmpmsk
swapper/-1       0d... 504423us : __x2apic_send_IPI_mask: To:
swapper/-1       0d... 504424us : __x2apic_send_IPI_mask: CPU: 1
cluster: 0-1
swapper/-1       0d... 504425us : __x2apic_send_IPI_mask: dest 00000400
--> tmpmsk
swapper/-1       0d... 504425us : __x2apic_send_IPI_mask: CPU: 2 cluster: 2
swapper/-1       0d... 504425us : __x2apic_send_IPI_mask: dest 00010010
--> tmpmsk
swapper/-1       0d... 504425us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3
swapper/-1       0d... 504426us : __x2apic_send_IPI_mask: dest 00010040
--> tmpmsk
swapper/-1       0d... 504430us : __x2apic_send_IPI_mask: To:
swapper/-1       0d... 504430us : __x2apic_send_IPI_mask: CPU: 1
cluster: 0-1
swapper/-1       0d... 504431us : __x2apic_send_IPI_mask: dest 00000400
--> tmpmsk
swapper/-1       0d... 504431us : __x2apic_send_IPI_mask: CPU: 2 cluster: 2
swapper/-1       0d... 504431us : __x2apic_send_IPI_mask: dest 00010010
--> tmpmsk
swapper/-1       0d... 504431us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3
swapper/-1       0d... 504431us : __x2apic_send_IPI_mask: dest 00010040
--> tmpmsk
swapper/-1       0d... 504433us : __x2apic_send_IPI_mask: To:
swapper/-1       0d... 504433us : __x2apic_send_IPI_mask: CPU: 1
cluster: 0-1
swapper/-1       0d... 504433us : __x2apic_send_IPI_mask: dest 00000400
--> tmpmsk
swapper/-1       0d... 504434us : __x2apic_send_IPI_mask: CPU: 2 cluster: 2
swapper/-1       0d... 504434us : __x2apic_send_IPI_mask: dest 00010010
--> tmpmsk
swapper/-1       0d... 504434us : __x2apic_send_IPI_mask: CPU: 3 cluster: 3
swapper/-1       0d... 504434us : __x2apic_send_IPI_mask: dest 00010040
--> tmpmsk
swapper/-1       3d... 1813404us : __x2apic_send_IPI_mask: To:
swapper/-1       3d... 1813405us : __x2apic_send_IPI_mask: CPU: 0
cluster: 0-1
swapper/-1       3d... 1813406us : __x2apic_send_IPI_mask: dest 00010001
--> tmpmsk
swapper/-1       3.... 1879964us : on_each_cpu_cond: 0-1,3
swapper/-1       3.... 1879972us : on_each_cpu_cond: 0-1,3
swapper/-1       3.... 1879975us : on_each_cpu_cond: 0-1,3
swapper/-1       3d... 1879976us : __x2apic_send_IPI_mask: To:
swapper/-1       3d... 1879977us : __x2apic_send_IPI_mask: CPU: 0
cluster: 0-1
swapper/-1       3d... 1879977us : __x2apic_send_IPI_mask: dest 00010401
--> tmpmsk
  <idle>-0       0d.h. 61903171us : __x2apic_send_IPI_mask: To:
  <idle>-0       0d.h. 61903172us : __x2apic_send_IPI_mask: CPU: 3
cluster: 3
  <idle>-0       0d.h. 61903172us : __x2apic_send_IPI_mask: dest
00010040 --> tmpmsk
  <idle>-0       0d.h. 61904223us : __x2apic_send_IPI_mask: To:
  <idle>-0       0d.h. 61904223us : __x2apic_send_IPI_mask: CPU: 1
cluster: 0-1
  <idle>-0       0d.h. 61904223us : __x2apic_send_IPI_mask: dest
00000400 --> tmpmsk
  <idle>-0       0d.h. 61904223us : __x2apic_send_IPI_mask: CPU: 2
cluster: 2
  <idle>-0       0d.h. 61904224us : __x2apic_send_IPI_mask: dest
00010010 --> tmpmsk
  <idle>-0       0d.h. 61904224us : __x2apic_send_IPI_mask: CPU: 3
cluster: 3
  <idle>-0       0d.h. 61904224us : __x2apic_send_IPI_mask: dest
00010040 --> tmpmsk
  <idle>-0       0d.h. 62924392us : __x2apic_send_IPI_mask: To:
  <idle>-0       0d.h. 62924393us : __x2apic_send_IPI_mask: CPU: 3
cluster: 3
  <idle>-0       0d.h. 62924394us : __x2apic_send_IPI_mask: dest
00010040 --> tmpmsk

Thanks,
Rick

On 05/16/18 10:50, Thomas Gleixner wrote:
> Rick,
>
> On Tue, 15 May 2018, Thomas Gleixner wrote:
>> I can't spot an immediate fail with that commit, but I'll have a look
>> tomorrow for instrumenting this with tracepoints which can be dumped from
>> the stall detector.
> can you please give the patch below a try? I assume you have a serial
> console, otherwise this is going to be tedious.
>
> Please add the following to the kernel command line:
>
> ftrace_dump_on_oops
>
> Note, the box will panic after the first stall and spill out the trace
> buffer over serial, which might take a while.
>
> Thanks,
>
> tglx
>
> 8<--------------------
>
> --- a/arch/x86/kernel/apic/x2apic_cluster.c
> +++ b/arch/x86/kernel/apic/x2apic_cluster.c
> @@ -52,20 +52,28 @@ static void
> if (apic_dest != APIC_DEST_ALLINC)
> cpumask_clear_cpu(smp_processor_id(), tmpmsk);
>
> + trace_printk("To: %*pbl\n", cpumask_pr_args(tmpmsk));
> +
> /* Collapse cpus in a cluster so a single IPI per cluster is sent */
> for_each_cpu(cpu, tmpmsk) {
> struct cluster_mask *cmsk = per_cpu(cluster_masks, cpu);
>
> dest = 0;
> + trace_printk("CPU: %u cluster: %*pbl\n", cpu,
> + cpumask_pr_args(&cmsk->mask));
> for_each_cpu_and(clustercpu, tmpmsk, &cmsk->mask)
> dest |= per_cpu(x86_cpu_to_logical_apicid, clustercpu);
>
> - if (!dest)
> + if (!dest) {
> + trace_printk("dest = 0!?\n");
> continue;
> + }
>
> __x2apic_send_IPI_dest(dest, vector, apic->dest_logical);
> /* Remove cluster CPUs from tmpmask */
> cpumask_andnot(tmpmsk, tmpmsk, &cmsk->mask);
> + trace_printk("dest %08x --> tmpmsk %*pbl\n", dest,
> + cpumask_pr_args(tmpmsk));
> }
>
> local_irq_restore(flags);
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -124,7 +124,7 @@ int rcu_num_lvls __read_mostly = RCU_NUM
> int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
> int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
> /* panic() on RCU Stall sysctl. */
> -int sysctl_panic_on_rcu_stall __read_mostly;
> +int sysctl_panic_on_rcu_stall __read_mostly = 1;
>
> /*
> * The rcu_scheduler_active variable is initialized to the value
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -681,6 +681,7 @@ void on_each_cpu_cond(bool (*cond_func)(
> for_each_online_cpu(cpu)
> if (cond_func(cpu, info))
> cpumask_set_cpu(cpu, cpus);
> + trace_printk("%*pbl\n", cpumask_pr_args(cpus));
> on_each_cpu_mask(cpus, func, info, wait);
> preempt_enable();
> free_cpumask_var(cpus);


Attachments:
x2apic-hang-on-boot-dmesg-patch1.txt (23.72 kB)

2018-05-17 12:37:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC

Rick,

On Wed, 16 May 2018, Rick Warner wrote:

> I've attached the dmesg output with the kernel parameter and supplied patch.

Thanks for providing the data. I think I know what's wrong here. Does the
patch below fix your problem?

While you test, I go looking for a big brown paperbag.

Thanks,

tglx

8<----------------

--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -124,6 +124,7 @@ static void init_x2apic_ldr(void)
goto update;
}
cmsk = cluster_hotplug_mask;
+ cmsk->clusterid = cluster;
cluster_hotplug_mask = NULL;
update:
this_cpu_write(cluster_masks, cmsk);

2018-05-17 16:01:35

by Rick Warner

[permalink] [raw]
Subject: Re: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC

Success!

This resolved the issue.  Thanks Thomas!

Rick

On 05/17/18 08:36, Thomas Gleixner wrote:
> Rick,
>
> On Wed, 16 May 2018, Rick Warner wrote:
>
>> I've attached the dmesg output with the kernel parameter and supplied patch.
> Thanks for providing the data. I think I know what's wrong here. Does the
> patch below fix your problem?
>
> While you test, I go looking for a big brown paperbag.
>
> Thanks,
>
> tglx
>
> 8<----------------
>
> --- a/arch/x86/kernel/apic/x2apic_cluster.c
> +++ b/arch/x86/kernel/apic/x2apic_cluster.c
> @@ -124,6 +124,7 @@ static void init_x2apic_ldr(void)
> goto update;
> }
> cmsk = cluster_hotplug_mask;
> + cmsk->clusterid = cluster;
> cluster_hotplug_mask = NULL;
> update:
> this_cpu_write(cluster_masks, cmsk);


Subject: [tip:x86/urgent] x86/apic/x2apic: Initialize cluster ID properly

Commit-ID: fed71f7d98795ed0fa1d431910787f0f4a68324f
Gitweb: https://git.kernel.org/tip/fed71f7d98795ed0fa1d431910787f0f4a68324f
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 17 May 2018 14:36:39 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 17 May 2018 21:00:12 +0200

x86/apic/x2apic: Initialize cluster ID properly

Rick bisected a regression on large systems which use the x2apic cluster
mode for interrupt delivery to the commit wich reworked the cluster
management.

The problem is caused by a missing initialization of the clusterid field
in the shared cluster data structures. So all structures end up with
cluster ID 0 which only allows sharing between all CPUs which belong to
cluster 0. All other CPUs with a cluster ID > 0 cannot share the data
structure because they cannot find existing data with their cluster
ID. This causes malfunction with IPIs because IPIs are sent to the wrong
cluster and the caller waits for ever that the target CPU handles the IPI.

Add the missing initialization when a upcoming CPU is the first in a
cluster so that the later booting CPUs can find the data and share it for
proper operation.

Fixes: 023a611748fd ("x86/apic/x2apic: Simplify cluster management")
Reported-by: Rick Warner <[email protected]>
Bisected-by: Rick Warner <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Rick Warner <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/kernel/apic/x2apic_cluster.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index 8b04234e010b..7685444a106b 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -116,6 +116,7 @@ static void init_x2apic_ldr(void)
goto update;
}
cmsk = cluster_hotplug_mask;
+ cmsk->clusterid = cluster;
cluster_hotplug_mask = NULL;
update:
this_cpu_write(cluster_masks, cmsk);