Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp785832imm; Tue, 15 May 2018 09:08:57 -0700 (PDT) X-Google-Smtp-Source: AB8JxZp8LxlnAZpXpJyYkiMvzAhZLOUGJ6tg7obPeAMynTSHbVozi7/7MgV/vFY3OC+hLOSE4ry6 X-Received: by 2002:a17:902:2804:: with SMTP id e4-v6mr15013004plb.153.1526400537105; Tue, 15 May 2018 09:08:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526400537; cv=none; d=google.com; s=arc-20160816; b=IwhWVlvU0K7uxHHDfsbn6AinoLlMnKLJDv78C8cmiwbtJXITcV8iw+O9jQKPIhDz0O gXsZAhfsbjPBoGsPfzRPTL31ojO9wIMJyGBwxGaPCbkpPvQI4yJB55yrBBUtA0McsHQi fZiIgm0I9b5gjoTChqGwvjtFkbnuWNC/RHVAVn80C5Pu+yoZd8aq0lfWJF/WyZ3BQMo4 +lX5vFZeS8rN3auoOg832W/6bEypCme8TG1bg5nIyDosRjaeVAShZ5ayolO0ZxwUMK7K crx7j56QhPhuNQ3TKshsntgYliHfI/5Df8zs+tSVf0UvKyU1ES5srkSOuA2O8qS/I4vN pY5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:references:to:from:subject:dkim-signature :arc-authentication-results; bh=gna8Jl/eAKXIt69VPSO+5s8WDnJVyE1/016WC2WCrWs=; b=cLIlXdaIJZAkVrf2D5Qk1xeXXQFg6jWb94h5M94GBlOvVg0yCqZ8ji+bWZqjKIy01C DF3ULzCIeXeyoaNKkRpOnLtygVm0BvTX0/TH9Fn9E2bkd4vc3Yfnaae0T4cVwVkN3XSs xhK9TjAbJ8G8MaiZS+PCQlx1FyOrTRd7siD/kaVOFX4RN5bSotL+9m43jSVSZdbFDyh+ SLIaY23sjR2ZCGxA7tTvfj+adReot+efbDZY9rnF2JPqSJUINfZQHCkBiMdE+ShIKpoS fRHaW4gUvx3Ebr5vp68Fxk0IvCSaHViN65QcFGyhZpiqKce4OfXGLcfPMDWdHk8rlIQL J7tg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@microway.com header.s=verified header.b=j6dTvI8l; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id ay12-v6si342977plb.139.2018.05.15.09.08.42; Tue, 15 May 2018 09:08:57 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@microway.com header.s=verified header.b=j6dTvI8l; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753960AbeEOQIC (ORCPT + 99 others); Tue, 15 May 2018 12:08:02 -0400 Received: from mail1.microway.com ([50.245.10.177]:41480 "EHLO microway.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753546AbeEOQH6 (ORCPT ); Tue, 15 May 2018 12:07:58 -0400 Received: from jungle.microway.com (jungle.microway.com [10.100.100.251]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by microway.com (Postfix) with ESMTPSA id 72BB39F5AF; Tue, 15 May 2018 12:07:57 -0400 (EDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=microway.com; s=verified; t=1526400477; bh=EBzoOrobl7H8+KB89JkiMgfysoZCjS+WJ4ryejjA7Wk=; h=Subject:From:To:References:Date:In-Reply-To; b=j6dTvI8lVilf7NrrDCztEuD584fdpyP8rduoOj47g071uek/QVSrtbjEQ5nT8Q5JQ mk1c07WhDX8F4Ou0jatOUx3I8DuS0sd4cPKcXcC7IqVEAtAy4zvTdMOcdSbs/x6UHd 70jxKg5c5fJuzKIw5LJ+Y6nUWXBAZLgxIODumPG0= Subject: [bisected] rcu_sched detected stalls - 4.15 or newer kernel with some Xeon skylake CPUs and extended APIC From: Rick Warner To: Linux Kernel Mailing List , Thomas Gleixner References: <831e8a53-05d1-edfb-6287-fecfba22b8bd@microway.com> Message-ID: Date: Tue, 15 May 2018 12:07:56 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <831e8a53-05d1-edfb-6287-fecfba22b8bd@microway.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi All, Does anyone have ideas on this?  Is there any other data I can provide to help debug this? Thanks, Rick On 05/01/2018 12:37 PM, Rick Warner wrote: > Hi All, > > I've discovered that some new Supermicro skylake systems will hang/stall > while booting the 4.15 kernel when extended APIC (x2apic) is enabled in > the BIOS. The issue happens on specific CPUs only and follows the CPUs. > > We had (4) quad socket systems with Xeon 6134 CPUs; 2 out of 4 were > exhibiting this behavior.  We replaced 2 CPUs at that time and the > behavior was eliminated. Those systems were then shipped to our customer > (we are an HPC system integrator). > > Now, we have 5 single socket systems with 5122 CPUs.  2 out of the 5 are > hanging.  If we swap the CPUs from the hanging systems with working > systems, the behavior follows the CPU. > > I've done a git bisect between 4.14 and 4.15 and found this commit is > triggering the issue: > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=023a611748fd58d46c8aa049cf4f22ebada983f5 > > Some of the commits right before it also seemed to trigger this warning: > [    5.062563] Debug warning: early ioremap leak of 1 areas detected. >                please boot with early_ioremap_debug and report the dmesg. > > I have a dmesg log of 1 commit prior to the referenced link with > early_ioremap_debug enabled if it is desired. > > The latest git still has the issue. > > I've attached a dmesg log captured via serial console from a system > exhibiting this problem.  Here is an excerpt from it where the problems > start: > > ACPI: Added _OSI(Module Device) > ACPI: Added _OSI(Processor Device) > ACPI: Added _OSI(3.0 _SCP Extensions) > ACPI: Added _OSI(Processor Aggregator Device) > ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored > INFO: rcu_sched self-detected stall on CPU >         34-....: (14997 ticks this GP) idle=b3e/140000000000001/0 > softirq=18/18 fqs=7497 > INFO: rcu_sched detected stalls on CPUs/tasks: > >         34-....: (14997 ticks this GP) idle=b3e/140000000000001/0 > softirq=18/18 fqs=7498 >  (t=15002 jiffies g=-294 c=-295 q=391) >         (detected by 0, t=15002 jiffies, g=-294, c=-295, q=391) > NMI backtrace for cpu 34 > CPU: 34 PID: 1 Comm: swapper/0 Not tainted 4.15.7-gentoo-r1-netuno-x86_64 #4 > Hardware name: Supermicro SYS-2049U-TR4/X11QPH+, BIOS 2.0c 02/23/2018 > Call Trace: >   >  dump_stack+0x5d/0x79 >  nmi_cpu_backtrace+0x94/0xae >  ? irq_force_complete_move+0x6f/0x6f >  nmi_trigger_cpumask_backtrace+0x56/0xd3 >  rcu_dump_cpu_stacks+0x96/0xc0 >  rcu_check_callbacks+0x285/0x697 >  update_process_times+0x28/0x4a >  tick_handle_periodic+0x20/0x5f >  smp_apic_timer_interrupt+0x93/0xf9 >  apic_timer_interrupt+0x7d/0x90 >   > RIP: 0010:smp_call_function_many+0x1f1/0x204 > RSP: 0000:ffffc900000f3af0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11 > RAX: 0000000000000001 RBX: ffff880c110a0488 RCX: 0000000000000001 > RDX: ffff880c10e64440 RSI: 0000000000000000 RDI: ffff880c110a0488 > RBP: ffff880c110a0480 R08: fffffffffffffffe R09: 0000000000000003 > R10: 0000000000000000 R11: ffffea00c03c1a60 R12: 0000000000000001 > R13: ffff880c110a04b8 R14: 0000000000020440 R15: ffffffff81ed5400 >  ? slub_cpu_dead+0xa0/0xa0 >  ? slub_cpu_dead+0xa0/0xa0 >  ? __mmu_notifier_mm_destroy+0x32/0x32 >  on_each_cpu_mask+0x23/0x53 >  ? slub_cpu_dead+0xa0/0xa0 >  on_each_cpu_cond+0x7c/0x8b >  __kmem_cache_shrink+0x3c/0x237 >  ? acpi_ps_delete_parse_tree+0x2d/0x59 >  ? set_debug_rodata+0x11/0x11 >  ? acpi_os_purge_cache+0xa/0xd >  acpi_os_purge_cache+0xa/0xd >  acpi_purge_cached_objects+0x29/0x38 >  acpi_initialize_objects+0x46/0x4f >  ? acpi_sleep_init+0xd6/0xd6 >  acpi_init+0xb6/0x324 >  ? scan_for_dmi_ipmi+0x15/0xec >  ? acpi_sleep_init+0xd6/0xd6 >  do_one_initcall+0x89/0x128 >  ? set_debug_rodata+0x11/0x11 >  ? set_debug_rodata+0x11/0x11 >  kernel_init_freeable+0x112/0x18e >  ? rest_init+0xaa/0xaa >  kernel_init+0xa/0xf0 >  ret_from_fork+0x35/0x40 > > The NMI dump info repeats periodically after that but never progresses > further. > > If any other information is needed, please let me know.  I've reported > this issue to Supermicro already and they believe it is an issue with > the kernel opposed to an issue specific to their systems.  I don't have > any other brand Xeon skylake systems with extended APIC support that I > can try this with. > > Thanks, > Rick > > > Richard Warner > Chief Technology Officer > Microway, Inc