Received: by 2002:a25:b794:0:0:0:0:0 with SMTP id n20csp2497469ybh; Mon, 5 Aug 2019 02:01:21 -0700 (PDT) X-Google-Smtp-Source: APXvYqzGqXtWH03ZVktyTiR/jKl56ih/brnwaVH6MhHRt7nUDVYUVX1pqdaF6dn+gu6/WyaR14o9 X-Received: by 2002:a17:902:2bcb:: with SMTP id l69mr144539249plb.155.1564995681679; Mon, 05 Aug 2019 02:01:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1564995681; cv=none; d=google.com; s=arc-20160816; b=03Ty76vERoeTNqpI+vTaMwySMC58S/ByrVGdVooxkVn353lzN4OAozxJpzQ7MZkvyT QuSBgWrOMhmzBfbE3axa0A17aOUdX8/ZdqJKfppDl74UZJmK7BoX9jGcX3FcsT1f93Jr MNt/Pw8mbyeAP1KuDVkWjRQZHzJkBDija1Hooe0LqF6Lq9+l4sxSBoifxs/E4Rb3Wc8/ 4RG+zJi07Tyus6K19PKhwQdFyMhx1RqasBBdNJhIsCyQANzJb4UNKQ4fKIfSqFnU3ge0 G21CyHUt9pu6MedbUOKBy4ModVDY9RQjAG2jkO2cTpTOMatvNmqDPIN1O4v5ZsMluAFa faIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature; bh=rjwVvQQidw169Jm3n3lgZp9lgXOA8D2Mu3C4QxVGgVE=; b=O4rX/x8w+56P3oar5ijDIqoQib0w/TU3KcMrIQqmqy5FQZRw69gD1gOy/m/P2h3U3H tQNMEwIb1vsR8cJUbzoZ1gN/Ocl4xRCii7LTwj2kXPtX/TGoNK27Ji2VQu50ijs9C4s8 Ueb6vu6JGNStzLW6FzyfuEIrtN5I9w55WY3/5KcayNATLip2o3PbAH9pfyN8KuJUfJ5O lTjy9rZZh2ZC6IiYWiysiPSQiouVsKCyARJbMUV6Cp+of1uv17uE9acV+2QOSorGJaDp 27Xj2AmLtCHkMsOgbDtpiEiSQ+rCf2aoLn2Ii5z7o3LOzxpyvMncM1XK5A8WcjNbY7j0 z07g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=LjlqH2tn; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p3si11996661pjr.18.2019.08.05.02.01.05; Mon, 05 Aug 2019 02:01:21 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=LjlqH2tn; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727942AbfHEJAD (ORCPT + 99 others); Mon, 5 Aug 2019 05:00:03 -0400 Received: from mail-pf1-f195.google.com ([209.85.210.195]:35942 "EHLO mail-pf1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727855AbfHEJAC (ORCPT ); Mon, 5 Aug 2019 05:00:02 -0400 Received: by mail-pf1-f195.google.com with SMTP id r7so39302684pfl.3 for ; Mon, 05 Aug 2019 02:00:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=rjwVvQQidw169Jm3n3lgZp9lgXOA8D2Mu3C4QxVGgVE=; b=LjlqH2tntzgY1ntlSynASEYypUp7REv5zTRSmv7L02OieZzVhFWPNxJQbR9Q8A/M2a EuFxy7OulufVEDaQ+LePJNT1VZEbjiBNVIgItbAVfO9zyt4LkZutFHb//EazkR9clAa7 e3F+LHEw+JVZaAyTDpPB0bjxU7Rd2lUcx5OTGVmJglNEXkvYRFN1QlbMF8lYiRc0Ph4v pGsQQ9J8HchC4ZGH08Dre6X9FWN6MAFL7LMr6AZh93JPWg2s9MbamxT5ziAdNhKICKVw s9TVE0x+fMiRb4sqlvHqCMBJCgb4Kgjwdl3eNCO4xNMeQQQENeBcvSdVLek57fZjFUx1 kIPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=rjwVvQQidw169Jm3n3lgZp9lgXOA8D2Mu3C4QxVGgVE=; b=W74+OmqIiHBD036+WW5AlJ9H+0xePHTbO1BixaLhdTRYgc/j5AL+MxaZf3VxrAU79u OnwLXacjFjRXhalAH19L4PYMQinkS6DDKV+RGCUxWOdlhVM0e7NhFNAKh2bk+c2FDSv1 xUpXu10OK+jT33wYbcFJFvxQCRsYd631vthK0UFZTEXTFuL6JuhHhGZLgBEV8pnMBHKC u8jgBE7LAeImTM26LCVFeNZ7hqiCSjMX5Fry35JXc09GWZLfWbY9yqwr4KYRSauoxaL3 LiPbUrFyvI8rHCopCXrrRtf8zHP0Sv1L5YEsv3sPsfBbQc9S8JxzMcqXfRNVO5fwYKxv XNhA== X-Gm-Message-State: APjAAAXevPunUEc3A8hzicSz8r389jlB8qI/cuGG/VvH8lWQw/UI5rsm 9QJW6GwKnhENlbFjXKmKLw== X-Received: by 2002:a17:90a:d814:: with SMTP id a20mr17104592pjv.48.1564995600804; Mon, 05 Aug 2019 02:00:00 -0700 (PDT) Received: from mylaptop.redhat.com ([209.132.188.80]) by smtp.gmail.com with ESMTPSA id v184sm82428375pfb.82.2019.08.05.01.59.53 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 05 Aug 2019 02:00:00 -0700 (PDT) From: Pingfan Liu To: Thomas Gleixner , Andy Lutomirski , x86@kernel.org Cc: Pingfan Liu , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Dave Hansen , Peter Zijlstra , Masami Hiramatsu , Qian Cai , Vlastimil Babka , Daniel Drake , Jacob Pan , Michal Hocko , Eric Biederman , linux-kernel@vger.kernel.org, Dave Young , Baoquan He , kexec@lists.infradead.org Subject: [PATCH 3/4] x86/smp: send capped cpus to a stable state when smp_init() Date: Mon, 5 Aug 2019 16:58:58 +0800 Message-Id: <1564995539-29609-4-git-send-email-kernelfans@gmail.com> X-Mailer: git-send-email 2.7.5 In-Reply-To: <1564995539-29609-1-git-send-email-kernelfans@gmail.com> References: <1564995539-29609-1-git-send-email-kernelfans@gmail.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org *** Back ground *** On x86 it's required to have all logical CPUs set CR4.MCE=1. Otherwise, a broadcast MCE observing CR4.MCE=0b on any core will shutdown the machine. The option 'nosmt' has already complied with the above rule by Thomas's patch. For detail, refer to 506a66f3748 (Revert "x86/apic: Ignore secondary threads if nosmt=force") But for nr_cpus option, the exposure to broadcast MCE is a little complicated, and can be categorized into three cases. -1. boot up by BIOS. Since no one set CR4.MCE=1, nr_cpus risks rebooting by broadcast MCE. -2. boot up by "kexec -p nr_cpus=". Since the 1st kernel has all cpus' CR4.MCE=1 set before kexec -p, nr_cpus is free of rebooting by broadcast MCE. Furthermore, the crashed kernel's wreckage, including page table and text, is not touched by capture kernel. Hence if MCE event happens on capped cpu, do_machine_check->__mc_check_crashing_cpu() runs smoothly and returns immediately, the capped cpu is still pinned on "halt". -3. boot up by "kexec -l nr_cpus=". As "kexec -p", it is free of rebooting by broadcast MCE. But the 1st kernel's wreckage is discarded and changed. when capped cpus execute do_machine_check(), they may crack the new kernel. But this is not related with broadcast MCE, and need an extra fix. In a word, only case 1 suffers from unexpected rebooting due to broadcast MCE. *** Solution *** When fixing case 1, "nr_cpus" can not follow the same way as "nosmt". Because nr_cpus limits the allocation of percpu area and some other kthread memory, which is critical to cpu hotplug framework. Instead, developing a dedicated SIPI callback make_capped_cpu_stable() for capped cpu, which does not lean on percpu area to work. BTW this patch has side effect to suppress case 3. It brings capped cpu to make_capped_cpu_stable() in the 2nd kernel's context, instead of leaving them with the 1st kernel's context. Signed-off-by: Pingfan Liu To: Thomas Gleixner To: Andy Lutomirski Cc: Ingo Molnar Cc: Borislav Petkov Cc: "H. Peter Anvin" Cc: Dave Hansen Cc: Peter Zijlstra To: x86@kernel.org Cc: Masami Hiramatsu Cc: Qian Cai Cc: Vlastimil Babka Cc: Daniel Drake Cc: Jacob Pan Cc: Michal Hocko Cc: Eric Biederman Cc: linux-kernel@vger.kernel.org Cc: Dave Young Cc: Baoquan He Cc: kexec@lists.infradead.org --- arch/x86/include/asm/apic.h | 1 + arch/x86/include/asm/smp.h | 1 + arch/x86/kernel/apic/apic.c | 2 +- arch/x86/kernel/cpu/common.c | 3 ++ arch/x86/kernel/smpboot.c | 83 ++++++++++++++++++++++++++++++++++++++++++++ kernel/smp.c | 6 ++++ 6 files changed, 95 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index e647aa0..02232e3 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -49,6 +49,7 @@ static inline void generic_apic_probe(void) #ifdef CONFIG_X86_LOCAL_APIC +extern int cpuid_to_apicid[]; extern int apic_verbosity; extern int local_apic_timer_c2_ok; diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h index 5f63399..f67ce77 100644 --- a/arch/x86/include/asm/smp.h +++ b/arch/x86/include/asm/smp.h @@ -197,6 +197,7 @@ extern void nmi_selftest(void); #endif extern struct cpumask *cpu_capped_mask; +extern struct cpumask *cpu_capped_done_mask; #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_SMP_H */ diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 6a57bad3..b86c9e6 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -2244,7 +2244,7 @@ static int nr_logical_cpuids = 1; /* * Used to store mapping between logical CPU IDs and APIC IDs. */ -static int cpuid_to_apicid[] = { +int cpuid_to_apicid[] = { [0 ... NR_CPUS - 1] = -1, }; diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 4d87df5..1a62b37 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -69,6 +69,8 @@ cpumask_var_t cpu_callin_mask; /* size of NR_CPUS is required. */ struct cpumask __cpu_capped_mask __initdata; struct cpumask *cpu_capped_mask; +struct cpumask __cpu_capped_done_mask __initdata; +struct cpumask *cpu_capped_done_mask; /* representing cpus for which sibling maps can be computed */ cpumask_var_t cpu_sibling_setup_mask; @@ -88,6 +90,7 @@ void __init setup_cpu_local_masks(void) alloc_bootmem_cpumask_var(&cpu_callout_mask); alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask); cpu_capped_mask = &__cpu_capped_mask; + cpu_capped_done_mask = &__cpu_capped_done_mask; } static void default_init(struct cpuinfo_x86 *c) diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index fdbd47c..d247fbd 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1014,6 +1014,89 @@ int common_cpu_up(unsigned int cpu, struct task_struct *idle) return 0; } +/* SIPI callback for the capped cpu */ +static void notrace make_capped_cpu_stable(void *unused) +{ + unsigned int apic_id; + int i, cpu = -1; + + /* protect against MCE broadcast */ + native_write_cr4(X86_CR4_MCE); + x2apic_setup(); + apic_id = read_apic_id(); + for (i = 0; i < NR_CPUS; i++) + if (cpuid_to_apicid[i] == apic_id) { + cpu = i; + break; + } + + /* trampoline_*.S has loaded idt with 0, 0 */ + + /* No response to any interrupt */ + disable_local_APIC(); + /* Done with the temporary stack, signal the master to go ahead */ + if (cpu != -1) { + cpumask_clear_cpu(cpu, cpu_capped_mask); + cpumask_set_cpu(cpu, cpu_capped_done_mask); + } + + /* run without stack and can not have the __init attribute */ + do { + asm volatile("hlt" : : : "memory"); + } while (true); +} + +static void __init do_stable_cpu(int cpu) +{ + static char capped_tmp_stack[512]; + int cpu0_nmi_registered = 0, apicid = cpuid_to_apicid[cpu]; + unsigned long start_ip = real_mode_header->trampoline_start; + unsigned long timeout, boot_error = 0; + + /* invalid percpu area */ + initial_gs = 0; + /* + * Borrow the value of cpu 0. Since capped cpu segment shadow register + * can cache the content, and keep it unchanged. + */ + early_gdt_descr.address = (unsigned long)get_cpu_gdt_rw(0); + initial_code = (unsigned long)make_capped_cpu_stable; + initial_stack = (unsigned long)&capped_tmp_stack; + + if (apic->wakeup_secondary_cpu) + boot_error = apic->wakeup_secondary_cpu(apicid, start_ip); + else + boot_error = wakeup_cpu_via_init_nmi(cpu, start_ip, apicid, + &cpu0_nmi_registered); + if (cpu0_nmi_registered) + unregister_nmi_handler(NMI_LOCAL, "wake_cpu0"); + + if (!boot_error) { + /* Wait 10s total for first sign of life from capped cpu */ + boot_error = -1; + timeout = jiffies + 10*HZ; + while (time_before(jiffies, timeout)) { + if (cpumask_test_cpu(cpu, cpu_capped_done_mask)) + break; + schedule(); + } + } +} + +void __init bring_capped_cpu_stable(void) +{ + int cpu; + + /* Guest does not suffer from MCE broadcast */ + if (boot_cpu_has(X86_FEATURE_HYPERVISOR)) + return; + + for_each_cpu(cpu, cpu_capped_mask) + do_stable_cpu(cpu); + cpu_capped_mask = NULL; + cpu_capped_done_mask = NULL; +} + /* * NOTE - on most systems this is a PHYSICAL apic ID, but on multiquad * (ie clustered apic addressing mode), this is a LOGICAL apic ID. diff --git a/kernel/smp.c b/kernel/smp.c index 7dbcb40..de528b7 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -574,6 +574,10 @@ void __init setup_nr_cpu_ids(void) nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1; } +void __weak __init bring_capped_cpu_stable(void) +{ +} + /* Called by boot processor to activate the rest. */ void __init smp_init(void) { @@ -593,6 +597,8 @@ void __init smp_init(void) cpu_up(cpu); } + /* force cpus capped by nr_cpus option into stable state */ + bring_capped_cpu_stable(); num_nodes = num_online_nodes(); num_cpus = num_online_cpus(); pr_info("Brought up %d node%s, %d CPU%s\n", -- 2.7.5