Received: by 2002:a05:7412:5112:b0:fa:6e18:a558 with SMTP id fm18csp1095497rdb; Wed, 24 Jan 2024 04:58:27 -0800 (PST) X-Google-Smtp-Source: AGHT+IEzxUV07Thwkw+wCFkaluRKhIgpL6DJbrxnr41trPyDb/0I13P3QrqRhntzfZ0YHc/4H/vo X-Received: by 2002:a05:620a:123b:b0:783:4195:90e7 with SMTP id v27-20020a05620a123b00b00783419590e7mr8632905qkj.46.1706101107755; Wed, 24 Jan 2024 04:58:27 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706101107; cv=pass; d=google.com; s=arc-20160816; b=qlceINa0swGt1VRXYQOa5uKT35Qh8cXOzlm0xyzJ4iTyFrMrlqlNQNJXj/F6O5U9d4 LDPLxQHfiPdqxKMOr0PDbeRj5fVJhODcu8OSSQE7a26+5rWLD+DSri/VOP5zl63vyXYF HDwC1KzTEsPE3iuMpg+V1buLcZ5reRmLfLmD3h6kmxlMN2oOkqfiAuNWvvmZ/4/QCwuO KFieBCzxc3/Cax8MrCQZp2aJ0e2DWGjpVLGEJ5NCpTIcOZwni2P3+j2Ga9JC7tjMQn6o KAPxCVf3ob+nxNVloZe/GLmHmPbTmcbVpsfkQW3GepRkXgCKMbdpWsyuhNK8zUmLpx76 EOSQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from:dkim-signature; bh=MMCljq9G+/X0CX8pg1sLFLETXJmalrWCOIsg/JZC3XI=; fh=OvJRnOqsMTm9XoNmEwebcqh9Ud7yh1CTeKAP84ols98=; b=Bh8cyJijZh8/nJjKtSlhOKDvjkKWCmpbKt9HSgI5GzYDEjekNzLjV6sgF3ugSo5Rdx N6qOXYvesbe9oc91IAy+b0pZ33bINkjr4wOWOTr07FrmeWasmj+zOU/n52R/g9LKk1Vt eidOpVuOoGlrL/pPplO78G3n2zGiVrD1R2Iy+X6o/iVY9iIObe9B6ixfODol7VHNQBJO nJ1nH2gaXCcDeZnT7YUdIsdizKR/XmmjlF/54zKyYz7CpCktZDHChm+U9qb75/npoQyW S/AZ63jykHnebM6qJJs09Mc2urjmvZCfCyKlCAU/BWRV7vpH4JVG5prMSUgRO8uYKm0a dBSw== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=UN2LqsG5; arc=pass (i=1 dkim=pass dkdomain=intel.com dmarc=pass fromdomain=linux.intel.com); spf=pass (google.com: domain of linux-kernel+bounces-37030-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-37030-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id vu11-20020a05620a560b00b00783826269ccsi9815383qkn.250.2024.01.24.04.58.27 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Jan 2024 04:58:27 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-37030-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=UN2LqsG5; arc=pass (i=1 dkim=pass dkdomain=intel.com dmarc=pass fromdomain=linux.intel.com); spf=pass (google.com: domain of linux-kernel+bounces-37030-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-37030-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 5F28C1C24E3D for ; Wed, 24 Jan 2024 12:58:27 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 67EC57C09D; Wed, 24 Jan 2024 12:56:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b="UN2LqsG5" Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7440D7A71B for ; Wed, 24 Jan 2024 12:56:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=134.134.136.31 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706101018; cv=none; b=M0tw1ArS66qjy5901mG16oUMjz3NwKye40X+3+/9DEVzDm4+TgrEknOj5FPtQgTMoMaIEy2NsbLfXrSqsoqo0GxGEv7NNSEsbVAvDjIR/ofjHOLSjTgsV1B9Hv5o8TjRNoU3jB3kK2xYu/Au3wMBFGbhsHeYFtYEGzx0YxrYEMQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706101018; c=relaxed/simple; bh=csq/DQ86q1cyKE+6/psjeBOQygPbyK7cNC3thAaHsaQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Cc/5DArw64NLJ3AO1YphAyy2+KZUI8V8TCoLWm7RmOShhuqtGojUrmPOIMY2nb5X4hHlskdwLLIlYBgrxNWtdbYpleFSthgW874ezJgnPddjAmx9m7gMYPf4lLcgxG1W+SdROq7pDwhOkOhqThv3aYoDCncGsuALrHz1ULwa8I8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com; spf=none smtp.helo=mgamail.intel.com; dkim=pass (2048-bit key) header.d=intel.com header.i=@intel.com header.b=UN2LqsG5; arc=none smtp.client-ip=134.134.136.31 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.intel.com Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.helo=mgamail.intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1706101016; x=1737637016; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=csq/DQ86q1cyKE+6/psjeBOQygPbyK7cNC3thAaHsaQ=; b=UN2LqsG5faC5iWhiYKUToyRXHFOruSXJCY5pC2t7Z9mtvgeh1Jztdk1/ FkGqg4KohMsdI2ojdEwr6nSfHwXV2wkvys/KqJStMM2AdydBFw81zmw5S 5RUGJTWBdieK7EJBO6jL/SGV9sFgU+z0NaL60hI+Ne+gMqmuvO0zpVcV4 AVPLWlnLr8ODJc3SJCToBS7JdT3cNPF1GAczg4g89351QtDORMDZgqVbq 3axF72IFNCPU+kuHe07uV/jo8Ogl6KSZlQCbIMinp/F4CPDTzPv9zSMHH b5EQptQ5sEEgFhGoPRiReSjDH30oaEcnRxFi5RsCTmw/ZbbjD290nvDlt w==; X-IronPort-AV: E=McAfee;i="6600,9927,10962"; a="466110138" X-IronPort-AV: E=Sophos;i="6.05,216,1701158400"; d="scan'208";a="466110138" Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Jan 2024 04:56:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10962"; a="735924080" X-IronPort-AV: E=Sophos;i="6.05,216,1701158400"; d="scan'208";a="735924080" Received: from black.fi.intel.com ([10.237.72.28]) by orsmga003.jf.intel.com with ESMTP; 24 Jan 2024 04:56:47 -0800 Received: by black.fi.intel.com (Postfix, from userid 1000) id DE9BFBF1; Wed, 24 Jan 2024 14:56:02 +0200 (EET) From: "Kirill A. Shutemov" To: Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org Cc: "Rafael J. Wysocki" , Peter Zijlstra , Adrian Hunter , Kuppuswamy Sathyanarayanan , Elena Reshetova , Jun Nakajima , Rick Edgecombe , Tom Lendacky , "Kalra, Ashish" , Sean Christopherson , "Huang, Kai" , Baoquan He , kexec@lists.infradead.org, linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCHv6 16/16] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method Date: Wed, 24 Jan 2024 14:55:57 +0200 Message-ID: <20240124125557.493675-17-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240124125557.493675-1-kirill.shutemov@linux.intel.com> References: <20240124125557.493675-1-kirill.shutemov@linux.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit MADT Multiprocessor Wakeup structure version 1 brings support of CPU offlining: BIOS provides a reset vector where the CPU has to jump to for offlining itself. The new TEST mailbox command can be used to test whether the CPU offlined itself which means the BIOS has control over the CPU and can online it again via the ACPI MADT wakeup method. Add CPU offling support for the ACPI MADT wakeup method by implementing custom cpu_die(), play_dead() and stop_this_cpu() SMP operations. CPU offlining makes is possible to hand over secondary CPUs over kexec, not limiting the second kernel to a single CPU. The change conforms to the approved ACPI spec change proposal. See the Link. Signed-off-by: Kirill A. Shutemov Link: https://lore.kernel.org/all/13356251.uLZWGnKmhe@kreacher --- arch/x86/include/asm/acpi.h | 2 + arch/x86/kernel/acpi/Makefile | 2 +- arch/x86/kernel/acpi/madt_playdead.S | 28 ++++ arch/x86/kernel/acpi/madt_wakeup.c | 184 ++++++++++++++++++++++++++- include/acpi/actbl2.h | 15 ++- 5 files changed, 227 insertions(+), 4 deletions(-) create mode 100644 arch/x86/kernel/acpi/madt_playdead.S diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h index 2625b915ae7f..021cafa214c2 100644 --- a/arch/x86/include/asm/acpi.h +++ b/arch/x86/include/asm/acpi.h @@ -81,6 +81,8 @@ union acpi_subtable_headers; int __init acpi_parse_mp_wake(union acpi_subtable_headers *header, const unsigned long end); +void asm_acpi_mp_play_dead(u64 reset_vector, u64 pgd_pa); + /* * Check if the CPU can handle C2 and deeper */ diff --git a/arch/x86/kernel/acpi/Makefile b/arch/x86/kernel/acpi/Makefile index 8c7329c88a75..37b1f28846de 100644 --- a/arch/x86/kernel/acpi/Makefile +++ b/arch/x86/kernel/acpi/Makefile @@ -4,7 +4,7 @@ obj-$(CONFIG_ACPI) += boot.o obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o obj-$(CONFIG_ACPI_APEI) += apei.o obj-$(CONFIG_ACPI_CPPC_LIB) += cppc.o -obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) += madt_wakeup.o +obj-$(CONFIG_X86_ACPI_MADT_WAKEUP) += madt_wakeup.o madt_playdead.o ifneq ($(CONFIG_ACPI_PROCESSOR),) obj-y += cstate.o diff --git a/arch/x86/kernel/acpi/madt_playdead.S b/arch/x86/kernel/acpi/madt_playdead.S new file mode 100644 index 000000000000..4e498d28cdc8 --- /dev/null +++ b/arch/x86/kernel/acpi/madt_playdead.S @@ -0,0 +1,28 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include +#include +#include +#include + + .text + .align PAGE_SIZE + +/* + * asm_acpi_mp_play_dead() - Hand over control of the CPU to the BIOS + * + * rdi: Address of the ACPI MADT MPWK ResetVector + * rsi: PGD of the identity mapping + */ +SYM_FUNC_START(asm_acpi_mp_play_dead) + /* Turn off global entries. Following CR3 write will flush them. */ + movq %cr4, %rdx + andq $~(X86_CR4_PGE), %rdx + movq %rdx, %cr4 + + /* Switch to identity mapping */ + movq %rsi, %cr3 + + /* Jump to reset vector */ + ANNOTATE_RETPOLINE_SAFE + jmp *%rdi +SYM_FUNC_END(asm_acpi_mp_play_dead) diff --git a/arch/x86/kernel/acpi/madt_wakeup.c b/arch/x86/kernel/acpi/madt_wakeup.c index 30820f9de5af..9e984e2191ba 100644 --- a/arch/x86/kernel/acpi/madt_wakeup.c +++ b/arch/x86/kernel/acpi/madt_wakeup.c @@ -1,10 +1,19 @@ // SPDX-License-Identifier: GPL-2.0-or-later #include #include +#include #include +#include +#include +#include +#include #include #include +#include +#include +#include #include +#include /* Physical address of the Multiprocessor Wakeup Structure mailbox */ static u64 acpi_mp_wake_mailbox_paddr __ro_after_init; @@ -12,6 +21,154 @@ static u64 acpi_mp_wake_mailbox_paddr __ro_after_init; /* Virtual address of the Multiprocessor Wakeup Structure mailbox */ static struct acpi_madt_multiproc_wakeup_mailbox *acpi_mp_wake_mailbox __ro_after_init; +static u64 acpi_mp_pgd __ro_after_init; +static u64 acpi_mp_reset_vector_paddr __ro_after_init; + +static void acpi_mp_stop_this_cpu(void) +{ + asm_acpi_mp_play_dead(acpi_mp_reset_vector_paddr, acpi_mp_pgd); +} + +static void acpi_mp_play_dead(void) +{ + play_dead_common(); + asm_acpi_mp_play_dead(acpi_mp_reset_vector_paddr, acpi_mp_pgd); +} + +static void acpi_mp_cpu_die(unsigned int cpu) +{ + u32 apicid = per_cpu(x86_cpu_to_apicid, cpu); + unsigned long timeout; + + /* + * Use TEST mailbox command to prove that BIOS got control over + * the CPU before declaring it dead. + * + * BIOS has to clear 'command' field of the mailbox. + */ + acpi_mp_wake_mailbox->apic_id = apicid; + smp_store_release(&acpi_mp_wake_mailbox->command, + ACPI_MP_WAKE_COMMAND_TEST); + + /* Don't wait longer than a second. */ + timeout = USEC_PER_SEC; + while (READ_ONCE(acpi_mp_wake_mailbox->command) && --timeout) + udelay(1); + + if (!timeout) + pr_err("Failed to hand over CPU %d to BIOS\n", cpu); +} + +/* The argument is required to match type of x86_mapping_info::alloc_pgt_page */ +static void __init *alloc_pgt_page(void *dummy) +{ + return memblock_alloc(PAGE_SIZE, PAGE_SIZE); +} + +static void __init free_pgt_page(void *pgt, void *dummy) +{ + return memblock_free(pgt, PAGE_SIZE); +} + +/* + * Make sure asm_acpi_mp_play_dead() is present in the identity mapping at + * the same place as in the kernel page tables. asm_acpi_mp_play_dead() switches + * to the identity mapping and the function has be present at the same spot in + * the virtual address space before and after switching page tables. + */ +static int __init init_transition_pgtable(pgd_t *pgd) +{ + pgprot_t prot = PAGE_KERNEL_EXEC_NOENC; + unsigned long vaddr, paddr; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + vaddr = (unsigned long)asm_acpi_mp_play_dead; + pgd += pgd_index(vaddr); + if (!pgd_present(*pgd)) { + p4d = (p4d_t *)alloc_pgt_page(NULL); + if (!p4d) + return -ENOMEM; + set_pgd(pgd, __pgd(__pa(p4d) | _KERNPG_TABLE)); + } + p4d = p4d_offset(pgd, vaddr); + if (!p4d_present(*p4d)) { + pud = (pud_t *)alloc_pgt_page(NULL); + if (!pud) + return -ENOMEM; + set_p4d(p4d, __p4d(__pa(pud) | _KERNPG_TABLE)); + } + pud = pud_offset(p4d, vaddr); + if (!pud_present(*pud)) { + pmd = (pmd_t *)alloc_pgt_page(NULL); + if (!pmd) + return -ENOMEM; + set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE)); + } + pmd = pmd_offset(pud, vaddr); + if (!pmd_present(*pmd)) { + pte = (pte_t *)alloc_pgt_page(NULL); + if (!pte) + return -ENOMEM; + set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE)); + } + pte = pte_offset_kernel(pmd, vaddr); + + paddr = __pa(vaddr); + set_pte(pte, pfn_pte(paddr >> PAGE_SHIFT, prot)); + + return 0; +} + +static int __init acpi_mp_setup_reset(u64 reset_vector) +{ + pgd_t *pgd; + struct x86_mapping_info info = { + .alloc_pgt_page = alloc_pgt_page, + .free_pgt_page = free_pgt_page, + .page_flag = __PAGE_KERNEL_LARGE_EXEC, + .kernpg_flag = _KERNPG_TABLE_NOENC, + }; + + pgd = alloc_pgt_page(NULL); + if (!pgd) + return -ENOMEM; + + for (int i = 0; i < nr_pfn_mapped; i++) { + unsigned long mstart, mend; + + mstart = pfn_mapped[i].start << PAGE_SHIFT; + mend = pfn_mapped[i].end << PAGE_SHIFT; + if (kernel_ident_mapping_init(&info, pgd, mstart, mend)) { + kernel_ident_mapping_free(&info, pgd); + return -ENOMEM; + } + } + + if (kernel_ident_mapping_init(&info, pgd, + PAGE_ALIGN_DOWN(reset_vector), + PAGE_ALIGN(reset_vector + 1))) { + kernel_ident_mapping_free(&info, pgd); + return -ENOMEM; + } + + if (init_transition_pgtable(pgd)) { + kernel_ident_mapping_free(&info, pgd); + return -ENOMEM; + } + + smp_ops.play_dead = acpi_mp_play_dead; + smp_ops.stop_this_cpu = acpi_mp_stop_this_cpu; + smp_ops.cpu_die = acpi_mp_cpu_die; + + acpi_mp_reset_vector_paddr = reset_vector; + acpi_mp_pgd = __pa(pgd); + + return 0; +} + static int acpi_wakeup_cpu(u32 apicid, unsigned long start_ip) { if (!acpi_mp_wake_mailbox_paddr) { @@ -97,14 +254,37 @@ int __init acpi_parse_mp_wake(union acpi_subtable_headers *header, struct acpi_madt_multiproc_wakeup *mp_wake; mp_wake = (struct acpi_madt_multiproc_wakeup *)header; - if (BAD_MADT_ENTRY(mp_wake, end)) + + /* + * Cannot use the standard BAD_MADT_ENTRY() to sanity check the @mp_wake + * entry. 'sizeof (struct acpi_madt_multiproc_wakeup)' can be larger + * than the actual size of the MP wakeup entry in ACPI table because the + * 'reset_vector' is only available in the V1 MP wakeup structure. + */ + if (!mp_wake) + return -EINVAL; + if (end - (unsigned long)mp_wake < ACPI_MADT_MP_WAKEUP_SIZE_V0) + return -EINVAL; + if (mp_wake->header.length < ACPI_MADT_MP_WAKEUP_SIZE_V0) return -EINVAL; acpi_table_print_madt_entry(&header->common); acpi_mp_wake_mailbox_paddr = mp_wake->mailbox_address; - acpi_mp_disable_offlining(mp_wake); + if (mp_wake->version >= ACPI_MADT_MP_WAKEUP_VERSION_V1 && + mp_wake->header.length >= ACPI_MADT_MP_WAKEUP_SIZE_V1) { + if (acpi_mp_setup_reset(mp_wake->reset_vector)) { + pr_warn("Failed to setup MADT reset vector\n"); + acpi_mp_disable_offlining(mp_wake); + } + } else { + /* + * CPU offlining requires version 1 of the ACPI MADT wakeup + * structure. + */ + acpi_mp_disable_offlining(mp_wake); + } apic_update_callback(wakeup_secondary_cpu_64, acpi_wakeup_cpu); diff --git a/include/acpi/actbl2.h b/include/acpi/actbl2.h index e1a395af7591..2aedda70ef88 100644 --- a/include/acpi/actbl2.h +++ b/include/acpi/actbl2.h @@ -1120,8 +1120,20 @@ struct acpi_madt_multiproc_wakeup { u16 version; u32 reserved; /* reserved - must be zero */ u64 mailbox_address; + u64 reset_vector; }; +/* Values for Version field above */ + +enum acpi_madt_multiproc_wakeup_version { + ACPI_MADT_MP_WAKEUP_VERSION_NONE = 0, + ACPI_MADT_MP_WAKEUP_VERSION_V1 = 1, + ACPI_MADT_MP_WAKEUP_VERSION_RESERVED = 2, /* 2 and greater are reserved */ +}; + +#define ACPI_MADT_MP_WAKEUP_SIZE_V0 16 +#define ACPI_MADT_MP_WAKEUP_SIZE_V1 24 + #define ACPI_MULTIPROC_WAKEUP_MB_OS_SIZE 2032 #define ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE 2048 @@ -1134,7 +1146,8 @@ struct acpi_madt_multiproc_wakeup_mailbox { u8 reserved_firmware[ACPI_MULTIPROC_WAKEUP_MB_FIRMWARE_SIZE]; /* reserved for firmware use */ }; -#define ACPI_MP_WAKE_COMMAND_WAKEUP 1 +#define ACPI_MP_WAKE_COMMAND_WAKEUP 1 +#define ACPI_MP_WAKE_COMMAND_TEST 2 /* 17: CPU Core Interrupt Controller (ACPI 6.5) */ -- 2.43.0