Received: by 2002:a05:6358:45e:b0:b5:b6eb:e1f9 with SMTP id 30csp4562357rwe; Tue, 30 Aug 2022 12:24:05 -0700 (PDT) X-Google-Smtp-Source: AA6agR4kAoEr9JuegkdYyFleAU8IDMMQK9ppcNl62oujkajgFWBDcLT0RD7OpG1boijGx3LN5X9m X-Received: by 2002:a17:903:22c4:b0:175:41bc:9596 with SMTP id y4-20020a17090322c400b0017541bc9596mr280512plg.112.1661887435017; Tue, 30 Aug 2022 12:23:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1661887435; cv=none; d=google.com; s=arc-20160816; b=Eqb2YlmmIa+ixhyUfE1Jtm40iRpnXlGYYoapZagQG8WHZx9tbWGlP1/kISsSiQCugX tcywy1kJ3sPbj1OTf54Su3yz9rBlPs/eT7vTyoFdnQdlBllq7yBwZCpdFVLGKsXHKbGX fJFlUB8Hv7llHlkBsfF7X1HhHY3ffqlSUbwdwETjayvZ1/H+vyrVn1zN4Y9Htff/pTby 9t7RNR9I3IH+ON/aKNluR+3E0qiDp9ZWAr5Ob3yElWN6bFWF03QpyC3ATTTLOSFyR1aY bznLjHakxmM3B3HPmel/i7sWCuBgBwoJBWrinrcaE6m7eS6NLIX80t0MH7Fxbtu5bCS9 xvCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=RwvLO9rP1ivdgeRaxtqpQW51PXDiTcduW1xRWN/Bphg=; b=x4CFxa+Bapc/eirofumkTOt8fjrDC+1c6vAt5EpVfc0J3tzoMvBXzSIi1jty8N6rMa 5fYc0/e7umb3Hl60nkVXyepxiGDM1ujPtFyM91bn0ikqUNiqeNQYzrMG3No9UEWh6tIS BVjymwE/iH/q1eUEBAXpikWeCFWiK/4KYjXju/RFqvObKojWeaVRceXwnG93LYYUWRA2 1EDl1xvUeL9nfy2PkBGjBPV7n+ZDD9crQFoJGXqdyB8F+A9bRDpEMmKW1rk9dUQ7jM3F v43vOyUaztIp2atJalomYPQeajObdQBbMHVBBR67AogxUI4hF9Gls7zyyti2Pl3t7NGV 4rlQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=jqDb9X19; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id pc16-20020a17090b3b9000b001fe1ce51d48si1256471pjb.89.2022.08.30.12.23.41; Tue, 30 Aug 2022 12:23:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=jqDb9X19; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231129AbiH3TPl (ORCPT + 99 others); Tue, 30 Aug 2022 15:15:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47472 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230020AbiH3TPj (ORCPT ); Tue, 30 Aug 2022 15:15:39 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B72A94D258 for ; Tue, 30 Aug 2022 12:15:36 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 2933E61703 for ; Tue, 30 Aug 2022 19:15:36 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8518FC43141 for ; Tue, 30 Aug 2022 19:15:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661886935; bh=w9radkabqhUd3OLgPbRfZ1Y/HtR3pNtqCwH5lVVybGg=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=jqDb9X19r1DBjmxizgBgeXSCNu8ZXgeS1ETCOeh7MOmH0MKzuKe+8HbI/93Wgez9A vyxaW0pTV9p3gtUPdS5kiuzu4O9zgKaQs1hud3hNL+QoCVzO2jL0UNWl567wJCGSiB YQfztV43FGYbrBI8u521VJYTHMB/mnvjN9UqJa1sZ4Zv3hTu0nTC3OcEEJVHR8iz9W VBfWaTzyoqnBfvjNhYoeApkMp8bGia6J+WG2K/eFudtPvohghyOg25t7qFbC8xu21F QEAMTbgWu98xJF0+bhkX3fyCa3KNTzHauPw5FEZCDSHmcUz8bANvByoRxuDik47U/G VNjHgpuoemTFg== Received: by mail-yb1-f169.google.com with SMTP id p204so984929yba.3 for ; Tue, 30 Aug 2022 12:15:35 -0700 (PDT) X-Gm-Message-State: ACgBeo07jPSgcOyY3AreYJO8G/084wXCHTLDmVHVn5HfUad6b7UIon+i OgFphaoRWJvIviJmCJY/1BMGCZgpORZJSIcHxfS6FA== X-Received: by 2002:a25:103:0:b0:695:8b6b:308d with SMTP id 3-20020a250103000000b006958b6b308dmr12602675ybb.572.1661886934504; Tue, 30 Aug 2022 12:15:34 -0700 (PDT) MIME-Version: 1.0 References: <20220817051127.3323755-1-ashok.raj@intel.com> <20220817051127.3323755-6-ashok.raj@intel.com> In-Reply-To: <20220817051127.3323755-6-ashok.raj@intel.com> From: Andy Lutomirski Date: Tue, 30 Aug 2022 12:15:23 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v3 5/5] x86/microcode: Place siblings in NMI loop while update in progress To: Ashok Raj Cc: Borislav Petkov , Thomas Gleixner , Tony Luck , Dave Hansen , LKML Mailing List , X86-kernel , Tom Lendacky , Jacon Jun Pan Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 16, 2022 at 10:12 PM Ashok Raj wrote: > > Microcode updates need a guarantee that the thread sibling that is waiting > for the update to finish on the primary core will not execute any > instructions until the update is complete. This is required to guarantee > any MSR or instruction that's being patched will be executed before the > update is complete. > > After the stop_machine() rendezvous, an NMI handler is registered. If an > NMI were to happen while the microcode update is not complete, the > secondary thread will spin until the ucode update state is cleared. > > Couple of choices discussed are: > > 1. Rendezvous inside the NMI handler, and also perform the update from > within the handler. This seemed too risky and might cause instability > with the races that we would need to solve. This would be a difficult > choice. > 1.a Since the primary thread of every core is performing a wrmsr > for the update, once the wrmsr has started, it can't be > interrupted. Hence its not required to NMI the primary thread of > the core. Only the secondary thread needs to be parked in NMI > before the update begins. > Suggested by From Andy Cooper > 2. Thomas (tglx) suggested that we could look into masking all the LVT > originating NMI's. Such as LINT1, Perf control LVT entries and such. > Since we are in the rendezvous loop, we don't need to worry about any > NMI IPI's generated by the OS. > > The one we didn't have any control over is the ACPI mechanism of sending > notifications to kernel for Firmware First Processing (FFM). Apparently > it seems there is a PCH register that BIOS in SMI would write to > generate such an interrupt (ACPI GHES). > 3. This is a simpler option. OS registers an NMI handler and doesn't do any > NMI rendezvous dance. But if an NMI were to happen, we check if any of > the CPUs thread siblings have an update in progress. Only those CPUs > would take an NMI. The thread performing the wrmsr() will only take an > NMI after the completion of the wrmsr 0x79 flow. > > [ Lutomirsky thinks this is weak, and what happens from taking the > interrupt and the path to the registered callback handler might be > exposed.] > > Seems like 1.a is the best candidate. > > The algorithm is something like this: > > After stop_machine() all threads are executing __reload_late() > > nmi_callback() > { > if (!in_ucode_update) > return NMI_DONE; > if (cpu not in sibling_mask) > return NMI_DONE; > update sibling reached NMI for primary to continue > > while (cpu in sibling_mask) > wait; > return NMI_HANDLED; > } > > __reload_late() > { > > entry_rendezvous(&late_cpus_in); > set_mcip() > if (this_cpu is first_cpu in the core) > wait for siblings to drop in NMI > apply_microcode() > else { > send self_ipi(NMI_VECTOR); > goto wait_for_siblings; > } > > wait_for_siblings: > exit_rendezvous(&late_cpus_out); > clear_mcip > } > > reload_late() > { > register_nmi_handler() > prepare_mask of all sibling cpus() > update state = ucode in progress; > stop_machine(); > unregister_nmi_handler(); > } > > Signed-off-by: Ashok Raj > --- > arch/x86/kernel/cpu/microcode/core.c | 218 ++++++++++++++++++++++++++- > 1 file changed, 211 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kernel/cpu/microcode/core.c b/arch/x86/kernel/cpu/microcode/core.c > index d24e1c754c27..fd3b8ce2c82a 100644 > --- a/arch/x86/kernel/cpu/microcode/core.c > +++ b/arch/x86/kernel/cpu/microcode/core.c > @@ -39,7 +39,9 @@ > #include > #include > #include > +#include > #include > +#include > > #define DRIVER_VERSION "2.2" > > @@ -411,6 +413,13 @@ static int check_online_cpus(void) > > static atomic_t late_cpus_in; > static atomic_t late_cpus_out; > +static atomic_t nmi_cpus; // number of CPUs that enter NMI > +static atomic_t nmi_timeouts; // number of siblings that timeout > +static atomic_t nmi_siblings; // Nmber of siblings that enter NMI > +static atomic_t in_ucode_update;// Are we in microcode update? > +static atomic_t nmi_exit; // Siblings that exit NMI Some of these variables seem oddly managed and just for debugging. > + > +static struct cpumask all_sibling_mask; > > static int __wait_for_cpus(atomic_t *t, long long timeout) > { > @@ -433,6 +442,104 @@ static int __wait_for_cpus(atomic_t *t, long long timeout) > return 0; > } > > +struct core_rendez { > + int num_core_cpus; > + atomic_t callin; > + atomic_t core_done; > +}; > + > +static DEFINE_PER_CPU(struct core_rendez, core_sync); > + > +static int __wait_for_update(atomic_t *t, long long timeout) > +{ > + while (!atomic_read(t)) { > + if (timeout < SPINUNIT) > + return 1; Since you're using signed arithmetic, timeout < 0 would be a less error-prone condition. Anyway, this patch is full of debugging stuff, so I won't do a line-for-line review, but I do have a suggestion. Instead of all this bookkeeping, maybe just track the number of cores to park in NMI, kind of like this (hand-wavy pseudocode): static struct cpumask cpus_to_park_in_nmi; /* fill out the cpumask */ static atomic_t nmi_parked_cpus; static bool park_enabled; Then, after __wait_for_cpus (once everything is stopped), one cpu sets up the nmi handler, sets park_enabled, and sends the NMI IPI to all the CPUs parked in there. The handler does: if (this cpu is in cpus_to_mark_in_nmi) { WARN_ON_ONCE(!park_enabled); atomic_inc(&nmi_parked_cpus); while (READ_ONCE(park_enabled)) ; /* because Intel won't promise that cpu_relax() is okay */ atomic_dec(&nmi_parked_cpus); } and the CPUs that aren't supposed to park wait for nmi_parked_cpus to have the right value. After the update, park_enabled gets cleared and everything resumes. Does this seem reasonable? I was thinking it would be straightforward to have __wait_for_cpus handle this, but that would only really be easy in a language with closures or continuation passing. --Andy