Message-ID: <bfd85699-4b7a-c606-266a-cea7ff336950@intel.com>
Date:   Wed, 1 Feb 2023 14:21:18 -0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.6.1
Subject: Re: [Patch v3 Part2 4/9] x86/microcode: Do not call apply_microcode()
 on sibling threads
Content-Language: en-US
To:     Ashok Raj <ashok.raj@intel.com>, Borislav Petkov <bp@alien8.de>,
        Thomas Gleixner <tglx@linutronix.de>
Cc:     LKML <linux-kernel@vger.kernel.org>, x86 <x86@kernel.org>,
        Ingo Molnar <mingo@kernel.org>,
        Tony Luck <tony.luck@intel.com>,
        Alison Schofield <alison.schofield@intel.com>,
        Reinette Chatre <reinette.chatre@intel.com>,
        Tom Lendacky <thomas.lendacky@amd.com>,
        Stefan Talpalaru <stefantalpalaru@yahoo.com>,
        David Woodhouse <dwmw2@infradead.org>,
        Benjamin Herrenschmidt <benh@kernel.crashing.org>,
        Jonathan Corbet <corbet@lwn.net>,
        "Rafael J . Wysocki" <rafael@kernel.org>,
        Peter Zilstra <peterz@infradead.org>,
        Andy Lutomirski <luto@kernel.org>,
        Andrew Cooper <Andrew.Cooper3@citrix.com>,
        Boris Ostrovsky <boris.ostrovsky@oracle.com>,
        Martin Pohlack <mpohlack@amazon.de>
References: <20230130213955.6046-1-ashok.raj@intel.com>
 <20230130213955.6046-5-ashok.raj@intel.com>
From:   Dave Hansen <dave.hansen@intel.com>
In-Reply-To: <20230130213955.6046-5-ashok.raj@intel.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: bulk

On 1/30/23 13:39, Ashok Raj wrote:
> Microcode updates are applied at the core, so an update to one HT sibling
> is effective on all HT siblings of the same core.
> 
> During late-load, after the primary has updated the microcode, it also
> reflects that in the per-cpu structure (cpuinfo_x86) holding the current
> revision.
> 
> Current code calls apply_microcode() to update the SW per-cpu revision.

I'm having a hard time following this.  I can't even suggest a better
message because I can't grok this one.

> But in the odd case when primary returned with an error, and as a result
> the secondary didn't get the revision updated, will attempt to perform
> a patch load and the primary has already been released to the system.
> This could be problematic, because the whole rendezvous dance is to
> prevent updates when one of the siblings could be executing arbitrary code.

OK, let me see if I understand:

Today, ->apply_microcode() is called for both CPU threads.  Typically,
T0 comes in and will actually successfully update the microcode.  T1
will come in later, notice that T0 updated the microcode already and
return without even trying to do the update WRMSR.  One thing T1 _does_
do before returning is to update the per-cpu data.

That works great, unless T0 experiences an error.  In that case, T0 will
jump out of __reload_late() after failing to do the update.  T1 will
come bumbling along after it and will enter ->apply_microcode(),
blissfully unaware of T0's failure.  T1 will assume that it is supposed
to do T0's job, noting "rev < mc->hdr.rev".  T1 will write the MSR while
T0 is off doing god knows what.

T1 should not even be attempting to do ->apply_microcode() because T0 is
not quiescent.

> Replace apply_microcode() with a call to collect_cpu_info() and let that
> call also update the per-cpu structure instead of returning the previously
> cached values.

	To fix this, remove the path where T1 calls ->apply_microcode().
	However, this alone would leave the per-cpu metadata for T1 out
	of date.  Call collect_cpu_info() to ensure it is updated.

Right?

FWIW, this seems a bit hacky and inconsistent to me.  It would be nice
if the common T0/T1 work (updating the per-cpu metadata) was done with
common code.

Could we zap the uci->cpu_sig.rev work entirely from ->apply_microcode()
and do it in __reload_late() instead?