Date:   Tue, 31 Jan 2023 22:08:48 +0100
From:   Borislav Petkov <bp@alien8.de>
To:     "Luck, Tony" <tony.luck@intel.com>
Cc:     "Raj, Ashok" <ashok.raj@intel.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        LKML <linux-kernel@vger.kernel.org>, x86 <x86@kernel.org>,
        Ingo Molnar <mingo@kernel.org>,
        "Hansen, Dave" <dave.hansen@intel.com>,
        "Schofield, Alison" <alison.schofield@intel.com>,
        "Chatre, Reinette" <reinette.chatre@intel.com>,
        Tom Lendacky <thomas.lendacky@amd.com>,
        Stefan Talpalaru <stefantalpalaru@yahoo.com>,
        David Woodhouse <dwmw2@infradead.org>,
        Benjamin Herrenschmidt <benh@kernel.crashing.org>,
        Jonathan Corbet <corbet@lwn.net>,
        "Rafael J . Wysocki" <rafael@kernel.org>,
        Peter Zilstra <peterz@infradead.org>,
        "Lutomirski, Andy" <luto@kernel.org>,
        "andrew.cooper3@citrix.com" <andrew.cooper3@citrix.com>,
        "Ostrovsky, Boris" <boris.ostrovsky@oracle.com>,
        Martin Pohlack <mpohlack@amazon.de>
Subject: Re: [Patch v3 Part2 3/9] x86/microcode/intel: Fix collect_cpu_info()
 to reflect current microcode
Message-ID: <Y9mDYMASXCFaFkNU@zn.tnic>
References: <20230130213955.6046-1-ashok.raj@intel.com>
 <20230130213955.6046-4-ashok.raj@intel.com>
 <Y9lGdh+0faIrIIiQ@zn.tnic>
 <SJ1PR11MB6083580526A7FFA11F110B77FCD09@SJ1PR11MB6083.namprd11.prod.outlook.com>
 <Y9l8yGVvVHBLKAoh@zn.tnic>
 <SJ1PR11MB60837E6E6AE7C82511DC039EFCD09@SJ1PR11MB6083.namprd11.prod.outlook.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <SJ1PR11MB60837E6E6AE7C82511DC039EFCD09@SJ1PR11MB6083.namprd11.prod.outlook.com>
Precedence: bulk

On Tue, Jan 31, 2023 at 08:49:52PM +0000, Luck, Tony wrote:
> What happens here if the update on the first hyperthread failed (sure, it shouldn't,
> but stuff happens at large scale).  In this case the current rev is still older that the
> the cache version ... so there is no "goto out", and this hyperthread will now write
> the MSR to initiate microcode update here, while the first thread is off executing
> arbitrary code (the situation that we want to avoid).

Lemme see if I can follow: we sync all threads in __reload_late() and
once they all arrive, we send them down into ->apply_microcode.

T0 arrives, and fails the update. That is this piece:

        /* write microcode via MSR 0x79 */
        wrmsrl(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits);

        rev = intel_get_microcode_revision();

        if (rev != mc->hdr.rev) {
                pr_err("CPU%d update to revision 0x%x failed\n",
                       cpu, mc->hdr.rev);
                return UCODE_ERROR;
        }

We return here without updating cpu_sig.rev, as we should.

T1 arrives, updates successfully and updates its cpu_sig.rev.

T0's patch level has been updated too with that because the microcode
engine is shared between the threads. T0's cpu_sig.rev isn't, however,
as that has happened "behind its back", so to speak.

Is that the scenario you're talking about?

If so, if you look at __reload_late(), it'll say

	pr_warn("Error reloading microcode on CPU %d\n", cpu);

and the large scale operator will know.

And well, the easy fix is, do the reload again. :-)

That'll update the cached values too.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette