Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp5315772pxb; Sun, 6 Feb 2022 21:56:42 -0800 (PST) X-Google-Smtp-Source: ABdhPJxIyx/a67ZNVPjdB8WL7wjTBvPBx3Jwvhw4urd8wfLnY1knCzKBdRoYsgimSkZUbVh0yf5X X-Received: by 2002:aa7:d553:: with SMTP id u19mr11959627edr.298.1644213402768; Sun, 06 Feb 2022 21:56:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1644213402; cv=none; d=google.com; s=arc-20160816; b=IiKPstBed3d6luTdNNSNL2FpHI0qaO8GLOekOYht7z7QnfYRLrzOz2VRt/92d/CnSY ZiskEHchat4OXORgvQ8RKu+NQS6LaXg5O5l16v+39QRmVd+egd/iAm1qDq3hKlfnnkd2 v7ugOJ8Qu71YAUCTkwQiLBuYng1TNUJaakOPrPvO44KNIk602muokNSMXOPr9DGKjd5/ A+k6xg7SBrvSpNo9SBCU/4gO34vWcTkIwrBhrRC8AzW/ls/vGiLFihhEmAdvFJXgwwqu 3AKcf99Lfr1ulj8CslCCtyp+z9kqd4n4k/UvgTksQZOOCG7ontaAMxDHPrHUGdRuyO8w etDQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=1SzB+qecFpF7d3ylySoM5Jgfe/KbFf09pYSzNH83hVY=; b=lmJigL3x6S2HXQxlC1gTUyTNO4EHV9KhoV8QXYMdCBvVdhAADPB5g8mnrW2NWYniLn zfxd2Epm3NKlPnP0A+XFXoRl33XzBZRmDPNh9l3sOBATi33VcHu1LiPJPVajKxY2pC77 lOq5fls3HlpWtWZfpl3Qr0EpQZt6nXIcSoV+fXAKdhwO2G6mVXvcVf3wl5aSbDkQhIpC lbokLbA/STx5Bg4TYGX0mo17XnrfHwTTUTWNpRWraxPcdEAW+DsHOuT5lBpidhuaAWqM 8HxOgLusqsnVcFQ3Cm1XQITN1GC/nSSfvqbHgRa0bjn1M5BocGH/KDhZTxHCDxr9WWNq fYug== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Uxbl8Pdd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f19si7884989edd.169.2022.02.06.21.56.18; Sun, 06 Feb 2022 21:56:42 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Uxbl8Pdd; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229720AbiBCUWT (ORCPT + 99 others); Thu, 3 Feb 2022 15:22:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55560 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230100AbiBCUWS (ORCPT ); Thu, 3 Feb 2022 15:22:18 -0500 Received: from mail-pg1-x52a.google.com (mail-pg1-x52a.google.com [IPv6:2607:f8b0:4864:20::52a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 292E7C06173B for ; Thu, 3 Feb 2022 12:22:18 -0800 (PST) Received: by mail-pg1-x52a.google.com with SMTP id h23so3190537pgk.11 for ; Thu, 03 Feb 2022 12:22:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=1SzB+qecFpF7d3ylySoM5Jgfe/KbFf09pYSzNH83hVY=; b=Uxbl8Pddg7ptvrCfhGGPT2VCjofSGjHKur3RFnhwTPwe1zAOX7ON0pTVw1avHA/ysE w2FRkn3rZ4nrFyU4GkduyWHjF+JQhuwrjr5AoPShnS4asAsT4QEQ3hYFJvnPPq8AEKQu rOQgFWh3oBskRswH/Hll7mhOlthuVuM4hKTtnQx6R8H4xkY9nJqtyYzYB1ibvMqfGdvS Hpb099AZw7qOD2GQGGyZR6gMho+6aXJyU38mDcUbPfuhCo5lQdVnkXhCPtYsWVYuf9aw YKF5OrlshJZxqvlK9t7wwAGyA0PrklvTDMgMuZjyMGZcWgm4zL4qUbhx7v4BaAz5GuYo +7Yg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=1SzB+qecFpF7d3ylySoM5Jgfe/KbFf09pYSzNH83hVY=; b=5+yavgU9V/RA92y2Wd7DG041ceMEQaGEiBHbyYwYofW02/7ExaOhiclP+7RGPFRO/W bnudPH3FFy0M4uRY+P4oZdw/tPObwAYTJh2lt0sWJksEGYAEreA9wvekaPn7cZaweV0O b4HqqKJBTe0KN+PewnCEftAvr89NVN/XOutBHLFywbIu1+1KlcCqyjGMqxWUs6htFSU2 LNl9OKRuuNtd2rTZF5zut8YCJFeBHUwWfMk/6eKh4LEgaC8Q6QwHM5TI+xFYlobtVpx/ rbQsmxe3nDZ9n5jYqFIXVecFyjkhJka0P6KpRGwHFHdCtfHjl/MVhu9sIkqYaQrsHCoI Zl6A== X-Gm-Message-State: AOAM5301OAJqXn/JbhBGX7J7mCg5QWDtmrixXrFPPL6kT/Hf4XTQdSxJ aN0y7vHonY9n7BHwlCm1T/BnLA== X-Received: by 2002:a05:6a00:793:: with SMTP id g19mr2402889pfu.21.1643919737311; Thu, 03 Feb 2022 12:22:17 -0800 (PST) Received: from google.com (157.214.185.35.bc.googleusercontent.com. [35.185.214.157]) by smtp.gmail.com with ESMTPSA id x126sm18177131pgb.85.2022.02.03.12.22.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 03 Feb 2022 12:22:16 -0800 (PST) Date: Thu, 3 Feb 2022 20:22:13 +0000 From: Sean Christopherson To: Maxim Levitsky Cc: Chao Gao , Zeng Guang , Tom Lendacky , Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , "kvm@vger.kernel.org" , Dave Hansen , "Luck, Tony" , Kan Liang , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Kim Phillips , Jarkko Sakkinen , Jethro Beekman , "Huang, Kai" , "x86@kernel.org" , "linux-kernel@vger.kernel.org" , "Hu, Robert" Subject: Re: [PATCH v5 7/8] KVM: VMX: Update PID-pointer table entry when APIC ID is changed Message-ID: References: <20211231142849.611-8-guang.zeng@intel.com> <640e82f3-489d-60af-1d31-25096bef1a46@amd.com> <4eee5de5-ab76-7094-17aa-adc552032ba0@intel.com> <20220110074523.GA18434@gao-cwp> <1ff69ed503faa4c5df3ad1b5abe8979d570ef2b8.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 02, 2022, Sean Christopherson wrote: > On Thu, Jan 13, 2022, Sean Christopherson wrote: > > On Tue, Jan 11, 2022, Maxim Levitsky wrote: > > > Both Intel and AMD's PRM also state that changing APIC ID is implementation > > > dependent. > > > > > > I vote to forbid changing apic id, at least in the case any APIC acceleration > > > is used, be that APICv or AVIC. > > > > That has my vote as well. For IPIv in particular there's not much concern with > > backwards compability, i.e. we can tie the behavior to enable_ipiv. > > Hrm, it may not be that simple. There's some crusty (really, really crusty) code > in Linux's boot code that writes APIC_ID. IIUC, the intent is to play nice with > running a UP crash dump kernel on "BSP" that isn't "the BSP", e.g. has a non-zero > APIC ID. ... > It's entirely possible that this path is unused in a KVM guest, but I don't think > we can know that with 100% certainty. > > But I also completely agree that attempting to keep the tables up-to-date is ugly > and a waste of time and effort, e.g. as Maxim pointed out, the current AVIC code > is comically broken. > > Rather than disallowing the write, what if we add yet another inhibit that disables > APICv if IPI virtualization is enabled and a vCPU has an APIC ID != vcpu_id? KVM > is equipped to handle the emulation, so it just means that a guest that's doing > weird things loses a big of performance. LOL, this is all such a mess. The x2apic ID is actually indirectly writable on AMD CPUs. Per the APM: A value previously written by software to the 8-bit APIC_ID register (MMIO offset 30h) is converted by hardware into the appropriate format and reflected into the 32-bit x2APIC_ID register (MSR 802h). I confirmed this on hardware (Milan). That means KVM's handling of the x2APIC ID in kvm_lapic_set_base() is wrong, at least with respect to AMD. Intel's SDM is a bit vague. I _think_ it means Intel CPUs treat the the x2APIC ID and xAPIC ID as two completely independent assets. I haven't been able to glean any info from hardware because writes to the legacy xAPIC ID are ignored on all CPUs I've tested (Haswell and Cascade lake). The x2APIC ID (32 bits) and the legacy local xAPIC ID (8 bits) are preserved across this transition. Given that the xAPIC ID appears to no longer be writable on Intel CPUs, it's impossible that _generic_ kernel code can rely on xAPIC ID being writable. That just leaves the aforementioned amd_numa_init() crud. Linux's handling of that is: void __init x86_numa_init(void) { if (!numa_off) { #ifdef CONFIG_ACPI_NUMA if (!numa_init(x86_acpi_numa_init)) return; #endif #ifdef CONFIG_AMD_NUMA if (!numa_init(amd_numa_init)) return; #endif } numa_init(dummy_numa_init); } i.e. ACPI_NUMA gets priority and thus amd_numa_init() will never be reached if the NUMA topology is enumerated in the ACPI tables. Furthermore, the VMM would have to actually emulate an old AMD northbridge, which is also extremely unlikely. The odds of breaking a guest are further diminised given that KVM doesn't emulate the xAPIC ID => x2APIC ID hilarity on AMD CPUs and no one has complained. So, rather than tie this to IPI virtualization, I think we should either make the xAPIC ID read-only across the board, or if we want to hedge in case someone has a crazy use case, make the xAPIC ID read-only by default, add a module param to let userspace opt-in to a writable xAPIC ID, and report x2APIC and APICv as unsupported if the xAPIC ID is writable. E.g. rougly this, plus your AVIC patches if we want to hedge. diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 28be02adc669..32854ac403a8 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -539,8 +539,15 @@ void kvm_set_cpu_caps(void) 0 /* Reserved*/ | F(AES) | F(XSAVE) | 0 /* OSXSAVE */ | F(AVX) | F(F16C) | F(RDRAND) ); - /* KVM emulates x2apic in software irrespective of host support. */ - kvm_cpu_cap_set(X86_FEATURE_X2APIC); + /* + * KVM emulates x2apic in software irrespective of host support. Due + * to architecturally difference between Intel and AMD, x2APIC is not + * supported if the xAPIC ID is writable. + */ + if (!xapic_id_writable) + kvm_cpu_cap_set(X86_FEATURE_X2APIC); + else + kvm_cpu_cap_clear(X86_FEATURE_X2APIC); kvm_cpu_cap_mask(CPUID_1_EDX, F(FPU) | F(VME) | F(DE) | F(PSE) | diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 670361bf1d81..6b42b65e7a42 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2047,10 +2047,10 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) switch (reg) { case APIC_ID: /* Local APIC ID */ - if (!apic_x2apic_mode(apic)) + if (apic_x2apic_mode(apic)) + ret = 1; + else if (xapic_id_writable) kvm_apic_set_xapic_id(apic, val >> 24); - else - ret = 1; break; case APIC_TASKPRI: diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 9cef8e4598df..71a3bcdb3317 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4743,7 +4743,8 @@ static __init int svm_hardware_setup(void) nrips = false; } - enable_apicv = avic = avic && npt_enabled && boot_cpu_has(X86_FEATURE_AVIC); + enable_apicv = avic = avic && !xapic_id_writable && npt_enabled && + boot_cpu_has(X86_FEATURE_AVIC); if (enable_apicv) { pr_info("AVIC enabled\n"); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 1b135473677b..fad7b36fbb1d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7939,7 +7939,7 @@ static __init int hardware_setup(void) ple_window_shrink = 0; } - if (!cpu_has_vmx_apicv()) + if (!cpu_has_vmx_apicv() || xapic_id_writable) enable_apicv = 0; if (!enable_apicv) vmx_x86_ops.sync_pir_to_irr = NULL; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index eaad2f485b64..ef6eba8c832a 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -174,6 +174,10 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); static int __read_mostly lapic_timer_advance_ns = -1; module_param(lapic_timer_advance_ns, int, S_IRUGO | S_IWUSR); +bool __read_mostly xapic_id_writable; +module_param(xapic_id_writable, bool, 0444); +EXPORT_SYMBOL_GPL(xapic_id_writable); + static bool __read_mostly vector_hashing = true; module_param(vector_hashing, bool, S_IRUGO); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 1ebd5a7594da..142663ff9cba 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -346,6 +346,7 @@ static inline bool kvm_mpx_supported(void) extern unsigned int min_timer_period_us; +extern bool xapic_id_writable; extern bool enable_vmware_backdoor; extern int pi_inject_timer;