Received: by 2002:ab2:7855:0:b0:1f9:5764:f03e with SMTP id m21csp78336lqp; Tue, 21 May 2024 19:29:01 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXGpAZre3/yP3l8sPdYY6yzCqbZNkdOL9p+mabhJ7nx2/rihzOgvf/0vgKIcRdAh74RucIIPvX16uAa51MI8D9y1UoyIRo5tIERawjKlA== X-Google-Smtp-Source: AGHT+IE3adWqV1ELKDqLW4eTefvvhhwKwqGXUgN/O/fNQ+DpuxY0XlSwXVvESBlTQkZt23IFbJCK X-Received: by 2002:a19:7410:0:b0:523:b672:25c9 with SMTP id 2adb3069b0e04-526bdc520e5mr255323e87.30.1716344940944; Tue, 21 May 2024 19:29:00 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1716344940; cv=pass; d=google.com; s=arc-20160816; b=R1i/P5Jdl2XpNIX5MltJGsB0Ua+Gst3TLXWtFXA+j1haacH3wVaN/I+062RSuPhyxa tF5P53BdoH5t6HndeM2iyPew5udxd2QcIDjZOnK0pLFaYtwQrxDZIuSseOwrjKPodKgE wsGO3ahE9vuVeUNubx/erajOwaNre1ueK/YNgoEtYc5SYyRAcuurih1vjXKZUZtd0DBu KWZpn8a8KV7phZOvfvpoLJu/diuzw9peD5H4Hy8OlKQp5jNpLJDx89J3YIHwx+eJ0BFf MXXtmZXEkIoI3C85tIM9Xz9qDnDRRRZacLiPbZlXNRCYwTTfGs0x4k1RDiKv87wp/Zds mNGQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:references:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:in-reply-to:date :reply-to:dkim-signature; bh=mfIvjgKueQKVKfXuZvg9684V73mkUfXcfkr0mjRx2Pg=; fh=vLRTqemQ6rT7lRBvDZC5PEDGmd+RsrYNNeBHtMweRmc=; b=YB/OULkmI6VP5cC+XUqM0w6oEQjGFKqq4eIOyIVDpCe0WS49PK39gpMWMa2vcVcVRw WVQIDuDKksxt0jLLKbnAgOpJ2sd1IiJ8nU2QhwEYBgtebPzFLguUdQaJzaEM9hiBv7ac GF2ZiREd1hQ0/RCL0ezvA9AjyfEcCOVYjJK9CfKPHIpZ7d/hZjNKkxJi2LNpIu1wg5Cv Vz9jno8LOhljQ8Nm4ZWoKesHiT18dCfQ5HyK4dqAmgJgGgtbStI81rJbvddGiUEcv6qP m8BPSbukKHMJVeeEo8B9OgIT3a0XF3CPGwcaJT/R4aaQDd32Wh9XKpDiryrBDGrN2/b/ Ap5Q==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=QwRHV1hb; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-185689-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-185689-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id a640c23a62f3a-a5a17be64casi1515720966b.746.2024.05.21.19.29.00 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 21 May 2024 19:29:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-185689-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=QwRHV1hb; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-185689-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-185689-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 81D3D1F26715 for ; Wed, 22 May 2024 02:29:00 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C1D9F757EE; Wed, 22 May 2024 02:28:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="QwRHV1hb" Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C1B615644E for ; Wed, 22 May 2024 02:28:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716344914; cv=none; b=r8pAlnhfitC8PXuaZ7IOs3yE52BxLzkSN8zKLJJezC39MuvzX/iup6nFvA+7SCBWqIlygmE737XyIxLKGqEbTRhDj8h+m0nT2xapoL1Gbo6eidQJh/Lu9rwn/ydHgTHJGKEbsXk4IdEVgfXpioGWVYG/nb/03gE71/22xNpwrn0= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1716344914; c=relaxed/simple; bh=OEUt2h0mBOqIq2hy4jSpEIVHUPabTFaHVJ651yVSDWk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=BRfcTgFqf8bvwvHguCVvNMKM4ZUtHb7/m3e6mN6dwZp6dnU8aNCyl+mX4Zo4uT+PbppmarXRzeIxmBjpFmoyUztyo+VBEN94nSYOqgbJG8xNwRqkxjugcZAc/JGuc6JzClLAQ1OBtzlcxw/lYQcS5Mt3OOgspE5eprQBJTpps0Q= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=QwRHV1hb; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-627e7734a29so2401787b3.1 for ; Tue, 21 May 2024 19:28:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1716344912; x=1716949712; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:from:to:cc:subject:date:message-id:reply-to; bh=mfIvjgKueQKVKfXuZvg9684V73mkUfXcfkr0mjRx2Pg=; b=QwRHV1hbnAWmDpmQrrGNZTpBCgkBJn0d+rJ2cj3HwiFe8ADrKXNB+e7c0LEbc0v5OZ W0iWedD9rGEc4w/PuELJvtIHJqeLVHHCWc9z5gVvRhC5uE1STZ2WDgpoPOFX4UzNI5Vu vV5fdZsY+K7EdnW48p+yligYaOHa727279VwBFgHE+gY+M9gFRjd1wWwLme9ynJc4gTR 3DdAVur1ngEhul99dsvnTdv0X72KHHncneWG6sXzAOsyXT1VPHU2n1jDZSlrXIVEnmO6 D24KNdoqNioc1/LiWPHZsdB7NSCUCaWiFnKEe/fq3tELjB2EIAYggcNgsONzyQ7gyhtw g+gQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716344912; x=1716949712; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:reply-to:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=mfIvjgKueQKVKfXuZvg9684V73mkUfXcfkr0mjRx2Pg=; b=sTtrUeJAzZIY8rAI1do/wrY9lHL7Z8G72yGDGNe+HivdOZC2vAqHoHcrmTLx/0ya5V hBB5Poq4wykGnSr9nB/FCQSAY7zSsnh565tX86pJ/bis6oEhCkAtmfswm95qdX4bPU+5 hOmEiP1AhvxJhAJvEIdTOn85g2+hC8rI1yN/XXKTaYsmbv5VXdV8oEq/WT4usM78U7Dg NlxPL/gUnVjY4TOjKHso5NW2V3NlKsgnF5a/mh2+hZ81yUTS9JS3diQeAVRpaUNe7CM+ rvJksiIToC6ZjL0vyPm7+5Zw13TTHuhDgIYSmpfBvrECfw22203HYzKhvZoDBGfsL3ce e3KA== X-Forwarded-Encrypted: i=1; AJvYcCXK6pKRobIIoJIozI9YE9+CL7EexQ71Y5HZpToszwyKgUJ7NYQcpiVLS63crUmnW9JLRTRpTSw4jMQGDd6skHRt92Z3raVV0sXMBTtI X-Gm-Message-State: AOJu0YzpPfAmbs7K3SbJ/2ZpZb9I6sDYztqripICx2YSLSa+CZZIQ7DE wJZUqx7k5vZkYFCinVMrhayEvbuASUeFrAY/Bba/1vv8TYjXgkkdaYuEvAbOnZGhcJ2gTcNOPR+ LIA== X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a0d:d78e:0:b0:61b:7912:6cad with SMTP id 00721157ae682-627e46cfe0emr2133237b3.2.1716344911878; Tue, 21 May 2024 19:28:31 -0700 (PDT) Reply-To: Sean Christopherson Date: Tue, 21 May 2024 19:28:22 -0700 In-Reply-To: <20240522022827.1690416-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240522022827.1690416-1-seanjc@google.com> X-Mailer: git-send-email 2.45.0.215.g3402c0e53f-goog Message-ID: <20240522022827.1690416-2-seanjc@google.com> Subject: [PATCH v2 1/6] KVM: Register cpuhp and syscore callbacks when enabling hardware From: Sean Christopherson To: Sean Christopherson , Paolo Bonzini Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Chao Gao , Kai Huang Content-Type: text/plain; charset="UTF-8" Register KVM's cpuhp and syscore callback when enabling virtualization in hardware instead of registering the callbacks during initialization, and let the CPU up/down framework invoke the inner enable/disable functions. Registering the callbacks during initialization makes things more complex than they need to be, as KVM needs to be very careful about handling races between enabling CPUs being onlined/offlined and hardware being enabled/disabled. Intel TDX support will require KVM to enable virtualization during KVM initialization, i.e. will add another wrinkle to things, at which point sorting out the potential races with kvm_usage_count would become even more complex. Use a dedicated mutex to guard kvm_usage_count, as taking kvm_lock outside cpu_hotplug_lock is disallowed. Ideally, KVM would *always* take kvm_lock outside cpu_hotplug_lock, but KVM x86 takes kvm_lock in several notifiers that may be called under cpus_read_lock(). kvmclock_cpufreq_notifier() in particular has callchains that are infeasible to guarantee will never be called with cpu_hotplug_lock held. And practically speaking, using a dedicated mutex is a non-issue as the cost is a few bytes for all of KVM. Note, using the cpuhp framework has a subtle behavioral change: enabling will be done serially across all CPUs, whereas KVM currently sends an IPI to all CPUs in parallel. While serializing virtualization enabling could create undesirable latency, the issue is limited to creation of KVM's first VM, and even that can be mitigated, e.g. by letting userspace force virtualization to be enabled when KVM is initialized. Cc: Chao Gao Signed-off-by: Sean Christopherson --- virt/kvm/kvm_main.c | 189 ++++++++++++++++---------------------------- 1 file changed, 69 insertions(+), 120 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index a1756d5077ee..97783d6987e9 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5499,9 +5499,10 @@ __visible bool kvm_rebooting; EXPORT_SYMBOL_GPL(kvm_rebooting); static DEFINE_PER_CPU(bool, hardware_enabled); +static DEFINE_MUTEX(kvm_usage_lock); static int kvm_usage_count; -static int __hardware_enable_nolock(void) +static int hardware_enable_nolock(void) { if (__this_cpu_read(hardware_enabled)) return 0; @@ -5516,34 +5517,18 @@ static int __hardware_enable_nolock(void) return 0; } -static void hardware_enable_nolock(void *failed) -{ - if (__hardware_enable_nolock()) - atomic_inc(failed); -} - static int kvm_online_cpu(unsigned int cpu) { - int ret = 0; - /* * Abort the CPU online process if hardware virtualization cannot * be enabled. Otherwise running VMs would encounter unrecoverable * errors when scheduled to this CPU. */ - mutex_lock(&kvm_lock); - if (kvm_usage_count) - ret = __hardware_enable_nolock(); - mutex_unlock(&kvm_lock); - return ret; + return hardware_enable_nolock(); } static void hardware_disable_nolock(void *junk) { - /* - * Note, hardware_disable_all_nolock() tells all online CPUs to disable - * hardware, not just CPUs that successfully enabled hardware! - */ if (!__this_cpu_read(hardware_enabled)) return; @@ -5554,78 +5539,10 @@ static void hardware_disable_nolock(void *junk) static int kvm_offline_cpu(unsigned int cpu) { - mutex_lock(&kvm_lock); - if (kvm_usage_count) - hardware_disable_nolock(NULL); - mutex_unlock(&kvm_lock); + hardware_disable_nolock(NULL); return 0; } -static void hardware_disable_all_nolock(void) -{ - BUG_ON(!kvm_usage_count); - - kvm_usage_count--; - if (!kvm_usage_count) - on_each_cpu(hardware_disable_nolock, NULL, 1); -} - -static void hardware_disable_all(void) -{ - cpus_read_lock(); - mutex_lock(&kvm_lock); - hardware_disable_all_nolock(); - mutex_unlock(&kvm_lock); - cpus_read_unlock(); -} - -static int hardware_enable_all(void) -{ - atomic_t failed = ATOMIC_INIT(0); - int r; - - /* - * Do not enable hardware virtualization if the system is going down. - * If userspace initiated a forced reboot, e.g. reboot -f, then it's - * possible for an in-flight KVM_CREATE_VM to trigger hardware enabling - * after kvm_reboot() is called. Note, this relies on system_state - * being set _before_ kvm_reboot(), which is why KVM uses a syscore ops - * hook instead of registering a dedicated reboot notifier (the latter - * runs before system_state is updated). - */ - if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF || - system_state == SYSTEM_RESTART) - return -EBUSY; - - /* - * When onlining a CPU, cpu_online_mask is set before kvm_online_cpu() - * is called, and so on_each_cpu() between them includes the CPU that - * is being onlined. As a result, hardware_enable_nolock() may get - * invoked before kvm_online_cpu(), which also enables hardware if the - * usage count is non-zero. Disable CPU hotplug to avoid attempting to - * enable hardware multiple times. - */ - cpus_read_lock(); - mutex_lock(&kvm_lock); - - r = 0; - - kvm_usage_count++; - if (kvm_usage_count == 1) { - on_each_cpu(hardware_enable_nolock, &failed, 1); - - if (atomic_read(&failed)) { - hardware_disable_all_nolock(); - r = -EBUSY; - } - } - - mutex_unlock(&kvm_lock); - cpus_read_unlock(); - - return r; -} - static void kvm_shutdown(void) { /* @@ -5648,27 +5565,25 @@ static int kvm_suspend(void) { /* * Secondary CPUs and CPU hotplug are disabled across the suspend/resume - * callbacks, i.e. no need to acquire kvm_lock to ensure the usage count - * is stable. Assert that kvm_lock is not held to ensure the system - * isn't suspended while KVM is enabling hardware. Hardware enabling - * can be preempted, but the task cannot be frozen until it has dropped - * all locks (userspace tasks are frozen via a fake signal). + * callbacks, i.e. no need to acquire kvm_usage_lock to ensure the usage + * count is stable. Assert that kvm_usage_lock is not held to ensure + * the system isn't suspended while KVM is enabling hardware. Hardware + * enabling can be preempted, but the task cannot be frozen until it has + * dropped all locks (userspace tasks are frozen via a fake signal). */ - lockdep_assert_not_held(&kvm_lock); + lockdep_assert_not_held(&kvm_usage_lock); lockdep_assert_irqs_disabled(); - if (kvm_usage_count) - hardware_disable_nolock(NULL); + hardware_disable_nolock(NULL); return 0; } static void kvm_resume(void) { - lockdep_assert_not_held(&kvm_lock); + lockdep_assert_not_held(&kvm_usage_lock); lockdep_assert_irqs_disabled(); - if (kvm_usage_count) - WARN_ON_ONCE(__hardware_enable_nolock()); + WARN_ON_ONCE(hardware_enable_nolock()); } static struct syscore_ops kvm_syscore_ops = { @@ -5676,6 +5591,60 @@ static struct syscore_ops kvm_syscore_ops = { .resume = kvm_resume, .shutdown = kvm_shutdown, }; + +static int hardware_enable_all(void) +{ + int r; + + guard(mutex)(&kvm_usage_lock); + + if (kvm_usage_count++) + return 0; + + r = cpuhp_setup_state(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online", + kvm_online_cpu, kvm_offline_cpu); + if (r) + goto err_cpuhp; + + register_syscore_ops(&kvm_syscore_ops); + + /* + * Undo virtualization enabling and bail if the system is going down. + * If userspace initiated a forced reboot, e.g. reboot -f, then it's + * possible for an in-flight operation to enable virtualization after + * syscore_shutdown() is called, i.e. without kvm_shutdown() being + * invoked. Note, this relies on system_state being set _before_ + * kvm_shutdown(), e.g. to ensure either kvm_shutdown() is invoked + * or this CPU observes the impending shutdown. Which is why KVM uses + * a syscore ops hook instead of registering a dedicated reboot + * notifier (the latter runs before system_state is updated). + */ + if (system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF || + system_state == SYSTEM_RESTART) { + r = -EBUSY; + goto err_rebooting; + } + + return 0; + +err_rebooting: + unregister_syscore_ops(&kvm_syscore_ops); + cpuhp_remove_state(CPUHP_AP_KVM_ONLINE); +err_cpuhp: + --kvm_usage_count; + return r; +} + +static void hardware_disable_all(void) +{ + guard(mutex)(&kvm_usage_lock); + + if (--kvm_usage_count) + return; + + unregister_syscore_ops(&kvm_syscore_ops); + cpuhp_remove_state(CPUHP_AP_KVM_ONLINE); +} #else /* CONFIG_KVM_GENERIC_HARDWARE_ENABLING */ static int hardware_enable_all(void) { @@ -6381,15 +6350,6 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) int r; int cpu; -#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING - r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_ONLINE, "kvm/cpu:online", - kvm_online_cpu, kvm_offline_cpu); - if (r) - return r; - - register_syscore_ops(&kvm_syscore_ops); -#endif - /* A kmem cache lets us meet the alignment requirements of fx_save. */ if (!vcpu_align) vcpu_align = __alignof__(struct kvm_vcpu); @@ -6400,10 +6360,8 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) offsetofend(struct kvm_vcpu, stats_id) - offsetof(struct kvm_vcpu, arch), NULL); - if (!kvm_vcpu_cache) { - r = -ENOMEM; - goto err_vcpu_cache; - } + if (!kvm_vcpu_cache) + return -ENOMEM; for_each_possible_cpu(cpu) { if (!alloc_cpumask_var_node(&per_cpu(cpu_kick_mask, cpu), @@ -6460,11 +6418,6 @@ int kvm_init(unsigned vcpu_size, unsigned vcpu_align, struct module *module) for_each_possible_cpu(cpu) free_cpumask_var(per_cpu(cpu_kick_mask, cpu)); kmem_cache_destroy(kvm_vcpu_cache); -err_vcpu_cache: -#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING - unregister_syscore_ops(&kvm_syscore_ops); - cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE); -#endif return r; } EXPORT_SYMBOL_GPL(kvm_init); @@ -6486,10 +6439,6 @@ void kvm_exit(void) kmem_cache_destroy(kvm_vcpu_cache); kvm_vfio_ops_exit(); kvm_async_pf_deinit(); -#ifdef CONFIG_KVM_GENERIC_HARDWARE_ENABLING - unregister_syscore_ops(&kvm_syscore_ops); - cpuhp_remove_state_nocalls(CPUHP_AP_KVM_ONLINE); -#endif kvm_irqfd_exit(); } EXPORT_SYMBOL_GPL(kvm_exit); -- 2.45.0.215.g3402c0e53f-goog