Received: by 2002:a05:6a10:f3d0:0:0:0:0 with SMTP id a16csp3106725pxv; Mon, 12 Jul 2021 09:25:52 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyFNGfwgU1BIBInHG3w1We4tWkJ+4oVRKQQRf5AxZxqCbl4Z6hdKJ8eGG0aw73IjL1/M07m X-Received: by 2002:a05:6e02:512:: with SMTP id d18mr12210153ils.174.1626107151672; Mon, 12 Jul 2021 09:25:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1626107151; cv=none; d=google.com; s=arc-20160816; b=NLcDHb7LxhB2O89QCCZf09pKoL+3xnHZ4dd44wWTbJaQBfFVaFtrQP1lB/PtKSQs2x NauxSqiixfWi0MmJO/oWo50aY0GY3H+aGyYU0urTwlSveEOilLIlS876eNrK4hmeNCCC jTvP3LAhtWOwu0KdsRfJW5jDxqYkfUSGFukmG75tGcAz6Qlnhs21IqQP9L3QPfKD1YcS rI7QBXXeU8fbpc8pMkg3aRXuUWPW5Ge/pxlLsQVxSQv3EbjAqejfnObAPK/sCNKAalsa megJ0NuaqGNpmUEN8GzrL6OBYFwzX6BlR/LvOW47/sxHgjdrNQm8U7MccfwQ0bVuks5a 4T7Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:references:in-reply-to :subject:cc:to:from:message-id:date; bh=1Dnw/cADwcIJ+IMCOqDINgS7NtYheZnLLiYkKPPh9UM=; b=cx5nNbkogZHiRnJOdOxypVpvYCzDUVFakkVEtKKUroxRACOlPr4c1Xz9eiz7mHilu2 M3CIcpF4GMMlzbkj2XxmxsLG8IYDB0zSmR4k2SfQs3QTKOYEBT+ZlU0SlGJwG3kA50wu Q/fHHC8QdlOKzVfi5aYf/2EVObBBsCzYQKyhcYh8Auda91mX7zuguTXnYbABXaDxs6t3 bXWStB4oHJa8pjjJIBZGcP4GR604ZNZ1ozwva4M6UxAB5FslYfSNYyZG20e2VKZXeQt/ 0TSjN8Z9riUYuEzIydEs2AV4hZaPxedg+3OisU+gQ2U7un9+l8/SweHL9S2zkaWcyXL0 c2KQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id x14si19177165jat.20.2021.07.12.09.25.40; Mon, 12 Jul 2021 09:25:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233612AbhGLQ1e (ORCPT + 99 others); Mon, 12 Jul 2021 12:27:34 -0400 Received: from mail.kernel.org ([198.145.29.99]:34230 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233207AbhGLQ1d (ORCPT ); Mon, 12 Jul 2021 12:27:33 -0400 Received: from disco-boy.misterjones.org (disco-boy.misterjones.org [51.254.78.96]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 4D76A60233; Mon, 12 Jul 2021 16:24:45 +0000 (UTC) Received: from sofa.misterjones.org ([185.219.108.64] helo=wait-a-minute.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1m2yjb-00Crz7-6O; Mon, 12 Jul 2021 17:24:43 +0100 Date: Mon, 12 Jul 2021 17:24:42 +0100 Message-ID: <874kcz33g5.wl-maz@kernel.org> From: Marc Zyngier To: Sergey Senozhatsky Cc: Will Deacon , Suleiman Souhlal , Joel Fernandes , linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org Subject: Re: [PATCHv2 4/4] arm64: add host pv-vcpu-state support In-Reply-To: <20210709043713.887098-5-senozhatsky@chromium.org> References: <20210709043713.887098-1-senozhatsky@chromium.org> <20210709043713.887098-5-senozhatsky@chromium.org> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/27.1 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-SA-Exim-Connect-IP: 185.219.108.64 X-SA-Exim-Rcpt-To: senozhatsky@chromium.org, will@kernel.org, suleiman@google.com, joelaf@google.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 09 Jul 2021 05:37:13 +0100, Sergey Senozhatsky wrote: > > Add PV-vcpu-state support bits to the host. Host uses the guest > PV-state per-CPU pointers to update the VCPU state each time > it kvm_arch_vcpu_load() or kvm_arch_vcpu_put() the VCPU, so > that guest scheduler can become aware of the fact that not > all VCPUs are always available. Currently guest scheduler > on amr64 always assumes that all CPUs are available because arm64 > vcpu_is_preempted() is not implemented on arm64. > > - schbench -t 3 -m 3 -p 4096 > > Latency percentiles (usec) > BASE > ================================================ > 50.0th: 1 (3556427 samples) > 75.0th: 13 (879210 samples) > 90.0th: 15 (893311 samples) > 95.0th: 18 (159594 samples) > *99.0th: 118 (224187 samples) > 99.5th: 691 (28555 samples) > 99.9th: 7384 (23076 samples) > min=1, max=104218 > avg worker transfer: 25192.00 ops/sec 98.41MB/s > > PATCHED > ================================================ > 50.0th: 1 (3507010 samples) > 75.0th: 13 (1635775 samples) > 90.0th: 16 (901271 samples) > 95.0th: 24 (281051 samples) > *99.0th: 114 (255581 samples) > 99.5th: 382 (33051 samples) > 99.9th: 6392 (26592 samples) > min=1, max=83877 > avg worker transfer: 28613.39 ops/sec 111.77MB/s > > - perf bench sched all > > ops/sec > BASE PATCHED > ================================================ > 33452 36485 > 33541 39405 > 33365 36858 > 33455 38047 > 33449 37866 > 33616 34922 > 33479 34388 > 33594 37203 > 33458 35363 > 33704 35180 > > Student's T-test > > N Min Max Median Avg Stddev > base 10 33365 33704 33479 33511.3 100.92467 > patched 10 34388 39405 36858 36571.7 1607.454 > Difference at 95.0% confidence > 3060.4 +/- 1070.09 > 9.13244% +/- 3.19321% > (Student's t, pooled s = 1138.88) > > Signed-off-by: Sergey Senozhatsky > --- > arch/arm64/include/asm/kvm_host.h | 23 +++++++++++ > arch/arm64/kvm/Makefile | 3 +- > arch/arm64/kvm/arm.c | 3 ++ > arch/arm64/kvm/hypercalls.c | 11 ++++++ > arch/arm64/kvm/pv-vcpu-state.c | 64 +++++++++++++++++++++++++++++++ > 5 files changed, 103 insertions(+), 1 deletion(-) > create mode 100644 arch/arm64/kvm/pv-vcpu-state.c > > diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h > index 41911585ae0c..e782f4d0c916 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -381,6 +381,12 @@ struct kvm_vcpu_arch { > u64 last_steal; > gpa_t base; > } steal; > + > + /* PV state of the VCPU */ > + struct { > + gpa_t base; > + struct gfn_to_hva_cache ghc; Please stay consistent with what we use of steal time accounting (i.e. don't use gfn_to_hva_cache, but directly use a gpa with kvm_put_guest()). If you can demonstrate that there is an unacceptable overhead in doing so, then both steal time and preemption will be updated at the same time. > + } vcpu_state; > }; > > /* Pointer to the vcpu's SVE FFR for sve_{save,load}_state() */ > @@ -695,6 +701,23 @@ static inline bool kvm_arm_is_pvtime_enabled(struct kvm_vcpu_arch *vcpu_arch) > return (vcpu_arch->steal.base != GPA_INVALID); > } > > +int kvm_init_vcpu_state(struct kvm_vcpu *vcpu, gfn_t addr); > +int kvm_release_vcpu_state(struct kvm_vcpu *vcpu); > + > +static inline void kvm_arm_vcpu_state_init(struct kvm_vcpu_arch *vcpu_arch) > +{ > + vcpu_arch->vcpu_state.base = GPA_INVALID; > + memset(&vcpu_arch->vcpu_state.ghc, 0, sizeof(struct gfn_to_hva_cache)); Does it really need to be inlined? > +} > + > +static inline bool > +kvm_arm_is_vcpu_state_enabled(struct kvm_vcpu_arch *vcpu_arch) > +{ > + return (vcpu_arch->vcpu_state.base != GPA_INVALID); > +} > + > +void kvm_update_vcpu_preempted(struct kvm_vcpu *vcpu, bool preempted); > + > void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome); > > struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr); > diff --git a/arch/arm64/kvm/Makefile b/arch/arm64/kvm/Makefile > index 989bb5dad2c8..2a3ee82c6d90 100644 > --- a/arch/arm64/kvm/Makefile > +++ b/arch/arm64/kvm/Makefile > @@ -12,7 +12,8 @@ obj-$(CONFIG_KVM) += hyp/ > > kvm-y := $(KVM)/kvm_main.o $(KVM)/coalesced_mmio.o $(KVM)/eventfd.o \ > $(KVM)/vfio.o $(KVM)/irqchip.o $(KVM)/binary_stats.o \ > - arm.o mmu.o mmio.o psci.o perf.o hypercalls.o pvtime.o \ > + arm.o mmu.o mmio.o psci.o perf.o hypercalls.o \ > + pvtime.o pv-vcpu-state.o \ > inject_fault.o va_layout.o handle_exit.o \ > guest.o debug.o reset.o sys_regs.o \ > vgic-sys-reg-v3.o fpsimd.o pmu.o \ > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c > index e9a2b8f27792..43e995c9fddb 100644 > --- a/arch/arm64/kvm/arm.c > +++ b/arch/arm64/kvm/arm.c > @@ -332,6 +332,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > kvm_arm_reset_debug_ptr(vcpu); > > kvm_arm_pvtime_vcpu_init(&vcpu->arch); > + kvm_arm_vcpu_state_init(&vcpu->arch); > > vcpu->arch.hw_mmu = &vcpu->kvm->arch.mmu; > > @@ -429,10 +430,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) > if (vcpu_has_ptrauth(vcpu)) > vcpu_ptrauth_disable(vcpu); > kvm_arch_vcpu_load_debug_state_flags(vcpu); > + kvm_update_vcpu_preempted(vcpu, false); > } > > void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) > { > + kvm_update_vcpu_preempted(vcpu, true); This doesn't look right. With this, you are now telling the guest that a vcpu that is blocked on WFI is preempted. This really isn't the case, as it has voluntarily entered a low-power mode while waiting for an interrupt. Indeed, the vcpu isn't running. A physical CPU wouldn't be running either. > kvm_arch_vcpu_put_debug_state_flags(vcpu); > kvm_arch_vcpu_put_fp(vcpu); > if (has_vhe()) > diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c > index 30da78f72b3b..95bcf86e0b6f 100644 > --- a/arch/arm64/kvm/hypercalls.c > +++ b/arch/arm64/kvm/hypercalls.c > @@ -110,6 +110,9 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu) > case ARM_SMCCC_HV_PV_TIME_FEATURES: > val[0] = SMCCC_RET_SUCCESS; > break; > + case ARM_SMCCC_HV_PV_VCPU_STATE_FEATURES: > + val[0] = SMCCC_RET_SUCCESS; > + break; > } > break; > case ARM_SMCCC_HV_PV_TIME_FEATURES: > @@ -139,6 +142,14 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu) > case ARM_SMCCC_TRNG_RND32: > case ARM_SMCCC_TRNG_RND64: > return kvm_trng_call(vcpu); > + case ARM_SMCCC_HV_PV_VCPU_STATE_INIT: > + if (kvm_init_vcpu_state(vcpu, smccc_get_arg1(vcpu)) == 0) > + val[0] = SMCCC_RET_SUCCESS; > + break; > + case ARM_SMCCC_HV_PV_VCPU_STATE_RELEASE: > + if (kvm_release_vcpu_state(vcpu) == 0) > + val[0] = SMCCC_RET_SUCCESS; > + break; > default: > return kvm_psci_call(vcpu); > } > diff --git a/arch/arm64/kvm/pv-vcpu-state.c b/arch/arm64/kvm/pv-vcpu-state.c > new file mode 100644 > index 000000000000..8496bb2a5966 > --- /dev/null > +++ b/arch/arm64/kvm/pv-vcpu-state.c > @@ -0,0 +1,64 @@ > +// SPDX-License-Identifier: GPL-2.0-only > + > +#include > +#include > + > +#include > +#include > + > +#include Why this include? > + > +int kvm_init_vcpu_state(struct kvm_vcpu *vcpu, gpa_t addr) > +{ > + struct kvm *kvm = vcpu->kvm; > + int ret; > + u64 idx; > + > + if (kvm_arm_is_vcpu_state_enabled(&vcpu->arch)) > + return 0; > + > + idx = srcu_read_lock(&kvm->srcu); > + ret = kvm_gfn_to_hva_cache_init(vcpu->kvm, > + &vcpu->arch.vcpu_state.ghc, > + addr, > + sizeof(struct vcpu_state)); > + srcu_read_unlock(&kvm->srcu, idx); > + > + if (!ret) > + vcpu->arch.vcpu_state.base = addr; > + return ret; > +} > + > +int kvm_release_vcpu_state(struct kvm_vcpu *vcpu) > +{ > + if (!kvm_arm_is_vcpu_state_enabled(&vcpu->arch)) > + return 0; > + > + kvm_arm_vcpu_state_init(&vcpu->arch); > + return 0; > +} > + > +void kvm_update_vcpu_preempted(struct kvm_vcpu *vcpu, bool preempted) > +{ > + struct kvm *kvm = vcpu->kvm; > + u64 idx; > + > + if (!kvm_arm_is_vcpu_state_enabled(&vcpu->arch)) > + return; > + > + /* > + * This function is called from atomic context, so we need to > + * disable page faults. kvm_write_guest_cached() will call > + * might_fault(). > + */ > + pagefault_disable(); > + /* > + * Need to take the SRCU lock because kvm_write_guest_offset_cached() > + * calls kvm_memslots(); > + */ > + idx = srcu_read_lock(&kvm->srcu); > + kvm_write_guest_cached(kvm, &vcpu->arch.vcpu_state.ghc, > + &preempted, sizeof(bool)); > + srcu_read_unlock(&kvm->srcu, idx); > + pagefault_enable(); > +} Before rushing into an implementation, this really could do with some specification: - where is the memory allocated? - what is the lifetime of the memory? - what is its expected memory type? - what is the coherency model (what if the guest runs with caching disabled, for example)? - is the functionality preserved across a VM reset? - what is the strategy w.r.t live migration? - can userspace detect that the feature is implemented? - can userspace prevent the feature from being used? - where is the documentation? Thanks, M. -- Without deviation from the norm, progress is not possible.