Received: by 2002:a05:7412:e794:b0:fa:551:50a7 with SMTP id o20csp2406542rdd; Fri, 12 Jan 2024 08:28:50 -0800 (PST) X-Google-Smtp-Source: AGHT+IEtybEBbtbPv+BUXeXjrEIJfKNAKIBQdPukQA8qdGCjHl2+DwJOvD18E0UonYhgthgmqoT3 X-Received: by 2002:ad4:5f4e:0:b0:680:f972:b74a with SMTP id p14-20020ad45f4e000000b00680f972b74amr1458930qvg.0.1705076930166; Fri, 12 Jan 2024 08:28:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1705076930; cv=none; d=google.com; s=arc-20160816; b=daQ8dLxz13lbuWao/OLxK89WACnKLatR2ZAQ4er3qToepYtGYrayoLX05pwnADyQN7 vClUTinM1hawpC+TqUqzF5Q8RkLkdpoqLkLZpFgWkj/GGhxUbe9juPnaanv3+ZxNXLPL DIyA+chuO4Zg0FpzWjiccPgmA650TA9iv90V7LaRayfctf5Ki+RDJ4Z0tO/gU7AnKvyZ kF+M2QaEQUAKbIJ5QZ69L2JNyfWoHjjOPl++R9F2qjeAJNfES1RbqgC0CjsB+qQf+si4 COlv8neOrQ1aREhpzseQ24wrJ/hqGQyNkkhrPBwaGjVRsFkgdpzy9RFfqJjKqDsZfFtP Fu3w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:references:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:in-reply-to:date :dkim-signature; bh=N1LDGThbTCa8jpCR+m8/s9AcuWvCYLNXRN2os5z9iYI=; fh=AqEywIqJr99b5UAjC8+GW+sEbdZMvVH2EGwFgDUXQ5k=; b=ZJG0sGok5J33dSIiGyVwvHMtRA5Yu1UlHefFfak/+b9LOCm9Y3GtZPLsvT6AEGEPtq PBDP/HGaJ+Hy4wbueBakri7YSkbZ32Lj/7Y4rG7CtuH9NvSBUYA3NNXj898pYjo4Ljd1 oU195DkqzV3WfQMfv+14xPWS79zOrgxXn5/pe4rYcx7jv2aCPp+5TToZSSUIvFV4iGJx 1EPR5B502SzjzwtYpbveXRM9Md9lHONIfN8DtqWIK7T/PA83nmBnhYV2zS4O7e5Ng3U9 OZnjy2p/+0rakxbzEq44KXq0iC1IBFkwNr/A4Fr9l55uqTozt823S/lNBAAAsZOTYeua mv7w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=xPSFjvJR; spf=pass (google.com: domain of linux-kernel+bounces-24858-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-24858-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id j12-20020a0cf30c000000b0067f52e92488si3130572qvl.121.2024.01.12.08.28.50 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 12 Jan 2024 08:28:50 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-24858-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=xPSFjvJR; spf=pass (google.com: domain of linux-kernel+bounces-24858-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-24858-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 73A201C21193 for ; Fri, 12 Jan 2024 16:28:49 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 257D9745F4; Fri, 12 Jan 2024 16:28:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="xPSFjvJR" Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com [209.85.210.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1745C6D1AA for ; Fri, 12 Jan 2024 16:28:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-pf1-f201.google.com with SMTP id d2e1a72fcca58-6da57e2d2b9so4640821b3a.2 for ; Fri, 12 Jan 2024 08:28:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1705076913; x=1705681713; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=N1LDGThbTCa8jpCR+m8/s9AcuWvCYLNXRN2os5z9iYI=; b=xPSFjvJR/ZpuMZE0CnIUQnxqJjxuPqG8Quh6OQEcxVGjxtaWGiWFgi+rAV59Biiiu1 cMgiYVpmbFvoq3NeA3P9Kp6tDuD2YaySp/3k2AbislZuIP873dsD1P6nEAUy9O4vtpuW A65K0ZrnJ/o+ybRPzXlxta9Td4q9yaWqytzb9kDZBv+GM9h50NTxlUEmnIBHn2S96z4z 9lEqW8u2CM4t5IR0s6G+LluIkv0yec5jCm5wfhjQ6iKmmwIWAdZsUB4IM7FcvE/MkAKC L2C/xeJvV5BRHKiqarxWPWm3UcMcdchg9teSIMbiYfGLroGOCHIrSnpkp5Wem4QROH+M A39A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705076913; x=1705681713; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=N1LDGThbTCa8jpCR+m8/s9AcuWvCYLNXRN2os5z9iYI=; b=Hn04T6XUv0BCdh2BxzkJTk74KB1aB4l7kQWRBKPwBtoy+1J+eW/SjohWGKQvBSA4I8 vFkegBLWjWcvK9ztsKYMvpDir+ibTNlQ0yjnnilQuHREu7rgW2FT2Bryzpwn6CGrZc4H wpFDLjOp+BQh9g7u+cWEa7LydoyZ3TsMx4MEc4bzKZH2ZJIRJ497mUqZye6cTuGo3+4c hZyRJt30ceEceJwLAnATV3E7vkfMnS9bdRavdwa7ofTuTGuUmwtp4Lo/0TVSZk/25qYz KBl2gZWIkV2VLs3iRuURIbqp7jhpeUOoqdDncM8YgjvscgDm1jpKepg1Re5nXj3eI/mF QmdA== X-Gm-Message-State: AOJu0Yzg1HBSGJEwNLbg8kU4vwzl6MY04We4TZiWh9QkLO1pM0H76QKY CkTXFI972ifSFFazUi6kFeCa/u4X/YyLtjBiHA== X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6a00:1483:b0:6d9:f3b6:c6a9 with SMTP id v3-20020a056a00148300b006d9f3b6c6a9mr100097pfu.4.1705076913252; Fri, 12 Jan 2024 08:28:33 -0800 (PST) Date: Fri, 12 Jan 2024 08:28:31 -0800 In-Reply-To: <20240112091128.3868059-1-foxywang@tencent.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240112091128.3868059-1-foxywang@tencent.com> Message-ID: Subject: Re: [PATCH] KVM: irqchip: synchronize srcu only if needed From: Sean Christopherson To: Yi Wang Cc: pbonzini@redhat.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, wanpengli@tencent.com, Yi Wang , Oliver Upton , Marc Zyngier , Anup Patel , Atish Patra , Christian Borntraeger , Janosch Frank , Claudio Imbrenda Content-Type: text/plain; charset="us-ascii" +other KVM maintainers On Fri, Jan 12, 2024, Yi Wang wrote: > From: Yi Wang > > We found that it may cost more than 20 milliseconds very accidentally > to enable cap of KVM_CAP_SPLIT_IRQCHIP on a host which has many vms > already. > > The reason is that when vmm(qemu/CloudHypervisor) invokes > KVM_CAP_SPLIT_IRQCHIP kvm will call synchronize_srcu_expedited() and > might_sleep and kworker of srcu may cost some delay during this period. might_sleep() yielding is not justification for changing KVM. That's more or less saying "my task got preempted and took longer to run". Well, yeah. > Since this happens during creating vm, it's no need to synchronize srcu > now 'cause everything is not ready(vcpu/irqfd) and none uses irq_srcu now. > > Signed-off-by: Yi Wang > --- > arch/x86/kvm/irq_comm.c | 2 +- > include/linux/kvm_host.h | 2 ++ > virt/kvm/irqchip.c | 3 ++- > 3 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kvm/irq_comm.c b/arch/x86/kvm/irq_comm.c > index 16d076a1b91a..37c92b7486c7 100644 > --- a/arch/x86/kvm/irq_comm.c > +++ b/arch/x86/kvm/irq_comm.c > @@ -394,7 +394,7 @@ static const struct kvm_irq_routing_entry empty_routing[] = {}; > > int kvm_setup_empty_irq_routing(struct kvm *kvm) > { > - return kvm_set_irq_routing(kvm, empty_routing, 0, 0); > + return kvm_set_irq_routing(kvm, empty_routing, 0, NONEED_SYNC_SRCU); > } > > void kvm_arch_post_irq_routing_update(struct kvm *kvm) > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 4944136efaa2..a46370cca355 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -1995,6 +1995,8 @@ static inline int mmu_invalidate_retry_hva(struct kvm *kvm, > > #define KVM_MAX_IRQ_ROUTES 4096 /* might need extension/rework in the future */ > > +#define NONEED_SYNC_SRCU (1U << 0) > + > bool kvm_arch_can_set_irq_routing(struct kvm *kvm); > int kvm_set_irq_routing(struct kvm *kvm, > const struct kvm_irq_routing_entry *entries, > diff --git a/virt/kvm/irqchip.c b/virt/kvm/irqchip.c > index 1e567d1f6d3d..cea5c43c1a49 100644 > --- a/virt/kvm/irqchip.c > +++ b/virt/kvm/irqchip.c > @@ -224,7 +224,8 @@ int kvm_set_irq_routing(struct kvm *kvm, > > kvm_arch_post_irq_routing_update(kvm); > > - synchronize_srcu_expedited(&kvm->irq_srcu); > + if (!(flags & NONEED_SYNC_SRCU)) > + synchronize_srcu_expedited(&kvm->irq_srcu); I'm not a fan of x86 passing in a magic flag. It's not immediately clear why skipping synchronization is safe. Piecing things together, _on x86_, I believe the answer is that vCPU can't have yet been created, kvm->lock is held, _and_ kvm_arch_irqfd_allowed() will subtly reject attempts to assign irqfds if the local APIC isn't in-kernel. But AFAICT, s390's implementation of KVM_CREATE_IRQCHIP, which sets up identical dummy/empty routing, doesn't provide the same guarantees. case KVM_CREATE_IRQCHIP: { struct kvm_irq_routing_entry routing; r = -EINVAL; if (kvm->arch.use_irqchip) { /* Set up dummy routing. */ memset(&routing, 0, sizeof(routing)); r = kvm_set_irq_routing(kvm, &routing, 0, 0); } break; } It's entirely possible that someday, kvm_setup_empty_irq_routing() is moved to common KVM and used for s390, at which point we have a mess on our hands because it's not at all obvious whether or not it's safe for s390 to also skip synchronization. Rather than hack in a workaround for x86, I would rather we try and clean up this mess. Except for kvm_irq_map_gsi(), it looks like all flows assume irq_routing is non-NULL. But I'm not remotely confident that that holds true on all architectures, e.g. the only reason kvm_irq_map_gsi() checks for a NULL irq_routing is because syzkaller generated a splat (commit c622a3c21ede ("KVM: irqfd: fix NULL pointer dereference in kvm_irq_map_gsi")). And on x86, I'm pretty sure as of commit 654f1f13ea56 ("kvm: Check irqchip mode before assign irqfd"), which added kvm_arch_irqfd_allowed(), it's impossible for kvm_irq_map_gsi() to encounter a NULL irq_routing _on x86_. But I strongly suspect other architectures can reach kvm_irq_map_gsi() with a NULL irq_routing, e.g. RISC-V dynamically configures its interrupt controller, yet doesn't implement kvm_arch_intc_initialized(). So instead of special casing x86, what if we instead have KVM setup an empty IRQ routing table during kvm_create_vm(), and then avoid this mess entirely? That way x86 and s390 no longer need to set empty/dummy routing when creating an IRQCHIP, and the worst case scenario of userspace misusing an ioctl() is no longer a NULL pointer deref.