Received: by 2002:a05:6500:1b45:b0:1f5:f2ab:c469 with SMTP id cz5csp790898lqb; Wed, 17 Apr 2024 10:43:33 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCVibuj2XOv733RFcAFROIEk78PvMOSu9lgyF/PicC/T2ofc+NN6L+e1QVCrK5eEs1uevmjSYcwVo0M4vSoU3CIQw9dwqFFcVUBO+75auw== X-Google-Smtp-Source: AGHT+IGA+45Wp+SadKhBROBsUuHHVksRvbFmJsFSVOFt7L7qZq6q0PUSsMd5TUvnILei4j656POY X-Received: by 2002:a17:903:1ca:b0:1e3:d242:418a with SMTP id e10-20020a17090301ca00b001e3d242418amr147040plh.9.1713375813137; Wed, 17 Apr 2024 10:43:33 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1713375813; cv=pass; d=google.com; s=arc-20160816; b=ox8WhlzO6aC1/dDpxBtds+Po0M+h944LVE0fp9gfx5rJBY/XSLUa2eFY7a1qukZ5aC T3PD+ICXDOom+7EKjgAZFktfJx7LQKCbjpX15sPr9bP5f50bqDizIQT1R93T7NxmOajj dXHpoRb803tG5nMvaqvAjugbqJNx8h3pq1dojz997/cE8xUuIUJoQoBaVjVUKTJ26HiI Hi/GvOB2tUF38hPPZ1WoNJA2n0kUMHcn/qAybolN2yOyjtPnt3AVrqMHuYbGMp+YLU94 ahYnXyDi2c8Zdfn5N81U4C8N0bvm3QoYRv4eScpn+8WUFs3Uqpu2ynyJ1QpUXFcERUcH T0HQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:references:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:in-reply-to:date :dkim-signature; bh=GpN2EoQmT6H5WU6J9NoAPDgtZu4AduZcdc3lBItN25E=; fh=33nlI+J5U0q0MBVY/bZCI3WZmRIg++I7E8llh0vuUC0=; b=iPpqAWu6f5UV8dhjKO6SvWcfkOoisljtdoa6WMkvMMh6kVNQU7ZCL5miTSq7bBm32n oJrldvpD6BaBHJkyMLtLVIB8RB6kOYUO3JOJE5XjadaKzSP2pPsS8EPGhRIGvZ1v1nks 7QkcILjIVoMBGZkbXCrjFK8vxCvCnoulwBSNeGyocx2T2qxpwIr1fga/jxcIEwpZ406T 64f9ZiUdcLdhL+5bN4uOqYk5ORTvMqEkC2NIpBOib+IrlxH5t+hWXrLoQ8kTI9sFH/Ra YK42dq31WF/jSKjp97IrnanffwYQiPkPtEB1b83rs2cFNuVizYoCHA/kovVKk1r29gET smNg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b="IzYJ8H9/"; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-148987-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-148987-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from sy.mirrors.kernel.org (sy.mirrors.kernel.org. [147.75.48.161]) by mx.google.com with ESMTPS id t20-20020a170902dcd400b001e5075f1344si11530930pll.579.2024.04.17.10.43.32 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Apr 2024 10:43:33 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-148987-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b="IzYJ8H9/"; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-148987-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) smtp.mailfrom="linux-kernel+bounces-148987-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sy.mirrors.kernel.org (Postfix) with ESMTPS id 98665B223CF for ; Wed, 17 Apr 2024 17:22:33 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0E819171E78; Wed, 17 Apr 2024 17:22:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="IzYJ8H9/" Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 710F1171065 for ; Wed, 17 Apr 2024 17:22:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713374544; cv=none; b=tE1QJo2T3OpvK8yaBDDKjZOzk07fjQnzxSCPJqDcTWUktL2plRQ1ijxFz9beQoJG57U3xb3MUYEdEkYgYboy5MLiSomEpP8PbcPCTDEfQfHF3jKgM2OjkcBY+Lt7AM+15roFfnKISQONepu9scNKGWIPUnwGBYZLhZxnaMR0vkE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713374544; c=relaxed/simple; bh=tLRtMi9r2i1h39kQd9vETDobB4gCES9yT4inm+DKJuM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=kcycO4Ak6EEUYLA3A+vrns+KVGac9iFIj89KxNWG2h9Ht63vL/E4kvF6aUBeUcC1twVy5d/fznBxiAhcrk9VfAqXKpGqf2YO35VxzqYIeUwvATLh6lMq+iaKmdprnPgjMzoK3YFF8MBjaJT2oiEBBz+EILIg5XRNKYNeCSpXhyw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=IzYJ8H9/; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-61aecbcb990so37591987b3.1 for ; Wed, 17 Apr 2024 10:22:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1713374540; x=1713979340; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=GpN2EoQmT6H5WU6J9NoAPDgtZu4AduZcdc3lBItN25E=; b=IzYJ8H9/KWQ8chqSNw06L7SgZhNVfvgZsMFT56INWihfMdx2yy/eH3IR009iUizPOC Yxhgzt0GeMvBikI85RVsRPXE4Dkq5Gf1EY4a39nQAQIY9jZwlSi6qiZQbZAAtwMSvEVw 2ckGk3iN/DajXp91SyZlt/Y1KgfDwFb7r2a6UAD8lc7dVixt/h6EWfSMGkTuX8sUN9hO MgIME8BLIM4ItnMVvH/zI0jNAdSP8MvFaF6Ji1ACzO8HbXqaTvoypiTf469hgobcL9lj c0Hz8WI3INOBUxfN/6aakF0gdbmEQqXOtPe/9KizQ/lZ5VHsBR2BOWDtcwJ8A36koAnS 9JKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713374540; x=1713979340; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=GpN2EoQmT6H5WU6J9NoAPDgtZu4AduZcdc3lBItN25E=; b=mLfdcdUflhGZAWWPB/o9/P8JUyXTRq1kg+a+EoMhjBQn1w1SY1OZ+aAXk6cSm8yGuJ 9mTNAqqCEe85ysrOANXnrUIB6eMcDpVZv1kNMiRIOZ6O00BaDRv7WY6dyIT1cRYh08iK q3YzAk+lvwzKM1t9ccunvy8OdhXlsGwhWdaJVNV0YQakKQjbaE3r/gKwOrR694biHKVE BC679vynZbkD/69v+Q3k+juOZya4I/EFImZ/92PooDHjr0hywCOAly1/VVxXTnERpEHE WmZliSqj2XFGsUcuaK+E2+k9QJV8UNNEduh7eliCz6JWeF0jZyJA+ojAN9yGEgjGr7MK sOGA== X-Forwarded-Encrypted: i=1; AJvYcCXcDGQPs/HwxvTDduMbv6eEqNZqArhWQ5SJ5DGK6I5sroWDOGzAoTX9g1XSiLFC+qE/Szr3on812OCMm7Dxxq8HcMdKKr4nAv2m3Xge X-Gm-Message-State: AOJu0YxLn2EuV42RYZqd2fz7ZkIUOc9fxIcexdSeKPmWjLwqEdMl1p2x Ksj386a+uxXRLQZGuJ5Rq2qmEduxDtPDdbwv/MTlS2Zyib/AD6VCZZZhrqsQxPvneyixsXl6UOD lAw== X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a25:8407:0:b0:ddd:6bde:6c82 with SMTP id u7-20020a258407000000b00ddd6bde6c82mr4523853ybk.12.1713374540488; Wed, 17 Apr 2024 10:22:20 -0700 (PDT) Date: Wed, 17 Apr 2024 10:22:18 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: Message-ID: Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu From: Sean Christopherson To: Marcelo Tosatti Cc: "Paul E. McKenney" , Leonardo Bras , Paolo Bonzini , Frederic Weisbecker , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Zqiang , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, rcu@vger.kernel.org Content-Type: text/plain; charset="us-ascii" On Wed, Apr 17, 2024, Marcelo Tosatti wrote: > On Tue, Apr 16, 2024 at 07:07:32AM -0700, Sean Christopherson wrote: > > On Tue, Apr 16, 2024, Marcelo Tosatti wrote: > > > > Why not have > > > > KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout > > > > handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest? > > > > > > Do you mean something like: > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > > index d9642dd06c25..0ca5a6a45025 100644 > > > --- a/kernel/rcu/tree.c > > > +++ b/kernel/rcu/tree.c > > > @@ -3938,7 +3938,7 @@ static int rcu_pending(int user) > > > return 1; > > > > > > /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */ > > > - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu()) > > > + if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu()) > > > return 0; > > > > Yes. This, https://lore.kernel.org/all/ZhAN28BcMsfl4gm-@google.com, plus logic > > in kvm_sched_{in,out}(). > > Question: where is vcpu->wants_to_run set? (or, where is the full series > again?). Precisely around the call to kvm_arch_vcpu_ioctl_run(). I am planning on applying the patch that introduces the code for 6.10[*], I just haven't yet for a variety of reasons. [*] https://lore.kernel.org/all/20240307163541.92138-1-dmatlack@google.com > So for guest HLT emulation, there is a window between > > kvm_vcpu_block -> fire_sched_out_preempt_notifiers -> vcpu_put > and the idle's task call to ct_cpuidle_enter, where > > ct_dynticks_nesting() != 0 and vcpu_put has already executed. > > Even for idle=poll, the race exists. Is waking rcuc actually problematic? I agree it's not ideal, but it's a smallish window, i.e. is unlikely to happen frequently, and if rcuc is awakened, it will effectively steal cycles from the idle thread, not the vCPU thread. If the vCPU gets a wake event before rcuc completes, then the vCPU could experience jitter, but that could also happen if the CPU ends up in a deep C-state. And that race exists in general, i.e. any IRQ that arrives just as the idle task is being scheduled in will unnecessarily wakeup rcuc. > > > /* Is the RCU core waiting for a quiescent state from this CPU? */ > > > > > > The problem is: > > > > > > 1) You should only set that flag, in the VM-entry path, after the point > > > where no use of RCU is made: close to guest_state_enter_irqoff call. > > > > Why? As established above, KVM essentially has 1 second to enter the guest after > > setting in_guest_run_loop (or whatever we call it). In the vast majority of cases, > > the time before KVM enters the guest can probably be measured in microseconds. > > OK. > > > Snapshotting the exit time has the exact same problem of depending on KVM to > > re-enter the guest soon-ish, so I don't understand why this would be considered > > a problem with a flag to note the CPU is in KVM's run loop, but not with a > > snapshot to say the CPU recently exited a KVM guest. > > See the race above. Ya, but if kvm_last_guest_exit is zeroed in kvm_sched_out(), then the snapshot approach ends up with the same race. And not zeroing kvm_last_guest_exit is arguably much more problematic as encountering a false positive doesn't require hitting a small window. > > > 2) While handling a VM-exit, a host timer interrupt can occur before that, > > > or after the point where "this_cpu->in_kvm_run" is set to false. > > > > > > And a host timer interrupt calls rcu_sched_clock_irq which is going to > > > wake up rcuc. > > > > If in_kvm_run is false when the IRQ is handled, then either KVM exited to userspace > > or the vCPU was scheduled out. In the former case, rcuc won't be woken up if the > > CPU is in userspace. And in the latter case, waking up rcuc is absolutely the > > correct thing to do as VM-Enter is not imminent. > > > > For exits to userspace, there would be a small window where an IRQ could arrive > > between KVM putting the vCPU and the CPU actually returning to userspace, but > > unless that's problematic in practice, I think it's a reasonable tradeoff. > > OK, your proposal looks alright except these races. > > We don't want those races to occur in production (and they likely will). > > Is there any way to fix the races? Perhaps cmpxchg? I don't think an atomic switch from the vCPU task to the idle task is feasible, e.g. KVM would somehow have to know that the idle task is going to run next. This seems like something that needs a generic solution, e.g. to prevent waking rcuc if the idle task is in the process of being scheduled in.