Received: by 2002:ab2:3350:0:b0:1f4:6588:b3a7 with SMTP id o16csp1491911lqe; Mon, 8 Apr 2024 10:16:45 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXKWwHZQm/vOYLzzJgBcoZN7kxZIyYcFHieilVairw/0eOtO8atLgPWfAT05X7xID/PgcfqVvuKRQgkdoigaTnbax95hzrkHT/gZlqbSQ== X-Google-Smtp-Source: AGHT+IEofKyJ7Y71OtQZIa0CjQUoyFfTD8duDsYIO2QYhKdHS2ihVd7+SVcmtkHRO+bX55dzleS2 X-Received: by 2002:a50:8d5d:0:b0:56e:2a21:3017 with SMTP id t29-20020a508d5d000000b0056e2a213017mr7555389edt.5.1712596605152; Mon, 08 Apr 2024 10:16:45 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712596605; cv=pass; d=google.com; s=arc-20160816; b=TQyGDl+gxK95muNpZP8bc6nzT86GRjE+2Q/uOpz7lyviPoHOD1ISzDpomxyyxH7iXq v3NRs3/B9p7fTCHHuf80bQOiUGKZ7jWbUZvFfFAJhxZkUoCYdj0iFyPMQbGgH9++r+OP LaaSFUuCii0pk90HQi12B0vOu3iYItUY3KQOJe9b2gH4oF0sPHq4hg9hUM8m21JiGaXr 2ygN1x5hTHC1kWoyRd/Vifp9s8DE0SjCuO/MPx+u0dGif+OeyedLOFDuMMemC2zEvYj1 NEv99mjD2/AgR4yvPW//CUB1uScmLandRiwBOLrTWAwQDthghi7QBTOrYUOy36M4rNj8 4uBA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:from:subject:message-id:references:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:in-reply-to:date :dkim-signature; bh=XYkMSUzqJlpjQtSfhn57aIvoeBGIOTZt8KfeTsbtdM8=; fh=1yRg7Wc789LPAUTMrJFog72nsr2puiXOkR9gJFiCsT0=; b=K6PXV56yA2YKD/LL+jsEj0yvs6skyTnsRCwfoOWMcyAeWBPBJWFBuvuZeQf/gEm0TZ osJdN9sSf9aPO03hsQYr2JyATAV6vgDq6Uohxx/mPPtbxRTbhpwr7UjgDfVOz/XG08KC VVrx8WZIHSF8D98+7baRkj4yOHEweLguGK0Wivc83BYyPUhsTJ3Lbot1LRBU80trOEkQ 8gwsQTnPKXZzUm0UATzpe+Axuu3JFyKeolEN+HtI5UipSaBrJ3ct3uRKtcV4WN2x9SKa afSXy1+JoMqbxSw9H9fzJvUBM3g+beEip2wvPJe23CF61O5EGwc6S/DJZweBjrlTzzMf /pbg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=lid8DAZl; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-135738-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-135738-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id z7-20020a05640235c700b0056dc926503csi3882489edc.218.2024.04.08.10.16.45 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Apr 2024 10:16:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-135738-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=lid8DAZl; arc=pass (i=1 spf=pass spfdomain=flex--seanjc.bounces.google.com dkim=pass dkdomain=google.com dmarc=pass fromdomain=google.com); spf=pass (google.com: domain of linux-kernel+bounces-135738-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-135738-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id B36301F2577E for ; Mon, 8 Apr 2024 17:16:44 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 043C2143C59; Mon, 8 Apr 2024 17:16:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="lid8DAZl" Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D49E1142E68 for ; Mon, 8 Apr 2024 17:16:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712596588; cv=none; b=QGnKgEak/izMm0yepyqgGrnjK2qIu2tKlyUpJfKFS2UrVmmCqDysPQ9RyQBFhJCIWjha6CKhzfmF569xg0ULoBWfEahgxNgblayT2ZR399AGrlXYhETS3XpfvgaPEeDWDtn1KOeDvgGcEdR3cbv9OPgH8B/4JM0wAcxnjGWx7Jw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712596588; c=relaxed/simple; bh=vmPDnU2rbQte5MXl0eNshZbSVgsEa6yKXoSEo0wMagw=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=p1WxqpKvnJjbvwK7RmkyEhDQN+FNNAEOThJVoCusUxXQuo7mHq2jDKe9lC/N170k5sfhs42p8cJFh1O2tagXUCGXjAOPWFZXNTS4Z14MXBzmGJezsFoVL73b8rl16Tpm9xH/rvBMOMDwyzBUaYnlhSvp5bTVmkoIpgRzKr9hnXE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=lid8DAZl; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-5d8bdadc79cso4129665a12.2 for ; Mon, 08 Apr 2024 10:16:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1712596586; x=1713201386; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=XYkMSUzqJlpjQtSfhn57aIvoeBGIOTZt8KfeTsbtdM8=; b=lid8DAZl+MQvmnSl6vAZQSIOyYwV20ybesJgzgN7BdSGtkm27tg8n3AoinKX72nytQ ZPBperwVtyZV1qS5v4LsUfx8xK84t+mt1q7nSaArMv3hQbRxzmrL1nRQQN35kSTmIHtY BS9GkMYzDexKnHFCxmXk+T+vyLMQ0bSGfELkNKZYgIePvcgYPrUAcHtB/fXsXzWDfEJp 88hqDUf7tOVMS3C8QwsV6VEHCeF/GvsYPMSj4JjHPvaORSo0Hh4bYOpSSSOJWz++omPB RBWB8RWzdwwfKMj7vCb+jf1+A6Qj8b9KpxpAGWlwhmxqP7RaP4/V8pe8gXWD83LQ9Rh/ LWZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712596586; x=1713201386; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=XYkMSUzqJlpjQtSfhn57aIvoeBGIOTZt8KfeTsbtdM8=; b=YFoT/vYPa7Q41pqEs/6NqIabcIuA1FVqIFwQMQLq8ik1Xz/SmQcj1FPhcvwzezkgMZ YObBmtUQJyMgUdk230yAyAcnsg/SR50NaVUHRv/ljLYOx376xIY9BA9dcDs9TTsbsWOZ Otsiw3Bx1aBj4mmI0kvc6drpHbyH8KHaRRuI86BhLCkL3vx2a+VLmAKijecj3pFaYRp0 gBu2FV12iD1ElcFCxzj8X5G5/AQGfGkEGdzpk6qZlF41W8fhCE18TdoNyldTdW5sHXqt lnrJS0y0VEl3yuh/+ZUpceHDLeSpDQRYTyaTdQy9NP5ypTdavLD4CG/u4g3SavP0fjfD wUSg== X-Forwarded-Encrypted: i=1; AJvYcCXDMN1u2+XSGyQlTU+uA5vKsyKkvd0R+oGBCaLZ+hJanQyeN2Y/kun7KOXIKlOudyL4HsBnzyKCrmPQvswm+NbuNUF/bZHhoM/gqSJs X-Gm-Message-State: AOJu0Yw+xYS7Oo9Bsmdu1RwEWqxQ/pQ9FfdrGOa9u7AeN8ACWCHM4zaM bXfKCvpsYQx+VqCuoUygAgTgT5wwt62XhOJ0c/LM2MhTCp/64Y8KF9/RERUAhuNPA5vKseQIroH /Ww== X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a63:5006:0:b0:5dc:af76:f57d with SMTP id e6-20020a635006000000b005dcaf76f57dmr30599pgb.7.1712596586030; Mon, 08 Apr 2024 10:16:26 -0700 (PDT) Date: Mon, 8 Apr 2024 10:16:24 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240328171949.743211-1-leobras@redhat.com> Message-ID: Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu From: Sean Christopherson To: "Paul E. McKenney" Cc: Marcelo Tosatti , Leonardo Bras , Paolo Bonzini , Frederic Weisbecker , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Zqiang , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, rcu@vger.kernel.org Content-Type: text/plain; charset="us-ascii" On Fri, Apr 05, 2024, Paul E. McKenney wrote: > On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote: > > On Fri, Apr 05, 2024, Marcelo Tosatti wrote: > > > rcuc wakes up (which might exceed the allowed latency threshold > > > for certain realtime apps). > > > > Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter > > a guest) I was trying to ask about the case where RCU thinks a CPU is about to > > enter a guest, but the CPU never does (at least, not in the immediate future). > > > > Or am I just not understanding how RCU's kthreads work? > > It is quite possible that the current rcu_pending() code needs help, > given the possibility of vCPU preemption. I have heard of people doing > nested KVM virtualization -- or is that no longer a thing? Nested virtualization is still very much a thing, but I don't see how it is at all unique with respect to RCU grace periods and quiescent states. More below. > But the help might well involve RCU telling the hypervisor that a given > vCPU needs to run. Not sure how that would go over, though it has been > prototyped a couple times in the context of RCU priority boosting. > > > > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1 > > > > > second value was copied from rcu_nohz_full_cpu() which checks if the > > > > > grace period started over than a second ago. If this value is bad, > > > > > I have no issue changing it. > > > > > > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless > > > > of what magic time threshold is used. > > > > > > Why? It works for this particular purpose. > > > > Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard > > against edge cases, and I'm pretty sure we can do better with about the same amount > > of effort/churn. > > Beyond a certain point, we have no choice. How long should RCU let > a CPU run with preemption disabled before complaining? We choose 21 > seconds in mainline and some distros choose 60 seconds. Android chooses > 20 milliseconds for synchronize_rcu_expedited() grace periods. Issuing a warning based on an arbitrary time limit is wildly different than using an arbitrary time window to make functional decisions. My objection to the "assume the CPU will enter a quiescent state if it exited a KVM guest in the last second" is that there are plenty of scenarios where that assumption falls apart, i.e. where _that_ physical CPU will not re-enter the guest. Off the top of my head: - If the vCPU is migrated to a different physical CPU (pCPU), the *old* pCPU will get false positives, and the *new* pCPU will get false negatives (though the false negatives aren't all that problematic since the pCPU will enter a quiescent state on the next VM-Enter. - If the vCPU halts, in which case KVM will schedule out the vCPU/task, i.e. won't re-enter the guest. And so the pCPU will get false positives until the vCPU gets a wake event or the 1 second window expires. - If the VM terminates, the pCPU will get false positives until the 1 second window expires. The false positives are solvable problems, by hooking vcpu_put() to reset kvm_last_guest_exit. And to help with the false negatives when a vCPU task is scheduled in on a different pCPU, KVM would hook vcpu_load(). > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index d9642dd06c25..303ae9ae1c53 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -3937,8 +3937,13 @@ static int rcu_pending(int user) > > if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE)) > > return 1; > > > > - /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */ > > - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu()) > > + /* > > + * Is this a nohz_full CPU in userspace, idle, or likely to enter a > > + * guest in the near future? (Ignore RCU if so.) > > + */ > > + if ((user || rcu_is_cpu_rrupt_from_idle() || > > + __this_cpu_read(context_tracking.in_guest_run_loop)) && > > In the case of (user || rcu_is_cpu_rrupt_from_idle()), this CPU was in > a quiescent just before the current scheduling-clock interrupt and will > again be in a quiescent state right after return from this interrupt. > This means that the grace-period kthread will be able to remotely sense > this quiescent state, so that the current CPU need do nothing. > > In constrast, it looks like context_tracking.in_guest_run_loop instead > means that when we return from this interrupt, this CPU will still be > in a non-quiescent state. > > Now, in the nested-virtualization case, your point might be that the > lower-level hypervisor could preempt the vCPU in the interrupt handler > just as easily as in the .in_guest_run_loop code. Which is a good point. > But I don't know of a way to handle this other than heuristics and maybe > hinting to the hypervisor (which has been prototyped for RCU priority > boosting). Regarding nested virtualization, what exactly is your concern? IIUC, you are worried about this code running at L1, i.e. as a nested hypervisor, and L0, i.e. the bare metal hypervisor, scheduling out the L1 CPU. And because the L1 CPU doesn't get run "soon", it won't enter a quiescent state as expected by RCU. But that's 100% the case with RCU in a VM in general. If an L1 CPU gets scheduled out by L0, that L1 CPU won't participate in any RCU stuff until it gets scheduled back in by L0. E.g. throw away all of the special case checks for rcu_nohz_full_cpu() in rcu_pending(), and the exact same problem exists. The L1 CPU could get scheduled out while trying to run the RCU core kthread just as easily as it could get scheduled out while trying to run the vCPU task. Or the L1 CPU could get scheduled out while it's still in the IRQ handler, before it even completes it rcu_pending(). And FWIW, it's not just L0 scheduling that is problematic. If something in L0 prevents an L1 CPU (vCPU from L0's perspective) from making forward progress, e.g. due to a bug in L0, or severe resource contention, from the L1 kernel's perspective, the L1 CPU will appear stuck and trigger various warnings, e.g. soft-lockup, need_resched, RCU stalls, etc. > Maybe the time for such hinting has come? That's a largely orthogonal discussion. As above, boosting the scheduling priority of a vCPU because that vCPU is in critical section of some form is not at all unique to nested virtualization (or RCU). For basic functional correctness, the L0 hypervisor already has the "hint" it needs. L0 knows that the L1 CPU wants to run by virtue of the L1 CPU being runnable, i.e. not halted, not in WFS, etc. > > + rcu_nohz_full_cpu()) > > And rcu_nohz_full_cpu() has a one-second timeout, and has for quite > some time. That's not a good reason to use a suboptimal heuristic for determining whether or not a CPU is likely to enter a KVM guest, it simply mitigates the worst case scenario of a false positive. > > return 0; > > > > /* Is the RCU core waiting for a quiescent state from this CPU? */ > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index bfb2b52a1416..5a7efc669a0f 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu) > > { > > int cpu = get_cpu(); > > > > + if (vcpu->wants_to_run) > > + context_tracking_guest_start_run_loop(); > > At this point, if this is a nohz_full CPU, it will no longer report > quiescent states until the grace period is at least one second old. I don't think I follow the "will no longer report quiescent states" issue. Are you saying that this would prevent guest_context_enter_irqoff() from reporting that the CPU is entering a quiescent state? If so, that's an issue that would need to be resolved regardless of what heuristic we use to determine whether or not a CPU is likely to enter a KVM guest. > > __this_cpu_write(kvm_running_vcpu, vcpu); > > preempt_notifier_register(&vcpu->preempt_notifier); > > kvm_arch_vcpu_load(vcpu, cpu); > > @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu) > > kvm_arch_vcpu_put(vcpu); > > preempt_notifier_unregister(&vcpu->preempt_notifier); > > __this_cpu_write(kvm_running_vcpu, NULL); > > + > > And also at this point, if this is a nohz_full CPU, it will no longer > report quiescent states until the grace period is at least one second old. > > > + if (vcpu->wants_to_run) > > + context_tracking_guest_stop_run_loop(); > > + > > preempt_enable(); > > } > > EXPORT_SYMBOL_GPL(vcpu_put); > > > > base-commit: 619e56a3810c88b8d16d7b9553932ad05f0d4968 > > All of which might be OK. Just checking as to whether all of that was > in fact the intent. > > Thanx, Paul