Received: by 2002:a05:6500:1b45:b0:1f5:f2ab:c469 with SMTP id cz5csp741595lqb; Wed, 17 Apr 2024 09:23:00 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCX152PYOlAN3g7+0dOwchLDGyan+6QQu5t/X9TGaHeHi9gj04jakuBsnlo1XnPJ7I0pZusDlZs5u1ZItSnV2NWoIjDs0wHaOJu6I7Nzqg== X-Google-Smtp-Source: AGHT+IE95X20BZSHfhflRFyFn7fYUIUoSauQ5O3T7VzO7RxV5UghsOijZsFWVrFO9+rEjI7lASfe X-Received: by 2002:a50:9e49:0:b0:570:1026:c985 with SMTP id z67-20020a509e49000000b005701026c985mr137478ede.0.1713370980032; Wed, 17 Apr 2024 09:23:00 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1713370980; cv=pass; d=google.com; s=arc-20160816; b=iT+EW8sK8ccM6vIy8pi+tMnMSb+9A84xGl98pPkmEaO0b9TV1L4z9bA4coEL80a2+f 0xKCJX2TMCwHCxxj4Xl3aMvNLh3LcSg0H4AZpKcQp0fY4ahew5qBGM+80etwve7WNQLW DBMmR5jlJkF7X9t/knaYd83lF4cuFLG3t5H0WYxIFs8tAltjxDxFja7AJ6XDy974J7t4 j+ogfsJY05yV/5/swru8CFgqU2Z99VfgZXD1PahNiwWs2CFoQgSX1s1SxR9P9yMqtiI4 GDXubSNXilVzTvxNDIahcEKBRwyt5HVj94VZFRPYjlaNgAEUrZprxIokVH7T/ihK/fJ7 EO5w== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date:dkim-signature; bh=MVwYiuDV1ATQ7UBW8Sq0lK/csErv1chKrLOaAmc5mOg=; fh=GDdg9PWL2nCiQO3Ze161jd6IuubKeGfjLOdpADw+o5k=; b=QSVnKSRvT8cgR5BTtOmB0OAVJqS3o8Kz5GW7g8yh6/A9bxvjzWT5SEzWIE7JOeiYBO gjU1353q0JJWhG2vqkb4j8F/uqnLZgNG8jEt9FRGk6e5qFAS9r38JTznenV4CTUcc/9T xcWZ4Aazyfsi2ioo7RE+lhhJtNZ/aCYGgEHSGTUw687ZbxnHs+90qas7EpTL8PUoIDXw ZM88q4/5EtbIb3LemLImGYaeY5ns9UXDmOjb48WcO+nNNj3b8F/vM0MrnBJdRLXMGAi5 iTROwgJi0rq+o3G/deslONEAX6JW2aucgPaRAYZlifWTuVJZAXKq56ranulS3r9x3czq 1Cgw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=EQVLk93Q; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-148890-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-148890-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id w1-20020a05640234c100b005701ec0c6e9si4061466edc.463.2024.04.17.09.22.59 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Apr 2024 09:23:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-148890-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=EQVLk93Q; arc=pass (i=1 spf=pass spfdomain=redhat.com dkim=pass dkdomain=redhat.com dmarc=pass fromdomain=redhat.com); spf=pass (google.com: domain of linux-kernel+bounces-148890-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-148890-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 429861F264E5 for ; Wed, 17 Apr 2024 16:14:55 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id A9A7D148839; Wed, 17 Apr 2024 16:14:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="EQVLk93Q" Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 045E4146D59 for ; Wed, 17 Apr 2024 16:14:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713370482; cv=none; b=jsAGaWZzIuG7mK9y2ebvFoeAZg3gBAg1UdU2kToTCsdm5VjzM2xrDy8GJYCUkmbbfJzCprjeADXWFuOUkF5fOP522dLYA8wYhrOTdFp/YUa8w5JhjyJO01sv0NW97zsaarJaYEDSgExJw+/zCgb+mumMU7D8BjwtCf2hwKiBGF4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713370482; c=relaxed/simple; bh=S1yhgaRLFPOXidC/cYYslItIjG46TDGeuQmOH40nfME=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=V2zi4ruF1N9qaLjFuVLX/FSlqMDTdacPTUaTF/QC68FXD5IdzOHbawJj44BDmxdBUJxKooGuXGa/9XGHmHvXIuNORl1c9GbqBwfP/4FU0n98KSXycNmzDlHGF+K0VqeDxt5UvToOawOXj/cdUsdVtjSgYywepfYvPEp7yGxVdjo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=EQVLk93Q; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1713370479; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=MVwYiuDV1ATQ7UBW8Sq0lK/csErv1chKrLOaAmc5mOg=; b=EQVLk93QTlbHuyPqPxx1IzBrFY3j/UkIemGhTs+zNJIzYqPIu1ylpYQssywE/i8OJbp7zQ E8Dtx3pgbPBMkTuNSxn/tpXAv+4PdDE9rCTbWjIsJfg76mYq/+7iLMJX+c9aCb1NUPl2pu TLMI1Ek+k+JYwMbLV9BlUZKLbknGBMs= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-346-6kAP76NnOCmOGM6mQn6s8w-1; Wed, 17 Apr 2024 12:14:35 -0400 X-MC-Unique: 6kAP76NnOCmOGM6mQn6s8w-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2CAB63C14940; Wed, 17 Apr 2024 16:14:35 +0000 (UTC) Received: from tpad.localdomain (unknown [10.96.133.8]) by smtp.corp.redhat.com (Postfix) with ESMTPS id BD27340C6CB2; Wed, 17 Apr 2024 16:14:33 +0000 (UTC) Received: by tpad.localdomain (Postfix, from userid 1000) id 50881400DC647; Wed, 17 Apr 2024 13:14:11 -0300 (-03) Date: Wed, 17 Apr 2024 13:14:11 -0300 From: Marcelo Tosatti To: Sean Christopherson Cc: "Paul E. McKenney" , Leonardo Bras , Paolo Bonzini , Frederic Weisbecker , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Zqiang , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, rcu@vger.kernel.org Subject: Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu Message-ID: References: <20240328171949.743211-1-leobras@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.2 On Tue, Apr 16, 2024 at 07:07:32AM -0700, Sean Christopherson wrote: > On Tue, Apr 16, 2024, Marcelo Tosatti wrote: > > On Mon, Apr 15, 2024 at 02:29:32PM -0700, Sean Christopherson wrote: > > > And snapshotting the VM-Exit time will get false negatives when the vCPU is about > > > to run, but for whatever reason has kvm_last_guest_exit=0, e.g. if a vCPU was > > > preempted and/or migrated to a different pCPU. > > > > Right, for the use-case where waking up rcuc is a problem, the pCPU is > > isolated (there are no userspace processes and hopefully no kernel threads > > executing there), vCPU pinned to that pCPU. > > > > So there should be no preemptions or migrations. > > I understand that preemption/migration will not be problematic if the system is > configured "correctly", but we still need to play nice with other scenarios and/or > suboptimal setups. While false positives aren't fatal, KVM still should do its > best to avoid them, especially when it's relatively easy to do so. Sure. > > > My understanding is that RCU already has a timeout to avoid stalling RCU. I don't > > > see what is gained by effectively duplicating that timeout for KVM. > > > > The point is not to avoid stalling RCU. The point is to not perform RCU > > core processing through rcuc thread (because that interrupts execution > > of the vCPU thread), if it is known that an extended quiescent state > > will occur "soon" anyway (via VM-entry). > > I know. My point is that, as you note below, RCU will wake-up rcuc after 1 second > even if KVM is still reporting a VM-Enter is imminent, i.e. there's a 1 second > timeout to avoid an RCU stall to due to KVM never completing entry to the guest. Right. So a reply to the sentence: "My understanding is that RCU already has a timeout to avoid stalling RCU. I don't see what is gained by effectively duplicating that timeout for KVM." Is that the current RCU timeout is not functional for KVM VM entries, therefore it needs modification. > > If the extended quiescent state does not occur in 1 second, then rcuc > > will be woken up (the time_before call in rcu_nohz_full_cpu function > > above). > > > > > Why not have > > > KVM provide a "this task is in KVM_RUN" flag, and then let the existing timeout > > > handle the (hopefully rare) case where KVM doesn't "immediately" re-enter the guest? > > > > Do you mean something like: > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > > index d9642dd06c25..0ca5a6a45025 100644 > > --- a/kernel/rcu/tree.c > > +++ b/kernel/rcu/tree.c > > @@ -3938,7 +3938,7 @@ static int rcu_pending(int user) > > return 1; > > > > /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */ > > - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu()) > > + if ((user || rcu_is_cpu_rrupt_from_idle() || this_cpu->in_kvm_run) && rcu_nohz_full_cpu()) > > return 0; > > Yes. This, https://lore.kernel.org/all/ZhAN28BcMsfl4gm-@google.com, plus logic > in kvm_sched_{in,out}(). Question: where is vcpu->wants_to_run set? (or, where is the full series again?). diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index bfb2b52a1416..5a7efc669a0f 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu) { int cpu = get_cpu(); + if (vcpu->wants_to_run) + context_tracking_guest_start_run_loop(); + __this_cpu_write(kvm_running_vcpu, vcpu); preempt_notifier_register(&vcpu->preempt_notifier); kvm_arch_vcpu_load(vcpu, cpu); @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu) kvm_arch_vcpu_put(vcpu); preempt_notifier_unregister(&vcpu->preempt_notifier); __this_cpu_write(kvm_running_vcpu, NULL); + + if (vcpu->wants_to_run) + context_tracking_guest_stop_run_loop(); + preempt_enable(); } EXPORT_SYMBOL_GPL(vcpu_put); A little worried about guest HLT: /** * rcu_is_cpu_rrupt_from_idle - see if 'interrupted' from idle * * If the current CPU is idle and running at a first-level (not nested) * interrupt, or directly, from idle, return true. * * The caller must have at least disabled IRQs. */ static int rcu_is_cpu_rrupt_from_idle(void) { long nesting; /* * Usually called from the tick; but also used from smp_function_call() * for expedited grace periods. This latter can result in running from * the idle task, instead of an actual IPI. */ ... /* Does CPU appear to be idle from an RCU standpoint? */ return ct_dynticks_nesting() == 0; } static __always_inline void ct_cpuidle_enter(void) { lockdep_assert_irqs_disabled(); /* * Idle is allowed to (temporary) enable IRQs. It * will return with IRQs disabled. * * Trace IRQs enable here, then switch off RCU, and have * arch_cpu_idle() use raw_local_irq_enable(). Note that * ct_idle_enter() relies on lockdep IRQ state, so switch that * last -- this is very similar to the entry code. */ trace_hardirqs_on_prepare(); lockdep_hardirqs_on_prepare(); instrumentation_end(); ct_idle_enter(); lockdep_hardirqs_on(_RET_IP_); } So for guest HLT emulation, there is a window between kvm_vcpu_block -> fire_sched_out_preempt_notifiers -> vcpu_put and the idle's task call to ct_cpuidle_enter, where ct_dynticks_nesting() != 0 and vcpu_put has already executed. Even for idle=poll, the race exists. > > /* Is the RCU core waiting for a quiescent state from this CPU? */ > > > > The problem is: > > > > 1) You should only set that flag, in the VM-entry path, after the point > > where no use of RCU is made: close to guest_state_enter_irqoff call. > > Why? As established above, KVM essentially has 1 second to enter the guest after > setting in_guest_run_loop (or whatever we call it). In the vast majority of cases, > the time before KVM enters the guest can probably be measured in microseconds. OK. > Snapshotting the exit time has the exact same problem of depending on KVM to > re-enter the guest soon-ish, so I don't understand why this would be considered > a problem with a flag to note the CPU is in KVM's run loop, but not with a > snapshot to say the CPU recently exited a KVM guest. See the race above. > > 2) While handling a VM-exit, a host timer interrupt can occur before that, > > or after the point where "this_cpu->in_kvm_run" is set to false. > > > > And a host timer interrupt calls rcu_sched_clock_irq which is going to > > wake up rcuc. > > If in_kvm_run is false when the IRQ is handled, then either KVM exited to userspace > or the vCPU was scheduled out. In the former case, rcuc won't be woken up if the > CPU is in userspace. And in the latter case, waking up rcuc is absolutely the > correct thing to do as VM-Enter is not imminent. > > For exits to userspace, there would be a small window where an IRQ could arrive > between KVM putting the vCPU and the CPU actually returning to userspace, but > unless that's problematic in practice, I think it's a reasonable tradeoff. OK, your proposal looks alright except these races. We don't want those races to occur in production (and they likely will). Is there any way to fix the races? Perhaps cmpxchg?