Received: by 2002:a05:7412:b995:b0:f9:9502:5bb8 with SMTP id it21csp6837237rdb; Tue, 2 Jan 2024 15:50:02 -0800 (PST) X-Google-Smtp-Source: AGHT+IElqeToCgcgw+f/ohxj01a1da6UHDJsmGeFEsTmFdiZhWNv72ZNLSacLLt0pDV4tEKCYH/0 X-Received: by 2002:a05:622a:1a82:b0:425:4043:7621 with SMTP id s2-20020a05622a1a8200b0042540437621mr24974669qtc.73.1704239402704; Tue, 02 Jan 2024 15:50:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704239402; cv=none; d=google.com; s=arc-20160816; b=i1A8ApiffcDFz1XbxJRaoc5X3fatM8dstlP6vEuKc9ljQk09RIbMUFeWnJC9TbG9iR +tQcSVEZZo9nx16vea8kk2vWXBoiDSYOkpTN8llmdEijKfBCDQ9fY5v/JArulWzbI+eq JM/qVLB7FoxkGNwsy/OfdDr48WDNTISsolZoyU0nl95W86Jk7sWwCzmwvODwKDDolu1+ f+60rBYR/fSsVG4qBVR3vWFJ9ASC1mjA1Uu/NVjGg7eoK8WSEv8Uwp290Ef0Qx1/KrwP /5FI7vgHWV38hAIEgm0yb1FFuNgW7YxCpxnY5U8hmCtN4MY0FWNIzbg98++x3DBWyb6W c1wA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:list-unsubscribe:list-subscribe :list-id:precedence:dkim-signature; bh=GtCri9auB/Q0wVXXcm5aP1kh1WsTIZcHi83DzvjBb0g=; fh=LuFJrddUdSPdBE0yQFeztr/bGQR2oFmYaaIzU49DAk8=; b=Aa4Su3pRUoKD6QENVUh60jSbb85huQMLnbdLUzJHXU/MT2lmmAb0QYadUG9MyWHEPL slTBZ2VC/bo3z1v13h+N9ZG3Q2atiRG+qyxWoP6qyvi6LqBPhfpVXGNkTOlELi4EJv8T Ug4PROuMgDUwyZOD4bZG35fw17dNgK8cD75CBEF2biE9EZ18AtLd+uS5gpVSu9FS0pDh 0WzmcWCuHJEN++8yhrGKvHsRk/CzXjdprwqecD8/D+zwixFX31h1wFPvaKezK8guvr0I Ce19Rh2FJT5TwFRPPEzKkQigHUJuy9GVLl/XoUjMgqhyQ2+sVaGpsgj4ZESt71VswywY 2J6Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=qQmf0p3Q; spf=pass (google.com: domain of linux-kernel+bounces-14986-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-14986-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id q14-20020a05622a030e00b004283710e8efsi189715qtw.490.2024.01.02.15.50.02 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Jan 2024 15:50:02 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-14986-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=qQmf0p3Q; spf=pass (google.com: domain of linux-kernel+bounces-14986-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-14986-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 63CF21C212DF for ; Tue, 2 Jan 2024 23:50:02 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 4F2AE1799A; Tue, 2 Jan 2024 23:49:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="qQmf0p3Q" X-Original-To: linux-kernel@vger.kernel.org Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BBDC917988 for ; Tue, 2 Jan 2024 23:49:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=google.com Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-5534180f0e9so2224a12.1 for ; Tue, 02 Jan 2024 15:49:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1704239388; x=1704844188; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GtCri9auB/Q0wVXXcm5aP1kh1WsTIZcHi83DzvjBb0g=; b=qQmf0p3QbELZptVm0RUAqHMRMmldx2Vo0R/L2b0f4eTOFzwWqSqa2GGoT6RaMMJHcL 9nkQM8anwMK5glENdDvjWSiXiRXYaCt4+/6jxKThOZFFlZuP69q+/FGTljn6KboUjSzR npOp/+DlMrTgKs3Zzwpzj+5MgTrItX92aowwdRHngeHwerub5EG7ZqDc7FXkc5/d1kUq 2h0sXlAf8Zh6qDzSdCZRUlSU654+bIqXbeXNE/nym6g8qxskeL77WB/GOTCy3a1L5Hf+ kxfu9I96jPT+AHF2u+xDpL7tvA01BQR59hd4K6KQGrh5CdWzIbnz2Wi/0GL/GvMxTtWD auqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704239388; x=1704844188; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GtCri9auB/Q0wVXXcm5aP1kh1WsTIZcHi83DzvjBb0g=; b=aaijvVSUOQQqOWSOpUVVbJeGxA8y4xUnUSwpRK/Q+iPeC+F0SxZup7VcjtKzWIb/zI 0nJQJCKrkvLwe86BPEp87rStQyU4Cs7yr+EvAGw1AqAmJXozNUq39xSEo72/uM/VaTlh ixDTBYdlo1d0jsp0Gc2l8Ou0W6YeWOB5cqjXX0UERUEVxjgt77Dk+fS8Nn2rHRREPjsd VIMnE6LLllPNWufLk8gOIl7+vfQU7rB2+5sW65LgvKDcltAvYBjuU8fDZhXs1Cmrk1Tu SjfjEs2AEG0S2nIgNDZsx+Vb/yA7qXfjbOGf0DY9XIwu6fRGPSsZIKWAl3c83Nz+8+lB ZfAg== X-Gm-Message-State: AOJu0YxmMp+WwWXoyDTdVQXzgry/2Ph5ZpEvW02vH2GzxUyqmpNg65IQ 7h5m6HU48GN2kJ7m480n6M3QweN1bRoorWlTdh1Bca9uSFr6 X-Received: by 2002:a50:d60b:0:b0:54c:f4fd:3427 with SMTP id x11-20020a50d60b000000b0054cf4fd3427mr13814edi.7.1704239387819; Tue, 02 Jan 2024 15:49:47 -0800 (PST) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20c9c21619aa44363c2c7503db1581cb816a1c0f.camel@redhat.com> <481be19e33915804c855a55181c310dd8071b546.camel@redhat.com> In-Reply-To: <481be19e33915804c855a55181c310dd8071b546.camel@redhat.com> From: Jim Mattson Date: Tue, 2 Jan 2024 15:49:32 -0800 Message-ID: Subject: Re: RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers To: Maxim Levitsky Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Paolo Bonzini , Sean Christopherson , Marc Zyngier , Thomas Gleixner , Vitaly Kuznetsov Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Tue, Jan 2, 2024 at 2:21=E2=80=AFPM Maxim Levitsky = wrote: > > On Thu, 2023-12-21 at 11:09 -0800, Jim Mattson wrote: > > On Thu, Dec 21, 2023 at 8:52=E2=80=AFAM Maxim Levitsky wrote: > > > > > > Hi! > > > > > > Recently I was tasked with triage of the failures of 'vmx_preemption_= timer' > > > that happen in our kernel CI pipeline. > > > > > > > > > The test usually fails because L2 observes TSC after the > > > preemption timer deadline, before the VM exit happens. > > > > > > This happens because KVM emulates nested preemption timer with HR tim= ers, > > > so it converts the preemption timer value to nanoseconds, taking in a= ccount > > > tsc scaling and host tsc frequency, and sets HR timer. > > > > > > HR timer however as I found out the hard way is bound to CLOCK_MONOTO= NIC, > > > and thus its rate can be adjusted by NTP, which means that it can run= slower or > > > faster than KVM expects, which can result in the interrupt arriving e= arlier, > > > or late, which is what is happening. > > > > > > This is how you can reproduce it on an Intel machine: > > > > > > > > > 1. stop the NTP daemon: > > > sudo systemctl stop chronyd.service > > > 2. introduce a small error in the system time: > > > sudo date -s "$(date)" > > > > > > 3. start NTP daemon: > > > sudo chronyd -d -n (for debug) or start the systemd service ag= ain > > > > > > 4. run the vmx_preemption_timer test a few times until it fails: > > > > > > > > > I did some research and it looks like I am not the first to encounter= this: > > > > > > From the ARM side there was an attempt to support CLOCK_MONOTONIC_RAW= with > > > timer subsystem which was even merged but then reverted due to issues= : > > > > > > https://lore.kernel.org/all/1452879670-16133-3-git-send-email-marc.zy= ngier@arm.com/T/#u > > > > > > It looks like this issue was later worked around in the ARM code: > > > > > > > > > commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45 > > > Author: Marc Zyngier > > > Date: Wed Apr 6 09:37:22 2016 +0100 > > > > > > KVM: arm/arm64: Handle forward time correction gracefully > > > > > > On a host that runs NTP, corrections can have a direct impact on > > > the background timer that we program on the behalf of a vcpu. > > > > > > In particular, NTP performing a forward correction will result in > > > a timer expiring sooner than expected from a guest point of view. > > > Not a big deal, we kick the vcpu anyway. > > > > > > But on wake-up, the vcpu thread is going to perform a check to > > > find out whether or not it should block. And at that point, the > > > timer check is going to say "timer has not expired yet, go back > > > to sleep". This results in the timer event being lost forever. > > > > > > There are multiple ways to handle this. One would be record that > > > the timer has expired and let kvm_cpu_has_pending_timer return > > > true in that case, but that would be fairly invasive. Another is > > > to check for the "short sleep" condition in the hrtimer callback, > > > and restart the timer for the remaining time when the condition > > > is detected. > > > > > > This patch implements the latter, with a bit of refactoring in > > > order to avoid too much code duplication. > > > > > > Cc: > > > Reported-by: Alexander Graf > > > Reviewed-by: Alexander Graf > > > Signed-off-by: Marc Zyngier > > > Signed-off-by: Christoffer Dall > > > > > > > > > So to solve this issue there are two options: > > > > > > > > > 1. Have another go at implementing support for CLOCK_MONOTONIC_RAW ti= mers. > > > I don't know if that is feasible and I would be very happy to hear= a feedback from you. > > > > > > 2. Also work this around in KVM. KVM does listen to changes in the ti= mekeeping system > > > (kernel calls its update_pvclock_gtod), and it even notes rates of = both regular and raw clocks. > > > > > > When starting a HR timer I can adjust its period for the difference= in rates, which will in most > > > cases produce more correct result that what we have now, but will s= till fail if the rate > > > is changed at the same time the timer is started or before it expir= es. > > > > > > Or I can also restart the timer, although that might cause more har= m than > > > good to the accuracy. > > > > > > > > > What do you think? > > > > Is this what the "adaptive tuning" in the local APIC TSC_DEADLINE > > timer is all about (lapic_timer_advance_ns =3D -1)? > > > Hi, > > I don't think that 'lapic_timer_advance' is designed for that but it does > mask this problem somewhat. > > The goal of 'lapic_timer_advance' is to decrease time between deadline pa= ssing and start > of guest timer irq routine by making the deadline happen a bit earlier (b= y timer_advance_ns), and then busy-waiting > (hopefully only a bit) until the deadline passes, and then immediately do= the VM entry. > > This way instead of overhead of VM exit and VM entry that both happen aft= er the deadline, > only the VM entry happens after the deadline. > > > In relation to NTP interference: If the deadline happens earlier than exp= ected, then > KVM will busy wait and decrease the 'timer_advance_ns', and next time the= deadline > will happen a bit later thus adopting for the NTP adjustment somewhat. > > Note though that 'timer_advance_ns' variable is unsigned and adjust_lapic= _timer_advance can underflow > it, which can be fixed. > > Now if the deadline happens later than expected, then the guest will see = this happen, > but at least adjust_lapic_timer_advance should increase the 'timer_advanc= e_ns' so next > time the deadline will happen earlier which will also eventually hide the= problem. > > So overall I do think that implementing the 'lapic_timer_advance' for nes= ted VMX preemption timer > is a good idea, especially since this feature is not really nested in som= e sense - the timer is > just delivered as a VM exit but it is always delivered to L1, so VMX pree= mption timer can > be seen as just an extra L1's deadline timer. > > I do think that nested VMX preemption timer should use its own value of t= imer_advance_ns, thus > we need to extract the common code and make both timers use it. Does this= make sense? Alternatively, why not just use the hardware VMX-preemption timer to deliver the virtual VMX-preemption timer? Today, I believe that we only use the hardware VMX-preemption timer to deliver the virtual local APIC timer. However, it shouldn't be that hard to pick the first deadline of {VMX-preemption timer, local APIC timer} at each emulated VM-entry to L2. > Best regards, > Maxim Levitsky > > > > If so, can we > > leverage that for the VMX-preemption timer as well? > > > Best regards, > > > Maxim Levitsky > > > > > > > > > > > > >