Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp4985574pxu; Thu, 10 Dec 2020 10:03:59 -0800 (PST) X-Google-Smtp-Source: ABdhPJzyrrHgMysOZw3myaFMAPk5kWbZLboVRRUATG8do/NbiZlLUAXNoGZmouZFO0vF5akuInJ/ X-Received: by 2002:aa7:cf85:: with SMTP id z5mr8050226edx.274.1607623439653; Thu, 10 Dec 2020 10:03:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1607623439; cv=none; d=google.com; s=arc-20160816; b=epGVIaGbomJCXnrXfA4U3fkVFKgEfu2cywYETUcitAgDYI/8FSwbj+jUOKNovjygqn b6kJYRy6NwIXV86QfJi2ZL1SPK9na1U1kN7TbhJ2qYbhpdWhXebd9HPYL7JoFc617Amn GOZJ4xu8lC2HeOrMkZNetuUAL1+0ZHzFgOnCxdY1uZCO3YNFs1kbm782fewoEoYvUBnu TiSwxNXEtvYsX1H7EeDA3x321Chpxj+/pPBf60GlRAQO/UlSkO/kdihkqd/SzThPo+rA 5TojNbt9ppBL1KzQG1dCRzez0BrMQreAnZN6OT1IiuVx+LPbNXpBM0U5PgEBlJqPMj69 tRJQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=kSOS9gHQJjTmsLc8VDbsSjiQ/RzzRfDOewkmFTUJ9nA=; b=YLp9K652zQVc+jML8TMXSx5zG4jJZErSK03iGeqWjF5UK0HNLnRkukOmQtrN5OBnqG kFZkXCcdupOVzL90zE6uhjv9dnscbuHsiZI2thPEsp6Kp3QTbKHapSqam9nk0jGy5Ael P47LfgyaYB3wji33GsVdoNQ9fn90cs4Rc0U5d/wzWfYDB/oqWzP358Oc8OGHdczy0XnA 6ncv4MqFcWFkKxG6vVyE3LsYHHI8Ex8Dc91LMQdQhTmcKaE3DXETe+uIZX0hanKDNn11 aeTiPC19CtPcwp7E+xmweimrWmoI4JH+N7p15vM72EzwnumgbAfDoKCzSd1tqJRAvswg DJGA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="Kxt/6OhE"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bs12si2952359ejb.127.2020.12.10.10.03.33; Thu, 10 Dec 2020 10:03:59 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="Kxt/6OhE"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2392929AbgLJSAm (ORCPT + 99 others); Thu, 10 Dec 2020 13:00:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37190 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2390787AbgLJSAi (ORCPT ); Thu, 10 Dec 2020 13:00:38 -0500 Received: from mail-lf1-x144.google.com (mail-lf1-x144.google.com [IPv6:2a00:1450:4864:20::144]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6CBC2C061793 for ; Thu, 10 Dec 2020 09:59:58 -0800 (PST) Received: by mail-lf1-x144.google.com with SMTP id a12so9521057lfl.6 for ; Thu, 10 Dec 2020 09:59:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=kSOS9gHQJjTmsLc8VDbsSjiQ/RzzRfDOewkmFTUJ9nA=; b=Kxt/6OhE59pWJyW2ZQr+9XDZkp9Lg5lk3YGJO2hJ/iMzIxwVGXbRe6+uZji2XV3W/6 eaZlazRL4VCSMtd5bCHaSiHFvhZFJBDxSs6SSiruVbIZYzcIAnUiMaAhfFENp10Ys8rs LjFaiY0iC6HeaQFkJ4mf7nA/zu3dmnvoamfeeJa+RcsInZBTN5wLmSVgw1rzUasRBvfi eUzcIJimD7Juvsz0GA8cIAsVJ9sz982+E2YR0g58eknOW6uIW4OR4RdzfsvvoX0Wd57B e69CJZtwFwqhXCTquurJTyEvnFGG7ok6Th2zMcMS51lKz76lCTzrHi2cXfd/6WtCyGGJ v8NA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=kSOS9gHQJjTmsLc8VDbsSjiQ/RzzRfDOewkmFTUJ9nA=; b=lI/9F7DQOw2qjZC2BuCnclLVE0h8mK/SmSi7reM74PedHmMZ+HlL2mWmVD644wdTWw 6fvE1EGQAsbGTsHSXfKTH4YEvPXrGyRaUr3iryBuwqNhyADhZk1S6s4yDTT1wHBEpKaj tRCYIIl2fbmYVjqe5POqhaBWUqVLlDkc/oqIUC4TFeQAMCes5hRiFBWxS4bdBJUoKkbO KqhO8FxVyx10mIuYxnKT0cXuJ+lEkawONfRfhuN5M650W6uGfNkk4PXEGONz01b4VmzH kUSb9pkJXCZT7VgjA/xxIPJFrFcKzh6GBSNpLNSm3uBtH0/PZX9UCXUDwB5j6A3UCJxK 2ebw== X-Gm-Message-State: AOAM5331nsO/ktTaWIo7klJcE66V4+eLoPLSQ6EkquJ8BHCb3bQqWR13 b72pzxK7bVKPMoLGHEzGldewJcZ0+4BghwRfZxEVRA== X-Received: by 2002:ac2:4c8e:: with SMTP id d14mr2983701lfl.411.1607623196388; Thu, 10 Dec 2020 09:59:56 -0800 (PST) MIME-Version: 1.0 References: <9389c1198da174bcc9483d6ebf535405aa8bdb45.camel@redhat.com> In-Reply-To: From: Oliver Upton Date: Thu, 10 Dec 2020 11:59:44 -0600 Message-ID: Subject: Re: [PATCH v2 1/3] KVM: x86: implement KVM_{GET|SET}_TSC_STATE To: Andy Lutomirski Cc: Maxim Levitsky , Paolo Bonzini , Thomas Gleixner , Marcelo Tosatti , kvm list , "H. Peter Anvin" , Jonathan Corbet , Jim Mattson , Wanpeng Li , "open list:KERNEL SELFTEST FRAMEWORK" , Vitaly Kuznetsov , Sean Christopherson , open list , Ingo Molnar , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Joerg Roedel , Borislav Petkov , Shuah Khan , Andrew Jones , "open list:DOCUMENTATION" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 10, 2020 at 9:16 AM Andy Lutomirski wrote= : > > > > > On Dec 10, 2020, at 6:52 AM, Maxim Levitsky wrote= : > > > > =EF=BB=BFOn Thu, 2020-12-10 at 12:48 +0100, Paolo Bonzini wrote: > >>> On 08/12/20 22:20, Thomas Gleixner wrote: > >>> So now life migration comes a long time after timekeeping had set the > >>> limits and just because it's virt it expects that everything works an= d it > >>> just can ignore these limits. > >>> > >>> TBH. That's not any different than SMM or hard/firmware taking the > >>> machine out for lunch. It's exactly the same: It's broken. > >> > >> I agree. If *live* migration stops the VM for 200 seconds, it's broke= n. > >> > >> Sure, there's the case of snapshotting the VM over the weekend. My > >> favorite solution would be to just put it in S3 before doing that. *D= o > >> what bare metal does* and you can't go that wrong. > > > > Note though that qemu has a couple of issues with s3, and it is disable= d > > by default in libvirt. > > I would be very happy to work on improving this if there is a need for = that. > > There=E2=80=99s also the case where someone has a VM running on a laptop = and someone closes the lid. The host QEMU might not have a chance to convin= ce the guest to enter S3. > > > > > > >> > >> In general it's userspace policy whether to keep the TSC value the sam= e > >> across live migration. There's pros and cons to both approaches, so K= VM > >> should provide the functionality to keep the TSC running (which the > >> guest will see as a very long, but not extreme SMI), and this is what > >> this series does. Maxim will change it to operate per-VM. Thanks > >> Thomas, Oliver and everyone else for the input. So, to be clear, this per-VM ioctl will work something like the following: static u64 kvm_read_tsc_base(struct kvm *kvm, u64 host_tsc) { return kvm_scale_tsc(kvm, host_tsc) + kvm->host_tsc_offset; } case KVM_GET_TSC_BASE: struct kvm_tsc_base base =3D { .flags =3D KVM_TSC_BASE_TIMESTAMP_VALID; }; u64 host_tsc; kvm_get_walltime(&base.nsec, &host_tsc); base.tsc =3D kvm_read_tsc_base(kvm, host_tsc); copy_to_user(...); [...] case KVM_SET_TSC_BASE: struct kvm_tsc_base base; u64 host_tsc, nsec; s64 delta =3D 0; copy_from_user(...); kvm_get_walltime(&nsec, &host_tsc); delta +=3D base.tsc - kvm_read_tsc_base(kvm, host_tsc); if (base.flags & KVM_TSC_BASE_TIMESTAMP_VALID) { u64 delta_nsec =3D nsec - base.nsec; if (delta_nsec > 0) delta +=3D nsec_to_cycles(kvm, diff); else delta -=3D nsec_to_cycles(kvm, -diff); } kvm->host_tsc_offset +=3D delta; /* plumb host_tsc_offset through to each vcpu */ However, I don't believe we can assume the guest's TSCs to be synchronized, even if sane guests will never touch them. In this case, I think a per-vCPU ioctl is still warranted, allowing userspace to get at the guest CPU adjust component of Thomas' equation below (paraphrased): TSC guest CPU =3D host tsc base + guest base offset + guest CPU adj= ust Alternatively, a write from userspace to the guest's IA32_TSC_ADJUST with KVM_X86_QUIRK_TSC_HOST_ACCESS could have the same effect, but that seems to= be problematic for a couple reasons. First, depending on the guest's CPUID the TSC_ADJUST MSR may not even be available, meaning that the guest could've u= sed IA32_TSC to adjust the TSC (eww). Second, userspace replaying writes to IA3= 2_TSC (in the case IA32_TSC_ADJUST doesn't exist for the guest) seems _very_ unlikely to work given all the magic handling that KVM does for writes to it. Is this roughly where we are or have I entirely missed the mark? :-) -- Thanks, Oliver > > > > I agree with that. > > > > I still think though that we should have a discussion on feasibility > > of making the kernel time code deal with large *forward* tsc jumps > > without crashing. > > > > If that is indeed hard to do, or will cause performance issues, > > then I agree that we might indeed inform the guest of time jumps instea= d. > > > > Tglx, even without fancy shared host/guest timekeeping, count the guest k= ernel manage to update its timekeeping if the host sent the guest an interr= upt or NMI on all CPUs synchronously on resume? > > Alternatively, if we had the explicit =E2=80=9Cmax TSC value that makes s= ense right now=E2=80=9D in the timekeeping data, the guest would reliably n= otice the large jump and could at least do something intelligent about it i= nstead of overflowing its internal calculation.