Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752500AbdHVVA5 (ORCPT ); Tue, 22 Aug 2017 17:00:57 -0400 Received: from mx1.redhat.com ([209.132.183.28]:41816 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752302AbdHVVA4 (ORCPT ); Tue, 22 Aug 2017 17:00:56 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com B858A81E01 Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx01.extmail.prod.ext.phx2.redhat.com; spf=fail smtp.mailfrom=pbonzini@redhat.com Date: Tue, 22 Aug 2017 17:00:53 -0400 (EDT) From: Paolo Bonzini To: John Stultz Cc: Denis Plotnikov , Radim Krcmar , kvm list , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , lkml , x86@kernel.org, rkagan@virtuozzo.com, den@virtuozzo.com, Marcelo Tosatti Message-ID: <894362115.582988.1503435653874.JavaMail.zimbra@redhat.com> In-Reply-To: References: <1501684690-211093-1-git-send-email-dplotnikov@virtuozzo.com> Subject: Re: [PATCH v4 00/10] make L2's kvm-clock stable, get rid of pvclock_gtod_copy in KVM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [94.39.192.75, 10.4.196.24, 10.4.195.14] Thread-Topic: make L2's kvm-clock stable, get rid of pvclock_gtod_copy in KVM Thread-Index: 7T3VojsREar+FlmFKPHkymIzDcNCkA== X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Tue, 22 Aug 2017 21:00:55 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3087 Lines: 64 > I still don't feel my questions have been well answered. Its really > not clear to me why, in order to allow the level-2 guest to use a vdso > that the answer is to export more data through the entire stack rather > then to make the kvmclock to be usable from the vsyscall. Thanks, this helps. A stable kvmclock is already usable from the vsyscall. It is however not yet usable _in the hypervisor_ as a way to provide another stable kvmclock to the nested guest; right now the only clocksource that a hypervisor can use to provide a stable kvmclock is the TSC. So, regarding the "why is it necessary" part. Even on a modern host with invariant TSC, kvmclock mediates between TSC and the guest and provides for example support for live migration, where the TSC frequency may be different between source and destination. If the L1 hypervisor could use the TSC to provide a stable kvmclock, there would be no need for kvmclock in the first place. The paravirtualized clock may well disappear in a few years since Skylake provides TSC scaling. However, I'm not that optimistic because people are complaining that I removed support for 2007 processors and it seems that I'll have to put it back. So, as more people use nested virtualization (and we have nested virt migration in the works, too), nested kvmclock becomes more important too. Regarding the "why is it best" part. Right now, the hypervisor makes a copy of the timekeeper information in order to prepare the stable kvmclock. This code is very much tied to the TSC. However, a snapshot of the timekeeper information is almost entirely the same thing that ktime_get_snapshot returns, so my suggestion to "untie" the hypervisor code from the TSC was to use ktime_get_snapshot instead. This way, the clocksource itself tells KVM whether it can be the base for a vsyscall-happy kvmclock (which means, it must be the TSC or a linear transformation of it). While I am very happy with how the KVM code comes out, it might certainly be not the best solution---I definitely need help from the clocksource maintainers here, not just approval! In particular, it doesn't help that a lot of code surrounding ktime_get_snapshot is unused, so that may have sent me off track. In particular, the return value of the new callback can be defined as "is it the TSC or a linear transformation of it". But that's as good a definition as "is it good for KVM" (i.e., not very good) without some documentation on the meaning of "cycles" in the struct returned by ktime_get_snapshot. Once I understand that, I hope I can provide a better explanation for the return value of the callback. Paolo > So far for a problem statement, all I've got is: > "However, when using nested virtualization you have > > L0: bare-metal hypervisor (uses TSC) > L1: nested hypervisor (uses kvmclock, can use vsyscall) > L2: nested guest > > and L2 cannot use vsyscall because it is not using the TSC." > > Which is a start but doesn't really make it clear why the proposed > solution is best/necessary. > > thanks > -john >