DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com B858A81E01
Date: Tue, 22 Aug 2017 17:00:53 -0400 (EDT)
From: Paolo Bonzini <pbonzini@redhat.com>
To: John Stultz <john.stultz@linaro.org>
Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>,
        Radim Krcmar <rkrcmar@redhat.com>, kvm list <kvm@vger.kernel.org>,
        Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>,
        "H. Peter Anvin" <hpa@zytor.com>, lkml <linux-kernel@vger.kernel.org>,
        x86@kernel.org, rkagan@virtuozzo.com, den@virtuozzo.com,
        Marcelo Tosatti <mtosatti@redhat.com>
Message-ID: <894362115.582988.1503435653874.JavaMail.zimbra@redhat.com>
In-Reply-To: <CALAqxLWGf6BAdji=rqKPnyvFzzUQscA5rXU8SY4RcWXymyVL1Q@mail.gmail.com>
References: <1501684690-211093-1-git-send-email-dplotnikov@virtuozzo.com> <CALAqxLVngARqG4g8URMQGQtZec_p5NVP7hFXQGS6+JWp74FVMA@mail.gmail.com> <b913b3b2-b3bb-348a-4069-c251a22233cd@redhat.com> <f93105c1-0d6f-f9d4-fc43-31efae1c9aeb@virtuozzo.com> <CALAqxLWGf6BAdji=rqKPnyvFzzUQscA5rXU8SY4RcWXymyVL1Q@mail.gmail.com>
Subject: Re: [PATCH v4 00/10] make L2's kvm-clock stable, get rid of
 pvclock_gtod_copy in KVM
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Thread-Topic: make L2's kvm-clock stable, get rid of pvclock_gtod_copy in KVM
Thread-Index: 7T3VojsREar+FlmFKPHkymIzDcNCkA==
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3087
Lines: 64


> I still don't feel my questions have been well answered. Its really
> not clear to me why, in order to allow the level-2 guest to use a vdso
> that the answer is to export more data through the entire stack rather
> then to make the kvmclock to be usable from the vsyscall.

Thanks, this helps.

A stable kvmclock is already usable from the vsyscall.  It is however not
yet usable _in the hypervisor_ as a way to provide another stable kvmclock
to the nested guest; right now the only clocksource that a hypervisor can
use to provide a stable kvmclock is the TSC.

So, regarding the "why is it necessary" part.  Even on a modern host with
invariant TSC, kvmclock mediates between TSC and the guest and provides for
example support for live migration, where the TSC frequency may be
different between source and destination.   If the L1 hypervisor could
use the TSC to provide a stable kvmclock, there would be no need for kvmclock
in the first place.  The paravirtualized clock may well disappear in a few
years since Skylake provides TSC scaling.  However, I'm not that optimistic
because people are complaining that I removed support for 2007 processors
and it seems that I'll have to put it back.  So, as more people use nested
virtualization (and we have nested virt migration in the works, too), nested
kvmclock becomes more important too.

Regarding the "why is it best" part.  Right now, the hypervisor makes a
copy of the timekeeper information in order to prepare the stable kvmclock.
This code is very much tied to the TSC.  However, a snapshot of the timekeeper
information is almost entirely the same thing that ktime_get_snapshot returns,
so my suggestion to "untie" the hypervisor code from the TSC was to use
ktime_get_snapshot instead.  This way, the clocksource itself tells KVM
whether it can be the base for a vsyscall-happy kvmclock (which means, it
must be the TSC or a linear transformation of it).

While I am very happy with how the KVM code comes out, it might certainly
be not the best solution---I definitely need help from the clocksource
maintainers here, not just approval!  In particular, it doesn't help that
a lot of code surrounding ktime_get_snapshot is unused, so that may have
sent me off track.

In particular, the return value of the new callback can be defined as "is
it the TSC or a linear transformation of it".  But that's as good a definition
as "is it good for KVM" (i.e., not very good) without some documentation on
the meaning of "cycles" in the struct returned by ktime_get_snapshot. Once I
understand that, I hope I can provide a better explanation for the return
value of the callback.

Paolo

> So far for a problem statement, all I've got is:
> "However, when using nested virtualization you have
> 
>         L0: bare-metal hypervisor (uses TSC)
>         L1: nested hypervisor (uses kvmclock, can use vsyscall)
>         L2: nested guest
> 
> and L2 cannot use vsyscall because it is not using the TSC."
> 
> Which is a start but doesn't really make it clear why the proposed
> solution is best/necessary.
> 
> thanks
> -john
>