Date: Wed, 23 Aug 2017 14:45:48 +0200 (CEST)
From: Thomas Gleixner <tglx@linutronix.de>
To: Paolo Bonzini <pbonzini@redhat.com>
cc: John Stultz <john.stultz@linaro.org>,
        Denis Plotnikov <dplotnikov@virtuozzo.com>,
        Radim Krcmar <rkrcmar@redhat.com>, kvm list <kvm@vger.kernel.org>,
        Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
        lkml <linux-kernel@vger.kernel.org>, x86@kernel.org,
        rkagan@virtuozzo.com, den@virtuozzo.com,
        Marcelo Tosatti <mtosatti@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH v4 00/10] make L2's kvm-clock stable, get rid of
 pvclock_gtod_copy in KVM
In-Reply-To: <894362115.582988.1503435653874.JavaMail.zimbra@redhat.com>
Message-ID: <alpine.DEB.2.20.1708230908330.1821@nanos>
References: <1501684690-211093-1-git-send-email-dplotnikov@virtuozzo.com> <CALAqxLVngARqG4g8URMQGQtZec_p5NVP7hFXQGS6+JWp74FVMA@mail.gmail.com> <b913b3b2-b3bb-348a-4069-c251a22233cd@redhat.com> <f93105c1-0d6f-f9d4-fc43-31efae1c9aeb@virtuozzo.com>
 <CALAqxLWGf6BAdji=rqKPnyvFzzUQscA5rXU8SY4RcWXymyVL1Q@mail.gmail.com> <894362115.582988.1503435653874.JavaMail.zimbra@redhat.com>
User-Agent: Alpine 2.20 (DEB 67 2015-01-07)
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7652
Lines: 208

On Tue, 22 Aug 2017, Paolo Bonzini wrote:
> Regarding the "why is it best" part.  Right now, the hypervisor makes a
> copy of the timekeeper information in order to prepare the stable kvmclock.
> This code is very much tied to the TSC.  However, a snapshot of the timekeeper
> information is almost entirely the same thing that ktime_get_snapshot returns,
> so my suggestion to "untie" the hypervisor code from the TSC was to use
> ktime_get_snapshot instead.  This way, the clocksource itself tells KVM
> whether it can be the base for a vsyscall-happy kvmclock (which means, it
> must be the TSC or a linear transformation of it).
>
> While I am very happy with how the KVM code comes out, it might certainly
> be not the best solution---I definitely need help from the clocksource
> maintainers here, not just approval!  In particular, it doesn't help that
> a lot of code surrounding ktime_get_snapshot is unused, so that may have
> sent me off track.
> 
> In particular, the return value of the new callback can be defined as "is
> it the TSC or a linear transformation of it".  But that's as good a
> definition as "is it good for KVM" (i.e., not very good) without some
> documentation on the meaning of "cycles" in the struct returned by
> ktime_get_snapshot. Once I understand that, I hope I can provide a better
> explanation for the return value of the callback.

This all looks wrong to begin with and you are just bolting and duct taping
stuff together as you see fit.

I understand the idea of providing a clocksource to the core code which
provides direct nano second resolution and do the host -> guest conversion
in the kvmclock implementation. But that's the root cause of all evils.

The reason why you do that is to support live migration of VMs to hosts
with a different TSC frequency. And for exactly that particular corner case
the whole conversion magic is implemented and now you need even more duct
tape to make it work with nested VMs.

That's just wrong. We need to sit back and look at that whole kvm clock
mechanism and redesign it proper.

Let's look at the goals:

   1) Provide the information of host frequency to the guest

   2) Propagate host frequency changes to the guest

      That's solely used for migration purposes.

      It's not there for dealing with non constant frequency TSCs, which
      would be beyond insane.

      Neither to runtime propagate host clocksource adjustments (via NTP &
      friends) to a guest. If you use it that way today, then this needs to
      be fixed, simply because that's the wrong approach. We have proper
      correction mechanisms (NTP,PPS,PTP ...) which can be utilized to do
      so. Aside of that it'd be interesting to see the interaction between
      the host frequency scaling change and NTP in the guest trying to
      adjust as well.

Now lets look at a simple ktime_get() with the current implementation:

   ktime_get()
     do {
	seq = seqcount_begin(tk);
	data = tk->data;
	now = tk->clock->read(tk->clock);
	  kvmclock_read()
            do {
	       seqk = seqcount_begin(kvmclock);
	       nowk = kvmclock_read();
	       datak = kvmclock_data;
	    } while (seqcount_retry(seqk, kvmclock));
	    nsec = convert(nowk, datak);
	    return nsec;
     } while (seqcount_retry(seq, tk));

     nsec = convert(now, data);
     return nsec;

So you need two sequence counters and two conversions for reading the
clock. Sorry, but that's just crap.

Let's look at the normal usecase (no migration) first:

  The TSC frequency can be retrieved at guest boot time and does not ever
  change. For this case the above is bloat and completely useless.

  All you need to do is to register the kvm clocksource with the proper
  frequency and the readout is just the plain TSC read. Nothing
  else.

So now the special case of live migration. You need to make sure that the
change of the TSC frequency is properly propagated and any reader which is
in the middle of a time getter function will retry with the new parameters.

That's not any different from the case where the timekeeper is adjusted by
any of the regular mechanisms: update_wall_time(), ntp, pps, ptp ....

The only thing you need to ensure is that the timekeeping core is properly
informed about the change and code which executes time getter functions
will retry. The core has _ALL_ mechanisms available for that.

So the real question is how to ensure that:

  1) None of the update functions is in progress

  2) The update is propagated via the existing mechanisms

The whole live migration magic is orchestrated by qemu and the kernel. So
it's reasonably simple to ensure that.

Something like the below should work for this:

	Host				Guest
1:	prevent_nmi_delivery_to_guest()
2:	inject_kvmclock_irq()
3:					handle_kvmclock_irq()
4:					  timekeeping_freeze()
5:					  exit_vm()
6:	stop_and_migrate_guest()
7:	update_guest_data()
8:	resume_guest()
9:					  timekeeping_reconfigure();
10:					  timekeeping_unfreeze();
11:	resume_nmi_delivery_to_guest()

#1  Ensures that no NMI is delivered to the guest so that the timekeeper
    core does not have to worry about NMI context calling any of the NMI
    safe time getters.

    If that's not possible then it's simple enough to do something about
    the NMI safe time getters in the time keeping core code, so you don't
    have to worry much about it.

#2  Inject a special interrupt vector which initiates the timekeeping
    freeze mechanism in the guest

#3  kvm clock interrupt handler gets invoked

#4  Ensures that:

    - All potential timekeeping update mechanisms have left the critical
      region

    - All potential time getters are blocked

    The mechanism for that is simple enough:

    timekeeping_freeze()
    {
    	raw_spin_lock(&timekeeper_lock);
	write_seqcount_begin(&tk_core.seq);
	/* Ensure a consistent state */
	timekeeping_forward_now(tk);
	timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);	
    }

#5  Exit the VM with some magic exit reason so the host side knows that
    timekeeping is frozen

#6  Stop the VM, migrate it

#9  Retrieve the new conversion factor and adjust the timekeeper data
    accordingly

#10 Unfreeze the time keeper with the new configuration

    timekeeping_freeze()
    {
	/* Ensure a consistent state */
	timekeeping_forward_now(tk);
	timekeeping_update(tk, TK_CLEAR_NTP | TK_MIRROR | TK_CLOCK_WAS_SET);	
	
	write_seqcount_end(&tk_core.seq);
    	raw_spin_unlock(&timekeeper_lock);
    }

#11 Resume NMI delivery

    As I said in #1 this can be completely handled in the timekeeper core,
    but if we can avoid that then stuff becomes simpler. And simpler is
    preferred ....


So now for the nested KVM case. If you follow the above scheme then this
becomes really simple:

  1) The TSC frequency is merily propagated to the L2 guest. It's the
     same as the L1 guest TSC frequency. No magic voodoo required.

  2) Migration of a L2 guest to a different L1 guest follows the
     same scheme as above

  3) Migration of a L2 guest to a physcial host follows the same scheme as
     above - no idea whether that's supported at all

  4) Migration of a L1 guest with a embedded L2 guest is not rocket science
     either. The above needs some extra code which propagates the time
     keeper freeze to L2 and then when L1 resumes, the updated frequency
     data is propagated to L2 and L2 resumed along with it.

  You don't need any timestamp snapshot magic and voodoo callbacks. All you
  need is a proper mechanism to update the timekeeper.

I might be missing something really important as usual. If so, I'm happy to
be educated.

Thanks,

	tglx