Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932419AbdHWQCZ (ORCPT ); Wed, 23 Aug 2017 12:02:25 -0400 Received: from mail-wr0-f193.google.com ([209.85.128.193]:34251 "EHLO mail-wr0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932120AbdHWQCK (ORCPT ); Wed, 23 Aug 2017 12:02:10 -0400 Subject: Re: [PATCH v4 00/10] make L2's kvm-clock stable, get rid of pvclock_gtod_copy in KVM To: Thomas Gleixner Cc: John Stultz , Denis Plotnikov , Radim Krcmar , kvm list , Ingo Molnar , "H. Peter Anvin" , lkml , x86@kernel.org, rkagan@virtuozzo.com, den@virtuozzo.com, Marcelo Tosatti , Peter Zijlstra References: <1501684690-211093-1-git-send-email-dplotnikov@virtuozzo.com> <894362115.582988.1503435653874.JavaMail.zimbra@redhat.com> From: Paolo Bonzini Message-ID: Date: Wed, 23 Aug 2017 18:02:03 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3644 Lines: 93 On 23/08/2017 14:45, Thomas Gleixner wrote: > So the real question is how to ensure that: > > 1) None of the update functions is in progress > > 2) The update is propagated via the existing mechanisms > > The whole live migration magic is orchestrated by qemu and the kernel. So > it's reasonably simple to ensure that. There is no orchestration whatsoever between QEMU and the host kernel, much less between anything and the guest. QEMU just sends ioctls. The complex part is above QEMU, just because you have to make sure that the destination sets up networking and everything else just like the source, but as far as QEMU is concerned live migration is as simple as stop_guest() get_guest_state() write_data_to_socket() read_data_from_socket() set_guest_state() resume_guest() and as far as KVM is concerned, it's as simple as get a signal, KVM_RUN exits KVM_GET_REGS and a bunch more ioctls KVM_SET_REGS and a bunch more ioctls KVM_SET_KVMCLOCK triggers kvmclock update ioctl(KVM_RUN) process KVM_REQ_CLOCK_UPDATE enter the guest Adding interrupts and all that is much, much more complicated than just reusing code that runs all the time (albeit only on hosts with impaired TSC) and needs no special case at all. > The reason why you do that is to support live migration of VMs to hosts > with a different TSC frequency. And for exactly that particular corner case > the whole conversion magic is implemented and now you need even more duct > tape to make it work with nested VMs. The point of the first part of this series is to _remove_ the duct tape and actually make KVM use generic timekeeper services, namely ktime_get_snapshot. If you look at patches 1-2-3-4-5-7 the delta is -120 lines of code, without any nested virtualization stuff. More duct tape would have been just: - if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC) + mode = READ_ONCE(pvclock_gtod_data.clock.vclock_mode); + if (mode != VCLOCK_TSC && + (mode != VCLOCK_PVCLOCK || !pvclock_nested_virt_magic()) return false; - return do_realtime(ts, cycle_now) == VCLOCK_TSC; + switch (mode) { + case VCLOCK_TSC: + return do_realtime_tsc(ts, cycle_now); + case VCLOCK_PVCLOCK: + return do_realtime_pvclock(ts, cycle_now); + } Nested virtualization does need a clocksource change notifier on top, but we can cross that bridge later. Maybe Denis can post just those patches to begin with. Paolo > So now for the nested KVM case. If you follow the above scheme then this > becomes really simple: > > 1) The TSC frequency is merily propagated to the L2 guest. It's the > same as the L1 guest TSC frequency. No magic voodoo required. > > 2) Migration of a L2 guest to a different L1 guest follows the > same scheme as above > > 3) Migration of a L2 guest to a physcial host follows the same scheme as > above - no idea whether that's supported at all > > 4) Migration of a L1 guest with a embedded L2 guest is not rocket science > either. The above needs some extra code which propagates the time > keeper freeze to L2 and then when L1 resumes, the updated frequency > data is propagated to L2 and L2 resumed along with it. > > You don't need any timestamp snapshot magic and voodoo callbacks. All you > need is a proper mechanism to update the timekeeper. > > I might be missing something really important as usual. If so, I'm happy to > be educated.