LinuxLists.cc - RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers

2023-12-21 16:52:08

Subject: RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers

Hi!

Recently I was tasked with triage of the failures of 'vmx_preemption_timer'
that happen in our kernel CI pipeline.

The test usually fails because L2 observes TSC after the
preemption timer deadline, before the VM exit happens.

This happens because KVM emulates nested preemption timer with HR timers,
so it converts the preemption timer value to nanoseconds, taking in account
tsc scaling and host tsc frequency, and sets HR timer.

HR timer however as I found out the hard way is bound to CLOCK_MONOTONIC,
and thus its rate can be adjusted by NTP, which means that it can run slower or
faster than KVM expects, which can result in the interrupt arriving earlier,
or late, which is what is happening.

This is how you can reproduce it on an Intel machine:

1. stop the NTP daemon:
sudo systemctl stop chronyd.service
2. introduce a small error in the system time:
sudo date -s "$(date)"

3. start NTP daemon:
sudo chronyd -d -n (for debug) or start the systemd service again

4. run the vmx_preemption_timer test a few times until it fails:

I did some research and it looks like I am not the first to encounter this:

From the ARM side there was an attempt to support CLOCK_MONOTONIC_RAW with
timer subsystem which was even merged but then reverted due to issues:

https://lore.kernel.org/all/[email protected]/T/#u

It looks like this issue was later worked around in the ARM code:

commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45
Author: Marc Zyngier <[email protected]>
Date: Wed Apr 6 09:37:22 2016 +0100

KVM: arm/arm64: Handle forward time correction gracefully

On a host that runs NTP, corrections can have a direct impact on
the background timer that we program on the behalf of a vcpu.

In particular, NTP performing a forward correction will result in
a timer expiring sooner than expected from a guest point of view.
Not a big deal, we kick the vcpu anyway.

But on wake-up, the vcpu thread is going to perform a check to
find out whether or not it should block. And at that point, the
timer check is going to say "timer has not expired yet, go back
to sleep". This results in the timer event being lost forever.

There are multiple ways to handle this. One would be record that
the timer has expired and let kvm_cpu_has_pending_timer return
true in that case, but that would be fairly invasive. Another is
to check for the "short sleep" condition in the hrtimer callback,
and restart the timer for the remaining time when the condition
is detected.

This patch implements the latter, with a bit of refactoring in
order to avoid too much code duplication.

Cc: <[email protected]>
Reported-by: Alexander Graf <[email protected]>
Reviewed-by: Alexander Graf <[email protected]>
Signed-off-by: Marc Zyngier <[email protected]>
Signed-off-by: Christoffer Dall <[email protected]>

So to solve this issue there are two options:

1. Have another go at implementing support for CLOCK_MONOTONIC_RAW timers.
I don't know if that is feasible and I would be very happy to hear a feedback from you.

2. Also work this around in KVM. KVM does listen to changes in the timekeeping system
(kernel calls its update_pvclock_gtod), and it even notes rates of both regular and raw clocks.

When starting a HR timer I can adjust its period for the difference in rates, which will in most
cases produce more correct result that what we have now, but will still fail if the rate
is changed at the same time the timer is started or before it expires.

Or I can also restart the timer, although that might cause more harm than
good to the accuracy.

What do you think?

Best regards,
Maxim Levitsky

2023-12-21 19:10:31

by Jim Mattson

[permalink] [raw]

Subject: Re: RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers

On Thu, Dec 21, 2023 at 8:52 AM Maxim Levitsky <[email protected]> wrote:
>
>
> Hi!
>
> Recently I was tasked with triage of the failures of 'vmx_preemption_timer'
> that happen in our kernel CI pipeline.
>
>
> The test usually fails because L2 observes TSC after the
> preemption timer deadline, before the VM exit happens.
>
> This happens because KVM emulates nested preemption timer with HR timers,
> so it converts the preemption timer value to nanoseconds, taking in account
> tsc scaling and host tsc frequency, and sets HR timer.
>
> HR timer however as I found out the hard way is bound to CLOCK_MONOTONIC,
> and thus its rate can be adjusted by NTP, which means that it can run slower or
> faster than KVM expects, which can result in the interrupt arriving earlier,
> or late, which is what is happening.
>
> This is how you can reproduce it on an Intel machine:
>
>
> 1. stop the NTP daemon:
> sudo systemctl stop chronyd.service
> 2. introduce a small error in the system time:
> sudo date -s "$(date)"
>
> 3. start NTP daemon:
> sudo chronyd -d -n (for debug) or start the systemd service again
>
> 4. run the vmx_preemption_timer test a few times until it fails:
>
>
> I did some research and it looks like I am not the first to encounter this:
>
> From the ARM side there was an attempt to support CLOCK_MONOTONIC_RAW with
> timer subsystem which was even merged but then reverted due to issues:
>
> https://lore.kernel.org/all/[email protected]/T/#u
>
> It looks like this issue was later worked around in the ARM code:
>
>
> commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45
> Author: Marc Zyngier <[email protected]>
> Date: Wed Apr 6 09:37:22 2016 +0100
>
> KVM: arm/arm64: Handle forward time correction gracefully
>
> On a host that runs NTP, corrections can have a direct impact on
> the background timer that we program on the behalf of a vcpu.
>
> In particular, NTP performing a forward correction will result in
> a timer expiring sooner than expected from a guest point of view.
> Not a big deal, we kick the vcpu anyway.
>
> But on wake-up, the vcpu thread is going to perform a check to
> find out whether or not it should block. And at that point, the
> timer check is going to say "timer has not expired yet, go back
> to sleep". This results in the timer event being lost forever.
>
> There are multiple ways to handle this. One would be record that
> the timer has expired and let kvm_cpu_has_pending_timer return
> true in that case, but that would be fairly invasive. Another is
> to check for the "short sleep" condition in the hrtimer callback,
> and restart the timer for the remaining time when the condition
> is detected.
>
> This patch implements the latter, with a bit of refactoring in
> order to avoid too much code duplication.
>
> Cc: <[email protected]>
> Reported-by: Alexander Graf <[email protected]>
> Reviewed-by: Alexander Graf <[email protected]>
> Signed-off-by: Marc Zyngier <[email protected]>
> Signed-off-by: Christoffer Dall <[email protected]>
>
>
> So to solve this issue there are two options:
>
>
> 1. Have another go at implementing support for CLOCK_MONOTONIC_RAW timers.
> I don't know if that is feasible and I would be very happy to hear a feedback from you.
>
> 2. Also work this around in KVM. KVM does listen to changes in the timekeeping system
> (kernel calls its update_pvclock_gtod), and it even notes rates of both regular and raw clocks.
>
> When starting a HR timer I can adjust its period for the difference in rates, which will in most
> cases produce more correct result that what we have now, but will still fail if the rate
> is changed at the same time the timer is started or before it expires.
>
> Or I can also restart the timer, although that might cause more harm than
> good to the accuracy.
>
>
> What do you think?

Is this what the "adaptive tuning" in the local APIC TSC_DEADLINE
timer is all about (lapic_timer_advance_ns = -1)? If so, can we
leverage that for the VMX-preemption timer as well?
>
> Best regards,
> Maxim Levitsky
>
>
>

2024-01-02 22:22:15

by Maxim Levitsky

[permalink] [raw]

Subject: Re: RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers

On Thu, 2023-12-21 at 11:09 -0800, Jim Mattson wrote:
> On Thu, Dec 21, 2023 at 8:52 AM Maxim Levitsky <[email protected]> wrote:
> >
> > Hi!
> >
> > Recently I was tasked with triage of the failures of 'vmx_preemption_timer'
> > that happen in our kernel CI pipeline.
> >
> >
> > The test usually fails because L2 observes TSC after the
> > preemption timer deadline, before the VM exit happens.
> >
> > This happens because KVM emulates nested preemption timer with HR timers,
> > so it converts the preemption timer value to nanoseconds, taking in account
> > tsc scaling and host tsc frequency, and sets HR timer.
> >
> > HR timer however as I found out the hard way is bound to CLOCK_MONOTONIC,
> > and thus its rate can be adjusted by NTP, which means that it can run slower or
> > faster than KVM expects, which can result in the interrupt arriving earlier,
> > or late, which is what is happening.
> >
> > This is how you can reproduce it on an Intel machine:
> >
> >
> > 1. stop the NTP daemon:
> > sudo systemctl stop chronyd.service
> > 2. introduce a small error in the system time:
> > sudo date -s "$(date)"
> >
> > 3. start NTP daemon:
> > sudo chronyd -d -n (for debug) or start the systemd service again
> >
> > 4. run the vmx_preemption_timer test a few times until it fails:
> >
> >
> > I did some research and it looks like I am not the first to encounter this:
> >
> > From the ARM side there was an attempt to support CLOCK_MONOTONIC_RAW with
> > timer subsystem which was even merged but then reverted due to issues:
> >
> > https://lore.kernel.org/all/[email protected]/T/#u
> >
> > It looks like this issue was later worked around in the ARM code:
> >
> >
> > commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45
> > Author: Marc Zyngier <[email protected]>
> > Date: Wed Apr 6 09:37:22 2016 +0100
> >
> > KVM: arm/arm64: Handle forward time correction gracefully
> >
> > On a host that runs NTP, corrections can have a direct impact on
> > the background timer that we program on the behalf of a vcpu.
> >
> > In particular, NTP performing a forward correction will result in
> > a timer expiring sooner than expected from a guest point of view.
> > Not a big deal, we kick the vcpu anyway.
> >
> > But on wake-up, the vcpu thread is going to perform a check to
> > find out whether or not it should block. And at that point, the
> > timer check is going to say "timer has not expired yet, go back
> > to sleep". This results in the timer event being lost forever.
> >
> > There are multiple ways to handle this. One would be record that
> > the timer has expired and let kvm_cpu_has_pending_timer return
> > true in that case, but that would be fairly invasive. Another is
> > to check for the "short sleep" condition in the hrtimer callback,
> > and restart the timer for the remaining time when the condition
> > is detected.
> >
> > This patch implements the latter, with a bit of refactoring in
> > order to avoid too much code duplication.
> >
> > Cc: <[email protected]>
> > Reported-by: Alexander Graf <[email protected]>
> > Reviewed-by: Alexander Graf <[email protected]>
> > Signed-off-by: Marc Zyngier <[email protected]>
> > Signed-off-by: Christoffer Dall <[email protected]>
> >
> >
> > So to solve this issue there are two options:
> >
> >
> > 1. Have another go at implementing support for CLOCK_MONOTONIC_RAW timers.
> > I don't know if that is feasible and I would be very happy to hear a feedback from you.
> >
> > 2. Also work this around in KVM. KVM does listen to changes in the timekeeping system
> > (kernel calls its update_pvclock_gtod), and it even notes rates of both regular and raw clocks.
> >
> > When starting a HR timer I can adjust its period for the difference in rates, which will in most
> > cases produce more correct result that what we have now, but will still fail if the rate
> > is changed at the same time the timer is started or before it expires.
> >
> > Or I can also restart the timer, although that might cause more harm than
> > good to the accuracy.
> >
> >
> > What do you think?
>
> Is this what the "adaptive tuning" in the local APIC TSC_DEADLINE
> timer is all about (lapic_timer_advance_ns = -1)?

Hi,

I don't think that 'lapic_timer_advance' is designed for that but it does
mask this problem somewhat.

The goal of 'lapic_timer_advance' is to decrease time between deadline passing and start
of guest timer irq routine by making the deadline happen a bit earlier (by timer_advance_ns), and then busy-waiting
(hopefully only a bit) until the deadline passes, and then immediately do the VM entry.

This way instead of overhead of VM exit and VM entry that both happen after the deadline,
only the VM entry happens after the deadline.

In relation to NTP interference: If the deadline happens earlier than expected, then
KVM will busy wait and decrease the 'timer_advance_ns', and next time the deadline
will happen a bit later thus adopting for the NTP adjustment somewhat.

Note though that 'timer_advance_ns' variable is unsigned and adjust_lapic_timer_advance can underflow
it, which can be fixed.

Now if the deadline happens later than expected, then the guest will see this happen,
but at least adjust_lapic_timer_advance should increase the 'timer_advance_ns' so next
time the deadline will happen earlier which will also eventually hide the problem.

So overall I do think that implementing the 'lapic_timer_advance' for nested VMX preemption timer
is a good idea, especially since this feature is not really nested in some sense - the timer is
just delivered as a VM exit but it is always delivered to L1, so VMX preemption timer can
be seen as just an extra L1's deadline timer.

I do think that nested VMX preemption timer should use its own value of timer_advance_ns, thus
we need to extract the common code and make both timers use it. Does this make sense?

Best regards,
Maxim Levitsky

> If so, can we
> leverage that for the VMX-preemption timer as well?
> > Best regards,
> > Maxim Levitsky
> >
> >
> >

2024-01-02 23:50:02

by Jim Mattson

[permalink] [raw]

Subject: Re: RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers

On Tue, Jan 2, 2024 at 2:21 PM Maxim Levitsky <[email protected]> wrote:
>
> On Thu, 2023-12-21 at 11:09 -0800, Jim Mattson wrote:
> > On Thu, Dec 21, 2023 at 8:52 AM Maxim Levitsky <[email protected]> wrote:
> > >
> > > Hi!
> > >
> > > Recently I was tasked with triage of the failures of 'vmx_preemption_timer'
> > > that happen in our kernel CI pipeline.
> > >
> > >
> > > The test usually fails because L2 observes TSC after the
> > > preemption timer deadline, before the VM exit happens.
> > >
> > > This happens because KVM emulates nested preemption timer with HR timers,
> > > so it converts the preemption timer value to nanoseconds, taking in account
> > > tsc scaling and host tsc frequency, and sets HR timer.
> > >
> > > HR timer however as I found out the hard way is bound to CLOCK_MONOTONIC,
> > > and thus its rate can be adjusted by NTP, which means that it can run slower or
> > > faster than KVM expects, which can result in the interrupt arriving earlier,
> > > or late, which is what is happening.
> > >
> > > This is how you can reproduce it on an Intel machine:
> > >
> > >
> > > 1. stop the NTP daemon:
> > > sudo systemctl stop chronyd.service
> > > 2. introduce a small error in the system time:
> > > sudo date -s "$(date)"
> > >
> > > 3. start NTP daemon:
> > > sudo chronyd -d -n (for debug) or start the systemd service again
> > >
> > > 4. run the vmx_preemption_timer test a few times until it fails:
> > >
> > >
> > > I did some research and it looks like I am not the first to encounter this:
> > >
> > > From the ARM side there was an attempt to support CLOCK_MONOTONIC_RAW with
> > > timer subsystem which was even merged but then reverted due to issues:
> > >
> > > https://lore.kernel.org/all/[email protected]/T/#u
> > >
> > > It looks like this issue was later worked around in the ARM code:
> > >
> > >
> > > commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45
> > > Author: Marc Zyngier <[email protected]>
> > > Date: Wed Apr 6 09:37:22 2016 +0100
> > >
> > > KVM: arm/arm64: Handle forward time correction gracefully
> > >
> > > On a host that runs NTP, corrections can have a direct impact on
> > > the background timer that we program on the behalf of a vcpu.
> > >
> > > In particular, NTP performing a forward correction will result in
> > > a timer expiring sooner than expected from a guest point of view.
> > > Not a big deal, we kick the vcpu anyway.
> > >
> > > But on wake-up, the vcpu thread is going to perform a check to
> > > find out whether or not it should block. And at that point, the
> > > timer check is going to say "timer has not expired yet, go back
> > > to sleep". This results in the timer event being lost forever.
> > >
> > > There are multiple ways to handle this. One would be record that
> > > the timer has expired and let kvm_cpu_has_pending_timer return
> > > true in that case, but that would be fairly invasive. Another is
> > > to check for the "short sleep" condition in the hrtimer callback,
> > > and restart the timer for the remaining time when the condition
> > > is detected.
> > >
> > > This patch implements the latter, with a bit of refactoring in
> > > order to avoid too much code duplication.
> > >
> > > Cc: <[email protected]>
> > > Reported-by: Alexander Graf <[email protected]>
> > > Reviewed-by: Alexander Graf <[email protected]>
> > > Signed-off-by: Marc Zyngier <[email protected]>
> > > Signed-off-by: Christoffer Dall <[email protected]>
> > >
> > >
> > > So to solve this issue there are two options:
> > >
> > >
> > > 1. Have another go at implementing support for CLOCK_MONOTONIC_RAW timers.
> > > I don't know if that is feasible and I would be very happy to hear a feedback from you.
> > >
> > > 2. Also work this around in KVM. KVM does listen to changes in the timekeeping system
> > > (kernel calls its update_pvclock_gtod), and it even notes rates of both regular and raw clocks.
> > >
> > > When starting a HR timer I can adjust its period for the difference in rates, which will in most
> > > cases produce more correct result that what we have now, but will still fail if the rate
> > > is changed at the same time the timer is started or before it expires.
> > >
> > > Or I can also restart the timer, although that might cause more harm than
> > > good to the accuracy.
> > >
> > >
> > > What do you think?
> >
> > Is this what the "adaptive tuning" in the local APIC TSC_DEADLINE
> > timer is all about (lapic_timer_advance_ns = -1)?
>
>
> Hi,
>
> I don't think that 'lapic_timer_advance' is designed for that but it does
> mask this problem somewhat.
>
> The goal of 'lapic_timer_advance' is to decrease time between deadline passing and start
> of guest timer irq routine by making the deadline happen a bit earlier (by timer_advance_ns), and then busy-waiting
> (hopefully only a bit) until the deadline passes, and then immediately do the VM entry.
>
> This way instead of overhead of VM exit and VM entry that both happen after the deadline,
> only the VM entry happens after the deadline.
>
>
> In relation to NTP interference: If the deadline happens earlier than expected, then
> KVM will busy wait and decrease the 'timer_advance_ns', and next time the deadline
> will happen a bit later thus adopting for the NTP adjustment somewhat.
>
> Note though that 'timer_advance_ns' variable is unsigned and adjust_lapic_timer_advance can underflow
> it, which can be fixed.
>
> Now if the deadline happens later than expected, then the guest will see this happen,
> but at least adjust_lapic_timer_advance should increase the 'timer_advance_ns' so next
> time the deadline will happen earlier which will also eventually hide the problem.
>
> So overall I do think that implementing the 'lapic_timer_advance' for nested VMX preemption timer
> is a good idea, especially since this feature is not really nested in some sense - the timer is
> just delivered as a VM exit but it is always delivered to L1, so VMX preemption timer can
> be seen as just an extra L1's deadline timer.
>
> I do think that nested VMX preemption timer should use its own value of timer_advance_ns, thus
> we need to extract the common code and make both timers use it. Does this make sense?

Alternatively, why not just use the hardware VMX-preemption timer to
deliver the virtual VMX-preemption timer?

Today, I believe that we only use the hardware VMX-preemption timer to
deliver the virtual local APIC timer. However, it shouldn't be that
hard to pick the first deadline of {VMX-preemption timer, local APIC
timer} at each emulated VM-entry to L2.

> Best regards,
> Maxim Levitsky
>
>
> > If so, can we
> > leverage that for the VMX-preemption timer as well?
> > > Best regards,
> > > Maxim Levitsky
> > >
> > >
> > >
>
>
>
>

2024-05-15 21:03:57

by Maxim Levitsky

[permalink] [raw]

Subject: Re: RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers

On Tue, 2024-01-02 at 15:49 -0800, Jim Mattson wrote:
> On Tue, Jan 2, 2024 at 2:21 PM Maxim Levitsky <[email protected]> wrote:
> > On Thu, 2023-12-21 at 11:09 -0800, Jim Mattson wrote:
> > > On Thu, Dec 21, 2023 at 8:52 AM Maxim Levitsky <[email protected]> wrote:
> > > > Hi!
> > > >
> > > > Recently I was tasked with triage of the failures of 'vmx_preemption_timer'
> > > > that happen in our kernel CI pipeline.
> > > >
> > > >
> > > > The test usually fails because L2 observes TSC after the
> > > > preemption timer deadline, before the VM exit happens.
> > > >
> > > > This happens because KVM emulates nested preemption timer with HR timers,
> > > > so it converts the preemption timer value to nanoseconds, taking in account
> > > > tsc scaling and host tsc frequency, and sets HR timer.
> > > >
> > > > HR timer however as I found out the hard way is bound to CLOCK_MONOTONIC,
> > > > and thus its rate can be adjusted by NTP, which means that it can run slower or
> > > > faster than KVM expects, which can result in the interrupt arriving earlier,
> > > > or late, which is what is happening.
> > > >
> > > > This is how you can reproduce it on an Intel machine:
> > > >
> > > >
> > > > 1. stop the NTP daemon:
> > > > sudo systemctl stop chronyd.service
> > > > 2. introduce a small error in the system time:
> > > > sudo date -s "$(date)"
> > > >
> > > > 3. start NTP daemon:
> > > > sudo chronyd -d -n (for debug) or start the systemd service again
> > > >
> > > > 4. run the vmx_preemption_timer test a few times until it fails:
> > > >
> > > >
> > > > I did some research and it looks like I am not the first to encounter this:
> > > >
> > > > From the ARM side there was an attempt to support CLOCK_MONOTONIC_RAW with
> > > > timer subsystem which was even merged but then reverted due to issues:
> > > >
> > > > https://lore.kernel.org/all/[email protected]/T/#u
> > > >
> > > > It looks like this issue was later worked around in the ARM code:
> > > >
> > > >
> > > > commit 1c5631c73fc2261a5df64a72c155cb53dcdc0c45
> > > > Author: Marc Zyngier <[email protected]>
> > > > Date: Wed Apr 6 09:37:22 2016 +0100
> > > >
> > > > KVM: arm/arm64: Handle forward time correction gracefully
> > > >
> > > > On a host that runs NTP, corrections can have a direct impact on
> > > > the background timer that we program on the behalf of a vcpu.
> > > >
> > > > In particular, NTP performing a forward correction will result in
> > > > a timer expiring sooner than expected from a guest point of view.
> > > > Not a big deal, we kick the vcpu anyway.
> > > >
> > > > But on wake-up, the vcpu thread is going to perform a check to
> > > > find out whether or not it should block. And at that point, the
> > > > timer check is going to say "timer has not expired yet, go back
> > > > to sleep". This results in the timer event being lost forever.
> > > >
> > > > There are multiple ways to handle this. One would be record that
> > > > the timer has expired and let kvm_cpu_has_pending_timer return
> > > > true in that case, but that would be fairly invasive. Another is
> > > > to check for the "short sleep" condition in the hrtimer callback,
> > > > and restart the timer for the remaining time when the condition
> > > > is detected.
> > > >
> > > > This patch implements the latter, with a bit of refactoring in
> > > > order to avoid too much code duplication.
> > > >
> > > > Cc: <[email protected]>
> > > > Reported-by: Alexander Graf <[email protected]>
> > > > Reviewed-by: Alexander Graf <[email protected]>
> > > > Signed-off-by: Marc Zyngier <[email protected]>
> > > > Signed-off-by: Christoffer Dall <[email protected]>
> > > >
> > > >
> > > > So to solve this issue there are two options:
> > > >
> > > >
> > > > 1. Have another go at implementing support for CLOCK_MONOTONIC_RAW timers.
> > > > I don't know if that is feasible and I would be very happy to hear a feedback from you.
> > > >
> > > > 2. Also work this around in KVM. KVM does listen to changes in the timekeeping system
> > > > (kernel calls its update_pvclock_gtod), and it even notes rates of both regular and raw clocks.
> > > >
> > > > When starting a HR timer I can adjust its period for the difference in rates, which will in most
> > > > cases produce more correct result that what we have now, but will still fail if the rate
> > > > is changed at the same time the timer is started or before it expires.
> > > >
> > > > Or I can also restart the timer, although that might cause more harm than
> > > > good to the accuracy.
> > > >
> > > >
> > > > What do you think?
> > >
> > > Is this what the "adaptive tuning" in the local APIC TSC_DEADLINE
> > > timer is all about (lapic_timer_advance_ns = -1)?
> >
> > Hi,
> >
> > I don't think that 'lapic_timer_advance' is designed for that but it does
> > mask this problem somewhat.
> >
> > The goal of 'lapic_timer_advance' is to decrease time between deadline passing and start
> > of guest timer irq routine by making the deadline happen a bit earlier (by timer_advance_ns), and then busy-waiting
> > (hopefully only a bit) until the deadline passes, and then immediately do the VM entry.
> >
> > This way instead of overhead of VM exit and VM entry that both happen after the deadline,
> > only the VM entry happens after the deadline.
> >
> >
> > In relation to NTP interference: If the deadline happens earlier than expected, then
> > KVM will busy wait and decrease the 'timer_advance_ns', and next time the deadline
> > will happen a bit later thus adopting for the NTP adjustment somewhat.
> >
> > Note though that 'timer_advance_ns' variable is unsigned and adjust_lapic_timer_advance can underflow
> > it, which can be fixed.
> >
> > Now if the deadline happens later than expected, then the guest will see this happen,
> > but at least adjust_lapic_timer_advance should increase the 'timer_advance_ns' so next
> > time the deadline will happen earlier which will also eventually hide the problem.
> >
> > So overall I do think that implementing the 'lapic_timer_advance' for nested VMX preemption timer
> > is a good idea, especially since this feature is not really nested in some sense - the timer is
> > just delivered as a VM exit but it is always delivered to L1, so VMX preemption timer can
> > be seen as just an extra L1's deadline timer.
> >
> > I do think that nested VMX preemption timer should use its own value of timer_advance_ns, thus
> > we need to extract the common code and make both timers use it. Does this make sense?
>
> Alternatively, why not just use the hardware VMX-preemption timer to
> deliver the virtual VMX-preemption timer?
>
> Today, I believe that we only use the hardware VMX-preemption timer to
> deliver the virtual local APIC timer. However, it shouldn't be that
> hard to pick the first deadline of {VMX-preemption timer, local APIC
> timer} at each emulated VM-entry to L2.

I assume that this is possible but it might add some complexity.

AFAIK the design choice here was that L1 uses the hardware VMX preemption timer always,
while L2 uses the software preemption timer which is relatively simple.

I do agree that this might work and if it does work it might be even worthwhile
change on its own.

If you agree that this is a good idea, I can prepare a patch series for that.

Note though that the same problem (although somewhat masked by lapic_timer_advance)
does exit on AMD as well because while AMD lacks both VMX preemption timer and even the TSC deadline timer,
KVM exposes TSC deadline to the guest, and HR timers are always used for its emulation,
and are prone to NTP interference as I discovered.

Best regards,
Maxim Levitsky

>
> > Best regards,
> > Maxim Levitsky
> >
> >
> > > If so, can we
> > > leverage that for the VMX-preemption timer as well?
> > > > Best regards,
> > > > Maxim Levitsky
> > > >
> > > >
> > > >
> >
> >
> >

2024-05-22 20:20:32

by Thomas Gleixner

[permalink] [raw]

Subject: Re: RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers

On Thu, May 16 2024 at 09:53, Jim Mattson wrote:
> On Wed, May 15, 2024 at 2:03 PM Maxim Levitsky <[email protected]> wrote:
>> > Today, I believe that we only use the hardware VMX-preemption timer to
>> > deliver the virtual local APIC timer. However, it shouldn't be that
>> > hard to pick the first deadline of {VMX-preemption timer, local APIC
>> > timer} at each emulated VM-entry to L2.
>>
>> I assume that this is possible but it might add some complexity.
>>
>> AFAIK the design choice here was that L1 uses the hardware VMX preemption timer always,
>> while L2 uses the software preemption timer which is relatively simple.
>>
>> I do agree that this might work and if it does work it might be even worthwhile
>> change on its own.
>>
>> If you agree that this is a good idea, I can prepare a patch series for that.
>
> I do think it would be worthwhile to provide the infrastructure for
> multiple clients of the VMX-preemption timer.

That only solves the problem when the guests are on the CPU, but it does
not solve anything when they are off the CPU because they are waiting
for a timer to expire. In that case you are back at square one, no?

> (Better yet would be to provide a CLOCK_MONOTONIC_RAW hrtimer, but
> that's outwith our domain.)

That's a non-trivial exercise. I respond to that in a separate mail.

Thanks,

tglx

2024-05-22 21:07:24

by Thomas Gleixner

[permalink] [raw]

Subject: Re: RFC: NTP adjustments interfere with KVM emulation of TSC deadline timers

On Thu, Dec 21 2023 at 18:51, Maxim Levitsky wrote:
> The test usually fails because L2 observes TSC after the
> preemption timer deadline, before the VM exit happens.

That's an arguably silly failure condition.

Timer interrupt delivery can be late even on bare metal, so observing
TSC ahead of the expected timer event is not really wrong.

Btw, the kernel also handles it nicely when the timer event arrives
_before_ the expected time. It simply reprograms the timer and is done
with it. That's actually required because clocksource (which determines
time) and clockevent (which expires timers) can be on different clocks
which might drift against each other.

> In particular, NTP performing a forward correction will result in
> a timer expiring sooner than expected from a guest point of view.
> Not a big deal, we kick the vcpu anyway.
>
> But on wake-up, the vcpu thread is going to perform a check to
> find out whether or not it should block. And at that point, the
> timer check is going to say "timer has not expired yet, go back
> to sleep". This results in the timer event being lost forever.

That's obviously a real problem.

> There are multiple ways to handle this. One would be record that
> the timer has expired and let kvm_cpu_has_pending_timer return
> true in that case, but that would be fairly invasive. Another is
> to check for the "short sleep" condition in the hrtimer callback,
> and restart the timer for the remaining time when the condition
> is detected.

:)

> So to solve this issue there are two options:

There is a third option:

3. Unconditionally inject the timer interrupt into the guest when the
underlying hrtimer has expired

That's fine because timer interrupts can be early (see above) and
any sane OS has to be able to handle it.

> 1. Have another go at implementing support for CLOCK_MONOTONIC_RAW timers.
> I don't know if that is feasible and I would be very happy to hear
> a feedback from you.

That's a non-trivial exercise.

The charm of having all clocks related to CLOCK_MONOTONIC is that there
is zero requirement to take NTP frequency adjustments into account,
which makes the implementation reasonably simple and robust.

Changing everything over in that area (hrtimers, clockevents, NOHZ) to
be raw hardware frequency based would be an Herculean task and just a
huge pile of horrors.

So the only realistic way to do that is to correlate a
CLOCK_MONOTONIC_RAW timer to CLOCK_MONOTONIC, which obviously has the
same problem you are trying to solve :)

But we could be smart about it. Let's look at the math:

mraw = base_mraw + (tsc - base_r) * factor_r;
mono = base_mono + (tsc - base_m) * factor_m;

So converting a MONOTONIC_RAW time into MONOTONIC would be:

tsc = (mraw - base_mraw)/factor_r + base_r

mono = base_mono + ((mraw - base_mraw)/factor_r + base_r - base_m) * factor_m;

It's guaranteed that base_r == base_m, so:

mono = base_mono + (mraw - base_mraw) * factor_m / factor_r;

The conversion factors are actually implemented with scaled math:

mono = base_mono + (((delta_raw * mult_m) >> sft_m) << sft_r) / mult_r;

As sft_m and sft_r are guaranteed to be identical:

mono = base_mono + (delta_raw * mult_m) / mult_r;

That obviously only works correctly when mult_m is constant between the
time the timer is enqueued and the time the timer is expired as you
figured out.

But even if mult_m changes this will be correct if we take NOHZ out of
the picture for a moment. Why?

In a NOHZ=n scenario the next expiring timer is at least reevaluated
once every tick. As mult_m is stable between ticks any MONOTONIC_RAW
timer which expires before the next tick will be pretty accurately
mapped back onto MONOTONIC and therefore expire at the expected time.

Now NOHZ comes into play and ruins everything under the following
condition:

1) CPU takes an idle nap for a longer period of time

2) Time synchronization (NTP/PTP/PPS) is adjusting mult_m during that
#1 period

That's the only condition where the conversion fails. If NTP slows down
the conversion then the timer is going to be late. If it speeds it up
then the hrtimer core will take care of it and guarantee that the timer
callback is never invoked early.

But that's going to be a rare problem because it requires:

1) the CPU to be in idle for a longer period

2) the MONOTONIC_RAW timer to be the effective first timer to fire
after that idle period

3) Time synchronization adjusting right during that idle period

Sure that can happen, but the question is whether it's really a
problem. As I said before timer events coming late is to be expected
even on bare metal (think SMI, NMI, long interrupt disabled regions).

So the main benefit of such a change would be to spare the various
architecture specific implementations the stupid exercise of
implementing half baked workarounds which will suffer from
the very same problems.

If done right then the extra overhead of the division will be not really
noticable and only take effect when there is a MONOTONIC_RAW timer
queued. IOW, it's a penalty on virtualization hosts, but not for
everyone. The facility will introduce some extra cycles due to
conditionals vs. MONOTONIC_RAW in a few places, but that's probably
something which can't even be measured.

Thanks,

tglx