Under virtualisation it is possible to get unexpected latency during a
clockevent device's set_next_event() callback which can make it return
-ETIME even for a delta based on min_delta_ns.
The clockevents_program_min_delta() implementation for
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n doesn't handle retries when this
happens, nor does clockevents_program_event() or its callers when force
is true (for example hrtimer_reprogram()). This can result in hangs
until the clock event device does a full period.
It isn't appropriate to use MIN_ADJUST in this case as occasional
hypervisor induced high latency will cause min_delta_ns to quickly
increase to the maximum.
Instead, borrow the retry pattern from the MIN_ADJUST case, but without
making adjustments. We retry up to 10 times before giving up.
Signed-off-by: James Hogan <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
---
kernel/time/clockevents.c | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index a9b76a40319e..35f445f1bf39 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -281,16 +281,28 @@ static int clockevents_program_min_delta(struct clock_event_device *dev)
{
unsigned long long clc;
int64_t delta;
+ int i;
- delta = dev->min_delta_ns;
- dev->next_event = ktime_add_ns(ktime_get(), delta);
+ for (i = 0;;) {
+ delta = dev->min_delta_ns;
+ dev->next_event = ktime_add_ns(ktime_get(), delta);
- if (clockevent_state_shutdown(dev))
- return 0;
+ if (clockevent_state_shutdown(dev))
+ return 0;
- dev->retries++;
- clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
- return dev->set_next_event((unsigned long) clc, dev);
+ dev->retries++;
+ clc = ((unsigned long long) delta * dev->mult) >> dev->shift;
+ if (dev->set_next_event((unsigned long) clc, dev) == 0)
+ return 0;
+
+ if (++i > 9) {
+ /*
+ * We tried 10 times to program the device with the
+ * given min_delta_ns. Get out of here.
+ */
+ return -ETIME;
+ }
+ }
}
#endif /* CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST */
--
2.4.10
On Fri, 22 Apr 2016 11:40:11 +0100
James Hogan <[email protected]> wrote:
> Under virtualisation it is possible to get unexpected latency during a
> clockevent device's set_next_event() callback which can make it return
> -ETIME even for a delta based on min_delta_ns.
Do you have an example for this behavior? I would call that a BUG in the
implementation of the clockevent device, no?
> The clockevents_program_min_delta() implementation for
> CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n doesn't handle retries when this
> happens, nor does clockevents_program_event() or its callers when force
> is true (for example hrtimer_reprogram()). This can result in hangs
> until the clock event device does a full period.
Is that because some clockevent devices can not program the minimum delta
in some corner cases?
> It isn't appropriate to use MIN_ADJUST in this case as occasional
> hypervisor induced high latency will cause min_delta_ns to quickly
> increase to the maximum.
I agree, the whole minimum delta adjustment is quite broken on a virtualized
system. On s390 we have seen the rise of the min_delta_ns to the maximum
value due to a busy hypervisor.
> Instead, borrow the retry pattern from the MIN_ADJUST case, but without
> making adjustments. We retry up to 10 times before giving up.
That will add a few unnecessary instruction for architectures that have a
sane set_next_event function, namely those that always returns 0. Should
not be too bad though.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Mon, Apr 25, 2016 at 03:48:58PM +0200, Martin Schwidefsky wrote:
> On Fri, 22 Apr 2016 11:40:11 +0100
> James Hogan <[email protected]> wrote:
>
> > Under virtualisation it is possible to get unexpected latency during a
> > clockevent device's set_next_event() callback which can make it return
> > -ETIME even for a delta based on min_delta_ns.
>
> Do you have an example for this behavior?
The place where I've observed it is arch/mips/kernel/cevt-r4k.c, which
returns -ETIME when the delay is too short for it to be able to set it
and read back the timer.
I've also recently (Friday afternoon) seen a report of it apparently
happening with the MIPS GIC clockevent driver too
(drivers/clocksource/mips-gic-timer.c) which has similar logic, probably
copied from cevt-r4k, and this patch appeared to help (I still need to
confirm that one). That wasn't with virtualisation, but was on a
multithreaded core being stress tested, a case when its also hard to
find a guaranteed min delta.
> I would call that a BUG in the implementation of the clockevent
> device, no?
Several drivers seem to do that. I'm open to alternatives. Do you think
the driver should retry itself when it detects this race may have been
hit?
>
> > The clockevents_program_min_delta() implementation for
> > CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n doesn't handle retries when this
> > happens, nor does clockevents_program_event() or its callers when force
> > is true (for example hrtimer_reprogram()). This can result in hangs
> > until the clock event device does a full period.
>
> Is that because some clockevent devices can not program the minimum delta
> in some corner cases?
yes.
I think it actually ended up causing an arithmetic overflow somewhere in
ktime_get() (I'd have to dig through my notes to find specifics)
which resulted in __iter_div_u64_rem() being given an excessively large
dividend, which effectively hung the CPU.
Thanks
James
>
> > It isn't appropriate to use MIN_ADJUST in this case as occasional
> > hypervisor induced high latency will cause min_delta_ns to quickly
> > increase to the maximum.
>
> I agree, the whole minimum delta adjustment is quite broken on a virtualized
> system. On s390 we have seen the rise of the min_delta_ns to the maximum
> value due to a busy hypervisor.
>
> > Instead, borrow the retry pattern from the MIN_ADJUST case, but without
> > making adjustments. We retry up to 10 times before giving up.
>
> That will add a few unnecessary instruction for architectures that have a
> sane set_next_event function, namely those that always returns 0. Should
> not be too bad though.
>
> --
> blue skies,
> Martin.
>
> "Reality continues to ruin my life." - Calvin.
>
Hi,
On Fri, Apr 22, 2016 at 11:40:11AM +0100, James Hogan wrote:
> Under virtualisation it is possible to get unexpected latency during a
> clockevent device's set_next_event() callback which can make it return
> -ETIME even for a delta based on min_delta_ns.
>
> The clockevents_program_min_delta() implementation for
> CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n doesn't handle retries when this
> happens, nor does clockevents_program_event() or its callers when force
> is true (for example hrtimer_reprogram()). This can result in hangs
> until the clock event device does a full period.
>
> It isn't appropriate to use MIN_ADJUST in this case as occasional
> hypervisor induced high latency will cause min_delta_ns to quickly
> increase to the maximum.
>
> Instead, borrow the retry pattern from the MIN_ADJUST case, but without
> making adjustments. We retry up to 10 times before giving up.
>
> Signed-off-by: James Hogan <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Martin Schwidefsky <[email protected]>
This is still an issue, both under virtualization and on
hardware-multi-threaded MIPS cores with the mips-gic driver where
general latency can by impacted by the behaviour of software running on
other hardware threads (which are seen as other CPUs from Linux' point
of view) in the same core.
Does anybody have any more feedback about this patch?
Thanks
James