2022-07-25 09:55:52

by Valentin Schneider

[permalink] [raw]
Subject: [Question] timers: trigger_dyntick_cpu() vs TIMER_DEFERRABLE

Hi,

I've been incidentally staring at some NOHZ bits related to the timer
wheels, and trigger_dyntick_cpu() confuses me:

static void
trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
{
[...]
/*
* TODO: This wants some optimizing similar to the code below, but we
* will do that when we switch from push to pull for deferrable timers.
*/
if ((timer->flags & TIMER_DEFERRABLE)) {
if (tick_nohz_full_cpu(base->cpu))
wake_up_nohz_cpu(base->cpu);
return;
}
[...]
}

From what I grok out of get_nohz_timer_target(), under
timers_migration_enabled we should migrate the timer to an non-idle CPU
(or at the very least a non-isolated CPU) *before* enqueuing the
timer. Without timers_migration_enabled (or if TIMER_PINNED), I don't see
anything that could migrate the timer elsewhere, so:

Why bother kicking a NOHZ CPU for a deferrable timer if it is the next
expiring one? Per the definition:

* @TIMER_DEFERRABLE: A deferrable timer will work normally when the
* system is busy, but will not cause a CPU to come out of idle just
* to service it; instead, the timer will be serviced when the CPU
* eventually wakes up with a subsequent non-deferrable timer.

I tried to find some discussion over this in LKML, but found nothing.
v3 of the patch did *not* kick a CPU for a deferrable timer, but v4 (the
one that ended up merged) did (see below). Patch in question is:

a683f390b93f ("timers: Forward the wheel clock whenever possible")

Thanks

====
v3
====
@@ -520,23 +522,27 @@ static void internal_add_timer(struct ti
__internal_add_timer(base, timer);

/*
- * Check whether the other CPU is in dynticks mode and needs
- * to be triggered to reevaluate the timer wheel. We are
- * protected against the other CPU fiddling with the timer by
- * holding the timer base lock. This also makes sure that a
- * CPU on the way to stop its tick can not evaluate the timer
- * wheel.
- *
- * Spare the IPI for deferrable timers on idle targets though.
- * The next busy ticks will take care of it. Except full dynticks
- * require special care against races with idle_cpu(), lets deal
- * with that later.
- */
- if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active) {
- if (!(timer->flags & TIMER_DEFERRABLE) ||
- tick_nohz_full_cpu(base->cpu))
- wake_up_nohz_cpu(base->cpu);
- }
+ * We might have to IPI the remote CPU if the base is idle and the
+ * timer is not deferrable. If the other cpu is on the way to idle
+ * then it can't set base->is_idle as we hold base lock.
+ */
+ if (!IS_ENABLED(CONFIG_NO_HZ_COMMON) || !base->is_idle ||
+ (timer->flags & TIMER_DEFERRABLE))
+ return;
+
+ /* Check whether this is the new first expiring timer */
+ if (time_after_eq(timer->expires, base->next_expiry))
+ return;
+ base->next_expiry = timer->expires;
+
+ /*
+ * Check whether the other CPU is in dynticks mode and needs to be
+ * triggered to reevaluate the timer wheel. We are protected against
+ * the other CPU fiddling with the timer by holding the timer base
+ * lock.
+ */
+ if (tick_nohz_full_cpu(base->cpu))
+ wake_up_nohz_cpu(base->cpu);
}
====
v4
====
@@ -519,24 +521,37 @@ static void internal_add_timer(struct ti
{
__internal_add_timer(base, timer);

+ if (!IS_ENABLED(CONFIG_NO_HZ_COMMON) || !base->nohz_active)
+ return;
+
/*
- * Check whether the other CPU is in dynticks mode and needs
- * to be triggered to reevaluate the timer wheel. We are
- * protected against the other CPU fiddling with the timer by
- * holding the timer base lock. This also makes sure that a
- * CPU on the way to stop its tick can not evaluate the timer
- * wheel.
- *
- * Spare the IPI for deferrable timers on idle targets though.
- * The next busy ticks will take care of it. Except full dynticks
- * require special care against races with idle_cpu(), lets deal
- * with that later.
- */
- if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active) {
- if (!(timer->flags & TIMER_DEFERRABLE) ||
- tick_nohz_full_cpu(base->cpu))
+ * This wants some optimizing similar to the below, but we do that
+ * when we switch from push to pull for deferrable timers.
+ */
+ if (timer->flags & TIMER_DEFERRABLE) {
+ if (tick_nohz_full_cpu(base->cpu))
wake_up_nohz_cpu(base->cpu);
+ return;
}
+
+ /*
+ * We might have to IPI the remote CPU if the base is idle and the
+ * timer is not deferrable. If the other cpu is on the way to idle
+ * then it can't set base->is_idle as we hold base lock.
+ */
+ if (!base->is_idle)
+ return;
+
+ /* Check whether this is the new first expiring timer */
+ if (time_after_eq(timer->expires, base->next_expiry))
+ return;
+
+ /*
+ * Set the next expiry time and kick the cpu so it can reevaluate the
+ * wheel
+ */
+ base->next_expiry = timer->expires;
+ wake_up_nohz_cpu(base->cpu);
}


2022-07-25 10:57:28

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [Question] timers: trigger_dyntick_cpu() vs TIMER_DEFERRABLE

On Mon, Jul 25, 2022 at 10:32:42AM +0100, Valentin Schneider wrote:
> Hi,
>
> I've been incidentally staring at some NOHZ bits related to the timer
> wheels, and trigger_dyntick_cpu() confuses me:
>
> static void
> trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
> {
> [...]
> /*
> * TODO: This wants some optimizing similar to the code below, but we
> * will do that when we switch from push to pull for deferrable timers.
> */
> if ((timer->flags & TIMER_DEFERRABLE)) {
> if (tick_nohz_full_cpu(base->cpu))
> wake_up_nohz_cpu(base->cpu);
> return;
> }
> [...]
> }
>
> From what I grok out of get_nohz_timer_target(), under
> timers_migration_enabled we should migrate the timer to an non-idle CPU
> (or at the very least a non-isolated CPU) *before* enqueuing the
> timer.

That's not always the case. For example TIMER_PINNED timers might have
to run on a buzy or isolated CPU.

And note that even when (base->cpu == smp_processor_id()) we want to kick
the current CPU with a self-IPI. This way we force, from IRQ-tail, the
tick to recalculate the next deadline to fire, considering the new enqueued
timer callback.

> Without timers_migration_enabled (or if TIMER_PINNED), I don't see
> anything that could migrate the timer elsewhere, so:
>
> Why bother kicking a NOHZ CPU for a deferrable timer if it is the next
> expiring one? Per the definition:
>
> * @TIMER_DEFERRABLE: A deferrable timer will work normally when the
> * system is busy, but will not cause a CPU to come out of idle just
> * to service it; instead, the timer will be serviced when the CPU
> * eventually wakes up with a subsequent non-deferrable timer.
>
> I tried to find some discussion over this in LKML, but found nothing.
> v3 of the patch did *not* kick a CPU for a deferrable timer, but v4 (the
> one that ended up merged) did (see below). Patch in question is:
>
> a683f390b93f ("timers: Forward the wheel clock whenever possible")

Because TIMER_DEFERRABLE timers should only be deferred when the CPU is
in "nohz-idle". If the CPU runs an actual task with the tick shutdown
("nohz-full"), we should execute those deferrable timers.

Now that's the theory. In practice the deferrable timers are ignored by
both nohz-idle and nohz-full when it comes to compute the next nohz delta.
This is a mistake that is there since the introduction of nohz-full but I've
always been scared to break some user setup while fixing it. Anyway things
should look like this (untested):

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index de192dcff828..5f8ef777a785 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -819,7 +819,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
* disabled this also looks at the next expiring
* hrtimer.
*/
- next_tick = get_next_timer_interrupt(basejiff, basemono);
+ next_tick = get_next_timer_interrupt(basejiff, basemono, ts->inidle);
ts->next_timer = next_tick;
}

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 717fcb9fb14a..8279d4e9b7a0 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -574,16 +574,6 @@ trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
if (!is_timers_nohz_active())
return;

- /*
- * TODO: This wants some optimizing similar to the code below, but we
- * will do that when we switch from push to pull for deferrable timers.
- */
- if (timer->flags & TIMER_DEFERRABLE) {
- if (tick_nohz_full_cpu(base->cpu))
- wake_up_nohz_cpu(base->cpu);
- return;
- }
-
/*
* We might have to IPI the remote CPU if the base is idle and the
* timer is not deferrable. If the other CPU is on the way to idle
@@ -1678,17 +1668,9 @@ static u64 cmp_next_hrtimer_event(u64 basem, u64 expires)
return DIV_ROUND_UP_ULL(nextevt, TICK_NSEC) * TICK_NSEC;
}

-/**
- * get_next_timer_interrupt - return the time (clock mono) of the next timer
- * @basej: base time jiffies
- * @basem: base time clock monotonic
- *
- * Returns the tick aligned clock monotonic time of the next pending
- * timer or KTIME_MAX if no timer is pending.
- */
-u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
+static u64 get_next_base_interrupt(struct timer_base *base,
+ unsigned long basej, u64 basem)
{
- struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
u64 expires = KTIME_MAX;
unsigned long nextevt;

@@ -1734,6 +1716,32 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
}
raw_spin_unlock(&base->lock);

+ return expires;
+}
+
+/**
+ * get_next_timer_interrupt - return the time (clock mono) of the next timer
+ * @basej: base time jiffies
+ * @basem: base time clock monotonic
+ * @idle: is the CPU idle?
+ *
+ * Returns the tick aligned clock monotonic time of the next pending
+ * timer or KTIME_MAX if no timer is pending.
+ */
+u64 get_next_timer_interrupt(unsigned long basej, u64 basem, bool idle)
+{
+ struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+ u64 expires;
+
+ expires = get_next_base_interrupt(base, basej, basem);
+ if (!idle) {
+ u64 expires_def;
+
+ base = this_cpu_ptr(&timer_bases[BASE_DEF]);
+ expires_def = get_next_base_interrupt(base, basej, basem);
+ expires = min(expires, expires_def);
+ }
+
return cmp_next_hrtimer_event(basem, expires);
}

@@ -1744,15 +1752,15 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
*/
void timer_clear_idle(void)
{
- struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
-
/*
* We do this unlocked. The worst outcome is a remote enqueue sending
* a pointless IPI, but taking the lock would just make the window for
* sending the IPI a few instructions smaller for the cost of taking
* the lock in the exit from idle path.
*/
- base->is_idle = false;
+ __this_cpu_write(timer_bases[BASE_STD].is_idle, false);
+ if (tick_nohz_full_cpu(smp_processor_id()))
+ __this_cpu_write(timer_bases[BASE_DEF].is_idle, false);
}
#endif

2022-07-25 15:38:01

by Valentin Schneider

[permalink] [raw]
Subject: Re: [Question] timers: trigger_dyntick_cpu() vs TIMER_DEFERRABLE

On 25/07/22 12:43, Frederic Weisbecker wrote:
> On Mon, Jul 25, 2022 at 10:32:42AM +0100, Valentin Schneider wrote:
>> From what I grok out of get_nohz_timer_target(), under
>> timers_migration_enabled we should migrate the timer to an non-idle CPU
>> (or at the very least a non-isolated CPU) *before* enqueuing the
>> timer.
>
> That's not always the case. For example TIMER_PINNED timers might have
> to run on a buzy or isolated CPU.
>
> And note that even when (base->cpu == smp_processor_id()) we want to kick
> the current CPU with a self-IPI. This way we force, from IRQ-tail, the
> tick to recalculate the next deadline to fire, considering the new enqueued
> timer callback.
>

Right, tick_irq_exit() & friends... I'm still figuring the different
dependencies down there, but I think I can roughly map the bits of what
you're describing.

>> Without timers_migration_enabled (or if TIMER_PINNED), I don't see
>> anything that could migrate the timer elsewhere, so:
>>
>> Why bother kicking a NOHZ CPU for a deferrable timer if it is the next
>> expiring one? Per the definition:
>>
>> * @TIMER_DEFERRABLE: A deferrable timer will work normally when the
>> * system is busy, but will not cause a CPU to come out of idle just
>> * to service it; instead, the timer will be serviced when the CPU
>> * eventually wakes up with a subsequent non-deferrable timer.
>>
>> I tried to find some discussion over this in LKML, but found nothing.
>> v3 of the patch did *not* kick a CPU for a deferrable timer, but v4 (the
>> one that ended up merged) did (see below). Patch in question is:
>>
>> a683f390b93f ("timers: Forward the wheel clock whenever possible")
>
> Because TIMER_DEFERRABLE timers should only be deferred when the CPU is
> in "nohz-idle". If the CPU runs an actual task with the tick shutdown
> ("nohz-full"), we should execute those deferrable timers.
>

Ah, that makes sense, thank you for highlighting the difference. The
comment *does* say "come out of *idle*", not *tickless*...

> Now that's the theory. In practice the deferrable timers are ignored by
> both nohz-idle and nohz-full when it comes to compute the next nohz delta.
> This is a mistake that is there since the introduction of nohz-full but I've
> always been scared to break some user setup while fixing it. Anyway things
> should look like this (untested):
>

IIUC that's making get_next_timer_interrupt() poke the deferrable base if the
CPU isn't tickless idle (IOW if it is tickless "busy" or ticking
idle). That makes sense from what you've written above, but I get your
apprehension (though AIUI "only" pinned deferrable timers should be
problematic, as the others should be migrated away).


Thanks again for your detailed reply!

2022-07-25 15:49:06

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Question] timers: trigger_dyntick_cpu() vs TIMER_DEFERRABLE

> Ah, that makes sense, thank you for highlighting the difference. The
> comment *does* say "come out of *idle*", not *tickless*...
>
> > Now that's the theory. In practice the deferrable timers are ignored by
> > both nohz-idle and nohz-full when it comes to compute the next nohz delta.
> > This is a mistake that is there since the introduction of nohz-full but I've
> > always been scared to break some user setup while fixing it. Anyway things
> > should look like this (untested):
> >
>
> IIUC that's making get_next_timer_interrupt() poke the deferrable base if the
> CPU isn't tickless idle (IOW if it is tickless "busy" or ticking
> idle). That makes sense from what you've written above, but I get your
> apprehension (though AIUI "only" pinned deferrable timers should be
> problematic, as the others should be migrated away).
>
>


taking a small step back; the idea behind deferrable timers is
(mostly) that these are timers that do ongoing "maintenance" of sorts,
be it counter updates or cleanup or whatever...
and that if the CPU is idle, no things happen that would require such
maintenance... so it's not worth waking the CPU out of idle for it...
and the nature of the timer is such
that delays are benign; sure the counters get sync'd maybe a bit later
but it does not impact correctness.

But once "real code" is executing (we can debate if an occasional
network interrupt counts as such but that's a detail) work is done
that would cause the maintenance to be meaningful again,
and then we do fire these timers "as usual", but ideally grouped with
other timers. So frankly that doesn't mean "right the exact
millisecond", after all the nature of deferrable already means that
exact timing is not implied, but within reason if something else
happens anyway/