Hi,
the queue grew because of some additional cleanup patches and splitting
already existing patches to make review easier. Several minor things had to
be reworked. For all people who did testing, I would be really happy if you
could test v9 as well!
The queue contains three parts:
- Patches 1 - 12: Cleanups and minor fixes
- Patches 13 - 16: timer base idle marking rework with two preparatory
changes. See the section below for more details.
- Patches 17 - 32: Updated timer pull model on top of timer idle rework
The queue is available here:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel timers/pushpull
Move marking timer bases as idle into tick_nohz_stop_tick()
-----------------------------------------------------------
The idle marking of timer bases is done in get_next_timer_interrupt()
whenever possible. The timer bases are idle, even if the tick will not be
stopped. This lead to an IPI when a new first timer is enqueued remote. To
prevent this, setting timer_base->in_idle flag is postponed to
tick_nohz_stop_tick().
Furthermore this synchronizes the states of timer base is_idle and
tick_stopped. With the timer pull model in place, also the idle state in
the hierarchy of a CPU is synchronized with the other idle related states.
Timer pull model
----------------
Placing timers at enqueue time on a target CPU based on dubious heuristics
does not make any sense:
1) Most timer wheel timers are canceled or rearmed before they expire.
2) The heuristics to predict which CPU will be busy when the timer expires
are wrong by definition.
So placing the timers at enqueue wastes precious cycles.
The proper solution to this problem is to always queue the timers on the
local CPU and allow the non pinned timers to be pulled onto a busy CPU at
expiry time.
Therefore split the timer storage into local pinned and global timers:
Local pinned timers are always expired on the CPU on which they have been
queued. Global timers can be expired on any CPU.
As long as a CPU is busy it expires both local and global timers. When a
CPU goes idle it arms for the first expiring local timer. If the first
expiring pinned (local) timer is before the first expiring movable timer,
then no action is required because the CPU will wake up before the first
movable timer expires. If the first expiring movable timer is before the
first expiring pinned (local) timer, then this timer is queued into a idle
timerqueue and eventually expired by some other active CPU.
To avoid global locking the timerqueues are implemented as a hierarchy. The
lowest level of the hierarchy holds the CPUs. The CPUs are associated to
groups of 8, which are separated per node. If more than one CPU group
exist, then a second level in the hierarchy collects the groups. Depending
on the size of the system more than 2 levels are required. Each group has a
"migrator" which checks the timerqueue during the tick for remote timers to
be expired.
If the last CPU in a group goes idle it reports the first expiring event in
the group up to the next group(s) in the hierarchy. If the last CPU goes
idle it arms its timer for the first system wide expiring timer to ensure
that no timer event is missed.
Testing
~~~~~~~
Enqueue
^^^^^^^
The impact of wasting cycles during enqueue by using the heuristic in
contrast to always queuing the timer on the local CPU was measured with a
micro benchmark. Therefore a timer is enqueued and dequeued in a loop with
1000 repetitions on a isolated CPU. The time the loop takes is measured. A
quarter of the remaining CPUs was kept busy. This measurement was repeated
several times. With the patch queue the average duration was reduced by
approximately 25%.
145ns plain v6
109ns v6 with patch queue
Furthermore the impact of residence in deep idle states of an idle system
was investigated. The patch queue doesn't downgrade this behavior.
dbench test
^^^^^^^^^^^
A dbench test starting X pairs of client servers are used to create load on
the system. The measurable value is the throughput. The tests were executed
on a zen3 machine. The base is the tip tree branch timers/core which is
based on a v6.6-rc1.
governor menu
NR timers/core pull-model impact
----------------------------------------------
1 353.19 (0.19) 353.45 (0.30) 0.07%
2 700.10 (0.96) 687.00 (0.20) -1.87%
4 1329.37 (0.63) 1282.91 (0.64) -3.49%
8 2561.16 (1.28) 2493.56 (1.76) -2.64%
16 4959.96 (0.80) 4914.59 (0.64) -0.91%
32 9741.92 (3.44) 8979.83 (1.13) -7.82%
64 16535.40 (2.84) 16388.47 (4.02) -0.89%
128 22136.83 (2.42) 23174.50 (1.43) 4.69%
256 39256.77 (4.48) 38994.00 (0.39) -0.67%
512 36799.03 (1.83) 38091.10 (0.63) 3.51%
1024 32903.03 (0.86) 35370.70 (0.89) 7.50%
governor teo
NR timers/core pull-model impact
----------------------------------------------
1 350.83 (1.27) 352.45 (0.96) 0.46%
2 699.52 (0.85) 690.10 (0.54) -1.35%
4 1339.53 (1.99) 1294.71 (2.71) -3.35%
8 2574.10 (0.76) 2495.46 (1.97) -3.06%
16 4898.50 (1.74) 4783.06 (1.64) -2.36%
32 9115.50 (4.63) 9037.83 (1.58) -0.85%
64 16663.90 (3.80) 16042.00 (1.72) -3.73%
128 25044.93 (1.11) 23250.03 (1.08) -7.17%
256 38059.53 (1.70) 39658.57 (2.98) 4.20%
512 36369.30 (0.39) 38890.13 (0.36) 6.93%
1024 33956.83 (1.14) 35514.83 (0.29) 4.59%
Ping Pong Oberservation
^^^^^^^^^^^^^^^^^^^^^^^
During testing on a mostly idle machine a ping pong game could be observed:
a process_timeout timer is expired remotely on a non idle CPU. Then the CPU
where the schedule_timeout() was executed to enqueue the timer comes out of
idle and restarts the timer using schedule_timeout() and goes back to idle
again. This is due to the fair scheduler which tries to keep the task on
the CPU which it previously executed on.
Possible Next Steps
~~~~~~~~~~~~~~~~~~~
Simple deferrable timers are no longer required as they can be converted to
global timers. If a CPU goes idle, a formerly deferrable timer will not
prevent the CPU to sleep as long as possible. Only the last migrator CPU
has to take care of them. Deferrable timers with timer pinned flags needs
to be expired on the specified CPU but must not prevent CPU from going
idle. They require their own timer base which is never taken into account
when calculating the next expiry time. This conversation and required
cleanup will be done in a follow up series.
v8..v9: https://lore.kernel.org/r/[email protected]
- Address review feedback
- Add more minor cleanup fixes
- fixes inconsistent idle related states
v7..v8: https://lore.kernel.org/r/[email protected]
- Address review feedback
- Move marking timer base idle into tick_nohz_stop_tick()
- Look ahead function to determine possible sleep lenght
v6..v7:
- Address review feedback of Frederic and bigeasy
- Change lock, unlock fetch next timer interrupt logic after remote expiry
- Move timer_expire_remote() into tick-internal.h
- Add documentation section about "Required event and timerqueue update
after remote expiry"
- Fix fallout of kernel test robot
v5..v6:
- Address review of Frederic Weisbecker and Peter Zijlstra (spelling,
locking, race in tmigr_handle_remote_cpu())
- unconditionally set TIMER_PINNED flag in add_timer_on(); introduce
add_timer() variants which set/unset TIMER_PINNED flag; drop fixing
add_timer_on() call sites, as TIMER_PINNED flag is set implicitly;
Fixing workqueue to use add_timer_global() instead of simply
add_timer() for unbound work.
- Drop support for siblings to end up in the same level 0 group (could be
added again in a better way as an improvement later on)
- Do not send IPI for new first deferrable timers
v4..v5:
- address review feedback of Frederic Weisbecker
- fix issue with group timer update after remote expiry
v3..v4:
- address review feedback of Frederic Weisbecker
- address kernel test robot fallout
- Move patch 16 "add_timer_on(): Make sure callers have TIMER_PINNED
flag" at the begin of the queue to prevent timers to end up in global
timer base when they were queued using add_timer_on()
- Fix some comments and typos
v2..v3: https://lore.kernel.org/r/[email protected]/
- Minimize usage of locks by storing data using atomic_cmpxchg() for
migrator information and information about active cpus.
Thanks,
Anna-Maria
Anna-Maria Behnsen (29):
tick-sched: Fix function names in comments
tick/sched: Cleanup confusing variables
tick-sched: Warn when next tick seems to be in the past
tracing/timers: Enhance timer_start tracepoint
tracing/timers: Add tracepoint for tracking timer base is_idle flag
timers: Do not IPI for deferrable timers
timers: Move store of next event into __next_timer_interrupt()
timers: Clarify check in forward_timer_base()
timers: Split out forward timer base functionality
timers: Use already existing function for forwarding timer base
timers: Fix nextevt calculation when no timers are pending
timers: Restructure get_next_timer_interrupt()
timers: Split out get next timer interrupt
timers: Move marking timer bases idle into tick_nohz_stop_tick()
timers: Optimization for timer_base_try_to_set_idle()
timers: Introduce add_timer() variants which modify timer flags
workqueue: Use global variant for add_timer()
timers: add_timer_on(): Make sure TIMER_PINNED flag is set
timers: Ease code in run_local_timers()
timers: Split next timer interrupt logic
timers: Keep the pinned timers separate from the others
timers: Retrieve next expiry of pinned/non-pinned timers separately
timers: Split out "get next timer interrupt" functionality
timers: Add get next timer interrupt functionality for remote CPUs
timers: Check if timers base is handled already
timers: Introduce function to check timer base is_idle flag
timers: Implement the hierarchical pull model
timer_migration: Add tracepoints
timers: Always queue timers on the local CPU
Richard Cochran (linutronix GmbH) (2):
timers: Restructure internal locking
tick/sched: Split out jiffies update helper function
Thomas Gleixner (1):
timers: Rework idle logic
include/linux/cpuhotplug.h | 1 +
include/linux/timer.h | 16 +-
include/trace/events/timer.h | 40 +-
include/trace/events/timer_migration.h | 297 +++++
kernel/time/Makefile | 3 +
kernel/time/tick-internal.h | 14 +
kernel/time/tick-sched.c | 89 +-
kernel/time/timer.c | 562 ++++++--
kernel/time/timer_migration.c | 1662 ++++++++++++++++++++++++
kernel/time/timer_migration.h | 144 ++
kernel/workqueue.c | 2 +-
11 files changed, 2680 insertions(+), 150 deletions(-)
create mode 100644 include/trace/events/timer_migration.h
create mode 100644 kernel/time/timer_migration.c
create mode 100644 kernel/time/timer_migration.h
--
2.39.2
The current check whether a forward of the timer base is required can be
simplified by using an already existing comparison function which is easier
to read. The related comment is outdated and was not updated when the check
changed in commit 36cd28a4cdd0 ("timers: Lower base clock forwarding
threshold").
Use time_before_eq() for the check and replace the comment by copying the
comment from the same check inside get_next_timer_interrupt(). Move the
precious information of the outdated comment to the proper place in
__run_timers().
No functional change.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---
v9: Move precious information of outdated comment to proper place (as
suggested by Frederic)
---
kernel/time/timer.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 3ca706db1d20..66bac56909ba 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -944,11 +944,10 @@ static inline void forward_timer_base(struct timer_base *base)
unsigned long jnow = READ_ONCE(jiffies);
/*
- * No need to forward if we are close enough below jiffies.
- * Also while executing timers, base->clk is 1 offset ahead
- * of jiffies to avoid endless requeuing to current jiffies.
+ * Check whether we can forward the base. We can only do that when
+ * @basej is past base->clk otherwise we might rewind base->clk.
*/
- if ((long)(jnow - base->clk) < 1)
+ if (time_before_eq(jnow, base->clk))
return;
/*
@@ -2015,6 +2014,10 @@ static inline void __run_timers(struct timer_base *base)
*/
WARN_ON_ONCE(!levels && !base->next_expiry_recalc
&& base->timers_pending);
+ /*
+ * While executing timers, base->clk is set 1 offset ahead of
+ * jiffies to avoid endless requeuing to current jiffies.
+ */
base->clk++;
next_expiry_recalc(base);
--
2.39.2
When no timer is queued into an empty timer base, the next_expiry will not
be updated. It was originally calculated as
base->clk + NEXT_TIMER_MAX_DELTA
When the timer base stays empty long enough (> NEXT_TIMER_MAX_DELTA), the
next_expiry value of the empty base suggests that there is a timer pending
soon. This might be more a kind of a theoretical problem, but the fix
doesn't hurt.
Use only base->next_expiry value as nextevt when timers are
pending. Otherwise nextevt will be jiffies + NEXT_TIMER_MAX_DELTA. As all
information is in place, update base->next_expiry value of the empty timer
base as well.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v9: New patch
---
kernel/time/timer.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 0826018d9873..4dffe966424c 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1922,8 +1922,8 @@ static u64 cmp_next_hrtimer_event(u64 basem, u64 expires)
u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+ unsigned long nextevt = basej + NEXT_TIMER_MAX_DELTA;
u64 expires = KTIME_MAX;
- unsigned long nextevt;
/*
* Pretend that there is no timer pending if the cpu is offline.
@@ -1935,7 +1935,6 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
raw_spin_lock(&base->lock);
if (base->next_expiry_recalc)
next_expiry_recalc(base);
- nextevt = base->next_expiry;
/*
* We have a fresh next event. Check whether we can forward the
@@ -1944,10 +1943,20 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
__forward_timer_base(base, basej);
if (base->timers_pending) {
+ nextevt = base->next_expiry;
+
/* If we missed a tick already, force 0 delta */
if (time_before(nextevt, basej))
nextevt = basej;
expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
+ } else {
+ /*
+ * Move next_expiry for the empty base into the future to
+ * prevent a unnecessary raise of the timer softirq when the
+ * next_expiry value will be reached even if there is no timer
+ * pending.
+ */
+ base->next_expiry = nextevt;
}
/*
--
2.39.2
When debugging timer code the timer tracepoints are very important. There
is no tracepoint when the is_idle flag of the timer base changes. Instead
of always adding manually trace_printk(), add tracepoints which can be
easily enabled whenever required.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v9: New in v9
---
include/trace/events/timer.h | 20 ++++++++++++++++++++
kernel/time/timer.c | 2 ++
2 files changed, 22 insertions(+)
diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 99ada928d445..1ef58a04fc57 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -142,6 +142,26 @@ DEFINE_EVENT(timer_class, timer_cancel,
TP_ARGS(timer)
);
+TRACE_EVENT(timer_base_idle,
+
+ TP_PROTO(bool is_idle, unsigned int cpu),
+
+ TP_ARGS(is_idle, cpu),
+
+ TP_STRUCT__entry(
+ __field( bool, is_idle )
+ __field( unsigned int, cpu )
+ ),
+
+ TP_fast_assign(
+ __entry->is_idle = is_idle;
+ __entry->cpu = cpu;
+ ),
+
+ TP_printk("is_idle=%d cpu=%d",
+ __entry->is_idle, __entry->cpu)
+);
+
#define decode_clockid(type) \
__print_symbolic(type, \
{ CLOCK_REALTIME, "CLOCK_REALTIME" }, \
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index a81d793a43d0..46a9b96a3976 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1964,6 +1964,7 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
if ((expires - basem) > TICK_NSEC)
base->is_idle = true;
}
+ trace_timer_base_idle(base->is_idle, base->cpu);
raw_spin_unlock(&base->lock);
return cmp_next_hrtimer_event(basem, expires);
@@ -1985,6 +1986,7 @@ void timer_clear_idle(void)
* the lock in the exit from idle path.
*/
base->is_idle = false;
+ trace_timer_base_idle(0, smp_processor_id());
}
#endif
--
2.39.2
Both call sites of __next_timer_interrupt() store the return value directly
in base->next_expiry. Move the store into __next_timer_interrupt() and to
make its purpose more clear, rename the function to next_expiry_recalc().
No functional change.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---
v9: Typo fix only
v6: Fix typos in commit message and drop not required return as suggested
by Peter Zijlstra
v4: rename function as suggested by Frederic Weisbecker
---
kernel/time/timer.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index a6e31b09637c..3ca706db1d20 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1800,8 +1800,10 @@ static int next_pending_bucket(struct timer_base *base, unsigned offset,
/*
* Search the first expiring timer in the various clock levels. Caller must
* hold base->lock.
+ *
+ * Store next expiry time in base->next_expiry.
*/
-static unsigned long __next_timer_interrupt(struct timer_base *base)
+static void next_expiry_recalc(struct timer_base *base)
{
unsigned long clk, next, adj;
unsigned lvl, offset = 0;
@@ -1867,10 +1869,9 @@ static unsigned long __next_timer_interrupt(struct timer_base *base)
clk += adj;
}
+ base->next_expiry = next;
base->next_expiry_recalc = false;
base->timers_pending = !(next == base->clk + NEXT_TIMER_MAX_DELTA);
-
- return next;
}
#ifdef CONFIG_NO_HZ_COMMON
@@ -1930,7 +1931,7 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
raw_spin_lock(&base->lock);
if (base->next_expiry_recalc)
- base->next_expiry = __next_timer_interrupt(base);
+ next_expiry_recalc(base);
nextevt = base->next_expiry;
/*
@@ -2015,7 +2016,7 @@ static inline void __run_timers(struct timer_base *base)
WARN_ON_ONCE(!levels && !base->next_expiry_recalc
&& base->timers_pending);
base->clk++;
- base->next_expiry = __next_timer_interrupt(base);
+ next_expiry_recalc(base);
while (levels--)
expire_timers(base, heads + levels);
--
2.39.2
get_next_timer_interrupt() contains two parts for the next timer interrupt
calculation. Those two parts are separated by forwarding the base
clock. But the second part does not depend on the forwarded base
clock.
Therefore restructure get_next_timer_interrupt() to keep things together
which belong together.
No functional change.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v9: New patch to eases patch "timers: Split out get next timer functionality"
---
kernel/time/timer.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 4dffe966424c..9d377ebb7395 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1936,12 +1936,6 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
if (base->next_expiry_recalc)
next_expiry_recalc(base);
- /*
- * We have a fresh next event. Check whether we can forward the
- * base.
- */
- __forward_timer_base(base, basej);
-
if (base->timers_pending) {
nextevt = base->next_expiry;
@@ -1959,6 +1953,12 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
base->next_expiry = nextevt;
}
+ /*
+ * We have a fresh next event. Check whether we can forward the
+ * base.
+ */
+ __forward_timer_base(base, basej);
+
/*
* Base is idle if the next event is more than a tick away.
*
--
2.39.2
From: Thomas Gleixner <[email protected]>
To improve readability of the code, split base->idle calculation and
expires calculation into separate parts. While at it, update the comment
about timer base idle marking.
Thereby the following subtle change happens if the next event is just one
jiffy ahead and the tick was already stopped: Originally base->is_idle
remains true in this situation. Now base->is_idle turns to false. This may
spare an IPI if a timer is enqueued remotely to an idle CPU that is going
to tick on the next jiffy.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---
v9: Re-ordering to not hurt the eyes and update comment
v4: Change condition to force 0 delta and update commit message (Frederic)
---
kernel/time/timer.c | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index fee42dda8237..0826018d9873 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1943,22 +1943,23 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
*/
__forward_timer_base(base, basej);
- if (time_before_eq(nextevt, basej)) {
- expires = basem;
- base->is_idle = false;
- } else {
- if (base->timers_pending)
- expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
- /*
- * If we expect to sleep more than a tick, mark the base idle.
- * Also the tick is stopped so any added timer must forward
- * the base clk itself to keep granularity small. This idle
- * logic is only maintained for the BASE_STD base, deferrable
- * timers may still see large granularity skew (by design).
- */
- if ((expires - basem) > TICK_NSEC)
- base->is_idle = true;
+ if (base->timers_pending) {
+ /* If we missed a tick already, force 0 delta */
+ if (time_before(nextevt, basej))
+ nextevt = basej;
+ expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
}
+
+ /*
+ * Base is idle if the next event is more than a tick away.
+ *
+ * If the base is marked idle then any timer add operation must forward
+ * the base clk itself to keep granularity small. This idle logic is
+ * only maintained for the BASE_STD base, deferrable timers may still
+ * see large granularity skew (by design).
+ */
+ base->is_idle = time_after(nextevt, basej + 1);
+
trace_timer_base_idle(base->is_idle, base->cpu);
raw_spin_unlock(&base->lock);
--
2.39.2
When tick is stopped also the timer base is_idle flag is set. When
reentering the timer_base_try_to_set_idle() with the tick stopped, there is
no need to check whether the timer base needs to be set idle again. When a
timer was enqueued in the meantime, this is already handled by the
nohz_get_next_event() call which was executed before tick_nohz_stop_tick().
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v9: New, as this optimization was splitted from the patch before.
---
kernel/time/tick-sched.c | 2 +-
kernel/time/timer.c | 11 ++++++++---
2 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3e1cdb7c6966..c6b415052c56 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -886,7 +886,7 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
unsigned long basejiff = ts->last_jiffies;
u64 basemono = ts->timer_expires_base;
- bool timer_idle;
+ bool timer_idle = ts->tick_stopped;
u64 expires;
/* Make sure we won't be trying to stop it twice in a row. */
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index df6558f62e6f..177bcde8a2c0 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1996,13 +1996,18 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
* timer_base_try_to_set_idle() - Try to set the idle state of the timer bases
* @basej: base time jiffies
* @basem: base time clock monotonic
- * @idle: pointer to store the value of timer_base->is_idle
+ * @idle: pointer to store the value of timer_base->is_idle on return;
+ * *idle contains the information whether tick was already stopped
*
- * Returns the tick aligned clock monotonic time of the next pending
- * timer or KTIME_MAX if no timer is pending.
+ * Returns the tick aligned clock monotonic time of the next pending timer or
+ * KTIME_MAX if no timer is pending. When tick was already stopped KTIME_MAX is
+ * returned as well.
*/
u64 timer_base_try_to_set_idle(unsigned long basej, u64 basem, bool *idle)
{
+ if (*idle)
+ return KTIME_MAX;
+
return __get_next_timer_interrupt(basej, basem, idle);
}
--
2.39.2
There is an already existing function for forwarding the timer
base. Forwarding the timer base is implemented directly in
get_next_timer_interrupt() as well.
Remove the code duplication and invoke __forward_timer_base() instead.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---
kernel/time/timer.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index e289cbd84e8c..fee42dda8237 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1939,15 +1939,9 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
/*
* We have a fresh next event. Check whether we can forward the
- * base. We can only do that when @basej is past base->clk
- * otherwise we might rewind base->clk.
+ * base.
*/
- if (time_after(basej, base->clk)) {
- if (time_after(nextevt, basej))
- base->clk = basej;
- else if (time_after(nextevt, base->clk))
- base->clk = nextevt;
- }
+ __forward_timer_base(base, basej);
if (time_before_eq(nextevt, basej)) {
expires = basem;
--
2.39.2
When adding a timer to the timer wheel using add_timer_on(), it is an
implicitly pinned timer. With the timer pull at expiry time model in place,
TIMER_PINNED flag is required to make sure timers end up in proper base.
Add TIMER_PINNED flag unconditionally when add_timer_on() is executed.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---
kernel/time/timer.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 0ce0e6b25482..ea94479ee7e2 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1284,7 +1284,10 @@ EXPORT_SYMBOL(add_timer_global);
* @timer: The timer to be started
* @cpu: The CPU to start it on
*
- * Same as add_timer() except that it starts the timer on the given CPU.
+ * Same as add_timer() except that it starts the timer on the given CPU and
+ * the TIMER_PINNED flag is set. When timer shouldn't be a pinned timer in
+ * the next round, add_timer_global() should be used instead as it unsets
+ * the TIMER_PINNED flag.
*
* See add_timer() for further details.
*/
@@ -1298,6 +1301,9 @@ void add_timer_on(struct timer_list *timer, int cpu)
if (WARN_ON_ONCE(timer_pending(timer)))
return;
+ /* Make sure timer flags have TIMER_PINNED flag set */
+ timer->flags |= TIMER_PINNED;
+
new_base = get_timer_cpu_base(timer->flags, cpu);
/*
--
2.39.2
Timer might be used as pinned timer (using add_timer_on()) and later on as
non pinned timers using add_timer(). When the NOHZ timer pull at expiry
model is in place, TIMER_PINNED flag is required to be used whenever a
timer needs to expire on a dedicated CPU. Flag must no be set, if
expiration on a dedicated CPU is not required.
add_timer_on()'s behavior will be changed during the preparation patches
for the NOHZ timer pull at expiry model to unconditionally set TIMER_PINNED
flag. To be able to reset/set the flag when queueing a timer, two variants
of add_timer() are introduced.
This is a preparatory patch and has no functional change.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v9: Update documentation to match kernel-doc style (missing brackets after
function names)
New in v6
---
include/linux/timer.h | 2 ++
kernel/time/timer.c | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 36 insertions(+)
diff --git a/include/linux/timer.h b/include/linux/timer.h
index 26a545bb0153..404bb31a95c7 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -179,6 +179,8 @@ extern int timer_reduce(struct timer_list *timer, unsigned long expires);
#define NEXT_TIMER_MAX_DELTA ((1UL << 30) - 1)
extern void add_timer(struct timer_list *timer);
+extern void add_timer_local(struct timer_list *timer);
+extern void add_timer_global(struct timer_list *timer);
extern int try_to_del_timer_sync(struct timer_list *timer);
extern int timer_delete_sync(struct timer_list *timer);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 177bcde8a2c0..0ce0e6b25482 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1245,6 +1245,40 @@ void add_timer(struct timer_list *timer)
}
EXPORT_SYMBOL(add_timer);
+/**
+ * add_timer_local() - Start a timer on the local CPU
+ * @timer: The timer to be started
+ *
+ * Same as add_timer() except that the timer flag TIMER_PINNED is set.
+ *
+ * See add_timer() for further details.
+ */
+void add_timer_local(struct timer_list *timer)
+{
+ if (WARN_ON_ONCE(timer_pending(timer)))
+ return;
+ timer->flags |= TIMER_PINNED;
+ __mod_timer(timer, timer->expires, MOD_TIMER_NOTPENDING);
+}
+EXPORT_SYMBOL(add_timer_local);
+
+/**
+ * add_timer_global() - Start a timer without TIMER_PINNED flag set
+ * @timer: The timer to be started
+ *
+ * Same as add_timer() except that the timer flag TIMER_PINNED is unset.
+ *
+ * See add_timer() for further details.
+ */
+void add_timer_global(struct timer_list *timer)
+{
+ if (WARN_ON_ONCE(timer_pending(timer)))
+ return;
+ timer->flags &= ~TIMER_PINNED;
+ __mod_timer(timer, timer->expires, MOD_TIMER_NOTPENDING);
+}
+EXPORT_SYMBOL(add_timer_global);
+
/**
* add_timer_on - Start a timer on a particular CPU
* @timer: The timer to be started
--
2.39.2
The implementation of the NOHZ pull at expiry model will change the timer
bases per CPU. Timers, that have to expire on a specific CPU, require the
TIMER_PINNED flag. If the CPU doesn't matter, the TIMER_PINNED flag must be
dropped. This is required for call sites which use the timer alternately as
pinned and not pinned timer like workqueues do.
Therefore use add_timer_global() to make sure TIMER_PINNED flag is dropped.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Lai Jiangshan <[email protected]>
---
v6:
- New patch: As v6 provides unconditially setting TIMER_PINNED flag in
add_timer_on() workqueue requires new add_timer_global() variant.
---
kernel/workqueue.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6e578f576a6f..3a1360794137 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1958,7 +1958,7 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
if (unlikely(cpu != WORK_CPU_UNBOUND))
add_timer_on(timer, cpu);
else
- add_timer(timer);
+ add_timer_global(timer);
}
/**
--
2.39.2
For the conversion of the NOHZ timer placement to a pull at expiry time
model it's required to have separate expiry times for the pinned and the
non-pinned (movable) timers. Therefore struct timer_events is introduced.
No functional change
Originally-by: Richard Cochran (linutronix GmbH) <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---
v9: Update was required (change of preceding patches)
---
kernel/time/timer.c | 35 +++++++++++++++++++++++++++++++----
1 file changed, 31 insertions(+), 4 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 366ea26ce3ba..0d53d853ae22 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -221,6 +221,11 @@ struct timer_base {
static DEFINE_PER_CPU(struct timer_base, timer_bases[NR_BASES]);
+struct timer_events {
+ u64 local;
+ u64 global;
+};
+
#ifdef CONFIG_NO_HZ_COMMON
static DEFINE_STATIC_KEY_FALSE(timers_nohz_active);
@@ -1983,10 +1988,11 @@ static unsigned long next_timer_interrupt(struct timer_base *base,
static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
bool *idle)
{
+ struct timer_events tevt = { .local = KTIME_MAX, .global = KTIME_MAX };
unsigned long nextevt, nextevt_local, nextevt_global;
struct timer_base *base_local, *base_global;
- u64 expires = KTIME_MAX;
bool local_first;
+ u64 expires;
/*
* Pretend that there is no timer pending if the cpu is offline.
@@ -1995,7 +2001,7 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
if (cpu_is_offline(smp_processor_id())) {
if (idle)
*idle = true;
- return expires;
+ return tevt.local;
}
base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
@@ -2022,13 +2028,31 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
nextevt = local_first ? nextevt_local : nextevt_global;
- if (base_local->timers_pending || base_global->timers_pending) {
+ /*
+ * If the @nextevt is at max. one tick away, use @nextevt and store
+ * it in the local expiry value. The next global event is irrelevant in
+ * this case and can be left as KTIME_MAX.
+ */
+ if (time_before_eq(nextevt, basej + 1)) {
/* If we missed a tick already, force 0 delta */
if (time_before(nextevt, basej))
nextevt = basej;
- expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
+ tevt.local = basem + (u64)(nextevt - basej) * TICK_NSEC;
+ goto unlock;
}
+ /*
+ * Update tevt.* values:
+ *
+ * If the local queue expires first, then the global event can be
+ * ignored. If the global queue is empty, nothing to do either.
+ */
+ if (!local_first && base_global->timers_pending)
+ tevt.global = basem + (u64)(nextevt_global - basej) * TICK_NSEC;
+
+ if (base_local->timers_pending)
+ tevt.local = basem + (u64)(nextevt_local - basej) * TICK_NSEC;
+
/*
* We have a fresh next event. Check whether we can forward the
* base.
@@ -2058,9 +2082,12 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
trace_timer_base_idle(base_local->is_idle, base_local->cpu);
}
+unlock:
raw_spin_unlock(&base_global->lock);
raw_spin_unlock(&base_local->lock);
+ expires = min_t(u64, tevt.local, tevt.global);
+
return cmp_next_hrtimer_event(basem, expires);
}
--
2.39.2
Separate the storage space for pinned timers. Deferrable timers (doesn't
matter if pinned or non pinned) are still enqueued into their own base.
This is preparatory work for changing the NOHZ timer placement from a push
at enqueue time to a pull at expiry time model.
Originally-by: Richard Cochran (linutronix GmbH) <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---
v9:
- Update was required (change of preceding patches)
v6:
- Drop set TIMER_PINNED flag in add_timer_on() and drop related
warning. add_timer_on() fix is splitted into a separate
patch. Therefore also drop "Reviewed-by" of Frederic Weisbecker
v5:
- Add WARN_ONCE() in add_timer_on()
- Decrease patch size by splitting into three patches (this patch and the
two before)
v4:
- split out logic to forward base clock into a helper function
forward_base_clk() (Frederic)
- ease the code in run_local_timers() and timer_clear_idle() (Frederic)
---
kernel/time/timer.c | 95 ++++++++++++++++++++++++++++++++-------------
1 file changed, 68 insertions(+), 27 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index eda4972ca862..366ea26ce3ba 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -187,12 +187,18 @@ EXPORT_SYMBOL(jiffies_64);
#define WHEEL_SIZE (LVL_SIZE * LVL_DEPTH)
#ifdef CONFIG_NO_HZ_COMMON
-# define NR_BASES 2
-# define BASE_STD 0
-# define BASE_DEF 1
+/*
+ * If multiple bases need to be locked, use the base ordering for lock
+ * nesting, i.e. lowest number first.
+ */
+# define NR_BASES 3
+# define BASE_LOCAL 0
+# define BASE_GLOBAL 1
+# define BASE_DEF 2
#else
# define NR_BASES 1
-# define BASE_STD 0
+# define BASE_LOCAL 0
+# define BASE_GLOBAL 0
# define BASE_DEF 0
#endif
@@ -899,7 +905,10 @@ static int detach_if_pending(struct timer_list *timer, struct timer_base *base,
static inline struct timer_base *get_timer_cpu_base(u32 tflags, u32 cpu)
{
- struct timer_base *base = per_cpu_ptr(&timer_bases[BASE_STD], cpu);
+ int index = tflags & TIMER_PINNED ? BASE_LOCAL : BASE_GLOBAL;
+ struct timer_base *base;
+
+ base = per_cpu_ptr(&timer_bases[index], cpu);
/*
* If the timer is deferrable and NO_HZ_COMMON is set then we need
@@ -912,7 +921,10 @@ static inline struct timer_base *get_timer_cpu_base(u32 tflags, u32 cpu)
static inline struct timer_base *get_timer_this_cpu_base(u32 tflags)
{
- struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+ int index = tflags & TIMER_PINNED ? BASE_LOCAL : BASE_GLOBAL;
+ struct timer_base *base;
+
+ base = this_cpu_ptr(&timer_bases[index]);
/*
* If the timer is deferrable and NO_HZ_COMMON is set then we need
@@ -1971,9 +1983,10 @@ static unsigned long next_timer_interrupt(struct timer_base *base,
static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
bool *idle)
{
- struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+ unsigned long nextevt, nextevt_local, nextevt_global;
+ struct timer_base *base_local, *base_global;
u64 expires = KTIME_MAX;
- unsigned long nextevt;
+ bool local_first;
/*
* Pretend that there is no timer pending if the cpu is offline.
@@ -1985,10 +1998,31 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
return expires;
}
- raw_spin_lock(&base->lock);
- nextevt = next_timer_interrupt(base, basej);
+ base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
+ base_global = this_cpu_ptr(&timer_bases[BASE_GLOBAL]);
+
+ raw_spin_lock(&base_local->lock);
+ raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
+
+ nextevt_local = next_timer_interrupt(base_local, basej);
+ nextevt_global = next_timer_interrupt(base_global, basej);
- if (base->timers_pending) {
+ /*
+ * Check whether the local event is expiring before or at the same
+ * time as the global event.
+ *
+ * Note, that nextevt_global and nextevt_local might be based on
+ * different base->clk values. So it's not guaranteed that
+ * comparing with empty bases results in a correct local_first.
+ */
+ if (base_local->timers_pending && base_global->timers_pending)
+ local_first = time_before_eq(nextevt_local, nextevt_global);
+ else
+ local_first = base_local->timers_pending;
+
+ nextevt = local_first ? nextevt_local : nextevt_global;
+
+ if (base_local->timers_pending || base_global->timers_pending) {
/* If we missed a tick already, force 0 delta */
if (time_before(nextevt, basej))
nextevt = basej;
@@ -1999,28 +2033,33 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
* We have a fresh next event. Check whether we can forward the
* base.
*/
- __forward_timer_base(base, basej);
+ __forward_timer_base(base_local, basej);
+ __forward_timer_base(base_global, basej);
/*
* Set base->is_idle only when caller is timer_base_try_to_set_idle()
*/
if (idle) {
/*
- * Base is idle if the next event is more than a tick away.
+ * Bases are idle if the next event is more than a tick away.
*
* If the base is marked idle then any timer add operation must
* forward the base clk itself to keep granularity small. This
- * idle logic is only maintained for the BASE_STD base,
- * deferrable timers may still see large granularity skew (by
- * design).
+ * idle logic is only maintained for the BASE_LOCAL and
+ * BASE_GLOBAL base, deferrable timers may still see large
+ * granularity skew (by design).
*/
- if (!base->is_idle)
- base->is_idle = time_after(nextevt, basej + 1);
- *idle = base->is_idle;
- trace_timer_base_idle(base->is_idle, base->cpu);
+ if (!base_local->is_idle) {
+ bool is_idle = time_after(nextevt, basej + 1);
+
+ base_local->is_idle = base_global->is_idle = is_idle;
+ }
+ *idle = base_local->is_idle;
+ trace_timer_base_idle(base_local->is_idle, base_local->cpu);
}
- raw_spin_unlock(&base->lock);
+ raw_spin_unlock(&base_global->lock);
+ raw_spin_unlock(&base_local->lock);
return cmp_next_hrtimer_event(basem, expires);
}
@@ -2064,15 +2103,15 @@ u64 timer_base_try_to_set_idle(unsigned long basej, u64 basem, bool *idle)
*/
void timer_clear_idle(void)
{
- struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
-
/*
* We do this unlocked. The worst outcome is a remote enqueue sending
* a pointless IPI, but taking the lock would just make the window for
* sending the IPI a few instructions smaller for the cost of taking
* the lock in the exit from idle path.
*/
- base->is_idle = false;
+ __this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false);
+ __this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false);
+
trace_timer_base_idle(0, smp_processor_id());
}
#endif
@@ -2123,11 +2162,13 @@ static inline void __run_timers(struct timer_base *base)
*/
static __latent_entropy void run_timer_softirq(struct softirq_action *h)
{
- struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+ struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
__run_timers(base);
- if (IS_ENABLED(CONFIG_NO_HZ_COMMON))
+ if (IS_ENABLED(CONFIG_NO_HZ_COMMON)) {
+ __run_timers(this_cpu_ptr(&timer_bases[BASE_GLOBAL]));
__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
+ }
}
/*
@@ -2135,7 +2176,7 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h)
*/
static void run_local_timers(void)
{
- struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+ struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
hrtimer_run_queues();
--
2.39.2
The functionality for getting the next timer interrupt in
get_next_timer_interrupt() is split into a separate function
fetch_next_timer_interrupt() to be usable by other call sites.
This is preparatory work for the conversion of the NOHZ timer
placement to a pull at expiry time model. No functional change.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---
v9: Update was required (change of preceding patches)
v6: s/splitted/split
v5: Update commit message
v4: Fix typo in comment
---
kernel/time/timer.c | 64 +++++++++++++++++++++++++++------------------
1 file changed, 38 insertions(+), 26 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 0d53d853ae22..fc376e06980e 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1985,30 +1985,13 @@ static unsigned long next_timer_interrupt(struct timer_base *base,
return base->next_expiry;
}
-static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
- bool *idle)
+static unsigned long fetch_next_timer_interrupt(unsigned long basej, u64 basem,
+ struct timer_base *base_local,
+ struct timer_base *base_global,
+ struct timer_events *tevt)
{
- struct timer_events tevt = { .local = KTIME_MAX, .global = KTIME_MAX };
unsigned long nextevt, nextevt_local, nextevt_global;
- struct timer_base *base_local, *base_global;
bool local_first;
- u64 expires;
-
- /*
- * Pretend that there is no timer pending if the cpu is offline.
- * Possible pending timers will be migrated later to an active cpu.
- */
- if (cpu_is_offline(smp_processor_id())) {
- if (idle)
- *idle = true;
- return tevt.local;
- }
-
- base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
- base_global = this_cpu_ptr(&timer_bases[BASE_GLOBAL]);
-
- raw_spin_lock(&base_local->lock);
- raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
nextevt_local = next_timer_interrupt(base_local, basej);
nextevt_global = next_timer_interrupt(base_global, basej);
@@ -2037,8 +2020,8 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
/* If we missed a tick already, force 0 delta */
if (time_before(nextevt, basej))
nextevt = basej;
- tevt.local = basem + (u64)(nextevt - basej) * TICK_NSEC;
- goto unlock;
+ tevt->local = basem + (u64)(nextevt - basej) * TICK_NSEC;
+ return nextevt;
}
/*
@@ -2048,10 +2031,40 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
* ignored. If the global queue is empty, nothing to do either.
*/
if (!local_first && base_global->timers_pending)
- tevt.global = basem + (u64)(nextevt_global - basej) * TICK_NSEC;
+ tevt->global = basem + (u64)(nextevt_global - basej) * TICK_NSEC;
if (base_local->timers_pending)
- tevt.local = basem + (u64)(nextevt_local - basej) * TICK_NSEC;
+ tevt->local = basem + (u64)(nextevt_local - basej) * TICK_NSEC;
+
+ return nextevt;
+}
+
+static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
+ bool *idle)
+{
+ struct timer_events tevt = { .local = KTIME_MAX, .global = KTIME_MAX };
+ struct timer_base *base_local, *base_global;
+ unsigned long nextevt;
+ u64 expires;
+
+ /*
+ * Pretend that there is no timer pending if the cpu is offline.
+ * Possible pending timers will be migrated later to an active cpu.
+ */
+ if (cpu_is_offline(smp_processor_id())) {
+ if (idle)
+ *idle = true;
+ return tevt.local;
+ }
+
+ base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
+ base_global = this_cpu_ptr(&timer_bases[BASE_GLOBAL]);
+
+ raw_spin_lock(&base_local->lock);
+ raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
+
+ nextevt = fetch_next_timer_interrupt(basej, basem, base_local,
+ base_global, &tevt);
/*
* We have a fresh next event. Check whether we can forward the
@@ -2082,7 +2095,6 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
trace_timer_base_idle(base_local->is_idle, base_local->cpu);
}
-unlock:
raw_spin_unlock(&base_global->lock);
raw_spin_unlock(&base_local->lock);
--
2.39.2
The logic for raising a softirq the way it is implemented right now, is
readable for two timer bases. When increasing numbers of timer bases, code
gets harder to read. With the introduction of the timer migration
hierarchy, there will be three timer bases.
Therefore ease the code. No functional change.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---
v5: New patch to decrease patch size of follow up patches
---
kernel/time/timer.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index ea94479ee7e2..b14d84f1fe50 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2132,16 +2132,14 @@ static void run_local_timers(void)
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
hrtimer_run_queues();
- /* Raise the softirq only if required. */
- if (time_before(jiffies, base->next_expiry)) {
- if (!IS_ENABLED(CONFIG_NO_HZ_COMMON))
- return;
- /* CPU is awake, so check the deferrable base. */
- base++;
- if (time_before(jiffies, base->next_expiry))
+
+ for (int i = 0; i < NR_BASES; i++, base++) {
+ /* Raise the softirq only if required. */
+ if (time_after_eq(jiffies, base->next_expiry)) {
+ raise_softirq(TIMER_SOFTIRQ);
return;
+ }
}
- raise_softirq(TIMER_SOFTIRQ);
}
/*
--
2.39.2
Due to the conversion of the NOHZ timer placement to a pull at expiry
time model, the per CPU timer bases with non pinned timers are no
longer handled only by the local CPU. In case a remote CPU already
expires the non pinned timers base of the local cpu, nothing more
needs to be done by the local CPU. A check at the begin of the expire
timers routine is required, because timer base lock is dropped before
executing the timer callback function.
This is a preparatory work, but has no functional impact right now.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v6: Drop double negation
---
kernel/time/timer.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index b0fa8afe9059..a797603dfd49 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2232,6 +2232,9 @@ static inline void __run_timers(struct timer_base *base)
lockdep_assert_held(&base->lock);
+ if (base->running_timer)
+ return;
+
while (time_after_eq(jiffies, base->clk) &&
time_after_eq(jiffies, base->next_expiry)) {
levels = collect_expired_timers(base, heads);
--
2.39.2
To prepare for the conversion of the NOHZ timer placement to a pull at
expiry time model it's required to have functionality available getting the
next timer interrupt on a remote CPU.
Locking of the timer bases and getting the information for the next timer
interrupt functionality is split into separate functions. This is required
to be compliant with lock ordering when the new model is in place.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
---
v8:
- Update comment
v7:
- Move functions into CONFIG_SMP && CONFIG_NO_HZ_COMMON section
- change lock, fetch functions to be unconditional
- split out unlock function into a separate function
v6:
- introduce timer_lock_remote_bases() to fix race
---
kernel/time/tick-internal.h | 10 +++++
kernel/time/timer.c | 76 ++++++++++++++++++++++++++++++++++---
2 files changed, 81 insertions(+), 5 deletions(-)
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index dc12a938f00f..183ad32330fb 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -8,6 +8,11 @@
#include "timekeeping.h"
#include "tick-sched.h"
+struct timer_events {
+ u64 local;
+ u64 global;
+};
+
#ifdef CONFIG_GENERIC_CLOCKEVENTS
# define TICK_DO_TIMER_NONE -1
@@ -155,6 +160,11 @@ extern unsigned long tick_nohz_active;
extern void timers_update_nohz(void);
# ifdef CONFIG_SMP
extern struct static_key_false timers_migration_enabled;
+extern void fetch_next_timer_interrupt_remote(unsigned long basej, u64 basem,
+ struct timer_events *tevt,
+ unsigned int cpu);
+extern void timer_lock_remote_bases(unsigned int cpu);
+extern void timer_unlock_remote_bases(unsigned int cpu);
# endif
#else /* CONFIG_NO_HZ_COMMON */
static inline void timers_update_nohz(void) { }
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index fc376e06980e..2cff43c10329 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -221,11 +221,6 @@ struct timer_base {
static DEFINE_PER_CPU(struct timer_base, timer_bases[NR_BASES]);
-struct timer_events {
- u64 local;
- u64 global;
-};
-
#ifdef CONFIG_NO_HZ_COMMON
static DEFINE_STATIC_KEY_FALSE(timers_nohz_active);
@@ -2039,6 +2034,77 @@ static unsigned long fetch_next_timer_interrupt(unsigned long basej, u64 basem,
return nextevt;
}
+# ifdef CONFIG_SMP
+/**
+ * fetch_next_timer_interrupt_remote() - Store next timers into @tevt
+ * @basej: base time jiffies
+ * @basem: base time clock monotonic
+ * @tevt: Pointer to the storage for the expiry values
+ * @cpu: Remote CPU
+ *
+ * Stores the next pending local and global timer expiry values in the
+ * struct pointed to by @tevt. If a queue is empty the corresponding
+ * field is set to KTIME_MAX. If local event expires before global
+ * event, global event is set to KTIME_MAX as well.
+ *
+ * Caller needs to make sure timer base locks are held (use
+ * timer_lock_remote_bases() for this purpose).
+ */
+void fetch_next_timer_interrupt_remote(unsigned long basej, u64 basem,
+ struct timer_events *tevt,
+ unsigned int cpu)
+{
+ struct timer_base *base_local, *base_global;
+
+ /* Preset local / global events */
+ tevt->local = tevt->global = KTIME_MAX;
+
+ base_local = per_cpu_ptr(&timer_bases[BASE_LOCAL], cpu);
+ base_global = per_cpu_ptr(&timer_bases[BASE_GLOBAL], cpu);
+
+ lockdep_assert_held(&base_local->lock);
+ lockdep_assert_held(&base_global->lock);
+
+ fetch_next_timer_interrupt(basej, basem, base_local, base_global, tevt);
+}
+
+/**
+ * timer_unlock_remote_bases - unlock timer bases of cpu
+ * @cpu: Remote CPU
+ *
+ * Unlocks the remote timer bases.
+ */
+void timer_unlock_remote_bases(unsigned int cpu)
+{
+ struct timer_base *base_local, *base_global;
+
+ base_local = per_cpu_ptr(&timer_bases[BASE_LOCAL], cpu);
+ base_global = per_cpu_ptr(&timer_bases[BASE_GLOBAL], cpu);
+
+ raw_spin_unlock(&base_global->lock);
+ raw_spin_unlock(&base_local->lock);
+}
+
+/**
+ * timer_lock_remote_bases - lock timer bases of cpu
+ * @cpu: Remote CPU
+ *
+ * Locks the remote timer bases.
+ */
+void timer_lock_remote_bases(unsigned int cpu)
+{
+ struct timer_base *base_local, *base_global;
+
+ base_local = per_cpu_ptr(&timer_bases[BASE_LOCAL], cpu);
+ base_global = per_cpu_ptr(&timer_bases[BASE_GLOBAL], cpu);
+
+ lockdep_assert_irqs_disabled();
+
+ raw_spin_lock(&base_local->lock);
+ raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
+}
+# endif /* CONFIG_SMP */
+
static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
bool *idle)
{
--
2.39.2
From: "Richard Cochran (linutronix GmbH)" <[email protected]>
The logic to get the time of the last jiffies update will be needed by
the timer pull model as well.
Move the code into a global function in anticipation of the new caller.
No functional change.
Signed-off-by: Richard Cochran (linutronix GmbH) <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
kernel/time/tick-internal.h | 1 +
kernel/time/tick-sched.c | 18 +++++++++++++++---
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 183ad32330fb..e0e58dd18919 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -158,6 +158,7 @@ static inline void tick_nohz_init(void) { }
#ifdef CONFIG_NO_HZ_COMMON
extern unsigned long tick_nohz_active;
extern void timers_update_nohz(void);
+extern u64 get_jiffies_update(unsigned long *basej);
# ifdef CONFIG_SMP
extern struct static_key_false timers_migration_enabled;
extern void fetch_next_timer_interrupt_remote(unsigned long basej, u64 basem,
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c6b415052c56..aca0e133ba09 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -799,18 +799,30 @@ static inline bool local_timer_softirq_pending(void)
return local_softirq_pending() & BIT(TIMER_SOFTIRQ);
}
-static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
+/*
+ * Read jiffies and the time when jiffies were updated last
+ */
+u64 get_jiffies_update(unsigned long *basej)
{
- u64 basemono, next_tick, delta, expires;
unsigned long basejiff;
unsigned int seq;
+ u64 basemono;
- /* Read jiffies and the time when jiffies were updated last */
do {
seq = read_seqcount_begin(&jiffies_seq);
basemono = last_jiffies_update;
basejiff = jiffies;
} while (read_seqcount_retry(&jiffies_seq, seq));
+ *basej = basejiff;
+ return basemono;
+}
+
+static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
+{
+ u64 basemono, next_tick, delta, expires;
+ unsigned long basejiff;
+
+ basemono = get_jiffies_update(&basejiff);
ts->last_jiffies = basejiff;
ts->timer_expires_base = basemono;
--
2.39.2
To prepare for the conversion of the NOHZ timer placement to a pull at
expiry time model it's required to have a function that returns the value
of the is_idle flag of the timer base to keep the hierarchy states during
online in sync with timer base state.
No functional change.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v9: new in v9
---
kernel/time/tick-internal.h | 1 +
kernel/time/timer.c | 10 ++++++++++
2 files changed, 11 insertions(+)
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index e0e58dd18919..2d1a44850c20 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -166,6 +166,7 @@ extern void fetch_next_timer_interrupt_remote(unsigned long basej, u64 basem,
unsigned int cpu);
extern void timer_lock_remote_bases(unsigned int cpu);
extern void timer_unlock_remote_bases(unsigned int cpu);
+extern bool timer_base_is_idle(void);
# endif
#else /* CONFIG_NO_HZ_COMMON */
static inline void timers_update_nohz(void) { }
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index a797603dfd49..b6c9ac0c3712 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2201,6 +2201,16 @@ u64 timer_base_try_to_set_idle(unsigned long basej, u64 basem, bool *idle)
return __get_next_timer_interrupt(basej, basem, idle);
}
+/**
+ * timer_base_is_idle() - Return whether timer base is set idle
+ *
+ * Returns value of local timer base is_idle value.
+ */
+bool timer_base_is_idle(void)
+{
+ return __this_cpu_read(timer_bases[BASE_LOCAL].is_idle);
+}
+
/**
* timer_clear_idle - Clear the idle state of the timer base
*
--
2.39.2
The timer base is marked idle when get_next_timer_interrupt() is
executed. But the decision whether the tick will be stopped and whether the
system is able to go idle is done later. When the timer bases is marked
idle and a new first timer is enqueued remote an IPI is raised. Even if it
is not required because the tick is not stopped and the timer base is
evaluated again at the next tick.
To prevent this, the timer base is marked idle in tick_nohz_stop_tick() and
get_next_timer_interrupt() is streamlined by only looking for the next
timer interrupt. All other work is postponed to
timer_base_try_to_set_idle() which is called by tick_nohz_stop_tick().
While at it fix some nearby whitespace damage as well.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v9:
- update to the changes of the patch before
- Cleanup logic in tick_nohz_stop_tick() after executing timer_base_try_to_set_idle()
---
kernel/time/tick-internal.h | 1 +
kernel/time/tick-sched.c | 46 ++++++++++++++++++++++++++---------
kernel/time/timer.c | 48 ++++++++++++++++++++++++++++---------
3 files changed, 73 insertions(+), 22 deletions(-)
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 649f2b48e8f0..dc12a938f00f 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -164,6 +164,7 @@ static inline void timers_update_nohz(void) { }
DECLARE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases);
extern u64 get_next_timer_interrupt(unsigned long basej, u64 basem);
+u64 timer_base_try_to_set_idle(unsigned long basej, u64 basem, bool *idle);
void timer_clear_idle(void);
#define CLOCK_SET_WALL \
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b1b591de781e..3e1cdb7c6966 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -849,11 +849,6 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
*/
delta = next_tick - basemono;
if (delta <= (u64)TICK_NSEC) {
- /*
- * Tell the timer code that the base is not idle, i.e. undo
- * the effect of get_next_timer_interrupt():
- */
- timer_clear_idle();
/*
* We've not stopped the tick yet, and there's a timer in the
* next period, so no point in stopping it either, bail.
@@ -889,12 +884,41 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
{
struct clock_event_device *dev = __this_cpu_read(tick_cpu_device.evtdev);
+ unsigned long basejiff = ts->last_jiffies;
u64 basemono = ts->timer_expires_base;
- u64 expires = ts->timer_expires;
+ bool timer_idle;
+ u64 expires;
/* Make sure we won't be trying to stop it twice in a row. */
ts->timer_expires_base = 0;
+ /*
+ * Now the tick should be stopped definitely - so the timer base needs
+ * to be marked idle as well to not miss a newly queued timer.
+ */
+ expires = timer_base_try_to_set_idle(basejiff, basemono, &timer_idle);
+ if (!timer_idle) {
+ /*
+ * Do not clear tick_stopped here when it was already set - it
+ * will be retained on the next idle iteration when the tick
+ * expired earlier than expected.
+ */
+ expires = basemono + TICK_NSEC;
+ } else if (expires > ts->timer_expires) {
+ /*
+ * This path could only happen when the first timer was removed
+ * between calculating the possible sleep length and now (when
+ * high resolution mode is not active, timer could also be a
+ * hrtimer).
+ *
+ * We have to stick to the original calculated expiry value to
+ * not stop the tick for too long with a shallow C-state (which
+ * was programmed by cpuidle because of an early next expiration
+ * value).
+ */
+ expires = ts->timer_expires;
+ }
+
/*
* If this CPU is the one which updates jiffies, then give up
* the assignment and let it be taken by the CPU which runs
@@ -930,6 +954,10 @@ static void tick_nohz_stop_tick(struct tick_sched *ts, int cpu)
* in tick_nohz_restart_sched_tick().
*/
if (!ts->tick_stopped) {
+ /* If the timer base is not idle, retain the tick. */
+ if (!timer_idle)
+ return;
+
calc_load_nohz_start();
quiet_vmstat();
@@ -991,7 +1019,7 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
touch_softlockup_watchdog_sched();
/* Cancel the scheduled timer and restore the tick: */
- ts->tick_stopped = 0;
+ ts->tick_stopped = 0;
tick_nohz_restart(ts, now);
}
@@ -1147,10 +1175,6 @@ void tick_nohz_idle_stop_tick(void)
void tick_nohz_idle_retain_tick(void)
{
tick_nohz_retain_tick(this_cpu_ptr(&tick_cpu_sched));
- /*
- * Undo the effect of get_next_timer_interrupt() called from
- * tick_nohz_next_event().
- */
timer_clear_idle();
}
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index c9f7f86e95fd..df6558f62e6f 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1911,7 +1911,8 @@ static u64 cmp_next_hrtimer_event(u64 basem, u64 expires)
return DIV_ROUND_UP_ULL(nextevt, TICK_NSEC) * TICK_NSEC;
}
-static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem)
+static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
+ bool *idle)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
unsigned long nextevt = basej + NEXT_TIMER_MAX_DELTA;
@@ -1921,8 +1922,11 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem)
* Pretend that there is no timer pending if the cpu is offline.
* Possible pending timers will be migrated later to an active cpu.
*/
- if (cpu_is_offline(smp_processor_id()))
+ if (cpu_is_offline(smp_processor_id())) {
+ if (idle)
+ *idle = true;
return expires;
+ }
raw_spin_lock(&base->lock);
if (base->next_expiry_recalc)
@@ -1952,16 +1956,24 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem)
__forward_timer_base(base, basej);
/*
- * Base is idle if the next event is more than a tick away.
- *
- * If the base is marked idle then any timer add operation must forward
- * the base clk itself to keep granularity small. This idle logic is
- * only maintained for the BASE_STD base, deferrable timers may still
- * see large granularity skew (by design).
+ * Set base->is_idle only when caller is timer_base_try_to_set_idle()
*/
- base->is_idle = time_after(nextevt, basej + 1);
+ if (idle) {
+ /*
+ * Base is idle if the next event is more than a tick away.
+ *
+ * If the base is marked idle then any timer add operation must
+ * forward the base clk itself to keep granularity small. This
+ * idle logic is only maintained for the BASE_STD base,
+ * deferrable timers may still see large granularity skew (by
+ * design).
+ */
+ if (!base->is_idle)
+ base->is_idle = time_after(nextevt, basej + 1);
+ *idle = base->is_idle;
+ trace_timer_base_idle(base->is_idle, base->cpu);
+ }
- trace_timer_base_idle(base->is_idle, base->cpu);
raw_spin_unlock(&base->lock);
return cmp_next_hrtimer_event(basem, expires);
@@ -1977,7 +1989,21 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem)
*/
u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
{
- return __get_next_timer_interrupt(basej, basem);
+ return __get_next_timer_interrupt(basej, basem, NULL);
+}
+
+/**
+ * timer_base_try_to_set_idle() - Try to set the idle state of the timer bases
+ * @basej: base time jiffies
+ * @basem: base time clock monotonic
+ * @idle: pointer to store the value of timer_base->is_idle
+ *
+ * Returns the tick aligned clock monotonic time of the next pending
+ * timer or KTIME_MAX if no timer is pending.
+ */
+u64 timer_base_try_to_set_idle(unsigned long basej, u64 basem, bool *idle)
+{
+ return __get_next_timer_interrupt(basej, basem, idle);
}
/**
--
2.39.2
The timer pull model is in place so we can remove the heuristics which try
to guess the best target CPU at enqueue/modification time.
All non pinned timers are queued on the local CPU in the separate storage
and eventually pulled at expiry time to a remote CPU.
Originally-by: Richard Cochran (linutronix GmbH) <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v9:
- Update to the changes of the preceding patches
v6:
- Update TIMER_PINNED flag description.
v5:
- Move WARN_ONCE() in add_timer_on() into a previous patch
- Fold crystallball magic related hunks into this patch
v4: Update comment about TIMER_PINNED flag (heristic is removed)
---
include/linux/timer.h | 14 ++++---------
kernel/time/timer.c | 46 +++++++++++++++++++++----------------------
2 files changed, 26 insertions(+), 34 deletions(-)
diff --git a/include/linux/timer.h b/include/linux/timer.h
index 404bb31a95c7..4dd59e4e5681 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -50,16 +50,10 @@ struct timer_list {
* workqueue locking issues. It's not meant for executing random crap
* with interrupts disabled. Abuse is monitored!
*
- * @TIMER_PINNED: A pinned timer will not be affected by any timer
- * placement heuristics (like, NOHZ) and will always expire on the CPU
- * on which the timer was enqueued.
- *
- * Note: Because enqueuing of timers can migrate the timer from one
- * CPU to another, pinned timers are not guaranteed to stay on the
- * initialy selected CPU. They move to the CPU on which the enqueue
- * function is invoked via mod_timer() or add_timer(). If the timer
- * should be placed on a particular CPU, then add_timer_on() has to be
- * used.
+ * @TIMER_PINNED: A pinned timer will always expire on the CPU on which the
+ * timer was enqueued. When a particular CPU is required, add_timer_on()
+ * has to be used. Enqueue via mod_timer() and add_timer() is always done
+ * on the local CPU.
*/
#define TIMER_CPUMASK 0x0003FFFF
#define TIMER_MIGRATING 0x00040000
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index ac3e888d053f..6e9e1d852438 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -590,10 +590,13 @@ trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
/*
* We might have to IPI the remote CPU if the base is idle and the
- * timer is not deferrable. If the other CPU is on the way to idle
- * then it can't set base->is_idle as we hold the base lock:
+ * timer is pinned. If it is a non pinned timer, it is only queued
+ * on the remote CPU, when timer was running during queueing. Then
+ * everything is handled by remote CPU anyway. If the other CPU is
+ * on the way to idle then it can't set base->is_idle as we hold
+ * the base lock:
*/
- if (base->is_idle)
+ if (base->is_idle && timer->flags & TIMER_PINNED)
wake_up_nohz_cpu(base->cpu);
}
@@ -941,17 +944,6 @@ static inline struct timer_base *get_timer_base(u32 tflags)
return get_timer_cpu_base(tflags, tflags & TIMER_CPUMASK);
}
-static inline struct timer_base *
-get_target_base(struct timer_base *base, unsigned tflags)
-{
-#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
- if (static_branch_likely(&timers_migration_enabled) &&
- !(tflags & TIMER_PINNED))
- return get_timer_cpu_base(tflags, get_nohz_timer_target());
-#endif
- return get_timer_this_cpu_base(tflags);
-}
-
static inline void __forward_timer_base(struct timer_base *base,
unsigned long basej)
{
@@ -1106,7 +1098,7 @@ __mod_timer(struct timer_list *timer, unsigned long expires, unsigned int option
if (!ret && (options & MOD_TIMER_PENDING_ONLY))
goto out_unlock;
- new_base = get_target_base(base, timer->flags);
+ new_base = get_timer_this_cpu_base(timer->flags);
if (base != new_base) {
/*
@@ -2228,11 +2220,17 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
* BASE_GLOBAL base, deferrable timers may still see large
* granularity skew (by design).
*/
- if (!base_local->is_idle) {
- bool is_idle = time_after(nextevt, basej + 1);
- base_local->is_idle = base_global->is_idle = is_idle;
- }
+ /*
+ * base->is_idle information is required to wakeup an idle CPU
+ * when a new timer was enqueued. Only pinned timers could be
+ * enqueued remotely into a idle base. Therefore do maintain
+ * only base_local->is_idle information and ignore
+ * base_global->is_idle information.
+ */
+ if (!base_local->is_idle)
+ base_local->is_idle = time_after(nextevt, basej + 1);
+
*idle = base_local->is_idle;
trace_timer_base_idle(base_local->is_idle, base_local->cpu);
@@ -2307,13 +2305,13 @@ bool timer_base_is_idle(void)
void timer_clear_idle(void)
{
/*
- * We do this unlocked. The worst outcome is a remote enqueue sending
- * a pointless IPI, but taking the lock would just make the window for
- * sending the IPI a few instructions smaller for the cost of taking
- * the lock in the exit from idle path.
+ * We do this unlocked. The worst outcome is a remote pinned timer
+ * enqueue sending a pointless IPI, but taking the lock would just
+ * make the window for sending the IPI a few instructions smaller
+ * for the cost of taking the lock in the exit from idle
+ * path. Required for BASE_LOCAL only.
*/
__this_cpu_write(timer_bases[BASE_LOCAL].is_idle, false);
- __this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false);
trace_timer_base_idle(0, smp_processor_id());
--
2.39.2
Logic for getting next timer interrupt (no matter of recalculated or
already stored in base->next_expiry) is split into a separate function
"next_timer_interrupt()" to make it available for new call sites.
No functional change.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v9: Adapt to the fix for empty timer bases.
---
kernel/time/timer.c | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index b14d84f1fe50..eda4972ca862 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1951,12 +1951,29 @@ static u64 cmp_next_hrtimer_event(u64 basem, u64 expires)
return DIV_ROUND_UP_ULL(nextevt, TICK_NSEC) * TICK_NSEC;
}
+static unsigned long next_timer_interrupt(struct timer_base *base,
+ unsigned long basej)
+{
+ if (base->next_expiry_recalc)
+ next_expiry_recalc(base);
+
+ /*
+ * Move next_expiry for the empty base into the future to prevent a
+ * unnecessary raise of the timer softirq when the next_expiry value
+ * will be reached even if there is no timer pending.
+ */
+ if (!base->timers_pending)
+ base->next_expiry = basej + NEXT_TIMER_MAX_DELTA;
+
+ return base->next_expiry;
+}
+
static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
bool *idle)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
- unsigned long nextevt = basej + NEXT_TIMER_MAX_DELTA;
u64 expires = KTIME_MAX;
+ unsigned long nextevt;
/*
* Pretend that there is no timer pending if the cpu is offline.
@@ -1969,24 +1986,13 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
}
raw_spin_lock(&base->lock);
- if (base->next_expiry_recalc)
- next_expiry_recalc(base);
+ nextevt = next_timer_interrupt(base, basej);
if (base->timers_pending) {
- nextevt = base->next_expiry;
-
/* If we missed a tick already, force 0 delta */
if (time_before(nextevt, basej))
nextevt = basej;
expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
- } else {
- /*
- * Move next_expiry for the empty base into the future to
- * prevent a unnecessary raise of the timer softirq when the
- * next_expiry value will be reached even if there is no timer
- * pending.
- */
- base->next_expiry = nextevt;
}
/*
--
2.39.2
Placing timers at enqueue time on a target CPU based on dubious heuristics
does not make any sense:
1) Most timer wheel timers are canceled or rearmed before they expire.
2) The heuristics to predict which CPU will be busy when the timer expires
are wrong by definition.
So placing the timers at enqueue wastes precious cycles.
The proper solution to this problem is to always queue the timers on the
local CPU and allow the non pinned timers to be pulled onto a busy CPU at
expiry time.
Therefore split the timer storage into local pinned and global timers:
Local pinned timers are always expired on the CPU on which they have been
queued. Global timers can be expired on any CPU.
As long as a CPU is busy it expires both local and global timers. When a
CPU goes idle it arms for the first expiring local timer. If the first
expiring pinned (local) timer is before the first expiring movable timer,
then no action is required because the CPU will wake up before the first
movable timer expires. If the first expiring movable timer is before the
first expiring pinned (local) timer, then this timer is queued into a idle
timerqueue and eventually expired by some other active CPU.
To avoid global locking the timerqueues are implemented as a hierarchy. The
lowest level of the hierarchy holds the CPUs. The CPUs are associated to
groups of 8, which are separated per node. If more than one CPU group
exist, then a second level in the hierarchy collects the groups. Depending
on the size of the system more than 2 levels are required. Each group has a
"migrator" which checks the timerqueue during the tick for remote expirable
timers.
If the last CPU in a group goes idle it reports the first expiring event in
the group up to the next group(s) in the hierarchy. If the last CPU goes
idle it arms its timer for the first system wide expiring timer to ensure
that no timer event is missed.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v9:
- Adapt to the changes of the preceding patches
- Fix state inconsitency (when timer base is idle, cpu must also be
marked as idle in hierarchy)
- Make sure new timers are considered, when timer base is idle and a
timer is enqueued into global queue (e.g. during interrupt) ->
timer_use_tmigr()
- Changes which are required due to the timer code change of marking the
timer base idle in tick_nohz_stop_tick()
v8:
- Review of Frederic:
- Fix hotplug race (introduction of wakeup_recalc)
- Make wakeup and wakeup_recalc logic consistent all over the place
- Fix child/group state race and read it with locks held
- Add more clarifying comments
- Fix grammar all over the place
- change integers which act as boolean value into bool
- rewrite condition in tmigr_check_migrator() without negation
- Improve update events logic with a check of the first event
- Implement a quick forecast which is called when
get_next_timer_interrupt() is executed.
v7:
- Review remarks of Frederic and bigeasy:
- change logic in tmigr_handle_remote_cpu()
- s/kzalloc/kcalloc
- move timer_expire_remote() into NO_HZ_COMMON && SMP config section
- drop DBG_BUG_ON() makro and use only WARN_ON_ONCE()
- remove leftovers from sibling logic during setup
- Move timer_expire_remote() into tick-internal.h
- Add documentation section about "Required event and timerqueue update
after remote expiry"
- Fix fallout of kernel test robot
v6:
- Fix typos
- Review remarks of Peter Zijlstra (locking, struct member cleanup, use
atomic_try_cmpxchg(), update struct member descriptions)
- Fix race in tmigr_handle_remote_cpu() (Frederic Weisbecker)
v5:
- Review remarks of Frederic
- Return nextevt when CPU is marked offline in timer migration hierarchy
instead of KTIME_MAX
- Fix update of group events issue, after remote expiring
v4:
- Fold typo fix in comment into proper patch "timer: Split out "get next
timer interrupt" functionality"
- Update wrong comment for tmigr_state union definition
- Fix fallout of kernel test robot
---
include/linux/cpuhotplug.h | 1 +
kernel/time/Makefile | 3 +
kernel/time/tick-internal.h | 1 +
kernel/time/timer.c | 111 ++-
kernel/time/timer_migration.c | 1636 +++++++++++++++++++++++++++++++++
kernel/time/timer_migration.h | 144 +++
6 files changed, 1888 insertions(+), 8 deletions(-)
create mode 100644 kernel/time/timer_migration.c
create mode 100644 kernel/time/timer_migration.h
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index efc0c0b07efb..85a78c5a7c01 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -245,6 +245,7 @@ enum cpuhp_state {
CPUHP_AP_PERF_POWERPC_HV_24x7_ONLINE,
CPUHP_AP_PERF_POWERPC_HV_GPCI_ONLINE,
CPUHP_AP_PERF_CSKY_ONLINE,
+ CPUHP_AP_TMIGR_ONLINE,
CPUHP_AP_WATCHDOG_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RANDOM_ONLINE,
diff --git a/kernel/time/Makefile b/kernel/time/Makefile
index 7e875e63ff3b..4af2a264a160 100644
--- a/kernel/time/Makefile
+++ b/kernel/time/Makefile
@@ -17,6 +17,9 @@ endif
obj-$(CONFIG_GENERIC_SCHED_CLOCK) += sched_clock.o
obj-$(CONFIG_TICK_ONESHOT) += tick-oneshot.o tick-sched.o
obj-$(CONFIG_LEGACY_TIMER_TICK) += tick-legacy.o
+ifeq ($(CONFIG_SMP),y)
+ obj-$(CONFIG_NO_HZ_COMMON) += timer_migration.o
+endif
obj-$(CONFIG_HAVE_GENERIC_VDSO) += vsyscall.o
obj-$(CONFIG_DEBUG_FS) += timekeeping_debug.o
obj-$(CONFIG_TEST_UDELAY) += test_udelay.o
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index 2d1a44850c20..9cec0c0c314f 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -167,6 +167,7 @@ extern void fetch_next_timer_interrupt_remote(unsigned long basej, u64 basem,
extern void timer_lock_remote_bases(unsigned int cpu);
extern void timer_unlock_remote_bases(unsigned int cpu);
extern bool timer_base_is_idle(void);
+extern void timer_expire_remote(unsigned int cpu);
# endif
#else /* CONFIG_NO_HZ_COMMON */
static inline void timers_update_nohz(void) { }
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index b6c9ac0c3712..ac3e888d053f 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -53,6 +53,7 @@
#include <asm/io.h>
#include "tick-internal.h"
+#include "timer_migration.h"
#define CREATE_TRACE_POINTS
#include <trace/events/timer.h>
@@ -2103,6 +2104,64 @@ void timer_lock_remote_bases(unsigned int cpu)
raw_spin_lock(&base_local->lock);
raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
}
+
+static void __run_timer_base(struct timer_base *base);
+
+/**
+ * timer_expire_remote() - expire global timers of cpu
+ * @cpu: Remote CPU
+ *
+ * Expire timers of global base of remote CPU.
+ */
+void timer_expire_remote(unsigned int cpu)
+{
+ struct timer_base *base = per_cpu_ptr(&timer_bases[BASE_GLOBAL], cpu);
+
+ __run_timer_base(base);
+}
+
+static void timer_use_tmigr(unsigned long basej, u64 basem,
+ unsigned long *nextevt, bool *tick_stop_path,
+ bool timer_base_idle, struct timer_events *tevt)
+{
+ u64 next_tmigr;
+
+ if (timer_base_idle)
+ next_tmigr = tmigr_cpu_new_timer(tevt->global);
+ else if (tick_stop_path)
+ next_tmigr = tmigr_cpu_deactivate(tevt->global);
+ else
+ next_tmigr = tmigr_quick_check();
+
+ /*
+ * If the CPU is the last going idle in timer migration hierarchy, make
+ * sure the CPU will wake up in time to handle remote timers.
+ * next_tmigr == KTIME_MAX if other CPUs are still active.
+ */
+ if (next_tmigr < tevt->local) {
+ u64 tmp;
+
+ /* If we missed a tick already, force 0 delta */
+ if (next_tmigr < basem)
+ next_tmigr = basem;
+
+ tmp = div_u64(next_tmigr - basem, TICK_NSEC);
+
+ *nextevt = basej + (unsigned long)tmp;
+ tevt->local = next_tmigr;
+ }
+}
+# else
+static void timer_use_tmigr(unsigned long basej, u64 basem,
+ unsigned long *nextevt, bool *tick_stop_path,
+ bool timer_base_idle, struct timer_events *tevt)
+{
+ /*
+ * Make sure first event is written into tevt->local to not miss a
+ * timer on !SMP systems.
+ */
+ tevt->local = min_t(u64, tevt->local, tevt->global);
+}
# endif /* CONFIG_SMP */
static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
@@ -2111,7 +2170,6 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
struct timer_events tevt = { .local = KTIME_MAX, .global = KTIME_MAX };
struct timer_base *base_local, *base_global;
unsigned long nextevt;
- u64 expires;
/*
* Pretend that there is no timer pending if the cpu is offline.
@@ -2132,6 +2190,21 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
nextevt = fetch_next_timer_interrupt(basej, basem, base_local,
base_global, &tevt);
+ /*
+ * When the when the next event is only one jiffie ahead there is no
+ * need to call timer migration hierarchy related
+ * functions. @tevt->global will be KTIME_MAX, nevertheless if the next
+ * timer is a global timer. This is also true, when the timer base is
+ * idle.
+ *
+ * The proper timer migration hierarchy function depends on the callsite
+ * and whether timer base is idle or not. @nextevt will be updated when
+ * this CPU needs to handle the first timer migration hierarchy event.
+ */
+ if (time_after(nextevt, basej + 1))
+ timer_use_tmigr(basej, basem, &nextevt, idle,
+ base_local->is_idle, &tevt);
+
/*
* We have a fresh next event. Check whether we can forward the
* base.
@@ -2144,7 +2217,10 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
*/
if (idle) {
/*
- * Bases are idle if the next event is more than a tick away.
+ * Bases are idle if the next event is more than a tick
+ * away. Caution: @nextevt could have changed by enqueueing a
+ * global timer into timer migration hierarchy. Therefore a new
+ * check is required here.
*
* If the base is marked idle then any timer add operation must
* forward the base clk itself to keep granularity small. This
@@ -2159,14 +2235,23 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
}
*idle = base_local->is_idle;
trace_timer_base_idle(base_local->is_idle, base_local->cpu);
+
+ /*
+ * When timer base is not set idle, undo the effect of
+ * tmigr_cpu_deactivate() to prevent inconsitent states - active
+ * timer base but inactive timer migration hierarchy.
+ *
+ * When timer base was already marked idle, nothing will be
+ * changed here.
+ */
+ if (!base_local->is_idle)
+ tmigr_cpu_activate();
}
raw_spin_unlock(&base_global->lock);
raw_spin_unlock(&base_local->lock);
- expires = min_t(u64, tevt.local, tevt.global);
-
- return cmp_next_hrtimer_event(basem, expires);
+ return cmp_next_hrtimer_event(basem, tevt.local);
}
/**
@@ -2174,8 +2259,11 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
* @basej: base time jiffies
* @basem: base time clock monotonic
*
- * Returns the tick aligned clock monotonic time of the next pending
- * timer or KTIME_MAX if no timer is pending.
+ * Returns the tick aligned clock monotonic time of the next pending timer or
+ * KTIME_MAX if no timer is pending. If timer of global base was queued into
+ * timer migration hierarchy, first global timer is not taken into account. If
+ * it was the last CPU of timer migration hierarchy going idle, first global
+ * event is taken into account.
*/
u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
{
@@ -2228,6 +2316,9 @@ void timer_clear_idle(void)
__this_cpu_write(timer_bases[BASE_GLOBAL].is_idle, false);
trace_timer_base_idle(0, smp_processor_id());
+
+ /* Activate without holding the timer_base->lock */
+ tmigr_cpu_activate();
}
#endif
@@ -2297,6 +2388,9 @@ static __latent_entropy void run_timer_softirq(struct softirq_action *h)
if (IS_ENABLED(CONFIG_NO_HZ_COMMON)) {
run_timer_base(BASE_GLOBAL);
run_timer_base(BASE_DEF);
+
+ if (is_timers_nohz_active())
+ tmigr_handle_remote();
}
}
@@ -2311,7 +2405,8 @@ static void run_local_timers(void)
for (int i = 0; i < NR_BASES; i++, base++) {
/* Raise the softirq only if required. */
- if (time_after_eq(jiffies, base->next_expiry)) {
+ if (time_after_eq(jiffies, base->next_expiry) ||
+ (i == BASE_DEF && tmigr_requires_handle_remote())) {
raise_softirq(TIMER_SOFTIRQ);
return;
}
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
new file mode 100644
index 000000000000..05cd8f1bc45d
--- /dev/null
+++ b/kernel/time/timer_migration.c
@@ -0,0 +1,1636 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Infrastructure for migratable timers
+ *
+ * Copyright(C) 2022 linutronix GmbH
+ */
+#include <linux/cpuhotplug.h>
+#include <linux/slab.h>
+#include <linux/smp.h>
+#include <linux/spinlock.h>
+#include <linux/timerqueue.h>
+#include <trace/events/ipi.h>
+
+#include "timer_migration.h"
+#include "tick-internal.h"
+
+/*
+ * The timer migration mechanism is built on a hierarchy of groups. The
+ * lowest level group contains CPUs, the next level groups of CPU groups
+ * and so forth. The CPU groups are kept per node so for the normal case
+ * lock contention won't happen across nodes. Depending on the number of
+ * CPUs per node even the next level might be kept as groups of CPU groups
+ * per node and only the levels above cross the node topology.
+ *
+ * Example topology for a two node system with 24 CPUs each.
+ *
+ * LVL 2 [GRP2:0]
+ * GRP1:0 = GRP1:M
+ *
+ * LVL 1 [GRP1:0] [GRP1:1]
+ * GRP0:0 - GRP0:2 GRP0:3 - GRP0:5
+ *
+ * LVL 0 [GRP0:0] [GRP0:1] [GRP0:2] [GRP0:3] [GRP0:4] [GRP0:5]
+ * CPUS 0-7 8-15 16-23 24-31 32-39 40-47
+ *
+ * The groups hold a timer queue of events sorted by expiry time. These
+ * queues are updated when CPUs go in idle. When they come out of idle
+ * ignore flag of events is set.
+ *
+ * Each group has a designated migrator CPU/group as long as a CPU/group is
+ * active in the group. This designated role is necessary to avoid that all
+ * active CPUs in a group try to migrate expired timers from other CPUs,
+ * which would result in massive lock bouncing.
+ *
+ * When a CPU is awake, it checks in it's own timer tick the group
+ * hierarchy up to the point where it is assigned the migrator role or if
+ * no CPU is active, it also checks the groups where no migrator is set
+ * (TMIGR_NONE).
+ *
+ * If it finds expired timers in one of the group queues it pulls them over
+ * from the idle CPU and runs the timer function. After that it updates the
+ * group and the parent groups if required.
+ *
+ * CPUs which go idle arm their CPU local timer hardware for the next local
+ * (pinned) timer event. If the next migratable timer expires after the
+ * next local timer or the CPU has no migratable timer pending then the
+ * CPU does not queue an event in the LVL0 group. If the next migratable
+ * timer expires before the next local timer then the CPU queues that timer
+ * in the LVL0 group. In both cases the CPU marks itself idle in the LVL0
+ * group.
+ *
+ * When CPU comes out of idle and when a group has at least a single active
+ * child, the ignore flag of the tmigr_event is set. This indicates, that
+ * the event is ignored even if it is still enqueued in the parent groups
+ * timer queue. It will be removed when touching the timer queue the next
+ * time. This spares locking in active path as the lock protects (after
+ * setup) only event information. For more information about locking,
+ * please read the section "Locking rules".
+ *
+ * If the CPU is the migrator of the group then it delegates that role to
+ * the next active CPU in the group or sets migrator to TMIGR_NONE when
+ * there is no active CPU in the group. This delegation needs to be
+ * propagated up the hierarchy so hand over from other leaves can happen at
+ * all hierarchy levels w/o doing a search.
+ *
+ * When the last CPU in the system goes idle, then it drops all migrator
+ * duties up to the top level of the hierarchy (LVL2 in the example). It
+ * then has to make sure, that it arms it's own local hardware timer for
+ * the earliest event in the system.
+ *
+ *
+ * Lifetime rules:
+ * ---------------
+ *
+ * The groups are built up at init time or when CPUs come online. They are
+ * not destroyed when a group becomes empty due to offlining. The group
+ * just won't participate in the hierarchy management anymore. Destroying
+ * groups would result in interesting race conditions which would just make
+ * the whole mechanism slow and complex.
+ *
+ *
+ * Locking rules:
+ * --------------
+ *
+ * For setting up new groups and handling events it's required to lock both
+ * child and parent group. The lock ordering is always bottom up. This also
+ * includes the per CPU locks in struct tmigr_cpu. For updating the migrator and
+ * active CPU/group information atomic_try_cmpxchg() is used instead and only
+ * the per CPU tmigr_cpu->lock is held.
+ *
+ * During the setup of groups tmigr_level_list is required. It is protected by
+ * @tmigr_mutex.
+ *
+ * When @timer_base->lock as well as tmigr related locks are required, the lock
+ * ordering is: first @timer_base->lock, afterwards tmigr related locks.
+ *
+ *
+ * Protection of the tmigr group state information:
+ * ------------------------------------------------
+ *
+ * The state information with the list of active children and migrator needs to
+ * be protected by a sequence counter. It prevents a race when updates in a
+ * child groups are propagated in changed order. The following scenario
+ * describes what happens without updating the sequence counter:
+ *
+ * Therefore, let's take three groups and four CPUs (CPU2 and CPU3 as well
+ * as GRP0:1 will not change during the scenario):
+ *
+ * LVL 1 [GRP1:0]
+ * migrator = GRP0:1
+ * active = GRP0:0, GRP0:1
+ * / \
+ * LVL 0 [GRP0:0] [GRP0:1]
+ * migrator = CPU0 migrator = CPU2
+ * active = CPU0 active = CPU2
+ * / \ / \
+ * CPUs 0 1 2 3
+ * active idle active idle
+ *
+ *
+ * 1. CPU0 goes idle (changes are updated in GRP0:0; afterwards the current
+ * states of GRP0:0 and GRP1:0 are stored in the data for walking the
+ * hierarchy):
+ *
+ * LVL 1 [GRP1:0]
+ * migrator = GRP0:1
+ * active = GRP0:0, GRP0:1
+ * / \
+ * LVL 0 [GRP0:0] [GRP0:1]
+ * --> migrator = TMIGR_NONE migrator = CPU2
+ * --> active = active = CPU2
+ * / \ / \
+ * CPUs 0 1 2 3
+ * --> idle idle active idle
+ *
+ * 2. CPU1 comes out of idle (changes are update in GRP0:0; afterwards the
+ * current states of GRP0:0 and GRP1:0 are stored in the data for walking the
+ * hierarchy):
+ *
+ * LVL 1 [GRP1:0]
+ * migrator = GRP0:1
+ * active = GRP0:0, GRP0:1
+ * / \
+ * LVL 0 [GRP0:0] [GRP0:1]
+ * --> migrator = CPU1 migrator = CPU2
+ * --> active = CPU1 active = CPU2
+ * / \ / \
+ * CPUs 0 1 2 3
+ * idle --> active active idle
+ *
+ * 3. Here comes the change of the order: Propagating the changes of step 2
+ * through the hierarchy to GRP1:0 - nothing to be done, because GRP0:0
+ * is already up to date.
+ *
+ * 4. Propagating the changes of step 1 through the hierarchy to GRP1:0
+ *
+ * LVL 1 [GRP1:0]
+ * --> migrator = GRP0:1
+ * --> active = GRP0:1
+ * / \
+ * LVL 0 [GRP0:0] [GRP0:1]
+ * migrator = CPU1 migrator = CPU2
+ * active = CPU1 active = CPU2
+ * / \ / \
+ * CPUs 0 1 2 3
+ * idle active active idle
+ *
+ * Now there is a inconsistent overall state because GRP0:0 is active, but
+ * it is marked as idle in the GRP1:0. This is prevented by incrementing
+ * sequence counter whenever changing the state.
+ *
+ *
+ * Required event and timerqueue update after a remote expiry:
+ * -----------------------------------------------------------
+ *
+ * After a remote expiry of a CPU, a walk through the hierarchy updating the
+ * events and timerqueues has to be done when there is a 'new' global timer of
+ * the remote CPU (which is obvious) but also if there is no new global timer,
+ * but the remote CPU is still idle:
+ *
+ * 1. CPU2 is the migrator and does the remote expiry in GRP1:0; expiry of
+ * evt-CPU0 and evt-CPU1 are equal:
+ *
+ * LVL 1 [GRP1:0]
+ * migrator = GRP0:1
+ * active = GRP0:1
+ * --> timerqueue = evt-GRP0:0
+ * / \
+ * LVL 0 [GRP0:0] [GRP0:1]
+ * migrator = TMIGR_NONE migrator = CPU2
+ * active = active = CPU2
+ * groupevt.ignore = false groupevt.ignore = true
+ * groupevt.cpu = CPU0 groupevt.cpu =
+ * timerqueue = evt-CPU0, timerqueue =
+ * evt-CPU1
+ * / \ / \
+ * CPUs 0 1 2 3
+ * idle idle active idle
+ *
+ * 2. Remove the first event of the timerqueue in GRP1:0 and expire the timers
+ * of CPU0 (see evt-GRP0:0->cpu value):
+ *
+ * LVL 1 [GRP1:0]
+ * migrator = GRP0:1
+ * active = GRP0:1
+ * --> timerqueue =
+ * / \
+ * LVL 0 [GRP0:0] [GRP0:1]
+ * migrator = TMIGR_NONE migrator = CPU2
+ * active = active = CPU2
+ * groupevt.ignore = false groupevt.ignore = true
+ * --> groupevt.cpu = CPU0 groupevt.cpu =
+ * timerqueue = evt-CPU0, timerqueue =
+ * evt-CPU1
+ * / \ / \
+ * CPUs 0 1 2 3
+ * idle idle active idle
+ *
+ * 3. After the remote expiry CPU0 has no global timer that needs to be
+ * enqueued. When skipping the walk, the global timer of CPU1 is not handled,
+ * as the group event of GRP0:0 is not updated and not enqueued into GRP1:0. The
+ * walk has to be done to update the group events and timerqueues:
+ *
+ * LVL 1 [GRP1:0]
+ * migrator = GRP0:1
+ * active = GRP0:1
+ * --> timerqueue = evt-GRP0:0
+ * / \
+ * LVL 0 [GRP0:0] [GRP0:1]
+ * migrator = TMIGR_NONE migrator = CPU2
+ * active = active = CPU2
+ * groupevt.ignore = false groupevt.ignore = true
+ * --> groupevt.cpu = CPU1 groupevt.cpu =
+ * --> timerqueue = evt-CPU1 timerqueue =
+ * / \ / \
+ * CPUs 0 1 2 3
+ * idle idle active idle
+ *
+ * Now CPU2 (migrator) is able to handle the timer of CPU1 as CPU2 only scans
+ * the timerqueues of GRP0:1 and GRP1:0.
+ *
+ * The update of step 3 is valid to be skipped, when the remote CPU went offline
+ * in the meantime because an update was already done during inactive path. When
+ * CPU became active in the meantime, update isn't required as well, because
+ * GRP0:0 is now longer idle.
+ */
+
+static DEFINE_MUTEX(tmigr_mutex);
+static struct list_head *tmigr_level_list __read_mostly;
+
+static unsigned int tmigr_hierarchy_levels __read_mostly;
+static unsigned int tmigr_crossnode_level __read_mostly;
+
+static DEFINE_PER_CPU(struct tmigr_cpu, tmigr_cpu);
+
+#define TMIGR_NONE 0xFF
+#define BIT_CNT 8
+
+static inline bool tmigr_is_not_available(struct tmigr_cpu *tmc)
+{
+ return !(tmc->tmgroup && tmc->online);
+}
+
+/*
+ * Returns true, when @childmask corresponds to the group migrator or when the
+ * group is not active - so no migrator is set.
+ */
+static bool tmigr_check_migrator(struct tmigr_group *group, u8 childmask)
+{
+ union tmigr_state s;
+
+ s.state = atomic_read(&group->migr_state);
+
+ if ((s.migrator == childmask) || (s.migrator == TMIGR_NONE))
+ return true;
+
+ return false;
+}
+
+static bool tmigr_check_migrator_and_lonely(struct tmigr_group *group, u8 childmask)
+{
+ bool lonely, migrator = false;
+ unsigned long active;
+ union tmigr_state s;
+
+ s.state = atomic_read(&group->migr_state);
+
+ if ((s.migrator == childmask) || (s.migrator == TMIGR_NONE))
+ migrator = true;
+
+ active = s.active;
+ lonely = bitmap_weight(&active, BIT_CNT) <= 1;
+
+ return (migrator && lonely);
+}
+
+static bool tmigr_check_lonely(struct tmigr_group *group)
+{
+ unsigned long active;
+ union tmigr_state s;
+
+ s.state = atomic_read(&group->migr_state);
+
+ active = s.active;
+
+ return bitmap_weight(&active, BIT_CNT) <= 1;
+}
+
+typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, void *);
+
+static void __walk_groups(up_f up, void *data,
+ struct tmigr_cpu *tmc)
+{
+ struct tmigr_group *child = NULL, *group = tmc->tmgroup;
+
+ do {
+ WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels);
+
+ if (up(group, child, data))
+ break;
+
+ child = group;
+ group = group->parent;
+ } while (group);
+}
+
+static void walk_groups(up_f up, void *data, struct tmigr_cpu *tmc)
+{
+ lockdep_assert_held(&tmc->lock);
+
+ __walk_groups(up, data, tmc);
+}
+
+/**
+ * struct tmigr_walk - data required for walking the hierarchy
+ * @evt: Pointer to tmigr_event which needs to be queued (of idle
+ * child group)
+ * @childmask: childmask of child group
+ * @nextexp: Next CPU event expiry information which is handed into
+ * the timer migration code by the timer code
+ * (get_next_timer_interrupt()); it is furthermore used for
+ * the first event which is queued, if the timer migration
+ * hierarchy is completely idle
+ * @childstate: tmigr_group->migr_state of the child - will be only
+ * reread when cmpxchg in the group fails (is required for
+ * the deactivate path and the new timer path)
+ * @groupstate: tmigr_group->migr_state of the group - will be only
+ * reread when cmpxchg in the group fails (is required for
+ * the active, the deactivate and the new timer path)
+ * @remote: Is set, when the new timer path is executed in
+ * tmigr_handle_remote_cpu()
+ */
+struct tmigr_walk {
+ struct tmigr_event *evt;
+ u8 childmask;
+ u64 nextexp;
+ union tmigr_state childstate;
+ union tmigr_state groupstate;
+ bool remote;
+};
+
+/**
+ * struct tmigr_remote_data - data required for remote expiry hierarchy walk
+ * @basej: timer base in jiffies
+ * @now: timer base monotonic
+ * @nextexp: returns expiry of the first timer in the idle timer
+ * migration hierarchy to make sure the timer is handled in
+ * time; it is stored in the per CPU tmigr_cpu struct of
+ * CPU which expires remote timers
+ * @childmask: childmask of child group
+ * @check: is set if there is the need to handle remote timers;
+ * required in tmigr_check_handle_remote() only
+ * @tmc_active: this flag indicates, whether the CPU which triggers
+ * the hierarchy walk is !idle in the timer migration
+ * hierarchy. When the CPU is idle and the whole hierarchy is
+ * idle, only the first event of the top level has to be
+ * considered.
+ */
+struct tmigr_remote_data {
+ unsigned long basej;
+ u64 now;
+ u64 nextexp;
+ u8 childmask;
+ bool check;
+ bool tmc_active;
+};
+
+/*
+ * Returns the next event of the timerqueue @group->events
+ *
+ * Removes timers with ignore flag and update next_expiry of the group. Values
+ * of the group event are updated in tmigr_update_events() only.
+ */
+static struct tmigr_event *tmigr_next_groupevt(struct tmigr_group *group)
+{
+ struct timerqueue_node *node = NULL;
+ struct tmigr_event *evt = NULL;
+
+ lockdep_assert_held(&group->lock);
+
+ WRITE_ONCE(group->next_expiry, KTIME_MAX);
+
+ while ((node = timerqueue_getnext(&group->events))) {
+ evt = container_of(node, struct tmigr_event, nextevt);
+
+ if (!evt->ignore) {
+ WRITE_ONCE(group->next_expiry, evt->nextevt.expires);
+ return evt;
+ }
+
+ /*
+ * Remove next timers with ignore flag, because the group lock
+ * is held anyway
+ */
+ if (!timerqueue_del(&group->events, node))
+ break;
+ }
+
+ return NULL;
+}
+
+/*
+ * Return the next event which is already expired of the group timerqueue
+ *
+ * Event, which is returned, is also removed from the queue.
+ */
+static struct tmigr_event *tmigr_next_expired_groupevt(struct tmigr_group *group,
+ u64 now)
+{
+ struct tmigr_event *evt = tmigr_next_groupevt(group);
+
+ if (!evt || now < evt->nextevt.expires)
+ return NULL;
+
+ /*
+ * The event is already expired. Remove it. If it's not the last event,
+ * then update all group event related information.
+ */
+ if (timerqueue_del(&group->events, &evt->nextevt))
+ tmigr_next_groupevt(group);
+ else
+ WRITE_ONCE(group->next_expiry, KTIME_MAX);
+
+ return evt;
+}
+
+static u64 tmigr_next_groupevt_expires(struct tmigr_group *group)
+{
+ struct tmigr_event *evt;
+
+ evt = tmigr_next_groupevt(group);
+
+ if (!evt)
+ return KTIME_MAX;
+ else
+ return evt->nextevt.expires;
+}
+
+static bool tmigr_active_up(struct tmigr_group *group,
+ struct tmigr_group *child,
+ void *ptr)
+{
+ union tmigr_state curstate, newstate;
+ struct tmigr_walk *data = ptr;
+ bool walk_done;
+ u8 childmask;
+
+ childmask = data->childmask;
+ newstate = curstate = data->groupstate;
+
+retry:
+ walk_done = true;
+
+ if (newstate.migrator == TMIGR_NONE) {
+ newstate.migrator = childmask;
+
+ /* Changes need to be propagated */
+ walk_done = false;
+ }
+
+ newstate.active |= childmask;
+
+ newstate.seq++;
+
+ if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
+ newstate.state = curstate.state;
+ goto retry;
+ }
+
+ if (group->parent && (walk_done == false)) {
+ data->groupstate.state = atomic_read(&group->parent->migr_state);
+ data->childmask = group->childmask;
+ }
+
+ /*
+ * The group is active and the event will be ignored - the ignore flag is
+ * updated without holding the lock. In case the bit is set while
+ * another CPU already handles remote events, nothing happens, because
+ * it is clear that the CPU became active just in this moment, or in
+ * worst case the event is handled remote. Nothing to worry about.
+ */
+ group->groupevt.ignore = true;
+
+ return walk_done;
+}
+
+static void __tmigr_cpu_activate(struct tmigr_cpu *tmc)
+{
+ struct tmigr_walk data;
+
+ data.childmask = tmc->childmask;
+ data.groupstate.state = atomic_read(&tmc->tmgroup->migr_state);
+
+ tmc->cpuevt.ignore = true;
+ WRITE_ONCE(tmc->wakeup, KTIME_MAX);
+ tmc->wakeup_recalc = false;
+
+ walk_groups(&tmigr_active_up, &data, tmc);
+}
+
+/**
+ * tmigr_cpu_activate() - set this CPU active in timer migration hierarchy
+ *
+ * Call site timer_clear_idle() is called with interrupts disabled.
+ */
+void tmigr_cpu_activate(void)
+{
+ struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+
+ if (tmigr_is_not_available(tmc))
+ return;
+
+ if (WARN_ON_ONCE(!tmc->idle))
+ return;
+
+ raw_spin_lock(&tmc->lock);
+ tmc->idle = false;
+ __tmigr_cpu_activate(tmc);
+ raw_spin_unlock(&tmc->lock);
+}
+
+/*
+ * Returns true, if there is nothing to be propagated to the next level
+ *
+ * @data->nextexp is reset to KTIME_MAX; it is reused for the first global event
+ * which needs to be handled by the migrator (in the top level group).
+ *
+ * This is the only place where the group event expiry value is set.
+ */
+static bool tmigr_update_events(struct tmigr_group *group,
+ struct tmigr_group *child,
+ struct tmigr_walk *data)
+{
+ struct tmigr_event *evt, *first_childevt;
+ bool walk_done, remote = data->remote;
+ bool leftmost_change = false;
+ u64 nextexp;
+
+ if (child) {
+ raw_spin_lock(&child->lock);
+ raw_spin_lock_nested(&group->lock, SINGLE_DEPTH_NESTING);
+
+ data->childstate.state = atomic_read(&child->migr_state);
+ data->groupstate.state = atomic_read(&group->migr_state);
+
+ if (data->childstate.active) {
+ walk_done = true;
+ goto unlock;
+ }
+
+ first_childevt = tmigr_next_groupevt(child);
+ nextexp = child->next_expiry;
+ evt = &child->groupevt;
+ } else {
+ nextexp = data->nextexp;
+
+ /*
+ * Set @data->nextexp to KTIME_MAX; it is reused for the first
+ * global event which needs to be handled by the migrator (in
+ * the top level group).
+ */
+ data->nextexp = KTIME_MAX;
+
+ first_childevt = evt = data->evt;
+
+ /*
+ * Walking the hierarchy is required in any case when a
+ * remote expiry was done before. This ensures to not lose
+ * already queued events in non active groups (see section
+ * "Required event and timerqueue update after remote
+ * expiry" in the documentation at the top).
+ *
+ * The two call sites which are executed without a remote expiry
+ * before, are not prevented from propagating changes through
+ * the hierarchy by the return:
+ * - When entering this path by tmigr_new_timer(), @evt->ignore
+ * is never set.
+ * - tmigr_inactive_up() takes care of the propagation by
+ * itself and ignores the return value. But an immediate
+ * return is required because nothing has to be done in this
+ * level as the event could be ignored.
+ */
+ if (evt->ignore && !remote)
+ return true;
+
+ raw_spin_lock(&group->lock);
+ data->groupstate.state = atomic_read(&group->migr_state);
+ }
+
+ if (nextexp == KTIME_MAX) {
+ evt->ignore = true;
+
+ /*
+ * When the next child event could be ignored (nextexp is
+ * KTIME_MAX) and there was no remote timer handling before or
+ * the group is already active, there is no need to walk the
+ * hierarchy even if there is a parent group.
+ *
+ * The other way round: even if the event could be ignored, but
+ * if a remote timer handling was executed before and the group
+ * is not active, walking the hierarchy is required to not miss
+ * an enqueued timer in the non active group. The enqueued timer
+ * of the group needs to be propagated to a higher level to
+ * ensure it is handled.
+ */
+ if (!remote || data->groupstate.active) {
+ walk_done = true;
+ goto unlock;
+ }
+ } else {
+ /*
+ * An update of @evt->cpu and @evt->ignore flag is required only
+ * when @child is set (the child is equal or higher than lvl0),
+ * but it doesn't matter if it is written once more to the per
+ * CPU event; make the update unconditional.
+ */
+ evt->cpu = first_childevt->cpu;
+ evt->ignore = false;
+ }
+
+ walk_done = !group->parent;
+
+ /*
+ * If the child event is already queued in the group, remove it from the
+ * queue when the expiry time changed only.
+ */
+ if (timerqueue_node_queued(&evt->nextevt)) {
+ if (evt->nextevt.expires == nextexp)
+ goto check_toplvl;
+
+ leftmost_change = timerqueue_getnext(&group->events) == &evt->nextevt;
+ if (!timerqueue_del(&group->events, &evt->nextevt))
+ WRITE_ONCE(group->next_expiry, KTIME_MAX);
+ }
+
+ evt->nextevt.expires = nextexp;
+
+ if (timerqueue_add(&group->events, &evt->nextevt)) {
+ leftmost_change = true;
+ WRITE_ONCE(group->next_expiry, nextexp);
+ }
+
+check_toplvl:
+ if (walk_done && (data->groupstate.migrator == TMIGR_NONE)) {
+ /*
+ * Nothing to do when first event didn't changed and update was
+ * done during remote timer handling.
+ */
+ if (remote && !leftmost_change)
+ goto unlock;
+ /*
+ * The top level group is idle and it has to be ensured the
+ * global timers are handled in time. (This could be optimized
+ * by keeping track of the last global scheduled event and only
+ * arming it on the CPU if the new event is earlier. Not sure if
+ * its worth the complexity.)
+ */
+ data->nextexp = tmigr_next_groupevt_expires(group);
+ }
+
+unlock:
+ raw_spin_unlock(&group->lock);
+
+ if (child)
+ raw_spin_unlock(&child->lock);
+
+ return walk_done;
+}
+
+static bool tmigr_new_timer_up(struct tmigr_group *group,
+ struct tmigr_group *child,
+ void *ptr)
+{
+ struct tmigr_walk *data = ptr;
+
+ return tmigr_update_events(group, child, data);
+}
+
+/*
+ * Returns the expiry of the next timer that needs to be handled. KTIME_MAX is
+ * returned, when an active CPU will handle all the timer migration hierarchy
+ * timers.
+ */
+static u64 tmigr_new_timer(struct tmigr_cpu *tmc, u64 nextexp)
+{
+ struct tmigr_walk data = { .evt = &tmc->cpuevt,
+ .nextexp = nextexp };
+
+ lockdep_assert_held(&tmc->lock);
+
+ if (tmc->remote)
+ return KTIME_MAX;
+
+ tmc->cpuevt.ignore = false;
+ data.remote = false;
+
+ walk_groups(&tmigr_new_timer_up, &data, tmc);
+
+ /* If there is a new first global event, make sure it is handled */
+ return data.nextexp;
+}
+
+static u64 tmigr_handle_remote_cpu(unsigned int cpu, u64 now,
+ unsigned long jif)
+{
+ struct timer_events tevt;
+ struct tmigr_walk data;
+ struct tmigr_cpu *tmc;
+ u64 next = KTIME_MAX;
+
+ tmc = per_cpu_ptr(&tmigr_cpu, cpu);
+
+ raw_spin_lock_irq(&tmc->lock);
+
+ /*
+ * The remote CPU is offline or the CPU event does not has to be handled
+ * (the CPU is active or there is no longer an event to expire) or
+ * another CPU handles the CPU timers already or the next event was
+ * already expired - return!
+ */
+ if (!tmc->online || tmc->remote || tmc->cpuevt.ignore ||
+ now < tmc->cpuevt.nextevt.expires) {
+ raw_spin_unlock_irq(&tmc->lock);
+ return next;
+ }
+
+ tmc->remote = true;
+ WRITE_ONCE(tmc->wakeup, KTIME_MAX);
+
+ /* Drop the lock to allow the remote CPU to exit idle */
+ raw_spin_unlock_irq(&tmc->lock);
+
+ if (cpu != smp_processor_id())
+ timer_expire_remote(cpu);
+
+ /*
+ * Lock ordering needs to be preserved - timer_base locks before tmigr
+ * related locks (see section "Locking rules" in the documentation at
+ * the top). During fetching the next timer interrupt, also tmc->lock
+ * needs to be held. Otherwise there is a possible race window against
+ * the CPU itself when it comes out of idle, updates the first timer in
+ * the hierarchy and goes back to idle.
+ *
+ * timer base locks are dropped as fast as possible: After checking
+ * whether the remote CPU went offline in the meantime and after
+ * fetching the next remote timer interrupt. Dropping the locks as fast
+ * as possible keeps the locking region small and prevents holding
+ * several (unnecessary) locks during walking the hierarchy for updating
+ * the timerqueue and group events.
+ */
+ local_irq_disable();
+ timer_lock_remote_bases(cpu);
+ raw_spin_lock(&tmc->lock);
+
+ /*
+ * When the CPU went offline in the meantime, no hierarchy walk has to
+ * be done for updating the queued events, because the walk was
+ * already done during marking the CPU offline in the hierarchy.
+ *
+ * When the CPU is no longer idle, the CPU takes care of the timers and
+ * also of the timers in the path to the top.
+ *
+ * (See also section "Required event and timerqueue update after
+ * remote expiry" in the documentation at the top)
+ */
+ if (!tmc->online || !tmc->idle) {
+ timer_unlock_remote_bases(cpu);
+ goto unlock;
+ } else {
+ /* next event of CPU */
+ fetch_next_timer_interrupt_remote(jif, now, &tevt, cpu);
+ }
+
+ timer_unlock_remote_bases(cpu);
+
+ data.evt = &tmc->cpuevt;
+ data.nextexp = tevt.global;
+ data.remote = true;
+
+ /*
+ * The update is done even when there is no 'new' global timer pending
+ * on the remote CPU (see section "Required event and timerqueue update
+ * after remote expiry" in the documentation at the top)
+ */
+ walk_groups(&tmigr_new_timer_up, &data, tmc);
+
+ next = data.nextexp;
+
+unlock:
+ tmc->remote = false;
+ raw_spin_unlock_irq(&tmc->lock);
+
+ return next;
+}
+
+static bool tmigr_handle_remote_up(struct tmigr_group *group,
+ struct tmigr_group *child,
+ void *ptr)
+{
+ struct tmigr_remote_data *data = ptr;
+ u64 now, next = KTIME_MAX;
+ struct tmigr_event *evt;
+ unsigned long jif;
+ u8 childmask;
+
+ jif = data->basej;
+ now = data->now;
+
+ childmask = data->childmask;
+
+again:
+ /*
+ * Handle the group only if @childmask is the migrator or if the
+ * group has no migrator. Otherwise the group is active and is
+ * handled by its own migrator.
+ */
+ if (!tmigr_check_migrator(group, childmask))
+ return true;
+
+ raw_spin_lock_irq(&group->lock);
+
+ evt = tmigr_next_expired_groupevt(group, now);
+
+ if (evt) {
+ unsigned int remote_cpu = evt->cpu;
+
+ raw_spin_unlock_irq(&group->lock);
+
+ next = tmigr_handle_remote_cpu(remote_cpu, now, jif);
+
+ /* check if there is another event, that needs to be handled */
+ goto again;
+ } else {
+ raw_spin_unlock_irq(&group->lock);
+ }
+
+ /* Update of childmask for the next level */
+ data->childmask = group->childmask;
+ data->nextexp = next;
+
+ return false;
+}
+
+/**
+ * tmigr_handle_remote() - Handle global timers of remote idle CPUs
+ *
+ * Called from the timer soft interrupt with interrupts enabled.
+ */
+void tmigr_handle_remote(void)
+{
+ struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+ struct tmigr_remote_data data;
+
+ if (tmigr_is_not_available(tmc))
+ return;
+
+ data.childmask = tmc->childmask;
+ data.nextexp = KTIME_MAX;
+
+ /*
+ * NOTE: This is a doubled check because the migrator test will be done
+ * in tmigr_handle_remote_up() anyway. Keep this check to fasten the
+ * return when nothing has to be done.
+ */
+ if (!tmigr_check_migrator(tmc->tmgroup, tmc->childmask))
+ return;
+
+ data.now = get_jiffies_update(&data.basej);
+
+ /*
+ * Update @tmc->wakeup only at the end and do not reset @tmc->wakeup to
+ * KTIME_MAX. Even if tmc->lock is not held during the whole remote
+ * handling, tmc->wakeup is fine to be stale as it is called in
+ * interrupt context and tick_nohz_next_event() is executed in interrupt
+ * exit path only after processing the last pending interrupt.
+ */
+
+ __walk_groups(&tmigr_handle_remote_up, &data, tmc);
+
+ raw_spin_lock_irq(&tmc->lock);
+ WRITE_ONCE(tmc->wakeup, data.nextexp);
+ raw_spin_unlock_irq(&tmc->lock);
+}
+
+static bool tmigr_requires_handle_remote_up(struct tmigr_group *group,
+ struct tmigr_group *child,
+ void *ptr)
+{
+ struct tmigr_remote_data *data = ptr;
+ u8 childmask;
+
+ childmask = data->childmask;
+
+ /*
+ * Handle the group only if the child is the migrator or if the group
+ * has no migrator. Otherwise the group is active and is handled by its
+ * own migrator.
+ */
+ if (!tmigr_check_migrator(group, childmask))
+ return true;
+
+ /*
+ * When there is a parent group and the CPU which triggered the
+ * hierarchy walk is not active, proceed the walk to reach the top level
+ * group before reading the next_expiry value.
+ */
+ if (group->parent && !data->tmc_active)
+ goto out;
+
+ /*
+ * On 32 bit systems the racy lockless check for next_expiry will
+ * turn into a random number generator. Therefore do the lockless
+ * check only on 64 bit systems.
+ */
+ if (IS_ENABLED(CONFIG_64BIT)) {
+ data->nextexp = READ_ONCE(group->next_expiry);
+ if (data->now >= data->nextexp) {
+ data->check = true;
+ return true;
+ }
+ } else {
+ raw_spin_lock(&group->lock);
+ data->nextexp = group->next_expiry;
+ if (data->now >= group->next_expiry) {
+ data->check = true;
+ raw_spin_unlock(&group->lock);
+ return true;
+ }
+ raw_spin_unlock(&group->lock);
+ }
+
+out:
+ /* Update of childmask for the next level */
+ data->childmask = group->childmask;
+ return false;
+}
+
+/**
+ * tmigr_requires_handle_remote() - Check the need of remote timer handling
+ *
+ * Must be called with interrupts disabled.
+ */
+int tmigr_requires_handle_remote(void)
+{
+ struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+ struct tmigr_remote_data data;
+ unsigned int ret = 0;
+ unsigned long jif;
+
+ if (tmigr_is_not_available(tmc))
+ return ret;
+
+ data.now = get_jiffies_update(&jif);
+ data.childmask = tmc->childmask;
+ data.nextexp = KTIME_MAX;
+ data.tmc_active = !tmc->idle;
+ data.check = false;
+
+ /*
+ * When the CPU is active, walking the hierarchy to check whether a
+ * remote expiry is required.
+ *
+ * Check is done lockless as interrupts are disabled and @tmc->idle is
+ * set only by the local CPU.
+ */
+ if (!tmc->idle) {
+ __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
+
+ if (data.nextexp != KTIME_MAX)
+ ret = 1;
+
+ return ret;
+ }
+
+ /*
+ * When the CPU is idle, check whether the recalculation of @tmc->wakeup
+ * is required. @tmc->wakeup_recalc is set by a remote CPU which is
+ * about to go offline, was the last active CPU in the whole timer
+ * migration hierarchy and now delegates handling of the hierarchy to
+ * this CPU.
+ *
+ * Racy lockless check is valid:
+ * - @tmc->wakeup_recalc is set by the remote CPU before it issues
+ * reschedule IPI.
+ * - As interrupts are disabled here this CPU will either observe
+ * @tmc->wakeup_recalc set before the reschedule IPI can be handled or
+ * it will observe it when this function is called again on return
+ * from handling the reschedule IPI.
+ */
+ if (tmc->wakeup_recalc) {
+ raw_spin_lock(&tmc->lock);
+
+ __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
+
+ if (data.nextexp != KTIME_MAX)
+ ret = 1;
+
+ WRITE_ONCE(tmc->wakeup, data.nextexp);
+ tmc->wakeup_recalc = false;
+ raw_spin_unlock(&tmc->lock);
+
+ return ret;
+ }
+
+ /*
+ * When the CPU is idle and @tmc->wakeup is reliable, compare it with
+ * @data.now. On 64 bit it is valid to do this lockless. On 32 bit
+ * systems, holding the lock is required to get valid data on concurrent
+ * writers.
+ */
+ if (IS_ENABLED(CONFIG_64BIT)) {
+ if (data.now >= READ_ONCE(tmc->wakeup))
+ ret = 1;
+ } else {
+ raw_spin_lock(&tmc->lock);
+ if (data.now >= tmc->wakeup)
+ ret = 1;
+ raw_spin_unlock(&tmc->lock);
+ }
+
+ return ret;
+}
+
+/**
+ * tmigr_cpu_new_timer() - enqueue next global timer into hierarchy (idle tmc)
+ * @nextexp: Next expiry of global timer (or KTIME_MAX if not)
+ *
+ * The CPU is already deactivated in the timer migration
+ * hierarchy. tick_nohz_get_sleep_length() calls tick_nohz_next_event()
+ * and thereby the timer idle path is executed once more. @tmc->wakeup
+ * holds the first timer, when the timer migration hierarchy is
+ * completely idle.
+ *
+ * Returns the first timer that needs to be handled by this CPU or KTIME_MAX if
+ * nothing needs to be done.
+ */
+u64 tmigr_cpu_new_timer(u64 nextexp)
+{
+ struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+ u64 ret;
+
+ if (tmigr_is_not_available(tmc))
+ return nextexp;
+
+ raw_spin_lock(&tmc->lock);
+
+ ret = READ_ONCE(tmc->wakeup);
+ if (nextexp != KTIME_MAX) {
+ if (nextexp != tmc->cpuevt.nextevt.expires ||
+ tmc->cpuevt.ignore) {
+ ret = tmigr_new_timer(tmc, nextexp);
+ }
+ } else if (tmc->wakeup_recalc) {
+ struct tmigr_remote_data data;
+
+ data.now = KTIME_MAX;
+ data.childmask = tmc->childmask;
+ data.nextexp = KTIME_MAX;
+ data.tmc_active = false;
+ data.check = false;
+
+ __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
+
+ ret = data.nextexp;
+ }
+ tmc->wakeup_recalc = false;
+
+ /*
+ * Make sure the reevaluation of timers in idle path will not miss an
+ * event.
+ */
+ WRITE_ONCE(tmc->wakeup, ret);
+
+ raw_spin_unlock(&tmc->lock);
+ return ret;
+}
+
+static bool tmigr_inactive_up(struct tmigr_group *group,
+ struct tmigr_group *child,
+ void *ptr)
+{
+ union tmigr_state curstate, newstate;
+ struct tmigr_walk *data = ptr;
+ bool walk_done;
+ u8 childmask;
+
+ childmask = data->childmask;
+ newstate = curstate = data->groupstate;
+
+retry:
+ walk_done = true;
+
+ /* Reset active bit when the child is no longer active */
+ if (!data->childstate.active)
+ newstate.active &= ~childmask;
+
+ if (newstate.migrator == childmask) {
+ /*
+ * Find a new migrator for the group, because the child group is
+ * idle!
+ */
+ if (!data->childstate.active) {
+ unsigned long new_migr_bit, active = newstate.active;
+
+ new_migr_bit = find_first_bit(&active, BIT_CNT);
+
+ if (new_migr_bit != BIT_CNT) {
+ newstate.migrator = BIT(new_migr_bit);
+ } else {
+ newstate.migrator = TMIGR_NONE;
+
+ /* Changes need to be propagated */
+ walk_done = false;
+ }
+ }
+ }
+
+ newstate.seq++;
+
+ WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active));
+
+ if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
+ newstate.state = curstate.state;
+
+ /*
+ * Something changed in the child/parent group in the meantime,
+ * reread the state of the child and parent; Update of
+ * data->childstate is required for event handling;
+ */
+ if (child)
+ data->childstate.state = atomic_read(&child->migr_state);
+
+ goto retry;
+ }
+
+ data->groupstate = newstate;
+ data->remote = false;
+
+ /* Event Handling */
+ tmigr_update_events(group, child, data);
+
+ if (group->parent && (walk_done == false)) {
+ data->childmask = group->childmask;
+ data->childstate = newstate;
+ data->groupstate.state = atomic_read(&group->parent->migr_state);
+ }
+
+ /*
+ * data->nextexp was set by tmigr_update_events() and contains the
+ * expiry of the first global event which needs to be handled
+ */
+ if (data->nextexp != KTIME_MAX) {
+ WARN_ON_ONCE(group->parent);
+ /*
+ * Top level path - If this CPU is about going offline, wake
+ * up some random other CPU so it will take over the
+ * migrator duty and program its timer properly. Ideally
+ * wake the CPU with the closest expiry time, but that's
+ * overkill to figure out.
+ *
+ * Set wakeup_recalc of remote CPU, to make sure the complete
+ * idle hierarchy with enqueued timers is reevaluated.
+ */
+ if (!(this_cpu_ptr(&tmigr_cpu)->online)) {
+ struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+ unsigned int cpu = smp_processor_id();
+ struct tmigr_cpu *tmc_resched;
+
+ cpu = cpumask_any_but(cpu_online_mask, cpu);
+ tmc_resched = per_cpu_ptr(&tmigr_cpu, cpu);
+
+ raw_spin_unlock(&tmc->lock);
+
+ raw_spin_lock(&tmc_resched->lock);
+ tmc_resched->wakeup_recalc = true;
+ raw_spin_unlock(&tmc_resched->lock);
+
+ raw_spin_lock(&tmc->lock);
+ smp_send_reschedule(cpu);
+ }
+ }
+
+ return walk_done;
+}
+
+static u64 __tmigr_cpu_deactivate(struct tmigr_cpu *tmc, u64 nextexp)
+{
+ struct tmigr_walk data = { .childmask = tmc->childmask,
+ .evt = &tmc->cpuevt,
+ .nextexp = nextexp,
+ .childstate.state = 0 };
+
+ data.groupstate.state = atomic_read(&tmc->tmgroup->migr_state);
+
+ /*
+ * If nextexp is KTIME_MAX, the CPU event will be ignored because the
+ * local timer expires before the global timer, no global timer is set
+ * or CPU goes offline.
+ */
+ if (nextexp != KTIME_MAX)
+ tmc->cpuevt.ignore = false;
+
+ walk_groups(&tmigr_inactive_up, &data, tmc);
+ return data.nextexp;
+}
+
+/**
+ * tmigr_cpu_deactivate() - Put current CPU into inactive state
+ * @nextexp: The next timer event expiry set in the current CPU
+ *
+ * Must be called with interrupts disabled.
+ *
+ * Return: the next event expiry of the current CPU or the next event expiry
+ * from the hierarchy if this CPU is the top level migrator or the hierarchy is
+ * completely idle.
+ */
+u64 tmigr_cpu_deactivate(u64 nextexp)
+{
+ struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+ u64 ret;
+
+ if (tmigr_is_not_available(tmc))
+ return nextexp;
+
+ raw_spin_lock(&tmc->lock);
+
+ ret = __tmigr_cpu_deactivate(tmc, nextexp);
+
+ tmc->idle = true;
+
+ /*
+ * Make sure the reevaluation of timers in idle path will not miss an
+ * event.
+ */
+ WRITE_ONCE(tmc->wakeup, ret);
+
+ raw_spin_unlock(&tmc->lock);
+ return ret;
+}
+
+/**
+ * tmigr_quick_check() - Quick forecast of next tmigr event when CPU wants to
+ * go idle
+ *
+ * Returns KTIME_MAX, when it is probable that nothing has to be done (not the
+ * only one in the level 0 group; and if it is the only one in level 0 group,
+ * but there are more than a single group active in top level)
+ *
+ * Returns first expiry of the top level group, when it is the only one in level
+ * 0 and top level also only has a single active child.
+ */
+u64 tmigr_quick_check(void)
+{
+ struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+ struct tmigr_group *topgroup;
+ struct list_head lvllist;
+
+ if (tmigr_is_not_available(tmc))
+ return KTIME_MAX;
+
+ if (WARN_ON_ONCE(tmc->idle))
+ return KTIME_MAX;
+
+ if (!tmigr_check_migrator_and_lonely(tmc->tmgroup, tmc->childmask))
+ return KTIME_MAX;
+
+ for (int i = tmigr_hierarchy_levels; i > 0 ; i--) {
+ lvllist = tmigr_level_list[i - 1];
+ if (list_is_singular(&lvllist)) {
+ topgroup = list_first_entry(&lvllist, struct tmigr_group, list);
+
+ if (tmigr_check_lonely(topgroup))
+ return READ_ONCE(topgroup->next_expiry);
+ } else {
+ continue;
+ }
+ }
+
+ return KTIME_MAX;
+}
+
+static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl,
+ int node)
+{
+ union tmigr_state s;
+
+ raw_spin_lock_init(&group->lock);
+
+ group->level = lvl;
+ group->numa_node = lvl < tmigr_crossnode_level ? node : NUMA_NO_NODE;
+
+ group->num_children = 0;
+
+ s.migrator = TMIGR_NONE;
+ s.active = 0;
+ s.seq = 0;
+ atomic_set(&group->migr_state, s.state);
+
+ timerqueue_init_head(&group->events);
+ timerqueue_init(&group->groupevt.nextevt);
+ group->groupevt.nextevt.expires = KTIME_MAX;
+ WRITE_ONCE(group->next_expiry, KTIME_MAX);
+ group->groupevt.ignore = true;
+}
+
+static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
+ unsigned int lvl)
+{
+ struct tmigr_group *tmp, *group = NULL;
+
+ lockdep_assert_held(&tmigr_mutex);
+
+ /* Try to attach to an existing group first */
+ list_for_each_entry(tmp, &tmigr_level_list[lvl], list) {
+ /*
+ * If @lvl is below the cross numa node level, check whether
+ * this group belongs to the same numa node.
+ */
+ if (lvl < tmigr_crossnode_level && tmp->numa_node != node)
+ continue;
+
+ /* Capacity left? */
+ if (tmp->num_children >= TMIGR_CHILDREN_PER_GROUP)
+ continue;
+
+ /*
+ * TODO: A possible further improvement: Make sure that all CPU
+ * siblings end up in the same group of the lowest level of the
+ * hierarchy. Rely on the topology sibling mask would be a
+ * reasonable solution.
+ */
+
+ group = tmp;
+ break;
+ }
+
+ if (group)
+ return group;
+
+ /* Allocate and set up a new group */
+ group = kzalloc_node(sizeof(*group), GFP_KERNEL, node);
+ if (!group)
+ return ERR_PTR(-ENOMEM);
+
+ tmigr_init_group(group, lvl, node);
+
+ /* Setup successful. Add it to the hierarchy */
+ list_add(&group->list, &tmigr_level_list[lvl]);
+ return group;
+}
+
+static void tmigr_connect_child_parent(struct tmigr_group *child,
+ struct tmigr_group *parent)
+{
+ union tmigr_state childstate;
+
+ raw_spin_lock_irq(&child->lock);
+ raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING);
+
+ child->parent = parent;
+ child->childmask = BIT(parent->num_children++);
+
+ raw_spin_unlock(&parent->lock);
+ raw_spin_unlock_irq(&child->lock);
+
+ /*
+ * To prevent inconsistent states, active children need to be active in
+ * the new parent as well. Inactive children are already marked inactive
+ * in the parent group.
+ */
+ childstate.state = atomic_read(&child->migr_state);
+ if (childstate.migrator != TMIGR_NONE) {
+ struct tmigr_walk data;
+
+ data.childmask = child->childmask;
+ data.groupstate.state = atomic_read(&parent->migr_state);
+
+ /*
+ * There is only one new level per time. When connecting the
+ * child and the parent and set the child active when the parent
+ * is inactive, the parent needs to be the uppermost
+ * level. Otherwise there went something wrong!
+ */
+ WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent);
+ }
+}
+
+static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
+{
+ struct tmigr_group *group, *child, **stack;
+ int top = 0, err = 0, i = 0;
+ struct list_head *lvllist;
+
+ stack = kcalloc(tmigr_hierarchy_levels, sizeof(*stack), GFP_KERNEL);
+ if (!stack)
+ return -ENOMEM;
+
+ do {
+ group = tmigr_get_group(cpu, node, i);
+ if (IS_ERR(group)) {
+ err = PTR_ERR(group);
+ break;
+ }
+
+ top = i;
+ stack[i++] = group;
+
+ /*
+ * When booting only less CPUs of a system than CPUs are
+ * available, not all calculated hierarchy levels are required.
+ *
+ * The loop is aborted as soon as the highest level, which might
+ * be different from tmigr_hierarchy_levels, contains only a
+ * single group.
+ */
+ if (group->parent || i == tmigr_hierarchy_levels ||
+ (list_empty(&tmigr_level_list[i]) &&
+ list_is_singular(&tmigr_level_list[i - 1])))
+ break;
+
+ } while (i < tmigr_hierarchy_levels);
+
+ do {
+ group = stack[--i];
+
+ if (err < 0) {
+ list_del(&group->list);
+ kfree(group);
+ continue;
+ }
+
+ WARN_ON_ONCE(i != group->level);
+
+ /*
+ * Update tmc -> group / child -> group connection
+ */
+ if (i == 0) {
+ struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+
+ raw_spin_lock_irq(&group->lock);
+
+ tmc->tmgroup = group;
+ tmc->childmask = BIT(group->num_children++);
+
+ raw_spin_unlock_irq(&group->lock);
+
+ /* There are no children that need to be connected */
+ continue;
+ } else {
+ child = stack[i - 1];
+ tmigr_connect_child_parent(child, group);
+ }
+
+ /* check if uppermost level was newly created */
+ if (top != i)
+ continue;
+
+ WARN_ON_ONCE(top == 0);
+
+ lvllist = &tmigr_level_list[top];
+ if (group->num_children == 1 && list_is_singular(lvllist)) {
+ lvllist = &tmigr_level_list[top - 1];
+ list_for_each_entry(child, lvllist, list) {
+ if (child->parent)
+ continue;
+
+ tmigr_connect_child_parent(child, group);
+ }
+ }
+ } while (i > 0);
+
+ kfree(stack);
+
+ return err;
+}
+
+static int tmigr_add_cpu(unsigned int cpu)
+{
+ int node = cpu_to_node(cpu);
+ int ret;
+
+ mutex_lock(&tmigr_mutex);
+ ret = tmigr_setup_groups(cpu, node);
+ mutex_unlock(&tmigr_mutex);
+
+ return ret;
+}
+
+static int tmigr_cpu_online(unsigned int cpu)
+{
+ struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+ unsigned int ret;
+
+ /* First online attempt? Initialize CPU data */
+ if (!tmc->tmgroup) {
+ raw_spin_lock_init(&tmc->lock);
+
+ ret = tmigr_add_cpu(cpu);
+ if (ret < 0)
+ return ret;
+
+ if (tmc->childmask == 0)
+ return -EINVAL;
+
+ timerqueue_init(&tmc->cpuevt.nextevt);
+ tmc->cpuevt.nextevt.expires = KTIME_MAX;
+ tmc->cpuevt.ignore = true;
+ tmc->cpuevt.cpu = cpu;
+
+ tmc->remote = false;
+ WRITE_ONCE(tmc->wakeup, KTIME_MAX);
+ }
+ raw_spin_lock_irq(&tmc->lock);
+ if (timer_base_is_idle())
+ tmc->idle = true;
+ else
+ __tmigr_cpu_activate(tmc);
+ tmc->online = true;
+ raw_spin_unlock_irq(&tmc->lock);
+ return 0;
+}
+
+static int tmigr_cpu_offline(unsigned int cpu)
+{
+ struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
+
+ raw_spin_lock_irq(&tmc->lock);
+ tmc->online = false;
+ WRITE_ONCE(tmc->wakeup, KTIME_MAX);
+
+ /*
+ * CPU has to handle the local events on his own, when on the way to
+ * offline; Therefore nextevt value is set to KTIME_MAX
+ */
+ __tmigr_cpu_deactivate(tmc, KTIME_MAX);
+ raw_spin_unlock_irq(&tmc->lock);
+
+ return 0;
+}
+
+static int __init tmigr_init(void)
+{
+ unsigned int cpulvl, nodelvl, cpus_per_node, i;
+ unsigned int nnodes = num_possible_nodes();
+ unsigned int ncpus = num_possible_cpus();
+ int ret = -ENOMEM;
+
+ /* Nothing to do if running on UP */
+ if (ncpus == 1)
+ return 0;
+
+ /*
+ * Calculate the required hierarchy levels. Unfortunately there is no
+ * reliable information available, unless all possible CPUs have been
+ * brought up and all numa nodes are populated.
+ *
+ * Estimate the number of levels with the number of possible nodes and
+ * the number of possible CPUs. Assume CPUs are spread evenly across
+ * nodes. We cannot rely on cpumask_of_node() because there only already
+ * online CPUs are considered.
+ */
+ cpus_per_node = DIV_ROUND_UP(ncpus, nnodes);
+
+ /* Calc the hierarchy levels required to hold the CPUs of a node */
+ cpulvl = DIV_ROUND_UP(order_base_2(cpus_per_node),
+ ilog2(TMIGR_CHILDREN_PER_GROUP));
+
+ /* Calculate the extra levels to connect all nodes */
+ nodelvl = DIV_ROUND_UP(order_base_2(nnodes),
+ ilog2(TMIGR_CHILDREN_PER_GROUP));
+
+ tmigr_hierarchy_levels = cpulvl + nodelvl;
+
+ /*
+ * If a numa node spawns more than one CPU level group then the next
+ * level(s) of the hierarchy contains groups which handle all CPU groups
+ * of the same numa node. The level above goes across numa nodes. Store
+ * this information for the setup code to decide when node matching is
+ * not longer required.
+ */
+ tmigr_crossnode_level = cpulvl;
+
+ tmigr_level_list = kcalloc(tmigr_hierarchy_levels, sizeof(struct list_head), GFP_KERNEL);
+ if (!tmigr_level_list)
+ goto err;
+
+ for (i = 0; i < tmigr_hierarchy_levels; i++)
+ INIT_LIST_HEAD(&tmigr_level_list[i]);
+
+ pr_info("Timer migration: %d hierarchy levels; %d children per group;"
+ " %d crossnode level\n",
+ tmigr_hierarchy_levels, TMIGR_CHILDREN_PER_GROUP,
+ tmigr_crossnode_level);
+
+ ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
+ tmigr_cpu_online, tmigr_cpu_offline);
+ if (ret)
+ goto err;
+
+ return 0;
+
+err:
+ pr_err("Timer migration setup failed\n");
+ return ret;
+}
+late_initcall(tmigr_init);
diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
new file mode 100644
index 000000000000..260b87e5708d
--- /dev/null
+++ b/kernel/time/timer_migration.h
@@ -0,0 +1,144 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _KERNEL_TIME_MIGRATION_H
+#define _KERNEL_TIME_MIGRATION_H
+
+/* Per group capacity. Must be a power of 2! */
+#define TMIGR_CHILDREN_PER_GROUP 8
+
+/**
+ * struct tmigr_event - a timer event associated to a CPU
+ * @nextevt: The node to enqueue an event in the parent group queue
+ * @cpu: The CPU to which this event belongs
+ * @ignore: Hint whether the event could be ignored; it is set when
+ * CPU or group is active;
+ */
+struct tmigr_event {
+ struct timerqueue_node nextevt;
+ unsigned int cpu;
+ bool ignore;
+};
+
+/**
+ * struct tmigr_group - timer migration hierarchy group
+ * @lock: Lock protecting the event information and group hierarchy
+ * information during setup
+ * @migr_state: State of the group (see union tmigr_state)
+ * @parent: Pointer to the parent group
+ * @groupevt: Next event of the group which is only used when the
+ * group is !active. The group event is then queued into
+ * the parent timer queue.
+ * Ignore bit of @groupevt is set when the group is active.
+ * @next_expiry: Base monotonic expiry time of the next event of the
+ * group; It is used for the racy lockless check whether a
+ * remote expiry is required; it is always reliable
+ * @events: Timer queue for child events queued in the group
+ * @childmask: childmask of the group in the parent group; is set
+ * during setup and will never change; could be read
+ * lockless
+ * @level: Hierarchy level of the group; Required during setup
+ * @list: List head that is added to the per level
+ * tmigr_level_list; is required during setup when a
+ * new group needs to be connected to the existing
+ * hierarchy groups
+ * @numa_node: Is set to numa node when level < tmigr_crossnode_level;
+ * otherwise it is set to NUMA_NO_NODE; Required for
+ * setup only to make sure CPUs and groups are per
+ * numa node as long as level < tmigr_crossnode_level
+ * @num_children: Counter of group children to make sure the group is only
+ * filled with TMIGR_CHILDREN_PER_GROUP; Required for setup
+ * only
+ */
+struct tmigr_group {
+ raw_spinlock_t lock;
+ atomic_t migr_state;
+ struct tmigr_group *parent;
+ struct tmigr_event groupevt;
+ u64 next_expiry;
+ struct timerqueue_head events;
+ u8 childmask;
+ unsigned int level;
+ struct list_head list;
+ int numa_node;
+ unsigned int num_children;
+};
+
+/**
+ * struct tmigr_cpu - timer migration per CPU group
+ * @lock: Lock protecting the tmigr_cpu group information
+ * @online: Indicates whether the CPU is online; In deactivate path
+ * it is required to know whether the migrator in the top
+ * level group is on the way to go offline when a timer is
+ * pending. Then another online CPU needs to be rescheduled
+ * to make sure the timers are handled properly;
+ * Furthermore the information is required in CPU hotplug
+ * path as the CPU is able to go idle before the timer
+ * migration hierarchy hotplug AP is reached. During this
+ * phase, the CPU has to handle the global timers by its
+ * own and does not act as a migrator.
+ * @idle: Indicates whether the CPU is idle in the timer migration
+ * hierarchy
+ * @remote: Is set when timers of the CPU are expired remote
+ * @wakeup_recalc: Indicates, whether a recalculation of the @wakeup value
+ * is required. It is only used when the CPU is marked idle
+ * in the timer migration hierarchy.
+ * @tmgroup: Pointer to the parent group
+ * @childmask: childmask of tmigr_cpu in the parent group
+ * @wakeup: Stores the first timer when the timer migration
+ * hierarchy is completely idle and remote expiry was done;
+ * is returned to timer code in the idle path; it is only
+ * valid, when @wakeup_recalc is not set
+ * @cpuevt: CPU event which could be queued into the parent group
+ */
+struct tmigr_cpu {
+ raw_spinlock_t lock;
+ bool online;
+ bool idle;
+ bool remote;
+ bool wakeup_recalc;
+ struct tmigr_group *tmgroup;
+ u8 childmask;
+ u64 wakeup;
+ struct tmigr_event cpuevt;
+};
+
+/**
+ * union tmigr_state - state of tmigr_group
+ * @state: Combined version of the state - only used for atomic
+ * read/cmpxchg function
+ * @struct: Split version of the state - only use the struct members to
+ * update information to stay independent of endianness
+ */
+union tmigr_state {
+ u32 state;
+ /**
+ * struct - split state of tmigr_group
+ * @active: Contains each childmask bit of the active children
+ * @migrator: Contains childmask of the child which is migrator
+ * @seq: Sequence counter needs to be increased when an update
+ * to the tmigr_state is done. It prevents a race when
+ * updates in the child groups are propagated in changed
+ * order. Detailed information about the scenario is
+ * given in the documentation at the begin of
+ * timer_migration.c.
+ */
+ struct {
+ u8 active;
+ u8 migrator;
+ u16 seq;
+ } __packed;
+};
+
+#if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
+extern void tmigr_handle_remote(void);
+extern int tmigr_requires_handle_remote(void);
+extern void tmigr_cpu_activate(void);
+extern u64 tmigr_cpu_deactivate(u64 nextevt);
+extern u64 tmigr_cpu_new_timer(u64 nextevt);
+extern u64 tmigr_quick_check(void);
+#else
+static inline void tmigr_handle_remote(void) { }
+static inline int tmigr_requires_handle_remote(void) { return 0; }
+static inline void tmigr_cpu_activate(void) { }
+#endif
+
+#endif
--
2.39.2
The timer pull logic needs proper debugging aids. Add tracepoints so the
hierarchical idle machinery can be diagnosed.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
v9: Add tmigr_cpu_new_timer_idle tracepoint
v8: Add wakeup value to tracepoints
---
include/trace/events/timer_migration.h | 297 +++++++++++++++++++++++++
kernel/time/timer_migration.c | 26 +++
2 files changed, 323 insertions(+)
create mode 100644 include/trace/events/timer_migration.h
diff --git a/include/trace/events/timer_migration.h b/include/trace/events/timer_migration.h
new file mode 100644
index 000000000000..a2e7e32058f8
--- /dev/null
+++ b/include/trace/events/timer_migration.h
@@ -0,0 +1,297 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM timer_migration
+
+#if !defined(_TRACE_TIMER_MIGRATION_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_TIMER_MIGRATION_H
+
+#include <linux/tracepoint.h>
+
+/* Group events */
+TRACE_EVENT(tmigr_group_set,
+
+ TP_PROTO(struct tmigr_group *group),
+
+ TP_ARGS(group),
+
+ TP_STRUCT__entry(
+ __field( void *, group )
+ __field( unsigned int, lvl )
+ __field( unsigned int, numa_node )
+ ),
+
+ TP_fast_assign(
+ __entry->group = group;
+ __entry->lvl = group->level;
+ __entry->numa_node = group->numa_node;
+ ),
+
+ TP_printk("group=%p lvl=%d numa=%d",
+ __entry->group, __entry->lvl, __entry->numa_node)
+);
+
+TRACE_EVENT(tmigr_connect_child_parent,
+
+ TP_PROTO(struct tmigr_group *child),
+
+ TP_ARGS(child),
+
+ TP_STRUCT__entry(
+ __field( void *, child )
+ __field( void *, parent )
+ __field( unsigned int, lvl )
+ __field( unsigned int, numa_node )
+ __field( unsigned int, num_children )
+ __field( u32, childmask )
+ ),
+
+ TP_fast_assign(
+ __entry->child = child;
+ __entry->parent = child->parent;
+ __entry->lvl = child->parent->level;
+ __entry->numa_node = child->parent->numa_node;
+ __entry->numa_node = child->parent->num_children;
+ __entry->childmask = child->childmask;
+ ),
+
+ TP_printk("group=%p childmask=%0x parent=%p lvl=%d numa=%d num_children=%d",
+ __entry->child, __entry->childmask, __entry->parent,
+ __entry->lvl, __entry->numa_node, __entry->num_children)
+);
+
+TRACE_EVENT(tmigr_connect_cpu_parent,
+
+ TP_PROTO(struct tmigr_cpu *tmc),
+
+ TP_ARGS(tmc),
+
+ TP_STRUCT__entry(
+ __field( void *, parent )
+ __field( unsigned int, cpu )
+ __field( unsigned int, lvl )
+ __field( unsigned int, numa_node )
+ __field( unsigned int, num_children )
+ __field( u32, childmask )
+ ),
+
+ TP_fast_assign(
+ __entry->parent = tmc->tmgroup;
+ __entry->cpu = tmc->cpuevt.cpu;
+ __entry->lvl = tmc->tmgroup->level;
+ __entry->numa_node = tmc->tmgroup->numa_node;
+ __entry->numa_node = tmc->tmgroup->num_children;
+ __entry->childmask = tmc->childmask;
+ ),
+
+ TP_printk("cpu=%d childmask=%0x parent=%p lvl=%d numa=%d num_children=%d",
+ __entry->cpu, __entry->childmask, __entry->parent,
+ __entry->lvl, __entry->numa_node, __entry->num_children)
+);
+
+DECLARE_EVENT_CLASS(tmigr_group_and_cpu,
+
+ TP_PROTO(struct tmigr_group *group, union tmigr_state state, u32 childmask),
+
+ TP_ARGS(group, state, childmask),
+
+ TP_STRUCT__entry(
+ __field( void *, group )
+ __field( void *, parent )
+ __field( unsigned int, lvl )
+ __field( unsigned int, numa_node )
+ __field( u8, active )
+ __field( u8, migrator )
+ __field( u32, childmask )
+ ),
+
+ TP_fast_assign(
+ __entry->group = group;
+ __entry->parent = group->parent;
+ __entry->lvl = group->level;
+ __entry->numa_node = group->numa_node;
+ __entry->active = state.active;
+ __entry->migrator = state.migrator;
+ __entry->childmask = childmask;
+ ),
+
+ TP_printk("group=%p lvl=%d numa=%d active=%0x migrator=%0x "
+ "parent=%p childmask=%0x",
+ __entry->group, __entry->lvl, __entry->numa_node,
+ __entry->active, __entry->migrator,
+ __entry->parent, __entry->childmask)
+);
+
+DEFINE_EVENT(tmigr_group_and_cpu, tmigr_group_set_cpu_inactive,
+
+ TP_PROTO(struct tmigr_group *group, union tmigr_state state, u32 childmask),
+
+ TP_ARGS(group, state, childmask)
+);
+
+DEFINE_EVENT(tmigr_group_and_cpu, tmigr_group_set_cpu_active,
+
+ TP_PROTO(struct tmigr_group *group, union tmigr_state state, u32 childmask),
+
+ TP_ARGS(group, state, childmask)
+);
+
+/* CPU events*/
+DECLARE_EVENT_CLASS(tmigr_cpugroup,
+
+ TP_PROTO(struct tmigr_cpu *tmc),
+
+ TP_ARGS(tmc),
+
+ TP_STRUCT__entry(
+ __field( void *, parent)
+ __field( unsigned int, cpu)
+ __field( u64, wakeup)
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = tmc->cpuevt.cpu;
+ __entry->parent = tmc->tmgroup;
+ __entry->wakeup = tmc->wakeup;
+ ),
+
+ TP_printk("cpu=%d parent=%p wakeup=%llu", __entry->cpu, __entry->parent, __entry->wakeup)
+);
+
+DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_new_timer,
+
+ TP_PROTO(struct tmigr_cpu *tmc),
+
+ TP_ARGS(tmc)
+);
+
+DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_active,
+
+ TP_PROTO(struct tmigr_cpu *tmc),
+
+ TP_ARGS(tmc)
+);
+
+DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_online,
+
+ TP_PROTO(struct tmigr_cpu *tmc),
+
+ TP_ARGS(tmc)
+);
+
+DEFINE_EVENT(tmigr_cpugroup, tmigr_cpu_offline,
+
+ TP_PROTO(struct tmigr_cpu *tmc),
+
+ TP_ARGS(tmc)
+);
+
+DEFINE_EVENT(tmigr_cpugroup, tmigr_handle_remote_cpu,
+
+ TP_PROTO(struct tmigr_cpu *tmc),
+
+ TP_ARGS(tmc)
+);
+
+DECLARE_EVENT_CLASS(tmigr_idle,
+
+ TP_PROTO(struct tmigr_cpu *tmc, u64 nextevt),
+
+ TP_ARGS(tmc, nextevt),
+
+ TP_STRUCT__entry(
+ __field( void *, parent)
+ __field( unsigned int, cpu)
+ __field( u64, nextevt)
+ __field( u64, wakeup)
+ ),
+
+ TP_fast_assign(
+ __entry->cpu = tmc->cpuevt.cpu;
+ __entry->parent = tmc->tmgroup;
+ __entry->nextevt = nextevt;
+ __entry->wakeup = tmc->wakeup;
+ ),
+
+ TP_printk("cpu=%d parent=%p nextevt=%llu wakeup=%llu",
+ __entry->cpu, __entry->parent, __entry->nextevt, __entry->wakeup)
+);
+
+DEFINE_EVENT(tmigr_idle, tmigr_cpu_idle,
+
+ TP_PROTO(struct tmigr_cpu *tmc, u64 nextevt),
+
+ TP_ARGS(tmc, nextevt)
+);
+
+DEFINE_EVENT(tmigr_idle, tmigr_cpu_new_timer_idle,
+
+ TP_PROTO(struct tmigr_cpu *tmc, u64 nextevt),
+
+ TP_ARGS(tmc, nextevt)
+);
+
+TRACE_EVENT(tmigr_update_events,
+
+ TP_PROTO(struct tmigr_group *child, struct tmigr_group *group,
+ union tmigr_state childstate, union tmigr_state groupstate,
+ u64 nextevt),
+
+ TP_ARGS(child, group, childstate, groupstate, nextevt),
+
+ TP_STRUCT__entry(
+ __field( void *, child )
+ __field( void *, group )
+ __field( u64, nextevt )
+ __field( u64, group_next_expiry )
+ __field( unsigned int, group_lvl )
+ __field( u8, child_active )
+ __field( u8, group_active )
+ __field( unsigned int, child_evtcpu )
+ __field( u64, child_evt_expiry )
+ ),
+
+ TP_fast_assign(
+ __entry->child = child;
+ __entry->group = group;
+ __entry->nextevt = nextevt;
+ __entry->group_next_expiry = group->next_expiry;
+ __entry->group_lvl = group->level;
+ __entry->child_active = childstate.active;
+ __entry->group_active = groupstate.active;
+ __entry->child_evtcpu = child ? child->groupevt.cpu : 0;
+ __entry->child_evt_expiry = child ? child->groupevt.nextevt.expires : 0;
+ ),
+
+ TP_printk("child=%p group=%p group_lvl=%d child_active=%0x group_active=%0x "
+ "nextevt=%llu next_expiry=%llu child_evt_expiry=%llu child_evtcpu=%d",
+ __entry->child, __entry->group, __entry->group_lvl, __entry->child_active,
+ __entry->group_active,
+ __entry->nextevt, __entry->group_next_expiry, __entry->child_evt_expiry,
+ __entry->child_evtcpu)
+);
+
+TRACE_EVENT(tmigr_handle_remote,
+
+ TP_PROTO(struct tmigr_group *group),
+
+ TP_ARGS(group),
+
+ TP_STRUCT__entry(
+ __field( void * , group )
+ __field( unsigned int , lvl )
+ ),
+
+ TP_fast_assign(
+ __entry->group = group;
+ __entry->lvl = group->level;
+ ),
+
+ TP_printk("group=%p lvl=%d",
+ __entry->group, __entry->lvl)
+);
+
+#endif /* _TRACE_TIMER_MIGRATION_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
index 05cd8f1bc45d..54ab18ccc62b 100644
--- a/kernel/time/timer_migration.c
+++ b/kernel/time/timer_migration.c
@@ -14,6 +14,9 @@
#include "timer_migration.h"
#include "tick-internal.h"
+#define CREATE_TRACE_POINTS
+#include <trace/events/timer_migration.h>
+
/*
* The timer migration mechanism is built on a hierarchy of groups. The
* lowest level group contains CPUs, the next level groups of CPU groups
@@ -511,6 +514,8 @@ static bool tmigr_active_up(struct tmigr_group *group,
*/
group->groupevt.ignore = true;
+ trace_tmigr_group_set_cpu_active(group, newstate, childmask);
+
return walk_done;
}
@@ -521,6 +526,8 @@ static void __tmigr_cpu_activate(struct tmigr_cpu *tmc)
data.childmask = tmc->childmask;
data.groupstate.state = atomic_read(&tmc->tmgroup->migr_state);
+ trace_tmigr_cpu_active(tmc);
+
tmc->cpuevt.ignore = true;
WRITE_ONCE(tmc->wakeup, KTIME_MAX);
tmc->wakeup_recalc = false;
@@ -688,6 +695,9 @@ static bool tmigr_update_events(struct tmigr_group *group,
data->nextexp = tmigr_next_groupevt_expires(group);
}
+ trace_tmigr_update_events(child, group, data->childstate,
+ data->groupstate, nextexp);
+
unlock:
raw_spin_unlock(&group->lock);
@@ -721,6 +731,8 @@ static u64 tmigr_new_timer(struct tmigr_cpu *tmc, u64 nextexp)
if (tmc->remote)
return KTIME_MAX;
+ trace_tmigr_cpu_new_timer(tmc);
+
tmc->cpuevt.ignore = false;
data.remote = false;
@@ -754,6 +766,8 @@ static u64 tmigr_handle_remote_cpu(unsigned int cpu, u64 now,
return next;
}
+ trace_tmigr_handle_remote_cpu(tmc);
+
tmc->remote = true;
WRITE_ONCE(tmc->wakeup, KTIME_MAX);
@@ -838,6 +852,7 @@ static bool tmigr_handle_remote_up(struct tmigr_group *group,
childmask = data->childmask;
+ trace_tmigr_handle_remote(group);
again:
/*
* Handle the group only if @childmask is the migrator or if the
@@ -1101,6 +1116,7 @@ u64 tmigr_cpu_new_timer(u64 nextexp)
*/
WRITE_ONCE(tmc->wakeup, ret);
+ trace_tmigr_cpu_new_timer_idle(tmc, nextexp);
raw_spin_unlock(&tmc->lock);
return ret;
}
@@ -1210,6 +1226,8 @@ static bool tmigr_inactive_up(struct tmigr_group *group,
}
}
+ trace_tmigr_group_set_cpu_inactive(group, newstate, childmask);
+
return walk_done;
}
@@ -1264,6 +1282,7 @@ u64 tmigr_cpu_deactivate(u64 nextexp)
*/
WRITE_ONCE(tmc->wakeup, ret);
+ trace_tmigr_cpu_idle(tmc, nextexp);
raw_spin_unlock(&tmc->lock);
return ret;
}
@@ -1376,6 +1395,7 @@ static struct tmigr_group *tmigr_get_group(unsigned int cpu, int node,
/* Setup successful. Add it to the hierarchy */
list_add(&group->list, &tmigr_level_list[lvl]);
+ trace_tmigr_group_set(group);
return group;
}
@@ -1393,6 +1413,8 @@ static void tmigr_connect_child_parent(struct tmigr_group *child,
raw_spin_unlock(&parent->lock);
raw_spin_unlock_irq(&child->lock);
+ trace_tmigr_connect_child_parent(child);
+
/*
* To prevent inconsistent states, active children need to be active in
* the new parent as well. Inactive children are already marked inactive
@@ -1474,6 +1496,8 @@ static int tmigr_setup_groups(unsigned int cpu, unsigned int node)
raw_spin_unlock_irq(&group->lock);
+ trace_tmigr_connect_cpu_parent(tmc);
+
/* There are no children that need to be connected */
continue;
} else {
@@ -1541,6 +1565,7 @@ static int tmigr_cpu_online(unsigned int cpu)
WRITE_ONCE(tmc->wakeup, KTIME_MAX);
}
raw_spin_lock_irq(&tmc->lock);
+ trace_tmigr_cpu_online(tmc);
if (timer_base_is_idle())
tmc->idle = true;
else
@@ -1563,6 +1588,7 @@ static int tmigr_cpu_offline(unsigned int cpu)
* offline; Therefore nextevt value is set to KTIME_MAX
*/
__tmigr_cpu_deactivate(tmc, KTIME_MAX);
+ trace_tmigr_cpu_offline(tmc);
raw_spin_unlock_irq(&tmc->lock);
return 0;
--
2.39.2
From: "Richard Cochran (linutronix GmbH)" <[email protected]>
Move the locking out from __run_timers() to the call sites, so the
protected section can be extended at the call site. Preparatory patch for
changing the NOHZ timer placement to a pull at expiry time model.
No functional change.
Signed-off-by: Richard Cochran (linutronix GmbH) <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
---
kernel/time/timer.c | 31 +++++++++++++++++++++----------
1 file changed, 21 insertions(+), 10 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2cff43c10329..b0fa8afe9059 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2230,11 +2230,7 @@ static inline void __run_timers(struct timer_base *base)
struct hlist_head heads[LVL_DEPTH];
int levels;
- if (time_before(jiffies, base->next_expiry))
- return;
-
- timer_base_lock_expiry(base);
- raw_spin_lock_irq(&base->lock);
+ lockdep_assert_held(&base->lock);
while (time_after_eq(jiffies, base->clk) &&
time_after_eq(jiffies, base->next_expiry)) {
@@ -2258,21 +2254,36 @@ static inline void __run_timers(struct timer_base *base)
while (levels--)
expire_timers(base, heads + levels);
}
+}
+
+static void __run_timer_base(struct timer_base *base)
+{
+ if (time_before(jiffies, base->next_expiry))
+ return;
+
+ timer_base_lock_expiry(base);
+ raw_spin_lock_irq(&base->lock);
+ __run_timers(base);
raw_spin_unlock_irq(&base->lock);
timer_base_unlock_expiry(base);
}
+static void run_timer_base(int index)
+{
+ struct timer_base *base = this_cpu_ptr(&timer_bases[index]);
+
+ __run_timer_base(base);
+}
+
/*
* This function runs timers and the timer-tq in bottom half context.
*/
static __latent_entropy void run_timer_softirq(struct softirq_action *h)
{
- struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
-
- __run_timers(base);
+ run_timer_base(BASE_LOCAL);
if (IS_ENABLED(CONFIG_NO_HZ_COMMON)) {
- __run_timers(this_cpu_ptr(&timer_bases[BASE_GLOBAL]));
- __run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));
+ run_timer_base(BASE_GLOBAL);
+ run_timer_base(BASE_DEF);
}
}
--
2.39.2
On 2023-12-01 10:26:34 [+0100], Anna-Maria Behnsen wrote:
> When no timer is queued into an empty timer base, the next_expiry will not
> be updated. It was originally calculated as
>
> base->clk + NEXT_TIMER_MAX_DELTA
>
> When the timer base stays empty long enough (> NEXT_TIMER_MAX_DELTA), the
> next_expiry value of the empty base suggests that there is a timer pending
> soon. This might be more a kind of a theoretical problem, but the fix
> doesn't hurt.
So __run_timers() sets base::next_expiry to base->clk +
NEXT_TIMER_MAX_DELTA and then we have no more timers enqueued.
But wouldn't base->timers_pending remain false? Therefore it would use
"expires = KTIME_MAX" as return value (well cmp_next_hrtimer_event())?
Based on the code as of #11, it would only set timer_base::is_idle
wrongly false if it wraps around. Other than that, I don't see an issue.
What do I miss?
If you update it regardless here then it would make a difference to
run_local_timers() assuming we have still hrtimer which expire and this
next_expiry check might raise a softirq since it does not consider the
timers_pending value.
> Use only base->next_expiry value as nextevt when timers are
> pending. Otherwise nextevt will be jiffies + NEXT_TIMER_MAX_DELTA. As all
> information is in place, update base->next_expiry value of the empty timer
> base as well.
or consider timers_pending in run_local_timers()? An additional read vs
write?
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1944,10 +1943,20 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
> __forward_timer_base(base, basej);
>
> if (base->timers_pending) {
> + nextevt = base->next_expiry;
> +
> /* If we missed a tick already, force 0 delta */
> if (time_before(nextevt, basej))
> nextevt = basej;
> expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
> + } else {
> + /*
> + * Move next_expiry for the empty base into the future to
> + * prevent a unnecessary raise of the timer softirq when the
an
> + * next_expiry value will be reached even if there is no timer
> + * pending.
> + */
> + base->next_expiry = nextevt;
> }
>
> /*
Sebastian
On 2023-12-01 10:26:38 [+0100], Anna-Maria Behnsen wrote:
> When tick is stopped also the timer base is_idle flag is set. When
> reentering the timer_base_try_to_set_idle() with the tick stopped, there is
> no need to check whether the timer base needs to be set idle again. When a
> timer was enqueued in the meantime, this is already handled by the
> nohz_get_next_event() call which was executed before tick_nohz_stop_tick().
as of #15 tick_stopped is set later in tick_nohz_stop_tick() and both
(tick_sched::tick_stopped and timer_base::is_idle) are cleared in
tick_nohz_restart_sched_tick().
Then we have tick_nohz_idle_retain_tick() with only one caller and is
only clearing timer_base::is_idle. Now, wouldn't it make sense to
preload timer_idle based on timer_base::is_idle?
I don't know if it there is a different outcome if timer_base::is_idle
gets cleared in the idle path vs tick_sched::tick_stopped.
I can't find nohz_get_next_event().
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
Sebastian
Sebastian Siewior <[email protected]> writes:
> On 2023-12-01 10:26:34 [+0100], Anna-Maria Behnsen wrote:
>> When no timer is queued into an empty timer base, the next_expiry will not
>> be updated. It was originally calculated as
>>
>> base->clk + NEXT_TIMER_MAX_DELTA
>>
>> When the timer base stays empty long enough (> NEXT_TIMER_MAX_DELTA), the
>> next_expiry value of the empty base suggests that there is a timer pending
>> soon. This might be more a kind of a theoretical problem, but the fix
>> doesn't hurt.
>
> So __run_timers() sets base::next_expiry to base->clk +
> NEXT_TIMER_MAX_DELTA and then we have no more timers enqueued.
>
> But wouldn't base->timers_pending remain false? Therefore it would use
> "expires = KTIME_MAX" as return value (well cmp_next_hrtimer_event())?
Jupp.
> Based on the code as of #11, it would only set timer_base::is_idle
> wrongly false if it wraps around. Other than that, I don't see an issue.
> What do I miss?
And it will raise an unnecessary softirq when it wraps around as you
also mentioned on the next paragraph.
> If you update it regardless here then it would make a difference to
> run_local_timers() assuming we have still hrtimer which expire and this
> next_expiry check might raise a softirq since it does not consider the
> timers_pending value.
The only difference with this change would be that the softirq will not
be raised when it wraps around.
>> Use only base->next_expiry value as nextevt when timers are
>> pending. Otherwise nextevt will be jiffies + NEXT_TIMER_MAX_DELTA. As all
>> information is in place, update base->next_expiry value of the empty timer
>> base as well.
>
> or consider timers_pending in run_local_timers()? An additional read vs
> write?
This would also be a possibility to add the check in run_local_timers()
with timers_pending. And we also have to make the is_idle marking in
get_next_timer_interrupt() dependant on base::timers_pending bit. But
this also means, we cannot rely on next_expiry when no timer is pending.
Frederic, what do you think?
Thanks,
Anna-Maria
Sebastian Siewior <[email protected]> writes:
> On 2023-12-01 10:26:38 [+0100], Anna-Maria Behnsen wrote:
>> When tick is stopped also the timer base is_idle flag is set. When
>> reentering the timer_base_try_to_set_idle() with the tick stopped, there is
>> no need to check whether the timer base needs to be set idle again. When a
>> timer was enqueued in the meantime, this is already handled by the
>> nohz_get_next_event() call which was executed before tick_nohz_stop_tick().
>
> as of #15 tick_stopped is set later in tick_nohz_stop_tick() and both
> (tick_sched::tick_stopped and timer_base::is_idle) are cleared in
> tick_nohz_restart_sched_tick().
>
> Then we have tick_nohz_idle_retain_tick() with only one caller and is
> only clearing timer_base::is_idle. Now, wouldn't it make sense to
> preload timer_idle based on timer_base::is_idle?
When revisting the code, this timer_clear_idle() is no longer required
in tick_nohz_idle_retain_tick(). This is only called when the tick is
not stopped - so timer base is not idle as well and this call is
superfluous.
As we keep both states in sync (tick_sched::tick_stopped and
timer_base::is_idle) it doesn't matter which one is used. In
tick_nohz_stop_tick() I don't have access to timer base. I could add it
to timer_base_try_to_set_idle() but it will not make a difference.
> I don't know if it there is a different outcome if timer_base::is_idle
> gets cleared in the idle path vs tick_sched::tick_stopped.
> I can't find nohz_get_next_event().
s/nohz_get_next_event/tick_nohz_next_event/ ...
>> Signed-off-by: Anna-Maria Behnsen <[email protected]>
>
> Sebastian
Thanks,
Anna-Maria
On 2023-12-01 10:26:39 [+0100], Anna-Maria Behnsen wrote:
> Timer might be used as pinned timer (using add_timer_on()) and later on as
> non pinned timers using add_timer(). When the NOHZ timer pull at expiry
> model is in place, TIMER_PINNED flag is required to be used whenever a
> timer needs to expire on a dedicated CPU. Flag must no be set, if
> expiration on a dedicated CPU is not required.
Slightly reworded.
| A timer might be used as a pinned timer (using add_timer_on()) and later
| on as non-pinned timer using add_timer(). When the "NOHZ timer pull at
| expiry model" is in place, the TIMER_PINNED flag is required to be used
| whenever a timer needs to expire on a dedicated CPU. Otherwise the flag
| must not be set if expiration on a dedicated CPU is not required.
> add_timer_on()'s behavior will be changed during the preparation patches
> for the NOHZ timer pull at expiry model to unconditionally set TIMER_PINNED
> flag. To be able to reset/set the flag when queueing a timer, two variants
> of add_timer() are introduced.
and here.
| add_timer_on()'s behavior will be changed during the preparation patches
| for the "NOHZ timer pull at expiry model" to unconditionally set
| TIMER_PINNED flag. To be able to clear/ set the flag when queueing a
| timer, two variants of add_timer() are introduced.
I let you be judge of this.
Sebastian
On 2023-12-01 10:26:41 [+0100], Anna-Maria Behnsen wrote:
> When adding a timer to the timer wheel using add_timer_on(), it is an
> implicitly pinned timer. With the timer pull at expiry time model in place,
> TIMER_PINNED flag is required to make sure timers end up in proper base.
>
> Add TIMER_PINNED flag unconditionally when add_timer_on() is executed.
This is odd. I have some vague memory that this was already the case.
Otherwise all add_timer_on() users without TIMER_PINNED may get it wrong
due to optimisation. Looking at git history it was never the case and I
can't confuse it with hrtimer since it never supported the "_on()"
feature…
At least the mix timer in drivers/char/random.c is not using the PINNED
flag with _on(). So this was wrong then?…
Sebastian
On 2023-12-01 10:26:43 [+0100], Anna-Maria Behnsen wrote:
> Logic for getting next timer interrupt (no matter of recalculated or
> already stored in base->next_expiry) is split into a separate function
> "next_timer_interrupt()" to make it available for new call sites.
Be authoritative.
| Split the logic for getting next timer interrupt (no matter of recalculated or
| already stored in base->next_expiry) into a separate function named
| next_timer_interrupt(). Make it available to local call sites only.
> No functional change.
>
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
Sebastian
On 2023-12-01 10:26:44 [+0100], Anna-Maria Behnsen wrote:
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1985,10 +1998,31 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
> return expires;
> }
>
> - raw_spin_lock(&base->lock);
> - nextevt = next_timer_interrupt(base, basej);
> + base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
> + base_global = this_cpu_ptr(&timer_bases[BASE_GLOBAL]);
> +
> + raw_spin_lock(&base_local->lock);
> + raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
> +
> + nextevt_local = next_timer_interrupt(base_local, basej);
> + nextevt_global = next_timer_interrupt(base_global, basej);
>
> - if (base->timers_pending) {
> + /*
> + * Check whether the local event is expiring before or at the same
> + * time as the global event.
> + *
> + * Note, that nextevt_global and nextevt_local might be based on
> + * different base->clk values. So it's not guaranteed that
> + * comparing with empty bases results in a correct local_first.
This ends like an unsolved mystery case. Could you add why one should
not worry about an incorrect local_first?
But seriously, how far apart can they get and what difference does it
make? At timer enqueue time clk equals jiffies. At this point one clk
base could be at jiffies and the other might be a few jiffies before
that.
The next event (as in next_expiry) should be valid for both compare
wise. Both must be larger than jiffies. The delta between jiffies and
next event has to be less than NEXT_TIMER_MAX_DELTA for each base.
> + */
> + if (base_local->timers_pending && base_global->timers_pending)
> + local_first = time_before_eq(nextevt_local, nextevt_global);
> + else
> + local_first = base_local->timers_pending;
> +
> + nextevt = local_first ? nextevt_local : nextevt_global;
> +
> + if (base_local->timers_pending || base_global->timers_pending) {
> /* If we missed a tick already, force 0 delta */
> if (time_before(nextevt, basej))
> nextevt = basej;
So if nextevt_local missed a tick and nextevt_global is
NEXT_TIMER_MAX_DELTA-1 (so we get the largest difference possible
between those two) then the time_before_eq() should still come out
right. We could still miss more than one tick.
This looks good. I just don't understand the (above) comment.
Sebastian
Sebastian Siewior <[email protected]> writes:
> On 2023-12-01 10:26:39 [+0100], Anna-Maria Behnsen wrote:
>> Timer might be used as pinned timer (using add_timer_on()) and later on as
>> non pinned timers using add_timer(). When the NOHZ timer pull at expiry
>> model is in place, TIMER_PINNED flag is required to be used whenever a
>> timer needs to expire on a dedicated CPU. Flag must no be set, if
>> expiration on a dedicated CPU is not required.
>
> Slightly reworded.
>
> | A timer might be used as a pinned timer (using add_timer_on()) and later
> | on as non-pinned timer using add_timer(). When the "NOHZ timer pull at
> | expiry model" is in place, the TIMER_PINNED flag is required to be used
> | whenever a timer needs to expire on a dedicated CPU. Otherwise the flag
> | must not be set if expiration on a dedicated CPU is not required.
>
>> add_timer_on()'s behavior will be changed during the preparation patches
>> for the NOHZ timer pull at expiry model to unconditionally set TIMER_PINNED
>> flag. To be able to reset/set the flag when queueing a timer, two variants
>> of add_timer() are introduced.
>
> and here.
>
> | add_timer_on()'s behavior will be changed during the preparation patches
> | for the "NOHZ timer pull at expiry model" to unconditionally set
> | TIMER_PINNED flag. To be able to clear/ set the flag when queueing a
> | timer, two variants of add_timer() are introduced.
>
> I let you be judge of this.
I will take your reworded version.
Thanks,
Anna-Maria
On 2023-12-01 10:26:45 [+0100], Anna-Maria Behnsen wrote:
> For the conversion of the NOHZ timer placement to a pull at expiry time
> model it's required to have separate expiry times for the pinned and the
> non-pinned (movable) timers. Therefore struct timer_events is introduced.
>
> No functional change
>
> Originally-by: Richard Cochran (linutronix GmbH) <[email protected]>
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
> Reviewed-by: Frederic Weisbecker <[email protected]>
…
> index 366ea26ce3ba..0d53d853ae22 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
…
> @@ -2022,13 +2028,31 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
>
> nextevt = local_first ? nextevt_local : nextevt_global;
>
> - if (base_local->timers_pending || base_global->timers_pending) {
> + /*
> + * If the @nextevt is at max. one tick away, use @nextevt and store
> + * it in the local expiry value. The next global event is irrelevant in
> + * this case and can be left as KTIME_MAX.
> + */
> + if (time_before_eq(nextevt, basej + 1)) {
> /* If we missed a tick already, force 0 delta */
> if (time_before(nextevt, basej))
> nextevt = basej;
> - expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
> + tevt.local = basem + (u64)(nextevt - basej) * TICK_NSEC;
> + goto unlock;
You claim "No functional change" in the patch description. However if
you take the shortcut here you don't update `idle' if set and you don't
__forward_timer_base(). The `idle` parameter doesn't matter because it
was false and will remain false as per current logic.
But what about the forward of the timer base? It is probably not real
problem since the next add/mod timer call will forward it.
Sebastian
Sebastian Siewior <[email protected]> writes:
> On 2023-12-01 10:26:41 [+0100], Anna-Maria Behnsen wrote:
>> When adding a timer to the timer wheel using add_timer_on(), it is an
>> implicitly pinned timer. With the timer pull at expiry time model in place,
>> TIMER_PINNED flag is required to make sure timers end up in proper base.
>>
>> Add TIMER_PINNED flag unconditionally when add_timer_on() is executed.
>
> This is odd. I have some vague memory that this was already the case.
> Otherwise all add_timer_on() users without TIMER_PINNED may get it wrong
> due to optimisation.
Which optimisation are you talking about? Are you talking about the
heuristic for finding the best CPU in get_target_base()? This heuristic
is not used for add_timer_on().
> Looking at git history it was never the case and I
> can't confuse it with hrtimer since it never supported the "_on()"
> feature…
> At least the mix timer in drivers/char/random.c is not using the PINNED
> flag with _on(). So this was wrong then?…
No, this it is not wrong, as at the moment timers expires always on the
CPU where they have been queued. So when a timer should be queued on a
dedicated CPU several approaches are valid:
- using add_timer_on() with or without TIMER_PINNED flag set to enqueue
timers on any specified CPU
- use add_timer()/mod_timer()/... with TIMER_PINNED flag set - but this
only works to enqueue timer on this CPU!
When using the add_timer()/mod_timer()/... functions without
TIMER_PINNED flag, the heuristic is used for finding the best CPU.
So without the timer pull model the change doesn't hurt.
But with the timer pull model in place, it is required to keep the
pinned and non pinned timers in separate per CPU wheels (local wheel =
TIMER_PINNED is set; global wheel = TIMER_PINNED is not set). So without
this change but with the timer pull model, the mix timer of random.c
would be enqueued on the dedicated CPU, but it would end up in the wrong
wheel (global wheel). And then the timer could also expire on another
CPUs as the global wheels are handled by the migrator when the CPU is
idle.
Does this makes it a little more clear, why the change is required and
why it is also valid right now?
Thanks,
Anna-Maria
On 2023-12-01 10:26:46 [+0100], Anna-Maria Behnsen wrote:
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 0d53d853ae22..fc376e06980e 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
…
> +static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
> + bool *idle)
> +{
> + struct timer_events tevt = { .local = KTIME_MAX, .global = KTIME_MAX };
> + struct timer_base *base_local, *base_global;
> + unsigned long nextevt;
> + u64 expires;
> +
> + /*
> + * Pretend that there is no timer pending if the cpu is offline.
> + * Possible pending timers will be migrated later to an active cpu.
> + */
> + if (cpu_is_offline(smp_processor_id())) {
> + if (idle)
> + *idle = true;
> + return tevt.local;
> + }
> +
> + base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
> + base_global = this_cpu_ptr(&timer_bases[BASE_GLOBAL]);
> +
> + raw_spin_lock(&base_local->lock);
> + raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
> +
> + nextevt = fetch_next_timer_interrupt(basej, basem, base_local,
> + base_global, &tevt);
Now you split it, move it and we have the __forward_timer_base() back in
case of the shortcut which is now in fetch_next_timer_interrupt().
All good.
>
> /*
> * We have a fresh next event. Check whether we can forward the
Sebastian
Sebastian Siewior <[email protected]> writes:
> On 2023-12-01 10:26:44 [+0100], Anna-Maria Behnsen wrote:
>> --- a/kernel/time/timer.c
>> +++ b/kernel/time/timer.c
>> @@ -1985,10 +1998,31 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
>> return expires;
>> }
>>
>> - raw_spin_lock(&base->lock);
>> - nextevt = next_timer_interrupt(base, basej);
>> + base_local = this_cpu_ptr(&timer_bases[BASE_LOCAL]);
>> + base_global = this_cpu_ptr(&timer_bases[BASE_GLOBAL]);
>> +
>> + raw_spin_lock(&base_local->lock);
>> + raw_spin_lock_nested(&base_global->lock, SINGLE_DEPTH_NESTING);
>> +
>> + nextevt_local = next_timer_interrupt(base_local, basej);
>> + nextevt_global = next_timer_interrupt(base_global, basej);
>>
>> - if (base->timers_pending) {
>> + /*
>> + * Check whether the local event is expiring before or at the same
>> + * time as the global event.
>> + *
>> + * Note, that nextevt_global and nextevt_local might be based on
>> + * different base->clk values. So it's not guaranteed that
>> + * comparing with empty bases results in a correct local_first.
>
> This ends like an unsolved mystery case. Could you add why one should
> not worry about an incorrect local_first?
>
> But seriously, how far apart can they get and what difference does it
> make? At timer enqueue time clk equals jiffies. At this point one clk
> base could be at jiffies and the other might be a few jiffies before
> that.
> The next event (as in next_expiry) should be valid for both compare
> wise. Both must be larger than jiffies. The delta between jiffies and
> next event has to be less than NEXT_TIMER_MAX_DELTA for each base.
>
>> + */
>> + if (base_local->timers_pending && base_global->timers_pending)
>> + local_first = time_before_eq(nextevt_local, nextevt_global);
>> + else
>> + local_first = base_local->timers_pending;
>> +
>> + nextevt = local_first ? nextevt_local : nextevt_global;
>> +
>> + if (base_local->timers_pending || base_global->timers_pending) {
>> /* If we missed a tick already, force 0 delta */
>> if (time_before(nextevt, basej))
>> nextevt = basej;
>
> So if nextevt_local missed a tick and nextevt_global is
> NEXT_TIMER_MAX_DELTA-1 (so we get the largest difference possible
> between those two) then the time_before_eq() should still come out
> right. We could still miss more than one tick.
>
This problem was only there when comparing _empty_ bases
(!timer_base::timers_pending) because of the different base clocks and
the stale next_expiry.
But I didn't update the check and the comment after introducing the
forward of the next_expiry when !timer_base::timers_pending in
next_timer_interrupt(). So now it is sufficient to replace the
local_first detection by simply doing:
local_first = time_before_eq(nextevt_local, nextevt_global);
Will fix it and will also add a comment to next_timer_interrupt() where
the next_expiry is updated when !timer_base::timers_pending.
Thanks,
Anna-Maria
On 2023-12-06 10:57:57 [+0100], Anna-Maria Behnsen wrote:
> Sebastian Siewior <[email protected]> writes:
>
> > On 2023-12-01 10:26:41 [+0100], Anna-Maria Behnsen wrote:
> >> When adding a timer to the timer wheel using add_timer_on(), it is an
> >> implicitly pinned timer. With the timer pull at expiry time model in place,
> >> TIMER_PINNED flag is required to make sure timers end up in proper base.
> >>
> >> Add TIMER_PINNED flag unconditionally when add_timer_on() is executed.
> >
> > This is odd. I have some vague memory that this was already the case.
> > Otherwise all add_timer_on() users without TIMER_PINNED may get it wrong
> > due to optimisation.
>
> Which optimisation are you talking about? Are you talking about the
> heuristic for finding the best CPU in get_target_base()? This heuristic
> is not used for add_timer_on().
Yes, true. And nobody probably mixes add_timer_on() and mod_timer().
> Does this makes it a little more clear, why the change is required and
> why it is also valid right now?
Yes, all good. Thanks.
> Thanks,
>
> Anna-Maria
Sebastian
On 2023-12-01 10:26:47 [+0100], Anna-Maria Behnsen wrote:
> To prepare for the conversion of the NOHZ timer placement to a pull at
> expiry time model it's required to have functionality available getting the
> next timer interrupt on a remote CPU.
>
> Locking of the timer bases and getting the information for the next timer
> interrupt functionality is split into separate functions. This is required
> to be compliant with lock ordering when the new model is in place.
>
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
> Reviewed-by: Frederic Weisbecker <[email protected]>
Please fold the hunk below, it keeps sparse happy.
------->8---------
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2cff43c103295..00420d8faa042 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2075,6 +2075,8 @@ void fetch_next_timer_interrupt_remote(unsigned long basej, u64 basem,
* Unlocks the remote timer bases.
*/
void timer_unlock_remote_bases(unsigned int cpu)
+ __releases(timer_bases[BASE_LOCAL]->lock)
+ __releases(timer_bases[BASE_GLOBAL]->lock)
{
struct timer_base *base_local, *base_global;
@@ -2092,6 +2094,8 @@ void timer_unlock_remote_bases(unsigned int cpu)
* Locks the remote timer bases.
*/
void timer_lock_remote_bases(unsigned int cpu)
+ __acquires(timer_bases[BASE_LOCAL]->lock)
+ __acquires(timer_bases[BASE_GLOBAL]->lock)
{
struct timer_base *base_local, *base_global;
Sebastian
Sebastian Siewior <[email protected]> writes:
> On 2023-12-06 10:57:57 [+0100], Anna-Maria Behnsen wrote:
>> Sebastian Siewior <[email protected]> writes:
>>
>> > On 2023-12-01 10:26:41 [+0100], Anna-Maria Behnsen wrote:
>> >> When adding a timer to the timer wheel using add_timer_on(), it is an
>> >> implicitly pinned timer. With the timer pull at expiry time model in place,
>> >> TIMER_PINNED flag is required to make sure timers end up in proper base.
>> >>
>> >> Add TIMER_PINNED flag unconditionally when add_timer_on() is executed.
>> >
>> > This is odd. I have some vague memory that this was already the case.
>> > Otherwise all add_timer_on() users without TIMER_PINNED may get it wrong
>> > due to optimisation.
>>
>> Which optimisation are you talking about? Are you talking about the
>> heuristic for finding the best CPU in get_target_base()? This heuristic
>> is not used for add_timer_on().
>
> Yes, true. And nobody probably mixes add_timer_on() and mod_timer().
Workqueue mixes add_timer_on() and add_timer(). But therefore the
add_timer_global() function is introduced in patch 17 which drops the
TIMER_PINNED flag. In patch 18 the add_timer() in workqueue code is
replaced by the new function.
Thanks,
Anna-Maria
On 2023-12-01 10:26:49 [+0100], Anna-Maria Behnsen wrote:
> Due to the conversion of the NOHZ timer placement to a pull at expiry
> time model, the per CPU timer bases with non pinned timers are no
> longer handled only by the local CPU. In case a remote CPU already
> expires the non pinned timers base of the local cpu, nothing more
CPU
so it is consistent with the other.
> needs to be done by the local CPU. A check at the begin of the expire
> timers routine is required, because timer base lock is dropped before
> executing the timer callback function.
>
> This is a preparatory work, but has no functional impact right now.
>
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
Sebastian
On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
…
> As long as a CPU is busy it expires both local and global timers. When a
> CPU goes idle it arms for the first expiring local timer. If the first
> expiring pinned (local) timer is before the first expiring movable timer,
> then no action is required because the CPU will wake up before the first
> movable timer expires. If the first expiring movable timer is before the
> first expiring pinned (local) timer, then this timer is queued into a idle
an
> timerqueue and eventually expired by some other active CPU.
s/some other/another ?
…
>
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
> ---
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index b6c9ac0c3712..ac3e888d053f 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -2103,6 +2104,64 @@ void timer_lock_remote_bases(unsigned int cpu)
…
> +static void timer_use_tmigr(unsigned long basej, u64 basem,
> + unsigned long *nextevt, bool *tick_stop_path,
> + bool timer_base_idle, struct timer_events *tevt)
> +{
> + u64 next_tmigr;
> +
> + if (timer_base_idle)
> + next_tmigr = tmigr_cpu_new_timer(tevt->global);
> + else if (tick_stop_path)
> + next_tmigr = tmigr_cpu_deactivate(tevt->global);
> + else
> + next_tmigr = tmigr_quick_check();
> +
> + /*
> + * If the CPU is the last going idle in timer migration hierarchy, make
> + * sure the CPU will wake up in time to handle remote timers.
> + * next_tmigr == KTIME_MAX if other CPUs are still active.
> + */
> + if (next_tmigr < tevt->local) {
> + u64 tmp;
> +
> + /* If we missed a tick already, force 0 delta */
> + if (next_tmigr < basem)
> + next_tmigr = basem;
> +
> + tmp = div_u64(next_tmigr - basem, TICK_NSEC);
Is this considered a hot path? Asking because u64 divs are nice if can
be avoided ;)
I guess the original value is from fetch_next_timer_interrupt(). But
then you only need it if the caller (__get_next_timer_interrupt()) has
the `idle' value set. Otherwise the operation is pointless.
Would it somehow work to replace
base_local->is_idle = time_after(nextevt, basej + 1);
with maybe something like
base_local->is_idle = tevt.local > basem + TICK_NSEC
If so you could avoid the `nextevt' maneuver.
> + *nextevt = basej + (unsigned long)tmp;
> + tevt->local = next_tmigr;
> + }
> +}
> +# else
…
> @@ -2132,6 +2190,21 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
> nextevt = fetch_next_timer_interrupt(basej, basem, base_local,
> base_global, &tevt);
>
> + /*
> + * When the when the next event is only one jiffie ahead there is no
If the next event is only one jiffy ahead then there is no
> + * need to call timer migration hierarchy related
> + * functions. @tevt->global will be KTIME_MAX, nevertheless if the next
> + * timer is a global timer. This is also true, when the timer base is
The second sentence is hard to parse.
> + * idle.
> + *
> + * The proper timer migration hierarchy function depends on the callsite
> + * and whether timer base is idle or not. @nextevt will be updated when
> + * this CPU needs to handle the first timer migration hierarchy event.
> + */
> + if (time_after(nextevt, basej + 1))
> + timer_use_tmigr(basej, basem, &nextevt, idle,
> + base_local->is_idle, &tevt);
> +
> /*
> * We have a fresh next event. Check whether we can forward the
> * base.
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@
…
> +/*
> + * The timer migration mechanism is built on a hierarchy of groups. The
> + * lowest level group contains CPUs, the next level groups of CPU groups
> + * and so forth. The CPU groups are kept per node so for the normal case
> + * lock contention won't happen across nodes. Depending on the number of
> + * CPUs per node even the next level might be kept as groups of CPU groups
> + * per node and only the levels above cross the node topology.
> + *
> + * Example topology for a two node system with 24 CPUs each.
> + *
> + * LVL 2 [GRP2:0]
> + * GRP1:0 = GRP1:M
> + *
> + * LVL 1 [GRP1:0] [GRP1:1]
> + * GRP0:0 - GRP0:2 GRP0:3 - GRP0:5
> + *
> + * LVL 0 [GRP0:0] [GRP0:1] [GRP0:2] [GRP0:3] [GRP0:4] [GRP0:5]
> + * CPUS 0-7 8-15 16-23 24-31 32-39 40-47
In the CPUS list between 24-31 and 32-39 is a tab while the other
separators are spaces. Could you please align it with spaces? Judging
form the top you have tabstop=8 but here tabstop=4 looks "nice".
> + *
> + * The groups hold a timer queue of events sorted by expiry time. These
> + * queues are updated when CPUs go in idle. When they come out of idle
> + * ignore flag of events is set.
> + *
Sebastian
Sebastian Siewior <[email protected]> writes:
> On 2023-12-01 10:26:45 [+0100], Anna-Maria Behnsen wrote:
>> For the conversion of the NOHZ timer placement to a pull at expiry time
>> model it's required to have separate expiry times for the pinned and the
>> non-pinned (movable) timers. Therefore struct timer_events is introduced.
>>
>> No functional change
>>
>> Originally-by: Richard Cochran (linutronix GmbH) <[email protected]>
>> Signed-off-by: Anna-Maria Behnsen <[email protected]>
>> Reviewed-by: Frederic Weisbecker <[email protected]>
> …
>> index 366ea26ce3ba..0d53d853ae22 100644
>> --- a/kernel/time/timer.c
>> +++ b/kernel/time/timer.c
> …
>> @@ -2022,13 +2028,31 @@ static inline u64 __get_next_timer_interrupt(unsigned long basej, u64 basem,
>>
>> nextevt = local_first ? nextevt_local : nextevt_global;
>>
>> - if (base_local->timers_pending || base_global->timers_pending) {
>> + /*
>> + * If the @nextevt is at max. one tick away, use @nextevt and store
>> + * it in the local expiry value. The next global event is irrelevant in
>> + * this case and can be left as KTIME_MAX.
>> + */
>> + if (time_before_eq(nextevt, basej + 1)) {
>> /* If we missed a tick already, force 0 delta */
>> if (time_before(nextevt, basej))
>> nextevt = basej;
>> - expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
>> + tevt.local = basem + (u64)(nextevt - basej) * TICK_NSEC;
>> + goto unlock;
>
> You claim "No functional change" in the patch description. However if
> you take the shortcut here you don't update `idle' if set and you don't
> __forward_timer_base(). The `idle` parameter doesn't matter because it
> was false and will remain false as per current logic.
>
> But what about the forward of the timer base? It is probably not real
> problem since the next add/mod timer call will forward it.
You are right. It is not a problem as the timer base will be forwarded
by add/mod timer and also when timers needs to expire. (It is reworked
by the next patch...)
But it is not consistent and happend within one of the last rework
iterations. I'll change the goto label into 'forward' and place it
before the forward calls.
Thanks,
Anna-Maria
Sebastian Siewior <[email protected]> writes:
> On 2023-12-01 10:26:47 [+0100], Anna-Maria Behnsen wrote:
>> To prepare for the conversion of the NOHZ timer placement to a pull at
>> expiry time model it's required to have functionality available getting the
>> next timer interrupt on a remote CPU.
>>
>> Locking of the timer bases and getting the information for the next timer
>> interrupt functionality is split into separate functions. This is required
>> to be compliant with lock ordering when the new model is in place.
>>
>> Signed-off-by: Anna-Maria Behnsen <[email protected]>
>> Reviewed-by: Frederic Weisbecker <[email protected]>
>
> Please fold the hunk below, it keeps sparse happy.
Thanks, I will do!
Anna-Maria
Anna-Maria Behnsen <[email protected]> writes:
> Hi,
>
> the queue grew because of some additional cleanup patches and splitting
> already existing patches to make review easier. Several minor things had to
> be reworked. For all people who did testing, I would be really happy if you
> could test v9 as well!
bigeasy triggered a warning during boot. So I'll go and fix it and will
send a v10 when I'm done.
Thanks,
Anna-Maria
On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@
…
> + * Protection of the tmigr group state information:
> + * ------------------------------------------------
> + *
> + * The state information with the list of active children and migrator needs to
> + * be protected by a sequence counter. It prevents a race when updates in a
s/a$//
> + * child groups are propagated in changed order. The following scenario
> + * describes what happens without updating the sequence counter:
> + *
> + * Therefore, let's take three groups and four CPUs (CPU2 and CPU3 as well
> + * as GRP0:1 will not change during the scenario):
> + *
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:0, GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = CPU0 migrator = CPU2
> + * active = CPU0 active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * active idle active idle
> + *
> + *
> + * 1. CPU0 goes idle (changes are updated in GRP0:0; afterwards the current
> + * states of GRP0:0 and GRP1:0 are stored in the data for walking the
> + * hierarchy):
CPU0 goes idle. The state update is performed lock less and group
wise. In the first step only GRP0:0 has been updated. The update of
GRP1:0 is pending, the CPU walks through the hierarchy.
> + *
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:0, GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * --> migrator = TMIGR_NONE migrator = CPU2
> + * --> active = active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * --> idle idle active idle
> + * 2. CPU1 comes out of idle (changes are update in GRP0:0; afterwards the
> + * current states of GRP0:0 and GRP1:0 are stored in the data for walking the
> + * hierarchy):
While CPU0 goes idle and continues to update the state, CPU1 comes
out of idle. CPU1 updates GRP0:0. The update for GRP1:0 is pending,
tge CPU walks through the hierarchy. Both CPUs now walk the hierarchy
to perform the needed update from their point of view.
The currently visible state:
> + *
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:0, GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * --> migrator = CPU1 migrator = CPU2
> + * --> active = CPU1 active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle --> active active idle
> + *
> + * 3. Here comes the change of the order: Propagating the changes of step 2
> + * through the hierarchy to GRP1:0 - nothing to be done, because GRP0:0
> + * is already up to date.
Here is the race condition: CPU1 managed to propagate its changes
through the hierarchy to GRP1:0 before CPU0 did. The active members
of GRP1:0 remain unchanged after the update since it is still valid
from CPU1 current point of view:
LVL 1 [GRP1:0]
--> migrator = GRP0:1
--> active = GRP0:0, GRP0:1
/ \
LVL 0 [GRP0:0] [GRP0:1]
migrator = CPU1 migrator = CPU2
active = CPU1 active = CPU2
/ \ / \
CPUs 0 1 2 3
idle active active idle
[ I take it as the migrator remains set to GRP0:1 by CPU1 but it could
be changed to GRP0:0. I assume that both fields (migrator+active) are
changed there via the propagation and the arrow in both fields denotes
this. ]
> + * 4. Propagating the changes of step 1 through the hierarchy to GRP1:0
Now CPU0 finally propagates its changes to GRP1:0.
> + *
> + * LVL 1 [GRP1:0]
> + * --> migrator = GRP0:1
> + * --> active = GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = CPU1 migrator = CPU2
> + * active = CPU1 active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle active active idle
> + *
> + * Now there is a inconsistent overall state because GRP0:0 is active, but
> + * it is marked as idle in the GRP1:0. This is prevented by incrementing
> + * sequence counter whenever changing the state.
The race of CPU0 vs CPU1 led to an inconsistent state in GRP1:0.
CPU1 is active and is correctly listed as active in GRP0:0. However
GRP1:0 does not have GRP0:0 listed as active which is wrong.
The sequence counter has been added to avoid inconsistent states
during updates. The state is updated atomically only if all members,
including the sequence counter, match the expected value
(compare-and-exchange).
Looking back at the previous example with the addition of the
sequence number: The update as performed by CPU0 in step 4 will fail.
CPU1 changed the sequence number during the update in step 3 so the
expected old value (as seen by CPU0 before starting the walk) does
not match.
Sebastian
Sebastian Siewior <[email protected]> writes:
> On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
[...]
>> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
>> index b6c9ac0c3712..ac3e888d053f 100644
>> --- a/kernel/time/timer.c
>> +++ b/kernel/time/timer.c
>> @@ -2103,6 +2104,64 @@ void timer_lock_remote_bases(unsigned int cpu)
> …
>> +static void timer_use_tmigr(unsigned long basej, u64 basem,
>> + unsigned long *nextevt, bool *tick_stop_path,
>> + bool timer_base_idle, struct timer_events *tevt)
>> +{
>> + u64 next_tmigr;
>> +
>> + if (timer_base_idle)
>> + next_tmigr = tmigr_cpu_new_timer(tevt->global);
>> + else if (tick_stop_path)
>> + next_tmigr = tmigr_cpu_deactivate(tevt->global);
>> + else
>> + next_tmigr = tmigr_quick_check();
>> +
>> + /*
>> + * If the CPU is the last going idle in timer migration hierarchy, make
>> + * sure the CPU will wake up in time to handle remote timers.
>> + * next_tmigr == KTIME_MAX if other CPUs are still active.
>> + */
>> + if (next_tmigr < tevt->local) {
>> + u64 tmp;
>> +
>> + /* If we missed a tick already, force 0 delta */
>> + if (next_tmigr < basem)
>> + next_tmigr = basem;
>> +
>> + tmp = div_u64(next_tmigr - basem, TICK_NSEC);
>
> Is this considered a hot path? Asking because u64 divs are nice if can
> be avoided ;)
It's the 'try to go idle path' - so no hot path. Please correct me if
I'm wrong.
> I guess the original value is from fetch_next_timer_interrupt(). But
> then you only need it if the caller (__get_next_timer_interrupt()) has
> the `idle' value set. Otherwise the operation is pointless.
> Would it somehow work to replace
> base_local->is_idle = time_after(nextevt, basej + 1);
>
> with maybe something like
> base_local->is_idle = tevt.local > basem + TICK_NSEC
>
> If so you could avoid the `nextevt' maneuver.
>
This change could be done indepentant as an improvement on top of the
queue as well. I will not improve it right now, if it's ok.
>> + *nextevt = basej + (unsigned long)tmp;
>> + tevt->local = next_tmigr;
>> + }
>> +}
>> +# else
Thanks for the other input - I already changed it for v10 of the queue.
Thanks,
Anna-Maria
Sebastian Siewior <[email protected]> writes:
> On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
>> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
>> new file mode 100644
>> index 000000000000..05cd8f1bc45d
>> --- /dev/null
>> +++ b/kernel/time/timer_migration.c
>> @@ -0,0 +1,1636 @@
> …
>> + * Protection of the tmigr group state information:
>> + * ------------------------------------------------
>> + *
>> + * The state information with the list of active children and migrator needs to
>> + * be protected by a sequence counter. It prevents a race when updates in a
>
> s/a$//
>
>> + * child groups are propagated in changed order. The following scenario
>> + * describes what happens without updating the sequence counter:
>> + *
>> + * Therefore, let's take three groups and four CPUs (CPU2 and CPU3 as well
>> + * as GRP0:1 will not change during the scenario):
>> + *
>> + * LVL 1 [GRP1:0]
>> + * migrator = GRP0:1
>> + * active = GRP0:0, GRP0:1
>> + * / \
>> + * LVL 0 [GRP0:0] [GRP0:1]
>> + * migrator = CPU0 migrator = CPU2
>> + * active = CPU0 active = CPU2
>> + * / \ / \
>> + * CPUs 0 1 2 3
>> + * active idle active idle
>> + *
>> + *
>> + * 1. CPU0 goes idle (changes are updated in GRP0:0; afterwards the current
>> + * states of GRP0:0 and GRP1:0 are stored in the data for walking the
>> + * hierarchy):
>
> CPU0 goes idle. The state update is performed lock less and group
> wise. In the first step only GRP0:0 has been updated. The update of
> GRP1:0 is pending, the CPU walks through the hierarchy.
>
>> + *
>> + * LVL 1 [GRP1:0]
>> + * migrator = GRP0:1
>> + * active = GRP0:0, GRP0:1
>> + * / \
>> + * LVL 0 [GRP0:0] [GRP0:1]
>> + * --> migrator = TMIGR_NONE migrator = CPU2
>> + * --> active = active = CPU2
>> + * / \ / \
>> + * CPUs 0 1 2 3
>> + * --> idle idle active idle
>
>> + * 2. CPU1 comes out of idle (changes are update in GRP0:0; afterwards the
>> + * current states of GRP0:0 and GRP1:0 are stored in the data for walking the
>> + * hierarchy):
>
> While CPU0 goes idle and continues to update the state, CPU1 comes
> out of idle. CPU1 updates GRP0:0. The update for GRP1:0 is pending,
> tge CPU walks through the hierarchy. Both CPUs now walk the hierarchy
> to perform the needed update from their point of view.
> The currently visible state:
>
>> + *
>> + * LVL 1 [GRP1:0]
>> + * migrator = GRP0:1
>> + * active = GRP0:0, GRP0:1
>> + * / \
>> + * LVL 0 [GRP0:0] [GRP0:1]
>> + * --> migrator = CPU1 migrator = CPU2
>> + * --> active = CPU1 active = CPU2
>> + * / \ / \
>> + * CPUs 0 1 2 3
>> + * idle --> active active idle
>> + *
>> + * 3. Here comes the change of the order: Propagating the changes of step 2
>> + * through the hierarchy to GRP1:0 - nothing to be done, because GRP0:0
>> + * is already up to date.
>
> Here is the race condition: CPU1 managed to propagate its changes
> through the hierarchy to GRP1:0 before CPU0 did. The active members
> of GRP1:0 remain unchanged after the update since it is still valid
> from CPU1 current point of view:
>
> LVL 1 [GRP1:0]
> --> migrator = GRP0:1
> --> active = GRP0:0, GRP0:1
> / \
> LVL 0 [GRP0:0] [GRP0:1]
> migrator = CPU1 migrator = CPU2
> active = CPU1 active = CPU2
> / \ / \
> CPUs 0 1 2 3
> idle active active idle
>
> [ I take it as the migrator remains set to GRP0:1 by CPU1 but it could
> be changed to GRP0:0. I assume that both fields (migrator+active) are
> changed there via the propagation and the arrow in both fields denotes
> this. ]
>
>> + * 4. Propagating the changes of step 1 through the hierarchy to GRP1:0
>
> Now CPU0 finally propagates its changes to GRP1:0.
>
>> + *
>> + * LVL 1 [GRP1:0]
>> + * --> migrator = GRP0:1
>> + * --> active = GRP0:1
>> + * / \
>> + * LVL 0 [GRP0:0] [GRP0:1]
>> + * migrator = CPU1 migrator = CPU2
>> + * active = CPU1 active = CPU2
>> + * / \ / \
>> + * CPUs 0 1 2 3
>> + * idle active active idle
>> + *
>> + * Now there is a inconsistent overall state because GRP0:0 is active, but
>> + * it is marked as idle in the GRP1:0. This is prevented by incrementing
>> + * sequence counter whenever changing the state.
>
> The race of CPU0 vs CPU1 led to an inconsistent state in GRP1:0.
> CPU1 is active and is correctly listed as active in GRP0:0. However
> GRP1:0 does not have GRP0:0 listed as active which is wrong.
> The sequence counter has been added to avoid inconsistent states
> during updates. The state is updated atomically only if all members,
> including the sequence counter, match the expected value
> (compare-and-exchange).
> Looking back at the previous example with the addition of the
> sequence number: The update as performed by CPU0 in step 4 will fail.
> CPU1 changed the sequence number during the update in step 3 so the
> expected old value (as seen by CPU0 before starting the walk) does
> not match.
>
Thanks a lot for rephrasing the documentation to make it clearer for the
reader! I use your proposal with some minor changes.
Thanks,
Anna-Maria
On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@
…
> + * Required event and timerqueue update after a remote expiry:
> + * -----------------------------------------------------------
> + *
> + * After a remote expiry of a CPU, a walk through the hierarchy updating the
> + * events and timerqueues has to be done when there is a 'new' global timer of
> + * the remote CPU (which is obvious) but also if there is no new global timer,
> + * but the remote CPU is still idle:
After expiring timers of a remote CPU, a walk through the hierarchy and
updating events timerqueues is required. It is obviously needed if there
is a 'new' global timer but also if there no new global timer but the
remote CPU is still idle.
> + * 1. CPU2 is the migrator and does the remote expiry in GRP1:0; expiry of
> + * evt-CPU0 and evt-CPU1 are equal:
CPU0 and CPU1 have both a timer expiring at the same time so both
have an event enqueued in the timerqueue. CPU2 and CPU3 have no
global timer pending and CPU2 is the only active CPU and also the
migrator.
> + *
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:1
> + * --> timerqueue = evt-GRP0:0
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = TMIGR_NONE migrator = CPU2
> + * active = active = CPU2
> + * groupevt.ignore = false groupevt.ignore = true
> + * groupevt.cpu = CPU0 groupevt.cpu =
> + * timerqueue = evt-CPU0, timerqueue =
> + * evt-CPU1
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle idle active idle
> + *
> + * 2. Remove the first event of the timerqueue in GRP1:0 and expire the timers
> + * of CPU0 (see evt-GRP0:0->cpu value):
CPU2 begins to expire remote timers. It starts with own group
GRP0:1. GRP0:1 has nothing in ts timerqueue and continues with its
parent, GRP1:0. In GRP1:0 it dequeues the first event. It looks at
CPU member expires the pending timer of CPU0.
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:1
> + * --> timerqueue =
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = TMIGR_NONE migrator = CPU2
> + * active = active = CPU2
> + * groupevt.ignore = false groupevt.ignore = true
> + * --> groupevt.cpu = CPU0 groupevt.cpu =
> + * timerqueue = evt-CPU0, timerqueue =
> + * evt-CPU1
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle idle active idle
> + *
> + * 3. After the remote expiry CPU0 has no global timer that needs to be
> + * enqueued. When skipping the walk, the global timer of CPU1 is not handled,
> + * as the group event of GRP0:0 is not updated and not enqueued into GRP1:0. The
> + * walk has to be done to update the group events and timerqueues:
The work isn't over after expiring timers of CPU0. If we stop
here, then CPU1's timer have not been expired and the timerqueue of
GRP0:0 has still an event for CPU0 enqueued which has just been
processed. So it is required to walk the hierarchy from CPU0's
point of view and update it accordingly.
CPU0 will be removed from the timerqueue because it has no pending
timer. If CPU0 would have a timer pending then it has to expire
after CPU1's first timer because all timer from this period have
just been expired.
Either way CPU1 will be first in GRP0:0's timerqueue and therefore
set in the CPU field of the group event which is enqueued in
GRP1:0's timerqueue.
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:1
> + * --> timerqueue = evt-GRP0:0
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = TMIGR_NONE migrator = CPU2
> + * active = active = CPU2
> + * groupevt.ignore = false groupevt.ignore = true
> + * --> groupevt.cpu = CPU1 groupevt.cpu =
> + * --> timerqueue = evt-CPU1 timerqueue =
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle idle active idle
> + *
> + * Now CPU2 (migrator) is able to handle the timer of CPU1 as CPU2 only scans
> + * the timerqueues of GRP0:1 and GRP1:0.
Now CPU2 continues step 2 at GRP1:0 and will expire the timer of
CPU1.
> + * The update of step 3 is valid to be skipped, when the remote CPU went offline
> + * in the meantime because an update was already done during inactive path. When
> + * CPU became active in the meantime, update isn't required as well, because
> + * GRP0:0 is now longer idle.
The hierarchy walk in step 3 can be skipped if the migrator notices
that a CPU of GRP0:0 is active. The CPU will mark GRP0:0 active and
take care of the group and any needed updates within the hierarchy.
I skipped the "offline" part because it is not needed. Before the CPU
can go offline it has first to come out of idle. While going offline it
won't (probably) participate here and the remaining timer will be
migrated to another CPU.
> + */
…
> +
> +typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, void *);
> +
> +static void __walk_groups(up_f up, void *data,
> + struct tmigr_cpu *tmc)
> +{
> + struct tmigr_group *child = NULL, *group = tmc->tmgroup;
> +
> + do {
> + WARN_ON_ONCE(group->level >= tmigr_hierarchy_levels);
> +
> + if (up(group, child, data))
> + break;
> +
> + child = group;
> + group = group->parent;
> + } while (group);
> +}
> +
> +static void walk_groups(up_f up, void *data, struct tmigr_cpu *tmc)
> +{
> + lockdep_assert_held(&tmc->lock);
> +
> + __walk_groups(up, data, tmc);
> +}
So these two. walk_groups() uses all have tmigr_cpu::lock acquired and
__walk_groups() don't. Also the `up' function passed walk_groups() has
always the same data type while the data argument passed to
__walk_groups() has also the same type but different compared to the
former.
Given the locking situation and the type of the data argument looks like
walk_groups() is used for thing#1 and __walk_groups() for thing#2.
Therefore it could make sense have two separate functions (instead of
walk_groups() and __walk_groups()) to distinguish this.
Now it is too late but maybe later I figure out why the one type
requires locking and the other doesn't.
…
> +/*
> + * Return the next event which is already expired of the group timerqueue
> + *
> + * Event, which is returned, is also removed from the queue.
> + */
> +static struct tmigr_event *tmigr_next_expired_groupevt(struct tmigr_group *group,
> + u64 now)
> +{
> + struct tmigr_event *evt = tmigr_next_groupevt(group);
> +
> + if (!evt || now < evt->nextevt.expires)
> + return NULL;
> +
> + /*
> + * The event is already expired. Remove it. If it's not the last event,
> + * then update all group event related information.
> + */
The event expired, remove it. Update group's next expire time.
> + if (timerqueue_del(&group->events, &evt->nextevt))
> + tmigr_next_groupevt(group);
> + else
> + WRITE_ONCE(group->next_expiry, KTIME_MAX);
And then you can invoke tmigr_next_groupevt() unconditionally.
> + return evt;
> +}
> +
Sebastian
Le Tue, Dec 05, 2023 at 12:53:03PM +0100, Anna-Maria Behnsen a ?crit :
> Sebastian Siewior <[email protected]> writes:
> >> Use only base->next_expiry value as nextevt when timers are
> >> pending. Otherwise nextevt will be jiffies + NEXT_TIMER_MAX_DELTA. As all
> >> information is in place, update base->next_expiry value of the empty timer
> >> base as well.
> >
> > or consider timers_pending in run_local_timers()? An additional read vs
> > write?
>
> This would also be a possibility to add the check in run_local_timers()
> with timers_pending.
We could but do we really care about avoiding a potential softirq every 12 days
(on 1000 Hz...)
> And we also have to make the is_idle marking in
> get_next_timer_interrupt() dependant on base::timers_pending bit.
Yes that, on the other hand, looks mandatory! Because if the CPU sleeps for 12
days and then gets an interrupt and then go back to sleep, base->is_idle will be
set as false and remote enqueues won't be notified.
> But this also means, we cannot rely on next_expiry when no timer is pending.
But note that this patch only fixes that partially anyway. Suppose the tick is
stopped entirely and the CPU sleeps for 13 days without any interruption.
Then it's woken up with TIF_RESCHED without any timer queued,
get_next_timer_interrupt() won't be called upon tick restart to fix
->next_expiry.
>
> Frederic, what do you think?
So it looks like is_idle must be fixed.
As for the timer softirq, ->next_expiry is already unreliable because when
a timer is removed, ->next_expiry is not updated (even though that removed
timer might have been the earliest). So ->next_expiry can already carry a
"too early" value. The only constraint is that ->next_expiry can't be later
than the first timer.
So I'd rather put a comment somewhere about the fact that wrapping is expected
to behave ok. But it's your call.
Thanks.
On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@
…
> +static bool tmigr_active_up(struct tmigr_group *group,
> + struct tmigr_group *child,
> + void *ptr)
> +{
> + union tmigr_state curstate, newstate;
> + struct tmigr_walk *data = ptr;
> + bool walk_done;
> + u8 childmask;
> +
> + childmask = data->childmask;
> + newstate = curstate = data->groupstate;
> +
> +retry:
> + walk_done = true;
> +
> + if (newstate.migrator == TMIGR_NONE) {
> + newstate.migrator = childmask;
> +
> + /* Changes need to be propagated */
> + walk_done = false;
> + }
> +
> + newstate.active |= childmask;
> +
> + newstate.seq++;
> +
> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
> + newstate.state = curstate.state;
This does not look right. If
group->migr_state != curstate.state
then
curstate.state = newstate.state
does not make a difference since curstate is on stack. So this should
become
| if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
| curstate.state = newstate.state = atomic_read(&group->parent->migr_state);
and now I question the existence of tmigr_walk::groupstate. It does not
match the comment for the struct saying it will be re-read if the
cmpxchg() fails because it does not happen (at least here). Also why do
we have it? Is it avoid atomic_read() and have it "cached"?
> + goto retry;
> + }
> +
> + if (group->parent && (walk_done == false)) {
The group's parent doesn't change so it can be accessed lock-less. It is
just that the topmost group has no parent so we need this check. I would
move the walk_done check to the left so it can be evaluated first.
> + data->groupstate.state = atomic_read(&group->parent->migr_state);
> + data->childmask = group->childmask;
We don't re-read in case the cmpxchg failed assuming someone else is
updating the state. Wouldn't it make sense to read the childmask at top
of the function from the child pointer. There is no need to keep around
in the data pointer, right?
> + }
> +
> + /*
> + * The group is active and the event will be ignored - the ignore flag is
> + * updated without holding the lock. In case the bit is set while
> + * another CPU already handles remote events, nothing happens, because
> + * it is clear that the CPU became active just in this moment, or in
> + * worst case the event is handled remote. Nothing to worry about.
> + */
The CPU is becoming active, so is the group. The ignore flag for the
group is updated lock less to reflect this. Another CPU might also
set it true while becoming active. In worst case the migrator
observes it too late and expires remotely timer belonging to this
group. The lock is held while going idle (and setting ignore to
false) so the state is not lost.
> + group->groupevt.ignore = true;
> +
> + return walk_done;
> +}
…
> +/*
> + * Returns the expiry of the next timer that needs to be handled. KTIME_MAX is
> + * returned, when an active CPU will handle all the timer migration hierarchy
s/when/if ?
> + * timers.
> + */
> +static u64 tmigr_new_timer(struct tmigr_cpu *tmc, u64 nextexp)
> +{
> + struct tmigr_walk data = { .evt = &tmc->cpuevt,
> + .nextexp = nextexp };
> +
> + lockdep_assert_held(&tmc->lock);
> +
> + if (tmc->remote)
> + return KTIME_MAX;
> +
> + tmc->cpuevt.ignore = false;
> + data.remote = false;
> +
> + walk_groups(&tmigr_new_timer_up, &data, tmc);
> +
> + /* If there is a new first global event, make sure it is handled */
> + return data.nextexp;
> +}
> +
> +static u64 tmigr_handle_remote_cpu(unsigned int cpu, u64 now,
> + unsigned long jif)
> +{
> + struct timer_events tevt;
> + struct tmigr_walk data;
> + struct tmigr_cpu *tmc;
> + u64 next = KTIME_MAX;
> +
> + tmc = per_cpu_ptr(&tmigr_cpu, cpu);
> +
> + raw_spin_lock_irq(&tmc->lock);
> +
> + /*
> + * The remote CPU is offline or the CPU event does not has to be handled
> + * (the CPU is active or there is no longer an event to expire) or
> + * another CPU handles the CPU timers already or the next event was
> + * already expired - return!
> + */
The comment is the english version of the C code below. The *why* is
usually the thing we care about. This basically sums up to:
If remote CPU is offline then the timer events have been
migrated away.
If tmigr_cpu::remote is set then another CPU takes care of this.
If tmigr_event::ignore is set then the CPU is returning from
idle and take care of its timers.
If the next event expires in the future then the event node has
been updated and there are no timer to expire right now.
> + if (!tmc->online || tmc->remote || tmc->cpuevt.ignore ||
> + now < tmc->cpuevt.nextevt.expires) {
> + raw_spin_unlock_irq(&tmc->lock);
> + return next;
Looking at the last condition where the timerqueue has been forwarded by
a jiffy+, shouldn't we return _that_ values next instead of KTIME_MAX?
> + }
> +
> + tmc->remote = true;
> + WRITE_ONCE(tmc->wakeup, KTIME_MAX);
> +
> + /* Drop the lock to allow the remote CPU to exit idle */
> + raw_spin_unlock_irq(&tmc->lock);
> +
> + if (cpu != smp_processor_id())
> + timer_expire_remote(cpu);
> +
> + /*
> + * Lock ordering needs to be preserved - timer_base locks before tmigr
> + * related locks (see section "Locking rules" in the documentation at
> + * the top). During fetching the next timer interrupt, also tmc->lock
> + * needs to be held. Otherwise there is a possible race window against
> + * the CPU itself when it comes out of idle, updates the first timer in
> + * the hierarchy and goes back to idle.
> + *
> + * timer base locks are dropped as fast as possible: After checking
> + * whether the remote CPU went offline in the meantime and after
> + * fetching the next remote timer interrupt. Dropping the locks as fast
> + * as possible keeps the locking region small and prevents holding
> + * several (unnecessary) locks during walking the hierarchy for updating
> + * the timerqueue and group events.
> + */
> + local_irq_disable();
> + timer_lock_remote_bases(cpu);
> + raw_spin_lock(&tmc->lock);
> +
> + /*
> + * When the CPU went offline in the meantime, no hierarchy walk has to
> + * be done for updating the queued events, because the walk was
> + * already done during marking the CPU offline in the hierarchy.
> + *
> + * When the CPU is no longer idle, the CPU takes care of the timers and
If the CPU is no longer idle then it takes care of its timers and
also of the timers in its hierarchy.
> + * also of the timers in the path to the top.
> + *
> + * (See also section "Required event and timerqueue update after
> + * remote expiry" in the documentation at the top)
Perfect. The reasoning why it is safe to skip walk + expire timers, has
been explained.
> + */
> + if (!tmc->online || !tmc->idle) {
> + timer_unlock_remote_bases(cpu);
> + goto unlock;
> + } else {
> + /* next event of CPU */
> + fetch_next_timer_interrupt_remote(jif, now, &tevt, cpu);
> + }
> +
> + timer_unlock_remote_bases(cpu);
> +
> + data.evt = &tmc->cpuevt;
> + data.nextexp = tevt.global;
> + data.remote = true;
> +
> + /*
> + * The update is done even when there is no 'new' global timer pending
> + * on the remote CPU (see section "Required event and timerqueue update
> + * after remote expiry" in the documentation at the top)
> + */
> + walk_groups(&tmigr_new_timer_up, &data, tmc);
> +
> + next = data.nextexp;
> +
> +unlock:
> + tmc->remote = false;
> + raw_spin_unlock_irq(&tmc->lock);
> +
> + return next;
> +}
> +
…
> +/**
> + * tmigr_handle_remote() - Handle global timers of remote idle CPUs
> + *
> + * Called from the timer soft interrupt with interrupts enabled.
> + */
> +void tmigr_handle_remote(void)
> +{
> + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> + struct tmigr_remote_data data;
> +
> + if (tmigr_is_not_available(tmc))
> + return;
> +
> + data.childmask = tmc->childmask;
> + data.nextexp = KTIME_MAX;
> +
> + /*
> + * NOTE: This is a doubled check because the migrator test will be done
> + * in tmigr_handle_remote_up() anyway. Keep this check to fasten the
s/fasten/speed up/
> + * return when nothing has to be done.
> + */
> + if (!tmigr_check_migrator(tmc->tmgroup, tmc->childmask))
> + return;
> +
> + data.now = get_jiffies_update(&data.basej);
> +
> + /*
> + * Update @tmc->wakeup only at the end and do not reset @tmc->wakeup to
> + * KTIME_MAX. Even if tmc->lock is not held during the whole remote
> + * handling, tmc->wakeup is fine to be stale as it is called in
> + * interrupt context and tick_nohz_next_event() is executed in interrupt
> + * exit path only after processing the last pending interrupt.
> + */
> +
> + __walk_groups(&tmigr_handle_remote_up, &data, tmc);
> +
> + raw_spin_lock_irq(&tmc->lock);
> + WRITE_ONCE(tmc->wakeup, data.nextexp);
> + raw_spin_unlock_irq(&tmc->lock);
> +}
> +
> +static bool tmigr_requires_handle_remote_up(struct tmigr_group *group,
> + struct tmigr_group *child,
> + void *ptr)
> +{
> + struct tmigr_remote_data *data = ptr;
> + u8 childmask;
> +
> + childmask = data->childmask;
> +
> + /*
> + * Handle the group only if the child is the migrator or if the group
> + * has no migrator. Otherwise the group is active and is handled by its
> + * own migrator.
> + */
> + if (!tmigr_check_migrator(group, childmask))
> + return true;
> +
> + /*
> + * When there is a parent group and the CPU which triggered the
> + * hierarchy walk is not active, proceed the walk to reach the top level
> + * group before reading the next_expiry value.
> + */
> + if (group->parent && !data->tmc_active)
> + goto out;
> +
> + /*
> + * On 32 bit systems the racy lockless check for next_expiry will
> + * turn into a random number generator. Therefore do the lockless
> + * check only on 64 bit systems.
> + */
The lock is required on 32bit architectures to read the
variable consistently with a concurrent writer. On 64bit the
lock is not required because the read operation is not split
and so is always consistent.
> + if (IS_ENABLED(CONFIG_64BIT)) {
> + data->nextexp = READ_ONCE(group->next_expiry);
> + if (data->now >= data->nextexp) {
> + data->check = true;
> + return true;
> + }
> + } else {
> + raw_spin_lock(&group->lock);
> + data->nextexp = group->next_expiry;
> + if (data->now >= group->next_expiry) {
> + data->check = true;
> + raw_spin_unlock(&group->lock);
> + return true;
> + }
> + raw_spin_unlock(&group->lock);
> + }
> +
> +out:
> + /* Update of childmask for the next level */
> + data->childmask = group->childmask;
> + return false;
> +}
> +
> +/**
> + * tmigr_requires_handle_remote() - Check the need of remote timer handling
> + *
> + * Must be called with interrupts disabled.
> + */
> +int tmigr_requires_handle_remote(void)
> +{
> + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> + struct tmigr_remote_data data;
> + unsigned int ret = 0;
> + unsigned long jif;
> +
> + if (tmigr_is_not_available(tmc))
> + return ret;
> +
> + data.now = get_jiffies_update(&jif);
> + data.childmask = tmc->childmask;
> + data.nextexp = KTIME_MAX;
> + data.tmc_active = !tmc->idle;
> + data.check = false;
> +
> + /*
> + * When the CPU is active, walking the hierarchy to check whether a
> + * remote expiry is required.
s/When/If
s/walking/walk
> + *
> + * Check is done lockless as interrupts are disabled and @tmc->idle is
> + * set only by the local CPU.
> + */
> + if (!tmc->idle) {
> + __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
> +
> + if (data.nextexp != KTIME_MAX)
> + ret = 1;
> +
> + return ret;
> + }
> +
> + /*
> + * When the CPU is idle, check whether the recalculation of @tmc->wakeup
> + * is required. @tmc->wakeup_recalc is set by a remote CPU which is
> + * about to go offline, was the last active CPU in the whole timer
> + * migration hierarchy and now delegates handling of the hierarchy to
> + * this CPU.
I'm failing here…
> + * Racy lockless check is valid:
> + * - @tmc->wakeup_recalc is set by the remote CPU before it issues
> + * reschedule IPI.
> + * - As interrupts are disabled here this CPU will either observe
> + * @tmc->wakeup_recalc set before the reschedule IPI can be handled or
> + * it will observe it when this function is called again on return
> + * from handling the reschedule IPI.
> + */
> + if (tmc->wakeup_recalc) {
> + raw_spin_lock(&tmc->lock);
> +
> + __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
> +
> + if (data.nextexp != KTIME_MAX)
> + ret = 1;
> +
> + WRITE_ONCE(tmc->wakeup, data.nextexp);
> + tmc->wakeup_recalc = false;
> + raw_spin_unlock(&tmc->lock);
> +
> + return ret;
> + }
> +
> + /*
> + * When the CPU is idle and @tmc->wakeup is reliable, compare it with
> + * @data.now. On 64 bit it is valid to do this lockless. On 32 bit
> + * systems, holding the lock is required to get valid data on concurrent
> + * writers.
The wakeup value is reliable when the CPU is idle?
See the comments regarding 64bit.
> + */
> + if (IS_ENABLED(CONFIG_64BIT)) {
> + if (data.now >= READ_ONCE(tmc->wakeup))
> + ret = 1;
> + } else {
> + raw_spin_lock(&tmc->lock);
> + if (data.now >= tmc->wakeup)
> + ret = 1;
> + raw_spin_unlock(&tmc->lock);
> + }
> +
> + return ret;
> +}
…
> +static bool tmigr_inactive_up(struct tmigr_group *group,
> + struct tmigr_group *child,
> + void *ptr)
> +{
> + union tmigr_state curstate, newstate;
> + struct tmigr_walk *data = ptr;
> + bool walk_done;
> + u8 childmask;
> +
> + childmask = data->childmask;
> + newstate = curstate = data->groupstate;
> +
> +retry:
> + walk_done = true;
> +
> + /* Reset active bit when the child is no longer active */
> + if (!data->childstate.active)
> + newstate.active &= ~childmask;
> +
> + if (newstate.migrator == childmask) {
> + /*
> + * Find a new migrator for the group, because the child group is
> + * idle!
> + */
> + if (!data->childstate.active) {
> + unsigned long new_migr_bit, active = newstate.active;
> +
> + new_migr_bit = find_first_bit(&active, BIT_CNT);
> +
> + if (new_migr_bit != BIT_CNT) {
> + newstate.migrator = BIT(new_migr_bit);
> + } else {
> + newstate.migrator = TMIGR_NONE;
> +
> + /* Changes need to be propagated */
> + walk_done = false;
> + }
> + }
> + }
> +
> + newstate.seq++;
> +
> + WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active));
> +
> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
This one appears wrong, too. The curstate is not getting updated during
retry.
> + newstate.state = curstate.state;
> +
> + /*
> + * Something changed in the child/parent group in the meantime,
> + * reread the state of the child and parent; Update of
> + * data->childstate is required for event handling;
> + */
> + if (child)
> + data->childstate.state = atomic_read(&child->migr_state);
> +
> + goto retry;
> + }
> +
> + data->groupstate = newstate;
> + data->remote = false;
> +
> + /* Event Handling */
> + tmigr_update_events(group, child, data);
> +
> + if (group->parent && (walk_done == false)) {
> + data->childmask = group->childmask;
> + data->childstate = newstate;
> + data->groupstate.state = atomic_read(&group->parent->migr_state);
> + }
> +
> + /*
> + * data->nextexp was set by tmigr_update_events() and contains the
> + * expiry of the first global event which needs to be handled
> + */
> + if (data->nextexp != KTIME_MAX) {
> + WARN_ON_ONCE(group->parent);
> + /*
> + * Top level path - If this CPU is about going offline, wake
> + * up some random other CPU so it will take over the
> + * migrator duty and program its timer properly. Ideally
> + * wake the CPU with the closest expiry time, but that's
> + * overkill to figure out.
> + *
> + * Set wakeup_recalc of remote CPU, to make sure the complete
> + * idle hierarchy with enqueued timers is reevaluated.
> + */
> + if (!(this_cpu_ptr(&tmigr_cpu)->online)) {
> + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> + unsigned int cpu = smp_processor_id();
> + struct tmigr_cpu *tmc_resched;
> +
> + cpu = cpumask_any_but(cpu_online_mask, cpu);
> + tmc_resched = per_cpu_ptr(&tmigr_cpu, cpu);
> +
> + raw_spin_unlock(&tmc->lock);
> +
> + raw_spin_lock(&tmc_resched->lock);
> + tmc_resched->wakeup_recalc = true;
> + raw_spin_unlock(&tmc_resched->lock);
> +
> + raw_spin_lock(&tmc->lock);
> + smp_send_reschedule(cpu);
This whole thing confuses me.
If the CPU goes offline, it needs to get removed from the migration
hierarchy and this is it. Everything else is handled by the migrator. If
the migrator is going offline then it needs wake a random CPU and make
sure it takes the migrator role. I am confused by the whole ::wakeup and
::wakeup_recalc thingy.
> + }
> + }
> +
> + return walk_done;
> +}
Sebastian
Sebastian Siewior <[email protected]> writes:
> On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
>> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
>> new file mode 100644
>> index 000000000000..05cd8f1bc45d
>> --- /dev/null
>> +++ b/kernel/time/timer_migration.c
>> @@ -0,0 +1,1636 @@
> …
>> +static bool tmigr_active_up(struct tmigr_group *group,
>> + struct tmigr_group *child,
>> + void *ptr)
>> +{
>> + union tmigr_state curstate, newstate;
>> + struct tmigr_walk *data = ptr;
>> + bool walk_done;
>> + u8 childmask;
>> +
>> + childmask = data->childmask;
>> + newstate = curstate = data->groupstate;
>> +
>> +retry:
>> + walk_done = true;
>> +
>> + if (newstate.migrator == TMIGR_NONE) {
>> + newstate.migrator = childmask;
>> +
>> + /* Changes need to be propagated */
>> + walk_done = false;
>> + }
>> +
>> + newstate.active |= childmask;
>> +
>> + newstate.seq++;
>> +
>> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
>> + newstate.state = curstate.state;
>
> This does not look right. If
> group->migr_state != curstate.state
> then
> curstate.state = newstate.state
>
> does not make a difference since curstate is on stack. So this should
> become
>
> | if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
> | curstate.state = newstate.state = atomic_read(&group->parent->migr_state);
>
> and now I question the existence of tmigr_walk::groupstate. It does not
> match the comment for the struct saying it will be re-read if the
> cmpxchg() fails because it does not happen (at least here). Also why do
> we have it? Is it avoid atomic_read() and have it "cached"?
atomic_try_cmpxchg() updates curstate.state with the actual
group->migr_state when those values do not match. So it is reread by
atomic_try_cmpxchg() and still matches the description. (This at least
tells the function description of atomic_try_cmpxchg()).
But beside of this, why do I want to update curstate.state with
group->parent->migr_state when cmpxchg of this group wasn't successful
yet or was it a copy paste error?
>> + goto retry;
>> + }
>> +
>> + if (group->parent && (walk_done == false)) {
>
> The group's parent doesn't change so it can be accessed lock-less. It is
> just that the topmost group has no parent so we need this check. I would
> move the walk_done check to the left so it can be evaluated first.
Will change it.
>> + data->groupstate.state = atomic_read(&group->parent->migr_state);
>> + data->childmask = group->childmask;
>
> We don't re-read in case the cmpxchg failed assuming someone else is
> updating the state. Wouldn't it make sense to read the childmask at top
> of the function from the child pointer. There is no need to keep around
> in the data pointer, right?
When we are in lvl0, then @child is NULL as the child is a tmigr_cpu and
not a tmigr_group. This is the reason why I decided to store it inside
the tmigr_walk struct.
>> + }
>> +
>> + /*
>> + * The group is active and the event will be ignored - the ignore flag is
>> + * updated without holding the lock. In case the bit is set while
>> + * another CPU already handles remote events, nothing happens, because
>> + * it is clear that the CPU became active just in this moment, or in
>> + * worst case the event is handled remote. Nothing to worry about.
>> + */
>
> The CPU is becoming active, so is the group. The ignore flag for the
> group is updated lock less to reflect this. Another CPU might also
> set it true while becoming active. In worst case the migrator
> observes it too late and expires remotely timer belonging to this
> group. The lock is held while going idle (and setting ignore to
> false) so the state is not lost.
>
This is what I wanted to explain:
/*
* The group is active (again). The group event might be still queued
* into the parent group's timerqueue but can now be handled by the the
* migrator of this group. Therefore the ignore flag for the group event
* is updated to reflect this.
*
* The update of the ignore flag in the active path is done lock
* less. In worst case the migrator of the parent group observes the
* change too late and expires remotely timer belonging to this
* group. The lock is held while updating the ignore flag in idle
* path. So this state change will not be lost.
*/
>> + group->groupevt.ignore = true;
>> +
>> + return walk_done;
>> +}
[...]
>> +
>> +static u64 tmigr_handle_remote_cpu(unsigned int cpu, u64 now,
>> + unsigned long jif)
>> +{
>> + struct timer_events tevt;
>> + struct tmigr_walk data;
>> + struct tmigr_cpu *tmc;
>> + u64 next = KTIME_MAX;
>> +
>> + tmc = per_cpu_ptr(&tmigr_cpu, cpu);
>> +
>> + raw_spin_lock_irq(&tmc->lock);
>> +
>> + /*
>> + * The remote CPU is offline or the CPU event does not has to be handled
>> + * (the CPU is active or there is no longer an event to expire) or
>> + * another CPU handles the CPU timers already or the next event was
>> + * already expired - return!
>> + */
>
> The comment is the english version of the C code below. The *why* is
> usually the thing we care about. This basically sums up to:
>
> If remote CPU is offline then the timer events have been
> migrated away.
> If tmigr_cpu::remote is set then another CPU takes care of this.
> If tmigr_event::ignore is set then the CPU is returning from
> idle and take care of its timers.
> If the next event expires in the future then the event node has
> been updated and there are no timer to expire right now.
I'll change it. Thanks.
>> + if (!tmc->online || tmc->remote || tmc->cpuevt.ignore ||
>> + now < tmc->cpuevt.nextevt.expires) {
>> + raw_spin_unlock_irq(&tmc->lock);
>> + return next;
>
> Looking at the last condition where the timerqueue has been forwarded by
> a jiffy+, shouldn't we return _that_ values next instead of KTIME_MAX?
No. Because the event is already queued into the hierarchy and the
migrator takes care. If hiererachy is completely idle, the CPU which
updated the event takes care. I'll add this to the comment above.
>> + }
>> +
>> + tmc->remote = true;
>> + WRITE_ONCE(tmc->wakeup, KTIME_MAX);
>> +
>> + /* Drop the lock to allow the remote CPU to exit idle */
>> + raw_spin_unlock_irq(&tmc->lock);
>> +
>> + if (cpu != smp_processor_id())
>> + timer_expire_remote(cpu);
>> +
>> + /*
>> + * Lock ordering needs to be preserved - timer_base locks before tmigr
>> + * related locks (see section "Locking rules" in the documentation at
>> + * the top). During fetching the next timer interrupt, also tmc->lock
>> + * needs to be held. Otherwise there is a possible race window against
>> + * the CPU itself when it comes out of idle, updates the first timer in
>> + * the hierarchy and goes back to idle.
>> + *
>> + * timer base locks are dropped as fast as possible: After checking
>> + * whether the remote CPU went offline in the meantime and after
>> + * fetching the next remote timer interrupt. Dropping the locks as fast
>> + * as possible keeps the locking region small and prevents holding
>> + * several (unnecessary) locks during walking the hierarchy for updating
>> + * the timerqueue and group events.
>> + */
>> + local_irq_disable();
>> + timer_lock_remote_bases(cpu);
>> + raw_spin_lock(&tmc->lock);
>> +
>> + /*
>> + * When the CPU went offline in the meantime, no hierarchy walk has to
>> + * be done for updating the queued events, because the walk was
>> + * already done during marking the CPU offline in the hierarchy.
>> + *
>> + * When the CPU is no longer idle, the CPU takes care of the timers and
> If the CPU is no longer idle then it takes care of its timers and
> also of the timers in its hierarchy.
>
>> + * also of the timers in the path to the top.
>> + *
>> + * (See also section "Required event and timerqueue update after
>> + * remote expiry" in the documentation at the top)
>
> Perfect. The reasoning why it is safe to skip walk + expire timers, has
> been explained.
:)
>> + */
>> + if (!tmc->online || !tmc->idle) {
>> + timer_unlock_remote_bases(cpu);
>> + goto unlock;
>> + } else {
>> + /* next event of CPU */
>> + fetch_next_timer_interrupt_remote(jif, now, &tevt, cpu);
>> + }
>> +
>> + timer_unlock_remote_bases(cpu);
>> +
>> + data.evt = &tmc->cpuevt;
>> + data.nextexp = tevt.global;
>> + data.remote = true;
>> +
>> + /*
>> + * The update is done even when there is no 'new' global timer pending
>> + * on the remote CPU (see section "Required event and timerqueue update
>> + * after remote expiry" in the documentation at the top)
>> + */
>> + walk_groups(&tmigr_new_timer_up, &data, tmc);
>> +
>> + next = data.nextexp;
>> +
>> +unlock:
>> + tmc->remote = false;
>> + raw_spin_unlock_irq(&tmc->lock);
>> +
>> + return next;
>> +}
[...]
>> +/**
>> + * tmigr_requires_handle_remote() - Check the need of remote timer handling
>> + *
>> + * Must be called with interrupts disabled.
>> + */
>> +int tmigr_requires_handle_remote(void)
>> +{
>> + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
>> + struct tmigr_remote_data data;
>> + unsigned int ret = 0;
>> + unsigned long jif;
>> +
>> + if (tmigr_is_not_available(tmc))
>> + return ret;
>> +
>> + data.now = get_jiffies_update(&jif);
>> + data.childmask = tmc->childmask;
>> + data.nextexp = KTIME_MAX;
>> + data.tmc_active = !tmc->idle;
>> + data.check = false;
>> +
>> + /*
>> + * When the CPU is active, walking the hierarchy to check whether a
>> + * remote expiry is required.
>
> s/When/If
> s/walking/walk
>
>> + *
>> + * Check is done lockless as interrupts are disabled and @tmc->idle is
>> + * set only by the local CPU.
>> + */
>> + if (!tmc->idle) {
>> + __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
>> +
>> + if (data.nextexp != KTIME_MAX)
>> + ret = 1;
>> +
>> + return ret;
>> + }
>> +
>> + /*
>> + * When the CPU is idle, check whether the recalculation of @tmc->wakeup
>> + * is required. @tmc->wakeup_recalc is set by a remote CPU which is
>> + * about to go offline, was the last active CPU in the whole timer
>> + * migration hierarchy and now delegates handling of the hierarchy to
>> + * this CPU.
>
> I'm failing here…
If the CPU is idle, check whether the recalculation of @tmc->wakeup
is required. @tmc->wakeup_recalc is set, when the last active CPU
went offline. The last active CPU delegated the handling of the timer
migration hierarchy to another (this) CPU by updating this flag and
sending a reschedule.
Better?
>> + * Racy lockless check is valid:
>> + * - @tmc->wakeup_recalc is set by the remote CPU before it issues
>> + * reschedule IPI.
>> + * - As interrupts are disabled here this CPU will either observe
>> + * @tmc->wakeup_recalc set before the reschedule IPI can be handled or
>> + * it will observe it when this function is called again on return
>> + * from handling the reschedule IPI.
>> + */
>> + if (tmc->wakeup_recalc) {
>> + raw_spin_lock(&tmc->lock);
>> +
>> + __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc);
>> +
>> + if (data.nextexp != KTIME_MAX)
>> + ret = 1;
>> +
>> + WRITE_ONCE(tmc->wakeup, data.nextexp);
>> + tmc->wakeup_recalc = false;
>> + raw_spin_unlock(&tmc->lock);
>> +
>> + return ret;
>> + }
>> +
>> + /*
>> + * When the CPU is idle and @tmc->wakeup is reliable, compare it with
>> + * @data.now. On 64 bit it is valid to do this lockless. On 32 bit
>> + * systems, holding the lock is required to get valid data on concurrent
>> + * writers.
>
> The wakeup value is reliable when the CPU is idle?
Yes and when wakeup_recalc is not set.
> See the comments regarding 64bit.
Updated it accordingly.
>
>> + */
>> + if (IS_ENABLED(CONFIG_64BIT)) {
>> + if (data.now >= READ_ONCE(tmc->wakeup))
>> + ret = 1;
>> + } else {
>> + raw_spin_lock(&tmc->lock);
>> + if (data.now >= tmc->wakeup)
>> + ret = 1;
>> + raw_spin_unlock(&tmc->lock);
>> + }
>> +
>> + return ret;
>> +}
>
> …
>
>> +static bool tmigr_inactive_up(struct tmigr_group *group,
>> + struct tmigr_group *child,
>> + void *ptr)
>> +{
>> + union tmigr_state curstate, newstate;
>> + struct tmigr_walk *data = ptr;
>> + bool walk_done;
>> + u8 childmask;
>> +
>> + childmask = data->childmask;
>> + newstate = curstate = data->groupstate;
>> +
>> +retry:
>> + walk_done = true;
>> +
>> + /* Reset active bit when the child is no longer active */
>> + if (!data->childstate.active)
>> + newstate.active &= ~childmask;
>> +
>> + if (newstate.migrator == childmask) {
>> + /*
>> + * Find a new migrator for the group, because the child group is
>> + * idle!
>> + */
>> + if (!data->childstate.active) {
>> + unsigned long new_migr_bit, active = newstate.active;
>> +
>> + new_migr_bit = find_first_bit(&active, BIT_CNT);
>> +
>> + if (new_migr_bit != BIT_CNT) {
>> + newstate.migrator = BIT(new_migr_bit);
>> + } else {
>> + newstate.migrator = TMIGR_NONE;
>> +
>> + /* Changes need to be propagated */
>> + walk_done = false;
>> + }
>> + }
>> + }
>> +
>> + newstate.seq++;
>> +
>> + WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active));
>> +
>> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
>
> This one appears wrong, too. The curstate is not getting updated during
> retry.
See the answer above.
>> + newstate.state = curstate.state;
>> +
>> + /*
>> + * Something changed in the child/parent group in the meantime,
>> + * reread the state of the child and parent; Update of
>> + * data->childstate is required for event handling;
>> + */
>> + if (child)
>> + data->childstate.state = atomic_read(&child->migr_state);
>> +
>> + goto retry;
>> + }
>> +
>> + data->groupstate = newstate;
>> + data->remote = false;
>> +
>> + /* Event Handling */
>> + tmigr_update_events(group, child, data);
>> +
>> + if (group->parent && (walk_done == false)) {
>> + data->childmask = group->childmask;
>> + data->childstate = newstate;
>> + data->groupstate.state = atomic_read(&group->parent->migr_state);
>> + }
>> +
>> + /*
>> + * data->nextexp was set by tmigr_update_events() and contains the
>> + * expiry of the first global event which needs to be handled
>> + */
>> + if (data->nextexp != KTIME_MAX) {
>> + WARN_ON_ONCE(group->parent);
>> + /*
>> + * Top level path - If this CPU is about going offline, wake
>> + * up some random other CPU so it will take over the
>> + * migrator duty and program its timer properly. Ideally
>> + * wake the CPU with the closest expiry time, but that's
>> + * overkill to figure out.
>> + *
>> + * Set wakeup_recalc of remote CPU, to make sure the complete
>> + * idle hierarchy with enqueued timers is reevaluated.
>> + */
>> + if (!(this_cpu_ptr(&tmigr_cpu)->online)) {
>> + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
>> + unsigned int cpu = smp_processor_id();
>> + struct tmigr_cpu *tmc_resched;
>> +
>> + cpu = cpumask_any_but(cpu_online_mask, cpu);
>> + tmc_resched = per_cpu_ptr(&tmigr_cpu, cpu);
>> +
>> + raw_spin_unlock(&tmc->lock);
>> +
>> + raw_spin_lock(&tmc_resched->lock);
>> + tmc_resched->wakeup_recalc = true;
>> + raw_spin_unlock(&tmc_resched->lock);
>> +
>> + raw_spin_lock(&tmc->lock);
>> + smp_send_reschedule(cpu);
>
> This whole thing confuses me.
> If the CPU goes offline, it needs to get removed from the migration
> hierarchy and this is it. Everything else is handled by the migrator. If
> the migrator is going offline then it needs wake a random CPU and make
> sure it takes the migrator role. I am confused by the whole ::wakeup and
> ::wakeup_recalc thingy.
>
wakeup_recalc is required to indicate, that the CPU was chosen as the
new migrator CPU when the last active CPU in timer migration hierarchy
went offline.
When a CPU goes idle and it is the last active CPU in the timer
migration hierarchy, it has to make sure that it wakes up in time to
handle the first event of the hierarchy. On the normal idle path this is
not a problem as the value of the first event of the hierarchy is
returned. But when an IRQ occurs on this idle CPU, the timers are
revisited again. But then it is also required that the first event of
the timer migration hierarchy is still considered, as the CPU cannot
make sure another CPU will handle it. So the value is stored on idle
path to tmc->wakeup. Otherwise every idle CPU has to walk the hierarchy
again after an IRQ to make sure nothing is lost as the CPU doesn't know
what happend in the meantime. I'm aware, that it is possible that
several CPUs have the same wakeup value when there is no new first event
and if the hierarchy is idle and some CPUs become active and go idle
directly again. But if we want to cover this, we need a single point
with the first event and with the last migrator information and I'm
quite sure that it will not scale.
Hopefully this is now clear?
>> + }
>> + }
>> +
>> + return walk_done;
>> +}
>
> Sebastian
Thanks,
Anna-Maria
Anna-Maria Behnsen <[email protected]> writes:
> Sebastian Siewior <[email protected]> writes:
>
>> On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
>>
>> This whole thing confuses me.
>> If the CPU goes offline, it needs to get removed from the migration
>> hierarchy and this is it. Everything else is handled by the migrator. If
>> the migrator is going offline then it needs wake a random CPU and make
>> sure it takes the migrator role. I am confused by the whole ::wakeup and
>> ::wakeup_recalc thingy.
>>
>
> wakeup_recalc is required to indicate, that the CPU was chosen as the
> new migrator CPU when the last active CPU in timer migration hierarchy
> went offline.
>
> When a CPU goes idle and it is the last active CPU in the timer
> migration hierarchy, it has to make sure that it wakes up in time to
> handle the first event of the hierarchy. On the normal idle path this is
> not a problem as the value of the first event of the hierarchy is
> returned. But when an IRQ occurs on this idle CPU, the timers are
> revisited again. But then it is also required that the first event of
> the timer migration hierarchy is still considered, as the CPU cannot
> make sure another CPU will handle it. So the value is stored on idle
> path to tmc->wakeup. Otherwise every idle CPU has to walk the hierarchy
> again after an IRQ to make sure nothing is lost as the CPU doesn't know
> what happend in the meantime. I'm aware, that it is possible that
> several CPUs have the same wakeup value when there is no new first event
> and if the hierarchy is idle and some CPUs become active and go idle
> directly again. But if we want to cover this, we need a single point
> with the first event and with the last migrator information and I'm
> quite sure that it will not scale.
>
I forgot to explain one thing: A CPU does remote expiry and it is
possible that the CPU is idle in timer migration hierarchy. Therefore
the wakeup value is updated at the end of remote handling with the new
first event of the timer migration hierarchy unconditionally.
Thanks,
Anna-Maria
On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@
…
> +static int __init tmigr_init(void)
> +{
> + unsigned int cpulvl, nodelvl, cpus_per_node, i;
> + unsigned int nnodes = num_possible_nodes();
> + unsigned int ncpus = num_possible_cpus();
> + int ret = -ENOMEM;
> +
> + /* Nothing to do if running on UP */
> + if (ncpus == 1)
> + return 0;
> +
> + /*
> + * Calculate the required hierarchy levels. Unfortunately there is no
> + * reliable information available, unless all possible CPUs have been
> + * brought up and all numa nodes are populated.
NUMA
> + *
> + * Estimate the number of levels with the number of possible nodes and
> + * the number of possible CPUs. Assume CPUs are spread evenly across
> + * nodes. We cannot rely on cpumask_of_node() because there only already
> + * online CPUs are considered.
> + */
We cannot rely on cpumask_of_node() because it only works for online
CPUs.
> + cpus_per_node = DIV_ROUND_UP(ncpus, nnodes);
> +
> + /* Calc the hierarchy levels required to hold the CPUs of a node */
> + cpulvl = DIV_ROUND_UP(order_base_2(cpus_per_node),
> + ilog2(TMIGR_CHILDREN_PER_GROUP));
> +
> + /* Calculate the extra levels to connect all nodes */
> + nodelvl = DIV_ROUND_UP(order_base_2(nnodes),
> + ilog2(TMIGR_CHILDREN_PER_GROUP));
> +
> + tmigr_hierarchy_levels = cpulvl + nodelvl;
> +
> + /*
> + * If a numa node spawns more than one CPU level group then the next
NUMA
> + * level(s) of the hierarchy contains groups which handle all CPU groups
> + * of the same numa node. The level above goes across numa nodes. Store
NUMA
> + * this information for the setup code to decide when node matching is
> + * not longer required.
s/not longer/no longer ?
> + */
> + tmigr_crossnode_level = cpulvl;
> +
> + tmigr_level_list = kcalloc(tmigr_hierarchy_levels, sizeof(struct list_head), GFP_KERNEL);
> + if (!tmigr_level_list)
> + goto err;
> +
> + for (i = 0; i < tmigr_hierarchy_levels; i++)
> + INIT_LIST_HEAD(&tmigr_level_list[i]);
> +
> + pr_info("Timer migration: %d hierarchy levels; %d children per group;"
> + " %d crossnode level\n",
> + tmigr_hierarchy_levels, TMIGR_CHILDREN_PER_GROUP,
> + tmigr_crossnode_level);
> +
> + ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online",
> + tmigr_cpu_online, tmigr_cpu_offline);
> + if (ret)
> + goto err;
> +
> + return 0;
> +
> +err:
> + pr_err("Timer migration setup failed\n");
> + return ret;
> +}
> +late_initcall(tmigr_init);
> diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
> new file mode 100644
> index 000000000000..260b87e5708d
> --- /dev/null
> +++ b/kernel/time/timer_migration.h
> @@ -0,0 +1,144 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _KERNEL_TIME_MIGRATION_H
> +#define _KERNEL_TIME_MIGRATION_H
> +
> +/* Per group capacity. Must be a power of 2! */
> +#define TMIGR_CHILDREN_PER_GROUP 8
BUILD_BUG_ON_NOT_POWER_OF_2(TMIGR_CHILDREN_PER_GROUP)
Maybe in the .c file.
> +/**
> + * struct tmigr_event - a timer event associated to a CPU
> + * @nextevt: The node to enqueue an event in the parent group queue
> + * @cpu: The CPU to which this event belongs
> + * @ignore: Hint whether the event could be ignored; it is set when
> + * CPU or group is active;
> + */
> +struct tmigr_event {
> + struct timerqueue_node nextevt;
> + unsigned int cpu;
> + bool ignore;
> +};
> +
> +/**
> + * struct tmigr_group - timer migration hierarchy group
> + * @lock: Lock protecting the event information and group hierarchy
> + * information during setup
> + * @migr_state: State of the group (see union tmigr_state)
So the lock does not protect migr_state? Mind moving it a little down the
road? Not only would it be more obvious what is protected by the lock
but it would also move migr_state in another/ later cache line.
> + * @parent: Pointer to the parent group
> + * @groupevt: Next event of the group which is only used when the
> + * group is !active. The group event is then queued into
> + * the parent timer queue.
> + * Ignore bit of @groupevt is set when the group is active.
> + * @next_expiry: Base monotonic expiry time of the next event of the
> + * group; It is used for the racy lockless check whether a
> + * remote expiry is required; it is always reliable
> + * @events: Timer queue for child events queued in the group
> + * @childmask: childmask of the group in the parent group; is set
> + * during setup and will never change; could be read
_can_ be read lockless.
> + * lockless
> + * @level: Hierarchy level of the group; Required during setup
> + * @list: List head that is added to the per level
> + * tmigr_level_list; is required during setup when a
> + * new group needs to be connected to the existing
> + * hierarchy groups
> + * @numa_node: Is set to numa node when level < tmigr_crossnode_level;
NUMA
(as long as the group level is per NUMA node).
> + * otherwise it is set to NUMA_NO_NODE; Required for
> + * setup only to make sure CPUs and groups are per
> + * numa node as long as level < tmigr_crossnode_level
… to make sure CPU and group information is NUMA local. This is
true until the top most hierarchy level (level <
tmigr_crossnode_level).
> + * @num_children: Counter of group children to make sure the group is only
> + * filled with TMIGR_CHILDREN_PER_GROUP; Required for setup
> + * only
> + */
> +struct tmigr_group {
> + raw_spinlock_t lock;
> + atomic_t migr_state;
> + struct tmigr_group *parent;
> + struct tmigr_event groupevt;
> + u64 next_expiry;
> + struct timerqueue_head events;
> + u8 childmask;
> + unsigned int level;
> + struct list_head list;
> + int numa_node;
> + unsigned int num_children;
> +};
> +
> +/**
> + * struct tmigr_cpu - timer migration per CPU group
> + * @lock: Lock protecting the tmigr_cpu group information
> + * @online: Indicates whether the CPU is online; In deactivate path
> + * it is required to know whether the migrator in the top
> + * level group is on the way to go offline when a timer is
level group, which is to be set offline, while a timer is pending.
> + * pending. Then another online CPU needs to be rescheduled
> + * to make sure the timers are handled properly;
Then another online CPU needs to be notified to take over the
migrator role.
The "rescheduled" part sounds like the current implementation.
> + * Furthermore the information is required in CPU hotplug
> + * path as the CPU is able to go idle before the timer
> + * migration hierarchy hotplug AP is reached. During this
> + * phase, the CPU has to handle the global timers by its
s/by its own/on its own/
> + * own and does not act as a migrator.
s/does not/must not
> + * @idle: Indicates whether the CPU is idle in the timer migration
> + * hierarchy
> + * @remote: Is set when timers of the CPU are expired remote
s/remote/remotely
> + * @wakeup_recalc: Indicates, whether a recalculation of the @wakeup value
> + * is required. It is only used when the CPU is marked idle
> + * in the timer migration hierarchy.
What does `It' refer to? Is it `wakeup_recalc' or `wakeup' ?
> + * @tmgroup: Pointer to the parent group
> + * @childmask: childmask of tmigr_cpu in the parent group
> + * @wakeup: Stores the first timer when the timer migration
> + * hierarchy is completely idle and remote expiry was done;
> + * is returned to timer code in the idle path; it is only
is used in the idle path only (what is the idle path (probably
obvious))
> + * valid, when @wakeup_recalc is not set
> + * @cpuevt: CPU event which could be queued into the parent group
I don't know why but it feels like s/queued/enqueued/g
But it might be a British vs American thing.
Sebastian
Frederic Weisbecker <[email protected]> writes:
> Le Tue, Dec 05, 2023 at 12:53:03PM +0100, Anna-Maria Behnsen a écrit :
>>
>> Frederic, what do you think?
>
> So it looks like is_idle must be fixed.
>
> As for the timer softirq, ->next_expiry is already unreliable because when
> a timer is removed, ->next_expiry is not updated (even though that removed
> timer might have been the earliest). So ->next_expiry can already carry a
> "too early" value. The only constraint is that ->next_expiry can't be later
> than the first timer.
>
> So I'd rather put a comment somewhere about the fact that wrapping is expected
> to behave ok. But it's your call.
Ok. If both solutions are fine, I would like to take the solution with
updating the next_expiry values for empty bases. It will make the
compare of expiry values of global and local timer base easier in one of
the patches later on.
Thanks,
Anna-Maria
On Tue, Dec 12, 2023 at 02:21:25PM +0100, Anna-Maria Behnsen wrote:
> Frederic Weisbecker <[email protected]> writes:
>
> > Le Tue, Dec 05, 2023 at 12:53:03PM +0100, Anna-Maria Behnsen a ?crit :
> >>
> >> Frederic, what do you think?
> >
> > So it looks like is_idle must be fixed.
> >
> > As for the timer softirq, ->next_expiry is already unreliable because when
> > a timer is removed, ->next_expiry is not updated (even though that removed
> > timer might have been the earliest). So ->next_expiry can already carry a
> > "too early" value. The only constraint is that ->next_expiry can't be later
> > than the first timer.
> >
> > So I'd rather put a comment somewhere about the fact that wrapping is expected
> > to behave ok. But it's your call.
>
> Ok. If both solutions are fine, I would like to take the solution with
> updating the next_expiry values for empty bases. It will make the
> compare of expiry values of global and local timer base easier in one of
> the patches later on.
Fine by me at least!
Thanks.
> Thanks,
>
> Anna-Maria
>
Sebastian Siewior <[email protected]> writes:
> On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
[...]
>> diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
>> new file mode 100644
>> index 000000000000..260b87e5708d
>> --- /dev/null
>> +++ b/kernel/time/timer_migration.h
>> @@ -0,0 +1,144 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +#ifndef _KERNEL_TIME_MIGRATION_H
>> +#define _KERNEL_TIME_MIGRATION_H
>> +
>> +/* Per group capacity. Must be a power of 2! */
>> +#define TMIGR_CHILDREN_PER_GROUP 8
>
> BUILD_BUG_ON_NOT_POWER_OF_2(TMIGR_CHILDREN_PER_GROUP)
>
> Maybe in the .c file.
>
in tmigr_init() ?
>> +/**
>> + * struct tmigr_event - a timer event associated to a CPU
>> + * @nextevt: The node to enqueue an event in the parent group queue
>> + * @cpu: The CPU to which this event belongs
>> + * @ignore: Hint whether the event could be ignored; it is set when
>> + * CPU or group is active;
>> + */
>> +struct tmigr_event {
>> + struct timerqueue_node nextevt;
>> + unsigned int cpu;
>> + bool ignore;
>> +};
>> +
>> +/**
>> + * struct tmigr_group - timer migration hierarchy group
>> + * @lock: Lock protecting the event information and group hierarchy
>> + * information during setup
>> + * @migr_state: State of the group (see union tmigr_state)
>
> So the lock does not protect migr_state?
Right - this is not required due to the atomic cmpxchg and seqence
counter.
> Mind moving it a little down the road? Not only would it be more
> obvious what is protected by the lock but it would also move
> migr_state in another/ later cache line.
>
Where do you want me to move it? Switch places of lock and migr_state?
When I put it to another place, I would generate holes. A general
question: Is it required to have a look at the struct with pahole also
with LOCKDEP enabled? If yes, lock should stay at the first position.
[...]
>> + * @cpuevt: CPU event which could be queued into the parent group
>
> I don't know why but it feels like s/queued/enqueued/g
> But it might be a British vs American thing.
I think this queued/enqueued is not used consistent all over the place
(in my patchset). But I'm also not a native speaker and not sure which
is the proper one :). Nevertheless, I'll change it.
Thanks,
Anna-Maria
On 2023-12-12 12:31:19 [+0100], Anna-Maria Behnsen wrote:
> Sebastian Siewior <[email protected]> writes:
>
> > On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
> >> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> >> new file mode 100644
> >> index 000000000000..05cd8f1bc45d
> >> --- /dev/null
> >> +++ b/kernel/time/timer_migration.c
> >> @@ -0,0 +1,1636 @@
> > …
> >> +static bool tmigr_active_up(struct tmigr_group *group,
> >> + struct tmigr_group *child,
> >> + void *ptr)
> >> +{
> >> + union tmigr_state curstate, newstate;
> >> + struct tmigr_walk *data = ptr;
> >> + bool walk_done;
> >> + u8 childmask;
> >> +
> >> + childmask = data->childmask;
> >> + newstate = curstate = data->groupstate;
> >> +
> >> +retry:
> >> + walk_done = true;
> >> +
> >> + if (newstate.migrator == TMIGR_NONE) {
> >> + newstate.migrator = childmask;
> >> +
> >> + /* Changes need to be propagated */
> >> + walk_done = false;
> >> + }
> >> +
> >> + newstate.active |= childmask;
> >> +
> >> + newstate.seq++;
> >> +
> >> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
> >> + newstate.state = curstate.state;
> >
> > This does not look right. If
> > group->migr_state != curstate.state
> > then
> > curstate.state = newstate.state
> >
> > does not make a difference since curstate is on stack. So this should
> > become
> >
> > | if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
> > | curstate.state = newstate.state = atomic_read(&group->parent->migr_state);
> >
> > and now I question the existence of tmigr_walk::groupstate. It does not
> > match the comment for the struct saying it will be re-read if the
> > cmpxchg() fails because it does not happen (at least here). Also why do
> > we have it? Is it avoid atomic_read() and have it "cached"?
>
> atomic_try_cmpxchg() updates curstate.state with the actual
> group->migr_state when those values do not match. So it is reread by
> atomic_try_cmpxchg() and still matches the description. (This at least
> tells the function description of atomic_try_cmpxchg()).
Ach. Indeed. That part slipped my mind. Could still replace it with:
newstate = curstate
to match the assignment at the top of the function? Or do something
like:
| childmask = data->childmask;
| curstate = data->groupstate;
| retry:
| newstate = curstate;
|
| walk_done = true;
…
| if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state))
| goto retry;
So gcc can save a branch and recycle the upper the cooking the code.
gcc-13 does not recognise this, clang-16 does.
> But beside of this, why do I want to update curstate.state with
> group->parent->migr_state when cmpxchg of this group wasn't successful
> yet or was it a copy paste error?
It was an error.
> >> + data->groupstate.state = atomic_read(&group->parent->migr_state);
> >> + data->childmask = group->childmask;
> >
> > We don't re-read in case the cmpxchg failed assuming someone else is
> > updating the state. Wouldn't it make sense to read the childmask at top
> > of the function from the child pointer. There is no need to keep around
> > in the data pointer, right?
>
> When we are in lvl0, then @child is NULL as the child is a tmigr_cpu and
> not a tmigr_group. This is the reason why I decided to store it inside
> the tmigr_walk struct.
But it supposed to be group->migr_state for the cmpxchg. So considering
the previous bit, why not:
| childmask = data->childmask;
| curstate = atomic_read(&group->migr_state);
|
| do {
| newstate = curstate;
| walk_done = true;
|
| if (newstate.migrator == TMIGR_NONE) {
| newstate.migrator = childmask;
|
| /* Changes need to be propagated */
| walk_done = false;
| }
|
| newstate.active |= childmask;
|
| newstate.seq++;
|
| } while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state));
this seems nice.
> >> + }
> >> +
> >> + /*
> >> + * The group is active and the event will be ignored - the ignore flag is
> >> + * updated without holding the lock. In case the bit is set while
> >> + * another CPU already handles remote events, nothing happens, because
> >> + * it is clear that the CPU became active just in this moment, or in
> >> + * worst case the event is handled remote. Nothing to worry about.
> >> + */
> >
> > The CPU is becoming active, so is the group. The ignore flag for the
> > group is updated lock less to reflect this. Another CPU might also
> > set it true while becoming active. In worst case the migrator
> > observes it too late and expires remotely timer belonging to this
> > group. The lock is held while going idle (and setting ignore to
> > false) so the state is not lost.
> >
>
> This is what I wanted to explain:
>
> /*
> * The group is active (again). The group event might be still queued
> * into the parent group's timerqueue but can now be handled by the the
s/the$@@
> * migrator of this group. Therefore the ignore flag for the group event
> * is updated to reflect this.
> *
> * The update of the ignore flag in the active path is done lock
lockless
> * less. In worst case the migrator of the parent group observes the
> * change too late and expires remotely timer belonging to this
a timer?
> * group. The lock is held while updating the ignore flag in idle
> * path. So this state change will not be lost.
> */
>
> >> + group->groupevt.ignore = true;
> >> +
> >> + return walk_done;
> >> +}
…
> >> + if (!tmc->online || tmc->remote || tmc->cpuevt.ignore ||
> >> + now < tmc->cpuevt.nextevt.expires) {
> >> + raw_spin_unlock_irq(&tmc->lock);
> >> + return next;
> >
> > Looking at the last condition where the timerqueue has been forwarded by
> > a jiffy+, shouldn't we return _that_ values next instead of KTIME_MAX?
>
> No. Because the event is already queued into the hierarchy and the
> migrator takes care. If hiererachy is completely idle, the CPU which
> updated the event takes care. I'll add this to the comment above.
So another CPU took care of it and we set tmc->wakeup to KTIME_MAX…
One confusing part is that this return value (if not aborted early but
after completing this function) is used to set tmc->wakeup based on the
next pending timer for the CPU that was expired remotely.
We could have expired four CPUs and the next timer of the last CPU may
not be the earliest timer for the fist CPU in the group.
And this fine because it can be stale and only valid if the CPU goes
idle?
> >> + /*
> >> + * When the CPU is idle, check whether the recalculation of @tmc->wakeup
> >> + * is required. @tmc->wakeup_recalc is set by a remote CPU which is
> >> + * about to go offline, was the last active CPU in the whole timer
> >> + * migration hierarchy and now delegates handling of the hierarchy to
> >> + * this CPU.
> >
> > I'm failing here…
>
> If the CPU is idle, check whether the recalculation of @tmc->wakeup
> is required. @tmc->wakeup_recalc is set, when the last active CPU
> went offline. The last active CPU delegated the handling of the timer
> migration hierarchy to another (this) CPU by updating this flag and
> sending a reschedule.
>
> Better?
So the last CPU going offline had to be the migrator because otherwise
it wouldn't matter?
…
> >> +
> >> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) {
> >
> > This one appears wrong, too. The curstate is not getting updated during
> > retry.
>
> See the answer above.
Yes, and I think the do { } while should work here, too.
> >> + * data->nextexp was set by tmigr_update_events() and contains the
> >> + * expiry of the first global event which needs to be handled
> >> + */
> >> + if (data->nextexp != KTIME_MAX) {
> >> + WARN_ON_ONCE(group->parent);
> >> + /*
> >> + * Top level path - If this CPU is about going offline, wake
> >> + * up some random other CPU so it will take over the
> >> + * migrator duty and program its timer properly. Ideally
> >> + * wake the CPU with the closest expiry time, but that's
> >> + * overkill to figure out.
> >> + *
> >> + * Set wakeup_recalc of remote CPU, to make sure the complete
> >> + * idle hierarchy with enqueued timers is reevaluated.
> >> + */
> >> + if (!(this_cpu_ptr(&tmigr_cpu)->online)) {
> >> + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu);
> >> + unsigned int cpu = smp_processor_id();
> >> + struct tmigr_cpu *tmc_resched;
> >> +
> >> + cpu = cpumask_any_but(cpu_online_mask, cpu);
> >> + tmc_resched = per_cpu_ptr(&tmigr_cpu, cpu);
> >> +
> >> + raw_spin_unlock(&tmc->lock);
> >> +
> >> + raw_spin_lock(&tmc_resched->lock);
> >> + tmc_resched->wakeup_recalc = true;
> >> + raw_spin_unlock(&tmc_resched->lock);
> >> +
> >> + raw_spin_lock(&tmc->lock);
> >> + smp_send_reschedule(cpu);
> >
> > This whole thing confuses me.
> > If the CPU goes offline, it needs to get removed from the migration
> > hierarchy and this is it. Everything else is handled by the migrator. If
> > the migrator is going offline then it needs wake a random CPU and make
> > sure it takes the migrator role. I am confused by the whole ::wakeup and
> > ::wakeup_recalc thingy.
> >
>
> wakeup_recalc is required to indicate, that the CPU was chosen as the
> new migrator CPU when the last active CPU in timer migration hierarchy
> went offline.
Aha! I suspected this. So this is more like need_new_migrator.
…
> Hopefully this is now clear?
yes.
> Thanks,
>
> Anna-Maria
Sebastian
On 2023-12-12 15:52:19 [+0100], Anna-Maria Behnsen wrote:
> Sebastian Siewior <[email protected]> writes:
>
> >> +/* Per group capacity. Must be a power of 2! */
> >> +#define TMIGR_CHILDREN_PER_GROUP 8
> >
> > BUILD_BUG_ON_NOT_POWER_OF_2(TMIGR_CHILDREN_PER_GROUP)
> >
> > Maybe in the .c file.
> >
>
> in tmigr_init() ?
Yeah why not. It is used there for the init of the structs.
> >> +/**
> >> + * struct tmigr_group - timer migration hierarchy group
> >> + * @lock: Lock protecting the event information and group hierarchy
> >> + * information during setup
> >> + * @migr_state: State of the group (see union tmigr_state)
> >
> > So the lock does not protect migr_state?
>
> Right - this is not required due to the atomic cmpxchg and seqence
> counter.
>
> > Mind moving it a little down the road? Not only would it be more
> > obvious what is protected by the lock but it would also move
> > migr_state in another/ later cache line.
> >
>
> Where do you want me to move it? Switch places of lock and migr_state?
> When I put it to another place, I would generate holes. A general
> question: Is it required to have a look at the struct with pahole also
> with LOCKDEP enabled? If yes, lock should stay at the first position.
Maybe something like:
| struct tmigr_group {
| raw_spinlock_t lock; /* 0 4 */
|
| /* XXX 4 bytes hole, try to pack */
|
| struct tmigr_group * parent; /* 8 8 */
| struct tmigr_event groupevt __attribute__((__aligned__(8))); /* 16 40 */
|
| /* XXX last struct has 3 bytes of padding */
|
| u64 next_expiry; /* 56 8 */
| /* --- cacheline 1 boundary (64 bytes) --- */
| struct timerqueue_head events; /* 64 16 */
| atomic_t migr_state; /* 80 4 */
| unsigned int level; /* 84 4 */
| int numa_node; /* 88 4 */
| unsigned int num_children; /* 92 4 */
| u8 childmask; /* 96 1 */
|
| /* XXX 7 bytes hole, try to pack */
|
| struct list_head list; /* 104 16 */
Starting with lock isn't bad as you see everything from here is
protected by lock. If it makes sense you could start with list so that
the container_of() becomes a NOP.
I wouldn't make lockdep a thing and assume it is off. Also, I would
assume the architecture is 64bit.
However with lockdep enabled it becomes:
| struct tmigr_group {
| raw_spinlock_t lock; /* 0 64 */
| /* --- cacheline 1 boundary (64 bytes) --- */
| struct tmigr_group * parent; /* 64 8 */
| struct tmigr_event groupevt __attribute__((__aligned__(8))); /* 72 40 */
|
| /* XXX last struct has 3 bytes of padding */
|
| u64 next_expiry; /* 112 8 */
| struct timerqueue_head events; /* 120 16 */
| /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
| atomic_t migr_state; /* 136 4 */
| unsigned int level; /* 140 4 */
| int numa_node; /* 144 4 */
| unsigned int num_children; /* 148 4 */
| u8 childmask; /* 152 1 */
|
| /* XXX 7 bytes hole, try to pack */
|
| struct list_head list; /* 160 16 */
| } __attribute__((__aligned__(8)));
so it didn't change much.
I shuffled it a bit and everything after migr_state is read only.
I don't think looking at pahole is required but in your case it makes
sense to put the locked section into a separate cache line vs migr_state
variable. It doesn't cost much.
You can decide if it is worth to move childmask after the lock so you so
you avoid the 7 byte hole at the end. I wouldn't do it to satisfy pahole
here. If it makes sense, doesn't hurt/ confuse why not.
You would consider the pahole output more on a structure like dentry
which is used a _lot_. So saving 4 bytes would mean save a megabyte or
ten in the end.
> Thanks,
>
> Anna-Maria
>
Sebastian
Le Fri, Dec 01, 2023 at 10:26:27AM +0100, Anna-Maria Behnsen a ?crit :
> When debugging timer code the timer tracepoints are very important. There
> is no tracepoint when the is_idle flag of the timer base changes. Instead
> of always adding manually trace_printk(), add tracepoints which can be
> easily enabled whenever required.
>
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Just a detail below, again this can be posted as a delta patch or
edited before applying:
> ---
> v9: New in v9
> ---
> include/trace/events/timer.h | 20 ++++++++++++++++++++
> kernel/time/timer.c | 2 ++
> 2 files changed, 22 insertions(+)
>
> diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
> index 99ada928d445..1ef58a04fc57 100644
> --- a/include/trace/events/timer.h
> +++ b/include/trace/events/timer.h
> @@ -142,6 +142,26 @@ DEFINE_EVENT(timer_class, timer_cancel,
> TP_ARGS(timer)
> );
>
> +TRACE_EVENT(timer_base_idle,
> +
> + TP_PROTO(bool is_idle, unsigned int cpu),
> +
> + TP_ARGS(is_idle, cpu),
> +
> + TP_STRUCT__entry(
> + __field( bool, is_idle )
> + __field( unsigned int, cpu )
> + ),
> +
> + TP_fast_assign(
> + __entry->is_idle = is_idle;
> + __entry->cpu = cpu;
> + ),
> +
> + TP_printk("is_idle=%d cpu=%d",
> + __entry->is_idle, __entry->cpu)
> +);
> +
> #define decode_clockid(type) \
> __print_symbolic(type, \
> { CLOCK_REALTIME, "CLOCK_REALTIME" }, \
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index a81d793a43d0..46a9b96a3976 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1964,6 +1964,7 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
> if ((expires - basem) > TICK_NSEC)
> base->is_idle = true;
> }
> + trace_timer_base_idle(base->is_idle, base->cpu);
This will trigger a trace everytime we loop into idle or remotely
compute the next timer. Can we move that to when base->is_idle is set
from false to true only?
> raw_spin_unlock(&base->lock);
>
> return cmp_next_hrtimer_event(basem, expires);
> @@ -1985,6 +1986,7 @@ void timer_clear_idle(void)
> * the lock in the exit from idle path.
> */
> base->is_idle = false;
> + trace_timer_base_idle(0, smp_processor_id());
Same here. If base->is_idle was already false, you could spare a noisy
trace.
Thanks.
> }
> #endif
>
> --
> 2.39.2
>
Le Fri, Dec 01, 2023 at 10:26:33AM +0100, Anna-Maria Behnsen a ?crit :
> From: Thomas Gleixner <[email protected]>
>
> To improve readability of the code, split base->idle calculation and
> expires calculation into separate parts. While at it, update the comment
> about timer base idle marking.
>
> Thereby the following subtle change happens if the next event is just one
> jiffy ahead and the tick was already stopped: Originally base->is_idle
> remains true in this situation. Now base->is_idle turns to false. This may
> spare an IPI if a timer is enqueued remotely to an idle CPU that is going
> to tick on the next jiffy.
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Signed-off-by: Anna-Maria Behnsen <[email protected]>
> Reviewed-by: Frederic Weisbecker <[email protected]>
> ---
> v9: Re-ordering to not hurt the eyes and update comment
> v4: Change condition to force 0 delta and update commit message (Frederic)
> ---
> kernel/time/timer.c | 31 ++++++++++++++++---------------
> 1 file changed, 16 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index fee42dda8237..0826018d9873 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1943,22 +1943,23 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
> */
> __forward_timer_base(base, basej);
>
> - if (time_before_eq(nextevt, basej)) {
> - expires = basem;
> - base->is_idle = false;
> - } else {
> - if (base->timers_pending)
> - expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
> - /*
> - * If we expect to sleep more than a tick, mark the base idle.
> - * Also the tick is stopped so any added timer must forward
> - * the base clk itself to keep granularity small. This idle
> - * logic is only maintained for the BASE_STD base, deferrable
> - * timers may still see large granularity skew (by design).
> - */
> - if ((expires - basem) > TICK_NSEC)
> - base->is_idle = true;
> + if (base->timers_pending) {
> + /* If we missed a tick already, force 0 delta */
> + if (time_before(nextevt, basej))
> + nextevt = basej;
> + expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
> }
> +
> + /*
> + * Base is idle if the next event is more than a tick away.
> + *
> + * If the base is marked idle then any timer add operation must forward
> + * the base clk itself to keep granularity small. This idle logic is
> + * only maintained for the BASE_STD base, deferrable timers may still
> + * see large granularity skew (by design).
> + */
> + base->is_idle = time_after(nextevt, basej + 1);
> +
Much better, thanks! :-)
Le Fri, Dec 01, 2023 at 10:26:34AM +0100, Anna-Maria Behnsen a ?crit :
> When no timer is queued into an empty timer base, the next_expiry will not
> be updated. It was originally calculated as
>
> base->clk + NEXT_TIMER_MAX_DELTA
>
> When the timer base stays empty long enough (> NEXT_TIMER_MAX_DELTA), the
> next_expiry value of the empty base suggests that there is a timer pending
> soon. This might be more a kind of a theoretical problem, but the fix
> doesn't hurt.
This solves a real issue. I suggest removing the last sentence and add instead:
If the CPU sleeps in idle for a bit more than NEXT_TIMER_MAX_DELTA
(~12 days in HZ=1000) and then an interrupt fires, upon going back to idle
get_next_timer_interrupt() will still return KTIME_MAX but incorrectly set
is_idle to false. Therefore the CPU will keep the tick stopped and go back to
sleep though further remote enqueue of timers to this CPU will fail to send an IPI.
As a result the timer will remain ignored.
Reviewed-by: Frederic Weisbecker <[email protected]>
The following commit has been merged into the timers/core branch of tip:
Commit-ID: 7a39a5080ef0e3cf233d92165f6a778f08a08244
Gitweb: https://git.kernel.org/tip/7a39a5080ef0e3cf233d92165f6a778f08a08244
Author: Anna-Maria Behnsen <[email protected]>
AuthorDate: Fri, 01 Dec 2023 10:26:32 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 20 Dec 2023 16:49:38 +01:00
timers: Use already existing function for forwarding timer base
There is an already existing function for forwarding the timer
base. Forwarding the timer base is implemented directly in
get_next_timer_interrupt() as well.
Remove the code duplication and invoke __forward_timer_base() instead.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/time/timer.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 5b02e16..1a73d39 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1939,15 +1939,9 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
/*
* We have a fresh next event. Check whether we can forward the
- * base. We can only do that when @basej is past base->clk
- * otherwise we might rewind base->clk.
+ * base.
*/
- if (time_after(basej, base->clk)) {
- if (time_after(nextevt, basej))
- base->clk = basej;
- else if (time_after(nextevt, base->clk))
- base->clk = nextevt;
- }
+ __forward_timer_base(base, basej);
if (time_before_eq(nextevt, basej)) {
expires = basem;
The following commit has been merged into the timers/core branch of tip:
Commit-ID: 8a2c9c7e7848d7f63d38b698209148b5bb4ba7f3
Gitweb: https://git.kernel.org/tip/8a2c9c7e7848d7f63d38b698209148b5bb4ba7f3
Author: Anna-Maria Behnsen <[email protected]>
AuthorDate: Fri, 01 Dec 2023 10:26:30 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 20 Dec 2023 16:49:38 +01:00
timers: Clarify check in forward_timer_base()
The current check whether a forward of the timer base is required can be
simplified by using an already existing comparison function which is easier
to read. The related comment is outdated and was not updated when the check
changed in commit 36cd28a4cdd0 ("timers: Lower base clock forwarding
threshold").
Use time_before_eq() for the check and replace the comment by copying the
comment from the same check inside get_next_timer_interrupt(). Move the
precious information of the outdated comment to the proper place in
__run_timers().
No functional change.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/time/timer.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 490ff8e..f75f932 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -944,11 +944,10 @@ static inline void forward_timer_base(struct timer_base *base)
unsigned long jnow = READ_ONCE(jiffies);
/*
- * No need to forward if we are close enough below jiffies.
- * Also while executing timers, base->clk is 1 offset ahead
- * of jiffies to avoid endless requeuing to current jiffies.
+ * Check whether we can forward the base. We can only do that when
+ * @basej is past base->clk otherwise we might rewind base->clk.
*/
- if ((long)(jnow - base->clk) < 1)
+ if (time_before_eq(jnow, base->clk))
return;
/*
@@ -2021,6 +2020,10 @@ static inline void __run_timers(struct timer_base *base)
*/
WARN_ON_ONCE(!levels && !base->next_expiry_recalc
&& base->timers_pending);
+ /*
+ * While executing timers, base->clk is set 1 offset ahead of
+ * jiffies to avoid endless requeuing to current jiffies.
+ */
base->clk++;
next_expiry_recalc(base);
The following commit has been merged into the timers/core branch of tip:
Commit-ID: b573c73101d8786446535b2ab28cbc8907bda9a9
Gitweb: https://git.kernel.org/tip/b573c73101d8786446535b2ab28cbc8907bda9a9
Author: Anna-Maria Behnsen <[email protected]>
AuthorDate: Fri, 01 Dec 2023 10:26:27 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 20 Dec 2023 16:49:38 +01:00
tracing/timers: Add tracepoint for tracking timer base is_idle flag
When debugging timer code the timer tracepoints are very important. There
is no tracepoint when the is_idle flag of the timer base changes. Instead
of always adding manually trace_printk(), add tracepoints which can be
easily enabled whenever required.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/trace/events/timer.h | 20 ++++++++++++++++++++
kernel/time/timer.c | 14 +++++++++++---
2 files changed, 31 insertions(+), 3 deletions(-)
diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index 99ada92..1ef58a0 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -142,6 +142,26 @@ DEFINE_EVENT(timer_class, timer_cancel,
TP_ARGS(timer)
);
+TRACE_EVENT(timer_base_idle,
+
+ TP_PROTO(bool is_idle, unsigned int cpu),
+
+ TP_ARGS(is_idle, cpu),
+
+ TP_STRUCT__entry(
+ __field( bool, is_idle )
+ __field( unsigned int, cpu )
+ ),
+
+ TP_fast_assign(
+ __entry->is_idle = is_idle;
+ __entry->cpu = cpu;
+ ),
+
+ TP_printk("is_idle=%d cpu=%d",
+ __entry->is_idle, __entry->cpu)
+);
+
#define decode_clockid(type) \
__print_symbolic(type, \
{ CLOCK_REALTIME, "CLOCK_REALTIME" }, \
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index a81d793..ed8d606 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1950,7 +1950,10 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
if (time_before_eq(nextevt, basej)) {
expires = basem;
- base->is_idle = false;
+ if (base->is_idle) {
+ base->is_idle = false;
+ trace_timer_base_idle(false, base->cpu);
+ }
} else {
if (base->timers_pending)
expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
@@ -1961,8 +1964,10 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
* logic is only maintained for the BASE_STD base, deferrable
* timers may still see large granularity skew (by design).
*/
- if ((expires - basem) > TICK_NSEC)
+ if ((expires - basem) > TICK_NSEC && !base->is_idle) {
base->is_idle = true;
+ trace_timer_base_idle(true, base->cpu);
+ }
}
raw_spin_unlock(&base->lock);
@@ -1984,7 +1989,10 @@ void timer_clear_idle(void)
* sending the IPI a few instructions smaller for the cost of taking
* the lock in the exit from idle path.
*/
- base->is_idle = false;
+ if (base->is_idle) {
+ base->is_idle = false;
+ trace_timer_base_idle(false, smp_processor_id());
+ }
}
#endif
The following commit has been merged into the timers/core branch of tip:
Commit-ID: da65f29dada7f7cbbf0d6375b88a0316f5f7d6f5
Gitweb: https://git.kernel.org/tip/da65f29dada7f7cbbf0d6375b88a0316f5f7d6f5
Author: Anna-Maria Behnsen <[email protected]>
AuthorDate: Fri, 01 Dec 2023 10:26:34 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 20 Dec 2023 16:49:39 +01:00
timers: Fix nextevt calculation when no timers are pending
When no timer is queued into an empty timer base, the next_expiry will not
be updated. It was originally calculated as
base->clk + NEXT_TIMER_MAX_DELTA
When the timer base stays empty long enough (> NEXT_TIMER_MAX_DELTA), the
next_expiry value of the empty base suggests that there is a timer pending
soon. This might be more a kind of a theoretical problem, but the fix
doesn't hurt.
Use only base->next_expiry value as nextevt when timers are
pending. Otherwise nextevt will be jiffies + NEXT_TIMER_MAX_DELTA. As all
information is in place, update base->next_expiry value of the empty timer
base as well.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/time/timer.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index cf51655..352b161 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1922,8 +1922,8 @@ static u64 cmp_next_hrtimer_event(u64 basem, u64 expires)
u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
+ unsigned long nextevt = basej + NEXT_TIMER_MAX_DELTA;
u64 expires = KTIME_MAX;
- unsigned long nextevt;
bool was_idle;
/*
@@ -1936,7 +1936,6 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
raw_spin_lock(&base->lock);
if (base->next_expiry_recalc)
next_expiry_recalc(base);
- nextevt = base->next_expiry;
/*
* We have a fresh next event. Check whether we can forward the
@@ -1945,10 +1944,20 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
__forward_timer_base(base, basej);
if (base->timers_pending) {
+ nextevt = base->next_expiry;
+
/* If we missed a tick already, force 0 delta */
if (time_before(nextevt, basej))
nextevt = basej;
expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
+ } else {
+ /*
+ * Move next_expiry for the empty base into the future to
+ * prevent a unnecessary raise of the timer softirq when the
+ * next_expiry value will be reached even if there is no timer
+ * pending.
+ */
+ base->next_expiry = nextevt;
}
/*
The following commit has been merged into the timers/core branch of tip:
Commit-ID: bb8caad5083f8fbba70faf41f1d3bab7cf09da6d
Gitweb: https://git.kernel.org/tip/bb8caad5083f8fbba70faf41f1d3bab7cf09da6d
Author: Thomas Gleixner <[email protected]>
AuthorDate: Fri, 01 Dec 2023 10:26:33 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 20 Dec 2023 16:49:39 +01:00
timers: Rework idle logic
To improve readability of the code, split base->idle calculation and
expires calculation into separate parts. While at it, update the comment
about timer base idle marking.
Thereby the following subtle change happens if the next event is just one
jiffy ahead and the tick was already stopped: Originally base->is_idle
remains true in this situation. Now base->is_idle turns to false. This may
spare an IPI if a timer is enqueued remotely to an idle CPU that is going
to tick on the next jiffy.
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/time/timer.c | 40 ++++++++++++++++++++--------------------
1 file changed, 20 insertions(+), 20 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 1a73d39..cf51655 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1924,6 +1924,7 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
u64 expires = KTIME_MAX;
unsigned long nextevt;
+ bool was_idle;
/*
* Pretend that there is no timer pending if the cpu is offline.
@@ -1943,27 +1944,26 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
*/
__forward_timer_base(base, basej);
- if (time_before_eq(nextevt, basej)) {
- expires = basem;
- if (base->is_idle) {
- base->is_idle = false;
- trace_timer_base_idle(false, base->cpu);
- }
- } else {
- if (base->timers_pending)
- expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
- /*
- * If we expect to sleep more than a tick, mark the base idle.
- * Also the tick is stopped so any added timer must forward
- * the base clk itself to keep granularity small. This idle
- * logic is only maintained for the BASE_STD base, deferrable
- * timers may still see large granularity skew (by design).
- */
- if ((expires - basem) > TICK_NSEC && !base->is_idle) {
- base->is_idle = true;
- trace_timer_base_idle(true, base->cpu);
- }
+ if (base->timers_pending) {
+ /* If we missed a tick already, force 0 delta */
+ if (time_before(nextevt, basej))
+ nextevt = basej;
+ expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
}
+
+ /*
+ * Base is idle if the next event is more than a tick away.
+ *
+ * If the base is marked idle then any timer add operation must forward
+ * the base clk itself to keep granularity small. This idle logic is
+ * only maintained for the BASE_STD base, deferrable timers may still
+ * see large granularity skew (by design).
+ */
+ was_idle = base->is_idle;
+ base->is_idle = time_after(nextevt, basej + 1);
+ if (was_idle != base->is_idle)
+ trace_timer_base_idle(base->is_idle, base->cpu);
+
raw_spin_unlock(&base->lock);
return cmp_next_hrtimer_event(basem, expires);
The following commit has been merged into the timers/core branch of tip:
Commit-ID: b5e6f59888c7bde3c05f61b3ce06b78a86713fc0
Gitweb: https://git.kernel.org/tip/b5e6f59888c7bde3c05f61b3ce06b78a86713fc0
Author: Anna-Maria Behnsen <[email protected]>
AuthorDate: Fri, 01 Dec 2023 10:26:29 +01:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Wed, 20 Dec 2023 16:49:38 +01:00
timers: Move store of next event into __next_timer_interrupt()
Both call sites of __next_timer_interrupt() store the return value directly
in base->next_expiry. Move the store into __next_timer_interrupt() and to
make its purpose more clear, rename the function to next_expiry_recalc().
No functional change.
Signed-off-by: Anna-Maria Behnsen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Thomas Gleixner <[email protected]>
Reviewed-by: Frederic Weisbecker <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
kernel/time/timer.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 9188205..490ff8e 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1800,8 +1800,10 @@ static int next_pending_bucket(struct timer_base *base, unsigned offset,
/*
* Search the first expiring timer in the various clock levels. Caller must
* hold base->lock.
+ *
+ * Store next expiry time in base->next_expiry.
*/
-static unsigned long __next_timer_interrupt(struct timer_base *base)
+static void next_expiry_recalc(struct timer_base *base)
{
unsigned long clk, next, adj;
unsigned lvl, offset = 0;
@@ -1867,10 +1869,9 @@ static unsigned long __next_timer_interrupt(struct timer_base *base)
clk += adj;
}
+ base->next_expiry = next;
base->next_expiry_recalc = false;
base->timers_pending = !(next == base->clk + NEXT_TIMER_MAX_DELTA);
-
- return next;
}
#ifdef CONFIG_NO_HZ_COMMON
@@ -1930,7 +1931,7 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
raw_spin_lock(&base->lock);
if (base->next_expiry_recalc)
- base->next_expiry = __next_timer_interrupt(base);
+ next_expiry_recalc(base);
nextevt = base->next_expiry;
/*
@@ -2021,7 +2022,7 @@ static inline void __run_timers(struct timer_base *base)
WARN_ON_ONCE(!levels && !base->next_expiry_recalc
&& base->timers_pending);
base->clk++;
- base->next_expiry = __next_timer_interrupt(base);
+ next_expiry_recalc(base);
while (levels--)
expire_timers(base, heads + levels);