LinuxLists.cc - [PATCH] tick/nohz: Reduce the critical region for jiffies

2020-11-11 09:15:00

Subject: [PATCH] tick/nohz: Reduce the critical region for jiffies_seq

When nohz or nohz_full is configured, the concurrency calls of
tick_do_update_jiffies64 increases, and the conflict between
jiffies_lock and jiffies_seq increases, especially in multi-core
scenarios.

However, it is unnecessary to update the jiffies_seq lock multiple
times in a tick period, so the critical region of the jiffies_seq
can be reduced to reduce latency overheads. By the way,
last_jiffies_update is protected by jiffies_lock, so reducing the
jiffies_seq critical area is safe.

Signed-off-by: Yunfeng Ye <[email protected]>
---
kernel/time/tick-sched.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f0199a4ba1ad..41fb1400439b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -66,11 +66,11 @@ static void tick_do_update_jiffies64(ktime_t now)

/* Reevaluate with jiffies_lock held */
raw_spin_lock(&jiffies_lock);
- write_seqcount_begin(&jiffies_seq);

delta = ktime_sub(now, last_jiffies_update);
if (delta >= tick_period) {

+ write_seqcount_begin(&jiffies_seq);
delta = ktime_sub(delta, tick_period);
/* Pairs with the lockless read in this function. */
WRITE_ONCE(last_jiffies_update,
@@ -91,12 +91,11 @@ static void tick_do_update_jiffies64(ktime_t now)

/* Keep the tick_next_period variable up to date */
tick_next_period = ktime_add(last_jiffies_update, tick_period);
- } else {
write_seqcount_end(&jiffies_seq);
+ } else {
raw_spin_unlock(&jiffies_lock);
return;
}
- write_seqcount_end(&jiffies_seq);
raw_spin_unlock(&jiffies_lock);
update_wall_time();
}
--
2.18.4

2020-11-15 21:21:12

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] tick/nohz: Reduce the critical region for jiffies_seq

On Wed, Nov 11 2020 at 17:11, Yunfeng Ye wrote:
> When nohz or nohz_full is configured, the concurrency calls of
> tick_do_update_jiffies64 increases,

Why?

> and the conflict between jiffies_lock and jiffies_seq increases,
> especially in multi-core scenarios.

This does not make sense. The sequence counter is updated when holding
the lock, so there is no conflict between the lock and the sequence
count.

> However, it is unnecessary to update the jiffies_seq lock multiple
> times in a tick period, so the critical region of the jiffies_seq
> can be reduced to reduce latency overheads.

This does not make sense either. Before taking the lock we have:

delta = ktime_sub(now, READ_ONCE(last_jiffies_update));
if (delta < tick_period)
return;

as a lockless quick check.

We also have mechanisms to avoid that a gazillion of CPUs call this. Why
are they not working or are some of the callsites missing them?

I'm not against reducing the seqcount write scope per se, but it needs a
proper and correct explanation.

> By the way, last_jiffies_update is protected by jiffies_lock, so
> reducing the jiffies_seq critical area is safe.

This is misleading. The write to last_jiffies_update is serialized by
the jiffies lock, but the write has also to be inside the sequence write
held section because tick_nohz_next_event() does:

/* Read jiffies and the time when jiffies were updated last */
do {
seq = read_seqcount_begin(&jiffies_seq);
basemono = last_jiffies_update;
basejiff = jiffies;
} while (read_seqcount_retry(&jiffies_seq, seq));

So there is no 'By the way'.

Thanks,

tglx

2020-11-16 11:35:06

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] tick/nohz: Reduce the critical region for jiffies_seq

On Mon, Nov 16 2020 at 14:07, Yunfeng Ye wrote:
> On 2020/11/16 3:43, Thomas Gleixner wrote:
>>> and the conflict between jiffies_lock and jiffies_seq increases,
>>> especially in multi-core scenarios.
>>
>> This does not make sense. The sequence counter is updated when holding
>> the lock, so there is no conflict between the lock and the sequence
>> count.
>>
> Yes, there is no conflict between the lock and the sequence count, but
> when tick_do_update_jiffies64() is called one by one, the sequence count
> will be updated, it will affect the latency of tick_nohz_next_event(),
> because the priority of read seqcount is less than writer.

It's clear to me because I know how that code works, but for someone who
is not familiar with it your description of the problem is confusing at
best.

> During a tick period,

'During a tick period' is misleading. The tick period is the reciprocal
value ot the tick frequency.

What you want to explain is that if jiffies require an update, i.e.

now > last_update + period

then multiple CPUs might contend on it until last_update is forwarded
and the quick check is preventing it again.

> the tick_do_update_jiffies64() is called concurrency, and the
> time is up to 30+us. so the lockless quick check in tick_do_update_jiffies64()
> cannot intercept all concurrency.

It cannot catch all of it, right.

There are quite some other inefficiencies in that code and the seqcount
held time can be reduced way further. Let me stare at it.

Thanks,

tglx

2020-11-16 13:29:12

by Yunfeng Ye

[permalink] [raw]

Subject: Re: [PATCH] tick/nohz: Reduce the critical region for jiffies_seq

On 2020/11/16 19:29, Thomas Gleixner wrote:
> On Mon, Nov 16 2020 at 14:07, Yunfeng Ye wrote:
>> On 2020/11/16 3:43, Thomas Gleixner wrote:
>>>> and the conflict between jiffies_lock and jiffies_seq increases,
>>>> especially in multi-core scenarios.
>>>
>>> This does not make sense. The sequence counter is updated when holding
>>> the lock, so there is no conflict between the lock and the sequence
>>> count.
>>>
>> Yes, there is no conflict between the lock and the sequence count, but
>> when tick_do_update_jiffies64() is called one by one, the sequence count
>> will be updated, it will affect the latency of tick_nohz_next_event(),
>> because the priority of read seqcount is less than writer.
>
> It's clear to me because I know how that code works, but for someone who
> is not familiar with it your description of the problem is confusing at
> best.
>
>> During a tick period,
>
> 'During a tick period' is misleading. The tick period is the reciprocal
> value ot the tick frequency.
>
> What you want to explain is that if jiffies require an update, i.e.
>
> now > last_update + period
>
> then multiple CPUs might contend on it until last_update is forwarded
> and the quick check is preventing it again.
>
Yes, your are right, thanks.

>> the tick_do_update_jiffies64() is called concurrency, and the
>> time is up to 30+us. so the lockless quick check in tick_do_update_jiffies64()
>> cannot intercept all concurrency.
>
> It cannot catch all of it, right.
>
> There are quite some other inefficiencies in that code and the seqcount
> held time can be reduced way further. Let me stare at it.
>
I think the write seqcount only protecting the last_jiffies_update/jiffies_64/
tick_next_period is enough. The modification which has not been tested,
look like this:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f0199a4ba1ad..d5f9930e6bc7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -66,15 +66,13 @@ static void tick_do_update_jiffies64(ktime_t now)

/* Reevaluate with jiffies_lock held */
raw_spin_lock(&jiffies_lock);
- write_seqcount_begin(&jiffies_seq);

delta = ktime_sub(now, last_jiffies_update);
if (delta >= tick_period) {
+ ktime_t tmp_jiffies_update =
+ ktime_add(last_jiffies_update, tick_period);

delta = ktime_sub(delta, tick_period);
- /* Pairs with the lockless read in this function. */
- WRITE_ONCE(last_jiffies_update,
- ktime_add(last_jiffies_update, tick_period));

/* Slow path for long timeouts */
if (unlikely(delta >= tick_period)) {
@@ -82,21 +80,25 @@ static void tick_do_update_jiffies64(ktime_t now)

ticks = ktime_divns(delta, incr);

- /* Pairs with the lockless read in this function. */
- WRITE_ONCE(last_jiffies_update,
- ktime_add_ns(last_jiffies_update,
- incr * ticks));
+ tmp_jiffies_update =
+ ktime_add_ns(tmp_jiffies_update,
+ incr * ticks);
}
- do_timer(++ticks);
+ ticks++;
+
+ write_seqcount_begin(&jiffies_seq);
+ /* Pairs with the lockless read in this function. */
+ WRITE_ONCE(last_jiffies_update, tmp_jiffies_update);
+ jiffies_64 += ticks;

/* Keep the tick_next_period variable up to date */
tick_next_period = ktime_add(last_jiffies_update, tick_period);
- } else {
write_seqcount_end(&jiffies_seq);
+ calc_global_load();
+ } else {
raw_spin_unlock(&jiffies_lock);
return;
}
- write_seqcount_end(&jiffies_seq);
raw_spin_unlock(&jiffies_lock);
update_wall_time();
}

> Thanks,
>
> tglx
> .
>

2020-11-16 14:17:11

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] tick/nohz: Reduce the critical region for jiffies_seq

On Mon, Nov 16 2020 at 21:24, Yunfeng Ye wrote:
> On 2020/11/16 19:29, Thomas Gleixner wrote:
>> There are quite some other inefficiencies in that code and the seqcount
>> held time can be reduced way further. Let me stare at it.
>>
> I think the write seqcount only protecting the last_jiffies_update/jiffies_64/
> tick_next_period is enough. The modification which has not been tested,
> look like this:

Yes, but it can be further simplified. I was doing that before I got
distracted. Uncompiled and untested.

A split up patch series is here: https://tglx.de/~tglx/patches-jiffies.tar

Thanks,

tglx
---
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -193,7 +193,6 @@ extern int try_to_del_timer_sync(struct
#define del_singleshot_timer_sync(t) del_timer_sync(t)

extern void init_timers(void);
-extern void run_local_timers(void);
struct hrtimer;
extern enum hrtimer_restart it_real_fn(struct hrtimer *);

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1693,29 +1693,6 @@ void timer_clear_idle(void)
}
#endif

-/*
- * Called from the timer interrupt handler to charge one tick to the current
- * process. user_tick is 1 if the tick is user time, 0 for system.
- */
-void update_process_times(int user_tick)
-{
- struct task_struct *p = current;
-
- PRANDOM_ADD_NOISE(jiffies, user_tick, p, 0);
-
- /* Note: this timer irq context must be accounted for as well. */
- account_process_tick(p, user_tick);
- run_local_timers();
- rcu_sched_clock_irq(user_tick);
-#ifdef CONFIG_IRQ_WORK
- if (in_irq())
- irq_work_tick();
-#endif
- scheduler_tick();
- if (IS_ENABLED(CONFIG_POSIX_TIMERS))
- run_posix_cpu_timers();
-}
-
/**
* __run_timers - run all expired timers (if any) on this CPU.
* @base: the timer vector to be processed.
@@ -1765,7 +1742,7 @@ static __latent_entropy void run_timer_s
/*
* Called by the local, per-CPU timer interrupt on SMP.
*/
-void run_local_timers(void)
+static void run_local_timers(void)
{
struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);

@@ -1783,6 +1760,29 @@ void run_local_timers(void)
}

/*
+ * Called from the timer interrupt handler to charge one tick to the current
+ * process. user_tick is 1 if the tick is user time, 0 for system.
+ */
+void update_process_times(int user_tick)
+{
+ struct task_struct *p = current;
+
+ PRANDOM_ADD_NOISE(jiffies, user_tick, p, 0);
+
+ /* Note: this timer irq context must be accounted for as well. */
+ account_process_tick(p, user_tick);
+ run_local_timers();
+ rcu_sched_clock_irq(user_tick);
+#ifdef CONFIG_IRQ_WORK
+ if (in_irq())
+ irq_work_tick();
+#endif
+ scheduler_tick();
+ if (IS_ENABLED(CONFIG_POSIX_TIMERS))
+ run_posix_cpu_timers();
+}
+
+/*
* Since schedule_timeout()'s timer is defined on the stack, it must store
* the target task on the stack as well.
*/
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -877,6 +877,21 @@ static void tick_broadcast_init_next_eve
}
}

+static inline ktime_t tick_get_next_period(void)
+{
+ ktime_t next;
+
+ /*
+ * Protect against concurrent updates (store /load tearing on
+ * 32bit). It does not matter if the time is already in the
+ * past. The broadcast device which is about to be programmed will
+ * fire in any case.
+ */
+ raw_spin_lock(&jiffies_lock);
+ next = tick_next_period;
+ raw_spin_unlock(&jiffies_lock);
+}
+
/**
* tick_broadcast_setup_oneshot - setup the broadcast device
*/
@@ -905,10 +920,11 @@ static void tick_broadcast_setup_oneshot
tick_broadcast_oneshot_mask, tmpmask);

if (was_periodic && !cpumask_empty(tmpmask)) {
+ u64 ktime_t = tick_get_next_period();
+
clockevents_switch_state(bc, CLOCK_EVT_STATE_ONESHOT);
- tick_broadcast_init_next_event(tmpmask,
- tick_next_period);
- tick_broadcast_set_event(bc, cpu, tick_next_period);
+ tick_broadcast_init_next_event(tmpmask, nextevt);
+ tick_broadcast_set_event(bc, cpu, nextevt);
} else
bc->next_event = KTIME_MAX;
} else {
--- a/kernel/time/tick-common.c
+++ b/kernel/time/tick-common.c
@@ -27,7 +27,9 @@
*/
DEFINE_PER_CPU(struct tick_device, tick_cpu_device);
/*
- * Tick next event: keeps track of the tick time
+ * Tick next event: keeps track of the tick time. It's updated by the
+ * CPU which handles the tick and protected by jiffies_lock. There is
+ * no requirement to write hold the jiffies seqcount for it.
*/
ktime_t tick_next_period;
ktime_t tick_period;
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -44,7 +44,9 @@ struct tick_sched *tick_get_tick_sched(i

#if defined(CONFIG_NO_HZ_COMMON) || defined(CONFIG_HIGH_RES_TIMERS)
/*
- * The time, when the last jiffy update happened. Protected by jiffies_lock.
+ * The time, when the last jiffy update happened. Write access must hold
+ * jiffies_lock and jiffies_seq. tick_nohz_next_event() needs to get a
+ * consistent view of jiffies and last_jiffies_update.
*/
static ktime_t last_jiffies_update;

@@ -53,50 +55,59 @@ static ktime_t last_jiffies_update;
*/
static void tick_do_update_jiffies64(ktime_t now)
{
- unsigned long ticks = 0;
+ unsigned long ticks = 1;
ktime_t delta;

/*
- * Do a quick check without holding jiffies_lock:
- * The READ_ONCE() pairs with two updates done later in this function.
+ * Do a quick check without holding jiffies_lock. The READ_ONCE()
+ * pairs with the update done later in this function.
*/
- delta = ktime_sub(now, READ_ONCE(last_jiffies_update));
- if (delta < tick_period)
+ if (ktime_before(now, READ_ONCE(tick_next_period)))
return;

/* Reevaluate with jiffies_lock held */
raw_spin_lock(&jiffies_lock);
+ if (ktime_before(now, tick_next_period)) {
+ raw_spin_unlock(&jiffies_lock);
+ return;
+ }
+
write_seqcount_begin(&jiffies_seq);

- delta = ktime_sub(now, last_jiffies_update);
- if (delta >= tick_period) {
+ delta = ktime_sub(now, tick_next_period);
+ if (unlikely(delta >= tick_period)) {
+ /* Slow path for long idle sleep times */
+ s64 incr = ktime_to_ns(tick_period);

- delta = ktime_sub(delta, tick_period);
- /* Pairs with the lockless read in this function. */
- WRITE_ONCE(last_jiffies_update,
- ktime_add(last_jiffies_update, tick_period));
-
- /* Slow path for long timeouts */
- if (unlikely(delta >= tick_period)) {
- s64 incr = ktime_to_ns(tick_period);
-
- ticks = ktime_divns(delta, incr);
-
- /* Pairs with the lockless read in this function. */
- WRITE_ONCE(last_jiffies_update,
- ktime_add_ns(last_jiffies_update,
- incr * ticks));
- }
- do_timer(++ticks);
+ ticks += ktime_divns(delta, incr);

- /* Keep the tick_next_period variable up to date */
- tick_next_period = ktime_add(last_jiffies_update, tick_period);
+ last_jiffies_update = ktime_add_ns(last_jiffies_update,
+ incr * ticks);
} else {
- write_seqcount_end(&jiffies_seq);
- raw_spin_unlock(&jiffies_lock);
- return;
+ last_jiffies_update = ktime_add(last_jiffies_update,
+ tick_period);
}
+
+ /*
+ * Keep the tick_next_period variable up to date. WRITE_ONCE()
+ * pairs with the READ_ONCE() in the lockless quick check above. Do
+ * this here to make the lockless quick check above more efficient.
+ */
+ WRITE_ONCE(tick_next_period,
+ ktime_add(last_jiffies_update, tick_period));
+
+ /* Advance jiffies to complete the jiffies_seq protected job */
+ jiffies_64 += ticks;
+
+ /*
+ * Release the sequence count. calc_global_load() below is not
+ * protected by it, but jiffies_lock needs to be held to prevent
+ * concurrent invocations.
+ */
write_seqcount_end(&jiffies_seq);
+
+ calc_global_load();
+
raw_spin_unlock(&jiffies_lock);
update_wall_time();
}

2020-11-16 18:53:08

by Yunfeng Ye

[permalink] [raw]

Subject: Re: [PATCH] tick/nohz: Reduce the critical region for jiffies_seq

On 2020/11/16 3:43, Thomas Gleixner wrote:
> On Wed, Nov 11 2020 at 17:11, Yunfeng Ye wrote:
>> When nohz or nohz_full is configured, the concurrency calls of
>> tick_do_update_jiffies64 increases,
>
> Why?
>
When nohz=off, tick_do_update_jiffies64() is called by tick_sched_do_timer()
on the tick_do_timer_cpu only. But when nohz and nohz_full is on, the
concurrency calls of tick_do_update_jiffies64() increases, and it may be
called on every cpu cores, for example:

1)
irq_enter
tick_irq_enter
tick_nohz_irq_enter
tick_nohz_update_jiffies
2)
irq_exit
irq_exit_rcu
tick_irq_exit
tick_nohz_irq_exit
tick_nohz_full_update_tick
tick_nohz_restart_sched_tick
tick_do_update_jiffies64
3)
tick_nohz_idle_exit
__tick_nohz_idle_restart_tick
tick_nohz_restart_sched_tick
tick_do_update_jiffies64

>> and the conflict between jiffies_lock and jiffies_seq increases,
>> especially in multi-core scenarios.
>
> This does not make sense. The sequence counter is updated when holding
> the lock, so there is no conflict between the lock and the sequence
> count.
>
Yes, there is no conflict between the lock and the sequence count, but
when tick_do_update_jiffies64() is called one by one, the sequence count
will be updated, it will affect the latency of tick_nohz_next_event(),
because the priority of read seqcount is less than writer.

We meet a problem, the latency between irq_handler_exit and schedule cost
up to 9us, or more, we want to schedule quickly. below is the trace:

=>262651: <...>-87332 [013] dnh. 3773.487455: irq_handler_exit: irq=4 ret=handled
=>262666: <...>-87332 [013] dn.. 3773.487464: rcu_utilization: Start context switch
262667: <...>-87332 [013] dn.. 3773.487464: rcu_utilization: End context switch

We use function_graph ftrace to find which function cost so much, and
find that it is tick_nohz_irq_exit():

80519.988765 | 31) | irq_exit() {
80519.988766 | 31) | tick_nohz_irq_exit() {
=>80519.988766 | 31) | tick_nohz_next_event() {
=>80519.988774 | 31) 0.570 us | get_next_timer_interrupt();
80519.988775 | 31) 0.390 us | timekeeping_max_deferment();
80519.988775 | 31) 9.200 us | }
80519.988776 | 31) 0.390 us | tick_nohz_stop_tick();
80519.988776 | 31) + 10.700 us | }
80519.988777 | 31) + 11.630 us | }
80519.988777 | 31) | /* rcu_utilization: Start context switch */

The time between timestamp 80519.988766 and 80519.988774 is most, in function
tick_nohz_next_event(), there are the codes before calling
get_next_timer_interrupt():

static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
{
u64 basemono, next_tick, next_tmr, next_rcu, delta, expires;
unsigned long basejiff;
unsigned int seq;

/* Read jiffies and the time when jiffies were updated last */
do {
seq = read_seqcount_begin(&jiffies_seq);
basemono = last_jiffies_update;
basejiff = jiffies;
} while (read_seqcount_retry(&jiffies_seq, seq));
ts->last_jiffies = basejiff;
ts->timer_expires_base = basemono;

So the reason is that the read_seqcount leading to the latency problem. we
want to reduce the critical region of the jiffies_seq.

We still to trace tick_do_update_jiffies64 function:

167044.988746 | 5) + 34.720 us | } /* tick_do_update_jiffies64.part.2 */
167044.988747 | 90) + 24.920 us | } /* tick_nohz_next_event */
167044.988747 | 2) + 18.990 us | } /* tick_nohz_next_event */
167044.988747 | 84) + 17.430 us | } /* irq_exit */
167044.988748 | 92) + 34.150 us | } /* irq_exit */
167044.988749 | 63) 7.150 us | } /* generic_handle_irq */
167044.988750 | 5) 3.120 us | } /* generic_handle_irq */
167044.988750 | 14) + 38.940 us | } /* tick_do_update_jiffies64.part.2 */
167044.988751 | 61) 5.080 us | } /* tick_nohz_next_event */
167044.988751 | 23) + 35.010 us | } /* tick_nohz_next_event */
167044.988751 | 22) + 24.830 us | } /* tick_nohz_next_event */
167044.988751 | 28) + 40.140 us | } /* tick_do_update_jiffies64.part.2 */

During a tick period, the tick_do_update_jiffies64() is called concurrency, and the
time is up to 30+us. so the lockless quick check in tick_do_update_jiffies64()
cannot intercept all concurrency.

Currently we use the cmdline parameter "skew_tick=1" can reduce the the latency mostly,
because the conflict is mainly caused by tick timer. But we still want to reduce the
critical region of the jiffies_seq to reduce some latency, maybe many other
interrupt or timer happens at the same will still trigger the conflict.

>> However, it is unnecessary to update the jiffies_seq lock multiple
>> times in a tick period, so the critical region of the jiffies_seq
>> can be reduced to reduce latency overheads.
>
> This does not make sense either. Before taking the lock we have:
>
> delta = ktime_sub(now, READ_ONCE(last_jiffies_update));
> if (delta < tick_period)
> return;
>
> as a lockless quick check.
>
> We also have mechanisms to avoid that a gazillion of CPUs call this. Why
> are they not working or are some of the callsites missing them?
>
> I'm not against reducing the seqcount write scope per se, but it needs a
> proper and correct explanation.
>
Yes, there is a lockless quick check, but we have hundreds of cpu cores,
this lockless detection cannot intercept all concurrency.

>> By the way, last_jiffies_update is protected by jiffies_lock, so
>> reducing the jiffies_seq critical area is safe.
>
> This is misleading. The write to last_jiffies_update is serialized by
> the jiffies lock, but the write has also to be inside the sequence write
> held section because tick_nohz_next_event() does:
>
> /* Read jiffies and the time when jiffies were updated last */
> do {
> seq = read_seqcount_begin(&jiffies_seq);
> basemono = last_jiffies_update;
> basejiff = jiffies;
> } while (read_seqcount_retry(&jiffies_seq, seq));
>
> So there is no 'By the way'.
>
It is misleading indeed, I means when reducing the critical region of the jiffies_seq,
the read of last_jiffies_update is still under the protected by jiffies_lock.

Thanks.

> Thanks,
>
> tglx
> .
>