2022-04-05 03:39:58

by Frederic Weisbecker

[permalink] [raw]
Subject: [RT][PATCH 1/2] rcutorture: Also force sched priority to timersd on boosting test

ksoftirqd is statically boosted to the priority level right above the
one of rcu_torture_boost() so that timers, which torture readers rely on,
get a chance to run while rcu_torture_boost() is polling.

However timers processing got split from ksoftirqd into their own kthread
(timersd) that isn't boosted. It has the same SCHED_FIFO low prio as
rcu_torture_boost() and therefore timers can't preempt it and may
starve.

The issue can be triggered in practice on v5.17.1-rt17 using:

./kvm.sh --allcpus --configs TREE04 --duration 10m --kconfig "CONFIG_EXPERT=y CONFIG_PREEMPT_RT=y"

triggering the following:

rcu-torture: rcu_torture_boost boosting failed
Boost inversion thread ->rt_priority 1 gp_state 12908 jiffies 3501
rcu: rcu_preempt: wait state: RCU_GP_WAIT_FQS(5) ->state: 0x402 ->rt_priority 2 delta ->gp_start 3502 ->gp_activity 3502 ->gp_req_activity 3502 ->gp_wake_time 3502 ->gp_wake_seq 12897 ->gp_seq 12901 ->gp_seq_needed 12904 ->gp_max 4057 ->gp_flags 0x0
rcu: rcu_node 0:7 ->gp_seq 12901 ->gp_seq_needed 12904 ->qsmask 0x2 .... ->n_boosts 0
rcu: rcu_node 0:3 ->gp_seq 12901 ->gp_seq_needed 12908 ->qsmask 0x0 b... ->n_boosts 22
rcu: cpu 0 ->gp_seq_needed 12904
rcu: rcu_node 4:7 ->gp_seq 12901 ->gp_seq_needed 12908 ->qsmask 0x0 b..G ->n_boosts 6
rcu: cpu 5 ->gp_seq_needed 12904
rcu: CB 1^0->2 KbclSW F5089 L5089 C1 ..... q0 S CPU 0
rcu: CB 2^0->3 KbclSW F16859 L16859 C1 ..... q0 S CPU 0
rcu: CB 3^0->-1 KbclSW F3502 L3502 C2 .W12908.N. q4 S CPU 0
rcu: nocb GP 4 KldtS .[W.] .G:12904 rnp 4:7 1459 S CPU 0
rcu: CB 4^4->5 KbclSW F9918 L10534 C4 ..... q0 S CPU 0
rcu: CB 5^4->6 KbclSW F2550 L2550 C2 .W12904.N. q8 S CPU 0
rcu: CB 6^4->7 KbclSW F6639 L7253 C0 ..... q0 S CPU 0
rcu: CB 7^4->-1 KbclSW F30685 L30685 C1 ..... q0 S CPU 0
rcu: RCU callbacks invoked since boot: 236075
rcu_tasks: RTGS_WAIT_CBS(11) since 409853 g:0 i:0/0 k.
rcu_tasks_rude: RTGS_WAIT_CBS(11) since 409851 g:0 i:0/0 k.
rcu_tasks_trace: RTGS_WAIT_CBS(11) since 409848 g:0 i:0/0 k. N0 h:0/0/0
Boost inversion: GP 12908 still pending.
rcu-torture: rtc: 0000000093327eb2 ver: 1504 tfle: 0 rta: 1505 rtaf: 0 rtf: 1493 rtmbe: 0 rtmbkf: 0/782 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 1 rtb: 7 nt: 66220 onoff: 6/6:11/12 40,763:29,5050 1537:11766 (HZ=1000) barrier: 239/240:0 read-exits: 67 nocb-toggles: 0:0
rcu-torture: !!!
------------[ cut here ]------------
WARNING: CPU: 5 PID: 121 at kernel/rcu/rcutorture.c:1870 rcu_torture_stats_print+0x5ad/0x610
Modules linked in:
CPU: 5 PID: 121 Comm: rcu_torture_sta Not tainted 5.17.1-rt17 #63
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)
RIP: 0010:rcu_torture_stats_print+0x5ad/0x610
Code: c7 01 02 00 74 02 0f 0b 48 83 3d 35 c7 01 02 00 74 02 0f 0b 48 83 3d 21 c7 01 02 00 74 02 0f 0b 48 83 3d 0d c7 01 02 00 74 02 <0f> 0b 83 eb 01 0f 8e ba fc ff ff 0f 0b e9 b3 fc ff ff 8b 05 8f c6
RSP: 0018:ffffa7560041fdf8 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 40000000ffffe735 RDI: 00000000ffffffff
RBP: 00000000000005d5 R08: ffffffffaf65f6e0 R09: ffffffffafee3a7c
R10: ffffa75600427d58 R11: 0000000020212121 R12: ffffa7560041fe20
R13: ffffa7560041fe78 R14: 0000000000000000 R15: 000000000001bd20
FS: 0000000000000000(0000) GS:ffff9fe7df540000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000014216000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
? lock_timer_base+0x62/0x80
? rcu_torture_stats_print+0x610/0x610
rcu_torture_stats+0x29/0x70
kthread+0x118/0x130
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>

Fix this with statically boosting timersd just like is done with ksoftirqd.

Suggested-by: Mel Gorman <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
---
include/linux/interrupt.h | 1 +
kernel/rcu/rcutorture.c | 6 ++++++
kernel/softirq.c | 2 +-
3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index db4a23d07a57..e4b8a04e67ce 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -606,6 +606,7 @@ extern void raise_softirq_irqoff(unsigned int nr);
extern void raise_softirq(unsigned int nr);

#ifdef CONFIG_PREEMPT_RT
+DECLARE_PER_CPU(struct task_struct *, timersd);
extern void raise_timer_softirq(void);
extern void raise_hrtimer_softirq(void);

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 422f7e4cc08d..ac9715ed58ba 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -3283,6 +3283,12 @@ rcu_torture_init(void)
WARN_ON_ONCE(!t);
sp.sched_priority = 2;
sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+#ifdef CONFIG_PREEMPT_RT
+ t = per_cpu(timersd, cpu);
+ WARN_ON_ONCE(!t);
+ sp.sched_priority = 2;
+ sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+#endif
}
}
}
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 22948c2109f5..89eb45614af6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -624,7 +624,7 @@ static inline void tick_irq_exit(void)
#endif
}

-static DEFINE_PER_CPU(struct task_struct *, timersd);
+DEFINE_PER_CPU(struct task_struct *, timersd);
static DEFINE_PER_CPU(unsigned long, pending_timer_softirq);

static unsigned int local_pending_timers(void)
--
2.25.1


2022-04-05 03:42:02

by Frederic Weisbecker

[permalink] [raw]
Subject: [RT][PATCH 2/2] tick: Fix timer storm since introduction of timersd

If timers are pending while the tick is reprogrammed on nohz_mode, the
next expiry is not armed to fire now, it is delayed one jiffy forward
instead so as not to raise an inextinguishable timer storm with such
scenario:

1) IRQ triggers and queue a timer
2) ksoftirqd() is woken up
3) IRQ tail: timer is reprogrammed to fire now
4) IRQ exit
5) TIMER interrupt
6) goto 3)

...all that until we finally reach ksoftirqd.

Unfortunately we are checking the wrong softirq vector bitmask since
timersd kthread has split from ksoftirqd. Timers now have their own
vector state field that must be checked separately. As a result, the
old timer storm is back. This shows up early on boot with extremely long
initcalls:

[ 333.004807] initcall dquot_init+0x0/0x111 returned 0 after 323822879 usecs

and the cause is uncovered with the right trace events showing just
10 microseconds between ticks (~100 000 Hz):

swapper/-1 1dn.h111 60818582us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415486608
swapper/-1 1dn.h111 60818592us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415496082
swapper/-1 1dn.h111 60818601us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415505550
swapper/-1 1dn.h111 60818611us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415515013
swapper/-1 1dn.h111 60818620us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415524483
swapper/-1 1dn.h111 60818630us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415533949
swapper/-1 1dn.h111 60818639us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415543426
swapper/-1 1dn.h111 60818649us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415553061
swapper/-1 1dn.h111 60818658us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415562511

Fix this with checking the right timer vector state from the nohz code.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Sebastian Andrzej Siewior <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
include/linux/interrupt.h | 12 ++++++++++++
kernel/softirq.c | 7 +------
kernel/time/tick-sched.c | 2 +-
3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index e4b8a04e67ce..da248458f4d9 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -607,9 +607,16 @@ extern void raise_softirq(unsigned int nr);

#ifdef CONFIG_PREEMPT_RT
DECLARE_PER_CPU(struct task_struct *, timersd);
+DECLARE_PER_CPU(unsigned long, pending_timer_softirq);
+
extern void raise_timer_softirq(void);
extern void raise_hrtimer_softirq(void);

+static inline unsigned int local_pending_timers(void)
+{
+ return __this_cpu_read(pending_timer_softirq);
+}
+
#else
static inline void raise_timer_softirq(void)
{
@@ -620,6 +627,11 @@ static inline void raise_hrtimer_softirq(void)
{
raise_softirq_irqoff(HRTIMER_SOFTIRQ);
}
+
+static inline unsigned int local_pending_timers(void)
+{
+ return local_softirq_pending();
+}
#endif

DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 89eb45614af6..c0aef5f760e5 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -625,12 +625,7 @@ static inline void tick_irq_exit(void)
}

DEFINE_PER_CPU(struct task_struct *, timersd);
-static DEFINE_PER_CPU(unsigned long, pending_timer_softirq);
-
-static unsigned int local_pending_timers(void)
-{
- return __this_cpu_read(pending_timer_softirq);
-}
+DEFINE_PER_CPU(unsigned long, pending_timer_softirq);

static void wake_timersd(void)
{
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 17a283ce2b20..7c359f029b97 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -763,7 +763,7 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)

static inline bool local_timer_softirq_pending(void)
{
- return local_softirq_pending() & BIT(TIMER_SOFTIRQ);
+ return local_pending_timers() & BIT(TIMER_SOFTIRQ);
}

static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
--
2.25.1

2022-04-06 05:35:32

by Alison Chaiken

[permalink] [raw]
Subject: Re: [RT][PATCH 2/2] tick: Fix timer storm since introduction of timersd

On Mon, Apr 4, 2022 at 9:33 PM Frederic Weisbecker <[email protected]> wrote:
>
> If timers are pending while the tick is reprogrammed on nohz_mode, the
> next expiry is not armed to fire now, it is delayed one jiffy forward
> instead so as not to raise an inextinguishable timer storm with such
> scenario:
>
> 1) IRQ triggers and queue a timer
> 2) ksoftirqd() is woken up
> 3) IRQ tail: timer is reprogrammed to fire now
> 4) IRQ exit
> 5) TIMER interrupt
> 6) goto 3)
>
> ...all that until we finally reach ksoftirqd.
>
> Unfortunately we are checking the wrong softirq vector bitmask since
> timersd kthread has split from ksoftirqd. Timers now have their own
> vector state field that must be checked separately.

With kernel 5.15 and the timersd patch applied, we've observed that
x86_64 cores tend to enter deeper C-states even when there are pending
hrtimers. Presumably failure to check the right bits could also
explain that observation and, accordingly, the patch might fix it?

> As a result, the
> old timer storm is back. This shows up early on boot with extremely long
> initcalls:
>
> [ 333.004807] initcall dquot_init+0x0/0x111 returned 0 after 323822879 usecs
>
> and the cause is uncovered with the right trace events showing just
> 10 microseconds between ticks (~100 000 Hz):
>
> swapper/-1 1dn.h111 60818582us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415486608
> swapper/-1 1dn.h111 60818592us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415496082
> swapper/-1 1dn.h111 60818601us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415505550
> swapper/-1 1dn.h111 60818611us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415515013
> swapper/-1 1dn.h111 60818620us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415524483
> swapper/-1 1dn.h111 60818630us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415533949
> swapper/-1 1dn.h111 60818639us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415543426
> swapper/-1 1dn.h111 60818649us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415553061
> swapper/-1 1dn.h111 60818658us : hrtimer_expire_entry: hrtimer=00000000e0ef0f6b function=tick_sched_timer now=60415562511
>
> Fix this with checking the right timer vector state from the nohz code.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Sebastian Andrzej Siewior <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> ---
> include/linux/interrupt.h | 12 ++++++++++++
> kernel/softirq.c | 7 +------
> kernel/time/tick-sched.c | 2 +-
> 3 files changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index e4b8a04e67ce..da248458f4d9 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -607,9 +607,16 @@ extern void raise_softirq(unsigned int nr);
>
> #ifdef CONFIG_PREEMPT_RT
> DECLARE_PER_CPU(struct task_struct *, timersd);
> +DECLARE_PER_CPU(unsigned long, pending_timer_softirq);
> +
> extern void raise_timer_softirq(void);
> extern void raise_hrtimer_softirq(void);
>
> +static inline unsigned int local_pending_timers(void)
> +{
> + return __this_cpu_read(pending_timer_softirq);
> +}
> +
> #else
> static inline void raise_timer_softirq(void)
> {
> @@ -620,6 +627,11 @@ static inline void raise_hrtimer_softirq(void)
> {
> raise_softirq_irqoff(HRTIMER_SOFTIRQ);
> }
> +
> +static inline unsigned int local_pending_timers(void)
> +{
> + return local_softirq_pending();
> +}
> #endif
>
> DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 89eb45614af6..c0aef5f760e5 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -625,12 +625,7 @@ static inline void tick_irq_exit(void)
> }
>
> DEFINE_PER_CPU(struct task_struct *, timersd);
> -static DEFINE_PER_CPU(unsigned long, pending_timer_softirq);
> -
> -static unsigned int local_pending_timers(void)
> -{
> - return __this_cpu_read(pending_timer_softirq);
> -}
> +DEFINE_PER_CPU(unsigned long, pending_timer_softirq);
>
> static void wake_timersd(void)
> {
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 17a283ce2b20..7c359f029b97 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -763,7 +763,7 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
>
> static inline bool local_timer_softirq_pending(void)
> {
> - return local_softirq_pending() & BIT(TIMER_SOFTIRQ);
> + return local_pending_timers() & BIT(TIMER_SOFTIRQ);
> }
>
> static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
> --
> 2.25.1
>

Thanks,
Alison Chaiken
[email protected]
Aurora Innovation

2022-04-16 01:48:05

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RT][PATCH 2/2] tick: Fix timer storm since introduction of timersd

On Tue, Apr 05, 2022 at 09:21:16AM -0700, Alison Chaiken wrote:
> On Mon, Apr 4, 2022 at 9:33 PM Frederic Weisbecker <[email protected]> wrote:
> >
> > If timers are pending while the tick is reprogrammed on nohz_mode, the
> > next expiry is not armed to fire now, it is delayed one jiffy forward
> > instead so as not to raise an inextinguishable timer storm with such
> > scenario:
> >
> > 1) IRQ triggers and queue a timer
> > 2) ksoftirqd() is woken up
> > 3) IRQ tail: timer is reprogrammed to fire now
> > 4) IRQ exit
> > 5) TIMER interrupt
> > 6) goto 3)
> >
> > ...all that until we finally reach ksoftirqd.
> >
> > Unfortunately we are checking the wrong softirq vector bitmask since
> > timersd kthread has split from ksoftirqd. Timers now have their own
> > vector state field that must be checked separately.
>
> With kernel 5.15 and the timersd patch applied, we've observed that
> x86_64 cores tend to enter deeper C-states even when there are pending
> hrtimers. Presumably failure to check the right bits could also
> explain that observation and, accordingly, the patch might fix it?

Well, this issue rather adds unnecessary ticks. Thus I wouldn't expect
deeper C-states as a consequence but who knows...

Thanks.

Subject: Re: [RT][PATCH 2/2] tick: Fix timer storm since introduction of timersd

On 2022-04-05 03:07:52 [+0200], Frederic Weisbecker wrote:
> If timers are pending while the tick is reprogrammed on nohz_mode, the
> next expiry is not armed to fire now, it is delayed one jiffy forward
> instead so as not to raise an inextinguishable timer storm with such
> scenario:

The series of two has been applied.

Sebastian