LinuxLists.cc - [RFC PATCH v4 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

2022-10-03 23:31:22

Subject: [RFC PATCH v4 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

From: Pavankumar Kondeti <[email protected]>

Defer the softirq processing to ksoftirqd if a RT task is
running or queued on the current CPU. This complements the RT
task placement algorithm which tries to find a CPU that is not
currently busy with softirqs.

Currently NET_TX, NET_RX, BLOCK and IRQ_POLL softirqs are only
deferred as they can potentially run for long time.

Additionally, this patch stubs out ksoftirqd_running() logic,
in the CONFIG_RT_SOFTIRQ_OPTIMIZATION case, as deferring
potentially long-running softirqs will cause the logic to not
process shorter-running softirqs immediately. By stubbing it out
the potentially long running softirqs are deferred, but the
shorter running ones can still run immediately.

This patch includes folded-in fixes by:
Lingutla Chandrasekhar <[email protected]>
Satya Durga Srinivasu Prabhala <[email protected]>
J. Avila <[email protected]>

Cc: John Dias <[email protected]>
Cc: Connor O'Brien <[email protected]>
Cc: Rick Yiu <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Chris Redpath <[email protected]>
Cc: Abhijeet Dharmapurikar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Pavankumar Kondeti <[email protected]>
[[email protected]: trivial merge conflict resolution.]
Signed-off-by: Satya Durga Srinivasu Prabhala <[email protected]>
[elavila: Port to mainline, squash with bugfix]
Signed-off-by: J. Avila <[email protected]>
[jstultz: Rebase to linus/HEAD, minor rearranging of code,
included bug fix Reported-by: Qais Yousef <[email protected]> ]
Signed-off-by: John Stultz <[email protected]>
---
v4:
* Fix commit message to accurately note long-running softirqs
(suggested by Qais)
* Switch to using rt_task(current) (suggested by Qais)
---
kernel/softirq.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 35ee79dd8786..c8ce12bbab04 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -87,6 +87,7 @@ static void wakeup_softirqd(void)
wake_up_process(tsk);
}

+#ifndef CONFIG_RT_SOFTIRQ_OPTIMIZATION
/*
* If ksoftirqd is scheduled, we do not want to process pending softirqs
* right now. Let ksoftirqd handle this at its own rate, to get fairness,
@@ -101,6 +102,9 @@ static bool ksoftirqd_running(unsigned long pending)
return false;
return tsk && task_is_running(tsk) && !__kthread_should_park(tsk);
}
+#else
+#define ksoftirqd_running(pending) (false)
+#endif /* CONFIG_RT_SOFTIRQ_OPTIMIZATION */

#ifdef CONFIG_TRACE_IRQFLAGS
DEFINE_PER_CPU(int, hardirqs_enabled);
@@ -532,6 +536,21 @@ static inline bool lockdep_softirq_start(void) { return false; }
static inline void lockdep_softirq_end(bool in_hardirq) { }
#endif

+#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
+static __u32 softirq_deferred_for_rt(__u32 *pending)
+{
+ __u32 deferred = 0;
+
+ if (rt_task(current)) {
+ deferred = *pending & LONG_SOFTIRQ_MASK;
+ *pending &= ~LONG_SOFTIRQ_MASK;
+ }
+ return deferred;
+}
+#else
+#define softirq_deferred_for_rt(x) (0)
+#endif
+
asmlinkage __visible void __softirq_entry __do_softirq(void)
{
unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
@@ -539,6 +558,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
int max_restart = MAX_SOFTIRQ_RESTART;
struct softirq_action *h;
bool in_hardirq;
+ __u32 deferred;
__u32 pending;
int softirq_bit;

@@ -550,14 +570,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
current->flags &= ~PF_MEMALLOC;

pending = local_softirq_pending();
+ deferred = softirq_deferred_for_rt(&pending);

softirq_handle_begin();
+
in_hardirq = lockdep_softirq_start();
account_softirq_enter(current);

restart:
/* Reset the pending bitmask before enabling irqs */
- set_softirq_pending(0);
+ set_softirq_pending(deferred);
__this_cpu_write(active_softirqs, pending);

local_irq_enable();
@@ -596,13 +618,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
local_irq_disable();

pending = local_softirq_pending();
+ deferred = softirq_deferred_for_rt(&pending);
+
if (pending) {
if (time_before(jiffies, end) && !need_resched() &&
--max_restart)
goto restart;
+ }

+ if (pending | deferred)
wakeup_softirqd();
- }

account_softirq_exit(current);
lockdep_softirq_end(in_hardirq);
--
2.38.0.rc1.362.ged0d419d3c-goog

2022-10-04 11:06:47

by David Laight

[permalink] [raw]

Subject: RE: [RFC PATCH v4 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

From: John Stultz
> Sent: 04 October 2022 00:21
>
> From: Pavankumar Kondeti <[email protected]>
>
> Defer the softirq processing to ksoftirqd if a RT task is
> running or queued on the current CPU. This complements the RT
> task placement algorithm which tries to find a CPU that is not
> currently busy with softirqs.
>
> Currently NET_TX, NET_RX, BLOCK and IRQ_POLL softirqs are only
> deferred as they can potentially run for long time.

Deferring NET_RX to ksoftirqd stops the NET_RX code from
running until the RT process completes.

This has exactly the same problems as the softint taking
priority of the RT task - just the other way around.

Under very high traffic loads receive packets get lost.
In many cases that is actually far worse that the wakeup
of an RT process being delayed slightly.

The is no 'one size fits all' answer to the problem.

Plausibly depending on the priority of the RT task
might be useful.
But sometimes it depends on the actual reason for the
wakeup.
For instance a wakeup from an ISR or a hish-res timer
might need lower latency than one from a futex.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-10-04 19:35:12

by John Stultz

[permalink] [raw]

Subject: Re: [RFC PATCH v4 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

On Tue, Oct 4, 2022 at 3:45 AM David Laight <[email protected]> wrote:
> From: John Stultz
> > Sent: 04 October 2022 00:21
> >
> > From: Pavankumar Kondeti <[email protected]>
> >
> > Defer the softirq processing to ksoftirqd if a RT task is
> > running or queued on the current CPU. This complements the RT
> > task placement algorithm which tries to find a CPU that is not
> > currently busy with softirqs.
> >
> > Currently NET_TX, NET_RX, BLOCK and IRQ_POLL softirqs are only
> > deferred as they can potentially run for long time.
>
> Deferring NET_RX to ksoftirqd stops the NET_RX code from
> running until the RT process completes.
>
> This has exactly the same problems as the softint taking
> priority of the RT task - just the other way around.
>
> Under very high traffic loads receive packets get lost.
> In many cases that is actually far worse that the wakeup
> of an RT process being delayed slightly.
>
> The is no 'one size fits all' answer to the problem.

I'm not claiming there is a one size fits all solution, just like
there are cases where PREEMPT_RT is the right choice and cases where
it is not.

I'm only proposing we add this build-time configurable option, because
the issues they address do affect Android devices and we need some
sort of solution (though I'm open to alternatives).

> Plausibly depending on the priority of the RT task
> might be useful.
> But sometimes it depends on the actual reason for the
> wakeup.
> For instance a wakeup from an ISR or a hish-res timer
> might need lower latency than one from a futex.

I mean, with the PREEMPT_RT series years ago there used to be
configuration guides on setting the rtprio for each softirq thread to
provide tuning knobs like you mention, but I'm not sure what the
current state of that is.

If you're wanting more flexibility from the proposed patch, I guess we
could have a knob to runtime set which softirqs would be deferred by
rt tasks, but I'm not sure if folks want to expose that level of
detail of the kernel's workings as a UABI, so having it be a kernel
build option avoids that.

thanks
-john

2022-10-10 16:34:33

by Qais Yousef

[permalink] [raw]

Subject: Re: [RFC PATCH v4 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

Hi John

On 10/03/22 23:20, John Stultz wrote:
> From: Pavankumar Kondeti <[email protected]>
>
> Defer the softirq processing to ksoftirqd if a RT task is
> running or queued on the current CPU. This complements the RT
> task placement algorithm which tries to find a CPU that is not
> currently busy with softirqs.
>
> Currently NET_TX, NET_RX, BLOCK and IRQ_POLL softirqs are only
> deferred as they can potentially run for long time.
>
> Additionally, this patch stubs out ksoftirqd_running() logic,
> in the CONFIG_RT_SOFTIRQ_OPTIMIZATION case, as deferring
> potentially long-running softirqs will cause the logic to not
> process shorter-running softirqs immediately. By stubbing it out
> the potentially long running softirqs are deferred, but the
> shorter running ones can still run immediately.

The cover letter didn't make it to my inbox (nor to others AFAICS from lore),
so replying here.

The series looks good to me. It offers a compromise to avoid an existing
conflict between RT and softirqs without disrupting much how both inherently
work. I guess it's up to the maintainers to decide if this direction is
acceptable or not.

I've seen Thomas had a comment on another softirq series (which attempts to
throttle them ;-) by the way that is worth looking it:

https://lore.kernel.org/lkml/877d81jc13.ffs@tglx/

Meanwhile, I did run a few tests on a test laptop that has 2 core SMT2 i7
laptop (since I assume you tested on Android)

I set priority to 1 for all of these cyclic tests.

First I ran without applying your patch to fix the affinity problem in
cyclictest:

I had a 1 hour run of 4 threads - 4 iperf threads and 4 dd threads are
doing work in the background:

| vanilla | with softirq patches v4 |
-------------------|-----------------|-------------------------|
t0 max delay (us) | 6728 | 2096 |
t1 max delay (us) | 2545 | 1990 |
t2 max delay (us) | 2282 | 2094 |
t3 max delay (us) | 6038 | 2162 |

Which shows max latency is improved a lot. Though because I missed applying
your cyclictest patch, I believe this can be attributed only to patch 3 which
defers the softirq if there's current task is an RT one.

I applied your patch to cyclictest to NOT force affinity when specifying -t
option.

Ran cyclictest for 4 hours, -t 3, 3 iperf threads and 3 dd threads running in
the background:

| vanilla | with softirq patches v4 |
-------------------|-----------------|-------------------------|
t0 max delay (us) | 2656 | 2164 |
t1 max delay (us) | 2773 | 1982 |
t2 max delay (us) | 2272 | 2278 |

I didn't hit a big max delay on this case. It shows things are better still.

Ran another cyclictest for 4 hours, -t 4, 4 iperf threads and 4 dd threads in
the background:

| vanilla | with softirq patches v4 |
-------------------|-----------------|-------------------------|
t0 max delay (us) | 4012 | 7074 |
t1 max delay (us) | 2460 | 9088 |
t2 max delay (us) | 3755 | 2784 |
t3 max delay (us) | 4392 | 2295 |

Here the results worryingly show that applying the patches make things much
worse.

I still have to investigate why. I'll have another run to see if the results
look different, then try to dig more.

All results are from the cyclictest json dump.

Cheers

--
Qais Yousef

2022-10-17 14:54:18

by Qais Yousef

[permalink] [raw]

Subject: Re: [RFC PATCH v4 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

On 10/10/22 17:09, Qais Yousef wrote:
> Hi John
>
> On 10/03/22 23:20, John Stultz wrote:
> > From: Pavankumar Kondeti <[email protected]>
> >
> > Defer the softirq processing to ksoftirqd if a RT task is
> > running or queued on the current CPU. This complements the RT
> > task placement algorithm which tries to find a CPU that is not
> > currently busy with softirqs.
> >
> > Currently NET_TX, NET_RX, BLOCK and IRQ_POLL softirqs are only
> > deferred as they can potentially run for long time.
> >
> > Additionally, this patch stubs out ksoftirqd_running() logic,
> > in the CONFIG_RT_SOFTIRQ_OPTIMIZATION case, as deferring
> > potentially long-running softirqs will cause the logic to not
> > process shorter-running softirqs immediately. By stubbing it out
> > the potentially long running softirqs are deferred, but the
> > shorter running ones can still run immediately.
>
> The cover letter didn't make it to my inbox (nor to others AFAICS from lore),
> so replying here.
>
> The series looks good to me. It offers a compromise to avoid an existing
> conflict between RT and softirqs without disrupting much how both inherently
> work. I guess it's up to the maintainers to decide if this direction is
> acceptable or not.
>
> I've seen Thomas had a comment on another softirq series (which attempts to
> throttle them ;-) by the way that is worth looking it:
>
> https://lore.kernel.org/lkml/877d81jc13.ffs@tglx/
>
>
> Meanwhile, I did run a few tests on a test laptop that has 2 core SMT2 i7
> laptop (since I assume you tested on Android)
>
> I set priority to 1 for all of these cyclic tests.
>
> First I ran without applying your patch to fix the affinity problem in
> cyclictest:
>
> I had a 1 hour run of 4 threads - 4 iperf threads and 4 dd threads are
> doing work in the background:
>
> | vanilla | with softirq patches v4 |
> -------------------|-----------------|-------------------------|
> t0 max delay (us) | 6728 | 2096 |
> t1 max delay (us) | 2545 | 1990 |
> t2 max delay (us) | 2282 | 2094 |
> t3 max delay (us) | 6038 | 2162 |
>
> Which shows max latency is improved a lot. Though because I missed applying
> your cyclictest patch, I believe this can be attributed only to patch 3 which
> defers the softirq if there's current task is an RT one.
>
>
> I applied your patch to cyclictest to NOT force affinity when specifying -t
> option.
>
>
>
> Ran cyclictest for 4 hours, -t 3, 3 iperf threads and 3 dd threads running in
> the background:
>
> | vanilla | with softirq patches v4 |
> -------------------|-----------------|-------------------------|
> t0 max delay (us) | 2656 | 2164 |
> t1 max delay (us) | 2773 | 1982 |
> t2 max delay (us) | 2272 | 2278 |
>
> I didn't hit a big max delay on this case. It shows things are better still.
>
>
>
> Ran another cyclictest for 4 hours, -t 4, 4 iperf threads and 4 dd threads in
> the background:
>
> | vanilla | with softirq patches v4 |
> -------------------|-----------------|-------------------------|
> t0 max delay (us) | 4012 | 7074 |
> t1 max delay (us) | 2460 | 9088 |
> t2 max delay (us) | 3755 | 2784 |
> t3 max delay (us) | 4392 | 2295 |
>
> Here the results worryingly show that applying the patches make things much
> worse.
>
> I still have to investigate why. I'll have another run to see if the results
> look different, then try to dig more.
>
> All results are from the cyclictest json dump.

Actually scrap those results. I stupidly forgot to enable the
CONFIG_RT_SOFTIRQ_OPTIMIZATION..

I repeated the 4hours, 4-threads test 3 times:

| vanilla | with softirq patches v4 |
-------------------|--------------------|--------------------------|
| #1 | #2 | #3 | #1 | #2 | #3 |
-------------------|------|------|------|--------|--------|--------|
t0 max delay (us) | 9594*| 2246 | 2317 | 2763 | 2274 | 2623 |
t1 max delay (us) | 3236 | 2356 | 2816 | 2675 | 2962 | 2944 |
t2 max delay (us) | 2271 | 2622 | 2829 | 2274 | 2848 | 2400 |
t3 max delay (us) | 2216 | 6587*| 2724 | 2631 | 2753 | 3034 |

Worst case scenario is reduced to 3034us instead of 9594us.

I repeated the 1 hour 3 threads tests again too:

| vanilla | with softirq patches v4 |
-------------------|--------------------|--------------------------|
| #1 | #2 | #3 | #1 | #2 | #3 |
-------------------|_-----|------|------|--------|--------|--------|
t0 max delay (us) |18059 | 2251 | 2365 | 2684 | 2779 | 2838 |
t1 max delay (us) |16311 | 2261 | 2477 | 2963 | 3020 | 3226 |
t2 max delay (us) | 8887 | 2259 | 2432 | 2904 | 2833 | 3016 |

Worst case scenario is 3226us for softirq compared to 18059 for vanilla 6.0.

This time I paid attention to the average as the best case number for vanilla
kernel is better:

| vanilla | with softirq patches v4 |
-------------------|--------------------|--------------------------|
| #1 | #2 | #3 | #1 | #2 | #3 |
-------------------|------|------|------|--------|--------|--------|
t0 avg delay (us) |31.59 |22.94 |26.50 | 31.81 | 33.57 | 34.90 |
t1 avg delay (us) |16.85 |16.32 |37.16 | 29.05 | 30.51 | 31.65 |
t2 avg delay (us) |25.34 |32.12 |17.40 | 26.76 | 28.28 | 28.56 |

It shows that we largely hover around 30us with the patches compared to 16-26us
being more prevalent for vanilla kernels.

I am not sure I can draw a concrete conclusion from these numbers. It seems
I need to run longer than 4 hours to hit the worst case scenario every run on
the vanilla kernel. There's an indication that the worst case scenario is
harder to hit, and it looks there's a hit on the average delay.

I'm losing access to this system from today. I think I'll wait for more
feedback on this RFC; and do another round of testing for longer periods of
time once there's clearer sense this is indeed the direction we'll be going
for.

HTH.

Cheers

--
Qais Yousef

2022-10-18 00:53:48

by John Stultz

[permalink] [raw]

Subject: Re: [RFC PATCH v4 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

On Mon, Oct 17, 2022 at 7:45 AM Qais Yousef <[email protected]> wrote:
> This time I paid attention to the average as the best case number for vanilla
> kernel is better:
>
> | vanilla | with softirq patches v4 |
> -------------------|--------------------|--------------------------|
> | #1 | #2 | #3 | #1 | #2 | #3 |
> -------------------|------|------|------|--------|--------|--------|
> t0 avg delay (us) |31.59 |22.94 |26.50 | 31.81 | 33.57 | 34.90 |
> t1 avg delay (us) |16.85 |16.32 |37.16 | 29.05 | 30.51 | 31.65 |
> t2 avg delay (us) |25.34 |32.12 |17.40 | 26.76 | 28.28 | 28.56 |
>
> It shows that we largely hover around 30us with the patches compared to 16-26us
> being more prevalent for vanilla kernels.
>
> I am not sure I can draw a concrete conclusion from these numbers. It seems
> I need to run longer than 4 hours to hit the worst case scenario every run on
> the vanilla kernel. There's an indication that the worst case scenario is
> harder to hit, and it looks there's a hit on the average delay.

Thanks so much for running these tests and capturing these detailed numbers!

I'll have to look further into the average case going up here.

> I'm losing access to this system from today. I think I'll wait for more
> feedback on this RFC; and do another round of testing for longer periods of
> time once there's clearer sense this is indeed the direction we'll be going
> for.

Do you mind sending me the script you used to run the test, and I'll
try to reproduce on some x86 hardware locally?

thanks
-john

2022-10-19 12:21:33

by Qais Yousef

[permalink] [raw]

Subject: Re: [RFC PATCH v4 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

(note the change in email)

On 10/17/22 17:04, John Stultz wrote:
> On Mon, Oct 17, 2022 at 7:45 AM Qais Yousef <[email protected]> wrote:
> > This time I paid attention to the average as the best case number for vanilla
> > kernel is better:
> >
> > | vanilla | with softirq patches v4 |
> > -------------------|--------------------|--------------------------|
> > | #1 | #2 | #3 | #1 | #2 | #3 |
> > -------------------|------|------|------|--------|--------|--------|
> > t0 avg delay (us) |31.59 |22.94 |26.50 | 31.81 | 33.57 | 34.90 |
> > t1 avg delay (us) |16.85 |16.32 |37.16 | 29.05 | 30.51 | 31.65 |
> > t2 avg delay (us) |25.34 |32.12 |17.40 | 26.76 | 28.28 | 28.56 |
> >
> > It shows that we largely hover around 30us with the patches compared to 16-26us
> > being more prevalent for vanilla kernels.
> >
> > I am not sure I can draw a concrete conclusion from these numbers. It seems
> > I need to run longer than 4 hours to hit the worst case scenario every run on
> > the vanilla kernel. There's an indication that the worst case scenario is
> > harder to hit, and it looks there's a hit on the average delay.
>
> Thanks so much for running these tests and capturing these detailed numbers!
>
> I'll have to look further into the average case going up here.
>
> > I'm losing access to this system from today. I think I'll wait for more
> > feedback on this RFC; and do another round of testing for longer periods of
> > time once there's clearer sense this is indeed the direction we'll be going
> > for.
>
> Do you mind sending me the script you used to run the test, and I'll
> try to reproduce on some x86 hardware locally?

I ran that in a personal CI setup. I basically do the following 3 in parallel
'scripts':

cyclictest.sh [1]:

cyclictest -t 3 -p 99 -D 3600 -i 1000 --json=cyclictest.json

iperf.sh [2]:

iperf -s -D
iperf -c localhost -u -b 10g -t 3600 -i 1 -P 3

dd.sh [3]:

while true
do
cyclictest_running=`ps -e | grep cyclictest || true`
if [ "x$cyclictest_running" == "x" ]; then
break
fi

#
# Run dd
#
file="/tmp/myci.dd.file"
for i in $(seq 3)
do
dd if=/dev/zero of=$file.$i bs=1M count=2048 &
done
wait
rm -f $file*

sleep 3
done

[1] https://github.com/qais-yousef/myci-sched-tests/blob/dev/vars/run_cyclictest.groovy
[2] https://github.com/qais-yousef/myci-sched-tests/blob/dev/vars/run_iperf_parallel.groovy
[3] https://github.com/qais-yousef/myci-sched-tests/blob/dev/vars/run_dd_parallel.groovy

Cheers

--
Qais Yousef