LinuxLists.cc - [PATCH v5 0/3] Softirq aware -rt scheduling

2022-11-16 08:20:46

Subject: [PATCH v5 0/3] Softirq aware -rt scheduling

Hey all,

This series is a set of patches that optimize scheduler decisions around
realtime tasks and softirqs. This series is a rebased and reworked set
of changes that have been shipping on Android devices for a number of
years, originally created to resolve audio glitches seen on devices
caused by softirqs for network or storage drivers.

Long running softirqs cause issues because they aren’t currently taken
into account when a realtime task is woken up, but they will delay
realtime tasks from running if the realtime tasks are placed on a cpu
currently running a softirq.

This can easily be seen on some devices by running cyclictest* along
with some heavy background filesystems noise:

Without the patches:
T: 0 ( 7596) P:99 I:1000 C: 59980 Min: 7 Act: 13 Avg: 29 Max: 4107
T: 1 ( 7597) P:99 I:1500 C: 39990 Min: 14 Act: 26 Avg: 36 Max: 8994
T: 2 ( 7598) P:99 I:2000 C: 29995 Min: 7 Act: 22 Avg: 35 Max: 3616
T: 3 ( 7599) P:99 I:2500 C: 23915 Min: 7 Act: 25 Avg: 49 Max: 40273
T: 4 ( 7600) P:99 I:3000 C: 19995 Min: 8 Act: 22 Avg: 38 Max: 10510
T: 5 ( 7601) P:99 I:3500 C: 17135 Min: 7 Act: 26 Avg: 39 Max: 13194
T: 6 ( 7602) P:99 I:4000 C: 14990 Min: 7 Act: 26 Avg: 40 Max: 9470
T: 7 ( 7603) P:99 I:4500 C: 13318 Min: 8 Act: 29 Avg: 44 Max: 20101

Which you can visually see in the image here:
https://github.com/johnstultz-work/misc/raw/main/images/2022-08-09-softirq-rt-big-latency.png

Which is from the perfetto trace captured here:
https://ui.perfetto.dev/#!/?s=33661aec8ec82c2da0a59263f36f7d72b4a2f4e7a99b28b222bd12ad872f

The first patch adds a bit of generic infrastructure to get the per-cpu
softirq_pending flag.

The second patch in the series adds logic to account for when softirqs
are running, and then conditionally based on
CONFIG_RT_SOFTIRQ_AWARE_SCHED allows rt-task placement to be done in a
way that’s aware if a current softirq might be a long-running one, to
potentially place the rt task on another free core.

The third patch in the series adds logic in __do_softirq(), also under
CONFIG_RT_SOFTIRQ_AWARE_SCHED, to defer some of the potentially long
running softirqs to ksoftirqd if a -rt task is currently running on the
cpu. This patch also includes a folded down fix that stubbs out
ksoftirqd_running() based on CONFIG_RT_SOFTIRQ_AWARE_SCHED, since in
changing to more frequently defer long running softirqs, the logic using
ksoftirqd_running will end up being too conservative and needlessly
delay shorter-running softirqs.

With these patches we see dramatic improvements in the worst case
latencies in the cyclictest* + filesystem noise test above:

With the patches
T: 0 ( 7527) P:99 I:1000 C: 59998 Min: 6 Act: 29 Avg: 35 Max: 1734
T: 1 ( 7528) P:99 I:1500 C: 40000 Min: 7 Act: 39 Avg: 35 Max: 1181
T: 2 ( 7529) P:99 I:2000 C: 30000 Min: 7 Act: 25 Avg: 25 Max: 444
T: 3 ( 7530) P:99 I:2500 C: 24000 Min: 7 Act: 34 Avg: 36 Max: 1729
T: 4 ( 7531) P:99 I:3000 C: 20000 Min: 7 Act: 36 Avg: 25 Max: 406
T: 5 ( 7532) P:99 I:3500 C: 17143 Min: 7 Act: 38 Avg: 34 Max: 1264
T: 6 ( 7533) P:99 I:4000 C: 15000 Min: 7 Act: 27 Avg: 33 Max: 2351
T: 7 ( 7534) P:99 I:4500 C: 13334 Min: 7 Act: 41 Avg: 29 Max: 2285

Since these patches have been carried along for years, and have at times
badly collided with upstream, I wanted to submit them for some initial
review, discussion and feedback so we could hopefully eventually find a
reasonable solution that might land upstream.

* Unfortunately cyclictest had a bug that causes it to always affine
threads to cpus preventing them from being migrated. So you’ll need
to update to the latest version (which includes a fix) to reproduce.

Let me know what you think!

thanks
-john

New in v5:
* Conditionalize active_softirqs handling (suggested by Alexander
Gordeev <[email protected]>)
* Reorder rt_task_fits_cpu to have the "fast" function first
(Suggested by Alexander Gordeev <[email protected]>)
* Fix bug I introduced in v2 condensing
task_thread_info(task)->preempt_count to preempt_count()
(Reported-by: Alexander Gordeev <[email protected]>)
* Tweak comment discription to remove the vauge "slow"
descriptor of softirqs being run by ksoftirqd
(Suggested by Alexander Gordeev <[email protected]>)
* Switch to using CONFIG_RT_SOFTIRQ_AWARE_SCHED (suggested by
Joel Fernandes <[email protected]>)
* Simplify cpu_busy_with_softirqs() logic as pointed out by
Alexander Gordeev <[email protected]>
* Switch to using IS_ENABLED rather then defining my own macro
(suggsted by Joel Fernandes <[email protected]>)

Cc: John Dias <[email protected]>
Cc: Connor O'Brien <[email protected]>
Cc: Rick Yiu <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Chris Redpath <[email protected]>
Cc: Abhijeet Dharmapurikar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: [email protected]

Connor O'Brien (1):
sched: Avoid placing RT threads on cores handling long softirqs

John Stultz (1):
softirq: Add generic accessor to percpu softirq_pending data

Pavankumar Kondeti (1):
softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

arch/s390/include/asm/hardirq.h | 6 ++++
include/linux/interrupt.h | 20 ++++++++++++++
init/Kconfig | 10 +++++++
kernel/sched/rt.c | 49 +++++++++++++++++++++++++++------
kernel/softirq.c | 46 +++++++++++++++++++++++++++++--
5 files changed, 120 insertions(+), 11 deletions(-)

--
2.38.1.431.g37b22c650d-goog

2022-11-16 08:23:02

by John Stultz

[permalink] [raw]

Subject: [PATCH v5 2/3] sched: Avoid placing RT threads on cores handling long softirqs

From: Connor O'Brien <[email protected]>

In certain audio use cases, scheduling RT threads on cores that
are handling softirqs can lead to glitches. Prevent this
behavior in cases where the softirq is likely to take a long
time. To avoid unnecessary migrations, the old behavior is
preserved for RCU, SCHED and TIMER irqs which are expected to be
relatively quick.

This patch reworks and combines two related changes originally
by John Dias <[email protected]>

Cc: John Dias <[email protected]>
Cc: Connor O'Brien <[email protected]>
Cc: Rick Yiu <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Chris Redpath <[email protected]>
Cc: Abhijeet Dharmapurikar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: [email protected]
Signed-off-by: John Dias <[email protected]>
[elavila: Port to mainline, amend commit text]
Signed-off-by: J. Avila <[email protected]>
[connoro: Reworked, simplified, and merged two patches together]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Further simplified and fixed issues, reworded commit
message, removed arm64-isms]
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Reformatted Kconfig entry to match coding style
(Reported-by: Randy Dunlap <[email protected]>)
* Made rt_task_fits_capacity_and_may_preempt static to
avoid warnings (Reported-by: kernel test robot <[email protected]>)
* Rework to use preempt_count and drop kconfig dependency on ARM64
v3:
* Use introduced __cpu_softirq_pending() to avoid s390 build
issues (Reported-by: kernel test robot <[email protected]>)
v4:
* Drop TASKLET_SOFTIRQ from LONG_SOFTIRQS (suggested by Qais)
* Depend on !PREEMPT_RT (Suggested by Qais)
* Larger simplification of logic (suggested by Qais)
* Rework LONG_SOFTIRQS to use BIT() macros
* Rename task_may_preempt() to cpu_busy_with_softirqs()
v5:
* Conditionalize active_softirqs handling (suggested by Alexander
Gordeev <[email protected]>)
* Reorder rt_task_fits_cpu to have the "fast" function first
(Suggested by Alexander Gordeev <[email protected]>)
* Fix bug I introduced in v2 condensing
task_thread_info(task)->preempt_count to preempt_count()
(Reported-by: Alexander Gordeev <[email protected]>)
* Tweak comment discription to remove the vauge "slow"
descriptor of softirqs being run by ksoftirqd
(Suggested by Alexander Gordeev <[email protected]>)
* Switch to using CONFIG_RT_SOFTIRQ_AWARE_SCHED (suggested by
Joel Fernandes <[email protected]>)
* Simplify cpu_busy_with_softirqs() logic as pointed out by
Alexander Gordeev <[email protected]>
* Switch to using IS_ENABLED rather then defining my own macro
(suggsted by Joel Fernandes <[email protected]>)
---
include/linux/interrupt.h | 9 +++++++
init/Kconfig | 10 ++++++++
kernel/sched/rt.c | 49 ++++++++++++++++++++++++++++++++-------
kernel/softirq.c | 17 ++++++++++++++
4 files changed, 76 insertions(+), 9 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a749a8663841..7d09eb998d4c 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -582,6 +582,11 @@ enum
* _ IRQ_POLL: irq_poll_cpu_dead() migrates the queue
*/
#define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(RCU_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ))
+/* Softirq's where the handling might be long: */
+#define LONG_SOFTIRQ_MASK (BIT(NET_TX_SOFTIRQ) | \
+ BIT(NET_RX_SOFTIRQ) | \
+ BIT(BLOCK_SOFTIRQ) | \
+ BIT(IRQ_POLL_SOFTIRQ))

/* map softirq index to softirq name. update 'softirq_to_name' in
* kernel/softirq.c when adding a new softirq.
@@ -618,6 +623,10 @@ extern void raise_softirq(unsigned int nr);

DECLARE_PER_CPU(struct task_struct *, ksoftirqd);

+#ifdef CONFIG_RT_SOFTIRQ_AWARE_SCHED
+DECLARE_PER_CPU(u32, active_softirqs);
+#endif
+
static inline struct task_struct *this_cpu_ksoftirqd(void)
{
return this_cpu_read(ksoftirqd);
diff --git a/init/Kconfig b/init/Kconfig
index abf65098f1b6..ce0f0be5759c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1291,6 +1291,16 @@ config SCHED_AUTOGROUP
desktop applications. Task group autogeneration is currently based
upon task session.

+config RT_SOFTIRQ_AWARE_SCHED
+ bool "Improve RT scheduling during long softirq execution"
+ depends on SMP && !PREEMPT_RT
+ default n
+ help
+ Enable an optimization which tries to avoid placing RT tasks on CPUs
+ occupied by nonpreemptible tasks, such as a long softirq or CPUs
+ which may soon block preemptions, such as a CPU running a ksoftirq
+ thread which handles slow softirqs.
+
config SYSFS_DEPRECATED
bool "Enable deprecated sysfs features to support old userspace tools"
depends on SYSFS
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ed2a47e4ddae..152347c4394c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1595,6 +1595,32 @@ static void yield_task_rt(struct rq *rq)
#ifdef CONFIG_SMP
static int find_lowest_rq(struct task_struct *task);

+#ifdef CONFIG_RT_SOFTIRQ_AWARE_SCHED
+/*
+ * Return whether the given cpu is currently non-preemptible
+ * while handling a potentially long softirq, or if the current
+ * task is likely to block preemptions soon because it is a
+ * ksoftirq thread that is handling softirqs.
+ */
+static bool cpu_busy_with_softirqs(int cpu)
+{
+ u32 softirqs = per_cpu(active_softirqs, cpu) |
+ __cpu_softirq_pending(cpu);
+
+ return softirqs & LONG_SOFTIRQ_MASK;
+}
+#else
+static bool cpu_busy_with_softirqs(int cpu)
+{
+ return false;
+}
+#endif /* CONFIG_RT_SOFTIRQ_AWARE_SCHED */
+
+static bool rt_task_fits_cpu(struct task_struct *p, int cpu)
+{
+ return rt_task_fits_capacity(p, cpu) && !cpu_busy_with_softirqs(cpu);
+}
+
static int
select_task_rq_rt(struct task_struct *p, int cpu, int flags)
{
@@ -1633,22 +1659,24 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
* This test is optimistic, if we get it wrong the load-balancer
* will have to sort it out.
*
- * We take into account the capacity of the CPU to ensure it fits the
- * requirement of the task - which is only important on heterogeneous
- * systems like big.LITTLE.
+ * We use rt_task_fits_cpu() to evaluate if the CPU is busy with
+ * potentially long-running softirq work, as well as take into
+ * account the capacity of the CPU to ensure it fits the
+ * requirement of the task - which is only important on
+ * heterogeneous systems like big.LITTLE.
*/
test = curr &&
unlikely(rt_task(curr)) &&
(curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);

- if (test || !rt_task_fits_capacity(p, cpu)) {
+ if (test || !rt_task_fits_cpu(p, cpu)) {
int target = find_lowest_rq(p);

/*
* Bail out if we were forcing a migration to find a better
* fitting CPU but our search failed.
*/
- if (!test && target != -1 && !rt_task_fits_capacity(p, target))
+ if (!test && target != -1 && !rt_task_fits_cpu(p, target))
goto out_unlock;

/*
@@ -1890,14 +1918,17 @@ static int find_lowest_rq(struct task_struct *task)
return -1; /* No other targets possible */

/*
- * If we're on asym system ensure we consider the different capacities
- * of the CPUs when searching for the lowest_mask.
+ * If we're using the softirq optimization or if we are
+ * on asym system, ensure we consider the softirq processing
+ * or different capacities of the CPUs when searching for the
+ * lowest_mask.
*/
- if (sched_asym_cpucap_active()) {
+ if (IS_ENABLED(CONFIG_RT_SOFTIRQ_AWARE_SCHED) ||
+ sched_asym_cpucap_active()) {

ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
task, lowest_mask,
- rt_task_fits_capacity);
+ rt_task_fits_cpu);
} else {

ret = cpupri_find(&task_rq(task)->rd->cpupri,
diff --git a/kernel/softirq.c b/kernel/softirq.c
index c8a6913c067d..dd92ce8f771b 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -60,6 +60,21 @@ static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp

DEFINE_PER_CPU(struct task_struct *, ksoftirqd);

+#ifdef CONFIG_RT_SOFTIRQ_AWARE_SCHED
+/*
+ * active_softirqs -- per cpu, a mask of softirqs that are being handled,
+ * with the expectation that approximate answers are acceptable and therefore
+ * no synchronization.
+ */
+DEFINE_PER_CPU(u32, active_softirqs);
+static inline void set_active_softirqs(u32 pending)
+{
+ __this_cpu_write(active_softirqs, pending);
+}
+#else /* CONFIG_RT_SOFTIRQ_AWARE_SCHED */
+static inline void set_active_softirqs(u32 pending) {};
+#endif /* CONFIG_RT_SOFTIRQ_AWARE_SCHED */
+
const char * const softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL",
"TASKLET", "SCHED", "HRTIMER", "RCU"
@@ -551,6 +566,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
restart:
/* Reset the pending bitmask before enabling irqs */
set_softirq_pending(0);
+ set_active_softirqs(pending);

local_irq_enable();

@@ -580,6 +596,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
pending >>= softirq_bit;
}

+ set_active_softirqs(0);
if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
__this_cpu_read(ksoftirqd) == current)
rcu_softirq_qs();
--
2.38.1.431.g37b22c650d-goog

2022-11-16 08:52:32

by John Stultz

[permalink] [raw]

Subject: [PATCH v5 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

From: Pavankumar Kondeti <[email protected]>

Defer the softirq processing to ksoftirqd if a RT task is
running or queued on the current CPU. This complements the RT
task placement algorithm which tries to find a CPU that is not
currently busy with softirqs.

Currently NET_TX, NET_RX, BLOCK and IRQ_POLL softirqs are only
deferred as they can potentially run for long time.

Additionally, this patch stubs out ksoftirqd_running() logic,
in the CONFIG_RT_SOFTIRQ_AWARE_SCHED case, as deferring
potentially long-running softirqs will cause the logic to not
process shorter-running softirqs immediately. By stubbing it out
the potentially long running softirqs are deferred, but the
shorter running ones can still run immediately.

This patch includes folded-in fixes by:
Lingutla Chandrasekhar <[email protected]>
Satya Durga Srinivasu Prabhala <[email protected]>
J. Avila <[email protected]>

Cc: John Dias <[email protected]>
Cc: Connor O'Brien <[email protected]>
Cc: Rick Yiu <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Chris Redpath <[email protected]>
Cc: Abhijeet Dharmapurikar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: [email protected]
Signed-off-by: Pavankumar Kondeti <[email protected]>
[[email protected]: trivial merge conflict resolution.]
Signed-off-by: Satya Durga Srinivasu Prabhala <[email protected]>
[elavila: Port to mainline, squash with bugfix]
Signed-off-by: J. Avila <[email protected]>
[jstultz: Rebase to linus/HEAD, minor rearranging of code,
included bug fix Reported-by: Qais Yousef <[email protected]> ]
Signed-off-by: John Stultz <[email protected]>
---
v4:
* Fix commit message to accurately note long-running softirqs
(suggested by Qais)
* Switch to using rt_task(current) (suggested by Qais)
v5:
* Switch to using CONFIG_RT_SOFTIRQ_AWARE_SCHED (suggested by
Joel Fernandes <[email protected]>)
---
kernel/softirq.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index dd92ce8f771b..5db2afd0be68 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -95,6 +95,7 @@ static void wakeup_softirqd(void)
wake_up_process(tsk);
}

+#ifndef CONFIG_RT_SOFTIRQ_AWARE_SCHED
/*
* If ksoftirqd is scheduled, we do not want to process pending softirqs
* right now. Let ksoftirqd handle this at its own rate, to get fairness,
@@ -109,6 +110,9 @@ static bool ksoftirqd_running(unsigned long pending)
return false;
return tsk && task_is_running(tsk) && !__kthread_should_park(tsk);
}
+#else
+#define ksoftirqd_running(pending) (false)
+#endif /* CONFIG_RT_SOFTIRQ_AWARE_SCHED */

#ifdef CONFIG_TRACE_IRQFLAGS
DEFINE_PER_CPU(int, hardirqs_enabled);
@@ -540,6 +544,21 @@ static inline bool lockdep_softirq_start(void) { return false; }
static inline void lockdep_softirq_end(bool in_hardirq) { }
#endif

+#ifdef CONFIG_RT_SOFTIRQ_AWARE_SCHED
+static __u32 softirq_deferred_for_rt(__u32 *pending)
+{
+ __u32 deferred = 0;
+
+ if (rt_task(current)) {
+ deferred = *pending & LONG_SOFTIRQ_MASK;
+ *pending &= ~LONG_SOFTIRQ_MASK;
+ }
+ return deferred;
+}
+#else
+#define softirq_deferred_for_rt(x) (0)
+#endif
+
asmlinkage __visible void __softirq_entry __do_softirq(void)
{
unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
@@ -547,6 +566,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
int max_restart = MAX_SOFTIRQ_RESTART;
struct softirq_action *h;
bool in_hardirq;
+ __u32 deferred;
__u32 pending;
int softirq_bit;

@@ -558,14 +578,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
current->flags &= ~PF_MEMALLOC;

pending = local_softirq_pending();
+ deferred = softirq_deferred_for_rt(&pending);

softirq_handle_begin();
+
in_hardirq = lockdep_softirq_start();
account_softirq_enter(current);

restart:
/* Reset the pending bitmask before enabling irqs */
- set_softirq_pending(0);
+ set_softirq_pending(deferred);
set_active_softirqs(pending);

local_irq_enable();
@@ -604,13 +626,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
local_irq_disable();

pending = local_softirq_pending();
+ deferred = softirq_deferred_for_rt(&pending);
+
if (pending) {
if (time_before(jiffies, end) && !need_resched() &&
--max_restart)
goto restart;
+ }

+ if (pending | deferred)
wakeup_softirqd();
- }

account_softirq_exit(current);
lockdep_softirq_end(in_hardirq);
--
2.38.1.431.g37b22c650d-goog

2022-11-16 09:10:51

by John Stultz

[permalink] [raw]

Subject: [PATCH v5 1/3] softirq: Add generic accessor to percpu softirq_pending data

In a previous iteration of this patch series, I was checking:

per_cpu(irq_stat, cpu).__softirq_pending

which resulted in build errors on s390.

This patch tries to create a generic accessor to this percpu
softirq_pending data.

This interface is inherently racy as its reading percpu data
without a lock. However, being able to peek at the softirq
pending data allows us to make better decisions about rt task
placement vs just ignoring it.

On s390 this call returns 0, which maybe isn't ideal but
results in no functional change from what we do now.

TODO: Heiko suggested changing s390 to use a proper per-cpu
irqstat variable instead.

Feedback or suggestions for better approach here would be
welcome!

Cc: John Dias <[email protected]>
Cc: Connor O'Brien <[email protected]>
Cc: Rick Yiu <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Chris Redpath <[email protected]>
Cc: Abhijeet Dharmapurikar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: [email protected]
Reported-by: kernel test robot <[email protected]>
Signed-off-by: John Stultz <[email protected]>
---
arch/s390/include/asm/hardirq.h | 6 ++++++
include/linux/interrupt.h | 11 +++++++++++
2 files changed, 17 insertions(+)

diff --git a/arch/s390/include/asm/hardirq.h b/arch/s390/include/asm/hardirq.h
index 58668ffb5488..cd9cc11588ab 100644
--- a/arch/s390/include/asm/hardirq.h
+++ b/arch/s390/include/asm/hardirq.h
@@ -16,6 +16,12 @@
#define local_softirq_pending() (S390_lowcore.softirq_pending)
#define set_softirq_pending(x) (S390_lowcore.softirq_pending = (x))
#define or_softirq_pending(x) (S390_lowcore.softirq_pending |= (x))
+/*
+ * Not sure what the right thing is here for s390,
+ * but returning 0 will result in no logical change
+ * from what happens now
+ */
+#define __cpu_softirq_pending(x) (0)

#define __ARCH_IRQ_STAT
#define __ARCH_IRQ_EXIT_IRQS_DISABLED
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a92bce40b04b..a749a8663841 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -527,6 +527,17 @@ DECLARE_STATIC_KEY_FALSE(force_irqthreads_key);
#define set_softirq_pending(x) (__this_cpu_write(local_softirq_pending_ref, (x)))
#define or_softirq_pending(x) (__this_cpu_or(local_softirq_pending_ref, (x)))

+/**
+ * __cpu_softirq_pending() - Checks to see if softirq is pending on a cpu
+ *
+ * This helper is inherently racy, as we're accessing per-cpu data w/o locks.
+ * But peeking at the flag can still be useful when deciding where to place a
+ * task.
+ */
+static inline u32 __cpu_softirq_pending(int cpu)
+{
+ return (u32)per_cpu(local_softirq_pending_ref, cpu);
+}
#endif /* local_softirq_pending */

/* Some architectures might implement lazy enabling/disabling of
--
2.38.1.431.g37b22c650d-goog

2022-11-16 11:50:04

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v5 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

On Wed, Nov 16, 2022 at 07:59:28AM +0000, John Stultz wrote:
> From: Pavankumar Kondeti <[email protected]>
>
> Defer the softirq processing to ksoftirqd if a RT task is
> running or queued on the current CPU. This complements the RT
> task placement algorithm which tries to find a CPU that is not
> currently busy with softirqs.
>
> Currently NET_TX, NET_RX, BLOCK and IRQ_POLL softirqs are only
> deferred as they can potentially run for long time.
>
> Additionally, this patch stubs out ksoftirqd_running() logic,
> in the CONFIG_RT_SOFTIRQ_AWARE_SCHED case, as deferring
> potentially long-running softirqs will cause the logic to not
> process shorter-running softirqs immediately. By stubbing it out
> the potentially long running softirqs are deferred, but the
> shorter running ones can still run immediately.

So I'm hating on the new config space, and dubious of the placement
logic (I'm thinking much the same gain can be had when actual softirq
processing stops quickly when we want to reschedule).

However I still have these here patches (revived from the dead):

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=core/softirq

That fix some of this same... I think the last time this fell on its
face due to regressions and me not having time/energy to chase them down
and the author of the hack-of-the-day solution walking off or something
like that.

I think this aspect of the whole softirq thing really needs fixing first
and not hidden under some obscure CONFIG symbol.

2022-11-18 08:32:04

by John Stultz

[permalink] [raw]

Subject: Re: [PATCH v5 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

On Wed, Nov 16, 2022 at 2:37 AM Peter Zijlstra <[email protected]> wrote:
> On Wed, Nov 16, 2022 at 07:59:28AM +0000, John Stultz wrote:
> > From: Pavankumar Kondeti <[email protected]>
> >
> > Defer the softirq processing to ksoftirqd if a RT task is
> > running or queued on the current CPU. This complements the RT
> > task placement algorithm which tries to find a CPU that is not
> > currently busy with softirqs.
> >
> > Currently NET_TX, NET_RX, BLOCK and IRQ_POLL softirqs are only
> > deferred as they can potentially run for long time.
> >
> > Additionally, this patch stubs out ksoftirqd_running() logic,
> > in the CONFIG_RT_SOFTIRQ_AWARE_SCHED case, as deferring
> > potentially long-running softirqs will cause the logic to not
> > process shorter-running softirqs immediately. By stubbing it out
> > the potentially long running softirqs are deferred, but the
> > shorter running ones can still run immediately.
>
> So I'm hating on the new config space, and dubious of the placement
> logic (I'm thinking much the same gain can be had when actual softirq
> processing stops quickly when we want to reschedule).

Hey Peter!
Thanks for taking the time to look this over and provide feedback!

> However I still have these here patches (revived from the dead):
>
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=core/softirq
>
> That fix some of this same... I think the last time this fell on its
> face due to regressions and me not having time/energy to chase them down
> and the author of the hack-of-the-day solution walking off or something
> like that.

So I've taken a look at your patch series and backported it (as well
as extended it to use softirq_needs_break for the BLOCK softirq) to a
5.10 device kernel where I can compare the behaviors against the tuned
code that's shipping.

At first glance at your patches, I was optimistic, as it's great that
it nicely time bounds the number of softirqs run, but I fet that it
doesn't help if a single softirq takes longer than we'd like.
(And indeed, I can see single BLOCK softirqs invocations taking quite
some time - > 16ms! in my last run)

Obviously "fix that driver to not do that" would be the standard
response, but it does become a constant wack-a-mole game. Sort of like
some of the philosophy around some of the PREEMPT_RT changes, where a
broad solution is used to avoid having to fix and maintain latencies
in code across the kernel, this task placement solution helps avoid rt
latencies being dependent on the continual perfection of all drivers
used.

> I think this aspect of the whole softirq thing really needs fixing first
> and not hidden under some obscure CONFIG symbol.

Sure, and while I am happy to help with your current patch series, as
I do think it should improve things, I'm not sure if it will let us
move away from the softirq-rt placement optimization patches.
Any ideas for additional approaches that would be more agreeable to you?

We could pull most of the logic out from the CONFIG file, maybe let
userland set the long-softirq mask which rt tasks would avoid
scheduling onto? Though I know folks would probably like to avoid the
cpupri_find_fitness test logic if they could so I'll have to think how
to efficiently shortcut that if the mask is null.

Anyway, thanks so much again for the feedback!
-john

2022-12-08 18:03:31

by Joel Fernandes

[permalink] [raw]

Subject: Re: [PATCH v5 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

Hi John,

On Wed, Nov 16, 2022 at 07:59:28AM +0000, John Stultz wrote:
> From: Pavankumar Kondeti <[email protected]>
>
> Defer the softirq processing to ksoftirqd if a RT task is
> running or queued on the current CPU. This complements the RT
> task placement algorithm which tries to find a CPU that is not
> currently busy with softirqs.
>
> Currently NET_TX, NET_RX, BLOCK and IRQ_POLL softirqs are only
> deferred as they can potentially run for long time.
>
> Additionally, this patch stubs out ksoftirqd_running() logic,
> in the CONFIG_RT_SOFTIRQ_AWARE_SCHED case, as deferring
> potentially long-running softirqs will cause the logic to not
> process shorter-running softirqs immediately. By stubbing it out
> the potentially long running softirqs are deferred, but the
> shorter running ones can still run immediately.
>
> This patch includes folded-in fixes by:
> Lingutla Chandrasekhar <[email protected]>
> Satya Durga Srinivasu Prabhala <[email protected]>
> J. Avila <[email protected]>
>
[...]
> ---
> v4:
> * Fix commit message to accurately note long-running softirqs
> (suggested by Qais)
> * Switch to using rt_task(current) (suggested by Qais)
> v5:
> * Switch to using CONFIG_RT_SOFTIRQ_AWARE_SCHED (suggested by
> Joel Fernandes <[email protected]>)
> ---
> kernel/softirq.c | 29 +++++++++++++++++++++++++++--
> 1 file changed, 27 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index dd92ce8f771b..5db2afd0be68 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -95,6 +95,7 @@ static void wakeup_softirqd(void)
> wake_up_process(tsk);
> }
>
> +#ifndef CONFIG_RT_SOFTIRQ_AWARE_SCHED
> /*
> * If ksoftirqd is scheduled, we do not want to process pending softirqs
> * right now. Let ksoftirqd handle this at its own rate, to get fairness,
> @@ -109,6 +110,9 @@ static bool ksoftirqd_running(unsigned long pending)
> return false;
> return tsk && task_is_running(tsk) && !__kthread_should_park(tsk);
> }
> +#else
> +#define ksoftirqd_running(pending) (false)
> +#endif /* CONFIG_RT_SOFTIRQ_AWARE_SCHED */
>
> #ifdef CONFIG_TRACE_IRQFLAGS
> DEFINE_PER_CPU(int, hardirqs_enabled);
> @@ -540,6 +544,21 @@ static inline bool lockdep_softirq_start(void) { return false; }
> static inline void lockdep_softirq_end(bool in_hardirq) { }
> #endif
>
> +#ifdef CONFIG_RT_SOFTIRQ_AWARE_SCHED
> +static __u32 softirq_deferred_for_rt(__u32 *pending)
> +{
> + __u32 deferred = 0;
> +
> + if (rt_task(current)) {

Over here, I suggest also check dl_task(current). SCHED_DEADLINE is being
used for the 'sugov' (schedutil governor threads) on currently shipping
products, and DL should be treated greater than RT priority.

On the other hand, the DL counterpart to avoid softirq is not there, but at
least softirq can defer for DL threads.

> + deferred = *pending & LONG_SOFTIRQ_MASK;
> + *pending &= ~LONG_SOFTIRQ_MASK;
> + }
> + return deferred;
> +}
> +#else
> +#define softirq_deferred_for_rt(x) (0)
> +#endif
> +
> asmlinkage __visible void __softirq_entry __do_softirq(void)
> {
> unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> @@ -547,6 +566,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> int max_restart = MAX_SOFTIRQ_RESTART;
> struct softirq_action *h;
> bool in_hardirq;
> + __u32 deferred;
> __u32 pending;
> int softirq_bit;
>
> @@ -558,14 +578,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> current->flags &= ~PF_MEMALLOC;
>
> pending = local_softirq_pending();
> + deferred = softirq_deferred_for_rt(&pending);
>
> softirq_handle_begin();
> +
> in_hardirq = lockdep_softirq_start();
> account_softirq_enter(current);
>
> restart:
> /* Reset the pending bitmask before enabling irqs */
> - set_softirq_pending(0);
> + set_softirq_pending(deferred);

Over here, the local pending mask is set to whatever was deferred [1]..

> set_active_softirqs(pending);
>
> local_irq_enable();
> @@ -604,13 +626,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> local_irq_disable();
>
> pending = local_softirq_pending();
> + deferred = softirq_deferred_for_rt(&pending);

... so over here, pending should always be set to atleast the deferred bits
if deferred is set (unless something wokeup and executed softirqs while
interrupts are enabled -- which should not happen because BH is disabled
through out) [2].

> +
> if (pending) {
> if (time_before(jiffies, end) && !need_resched() &&
> --max_restart)

So then here, pending (which also now contains deferred due to [1] [2])
should already take care off waking up ksoftirqd, except that perhaps it
should be like this:

if (pending) {
// Also, don't restart if something was deferred, let the RT task
// breath a bit.
if (time_before(jiffies, end) && !need_resched() &&
!deferred && --max_restart)
goto restart;

wakeup_softirqd();
}

Or, will that not work?

My point being that we probably don't want to go through the retry-cycle,
if something was deferred since the plan is to wake up ksoftirqd in such
cases.

> + if (pending | deferred)
> wakeup_softirqd();

And then perhaps this check can be removed.

Thoughts?

thanks,

- Joel

> goto restart;
> + }
>
> + if (pending | deferred)
> wakeup_softirqd();
> - }
>
> account_softirq_exit(current);
> lockdep_softirq_end(in_hardirq);
> --
> 2.38.1.431.g37b22c650d-goog
>