LinuxLists.cc - [RFC][PATCH v3 0/3] Softirq -rt Optimizations

2022-09-21 01:47:40

Subject: [RFC][PATCH v3 0/3] Softirq -rt Optimizations

Hey all,

This series is a set of patches that optimize scheduler decisions around
realtime tasks and softirqs. This series is a rebased and reworked set
of changes that have been shipping on Android devices for a number of
years, originally created to resolve audio glitches seen on devices
caused by softirqs for network or storage drivers.

Long running softirqs cause issues because they aren’t currently taken
into account when a realtime task is woken up, but they will delay
realtime tasks from running if the realtime tasks are placed on a cpu
currently running a softirq.

This can easily be seen on some devices by running cyclictest* along
with some heavy background filesystems noise:

Without the patches:
T: 0 ( 7596) P:99 I:1000 C: 59980 Min: 7 Act: 13 Avg: 29 Max: 4107
T: 1 ( 7597) P:99 I:1500 C: 39990 Min: 14 Act: 26 Avg: 36 Max: 8994
T: 2 ( 7598) P:99 I:2000 C: 29995 Min: 7 Act: 22 Avg: 35 Max: 3616
T: 3 ( 7599) P:99 I:2500 C: 23915 Min: 7 Act: 25 Avg: 49 Max: 40273
T: 4 ( 7600) P:99 I:3000 C: 19995 Min: 8 Act: 22 Avg: 38 Max: 10510
T: 5 ( 7601) P:99 I:3500 C: 17135 Min: 7 Act: 26 Avg: 39 Max: 13194
T: 6 ( 7602) P:99 I:4000 C: 14990 Min: 7 Act: 26 Avg: 40 Max: 9470
T: 7 ( 7603) P:99 I:4500 C: 13318 Min: 8 Act: 29 Avg: 44 Max: 20101

Which you can visually see in the image here:
https://github.com/johnstultz-work/misc/raw/main/images/2022-08-09-softirq-rt-big-latency.png

Which is from the perfetto trace captured here:
https://ui.perfetto.dev/#!/?s=33661aec8ec82c2da0a59263f36f7d72b4a2f4e7a99b28b222bd12ad872f

The first patch adds a bit of generic infrastructure to get the per-cpu
softirq_pending flag.

The second patch in the series adds logic to account for when softirqs
are running, and then conditionally based on
CONFIG_RT_SOFTIRQ_OPTIMIZATION allows rt-task placement to be done in a
way that’s aware if a current softirq might be a long-running one, to
potentially place the rt task on another free core.

The third patch in the series adds logic in __do_softirq(), also under
CONFIG_RT_SOFTIRQ_OPTIMIZATION, to defer some of the potentially long
running softirqs to ksoftirqd if a -rt task is currently running on the
cpu. This patch also includes a folded down fix that stubbs out
ksoftirqd_running() based on CONFIG_RT_SOFTIRQ_OPTIMIZATION, since in
changing to more frequently defer long running softirqs, the logic using
ksoftirqd_running will end up being too conservative and needlessly
delay shorter-running softirqs.

With these patches we see dramatic improvements in the worst case
latencies in the cyclictest* + filesystem noise test above:

With the patches
T: 0 ( 7527) P:99 I:1000 C: 59998 Min: 6 Act: 29 Avg: 35 Max: 1734
T: 1 ( 7528) P:99 I:1500 C: 40000 Min: 7 Act: 39 Avg: 35 Max: 1181
T: 2 ( 7529) P:99 I:2000 C: 30000 Min: 7 Act: 25 Avg: 25 Max: 444
T: 3 ( 7530) P:99 I:2500 C: 24000 Min: 7 Act: 34 Avg: 36 Max: 1729
T: 4 ( 7531) P:99 I:3000 C: 20000 Min: 7 Act: 36 Avg: 25 Max: 406
T: 5 ( 7532) P:99 I:3500 C: 17143 Min: 7 Act: 38 Avg: 34 Max: 1264
T: 6 ( 7533) P:99 I:4000 C: 15000 Min: 7 Act: 27 Avg: 33 Max: 2351
T: 7 ( 7534) P:99 I:4500 C: 13334 Min: 7 Act: 41 Avg: 29 Max: 2285

Since these patches have been carried along for years, and have at times
badly collided with upstream, I wanted to submit them for some initial
review, discussion and feedback so we could hopefully eventually find a
reasonable solution that might land upstream.

* Unfortunately cyclictest had a bug that causes it to always affine
threads to cpus preventing them from being migrated. So you’ll need
to update to the latest version (which includes a fix) to reproduce.

Let me know what you think!

thanks
-john

Changes in v3:
* Added generic __cpu_softirq_pending() accessor to avoid s390 build
trouble

Cc: John Dias <[email protected]>
Cc: Connor O'Brien <[email protected]>
Cc: Rick Yiu <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Chris Redpath <[email protected]>
Cc: Abhijeet Dharmapurikar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: [email protected]

Connor O'Brien (1):
sched: Avoid placing RT threads on cores handling long softirqs

John Stultz (1):
softirq: Add generic accessor to percpu softirq_pending data

Pavankumar Kondeti (1):
softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

arch/s390/include/asm/hardirq.h | 6 ++++
include/linux/interrupt.h | 18 ++++++++++
include/linux/sched.h | 10 ++++++
init/Kconfig | 10 ++++++
kernel/sched/cpupri.c | 13 +++++++
kernel/sched/rt.c | 64 ++++++++++++++++++++++++++++-----
kernel/softirq.c | 34 ++++++++++++++++--
7 files changed, 144 insertions(+), 11 deletions(-)

--
2.37.3.968.ga6b4b080e4-goog

2022-09-21 01:50:33

by John Stultz

[permalink] [raw]

Subject: [RFC PATCH v3 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

From: Pavankumar Kondeti <[email protected]>

Defer the softirq processing to ksoftirqd if a RT task is
running or queued on the current CPU. This complements the RT
task placement algorithm which tries to find a CPU that is not
currently busy with softirqs.

Currently NET_TX, NET_RX, BLOCK and TASKLET softirqs are only
deferred as they can potentially run for long time.

Additionally, this patch stubs out ksoftirqd_running() logic,
in the CONFIG_RT_SOFTIRQ_OPTIMIZATION case, as deferring
potentially long-running softirqs will cause the logic to not
process shorter-running softirqs immediately. By stubbing it out
the potentially long running softirqs are deferred, but the
shorter running ones can still run immediately.

This patch includes folded-in fixes by:
Lingutla Chandrasekhar <[email protected]>
Satya Durga Srinivasu Prabhala <[email protected]>
J. Avila <[email protected]>

Cc: John Dias <[email protected]>
Cc: Connor O'Brien <[email protected]>
Cc: Rick Yiu <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Chris Redpath <[email protected]>
Cc: Abhijeet Dharmapurikar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: Pavankumar Kondeti <[email protected]>
[[email protected]: trivial merge conflict resolution.]
Signed-off-by: Satya Durga Srinivasu Prabhala <[email protected]>
[elavila: Port to mainline, squash with bugfix]
Signed-off-by: J. Avila <[email protected]>
[jstultz: Rebase to linus/HEAD, minor rearranging of code,
included bug fix Reported-by: Qais Yousef <[email protected]> ]
Signed-off-by: John Stultz <[email protected]>
---
include/linux/sched.h | 10 ++++++++++
kernel/sched/cpupri.c | 13 +++++++++++++
kernel/softirq.c | 25 +++++++++++++++++++++++--
3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e7b2f8a5c711..7f76371cbbb0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1826,6 +1826,16 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)

extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_effective_cpus);
+
+#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
+extern bool cpupri_check_rt(void);
+#else
+static inline bool cpupri_check_rt(void)
+{
+ return false;
+}
+#endif
+
#ifdef CONFIG_SMP
extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
index fa9ce9d83683..18dc75d16951 100644
--- a/kernel/sched/cpupri.c
+++ b/kernel/sched/cpupri.c
@@ -64,6 +64,19 @@ static int convert_prio(int prio)
return cpupri;
}

+#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
+/*
+ * cpupri_check_rt - check if CPU has a RT task
+ * should be called from rcu-sched read section.
+ */
+bool cpupri_check_rt(void)
+{
+ int cpu = raw_smp_processor_id();
+
+ return cpu_rq(cpu)->rd->cpupri.cpu_to_pri[cpu] > CPUPRI_NORMAL;
+}
+#endif
+
static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
struct cpumask *lowest_mask, int idx)
{
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 35ee79dd8786..203a70dc9459 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -87,6 +87,7 @@ static void wakeup_softirqd(void)
wake_up_process(tsk);
}

+#ifndef CONFIG_RT_SOFTIRQ_OPTIMIZATION
/*
* If ksoftirqd is scheduled, we do not want to process pending softirqs
* right now. Let ksoftirqd handle this at its own rate, to get fairness,
@@ -101,6 +102,9 @@ static bool ksoftirqd_running(unsigned long pending)
return false;
return tsk && task_is_running(tsk) && !__kthread_should_park(tsk);
}
+#else
+#define ksoftirqd_running(pending) (false)
+#endif /* CONFIG_RT_SOFTIRQ_OPTIMIZATION */

#ifdef CONFIG_TRACE_IRQFLAGS
DEFINE_PER_CPU(int, hardirqs_enabled);
@@ -532,6 +536,17 @@ static inline bool lockdep_softirq_start(void) { return false; }
static inline void lockdep_softirq_end(bool in_hardirq) { }
#endif

+static __u32 softirq_deferred_for_rt(__u32 *pending)
+{
+ __u32 deferred = 0;
+
+ if (cpupri_check_rt()) {
+ deferred = *pending & LONG_SOFTIRQ_MASK;
+ *pending &= ~LONG_SOFTIRQ_MASK;
+ }
+ return deferred;
+}
+
asmlinkage __visible void __softirq_entry __do_softirq(void)
{
unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
@@ -539,6 +554,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
int max_restart = MAX_SOFTIRQ_RESTART;
struct softirq_action *h;
bool in_hardirq;
+ __u32 deferred;
__u32 pending;
int softirq_bit;

@@ -551,13 +567,15 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)

pending = local_softirq_pending();

+ deferred = softirq_deferred_for_rt(&pending);
softirq_handle_begin();
+
in_hardirq = lockdep_softirq_start();
account_softirq_enter(current);

restart:
/* Reset the pending bitmask before enabling irqs */
- set_softirq_pending(0);
+ set_softirq_pending(deferred);
__this_cpu_write(active_softirqs, pending);

local_irq_enable();
@@ -596,13 +614,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
local_irq_disable();

pending = local_softirq_pending();
+ deferred = softirq_deferred_for_rt(&pending);
+
if (pending) {
if (time_before(jiffies, end) && !need_resched() &&
--max_restart)
goto restart;
+ }

+ if (pending | deferred)
wakeup_softirqd();
- }

account_softirq_exit(current);
lockdep_softirq_end(in_hardirq);
--
2.37.3.968.ga6b4b080e4-goog

2022-09-21 02:17:45

by John Stultz

[permalink] [raw]

Subject: [RFC PATCH v3 2/3] sched: Avoid placing RT threads on cores handling long softirqs

From: Connor O'Brien <[email protected]>

In certain audio use cases, scheduling RT threads on cores that
are handling softirqs can lead to glitches. Prevent this
behavior in cases where the softirq is likely to take a long
time. To avoid unnecessary migrations, the old behavior is
preserved for RCU, SCHED and TIMER irqs which are expected to be
relatively quick.

This patch reworks and combines two related changes originally
by John Dias <[email protected]>

Cc: John Dias <[email protected]>
Cc: Connor O'Brien <[email protected]>
Cc: Rick Yiu <[email protected]>
Cc: John Kacur <[email protected]>
Cc: Qais Yousef <[email protected]>
Cc: Chris Redpath <[email protected]>
Cc: Abhijeet Dharmapurikar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Signed-off-by: John Dias <[email protected]>
[elavila: Port to mainline, amend commit text]
Signed-off-by: J. Avila <[email protected]>
[connoro: Reworked, simplified, and merged two patches together]
Signed-off-by: Connor O'Brien <[email protected]>
[jstultz: Further simplified and fixed issues, reworded commit
message, removed arm64-isms]
Signed-off-by: John Stultz <[email protected]>
---
v2:
* Reformatted Kconfig entry to match coding style
(Reported-by: Randy Dunlap <[email protected]>)
* Made rt_task_fits_capacity_and_may_preempt static to
avoid warnings (Reported-by: kernel test robot <[email protected]>)
* Rework to use preempt_count and drop kconfig dependency on ARM64
v3:
* Use introduced __cpu_softirq_pending() to avoid s390 build
issues (Reported-by: kernel test robot <[email protected]>)
---
include/linux/interrupt.h | 7 +++++
init/Kconfig | 10 ++++++
kernel/sched/rt.c | 64 +++++++++++++++++++++++++++++++++------
kernel/softirq.c | 9 ++++++
4 files changed, 81 insertions(+), 9 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a749a8663841..1d126b8495bc 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -582,6 +582,12 @@ enum
* _ IRQ_POLL: irq_poll_cpu_dead() migrates the queue
*/
#define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(RCU_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ))
+/* Softirq's where the handling might be long: */
+#define LONG_SOFTIRQ_MASK ((1 << NET_TX_SOFTIRQ) | \
+ (1 << NET_RX_SOFTIRQ) | \
+ (1 << BLOCK_SOFTIRQ) | \
+ (1 << IRQ_POLL_SOFTIRQ) | \
+ (1 << TASKLET_SOFTIRQ))

/* map softirq index to softirq name. update 'softirq_to_name' in
* kernel/softirq.c when adding a new softirq.
@@ -617,6 +623,7 @@ extern void raise_softirq_irqoff(unsigned int nr);
extern void raise_softirq(unsigned int nr);

DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
+DECLARE_PER_CPU(u32, active_softirqs);

static inline struct task_struct *this_cpu_ksoftirqd(void)
{
diff --git a/init/Kconfig b/init/Kconfig
index 532362fcfe31..8b5add74b6cb 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1284,6 +1284,16 @@ config SCHED_AUTOGROUP
desktop applications. Task group autogeneration is currently based
upon task session.

+config RT_SOFTIRQ_OPTIMIZATION
+ bool "Improve RT scheduling during long softirq execution"
+ depends on SMP
+ default n
+ help
+ Enable an optimization which tries to avoid placing RT tasks on CPUs
+ occupied by nonpreemptible tasks, such as a long softirq or CPUs
+ which may soon block preemptions, such as a CPU running a ksoftirq
+ thread which handles slow softirqs.
+
config SYSFS_DEPRECATED
bool "Enable deprecated sysfs features to support old userspace tools"
depends on SYSFS
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 55f39c8f4203..826f56daecc5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1599,12 +1599,49 @@ static void yield_task_rt(struct rq *rq)
#ifdef CONFIG_SMP
static int find_lowest_rq(struct task_struct *task);

+#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
+/*
+ * Return whether the task on the given cpu is currently non-preemptible
+ * while handling a potentially long softirq, or if the task is likely
+ * to block preemptions soon because it is a ksoftirq thread that is
+ * handling slow softirq.
+ */
+static bool task_may_preempt(struct task_struct *task, int cpu)
+{
+ u32 softirqs = per_cpu(active_softirqs, cpu) |
+ __cpu_softirq_pending(cpu);
+ struct task_struct *cpu_ksoftirqd = per_cpu(ksoftirqd, cpu);
+ struct task_struct *curr;
+ struct rq *rq = cpu_rq(cpu);
+ int ret;
+
+ rcu_read_lock();
+ curr = READ_ONCE(rq->curr); /* unlocked access */
+ ret = !((softirqs & LONG_SOFTIRQ_MASK) &&
+ (curr == cpu_ksoftirqd ||
+ preempt_count() & SOFTIRQ_MASK));
+ rcu_read_unlock();
+ return ret;
+}
+#else
+static bool task_may_preempt(struct task_struct *task, int cpu)
+{
+ return true;
+}
+#endif /* CONFIG_RT_SOFTIRQ_OPTIMIZATION */
+
+static bool rt_task_fits_capacity_and_may_preempt(struct task_struct *p, int cpu)
+{
+ return task_may_preempt(p, cpu) && rt_task_fits_capacity(p, cpu);
+}
+
static int
select_task_rq_rt(struct task_struct *p, int cpu, int flags)
{
struct task_struct *curr;
struct rq *rq;
bool test;
+ bool may_not_preempt;

/* For anything but wake ups, just return the task_cpu */
if (!(flags & (WF_TTWU | WF_FORK)))
@@ -1616,7 +1653,12 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
curr = READ_ONCE(rq->curr); /* unlocked access */

/*
- * If the current task on @p's runqueue is an RT task, then
+ * If the current task on @p's runqueue is a softirq task,
+ * it may run without preemption for a time that is
+ * ill-suited for a waiting RT task. Therefore, try to
+ * wake this RT task on another runqueue.
+ *
+ * Also, if the current task on @p's runqueue is an RT task, then
* try to see if we can wake this RT task up on another
* runqueue. Otherwise simply start this RT task
* on its current runqueue.
@@ -1641,9 +1683,10 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
* requirement of the task - which is only important on heterogeneous
* systems like big.LITTLE.
*/
- test = curr &&
- unlikely(rt_task(curr)) &&
- (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
+ may_not_preempt = !task_may_preempt(curr, cpu);
+ test = (curr && (may_not_preempt ||
+ (unlikely(rt_task(curr)) &&
+ (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio))));

if (test || !rt_task_fits_capacity(p, cpu)) {
int target = find_lowest_rq(p);
@@ -1656,11 +1699,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
goto out_unlock;

/*
- * Don't bother moving it if the destination CPU is
+ * If cpu is non-preemptible, prefer remote cpu
+ * even if it's running a higher-prio task.
+ * Otherwise: Don't bother moving it if the destination CPU is
* not running a lower priority task.
*/
if (target != -1 &&
- p->prio < cpu_rq(target)->rt.highest_prio.curr)
+ (may_not_preempt ||
+ p->prio < cpu_rq(target)->rt.highest_prio.curr))
cpu = target;
}

@@ -1901,11 +1947,11 @@ static int find_lowest_rq(struct task_struct *task)

ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
task, lowest_mask,
- rt_task_fits_capacity);
+ rt_task_fits_capacity_and_may_preempt);
} else {

- ret = cpupri_find(&task_rq(task)->rd->cpupri,
- task, lowest_mask);
+ ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
+ task, lowest_mask, task_may_preempt);
}

if (!ret)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index c8a6913c067d..35ee79dd8786 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -60,6 +60,13 @@ static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp

DEFINE_PER_CPU(struct task_struct *, ksoftirqd);

+/*
+ * active_softirqs -- per cpu, a mask of softirqs that are being handled,
+ * with the expectation that approximate answers are acceptable and therefore
+ * no synchronization.
+ */
+DEFINE_PER_CPU(u32, active_softirqs);
+
const char * const softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL",
"TASKLET", "SCHED", "HRTIMER", "RCU"
@@ -551,6 +558,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
restart:
/* Reset the pending bitmask before enabling irqs */
set_softirq_pending(0);
+ __this_cpu_write(active_softirqs, pending);

local_irq_enable();

@@ -580,6 +588,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
pending >>= softirq_bit;
}

+ __this_cpu_write(active_softirqs, 0);
if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
__this_cpu_read(ksoftirqd) == current)
rcu_softirq_qs();
--
2.37.3.968.ga6b4b080e4-goog

2022-09-28 13:21:11

by Qais Yousef

[permalink] [raw]

Subject: Re: [RFC PATCH v3 2/3] sched: Avoid placing RT threads on cores handling long softirqs

Hi John

On 09/21/22 01:25, John Stultz wrote:
> From: Connor O'Brien <[email protected]>
>
> In certain audio use cases, scheduling RT threads on cores that
> are handling softirqs can lead to glitches. Prevent this
> behavior in cases where the softirq is likely to take a long
> time. To avoid unnecessary migrations, the old behavior is
> preserved for RCU, SCHED and TIMER irqs which are expected to be
> relatively quick.
>
> This patch reworks and combines two related changes originally
> by John Dias <[email protected]>
>
> Cc: John Dias <[email protected]>
> Cc: Connor O'Brien <[email protected]>
> Cc: Rick Yiu <[email protected]>
> Cc: John Kacur <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Chris Redpath <[email protected]>
> Cc: Abhijeet Dharmapurikar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: John Dias <[email protected]>
> [elavila: Port to mainline, amend commit text]
> Signed-off-by: J. Avila <[email protected]>
> [connoro: Reworked, simplified, and merged two patches together]
> Signed-off-by: Connor O'Brien <[email protected]>
> [jstultz: Further simplified and fixed issues, reworded commit
> message, removed arm64-isms]
> Signed-off-by: John Stultz <[email protected]>
> ---
> v2:
> * Reformatted Kconfig entry to match coding style
> (Reported-by: Randy Dunlap <[email protected]>)
> * Made rt_task_fits_capacity_and_may_preempt static to
> avoid warnings (Reported-by: kernel test robot <[email protected]>)
> * Rework to use preempt_count and drop kconfig dependency on ARM64
> v3:
> * Use introduced __cpu_softirq_pending() to avoid s390 build
> issues (Reported-by: kernel test robot <[email protected]>)
> ---
> include/linux/interrupt.h | 7 +++++
> init/Kconfig | 10 ++++++
> kernel/sched/rt.c | 64 +++++++++++++++++++++++++++++++++------
> kernel/softirq.c | 9 ++++++
> 4 files changed, 81 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> index a749a8663841..1d126b8495bc 100644
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -582,6 +582,12 @@ enum
> * _ IRQ_POLL: irq_poll_cpu_dead() migrates the queue
> */
> #define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(RCU_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ))
> +/* Softirq's where the handling might be long: */
> +#define LONG_SOFTIRQ_MASK ((1 << NET_TX_SOFTIRQ) | \
> + (1 << NET_RX_SOFTIRQ) | \
> + (1 << BLOCK_SOFTIRQ) | \
> + (1 << IRQ_POLL_SOFTIRQ) | \
> + (1 << TASKLET_SOFTIRQ))

I'm not sure about the TASKLET. I can understand NET and BLOCK require high
throughput, hence could end up in softirq for a long time. But TASKLET seems
allowing sloppiness. I don't feel strongly about it, but worth debating if we
really need to include it.

NET has actually introduced a knob to help bound their softirq, see

Documentation/admin-guide/sysctl/net.rst::netdev_budget_usecs

Though I still yet to find the time to understand if this knob allows improving
latencies without impacting network throughput.

There's also threaded NAPI support, which I'm not sure will make NET softirq
latencies better or worse. I think this will cause more CPUs to process the
network softirq, but potentially spend less time there.

I'm not aware of any effort to improve latencies for BLOCK softirqs.

I'm under the impression all TASKLET users can be converted to workqueues or
kthreads.

>
> /* map softirq index to softirq name. update 'softirq_to_name' in
> * kernel/softirq.c when adding a new softirq.
> @@ -617,6 +623,7 @@ extern void raise_softirq_irqoff(unsigned int nr);
> extern void raise_softirq(unsigned int nr);
>
> DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
> +DECLARE_PER_CPU(u32, active_softirqs);
>
> static inline struct task_struct *this_cpu_ksoftirqd(void)
> {
> diff --git a/init/Kconfig b/init/Kconfig
> index 532362fcfe31..8b5add74b6cb 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1284,6 +1284,16 @@ config SCHED_AUTOGROUP
> desktop applications. Task group autogeneration is currently based
> upon task session.
>
> +config RT_SOFTIRQ_OPTIMIZATION
> + bool "Improve RT scheduling during long softirq execution"
> + depends on SMP

Not sure if we need to depend on !PREEMPT_RT. This optimization might not be
desired for those systems. I'll defer to PREEMPT_RT folks to decide on that.

> + default n
> + help
> + Enable an optimization which tries to avoid placing RT tasks on CPUs
> + occupied by nonpreemptible tasks, such as a long softirq or CPUs
> + which may soon block preemptions, such as a CPU running a ksoftirq
> + thread which handles slow softirqs.
> +
> config SYSFS_DEPRECATED
> bool "Enable deprecated sysfs features to support old userspace tools"
> depends on SYSFS
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 55f39c8f4203..826f56daecc5 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1599,12 +1599,49 @@ static void yield_task_rt(struct rq *rq)
> #ifdef CONFIG_SMP
> static int find_lowest_rq(struct task_struct *task);
>
> +#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
> +/*
> + * Return whether the task on the given cpu is currently non-preemptible
> + * while handling a potentially long softirq, or if the task is likely
> + * to block preemptions soon because it is a ksoftirq thread that is
> + * handling slow softirq.
> + */
> +static bool task_may_preempt(struct task_struct *task, int cpu)
> +{
> + u32 softirqs = per_cpu(active_softirqs, cpu) |
> + __cpu_softirq_pending(cpu);
> + struct task_struct *cpu_ksoftirqd = per_cpu(ksoftirqd, cpu);
> + struct task_struct *curr;
> + struct rq *rq = cpu_rq(cpu);
> + int ret;
> +
> + rcu_read_lock();
> + curr = READ_ONCE(rq->curr); /* unlocked access */
> + ret = !((softirqs & LONG_SOFTIRQ_MASK) &&
> + (curr == cpu_ksoftirqd ||
> + preempt_count() & SOFTIRQ_MASK));
> + rcu_read_unlock();
> + return ret;
> +}
> +#else
> +static bool task_may_preempt(struct task_struct *task, int cpu)
> +{
> + return true;
> +}
> +#endif /* CONFIG_RT_SOFTIRQ_OPTIMIZATION */
> +
> +static bool rt_task_fits_capacity_and_may_preempt(struct task_struct *p, int cpu)
> +{
> + return task_may_preempt(p, cpu) && rt_task_fits_capacity(p, cpu);
> +}

Maybe better to rename to rt_task_fits_cpu() and make it generic?

See below for more on that.

> +
> static int
> select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> {
> struct task_struct *curr;
> struct rq *rq;
> bool test;
> + bool may_not_preempt;
>
> /* For anything but wake ups, just return the task_cpu */
> if (!(flags & (WF_TTWU | WF_FORK)))
> @@ -1616,7 +1653,12 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> curr = READ_ONCE(rq->curr); /* unlocked access */
>
> /*
> - * If the current task on @p's runqueue is an RT task, then
> + * If the current task on @p's runqueue is a softirq task,
> + * it may run without preemption for a time that is
> + * ill-suited for a waiting RT task. Therefore, try to
> + * wake this RT task on another runqueue.
> + *
> + * Also, if the current task on @p's runqueue is an RT task, then
> * try to see if we can wake this RT task up on another
> * runqueue. Otherwise simply start this RT task
> * on its current runqueue.
> @@ -1641,9 +1683,10 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> * requirement of the task - which is only important on heterogeneous
> * systems like big.LITTLE.
> */
> - test = curr &&
> - unlikely(rt_task(curr)) &&
> - (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> + may_not_preempt = !task_may_preempt(curr, cpu);
> + test = (curr && (may_not_preempt ||
> + (unlikely(rt_task(curr)) &&
> + (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio))));

I think this is unnecesary if you create new rt_task_fits_cpu() and ...

>
> if (test || !rt_task_fits_capacity(p, cpu)) {

... replace the call to rt_task_fits_capacity() with the new
rt_task_fits_cpu()?

> int target = find_lowest_rq(p);
> @@ -1656,11 +1699,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> goto out_unlock;
>
> /*
> - * Don't bother moving it if the destination CPU is
> + * If cpu is non-preemptible, prefer remote cpu
> + * even if it's running a higher-prio task.
> + * Otherwise: Don't bother moving it if the destination CPU is
> * not running a lower priority task.
> */
> if (target != -1 &&
> - p->prio < cpu_rq(target)->rt.highest_prio.curr)
> + (may_not_preempt ||
> + p->prio < cpu_rq(target)->rt.highest_prio.curr))
> cpu = target;

I'm not sure this makes sense. You assume a higher priority task will cause
less delay than softirqs. Which I think is an optimistic assumption?

I think we should just mimic the same fallback behavior when we fail to find
a CPU that fits the capacity requirement. Keeps things more consistent IMO.

> }
>
> @@ -1901,11 +1947,11 @@ static int find_lowest_rq(struct task_struct *task)
>
> ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
> task, lowest_mask,
> - rt_task_fits_capacity);
> + rt_task_fits_capacity_and_may_preempt);
> } else {
>
> - ret = cpupri_find(&task_rq(task)->rd->cpupri,
> - task, lowest_mask);
> + ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
> + task, lowest_mask, task_may_preempt);

I think we can simplify the code here to call cpupri_find_fitness()
unconditionally passing the new rt_task_fits_cpu(). rt_task_fits_capacity()
will always return true if the static key is disabled or uclamp is not
configured. So rt_task_fits_cpu() will effectively depend on/boil down to
task_may_preempt() in these cases.

Note that I had this called unconditionally in the past, but Steve suggested
doing it this way, IIRC, to avoid the cost of calling the fitness function when
we don't need to. I'm not sure it matters to be honest, but if you follow my
suggestion you might be asked to avoid the costs for the users who don't care
about the fitness criteria.

Thanks

--
Qais Yousef

> }
>
> if (!ret)
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index c8a6913c067d..35ee79dd8786 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -60,6 +60,13 @@ static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp
>
> DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
>
> +/*
> + * active_softirqs -- per cpu, a mask of softirqs that are being handled,
> + * with the expectation that approximate answers are acceptable and therefore
> + * no synchronization.
> + */
> +DEFINE_PER_CPU(u32, active_softirqs);
> +
> const char * const softirq_to_name[NR_SOFTIRQS] = {
> "HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL",
> "TASKLET", "SCHED", "HRTIMER", "RCU"
> @@ -551,6 +558,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> restart:
> /* Reset the pending bitmask before enabling irqs */
> set_softirq_pending(0);
> + __this_cpu_write(active_softirqs, pending);
>
> local_irq_enable();
>
> @@ -580,6 +588,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> pending >>= softirq_bit;
> }
>
> + __this_cpu_write(active_softirqs, 0);
> if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
> __this_cpu_read(ksoftirqd) == current)
> rcu_softirq_qs();
> --
> 2.37.3.968.ga6b4b080e4-goog
>

2022-09-28 13:29:36

by Qais Yousef

[permalink] [raw]

Subject: Re: [RFC][PATCH v3 0/3] Softirq -rt Optimizations

Hi John

On 09/21/22 01:25, John Stultz wrote:
> Hey all,
>
> This series is a set of patches that optimize scheduler decisions around
> realtime tasks and softirqs. This series is a rebased and reworked set
> of changes that have been shipping on Android devices for a number of
> years, originally created to resolve audio glitches seen on devices
> caused by softirqs for network or storage drivers.
>
> Long running softirqs cause issues because they aren’t currently taken
> into account when a realtime task is woken up, but they will delay
> realtime tasks from running if the realtime tasks are placed on a cpu
> currently running a softirq.

Thanks a lot for sending this series. I've raised this problem in various
venues in the past, but it seems it is hard to do something better than what
you propose here.

Borrowing some behaviours from PREEMPT_RT (like threadedirqs) won't cut it
outside PREEMPT_RT AFAIU.

Peter did suggest an alternative at one point in the past to be more aggressive
in limiting softirqs [1] but I never managed to find the time to verify it
- especially its impact on network throughput as this seems to be the tricky
trade-of (and tricky thing to verify for me at least). I'm not sure if BLOCK
softirqs are as sensitive.

I think the proposed approach is not intrusive and offers a good balance that
is well contained and easy to improve upon on the future. It's protected with
a configuration option so users that don't want it can easily disable it.

[1] https://gitlab.arm.com/linux-arm/linux-qy/-/commits/core/softirq/

Thanks

--
Qais Yousef

2022-09-28 13:44:02

by Qais Yousef

[permalink] [raw]

Subject: Re: [RFC PATCH v3 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

On 09/21/22 01:25, John Stultz wrote:
> From: Pavankumar Kondeti <[email protected]>
>
> Defer the softirq processing to ksoftirqd if a RT task is
> running or queued on the current CPU. This complements the RT
> task placement algorithm which tries to find a CPU that is not
> currently busy with softirqs.
>
> Currently NET_TX, NET_RX, BLOCK and TASKLET softirqs are only

Should we mention IRQ_POLL?

I think TASKLET is debatable as I mentioned in my other email.

> deferred as they can potentially run for long time.
>
> Additionally, this patch stubs out ksoftirqd_running() logic,
> in the CONFIG_RT_SOFTIRQ_OPTIMIZATION case, as deferring
> potentially long-running softirqs will cause the logic to not
> process shorter-running softirqs immediately. By stubbing it out
> the potentially long running softirqs are deferred, but the
> shorter running ones can still run immediately.
>
> This patch includes folded-in fixes by:
> Lingutla Chandrasekhar <[email protected]>
> Satya Durga Srinivasu Prabhala <[email protected]>
> J. Avila <[email protected]>
>
> Cc: John Dias <[email protected]>
> Cc: Connor O'Brien <[email protected]>
> Cc: Rick Yiu <[email protected]>
> Cc: John Kacur <[email protected]>
> Cc: Qais Yousef <[email protected]>
> Cc: Chris Redpath <[email protected]>
> Cc: Abhijeet Dharmapurikar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Pavankumar Kondeti <[email protected]>
> [[email protected]: trivial merge conflict resolution.]
> Signed-off-by: Satya Durga Srinivasu Prabhala <[email protected]>
> [elavila: Port to mainline, squash with bugfix]
> Signed-off-by: J. Avila <[email protected]>
> [jstultz: Rebase to linus/HEAD, minor rearranging of code,
> included bug fix Reported-by: Qais Yousef <[email protected]> ]
> Signed-off-by: John Stultz <[email protected]>
> ---
> include/linux/sched.h | 10 ++++++++++
> kernel/sched/cpupri.c | 13 +++++++++++++
> kernel/softirq.c | 25 +++++++++++++++++++++++--
> 3 files changed, 46 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e7b2f8a5c711..7f76371cbbb0 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1826,6 +1826,16 @@ current_restore_flags(unsigned long orig_flags, unsigned long flags)
>
> extern int cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
> extern int task_can_attach(struct task_struct *p, const struct cpumask *cs_effective_cpus);
> +
> +#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
> +extern bool cpupri_check_rt(void);
> +#else
> +static inline bool cpupri_check_rt(void)
> +{
> + return false;
> +}
> +#endif
> +
> #ifdef CONFIG_SMP
> extern void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask);
> extern int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask);
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index fa9ce9d83683..18dc75d16951 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -64,6 +64,19 @@ static int convert_prio(int prio)
> return cpupri;
> }
>
> +#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
> +/*
> + * cpupri_check_rt - check if CPU has a RT task
> + * should be called from rcu-sched read section.
> + */
> +bool cpupri_check_rt(void)
> +{
> + int cpu = raw_smp_processor_id();
> +
> + return cpu_rq(cpu)->rd->cpupri.cpu_to_pri[cpu] > CPUPRI_NORMAL;
> +}

Priorities always mess up with my brain! I always forget which direction to
look at :D

Hmm I was wondering why not do rt_task(current), but if the task is not running
(which can only indicate there's a DL or a stopper task preempting it), that
won't work. But I think your code has a similar problem; you'll return true
even if there's only a DL task running since we set the priority to
CPUPRI_HIGHER which will cause your condition to return true.

This makes me think if we should enable this optimization for DL tasks too.
Hmm...

That said, is there a reason why we can't remove this function and just call
rt_task(current) directly in softirq_deferred_for_rt()?

If we decided to care about DL we can do rt_task(current) || dl_task(current).

Thanks

--
Qais Yousef

> +#endif
> +
> static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
> struct cpumask *lowest_mask, int idx)
> {
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 35ee79dd8786..203a70dc9459 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -87,6 +87,7 @@ static void wakeup_softirqd(void)
> wake_up_process(tsk);
> }
>
> +#ifndef CONFIG_RT_SOFTIRQ_OPTIMIZATION
> /*
> * If ksoftirqd is scheduled, we do not want to process pending softirqs
> * right now. Let ksoftirqd handle this at its own rate, to get fairness,
> @@ -101,6 +102,9 @@ static bool ksoftirqd_running(unsigned long pending)
> return false;
> return tsk && task_is_running(tsk) && !__kthread_should_park(tsk);
> }
> +#else
> +#define ksoftirqd_running(pending) (false)
> +#endif /* CONFIG_RT_SOFTIRQ_OPTIMIZATION */
>
> #ifdef CONFIG_TRACE_IRQFLAGS
> DEFINE_PER_CPU(int, hardirqs_enabled);
> @@ -532,6 +536,17 @@ static inline bool lockdep_softirq_start(void) { return false; }
> static inline void lockdep_softirq_end(bool in_hardirq) { }
> #endif
>
> +static __u32 softirq_deferred_for_rt(__u32 *pending)
> +{
> + __u32 deferred = 0;
> +
> + if (cpupri_check_rt()) {
> + deferred = *pending & LONG_SOFTIRQ_MASK;
> + *pending &= ~LONG_SOFTIRQ_MASK;
> + }
> + return deferred;
> +}
> +
> asmlinkage __visible void __softirq_entry __do_softirq(void)
> {
> unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> @@ -539,6 +554,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> int max_restart = MAX_SOFTIRQ_RESTART;
> struct softirq_action *h;
> bool in_hardirq;
> + __u32 deferred;
> __u32 pending;
> int softirq_bit;
>
> @@ -551,13 +567,15 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
>
> pending = local_softirq_pending();
>
> + deferred = softirq_deferred_for_rt(&pending);
> softirq_handle_begin();
> +
> in_hardirq = lockdep_softirq_start();
> account_softirq_enter(current);
>
> restart:
> /* Reset the pending bitmask before enabling irqs */
> - set_softirq_pending(0);
> + set_softirq_pending(deferred);
> __this_cpu_write(active_softirqs, pending);
>
> local_irq_enable();
> @@ -596,13 +614,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> local_irq_disable();
>
> pending = local_softirq_pending();
> + deferred = softirq_deferred_for_rt(&pending);
> +
> if (pending) {
> if (time_before(jiffies, end) && !need_resched() &&
> --max_restart)
> goto restart;
> + }
>
> + if (pending | deferred)
> wakeup_softirqd();
> - }
>
> account_softirq_exit(current);
> lockdep_softirq_end(in_hardirq);
> --
> 2.37.3.968.ga6b4b080e4-goog
>

2022-09-28 14:10:52

by David Laight

[permalink] [raw]

Subject: RE: [RFC][PATCH v3 0/3] Softirq -rt Optimizations

From: Qais Yousef
> Sent: 28 September 2022 14:01
>
> Hi John
>
> On 09/21/22 01:25, John Stultz wrote:
> > Hey all,
> >
> > This series is a set of patches that optimize scheduler decisions around
> > realtime tasks and softirqs. This series is a rebased and reworked set
> > of changes that have been shipping on Android devices for a number of
> > years, originally created to resolve audio glitches seen on devices
> > caused by softirqs for network or storage drivers.
> >
> > Long running softirqs cause issues because they aren’t currently taken
> > into account when a realtime task is woken up, but they will delay
> > realtime tasks from running if the realtime tasks are placed on a cpu
> > currently running a softirq.
>
> Thanks a lot for sending this series. I've raised this problem in various
> venues in the past, but it seems it is hard to do something better than what
> you propose here.
>
> Borrowing some behaviours from PREEMPT_RT (like threadedirqs) won't cut it
> outside PREEMPT_RT AFAIU.
>
> Peter did suggest an alternative at one point in the past to be more aggressive
> in limiting softirqs [1] but I never managed to find the time to verify it
> - especially its impact on network throughput as this seems to be the tricky
> trade-of (and tricky thing to verify for me at least). I'm not sure if BLOCK
> softirqs are as sensitive.

I've had issues with the opposite problem.
Long running RT tasks stopping the softint code running.

If an RT task is running, the softint will run in the context of the
RT task - so has priority over it.
If the RT task isn't running the softint stops the RT task being scheduled.
This is really just the same.

If the softint defers back to thread context it won't be scheduled
until any RT task finishes. This is the opposite priority.

IIRC there is another strange case where the RT thread has been woken
but isn't yet running - can't remember the exact details.

I can (mostly) handle the RT task being delayed (there are a lot of RT
threads sharing the work) but it is paramount that the ethernet receive
code actually runs - I can't afford to drop packets (they contain audio
the RT threads are processing).

In my case threaded NAPI (mostly) fixes it - provided the NAPI thread are RT.

David

>
> I think the proposed approach is not intrusive and offers a good balance that
> is well contained and easy to improve upon on the future. It's protected with
> a configuration option so users that don't want it can easily disable it.
>
> [1] https://gitlab.arm.com/linux-arm/linux-qy/-/commits/core/softirq/
>
>
> Thanks
>
> --
> Qais Yousef

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-09-28 16:17:16

by Qais Yousef

[permalink] [raw]

Subject: Re: [RFC][PATCH v3 0/3] Softirq -rt Optimizations

On 09/28/22 13:51, David Laight wrote:
> From: Qais Yousef
> > Sent: 28 September 2022 14:01
> >
> > Hi John
> >
> > On 09/21/22 01:25, John Stultz wrote:
> > > Hey all,
> > >
> > > This series is a set of patches that optimize scheduler decisions around
> > > realtime tasks and softirqs. This series is a rebased and reworked set
> > > of changes that have been shipping on Android devices for a number of
> > > years, originally created to resolve audio glitches seen on devices
> > > caused by softirqs for network or storage drivers.
> > >
> > > Long running softirqs cause issues because they aren’t currently taken
> > > into account when a realtime task is woken up, but they will delay
> > > realtime tasks from running if the realtime tasks are placed on a cpu
> > > currently running a softirq.
> >
> > Thanks a lot for sending this series. I've raised this problem in various
> > venues in the past, but it seems it is hard to do something better than what
> > you propose here.
> >
> > Borrowing some behaviours from PREEMPT_RT (like threadedirqs) won't cut it
> > outside PREEMPT_RT AFAIU.
> >
> > Peter did suggest an alternative at one point in the past to be more aggressive
> > in limiting softirqs [1] but I never managed to find the time to verify it
> > - especially its impact on network throughput as this seems to be the tricky
> > trade-of (and tricky thing to verify for me at least). I'm not sure if BLOCK
> > softirqs are as sensitive.
>
> I've had issues with the opposite problem.
> Long running RT tasks stopping the softint code running.
>
> If an RT task is running, the softint will run in the context of the
> RT task - so has priority over it.
> If the RT task isn't running the softint stops the RT task being scheduled.
> This is really just the same.
>
> If the softint defers back to thread context it won't be scheduled
> until any RT task finishes. This is the opposite priority.

If we can get a subset of threadedirqs (call it threadedsoftirqs) from
PREEMPT_RT where softirqs can be converted into RT kthreads, that'll alleviate
both sides of the problem IMO. But last I checked with Thomas this won't be
possible. But things might have changed since then..

>
> IIRC there is another strange case where the RT thread has been woken
> but isn't yet running - can't remember the exact details.
>
> I can (mostly) handle the RT task being delayed (there are a lot of RT
> threads sharing the work) but it is paramount that the ethernet receive
> code actually runs - I can't afford to drop packets (they contain audio
> the RT threads are processing).
>
> In my case threaded NAPI (mostly) fixes it - provided the NAPI thread are RT.

There's a netdev_budget and netdev_bugdet_usecs params in procfs that control
how long the NET_RX spends in the softirq. Maybe you need to tweak those too.
In your case, you probably want to increase the budget.

Note that in Android the BLOCK layer seems to cause similar problems which
don't have these NET facilities. So NET is only one side of the problem.

Thanks

--
Qais Yousef

2022-09-28 16:23:16

by David Laight

[permalink] [raw]

Subject: RE: [RFC][PATCH v3 0/3] Softirq -rt Optimizations

From: Qais Yousef
> Sent: 28 September 2022 16:56
>
> On 09/28/22 13:51, David Laight wrote:
> > From: Qais Yousef
> > > Sent: 28 September 2022 14:01
> > >
> > > Hi John
> > >
> > > On 09/21/22 01:25, John Stultz wrote:
> > > > Hey all,
> > > >
> > > > This series is a set of patches that optimize scheduler decisions around
> > > > realtime tasks and softirqs. This series is a rebased and reworked set
> > > > of changes that have been shipping on Android devices for a number of
> > > > years, originally created to resolve audio glitches seen on devices
> > > > caused by softirqs for network or storage drivers.
> > > >
> > > > Long running softirqs cause issues because they aren’t currently taken
> > > > into account when a realtime task is woken up, but they will delay
> > > > realtime tasks from running if the realtime tasks are placed on a cpu
> > > > currently running a softirq.
> > >
> > > Thanks a lot for sending this series. I've raised this problem in various
> > > venues in the past, but it seems it is hard to do something better than what
> > > you propose here.
> > >
> > > Borrowing some behaviours from PREEMPT_RT (like threadedirqs) won't cut it
> > > outside PREEMPT_RT AFAIU.
> > >
> > > Peter did suggest an alternative at one point in the past to be more aggressive
> > > in limiting softirqs [1] but I never managed to find the time to verify it
> > > - especially its impact on network throughput as this seems to be the tricky
> > > trade-of (and tricky thing to verify for me at least). I'm not sure if BLOCK
> > > softirqs are as sensitive.
> >
> > I've had issues with the opposite problem.
> > Long running RT tasks stopping the softint code running.
> >
> > If an RT task is running, the softint will run in the context of the
> > RT task - so has priority over it.
> > If the RT task isn't running the softint stops the RT task being scheduled.
> > This is really just the same.
> >
> > If the softint defers back to thread context it won't be scheduled
> > until any RT task finishes. This is the opposite priority.
>
> If we can get a subset of threadedirqs (call it threadedsoftirqs) from
> PREEMPT_RT where softirqs can be converted into RT kthreads, that'll alleviate
> both sides of the problem IMO. But last I checked with Thomas this won't be
> possible. But things might have changed since then..

Part of the problem is that can significantly increase latency.
Some softirq calls will be latency sensitive.

> > IIRC there is another strange case where the RT thread has been woken
> > but isn't yet running - can't remember the exact details.
> >
> > I can (mostly) handle the RT task being delayed (there are a lot of RT
> > threads sharing the work) but it is paramount that the ethernet receive
> > code actually runs - I can't afford to drop packets (they contain audio
> > the RT threads are processing).
> >
> > In my case threaded NAPI (mostly) fixes it - provided the NAPI thread are RT.
>
> There's a netdev_budget and netdev_bugdet_usecs params in procfs that control
> how long the NET_RX spends in the softirq. Maybe you need to tweak those too.
> In your case, you probably want to increase the budget.

Maybe, but the problem is that the softint code is far too willing
to drop to kthread context.
Eric made a change to reduce that (to avoid losing ethernet packets)
but the original test got added back - there are now two tests, but
the original one dominates. Eric's bug fix got reverted (with extra
tests that make the code slower).

I did test with that changed, but still got some lost packets.
Trying to receive 500000 UDP packets/sec is quite hard!
They are also split across 10k unconnected sockets.

> Note that in Android the BLOCK layer seems to cause similar problems which
> don't have these NET facilities. So NET is only one side of the problem.

Isn't the block layer softints stopping other code?
I'd really got the other problem.
Although I do have a 10ms timer wakeup that really needs not to be delayed.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-09-29 11:47:11

by Qais Yousef

[permalink] [raw]

Subject: Re: [RFC][PATCH v3 0/3] Softirq -rt Optimizations

On 09/28/22 16:19, David Laight wrote:
> From: Qais Yousef
> > Sent: 28 September 2022 16:56
> >
> > On 09/28/22 13:51, David Laight wrote:
> > > From: Qais Yousef
> > > > Sent: 28 September 2022 14:01
> > > >
> > > > Hi John
> > > >
> > > > On 09/21/22 01:25, John Stultz wrote:
> > > > > Hey all,
> > > > >
> > > > > This series is a set of patches that optimize scheduler decisions around
> > > > > realtime tasks and softirqs. This series is a rebased and reworked set
> > > > > of changes that have been shipping on Android devices for a number of
> > > > > years, originally created to resolve audio glitches seen on devices
> > > > > caused by softirqs for network or storage drivers.
> > > > >
> > > > > Long running softirqs cause issues because they aren’t currently taken
> > > > > into account when a realtime task is woken up, but they will delay
> > > > > realtime tasks from running if the realtime tasks are placed on a cpu
> > > > > currently running a softirq.
> > > >
> > > > Thanks a lot for sending this series. I've raised this problem in various
> > > > venues in the past, but it seems it is hard to do something better than what
> > > > you propose here.
> > > >
> > > > Borrowing some behaviours from PREEMPT_RT (like threadedirqs) won't cut it
> > > > outside PREEMPT_RT AFAIU.
> > > >
> > > > Peter did suggest an alternative at one point in the past to be more aggressive
> > > > in limiting softirqs [1] but I never managed to find the time to verify it
> > > > - especially its impact on network throughput as this seems to be the tricky
> > > > trade-of (and tricky thing to verify for me at least). I'm not sure if BLOCK
> > > > softirqs are as sensitive.
> > >
> > > I've had issues with the opposite problem.
> > > Long running RT tasks stopping the softint code running.
> > >
> > > If an RT task is running, the softint will run in the context of the
> > > RT task - so has priority over it.
> > > If the RT task isn't running the softint stops the RT task being scheduled.
> > > This is really just the same.
> > >
> > > If the softint defers back to thread context it won't be scheduled
> > > until any RT task finishes. This is the opposite priority.
> >
> > If we can get a subset of threadedirqs (call it threadedsoftirqs) from
> > PREEMPT_RT where softirqs can be converted into RT kthreads, that'll alleviate
> > both sides of the problem IMO. But last I checked with Thomas this won't be
> > possible. But things might have changed since then..
>
> Part of the problem is that can significantly increase latency.
> Some softirq calls will be latency sensitive.

Probably part of the problem why it can't be made available outside PREEMPT_RT
:)

>
> > > IIRC there is another strange case where the RT thread has been woken
> > > but isn't yet running - can't remember the exact details.
> > >
> > > I can (mostly) handle the RT task being delayed (there are a lot of RT
> > > threads sharing the work) but it is paramount that the ethernet receive
> > > code actually runs - I can't afford to drop packets (they contain audio
> > > the RT threads are processing).
> > >
> > > In my case threaded NAPI (mostly) fixes it - provided the NAPI thread are RT.
> >
> > There's a netdev_budget and netdev_bugdet_usecs params in procfs that control
> > how long the NET_RX spends in the softirq. Maybe you need to tweak those too.
> > In your case, you probably want to increase the budget.
>
> Maybe, but the problem is that the softint code is far too willing
> to drop to kthread context.
> Eric made a change to reduce that (to avoid losing ethernet packets)
> but the original test got added back - there are now two tests, but
> the original one dominates. Eric's bug fix got reverted (with extra
> tests that make the code slower).

Would be good to know what fix you're referring to.

> I did test with that changed, but still got some lost packets.
> Trying to receive 500000 UDP packets/sec is quite hard!
> They are also split across 10k unconnected sockets.

There's a hardcoded value in kernel/softirq.c::MAX_SOFTIRQ_TIME which is set to
2ms.

It might be worth bringing your problem with the networking community. I don't
think your use case is unique - but they'd know better and what needs to be
done to achieve it.

Note there's a physical upper limit that will be dictated by the hardware;
whether it's the number of cores, max frequencies, memory (speed and size) etc.

I'm assuming this is not a problem, but worth to highlight.

> > Note that in Android the BLOCK layer seems to cause similar problems which
> > don't have these NET facilities. So NET is only one side of the problem.
>
> Isn't the block layer softints stopping other code?
> I'd really got the other problem.
> Although I do have a 10ms timer wakeup that really needs not to be delayed.

I was just trying to highlight that this series is concerned with more than
just networking.

I thought you had concerns about this series, but it seems you're trying to
highlight another type of relevant problem.

Cheers

--
Qais Yousef

2022-10-03 17:26:09

by John Stultz

[permalink] [raw]

Subject: Re: [RFC PATCH v3 2/3] sched: Avoid placing RT threads on cores handling long softirqs

On Wed, Sep 28, 2022 at 5:55 AM Qais Yousef <[email protected]> wrote:
> On 09/21/22 01:25, John Stultz wrote:
> > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> > index a749a8663841..1d126b8495bc 100644
> > --- a/include/linux/interrupt.h
> > +++ b/include/linux/interrupt.h
> > @@ -582,6 +582,12 @@ enum
> > * _ IRQ_POLL: irq_poll_cpu_dead() migrates the queue
> > */
> > #define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(RCU_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ))
> > +/* Softirq's where the handling might be long: */
> > +#define LONG_SOFTIRQ_MASK ((1 << NET_TX_SOFTIRQ) | \
> > + (1 << NET_RX_SOFTIRQ) | \
> > + (1 << BLOCK_SOFTIRQ) | \
> > + (1 << IRQ_POLL_SOFTIRQ) | \
> > + (1 << TASKLET_SOFTIRQ))
>
> I'm not sure about the TASKLET. I can understand NET and BLOCK require high
> throughput, hence could end up in softirq for a long time. But TASKLET seems
> allowing sloppiness. I don't feel strongly about it, but worth debating if we
> really need to include it.

That's fair. Digging through the patch history in the Android trees,
the first pass was for all softirqs but then restricted to remove
known short-running ones.
From the bug history and what I can directly reproduce, the net and
block softirqs have definitely caused trouble, but I don't see a
specific example from TASKLET, so I'm ok dropping that for now, and
should we get specific evidence we can argue for it in a future patch.

So I'll drop TASKLET from the list here. Thanks for the suggestion!

> > @@ -1284,6 +1284,16 @@ config SCHED_AUTOGROUP
> > desktop applications. Task group autogeneration is currently based
> > upon task session.
> >
> > +config RT_SOFTIRQ_OPTIMIZATION
> > + bool "Improve RT scheduling during long softirq execution"
> > + depends on SMP
>
> Not sure if we need to depend on !PREEMPT_RT. This optimization might not be
> desired for those systems. I'll defer to PREEMPT_RT folks to decide on that.

Probably a safer default to turn it off w/ PREEMPT_RT. Thanks for the
suggestion!

> > +#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
> > +/*
> > + * Return whether the task on the given cpu is currently non-preemptible
> > + * while handling a potentially long softirq, or if the task is likely
> > + * to block preemptions soon because it is a ksoftirq thread that is
> > + * handling slow softirq.
> > + */
> > +static bool task_may_preempt(struct task_struct *task, int cpu)
> > +{
> > + u32 softirqs = per_cpu(active_softirqs, cpu) |
> > + __cpu_softirq_pending(cpu);
> > + struct task_struct *cpu_ksoftirqd = per_cpu(ksoftirqd, cpu);
> > + struct task_struct *curr;
> > + struct rq *rq = cpu_rq(cpu);
> > + int ret;
> > +
> > + rcu_read_lock();
> > + curr = READ_ONCE(rq->curr); /* unlocked access */
> > + ret = !((softirqs & LONG_SOFTIRQ_MASK) &&
> > + (curr == cpu_ksoftirqd ||
> > + preempt_count() & SOFTIRQ_MASK));
> > + rcu_read_unlock();
> > + return ret;
> > +}
> > +#else
> > +static bool task_may_preempt(struct task_struct *task, int cpu)
> > +{
> > + return true;
> > +}
> > +#endif /* CONFIG_RT_SOFTIRQ_OPTIMIZATION */
> > +
> > +static bool rt_task_fits_capacity_and_may_preempt(struct task_struct *p, int cpu)
> > +{
> > + return task_may_preempt(p, cpu) && rt_task_fits_capacity(p, cpu);
> > +}
>
> Maybe better to rename to rt_task_fits_cpu() and make it generic?
>
> See below for more on that.
>
> > @@ -1641,9 +1683,10 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> > * requirement of the task - which is only important on heterogeneous
> > * systems like big.LITTLE.
> > */
> > - test = curr &&
> > - unlikely(rt_task(curr)) &&
> > - (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> > + may_not_preempt = !task_may_preempt(curr, cpu);
> > + test = (curr && (may_not_preempt ||
> > + (unlikely(rt_task(curr)) &&
> > + (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio))));
>
> I think this is unnecesary if you create new rt_task_fits_cpu() and ...
>
> >
> > if (test || !rt_task_fits_capacity(p, cpu)) {
>
> ... replace the call to rt_task_fits_capacity() with the new
> rt_task_fits_cpu()?

But is that really the same logic? Above we're doing:
if ((!task_may_preempt(curr, cpu)|| <other stuff >) ||
!rt_task_fits_capacity(p, cpu))

And you're suggestion switching it to
if (<other stuff> || !rt_task_fits_cpu(p, cpu))
which would expand to:
if( <other stuff > || !(task_may_preempt(p, cpu) &&
rt_task_fits_capacity(p, cpu)))

I worry we would be skipping the part where we compare against curr.

> > int target = find_lowest_rq(p);
> > @@ -1656,11 +1699,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> > goto out_unlock;
> >
> > /*
> > - * Don't bother moving it if the destination CPU is
> > + * If cpu is non-preemptible, prefer remote cpu
> > + * even if it's running a higher-prio task.
> > + * Otherwise: Don't bother moving it if the destination CPU is
> > * not running a lower priority task.
> > */
> > if (target != -1 &&
> > - p->prio < cpu_rq(target)->rt.highest_prio.curr)
> > + (may_not_preempt ||
> > + p->prio < cpu_rq(target)->rt.highest_prio.curr))
> > cpu = target;
>
> I'm not sure this makes sense. You assume a higher priority task will cause
> less delay than softirqs. Which I think is an optimistic assumption?
>
> I think we should just mimic the same fallback behavior when we fail to find
> a CPU that fits the capacity requirement. Keeps things more consistent IMO.

This sounds reasonable. I do fret that long-running rt tasks are less
common then the long running softirqs, so this may have an impact to
the effectiveness of the patch, but I also suspect it's even more rare
to have all the other cpus busy with rt tasks, so its probably very
unlikely.

> > }
> >
> > @@ -1901,11 +1947,11 @@ static int find_lowest_rq(struct task_struct *task)
> >
> > ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
> > task, lowest_mask,
> > - rt_task_fits_capacity);
> > + rt_task_fits_capacity_and_may_preempt);
> > } else {
> >
> > - ret = cpupri_find(&task_rq(task)->rd->cpupri,
> > - task, lowest_mask);
> > + ret = cpupri_find_fitness(&task_rq(task)->rd->cpupri,
> > + task, lowest_mask, task_may_preempt);
>
> I think we can simplify the code here to call cpupri_find_fitness()
> unconditionally passing the new rt_task_fits_cpu(). rt_task_fits_capacity()
> will always return true if the static key is disabled or uclamp is not
> configured. So rt_task_fits_cpu() will effectively depend on/boil down to
> task_may_preempt() in these cases.
>
> Note that I had this called unconditionally in the past, but Steve suggested
> doing it this way, IIRC, to avoid the cost of calling the fitness function when
> we don't need to. I'm not sure it matters to be honest, but if you follow my
> suggestion you might be asked to avoid the costs for the users who don't care
> about the fitness criteria.

I'll take another pass at it and see what I can do.

Thanks so much for the feedback!
-john

2022-10-03 18:20:54

by John Stultz

[permalink] [raw]

Subject: Re: [RFC PATCH v3 2/3] sched: Avoid placing RT threads on cores handling long softirqs

On Mon, Oct 3, 2022 at 9:55 AM John Stultz <[email protected]> wrote:
> On Wed, Sep 28, 2022 at 5:55 AM Qais Yousef <[email protected]> wrote:
> > On 09/21/22 01:25, John Stultz wrote:
> > > @@ -1641,9 +1683,10 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> > > * requirement of the task - which is only important on heterogeneous
> > > * systems like big.LITTLE.
> > > */
> > > - test = curr &&
> > > - unlikely(rt_task(curr)) &&
> > > - (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio);
> > > + may_not_preempt = !task_may_preempt(curr, cpu);
> > > + test = (curr && (may_not_preempt ||
> > > + (unlikely(rt_task(curr)) &&
> > > + (curr->nr_cpus_allowed < 2 || curr->prio <= p->prio))));
> >
> > I think this is unnecesary if you create new rt_task_fits_cpu() and ...
> >
> > >
> > > if (test || !rt_task_fits_capacity(p, cpu)) {
> >
> > ... replace the call to rt_task_fits_capacity() with the new
> > rt_task_fits_cpu()?
>
>
> But is that really the same logic? Above we're doing:
> if ((!task_may_preempt(curr, cpu)|| <other stuff >) ||
> !rt_task_fits_capacity(p, cpu))
>
> And you're suggestion switching it to
> if (<other stuff> || !rt_task_fits_cpu(p, cpu))
> which would expand to:
> if( <other stuff > || !(task_may_preempt(p, cpu) &&
> rt_task_fits_capacity(p, cpu)))
>
> I worry we would be skipping the part where we compare against curr.

Ignore this bit, I've not finished my coffee.

I was mixing up an earlier version of the patch where the task passed
in to task_may_preempt() was compared with the ksoftirqd (which didn't
seem right), and I've since switched it to comparing curr on the cpu
with the ksoftirqd, making the task passed in unused.

I'm reworking this to be less confusing (renaming this to
cpu_busy_with_softirqs()), and will try to take your larger suggestion
here.

thanks
-john

2022-10-03 18:57:17

by John Stultz

[permalink] [raw]

Subject: Re: [RFC PATCH v3 3/3] softirq: defer softirq processing to ksoftirqd if CPU is busy with RT

On Wed, Sep 28, 2022 at 5:56 AM Qais Yousef <[email protected]> wrote:
>
> On 09/21/22 01:25, John Stultz wrote:
> > From: Pavankumar Kondeti <[email protected]>
> >
> > Defer the softirq processing to ksoftirqd if a RT task is
> > running or queued on the current CPU. This complements the RT
> > task placement algorithm which tries to find a CPU that is not
> > currently busy with softirqs.
> >
> > Currently NET_TX, NET_RX, BLOCK and TASKLET softirqs are only
>
> Should we mention IRQ_POLL?

Ah, yes. Thank you for pointing that out.

> I think TASKLET is debatable as I mentioned in my other email.

Yeah, I've dropped it for now.

> > +#ifdef CONFIG_RT_SOFTIRQ_OPTIMIZATION
> > +/*
> > + * cpupri_check_rt - check if CPU has a RT task
> > + * should be called from rcu-sched read section.
> > + */
> > +bool cpupri_check_rt(void)
> > +{
> > + int cpu = raw_smp_processor_id();
> > +
> > + return cpu_rq(cpu)->rd->cpupri.cpu_to_pri[cpu] > CPUPRI_NORMAL;
> > +}
>
> Priorities always mess up with my brain! I always forget which direction to
> look at :D

Yeah, cpu_pri logic in particular (as it also depends on which version
you're looking at - the original version of this patch against an
older kernel had an off by one error that took awhile to find).

> Hmm I was wondering why not do rt_task(current), but if the task is not running
> (which can only indicate there's a DL or a stopper task preempting it), that
> won't work. But I think your code has a similar problem; you'll return true
> even if there's only a DL task running since we set the priority to
> CPUPRI_HIGHER which will cause your condition to return true.
>
> This makes me think if we should enable this optimization for DL tasks too.
> Hmm...
>
> That said, is there a reason why we can't remove this function and just call
> rt_task(current) directly in softirq_deferred_for_rt()?
>

I had thought similarly, but had hesitated to switch in case there was
some subtlety I wasn't seeing.
But I think you've persuaded me to simplify this.

Thanks again for the feedback and suggestions!
-john

2022-10-04 09:57:35

by David Laight

[permalink] [raw]

Subject: RE: [RFC PATCH v3 2/3] sched: Avoid placing RT threads on cores handling long softirqs

From: John Stultz
> Sent: 03 October 2022 17:55
>
> On Wed, Sep 28, 2022 at 5:55 AM Qais Yousef <[email protected]> wrote:
> > On 09/21/22 01:25, John Stultz wrote:
> > > diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
> > > index a749a8663841..1d126b8495bc 100644
> > > --- a/include/linux/interrupt.h
> > > +++ b/include/linux/interrupt.h
> > > @@ -582,6 +582,12 @@ enum
> > > * _ IRQ_POLL: irq_poll_cpu_dead() migrates the queue
> > > */
> > > #define SOFTIRQ_HOTPLUG_SAFE_MASK (BIT(RCU_SOFTIRQ) | BIT(IRQ_POLL_SOFTIRQ))
> > > +/* Softirq's where the handling might be long: */
> > > +#define LONG_SOFTIRQ_MASK ((1 << NET_TX_SOFTIRQ) | \
> > > + (1 << NET_RX_SOFTIRQ) | \
> > > + (1 << BLOCK_SOFTIRQ) | \
> > > + (1 << IRQ_POLL_SOFTIRQ) | \
> > > + (1 << TASKLET_SOFTIRQ))
> >
> > I'm not sure about the TASKLET. I can understand NET and BLOCK require high
> > throughput, hence could end up in softirq for a long time. But TASKLET seems
> > allowing sloppiness. I don't feel strongly about it, but worth debating if we
> > really need to include it.
>
> That's fair. Digging through the patch history in the Android trees,
> the first pass was for all softirqs but then restricted to remove
> known short-running ones.
> From the bug history and what I can directly reproduce, the net and
> block softirqs have definitely caused trouble, but I don't see a
> specific example from TASKLET, so I'm ok dropping that for now, and
> should we get specific evidence we can argue for it in a future patch.
>
> So I'll drop TASKLET from the list here. Thanks for the suggestion!

I've also seen the code that finally frees memory freed under rcu
take a long time.
That was a workload sending a lot of UDP/RTP from a raw socket using
IP_HDRINC - each send allocated a structure (fib?) that was freed from
the rcu (timer?) softint callback.

But, actually, one of the biggest causes of RT wakeup latency
was a normal thread looping without a cond_resched() call.
In my case some graphics driver doing page flushes of the
display memory.

...
> > > int target = find_lowest_rq(p);
> > > @@ -1656,11 +1699,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> > > goto out_unlock;
> > >
> > > /*
> > > - * Don't bother moving it if the destination CPU is
> > > + * If cpu is non-preemptible, prefer remote cpu
> > > + * even if it's running a higher-prio task.
> > > + * Otherwise: Don't bother moving it if the destination CPU is
> > > * not running a lower priority task.
> > > */
> > > if (target != -1 &&
> > > - p->prio < cpu_rq(target)->rt.highest_prio.curr)
> > > + (may_not_preempt ||
> > > + p->prio < cpu_rq(target)->rt.highest_prio.curr))
> > > cpu = target;
> >
> > I'm not sure this makes sense. You assume a higher priority task will cause
> > less delay than softirqs. Which I think is an optimistic assumption?
> >
> > I think we should just mimic the same fallback behavior when we fail to find
> > a CPU that fits the capacity requirement. Keeps things more consistent IMO.
>
> This sounds reasonable. I do fret that long-running rt tasks are less
> common then the long running softirqs, so this may have an impact to
> the effectiveness of the patch, but I also suspect it's even more rare
> to have all the other cpus busy with rt tasks, so its probably very
> unlikely.

I've a workload that is very much like that :-)
The same RTP audio program.
Running 35 RT threads (on 40 cpu) that all might run for
9ms in every 10ms.
The other 5 cpu might also be running RT threads since I
have to use threaded NAPI and make the NAPI threads RT
in order to avoid dropping packets.
Most of the wakeups can just wait for the previous cpu
to become available, only the sleep on a high-res timer
would benefit from changing the cpu.

The real scheduling problem wasn't actually wakeups at all.
The problem is the softint code running while the RT thread
held a cv - which stopped all the other threads in their
tracks.
I had to replace all the 'hot' cv with atomic operations.

The only 'wakeup' problem I had was that cv_broadcast() woke
each RT task in turn - so if one waited for the softint code
to finish then so would all the rest.
(Fixed by using a separate cv for each thread.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-10-05 15:35:47

by Qais Yousef

[permalink] [raw]

Subject: Re: [RFC PATCH v3 2/3] sched: Avoid placing RT threads on cores handling long softirqs

On 10/03/22 09:55, John Stultz wrote:

[...]

> > > int target = find_lowest_rq(p);
> > > @@ -1656,11 +1699,14 @@ select_task_rq_rt(struct task_struct *p, int cpu, int flags)
> > > goto out_unlock;
> > >
> > > /*
> > > - * Don't bother moving it if the destination CPU is
> > > + * If cpu is non-preemptible, prefer remote cpu
> > > + * even if it's running a higher-prio task.
> > > + * Otherwise: Don't bother moving it if the destination CPU is
> > > * not running a lower priority task.
> > > */
> > > if (target != -1 &&
> > > - p->prio < cpu_rq(target)->rt.highest_prio.curr)
> > > + (may_not_preempt ||
> > > + p->prio < cpu_rq(target)->rt.highest_prio.curr))
> > > cpu = target;
> >
> > I'm not sure this makes sense. You assume a higher priority task will cause
> > less delay than softirqs. Which I think is an optimistic assumption?
> >
> > I think we should just mimic the same fallback behavior when we fail to find
> > a CPU that fits the capacity requirement. Keeps things more consistent IMO.
>
> This sounds reasonable. I do fret that long-running rt tasks are less
> common then the long running softirqs, so this may have an impact to
> the effectiveness of the patch, but I also suspect it's even more rare
> to have all the other cpus busy with rt tasks, so its probably very
> unlikely.

Yes. I think it is a hard problem to hit as all these other RT tasks must be
higher priority. So if this ever happens, then one should question if the
priority is set correctly for the audio threads first. Or why there are so many
higher priority tasks running for so long.

Thanks!

--
Qais Yousef

2022-10-05 16:18:38

by Qais Yousef

[permalink] [raw]

Subject: Re: [RFC PATCH v3 2/3] sched: Avoid placing RT threads on cores handling long softirqs

On 10/04/22 09:50, David Laight wrote:

[...]

> > That's fair. Digging through the patch history in the Android trees,
> > the first pass was for all softirqs but then restricted to remove
> > known short-running ones.
> > From the bug history and what I can directly reproduce, the net and
> > block softirqs have definitely caused trouble, but I don't see a
> > specific example from TASKLET, so I'm ok dropping that for now, and
> > should we get specific evidence we can argue for it in a future patch.
> >
> > So I'll drop TASKLET from the list here. Thanks for the suggestion!
>
> I've also seen the code that finally frees memory freed under rcu
> take a long time.
> That was a workload sending a lot of UDP/RTP from a raw socket using
> IP_HDRINC - each send allocated a structure (fib?) that was freed from
> the rcu (timer?) softint callback.

I'm assuming this is a network driver that using RCU callback to free some
memory?

>
> But, actually, one of the biggest causes of RT wakeup latency
> was a normal thread looping without a cond_resched() call.
> In my case some graphics driver doing page flushes of the
> display memory.

Were these drivers that cause these problem in-tree or out-of-tree?

Thanks

--
Qais Yousef