2018-01-19 15:47:01

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 0/4] softirq: Per vector threading v3

As per Linus suggestion, this take doesn't limit the number of occurences
per jiffy anymore but instead defers a vector to workqueues as soon as
it gets re-enqueued on IRQ tail.

No tunable here, so testing should be easier.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
softirq/thread-v3

HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4

Thanks,
Frederic
---

Frederic Weisbecker (4):
softirq: Limit vector to a single iteration on IRQ tail
softirq: Per vector deferment to workqueue
softirq: Defer to workqueue when rescheduling is needed
softirq: Replace ksoftirqd with workqueues entirely


Documentation/RCU/stallwarn.txt | 4 +-
include/linux/interrupt.h | 7 +-
kernel/sched/cputime.c | 12 +--
kernel/sched/sched.h | 4 +-
kernel/softirq.c | 223 +++++++++++++++++++++++-----------------
net/ipv4/tcp_output.c | 5 +-
6 files changed, 144 insertions(+), 111 deletions(-)


2018-01-19 15:47:30

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

Some softirq vectors can be more CPU hungry than others. Especially
networking may sometimes deal with packet storm and need more CPU than
IRQ tail can offer without inducing scheduler latencies. In this case
the current code defers to ksoftirqd that behaves nicer. Now this nice
behaviour can be bad for other IRQ vectors that usually need quick
processing.

To solve this we only defer to threading the vectors that got
re-enqueued on IRQ tail processing and leave the others inline on IRQ
tail service. This is achieved using workqueues with
per-CPU/per-vector worklets.

Note ksoftirqd is not yet removed as it is still needed for threaded IRQs
mode.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: David Miller <[email protected]>
Cc: Hannes Frederic Sowa <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Levin Alexander <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Radu Rendec <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
---
include/linux/interrupt.h | 2 +
kernel/sched/cputime.c | 5 +-
kernel/softirq.c | 120 ++++++++++++++++++++++++++++++++++++++++++----
net/ipv4/tcp_output.c | 3 +-
4 files changed, 119 insertions(+), 11 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 69c2382..92d044d 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -514,6 +514,8 @@ static inline struct task_struct *this_cpu_ksoftirqd(void)
return this_cpu_read(ksoftirqd);
}

+extern int softirq_serving_workqueue(void);
+
/* Tasklets --- multithreaded analogue of BHs.

Main feature differing them of generic softirqs: tasklet
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index bac6ac9..30f70e5 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -71,7 +71,8 @@ void irqtime_account_irq(struct task_struct *curr)
*/
if (hardirq_count())
irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
- else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
+ else if (in_serving_softirq() && curr != this_cpu_ksoftirqd() &&
+ !softirq_serving_workqueue())
irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
}
EXPORT_SYMBOL_GPL(irqtime_account_irq);
@@ -375,7 +376,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,

cputime -= other;

- if (this_cpu_ksoftirqd() == p) {
+ if (this_cpu_ksoftirqd() == p || softirq_serving_workqueue()) {
/*
* ksoftirqd time do not get accounted in cpu_softirq_time.
* So, we have to handle it separately here.
diff --git a/kernel/softirq.c b/kernel/softirq.c
index c8c6841..becb1d9 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -62,6 +62,19 @@ const char * const softirq_to_name[NR_SOFTIRQS] = {
"TASKLET", "SCHED", "HRTIMER", "RCU"
};

+struct vector {
+ int nr;
+ struct work_struct work;
+};
+
+struct softirq {
+ unsigned int pending_work_mask;
+ int work_running;
+ struct vector vector[NR_SOFTIRQS];
+};
+
+static DEFINE_PER_CPU(struct softirq, softirq_cpu);
+
/*
* we cannot loop indefinitely here to avoid userspace starvation,
* but we also don't want to introduce a worst case 1/HZ latency
@@ -223,8 +236,77 @@ static inline bool lockdep_softirq_start(void) { return false; }
static inline void lockdep_softirq_end(bool in_hardirq) { }
#endif

+int softirq_serving_workqueue(void)
+{
+ return __this_cpu_read(softirq_cpu.work_running);
+}
+
+static void vector_work_func(struct work_struct *work)
+{
+ struct vector *vector = container_of(work, struct vector, work);
+ struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
+ int vec_nr = vector->nr;
+ int vec_bit = BIT(vec_nr);
+ u32 pending;
+
+ local_irq_disable();
+ pending = local_softirq_pending();
+ account_irq_enter_time(current);
+ __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
+ lockdep_softirq_enter();
+ set_softirq_pending(pending & ~vec_bit);
+ local_irq_enable();
+
+ if (pending & vec_bit) {
+ struct softirq_action *sa = &softirq_vec[vec_nr];
+
+ kstat_incr_softirqs_this_cpu(vec_nr);
+ softirq->work_running = 1;
+ trace_softirq_entry(vec_nr);
+ sa->action(sa);
+ trace_softirq_exit(vec_nr);
+ softirq->work_running = 0;
+ }
+
+ local_irq_disable();
+
+ pending = local_softirq_pending();
+ if (pending & vec_bit)
+ schedule_work_on(smp_processor_id(), &vector->work);
+ else
+ softirq->pending_work_mask &= ~vec_bit;
+
+ lockdep_softirq_exit();
+ account_irq_exit_time(current);
+ __local_bh_enable(SOFTIRQ_OFFSET);
+ local_irq_enable();
+}
+
+static void do_softirq_workqueue(u32 pending)
+{
+ struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
+ struct softirq_action *h = softirq_vec;
+ int softirq_bit;
+
+ pending &= ~softirq->pending_work_mask;
+
+ while ((softirq_bit = ffs(pending))) {
+ struct vector *vector;
+ unsigned int vec_nr;
+
+ h += softirq_bit - 1;
+ vec_nr = h - softirq_vec;
+ softirq->pending_work_mask |= BIT(vec_nr);
+ vector = &softirq->vector[vec_nr];
+ schedule_work_on(smp_processor_id(), &vector->work);
+ h++;
+ pending >>= softirq_bit;
+ }
+}
+
asmlinkage __visible void __softirq_entry __do_softirq(void)
{
+ struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
unsigned long old_flags = current->flags;
struct softirq_action *h;
bool in_hardirq;
@@ -238,15 +320,18 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
*/
current->flags &= ~PF_MEMALLOC;

- pending = local_softirq_pending();
+ /* Ignore vectors pending on workqueues, they have been punished */
+ pending = local_softirq_pending() & ~softirq->pending_work_mask;
account_irq_enter_time(current);

__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
in_hardirq = lockdep_softirq_start();
-
restart:
- /* Reset the pending bitmask before enabling irqs */
- set_softirq_pending(0);
+ /*
+ * Reset the pending bitmask before enabling irqs but keep
+ * those pending on workqueues so they get properly handled there.
+ */
+ set_softirq_pending(softirq->pending_work_mask);

local_irq_enable();

@@ -282,12 +367,18 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
rcu_bh_qs();
local_irq_disable();

- pending = local_softirq_pending();
+ pending = local_softirq_pending() & ~softirq->pending_work_mask;
if (pending) {
- if (pending & executed || need_resched())
+ if (need_resched()) {
wakeup_softirqd();
- else
- goto restart;
+ } else {
+ /* Vectors that got re-enqueued are threaded */
+ if (executed & pending)
+ do_softirq_workqueue(executed & pending);
+ pending &= ~executed;
+ if (pending)
+ goto restart;
+ }
}

lockdep_softirq_end(in_hardirq);
@@ -624,10 +715,23 @@ void __init softirq_init(void)
int cpu;

for_each_possible_cpu(cpu) {
+ struct softirq *softirq;
+ int i;
+
per_cpu(tasklet_vec, cpu).tail =
&per_cpu(tasklet_vec, cpu).head;
per_cpu(tasklet_hi_vec, cpu).tail =
&per_cpu(tasklet_hi_vec, cpu).head;
+
+ softirq = &per_cpu(softirq_cpu, cpu);
+
+ for (i = 0; i < NR_SOFTIRQS; i++) {
+ struct vector *vector;
+
+ vector = &softirq->vector[i];
+ vector->nr = i;
+ INIT_WORK(&vector->work, vector_work_func);
+ }
}

open_softirq(TASKLET_SOFTIRQ, tasklet_action);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a4d214c..b4e4160 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -919,7 +919,8 @@ void tcp_wfree(struct sk_buff *skb)
* - chance for incoming ACK (processed by another cpu maybe)
* to migrate this flow (skb->ooo_okay will be eventually set)
*/
- if (refcount_read(&sk->sk_wmem_alloc) >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
+ if (refcount_read(&sk->sk_wmem_alloc) >= SKB_TRUESIZE(1) &&
+ (this_cpu_ksoftirqd() == current || softirq_serving_workqueue()))
goto out;

for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) {
--
2.7.4


2018-01-19 15:47:48

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail

Softirqs are currently allowed to execute as much pending work as needed
as long as it doesn't overrun 10 iterations or 2 ms. The rest of the
pending work is deferred to ksoftirqd beyond that limit.

In fact 2 ms is a very high threshold that shouldn't be reachable after
only 10 iterations. And just counting iterations doesn't help diagnose
which vector forces softirqs to be deferred to a thread at the expense
of other vectors that could need quick service. Instead of blindly
kicking ksoftirqd when the softirq load becomes too high, we rather
want to offload the culprit vector only and let the other vectors to
be handled on IRQ tail as usual.

To achieve this, we rather keep track of which vectors have run on
past iterations and kick them off to threading if they get re-enqueued.
We assume that a vector only needs one iteration on normal loads while
re-enqueuing is a matter of stress, so this should be a simple yet
reliable way to detect which vector needs to be queued off in a task.

For now, vectors that get re-enqueued trigger ksoftirqd but they are
going to be handled by per-vector workqueues on subsequent patches.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: David Miller <[email protected]>
Cc: Hannes Frederic Sowa <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Levin Alexander <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Radu Rendec <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
---
kernel/softirq.c | 30 +++++++-----------------------
1 file changed, 7 insertions(+), 23 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 2f5e87f..c8c6841 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -190,22 +190,6 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
}
EXPORT_SYMBOL(__local_bh_enable_ip);

-/*
- * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
- * but break the loop if need_resched() is set or after 2 ms.
- * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
- * certain cases, such as stop_machine(), jiffies may cease to
- * increment and so we need the MAX_SOFTIRQ_RESTART limit as
- * well to make sure we eventually return from this method.
- *
- * These limits have been established via experimentation.
- * The two things to balance is latency against fairness -
- * we want to handle softirqs as soon as possible, but they
- * should not be able to lock up the box.
- */
-#define MAX_SOFTIRQ_TIME msecs_to_jiffies(2)
-#define MAX_SOFTIRQ_RESTART 10
-
#ifdef CONFIG_TRACE_IRQFLAGS
/*
* When we run softirqs from irq_exit() and thus on the hardirq stack we need
@@ -241,12 +225,10 @@ static inline void lockdep_softirq_end(bool in_hardirq) { }

asmlinkage __visible void __softirq_entry __do_softirq(void)
{
- unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
unsigned long old_flags = current->flags;
- int max_restart = MAX_SOFTIRQ_RESTART;
struct softirq_action *h;
bool in_hardirq;
- __u32 pending;
+ __u32 pending, executed = 0;
int softirq_bit;

/*
@@ -284,6 +266,9 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
trace_softirq_entry(vec_nr);
h->action(h);
trace_softirq_exit(vec_nr);
+
+ executed |= 1 << vec_nr;
+
if (unlikely(prev_count != preempt_count())) {
pr_err("huh, entered softirq %u %s %p with preempt_count %08x, exited with %08x?\n",
vec_nr, softirq_to_name[vec_nr], h->action,
@@ -299,11 +284,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)

pending = local_softirq_pending();
if (pending) {
- if (time_before(jiffies, end) && !need_resched() &&
- --max_restart)
+ if (pending & executed || need_resched())
+ wakeup_softirqd();
+ else
goto restart;
-
- wakeup_softirqd();
}

lockdep_softirq_end(in_hardirq);
--
2.7.4


2018-01-19 15:48:02

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 3/4] softirq: Defer to workqueue when rescheduling is needed

One more step toward converting ksoftirqd to per vector workqueues.

Suggested-by: Paolo Abeni <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: David Miller <[email protected]>
Cc: Hannes Frederic Sowa <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Levin Alexander <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Radu Rendec <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
---
kernel/softirq.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index becb1d9..bb0cffa 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -369,16 +369,14 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)

pending = local_softirq_pending() & ~softirq->pending_work_mask;
if (pending) {
- if (need_resched()) {
- wakeup_softirqd();
- } else {
- /* Vectors that got re-enqueued are threaded */
- if (executed & pending)
- do_softirq_workqueue(executed & pending);
- pending &= ~executed;
- if (pending)
- goto restart;
- }
+ if (need_resched())
+ executed = pending;
+ /* Vectors that got re-enqueued are threaded */
+ if (executed & pending)
+ do_softirq_workqueue(executed & pending);
+ pending &= ~executed;
+ if (pending)
+ goto restart;
}

lockdep_softirq_end(in_hardirq);
--
2.7.4


2018-01-19 15:49:46

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 4/4] softirq: Replace ksoftirqd with workqueues entirely

Ksoftirqd only remains to implement threaded IRQs. Convert it to
existing per-vector workqueues to avoid code duplication.

Suggested-by: Linus Torvalds <[email protected]>
Suggested-by: Paolo Abeni <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: David Miller <[email protected]>
Cc: Hannes Frederic Sowa <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Levin Alexander <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Radu Rendec <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
---
Documentation/RCU/stallwarn.txt | 4 +-
include/linux/interrupt.h | 7 ----
kernel/sched/cputime.c | 13 +++---
kernel/sched/sched.h | 4 +-
kernel/softirq.c | 87 +++++++++--------------------------------
net/ipv4/tcp_output.c | 4 +-
6 files changed, 31 insertions(+), 88 deletions(-)

diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index a08f928..ea3a8de 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -17,8 +17,8 @@ o A CPU looping in an RCU read-side critical section.
o A CPU looping with interrupts disabled.

o A CPU looping with preemption disabled. This condition can
- result in RCU-sched stalls and, if ksoftirqd is in use, RCU-bh
- stalls.
+ result in RCU-sched stalls and, if softirq workqueue is in use,
+ RCU-bh stalls.

o A CPU looping with bottom halves disabled. This condition can
result in RCU-sched and RCU-bh stalls.
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 92d044d..680f620 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -507,13 +507,6 @@ extern void __raise_softirq_irqoff(unsigned int nr);
extern void raise_softirq_irqoff(unsigned int nr);
extern void raise_softirq(unsigned int nr);

-DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
-
-static inline struct task_struct *this_cpu_ksoftirqd(void)
-{
- return this_cpu_read(ksoftirqd);
-}
-
extern int softirq_serving_workqueue(void);

/* Tasklets --- multithreaded analogue of BHs.
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 30f70e5..c5b8dbd 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -64,15 +64,14 @@ void irqtime_account_irq(struct task_struct *curr)
irqtime->irq_start_time += delta;

/*
- * We do not account for softirq time from ksoftirqd here.
- * We want to continue accounting softirq time to ksoftirqd thread
+ * We do not account for softirq time from workqueue here.
+ * We want to continue accounting softirq time to workqueue thread
* in that case, so as not to confuse scheduler with a special task
* that do not consume any time, but still wants to run.
*/
if (hardirq_count())
irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
- else if (in_serving_softirq() && curr != this_cpu_ksoftirqd() &&
- !softirq_serving_workqueue())
+ else if (in_serving_softirq() && !softirq_serving_workqueue())
irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
}
EXPORT_SYMBOL_GPL(irqtime_account_irq);
@@ -376,11 +375,11 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,

cputime -= other;

- if (this_cpu_ksoftirqd() == p || softirq_serving_workqueue()) {
+ if (softirq_serving_workqueue()) {
/*
- * ksoftirqd time do not get accounted in cpu_softirq_time.
+ * Softirq wq time do not get accounted in cpu_softirq_time.
* So, we have to handle it separately here.
- * Also, p->stime needs to be updated for ksoftirqd.
+ * Also, p->stime needs to be updated for workqueue.
*/
account_system_index_time(p, cputime, CPUTIME_SOFTIRQ);
} else if (user_tick) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b19552a2..5d481f1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2061,8 +2061,8 @@ struct irqtime {
DECLARE_PER_CPU(struct irqtime, cpu_irqtime);

/*
- * Returns the irqtime minus the softirq time computed by ksoftirqd.
- * Otherwise ksoftirqd's sum_exec_runtime is substracted its own runtime
+ * Returns the irqtime minus the softirq time computed by workqueue.
+ * Otherwise workqueue's sum_exec_runtime is substracted its own runtime
* and never move forward.
*/
static inline u64 irq_time_read(int cpu)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index bb0cffa..cf43a8d 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -55,8 +55,6 @@ EXPORT_SYMBOL(irq_stat);

static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;

-DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
-
const char * const softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL",
"TASKLET", "SCHED", "HRTIMER", "RCU"
@@ -76,32 +74,6 @@ struct softirq {
static DEFINE_PER_CPU(struct softirq, softirq_cpu);

/*
- * we cannot loop indefinitely here to avoid userspace starvation,
- * but we also don't want to introduce a worst case 1/HZ latency
- * to the pending events, so lets the scheduler to balance
- * the softirq load for us.
- */
-static void wakeup_softirqd(void)
-{
- /* Interrupts are disabled: no need to stop preemption */
- struct task_struct *tsk = __this_cpu_read(ksoftirqd);
-
- if (tsk && tsk->state != TASK_RUNNING)
- wake_up_process(tsk);
-}
-
-/*
- * If ksoftirqd is scheduled, we do not want to process pending softirqs
- * right now. Let ksoftirqd handle this at its own rate, to get fairness.
- */
-static bool ksoftirqd_running(void)
-{
- struct task_struct *tsk = __this_cpu_read(ksoftirqd);
-
- return tsk && (tsk->state == TASK_RUNNING);
-}
-
-/*
* preempt_count and SOFTIRQ_OFFSET usage:
* - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
* softirq processing.
@@ -388,7 +360,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)

asmlinkage __visible void do_softirq(void)
{
- __u32 pending;
+ __u32 pending, pending_work;
unsigned long flags;

if (in_interrupt())
@@ -397,8 +369,9 @@ asmlinkage __visible void do_softirq(void)
local_irq_save(flags);

pending = local_softirq_pending();
+ pending_work = __this_cpu_read(softirq_cpu.pending_work_mask);

- if (pending && !ksoftirqd_running())
+ if (pending & ~pending_work)
do_softirq_own_stack();

local_irq_restore(flags);
@@ -412,7 +385,7 @@ void irq_enter(void)
rcu_irq_enter();
if (is_idle_task(current) && !in_interrupt()) {
/*
- * Prevent raise_softirq from needlessly waking up ksoftirqd
+ * Prevent raise_softirq from needlessly waking up workqueue
* here, as softirq will be serviced on return from interrupt.
*/
local_bh_disable();
@@ -425,7 +398,15 @@ void irq_enter(void)

static inline void invoke_softirq(void)
{
- if (ksoftirqd_running())
+ unsigned int pending_work, pending = local_softirq_pending();
+
+ if (!pending)
+ return;
+
+ pending_work = __this_cpu_read(softirq_cpu.pending_work_mask);
+ pending &= ~pending_work;
+
+ if (!pending)
return;

if (!force_irqthreads) {
@@ -445,7 +426,7 @@ static inline void invoke_softirq(void)
do_softirq_own_stack();
#endif
} else {
- wakeup_softirqd();
+ do_softirq_workqueue(pending);
}
}

@@ -474,7 +455,7 @@ void irq_exit(void)
#endif
account_irq_exit_time(current);
preempt_count_sub(HARDIRQ_OFFSET);
- if (!in_interrupt() && local_softirq_pending())
+ if (!in_interrupt())
invoke_softirq();

tick_irq_exit();
@@ -495,11 +476,11 @@ inline void raise_softirq_irqoff(unsigned int nr)
* actually run the softirq once we return from
* the irq or softirq.
*
- * Otherwise we wake up ksoftirqd to make sure we
+ * Otherwise we wake up workqueue to make sure we
* schedule the softirq soon.
*/
if (!in_interrupt())
- wakeup_softirqd();
+ do_softirq_workqueue(BIT(nr));
}

void raise_softirq(unsigned int nr)
@@ -736,27 +717,6 @@ void __init softirq_init(void)
open_softirq(HI_SOFTIRQ, tasklet_hi_action);
}

-static int ksoftirqd_should_run(unsigned int cpu)
-{
- return local_softirq_pending();
-}
-
-static void run_ksoftirqd(unsigned int cpu)
-{
- local_irq_disable();
- if (local_softirq_pending()) {
- /*
- * We can safely run softirq on inline stack, as we are not deep
- * in the task stack here.
- */
- __do_softirq();
- local_irq_enable();
- cond_resched_rcu_qs();
- return;
- }
- local_irq_enable();
-}
-
#ifdef CONFIG_HOTPLUG_CPU
/*
* tasklet_kill_immediate is called to remove a tasklet which can already be
@@ -819,22 +779,13 @@ static int takeover_tasklets(unsigned int cpu)
#define takeover_tasklets NULL
#endif /* CONFIG_HOTPLUG_CPU */

-static struct smp_hotplug_thread softirq_threads = {
- .store = &ksoftirqd,
- .thread_should_run = ksoftirqd_should_run,
- .thread_fn = run_ksoftirqd,
- .thread_comm = "ksoftirqd/%u",
-};
-
-static __init int spawn_ksoftirqd(void)
+static __init int tasklet_set_takeover(void)
{
cpuhp_setup_state_nocalls(CPUHP_SOFTIRQ_DEAD, "softirq:dead", NULL,
takeover_tasklets);
- BUG_ON(smpboot_register_percpu_thread(&softirq_threads));
-
return 0;
}
-early_initcall(spawn_ksoftirqd);
+early_initcall(tasklet_set_takeover);

/*
* [ These __weak aliases are kept in a separate compilation unit, so that
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b4e4160..3b4811e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -912,7 +912,7 @@ void tcp_wfree(struct sk_buff *skb)
*/
WARN_ON(refcount_sub_and_test(skb->truesize - 1, &sk->sk_wmem_alloc));

- /* If this softirq is serviced by ksoftirqd, we are likely under stress.
+ /* If this softirq is serviced by workqueue, we are likely under stress.
* Wait until our queues (qdisc + devices) are drained.
* This gives :
* - less callbacks to tcp_write_xmit(), reducing stress (batches)
@@ -920,7 +920,7 @@ void tcp_wfree(struct sk_buff *skb)
* to migrate this flow (skb->ooo_okay will be eventually set)
*/
if (refcount_read(&sk->sk_wmem_alloc) >= SKB_TRUESIZE(1) &&
- (this_cpu_ksoftirqd() == current || softirq_serving_workqueue()))
+ softirq_serving_workqueue())
goto out;

for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) {
--
2.7.4


2018-01-19 16:19:06

by David Miller

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail

From: Frederic Weisbecker <[email protected]>
Date: Fri, 19 Jan 2018 16:46:11 +0100

> For now, vectors that get re-enqueued trigger ksoftirqd but they are
> going to be handled by per-vector workqueues on subsequent patches.

Frederic, first of all, thanks for doing all of this work.

So this "get requeued" condition I think will trigger always for
networking tunnel decapsulation.

Each decap will (eventually) do a:

__raise_softirq_irqoff(NET_RX_SOFTIRQ);

via ____napi_schedule() in net/core/dev.c

Example code path:

ip_tunnel_rcv()
gro_cells_receive()
napi_schedule()
__napi_schedule()
____napi_schedule();
__raise_softirq_irqoff(NET_RX_SOFTIRQ);

Anyways, just FYI...

2018-01-19 18:27:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail

On Fri, Jan 19, 2018 at 8:16 AM, David Miller <[email protected]> wrote:
>
> So this "get requeued" condition I think will trigger always for
> networking tunnel decapsulation.

Hmm. Interesting and a perhaps bit discouraging.

Will it always be just a _single_ level of indirection, or will double
tunnels (I assume some people do that, just because the universe is
out to get us) then result in this perhaps repeating several times?

Because it would be not very complicated to just have two bitmasks
(and still no per-softirq counting), and have that "switch to
threading only on the _second_ time this happens".

But let's have people test the behavior of this simpler model first?

Linus

2018-01-19 18:48:23

by David Miller

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail

From: Linus Torvalds <[email protected]>
Date: Fri, 19 Jan 2018 10:25:03 -0800

> On Fri, Jan 19, 2018 at 8:16 AM, David Miller <[email protected]> wrote:
>>
>> So this "get requeued" condition I think will trigger always for
>> networking tunnel decapsulation.
>
> Hmm. Interesting and a perhaps bit discouraging.
>
> Will it always be just a _single_ level of indirection, or will double
> tunnels (I assume some people do that, just because the universe is
> out to get us) then result in this perhaps repeating several times?

Every level of tunnel encapsulation will trigger a new softirq.

So if you have an IP tunnel inside of an IP tunnel that will trigger
twice.

2018-01-20 08:44:05

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

Hi Frederic,

On Fri, Jan 19, 2018 at 04:46:12PM +0100, Frederic Weisbecker wrote:
> Some softirq vectors can be more CPU hungry than others. Especially
> networking may sometimes deal with packet storm and need more CPU than
> IRQ tail can offer without inducing scheduler latencies. In this case
> the current code defers to ksoftirqd that behaves nicer. Now this nice
> behaviour can be bad for other IRQ vectors that usually need quick
> processing.
>
> To solve this we only defer to threading the vectors that got
> re-enqueued on IRQ tail processing and leave the others inline on IRQ
> tail service. This is achieved using workqueues with
> per-CPU/per-vector worklets.
>
> Note ksoftirqd is not yet removed as it is still needed for threaded IRQs
> mode.
>

<snip>

> +static void do_softirq_workqueue(u32 pending)
> +{
> + struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
> + struct softirq_action *h = softirq_vec;
> + int softirq_bit;
> +
> + pending &= ~softirq->pending_work_mask;
> +
> + while ((softirq_bit = ffs(pending))) {
> + struct vector *vector;
> + unsigned int vec_nr;
> +
> + h += softirq_bit - 1;
> + vec_nr = h - softirq_vec;
> + softirq->pending_work_mask |= BIT(vec_nr);
> + vector = &softirq->vector[vec_nr];
> + schedule_work_on(smp_processor_id(), &vector->work);
> + h++;
> + pending >>= softirq_bit;
> + }
> +}
> +

I have couple questions/comments.

(1) Since the work is queued on a bounded per-cpu worker, we may run
into a deadlock if a TASKLET is killed from another work running on
the same bounded per-cpu worker.

For example,

(1) Schedule a TASKLET on CPU#0 from IRQ.
(2) Another IRQ comes on the same CPU and we queue a work to kill
the TASKLET.
(3) The TASKLET vector is deferred to workqueue.
(4) We run the TASKLET kill work and wait for the TASKLET to finish,
which won't happen.

We can fix this by queueing the TASKLET kill work on an unbounded
workqueue so that this runs in parallel with TASKLET vector work.

Just wanted to know if we have to be aware of this *condition*.

(2) Ksoftirqd thread gets parked when a CPU is hotplugged out. So
there is a gaurantee that the softirq handling never happens on
another CPU. Where as a bounded worker gets detached and the queued
work can run on another CPU. I guess, some special handling is
needed to handle hotplug.

Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.


2018-01-21 16:12:01

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

On Sat, Jan 20, 2018 at 02:11:39PM +0530, Pavan Kondeti wrote:

Hi Pavan,


> I have couple questions/comments.
>
> (1) Since the work is queued on a bounded per-cpu worker, we may run
> into a deadlock if a TASKLET is killed from another work running on
> the same bounded per-cpu worker.
>
> For example,
>
> (1) Schedule a TASKLET on CPU#0 from IRQ.
> (2) Another IRQ comes on the same CPU and we queue a work to kill
> the TASKLET.
> (3) The TASKLET vector is deferred to workqueue.
> (4) We run the TASKLET kill work and wait for the TASKLET to finish,
> which won't happen.
>
> We can fix this by queueing the TASKLET kill work on an unbounded
> workqueue so that this runs in parallel with TASKLET vector work.
>
> Just wanted to know if we have to be aware of this *condition*.

But IIRC the workqueues have several workers per CPU so the tasklet to
be killed can run while the tasklet killer yields.

>
> (2) Ksoftirqd thread gets parked when a CPU is hotplugged out. So
> there is a gaurantee that the softirq handling never happens on
> another CPU. Where as a bounded worker gets detached and the queued
> work can run on another CPU. I guess, some special handling is
> needed to handle hotplug.

Good catch. Funny, I worried a bit about CPU hotplug but I assumed
the pending CPU-bound worklets would be simply sync'ed before CPU gets down.

Breaking their CPU-bound properties doesn't look sane to me.

Anyway, I'll need to make a CPU hotplug hook.

Thanks for reporting that!

2018-01-21 16:32:04

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail

On Fri, Jan 19, 2018 at 01:47:27PM -0500, David Miller wrote:
> From: Linus Torvalds <[email protected]>
> Date: Fri, 19 Jan 2018 10:25:03 -0800
>
> > On Fri, Jan 19, 2018 at 8:16 AM, David Miller <[email protected]> wrote:
> >>
> >> So this "get requeued" condition I think will trigger always for
> >> networking tunnel decapsulation.
> >
> > Hmm. Interesting and a perhaps bit discouraging.
> >
> > Will it always be just a _single_ level of indirection, or will double
> > tunnels (I assume some people do that, just because the universe is
> > out to get us) then result in this perhaps repeating several times?
>
> Every level of tunnel encapsulation will trigger a new softirq.
>
> So if you have an IP tunnel inside of an IP tunnel that will trigger
> twice.

So we may likely need to come back to a call counter based limit :-s

2018-01-21 16:57:48

by David Miller

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] softirq: Limit vector to a single iteration on IRQ tail

From: Frederic Weisbecker <[email protected]>
Date: Sun, 21 Jan 2018 17:30:09 +0100

> On Fri, Jan 19, 2018 at 01:47:27PM -0500, David Miller wrote:
>> From: Linus Torvalds <[email protected]>
>> Date: Fri, 19 Jan 2018 10:25:03 -0800
>>
>> > On Fri, Jan 19, 2018 at 8:16 AM, David Miller <[email protected]> wrote:
>> >>
>> >> So this "get requeued" condition I think will trigger always for
>> >> networking tunnel decapsulation.
>> >
>> > Hmm. Interesting and a perhaps bit discouraging.
>> >
>> > Will it always be just a _single_ level of indirection, or will double
>> > tunnels (I assume some people do that, just because the universe is
>> > out to get us) then result in this perhaps repeating several times?
>>
>> Every level of tunnel encapsulation will trigger a new softirq.
>>
>> So if you have an IP tunnel inside of an IP tunnel that will trigger
>> twice.
>
> So we may likely need to come back to a call counter based limit :-s

I'm not so sure exactly to what extent we should try to handle that,
and if so exactly how.

The only reason we do this is to control stack usage. The re-softirq
on tunnel decapsulation is functioning as a continuation of sorts.
If we are already running in the net_rx_action() softirq it would
be so much nicer and efficient to just make sure net_rx_action()
runs the rest of the packet processing.

Right now that doesn't happen because net_rx_action() runs only one
round of NAPI polling. It doesn't re-snapshot the list and try
again before returning.

net_rx_action() has it's own logic like do_softirq() does for timing
out in the middle of it's work which may or may not have some further
influence upon fairness to other softirqs.

Basically, it runs a snapshot the NAPI poll list for this CPU until
either usecs_to_jiffies(netdev_budget_usecs) jiffies have elapsed or
the list snapshot has been fully processed.

The default netdev_budget_usecs is 2000, which is my math isn't broken
is 2 jiffies when HZ=1000. I know why we use 2000 instead of 1000,
it's to handle the case where we are invoked very close to the end
of a jiffy. That situation does happen often enough in practice to
cause performance problems.

It would seem that all of these issues are why the tendency is to deal
with measuring cost using time rather than a simpler heuristic such as
whether softirqs were retriggered during a softirq run.

2018-01-21 17:52:01

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

On Sun, Jan 21, 2018 at 05:11:17PM +0100, Frederic Weisbecker wrote:
> On Sat, Jan 20, 2018 at 02:11:39PM +0530, Pavan Kondeti wrote:
>
> Hi Pavan,
>
>
> > I have couple questions/comments.
> >
> > (1) Since the work is queued on a bounded per-cpu worker, we may run
> > into a deadlock if a TASKLET is killed from another work running on
> > the same bounded per-cpu worker.
> >
> > For example,
> >
> > (1) Schedule a TASKLET on CPU#0 from IRQ.
> > (2) Another IRQ comes on the same CPU and we queue a work to kill
> > the TASKLET.
> > (3) The TASKLET vector is deferred to workqueue.
> > (4) We run the TASKLET kill work and wait for the TASKLET to finish,
> > which won't happen.
> >
> > We can fix this by queueing the TASKLET kill work on an unbounded
> > workqueue so that this runs in parallel with TASKLET vector work.
> >
> > Just wanted to know if we have to be aware of this *condition*.
>
> But IIRC the workqueues have several workers per CPU so the tasklet to
> be killed can run while the tasklet killer yields.
>
AFAIK, the work items queued via schedule_work_on() goes to the system_wq
which is bounded to a CPU with concurrency restrictions. If any work
item (in this case tasklet kill) is getting executed on this bounded
worker, the next items have to wait. The forward progress happens only
when the current work is finished or enters sleep.

This also makes me wonder what happens if a CPU hogging work gets executed
on the system_wq while softirq work is pending? The softirq work gets
starved which won't happen now with ksoftirqd design. Ideally the CPU
hogging work should not be queued on the system_wq and instead should
be queued on CPU intenstive workqueue (WQ_CPU_INTENSIVE) to exempt
from concurrency management. May be we need some special workqueue
which is bounded but not subjected to concurrency management.

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.


2018-01-21 20:49:00

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

On Sun, Jan 21, 2018 at 11:20:40PM +0530, Pavan Kondeti wrote:
> On Sun, Jan 21, 2018 at 05:11:17PM +0100, Frederic Weisbecker wrote:
> > On Sat, Jan 20, 2018 at 02:11:39PM +0530, Pavan Kondeti wrote:
> >
> > Hi Pavan,
> >
> >
> > > I have couple questions/comments.
> > >
> > > (1) Since the work is queued on a bounded per-cpu worker, we may run
> > > into a deadlock if a TASKLET is killed from another work running on
> > > the same bounded per-cpu worker.
> > >
> > > For example,
> > >
> > > (1) Schedule a TASKLET on CPU#0 from IRQ.
> > > (2) Another IRQ comes on the same CPU and we queue a work to kill
> > > the TASKLET.
> > > (3) The TASKLET vector is deferred to workqueue.
> > > (4) We run the TASKLET kill work and wait for the TASKLET to finish,
> > > which won't happen.
> > >
> > > We can fix this by queueing the TASKLET kill work on an unbounded
> > > workqueue so that this runs in parallel with TASKLET vector work.
> > >
> > > Just wanted to know if we have to be aware of this *condition*.
> >
> > But IIRC the workqueues have several workers per CPU so the tasklet to
> > be killed can run while the tasklet killer yields.
> >
> AFAIK, the work items queued via schedule_work_on() goes to the system_wq
> which is bounded to a CPU with concurrency restrictions. If any work
> item (in this case tasklet kill) is getting executed on this bounded
> worker, the next items have to wait. The forward progress happens only
> when the current work is finished or enters sleep.

Workqueues have multiple workers to handle the pending works, unless they
are __WQ_ORDERED.

The worker manager in a pool is supposed to create workers on-demand
when necessary to make sure that no work is blocking the others.

Thanks.

>
> This also makes me wonder what happens if a CPU hogging work gets executed
> on the system_wq while softirq work is pending? The softirq work gets
> starved which won't happen now with ksoftirqd design. Ideally the CPU
> hogging work should not be queued on the system_wq and instead should
> be queued on CPU intenstive workqueue (WQ_CPU_INTENSIVE) to exempt
> from concurrency management. May be we need some special workqueue
> which is bounded but not subjected to concurrency management.
>
> --
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.
>

2018-01-22 20:01:26

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

Em Fri, 19 Jan 2018 16:46:10 +0100
Frederic Weisbecker <[email protected]> escreveu:

> As per Linus suggestion, this take doesn't limit the number of occurences
> per jiffy anymore but instead defers a vector to workqueues as soon as
> it gets re-enqueued on IRQ tail.
>
> No tunable here, so testing should be easier.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> softirq/thread-v3

Frederic,

FYI, I'm out of town for two weeks. I'll try to setup a test environment
while traveling, but I'm unsure if it will work, as I won't have access to
my DVB-S2 provider from here.

>
> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
>
> Thanks,
> Frederic
> ---
>
> Frederic Weisbecker (4):
> softirq: Limit vector to a single iteration on IRQ tail
> softirq: Per vector deferment to workqueue
> softirq: Defer to workqueue when rescheduling is needed
> softirq: Replace ksoftirqd with workqueues entirely
>
>
> Documentation/RCU/stallwarn.txt | 4 +-
> include/linux/interrupt.h | 7 +-
> kernel/sched/cputime.c | 12 +--
> kernel/sched/sched.h | 4 +-
> kernel/softirq.c | 223 +++++++++++++++++++++++-----------------
> net/ipv4/tcp_output.c | 5 +-
> 6 files changed, 144 insertions(+), 111 deletions(-)




Cheers,
Mauro

2018-01-23 10:14:42

by Paolo Abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

Hi,

On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> As per Linus suggestion, this take doesn't limit the number of occurences
> per jiffy anymore but instead defers a vector to workqueues as soon as
> it gets re-enqueued on IRQ tail.
>
> No tunable here, so testing should be easier.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> softirq/thread-v3
>
> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4

I tested this series in the UDP flood scenario, binding the user space
process receiving the packets on the same CPU processing the related
IRQ, and the tput sinks nearly to 0, like before Eric's patch.

The perf tool says that almost all the softirq processing is done
inside the workqueue, but the user space process is scheduled very
rarely, while before this series, in this scenario, ksoftirqd and the
user space process got a fair share of the CPU time.

Cheers,

Paolo





2018-01-23 12:33:37

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

On Tue, 2018-01-23 at 11:13 +0100, Paolo Abeni wrote:
> Hi,
>
> On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> > As per Linus suggestion, this take doesn't limit the number of
> > occurences
> > per jiffy anymore but instead defers a vector to workqueues as soon
> > as
> > it gets re-enqueued on IRQ tail.
> >
> > No tunable here, so testing should be easier.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-
> > dynticks.git
> > softirq/thread-v3
> >
> > HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
>
> I tested this series in the UDP flood scenario, binding the user
> space
> process receiving the packets on the same CPU processing the related
> IRQ, and the tput sinks nearly to 0, like before Eric's patch.
>
> The perf tool says that almost all the softirq processing is done
> inside the workqueue, but the user space process is scheduled very
> rarely, while before this series, in this scenario, ksoftirqd and the
> user space process got a fair share of the CPU time.

It get a fair share of the CPU time only on some lucky platforms (maybe
on the most). On other not-so-lucky platforms it was also unfair - as
I've previously described by the reason of slow raising softirqs :-/

--
Thanks,
Dmitry

2018-01-23 16:23:50

by David Miller

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

From: Paolo Abeni <[email protected]>
Date: Tue, 23 Jan 2018 11:13:52 +0100

> Hi,
>
> On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
>> As per Linus suggestion, this take doesn't limit the number of occurences
>> per jiffy anymore but instead defers a vector to workqueues as soon as
>> it gets re-enqueued on IRQ tail.
>>
>> No tunable here, so testing should be easier.
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>> softirq/thread-v3
>>
>> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
>
> I tested this series in the UDP flood scenario, binding the user space
> process receiving the packets on the same CPU processing the related
> IRQ, and the tput sinks nearly to 0, like before Eric's patch.
>
> The perf tool says that almost all the softirq processing is done
> inside the workqueue, but the user space process is scheduled very
> rarely, while before this series, in this scenario, ksoftirqd and the
> user space process got a fair share of the CPU time.

Do workqueue threads get a higher scheduling priority than user
processes? If so, that's going to work against the entire point of
deferring softirq work into a thread.

Or is it that the workqueue execution is simply not yielding for some
reason?

2018-01-23 16:58:13

by Paolo Abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

On Tue, 2018-01-23 at 11:22 -0500, David Miller wrote:
> From: Paolo Abeni <[email protected]>
> Date: Tue, 23 Jan 2018 11:13:52 +0100
>
> > Hi,
> >
> > On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> >> As per Linus suggestion, this take doesn't limit the number of occurences
> >> per jiffy anymore but instead defers a vector to workqueues as soon as
> >> it gets re-enqueued on IRQ tail.
> >>
> >> No tunable here, so testing should be easier.
> >>
> >> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> >> softirq/thread-v3
> >>
> >> HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
> >
> > I tested this series in the UDP flood scenario, binding the user space
> > process receiving the packets on the same CPU processing the related
> > IRQ, and the tput sinks nearly to 0, like before Eric's patch.
> >
> > The perf tool says that almost all the softirq processing is done
> > inside the workqueue, but the user space process is scheduled very
> > rarely, while before this series, in this scenario, ksoftirqd and the
> > user space process got a fair share of the CPU time.
>
> Do workqueue threads get a higher scheduling priority than user
> processes?

As far as I can see, no: the workqueue thread has the same priority and
nice level than the user space process.

> Or is it that the workqueue execution is simply not yielding for some
> reason?

It's like that.

I spent little time on it, so I haven't many data point. I'll try to
investigate the scenario later this week.

Cheers,

Paolo

2018-01-23 17:43:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

On Tue, Jan 23, 2018 at 8:57 AM, Paolo Abeni <[email protected]> wrote:
>
>> Or is it that the workqueue execution is simply not yielding for some
>> reason?
>
> It's like that.
>
> I spent little time on it, so I haven't many data point. I'll try to
> investigate the scenario later this week.

Hmm. workqueues seem to use cond_resched_rcu_qs(), which does a
cond_resched() (and a RCU quiescent note).

But I wonder if the test triggers the "lets run lots of workqueue
threads", and then the single-threaded user space just gets blown out
of the water by many kernel threads. Each thread gets its own "fair"
amount of CPU, but..

Linus

2018-01-23 18:02:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

On Tue, 2018-01-23 at 09:42 -0800, Linus Torvalds wrote:
> On Tue, Jan 23, 2018 at 8:57 AM, Paolo Abeni <[email protected]> wrote:
> >
> >> Or is it that the workqueue execution is simply not yielding for some
> >> reason?
> >
> > It's like that.
> >
> > I spent little time on it, so I haven't many data point. I'll try to
> > investigate the scenario later this week.
>
> Hmm. workqueues seem to use cond_resched_rcu_qs(), which does a
> cond_resched() (and a RCU quiescent note).
>
> But I wonder if the test triggers the "lets run lots of workqueue
> threads", and then the single-threaded user space just gets blown out
> of the water by many kernel threads. Each thread gets its own "fair"
> amount of CPU, but..

If folks aren't careful with workqueues, they can be a generic
starvation problem. Like the below in the here and now.

fs/nfs: Add a resched point to nfs_commit_release_pages()

During heavy NFS write, kworkers can do very large amounts of work
without scheduling (82ms traced). Add a resched point.

Signed-off-by: Mike Galbraith <[email protected]>
Suggested-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 1 +
1 file changed, 1 insertion(+)

--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1837,6 +1837,7 @@ static void nfs_commit_release_pages(str
set_bit(NFS_CONTEXT_RESEND_WRITES, &req->wb_context->flags);
next:
nfs_unlock_and_release_request(req);
+ cond_resched();
}
nfss = NFS_SERVER(data->inode);
if (atomic_long_read(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)


2018-01-23 18:25:33

by David Miller

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

From: Linus Torvalds <[email protected]>
Date: Tue, 23 Jan 2018 09:42:32 -0800

> But I wonder if the test triggers the "lets run lots of workqueue
> threads", and then the single-threaded user space just gets blown out
> of the water by many kernel threads. Each thread gets its own "fair"
> amount of CPU, but..

If a single cpu's softirq deferral can end up running on multiple
workqueue threads, indeed that's a serious problem.

So if we're in a workqueue and it does a:

schedule_work_on(this_cpu, currently_executing_work);

it'll potentially make a new thread?

That's exactly the code path that will get exercised during a UDP
flood the way that vector_work_func() is implemented.

2018-01-24 01:58:56

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

On Tue, Jan 23, 2018 at 01:24:24PM -0500, David Miller wrote:
> From: Linus Torvalds <[email protected]>
> Date: Tue, 23 Jan 2018 09:42:32 -0800
>
> > But I wonder if the test triggers the "lets run lots of workqueue
> > threads", and then the single-threaded user space just gets blown out
> > of the water by many kernel threads. Each thread gets its own "fair"
> > amount of CPU, but..
>
> If a single cpu's softirq deferral can end up running on multiple
> workqueue threads, indeed that's a serious problem.
>
> So if we're in a workqueue and it does a:
>
> schedule_work_on(this_cpu, currently_executing_work);
>
> it'll potentially make a new thread?
>
> That's exactly the code path that will get exercised during a UDP
> flood the way that vector_work_func() is implemented.

It shouldn't create a new thread unless all other workers in the CPU
are voluntarily sleeping while executing a work. Also since softirqs
can't sleep, we shouldn't even have two vectors running concurrently
on workqueues.

2018-01-24 02:03:36

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

2018-01-24 2:57 UTC+01:00, Frederic Weisbecker <[email protected]>:
> On Tue, Jan 23, 2018 at 01:24:24PM -0500, David Miller wrote:
>> From: Linus Torvalds <[email protected]>
>> Date: Tue, 23 Jan 2018 09:42:32 -0800
>>
>> > But I wonder if the test triggers the "lets run lots of workqueue
>> > threads", and then the single-threaded user space just gets blown out
>> > of the water by many kernel threads. Each thread gets its own "fair"
>> > amount of CPU, but..
>>
>> If a single cpu's softirq deferral can end up running on multiple
>> workqueue threads, indeed that's a serious problem.
>>
>> So if we're in a workqueue and it does a:
>>
>> schedule_work_on(this_cpu, currently_executing_work);
>>
>> it'll potentially make a new thread?
>>
>> That's exactly the code path that will get exercised during a UDP
>> flood the way that vector_work_func() is implemented.
>
> It shouldn't create a new thread unless all other workers in the CPU
> are voluntarily sleeping while executing a work. Also since softirqs
> can't sleep, we shouldn't even have two vectors running concurrently
> on workqueues.
>

...unless I misunderstood workqueue internals or missed something, in
which case we may try ordered workqueue.

2018-01-24 02:14:20

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

On Tue, Jan 23, 2018 at 12:32:13PM +0000, Dmitry Safonov wrote:
> On Tue, 2018-01-23 at 11:13 +0100, Paolo Abeni wrote:
> > Hi,
> >
> > On Fri, 2018-01-19 at 16:46 +0100, Frederic Weisbecker wrote:
> > > As per Linus suggestion, this take doesn't limit the number of
> > > occurences
> > > per jiffy anymore but instead defers a vector to workqueues as soon
> > > as
> > > it gets re-enqueued on IRQ tail.
> > >
> > > No tunable here, so testing should be easier.
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-
> > > dynticks.git
> > > softirq/thread-v3
> > >
> > > HEAD: 6835e92cbd70ef4a056987d2e1ed383b294429d4
> >
> > I tested this series in the UDP flood scenario, binding the user
> > space
> > process receiving the packets on the same CPU processing the related
> > IRQ, and the tput sinks nearly to 0, like before Eric's patch.
> >
> > The perf tool says that almost all the softirq processing is done
> > inside the workqueue, but the user space process is scheduled very
> > rarely, while before this series, in this scenario, ksoftirqd and the
> > user space process got a fair share of the CPU time.
>
> It get a fair share of the CPU time only on some lucky platforms (maybe
> on the most). On other not-so-lucky platforms it was also unfair - as
> I've previously described by the reason of slow raising softirqs :-/

Indeed your case looks pretty special, as IRQs don't get the opportunity to
interrupt softirqs but they fire right after.

I'm wondering if you shouldn't try threaded IRQs. Having softirqs always
running in a thread instead of hardirq tail might help (through threadirqs=
kernel param).

2018-01-24 14:54:58

by Paolo Abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

On Tue, 2018-01-23 at 09:42 -0800, Linus Torvalds wrote:
> On Tue, Jan 23, 2018 at 8:57 AM, Paolo Abeni <[email protected]> wrote:
> >
> > > Or is it that the workqueue execution is simply not yielding for some
> > > reason?
> >
> > It's like that.
> >
> > I spent little time on it, so I haven't many data point. I'll try to
> > investigate the scenario later this week.
>
> Hmm. workqueues seem to use cond_resched_rcu_qs(), which does a
> cond_resched() (and a RCU quiescent note).
>
> But I wonder if the test triggers the "lets run lots of workqueue
> threads", and then the single-threaded user space just gets blown out
> of the water by many kernel threads. Each thread gets its own "fair"
> amount of CPU, but..

Niklas suggested a possible relation with CONFIG_IRQ_TIME_ACCOUNTING=y
and indeed he was right.

The patched kernel under test had CONFIG_IRQ_TIME_ACCOUNTING set, and
very little CPU time was accounted to the kworker:

[2125 is the relevant kworker's pid]
grep sum_exec_runtime /proc/2125/sched; sleep 10; grep sum_exec_runtime /proc/2125/sched
se.sum_exec_runtime : 13408.239286
se.sum_exec_runtime : 13456.907197

despite such process was processing a lot of packets and basically
burning a CPU.

Switching CONFIG_IRQ_TIME_ACCOUNTING off I see the expected behaviour:
top reports that the user space process and kworker share the CPU
almost fairly and the user space process is able to receive a
reasonable amount of packets.

Paolo

2018-01-24 15:06:52

by David Miller

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

From: Paolo Abeni <[email protected]>
Date: Wed, 24 Jan 2018 15:54:05 +0100

> Niklas suggested a possible relation with CONFIG_IRQ_TIME_ACCOUNTING=y
> and indeed he was right.
>
> The patched kernel under test had CONFIG_IRQ_TIME_ACCOUNTING set, and
> very little CPU time was accounted to the kworker:
>
> [2125 is the relevant kworker's pid]
> grep sum_exec_runtime /proc/2125/sched; sleep 10; grep sum_exec_runtime /proc/2125/sched
> se.sum_exec_runtime : 13408.239286
> se.sum_exec_runtime : 13456.907197
>
> despite such process was processing a lot of packets and basically
> burning a CPU.

So IRQ_TIME_ACCOUNTING makes the scheduler think that the worker
threads are using nearly no task time at all.

The existing ksoftirqd code should hit the same problem, right?

2018-01-24 16:12:53

by Paolo Abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

On Wed, 2018-01-24 at 10:05 -0500, David Miller wrote:
> From: Paolo Abeni <[email protected]>
> Date: Wed, 24 Jan 2018 15:54:05 +0100
>
> > Niklas suggested a possible relation with CONFIG_IRQ_TIME_ACCOUNTING=y
> > and indeed he was right.
> >
> > The patched kernel under test had CONFIG_IRQ_TIME_ACCOUNTING set, and
> > very little CPU time was accounted to the kworker:
> >
> > [2125 is the relevant kworker's pid]
> > grep sum_exec_runtime /proc/2125/sched; sleep 10; grep sum_exec_runtime /proc/2125/sched
> > se.sum_exec_runtime : 13408.239286
> > se.sum_exec_runtime : 13456.907197
> >
> > despite such process was processing a lot of packets and basically
> > burning a CPU.
>
> So IRQ_TIME_ACCOUNTING makes the scheduler think that the worker
> threads are using nearly no task time at all.

Yes, this is the behavior I observe in the test. But a quick look at
the scheduler code - I'm not very familiar with it - let me think this
is not the intended/expected behaviour for the ksoftirqd (and after
this series, for the kworker serving the softirq).

> The existing ksoftirqd code should hit the same problem, right?

I just tried the vanilla kernel with CONFIG_IRQ_TIME_ACCOUNTING=y, and
in the same test scenario, I observe the 'good' behavior: the user
space process and ksoftirqd share almost fairly the relevant CPU, and
the CPU time spend in softirq processing is accounted to ksoftirqd.

Cheers,

Paolo


2018-02-07 14:20:03

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

Em Fri, 19 Jan 2018 16:46:10 +0100
Frederic Weisbecker <[email protected]> escreveu:

> As per Linus suggestion, this take doesn't limit the number of occurences
> per jiffy anymore but instead defers a vector to workqueues as soon as
> it gets re-enqueued on IRQ tail.
>
> No tunable here, so testing should be easier.

Sorry for taking so long to test it. Just returned this week from a 2
weeks travel, and had first to submit media patches for the merge window.

Didn't work. With this patch series, it still lose frames with
usb bulk transfers, while recording DVB programs with tvheadend.

Thanks,
Mauro

Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

On 2018-01-19 16:46:12 [+0100], Frederic Weisbecker wrote:
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index c8c6841..becb1d9 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -62,6 +62,19 @@ const char * const softirq_to_name[NR_SOFTIRQS] = {

> +static void vector_work_func(struct work_struct *work)
> +{
> + struct vector *vector = container_of(work, struct vector, work);
> + struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
> + int vec_nr = vector->nr;
> + int vec_bit = BIT(vec_nr);
> + u32 pending;
> +
> + local_irq_disable();
> + pending = local_softirq_pending();
> + account_irq_enter_time(current);
> + __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
> + lockdep_softirq_enter();
> + set_softirq_pending(pending & ~vec_bit);
> + local_irq_enable();
> +
> + if (pending & vec_bit) {
> + struct softirq_action *sa = &softirq_vec[vec_nr];
> +
> + kstat_incr_softirqs_this_cpu(vec_nr);
> + softirq->work_running = 1;
> + trace_softirq_entry(vec_nr);
> + sa->action(sa);

You invoke the softirq handler while BH is disabled (not wrong, I just
state the obvious). That means, the scheduler can't preempt/interrupt
the workqueue/BH-handler while it is invoked so it has to wait until it
completes its doing.
In do_softirq_workqueue() you schedule multiple workqueue items (one for
each softirq vector) which is unnecessary because they can't preempt one
another and should be invoked the order they were enqueued. So it would
be enough to enqueue one item because it is serialized after all. So one
work_struct per CPU with a cond_resched_rcu_qs() while switching from one
vector to another should accomplish that what you have now here (not
sure if that cond_resched after each vector is needed). But…

> + trace_softirq_exit(vec_nr);
> + softirq->work_running = 0;
> + }
> +
> + local_irq_disable();
> +
> + pending = local_softirq_pending();
> + if (pending & vec_bit)
> + schedule_work_on(smp_processor_id(), &vector->work);

… on a system that is using system_wq a lot, it might introduced a certain
latency until your softirq-worker gets its turn. The workqueue will
spawn new workers if the current worker schedules out but until that
happens you have to wait. I am not sure if this is intended or whether
this might be a problem. I think you could argue either way depending on
what you currently think is more important.
Further, schedule_work_on(x, ) does not guarentee that the work item is
invoked on CPU x. It tries that but if CPU x goes down due to
CPU-hotplug then the workitem will be moved to random CPU. For that
reason we have work_on_cpu_safe() but you don't want to use that / flush
that workqueue while in here.

May I instead suggest to stick to ksoftirqd? So you run in softirq
context (after return from IRQ) and if takes too long, you offload the
vector to ksoftirqd instead. You may want to play with the metric on
which you decide when you want switch to ksoftirqd / account how long a
vector runs.

> + else
> + softirq->pending_work_mask &= ~vec_bit;
> +
> + lockdep_softirq_exit();
> + account_irq_exit_time(current);
> + __local_bh_enable(SOFTIRQ_OFFSET);
> + local_irq_enable();
> +}

Sebastian

2018-02-08 18:46:15

by David Miller

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

From: Sebastian Andrzej Siewior <[email protected]>
Date: Thu, 8 Feb 2018 18:44:52 +0100

> May I instead suggest to stick to ksoftirqd? So you run in softirq
> context (after return from IRQ) and if takes too long, you offload the
> vector to ksoftirqd instead. You may want to play with the metric on
> which you decide when you want switch to ksoftirqd / account how long a
> vector runs.

Having read over this stuff for the past few weeks this is how I feel
as well. Just make ksofbitrq do what we want (only execute the
overloaded softirq vectors).

The more I look at the workqueue stuff, the more complications and
weird behavioral artifacts we are getting for questionable gain.

2018-02-08 20:16:43

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

On Thu, 2018-02-08 at 13:45 -0500, David Miller wrote:
> From: Sebastian Andrzej Siewior <[email protected]>
> Date: Thu, 8 Feb 2018 18:44:52 +0100
>
> > May I instead suggest to stick to ksoftirqd? So you run in softirq
> > context (after return from IRQ) and if takes too long, you offload
> the
> > vector to ksoftirqd instead. You may want to play with the metric
> on
> > which you decide when you want switch to ksoftirqd / account how
> long a
> > vector runs.
>
> Having read over this stuff for the past few weeks this is how I feel
> as well. Just make ksofbitrq do what we want (only execute the
> overloaded softirq vectors).
>
> The more I look at the workqueue stuff, the more complications and
> weird behavioral artifacts we are getting for questionable gain.

What about creating several ksoftirqd threads per-cpu?
Like I did with boot parameter to specify how many threads and which
softirqs to serve.

--
Thanks,
Dmitry

2018-02-08 20:24:47

by David Miller

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

From: Dmitry Safonov <[email protected]>
Date: Thu, 08 Feb 2018 20:14:55 +0000

> On Thu, 2018-02-08 at 13:45 -0500, David Miller wrote:
>> From: Sebastian Andrzej Siewior <[email protected]>
>> Date: Thu, 8 Feb 2018 18:44:52 +0100
>>
>> > May I instead suggest to stick to ksoftirqd? So you run in softirq
>> > context (after return from IRQ) and if takes too long, you offload
>> the
>> > vector to ksoftirqd instead. You may want to play with the metric
>> on
>> > which you decide when you want switch to ksoftirqd / account how
>> long a
>> > vector runs.
>>
>> Having read over this stuff for the past few weeks this is how I feel
>> as well. Just make ksofbitrq do what we want (only execute the
>> overloaded softirq vectors).
>>
>> The more I look at the workqueue stuff, the more complications and
>> weird behavioral artifacts we are getting for questionable gain.
>
> What about creating several ksoftirqd threads per-cpu?
> Like I did with boot parameter to specify how many threads and which
> softirqs to serve.

Why do we need more than one per cpu?

There is a set of vectors which are "overloaded" and ksoftirqd processes
them one by one.

The only difference with what happens now is that one softirq being
overloaded doesn't defer the processing of all softirqs to ksoftirqd.

2018-02-08 20:31:41

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

On Thu, 2018-02-08 at 15:22 -0500, David Miller wrote:
> From: Dmitry Safonov <[email protected]>
> Date: Thu, 08 Feb 2018 20:14:55 +0000
>
> > On Thu, 2018-02-08 at 13:45 -0500, David Miller wrote:
> >> From: Sebastian Andrzej Siewior <[email protected]>
> >> Date: Thu, 8 Feb 2018 18:44:52 +0100
> >>
> >> > May I instead suggest to stick to ksoftirqd? So you run in
> softirq
> >> > context (after return from IRQ) and if takes too long, you
> offload
> >> the
> >> > vector to ksoftirqd instead. You may want to play with the
> metric
> >> on
> >> > which you decide when you want switch to ksoftirqd / account how
> >> long a
> >> > vector runs.
> >>
> >> Having read over this stuff for the past few weeks this is how I
> feel
> >> as well. Just make ksofbitrq do what we want (only execute the
> >> overloaded softirq vectors).
> >>
> >> The more I look at the workqueue stuff, the more complications and
> >> weird behavioral artifacts we are getting for questionable gain.
> >
> > What about creating several ksoftirqd threads per-cpu?
> > Like I did with boot parameter to specify how many threads and
> which
> > softirqs to serve.
>
> Why do we need more than one per cpu?

Ugh, yeah, I remember why I did it - I tried to reuse scheduler for
each ksoftirqd thread to decide if it need to run now or later.
That would give an admin a way to prioritise softirqs with nice.
Not sure if it's a nice idea at all..

>
> There is a set of vectors which are "overloaded" and ksoftirqd
> processes
> them one by one.
>
> The only difference with what happens now is that one softirq being
> overloaded doesn't defer the processing of all softirqs to ksoftirqd.

2018-02-09 04:12:57

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

On Thu, 2018-02-08 at 20:30 +0000, Dmitry Safonov wrote:
> On Thu, 2018-02-08 at 15:22 -0500, David Miller wrote:
> > From: Dmitry Safonov <[email protected]>
> > Date: Thu, 08 Feb 2018 20:14:55 +0000
> >
> > > On Thu, 2018-02-08 at 13:45 -0500, David Miller wrote:
> > >> From: Sebastian Andrzej Siewior <[email protected]>
> > >> Date: Thu, 8 Feb 2018 18:44:52 +0100
> > >>
> > >> > May I instead suggest to stick to ksoftirqd? So you run in
> > softirq
> > >> > context (after return from IRQ) and if takes too long, you
> > offload
> > >> the
> > >> > vector to ksoftirqd instead. You may want to play with the
> > metric
> > >> on
> > >> > which you decide when you want switch to ksoftirqd / account how
> > >> long a
> > >> > vector runs.
> > >>
> > >> Having read over this stuff for the past few weeks this is how I
> > feel
> > >> as well. Just make ksofbitrq do what we want (only execute the
> > >> overloaded softirq vectors).
> > >>
> > >> The more I look at the workqueue stuff, the more complications and
> > >> weird behavioral artifacts we are getting for questionable gain.
> > >
> > > What about creating several ksoftirqd threads per-cpu?
> > > Like I did with boot parameter to specify how many threads and
> > which
> > > softirqs to serve.
> >
> > Why do we need more than one per cpu?
>
> Ugh, yeah, I remember why I did it - I tried to reuse scheduler for
> each ksoftirqd thread to decide if it need to run now or later.
> That would give an admin a way to prioritise softirqs with nice.
> Not sure if it's a nice idea at all..

For RT that can be handy, but for the general case it's a waste of
cycles, so would want to be opt-in.

-Mike

Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

On 2018-02-09 05:11:08 [+0100], Mike Galbraith wrote:
> On Thu, 2018-02-08 at 20:30 +0000, Dmitry Safonov wrote:
> > On Thu, 2018-02-08 at 15:22 -0500, David Miller wrote:
> > > Why do we need more than one per cpu?
> >
> > Ugh, yeah, I remember why I did it - I tried to reuse scheduler for
> > each ksoftirqd thread to decide if it need to run now or later.
> > That would give an admin a way to prioritise softirqs with nice.
> > Not sure if it's a nice idea at all..
>
> For RT that can be handy, but for the general case it's a waste of
> cycles, so would want to be opt-in.

We used to have it in RT. It got replaced because the context switch from
one thread to the other simply took time.
If you need to prioritise one softirqs over the other I think you would
be better off running with threaded-interrupts. In "normal" mode, if you
have a raised NAPI and say a TASKLET then once both IRQ handler complete
(and interrupts get enabled again) then the softirqs are run. With
threaded interrupts enabled, the NAPI-softirq is invoked once the
threaded-handler completes that raised it - same goes the other threaded
handler which raised the tasklet softirq.
So you could simply change the priority of the threaded interrupt in
order to prefer NAPI over tasklet handling (or the other way around).

> -Mike

Sebastian

2018-02-15 16:15:27

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

On Thu, Feb 08, 2018 at 06:44:52PM +0100, Sebastian Andrzej Siewior wrote:
> On 2018-01-19 16:46:12 [+0100], Frederic Weisbecker wrote:
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index c8c6841..becb1d9 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -62,6 +62,19 @@ const char * const softirq_to_name[NR_SOFTIRQS] = {
> …
> > +static void vector_work_func(struct work_struct *work)
> > +{
> > + struct vector *vector = container_of(work, struct vector, work);
> > + struct softirq *softirq = this_cpu_ptr(&softirq_cpu);
> > + int vec_nr = vector->nr;
> > + int vec_bit = BIT(vec_nr);
> > + u32 pending;
> > +
> > + local_irq_disable();
> > + pending = local_softirq_pending();
> > + account_irq_enter_time(current);
> > + __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
> > + lockdep_softirq_enter();
> > + set_softirq_pending(pending & ~vec_bit);
> > + local_irq_enable();
> > +
> > + if (pending & vec_bit) {
> > + struct softirq_action *sa = &softirq_vec[vec_nr];
> > +
> > + kstat_incr_softirqs_this_cpu(vec_nr);
> > + softirq->work_running = 1;
> > + trace_softirq_entry(vec_nr);
> > + sa->action(sa);
>
> You invoke the softirq handler while BH is disabled (not wrong, I just
> state the obvious). That means, the scheduler can't preempt/interrupt
> the workqueue/BH-handler while it is invoked so it has to wait until it
> completes its doing.
> In do_softirq_workqueue() you schedule multiple workqueue items (one for
> each softirq vector) which is unnecessary because they can't preempt one
> another and should be invoked the order they were enqueued. So it would
> be enough to enqueue one item because it is serialized after all. So one
> work_struct per CPU with a cond_resched_rcu_qs() while switching from one
> vector to another should accomplish that what you have now here (not
> sure if that cond_resched after each vector is needed). But…

Makes sense.

>
> > + trace_softirq_exit(vec_nr);
> > + softirq->work_running = 0;
> > + }
> > +
> > + local_irq_disable();
> > +
> > + pending = local_softirq_pending();
> > + if (pending & vec_bit)
> > + schedule_work_on(smp_processor_id(), &vector->work);
>
> … on a system that is using system_wq a lot, it might introduced a certain
> latency until your softirq-worker gets its turn. The workqueue will
> spawn new workers if the current worker schedules out but until that
> happens you have to wait. I am not sure if this is intended or whether
> this might be a problem. I think you could argue either way depending on
> what you currently think is more important.

Indeed :)

> Further, schedule_work_on(x, ) does not guarentee that the work item is
> invoked on CPU x. It tries that but if CPU x goes down due to
> CPU-hotplug then the workitem will be moved to random CPU. For that
> reason we have work_on_cpu_safe() but you don't want to use that / flush
> that workqueue while in here.

Yeah, someone also reported me that hotplug issue. I didn't think workqueue
would break the affinity but here it does. So we would need a hotplug hook
indeed.

>
> May I instead suggest to stick to ksoftirqd? So you run in softirq
> context (after return from IRQ) and if takes too long, you offload the
> vector to ksoftirqd instead. You may want to play with the metric on
> which you decide when you want switch to ksoftirqd / account how long a
> vector runs.

Yeah that makes sense. These workqueues are too much headaches eventually.
I'm going to try that ksoftirqd thing.

Thanks.

Subject: Re: [RFC PATCH 2/4] softirq: Per vector deferment to workqueue

On 2018-02-15 17:13:52 [+0100], Frederic Weisbecker wrote:
> Yeah that makes sense. These workqueues are too much headaches eventually.
> I'm going to try that ksoftirqd thing.

I may have something for that USB problem, I just need some testing
before posting…

> Thanks.

Sebastian

2018-03-01 15:22:34

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] softirq: Per vector threading v3

On Wed, Feb 07, 2018 at 12:18:44PM -0200, Mauro Carvalho Chehab wrote:
> Em Fri, 19 Jan 2018 16:46:10 +0100
> Frederic Weisbecker <[email protected]> escreveu:
>
> > As per Linus suggestion, this take doesn't limit the number of occurences
> > per jiffy anymore but instead defers a vector to workqueues as soon as
> > it gets re-enqueued on IRQ tail.
> >
> > No tunable here, so testing should be easier.
>
> Sorry for taking so long to test it. Just returned this week from a 2
> weeks travel, and had first to submit media patches for the merge window.
>
> Didn't work. With this patch series, it still lose frames with
> usb bulk transfers, while recording DVB programs with tvheadend.

How can we reproduce this workload?