2018-01-12 05:36:07

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 0/2] softirq: Per vector threading

So this is a first shot to implement what Linus suggested.
To summarize: when a softirq vector is stormed and needs more time than
what IRQ tail can offer, the whole softirq processing is offloaded to
ksoftirqd. But this has an impact on other softirq vectors that are
then subject to scheduler latencies.

So the softirqs time limits is now per vector and only the vectors that
get stormed are offloaded to a thread (workqueue).

This is in a very Proof of concept state. It doesn't even boot successfully
once in a while. So I'll do more debugging tomorrow (today in fact) but
you get the big picture.

It probably won't come free given the clock reads around softirq callbacks.

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
softirq/poc

HEAD: 0e982634115283710d0801048e5a316def26f31d

Thanks,
Frederic
---

Frederic Weisbecker (2):
softirq: Account time and iteration stats per vector
softirq: Per vector thread deferment


kernel/softirq.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 114 insertions(+), 9 deletions(-)


2018-01-12 05:36:15

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 1/2] softirq: Account time and iteration stats per vector

As we plan to be able to defer some specific softurq vector processing
to workqueues when those vectors need more time than IRQs can offer,
let's first count the time spent and the number of occurences per vector.

For now we still defer to ksoftirqd when the per vector limits are reached

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: David Miller <[email protected]>
Cc: Hannes Frederic Sowa <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Levin Alexander <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Radu Rendec <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
---
kernel/softirq.c | 37 +++++++++++++++++++++++++++++--------
1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 2f5e87f..fa267f7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -26,6 +26,7 @@
#include <linux/smpboot.h>
#include <linux/tick.h>
#include <linux/irq.h>
+#include <linux/sched/clock.h>

#define CREATE_TRACE_POINTS
#include <trace/events/irq.h>
@@ -62,6 +63,17 @@ const char * const softirq_to_name[NR_SOFTIRQS] = {
"TASKLET", "SCHED", "HRTIMER", "RCU"
};

+struct vector_stat {
+ u64 time;
+ int count;
+};
+
+struct softirq_stat {
+ struct vector_stat stat[NR_SOFTIRQS];
+};
+
+static DEFINE_PER_CPU(struct softirq_stat, softirq_stat_cpu);
+
/*
* we cannot loop indefinitely here to avoid userspace starvation,
* but we also don't want to introduce a worst case 1/HZ latency
@@ -203,7 +215,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);
* we want to handle softirqs as soon as possible, but they
* should not be able to lock up the box.
*/
-#define MAX_SOFTIRQ_TIME msecs_to_jiffies(2)
+#define MAX_SOFTIRQ_TIME (2 * NSEC_PER_MSEC)
#define MAX_SOFTIRQ_RESTART 10

#ifdef CONFIG_TRACE_IRQFLAGS
@@ -241,12 +253,11 @@ static inline void lockdep_softirq_end(bool in_hardirq) { }

asmlinkage __visible void __softirq_entry __do_softirq(void)
{
- unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
+ struct softirq_stat *sstat = this_cpu_ptr(&softirq_stat_cpu);
unsigned long old_flags = current->flags;
- int max_restart = MAX_SOFTIRQ_RESTART;
struct softirq_action *h;
bool in_hardirq;
- __u32 pending;
+ __u32 pending, overrun = 0;
int softirq_bit;

/*
@@ -262,6 +273,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
__local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
in_hardirq = lockdep_softirq_start();

+ memzero_explicit(sstat, sizeof(*sstat));
restart:
/* Reset the pending bitmask before enabling irqs */
set_softirq_pending(0);
@@ -271,8 +283,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
h = softirq_vec;

while ((softirq_bit = ffs(pending))) {
+ struct vector_stat *vstat;
unsigned int vec_nr;
int prev_count;
+ u64 startime;

h += softirq_bit - 1;

@@ -280,10 +294,18 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
prev_count = preempt_count();

kstat_incr_softirqs_this_cpu(vec_nr);
+ vstat = &sstat->stat[vec_nr];

trace_softirq_entry(vec_nr);
+ startime = local_clock();
h->action(h);
+ vstat->time += local_clock() - startime;
+ vstat->count++;
trace_softirq_exit(vec_nr);
+
+ if (vstat->time > MAX_SOFTIRQ_TIME || vstat->count > MAX_SOFTIRQ_RESTART)
+ overrun |= 1 << vec_nr;
+
if (unlikely(prev_count != preempt_count())) {
pr_err("huh, entered softirq %u %s %p with preempt_count %08x, exited with %08x?\n",
vec_nr, softirq_to_name[vec_nr], h->action,
@@ -299,11 +321,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)

pending = local_softirq_pending();
if (pending) {
- if (time_before(jiffies, end) && !need_resched() &&
- --max_restart)
+ if (overrun || need_resched())
+ wakeup_softirqd();
+ else
goto restart;
-
- wakeup_softirqd();
}

lockdep_softirq_end(in_hardirq);
--
2.7.4

2018-01-12 05:36:23

by Frederic Weisbecker

[permalink] [raw]
Subject: [RFC PATCH 2/2] softirq: Per vector thread deferment

Some softirq vectors can be more CPU hungry than others. Especially
networking may sometimes deal with packet storm and need more CPU than
IRQ tail can offer without inducing scheduler latencies. In this case
the current code defers to ksoftirqd that behaves nicer. Now this nice
behaviour can be bad for other IRQ vectors that usually need quick
processing.

To solve this we only defer to threading the vectors that outreached the
time limit on IRQ tail processing and leave the others inline on real
Soft-IRQs service. This is achieved using workqueues with
per-CPU/per-vector worklets.

Note ksoftirqd is not removed as it is still needed for threaded IRQs
mode.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: David Miller <[email protected]>
Cc: Hannes Frederic Sowa <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Levin Alexander <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Paul E. McKenney <[email protected]>
Cc: Radu Rendec <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>
---
kernel/softirq.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 87 insertions(+), 3 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index fa267f7..0c817ec6 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -74,6 +74,13 @@ struct softirq_stat {

static DEFINE_PER_CPU(struct softirq_stat, softirq_stat_cpu);

+struct vector_work {
+ int vec;
+ struct work_struct work;
+};
+
+static DEFINE_PER_CPU(struct vector_work[NR_SOFTIRQS], vector_work_cpu);
+
/*
* we cannot loop indefinitely here to avoid userspace starvation,
* but we also don't want to introduce a worst case 1/HZ latency
@@ -251,6 +258,70 @@ static inline bool lockdep_softirq_start(void) { return false; }
static inline void lockdep_softirq_end(bool in_hardirq) { }
#endif

+static void vector_work_func(struct work_struct *work)
+{
+ struct vector_work *vector_work;
+ u32 pending;
+ int vec;
+
+ vector_work = container_of(work, struct vector_work, work);
+ vec = vector_work->vec;
+
+ local_irq_disable();
+ pending = local_softirq_pending();
+ account_irq_enter_time(current);
+ __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
+ lockdep_softirq_enter();
+ set_softirq_pending(pending & ~(1 << vec));
+ local_irq_enable();
+
+ if (pending & (1 << vec)) {
+ struct softirq_action *sa = &softirq_vec[vec];
+
+ kstat_incr_softirqs_this_cpu(vec);
+ trace_softirq_entry(vec);
+ sa->action(sa);
+ trace_softirq_exit(vec);
+ }
+
+ local_irq_disable();
+
+ pending = local_softirq_pending();
+ if (pending & (1 << vec))
+ schedule_work_on(smp_processor_id(), work);
+
+ lockdep_softirq_exit();
+ account_irq_exit_time(current);
+ __local_bh_enable(SOFTIRQ_OFFSET);
+ local_irq_enable();
+}
+
+static int do_softirq_overrun(u32 overrun, u32 pending)
+{
+ struct softirq_action *h = softirq_vec;
+ int softirq_bit;
+
+ if (!overrun)
+ return pending;
+
+ overrun &= pending;
+ pending &= ~overrun;
+
+ while ((softirq_bit = ffs(overrun))) {
+ struct vector_work *work;
+ unsigned int vec_nr;
+
+ h += softirq_bit - 1;
+ vec_nr = h - softirq_vec;
+ work = this_cpu_ptr(&vector_work_cpu[vec_nr]);
+ schedule_work_on(smp_processor_id(), &work->work);
+ h++;
+ overrun >>= softirq_bit;
+ }
+
+ return pending;
+}
+
asmlinkage __visible void __softirq_entry __do_softirq(void)
{
struct softirq_stat *sstat = this_cpu_ptr(&softirq_stat_cpu);
@@ -321,10 +392,13 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)

pending = local_softirq_pending();
if (pending) {
- if (overrun || need_resched())
+ if (need_resched()) {
wakeup_softirqd();
- else
- goto restart;
+ } else {
+ pending = do_softirq_overrun(overrun, pending);
+ if (pending)
+ goto restart;
+ }
}

lockdep_softirq_end(in_hardirq);
@@ -661,10 +735,20 @@ void __init softirq_init(void)
int cpu;

for_each_possible_cpu(cpu) {
+ int i;
+
per_cpu(tasklet_vec, cpu).tail =
&per_cpu(tasklet_vec, cpu).head;
per_cpu(tasklet_hi_vec, cpu).tail =
&per_cpu(tasklet_hi_vec, cpu).head;
+
+ for (i = 0; i < NR_SOFTIRQS; i++) {
+ struct vector_work *work;
+
+ work = &per_cpu(vector_work_cpu[i], cpu);
+ work->vec = i;
+ INIT_WORK(&work->work, vector_work_func);
+ }
}

open_softirq(TASKLET_SOFTIRQ, tasklet_action);
--
2.7.4

2018-01-12 06:23:02

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] softirq: Account time and iteration stats per vector

On Thu, Jan 11, 2018 at 9:35 PM, Frederic Weisbecker
<[email protected]> wrote:
> As we plan to be able to defer some specific softurq vector processing
> to workqueues when those vectors need more time than IRQs can offer,
> let's first count the time spent and the number of occurences per vector.
>
> For now we still defer to ksoftirqd when the per vector limits are reached
>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Dmitry Safonov <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: David Miller <[email protected]>
> Cc: Hannes Frederic Sowa <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Levin Alexander <[email protected]>
> Cc: Paolo Abeni <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Radu Rendec <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Stanislaw Gruszka <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Wanpeng Li <[email protected]>
> ---
> kernel/softirq.c | 37 +++++++++++++++++++++++++++++--------
> 1 file changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 2f5e87f..fa267f7 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -26,6 +26,7 @@
> #include <linux/smpboot.h>
> #include <linux/tick.h>
> #include <linux/irq.h>
> +#include <linux/sched/clock.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/irq.h>
> @@ -62,6 +63,17 @@ const char * const softirq_to_name[NR_SOFTIRQS] = {
> "TASKLET", "SCHED", "HRTIMER", "RCU"
> };
>
> +struct vector_stat {
> + u64 time;
> + int count;
> +};
> +
> +struct softirq_stat {
> + struct vector_stat stat[NR_SOFTIRQS];
> +};
> +
> +static DEFINE_PER_CPU(struct softirq_stat, softirq_stat_cpu);
> +
> /*
> * we cannot loop indefinitely here to avoid userspace starvation,
> * but we also don't want to introduce a worst case 1/HZ latency
> @@ -203,7 +215,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);
> * we want to handle softirqs as soon as possible, but they
> * should not be able to lock up the box.
> */
> -#define MAX_SOFTIRQ_TIME msecs_to_jiffies(2)
> +#define MAX_SOFTIRQ_TIME (2 * NSEC_PER_MSEC)
> #define MAX_SOFTIRQ_RESTART 10
>
> #ifdef CONFIG_TRACE_IRQFLAGS
> @@ -241,12 +253,11 @@ static inline void lockdep_softirq_end(bool in_hardirq) { }
>
> asmlinkage __visible void __softirq_entry __do_softirq(void)
> {
> - unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> + struct softirq_stat *sstat = this_cpu_ptr(&softirq_stat_cpu);
> unsigned long old_flags = current->flags;
> - int max_restart = MAX_SOFTIRQ_RESTART;
> struct softirq_action *h;
> bool in_hardirq;
> - __u32 pending;
> + __u32 pending, overrun = 0;
> int softirq_bit;
>
> /*
> @@ -262,6 +273,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
> in_hardirq = lockdep_softirq_start();
>
> + memzero_explicit(sstat, sizeof(*sstat));

If you clear sstat here, it means it does not need to be a per cpu
variable, but an automatic one (defined on the stack)

I presume we need a per cpu var to track cpu usage on last time window.

( typical case of 99,000 IRQ per second, one packet delivered per IRQ,
10 usec spent per packet)



> restart:
> /* Reset the pending bitmask before enabling irqs */
> set_softirq_pending(0);
> @@ -271,8 +283,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> h = softirq_vec;
>
> while ((softirq_bit = ffs(pending))) {
> + struct vector_stat *vstat;
> unsigned int vec_nr;
> int prev_count;
> + u64 startime;
>
> h += softirq_bit - 1;
>
> @@ -280,10 +294,18 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> prev_count = preempt_count();
>
> kstat_incr_softirqs_this_cpu(vec_nr);
> + vstat = &sstat->stat[vec_nr];
>
> trace_softirq_entry(vec_nr);
> + startime = local_clock();
> h->action(h);
> + vstat->time += local_clock() - startime;

You might store local_clock() in a variable, so that we do not call
local_clock() two times per ->action() called.


> + vstat->count++;
> trace_softirq_exit(vec_nr);
> +
> + if (vstat->time > MAX_SOFTIRQ_TIME || vstat->count > MAX_SOFTIRQ_RESTART)

If we trust local_clock() to be precise enough, we do not need to
track vstat->count anymore.

> + overrun |= 1 << vec_nr;
> +
> if (unlikely(prev_count != preempt_count())) {
> pr_err("huh, entered softirq %u %s %p with preempt_count %08x, exited with %08x?\n",
> vec_nr, softirq_to_name[vec_nr], h->action,
> @@ -299,11 +321,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
>
> pending = local_softirq_pending();
> if (pending) {
> - if (time_before(jiffies, end) && !need_resched() &&
> - --max_restart)
> + if (overrun || need_resched())
> + wakeup_softirqd();
> + else
> goto restart;
> -
> - wakeup_softirqd();
> }
>
> lockdep_softirq_end(in_hardirq);
> --
> 2.7.4
>

2018-01-12 06:27:33

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] softirq: Per vector thread deferment

On Fri, Jan 12, 2018 at 06:35:54AM +0100, Frederic Weisbecker wrote:
> Some softirq vectors can be more CPU hungry than others. Especially
> networking may sometimes deal with packet storm and need more CPU than
> IRQ tail can offer without inducing scheduler latencies. In this case
> the current code defers to ksoftirqd that behaves nicer. Now this nice
> behaviour can be bad for other IRQ vectors that usually need quick
> processing.
>
> To solve this we only defer to threading the vectors that outreached the
> time limit on IRQ tail processing and leave the others inline on real
> Soft-IRQs service. This is achieved using workqueues with
> per-CPU/per-vector worklets.
>
> Note ksoftirqd is not removed as it is still needed for threaded IRQs
> mode.
>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Dmitry Safonov <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: David Miller <[email protected]>
> Cc: Hannes Frederic Sowa <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Levin Alexander <[email protected]>
> Cc: Paolo Abeni <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Radu Rendec <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Stanislaw Gruszka <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Wanpeng Li <[email protected]>
> ---
> kernel/softirq.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 87 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index fa267f7..0c817ec6 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -74,6 +74,13 @@ struct softirq_stat {
>
> static DEFINE_PER_CPU(struct softirq_stat, softirq_stat_cpu);
>
> +struct vector_work {
> + int vec;
> + struct work_struct work;
> +};
> +
> +static DEFINE_PER_CPU(struct vector_work[NR_SOFTIRQS], vector_work_cpu);
> +
> /*
> * we cannot loop indefinitely here to avoid userspace starvation,
> * but we also don't want to introduce a worst case 1/HZ latency
> @@ -251,6 +258,70 @@ static inline bool lockdep_softirq_start(void) { return false; }
> static inline void lockdep_softirq_end(bool in_hardirq) { }
> #endif
>
> +static void vector_work_func(struct work_struct *work)
> +{
> + struct vector_work *vector_work;
> + u32 pending;
> + int vec;
> +
> + vector_work = container_of(work, struct vector_work, work);
> + vec = vector_work->vec;
> +
> + local_irq_disable();
> + pending = local_softirq_pending();
> + account_irq_enter_time(current);
> + __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
> + lockdep_softirq_enter();
> + set_softirq_pending(pending & ~(1 << vec));
> + local_irq_enable();
> +
> + if (pending & (1 << vec)) {

Ah I see the problem. Say in do_softirq() we had pending VECTOR 1 and 2.
And we had overrun only VECTOR 1 so VECTOR 1 is enqueued to workqueue.
Right after that we go back to the restart loop in do_softirq in order to
handle pending VECTOR 2 but we erase the local_softirqs_pending state. So
when the workqueue runs, it doesn't see anymore VECTOR 1 pending and we lose
it.

So I need to remove the above condition and make the vector work
unconditionally execute the vector callback.

Now I can go to sleep...


> + struct softirq_action *sa = &softirq_vec[vec];
> +
> + kstat_incr_softirqs_this_cpu(vec);
> + trace_softirq_entry(vec);
> + sa->action(sa);
> + trace_softirq_exit(vec);
> + }

2018-01-12 09:07:40

by Paolo Abeni

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] softirq: Per vector thread deferment

On Fri, 2018-01-12 at 06:35 +0100, Frederic Weisbecker wrote:
> Some softirq vectors can be more CPU hungry than others. Especially
> networking may sometimes deal with packet storm and need more CPU than
> IRQ tail can offer without inducing scheduler latencies. In this case
> the current code defers to ksoftirqd that behaves nicer. Now this nice
> behaviour can be bad for other IRQ vectors that usually need quick
> processing.
>
> To solve this we only defer to threading the vectors that outreached the
> time limit on IRQ tail processing and leave the others inline on real
> Soft-IRQs service. This is achieved using workqueues with
> per-CPU/per-vector worklets.
>
> Note ksoftirqd is not removed as it is still needed for threaded IRQs
> mode.
>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Dmitry Safonov <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: David Miller <[email protected]>
> Cc: Hannes Frederic Sowa <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Levin Alexander <[email protected]>
> Cc: Paolo Abeni <[email protected]>
> Cc: Paul E. McKenney <[email protected]>
> Cc: Radu Rendec <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Stanislaw Gruszka <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Wanpeng Li <[email protected]>
> ---
> kernel/softirq.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 87 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index fa267f7..0c817ec6 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -74,6 +74,13 @@ struct softirq_stat {
>
> static DEFINE_PER_CPU(struct softirq_stat, softirq_stat_cpu);
>
> +struct vector_work {
> + int vec;
> + struct work_struct work;
> +};
> +
> +static DEFINE_PER_CPU(struct vector_work[NR_SOFTIRQS], vector_work_cpu);
> +
> /*
> * we cannot loop indefinitely here to avoid userspace starvation,
> * but we also don't want to introduce a worst case 1/HZ latency
> @@ -251,6 +258,70 @@ static inline bool lockdep_softirq_start(void) { return false; }
> static inline void lockdep_softirq_end(bool in_hardirq) { }
> #endif
>
> +static void vector_work_func(struct work_struct *work)
> +{
> + struct vector_work *vector_work;
> + u32 pending;
> + int vec;
> +
> + vector_work = container_of(work, struct vector_work, work);
> + vec = vector_work->vec;
> +
> + local_irq_disable();
> + pending = local_softirq_pending();
> + account_irq_enter_time(current);
> + __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
> + lockdep_softirq_enter();
> + set_softirq_pending(pending & ~(1 << vec));
> + local_irq_enable();
> +
> + if (pending & (1 << vec)) {
> + struct softirq_action *sa = &softirq_vec[vec];
> +
> + kstat_incr_softirqs_this_cpu(vec);
> + trace_softirq_entry(vec);
> + sa->action(sa);
> + trace_softirq_exit(vec);
> + }
> +
> + local_irq_disable();
> +
> + pending = local_softirq_pending();
> + if (pending & (1 << vec))
> + schedule_work_on(smp_processor_id(), work);

If we check for the overrun condition here, as done in the
__do_softirq() main loop, we could avoid ksoftirqd completely and
probably have less code duplication.

> +
> + lockdep_softirq_exit();
> + account_irq_exit_time(current);
> + __local_bh_enable(SOFTIRQ_OFFSET);
> + local_irq_enable();
> +}
> +
> +static int do_softirq_overrun(u32 overrun, u32 pending)
> +{
> + struct softirq_action *h = softirq_vec;
> + int softirq_bit;
> +
> + if (!overrun)
> + return pending;
> +
> + overrun &= pending;
> + pending &= ~overrun;
> +
> + while ((softirq_bit = ffs(overrun))) {
> + struct vector_work *work;
> + unsigned int vec_nr;
> +
> + h += softirq_bit - 1;
> + vec_nr = h - softirq_vec;
> + work = this_cpu_ptr(&vector_work_cpu[vec_nr]);
> + schedule_work_on(smp_processor_id(), &work->work);
> + h++;
> + overrun >>= softirq_bit;
> + }
> +
> + return pending;
> +}
> +
> asmlinkage __visible void __softirq_entry __do_softirq(void)
> {
> struct softirq_stat *sstat = this_cpu_ptr(&softirq_stat_cpu);
> @@ -321,10 +392,13 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
>
> pending = local_softirq_pending();
> if (pending) {
> - if (overrun || need_resched())
> + if (need_resched()) {
> wakeup_softirqd();
> - else
> - goto restart;
> + } else {
> + pending = do_softirq_overrun(overrun, pending);
> + if (pending)
> + goto restart;
> + }
> }
>
> lockdep_softirq_end(in_hardirq);

This way the 'overrun' branch is not triggered if we (also) need
resched, should we test for overrun first ?

Cheers,

Paolo

2018-01-12 14:34:54

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] softirq: Account time and iteration stats per vector

On Thu, Jan 11, 2018 at 10:22:58PM -0800, Eric Dumazet wrote:
> > asmlinkage __visible void __softirq_entry __do_softirq(void)
> > {
> > - unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
> > + struct softirq_stat *sstat = this_cpu_ptr(&softirq_stat_cpu);
> > unsigned long old_flags = current->flags;
> > - int max_restart = MAX_SOFTIRQ_RESTART;
> > struct softirq_action *h;
> > bool in_hardirq;
> > - __u32 pending;
> > + __u32 pending, overrun = 0;
> > int softirq_bit;
> >
> > /*
> > @@ -262,6 +273,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> > __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
> > in_hardirq = lockdep_softirq_start();
> >
> > + memzero_explicit(sstat, sizeof(*sstat));
>
> If you clear sstat here, it means it does not need to be a per cpu
> variable, but an automatic one (defined on the stack)

That's right. But I thought it was bit large for the stack:

struct {
u64 time;
u64 count;
} [NR_SOFTIRQS]

although arguably we are either using softirq stack or a fresh task one.

>
> I presume we need a per cpu var to track cpu usage on last time window.
>
> ( typical case of 99,000 IRQ per second, one packet delivered per IRQ,
> 10 usec spent per packet)

So should I account, like, per vector stats in a jiffy window for example? And apply
the limits on top of that?

>
>
>
> > restart:
> > /* Reset the pending bitmask before enabling irqs */
> > set_softirq_pending(0);
> > @@ -271,8 +283,10 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> > h = softirq_vec;
> >
> > while ((softirq_bit = ffs(pending))) {
> > + struct vector_stat *vstat;
> > unsigned int vec_nr;
> > int prev_count;
> > + u64 startime;
> >
> > h += softirq_bit - 1;
> >
> > @@ -280,10 +294,18 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> > prev_count = preempt_count();
> >
> > kstat_incr_softirqs_this_cpu(vec_nr);
> > + vstat = &sstat->stat[vec_nr];
> >
> > trace_softirq_entry(vec_nr);
> > + startime = local_clock();
> > h->action(h);
> > + vstat->time += local_clock() - startime;
>
> You might store local_clock() in a variable, so that we do not call
> local_clock() two times per ->action() called.

So you mean I keep the second local_clock() call for the next first call in the while
iteration, right? Yep that sounds possible.

>
>
> > + vstat->count++;
> > trace_softirq_exit(vec_nr);
> > +
> > + if (vstat->time > MAX_SOFTIRQ_TIME || vstat->count > MAX_SOFTIRQ_RESTART)
>
> If we trust local_clock() to be precise enough, we do not need to
> track vstat->count anymore.

That's what I was thinking. Should I keep MAX_SOFTIRQ_TIME to 2 ms BTW? It looks a bit long
to me.

Thanks.

2018-01-12 14:56:11

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 2/2] softirq: Per vector thread deferment

On Fri, Jan 12, 2018 at 10:07:25AM +0100, Paolo Abeni wrote:
> On Fri, 2018-01-12 at 06:35 +0100, Frederic Weisbecker wrote:
> > Some softirq vectors can be more CPU hungry than others. Especially
> > networking may sometimes deal with packet storm and need more CPU than
> > IRQ tail can offer without inducing scheduler latencies. In this case
> > the current code defers to ksoftirqd that behaves nicer. Now this nice
> > behaviour can be bad for other IRQ vectors that usually need quick
> > processing.
> >
> > To solve this we only defer to threading the vectors that outreached the
> > time limit on IRQ tail processing and leave the others inline on real
> > Soft-IRQs service. This is achieved using workqueues with
> > per-CPU/per-vector worklets.
> >
> > Note ksoftirqd is not removed as it is still needed for threaded IRQs
> > mode.
> >
> > Suggested-by: Linus Torvalds <[email protected]>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Dmitry Safonov <[email protected]>
> > Cc: Eric Dumazet <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: David Miller <[email protected]>
> > Cc: Hannes Frederic Sowa <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Levin Alexander <[email protected]>
> > Cc: Paolo Abeni <[email protected]>
> > Cc: Paul E. McKenney <[email protected]>
> > Cc: Radu Rendec <[email protected]>
> > Cc: Rik van Riel <[email protected]>
> > Cc: Stanislaw Gruszka <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Wanpeng Li <[email protected]>
> > ---
> > kernel/softirq.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 87 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index fa267f7..0c817ec6 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -74,6 +74,13 @@ struct softirq_stat {
> >
> > static DEFINE_PER_CPU(struct softirq_stat, softirq_stat_cpu);
> >
> > +struct vector_work {
> > + int vec;
> > + struct work_struct work;
> > +};
> > +
> > +static DEFINE_PER_CPU(struct vector_work[NR_SOFTIRQS], vector_work_cpu);
> > +
> > /*
> > * we cannot loop indefinitely here to avoid userspace starvation,
> > * but we also don't want to introduce a worst case 1/HZ latency
> > @@ -251,6 +258,70 @@ static inline bool lockdep_softirq_start(void) { return false; }
> > static inline void lockdep_softirq_end(bool in_hardirq) { }
> > #endif
> >
> > +static void vector_work_func(struct work_struct *work)
> > +{
> > + struct vector_work *vector_work;
> > + u32 pending;
> > + int vec;
> > +
> > + vector_work = container_of(work, struct vector_work, work);
> > + vec = vector_work->vec;
> > +
> > + local_irq_disable();
> > + pending = local_softirq_pending();
> > + account_irq_enter_time(current);
> > + __local_bh_disable_ip(_RET_IP_, SOFTIRQ_OFFSET);
> > + lockdep_softirq_enter();
> > + set_softirq_pending(pending & ~(1 << vec));
> > + local_irq_enable();
> > +
> > + if (pending & (1 << vec)) {
> > + struct softirq_action *sa = &softirq_vec[vec];
> > +
> > + kstat_incr_softirqs_this_cpu(vec);
> > + trace_softirq_entry(vec);
> > + sa->action(sa);
> > + trace_softirq_exit(vec);
> > + }
> > +
> > + local_irq_disable();
> > +
> > + pending = local_softirq_pending();
> > + if (pending & (1 << vec))
> > + schedule_work_on(smp_processor_id(), work);
>
> If we check for the overrun condition here, as done in the
> __do_softirq() main loop, we could avoid ksoftirqd completely and
> probably have less code duplication.

Yes that could be possible indeed. I guess having workqueues serializing
vector works is not much different that what ksoftirqd does.

I can try that.

>
> > +
> > + lockdep_softirq_exit();
> > + account_irq_exit_time(current);
> > + __local_bh_enable(SOFTIRQ_OFFSET);
> > + local_irq_enable();
> > +}
> > +
> > +static int do_softirq_overrun(u32 overrun, u32 pending)
> > +{
> > + struct softirq_action *h = softirq_vec;
> > + int softirq_bit;
> > +
> > + if (!overrun)
> > + return pending;
> > +
> > + overrun &= pending;
> > + pending &= ~overrun;
> > +
> > + while ((softirq_bit = ffs(overrun))) {
> > + struct vector_work *work;
> > + unsigned int vec_nr;
> > +
> > + h += softirq_bit - 1;
> > + vec_nr = h - softirq_vec;
> > + work = this_cpu_ptr(&vector_work_cpu[vec_nr]);
> > + schedule_work_on(smp_processor_id(), &work->work);
> > + h++;
> > + overrun >>= softirq_bit;
> > + }
> > +
> > + return pending;
> > +}
> > +
> > asmlinkage __visible void __softirq_entry __do_softirq(void)
> > {
> > struct softirq_stat *sstat = this_cpu_ptr(&softirq_stat_cpu);
> > @@ -321,10 +392,13 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
> >
> > pending = local_softirq_pending();
> > if (pending) {
> > - if (overrun || need_resched())
> > + if (need_resched()) {
> > wakeup_softirqd();
> > - else
> > - goto restart;
> > + } else {
> > + pending = do_softirq_overrun(overrun, pending);
> > + if (pending)
> > + goto restart;
> > + }
> > }
> >
> > lockdep_softirq_end(in_hardirq);
>
> This way the 'overrun' branch is not triggered if we (also) need
> resched, should we test for overrun first ?

Yes they could have similar treatment. If need_resched() we schedule
everything that is still pending: do_softirq_overrun(pending, pending),
otherwise we take the other branch and still do a goto restart.

In fact it can even be simplified this way:


if (need_resched())
overrun = pending;
pending = do_softirq_overrun(overrun, pending);
if (pending)
goto restart;

Thanks.

2018-01-12 18:12:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] softirq: Account time and iteration stats per vector

On Fri, Jan 12, 2018 at 6:34 AM, Frederic Weisbecker
<[email protected]> wrote:
>
> That's right. But I thought it was bit large for the stack:
>
> struct {
> u64 time;
> u64 count;
> } [NR_SOFTIRQS]

Note that you definitely don't want "u64" here.

Both of these values had better be very limited. The "count" is on the
order of 10 - it fits in 4 _bits_ without any overflow.

And 'time' is on the order of 2ms, so even if it's in nanoseconds, we
already know that we want to limit it to a single ms or so (yes, yes,
right now our limit is 2ms, but I think that's long). So even that
doesn't need 64-bit.

Finally, I think you can join them. If we do a "time or count" limit,
let's just make the "count" act as some arbitrary fixed time, so that
we limit things that way.

Say, if we want to limit it to 2ms, consider one count to be 0.2ms. So
instead of keeping track of count at all, just say "make each softirq
call count as at least 200,000ns even if the scheduler clock says it's
less". End result: we'd loop at most ten times.

So now you only need one value, and you know it can't be bigger than 2
million, so it can be a 32-bit one. Boom. Done.

Also, don't you want these to be percpu, and keep accumulating them
until you decide to either age them away (just clear it in timer
interrupt?) or if the value gets so big that you want o fall back to
the thread instead (and then the thread can clear it every iteration,
so you don't need to track whether the thread is active or not).

I don't know. I'm traveling today, so I didn't actually have time to
really look at the patches, I'm just reacting to Eric's reaction.

Linus

2018-01-12 18:54:34

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] softirq: Account time and iteration stats per vector

On Fri, Jan 12, 2018 at 10:12:32AM -0800, Linus Torvalds wrote:
> On Fri, Jan 12, 2018 at 6:34 AM, Frederic Weisbecker
> <[email protected]> wrote:
> >
> > That's right. But I thought it was bit large for the stack:
> >
> > struct {
> > u64 time;
> > u64 count;
> > } [NR_SOFTIRQS]
>
> Note that you definitely don't want "u64" here.
>
> Both of these values had better be very limited. The "count" is on the
> order of 10 - it fits in 4 _bits_ without any overflow.
>
> And 'time' is on the order of 2ms, so even if it's in nanoseconds, we
> already know that we want to limit it to a single ms or so (yes, yes,
> right now our limit is 2ms, but I think that's long). So even that
> doesn't need 64-bit.

Ok.

>
> Finally, I think you can join them. If we do a "time or count" limit,
> let's just make the "count" act as some arbitrary fixed time, so that
> we limit things that way.
>
> Say, if we want to limit it to 2ms, consider one count to be 0.2ms. So
> instead of keeping track of count at all, just say "make each softirq
> call count as at least 200,000ns even if the scheduler clock says it's
> less". End result: we'd loop at most ten times.
>
> So now you only need one value, and you know it can't be bigger than 2
> million, so it can be a 32-bit one. Boom. Done.

Right.

Now I believe that the time was added as a limit because count alone was not
reliable enough to diagnose a softirq overrun. But if everyone is fine with
keeping the count as a single metric, I would be much happier because that
means less overhead, no need to fetch the clock, etc...

>
> Also, don't you want these to be percpu, and keep accumulating them
> until you decide to either age them away (just clear it in timer
> interrupt?) or if the value gets so big that you want o fall back to
> the thread instead (and then the thread can clear it every iteration,
> so you don't need to track whether the thread is active or not).
>
> I don't know. I'm traveling today, so I didn't actually have time to
> really look at the patches, I'm just reacting to Eric's reaction.

Clearing the accumulation on tick and flush, that sounds like a good plan.
Well I'm probably not going to use the tick for that because of nohz (again)
but I can check if jiffies changed since we started the accumulation and
reset it if so.

I'm going to respin, thanks!