2018-01-09 13:36:31

by Dmitry Safonov

[permalink] [raw]
Subject: [RFC 0/2] Net softirq deferring to ksoftirqd

Based on events I've saw on out-of-tree drivers, but I believe that
can happen on the mainstream ones.
I managed to get some results in Qemu VMs, but this lacks testing
on a real mainstream hardware.

Also I'm not sure that's completely sane thing to-do, so sending early
to get some reviews.

Cc: Andrew Morton <[email protected]>
Cc: David Miller <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Hannes Frederic Sowa <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "Levin, Alexander (Sasha Levin)" <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Radu Rendec <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Wanpeng Li <[email protected]>

Dmitry Safonov (2):
softirq: Defer net rx/tx processing to ksoftirqd context
softirq: Introduce mask for __do_softirq()

include/linux/interrupt.h | 8 ++++----
kernel/softirq.c | 48 ++++++++++++++++++++++++++++++++++-------------
2 files changed, 39 insertions(+), 17 deletions(-)

--
2.13.6


2018-01-09 13:36:35

by Dmitry Safonov

[permalink] [raw]
Subject: [RFC 2/2] softirq: Introduce mask for __do_softirq()

Warning: Not merge-ready, tested only on x86_64 & arm32.

For the reason of deferring net-softirqs till ksoftirqd run,
__do_softirq() needs to process softirqs with mask, depending
if it's called from ksoftirqd thread or on the context of
some task.

Signed-off-by: Dmitry Safonov <[email protected]>
---
include/linux/interrupt.h | 8 ++++----
kernel/softirq.c | 41 ++++++++++++++++++++++-------------------
2 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 69c238210325..1e943959b31a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -489,14 +489,14 @@ struct softirq_action
};

asmlinkage void do_softirq(void);
-asmlinkage void __do_softirq(void);
+asmlinkage void __do_softirq(__u32 mask);

#ifdef __ARCH_HAS_DO_SOFTIRQ
-void do_softirq_own_stack(void);
+void do_softirq_own_stack(__u32 mask);
#else
-static inline void do_softirq_own_stack(void)
+static inline void do_softirq_own_stack(__u32 mask)
{
- __do_softirq();
+ __do_softirq(mask);
}
#endif

diff --git a/kernel/softirq.c b/kernel/softirq.c
index ee48f194dcec..5459b079bf73 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -88,26 +88,26 @@ static bool ksoftirqd_running(void)
return tsk && (tsk->state == TASK_RUNNING);
}

-static bool defer_softirq(void)
+#define DEFER_SOFTIRQS (NET_RX_SOFTIRQ | NET_TX_SOFTIRQ)
+
+/* Mask of softirqs that are pending to be processed on the current context. */
+static __u32 current_softirq_pending(void)
{
__u32 pending = local_softirq_pending();

- if (!pending)
- return true;
-
if (ksoftirqd_running())
- return true;
+ return 0;

/*
* Defer net-rx softirqs to ksoftirqd processing as they may
* make userspace starving cpu time.
*/
- if (pending & (NET_RX_SOFTIRQ | NET_TX_SOFTIRQ)) {
+ if (pending & DEFER_SOFTIRQS)
wakeup_softirqd();
- return true;
- }

- return false;
+ pending &= ~DEFER_SOFTIRQS;
+
+ return pending;
}

/*
@@ -261,7 +261,7 @@ static inline bool lockdep_softirq_start(void) { return false; }
static inline void lockdep_softirq_end(bool in_hardirq) { }
#endif

-asmlinkage __visible void __softirq_entry __do_softirq(void)
+asmlinkage __visible void __softirq_entry __do_softirq(__u32 mask)
{
unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
unsigned long old_flags = current->flags;
@@ -286,7 +286,8 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)

restart:
/* Reset the pending bitmask before enabling irqs */
- set_softirq_pending(0);
+ set_softirq_pending(pending & ~mask);
+ pending &= mask;

local_irq_enable();

@@ -320,7 +321,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
local_irq_disable();

pending = local_softirq_pending();
- if (pending) {
+ if (pending & mask) {
if (time_before(jiffies, end) && !need_resched() &&
--max_restart)
goto restart;
@@ -338,14 +339,14 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
asmlinkage __visible void do_softirq(void)
{
unsigned long flags;
+ __u32 pending = current_softirq_pending();

- if (in_interrupt())
+ if (in_interrupt() || !pending)
return;

local_irq_save(flags);

- if (!defer_softirq())
- do_softirq_own_stack();
+ do_softirq_own_stack(pending);

local_irq_restore(flags);
}
@@ -371,7 +372,9 @@ void irq_enter(void)

static inline void invoke_softirq(void)
{
- if (defer_softirq())
+ __u32 pending = current_softirq_pending();
+
+ if (!pending)
return;

if (!force_irqthreads) {
@@ -381,14 +384,14 @@ static inline void invoke_softirq(void)
* it is the irq stack, because it should be near empty
* at this stage.
*/
- __do_softirq();
+ __do_softirq(pending);
#else
/*
* Otherwise, irq_exit() is called on the task stack that can
* be potentially deep already. So call softirq in its own stack
* to prevent from any overrun.
*/
- do_softirq_own_stack();
+ do_softirq_own_stack(pending);
#endif
} else {
wakeup_softirqd();
@@ -682,7 +685,7 @@ static void run_ksoftirqd(unsigned int cpu)
* We can safely run softirq on inline stack, as we are not deep
* in the task stack here.
*/
- __do_softirq();
+ __do_softirq(~0);
local_irq_enable();
cond_resched_rcu_qs();
return;
--
2.13.6

2018-01-09 13:36:33

by Dmitry Safonov

[permalink] [raw]
Subject: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

Warning: Not merge-ready

I. Current workflow of ksoftirqd.
Softirqs are processed in the context of ksoftirqd iff they are
being raised very frequently. How it works:
do_softirq() and invoke_softirq() deffer pending softirq iff
ksoftirqd is in runqueue. Ksoftirqd is scheduled mostly in the
end of processed softirqs if 2ms were not enough to process all
pending softirqs.

Here is pseudo-picture of the workflow (for simplicity on UMP):
------------- ------------------ ------------------
| ksoftirqd | | User's process | | Softirqs |
------------- ------------------ ------------------
Not scheduled Running
|
o------------------------o
|
__do_softirq()
|
2ms & softirq pending?
Schedule ksoftirqd
|
Scheduled o------------------------o
|
o--------------------o
|
Running Scheduled
|
o--------------------o
|
Not scheduled Running

Timegraph for the workflow,
dash (-) means ksoftirqd not scheduled;
equal(=) ksoftirqd is scheduled, a softirq may still be pending

Pending softirqs
| | | | | | | | |
v v v v | | | | v
Processing o-----o | | | | o--o
softirqs | | | | | | | |
| | | | | | | |
| | | | | | | |
Userspace o-o o=========o | | | | o----o o---------o
<-2ms-> | | | | | |
| v v v v |
Ksoftirqd o----------o

II. Corner-conditions.
During testing of commit [1] on some non-mainstream driver,
I've found that due to platform specifics, the IRQ is being
raised too late (after softirq has been processed).
In result softirqs steal time from userspace process, leaving
it starving for CPU time and never/rarely scheduling ksoftirqd:

Pending softirqs
| | | | | |
v v v v v v
Processing o-----o o-----o o-----o o-----o o-----o o ...
softirqs | | | | | | | | | | |
| | | | | | | | | | |
| | | | | | | | | | |
Userspace o-o o-o o-o o-o o-o o-o (starving)

Ksoftirqd (rarely scheduled)

Afterwards I thought that the same may happen to mainstream
if PPS rate is selected to raise an IRQ just after previous
softirq was processed. I managed to reproduce the conjecture,
see (IV).

III. RFC proposal.
Firstly, I tried to count all time spent in softirq processing to
ksoftirqd thread that serves local CPU and add comparison of
vruntime for ksoftirqd and current task to decide if softirq
should be delayed. You may imagine what a disgraceful hacks were
involved. Current RFC has nothing of that kind and relies on
fair scheduling of ksoftirqd and other tasks.
To do that we check pending softirqs and serve them on current
context only if there are non-net softirqs pending.
The following patch adds a mask to __do_softirq() to process
net-softirqs only on ksoftirqd context if multiply softirqs
are pending.

IV. Test results.
Unfortunately, I wasn't able to test it on hardware with mainstream
kernel. So, I've only results from Qemu VMs with fedora 26.
The first VM stresses the second with UDP packages by pktgen.
The receiver VM is running udp_sink[2] program and prints the
amount of PPS served.
Vms have virtio as network cards, have rt priority and are
assigned to different CPUs on the host.
Host's CPU is Intel Core i7-7600U @ 2.80GHz.
RFC definitely needs some testing on the real HW (because I
don't expect anyone would quite believe VM perf testing) - any
help with testing it would be appreciated.

Source | Destination
--------|------------------------------------
| master | RFC |
| (4.15-rc4) | |
--------|------------------|----------------|
5000 | 5000.7 | 4999.7 |
--------|------------------|----------------|
7000 | 6997.42 | 6995.88 |
--------|------------------|----------------|
8000 | 7999.55 | 7999.86 |
--------|------------------|----------------|
9000 | 8951.37 | 8986.30 |
--------|------------------|----------------|
10000 | 9864.96 | 9972.05 |
--------|------------------|----------------|
11000 | 10711.92 | 10976.26 |
--------|------------------|----------------|
12000 | 11494.79 | 11962.40 |
--------|------------------|----------------|
13000 | 12161.76 | 12946.91 |
--------|------------------|----------------|
14000 | 11152.07 | 13942.96 |
--------|------------------|----------------|
15000 | 8650.22 | 14878.26 |
--------|------------------|----------------|
16000 | 7662.55 | 15880.60 |
--------|------------------|----------------|
17000 | 6485.49 | 16814.07 |
--------|------------------|----------------|
18000 | 5489.48 | 17679.69 |
--------|------------------|----------------|
19000 | 4679.59 | 18543.60 |
--------|------------------|----------------|
20000 | 4738.24 | 19233.56 |
--------|------------------|----------------|
21000 | 4015.00 | 20247.50 |
--------|------------------|----------------|
22000 | 4376.99 | 20654.62 |
--------|------------------|----------------|
23000 | 9429.80 | 20925.07 |
--------|------------------|----------------|
24000 | 8872.33 | 21336.31 |
--------|------------------|----------------|
25000 | 19824.67 | 21486.84 |
--------|------------------|----------------|
30000 | 20779.49 | 21487.15 |
--------|------------------|----------------|
40000 | 24559.83 | 21452.74 |
--------|------------------|----------------|
50000 | 18469.20 | 21191.34 |
--------|------------------|----------------|
100000 | 19773.00 | 22592.28 |
--------|------------------|----------------|

Note, that I tested in VMs and I've found that if I produce more
hw irqs on the host, than the results for master are not that
dramatically bad, but still much worse then with RFC.
By that reason I have qualms if my test's results are correct.

V. References:
[1] 4cd13c21b207 ("softirq: Let ksoftirqd do its job")
[2] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c

Signed-off-by: Dmitry Safonov <[email protected]>
---
kernel/softirq.c | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 2f5e87f1bae2..ee48f194dcec 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -88,6 +88,28 @@ static bool ksoftirqd_running(void)
return tsk && (tsk->state == TASK_RUNNING);
}

+static bool defer_softirq(void)
+{
+ __u32 pending = local_softirq_pending();
+
+ if (!pending)
+ return true;
+
+ if (ksoftirqd_running())
+ return true;
+
+ /*
+ * Defer net-rx softirqs to ksoftirqd processing as they may
+ * make userspace starving cpu time.
+ */
+ if (pending & (NET_RX_SOFTIRQ | NET_TX_SOFTIRQ)) {
+ wakeup_softirqd();
+ return true;
+ }
+
+ return false;
+}
+
/*
* preempt_count and SOFTIRQ_OFFSET usage:
* - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving
@@ -315,7 +337,6 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)

asmlinkage __visible void do_softirq(void)
{
- __u32 pending;
unsigned long flags;

if (in_interrupt())
@@ -323,9 +344,7 @@ asmlinkage __visible void do_softirq(void)

local_irq_save(flags);

- pending = local_softirq_pending();
-
- if (pending && !ksoftirqd_running())
+ if (!defer_softirq())
do_softirq_own_stack();

local_irq_restore(flags);
@@ -352,7 +371,7 @@ void irq_enter(void)

static inline void invoke_softirq(void)
{
- if (ksoftirqd_running())
+ if (defer_softirq())
return;

if (!force_irqthreads) {
--
2.13.6

2018-01-09 16:03:58

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

2018-01-09 14:36 UTC+01:00, Dmitry Safonov <[email protected]>:
> Warning: Not merge-ready
>
> I. Current workflow of ksoftirqd.
> Softirqs are processed in the context of ksoftirqd iff they are
> being raised very frequently. How it works:
> do_softirq() and invoke_softirq() deffer pending softirq iff
> ksoftirqd is in runqueue. Ksoftirqd is scheduled mostly in the
> end of processed softirqs if 2ms were not enough to process all
> pending softirqs.
>
> Here is pseudo-picture of the workflow (for simplicity on UMP):
> ------------- ------------------ ------------------
> | ksoftirqd | | User's process | | Softirqs |
> ------------- ------------------ ------------------
> Not scheduled Running
> |
> o------------------------o
> |
> __do_softirq()
> |
> 2ms & softirq pending?
> Schedule ksoftirqd
> |
> Scheduled o------------------------o
> |
> o--------------------o
> |
> Running Scheduled
> |
> o--------------------o
> |
> Not scheduled Running
>
> Timegraph for the workflow,
> dash (-) means ksoftirqd not scheduled;
> equal(=) ksoftirqd is scheduled, a softirq may still be pending
>
> Pending softirqs
> | | | | | | | | |
> v v v v | | | | v
> Processing o-----o | | | | o--o
> softirqs | | | | | | | |
> | | | | | | | |
> | | | | | | | |
> Userspace o-o o=========o | | | | o----o o---------o
> <-2ms-> | | | | | |
> | v v v v |
> Ksoftirqd o----------o
>
> II. Corner-conditions.
> During testing of commit [1] on some non-mainstream driver,
> I've found that due to platform specifics, the IRQ is being
> raised too late (after softirq has been processed).

I'm a bit confused about that part. I would expect the softirq to be
raised by the IRQ.
So I guess in this scenario the softirq is raised by something else
and you expect the upcoming IRQ to handle the softirq, right? (sorry
I'm not used to networking code).

Thanks.

> In result softirqs steal time from userspace process, leaving
> it starving for CPU time and never/rarely scheduling ksoftirqd:

2018-01-09 18:02:58

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Tue, Jan 9, 2018 at 5:36 AM, Dmitry Safonov <[email protected]> wrote:
> Warning: Not merge-ready
>
> I. Current workflow of ksoftirqd.
> Softirqs are processed in the context of ksoftirqd iff they are
> being raised very frequently. How it works:
> do_softirq() and invoke_softirq() deffer pending softirq iff
>

...

>
> Note, that I tested in VMs and I've found that if I produce more
> hw irqs on the host, than the results for master are not that
> dramatically bad, but still much worse then with RFC.
> By that reason I have qualms if my test's results are correct.

Note that deferring all NET RX/TX to ksoftirqd is going to
dramatically hurt tail latencies.

You really should test with RPC like workloads (netperf -t TCP_RR) and
hundred of threads per cpu :/

It seems we are going to revert/adapt 4cd13c21b2 , not defer more
stuff to ksoftirqd.

Thanks

2018-01-10 21:20:42

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Tue, 2018-01-09 at 17:03 +0100, Frederic Weisbecker wrote:
> 2018-01-09 14:36 UTC+01:00, Dmitry Safonov <[email protected]>:
> > Warning: Not merge-ready
> >
> > I. Current workflow of ksoftirqd.
> > Softirqs are processed in the context of ksoftirqd iff they are
> > being raised very frequently. How it works:
> > do_softirq() and invoke_softirq() deffer pending softirq iff
> > ksoftirqd is in runqueue. Ksoftirqd is scheduled mostly in the
> > end of processed softirqs if 2ms were not enough to process all
> > pending softirqs.
> >
> > Here is pseudo-picture of the workflow (for simplicity on UMP):
> > ------------- ------------------ ------------------
> > | ksoftirqd | | User's process | | Softirqs |
> > ------------- ------------------ ------------------
> > Not scheduled Running
> > |
> > o------------------------o
> > |
> > __do_softirq()
> > |
> > 2ms & softirq
> > pending?
> > Schedule ksoftirqd
> > |
> > Scheduled o------------------------o
> > |
> > o--------------------o
> > |
> > Running Scheduled
> > |
> > o--------------------o
> > |
> > Not scheduled Running
> >
> > Timegraph for the workflow,
> > dash (-) means ksoftirqd not scheduled;
> > equal(=) ksoftirqd is scheduled, a softirq may still be
> > pending
> >
> > Pending softirqs
> > | | | | | | | | |
> > v v v v | | | | v
> > Processing o-----o | | | | o--o
> > softirqs | | | | | | | |
> > | | | | | | | |
> > | | | | | | | |
> > Userspace o-o o=========o | | | | o----o o---------o
> > <-2ms-> | | | | | |
> > | v v v v |
> > Ksoftirqd o----------o
> >
> > II. Corner-conditions.
> > During testing of commit [1] on some non-mainstream driver,
> > I've found that due to platform specifics, the IRQ is being
> > raised too late (after softirq has been processed).
>
> I'm a bit confused about that part. I would expect the softirq to be
> raised by the IRQ.

The rx-softirq is raised by napi_schedule(), which is called on
receiving an interrupt from device, yes.

> So I guess in this scenario the softirq is raised by something else
> and you expect the upcoming IRQ to handle the softirq, right? (sorry
> I'm not used to networking code).

So, the softirq is served by after upcomming IRQ. But in the end of the
softirq there is no yet-pending softirq. And just after irq_exit(),
there emerges another pending softirq.
ITOW, here is what I see on the trace:


python-8597 [001] d... 16635.495480: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495498: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495503: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495533: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495542: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495567: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495576: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495601: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495615: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495627: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495637: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495668: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495684: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495703: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495710: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495723: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495736: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495755: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495769: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495788: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495804: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495822: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495834: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495857: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495868: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495891: rcu_irq_exit <-irq_exit
python-8597 [001] d.s1 16635.495907: rcu_irq_exit <-irq_exit
python-8597 [001] d..1 16635.495908: __do_softirq <-
do_softirq_own_stack
python-8597 [001] d... 16635.495939: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495959: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.495973: __do_softirq <-irq_exit
python-8597 [001] d... 16635.495988: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.496002: __do_softirq <-irq_exit
python-8597 [001] d... 16635.496014: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.496027: __do_softirq <-irq_exit
python-8597 [001] d... 16635.496044: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.496059: __do_softirq <-irq_exit
python-8597 [001] d... 16635.496088: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.496102: __do_softirq <-irq_exit
python-8597 [001] d.s. 16635.496108: rcu_irq_exit <-irq_exit
python-8597 [001] dN.. 16635.496131: rcu_irq_exit <-irq_exit
ksoftirqd/1-14 [001] d... 16635.496132: __do_softirq <-
run_ksoftirqd
ksoftirqd/1-14 [001] d.s1 16635.496145: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.496171: __do_softirq <-irq_exit
python-8597 [001] d... 16635.496191: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.496204: __do_softirq <-irq_exit
python-8597 [001] d... 16635.496218: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.496231: __do_softirq <-irq_exit
python-8597 [001] d... 16635.496250: rcu_irq_exit <-irq_exit
python-8597 [001] d... 16635.496263: __do_softirq <-irq_exit
python-8597 [001] d... 16635.496283: rcu_irq_exit <-irq_exit

Ksoftirqd here is very rarely scheduled, the python process here is
receiver, and looks CPU-starving:

%Cpu1 : 0.6 us, 9.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 35.3 hi, 54.6
si, 0.0 st
[..]
PID USER PR NI VIRT RES SHR S %CPU
%MEM TIME+ COMMAND
8597 root 20 0 7664 5736 3924 R 10 0.0 4:00.69 python

That's what I tried to picture there with diagrams.
(the source of __do_softirq() also show how ksoftirq got scheduled)

--
Thanks,
Dmitry

2018-01-10 21:48:05

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Tue, 2018-01-09 at 10:02 -0800, Eric Dumazet wrote:
> On Tue, Jan 9, 2018 at 5:36 AM, Dmitry Safonov <[email protected]>
> wrote:
> > Warning: Not merge-ready
> >
> > I. Current workflow of ksoftirqd.
> > Softirqs are processed in the context of ksoftirqd iff they are
> > being raised very frequently. How it works:
> > do_softirq() and invoke_softirq() deffer pending softirq iff
> >
>
> ...
>
> >
> > Note, that I tested in VMs and I've found that if I produce more
> > hw irqs on the host, than the results for master are not that
> > dramatically bad, but still much worse then with RFC.
> > By that reason I have qualms if my test's results are correct.
>
> Note that deferring all NET RX/TX to ksoftirqd is going to
> dramatically hurt tail latencies.
>
> You really should test with RPC like workloads (netperf -t TCP_RR)
> and
> hundred of threads per cpu :/

Yeah, thanks for the reply, will give that a shot.

>
> It seems we are going to revert/adapt 4cd13c21b2 , not defer more
> stuff to ksoftirqd.

Hmm, what if we use some other logic for deferring/non-deferring
like checking how many softirqs where serviced during process's
timeslice and decide if proceed with __do_softirq() or defer it
not to starve a task? Might that make sense?

--
Thanks,
Dmitry

2018-01-11 02:13:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, Jan 10, 2018 at 1:48 PM, Dmitry Safonov <[email protected]> wrote:
> Hmm, what if we use some other logic for deferring/non-deferring
> like checking how many softirqs where serviced during process's
> timeslice and decide if proceed with __do_softirq() or defer it
> not to starve a task? Might that make sense?

Yes, but it might also be hard to come up with a good heuristic.

We actually *have* a fairly good heuristic right now: we end up
punting to softirqd if we have too much work at one synchronous event.
We simply count, and refuse to do too much, and say "Ok, wake up
ksoftirqd".

That has worked fairly well for a long time, and I think it's
fundamentally the right thing to do.

I think that the problem with the "once you punt to ksoftirqd, _keep_
punting to it" in commit 4cd13c21b207 ("softirq: Let ksoftirqd do its
job") was that it simply went much too far.

Doing it under heavy load once is fine. But then what happens is that
ksoftirqd keeps running (for the same reason that we woke it up in the
first place), and then eventually it gets scheduled away because it's
doing a lot of work.

And I think THAT is when the ksoftirqd scheduling latencies get bad.
Not on initial "push things to ksoftirqd". If ksoftirqd hasn't been
running, then the scheduler will be pretty eager to pick it.

But if ksoftirqd has been using CPU time, and gets preempted by other
things (and it's pretty eager to do so - see the whole
"need_resched()" in __do_softirq()), now we're really talking long
latencies when there are other runnable processes.

And dammit, softirq latencies are *MORE IMPORTANT* than some random
user process scheduling. But the ksoftirqd_running() code will just
see "ok, it's runnable, I'm not going to run anything synchronously",
and let those softirq things wait. We're talking packet scheduling,
but we're talking other things too.

So just saying "hey, ksoftirq is runnable - but maybe not running
_now"" and ignoring softirqs entirely is just stupid. Even if we could
easily do another small bunch of them, at least the non-networking
ones.

So maybe that "ksoftirqd_running()" check should actually be something like

static bool ksoftirqd_running(void)
{
struct task_struct *tsk = __this_cpu_read(ksoftirqd);

return tsk == current;
}

which actually checks that ksoftirq is running right *now*, and not
scheduled away because somebody is running a perl script.

Linus

2018-01-11 03:22:39

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, Jan 10, 2018 at 06:13:01PM -0800, Linus Torvalds wrote:
> So just saying "hey, ksoftirq is runnable - but maybe not running
> _now"" and ignoring softirqs entirely is just stupid. Even if we could
> easily do another small bunch of them, at least the non-networking
> ones.
>
> So maybe that "ksoftirqd_running()" check should actually be something like
>
> static bool ksoftirqd_running(void)
> {
> struct task_struct *tsk = __this_cpu_read(ksoftirqd);
>
> return tsk == current;
> }
>
> which actually checks that ksoftirq is running right *now*, and not
> scheduled away because somebody is running a perl script.

Makes sense, but I think you need to keep the TASK_RUNNING check. In case
the hardirq is interrupting ksoftirqd in TASK_INTERRUPTIBLE state right before
it's going to sleep. In that case neither ksoftirqd nor the hardirq are going
to serve the poor pending softirqd. And if we are in nohz mode, it may not be
served before a while. So perhaps it should be:

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 2f5e87f..6e5d7bc 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -85,7 +85,7 @@ static bool ksoftirqd_running(void)
{
struct task_struct *tsk = __this_cpu_read(ksoftirqd);

- return tsk && (tsk->state == TASK_RUNNING);
+ return (tsk == current) && (tsk->state == TASK_RUNNING);
}

/*

2018-01-11 04:19:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, Jan 10, 2018 at 7:22 PM, Frederic Weisbecker
<[email protected]> wrote:
>
> Makes sense, but I think you need to keep the TASK_RUNNING check.

Yes, good point.

> So perhaps it should be:
>
> - return tsk && (tsk->state == TASK_RUNNING);
> + return (tsk == current) && (tsk->state == TASK_RUNNING);

Looks good to me - definitely worth trying.

Maybe that weakens the thing so much that it doesn't actually help the
UDP packet storm case?

And maybe it's not sufficient for the dvb issue.

But I think it's worth at least testing. Maybe it makes neither side
entirely happy, but maybe it might be a good halfway point?

Linus

2018-01-11 04:45:01

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, Jan 10, 2018 at 08:19:49PM -0800, Linus Torvalds wrote:
> On Wed, Jan 10, 2018 at 7:22 PM, Frederic Weisbecker
> <[email protected]> wrote:
> >
> > Makes sense, but I think you need to keep the TASK_RUNNING check.
>
> Yes, good point.
>
> > So perhaps it should be:
> >
> > - return tsk && (tsk->state == TASK_RUNNING);
> > + return (tsk == current) && (tsk->state == TASK_RUNNING);
>
> Looks good to me - definitely worth trying.
>
> Maybe that weakens the thing so much that it doesn't actually help the
> UDP packet storm case?
>
> And maybe it's not sufficient for the dvb issue.
>
> But I think it's worth at least testing. Maybe it makes neither side
> entirely happy, but maybe it might be a good halfway point?

Yes I believe Dmitry is facing a different problem where he would rather
see ksoftirqd scheduled more often to handle the queue as a deferred batch
instead of having it served one by one on the tails of IRQ storms.
(Dmitry correct me if I misunderstood).

But your patch still seems to make sense for the case you described: when
ksoftirqd is voluntarily preempted off and the current IRQ could handle the
queue.

2018-01-11 14:31:36

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, 2018-01-11 at 05:44 +0100, Frederic Weisbecker wrote:
> On Wed, Jan 10, 2018 at 08:19:49PM -0800, Linus Torvalds wrote:
> > On Wed, Jan 10, 2018 at 7:22 PM, Frederic Weisbecker
> > <[email protected]> wrote:
> > >
> > > Makes sense, but I think you need to keep the TASK_RUNNING check.
> >
> > Yes, good point.
> >
> > > So perhaps it should be:
> > >
> > > - return tsk && (tsk->state == TASK_RUNNING);
> > > + return (tsk == current) && (tsk->state == TASK_RUNNING);
> >
> > Looks good to me - definitely worth trying.
> >
> > Maybe that weakens the thing so much that it doesn't actually help
> > the
> > UDP packet storm case?
> >
> > And maybe it's not sufficient for the dvb issue.
> >
> > But I think it's worth at least testing. Maybe it makes neither
> > side
> > entirely happy, but maybe it might be a good halfway point?
>
> Yes I believe Dmitry is facing a different problem where he would
> rather
> see ksoftirqd scheduled more often to handle the queue as a deferred
> batch
> instead of having it served one by one on the tails of IRQ storms.
> (Dmitry correct me if I misunderstood).

Quite so, what I see is that ksoftirqd is rarely (close to never)
scheduled in case of UDP packet storm. That's because the up coming irq
is too late in __do_softirq().
So, there is no wakeup on UDP storm here:
: pending = local_softirq_pending();
: if (pending & mask) {
: if (time_before(jiffies, end) && !need_resched() &&
: --max_restart)
: goto restart;
:
: wakeup_softirqd();
: }
(as there is yet no pending softirq). It comes a bit late to schedule
ksoftirqd and in result the next softirq is processed on the context of
the task again, not in the scheduled ksoftirqd.
That results in cpu-time starvation for the process on irq storm.

While I saw that on out-of-tree driver, I believe that on some
frequencies (lower than storm) one can observe the same on mainstream
drivers. And I *think* that I've reproduced that on mainstream with
virtio driver and package size of 1500 in VMs (thou I don't quite like
the perf testing in VMs).

So, ITOW, maybe there is a bit better way to *detect* that cpu time
spent on serving softirqs is close to storm and that userspace starts
starving? (and launch ksoftirqd in the result or balance between
deferring and serving softirq right-there).

> But your patch still seems to make sense for the case you described:
> when
> ksoftirqd is voluntarily preempted off and the current IRQ could
> handle the
> queue.

2018-01-11 16:20:23

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 6:31 AM, Dmitry Safonov <[email protected]> wrote:
> On Thu, 2018-01-11 at 05:44 +0100, Frederic Weisbecker wrote:
>> On Wed, Jan 10, 2018 at 08:19:49PM -0800, Linus Torvalds wrote:
>> > On Wed, Jan 10, 2018 at 7:22 PM, Frederic Weisbecker
>> > <[email protected]> wrote:
>> > >
>> > > Makes sense, but I think you need to keep the TASK_RUNNING check.
>> >
>> > Yes, good point.
>> >
>> > > So perhaps it should be:
>> > >
>> > > - return tsk && (tsk->state == TASK_RUNNING);
>> > > + return (tsk == current) && (tsk->state == TASK_RUNNING);
>> >
>> > Looks good to me - definitely worth trying.
>> >
>> > Maybe that weakens the thing so much that it doesn't actually help
>> > the
>> > UDP packet storm case?
>> >
>> > And maybe it's not sufficient for the dvb issue.
>> >
>> > But I think it's worth at least testing. Maybe it makes neither
>> > side
>> > entirely happy, but maybe it might be a good halfway point?
>>
>> Yes I believe Dmitry is facing a different problem where he would
>> rather
>> see ksoftirqd scheduled more often to handle the queue as a deferred
>> batch
>> instead of having it served one by one on the tails of IRQ storms.
>> (Dmitry correct me if I misunderstood).
>
> Quite so, what I see is that ksoftirqd is rarely (close to never)
> scheduled in case of UDP packet storm. That's because the up coming irq
> is too late in __do_softirq().
> So, there is no wakeup on UDP storm here:
> : pending = local_softirq_pending();
> : if (pending & mask) {
> : if (time_before(jiffies, end) && !need_resched() &&
> : --max_restart)
> : goto restart;
> :
> : wakeup_softirqd();
> : }
> (as there is yet no pending softirq). It comes a bit late to schedule
> ksoftirqd and in result the next softirq is processed on the context of
> the task again, not in the scheduled ksoftirqd.
> That results in cpu-time starvation for the process on irq storm.
>
> While I saw that on out-of-tree driver, I believe that on some
> frequencies (lower than storm) one can observe the same on mainstream
> drivers. And I *think* that I've reproduced that on mainstream with
> virtio driver and package size of 1500 in VMs (thou I don't quite like
> the perf testing in VMs).
>
> So, ITOW, maybe there is a bit better way to *detect* that cpu time
> spent on serving softirqs is close to storm and that userspace starts
> starving? (and launch ksoftirqd in the result or balance between
> deferring and serving softirq right-there).
>
>> But your patch still seems to make sense for the case you described:
>> when
>> ksoftirqd is voluntarily preempted off and the current IRQ could
>> handle the
>> queue.


Note that ksoftirqd being kicked (TASK_RUNNING) is the sign of softirq pressure.
Or maybe we lack one bit to signal that __do_softirq() had to
wakep_softirq() because of pressure.
(If I remember well, I added such state when submitting my first patch,
https://www.spinics.net/lists/netdev/msg377172.html
then Peter suggested to use tsk->state == TASK_RUNNING

https://www.spinics.net/lists/netdev/msg377210.html


Maybe the problem is not the new patch, but use of need_resched() in
__do_softirq()
that I added in 2013 ( commit c10d73671ad30f54692f7f69f0e09e75d3a8926a
) combined with the new patch.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 2f5e87f1bae22f3df44fa4493fcc8b255882267f..d2f20daf77d14dc8ebde00d7c4a0237152d082ba
100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -192,7 +192,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);

/*
* We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
- * but break the loop if need_resched() is set or after 2 ms.
+ * but break the loop after 2 ms.
* The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
* certain cases, such as stop_machine(), jiffies may cease to
* increment and so we need the MAX_SOFTIRQ_RESTART limit as
@@ -299,8 +299,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)

pending = local_softirq_pending();
if (pending) {
- if (time_before(jiffies, end) && !need_resched() &&
- --max_restart)
+ if (time_before(jiffies, end) && --max_restart)
goto restart;

wakeup_softirqd();

2018-01-11 16:32:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 08:20:18AM -0800, Eric Dumazet wrote:
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 2f5e87f1bae22f3df44fa4493fcc8b255882267f..d2f20daf77d14dc8ebde00d7c4a0237152d082ba
> 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -192,7 +192,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);
>
> /*
> * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
> - * but break the loop if need_resched() is set or after 2 ms.
> + * but break the loop after 2 ms.
> * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
> * certain cases, such as stop_machine(), jiffies may cease to
> * increment and so we need the MAX_SOFTIRQ_RESTART limit as
> @@ -299,8 +299,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
>
> pending = local_softirq_pending();
> if (pending) {
> - if (time_before(jiffies, end) && !need_resched() &&
> - --max_restart)
> + if (time_before(jiffies, end) && --max_restart)
> goto restart;
>
> wakeup_softirqd();

You just introduced a 2ms preempt-disable region I think, that's not
cool for PREEMPT and a plain bug on PREEMPT_RT.

2018-01-11 16:38:44

by David Miller

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

From: Peter Zijlstra <[email protected]>
Date: Thu, 11 Jan 2018 17:32:04 +0100

> You just introduced a 2ms preempt-disable region I think, that's not
> cool for PREEMPT and a plain bug on PREEMPT_RT.

I'd just like to know what the difference is between ksoftirqd running
softirq work for 2ms and the real softirq context doing it. The
latter already has this "2ms preempt-disable region" of sorts, right?

2018-01-11 16:43:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 11:38:40AM -0500, David Miller wrote:
> From: Peter Zijlstra <[email protected]>
> Date: Thu, 11 Jan 2018 17:32:04 +0100
>
> > You just introduced a 2ms preempt-disable region I think, that's not
> > cool for PREEMPT and a plain bug on PREEMPT_RT.
>
> I'd just like to know what the difference is between ksoftirqd running
> softirq work for 2ms and the real softirq context doing it. The
> latter already has this "2ms preempt-disable region" of sorts, right?

True I suppose.. and PREEMPT_RT does the whole softirq business
differently anyway.

I'll go back to staring at PTI crud..

2018-01-11 18:48:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 8:32 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Jan 11, 2018 at 08:20:18AM -0800, Eric Dumazet wrote:
>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>> index 2f5e87f1bae22f3df44fa4493fcc8b255882267f..d2f20daf77d14dc8ebde00d7c4a0237152d082ba
>> 100644
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -192,7 +192,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);
>>
>> /*
>> * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
>> - * but break the loop if need_resched() is set or after 2 ms.
>> + * but break the loop after 2 ms.
>> * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
>> * certain cases, such as stop_machine(), jiffies may cease to
>> * increment and so we need the MAX_SOFTIRQ_RESTART limit as
>> @@ -299,8 +299,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
>>
>> pending = local_softirq_pending();
>> if (pending) {
>> - if (time_before(jiffies, end) && !need_resched() &&
>> - --max_restart)
>> + if (time_before(jiffies, end) && --max_restart)
>> goto restart;
>>
>> wakeup_softirqd();
>
> You just introduced a 2ms preempt-disable region I think, that's not
> cool for PREEMPT and a plain bug on PREEMPT_RT.

I actually think that he whole "time_before()" is garbage and should be removed.

If doing 10 rounds of softirq processing takes several jiffies, you
have bigger issue.

Looking at the history, the original time_before() was _replacing_ the
"max_retry". See commit c10d73671ad3 ("softirq: reduce latencies").

But then "max_restart" was put back in because jiffies just isn't
reliable in the first place.

And I think *this* is the real issue. Networking does a *shitton* of
things in softirq context, and the problem is not the softirq code,
it's the networking code. The softirq code doesn't understand that the
networking code does billions of operations, and one softirq can take
ages and ages because it is handling millions of packets.

My feeling is that it's the networking code that should notice "I have
a billion packets in my softirq queue, and I just keep getting more,
now *I* should start doing things in a thread instead".

Because none of those limits really make sense for any of the other
softirq users. Everybody else just wants a low-latency "soft
interrupt", not some stupid thread. Who knew? Maybe the name should
have given the thing away?

If you want a thread, use a tasklet and add_task_work().

Maybe networking should reconsider its use of softirqs entirely?

Linus

2018-01-11 19:15:54

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 10:48 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jan 11, 2018 at 8:32 AM, Peter Zijlstra <[email protected]> wrote:
>> On Thu, Jan 11, 2018 at 08:20:18AM -0800, Eric Dumazet wrote:
>>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>>> index 2f5e87f1bae22f3df44fa4493fcc8b255882267f..d2f20daf77d14dc8ebde00d7c4a0237152d082ba
>>> 100644
>>> --- a/kernel/softirq.c
>>> +++ b/kernel/softirq.c
>>> @@ -192,7 +192,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);
>>>
>>> /*
>>> * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
>>> - * but break the loop if need_resched() is set or after 2 ms.
>>> + * but break the loop after 2 ms.
>>> * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
>>> * certain cases, such as stop_machine(), jiffies may cease to
>>> * increment and so we need the MAX_SOFTIRQ_RESTART limit as
>>> @@ -299,8 +299,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
>>>
>>> pending = local_softirq_pending();
>>> if (pending) {
>>> - if (time_before(jiffies, end) && !need_resched() &&
>>> - --max_restart)
>>> + if (time_before(jiffies, end) && --max_restart)
>>> goto restart;
>>>
>>> wakeup_softirqd();
>>
>> You just introduced a 2ms preempt-disable region I think, that's not
>> cool for PREEMPT and a plain bug on PREEMPT_RT.
>
> I actually think that he whole "time_before()" is garbage and should be removed.
>
> If doing 10 rounds of softirq processing takes several jiffies, you
> have bigger issue.
>
> Looking at the history, the original time_before() was _replacing_ the
> "max_retry". See commit c10d73671ad3 ("softirq: reduce latencies").
>
> But then "max_restart" was put back in because jiffies just isn't
> reliable in the first place.
>
> And I think *this* is the real issue. Networking does a *shitton* of
> things in softirq context, and the problem is not the softirq code,
> it's the networking code. The softirq code doesn't understand that the
> networking code does billions of operations, and one softirq can take
> ages and ages because it is handling millions of packets.
>
> My feeling is that it's the networking code that should notice "I have
> a billion packets in my softirq queue, and I just keep getting more,
> now *I* should start doing things in a thread instead".
>
> Because none of those limits really make sense for any of the other
> softirq users. Everybody else just wants a low-latency "soft
> interrupt", not some stupid thread. Who knew? Maybe the name should
> have given the thing away?
>
> If you want a thread, use a tasklet and add_task_work().
>
> Maybe networking should reconsider its use of softirqs entirely?

Issue can happen when one packet is delivered (and handled) per NIC IRQ.

Not 'billions' :/

Suppose host receives 99,000 packets per second on one queue (one CPU
handling the particular IRQ),
each packet taking 10 usec to handle (various cache misses and all the overhead)

How are we supposed to know this particular workload is problematic
for innocent user threads on the same cpu ?

Is 'networking' supposed to perform some kind of mid-term analysis of
what happened in the past and/or predict the future ?

2018-01-11 19:43:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 11:15 AM, Eric Dumazet <[email protected]> wrote:
>
> How are we supposed to know this particular workload is problematic
> for innocent user threads on the same cpu ?

Put that question another way: how is the _softirq_ code supposed to know?

If you can't know, then the softirq code definitely can't know either.
It has even less information. It doesn't know about NAPI, it doesn't
know about irq steering, it doesn't know anything at all.

Linus

2018-01-11 19:48:31

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 11:43 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jan 11, 2018 at 11:15 AM, Eric Dumazet <[email protected]> wrote:
>>
>> How are we supposed to know this particular workload is problematic
>> for innocent user threads on the same cpu ?
>
> Put that question another way: how is the _softirq_ code supposed to know?

That was the purpose on the last patch : As soon as ksoftirqd is scheduled
(by some kind of jitter in the 99,000 pps workload, or antagonist wakeup),
we then switch to a mode where process scheduler can make decisions
based on threads prios and cpu usage.

Then, as soon as the load was able to finish in its quantum the
pending irqs, we re-enter the mode
where softirq are immediately serviced.

>
> If you can't know, then the softirq code definitely can't know either.
> It has even less information. It doesn't know about NAPI, it doesn't
> know about irq steering, it doesn't know anything at all.

That is true.

2018-01-11 20:03:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 11:48 AM, Eric Dumazet <[email protected]> wrote:
> That was the purpose on the last patch : As soon as ksoftirqd is scheduled
> (by some kind of jitter in the 99,000 pps workload, or antagonist wakeup),
> we then switch to a mode where process scheduler can make decisions
> based on threads prios and cpu usage.

Yeah, but that really screws up everybody else.

It really is a soft *interrupt*. That was what it was designed for.
The thread handling is not primary, it's literally a fallback to avoid
complete starvation.

The fact that networking has now - for several years - tried to make
it some kind of thread and get fairness with user threads is all
entirely antithetical to what softirq was designed for.

> Then, as soon as the load was able to finish in its quantum the
> pending irqs, we re-enter the mode
> where softirq are immediately serviced.

Except that's not at all how the code works.

As I pointed out, the softirq thread can be scheduled away, but the
"softiq_running()" wil stilll return true - and the networking code
has now screwed up all the *other* softirqs too!

I really suspect that what networking wants is more like the
workqueues. Or at least more isolation between different softirq
users, but that's fairly fundamentally hard, given how softirqs are
designed.

My dvb-fixing patch was an *extremely* stupid version of that "more
isolation". But it really is a complete hack, saying that tasklets
are special and shouldn't trigger ksoftirqd. They're not really all
that special, but it at least isolated the USB/DVB usage from the
"networking wants softirqd" problem.

Linus

2018-01-11 20:16:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 12:03 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Jan 11, 2018 at 11:48 AM, Eric Dumazet <[email protected]> wrote:
>> That was the purpose on the last patch : As soon as ksoftirqd is scheduled
>> (by some kind of jitter in the 99,000 pps workload, or antagonist wakeup),
>> we then switch to a mode where process scheduler can make decisions
>> based on threads prios and cpu usage.
>
> Yeah, but that really screws up everybody else.
>
> It really is a soft *interrupt*. That was what it was designed for.
> The thread handling is not primary, it's literally a fallback to avoid
> complete starvation.
>
> The fact that networking has now - for several years - tried to make
> it some kind of thread and get fairness with user threads is all
> entirely antithetical to what softirq was designed for.
>
>> Then, as soon as the load was able to finish in its quantum the
>> pending irqs, we re-enter the mode
>> where softirq are immediately serviced.
>
> Except that's not at all how the code works.
>
> As I pointed out, the softirq thread can be scheduled away, but the
> "softiq_running()" wil stilll return true - and the networking code
> has now screwed up all the *other* softirqs too!
>
> I really suspect that what networking wants is more like the
> workqueues. Or at least more isolation between different softirq
> users, but that's fairly fundamentally hard, given how softirqs are
> designed.
>

Note that when I implemented TCP Small queues, I did experiments between
using a work queue or a tasklet, and workqueues added unacceptable P99
latencies,
when many user threads are competing with kernel threads.

I suspect that firing a worqueue for networking RX will likely have
the same effect :/

Note that the current __do_softirq() implementation suffers from the following :

Say we receive NET_RX softirq
-> While processing the packet, we wakeup on thread (thus
need_resched() becomes true),
but also raise a tasklet (because a particular driver needs some extra
processing in tasklet context instead of NET_RX ???)

-> Then we exit __do_softirq() _and_ schedule ksoftirqd (because
tasklet needs to be serviced)

2018-01-11 20:22:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <[email protected]> wrote:
>
> Note that when I implemented TCP Small queues, I did experiments between
> using a work queue or a tasklet, and workqueues added unacceptable P99
> latencies, when many user threads are competing with kernel threads.

Yes.

So I think one solution might be to have a hybrid system, where we do
the softirq's synchronously normally (which is what you really want
for good latency).

But then fall down on a threaded model - but that fallback case should
be per-softirq, not global. So if one softirq uses a lot of CPU time,
that shouldn't affect the latency of other softirqs.

So maybe we could get rid of the per-cpu ksoftirqd entirely, and
replace it with with per-cpu and per-softirq workqueues?

Would something like that sound sane?

Just a SMOP/SMOT (small matter of programming/testing).

Linus

2018-01-11 20:34:48

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote:
> On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <[email protected]>
> wrote:
> >
> > Note that when I implemented TCP Small queues, I did experiments
> > between
> > using a work queue or a tasklet, and workqueues added unacceptable
> > P99
> > latencies, when many user threads are competing with kernel
> > threads.
>
> Yes.
>
> So I think one solution might be to have a hybrid system, where we do
> the softirq's synchronously normally (which is what you really want
> for good latency).
>
> But then fall down on a threaded model - but that fallback case
> should
> be per-softirq, not global. So if one softirq uses a lot of CPU time,
> that shouldn't affect the latency of other softirqs.
>
> So maybe we could get rid of the per-cpu ksoftirqd entirely, and
> replace it with with per-cpu and per-softirq workqueues?
>
> Would something like that sound sane?
>
> Just a SMOP/SMOT (small matter of programming/testing).

I could try to write a PoC for that..
What should be the trigger to fall into workqueue?
How to tell if there're too many softirqs of the kind?
Current logic with if (pending) in the end of __do_softirq()
looks working selectively..
It looks to be still possible to starve a cpu.

--
Thanks,
Dmitry

2018-01-11 20:37:43

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <[email protected]> wrote:
> On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote:
>> On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <[email protected]>
>> wrote:
>> >
>> > Note that when I implemented TCP Small queues, I did experiments
>> > between
>> > using a work queue or a tasklet, and workqueues added unacceptable
>> > P99
>> > latencies, when many user threads are competing with kernel
>> > threads.
>>
>> Yes.
>>
>> So I think one solution might be to have a hybrid system, where we do
>> the softirq's synchronously normally (which is what you really want
>> for good latency).
>>
>> But then fall down on a threaded model - but that fallback case
>> should
>> be per-softirq, not global. So if one softirq uses a lot of CPU time,
>> that shouldn't affect the latency of other softirqs.
>>
>> So maybe we could get rid of the per-cpu ksoftirqd entirely, and
>> replace it with with per-cpu and per-softirq workqueues?
>>
>> Would something like that sound sane?
>>
>> Just a SMOP/SMOT (small matter of programming/testing).
>
> I could try to write a PoC for that..
> What should be the trigger to fall into workqueue?
> How to tell if there're too many softirqs of the kind?
> Current logic with if (pending) in the end of __do_softirq()
> looks working selectively..
> It looks to be still possible to starve a cpu.

I guess we would need to track amount of time spent while processing
sortirq (while interrupting a non idle task)

2018-01-11 20:40:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <[email protected]> wrote:
>
> I could try to write a PoC for that..
> What should be the trigger to fall into workqueue?
> How to tell if there're too many softirqs of the kind?

I suspect it would have to be time-based, probably using the scheduler clock.

Most softirqs are really really small. So just counting them probably
isn't all that meaningful, although the count is good as a fallback
(as shown by the jiffy issues).

The good news is that we only have a fairly small handful of softirqs,
so counting/timing them separately is still mainly a pretty small
array (which needs to be percpu, of course).

Linus

2018-01-11 20:46:31

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, 2018-01-11 at 12:40 -0800, Linus Torvalds wrote:
> On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <[email protected]>
> wrote:
> >
> > I could try to write a PoC for that..
> > What should be the trigger to fall into workqueue?
> > How to tell if there're too many softirqs of the kind?
>
> I suspect it would have to be time-based, probably using the
> scheduler clock.

I thought about this, but I was a bit afraid of how much pricey it
would be recalculate it each clock. Well, might just try to write that
and measure the impact.

> Most softirqs are really really small. So just counting them probably
> isn't all that meaningful, although the count is good as a fallback
> (as shown by the jiffy issues).
>
> The good news is that we only have a fairly small handful of
> softirqs,
> so counting/timing them separately is still mainly a pretty small
> array (which needs to be percpu, of course).

2018-01-11 20:53:13

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 12:46 PM, Dmitry Safonov <[email protected]> wrote:
> On Thu, 2018-01-11 at 12:40 -0800, Linus Torvalds wrote:
>> On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <[email protected]>
>> wrote:
>> >
>> > I could try to write a PoC for that..
>> > What should be the trigger to fall into workqueue?
>> > How to tell if there're too many softirqs of the kind?
>>
>> I suspect it would have to be time-based, probably using the
>> scheduler clock.
>
> I thought about this, but I was a bit afraid of how much pricey it
> would be recalculate it each clock. Well, might just try to write that
> and measure the impact.
>
>> Most softirqs are really really small. So just counting them probably
>> isn't all that meaningful, although the count is good as a fallback
>> (as shown by the jiffy issues).
>>
>> The good news is that we only have a fairly small handful of
>> softirqs,
>> so counting/timing them separately is still mainly a pretty small
>> array (which needs to be percpu, of course).

Note that using (scheduler) clock might also help to break net_rx_action()
not on a stupid netdev_budget, but on a more precise time limit as well.

netdev_budget of 300 packets is quite big :/

(The time_limit based on jiffies + 2 does not work on hosts with one
cpu, since jiffies wont make progress while net_rx_action() is
running)

2018-01-11 21:13:48

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, 2018-01-11 at 12:53 -0800, Eric Dumazet wrote:
> On Thu, Jan 11, 2018 at 12:46 PM, Dmitry Safonov <[email protected]>
> wrote:
> > On Thu, 2018-01-11 at 12:40 -0800, Linus Torvalds wrote:
> > > On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <[email protected]
> > > >
> > > wrote:
> > > >
> > > > I could try to write a PoC for that..
> > > > What should be the trigger to fall into workqueue?
> > > > How to tell if there're too many softirqs of the kind?
> > >
> > > I suspect it would have to be time-based, probably using the
> > > scheduler clock.
> >
> > I thought about this, but I was a bit afraid of how much pricey it
> > would be recalculate it each clock. Well, might just try to write
> > that
> > and measure the impact.
> >
> > > Most softirqs are really really small. So just counting them
> > > probably
> > > isn't all that meaningful, although the count is good as a
> > > fallback
> > > (as shown by the jiffy issues).
> > >
> > > The good news is that we only have a fairly small handful of
> > > softirqs,
> > > so counting/timing them separately is still mainly a pretty small
> > > array (which needs to be percpu, of course).
>
> Note that using (scheduler) clock might also help to break
> net_rx_action()
> not on a stupid netdev_budget, but on a more precise time limit as
> well.
>
> netdev_budget of 300 packets is quite big :/
>
> (The time_limit based on jiffies + 2 does not work on hosts with one
> cpu, since jiffies wont make progress while net_rx_action() is
> running)

Thanks for the details, Eric.
I'll try to come up with poc if no one beats me at it.

--
Thanks,
Dmitry

2018-01-12 05:24:08

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote:
> On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <[email protected]> wrote:
> >
> > Note that when I implemented TCP Small queues, I did experiments between
> > using a work queue or a tasklet, and workqueues added unacceptable P99
> > latencies, when many user threads are competing with kernel threads.
>
> Yes.
>
> So I think one solution might be to have a hybrid system, where we do
> the softirq's synchronously normally (which is what you really want
> for good latency).
>
> But then fall down on a threaded model - but that fallback case should
> be per-softirq, not global. So if one softirq uses a lot of CPU time,
> that shouldn't affect the latency of other softirqs.
>
> So maybe we could get rid of the per-cpu ksoftirqd entirely, and
> replace it with with per-cpu and per-softirq workqueues?

How would that be better than what RT used to do, and I still do for my
RT kernels via boot option, namely split ksoftirqd into per-softirq
threads.

-Mike

2018-01-12 05:41:16

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 09:13:42PM +0000, Dmitry Safonov wrote:
> On Thu, 2018-01-11 at 12:53 -0800, Eric Dumazet wrote:
> > On Thu, Jan 11, 2018 at 12:46 PM, Dmitry Safonov <[email protected]>
> > wrote:
> > > On Thu, 2018-01-11 at 12:40 -0800, Linus Torvalds wrote:
> > > > On Thu, Jan 11, 2018 at 12:34 PM, Dmitry Safonov <[email protected]
> > > > >
> > > > wrote:
> > > > >
> > > > > I could try to write a PoC for that..
> > > > > What should be the trigger to fall into workqueue?
> > > > > How to tell if there're too many softirqs of the kind?
> > > >
> > > > I suspect it would have to be time-based, probably using the
> > > > scheduler clock.
> > >
> > > I thought about this, but I was a bit afraid of how much pricey it
> > > would be recalculate it each clock. Well, might just try to write
> > > that
> > > and measure the impact.
> > >
> > > > Most softirqs are really really small. So just counting them
> > > > probably
> > > > isn't all that meaningful, although the count is good as a
> > > > fallback
> > > > (as shown by the jiffy issues).
> > > >
> > > > The good news is that we only have a fairly small handful of
> > > > softirqs,
> > > > so counting/timing them separately is still mainly a pretty small
> > > > array (which needs to be percpu, of course).
> >
> > Note that using (scheduler) clock might also help to break
> > net_rx_action()
> > not on a stupid netdev_budget, but on a more precise time limit as
> > well.
> >
> > netdev_budget of 300 packets is quite big :/
> >
> > (The time_limit based on jiffies + 2 does not work on hosts with one
> > cpu, since jiffies wont make progress while net_rx_action() is
> > running)
>
> Thanks for the details, Eric.
> I'll try to come up with poc if no one beats me at it.

I just gave it a try. Sorry I couldn't resist :-s

2018-01-12 10:14:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote:
> On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote:
> > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <[email protected]> wrote:
> > >
> > > Note that when I implemented TCP Small queues, I did experiments between
> > > using a work queue or a tasklet, and workqueues added unacceptable P99
> > > latencies, when many user threads are competing with kernel threads.
> >
> > Yes.
> >
> > So I think one solution might be to have a hybrid system, where we do
> > the softirq's synchronously normally (which is what you really want
> > for good latency).
> >
> > But then fall down on a threaded model - but that fallback case should
> > be per-softirq, not global. So if one softirq uses a lot of CPU time,
> > that shouldn't affect the latency of other softirqs.
> >
> > So maybe we could get rid of the per-cpu ksoftirqd entirely, and
> > replace it with with per-cpu and per-softirq workqueues?
>
> How would that be better than what RT used to do, and I still do for my
> RT kernels via boot option, namely split ksoftirqd into per-softirq
> threads.

Since we mention it; one of the problems RT has is that all RX is
through a single softirq context, which generates a priority inversion
between devices.

If we're splitting things, it would be awesome if we could get
per-device context.

2018-01-12 14:58:26

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote:
> On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote:
> > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <[email protected]> wrote:
> > >
> > > Note that when I implemented TCP Small queues, I did experiments between
> > > using a work queue or a tasklet, and workqueues added unacceptable P99
> > > latencies, when many user threads are competing with kernel threads.
> >
> > Yes.
> >
> > So I think one solution might be to have a hybrid system, where we do
> > the softirq's synchronously normally (which is what you really want
> > for good latency).
> >
> > But then fall down on a threaded model - but that fallback case should
> > be per-softirq, not global. So if one softirq uses a lot of CPU time,
> > that shouldn't affect the latency of other softirqs.
> >
> > So maybe we could get rid of the per-cpu ksoftirqd entirely, and
> > replace it with with per-cpu and per-softirq workqueues?
>
> How would that be better than what RT used to do, and I still do for my
> RT kernels via boot option, namely split ksoftirqd into per-softirq
> threads.

Workqueue are probably more simple. Unless you need to set specific prios
to your ksoftirqds? Not sure if that's tunable on workqueues.

2018-01-12 15:16:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, 2018-01-12 at 15:58 +0100, Frederic Weisbecker wrote:
> On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote:
> > On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote:
> > > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <[email protected]> wrote:
> > > >
> > > > Note that when I implemented TCP Small queues, I did experiments between
> > > > using a work queue or a tasklet, and workqueues added unacceptable P99
> > > > latencies, when many user threads are competing with kernel threads.
> > >
> > > Yes.
> > >
> > > So I think one solution might be to have a hybrid system, where we do
> > > the softirq's synchronously normally (which is what you really want
> > > for good latency).
> > >
> > > But then fall down on a threaded model - but that fallback case should
> > > be per-softirq, not global. So if one softirq uses a lot of CPU time,
> > > that shouldn't affect the latency of other softirqs.
> > >
> > > So maybe we could get rid of the per-cpu ksoftirqd entirely, and
> > > replace it with with per-cpu and per-softirq workqueues?
> >
> > How would that be better than what RT used to do, and I still do for my
> > RT kernels via boot option, namely split ksoftirqd into per-softirq
> > threads.
>
> Workqueue are probably more simple. Unless you need to set specific prios
> to your ksoftirqds? Not sure if that's tunable on workqueues.

No, you can't prioritize workqueues, and they spawn threads whenever
they bloody well feel like.

I carry a hack to give users minimal control over kthread/workqueue
priority. ?Very handy thing to have, especially if you're doing high
utilization stuff, and would prefer your box actually survive it.

-Mike

2018-01-12 17:00:44

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, Jan 12, 2018 at 04:15:04PM +0100, Mike Galbraith wrote:
> On Fri, 2018-01-12 at 15:58 +0100, Frederic Weisbecker wrote:
> > On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote:
> > > On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote:
> > > > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <[email protected]> wrote:
> > > > >
> > > > > Note that when I implemented TCP Small queues, I did experiments between
> > > > > using a work queue or a tasklet, and workqueues added unacceptable P99
> > > > > latencies, when many user threads are competing with kernel threads.
> > > >
> > > > Yes.
> > > >
> > > > So I think one solution might be to have a hybrid system, where we do
> > > > the softirq's synchronously normally (which is what you really want
> > > > for good latency).
> > > >
> > > > But then fall down on a threaded model - but that fallback case should
> > > > be per-softirq, not global. So if one softirq uses a lot of CPU time,
> > > > that shouldn't affect the latency of other softirqs.
> > > >
> > > > So maybe we could get rid of the per-cpu ksoftirqd entirely, and
> > > > replace it with with per-cpu and per-softirq workqueues?
> > >
> > > How would that be better than what RT used to do, and I still do for my
> > > RT kernels via boot option, namely split ksoftirqd into per-softirq
> > > threads.
> >
> > Workqueue are probably more simple. Unless you need to set specific prios
> > to your ksoftirqds? Not sure if that's tunable on workqueues.
>
> No, you can't prioritize workqueues, and they spawn threads whenever
> they bloody well feel like.
>
> I carry a hack to give users minimal control over kthread/workqueue
> priority. ?Very handy thing to have, especially if you're doing high
> utilization stuff, and would prefer your box actually survive it.

How useful system_highpri_wq can be in this regard?

2018-01-12 17:24:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Thu, Jan 11, 2018 at 9:23 PM, Mike Galbraith <[email protected]> wrote:
> On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote:
>>
>> So maybe we could get rid of the per-cpu ksoftirqd entirely, and
>> replace it with with per-cpu and per-softirq workqueues?
>
> How would that be better than what RT used to do, and I still do for my
> RT kernels via boot option, namely split ksoftirqd into per-softirq
> threads.

So I think testing and playing around with things is probably the way
to go, but the reason I would suggest at least looking at workqueues
is that I feel that we already have a *ton* of percpu threads, and
multiplying the ksoftirqd threads by a factor of ten (or however many
softirqs we have) sounds insane.

And yes, they do need to be percpu. Networking wants to do all that
packet handling concurrently across CPUs, so we can't just replace the
per-cpu ksoftirqd threads with per-softirq threads: we really do want
both per-cpu _and_ per-softirq.

And honestly, that sounds insane to do with dedicated threads, most of
which do basically nothing at all.. And it's what the workqueue
threads are designed for - expanding as needed, and not tying up a
dedicated thread for every little work.

But code and testing talks,

Linus

2018-01-12 17:26:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, Jan 12, 2018 at 2:13 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote:
>>
>> How would that be better than what RT used to do, and I still do for my
>> RT kernels via boot option, namely split ksoftirqd into per-softirq
>> threads.
>
> Since we mention it; one of the problems RT has is that all RX is
> through a single softirq context, which generates a priority inversion
> between devices.

Oh, yes, that is complete and utter shit. Not acceptable at all.

As mentioned, it really would need to be per-cpu _and_ per-softirq.

Which is why I thought workqueues might be the thing. Whatever RT is
doing is apparently just pure and utter garbage.

Linus

2018-01-12 17:45:02

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, 2018-01-12 at 09:26 -0800, Linus Torvalds wrote:
> On Fri, Jan 12, 2018 at 2:13 AM, Peter Zijlstra <[email protected]> wrote:
> > On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote:
> >>
> >> How would that be better than what RT used to do, and I still do for my
> >> RT kernels via boot option, namely split ksoftirqd into per-softirq
> >> threads.
> >
> > Since we mention it; one of the problems RT has is that all RX is
> > through a single softirq context, which generates a priority inversion
> > between devices.
>
> Oh, yes, that is complete and utter shit. Not acceptable at all.
>
> As mentioned, it really would need to be per-cpu _and_ per-softirq.
>
> Which is why I thought workqueues might be the thing. Whatever RT is
> doing is apparently just pure and utter garbage.

Nah, a misunderstanding happened. ?RT that still offers full threading
creates per-softirq threads per cpu. ?The regular trees split ksoftirqd
into only two threads per cpu, one processes timer/hrtimer softriqs,
the other processes the rest.

-Mike

2018-01-12 17:51:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <[email protected]> wrote:
>
> Nah, a misunderstanding happened. RT that still offers full threading
> creates per-softirq threads per cpu. The regular trees split ksoftirqd
> into only two threads per cpu, one processes timer/hrtimer softriqs,
> the other processes the rest.

Ok, that sounds like it should work, but it also sounds like it's very
specific to RT itself.

For example, the dvb issue was not about the timer softirqs, but about
the tasklet ones.

So maybe we wouldn't need to split it for _every_ softirq, but we'd
need to split it more than just along the timer case.

And it does sound a bit excessive to have ten fixed threads for every
CPU. The days when tens of CPU's meant "huge system" are gone. These
days it can be a phone.

Linus

2018-01-12 18:10:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, 2018-01-12 at 18:00 +0100, Frederic Weisbecker wrote:
> On Fri, Jan 12, 2018 at 04:15:04PM +0100, Mike Galbraith wrote:
> > On Fri, 2018-01-12 at 15:58 +0100, Frederic Weisbecker wrote:
> > > On Fri, Jan 12, 2018 at 06:23:08AM +0100, Mike Galbraith wrote:
> > > > On Thu, 2018-01-11 at 12:22 -0800, Linus Torvalds wrote:
> > > > > On Thu, Jan 11, 2018 at 12:16 PM, Eric Dumazet <[email protected]> wrote:
> > > > > >
> > > > > > Note that when I implemented TCP Small queues, I did experiments between
> > > > > > using a work queue or a tasklet, and workqueues added unacceptable P99
> > > > > > latencies, when many user threads are competing with kernel threads.
> > > > >
> > > > > Yes.
> > > > >
> > > > > So I think one solution might be to have a hybrid system, where we do
> > > > > the softirq's synchronously normally (which is what you really want
> > > > > for good latency).
> > > > >
> > > > > But then fall down on a threaded model - but that fallback case should
> > > > > be per-softirq, not global. So if one softirq uses a lot of CPU time,
> > > > > that shouldn't affect the latency of other softirqs.
> > > > >
> > > > > So maybe we could get rid of the per-cpu ksoftirqd entirely, and
> > > > > replace it with with per-cpu and per-softirq workqueues?
> > > >
> > > > How would that be better than what RT used to do, and I still do for my
> > > > RT kernels via boot option, namely split ksoftirqd into per-softirq
> > > > threads.
> > >
> > > Workqueue are probably more simple. Unless you need to set specific prios
> > > to your ksoftirqds? Not sure if that's tunable on workqueues.
> >
> > No, you can't prioritize workqueues, and they spawn threads whenever
> > they bloody well feel like.
> >
> > I carry a hack to give users minimal control over kthread/workqueue
> > priority. ?Very handy thing to have, especially if you're doing high
> > utilization stuff, and would prefer your box actually survive it.
>
> How useful system_highpri_wq can be in this regard?

Not at all. ?You could make the system protect itself via boosting, but
that just makes a noisy mess.

-Mike

2018-01-12 18:16:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, 2018-01-12 at 09:51 -0800, Linus Torvalds wrote:
> On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <[email protected]> wrote:
> >
> > Nah, a misunderstanding happened. RT that still offers full threading
> > creates per-softirq threads per cpu. The regular trees split ksoftirqd
> > into only two threads per cpu, one processes timer/hrtimer softriqs,
> > the other processes the rest.
>
> Ok, that sounds like it should work, but it also sounds like it's very
> specific to RT itself.
>
> For example, the dvb issue was not about the timer softirqs, but about
> the tasklet ones.
>
> So maybe we wouldn't need to split it for _every_ softirq, but we'd
> need to split it more than just along the timer case.
>
> And it does sound a bit excessive to have ten fixed threads for every
> CPU. The days when tens of CPU's meant "huge system" are gone. These
> days it can be a phone.

Yeah, it is excessive more often than not. You get to prioritize, and
segregate, which is nice, but you pay for it.

-Mike

2018-01-12 18:45:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, 2018-01-12 at 19:15 +0100, Mike Galbraith wrote:
> On Fri, 2018-01-12 at 09:51 -0800, Linus Torvalds wrote:
> > On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <[email protected]> wrote:
> > >
> > > Nah, a misunderstanding happened. RT that still offers full threading
> > > creates per-softirq threads per cpu. The regular trees split ksoftirqd
> > > into only two threads per cpu, one processes timer/hrtimer softriqs,
> > > the other processes the rest.
> >
> > Ok, that sounds like it should work, but it also sounds like it's very
> > specific to RT itself.
> >
> > For example, the dvb issue was not about the timer softirqs, but about
> > the tasklet ones.
> >
> > So maybe we wouldn't need to split it for _every_ softirq, but we'd
> > need to split it more than just along the timer case.
> >
> > And it does sound a bit excessive to have ten fixed threads for every
> > CPU. The days when tens of CPU's meant "huge system" are gone. These
> > days it can be a phone.
>
> Yeah, it is excessive more often than not. You get to prioritize, and
> segregate, which is nice, but you pay for it.

BTW, much of the softirq load in RT is processed by the raising task.

tbench_srv-6985 [000] d...112 293.902511: softirq_raise: vec=3 [action=NET_RX]
tbench_srv-6985 [000] .....13 293.902511: softirq_entry: vec=3 [action=NET_RX]
tbench_srv-6985 [000] .....13 293.902515: softirq_exit: vec=3 [action=NET_RX]
tbench-6984 [003] d...112 293.902520: softirq_raise: vec=3 [action=NET_RX]
tbench-6984 [003] .....13 293.902520: softirq_entry: vec=3 [action=NET_RX]
tbench-6984 [003] .....13 293.902523: softirq_exit: vec=3 [action=NET_RX]

2018-01-12 19:28:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, 12 Jan 2018, Linus Torvalds wrote:
> On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <[email protected]> wrote:
> >
> > Nah, a misunderstanding happened. RT that still offers full threading
> > creates per-softirq threads per cpu. The regular trees split ksoftirqd
> > into only two threads per cpu, one processes timer/hrtimer softriqs,
> > the other processes the rest.
>
> Ok, that sounds like it should work, but it also sounds like it's very
> specific to RT itself.
>
> For example, the dvb issue was not about the timer softirqs, but about
> the tasklet ones.
>
> So maybe we wouldn't need to split it for _every_ softirq, but we'd
> need to split it more than just along the timer case.
>
> And it does sound a bit excessive to have ten fixed threads for every
> CPU. The days when tens of CPU's meant "huge system" are gone. These
> days it can be a phone.

That's true.

One thing which might worth a try is utilizing the threaded irq
infrastructure and that might also pave the way to address Peters request
for per device splitting. I've experimented with that in the past but never
got around to finish it completely. I should have half baken patches
somewhere in the poison cabinet.

Thanks,

tglx


2018-01-12 19:53:24

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Fri, Jan 12, 2018 at 08:28:06PM +0100, Thomas Gleixner wrote:
> On Fri, 12 Jan 2018, Linus Torvalds wrote:
> > On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <[email protected]> wrote:
> > >
> > > Nah, a misunderstanding happened. RT that still offers full threading
> > > creates per-softirq threads per cpu. The regular trees split ksoftirqd
> > > into only two threads per cpu, one processes timer/hrtimer softriqs,
> > > the other processes the rest.
> >
> > Ok, that sounds like it should work, but it also sounds like it's very
> > specific to RT itself.
> >
> > For example, the dvb issue was not about the timer softirqs, but about
> > the tasklet ones.
> >
> > So maybe we wouldn't need to split it for _every_ softirq, but we'd
> > need to split it more than just along the timer case.
> >
> > And it does sound a bit excessive to have ten fixed threads for every
> > CPU. The days when tens of CPU's meant "huge system" are gone. These
> > days it can be a phone.
>
> That's true.
>
> One thing which might worth a try is utilizing the threaded irq
> infrastructure and that might also pave the way to address Peters request
> for per device splitting. I've experimented with that in the past but never
> got around to finish it completely. I should have half baken patches
> somewhere in the poison cabinet.

I'll gladly have a look at them to see what I can do.

Thanks.

2018-01-17 20:32:20

by David Miller

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

From: Mike Galbraith <[email protected]>
Date: Fri, 12 Jan 2018 19:44:30 +0100

> On Fri, 2018-01-12 at 19:15 +0100, Mike Galbraith wrote:
>> On Fri, 2018-01-12 at 09:51 -0800, Linus Torvalds wrote:
>> > On Fri, Jan 12, 2018 at 9:44 AM, Mike Galbraith <[email protected]> wrote:
>> > >
>> > > Nah, a misunderstanding happened. RT that still offers full threading
>> > > creates per-softirq threads per cpu. The regular trees split ksoftirqd
>> > > into only two threads per cpu, one processes timer/hrtimer softriqs,
>> > > the other processes the rest.
>> >
>> > Ok, that sounds like it should work, but it also sounds like it's very
>> > specific to RT itself.
>> >
>> > For example, the dvb issue was not about the timer softirqs, but about
>> > the tasklet ones.
>> >
>> > So maybe we wouldn't need to split it for _every_ softirq, but we'd
>> > need to split it more than just along the timer case.
>> >
>> > And it does sound a bit excessive to have ten fixed threads for every
>> > CPU. The days when tens of CPU's meant "huge system" are gone. These
>> > days it can be a phone.
>>
>> Yeah, it is excessive more often than not. You get to prioritize, and
>> segregate, which is nice, but you pay for it.
>
> BTW, much of the softirq load in RT is processed by the raising task.
>
> tbench_srv-6985 [000] d...112 293.902511: softirq_raise: vec=3 [action=NET_RX]
> tbench_srv-6985 [000] .....13 293.902511: softirq_entry: vec=3 [action=NET_RX]
> tbench_srv-6985 [000] .....13 293.902515: softirq_exit: vec=3 [action=NET_RX]
> tbench-6984 [003] d...112 293.902520: softirq_raise: vec=3 [action=NET_RX]
> tbench-6984 [003] .....13 293.902520: softirq_entry: vec=3 [action=NET_RX]
> tbench-6984 [003] .....13 293.902523: softirq_exit: vec=3 [action=NET_RX]

And this is because tbench runs over loopback, which triggers softirqs in
the context of whatever generates the loopback packet transmit. Which in
this case is the tbench process calling sendmsg().

I wanted to chime in about this earlier, and make it clear that it isn't
just IRQs that can trigger softirqs. User context actions in the kernel
can trigger softirqs too.

2018-01-17 21:07:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, Jan 17, 2018 at 12:30 PM, David Miller <[email protected]> wrote:
>
> I wanted to chime in about this earlier, and make it clear that it isn't
> just IRQs that can trigger softirqs. User context actions in the kernel
> can trigger softirqs too.

Yes, anybody can do that "raise_softirq()" thing, although the common
thing tends to be for it to be from an interrupt that wants to delay
more long-running things to a non-irq-dsiabled context.

It was in some way always a "poor mans interrupt thread" (with no
blocking like a real thread context, but at least not impacting actual
interrupt latency).

That said, this made me wonder a bit. I wonder how bounded the latency
is for raising a softirq from process context. We only _check_ the
softirq on the last hardirq exit, I think.

We latency was traditionally bounded by the timer irq, and non-idle
CPU's should still be running timers. But I do note that networking
does seem to have some private hacks for the latency problem (ie
net_tx_action())?

I wonder if we should run softirqs on return to user mode (and make
softirq set a thread flag if not in interrupt context).

Although maybe some people actually want the "delay until next
interrupt" kind of behavior to throttle things if they do softirqs
from process context. That sounds unlikely, but who knows.

Linus

2018-01-17 21:50:53

by David Miller

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

From: Linus Torvalds <[email protected]>
Date: Wed, 17 Jan 2018 13:06:58 -0800

> It was in some way always a "poor mans interrupt thread" (with no
> blocking like a real thread context, but at least not impacting actual
> interrupt latency).

Or in this loopback device case (and tunnel decapsulation) a poor
man's longjmp, releasing the current stack frame to keep the depth
in check.

Anyways...

> That said, this made me wonder a bit. I wonder how bounded the latency
> is for raising a softirq from process context. We only _check_ the
> softirq on the last hardirq exit, I think.

System call return checks it, otherwise this situation would be
completely bolixed.

> I wonder if we should run softirqs on return to user mode (and make
> softirq set a thread flag if not in interrupt context).

I'm pretty sure we already do.

2018-01-17 21:55:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, Jan 17, 2018 at 1:49 PM, David Miller <[email protected]> wrote:
>
>> That said, this made me wonder a bit. I wonder how bounded the latency
>> is for raising a softirq from process context. We only _check_ the
>> softirq on the last hardirq exit, I think.
>
> System call return checks it, otherwise this situation would be
> completely bolixed.

That's what I thought too. But then I went and looked, and I can't find it.

But you're probably right, and I just missed it.

Linus

2018-01-17 21:56:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, 17 Jan 2018, David Miller wrote:

> From: Linus Torvalds <[email protected]>
> Date: Wed, 17 Jan 2018 13:06:58 -0800
>
> > It was in some way always a "poor mans interrupt thread" (with no
> > blocking like a real thread context, but at least not impacting actual
> > interrupt latency).
>
> Or in this loopback device case (and tunnel decapsulation) a poor
> man's longjmp, releasing the current stack frame to keep the depth
> in check.
>
> Anyways...
>
> > That said, this made me wonder a bit. I wonder how bounded the latency
> > is for raising a softirq from process context. We only _check_ the
> > softirq on the last hardirq exit, I think.
>
> System call return checks it, otherwise this situation would be
> completely bolixed.

Errm. No.

>
> > I wonder if we should run softirqs on return to user mode (and make
> > softirq set a thread flag if not in interrupt context).
>
> I'm pretty sure we already do.

Nope.

raise_softirq() -> raise_softirq_irqoff()

set_softirq_bit();

if (!in_interrupt())
wake_softirqd();

So if the caller is not in hard or soft interrupt context, which includes
bottom half disabled regions softirqd is woken.

If the caller is in a bottom half disabled region then local_bh_enable()
will run the pending softirqs.

Thanks,

tglx


2018-01-17 21:58:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, Jan 17, 2018 at 1:54 PM, Thomas Gleixner <[email protected]> wrote:
> raise_softirq() -> raise_softirq_irqoff()
>
> set_softirq_bit();
>
> if (!in_interrupt())
> wake_softirqd();
>
> So if the caller is not in hard or soft interrupt context, which includes
> bottom half disabled regions softirqd is woken.

That does seem unnecessarily expensive, and maybe we could just do it
with thread flag (TIF_NOTIFY_RESUME or whatever).

In fact, that was what I *thought* we did. Maybe I just remember some
historical behavior.

Since networking seems to largely prefer softirqd anyway, maybe that
wake_softirqd() is the right thing to do anyway.

Linus

2018-01-17 22:00:03

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, 17 Jan 2018, Linus Torvalds wrote:

> On Wed, Jan 17, 2018 at 1:49 PM, David Miller <[email protected]> wrote:
> >
> >> That said, this made me wonder a bit. I wonder how bounded the latency
> >> is for raising a softirq from process context. We only _check_ the
> >> softirq on the last hardirq exit, I think.
> >
> > System call return checks it, otherwise this situation would be
> > completely bolixed.
>
> That's what I thought too. But then I went and looked, and I can't find it.
>
> But you're probably right, and I just missed it.

Not really. There is nothing to see there.

Thanks,

tglx

2018-01-17 22:01:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, 17 Jan 2018, Linus Torvalds wrote:

> On Wed, Jan 17, 2018 at 1:54 PM, Thomas Gleixner <[email protected]> wrote:
> > raise_softirq() -> raise_softirq_irqoff()
> >
> > set_softirq_bit();
> >
> > if (!in_interrupt())
> > wake_softirqd();
> >
> > So if the caller is not in hard or soft interrupt context, which includes
> > bottom half disabled regions softirqd is woken.
>
> That does seem unnecessarily expensive, and maybe we could just do it
> with thread flag (TIF_NOTIFY_RESUME or whatever).
>
> In fact, that was what I *thought* we did. Maybe I just remember some
> historical behavior.
>
> Since networking seems to largely prefer softirqd anyway, maybe that
> wake_softirqd() is the right thing to do anyway.

Well, but we only do it when we are not in a bh disabled region. The places
where thread context raises the network softirqs is usually inside a bh
disabled region, so the softirq is executed on local_bh_enable(). The
thread is woken up rarely.

Thanks,

tglx

2018-01-17 22:04:04

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, Jan 17, 2018 at 2:00 PM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 17 Jan 2018, Linus Torvalds wrote:
>
>> On Wed, Jan 17, 2018 at 1:54 PM, Thomas Gleixner <[email protected]> wrote:
>> > raise_softirq() -> raise_softirq_irqoff()
>> >
>> > set_softirq_bit();
>> >
>> > if (!in_interrupt())
>> > wake_softirqd();
>> >
>> > So if the caller is not in hard or soft interrupt context, which includes
>> > bottom half disabled regions softirqd is woken.
>>
>> That does seem unnecessarily expensive, and maybe we could just do it
>> with thread flag (TIF_NOTIFY_RESUME or whatever).
>>
>> In fact, that was what I *thought* we did. Maybe I just remember some
>> historical behavior.
>>
>> Since networking seems to largely prefer softirqd anyway, maybe that
>> wake_softirqd() is the right thing to do anyway.
>
> Well, but we only do it when we are not in a bh disabled region. The places
> where thread context raises the network softirqs is usually inside a bh
> disabled region, so the softirq is executed on local_bh_enable(). The
> thread is woken up rarely.

There is also the netif_rx_ni() stuff.

Can't remember right now why it is not using
local_bh_{diable,enable}() pair instead
of preempt_disable() ... if (local_softirq_pending()) do_softirq();

2018-01-17 22:27:01

by David Miller

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

From: Eric Dumazet <[email protected]>
Date: Wed, 17 Jan 2018 14:02:43 -0800

> On Wed, Jan 17, 2018 at 2:00 PM, Thomas Gleixner <[email protected]> wrote:
>> On Wed, 17 Jan 2018, Linus Torvalds wrote:
>>
>>> On Wed, Jan 17, 2018 at 1:54 PM, Thomas Gleixner <[email protected]> wrote:
>>> > raise_softirq() -> raise_softirq_irqoff()
>>> >
>>> > set_softirq_bit();
>>> >
>>> > if (!in_interrupt())
>>> > wake_softirqd();
>>> >
>>> > So if the caller is not in hard or soft interrupt context, which includes
>>> > bottom half disabled regions softirqd is woken.
>>>
>>> That does seem unnecessarily expensive, and maybe we could just do it
>>> with thread flag (TIF_NOTIFY_RESUME or whatever).
>>>
>>> In fact, that was what I *thought* we did. Maybe I just remember some
>>> historical behavior.
>>>
>>> Since networking seems to largely prefer softirqd anyway, maybe that
>>> wake_softirqd() is the right thing to do anyway.
>>
>> Well, but we only do it when we are not in a bh disabled region. The places
>> where thread context raises the network softirqs is usually inside a bh
>> disabled region, so the softirq is executed on local_bh_enable(). The
>> thread is woken up rarely.
>
> There is also the netif_rx_ni() stuff.
>
> Can't remember right now why it is not using
> local_bh_{diable,enable}() pair instead
> of preempt_disable() ... if (local_softirq_pending()) do_softirq();

Hmmm, that code predates the initial GIT repository build.

I do remember we had some back and forth with that stuff.


2018-01-17 22:55:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Wed, 17 Jan 2018, David Miller wrote:
> From: Eric Dumazet <[email protected]>
> Date: Wed, 17 Jan 2018 14:02:43 -0800
>
> > On Wed, Jan 17, 2018 at 2:00 PM, Thomas Gleixner <[email protected]> wrote:
> >> On Wed, 17 Jan 2018, Linus Torvalds wrote:
> >>
> >>> On Wed, Jan 17, 2018 at 1:54 PM, Thomas Gleixner <[email protected]> wrote:
> >>> > raise_softirq() -> raise_softirq_irqoff()
> >>> >
> >>> > set_softirq_bit();
> >>> >
> >>> > if (!in_interrupt())
> >>> > wake_softirqd();
> >>> >
> >>> > So if the caller is not in hard or soft interrupt context, which includes
> >>> > bottom half disabled regions softirqd is woken.
> >>>
> >>> That does seem unnecessarily expensive, and maybe we could just do it
> >>> with thread flag (TIF_NOTIFY_RESUME or whatever).
> >>>
> >>> In fact, that was what I *thought* we did. Maybe I just remember some
> >>> historical behavior.
> >>>
> >>> Since networking seems to largely prefer softirqd anyway, maybe that
> >>> wake_softirqd() is the right thing to do anyway.
> >>
> >> Well, but we only do it when we are not in a bh disabled region. The places
> >> where thread context raises the network softirqs is usually inside a bh
> >> disabled region, so the softirq is executed on local_bh_enable(). The
> >> thread is woken up rarely.
> >
> > There is also the netif_rx_ni() stuff.
> >
> > Can't remember right now why it is not using
> > local_bh_{diable,enable}() pair instead
> > of preempt_disable() ... if (local_softirq_pending()) do_softirq();
>
> Hmmm, that code predates the initial GIT repository build.
>
> I do remember we had some back and forth with that stuff.

commit 3ace873f40916682583562c1e1f324a4a50a2811 in

https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git

deinlines that function and drops a comment while at it:

-/* Post buffer to the network code from _non interrupt_ context.
- * see net/core/dev.c for netif_rx description.
- */
-static inline int netif_rx_ni(struct sk_buff *skb)

It was added somewhere around 2.5.0

Thanks,

tglx

Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On 2018-01-17 17:24:47 [-0500], David Miller wrote:
> From: Eric Dumazet <[email protected]>
> Date: Wed, 17 Jan 2018 14:02:43 -0800
> > There is also the netif_rx_ni() stuff.
> >
> > Can't remember right now why it is not using
> > local_bh_{diable,enable}() pair instead
> > of preempt_disable() ... if (local_softirq_pending()) do_softirq();
>
> Hmmm, that code predates the initial GIT repository build.
>
> I do remember we had some back and forth with that stuff.

So I did a little research and tried to replace preempt_disable() with
local_bh_disable() [0] a while ago.

[0] https://lkml.kernel.org/r/[email protected]

Sebastian

2018-02-13 16:37:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context

On Tue, 6 Feb 2018, Sebastian Andrzej Siewior wrote:

> On 2018-01-17 17:24:47 [-0500], David Miller wrote:
> > From: Eric Dumazet <[email protected]>
> > Date: Wed, 17 Jan 2018 14:02:43 -0800
> > > There is also the netif_rx_ni() stuff.
> > >
> > > Can't remember right now why it is not using
> > > local_bh_{diable,enable}() pair instead
> > > of preempt_disable() ... if (local_softirq_pending()) do_softirq();
> >
> > Hmmm, that code predates the initial GIT repository build.
> >
> > I do remember we had some back and forth with that stuff.
>
> So I did a little research and tried to replace preempt_disable() with
> local_bh_disable() [0] a while ago.
>
> [0] https://lkml.kernel.org/r/[email protected]

Maybe you should just try again with a slightly better changelog :)