2022-12-22 22:43:12

by Jakub Kicinski

[permalink] [raw]
Subject: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending

In networking we try to keep Tx packet queues small, so we limit
how many bytes a socket may packetize and queue up. Tx completions
(from NAPI) notify the sockets when packets have left the system
(NIC Tx completion) and the socket schedules a tasklet to queue
the next batch of frames.

This leads to a situation where we go thru the softirq loop twice.
First round we have pending = NET (from the NIC IRQ/NAPI), and
the second iteration has pending = TASKLET (the socket tasklet).

On two web workloads I looked at this condition accounts for 10%
and 23% of all ksoftirqd wake ups respectively. We run NAPI
which wakes some process up, we hit need_resched() and wake up
ksoftirqd just to run the TSQ (TCP small queues) tasklet.

Tweak the need_resched() condition to be ignored if all pending
softIRQs are "non-deferred". The tasklet would run relatively
soon, anyway, but once ksoftirqd is woken we're risking stalls.

I did not see any negative impact on the latency in an RR test
on a loaded machine with this change applied.

Signed-off-by: Jakub Kicinski <[email protected]>
---
kernel/softirq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index ad200d386ec1..4ac59ffb0d55 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -601,7 +601,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)

if (time_is_before_eq_jiffies(end) || !--max_restart)
limit = SOFTIRQ_OVERLOAD_TIME;
- else if (need_resched())
+ else if (need_resched() && pending & ~SOFTIRQ_NOW_MASK)
limit = SOFTIRQ_DEFER_TIME;
else
goto restart;
--
2.38.1


2023-01-09 10:03:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending

On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote:
> In networking we try to keep Tx packet queues small, so we limit
> how many bytes a socket may packetize and queue up. Tx completions
> (from NAPI) notify the sockets when packets have left the system
> (NIC Tx completion) and the socket schedules a tasklet to queue
> the next batch of frames.
>
> This leads to a situation where we go thru the softirq loop twice.
> First round we have pending = NET (from the NIC IRQ/NAPI), and
> the second iteration has pending = TASKLET (the socket tasklet).

So to me that sounds like you want to fix the network code to not do
this then. Why can't the NAPI thing directly queue the next batch; why
do you have to do a softirq roundtrip like this?

> On two web workloads I looked at this condition accounts for 10%
> and 23% of all ksoftirqd wake ups respectively. We run NAPI
> which wakes some process up, we hit need_resched() and wake up
> ksoftirqd just to run the TSQ (TCP small queues) tasklet.
>
> Tweak the need_resched() condition to be ignored if all pending
> softIRQs are "non-deferred". The tasklet would run relatively
> soon, anyway, but once ksoftirqd is woken we're risking stalls.
>
> I did not see any negative impact on the latency in an RR test
> on a loaded machine with this change applied.

Ignoring need_resched() will get you in trouble with RT people real
fast.

2023-01-09 10:48:07

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending

On Mon, Jan 9, 2023 at 10:44 AM Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote:
> > In networking we try to keep Tx packet queues small, so we limit
> > how many bytes a socket may packetize and queue up. Tx completions
> > (from NAPI) notify the sockets when packets have left the system
> > (NIC Tx completion) and the socket schedules a tasklet to queue
> > the next batch of frames.
> >
> > This leads to a situation where we go thru the softirq loop twice.
> > First round we have pending = NET (from the NIC IRQ/NAPI), and
> > the second iteration has pending = TASKLET (the socket tasklet).
>
> So to me that sounds like you want to fix the network code to not do
> this then. Why can't the NAPI thing directly queue the next batch; why
> do you have to do a softirq roundtrip like this?

I think Jakub refers to tcp_wfree() code, which can be called from
arbitrary contexts,
including non NAPI ones, and with the socket locked (by this thread or
another) or not locked at all
(say if skb is freed from a TX completion handler or a qdisc drop)

>
> > On two web workloads I looked at this condition accounts for 10%
> > and 23% of all ksoftirqd wake ups respectively. We run NAPI
> > which wakes some process up, we hit need_resched() and wake up
> > ksoftirqd just to run the TSQ (TCP small queues) tasklet.
> >
> > Tweak the need_resched() condition to be ignored if all pending
> > softIRQs are "non-deferred". The tasklet would run relatively
> > soon, anyway, but once ksoftirqd is woken we're risking stalls.
> >
> > I did not see any negative impact on the latency in an RR test
> > on a loaded machine with this change applied.
>
> Ignoring need_resched() will get you in trouble with RT people real
> fast.

2023-01-09 20:00:38

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending

On Mon, 9 Jan 2023 11:16:45 +0100 Eric Dumazet wrote:
> > On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote:
> > > In networking we try to keep Tx packet queues small, so we limit
> > > how many bytes a socket may packetize and queue up. Tx completions
> > > (from NAPI) notify the sockets when packets have left the system
> > > (NIC Tx completion) and the socket schedules a tasklet to queue
> > > the next batch of frames.
> > >
> > > This leads to a situation where we go thru the softirq loop twice.
> > > First round we have pending = NET (from the NIC IRQ/NAPI), and
> > > the second iteration has pending = TASKLET (the socket tasklet).
> >
> > So to me that sounds like you want to fix the network code to not do
> > this then. Why can't the NAPI thing directly queue the next batch; why
> > do you have to do a softirq roundtrip like this?
>
> I think Jakub refers to tcp_wfree() code, which can be called from
> arbitrary contexts,
> including non NAPI ones, and with the socket locked (by this thread or
> another) or not locked at all
> (say if skb is freed from a TX completion handler or a qdisc drop)

Yes, fwiw.

> > > On two web workloads I looked at this condition accounts for 10%
> > > and 23% of all ksoftirqd wake ups respectively. We run NAPI
> > > which wakes some process up, we hit need_resched() and wake up
> > > ksoftirqd just to run the TSQ (TCP small queues) tasklet.
> > >
> > > Tweak the need_resched() condition to be ignored if all pending
> > > softIRQs are "non-deferred". The tasklet would run relatively
> > > soon, anyway, but once ksoftirqd is woken we're risking stalls.
> > >
> > > I did not see any negative impact on the latency in an RR test
> > > on a loaded machine with this change applied.
> >
> > Ignoring need_resched() will get you in trouble with RT people real
> > fast.

Ah, you're right :/ Is it good enough if we throw || force_irqthreads()
into the condition?

Otherwise we can just postpone this optimization, the overload
time horizon / limit patch is much more important.

2023-03-03 11:41:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending

On Mon, Jan 09 2023 at 10:44, Peter Zijlstra wrote:
> On Thu, Dec 22, 2022 at 02:12:44PM -0800, Jakub Kicinski wrote:
>> Tweak the need_resched() condition to be ignored if all pending
>> softIRQs are "non-deferred". The tasklet would run relatively
>> soon, anyway, but once ksoftirqd is woken we're risking stalls.
>>
>> I did not see any negative impact on the latency in an RR test
>> on a loaded machine with this change applied.
>
> Ignoring need_resched() will get you in trouble with RT people real
> fast.

In this case not really. softirq processing is preemptible in RT, but
it's still a major pain ...

Thanks,

tglx


2023-03-03 14:17:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/3] softirq: don't yield if only expedited handlers are pending

Jakub!

On Thu, Dec 22 2022 at 14:12, Jakub Kicinski wrote:
> This leads to a situation where we go thru the softirq loop twice.
> First round we have pending = NET (from the NIC IRQ/NAPI), and
> the second iteration has pending = TASKLET (the socket tasklet).

...

> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index ad200d386ec1..4ac59ffb0d55 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -601,7 +601,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
>
> if (time_is_before_eq_jiffies(end) || !--max_restart)
> limit = SOFTIRQ_OVERLOAD_TIME;
> - else if (need_resched())
> + else if (need_resched() && pending & ~SOFTIRQ_NOW_MASK)
> limit = SOFTIRQ_DEFER_TIME;
> else
> goto restart;

While this is the least of my softirq worries on PREEMPT_RT, Peter is
right about real-time tasks being deferred on a PREEMPT_RT=n
kernel. That's a real issue for low-latency audio which John Stultz is
trying to resolve. Especially as the above check can go in circles.

I fear we need to go back to the drawing board and come up with a real
solution which takes these contradicting aspects into account. Let me
stare at Peters and Johns patches for a while.

Thanks

tglx