2023-07-03 13:05:28

by Wander Lairson Costa

[permalink] [raw]
Subject: Splat in kernel RT while processing incoming network packets

Dear all,

I am writing to report a splat issue we encountered while running the
Real-Time (RT) kernel in conjunction with Network RPS (Receive Packet
Steering).

During some testing of the RT kernel version 6.4.0 with Network RPS enabled,
we observed a splat occurring in the SoftIRQ subsystem. The splat message is as
follows:

[ 37.168920] ------------[ cut here ]------------
[ 37.168925] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:291 do_softirq_post_smp_call_flush+0x2d/0x60
[ 37.168935] Modules linked in: xt_conntrack(E) ...
[ 37.168976] Unloaded tainted modules: intel_cstate(E):4 intel_uncore(E):3
[ 37.168994] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E ------- --- 6.4.0-0.rc2.23.test.eln127.x86_64+rt #1
[ 37.168996] Hardware name: Red Hat KVM, BIOS 1.15.0-2.module+el8.6.0+14757+c25ee005 04/01/2014
[ 37.168998] RIP: 0010:do_softirq_post_smp_call_flush+0x2d/0x60
[ 37.169001] Code: 00 0f 1f 44 00 00 53 89 fb 48 c7 c7 f7 98 be 96 e8 d8 97 d2 00 65 66 8b 05 f8 36 ...
[ 37.169002] RSP: 0018:ffffffff97403eb0 EFLAGS: 00010002
[ 37.169004] RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000003
[ 37.169005] RDX: ffff992db7a34840 RSI: ffffffff96be98f7 RDI: ffffffff96bc23d8
[ 37.169006] RBP: ffffffff97410000 R08: ffff992db7a34840 R09: ffff992c87f8dbc0
[ 37.169007] R10: 00000000fffbfc67 R11: 0000000000000018 R12: 0000000000000000
[ 37.169008] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 37.169011] FS: 0000000000000000(0000) GS:ffff992db7a00000(0000) knlGS:0000000000000000
[ 37.169013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 37.169014] CR2: 00007f028b8da3f8 CR3: 0000000118f44001 CR4: 0000000000370eb0
[ 37.169015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 37.169015] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 37.169016] Call Trace:
[ 37.169018] <TASK>
[ 37.169020] flush_smp_call_function_queue+0x78/0x80
[ 37.169026] do_idle+0xb2/0xd0
[ 37.169030] cpu_startup_entry+0x1d/0x20
[ 37.169032] rest_init+0xd1/0xe0
[ 37.169037] arch_call_rest_init+0xe/0x30
[ 37.169044] start_kernel+0x342/0x420
[ 37.169046] x86_64_start_reservations+0x18/0x30
[ 37.169051] x86_64_start_kernel+0x96/0xa0
[ 37.169054] secondary_startup_64_no_verify+0x10b/0x10b
[ 37.169059] </TASK>
[ 37.169060] ---[ end trace 0000000000000000 ]---

It comes from [1].

The issue lies in the mechanism of RPS to defer network packets processing to
other CPUs. It sends an IPI to the to the target CPU. The registered callback
is rps_trigger_softirq, which will raise a softirq, leading to the following
scenario:

CPU0 CPU1
| netif_rx() |
| | enqueue_to_backlog(cpu=1) |
| | | net_rps_send_ipi() |
| | flush_smp_call_function_queue()
| | | was_pending = local_softirq_pending()
| | | __flush_smp_call_function_queue()
| | | rps_trigger_softirq()
| | | | __raise_softirq_irqoff()
| | | do_softirq_post_smp_call_flush()

That has the undesired side effect of raising a softirq in a function call,
leading to the aforementioned splat.

The kernel version is kernel-ark [1], os-build-rt branch. It is essentially the
upstream kernel with the PREEMPT_RT patches, and with RHEL configs. I can
provide the .config.

The only solution I imagined so far was to modify RPS to process packtes in a
kernel thread in RT. But I wonder how would be that be different than processing
them in ksoftirqd.

Any inputs on the issue?

[1] https://elixir.bootlin.com/linux/latest/source/kernel/softirq.c#L306

Cheers,
Wander



2023-07-03 13:32:17

by Wander Lairson Costa

[permalink] [raw]
Subject: Re: Splat in kernel RT while processing incoming network packets

On Mon, Jul 03, 2023 at 09:47:26AM -0300, Wander Lairson Costa wrote:
> Dear all,
>
> I am writing to report a splat issue we encountered while running the
> Real-Time (RT) kernel in conjunction with Network RPS (Receive Packet
> Steering).
>
> During some testing of the RT kernel version 6.4.0 with Network RPS enabled,
> we observed a splat occurring in the SoftIRQ subsystem. The splat message is as
> follows:
>
> [ 37.168920] ------------[ cut here ]------------
> [ 37.168925] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:291 do_softirq_post_smp_call_flush+0x2d/0x60
> [ 37.168935] Modules linked in: xt_conntrack(E) ...
> [ 37.168976] Unloaded tainted modules: intel_cstate(E):4 intel_uncore(E):3
> [ 37.168994] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E ------- --- 6.4.0-0.rc2.23.test.eln127.x86_64+rt #1
> [ 37.168996] Hardware name: Red Hat KVM, BIOS 1.15.0-2.module+el8.6.0+14757+c25ee005 04/01/2014
> [ 37.168998] RIP: 0010:do_softirq_post_smp_call_flush+0x2d/0x60
> [ 37.169001] Code: 00 0f 1f 44 00 00 53 89 fb 48 c7 c7 f7 98 be 96 e8 d8 97 d2 00 65 66 8b 05 f8 36 ...
> [ 37.169002] RSP: 0018:ffffffff97403eb0 EFLAGS: 00010002
> [ 37.169004] RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000003
> [ 37.169005] RDX: ffff992db7a34840 RSI: ffffffff96be98f7 RDI: ffffffff96bc23d8
> [ 37.169006] RBP: ffffffff97410000 R08: ffff992db7a34840 R09: ffff992c87f8dbc0
> [ 37.169007] R10: 00000000fffbfc67 R11: 0000000000000018 R12: 0000000000000000
> [ 37.169008] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [ 37.169011] FS: 0000000000000000(0000) GS:ffff992db7a00000(0000) knlGS:0000000000000000
> [ 37.169013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 37.169014] CR2: 00007f028b8da3f8 CR3: 0000000118f44001 CR4: 0000000000370eb0
> [ 37.169015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 37.169015] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 37.169016] Call Trace:
> [ 37.169018] <TASK>
> [ 37.169020] flush_smp_call_function_queue+0x78/0x80
> [ 37.169026] do_idle+0xb2/0xd0
> [ 37.169030] cpu_startup_entry+0x1d/0x20
> [ 37.169032] rest_init+0xd1/0xe0
> [ 37.169037] arch_call_rest_init+0xe/0x30
> [ 37.169044] start_kernel+0x342/0x420
> [ 37.169046] x86_64_start_reservations+0x18/0x30
> [ 37.169051] x86_64_start_kernel+0x96/0xa0
> [ 37.169054] secondary_startup_64_no_verify+0x10b/0x10b
> [ 37.169059] </TASK>
> [ 37.169060] ---[ end trace 0000000000000000 ]---
>
> It comes from [1].
>
> The issue lies in the mechanism of RPS to defer network packets processing to
> other CPUs. It sends an IPI to the to the target CPU. The registered callback
> is rps_trigger_softirq, which will raise a softirq, leading to the following
> scenario:
>
> CPU0 CPU1
> | netif_rx() |
> | | enqueue_to_backlog(cpu=1) |
> | | | net_rps_send_ipi() |
> | | flush_smp_call_function_queue()
> | | | was_pending = local_softirq_pending()
> | | | __flush_smp_call_function_queue()
> | | | rps_trigger_softirq()
> | | | | __raise_softirq_irqoff()
> | | | do_softirq_post_smp_call_flush()
>
> That has the undesired side effect of raising a softirq in a function call,
> leading to the aforementioned splat.
>
> The kernel version is kernel-ark [1], os-build-rt branch. It is essentially the

Correction: kernel-ark [2]

> upstream kernel with the PREEMPT_RT patches, and with RHEL configs. I can
> provide the .config.
>
> The only solution I imagined so far was to modify RPS to process packtes in a
> kernel thread in RT. But I wonder how would be that be different than processing
> them in ksoftirqd.
>
> Any inputs on the issue?
>
> [1] https://elixir.bootlin.com/linux/latest/source/kernel/softirq.c#L306
>

[2] https://gitlab.com/cki-project/kernel-ark

> Cheers,
> Wander
>


Subject: Re: Splat in kernel RT while processing incoming network packets

On 2023-07-03 09:47:26 [-0300], Wander Lairson Costa wrote:
> Dear all,
Hi,

> I am writing to report a splat issue we encountered while running the
> Real-Time (RT) kernel in conjunction with Network RPS (Receive Packet
> Steering).
>
> During some testing of the RT kernel version 6.4.0 with Network RPS enabled,
> we observed a splat occurring in the SoftIRQ subsystem. The splat message is as
> follows:
>
> [ 37.168920] ------------[ cut here ]------------
> [ 37.168925] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:291 do_softirq_post_smp_call_flush+0x2d/0x60

> [ 37.169060] ---[ end trace 0000000000000000 ]---
>
> It comes from [1].
>
> The issue lies in the mechanism of RPS to defer network packets processing to
> other CPUs. It sends an IPI to the to the target CPU. The registered callback
> is rps_trigger_softirq, which will raise a softirq, leading to the following
> scenario:
>
> CPU0 CPU1
> | netif_rx() |
> | | enqueue_to_backlog(cpu=1) |
> | | | net_rps_send_ipi() |
> | | flush_smp_call_function_queue()
> | | | was_pending = local_softirq_pending()
> | | | __flush_smp_call_function_queue()
> | | | rps_trigger_softirq()
> | | | | __raise_softirq_irqoff()
> | | | do_softirq_post_smp_call_flush()
>
> That has the undesired side effect of raising a softirq in a function call,
> leading to the aforementioned splat.

correct.

> The kernel version is kernel-ark [1], os-build-rt branch. It is essentially the
> upstream kernel with the PREEMPT_RT patches, and with RHEL configs. I can
> provide the .config.

It is fine, I see it.

> The only solution I imagined so far was to modify RPS to process packtes in a
> kernel thread in RT. But I wonder how would be that be different than processing
> them in ksoftirqd.
>
> Any inputs on the issue?

Not sure how to proceed. One thing you could do is a hack similar like
net-Avoid-the-IPI-to-free-the.patch which does it for defer_csd.
On the other hand we could drop net-Avoid-the-IPI-to-free-the.patch and
remove the warning because we have now commit
d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")

Prior that, raising softirq from hardirq would wake ksoftirqd which in
turn would collect all pending softirqs. As a consequence all following
softirqs (networking, …) would run as SCHED_OTHER and compete with
SCHED_OTHER tasks for resources. Not good because the networking work is
no longer processed within the networking interrupt thread. Also not a
DDoS kind of situation where one could want to delay processing.

With that change, this isn't the case anymore. Only an "unrelated" IRQ
thread could pick up the networking work which is less then ideal. That
is because the global softirq set is added, ksoftirq is marked for a
wakeup and could be delayed because other tasks are busy. Then the disk
interrupt (for instance) could pick it up as part of its threaded
interrupt.

Now that I think about, we could make the backlog pseudo device a
thread. NAPI threading enables one thread but here we would need one
thread per-CPU. So it would remain kind of special. But we would avoid
clobbering the global state and delay everything to ksoftird. Processing
it in ksoftirqd might not be ideal from performance point of view.

> [1] https://elixir.bootlin.com/linux/latest/source/kernel/softirq.c#L306
>
> Cheers,
> Wander

Sebastian

2023-07-03 21:33:26

by Wander Lairson Costa

[permalink] [raw]
Subject: Re: Splat in kernel RT while processing incoming network packets

On Mon, Jul 03, 2023 at 04:29:08PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-07-03 09:47:26 [-0300], Wander Lairson Costa wrote:
> > Dear all,
> Hi,
>
> > I am writing to report a splat issue we encountered while running the
> > Real-Time (RT) kernel in conjunction with Network RPS (Receive Packet
> > Steering).
> >
> > During some testing of the RT kernel version 6.4.0 with Network RPS enabled,
> > we observed a splat occurring in the SoftIRQ subsystem. The splat message is as
> > follows:
> >
> > [ 37.168920] ------------[ cut here ]------------
> > [ 37.168925] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:291 do_softirq_post_smp_call_flush+0x2d/0x60
> …
> > [ 37.169060] ---[ end trace 0000000000000000 ]---
> >
> > It comes from [1].
> >
> > The issue lies in the mechanism of RPS to defer network packets processing to
> > other CPUs. It sends an IPI to the to the target CPU. The registered callback
> > is rps_trigger_softirq, which will raise a softirq, leading to the following
> > scenario:
> >
> > CPU0 CPU1
> > | netif_rx() |
> > | | enqueue_to_backlog(cpu=1) |
> > | | | net_rps_send_ipi() |
> > | | flush_smp_call_function_queue()
> > | | | was_pending = local_softirq_pending()
> > | | | __flush_smp_call_function_queue()
> > | | | rps_trigger_softirq()
> > | | | | __raise_softirq_irqoff()
> > | | | do_softirq_post_smp_call_flush()
> >
> > That has the undesired side effect of raising a softirq in a function call,
> > leading to the aforementioned splat.
>
> correct.
>
> > The kernel version is kernel-ark [1], os-build-rt branch. It is essentially the
> > upstream kernel with the PREEMPT_RT patches, and with RHEL configs. I can
> > provide the .config.
>
> It is fine, I see it.
>
> > The only solution I imagined so far was to modify RPS to process packtes in a
> > kernel thread in RT. But I wonder how would be that be different than processing
> > them in ksoftirqd.
> >
> > Any inputs on the issue?
>
> Not sure how to proceed. One thing you could do is a hack similar like
> net-Avoid-the-IPI-to-free-the.patch which does it for defer_csd.

At first sight it seems straightforward to implement.

> On the other hand we could drop net-Avoid-the-IPI-to-free-the.patch and
> remove the warning because we have now commit
> d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")

But I am more in favor of a solution that removes code than one that
adds more :)

>
> Prior that, raising softirq from hardirq would wake ksoftirqd which in
> turn would collect all pending softirqs. As a consequence all following
> softirqs (networking, …) would run as SCHED_OTHER and compete with
> SCHED_OTHER tasks for resources. Not good because the networking work is
> no longer processed within the networking interrupt thread. Also not a
> DDoS kind of situation where one could want to delay processing.
>
> With that change, this isn't the case anymore. Only an "unrelated" IRQ
> thread could pick up the networking work which is less then ideal. That
> is because the global softirq set is added, ksoftirq is marked for a
> wakeup and could be delayed because other tasks are busy. Then the disk
> interrupt (for instance) could pick it up as part of its threaded
> interrupt.
>
> Now that I think about, we could make the backlog pseudo device a
> thread. NAPI threading enables one thread but here we would need one
> thread per-CPU. So it would remain kind of special. But we would avoid
> clobbering the global state and delay everything to ksoftird. Processing
> it in ksoftirqd might not be ideal from performance point of view.

Before sending this to the ML, I talked to Paolo about using NAPI
thread. He explained that it is implemented per interface. For example,
for this specific case, it happened on the loopback interface, which
doesn't implement NAPI. I am cc'ing him, so the can correct me if I am
saying something wrong.

>
> > [1] https://elixir.bootlin.com/linux/latest/source/kernel/softirq.c#L306
> >
> > Cheers,
> > Wander
>
> Sebastian
>


Subject: Re: Splat in kernel RT while processing incoming network packets

On 2023-07-03 18:15:58 [-0300], Wander Lairson Costa wrote:
> > Not sure how to proceed. One thing you could do is a hack similar like
> > net-Avoid-the-IPI-to-free-the.patch which does it for defer_csd.
>
> At first sight it seems straightforward to implement.
>
> > On the other hand we could drop net-Avoid-the-IPI-to-free-the.patch and
> > remove the warning because we have now commit
> > d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")
>
> But I am more in favor of a solution that removes code than one that
> adds more :)

Raising the softirq from anonymous (hardirq context) is not ideal for
the reasons I stated below.

> > Prior that, raising softirq from hardirq would wake ksoftirqd which in
> > turn would collect all pending softirqs. As a consequence all following
> > softirqs (networking, …) would run as SCHED_OTHER and compete with
> > SCHED_OTHER tasks for resources. Not good because the networking work is
> > no longer processed within the networking interrupt thread. Also not a
> > DDoS kind of situation where one could want to delay processing.
> >
> > With that change, this isn't the case anymore. Only an "unrelated" IRQ
> > thread could pick up the networking work which is less then ideal. That
> > is because the global softirq set is added, ksoftirq is marked for a
> > wakeup and could be delayed because other tasks are busy. Then the disk
> > interrupt (for instance) could pick it up as part of its threaded
> > interrupt.
> >
> > Now that I think about, we could make the backlog pseudo device a
> > thread. NAPI threading enables one thread but here we would need one
> > thread per-CPU. So it would remain kind of special. But we would avoid
> > clobbering the global state and delay everything to ksoftird. Processing
> > it in ksoftirqd might not be ideal from performance point of view.
>
> Before sending this to the ML, I talked to Paolo about using NAPI
> thread. He explained that it is implemented per interface. For example,
> for this specific case, it happened on the loopback interface, which
> doesn't implement NAPI. I am cc'ing him, so the can correct me if I am
> saying something wrong.

It is per NAPI-queue/instance and you could have multiple instances per
interface. However loopback has one and you need per-CPU threads if you
want to RPS your skbs to any CPU.

We could just remove the warning but then your RPS processes the skbs in
SCHED_OTHER. This might not be what you want. Maybe Paolo has a better
idea.

> > > Cheers,
> > > Wander

Sebastian

2023-07-04 11:09:33

by Paolo Abeni

[permalink] [raw]
Subject: Re: Splat in kernel RT while processing incoming network packets

On Tue, 2023-07-04 at 12:05 +0200, Sebastian Andrzej Siewior wrote:
> On 2023-07-03 18:15:58 [-0300], Wander Lairson Costa wrote:
> > > Not sure how to proceed. One thing you could do is a hack similar like
> > > net-Avoid-the-IPI-to-free-the.patch which does it for defer_csd.
> >
> > At first sight it seems straightforward to implement.
> >
> > > On the other hand we could drop net-Avoid-the-IPI-to-free-the.patch and
> > > remove the warning because we have now commit
> > > d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"")
> >
> > But I am more in favor of a solution that removes code than one that
> > adds more :)
>
> Raising the softirq from anonymous (hardirq context) is not ideal for
> the reasons I stated below.
>
> > > Prior that, raising softirq from hardirq would wake ksoftirqd which in
> > > turn would collect all pending softirqs. As a consequence all following
> > > softirqs (networking, …) would run as SCHED_OTHER and compete with
> > > SCHED_OTHER tasks for resources. Not good because the networking work is
> > > no longer processed within the networking interrupt thread. Also not a
> > > DDoS kind of situation where one could want to delay processing.
> > >
> > > With that change, this isn't the case anymore. Only an "unrelated" IRQ
> > > thread could pick up the networking work which is less then ideal. That
> > > is because the global softirq set is added, ksoftirq is marked for a
> > > wakeup and could be delayed because other tasks are busy. Then the disk
> > > interrupt (for instance) could pick it up as part of its threaded
> > > interrupt.
> > >
> > > Now that I think about, we could make the backlog pseudo device a
> > > thread. NAPI threading enables one thread but here we would need one
> > > thread per-CPU. So it would remain kind of special. But we would avoid
> > > clobbering the global state and delay everything to ksoftird. Processing
> > > it in ksoftirqd might not be ideal from performance point of view.
> >
> > Before sending this to the ML, I talked to Paolo about using NAPI
> > thread. He explained that it is implemented per interface. For example,
> > for this specific case, it happened on the loopback interface, which
> > doesn't implement NAPI. I am cc'ing him, so the can correct me if I am
> > saying something wrong.
>
> It is per NAPI-queue/instance and you could have multiple instances per
> interface. However loopback has one and you need per-CPU threads if you
> want to RPS your skbs to any CPU.

Just to hopefully clarify the networking side of it, napi instances !=
network backlog (used by RPS). The network backlog (RPS) is available
for all the network devices, including the loopback and all the virtual
ones. 

The napi instances (and the threaded mode) are available only on
network device drivers implementing the napi model. The loopback driver
does not implement the napi model, as most virtual devices and even
some H/W NICs (mostily low end ones).

The network backlog can't run in threaded mode: there is no API/sysctl
nor infrastructure for that. The backlog processing threaded mode could
be implemented, even if should not be completely trivial and it sounds
a bit weird to me.


Just for the records, I mentioned the following in the bz:

It looks like flush_smp_call_function_queue() has 2 only callers,
migration, and do_idle().

What about moving softirq processing from
flush_smp_call_function_queue() into cpu_stopper_thread(), outside the
unpreemptable critical section?

I *think*/wild guess the call from do_idle() could be just removed (at
least for RT build), as according to:

commit b2a02fc43a1f40ef4eb2fb2b06357382608d4d84
Author: Peter Zijlstra <[email protected]>
Date: Tue May 26 18:11:01 2020 +0200

smp: Optimize send_call_function_single_ipi()

is just an optimization.

Cheers,

Paolo


Subject: Re: Splat in kernel RT while processing incoming network packets

On 2023-07-04 12:29:33 [+0200], Paolo Abeni wrote:
> Just to hopefully clarify the networking side of it, napi instances !=
> network backlog (used by RPS). The network backlog (RPS) is available
> for all the network devices, including the loopback and all the virtual
> ones. 

Yes.

> The napi instances (and the threaded mode) are available only on
> network device drivers implementing the napi model. The loopback driver
> does not implement the napi model, as most virtual devices and even
> some H/W NICs (mostily low end ones).

Yes.

> The network backlog can't run in threaded mode: there is no API/sysctl
> nor infrastructure for that. The backlog processing threaded mode could
> be implemented, even if should not be completely trivial and it sounds
> a bit weird to me.

Yes, I mean that this needs to be done.

>
> Just for the records, I mentioned the following in the bz:
>
> It looks like flush_smp_call_function_queue() has 2 only callers,
> migration, and do_idle().
>
> What about moving softirq processing from
> flush_smp_call_function_queue() into cpu_stopper_thread(), outside the
> unpreemptable critical section?

This doesn't solve anything. You schedule softirq from hardirq and from
this moment on you are in "anonymous context" and we solve this by
processing it in ksoftirqd.
For !RT you process it while leaving the hardirq. For RT, we can't.
Processing it in the context of the currently running process (say idle
as in the reported backtrace or an another running user task) would lead
to processing network related that originated somewhere at someone
else's expense. Assume you have a high prio RT task running, not related
to networking at all, and suddenly you throw a bunch of skbs on it.

Therefore it is preferred to process them within the interrupt thread in
which the softirq was raised/ within its origin.

The other problem with ksoftirqd processing is that everything is added
to a global state and then left for ksoftirqd to process. The global
state is considered by every local_bh_enable() instance so random
interrupt thread could process it or even a random task doing a syscall
involving spin_lock_bh().

The NAPI-threads are nice in a way that they don't clobber the global
state.
For RPS we would need either per-CPU threads or serve this in
ksoftirqd/X. The additional thread per-CPU makes only sense if it runs
at higher priority. However without the priority it would be no
different to ksoftirqd unless it does only the backlog's work.

puh. I'm undecided here. We might want to throw it into ksoftirqd,
remove the warning. But then this will be processed with other softirqs
(like USB due to tasklet) and at some point and might be picked up by
another interrupt thread.

> Cheers,
>
> Paolo

Sebastian

2023-07-05 16:21:07

by Wander Lairson Costa

[permalink] [raw]
Subject: Re: Splat in kernel RT while processing incoming network packets

On Tue, Jul 04, 2023 at 04:47:49PM +0200, Sebastian Andrzej Siewior wrote:
> On 2023-07-04 12:29:33 [+0200], Paolo Abeni wrote:
> > Just to hopefully clarify the networking side of it, napi instances !=
> > network backlog (used by RPS). The network backlog (RPS) is available
> > for all the network devices, including the loopback and all the virtual
> > ones.?
>
> Yes.
>
> > The napi instances (and the threaded mode) are available only on
> > network device drivers implementing the napi model. The loopback driver
> > does not implement the napi model, as most virtual devices and even
> > some H/W NICs (mostily low end ones).
>
> Yes.
>
> > The network backlog can't run in threaded mode: there is no API/sysctl
> > nor infrastructure for that. The backlog processing threaded mode could
> > be implemented, even if should not be completely trivial and it sounds
> > a bit weird to me.
>
> Yes, I mean that this needs to be done.
>
> >
> > Just for the records, I mentioned the following in the bz:
> >
> > It looks like flush_smp_call_function_queue() has 2 only callers,
> > migration, and do_idle().
> >
> > What about moving softirq processing from
> > flush_smp_call_function_queue() into cpu_stopper_thread(), outside the
> > unpreemptable critical section?
>
> This doesn't solve anything. You schedule softirq from hardirq and from
> this moment on you are in "anonymous context" and we solve this by
> processing it in ksoftirqd.
> For !RT you process it while leaving the hardirq. For RT, we can't.
> Processing it in the context of the currently running process (say idle
> as in the reported backtrace or an another running user task) would lead
> to processing network related that originated somewhere at someone
> else's expense. Assume you have a high prio RT task running, not related
> to networking at all, and suddenly you throw a bunch of skbs on it.
>
> Therefore it is preferred to process them within the interrupt thread in
> which the softirq was raised/ within its origin.
>
> The other problem with ksoftirqd processing is that everything is added
> to a global state and then left for ksoftirqd to process. The global
> state is considered by every local_bh_enable() instance so random
> interrupt thread could process it or even a random task doing a syscall
> involving spin_lock_bh().
>
> The NAPI-threads are nice in a way that they don't clobber the global
> state.
> For RPS we would need either per-CPU threads or serve this in
> ksoftirqd/X. The additional thread per-CPU makes only sense if it runs
> at higher priority. However without the priority it would be no
> different to ksoftirqd unless it does only the backlog's work.
>
> puh. I'm undecided here. We might want to throw it into ksoftirqd,
> remove the warning. But then this will be processed with other softirqs
> (like USB due to tasklet) and at some point and might be picked up by
> another interrupt thread.
>

Maybe, under RT, some softirq should run in the context of the "target"
process. For NET_RX, for example, the softirq's would run in the context
of the packet recipient process. Each task_struct would have a list of
pending softirq, which would be checked in a few points, like on scheduling,
when the process enters in the kernel, softirq raise, etc. The default
target process would be ksoftirqd. Does this idea make sense?

> > Cheers,
> >
> > Paolo
>
> Sebastian
>


Subject: Re: Splat in kernel RT while processing incoming network packets

On 2023-07-05 12:59:28 [-0300], Wander Lairson Costa wrote:
> Maybe, under RT, some softirq should run in the context of the "target"
> process. For NET_RX, for example, the softirq's would run in the context
> of the packet recipient process. Each task_struct would have a list of
> pending softirq, which would be checked in a few points, like on scheduling,
> when the process enters in the kernel, softirq raise, etc. The default
> target process would be ksoftirqd. Does this idea make sense?

We had something similar. The softirq runs in the context of the task
that raised it. So the networking driver raised NET_RX and it was
processed in its context (and still is). The only difference now is that
we no longer have a task based "raised bit" but a per-CPU.

For RPS you already pulled the skb from the NIC, you need to process it
and this isn't handled in the task's context but on a specific CPU.

Let me look at per-CPU backlog thread or ripping the warning out…

Sebastian