2019-10-09 06:47:06

by Jonas Bonn

[permalink] [raw]
Subject: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi,

The lockless pfifo_fast qdisc has an issue with packets getting stuck in
the queue. What appears to happen is:

i) Thread 1 holds the 'seqlock' on the qdisc and dequeues packets.
ii) Thread 1 dequeues the last packet in the queue.
iii) Thread 1 iterates through the qdisc->dequeue function again and
determines that the queue is empty.

iv) Thread 2 queues up a packet. Since 'seqlock' is busy, it just
assumes the packet will be dequeued by whoever is holding the lock.

v) Thread 1 releases 'seqlock'.

After v), nobody will check if there are packets in the queue until a
new packet is enqueued. Thereby, the packet enqueued by Thread 2 may be
delayed indefinitely.

What, I think, should probably happen is that Thread 1 should check that
the queue is empty again after releasing 'seqlock'. I poked at this,
but it wasn't obvious to me how to go about this given the way the
layering works here. Roughly:

qdisc_run_end() {
...
spin_unlock(seqlock);
if (!qdisc_is_empty(qdisc))
qdisc_run();
...
}

Calling qdisc_run() from qdisc_run_end() doesn't feel right!

There's a qdisc->empty property (and qdisc_is_empty() relies on it) but
it's not particularly useful in this case since there's a race in
setting this property which makes it not quite reliable.

Hope someone can shine some light on how to proceed here.

/Jonas


2019-10-09 19:15:51

by Paolo Abeni

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Wed, 2019-10-09 at 08:46 +0200, Jonas Bonn wrote:
> Hi,
>
> The lockless pfifo_fast qdisc has an issue with packets getting stuck in
> the queue. What appears to happen is:
>
> i) Thread 1 holds the 'seqlock' on the qdisc and dequeues packets.
> ii) Thread 1 dequeues the last packet in the queue.
> iii) Thread 1 iterates through the qdisc->dequeue function again and
> determines that the queue is empty.
>
> iv) Thread 2 queues up a packet. Since 'seqlock' is busy, it just
> assumes the packet will be dequeued by whoever is holding the lock.
>
> v) Thread 1 releases 'seqlock'.
>
> After v), nobody will check if there are packets in the queue until a
> new packet is enqueued. Thereby, the packet enqueued by Thread 2 may be
> delayed indefinitely.

I think you are right.

It looks like this possible race is present since the initial lockless
implementation - commit 6b3ba9146fe6 ("net: sched: allow qdiscs to
handle locking")

Anyhow the racing windows looks quite tiny - I never observed that
issue in my tests. Do you have a working reproducer?

Something alike the following code - completely untested - can possibly
address the issue, but it's a bit rough and I would prefer not adding
additonal complexity to the lockless qdiscs, can you please have a spin
a it?

Thanks,

Paolo
---
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 6a70845bd9ab..65a1c03330d6 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -113,18 +113,23 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
struct net_device *dev, struct netdev_queue *txq,
spinlock_t *root_lock, bool validate);

-void __qdisc_run(struct Qdisc *q);
+int __qdisc_run(struct Qdisc *q);

static inline void qdisc_run(struct Qdisc *q)
{
+ int quota = 0;
+
if (qdisc_run_begin(q)) {
/* NOLOCK qdisc must check 'state' under the qdisc seqlock
* to avoid racing with dev_qdisc_reset()
*/
if (!(q->flags & TCQ_F_NOLOCK) ||
likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
- __qdisc_run(q);
+ quota = __qdisc_run(q);
qdisc_run_end(q);
+
+ if (quota > 0 && q->flags & TCQ_F_NOLOCK && q->ops->peek(q))
+ __netif_schedule(q);
}
}

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 17bd8f539bc7..013480f6a794 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -376,7 +376,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
}

-void __qdisc_run(struct Qdisc *q)
+int __qdisc_run(struct Qdisc *q)
{
int quota = dev_tx_weight;
int packets;
@@ -390,9 +390,10 @@ void __qdisc_run(struct Qdisc *q)
quota -= packets;
if (quota <= 0 || need_resched()) {
__netif_schedule(q);
- break;
+ return 0;
}
}
+ return quota;
}

unsigned long dev_trans_start(struct net_device *dev)

2019-10-10 06:28:34

by Jonas Bonn

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Paolo,

On 09/10/2019 21:14, Paolo Abeni wrote:
> On Wed, 2019-10-09 at 08:46 +0200, Jonas Bonn wrote:
>> Hi,
>>
>> The lockless pfifo_fast qdisc has an issue with packets getting stuck in
>> the queue. What appears to happen is:
>>
>> i) Thread 1 holds the 'seqlock' on the qdisc and dequeues packets.
>> ii) Thread 1 dequeues the last packet in the queue.
>> iii) Thread 1 iterates through the qdisc->dequeue function again and
>> determines that the queue is empty.
>>
>> iv) Thread 2 queues up a packet. Since 'seqlock' is busy, it just
>> assumes the packet will be dequeued by whoever is holding the lock.
>>
>> v) Thread 1 releases 'seqlock'.
>>
>> After v), nobody will check if there are packets in the queue until a
>> new packet is enqueued. Thereby, the packet enqueued by Thread 2 may be
>> delayed indefinitely.
>
> I think you are right.
>
> It looks like this possible race is present since the initial lockless
> implementation - commit 6b3ba9146fe6 ("net: sched: allow qdiscs to
> handle locking")
>
> Anyhow the racing windows looks quite tiny - I never observed that
> issue in my tests. Do you have a working reproducer?

Yes, it's reliably reproducible. We do network latency measurements and
latency spikes for these packets that get stuck in the queue.

>
> Something alike the following code - completely untested - can possibly
> address the issue, but it's a bit rough and I would prefer not adding
> additonal complexity to the lockless qdiscs, can you please have a spin
> a it?

Your change looks reasonable. I'll give it a try.


>
> Thanks,
>
> Paolo
> ---
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index 6a70845bd9ab..65a1c03330d6 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -113,18 +113,23 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
> struct net_device *dev, struct netdev_queue *txq,
> spinlock_t *root_lock, bool validate);
>
> -void __qdisc_run(struct Qdisc *q);
> +int __qdisc_run(struct Qdisc *q);
>
> static inline void qdisc_run(struct Qdisc *q)
> {
> + int quota = 0;
> +
> if (qdisc_run_begin(q)) {
> /* NOLOCK qdisc must check 'state' under the qdisc seqlock
> * to avoid racing with dev_qdisc_reset()
> */
> if (!(q->flags & TCQ_F_NOLOCK) ||
> likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> - __qdisc_run(q);
> + quota = __qdisc_run(q);
> qdisc_run_end(q);
> +
> + if (quota > 0 && q->flags & TCQ_F_NOLOCK && q->ops->peek(q))
> + __netif_schedule(q);

Not sure this is relevant, but there's a subtle difference in the way
that the underlying ptr_ring peeks at the queue head and checks whether
the queue is empty.

For peek it's:

READ_ONCE(r->queue[r->consumer_head]);

For is_empty it's:

!r->queue[READ_ONCE(r->consumer_head)];

The placement of the READ_ONCE changes here. I can't get my head around
whether this difference is significant or not. If it is, then perhaps
an is_empty() method is needed on the qdisc_ops...???

/Jonas

2019-10-11 00:41:21

by Jonas Bonn

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Paolo,

On 09/10/2019 21:14, Paolo Abeni wrote:
> Something alike the following code - completely untested - can possibly
> address the issue, but it's a bit rough and I would prefer not adding
> additonal complexity to the lockless qdiscs, can you please have a spin
> a it?

We've tested a couple of variants of this patch today, but unfortunately
it doesn't fix the problem of packets getting stuck in the queue.

A couple of comments:

i) On 5.4, there is the BYPASS path that also needs the same treatment
as it's essentially replicating the behavour of qdisc_run, just without
the queue/dequeue steps

ii) We are working a lot with the 4.19 kernel so I backported to the
patch to this version and tested there. Here the solution would seem to
be more robust as the BYPASS path does not exist.

Unfortunately, in both cases we continue to see the issue of the "last
packet" getting stuck in the queue.

/Jonas


>
> Thanks,
>
> Paolo
> ---
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index 6a70845bd9ab..65a1c03330d6 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -113,18 +113,23 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
> struct net_device *dev, struct netdev_queue *txq,
> spinlock_t *root_lock, bool validate);
>
> -void __qdisc_run(struct Qdisc *q);
> +int __qdisc_run(struct Qdisc *q);
>
> static inline void qdisc_run(struct Qdisc *q)
> {
> + int quota = 0;
> +
> if (qdisc_run_begin(q)) {
> /* NOLOCK qdisc must check 'state' under the qdisc seqlock
> * to avoid racing with dev_qdisc_reset()
> */
> if (!(q->flags & TCQ_F_NOLOCK) ||
> likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> - __qdisc_run(q);
> + quota = __qdisc_run(q);
> qdisc_run_end(q);
> +
> + if (quota > 0 && q->flags & TCQ_F_NOLOCK && q->ops->peek(q))
> + __netif_schedule(q);
> }
> }
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 17bd8f539bc7..013480f6a794 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -376,7 +376,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
> return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
> }
>
> -void __qdisc_run(struct Qdisc *q)
> +int __qdisc_run(struct Qdisc *q)
> {
> int quota = dev_tx_weight;
> int packets;
> @@ -390,9 +390,10 @@ void __qdisc_run(struct Qdisc *q)
> quota -= packets;
> if (quota <= 0 || need_resched()) {
> __netif_schedule(q);
> - break;
> + return 0;
> }
> }
> + return quota;
> }
>
> unsigned long dev_trans_start(struct net_device *dev)
>

2020-06-23 14:13:03

by Zhivich, Michael

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

> From: Jonas Bonn <[email protected]>
> To: Paolo Abeni <[email protected]>,
> "[email protected]" <[email protected]>,
> LKML <[email protected]>,
> "David S . Miller" <[email protected]>,
> John Fastabend <[email protected]>
> Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
> Date: Fri, 11 Oct 2019 02:39:48 +0200
> Message-ID: <[email protected]> (raw)
> In-Reply-To: <[email protected]>
>
> Hi Paolo,
>
> On 09/10/2019 21:14, Paolo Abeni wrote:
> > Something alike the following code - completely untested - can possibly
> > address the issue, but it's a bit rough and I would prefer not adding
> > additonal complexity to the lockless qdiscs, can you please have a spin
> > a it?
>
> We've tested a couple of variants of this patch today, but unfortunately
> it doesn't fix the problem of packets getting stuck in the queue.
>
> A couple of comments:
>
> i) On 5.4, there is the BYPASS path that also needs the same treatment
> as it's essentially replicating the behavour of qdisc_run, just without
> the queue/dequeue steps
>
> ii) We are working a lot with the 4.19 kernel so I backported to the
> patch to this version and tested there. Here the solution would seem to
> be more robust as the BYPASS path does not exist.
>
> Unfortunately, in both cases we continue to see the issue of the "last
> packet" getting stuck in the queue.
>
> /Jonas

Hello Jonas, Paolo,

We have observed the same problem with pfifo_fast qdisc when sending periodic small
packets on a TCP flow with multiple simultaneous connections on a 4.19.75
kernel. We've been able to catch it in action using perf probes (see trace
below). For qdisc = 0xffff900d7c247c00, skb = 0xffff900b72c334f0,
it takes 200270us to traverse the networking stack on a system that's not otherwise busy.
qdisc only resumes processing when another enqueued packet comes in,
so the packet could have been stuck indefinitely.

proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900bfc294af0 band=2 atomic_qlen=0
proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
proc-19927 19927 [014] 580644.045480: probe:tcp_transmit_skb2: (ffffffff9b6dc4e5) skb=0xffff900b72c334f0 sk=0xffff900d62958040 source=0x4b4e dest=0x9abe
proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
proc-19927 19927 [014] 580644.045481: probe:ip_finish_output2: (ffffffff9b6bc650) net=0xffffffff9c107c80 sk=0xffff900d62958040 skb=0xffff900b72c334f0 __func__=0x0
proc-19902 19902 [032] 580644.045481: probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900bfc294af0 q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180 root_lock=0x0 validate=1 ret=-1 again=155
proc-19927 19927 [014] 580644.045481: net:net_dev_queue: dev=eth0 skbaddr=0xffff900b72c334f0 len=115
proc-19902 19902 [032] 580644.045482: probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=1
proc-19927 19927 [014] 580644.045483: probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900b72c334f0 qdisc=0xffff900d7c247c00 to_free=18446622925407304000
proc-19902 19902 [032] 580644.045483: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
proc-19927 19927 [014] 580644.045483: probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f) skb=0xffff900b72c334f0 qdisc=0xffff900d7c247c00 to_free=0xffff91d0f67ab940 atomic_qlen=1
proc-19902 19902 [032] 580644.045484: probe:__qdisc_run_2: (ffffffff9b69ea5a) q=0xffff900d7c247c00 packets=1
proc-19927 19927 [014] 580644.245745: probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900d98fdf6f0 qdisc=0xffff900d7c247c00 to_free=18446622925407304000
proc-19927 19927 [014] 580644.245745: probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f) skb=0xffff900d98fdf6f0 qdisc=0xffff900d7c247c00 to_free=0xffff91d0f67ab940 atomic_qlen=2
proc-19927 19927 [014] 580644.245746: probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=0
proc-19927 19927 [014] 580644.245746: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900b72c334f0 band=2 atomic_qlen=1
proc-19927 19927 [014] 580644.245747: probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
proc-19927 19927 [014] 580644.245747: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900d98fdf6f0 band=2 atomic_qlen=0
proc-19927 19927 [014] 580644.245748: probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
proc-19927 19927 [014] 580644.245748: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
proc-19927 19927 [014] 580644.245749: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x0 parent=0xF txq_state=0x0 packets=2 skbaddr=0xffff900b72c334f0
proc-19927 19927 [014] 580644.245749: probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900b72c334f0 q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180 root_lock=0x0 validate=1 ret=-1 again=155
proc-19927 19927 [014] 580644.245750: net:net_dev_start_xmit: dev=eth0 queue_mapping=14 skbaddr=0xffff900b72c334f0 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x0800 ip_summed=3 len=115 data_len=0 network_offset=14 transport_offset_valid=1 transport_offset=34 tx_flags=0 gso_size=0 gso_segs=1 gso_type=0x1

I was wondering if you had any more luck in finding a solution or workaround for this problem
(that is, aside from switching to a different qdisc)?

Thanks,
~ Michael

2020-06-30 21:09:59

by Josh Hunt

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On 6/23/20 6:42 AM, Michael Zhivich wrote:
>> From: Jonas Bonn <[email protected]>
>> To: Paolo Abeni <[email protected]>,
>> "[email protected]" <[email protected]>,
>> LKML <[email protected]>,
>> "David S . Miller" <[email protected]>,
>> John Fastabend <[email protected]>
>> Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
>> Date: Fri, 11 Oct 2019 02:39:48 +0200
>> Message-ID: <[email protected]> (raw)
>> In-Reply-To: <[email protected]>
>>
>> Hi Paolo,
>>
>> On 09/10/2019 21:14, Paolo Abeni wrote:
>>> Something alike the following code - completely untested - can possibly
>>> address the issue, but it's a bit rough and I would prefer not adding
>>> additonal complexity to the lockless qdiscs, can you please have a spin
>>> a it?
>>
>> We've tested a couple of variants of this patch today, but unfortunately
>> it doesn't fix the problem of packets getting stuck in the queue.
>>
>> A couple of comments:
>>
>> i) On 5.4, there is the BYPASS path that also needs the same treatment
>> as it's essentially replicating the behavour of qdisc_run, just without
>> the queue/dequeue steps
>>
>> ii) We are working a lot with the 4.19 kernel so I backported to the
>> patch to this version and tested there. Here the solution would seem to
>> be more robust as the BYPASS path does not exist.
>>
>> Unfortunately, in both cases we continue to see the issue of the "last
>> packet" getting stuck in the queue.
>>
>> /Jonas
>
> Hello Jonas, Paolo,
>
> We have observed the same problem with pfifo_fast qdisc when sending periodic small
> packets on a TCP flow with multiple simultaneous connections on a 4.19.75
> kernel. We've been able to catch it in action using perf probes (see trace
> below). For qdisc = 0xffff900d7c247c00, skb = 0xffff900b72c334f0,
> it takes 200270us to traverse the networking stack on a system that's not otherwise busy.
> qdisc only resumes processing when another enqueued packet comes in,
> so the packet could have been stuck indefinitely.
>
> proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900bfc294af0 band=2 atomic_qlen=0
> proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
> proc-19927 19927 [014] 580644.045480: probe:tcp_transmit_skb2: (ffffffff9b6dc4e5) skb=0xffff900b72c334f0 sk=0xffff900d62958040 source=0x4b4e dest=0x9abe
> proc-19902 19902 [032] 580644.045480: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
> proc-19927 19927 [014] 580644.045481: probe:ip_finish_output2: (ffffffff9b6bc650) net=0xffffffff9c107c80 sk=0xffff900d62958040 skb=0xffff900b72c334f0 __func__=0x0
> proc-19902 19902 [032] 580644.045481: probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900bfc294af0 q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180 root_lock=0x0 validate=1 ret=-1 again=155
> proc-19927 19927 [014] 580644.045481: net:net_dev_queue: dev=eth0 skbaddr=0xffff900b72c334f0 len=115
> proc-19902 19902 [032] 580644.045482: probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=1
> proc-19927 19927 [014] 580644.045483: probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900b72c334f0 qdisc=0xffff900d7c247c00 to_free=18446622925407304000
> proc-19902 19902 [032] 580644.045483: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
> proc-19927 19927 [014] 580644.045483: probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f) skb=0xffff900b72c334f0 qdisc=0xffff900d7c247c00 to_free=0xffff91d0f67ab940 atomic_qlen=1
> proc-19902 19902 [032] 580644.045484: probe:__qdisc_run_2: (ffffffff9b69ea5a) q=0xffff900d7c247c00 packets=1
> proc-19927 19927 [014] 580644.245745: probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900d98fdf6f0 qdisc=0xffff900d7c247c00 to_free=18446622925407304000
> proc-19927 19927 [014] 580644.245745: probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f) skb=0xffff900d98fdf6f0 qdisc=0xffff900d7c247c00 to_free=0xffff91d0f67ab940 atomic_qlen=2
> proc-19927 19927 [014] 580644.245746: probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=0
> proc-19927 19927 [014] 580644.245746: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900b72c334f0 band=2 atomic_qlen=1
> proc-19927 19927 [014] 580644.245747: probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
> proc-19927 19927 [014] 580644.245747: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0xffff900d98fdf6f0 band=2 atomic_qlen=0
> proc-19927 19927 [014] 580644.245748: probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00 skb=0xffffffff9b69d8c0 band=2
> proc-19927 19927 [014] 580644.245748: probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d) qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
> proc-19927 19927 [014] 580644.245749: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x0 parent=0xF txq_state=0x0 packets=2 skbaddr=0xffff900b72c334f0
> proc-19927 19927 [014] 580644.245749: probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900b72c334f0 q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180 root_lock=0x0 validate=1 ret=-1 again=155
> proc-19927 19927 [014] 580644.245750: net:net_dev_start_xmit: dev=eth0 queue_mapping=14 skbaddr=0xffff900b72c334f0 vlan_tagged=0 vlan_proto=0x0000 vlan_tci=0x0000 protocol=0x0800 ip_summed=3 len=115 data_len=0 network_offset=14 transport_offset_valid=1 transport_offset=34 tx_flags=0 gso_size=0 gso_segs=1 gso_type=0x1
>
> I was wondering if you had any more luck in finding a solution or workaround for this problem
> (that is, aside from switching to a different qdisc)?
>
> Thanks,
> ~ Michael
>

Jonas/Paolo

Do either of you know if there's been any development on a fix for this
issue? If not we can propose something.

Thanks
Josh

2020-07-01 07:54:04

by Jonas Bonn

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc



On 30/06/2020 21:14, Josh Hunt wrote:
> On 6/23/20 6:42 AM, Michael Zhivich wrote:
>>> From: Jonas Bonn <[email protected]>
>>> To: Paolo Abeni <[email protected]>,
>>>     "[email protected]" <[email protected]>,
>>>     LKML <[email protected]>,
>>>     "David S . Miller" <[email protected]>,
>>>     John Fastabend <[email protected]>
>>> Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
>>> Date: Fri, 11 Oct 2019 02:39:48 +0200
>>> Message-ID: <[email protected]> (raw)
>>> In-Reply-To: <[email protected]>
>>>
>>> Hi Paolo,
>>>
>>> On 09/10/2019 21:14, Paolo Abeni wrote:
>>>> Something alike the following code - completely untested - can possibly
>>>> address the issue, but it's a bit rough and I would prefer not adding
>>>> additonal complexity to the lockless qdiscs, can you please have a spin
>>>> a it?
>>>
>>> We've tested a couple of variants of this patch today, but unfortunately
>>> it doesn't fix the problem of packets getting stuck in the queue.
>>>
>>> A couple of comments:
>>>
>>> i) On 5.4, there is the BYPASS path that also needs the same treatment
>>> as it's essentially replicating the behavour of qdisc_run, just without
>>> the queue/dequeue steps
>>>
>>> ii)  We are working a lot with the 4.19 kernel so I backported to the
>>> patch to this version and tested there.  Here the solution would seem to
>>> be more robust as the BYPASS path does not exist.
>>>
>>> Unfortunately, in both cases we continue to see the issue of the "last
>>> packet" getting stuck in the queue.
>>>
>>> /Jonas
>>
>> Hello Jonas, Paolo,
>>
>> We have observed the same problem with pfifo_fast qdisc when sending
>> periodic small
>> packets on a TCP flow with multiple simultaneous connections on a 4.19.75
>> kernel.  We've been able to catch it in action using perf probes (see
>> trace
>> below).  For qdisc = 0xffff900d7c247c00, skb = 0xffff900b72c334f0,
>> it takes 200270us to traverse the networking stack on a system that's
>> not otherwise busy.
>> qdisc only resumes processing when another enqueued packet comes in,
>> so the packet could have been stuck indefinitely.
>>
>>     proc-19902 19902 [032] 580644.045480:
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d)
>> qdisc=0xffff900d7c247c00 skb=0xffff900bfc294af0 band=2 atomic_qlen=0
>>     proc-19902 19902 [032] 580644.045480:
>> probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00
>> skb=0xffffffff9b69d8c0 band=2
>>     proc-19927 19927 [014] 580644.045480:
>> probe:tcp_transmit_skb2: (ffffffff9b6dc4e5) skb=0xffff900b72c334f0
>> sk=0xffff900d62958040 source=0x4b4e dest=0x9abe
>>     proc-19902 19902 [032] 580644.045480:
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d)
>> qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
>>     proc-19927 19927 [014] 580644.045481:
>> probe:ip_finish_output2: (ffffffff9b6bc650) net=0xffffffff9c107c80
>> sk=0xffff900d62958040 skb=0xffff900b72c334f0 __func__=0x0
>>     proc-19902 19902 [032] 580644.045481:
>> probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900bfc294af0
>> q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180
>> root_lock=0x0 validate=1 ret=-1 again=155
>>     proc-19927 19927 [014] 580644.045481:
>> net:net_dev_queue: dev=eth0 skbaddr=0xffff900b72c334f0 len=115
>>     proc-19902 19902 [032] 580644.045482:
>> probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00
>> skb=0xffffffff9b69d8c0 band=1
>>     proc-19927 19927 [014] 580644.045483:
>> probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900b72c334f0
>> qdisc=0xffff900d7c247c00 to_free=18446622925407304000
>>     proc-19902 19902 [032] 580644.045483:
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d)
>> qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
>>     proc-19927 19927 [014] 580644.045483:
>> probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f)
>> skb=0xffff900b72c334f0 qdisc=0xffff900d7c247c00
>> to_free=0xffff91d0f67ab940 atomic_qlen=1
>>     proc-19902 19902 [032] 580644.045484:
>> probe:__qdisc_run_2: (ffffffff9b69ea5a) q=0xffff900d7c247c00 packets=1
>>     proc-19927 19927 [014] 580644.245745:
>> probe:pfifo_fast_enqueue: (ffffffff9b69d9f0) skb=0xffff900d98fdf6f0
>> qdisc=0xffff900d7c247c00 to_free=18446622925407304000
>>     proc-19927 19927 [014] 580644.245745:
>> probe:pfifo_fast_enqueue_end: (ffffffff9b69da9f)
>> skb=0xffff900d98fdf6f0 qdisc=0xffff900d7c247c00
>> to_free=0xffff91d0f67ab940 atomic_qlen=2
>>     proc-19927 19927 [014] 580644.245746:
>> probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00
>> skb=0xffffffff9b69d8c0 band=0
>>     proc-19927 19927 [014] 580644.245746:
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d)
>> qdisc=0xffff900d7c247c00 skb=0xffff900b72c334f0 band=2 atomic_qlen=1
>>     proc-19927 19927 [014] 580644.245747:
>> probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00
>> skb=0xffffffff9b69d8c0 band=2
>>     proc-19927 19927 [014] 580644.245747:
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d)
>> qdisc=0xffff900d7c247c00 skb=0xffff900d98fdf6f0 band=2 atomic_qlen=0
>>     proc-19927 19927 [014] 580644.245748:
>> probe:pfifo_fast_dequeue: (ffffffff9b69d8c0) qdisc=0xffff900d7c247c00
>> skb=0xffffffff9b69d8c0 band=2
>>     proc-19927 19927 [014] 580644.245748:
>> probe:pfifo_fast_dequeue_end: (ffffffff9b69d99d)
>> qdisc=0xffff900d7c247c00 skb=0x0 band=3 atomic_qlen=0
>>     proc-19927 19927 [014] 580644.245749:
>> qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x0 parent=0xF
>> txq_state=0x0 packets=2 skbaddr=0xffff900b72c334f0
>>     proc-19927 19927 [014] 580644.245749:
>> probe:sch_direct_xmit: (ffffffff9b69e570) skb=0xffff900b72c334f0
>> q=0xffff900d7c247c00 dev=0xffff900d6a140000 txq=0xffff900d6a181180
>> root_lock=0x0 validate=1 ret=-1 again=155
>>     proc-19927 19927 [014] 580644.245750:
>> net:net_dev_start_xmit: dev=eth0 queue_mapping=14
>> skbaddr=0xffff900b72c334f0 vlan_tagged=0 vlan_proto=0x0000
>> vlan_tci=0x0000 protocol=0x0800 ip_summed=3 len=115 data_len=0
>> network_offset=14 transport_offset_valid=1 transport_offset=34
>> tx_flags=0 gso_size=0 gso_segs=1 gso_type=0x1
>>
>> I was wondering if you had any more luck in finding a solution or
>> workaround for this problem
>> (that is, aside from switching to a different qdisc)?
>>
>> Thanks,
>> ~ Michael
>>
>
> Jonas/Paolo
>
> Do either of you know if there's been any development on a fix for this
> issue? If not we can propose something.

Hi Josh,

No, I haven't been able to do any more work on this and the affected
user switched qdisc (to avoid this problem) so I lost the reliable
reproducer that I had...

/Jonas

>
> Thanks
> Josh

2020-07-01 16:08:39

by Cong Wang

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <[email protected]> wrote:
> Do either of you know if there's been any development on a fix for this
> issue? If not we can propose something.

If you have a reproducer, I can look into this.

Thanks.

2020-07-01 20:02:22

by Cong Wang

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <[email protected]> wrote:
>
> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <[email protected]> wrote:
> > Do either of you know if there's been any development on a fix for this
> > issue? If not we can propose something.
>
> If you have a reproducer, I can look into this.

Does the attached patch fix this bug completely?

Thanks.


Attachments:
qdisc_run.diff (1.17 kB)

2020-07-01 22:29:31

by Josh Hunt

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc


On 7/1/20 12:58 PM, Cong Wang wrote:
> On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <[email protected]> wrote:
>>
>> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <[email protected]> wrote:
>>> Do either of you know if there's been any development on a fix for this
>>> issue? If not we can propose something.
>>
>> If you have a reproducer, I can look into this.
>
> Does the attached patch fix this bug completely?
>
> Thanks.
>

Hey Cong

Unfortunately we don't have a reproducer that would be easy to share,
however your patch makes sense to me at a high-level. We will test and
let you know if this fixes our problem.

Thanks!
Josh

2020-07-02 06:15:12

by Jonas Bonn

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Cong,

On 01/07/2020 21:58, Cong Wang wrote:
> On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <[email protected]> wrote:
>>
>> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <[email protected]> wrote:
>>> Do either of you know if there's been any development on a fix for this
>>> issue? If not we can propose something.
>>
>> If you have a reproducer, I can look into this.
>
> Does the attached patch fix this bug completely?

It's easier to comment if you inline the patch, but after taking a quick
look it seems too simplistic.

i) Are you sure you haven't got the return values on qdisc_run reversed?
ii) There's a "bypass" path that skips the enqueue/dequeue operation if
the queue is empty; that needs a similar treatment: after releasing
seqlock it needs to ensure that another packet hasn't been enqueued
since it last checked.

/Jonas

>
> Thanks.
>

2020-07-02 09:48:32

by Paolo Abeni

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi all,

On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:
> Hi Cong,
>
> On 01/07/2020 21:58, Cong Wang wrote:
> > On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <[email protected]> wrote:
> > > On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <[email protected]> wrote:
> > > > Do either of you know if there's been any development on a fix for this
> > > > issue? If not we can propose something.
> > >
> > > If you have a reproducer, I can look into this.
> >
> > Does the attached patch fix this bug completely?
>
> It's easier to comment if you inline the patch, but after taking a quick
> look it seems too simplistic.
>
> i) Are you sure you haven't got the return values on qdisc_run reversed?

qdisc_run() returns true if it was able to acquire the seq lock. We
need to take special action in the opposite case, so Cong's patch LGTM
from a functional PoV.

> ii) There's a "bypass" path that skips the enqueue/dequeue operation if
> the queue is empty; that needs a similar treatment: after releasing
> seqlock it needs to ensure that another packet hasn't been enqueued
> since it last checked.

That has been reverted with
commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323

---
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 90b59fc50dc9..c7e48356132a 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>
> if (q->flags & TCQ_F_NOLOCK) {
> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> - qdisc_run(q);
> + if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
> + __netif_schedule(q);

I fear the __netif_schedule() call may cause performance regression to
the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
collect some data.

Thanks!

Paolo

2020-07-02 18:13:25

by Josh Hunt

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On 7/2/20 2:45 AM, Paolo Abeni wrote:
> Hi all,
>
> On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:
>> Hi Cong,
>>
>> On 01/07/2020 21:58, Cong Wang wrote:
>>> On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <[email protected]> wrote:
>>>> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <[email protected]> wrote:
>>>>> Do either of you know if there's been any development on a fix for this
>>>>> issue? If not we can propose something.
>>>>
>>>> If you have a reproducer, I can look into this.
>>>
>>> Does the attached patch fix this bug completely?
>>
>> It's easier to comment if you inline the patch, but after taking a quick
>> look it seems too simplistic.
>>
>> i) Are you sure you haven't got the return values on qdisc_run reversed?
>
> qdisc_run() returns true if it was able to acquire the seq lock. We
> need to take special action in the opposite case, so Cong's patch LGTM
> from a functional PoV.
>
>> ii) There's a "bypass" path that skips the enqueue/dequeue operation if
>> the queue is empty; that needs a similar treatment: after releasing
>> seqlock it needs to ensure that another packet hasn't been enqueued
>> since it last checked.
>
> That has been reverted with
> commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323
>
> ---
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 90b59fc50dc9..c7e48356132a 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>>
>> if (q->flags & TCQ_F_NOLOCK) {
>> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
>> - qdisc_run(q);
>> + if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
>> + __netif_schedule(q);
>
> I fear the __netif_schedule() call may cause performance regression to
> the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
> collect some data.

Initial results with Cong's patch look promising, so far no stalls. We
will let it run over the long weekend and report back on Tuesday.

Paolo - I have concerns about possible performance regression with the
change as well. If you can gather some data that would be great. If
things look good with our low throughput test over the weekend we can
also try assessing performance next week.

Josh

2020-07-07 14:21:56

by Paolo Abeni

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Thu, 2020-07-02 at 11:08 -0700, Josh Hunt wrote:
> On 7/2/20 2:45 AM, Paolo Abeni wrote:
> > Hi all,
> >
> > On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:
> > > Hi Cong,
> > >
> > > On 01/07/2020 21:58, Cong Wang wrote:
> > > > On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <[email protected]> wrote:
> > > > > On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <[email protected]> wrote:
> > > > > > Do either of you know if there's been any development on a fix for this
> > > > > > issue? If not we can propose something.
> > > > >
> > > > > If you have a reproducer, I can look into this.
> > > >
> > > > Does the attached patch fix this bug completely?
> > >
> > > It's easier to comment if you inline the patch, but after taking a quick
> > > look it seems too simplistic.
> > >
> > > i) Are you sure you haven't got the return values on qdisc_run reversed?
> >
> > qdisc_run() returns true if it was able to acquire the seq lock. We
> > need to take special action in the opposite case, so Cong's patch LGTM
> > from a functional PoV.
> >
> > > ii) There's a "bypass" path that skips the enqueue/dequeue operation if
> > > the queue is empty; that needs a similar treatment: after releasing
> > > seqlock it needs to ensure that another packet hasn't been enqueued
> > > since it last checked.
> >
> > That has been reverted with
> > commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323
> >
> > ---
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 90b59fc50dc9..c7e48356132a 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> > >
> > > if (q->flags & TCQ_F_NOLOCK) {
> > > rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> > > - qdisc_run(q);
> > > + if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
> > > + __netif_schedule(q);
> >
> > I fear the __netif_schedule() call may cause performance regression to
> > the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
> > collect some data.
>
> Initial results with Cong's patch look promising, so far no stalls. We
> will let it run over the long weekend and report back on Tuesday.
>
> Paolo - I have concerns about possible performance regression with the
> change as well. If you can gather some data that would be great.

I finally had the time to run some performance tests vs the above with
mixed results.

Using several netperf threadsover a single pfifo_fast queue with small
UDP packets, perf differences vs vanilla are just above noise range (1-
1,5%)

Using pktgen in 'queue_xmit' mode on a dummy device (this should
maximise the pkt-rate and thus the contention) I see:

pktgen threads vanilla patched delta
nr kpps kpps %

1 3240 3240 0
2 3910 2710 -30.5
4 5140 4920 -4

A relevant source of the measured overhead is due to the contention on
q->state in __netif_schedule, so the following helps a bit:

---
diff --git a/net/core/dev.c b/net/core/dev.c
index b8e8286a0a34..3cad6e086fac 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3750,7 +3750,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,

if (q->flags & TCQ_F_NOLOCK) {
rc = q->enqueue(skb, q, NULL, &to_free) & NET_XMIT_MASK;
- if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
+ if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS &&
+ !test_bit(__QDISC_STATE_SCHED, &q->state))
__netif_schedule(q);

if (unlikely(to_free))
---

With the above incremental patch applied I see:
pktgen threads vanilla patched[II] delta
nr kpps kpps %
1 3240 3240 0
2 3910 2830 -27%
4 5140 5140 0

So the regression with 2 pktgen threads is still relevant. 'perf' shows
relevant time spent into net_tx_action() and __netif_schedule().

Cheers,

Paolo.

2020-07-08 20:17:41

by Cong Wang

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Tue, Jul 7, 2020 at 7:18 AM Paolo Abeni <[email protected]> wrote:
> So the regression with 2 pktgen threads is still relevant. 'perf' shows
> relevant time spent into net_tx_action() and __netif_schedule().

So, touching the __QDISC_STATE_SCHED bit in __dev_xmit_skb() is
not a good idea.

Let me see if there is any other way to fix this.

Thanks.

2020-07-08 21:00:31

by Zhivich, Michael

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On 7/2/20, 2:08 PM, "Josh Hunt" <[email protected]> wrote:
>
> On 7/2/20 2:45 AM, Paolo Abeni wrote:
> > Hi all,
> >
> > On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:
> >> Hi Cong,
> >>
> >> On 01/07/2020 21:58, Cong Wang wrote:
> >>> On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <[email protected]> wrote:
> >>>> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <[email protected]> wrote:
> >>>>> Do either of you know if there's been any development on a fix for this
> >>>>> issue? If not we can propose something.
> >>>>
> >>>> If you have a reproducer, I can look into this.
> >>>
> >>> Does the attached patch fix this bug completely?
> >>
> >> It's easier to comment if you inline the patch, but after taking a quick
> >> look it seems too simplistic.
> >>
> >> i) Are you sure you haven't got the return values on qdisc_run reversed?
> >
> > qdisc_run() returns true if it was able to acquire the seq lock. We
> > need to take special action in the opposite case, so Cong's patch LGTM
> > from a functional PoV.
> >
> >> ii) There's a "bypass" path that skips the enqueue/dequeue operation if
> >> the queue is empty; that needs a similar treatment: after releasing
> >> seqlock it needs to ensure that another packet hasn't been enqueued
> >> since it last checked.
> >
> > That has been reverted with
> > commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323
> >
> > ---
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index 90b59fc50dc9..c7e48356132a 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >>
> >> if (q->flags & TCQ_F_NOLOCK) {
> >> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> >> - qdisc_run(q);
> >> + if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
> >> + __netif_schedule(q);
> >
> > I fear the __netif_schedule() call may cause performance regression to
> > the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
> > collect some data.
>
> Initial results with Cong's patch look promising, so far no stalls. We
> will let it run over the long weekend and report back on Tuesday.
>
> Paolo - I have concerns about possible performance regression with the
> change as well. If you can gather some data that would be great. If
> things look good with our low throughput test over the weekend we can
> also try assessing performance next week.
>
> Josh

After running our reproducer over the long weekend, we've observed several more packets getting stuck.
The behavior is order of magnitude better *with* the patch (that is, only a few packets get stuck),
but the patch does not completely resolve the issue.

I have a nagging suspicion that the same race that we observed between consumer/producer threads can occur with
softirq processing in net_tx_action() as well (as triggered by __netif_schedule()), since both rely on the same semantics of qdisc_run().
Unfortunately, in such a case, we cannot just punt to __netif_schedule() again.

Regards,
~ Michael


Attachments:
smime.p7s (3.41 kB)

2020-07-09 09:23:46

by Paolo Abeni

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Wed, 2020-07-08 at 13:16 -0700, Cong Wang wrote:
> On Tue, Jul 7, 2020 at 7:18 AM Paolo Abeni <[email protected]> wrote:
> > So the regression with 2 pktgen threads is still relevant. 'perf' shows
> > relevant time spent into net_tx_action() and __netif_schedule().
>
> So, touching the __QDISC_STATE_SCHED bit in __dev_xmit_skb() is
> not a good idea.
>
> Let me see if there is any other way to fix this.

Thank you very much for the effort! I'm personally out of ideas for a
real fix that would avoid regressions.

To be more exaustive this are the sources of overhead, as far as I can
observe them with perf:

- contention on q->state, in __netif_schedule()
- execution of net_tx_action() when there are no packet to be served

Cheers,

Paolo

2020-08-20 07:45:03

by Jike Song

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Josh,

On Fri, Jul 3, 2020 at 2:14 AM Josh Hunt <[email protected]> wrote:
{snip}
> Initial results with Cong's patch look promising, so far no stalls. We
> will let it run over the long weekend and report back on Tuesday.
>
> Paolo - I have concerns about possible performance regression with the
> change as well. If you can gather some data that would be great. If
> things look good with our low throughput test over the weekend we can
> also try assessing performance next week.
>

We met possibly the same problem when testing nvidia/mellanox's
GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
can also have a try?

Besides, our testing is pretty complex, do you have a quick test to
reproduce it?

--
Thanks,
Jike

2020-08-20 19:07:59

by Josh Hunt

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Jike

On 8/20/20 12:43 AM, Jike Song wrote:
> Hi Josh,
>
>
> We met possibly the same problem when testing nvidia/mellanox's
> GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
> DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
> can also have a try?

We also did something similar where we've switched over to using the fq
scheduler everywhere for now. We believe the bug is in the nolock code
which only pfifo_fast uses atm, but we've been unable to come up with a
satisfactory solution.

>
> Besides, our testing is pretty complex, do you have a quick test to
> reproduce it?
>

Unfortunately we don't have a simple test case either. Our current
reproducer is complex as well, although it would seem like we should be
able to come up with something where you have maybe 2 threads trying to
send on the same tx queue running pfifo_fast every few hundred
milliseconds and not much else/no other tx traffic on that queue. IIRC
we believe the scenario is when one thread is in the process of
dequeuing a packet while another is enqueuing, the enqueue-er (word? :))
sees the dequeue is in progress and so does not xmit the packet assuming
the dequeue operation will take care of it. However b/c the dequeue is
in the process of completing it doesn't and the newly enqueued packet
stays in the qdisc until another packet is enqueued pushing both out.

Given that we have a workaround with using fq or any other qdisc not
named pfifo_fast this has gotten bumped down in priority for us. I would
like to work on a reproducer at some point, but won't likely be for a
few weeks :(

Josh

2020-08-25 02:49:28

by Kehuan Feng

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hillf,

With the latest version (attached what I have changed on my tree), the
system failed to start up with cpu stalled.


Hillf Danton <[email protected]> 于2020年8月22日周六 上午11:30写道:
>
>
> On Thu, 20 Aug 2020 20:43:17 +0800 Hillf Danton wrote:
> > Hi Jike,
> >
> > On Thu, 20 Aug 2020 15:43:17 +0800 Jike Song wrote:
> > > Hi Josh,
> > >
> > > On Fri, Jul 3, 2020 at 2:14 AM Josh Hunt <[email protected]> wrote:
> > > {snip}
> > > > Initial results with Cong's patch look promising, so far no stalls. We
> > > > will let it run over the long weekend and report back on Tuesday.
> > > >
> > > > Paolo - I have concerns about possible performance regression with the
> > > > change as well. If you can gather some data that would be great. If
> > > > things look good with our low throughput test over the weekend we can
> > > > also try assessing performance next week.
> > > >
> > >
> > > We met possibly the same problem when testing nvidia/mellanox's
> >
> > Below is what was sent in reply to this thread early last month with
> > minor tuning, based on the seqlock. Feel free to drop an echo if it
> > makes ant-antenna-size sense in your tests.
> >
> > > GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
> > > DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
> > > can also have a try?
> > >
> > > Besides, our testing is pretty complex, do you have a quick test to
> > > reproduce it?
> > >
> > > --
> > > Thanks,
> > > Jike
> >
> >
> > --- a/include/net/sch_generic.h
> > +++ b/include/net/sch_generic.h
> > @@ -79,6 +79,7 @@ struct Qdisc {
> > #define TCQ_F_INVISIBLE 0x80 /* invisible by default in dump */
> > #define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */
> > #define TCQ_F_OFFLOADED 0x200 /* qdisc is offloaded to HW */
> > + int pkt_seq;
> > u32 limit;
> > const struct Qdisc_ops *ops;
> > struct qdisc_size_table __rcu *stab;
> > @@ -156,6 +157,7 @@ static inline bool qdisc_is_empty(const
> > static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> > {
> > if (qdisc->flags & TCQ_F_NOLOCK) {
> > + qdisc->pkt_seq++;
> > if (!spin_trylock(&qdisc->seqlock))
> > return false;
> > WRITE_ONCE(qdisc->empty, false);
> > --- a/include/net/pkt_sched.h
> > +++ b/include/net/pkt_sched.h
> > @@ -117,7 +117,9 @@ void __qdisc_run(struct Qdisc *q);
> >
> > static inline void qdisc_run(struct Qdisc *q)
> > {
> > - if (qdisc_run_begin(q)) {
> > + while (qdisc_run_begin(q)) {
> > + int seq = q->pkt_seq;
> > +
> > /* NOLOCK qdisc must check 'state' under the qdisc seqlock
> > * to avoid racing with dev_qdisc_reset()
> > */
> > @@ -125,6 +127,9 @@ static inline void qdisc_run(struct Qdis
> > likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> > __qdisc_run(q);
> > qdisc_run_end(q);
> > +
> > + if (!(q->flags & TCQ_F_NOLOCK) || seq == q->pkt_seq)
> > + return;
> > }
> > }
>
> The echo from Feng indicates that it's hard to conclude that TCQ_F_NOLOCK
> is the culprit, lets try again with it ignored for now.
>
> Every pkt enqueued on pfifo_fast is tracked in the below diff, and those
> pkts enqueued while we're running qdisc are detected and handled to cut
> the chance for the stuck pkts reported.
>
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -79,6 +79,7 @@ struct Qdisc {
> #define TCQ_F_INVISIBLE 0x80 /* invisible by default in dump */
> #define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */
> #define TCQ_F_OFFLOADED 0x200 /* qdisc is offloaded to HW */
> + int pkt_seq;
> u32 limit;
> const struct Qdisc_ops *ops;
> struct qdisc_size_table __rcu *stab;
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -631,6 +631,7 @@ static int pfifo_fast_enqueue(struct sk_
> return qdisc_drop(skb, qdisc, to_free);
> }
>
> + qdisc->pkt_seq++;
> qdisc_update_stats_at_enqueue(qdisc, pkt_len);
> return NET_XMIT_SUCCESS;
> }
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -117,7 +117,8 @@ void __qdisc_run(struct Qdisc *q);
>
> static inline void qdisc_run(struct Qdisc *q)
> {
> - if (qdisc_run_begin(q)) {
> + while (qdisc_run_begin(q)) {
> + int seq = q->pkt_seq;
> /* NOLOCK qdisc must check 'state' under the qdisc seqlock
> * to avoid racing with dev_qdisc_reset()
> */
> @@ -125,6 +126,12 @@ static inline void qdisc_run(struct Qdis
> likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> __qdisc_run(q);
> qdisc_run_end(q);
> +
> + /* go another round if there are pkts enqueued after
> + * taking seq_lock
> + */
> + if (seq != q->pkt_seq)
> + continue;
> }
> }
>
>


Attachments:
fix_nolock_from_hillf.patch (1.23 kB)

2020-08-25 07:15:37

by Kehuan Feng

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Hillf,

I just tried the updated version and the system can boot up now.
It does mitigate the issue a lot but still couldn't get rid of it
thoroughly. It seems to me like the effect of Cong's patch.


Hillf Danton <[email protected]> 于2020年8月25日周二 上午11:23写道:
>
>
> Hi Feng,
>
> On Tue, 25 Aug 2020 10:18:05 +0800 Fengkehuan Feng wrote:
> > Hillf,
> >
> > With the latest version (attached what I have changed on my tree), the
> > system failed to start up with cpu stalled.
>
> My fault.
>
> There is a missing break while running qdisc and it's fixed
> in the diff below for Linux-5.x.
>
> If it is Linux-4.x in your testing, running qdisc looks a bit
> different based on your diff(better if it's in the message body):
>
> static inline void qdisc_run(struct Qdisc *q)
> {
> - if (qdisc_run_begin(q)) {
> + while (qdisc_run_begin(q)) {
> + int seq = q->pkt_seq;
> __qdisc_run(q);
> qdisc_run_end(q);
> +
> + /* go another round if there are pkts enqueued after
> + * taking seq_lock
> + */
> + if (seq != q->pkt_seq)
> + continue;
> + else
> + return;
> }
> }
>
>
> Hillf
>
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -79,6 +79,7 @@ struct Qdisc {
> #define TCQ_F_INVISIBLE 0x80 /* invisible by default in dump */
> #define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */
> #define TCQ_F_OFFLOADED 0x200 /* qdisc is offloaded to HW */
> + int pkt_seq;
> u32 limit;
> const struct Qdisc_ops *ops;
> struct qdisc_size_table __rcu *stab;
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -631,6 +631,7 @@ static int pfifo_fast_enqueue(struct sk_
> return qdisc_drop(skb, qdisc, to_free);
> }
>
> + qdisc->pkt_seq++;
> qdisc_update_stats_at_enqueue(qdisc, pkt_len);
> return NET_XMIT_SUCCESS;
> }
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -117,14 +117,27 @@ void __qdisc_run(struct Qdisc *q);
>
> static inline void qdisc_run(struct Qdisc *q)
> {
> - if (qdisc_run_begin(q)) {
> + while (qdisc_run_begin(q)) {
> + int seq = q->pkt_seq;
> + bool check_seq = false;
> +
> /* NOLOCK qdisc must check 'state' under the qdisc seqlock
> * to avoid racing with dev_qdisc_reset()
> */
> if (!(q->flags & TCQ_F_NOLOCK) ||
> - likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> + likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
> __qdisc_run(q);
> + check_seq = true;
> + }
> qdisc_run_end(q);
> +
> + /* go another round if there are pkts enqueued after
> + * taking seq_lock
> + */
> + if (check_seq && seq != q->pkt_seq)
> + continue;
> + else
> + return;
> }
> }
>
> --
>

2020-08-26 02:40:06

by Kehuan Feng

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Hillf,

Thanks for the patch.
I just tried it and it looks better than previous one. The issue
appeared only once over ~30 mins stressing (without the patch , it
shows up within 1 mins in usual, so I feel like we are getting close
to the final fix)
(pasted the modifications on my tree in case of any missing)

--- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
+++ ./include/net/sch_generic.h 2020-08-26 09:41:04.647173869 +0800
@@ -79,6 +79,7 @@
#define TCQ_F_INVISIBLE 0x80 /* invisible by default in dump */
#define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */
#define TCQ_F_OFFLOADED 0x200 /* qdisc is offloaded to HW */
+ int pkt_seq;
u32 limit;
const struct Qdisc_ops *ops;
struct qdisc_size_table __rcu *stab;
--- ./include/net/pkt_sched.h.orig 2020-08-21 15:13:51.787952710 +0800
+++ ./include/net/pkt_sched.h 2020-08-26 09:42:14.491377514 +0800
@@ -117,8 +117,15 @@
static inline void qdisc_run(struct Qdisc *q)
{
if (qdisc_run_begin(q)) {
+ q->pkt_seq = 0;
+
__qdisc_run(q);
qdisc_run_end(q);
+
+ /* reschedule qdisc if there are packets enqueued */
+ if (q->pkt_seq != 0)
+ __netif_schedule(q);
+
}
}

--- ./net/core/dev.c.orig 2020-03-19 16:31:27.000000000 +0800
+++ ./net/core/dev.c 2020-08-26 09:47:57.783165885 +0800
@@ -2721,6 +2721,7 @@

local_irq_save(flags);
sd = this_cpu_ptr(&softnet_data);
+ q->pkt_seq = 0;
q->next_sched = NULL;
*sd->output_queue_tailp = q;
sd->output_queue_tailp = &q->next_sched;
--- ./net/sched/sch_generic.c.orig 2020-08-24 22:02:04.589830751 +0800
+++ ./net/sched/sch_generic.c 2020-08-26 09:43:40.987852551 +0800
@@ -403,6 +403,9 @@
*/
quota -= packets;
if (quota <= 0 || need_resched()) {
+ /* info caller to reschedule qdisc outside q->seqlock */
+ q->pkt_seq = 1;
+
__netif_schedule(q);
break;
}


Hillf Danton <[email protected]> 于2020年8月26日周三 上午12:26写道:
>
>
> Hi Feng,
>
> On Tue, 25 Aug 2020 15:14:12 +0800 Fengkehuan Feng wrote:
> > Hi Hillf,
> >
> > I just tried the updated version and the system can boot up now.
>
> Thanks again for your testing.
>
> > It does mitigate the issue a lot but still couldn't get rid of it
> > thoroughly. It seems to me like the effect of Cong's patch.
>
> Your echoes show we're still march in the dark and let's try another
> direction in which qdisc is rescheduled outside seqlock to make sure
> tx softirq is raised when there're more packets on the pfifo_fast to
> be transmitted.
>
> CPU0 CPU1
> ---- ----
> seqlock
> test __QDISC_STATE_SCHED
> raise tx softirq
> clear __QDISC_STATE_SCHED
> try seqlock
> __qdisc_run(q);
> sequnlock
> sequnlock
>
>
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -79,6 +79,7 @@ struct Qdisc {
> #define TCQ_F_INVISIBLE 0x80 /* invisible by default in dump */
> #define TCQ_F_NOLOCK 0x100 /* qdisc does not require locking */
> #define TCQ_F_OFFLOADED 0x200 /* qdisc is offloaded to HW */
> + int pkt_seq;
> u32 limit;
> const struct Qdisc_ops *ops;
> struct qdisc_size_table __rcu *stab;
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -118,6 +118,8 @@ void __qdisc_run(struct Qdisc *q);
> static inline void qdisc_run(struct Qdisc *q)
> {
> if (qdisc_run_begin(q)) {
> + q->pkt_seq = 0;
> +
> /* NOLOCK qdisc must check 'state' under the qdisc seqlock
> * to avoid racing with dev_qdisc_reset()
> */
> @@ -125,6 +127,10 @@ static inline void qdisc_run(struct Qdis
> likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
> __qdisc_run(q);
> qdisc_run_end(q);
> +
> + /* reschedule qdisc if there are packets enqueued */
> + if (q->pkt_seq != 0)
> + __netif_schedule(q);
> }
> }
>
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -384,6 +384,8 @@ void __qdisc_run(struct Qdisc *q)
> while (qdisc_restart(q, &packets)) {
> quota -= packets;
> if (quota <= 0) {
> + /* info caller to reschedule qdisc outside q->seqlock */
> + q->pkt_seq = 1;
> __netif_schedule(q);
> break;
> }
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3031,6 +3031,7 @@ static void __netif_reschedule(struct Qd
>
> local_irq_save(flags);
> sd = this_cpu_ptr(&softnet_data);
> + q->pkt_seq = 0;
> q->next_sched = NULL;
> *sd->output_queue_tailp = q;
> sd->output_queue_tailp = &q->next_sched;
> --
>

2020-08-27 06:57:51

by Kehuan Feng

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Hillf,

> Let’s see if TCQ_F_NOLOC is making fq_codel different in your testing.

I assume you meant disabling NOLOCK for pfifo_fast.

Here is the modification,

--- ./net/sched/sch_generic.c.orig 2020-08-24 22:02:04.589830751 +0800
+++ ./net/sched/sch_generic.c 2020-08-27 10:17:10.148977195 +0800
@@ -792,7 +792,7 @@
.dump = pfifo_fast_dump,
.change_tx_queue_len = pfifo_fast_change_tx_queue_len,
.owner = THIS_MODULE,
- .static_flags = TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
+ .static_flags = TCQ_F_CPUSTATS,

The issue never happen again with it for over 3 hours stressing. And I
restarted the test for two times. No any surprising. Quite stable...


Hillf Danton <[email protected]> 于2020年8月26日周三 下午2:37写道:
>
> Hi Feng,
>
>
>
> On Wed, 26 Aug 2020 11:12:38 +0800 Fengkehuan Feng wrote:
>
> >Hi Hillf,
> >
> >I just gave more tries on the patch, and it seems not that good as what I told in last email.
>
> >I could see more packets getting stuck now...
>
> We have more to learn here:P
>
> >
> >Let me explain what I am facing in detail in case we are not aligning to fix the same problem.
> >
> >Our application is in deep learning scenario and it's based on NVIDIA NCCL to do
>
> >collective communication intra-node or inter-node (to be more specific, it's data
>
> > all-reduce on two servers witch 8 GPU nodes each).
> >NCCL can support data transmission through TCP/RDMA/GDR. In normal, it takes
>
> > about 1000 us for TCP or less for RDMA/GDR to transmit 512KB packet, but
>
> > sometimes it tooks hundreds of millisecond or several seconds to get completed.
> >
>
> >When we change the default qdisc from pfifo_fast to fq_codel, the issue never
>
> > happen, so we suspect it's something wrong within the networking stack (but
>
> > it's a bit strange that RDMA or GDR has the same problem)
>
> Let’s see if TCQ_F_NOLOC is making fq_codel different in your testing.
>
>
>
> --- a/net/sched/sch_generic.c
>
> +++ b/net/sched/sch_generic.c
>
> @@ -791,7 +791,7 @@ struct Qdisc_ops pfifo_fast_ops __read_m
>
> .dump = pfifo_fast_dump,
>
> .change_tx_queue_len = pfifo_fast_change_tx_queue_len,
>
> .owner = THIS_MODULE,
>
> - .static_flags = TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
>
> + .static_flags = TCQ_F_CPUSTATS,
>
> };
>
> EXPORT_SYMBOL(pfifo_fast_ops);
>
> --
>
>
>
> >
> >Here is the log print from our test application,
> >
> >size: 512KB, use_time: 1118us, speed: 0.436745GB/s
> >size: 512KB, use_time: 912us, speed: 0.535396GB/s
> >size: 512KB, use_time: 1023us, speed: 0.477303GB/s
> >size: 512KB, use_time: 919us, speed: 0.531318GB/s
> >size: 512KB, use_time: 1129us, speed: 0.432490GB/s
> >size: 512KB, use_time: 2098748us, speed: 0.000233GB/s
> >size: 512KB, use_time: 1018us, speed: 0.479648GB/s
> >size: 512KB, use_time: 1120us, speed: 0.435965GB/s
> >size: 512KB, use_time: 1071us, speed: 0.455912GB/
>
>
>
>
>
> JFYI I failed to find this message at lore.kernel.org perhaps
>
> because of pure text mail.
>
>
>
> Thanks
>
> Hillf

2020-08-28 01:49:08

by Kehuan Feng

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Hillf,

Unfortunately, above mem barriers don't help. The issue shows up
within 1 minute ...

Hillf Danton <[email protected]> 于2020年8月27日周四 下午8:58写道:

>
>
> On Thu, 27 Aug 2020 14:56:31 +0800 Kehuan Feng wrote:
> >
> > > Lets see if TCQ_F_NOLOC is making fq_codel different in your testing.
> >
> > I assume you meant disabling NOLOCK for pfifo_fast.
> >
> > Here is the modification,
> >
> > --- ./net/sched/sch_generic.c.orig 2020-08-24 22:02:04.589830751 +0800
> > +++ ./net/sched/sch_generic.c 2020-08-27 10:17:10.148977195 +0800
> > @@ -792,7 +792,7 @@
> > .dump =3D pfifo_fast_dump,
> > .change_tx_queue_len =3D pfifo_fast_change_tx_queue_len,
> > .owner =3D THIS_MODULE,
> > - .static_flags =3D TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
> > + .static_flags =3D TCQ_F_CPUSTATS,
> >
> > The issue never happen again with it for over 3 hours stressing. And I
> > restarted the test for two times. No any surprising. Quite stable...
>
> Jaw off. That is great news and I'm failing again to explain the test
> result wrt the difference TCQ_F_NOLOCK can make in running qdisc.
>
> Nothing comes into mind other than two mem barriers though only one is
> needed...
>
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3040,6 +3040,7 @@ static void __netif_reschedule(struct Qd
>
> void __netif_schedule(struct Qdisc *q)
> {
> + smp_mb__before_atomic();
> if (!test_and_set_bit(__QDISC_STATE_SCHED, &q->state))
> __netif_reschedule(q);
> }
> @@ -4899,6 +4900,7 @@ static __latent_entropy void net_tx_acti
> */
> smp_mb__before_atomic();
> clear_bit(__QDISC_STATE_SCHED, &q->state);
> + smp_mb__after_atomic();
> qdisc_run(q);
> if (root_lock)
> spin_unlock(root_lock);
>

2020-09-03 05:04:13

by Cong Wang

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hello, Kehuan

Can you test the attached one-line fix? I think we are overthinking,
probably all
we need here is a busy wait.

Thanks.


Attachments:
qdisc-seqlock.diff (530.00 B)

2020-09-03 08:42:29

by Paolo Abeni

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> Can you test the attached one-line fix? I think we are overthinking,
> probably all
> we need here is a busy wait.

I think that will solve, but I also think that will kill NOLOCK
performances due to really increased contention.

At this point I fear we could consider reverting the NOLOCK stuff.
I personally would hate doing so, but it looks like NOLOCK benefits are
outweighed by its issues.

Any other opinion more than welcome!

Cheers,

Paolo

2020-09-03 17:45:07

by Cong Wang

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Thu, Sep 3, 2020 at 1:40 AM Paolo Abeni <[email protected]> wrote:
>
> On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> > Can you test the attached one-line fix? I think we are overthinking,
> > probably all
> > we need here is a busy wait.
>
> I think that will solve, but I also think that will kill NOLOCK
> performances due to really increased contention.

Yeah, we somehow end up with more locks (seqlock, skb array lock)
for lockless qdisc. What an irony... ;)

>
> At this point I fear we could consider reverting the NOLOCK stuff.
> I personally would hate doing so, but it looks like NOLOCK benefits are
> outweighed by its issues.

I agree, NOLOCK brings more pains than gains. There are many race
conditions hidden in generic qdisc layer, another one is enqueue vs.
reset which is being discussed in another thread.

Thanks.

2020-09-04 03:24:31

by Kehuan Feng

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Hillf, Cong, Paolo,

Sorry for the late reply due to other urgent task.

I tried Hillf's patch (shown below on my tree) and it doesn't help and
the jitter shows up very quickly.

--- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
+++ ./include/net/sch_generic.h 2020-09-04 10:48:32.081217156 +0800
@@ -108,6 +108,7 @@

spinlock_t busylock ____cacheline_aligned_in_smp;
spinlock_t seqlock;
+ int run, seq;
};

static inline void qdisc_refcount_inc(struct Qdisc *qdisc)
@@ -127,8 +128,11 @@
static inline bool qdisc_run_begin(struct Qdisc *qdisc)
{
if (qdisc->flags & TCQ_F_NOLOCK) {
+ qdisc->run++;
+ smp_wmb();
if (!spin_trylock(&qdisc->seqlock))
return false;
+ qdisc->seq = qdisc->run;
} else if (qdisc_is_running(qdisc)) {
return false;
}
@@ -143,8 +147,15 @@
static inline void qdisc_run_end(struct Qdisc *qdisc)
{
write_seqcount_end(&qdisc->running);
- if (qdisc->flags & TCQ_F_NOLOCK)
+ if (qdisc->flags & TCQ_F_NOLOCK) {
+ int seq = qdisc->seq;
+
spin_unlock(&qdisc->seqlock);
+ smp_rmb();
+ if (seq != qdisc->run)
+ __netif_schedule(qdisc);
+
+ }
}


I also tried Cong's patch (shown below on my tree) and it could avoid
the issue (stressing for 30 minutus for three times and not jitter
observed).

--- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
+++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
@@ -127,8 +127,7 @@
static inline bool qdisc_run_begin(struct Qdisc *qdisc)
{
if (qdisc->flags & TCQ_F_NOLOCK) {
- if (!spin_trylock(&qdisc->seqlock))
- return false;
+ spin_lock(&qdisc->seqlock);
} else if (qdisc_is_running(qdisc)) {
return false;
}

I am not actually know what you are discussing above. It seems to me
that Cong's patch is similar as disabling lockless feature.

Anyway, we are going to use fq_codel instead, since CentOS 8/kernel
4.18 also uses fq_codel as the default qdisc, not sure whehter they
found some thing related to this.

Thanks,
Kehuan

Hillf Danton <[email protected]> 于2020年9月3日周四 下午6:20写道:
>
>
> On Thu, 03 Sep 2020 10:39:54 +0200 Paolo Abeni wrote:
> > On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> > > Can you test the attached one-line fix? I think we are overthinking,
> > > probably all
> > > we need here is a busy wait.
> >
> > I think that will solve, but I also think that will kill NOLOCK
> > performances due to really increased contention.
> >
> > At this point I fear we could consider reverting the NOLOCK stuff.
> > I personally would hate doing so, but it looks like NOLOCK benefits are
> > outweighed by its issues.
> >
> > Any other opinion more than welcome!
>
> Hi Paolo,
>
> I suspect it's too late to fix the -27% below.
> Surgery to cut NOLOCK seems too early before the fix.
>
> Hillf
>
> >pktgen threads vanilla patched[II] delta
> >nr kpps kpps %
> >1 3240 3240 0
> >2 3910 2830 -27%
> >4 5140 5140 0
>

2020-09-04 05:11:33

by John Fastabend

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Cong Wang wrote:
> On Thu, Sep 3, 2020 at 1:40 AM Paolo Abeni <[email protected]> wrote:
> >
> > On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> > > Can you test the attached one-line fix? I think we are overthinking,
> > > probably all
> > > we need here is a busy wait.
> >
> > I think that will solve, but I also think that will kill NOLOCK
> > performances due to really increased contention.
>
> Yeah, we somehow end up with more locks (seqlock, skb array lock)
> for lockless qdisc. What an irony... ;)

I went back to the original nolock implementation code to try and figure
out how this was working in the first place.

After initial patch series we have this in __dev_xmit_skb()

if (q->flags & TCQ_F_NOLOCK) {
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
__qdisc_drop(skb, &to_free);
rc = NET_XMIT_DROP;
} else {
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
__qdisc_run(q);
}

if (unlikely(to_free))
kfree_skb_list(to_free);
return rc;
}

One important piece here is we used __qdisc_run(q) instead of
what we have there now qdisc_run(q). Here is the latest code,


if (q->flags & TCQ_F_NOLOCK) {
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
qdisc_run(q);
...

__qdisc_run is going to always go into a qdisc_restart loop and
dequeue packets. There is no check here to see if another CPU
is running or not. Compare that to qdisc_run()

static inline void qdisc_run(struct Qdisc *q)
{
if (qdisc_run_begin(q)) {
__qdisc_run(q);
qdisc_run_end(q);
}
}

Here we have all the racing around qdisc_is_running() that seems
unsolvable.

Seems we flipped __qdisc_run to qdisc_run here 32f7b44d0f566
("sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpers").
Its not clear to me from thatpatch though why it was even done
there?

Maybe this would unlock us,

diff --git a/net/core/dev.c b/net/core/dev.c
index 7df6c9617321..9b09429103f1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,

if (q->flags & TCQ_F_NOLOCK) {
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
- qdisc_run(q);
+ __qdisc_run(q);

if (unlikely(to_free))
kfree_skb_list(to_free);


Per other thread we also need the state deactivated check added
back.

>
> >
> > At this point I fear we could consider reverting the NOLOCK stuff.
> > I personally would hate doing so, but it looks like NOLOCK benefits are
> > outweighed by its issues.
>
> I agree, NOLOCK brings more pains than gains. There are many race
> conditions hidden in generic qdisc layer, another one is enqueue vs.
> reset which is being discussed in another thread.

Sure. Seems they crept in over time. I had some plans to write a
lockless HTB implementation. But with fq+EDT with BPF it seems that
it is no longer needed, we have a more generic/better solution. So
I dropped it. Also most folks should really be using fq, fq_codel,
etc. by default anyways. Using pfifo_fast alone is not ideal IMO.

Thanks,
John

2020-09-10 20:19:38

by Cong Wang

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Thu, Sep 3, 2020 at 10:08 PM John Fastabend <[email protected]> wrote:
> Maybe this would unlock us,
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7df6c9617321..9b09429103f1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
>
> if (q->flags & TCQ_F_NOLOCK) {
> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> - qdisc_run(q);
> + __qdisc_run(q);
>
> if (unlikely(to_free))
> kfree_skb_list(to_free);
>
>
> Per other thread we also need the state deactivated check added
> back.

I guess no, because pfifo_dequeue() seems to require q->seqlock,
according to comments in qdisc_run(), so we can not just get rid of
qdisc_run_begin()/qdisc_run_end() here.

Thanks.

2020-09-10 20:20:18

by Cong Wang

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng <[email protected]> wrote:
> I also tried Cong's patch (shown below on my tree) and it could avoid
> the issue (stressing for 30 minutus for three times and not jitter
> observed).

Thanks for verifying it!

>
> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> @@ -127,8 +127,7 @@
> static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> {
> if (qdisc->flags & TCQ_F_NOLOCK) {
> - if (!spin_trylock(&qdisc->seqlock))
> - return false;
> + spin_lock(&qdisc->seqlock);
> } else if (qdisc_is_running(qdisc)) {
> return false;
> }
>
> I am not actually know what you are discussing above. It seems to me
> that Cong's patch is similar as disabling lockless feature.

From performance's perspective, yeah. Did you see any performance
downgrade with my patch applied? It would be great if you can compare
it with removing NOLOCK. And if the performance is as bad as no
NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
for now.

Thanks.

2020-09-10 21:10:48

by John Fastabend

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Cong Wang wrote:
> On Thu, Sep 3, 2020 at 10:08 PM John Fastabend <[email protected]> wrote:
> > Maybe this would unlock us,
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 7df6c9617321..9b09429103f1 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >
> > if (q->flags & TCQ_F_NOLOCK) {
> > rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> > - qdisc_run(q);
> > + __qdisc_run(q);
> >
> > if (unlikely(to_free))
> > kfree_skb_list(to_free);
> >
> >
> > Per other thread we also need the state deactivated check added
> > back.
>
> I guess no, because pfifo_dequeue() seems to require q->seqlock,
> according to comments in qdisc_run(), so we can not just get rid of
> qdisc_run_begin()/qdisc_run_end() here.
>
> Thanks.

Seems we would have to revert this as well then,

commit 021a17ed796b62383f7623f4fea73787abddad77
Author: Paolo Abeni <[email protected]>
Date: Tue May 15 16:24:37 2018 +0200

pfifo_fast: drop unneeded additional lock on dequeue

After the previous patch, for NOLOCK qdiscs, q->seqlock is
always held when the dequeue() is invoked, we can drop
any additional locking to protect such operation.

Then I think it should be safe. Back when I was working on the ptr
ring implementation I opted not to do a case without the spinlock
because the performance benefit was minimal in the benchmarks I
was looking at. I assumed at some point it would be worth going
back to it, but just changing those to the __ptr_ring* cases is
not safe without a lock. I remember having a discussion with Tsirkin
about the details, but would have to go through the mail servers
to find it.

FWIW the initial perf looked like this, (https://lwn.net/Articles/698135/)

nolock pfifo_fast
1: 1417597 1407479 1418913 1439601
2: 1882009 1867799 1864374 1855950
4: 1806736 1804261 1803697 1806994
8: 1354318 1358686 1353145 1356645
12: 1331928 1333079 1333476 1335544

locked pfifo_fast
1: 1471479 1469142 1458825 1456788
2: 1746231 1749490 1753176 1753780
4: 1119626 1120515 1121478 1119220
8: 1001471 999308 1000318 1000776
12: 989269 992122 991590 986581

As you can see measurable improvement on many cores. But, actually
worse if you have enough nic queues to map 1:1 with cores.

nolock mq
1: 1417768 1438712 1449092 1426775
2: 2644099 2634961 2628939 2712867
4: 4866133 4862802 4863396 4867423
8: 9422061 9464986 9457825 9467619
12: 13854470 13213735 13664498 13213292

locked mq
1: 1448374 1444208 1437459 1437088
2: 2687963 2679221 2651059 2691630
4: 5153884 4684153 5091728 4635261
8: 9292395 9625869 9681835 9711651
12: 13553918 13682410 14084055 13946138

So only better if you have more cores than hardware queues
which was the case on some of the devices we had at the time.

Thanks,
John

2020-09-10 21:42:06

by Paolo Abeni

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Thu, 2020-09-10 at 14:07 -0700, John Fastabend wrote:
> Cong Wang wrote:
> > On Thu, Sep 3, 2020 at 10:08 PM John Fastabend <[email protected]> wrote:
> > > Maybe this would unlock us,
> > >
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 7df6c9617321..9b09429103f1 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> > >
> > > if (q->flags & TCQ_F_NOLOCK) {
> > > rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> > > - qdisc_run(q);
> > > + __qdisc_run(q);
> > >
> > > if (unlikely(to_free))
> > > kfree_skb_list(to_free);
> > >
> > >
> > > Per other thread we also need the state deactivated check added
> > > back.
> >
> > I guess no, because pfifo_dequeue() seems to require q->seqlock,
> > according to comments in qdisc_run(), so we can not just get rid of
> > qdisc_run_begin()/qdisc_run_end() here.
> >
> > Thanks.
>
> Seems we would have to revert this as well then,
>
> commit 021a17ed796b62383f7623f4fea73787abddad77
> Author: Paolo Abeni <[email protected]>
> Date: Tue May 15 16:24:37 2018 +0200
>
> pfifo_fast: drop unneeded additional lock on dequeue
>
> After the previous patch, for NOLOCK qdiscs, q->seqlock is
> always held when the dequeue() is invoked, we can drop
> any additional locking to protect such operation.
>
> Then I think it should be safe. Back when I was working on the ptr
> ring implementation I opted not to do a case without the spinlock
> because the performance benefit was minimal in the benchmarks I
> was looking at.

The main point behind all that changes was try to close the gap vs the
locked implementation in the uncontended scenario. In our benchmark,
after commit eb82a994479245a79647d302f9b4eb8e7c9d7ca6, was more near to
10%.

Anyway I agree reverting back to the bitlock should be safe.

Cheers,

Paolo


2020-09-14 02:12:02

by Yunsheng Lin

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On 2020/9/11 4:19, Cong Wang wrote:
> On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng <[email protected]> wrote:
>> I also tried Cong's patch (shown below on my tree) and it could avoid
>> the issue (stressing for 30 minutus for three times and not jitter
>> observed).
>
> Thanks for verifying it!
>
>>
>> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
>> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
>> @@ -127,8 +127,7 @@
>> static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>> {
>> if (qdisc->flags & TCQ_F_NOLOCK) {
>> - if (!spin_trylock(&qdisc->seqlock))
>> - return false;
>> + spin_lock(&qdisc->seqlock);
>> } else if (qdisc_is_running(qdisc)) {
>> return false;
>> }
>>
>> I am not actually know what you are discussing above. It seems to me
>> that Cong's patch is similar as disabling lockless feature.
>
>>From performance's perspective, yeah. Did you see any performance
> downgrade with my patch applied? It would be great if you can compare
> it with removing NOLOCK. And if the performance is as bad as no
> NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
> for now.

It seems the lockless qdisc may have below concurrent problem:
cpu0: cpu1:
q->enqueue .
qdisc_run_begin(q) .
__qdisc_run(q) ->qdisc_restart() -> dequeue_skb() .
-> sch_direct_xmit() .
.
q->enqueue
qdisc_run_begin(q)
qdisc_run_end(q)


cpu1 enqueue a skb without calling __qdisc_run(), and cpu0 did not see the
enqueued skb when calling __qdisc_run(q) because cpu1 may enqueue the skb
after cpu0 called __qdisc_run(q) and before cpu0 called qdisc_run_end(q).


Kehuan, do you care to try the below patch if it is the same problem?

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d60e7c3..c97c1ed 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -36,6 +36,7 @@ struct qdisc_rate_table {
enum qdisc_state_t {
__QDISC_STATE_SCHED,
__QDISC_STATE_DEACTIVATED,
+ __QDISC_STATE_ENQUEUED,
};

struct qdisc_size_table {
diff --git a/net/core/dev.c b/net/core/dev.c
index 0362419..5985648 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3748,6 +3748,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
qdisc_calculate_pkt_len(skb, q);

if (q->flags & TCQ_F_NOLOCK) {
+ set_bit(__QDISC_STATE_ENQUEUED, &q->state);
+ smp_mb__after_atomic();
rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
qdisc_run(q);

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 265a61d..c389641 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -381,6 +381,8 @@ void __qdisc_run(struct Qdisc *q)
int quota = dev_tx_weight;
int packets;

+ clear_bit(__QDISC_STATE_ENQUEUED, &q->state);
+ smp_mb__after_atomic();
while (qdisc_restart(q, &packets)) {
quota -= packets;
if (quota <= 0) {
@@ -388,6 +390,9 @@ void __qdisc_run(struct Qdisc *q)
break;
}
}
+
+ if (test_bit(__QDISC_STATE_ENQUEUED, &q->state))
+ __netif_schedule(q);
}

unsigned long dev_trans_start(struct net_device *dev)


>
> Thanks.
>

2020-09-17 20:00:08

by Cong Wang

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Sun, Sep 13, 2020 at 7:10 PM Yunsheng Lin <[email protected]> wrote:
>
> On 2020/9/11 4:19, Cong Wang wrote:
> > On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng <[email protected]> wrote:
> >> I also tried Cong's patch (shown below on my tree) and it could avoid
> >> the issue (stressing for 30 minutus for three times and not jitter
> >> observed).
> >
> > Thanks for verifying it!
> >
> >>
> >> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> >> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> >> @@ -127,8 +127,7 @@
> >> static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> >> {
> >> if (qdisc->flags & TCQ_F_NOLOCK) {
> >> - if (!spin_trylock(&qdisc->seqlock))
> >> - return false;
> >> + spin_lock(&qdisc->seqlock);
> >> } else if (qdisc_is_running(qdisc)) {
> >> return false;
> >> }
> >>
> >> I am not actually know what you are discussing above. It seems to me
> >> that Cong's patch is similar as disabling lockless feature.
> >
> >>From performance's perspective, yeah. Did you see any performance
> > downgrade with my patch applied? It would be great if you can compare
> > it with removing NOLOCK. And if the performance is as bad as no
> > NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
> > for now.
>
> It seems the lockless qdisc may have below concurrent problem:
> cpu0: cpu1:
> q->enqueue .
> qdisc_run_begin(q) .
> __qdisc_run(q) ->qdisc_restart() -> dequeue_skb() .
> -> sch_direct_xmit() .
> .
> q->enqueue
> qdisc_run_begin(q)
> qdisc_run_end(q)
>
>
> cpu1 enqueue a skb without calling __qdisc_run(), and cpu0 did not see the
> enqueued skb when calling __qdisc_run(q) because cpu1 may enqueue the skb
> after cpu0 called __qdisc_run(q) and before cpu0 called qdisc_run_end(q).

This is the same problem that my patch fixes, I do not know
why you are suggesting another patch despite quoting mine.
Please read the whole thread if you want to participate.

Thanks.

2020-09-18 02:59:47

by Kehuan Feng

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Sorry, guys, the experiment environment is no longer existing now. We
finally use fq_codel for online product.

Cong Wang <[email protected]> 于2020年9月18日周五 上午3:52写道:
>
> On Sun, Sep 13, 2020 at 7:10 PM Yunsheng Lin <[email protected]> wrote:
> >
> > On 2020/9/11 4:19, Cong Wang wrote:
> > > On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng <[email protected]> wrote:
> > >> I also tried Cong's patch (shown below on my tree) and it could avoid
> > >> the issue (stressing for 30 minutus for three times and not jitter
> > >> observed).
> > >
> > > Thanks for verifying it!
> > >
> > >>
> > >> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> > >> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> > >> @@ -127,8 +127,7 @@
> > >> static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> > >> {
> > >> if (qdisc->flags & TCQ_F_NOLOCK) {
> > >> - if (!spin_trylock(&qdisc->seqlock))
> > >> - return false;
> > >> + spin_lock(&qdisc->seqlock);
> > >> } else if (qdisc_is_running(qdisc)) {
> > >> return false;
> > >> }
> > >>
> > >> I am not actually know what you are discussing above. It seems to me
> > >> that Cong's patch is similar as disabling lockless feature.
> > >
> > >>From performance's perspective, yeah. Did you see any performance
> > > downgrade with my patch applied? It would be great if you can compare
> > > it with removing NOLOCK. And if the performance is as bad as no
> > > NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
> > > for now.
> >
> > It seems the lockless qdisc may have below concurrent problem:
> > cpu0: cpu1:
> > q->enqueue .
> > qdisc_run_begin(q) .
> > __qdisc_run(q) ->qdisc_restart() -> dequeue_skb() .
> > -> sch_direct_xmit() .
> > .
> > q->enqueue
> > qdisc_run_begin(q)
> > qdisc_run_end(q)
> >
> >
> > cpu1 enqueue a skb without calling __qdisc_run(), and cpu0 did not see the
> > enqueued skb when calling __qdisc_run(q) because cpu1 may enqueue the skb
> > after cpu0 called __qdisc_run(q) and before cpu0 called qdisc_run_end(q).
>
> This is the same problem that my patch fixes, I do not know
> why you are suggesting another patch despite quoting mine.
> Please read the whole thread if you want to participate.
>
> Thanks.

2021-04-02 19:28:18

by Jiri Kosina

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Thu, 3 Sep 2020, John Fastabend wrote:

> > > At this point I fear we could consider reverting the NOLOCK stuff.
> > > I personally would hate doing so, but it looks like NOLOCK benefits are
> > > outweighed by its issues.
> >
> > I agree, NOLOCK brings more pains than gains. There are many race
> > conditions hidden in generic qdisc layer, another one is enqueue vs.
> > reset which is being discussed in another thread.
>
> Sure. Seems they crept in over time. I had some plans to write a
> lockless HTB implementation. But with fq+EDT with BPF it seems that
> it is no longer needed, we have a more generic/better solution. So
> I dropped it. Also most folks should really be using fq, fq_codel,
> etc. by default anyways. Using pfifo_fast alone is not ideal IMO.

Half a year later, we still have the NOLOCK implementation
present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself.

And we've just been bitten by this very same race which appears to be
still unfixed, with single packet being stuck in pfifo_fast qdisc
basically indefinitely due to this very race that this whole thread began
with back in 2019.

Unless there are

(a) any nice ideas how to solve this in an elegant way without
(re-)introducing extra spinlock (Cong's fix) or

(b) any objections to revert as per the argumentation above

I'll be happy to send a revert of the whole NOLOCK implementation next
week.

Thanks,

--
Jiri Kosina
SUSE Labs

2021-04-02 19:34:30

by Josh Hunt

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On 4/2/21 12:25 PM, Jiri Kosina wrote:
> On Thu, 3 Sep 2020, John Fastabend wrote:
>
>>>> At this point I fear we could consider reverting the NOLOCK stuff.
>>>> I personally would hate doing so, but it looks like NOLOCK benefits are
>>>> outweighed by its issues.
>>>
>>> I agree, NOLOCK brings more pains than gains. There are many race
>>> conditions hidden in generic qdisc layer, another one is enqueue vs.
>>> reset which is being discussed in another thread.
>>
>> Sure. Seems they crept in over time. I had some plans to write a
>> lockless HTB implementation. But with fq+EDT with BPF it seems that
>> it is no longer needed, we have a more generic/better solution. So
>> I dropped it. Also most folks should really be using fq, fq_codel,
>> etc. by default anyways. Using pfifo_fast alone is not ideal IMO.
>
> Half a year later, we still have the NOLOCK implementation
> present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself.
>
> And we've just been bitten by this very same race which appears to be
> still unfixed, with single packet being stuck in pfifo_fast qdisc
> basically indefinitely due to this very race that this whole thread began
> with back in 2019.
>
> Unless there are
>
> (a) any nice ideas how to solve this in an elegant way without
> (re-)introducing extra spinlock (Cong's fix) or
>
> (b) any objections to revert as per the argumentation above
>
> I'll be happy to send a revert of the whole NOLOCK implementation next
> week.
>

Jiri

If you have a reproducer can you try
https://lkml.org/lkml/2021/3/24/1485 ? If that doesn't work I think your
suggestion of reverting nolock makes sense to me. We've moved to using
fq as our default now b/c of this bug.

Josh

2021-04-03 12:27:52

by Jiri Kosina

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Sat, 3 Apr 2021, Hillf Danton wrote:

> >>> Sure. Seems they crept in over time. I had some plans to write a
> >>> lockless HTB implementation. But with fq+EDT with BPF it seems that
> >>> it is no longer needed, we have a more generic/better solution. So
> >>> I dropped it. Also most folks should really be using fq, fq_codel,
> >>> etc. by default anyways. Using pfifo_fast alone is not ideal IMO.
> >>
> >> Half a year later, we still have the NOLOCK implementation
> >> present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself.
> >>
> >> And we've just been bitten by this very same race which appears to be
> >> still unfixed, with single packet being stuck in pfifo_fast qdisc
> >> basically indefinitely due to this very race that this whole thread began
> >> with back in 2019.
> >>
> >> Unless there are
> >>
> >> (a) any nice ideas how to solve this in an elegant way without
> >> (re-)introducing extra spinlock (Cong's fix) or
> >>
> >> (b) any objections to revert as per the argumentation above
> >>
> >> I'll be happy to send a revert of the whole NOLOCK implementation next
> >> week.
> >>
> >Jiri
> >
>
> Feel free to revert it as the scorch wont end without a deluge.

I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
coming days. If it works, then we can consider proceeding with it,
otherwise I am all for reverting the whole NOLOCK stuff.

[1] https://lore.kernel.org/linux-can/[email protected]/T/#u

--
Jiri Kosina
SUSE Labs

2021-04-06 11:37:06

by Yunsheng Lin

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On 2021/4/3 20:23, Jiri Kosina wrote:
> On Sat, 3 Apr 2021, Hillf Danton wrote:
>
>>>>> Sure. Seems they crept in over time. I had some plans to write a
>>>>> lockless HTB implementation. But with fq+EDT with BPF it seems that
>>>>> it is no longer needed, we have a more generic/better solution. So
>>>>> I dropped it. Also most folks should really be using fq, fq_codel,
>>>>> etc. by default anyways. Using pfifo_fast alone is not ideal IMO.
>>>>
>>>> Half a year later, we still have the NOLOCK implementation
>>>> present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself.
>>>>
>>>> And we've just been bitten by this very same race which appears to be
>>>> still unfixed, with single packet being stuck in pfifo_fast qdisc
>>>> basically indefinitely due to this very race that this whole thread began
>>>> with back in 2019.
>>>>
>>>> Unless there are
>>>>
>>>> (a) any nice ideas how to solve this in an elegant way without
>>>> (re-)introducing extra spinlock (Cong's fix) or
>>>>
>>>> (b) any objections to revert as per the argumentation above
>>>>
>>>> I'll be happy to send a revert of the whole NOLOCK implementation next
>>>> week.
>>>>
>>> Jiri
>>>
>>
>> Feel free to revert it as the scorch wont end without a deluge.
>
> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
> coming days. If it works, then we can consider proceeding with it,
> otherwise I am all for reverting the whole NOLOCK stuff.

Hi, Jiri
Do you have a reproducer that can be shared here?
With reproducer, I can debug and test it myself too.

Thanks.

>
> [1] https://lore.kernel.org/linux-can/[email protected]/T/#u
>

2021-04-06 12:24:52

by Cong Wang

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Sat, Apr 3, 2021 at 5:23 AM Jiri Kosina <[email protected]> wrote:
>
> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
> coming days. If it works, then we can consider proceeding with it,
> otherwise I am all for reverting the whole NOLOCK stuff.
>
> [1] https://lore.kernel.org/linux-can/[email protected]/T/#u

I personally prefer to just revert that bit, as it brings more troubles
than gains. Even with Yunsheng's patch, there are still some issues.
Essentially, I think the core qdisc scheduling code is not ready for
lockless, just look at those NOLOCK checks in sch_generic.c. :-/

Thanks.

2021-04-06 14:55:23

by Yunsheng Lin

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On 2021/4/6 9:49, Cong Wang wrote:
> On Sat, Apr 3, 2021 at 5:23 AM Jiri Kosina <[email protected]> wrote:
>>
>> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
>> coming days. If it works, then we can consider proceeding with it,
>> otherwise I am all for reverting the whole NOLOCK stuff.
>>
>> [1] https://lore.kernel.org/linux-can/[email protected]/T/#u
>
> I personally prefer to just revert that bit, as it brings more troubles
> than gains. Even with Yunsheng's patch, there are still some issues.
> Essentially, I think the core qdisc scheduling code is not ready for
> lockless, just look at those NOLOCK checks in sch_generic.c. :-/

I am also awared of the NOLOCK checks too:), and I am willing to
take care of it if that is possible.

As the number of cores in a system is increasing, it is the trend
to become lockless, right? Even there is only one cpu involved, the
spinlock taking and releasing takes about 30ns on our arm64 system
when CONFIG_PREEMPT_VOLUNTARY is enable(ip forwarding testing).

Currently I has three ideas to optimize the lockless qdisc:
1. implement the qdisc bypass for lockless qdisc too, see [1].

2. implement lockless enqueuing for lockless qdisc using the idea
from Jason and Toke. And it has a noticable proformance increase with
1-4 threads running using the below prototype based on ptr_ring.

static inline int __ptr_ring_multi_produce(struct ptr_ring *r, void *ptr)
{

int producer, next_producer;


do {
producer = READ_ONCE(r->producer);
if (unlikely(!r->size) || r->queue[producer])
return -ENOSPC;
next_producer = producer + 1;
if (unlikely(next_producer >= r->size))
next_producer = 0;
} while(cmpxchg_relaxed(&r->producer, producer, next_producer) != producer);

/* Make sure the pointer we are storing points to a valid data. */
/* Pairs with the dependency ordering in __ptr_ring_consume. */
smp_wmb();

WRITE_ONCE(r->queue[producer], ptr);
return 0;
}

3. Maybe it is possible to remove the netif_tx_lock for lockless qdisc
too, because dev_hard_start_xmit is also in the protection of
qdisc_run_begin()/qdisc_run_end()(if there is only one qdisc using
a netdev queue, which is true for pfifo_fast, I believe).


[1]. https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

>
> Thanks.
>
> .
>

2021-04-06 16:09:07

by Michal Kubecek

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Tue, Apr 06, 2021 at 08:55:41AM +0800, Yunsheng Lin wrote:
>
> Hi, Jiri
> Do you have a reproducer that can be shared here?
> With reproducer, I can debug and test it myself too.

I'm afraid we are not aware of a simple reproducer. As mentioned in the
original discussion, the race window is extremely small and the other
thread has to do quite a lot in the meantime which is probably why, as
far as I know, this was never observed on real hardware, only in
virtualization environments. NFS may also be important as, IIUC, it can
often issue an RPC request from a different CPU right after a data
transfer. Perhaps you could cheat a bit and insert a random delay
between the empty queue check and releasing q->seqlock to make it more
likely to happen.

Other than that, it's rather just "run this complex software in a xen VM
and wait".

Michal

2021-04-06 16:40:37

by Michal Kubecek

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Tue, Apr 06, 2021 at 10:46:29AM +0800, Yunsheng Lin wrote:
> On 2021/4/6 9:49, Cong Wang wrote:
> > On Sat, Apr 3, 2021 at 5:23 AM Jiri Kosina <[email protected]> wrote:
> >>
> >> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
> >> coming days. If it works, then we can consider proceeding with it,
> >> otherwise I am all for reverting the whole NOLOCK stuff.
> >>
> >> [1] https://lore.kernel.org/linux-can/[email protected]/T/#u
> >
> > I personally prefer to just revert that bit, as it brings more troubles
> > than gains. Even with Yunsheng's patch, there are still some issues.
> > Essentially, I think the core qdisc scheduling code is not ready for
> > lockless, just look at those NOLOCK checks in sch_generic.c. :-/
>
> I am also awared of the NOLOCK checks too:), and I am willing to
> take care of it if that is possible.
>
> As the number of cores in a system is increasing, it is the trend
> to become lockless, right? Even there is only one cpu involved, the
> spinlock taking and releasing takes about 30ns on our arm64 system
> when CONFIG_PREEMPT_VOLUNTARY is enable(ip forwarding testing).

I agree with the benefits but currently the situation is that we have
a race condition affecting the default qdisc which is being hit in
production and can cause serious trouble which is made worse by commit
1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host
queues") preventing the retransmits of the stuck packet being sent.

Perhaps rather than patching over current implementation which requires
more and more complicated hacks to work around the fact that we cannot
make the "queue is empty" check and leaving the critical section atomic,
it would make sense to reimplement it in a way which would allow us
making it atomic.

Michal

2021-04-06 21:35:10

by Yunsheng Lin

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On 2021/4/6 15:31, Michal Kubecek wrote:
> On Tue, Apr 06, 2021 at 10:46:29AM +0800, Yunsheng Lin wrote:
>> On 2021/4/6 9:49, Cong Wang wrote:
>>> On Sat, Apr 3, 2021 at 5:23 AM Jiri Kosina <[email protected]> wrote:
>>>>
>>>> I am still planning to have Yunsheng Lin's (CCing) fix [1] tested in the
>>>> coming days. If it works, then we can consider proceeding with it,
>>>> otherwise I am all for reverting the whole NOLOCK stuff.
>>>>
>>>> [1] https://lore.kernel.org/linux-can/[email protected]/T/#u
>>>
>>> I personally prefer to just revert that bit, as it brings more troubles
>>> than gains. Even with Yunsheng's patch, there are still some issues.
>>> Essentially, I think the core qdisc scheduling code is not ready for
>>> lockless, just look at those NOLOCK checks in sch_generic.c. :-/
>>
>> I am also awared of the NOLOCK checks too:), and I am willing to
>> take care of it if that is possible.
>>
>> As the number of cores in a system is increasing, it is the trend
>> to become lockless, right? Even there is only one cpu involved, the
>> spinlock taking and releasing takes about 30ns on our arm64 system
>> when CONFIG_PREEMPT_VOLUNTARY is enable(ip forwarding testing).
>
> I agree with the benefits but currently the situation is that we have
> a race condition affecting the default qdisc which is being hit in
> production and can cause serious trouble which is made worse by commit
> 1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host
> queues") preventing the retransmits of the stuck packet being sent.
>
> Perhaps rather than patching over current implementation which requires
> more and more complicated hacks to work around the fact that we cannot
> make the "queue is empty" check and leaving the critical section atomic,
> it would make sense to reimplement it in a way which would allow us
> making it atomic.

Yes, reimplementing that is also an option.
But what if reimplemention also has the same problem if we do not find
the root cause of this problem? I think it better to find the root cause
of it first?

>
> Michal
>
>
> .
>

2021-04-06 23:17:27

by Juergen Gross

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On 06.04.21 09:06, Michal Kubecek wrote:
> On Tue, Apr 06, 2021 at 08:55:41AM +0800, Yunsheng Lin wrote:
>>
>> Hi, Jiri
>> Do you have a reproducer that can be shared here?
>> With reproducer, I can debug and test it myself too.
>
> I'm afraid we are not aware of a simple reproducer. As mentioned in the
> original discussion, the race window is extremely small and the other
> thread has to do quite a lot in the meantime which is probably why, as
> far as I know, this was never observed on real hardware, only in
> virtualization environments. NFS may also be important as, IIUC, it can
> often issue an RPC request from a different CPU right after a data
> transfer. Perhaps you could cheat a bit and insert a random delay
> between the empty queue check and releasing q->seqlock to make it more
> likely to happen.
>
> Other than that, it's rather just "run this complex software in a xen VM
> and wait".

Being the one who has managed to reproduce the issue I can share my
setup, maybe you can setup something similar (we have seen the issue
with this kind of setup on two different machines).

I'm using a physical machine with 72 cpus and 48 GB of memory. It is
running Xen as virtualization platform.

Xen dom0 is limited to 40 vcpus and 32 GB of memory, the dom0 vcpus are
limited to run on the first 40 physical cpus (no idea whether that
matters, though).

In a guest with 16 vcpu and 8GB of memory I'm running 8 parallel
sysbench instances in a loop, those instances are prepared via

sysbench --file-test-mode=rndrd --test=fileio prepare

and then started in a do while loop via:

sysbench --test=fileio --file-test-mode=rndrw --rand-seed=0
--max-time=300 --max-requests=0 run

Each instance is using a dedicated NFS mount to run on. The NFS
server for the 8 mounts is running in dom0 of the same server, the
data of the NFS shares is located in a RAM disk (size is a little bit
above 16GB). The shares are mounted in the guest with:

mount -t nfs -o
rw,proto=tcp,nolock,nfsvers=3,rsize=65536,wsize=65536,nosharetransport
dom0:/ramdisk/share[1-8] /mnt[1-8]

The guests vcpus are limited to run on physical cpus 40-55, on the same
physical cpus I have 16 small guests running eating up cpu time, each of
those guests is pinned to one of the physical cpus 40-55.

That's basically it. All you need to do is to watch out for sysbench
reporting maximum latencies above one second or so (in my setup there
are latencies of several minutes at least once each hour of testing).

In case you'd like to have some more details about the setup don't
hesitate to contact me directly. I can provide you with some scripts
and config runes if you want.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.06 kB)
OpenPGP_signature (505.00 B)
OpenPGP digital signature
Download all attachments

2021-04-07 02:06:25

by Yunsheng Lin

[permalink] [raw]
Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On 2021/4/6 18:13, Juergen Gross wrote:
> On 06.04.21 09:06, Michal Kubecek wrote:
>> On Tue, Apr 06, 2021 at 08:55:41AM +0800, Yunsheng Lin wrote:
>>>
>>> Hi, Jiri
>>> Do you have a reproducer that can be shared here?
>>> With reproducer, I can debug and test it myself too.
>>
>> I'm afraid we are not aware of a simple reproducer. As mentioned in the
>> original discussion, the race window is extremely small and the other
>> thread has to do quite a lot in the meantime which is probably why, as
>> far as I know, this was never observed on real hardware, only in
>> virtualization environments. NFS may also be important as, IIUC, it can
>> often issue an RPC request from a different CPU right after a data
>> transfer. Perhaps you could cheat a bit and insert a random delay
>> between the empty queue check and releasing q->seqlock to make it more
>> likely to happen.
>>
>> Other than that, it's rather just "run this complex software in a xen VM
>> and wait".
>
> Being the one who has managed to reproduce the issue I can share my
> setup, maybe you can setup something similar (we have seen the issue
> with this kind of setup on two different machines).
>
> I'm using a physical machine with 72 cpus and 48 GB of memory. It is
> running Xen as virtualization platform.
>
> Xen dom0 is limited to 40 vcpus and 32 GB of memory, the dom0 vcpus are
> limited to run on the first 40 physical cpus (no idea whether that
> matters, though).
>
> In a guest with 16 vcpu and 8GB of memory I'm running 8 parallel
> sysbench instances in a loop, those instances are prepared via
>
> sysbench --file-test-mode=rndrd --test=fileio prepare
>
> and then started in a do while loop via:
>
> sysbench --test=fileio --file-test-mode=rndrw --rand-seed=0 --max-time=300 --max-requests=0 run
>
> Each instance is using a dedicated NFS mount to run on. The NFS
> server for the 8 mounts is running in dom0 of the same server, the
> data of the NFS shares is located in a RAM disk (size is a little bit
> above 16GB). The shares are mounted in the guest with:
>
> mount -t nfs -o rw,proto=tcp,nolock,nfsvers=3,rsize=65536,wsize=65536,nosharetransport dom0:/ramdisk/share[1-8] /mnt[1-8]
>
> The guests vcpus are limited to run on physical cpus 40-55, on the same
> physical cpus I have 16 small guests running eating up cpu time, each of
> those guests is pinned to one of the physical cpus 40-55.
>
> That's basically it. All you need to do is to watch out for sysbench
> reporting maximum latencies above one second or so (in my setup there
> are latencies of several minutes at least once each hour of testing).
>
> In case you'd like to have some more details about the setup don't
> hesitate to contact me directly. I can provide you with some scripts
> and config runes if you want.

The setup is rather complex, I just tried Michal' suggestion using
the below patch:

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9fb0ad4..b691eda 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -207,6 +207,11 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
{
write_seqcount_end(&qdisc->running);
if (qdisc->flags & TCQ_F_NOLOCK) {
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
spin_unlock(&qdisc->seqlock);

if (unlikely(test_bit(__QDISC_STATE_MISSED,
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6d7f954..a83c520 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -630,6 +630,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
return qdisc_drop_cpu(skb, qdisc, to_free);
else
return qdisc_drop(skb, qdisc, to_free);
+ } else {
+ skb->enqueue_jiffies = jiffies;
}

qdisc_update_stats_at_enqueue(qdisc, pkt_len);
@@ -653,6 +655,13 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
skb = __skb_array_consume(q);
}
if (likely(skb)) {
+ unsigned int delay_ms;
+
+ delay_ms = jiffies_to_msecs(jiffies - skb->enqueue_jiffies);
linyunsheng@plinth:~/ci/kernel$ vi qdisc_reproducer.patch
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -920,7 +920,7 @@ struct sk_buff {
*data;
unsigned int truesize;
refcount_t users;
-
+ unsigned long enqueue_jiffies;
#ifdef CONFIG_SKB_EXTENSIONS
/* only useable after checking ->active_extensions != 0 */
struct skb_ext *extensions;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 639e465..ba39b86 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -176,8 +176,14 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
static inline void qdisc_run_end(struct Qdisc *qdisc)
{
write_seqcount_end(&qdisc->running);
- if (qdisc->flags & TCQ_F_NOLOCK)
+ if (qdisc->flags & TCQ_F_NOLOCK) {
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
spin_unlock(&qdisc->seqlock);
+ }
}

static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 49eae93..fcfae43 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -630,6 +630,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
return qdisc_drop_cpu(skb, qdisc, to_free);
else
return qdisc_drop(skb, qdisc, to_free);
+ } else {
+ skb->enqueue_jiffies = jiffies;
}

qdisc_update_stats_at_enqueue(qdisc, pkt_len);
@@ -651,6 +653,13 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
skb = __skb_array_consume(q);
}
if (likely(skb)) {
+ unsigned int delay_ms;
+
+ delay_ms = jiffies_to_msecs(jiffies - skb->enqueue_jiffies);
+
+ if (delay_ms > 100)
+ netdev_err(qdisc_dev(qdisc), "delay: %u ms\n", delay_ms);
+
qdisc_update_stats_at_dequeue(qdisc, skb);
} else {
WRITE_ONCE(qdisc->empty, true);


Using the below shell:

while((1))
do
taskset -c 0 mz dummy1 -d 10000 -c 100&
taskset -c 1 mz dummy1 -d 10000 -c 100&
taskset -c 2 mz dummy1 -d 10000 -c 100&
sleep 3
done

And got the below log:
[ 80.881716] hns3 0000:bd:00.0 eth2: delay: 176 ms
[ 82.036564] hns3 0000:bd:00.0 eth2: delay: 296 ms
[ 87.065820] hns3 0000:bd:00.0 eth2: delay: 320 ms
[ 94.134174] dummy1: delay: 1588 ms
[ 94.137570] dummy1: delay: 1580 ms
[ 94.140963] dummy1: delay: 1572 ms
[ 94.144354] dummy1: delay: 1568 ms
[ 94.147745] dummy1: delay: 1560 ms
[ 99.065800] hns3 0000:bd:00.0 eth2: delay: 264 ms
[ 100.106174] dummy1: delay: 1448 ms
[ 102.169799] hns3 0000:bd:00.0 eth2: delay: 436 ms
[ 103.166221] dummy1: delay: 1604 ms
[ 103.169617] dummy1: delay: 1604 ms
[ 104.985806] dummy1: delay: 316 ms
[ 105.113797] hns3 0000:bd:00.0 eth2: delay: 308 ms
[ 107.289805] hns3 0000:bd:00.0 eth2: delay: 556 ms
[ 108.912922] hns3 0000:bd:00.0 eth2: delay: 188 ms
[ 137.241801] dummy1: delay: 30624 ms
[ 137.245283] dummy1: delay: 30620 ms
[ 137.248760] dummy1: delay: 30616 ms
[ 137.252237] dummy1: delay: 30616 ms


It seems the problem can be easily reproduced using Michal'
suggestion.

Will test and debug it using the above reproducer first.

>
>
> Juergen