On 7/2/20, 2:08 PM, "Josh Hunt" <[email protected]> wrote:
>
> On 7/2/20 2:45 AM, Paolo Abeni wrote:
> > Hi all,
> >
> > On Thu, 2020-07-02 at 08:14 +0200, Jonas Bonn wrote:
> >> Hi Cong,
> >>
> >> On 01/07/2020 21:58, Cong Wang wrote:
> >>> On Wed, Jul 1, 2020 at 9:05 AM Cong Wang <[email protected]> wrote:
> >>>> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt <[email protected]> wrote:
> >>>>> Do either of you know if there's been any development on a fix for this
> >>>>> issue? If not we can propose something.
> >>>>
> >>>> If you have a reproducer, I can look into this.
> >>>
> >>> Does the attached patch fix this bug completely?
> >>
> >> It's easier to comment if you inline the patch, but after taking a quick
> >> look it seems too simplistic.
> >>
> >> i) Are you sure you haven't got the return values on qdisc_run reversed?
> >
> > qdisc_run() returns true if it was able to acquire the seq lock. We
> > need to take special action in the opposite case, so Cong's patch LGTM
> > from a functional PoV.
> >
> >> ii) There's a "bypass" path that skips the enqueue/dequeue operation if
> >> the queue is empty; that needs a similar treatment: after releasing
> >> seqlock it needs to ensure that another packet hasn't been enqueued
> >> since it last checked.
> >
> > That has been reverted with
> > commit 379349e9bc3b42b8b2f8f7a03f64a97623fff323
> >
> > ---
> >> diff --git a/net/core/dev.c b/net/core/dev.c
> >> index 90b59fc50dc9..c7e48356132a 100644
> >> --- a/net/core/dev.c
> >> +++ b/net/core/dev.c
> >> @@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
> >>
> >> if (q->flags & TCQ_F_NOLOCK) {
> >> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> >> - qdisc_run(q);
> >> + if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
> >> + __netif_schedule(q);
> >
> > I fear the __netif_schedule() call may cause performance regression to
> > the point of making a revert of TCQ_F_NOLOCK preferable. I'll try to
> > collect some data.
>
> Initial results with Cong's patch look promising, so far no stalls. We
> will let it run over the long weekend and report back on Tuesday.
>
> Paolo - I have concerns about possible performance regression with the
> change as well. If you can gather some data that would be great. If
> things look good with our low throughput test over the weekend we can
> also try assessing performance next week.
>
> Josh

After running our reproducer over the long weekend, we've observed several more packets getting stuck.
The behavior is order of magnitude better *with* the patch (that is, only a few packets get stuck),
but the patch does not completely resolve the issue.

I have a nagging suspicion that the same race that we observed between consumer/producer threads can occur with
softirq processing in net_tx_action() as well (as triggered by __netif_schedule()), since both rely on the same semantics of qdisc_run().
Unfortunately, in such a case, we cannot just punt to __netif_schedule() again.

Regards,
~ Michael

Attachments:

smime.p7s (3.41 kB)

2020-07-09 09:23:46

by Paolo Abeni

[permalink] [raw]

Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On Wed, 2020-07-08 at 13:16 -0700, Cong Wang wrote:
> On Tue, Jul 7, 2020 at 7:18 AM Paolo Abeni <[email protected]> wrote:
> > So the regression with 2 pktgen threads is still relevant. 'perf' shows
> > relevant time spent into net_tx_action() and __netif_schedule().
>
> So, touching the __QDISC_STATE_SCHED bit in __dev_xmit_skb() is
> not a good idea.
>
> Let me see if there is any other way to fix this.

Thank you very much for the effort! I'm personally out of ideas for a
real fix that would avoid regressions.

To be more exaustive this are the sources of overhead, as far as I can
observe them with perf:

- contention on q->state, in __netif_schedule()
- execution of net_tx_action() when there are no packet to be served

Cheers,

Paolo

2020-08-20 07:45:03

by Jike Song

[permalink] [raw]

Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Josh,

On Fri, Jul 3, 2020 at 2:14 AM Josh Hunt <[email protected]> wrote:
{snip}
> Initial results with Cong's patch look promising, so far no stalls. We
> will let it run over the long weekend and report back on Tuesday.
>
> Paolo - I have concerns about possible performance regression with the
> change as well. If you can gather some data that would be great. If
> things look good with our low throughput test over the weekend we can
> also try assessing performance next week.
>

We met possibly the same problem when testing nvidia/mellanox's
GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
can also have a try?

Besides, our testing is pretty complex, do you have a quick test to
reproduce it?

--
Thanks,
Jike

2020-08-20 19:07:59

by Josh Hunt

[permalink] [raw]

Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

Hi Jike

On 8/20/20 12:43 AM, Jike Song wrote:
> Hi Josh,
>
>
> We met possibly the same problem when testing nvidia/mellanox's
> GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
> DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
> can also have a try?

We also did something similar where we've switched over to using the fq
scheduler everywhere for now. We believe the bug is in the nolock code
which only pfifo_fast uses atm, but we've been unable to come up with a
satisfactory solution.

>
> Besides, our testing is pretty complex, do you have a quick test to
> reproduce it?
>

Unfortunately we don't have a simple test case either. Our current
reproducer is complex as well, although it would seem like we should be
able to come up with something where you have maybe 2 threads trying to
send on the same tx queue running pfifo_fast every few hundred
milliseconds and not much else/no other tx traffic on that queue. IIRC
we believe the scenario is when one thread is in the process of
dequeuing a packet while another is enqueuing, the enqueue-er (word? :))
sees the dequeue is in progress and so does not xmit the packet assuming
the dequeue operation will take care of it. However b/c the dequeue is
in the process of completing it doesn't and the newly enqueued packet
stays in the qdisc until another packet is enqueued pushing both out.

Given that we have a workaround with using fq or any other qdisc not
named pfifo_fast this has gotten bumped down in priority for us. I would
like to work on a reproducer at some point, but won't likely be for a
few weeks :(

Josh

2020-08-25 02:49:28

2021-04-07 02:06:25

by Yunsheng Lin

[permalink] [raw]

Subject: Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

On 2021/4/6 18:13, Juergen Gross wrote:
> On 06.04.21 09:06, Michal Kubecek wrote:
>> On Tue, Apr 06, 2021 at 08:55:41AM +0800, Yunsheng Lin wrote:
>>>
>>> Hi, Jiri
>>> Do you have a reproducer that can be shared here?
>>> With reproducer, I can debug and test it myself too.
>>
>> I'm afraid we are not aware of a simple reproducer. As mentioned in the
>> original discussion, the race window is extremely small and the other
>> thread has to do quite a lot in the meantime which is probably why, as
>> far as I know, this was never observed on real hardware, only in
>> virtualization environments. NFS may also be important as, IIUC, it can
>> often issue an RPC request from a different CPU right after a data
>> transfer. Perhaps you could cheat a bit and insert a random delay
>> between the empty queue check and releasing q->seqlock to make it more
>> likely to happen.
>>
>> Other than that, it's rather just "run this complex software in a xen VM
>> and wait".
>
> Being the one who has managed to reproduce the issue I can share my
> setup, maybe you can setup something similar (we have seen the issue
> with this kind of setup on two different machines).
>
> I'm using a physical machine with 72 cpus and 48 GB of memory. It is
> running Xen as virtualization platform.
>
> Xen dom0 is limited to 40 vcpus and 32 GB of memory, the dom0 vcpus are
> limited to run on the first 40 physical cpus (no idea whether that
> matters, though).
>
> In a guest with 16 vcpu and 8GB of memory I'm running 8 parallel
> sysbench instances in a loop, those instances are prepared via
>
> sysbench --file-test-mode=rndrd --test=fileio prepare
>
> and then started in a do while loop via:
>
> sysbench --test=fileio --file-test-mode=rndrw --rand-seed=0 --max-time=300 --max-requests=0 run
>
> Each instance is using a dedicated NFS mount to run on. The NFS
> server for the 8 mounts is running in dom0 of the same server, the
> data of the NFS shares is located in a RAM disk (size is a little bit
> above 16GB). The shares are mounted in the guest with:
>
> mount -t nfs -o rw,proto=tcp,nolock,nfsvers=3,rsize=65536,wsize=65536,nosharetransport dom0:/ramdisk/share[1-8] /mnt[1-8]
>
> The guests vcpus are limited to run on physical cpus 40-55, on the same
> physical cpus I have 16 small guests running eating up cpu time, each of
> those guests is pinned to one of the physical cpus 40-55.
>
> That's basically it. All you need to do is to watch out for sysbench
> reporting maximum latencies above one second or so (in my setup there
> are latencies of several minutes at least once each hour of testing).
>
> In case you'd like to have some more details about the setup don't
> hesitate to contact me directly. I can provide you with some scripts
> and config runes if you want.

The setup is rather complex, I just tried Michal' suggestion using
the below patch:

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 9fb0ad4..b691eda 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -207,6 +207,11 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
{
write_seqcount_end(&qdisc->running);
if (qdisc->flags & TCQ_F_NOLOCK) {
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
spin_unlock(&qdisc->seqlock);

if (unlikely(test_bit(__QDISC_STATE_MISSED,
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 6d7f954..a83c520 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -630,6 +630,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
return qdisc_drop_cpu(skb, qdisc, to_free);
else
return qdisc_drop(skb, qdisc, to_free);
+ } else {
+ skb->enqueue_jiffies = jiffies;
}

qdisc_update_stats_at_enqueue(qdisc, pkt_len);
@@ -653,6 +655,13 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
skb = __skb_array_consume(q);
}
if (likely(skb)) {
+ unsigned int delay_ms;
+
+ delay_ms = jiffies_to_msecs(jiffies - skb->enqueue_jiffies);
linyunsheng@plinth:~/ci/kernel$ vi qdisc_reproducer.patch
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -920,7 +920,7 @@ struct sk_buff {
*data;
unsigned int truesize;
refcount_t users;
-
+ unsigned long enqueue_jiffies;
#ifdef CONFIG_SKB_EXTENSIONS
/* only useable after checking ->active_extensions != 0 */
struct skb_ext *extensions;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 639e465..ba39b86 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -176,8 +176,14 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
static inline void qdisc_run_end(struct Qdisc *qdisc)
{
write_seqcount_end(&qdisc->running);
- if (qdisc->flags & TCQ_F_NOLOCK)
+ if (qdisc->flags & TCQ_F_NOLOCK) {
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
+ udelay(10000);
spin_unlock(&qdisc->seqlock);
+ }
}

static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 49eae93..fcfae43 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -630,6 +630,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
return qdisc_drop_cpu(skb, qdisc, to_free);
else
return qdisc_drop(skb, qdisc, to_free);
+ } else {
+ skb->enqueue_jiffies = jiffies;
}

qdisc_update_stats_at_enqueue(qdisc, pkt_len);
@@ -651,6 +653,13 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
skb = __skb_array_consume(q);
}
if (likely(skb)) {
+ unsigned int delay_ms;
+
+ delay_ms = jiffies_to_msecs(jiffies - skb->enqueue_jiffies);
+
+ if (delay_ms > 100)
+ netdev_err(qdisc_dev(qdisc), "delay: %u ms\n", delay_ms);
+
qdisc_update_stats_at_dequeue(qdisc, skb);
} else {
WRITE_ONCE(qdisc->empty, true);

Using the below shell:

while((1))
do
taskset -c 0 mz dummy1 -d 10000 -c 100&
taskset -c 1 mz dummy1 -d 10000 -c 100&
taskset -c 2 mz dummy1 -d 10000 -c 100&
sleep 3
done

And got the below log:
[ 80.881716] hns3 0000:bd:00.0 eth2: delay: 176 ms
[ 82.036564] hns3 0000:bd:00.0 eth2: delay: 296 ms
[ 87.065820] hns3 0000:bd:00.0 eth2: delay: 320 ms
[ 94.134174] dummy1: delay: 1588 ms
[ 94.137570] dummy1: delay: 1580 ms
[ 94.140963] dummy1: delay: 1572 ms
[ 94.144354] dummy1: delay: 1568 ms
[ 94.147745] dummy1: delay: 1560 ms
[ 99.065800] hns3 0000:bd:00.0 eth2: delay: 264 ms
[ 100.106174] dummy1: delay: 1448 ms
[ 102.169799] hns3 0000:bd:00.0 eth2: delay: 436 ms
[ 103.166221] dummy1: delay: 1604 ms
[ 103.169617] dummy1: delay: 1604 ms
[ 104.985806] dummy1: delay: 316 ms
[ 105.113797] hns3 0000:bd:00.0 eth2: delay: 308 ms
[ 107.289805] hns3 0000:bd:00.0 eth2: delay: 556 ms
[ 108.912922] hns3 0000:bd:00.0 eth2: delay: 188 ms
[ 137.241801] dummy1: delay: 30624 ms
[ 137.245283] dummy1: delay: 30620 ms
[ 137.248760] dummy1: delay: 30616 ms
[ 137.252237] dummy1: delay: 30616 ms

It seems the problem can be easily reproduced using Michal'
suggestion.

Will test and debug it using the above reproducer first.

>
>
> Juergen