During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on
Linux v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of
packets are delivered out-of-order.
We have tracked the problem down to the driver interface level, and it
seems that the driver's net_device_ops.ndo_start_xmit() function gets
the packets handed over in the wrong order.
This behavior was not observed on Linux v4.15 and I have bisected the
problem down to this patch:
> commit c5ad119fb6c09b0297446be05bd66602fa564758
> Author: John Fastabend <[email protected]>
> Date: Thu Dec 7 09:58:19 2017 -0800
>
> net: sched: pfifo_fast use skb_array
>
> This converts the pfifo_fast qdisc to use the skb_array data structure
> and set the lockless qdisc bit. pfifo_fast is the first qdisc to support
> the lockless bit that can be a child of a qdisc requiring locking. So
> we add logic to clear the lock bit on initialization in these cases when
> the qdisc graft operation occurs.
>
> This also removes the logic used to pick the next band to dequeue from
> and instead just checks a per priority array for packets from top priority
> to lowest. This might need to be a bit more clever but seems to work
> for now.
>
> Signed-off-by: John Fastabend <[email protected]>
> Signed-off-by: David S. Miller <[email protected]>
The patch does not revert cleanly, but moving to one commit earlier
makes the problem go away.
Selecting the "fq" scheduler instead of "pfifo_fast" makes the problem
go away as well.
Is this an unintended side-effect of the patch or is there something the
driver has to do to request in-order delivery?
Thanks,
Jakob
On Tue, Mar 13, 2018 at 11:24 AM, Jakob Unterwurzacher
<[email protected]> wrote:
> During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux
> v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets are
> delivered out-of-order.
>
> We have tracked the problem down to the driver interface level, and it seems
> that the driver's net_device_ops.ndo_start_xmit() function gets the packets
> handed over in the wrong order.
>
> This behavior was not observed on Linux v4.15 and I have bisected the
> problem down to this patch:
>
>> commit c5ad119fb6c09b0297446be05bd66602fa564758
>> Author: John Fastabend <[email protected]>
>> Date: Thu Dec 7 09:58:19 2017 -0800
>>
>> net: sched: pfifo_fast use skb_array
>>
>> This converts the pfifo_fast qdisc to use the skb_array data structure
>> and set the lockless qdisc bit. pfifo_fast is the first qdisc to
>> support
>> the lockless bit that can be a child of a qdisc requiring locking. So
>> we add logic to clear the lock bit on initialization in these cases
>> when
>> the qdisc graft operation occurs.
>>
>> This also removes the logic used to pick the next band to dequeue from
>> and instead just checks a per priority array for packets from top
>> priority
>> to lowest. This might need to be a bit more clever but seems to work
>> for now.
>>
>> Signed-off-by: John Fastabend <[email protected]>
>> Signed-off-by: David S. Miller <[email protected]>
>
>
> The patch does not revert cleanly, but moving to one commit earlier makes
> the problem go away.
>
> Selecting the "fq" scheduler instead of "pfifo_fast" makes the problem go
> away as well.
I am of course, a fan of obsoleting pfifo_fast. There's no good reason
for it anymore.
>
> Is this an unintended side-effect of the patch or is there something the
> driver has to do to request in-order delivery?
>
> Thanks,
> Jakob
--
Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619
On 03/13/2018 11:35 AM, Dave Taht wrote:
> On Tue, Mar 13, 2018 at 11:24 AM, Jakob Unterwurzacher
> <[email protected]> wrote:
>> During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux
>> v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets are
>> delivered out-of-order.
>>
Is the stress-testing tool available somewhere? What type of packets
are being sent?
>> We have tracked the problem down to the driver interface level, and it seems
>> that the driver's net_device_ops.ndo_start_xmit() function gets the packets
>> handed over in the wrong order.
>>
>> This behavior was not observed on Linux v4.15 and I have bisected the
>> problem down to this patch:
>>
>>> commit c5ad119fb6c09b0297446be05bd66602fa564758
>>> Author: John Fastabend <[email protected]>
>>> Date: Thu Dec 7 09:58:19 2017 -0800
>>>
>>> net: sched: pfifo_fast use skb_array
>>>
>>> This converts the pfifo_fast qdisc to use the skb_array data structure
>>> and set the lockless qdisc bit. pfifo_fast is the first qdisc to
>>> support
>>> the lockless bit that can be a child of a qdisc requiring locking. So
>>> we add logic to clear the lock bit on initialization in these cases
>>> when
>>> the qdisc graft operation occurs.
>>>
>>> This also removes the logic used to pick the next band to dequeue from
>>> and instead just checks a per priority array for packets from top
>>> priority
>>> to lowest. This might need to be a bit more clever but seems to work
>>> for now.
>>>
>>> Signed-off-by: John Fastabend <[email protected]>
>>> Signed-off-by: David S. Miller <[email protected]>
>>
>>
>> The patch does not revert cleanly, but moving to one commit earlier makes
>> the problem go away.
>>
>> Selecting the "fq" scheduler instead of "pfifo_fast" makes the problem go
>> away as well.
>
Is this a single queue device or a multiqueue device? Running
'tc -s qdisc show dev foo' would help some.
> I am of course, a fan of obsoleting pfifo_fast. There's no good reason
> for it anymore.
>
>>
>> Is this an unintended side-effect of the patch or is there something the
>> driver has to do to request in-order delivery?
>>
If we introduced a OOO edge case somewhere that was not
intended so I'll take a look into it. But, if you can provide
a bit more details on how stress testing is done to cause the
issue that would help.
Thanks,
John
>> Thanks,
>> Jakob
>
>
>
On 14.03.18 05:03, John Fastabend wrote:
>>> During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux
>>> v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets are
>>> delivered out-of-order.
>>>
>
> Is the stress-testing tool available somewhere? What type of packets
> are being sent?
Not public, no, the problem is that you'd need a CAN adapter as well.
The test is simple, sending CAN frames with an increasing counter and a
random length payload:
[ tx thread rx thread ]
| ^
v |
[ interface 0 ] ---- cable ----> [ interface 1 ]
I'll see if I can come up with a UDP testcase that works with normal
ethernet interfaces.
> Is this a single queue device or a multiqueue device? Running
> 'tc -s qdisc show dev foo' would help some.
Here you go:
root@rk3399-q7:~# tc -s qdisc show dev can0
qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1
1 1 1 1 1
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
> If we introduced a OOO edge case somewhere that was not
> intended so I'll take a look into it. But, if you can provide
> a bit more details on how stress testing is done to cause the
> issue that would help.
Will do.
Thanks,
Jakob
On 14.03.18 05:03, John Fastabend wrote:
> On 03/13/2018 11:35 AM, Dave Taht wrote:
>> On Tue, Mar 13, 2018 at 11:24 AM, Jakob Unterwurzacher
>> <[email protected]> wrote:
>>> During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux
>>> v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets are
>>> delivered out-of-order.
>>>
>
> Is the stress-testing tool available somewhere? What type of packets
> are being sent?
I have reproduced it using two USB network cards connected to each
other. The test tool sends UDP packets containing a counter and listens
on the other interface, it is available at
https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
Here is what I get:
root@rk3399-q7:~# ./pfifo_stress.py
[...]
expected ctr 0xcdc, received 0xcdd
expected ctr 0xcde, received 0xcdc
expected ctr 0xcdd, received 0xcde
expected ctr 0xe3c, received 0xe3d
expected ctr 0xe3e, received 0xe3c
expected ctr 0xe3d, received 0xe3e
expected ctr 0x1097, received 0x1098
expected ctr 0x1099, received 0x1097
expected ctr 0x1098, received 0x1099
expected ctr 0x17c0, received 0x17c1
expected ctr 0x17c2, received 0x17c0
[...]
Best regards,
Jakob
On 03/15/2018 11:08 AM, Jakob Unterwurzacher wrote:
> On 14.03.18 05:03, John Fastabend wrote:
>> On 03/13/2018 11:35 AM, Dave Taht wrote:
>>> On Tue, Mar 13, 2018 at 11:24 AM, Jakob Unterwurzacher
>>> <[email protected]> wrote:
>>>> During stress-testing our "ucan" USB/CAN adapter SocketCAN driver on Linux
>>>> v4.16-rc4-383-ged58d66f60b3 we observed that a small fraction of packets are
>>>> delivered out-of-order.
>>>>
>>
>> Is the stress-testing tool available somewhere? What type of packets
>> are being sent?
>
>
> I have reproduced it using two USB network cards connected to each other. The test tool sends UDP packets containing a counter and listens on the other interface, it is available at
> https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
>
Great thanks, can you also run this with taskset to bind to
a single CPU,
# taskset 0x1 ./pifof_stress.py
And let me know if you still see the OOO.
> Here is what I get:
>
> root@rk3399-q7:~# ./pfifo_stress.py
> [...]
> expected ctr 0xcdc, received 0xcdd
> expected ctr 0xcde, received 0xcdc
> expected ctr 0xcdd, received 0xcde
> expected ctr 0xe3c, received 0xe3d
> expected ctr 0xe3e, received 0xe3c
> expected ctr 0xe3d, received 0xe3e
> expected ctr 0x1097, received 0x1098
> expected ctr 0x1099, received 0x1097
> expected ctr 0x1098, received 0x1099
> expected ctr 0x17c0, received 0x17c1
> expected ctr 0x17c2, received 0x17c0
> [...]
>
> Best regards,
> Jakob
On 15.03.18 23:30, John Fastabend wrote:
>> I have reproduced it using two USB network cards connected to each other. The test tool sends UDP packets containing a counter and listens on the other interface, it is available at
>> https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
>>
>
> Great thanks, can you also run this with taskset to bind to
> a single CPU,
>
> # taskset 0x1 ./pifof_stress.py
>
> And let me know if you still see the OOO.
Interesting. Looks like it depends on which core it runs on. CPU0 is
clean, CPU1 is not.
Clean: taskset --cpu-list 0 ./pfifo_stress.py
Broken: taskset --cpu-list 1 ./pfifo_stress.py
Maybe related: CPU0 is where USB interrupts are handled:
> root@rk3399-q7:~# cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
> 217: 2175353 0 0 0 0 0 GICv3 142 Level xhci-hcd:usb5
Thanks,
Jakob
On Friday, March 16, 2018, 11:26:47 AM CET Jakob Unterwurzacher wrote:
> On 15.03.18 23:30, John Fastabend wrote:
> >> I have reproduced it using two USB network cards connected to each other. The test tool sends UDP packets containing a counter and listens on the other interface, it is available at
> >> https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
> >>
> >
> > Great thanks, can you also run this with taskset to bind to
> > a single CPU,
> >
> > # taskset 0x1 ./pifof_stress.py
> >
> > And let me know if you still see the OOO.
>
> Interesting. Looks like it depends on which core it runs on. CPU0 is
> clean, CPU1 is not.
>
> Clean: taskset --cpu-list 0 ./pfifo_stress.py
>
> Broken: taskset --cpu-list 1 ./pfifo_stress.py
>
> Maybe related: CPU0 is where USB interrupts are handled:
>
> > root@rk3399-q7:~# cat /proc/interrupts
> > CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
> > 217: 2175353 0 0 0 0 0 GICv3 142 Level xhci-hcd:usb5
This reminds me somewhat of this thread: https://marc.info/?l=linux-can&m=148007442317274&w=2
Best regards,
Alexander
Hi,
On Fri, 2018-03-16 at 11:26 +0100, Jakob Unterwurzacher wrote:
> On 15.03.18 23:30, John Fastabend wrote:
> > > I have reproduced it using two USB network cards connected to each other. The test tool sends UDP packets containing a counter and listens on the other interface, it is available at
> > > https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
> > >
> >
> > Great thanks, can you also run this with taskset to bind to
> > a single CPU,
> >
> > # taskset 0x1 ./pifof_stress.py
> >
> > And let me know if you still see the OOO.
>
> Interesting. Looks like it depends on which core it runs on. CPU0 is
> clean, CPU1 is not.
>
> Clean: taskset --cpu-list 0 ./pfifo_stress.py
>
> Broken: taskset --cpu-list 1 ./pfifo_stress.py
>
> Maybe related: CPU0 is where USB interrupts are handled:
>
> > root@rk3399-q7:~# cat /proc/interrupts
> > CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
> > 217: 2175353 0 0 0 0 0 GICv3 142 Level xhci-hcd:usb5
Is not clear to me if you can reproduce the bug with the vanilla
kernel, or if you need some out-of-tree nic driver. Can you please
clarify which NIC/driver are you using?
Thanks,
Paolo
On 19.03.18 13:32, Paolo Abeni wrote:
> Is not clear to me if you can reproduce the bug with the vanilla
> kernel, or if you need some out-of-tree nic driver. Can you please
> clarify which NIC/driver are you using?
Yes I reproduced it with a vanilla kernel. I use two off-the-shelf USB
NICs, lsusb says:
Bus 005 Device 013: ID 0b95:7720 ASIX Electronics Corp. AX88772
Bus 005 Device 014: ID 0b95:772b ASIX Electronics Corp. AX88772B
Photo of the setup:
https://github.com/jakob-tsd/pfifo_stress/blob/master/usb_nics.jpg
Best regards,
Jakob
On 16.03.18 11:26, Jakob Unterwurzacher wrote:
> On 15.03.18 23:30, John Fastabend wrote:
>>> I have reproduced it using two USB network cards connected to each
>>> other. The test tool sends UDP packets containing a counter and
>>> listens on the other interface, it is available at
>>> https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
>>
>> Great thanks, can you also run this with taskset to bind to
>> a single CPU,
>>
>> # taskset 0x1 ./pifof_stress.py
>>
>> And let me know if you still see the OOO.
>
> Interesting. Looks like it depends on which core it runs on. CPU0 is
> clean, CPU1 is not.
So we are at v4.16-rc6 now - have you managed to reproduce this is or
should I try to get the revert correct?
Best regards,
Jakob
On 03/21/2018 03:01 AM, Jakob Unterwurzacher wrote:
> On 16.03.18 11:26, Jakob Unterwurzacher wrote:
>> On 15.03.18 23:30, John Fastabend wrote:
>>>> I have reproduced it using two USB network cards connected to each other. The test tool sends UDP packets containing a counter and listens on the other interface, it is available at
>>>> https://github.com/jakob-tsd/pfifo_stress/blob/master/pfifo_stress.py
>>>
>>> Great thanks, can you also run this with taskset to bind to
>>> a single CPU,
>>>
>>> # taskset 0x1 ./pifof_stress.py
>>>
>>> And let me know if you still see the OOO.
>>
>> Interesting. Looks like it depends on which core it runs on. CPU0 is clean, CPU1 is not.
>
> So we are at v4.16-rc6 now - have you managed to reproduce this is or should I try to get the revert correct?
I have a theory on what is going on here. Because we now run without
locks we can have multiple qdisc_run() calls running in parallel. Possible
if we send packets using multiple cores.
--- application ---
cpu0 cpu1
| |
| |
enqueue enqueue
| |
pfifo_fast
| |
dequeue dequeue
\ /
ndo_xmit
The skb->ooo_okay flag will keep the enqueue side packets in
order. So that is covered.
But on the dequeue side if two cores dequeue in parallel we will
race to ndo ops to ensure packets are in order. Rarely, I guess the
second dequeue could actually call ndo hook before first dequeued
packet. Because usually the dequeue happens on the same queue the
enqueue happened on we don't see this very often. But there seems
to be a case where the driver calls netif_tx_wake_queue() on a
different core (from the rx interrupt context). The wake queue call
then eventually runs the dequeue on a different core. So when taskset
is aligned with the interrupt everything is in-order when it is moved
to a different core we see the OOO.
Thats my theory at least. Are you able to test a patch if I generate
one to fix this?
FWIW the revert on this is trivial, but I think we can fix this
without too much work. Also, if you had a driver tx queue per core
this would not be an issue.
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 190570f..171f470 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -792,7 +792,6 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.dump = pfifo_fast_dump,
.change_tx_queue_len = pfifo_fast_change_tx_queue_len,
.owner = THIS_MODULE,
- .static_flags = TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
};
EXPORT_SYMBOL(pfifo_fast_ops);
>
> Best regards,
> Jakob
On 21.03.18 19:43, John Fastabend wrote:
> Thats my theory at least. Are you able to test a patch if I generate
> one to fix this?
Yes, no problem.
I just tested with the flag change you suggested (see below, I had to
keep TCQ_F_CPUSTATS to prevent a crash) and I have NOT seen OOO so far.
Thanks,
Jakob
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 190570f21b20..51b68ef4977b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -792,7 +792,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.dump = pfifo_fast_dump,
.change_tx_queue_len = pfifo_fast_change_tx_queue_len,
.owner = THIS_MODULE,
- .static_flags = TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
+ .static_flags = TCQ_F_CPUSTATS,
};
EXPORT_SYMBOL(pfifo_fast_ops);
On 03/21/2018 12:44 PM, Jakob Unterwurzacher wrote:
> On 21.03.18 19:43, John Fastabend wrote:
>> Thats my theory at least. Are you able to test a patch if I generate
>> one to fix this?
>
> Yes, no problem.
Can you try this,
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d4907b5..1e596bd 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -30,6 +30,7 @@ struct qdisc_rate_table {
enum qdisc_state_t {
__QDISC_STATE_SCHED,
__QDISC_STATE_DEACTIVATED,
+ __QDISC_STATE_RUNNING,
};
struct qdisc_size_table {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 190570f..cf7c37d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -377,20 +377,26 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
struct netdev_queue *txq;
struct net_device *dev;
struct sk_buff *skb;
- bool validate;
+ bool more, validate;
/* Dequeue packet */
+ if (test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+ return false;
+
skb = dequeue_skb(q, &validate, packets);
- if (unlikely(!skb))
+ if (unlikely(!skb)) {
+ clear_bit(__QDISC_STATE_RUNNING, &q->state);
return false;
+ }
if (!(q->flags & TCQ_F_NOLOCK))
root_lock = qdisc_lock(q);
dev = qdisc_dev(q);
txq = skb_get_tx_queue(dev, skb);
-
- return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
+ more = sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
+ clear_bit(__QDISC_STATE_RUNNING, &q->state);
+ return more;
}
>
> I just tested with the flag change you suggested (see below, I had to keep TCQ_F_CPUSTATS to prevent a crash) and I have NOT seen OOO so far.
>
Right because the code expects per cpu stats if the CPUSTATS flag is
removed it will crash.
> Thanks,
> Jakob
>
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 190570f21b20..51b68ef4977b 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -792,7 +792,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
> .dump = pfifo_fast_dump,
> .change_tx_queue_len = pfifo_fast_change_tx_queue_len,
> .owner = THIS_MODULE,
> - .static_flags = TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
> + .static_flags = TCQ_F_CPUSTATS,
> };
> EXPORT_SYMBOL(pfifo_fast_ops);
On 21.03.18 21:52, John Fastabend wrote:
> Can you try this,
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index d4907b5..1e596bd 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -30,6 +30,7 @@ struct qdisc_rate_table {
> enum qdisc_state_t {
> __QDISC_STATE_SCHED,
> __QDISC_STATE_DEACTIVATED,
> + __QDISC_STATE_RUNNING,
> };
> [...]
Tested, looks good. No OOO observed, no side effects observed, iperf
numbers on Gigabit Ethernet look the same.
Thanks,
Jakob
On 03/22/2018 03:16 AM, Jakob Unterwurzacher wrote:
> On 21.03.18 21:52, John Fastabend wrote:
>> Can you try this,
>>
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index d4907b5..1e596bd 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -30,6 +30,7 @@ struct qdisc_rate_table {
>> enum qdisc_state_t {
>> __QDISC_STATE_SCHED,
>> __QDISC_STATE_DEACTIVATED,
>> + __QDISC_STATE_RUNNING,
>> };
>> [...]
>
> Tested, looks good. No OOO observed, no side effects observed, iperf
> numbers on Gigabit Ethernet look the same.
>
> Thanks,
> Jakob
Thanks, I'll send out an official patch today.
.John