On 12/30/19 6:30 AM, Alexander Lobakin wrote:
> Add GRO callbacks to the AR9331 tagger so GRO layer can now process
> such frames.
>
> Signed-off-by: Alexander Lobakin <[email protected]>
This is a good example and we should probably build a tagger abstraction
that is much simpler to fill in callbacks for (although indirect
function calls may end-up killing performance with retpoline and
friends), but let's consider this idea.
> ---
> net/dsa/tag_ar9331.c | 77 ++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 77 insertions(+)
>
> diff --git a/net/dsa/tag_ar9331.c b/net/dsa/tag_ar9331.c
> index c22c1b515e02..99cc7fd92d8e 100644
> --- a/net/dsa/tag_ar9331.c
> +++ b/net/dsa/tag_ar9331.c
> @@ -100,12 +100,89 @@ static void ar9331_tag_flow_dissect(const struct sk_buff *skb, __be16 *proto,
> *proto = ar9331_tag_encap_proto(skb->data);
> }
>
> +static struct sk_buff *ar9331_tag_gro_receive(struct list_head *head,
> + struct sk_buff *skb)
> +{
> + const struct packet_offload *ptype;
> + struct sk_buff *p, *pp = NULL;
> + u32 data_off, data_end;
> + const u8 *data;
> + int flush = 1;
> +
> + data_off = skb_gro_offset(skb);
> + data_end = data_off + AR9331_HDR_LEN;
AR9331_HDR_LEN is a parameter here which is incidentally
dsa_device_ops::overhead.
> +
> + data = skb_gro_header_fast(skb, data_off);
> + if (skb_gro_header_hard(skb, data_end)) {
> + data = skb_gro_header_slow(skb, data_end, data_off);
> + if (unlikely(!data))
> + goto out;
> + }
> +
> + /* Data that is to the left from the current position is already
> + * pulled to the head
> + */
> + if (unlikely(!ar9331_tag_sanity_check(skb->data + data_off)))
> + goto out;
This is applicable to all taggers, they need to verify the sanity of the
header they are being handed.
> +
> + rcu_read_lock();
> +
> + ptype = gro_find_receive_by_type(ar9331_tag_encap_proto(data));
If there is no encapsulation a tagger can return the frame's protocol
directly, so similarly the tagger can be interrogated for returning that.
> + if (!ptype)
> + goto out_unlock;
> +
> + flush = 0;
> +
> + list_for_each_entry(p, head, list) {
> + if (!NAPI_GRO_CB(p)->same_flow)
> + continue;
> +
> + if (ar9331_tag_source_port(skb->data + data_off) ^
> + ar9331_tag_source_port(p->data + data_off))
Similarly here, the tagger could provide a function whose job is to
return the port number from within its own tag.
So with that being said, what do you think about building a tagger
abstraction which is comprised of:
- header length which is dsa_device_ops::overhead
- validate_tag()
- get_tag_encap_proto()
- get_port_number()
and the rest is just wrapping the general GRO list manipulation?
Also, I am wondering should we somehow expose the DSA master
net_device's napi_struct such that we could have the DSA slave
net_devices call napi_gro_receive() themselves directly such that they
could also perform additional GRO on top of Ethernet frames?
--
Florian
On Mon, Dec 30, 2019 at 10:20:50AM -0800, Florian Fainelli wrote:
> On 12/30/19 6:30 AM, Alexander Lobakin wrote:
> > Add GRO callbacks to the AR9331 tagger so GRO layer can now process
> > such frames.
> >
> > Signed-off-by: Alexander Lobakin <[email protected]>
>
> This is a good example and we should probably build a tagger abstraction
> that is much simpler to fill in callbacks for (although indirect
> function calls may end-up killing performance with retpoline and
> friends), but let's consider this idea.
Hi Florian
We really do need some numbers here. Does GRO really help? On an ARM
or MIPS platform, i don't think retpoline is an issue? But x86 is, and
we do have a few x86 boards with switches.
Maybe we can do some macro magic instead of function pointers, if we
can keep it all within one object file?
Andrew
Florian Fainelli wrote 30.12.2019 21:20:
> On 12/30/19 6:30 AM, Alexander Lobakin wrote:
>> Add GRO callbacks to the AR9331 tagger so GRO layer can now process
>> such frames.
>>
>> Signed-off-by: Alexander Lobakin <[email protected]>
>
> This is a good example and we should probably build a tagger
> abstraction
> that is much simpler to fill in callbacks for (although indirect
> function calls may end-up killing performance with retpoline and
> friends), but let's consider this idea.
Hey al,
Sorry for late replies, was in a big trip.
The performance issue was the main reason why I chose to write full
.gro_receive() for every single tagger instead of providing a bunch
of abstraction callbacks. It really isn't a problem for MIPS, on
which I'm working on this stuff, but can kill any advantages that we
could get from GRO support on e.g. x86.
>> ---
>> net/dsa/tag_ar9331.c | 77
>> ++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 77 insertions(+)
>>
>> diff --git a/net/dsa/tag_ar9331.c b/net/dsa/tag_ar9331.c
>> index c22c1b515e02..99cc7fd92d8e 100644
>> --- a/net/dsa/tag_ar9331.c
>> +++ b/net/dsa/tag_ar9331.c
>> @@ -100,12 +100,89 @@ static void ar9331_tag_flow_dissect(const struct
>> sk_buff *skb, __be16 *proto,
>> *proto = ar9331_tag_encap_proto(skb->data);
>> }
>>
>> +static struct sk_buff *ar9331_tag_gro_receive(struct list_head *head,
>> + struct sk_buff *skb)
>> +{
>> + const struct packet_offload *ptype;
>> + struct sk_buff *p, *pp = NULL;
>> + u32 data_off, data_end;
>> + const u8 *data;
>> + int flush = 1;
>> +
>> + data_off = skb_gro_offset(skb);
>> + data_end = data_off + AR9331_HDR_LEN;
>
> AR9331_HDR_LEN is a parameter here which is incidentally
> dsa_device_ops::overhead.
Or we can split .overhead to .rx_len and .tx_len and use the first
to help GRO layer and flow dissector and the second to determine
total overhead to correct MTU value. Smth like:
mtu = max(tag_ops->rx_len, tag_ops->tx_len);
>> +
>> + data = skb_gro_header_fast(skb, data_off);
>> + if (skb_gro_header_hard(skb, data_end)) {
>> + data = skb_gro_header_slow(skb, data_end, data_off);
>> + if (unlikely(!data))
>> + goto out;
>> + }
>> +
>> + /* Data that is to the left from the current position is already
>> + * pulled to the head
>> + */
>> + if (unlikely(!ar9331_tag_sanity_check(skb->data + data_off)))
>> + goto out;
>
> This is applicable to all taggers, they need to verify the sanity of
> the
> header they are being handed.
>
>> +
>> + rcu_read_lock();
>> +
>> + ptype = gro_find_receive_by_type(ar9331_tag_encap_proto(data));
>
> If there is no encapsulation a tagger can return the frame's protocol
> directly, so similarly the tagger can be interrogated for returning
> that.
>
>> + if (!ptype)
>> + goto out_unlock;
>> +
>> + flush = 0;
>> +
>> + list_for_each_entry(p, head, list) {
>> + if (!NAPI_GRO_CB(p)->same_flow)
>> + continue;
>> +
>> + if (ar9331_tag_source_port(skb->data + data_off) ^
>> + ar9331_tag_source_port(p->data + data_off))
>
> Similarly here, the tagger could provide a function whose job is to
> return the port number from within its own tag.
>
> So with that being said, what do you think about building a tagger
> abstraction which is comprised of:
>
> - header length which is dsa_device_ops::overhead
> - validate_tag()
> - get_tag_encap_proto()
> - get_port_number()
>
> and the rest is just wrapping the general GRO list manipulation?
get_tag_encap_proto() and get_port_number() would be called more
than once in that case for every single frame. Not sure if it is
a good idea regarding to mentioned retpoline issues.
> Also, I am wondering should we somehow expose the DSA master
> net_device's napi_struct such that we could have the DSA slave
> net_devices call napi_gro_receive() themselves directly such that they
> could also perform additional GRO on top of Ethernet frames?
There's no reason to pass frames to GRO layer more than once.
The most correct way to handle frames is to pass them to networking
stack only after DSA tag extraction and removal. That's kinda how
mac80211 infra works. But this is rather problematic for DSA as it
keeps Ethernet controller drivers and taggers completely independent
from each others.
I also had an idea to use net_device::rx_handler for tag processing
instead of dsa_pack_type. CPU ports can't be bridged anyway, so this
should not be a problem an the first look.
Regards,
ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ
Hi Alexander,
On Mon, 13 Jan 2020 at 11:22, Alexander Lobakin <[email protected]> wrote:
>
> CPU ports can't be bridged anyway
>
> Regards,
> ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ
The fact that CPU ports can't be bridged is already not ideal.
One can have a DSA switch with cascaded switches on each port, so it
acts like N DSA masters (not as DSA links, since the taggers are
incompatible), with each switch forming its own tree. It is desirable
that the ports of the DSA switch on top are bridged, so that
forwarding between cascaded switches does not pass through the CPU.
-Vladimir
Vladimir Oltean wrote 13.01.2020 12:42:
> Hi Alexander,
>
> On Mon, 13 Jan 2020 at 11:22, Alexander Lobakin <[email protected]>
> wrote:
>>
>> CPU ports can't be bridged anyway
>>
>> Regards,
>> ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ
>
> The fact that CPU ports can't be bridged is already not ideal.
> One can have a DSA switch with cascaded switches on each port, so it
> acts like N DSA masters (not as DSA links, since the taggers are
> incompatible), with each switch forming its own tree. It is desirable
> that the ports of the DSA switch on top are bridged, so that
> forwarding between cascaded switches does not pass through the CPU.
Oh, I see. But currently DSA infra forbids the adding DSA masters to
bridges IIRC. Can't name it good or bad decision, but was introduced
to prevent accidental packet flow breaking on DSA setups.
> -Vladimir
Regards,
ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ
On Mon, 13 Jan 2020 at 11:46, Alexander Lobakin <[email protected]> wrote:
>
> Vladimir Oltean wrote 13.01.2020 12:42:
> > Hi Alexander,
> >
> > On Mon, 13 Jan 2020 at 11:22, Alexander Lobakin <[email protected]>
> > wrote:
> >>
> >> CPU ports can't be bridged anyway
> >>
> >> Regards,
> >> ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ
> >
> > The fact that CPU ports can't be bridged is already not ideal.
> > One can have a DSA switch with cascaded switches on each port, so it
> > acts like N DSA masters (not as DSA links, since the taggers are
> > incompatible), with each switch forming its own tree. It is desirable
> > that the ports of the DSA switch on top are bridged, so that
> > forwarding between cascaded switches does not pass through the CPU.
>
> Oh, I see. But currently DSA infra forbids the adding DSA masters to
> bridges IIRC. Can't name it good or bad decision, but was introduced
> to prevent accidental packet flow breaking on DSA setups.
>
I just wanted to point out that some people are going to be looking at
ways by which the ETH_P_XDSA handler can be made to play nice with the
master's rx_handler, and that it would be nice to at least not make
the limitation worse than it is by converting everything to
rx_handlers (which "currently" can't be stacked, from the comments in
netdevice.h).
> > -Vladimir
>
> Regards,
> ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ
On 1/13/20 2:28 AM, Vladimir Oltean wrote:
> On Mon, 13 Jan 2020 at 11:46, Alexander Lobakin <[email protected]> wrote:
>>
>> Vladimir Oltean wrote 13.01.2020 12:42:
>>> Hi Alexander,
>>>
>>> On Mon, 13 Jan 2020 at 11:22, Alexander Lobakin <[email protected]>
>>> wrote:
>>>>
>>>> CPU ports can't be bridged anyway
>>>>
>>>> Regards,
>>>> ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ
>>>
>>> The fact that CPU ports can't be bridged is already not ideal.
>>> One can have a DSA switch with cascaded switches on each port, so it
>>> acts like N DSA masters (not as DSA links, since the taggers are
>>> incompatible), with each switch forming its own tree. It is desirable
>>> that the ports of the DSA switch on top are bridged, so that
>>> forwarding between cascaded switches does not pass through the CPU.
>>
>> Oh, I see. But currently DSA infra forbids the adding DSA masters to
>> bridges IIRC. Can't name it good or bad decision, but was introduced
>> to prevent accidental packet flow breaking on DSA setups.
>>
>
> I just wanted to point out that some people are going to be looking at
> ways by which the ETH_P_XDSA handler can be made to play nice with the
> master's rx_handler, and that it would be nice to at least not make
> the limitation worse than it is by converting everything to
> rx_handlers (which "currently" can't be stacked, from the comments in
> netdevice.h).
I am not sure this would change the situation much, today we cannot have
anything but switch tags travel on the DSA master network device,
whether we accomplish the RX tap through a special skb->protocol value
or via rx_handler, it probably does not functionally matter, but it
could change the performance.
--
Florian
Florian Fainelli wrote 15.01.2020 00:56:
> On 1/13/20 2:28 AM, Vladimir Oltean wrote:
>> On Mon, 13 Jan 2020 at 11:46, Alexander Lobakin <[email protected]>
>> wrote:
>>>
>>> Vladimir Oltean wrote 13.01.2020 12:42:
>>>> Hi Alexander,
>>>>
>>>> On Mon, 13 Jan 2020 at 11:22, Alexander Lobakin <[email protected]>
>>>> wrote:
>>>>>
>>>>> CPU ports can't be bridged anyway
>>>>>
>>>>> Regards,
>>>>> ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ
>>>>
>>>> The fact that CPU ports can't be bridged is already not ideal.
>>>> One can have a DSA switch with cascaded switches on each port, so it
>>>> acts like N DSA masters (not as DSA links, since the taggers are
>>>> incompatible), with each switch forming its own tree. It is
>>>> desirable
>>>> that the ports of the DSA switch on top are bridged, so that
>>>> forwarding between cascaded switches does not pass through the CPU.
>>>
>>> Oh, I see. But currently DSA infra forbids the adding DSA masters to
>>> bridges IIRC. Can't name it good or bad decision, but was introduced
>>> to prevent accidental packet flow breaking on DSA setups.
>>>
>>
>> I just wanted to point out that some people are going to be looking at
>> ways by which the ETH_P_XDSA handler can be made to play nice with the
>> master's rx_handler, and that it would be nice to at least not make
>> the limitation worse than it is by converting everything to
>> rx_handlers (which "currently" can't be stacked, from the comments in
>> netdevice.h).
>
> I am not sure this would change the situation much, today we cannot
> have
> anything but switch tags travel on the DSA master network device,
> whether we accomplish the RX tap through a special skb->protocol value
> or via rx_handler, it probably does not functionally matter, but it
> could change the performance.
As for now, I think that we should keep this RFC as it is so
developers working with different DSA switches could test it or
implement GRO offload for other taggers like DSA and EDSA, *but*
any future work on this should come only when we'll revise/reimagine
basic DSA packet flow, as we already know (at least me and Florian
reproduce it well) that the current path through unlikely branches
in eth_type_trans() and frame capturing through packet_type is so
suboptimal that nearly destroys overall performance on several
setups.
Switching to net_device::rx_handler() is just one of all the possible
variants, I'm sure we'll find the best solution together.
Regards,
ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ
Alexander Lobakin wrote 15.01.2020 10:38:
> Florian Fainelli wrote 15.01.2020 00:56:
>> On 1/13/20 2:28 AM, Vladimir Oltean wrote:
>>> On Mon, 13 Jan 2020 at 11:46, Alexander Lobakin <[email protected]>
>>> wrote:
>>>>
>>>> Vladimir Oltean wrote 13.01.2020 12:42:
>>>>> Hi Alexander,
>>>>>
>>>>> On Mon, 13 Jan 2020 at 11:22, Alexander Lobakin <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>> CPU ports can't be bridged anyway
>>>>>>
>>>>>> Regards,
>>>>>> ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ
>>>>>
>>>>> The fact that CPU ports can't be bridged is already not ideal.
>>>>> One can have a DSA switch with cascaded switches on each port, so
>>>>> it
>>>>> acts like N DSA masters (not as DSA links, since the taggers are
>>>>> incompatible), with each switch forming its own tree. It is
>>>>> desirable
>>>>> that the ports of the DSA switch on top are bridged, so that
>>>>> forwarding between cascaded switches does not pass through the CPU.
>>>>
>>>> Oh, I see. But currently DSA infra forbids the adding DSA masters to
>>>> bridges IIRC. Can't name it good or bad decision, but was introduced
>>>> to prevent accidental packet flow breaking on DSA setups.
>>>>
>>>
>>> I just wanted to point out that some people are going to be looking
>>> at
>>> ways by which the ETH_P_XDSA handler can be made to play nice with
>>> the
>>> master's rx_handler, and that it would be nice to at least not make
>>> the limitation worse than it is by converting everything to
>>> rx_handlers (which "currently" can't be stacked, from the comments in
>>> netdevice.h).
>>
>> I am not sure this would change the situation much, today we cannot
>> have
>> anything but switch tags travel on the DSA master network device,
>> whether we accomplish the RX tap through a special skb->protocol value
>> or via rx_handler, it probably does not functionally matter, but it
>> could change the performance.
>
> As for now, I think that we should keep this RFC as it is so
> developers working with different DSA switches could test it or
> implement GRO offload for other taggers like DSA and EDSA, *but*
> any future work on this should come only when we'll revise/reimagine
> basic DSA packet flow, as we already know (at least me and Florian
> reproduce it well) that the current path through unlikely branches
> in eth_type_trans() and frame capturing through packet_type is so
> suboptimal that nearly destroys overall performance on several
> setups.
Well, I had enough free time today to write and test sort of
blueprint-like DSA via .rx_handler() to compare it with the current
flow and get at least basic picture of what's going on.
I chose a 600 MHz UP MIPS system to make a difference more noticeable
as more powerful systems tend to mitigate plenty of different "heavy"
corners and misses.
Ethernet driver for CPU port uses BQL and DIM, as well as hardware TSO.
A minimal GRO over DSA is also enabled. The codebase is Linux 5.5-rc6.
I use simple VLAN NAT (with nft flow offload), iperf3, IPv4 + TCP.
Mainline DSA Rx processing, one flow:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 4.30 GBytes 615 Mbits/sec 2091 sender
[ 5] 0.00-60.01 sec 4.30 GBytes 615 Mbits/sec receiver
10 flows:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 414 MBytes 57.9 Mbits/sec 460 sender
[ 5] 0.00-60.01 sec 413 MBytes 57.7 Mbits/sec receiver
[ 7] 0.00-60.00 sec 392 MBytes 54.8 Mbits/sec 497 sender
[ 7] 0.00-60.01 sec 391 MBytes 54.6 Mbits/sec receiver
[ 9] 0.00-60.00 sec 391 MBytes 54.6 Mbits/sec 438 sender
[ 9] 0.00-60.01 sec 389 MBytes 54.4 Mbits/sec receiver
[ 11] 0.00-60.00 sec 383 MBytes 53.5 Mbits/sec 472 sender
[ 11] 0.00-60.01 sec 382 MBytes 53.4 Mbits/sec receiver
[ 13] 0.00-60.00 sec 404 MBytes 56.5 Mbits/sec 466 sender
[ 13] 0.00-60.01 sec 403 MBytes 56.3 Mbits/sec receiver
[ 15] 0.00-60.00 sec 453 MBytes 63.4 Mbits/sec 490 sender
[ 15] 0.00-60.01 sec 452 MBytes 63.1 Mbits/sec receiver
[ 17] 0.00-60.00 sec 461 MBytes 64.4 Mbits/sec 430 sender
[ 17] 0.00-60.01 sec 459 MBytes 64.2 Mbits/sec receiver
[ 19] 0.00-60.00 sec 365 MBytes 51.0 Mbits/sec 493 sender
[ 19] 0.00-60.01 sec 364 MBytes 50.9 Mbits/sec receiver
[ 21] 0.00-60.00 sec 407 MBytes 56.9 Mbits/sec 517 sender
[ 21] 0.00-60.01 sec 405 MBytes 56.7 Mbits/sec receiver
[ 23] 0.00-60.00 sec 486 MBytes 68.0 Mbits/sec 458 sender
[ 23] 0.00-60.01 sec 484 MBytes 67.7 Mbits/sec receiver
[SUM] 0.00-60.00 sec 4.06 GBytes 581 Mbits/sec 4721 sender
[SUM] 0.00-60.01 sec 4.04 GBytes 579 Mbits/sec receiver
.rx_handler(), one flow:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 4.40 GBytes 630 Mbits/sec 853 sender
[ 5] 0.00-60.01 sec 4.40 GBytes 630 Mbits/sec receiver
And 10:
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-60.00 sec 440 MBytes 61.5 Mbits/sec 551 sender
[ 5] 0.00-60.01 sec 439 MBytes 61.4 Mbits/sec receiver
[ 7] 0.00-60.00 sec 455 MBytes 63.6 Mbits/sec 496 sender
[ 7] 0.00-60.01 sec 454 MBytes 63.4 Mbits/sec receiver
[ 9] 0.00-60.00 sec 484 MBytes 67.7 Mbits/sec 532 sender
[ 9] 0.00-60.01 sec 483 MBytes 67.5 Mbits/sec receiver
[ 11] 0.00-60.00 sec 598 MBytes 83.6 Mbits/sec 452 sender
[ 11] 0.00-60.01 sec 596 MBytes 83.3 Mbits/sec receiver
[ 13] 0.00-60.00 sec 427 MBytes 59.7 Mbits/sec 539 sender
[ 13] 0.00-60.01 sec 426 MBytes 59.5 Mbits/sec receiver
[ 15] 0.00-60.00 sec 469 MBytes 65.5 Mbits/sec 466 sender
[ 15] 0.00-60.01 sec 467 MBytes 65.3 Mbits/sec receiver
[ 17] 0.00-60.00 sec 463 MBytes 64.7 Mbits/sec 472 sender
[ 17] 0.00-60.01 sec 462 MBytes 64.5 Mbits/sec receiver
[ 19] 0.00-60.00 sec 533 MBytes 74.5 Mbits/sec 447 sender
[ 19] 0.00-60.01 sec 532 MBytes 74.3 Mbits/sec receiver
[ 21] 0.00-60.00 sec 444 MBytes 62.1 Mbits/sec 527 sender
[ 21] 0.00-60.01 sec 443 MBytes 61.9 Mbits/sec receiver
[ 23] 0.00-60.00 sec 500 MBytes 69.9 Mbits/sec 449 sender
[ 23] 0.00-60.01 sec 499 MBytes 69.8 Mbits/sec receiver
[SUM] 0.00-60.00 sec 4.70 GBytes 673 Mbits/sec 4931 sender
[SUM] 0.00-60.01 sec 4.69 GBytes 671 Mbits/sec receiver
Pretty significant stats. This happens not only because we get rid of
out-of-line unlikely() branches (which are natural killers, at least
on MIPS), but also because we don't need to call netif_receive_skb()
for the second time -- we might just return RX_HANDLER_ANOTHER and
Rx path becomes then not much longer than in case of simple VLAN tag
removal (_net/core/dev.c:5056_).
This should get more attention and tests on a wide variety of other
systems, of course.
> Switching to net_device::rx_handler() is just one of all the possible
> variants, I'm sure we'll find the best solution together.
>
> Regards,
> ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ
Regards,
ᚷ ᛖ ᚢ ᚦ ᚠ ᚱ