2019-05-01 21:04:38

by Lennart Sorensen

[permalink] [raw]
Subject: i40e X722 RSS problem with NAT-Traversal IPsec packets

We have hit a strange problem with RSS on the X722 on our new servers
(S2600WFT based).

The RSS hash is distributing most packets across cores quite nicely, with
one exception. ESP encapsulated in UDP is always going to queue 0 no
matter what the hash key is set to or how many cores have queues assigned.
So if terminating IPsec tunnels that are using NAT-Traversal, all packets
arrive on the same core, which clearly isn't good for scalability.
Other UDP packets are fine, TCP is fine, ICMP, ESP, etc have no problem
that we have seen, only the ESP in UDP packets.

Given the packets are UDP packets I would have hoped they would just
be distributed using the source and destination ip and port values as
other UDP packets seem to be, but they are not. I vaguely suspect the
UDP tunnel handling support the card has for this since it claims to
use the internal packet's values for RSS rather than the UDP packet
itself for certain supported types of UDP encapsulated IP traffic, but
not ESP in UDP, so perhaps it sees an IP packet inside a UDP packet,
and decides to try and parse it instead, doesn't know how to handle it
and stops without assigning any RSS value to the packet at all rather
than falling back to treating it as a plain UDP packet. But that's just
guessing based on the documentation of the hardware capabilities.

Here is an example of a packet that always hits queue 0:

14:48:09.014360 54:ee:75:30:f1:e1 > a4:bf:01:4e:0c:87, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 64, id 3312, offset 0, flags [DF], proto UDP (17), length 160)
1.99.99.2.4500 > 1.99.99.1.4500: [no cksum] UDP-encap: ESP(spi=0xac11cadf,seq=0x480), length 132
0x0000: 4500 00a0 0cf0 4000 4011 6494 0163 6302 E.....@[email protected].
0x0010: 0163 6301 1194 1194 008c 0000 ac11 cadf .cc.............
0x0020: 0000 0480 901d 3b39 e884 0616 fed4 3e37 ......;9......>7
0x0030: bb67 bca2 adac e519 c7a9 ced9 00bf 263e .g............&>
0x0040: 28a6 ba38 1e8c e6e3 bbf9 e093 1c49 8154 (..8.........I.T
0x0050: 0d66 c1d5 2416 f4d2 26ec f5a1 773f 4ae2 .f..$...&...w?J.
0x0060: 8e26 0ed8 0e5f daab 06b2 aa51 2f2f e16e .&..._.....Q//.n
0x0070: 22ca dd94 f499 027b 11d0 de7b 4d9d 7af1 "......{...{M.z.
0x0080: f468 ae0d ad41 5c96 577d 7b44 1cc4 0ba3 .h...A\.W}{D....
0x0090: 9ff7 142f b159 c9d0 38e1 c460 120f f4bb .../.Y..8..`....
14:48:09.014439 a4:bf:01:4e:0c:87 > 54:ee:75:30:f1:e1, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 64, id 43796, offset 0, flags [none], proto UDP (17), length 160)
1.99.99.1.4500 > 1.99.99.2.4500: [no cksum] UDP-encap: ESP(spi=0x47f5919c,seq=0x480), length 132
0x0000: 4500 00a0 ab14 0000 4011 0670 0163 6301 [email protected].
0x0010: 0163 6302 1194 1194 008c 0000 47f5 919c .cc.........G...
0x0020: 0000 0480 106b cafb 14ee f75b 3533 16fb .....k.....[53..
0x0030: 87f5 9d90 a73b 8daf 481f 22b7 2b30 b482 .....;..H.".+0..
0x0040: a330 1fe4 59da a394 b48b ac77 5a96 dfac .0..Y......wZ...
0x0050: 4798 793a ca7e 1af2 a9a8 2f7b 9327 d5b9 G.y:.~..../{.'..
0x0060: f8d0 e761 c7b3 a85c c843 ec25 62b2 e083 ...a...\.C.%b...
0x0070: f0d5 1097 736b 051a b15d e7de 7f0e b5b7 ....sk...]......
0x0080: 209b 4d1d af37 c1a1 09a0 a6c9 71cf 7d54 ..M..7......q.}T
0x0090: 55c3 2797 e622 581f 09cf 9483 2ba5 e64a U.'.."X.....+..J

This was done on 4.19.28 kernel with the i40e driver in that kernel with
libreswan for IPsec using netkey in the kernel and nat-traversal in use.
The packets are a ping echo and reply pair. NVM version 3.49 and 4.00
tried so far.

No other network interfaces we have used have had this problem. RSS has
always just worked until now.

--
Len Sorensen


2019-05-01 22:54:18

by Alexander Duyck

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Wed, May 1, 2019 at 2:03 PM Lennart Sorensen
<[email protected]> wrote:
>
> We have hit a strange problem with RSS on the X722 on our new servers
> (S2600WFT based).
>
> The RSS hash is distributing most packets across cores quite nicely, with
> one exception. ESP encapsulated in UDP is always going to queue 0 no
> matter what the hash key is set to or how many cores have queues assigned.
> So if terminating IPsec tunnels that are using NAT-Traversal, all packets
> arrive on the same core, which clearly isn't good for scalability.
> Other UDP packets are fine, TCP is fine, ICMP, ESP, etc have no problem
> that we have seen, only the ESP in UDP packets.
>
> Given the packets are UDP packets I would have hoped they would just
> be distributed using the source and destination ip and port values as
> other UDP packets seem to be, but they are not. I vaguely suspect the
> UDP tunnel handling support the card has for this since it claims to
> use the internal packet's values for RSS rather than the UDP packet
> itself for certain supported types of UDP encapsulated IP traffic, but
> not ESP in UDP, so perhaps it sees an IP packet inside a UDP packet,
> and decides to try and parse it instead, doesn't know how to handle it
> and stops without assigning any RSS value to the packet at all rather
> than falling back to treating it as a plain UDP packet. But that's just
> guessing based on the documentation of the hardware capabilities.
>
> Here is an example of a packet that always hits queue 0:
>
> 14:48:09.014360 54:ee:75:30:f1:e1 > a4:bf:01:4e:0c:87, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 64, id 3312, offset 0, flags [DF], proto UDP (17), length 160)
> 1.99.99.2.4500 > 1.99.99.1.4500: [no cksum] UDP-encap: ESP(spi=0xac11cadf,seq=0x480), length 132
> 0x0000: 4500 00a0 0cf0 4000 4011 6494 0163 6302 E.....@[email protected].
> 0x0010: 0163 6301 1194 1194 008c 0000 ac11 cadf .cc.............
> 0x0020: 0000 0480 901d 3b39 e884 0616 fed4 3e37 ......;9......>7
> 0x0030: bb67 bca2 adac e519 c7a9 ced9 00bf 263e .g............&>
> 0x0040: 28a6 ba38 1e8c e6e3 bbf9 e093 1c49 8154 (..8.........I.T
> 0x0050: 0d66 c1d5 2416 f4d2 26ec f5a1 773f 4ae2 .f..$...&...w?J.
> 0x0060: 8e26 0ed8 0e5f daab 06b2 aa51 2f2f e16e .&..._.....Q//.n
> 0x0070: 22ca dd94 f499 027b 11d0 de7b 4d9d 7af1 "......{...{M.z.
> 0x0080: f468 ae0d ad41 5c96 577d 7b44 1cc4 0ba3 .h...A\.W}{D....
> 0x0090: 9ff7 142f b159 c9d0 38e1 c460 120f f4bb .../.Y..8..`....
> 14:48:09.014439 a4:bf:01:4e:0c:87 > 54:ee:75:30:f1:e1, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 64, id 43796, offset 0, flags [none], proto UDP (17), length 160)
> 1.99.99.1.4500 > 1.99.99.2.4500: [no cksum] UDP-encap: ESP(spi=0x47f5919c,seq=0x480), length 132
> 0x0000: 4500 00a0 ab14 0000 4011 0670 0163 6301 [email protected].
> 0x0010: 0163 6302 1194 1194 008c 0000 47f5 919c .cc.........G...
> 0x0020: 0000 0480 106b cafb 14ee f75b 3533 16fb .....k.....[53..
> 0x0030: 87f5 9d90 a73b 8daf 481f 22b7 2b30 b482 .....;..H.".+0..
> 0x0040: a330 1fe4 59da a394 b48b ac77 5a96 dfac .0..Y......wZ...
> 0x0050: 4798 793a ca7e 1af2 a9a8 2f7b 9327 d5b9 G.y:.~..../{.'..
> 0x0060: f8d0 e761 c7b3 a85c c843 ec25 62b2 e083 ...a...\.C.%b...
> 0x0070: f0d5 1097 736b 051a b15d e7de 7f0e b5b7 ....sk...]......
> 0x0080: 209b 4d1d af37 c1a1 09a0 a6c9 71cf 7d54 ..M..7......q.}T
> 0x0090: 55c3 2797 e622 581f 09cf 9483 2ba5 e64a U.'.."X.....+..J
>
> This was done on 4.19.28 kernel with the i40e driver in that kernel with
> libreswan for IPsec using netkey in the kernel and nat-traversal in use.
> The packets are a ping echo and reply pair. NVM version 3.49 and 4.00
> tried so far.
>
> No other network interfaces we have used have had this problem. RSS has
> always just worked until now.
>
> --
> Len Sorensen

I'm not sure how RSS will do much for you here. Basically you only
have the source IP address as your only source of entropy when it
comes to RSS since the destination IP should always be the same if you
are performing a server role and terminating packets on the local
system and as far as the ports in your example you seem to only be
using 4500 for both the source and the destination.

In your testing are you only looking at a point to point connection
between two systems, or do you have multiple systems accessing the
system you are testing? I ask as the only way this should do any
traffic spreading via RSS would be if the source IPs are different and
that would require multiple client systems accessing the server.

In the case of other encapsulation types over UDP, such as VXLAN, I
know that a hash value is stored in the UDP source port location
instead of the true source port number. This allows the RSS hashing to
occur on this extra information which would allow for a greater
diversity in hash results. Depending on how you are generating the ESP
encapsulation you might look at seeing if it would be possible to have
a hash on the inner data used as the UDP source port in the outgoing
packets. This would help to resolve this sort of issue.

Thanks.

- Alex

2019-05-02 15:13:31

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Wed, May 01, 2019 at 03:52:57PM -0700, Alexander Duyck wrote:
> I'm not sure how RSS will do much for you here. Basically you only
> have the source IP address as your only source of entropy when it
> comes to RSS since the destination IP should always be the same if you
> are performing a server role and terminating packets on the local
> system and as far as the ports in your example you seem to only be
> using 4500 for both the source and the destination.

I have thousands of IPsec clients connecting. Simply treating them as
normal UDP packets would work. The IP address is different, and often
the port too.

> In your testing are you only looking at a point to point connection
> between two systems, or do you have multiple systems accessing the
> system you are testing? I ask as the only way this should do any
> traffic spreading via RSS would be if the source IPs are different and
> that would require multiple client systems accessing the server.

I tried changing the client IP address and the RSS hash key. It never
changed to another queue. Something is broken.

> In the case of other encapsulation types over UDP, such as VXLAN, I
> know that a hash value is stored in the UDP source port location
> instead of the true source port number. This allows the RSS hashing to
> occur on this extra information which would allow for a greater
> diversity in hash results. Depending on how you are generating the ESP
> encapsulation you might look at seeing if it would be possible to have
> a hash on the inner data used as the UDP source port in the outgoing
> packets. This would help to resolve this sort of issue.

Well it works on every other network card except this one. Every other
intel card in the past we have used had no problem doing this right.

You want all the packets for a given ipsec tunnel to go to the same queue.
That is not a problem here. What you don't want is every ipsec packet
from everyone going to the same queue (always queue 0). So simply
treating them as UDP packets with a source and destination IP and port
would work perfectly fine. The X722 isn't doing that. It is always
assigning a hash value of 0 to these packets.

--
Len Sorensen

2019-05-02 17:04:56

by Alexander Duyck

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Thu, May 2, 2019 at 8:11 AM Lennart Sorensen
<[email protected]> wrote:
>
> On Wed, May 01, 2019 at 03:52:57PM -0700, Alexander Duyck wrote:
> > I'm not sure how RSS will do much for you here. Basically you only
> > have the source IP address as your only source of entropy when it
> > comes to RSS since the destination IP should always be the same if you
> > are performing a server role and terminating packets on the local
> > system and as far as the ports in your example you seem to only be
> > using 4500 for both the source and the destination.
>
> I have thousands of IPsec clients connecting. Simply treating them as
> normal UDP packets would work. The IP address is different, and often
> the port too.

Thanks for the clarification. I just wanted to verify that I know we
have had similar complaints in the past and it turns out those were
only using one set of IP addresses.

> > In your testing are you only looking at a point to point connection
> > between two systems, or do you have multiple systems accessing the
> > system you are testing? I ask as the only way this should do any
> > traffic spreading via RSS would be if the source IPs are different and
> > that would require multiple client systems accessing the server.
>
> I tried changing the client IP address and the RSS hash key. It never
> changed to another queue. Something is broken.

Okay, so if changing the RSS hash key has not effect then it is likely
not being used.

> > In the case of other encapsulation types over UDP, such as VXLAN, I
> > know that a hash value is stored in the UDP source port location
> > instead of the true source port number. This allows the RSS hashing to
> > occur on this extra information which would allow for a greater
> > diversity in hash results. Depending on how you are generating the ESP
> > encapsulation you might look at seeing if it would be possible to have
> > a hash on the inner data used as the UDP source port in the outgoing
> > packets. This would help to resolve this sort of issue.
>
> Well it works on every other network card except this one. Every other
> intel card in the past we have used had no problem doing this right.

The question is what is different about this card, and I don't have an
immediate answer so we would need to do some investigation.

> You want all the packets for a given ipsec tunnel to go to the same queue.
> That is not a problem here. What you don't want is every ipsec packet
> from everyone going to the same queue (always queue 0). So simply
> treating them as UDP packets with a source and destination IP and port
> would work perfectly fine. The X722 isn't doing that. It is always
> assigning a hash value of 0 to these packets.

You had stated in your earlier email that "Other UDP packets are
fine". Perhaps we need to do some further isolation to identify why
the ESP over UDP packets are not being hashed on while other UDP
packets are.

Would it be possible to provide a couple of raw Ethernet frames
instead of IP packets for us to examine? I noticed the two packets you
sent earlier didn't start until the IP header. One possibility would
be that if we had any extra outer headers or trailers added to the
packet that could possibly cause issues since that might either make
the packet not parsable or possibly flag it as some sort of length
error when the size of the packet doesn't match what is reported in
the headers.

One other thing we may want to look at doing is trying to identify the
particular part of the packets that might be causing the hash to not
be generated. One way to do that would be to use something like
netperf to generate packets and send them toward your test system.
Something like the command line below could be used to send packets
that should be similar to the ones you provided earlier:
netperf -H <target IP> -t UDP_STREAM -N -- -P 4500,4500 -m 132

If the packets generated by netperf were not hashed that would tell us
then it may be some sort of issue with how UDP packets are being
parsed, and from there we could narrow things down by modifying port
numbers and changing packet sizes. If that does get hashed then we
need to start looking outside of the IP/UDP header parsing for
possible issues since there is likely something else causing the
issue.

Thanks.

- Alex

2019-05-02 17:18:23

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Thu, May 02, 2019 at 10:03:23AM -0700, Alexander Duyck wrote:
> On Thu, May 2, 2019 at 8:11 AM Lennart Sorensen
> <[email protected]> wrote:
> >
> > On Wed, May 01, 2019 at 03:52:57PM -0700, Alexander Duyck wrote:
> > > I'm not sure how RSS will do much for you here. Basically you only
> > > have the source IP address as your only source of entropy when it
> > > comes to RSS since the destination IP should always be the same if you
> > > are performing a server role and terminating packets on the local
> > > system and as far as the ports in your example you seem to only be
> > > using 4500 for both the source and the destination.
> >
> > I have thousands of IPsec clients connecting. Simply treating them as
> > normal UDP packets would work. The IP address is different, and often
> > the port too.
>
> Thanks for the clarification. I just wanted to verify that I know we
> have had similar complaints in the past and it turns out those were
> only using one set of IP addresses.
>
> > > In your testing are you only looking at a point to point connection
> > > between two systems, or do you have multiple systems accessing the
> > > system you are testing? I ask as the only way this should do any
> > > traffic spreading via RSS would be if the source IPs are different and
> > > that would require multiple client systems accessing the server.
> >
> > I tried changing the client IP address and the RSS hash key. It never
> > changed to another queue. Something is broken.
>
> Okay, so if changing the RSS hash key has not effect then it is likely
> not being used.
>
> > > In the case of other encapsulation types over UDP, such as VXLAN, I
> > > know that a hash value is stored in the UDP source port location
> > > instead of the true source port number. This allows the RSS hashing to
> > > occur on this extra information which would allow for a greater
> > > diversity in hash results. Depending on how you are generating the ESP
> > > encapsulation you might look at seeing if it would be possible to have
> > > a hash on the inner data used as the UDP source port in the outgoing
> > > packets. This would help to resolve this sort of issue.
> >
> > Well it works on every other network card except this one. Every other
> > intel card in the past we have used had no problem doing this right.
>
> The question is what is different about this card, and I don't have an
> immediate answer so we would need to do some investigation.

I think the firmware has a bug. :) My first email has my speculation
of where the bug could be.

> You had stated in your earlier email that "Other UDP packets are
> fine". Perhaps we need to do some further isolation to identify why
> the ESP over UDP packets are not being hashed on while other UDP
> packets are.

Well they are IP packets encapsulated in UDP, while other UDP packets
are not IP packets encapsulated in UDP, and there is special handling
for some IP types inside UDP on this card, which is an unusual feature.
For the supported IP in UDP types, it actually is supposed to use the IP
packet inside the UDP packet to generate the RSS value, so it pretends it
wasn't even encapsulated. But it does not handle ESP in UDP specifically,
and hence I suspect that is the problem. I think it tries to handle the
IP in UDP and since it doesn't support ESP in UDP it fails to fall back
to using the original UDP packet for the RSS value. That would at least
explain why regular UDP packets that don't contain an IP packet inside
are fine, but this particular type of packet is being handled wrong.

> Would it be possible to provide a couple of raw Ethernet frames
> instead of IP packets for us to examine? I noticed the two packets you
> sent earlier didn't start until the IP header. One possibility would
> be that if we had any extra outer headers or trailers added to the
> packet that could possibly cause issues since that might either make
> the packet not parsable or possibly flag it as some sort of length
> error when the size of the packet doesn't match what is reported in
> the headers.

Oh did I forget the option for that? I can try and capture some today
with the full headers.

> One other thing we may want to look at doing is trying to identify the
> particular part of the packets that might be causing the hash to not
> be generated. One way to do that would be to use something like
> netperf to generate packets and send them toward your test system.
> Something like the command line below could be used to send packets
> that should be similar to the ones you provided earlier:
> netperf -H <target IP> -t UDP_STREAM -N -- -P 4500,4500 -m 132
>
> If the packets generated by netperf were not hashed that would tell us
> then it may be some sort of issue with how UDP packets are being
> parsed, and from there we could narrow things down by modifying port
> numbers and changing packet sizes. If that does get hashed then we
> need to start looking outside of the IP/UDP header parsing for
> possible issues since there is likely something else causing the
> issue.

I will see what I can do with that.

--
Len Sorensen

2019-05-02 17:29:50

by Alexander Duyck

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Thu, May 2, 2019 at 10:16 AM Lennart Sorensen
<[email protected]> wrote:
>
> On Thu, May 02, 2019 at 10:03:23AM -0700, Alexander Duyck wrote:
> > On Thu, May 2, 2019 at 8:11 AM Lennart Sorensen
> > <[email protected]> wrote:
> > >
> > > On Wed, May 01, 2019 at 03:52:57PM -0700, Alexander Duyck wrote:
> > > > I'm not sure how RSS will do much for you here. Basically you only
> > > > have the source IP address as your only source of entropy when it
> > > > comes to RSS since the destination IP should always be the same if you
> > > > are performing a server role and terminating packets on the local
> > > > system and as far as the ports in your example you seem to only be
> > > > using 4500 for both the source and the destination.
> > >
> > > I have thousands of IPsec clients connecting. Simply treating them as
> > > normal UDP packets would work. The IP address is different, and often
> > > the port too.
> >
> > Thanks for the clarification. I just wanted to verify that I know we
> > have had similar complaints in the past and it turns out those were
> > only using one set of IP addresses.
> >
> > > > In your testing are you only looking at a point to point connection
> > > > between two systems, or do you have multiple systems accessing the
> > > > system you are testing? I ask as the only way this should do any
> > > > traffic spreading via RSS would be if the source IPs are different and
> > > > that would require multiple client systems accessing the server.
> > >
> > > I tried changing the client IP address and the RSS hash key. It never
> > > changed to another queue. Something is broken.
> >
> > Okay, so if changing the RSS hash key has not effect then it is likely
> > not being used.
> >
> > > > In the case of other encapsulation types over UDP, such as VXLAN, I
> > > > know that a hash value is stored in the UDP source port location
> > > > instead of the true source port number. This allows the RSS hashing to
> > > > occur on this extra information which would allow for a greater
> > > > diversity in hash results. Depending on how you are generating the ESP
> > > > encapsulation you might look at seeing if it would be possible to have
> > > > a hash on the inner data used as the UDP source port in the outgoing
> > > > packets. This would help to resolve this sort of issue.
> > >
> > > Well it works on every other network card except this one. Every other
> > > intel card in the past we have used had no problem doing this right.
> >
> > The question is what is different about this card, and I don't have an
> > immediate answer so we would need to do some investigation.
>
> I think the firmware has a bug. :) My first email has my speculation
> of where the bug could be.

The thing is the firmware has to have some idea what it is dealing
with. As far as I know I don't believe port number 4500 is being
auto-flagged as any special type. In the case of the other tunnel
types such as VXLAN, NVGRE, and GENEVE the driver has to set a port
value indicating that the port will receive special handling. If it
isn't added via i40e_udp_tunnel_add then the firmware/hardware
shouldn't know anything about the tunnel.

> > You had stated in your earlier email that "Other UDP packets are
> > fine". Perhaps we need to do some further isolation to identify why
> > the ESP over UDP packets are not being hashed on while other UDP
> > packets are.
>
> Well they are IP packets encapsulated in UDP, while other UDP packets
> are not IP packets encapsulated in UDP, and there is special handling
> for some IP types inside UDP on this card, which is an unusual feature.

It really isn't that unusual of a feature. Many NICs have this
functionality now. In order to support it we usually have to populate
the port values for the device so the internal parser knows to expect
them.

> For the supported IP in UDP types, it actually is supposed to use the IP
> packet inside the UDP packet to generate the RSS value, so it pretends it
> wasn't even encapsulated. But it does not handle ESP in UDP specifically,
> and hence I suspect that is the problem. I think it tries to handle the
> IP in UDP and since it doesn't support ESP in UDP it fails to fall back
> to using the original UDP packet for the RSS value. That would at least
> explain why regular UDP packets that don't contain an IP packet inside
> are fine, but this particular type of packet is being handled wrong.

That is one of the reasons I suggested testing with netperf as I did
below. Basically if we construct all the outer headers the same as
your packet we can see if some specific combination is causing a
parsing issue. I tested the netperf approach on an XL710 and didn't
see any issues, but perhaps the XL722 is doing something differently.

> > Would it be possible to provide a couple of raw Ethernet frames
> > instead of IP packets for us to examine? I noticed the two packets you
> > sent earlier didn't start until the IP header. One possibility would
> > be that if we had any extra outer headers or trailers added to the
> > packet that could possibly cause issues since that might either make
> > the packet not parsable or possibly flag it as some sort of length
> > error when the size of the packet doesn't match what is reported in
> > the headers.
>
> Oh did I forget the option for that? I can try and capture some today
> with the full headers.

Thanks. If nothing else it should make it possible to just use
tcpreplay if needed to reproduce the issue.

> > One other thing we may want to look at doing is trying to identify the
> > particular part of the packets that might be causing the hash to not
> > be generated. One way to do that would be to use something like
> > netperf to generate packets and send them toward your test system.
> > Something like the command line below could be used to send packets
> > that should be similar to the ones you provided earlier:
> > netperf -H <target IP> -t UDP_STREAM -N -- -P 4500,4500 -m 132
> >
> > If the packets generated by netperf were not hashed that would tell us
> > then it may be some sort of issue with how UDP packets are being
> > parsed, and from there we could narrow things down by modifying port
> > numbers and changing packet sizes. If that does get hashed then we
> > need to start looking outside of the IP/UDP header parsing for
> > possible issues since there is likely something else causing the
> > issue.
>
> I will see what I can do with that.
>
> --
> Len Sorensen

2019-05-02 17:57:53

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Thu, May 02, 2019 at 10:28:22AM -0700, Alexander Duyck wrote:
> The thing is the firmware has to have some idea what it is dealing
> with. As far as I know I don't believe port number 4500 is being
> auto-flagged as any special type. In the case of the other tunnel
> types such as VXLAN, NVGRE, and GENEVE the driver has to set a port
> value indicating that the port will receive special handling. If it
> isn't added via i40e_udp_tunnel_add then the firmware/hardware
> shouldn't know anything about the tunnel.

Well that makes some sense. I was wondering why there didn't seem to
be an on/off switch for that feature.

> It really isn't that unusual of a feature. Many NICs have this
> functionality now. In order to support it we usually have to populate
> the port values for the device so the internal parser knows to expect
> them.
>
> That is one of the reasons I suggested testing with netperf as I did
> below. Basically if we construct all the outer headers the same as
> your packet we can see if some specific combination is causing a
> parsing issue. I tested the netperf approach on an XL710 and didn't
> see any issues, but perhaps the XL722 is doing something differently.
>
> Thanks. If nothing else it should make it possible to just use
> tcpreplay if needed to reproduce the issue.

Here is the same packets as before with the link level header included
(I forgot to use -XX rather than -X):

13:43:49.081567 54:ee:75:30:f1:e1 > a4:bf:01:4e:0c:87, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 64, id 21783, offset 0, flags [DF], proto UDP (17), length 160)
1.99.99.2.4500 > 1.99.99.1.4500: [no cksum] UDP-encap: ESP(spi=0x8de82290,seq=0x6a56), length 132
0x0000: a4bf 014e 0c87 54ee 7530 f1e1 0800 4500 ...N..T.u0....E.
0x0010: 00a0 5517 4000 4011 1c6d 0163 6302 0163 ..U.@[email protected]
0x0020: 6301 1194 1194 008c 0000 8de8 2290 0000 c..........."...
0x0030: 6a56 72da 0734 52f6 406e 9346 f946 c698 [email protected]..
0x0040: a38c 280c 94da 53e1 91e0 35bf 812a 4500 ..(...S...5..*E.
0x0050: 6003 ca7d 6872 a50b d41a 5c4d 7c22 3fb8 `..}hr....\M|"?.
0x0060: 56d8 2a0f bc3f d3a6 5853 682c 914c c1b1 V.*..?..XSh,.L..
0x0070: c5c3 94e8 4789 d8b4 4ab4 e5f9 d20a e5ef ....G...J.......
0x0080: de1d 05dd e98a 996b 5c11 6657 b667 6af1 .......k\.fW.gj.
0x0090: 2a97 694b 16de 74e2 f8fe 13a3 d45e e3e9 *.iK..t......^..
0x00a0: f0b1 b83b 99e3 55cb b40b 5ba8 9c23 ...;..U...[..#
13:43:49.081658 a4:bf:01:4e:0c:87 > 54:ee:75:30:f1:e1, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 64, id 44552, offset 0, flags [none], proto UDP (17), length 160)
1.99.99.1.4500 > 1.99.99.2.4500: [no cksum] UDP-encap: ESP(spi=0x1d4ecfdf,seq=0x6a56), length 132
0x0000: 54ee 7530 f1e1 a4bf 014e 0c87 0800 4500 T.u0.....N....E.
0x0010: 00a0 ae08 0000 4011 037c 0163 6301 0163 ......@..|.cc..c
0x0020: 6302 1194 1194 008c 0000 1d4e cfdf 0000 c..........N....
0x0030: 6a56 28ca 4809 8933 911d f2be 4510 e757 jV(.H..3....E..W
0x0040: 3885 7d26 5238 8c58 38e3 6c07 2f8e 335a 8.}&R8.X8.l./.3Z
0x0050: 6d48 2a72 4619 e8a3 c421 bc54 48b2 6239 mH*rF....!.TH.b9
0x0060: 5e07 7e89 a68e 0161 4e6a 5b6f 8b89 9f53 ^.~....aNj[o...S
0x0070: 4c40 1c6c d159 60f8 68e7 24db 8b21 2ec2 [email protected]`.h.$..!..
0x0080: 4b67 9b83 643b b0ac 6e2d bf4f 1ee1 9508 Kg..d;..n-.O....
0x0090: d1bd dcd4 74ee e4dc 78d0 578a 5905 1f4d ....t...x.W.Y..M
0x00a0: 74be e643 910b b4d3 f428 8822 e22b t..C.....(.".+

I will try to see what I can do with netperf.

--
Len Sorensen

2019-05-02 18:55:18

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Thu, May 02, 2019 at 01:55:13PM -0400, Lennart Sorensen wrote:
> Here is the same packets as before with the link level header included
> (I forgot to use -XX rather than -X):
>
> 13:43:49.081567 54:ee:75:30:f1:e1 > a4:bf:01:4e:0c:87, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 64, id 21783, offset 0, flags [DF], proto UDP (17), length 160)
> 1.99.99.2.4500 > 1.99.99.1.4500: [no cksum] UDP-encap: ESP(spi=0x8de82290,seq=0x6a56), length 132
> 0x0000: a4bf 014e 0c87 54ee 7530 f1e1 0800 4500 ...N..T.u0....E.
> 0x0010: 00a0 5517 4000 4011 1c6d 0163 6302 0163 ..U.@[email protected]
> 0x0020: 6301 1194 1194 008c 0000 8de8 2290 0000 c..........."...
> 0x0030: 6a56 72da 0734 52f6 406e 9346 f946 c698 [email protected]..
> 0x0040: a38c 280c 94da 53e1 91e0 35bf 812a 4500 ..(...S...5..*E.
> 0x0050: 6003 ca7d 6872 a50b d41a 5c4d 7c22 3fb8 `..}hr....\M|"?.
> 0x0060: 56d8 2a0f bc3f d3a6 5853 682c 914c c1b1 V.*..?..XSh,.L..
> 0x0070: c5c3 94e8 4789 d8b4 4ab4 e5f9 d20a e5ef ....G...J.......
> 0x0080: de1d 05dd e98a 996b 5c11 6657 b667 6af1 .......k\.fW.gj.
> 0x0090: 2a97 694b 16de 74e2 f8fe 13a3 d45e e3e9 *.iK..t......^..
> 0x00a0: f0b1 b83b 99e3 55cb b40b 5ba8 9c23 ...;..U...[..#
> 13:43:49.081658 a4:bf:01:4e:0c:87 > 54:ee:75:30:f1:e1, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 64, id 44552, offset 0, flags [none], proto UDP (17), length 160)
> 1.99.99.1.4500 > 1.99.99.2.4500: [no cksum] UDP-encap: ESP(spi=0x1d4ecfdf,seq=0x6a56), length 132
> 0x0000: 54ee 7530 f1e1 a4bf 014e 0c87 0800 4500 T.u0.....N....E.
> 0x0010: 00a0 ae08 0000 4011 037c 0163 6301 0163 ......@..|.cc..c
> 0x0020: 6302 1194 1194 008c 0000 1d4e cfdf 0000 c..........N....
> 0x0030: 6a56 28ca 4809 8933 911d f2be 4510 e757 jV(.H..3....E..W
> 0x0040: 3885 7d26 5238 8c58 38e3 6c07 2f8e 335a 8.}&R8.X8.l./.3Z
> 0x0050: 6d48 2a72 4619 e8a3 c421 bc54 48b2 6239 mH*rF....!.TH.b9
> 0x0060: 5e07 7e89 a68e 0161 4e6a 5b6f 8b89 9f53 ^.~....aNj[o...S
> 0x0070: 4c40 1c6c d159 60f8 68e7 24db 8b21 2ec2 [email protected]`.h.$..!..
> 0x0080: 4b67 9b83 643b b0ac 6e2d bf4f 1ee1 9508 Kg..d;..n-.O....
> 0x0090: d1bd dcd4 74ee e4dc 78d0 578a 5905 1f4d ....t...x.W.Y..M
> 0x00a0: 74be e643 910b b4d3 f428 8822 e22b t..C.....(.".+
>
> I will try to see what I can do with netperf.

Hmm, maybe UDP isn't doing as well as I thought.

Playing with packit doing this:

packit -t UDP -d 1.99.99.1 -D 32432 -S 4500 -i enp0s25 -h -p "0x 00 11 22 33 44 55 66 77 88 99 00 11 22 33 44 55 66 77 88 99 00 11 22 33 44 55 66 77 88 99" -c 5

I have played with the source and destination port numbers, and so far
I have only managed to hit queues 0, 1 and 2 (mostly 0 and 2). No port
number I have tried has made it hit any other queue. That is weird.
Making random changes ought to distribute more than that. And changing
the hkey certainly ought to make a difference, and so far it doesn't
seem to for these packets (I know I saw icmp move around just fine before
when changing the hkey).

--
Len Sorensen

2019-05-02 21:02:20

by Alexander Duyck

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Thu, May 2, 2019 at 11:52 AM Lennart Sorensen
<[email protected]> wrote:
>
> On Thu, May 02, 2019 at 01:55:13PM -0400, Lennart Sorensen wrote:
> > Here is the same packets as before with the link level header included
> > (I forgot to use -XX rather than -X):
> >
> > 13:43:49.081567 54:ee:75:30:f1:e1 > a4:bf:01:4e:0c:87, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 64, id 21783, offset 0, flags [DF], proto UDP (17), length 160)
> > 1.99.99.2.4500 > 1.99.99.1.4500: [no cksum] UDP-encap: ESP(spi=0x8de82290,seq=0x6a56), length 132
> > 0x0000: a4bf 014e 0c87 54ee 7530 f1e1 0800 4500 ...N..T.u0....E.
> > 0x0010: 00a0 5517 4000 4011 1c6d 0163 6302 0163 ..U.@[email protected]
> > 0x0020: 6301 1194 1194 008c 0000 8de8 2290 0000 c..........."...
> > 0x0030: 6a56 72da 0734 52f6 406e 9346 f946 c698 [email protected]..
> > 0x0040: a38c 280c 94da 53e1 91e0 35bf 812a 4500 ..(...S...5..*E.
> > 0x0050: 6003 ca7d 6872 a50b d41a 5c4d 7c22 3fb8 `..}hr....\M|"?.
> > 0x0060: 56d8 2a0f bc3f d3a6 5853 682c 914c c1b1 V.*..?..XSh,.L..
> > 0x0070: c5c3 94e8 4789 d8b4 4ab4 e5f9 d20a e5ef ....G...J.......
> > 0x0080: de1d 05dd e98a 996b 5c11 6657 b667 6af1 .......k\.fW.gj.
> > 0x0090: 2a97 694b 16de 74e2 f8fe 13a3 d45e e3e9 *.iK..t......^..
> > 0x00a0: f0b1 b83b 99e3 55cb b40b 5ba8 9c23 ...;..U...[..#
> > 13:43:49.081658 a4:bf:01:4e:0c:87 > 54:ee:75:30:f1:e1, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 64, id 44552, offset 0, flags [none], proto UDP (17), length 160)
> > 1.99.99.1.4500 > 1.99.99.2.4500: [no cksum] UDP-encap: ESP(spi=0x1d4ecfdf,seq=0x6a56), length 132
> > 0x0000: 54ee 7530 f1e1 a4bf 014e 0c87 0800 4500 T.u0.....N....E.
> > 0x0010: 00a0 ae08 0000 4011 037c 0163 6301 0163 ......@..|.cc..c
> > 0x0020: 6302 1194 1194 008c 0000 1d4e cfdf 0000 c..........N....
> > 0x0030: 6a56 28ca 4809 8933 911d f2be 4510 e757 jV(.H..3....E..W
> > 0x0040: 3885 7d26 5238 8c58 38e3 6c07 2f8e 335a 8.}&R8.X8.l./.3Z
> > 0x0050: 6d48 2a72 4619 e8a3 c421 bc54 48b2 6239 mH*rF....!.TH.b9
> > 0x0060: 5e07 7e89 a68e 0161 4e6a 5b6f 8b89 9f53 ^.~....aNj[o...S
> > 0x0070: 4c40 1c6c d159 60f8 68e7 24db 8b21 2ec2 [email protected]`.h.$..!..
> > 0x0080: 4b67 9b83 643b b0ac 6e2d bf4f 1ee1 9508 Kg..d;..n-.O....
> > 0x0090: d1bd dcd4 74ee e4dc 78d0 578a 5905 1f4d ....t...x.W.Y..M
> > 0x00a0: 74be e643 910b b4d3 f428 8822 e22b t..C.....(.".+
> >
> > I will try to see what I can do with netperf.
>
> Hmm, maybe UDP isn't doing as well as I thought.
>
> Playing with packit doing this:
>
> packit -t UDP -d 1.99.99.1 -D 32432 -S 4500 -i enp0s25 -h -p "0x 00 11 22 33 44 55 66 77 88 99 00 11 22 33 44 55 66 77 88 99 00 11 22 33 44 55 66 77 88 99" -c 5
>
> I have played with the source and destination port numbers, and so far
> I have only managed to hit queues 0, 1 and 2 (mostly 0 and 2). No port
> number I have tried has made it hit any other queue. That is weird.
> Making random changes ought to distribute more than that. And changing
> the hkey certainly ought to make a difference, and so far it doesn't
> seem to for these packets (I know I saw icmp move around just fine before
> when changing the hkey).
>
> --
> Len Sorensen

If I recall correctly RSS is only using something like the lower 9
bits (indirection table size of 512) of the resultant hash on the
X722, even fewer if you have fewer queues that are a power of 2 and
happen to program the indirection table in a round robin fashion. So
for example on my system setup with 32 queues it is technically only
using the lower 5 bits of the hash.

One issue as a result of that is that you can end up with swaths of
bits that don't really seem to impact the hash all that much since it
will never actually change those bits of the resultant hash. In order
to guarantee that every bit in the input impacts the hash you have to
make certain you have to gaps in the key wider than the bits you
examine in the final hash.

A quick and dirty way to verify that the hash key is part of the issue
would be to use something like a simple repeating value such as AA:55
as your hash key. With something like that every bit you change in the
UDP port number should result in a change in the final RSS hash for
queue counts of 3 or greater. The downside is the upper 16 bits of the
hash are identical to the lower 16 so the actual hash value itself
isn't as useful.

2019-05-03 15:16:24

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Thu, May 02, 2019 at 01:59:46PM -0700, Alexander Duyck wrote:
> If I recall correctly RSS is only using something like the lower 9
> bits (indirection table size of 512) of the resultant hash on the
> X722, even fewer if you have fewer queues that are a power of 2 and
> happen to program the indirection table in a round robin fashion. So
> for example on my system setup with 32 queues it is technically only
> using the lower 5 bits of the hash.
>
> One issue as a result of that is that you can end up with swaths of
> bits that don't really seem to impact the hash all that much since it
> will never actually change those bits of the resultant hash. In order
> to guarantee that every bit in the input impacts the hash you have to
> make certain you have to gaps in the key wider than the bits you
> examine in the final hash.
>
> A quick and dirty way to verify that the hash key is part of the issue
> would be to use something like a simple repeating value such as AA:55
> as your hash key. With something like that every bit you change in the
> UDP port number should result in a change in the final RSS hash for
> queue counts of 3 or greater. The downside is the upper 16 bits of the
> hash are identical to the lower 16 so the actual hash value itself
> isn't as useful.

OK I set the hkey to
aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55
and still only see queue 0 and 2 getting hit with a couple of dozen
different UDP port numbers I picked. Changing the hash with ethtool to
that didn't even move where the tcp packets for my ssh connection are
going (they are always on queue 2 it seems).

Does it just not hash UDP packets correctly? Is it even doing RSS?
(the register I checked claimed it is).

This system has 40 queues assigned by default since that is how many
CPUs there are. Changing it to a lower number didn't make a difference
(I tried 32 and 8).

--
Len Sorensen

2019-05-03 17:27:54

by Alexander Duyck

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Fri, May 3, 2019 at 8:14 AM Lennart Sorensen
<[email protected]> wrote:
>
> On Thu, May 02, 2019 at 01:59:46PM -0700, Alexander Duyck wrote:
> > If I recall correctly RSS is only using something like the lower 9
> > bits (indirection table size of 512) of the resultant hash on the
> > X722, even fewer if you have fewer queues that are a power of 2 and
> > happen to program the indirection table in a round robin fashion. So
> > for example on my system setup with 32 queues it is technically only
> > using the lower 5 bits of the hash.
> >
> > One issue as a result of that is that you can end up with swaths of
> > bits that don't really seem to impact the hash all that much since it
> > will never actually change those bits of the resultant hash. In order
> > to guarantee that every bit in the input impacts the hash you have to
> > make certain you have to gaps in the key wider than the bits you
> > examine in the final hash.
> >
> > A quick and dirty way to verify that the hash key is part of the issue
> > would be to use something like a simple repeating value such as AA:55
> > as your hash key. With something like that every bit you change in the
> > UDP port number should result in a change in the final RSS hash for
> > queue counts of 3 or greater. The downside is the upper 16 bits of the
> > hash are identical to the lower 16 so the actual hash value itself
> > isn't as useful.
>
> OK I set the hkey to
> aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55
> and still only see queue 0 and 2 getting hit with a couple of dozen
> different UDP port numbers I picked. Changing the hash with ethtool to
> that didn't even move where the tcp packets for my ssh connection are
> going (they are always on queue 2 it seems).

The TCP flow could be bypassing RSS and may be using ATR to decide
where the Rx packets are processed. Now that I think about it there is
a possibility that ATR could be interfering with the queue selection.
You might try disabling it by running:
ethtool --set-priv-flags <iface> flow-director-atr off

> Does it just not hash UDP packets correctly? Is it even doing RSS?
> (the register I checked claimed it is).

The problem is RSS can be bypassed for queue selection by things like
ATR which I called out above. One possibility is that if the
encryption you were using was leaving the skb->encapsulation flag set,
and the NIC might have misidentified the packets as something it could
parse and set up a bunch of rules that were rerouting incoming traffic
based on outgoing traffic. Disabling the feature should switch off
that behavior if that is in fact the case.

> This system has 40 queues assigned by default since that is how many
> CPUs there are. Changing it to a lower number didn't make a difference
> (I tried 32 and 8).

You are probably fine using 40 queues. That isn't an even power of two
so it would actually improve the entropy a bit since the lower bits
don't have a many:1 mapping to queues.

2019-05-03 21:19:22

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Fri, May 03, 2019 at 10:19:47AM -0700, Alexander Duyck wrote:
> The TCP flow could be bypassing RSS and may be using ATR to decide
> where the Rx packets are processed. Now that I think about it there is
> a possibility that ATR could be interfering with the queue selection.
> You might try disabling it by running:
> ethtool --set-priv-flags <iface> flow-director-atr off

Hmm, I thought I had killed ATR (I certainly meant to), but it appears
I had not. I will experiment to see if that makes a difference.

> The problem is RSS can be bypassed for queue selection by things like
> ATR which I called out above. One possibility is that if the
> encryption you were using was leaving the skb->encapsulation flag set,
> and the NIC might have misidentified the packets as something it could
> parse and set up a bunch of rules that were rerouting incoming traffic
> based on outgoing traffic. Disabling the feature should switch off
> that behavior if that is in fact the case.
>
> You are probably fine using 40 queues. That isn't an even power of two
> so it would actually improve the entropy a bit since the lower bits
> don't have a many:1 mapping to queues.

I will let you know Monday how my tests go with atr off. I really
thought that was off already since it was supposed to be. We always
try to turn that off because it does not work well.

--
Len Sorensen

2019-05-13 19:30:48

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Fri, May 03, 2019 at 04:59:35PM -0400, Lennart Sorensen wrote:
> On Fri, May 03, 2019 at 10:19:47AM -0700, Alexander Duyck wrote:
> > The TCP flow could be bypassing RSS and may be using ATR to decide
> > where the Rx packets are processed. Now that I think about it there is
> > a possibility that ATR could be interfering with the queue selection.
> > You might try disabling it by running:
> > ethtool --set-priv-flags <iface> flow-director-atr off
>
> Hmm, I thought I had killed ATR (I certainly meant to), but it appears
> I had not. I will experiment to see if that makes a difference.
>
> > The problem is RSS can be bypassed for queue selection by things like
> > ATR which I called out above. One possibility is that if the
> > encryption you were using was leaving the skb->encapsulation flag set,
> > and the NIC might have misidentified the packets as something it could
> > parse and set up a bunch of rules that were rerouting incoming traffic
> > based on outgoing traffic. Disabling the feature should switch off
> > that behavior if that is in fact the case.
> >
> > You are probably fine using 40 queues. That isn't an even power of two
> > so it would actually improve the entropy a bit since the lower bits
> > don't have a many:1 mapping to queues.
>
> I will let you know Monday how my tests go with atr off. I really
> thought that was off already since it was supposed to be. We always
> try to turn that off because it does not work well.

OK it took a while to try a bunch of stuff to make sure ATR really really
was off.

I still see the problem it seems.

# ethtool --show-priv-flags eth2
Private flags for eth2:
MFP : off
LinkPolling : off
flow-director-atr: off
veb-stats : off
hw-atr-eviction : on
legacy-rx : off

# ethtool -i eth2
driver: i40e
version: 2.1.7-k
firmware-version: 4.00 0x80001577 1.1767.0
expansion-rom-version:
bus-info: 0000:3d:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes


Here are two packets that for some reason both go to queue 0 which
seems odd. As far as I can tell all of the packets for UDP port 4500
traffic from any IP is going to queue 0.

UDP from 10.49.1.50:4500 to 10.49.1.1:4500 encapsulating ESP:

a4bf 014e 0c88 001f 45ff f410 0800 45e0
0060 166e 4000 4011 0b1b 0af9 0132 0af9
0101 1194 1194 004c 0000 0000 0201 0000
0000 4eaf 2f76 58cd aae0 4d92 8cb7 0835
1141 7a23 9f06 f323 b816 1a2b c88d 322c
5f16 d4a6 ba72 7c89 2258 9d20 085e d6ed
c7a4 5cc1 3ef2 0753 783d b691 e9d6

UDP from 10.49.1.51:4500 to 10.49.1.1:4500 encapsulating ESP:

a4bf 014e 0c88 20f3 99ae c688 0800 45e0
0060 1671 4000 4011 0b17 0af9 0133 0af9
0101 1194 1194 004c 0000 0000 0200 0000
0000 4ec5 253f 27f1 7fdd 4d82 0697 bef2
45bd 281f 8ecf ac4f 06ed 79ba 3cbb 5eaf
494b 146e a013 8b93 1c38 8aef da3f a73d
6f13 5f80 e946 82e2 7da7 21e8 9d03


# ethtool -x eth2
RX flow hash indirection table for eth2 with 12 RX ring(s):
0: 0 1 2 3 4 5 6 7
8: 8 9 10 11 0 1 2 3
16: 4 5 6 7 8 9 10 11
...
488: 8 9 10 11 0 1 2 3
496: 4 5 6 7 8 9 10 11
504: 0 1 2 3 4 5 6 7
RSS hash key:
60:56:66:39:8e:70:46:02:5d:33:5e:9c:5f:f6:fa:9d:ac:50:63:7c:ca:01:23:22:07:a3:8a:23:98:fd:38:5b:74:96:7e:72:0c:aa:83:fc:10:aa:6d:35:bb:8c:4e:eb:46:03:07:6a

Changing the key to:

aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55

makes no change in the queue the packets are going to.

--
Len Sorensen

2019-05-13 19:53:31

by Alexander Duyck

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Mon, May 13, 2019 at 9:55 AM Lennart Sorensen
<[email protected]> wrote:
>
> On Fri, May 03, 2019 at 04:59:35PM -0400, Lennart Sorensen wrote:
> > On Fri, May 03, 2019 at 10:19:47AM -0700, Alexander Duyck wrote:
> > > The TCP flow could be bypassing RSS and may be using ATR to decide
> > > where the Rx packets are processed. Now that I think about it there is
> > > a possibility that ATR could be interfering with the queue selection.
> > > You might try disabling it by running:
> > > ethtool --set-priv-flags <iface> flow-director-atr off
> >
> > Hmm, I thought I had killed ATR (I certainly meant to), but it appears
> > I had not. I will experiment to see if that makes a difference.
> >
> > > The problem is RSS can be bypassed for queue selection by things like
> > > ATR which I called out above. One possibility is that if the
> > > encryption you were using was leaving the skb->encapsulation flag set,
> > > and the NIC might have misidentified the packets as something it could
> > > parse and set up a bunch of rules that were rerouting incoming traffic
> > > based on outgoing traffic. Disabling the feature should switch off
> > > that behavior if that is in fact the case.
> > >
> > > You are probably fine using 40 queues. That isn't an even power of two
> > > so it would actually improve the entropy a bit since the lower bits
> > > don't have a many:1 mapping to queues.
> >
> > I will let you know Monday how my tests go with atr off. I really
> > thought that was off already since it was supposed to be. We always
> > try to turn that off because it does not work well.
>
> OK it took a while to try a bunch of stuff to make sure ATR really really
> was off.
>
> I still see the problem it seems.
>
> # ethtool --show-priv-flags eth2
> Private flags for eth2:
> MFP : off
> LinkPolling : off
> flow-director-atr: off
> veb-stats : off
> hw-atr-eviction : on
> legacy-rx : off
>
> # ethtool -i eth2
> driver: i40e
> version: 2.1.7-k
> firmware-version: 4.00 0x80001577 1.1767.0
> expansion-rom-version:
> bus-info: 0000:3d:00.1
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
>
> Here are two packets that for some reason both go to queue 0 which
> seems odd. As far as I can tell all of the packets for UDP port 4500
> traffic from any IP is going to queue 0.
>
> UDP from 10.49.1.50:4500 to 10.49.1.1:4500 encapsulating ESP:
>
> a4bf 014e 0c88 001f 45ff f410 0800 45e0
> 0060 166e 4000 4011 0b1b 0af9 0132 0af9
> 0101 1194 1194 004c 0000 0000 0201 0000
> 0000 4eaf 2f76 58cd aae0 4d92 8cb7 0835
> 1141 7a23 9f06 f323 b816 1a2b c88d 322c
> 5f16 d4a6 ba72 7c89 2258 9d20 085e d6ed
> c7a4 5cc1 3ef2 0753 783d b691 e9d6
>
> UDP from 10.49.1.51:4500 to 10.49.1.1:4500 encapsulating ESP:
>
> a4bf 014e 0c88 20f3 99ae c688 0800 45e0
> 0060 1671 4000 4011 0b17 0af9 0133 0af9
> 0101 1194 1194 004c 0000 0000 0200 0000
> 0000 4ec5 253f 27f1 7fdd 4d82 0697 bef2
> 45bd 281f 8ecf ac4f 06ed 79ba 3cbb 5eaf
> 494b 146e a013 8b93 1c38 8aef da3f a73d
> 6f13 5f80 e946 82e2 7da7 21e8 9d03
>
>
> # ethtool -x eth2
> RX flow hash indirection table for eth2 with 12 RX ring(s):
> 0: 0 1 2 3 4 5 6 7
> 8: 8 9 10 11 0 1 2 3
> 16: 4 5 6 7 8 9 10 11
> ...
> 488: 8 9 10 11 0 1 2 3
> 496: 4 5 6 7 8 9 10 11
> 504: 0 1 2 3 4 5 6 7
> RSS hash key:
> 60:56:66:39:8e:70:46:02:5d:33:5e:9c:5f:f6:fa:9d:ac:50:63:7c:ca:01:23:22:07:a3:8a:23:98:fd:38:5b:74:96:7e:72:0c:aa:83:fc:10:aa:6d:35:bb:8c:4e:eb:46:03:07:6a
>
> Changing the key to:
>
> aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55:aa:55
>
> makes no change in the queue the packets are going to.
>
> --
> Len Sorensen

So I recreated the first packet you listed via text2pcap, replayed it
on my test system via tcpreplay, updated my configuration to 12
queues, and used the 2 hash keys you listed. I ended up seeing the
traffic bounce between queues 4 and 8 with an X710 I had to test with
when I was changing the key value.

Unfortunately I don't have an X722 to test with. I'm suspecting that
there may be some difference in the RSS setup, specifically it seems
like values in the PFQF_HENA register were changed for the X722 part
that may be causing the issues we are seeing.

I will see if I can get someone from the networking division to take a
look at this since I don't have access to the part in question nor a
datasheet for it so I am not sure if I can help much more.

Thanks.

- Alex

2019-05-14 16:36:05

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Mon, May 13, 2019 at 12:04:00PM -0700, Alexander Duyck wrote:
> So I recreated the first packet you listed via text2pcap, replayed it
> on my test system via tcpreplay, updated my configuration to 12
> queues, and used the 2 hash keys you listed. I ended up seeing the
> traffic bounce between queues 4 and 8 with an X710 I had to test with
> when I was changing the key value.
>
> Unfortunately I don't have an X722 to test with. I'm suspecting that
> there may be some difference in the RSS setup, specifically it seems
> like values in the PFQF_HENA register were changed for the X722 part
> that may be causing the issues we are seeing.
>
> I will see if I can get someone from the networking division to take a
> look at this since I don't have access to the part in question nor a
> datasheet for it so I am not sure if I can help much more.

Great. I hope someone can figure this out because it is working very
badly so far.

--
Len Sorensen

2019-05-16 19:44:53

by Alexander Duyck

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Tue, May 14, 2019 at 9:34 AM Lennart Sorensen
<[email protected]> wrote:
>
> On Mon, May 13, 2019 at 12:04:00PM -0700, Alexander Duyck wrote:
> > So I recreated the first packet you listed via text2pcap, replayed it
> > on my test system via tcpreplay, updated my configuration to 12
> > queues, and used the 2 hash keys you listed. I ended up seeing the
> > traffic bounce between queues 4 and 8 with an X710 I had to test with
> > when I was changing the key value.
> >
> > Unfortunately I don't have an X722 to test with. I'm suspecting that
> > there may be some difference in the RSS setup, specifically it seems
> > like values in the PFQF_HENA register were changed for the X722 part
> > that may be causing the issues we are seeing.
> >
> > I will see if I can get someone from the networking division to take a
> > look at this since I don't have access to the part in question nor a
> > datasheet for it so I am not sure if I can help much more.
>
> Great. I hope someone can figure this out because it is working very
> badly so far.
>
> --
> Len Sorensen

So I was sent a link to the datasheet for the part and I have a
working theory that what we may be seeing is a problem in the firmware
for the part.

Can you try applying the attached patch and send the output from the
dmesg? Specifically I would want anything with the name "i40e" in it.
What I am looking for is something like the following:
[ 294.383416] i40e 0000:81:00.1: fw 5.0.40043 api 1.5 nvm 5.04 0x800024cd 0.0.0
[ 294.675039] i40e 0000:81:00.1: MAC address: 68:05:ca:37:c7:99
[ 294.685941] i40e 0000:81:00.1: flow_type: 63 input_mask:0x0000000000004000
[ 294.686056] i40e 0000:81:00.1: flow_type: 46 input_mask:0x0007fff800000000
[ 294.686170] i40e 0000:81:00.1: flow_type: 45 input_mask:0x0007fff800000000
[ 294.686284] i40e 0000:81:00.1: flow_type: 44 input_mask:0x0007ffff80000000
[ 294.686399] i40e 0000:81:00.1: flow_type: 43 input_mask:0x0007fffe00000000
[ 294.686513] i40e 0000:81:00.1: flow_type: 41 input_mask:0x0007fffe00000000
[ 294.686628] i40e 0000:81:00.1: flow_type: 36 input_mask:0x0001801800000000
[ 294.686743] i40e 0000:81:00.1: flow_type: 35 input_mask:0x0001801800000000
[ 294.686858] i40e 0000:81:00.1: flow_type: 34 input_mask:0x0001801f80000000
[ 294.686973] i40e 0000:81:00.1: flow_type: 33 input_mask:0x0001801e00000000
[ 294.687087] i40e 0000:81:00.1: flow_type: 31 input_mask:0x0001801e00000000
[ 294.691906] i40e 0000:81:00.1 ens5f1: renamed from eth0
[ 294.711173] i40e 0000:81:00.1 ens5f1: NIC Link is Up, 10 Gbps Full
Duplex, Flow Control: None
[ 294.759061] i40e 0000:81:00.1: PCI-Express: Speed 8.0GT/s Width x8
[ 294.863363] i40e 0000:81:00.1: Features: PF-id[1] VFs: 32 VSIs: 34
QP: 32 RSS FD_ATR FD_SB NTUPLE VxLAN Geneve PTP VEPA

With that we can tell what flow types are enabled, and what input
fields are enabled for each flow type. My suspicion is that we may see
the two new types added to X722 for UDP, 29 and 30, may not match type
31 which is the current flow type supported on the X710.

I have included a copy inline below in case the patch is stripped,
however I suspect it will not apply cleanly as the mail client I am
using usually ends up causing white space mangling by replacing tabs
with spaces.

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 65c2b9d2652b..0c93859f8184 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -10998,6 +10998,15 @@ static int i40e_pf_config_rss(struct i40e_pf *pf)
((u64)i40e_read_rx_ctl(hw, I40E_PFQF_HENA(1)) << 32);
hena |= i40e_pf_get_default_rss_hena(pf);

+ for (ret = 64; ret--;) {
+ if (!(hena & (1ull << ret)))
+ continue;
+ dev_info(&pf->pdev->dev, "flow_type: %d
input_mask:0x%08x%08x\n",
+ ret,
+ i40e_read_rx_ctl(hw, I40E_GLQF_HASH_INSET(1, ret)),
+ i40e_read_rx_ctl(hw, I40E_GLQF_HASH_INSET(0, ret)));
+ }
+
i40e_write_rx_ctl(hw, I40E_PFQF_HENA(0), (u32)hena);
i40e_write_rx_ctl(hw, I40E_PFQF_HENA(1), (u32)(hena >> 32));


Attachments:
i40e-debug-hash-inputs.patch (999.00 B)

2019-05-16 23:10:40

by Lennart Sorensen

[permalink] [raw]
Subject: Re: [Intel-wired-lan] i40e X722 RSS problem with NAT-Traversal IPsec packets

On Thu, May 16, 2019 at 10:10:55AM -0700, Alexander Duyck wrote:
> So I was sent a link to the datasheet for the part and I have a
> working theory that what we may be seeing is a problem in the firmware
> for the part.
>
> Can you try applying the attached patch and send the output from the
> dmesg? Specifically I would want anything with the name "i40e" in it.
> What I am looking for is something like the following:
> [ 294.383416] i40e 0000:81:00.1: fw 5.0.40043 api 1.5 nvm 5.04 0x800024cd 0.0.0
> [ 294.675039] i40e 0000:81:00.1: MAC address: 68:05:ca:37:c7:99
> [ 294.685941] i40e 0000:81:00.1: flow_type: 63 input_mask:0x0000000000004000
> [ 294.686056] i40e 0000:81:00.1: flow_type: 46 input_mask:0x0007fff800000000
> [ 294.686170] i40e 0000:81:00.1: flow_type: 45 input_mask:0x0007fff800000000
> [ 294.686284] i40e 0000:81:00.1: flow_type: 44 input_mask:0x0007ffff80000000
> [ 294.686399] i40e 0000:81:00.1: flow_type: 43 input_mask:0x0007fffe00000000
> [ 294.686513] i40e 0000:81:00.1: flow_type: 41 input_mask:0x0007fffe00000000
> [ 294.686628] i40e 0000:81:00.1: flow_type: 36 input_mask:0x0001801800000000
> [ 294.686743] i40e 0000:81:00.1: flow_type: 35 input_mask:0x0001801800000000
> [ 294.686858] i40e 0000:81:00.1: flow_type: 34 input_mask:0x0001801f80000000
> [ 294.686973] i40e 0000:81:00.1: flow_type: 33 input_mask:0x0001801e00000000
> [ 294.687087] i40e 0000:81:00.1: flow_type: 31 input_mask:0x0001801e00000000
> [ 294.691906] i40e 0000:81:00.1 ens5f1: renamed from eth0
> [ 294.711173] i40e 0000:81:00.1 ens5f1: NIC Link is Up, 10 Gbps Full
> Duplex, Flow Control: None
> [ 294.759061] i40e 0000:81:00.1: PCI-Express: Speed 8.0GT/s Width x8
> [ 294.863363] i40e 0000:81:00.1: Features: PF-id[1] VFs: 32 VSIs: 34
> QP: 32 RSS FD_ATR FD_SB NTUPLE VxLAN Geneve PTP VEPA
>
> With that we can tell what flow types are enabled, and what input
> fields are enabled for each flow type. My suspicion is that we may see
> the two new types added to X722 for UDP, 29 and 30, may not match type
> 31 which is the current flow type supported on the X710.
>
> I have included a copy inline below in case the patch is stripped,
> however I suspect it will not apply cleanly as the mail client I am
> using usually ends up causing white space mangling by replacing tabs
> with spaces.
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c
> b/drivers/net/ethernet/intel/i40e/i40e_main.c
> index 65c2b9d2652b..0c93859f8184 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
> @@ -10998,6 +10998,15 @@ static int i40e_pf_config_rss(struct i40e_pf *pf)
> ((u64)i40e_read_rx_ctl(hw, I40E_PFQF_HENA(1)) << 32);
> hena |= i40e_pf_get_default_rss_hena(pf);
>
> + for (ret = 64; ret--;) {
> + if (!(hena & (1ull << ret)))
> + continue;
> + dev_info(&pf->pdev->dev, "flow_type: %d
> input_mask:0x%08x%08x\n",
> + ret,
> + i40e_read_rx_ctl(hw, I40E_GLQF_HASH_INSET(1, ret)),
> + i40e_read_rx_ctl(hw, I40E_GLQF_HASH_INSET(0, ret)));
> + }
> +
> i40e_write_rx_ctl(hw, I40E_PFQF_HENA(0), (u32)hena);
> i40e_write_rx_ctl(hw, I40E_PFQF_HENA(1), (u32)(hena >> 32));

> i40e: Debug hash inputs

Here is what I see:

i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 2.1.7-k
i40e: Copyright (c) 2013 - 2014 Intel Corporation.
i40e 0000:3d:00.0: fw 3.10.52896 api 1.6 nvm 4.00 0x80001577 1.1767.0
i40e 0000:3d:00.0: The driver for the device detected a newer version of the NVM image than expected. Please install the most recent version of the network driver.
i40e 0000:3d:00.0: MAC address: a4:bf:01:4e:0c:87
i40e 0000:3d:00.0: flow_type: 63 input_mask:0x0000000000004000
i40e 0000:3d:00.0: flow_type: 46 input_mask:0x0007fff800000000
i40e 0000:3d:00.0: flow_type: 45 input_mask:0x0007fff800000000
i40e 0000:3d:00.0: flow_type: 44 input_mask:0x0007ffff80000000
i40e 0000:3d:00.0: flow_type: 43 input_mask:0x0007fffe00000000
i40e 0000:3d:00.0: flow_type: 42 input_mask:0x0007fffe00000000
i40e 0000:3d:00.0: flow_type: 41 input_mask:0x0007fffe00000000
i40e 0000:3d:00.0: flow_type: 40 input_mask:0x0007fffe00000000
i40e 0000:3d:00.0: flow_type: 39 input_mask:0x0007fffe00000000
i40e 0000:3d:00.0: flow_type: 36 input_mask:0x0006060000000000
i40e 0000:3d:00.0: flow_type: 35 input_mask:0x0006060000000000
i40e 0000:3d:00.0: flow_type: 34 input_mask:0x0006060780000000
i40e 0000:3d:00.0: flow_type: 33 input_mask:0x0006060600000000
i40e 0000:3d:00.0: flow_type: 32 input_mask:0x0006060600000000
i40e 0000:3d:00.0: flow_type: 31 input_mask:0x0006060600000000
i40e 0000:3d:00.0: flow_type: 30 input_mask:0x0006060600000000
i40e 0000:3d:00.0: flow_type: 29 input_mask:0x0006060600000000
i40e 0000:3d:00.0: Features: PF-id[0] VSIs: 34 QP: 12 TXQ: 13 RSS VxLAN Geneve VEPA
i40e 0000:3d:00.1: fw 3.10.52896 api 1.6 nvm 4.00 0x80001577 1.1767.0
i40e 0000:3d:00.1: The driver for the device detected a newer version of the NVM image than expected. Please install the most recent version of the network driver.
i40e 0000:3d:00.1: MAC address: a4:bf:01:4e:0c:88
i40e 0000:3d:00.1: flow_type: 63 input_mask:0x0000000000004000
i40e 0000:3d:00.1: flow_type: 46 input_mask:0x0007fff800000000
i40e 0000:3d:00.1: flow_type: 45 input_mask:0x0007fff800000000
i40e 0000:3d:00.1: flow_type: 44 input_mask:0x0007ffff80000000
i40e 0000:3d:00.1: flow_type: 43 input_mask:0x0007fffe00000000
i40e 0000:3d:00.1: flow_type: 42 input_mask:0x0007fffe00000000
i40e 0000:3d:00.1: flow_type: 41 input_mask:0x0007fffe00000000
i40e 0000:3d:00.1: flow_type: 40 input_mask:0x0007fffe00000000
i40e 0000:3d:00.1: flow_type: 39 input_mask:0x0007fffe00000000
i40e 0000:3d:00.1: flow_type: 36 input_mask:0x0006060000000000
i40e 0000:3d:00.1: flow_type: 35 input_mask:0x0006060000000000
i40e 0000:3d:00.1: flow_type: 34 input_mask:0x0006060780000000
i40e 0000:3d:00.1: flow_type: 33 input_mask:0x0006060600000000
i40e 0000:3d:00.1: flow_type: 32 input_mask:0x0006060600000000
i40e 0000:3d:00.1: flow_type: 31 input_mask:0x0006060600000000
i40e 0000:3d:00.1: flow_type: 30 input_mask:0x0006060600000000
i40e 0000:3d:00.1: flow_type: 29 input_mask:0x0006060600000000
i40e 0000:3d:00.1: Features: PF-id[1] VSIs: 34 QP: 12 TXQ: 13 RSS VxLAN Geneve VEPA
i40e 0000:3d:00.1 eth2: NIC Link is Up, 1000 Mbps Full Duplex, Flow Control: None
i40e_ioctl: power down: eth1
i40e_ioctl: power down: eth2

--
Len Sorensen