We recently encountered a bug where a few customers using ibmveth on the
same LPAR hit an issue where a TCP session hung when large receive was
enabled. Closer analysis revealed that the session was stuck because the
one side was advertising a zero window repeatedly.
We narrowed this down to the fact the ibmveth driver did not set gso_size
which is translated by TCP into the MSS later up the stack. The MSS is
used to calculate the TCP window size and as that was abnormally large,
it was calculating a zero window, even although the sockets receive buffer
was completely empty.
We were able to reproduce this and worked with IBM to fix this. Thanks Tom
and Marcelo for all your help and review on this.
The patch fixes both our internal reproduction tests and our customers tests.
Signed-off-by: Jon Maxwell <[email protected]>
---
drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index 29c05d0..c51717e 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
int frames_processed = 0;
unsigned long lpar_rc;
struct iphdr *iph;
+ bool large_packet = 0;
+ u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
restart_poll:
while (frames_processed < budget) {
@@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
iph->check = 0;
iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
adapter->rx_large_packets++;
+ large_packet = 1;
}
}
}
+ if (skb->len > netdev->mtu) {
+ iph = (struct iphdr *)skb->data;
+ if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
+ iph->protocol == IPPROTO_TCP) {
+ hdr_len += sizeof(struct iphdr);
+ skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
+ skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
+ } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
+ iph->protocol == IPPROTO_TCP) {
+ hdr_len += sizeof(struct ipv6hdr);
+ skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
+ skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
+ }
+ if (!large_packet)
+ adapter->rx_large_packets++;
+ }
+
napi_gro_receive(napi, skb); /* send it up */
netdev->stats.rx_packets++;
--
1.8.3.1
On 10/25/2016 07:09 PM, Jon Maxwell wrote:
> We recently encountered a bug where a few customers using ibmveth on the
> same LPAR hit an issue where a TCP session hung when large receive was
> enabled. Closer analysis revealed that the session was stuck because the
> one side was advertising a zero window repeatedly.
>
> We narrowed this down to the fact the ibmveth driver did not set gso_size
> which is translated by TCP into the MSS later up the stack. The MSS is
> used to calculate the TCP window size and as that was abnormally large,
> it was calculating a zero window, even although the sockets receive buffer
> was completely empty.
>
> We were able to reproduce this and worked with IBM to fix this. Thanks Tom
> and Marcelo for all your help and review on this.
>
> The patch fixes both our internal reproduction tests and our customers tests.
>
> Signed-off-by: Jon Maxwell <[email protected]>
Thanks, Jon.
Acked-by: Thomas Falcon <[email protected]>
> ---
> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
> index 29c05d0..c51717e 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
> int frames_processed = 0;
> unsigned long lpar_rc;
> struct iphdr *iph;
> + bool large_packet = 0;
> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
>
> restart_poll:
> while (frames_processed < budget) {
> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
> iph->check = 0;
> iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
> adapter->rx_large_packets++;
> + large_packet = 1;
> }
> }
> }
>
> + if (skb->len > netdev->mtu) {
> + iph = (struct iphdr *)skb->data;
> + if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
> + iph->protocol == IPPROTO_TCP) {
> + hdr_len += sizeof(struct iphdr);
> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
> + } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
> + iph->protocol == IPPROTO_TCP) {
> + hdr_len += sizeof(struct ipv6hdr);
> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
> + }
> + if (!large_packet)
> + adapter->rx_large_packets++;
> + }
> +
> napi_gro_receive(napi, skb); /* send it up */
>
> netdev->stats.rx_packets++;
On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote:
> We recently encountered a bug where a few customers using ibmveth on the
> same LPAR hit an issue where a TCP session hung when large receive was
> enabled. Closer analysis revealed that the session was stuck because the
> one side was advertising a zero window repeatedly.
>
> We narrowed this down to the fact the ibmveth driver did not set gso_size
> which is translated by TCP into the MSS later up the stack. The MSS is
> used to calculate the TCP window size and as that was abnormally large,
> it was calculating a zero window, even although the sockets receive buffer
> was completely empty.
>
> We were able to reproduce this and worked with IBM to fix this. Thanks Tom
> and Marcelo for all your help and review on this.
>
> The patch fixes both our internal reproduction tests and our customers tests.
>
> Signed-off-by: Jon Maxwell <[email protected]>
> ---
> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
> index 29c05d0..c51717e 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
> int frames_processed = 0;
> unsigned long lpar_rc;
> struct iphdr *iph;
> + bool large_packet = 0;
> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
>
> restart_poll:
> while (frames_processed < budget) {
> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
> iph->check = 0;
> iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
> adapter->rx_large_packets++;
> + large_packet = 1;
> }
> }
> }
>
> + if (skb->len > netdev->mtu) {
> + iph = (struct iphdr *)skb->data;
> + if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
> + iph->protocol == IPPROTO_TCP) {
> + hdr_len += sizeof(struct iphdr);
> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
> + } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
> + iph->protocol == IPPROTO_TCP) {
> + hdr_len += sizeof(struct ipv6hdr);
> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
> + }
> + if (!large_packet)
> + adapter->rx_large_packets++;
> + }
> +
>
This might break forwarding and PMTU discovery.
You force gso_size to device mtu, regardless of real MSS used by the TCP
sender.
Don't you have the MSS provided in RX descriptor, instead of guessing
the value ?
On 10/27/2016 10:26 AM, Eric Dumazet wrote:
> On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote:
>> We recently encountered a bug where a few customers using ibmveth on the
>> same LPAR hit an issue where a TCP session hung when large receive was
>> enabled. Closer analysis revealed that the session was stuck because the
>> one side was advertising a zero window repeatedly.
>>
>> We narrowed this down to the fact the ibmveth driver did not set gso_size
>> which is translated by TCP into the MSS later up the stack. The MSS is
>> used to calculate the TCP window size and as that was abnormally large,
>> it was calculating a zero window, even although the sockets receive buffer
>> was completely empty.
>>
>> We were able to reproduce this and worked with IBM to fix this. Thanks Tom
>> and Marcelo for all your help and review on this.
>>
>> The patch fixes both our internal reproduction tests and our customers tests.
>>
>> Signed-off-by: Jon Maxwell <[email protected]>
>> ---
>> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
>> index 29c05d0..c51717e 100644
>> --- a/drivers/net/ethernet/ibm/ibmveth.c
>> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>> int frames_processed = 0;
>> unsigned long lpar_rc;
>> struct iphdr *iph;
>> + bool large_packet = 0;
>> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
>>
>> restart_poll:
>> while (frames_processed < budget) {
>> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>> iph->check = 0;
>> iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
>> adapter->rx_large_packets++;
>> + large_packet = 1;
>> }
>> }
>> }
>>
>> + if (skb->len > netdev->mtu) {
>> + iph = (struct iphdr *)skb->data;
>> + if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
>> + iph->protocol == IPPROTO_TCP) {
>> + hdr_len += sizeof(struct iphdr);
>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>> + } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
>> + iph->protocol == IPPROTO_TCP) {
>> + hdr_len += sizeof(struct ipv6hdr);
>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>> + }
>> + if (!large_packet)
>> + adapter->rx_large_packets++;
>> + }
>> +
>>
> This might break forwarding and PMTU discovery.
>
> You force gso_size to device mtu, regardless of real MSS used by the TCP
> sender.
>
> Don't you have the MSS provided in RX descriptor, instead of guessing
> the value ?
>
>
>
The MSS is not always available unfortunately, so this is the best solution there is at the moment.
On Thu, 2016-10-27 at 12:54 -0500, Thomas Falcon wrote:
> On 10/27/2016 10:26 AM, Eric Dumazet wrote:
> > On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote:
> >> We recently encountered a bug where a few customers using ibmveth on the
> >> same LPAR hit an issue where a TCP session hung when large receive was
> >> enabled. Closer analysis revealed that the session was stuck because the
> >> one side was advertising a zero window repeatedly.
> >>
> >> We narrowed this down to the fact the ibmveth driver did not set gso_size
> >> which is translated by TCP into the MSS later up the stack. The MSS is
> >> used to calculate the TCP window size and as that was abnormally large,
> >> it was calculating a zero window, even although the sockets receive buffer
> >> was completely empty.
> >>
> >> We were able to reproduce this and worked with IBM to fix this. Thanks Tom
> >> and Marcelo for all your help and review on this.
> >>
> >> The patch fixes both our internal reproduction tests and our customers tests.
> >>
> >> Signed-off-by: Jon Maxwell <[email protected]>
> >> ---
> >> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
> >> 1 file changed, 20 insertions(+)
> >>
> >> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
> >> index 29c05d0..c51717e 100644
> >> --- a/drivers/net/ethernet/ibm/ibmveth.c
> >> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> >> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
> >> int frames_processed = 0;
> >> unsigned long lpar_rc;
> >> struct iphdr *iph;
> >> + bool large_packet = 0;
> >> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
> >>
> >> restart_poll:
> >> while (frames_processed < budget) {
> >> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
> >> iph->check = 0;
> >> iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
> >> adapter->rx_large_packets++;
> >> + large_packet = 1;
> >> }
> >> }
> >> }
> >>
> >> + if (skb->len > netdev->mtu) {
> >> + iph = (struct iphdr *)skb->data;
> >> + if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
> >> + iph->protocol == IPPROTO_TCP) {
> >> + hdr_len += sizeof(struct iphdr);
> >> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
> >> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
> >> + } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
> >> + iph->protocol == IPPROTO_TCP) {
> >> + hdr_len += sizeof(struct ipv6hdr);
> >> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
> >> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
> >> + }
> >> + if (!large_packet)
> >> + adapter->rx_large_packets++;
> >> + }
> >> +
> >>
> > This might break forwarding and PMTU discovery.
> >
> > You force gso_size to device mtu, regardless of real MSS used by the TCP
> > sender.
> >
> > Don't you have the MSS provided in RX descriptor, instead of guessing
> > the value ?
> >
> >
> >
> The MSS is not always available unfortunately, so this is the best solution there is at the moment.
Hmm... then what about skb_shinfo(skb)->gso_segs ?
ip_rcv() for example has :
__IP_ADD_STATS(net,
IPSTATS_MIB_NOECTPKTS + (iph->tos & INET_ECN_MASK),
max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
Also prefer : (skb->protocol == htons(ETH_P_IP)) tests
And the ipv6 test is wrong :
} else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
iph->protocol == IPPROTO_TCP) {
Since iph is a pointer to ipv4 iphdr .
On Fri, Oct 28, 2016 at 5:08 AM, Eric Dumazet <[email protected]> wrote:
> On Thu, 2016-10-27 at 12:54 -0500, Thomas Falcon wrote:
>> On 10/27/2016 10:26 AM, Eric Dumazet wrote:
>> > On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote:
>> >> We recently encountered a bug where a few customers using ibmveth on the
>> >> same LPAR hit an issue where a TCP session hung when large receive was
>> >> enabled. Closer analysis revealed that the session was stuck because the
>> >> one side was advertising a zero window repeatedly.
>> >>
>> >> We narrowed this down to the fact the ibmveth driver did not set gso_size
>> >> which is translated by TCP into the MSS later up the stack. The MSS is
>> >> used to calculate the TCP window size and as that was abnormally large,
>> >> it was calculating a zero window, even although the sockets receive buffer
>> >> was completely empty.
>> >>
>> >> We were able to reproduce this and worked with IBM to fix this. Thanks Tom
>> >> and Marcelo for all your help and review on this.
>> >>
>> >> The patch fixes both our internal reproduction tests and our customers tests.
>> >>
>> >> Signed-off-by: Jon Maxwell <[email protected]>
>> >> ---
>> >> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
>> >> 1 file changed, 20 insertions(+)
>> >>
>> >> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
>> >> index 29c05d0..c51717e 100644
>> >> --- a/drivers/net/ethernet/ibm/ibmveth.c
>> >> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>> >> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>> >> int frames_processed = 0;
>> >> unsigned long lpar_rc;
>> >> struct iphdr *iph;
>> >> + bool large_packet = 0;
>> >> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
>> >>
>> >> restart_poll:
>> >> while (frames_processed < budget) {
>> >> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>> >> iph->check = 0;
>> >> iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
>> >> adapter->rx_large_packets++;
>> >> + large_packet = 1;
>> >> }
>> >> }
>> >> }
>> >>
>> >> + if (skb->len > netdev->mtu) {
>> >> + iph = (struct iphdr *)skb->data;
>> >> + if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
>> >> + iph->protocol == IPPROTO_TCP) {
>> >> + hdr_len += sizeof(struct iphdr);
>> >> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
>> >> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>> >> + } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
>> >> + iph->protocol == IPPROTO_TCP) {
>> >> + hdr_len += sizeof(struct ipv6hdr);
>> >> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
>> >> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>> >> + }
>> >> + if (!large_packet)
>> >> + adapter->rx_large_packets++;
>> >> + }
>> >> +
>> >>
>> > This might break forwarding and PMTU discovery.
>> >
>> > You force gso_size to device mtu, regardless of real MSS used by the TCP
>> > sender.
>> >
>> > Don't you have the MSS provided in RX descriptor, instead of guessing
>> > the value ?
>> >
>> >
>> >
>> The MSS is not always available unfortunately, so this is the best solution there is at the moment.
>
> Hmm... then what about skb_shinfo(skb)->gso_segs ?
>
> ip_rcv() for example has :
>
> __IP_ADD_STATS(net,
> IPSTATS_MIB_NOECTPKTS + (iph->tos & INET_ECN_MASK),
> max_t(unsigned short, 1, skb_shinfo(skb)->gso_segs));
>
>
Okay thanks Eric. As the MSS is the main concern, let me work with the
team here and see if we find can a better way to do this. Will take care of
the other points you raised at the same time.
>
> Also prefer : (skb->protocol == htons(ETH_P_IP)) tests
>
> And the ipv6 test is wrong :
>
> } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
> iph->protocol == IPPROTO_TCP) {
>
>
> Since iph is a pointer to ipv4 iphdr .
>
>
>
On 10/27/2016 10:26 AM, Eric Dumazet wrote:
> On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote:
>> We recently encountered a bug where a few customers using ibmveth on the
>> same LPAR hit an issue where a TCP session hung when large receive was
>> enabled. Closer analysis revealed that the session was stuck because the
>> one side was advertising a zero window repeatedly.
>>
>> We narrowed this down to the fact the ibmveth driver did not set gso_size
>> which is translated by TCP into the MSS later up the stack. The MSS is
>> used to calculate the TCP window size and as that was abnormally large,
>> it was calculating a zero window, even although the sockets receive buffer
>> was completely empty.
>>
>> We were able to reproduce this and worked with IBM to fix this. Thanks Tom
>> and Marcelo for all your help and review on this.
>>
>> The patch fixes both our internal reproduction tests and our customers tests.
>>
>> Signed-off-by: Jon Maxwell <[email protected]>
>> ---
>> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
>> index 29c05d0..c51717e 100644
>> --- a/drivers/net/ethernet/ibm/ibmveth.c
>> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>> int frames_processed = 0;
>> unsigned long lpar_rc;
>> struct iphdr *iph;
>> + bool large_packet = 0;
>> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
>>
>> restart_poll:
>> while (frames_processed < budget) {
>> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>> iph->check = 0;
>> iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
>> adapter->rx_large_packets++;
>> + large_packet = 1;
>> }
>> }
>> }
>>
>> + if (skb->len > netdev->mtu) {
>> + iph = (struct iphdr *)skb->data;
>> + if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
>> + iph->protocol == IPPROTO_TCP) {
>> + hdr_len += sizeof(struct iphdr);
>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>> + } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
>> + iph->protocol == IPPROTO_TCP) {
>> + hdr_len += sizeof(struct ipv6hdr);
>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>> + }
>> + if (!large_packet)
>> + adapter->rx_large_packets++;
>> + }
>> +
>>
>
> This might break forwarding and PMTU discovery.
>
> You force gso_size to device mtu, regardless of real MSS used by the TCP
> sender.
>
> Don't you have the MSS provided in RX descriptor, instead of guessing
> the value ?
We've had some further discussions on this with the Virtual I/O Server (VIOS)
development team. The large receive aggregation in the VIOS (AIX based) is actually
being done by software in the VIOS. What they may be able to do is when performing
this aggregation, they could look at the packet lengths of all the packets being
aggregated and take the largest packet size within the aggregation unit, minus the
header length and return that to the virtual ethernet client which we could then stuff
into gso_size. They are currently assessing how feasible this would be to do and whether
it would impact other bits of the code. However, assuming this does end up being an option,
would this address the concerns here or is that going to break something else I'm
not thinking of?
Unfortunately, I don't think we'd have a good way to get gso_segs set correctly as I don't
see how that would get passed back up the interface.
Thanks,
Brian
--
Brian King
Power Linux I/O
IBM Linux Technology Center
On Thu, Nov 3, 2016 at 8:40 AM, Brian King <[email protected]> wrote:
> On 10/27/2016 10:26 AM, Eric Dumazet wrote:
>> On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote:
>>> We recently encountered a bug where a few customers using ibmveth on the
>>> same LPAR hit an issue where a TCP session hung when large receive was
>>> enabled. Closer analysis revealed that the session was stuck because the
>>> one side was advertising a zero window repeatedly.
>>>
>>> We narrowed this down to the fact the ibmveth driver did not set gso_size
>>> which is translated by TCP into the MSS later up the stack. The MSS is
>>> used to calculate the TCP window size and as that was abnormally large,
>>> it was calculating a zero window, even although the sockets receive buffer
>>> was completely empty.
>>>
>>> We were able to reproduce this and worked with IBM to fix this. Thanks Tom
>>> and Marcelo for all your help and review on this.
>>>
>>> The patch fixes both our internal reproduction tests and our customers tests.
>>>
>>> Signed-off-by: Jon Maxwell <[email protected]>
>>> ---
>>> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
>>> 1 file changed, 20 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
>>> index 29c05d0..c51717e 100644
>>> --- a/drivers/net/ethernet/ibm/ibmveth.c
>>> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>>> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>>> int frames_processed = 0;
>>> unsigned long lpar_rc;
>>> struct iphdr *iph;
>>> + bool large_packet = 0;
>>> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
>>>
>>> restart_poll:
>>> while (frames_processed < budget) {
>>> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>>> iph->check = 0;
>>> iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
>>> adapter->rx_large_packets++;
>>> + large_packet = 1;
>>> }
>>> }
>>> }
>>>
>>> + if (skb->len > netdev->mtu) {
>>> + iph = (struct iphdr *)skb->data;
>>> + if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
>>> + iph->protocol == IPPROTO_TCP) {
>>> + hdr_len += sizeof(struct iphdr);
>>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
>>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>>> + } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
>>> + iph->protocol == IPPROTO_TCP) {
>>> + hdr_len += sizeof(struct ipv6hdr);
>>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
>>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>>> + }
>>> + if (!large_packet)
>>> + adapter->rx_large_packets++;
>>> + }
>>> +
>>>
>>
>> This might break forwarding and PMTU discovery.
>>
>> You force gso_size to device mtu, regardless of real MSS used by the TCP
>> sender.
>>
>> Don't you have the MSS provided in RX descriptor, instead of guessing
>> the value ?
>
> We've had some further discussions on this with the Virtual I/O Server (VIOS)
> development team. The large receive aggregation in the VIOS (AIX based) is actually
> being done by software in the VIOS. What they may be able to do is when performing
> this aggregation, they could look at the packet lengths of all the packets being
> aggregated and take the largest packet size within the aggregation unit, minus the
> header length and return that to the virtual ethernet client which we could then stuff
> into gso_size. They are currently assessing how feasible this would be to do and whether
> it would impact other bits of the code. However, assuming this does end up being an option,
> would this address the concerns here or is that going to break something else I'm
> not thinking of?
I was discussing this with a colleague and although this is better than
what we have so far. We wonder if there could be a corner case where
it ends up with a smaller value than the current MSS. For example if
the application sent a burst of small TCP packets with the PUSH
bit set. In that case they may not be coalesced by GRO. The VIOS could
probably be coded to detect that condition and use the previous MSS.
But that may not necessarily be the current MSS.
The ibmveth driver passes the MSS via gso_size to the VIOS. Either as the
3rd argument of ibmveth_send() or via tcp_hdr(skb)->check which is
presumably over-written when the VIOS does the TSO. Would it be possible
to keep a copy of this value on the TX side on the VIOS before it over-written
and then some how pass that up to the RX side along with frame and set
gso_size to that which should be the current MSS?
Regards
Jon
>
> Unfortunately, I don't think we'd have a good way to get gso_segs set correctly as I don't
> see how that would get passed back up the interface.
>
> Thanks,
>
> Brian
>
>
> --
> Brian King
> Power Linux I/O
> IBM Linux Technology Center
>
On 11/06/2016 03:22 PM, Jonathan Maxwell wrote:
> On Thu, Nov 3, 2016 at 8:40 AM, Brian King <[email protected]> wrote:
>> On 10/27/2016 10:26 AM, Eric Dumazet wrote:
>>> On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote:
>>>> We recently encountered a bug where a few customers using ibmveth on the
>>>> same LPAR hit an issue where a TCP session hung when large receive was
>>>> enabled. Closer analysis revealed that the session was stuck because the
>>>> one side was advertising a zero window repeatedly.
>>>>
>>>> We narrowed this down to the fact the ibmveth driver did not set gso_size
>>>> which is translated by TCP into the MSS later up the stack. The MSS is
>>>> used to calculate the TCP window size and as that was abnormally large,
>>>> it was calculating a zero window, even although the sockets receive buffer
>>>> was completely empty.
>>>>
>>>> We were able to reproduce this and worked with IBM to fix this. Thanks Tom
>>>> and Marcelo for all your help and review on this.
>>>>
>>>> The patch fixes both our internal reproduction tests and our customers tests.
>>>>
>>>> Signed-off-by: Jon Maxwell <[email protected]>
>>>> ---
>>>> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
>>>> 1 file changed, 20 insertions(+)
>>>>
>>>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
>>>> index 29c05d0..c51717e 100644
>>>> --- a/drivers/net/ethernet/ibm/ibmveth.c
>>>> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>>>> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>>>> int frames_processed = 0;
>>>> unsigned long lpar_rc;
>>>> struct iphdr *iph;
>>>> + bool large_packet = 0;
>>>> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
>>>>
>>>> restart_poll:
>>>> while (frames_processed < budget) {
>>>> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>>>> iph->check = 0;
>>>> iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
>>>> adapter->rx_large_packets++;
>>>> + large_packet = 1;
>>>> }
>>>> }
>>>> }
>>>>
>>>> + if (skb->len > netdev->mtu) {
>>>> + iph = (struct iphdr *)skb->data;
>>>> + if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
>>>> + iph->protocol == IPPROTO_TCP) {
>>>> + hdr_len += sizeof(struct iphdr);
>>>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
>>>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>>>> + } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
>>>> + iph->protocol == IPPROTO_TCP) {
>>>> + hdr_len += sizeof(struct ipv6hdr);
>>>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
>>>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>>>> + }
>>>> + if (!large_packet)
>>>> + adapter->rx_large_packets++;
>>>> + }
>>>> +
>>>>
>>>
>>> This might break forwarding and PMTU discovery.
>>>
>>> You force gso_size to device mtu, regardless of real MSS used by the TCP
>>> sender.
>>>
>>> Don't you have the MSS provided in RX descriptor, instead of guessing
>>> the value ?
>>
>> We've had some further discussions on this with the Virtual I/O Server (VIOS)
>> development team. The large receive aggregation in the VIOS (AIX based) is actually
>> being done by software in the VIOS. What they may be able to do is when performing
>> this aggregation, they could look at the packet lengths of all the packets being
>> aggregated and take the largest packet size within the aggregation unit, minus the
>> header length and return that to the virtual ethernet client which we could then stuff
>> into gso_size. They are currently assessing how feasible this would be to do and whether
>> it would impact other bits of the code. However, assuming this does end up being an option,
>> would this address the concerns here or is that going to break something else I'm
>> not thinking of?
>
> I was discussing this with a colleague and although this is better than
> what we have so far. We wonder if there could be a corner case where
> it ends up with a smaller value than the current MSS. For example if
> the application sent a burst of small TCP packets with the PUSH
> bit set. In that case they may not be coalesced by GRO. The VIOS could
Would that be a big problem though? Other than a performance degradation
in this specific case, do you see a functional issue with this approach?
> probably be coded to detect that condition and use the previous MSS.
> But that may not necessarily be the current MSS.
>
> The ibmveth driver passes the MSS via gso_size to the VIOS. Either as the
> 3rd argument of ibmveth_send() or via tcp_hdr(skb)->check which is
> presumably over-written when the VIOS does the TSO. Would it be possible
> to keep a copy of this value on the TX side on the VIOS before it over-written
> and then some how pass that up to the RX side along with frame and set
> gso_size to that which should be the current MSS?
That seems like it might be more difficult. Wouldn't that require the VIOS
to track the MSS on a per connection basis, since the MSS might differ
based on the source / destination? I discussed with VIOS development, and
they don't do any connection tracking where they'd be able to insert this.
Unless there is a functional issue with the approach to have the VIOS insert
the MSS based on the size of the packets received off the wire, I think that
might be our best option...
Thanks,
Brian
>
> Regards
>
> Jon
>
>>
>> Unfortunately, I don't think we'd have a good way to get gso_segs set correctly as I don't
>> see how that would get passed back up the interface.
>>
>> Thanks,
>>
>> Brian
>>
>>
>> --
>> Brian King
>> Power Linux I/O
>> IBM Linux Technology Center
>>
>
--
Brian King
Power Linux I/O
IBM Linux Technology Center
On 10/27/2016 10:26 AM, Eric Dumazet wrote:
> On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote:
>> We recently encountered a bug where a few customers using ibmveth on the
>> same LPAR hit an issue where a TCP session hung when large receive was
>> enabled. Closer analysis revealed that the session was stuck because the
>> one side was advertising a zero window repeatedly.
>>
>> We narrowed this down to the fact the ibmveth driver did not set gso_size
>> which is translated by TCP into the MSS later up the stack. The MSS is
>> used to calculate the TCP window size and as that was abnormally large,
>> it was calculating a zero window, even although the sockets receive buffer
>> was completely empty.
>>
>> We were able to reproduce this and worked with IBM to fix this. Thanks Tom
>> and Marcelo for all your help and review on this.
>>
>> The patch fixes both our internal reproduction tests and our customers tests.
>>
>> Signed-off-by: Jon Maxwell <[email protected]>
>> ---
>> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
>> index 29c05d0..c51717e 100644
>> --- a/drivers/net/ethernet/ibm/ibmveth.c
>> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>> int frames_processed = 0;
>> unsigned long lpar_rc;
>> struct iphdr *iph;
>> + bool large_packet = 0;
>> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
>>
>> restart_poll:
>> while (frames_processed < budget) {
>> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>> iph->check = 0;
>> iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
>> adapter->rx_large_packets++;
>> + large_packet = 1;
>> }
>> }
>> }
>>
>> + if (skb->len > netdev->mtu) {
>> + iph = (struct iphdr *)skb->data;
>> + if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
>> + iph->protocol == IPPROTO_TCP) {
>> + hdr_len += sizeof(struct iphdr);
>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>> + } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
>> + iph->protocol == IPPROTO_TCP) {
>> + hdr_len += sizeof(struct ipv6hdr);
>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>> + }
>> + if (!large_packet)
>> + adapter->rx_large_packets++;
>> + }
>> +
>>
>
> This might break forwarding and PMTU discovery.
>
> You force gso_size to device mtu, regardless of real MSS used by the TCP
> sender.
>
> Don't you have the MSS provided in RX descriptor, instead of guessing
> the value ?
Eric,
We are currently pursuing making changes to the Power Virtual I/O Server to provide
the MSS to the ibmveth driver. However, this will take time to go through test
and ultimately get released. Although imperfect, this patch does help a real customer
hitting this issue right now. Would you object to this patch getting merged as is,
with the understanding that when we get the change in the Virtual I/O Server released,
we will revert this interim change and apply the new method?
Thanks,
Brian
--
Brian King
Power Linux I/O
IBM Linux Technology Center