Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753306AbcKISNE (ORCPT ); Wed, 9 Nov 2016 13:13:04 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:49916 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751594AbcKISNC (ORCPT ); Wed, 9 Nov 2016 13:13:02 -0500 Subject: Re: [PATCH net-next] ibmveth: v1 calculate correct gso_size and set gso_type To: Jonathan Maxwell References: <1477440555-21133-1-git-send-email-jmaxwell37@gmail.com> <1477582016.7065.212.camel@edumazet-glaptop3.roam.corp.google.com> <6b7003ec-6059-c309-dfee-761216f3d058@linux.vnet.ibm.com> Cc: Eric Dumazet , Thomas Falcon , Jon Maxwell , hofrat@osadl.org, linux-kernel@vger.kernel.org, jarod@redhat.com, netdev@vger.kernel.org, paulus@samba.org, Tom Herbert , Marcelo Ricardo Leitner , linuxppc-dev@lists.ozlabs.org, David Miller From: Brian King Date: Wed, 9 Nov 2016 10:02:19 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16110916-0028-0000-0000-00000600928B X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00006048; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000189; SDB=6.00778672; UDB=6.00375063; IPR=6.00556024; BA=6.00004866; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00013274; XFM=3.00000011; UTC=2016-11-09 16:02:25 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16110916-0029-0000-0000-000030B63E13 Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-11-09_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1609300000 definitions=main-1611090297 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6466 Lines: 145 On 11/06/2016 03:22 PM, Jonathan Maxwell wrote: > On Thu, Nov 3, 2016 at 8:40 AM, Brian King wrote: >> On 10/27/2016 10:26 AM, Eric Dumazet wrote: >>> On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote: >>>> We recently encountered a bug where a few customers using ibmveth on the >>>> same LPAR hit an issue where a TCP session hung when large receive was >>>> enabled. Closer analysis revealed that the session was stuck because the >>>> one side was advertising a zero window repeatedly. >>>> >>>> We narrowed this down to the fact the ibmveth driver did not set gso_size >>>> which is translated by TCP into the MSS later up the stack. The MSS is >>>> used to calculate the TCP window size and as that was abnormally large, >>>> it was calculating a zero window, even although the sockets receive buffer >>>> was completely empty. >>>> >>>> We were able to reproduce this and worked with IBM to fix this. Thanks Tom >>>> and Marcelo for all your help and review on this. >>>> >>>> The patch fixes both our internal reproduction tests and our customers tests. >>>> >>>> Signed-off-by: Jon Maxwell >>>> --- >>>> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++ >>>> 1 file changed, 20 insertions(+) >>>> >>>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c >>>> index 29c05d0..c51717e 100644 >>>> --- a/drivers/net/ethernet/ibm/ibmveth.c >>>> +++ b/drivers/net/ethernet/ibm/ibmveth.c >>>> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget) >>>> int frames_processed = 0; >>>> unsigned long lpar_rc; >>>> struct iphdr *iph; >>>> + bool large_packet = 0; >>>> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr); >>>> >>>> restart_poll: >>>> while (frames_processed < budget) { >>>> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget) >>>> iph->check = 0; >>>> iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl); >>>> adapter->rx_large_packets++; >>>> + large_packet = 1; >>>> } >>>> } >>>> } >>>> >>>> + if (skb->len > netdev->mtu) { >>>> + iph = (struct iphdr *)skb->data; >>>> + if (be16_to_cpu(skb->protocol) == ETH_P_IP && >>>> + iph->protocol == IPPROTO_TCP) { >>>> + hdr_len += sizeof(struct iphdr); >>>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4; >>>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len; >>>> + } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 && >>>> + iph->protocol == IPPROTO_TCP) { >>>> + hdr_len += sizeof(struct ipv6hdr); >>>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6; >>>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len; >>>> + } >>>> + if (!large_packet) >>>> + adapter->rx_large_packets++; >>>> + } >>>> + >>>> >>> >>> This might break forwarding and PMTU discovery. >>> >>> You force gso_size to device mtu, regardless of real MSS used by the TCP >>> sender. >>> >>> Don't you have the MSS provided in RX descriptor, instead of guessing >>> the value ? >> >> We've had some further discussions on this with the Virtual I/O Server (VIOS) >> development team. The large receive aggregation in the VIOS (AIX based) is actually >> being done by software in the VIOS. What they may be able to do is when performing >> this aggregation, they could look at the packet lengths of all the packets being >> aggregated and take the largest packet size within the aggregation unit, minus the >> header length and return that to the virtual ethernet client which we could then stuff >> into gso_size. They are currently assessing how feasible this would be to do and whether >> it would impact other bits of the code. However, assuming this does end up being an option, >> would this address the concerns here or is that going to break something else I'm >> not thinking of? > > I was discussing this with a colleague and although this is better than > what we have so far. We wonder if there could be a corner case where > it ends up with a smaller value than the current MSS. For example if > the application sent a burst of small TCP packets with the PUSH > bit set. In that case they may not be coalesced by GRO. The VIOS could Would that be a big problem though? Other than a performance degradation in this specific case, do you see a functional issue with this approach? > probably be coded to detect that condition and use the previous MSS. > But that may not necessarily be the current MSS. > > The ibmveth driver passes the MSS via gso_size to the VIOS. Either as the > 3rd argument of ibmveth_send() or via tcp_hdr(skb)->check which is > presumably over-written when the VIOS does the TSO. Would it be possible > to keep a copy of this value on the TX side on the VIOS before it over-written > and then some how pass that up to the RX side along with frame and set > gso_size to that which should be the current MSS? That seems like it might be more difficult. Wouldn't that require the VIOS to track the MSS on a per connection basis, since the MSS might differ based on the source / destination? I discussed with VIOS development, and they don't do any connection tracking where they'd be able to insert this. Unless there is a functional issue with the approach to have the VIOS insert the MSS based on the size of the packets received off the wire, I think that might be our best option... Thanks, Brian > > Regards > > Jon > >> >> Unfortunately, I don't think we'd have a good way to get gso_segs set correctly as I don't >> see how that would get passed back up the interface. >> >> Thanks, >> >> Brian >> >> >> -- >> Brian King >> Power Linux I/O >> IBM Linux Technology Center >> > -- Brian King Power Linux I/O IBM Linux Technology Center