Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964786AbdLRWcq (ORCPT ); Mon, 18 Dec 2017 17:32:46 -0500 Received: from server.atrad.com.au ([150.101.241.2]:44450 "EHLO server.atrad.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S935088AbdLRWck (ORCPT ); Mon, 18 Dec 2017 17:32:40 -0500 Date: Tue, 19 Dec 2017 09:02:25 +1030 From: Jonathan Woithe To: Holger =?iso-8859-1?Q?Hoffst=E4tte?= Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: r8169 regression: UDP packets dropped intermittantly Message-ID: <20171218223224.GA13172@marvin.atrad.com.au> References: <20171218054951.GJ17747@marvin.atrad.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.6.1 (2016-04-27) X-MIMEDefang-action: accept Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3463 Lines: 77 Hi Holger On Mon, Dec 18, 2017 at 02:38:53PM +0100, Holger Hoffst?tte wrote: > On 12/18/17 06:49, Jonathan Woithe wrote: > > Resend to netdev. LKML CCed in case anyone in the wider kernel community > > can suggest a way forward. Please CC responses if replying only to LKML. > > > > It seems that this 4+ year old regression in the r8169 driver (documented in > > this thread on netdev beginning on 9 March 2013) will never be fixed, > > despite the identification of the commit which broke it. Cards using this > > driver will therefore remain unusable for certain workloads utilising UDP. > (snip) > > Since I've seen your postings several times now with no comment or resolution > I've decided to try your reproducer on my own systems. In short, I cannot > reproduce any packet loss, despite having 2 (cheap) 1Gb switches between the > two machines. Both are running 4.14.7. Thanks for trying the test program on your system. The result indicates that the problem might be specific to the behaviour of a particular network variant of the r8169 chip. The systems we use are all equipped with a PCI Netgear GA311 card, which identifies as 05:01.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) Subsystem: Netgear GA311 Respective IDs are 05:01.0 0200: 10ec:8169 (rev 10) Subsystem: 1385:311a > Both NICs are onboard PCIe This is a significant difference between your test systems and ours: the cards we are using are PCI and are not onboard. > Nevertheless your reproducer runs forever and all I see is 6 bytes > request, 14 bytes response, with no drops. Not one. I tried in both > directions - no difference. That's very interesting. On the system noted above with the GA311 the packet sequence certainly works most of the time. However, within an hour the 14 byte response will not be seen by the system which sent the 6 byte request. The slave sees the 6 byte request and sends the 14 byte response: the problem is in the master (the system sending the 6 byte request). The NIC in the slave or kernel version running on the slave does not affect the result. > I realize this doesn't actually solve your immediate problem, but it is > nevertheless an indicator that whatever you have been observing is caused > by something else. The inability to trigger the problem on your systems could be due to the NICs in use. That is an obvious difference between our system (which reliably experiences the problem) and yours (which doesn't). This may indicate that only certain variants of the r8169 chip are affected, which obviously complicates things. In any case, this tester (and the production program with which the problem was first noticed) work perfectly until commit da78dbff2e05630921c551dbbc70a4b7981a8fff (identified with git bisect). Furthermore, when the pre-da78dbff...981a8fff driver was ported to 4.3 as a test the problem was resolved, verified over a week of continuous testing; the standard 4.3 reliably triggered the problem within minutes. Of course the ported driver isn't a viable long term solution since it's essentially an out of tree driver. It's hard to see how this problem is unrelated to da78dbff...981a8fff. Before this commit, everything worked fine. While keeping everything else on the system unchanged, applying this single commit to the r8169 driver causes the problem. Thank you again for running the tests. Regards jonathan