Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761282AbdLSFpq (ORCPT ); Tue, 19 Dec 2017 00:45:46 -0500 Received: from server.atrad.com.au ([150.101.241.2]:44708 "EHLO server.atrad.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758923AbdLSFpo (ORCPT ); Tue, 19 Dec 2017 00:45:44 -0500 Date: Tue, 19 Dec 2017 16:15:32 +1030 From: Jonathan Woithe To: Holger =?iso-8859-1?Q?Hoffst=E4tte?= Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: r8169 regression: UDP packets dropped intermittantly Message-ID: <20171219054532.GA13685@marvin.atrad.com.au> References: <20171218054951.GJ17747@marvin.atrad.com.au> <20171218223224.GA13172@marvin.atrad.com.au> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20171218223224.GA13172@marvin.atrad.com.au> User-Agent: Mutt/1.6.1 (2016-04-27) X-MIMEDefang-action: accept Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3142 Lines: 69 Hi again This is a follow up to my earlier message. On Tue, Dec 19, 2017 at 09:02:25AM +1030, Jonathan Woithe wrote: > On Mon, Dec 18, 2017 at 02:38:53PM +0100, Holger Hoffst?tte wrote: > > Since I've seen your postings several times now with no comment or resolution > > I've decided to try your reproducer on my own systems. In short, I cannot > > reproduce any packet loss, despite having 2 (cheap) 1Gb switches between the > > two machines. Both are running 4.14.7. > > Thanks for trying the test program on your system. The result indicates > that the problem might be specific to the behaviour of a particular network > variant of the r8169 chip. I was able to temporarily acquire a PCIe card which uses the r8169 driver. This allowed me to run the reproducer on the same machine with two different r8169-based cards. The original NIC is this: 05:01.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) [10ec:8169] Subsystem: Netgear GA311 [1385:311a] The PCIe card is this: 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06) [10ec:8168] Subsystem: Realtek Semiconductor Co., Ltd. Device 0123 [10ec:0123] The test was conducted with kernel 4.3.0 since both the 4.3.0 driver (which triggers the fault) and the forward ported driver (which predates commit da78dbff2e05630921c551dbbc70a4b7981a8fff) was available. For the record, the machine used as the slave in these tests (the one receiving the 6 byte request and sending the 14 byte response) was using its onboard NIC: 00:19.0 Ethernet controller: Intel Corporation 82579V Gigabit Network Connection (rev 05) [8086:1503] Subsystem: Gigabyte Technology Co., Ltd 82579V Gigabit Network Connection [1458:e000] Test outcomes were as follows: PCIe card, unpatched 4.3.0 r8169 driver: no error (tested for 1 hour) PCIe card, forward ported r8169 driver: no error (tested for 1 hour) GA311 card, unpatched 4.3.0 r8169 driver: test fail in under 4 minutes GA311 card, forward ported r8169 driver: no error (tested for 1 hour) For completeness, I then booted 4.14 and repeated the test with its r8168 driver. The PCIe card ran for an hour without triggering the error, while the GA311 triggered it quickly (in under 3 minutes). This clearly indicates that not every card using the r8169 driver is vulnerable to the problem. It also explains why Holger was unable to reproduce the result on his system: the PCIe cards do not appear to suffer from the problem. Most likely the PCI RTL-8169 chip is affected, but newer PCIe variations do not. However, obviously more testing will be required with a wider variety of cards if this inference is to hold up. The above result (and those from Holger) allow the problem description to be refined a little: changes in commit da78dbff2e05630921c551dbbc70a4b7981a8fff cause GA311 NICs (and possibly other PCI cards using an RTL-8169) to have trouble with small UDP packets, while PCIe variants are seemingly unaffected. Does this help? Regards jonathan