Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754099AbdGUQJj (ORCPT ); Fri, 21 Jul 2017 12:09:39 -0400 Received: from caffeine.csclub.uwaterloo.ca ([129.97.134.17]:40677 "EHLO caffeine.csclub.uwaterloo.ca" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750849AbdGUQJi (ORCPT ); Fri, 21 Jul 2017 12:09:38 -0400 Date: Fri, 21 Jul 2017 12:09:37 -0400 To: Benjamin Poirier Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, intel-wired-lan@lists.osuosl.org, Jeff Kirsher Subject: Re: commit 16ecba59 breaks 82574L under heavy load. Message-ID: <20170721160937.GA22632@csclub.uwaterloo.ca> References: <20170718142109.GO18556@csclub.uwaterloo.ca> <20170718231435.64us7vu67wtp6pwe@f1.synalogic.ca> <20170719141953.GQ18556@csclub.uwaterloo.ca> <20170720000747.4jiadqubv7hg5esz@f1.synalogic.ca> <20170720140027.GR18556@csclub.uwaterloo.ca> <20170720234455.7nxtui7shmnzivxd@f1.synalogic.ca> <20170721152709.GT18556@csclub.uwaterloo.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170721152709.GT18556@csclub.uwaterloo.ca> User-Agent: Mutt/1.5.23 (2014-03-12) From: lsorense@csclub.uwaterloo.ca (Lennart Sorensen) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1850 Lines: 41 On Fri, Jul 21, 2017 at 11:27:09AM -0400, wrote: > On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote: > > Could you please test the following patch and let me know if it: > > 1) reduces the interrupt rate of the Other msi-x vector > > 2) avoids the link flaps > > or > > 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]" > > In this case, please paste icr values printed. > > I will give it a try. So test looks excellent. Seems to only get interrupts when link state actually changes now. > Another odd behaviour I see is that the driver will hang in > napi_synchronize on shutdown if there is traffic at the time (at least > I think that's the trigger, maybe the trigger is if there has been an > overload of traffic and the backlog in napi was used). > > From doing some searching, this seems to be a problem that has plagued > some people for years with this driver. > > I am having trouble figuring out exactly what napi_synchronize is waiting > for and who is supposed to toggle the flag it is waiting on. The flag > appears to work backwards from what I would have expected it to do. > I see lots of places that can set the bit, but only napi_enable seems > to clear it again, and I don't see how that would get called for all > the places that potentially set the bit. I just realized NAPI_STATE_SCHED and NAPIF_STATE_SCHED are the same thing and I need to look at both of those. Still something seems odd in some corner case where napi gets stuck and you can't close the port anymore due to napi_synchronize never being able to finish. Some traffic pattern causes that SCHED state bit to get into the wrong state and nothing ever clears it. Even managed to see it get stuck so it never passed traffic again and hung on shutdown. The napi poll was never called again. -- Len Sorensen