Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp3566646ybl; Tue, 21 Jan 2020 03:12:07 -0800 (PST) X-Google-Smtp-Source: APXvYqxWN7dOD5cooodNJfJwavEUGlYIwUk2qfDaYTUkeT188Y+9KMqbw08gYK00Fr2YSnSrccjL X-Received: by 2002:a9d:70d9:: with SMTP id w25mr3232702otj.231.1579605126847; Tue, 21 Jan 2020 03:12:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1579605126; cv=none; d=google.com; s=arc-20160816; b=VpD2ur0KWrJKFEg+SGuWISxdPAwM75W91KusRpztXLQoVr/fFiveY5aafXUz+YJXOl Aoiic7A/oK4itZEyJqEyh1AzFSVYVPezNKQNMn2Yrp2ciSjhahQNy4gEmfqe/65jLDSq p5v2I1QSHdg+cvY/bM0n4LQXGFlLVEkK/TtSUdnw5tn84sbkx80NRrIMdBxW1EIb7QfB JQ5kJ67OQq63P6+JjYTjYRDG/UmsylaKG2q8qTjSCDfhae6EY0d2LM/X+I5aedG5nHft a1OTbX4PaKV5ETz84jkit4Kkf1YYEhV3QiUwvfIkkQ4gdg28FbKBETGQCe4mEiKgxhA8 txPQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id; bh=joRSYdeopLs4N6XAbSAYsPc5g86oy7Ibw+vyyRpprM0=; b=T8+bTcEoPdafpA6LxFNho/cjnn9T0WLttqZYEEvNJEtlPfbRkd8URBGYzezLZDVVTV zNKeT+bryaRya4WUBKvHNhvD+2hGSRzo2u4p/LqA5mnkHNCOG76aOzlElvQuo8ju17/n 1jXTR0AklHblUy7E39xTM/LMzYVIRBM3GkaVOauat/eqUeyCON8/W/5CHHTM4s8mTptw eE9wBI3FqYYFEXzH3aaIa3/lhiu/pefQAnca2kUrPTZFftC5pBpOJ2ujjqYi9WmZ4KaW 9wJhobwDepRuB1yAfPI3+yBYfkNjPELq4u/XRwsSFySBbFRTZEnKBb01TniBXbhWaYtV EAcQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h7si22544601otk.86.2020.01.21.03.11.54; Tue, 21 Jan 2020 03:12:06 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729367AbgAULKj (ORCPT + 99 others); Tue, 21 Jan 2020 06:10:39 -0500 Received: from metis.ext.pengutronix.de ([85.220.165.71]:58151 "EHLO metis.ext.pengutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726473AbgAULKi (ORCPT ); Tue, 21 Jan 2020 06:10:38 -0500 Received: from kresse.hi.pengutronix.de ([2001:67c:670:100:1d::2a]) by metis.ext.pengutronix.de with esmtp (Exim 4.92) (envelope-from ) id 1itrQE-0002Xf-PZ; Tue, 21 Jan 2020 12:10:14 +0100 Message-ID: <8409fd7ad6b83da75c914a71accf522953a460a0.camel@pengutronix.de> Subject: Re: Issues with "PCI/LINK: Report degraded links via link bandwidth notification" From: Lucas Stach To: "Alex G." , Bjorn Helgaas , Alexandru Gagniuc , Keith Busch , Jens Axboe , Christoph Hellwig , Sagi Grimberg , David Airlie , Daniel Vetter Cc: Jan Vesely , Lukas Wunner , Alex Williamson , Austin Bolen , Shyam Iyer , Sinan Kaya , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org Date: Tue, 21 Jan 2020 12:10:08 +0100 In-Reply-To: References: <20200120023326.GA149019@google.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.30.5-1.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-SA-Exim-Connect-IP: 2001:67c:670:100:1d::2a X-SA-Exim-Mail-From: l.stach@pengutronix.de X-SA-Exim-Scanned: No (on metis.ext.pengutronix.de); SAEximRunCond expanded to false X-PTX-Original-Recipient: linux-kernel@vger.kernel.org Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mo, 2020-01-20 at 10:01 -0600, Alex G. wrote: > > On 1/19/20 8:33 PM, Bjorn Helgaas wrote: > > [+cc NVMe, GPU driver folks] > > > > On Wed, Jan 15, 2020 at 04:10:08PM -0600, Bjorn Helgaas wrote: > > > I think we have a problem with link bandwidth change notifications > > > (see https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/pcie/bw_notification.c). > > > > > > Here's a recent bug report where Jan reported "_tons_" of these > > > notifications on an nvme device: > > > https://bugzilla.kernel.org/show_bug.cgi?id=206197 > > > > > > There was similar discussion involving GPU drivers at > > > https://lore.kernel.org/r/20190429185611.121751-2-helgaas@kernel.org > > > > > > The current solution is the CONFIG_PCIE_BW config option, which > > > disables the messages completely. That option defaults to "off" (no > > > messages), but even so, I think it's a little problematic. > > > > > > Users are not really in a position to figure out whether it's safe to > > > enable. All they can do is experiment and see whether it works with > > > their current mix of devices and drivers. > > > > > > I don't think it's currently useful for distros because it's a > > > compile-time switch, and distros cannot predict what system configs > > > will be used, so I don't think they can enable it. > > > > > > Does anybody have proposals for making it smarter about distinguishing > > > real problems from intentional power management, or maybe interfaces > > > drivers could use to tell us when we should ignore bandwidth changes? > > > > NVMe, GPU folks, do your drivers or devices change PCIe link > > speed/width for power saving or other reasons? When CONFIG_PCIE_BW=y, > > the PCI core interprets changes like that as problems that need to be > > reported. > > > > If drivers do change link speed/width, can you point me to where > > that's done? Would it be feasible to add some sort of PCI core > > interface so the driver could say "ignore" or "pay attention to" > > subsequent link changes? > > > > Or maybe there would even be a way to move the link change itself into > > the PCI core, so the core would be aware of what's going on? > > Funny thing is, I was going to suggest an in-kernel API for this. > * Driver requests lower link speed 'X' > * Link management interrupt fires > * If link speed is at or above 'X' then do not report it. > I think an "ignore" flag would defeat the purpose of having link > bandwidth reporting in the first place. If some drivers set it, and > others don't, then it would be inconsistent enough to not be useful. > > A second suggestion is, if there is a way to ratelimit these messages on > a per-downstream port basis. Both AMD and Nvidia GPUs have embedded controllers, which are responsible for the power management. IIRC those controllers can autonomously initiate PCIe link speed changes depending on measured bus load. So there is no way for the driver to signal the requested bus speed to the PCIe core. I guess for the GPU usecase the best we can do is to have the driver opt-out of the link bandwidth notifications, as the driver knows that there is some autonomous entity on the endpoint mucking with the link parameters. Regards, Lucas