Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp5106589yba; Tue, 30 Apr 2019 09:13:54 -0700 (PDT) X-Google-Smtp-Source: APXvYqyA1dW2LxuZpoVbISzJcxz6j2TuBjCMocPe1m9X3yb7hIEn9VYR+Awhwn/vIkLDjFtMh1z6 X-Received: by 2002:a65:524a:: with SMTP id q10mr65204277pgp.224.1556640834531; Tue, 30 Apr 2019 09:13:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556640834; cv=none; d=google.com; s=arc-20160816; b=Bwg8baVwI4tIhO5smDqdiP/KmAG3oxfLiVlI9khkT7/Gn+wS0qSKwXn5oxquPbMe6F /HDTGXBdPf0F7QALcrzUoy4/RS4t1EM83/KbW3/f7436YvJhK9An4GZVsllI9an6WYly BzHs+QnnoJvTD8b4pLqzbX6hFsDdt0sesFosIamb9cVkEI19j3QtAakfyaMeOuvFPOqe GGI1tV0vk7blLfAbB9GtXoRWik+/Xl5igRYUrP2jx0TlKziuysvnhIVGRpwv3CDIItMY a1EiLiW+j0y2/LeAicN5m5q/baCggr9gMEQhZCQBNTLIFVTpFMhvXaNf/u+tUhnQZCON 3m6A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=Qjk8CviunRHZ/OT69XNiIj2ktK9WrbasN4sRIQ8RIrU=; b=DuJ8x7+sOOJ5VwFjc0pPNCFmfjQwyJ5eT+CtW56ewYr5P93HXUefOQf5j4jHnzYkqD 2XxtKnaLi7E279DhJmDZInWeCw5bXP0XnTyr4+aBA5lOWdcLEEaYDv+iqE/gNTXjagwM 3vgaU/0OB8UtQJWlworgAMFJDrWkS81a/beYADTkVAw46n/EDDCJ4Jd12UYmxWZQIjM7 rktXSM0Jnj7goaeWKZobQNIHqDK+I4ZBEYeMaicIGCZiVAbFgBcvqno4LzSmV7Unapiu B5xGiXnt7KX3Sn4Rs1ReAZTbH1mP/pp/hCj3a30KN/4SyXBiBP+XyANCQE4XWYP1T+0c OIQg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=bqsZ5BoM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r15si10025097pgg.592.2019.04.30.09.13.38; Tue, 30 Apr 2019 09:13:54 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=bqsZ5BoM; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726580AbfD3QLz (ORCPT + 99 others); Tue, 30 Apr 2019 12:11:55 -0400 Received: from mail.kernel.org ([198.145.29.99]:43480 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725942AbfD3QLy (ORCPT ); Tue, 30 Apr 2019 12:11:54 -0400 Received: from localhost (173-25-63-173.client.mchsi.com [173.25.63.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id D126820835; Tue, 30 Apr 2019 16:11:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1556640713; bh=KyJDGavjSpp4AuaU07L+2c6XypQvQiHt5+bPye/hGks=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=bqsZ5BoME8haEyekKzM65WyuYHNiOijsLhfWQkt+MUOiL/MqQ1H0yg94wEIj5ekmB +iSYf1rtA4JbjPbspm9HA7YRmWpOl3alprs/1Kzpe/o6LeT8wwzN/2ZgGu/FPJRuJL HdNv22y9V1h1RcGzVYFi67ja4OxASysmlk4fa1n0= Date: Tue, 30 Apr 2019 11:11:51 -0500 From: Bjorn Helgaas To: Alex G Cc: Lukas Wunner , Alex Williamson , Austin Bolen , Alexandru Gagniuc , Keith Busch , Shyam Iyer , Sinan Kaya , Linus Torvalds , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] Revert "PCI/LINK: Report degraded links via link bandwidth notification" Message-ID: <20190430161151.GB145057@google.com> References: <20190429185611.121751-1-helgaas@kernel.org> <20190429185611.121751-2-helgaas@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 29, 2019 at 08:07:53PM -0500, Alex G wrote: > On 4/29/19 1:56 PM, Bjorn Helgaas wrote: > > From: Bjorn Helgaas > > > > This reverts commit e8303bb7a75c113388badcc49b2a84b4121c1b3e. > > > > e8303bb7a75c added logging whenever a link changed speed or width to a > > state that is considered degraded. Unfortunately, it cannot differentiate > > signal integrity-related link changes from those intentionally initiated by > > an endpoint driver, including drivers that may live in userspace or VMs > > when making use of vfio-pci. Some GPU drivers actively manage the link > > state to save power, which generates a stream of messages like this: > > > > vfio-pci 0000:07:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at 0000:00:02.0 (capable of 64.000 Gb/s with 5 GT/s x16 link) > > > > We really *do* want to be alerted when the link bandwidth is reduced > > because of hardware failures, but degradation by intentional link state > > management is probably far more common, so the signal-to-noise ratio is > > currently low. > > > > Until we figure out a way to identify the real problems or silence the > > intentional situations, revert the following commits, which include the > > initial implementation (e8303bb7a75c) and subsequent fixes: > > I think we're overreacting to a bit of perceived verbosity in the system > log. Intentional degradation does not seem to me to be as common as > advertised. I have not observed this with either radeon, nouveau, or amdgpu, > and the proper mechanism to save power at the link level is ASPM. I stand to > be corrected and we have on CC some very knowledgeable fellows that I am > certain will jump at the opportunity to do so. I can't quantify how common it is, but the verbosity is definitely *there*, and it seems unlikely to me that a hardware failure is more common than any intentional driver-driven degradation. If we can reliably distinguish hardware failures from benign changes, we should certainly log the failures. But in this case even the failures are fully functional, albeit at lower performance, so if the messages end up being 99% false positives, I think it'll just be confusing for users. > What it seems like to me is that a proprietary driver running in a VM is > initiating these changes. And if that is the case then it seems this is a > virtualization problem. A quick glance over GPU drivers in linux did not > reveal any obvious places where we intentionally downgrade a link. I'm not 100% on board with the idea of drivers directly manipulating the link because it seems like the PCI core might need to be at least aware of this. But some drivers certainly do manipulate it today for ASPM, gen2/gen3 retraining, etc. If we treat this as a virtualization problem, I guess you're suggesting the host kernel should prevent that sort of link manipulation? We could have a conversation about that, but it doesn't seem like the immediate solution to this problem. > I'm not convinced a revert is the best call. I have very limited options at this stage of the release, but I'd be glad to hear suggestions. My concern is that if we release v5.1 as-is, we'll spend a lot of energy on those false positives. Bjorn