Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp8204155rwd; Tue, 20 Jun 2023 11:36:48 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4MbPG+IwhNv5gW0rfYp1GqyAuLpXKvUCTHWVFpdxFUQ+Nb0e2d5KiWGqv9jQTooz1Xb0tA X-Received: by 2002:a17:903:1109:b0:1b0:3637:384e with SMTP id n9-20020a170903110900b001b03637384emr9872241plh.25.1687286208077; Tue, 20 Jun 2023 11:36:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687286208; cv=none; d=google.com; s=arc-20160816; b=Dotsi4auc4HruNxmT+mfOHhB+/CP9QEwn2CyJM+AUKamkHFRPJIHbqn65J0H5SdmXK gNmsBq3HLu+V/2XSUQeoP71tPa8NyEUH1mTTO3jkLrZyeOvEXJ3OScbDmlK5YwnhEad4 NbEUvuW+M+p1hKKM1n8ncmId5tR0yx2BLD5+2Eu9Yvi1bgN0KFWFsoZzlJ3Ld4dSwDeb r+b7GIdyeqf8VFlRAeo09b3fZeYAVWT3WDKO4TadAwZbtDbktnwln4YDenEfoFjEUNwn v2eC47jESjWDfkPWgYA6mFGgGOcrKPeCoOJbGMDxHyYlq6tj/GvUxZ/WOzf4CSraEfSM nB1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:message-id:subject:cc:to:from:date :dkim-signature; bh=O+o9eMD3y9Tr9dqoEu+YBC63G/WCHMcMJPmT8gabU7Q=; b=uU/iM7r5M4TGy8BKg60OfotZcd7QbViQN0+ZKDF1ENqlGfvp9PVdfaGEhHEa9DjBqE ZmMU1j2a1G3eDT33iiC2gEAdn7WEooH0bCcYkDeGkPqTWZJWcmfmxZM7PmM4WygplX2s F+fDHs3Z6uWY/ernxOcBpr0RNJtNbroyzj5xWSCNePAzH8JQTqopkHkrNaed7XA//2F9 KEi3IMVuYr9fsF1AZwGZCEPLJeDmavTvV6hHXpSRAEhCOwU+tWwbJ+NGjtdaCMcbQPRA g/r6JxJiqxrmA+i25wQ2QXfEoMpxK3GRZ2Zv+B/budHwcL1QzlYeNR8rF0dJalL438LZ aoPg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Lc7sjKnp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a3-20020a170902ecc300b0019ec2a633f4si2561809plh.505.2023.06.20.11.36.35; Tue, 20 Jun 2023 11:36:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Lc7sjKnp; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229652AbjFTS26 (ORCPT + 99 others); Tue, 20 Jun 2023 14:28:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56674 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229520AbjFTS25 (ORCPT ); Tue, 20 Jun 2023 14:28:57 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 92A7C1717; Tue, 20 Jun 2023 11:28:54 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 2FB1661361; Tue, 20 Jun 2023 18:28:54 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 32936C433C8; Tue, 20 Jun 2023 18:28:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1687285733; bh=H/rq7SZDvG71JLbWFDr2WU3UgH4X4GJ/7+kcHnFx6Gc=; h=Date:From:To:Cc:Subject:In-Reply-To:From; b=Lc7sjKnpvayvN5gv39BysvD0hrG8waBAisqT3s+zXvYaHTrE1Faj3tNujp/wOXNU+ KTTrUfJXD8EICKYTaCSExjHZvttD/913+dqM7LFK+0mUIITnls7O82q7vsKEZTF4W6 JIViruwdQMvgKvktOpaNSGmIl972C2lQFgwr/ppIE6jfxRl8YgVY7fFZgIPMJ54tzs CQOz4N9G3CufCcj0UJ9UslJU3ZmkMyHIroA0uqDDLR0qYOBQyYC4kSmg9ENTM/5OOe NeJbgxz4rz3pjTVz5kg1WH08yLyExilw2fdEi07fdlkUt4PnX18oG5UjtGrp2zJvbO 6k14tTZawN48w== Date: Tue, 20 Jun 2023 13:28:51 -0500 From: Bjorn Helgaas To: "Limonciello, Mario" Cc: Kai-Heng Feng , bhelgaas@google.com, Mika Westerberg , Kuppuswamy Sathyanarayanan , Vidya Sagar , Michael Bottini , "Rafael J. Wysocki" , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] PCI/ASPM: Enable ASPM on external PCIe devices Message-ID: <20230620182851.GA52252@bhelgaas> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jun 19, 2023 at 05:09:55PM -0500, Limonciello, Mario wrote: > On 6/19/2023 4:37 PM, Bjorn Helgaas wrote: > > On Mon, Jun 19, 2023 at 11:16:35AM -0500, Limonciello, Mario wrote: > > > On 6/15/2023 10:01 PM, Kai-Heng Feng wrote: > > > > On Fri, Jun 16, 2023 at 1:12 AM Bjorn Helgaas wrote: > > > > > On Thu, Jun 15, 2023 at 03:04:20PM +0800, Kai-Heng Feng wrote: > > > > > > When a PCIe device is hotplugged to a Thunderbolt port, ASPM is not > > > > > > enabled for that device. However, when the device is plugged preboot, > > > > > > ASPM is enabled by default. > > > > > > > > > > > > The disparity happens because BIOS doesn't have the ability to program > > > > > > ASPM on hotplugged devices. > > > > > > > > > > > > So enable ASPM by default for external connected PCIe devices so ASPM > > > > > > settings are consitent between preboot and hotplugged. > > > > > > > > > > > > On HP Thunderbolt Dock G4, enable ASPM can also fix BadDLLP error: > > > > > > pcieport 0000:00:1d.0: AER: Corrected error received: 0000:07:04.0 > > > > > > pcieport 0000:07:04.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID) > > > > > > pcieport 0000:07:04.0: device [8086:0b26] error status/mask=00000080/00002000 > > > > > > pcieport 0000:07:04.0: [ 7] BadDLLP > > > > > > > > > > > > The root cause is still unclear, but quite likely because the I225 on > > > > > > the dock supports PTM, where ASPM timing is precalculated for the PTM. > > > > > > > > > > > > Cc: Mario Limonciello > > > > > > Cc: Mika Westerberg > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=217557 > > > > > > Signed-off-by: Kai-Heng Feng > > > > > > --- > > > > > > drivers/pci/pcie/aspm.c | 4 +++- > > > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c > > > > > > index 66d7514ca111..613b0754c9bb 100644 > > > > > > --- a/drivers/pci/pcie/aspm.c > > > > > > +++ b/drivers/pci/pcie/aspm.c > > > > > > @@ -119,7 +119,9 @@ static int policy_to_aspm_state(struct pcie_link_state *link) > > > > > > /* Enable Everything */ > > > > > > return ASPM_STATE_ALL; > > > > > > case POLICY_DEFAULT: > > > > > > - return link->aspm_default; > > > > > > + return dev_is_removable(&link->downstream->dev) ? > > > > > > + link->aspm_capable : > > > > > > + link->aspm_default; > > > > > I'm a little hesitant because dev_is_removable() is a convenient > > > > > test that covers the current issue, but it doesn't seem tightly > > > > > connected from a PCIe architecture perspective. > > > > > > > > > > I think the current model of compile-time ASPM policy selection: > > > > > > > > > > CONFIG_PCIEASPM_DEFAULT /* BIOS default setting */ > > > > > CONFIG_PCIEASPM_PERFORMANCE /* disable L0s and L1 */ > > > > > CONFIG_PCIEASPM_POWERSAVE /* enable L0s and L1 */ > > > > > CONFIG_PCIEASPM_POWER_SUPERSAVE /* enable L1 substates */ > > > > > > > > > > is flawed. As far as I know, there's no technical reason we > > > > > have to select this at kernel build-time. I suspect the > > > > > original reason was risk avoidance, i.e., we were worried that > > > > > we might expose hardware defects if we enabled ASPM states that > > > > > BIOS hadn't already enabled. > > > > > > > > > > How do we get out of that model? We do have sysfs knobs that > > > > > should cover all the functionality (set overall policy as above > > > > > via /sys/module/pcie_aspm/parameters/policy; set device-level > > > > > exceptions via /sys/bus/pci/devices/.../link/*_aspm). > > > > Agree. Build-time policy can be obsoleted by boot-time argument. > > > I agree as well. > > This isn't strictly relevant to the current problem, so let's put this > > on the back burner for now. > > I think it could fold into it if we end up treating the i225 > PCIe device as a quirk as mentioned below. > > > > > > In my opinion, the cleanest solution would be to enable all ASPM > > > > > functionality whenever possible and let users disable it if they > > > > > need to for performance. If there are device defects when > > > > > something is enabled, deal with it via quirks, as we do for > > > > > other PCI features. > > > > > > > > > > That feels a little risky, but let's have a conversation about > > > > > where we want to go in the long term. It's good to avoid risk, > > > > > but too much avoidance leads to its own complexity and an > > > > > inability to change things. > > > > I think we should separate the situation into two cases: > > > > - When BIOS/system firmware has the ability to program ASPM, honor > > > > it. This applies to most "internal" PCI devices. > > > > - When BIOS/system can't program ASPM, enable ASPM for whatever > > > > it's capable of. Most notable case is Intel VMD controller, and > > > > this patch for devices connected through TBT. > > > > > > > > Enabling all ASPM functionality regardless of what's being > > > > pre-programmed by BIOS is way too risky. Disabling ASPM to > > > > workaround issues and defects are still quite common among > > > > hardware manufacturers. > > > > It sounds like you have actual experience with this :) Do you have > > any concrete examples that we can use as "known breakage"? > > A variety of Intel chipsets don't support lane width switching > or speed switching.  When ASPM has been enabled on a dGPU, > these features are utilized and breakage ensues. Maybe this helps explain all the completely unmaintainable ASPM garbage in amdgpu and radeon. If these devices are broken, we need quirks for them. We can't avoid ASPM in general just because random devices break. > There are various methods to try to mitigate the impact both in > firmware and driver code. > > > This feels like a real problem to me. There are existing mechanisms > > (ACPI_FADT_NO_ASPM and _OSC PCIe cap ownership) the platform can use > > to prevent the OS from using ASPM. > > > > If vendors assume that *in addition*, the OS will pay attention to > > whatever ASPM configuration BIOS did, that's a major disconnect. We > > don't do anything like that for other PCI features, and I'm not aware > > of any restriction like that being documented. > > With both of those policies in place, how did we get into > the situation of having configuration options and knobs? The kernel parameters and config options been there pretty much from the beginning. We didn't have the per-device sysfs knobs until very recently. > > > I think the pragmatic way to approach it is to (essentially) apply > > > the policy as BIOS defaults and allow overrides from that. > > > > Do you mean that when enumerating a device (at boot-time or hot-add > > time), we would read the current ASPM config but not change it? And > > users could use the sysfs knobs to enable/disable ASPM as desired? > > Yes. Hot-added devices power up with ASPM disabled. This policy would mean the user has to explicitly enable it, which doesn't seem practical to me. > > That wouldn't solve the problem Kai-Heng is trying to solve. > > Alone it wouldn't; but if you treated the i225 PCIe device > connected to the system as a "quirk" to apply ASPM policy > from the parent device to this child device it could. I want quirks for BROKEN devices. Quirks for working hardware is a maintenance nightmare. Bjorn