Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp17350612rwd; Tue, 27 Jun 2023 01:34:31 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6jfmxJKmamO4LafW5ZzpkcxOq35CQBIgRATN3r3jJNZhHomFH23KdP92+IBHtR3S6nL77J X-Received: by 2002:a17:907:2687:b0:97c:64bd:50a5 with SMTP id bn7-20020a170907268700b0097c64bd50a5mr27466742ejc.53.1687854871678; Tue, 27 Jun 2023 01:34:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1687854871; cv=none; d=google.com; s=arc-20160816; b=bV3dL+zO2HaaprOi+I0LRPDXswOoRCzRMZno5mrm4hvhKjAARP3+steTP9JDPxT7eG JvljeHMyIhNzTb+gHlS2Dsj262IVc3RDpcZY7c3kg1HmUyo3B83sUoQZbVxbuodePd+/ nfTk+h65jJ0YRG26antnxzqyHOxq4i3hPVqvy/rPIlSOrNMLHpYBGDdMgC8OH8V6yGnk q4rko8DbxR3TdVPHTY5h4xC4KlrQf+pokEBAq5p+2ynL867/6qRuqHb3K/Mi44UPON3V KF5GvpsxevveTtfeiFcbMQskRRExvgro1zvU4pFruKN/MNF26A6cJclnlSd6XHW8dP7F 3aTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=gg3OEKCdXYQ44h8MuGgBj8DaSBevkCPyPNd3zprHKaY=; fh=eLOmOmAOYAv6RxHXUHwJ5Bq9aUNO3x2jTMHSfmqRF+0=; b=Ch7DabYBVadG7v9V1Olu70qFwqT3gZCmYUFtngaYGMcAJeiii5Zi/czRx7T8FDriXA clmOpICZcTz3LXyIE9hi6R7PZwS94Djes/r8QsIOb0Q3NNgFxnOYZi32GjGsw+3RjZZz XLHCe7iFZLbK5mDavhiPmVVEETkVcUtVLSv+LHPnZQ2WvnztR8hYslwmlBJX5aGiF3YH tfQ/nTM5/9uq/GVjAUHTFnUGpz6BryIYnTDJ0TnqHM0ukvNT5zdLiX8Q2jtL6NvrjCwH IzHzyj4U30qx+ZYcXOxm4z1BCXpF7jJkrpFb/XEjxiJQoZwUnk4FvELnlBDEXYof/wx0 RpeQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@canonical.com header.s=20210705 header.b=l9VaXRFl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ox27-20020a170907101b00b0098ad2c36e5fsi3797646ejb.883.2023.06.27.01.34.06; Tue, 27 Jun 2023 01:34:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@canonical.com header.s=20210705 header.b=l9VaXRFl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230353AbjF0INN (ORCPT + 99 others); Tue, 27 Jun 2023 04:13:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53818 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230093AbjF0INL (ORCPT ); Tue, 27 Jun 2023 04:13:11 -0400 Received: from smtp-relay-internal-1.canonical.com (smtp-relay-internal-1.canonical.com [185.125.188.123]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A84B3C7 for ; Tue, 27 Jun 2023 01:13:09 -0700 (PDT) Received: from mail-qk1-f199.google.com (mail-qk1-f199.google.com [209.85.222.199]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-1.canonical.com (Postfix) with ESMTPS id 85C203F266 for ; Tue, 27 Jun 2023 08:13:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1687853587; bh=gg3OEKCdXYQ44h8MuGgBj8DaSBevkCPyPNd3zprHKaY=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=l9VaXRFlSrjjfvIddCX0D+ROs32wEd2nMlQOpqXXFhArBSn8aIQgInZl2tJIQbx+g c/VYK+CqPkiQm5bm/KuiVs2hKc+C2TjepLnAJiegKcDa5jBEmrFXj3g4HUCd6EstsO lp361EIJgx57lWIu/mSPRzVljwSAHqgufkqsPNoY8YxQtfk7imkbsuPULwdQr1cv0K MrnEE1vQv6arI6vBsGs4VMFH4iRNGjm294EEwZnITKbsaKIR/tvMOQg+y1khaq+Od2 +2x03hd1ZSE2aurJONpHYGevlgjF5tFV2CM7Q1wVoaLoZdpyqRyh27CnvJgeWLhSMW 1XGNAfvU0Sn2Q== Received: by mail-qk1-f199.google.com with SMTP id af79cd13be357-7659b44990eso197863085a.1 for ; Tue, 27 Jun 2023 01:13:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687853585; x=1690445585; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=gg3OEKCdXYQ44h8MuGgBj8DaSBevkCPyPNd3zprHKaY=; b=SvT/5olHVaTAzrfk8eumIqScxHoflqJkNAk0FN1fx+zpoJTEkzi52aepBZ5MD3EjNp mVAFgOjFL0H+SrNMqUl3wa5V+8E3T2hCv5W35Qqxbi9987GFg/wkuPs3zMxVE7hnEHIG 8sQMqRHCe78RTOIQc/6+q2kjO6ArKJx44zj8rKdSkKpcSKQIpIn0CTAmYq2kYlMMxHA5 BrQUubODMsk7FRcEljuu6hQU5XNOpVMsEeeI8z7UqYj3vCsK2Z9ahfBI9/eHLumbPd4H TWkqxxZNP2YAxHXIKJoRffiNyK3Z11mky9jutYmS0RYTc1B+xCYtgADEBIIHOTvK9HNL O2rw== X-Gm-Message-State: AC+VfDwQuSfUD2c5qy2odIS6zR6O3m2da1xav9ENd3/IU4GVf0+XWzUZ kZVjrOF+UDn192yMT3jFqpmF9EEUmQ1VsFrsvh1ssefICtc+G9w6BNE61VcbjjvhQevu4z/CX/6 GePcJlBMljFnklk4DFk6N9XmkZxtUniV2M9IsnlmR2RhF06LqogxNqN5daw== X-Received: by 2002:a05:620a:b5d:b0:765:a99c:96f3 with SMTP id x29-20020a05620a0b5d00b00765a99c96f3mr3991341qkg.28.1687853585410; Tue, 27 Jun 2023 01:13:05 -0700 (PDT) X-Received: by 2002:a05:620a:b5d:b0:765:a99c:96f3 with SMTP id x29-20020a05620a0b5d00b00765a99c96f3mr3991319qkg.28.1687853585132; Tue, 27 Jun 2023 01:13:05 -0700 (PDT) MIME-Version: 1.0 References: <16bcc313-a4e1-ab50-4487-c99ccf5ecdf9@intel.com> <20230622131123.GA137990@bhelgaas> In-Reply-To: <20230622131123.GA137990@bhelgaas> From: Kai-Heng Feng Date: Tue, 27 Jun 2023 16:12:53 +0800 Message-ID: Subject: Re: [Intel-wired-lan] [PATCH] igc: Ignore AER reset when device is suspended To: Bjorn Helgaas Cc: "Neftin, Sasha" , "Ruinskiy, Dima" , "Gomes, Vinicius" , "Zulkifli, Muhammad Husaini" , Tony Luck , Kees Cook , linux-pci@vger.kernel.org, "Mushayev, Nikolay" , linux-kernel@vger.kernel.org, jesse.brandeburg@intel.com, "Edri, Michael" , "Guilherme G. Piccoli" , Eric Dumazet , anthony.l.nguyen@intel.com, linux-hardening@vger.kernel.org, Jakub Kicinski , netdev@vger.kernel.org, intel-wired-lan@lists.osuosl.org, Paolo Abeni , "Avivi, Amir" , "David S. Miller" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jun 22, 2023 at 9:11=E2=80=AFPM Bjorn Helgaas = wrote: > > On Thu, Jun 22, 2023 at 08:09:34AM +0300, Neftin, Sasha wrote: > > On 6/21/2023 23:43, Bjorn Helgaas wrote: > > > On Tue, Jun 20, 2023 at 08:36:36PM +0800, Kai-Heng Feng wrote: > > > > When a system that connects to a Thunderbolt dock equipped with I22= 5, > > > > I225 stops working after S3 resume: > > > > > The issue is that the PTM requests are sending before driver resume= s the > > > > device. Since the issue can also be observed on Windows, it's quite > > > > likely a firmware/hardwar limitation. > > > > > > I thought c01163dbd1b8 ("PCI/PM: Always disable PTM for all devices > > > during suspend") would turn off PTM. Is that not working for this > > > path, or are we re-enabling PTM incorrectly, or something else? > > > > I think we hit on the HW bug here. On some i225/6 parts, PTM requests a= re > > sent before SW takes ownership of the device. This patch could help. > > Is there an erratum we can read? If this is needed to work around a > hardware defect, there should be a comment in the code to that effect, > and we should have a better understanding because there may be other > scenarios (suspend/resume, hotplug, etc) that need similar changes. Actually, similar message can be seen on hotplugging the device. The AER message will be gone shortly after the driver done it's probing. > > (I know this patch is to work around a suspend/resume issue, but the > change is in the AER error recovery path, so it doesn't quite fit > together for me yet.) This is something I really want to discuss. This is not the first time that AER handling doesn't play well with system resume because the error handling and resume routine can happen at the same time. Some possible way going forward: 1) Serialize error recovery and resume routine. - If error recovery happens first and it's a successful recovery, does the resume callback still need to be called? - If the device successfully resume, is the error recovery routine still needed? So I think the most plausible way is to call error recovery only if the resume fails. Ignore the AER if resume success. 2) Disable the AER interrupt during suspend - Since the AER is still recorded and AER interrupt gets enabled by port driver before child device resuming, the error recovery/resume race can still happen. - So the port services resume routines can only be called after the entire PCIe hierarchy is resumed. 3) Disable the AER service completely during suspend - This is what's in my mind. If the AER is caused by firmware and hardware (like most cases), the most feasible way is to workaround the issue in the driver. IMO ether 1) or 2) requires involvements that add little benefit. So hopefully we can opt to 3). > > Are you saying the NIC sends PTM requests when it doesn't have PTM > Enable set? I think I mentioned during previous discussion. The PTM gets enabled by the firmware/hardware on the TBT dock right on S3 resume. The issue is also logged on Windows' Event Viewer, but hardware vendor doesn't care at all since the device is still functional :) > > What exactly does it mean for "SW to take ownership of the device"? > What PCIe transaction would tell the device the SW has taken > ownership? Please correct me if I am wrong, but Intel ethernet devices may need the driver to perform some actions so the ownership can be switched between software and firmware. Kai-Heng > > So far this feels kind of hand-wavey. > > > > Checking pci_is_enable() in the .error_detected() callback looks like > > > a pattern that may need to be replicated in many other drivers, which > > > makes me think it may not be the best approach. > > > > > > > So avoid resetting the device if it's not resumed. Once the device = is > > > > fully resumed, the device can work normally. > > > > > > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D216850 > > > > Signed-off-by: Kai-Heng Feng > > > > --- > > > > drivers/net/ethernet/intel/igc/igc_main.c | 3 +++ > > > > 1 file changed, 3 insertions(+) > > > > > > > > diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/ne= t/ethernet/intel/igc/igc_main.c > > > > index fa764190f270..6a46f886ff43 100644 > > > > --- a/drivers/net/ethernet/intel/igc/igc_main.c > > > > +++ b/drivers/net/ethernet/intel/igc/igc_main.c > > > > @@ -6962,6 +6962,9 @@ static pci_ers_result_t igc_io_error_detected= (struct pci_dev *pdev, > > > > struct net_device *netdev =3D pci_get_drvdata(pdev); > > > > struct igc_adapter *adapter =3D netdev_priv(netdev); > > > > + if (!pci_is_enabled(pdev)) > > > > + return 0; > > > > + > > > > netif_device_detach(netdev); > > > > if (state =3D=3D pci_channel_io_perm_failure)