Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp745886rdb; Fri, 8 Sep 2023 15:23:27 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFos3SuaG+tA+ZJu3RGCn4QEMG9YurOoVV5Ga720snOuyzKQztPtj/wXZ0CBC88GmIC2oFO X-Received: by 2002:a17:902:bd45:b0:1c3:7628:fcb3 with SMTP id b5-20020a170902bd4500b001c37628fcb3mr2820630plx.47.1694211807286; Fri, 08 Sep 2023 15:23:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694211807; cv=none; d=google.com; s=arc-20160816; b=EJehq1sGKfMVJrJOvJN0vZS9xI8dZHN3tSDtFxsMNlQADcdf998gMWct8tF/HOe3pJ 4wHolPpG7NAZ06P0jJ7zlZ7zIZk9cOgdzBvLtbEqVxHv1GArEdPIeSsIz1dafUbaXIIt uVAe03rCVIBwZRqK6zOEqjRwGJiVb1wPaLaRKFlEOsevKrUgn/3CGSYFE0gTjCK6NuKh Dz4n9h7j9S2veQ/dX9H90AUYtVEwKJ6WJzHkVpvC5aXJUs4vdI4JjgstmPjCOjWpLtOB OgM1xZNpbBfCeLvfbsePD72mnSfMBc37W0YXbgQrCBBM56ICHDIeIV81E6CoBNN+bbix hwjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :message-id:subject:cc:to:from:date:dkim-signature; bh=uVgto1M5pR0DmT/xmFX3F9FSI09HPifHVVZemdo9B/0=; fh=dInE5PU7k72hDHPwzbrTxisEm8B2CprL00YbQWqL2Fo=; b=nuJzZRNoNammeiVbG2gHepiu4Mtg3pXVbWwI/D283kO9EAt4ny7McIF4dnpnR14MXr k5uqGCHrlqLAfJBA8PCUG4FzPDvlq5Usm/uQ16u4iPWPyuB7E0zonmNRJ2u2DGJ5ljjo z4i+m32Sj+GWyMKjFvFTiyZEJQnQX6kBbfA72mkIvYaWHI/tfoSApMfZJBHAY88s8C2r s3ClUs+pAJPX+Z2FB4LaFGr01OqgqiuYswtWoHZfusD4vN/gRSsQXUB9p9S7Mq9Vb+Nj qmCErSUbOeg2jdj2n1n3UwldFjxYCbpofNu6BBesz0rrU73FDecPJbZyUbWMHYH2dQte GoRw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=nuJaSGP0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id b16-20020a170903229000b001b3eeaad177si2309637plh.99.2023.09.08.15.23.12; Fri, 08 Sep 2023 15:23:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=nuJaSGP0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245562AbjIHULW (ORCPT + 99 others); Fri, 8 Sep 2023 16:11:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50640 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232528AbjIHULV (ORCPT ); Fri, 8 Sep 2023 16:11:21 -0400 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F13CDA7; Fri, 8 Sep 2023 13:11:06 -0700 (PDT) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 68FDFC433C8; Fri, 8 Sep 2023 20:11:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1694203866; bh=PNgysYno3H1Cwk4d1QkgMgkiUpWYM+j+uutlH71z4Rw=; h=Date:From:To:Cc:Subject:In-Reply-To:From; b=nuJaSGP0cu1NPGrKUDGnFQg7BgNK7xIsEX0D1ToXKT1AojCOlrM2sAIMU1f1OjXeF SPnf7NFht7N2F8SNia5VmPBqZJHNCXdqV2c/ntmXZXy0R9PnZbTY46EwtJDedpikrf YzhsvYnFc20WecuVjkG8CX3EnKQ0i9w0dZjo1NkYqZOK5EdPpsEoBoVDhp+nzJ5pZi ARToLdQxxyvlI6j5Ywe0kWf6AZeMerROmSsTmqSLYTv8xRiOUGPXGUbKmWcmmFFwkD StHI3ZIejn1r9DBIQ8z/D7DzummE/Z3FZOp+y7XhmKbI143s0heXdCywFIGVn2AIfp oN8+jBV/y3i1g== Date: Fri, 8 Sep 2023 15:11:04 -0500 From: Bjorn Helgaas To: Alex Williamson Cc: Wu Zongyong , lukas@wunner.de, sdonthineni@nvidia.com, bhelgaas@google.com, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, wllenyj@linux.alibaba.com, wutu.xq2@linux.alibaba.com, gerry@linux.alibaba.com, pjaroszynski@nvidia.com Subject: Re: [PATCH] PCI: Mark NVIDIA T4 GPUs to avoid bus reset Message-ID: <20230908201104.GA305023@bhelgaas> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230907214037.7f35f26a.alex.williamson@redhat.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 07, 2023 at 09:40:37PM -0600, Alex Williamson wrote: > On Fri, 8 Sep 2023 10:50:48 +0800 > Wu Zongyong wrote: > > > On Wed, Aug 09, 2023 at 06:05:18PM -0500, Bjorn Helgaas wrote: > > > On Mon, Apr 10, 2023 at 08:34:11PM +0800, Wu Zongyong wrote: > > > > NVIDIA T4 GPUs do not work with SBR. This problem is found when the T4 > > > > card is direct attached to a Root Port only. So avoid bus reset by > > > > marking T4 GPUs PCI_DEV_FLAGS_NO_BUS_RESET. > > > > > > > > Fixes: 4c207e7121fa ("PCI: Mark some NVIDIA GPUs to avoid bus reset") > > > > Signed-off-by: Wu Zongyong > > > > > > Applied to pci/virtualization for v6.6, thanks! > > > > I talk about the issue with NVIDIA, and they think the issue is probably related > > the pci link instead of the T4 GPU card. > > > > I will try to describe the issue I met in detail. > > > > The T4 card which is direct attached to a Root Port and I rebind it to > > vfio-pci driver. Then I try to use to call some vfio-related api and the > > ioctl VFIO_GROUP_GET_DEVICE_FD failed. > > > > The stack is (base on kernel v5.10): > > vfio_group_fops_unl_ioctl > > vfio_group_get_device_fd > > vfio_pci_open > > vfio_pci_enable // return value is -19 > > pci_try_reset_function > > __pci_reset_function_locked > > > > After the __pci_reset_function_locked(), the dmesg shows: > > [12207494.508467] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Link Down > > [12207494.508535] vfio-pci 0000:40:00.0: No device request channel registered, blocked until released by user > > [12207494.518426] pci 0000:40:00.0: Removing from iommu group 84 > > [12207495.532365] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Card present > > [12207495.532367] pcieport 0000:3f:00.0: pciehp: Slot(5-1): Link Up > > > > NVIDIA people thinks this root port is not going through this reset logic and getting the > > link down/hot plug interrupts[1]. > > > > Can you revert the patch I sent and maybe we should dig it deeply. > > Yes, please revert, we do testing with T4 and have not seen any issues > with bus reset. The T4 provides neither PM nor FLR reset, so masking > bus reset compromises this device for assignment scenarios. I can send > a revert patch if requested. Thanks, Reverted as below. Hopefully this will make v6.6-rc1. commit 42f5c40846f3 ("Revert "PCI: Mark NVIDIA T4 GPUs to avoid bus reset"") Author: Bjorn Helgaas Date: Fri Sep 8 14:55:30 2023 -0500 Revert "PCI: Mark NVIDIA T4 GPUs to avoid bus reset" This reverts commit d5af729dc2071273f14cbb94abbc60608142fd83. d5af729dc207 ("PCI: Mark NVIDIA T4 GPUs to avoid bus reset") avoided Secondary Bus Reset on the T4 because the reset seemed to not work when the T4 was directly attached to a Root Port. But NVIDIA thinks the issue is probably related to some issue with the Root Port, not with the T4. The T4 provides neither PM nor FLR reset, so masking bus reset compromises this device for assignment scenarios. Revert d5af729dc207 as requested by Wu Zongyong. This will leave SBR broken in the specific configuration Wu tested, as it was in v6.5, so Wu will debug that further. Link: https://lore.kernel.org/r/ZPqMCDWvITlOLHgJ@wuzongyong-alibaba Signed-off-by: Bjorn Helgaas diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 5de09d2eb014..eeec1d6f9023 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -3726,7 +3726,7 @@ static void quirk_no_bus_reset(struct pci_dev *dev) */ static void quirk_nvidia_no_bus_reset(struct pci_dev *dev) { - if ((dev->device & 0xffc0) == 0x2340 || dev->device == 0x1eb8) + if ((dev->device & 0xffc0) == 0x2340) quirk_no_bus_reset(dev); } DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,