Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp574508pxf; Wed, 17 Mar 2021 10:47:23 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxfoGqzMCkNVdI4WBKTv8YD+ZzjSNKDG28qW1RPOBEtWKiNJKrNKulvpG2pZJRgK3xiGKUz X-Received: by 2002:a17:907:d15:: with SMTP id gn21mr35639737ejc.337.1616003242943; Wed, 17 Mar 2021 10:47:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616003242; cv=none; d=google.com; s=arc-20160816; b=wk9rW/F7BexvvvfvlgbvFpnl47Yv5EVSpyZvG2O/vsl5050rk99y9dun4KcX48BtZe Tq8qThl/6wbiNaP8JaqIxcO6nbqTFKExtOE54l0Q1wROjxDxVLEEJ78mViUVZY3aROVf XmpkiCOFGDLzWT+QXAMMoQhPraCDUORcOnvfzqbRkS/k/e3fyR+OxvPYHpkkBE7mV+7s B6Ebqedl7Bpq7UZWfsOD8zxuZPOtrD2hMCZvGEWdxg7UWFv3bJYWIc4lzKi3w9fm3CFV hIpCb3qixzKpAsjViewzDrjK/eBGi/V+WjtOl7sAOaimJpVzz7cQnUyG8Mg9kVYNFKuL kang== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=3+jbSQhgO2M50c8RnLTshbFAggGniObF7WgVOvlDyi8=; b=eo8USPCV7lT82b9A3qRrXy7g8R7NWA3EocBgT470axrSpEYTOJyUaRkjEyPm8fJSjy nVK+JX6NoB1JQuCg+0gWWM5nLcOBlOVW+O5wlxLKbgXCtUpwSYAOXgKH91+0Uxie2XI5 YFhzsonHIyOO1IqWHwsdpioMZzz8tb/F7mn6+9HVSXFOqgRNYggCLB7L2TpM+8KoswUK CAfFM++rvHXzoqualpQRtP6uDTmCelICLc6aGM5YuvN9pqhuiWVCeNtVLGUn3JaanC6R MXGaf/O0UbUgOwYeyJMNmWuAYat0kVSpGO3OrSEu/LFRVLCaDUJvvyZXF8ti5v0mlXoC /dZg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=Kr80ORLa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b17si17526092eds.443.2021.03.17.10.46.59; Wed, 17 Mar 2021 10:47:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=Kr80ORLa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232735AbhCQRqF (ORCPT + 99 others); Wed, 17 Mar 2021 13:46:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41124 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232476AbhCQRpa (ORCPT ); Wed, 17 Mar 2021 13:45:30 -0400 Received: from mail-ed1-x533.google.com (mail-ed1-x533.google.com [IPv6:2a00:1450:4864:20::533]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D4A82C06174A for ; Wed, 17 Mar 2021 10:45:29 -0700 (PDT) Received: by mail-ed1-x533.google.com with SMTP id e7so3315515edu.10 for ; Wed, 17 Mar 2021 10:45:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=3+jbSQhgO2M50c8RnLTshbFAggGniObF7WgVOvlDyi8=; b=Kr80ORLaQlf8ECE5poODA5M3XneyW20JkaUrpc7uM49MpJEcAPLIiXzYfNAUPPSvyv j4Or20lvvxAaBhsMYMiC2/PBJdtLZ4n/s1e3y+R8P7zTW7cXp/HaBaImVzzbUcNbQIvL x7YB96PA6F4n5d9Tkv/Hgy2cOYPQDMcOxTgf5bofTpy55H/3+ZyuMC2rNWNv87kSDMk4 fejnQl7FTuajrheyDfBDf60TLP0zZRV1MZ+Qr7UrkZRVtkVPDZD8y0udeqzkMs6Sl7aa ncJu4+vKFUlZtRf1Grh2qM273Eh3/0PgZjfuA3hzNle4lgqvfcJ/EW8KT1fkPuRnQjgG TG6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=3+jbSQhgO2M50c8RnLTshbFAggGniObF7WgVOvlDyi8=; b=kQieb1yUuoOS24oOwOBmCEvpGHTt+fXqAaXsseHXX+eHtJJVFJgFY0W/LAW8WG/j9u V7NesvnO6Osa/GTF7SVXfAdGllJMpfcysDzX+5/hBNZojcE4c2TfbMKJgQSbJGWOfUjh Mw1aLOBg3WuMW413rvosTYqG/BFTz0vgIkXD7TDiRR3+/UO0iDS9jZ+uuxPu1pKVX7th U2XuKt7ETxlb6eq3F+njPdklRar9ZeUqtyfOrfeyHY8r5lu67FO1/Oxs2eP9RaohpJ3t dz4zG/I63Ge4T4aP05F4ovrCr8ugdw+YEhPs7FgoqSut45ZdCRLIQNOU6RrgGHgG65g9 Y4BQ== X-Gm-Message-State: AOAM533qwp5ecNy+W5v5+43MdgAz3oi7K63vHyl1g0z5wCD25fP5LAm+ 0Dktn2ild32Sw/96JCrS1Xsm4R3i5nCoSUikXPSuNA== X-Received: by 2002:a05:6402:11c9:: with SMTP id j9mr43442207edw.348.1616003128580; Wed, 17 Mar 2021 10:45:28 -0700 (PDT) MIME-Version: 1.0 References: <59cb30f5e5ac6d65427ceaadf1012b2ba8dbf66c.1615606143.git.sathyanarayanan.kuppuswamy@linux.intel.com> <20210317041342.GA19198@wunner.de> <20210317053114.GA32370@wunner.de> In-Reply-To: From: Dan Williams Date: Wed, 17 Mar 2021 10:45:21 -0700 Message-ID: Subject: Re: [PATCH v2 1/1] PCI: pciehp: Skip DLLSC handling if DPC is triggered To: Sathyanarayanan Kuppuswamy Natarajan Cc: Lukas Wunner , Kuppuswamy Sathyanarayanan , Bjorn Helgaas , Linux PCI , Linux Kernel Mailing List , "Raj, Ashok" , Keith Busch , knsathya@kernel.org, Sinan Kaya Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 17, 2021 at 10:20 AM Sathyanarayanan Kuppuswamy Natarajan wrote: > > Hi, > > On Wed, Mar 17, 2021 at 9:31 AM Dan Williams wrote: > > > > On Tue, Mar 16, 2021 at 10:31 PM Lukas Wunner wrote: > > > > > > On Tue, Mar 16, 2021 at 10:08:31PM -0700, Dan Williams wrote: > > > > On Tue, Mar 16, 2021 at 9:14 PM Lukas Wunner wrote: > > > > > > > > > > On Fri, Mar 12, 2021 at 07:32:08PM -0800, sathyanarayanan.kuppuswamy@linux.intel.com wrote: > > > > > > + if ((events == PCI_EXP_SLTSTA_DLLSC) && is_dpc_reset_active(pdev)) { > > > > > > + ctrl_info(ctrl, "Slot(%s): DLLSC event(DPC), skipped\n", > > > > > > + slot_name(ctrl)); > > > > > > + ret = IRQ_HANDLED; > > > > > > + goto out; > > > > > > + } > > > > > > > > > > Two problems here: > > > > > > > > > > (1) If recovery fails, the link will *remain* down, so there'll be > > > > > no Link Up event. You've filtered the Link Down event, thus the > > > > > slot will remain in ON_STATE even though the device in the slot is > > > > > no longer accessible. That's not good, the slot should be brought > > > > > down in this case. > > > > > > > > Can you elaborate on why that is "not good" from the end user > > > > perspective? From a driver perspective the device driver context is > > > > lost and the card needs servicing. The service event starts a new > > > > cycle of slot-attention being triggered and that syncs the slot-down > > > > state at that time. > > > > > > All of pciehp's code assumes that if the link is down, the slot must be > > > off. A slot which is in ON_STATE for a prolonged period of time even > > > though the link is down is an oddity the code doesn't account for. > > > > > > If the link goes down, the slot should be brought into OFF_STATE. > > > (It's okay though to delay bringdown until DPC recovery has completed > > > unsuccessfully, which is what the patch I'm proposing does.) > > > > > > I don't understand what you mean by "service event". Someone unplugging > > > and replugging the NVMe drive? > > > > Yes, service meaning a technician physically removes the card. > > > > > > > > > > > > > (2) If recovery succeeds, there's a race where pciehp may call > > > > > is_dpc_reset_active() *after* dpc_reset_link() has finished. > > > > > So both the DPC Trigger Status bit as well as pdev->dpc_reset_active > > > > > will be cleared. Thus, the Link Up event is not filtered by pciehp > > > > > and the slot is brought down and back up even though DPC recovery > > > > > was succesful, which seems undesirable. > > > > > > > > The hotplug driver never saw the Link Down, so what does it do when > > > > the slot transitions from Link Up to Link Up? Do you mean the Link > > > > Down might fire after the dpc recovery has completed if the hotplug > > > > notification was delayed? > > > > > > If the Link Down is filtered and the Link Up is not, pciehp will > > > bring down the slot and then bring it back up. That's because pciehp > > > can't really tell whether a DLLSC event is Link Up or Link Down. > > > > > > It just knows that the link was previously up, is now up again, > > > but must have been down intermittently, so transactions to the > > > device in the slot may have been lost and the slot is therefore > > > brought down for safety. Because the link is up, it is then > > > brought back up. > > > > I wonder why we're not seeing that effect in testing? > > In our test case, there is a good chance that the LINK UP event is also > filtered. We change the dpc_reset_active status only after we verify > the link is up. So if hotplug handler handles the LINK UP event before > we change the status of dpc_reset_active, then it will not lead to the > issue mentioned by Lukas. > Ah, ok, we're missing a flush of the hotplug event handler after the link is up to make sure the hotplug handler does not see the Link Up. I'm not immediately seeing how the new proposal ensures that there is no Link Up event still in flight after DPC completes its work. Wouldn't it be required to throw away Link Up to Link Up transitions?