Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp1885483pxk; Sat, 3 Oct 2020 00:58:58 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyacmvMxBVDG4xGlcKtLVLjjVW6NEIQ1UTa3ap/vfP5yYrzgvWq4ToM9KHwkyXNfD6lo9pE X-Received: by 2002:aa7:c497:: with SMTP id m23mr7100341edq.57.1601711938200; Sat, 03 Oct 2020 00:58:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1601711938; cv=none; d=google.com; s=arc-20160816; b=TaqkUZdqfscvgtOjSGoEPTYnQhfc92+wwdqaIWGbTMSyeJTNuFoPcYNb/0Ee5TjuTu em7aXFmrB2TwKdMEx+e0lGQwwiuV+gRV2lCtplquqUa3vl7YsybATIz1hYagIdWEqs9e vsO0C9aDwO5lUOiUkYNXgneUGFybcxVXS1++KGVqDutP2sm5OcvFMpiJLEsns2MWskTV Yrsc31fHZd7dJkm9NDIUfdrvMU1fVW2sDB052FCVbxi3VHewECg3pUzBrXxhof9PSbZR Ajm3OvyzUG1bu/DXxOasSHuxB612t2oUUmCRVK3dbjzCZxxYrjsbwCYI4MVah8Es7U+s 4HvA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:references:in-reply-to:message-id:date:subject :cc:to:from:ironport-sdr:ironport-sdr; bh=eUSGJi322VGrIeZoMpRPPtzljWGwkQY6s5F0neZRGM0=; b=jcE5g2ftaVsqhDSRkEvgaMNrGJ9irKuQ/xstH5Jx47VXNdQmngBLa6nKGMOKI1DVOB le+YenBcQ2JZ/WFsC954wdqn7QZS4EGNZz579f2Wn8tISpkSiOuN9TXIVLYDoUAdA5FF qZNw1s6EJloVHXOdwQ3PEsiCitf9S4mO3ea0I0GidzHIhgjr9uNWhPObgAuzp7tEFM44 Xp2rZUCh7KfrYLQwRN252pw5cjMqwaXZKQ1KIZjNtO+UvLeCz0tCYLQMulK70MIBzoE3 fp3SQmmzMHpRcDNKh8DvX/7v6fS/rhhvmLTf0Jn+Hwda4l5D2xmINn3HvdcRSuz04L43 3j6g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id m11si2810791edp.265.2020.10.03.00.58.36; Sat, 03 Oct 2020 00:58:58 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725809AbgJCH4z (ORCPT + 99 others); Sat, 3 Oct 2020 03:56:55 -0400 Received: from mga07.intel.com ([134.134.136.100]:28600 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725730AbgJCH4v (ORCPT ); Sat, 3 Oct 2020 03:56:51 -0400 IronPort-SDR: TLMDUX6lqN3qHE/TVerFs/wQ0YNR3ZYLzMSB3HrctcykKgLBjhSmDZZAU3Cz7ebChxV/IpTP4q mFz4i/+ZXNfw== X-IronPort-AV: E=McAfee;i="6000,8403,9762"; a="227305316" X-IronPort-AV: E=Sophos;i="5.77,330,1596524400"; d="scan'208";a="227305316" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Oct 2020 00:56:50 -0700 IronPort-SDR: lFB/BfHnpjyJlbSdyhIsn0QVHMO7ArBZXrS//dP8Fgrq7tBli83IrKBSr9TgOyToaB3pnmgB/H 5U0VZnehabnQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,330,1596524400"; d="scan'208";a="513062784" Received: from shskylake.sh.intel.com ([10.239.48.137]) by orsmga005.jf.intel.com with ESMTP; 03 Oct 2020 00:56:47 -0700 From: Ethan Zhao To: bhelgaas@google.com, oohall@gmail.com, ruscur@russell.cc, lukas@wunner.de, andriy.shevchenko@linux.intel.com, stuart.w.hayes@gmail.com, mr.nuke.me@gmail.com, mika.westerberg@linux.intel.com Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, ashok.raj@linux.intel.com, sathyanarayanan.kuppuswamy@intel.com, xerces.zhao@gmail.com, Ethan Zhao Subject: [PATCH v7 3/5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC Date: Sat, 3 Oct 2020 03:55:12 -0400 Message-Id: <20201003075514.32935-4-haifeng.zhao@intel.com> X-Mailer: git-send-email 2.18.4 In-Reply-To: <20201003075514.32935-1-haifeng.zhao@intel.com> References: <20201003075514.32935-1-haifeng.zhao@intel.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When root port has DPC capability and it is enabled, then triggered by errors, DPC DLLSC and PDC etc interrupts will be sent to DPC driver, pciehp drivers almost at the same time. That will cause following messed and confused errors handling/recovery/removal /plugin procedure. 1. Port and device are in error recovery reseting initiated by DPC hardware, pciehp driver treats them as device is doing hot-remove or hot-plugin the same time. 2. While DPC handler calling device driver->err_handler callback( error_detected/resume etc), but the slot may be powered off by pciehp -> remove_board() -> pciehp_power_off_slot(). 3. While DPC handler -> pci_do_recovery is doing different action to detect error and recover based on device->error_state, pciehp driver could change it on the fly by: pciehp_unconfigure_device() ->pci_walk_bus() -> pci_dev_set_disconnected() 4. While DPC handler is calling device driver err_handler callback to detect error and recover, pciehp driver could is doing device unbind and release its driver. ... While NON-FATAL/FATAL errors happen while hotplug is(is not)doing, result is not determinate. So we need some kind of synchronization between pciehp DLLSC/PDC handling and DPC driver error recover handling. we need a determinate result of DPC error containment, link is recovered, link isn't recovered, device is still there, device is removed, then do pciehp hot-remove and hot-plugin procudure, don't mix them together. Per our test on ICS platform, DPC error containment and software handler will take 10ms up to 50ms till clean the DPC triggered status. it is quick enough for pciehp compared with its 1000ms waiting to ignore DLLSC/PDC after doing power off. With this patch, the handling flow of DPC containment and hotplug is partly ordered and serialized, let hardware DPC do the controller reset etc recovery action first, then DPC driver handling the call-back from device drivers, clear the DPC status, at the end, pciehp handle the DLLSC and PDC etc. After tens of PCIe Gen4 NVMe SSD brute force hot-remove and hot-plugin with any time internval between the two actions, also stressed with the DPC injection test. system recovered to normal working state from NON-FATAL/FATAL errors as expected. hotplug works well without any random undeterminate errors or malfunction. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Changes: v2: revise doc according to Andy's suggestion. v3: no change. v4: no change. v5: no change. v6: moved to [3/5] from [2/5] and re-wrote description. v7: no change. drivers/pci/hotplug/pciehp_hpc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 53433b37e181..6f271160f18d 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) down_read(&ctrl->reset_lock); if (events & DISABLE_SLOT) pciehp_handle_disable_request(ctrl); - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { + pci_wait_port_outdpc(pdev); pciehp_handle_presence_or_link_change(ctrl, events); + } up_read(&ctrl->reset_lock); ret = IRQ_HANDLED; -- 2.18.4