Received: by 10.223.185.116 with SMTP id b49csp766684wrg; Fri, 16 Feb 2018 06:55:55 -0800 (PST) X-Google-Smtp-Source: AH8x224eM3HPIh3jKo5y5wTmmN8QHV87elV3Yh6s5vfwzl1JfC2/RR20/bl/auTMUePYn1+A/ZEh X-Received: by 2002:a17:902:d806:: with SMTP id a6-v6mr6090123plz.274.1518792955409; Fri, 16 Feb 2018 06:55:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518792955; cv=none; d=google.com; s=arc-20160816; b=NUPA+S8KLijVuvVLKR55qet7DRY8ZHz/YIAOmxJkZs2tCcYo6rAGsqSO8Rwe0nzJCU Pt/m37ByfYk/JDHnnNAB9NQesYag5YZoPrp5PbSRgKQIigHzMp/4boXjvkm+jsMapWsw 2H5asM34g/xt1ghRMQQZL1V1WES778SjYAd6DbWCA7byVBCxMyqRFRef4GQGFyE2bA0M j+umTu4BQbe+f/HjBd0RXSwG8LL6IQJFS7ewRX2FCtBBQSzgLPDbYuueOIHuVl07PZlu jc/wcG214ov6bNBauScuaYy25eeit9mCyOUKrf7Bh61ECINyH22/M/yWyShAqsBiYv0U tLIA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :arc-authentication-results; bh=WqHgB+YiTAUCERsiO4J6i8gi1rKMy1qF4A6/N0CjvYg=; b=JhejYtl3qtICCHcH5A3azAJmcECokAI8T7O2tcQci8vF8HKOSZBMDOSOq02aPWCffQ bDwS6bPff5UGvCQFWZ5NHkd6FG9FuUBlRS7Ms8GDpf2JiQjq/jYCwhW3d9QgCIxzg2ht 7ijjPRnnVuCGE9B8L1OFHu0sIGSgCaZRvO9BuKTGZ9I8ro0Xb11sm5wRu3/BQTAEQB9Y eGbybRxTgDGYsxxggO84lqaO0dMJIzCvSXoEVbVr3mxbLGfyRmXXMciaJV4+8xi1Vrr9 1pJOAxRUiNOWeA/oQKq8rQHizGAoJWKxuZTQZD4JwKPL0gJV4GpSht4bnRB0WwwrQIZ3 wUkQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 43-v6si1365110pla.70.2018.02.16.06.55.40; Fri, 16 Feb 2018 06:55:55 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1165399AbeBOUMI (ORCPT + 99 others); Thu, 15 Feb 2018 15:12:08 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:44980 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752240AbeBOUMH (ORCPT ); Thu, 15 Feb 2018 15:12:07 -0500 Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w1FKBwa4049978 for ; Thu, 15 Feb 2018 15:12:06 -0500 Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by mx0a-001b2d01.pphosted.com with ESMTP id 2g5cb7mysb-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Thu, 15 Feb 2018 15:12:05 -0500 Received: from localhost by e34.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 15 Feb 2018 13:12:04 -0700 Received: from b03cxnp08028.gho.boulder.ibm.com (9.17.130.20) by e34.co.us.ibm.com (192.168.1.134) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 15 Feb 2018 13:12:00 -0700 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w1FKBxx85308748; Thu, 15 Feb 2018 13:11:59 -0700 Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B537478041; Thu, 15 Feb 2018 13:11:59 -0700 (MST) Received: from localhost (unknown [9.40.195.73]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP id 9C6037803F; Thu, 15 Feb 2018 13:11:59 -0700 (MST) From: wenxiong@linux.vnet.ibm.com To: linux-nvme@lists.infradead.org Cc: keith.busch@intel.com, axboe@fb.com, linux-kernel@vger.kernel.org, wenxiong@us.ibm.com, Wen Xiong Subject: [PATCH V3] nvme-pci: Fixes EEH failure on ppc Date: Thu, 15 Feb 2018 14:05:10 -0600 X-Mailer: git-send-email 1.7.1 X-TM-AS-GCONF: 00 x-cbid: 18021520-0016-0000-0000-00000844987E X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00008539; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000253; SDB=6.00990222; UDB=6.00502873; IPR=6.00769570; MB=3.00019574; MTD=3.00000008; XFM=3.00000015; UTC=2018-02-15 20:12:02 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18021520-0017-0000-0000-00003D7D441D Message-Id: <1518725110-25894-1-git-send-email-wenxiong@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-02-15_09:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=1 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1802150242 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Wen Xiong With b2a0eb1a0ac72869c910a79d935a0b049ec78ad9(nvme-pci: Remove watchdog timer), EEH recovery stops working on ppc. After removing whatdog timer routine, when trigger EEH on ppc, we hit EEH in nvme_timeout(). We would like to check if pci channel is offline or not at the beginning of nvme_timeout(), if it is already offline, we don't need to do future nvme timeout process. Add mrmory barrier before calling pci_channel_offline(). With the patch, EEH recovery works successfuly on ppc. Signed-off-by: Wen Xiong [ 232.585495] EEH: PHB#3 failure detected, location: N/A [ 232.585545] CPU: 8 PID: 4873 Comm: kworker/8:1H Not tainted 4.14.0-6.el7a.ppc64le #1 [ 232.585646] Workqueue: kblockd blk_mq_timeout_work [ 232.585705] Call Trace: [ 232.585743] [c000003f7a533940] [c000000000c3556c] dump_stack+0xb0/0xf4 (unreliable) [ 232.585823] [c000003f7a533980] [c000000000043eb0] eeh_check_failure+0x290/0x630 [ 232.585924] [c000003f7a533a30] [c008000011063f30] nvme_timeout+0x1f0/0x410 [nvme] [ 232.586038] [c000003f7a533b00] [c000000000637fc8] blk_mq_check_expired+0x118/0x1a0 [ 232.586134] [c000003f7a533b80] [c00000000063e65c] bt_for_each+0x11c/0x200 [ 232.586191] [c000003f7a533be0] [c00000000063f1f8] blk_mq_queue_tag_busy_iter+0x78/0x110 [ 232.586272] [c000003f7a533c30] [c0000000006367b8] blk_mq_timeout_work+0xa8/0x1c0 [ 232.586351] [c000003f7a533c80] [c00000000015d5ec] process_one_work+0x1bc/0x5f0 [ 232.586431] [c000003f7a533d20] [c00000000016060c] worker_thread+0xac/0x6b0 [ 232.586485] [c000003f7a533dc0] [c00000000016a528] kthread+0x168/0x1b0 [ 232.586539] [c000003f7a533e30] [c00000000000b4e8] ret_from_kernel_thread+0x5c/0x74 [ 232.586640] nvme nvme0: I/O 10 QID 0 timeout, reset controller [ 232.586640] EEH: Detected error on PHB#3 [ 232.586642] EEH: This PCI device has failed 1 times in the last hour [ 232.586642] EEH: Notify device drivers to shutdown [ 232.586645] nvme nvme0: frozen state error detected, reset controller [ 234.098667] EEH: Collect temporary log [ 234.098694] PHB4 PHB#3 Diag-data (Version: 1) [ 234.098728] brdgCtl: 00000002 [ 234.098748] RootSts: 00070020 00402000 c1010008 00100107 00000000 [ 234.098807] RootErrSts: 00000000 00000020 00000001 [ 234.098878] nFir: 0000800000000000 0030001c00000000 0000800000000000 [ 234.098937] PhbSts: 0000001800000000 0000001800000000 [ 234.098990] Lem: 0000000100000100 0000000000000000 0000000100000000 [ 234.099067] PhbErr: 000004a000000000 0000008000000000 2148000098000240 a008400000000000 [ 234.099140] RxeMrgErr: 0000000000000001 0000000000000001 0000000000000000 0000000000000000 [ 234.099250] PcieDlp: 0000000000000000 0000000000000000 8000000000000000 [ 234.099326] RegbErr: 00d0000010000000 0000000010000000 8800005800000000 0000000007011000 [ 234.099418] EEH: Reset without hotplug activity [ 237.317675] nvme 0003:01:00.0: Refused to change power state, currently in D3 [ 237.317740] nvme 0003:01:00.0: Using 64-bit DMA iommu bypass [ 237.317797] nvme nvme0: Removing after probe failure status: -19 [ 361.139047689,3] PHB#0003[0:3]: Escalating freeze to fence PESTA[0]=a440002a01000000 [ 237.617706] EEH: Notify device drivers the completion of reset [ 237.617754] nvme nvme0: restart after slot reset [ 237.617834] EEH: Notify device driver to resume [ 238.777746] nvme0n1: detected capacity change from 24576000000 to 0 [ 238.777841] nvme0n2: detected capacity change from 24576000000 to 0 [ 238.777944] nvme0n3: detected capacity change from 24576000000 to 0 [ 238.778019] nvme0n4: detected capacity change from 24576000000 to 0 [ 238.778132] nvme0n5: detected capacity change from 24576000000 to 0 [ 238.778222] nvme0n6: detected capacity change from 24576000000 to 0 [ 238.778314] nvme0n7: detected capacity change from 24576000000 to 0 [ 238.778416] nvme0n8: detected capacity change from 24576000000 to 0 --- --- drivers/nvme/host/pci.c | 13 +++++++------ 1 files changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 6fe7af0..dfba90d 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1153,12 +1153,6 @@ static bool nvme_should_reset(struct nvme_dev *dev, u32 csts) if (!(csts & NVME_CSTS_CFS) && !nssro) return false; - /* If PCI error recovery process is happening, we cannot reset or - * the recovery mechanism will surely fail. - */ - if (pci_channel_offline(to_pci_dev(dev->dev))) - return false; - return true; } @@ -1189,6 +1183,13 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) struct nvme_command cmd; u32 csts = readl(dev->bar + NVME_REG_CSTS); + /* If PCI error recovery process is happening, we cannot reset or + * the recovery mechanism will surely fail. + */ + mb(); + if (pci_channel_offline(to_pci_dev(dev->dev))) + return BLK_EH_RESET_TIMER; + /* * Reset immediately if the controller is failed */ -- 1.7.1