Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp1057633imm; Fri, 11 May 2018 10:24:53 -0700 (PDT) X-Google-Smtp-Source: AB8JxZoQ2vvmBCVuBFQISMBEwe9tqc3gVOzF9jfUIvFoJICZOFaBMBLlfPcLG18KHU1QvHsqcYsh X-Received: by 2002:a17:902:b788:: with SMTP id e8-v6mr6393862pls.263.1526059493104; Fri, 11 May 2018 10:24:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526059493; cv=none; d=google.com; s=arc-20160816; b=fDLTry+rw2k5SozEW/g2MPiFL6mMbaKFe0TI+rtgiTF2t+R3uhkC0HYZi5m1vIRekH 9t83zYTFvdITHJDlGAd6x4M2fUddJ+5F7DiuxMf+ZHdq/30VX8QtqCScbi0KNlHW9aZ1 bws/14aXaaNwOpkhYWV9Oe829QKxJ00bVYsKhrtmFzq/bvX6s7dQsNIclRw6vakC65ci 3rMj9AzggC4yI4yHp37I62Pbrh+KzZP/FQ9McUdepfZAS5nmLRizrGX84yV/tD75fgjK +51VlMBysc+Qi6gsK3T6DmkIO2Ji/lEpLYdLy7eMxjY3cOLGKfGHnuZ6r2S7jze73wO9 Ygmg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=6hnbF39Cq1i3lcBUpNgUxPz1X8BeR8sJEvE7l4uVGbI=; b=hjeAhZYtY2v5xbpw4NWcLTe1j7wpLjb6kLoRCkGaUwJCv3Se+BRgT8HBwpqXZl39YO VADWNDS3c4WRjAX0k5JciY78NzsebN4CuSe2jAOyZrL7t/sxtFHdlfIrEt5navc6niAK aS6UmX3EkYqePDUcxvRl0tjJ4As//xWpGWrNlYy6vrfhEhLZLN8U5nc7sVms19CAEMFV qkCtylcNSgzVC1htWT/wNkuTA4GX2BZdW3Ee757r8MmljVncoUqtSLgWr16IXvs64jlP svu2qJTCvtTQl6L7lVsWrojgmsDo8XBb+kZ8+xbJSdtoS85QzKEaHJRxyqwsrdoA589i /cmA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p17-v6si3602281plo.363.2018.05.11.10.24.37; Fri, 11 May 2018 10:24:53 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751333AbeEKRYW (ORCPT + 99 others); Fri, 11 May 2018 13:24:22 -0400 Received: from mga03.intel.com ([134.134.136.65]:55638 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750711AbeEKRYV (ORCPT ); Fri, 11 May 2018 13:24:21 -0400 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 11 May 2018 10:24:20 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.49,389,1520924400"; d="scan'208";a="45265381" Received: from unknown (HELO localhost.localdomain) ([10.232.112.44]) by fmsmga002.fm.intel.com with ESMTP; 11 May 2018 10:24:20 -0700 Date: Fri, 11 May 2018 11:26:11 -0600 From: Keith Busch To: Bjorn Helgaas Cc: Andrew Lutomirski , Jesse Vincent , Sagi Grimberg , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, Jens Axboe , Bjorn Helgaas , Christoph Hellwig Subject: Re: Another NVMe failure, this time with AER info Message-ID: <20180511172610.GB7344@localhost.localdomain> References: <20180511165752.GG190385@bhelgaas-glaptop.roam.corp.google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180511165752.GG190385@bhelgaas-glaptop.roam.corp.google.com> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 11, 2018 at 11:57:52AM -0500, Bjorn Helgaas wrote: > We reported several corrected errors before the nvme timeout: > > [12750.281158] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 > [12750.297594] nvme nvme0: I/O 455 QID 2 timeout, disable controller > [12750.305196] nvme 0000:01:00.0: enabling device (0000 -> 0002) > [12750.305465] nvme nvme0: Removing after probe failure status: -19 > [12750.313188] nvme nvme0: I/O 456 QID 2 timeout, disable controller > [12750.329152] nvme nvme0: I/O 457 QID 2 timeout, disable controller > > The corrected errors are supposedly recovered in hardware without > software intervention, and AER logs them for informational purposes. > > But it seems very likely that these corrected errors are related to > the nvme timeout: the first corrected errors were logged at > 12720.894411, nvme_io_timeout defaults to 30 seconds, and the nvme > timeout was at 12750.281158. The nvme_timeout handling is broken at the moment, but I'm not sure any of the fixes being considered will help here if we're really getting MMIO errors (that's what it looks like). > I don't have any good ideas. As a shot in the dark, you could try > running these commands before doing a suspend: > > # setpci -s01:00.0 0x98.W > # setpci -s00:1c.0 0x68.W > # setpci -s01:00.0 0x198.L > # setpci -s00:1c.0 0x208.L > > # setpci -s01:00.0 0x198.L=0x00000000 > # setpci -s01:00.0 0x98.W=0x0000 > # setpci -s00:1c.0 0x208.L=0x00000000 > # setpci -s00:1c.0 0x68.W=0x0000 > > # lspci -vv -s00:1c.0 > # lspci -vv -s01:00.0 > > The idea is to turn off ASPM L1.2 and LTR, just because that's new and > we've had issues with it before. If you try this, please collect the > output of the commands above in addition to the dmesg log, in case my > math is bad. I trust you know the offsets here, but it's hard to tell what this is doing with hard-coded addresses. Just to be safe and for clarity, I recommend the 'CAP_*+' with a mask. For example, disabling ASPM L1.2 can look like: # setpci -s CAP_PM+8.l=0:4 And disabling LTR: # setpci -s CAP_EXP+28.w=0:400