Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp1027775imm; Fri, 11 May 2018 09:58:22 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqX89/WdNwVIPsxp/JqFT32hWiZkydF1bydLgWlo6w7Nbhj/uvB7w1XBK3ZZfHbeYPNiIXX X-Received: by 2002:a63:a002:: with SMTP id r2-v6mr5037924pge.240.1526057902119; Fri, 11 May 2018 09:58:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526057902; cv=none; d=google.com; s=arc-20160816; b=oXL1STitguy46YOldo5bvWUEEgfBTnJN7jHdqjg5d6V0XAK27RThHQN3CgpKimSDHN 0knTbMz/24zSPU/BT2Doeuu0e6opPQR3EA6XWnCd36SCalTWm+qFMoM5+bilywZll+MT 7i/ATxtYmsMJ+IDDQdLvX0DpKykJzjx3wEGeinxTSDt8PKUphi9BkhoT6DKY6yLl2soU RJtepqZ39a/khmbY9Ayece9j3ug1g6uFdkAhvjc3AQst9IpjQqFfExmtHaYVkugR7L4K iaLawk9B0uTyTMrlUPghWaUMGUhz5EgIw8jiAEdEipNtRW9HfoyPKKo24aQlL4fWkp8w stZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:message-id:subject:cc:to:from:date :dkim-signature:arc-authentication-results; bh=lLnPkT3M2TOoX49e1NvTP700oQEj2hfoGxrTxt6IVAE=; b=f8TTWNm7R/cFSNdVVYDEHabTPEGvR4ypzMu8/sIahn3efL2w2yec/8QoITOppyr4ne d2NdzaWNZ4r/b+Eu5Wi/OjQZY3CHpNVUtN+2zwaoendwh8hC+dR/4N1prQyW84ZuzCXj ik6BwnxS9GZh4mBl68AARMkc2GiHvoSuhyf1ZsYuAnhffWe1m30bUJWLNLRM4NBVlGdM sXR24lDet1bJJ3T9+uBTathYmNFOWhNpdrM6HWn95F8u0SYPsleZ7JIpAly4OLnv5/W4 wK7sHadEJIcyLey+Weuso1pUT0Y1zYACb0bNY7DB0+IeqK4hDAX47y5GmrZMltHc9H9s kieg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=Xn8M4yOG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n136-v6si3642183pfd.312.2018.05.11.09.58.07; Fri, 11 May 2018 09:58:22 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=Xn8M4yOG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751338AbeEKQ5z (ORCPT + 99 others); Fri, 11 May 2018 12:57:55 -0400 Received: from mail.kernel.org ([198.145.29.99]:51798 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750746AbeEKQ5y (ORCPT ); Fri, 11 May 2018 12:57:54 -0400 Received: from localhost (unknown [69.55.156.246]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id A99B821783; Fri, 11 May 2018 16:57:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1526057873; bh=lqgbNslvo12t+Mbg58bq/b/dQm1mAIgsbfjMdx9LLnc=; h=Date:From:To:Cc:Subject:In-Reply-To:From; b=Xn8M4yOGa0/7FPuqrtevQbITAdIewTFUMa6gEhsxEtBLSm3DmIGxOiMw0/nRMSrFK 6WDKzMIbm9LHxYz8tfUOYjkkjFk5r0RfzJL9cE49HWEoeWpkn6qEyb8twEzjbm5zj0 dhazH7z7KG2Gmd9zq+Mymlhra6/MgpWkKQ42E4qU= Date: Fri, 11 May 2018 11:57:52 -0500 From: Bjorn Helgaas To: Andrew Lutomirski Cc: Jesse Vincent , Bjorn Helgaas , Christoph Hellwig , Sagi Grimberg , Jens Axboe , linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org Subject: Re: Another NVMe failure, this time with AER info Message-ID: <20180511165752.GG190385@bhelgaas-glaptop.roam.corp.google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andrew wrote: > A friend of mine has a brand new LG laptop that has intermittent NVMe > failures. They mostly happen during a suspend/resume cycle > (apparently during suspend, not resume). Unlike the earlier > Dell/Samsung issue, the NVMe device isn't completely gone -- MMIO > reads fail, but PCI configuration space is apparently still there: > nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 > and it comes with a nice AER dump: > [12720.894411] pcieport 0000:00:1c.0: AER: Multiple Corrected error received: id=00e0 > [12720.909747] pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e0(Transmitter ID) > [12720.909751] pcieport 0000:00:1c.0: device [8086:9d14] error status/mask=00001001/00002000 > [12720.909754] pcieport 0000:00:1c.0: [ 0] Receiver Error (First) > [12720.909756] pcieport 0000:00:1c.0: [12] Replay Timer Timeout I opened this bugzilla and attached the dmesg and lspci -vv output to it: https://bugzilla.kernel.org/show_bug.cgi?id=199695 The root port at 00:1c.0 leads to the NVMe device at 01:00.0 (this is nvme0): 00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1) (prog-if 00 [Normal decode]) Bus: primary=00, secondary=01, subordinate=01, sec-latency=0 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 (prog-if 02 [NVM Express]) Subsystem: Samsung Electronics Co Ltd Device a801 We reported several corrected errors before the nvme timeout: [12750.281158] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 [12750.297594] nvme nvme0: I/O 455 QID 2 timeout, disable controller [12750.305196] nvme 0000:01:00.0: enabling device (0000 -> 0002) [12750.305465] nvme nvme0: Removing after probe failure status: -19 [12750.313188] nvme nvme0: I/O 456 QID 2 timeout, disable controller [12750.329152] nvme nvme0: I/O 457 QID 2 timeout, disable controller The corrected errors are supposedly recovered in hardware without software intervention, and AER logs them for informational purposes. But it seems very likely that these corrected errors are related to the nvme timeout: the first corrected errors were logged at 12720.894411, nvme_io_timeout defaults to 30 seconds, and the nvme timeout was at 12750.281158. I don't have any good ideas. As a shot in the dark, you could try running these commands before doing a suspend: # setpci -s01:00.0 0x98.W # setpci -s00:1c.0 0x68.W # setpci -s01:00.0 0x198.L # setpci -s00:1c.0 0x208.L # setpci -s01:00.0 0x198.L=0x00000000 # setpci -s01:00.0 0x98.W=0x0000 # setpci -s00:1c.0 0x208.L=0x00000000 # setpci -s00:1c.0 0x68.W=0x0000 # lspci -vv -s00:1c.0 # lspci -vv -s01:00.0 The idea is to turn off ASPM L1.2 and LTR, just because that's new and we've had issues with it before. If you try this, please collect the output of the commands above in addition to the dmesg log, in case my math is bad. Bjorn