Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933994AbeAKUXL (ORCPT + 1 other); Thu, 11 Jan 2018 15:23:11 -0500 Received: from mail-yw0-f169.google.com ([209.85.161.169]:41125 "EHLO mail-yw0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932789AbeAKUXI (ORCPT ); Thu, 11 Jan 2018 15:23:08 -0500 X-Google-Smtp-Source: ACJfBosOEnExNjgv2OewTdQtS/geTQK99T6Us2xJCty5I9xUxZ8wKoRG/2qXiUTKbwKR+aiQ7gojJJ8488MRpXHEXiY= MIME-Version: 1.0 In-Reply-To: <20180111175916.GB2860@localhost.localdomain> References: <20171214184701.GA6322@libmpq.org> <20171215002155.GR30595@bhelgaas-glaptop.roam.corp.google.com> <20171215190126.GI19904@libmpq.org> <20180111175040.GJ1377@libmpq.org> <20180111175916.GB2860@localhost.localdomain> From: Rajat Jain Date: Thu, 11 Jan 2018 12:22:26 -0800 Message-ID: Subject: Re: ASPM powersupersave change NVMe SSD Samsung 960 PRO capacity to 0 and read-only To: Keith Busch Cc: Maik Broemme , Bjorn Helgaas , linux-pci , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: On Thu, Jan 11, 2018 at 9:59 AM, Keith Busch wrote: > On Thu, Jan 11, 2018 at 06:50:40PM +0100, Maik Broemme wrote: >> I've re-run the test with 4.15rc7.r111.g5f615b97cdea and the following >> patches from Keith: >> >> [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported >> [PATCH 2/4] PCI/AER: Provide API for getting AER information >> [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER >> [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling >> >> The issue is still the same. Additionally to the output before I see now: >> >> Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC containment event, status:0x1f09 source:0x0000 >> Jan 11 18:34:45 server.theraso.int kernel: dpc 0000:00:10.0:pcie010: DPC unmasked uncorrectable error detected, remove downstream devices >> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0080(Receiver ID) >> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: device [8086:19aa] error status/mask=00000020/00000000 >> Jan 11 18:34:45 server.theraso.int kernel: pcieport 0000:00:10.0: [ 5] Surprise Down Error (First) >> Jan 11 18:34:46 server.theraso.int kernel: nvme0n1: detected capacity change from 1024209543168 to 0 > > Okay, so that series wasn't going to fix anything, but at least it gets > some visibility into what's happened. The DPC was triggered due to a > Surprise Down uncorrectable error, so the power settting is causing the > link to fail. > > The NVMe driver has quirks specifically for this vendor's devices to > fence off NVMe specific automated power settings. Your observations > appear to align with the same issues. Agree. /* * Samsung SSD 960 EVO drops off the PCIe bus after system * suspend on a Ryzen board, ASUS PRIME B350M-A. */ if (dmi_match(DMI_BOARD_VENDOR, "ASUSTeK COMPUTER INC.") && dmi_match(DMI_BOARD_NAME, "PRIME B350M-A")) return NVME_QUIRK_NO_APST; It seems that the attempt to save extrapower using ASPM L1 substates is causing it to fall off. Sorry but I suspect that it may be difficult to debug without a pcie analyzer, some debugging directions can be: - Assuming this is a hotpluggable device, try with another NVMe to verify if the issue is specific to this device. - Can you please try switch the ASPM policy back from "powersupersave" -> powersave, and potentially do a rescan (echo 1 > /sys/bus/pci/rescan), and see if the device comes back (and goes away again when you switch back to supersave)? - May be put some debug prints in pcie_config_aspm_l1ss() to see writing to which register causes the device to fall off (most likely this would be the last statement, but just throwing ideas). - May be dump the timing parameters link->l1ss.ctl1 and link->l1ss.ctl2 from aspm_calc_l1ss_info(), and try to play with them a little.