Received: by 10.192.165.148 with SMTP id m20csp4248537imm; Tue, 8 May 2018 05:35:11 -0700 (PDT) X-Google-Smtp-Source: AB8JxZru6CV5FyJtJ/azeqiyp69O2oV6q94XyF9SIstc86tE5XfXEZIazcRylig+YPEJiN2ju8qy X-Received: by 10.98.68.156 with SMTP id m28mr34216219pfi.145.1525782911208; Tue, 08 May 2018 05:35:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525782911; cv=none; d=google.com; s=arc-20160816; b=UUj44yT19bvlBAB0XnDyzg7fcZT9E6XbOnf7AfZ+hyEdg/J7XAAPIyQdM5eywp0U+p XsIhYUWTXhhVb6JBO7ZgAQmsk/vqsnmzYtC7auh2niJdFlxD9MobiY/05Ljuq7ff6CAs q51f6KgSV6h0RENfZRLvY92j8314HZhfBfLUnvmuDOPSMWLvjWrRKyfk4m+XHD1YIstx wJX7vXgwvZ4bLhsGzK0V1xA2LX+KemOE9uF0+6zidwqHGse2vyPKZQguj/OMAXl1mTsM IXpgGMmV44foNNuP9PpyAtGS7JXZkpud3UXGUq7DYrW26X4EUIOZG8jReWLruUqnW3l8 wVyw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=fPrIEPd/Gz6yruAxUNRoSokabZj2bCR+2ltX6Fa9peM=; b=KAdPiSfStVijJBwChIbZe4Z0dDHaRct1RtbY/EgSR9c+Ujg1Vdsz1ZRmG7ZAOLK8/P 9yufRNUp8yIYE6jfaOsCdug0uHI/jkZQmLASTJXC/3DitiJWrtb6S/BnmBtv5FowJLZx 7KayGOhu4dSnHXK4qlS2XBALegKan8tvDPCduqnZEyvzX51y7fZsa5buxJqMPMuP5csI kehfLUBABImmVN6iC6nOdUotZRdfM1FlaKxnI0zZxO84XWzBpBFpp/L7Hn7/r75taHBG k8JE6WRdT9RNSlJmnHWLr7p/9NDZ8YMKg/O4upKQR1NGMtpIzOTySz2sHZHrJE6oykKM 8PUw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=MQhorHol; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id bd8-v6si23005246plb.156.2018.05.08.05.34.56; Tue, 08 May 2018 05:35:11 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=MQhorHol; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754544AbeEHMej (ORCPT + 99 others); Tue, 8 May 2018 08:34:39 -0400 Received: from mail.kernel.org ([198.145.29.99]:34586 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751805AbeEHMef (ORCPT ); Tue, 8 May 2018 08:34:35 -0400 Received: from localhost (50-81-23-193.client.mchsi.com [50.81.23.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 82B6921783; Tue, 8 May 2018 12:34:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1525782874; bh=mzXQ0R7pxfINTb4TiKpTBQsdNMbu4pIdtVB2oqCNl4M=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=MQhorHol6RzPS8jJ/PJ7o0uu6lxtJBcB74tdCqFvDDL6Mv+pxi5vkJI8m6LmidSD1 +PevKQ4HU7pfE843ut3D2Lf0Lu/U8xo8PiHHxtBiLLXIFi9xf5uDZILVxQub1UoLmD TAoOaNvyQAqgI5qdrzZy3WFyhSIFKP4WWUMHIXtE= Date: Tue, 8 May 2018 07:34:32 -0500 From: Bjorn Helgaas To: Paul Menzel Cc: okaya@codeaurora.org, Bjorn Helgaas , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Lukas Wunner Subject: Re: pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago) Message-ID: <20180508123432.GJ161390@bhelgaas-glaptop.roam.corp.google.com> References: <8770820b-85a0-172b-7230-3a44524e6c9f@molgen.mpg.de> <20180427192207.GG8199@bhelgaas-glaptop.roam.corp.google.com> <43b8ab4a-f8ee-dc96-40ec-e6fdfedd8309@molgen.mpg.de> <20180504024527.GE15790@bhelgaas-glaptop.roam.corp.google.com> <20180504133327.GF15790@bhelgaas-glaptop.roam.corp.google.com> <20180507213344.GA133147@bhelgaas-glaptop.roam.corp.google.com> <903e7c20-fdd7-9cbf-debb-a90e70240c7c@molgen.mpg.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <903e7c20-fdd7-9cbf-debb-a90e70240c7c@molgen.mpg.de> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 08, 2018 at 08:59:34AM +0200, Paul Menzel wrote: > Dear Bjorn, > > > Am 07.05.2018 um 23:33 schrieb Bjorn Helgaas: > > On Fri, May 04, 2018 at 08:33:27AM -0500, Bjorn Helgaas wrote: > > > commit b0d6f2230e12c85ae3b65a854a53c67c7c1f6406 > > > Author: Bjorn Helgaas > > > Date: Thu May 3 18:39:38 2018 -0500 > > > > > > PCI: pciehp: Add quirk for Intel Command Completed erratum > > > The Intel CF118 erratum means the controller does not set the Command > > > Completed bit unless writes to the Slot Command register change "Control" > > > bits. Command Completed is never set for writes that only change software > > > notification "Enable" bits. This results in timeouts like this: > > > pciehp 0000:00:1c.0:pcie004: Timeout on hotplug command 0x1038 (issued 65284 msec ago) > > > When this erratum is present, avoid these timeouts by marking commands > > > "completed" immediately unless they change the "Control" bits. > > > Here's the text of the erratum from the Intel document: > > > CF118 PCIe Slot Status Register Command Completed bit not always > > > updated on any configuration write to the Slot Control > > > Register > > > Problem: For PCIe root ports (devices 0 - 10) supporting hot-plug, > > > the Slot Status Register (offset AAh) Command Completed > > > (bit[4]) status is updated under the following condition: > > > IOH will set Command Completed bit after delivering the new > > > commands written in the Slot Controller register (offset > > > A8h) to VPP. The IOH detects new commands written in Slot > > > Control register by checking the change of value for Power > > > Controller Control (bit[10]), Power Indicator Control > > > (bits[9:8]), Attention Indicator Control (bits[7:6]), or > > > Electromechanical Interlock Control (bit[11]) fields. Any > > > other configuration writes to the Slot Control register > > > without changing the values of these fields will not cause > > > Command Completed bit to be set. > > > The PCIe Base Specification Revision 2.0 or later describes > > > the “Slot Control Register” in section 7.8.10, as follows > > > (Reference section 7.8.10, Slot Control Register, Offset > > > 18h). In hot-plug capable Downstream Ports, a write to the > > > Slot Control register must cause a hot-plug command to be > > > generated (see Section 6.7.3.2 for details on hot-plug > > > commands). A write to the Slot Control register in a > > > Downstream Port that is not hotplug capable must not cause a > > > hot-plug command to be executed. > > > The PCIe Spec intended that every write to the Slot Control > > > Register is a command and expected a command complete status > > > to abstract the VPP implementation specific nuances from the > > > OS software. IOH PCIe Slot Control Register implementation > > > is not fully conforming to the PCIe Specification in this > > > respect. > > > Implication: Software checking on the Command Completed status after > > > writing to the Slot Control register may time out. > > > Workaround: Software can read the Slot Control register and compare the > > > existing and new values to determine if it should check the > > > Command Completed status after writing to the Slot Control > > > register. > > > Link: http://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-v2-spec-update.html > > > Link: https://lkml.kernel.org/r/8770820b-85a0-172b-7230-3a44524e6c9f@molgen.mpg.de > > > Reported-by: Paul Menzel > > > Signed-off-by: Bjorn Helgaas > > > > I applied this with Paul's tested-by on pci/hotplug for v4.18. > > Thank you very much. Will this also be picked up by the stable Linux kernel > series? I did not tag it for stable because I didn't think it was a serious enough problem, based on this from Documentation/process/stable-kernel-rules.rst: - It must fix a problem that causes a build error (but not for things marked CONFIG_BROKEN), an oops, a hang, data corruption, a real security issue, or some "oh, that's not good" issue. In short, something critical. I know I'm on the conservative end of the stable-tagging spectrum, so maybe I could be convinced to add a stable tag. My impression was that this bug caused annoying messages and annoying delays of a couple seconds during shutdown and resume. Is it more serious than that? Bjorn