Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp4372457imu; Mon, 7 Jan 2019 22:09:06 -0800 (PST) X-Google-Smtp-Source: ALg8bN4QFLl524sdUZ0XUXz7sQw5s5DSlQ5YKQ+l0cK4ISAPUr6pucspNsOg6SrFeZ6BB5n9JROh X-Received: by 2002:a63:e001:: with SMTP id e1mr411064pgh.39.1546927746427; Mon, 07 Jan 2019 22:09:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1546927746; cv=none; d=google.com; s=arc-20160816; b=hqq4BS4uYYliuP5P6SukTV1aTzXx5C5Q5S6ndkp835Q20/DeImh8Ou8pDK0NaHSh4P YvUHmOOZ8sbA6xI+yHpltF8nHQ6apZXRDjmr6Z68Jw4h/u+L+zzKyxODIC4IZJGgClzG xqmhJuc2g0vy8fvMo80y+4Cj0GhnX1VDR2D0XjxW7F3mq52HMjSvWps8ACRNt3cXYwJR 4E7s0bdASw2tSyaUEkfVYlDGegWHcPO+1aTF8/8B5PCwcnTmdtLmZQmgVFDThFCHP0f5 Xr/iFmS2IMOZO5Evq6FC5CX+MHaCY7T3SzKo7CiUqch/F4cpgjcJzLZweb3JtFnzTVCa 4zqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=grPwuQiIpx8k703h8VKJqMdSt6SXqizLb/B0FHhsBaA=; b=qnlmSUCZOwyJDEifmmscVWnjmbW4rjLVaOOb3SWCmM2UaWxhHAhxJrqrWvBwYpB1yb ve9R/1UvMhMsJHBwESyHjF28CDsAejHjlLwGGEdopHAPgIJXHQs8f4YmRtFUAbob2kNk A++ul5wDr8zW+laaSMojb0qzgUxVh3BNidJIE6gU2dSkmxP0rG5pjd70MtzvXxAGNwx9 ADLN2fa7R9eB4wH7RJMmaqhW3j/pTgIsc74fDp6x2SkIH2WJcJwjp4GQWgJAdDYb26H8 ClksJzUNA06WZD5QDpvRx7lh5xkVOwzLmoNYl0Q7sQGcssylgGdEJ+A0eWikzwt+9fSE GGQQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=nIlnt6hs; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d32si11649935pla.136.2019.01.07.22.08.41; Mon, 07 Jan 2019 22:09:06 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=nIlnt6hs; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727818AbfAHGHi (ORCPT + 99 others); Tue, 8 Jan 2019 01:07:38 -0500 Received: from mail.kernel.org ([198.145.29.99]:52650 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727408AbfAHGHi (ORCPT ); Tue, 8 Jan 2019 01:07:38 -0500 Received: from localhost (unknown [77.138.135.184]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 98EB920700; Tue, 8 Jan 2019 06:07:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1546927657; bh=jDJvaN+uOakVMbxAqNGZOEwg7W7AOFW6nx5YuUUJJdA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=nIlnt6hsOhlE9yZKhq/xJk+VVs4JEYbBGNurC/WdoIZnHv6cVXlGA8ZlB+lvgEmPb 0mxbtdcuVpqEZuSlVM+fB6t8IfRDKtSslA8cKXt3YS71XCuor0qnHx/HxzuZUH3VJH IRoL+lC2tjj52LUqidmbxlCp7yXtKyFSkqC0PLAQ= Date: Tue, 8 Jan 2019 08:07:34 +0200 From: Leon Romanovsky To: Jason Gunthorpe Cc: Benjamin Herrenschmidt , David Gibson , davem@davemloft.net, saeedm@mellanox.com, ogerlitz@mellanox.com, tariqt@mellanox.com, bhelgaas@google.com, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, netdev@vger.kernel.org, alex.williamson@redhat.com, linux-pci@vger.kernel.org, linux-rdma@vger.kernel.org, sbest@redhat.com, paulus@samba.org Subject: Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45] Message-ID: <20190108060734.GH3632@mtr-leonro.mtl.com> References: <20181206041951.22413-1-david@gibson.dropbear.id.au> <20181206064509.GM15544@mtr-leonro.mtl.com> <20190104034401.GA2801@umbus.fritz.box> <20190105175116.GB14238@ziepe.ca> <20190108040129.GE5336@ziepe.ca> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="QNDPHrPUIc00TOLW" Content-Disposition: inline In-Reply-To: <20190108040129.GE5336@ziepe.ca> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --QNDPHrPUIc00TOLW Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Mon, Jan 07, 2019 at 09:01:29PM -0700, Jason Gunthorpe wrote: > On Sun, Jan 06, 2019 at 09:43:46AM +1100, Benjamin Herrenschmidt wrote: > > On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote: > > > > > > > Interesting. I've investigated this further, though I don't have as > > > > many new clues as I'd like. The problem occurs reliably, at least on > > > > one particular type of machine (a POWER8 "Garrison" with ConnectX-4). > > > > I don't yet know if it occurs with other machines, I'm having trouble > > > > getting access to other machines with a suitable card. I didn't > > > > manage to reproduce it on a different POWER8 machine with a > > > > ConnectX-5, but I don't know if it's the difference in machine or > > > > difference in card revision that's important. > > > > > > Make sure the card has the latest firmware is always good advice.. > > > > > > > So possibilities that occur to me: > > > > * It's something specific about how the vfio-pci driver uses D3 > > > > state - have you tried rebinding your device to vfio-pci? > > > > * It's something specific about POWER, either the kernel or the PCI > > > > bridge hardware > > > > * It's something specific about this particular type of machine > > > > > > Does the EEH indicate what happend to actually trigger it? > > > > In a very cryptic way that requires manual parsing using non-public > > docs sadly but yes. From the look of it, it's a completion timeout. > > > > Looks to me like we don't get a response to a config space access > > during the change of D state. I don't know if it's the write of the D3 > > state itself or the read back though (it's probably detected on the > > read back or a subsequent read, but that doesn't tell me which specific > > one failed). > > If it is just one card doing it (again, check you have latest > firmware) I wonder if it is a sketchy PCI-E electrical link that is > causing a long re-training cycle? Can you tell if the PCI-E link is > permanently gone or does it eventually return? > > Does the card work in Gen 3 when it starts? Is there any indication of > PCI-E link errors? > > Everytime or sometimes? > > POWER 8 firmware is good? If the link does eventually come back, is > the POWER8's D3 resumption timeout long enough? > > If this doesn't lead to an obvious conclusion you'll probably need to > connect to IBM's Mellanox support team to get more information from > the card side. +1, I tried to find any Mellanox-internal bugs related to your issue and didn't find anything concrete. Thanks > > Jason --QNDPHrPUIc00TOLW Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIcBAEBAgAGBQJcND4mAAoJEORje4g2clinWmYQALc1t9Jj1WUm7zYVTd84U3pI gnGibiWBO2l7MI+MYk14ZBFGEYJlskNRHIigRcOFEkzha5dy6p2JOnQyS3yBvHjO Bl3JfvqJLZ6gq4EFqtQlvuH8TaJrkB2L3rxTmWXhbNVcxIw5SIyylhpVDgSncpde MtP+XC7viTd15bBrYBTqVJsjr0LnIUfyPzBpDcn6vHht6iPln3pUv90T7w49/Vkm EgDUN3bNYjyXbX07sj78Z5t8UuKv0UcQ2oGAWmA/YLGo04XZRQFcUlu4BnWT2YOf 9z4yHBx/KdBMpxtRue74mqHitFjSu9u+Na5Leq6j3davuFg000q+f3AfE8nCWqLp DraqvSZKIhAiCFpQAcBAzEvVM0QzaKS8xqftPpnZ+509cnAwzRzlKDAO3xzyaXNN KM56HOXCPSJPvf0uCsTr3zTLpsAnzm1QOSt3J6SW4DxvBPTsdrvro08UQErbUAVL VieGklltiu+OeNY2DsCE6JSlxFIMOxMql3zVf5vD9GR7zzhtYA+sgVJssOzBMEY5 4yHnrg42lQ3OvjBF686S5xFHJ13hNHvd4CvdUiNvlncJS14zlEiFoGzmnk4+44bu LSXy8AVLNDtmqk2WG+DlrCZPnm6zp1wC8mvvkvywSpbrWVEkBn2DIlfK788i1BoR UMzhWxopBvsJGk41pwm3 =BVjQ -----END PGP SIGNATURE----- --QNDPHrPUIc00TOLW--