Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp4033529pxu; Mon, 12 Oct 2020 07:52:06 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyyVknhwXY0AsuoMuoZqQ82Xj7GIJUxTO2nkP2CuOkFmmGDc4+Lolutt9piDTataa0KPKO3 X-Received: by 2002:a50:9e87:: with SMTP id a7mr14788101edf.297.1602514326012; Mon, 12 Oct 2020 07:52:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1602514326; cv=none; d=google.com; s=arc-20160816; b=vdz7hny2jVOj1XBNzyOfaMYkGnYILQsuqQshKuVHF7+h5cGQYLvBSV49KRLXd864l4 /BgIZUfyuR6ouPyYBl9hpL3+UNu9WZBkT79X2/ZbtJV1kMBlUUKdHlaqgabeyJC3r1Xw ZM9br1466IqNQIszle8A5p+WmZ5G5/PwI6lVHLqQczJdPQlSj91HWhFGeXx0FH8c14zk jnHz5S38LWgBLSBFh5eeDtPpynqqLlmUojbz9eG3E1U1VBX0d+/71LTfQtdXjSsrAicV aGr+1R7+9IZUfSQst8Dp3jRHdnPoQpG1Bqp0y+51ZkIPIxT4kOvcKh4ky9ok4rCVRqtu o7kg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:autocrypt:from :references:cc:to:subject:dkim-signature; bh=1FmEDIsZ+pE2L+i0iPPU7xoWGF9PNLpJtX2hDf1wXqA=; b=0G5oedHuLRa2pSgKhxGscLTyOVBjgVjDF6Sz+Zduu5exYRvx5wYTEY1ZBE1SacPsYH UPCZD+dbzMzKKWybaqC/5Y7Qhj0LDGE5XIecWfv5HfagAjITavwBbwIpcXJ1ZhdER1b7 KA3UKGvyWeqYdNDVICmohH3Qeizh5K2wVufP6+YlLn99HkvBt4p9dDJJCie0OMdKt3tr ASttI5VpDqc0N0ce5ePe1p0a2avpMJw4GHsHPqwzOt4g/C8xjciSmQvhJQTE9dm6mzSE doL70zDlFeQD0KL7renmvY8R/NflHtwT4+XTGvzehs3FGx6wX5c/rHL9AAeJhhlbpAte jn/Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=LJqc3NPz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id ob21si11704826ejb.276.2020.10.12.07.51.41; Mon, 12 Oct 2020 07:52:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=LJqc3NPz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389400AbgJLOuf (ORCPT + 99 others); Mon, 12 Oct 2020 10:50:35 -0400 Received: from mail.kernel.org ([198.145.29.99]:57014 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387930AbgJLOub (ORCPT ); Mon, 12 Oct 2020 10:50:31 -0400 Received: from [192.168.0.112] (75-58-59-55.lightspeed.rlghnc.sbcglobal.net [75.58.59.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 37B142080A; Mon, 12 Oct 2020 14:50:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1602514230; bh=Um3brhR4V7mbgCo0NMThdWSXjyTwvJ/bNPfg+5StKaM=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=LJqc3NPzieo+zCMFKTWJbZTtyeoHuwpGsCfHSBwse3DTJOJaJoamovVG0tKre3pIP kP3vaN9XQyoxFskgTFvlBth9KjiJftUWI6h5Y6yOUjlPz5wzlrnUxtMt4Ayi4aIHuE 3R4tIBD50qhLdnfnA5/+1oYfSdQ37BN3Tka2EsjU= Subject: Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling To: sathyanarayanan.nkuppuswamy@gmail.com, bhelgaas@google.com Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, ashok.raj@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com References: <5c5bca0bdb958e456176fe6ede10ba8f838fbafc.1602263264.git.sathyanarayanan.kuppuswamy@linux.intel.com> From: Sinan Kaya Autocrypt: addr=okaya@kernel.org; keydata= mQENBFrnOrUBCADGOL0kF21B6ogpOkuYvz6bUjO7NU99PKhXx1MfK/AzK+SFgxJF7dMluoF6 uT47bU7zb7HqACH6itTgSSiJeSoq86jYoq5s4JOyaj0/18Hf3/YBah7AOuwk6LtV3EftQIhw 9vXqCnBwP/nID6PQ685zl3vH68yzF6FVNwbDagxUz/gMiQh7scHvVCjiqkJ+qu/36JgtTYYw 8lGWRcto6gr0eTF8Wd8f81wspmUHGsFdN/xPsZPKMw6/on9oOj3AidcR3P9EdLY4qQyjvcNC V9cL9b5I/Ud9ghPwW4QkM7uhYqQDyh3SwgEFudc+/RsDuxjVlg9CFnGhS0nPXR89SaQZABEB AAG0HVNpbmFuIEtheWEgPG9rYXlhQGtlcm5lbC5vcmc+iQFOBBMBCAA4FiEEYdOlMSE+a7/c ckrQvGF4I+4LAFcFAlztcAoCGwMFCwkIBwIGFQoJCAsCBBYCAwECHgECF4AACgkQvGF4I+4L AFfidAf/VKHInxep0Z96iYkIq42432HTZUrxNzG9IWk4HN7c3vTJKv2W+b9pgvBF1SmkyQSy 8SJ3Zd98CO6FOHA1FigFyZahVsme+T0GsS3/OF1kjrtMktoREr8t0rK0yKpCTYVdlkHadxmR Qs5xLzW1RqKlrNigKHI2yhgpMwrpzS+67F1biT41227sqFzW9urEl/jqGJXaB6GV+SRKSHN+ ubWXgE1NkmfAMeyJPKojNT7ReL6eh3BNB/Xh1vQJew+AE50EP7o36UXghoUktnx6cTkge0ZS qgxuhN33cCOU36pWQhPqVSlLTZQJVxuCmlaHbYWvye7bBOhmiuNKhOzb3FcgT7kBDQRa5zq1 AQgAyRq/7JZKOyB8wRx6fHE0nb31P75kCnL3oE+smKW/sOcIQDV3C7mZKLf472MWB1xdr4Tm eXeL/wT0QHapLn5M5wWghC80YvjjdolHnlq9QlYVtvl1ocAC28y43tKJfklhHiwMNDJfdZbw 9lQ2h+7nccFWASNUu9cqZOABLvJcgLnfdDpnSzOye09VVlKr3NHgRyRZa7me/oFJCxrJlKAl 2hllRLt0yV08o7i14+qmvxI2EKLX9zJfJ2rGWLTVe3EJBnCsQPDzAUVYSnTtqELu2AGzvDiM gatRaosnzhvvEK+kCuXuCuZlRWP7pWSHqFFuYq596RRG5hNGLbmVFZrCxQARAQABiQEfBBgB CAAJBQJa5zq1AhsMAAoJELxheCPuCwBX2UYH/2kkMC4mImvoClrmcMsNGijcZHdDlz8NFfCI gSb3NHkarnA7uAg8KJuaHUwBMk3kBhv2BGPLcmAknzBIehbZ284W7u3DT9o1Y5g+LDyx8RIi e7pnMcC+bE2IJExCVf2p3PB1tDBBdLEYJoyFz/XpdDjZ8aVls/pIyrq+mqo5LuuhWfZzPPec 9EiM2eXpJw+Rz+vKjSt1YIhg46YbdZrDM2FGrt9ve3YaM5H0lzJgq/JQPKFdbd5MB0X37Qc+ 2m/A9u9SFnOovA42DgXUyC2cSbIJdPWOK9PnzfXqF3sX9Aol2eLUmQuLpThJtq5EHu6FzJ7Y L+s0nPaNMKwv/Xhhm6Y= Message-ID: <5ae14b67-94a5-6d2f-b74d-ca32bbd079cd@kernel.org> Date: Mon, 12 Oct 2020 10:50:29 -0400 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.12.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/12/2020 1:03 AM, sathyanarayanan.nkuppuswamy@gmail.com wrote: > From: Kuppuswamy Sathyanarayanan > > Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > merged fatal and non-fatal error recovery paths, and also made > recovery code depend on hotplug handler for "remove affected > device + rescan" support. But this change also complicated the > error recovery path and which in turn led to the following > issues. > > 1. We depend on hotplug handler for removing the affected > devices/drivers on DLLSC LINK down event (on DPC event > trigger) and DPC handler for handling the error recovery. Since > both handlers operate on same set of affected devices, it leads > to race condition, which in turn leads to  NULL pointer > exceptions or error recovery failures.You can find more details > about this issue in following link. > > https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.zhao@intel.com/T/#t > > 2. For non-hotplug capable devices fatal (DPC) error recovery > is currently broken. Current fatal error recovery implementation > relies on PCIe hotplug (pciehp) handler for detaching and > re-enumerating the affected devices/drivers. So when dealing with > non-hotplug capable devices, recovery code does not restore the state > of the affected devices correctly. You can find more details about > this issue in the following links. > > https://lore.kernel.org/linux-pci/20200527083130.4137-1-Zhiqiang.Hou@nxp.com/ > https://lore.kernel.org/linux-pci/12115.1588207324@famine/ > https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com/ > > In order to fix the above two issues, we should stop relying on hotplug > handler for cleaning the affected devices/drivers and let error recovery > handler own this functionality. So this patch reverts Commit bdb5ac85777d > ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove > affected device + rescan"  functionality in fatal error recovery handler. > > Also holding pci_lock_rescan_remove() will prevent the race between hotplug > and DPC handler. > > Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > Signed-off-by: Kuppuswamy Sathyanarayanan > --- > Documentation/PCI/pci-error-recovery.rst | 47 ++++++++++------ > drivers/pci/pcie/err.c | 71 +++++++++++++++++++----- > 2 files changed, 87 insertions(+), 31 deletions(-) I'm not sure about locks involved but I do like the concept. Current fatal error handling is best effort. There is no way to recover if link is down by the time we reach to fatal error handling routine. This change will make the solution more reliable. Reviewed-by: Sinan Kaya