Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp1068464imm; Tue, 3 Jul 2018 05:05:39 -0700 (PDT) X-Google-Smtp-Source: ADUXVKI1hVtPIcWWRKALZXh3hDYTG5B98V5z1jW7zlY7sTtg1n1rdUY1V/MMJ0XcaUmICGmcPYdZ X-Received: by 2002:a65:4249:: with SMTP id d9-v6mr25515733pgq.362.1530619539529; Tue, 03 Jul 2018 05:05:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1530619539; cv=none; d=google.com; s=arc-20160816; b=0Wwo3LQEQbs7PS6UfGvq+WEjUGEatzHvZauTk6SAjfkwctwp04cm2g6Ww0viAzXluu 2F0NzBe6DMeEc0ClwLTCgbhs5P0vIxrtnxwT7XPR1b51RBXi0p0FTIlr9MKm6Liar9HK lXTlkjNhCgRFE4YLt18ZoI1GgSPM+EGWS5t5M4nnfJrrjobunxUEmag3HO8lpf+Lej+x 7UAu4LcwpodkWQ4SfutRUKF99D1jsB6GB+Z3TLS7iyyR+TjOQLw8boxyIU4rjmJz9CmK v0QRyC9ceIa/46nbxQC0v/ogRR1rYIralxXNlv0OhQ+V6Jv1i7Xmxw/k3UCwWFGf5Dvx +38w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:message-id:references :in-reply-to:subject:cc:to:from:date:content-transfer-encoding :mime-version:dkim-signature:dkim-signature :arc-authentication-results; bh=Rf9AvkQaE+d6K0r2z7H5m3r4srFa80ZEEsonJad2sXs=; b=BrYcJ4DkWZUz2BNB9txRuvucFL+iumBQ+kkdiAJkidg5OULUBbCfzOeEFzGPdZ/AXg dNgj8vs3PwRWN10GTZti/5DFZxfZWMdJ8y9nLpyOHtL9wSW0UJkaNc9WaC+POHACcuty tAiwdyaO9hI8QwAiOHd/UHDxEXQ6x1I+6NCT4M1kxcZTxYZzyVcLS7wqJ+MtuOSVvoTX X4gWzlqL/kUvtyQO0ka3E0BiL1zpQxns3tIm+yrxeEgByBeEtt6QY+g0qnF0uYXZB578 Dgmfz/bnGkeaAAsHVDeo/LLJbvhh56wf61y3ZHVymHCCRgJgB/jac1knrlCjFyqzlrBx vwBw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@codeaurora.org header.s=default header.b=HZj5YUDD; dkim=pass header.i=@codeaurora.org header.s=default header.b=oHaTWQ6v; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f32-v6si951136pgb.506.2018.07.03.05.05.24; Tue, 03 Jul 2018 05:05:39 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@codeaurora.org header.s=default header.b=HZj5YUDD; dkim=pass header.i=@codeaurora.org header.s=default header.b=oHaTWQ6v; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752958AbeGCMEY (ORCPT + 99 others); Tue, 3 Jul 2018 08:04:24 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:41054 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752686AbeGCMEX (ORCPT ); Tue, 3 Jul 2018 08:04:23 -0400 Received: by smtp.codeaurora.org (Postfix, from userid 1000) id 5D53660B67; Tue, 3 Jul 2018 12:04:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1530619462; bh=XY5YhIUna8GtPIvVvfcgPCbqCm4BkKymU8l22JUACFs=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=HZj5YUDDjxOLBRVgJlY4SyyBLlmLFnZ46h3qrN4M7Uz6Upb/7+dTlfP4AU5cncW2l /c5sVOyBLt6k5qln3u8xGd7RK8S8VpLTHHxoMPXb1LoWHA+v6L7yYjLPNlch6ywu+e mjSksrLpEbL+NuBu1Wzgk0Gy013zjxitW50Cw7Hw= X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on pdx-caf-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=2.0 tests=ALL_TRUSTED,BAYES_00, DKIM_SIGNED,T_DKIM_INVALID autolearn=no autolearn_force=no version=3.4.0 Received: from mail.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.codeaurora.org (Postfix) with ESMTP id 9090D6037C; Tue, 3 Jul 2018 12:04:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1530619460; bh=XY5YhIUna8GtPIvVvfcgPCbqCm4BkKymU8l22JUACFs=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=oHaTWQ6vlTA6MsBRaTCMe0TrYhOi4tcWmeIn0+HidPFMd8qVhnejGBLkve/yy1JrM NP8n1kipDbWhZj76U2X6n1GoxA1x2aXONraNzh7attG8U4dOU/sUYFuRYXg1EC0ucG s9ikKolU9F9FdTz9Mcxx6vInccmb/0jwmdB6uSAQ= MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Tue, 03 Jul 2018 08:04:20 -0400 From: okaya@codeaurora.org To: poza@codeaurora.org Cc: Lukas Wunner , linux-pci@vger.kernel.org, linux-arm-msm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, Bjorn Helgaas , Keith Busch , open list , linux-pci-owner@vger.kernel.org Subject: Re: [PATCH V5 3/3] PCI: Mask and unmask hotplug interrupts during reset In-Reply-To: <324f8cf2fe6f7bdc43ca8a646eea908d@codeaurora.org> References: <1530571967-19099-1-git-send-email-okaya@codeaurora.org> <1530571967-19099-4-git-send-email-okaya@codeaurora.org> <20180703083447.GA2689@wunner.de> <324f8cf2fe6f7bdc43ca8a646eea908d@codeaurora.org> Message-ID: X-Sender: okaya@codeaurora.org User-Agent: Roundcube Webmail/1.2.5 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2018-07-03 06:52, poza@codeaurora.org wrote: > On 2018-07-03 14:04, Lukas Wunner wrote: >> On Mon, Jul 02, 2018 at 06:52:47PM -0400, Sinan Kaya wrote: >>> If a bridge supports hotplug and observes a PCIe fatal error, the >>> following >>> events happen: >>> >>> 1. AER driver removes the devices from PCI tree on fatal error >>> 2. AER driver brings down the link by issuing a secondary bus reset >>> waits >>> for the link to come up. >>> 3. Hotplug driver observes a link down interrupt >>> 4. Hotplug driver tries to remove the devices waiting for the rescan >>> lock >>> but devices are already removed by the AER driver and AER driver is >>> waiting >>> for the link to come back up. >>> 5. AER driver tries to re-enumerate devices after polling for the >>> link >>> state to go up. >>> 6. Hotplug driver obtains the lock and tries to remove the devices >>> again. >>> >>> If a bridge is a hotplug capable bridge, mask hotplug interrupts >>> before the >>> reset and unmask afterwards. >> >> Would it work for you if you just amended the AER driver to skip >> removal and re-enumeration of devices if the port is a hotplug bridge? >> Just check for is_hotplug_bridge in struct pci_dev. >> > > I tend to agree with you Lukas. > > on this line I already have follow up patches > although I am waiting for Bjorn to review some patch-series before > that. > [PATCH v2 0/6] Fix issues and cleanup for ERR_FATAL and ERR_NONFATAL > > It doesn't look to me a an entirely a race condition since its guarded > by pci_lock_rescan_remove()) > I observed that both hotplug and aer/dpc comes out of it in a quiet > sane state. > To add more detail on when this issue happens. This problem is more visible on root ports with MSI-x capability or with multiple MSI interrupt numbers. AFAIK, QDT root ports are single shared MSI interrupt only. Therefore, you won't see this issue. As you can see in the code, rescan lock is held for the entire fatal error handling path. > My thinking is: Disabling hotplug interrupts during ERR_FATAL, > is something little away from natural course of link_down event > handling, which is handled by pciehp more maturely. > so it would be just easy not to take any action e.g. removal and > re-enumeration of devices from ERR_FATAL handling point of view. > I think it is more unnatural to fragment code flow and allow two drivers to do the same thing in parallel or create inter-driver dependency. I got the idea from pci_reset_slot() function which is already masking hotplug interrupts when called by external entries during secondary bus reset. We just didn't handle the same for fatal error cases.