Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752131AbdI2Ncr (ORCPT ); Fri, 29 Sep 2017 09:32:47 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:48574 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751176AbdI2Ncp (ORCPT ); Fri, 29 Sep 2017 09:32:45 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org 53F2E60227 Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=none smtp.mailfrom=okaya@codeaurora.org Subject: Re: [PATCH 3/4] pci aer: fix deadlock in do_recovery To: Govindarajulu Varadarajan Cc: benve@cisco.com, bhelgaas@google.com, linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, jlbec@evilplan.org, hch@lst.de, mingo@redhat.com, peterz@infradead.org References: <20170927214220.41216-1-gvaradar@cisco.com> <20170927214220.41216-4-gvaradar@cisco.com> <2dc437fe-2ab4-23e3-44f3-f06feaf88d86@codeaurora.org> From: Sinan Kaya Message-ID: <0a2a41c5-2872-fdb6-8ad2-97b0b6dc69b1@codeaurora.org> Date: Fri, 29 Sep 2017 09:32:42 -0400 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2140 Lines: 48 On 9/28/2017 7:46 PM, Govindarajulu Varadarajan wrote: >> How about releasing the device_lock here on CPU0?> > > pci_device_add() is called by driver's pci probe function. device_lock(dev) > should be held before calling pci driver probe function. I see. The goal of the lock held here is to protect probe() operation from being disrupted. I also don't think we can change this. > >> or in other words keep device_lock as short as possible? > > The problem is not the duration device_lock is held. It is the order two locks > are aquired. We cannot control or implement a restriction that during > device_lock() is held, driver probe should not call pci function which aquires > pci_bus_sem. And in case of pci aer, aer handler needs to call driver err_handler() > for which we need to hold device_lock() before calling err_handler(). In order > to find all the devices on a pci bus, we should hold pci_bus_sem to do > pci_walk_bus(). I was reacting to this to see if there is a better way to do this. "Only fix I could think of is to lock &pci_bus_sem and try locking all device->mutex under that pci_bus. If it fails, unlock all device->mutex and &pci_bus_sem and try again." How about gracefully returning from report_error_detected() when we cannot obtain the device_lock() by replacing it with device_trylock()? aer_pci_walk_bus() can still poll like you did until it gets the lock. At least, we don't get to introduce a new API, new lock semantics and code refactoring. __pci_bus_trylock() looked very powerful and also dangerously flexible to introduce new bugs to me. For instance, you called it like this. + down_read(&pci_bus_sem); + locked = __pci_bus_trylock(bus, pci_device_trylock, + pci_device_unlock); pci_bus_trylock() would obtain device + cfg locks whereas pci_device_trylock() only obtains the device lock. Can it race against cfg lock? It depends on the caller. Very subtle difference. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.