Received: by 10.213.65.68 with SMTP id h4csp1372160imn; Wed, 21 Mar 2018 09:04:59 -0700 (PDT) X-Google-Smtp-Source: AG47ELuedaK/grLRMDQXmp+1M7p3chh0uPblLBgfONKNecxbkYLZaROd67XYPQUwL81EU5Lf+m0G X-Received: by 2002:a17:902:7787:: with SMTP id o7-v6mr21442416pll.75.1521648299206; Wed, 21 Mar 2018 09:04:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521648299; cv=none; d=google.com; s=arc-20160816; b=S0WBm+wyQwgoWK/Eksu8viC78iU17VPwB266uoU2yPW+9SIj10lCZwrTjdeTN+g70v FvcmCcLLBHGEG9aVJqPcY1sy+oNH11s324H1zMvd1+mqC5CGeaagpWrDhOOV7iPC9fjD mlfX+MA3uobXAc2BV7917xBITwmOdFZJKV7fqT9qCToMrwdAjI3PKTR1O8QdOoBL0zVw zG4MTWH4/92I2Ig/65GibixHDp38CPuYX1Vp7UrDeSmj73y0bAVJg74cvs3famy9tl75 xG6mMJfFsH0DHoEuThZgSbmI+img/1v6TycWqLXX2g1G9WXHj+Aw7hMuZPdKWHeAjut6 yeDg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=9kpsxDQrrVQnusdwSTA7w8Pl+DDMVWXXckGOT910gXw=; b=nt0EQA404o38MeXF2Tt28Rs01uoDYHXClywlJr7D/Dwqe4xgqEToa/Er5KIg8wvrDI 7mm9FTipQqX4XPzhx7riZ+BOFAhW+c2k98tqiRwuRyOVCBGkdF1yVQ6oeifT6LrdROn6 Z7Iw/oqWK5cZE+O4rG8BtxPNc2q7oQPum/2sDnMK1ftcK3gDyifppey9U4RNBy7Gz47y ehpM2ghpBtuBrX6MG+F18xZyG1s6PbAapAZJjXYEZw8KhJ84svN6kJsaX3k197r0BS0X jEWMLYFYeMKWiSikY7URyRvfXgGoEJ+Mqc/C9I4wkdZX5zSc43k9fValh4Bsk6F3lblO Ur9A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p3si3149784pfh.84.2018.03.21.09.04.36; Wed, 21 Mar 2018 09:04:59 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752519AbeCUQAd (ORCPT + 99 others); Wed, 21 Mar 2018 12:00:33 -0400 Received: from mga09.intel.com ([134.134.136.24]:60458 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751981AbeCUQA0 (ORCPT ); Wed, 21 Mar 2018 12:00:26 -0400 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 21 Mar 2018 09:00:25 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.48,340,1517904000"; d="scan'208";a="36928481" Received: from unknown (HELO localhost.localdomain) ([10.232.112.44]) by orsmga003.jf.intel.com with ESMTP; 21 Mar 2018 09:00:25 -0700 Date: Wed, 21 Mar 2018 10:02:39 -0600 From: Keith Busch To: Ming Lei Cc: Marta Rybczynska , axboe@fb.com, hch@lst.de, sagi@grimberg.me, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, bhelgaas@google.com, linux-pci@vger.kernel.org, Pierre-Yves Kerbrat Subject: Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices Message-ID: <20180321160238.GF12909@localhost.localdomain> References: <744877924.5841545.1521630049567.JavaMail.zimbra@kalray.eu> <20180321115037.GA26083@ming.t460p> <464125757.5843583.1521634231341.JavaMail.zimbra@kalray.eu> <20180321154807.GD22254@ming.t460p> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180321154807.GD22254@ming.t460p> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote: > On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote: > > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote: > > >> NVMe driver uses threads for the work at device reset, including enabling > > >> the PCIe device. When multiple NVMe devices are initialized, their reset > > >> works may be scheduled in parallel. Then pci_enable_device_mem can be > > >> called in parallel on multiple cores. > > >> > > >> This causes a loop of enabling of all upstream bridges in > > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations > > >> including __pci_set_master and architecture-specific functions that > > >> call ones like and pci_enable_resources(). Both __pci_set_master() > > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space > > >> and change it. This is done as read/modify/write. > > >> > > >> Imagine that the PCIe tree looks like: > > >> A - B - switch - C - D > > >> \- E - F > > >> > > >> D and F are two NVMe disks and all devices from B are not enabled and bus > > >> mastering is not set. If their reset work are scheduled in parallel the two > > >> modifications of PCI_COMMAND may happen in parallel without locking and the > > >> system may end up with the part of PCIe tree not enabled. > > > > > > Then looks serialized reset should be used, and I did see the commit > > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed > > > to mark controller state' in reset stress test. > > > > > > But that commit only covers case of PCI reset from sysfs attribute, and > > > maybe other cases need to be dealt with in similar way too. > > > > > > > It seems to me that the serialized reset works for multiple resets of the > > same device, doesn't it? Our problem is linked to resets of different devices > > that share the same PCIe tree. > > Given reset shouldn't be a frequent action, it might be fine to serialize all > reset from different devices. The driver was much simpler when we had serialized resets in line with probe, but that had a bigger problems with certain init systems when you put enough nvme devices in your server, making them unbootable. Would it be okay to serialize just the pci_enable_device across all other tasks messing with the PCI topology? --- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index cef5ce851a92..e0a2f6c0f1cf 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev) int result = -ENOMEM; struct pci_dev *pdev = to_pci_dev(dev->dev); - if (pci_enable_device_mem(pdev)) - return result; + pci_lock_rescan_remove(); + result = pci_enable_device_mem(pdev); + pci_unlock_rescan_remove(); + if (result) + return -ENODEV; pci_set_master(pdev); --