Received: by 10.213.65.68 with SMTP id h4csp151659imn; Wed, 21 Mar 2018 14:54:28 -0700 (PDT) X-Google-Smtp-Source: AG47ELuYK+/WaI5rLHjxVfGCqJH0ghN684t4KXkF1MiRp0SQIrtZ3cRCH6U/MpewX0WM+t+7Xwce X-Received: by 10.98.62.71 with SMTP id l68mr14073448pfa.98.1521669268547; Wed, 21 Mar 2018 14:54:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521669268; cv=none; d=google.com; s=arc-20160816; b=ktLcNSuGQt/VYbCETe9vG9vw1yLKTP/AyhYDFxCoF6/xUCVC389y+dEBFK9WvjLTbb Lvnj/WWChKCx2DZAYA+0YNClq5XO+fX2p1bMPfZoSFPD4T8xi12GUTc5ByEjeW6jw7cS FI3tkOCz406aq2fNidjLiSLL7vy5wLSx1E0Fhdk+arizKpDOBDKs6siYF/eo2TR7+9Md ro/imlKOa4TBchX8OfTrQpiLIaxRKe5VJsynNxOa0ZqgOe4NE+Le3frqtjs4BOP8DDrl gRYFYLyOT24sTFtwXCHncNDxdJ47/Dj0hRsemPgcprl6q7u07PP+s3KwXuhoWUh1pERY k6mw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dmarc-filter:arc-authentication-results; bh=cU3FAqIFCfIqry7++srIsfVPckv5wGIp/jpalv4IxlE=; b=l4ZXfDJNbYOvEzxBp0fslJuqzrjiUiQNIkyLlyxfbItYcb5h4v4CfT2MXzigqbMfy7 rUN6IBIXzZbS/cq067c+SLHckVvb/58UdVeGO31EWC2Vb2Kexss8pOoG6t08lukoN0gp mbgBlHste4ssAUTqIDsJIur83Z4DanneDl1h0PaBAPTwNqvOvNJWEILDb58aSboP+GAU z9tRojhPzyBmHeNa2GxQ5VuQhKvC5lryQOPyixt/gwVwf7QnIcY2TNH+Rc5oyog6u92y DwPlLokn95ObZN+uLc8nj+XQXTdqbojTJyQhYT2RIfUul6RycJBCzaZJx2BOSKpLTw2b I5RA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n2-v6si4447040plp.518.2018.03.21.14.54.14; Wed, 21 Mar 2018 14:54:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753861AbeCUVxP (ORCPT + 99 others); Wed, 21 Mar 2018 17:53:15 -0400 Received: from mail.kernel.org ([198.145.29.99]:45456 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753715AbeCUVxN (ORCPT ); Wed, 21 Mar 2018 17:53:13 -0400 Received: from localhost (unknown [69.71.5.252]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 1261E21736; Wed, 21 Mar 2018 21:53:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1261E21736 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=helgaas@kernel.org Date: Wed, 21 Mar 2018 16:53:08 -0500 From: Bjorn Helgaas To: Marta Rybczynska Cc: Keith Busch , Ming Lei , axboe@fb.com, hch@lst.de, sagi@grimberg.me, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, bhelgaas@google.com, linux-pci@vger.kernel.org, Pierre-Yves Kerbrat , Srinath Mannam Subject: Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices Message-ID: <20180321215308.GH38649@bhelgaas-glaptop.roam.corp.google.com> References: <744877924.5841545.1521630049567.JavaMail.zimbra@kalray.eu> <20180321115037.GA26083@ming.t460p> <464125757.5843583.1521634231341.JavaMail.zimbra@kalray.eu> <20180321154807.GD22254@ming.t460p> <20180321160238.GF12909@localhost.localdomain> <1220434088.5871933.1521648656789.JavaMail.zimbra@kalray.eu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1220434088.5871933.1521648656789.JavaMail.zimbra@kalray.eu> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [+cc Srinath] On Wed, Mar 21, 2018 at 05:10:56PM +0100, Marta Rybczynska wrote: > > On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote: > >> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote: > >> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote: > >> > >> NVMe driver uses threads for the work at device reset, including enabling > >> > >> the PCIe device. When multiple NVMe devices are initialized, their reset > >> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be > >> > >> called in parallel on multiple cores. > >> > >> > >> > >> This causes a loop of enabling of all upstream bridges in > >> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations > >> > >> including __pci_set_master and architecture-specific functions that > >> > >> call ones like and pci_enable_resources(). Both __pci_set_master() > >> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space > >> > >> and change it. This is done as read/modify/write. > >> > >> > >> > >> Imagine that the PCIe tree looks like: > >> > >> A - B - switch - C - D > >> > >> \- E - F > >> > >> > >> > >> D and F are two NVMe disks and all devices from B are not enabled and bus > >> > >> mastering is not set. If their reset work are scheduled in parallel the two > >> > >> modifications of PCI_COMMAND may happen in parallel without locking and the > >> > >> system may end up with the part of PCIe tree not enabled. > >> > > > >> > > Then looks serialized reset should be used, and I did see the commit > >> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed > >> > > to mark controller state' in reset stress test. > >> > > > >> > > But that commit only covers case of PCI reset from sysfs attribute, and > >> > > maybe other cases need to be dealt with in similar way too. > >> > > > >> > > >> > It seems to me that the serialized reset works for multiple resets of the > >> > same device, doesn't it? Our problem is linked to resets of different devices > >> > that share the same PCIe tree. > >> > >> Given reset shouldn't be a frequent action, it might be fine to serialize all > >> reset from different devices. > > > > The driver was much simpler when we had serialized resets in line with > > probe, but that had a bigger problems with certain init systems when > > you put enough nvme devices in your server, making them unbootable. > > > > Would it be okay to serialize just the pci_enable_device across all > > other tasks messing with the PCI topology? > > > > --- > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > index cef5ce851a92..e0a2f6c0f1cf 100644 > > --- a/drivers/nvme/host/pci.c > > +++ b/drivers/nvme/host/pci.c > > @@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev) > > int result = -ENOMEM; > > struct pci_dev *pdev = to_pci_dev(dev->dev); > > > > - if (pci_enable_device_mem(pdev)) > > - return result; > > + pci_lock_rescan_remove(); > > + result = pci_enable_device_mem(pdev); > > + pci_unlock_rescan_remove(); > > + if (result) > > + return -ENODEV; > > > > pci_set_master(pdev); > > The problem may happen also with other device doing its probe and > nvme running its workqueue (and we probably have seen it in practice > too). We were thinking about a lock in the pci generic code too, > that's why I've put the linux-pci@ list in copy. Yes, this is a generic problem in the PCI core. We've tried to fix it in the past but haven't figured it out yet. See 40f11adc7cd9 ("PCI: Avoid race while enabling upstream bridges") and 0f50a49e3008 ("Revert "PCI: Avoid race while enabling upstream bridges""). It's not trivial, but if you figure out a good way to fix this, I'd be thrilled. Bjorn