Received: by 10.213.65.68 with SMTP id h4csp1377854imn; Wed, 21 Mar 2018 09:12:18 -0700 (PDT) X-Google-Smtp-Source: AG47ELslozp2ZRoVDEq0I6x7GKlvjFCFjNklBxb6FiQ0uy2v+qZJtZIJezf0l1FpVXtveTaGEiH3 X-Received: by 2002:a17:902:2006:: with SMTP id n6-v6mr21495124pla.150.1521648738775; Wed, 21 Mar 2018 09:12:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521648738; cv=none; d=google.com; s=arc-20160816; b=gEBn30/iezTqrugrdXpByJwqaL7T7dHnUb5frMWUgBYdS5iQc3R2aPXr1xvT/W8aPw tYZJvqNqThrVY3TP7tHWbIl+A4scCU3s5J1+V5Rziu//N6RCcPKH5VBb2Hp6HCil6N25 50TFvbv0xPgzDrBJpyCVJo9u7mvjvlEDly2ulPARyypoJ6xbcoPbpYDcl7ZPeJy6wGJJ ib17vTU8L8cGdwf9Fgb8U4yhQoztvW/sYX9ERi613v9ibnKNJcANZwcBKbd7ZD4PJ1BZ sXlMwoe2wS+cA3HC0RVU6dNqj8zDsWtURShTV38JXCh9dtO23n0jjP2pEyrXZaqQqUJE 4I6A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:thread-index:thread-topic :content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:dkim-signature:dkim-filter :arc-authentication-results; bh=L49ZO0BhkPnarzMSZwca1tMA1PzZnEQXU6UUscumg3I=; b=nDBr8mtacGvh0v9oPq9WF1LF5CJ1fYw8zv7XcHwm67222gqn8f6eeY6ywu5NCkQ3/h hjStT+4DyqBrRTlIAKARBSGqRszfC9pzL2uU/pfqPXwXpgIcme5tnoWPVM5sNuxwdM8O F3aDKCOdEsIEab4/V0QzG7Y0EjBw4fSjXUzNDBCbYrQeBZzYoo9DOkF1VvAqj5Bo5wRe Z2R0Ow3gsu9AInsyxw+FLp2iLbtXpHHUpPY5LlO7OcWw8qxrr3aW2up2SKQajW0xmGYS V8JbFx8sR2gmj3YwnUv/n9V1CZer9cgdsM7aiN+jpORxLdgMue3Hs1SOXCc/vtY0Gzeq AeNw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kalray.eu header.s=32AE1B44-9502-11E5-BA35-3734643DEF29 header.b=DqSZjpi8; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=kalray.eu Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g126si3176090pfc.395.2018.03.21.09.12.03; Wed, 21 Mar 2018 09:12:18 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kalray.eu header.s=32AE1B44-9502-11E5-BA35-3734643DEF29 header.b=DqSZjpi8; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=kalray.eu Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752652AbeCUQLE (ORCPT + 99 others); Wed, 21 Mar 2018 12:11:04 -0400 Received: from zimbra1.kalray.eu ([92.103.151.219]:39410 "EHLO zimbra1.kalray.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752592AbeCUQK7 (ORCPT ); Wed, 21 Mar 2018 12:10:59 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra1.kalray.eu (Postfix) with ESMTP id 7E376280040; Wed, 21 Mar 2018 17:10:57 +0100 (CET) Received: from zimbra1.kalray.eu ([127.0.0.1]) by localhost (zimbra1.kalray.eu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id wzW5IziaobVj; Wed, 21 Mar 2018 17:10:57 +0100 (CET) Received: from localhost (localhost [127.0.0.1]) by zimbra1.kalray.eu (Postfix) with ESMTP id 0909C2800C1; Wed, 21 Mar 2018 17:10:57 +0100 (CET) DKIM-Filter: OpenDKIM Filter v2.9.2 zimbra1.kalray.eu 0909C2800C1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kalray.eu; s=32AE1B44-9502-11E5-BA35-3734643DEF29; t=1521648657; bh=L49ZO0BhkPnarzMSZwca1tMA1PzZnEQXU6UUscumg3I=; h=Date:From:To:Message-ID:Subject:MIME-Version:Content-Type: Content-Transfer-Encoding; b=DqSZjpi8npkZGRLVkcZPu+7yVaxO0L/4A0iADxgiOG7rVRoT4rR6wafwlUTptT0Zt ZPrjS8jqKrwjbYooPGSxujOfb+OjnH7MAF1w6V9o48sAroFcs3jh+YSj1YkmiqVorE hNu9r330RzFbmo1F+B4RgBwqL9NgbvYAcB4vuzTs= X-Virus-Scanned: amavisd-new at kalray.eu Received: from zimbra1.kalray.eu ([127.0.0.1]) by localhost (zimbra1.kalray.eu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 70XEH5W0nE1P; Wed, 21 Mar 2018 17:10:56 +0100 (CET) Received: from zimbra1.kalray.eu (localhost [127.0.0.1]) by zimbra1.kalray.eu (Postfix) with ESMTP id E2849280094; Wed, 21 Mar 2018 17:10:56 +0100 (CET) Date: Wed, 21 Mar 2018 17:10:56 +0100 (CET) From: Marta Rybczynska To: Keith Busch Cc: Ming Lei , axboe@fb.com, hch@lst.de, sagi@grimberg.me, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, bhelgaas@google.com, linux-pci@vger.kernel.org, Pierre-Yves Kerbrat Message-ID: <1220434088.5871933.1521648656789.JavaMail.zimbra@kalray.eu> In-Reply-To: <20180321160238.GF12909@localhost.localdomain> References: <744877924.5841545.1521630049567.JavaMail.zimbra@kalray.eu> <20180321115037.GA26083@ming.t460p> <464125757.5843583.1521634231341.JavaMail.zimbra@kalray.eu> <20180321154807.GD22254@ming.t460p> <20180321160238.GF12909@localhost.localdomain> Subject: Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [192.168.40.201] X-Mailer: Zimbra 8.6.0_GA_1182 (ZimbraWebClient - FF45 (Linux)/8.6.0_GA_1182) Thread-Topic: nvme: avoid race-conditions when enabling devices Thread-Index: SxyyWlgFmS/wFmNWcZKChEsEYx3W5g== Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote: >> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote: >> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote: >> > >> NVMe driver uses threads for the work at device reset, including enabling >> > >> the PCIe device. When multiple NVMe devices are initialized, their reset >> > >> works may be scheduled in parallel. Then pci_enable_device_mem can be >> > >> called in parallel on multiple cores. >> > >> >> > >> This causes a loop of enabling of all upstream bridges in >> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operations >> > >> including __pci_set_master and architecture-specific functions that >> > >> call ones like and pci_enable_resources(). Both __pci_set_master() >> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe space >> > >> and change it. This is done as read/modify/write. >> > >> >> > >> Imagine that the PCIe tree looks like: >> > >> A - B - switch - C - D >> > >> \- E - F >> > >> >> > >> D and F are two NVMe disks and all devices from B are not enabled and bus >> > >> mastering is not set. If their reset work are scheduled in parallel the two >> > >> modifications of PCI_COMMAND may happen in parallel without locking and the >> > >> system may end up with the part of PCIe tree not enabled. >> > > >> > > Then looks serialized reset should be used, and I did see the commit >> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'failed >> > > to mark controller state' in reset stress test. >> > > >> > > But that commit only covers case of PCI reset from sysfs attribute, and >> > > maybe other cases need to be dealt with in similar way too. >> > > >> > >> > It seems to me that the serialized reset works for multiple resets of the >> > same device, doesn't it? Our problem is linked to resets of different devices >> > that share the same PCIe tree. >> >> Given reset shouldn't be a frequent action, it might be fine to serialize all >> reset from different devices. > > The driver was much simpler when we had serialized resets in line with > probe, but that had a bigger problems with certain init systems when > you put enough nvme devices in your server, making them unbootable. > > Would it be okay to serialize just the pci_enable_device across all > other tasks messing with the PCI topology? > > --- > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index cef5ce851a92..e0a2f6c0f1cf 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev) > int result = -ENOMEM; > struct pci_dev *pdev = to_pci_dev(dev->dev); > > - if (pci_enable_device_mem(pdev)) > - return result; > + pci_lock_rescan_remove(); > + result = pci_enable_device_mem(pdev); > + pci_unlock_rescan_remove(); > + if (result) > + return -ENODEV; > > pci_set_master(pdev); > The problem may happen also with other device doing its probe and nvme running its workqueue (and we probably have seen it in practice too). We were thinking about a lock in the pci generic code too, that's why I've put the linux-pci@ list in copy. Marta