Received: by 10.213.65.68 with SMTP id h4csp131719imn; Fri, 23 Mar 2018 00:45:37 -0700 (PDT) X-Google-Smtp-Source: AG47ELvnyeTyUy5SrQCGZRuuC5VY8FcYXxT84+K8Y7ZE25M0pg8MjVP5hpQ1P4vt1S9iRCE+3vBe X-Received: by 10.99.110.133 with SMTP id j127mr20439274pgc.79.1521791137300; Fri, 23 Mar 2018 00:45:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521791137; cv=none; d=google.com; s=arc-20160816; b=lE41/bm5MeE2OBF0a4tE62kCHFwd22T+WZ2j2P2YFjPRpmnnAH5tadrmCpeE6Ldcq7 LpzDHilpwyfsHCeRisdPeDKNqvOXzpzMO4yXfKULCDkge6IS5NLguR6CPOgkOHWNz8H4 i+Q1krHKk4ZOZXvcIvL36MvhOjmBJ+9yOq1c0gVslXWTYNgG0aZe1B71XdKFPkMxr4dK XUV/DgqVKNGd5oe7ETafRHhpsWPJxmblL9ilakjVKd4acVGCVaq/Z7iMMPGv/8VVbn+6 +r+0sSVR77FSeJ4PpQmQMFX4AlgOLlxg9QFfohhSVDAu1A6kQphWhoUmXRYAaq3ZlJ/z krwQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:thread-index:thread-topic :content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:dkim-signature:dkim-filter :arc-authentication-results; bh=T89U562b3vC4FIg7OVOE9W+0+8hdronjc50hfJdjMno=; b=qBBSHqfQooxm78kDo8nrwweQ6BGckJnZaPffObQtU+/u1BRiqlYi5i6iGvN8N9w8x0 Hj6HtsrMXScM+oICPiloLMcRDwsEmLEoprcGusPqNeGdTr6ZVAAYhaSPdI+1Z7eV0nle cfQKCD25ztXlOKn+ZOAL0sGdEvEqArnFR20kwS+Cm559GlrZUSFGVFMMlJMnRtAUq4cE IE1E5d2KXIECD18esSlszAwqJUms6QSkaNCMRsRy6fj8vMXq9KYTAecktrxtsYy3eMAf mlHezdNA04rXIFLmf2yhKfCAOKdZPXO24WB4Xd/gvSRDlJIrlxPIauSjytDHRv9evEvW +AFw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kalray.eu header.s=32AE1B44-9502-11E5-BA35-3734643DEF29 header.b=X/v/2iE7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=kalray.eu Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s1-v6si7731239plr.671.2018.03.23.00.45.22; Fri, 23 Mar 2018 00:45:37 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kalray.eu header.s=32AE1B44-9502-11E5-BA35-3734643DEF29 header.b=X/v/2iE7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=kalray.eu Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751789AbeCWHob (ORCPT + 99 others); Fri, 23 Mar 2018 03:44:31 -0400 Received: from zimbra1.kalray.eu ([92.103.151.219]:33280 "EHLO zimbra1.kalray.eu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751491AbeCWHoa (ORCPT ); Fri, 23 Mar 2018 03:44:30 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra1.kalray.eu (Postfix) with ESMTP id 2A89D2800BA; Fri, 23 Mar 2018 08:44:28 +0100 (CET) Received: from zimbra1.kalray.eu ([127.0.0.1]) by localhost (zimbra1.kalray.eu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 9g3Cv3BJEBrF; Fri, 23 Mar 2018 08:44:27 +0100 (CET) Received: from localhost (localhost [127.0.0.1]) by zimbra1.kalray.eu (Postfix) with ESMTP id 7425328009D; Fri, 23 Mar 2018 08:44:27 +0100 (CET) DKIM-Filter: OpenDKIM Filter v2.9.2 zimbra1.kalray.eu 7425328009D DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kalray.eu; s=32AE1B44-9502-11E5-BA35-3734643DEF29; t=1521791067; bh=T89U562b3vC4FIg7OVOE9W+0+8hdronjc50hfJdjMno=; h=Date:From:To:Message-ID:Subject:MIME-Version:Content-Type: Content-Transfer-Encoding; b=X/v/2iE7J9p62Tq8jjeNtz40jnppobns3GMSescaiXZ4QTTp8WRU+hgqmvS5YLf0v lwNyB2QrY1iEbQIDN2lD3FZjby1i6pprTEccOT8fZC+ob7HHhq0NmhB0G7I2RteT1x gYE6tgDWaMYqdF8g3/0odgDmCP5uZaXs84pmqeaU= X-Virus-Scanned: amavisd-new at kalray.eu Received: from zimbra1.kalray.eu ([127.0.0.1]) by localhost (zimbra1.kalray.eu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id RlCuFkFxEvYl; Fri, 23 Mar 2018 08:44:27 +0100 (CET) Received: from zimbra1.kalray.eu (localhost [127.0.0.1]) by zimbra1.kalray.eu (Postfix) with ESMTP id 597B0280053; Fri, 23 Mar 2018 08:44:27 +0100 (CET) Date: Fri, 23 Mar 2018 08:44:27 +0100 (CET) From: Marta Rybczynska To: Keith Busch Cc: Ming Lei , axboe@fb.com, hch@lst.de, sagi@grimberg.me, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org, bhelgaas@google.com, linux-pci@vger.kernel.org, Pierre-Yves Kerbrat Message-ID: <1716948335.5924942.1521791067272.JavaMail.zimbra@kalray.eu> In-Reply-To: <1220434088.5871933.1521648656789.JavaMail.zimbra@kalray.eu> References: <744877924.5841545.1521630049567.JavaMail.zimbra@kalray.eu> <20180321115037.GA26083@ming.t460p> <464125757.5843583.1521634231341.JavaMail.zimbra@kalray.eu> <20180321154807.GD22254@ming.t460p> <20180321160238.GF12909@localhost.localdomain> <1220434088.5871933.1521648656789.JavaMail.zimbra@kalray.eu> Subject: Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [192.168.40.201] X-Mailer: Zimbra 8.6.0_GA_1182 (ZimbraWebClient - FF53 (Linux)/8.6.0_GA_1182) Thread-Topic: nvme: avoid race-conditions when enabling devices Thread-Index: SxyyWlgFmS/wFmNWcZKChEsEYx3W5njVsQt5 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ----- Mail original ----- > De: "Marta Rybczynska" > =C3=80: "Keith Busch" > Cc: "Ming Lei" , axboe@fb.com, hch@lst.de, sagi@grim= berg.me, linux-nvme@lists.infradead.org, > linux-kernel@vger.kernel.org, bhelgaas@google.com, linux-pci@vger.kernel.= org, "Pierre-Yves Kerbrat" > > Envoy=C3=A9: Mercredi 21 Mars 2018 17:10:56 > Objet: Re: [RFC PATCH] nvme: avoid race-conditions when enabling devices >> On Wed, Mar 21, 2018 at 11:48:09PM +0800, Ming Lei wrote: >>> On Wed, Mar 21, 2018 at 01:10:31PM +0100, Marta Rybczynska wrote: >>> > > On Wed, Mar 21, 2018 at 12:00:49PM +0100, Marta Rybczynska wrote: >>> > >> NVMe driver uses threads for the work at device reset, including e= nabling >>> > >> the PCIe device. When multiple NVMe devices are initialized, their= reset >>> > >> works may be scheduled in parallel. Then pci_enable_device_mem can= be >>> > >> called in parallel on multiple cores. >>> > >>=20 >>> > >> This causes a loop of enabling of all upstream bridges in >>> > >> pci_enable_bridge(). pci_enable_bridge() causes multiple operation= s >>> > >> including __pci_set_master and architecture-specific functions tha= t >>> > >> call ones like and pci_enable_resources(). Both __pci_set_master() >>> > >> and pci_enable_resources() read PCI_COMMAND field in the PCIe spac= e >>> > >> and change it. This is done as read/modify/write. >>> > >>=20 >>> > >> Imagine that the PCIe tree looks like: >>> > >> A - B - switch - C - D >>> > >> \- E - F >>> > >>=20 >>> > >> D and F are two NVMe disks and all devices from B are not enabled = and bus >>> > >> mastering is not set. If their reset work are scheduled in paralle= l the two >>> > >> modifications of PCI_COMMAND may happen in parallel without lockin= g and the >>> > >> system may end up with the part of PCIe tree not enabled. >>> > >=20 >>> > > Then looks serialized reset should be used, and I did see the commi= t >>> > > 79c48ccf2fe ("nvme-pci: serialize pci resets") fixes issue of 'fail= ed >>> > > to mark controller state' in reset stress test. >>> > >=20 >>> > > But that commit only covers case of PCI reset from sysfs attribute,= and >>> > > maybe other cases need to be dealt with in similar way too. >>> > >=20 >>> >=20 >>> > It seems to me that the serialized reset works for multiple resets of= the >>> > same device, doesn't it? Our problem is linked to resets of different= devices >>> > that share the same PCIe tree. >>>=20 >>> Given reset shouldn't be a frequent action, it might be fine to seriali= ze all >>> reset from different devices. >>=20 >> The driver was much simpler when we had serialized resets in line with >> probe, but that had a bigger problems with certain init systems when >> you put enough nvme devices in your server, making them unbootable. >>=20 >> Would it be okay to serialize just the pci_enable_device across all >> other tasks messing with the PCI topology? >>=20 >> --- >> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c >> index cef5ce851a92..e0a2f6c0f1cf 100644 >> --- a/drivers/nvme/host/pci.c >> +++ b/drivers/nvme/host/pci.c >> @@ -2094,8 +2094,11 @@ static int nvme_pci_enable(struct nvme_dev *dev) >>=09int result =3D -ENOMEM; >>=09struct pci_dev *pdev =3D to_pci_dev(dev->dev); >>=20 >> -=09if (pci_enable_device_mem(pdev)) >> -=09=09return result; >> +=09pci_lock_rescan_remove(); >> +=09result =3D pci_enable_device_mem(pdev); >> +=09pci_unlock_rescan_remove(); >> +=09if (result) >> +=09=09return -ENODEV; >>=20 >>=09pci_set_master(pdev); >>=20 >=20 > The problem may happen also with other device doing its probe and nvme ru= nning > its > workqueue (and we probably have seen it in practice too). We were thinkin= g about > a lock > in the pci generic code too, that's why I've put the linux-pci@ list in c= opy. >=20 Keith, it looks to me that this is going to fix the issue between two nvme = driver instances at hotplug time. This is the one we didn't cover in the first pat= ch. We can see the issue at driver load (so at boot) and the lock isn't taken b= y the generic non-rescan code. Other calls to pci_enable_device_mem aren't protec= ted=20 neither (see Bjorns message). What do you think about applying both for now until we have a generic fix i= n pci? Marta