Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp4926585imm; Wed, 30 May 2018 15:04:03 -0700 (PDT) X-Google-Smtp-Source: ADUXVKIFEve6BgtUQ3wOSO95mr6Sr484ByBfwSqsdx4NXjajcvdhOn5HSCCzetzBsspmaR/caDN6 X-Received: by 2002:a63:a00a:: with SMTP id r10-v6mr3541720pge.222.1527717843244; Wed, 30 May 2018 15:04:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527717843; cv=none; d=google.com; s=arc-20160816; b=Z/H+D3hQ5X3ckISz9jvZBUQxtuqjKOmzJOTmtVpmpEKxVu0d1IH0LZSewsqiSbasrZ FX1A55CJei8AWYVR5snL/JEQlILiwmGQYx9nXkWAmpGKRy2MFGFx6fZWoH6oh64nHGOS Vl6GgEHB9YuwD19RZZrMEFqYk+zQ9d6aV+sWgQy2CdWJTulcXA/EuEXkl8ZpPZBtBBiZ yH/YiVZ6G5tU3wiWgN0hMe4lEv9N1vJDYPfUj2Vh+u0N1au401dXWd7lKf927zFiRMhP nAeer+HFuBn0C6dz1lQZYdUeGd6By8vg9PyFx2YY9tCHFShbV9pjopaAvDy0I5tAE7Lp waQQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=wvTojEXYWk4TP1IOjMef98I082+McaKE6RxvgrhxonQ=; b=FPAijgRRTF1L4WXf02TaRIUDe++rg0vA9WFpz3eIxV89qEZ9t+5wLEBIS5T4dpiOS6 Huhvjzbi1C3VNNVpKnPriB45R0g5tDerEB4uskbV+TrGgetSA0gjP3yrNtiyxhMvSiMp R9dW6R3LIdhRNSp+TfXZUwYXAmETGc6TBXXLrkkDqU6c38BQ8D/2Os10ACnYX7f3pHgc kyaUiw9Y4zjm8OdJcHmE466h6bavxJPJZbSyL1Iij4aKxYmfkoVPGQWkNcUPG1JJgj19 8sDH1EOGY/UQtLkmxlNJjC2lOzy4lnFEpno9v5QC6nDLJmGEl/W5i9K97ddBP1yo3Bxp lBjw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c65-v6si2931822pfc.224.2018.05.30.15.03.48; Wed, 30 May 2018 15:04:03 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932577AbeE3WCL (ORCPT + 99 others); Wed, 30 May 2018 18:02:11 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:51670 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S932222AbeE3WCI (ORCPT ); Wed, 30 May 2018 18:02:08 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 02100805A530; Wed, 30 May 2018 22:02:08 +0000 (UTC) Received: from localhost (unknown [10.18.25.149]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2679D2026DEF; Wed, 30 May 2018 22:02:07 +0000 (UTC) Date: Wed, 30 May 2018 18:02:06 -0400 From: Mike Snitzer To: Sagi Grimberg Cc: Christoph Hellwig , Johannes Thumshirn , Keith Busch , Hannes Reinecke , Laurence Oberman , Ewan Milne , James Smart , Linux Kernel Mailinglist , Linux NVMe Mailinglist , "Martin K . Petersen" , Martin George , John Meneghini Subject: Re: [PATCH 0/3] Provide more fine grained control over multipathing Message-ID: <20180530220206.GA7037@redhat.com> References: <20180525125322.15398-1-jthumshirn@suse.de> <20180525130535.GA24239@lst.de> <20180525135813.GB9591@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Scanned-By: MIMEDefang 2.78 on 10.11.54.4 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Wed, 30 May 2018 22:02:08 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Wed, 30 May 2018 22:02:08 +0000 (UTC) for IP:'10.11.54.4' DOMAIN:'int-mx04.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'msnitzer@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 30 2018 at 5:20pm -0400, Sagi Grimberg wrote: > Hi Folks, > > I'm sorry to chime in super late on this, but a lot has been > going on for me lately which got me off the grid. > > So I'll try to provide my input hopefully without starting any more > flames.. > > >>>This patch series aims to provide a more fine grained control over > >>>nvme's native multipathing, by allowing it to be switched on and off > >>>on a per-subsystem basis instead of a big global switch. > >> > >>No. The only reason we even allowed to turn multipathing off is > >>because you complained about installer issues. The path forward > >>clearly is native multipathing and there will be no additional support > >>for the use cases of not using it. > > > >We all basically knew this would be your position. But at this year's > >LSF we pretty quickly reached consensus that we do in fact need this. > >Except for yourself, Sagi and afaik Martin George: all on the cc were in > >attendance and agreed. > > Correction, I wasn't able to attend LSF this year (unfortunately). Yes, I was trying to say you weren't at LSF (but are on the cc). > >And since then we've exchanged mails to refine and test Johannes' > >implementation. > > > >You've isolated yourself on this issue. Please just accept that we all > >have a pretty solid command of what is needed to properly provide > >commercial support for NVMe multipath. > > > >The ability to switch between "native" and "other" multipath absolutely > >does _not_ imply anything about the winning disposition of native vs > >other. It is purely about providing commercial flexibility to use > >whatever solution makes sense for a given environment. The default _is_ > >native NVMe multipath. It is on userspace solutions for "other" > >multipath (e.g. multipathd) to allow user's to whitelist an NVMe > >subsystem to be switched to "other". > > > >Hopefully this clarifies things, thanks. > > Mike, I understand what you're saying, but I also agree with hch on > the simple fact that this is a burden on linux nvme (although less > passionate about it than hch). > > Beyond that, this is going to get much worse when we support "dispersed > namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed > namespaces" makes NVMe namespaces share-able over different subsystems > so changing the personality on a per-subsystem basis is just asking for > trouble. > > Moreover, I also wanted to point out that fabrics array vendors are > building products that rely on standard nvme multipathing (and probably > multipathing over dispersed namespaces as well), and keeping a knob that > will keep nvme users with dm-multipath will probably not help them > educate their customers as well... So there is another angle to this. Wouldn't expect you guys to nurture this 'mpath_personality' knob. SO when features like "dispersed namespaces" land a negative check would need to be added in the code to prevent switching from "native". And once something like "dispersed namespaces" lands we'd then have to see about a more sophisticated switch that operates at a different granularity. Could also be that switching one subsystem that is part of "dispersed namespaces" would then cascade to all other associated subsystems? Not that dissimilar from the 3rd patch in this series that allows a 'device' switch to be done in terms of the subsystem. Anyway, I don't know the end from the beginning on something you just told me about ;) But we're all in this together. And can take it as it comes. I'm merely trying to bridge the gap from old dm-multipath while native NVMe multipath gets its legs. In time I really do have aspirations to contribute more to NVMe multipathing. I think Christoph's NVMe multipath implementation of bio-based device ontop on NVMe core's blk-mq device(s) is very clever and effective (blk_steal_bios() hack and all). > Don't get me wrong, I do support your cause, and I think nvme should try > to help, I just think that subsystem granularity is not the correct > approach going forward. I understand there will be limits to this 'mpath_personality' knob's utility and it'll need to evolve over time. But the burden of making more advanced NVMe multipath features accessible outside of native NVMe isn't intended to be on any of the NVMe maintainers (other than maybe remembering to disallow the switch where it makes sense in the future). > As I said, I've been off the grid, can you remind me why global knob is > not sufficient? Because once nvme_core.multipath=N is set: native NVMe multipath is then not accessible from the same host. The goal of this patchset is to give users choice. But not limit them to _only_ using dm-multipath if they just have some legacy needs. Tough to be convincing with hypotheticals but I could imagine a very obvious usecase for native NVMe multipathing be PCI-based embedded NVMe "fabrics" (especially if/when the numa-based path selector lands). But the same host with PCI NVMe could be connected to a FC network that has historically always been managed via dm-multipath.. but say that FC-based infrastructure gets updated to use NVMe (to leverage a wider NVMe investment, whatever?) -- but maybe admins would still prefer to use dm-multipath for the NVMe over FC. > This might sound stupid to you, but can't users that desperately must > keep using dm-multipath (for its mature toolset or what-not) just > stack it on multipath nvme device? (I might be completely off on > this so feel free to correct my ignorance). We could certainly pursue adding multipath-tools support for native NVMe multipathing. Not opposed to it (even if just reporting topology and state). But given the extensive lengths NVMe multipath goes to hide devices we'd need some way to piercing through the opaque nvme device that native NVMe multipath exposes. But that really is a tangent relative to this patchset. Since that kind of visibility would also benefit the nvme cli... otherwise how are users to even be able to trust but verify native NVMe multipathing did what it expected it to? Mike