Received: by 10.223.185.116 with SMTP id b49csp7760525wrg; Thu, 1 Mar 2018 10:36:57 -0800 (PST) X-Google-Smtp-Source: AG47ELtvJhIKY4xl2JQgo+r1i6jgsUro04Pvmm+EhyhmOi+kFk4jh3H2RibGLSR/VKxIz6K7sAAC X-Received: by 2002:a17:902:12d:: with SMTP id 42-v6mr2791528plb.141.1519929417068; Thu, 01 Mar 2018 10:36:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1519929417; cv=none; d=google.com; s=arc-20160816; b=v86hEjZ6lD2KZjlj6s9V/jkHhA52iRSnXzFw3MY3JES4cJuxqfGLyqRBV72LGoWqef vx3QUnqSO81+XckwUk+iOjVr319Vx6SGcxGgmfC9zlD343jbuSsp4rtkcZZ5w8i1wkI9 5F1PQqSPAvlUUU9N/FPp22XUyC7clBcAicfUFIFAwUpPbUEjyR5xbnuWGbLP81zyHkjn urKi/ojXyMXWsp+USr4FmP7gBXOFJYLl+Vvwd3JSWLU5VJArhhYLZwcCDhixzNW9+MPX wkIa+N3Opg1QQg1Zt5oak90UF3nKmRCPtArNGteZUR+RNmbsa00icogj3bKARw8zZMGS HVVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=RUayEp1Pc47PVpu7IEQb0ergFUQfvRTb/Tr83FUbQYw=; b=OX/qS6BBZQdJEIUWtm8KM7I6LrEGUFUgVcJinP03Wikj9WnFeu8fqovK5XU7+Aosi6 EP7Bep0GGG9u8T2bvx4ADXyTIAxWZI9cdR/MND+P61OhdIhAuLXSJANpT1P96hpkJ05G 6rPQnwnQ/J/IYIv1lS4h1jgxPWN9TQtVglFO9O8N2RNs/pIy5jcyMVyBZqMiXvKPVtp3 xzpbSGI5oNuhzM4r0iHk2GdrBmm9a9XvFjb8Kh7zfLzGCiMbcJYnxDGHZLeNDvIcV8lL S+2qvo8D8Zx13Tp8Q6NMQiyfXd9Utb9qoYn5BjCrFaSMNAKlpJt+SVag5ZFyXNwyyaCW KDsg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b128si2783660pgc.220.2018.03.01.10.36.42; Thu, 01 Mar 2018 10:36:57 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1034000AbeCASgC (ORCPT + 99 others); Thu, 1 Mar 2018 13:36:02 -0500 Received: from mail-wm0-f68.google.com ([74.125.82.68]:51453 "EHLO mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1033904AbeCASgA (ORCPT ); Thu, 1 Mar 2018 13:36:00 -0500 Received: by mail-wm0-f68.google.com with SMTP id h21so13982395wmd.1; Thu, 01 Mar 2018 10:35:59 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=RUayEp1Pc47PVpu7IEQb0ergFUQfvRTb/Tr83FUbQYw=; b=bH+R4P+cMKksmCi3zGHsugnZlLCsJ95b49xKRARIrUqlFGeQ7rF95ofzX/1kweaIau 4qgD5Sr8JVum4xjp8oxjP1W29+ysYzKTvGVzOACzmGdIWArqm93DsFW+JmOvkieJg+Tx AgX672FYi86hxFTQmSajWwUpftqFwTiiuMuhNLU50ZTACvOz5nrb+NXO4KyrbPFtlPTg 1x6bloUGu0VcG3n3ZrEG17L+1wj1tVHDl06I1Bx8ufthv481na932r1myFaX48Y8Nfxd n+686IVyGoZQ8Og/lyeZty5b0eq8WjSAZMoTlrfV9jN9aAaMaPc24J2FGIg1aEx/YwGa HpRA== X-Gm-Message-State: APf1xPCODTkCiMvydwcF+QDwDs+ci5ZEu+J4DQ0a/L7oZFsWTgeW0ZNF 3MxDUJC3MKuVr+X7Obb8XBA= X-Received: by 10.28.175.139 with SMTP id y133mr2462340wme.98.1519929358546; Thu, 01 Mar 2018 10:35:58 -0800 (PST) Received: from [192.168.64.110] (bzq-219-42-90.isdn.bezeqint.net. [62.219.42.90]) by smtp.gmail.com with ESMTPSA id y8sm4593229wmb.48.2018.03.01.10.35.56 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 01 Mar 2018 10:35:57 -0800 (PST) Subject: Re: [PATCH v2 10/10] nvmet: Optionally use PCI P2P memory To: Logan Gunthorpe , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org Cc: Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Benjamin Herrenschmidt , Alex Williamson , Steve Wise References: <20180228234006.21093-1-logang@deltatee.com> <20180228234006.21093-11-logang@deltatee.com> <749e3752-4349-0bdf-5243-3d510c2b26db@grimberg.me> <40d69074-31a8-d06a-ade9-90de7712c553@deltatee.com> From: Sagi Grimberg Message-ID: <5649098f-b775-815b-8b9a-f34628873ff4@grimberg.me> Date: Thu, 1 Mar 2018 20:35:55 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <40d69074-31a8-d06a-ade9-90de7712c553@deltatee.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On 01/03/18 04:03 AM, Sagi Grimberg wrote: >> Can you describe what would be the plan to have it when these devices >> do come along? I'd say that p2p_dev needs to become a nvmet_ns reference >> and not from nvmet_ctrl. Then, when cmb capable devices come along, the >> ns can prefer to use its own cmb instead of locating a p2p_dev device? > > The patchset already supports CMB drives. That's essentially what patch > 7 is for. We change the nvme-pci driver to use the p2pmem code to > register and manage the CMB memory. After that, it is simply available > to the nvmet code. We have already been using this with a couple > prototype CMB drives. The comment was to your statement: "Ideally, we'd want to use an NVME CMB buffer as p2p memory. This would save an extra PCI transfer as the NVME card could just take the data out of it's own memory. However, at this time, cards with CMB buffers don't seem to be available." Maybe its a left-over which confused me... Anyways, my question still holds. If I rack several of these nvme drives, ideally we would use _their_ cmbs for I/O that is directed to these namespaces. This is why I was suggesting that p2p_dev should live in nvmet_ns and not in nvmet_ctrl as a single p2p_dev used by all namespaces. nvmet_ns seems like a more natural place to host p2p_dev with cmb in mind. >>> +static int nvmet_p2pdma_add_client(struct nvmet_ctrl *ctrl, >>> +                   struct nvmet_ns *ns) >>> +{ >>> +    int ret; >>> + >>> +    if (!blk_queue_pci_p2pdma(ns->bdev->bd_queue)) { >>> +        pr_err("peer-to-peer DMA is not supported by %s\n", >>> +               ns->device_path); >>> +        return -EINVAL; >> >> I'd say that just skip it instead of failing it, in theory one can >> connect nvme devices via p2p mem and expose other devices in the >> same subsystem. The message would be pr_debug to reduce chattiness. > > No, it's actually a bit more complicated than you make it. There's a > couple cases: > > 1) The user adds a namespace but there hasn't been a connection and no > p2pmem has been selected. In this case the add_client function is never > called and the user can add whatever namespace they like. > > 2) When the first connect happens, nvmet_setup_p2pmem() will call > add_client() for each namespace and rdma device. If any of them fail > then it does not use a P2P device and falls back, as you'd like, to > regular memory. > > 3) When the user adds a namespace after a port is in use and a > compatible P2P device has been found. In this case, if the user tries to > add a namespace that is not compatible with the P2P device in use then > it fails adding the new namespace. The only alternative here is to tear > everything down, ensure there are no P2P enabled buffers open and start > using regular memory again... That is very difficult. > Wouldn't it all be simpler if the p2p_dev resolution would be private to the namespace? So is adding some all the namespaces in a subsystem must comply to using p2p? Seems a little bit harsh if its not absolutely needed. Would be nice to export a subsystems between two ports (on two HCAs, across NUMA nodes) where the home node (primary path) would use p2p and failover would use host memory... Can you help me understand why this is absolutely not feasible? > I also disagree that these messages should be pr_debug. If a user is > trying to use P2P memory and they enable it but the system doesn't > choose to use their memory they must know why that is so they can make > the necessary adjustments. If the system doesn't at least print a dmesg > they are in the dark. I was arguing that the user might have intentionally wanted to use p2p where possible but still expose other namespaces over host memory. For this user, the messages are spam. I guess info/warn could also suffice (assuming we allow it, if we fail it then no point of the debug level discussion). > >>> +    } >>> + >>> +    ret = pci_p2pdma_add_client(&ctrl->p2p_clients, nvmet_ns_dev(ns)); >>> +    if (ret) >>> +        pr_err("failed to add peer-to-peer DMA client %s: %d\n", >>> +               ns->device_path, ret); >>> + >>> +    return ret; >>> +} >>> + >>>   int nvmet_ns_enable(struct nvmet_ns *ns) >>>   { >>>       struct nvmet_subsys *subsys = ns->subsys; >>> @@ -299,6 +319,14 @@ int nvmet_ns_enable(struct nvmet_ns *ns) >>>       if (ret) >>>           goto out_blkdev_put; >>> +    list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { >>> +        if (ctrl->p2p_dev) { >>> +            ret = nvmet_p2pdma_add_client(ctrl, ns); >>> +            if (ret) >>> +                goto out_remove_clients; >> >> Is this really a fatal failure given that we fall-back to main >> memory? Why not continue with main memory (and warn at best)? > > See above. It's fatal because we are already using p2p memory and we > can't easily tear that all down when a user adds a new namespace. In my mind, I/O to this namespace would use host memory and be done with it. I guess I still don't understand why this is not possible. >>> +    ctrl->p2p_dev = pci_p2pmem_find(&ctrl->p2p_clients); >> >> This is the first p2p_dev found right? What happens if I have more than >> a single p2p device? In theory I'd have more p2p memory I can use. Have >> you considered making pci_p2pmem_find return the least used suitable >> device? > > Yes, it currently returns the first found. I imagine a bunch of > improvements could be made to it in future work. However, I'd probably > start with finding the nearest p2p device and then falling back to the > least used if that's necessary. At this time though it's not clear how > complicated these systems will get and what's actually needed here. Fair enough. Just wandering.