Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp302540rwp; Wed, 12 Jul 2023 13:36:57 -0700 (PDT) X-Google-Smtp-Source: APBJJlGl27E/jNgQlmvz4vjcEx8va6cEkNgmvxJ59iibN+X6opHERI0fRrHVbaeZYGTjzS5ODxUF X-Received: by 2002:a17:902:d4c7:b0:1b6:a99b:ceb1 with SMTP id o7-20020a170902d4c700b001b6a99bceb1mr19339444plg.50.1689194217262; Wed, 12 Jul 2023 13:36:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689194217; cv=none; d=google.com; s=arc-20160816; b=pVUNOlfQYNJeojq/3tFM1m2bbGYEuDMX06gdlcfrxfoqChfnJY73vM8NRD0U4syFN7 uEZyJGY1KWD/FC90U1SIUbuXVFy0JvcFXk7GowRFz1zbMs+T6aPSuFZBT/VRSVajGKYR DOv+p+cUU1N4+GBNgta0dQIg4Kh6U3N0RaIShwFnkGmEQwl3kkNdpkHl+IQ75xnXKxyE b7b21Lsylj5xuUruLYF3H3V5V9SPu699K2oglkJpthQZUBNmOeVWcC/qHdW3/H7yLtLn zG0gMzR1dvD0Y9ZOrZ0f3Z9ysjDoLIbj3R78nS23bDSBtyr1o0bNVtOLBwT2YbULULMv MmCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=/ZSa4NetAP62hVIVyvAbNH/+emK9xW+vVSDSphd0g5Y=; fh=tzbEdtAXRfN1WxfBwZS7rucpM9oeqwou/281+oHAGxs=; b=fWyrZSgzWjdogWoerIRuVpKZF+TfczSe+wS6NUFiss4eEhujl6t3t+8ejMIHcP6Bdg 48owQDVvbecZniIIPSkwJeOBt56/LpZM+RwDMin+i6DNFwBdzr4hp79upzXuHLIuCXwb M/WStTb2M35RKv8woLOSmWrUB/AgV+SNlx4cW4ShbFH8kuk4i3AibtvLWzT94Zf00oj1 bczT+mkgfIS1uqf+1B5RFZp1yoiAfKH90fJMpjykWTH6roCSaFK4bvdsM2XUGkqGPSSZ pllBAe2VHEVVYRXLiTR/stA044F3dQhkJMXl781TvDz+DnThwZPEZqpl54b4egqP7t02 EAqg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=f3Ii093F; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y6-20020a17090322c600b001b9e68c22bdsi4110032plg.187.2023.07.12.13.36.43; Wed, 12 Jul 2023 13:36:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=f3Ii093F; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231836AbjGLUQR (ORCPT + 99 others); Wed, 12 Jul 2023 16:16:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47098 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229512AbjGLUQQ (ORCPT ); Wed, 12 Jul 2023 16:16:16 -0400 Received: from mail-vs1-xe2b.google.com (mail-vs1-xe2b.google.com [IPv6:2607:f8b0:4864:20::e2b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D9CC1FE4 for ; Wed, 12 Jul 2023 13:16:14 -0700 (PDT) Received: by mail-vs1-xe2b.google.com with SMTP id ada2fe7eead31-440ad576d87so28159137.1 for ; Wed, 12 Jul 2023 13:16:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689192973; x=1691784973; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/ZSa4NetAP62hVIVyvAbNH/+emK9xW+vVSDSphd0g5Y=; b=f3Ii093FW9+oc2F8zRSaiEu5EyBxsVuk08e/GPtsqsSeShWLCXC1bZxIOmmMmcmq1X oK6yGMz/yKUAxH6edgErEdCMXXNog2fkqOIotGgFfIVUj0iCgmVT+v8LC5kpCquThfT7 s7k5uUiWkhgOwmaUx18/OOTEOwqIkO3nmFoBTk4Dexu9TnDZDg74BS662UfijW794wbe 4mv5gp2rm0Aw88kz4kO6NM+04OR7peKtJVIioaYBTxh1ruVx+TUQsaXoyF64QJDSCbCU fJTbKpPR0YESffxmDzel4+2Y9WS8i4XOsqrufpBhzDjq1QWzp4YaNmCPgw4rBSNvB9NK 5lvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689192973; x=1691784973; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/ZSa4NetAP62hVIVyvAbNH/+emK9xW+vVSDSphd0g5Y=; b=kEBY2O8y3rltx67OtdSf0L4OWGI8Oo/arJLZrmK/h2b0TG770dYoMHzxFa5nwTo04c sK7axe80PAgbES7PVk7eBdaS0uVLIyjU6B4m8DTfZTtF9SSwbUeyc/lR/4ipwZYMbZwR PpJNxEXbACcE19rW/LCkxAfn0FFPB2cmaeeBdbdYcMfKuwZnG16u09R55Rw+ajM33V9D 4km81viCildJ5N0iredxS8fa3Bsmy3KSmL8vubmnG4sR2MDkaZirMkIKl4Nyae/ujnc2 etIVa1bsBASWimC4c/WyJMDSjsquc0nSOgBh/W2HxRt9cn2I8KKKHq5/wisSqB++E+2j ef2Q== X-Gm-Message-State: ABy/qLaJwnWrkZvkOzgF4sDP/hSyQUKpAbUaymfmNltwV80RqJz4Ph1w wfPT4nevQo0iBnmbZ35mx9Ka0p8KZNn+x8quw+Ohtg== X-Received: by 2002:a67:af0b:0:b0:445:208:2d62 with SMTP id v11-20020a67af0b000000b0044502082d62mr12673636vsl.23.1689192973170; Wed, 12 Jul 2023 13:16:13 -0700 (PDT) MIME-Version: 1.0 References: <20230710215906.49514550@kernel.org> <20230711050445.GA19323@lst.de> <20230711090047.37d7fe06@kernel.org> <04187826-8dad-d17b-2469-2837bafd3cd5@kernel.org> <20230711093224.1bf30ed5@kernel.org> <20230711133915.03482fdc@kernel.org> <2263ae79-690e-8a4d-fca2-31aacc5c9bc6@kernel.org> In-Reply-To: From: Mina Almasry Date: Wed, 12 Jul 2023 13:16:00 -0700 Message-ID: Subject: Re: Memory providers multiplexing (Was: [PATCH net-next v4 4/5] page_pool: remove PP_FLAG_PAGE_FRAG flag) To: Jason Gunthorpe Cc: David Ahern , Samiullah Khawaja , Willem de Bruijn , Jakub Kicinski , Christoph Hellwig , John Hubbard , Dan Williams , Jesper Dangaard Brouer , brouer@redhat.com, Alexander Duyck , Yunsheng Lin , davem@davemloft.net, pabeni@redhat.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Lorenzo Bianconi , Yisen Zhuang , Salil Mehta , Eric Dumazet , Sunil Goutham , Geetha sowjanya , Subbaraya Sundeep , hariprasad , Saeed Mahameed , Leon Romanovsky , Felix Fietkau , Ryder Lee , Shayne Chen , Sean Wang , Kalle Valo , Matthias Brugger , AngeloGioacchino Del Regno , Jesper Dangaard Brouer , Ilias Apalodimas , linux-rdma@vger.kernel.org, linux-wireless@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mediatek@lists.infradead.org, Jonathan Lemon , =?UTF-8?Q?Christian_K=C3=B6nig?= , logang@deltatee.com, Bjorn Helgaas Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 12, 2023 at 6:01=E2=80=AFAM Jason Gunthorpe wrot= e: > > On Tue, Jul 11, 2023 at 08:42:24PM -0700, Mina Almasry wrote: > > > 1. The device memory driver would be the p2pdma provider. It would > > expose a user API which allocates a device memory region, calls > > pci_p2pdma_add_resource() and pci_p2pmem_publish() on it, and returns > > a reference to it to the userspace. > > This is not quite right, if you convert any of the GPU drivers to use > P2PDMA you are going to need to restructure the p2pmem stuff to > seperate the genalloc. The GPU driver must continue to be the owner > and allocator of the MMIO memory it already controls, we can't have > two allocators working in parallel. > > The genalloc stuff supports the special NVMe use case, I don't know of > anything else that would like to work that way. > I think maybe you misunderstood the proposal. AFAICT the genalloc stuff works for us, although there are other issues with p2pdma that I need to solve. The proposal was that the uapi in step #1 allocates a region of GPU memory, and sets up a p2pdma provider for this region of memory. From the perspective of the GPU, the memory is allocated, and in use by the user. The p2pdma provider uses genalloc to give out 4K regions with struct pages to in-kernel allocators from this memory region. Why would that not work? Looking at the code, that seems to be how p2pdma works today. The p2pdma provider does p2pdma_add_resource() on a chunk of its memory, and the genalloc allocates memory from that chunk? The actual issues I see with this approach are: 1. There is no way for the p2pdma provider to relinquish back the memory it has provided via pci_p2pdma_add_resource(), in the case that the user crashed or would like to free the GPU buffer. I would need to add a pci_p2pdma_remove_resource(). Would that be acceptable? 2. The p2pdma semantics seem to be global to the pci device. I.e., 1 GPU can export 1 p2pdma resource at a time (the way I'm reading the API). This is not usable for my use case. I would need multiple users to be able to use the uapi in step #1 simultaneously. I would need the same pci device to export different p2pdma resources simultaneously and the p2pdma clients would need to be able to import some of the resources. I would likely need to add an api like this: diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index 8318a97c9c61..c9d754713fdc 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -19,6 +19,33 @@ struct scatterlist; #ifdef CONFIG_PCI_P2PDMA int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset); + +/* Adds a resource similar to pci_p2pdma_add_resource, and returns a file + * handle referring to this resource. Multiple such resources can be expor= ted + * by the same pci device. + */ +struct file *pci_p2pdma_add_resource_with_handle(struct pci_dev *pdev, + int bar, + size_t size, + u64 offset); + +/* Remove a resource added via pci_p2pdma_add_resource_with_handle() */ +struct file *pci_p2pdma_remove_resource_with_handle( + struct file *p2pdma_resource_file); + +/* Allocates memory from a resource created using + * pci_p2pdma_add_resource_with_handle() + */ +void *pci_alloc_p2pmem_from_handle(struct file *p2pdma_resource_file, + size_t size); + +/* Frees p2pmem to a resource created using + * pci_p2pdma_add_resource_with_handle() + */ +void pci_free_p2pmem_to_handle(struct pci_dev *p2pdma_resource_file, + void *addr, + size_t size); + Looking for feedback from anyone knowledgeable, but the p2pdma maintainers as well if possibl. > > 2. The NIC driver would be the p2pdma client and orchestrator. It > > would expose a user API which binds an rxq to a pci device. Prior to > > the bind the user API would check that the pci device has published > > p2p memory (pci_has_p2pmem()), and check the the p2p mem is accessible > > to the driver (pci_p2pdma_distance() I think), etc. > > This doesn't fit the programming model for GPUs at all. You don't want > to get packets landing in random GPU memory that a kernel side > allocator selects, you want packets landing in GPU memory owned by a > specific process that owns the TCP connection. > I think this comment is maybe a side effect of the misunderstanding. In the proposal, the user allocates a GPU buffer using the API in step #1, and then binds the memory to the NIC rx queue using the API specified in step #2. We use flow steering & RSS to steer this user's TCP traffic to the buffer owned by them. > This is why DMABUF is used here as it gives a handle to the GPU > memory. What you want is to get the P2P pages either directly from the > DMABUF or via pin_user_pages() on the DMABUF's mmap. > > > AFAICT, all the concerns brought up in this thread are sidestepped by > > using p2pdma. I need not allocate struct pages in the core dma-buf > > code anymore (or anywhere), and I need not allocate pgmaps. I would > > just re-use the p2pdma support. > > Well, as I said it is going to be a big ask to P2P enable any of the > DRM drivers. > > And you still have the netmem vs zone_device struct page conflict to > figure out > > But it is alot closer to reasonable than this RFC. > > Jason --=20 Thanks, Mina