2022-10-21 18:03:42

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices

Hi,

This is the latest P2PDMA userspace patch set. This version includes
some cleanup from feedback from the last posting[1].

This patch set enables userspace P2PDMA by allowing userspace to mmap()
allocated chunks of the CMB. The resulting VMA can be passed only
to O_DIRECT IO on NVMe backed files or block devices. A flag is added
to GUP() in Patch 1, then Patches 2 through 6 wire this flag up based
on whether the block queue indicates P2PDMA support. Patches 7
creates the sysfs resource that can hand out the VMAs and Patch 8
adds brief documentation for the new interface.

Feedback welcome.

This series is based on v6.1-rc1. A git branch is available here:

https://github.com/sbates130272/linux-p2pmem/ p2pdma_user_cmb_v11

Thanks,

Logan

[1] https://lkml.kernel.org/r/[email protected]

--

Changes in v11:
- Rebased onto v6.1-rc1, fixed minor conflict in bio_map_user_iov
- The GUP test was moved to try_grab_page() and try_grab_folio().
This ought to be a bit more future proof. It required adding a new
cleanup patch to return a proper error code from try_grab_page().
(Per Jason)

Changes in v10:
- Rebased onto v6.0-rc6
- Reworked iov iter changes to reuse the code better and
name them without the _flags() prefix (per Christoph)
- Renamed a number of flags variables to gup_flags (per John)
- Minor fixups to the last documentation patch (from Greg and John)

Changes in v9:
- Rebased onto v6.0-rc2, included reworking the iov_iter patch
due to changes there
- Drop the char device mmap implementation in favour of a sysfs
based interface. (per Christoph)

(v8 only included the first half of the series and was merged for v6.0)

Changes in v8:
- Rebase onto v5.19-rc1
- Rework how the pages are stored in the VMA per Jason's suggestion

Changes in v7:
- Rebased onto v5.18-rc1 which includes Christophs cleanup to
free_zone_device_page() (similar to Ralph's patch).
- Fix bug with concurrent first calls to pci_p2pdma_vma_fault()
that caused a double allocation and lost p2p memory. Noticed
by Andrew Maier.
- Collected a Reviewed-by tag from Chaitanya.
- Numerous minor fixes to commit messages

--

Logan Gunthorpe (9):
mm: allow multiple error returns in try_grab_page()
mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
block: add check when merging zone device pages
lib/scatterlist: add check when merging zone device pages
block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
PCI/P2PDMA: Allow userspace VMA allocations through sysfs
ABI: sysfs-bus-pci: add documentation for p2pmem allocate

Documentation/ABI/testing/sysfs-bus-pci | 10 ++
block/bio.c | 11 ++-
block/blk-map.c | 12 ++-
drivers/pci/p2pdma.c | 124 ++++++++++++++++++++++++
include/linux/mm.h | 3 +-
include/linux/mmzone.h | 24 +++++
include/linux/uio.h | 6 ++
lib/iov_iter.c | 32 ++++--
lib/scatterlist.c | 25 +++--
mm/gup.c | 45 ++++++---
mm/huge_memory.c | 19 ++--
mm/hugetlb.c | 23 +++--
12 files changed, 280 insertions(+), 54 deletions(-)


base-commit: 9abf2313adc1ca1b6180c508c25f22f9395cc780
--
2.30.2


2022-10-21 18:08:00

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v11 9/9] ABI: sysfs-bus-pci: add documentation for p2pmem allocate

Add documentation for the p2pmem/allocate binary file which allows
for allocating p2pmem buffers in userspace for passing to drivers
that support them. (Currently only O_DIRECT to NVMe devices.)

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
---
Documentation/ABI/testing/sysfs-bus-pci | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 840727fc75dc..ecf47559f495 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -407,6 +407,16 @@ Description:
file contains a '1' if the memory has been published for
use outside the driver that owns the device.

+What: /sys/bus/pci/devices/.../p2pmem/allocate
+Date: August 2022
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ This file allows mapping p2pmem into userspace. For each
+ mmap() call on this file, the kernel will allocate a chunk
+ of Peer-to-Peer memory for use in Peer-to-Peer transactions.
+ This memory can be used in O_DIRECT calls to NVMe backed
+ files for Peer-to-Peer copies.
+
What: /sys/bus/pci/devices/.../link/clkpm
/sys/bus/pci/devices/.../link/l0s_aspm
/sys/bus/pci/devices/.../link/l1_aspm
--
2.30.2

2022-10-21 18:09:22

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v11 7/9] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
---
block/blk-map.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 34735626b00f..8750f82d7da4 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -267,6 +267,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
{
unsigned int max_sectors = queue_max_hw_sectors(rq->q);
unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
+ unsigned int gup_flags = 0;
struct bio *bio;
int ret;
int j;
@@ -278,6 +279,9 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
if (bio == NULL)
return -ENOMEM;

+ if (blk_queue_pci_p2pdma(rq->q))
+ gup_flags |= FOLL_PCI_P2PDMA;
+
while (iov_iter_count(iter)) {
struct page **pages, *stack_pages[UIO_FASTIOV];
ssize_t bytes;
@@ -286,11 +290,11 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,

if (nr_vecs <= ARRAY_SIZE(stack_pages)) {
pages = stack_pages;
- bytes = iov_iter_get_pages2(iter, pages, LONG_MAX,
- nr_vecs, &offs);
+ bytes = iov_iter_get_pages(iter, pages, LONG_MAX,
+ nr_vecs, &offs, gup_flags);
} else {
- bytes = iov_iter_get_pages_alloc2(iter, &pages,
- LONG_MAX, &offs);
+ bytes = iov_iter_get_pages_alloc(iter, &pages,
+ LONG_MAX, &offs, gup_flags);
}
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
--
2.30.2

2022-10-24 19:21:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices

The series looks good to me know. How do we want to handle it? I think
we need a special branch somewhere (maybe in the block or mm trees?)
so that we can base the other iov_iter work from John on it. Also
Al has a whole bunch of iov_iter changes that we probably want on
the same branch as well, although some of those (READ vs WRITE fixups)
look like 6.1 material to me.

2022-10-24 21:13:29

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices

On 10/24/22 08:03, Christoph Hellwig wrote:
> The series looks good to me know. How do we want to handle it? I think
> we need a special branch somewhere (maybe in the block or mm trees?)
> so that we can base the other iov_iter work from John on it. Also
> Al has a whole bunch of iov_iter changes that we probably want on
> the same branch as well, although some of those (READ vs WRITE fixups)
> look like 6.1 material to me.
>

A little earlier, Jens graciously offered [1] to provide a topic branch,
such as:

for-6.2/block-gup [2]

(I've moved the name forward from 6.1 to 6.2, because that discussion
was 7 weeks ago.)


[1] https://lore.kernel.org/[email protected]
[2] https://lore.kernel.org/[email protected]

thanks,
--
John Hubbard
NVIDIA

2022-10-25 02:15:29

by Chaitanya Kulkarni

[permalink] [raw]
Subject: Re: [PATCH v11 9/9] ABI: sysfs-bus-pci: add documentation for p2pmem allocate

On 10/21/22 10:41, Logan Gunthorpe wrote:
> Add documentation for the p2pmem/allocate binary file which allows
> for allocating p2pmem buffers in userspace for passing to drivers
> that support them. (Currently only O_DIRECT to NVMe devices.)
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Reviewed-by: John Hubbard <[email protected]>
> Reviewed-by: Greg Kroah-Hartman <[email protected]>
> ---

Reviewed-by: Chaitanya Kulkarni <[email protected]>

-ck

2022-10-25 02:22:42

by Chaitanya Kulkarni

[permalink] [raw]
Subject: Re: [PATCH v11 7/9] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()

On 10/21/22 10:41, Logan Gunthorpe wrote:
> When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
> iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
> passed from userspace and enables the NVMe passthru requests to
> use P2PDMA pages.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: John Hubbard <[email protected]>
> ---
> block/blk-map.c | 12 ++++++++----

Reviewed-by: Chaitanya Kulkarni <[email protected]>

-ck


2022-11-08 08:05:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices

On Mon, Oct 24, 2022 at 12:15:56PM -0700, John Hubbard wrote:
> A little earlier, Jens graciously offered [1] to provide a topic branch,
> such as:
>
> for-6.2/block-gup [2]
>
> (I've moved the name forward from 6.1 to 6.2, because that discussion
> was 7 weeks ago.)

So what are we going to do with this series? It would be sad to miss
the merge window again.

2022-11-09 17:51:31

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices

@add Jens

On 2022-11-07 23:56, Christoph Hellwig wrote:
> On Mon, Oct 24, 2022 at 12:15:56PM -0700, John Hubbard wrote:
>> A little earlier, Jens graciously offered [1] to provide a topic branch,
>> such as:
>>
>> for-6.2/block-gup [2]
>>
>> (I've moved the name forward from 6.1 to 6.2, because that discussion
>> was 7 weeks ago.)
>
> So what are we going to do with this series? It would be sad to miss
> the merge window again.

I noticed Jens wasn't copied on this series. I've added him. It would be
nice to get this in someone's tree soon.

Thanks!

Logan

2022-11-09 18:38:25

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v11 0/9] Userspace P2PDMA with O_DIRECT NVMe devices

On 11/9/22 10:28 AM, Logan Gunthorpe wrote:
> @add Jens
>
> On 2022-11-07 23:56, Christoph Hellwig wrote:
>> On Mon, Oct 24, 2022 at 12:15:56PM -0700, John Hubbard wrote:
>>> A little earlier, Jens graciously offered [1] to provide a topic branch,
>>> such as:
>>>
>>> for-6.2/block-gup [2]
>>>
>>> (I've moved the name forward from 6.1 to 6.2, because that discussion
>>> was 7 weeks ago.)
>>
>> So what are we going to do with this series? It would be sad to miss
>> the merge window again.
>
> I noticed Jens wasn't copied on this series. I've added him. It would be
> nice to get this in someone's tree soon.

I took a look and the series looks fine to me.

--
Jens Axboe