2022-08-25 15:27:33

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v9 0/8] Userspace P2PDMA with O_DIRECT NVMe devices

Hi,

This is the latest P2PDMA userspace patch set. Since the last full
posting[1] the first half of the series[2] has made it into v6.0-rc1.

This version of the patchset also switches to using a sysfs binary
file instead of mmapping the nvme char device directly. This
removes the need for the anonymous inode as well as the numerous
hooks into the nvme driver. The file to mmap will be found in
/sys/<pci_device>/p2pmem/allocate. The latest version of this patch
set is much smaller as a result of these simplifications.

This patch set enables userspace P2PDMA by allowing userspace to mmap()
allocated chunks of the CMB. The resulting VMA can be passed only
to O_DIRECT IO on NVMe backed files or block devices. A flag is added
to GUP() in Patch 1, then Patches 2 through 6 wire this flag up based
on whether the block queue indicates P2PDMA support. Patches 7
creates the sysfs resource that can hand out the VMAs and Patch 8
adds brief documentation for the new interface.

Feedback welcome.

This series is based on v6.0-rc2. A git branch is available here:

https://github.com/sbates130272/linux-p2pmem/ p2pdma_user_cmb_v9

Thanks,

Logan

[1] https://lkml.kernel.org/r/[email protected]
[2] https://lkml.kernel.org/r/[email protected]

--

Changes since v7:
- Rebase onto v6.0-rc2, included reworking the iov_iter patch
due to changes there
- Drop the char device mmap implementation in favour of a sysfs
based interface. (per Christoph)

Changes since v6:
- Rebase onto v5.19-rc1
- Rework how the pages are stored in the VMA per Jason's suggestion

Changes since v5:
- Rebased onto v5.18-rc1 which includes Christophs cleanup to
free_zone_device_page() (similar to Ralph's patch).
- Fix bug with concurrent first calls to pci_p2pdma_vma_fault()
that caused a double allocation and lost p2p memory. Noticed
by Andrew Maier.
- Collected a Reviewed-by tag from Chaitanya.
- Numerous minor fixes to commit messages


--

Logan Gunthorpe (8):
mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
block: add check when merging zone device pages
lib/scatterlist: add check when merging zone device pages
block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
PCI/P2PDMA: Allow userspace VMA allocations through sysfs
ABI: sysfs-bus-pci: add documentation for p2pmem allocate

Documentation/ABI/testing/sysfs-bus-pci | 12 ++-
block/bio.c | 12 ++-
block/blk-map.c | 7 +-
drivers/pci/p2pdma.c | 124 ++++++++++++++++++++++++
include/linux/mm.h | 1 +
include/linux/mmzone.h | 24 +++++
include/linux/uio.h | 6 ++
lib/iov_iter.c | 40 +++++++-
lib/scatterlist.c | 25 +++--
mm/gup.c | 22 ++++-
10 files changed, 254 insertions(+), 19 deletions(-)


base-commit: 1c23f9e627a7b412978b4e852793c5e3c3efc555
--
2.30.2


2022-08-25 15:27:34

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v9 2/8] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()

Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().

This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
include/linux/uio.h | 6 ++++++
lib/iov_iter.c | 40 ++++++++++++++++++++++++++++++++++++----
2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 5896af36199c..76ba69edb8c5 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -247,8 +247,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
loff_t start, size_t count);
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
+ size_t maxsize, unsigned maxpages, size_t *start,
+ unsigned int gup_flags);
ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
size_t maxsize, unsigned maxpages, size_t *start);
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
+ struct page ***pages, size_t maxsize, size_t *start,
+ unsigned int gup_flags);
ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
size_t maxsize, size_t *start);
int iov_iter_npages(const struct iov_iter *i, int maxpages);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 4b7fce72e3e5..dedb78f3c655 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1427,7 +1427,8 @@ static struct page *first_bvec_segment(const struct iov_iter *i,

static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
struct page ***pages, size_t maxsize,
- unsigned int maxpages, size_t *start)
+ unsigned int maxpages, size_t *start,
+ unsigned int gup_flags)
{
unsigned int n;

@@ -1439,7 +1440,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
maxsize = MAX_RW_COUNT;

if (likely(user_backed_iter(i))) {
- unsigned int gup_flags = 0;
unsigned long addr;
int res;

@@ -1497,10 +1497,24 @@ ssize_t iov_iter_get_pages2(struct iov_iter *i,
return 0;
BUG_ON(!pages);

- return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
+ return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
+ start, 0);
}
EXPORT_SYMBOL(iov_iter_get_pages2);

+ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
+ size_t maxsize, unsigned maxpages, size_t *start,
+ unsigned int gup_flags)
+{
+ if (!maxpages)
+ return 0;
+ BUG_ON(!pages);
+
+ return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
+ start, gup_flags);
+}
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_flags);
+
ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
struct page ***pages, size_t maxsize,
size_t *start)
@@ -1509,7 +1523,7 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,

*pages = NULL;

- len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
+ len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start, 0);
if (len <= 0) {
kvfree(*pages);
*pages = NULL;
@@ -1518,6 +1532,24 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
}
EXPORT_SYMBOL(iov_iter_get_pages_alloc2);

+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ size_t *start, unsigned int gup_flags)
+{
+ ssize_t len;
+
+ *pages = NULL;
+
+ len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
+ gup_flags);
+ if (len <= 0) {
+ kvfree(*pages);
+ *pages = NULL;
+ }
+ return len;
+}
+EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc_flags);
+
size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
struct iov_iter *i)
{
--
2.30.2

2022-08-25 15:28:02

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v9 5/8] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
block/bio.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 969607bc1f4d..ca09d79a0683 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1200,6 +1200,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
+ unsigned int flags = 0;
ssize_t size, left;
unsigned len, i = 0;
size_t offset, trim;
@@ -1213,6 +1214,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);

+ if (bio->bi_bdev && bio->bi_bdev->bd_disk &&
+ blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
+ flags |= FOLL_PCI_P2PDMA;
+
/*
* Each segment in the iov is required to be a block size multiple.
* However, we may not be able to get the entire segment if it spans
@@ -1220,8 +1225,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
* result to ensure the bio's total size is correct. The remainder of
* the iov data will be picked up in the next bio iteration.
*/
- size = iov_iter_get_pages2(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
- nr_pages, &offset);
+ size = iov_iter_get_pages_flags(iter, pages,
+ UINT_MAX - bio->bi_iter.bi_size,
+ nr_pages, &offset, flags);
if (unlikely(size <= 0))
return size ? size : -EFAULT;

--
2.30.2

2022-08-25 15:29:20

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v9 8/8] ABI: sysfs-bus-pci: add documentation for p2pmem allocate

Add documentation for the p2pmem/allocate binary file which allows
for allocating p2pmem buffers in userspace for passing to drivers
that support them. (Currently only O_DIRECT to NVMe devices.)

Signed-off-by: Logan Gunthorpe <[email protected]>
---
Documentation/ABI/testing/sysfs-bus-pci | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
index 6fc2c2efe8ab..dca5e032b4fa 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -171,7 +171,7 @@ Description:
binary file containing the Vital Product Data for the
device. It should follow the VPD format defined in
PCI Specification 2.1 or 2.2, but users should consider
- that some devices may have incorrectly formatted data.
+ that some devices may have incorrectly formatted data.
If the underlying VPD has a writable section then the
corresponding section of this file will be writable.

@@ -407,6 +407,16 @@ Description:
file contains a '1' if the memory has been published for
use outside the driver that owns the device.

+What: /sys/bus/pci/devices/.../p2pmem/allocate
+Date: August 2022
+Contact: Logan Gunthorpe <[email protected]>
+Description:
+ Thes attribute allows mapping p2pmem into userspace. For
+ each mmap() call on this file, the kernel will allocate
+ a chunk of Peer-to-Peer memory for use in Peer-to-Peer
+ transactions. This memory can be used in O_DIRECT calls
+ to NVMe backed files for Peer-to-Peer copies.
+
What: /sys/bus/pci/devices/.../link/clkpm
/sys/bus/pci/devices/.../link/l0s_aspm
/sys/bus/pci/devices/.../link/l1_aspm
--
2.30.2

2022-08-25 15:29:51

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs

Create a sysfs bin attribute called "allocate" under the existing
"p2pmem" group. The only allowable operation on this file is the mmap()
call.

When mmap() is called on this attribute, the kernel allocates a chunk of
memory from the genalloc and inserts the pages into the VMA. The
dev_pagemap .page_free callback will indicate when these pages are no
longer used and they will be put back into the genalloc.

On device unbind, remove the sysfs file before the memremap_pages are
cleaned up. This ensures unmap_mapping_range() is called on the files
inode and no new mappings can be created.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/p2pdma.c | 124 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 124 insertions(+)

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4496a7c5c478..a6ed6bbca214 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -89,6 +89,90 @@ static ssize_t published_show(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RO(published);

+static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *attr, struct vm_area_struct *vma)
+{
+ struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
+ size_t len = vma->vm_end - vma->vm_start;
+ struct pci_p2pdma *p2pdma;
+ struct percpu_ref *ref;
+ unsigned long vaddr;
+ void *kaddr;
+ int ret;
+
+ /* prevent private mappings from being established */
+ if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+ pci_info_ratelimited(pdev,
+ "%s: fail, attempted private mapping\n",
+ current->comm);
+ return -EINVAL;
+ }
+
+ if (vma->vm_pgoff) {
+ pci_info_ratelimited(pdev,
+ "%s: fail, attempted mapping with non-zero offset\n",
+ current->comm);
+ return -EINVAL;
+ }
+
+ rcu_read_lock();
+ p2pdma = rcu_dereference(pdev->p2pdma);
+ if (!p2pdma) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
+ if (!kaddr) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /*
+ * vm_insert_page() can sleep, so a reference is taken to mapping
+ * such that rcu_read_unlock() can be done before inserting the
+ * pages
+ */
+ if (unlikely(!percpu_ref_tryget_live_rcu(ref))) {
+ ret = -ENODEV;
+ goto out_free_mem;
+ }
+ rcu_read_unlock();
+
+ for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
+ ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
+ if (ret) {
+ gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
+ return ret;
+ }
+ percpu_ref_get(ref);
+ put_page(virt_to_page(kaddr));
+ kaddr += PAGE_SIZE;
+ len -= PAGE_SIZE;
+ }
+
+ percpu_ref_put(ref);
+
+ return 0;
+out_free_mem:
+ gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
+out:
+ rcu_read_unlock();
+ return ret;
+}
+
+static struct bin_attribute p2pmem_alloc_attr = {
+ .attr = { .name = "allocate", .mode = 0660 },
+ .mmap = p2pmem_alloc_mmap,
+ /*
+ * Some places where we want to call mmap (ie. python) will check
+ * that the file size is greater than the mmap size before allowing
+ * the mmap to continue. To work around this, just set the size
+ * to be very large.
+ */
+ .size = SZ_1T,
+};
+
static struct attribute *p2pmem_attrs[] = {
&dev_attr_size.attr,
&dev_attr_available.attr,
@@ -96,11 +180,32 @@ static struct attribute *p2pmem_attrs[] = {
NULL,
};

+static struct bin_attribute *p2pmem_bin_attrs[] = {
+ &p2pmem_alloc_attr,
+ NULL,
+};
+
static const struct attribute_group p2pmem_group = {
.attrs = p2pmem_attrs,
+ .bin_attrs = p2pmem_bin_attrs,
.name = "p2pmem",
};

+static void p2pdma_page_free(struct page *page)
+{
+ struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+ struct percpu_ref *ref;
+
+ gen_pool_free_owner(pgmap->provider->p2pdma->pool,
+ (uintptr_t)page_to_virt(page), PAGE_SIZE,
+ (void **)&ref);
+ percpu_ref_put(ref);
+}
+
+static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
+ .page_free = p2pdma_page_free,
+};
+
static void pci_p2pdma_release(void *data)
{
struct pci_dev *pdev = data;
@@ -152,6 +257,19 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
return error;
}

+static void pci_p2pdma_unmap_mappings(void *data)
+{
+ struct pci_dev *pdev = data;
+
+ /*
+ * Removing the alloc attribute from sysfs will call
+ * unmap_mapping_range() on the inode, teardown any existing userspace
+ * mappings and prevent new ones from being created.
+ */
+ sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
+ p2pmem_group.name);
+}
+
/**
* pci_p2pdma_add_resource - add memory for use as p2p memory
* @pdev: the device to add the memory to
@@ -198,6 +316,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
pgmap->range.end = pgmap->range.start + size - 1;
pgmap->nr_range = 1;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+ pgmap->ops = &p2pdma_pgmap_ops;

p2p_pgmap->provider = pdev;
p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
@@ -209,6 +328,11 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
goto pgmap_free;
}

+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
+ pdev);
+ if (error)
+ goto pages_free;
+
p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,
pci_bus_address(pdev, bar) + offset,
--
2.30.2

2022-08-25 15:56:46

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v9 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If GUP is called without the flag and a
P2PDMA page is found, it will return an error.

FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.

Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
include/linux/mm.h | 1 +
mm/gup.c | 22 +++++++++++++++++++++-
2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3bedc449c14d..37a3e91e6e77 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2891,6 +2891,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA 0x100000 /* allow returning PCI P2PDMA pages */

/*
* FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 732825157430..79aea452619e 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -566,6 +566,12 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
goto out;
}

+ if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+ is_pci_p2pdma_page(page))) {
+ page = ERR_PTR(-EREMOTEIO);
+ goto out;
+ }
+
VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
!PageAnonExclusive(page), page);

@@ -1015,6 +1021,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
return -EOPNOTSUPP;

+ if ((gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_PCI_P2PDMA))
+ return -EOPNOTSUPP;
+
if (vma_is_secretmem(vma))
return -EFAULT;

@@ -2359,6 +2368,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
page = pte_page(pte);

+ if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
+ is_pci_p2pdma_page(page)))
+ goto pte_unmap;
+
folio = try_grab_folio(page, 1, flags);
if (!folio)
goto pte_unmap;
@@ -2438,6 +2451,12 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
undo_dev_pagemap(nr, nr_start, flags, pages);
break;
}
+
+ if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
+ undo_dev_pagemap(nr, nr_start, flags, pages);
+ break;
+ }
+
SetPageReferenced(page);
pages[*nr] = page;
if (unlikely(!try_grab_page(page, flags))) {
@@ -2926,7 +2945,8 @@ static int internal_get_user_pages_fast(unsigned long start,

if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
FOLL_FORCE | FOLL_PIN | FOLL_GET |
- FOLL_FAST_ONLY | FOLL_NOFAULT)))
+ FOLL_FAST_ONLY | FOLL_NOFAULT |
+ FOLL_PCI_P2PDMA)))
return -EINVAL;

if (gup_flags & FOLL_PIN)
--
2.30.2

2022-08-25 16:01:49

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v9 6/8] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()

When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.

Signed-off-by: Logan Gunthorpe <[email protected]>
---
block/blk-map.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-map.c b/block/blk-map.c
index 7196a6b64c80..1378f49ca5ca 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -236,6 +236,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
{
unsigned int max_sectors = queue_max_hw_sectors(rq->q);
unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
+ unsigned int flags = 0;
struct bio *bio;
int ret;
int j;
@@ -248,13 +249,17 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
return -ENOMEM;
bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, req_op(rq));

+ if (blk_queue_pci_p2pdma(rq->q))
+ flags |= FOLL_PCI_P2PDMA;
+
while (iov_iter_count(iter)) {
struct page **pages;
ssize_t bytes;
size_t offs, added = 0;
int npages;

- bytes = iov_iter_get_pages_alloc2(iter, &pages, LONG_MAX, &offs);
+ bytes = iov_iter_get_pages_alloc_flags(iter, &pages, LONG_MAX,
+ &offs, flags);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
--
2.30.2

2022-08-25 16:08:26

by Logan Gunthorpe

[permalink] [raw]
Subject: [PATCH v9 3/8] block: add check when merging zone device pages

Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.

Add a helper to determine if zone device pages are mergeable and use
this helper in page_is_mergeable().

Signed-off-by: Logan Gunthorpe <[email protected]>
---
block/bio.c | 2 ++
include/linux/mmzone.h | 24 ++++++++++++++++++++++++
2 files changed, 26 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 3d3a2678fea2..969607bc1f4d 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -865,6 +865,8 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
return false;
if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
return false;
+ if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
+ return false;

*same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
if (*same_page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e24b40c52468..2c31915b057e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -794,6 +794,25 @@ static inline bool is_zone_device_page(const struct page *page)
{
return page_zonenum(page) == ZONE_DEVICE;
}
+
+/*
+ * Consecutive zone device pages should not be merged into the same sgl
+ * or bvec segment with other types of pages or if they belong to different
+ * pgmaps. Otherwise getting the pgmap of a given segment is not possible
+ * without scanning the entire segment. This helper returns true either if
+ * both pages are not zone device pages or both pages are zone device pages
+ * with the same pgmap.
+ */
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+ const struct page *b)
+{
+ if (is_zone_device_page(a) != is_zone_device_page(b))
+ return false;
+ if (!is_zone_device_page(a))
+ return true;
+ return a->pgmap == b->pgmap;
+}
+
extern void memmap_init_zone_device(struct zone *, unsigned long,
unsigned long, struct dev_pagemap *);
#else
@@ -801,6 +820,11 @@ static inline bool is_zone_device_page(const struct page *page)
{
return false;
}
+static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
+ const struct page *b)
+{
+ return true;
+}
#endif

static inline bool folio_is_zone_device(const struct folio *folio)
--
2.30.2

2022-09-01 16:26:36

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs

On Thu, Aug 25, 2022 at 09:24:24AM -0600, Logan Gunthorpe wrote:
> Create a sysfs bin attribute called "allocate" under the existing
> "p2pmem" group. The only allowable operation on this file is the mmap()
> call.
>
> When mmap() is called on this attribute, the kernel allocates a chunk of
> memory from the genalloc and inserts the pages into the VMA. The
> dev_pagemap .page_free callback will indicate when these pages are no
> longer used and they will be put back into the genalloc.
>
> On device unbind, remove the sysfs file before the memremap_pages are
> cleaned up. This ensures unmap_mapping_range() is called on the files
> inode and no new mappings can be created.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/p2pdma.c | 124 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 124 insertions(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 4496a7c5c478..a6ed6bbca214 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -89,6 +89,90 @@ static ssize_t published_show(struct device *dev, struct device_attribute *attr,
> }
> static DEVICE_ATTR_RO(published);
>
> +static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> + struct bin_attribute *attr, struct vm_area_struct *vma)
> +{
> + struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
> + size_t len = vma->vm_end - vma->vm_start;
> + struct pci_p2pdma *p2pdma;
> + struct percpu_ref *ref;
> + unsigned long vaddr;
> + void *kaddr;
> + int ret;
> +
> + /* prevent private mappings from being established */
> + if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
> + pci_info_ratelimited(pdev,
> + "%s: fail, attempted private mapping\n",
> + current->comm);
> + return -EINVAL;
> + }
> +
> + if (vma->vm_pgoff) {
> + pci_info_ratelimited(pdev,
> + "%s: fail, attempted mapping with non-zero offset\n",
> + current->comm);
> + return -EINVAL;
> + }
> +
> + rcu_read_lock();
> + p2pdma = rcu_dereference(pdev->p2pdma);
> + if (!p2pdma) {
> + ret = -ENODEV;
> + goto out;
> + }
> +
> + kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
> + if (!kaddr) {
> + ret = -ENOMEM;
> + goto out;
> + }
> +
> + /*
> + * vm_insert_page() can sleep, so a reference is taken to mapping
> + * such that rcu_read_unlock() can be done before inserting the
> + * pages
> + */
> + if (unlikely(!percpu_ref_tryget_live_rcu(ref))) {
> + ret = -ENODEV;
> + goto out_free_mem;
> + }
> + rcu_read_unlock();
> +
> + for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
> + ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
> + if (ret) {
> + gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> + return ret;
> + }
> + percpu_ref_get(ref);
> + put_page(virt_to_page(kaddr));
> + kaddr += PAGE_SIZE;
> + len -= PAGE_SIZE;
> + }
> +
> + percpu_ref_put(ref);
> +
> + return 0;
> +out_free_mem:
> + gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> +out:
> + rcu_read_unlock();
> + return ret;
> +}
> +
> +static struct bin_attribute p2pmem_alloc_attr = {
> + .attr = { .name = "allocate", .mode = 0660 },
> + .mmap = p2pmem_alloc_mmap,
> + /*
> + * Some places where we want to call mmap (ie. python) will check
> + * that the file size is greater than the mmap size before allowing
> + * the mmap to continue. To work around this, just set the size
> + * to be very large.
> + */
> + .size = SZ_1T,
> +};
> +
> static struct attribute *p2pmem_attrs[] = {
> &dev_attr_size.attr,
> &dev_attr_available.attr,
> @@ -96,11 +180,32 @@ static struct attribute *p2pmem_attrs[] = {
> NULL,
> };
>
> +static struct bin_attribute *p2pmem_bin_attrs[] = {
> + &p2pmem_alloc_attr,
> + NULL,
> +};
> +
> static const struct attribute_group p2pmem_group = {
> .attrs = p2pmem_attrs,
> + .bin_attrs = p2pmem_bin_attrs,
> .name = "p2pmem",
> };
>
> +static void p2pdma_page_free(struct page *page)
> +{
> + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
> + struct percpu_ref *ref;
> +
> + gen_pool_free_owner(pgmap->provider->p2pdma->pool,
> + (uintptr_t)page_to_virt(page), PAGE_SIZE,
> + (void **)&ref);
> + percpu_ref_put(ref);
> +}
> +
> +static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
> + .page_free = p2pdma_page_free,
> +};
> +
> static void pci_p2pdma_release(void *data)
> {
> struct pci_dev *pdev = data;
> @@ -152,6 +257,19 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
> return error;
> }
>
> +static void pci_p2pdma_unmap_mappings(void *data)
> +{
> + struct pci_dev *pdev = data;
> +
> + /*
> + * Removing the alloc attribute from sysfs will call
> + * unmap_mapping_range() on the inode, teardown any existing userspace
> + * mappings and prevent new ones from being created.
> + */
> + sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
> + p2pmem_group.name);

Wait, why are you manually removing the sysfs file here? It's part of
the group, if you do this then it is gone for forever, right? Why
manually do this the sysfs core should handle this for you if the device
is removed.

And worst case, just pass in the device, not the pci device.

thanks,

greg k-h

2022-09-01 17:07:09

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v9 8/8] ABI: sysfs-bus-pci: add documentation for p2pmem allocate

On Thu, Aug 25, 2022 at 09:24:25AM -0600, Logan Gunthorpe wrote:
> Add documentation for the p2pmem/allocate binary file which allows
> for allocating p2pmem buffers in userspace for passing to drivers
> that support them. (Currently only O_DIRECT to NVMe devices.)
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> Documentation/ABI/testing/sysfs-bus-pci | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 6fc2c2efe8ab..dca5e032b4fa 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -171,7 +171,7 @@ Description:
> binary file containing the Vital Product Data for the
> device. It should follow the VPD format defined in
> PCI Specification 2.1 or 2.2, but users should consider
> - that some devices may have incorrectly formatted data.
> + that some devices may have incorrectly formatted data.
> If the underlying VPD has a writable section then the
> corresponding section of this file will be writable.
>

Not relevant change :(

thanks,

greg k-h

2022-09-01 17:55:11

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs




On 2022-09-01 10:20, Greg Kroah-Hartman wrote:
> On Thu, Aug 25, 2022 at 09:24:24AM -0600, Logan Gunthorpe wrote:
>> Create a sysfs bin attribute called "allocate" under the existing
>> "p2pmem" group. The only allowable operation on this file is the mmap()
>> call.
>>
>> When mmap() is called on this attribute, the kernel allocates a chunk of
>> memory from the genalloc and inserts the pages into the VMA. The
>> dev_pagemap .page_free callback will indicate when these pages are no
>> longer used and they will be put back into the genalloc.
>>
>> On device unbind, remove the sysfs file before the memremap_pages are
>> cleaned up. This ensures unmap_mapping_range() is called on the files
>> inode and no new mappings can be created.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> ---
>> drivers/pci/p2pdma.c | 124 +++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 124 insertions(+)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index 4496a7c5c478..a6ed6bbca214 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -89,6 +89,90 @@ static ssize_t published_show(struct device *dev, struct device_attribute *attr,
>> }
>> static DEVICE_ATTR_RO(published);
>>
>> +static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
>> + struct bin_attribute *attr, struct vm_area_struct *vma)
>> +{
>> + struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
>> + size_t len = vma->vm_end - vma->vm_start;
>> + struct pci_p2pdma *p2pdma;
>> + struct percpu_ref *ref;
>> + unsigned long vaddr;
>> + void *kaddr;
>> + int ret;
>> +
>> + /* prevent private mappings from being established */
>> + if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
>> + pci_info_ratelimited(pdev,
>> + "%s: fail, attempted private mapping\n",
>> + current->comm);
>> + return -EINVAL;
>> + }
>> +
>> + if (vma->vm_pgoff) {
>> + pci_info_ratelimited(pdev,
>> + "%s: fail, attempted mapping with non-zero offset\n",
>> + current->comm);
>> + return -EINVAL;
>> + }
>> +
>> + rcu_read_lock();
>> + p2pdma = rcu_dereference(pdev->p2pdma);
>> + if (!p2pdma) {
>> + ret = -ENODEV;
>> + goto out;
>> + }
>> +
>> + kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
>> + if (!kaddr) {
>> + ret = -ENOMEM;
>> + goto out;
>> + }
>> +
>> + /*
>> + * vm_insert_page() can sleep, so a reference is taken to mapping
>> + * such that rcu_read_unlock() can be done before inserting the
>> + * pages
>> + */
>> + if (unlikely(!percpu_ref_tryget_live_rcu(ref))) {
>> + ret = -ENODEV;
>> + goto out_free_mem;
>> + }
>> + rcu_read_unlock();
>> +
>> + for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
>> + ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
>> + if (ret) {
>> + gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
>> + return ret;
>> + }
>> + percpu_ref_get(ref);
>> + put_page(virt_to_page(kaddr));
>> + kaddr += PAGE_SIZE;
>> + len -= PAGE_SIZE;
>> + }
>> +
>> + percpu_ref_put(ref);
>> +
>> + return 0;
>> +out_free_mem:
>> + gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
>> +out:
>> + rcu_read_unlock();
>> + return ret;
>> +}
>> +
>> +static struct bin_attribute p2pmem_alloc_attr = {
>> + .attr = { .name = "allocate", .mode = 0660 },
>> + .mmap = p2pmem_alloc_mmap,
>> + /*
>> + * Some places where we want to call mmap (ie. python) will check
>> + * that the file size is greater than the mmap size before allowing
>> + * the mmap to continue. To work around this, just set the size
>> + * to be very large.
>> + */
>> + .size = SZ_1T,
>> +};
>> +
>> static struct attribute *p2pmem_attrs[] = {
>> &dev_attr_size.attr,
>> &dev_attr_available.attr,
>> @@ -96,11 +180,32 @@ static struct attribute *p2pmem_attrs[] = {
>> NULL,
>> };
>>
>> +static struct bin_attribute *p2pmem_bin_attrs[] = {
>> + &p2pmem_alloc_attr,
>> + NULL,
>> +};
>> +
>> static const struct attribute_group p2pmem_group = {
>> .attrs = p2pmem_attrs,
>> + .bin_attrs = p2pmem_bin_attrs,
>> .name = "p2pmem",
>> };
>>
>> +static void p2pdma_page_free(struct page *page)
>> +{
>> + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
>> + struct percpu_ref *ref;
>> +
>> + gen_pool_free_owner(pgmap->provider->p2pdma->pool,
>> + (uintptr_t)page_to_virt(page), PAGE_SIZE,
>> + (void **)&ref);
>> + percpu_ref_put(ref);
>> +}
>> +
>> +static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
>> + .page_free = p2pdma_page_free,
>> +};
>> +
>> static void pci_p2pdma_release(void *data)
>> {
>> struct pci_dev *pdev = data;
>> @@ -152,6 +257,19 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
>> return error;
>> }
>>
>> +static void pci_p2pdma_unmap_mappings(void *data)
>> +{
>> + struct pci_dev *pdev = data;
>> +
>> + /*
>> + * Removing the alloc attribute from sysfs will call
>> + * unmap_mapping_range() on the inode, teardown any existing userspace
>> + * mappings and prevent new ones from being created.
>> + */
>> + sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
>> + p2pmem_group.name);
>
> Wait, why are you manually removing the sysfs file here? It's part of
> the group, if you do this then it is gone for forever, right? Why
> manually do this the sysfs core should handle this for you if the device
> is removed.

We have to make sure the mappings are all removed before the cleanup of
devm_memremap_pages() which will wait for all the pages to be freed. If
we don't do this any userspace mapping will hang the cleanup until those
uses are unmapped themselves.

> And worst case, just pass in the device, not the pci device.

Ok, I'll make that change for v10.

Logan

2022-09-01 17:57:44

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs

On Thu, Sep 01, 2022 at 10:32:55AM -0600, Logan Gunthorpe wrote:
>
>
>
> On 2022-09-01 10:20, Greg Kroah-Hartman wrote:
> > On Thu, Aug 25, 2022 at 09:24:24AM -0600, Logan Gunthorpe wrote:
> >> Create a sysfs bin attribute called "allocate" under the existing
> >> "p2pmem" group. The only allowable operation on this file is the mmap()
> >> call.
> >>
> >> When mmap() is called on this attribute, the kernel allocates a chunk of
> >> memory from the genalloc and inserts the pages into the VMA. The
> >> dev_pagemap .page_free callback will indicate when these pages are no
> >> longer used and they will be put back into the genalloc.
> >>
> >> On device unbind, remove the sysfs file before the memremap_pages are
> >> cleaned up. This ensures unmap_mapping_range() is called on the files
> >> inode and no new mappings can be created.
> >>
> >> Signed-off-by: Logan Gunthorpe <[email protected]>
> >> ---
> >> drivers/pci/p2pdma.c | 124 +++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 124 insertions(+)
> >>
> >> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> >> index 4496a7c5c478..a6ed6bbca214 100644
> >> --- a/drivers/pci/p2pdma.c
> >> +++ b/drivers/pci/p2pdma.c
> >> @@ -89,6 +89,90 @@ static ssize_t published_show(struct device *dev, struct device_attribute *attr,
> >> }
> >> static DEVICE_ATTR_RO(published);
> >>
> >> +static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
> >> + struct bin_attribute *attr, struct vm_area_struct *vma)
> >> +{
> >> + struct pci_dev *pdev = to_pci_dev(kobj_to_dev(kobj));
> >> + size_t len = vma->vm_end - vma->vm_start;
> >> + struct pci_p2pdma *p2pdma;
> >> + struct percpu_ref *ref;
> >> + unsigned long vaddr;
> >> + void *kaddr;
> >> + int ret;
> >> +
> >> + /* prevent private mappings from being established */
> >> + if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
> >> + pci_info_ratelimited(pdev,
> >> + "%s: fail, attempted private mapping\n",
> >> + current->comm);
> >> + return -EINVAL;
> >> + }
> >> +
> >> + if (vma->vm_pgoff) {
> >> + pci_info_ratelimited(pdev,
> >> + "%s: fail, attempted mapping with non-zero offset\n",
> >> + current->comm);
> >> + return -EINVAL;
> >> + }
> >> +
> >> + rcu_read_lock();
> >> + p2pdma = rcu_dereference(pdev->p2pdma);
> >> + if (!p2pdma) {
> >> + ret = -ENODEV;
> >> + goto out;
> >> + }
> >> +
> >> + kaddr = (void *)gen_pool_alloc_owner(p2pdma->pool, len, (void **)&ref);
> >> + if (!kaddr) {
> >> + ret = -ENOMEM;
> >> + goto out;
> >> + }
> >> +
> >> + /*
> >> + * vm_insert_page() can sleep, so a reference is taken to mapping
> >> + * such that rcu_read_unlock() can be done before inserting the
> >> + * pages
> >> + */
> >> + if (unlikely(!percpu_ref_tryget_live_rcu(ref))) {
> >> + ret = -ENODEV;
> >> + goto out_free_mem;
> >> + }
> >> + rcu_read_unlock();
> >> +
> >> + for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
> >> + ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
> >> + if (ret) {
> >> + gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> >> + return ret;
> >> + }
> >> + percpu_ref_get(ref);
> >> + put_page(virt_to_page(kaddr));
> >> + kaddr += PAGE_SIZE;
> >> + len -= PAGE_SIZE;
> >> + }
> >> +
> >> + percpu_ref_put(ref);
> >> +
> >> + return 0;
> >> +out_free_mem:
> >> + gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
> >> +out:
> >> + rcu_read_unlock();
> >> + return ret;
> >> +}
> >> +
> >> +static struct bin_attribute p2pmem_alloc_attr = {
> >> + .attr = { .name = "allocate", .mode = 0660 },
> >> + .mmap = p2pmem_alloc_mmap,
> >> + /*
> >> + * Some places where we want to call mmap (ie. python) will check
> >> + * that the file size is greater than the mmap size before allowing
> >> + * the mmap to continue. To work around this, just set the size
> >> + * to be very large.
> >> + */
> >> + .size = SZ_1T,
> >> +};
> >> +
> >> static struct attribute *p2pmem_attrs[] = {
> >> &dev_attr_size.attr,
> >> &dev_attr_available.attr,
> >> @@ -96,11 +180,32 @@ static struct attribute *p2pmem_attrs[] = {
> >> NULL,
> >> };
> >>
> >> +static struct bin_attribute *p2pmem_bin_attrs[] = {
> >> + &p2pmem_alloc_attr,
> >> + NULL,
> >> +};
> >> +
> >> static const struct attribute_group p2pmem_group = {
> >> .attrs = p2pmem_attrs,
> >> + .bin_attrs = p2pmem_bin_attrs,
> >> .name = "p2pmem",
> >> };
> >>
> >> +static void p2pdma_page_free(struct page *page)
> >> +{
> >> + struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
> >> + struct percpu_ref *ref;
> >> +
> >> + gen_pool_free_owner(pgmap->provider->p2pdma->pool,
> >> + (uintptr_t)page_to_virt(page), PAGE_SIZE,
> >> + (void **)&ref);
> >> + percpu_ref_put(ref);
> >> +}
> >> +
> >> +static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
> >> + .page_free = p2pdma_page_free,
> >> +};
> >> +
> >> static void pci_p2pdma_release(void *data)
> >> {
> >> struct pci_dev *pdev = data;
> >> @@ -152,6 +257,19 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
> >> return error;
> >> }
> >>
> >> +static void pci_p2pdma_unmap_mappings(void *data)
> >> +{
> >> + struct pci_dev *pdev = data;
> >> +
> >> + /*
> >> + * Removing the alloc attribute from sysfs will call
> >> + * unmap_mapping_range() on the inode, teardown any existing userspace
> >> + * mappings and prevent new ones from being created.
> >> + */
> >> + sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
> >> + p2pmem_group.name);
> >
> > Wait, why are you manually removing the sysfs file here? It's part of
> > the group, if you do this then it is gone for forever, right? Why
> > manually do this the sysfs core should handle this for you if the device
> > is removed.
>
> We have to make sure the mappings are all removed before the cleanup of
> devm_memremap_pages() which will wait for all the pages to be freed.

Then don't use devm_ functions. Why not just use the manual functions
instead as you know when you want to tear this down.

> If
> we don't do this any userspace mapping will hang the cleanup until those
> uses are unmapped themselves.

Just do this in the remove call yourself and you should be fine.

thanks,

greg k-h

2022-09-01 18:08:03

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v9 8/8] ABI: sysfs-bus-pci: add documentation for p2pmem allocate



On 2022-09-01 10:18, Greg Kroah-Hartman wrote:
> On Thu, Aug 25, 2022 at 09:24:25AM -0600, Logan Gunthorpe wrote:
>> Add documentation for the p2pmem/allocate binary file which allows
>> for allocating p2pmem buffers in userspace for passing to drivers
>> that support them. (Currently only O_DIRECT to NVMe devices.)
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> ---
>> Documentation/ABI/testing/sysfs-bus-pci | 12 +++++++++++-
>> 1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
>> index 6fc2c2efe8ab..dca5e032b4fa 100644
>> --- a/Documentation/ABI/testing/sysfs-bus-pci
>> +++ b/Documentation/ABI/testing/sysfs-bus-pci
>> @@ -171,7 +171,7 @@ Description:
>> binary file containing the Vital Product Data for the
>> device. It should follow the VPD format defined in
>> PCI Specification 2.1 or 2.2, but users should consider
>> - that some devices may have incorrectly formatted data.
>> + that some devices may have incorrectly formatted data.
>> If the underlying VPD has a writable section then the
>> corresponding section of this file will be writable.
>>
>
> Not relevant change :(

Oops! Will fix.

Thanks,

Logan

2022-09-01 18:32:23

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs




On 2022-09-01 10:42, Greg Kroah-Hartman wrote:
> On Thu, Sep 01, 2022 at 10:32:55AM -0600, Logan Gunthorpe wrote:
>> On 2022-09-01 10:20, Greg Kroah-Hartman wrote:
>>> On Thu, Aug 25, 2022 at 09:24:24AM -0600, Logan Gunthorpe wrote:
>>>> + /*
>>>> + * Removing the alloc attribute from sysfs will call
>>>> + * unmap_mapping_range() on the inode, teardown any existing userspace
>>>> + * mappings and prevent new ones from being created.
>>>> + */
>>>> + sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
>>>> + p2pmem_group.name);
>>>
>>> Wait, why are you manually removing the sysfs file here? It's part of
>>> the group, if you do this then it is gone for forever, right? Why
>>> manually do this the sysfs core should handle this for you if the device
>>> is removed.
>>
>> We have to make sure the mappings are all removed before the cleanup of
>> devm_memremap_pages() which will wait for all the pages to be freed.
>
> Then don't use devm_ functions. Why not just use the manual functions
> instead as you know when you want to tear this down.

Well we haven't plugged in a remove call into p2pdma, that would be more
work and more interfaces touching the PCI code. Note: this code isn't a
driver but a set of PCI helpers available to other PCI drivers.
Everything that's setup is using the devm interfaces and gets torn down
with the same. So I don't really see the benefit of making the change
you propose.

Logan

2022-09-01 19:02:21

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs

On Thu, Sep 01, 2022 at 12:14:25PM -0600, Logan Gunthorpe wrote:
>
>
>
> On 2022-09-01 10:42, Greg Kroah-Hartman wrote:
> > On Thu, Sep 01, 2022 at 10:32:55AM -0600, Logan Gunthorpe wrote:
> >> On 2022-09-01 10:20, Greg Kroah-Hartman wrote:
> >>> On Thu, Aug 25, 2022 at 09:24:24AM -0600, Logan Gunthorpe wrote:
> >>>> + /*
> >>>> + * Removing the alloc attribute from sysfs will call
> >>>> + * unmap_mapping_range() on the inode, teardown any existing userspace
> >>>> + * mappings and prevent new ones from being created.
> >>>> + */
> >>>> + sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
> >>>> + p2pmem_group.name);
> >>>
> >>> Wait, why are you manually removing the sysfs file here? It's part of
> >>> the group, if you do this then it is gone for forever, right? Why
> >>> manually do this the sysfs core should handle this for you if the device
> >>> is removed.
> >>
> >> We have to make sure the mappings are all removed before the cleanup of
> >> devm_memremap_pages() which will wait for all the pages to be freed.
> >
> > Then don't use devm_ functions. Why not just use the manual functions
> > instead as you know when you want to tear this down.
>
> Well we haven't plugged in a remove call into p2pdma, that would be more
> work and more interfaces touching the PCI code. Note: this code isn't a
> driver but a set of PCI helpers available to other PCI drivers.
> Everything that's setup is using the devm interfaces and gets torn down
> with the same. So I don't really see the benefit of making the change
> you propose.

The issue is the classic one with the devm helpers. They do not lend
themselves to resource management issues that require ordering or other
sort of dependencies. Please do not use them here, just put in a remove
callback as you eventually will need it anyway, as you have a strong
requirement for what gets freed when, and the devm api does not provide
for that well.

thanks,

greg k-h

2022-09-01 19:58:14

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs



On 2022-09-01 12:36, Greg Kroah-Hartman wrote:
> On Thu, Sep 01, 2022 at 12:14:25PM -0600, Logan Gunthorpe wrote:
>> Well we haven't plugged in a remove call into p2pdma, that would be more
>> work and more interfaces touching the PCI code. Note: this code isn't a
>> driver but a set of PCI helpers available to other PCI drivers.
>> Everything that's setup is using the devm interfaces and gets torn down
>> with the same. So I don't really see the benefit of making the change
>> you propose.
>
> The issue is the classic one with the devm helpers. They do not lend
> themselves to resource management issues that require ordering or other
> sort of dependencies. Please do not use them here, just put in a remove
> callback as you eventually will need it anyway, as you have a strong
> requirement for what gets freed when, and the devm api does not provide
> for that well.

This surprises me. Can you elaborate on this classic issue?

I've definitely seen uses of devm that expect the calls will be torn
down in reverse order they are added. The existing p2pdma code will
certainly fail quite significantly if a devm_kzalloc() releases its
memory before the devm_memmap_pages() cleans up. There's also already an
action that is used to cleanup before the last devm_kzalloc() call
happens. If ordering is not guaranteed, then devm seems fairly broken
and unusable and I'd have to drop all uses from this code and go back to
the error prone method. Also what's the point of
devm_add_action_or_reset() if it doesn't guarantee the ordering or the
release?

But if it's that important I can make the change to these patches for v10.

Logan

2022-09-02 06:34:07

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs

On Thu, Sep 01, 2022 at 01:16:54PM -0600, Logan Gunthorpe wrote:
>
>
> On 2022-09-01 12:36, Greg Kroah-Hartman wrote:
> > On Thu, Sep 01, 2022 at 12:14:25PM -0600, Logan Gunthorpe wrote:
> >> Well we haven't plugged in a remove call into p2pdma, that would be more
> >> work and more interfaces touching the PCI code. Note: this code isn't a
> >> driver but a set of PCI helpers available to other PCI drivers.
> >> Everything that's setup is using the devm interfaces and gets torn down
> >> with the same. So I don't really see the benefit of making the change
> >> you propose.
> >
> > The issue is the classic one with the devm helpers. They do not lend
> > themselves to resource management issues that require ordering or other
> > sort of dependencies. Please do not use them here, just put in a remove
> > callback as you eventually will need it anyway, as you have a strong
> > requirement for what gets freed when, and the devm api does not provide
> > for that well.
>
> This surprises me. Can you elaborate on this classic issue?

There's long threads about it on the ksummit discuss mailing list and
other places.

> I've definitely seen uses of devm that expect the calls will be torn
> down in reverse order they are added.

Sorry, I didn't mean to imply the ordering of the devm code is
incorrect, that's fine.

It's when you have things in the devm "chain" that need to be freed in a
different order that stuff gets messy. Like irqs and clocks and other
types of resources that have "actions" associated with them.

> The existing p2pdma code will
> certainly fail quite significantly if a devm_kzalloc() releases its
> memory before the devm_memmap_pages() cleans up. There's also already an
> action that is used to cleanup before the last devm_kzalloc() call
> happens. If ordering is not guaranteed, then devm seems fairly broken
> and unusable and I'd have to drop all uses from this code and go back to
> the error prone method. Also what's the point of
> devm_add_action_or_reset() if it doesn't guarantee the ordering or the
> release?

I have never used devm_add_action_or_reset() so I can't say why it is
there. I am just pointing out that manually messing with a sysfs group
from a driver is a huge flag that something is wrong. A driver should
almost never be touching a raw kobject or calling any sysfs_* call if
all is normal, which is why I questioned this.

> But if it's that important I can make the change to these patches for v10.

Try it the way I suggest, with a remove() callback, and see if that
looks simpler and easier to follow and maintain over time.

thanks,

greg k-h

2022-09-02 19:50:47

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs



On 2022-09-01 23:53, Greg Kroah-Hartman wrote:
> On Thu, Sep 01, 2022 at 01:16:54PM -0600, Logan Gunthorpe wrote:
>> This surprises me. Can you elaborate on this classic issue?
>
> There's long threads about it on the ksummit discuss mailing list and
> other places.

I've managed to find one such thread dealing with lifetime issues of
different objects and bugs that are common with mistakes with its usage.
I've dealt with similar issues in the past, but as best as I can see
there are no lifetime issues in this code.

> I have never used devm_add_action_or_reset() so I can't say why it is
> there. I am just pointing out that manually messing with a sysfs group
> from a driver is a huge flag that something is wrong. A driver should
> almost never be touching a raw kobject or calling any sysfs_* call if
> all is normal, which is why I questioned this.

In this case we need to remove the specifc sysfs file to teardown any
vmas earlier in the remove sequence than it would be done normally. Whether
we do that through devm or remove() doesn't change the fact that we need
to access the dev->kobj to do that early.

>> But if it's that important I can make the change to these patches for v10.
>
> Try it the way I suggest, with a remove() callback, and see if that
> looks simpler and easier to follow and maintain over time.

See the diff at the bottom of this email. I can apply it on top of this
patch, but IMO it is neither easier to follow nor maintain. Unless you
have a different suggestion...

Thanks,

Logan

--

diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index a6ed6bbca214..4e1211a2a6cd 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -206,6 +206,23 @@ static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
.page_free = p2pdma_page_free,
};

+void pci_p2pdma_remove(struct pci_dev *pdev)
+{
+ if (!rcu_access_pointer(pdev->p2pdma))
+ return;
+
+ /*
+ * Any userspace mappings must be unmapped before the
+ * devm_memremap_pages() release happens, otherwise a device remove
+ * will hang on any processes that have pages mapped. To avoid this,
+ * remove the alloc attribute from sysfs which will call
+ * unmap_mapping_range() on the inode and teardown any existing
+ * userspace mappings.
+ */
+ sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
+ p2pmem_group.name);
+}
+
static void pci_p2pdma_release(void *data)
{
struct pci_dev *pdev = data;
@@ -257,19 +274,6 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
return error;
}

-static void pci_p2pdma_unmap_mappings(void *data)
-{
- struct pci_dev *pdev = data;
-
- /*
- * Removing the alloc attribute from sysfs will call
- * unmap_mapping_range() on the inode, teardown any existing userspace
- * mappings and prevent new ones from being created.
- */
- sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
- p2pmem_group.name);
-}
-
/**
* pci_p2pdma_add_resource - add memory for use as p2p memory
* @pdev: the device to add the memory to
@@ -328,11 +332,6 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
goto pgmap_free;
}

- error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
- pdev);
- if (error)
- goto pages_free;
-
p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,
pci_bus_address(pdev, bar) + offset,
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 49238ddd39ee..a096f2723eac 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -471,6 +471,8 @@ static void pci_device_remove(struct device *dev)
struct pci_dev *pci_dev = to_pci_dev(dev);
struct pci_driver *drv = pci_dev->driver;

+ pci_p2pdma_remove(pci_dev);
+
if (drv->remove) {
pm_runtime_get_sync(dev);
drv->remove(pci_dev);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 785f31086313..1c5c901a2fcc 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -774,4 +774,12 @@ static inline pci_power_t mid_pci_get_power_state(struct pci_dev *pdev)
}
#endif

+#ifdef CONFIG_PCI_P2PDMA
+void pci_p2pdma_remove(struct pci_dev *dev);
+#else
+static inline void pci_p2pdma_remove(struct pci_dev *dev);
+{
+}
+#endif
+
#endif /* DRIVERS_PCI_H */








2022-09-05 14:40:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v9 2/8] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()

On Thu, Aug 25, 2022 at 09:24:19AM -0600, Logan Gunthorpe wrote:
> Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
> which take a flags argument that is passed to get_user_pages_fast().
>
> This is so that FOLL_PCI_P2PDMA can be passed when appropriate.

Any reason why iov_iter_get_pages2 doesn't call iov_iter_get_pages_flags
and same for the __alloc version?

Also with the "main" version now having the 2 postfix, we can just
use the iov_iter_get_pages name for the flags version, and maybe
eventually just convert all users over to it.

2022-09-05 15:17:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] block: add check when merging zone device pages

On Thu, Aug 25, 2022 at 09:24:20AM -0600, Logan Gunthorpe wrote:
> Consecutive zone device pages should not be merged into the same sgl
> or bvec segment with other types of pages or if they belong to different
> pgmaps. Otherwise getting the pgmap of a given segment is not possible
> without scanning the entire segment. This helper returns true either if
> both pages are not zone device pages or both pages are zone device
> pages with the same pgmap.
>
> Add a helper to determine if zone device pages are mergeable and use
> this helper in page_is_mergeable().
>
> Signed-off-by: Logan Gunthorpe <[email protected]>

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2022-09-05 15:19:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v9 6/8] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()

On Thu, Aug 25, 2022 at 09:24:23AM -0600, Logan Gunthorpe wrote:
> When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
> iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
> passed from userspace and enables the NVMe passthru requests to
> use P2PDMA pages.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2022-09-05 15:54:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()

On Thu, Aug 25, 2022 at 09:24:22AM -0600, Logan Gunthorpe wrote:
> + if (bio->bi_bdev && bio->bi_bdev->bd_disk &&
> + blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))

bdev->bd_disk is never NULL.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2022-09-05 22:48:56

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v9 1/8] mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages

On 8/25/22 08:24, Logan Gunthorpe wrote:
> GUP Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
> allow obtaining P2PDMA pages. If GUP is called without the flag and a
> P2PDMA page is found, it will return an error.
>
> FOLL_PCI_P2PDMA cannot be set if FOLL_LONGTERM is set.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> ---
> include/linux/mm.h | 1 +
> mm/gup.c | 22 +++++++++++++++++++++-
> 2 files changed, 22 insertions(+), 1 deletion(-)
>

Looks good. And I see that Dan Williams' upcoming "Fix the DAX-gup
mistake" series will remove the need for most (all?) of the
undo_dev_pagemap() calls that you have to make here, so the end result
will be even simpler.


Reviewed-by: John Hubbard <[email protected]>

thanks,

--
John Hubbard
NVIDIA

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 3bedc449c14d..37a3e91e6e77 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2891,6 +2891,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
> #define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
> #define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
> #define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
> +#define FOLL_PCI_P2PDMA 0x100000 /* allow returning PCI P2PDMA pages */
>
> /*
> * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
> diff --git a/mm/gup.c b/mm/gup.c
> index 732825157430..79aea452619e 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -566,6 +566,12 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> goto out;
> }
>
> + if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
> + is_pci_p2pdma_page(page))) {
> + page = ERR_PTR(-EREMOTEIO);
> + goto out;
> + }
> +
> VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> !PageAnonExclusive(page), page);
>
> @@ -1015,6 +1021,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
> if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
> return -EOPNOTSUPP;
>
> + if ((gup_flags & FOLL_LONGTERM) && (gup_flags & FOLL_PCI_P2PDMA))
> + return -EOPNOTSUPP;
> +
> if (vma_is_secretmem(vma))
> return -EFAULT;
>
> @@ -2359,6 +2368,10 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
> VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
> page = pte_page(pte);
>
> + if (unlikely(!(flags & FOLL_PCI_P2PDMA) &&
> + is_pci_p2pdma_page(page)))
> + goto pte_unmap;
> +
> folio = try_grab_folio(page, 1, flags);
> if (!folio)
> goto pte_unmap;
> @@ -2438,6 +2451,12 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> undo_dev_pagemap(nr, nr_start, flags, pages);
> break;
> }
> +
> + if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
> + undo_dev_pagemap(nr, nr_start, flags, pages);
> + break;
> + }
> +
> SetPageReferenced(page);
> pages[*nr] = page;
> if (unlikely(!try_grab_page(page, flags))) {
> @@ -2926,7 +2945,8 @@ static int internal_get_user_pages_fast(unsigned long start,
>
> if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
> FOLL_FORCE | FOLL_PIN | FOLL_GET |
> - FOLL_FAST_ONLY | FOLL_NOFAULT)))
> + FOLL_FAST_ONLY | FOLL_NOFAULT |
> + FOLL_PCI_P2PDMA)))
> return -EINVAL;
>
> if (gup_flags & FOLL_PIN)


2022-09-05 23:35:45

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v9 2/8] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()

On 8/25/22 08:24, Logan Gunthorpe wrote:
> Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
> which take a flags argument that is passed to get_user_pages_fast().
>
> This is so that FOLL_PCI_P2PDMA can be passed when appropriate.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> include/linux/uio.h | 6 ++++++
> lib/iov_iter.c | 40 ++++++++++++++++++++++++++++++++++++----
> 2 files changed, 42 insertions(+), 4 deletions(-)
>

Yes.

I also don't object to Christoph's additional request to refactor and
rename around iov_iter_get_pages2() and such, but this version here is
perfectly good, so please feel free to add:

Reviewed-by: John Hubbard <[email protected]>

Looking ahead, interestingly enough, it turns out that this approach to
the wrappers is really pretty close to what I posted in patch 4/7 of my
"convert most filesystems to pin_user_pages_fast()" series [1]. Instead
of passing gup_flags through, I tried to avoid that (because FOLL_PIN
is, as a mild design preference, supposed to remain internal to gup.c),
but given that you need to do it, I think I can just build on top of
your approach, and pass in FOLL_PIN via your new gup_flags arg.

Along those lines, I've copied you in on "New topic branch for block +
gup work?", let's see what happens over there.

[1] https://lore.kernel.org/r/[email protected]


thanks,

--
John Hubbard
NVIDIA
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 5896af36199c..76ba69edb8c5 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -247,8 +247,14 @@ void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode
> void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
> void iov_iter_xarray(struct iov_iter *i, unsigned int direction, struct xarray *xarray,
> loff_t start, size_t count);
> +ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
> + size_t maxsize, unsigned maxpages, size_t *start,
> + unsigned int gup_flags);
> ssize_t iov_iter_get_pages2(struct iov_iter *i, struct page **pages,
> size_t maxsize, unsigned maxpages, size_t *start);
> +ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
> + struct page ***pages, size_t maxsize, size_t *start,
> + unsigned int gup_flags);
> ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i, struct page ***pages,
> size_t maxsize, size_t *start);
> int iov_iter_npages(const struct iov_iter *i, int maxpages);
> diff --git a/lib/iov_iter.c b/lib/iov_iter.c
> index 4b7fce72e3e5..dedb78f3c655 100644
> --- a/lib/iov_iter.c
> +++ b/lib/iov_iter.c
> @@ -1427,7 +1427,8 @@ static struct page *first_bvec_segment(const struct iov_iter *i,
>
> static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
> struct page ***pages, size_t maxsize,
> - unsigned int maxpages, size_t *start)
> + unsigned int maxpages, size_t *start,
> + unsigned int gup_flags)
> {
> unsigned int n;
>
> @@ -1439,7 +1440,6 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i,
> maxsize = MAX_RW_COUNT;
>
> if (likely(user_backed_iter(i))) {
> - unsigned int gup_flags = 0;
> unsigned long addr;
> int res;
>
> @@ -1497,10 +1497,24 @@ ssize_t iov_iter_get_pages2(struct iov_iter *i,
> return 0;
> BUG_ON(!pages);
>
> - return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages, start);
> + return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
> + start, 0);
> }
> EXPORT_SYMBOL(iov_iter_get_pages2);
>
> +ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
> + size_t maxsize, unsigned maxpages, size_t *start,
> + unsigned int gup_flags)
> +{
> + if (!maxpages)
> + return 0;
> + BUG_ON(!pages);
> +
> + return __iov_iter_get_pages_alloc(i, &pages, maxsize, maxpages,
> + start, gup_flags);
> +}
> +EXPORT_SYMBOL_GPL(iov_iter_get_pages_flags);
> +
> ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
> struct page ***pages, size_t maxsize,
> size_t *start)
> @@ -1509,7 +1523,7 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
>
> *pages = NULL;
>
> - len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start);
> + len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start, 0);
> if (len <= 0) {
> kvfree(*pages);
> *pages = NULL;
> @@ -1518,6 +1532,24 @@ ssize_t iov_iter_get_pages_alloc2(struct iov_iter *i,
> }
> EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
>
> +ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
> + struct page ***pages, size_t maxsize,
> + size_t *start, unsigned int gup_flags)
> +{
> + ssize_t len;
> +
> + *pages = NULL;
> +
> + len = __iov_iter_get_pages_alloc(i, pages, maxsize, ~0U, start,
> + gup_flags);
> + if (len <= 0) {
> + kvfree(*pages);
> + *pages = NULL;
> + }
> + return len;
> +}
> +EXPORT_SYMBOL_GPL(iov_iter_get_pages_alloc_flags);
> +
> size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
> struct iov_iter *i)
> {


2022-09-06 00:09:13

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v9 3/8] block: add check when merging zone device pages

On 8/25/22 08:24, Logan Gunthorpe wrote:
> Consecutive zone device pages should not be merged into the same sgl
> or bvec segment with other types of pages or if they belong to different
> pgmaps. Otherwise getting the pgmap of a given segment is not possible
> without scanning the entire segment. This helper returns true either if
> both pages are not zone device pages or both pages are zone device
> pages with the same pgmap.
>
> Add a helper to determine if zone device pages are mergeable and use
> this helper in page_is_mergeable().
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> block/bio.c | 2 ++
> include/linux/mmzone.h | 24 ++++++++++++++++++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/block/bio.c b/block/bio.c
> index 3d3a2678fea2..969607bc1f4d 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -865,6 +865,8 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
> return false;
> if (xen_domain() && !xen_biovec_phys_mergeable(bv, page))
> return false;
> + if (!zone_device_pages_have_same_pgmap(bv->bv_page, page))
> + return false;
>
> *same_page = ((vec_end_addr & PAGE_MASK) == page_addr);
> if (*same_page)
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e24b40c52468..2c31915b057e 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -794,6 +794,25 @@ static inline bool is_zone_device_page(const struct page *page)
> {
> return page_zonenum(page) == ZONE_DEVICE;
> }
> +
> +/*
> + * Consecutive zone device pages should not be merged into the same sgl
> + * or bvec segment with other types of pages or if they belong to different
> + * pgmaps. Otherwise getting the pgmap of a given segment is not possible
> + * without scanning the entire segment. This helper returns true either if
> + * both pages are not zone device pages or both pages are zone device pages
> + * with the same pgmap.
> + */

Nice work on the documentation here. Approach looks correct.


Reviewed-by: John Hubbard <[email protected]>

thanks,

--
John Hubbard
NVIDIA

> +static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
> + const struct page *b)
> +{
> + if (is_zone_device_page(a) != is_zone_device_page(b))
> + return false;
> + if (!is_zone_device_page(a))
> + return true;
> + return a->pgmap == b->pgmap;
> +}
> +
> extern void memmap_init_zone_device(struct zone *, unsigned long,
> unsigned long, struct dev_pagemap *);
> #else
> @@ -801,6 +820,11 @@ static inline bool is_zone_device_page(const struct page *page)
> {
> return false;
> }
> +static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
> + const struct page *b)
> +{
> + return true;
> +}
> #endif
>
> static inline bool folio_is_zone_device(const struct folio *folio)



2022-09-06 01:27:00

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v9 6/8] block: set FOLL_PCI_P2PDMA in bio_map_user_iov()

On 8/25/22 08:24, Logan Gunthorpe wrote:
> When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
> iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
> passed from userspace and enables the NVMe passthru requests to
> use P2PDMA pages.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> block/blk-map.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-map.c b/block/blk-map.c
> index 7196a6b64c80..1378f49ca5ca 100644
> --- a/block/blk-map.c
> +++ b/block/blk-map.c
> @@ -236,6 +236,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
> {
> unsigned int max_sectors = queue_max_hw_sectors(rq->q);
> unsigned int nr_vecs = iov_iter_npages(iter, BIO_MAX_VECS);
> + unsigned int flags = 0;

A small thing, but I'd also like to name that one gup_flags, instead of flags.

> struct bio *bio;
> int ret;
> int j;
> @@ -248,13 +249,17 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
> return -ENOMEM;
> bio_init(bio, NULL, bio->bi_inline_vecs, nr_vecs, req_op(rq));
>
> + if (blk_queue_pci_p2pdma(rq->q))
> + flags |= FOLL_PCI_P2PDMA;
> +
> while (iov_iter_count(iter)) {
> struct page **pages;
> ssize_t bytes;
> size_t offs, added = 0;
> int npages;
>
> - bytes = iov_iter_get_pages_alloc2(iter, &pages, LONG_MAX, &offs);
> + bytes = iov_iter_get_pages_alloc_flags(iter, &pages, LONG_MAX,
> + &offs, flags);
> if (unlikely(bytes <= 0)) {
> ret = bytes ? bytes : -EFAULT;
> goto out_unmap;

Looks good, please feel free to add:

Reviewed-by: John Hubbard <[email protected]>


thanks,

--
John Hubbard
NVIDIA

2022-09-06 01:30:45

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v9 5/8] block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()

On 8/25/22 08:24, Logan Gunthorpe wrote:
> When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
> iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
> from userspace and enables the O_DIRECT path in iomap based filesystems
> and direct to block devices.

Oh great, more O_DIRECT code paths. :)

>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> block/bio.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index 969607bc1f4d..ca09d79a0683 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1200,6 +1200,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
> struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
> struct page **pages = (struct page **)bv;
> + unsigned int flags = 0;

It would be nice to name this gup_flags, instead.

> ssize_t size, left;
> unsigned len, i = 0;
> size_t offset, trim;
> @@ -1213,6 +1214,10 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
> pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
>
> + if (bio->bi_bdev && bio->bi_bdev->bd_disk &&
> + blk_queue_pci_p2pdma(bio->bi_bdev->bd_disk->queue))
> + flags |= FOLL_PCI_P2PDMA;
> +
> /*
> * Each segment in the iov is required to be a block size multiple.
> * However, we may not be able to get the entire segment if it spans
> @@ -1220,8 +1225,9 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
> * result to ensure the bio's total size is correct. The remainder of
> * the iov data will be picked up in the next bio iteration.
> */
> - size = iov_iter_get_pages2(iter, pages, UINT_MAX - bio->bi_iter.bi_size,
> - nr_pages, &offset);
> + size = iov_iter_get_pages_flags(iter, pages,
> + UINT_MAX - bio->bi_iter.bi_size,
> + nr_pages, &offset, flags);
> if (unlikely(size <= 0))
> return size ? size : -EFAULT;
>

Looks good. After applying Christoph's tweak, and optionally the gup_flags
rename as well, please feel free to add:

Reviewed-by: John Hubbard <[email protected]>


thanks,

--
John Hubbard
NVIDIA

2022-09-06 01:32:47

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v9 8/8] ABI: sysfs-bus-pci: add documentation for p2pmem allocate

On 8/25/22 08:24, Logan Gunthorpe wrote:
> Add documentation for the p2pmem/allocate binary file which allows
> for allocating p2pmem buffers in userspace for passing to drivers
> that support them. (Currently only O_DIRECT to NVMe devices.)
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> Documentation/ABI/testing/sysfs-bus-pci | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci
> index 6fc2c2efe8ab..dca5e032b4fa 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci
> +++ b/Documentation/ABI/testing/sysfs-bus-pci
> @@ -171,7 +171,7 @@ Description:
> binary file containing the Vital Product Data for the
> device. It should follow the VPD format defined in
> PCI Specification 2.1 or 2.2, but users should consider
> - that some devices may have incorrectly formatted data.
> + that some devices may have incorrectly formatted data.
> If the underlying VPD has a writable section then the
> corresponding section of this file will be writable.
>
> @@ -407,6 +407,16 @@ Description:
> file contains a '1' if the memory has been published for
> use outside the driver that owns the device.
>
> +What: /sys/bus/pci/devices/.../p2pmem/allocate
> +Date: August 2022
> +Contact: Logan Gunthorpe <[email protected]>
> +Description:
> + Thes attribute allows mapping p2pmem into userspace. For

s/Thes/This/

Actually, the sentence goes on to say "file", which refers to the same
thing as "attribute". So maybe just saying "This attribute" would be
slightly clearer? So:

This file allows mapping p2pmem into userspace. For


With the typo fix and optionally the "attribute" change (up to you),
please feel free to add:

Reviewed-by: John Hubbard <[email protected]>


thanks,

--
John Hubbard
NVIDIA

> + each mmap() call on this file, the kernel will allocate
> + a chunk of Peer-to-Peer memory for use in Peer-to-Peer
> + transactions. This memory can be used in O_DIRECT calls
> + to NVMe backed files for Peer-to-Peer copies.
> +
> What: /sys/bus/pci/devices/.../link/clkpm
> /sys/bus/pci/devices/.../link/l0s_aspm
> /sys/bus/pci/devices/.../link/l1_aspm

2022-09-06 18:06:08

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v9 2/8] iov_iter: introduce iov_iter_get_pages_[alloc_]flags()



On 2022-09-05 17:21, John Hubbard wrote:
> Looking ahead, interestingly enough, it turns out that this approach to
> the wrappers is really pretty close to what I posted in patch 4/7 of my
> "convert most filesystems to pin_user_pages_fast()" series [1]. Instead
> of passing gup_flags through, I tried to avoid that (because FOLL_PIN
> is, as a mild design preference, supposed to remain internal to gup.c),
> but given that you need to do it, I think I can just build on top of
> your approach, and pass in FOLL_PIN via your new gup_flags arg.

Seems to me like we could just provide the one exported function with a
flag then use an static inline helper for the pin version. That would
help keep the FOLL_PIN flag contained as you like.

> Along those lines, I've copied you in on "New topic branch for block +
> gup work?", let's see what happens over there.
>
> [1] https://lore.kernel.org/r/[email protected]

Thanks for the review. All the feedback makes sense. I'll apply all the
changes from you and Christoph for the next version.

I'm happy to base the next version off of any other changes to the iov
functions. I suspect my patches can hold off a cycle or two more while
we do it. Let me know if you'd like some help making the conversions,
I'm also fine to do some work to aid with the effort.

Thanks,

Logan

2022-09-20 07:32:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs

On Fri, Sep 02, 2022 at 12:46:54PM -0600, Logan Gunthorpe wrote:
> See the diff at the bottom of this email. I can apply it on top of this
> patch, but IMO it is neither easier to follow nor maintain. Unless you
> have a different suggestion...

Greg, can you chime in on this? Besides this item we just have a few
cosmetic bits left I think, and I'd really like to get the series into
this merge window.

2022-09-22 08:55:08

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs

On Tue, Sep 20, 2022 at 08:46:13AM +0200, Christoph Hellwig wrote:
> On Fri, Sep 02, 2022 at 12:46:54PM -0600, Logan Gunthorpe wrote:
> > See the diff at the bottom of this email. I can apply it on top of this
> > patch, but IMO it is neither easier to follow nor maintain. Unless you
> > have a different suggestion...
>
> Greg, can you chime in on this? Besides this item we just have a few
> cosmetic bits left I think, and I'd really like to get the series into
> this merge window.
>

I don't seem to have this in my inbox at all anymore, sorry.

The original should be fine, Logan, thanks for trying to split it out a
bit more. So this can be taken as-is for 6.1-rc1.

thanks,

greg k-h

2022-09-22 15:36:34

by Logan Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v9 7/8] PCI/P2PDMA: Allow userspace VMA allocations through sysfs



On 2022-09-22 02:38, Greg Kroah-Hartman wrote:
> On Tue, Sep 20, 2022 at 08:46:13AM +0200, Christoph Hellwig wrote:
>> On Fri, Sep 02, 2022 at 12:46:54PM -0600, Logan Gunthorpe wrote:
>>> See the diff at the bottom of this email. I can apply it on top of this
>>> patch, but IMO it is neither easier to follow nor maintain. Unless you
>>> have a different suggestion...
>>
>> Greg, can you chime in on this? Besides this item we just have a few
>> cosmetic bits left I think, and I'd really like to get the series into
>> this merge window.
>>
>
> I don't seem to have this in my inbox at all anymore, sorry.
>
> The original should be fine, Logan, thanks for trying to split it out a
> bit more. So this can be taken as-is for 6.1-rc1.

Thanks Greg,

I'll send a v10 with changes from the other feedback later today.

Logan