Hi,
This patchset continues my work to add userspace P2PDMA access using
O_DIRECT NVMe devices. This posting contains some minor fixes and a
rebase onto v5.18-rc1 which contains cleanup from Christoph around
free_zone_device_page() that helps to enable this patchset. The
previous posting was here[1].
The patchset enables userspace P2PDMA by allowing userspace to mmap()
allocated chunks of the CMB. The resulting VMA can be passed only
to O_DIRECT IO on NVMe backed files or block devices. A flag is added
to GUP() in Patch <>, then Patches <> through <> wire this flag up based
on whether the block queue indicates P2PDMA support. Patches <>
through <> enable the CMB to be mapped into userspace by mmaping
the nvme char device.
This is relatively straightforward, however the one significant
problem is that, presently, pci_p2pdma_map_sg() requires a homogeneous
SGL with all P2PDMA pages or all regular pages. Enhancing GUP to
support enforcing this rule would require a huge hack that I don't
expect would be all that pallatable. So the first 13 patches add
support for P2PDMA pages to dma_map_sg[table]() to the dma-direct
and dma-iommu implementations. Thus systems without an IOMMU plus
Intel and AMD IOMMUs are supported. (Other IOMMU implementations would
then be unsupported, notably ARM and PowerPC but support would be added
when they convert to dma-iommu).
dma_map_sgtable() is preferred when dealing with P2PDMA memory as it
will return -EREMOTEIO when the DMA device cannot map specific P2PDMA
pages based on the existing rules in calc_map_type_and_dist().
The other issue is dma_unmap_sg() needs a flag to determine whether a
given dma_addr_t was mapped regularly or as a PCI bus address. To allow
this, a third flag is added to the page_link field in struct
scatterlist. This effectively means support for P2PDMA will now depend
on CONFIG_64BIT.
Feedback welcome.
This series is based on v5.18-rc1. A git branch is available here:
https://github.com/sbates130272/linux-p2pmem/ p2pdma_user_cmb_v6
Thanks,
Logan
[1] lkml.kernel.org/r/[email protected]
--
Changes since v5:
- Rebased onto v5.18-rc1 which includes Christophs cleanup to
free_zone_device_page() (similar to Ralph's patch).
- Fix bug with concurrent first calls to pci_p2pdma_vma_fault()
that caused a double allocation and lost p2p memory. Noticed
by Andrew Maier.
- Collected a Reviewed-by tag from Chaitanya.
- Numerous minor fixes to commit messages
Changes since v4:
- Rebase onto v5.17-rc1.
- Included Ralph Cambell's patches which removes the ZONE_DEVICE page
reference count offset. This is just to demonstrate that this
series is compatible with that direction.
- Added a comment in pci_p2pdma_map_sg_attrs(), per Chaitanya and
included his Reviewed-by tags.
- Patch 1 in the last series which cleaned up scatterlist.h
has been upstreamed.
- Dropped NEED_SG_DMA_BUS_ADDR_FLAG seeing depends on doesn't
work with selected symbols, per Christoph.
- Switched iov_iter_get_pages_[alloc_]flags to be exported with
EXPORT_SYMBOL_GPL, per Christoph.
- Renamed zone_device_pages_are_mergeable() to
zone_device_pages_have_same_pgmap(), per Christoph.
- Renamed .mmap_file_open operation in nvme_ctrl_ops to
cdev_file_open(), per Christoph.
Changes since v3:
- Add some comment and commit message cleanups I had missed for v3,
also moved the prototypes for some of the p2pdma helpers to
dma-map-ops.h (which I missed in v3 and was suggested in v2).
- Add separate cleanup patch for scatterlist.h and change the macros
to functions. (Suggested by Chaitanya and Jason, respectively)
- Rename sg_dma_mark_pci_p2pdma() and sg_is_dma_pci_p2pdma() to
sg_dma_mark_bus_address() and sg_is_dma_bus_address() which
is a more generic name (As requested by Jason)
- Fixes to some comments and commit messages as suggested by Bjorn
and Jason.
- Ensure swiotlb is not used with P2PDMA pages. (Per Jason)
- The sgtable coversion in RDMA was split out and sent upstream
separately, the new patch is only the removal. (Per Jason)
- Moved the FOLL_PCI_P2PDMA check outside of get_dev_pagemap() as
Jason suggested this will be removed in the near term.
- Add two patches to ensure that zone device pages with different
pgmaps are never merged in the block layer or
sg_alloc_append_table_from_pages() (Per Jason)
- Ensure synchronize_rcu() or call_rcu() is used before returning
pages to the genalloc. (Jason pointed out that pages are not
gauranteed to be unused in all architectures until at least
after an RCU grace period, and that synchronize_rcu() was likely
too slow to use in the vma close operation.
- Collected Acks and Reviews by Bjorn, Jason and Max.
Logan Gunthorpe (21):
lib/scatterlist: add flag for indicating P2PDMA segments in an SGL
PCI/P2PDMA: Attempt to set map_type if it has not been set
PCI/P2PDMA: Expose pci_p2pdma_map_type()
PCI/P2PDMA: Introduce helpers for dma_map_sg implementations
dma-mapping: allow EREMOTEIO return code for P2PDMA transfers
dma-direct: support PCI P2PDMA pages in dma-direct map_sg
dma-mapping: add flags to dma_map_ops to indicate PCI P2PDMA support
iommu/dma: support PCI P2PDMA pages in dma-iommu map_sg
nvme-pci: check DMA ops when indicating support for PCI P2PDMA
nvme-pci: convert to using dma_map_sgtable()
RDMA/core: introduce ib_dma_pci_p2p_dma_supported()
RDMA/rw: drop pci_p2pdma_[un]map_sg()
PCI/P2PDMA: Remove pci_p2pdma_[un]map_sg()
mm: introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
iov_iter: introduce iov_iter_get_pages_[alloc_]flags()
block: add check when merging zone device pages
lib/scatterlist: add check when merging zone device pages
block: set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
block: set FOLL_PCI_P2PDMA in bio_map_user_iov()
PCI/P2PDMA: Introduce pci_mmap_p2pmem()
nvme-pci: allow mmaping the CMB in userspace
block/bio.c | 10 +-
block/blk-map.c | 7 +-
drivers/infiniband/core/rw.c | 45 +---
drivers/iommu/dma-iommu.c | 68 ++++-
drivers/nvme/host/core.c | 18 +-
drivers/nvme/host/nvme.h | 4 +-
drivers/nvme/host/pci.c | 97 ++++----
drivers/nvme/target/rdma.c | 2 +-
drivers/pci/Kconfig | 5 +
drivers/pci/p2pdma.c | 467 ++++++++++++++++++++++++++++++-----
include/linux/dma-map-ops.h | 76 ++++++
include/linux/dma-mapping.h | 5 +
include/linux/mm.h | 24 ++
include/linux/pci-p2pdma.h | 38 +--
include/linux/scatterlist.h | 44 +++-
include/linux/uio.h | 6 +
include/rdma/ib_verbs.h | 11 +
include/uapi/linux/magic.h | 1 +
kernel/dma/direct.c | 43 +++-
kernel/dma/direct.h | 8 +-
kernel/dma/mapping.c | 22 +-
lib/iov_iter.c | 25 +-
lib/scatterlist.c | 25 +-
mm/gup.c | 22 +-
24 files changed, 867 insertions(+), 206 deletions(-)
base-commit: 3123109284176b1532874591f7c81f3837bbdc17
--
2.30.2
Consecutive zone device pages should not be merged into the same sgl
or bvec segment with other types of pages or if they belong to different
pgmaps. Otherwise getting the pgmap of a given segment is not possible
without scanning the entire segment. This helper returns true either if
both pages are not zone device pages or both pages are zone device
pages with the same pgmap.
Factor out the check for page mergability into a pages_are_mergable()
helper and add a check with zone_device_pages_are_mergeable().
Signed-off-by: Logan Gunthorpe <[email protected]>
---
lib/scatterlist.c | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index d5e82e4a57ad..af53a0984f76 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -410,6 +410,15 @@ static struct scatterlist *get_next_sg(struct sg_append_table *table,
return new_sg;
}
+static bool pages_are_mergeable(struct page *a, struct page *b)
+{
+ if (page_to_pfn(a) != page_to_pfn(b) + 1)
+ return false;
+ if (!zone_device_pages_have_same_pgmap(a, b))
+ return false;
+ return true;
+}
+
/**
* sg_alloc_append_table_from_pages - Allocate and initialize an append sg
* table from an array of pages
@@ -447,6 +456,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
unsigned int chunks, cur_page, seg_len, i, prv_len = 0;
unsigned int added_nents = 0;
struct scatterlist *s = sgt_append->prv;
+ struct page *last_pg;
/*
* The algorithm below requires max_segment to be aligned to PAGE_SIZE
@@ -460,21 +470,17 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
return -EOPNOTSUPP;
if (sgt_append->prv) {
- unsigned long paddr =
- (page_to_pfn(sg_page(sgt_append->prv)) * PAGE_SIZE +
- sgt_append->prv->offset + sgt_append->prv->length) /
- PAGE_SIZE;
-
if (WARN_ON(offset))
return -EINVAL;
/* Merge contiguous pages into the last SG */
prv_len = sgt_append->prv->length;
- while (n_pages && page_to_pfn(pages[0]) == paddr) {
+ last_pg = sg_page(sgt_append->prv);
+ while (n_pages && pages_are_mergeable(last_pg, pages[0])) {
if (sgt_append->prv->length + PAGE_SIZE > max_segment)
break;
sgt_append->prv->length += PAGE_SIZE;
- paddr++;
+ last_pg = pages[0];
pages++;
n_pages--;
}
@@ -488,7 +494,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
for (i = 1; i < n_pages; i++) {
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
- page_to_pfn(pages[i]) != page_to_pfn(pages[i - 1]) + 1) {
+ !pages_are_mergeable(pages[i], pages[i - 1])) {
chunks++;
seg_len = 0;
}
@@ -504,8 +510,7 @@ int sg_alloc_append_table_from_pages(struct sg_append_table *sgt_append,
for (j = cur_page + 1; j < n_pages; j++) {
seg_len += PAGE_SIZE;
if (seg_len >= max_segment ||
- page_to_pfn(pages[j]) !=
- page_to_pfn(pages[j - 1]) + 1)
+ !pages_are_mergeable(pages[j], pages[j - 1]))
break;
}
--
2.30.2
Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
a hunk of p2pmem into userspace.
Pages are allocated from the genalloc in bulk with their reference
count set to one. They are returned to the genalloc when the page is put
through p2pdma_page_free() (the reference count is once again set to 1
in free_zone_device_page()).
The VMA does not take a reference to the pages when they are inserted
with vmf_insert_mixed() (which is necessary for zone device pages) so
the backing P2P memory is stored in a structures in vm_private_data.
A pseudo mount is used to allocate an inode for each PCI device. The
inode's address_space is used in the file doing the mmap so that all
VMAs are collected and can be unmapped if the PCI device is unbound.
After unmapping, the VMAs are iterated through and their pages are
put so the device can continue to be unbound. An active flag is used
to signal to VMAs not to allocate any further P2P memory once the
removal process starts. The flag is synchronized with concurrent
access with an RCU lock.
The VMAs and inode will survive after the unbind of the device, but no
pages will be present in the VMA and a subsequent access will result
in a SIGBUS error.
Signed-off-by: Logan Gunthorpe <[email protected]>
Acked-by: Bjorn Helgaas <[email protected]>
---
drivers/pci/p2pdma.c | 340 ++++++++++++++++++++++++++++++++++++-
include/linux/pci-p2pdma.h | 11 ++
include/uapi/linux/magic.h | 1 +
3 files changed, 350 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 4d3cab9da748..cce4c7b6dd75 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -17,14 +17,19 @@
#include <linux/genalloc.h>
#include <linux/memremap.h>
#include <linux/percpu-refcount.h>
+#include <linux/pfn_t.h>
+#include <linux/pseudo_fs.h>
#include <linux/random.h>
#include <linux/seq_buf.h>
#include <linux/xarray.h>
+#include <uapi/linux/magic.h>
struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
struct xarray map_types;
+ struct inode *inode;
+ bool active;
};
struct pci_p2pdma_pagemap {
@@ -33,6 +38,17 @@ struct pci_p2pdma_pagemap {
u64 bus_offset;
};
+struct pci_p2pdma_map {
+ struct kref ref;
+ struct rcu_head rcu;
+ struct pci_dev *pdev;
+ struct inode *inode;
+ size_t len;
+
+ spinlock_t kaddr_lock;
+ void *kaddr;
+};
+
static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
{
return container_of(pgmap, struct pci_p2pdma_pagemap, pgmap);
@@ -101,6 +117,38 @@ static const struct attribute_group p2pmem_group = {
.name = "p2pmem",
};
+/*
+ * P2PDMA internal mount
+ * Fake an internal VFS mount-point in order to allocate struct address_space
+ * mappings to remove VMAs on unbind events.
+ */
+static int pci_p2pdma_fs_cnt;
+static struct vfsmount *pci_p2pdma_fs_mnt;
+
+static int pci_p2pdma_fs_init_fs_context(struct fs_context *fc)
+{
+ return init_pseudo(fc, P2PDMA_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type pci_p2pdma_fs_type = {
+ .name = "p2dma",
+ .owner = THIS_MODULE,
+ .init_fs_context = pci_p2pdma_fs_init_fs_context,
+ .kill_sb = kill_anon_super,
+};
+
+static void p2pdma_page_free(struct page *page)
+{
+ struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+
+ gen_pool_free(pgmap->provider->p2pdma->pool,
+ (uintptr_t)page_to_virt(page), PAGE_SIZE);
+}
+
+static const struct dev_pagemap_ops p2pdma_pgmap_ops = {
+ .page_free = p2pdma_page_free,
+};
+
static void pci_p2pdma_release(void *data)
{
struct pci_dev *pdev = data;
@@ -117,6 +165,9 @@ static void pci_p2pdma_release(void *data)
gen_pool_destroy(p2pdma->pool);
sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
xa_destroy(&p2pdma->map_types);
+
+ iput(p2pdma->inode);
+ simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
}
static int pci_p2pdma_setup(struct pci_dev *pdev)
@@ -134,17 +185,32 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
if (!p2p->pool)
goto out;
- error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+ error = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
+ &pci_p2pdma_fs_cnt);
if (error)
goto out_pool_destroy;
+ p2p->inode = alloc_anon_inode(pci_p2pdma_fs_mnt->mnt_sb);
+ if (IS_ERR(p2p->inode)) {
+ error = -ENOMEM;
+ goto out_unpin_fs;
+ }
+
+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
+ if (error)
+ goto out_put_inode;
+
error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
if (error)
- goto out_pool_destroy;
+ goto out_put_inode;
rcu_assign_pointer(pdev->p2pdma, p2p);
return 0;
+out_put_inode:
+ iput(p2p->inode);
+out_unpin_fs:
+ simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
out_pool_destroy:
gen_pool_destroy(p2p->pool);
out:
@@ -152,6 +218,54 @@ static int pci_p2pdma_setup(struct pci_dev *pdev)
return error;
}
+static void pci_p2pdma_map_free_pages(struct pci_p2pdma_map *pmap)
+{
+ int i;
+
+ if (!pmap->kaddr)
+ return;
+
+ for (i = 0; i < pmap->len; i += PAGE_SIZE)
+ put_page(virt_to_page(pmap->kaddr + i));
+
+ pmap->kaddr = NULL;
+}
+
+static void pci_p2pdma_free_mappings(struct address_space *mapping)
+{
+ struct vm_area_struct *vma;
+
+ i_mmap_lock_write(mapping);
+ if (RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
+ goto out;
+
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, 0, -1)
+ pci_p2pdma_map_free_pages(vma->vm_private_data);
+
+out:
+ i_mmap_unlock_write(mapping);
+}
+
+static void pci_p2pdma_unmap_mappings(void *data)
+{
+ struct pci_dev *pdev = data;
+ struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
+
+ /* Ensure no new pages can be allocated in mappings */
+ p2pdma->active = false;
+ synchronize_rcu();
+
+ unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
+
+ /*
+ * On some architectures, TLB flushes are done with call_rcu()
+ * so to ensure GUP fast is done with the pages, call synchronize_rcu()
+ * before freeing them.
+ */
+ synchronize_rcu();
+ pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
+}
+
/**
* pci_p2pdma_add_resource - add memory for use as p2p memory
* @pdev: the device to add the memory to
@@ -198,6 +312,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
pgmap->range.end = pgmap->range.start + size - 1;
pgmap->nr_range = 1;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
+ pgmap->ops = &p2pdma_pgmap_ops;
p2p_pgmap->provider = pdev;
p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
@@ -209,6 +324,11 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
goto pgmap_free;
}
+ error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
+ pdev);
+ if (error)
+ goto pages_free;
+
p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
error = gen_pool_add_owner(p2pdma->pool, (unsigned long)addr,
pci_bus_address(pdev, bar) + offset,
@@ -217,6 +337,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
if (error)
goto pages_free;
+ p2pdma->active = true;
pci_info(pdev, "added peer-to-peer DMA memory %#llx-%#llx\n",
pgmap->range.start, pgmap->range.end);
@@ -1018,3 +1139,218 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
return sprintf(page, "%s\n", pci_name(p2p_dev));
}
EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
+
+static struct pci_p2pdma_map *pci_p2pdma_map_alloc(struct pci_dev *pdev,
+ size_t len)
+{
+ struct pci_p2pdma_map *pmap;
+
+ pmap = kzalloc(sizeof(*pmap), GFP_KERNEL);
+ if (!pmap)
+ return NULL;
+
+ kref_init(&pmap->ref);
+ pmap->pdev = pci_dev_get(pdev);
+ pmap->len = len;
+
+ return pmap;
+}
+
+static void pci_p2pdma_map_free(struct rcu_head *rcu)
+{
+ struct pci_p2pdma_map *pmap =
+ container_of(rcu, struct pci_p2pdma_map, rcu);
+
+ pci_p2pdma_map_free_pages(pmap);
+ kfree(pmap);
+}
+
+static void pci_p2pdma_map_release(struct kref *ref)
+{
+ struct pci_p2pdma_map *pmap =
+ container_of(ref, struct pci_p2pdma_map, ref);
+
+ iput(pmap->inode);
+ simple_release_fs(&pci_p2pdma_fs_mnt, &pci_p2pdma_fs_cnt);
+ pci_dev_put(pmap->pdev);
+
+ if (pmap->kaddr) {
+ /*
+ * Make sure to wait for the TLB flush (which some
+ * architectures do using call_rcu()) before returning the
+ * pages to the genalloc. This ensures the pages are not reused
+ * before GUP-fast is finished with them. So the mapping is
+ * freed using call_rcu() seeing adding synchronize_rcu() to
+ * the munmap path can cause long delays on large systems
+ * during process cleanup.
+ */
+ call_rcu(&pmap->rcu, pci_p2pdma_map_free);
+ return;
+ }
+
+ /*
+ * If there are no pages, just free the object immediately. There
+ * are no more references to it so there is nothing that can race
+ * with adding the pages.
+ */
+ pci_p2pdma_map_free(&pmap->rcu);
+}
+
+static void pci_p2pdma_vma_open(struct vm_area_struct *vma)
+{
+ struct pci_p2pdma_map *pmap = vma->vm_private_data;
+
+ kref_get(&pmap->ref);
+}
+
+static void pci_p2pdma_vma_close(struct vm_area_struct *vma)
+{
+ struct pci_p2pdma_map *pmap = vma->vm_private_data;
+
+ kref_put(&pmap->ref, pci_p2pdma_map_release);
+}
+
+static bool pci_p2pdma_vma_alloc(struct pci_p2pdma_map *pmap)
+{
+ struct pci_p2pdma *p2pdma;
+ bool ret = true;
+ void *kaddr;
+
+ spin_lock(&pmap->kaddr_lock);
+ if (pmap->kaddr)
+ goto out_spin_unlock;
+
+ rcu_read_lock();
+ ret = false;
+ p2pdma = rcu_dereference(pmap->pdev->p2pdma);
+ if (!p2pdma)
+ goto out_rcu_unlock;
+
+ if (!p2pdma->active)
+ goto out_rcu_unlock;
+
+ kaddr = (void *)gen_pool_alloc(p2pdma->pool, pmap->len);
+ if (!kaddr) {
+ pci_err(pmap->pdev,
+ "No free P2PDMA memory for userspace mmap\n");
+ goto out_rcu_unlock;
+ }
+
+ WRITE_ONCE(pmap->kaddr, kaddr);
+ ret = true;
+
+out_rcu_unlock:
+ rcu_read_unlock();
+out_spin_unlock:
+ spin_unlock(&pmap->kaddr_lock);
+ return ret;
+}
+
+static vm_fault_t pci_p2pdma_vma_fault(struct vm_fault *vmf)
+{
+ struct pci_p2pdma_map *pmap = vmf->vma->vm_private_data;
+ void *vaddr;
+ pfn_t pfn;
+
+ if (!READ_ONCE(pmap->kaddr)) {
+ if (!pci_p2pdma_vma_alloc(pmap))
+ return VM_FAULT_SIGBUS;
+ }
+
+ vaddr = pmap->kaddr + vmf->address - vmf->vma->vm_start;
+ pfn = phys_to_pfn_t(virt_to_phys(vaddr), PFN_DEV | PFN_MAP);
+
+ return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+}
+static const struct vm_operations_struct pci_p2pdma_vmops = {
+ .open = pci_p2pdma_vma_open,
+ .close = pci_p2pdma_vma_close,
+ .fault = pci_p2pdma_vma_fault,
+};
+
+/**
+ * pci_p2pdma_file_open - setup file mapping to store P2PMEM VMAs
+ * @pdev: the device to allocate memory from
+ * @vma: the userspace vma to map the memory to
+ *
+ * Set f_mapping of the file to the p2pdma inode so that mappings
+ * are can be torn down on device unbind.
+ *
+ * Returns 0 on success, or a negative error code on failure
+ */
+void pci_p2pdma_file_open(struct pci_dev *pdev, struct file *file)
+{
+ struct pci_p2pdma *p2pdma;
+
+ rcu_read_lock();
+ p2pdma = rcu_dereference(pdev->p2pdma);
+ if (p2pdma)
+ file->f_mapping = p2pdma->inode->i_mapping;
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_file_open);
+
+/**
+ * pci_mmap_p2pmem - setup an mmap region to be backed with P2PDMA memory
+ * that was registered with pci_p2pdma_add_resource()
+ * @pdev: the device to allocate memory from
+ * @vma: the userspace vma to map the memory to
+ *
+ * The file must call pci_p2pdma_mmap_file_open() in its open() operation.
+ *
+ * Returns 0 on success, or a negative error code on failure
+ */
+int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma)
+{
+ struct pci_p2pdma_map *pmap;
+ struct pci_p2pdma *p2pdma;
+ int ret;
+
+ /* prevent private mappings from being established */
+ if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+ pci_info_ratelimited(pdev,
+ "%s: fail, attempted private mapping\n",
+ current->comm);
+ return -EINVAL;
+ }
+
+ if (vma->vm_pgoff) {
+ pci_info_ratelimited(pdev,
+ "%s: fail, attempted mapping with non-zero offset\n",
+ current->comm);
+ return -EINVAL;
+ }
+
+ pmap = pci_p2pdma_map_alloc(pdev, vma->vm_end - vma->vm_start);
+ if (!pmap)
+ return -ENOMEM;
+
+ spin_lock_init(&pmap->kaddr_lock);
+ rcu_read_lock();
+ p2pdma = rcu_dereference(pdev->p2pdma);
+ if (!p2pdma) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ ret = simple_pin_fs(&pci_p2pdma_fs_type, &pci_p2pdma_fs_mnt,
+ &pci_p2pdma_fs_cnt);
+ if (ret)
+ goto out;
+
+ ihold(p2pdma->inode);
+ pmap->inode = p2pdma->inode;
+ rcu_read_unlock();
+
+ vma->vm_flags |= VM_MIXEDMAP;
+ vma->vm_private_data = pmap;
+ vma->vm_ops = &pci_p2pdma_vmops;
+
+ return 0;
+
+out:
+ rcu_read_unlock();
+ kfree(pmap);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(pci_mmap_p2pmem);
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 2c07aa6b7665..040d79126463 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -34,6 +34,8 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
bool use_p2pdma);
+void pci_p2pdma_file_open(struct pci_dev *pdev, struct file *file);
+int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma);
#else /* CONFIG_PCI_P2PDMA */
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
@@ -90,6 +92,15 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
{
return sprintf(page, "none\n");
}
+static inline void pci_p2pdma_file_open(struct pci_dev *pdev,
+ struct file *file)
+{
+}
+static inline int pci_mmap_p2pmem(struct pci_dev *pdev,
+ struct vm_area_struct *vma)
+{
+ return -EOPNOTSUPP;
+}
#endif /* CONFIG_PCI_P2PDMA */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f724129c0425..59ba2e60dc03 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -95,6 +95,7 @@
#define BPF_FS_MAGIC 0xcafe4a11
#define AAFS_MAGIC 0x5a3c69f0
#define ZONEFS_MAGIC 0x5a4f4653
+#define P2PDMA_MAGIC 0x70327064
/* Since UDF 2.01 is ISO 13346 based... */
#define UDF_SUPER_MAGIC 0x15013346
--
2.30.2
On 2022-05-16 16:31, Chaitanya Kulkarni wrote:
> Do you have any plans to re-spin this ?
I didn't get any feedback this cycle, so there haven't been any changes.
I'll probably do a rebase and resend after the merge window.
Logan
On 4/7/22 08:46, Logan Gunthorpe wrote:
> Hi,
>
> This patchset continues my work to add userspace P2PDMA access using
> O_DIRECT NVMe devices. This posting contains some minor fixes and a
> rebase onto v5.18-rc1 which contains cleanup from Christoph around
> free_zone_device_page() that helps to enable this patchset. The
> previous posting was here[1].
>
> The patchset enables userspace P2PDMA by allowing userspace to mmap()
> allocated chunks of the CMB. The resulting VMA can be passed only
> to O_DIRECT IO on NVMe backed files or block devices. A flag is added
> to GUP() in Patch <>, then Patches <> through <> wire this flag up based
> on whether the block queue indicates P2PDMA support. Patches <>
> through <> enable the CMB to be mapped into userspace by mmaping
> the nvme char device.
>
> This is relatively straightforward, however the one significant
> problem is that, presently, pci_p2pdma_map_sg() requires a homogeneous
> SGL with all P2PDMA pages or all regular pages. Enhancing GUP to
> support enforcing this rule would require a huge hack that I don't
> expect would be all that pallatable. So the first 13 patches add
> support for P2PDMA pages to dma_map_sg[table]() to the dma-direct
> and dma-iommu implementations. Thus systems without an IOMMU plus
> Intel and AMD IOMMUs are supported. (Other IOMMU implementations would
> then be unsupported, notably ARM and PowerPC but support would be added
> when they convert to dma-iommu).
>
> dma_map_sgtable() is preferred when dealing with P2PDMA memory as it
> will return -EREMOTEIO when the DMA device cannot map specific P2PDMA
> pages based on the existing rules in calc_map_type_and_dist().
>
> The other issue is dma_unmap_sg() needs a flag to determine whether a
> given dma_addr_t was mapped regularly or as a PCI bus address. To allow
> this, a third flag is added to the page_link field in struct
> scatterlist. This effectively means support for P2PDMA will now depend
> on CONFIG_64BIT.
>
> Feedback welcome.
>
> This series is based on v5.18-rc1. A git branch is available here:
>
> https://github.com/sbates130272/linux-p2pmem/ p2pdma_user_cmb_v6
>
> Thanks,
>
> Logan
>
> [1] lkml.kernel.org/r/[email protected]
>
> --
Do you have any plans to re-spin this ?
-ck
On Thu, Apr 07, 2022 at 09:47:16AM -0600, Logan Gunthorpe wrote:
> +static void pci_p2pdma_unmap_mappings(void *data)
> +{
> + struct pci_dev *pdev = data;
> + struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
> +
> + /* Ensure no new pages can be allocated in mappings */
> + p2pdma->active = false;
> + synchronize_rcu();
> +
> + unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
> +
> + /*
> + * On some architectures, TLB flushes are done with call_rcu()
> + * so to ensure GUP fast is done with the pages, call synchronize_rcu()
> + * before freeing them.
> + */
> + synchronize_rcu();
> + pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
With the series from Felix getting close this should get updated to
not set pte_devmap and use proper natural refcounting without any of
this stuff.
Jason
On 2022-05-27 06:55, Jason Gunthorpe wrote:
> On Thu, Apr 07, 2022 at 09:47:16AM -0600, Logan Gunthorpe wrote:
>> +static void pci_p2pdma_unmap_mappings(void *data)
>> +{
>> + struct pci_dev *pdev = data;
>> + struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
>> +
>> + /* Ensure no new pages can be allocated in mappings */
>> + p2pdma->active = false;
>> + synchronize_rcu();
>> +
>> + unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
>> +
>> + /*
>> + * On some architectures, TLB flushes are done with call_rcu()
>> + * so to ensure GUP fast is done with the pages, call synchronize_rcu()
>> + * before freeing them.
>> + */
>> + synchronize_rcu();
>> + pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
>
> With the series from Felix getting close this should get updated to
> not set pte_devmap and use proper natural refcounting without any of
> this stuff.
Can you send a link? I'm not sure what you are referring to.
Thanks,
Logan
On Fri, May 27, 2022 at 09:35:07AM -0600, Logan Gunthorpe wrote:
>
>
> On 2022-05-27 06:55, Jason Gunthorpe wrote:
> > On Thu, Apr 07, 2022 at 09:47:16AM -0600, Logan Gunthorpe wrote:
> >> +static void pci_p2pdma_unmap_mappings(void *data)
> >> +{
> >> + struct pci_dev *pdev = data;
> >> + struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
> >> +
> >> + /* Ensure no new pages can be allocated in mappings */
> >> + p2pdma->active = false;
> >> + synchronize_rcu();
> >> +
> >> + unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
> >> +
> >> + /*
> >> + * On some architectures, TLB flushes are done with call_rcu()
> >> + * so to ensure GUP fast is done with the pages, call synchronize_rcu()
> >> + * before freeing them.
> >> + */
> >> + synchronize_rcu();
> >> + pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
> >
> > With the series from Felix getting close this should get updated to
> > not set pte_devmap and use proper natural refcounting without any of
> > this stuff.
>
> Can you send a link? I'm not sure what you are referring to.
IIRC this is the last part:
https://lore.kernel.org/linux-mm/[email protected]/
And the earlier bit with Christoph's pieces looks like it might get
merged to v5.19..
The general idea is once pte_devmap is not set then all the
refcounting works the way it should. This is what all new ZONE_DEVICE
users should do..
Jason
On 2022-05-27 13:03, Jason Gunthorpe wrote:
> On Fri, May 27, 2022 at 09:35:07AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2022-05-27 06:55, Jason Gunthorpe wrote:
>>> On Thu, Apr 07, 2022 at 09:47:16AM -0600, Logan Gunthorpe wrote:
>>>> +static void pci_p2pdma_unmap_mappings(void *data)
>>>> +{
>>>> + struct pci_dev *pdev = data;
>>>> + struct pci_p2pdma *p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
>>>> +
>>>> + /* Ensure no new pages can be allocated in mappings */
>>>> + p2pdma->active = false;
>>>> + synchronize_rcu();
>>>> +
>>>> + unmap_mapping_range(p2pdma->inode->i_mapping, 0, 0, 1);
>>>> +
>>>> + /*
>>>> + * On some architectures, TLB flushes are done with call_rcu()
>>>> + * so to ensure GUP fast is done with the pages, call synchronize_rcu()
>>>> + * before freeing them.
>>>> + */
>>>> + synchronize_rcu();
>>>> + pci_p2pdma_free_mappings(p2pdma->inode->i_mapping);
>>>
>>> With the series from Felix getting close this should get updated to
>>> not set pte_devmap and use proper natural refcounting without any of
>>> this stuff.
>>
>> Can you send a link? I'm not sure what you are referring to.
>
> IIRC this is the last part:
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> And the earlier bit with Christoph's pieces looks like it might get
> merged to v5.19..
>
> The general idea is once pte_devmap is not set then all the
> refcounting works the way it should. This is what all new ZONE_DEVICE
> users should do..
Ok, I don't actually follow how those patches relate to this.
Based on your description I guess I don't need to set PFN_DEV and
perhaps not use vmf_insert_mixed()? And then just use vm_normal_page()?
But the refcounting of the pages seemed like it was already sane to me,
unless you mean that the code no longer has to synchronize_rcu() before
returning the pages... that would be spectacular and clean things up a
lot (plus fix an annoying issue where if you use then free all the
memory you can't allocate new memory for an indeterminate amount of time).
Logan
On Fri, May 27, 2022 at 04:41:08PM -0600, Logan Gunthorpe wrote:
> >
> > IIRC this is the last part:
> >
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > And the earlier bit with Christoph's pieces looks like it might get
> > merged to v5.19..
> >
> > The general idea is once pte_devmap is not set then all the
> > refcounting works the way it should. This is what all new ZONE_DEVICE
> > users should do..
>
> Ok, I don't actually follow how those patches relate to this.
>
> Based on your description I guess I don't need to set PFN_DEV and
Yes
> perhaps not use vmf_insert_mixed()? And then just use vm_normal_page()?
I'm not sure ATM the best function to use, but yes, a function that
doesn't set PFN_DEV is needed here.
> But the refcounting of the pages seemed like it was already sane to me,
> unless you mean that the code no longer has to synchronize_rcu() before
> returning the pages...
Right. It also doesn't need to call unmap range or keep track of the
inode, or do any of that stuff unless it really needs mmap revokation
semantics (which I doubt this use case does)
unmap range was only necessary because the refcounting is wrong -
since the pte's don't hold a ref on the page in PFN_DEV mode it is
necessary to wipe all the PTE explicitly before going ahead to
decrement the refcount on this path.
Just stuff the pages into the mmap, and your driver unprobe will
automatically block until all the mmaps are closed - no different than
having an open file descriptor or something.
Jason
On Thu, Jun 02, 2022 at 10:49:15AM -0600, Logan Gunthorpe wrote:
>
>
> On 2022-06-02 10:30, Jason Gunthorpe wrote:
> > On Thu, Jun 02, 2022 at 10:16:10AM -0600, Logan Gunthorpe wrote:
> >
> >>> Just stuff the pages into the mmap, and your driver unprobe will
> >>> automatically block until all the mmaps are closed - no different than
> >>> having an open file descriptor or something.
> >>
> >> Oh is that what we want?
> >
> > Yes, it is the typical case - eg if you have a sysfs file open unbind
> > hangs indefinitely. Many drivers can't unbind while they have open file
> > descriptors/etc.
> >
> > A couple drivers go out of their way to allow unbinding while a live
> > userspace exists but this can get complicated. Usually there should be
> > a good reason.
> >
> > The module will already be refcounted anyhow because the mmap points
> > to a char file which holds a module reference - meaning a simple rmmod
> > of the driver shouldn't work already..
>
> Also, I just tried it... If I open a sysfs file for an nvme device (ie.
> /sys/class/nvme/nvme4/cntlid) and unbind the device, it does not block.
> A subsequent read on that file descriptor returns ENODEV. Which is what
> I would have expected.
Oh interesting, this has been changed since years ago when I last
looked, the kernfs_get_active() is now more narrowed than it once
was. So manybe sysfs isn't the same concern it used to be!
Thanks,
Jason
On Thu, Jun 02, 2022 at 10:45:55AM -0600, Logan Gunthorpe wrote:
>
>
>
> On 2022-06-02 10:30, Jason Gunthorpe wrote:
> > On Thu, Jun 02, 2022 at 10:16:10AM -0600, Logan Gunthorpe wrote:
> >
> >>> Just stuff the pages into the mmap, and your driver unprobe will
> >>> automatically block until all the mmaps are closed - no different than
> >>> having an open file descriptor or something.
> >>
> >> Oh is that what we want?
> >
> > Yes, it is the typical case - eg if you have a sysfs file open unbind
> > hangs indefinitely. Many drivers can't unbind while they have open file
> > descriptors/etc.
> >
> > A couple drivers go out of their way to allow unbinding while a live
> > userspace exists but this can get complicated. Usually there should be
> > a good reason.
>
> This is not my experience. All the drivers I've worked with do not block
> unbind with open file descriptors (at least for char devices). I know,
> for example, that having a file descriptor open of /dev/nvmeX does not
> cause unbinding to block.
So there are lots of bugs in the kernel, and I've seen many drivers
that think calling cdev_device_del() is all they need to do - and then
happily allow cdev ioctl's/etc on a de-initialized driver struct.
Drivers that do take care of this usually have to put a lock around
all their fops to serialize against unbind. RDMA uses SRCU, iirc TPM
used a rwlock. But this is tricky and hurts fops performance.
I don't know what nvme did to protect against this, I didn't notice
an obvious lock.
> I figured this was the expectation as the userspace process doing
> the unbind won't be able to be interrupted seeing there's no way to
> fail on that path. Though, it certainly would make things a lot
> easier if the unbind can block indefinitely as it usually requires
> some complicated locking.
As I said, this is what sysfs does today and I don't see that ever
changing. If you userspace has a sysfs file open then the driver
unbind hangs until the file is closed.
So, doing as bad as sysfs seems like a reasonable baseline to me.
> Do you have an example of this? What mechanisms are developers using to
> block unbind with open file descriptors?
Sysfs maintains a refcount with a bias that is basically a fancied
rwlock. Most places use some kind of refcount triggering a
completion. Sleep on the completion until refcount is 0 on unbind kind
of thing.
Jason
On 2022-06-02 10:30, Jason Gunthorpe wrote:
> On Thu, Jun 02, 2022 at 10:16:10AM -0600, Logan Gunthorpe wrote:
>
>>> Just stuff the pages into the mmap, and your driver unprobe will
>>> automatically block until all the mmaps are closed - no different than
>>> having an open file descriptor or something.
>>
>> Oh is that what we want?
>
> Yes, it is the typical case - eg if you have a sysfs file open unbind
> hangs indefinitely. Many drivers can't unbind while they have open file
> descriptors/etc.
>
> A couple drivers go out of their way to allow unbinding while a live
> userspace exists but this can get complicated. Usually there should be
> a good reason.
This is not my experience. All the drivers I've worked with do not block
unbind with open file descriptors (at least for char devices). I know,
for example, that having a file descriptor open of /dev/nvmeX does not
cause unbinding to block. I figured this was the expectation as the
userspace process doing the unbind won't be able to be interrupted
seeing there's no way to fail on that path. Though, it certainly would
make things a lot easier if the unbind can block indefinitely as it
usually requires some complicated locking.
Do you have an example of this? What mechanisms are developers using to
block unbind with open file descriptors?
Logan
On Thu, Jun 02, 2022 at 10:16:10AM -0600, Logan Gunthorpe wrote:
> > Just stuff the pages into the mmap, and your driver unprobe will
> > automatically block until all the mmaps are closed - no different than
> > having an open file descriptor or something.
>
> Oh is that what we want?
Yes, it is the typical case - eg if you have a sysfs file open unbind
hangs indefinitely. Many drivers can't unbind while they have open file
descriptors/etc.
A couple drivers go out of their way to allow unbinding while a live
userspace exists but this can get complicated. Usually there should be
a good reason.
The module will already be refcounted anyhow because the mmap points
to a char file which holds a module reference - meaning a simple rmmod
of the driver shouldn't work already..
Jason
On 2022-06-01 18:00, Jason Gunthorpe wrote:
> On Fri, May 27, 2022 at 04:41:08PM -0600, Logan Gunthorpe wrote:
>>>
>>> IIRC this is the last part:
>>>
>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> And the earlier bit with Christoph's pieces looks like it might get
>>> merged to v5.19..
>>>
>>> The general idea is once pte_devmap is not set then all the
>>> refcounting works the way it should. This is what all new ZONE_DEVICE
>>> users should do..
>>
>> Ok, I don't actually follow how those patches relate to this.
>>
>> Based on your description I guess I don't need to set PFN_DEV and
>
> Yes
>
>> perhaps not use vmf_insert_mixed()? And then just use vm_normal_page()?
>
> I'm not sure ATM the best function to use, but yes, a function that
> doesn't set PFN_DEV is needed here.
>
>> But the refcounting of the pages seemed like it was already sane to me,
>> unless you mean that the code no longer has to synchronize_rcu() before
>> returning the pages...
>
> Right. It also doesn't need to call unmap range or keep track of the
> inode, or do any of that stuff unless it really needs mmap revokation
> semantics (which I doubt this use case does)
>
> unmap range was only necessary because the refcounting is wrong -
> since the pte's don't hold a ref on the page in PFN_DEV mode it is
> necessary to wipe all the PTE explicitly before going ahead to
> decrement the refcount on this path.
>
> Just stuff the pages into the mmap, and your driver unprobe will
> automatically block until all the mmaps are closed - no different than
> having an open file descriptor or something.
Oh is that what we want? With the current method the mmaps are unmapped
on unbind so that it doesn't block indefinitely. It seems more typical
for resources to be dropped quickly on unbind and processes that are
using them will get an error on next use.
Logan
On 2022-06-02 10:30, Jason Gunthorpe wrote:
> On Thu, Jun 02, 2022 at 10:16:10AM -0600, Logan Gunthorpe wrote:
>
>>> Just stuff the pages into the mmap, and your driver unprobe will
>>> automatically block until all the mmaps are closed - no different than
>>> having an open file descriptor or something.
>>
>> Oh is that what we want?
>
> Yes, it is the typical case - eg if you have a sysfs file open unbind
> hangs indefinitely. Many drivers can't unbind while they have open file
> descriptors/etc.
>
> A couple drivers go out of their way to allow unbinding while a live
> userspace exists but this can get complicated. Usually there should be
> a good reason.
>
> The module will already be refcounted anyhow because the mmap points
> to a char file which holds a module reference - meaning a simple rmmod
> of the driver shouldn't work already..
Also, I just tried it... If I open a sysfs file for an nvme device (ie.
/sys/class/nvme/nvme4/cntlid) and unbind the device, it does not block.
A subsequent read on that file descriptor returns ENODEV. Which is what
I would have expected.
Logan
On 2022-06-02 11:28, Jason Gunthorpe wrote:
> On Thu, Jun 02, 2022 at 10:49:15AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2022-06-02 10:30, Jason Gunthorpe wrote:
>>> On Thu, Jun 02, 2022 at 10:16:10AM -0600, Logan Gunthorpe wrote:
>>>
>>>>> Just stuff the pages into the mmap, and your driver unprobe will
>>>>> automatically block until all the mmaps are closed - no different than
>>>>> having an open file descriptor or something.
>>>>
>>>> Oh is that what we want?
>>>
>>> Yes, it is the typical case - eg if you have a sysfs file open unbind
>>> hangs indefinitely. Many drivers can't unbind while they have open file
>>> descriptors/etc.
>>>
>>> A couple drivers go out of their way to allow unbinding while a live
>>> userspace exists but this can get complicated. Usually there should be
>>> a good reason.
>>>
>>> The module will already be refcounted anyhow because the mmap points
>>> to a char file which holds a module reference - meaning a simple rmmod
>>> of the driver shouldn't work already..
>>
>> Also, I just tried it... If I open a sysfs file for an nvme device (ie.
>> /sys/class/nvme/nvme4/cntlid) and unbind the device, it does not block.
>> A subsequent read on that file descriptor returns ENODEV. Which is what
>> I would have expected.
>
> Oh interesting, this has been changed since years ago when I last
> looked, the kernfs_get_active() is now more narrowed than it once
> was. So manybe sysfs isn't the same concern it used to be!
Yeah, so I really think *not* blocking unbind indefinitely is the better
approach here. It's what has always been done with device dax, etc.
mmaps in userspace processes get unmapped and will fault with SIGBUS on
next access and unbind will actually unbind the device relatively
promptly. Userspace processes can fail or try to handle the device going
away gracefully.
Logan