This RFC enables P2PDMA transfers in userspace between NVMe drives using
existing O_DIRECT operations or the NVMe passthrough IOCTL.
This is accomplished by allowing userspace to allocate chunks of any CMB
by mmaping the NVMe ctrl device (Patches 14 and 15). The resulting memory
will be backed by P2P pages and can be passed only to O_DIRECT
operations. A flag is added to GUP() in Patch 10 and Patches 11 through 13
wire this flag up based on whether the block queue indicates P2PDMA
support.
The above is pretty straight forward and (I hope) largely uncontroversial.
However, the one significant problem in all this is that, presently,
pci_p2pdma_map_sg() requires a homogeneous SGL with all P2PDMA pages or
none. Enhancing GUP to support enforcing this rule would require a huge
hack that I don't expect would be all that pallatable. So this RFC takes
the approach of removing the requirement of having a homogeneous SGL.
With the new common dma-iommu infrastructure, this patchset adds
support for P2PDMA pages into dma_map_sg() which will support AMD,
Intel (soon) and dma-direct implementations. (Other IOMMU
implementations would then be unsupported, notably ARM and PowerPC).
The other major blocker is that in order to implement support for
P2PDMA pages in dma_map_sg(), a flag is necessary to determine if a
given dma_addr_t points to P2PDMA memory or to an IOVA so that it can
be unmapped appropriately in dma_unmap_sg(). The (ugly) approach this
RFC takes is to use the top bit in the dma_length field and ensure
callers are prepared for it using a new DMA_ATTR_P2PDMA flag.
I suspect, the ultimate solution to this blocker will be to implement
some kind of new dma_op that doesn't use the SGL. Ideas have been
thrown around in the past for one that maps some kind of novel dma_vec
directly from a bio_vec. This will become a lot easier to implement if
more dma_ops providers get converted to the new dma-iommu
implementation, but this will take time.
Alternative ideas or other feedback welcome.
This series is based on v5.10-rc2 with Lu Baolu's (and Tom Murphy's)
v4 patchset for converting the Intel IOMMU to dma-iommu[1]. A git
branch is available here:
https://github.com/sbates130272/linux-p2pmem/ p2pdma_user_cmb_rfc
Thanks,
Logan
[1] https://lkml.kernel.org/lkml/[email protected]/T/#u.
Logan Gunthorpe (15):
PCI/P2PDMA: Don't sleep in upstream_bridge_distance_warn()
PCI/P2PDMA: Attempt to set map_type if it has not been set
PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and
pci_p2pdma_bus_offset()
lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
nvme-pci: Convert to using dma_map_sg for p2pdma pages
mm: Introduce FOLL_PCI_P2PDMA to gate getting PCI P2PDMA pages
iov_iter: Introduce iov_iter_get_pages_[alloc_]flags()
block: Set FOLL_PCI_P2PDMA in __bio_iov_iter_get_pages()
block: Set FOLL_PCI_P2PDMA in bio_map_user_iov()
PCI/P2PDMA: Introduce pci_mmap_p2pmem()
nvme-pci: Allow mmaping the CMB in userspace
block/bio.c | 7 +-
block/blk-map.c | 7 +-
drivers/dax/super.c | 7 +-
drivers/iommu/dma-iommu.c | 63 +++++++++++--
drivers/nvme/host/core.c | 14 ++-
drivers/nvme/host/nvme.h | 3 +-
drivers/nvme/host/pci.c | 50 ++++++----
drivers/pci/p2pdma.c | 178 +++++++++++++++++++++++++++++++++---
include/linux/dma-map-ops.h | 3 +
include/linux/dma-mapping.h | 16 ++++
include/linux/memremap.h | 4 +-
include/linux/mm.h | 1 +
include/linux/pci-p2pdma.h | 17 ++++
include/linux/scatterlist.h | 4 +
include/linux/uio.h | 21 ++++-
kernel/dma/direct.c | 33 ++++++-
kernel/dma/mapping.c | 8 ++
lib/iov_iter.c | 25 ++---
mm/gup.c | 28 +++---
mm/huge_memory.c | 8 +-
mm/memory-failure.c | 4 +-
mm/memremap.c | 14 ++-
22 files changed, 427 insertions(+), 88 deletions(-)
base-commit: 5ba8a2512e8c5f5cf9b7309dc895612f0a77a399
--
2.20.1
Add iov_iter_get_pages_flags() and iov_iter_get_pages_alloc_flags()
which take a flags argument that is passed to get_user_pages_fast().
This is so that FOLL_PCI_P2PDMA can be passed when appropriate.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
include/linux/uio.h | 21 +++++++++++++++++----
lib/iov_iter.c | 25 ++++++++++++++-----------
2 files changed, 31 insertions(+), 15 deletions(-)
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 72d88566694e..694caa868c05 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -221,10 +221,23 @@ void iov_iter_bvec(struct iov_iter *i, unsigned int direction, const struct bio_
void iov_iter_pipe(struct iov_iter *i, unsigned int direction, struct pipe_inode_info *pipe,
size_t count);
void iov_iter_discard(struct iov_iter *i, unsigned int direction, size_t count);
-ssize_t iov_iter_get_pages(struct iov_iter *i, struct page **pages,
- size_t maxsize, unsigned maxpages, size_t *start);
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i, struct page ***pages,
- size_t maxsize, size_t *start);
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i, struct page **pages,
+ size_t maxsize, unsigned maxpages, size_t *start,
+ unsigned int gup_flags);
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
+ struct page ***pages, size_t maxsize, size_t *start,
+ unsigned int gup_flags);
+static inline ssize_t iov_iter_get_pages(struct iov_iter *i,
+ struct page **pages, size_t maxsize, unsigned maxpages,
+ size_t *start)
+{
+ return iov_iter_get_pages_flags(i, pages, maxsize, maxpages, start, 0);
+}
+static inline ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ struct page ***pages, size_t maxsize, size_t *start)
+{
+ return iov_iter_get_pages_alloc_flags(i, pages, maxsize, start, 0);
+}
int iov_iter_npages(const struct iov_iter *i, int maxpages);
const void *dup_iter(struct iov_iter *new, struct iov_iter *old, gfp_t flags);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 1635111c5bd2..700effe0bb86 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1313,9 +1313,9 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
return __pipe_get_pages(i, min(maxsize, capacity), pages, iter_head, start);
}
-ssize_t iov_iter_get_pages(struct iov_iter *i,
+ssize_t iov_iter_get_pages_flags(struct iov_iter *i,
struct page **pages, size_t maxsize, unsigned maxpages,
- size_t *start)
+ size_t *start, unsigned int gup_flags)
{
if (maxsize > i->count)
maxsize = i->count;
@@ -1325,6 +1325,9 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
if (unlikely(iov_iter_is_discard(i)))
return -EFAULT;
+ if (iov_iter_rw(i) != WRITE)
+ gup_flags |= FOLL_WRITE;
+
iterate_all_kinds(i, maxsize, v, ({
unsigned long addr = (unsigned long)v.iov_base;
size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -1335,9 +1338,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
len = maxpages * PAGE_SIZE;
addr &= ~(PAGE_SIZE - 1);
n = DIV_ROUND_UP(len, PAGE_SIZE);
- res = get_user_pages_fast(addr, n,
- iov_iter_rw(i) != WRITE ? FOLL_WRITE : 0,
- pages);
+ res = get_user_pages_fast(addr, n, gup_flags, pages);
if (unlikely(res < 0))
return res;
return (res == n ? len : res * PAGE_SIZE) - *start;
@@ -1352,7 +1353,7 @@ ssize_t iov_iter_get_pages(struct iov_iter *i,
)
return 0;
}
-EXPORT_SYMBOL(iov_iter_get_pages);
+EXPORT_SYMBOL(iov_iter_get_pages_flags);
static struct page **get_pages_array(size_t n)
{
@@ -1392,9 +1393,9 @@ static ssize_t pipe_get_pages_alloc(struct iov_iter *i,
return n;
}
-ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
+ssize_t iov_iter_get_pages_alloc_flags(struct iov_iter *i,
struct page ***pages, size_t maxsize,
- size_t *start)
+ size_t *start, unsigned int gup_flags)
{
struct page **p;
@@ -1406,6 +1407,9 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
if (unlikely(iov_iter_is_discard(i)))
return -EFAULT;
+ if (iov_iter_rw(i) != WRITE)
+ gup_flags |= FOLL_WRITE;
+
iterate_all_kinds(i, maxsize, v, ({
unsigned long addr = (unsigned long)v.iov_base;
size_t len = v.iov_len + (*start = addr & (PAGE_SIZE - 1));
@@ -1417,8 +1421,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
p = get_pages_array(n);
if (!p)
return -ENOMEM;
- res = get_user_pages_fast(addr, n,
- iov_iter_rw(i) != WRITE ? FOLL_WRITE : 0, p);
+ res = get_user_pages_fast(addr, n, gup_flags, p);
if (unlikely(res < 0)) {
kvfree(p);
return res;
@@ -1439,7 +1442,7 @@ ssize_t iov_iter_get_pages_alloc(struct iov_iter *i,
)
return 0;
}
-EXPORT_SYMBOL(iov_iter_get_pages_alloc);
+EXPORT_SYMBOL(iov_iter_get_pages_alloc_flags);
size_t csum_and_copy_from_iter(void *addr, size_t bytes, __wsum *csum,
struct iov_iter *i)
--
2.20.1
Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.
The DMA_ATTR_P2PDMA flag is added to allow callers to indicate support
for P2PDMA pages. In order for a caller to support P2PDMA pages they
must ensure no segment is greater than 2GB such that the high bit
of the dma length can be used as a flag to indicate a P2PDMA segment.
Such code must then ensure to use sg_dma_p2pdma_len() instead of
sg_dma_len() to filter out the flag.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
include/linux/dma-mapping.h | 11 +++++++++++
kernel/dma/direct.c | 33 +++++++++++++++++++++++++++++++--
2 files changed, 42 insertions(+), 2 deletions(-)
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 956151052d45..8d028e15b531 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -61,6 +61,17 @@
*/
#define DMA_ATTR_PRIVILEGED (1UL << 9)
+/*
+ * DMA_ATTR_P2PDMA: specifies that dma_map_sg() may return p2pdma
+ * bus addresses. Code that specifies this must ensure to
+ * use sg_dma_p2pdma_len() instead of sg_dma_len() as the high
+ * bit of the length will indicate a P2PDMA bus address.
+ *
+ * If this attribute is not set and P2PDMA pages are encountered,
+ * dma_map_sg() will return an error.
+ */
+#define DMA_ATTR_P2PDMA (1UL << 10)
+
/*
* A dma_addr_t can hold any valid DMA or bus address for the platform. It can
* be given to a device to use as a DMA source or target. It is specific to a
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 06c111544f61..2fcb31789436 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -13,6 +13,7 @@
#include <linux/vmalloc.h>
#include <linux/set_memory.h>
#include <linux/slab.h>
+#include <linux/pci-p2pdma.h>
#include "direct.h"
/*
@@ -387,19 +388,47 @@ void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
struct scatterlist *sg;
int i;
- for_each_sg(sgl, sg, nents, i)
+ for_each_sg(sgl, sg, nents, i) {
+ if (attrs & DMA_ATTR_P2PDMA && sg_dma_is_p2pdma(sg)) {
+ sg_dma_len(sg) &= ~SG_P2PDMA_FLAG;
+ continue;
+ }
+
dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
attrs);
+ }
}
#endif
int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
enum dma_data_direction dir, unsigned long attrs)
{
- int i;
+ struct dev_pagemap *pgmap = NULL;
+ int i, map = -1;
struct scatterlist *sg;
+ u64 bus_off;
for_each_sg(sgl, sg, nents, i) {
+ if (is_pci_p2pdma_page(sg_page(sg))) {
+ if (sg_page(sg)->pgmap != pgmap) {
+ pgmap = sg_page(sg)->pgmap;
+ map = pci_p2pdma_should_map_bus(dev, pgmap);
+ bus_off = pci_p2pdma_bus_offset(sg_page(sg));
+ }
+
+ if (map < 0 || !(attrs & DMA_ATTR_P2PDMA)) {
+ sg->dma_address = DMA_MAPPING_ERROR;
+ goto out_unmap;
+ }
+
+ if (map) {
+ sg->dma_address = sg_phys(sg) + sg->offset -
+ bus_off;
+ sg_dma_len(sg) = sg->length | SG_P2PDMA_FLAG;
+ continue;
+ }
+ }
+
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
if (sg->dma_address == DMA_MAPPING_ERROR)
--
2.20.1
Callers that expect PCI P2PDMA pages can now set FOLL_PCI_P2PDMA to
allow obtaining P2PDMA pages. If a caller does not set this flag
and tries to map P2PDMA pages it will fail.
This is implemented by adding a flag and a check to get_dev_pagemap().
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/dax/super.c | 7 ++++---
include/linux/memremap.h | 4 ++--
include/linux/mm.h | 1 +
mm/gup.c | 28 +++++++++++++++++-----------
mm/huge_memory.c | 8 ++++----
mm/memory-failure.c | 4 ++--
mm/memremap.c | 14 ++++++++++----
7 files changed, 40 insertions(+), 26 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index edc279be3e59..59c05839b3f8 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -132,9 +132,10 @@ bool __generic_fsdax_supported(struct dax_device *dax_dev,
} else if (pfn_t_devmap(pfn) && pfn_t_devmap(end_pfn)) {
struct dev_pagemap *pgmap, *end_pgmap;
- pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL);
- end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL);
- if (pgmap && pgmap == end_pgmap && pgmap->type == MEMORY_DEVICE_FS_DAX
+ pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL, false);
+ end_pgmap = get_dev_pagemap(pfn_t_to_pfn(end_pfn), NULL, false);
+ if (!IS_ERR_OR_NULL(pgmap) && pgmap == end_pgmap
+ && pgmap->type == MEMORY_DEVICE_FS_DAX
&& pfn_t_to_page(pfn)->pgmap == pgmap
&& pfn_t_to_page(end_pfn)->pgmap == pgmap
&& pfn_t_to_pfn(pfn) == PHYS_PFN(__pa(kaddr))
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 79c49e7f5c30..14f6d899c657 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -136,7 +136,7 @@ void memunmap_pages(struct dev_pagemap *pgmap);
void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
void devm_memunmap_pages(struct device *dev, struct dev_pagemap *pgmap);
struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
- struct dev_pagemap *pgmap);
+ struct dev_pagemap *pgmap, bool allow_pci_p2pdma);
unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
@@ -160,7 +160,7 @@ static inline void devm_memunmap_pages(struct device *dev,
}
static inline struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
- struct dev_pagemap *pgmap)
+ struct dev_pagemap *pgmap, bool allow_pci_p2pdma)
{
return NULL;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..c405af966a4e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2790,6 +2790,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
+#define FOLL_PCI_P2PDMA 0x100000 /* allow returning PCI P2PDMA pages */
/*
* FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index 102877ed77a4..abbc0a06d241 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -449,11 +449,16 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
* case since they are only valid while holding the pgmap
* reference.
*/
- *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
- if (*pgmap)
+ *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap,
+ flags & FOLL_PCI_P2PDMA);
+ if (IS_ERR(*pgmap)) {
+ page = ERR_CAST(*pgmap);
+ goto out;
+ } else if (*pgmap) {
page = pte_page(pte);
- else
+ } else {
goto no_page;
+ }
} else if (unlikely(!page)) {
if (flags & FOLL_DUMP) {
/* Avoid special (like zero) pages in core dumps */
@@ -794,7 +799,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
struct page *page;
page = follow_page_mask(vma, address, foll_flags, &ctx);
- if (ctx.pgmap)
+ if (!IS_ERR_OR_NULL(ctx.pgmap))
put_dev_pagemap(ctx.pgmap);
return page;
}
@@ -1138,7 +1143,7 @@ static long __get_user_pages(struct mm_struct *mm,
nr_pages -= page_increm;
} while (nr_pages);
out:
- if (ctx.pgmap)
+ if (!IS_ERR_OR_NULL(ctx.pgmap))
put_dev_pagemap(ctx.pgmap);
return i ? i : ret;
}
@@ -2177,8 +2182,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
if (unlikely(flags & FOLL_LONGTERM))
goto pte_unmap;
- pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
- if (unlikely(!pgmap)) {
+ pgmap = get_dev_pagemap(pte_pfn(pte), pgmap,
+ flags & FOLL_PCI_P2PDMA);
+ if (IS_ERR_OR_NULL(pgmap)) {
undo_dev_pagemap(nr, nr_start, flags, pages);
goto pte_unmap;
}
@@ -2221,7 +2227,7 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
ret = 1;
pte_unmap:
- if (pgmap)
+ if (!IS_ERR_OR_NULL(pgmap))
put_dev_pagemap(pgmap);
pte_unmap(ptem);
return ret;
@@ -2255,8 +2261,8 @@ static int __gup_device_huge(unsigned long pfn, unsigned long addr,
do {
struct page *page = pfn_to_page(pfn);
- pgmap = get_dev_pagemap(pfn, pgmap);
- if (unlikely(!pgmap)) {
+ pgmap = get_dev_pagemap(pfn, pgmap, flags & FOLL_PCI_P2PDMA);
+ if (IS_ERR_OR_NULL(pgmap)) {
undo_dev_pagemap(nr, nr_start, flags, pages);
return 0;
}
@@ -2681,7 +2687,7 @@ static int internal_get_user_pages_fast(unsigned long start, int nr_pages,
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
FOLL_FORCE | FOLL_PIN | FOLL_GET |
- FOLL_FAST_ONLY)))
+ FOLL_FAST_ONLY | FOLL_PCI_P2PDMA)))
return -EINVAL;
if (gup_flags & FOLL_PIN)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9474dbc150ed..3030548dc007 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -985,8 +985,8 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
return ERR_PTR(-EEXIST);
pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
- *pgmap = get_dev_pagemap(pfn, *pgmap);
- if (!*pgmap)
+ *pgmap = get_dev_pagemap(pfn, *pgmap, flags & FOLL_PCI_P2PDMA);
+ if (IS_ERR_OR_NULL(*pgmap))
return ERR_PTR(-EFAULT);
page = pfn_to_page(pfn);
if (!try_grab_page(page, flags))
@@ -1159,8 +1159,8 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
return ERR_PTR(-EEXIST);
pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
- *pgmap = get_dev_pagemap(pfn, *pgmap);
- if (!*pgmap)
+ *pgmap = get_dev_pagemap(pfn, *pgmap, flags & FOLL_PCI_P2PDMA);
+ if (IS_ERR_OR_NULL(*pgmap))
return ERR_PTR(-EFAULT);
page = pfn_to_page(pfn);
if (!try_grab_page(page, flags))
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c0bb186bba62..44fdad77f06a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1328,8 +1328,8 @@ int memory_failure(unsigned long pfn, int flags)
p = pfn_to_online_page(pfn);
if (!p) {
if (pfn_valid(pfn)) {
- pgmap = get_dev_pagemap(pfn, NULL);
- if (pgmap)
+ pgmap = get_dev_pagemap(pfn, NULL, false);
+ if (!IS_ERR_OR_NULL(pgmap))
return memory_failure_dev_pagemap(pfn, flags,
pgmap);
}
diff --git a/mm/memremap.c b/mm/memremap.c
index 73a206d0f645..5f0284c3a3c3 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -197,14 +197,14 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
"altmap not supported for multiple ranges\n"))
return -EINVAL;
- conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL);
+ conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL, true);
if (conflict_pgmap) {
WARN(1, "Conflicting mapping in same section\n");
put_dev_pagemap(conflict_pgmap);
return -ENOMEM;
}
- conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL);
+ conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL, true);
if (conflict_pgmap) {
WARN(1, "Conflicting mapping in same section\n");
put_dev_pagemap(conflict_pgmap);
@@ -454,19 +454,20 @@ void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
* get_dev_pagemap() - take a new live reference on the dev_pagemap for @pfn
* @pfn: page frame number to lookup page_map
* @pgmap: optional known pgmap that already has a reference
+ * @allow_pci_p2pdma: allow getting a pgmap with the PCI P2PDMA type
*
* If @pgmap is non-NULL and covers @pfn it will be returned as-is. If @pgmap
* is non-NULL but does not cover @pfn the reference to it will be released.
*/
struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
- struct dev_pagemap *pgmap)
+ struct dev_pagemap *pgmap, bool allow_pci_p2pdma)
{
resource_size_t phys = PFN_PHYS(pfn);
/*
* In the cached case we're already holding a live reference.
*/
- if (pgmap) {
+ if (!IS_ERR_OR_NULL(pgmap)) {
if (phys >= pgmap->range.start && phys <= pgmap->range.end)
return pgmap;
put_dev_pagemap(pgmap);
@@ -479,6 +480,11 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
pgmap = NULL;
rcu_read_unlock();
+ if (!allow_pci_p2pdma && pgmap->type == MEMORY_DEVICE_PCI_P2PDMA) {
+ put_dev_pagemap(pgmap);
+ return ERR_PTR(-EINVAL);
+ }
+
return pgmap;
}
EXPORT_SYMBOL_GPL(get_dev_pagemap);
--
2.20.1
Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
a hunk of p2pmem into userspace.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/p2pdma.c | 104 +++++++++++++++++++++++++++++++++++++
include/linux/pci-p2pdma.h | 6 +++
2 files changed, 110 insertions(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 9961e779f430..8eab53ac59ae 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -16,6 +16,7 @@
#include <linux/genalloc.h>
#include <linux/memremap.h>
#include <linux/percpu-refcount.h>
+#include <linux/pfn_t.h>
#include <linux/random.h>
#include <linux/seq_buf.h>
#include <linux/xarray.h>
@@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
return sprintf(page, "%s\n", pci_name(p2p_dev));
}
EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
+
+struct pci_p2pdma_map {
+ struct kref ref;
+ struct pci_dev *pdev;
+ void *kaddr;
+ size_t len;
+};
+
+static struct pci_p2pdma_map *pci_p2pdma_map_alloc(struct pci_dev *pdev,
+ size_t len)
+{
+ struct pci_p2pdma_map *pmap;
+
+ pmap = kzalloc(sizeof(*pmap), GFP_KERNEL);
+ if (!pmap)
+ return NULL;
+
+ kref_init(&pmap->ref);
+ pmap->pdev = pdev;
+ pmap->len = len;
+
+ pmap->kaddr = pci_alloc_p2pmem(pdev, len);
+ if (!pmap->kaddr) {
+ kfree(pmap);
+ return NULL;
+ }
+
+ return pmap;
+}
+
+static void pci_p2pdma_map_free(struct kref *ref)
+{
+ struct pci_p2pdma_map *pmap =
+ container_of(ref, struct pci_p2pdma_map, ref);
+
+ pci_free_p2pmem(pmap->pdev, pmap->kaddr, pmap->len);
+ kfree(pmap);
+}
+
+static void pci_p2pdma_vma_open(struct vm_area_struct *vma)
+{
+ struct pci_p2pdma_map *pmap = vma->vm_private_data;
+
+ kref_get(&pmap->ref);
+}
+
+static void pci_p2pdma_vma_close(struct vm_area_struct *vma)
+{
+ struct pci_p2pdma_map *pmap = vma->vm_private_data;
+
+ kref_put(&pmap->ref, pci_p2pdma_map_free);
+}
+
+const struct vm_operations_struct pci_p2pdma_vmops = {
+ .open = pci_p2pdma_vma_open,
+ .close = pci_p2pdma_vma_close,
+};
+
+/**
+ * pci_mmap_p2pmem - allocate peer-to-peer DMA memory
+ * @pdev: the device to allocate memory from
+ * @vma: the userspace vma to map the memory to
+ *
+ * Returns 0 on success, or a negative error code on failure
+ */
+int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma)
+{
+ struct pci_p2pdma_map *pmap;
+ unsigned long addr, pfn;
+ vm_fault_t ret;
+
+ /* prevent private mappings from being established */
+ if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+ pci_info_ratelimited(pdev,
+ "%s: fail, attempted private mapping\n",
+ current->comm);
+ return -EINVAL;
+ }
+
+ pmap = pci_p2pdma_map_alloc(pdev, vma->vm_end - vma->vm_start);
+ if (!pmap)
+ return -ENOMEM;
+
+ vma->vm_flags |= VM_MIXEDMAP;
+ vma->vm_private_data = pmap;
+ vma->vm_ops = &pci_p2pdma_vmops;
+
+ pfn = virt_to_phys(pmap->kaddr) >> PAGE_SHIFT;
+ for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+ ret = vmf_insert_mixed(vma, addr,
+ __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP));
+ if (ret & VM_FAULT_ERROR)
+ goto out_error;
+ pfn++;
+ }
+
+ return 0;
+
+out_error:
+ kref_put(&pmap->ref, pci_p2pdma_map_free);
+ return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(pci_mmap_p2pmem);
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index fc5de47eeac4..26fe40363d1c 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -40,6 +40,7 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
bool use_p2pdma);
+int pci_mmap_p2pmem(struct pci_dev *pdev, struct vm_area_struct *vma);
#else /* CONFIG_PCI_P2PDMA */
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
@@ -116,6 +117,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
{
return sprintf(page, "none\n");
}
+static inline int pci_mmap_p2pmem(struct pci_dev *pdev,
+ struct vm_area_struct *vma)
+{
+ return -EOPNOTSUPP;
+}
#endif /* CONFIG_PCI_P2PDMA */
--
2.20.1
We make use of the top bit of the dma_length to indicate a P2PDMA
segment. Code that uses this will need to use sg_dma_p2pdma_len() to
get the length and ensure no lengths exceed 2GB.
An sg_dma_is_p2pdma() helper is included to check if a particular
segment is p2pdma().
Signed-off-by: Logan Gunthorpe <[email protected]>
---
include/linux/scatterlist.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 36c47e7e66a2..e738159d56f9 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -39,6 +39,10 @@ struct scatterlist {
#define sg_dma_len(sg) ((sg)->length)
#endif
+#define SG_P2PDMA_FLAG (1U << 31)
+#define sg_dma_p2pdma_len(sg) (sg_dma_len(sg) & ~SG_P2PDMA_FLAG)
+#define sg_dma_is_p2pdma(sg) (sg_dma_len(sg) & SG_P2PDMA_FLAG)
+
struct sg_table {
struct scatterlist *sgl; /* the list */
unsigned int nents; /* number of mapped entries */
--
2.20.1
Switch to using sg_dma_p2pdma_len() in places where sg_dma_len() is
used. Then replace the calls to pci_p2pdma_[un]map_sg() with calls
to dma_[un]map_sg() with DMA_ATTR_P2PDMA.
This should be equivalent, though support will be somewhat less
(only dma-direct and dma-iommu are currently supported).
Using DMA_ATTR_P2PDMA is safe because the block layer restricts
requests to be much less than 2GB, thus there's no way for a
segment to be greater than 2GB.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/nvme/host/pci.c | 30 ++++++++++++------------------
1 file changed, 12 insertions(+), 18 deletions(-)
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ef7ce464a48d..26976bdf4af0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -528,12 +528,8 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
WARN_ON_ONCE(!iod->nents);
- if (is_pci_p2pdma_page(sg_page(iod->sg)))
- pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
- rq_dma_dir(req));
- else
- dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
-
+ dma_unmap_sg_attrs(dev->dev, iod->sg, iod->nents, rq_dma_dir(req),
+ DMA_ATTR_P2PDMA);
if (iod->npages == 0)
dma_pool_free(dev->prp_small_pool, nvme_pci_iod_list(req)[0],
@@ -570,7 +566,7 @@ static void nvme_print_sgl(struct scatterlist *sgl, int nents)
pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d "
"dma_address:%pad dma_length:%d\n",
i, &phys, sg->offset, sg->length, &sg_dma_address(sg),
- sg_dma_len(sg));
+ sg_dma_p2pdma_len(sg));
}
}
@@ -581,7 +577,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
struct dma_pool *pool;
int length = blk_rq_payload_bytes(req);
struct scatterlist *sg = iod->sg;
- int dma_len = sg_dma_len(sg);
+ int dma_len = sg_dma_p2pdma_len(sg);
u64 dma_addr = sg_dma_address(sg);
int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
__le64 *prp_list;
@@ -601,7 +597,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
} else {
sg = sg_next(sg);
dma_addr = sg_dma_address(sg);
- dma_len = sg_dma_len(sg);
+ dma_len = sg_dma_p2pdma_len(sg);
}
if (length <= NVME_CTRL_PAGE_SIZE) {
@@ -650,7 +646,7 @@ static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
goto bad_sgl;
sg = sg_next(sg);
dma_addr = sg_dma_address(sg);
- dma_len = sg_dma_len(sg);
+ dma_len = sg_dma_p2pdma_len(sg);
}
done:
@@ -670,7 +666,7 @@ static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
struct scatterlist *sg)
{
sge->addr = cpu_to_le64(sg_dma_address(sg));
- sge->length = cpu_to_le32(sg_dma_len(sg));
+ sge->length = cpu_to_le32(sg_dma_p2pdma_len(sg));
sge->type = NVME_SGL_FMT_DATA_DESC << 4;
}
@@ -814,14 +810,12 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
if (!iod->nents)
goto out;
- if (is_pci_p2pdma_page(sg_page(iod->sg)))
- nr_mapped = pci_p2pdma_map_sg_attrs(dev->dev, iod->sg,
- iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN);
- else
- nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
- rq_dma_dir(req), DMA_ATTR_NO_WARN);
- if (!nr_mapped)
+ nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
+ rq_dma_dir(req), DMA_ATTR_NO_WARN | DMA_ATTR_P2PDMA);
+ if (!nr_mapped) {
+ ret = BLK_STS_IOERR;
goto out;
+ }
iod->use_sgl = nvme_pci_use_sgls(dev, req);
if (iod->use_sgl)
--
2.20.1
Add a flags member to the dma_map_ops structure with one flag to
indicate support for PCI P2PDMA.
Also, add a helper to check if a device supports PCI P2PDMA.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
include/linux/dma-map-ops.h | 3 +++
include/linux/dma-mapping.h | 5 +++++
kernel/dma/mapping.c | 8 ++++++++
3 files changed, 16 insertions(+)
diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index a5f89fc4d6df..8c12dfa43ce1 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -12,6 +12,9 @@
struct cma;
struct dma_map_ops {
+ unsigned int flags;
+#define DMA_F_PCI_P2PDMA_SUPPORTED (1 << 0)
+
void *(*alloc)(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t gfp,
unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 8d028e15b531..3646c6ba6a00 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -149,6 +149,7 @@ int dma_mmap_attrs(struct device *dev, struct vm_area_struct *vma,
unsigned long attrs);
bool dma_can_mmap(struct device *dev);
int dma_supported(struct device *dev, u64 mask);
+int dma_pci_p2pdma_supported(struct device *dev);
int dma_set_mask(struct device *dev, u64 mask);
int dma_set_coherent_mask(struct device *dev, u64 mask);
u64 dma_get_required_mask(struct device *dev);
@@ -244,6 +245,10 @@ static inline int dma_supported(struct device *dev, u64 mask)
{
return 0;
}
+static inline int dma_pci_p2pdma_supported(struct device *dev)
+{
+ return 0;
+}
static inline int dma_set_mask(struct device *dev, u64 mask)
{
return -EIO;
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 51bb8fa8eb89..192418ef43f8 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -567,6 +567,14 @@ int dma_supported(struct device *dev, u64 mask)
}
EXPORT_SYMBOL(dma_supported);
+int dma_pci_p2pdma_supported(struct device *dev)
+{
+ const struct dma_map_ops *ops = get_dma_ops(dev);
+
+ return !ops || ops->flags & DMA_F_PCI_P2PDMA_SUPPORTED;
+}
+EXPORT_SYMBOL(dma_pci_p2pdma_supported);
+
#ifdef CONFIG_ARCH_HAS_DMA_SET_MASK
void arch_dma_set_mask(struct device *dev, u64 mask);
#else
--
2.20.1
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be
passed from userspace and enables the NVMe passthru requests to
use P2PDMA pages.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
block/blk-map.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index 21630dccac62..cad8fcfe4f17 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -245,6 +245,7 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
{
unsigned int max_sectors = queue_max_hw_sectors(rq->q);
struct bio *bio, *bounce_bio;
+ unsigned int flags = 0;
int ret;
int j;
@@ -256,13 +257,17 @@ static int bio_map_user_iov(struct request *rq, struct iov_iter *iter,
return -ENOMEM;
bio->bi_opf |= req_op(rq);
+ if (blk_queue_pci_p2pdma(rq->q))
+ flags |= FOLL_PCI_P2PDMA;
+
while (iov_iter_count(iter)) {
struct page **pages;
ssize_t bytes;
size_t offs, added = 0;
int npages;
- bytes = iov_iter_get_pages_alloc(iter, &pages, LONG_MAX, &offs);
+ bytes = iov_iter_get_pages_alloc_flags(iter, &pages, LONG_MAX,
+ &offs, flags);
if (unlikely(bytes <= 0)) {
ret = bytes ? bytes : -EFAULT;
goto out_unmap;
--
2.20.1
Introduce pci_p2pdma_should_map_bus() which is meant to be called by
dma map functions to determine how to map a given p2pdma page.
pci_p2pdma_bus_offset() is also added to allow callers to get the bus
offset if they need to map the bus address.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/p2pdma.c | 46 ++++++++++++++++++++++++++++++++++++++
include/linux/pci-p2pdma.h | 11 +++++++++
2 files changed, 57 insertions(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index ea8472278b11..9961e779f430 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -930,6 +930,52 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
}
EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
+/**
+ * pci_p2pdma_bus_offset - returns the bus offset for a given page
+ * @page: page to get the offset for
+ *
+ * Must be passed a pci p2pdma page.
+ */
+u64 pci_p2pdma_bus_offset(struct page *page)
+{
+ struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page->pgmap);
+
+ WARN_ON(!is_pci_p2pdma_page(page));
+
+ return p2p_pgmap->bus_offset;
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_bus_offset);
+
+/**
+ * pci_p2pdma_should_map_bus - determine if a dma mapping should use the
+ * bus address
+ * @dev: device doing the DMA request
+ * @pgmap: dev_pagemap structure for the mapping
+ *
+ * Returns 1 if the page should be mapped with a bus address, 0 otherwise
+ * and -1 the device should not be mapping P2PDMA pages.
+ */
+int pci_p2pdma_should_map_bus(struct device *dev, struct dev_pagemap *pgmap)
+{
+ struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(pgmap);
+ struct pci_dev *client;
+
+ if (!dev_is_pci(dev))
+ return -1;
+
+ client = to_pci_dev(dev);
+
+ switch (pci_p2pdma_map_type(p2p_pgmap->provider, client)) {
+ case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
+ return 0;
+ case PCI_P2PDMA_MAP_BUS_ADDR:
+ return 1;
+ default:
+ return -1;
+ }
+}
+EXPORT_SYMBOL_GPL(pci_p2pdma_should_map_bus);
+
/**
* pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
* to enable p2pdma
diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
index 8318a97c9c61..fc5de47eeac4 100644
--- a/include/linux/pci-p2pdma.h
+++ b/include/linux/pci-p2pdma.h
@@ -34,6 +34,8 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs);
+u64 pci_p2pdma_bus_offset(struct page *page);
+int pci_p2pdma_should_map_bus(struct device *dev, struct dev_pagemap *pgmap);
int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
@@ -83,6 +85,15 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
{
}
+static inline u64 pci_p2pdma_bus_offset(struct page *page)
+{
+ return -1;
+}
+static inline int pci_p2pdma_should_map_bus(struct device *dev,
+ struct dev_pagemap *pgmap)
+{
+ return -1;
+}
static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
struct scatterlist *sg, int nents, enum dma_data_direction dir,
unsigned long attrs)
--
2.20.1
When a bio's queue supports PCI P2PDMA, set FOLL_PCI_P2PDMA for
iov_iter_get_pages_flags(). This allows PCI P2PDMA pages to be passed
from userspace and enables the O_DIRECT path in iomap based filesystems
and direct to block devices.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
block/bio.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/block/bio.c b/block/bio.c
index fa01bef35bb1..6b28256a5575 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -997,6 +997,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
struct page **pages = (struct page **)bv;
bool same_page = false;
+ unsigned int flags = 0;
ssize_t size, left;
unsigned len, i;
size_t offset;
@@ -1009,7 +1010,11 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
- size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+ if (bio->bi_disk && blk_queue_pci_p2pdma(bio->bi_disk->queue))
+ flags |= FOLL_PCI_P2PDMA;
+
+ size = iov_iter_get_pages_flags(iter, pages, LONG_MAX, nr_pages,
+ &offset, flags);
if (unlikely(size <= 0))
return size ? size : -EFAULT;
--
2.20.1
Attempt to find the mapping type for P2PDMA pages on the first
DMA map attempt if it has not been done ahead of time.
Previously, the mapping type was expected to be calculated ahead of
time, but if pages are to come from userspace then there's no
way to ensure the path was checked ahead of time.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
drivers/pci/p2pdma.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 94583779c36e..ea8472278b11 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -824,11 +824,17 @@ EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct pci_dev *provider,
struct pci_dev *client)
{
+ enum pci_p2pdma_map_type ret;
+
if (!provider->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
- return xa_to_value(xa_load(&provider->p2pdma->map_types,
- map_types_idx(client)));
+ ret = xa_to_value(xa_load(&provider->p2pdma->map_types,
+ map_types_idx(client)));
+ if (ret != PCI_P2PDMA_MAP_UNKNOWN)
+ return ret;
+
+ return upstream_bridge_distance_warn(provider, client, NULL);
}
static int __pci_p2pdma_map_sg(struct pci_p2pdma_pagemap *p2p_pgmap,
@@ -890,7 +896,6 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
case PCI_P2PDMA_MAP_BUS_ADDR:
return __pci_p2pdma_map_sg(p2p_pgmap, dev, sg, nents);
default:
- WARN_ON_ONCE(1);
return 0;
}
}
--
2.20.1
On Fri, Nov 06, 2020 at 10:00:35AM -0700, Logan Gunthorpe wrote:
> Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
> a hunk of p2pmem into userspace.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> drivers/pci/p2pdma.c | 104 +++++++++++++++++++++++++++++++++++++
> include/linux/pci-p2pdma.h | 6 +++
> 2 files changed, 110 insertions(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index 9961e779f430..8eab53ac59ae 100644
> +++ b/drivers/pci/p2pdma.c
> @@ -16,6 +16,7 @@
> #include <linux/genalloc.h>
> #include <linux/memremap.h>
> #include <linux/percpu-refcount.h>
> +#include <linux/pfn_t.h>
> #include <linux/random.h>
> #include <linux/seq_buf.h>
> #include <linux/xarray.h>
> @@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> return sprintf(page, "%s\n", pci_name(p2p_dev));
> }
> EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
> +
> +struct pci_p2pdma_map {
> + struct kref ref;
> + struct pci_dev *pdev;
> + void *kaddr;
> + size_t len;
> +};
Why have this at all? Nothing uses it and no vm_operations ops are
implemented?
This is very inflexable, it would be better if this is designed like
io_remap_pfn where it just preps and fills the VMA, doesn't take
ownership of the entire VMA
Jason
On 2020-11-06 10:22 a.m., Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 10:00:35AM -0700, Logan Gunthorpe wrote:
>> Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
>> a hunk of p2pmem into userspace.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> drivers/pci/p2pdma.c | 104 +++++++++++++++++++++++++++++++++++++
>> include/linux/pci-p2pdma.h | 6 +++
>> 2 files changed, 110 insertions(+)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index 9961e779f430..8eab53ac59ae 100644
>> +++ b/drivers/pci/p2pdma.c
>> @@ -16,6 +16,7 @@
>> #include <linux/genalloc.h>
>> #include <linux/memremap.h>
>> #include <linux/percpu-refcount.h>
>> +#include <linux/pfn_t.h>
>> #include <linux/random.h>
>> #include <linux/seq_buf.h>
>> #include <linux/xarray.h>
>> @@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
>> return sprintf(page, "%s\n", pci_name(p2p_dev));
>> }
>> EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
>> +
>> +struct pci_p2pdma_map {
>> + struct kref ref;
>> + struct pci_dev *pdev;
>> + void *kaddr;
>> + size_t len;
>> +};
>
> Why have this at all? Nothing uses it and no vm_operations ops are
> implemented?
It's necessary to free the allocated p2pmem when the mapping is torn down.
> This is very inflexable, it would be better if this is designed like
> io_remap_pfn where it just preps and fills the VMA, doesn't take
> ownership of the entire VMA
If someone wants to manage their own P2P memory and create their own
userspace mapping with vmf_insert_mixed they are free to do so. But this
helper is specifically for users of pci_p2pdma_map_alloc(). I know you
don't intend to use that but it doesn't make it any less valid.
Logan
On Fri, Nov 06, 2020 at 10:28:00AM -0700, Logan Gunthorpe wrote:
>
>
> On 2020-11-06 10:22 a.m., Jason Gunthorpe wrote:
> > On Fri, Nov 06, 2020 at 10:00:35AM -0700, Logan Gunthorpe wrote:
> >> Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
> >> a hunk of p2pmem into userspace.
> >>
> >> Signed-off-by: Logan Gunthorpe <[email protected]>
> >> drivers/pci/p2pdma.c | 104 +++++++++++++++++++++++++++++++++++++
> >> include/linux/pci-p2pdma.h | 6 +++
> >> 2 files changed, 110 insertions(+)
> >>
> >> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> >> index 9961e779f430..8eab53ac59ae 100644
> >> +++ b/drivers/pci/p2pdma.c
> >> @@ -16,6 +16,7 @@
> >> #include <linux/genalloc.h>
> >> #include <linux/memremap.h>
> >> #include <linux/percpu-refcount.h>
> >> +#include <linux/pfn_t.h>
> >> #include <linux/random.h>
> >> #include <linux/seq_buf.h>
> >> #include <linux/xarray.h>
> >> @@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> >> return sprintf(page, "%s\n", pci_name(p2p_dev));
> >> }
> >> EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
> >> +
> >> +struct pci_p2pdma_map {
> >> + struct kref ref;
> >> + struct pci_dev *pdev;
> >> + void *kaddr;
> >> + size_t len;
> >> +};
> >
> > Why have this at all? Nothing uses it and no vm_operations ops are
> > implemented?
>
> It's necessary to free the allocated p2pmem when the mapping is torn down.
That's suspicious.. Once in a VMA the lifetime of the page must be
controlled by the page refcount, it can't be put back into the genpool
just because the vma was destroed.
Jason
On 2020-11-06 10:42 a.m., Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 10:28:00AM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2020-11-06 10:22 a.m., Jason Gunthorpe wrote:
>>> On Fri, Nov 06, 2020 at 10:00:35AM -0700, Logan Gunthorpe wrote:
>>>> Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
>>>> a hunk of p2pmem into userspace.
>>>>
>>>> Signed-off-by: Logan Gunthorpe <[email protected]>
>>>> drivers/pci/p2pdma.c | 104 +++++++++++++++++++++++++++++++++++++
>>>> include/linux/pci-p2pdma.h | 6 +++
>>>> 2 files changed, 110 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>>>> index 9961e779f430..8eab53ac59ae 100644
>>>> +++ b/drivers/pci/p2pdma.c
>>>> @@ -16,6 +16,7 @@
>>>> #include <linux/genalloc.h>
>>>> #include <linux/memremap.h>
>>>> #include <linux/percpu-refcount.h>
>>>> +#include <linux/pfn_t.h>
>>>> #include <linux/random.h>
>>>> #include <linux/seq_buf.h>
>>>> #include <linux/xarray.h>
>>>> @@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
>>>> return sprintf(page, "%s\n", pci_name(p2p_dev));
>>>> }
>>>> EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
>>>> +
>>>> +struct pci_p2pdma_map {
>>>> + struct kref ref;
>>>> + struct pci_dev *pdev;
>>>> + void *kaddr;
>>>> + size_t len;
>>>> +};
>>>
>>> Why have this at all? Nothing uses it and no vm_operations ops are
>>> implemented?
>>
>> It's necessary to free the allocated p2pmem when the mapping is torn down.
>
> That's suspicious.. Once in a VMA the lifetime of the page must be
> controlled by the page refcount, it can't be put back into the genpool
> just because the vma was destroed.
Ah, hmm, yes. I guess the pages have to be hooked and returned to the
genalloc through free_devmap_managed_page(). Seems like it might be
doable... but it will complicate things for users that don't want to use
the genpool (though no such users exist upstream).
Logan
On Fri, Nov 06, 2020 at 10:53:45AM -0700, Logan Gunthorpe wrote:
>
>
> On 2020-11-06 10:42 a.m., Jason Gunthorpe wrote:
> > On Fri, Nov 06, 2020 at 10:28:00AM -0700, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2020-11-06 10:22 a.m., Jason Gunthorpe wrote:
> >>> On Fri, Nov 06, 2020 at 10:00:35AM -0700, Logan Gunthorpe wrote:
> >>>> Introduce pci_mmap_p2pmem() which is a helper to allocate and mmap
> >>>> a hunk of p2pmem into userspace.
> >>>>
> >>>> Signed-off-by: Logan Gunthorpe <[email protected]>
> >>>> drivers/pci/p2pdma.c | 104 +++++++++++++++++++++++++++++++++++++
> >>>> include/linux/pci-p2pdma.h | 6 +++
> >>>> 2 files changed, 110 insertions(+)
> >>>>
> >>>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> >>>> index 9961e779f430..8eab53ac59ae 100644
> >>>> +++ b/drivers/pci/p2pdma.c
> >>>> @@ -16,6 +16,7 @@
> >>>> #include <linux/genalloc.h>
> >>>> #include <linux/memremap.h>
> >>>> #include <linux/percpu-refcount.h>
> >>>> +#include <linux/pfn_t.h>
> >>>> #include <linux/random.h>
> >>>> #include <linux/seq_buf.h>
> >>>> #include <linux/xarray.h>
> >>>> @@ -1055,3 +1056,106 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> >>>> return sprintf(page, "%s\n", pci_name(p2p_dev));
> >>>> }
> >>>> EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
> >>>> +
> >>>> +struct pci_p2pdma_map {
> >>>> + struct kref ref;
> >>>> + struct pci_dev *pdev;
> >>>> + void *kaddr;
> >>>> + size_t len;
> >>>> +};
> >>>
> >>> Why have this at all? Nothing uses it and no vm_operations ops are
> >>> implemented?
> >>
> >> It's necessary to free the allocated p2pmem when the mapping is torn down.
> >
> > That's suspicious.. Once in a VMA the lifetime of the page must be
> > controlled by the page refcount, it can't be put back into the genpool
> > just because the vma was destroed.
>
> Ah, hmm, yes. I guess the pages have to be hooked and returned to the
> genalloc through free_devmap_managed_page().
That sounds about right, but in this case it doesn't need the VMA
operations.
> Seems like it might be doable... but it will complicate things for
> users that don't want to use the genpool (though no such users exist
> upstream).
I would like to use this stuff in RDMA pretty much immediately and the
genpool is harmful for those cases, so please don't make decisions
that are tying thing to genpool
Jason
On 2020-11-06 11:09 a.m., Jason Gunthorpe wrote:
>> Ah, hmm, yes. I guess the pages have to be hooked and returned to the
>> genalloc through free_devmap_managed_page().
>
> That sounds about right, but in this case it doesn't need the VMA
> operations.
>
>> Seems like it might be doable... but it will complicate things for
>> users that don't want to use the genpool (though no such users exist
>> upstream).
>
> I would like to use this stuff in RDMA pretty much immediately and the
> genpool is harmful for those cases, so please don't make decisions
> that are tying thing to genpool
I certainly can't make decisions for code that isn't currently upstream.
So you will almost certainly have to make changes for the code you want
to add, as is the standard procedure. I can't and should not export APIs
that you might need that have no upstream users, but you are certainly
free to send patches that create them when you add the use case.
Ultimately, if you aren't using the genpool you will have to implement
your own mmap operation that somehow allocates the pages and your own
page_free hook. The latter can be accommodated for by a patch that
splits off pci_p2pdma_add_resource() into a function that doesn't use
the genpool (I've already seen two independent developers create a
similar patch for this but with no upstream user, they couldn't be taken
upstream).
I also don't expect this to be going upstream in the near term so don't
get too excited about using it.
Logan
On Fri, Nov 06, 2020 at 11:20:05AM -0700, Logan Gunthorpe wrote:
>
>
> On 2020-11-06 11:09 a.m., Jason Gunthorpe wrote:
> >> Ah, hmm, yes. I guess the pages have to be hooked and returned to the
> >> genalloc through free_devmap_managed_page().
> >
> > That sounds about right, but in this case it doesn't need the VMA
> > operations.
> >
> >> Seems like it might be doable... but it will complicate things for
> >> users that don't want to use the genpool (though no such users exist
> >> upstream).
> >
> > I would like to use this stuff in RDMA pretty much immediately and the
> > genpool is harmful for those cases, so please don't make decisions
> > that are tying thing to genpool
>
> I certainly can't make decisions for code that isn't currently
> upstream.
The rdma drivers are all upstream, what are you thinking about?
> Ultimately, if you aren't using the genpool you will have to implement
> your own mmap operation that somehow allocates the pages and your own
> page_free hook.
Sure, the mlx5 driver already has a specialized alloctor for it's BAR
pages.
> I also don't expect this to be going upstream in the near term so don't
> get too excited about using it.
I don't know, it is actually not that horrible, the GUP and IOMMU
related changes are simpler than I expected
Jason
On 2020-11-06 12:30 p.m., Jason Gunthorpe wrote:
>> I certainly can't make decisions for code that isn't currently
>> upstream.
>
> The rdma drivers are all upstream, what are you thinking about?
Really? I feel like you should know what I mean here...
I mean upstream code that actually uses the APIs that I'd have to
introduce. I can't say here's an API feature that no code uses but the
already upstream rdma driver might use eventually. It's fairly easy to
send patches that make the necessary changes when someone adds a use of
those changes inside the rdma code.
>> Ultimately, if you aren't using the genpool you will have to implement
>> your own mmap operation that somehow allocates the pages and your own
>> page_free hook.
>
> Sure, the mlx5 driver already has a specialized alloctor for it's BAR
> pages.
So it *might* make sense to carve out a common helper to setup a VMA for
P2PDMA to do the vm_flags check and set VM_MIXEDMAP... but besides that,
there's no code that would be common to the two cases.
>> I also don't expect this to be going upstream in the near term so don't
>> get too excited about using it.
>
> I don't know, it is actually not that horrible, the GUP and IOMMU
> related changes are simpler than I expected
I think the deal breaker is the SGL hack and the fact that there are
important IOMMU implementations that won't have support.
Logan
On Fri, Nov 06, 2020 at 12:44:59PM -0700, Logan Gunthorpe wrote:
>
>
> On 2020-11-06 12:30 p.m., Jason Gunthorpe wrote:
> >> I certainly can't make decisions for code that isn't currently
> >> upstream.
> >
> > The rdma drivers are all upstream, what are you thinking about?
>
> Really? I feel like you should know what I mean here...
>
> I mean upstream code that actually uses the APIs that I'd have to
> introduce. I can't say here's an API feature that no code uses but the
> already upstream rdma driver might use eventually. It's fairly easy to
> send patches that make the necessary changes when someone adds a use of
> those changes inside the rdma code.
Sure, but that doesn't mean you have to actively design things to be
unusable beyond this narrow case. The RDMA drivers are all there, all
upstream, if this work is accepted then the changes to insert P2P
pages into their existing mmap flows is a reasonable usecase to
consider at this point when building core code APIs.
You shouldn't add dead code, but at least have a design in mind for
what it needs to look like and some allowance.
> >> Ultimately, if you aren't using the genpool you will have to implement
> >> your own mmap operation that somehow allocates the pages and your own
> >> page_free hook.
> >
> > Sure, the mlx5 driver already has a specialized alloctor for it's BAR
> > pages.
>
> So it *might* make sense to carve out a common helper to setup a VMA for
> P2PDMA to do the vm_flags check and set VM_MIXEDMAP... but besides that,
> there's no code that would be common to the two cases.
I think the whole insertion of P2PDMA pages into a VMA should be
similar to io_remap_pfn() so all the VM flags, pgprot_decrypted and
other subtle details are all in one place. (also it needs a
pgprot_decrypted before doing vmf_insert, I just learned that this
month)
> >> I also don't expect this to be going upstream in the near term so don't
> >> get too excited about using it.
> >
> > I don't know, it is actually not that horrible, the GUP and IOMMU
> > related changes are simpler than I expected
>
> I think the deal breaker is the SGL hack and the fact that there are
> important IOMMU implementations that won't have support.
Yes, that is pretty hacky, maybe someone will have a better idea
Jason
On 2020-11-06 12:53 p.m., Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 12:44:59PM -0700, Logan Gunthorpe wrote:
>>
>>
>> On 2020-11-06 12:30 p.m., Jason Gunthorpe wrote:
>>>> I certainly can't make decisions for code that isn't currently
>>>> upstream.
>>>
>>> The rdma drivers are all upstream, what are you thinking about?
>>
>> Really? I feel like you should know what I mean here...
>>
>> I mean upstream code that actually uses the APIs that I'd have to
>> introduce. I can't say here's an API feature that no code uses but the
>> already upstream rdma driver might use eventually. It's fairly easy to
>> send patches that make the necessary changes when someone adds a use of
>> those changes inside the rdma code.
>
> Sure, but that doesn't mean you have to actively design things to be
> unusable beyond this narrow case. The RDMA drivers are all there, all
> upstream, if this work is accepted then the changes to insert P2P
> pages into their existing mmap flows is a reasonable usecase to
> consider at this point when building core code APIs.
>
> You shouldn't add dead code, but at least have a design in mind for
> what it needs to look like and some allowance.
I don't see anything I've done that's at odds with that. You will still
need to make changes to the p2pdma code to implement your use case.
>>>> Ultimately, if you aren't using the genpool you will have to implement
>>>> your own mmap operation that somehow allocates the pages and your own
>>>> page_free hook.
>>>
>>> Sure, the mlx5 driver already has a specialized alloctor for it's BAR
>>> pages.
>>
>> So it *might* make sense to carve out a common helper to setup a VMA for
>> P2PDMA to do the vm_flags check and set VM_MIXEDMAP... but besides that,
>> there's no code that would be common to the two cases.
>
> I think the whole insertion of P2PDMA pages into a VMA should be
> similar to io_remap_pfn() so all the VM flags, pgprot_decrypted and
> other subtle details are all in one place. (also it needs a
> pgprot_decrypted before doing vmf_insert, I just learned that this
> month)
I don't think a function like that will work for the p2pmem use case. In
order to implement proper page freeing I expect I'll need a loop around
the allocator and vm_insert_mixed()... Something roughly like:
for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
vaddr = pci_alloc_p2pmem(pdev, PAGE_SIZE);
ret = vmf_insert_mixed(vma, addr,
__pfn_to_pfn_t(virt_to_pfn(vaddr), PFN_DEV | PFN_MAP));
}
That way we can call pci_free_p2pmem() when a page's ref count goes to
zero. I suspect your use case will need to do something similar.
Logan
On Fri, Nov 06, 2020 at 01:03:26PM -0700, Logan Gunthorpe wrote:
> I don't think a function like that will work for the p2pmem use case. In
> order to implement proper page freeing I expect I'll need a loop around
> the allocator and vm_insert_mixed()... Something roughly like:
>
> for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
> vaddr = pci_alloc_p2pmem(pdev, PAGE_SIZE);
> ret = vmf_insert_mixed(vma, addr,
> __pfn_to_pfn_t(virt_to_pfn(vaddr), PFN_DEV | PFN_MAP));
> }
>
> That way we can call pci_free_p2pmem() when a page's ref count goes to
> zero. I suspect your use case will need to do something similar.
Yes, but I would say the pci_alloc_p2pmem() layer should be able to
free pages on a page-by-page basis so you don't have to do stuff like
the above.
There is often a lot of value in having physical contiguous addresses,
so allocating page by page as well seems poor.
Jason
On 2020-11-06 5:14 p.m., Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 01:03:26PM -0700, Logan Gunthorpe wrote:
>> I don't think a function like that will work for the p2pmem use case. In
>> order to implement proper page freeing I expect I'll need a loop around
>> the allocator and vm_insert_mixed()... Something roughly like:
>>
>> for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
>> vaddr = pci_alloc_p2pmem(pdev, PAGE_SIZE);
>> ret = vmf_insert_mixed(vma, addr,
>> __pfn_to_pfn_t(virt_to_pfn(vaddr), PFN_DEV | PFN_MAP));
>> }
>>
>> That way we can call pci_free_p2pmem() when a page's ref count goes to
>> zero. I suspect your use case will need to do something similar.
>
> Yes, but I would say the pci_alloc_p2pmem() layer should be able to
> free pages on a page-by-page basis so you don't have to do stuff like
> the above.
>
> There is often a lot of value in having physical contiguous addresses,
> so allocating page by page as well seems poor.
Agreed. But I'll have to dig to see if genalloc supports freeing blocks
in different sizes than the allocations.
Logan
On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
> We make use of the top bit of the dma_length to indicate a P2PDMA
> segment.
I don't think "we" can. There is nothing limiting the size of a SGL
segment.
On 2020-11-09 09:12, Christoph Hellwig wrote:
> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
>> We make use of the top bit of the dma_length to indicate a P2PDMA
>> segment.
>
> I don't think "we" can. There is nothing limiting the size of a SGL
> segment.
Right, the story behind ab2cbeb0ed30 ("iommu/dma: Handle SG length
overflow better") comes immediately to mind, for one. If all the P2P
users can agree to be in on the game then by all means implement this in
the P2P code, but I don't think it belongs in the generic top-level
scatterlist API.
Robin.
On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
>> We make use of the top bit of the dma_length to indicate a P2PDMA
>> segment.
>
> I don't think "we" can. There is nothing limiting the size of a SGL
> segment.
Yes, I expected this would be the unacceptable part. Any alternative ideas?
Logan
On Fri, Nov 06, 2020 at 10:00:24AM -0700, Logan Gunthorpe wrote:
> Introduce pci_p2pdma_should_map_bus() which is meant to be called by
> dma map functions to determine how to map a given p2pdma page.
s/dma/DMA/ for consistency (also below in function comment)
> pci_p2pdma_bus_offset() is also added to allow callers to get the bus
> offset if they need to map the bus address.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> drivers/pci/p2pdma.c | 46 ++++++++++++++++++++++++++++++++++++++
> include/linux/pci-p2pdma.h | 11 +++++++++
> 2 files changed, 57 insertions(+)
>
> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
> index ea8472278b11..9961e779f430 100644
> --- a/drivers/pci/p2pdma.c
> +++ b/drivers/pci/p2pdma.c
> @@ -930,6 +930,52 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> }
> EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>
> +/**
> + * pci_p2pdma_bus_offset - returns the bus offset for a given page
> + * @page: page to get the offset for
> + *
> + * Must be passed a pci p2pdma page.
s/pci/PCI/
> + */
> +u64 pci_p2pdma_bus_offset(struct page *page)
> +{
> + struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page->pgmap);
> +
> + WARN_ON(!is_pci_p2pdma_page(page));
> +
> + return p2p_pgmap->bus_offset;
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_bus_offset);
> +
> +/**
> + * pci_p2pdma_should_map_bus - determine if a dma mapping should use the
> + * bus address
> + * @dev: device doing the DMA request
> + * @pgmap: dev_pagemap structure for the mapping
> + *
> + * Returns 1 if the page should be mapped with a bus address, 0 otherwise
> + * and -1 the device should not be mapping P2PDMA pages.
I think this is missing a word.
I'm not really sure how to interpret the "should" in
pci_p2pdma_should_map_bus(). If this returns -1, does that mean the
patches *cannot* be mapped? They *could* be mapped, but you really
*shouldn't*? Something else?
1 means page should be mapped with bus address. 0 means ... what,
exactly? It should be mapped with some different address?
Sorry these are naive questions because I don't know how all this
works.
> + */
> +int pci_p2pdma_should_map_bus(struct device *dev, struct dev_pagemap *pgmap)
> +{
> + struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(pgmap);
> + struct pci_dev *client;
> +
> + if (!dev_is_pci(dev))
> + return -1;
> +
> + client = to_pci_dev(dev);
> +
> + switch (pci_p2pdma_map_type(p2p_pgmap->provider, client)) {
> + case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
> + return 0;
> + case PCI_P2PDMA_MAP_BUS_ADDR:
> + return 1;
> + default:
> + return -1;
> + }
> +}
> +EXPORT_SYMBOL_GPL(pci_p2pdma_should_map_bus);
> +
> /**
> * pci_p2pdma_enable_store - parse a configfs/sysfs attribute store
> * to enable p2pdma
> diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h
> index 8318a97c9c61..fc5de47eeac4 100644
> --- a/include/linux/pci-p2pdma.h
> +++ b/include/linux/pci-p2pdma.h
> @@ -34,6 +34,8 @@ int pci_p2pdma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir, unsigned long attrs);
> void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
> int nents, enum dma_data_direction dir, unsigned long attrs);
> +u64 pci_p2pdma_bus_offset(struct page *page);
> +int pci_p2pdma_should_map_bus(struct device *dev, struct dev_pagemap *pgmap);
> int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
> bool *use_p2pdma);
> ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
> @@ -83,6 +85,15 @@ static inline void pci_p2pmem_free_sgl(struct pci_dev *pdev,
> static inline void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
> {
> }
> +static inline u64 pci_p2pdma_bus_offset(struct page *page)
> +{
> + return -1;
> +}
> +static inline int pci_p2pdma_should_map_bus(struct device *dev,
> + struct dev_pagemap *pgmap)
> +{
> + return -1;
> +}
> static inline int pci_p2pdma_map_sg_attrs(struct device *dev,
> struct scatterlist *sg, int nents, enum dma_data_direction dir,
> unsigned long attrs)
> --
> 2.20.1
>
On 2020-11-10 4:25 p.m., Bjorn Helgaas wrote:
> On Fri, Nov 06, 2020 at 10:00:24AM -0700, Logan Gunthorpe wrote:
>> Introduce pci_p2pdma_should_map_bus() which is meant to be called by
>> dma map functions to determine how to map a given p2pdma page.
>
> s/dma/DMA/ for consistency (also below in function comment)
>
>> pci_p2pdma_bus_offset() is also added to allow callers to get the bus
>> offset if they need to map the bus address.
>>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> ---
>> drivers/pci/p2pdma.c | 46 ++++++++++++++++++++++++++++++++++++++
>> include/linux/pci-p2pdma.h | 11 +++++++++
>> 2 files changed, 57 insertions(+)
>>
>> diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
>> index ea8472278b11..9961e779f430 100644
>> --- a/drivers/pci/p2pdma.c
>> +++ b/drivers/pci/p2pdma.c
>> @@ -930,6 +930,52 @@ void pci_p2pdma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
>> }
>> EXPORT_SYMBOL_GPL(pci_p2pdma_unmap_sg_attrs);
>>
>> +/**
>> + * pci_p2pdma_bus_offset - returns the bus offset for a given page
>> + * @page: page to get the offset for
>> + *
>> + * Must be passed a pci p2pdma page.
>
> s/pci/PCI/
>
>> + */
>> +u64 pci_p2pdma_bus_offset(struct page *page)
>> +{
>> + struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page->pgmap);
>> +
>> + WARN_ON(!is_pci_p2pdma_page(page));
>> +
>> + return p2p_pgmap->bus_offset;
>> +}
>> +EXPORT_SYMBOL_GPL(pci_p2pdma_bus_offset);
>> +
>> +/**
>> + * pci_p2pdma_should_map_bus - determine if a dma mapping should use the
>> + * bus address
>> + * @dev: device doing the DMA request
>> + * @pgmap: dev_pagemap structure for the mapping
>> + *
>> + * Returns 1 if the page should be mapped with a bus address, 0 otherwise
>> + * and -1 the device should not be mapping P2PDMA pages.
>
> I think this is missing a word.
>
> I'm not really sure how to interpret the "should" in
> pci_p2pdma_should_map_bus(). If this returns -1, does that mean the
> patches *cannot* be mapped? They *could* be mapped, but you really
> *shouldn't*? Something else?
>
> 1 means page should be mapped with bus address. 0 means ... what,
> exactly? It should be mapped with some different address?
1 means it must be mapped with a bus address
0 means it may be mapped normally (through the IOMMU or just with a
direct physical address)
-1 means it cannot be mapped and should fail (ie. if it must go through
the IOMMU, but the IOMMU is not in the whitelist).
> Sorry these are naive questions because I don't know how all this
> works.
Thanks for the review. Definitely points out some questionable language
that I used. I'll reword this if/when it goes further.
Logan
On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe <[email protected]> wrote:
>
>
>
> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
> > On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
> >> We make use of the top bit of the dma_length to indicate a P2PDMA
> >> segment.
> >
> > I don't think "we" can. There is nothing limiting the size of a SGL
> > segment.
>
> Yes, I expected this would be the unacceptable part. Any alternative ideas?
Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
segment-pages for is_pci_p2pdma_page()?
On 2020-12-09 6:22 p.m., Dan Williams wrote:
> On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe <[email protected]> wrote:
>>
>>
>>
>> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
>>> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
>>>> We make use of the top bit of the dma_length to indicate a P2PDMA
>>>> segment.
>>>
>>> I don't think "we" can. There is nothing limiting the size of a SGL
>>> segment.
>>
>> Yes, I expected this would be the unacceptable part. Any alternative ideas?
>
> Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
> segment-pages for is_pci_p2pdma_page()?
Because the DMA and page segments in the SGL aren't necessarily aligned...
The IOMMU implementations can coalesce multiple pages into fewer DMA
address ranges, so the page pointed to by sg->page_link may not be the
one that corresponds to the address in sg->dma_address for a given segment.
If that makes sense -- it's not the easiest thing to explain.
Logan
On Wed, Dec 9, 2020 at 6:07 PM Logan Gunthorpe <[email protected]> wrote:
>
>
>
> On 2020-12-09 6:22 p.m., Dan Williams wrote:
> > On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe <[email protected]> wrote:
> >>
> >>
> >>
> >> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
> >>> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
> >>>> We make use of the top bit of the dma_length to indicate a P2PDMA
> >>>> segment.
> >>>
> >>> I don't think "we" can. There is nothing limiting the size of a SGL
> >>> segment.
> >>
> >> Yes, I expected this would be the unacceptable part. Any alternative ideas?
> >
> > Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
> > segment-pages for is_pci_p2pdma_page()?
>
> Because the DMA and page segments in the SGL aren't necessarily aligned...
>
> The IOMMU implementations can coalesce multiple pages into fewer DMA
> address ranges, so the page pointed to by sg->page_link may not be the
> one that corresponds to the address in sg->dma_address for a given segment.
>
> If that makes sense -- it's not the easiest thing to explain.
It does...
Did someone already grab, or did you already consider the 3rd
available bit in page_link? AFAICS only SG_CHAIN and SG_END are
reserved. However, if you have a CONFIG_64BIT dependency for
user-directed p2pdma that would seem to allow SG_P2PDMA_FLAG to be
(0x4) in page_link.
On 2020-12-09 9:04 p.m., Dan Williams wrote:
> On Wed, Dec 9, 2020 at 6:07 PM Logan Gunthorpe <[email protected]> wrote:
>>
>>
>>
>> On 2020-12-09 6:22 p.m., Dan Williams wrote:
>>> On Mon, Nov 9, 2020 at 8:47 AM Logan Gunthorpe <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 2020-11-09 2:12 a.m., Christoph Hellwig wrote:
>>>>> On Fri, Nov 06, 2020 at 10:00:25AM -0700, Logan Gunthorpe wrote:
>>>>>> We make use of the top bit of the dma_length to indicate a P2PDMA
>>>>>> segment.
>>>>>
>>>>> I don't think "we" can. There is nothing limiting the size of a SGL
>>>>> segment.
>>>>
>>>> Yes, I expected this would be the unacceptable part. Any alternative ideas?
>>>
>>> Why is the SG_P2PDMA_FLAG needed as compared to checking the SGL
>>> segment-pages for is_pci_p2pdma_page()?
>>
>> Because the DMA and page segments in the SGL aren't necessarily aligned...
>>
>> The IOMMU implementations can coalesce multiple pages into fewer DMA
>> address ranges, so the page pointed to by sg->page_link may not be the
>> one that corresponds to the address in sg->dma_address for a given segment.
>>
>> If that makes sense -- it's not the easiest thing to explain.
>
> It does...
>
> Did someone already grab, or did you already consider the 3rd
> available bit in page_link? AFAICS only SG_CHAIN and SG_END are
> reserved. However, if you have a CONFIG_64BIT dependency for
> user-directed p2pdma that would seem to allow SG_P2PDMA_FLAG to be
> (0x4) in page_link.
Hmm, I half considered that, but I had came to the conclusion that given
the mis-alignment I shouldn't be using the page side of the SGL.
However, reconsidering now, that might actually be a reasonable option.
However, the CONFIG_64BIT dependency would have to be on all P2PDMA,
because we'd need to replace pci_p2pdma_map_sg() in all cases. I'm not
sure if this would be a restriction people care about.
Thanks,
Logan