2024-03-05 11:19:13

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

This is complimentary part to the proposed LSF/MM topic.
https://lore.kernel.org/linux-rdma/[email protected]/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057

This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and
DMA patches are ready for review and inclusion, the NVMe patches are still in
progress as they require agreement on API first.

Thanks

-------------------------------------------------------------------------------
The DMA mapping operation performs two steps at one same time: allocates
IOVA space and actually maps DMA pages to that space. This one shot
operation works perfectly for non-complex scenarios, where callers use
that DMA API in control path when they setup hardware.

However in more complex scenarios, when DMA mapping is needed in data
path and especially when some sort of specific datatype is involved,
such one shot approach has its drawbacks.

That approach pushes developers to introduce new DMA APIs for specific
datatype. For example existing scatter-gather mapping functions, or
latest Chuck's RFC series to add biovec related DMA mapping [1] and
probably struct folio will need it too.

These advanced DMA mapping APIs are needed to calculate IOVA size to
allocate it as one chunk and some sort of offset calculations to know
which part of IOVA to map.

Instead of teaching DMA to know these specific datatypes, let's separate
existing DMA mapping routine to two steps and give an option to advanced
callers (subsystems) perform all calculations internally in advance and
map pages later when it is needed.

In this series, three users are converted and each of such conversion
presents different positive gain:
1. RDMA simplifies and speeds up its pagefault handling for
on-demand-paging (ODP) mode.
2. VFIO PCI live migration code saves huge chunk of memory.
3. NVMe PCI avoids intermediate SG table manipulation and operates
directly on BIOs.

Thanks

[1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net

Chaitanya Kulkarni (2):
block: add dma_link_range() based API
nvme-pci: use blk_rq_dma_map() for NVMe SGL

Leon Romanovsky (14):
mm/hmm: let users to tag specific PFNs
dma-mapping: provide an interface to allocate IOVA
dma-mapping: provide callbacks to link/unlink pages to specific IOVA
iommu/dma: Provide an interface to allow preallocate IOVA
iommu/dma: Prepare map/unmap page functions to receive IOVA
iommu/dma: Implement link/unlink page callbacks
RDMA/umem: Preallocate and cache IOVA for UMEM ODP
RDMA/umem: Store ODP access mask information in PFN
RDMA/core: Separate DMA mapping to caching IOVA and page linkage
RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
vfio/mlx5: Explicitly use number of pages instead of allocated length
vfio/mlx5: Rewrite create mkey flow to allow better code reuse
vfio/mlx5: Explicitly store page list
vfio/mlx5: Convert vfio to use DMA link API

Documentation/core-api/dma-attributes.rst | 7 +
block/blk-merge.c | 156 ++++++++++++++
drivers/infiniband/core/umem_odp.c | 219 +++++++------------
drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 +
drivers/infiniband/hw/mlx5/odp.c | 59 +++--
drivers/iommu/dma-iommu.c | 129 ++++++++---
drivers/nvme/host/pci.c | 220 +++++--------------
drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++----------
drivers/vfio/pci/mlx5/cmd.h | 22 +-
drivers/vfio/pci/mlx5/main.c | 136 +++++-------
include/linux/blk-mq.h | 9 +
include/linux/dma-map-ops.h | 13 ++
include/linux/dma-mapping.h | 39 ++++
include/linux/hmm.h | 3 +
include/rdma/ib_umem_odp.h | 22 +-
include/rdma/ib_verbs.h | 54 +++++
kernel/dma/debug.h | 2 +
kernel/dma/direct.h | 7 +-
kernel/dma/mapping.c | 91 ++++++++
mm/hmm.c | 34 +--
20 files changed, 870 insertions(+), 605 deletions(-)

--
2.44.0



2024-03-05 11:19:41

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 01/16] mm/hmm: let users to tag specific PFNs

From: Leon Romanovsky <[email protected]>

Introduce new sticky flag, which isn't overwritten by HMM range fault.
Such flag allows users to tag specific PFNs with extra data in addition
to already filled by HMM.

Signed-off-by: Leon Romanovsky <[email protected]>
---
include/linux/hmm.h | 3 +++
mm/hmm.c | 34 +++++++++++++++++++++-------------
2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 126a36571667..b90902baa593 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -23,6 +23,7 @@ struct mmu_interval_notifier;
* HMM_PFN_WRITE - if the page memory can be written to (requires HMM_PFN_VALID)
* HMM_PFN_ERROR - accessing the pfn is impossible and the device should
* fail. ie poisoned memory, special pages, no vma, etc
+ * HMM_PFN_STICKY - Flag preserved on input-to-output transformation
*
* On input:
* 0 - Return the current state of the page, do not fault it.
@@ -36,6 +37,8 @@ enum hmm_pfn_flags {
HMM_PFN_VALID = 1UL << (BITS_PER_LONG - 1),
HMM_PFN_WRITE = 1UL << (BITS_PER_LONG - 2),
HMM_PFN_ERROR = 1UL << (BITS_PER_LONG - 3),
+ /* Sticky lag, carried from Input to Output */
+ HMM_PFN_STICKY = 1UL << (BITS_PER_LONG - 7),
HMM_PFN_ORDER_SHIFT = (BITS_PER_LONG - 8),

/* Input flags */
diff --git a/mm/hmm.c b/mm/hmm.c
index 277ddcab4947..9645a72beec0 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -44,8 +44,10 @@ static int hmm_pfns_fill(unsigned long addr, unsigned long end,
{
unsigned long i = (addr - range->start) >> PAGE_SHIFT;

- for (; addr < end; addr += PAGE_SIZE, i++)
- range->hmm_pfns[i] = cpu_flags;
+ for (; addr < end; addr += PAGE_SIZE, i++) {
+ range->hmm_pfns[i] &= HMM_PFN_STICKY;
+ range->hmm_pfns[i] |= cpu_flags;
+ }
return 0;
}

@@ -202,8 +204,10 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
return hmm_vma_fault(addr, end, required_fault, walk);

pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
- for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++)
- hmm_pfns[i] = pfn | cpu_flags;
+ for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+ hmm_pfns[i] &= HMM_PFN_STICKY;
+ hmm_pfns[i] |= pfn | cpu_flags;
+ }
return 0;
}
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -236,7 +240,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
if (required_fault)
goto fault;
- *hmm_pfn = 0;
+ *hmm_pfn = *hmm_pfn & HMM_PFN_STICKY;
return 0;
}

@@ -253,14 +257,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
cpu_flags = HMM_PFN_VALID;
if (is_writable_device_private_entry(entry))
cpu_flags |= HMM_PFN_WRITE;
- *hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
+ *hmm_pfn = (*hmm_pfn & HMM_PFN_STICKY) | swp_offset_pfn(entry) | cpu_flags;
return 0;
}

required_fault =
hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
if (!required_fault) {
- *hmm_pfn = 0;
+ *hmm_pfn = *hmm_pfn & HMM_PFN_STICKY;
return 0;
}

@@ -304,11 +308,11 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
pte_unmap(ptep);
return -EFAULT;
}
- *hmm_pfn = HMM_PFN_ERROR;
+ *hmm_pfn = (*hmm_pfn & HMM_PFN_STICKY) | HMM_PFN_ERROR;
return 0;
}

- *hmm_pfn = pte_pfn(pte) | cpu_flags;
+ *hmm_pfn = (*hmm_pfn & HMM_PFN_STICKY) | pte_pfn(pte) | cpu_flags;
return 0;

fault:
@@ -453,8 +457,10 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
}

pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
- for (i = 0; i < npages; ++i, ++pfn)
- hmm_pfns[i] = pfn | cpu_flags;
+ for (i = 0; i < npages; ++i, ++pfn) {
+ hmm_pfns[i] &= HMM_PFN_STICKY;
+ hmm_pfns[i] |= pfn | cpu_flags;
+ }
goto out_unlock;
}

@@ -512,8 +518,10 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
}

pfn = pte_pfn(entry) + ((start & ~hmask) >> PAGE_SHIFT);
- for (; addr < end; addr += PAGE_SIZE, i++, pfn++)
- range->hmm_pfns[i] = pfn | cpu_flags;
+ for (; addr < end; addr += PAGE_SIZE, i++, pfn++) {
+ range->hmm_pfns[i] &= HMM_PFN_STICKY;
+ range->hmm_pfns[i] |= pfn | cpu_flags;
+ }

spin_unlock(ptl);
return 0;
--
2.44.0


2024-03-05 11:20:11

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 02/16] dma-mapping: provide an interface to allocate IOVA

From: Leon Romanovsky <[email protected]>

Existing .map_page() callback provides two things at the same time:
allocates IOVA and links DMA pages. That combination works great for
most of the callers who use it in control paths, but less effective
in fast paths.

These advanced callers already manage their data in some sort of
database and can perform IOVA allocation in advance, leaving range
linkage operation to be in fast path.

Provide an interface to allocate/deallocate IOVA and next patch
link/unlink DMA ranges to that specific IOVA.

Signed-off-by: Leon Romanovsky <[email protected]>
---
include/linux/dma-map-ops.h | 3 +++
include/linux/dma-mapping.h | 20 ++++++++++++++++++++
kernel/dma/mapping.c | 30 ++++++++++++++++++++++++++++++
3 files changed, 53 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 4abc60f04209..bd605b44bb57 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -83,6 +83,9 @@ struct dma_map_ops {
size_t (*max_mapping_size)(struct device *dev);
size_t (*opt_mapping_size)(void);
unsigned long (*get_merge_boundary)(struct device *dev);
+
+ dma_addr_t (*alloc_iova)(struct device *dev, size_t size);
+ void (*free_iova)(struct device *dev, dma_addr_t dma_addr, size_t size);
};

#ifdef CONFIG_DMA_OPS
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 4a658de44ee9..176fb8a86d63 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -91,6 +91,16 @@ static inline void debug_dma_map_single(struct device *dev, const void *addr,
}
#endif /* CONFIG_DMA_API_DEBUG */

+struct dma_iova_attrs {
+ /* OUT field */
+ dma_addr_t addr;
+ /* IN fields */
+ struct device *dev;
+ size_t size;
+ enum dma_data_direction dir;
+ unsigned long attrs;
+};
+
#ifdef CONFIG_HAS_DMA
static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
{
@@ -101,6 +111,9 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
return 0;
}

+int dma_alloc_iova(struct dma_iova_attrs *iova);
+void dma_free_iova(struct dma_iova_attrs *iova);
+
dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
size_t offset, size_t size, enum dma_data_direction dir,
unsigned long attrs);
@@ -159,6 +172,13 @@ void dma_vunmap_noncontiguous(struct device *dev, void *vaddr);
int dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
size_t size, struct sg_table *sgt);
#else /* CONFIG_HAS_DMA */
+static inline int dma_alloc_iova(struct dma_iova_attrs *iova)
+{
+ return -EOPNOTSUPP;
+}
+static inline void dma_free_iova(struct dma_iova_attrs *iova)
+{
+}
static inline dma_addr_t dma_map_page_attrs(struct device *dev,
struct page *page, size_t offset, size_t size,
enum dma_data_direction dir, unsigned long attrs)
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 58db8fd70471..b6b27bab90f3 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -183,6 +183,36 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,
}
EXPORT_SYMBOL(dma_unmap_page_attrs);

+int dma_alloc_iova(struct dma_iova_attrs *iova)
+{
+ struct device *dev = iova->dev;
+ const struct dma_map_ops *ops = get_dma_ops(dev);
+
+ if (dma_map_direct(dev, ops) || !ops->alloc_iova) {
+ iova->addr = 0;
+ return 0;
+ }
+
+ iova->addr = ops->alloc_iova(dev, iova->size);
+ if (dma_mapping_error(dev, iova->addr))
+ return -ENOMEM;
+
+ return 0;
+}
+EXPORT_SYMBOL(dma_alloc_iova);
+
+void dma_free_iova(struct dma_iova_attrs *iova)
+{
+ struct device *dev = iova->dev;
+ const struct dma_map_ops *ops = get_dma_ops(dev);
+
+ if (dma_map_direct(dev, ops) || !ops->free_iova)
+ return;
+
+ ops->free_iova(dev, iova->addr, iova->size);
+}
+EXPORT_SYMBOL(dma_free_iova);
+
static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs)
{
--
2.44.0


2024-03-05 11:20:39

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 03/16] dma-mapping: provide callbacks to link/unlink pages to specific IOVA

From: Leon Romanovsky <[email protected]>

Introduce new DMA link/unlink API to provide a way for advanced users
to directly map/unmap pages without ned to allocate IOVA on every map
call.

Signed-off-by: Leon Romanovsky <[email protected]>
---
include/linux/dma-map-ops.h | 10 +++++++
include/linux/dma-mapping.h | 13 +++++++++
kernel/dma/debug.h | 2 ++
kernel/dma/direct.h | 3 ++
kernel/dma/mapping.c | 57 +++++++++++++++++++++++++++++++++++++
5 files changed, 85 insertions(+)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index bd605b44bb57..fd03a080df1e 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -86,6 +86,13 @@ struct dma_map_ops {

dma_addr_t (*alloc_iova)(struct device *dev, size_t size);
void (*free_iova)(struct device *dev, dma_addr_t dma_addr, size_t size);
+ dma_addr_t (*link_range)(struct device *dev, struct page *page,
+ unsigned long offset, dma_addr_t addr,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs);
+ void (*unlink_range)(struct device *dev, dma_addr_t dma_handle,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs);
};

#ifdef CONFIG_DMA_OPS
@@ -428,6 +435,9 @@ bool arch_dma_unmap_sg_direct(struct device *dev, struct scatterlist *sg,
#define arch_dma_unmap_sg_direct(d, s, n) (false)
#endif

+#define arch_dma_link_range_direct arch_dma_map_page_direct
+#define arch_dma_unlink_range_direct arch_dma_unmap_page_direct
+
#ifdef CONFIG_ARCH_HAS_SETUP_DMA_OPS
void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
bool coherent);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 176fb8a86d63..91cc084adb53 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -113,6 +113,9 @@ static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)

int dma_alloc_iova(struct dma_iova_attrs *iova);
void dma_free_iova(struct dma_iova_attrs *iova);
+dma_addr_t dma_link_range(struct page *page, unsigned long offset,
+ struct dma_iova_attrs *iova, dma_addr_t dma_offset);
+void dma_unlink_range(struct dma_iova_attrs *iova, dma_addr_t dma_offset);

dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
size_t offset, size_t size, enum dma_data_direction dir,
@@ -179,6 +182,16 @@ static inline int dma_alloc_iova(struct dma_iova_attrs *iova)
static inline void dma_free_iova(struct dma_iova_attrs *iova)
{
}
+static inline dma_addr_t dma_link_range(struct page *page, unsigned long offset,
+ struct dma_iova_attrs *iova,
+ dma_addr_t dma_offset)
+{
+ return DMA_MAPPING_ERROR;
+}
+static inline void dma_unlink_range(struct dma_iova_attrs *iova,
+ dma_addr_t dma_offset)
+{
+}
static inline dma_addr_t dma_map_page_attrs(struct device *dev,
struct page *page, size_t offset, size_t size,
enum dma_data_direction dir, unsigned long attrs)
diff --git a/kernel/dma/debug.h b/kernel/dma/debug.h
index f525197d3cae..3d529f355c6d 100644
--- a/kernel/dma/debug.h
+++ b/kernel/dma/debug.h
@@ -127,4 +127,6 @@ static inline void debug_dma_sync_sg_for_device(struct device *dev,
{
}
#endif /* CONFIG_DMA_API_DEBUG */
+#define debug_dma_link_range debug_dma_map_page
+#define debug_dma_unlink_range debug_dma_unmap_page
#endif /* _KERNEL_DMA_DEBUG_H */
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 18d346118fe8..1c30e1cd607a 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -125,4 +125,7 @@ static inline void dma_direct_unmap_page(struct device *dev, dma_addr_t addr,
swiotlb_tbl_unmap_single(dev, phys, size, dir,
attrs | DMA_ATTR_SKIP_CPU_SYNC);
}
+
+#define dma_direct_link_range dma_direct_map_page
+#define dma_direct_unlink_range dma_direct_unmap_page
#endif /* _KERNEL_DMA_DIRECT_H */
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b6b27bab90f3..f989c64622c2 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -213,6 +213,63 @@ void dma_free_iova(struct dma_iova_attrs *iova)
}
EXPORT_SYMBOL(dma_free_iova);

+/**
+ * dma_link_range - Link a physical page to DMA address
+ * @page: The page to be mapped
+ * @offset: The offset within the page
+ * @iova: Preallocated IOVA attributes
+ * @dma_offset: DMA offset form which this page needs to be linked
+ *
+ * dma_alloc_iova() allocates IOVA based on the size specified by ther user in
+ * iova->size. Call this function after IOVA allocation to link @page from
+ * @offset to get the DMA address. Note that very first call to this function
+ * will have @dma_offset set to 0 in the IOVA space allocated from
+ * dma_alloc_iova(). For subsequent calls to this function on same @iova,
+ * @dma_offset needs to be advanced by the caller with the size of previous
+ * page that was linked + DMA address returned for the previous page that was
+ * linked by this function.
+ */
+dma_addr_t dma_link_range(struct page *page, unsigned long offset,
+ struct dma_iova_attrs *iova, dma_addr_t dma_offset)
+{
+ struct device *dev = iova->dev;
+ size_t size = iova->size;
+ enum dma_data_direction dir = iova->dir;
+ unsigned long attrs = iova->attrs;
+ dma_addr_t addr = iova->addr + dma_offset;
+ const struct dma_map_ops *ops = get_dma_ops(dev);
+
+ if (dma_map_direct(dev, ops) ||
+ arch_dma_link_range_direct(dev, page_to_phys(page) + offset + size))
+ addr = dma_direct_link_range(dev, page, offset, size, dir, attrs);
+ else if (ops->link_range)
+ addr = ops->link_range(dev, page, offset, addr, size, dir, attrs);
+
+ kmsan_handle_dma(page, offset, size, dir);
+ debug_dma_link_range(dev, page, offset, size, dir, addr, attrs);
+ return addr;
+}
+EXPORT_SYMBOL(dma_link_range);
+
+void dma_unlink_range(struct dma_iova_attrs *iova, dma_addr_t dma_offset)
+{
+ struct device *dev = iova->dev;
+ size_t size = iova->size;
+ enum dma_data_direction dir = iova->dir;
+ unsigned long attrs = iova->attrs;
+ dma_addr_t addr = iova->addr + dma_offset;
+ const struct dma_map_ops *ops = get_dma_ops(dev);
+
+ if (dma_map_direct(dev, ops) ||
+ arch_dma_unlink_range_direct(dev, addr + size))
+ dma_direct_unlink_range(dev, addr, size, dir, attrs);
+ else if (ops->unlink_range)
+ ops->unlink_range(dev, addr, size, dir, attrs);
+
+ debug_dma_unlink_range(dev, addr, size, dir);
+}
+EXPORT_SYMBOL(dma_unlink_range);
+
static int __dma_map_sg_attrs(struct device *dev, struct scatterlist *sg,
int nents, enum dma_data_direction dir, unsigned long attrs)
{
--
2.44.0


2024-03-05 11:21:47

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 05/16] iommu/dma: Prepare map/unmap page functions to receive IOVA

From: Leon Romanovsky <[email protected]>

Extend the existing map_page/unmap_page function implementations to get
preallocated IOVA. In such case, the IOVA allocation needs to be
skipped, but rest of the code stays the same.

Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/iommu/dma-iommu.c | 68 ++++++++++++++++++++++++++-------------
1 file changed, 45 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index e55726783501..dbdd373a609a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -824,7 +824,7 @@ static void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
}

static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,
- size_t size)
+ size_t size, bool free_iova)
{
struct iommu_domain *domain = iommu_get_dma_domain(dev);
struct iommu_dma_cookie *cookie = domain->iova_cookie;
@@ -843,17 +843,19 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,

if (!iotlb_gather.queued)
iommu_iotlb_sync(domain, &iotlb_gather);
- __iommu_dma_free_iova(cookie, dma_addr, size, &iotlb_gather);
+ if (free_iova)
+ __iommu_dma_free_iova(cookie, dma_addr, size, &iotlb_gather);
}

static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
- size_t size, int prot, u64 dma_mask)
+ dma_addr_t iova, size_t size, int prot,
+ u64 dma_mask)
{
struct iommu_domain *domain = iommu_get_dma_domain(dev);
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iova_domain *iovad = &cookie->iovad;
size_t iova_off = iova_offset(iovad, phys);
- dma_addr_t iova;
+ bool no_iova = !iova;

if (static_branch_unlikely(&iommu_deferred_attach_enabled) &&
iommu_deferred_attach(dev, domain))
@@ -861,12 +863,14 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,

size = iova_align(iovad, size + iova_off);

- iova = __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
+ if (no_iova)
+ iova = __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
if (!iova)
return DMA_MAPPING_ERROR;

if (iommu_map(domain, iova, phys - iova_off, size, prot, GFP_ATOMIC)) {
- __iommu_dma_free_iova(cookie, iova, size, NULL);
+ if (no_iova)
+ __iommu_dma_free_iova(cookie, iova, size, NULL);
return DMA_MAPPING_ERROR;
}
return iova + iova_off;
@@ -1031,7 +1035,7 @@ static void *iommu_dma_alloc_remap(struct device *dev, size_t size,
return vaddr;

out_unmap:
- __iommu_dma_unmap(dev, *dma_handle, size);
+ __iommu_dma_unmap(dev, *dma_handle, size, true);
__iommu_dma_free_pages(pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
return NULL;
}
@@ -1060,7 +1064,7 @@ static void iommu_dma_free_noncontiguous(struct device *dev, size_t size,
{
struct dma_sgt_handle *sh = sgt_handle(sgt);

- __iommu_dma_unmap(dev, sgt->sgl->dma_address, size);
+ __iommu_dma_unmap(dev, sgt->sgl->dma_address, size, true);
__iommu_dma_free_pages(sh->pages, PAGE_ALIGN(size) >> PAGE_SHIFT);
sg_free_table(&sh->sgt);
kfree(sh);
@@ -1131,9 +1135,11 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
arch_sync_dma_for_device(sg_phys(sg), sg->length, dir);
}

-static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
- unsigned long offset, size_t size, enum dma_data_direction dir,
- unsigned long attrs)
+static dma_addr_t __iommu_dma_map_pages(struct device *dev, struct page *page,
+ unsigned long offset, dma_addr_t iova,
+ size_t size,
+ enum dma_data_direction dir,
+ unsigned long attrs)
{
phys_addr_t phys = page_to_phys(page) + offset;
bool coherent = dev_is_dma_coherent(dev);
@@ -1141,7 +1147,7 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
struct iommu_domain *domain = iommu_get_dma_domain(dev);
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iova_domain *iovad = &cookie->iovad;
- dma_addr_t iova, dma_mask = dma_get_mask(dev);
+ dma_addr_t addr, dma_mask = dma_get_mask(dev);

/*
* If both the physical buffer start address and size are
@@ -1182,14 +1188,23 @@ static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
if (!coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
arch_sync_dma_for_device(phys, size, dir);

- iova = __iommu_dma_map(dev, phys, size, prot, dma_mask);
- if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(dev, phys))
+ addr = __iommu_dma_map(dev, phys, iova, size, prot, dma_mask);
+ if (addr == DMA_MAPPING_ERROR && is_swiotlb_buffer(dev, phys))
swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
- return iova;
+ return addr;
}

-static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
- size_t size, enum dma_data_direction dir, unsigned long attrs)
+static dma_addr_t iommu_dma_map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ return __iommu_dma_map_pages(dev, page, offset, 0, size, dir, attrs);
+}
+
+static void __iommu_dma_unmap_pages(struct device *dev, dma_addr_t dma_handle,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs, bool free_iova)
{
struct iommu_domain *domain = iommu_get_dma_domain(dev);
phys_addr_t phys;
@@ -1201,12 +1216,19 @@ static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && !dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(phys, size, dir);

- __iommu_dma_unmap(dev, dma_handle, size);
+ __iommu_dma_unmap(dev, dma_handle, size, free_iova);

if (unlikely(is_swiotlb_buffer(dev, phys)))
swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
}

+static void iommu_dma_unmap_page(struct device *dev, dma_addr_t dma_handle,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ __iommu_dma_unmap_pages(dev, dma_handle, size, dir, attrs, true);
+}
+
/*
* Prepare a successfully-mapped scatterlist to give back to the caller.
*
@@ -1509,13 +1531,13 @@ static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
}

if (end)
- __iommu_dma_unmap(dev, start, end - start);
+ __iommu_dma_unmap(dev, start, end - start, true);
}

static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
size_t size, enum dma_data_direction dir, unsigned long attrs)
{
- return __iommu_dma_map(dev, phys, size,
+ return __iommu_dma_map(dev, phys, 0, size,
dma_info_to_prot(dir, false, attrs) | IOMMU_MMIO,
dma_get_mask(dev));
}
@@ -1523,7 +1545,7 @@ static dma_addr_t iommu_dma_map_resource(struct device *dev, phys_addr_t phys,
static void iommu_dma_unmap_resource(struct device *dev, dma_addr_t handle,
size_t size, enum dma_data_direction dir, unsigned long attrs)
{
- __iommu_dma_unmap(dev, handle, size);
+ __iommu_dma_unmap(dev, handle, size, true);
}

static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
@@ -1560,7 +1582,7 @@ static void __iommu_dma_free(struct device *dev, size_t size, void *cpu_addr)
static void iommu_dma_free(struct device *dev, size_t size, void *cpu_addr,
dma_addr_t handle, unsigned long attrs)
{
- __iommu_dma_unmap(dev, handle, size);
+ __iommu_dma_unmap(dev, handle, size, true);
__iommu_dma_free(dev, size, cpu_addr);
}

@@ -1626,7 +1648,7 @@ static void *iommu_dma_alloc(struct device *dev, size_t size,
if (!cpu_addr)
return NULL;

- *handle = __iommu_dma_map(dev, page_to_phys(page), size, ioprot,
+ *handle = __iommu_dma_map(dev, page_to_phys(page), 0, size, ioprot,
dev->coherent_dma_mask);
if (*handle == DMA_MAPPING_ERROR) {
__iommu_dma_free(dev, size, cpu_addr);
--
2.44.0


2024-03-05 11:22:10

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 06/16] iommu/dma: Implement link/unlink page callbacks

From: Leon Romanovsky <[email protected]>

Add an implementation of link/unlink interface to perform in map/unmap
pages in fast patch for pre-allocated IOVA.

Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/iommu/dma-iommu.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index dbdd373a609a..b683c4a4e9f8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1752,6 +1752,21 @@ static void iommu_dma_free_iova(struct device *dev, dma_addr_t iova,
__iommu_dma_free_iova(cookie, iova, size, &iotlb_gather);
}

+static dma_addr_t iommu_dma_link_range(struct device *dev, struct page *page,
+ unsigned long offset, dma_addr_t iova,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ return __iommu_dma_map_pages(dev, page, offset, iova, size, dir, attrs);
+}
+
+static void iommu_dma_unlink_range(struct device *dev, dma_addr_t addr,
+ size_t size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+ __iommu_dma_unmap_pages(dev, addr, size, dir, attrs, false);
+}
+
static const struct dma_map_ops iommu_dma_ops = {
.flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
@@ -1776,6 +1791,8 @@ static const struct dma_map_ops iommu_dma_ops = {
.opt_mapping_size = iommu_dma_opt_mapping_size,
.alloc_iova = iommu_dma_alloc_iova,
.free_iova = iommu_dma_free_iova,
+ .link_range = iommu_dma_link_range,
+ .unlink_range = iommu_dma_unlink_range,
};

/*
--
2.44.0


2024-03-05 11:23:25

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 09/16] RDMA/core: Separate DMA mapping to caching IOVA and page linkage

From: Leon Romanovsky <[email protected]>

Reuse newly added DMA API to cache IOVA and only link/unlink pages
in fast path.

Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/infiniband/core/umem_odp.c | 57 ++----------------------------
drivers/infiniband/hw/mlx5/odp.c | 22 +++++++++++-
include/rdma/ib_umem_odp.h | 8 +----
include/rdma/ib_verbs.h | 36 +++++++++++++++++++
4 files changed, 61 insertions(+), 62 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 3619fb78f786..1301009a6b78 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -81,20 +81,13 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
if (!umem_odp->pfn_list)
return -ENOMEM;

- umem_odp->dma_list = kvcalloc(
- ndmas, sizeof(*umem_odp->dma_list), GFP_KERNEL);
- if (!umem_odp->dma_list) {
- ret = -ENOMEM;
- goto out_pfn_list;
- }

umem_odp->iova.dev = dev->dma_device;
umem_odp->iova.size = end - start;
umem_odp->iova.dir = DMA_BIDIRECTIONAL;
ret = ib_dma_alloc_iova(dev, &umem_odp->iova);
if (ret)
- goto out_dma_list;
-
+ goto out_pfn_list;

ret = mmu_interval_notifier_insert(&umem_odp->notifier,
umem_odp->umem.owning_mm,
@@ -107,8 +100,6 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,

out_free_iova:
ib_dma_free_iova(dev, &umem_odp->iova);
-out_dma_list:
- kvfree(umem_odp->dma_list);
out_pfn_list:
kvfree(umem_odp->pfn_list);
return ret;
@@ -288,7 +279,6 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
mutex_unlock(&umem_odp->umem_mutex);
mmu_interval_notifier_remove(&umem_odp->notifier);
ib_dma_free_iova(dev, &umem_odp->iova);
- kvfree(umem_odp->dma_list);
kvfree(umem_odp->pfn_list);
}
put_pid(umem_odp->tgid);
@@ -296,40 +286,10 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
}
EXPORT_SYMBOL(ib_umem_odp_release);

-/*
- * Map for DMA and insert a single page into the on-demand paging page tables.
- *
- * @umem: the umem to insert the page to.
- * @dma_index: index in the umem to add the dma to.
- * @page: the page struct to map and add.
- * @access_mask: access permissions needed for this page.
- *
- * The function returns -EFAULT if the DMA mapping operation fails.
- *
- */
-static int ib_umem_odp_map_dma_single_page(
- struct ib_umem_odp *umem_odp,
- unsigned int dma_index,
- struct page *page)
-{
- struct ib_device *dev = umem_odp->umem.ibdev;
- dma_addr_t *dma_addr = &umem_odp->dma_list[dma_index];
-
- *dma_addr = ib_dma_map_page(dev, page, 0, 1 << umem_odp->page_shift,
- DMA_BIDIRECTIONAL);
- if (ib_dma_mapping_error(dev, *dma_addr)) {
- *dma_addr = 0;
- return -EFAULT;
- }
- umem_odp->npages++;
- return 0;
-}
-
/**
* ib_umem_odp_map_dma_and_lock - DMA map userspace memory in an ODP MR and lock it.
*
* Maps the range passed in the argument to DMA addresses.
- * The DMA addresses of the mapped pages is updated in umem_odp->dma_list.
* Upon success the ODP MR will be locked to let caller complete its device
* page table update.
*
@@ -437,15 +397,6 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
__func__, hmm_order, page_shift);
break;
}
-
- ret = ib_umem_odp_map_dma_single_page(
- umem_odp, dma_index, hmm_pfn_to_page(range.hmm_pfns[pfn_index]));
- if (ret < 0) {
- ibdev_dbg(umem_odp->umem.ibdev,
- "ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
- break;
- }
- range.hmm_pfns[pfn_index] |= HMM_PFN_STICKY;
}
/* upon success lock should stay on hold for the callee */
if (!ret)
@@ -465,7 +416,6 @@ EXPORT_SYMBOL(ib_umem_odp_map_dma_and_lock);
void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
u64 bound)
{
- dma_addr_t dma;
int idx;
u64 addr;
struct ib_device *dev = umem_odp->umem.ibdev;
@@ -479,15 +429,14 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
struct page *page = hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);

idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
- dma = umem_odp->dma_list[idx];

if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_VALID))
continue;
if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_STICKY))
continue;

- ib_dma_unmap_page(dev, dma, BIT(umem_odp->page_shift),
- DMA_BIDIRECTIONAL);
+ ib_dma_unlink_range(dev, &umem_odp->iova,
+ idx * (1 << umem_odp->page_shift));
if (umem_odp->pfn_list[pfn_idx] & HMM_PFN_WRITE) {
struct page *head_page = compound_head(page);
/*
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 5713fe25f4de..13d61f1ab40b 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -149,6 +149,7 @@ static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
{
struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
bool downgrade = flags & MLX5_IB_UPD_XLT_DOWNGRADE;
+ struct ib_device *dev = odp->umem.ibdev;
unsigned long pfn;
dma_addr_t pa;
size_t i;
@@ -162,12 +163,31 @@ static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
/* Initial ODP init */
continue;

- pa = odp->dma_list[idx + i];
+ if (pfn & HMM_PFN_STICKY && odp->iova.addr)
+ /*
+ * We are in this flow when there is a need to resync flags,
+ * for example when page was already linked in prefetch call
+ * with READ flag and now we need to add WRITE flag
+ *
+ * This page was already programmed to HW and we don't want/need
+ * to unlink and link it again just to resync flags.
+ *
+ * The DMA address calculation below is based on the fact that
+ * RDMA UMEM doesn't work with swiotlb.
+ */
+ pa = odp->iova.addr + (idx + i) * (1 << odp->page_shift);
+ else
+ pa = ib_dma_link_range(dev, hmm_pfn_to_page(pfn), 0, &odp->iova,
+ (idx + i) * (1 << odp->page_shift));
+ WARN_ON_ONCE(ib_dma_mapping_error(dev, pa));
+
pa |= MLX5_IB_MTT_READ;
if ((pfn & HMM_PFN_WRITE) && !downgrade)
pa |= MLX5_IB_MTT_WRITE;

pas[i] = cpu_to_be64(pa);
+ odp->pfn_list[idx + i] |= HMM_PFN_STICKY;
+ odp->npages++;
}
}

diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 095b1297cfb1..a786556c65f9 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -17,15 +17,9 @@ struct ib_umem_odp {
/* An array of the pfns included in the on-demand paging umem. */
unsigned long *pfn_list;

- /*
- * An array with DMA addresses mapped for pfns in pfn_list.
- * The lower two bits designate access permissions.
- * See ODP_READ_ALLOWED_BIT and ODP_WRITE_ALLOWED_BIT.
- */
- dma_addr_t *dma_list;
struct dma_iova_attrs iova;
/*
- * The umem_mutex protects the page_list and dma_list fields of an ODP
+ * The umem_mutex protects the page_list field of an ODP
* umem, allowing only a single thread to map/unmap pages. The mutex
* also protects access to the mmu notifier counters.
*/
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index e71fa19187cc..c9e2bcd5268a 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -4160,6 +4160,42 @@ static inline void ib_dma_unmap_page(struct ib_device *dev,
dma_unmap_page(dev->dma_device, addr, size, direction);
}

+/**
+ * ib_dma_link_range - Link a physical page to DMA address
+ * @dev: The device for which the dma_addr is to be created
+ * @page: The page to be mapped
+ * @offset: The offset within the page
+ * @iova: Preallocated IOVA attributes
+ * @dma_offset: DMA offset
+ */
+static inline dma_addr_t ib_dma_link_range(struct ib_device *dev,
+ struct page *page,
+ unsigned long offset,
+ struct dma_iova_attrs *iova,
+ dma_addr_t dma_offset)
+{
+ if (ib_uses_virt_dma(dev))
+ return (uintptr_t)(page_address(page) + offset);
+
+ return dma_link_range(page, offset, iova, dma_offset);
+}
+
+/**
+ * ib_dma_unlink_range - Unlink a mapping created by ib_dma_link_page()
+ * @dev: The device for which the DMA address was created
+ * @iova: DMA IOVA properties
+ * @dma_offset: DMA offset
+ */
+static inline void ib_dma_unlink_range(struct ib_device *dev,
+ struct dma_iova_attrs *iova,
+ dma_addr_t dma_offset)
+{
+ if (ib_uses_virt_dma(dev))
+ return;
+
+ dma_unlink_range(iova, dma_offset);
+}
+
int ib_dma_virt_map_sg(struct ib_device *dev, struct scatterlist *sg, int nents);
static inline int ib_dma_map_sg_attrs(struct ib_device *dev,
struct scatterlist *sg, int nents,
--
2.44.0


2024-03-05 11:23:48

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 10/16] RDMA/umem: Prevent UMEM ODP creation with SWIOTLB

From: Leon Romanovsky <[email protected]>

RDMA UMEM never supported DMA addresses returned from SWIOTLB, as these
addresses should be programmed to the hardware which is not aware that
it is bounce buffers and not real ones.

Instead of silently leave broken system for the users who didn't
know it, let's be explicit and return an error to them.

Signed-off-by: Leon Romanovsky <[email protected]>
---
Documentation/core-api/dma-attributes.rst | 7 +++
drivers/infiniband/core/umem_odp.c | 77 +++++++++++------------
include/linux/dma-mapping.h | 6 ++
kernel/dma/direct.h | 4 +-
kernel/dma/mapping.c | 4 ++
5 files changed, 58 insertions(+), 40 deletions(-)

diff --git a/Documentation/core-api/dma-attributes.rst b/Documentation/core-api/dma-attributes.rst
index 1887d92e8e92..b337ec65d506 100644
--- a/Documentation/core-api/dma-attributes.rst
+++ b/Documentation/core-api/dma-attributes.rst
@@ -130,3 +130,10 @@ accesses to DMA buffers in both privileged "supervisor" and unprivileged
subsystem that the buffer is fully accessible at the elevated privilege
level (and ideally inaccessible or at least read-only at the
lesser-privileged levels).
+
+DMA_ATTR_NO_TRANSLATION
+-----------------------
+
+This attribute is used to indicate to the DMA-mapping subsystem that the
+buffer is not subject to any address translation. This is used for devices
+that doesn't need buffer bouncing or fixing DMA addresses.
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 1301009a6b78..57c56000f60e 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -50,51 +50,50 @@
static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
const struct mmu_interval_notifier_ops *ops)
{
+ size_t page_size = 1UL << umem_odp->page_shift;
struct ib_device *dev = umem_odp->umem.ibdev;
+ size_t ndmas, npfns;
+ unsigned long start;
+ unsigned long end;
int ret;

umem_odp->umem.is_odp = 1;
mutex_init(&umem_odp->umem_mutex);

- if (!umem_odp->is_implicit_odp) {
- size_t page_size = 1UL << umem_odp->page_shift;
- unsigned long start;
- unsigned long end;
- size_t ndmas, npfns;
-
- start = ALIGN_DOWN(umem_odp->umem.address, page_size);
- if (check_add_overflow(umem_odp->umem.address,
- (unsigned long)umem_odp->umem.length,
- &end))
- return -EOVERFLOW;
- end = ALIGN(end, page_size);
- if (unlikely(end < page_size))
- return -EOVERFLOW;
-
- ndmas = (end - start) >> umem_odp->page_shift;
- if (!ndmas)
- return -EINVAL;
-
- npfns = (end - start) >> PAGE_SHIFT;
- umem_odp->pfn_list = kvcalloc(
- npfns, sizeof(*umem_odp->pfn_list), GFP_KERNEL);
- if (!umem_odp->pfn_list)
- return -ENOMEM;
-
-
- umem_odp->iova.dev = dev->dma_device;
- umem_odp->iova.size = end - start;
- umem_odp->iova.dir = DMA_BIDIRECTIONAL;
- ret = ib_dma_alloc_iova(dev, &umem_odp->iova);
- if (ret)
- goto out_pfn_list;
-
- ret = mmu_interval_notifier_insert(&umem_odp->notifier,
- umem_odp->umem.owning_mm,
- start, end - start, ops);
- if (ret)
- goto out_free_iova;
- }
+ if (umem_odp->is_implicit_odp)
+ return 0;
+
+ start = ALIGN_DOWN(umem_odp->umem.address, page_size);
+ if (check_add_overflow(umem_odp->umem.address,
+ (unsigned long)umem_odp->umem.length, &end))
+ return -EOVERFLOW;
+ end = ALIGN(end, page_size);
+ if (unlikely(end < page_size))
+ return -EOVERFLOW;
+
+ ndmas = (end - start) >> umem_odp->page_shift;
+ if (!ndmas)
+ return -EINVAL;
+
+ npfns = (end - start) >> PAGE_SHIFT;
+ umem_odp->pfn_list =
+ kvcalloc(npfns, sizeof(*umem_odp->pfn_list), GFP_KERNEL);
+ if (!umem_odp->pfn_list)
+ return -ENOMEM;
+
+ umem_odp->iova.dev = dev->dma_device;
+ umem_odp->iova.size = end - start;
+ umem_odp->iova.dir = DMA_BIDIRECTIONAL;
+ umem_odp->iova.attrs = DMA_ATTR_NO_TRANSLATION;
+ ret = ib_dma_alloc_iova(dev, &umem_odp->iova);
+ if (ret)
+ goto out_pfn_list;
+
+ ret = mmu_interval_notifier_insert(&umem_odp->notifier,
+ umem_odp->umem.owning_mm, start,
+ end - start, ops);
+ if (ret)
+ goto out_free_iova;

return 0;

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 91cc084adb53..89945e707a9b 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -62,6 +62,12 @@
*/
#define DMA_ATTR_PRIVILEGED (1UL << 9)

+/*
+ * DMA_ATTR_NO_TRANSLATION: used to indicate that the buffer should not be mapped
+ * through address translation.
+ */
+#define DMA_ATTR_NO_TRANSLATION (1UL << 10)
+
/*
* A dma_addr_t can hold any valid DMA or bus address for the platform. It can
* be given to a device to use as a DMA source or target. It is specific to a
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 1c30e1cd607a..1c9ec204c999 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -92,6 +92,8 @@ static inline dma_addr_t dma_direct_map_page(struct device *dev,
if (is_swiotlb_force_bounce(dev)) {
if (is_pci_p2pdma_page(page))
return DMA_MAPPING_ERROR;
+ if (attrs & DMA_ATTR_NO_TRANSLATION)
+ return DMA_MAPPING_ERROR;
return swiotlb_map(dev, phys, size, dir, attrs);
}

@@ -99,7 +101,7 @@ static inline dma_addr_t dma_direct_map_page(struct device *dev,
dma_kmalloc_needs_bounce(dev, size, dir)) {
if (is_pci_p2pdma_page(page))
return DMA_MAPPING_ERROR;
- if (is_swiotlb_active(dev))
+ if (is_swiotlb_active(dev) && !(attrs & DMA_ATTR_NO_TRANSLATION))
return swiotlb_map(dev, phys, size, dir, attrs);

dev_WARN_ONCE(dev, 1,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index f989c64622c2..49b1fde510c5 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -188,6 +188,10 @@ int dma_alloc_iova(struct dma_iova_attrs *iova)
struct device *dev = iova->dev;
const struct dma_map_ops *ops = get_dma_ops(dev);

+ if (dma_map_direct(dev, ops) && is_swiotlb_force_bounce(dev) &&
+ iova->attrs & DMA_ATTR_NO_TRANSLATION)
+ return -EOPNOTSUPP;
+
if (dma_map_direct(dev, ops) || !ops->alloc_iova) {
iova->addr = 0;
return 0;
--
2.44.0


2024-03-05 11:24:44

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 12/16] vfio/mlx5: Rewrite create mkey flow to allow better code reuse

From: Leon Romanovsky <[email protected]>

Change the creation of mkey to be performed in multiple steps:
data allocation, DMA setup and actual call to HW to create that mkey.

In this new flow, the whole input to MKEY command is saved to eliminate
the need to keep array of pointers for DMA addresses for receive list
and in the future patches for send list too.

In addition to memory size reduce and elimination of unnecessary data
movements to set MKEY input, the code is prepared for future reuse.

Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/vfio/pci/mlx5/cmd.c | 149 +++++++++++++++++++++---------------
drivers/vfio/pci/mlx5/cmd.h | 3 +-
2 files changed, 88 insertions(+), 64 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 45104e47b7b2..44762980fcb9 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -300,39 +300,21 @@ static int mlx5vf_cmd_get_vhca_id(struct mlx5_core_dev *mdev, u16 function_id,
return ret;
}

-static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
- struct mlx5_vhca_data_buffer *buf,
- struct mlx5_vhca_recv_buf *recv_buf,
- u32 *mkey)
+static u32 *alloc_mkey_in(u32 npages, u32 pdn)
{
- size_t npages = buf ? buf->npages : recv_buf->npages;
- int err = 0, inlen;
- __be64 *mtt;
+ int inlen;
void *mkc;
u32 *in;

inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
- sizeof(*mtt) * round_up(npages, 2);
+ sizeof(__be64) * round_up(npages, 2);

- in = kvzalloc(inlen, GFP_KERNEL);
+ in = kvzalloc(inlen, GFP_KERNEL_ACCOUNT);
if (!in)
- return -ENOMEM;
+ return NULL;

MLX5_SET(create_mkey_in, in, translations_octword_actual_size,
DIV_ROUND_UP(npages, 2));
- mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, in, klm_pas_mtt);
-
- if (buf) {
- struct sg_dma_page_iter dma_iter;
-
- for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
- *mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
- } else {
- int i;
-
- for (i = 0; i < npages; i++)
- *mtt++ = cpu_to_be64(recv_buf->dma_addrs[i]);
- }

mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry);
MLX5_SET(mkc, mkc, access_mode_1_0, MLX5_MKC_ACCESS_MODE_MTT);
@@ -346,9 +328,30 @@ static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
MLX5_SET(mkc, mkc, log_page_size, PAGE_SHIFT);
MLX5_SET(mkc, mkc, translations_octword_size, DIV_ROUND_UP(npages, 2));
MLX5_SET64(mkc, mkc, len, npages * PAGE_SIZE);
- err = mlx5_core_create_mkey(mdev, mkey, in, inlen);
- kvfree(in);
- return err;
+
+ return in;
+}
+
+static int create_mkey(struct mlx5_core_dev *mdev, u32 npages,
+ struct mlx5_vhca_data_buffer *buf, u32 *mkey_in,
+ u32 *mkey)
+{
+ __be64 *mtt;
+ int inlen;
+
+ mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
+
+ if (buf) {
+ struct sg_dma_page_iter dma_iter;
+
+ for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
+ *mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+ }
+
+ inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+ sizeof(__be64) * round_up(npages, 2);
+
+ return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
}

static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
@@ -368,13 +371,22 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
if (ret)
return ret;

- ret = _create_mkey(mdev, buf->migf->pdn, buf, NULL, &buf->mkey);
- if (ret)
+ buf->mkey_in = alloc_mkey_in(buf->npages, buf->migf->pdn);
+ if (!buf->mkey_in) {
+ ret = -ENOMEM;
goto err;
+ }
+
+ ret = create_mkey(mdev, buf->npages, buf, buf->mkey_in, &buf->mkey);
+ if (ret)
+ goto err_create_mkey;

buf->dmaed = true;

return 0;
+
+err_create_mkey:
+ kvfree(buf->mkey_in);
err:
dma_unmap_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
return ret;
@@ -390,6 +402,7 @@ void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf)

if (buf->dmaed) {
mlx5_core_destroy_mkey(migf->mvdev->mdev, buf->mkey);
+ kvfree(buf->mkey_in);
dma_unmap_sgtable(migf->mvdev->mdev->device, &buf->table.sgt,
buf->dma_dir, 0);
}
@@ -1286,46 +1299,45 @@ static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
return -ENOMEM;
}

-static int register_dma_recv_pages(struct mlx5_core_dev *mdev,
- struct mlx5_vhca_recv_buf *recv_buf)
+static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+ u32 *mkey_in)
{
- int i, j;
+ dma_addr_t addr;
+ __be64 *mtt;
+ int i;

- recv_buf->dma_addrs = kvcalloc(recv_buf->npages,
- sizeof(*recv_buf->dma_addrs),
- GFP_KERNEL_ACCOUNT);
- if (!recv_buf->dma_addrs)
- return -ENOMEM;
+ mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);

- for (i = 0; i < recv_buf->npages; i++) {
- recv_buf->dma_addrs[i] = dma_map_page(mdev->device,
- recv_buf->page_list[i],
- 0, PAGE_SIZE,
- DMA_FROM_DEVICE);
- if (dma_mapping_error(mdev->device, recv_buf->dma_addrs[i]))
- goto error;
+ for (i = npages - 1; i >= 0; i--) {
+ addr = be64_to_cpu(mtt[i]);
+ dma_unmap_single(mdev->device, addr, PAGE_SIZE,
+ DMA_FROM_DEVICE);
}
- return 0;
-
-error:
- for (j = 0; j < i; j++)
- dma_unmap_single(mdev->device, recv_buf->dma_addrs[j],
- PAGE_SIZE, DMA_FROM_DEVICE);
-
- kvfree(recv_buf->dma_addrs);
- return -ENOMEM;
}

-static void unregister_dma_recv_pages(struct mlx5_core_dev *mdev,
- struct mlx5_vhca_recv_buf *recv_buf)
+static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+ struct page **page_list, u32 *mkey_in)
{
+ dma_addr_t addr;
+ __be64 *mtt;
int i;

- for (i = 0; i < recv_buf->npages; i++)
- dma_unmap_single(mdev->device, recv_buf->dma_addrs[i],
- PAGE_SIZE, DMA_FROM_DEVICE);
+ mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
+
+ for (i = 0; i < npages; i++) {
+ addr = dma_map_page(mdev->device, page_list[i], 0, PAGE_SIZE,
+ DMA_FROM_DEVICE);
+ if (dma_mapping_error(mdev->device, addr))
+ goto error;
+
+ *mtt++ = cpu_to_be64(addr);
+ }
+
+ return 0;

- kvfree(recv_buf->dma_addrs);
+error:
+ unregister_dma_pages(mdev, i, mkey_in);
+ return -ENOMEM;
}

static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
@@ -1334,7 +1346,8 @@ static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;

mlx5_core_destroy_mkey(mdev, recv_buf->mkey);
- unregister_dma_recv_pages(mdev, recv_buf);
+ unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in);
+ kvfree(recv_buf->mkey_in);
free_recv_pages(&qp->recv_buf);
}

@@ -1350,18 +1363,28 @@ static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
if (err < 0)
return err;

- err = register_dma_recv_pages(mdev, recv_buf);
- if (err)
+ recv_buf->mkey_in = alloc_mkey_in(npages, pdn);
+ if (!recv_buf->mkey_in) {
+ err = -ENOMEM;
goto end;
+ }
+
+ err = register_dma_pages(mdev, npages, recv_buf->page_list,
+ recv_buf->mkey_in);
+ if (err)
+ goto err_register_dma;

- err = _create_mkey(mdev, pdn, NULL, recv_buf, &recv_buf->mkey);
+ err = create_mkey(mdev, npages, NULL, recv_buf->mkey_in,
+ &recv_buf->mkey);
if (err)
goto err_create_mkey;

return 0;

err_create_mkey:
- unregister_dma_recv_pages(mdev, recv_buf);
+ unregister_dma_pages(mdev, npages, recv_buf->mkey_in);
+err_register_dma:
+ kvfree(recv_buf->mkey_in);
end:
free_recv_pages(recv_buf);
return err;
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 887267ebbd8a..83728c0669e7 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -62,6 +62,7 @@ struct mlx5_vhca_data_buffer {
u64 length;
u32 npages;
u32 mkey;
+ u32 *mkey_in;
enum dma_data_direction dma_dir;
u8 dmaed:1;
u8 stop_copy_chunk_num;
@@ -137,8 +138,8 @@ struct mlx5_vhca_cq {
struct mlx5_vhca_recv_buf {
u32 npages;
struct page **page_list;
- dma_addr_t *dma_addrs;
u32 next_rq_offset;
+ u32 *mkey_in;
u32 mkey;
};

--
2.44.0


2024-03-05 11:25:57

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 14/16] vfio/mlx5: Convert vfio to use DMA link API

From: Leon Romanovsky <[email protected]>

Remove intermediate scatter-gather table as it is not needed
if DMA link API is used. This conversion reduces drastically
the memory used to manage that table.

Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/vfio/pci/mlx5/cmd.c | 177 ++++++++++++++++-------------------
drivers/vfio/pci/mlx5/cmd.h | 8 +-
drivers/vfio/pci/mlx5/main.c | 50 ++--------
3 files changed, 91 insertions(+), 144 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 5e2103042d9b..cfae03f7b7da 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -332,26 +332,60 @@ static u32 *alloc_mkey_in(u32 npages, u32 pdn)
return in;
}

-static int create_mkey(struct mlx5_core_dev *mdev, u32 npages,
- struct mlx5_vhca_data_buffer *buf, u32 *mkey_in,
+static int create_mkey(struct mlx5_core_dev *mdev, u32 npages, u32 *mkey_in,
u32 *mkey)
{
+ int inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
+ sizeof(__be64) * round_up(npages, 2);
+
+ return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
+}
+
+static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+ u32 *mkey_in, struct dma_iova_attrs *iova)
+{
+ dma_addr_t addr;
__be64 *mtt;
- int inlen;
+ int i;

mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);

- if (buf) {
- struct sg_dma_page_iter dma_iter;
+ for (i = npages - 1; i >= 0; i--) {
+ addr = be64_to_cpu(mtt[i]);
+ dma_unlink_range(iova, addr);
+ }
+ dma_free_iova(iova);
+}
+
+static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
+ struct page **page_list, u32 *mkey_in,
+ struct dma_iova_attrs *iova)
+{
+ dma_addr_t addr;
+ __be64 *mtt;
+ int i, err;
+
+ iova->dev = mdev->device;
+ iova->size = npages * PAGE_SIZE;
+ err = dma_alloc_iova(iova);
+ if (err)
+ return err;
+
+ mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
+
+ for (i = 0; i < npages; i++) {
+ addr = dma_link_range(page_list[i], 0, iova, i * PAGE_SIZE);
+ if (dma_mapping_error(mdev->device, addr))
+ goto error;

- for_each_sgtable_dma_page(&buf->table.sgt, &dma_iter, 0)
- *mtt++ = cpu_to_be64(sg_page_iter_dma_address(&dma_iter));
+ *mtt++ = cpu_to_be64(addr);
}

- inlen = MLX5_ST_SZ_BYTES(create_mkey_in) +
- sizeof(__be64) * round_up(npages, 2);
+ return 0;

- return mlx5_core_create_mkey(mdev, mkey, mkey_in, inlen);
+error:
+ unregister_dma_pages(mdev, i, mkey_in, iova);
+ return -ENOMEM;
}

static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
@@ -367,17 +401,16 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
if (buf->dmaed || !buf->npages)
return -EINVAL;

- ret = dma_map_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
- if (ret)
- return ret;
-
buf->mkey_in = alloc_mkey_in(buf->npages, buf->migf->pdn);
- if (!buf->mkey_in) {
- ret = -ENOMEM;
- goto err;
- }
+ if (!buf->mkey_in)
+ return -ENOMEM;
+
+ ret = register_dma_pages(mdev, buf->npages, buf->page_list,
+ buf->mkey_in, &buf->iova);
+ if (ret)
+ goto err_register_dma;

- ret = create_mkey(mdev, buf->npages, buf, buf->mkey_in, &buf->mkey);
+ ret = create_mkey(mdev, buf->npages, buf->mkey_in, &buf->mkey);
if (ret)
goto err_create_mkey;

@@ -386,32 +419,39 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
return 0;

err_create_mkey:
+ unregister_dma_pages(mdev, buf->npages, buf->mkey_in, &buf->iova);
+err_register_dma:
kvfree(buf->mkey_in);
-err:
- dma_unmap_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
return ret;
}

+static void free_page_list(u32 npages, struct page **page_list)
+{
+ int i;
+
+ /* Undo alloc_pages_bulk_array() */
+ for (i = npages - 1; i >= 0; i--)
+ __free_page(page_list[i]);
+
+ kvfree(page_list);
+}
+
void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf)
{
- struct mlx5_vf_migration_file *migf = buf->migf;
- struct sg_page_iter sg_iter;
+ struct mlx5vf_pci_core_device *mvdev = buf->migf->mvdev;
+ struct mlx5_core_dev *mdev = mvdev->mdev;

- lockdep_assert_held(&migf->mvdev->state_mutex);
- WARN_ON(migf->mvdev->mdev_detach);
+ lockdep_assert_held(&mvdev->state_mutex);
+ WARN_ON(mvdev->mdev_detach);

if (buf->dmaed) {
- mlx5_core_destroy_mkey(migf->mvdev->mdev, buf->mkey);
+ mlx5_core_destroy_mkey(mdev, buf->mkey);
+ unregister_dma_pages(mdev, buf->npages, buf->mkey_in,
+ &buf->iova);
kvfree(buf->mkey_in);
- dma_unmap_sgtable(migf->mvdev->mdev->device, &buf->table.sgt,
- buf->dma_dir, 0);
}

- /* Undo alloc_pages_bulk_array() */
- for_each_sgtable_page(&buf->table.sgt, &sg_iter, 0)
- __free_page(sg_page_iter_page(&sg_iter));
- sg_free_append_table(&buf->table);
- kvfree(buf->page_list);
+ free_page_list(buf->npages, buf->page_list);
kfree(buf);
}

@@ -426,7 +466,7 @@ mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
if (!buf)
return ERR_PTR(-ENOMEM);

- buf->dma_dir = dma_dir;
+ buf->iova.dir = dma_dir;
buf->migf = migf;
if (npages) {
ret = mlx5vf_add_migration_pages(buf, npages);
@@ -469,7 +509,7 @@ mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,

spin_lock_irq(&migf->list_lock);
list_for_each_entry_safe(buf, temp_buf, &migf->avail_list, buf_elm) {
- if (buf->dma_dir == dma_dir) {
+ if (buf->iova.dir == dma_dir) {
list_del_init(&buf->buf_elm);
if (buf->npages >= npages) {
spin_unlock_irq(&migf->list_lock);
@@ -1253,17 +1293,6 @@ static void mlx5vf_destroy_qp(struct mlx5_core_dev *mdev,
kfree(qp);
}

-static void free_recv_pages(struct mlx5_vhca_recv_buf *recv_buf)
-{
- int i;
-
- /* Undo alloc_pages_bulk_array() */
- for (i = 0; i < recv_buf->npages; i++)
- __free_page(recv_buf->page_list[i]);
-
- kvfree(recv_buf->page_list);
-}
-
static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
unsigned int npages)
{
@@ -1300,56 +1329,16 @@ static int alloc_recv_pages(struct mlx5_vhca_recv_buf *recv_buf,
return -ENOMEM;
}

-static void unregister_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
- u32 *mkey_in)
-{
- dma_addr_t addr;
- __be64 *mtt;
- int i;
-
- mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
-
- for (i = npages - 1; i >= 0; i--) {
- addr = be64_to_cpu(mtt[i]);
- dma_unmap_single(mdev->device, addr, PAGE_SIZE,
- DMA_FROM_DEVICE);
- }
-}
-
-static int register_dma_pages(struct mlx5_core_dev *mdev, u32 npages,
- struct page **page_list, u32 *mkey_in)
-{
- dma_addr_t addr;
- __be64 *mtt;
- int i;
-
- mtt = (__be64 *)MLX5_ADDR_OF(create_mkey_in, mkey_in, klm_pas_mtt);
-
- for (i = 0; i < npages; i++) {
- addr = dma_map_page(mdev->device, page_list[i], 0, PAGE_SIZE,
- DMA_FROM_DEVICE);
- if (dma_mapping_error(mdev->device, addr))
- goto error;
-
- *mtt++ = cpu_to_be64(addr);
- }
-
- return 0;
-
-error:
- unregister_dma_pages(mdev, i, mkey_in);
- return -ENOMEM;
-}
-
static void mlx5vf_free_qp_recv_resources(struct mlx5_core_dev *mdev,
struct mlx5_vhca_qp *qp)
{
struct mlx5_vhca_recv_buf *recv_buf = &qp->recv_buf;

mlx5_core_destroy_mkey(mdev, recv_buf->mkey);
- unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in);
+ unregister_dma_pages(mdev, recv_buf->npages, recv_buf->mkey_in,
+ &recv_buf->iova);
kvfree(recv_buf->mkey_in);
- free_recv_pages(&qp->recv_buf);
+ free_page_list(recv_buf->npages, recv_buf->page_list);
}

static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
@@ -1370,24 +1359,24 @@ static int mlx5vf_alloc_qp_recv_resources(struct mlx5_core_dev *mdev,
goto end;
}

+ recv_buf->iova.dir = DMA_FROM_DEVICE;
err = register_dma_pages(mdev, npages, recv_buf->page_list,
- recv_buf->mkey_in);
+ recv_buf->mkey_in, &recv_buf->iova);
if (err)
goto err_register_dma;

- err = create_mkey(mdev, npages, NULL, recv_buf->mkey_in,
- &recv_buf->mkey);
+ err = create_mkey(mdev, npages, recv_buf->mkey_in, &recv_buf->mkey);
if (err)
goto err_create_mkey;

return 0;

err_create_mkey:
- unregister_dma_pages(mdev, npages, recv_buf->mkey_in);
+ unregister_dma_pages(mdev, npages, recv_buf->mkey_in, &recv_buf->iova);
err_register_dma:
kvfree(recv_buf->mkey_in);
end:
- free_recv_pages(recv_buf);
+ free_page_list(npages, recv_buf->page_list);
return err;
}

diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 815fcb54494d..3a046166d9f2 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -57,22 +57,17 @@ struct mlx5_vf_migration_header {
};

struct mlx5_vhca_data_buffer {
+ struct dma_iova_attrs iova;
struct page **page_list;
- struct sg_append_table table;
loff_t start_pos;
u64 length;
u32 npages;
u32 mkey;
u32 *mkey_in;
- enum dma_data_direction dma_dir;
u8 dmaed:1;
u8 stop_copy_chunk_num;
struct list_head buf_elm;
struct mlx5_vf_migration_file *migf;
- /* Optimize mlx5vf_get_migration_page() for sequential access */
- struct scatterlist *last_offset_sg;
- unsigned int sg_last_entry;
- unsigned long last_offset;
};

struct mlx5vf_async_data {
@@ -137,6 +132,7 @@ struct mlx5_vhca_cq {
};

struct mlx5_vhca_recv_buf {
+ struct dma_iova_attrs iova;
u32 npages;
struct page **page_list;
u32 next_rq_offset;
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index 7ffe24693a55..668c28bc429c 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -34,35 +34,10 @@ static struct mlx5vf_pci_core_device *mlx5vf_drvdata(struct pci_dev *pdev)
core_device);
}

-struct page *
-mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
- unsigned long offset)
+struct page *mlx5vf_get_migration_page(struct mlx5_vhca_data_buffer *buf,
+ unsigned long offset)
{
- unsigned long cur_offset = 0;
- struct scatterlist *sg;
- unsigned int i;
-
- /* All accesses are sequential */
- if (offset < buf->last_offset || !buf->last_offset_sg) {
- buf->last_offset = 0;
- buf->last_offset_sg = buf->table.sgt.sgl;
- buf->sg_last_entry = 0;
- }
-
- cur_offset = buf->last_offset;
-
- for_each_sg(buf->last_offset_sg, sg,
- buf->table.sgt.orig_nents - buf->sg_last_entry, i) {
- if (offset < sg->length + cur_offset) {
- buf->last_offset_sg = sg;
- buf->sg_last_entry += i;
- buf->last_offset = cur_offset;
- return nth_page(sg_page(sg),
- (offset - cur_offset) / PAGE_SIZE);
- }
- cur_offset += sg->length;
- }
- return NULL;
+ return buf->page_list[offset / PAGE_SIZE];
}

int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
@@ -72,13 +47,9 @@ int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
size_t old_size, new_size;
struct page **page_list;
unsigned long filled;
- unsigned int to_fill;
- int ret;

- to_fill = min_t(unsigned int, npages,
- PAGE_SIZE / sizeof(*buf->page_list));
old_size = buf->npages * sizeof(*buf->page_list);
- new_size = old_size + to_fill * sizeof(*buf->page_list);
+ new_size = old_size + to_alloc * sizeof(*buf->page_list);
page_list = kvrealloc(buf->page_list, old_size, new_size,
GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!page_list)
@@ -87,22 +58,13 @@ int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
buf->page_list = page_list;

do {
- filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT, to_fill,
+ filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT, to_alloc,
buf->page_list + buf->npages);
if (!filled)
return -ENOMEM;

to_alloc -= filled;
- ret = sg_alloc_append_table_from_pages(
- &buf->table, buf->page_list + buf->npages, filled, 0,
- filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC,
- GFP_KERNEL_ACCOUNT);
- if (ret)
- return ret;
-
buf->npages += filled;
- to_fill = min_t(unsigned int, to_alloc,
- PAGE_SIZE / sizeof(*buf->page_list));
} while (to_alloc > 0);

return 0;
@@ -164,7 +126,7 @@ static void mlx5vf_buf_read_done(struct mlx5_vhca_data_buffer *vhca_buf)
struct mlx5_vf_migration_file *migf = vhca_buf->migf;

if (vhca_buf->stop_copy_chunk_num) {
- bool is_header = vhca_buf->dma_dir == DMA_NONE;
+ bool is_header = vhca_buf->iova.dir == DMA_NONE;
u8 chunk_num = vhca_buf->stop_copy_chunk_num;
size_t next_required_umem_size = 0;

--
2.44.0


2024-03-05 11:35:53

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 04/16] iommu/dma: Provide an interface to allow preallocate IOVA

From: Leon Romanovsky <[email protected]>

Separate IOVA allocation to dedicated callback so it will allow
cache of IOVA and reuse it in fast paths for devices which support
ODP (on-demand-paging) mechanism.

Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/iommu/dma-iommu.c | 50 +++++++++++++++++++++++++++++----------
1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 50ccc4f1ef81..e55726783501 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -356,7 +356,7 @@ int iommu_dma_init_fq(struct iommu_domain *domain)
atomic_set(&cookie->fq_timer_on, 0);
/*
* Prevent incomplete fq state being observable. Pairs with path from
- * __iommu_dma_unmap() through iommu_dma_free_iova() to queue_iova()
+ * __iommu_dma_unmap() through __iommu_dma_free_iova() to queue_iova()
*/
smp_wmb();
WRITE_ONCE(cookie->fq_domain, domain);
@@ -760,7 +760,7 @@ static int dma_info_to_prot(enum dma_data_direction dir, bool coherent,
}
}

-static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
+static dma_addr_t __iommu_dma_alloc_iova(struct iommu_domain *domain,
size_t size, u64 dma_limit, struct device *dev)
{
struct iommu_dma_cookie *cookie = domain->iova_cookie;
@@ -806,7 +806,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain,
return (dma_addr_t)iova << shift;
}

-static void iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
+static void __iommu_dma_free_iova(struct iommu_dma_cookie *cookie,
dma_addr_t iova, size_t size, struct iommu_iotlb_gather *gather)
{
struct iova_domain *iovad = &cookie->iovad;
@@ -843,7 +843,7 @@ static void __iommu_dma_unmap(struct device *dev, dma_addr_t dma_addr,

if (!iotlb_gather.queued)
iommu_iotlb_sync(domain, &iotlb_gather);
- iommu_dma_free_iova(cookie, dma_addr, size, &iotlb_gather);
+ __iommu_dma_free_iova(cookie, dma_addr, size, &iotlb_gather);
}

static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,
@@ -861,12 +861,12 @@ static dma_addr_t __iommu_dma_map(struct device *dev, phys_addr_t phys,

size = iova_align(iovad, size + iova_off);

- iova = iommu_dma_alloc_iova(domain, size, dma_mask, dev);
+ iova = __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
if (!iova)
return DMA_MAPPING_ERROR;

if (iommu_map(domain, iova, phys - iova_off, size, prot, GFP_ATOMIC)) {
- iommu_dma_free_iova(cookie, iova, size, NULL);
+ __iommu_dma_free_iova(cookie, iova, size, NULL);
return DMA_MAPPING_ERROR;
}
return iova + iova_off;
@@ -970,7 +970,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
return NULL;

size = iova_align(iovad, size);
- iova = iommu_dma_alloc_iova(domain, size, dev->coherent_dma_mask, dev);
+ iova = __iommu_dma_alloc_iova(domain, size, dev->coherent_dma_mask, dev);
if (!iova)
goto out_free_pages;

@@ -1004,7 +1004,7 @@ static struct page **__iommu_dma_alloc_noncontiguous(struct device *dev,
out_free_sg:
sg_free_table(sgt);
out_free_iova:
- iommu_dma_free_iova(cookie, iova, size, NULL);
+ __iommu_dma_free_iova(cookie, iova, size, NULL);
out_free_pages:
__iommu_dma_free_pages(pages, count);
return NULL;
@@ -1436,7 +1436,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
if (!iova_len)
return __finalise_sg(dev, sg, nents, 0);

- iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
+ iova = __iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
if (!iova) {
ret = -ENOMEM;
goto out_restore_sg;
@@ -1453,7 +1453,7 @@ static int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
return __finalise_sg(dev, sg, nents, iova);

out_free_iova:
- iommu_dma_free_iova(cookie, iova, iova_len, NULL);
+ __iommu_dma_free_iova(cookie, iova, iova_len, NULL);
out_restore_sg:
__invalidate_sg(sg, nents);
out:
@@ -1706,6 +1706,30 @@ static size_t iommu_dma_opt_mapping_size(void)
return iova_rcache_range();
}

+static dma_addr_t iommu_dma_alloc_iova(struct device *dev, size_t size)
+{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iommu_dma_cookie *cookie = domain->iova_cookie;
+ struct iova_domain *iovad = &cookie->iovad;
+ dma_addr_t dma_mask = dma_get_mask(dev);
+
+ size = iova_align(iovad, size);
+ return __iommu_dma_alloc_iova(domain, size, dma_mask, dev);
+}
+
+static void iommu_dma_free_iova(struct device *dev, dma_addr_t iova,
+ size_t size)
+{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iommu_dma_cookie *cookie = domain->iova_cookie;
+ struct iova_domain *iovad = &cookie->iovad;
+ struct iommu_iotlb_gather iotlb_gather;
+
+ size = iova_align(iovad, size);
+ iommu_iotlb_gather_init(&iotlb_gather);
+ __iommu_dma_free_iova(cookie, iova, size, &iotlb_gather);
+}
+
static const struct dma_map_ops iommu_dma_ops = {
.flags = DMA_F_PCI_P2PDMA_SUPPORTED,
.alloc = iommu_dma_alloc,
@@ -1728,6 +1752,8 @@ static const struct dma_map_ops iommu_dma_ops = {
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
.opt_mapping_size = iommu_dma_opt_mapping_size,
+ .alloc_iova = iommu_dma_alloc_iova,
+ .free_iova = iommu_dma_free_iova,
};

/*
@@ -1776,7 +1802,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
if (!msi_page)
return NULL;

- iova = iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
+ iova = __iommu_dma_alloc_iova(domain, size, dma_get_mask(dev), dev);
if (!iova)
goto out_free_page;

@@ -1790,7 +1816,7 @@ static struct iommu_dma_msi_page *iommu_dma_get_msi_page(struct device *dev,
return msi_page;

out_free_iova:
- iommu_dma_free_iova(cookie, iova, size, NULL);
+ __iommu_dma_free_iova(cookie, iova, size, NULL);
out_free_page:
kfree(msi_page);
return NULL;
--
2.44.0


2024-03-05 11:37:23

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 07/16] RDMA/umem: Preallocate and cache IOVA for UMEM ODP

From: Leon Romanovsky <[email protected]>

As a preparation to provide two step interface to map pages,
preallocate IOVA when UMEM is initialized.

Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/infiniband/core/umem_odp.c | 16 +++++++++++++++-
include/rdma/ib_umem_odp.h | 1 +
include/rdma/ib_verbs.h | 18 ++++++++++++++++++
3 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index e9fa22d31c23..f69d1233dc82 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -50,6 +50,7 @@
static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
const struct mmu_interval_notifier_ops *ops)
{
+ struct ib_device *dev = umem_odp->umem.ibdev;
int ret;

umem_odp->umem.is_odp = 1;
@@ -87,15 +88,25 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp,
goto out_pfn_list;
}

+ umem_odp->iova.dev = dev->dma_device;
+ umem_odp->iova.size = end - start;
+ umem_odp->iova.dir = DMA_BIDIRECTIONAL;
+ ret = ib_dma_alloc_iova(dev, &umem_odp->iova);
+ if (ret)
+ goto out_dma_list;
+
+
ret = mmu_interval_notifier_insert(&umem_odp->notifier,
umem_odp->umem.owning_mm,
start, end - start, ops);
if (ret)
- goto out_dma_list;
+ goto out_free_iova;
}

return 0;

+out_free_iova:
+ ib_dma_free_iova(dev, &umem_odp->iova);
out_dma_list:
kvfree(umem_odp->dma_list);
out_pfn_list:
@@ -262,6 +273,8 @@ EXPORT_SYMBOL(ib_umem_odp_get);

void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
{
+ struct ib_device *dev = umem_odp->umem.ibdev;
+
/*
* Ensure that no more pages are mapped in the umem.
*
@@ -274,6 +287,7 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
ib_umem_end(umem_odp));
mutex_unlock(&umem_odp->umem_mutex);
mmu_interval_notifier_remove(&umem_odp->notifier);
+ ib_dma_free_iova(dev, &umem_odp->iova);
kvfree(umem_odp->dma_list);
kvfree(umem_odp->pfn_list);
}
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 0844c1d05ac6..bb2d7f2a5b04 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -23,6 +23,7 @@ struct ib_umem_odp {
* See ODP_READ_ALLOWED_BIT and ODP_WRITE_ALLOWED_BIT.
*/
dma_addr_t *dma_list;
+ struct dma_iova_attrs iova;
/*
* The umem_mutex protects the page_list and dma_list fields of an ODP
* umem, allowing only a single thread to map/unmap pages. The mutex
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index b7b6b58dd348..e71fa19187cc 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -4077,6 +4077,24 @@ static inline int ib_dma_mapping_error(struct ib_device *dev, u64 dma_addr)
return dma_mapping_error(dev->dma_device, dma_addr);
}

+static inline int ib_dma_alloc_iova(struct ib_device *dev,
+ struct dma_iova_attrs *iova)
+{
+ if (ib_uses_virt_dma(dev))
+ return 0;
+
+ return dma_alloc_iova(iova);
+}
+
+static inline void ib_dma_free_iova(struct ib_device *dev,
+ struct dma_iova_attrs *iova)
+{
+ if (ib_uses_virt_dma(dev))
+ return;
+
+ dma_free_iova(iova);
+}
+
/**
* ib_dma_map_single - Map a kernel virtual address to DMA address
* @dev: The device for which the dma_addr is to be created
--
2.44.0


2024-03-05 11:39:29

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 11/16] vfio/mlx5: Explicitly use number of pages instead of allocated length

From: Leon Romanovsky <[email protected]>

allocated_length is a multiple of page size and number of pages, so let's
change the functions to accept number of pages. It opens us a venue to
combine receive and send paths together with code readability
improvement.

Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/vfio/pci/mlx5/cmd.c | 31 ++++++++---------
drivers/vfio/pci/mlx5/cmd.h | 10 +++---
drivers/vfio/pci/mlx5/main.c | 65 +++++++++++++++++++++++-------------
3 files changed, 62 insertions(+), 44 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index efd1d252cdc9..45104e47b7b2 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -305,8 +305,7 @@ static int _create_mkey(struct mlx5_core_dev *mdev, u32 pdn,
struct mlx5_vhca_recv_buf *recv_buf,
u32 *mkey)
{
- size_t npages = buf ? DIV_ROUND_UP(buf->allocated_length, PAGE_SIZE) :
- recv_buf->npages;
+ size_t npages = buf ? buf->npages : recv_buf->npages;
int err = 0, inlen;
__be64 *mtt;
void *mkc;
@@ -362,7 +361,7 @@ static int mlx5vf_dma_data_buffer(struct mlx5_vhca_data_buffer *buf)
if (mvdev->mdev_detach)
return -ENOTCONN;

- if (buf->dmaed || !buf->allocated_length)
+ if (buf->dmaed || !buf->npages)
return -EINVAL;

ret = dma_map_sgtable(mdev->device, &buf->table.sgt, buf->dma_dir, 0);
@@ -403,8 +402,7 @@ void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf)
}

struct mlx5_vhca_data_buffer *
-mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,
- size_t length,
+mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
enum dma_data_direction dma_dir)
{
struct mlx5_vhca_data_buffer *buf;
@@ -416,9 +414,8 @@ mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,

buf->dma_dir = dma_dir;
buf->migf = migf;
- if (length) {
- ret = mlx5vf_add_migration_pages(buf,
- DIV_ROUND_UP_ULL(length, PAGE_SIZE));
+ if (npages) {
+ ret = mlx5vf_add_migration_pages(buf, npages);
if (ret)
goto end;

@@ -444,8 +441,8 @@ void mlx5vf_put_data_buffer(struct mlx5_vhca_data_buffer *buf)
}

struct mlx5_vhca_data_buffer *
-mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
- size_t length, enum dma_data_direction dma_dir)
+mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+ enum dma_data_direction dma_dir)
{
struct mlx5_vhca_data_buffer *buf, *temp_buf;
struct list_head free_list;
@@ -460,7 +457,7 @@ mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
list_for_each_entry_safe(buf, temp_buf, &migf->avail_list, buf_elm) {
if (buf->dma_dir == dma_dir) {
list_del_init(&buf->buf_elm);
- if (buf->allocated_length >= length) {
+ if (buf->npages >= npages) {
spin_unlock_irq(&migf->list_lock);
goto found;
}
@@ -474,7 +471,7 @@ mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
}
}
spin_unlock_irq(&migf->list_lock);
- buf = mlx5vf_alloc_data_buffer(migf, length, dma_dir);
+ buf = mlx5vf_alloc_data_buffer(migf, npages, dma_dir);

found:
while ((temp_buf = list_first_entry_or_null(&free_list,
@@ -645,7 +642,7 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
MLX5_SET(save_vhca_state_in, in, op_mod, 0);
MLX5_SET(save_vhca_state_in, in, vhca_id, mvdev->vhca_id);
MLX5_SET(save_vhca_state_in, in, mkey, buf->mkey);
- MLX5_SET(save_vhca_state_in, in, size, buf->allocated_length);
+ MLX5_SET(save_vhca_state_in, in, size, buf->npages * PAGE_SIZE);
MLX5_SET(save_vhca_state_in, in, incremental, inc);
MLX5_SET(save_vhca_state_in, in, set_track, track);

@@ -668,8 +665,12 @@ int mlx5vf_cmd_save_vhca_state(struct mlx5vf_pci_core_device *mvdev,
}

if (!header_buf) {
- header_buf = mlx5vf_get_data_buffer(migf,
- sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+ u32 npages = DIV_ROUND_UP(
+ sizeof(struct mlx5_vf_migration_header),
+ PAGE_SIZE);
+
+ header_buf =
+ mlx5vf_get_data_buffer(migf, npages, DMA_NONE);
if (IS_ERR(header_buf)) {
err = PTR_ERR(header_buf);
goto err_free;
diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index f2c7227fa683..887267ebbd8a 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -60,7 +60,7 @@ struct mlx5_vhca_data_buffer {
struct sg_append_table table;
loff_t start_pos;
u64 length;
- u64 allocated_length;
+ u32 npages;
u32 mkey;
enum dma_data_direction dma_dir;
u8 dmaed:1;
@@ -219,12 +219,12 @@ int mlx5vf_cmd_alloc_pd(struct mlx5_vf_migration_file *migf);
void mlx5vf_cmd_dealloc_pd(struct mlx5_vf_migration_file *migf);
void mlx5fv_cmd_clean_migf_resources(struct mlx5_vf_migration_file *migf);
struct mlx5_vhca_data_buffer *
-mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf,
- size_t length, enum dma_data_direction dma_dir);
+mlx5vf_alloc_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+ enum dma_data_direction dma_dir);
void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf);
struct mlx5_vhca_data_buffer *
-mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf,
- size_t length, enum dma_data_direction dma_dir);
+mlx5vf_get_data_buffer(struct mlx5_vf_migration_file *migf, u32 npages,
+ enum dma_data_direction dma_dir);
void mlx5vf_put_data_buffer(struct mlx5_vhca_data_buffer *buf);
int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
unsigned int npages);
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index fe09a8c8af95..b11b1c27d284 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -94,7 +94,7 @@ int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,

if (ret)
goto err;
- buf->allocated_length += filled * PAGE_SIZE;
+ buf->npages += filled;
/* clean input for another bulk allocation */
memset(page_list, 0, filled * sizeof(*page_list));
to_fill = min_t(unsigned int, to_alloc,
@@ -352,6 +352,7 @@ static struct mlx5_vhca_data_buffer *
mlx5vf_mig_file_get_stop_copy_buf(struct mlx5_vf_migration_file *migf,
u8 index, size_t required_length)
{
+ u32 npages = DIV_ROUND_UP(required_length, PAGE_SIZE);
struct mlx5_vhca_data_buffer *buf = migf->buf[index];
u8 chunk_num;

@@ -359,12 +360,11 @@ mlx5vf_mig_file_get_stop_copy_buf(struct mlx5_vf_migration_file *migf,
chunk_num = buf->stop_copy_chunk_num;
buf->migf->buf[index] = NULL;
/* Checking whether the pre-allocated buffer can fit */
- if (buf->allocated_length >= required_length)
+ if (buf->npages >= npages)
return buf;

mlx5vf_put_data_buffer(buf);
- buf = mlx5vf_get_data_buffer(buf->migf, required_length,
- DMA_FROM_DEVICE);
+ buf = mlx5vf_get_data_buffer(buf->migf, npages, DMA_FROM_DEVICE);
if (IS_ERR(buf))
return buf;

@@ -417,7 +417,9 @@ static int mlx5vf_add_stop_copy_header(struct mlx5_vf_migration_file *migf,
u8 *to_buff;
int ret;

- header_buf = mlx5vf_get_data_buffer(migf, size, DMA_NONE);
+ BUILD_BUG_ON(size > PAGE_SIZE);
+ header_buf = mlx5vf_get_data_buffer(migf, DIV_ROUND_UP(size, PAGE_SIZE),
+ DMA_NONE);
if (IS_ERR(header_buf))
return PTR_ERR(header_buf);

@@ -432,7 +434,7 @@ static int mlx5vf_add_stop_copy_header(struct mlx5_vf_migration_file *migf,
to_buff = kmap_local_page(page);
memcpy(to_buff, &header, sizeof(header));
header_buf->length = sizeof(header);
- data.stop_copy_size = cpu_to_le64(migf->buf[0]->allocated_length);
+ data.stop_copy_size = cpu_to_le64(migf->buf[0]->npages * PAGE_SIZE);
memcpy(to_buff + sizeof(header), &data, sizeof(data));
header_buf->length += sizeof(data);
kunmap_local(to_buff);
@@ -481,15 +483,22 @@ static int mlx5vf_prep_stop_copy(struct mlx5vf_pci_core_device *mvdev,

num_chunks = mvdev->chunk_mode ? MAX_NUM_CHUNKS : 1;
for (i = 0; i < num_chunks; i++) {
- buf = mlx5vf_get_data_buffer(migf, inc_state_size, DMA_FROM_DEVICE);
+ buf = mlx5vf_get_data_buffer(
+ migf, DIV_ROUND_UP(inc_state_size, PAGE_SIZE),
+ DMA_FROM_DEVICE);
if (IS_ERR(buf)) {
ret = PTR_ERR(buf);
goto err;
}

+ BUILD_BUG_ON(sizeof(struct mlx5_vf_migration_header) >
+ PAGE_SIZE);
migf->buf[i] = buf;
- buf = mlx5vf_get_data_buffer(migf,
- sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+ buf = mlx5vf_get_data_buffer(
+ migf,
+ DIV_ROUND_UP(sizeof(struct mlx5_vf_migration_header),
+ PAGE_SIZE),
+ DMA_NONE);
if (IS_ERR(buf)) {
ret = PTR_ERR(buf);
goto err;
@@ -597,7 +606,8 @@ static long mlx5vf_precopy_ioctl(struct file *filp, unsigned int cmd,
* We finished transferring the current state and the device has a
* dirty state, save a new state to be ready for.
*/
- buf = mlx5vf_get_data_buffer(migf, inc_length, DMA_FROM_DEVICE);
+ buf = mlx5vf_get_data_buffer(migf, DIV_ROUND_UP(inc_length, PAGE_SIZE),
+ DMA_FROM_DEVICE);
if (IS_ERR(buf)) {
ret = PTR_ERR(buf);
mlx5vf_mark_err(migf);
@@ -718,8 +728,8 @@ mlx5vf_pci_save_device_data(struct mlx5vf_pci_core_device *mvdev, bool track)

if (track) {
/* leave the allocated buffer ready for the stop-copy phase */
- buf = mlx5vf_alloc_data_buffer(migf,
- migf->buf[0]->allocated_length, DMA_FROM_DEVICE);
+ buf = mlx5vf_alloc_data_buffer(migf, migf->buf[0]->npages,
+ DMA_FROM_DEVICE);
if (IS_ERR(buf)) {
ret = PTR_ERR(buf);
goto out_pd;
@@ -783,16 +793,15 @@ mlx5vf_resume_read_image_no_header(struct mlx5_vhca_data_buffer *vhca_buf,
const char __user **buf, size_t *len,
loff_t *pos, ssize_t *done)
{
+ u32 npages = DIV_ROUND_UP(requested_length, PAGE_SIZE);
int ret;

if (requested_length > MAX_LOAD_SIZE)
return -ENOMEM;

- if (vhca_buf->allocated_length < requested_length) {
- ret = mlx5vf_add_migration_pages(
- vhca_buf,
- DIV_ROUND_UP(requested_length - vhca_buf->allocated_length,
- PAGE_SIZE));
+ if (vhca_buf->npages < npages) {
+ ret = mlx5vf_add_migration_pages(vhca_buf,
+ npages - vhca_buf->npages);
if (ret)
return ret;
}
@@ -992,11 +1001,14 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
goto out_unlock;
break;
case MLX5_VF_LOAD_STATE_PREP_HEADER_DATA:
- if (vhca_buf_header->allocated_length < migf->record_size) {
+ {
+ u32 npages = DIV_ROUND_UP(migf->record_size, PAGE_SIZE);
+
+ if (vhca_buf_header->npages < npages) {
mlx5vf_free_data_buffer(vhca_buf_header);

- migf->buf_header[0] = mlx5vf_alloc_data_buffer(migf,
- migf->record_size, DMA_NONE);
+ migf->buf_header[0] = mlx5vf_alloc_data_buffer(
+ migf, npages, DMA_NONE);
if (IS_ERR(migf->buf_header[0])) {
ret = PTR_ERR(migf->buf_header[0]);
migf->buf_header[0] = NULL;
@@ -1009,6 +1021,7 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
vhca_buf_header->start_pos = migf->max_pos;
migf->load_state = MLX5_VF_LOAD_STATE_READ_HEADER_DATA;
break;
+ }
case MLX5_VF_LOAD_STATE_READ_HEADER_DATA:
ret = mlx5vf_resume_read_header_data(migf, vhca_buf_header,
&buf, &len, pos, &done);
@@ -1019,12 +1032,13 @@ static ssize_t mlx5vf_resume_write(struct file *filp, const char __user *buf,
{
u64 size = max(migf->record_size,
migf->stop_copy_prep_size);
+ u32 npages = DIV_ROUND_UP(size, PAGE_SIZE);

- if (vhca_buf->allocated_length < size) {
+ if (vhca_buf->npages < npages) {
mlx5vf_free_data_buffer(vhca_buf);

migf->buf[0] = mlx5vf_alloc_data_buffer(migf,
- size, DMA_TO_DEVICE);
+ npages, DMA_TO_DEVICE);
if (IS_ERR(migf->buf[0])) {
ret = PTR_ERR(migf->buf[0]);
migf->buf[0] = NULL;
@@ -1115,8 +1129,11 @@ mlx5vf_pci_resume_device_data(struct mlx5vf_pci_core_device *mvdev)

migf->buf[0] = buf;
if (MLX5VF_PRE_COPY_SUPP(mvdev)) {
- buf = mlx5vf_alloc_data_buffer(migf,
- sizeof(struct mlx5_vf_migration_header), DMA_NONE);
+ buf = mlx5vf_alloc_data_buffer(
+ migf,
+ DIV_ROUND_UP(sizeof(struct mlx5_vf_migration_header),
+ PAGE_SIZE),
+ DMA_NONE);
if (IS_ERR(buf)) {
ret = PTR_ERR(buf);
goto out_buf;
--
2.44.0


2024-03-05 11:41:55

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 08/16] RDMA/umem: Store ODP access mask information in PFN

From: Leon Romanovsky <[email protected]>

As a preparation to remove of dma_list, store access mask in PFN pointer
and not in dma_addr_t.

Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/infiniband/core/umem_odp.c | 99 +++++++++++-----------------
drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 +
drivers/infiniband/hw/mlx5/odp.c | 37 ++++++-----
include/rdma/ib_umem_odp.h | 13 ----
4 files changed, 59 insertions(+), 91 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index f69d1233dc82..3619fb78f786 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -310,22 +310,11 @@ EXPORT_SYMBOL(ib_umem_odp_release);
static int ib_umem_odp_map_dma_single_page(
struct ib_umem_odp *umem_odp,
unsigned int dma_index,
- struct page *page,
- u64 access_mask)
+ struct page *page)
{
struct ib_device *dev = umem_odp->umem.ibdev;
dma_addr_t *dma_addr = &umem_odp->dma_list[dma_index];

- if (*dma_addr) {
- /*
- * If the page is already dma mapped it means it went through
- * a non-invalidating trasition, like read-only to writable.
- * Resync the flags.
- */
- *dma_addr = (*dma_addr & ODP_DMA_ADDR_MASK) | access_mask;
- return 0;
- }
-
*dma_addr = ib_dma_map_page(dev, page, 0, 1 << umem_odp->page_shift,
DMA_BIDIRECTIONAL);
if (ib_dma_mapping_error(dev, *dma_addr)) {
@@ -333,7 +322,6 @@ static int ib_umem_odp_map_dma_single_page(
return -EFAULT;
}
umem_odp->npages++;
- *dma_addr |= access_mask;
return 0;
}

@@ -369,9 +357,6 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
struct hmm_range range = {};
unsigned long timeout;

- if (access_mask == 0)
- return -EINVAL;
-
if (user_virt < ib_umem_start(umem_odp) ||
user_virt + bcnt > ib_umem_end(umem_odp))
return -EFAULT;
@@ -397,7 +382,7 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
if (fault) {
range.default_flags = HMM_PFN_REQ_FAULT;

- if (access_mask & ODP_WRITE_ALLOWED_BIT)
+ if (access_mask & HMM_PFN_WRITE)
range.default_flags |= HMM_PFN_REQ_WRITE;
}

@@ -429,22 +414,17 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
for (pfn_index = 0; pfn_index < num_pfns;
pfn_index += 1 << (page_shift - PAGE_SHIFT), dma_index++) {

- if (fault) {
- /*
- * Since we asked for hmm_range_fault() to populate
- * pages it shouldn't return an error entry on success.
- */
- WARN_ON(range.hmm_pfns[pfn_index] & HMM_PFN_ERROR);
- WARN_ON(!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID));
- } else {
- if (!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID)) {
- WARN_ON(umem_odp->dma_list[dma_index]);
- continue;
- }
- access_mask = ODP_READ_ALLOWED_BIT;
- if (range.hmm_pfns[pfn_index] & HMM_PFN_WRITE)
- access_mask |= ODP_WRITE_ALLOWED_BIT;
- }
+ /*
+ * Since we asked for hmm_range_fault() to populate
+ * pages it shouldn't return an error entry on success.
+ */
+ WARN_ON(fault && range.hmm_pfns[pfn_index] & HMM_PFN_ERROR);
+ WARN_ON(fault && !(range.hmm_pfns[pfn_index] & HMM_PFN_VALID));
+ if (!(range.hmm_pfns[pfn_index] & HMM_PFN_VALID))
+ continue;
+
+ if (range.hmm_pfns[pfn_index] & HMM_PFN_STICKY)
+ continue;

hmm_order = hmm_pfn_to_map_order(range.hmm_pfns[pfn_index]);
/* If a hugepage was detected and ODP wasn't set for, the umem
@@ -459,13 +439,13 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
}

ret = ib_umem_odp_map_dma_single_page(
- umem_odp, dma_index, hmm_pfn_to_page(range.hmm_pfns[pfn_index]),
- access_mask);
+ umem_odp, dma_index, hmm_pfn_to_page(range.hmm_pfns[pfn_index]));
if (ret < 0) {
ibdev_dbg(umem_odp->umem.ibdev,
"ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
break;
}
+ range.hmm_pfns[pfn_index] |= HMM_PFN_STICKY;
}
/* upon success lock should stay on hold for the callee */
if (!ret)
@@ -485,7 +465,6 @@ EXPORT_SYMBOL(ib_umem_odp_map_dma_and_lock);
void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
u64 bound)
{
- dma_addr_t dma_addr;
dma_addr_t dma;
int idx;
u64 addr;
@@ -496,34 +475,34 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
virt = max_t(u64, virt, ib_umem_start(umem_odp));
bound = min_t(u64, bound, ib_umem_end(umem_odp));
for (addr = virt; addr < bound; addr += BIT(umem_odp->page_shift)) {
+ unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
+ struct page *page = hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
+
idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
dma = umem_odp->dma_list[idx];

- /* The access flags guaranteed a valid DMA address in case was NULL */
- if (dma) {
- unsigned long pfn_idx = (addr - ib_umem_start(umem_odp)) >> PAGE_SHIFT;
- struct page *page = hmm_pfn_to_page(umem_odp->pfn_list[pfn_idx]);
-
- dma_addr = dma & ODP_DMA_ADDR_MASK;
- ib_dma_unmap_page(dev, dma_addr,
- BIT(umem_odp->page_shift),
- DMA_BIDIRECTIONAL);
- if (dma & ODP_WRITE_ALLOWED_BIT) {
- struct page *head_page = compound_head(page);
- /*
- * set_page_dirty prefers being called with
- * the page lock. However, MMU notifiers are
- * called sometimes with and sometimes without
- * the lock. We rely on the umem_mutex instead
- * to prevent other mmu notifiers from
- * continuing and allowing the page mapping to
- * be removed.
- */
- set_page_dirty(head_page);
- }
- umem_odp->dma_list[idx] = 0;
- umem_odp->npages--;
+ if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_VALID))
+ continue;
+ if (!(umem_odp->pfn_list[pfn_idx] & HMM_PFN_STICKY))
+ continue;
+
+ ib_dma_unmap_page(dev, dma, BIT(umem_odp->page_shift),
+ DMA_BIDIRECTIONAL);
+ if (umem_odp->pfn_list[pfn_idx] & HMM_PFN_WRITE) {
+ struct page *head_page = compound_head(page);
+ /*
+ * set_page_dirty prefers being called with
+ * the page lock. However, MMU notifiers are
+ * called sometimes with and sometimes without
+ * the lock. We rely on the umem_mutex instead
+ * to prevent other mmu notifiers from
+ * continuing and allowing the page mapping to
+ * be removed.
+ */
+ set_page_dirty(head_page);
}
+ umem_odp->pfn_list[pfn_idx] &= ~HMM_PFN_STICKY;
+ umem_odp->npages--;
}
}
EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index bbe79b86c717..4f368242680d 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -334,6 +334,7 @@ struct mlx5_ib_flow_db {
#define MLX5_IB_UPD_XLT_PD BIT(4)
#define MLX5_IB_UPD_XLT_ACCESS BIT(5)
#define MLX5_IB_UPD_XLT_INDIRECT BIT(6)
+#define MLX5_IB_UPD_XLT_DOWNGRADE BIT(7)

/* Private QP creation flags to be passed in ib_qp_init_attr.create_flags.
*
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 4a04cbc5b78a..5713fe25f4de 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -34,6 +34,7 @@
#include <linux/kernel.h>
#include <linux/dma-buf.h>
#include <linux/dma-resv.h>
+#include <linux/hmm.h>

#include "mlx5_ib.h"
#include "cmd.h"
@@ -143,22 +144,12 @@ static void populate_klm(struct mlx5_klm *pklm, size_t idx, size_t nentries,
}
}

-static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
-{
- u64 mtt_entry = umem_dma & ODP_DMA_ADDR_MASK;
-
- if (umem_dma & ODP_READ_ALLOWED_BIT)
- mtt_entry |= MLX5_IB_MTT_READ;
- if (umem_dma & ODP_WRITE_ALLOWED_BIT)
- mtt_entry |= MLX5_IB_MTT_WRITE;
-
- return mtt_entry;
-}
-
static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
struct mlx5_ib_mr *mr, int flags)
{
struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
+ bool downgrade = flags & MLX5_IB_UPD_XLT_DOWNGRADE;
+ unsigned long pfn;
dma_addr_t pa;
size_t i;

@@ -166,8 +157,17 @@ static void populate_mtt(__be64 *pas, size_t idx, size_t nentries,
return;

for (i = 0; i < nentries; i++) {
+ pfn = odp->pfn_list[idx + i];
+ if (!(pfn & HMM_PFN_VALID))
+ /* Initial ODP init */
+ continue;
+
pa = odp->dma_list[idx + i];
- pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
+ pa |= MLX5_IB_MTT_READ;
+ if ((pfn & HMM_PFN_WRITE) && !downgrade)
+ pa |= MLX5_IB_MTT_WRITE;
+
+ pas[i] = cpu_to_be64(pa);
}
}

@@ -268,8 +268,7 @@ static bool mlx5_ib_invalidate_range(struct mmu_interval_notifier *mni,
* estimate the cost of another UMR vs. the cost of bigger
* UMR.
*/
- if (umem_odp->dma_list[idx] &
- (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT)) {
+ if (umem_odp->pfn_list[idx] & HMM_PFN_VALID) {
if (!in_block) {
blk_start_idx = idx;
in_block = 1;
@@ -555,7 +554,7 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
{
int page_shift, ret, np;
bool downgrade = flags & MLX5_PF_FLAGS_DOWNGRADE;
- u64 access_mask;
+ u64 access_mask = 0;
u64 start_idx;
bool fault = !(flags & MLX5_PF_FLAGS_SNAPSHOT);
u32 xlt_flags = MLX5_IB_UPD_XLT_ATOMIC;
@@ -563,12 +562,14 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
if (flags & MLX5_PF_FLAGS_ENABLE)
xlt_flags |= MLX5_IB_UPD_XLT_ENABLE;

+ if (flags & MLX5_PF_FLAGS_DOWNGRADE)
+ xlt_flags |= MLX5_IB_UPD_XLT_DOWNGRADE;
+
page_shift = odp->page_shift;
start_idx = (user_va - ib_umem_start(odp)) >> page_shift;
- access_mask = ODP_READ_ALLOWED_BIT;

if (odp->umem.writable && !downgrade)
- access_mask |= ODP_WRITE_ALLOWED_BIT;
+ access_mask |= HMM_PFN_WRITE;

np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault);
if (np < 0)
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index bb2d7f2a5b04..095b1297cfb1 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -68,19 +68,6 @@ static inline size_t ib_umem_odp_num_pages(struct ib_umem_odp *umem_odp)
umem_odp->page_shift;
}

-/*
- * The lower 2 bits of the DMA address signal the R/W permissions for
- * the entry. To upgrade the permissions, provide the appropriate
- * bitmask to the map_dma_pages function.
- *
- * Be aware that upgrading a mapped address might result in change of
- * the DMA address for the page.
- */
-#define ODP_READ_ALLOWED_BIT (1<<0ULL)
-#define ODP_WRITE_ALLOWED_BIT (1<<1ULL)
-
-#define ODP_DMA_ADDR_MASK (~(ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT))
-
#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING

struct ib_umem_odp *
--
2.44.0


2024-03-05 11:42:24

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 15/16] block: add dma_link_range() based API

From: Chaitanya Kulkarni <[email protected]>

Add two helper functions that are needed to calculate the total DMA
length of the request blk_rq_get_dma_length() and to create DMA
mapping blk_rq_dma_map().

blk_rq_get_dma_length() is used to get the total length of the request,
when driver is allocating IOVA space for this request with the call to
dma_alloc_iova(). This length is then initialized to the iova->size and
passed to allocate iova call chain :-
dma_map_ops->allov_iova()
iommu_dma_alloc_iova()
alloc_iova_fast()
iova_rcache_get()
OR
alloc_iova()

blk_rq_dma_map() iterates through bvec list and creates DMA mapping
for each page using iova parameter with the help of dma_link_range().
Note that @iova is allocated & pre-initialized using dma_alloc_iova()
by the caller. After creating a mapping for each page, call into the
callback function @cb provided by the drive with a mapped DMA address
for this page, offset into the iova space (needed at the time of
unlink), length of the mapped page, and page number that is mapped in
this request. Driver is responsible for using this DMA address to
complete the mapping of underlying protocol-specific data structures,
such as NVMe PRPs or NVMe SGLs. This callback approach allows us to
iterate bvec list only once to create bvec to DMA mapping and use that
DMA address in driver to build the protocol-specific data structure,
essentially mapping one bvec page at a time to DMA address and using
that DMA address to create underlying protocol-specific data structures.
Finally, returning the number of linked count.

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
block/blk-merge.c | 156 +++++++++++++++++++++++++++++++++++++++++
include/linux/blk-mq.h | 9 +++
2 files changed, 165 insertions(+)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 2d470cf2173e..63effc8ac1db 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -583,6 +583,162 @@ int __blk_rq_map_sg(struct request_queue *q, struct request *rq,
}
EXPORT_SYMBOL(__blk_rq_map_sg);

+static dma_addr_t blk_dma_link_page(struct page *page, unsigned int page_offset,
+ struct dma_iova_attrs *iova,
+ dma_addr_t dma_offset)
+{
+ dma_addr_t dma_addr;
+ int ret;
+
+ dma_addr = dma_link_range(page, page_offset, iova, dma_offset);
+ ret = dma_mapping_error(iova->dev, dma_addr);
+ if (ret) {
+ pr_err("dma_mapping_err %d dma_addr 0x%llx dma_offset %llu\n",
+ ret, dma_addr, dma_offset);
+ /* better way ? */
+ dma_addr = 0;
+ }
+ return dma_addr;
+}
+
+/**
+ * blk_rq_dma_map: block layer request to DMA mapping helper.
+ *
+ * @req : [in] request to be mapped
+ * @cb : [in] callback to be called for each bvec mapped bvec into
+ * underlaying driver.
+ * @cb_data : [in] callback data to be passed, privete to the underlaying
+ * driver.
+ * @iova : [in] iova to be used to create DMA mapping for this request's
+ * bvecs.
+ * Description:
+ * Iterates through bvec list and create dma mapping between each bvec page
+ * using @iova with dma_link_range(). Note that @iova needs to be allocated and
+ * pre-initialized using dma_alloc_iova() by the caller. After creating
+ * a mapping for each page, call into the callback function @cb provided by
+ * driver with mapped dma address for this bvec, offset into iova space, length
+ * of the mapped page, and bvec number that is mapped in this requets. Driver is
+ * responsible for using this dma address to complete the mapping of underlaying
+ * protocol specific data structure, such as NVMe PRPs or NVMe SGLs. This
+ * callback approach allows us to iterate bvec list only once to create bvec to
+ * DMA mapping & use that dma address in the driver to build the protocol
+ * specific data structure, essentially mapping one bvec page at a time to DMA
+ * address and use that DMA address to create underlaying protocol specific
+ * data structure.
+ *
+ * Caller needs to ensure @iova is initialized & allovated with using
+ * dma_alloc_iova().
+ */
+int blk_rq_dma_map(struct request *req, driver_map_cb cb, void *cb_data,
+ struct dma_iova_attrs *iova)
+{
+ dma_addr_t curr_dma_offset = 0;
+ dma_addr_t prev_dma_addr = 0;
+ dma_addr_t dma_addr;
+ size_t prev_dma_len = 0;
+ struct req_iterator iter;
+ struct bio_vec bv;
+ int linked_cnt = 0;
+
+ rq_for_each_bvec(bv, req, iter) {
+ if (bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
+ curr_dma_offset = prev_dma_addr + prev_dma_len;
+
+ dma_addr = blk_dma_link_page(bv.bv_page, bv.bv_offset,
+ iova, curr_dma_offset);
+ if (!dma_addr)
+ break;
+
+ cb(cb_data, linked_cnt, dma_addr, curr_dma_offset,
+ bv.bv_len);
+
+ prev_dma_len = bv.bv_len;
+ prev_dma_addr = dma_addr;
+ linked_cnt++;
+ } else {
+ unsigned nbytes = bv.bv_len;
+ unsigned total = 0;
+ unsigned offset, len;
+
+ while (nbytes > 0) {
+ struct page *page = bv.bv_page;
+
+ offset = bv.bv_offset + total;
+ len = min(get_max_segment_size(&req->q->limits,
+ page, offset),
+ nbytes);
+
+ page += (offset >> PAGE_SHIFT);
+ offset &= ~PAGE_MASK;
+
+ curr_dma_offset = prev_dma_addr + prev_dma_len;
+
+ dma_addr = blk_dma_link_page(page, offset,
+ iova,
+ curr_dma_offset);
+ if (!dma_addr)
+ break;
+
+ cb(cb_data, linked_cnt, dma_addr,
+ curr_dma_offset, len);
+
+ total += len;
+ nbytes -= len;
+
+ prev_dma_len = len;
+ prev_dma_addr = dma_addr;
+ linked_cnt++;
+ }
+ }
+ }
+ return linked_cnt;
+}
+EXPORT_SYMBOL_GPL(blk_rq_dma_map);
+
+/*
+ * Calculate total DMA length needed to satisfy this request.
+ */
+size_t blk_rq_get_dma_length(struct request *rq)
+{
+ struct request_queue *q = rq->q;
+ struct bio *bio = rq->bio;
+ unsigned int offset, len;
+ struct bvec_iter iter;
+ size_t dma_length = 0;
+ struct bio_vec bvec;
+
+ if (rq->rq_flags & RQF_SPECIAL_PAYLOAD)
+ return rq->special_vec.bv_len;
+
+ if (!rq->bio)
+ return 0;
+
+ for_each_bio(bio) {
+ bio_for_each_bvec(bvec, bio, iter) {
+ unsigned int nbytes = bvec.bv_len;
+ unsigned int total = 0;
+
+ if (bvec.bv_offset + bvec.bv_len <= PAGE_SIZE) {
+ dma_length += bvec.bv_len;
+ continue;
+ }
+
+ while (nbytes > 0) {
+ offset = bvec.bv_offset + total;
+ len = min(get_max_segment_size(&q->limits,
+ bvec.bv_page,
+ offset), nbytes);
+ total += len;
+ nbytes -= len;
+ dma_length += len;
+ }
+ }
+ }
+
+ return dma_length;
+}
+EXPORT_SYMBOL(blk_rq_get_dma_length);
+
static inline unsigned int blk_rq_get_max_sectors(struct request *rq,
sector_t offset)
{
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7a8150a5f051..80b9c7f2c3a0 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -8,6 +8,7 @@
#include <linux/scatterlist.h>
#include <linux/prefetch.h>
#include <linux/srcu.h>
+#include <linux/dma-mapping.h>

struct blk_mq_tags;
struct blk_flush_queue;
@@ -1144,7 +1145,15 @@ static inline int blk_rq_map_sg(struct request_queue *q, struct request *rq,

return __blk_rq_map_sg(q, rq, sglist, &last_sg);
}
+
+typedef void (*driver_map_cb)(void *cb_data, u32 cnt, dma_addr_t dma_addr,
+ dma_addr_t offset, u32 len);
+
+int blk_rq_dma_map(struct request *req, driver_map_cb cb, void *cb_data,
+ struct dma_iova_attrs *iova);
+
void blk_dump_rq_flags(struct request *, char *);
+size_t blk_rq_get_dma_length(struct request *rq);

#ifdef CONFIG_BLK_DEV_ZONED
static inline unsigned int blk_rq_zone_no(struct request *rq)
--
2.44.0


2024-03-05 11:42:50

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 13/16] vfio/mlx5: Explicitly store page list

From: Leon Romanovsky <[email protected]>

As a preparation to removal scatter-gather table and unifying
receive and send list, explicitly store page list.

Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/vfio/pci/mlx5/cmd.c | 1 +
drivers/vfio/pci/mlx5/cmd.h | 1 +
drivers/vfio/pci/mlx5/main.c | 35 +++++++++++++++++------------------
3 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
index 44762980fcb9..5e2103042d9b 100644
--- a/drivers/vfio/pci/mlx5/cmd.c
+++ b/drivers/vfio/pci/mlx5/cmd.c
@@ -411,6 +411,7 @@ void mlx5vf_free_data_buffer(struct mlx5_vhca_data_buffer *buf)
for_each_sgtable_page(&buf->table.sgt, &sg_iter, 0)
__free_page(sg_page_iter_page(&sg_iter));
sg_free_append_table(&buf->table);
+ kvfree(buf->page_list);
kfree(buf);
}

diff --git a/drivers/vfio/pci/mlx5/cmd.h b/drivers/vfio/pci/mlx5/cmd.h
index 83728c0669e7..815fcb54494d 100644
--- a/drivers/vfio/pci/mlx5/cmd.h
+++ b/drivers/vfio/pci/mlx5/cmd.h
@@ -57,6 +57,7 @@ struct mlx5_vf_migration_header {
};

struct mlx5_vhca_data_buffer {
+ struct page **page_list;
struct sg_append_table table;
loff_t start_pos;
u64 length;
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index b11b1c27d284..7ffe24693a55 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -69,44 +69,43 @@ int mlx5vf_add_migration_pages(struct mlx5_vhca_data_buffer *buf,
unsigned int npages)
{
unsigned int to_alloc = npages;
+ size_t old_size, new_size;
struct page **page_list;
unsigned long filled;
unsigned int to_fill;
int ret;

- to_fill = min_t(unsigned int, npages, PAGE_SIZE / sizeof(*page_list));
- page_list = kvzalloc(to_fill * sizeof(*page_list), GFP_KERNEL_ACCOUNT);
+ to_fill = min_t(unsigned int, npages,
+ PAGE_SIZE / sizeof(*buf->page_list));
+ old_size = buf->npages * sizeof(*buf->page_list);
+ new_size = old_size + to_fill * sizeof(*buf->page_list);
+ page_list = kvrealloc(buf->page_list, old_size, new_size,
+ GFP_KERNEL_ACCOUNT | __GFP_ZERO);
if (!page_list)
return -ENOMEM;

+ buf->page_list = page_list;
+
do {
filled = alloc_pages_bulk_array(GFP_KERNEL_ACCOUNT, to_fill,
- page_list);
- if (!filled) {
- ret = -ENOMEM;
- goto err;
- }
+ buf->page_list + buf->npages);
+ if (!filled)
+ return -ENOMEM;
+
to_alloc -= filled;
ret = sg_alloc_append_table_from_pages(
- &buf->table, page_list, filled, 0,
+ &buf->table, buf->page_list + buf->npages, filled, 0,
filled << PAGE_SHIFT, UINT_MAX, SG_MAX_SINGLE_ALLOC,
GFP_KERNEL_ACCOUNT);
-
if (ret)
- goto err;
+ return ret;
+
buf->npages += filled;
- /* clean input for another bulk allocation */
- memset(page_list, 0, filled * sizeof(*page_list));
to_fill = min_t(unsigned int, to_alloc,
- PAGE_SIZE / sizeof(*page_list));
+ PAGE_SIZE / sizeof(*buf->page_list));
} while (to_alloc > 0);

- kvfree(page_list);
return 0;
-
-err:
- kvfree(page_list);
- return ret;
}

static void mlx5vf_disable_fd(struct mlx5_vf_migration_file *migf)
--
2.44.0


2024-03-05 11:42:57

by Leon Romanovsky

[permalink] [raw]
Subject: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL

From: Chaitanya Kulkarni <[email protected]>

Update nvme_iod structure to hold iova, list of DMA linked addresses and
total linked count, first one is needed in the request submission path
to create a request to DMA mapping and last two are needed in the
request completion path to remove the DMA mapping. In nvme_map_data()
initialize iova with device, direction, and iova dma length with the
help of blk_rq_get_dma_length(). Allocate iova using dma_alloc_iova().
and call in nvme_pci_setup_sgls().

Call newly added blk_rq_dma_map() to create request to DMA mapping and
provide a callback function nvme_pci_sgl_map(). In the callback
function initialize NVMe SGL dma addresses.

Finally in nvme_unmap_data() unlink the dma address and free iova.

Full disclosure:-
-----------------

This is an RFC to demonstrate the newly added DMA APIs can be used to
map/unmap bvecs without the use of sg list, hence I've modified the pci
code to only handle SGLs for now. Once we have some agreement on the
structure of new DMA API I'll add support for PRPs along with all the
optimization that I've removed from the code for this RFC for NVMe SGLs
and PRPs.

I was able to run fio verification job successfully :-

$ fio fio/verify.fio --ioengine=io_uring --filename=/dev/nvme0n1
--loops=10
write-and-verify: (g=0): rw=randwrite, bs=(R) 8192B-8192B, (W) 8192B-8192B,
(T) 8192B-8192B, ioengine=io_uring, iodepth=16
fio-3.36
Starting 1 process
Jobs: 1 (f=1): [V(1)][81.6%][r=12.2MiB/s][r=1559 IOPS][eta 03m:00s]
write-and-verify: (groupid=0, jobs=1): err= 0: pid=4435: Mon Mar 4 20:54:48 2024
read: IOPS=2789, BW=21.8MiB/s (22.9MB/s)(6473MiB/297008msec)
slat (usec): min=4, max=5124, avg=356.51, stdev=604.30
clat (nsec): min=1593, max=23376k, avg=5377076.99, stdev=2039189.93
lat (usec): min=493, max=23407, avg=5733.58, stdev=2103.22
clat percentiles (usec):
| 1.00th=[ 1172], 5.00th=[ 2114], 10.00th=[ 2835], 20.00th=[ 3654],
| 30.00th=[ 4228], 40.00th=[ 4752], 50.00th=[ 5276], 60.00th=[ 5800],
| 70.00th=[ 6325], 80.00th=[ 7046], 90.00th=[ 8094], 95.00th=[ 8979],
| 99.00th=[10421], 99.50th=[11076], 99.90th=[12780], 99.95th=[14222],
| 99.99th=[16909]
write: IOPS=2608, BW=20.4MiB/s (21.4MB/s)(10.0GiB/502571msec); 0 zone resets
slat (usec): min=4, max=5787, avg=382.68, stdev=649.01
clat (nsec): min=521, max=23650k, avg=5751363.17, stdev=2676065.35
lat (usec): min=95, max=23674, avg=6134.04, stdev=2813.48
clat percentiles (usec):
| 1.00th=[ 709], 5.00th=[ 1270], 10.00th=[ 1958], 20.00th=[ 3261],
| 30.00th=[ 4228], 40.00th=[ 5014], 50.00th=[ 5800], 60.00th=[ 6521],
| 70.00th=[ 7373], 80.00th=[ 8225], 90.00th=[ 9241], 95.00th=[ 9896],
| 99.00th=[11469], 99.50th=[11863], 99.90th=[13960], 99.95th=[15270],
| 99.99th=[17695]
bw ( KiB/s): min= 1440, max=132496, per=99.28%, avg=20715.88, stdev=13123.13, samples=1013
iops : min= 180, max=16562, avg=2589.34, stdev=1640.39, samples=1013
lat (nsec) : 750=0.01%
lat (usec) : 2=0.01%, 4=0.01%, 100=0.01%, 250=0.01%, 500=0.07%
lat (usec) : 750=0.79%, 1000=1.22%
lat (msec) : 2=5.94%, 4=18.87%, 10=69.53%, 20=3.58%, 50=0.01%
cpu : usr=1.01%, sys=98.95%, ctx=1591, majf=0, minf=2286
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=828524,1310720,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
READ: bw=21.8MiB/s (22.9MB/s), 21.8MiB/s-21.8MiB/s (22.9MB/s-22.9MB/s),
io=6473MiB (6787MB), run=297008-297008msec
WRITE: bw=20.4MiB/s (21.4MB/s), 20.4MiB/s-20.4MiB/s (21.4MB/s-21.4MB/s),
io=10.0GiB (10.7GB), run=502571-502571msec

Disk stats (read/write):
nvme0n1: ios=829189/1310720, sectors=13293416/20971520, merge=0/0,
ticks=836561/1340351, in_queue=2176913, util=99.30%

Signed-off-by: Chaitanya Kulkarni <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/nvme/host/pci.c | 220 +++++++++-------------------------------
1 file changed, 49 insertions(+), 171 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e6267a6aa380..140939228409 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -236,7 +236,9 @@ struct nvme_iod {
unsigned int dma_len; /* length of single DMA segment mapping */
dma_addr_t first_dma;
dma_addr_t meta_dma;
- struct sg_table sgt;
+ struct dma_iova_attrs iova;
+ dma_addr_t dma_link_address[128];
+ u16 nr_dma_link_address;
union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
};

@@ -521,25 +523,10 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
return true;
}

-static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
-{
- const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
- struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
- dma_addr_t dma_addr = iod->first_dma;
- int i;
-
- for (i = 0; i < iod->nr_allocations; i++) {
- __le64 *prp_list = iod->list[i].prp_list;
- dma_addr_t next_dma_addr = le64_to_cpu(prp_list[last_prp]);
-
- dma_pool_free(dev->prp_page_pool, prp_list, dma_addr);
- dma_addr = next_dma_addr;
- }
-}
-
static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+ u16 i;

if (iod->dma_len) {
dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len,
@@ -547,9 +534,8 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
return;
}

- WARN_ON_ONCE(!iod->sgt.nents);
-
- dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
+ for (i = 0; i < iod->nr_dma_link_address; i++)
+ dma_unlink_range(&iod->iova, iod->dma_link_address[i]);

if (iod->nr_allocations == 0)
dma_pool_free(dev->prp_small_pool, iod->list[0].sg_list,
@@ -557,120 +543,15 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
else if (iod->nr_allocations == 1)
dma_pool_free(dev->prp_page_pool, iod->list[0].sg_list,
iod->first_dma);
- else
- nvme_free_prps(dev, req);
- mempool_free(iod->sgt.sgl, dev->iod_mempool);
-}
-
-static void nvme_print_sgl(struct scatterlist *sgl, int nents)
-{
- int i;
- struct scatterlist *sg;
-
- for_each_sg(sgl, sg, nents, i) {
- dma_addr_t phys = sg_phys(sg);
- pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d "
- "dma_address:%pad dma_length:%d\n",
- i, &phys, sg->offset, sg->length, &sg_dma_address(sg),
- sg_dma_len(sg));
- }
-}
-
-static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
- struct request *req, struct nvme_rw_command *cmnd)
-{
- struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
- struct dma_pool *pool;
- int length = blk_rq_payload_bytes(req);
- struct scatterlist *sg = iod->sgt.sgl;
- int dma_len = sg_dma_len(sg);
- u64 dma_addr = sg_dma_address(sg);
- int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
- __le64 *prp_list;
- dma_addr_t prp_dma;
- int nprps, i;
-
- length -= (NVME_CTRL_PAGE_SIZE - offset);
- if (length <= 0) {
- iod->first_dma = 0;
- goto done;
- }
-
- dma_len -= (NVME_CTRL_PAGE_SIZE - offset);
- if (dma_len) {
- dma_addr += (NVME_CTRL_PAGE_SIZE - offset);
- } else {
- sg = sg_next(sg);
- dma_addr = sg_dma_address(sg);
- dma_len = sg_dma_len(sg);
- }
-
- if (length <= NVME_CTRL_PAGE_SIZE) {
- iod->first_dma = dma_addr;
- goto done;
- }
-
- nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE);
- if (nprps <= (256 / 8)) {
- pool = dev->prp_small_pool;
- iod->nr_allocations = 0;
- } else {
- pool = dev->prp_page_pool;
- iod->nr_allocations = 1;
- }
-
- prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
- if (!prp_list) {
- iod->nr_allocations = -1;
- return BLK_STS_RESOURCE;
- }
- iod->list[0].prp_list = prp_list;
- iod->first_dma = prp_dma;
- i = 0;
- for (;;) {
- if (i == NVME_CTRL_PAGE_SIZE >> 3) {
- __le64 *old_prp_list = prp_list;
- prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
- if (!prp_list)
- goto free_prps;
- iod->list[iod->nr_allocations++].prp_list = prp_list;
- prp_list[0] = old_prp_list[i - 1];
- old_prp_list[i - 1] = cpu_to_le64(prp_dma);
- i = 1;
- }
- prp_list[i++] = cpu_to_le64(dma_addr);
- dma_len -= NVME_CTRL_PAGE_SIZE;
- dma_addr += NVME_CTRL_PAGE_SIZE;
- length -= NVME_CTRL_PAGE_SIZE;
- if (length <= 0)
- break;
- if (dma_len > 0)
- continue;
- if (unlikely(dma_len < 0))
- goto bad_sgl;
- sg = sg_next(sg);
- dma_addr = sg_dma_address(sg);
- dma_len = sg_dma_len(sg);
- }
-done:
- cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
- cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
- return BLK_STS_OK;
-free_prps:
- nvme_free_prps(dev, req);
- return BLK_STS_RESOURCE;
-bad_sgl:
- WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
- "Invalid SGL for payload:%d nents:%d\n",
- blk_rq_payload_bytes(req), iod->sgt.nents);
- return BLK_STS_IOERR;
+ dma_free_iova(&iod->iova);
}

static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
- struct scatterlist *sg)
+ dma_addr_t dma_addr,
+ unsigned int dma_len)
{
- sge->addr = cpu_to_le64(sg_dma_address(sg));
- sge->length = cpu_to_le32(sg_dma_len(sg));
+ sge->addr = cpu_to_le64(dma_addr);
+ sge->length = cpu_to_le32(dma_len);
sge->type = NVME_SGL_FMT_DATA_DESC << 4;
}

@@ -682,25 +563,37 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4;
}

+struct nvme_pci_sgl_map_data {
+ struct nvme_iod *iod;
+ struct nvme_sgl_desc *sgl_list;
+};
+
+static void nvme_pci_sgl_map(void *data, u32 cnt, dma_addr_t dma_addr,
+ dma_addr_t offset, u32 len)
+{
+ struct nvme_pci_sgl_map_data *d = data;
+ struct nvme_sgl_desc *sgl_list = d->sgl_list;
+ struct nvme_iod *iod = d->iod;
+
+ nvme_pci_sgl_set_data(&sgl_list[cnt], dma_addr, len);
+ iod->dma_link_address[cnt] = offset;
+ iod->nr_dma_link_address++;
+}
+
static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
struct request *req, struct nvme_rw_command *cmd)
{
+ unsigned int entries = blk_rq_nr_phys_segments(req);
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
- struct dma_pool *pool;
struct nvme_sgl_desc *sg_list;
- struct scatterlist *sg = iod->sgt.sgl;
- unsigned int entries = iod->sgt.nents;
+ struct dma_pool *pool;
dma_addr_t sgl_dma;
- int i = 0;
+ int linked_count;
+ struct nvme_pci_sgl_map_data data;

/* setting the transfer type as SGL */
cmd->flags = NVME_CMD_SGL_METABUF;

- if (entries == 1) {
- nvme_pci_sgl_set_data(&cmd->dptr.sgl, sg);
- return BLK_STS_OK;
- }
-
if (entries <= (256 / sizeof(struct nvme_sgl_desc))) {
pool = dev->prp_small_pool;
iod->nr_allocations = 0;
@@ -718,11 +611,13 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
iod->list[0].sg_list = sg_list;
iod->first_dma = sgl_dma;

- nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
- do {
- nvme_pci_sgl_set_data(&sg_list[i++], sg);
- sg = sg_next(sg);
- } while (--entries > 0);
+ data.iod = iod;
+ data.sgl_list = sg_list;
+
+ linked_count = blk_rq_dma_map(req, nvme_pci_sgl_map, &data,
+ &iod->iova);
+
+ nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, linked_count);

return BLK_STS_OK;
}
@@ -788,36 +683,20 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
&cmnd->rw, &bv);
}
}
-
- iod->dma_len = 0;
- iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
- if (!iod->sgt.sgl)
+ iod->iova.dev = dev->dev;
+ iod->iova.dir = rq_dma_dir(req);
+ iod->iova.attrs = DMA_ATTR_NO_WARN;
+ iod->iova.size = blk_rq_get_dma_length(req);
+ if (!iod->iova.size)
return BLK_STS_RESOURCE;
- sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req));
- iod->sgt.orig_nents = blk_rq_map_sg(req->q, req, iod->sgt.sgl);
- if (!iod->sgt.orig_nents)
- goto out_free_sg;

- rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req),
- DMA_ATTR_NO_WARN);
- if (rc) {
- if (rc == -EREMOTEIO)
- ret = BLK_STS_TARGET;
- goto out_free_sg;
- }
+ rc = dma_alloc_iova(&iod->iova);
+ if (rc)
+ return BLK_STS_RESOURCE;

- if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
- ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
- else
- ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
- if (ret != BLK_STS_OK)
- goto out_unmap_sg;
- return BLK_STS_OK;
+ iod->dma_len = 0;

-out_unmap_sg:
- dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
-out_free_sg:
- mempool_free(iod->sgt.sgl, dev->iod_mempool);
+ ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
return ret;
}

@@ -841,7 +720,6 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)

iod->aborted = false;
iod->nr_allocations = -1;
- iod->sgt.nents = 0;

ret = nvme_setup_cmd(req->q->queuedata, req);
if (ret)
--
2.44.0


2024-03-05 12:05:51

by Robin Murphy

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On 2024-03-05 11:18 am, Leon Romanovsky wrote:
> This is complimentary part to the proposed LSF/MM topic.
> https://lore.kernel.org/linux-rdma/[email protected]/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
>
> This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and
> DMA patches are ready for review and inclusion, the NVMe patches are still in
> progress as they require agreement on API first.
>
> Thanks
>
> -------------------------------------------------------------------------------
> The DMA mapping operation performs two steps at one same time: allocates
> IOVA space and actually maps DMA pages to that space. This one shot
> operation works perfectly for non-complex scenarios, where callers use
> that DMA API in control path when they setup hardware.
>
> However in more complex scenarios, when DMA mapping is needed in data
> path and especially when some sort of specific datatype is involved,
> such one shot approach has its drawbacks.
>
> That approach pushes developers to introduce new DMA APIs for specific
> datatype. For example existing scatter-gather mapping functions, or
> latest Chuck's RFC series to add biovec related DMA mapping [1] and
> probably struct folio will need it too.
>
> These advanced DMA mapping APIs are needed to calculate IOVA size to
> allocate it as one chunk and some sort of offset calculations to know
> which part of IOVA to map.

I don't follow this part at all - at *some* point, something must know a
range of memory addresses involved in a DMA transfer, so that's where it
should map that range for DMA. Even in a badly-designed system where the
point it's most practical to make the mapping is further out and only
knows that DMA will touch some subset of a buffer, but doesn't know
exactly what subset yet, you'd usually just map the whole buffer. I
don't see why the DMA API would ever need to know about anything other
than pages/PFNs and dma_addr_ts (yes, it does also accept them being
wrapped together in scatterlists; yes, scatterlists are awful and it
would be nice to replace them with a better general DMA descriptor; that
is a whole other subject of its own).

> Instead of teaching DMA to know these specific datatypes, let's separate
> existing DMA mapping routine to two steps and give an option to advanced
> callers (subsystems) perform all calculations internally in advance and
> map pages later when it is needed.

From a brief look, this is clearly an awkward reinvention of the IOMMU
API. If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU
address spaces then they can and should use the IOMMU API. Perhaps
there's room for some quality-of-life additions to the IOMMU API to help
with common usage patterns, but the generic DMA mapping API is
absolutely not the place for it.

Thanks,
Robin.

> In this series, three users are converted and each of such conversion
> presents different positive gain:
> 1. RDMA simplifies and speeds up its pagefault handling for
> on-demand-paging (ODP) mode.
> 2. VFIO PCI live migration code saves huge chunk of memory.
> 3. NVMe PCI avoids intermediate SG table manipulation and operates
> directly on BIOs.
>
> Thanks
>
> [1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net
>
> Chaitanya Kulkarni (2):
> block: add dma_link_range() based API
> nvme-pci: use blk_rq_dma_map() for NVMe SGL
>
> Leon Romanovsky (14):
> mm/hmm: let users to tag specific PFNs
> dma-mapping: provide an interface to allocate IOVA
> dma-mapping: provide callbacks to link/unlink pages to specific IOVA
> iommu/dma: Provide an interface to allow preallocate IOVA
> iommu/dma: Prepare map/unmap page functions to receive IOVA
> iommu/dma: Implement link/unlink page callbacks
> RDMA/umem: Preallocate and cache IOVA for UMEM ODP
> RDMA/umem: Store ODP access mask information in PFN
> RDMA/core: Separate DMA mapping to caching IOVA and page linkage
> RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
> vfio/mlx5: Explicitly use number of pages instead of allocated length
> vfio/mlx5: Rewrite create mkey flow to allow better code reuse
> vfio/mlx5: Explicitly store page list
> vfio/mlx5: Convert vfio to use DMA link API
>
> Documentation/core-api/dma-attributes.rst | 7 +
> block/blk-merge.c | 156 ++++++++++++++
> drivers/infiniband/core/umem_odp.c | 219 +++++++------------
> drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 +
> drivers/infiniband/hw/mlx5/odp.c | 59 +++--
> drivers/iommu/dma-iommu.c | 129 ++++++++---
> drivers/nvme/host/pci.c | 220 +++++--------------
> drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++----------
> drivers/vfio/pci/mlx5/cmd.h | 22 +-
> drivers/vfio/pci/mlx5/main.c | 136 +++++-------
> include/linux/blk-mq.h | 9 +
> include/linux/dma-map-ops.h | 13 ++
> include/linux/dma-mapping.h | 39 ++++
> include/linux/hmm.h | 3 +
> include/rdma/ib_umem_odp.h | 22 +-
> include/rdma/ib_verbs.h | 54 +++++
> kernel/dma/debug.h | 2 +
> kernel/dma/direct.h | 7 +-
> kernel/dma/mapping.c | 91 ++++++++
> mm/hmm.c | 34 +--
> 20 files changed, 870 insertions(+), 605 deletions(-)
>

2024-03-05 12:31:29

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Tue, Mar 05, 2024 at 12:05:23PM +0000, Robin Murphy wrote:
> On 2024-03-05 11:18 am, Leon Romanovsky wrote:
> > This is complimentary part to the proposed LSF/MM topic.
> > https://lore.kernel.org/linux-rdma/[email protected]/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
> >
> > This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and
> > DMA patches are ready for review and inclusion, the NVMe patches are still in
> > progress as they require agreement on API first.
> >
> > Thanks
> >
> > -------------------------------------------------------------------------------
> > The DMA mapping operation performs two steps at one same time: allocates
> > IOVA space and actually maps DMA pages to that space. This one shot
> > operation works perfectly for non-complex scenarios, where callers use
> > that DMA API in control path when they setup hardware.
> >
> > However in more complex scenarios, when DMA mapping is needed in data
> > path and especially when some sort of specific datatype is involved,
> > such one shot approach has its drawbacks.
> >
> > That approach pushes developers to introduce new DMA APIs for specific
> > datatype. For example existing scatter-gather mapping functions, or
> > latest Chuck's RFC series to add biovec related DMA mapping [1] and
> > probably struct folio will need it too.
> >
> > These advanced DMA mapping APIs are needed to calculate IOVA size to
> > allocate it as one chunk and some sort of offset calculations to know
> > which part of IOVA to map.
>
> I don't follow this part at all - at *some* point, something must know a
> range of memory addresses involved in a DMA transfer, so that's where it
> should map that range for DMA.

In all presented cases in this series, the overall DMA size is known in
advance. In RDMA case, it is known when user registers the memory, in
VFIO, when live migration is happening and in NVMe, when BIO is created.

So once we allocated IOVA, we will need to link ranges, which si the
same as map but without IOVA allocation.

> Even in a badly-designed system where the
> point it's most practical to make the mapping is further out and only knows
> that DMA will touch some subset of a buffer, but doesn't know exactly what
> subset yet, you'd usually just map the whole buffer. I don't see why the DMA
> API would ever need to know about anything other than pages/PFNs and
> dma_addr_ts (yes, it does also accept them being wrapped together in
> scatterlists; yes, scatterlists are awful and it would be nice to replace
> them with a better general DMA descriptor; that is a whole other subject of
> its own).

This is exactly what was done here, we got rid of scatterlists.

>
> > Instead of teaching DMA to know these specific datatypes, let's separate
> > existing DMA mapping routine to two steps and give an option to advanced
> > callers (subsystems) perform all calculations internally in advance and
> > map pages later when it is needed.
>
> From a brief look, this is clearly an awkward reinvention of the IOMMU API.
> If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU address
> spaces then they can and should use the IOMMU API. Perhaps there's room for
> some quality-of-life additions to the IOMMU API to help with common usage
> patterns, but the generic DMA mapping API is absolutely not the place for
> it.

DMA mapping gives nice abstraction from IOMMU, and allows us to have
same flow for IOMMU and non-IOMMU flows without duplicating code, while
you suggest to teach almost every part in the kernel to know about IOMMU.

In this series, we changed RDMA, VFIO and NVMe, and in all cases we
removed more code than added. From what I saw, VDPA and virito-blk will
benefit from proposed API too.

Even in this RFC, where Chaitanya did partial job and didn't convert
whole driver, the gain is pretty obvious:
https://lore.kernel.org/linux-rdma/016fc02cbfa9be3c156a6f74df38def1e09c08f1.1709635535.git.leon@kernel.org/T/#u

drivers/nvme/host/pci.c | 220 ++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------------------------------------------------------------------------------------------------------------------
1 file changed, 49 insertions(+), 171 deletions(-)


Thanks

>
> Thanks,
> Robin.
>
> > In this series, three users are converted and each of such conversion
> > presents different positive gain:
> > 1. RDMA simplifies and speeds up its pagefault handling for
> > on-demand-paging (ODP) mode.
> > 2. VFIO PCI live migration code saves huge chunk of memory.
> > 3. NVMe PCI avoids intermediate SG table manipulation and operates
> > directly on BIOs.
> >
> > Thanks
> >
> > [1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net
> >
> > Chaitanya Kulkarni (2):
> > block: add dma_link_range() based API
> > nvme-pci: use blk_rq_dma_map() for NVMe SGL
> >
> > Leon Romanovsky (14):
> > mm/hmm: let users to tag specific PFNs
> > dma-mapping: provide an interface to allocate IOVA
> > dma-mapping: provide callbacks to link/unlink pages to specific IOVA
> > iommu/dma: Provide an interface to allow preallocate IOVA
> > iommu/dma: Prepare map/unmap page functions to receive IOVA
> > iommu/dma: Implement link/unlink page callbacks
> > RDMA/umem: Preallocate and cache IOVA for UMEM ODP
> > RDMA/umem: Store ODP access mask information in PFN
> > RDMA/core: Separate DMA mapping to caching IOVA and page linkage
> > RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
> > vfio/mlx5: Explicitly use number of pages instead of allocated length
> > vfio/mlx5: Rewrite create mkey flow to allow better code reuse
> > vfio/mlx5: Explicitly store page list
> > vfio/mlx5: Convert vfio to use DMA link API
> >
> > Documentation/core-api/dma-attributes.rst | 7 +
> > block/blk-merge.c | 156 ++++++++++++++
> > drivers/infiniband/core/umem_odp.c | 219 +++++++------------
> > drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 +
> > drivers/infiniband/hw/mlx5/odp.c | 59 +++--
> > drivers/iommu/dma-iommu.c | 129 ++++++++---
> > drivers/nvme/host/pci.c | 220 +++++--------------
> > drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++----------
> > drivers/vfio/pci/mlx5/cmd.h | 22 +-
> > drivers/vfio/pci/mlx5/main.c | 136 +++++-------
> > include/linux/blk-mq.h | 9 +
> > include/linux/dma-map-ops.h | 13 ++
> > include/linux/dma-mapping.h | 39 ++++
> > include/linux/hmm.h | 3 +
> > include/rdma/ib_umem_odp.h | 22 +-
> > include/rdma/ib_verbs.h | 54 +++++
> > kernel/dma/debug.h | 2 +
> > kernel/dma/direct.h | 7 +-
> > kernel/dma/mapping.c | 91 ++++++++
> > mm/hmm.c | 34 +--
> > 20 files changed, 870 insertions(+), 605 deletions(-)
> >

2024-03-05 15:52:18

by Keith Busch

[permalink] [raw]
Subject: Re: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL

On Tue, Mar 05, 2024 at 01:18:47PM +0200, Leon Romanovsky wrote:
> @@ -236,7 +236,9 @@ struct nvme_iod {
> unsigned int dma_len; /* length of single DMA segment mapping */
> dma_addr_t first_dma;
> dma_addr_t meta_dma;
> - struct sg_table sgt;
> + struct dma_iova_attrs iova;
> + dma_addr_t dma_link_address[128];
> + u16 nr_dma_link_address;
> union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
> };

That's quite a lot of space to add to the iod. We preallocate one for
every request, and there could be millions of them.

2024-03-05 16:10:02

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL

On 3/5/24 8:51 AM, Keith Busch wrote:
> On Tue, Mar 05, 2024 at 01:18:47PM +0200, Leon Romanovsky wrote:
>> @@ -236,7 +236,9 @@ struct nvme_iod {
>> unsigned int dma_len; /* length of single DMA segment mapping */
>> dma_addr_t first_dma;
>> dma_addr_t meta_dma;
>> - struct sg_table sgt;
>> + struct dma_iova_attrs iova;
>> + dma_addr_t dma_link_address[128];
>> + u16 nr_dma_link_address;
>> union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
>> };
>
> That's quite a lot of space to add to the iod. We preallocate one for
> every request, and there could be millions of them.

Yeah, that's just a complete non-starter. As far as I can tell, this
ends up adding 1052 bytes per request. Doing the quick math on my test
box (24 drives), that's just a smidge over 3GB of extra memory. That's
not going to work, not even close.

--
Jens Axboe


2024-03-05 16:40:30

by Chaitanya Kulkarni

[permalink] [raw]
Subject: Re: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL

On 3/5/24 08:08, Jens Axboe wrote:
> On 3/5/24 8:51 AM, Keith Busch wrote:
>> On Tue, Mar 05, 2024 at 01:18:47PM +0200, Leon Romanovsky wrote:
>>> @@ -236,7 +236,9 @@ struct nvme_iod {
>>> unsigned int dma_len; /* length of single DMA segment mapping */
>>> dma_addr_t first_dma;
>>> dma_addr_t meta_dma;
>>> - struct sg_table sgt;
>>> + struct dma_iova_attrs iova;
>>> + dma_addr_t dma_link_address[128];
>>> + u16 nr_dma_link_address;
>>> union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
>>> };
>> That's quite a lot of space to add to the iod. We preallocate one for
>> every request, and there could be millions of them.
> Yeah, that's just a complete non-starter. As far as I can tell, this
> ends up adding 1052 bytes per request. Doing the quick math on my test
> box (24 drives), that's just a smidge over 3GB of extra memory. That's
> not going to work, not even close.
>

I don't have any intent to use more space for the nvme_iod than what
it is now. I'll trim down the iod structure and send out a patch soon with
this fixed to continue the discussion here on this thread ...

-ck


2024-03-05 16:47:36

by Chaitanya Kulkarni

[permalink] [raw]
Subject: Re: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL

On 3/5/24 08:39, Chaitanya Kulkarni wrote:
> On 3/5/24 08:08, Jens Axboe wrote:
>> On 3/5/24 8:51 AM, Keith Busch wrote:
>>> On Tue, Mar 05, 2024 at 01:18:47PM +0200, Leon Romanovsky wrote:
>>>> @@ -236,7 +236,9 @@ struct nvme_iod {
>>>> unsigned int dma_len; /* length of single DMA segment mapping */
>>>> dma_addr_t first_dma;
>>>> dma_addr_t meta_dma;
>>>> - struct sg_table sgt;
>>>> + struct dma_iova_attrs iova;
>>>> + dma_addr_t dma_link_address[128];
>>>> + u16 nr_dma_link_address;
>>>> union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
>>>> };
>>> That's quite a lot of space to add to the iod. We preallocate one for
>>> every request, and there could be millions of them.
>> Yeah, that's just a complete non-starter. As far as I can tell, this
>> ends up adding 1052 bytes per request. Doing the quick math on my test
>> box (24 drives), that's just a smidge over 3GB of extra memory. That's
>> not going to work, not even close.
>>
> I don't have any intent to use more space for the nvme_iod than what
> it is now. I'll trim down the iod structure and send out a patch soon with
> this fixed to continue the discussion here on this thread ...
>
> -ck
>
>

For final version when DMA API is discussion is concluded, I've plan to use
the iod_mempool for allocation of nvme_iod->dma_link_address, however I'
not wait for that and send out a updated version with trimmed nvme_iod size.

If you guys have any other comments please let me know or we can
continue the
discussion on once I post new version of this patch on this thread ...

Thanks a log Keith and Jens for looking into it ...

-ck


2024-03-06 14:34:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL

On Tue, Mar 05, 2024 at 08:51:56AM -0700, Keith Busch wrote:
> On Tue, Mar 05, 2024 at 01:18:47PM +0200, Leon Romanovsky wrote:
> > @@ -236,7 +236,9 @@ struct nvme_iod {
> > unsigned int dma_len; /* length of single DMA segment mapping */
> > dma_addr_t first_dma;
> > dma_addr_t meta_dma;
> > - struct sg_table sgt;
> > + struct dma_iova_attrs iova;
> > + dma_addr_t dma_link_address[128];
> > + u16 nr_dma_link_address;
> > union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
> > };
>
> That's quite a lot of space to add to the iod. We preallocate one for
> every request, and there could be millions of them.

Yes. And this whole proposal also seems clearly confused (not just
because of the gazillion reposts) but because it mixes up the case
where we can coalesce CPU regions into a single dma_addr_t range
(iommu and maybe in the future swiotlb) and one where we need a
dma_addr_t range per cpu range (direct misc cruft).

2024-03-06 14:44:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Tue, Mar 05, 2024 at 02:29:35PM +0200, Leon Romanovsky wrote:
> > > These advanced DMA mapping APIs are needed to calculate IOVA size to
> > > allocate it as one chunk and some sort of offset calculations to know
> > > which part of IOVA to map.
> >
> > I don't follow this part at all - at *some* point, something must know a
> > range of memory addresses involved in a DMA transfer, so that's where it
> > should map that range for DMA.
>
> In all presented cases in this series, the overall DMA size is known in
> advance. In RDMA case, it is known when user registers the memory, in
> VFIO, when live migration is happening and in NVMe, when BIO is created.
>
> So once we allocated IOVA, we will need to link ranges, which si the
> same as map but without IOVA allocation.

But above you say:

"These advanced DMA mapping APIs are needed to calculate IOVA size to
allocate it as one chunk and some sort of offset calculations to know
which part of IOVA to map."

this suggests you need helpers to calculate the len and offset. I
can't see where that would ever make sense. The total transfer
size should just be passed in by the callers and be known, and
there should be no offset.

> > > Instead of teaching DMA to know these specific datatypes, let's separate
> > > existing DMA mapping routine to two steps and give an option to advanced
> > > callers (subsystems) perform all calculations internally in advance and
> > > map pages later when it is needed.
> >
> > From a brief look, this is clearly an awkward reinvention of the IOMMU API.
> > If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU address
> > spaces then they can and should use the IOMMU API. Perhaps there's room for
> > some quality-of-life additions to the IOMMU API to help with common usage
> > patterns, but the generic DMA mapping API is absolutely not the place for
> > it.
>
> DMA mapping gives nice abstraction from IOMMU, and allows us to have
> same flow for IOMMU and non-IOMMU flows without duplicating code, while
> you suggest to teach almost every part in the kernel to know about IOMMU.

Except that the flows are fundamentally different for the "can coalesce"
vs "can't coalesce" case. In the former we have one dma_addr_t range,
and in the latter as many as there are input vectors (this is ignoring
the weird iommu merging case where we we coalesce some but not all
segments, but I'd rather not have that in a new API).

So if we want to efficiently be able to handle these cases we need
two APIs in the driver and a good framework to switch between them.
Robins makes a point here that the iommu API handles the can coalesce
case and he has a point as that's exactly how the IOMMU API works.
I'd still prefer to wrap it with dma callers to handle things like
swiotlb and maybe Xen grant tables and to avoid the type confusion
between dma_addr_t and then untyped iova in the iommu layer, but
having this layer or not is probably worth a discussion.

>
> In this series, we changed RDMA, VFIO and NVMe, and in all cases we
> removed more code than added. From what I saw, VDPA and virito-blk will
> benefit from proposed API too.
>
> Even in this RFC, where Chaitanya did partial job and didn't convert
> whole driver, the gain is pretty obvious:
> https://lore.kernel.org/linux-rdma/016fc02cbfa9be3c156a6f74df38def1e09c08f1.1709635535.git.leon@kernel.org/T/#u
>

I have no idea how that nvme patch is even supposed to work. It removes
the PRP path in nvme-pci, which not only is the most common I/O path
but actually required for the admin queue as NVMe doesn't support
SGLs for the admin queue.


2024-03-06 15:07:13

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL

On Wed, Mar 06, 2024 at 03:33:21PM +0100, Christoph Hellwig wrote:
> On Tue, Mar 05, 2024 at 08:51:56AM -0700, Keith Busch wrote:
> > On Tue, Mar 05, 2024 at 01:18:47PM +0200, Leon Romanovsky wrote:
> > > @@ -236,7 +236,9 @@ struct nvme_iod {
> > > unsigned int dma_len; /* length of single DMA segment mapping */
> > > dma_addr_t first_dma;
> > > dma_addr_t meta_dma;
> > > - struct sg_table sgt;
> > > + struct dma_iova_attrs iova;
> > > + dma_addr_t dma_link_address[128];
> > > + u16 nr_dma_link_address;
> > > union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
> > > };
> >
> > That's quite a lot of space to add to the iod. We preallocate one for
> > every request, and there could be millions of them.
>
> Yes. And this whole proposal also seems clearly confused (not just
> because of the gazillion reposts) but because it mixes up the case
> where we can coalesce CPU regions into a single dma_addr_t range
> (iommu and maybe in the future swiotlb) and one where we need a

I had the broad expectation that the DMA API user would already be
providing a place to store the dma_addr_t as it has to feed that into
the HW. That memory should simply last up until we do dma unmap and
the cases that need dma_addr_t during unmap can go get it from there.

If that is how things are organized, is there another reason to lean
further into single-range case optimization?

We can't do much on the map side as single range doesn't imply
contiguous range, P2P and alignment create discontinuities in the
dma_addr_t that still have to be delt with.

Jason

2024-03-06 15:44:03

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Wed, Mar 06, 2024 at 03:44:16PM +0100, Christoph Hellwig wrote:

> Except that the flows are fundamentally different for the "can coalesce"
> vs "can't coalesce" case. In the former we have one dma_addr_t range,
> and in the latter as many as there are input vectors (this is ignoring
> the weird iommu merging case where we we coalesce some but not all
> segments, but I'd rather not have that in a new API).

I don't think they are so fundamentally different, at least in our
past conversations I never came out with the idea we should burden the
driver with two different flows based on what kind of alignment the
transfer happens to have.

Certainly if we split the API to focus one API on doing only
page-aligned transfers the aligned part does become a little.

At least the RDMA drivers could productively use just a page aligned
interface. But I didn't think this would make BIO users happy so never
even thought about it..

> The total transfer size should just be passed in by the callers and
> be known, and there should be no offset.

The API needs the caller to figure out the total number of IOVA pages
it needs, rounding up the CPU ranges to full aligned pages. That
becomes the IOVA allocation.

offset is something that arises to support non-aligned transfers.

> So if we want to efficiently be able to handle these cases we need
> two APIs in the driver and a good framework to switch between them.

But, what does the non-page-aligned version look like? Doesn't it
still look basically like this?

And what is the actual difference if the input is aligned? The caller
can assume it doesn't need to provide a per-range dma_addr_t during
unmap.

It still can't assume the HW programming will be linear due to the P2P
!ACS support.

And it still has to call an API per-cpu range to actually program the
IOMMU.

So are they really so different to want different APIs? That strikes
me as a big driver cost.

> I'd still prefer to wrap it with dma callers to handle things like
> swiotlb and maybe Xen grant tables and to avoid the type confusion
> between dma_addr_t and then untyped iova in the iommu layer, but
> having this layer or not is probably worth a discussion.

I'm surprised by the idea of random drivers reaching past dma-iommu.c
and into the iommu layer to setup DMA directly on the DMA API's
iommu_domain?? That seems like completely giving up on the DMA API
abstraction to me. :(

IMHO, it needs to be wrapped, the wrapper needs to do all the special
P2P stuff, at a minimum. The wrapper should multiplex to all the
non-iommu cases for the driver too.

We still need to achieve some kind of abstraction here that doesn't
bruden every driver with different code paths for each DMA back end!
Don't we??

Jason

2024-03-06 16:15:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL

On Wed, Mar 06, 2024 at 11:05:18AM -0400, Jason Gunthorpe wrote:
> > Yes. And this whole proposal also seems clearly confused (not just
> > because of the gazillion reposts) but because it mixes up the case
> > where we can coalesce CPU regions into a single dma_addr_t range
> > (iommu and maybe in the future swiotlb) and one where we need a
>
> I had the broad expectation that the DMA API user would already be
> providing a place to store the dma_addr_t as it has to feed that into
> the HW. That memory should simply last up until we do dma unmap and
> the cases that need dma_addr_t during unmap can go get it from there.

Well. The dma_addr_t needs to be fed into the hardware somehow
obviously. But for a the coalesced case we only need one such
field, not Nranges.

> We can't do much on the map side as single range doesn't imply
> contiguous range, P2P and alignment create discontinuities in the
> dma_addr_t that still have to be delt with.

For alignment the right answer is almost always to require the
upper layers to align to the iommu granularity. We've been a bit
lax about that due to the way scatterlists are designed, but
requiring the proper alignment actually benefits everyone.

2024-03-06 16:20:54

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Wed, Mar 06, 2024 at 11:43:28AM -0400, Jason Gunthorpe wrote:
> I don't think they are so fundamentally different, at least in our
> past conversations I never came out with the idea we should burden the
> driver with two different flows based on what kind of alignment the
> transfer happens to have.

Then we talked past each other..

> At least the RDMA drivers could productively use just a page aligned
> interface. But I didn't think this would make BIO users happy so never
> even thought about it..

page aligned is generally the right thing for the block layer. NVMe
for example already requires that anyway due to PRPs.

> > The total transfer size should just be passed in by the callers and
> > be known, and there should be no offset.
>
> The API needs the caller to figure out the total number of IOVA pages
> it needs, rounding up the CPU ranges to full aligned pages. That
> becomes the IOVA allocation.

Yes, it's a basic align up to the granularity asuming we don't bother
with non-aligned transfers.

>
> > So if we want to efficiently be able to handle these cases we need
> > two APIs in the driver and a good framework to switch between them.
>
> But, what does the non-page-aligned version look like? Doesn't it
> still look basically like this?

I'd just rather have the non-aligned case for those who really need
it be the loop over map single region that is needed for the direct
mapping anyway.

>
> And what is the actual difference if the input is aligned? The caller
> can assume it doesn't need to provide a per-range dma_addr_t during
> unmap.

A per-range dma_addr_t doesn't really make sense for the aligned and
coalesced case.

> It still can't assume the HW programming will be linear due to the P2P
> !ACS support.
>
> And it still has to call an API per-cpu range to actually program the
> IOMMU.
>
> So are they really so different to want different APIs? That strikes
> me as a big driver cost.

To not have to store a dma_address range per CPU range that doesn't
actually get used at all.

2024-03-06 17:45:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Wed, Mar 06, 2024 at 05:20:22PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 06, 2024 at 11:43:28AM -0400, Jason Gunthorpe wrote:
> > I don't think they are so fundamentally different, at least in our
> > past conversations I never came out with the idea we should burden the
> > driver with two different flows based on what kind of alignment the
> > transfer happens to have.
>
> Then we talked past each other..

Well, we never talked to such detail

> > > So if we want to efficiently be able to handle these cases we need
> > > two APIs in the driver and a good framework to switch between them.
> >
> > But, what does the non-page-aligned version look like? Doesn't it
> > still look basically like this?
>
> I'd just rather have the non-aligned case for those who really need
> it be the loop over map single region that is needed for the direct
> mapping anyway.

There is a list of interesting cases this has to cover:

1. Direct map. No dma_addr_t at unmap, multiple HW SGLs
2. IOMMU aligned map, no P2P. Only IOVA range at unmap, single HW SGLs
3. IOMMU aligned map, P2P. Only IOVA range at unmap, multiple HW SGLs
4. swiotlb single range. Only IOVA range at unmap, single HW SGL
5. swiotlb multi-range. All dma_addr_t's at unmap, multiple HW SGLs.
6. Unaligned IOMMU. Only IOVA range at unmap, multiple HW SGLs

I think we agree that 1 and 2 should be optimized highly as they are
the common case. That mainly means no dma_addr_t storage in either

5 is the slowest and has the most overhead.

4 is basically the same as 2 from the driver's viewpoint

3 is quite similar to 1, but it has the IOVA range at unmap.

6 doesn't have to be optimal, from the driver perspective it can be
like 5

That is three basic driver flows 1/3, 2/4 and 5/6

So are you thinking something more like a driver flow of:

.. extent IO and get # aligned pages and know if there is P2P ..
dma_init_io(state, num_pages, p2p_flag)
if (dma_io_single_range(state)) {
// #2, #4
for each io()
dma_link_aligned_pages(state, io range)
hw_sgl = (state->iova, state->len)
} else {
// #1, #3, #5, #6
hw_sgls = alloc_hw_sgls(num_ios)
if (dma_io_needs_dma_addr_unmap(state))
dma_addr_storage = alloc_num_ios(); // #5 only
for each io()
hw_sgl[i] = dma_map_single(state, io range)
if (dma_addr_storage)
dma_addr_storage[i] = hw_sgl[i]; // #5 only
}

?

This is not quite what you said, we split the driver flow based on
needing 1 HW SGL vs need many HW SGL.

> > So are they really so different to want different APIs? That strikes
> > me as a big driver cost.
>
> To not have to store a dma_address range per CPU range that doesn't
> actually get used at all.

Right, that is a nice optimization we should reach for.

Jason

2024-03-06 22:14:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Wed, Mar 06, 2024 at 01:44:56PM -0400, Jason Gunthorpe wrote:
> There is a list of interesting cases this has to cover:
>
> 1. Direct map. No dma_addr_t at unmap, multiple HW SGLs
> 2. IOMMU aligned map, no P2P. Only IOVA range at unmap, single HW SGLs
> 3. IOMMU aligned map, P2P. Only IOVA range at unmap, multiple HW SGLs
> 4. swiotlb single range. Only IOVA range at unmap, single HW SGL
> 5. swiotlb multi-range. All dma_addr_t's at unmap, multiple HW SGLs.
> 6. Unaligned IOMMU. Only IOVA range at unmap, multiple HW SGLs
>
> I think we agree that 1 and 2 should be optimized highly as they are
> the common case. That mainly means no dma_addr_t storage in either

I don't think you can do without dma_addr_t storage. In most cases
your can just store the dma_addr_t in the LE/BE encoded hardware
SGL, so no extra storage should be needed though.

> 3 is quite similar to 1, but it has the IOVA range at unmap.

Can you explain what P2P case you mean? The switch one with the
bus address is indeed basically the same, just with potentioally a
different offset, while the through host bridge case is the same
as a normal iommu map.

>
> 4 is basically the same as 2 from the driver's viewpoint

I'd actually treat it the same as one.

> 5 is the slowest and has the most overhead.

and 5 could be broken into multiple 4s at least for now. Or do you
have a different dfinition of range here?

> So are you thinking something more like a driver flow of:
>
> .. extent IO and get # aligned pages and know if there is P2P ..
> dma_init_io(state, num_pages, p2p_flag)
> if (dma_io_single_range(state)) {
> // #2, #4
> for each io()
> dma_link_aligned_pages(state, io range)
> hw_sgl = (state->iova, state->len)
> } else {

I think what you have a dma_io_single_range should become before
the dma_init_io. If we know we can't coalesce it really just is a
dma_map_{single,page,bvec} loop, no need for any extra state.

And we're back to roughly the proposal I sent out years ago.

> This is not quite what you said, we split the driver flow based on
> needing 1 HW SGL vs need many HW SGL.

That's at least what I intended to say, and I'm a little curious as what
it came across.


2024-03-07 00:01:04

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Wed, Mar 06, 2024 at 11:14:00PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 06, 2024 at 01:44:56PM -0400, Jason Gunthorpe wrote:
> > There is a list of interesting cases this has to cover:
> >
> > 1. Direct map. No dma_addr_t at unmap, multiple HW SGLs
> > 2. IOMMU aligned map, no P2P. Only IOVA range at unmap, single HW SGLs
> > 3. IOMMU aligned map, P2P. Only IOVA range at unmap, multiple HW SGLs
> > 4. swiotlb single range. Only IOVA range at unmap, single HW SGL
> > 5. swiotlb multi-range. All dma_addr_t's at unmap, multiple HW SGLs.
> > 6. Unaligned IOMMU. Only IOVA range at unmap, multiple HW SGLs
> >
> > I think we agree that 1 and 2 should be optimized highly as they are
> > the common case. That mainly means no dma_addr_t storage in either
>
> I don't think you can do without dma_addr_t storage. In most cases
> your can just store the dma_addr_t in the LE/BE encoded hardware
> SGL, so no extra storage should be needed though.

RDMA (and often DRM too) generally doesn't work like that, the driver
copies the page table into the device and then the only reason to have
a dma_addr_t storage is to pass that to the dma unmap API. Optionally
eliminating long term dma_addr_t storage would be a worthwhile memory
savings for large long lived user space memory registrations.

> > 3 is quite similar to 1, but it has the IOVA range at unmap.
>
> Can you explain what P2P case you mean? The switch one with the
> bus address is indeed basically the same, just with potentioally a
> different offset, while the through host bridge case is the same
> as a normal iommu map.

Yes, the bus address case. The IOMMU is turned on, ACS on a local
switch is off.

All pages go through the IOMMU in the normal way except P2P pages
between devices on the same switch. (ie the dma_addr_t is CPU physical
of the P2P plus an offset). RDMA must support a mixture of IOVA and
P2P addresses in the same IO operation.

I suppose it would make more sense to say it is similar to 6.

> > 5 is the slowest and has the most overhead.
>
> and 5 could be broken into multiple 4s at least for now. Or do you
> have a different dfinition of range here?

I wrote the list as from a single IO operation perspective, so all but
5 need to store a single IOVA range that could be stored in some
simple non-dynamic memory along with whatever HW SGLs/etc are needed.

The point of 5 being different is because the driver has to provide a
dynamically sized list of dma_addr_t's as storage until unmap. 5 is
the only case that requires that full list.

So yes, 5 could be broken up into multiple IOs, but then the
specialness of 5 is the driver must keep track of multiple IOs..

> > So are you thinking something more like a driver flow of:
> >
> > .. extent IO and get # aligned pages and know if there is P2P ..
> > dma_init_io(state, num_pages, p2p_flag)
> > if (dma_io_single_range(state)) {
> > // #2, #4
> > for each io()
> > dma_link_aligned_pages(state, io range)
> > hw_sgl = (state->iova, state->len)
> > } else {
>
> I think what you have a dma_io_single_range should become before
> the dma_init_io. If we know we can't coalesce it really just is a
> dma_map_{single,page,bvec} loop, no need for any extra state.

I imagine dma_io_single_range() to just check a flag in state.

I still want to call dma_init_io() for the non-coalescing cases
because all the flows, regardless of composition, should be about as
fast as dma_map_sg is today.

That means we need to always pre-allocate the IOVA in any case where
the IOMMU might be active - even on a non-coalescing flow.

IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to
be used and we can't just call today's dma_map_page() in a loop on the
non-coalescing side and pay the overhead of Nx IOVA allocations.

In large part this is for RDMA, were a single P2P page in a large
multi-gigabyte user memory registration shouldn't drastically harm the
registration performance by falling down to doing dma_map_page, and an
IOVA allocation, on a 4k page by page basis.

The other thing that got hand waved here is how does dma_init_io()
know which of the 6 states we are looking at? I imagine we probably
want to do something like:

struct dma_io_summarize summary = {};
for each io()
dma_io_summarize_range(&summary, io range)
dma_init_io(dev, &state, &summary);
if (state->single_range) {
} else {
}
dma_io_done_mapping(&state); <-- flush IOTLB once

At least this way the DMA API still has some decent opportunity for
abstraction and future growth using state to pass bits of information
between the API family.

There is some swiotlb complexity that needs something like this, a
system with iommu can still fail to coalesce if the pages are
encrypted and the device doesn't support DMA from encrypted pages. We
need to check for P2P pages, encrypted memory pages, and who knows
what else.

> And we're back to roughly the proposal I sent out years ago.

Well, all of this is roughly your original proposal, just with
different optimization choices and some enhancement to also cover
hmm_range_fault() users.

Enhancing the single sgl case is not a big change, I think. It does
seem simplifying for the driver to not have to coalesce SGLs to detect
the single-SGL fast-path.

> > This is not quite what you said, we split the driver flow based on
> > needing 1 HW SGL vs need many HW SGL.
>
> That's at least what I intended to say, and I'm a little curious as what
> it came across.

Ok, I was reading the discussion more about as alignment than single
HW SGL, I think you ment alignment as implying coalescing behavior
implying single HW SGL..

Jason

2024-03-07 06:01:47

by Zhu Yanjun

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

在 2024/3/5 12:18, Leon Romanovsky 写道:
> This is complimentary part to the proposed LSF/MM topic.
> https://lore.kernel.org/linux-rdma/[email protected]/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057

I am interested in this topic. Hope I can join the meeting to discuss
this topic.

Zhu Yanjun

>
> This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and
> DMA patches are ready for review and inclusion, the NVMe patches are still in
> progress as they require agreement on API first.
>
> Thanks
>
> -------------------------------------------------------------------------------
> The DMA mapping operation performs two steps at one same time: allocates
> IOVA space and actually maps DMA pages to that space. This one shot
> operation works perfectly for non-complex scenarios, where callers use
> that DMA API in control path when they setup hardware.
>
> However in more complex scenarios, when DMA mapping is needed in data
> path and especially when some sort of specific datatype is involved,
> such one shot approach has its drawbacks.
>
> That approach pushes developers to introduce new DMA APIs for specific
> datatype. For example existing scatter-gather mapping functions, or
> latest Chuck's RFC series to add biovec related DMA mapping [1] and
> probably struct folio will need it too.
>
> These advanced DMA mapping APIs are needed to calculate IOVA size to
> allocate it as one chunk and some sort of offset calculations to know
> which part of IOVA to map.
>
> Instead of teaching DMA to know these specific datatypes, let's separate
> existing DMA mapping routine to two steps and give an option to advanced
> callers (subsystems) perform all calculations internally in advance and
> map pages later when it is needed.
>
> In this series, three users are converted and each of such conversion
> presents different positive gain:
> 1. RDMA simplifies and speeds up its pagefault handling for
> on-demand-paging (ODP) mode.
> 2. VFIO PCI live migration code saves huge chunk of memory.
> 3. NVMe PCI avoids intermediate SG table manipulation and operates
> directly on BIOs.
>
> Thanks
>
> [1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net
>
> Chaitanya Kulkarni (2):
> block: add dma_link_range() based API
> nvme-pci: use blk_rq_dma_map() for NVMe SGL
>
> Leon Romanovsky (14):
> mm/hmm: let users to tag specific PFNs
> dma-mapping: provide an interface to allocate IOVA
> dma-mapping: provide callbacks to link/unlink pages to specific IOVA
> iommu/dma: Provide an interface to allow preallocate IOVA
> iommu/dma: Prepare map/unmap page functions to receive IOVA
> iommu/dma: Implement link/unlink page callbacks
> RDMA/umem: Preallocate and cache IOVA for UMEM ODP
> RDMA/umem: Store ODP access mask information in PFN
> RDMA/core: Separate DMA mapping to caching IOVA and page linkage
> RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
> vfio/mlx5: Explicitly use number of pages instead of allocated length
> vfio/mlx5: Rewrite create mkey flow to allow better code reuse
> vfio/mlx5: Explicitly store page list
> vfio/mlx5: Convert vfio to use DMA link API
>
> Documentation/core-api/dma-attributes.rst | 7 +
> block/blk-merge.c | 156 ++++++++++++++
> drivers/infiniband/core/umem_odp.c | 219 +++++++------------
> drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 +
> drivers/infiniband/hw/mlx5/odp.c | 59 +++--
> drivers/iommu/dma-iommu.c | 129 ++++++++---
> drivers/nvme/host/pci.c | 220 +++++--------------
> drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++----------
> drivers/vfio/pci/mlx5/cmd.h | 22 +-
> drivers/vfio/pci/mlx5/main.c | 136 +++++-------
> include/linux/blk-mq.h | 9 +
> include/linux/dma-map-ops.h | 13 ++
> include/linux/dma-mapping.h | 39 ++++
> include/linux/hmm.h | 3 +
> include/rdma/ib_umem_odp.h | 22 +-
> include/rdma/ib_verbs.h | 54 +++++
> kernel/dma/debug.h | 2 +
> kernel/dma/direct.h | 7 +-
> kernel/dma/mapping.c | 91 ++++++++
> mm/hmm.c | 34 +--
> 20 files changed, 870 insertions(+), 605 deletions(-)
>


2024-03-07 15:06:13

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote:
> >
> > I don't think you can do without dma_addr_t storage. In most cases
> > your can just store the dma_addr_t in the LE/BE encoded hardware
> > SGL, so no extra storage should be needed though.
>
> RDMA (and often DRM too) generally doesn't work like that, the driver
> copies the page table into the device and then the only reason to have
> a dma_addr_t storage is to pass that to the dma unmap API. Optionally
> eliminating long term dma_addr_t storage would be a worthwhile memory
> savings for large long lived user space memory registrations.

It's just kinda hard to do. For aligned IOMMU mapping you'd only
have one dma_addr_t mappings (or maybe a few if P2P regions are
involved), so this probably doesn't matter. For direct mappings
you'd have a few, but maybe the better answer is to use THP
more aggressively and reduce the number of segments.

> I wrote the list as from a single IO operation perspective, so all but
> 5 need to store a single IOVA range that could be stored in some
> simple non-dynamic memory along with whatever HW SGLs/etc are needed.
>
> The point of 5 being different is because the driver has to provide a
> dynamically sized list of dma_addr_t's as storage until unmap. 5 is
> the only case that requires that full list.

No, all cases need to store one or more ranges.

> > > So are you thinking something more like a driver flow of:
> > >
> > > .. extent IO and get # aligned pages and know if there is P2P ..
> > > dma_init_io(state, num_pages, p2p_flag)
> > > if (dma_io_single_range(state)) {
> > > // #2, #4
> > > for each io()
> > > dma_link_aligned_pages(state, io range)
> > > hw_sgl = (state->iova, state->len)
> > > } else {
> >
> > I think what you have a dma_io_single_range should become before
> > the dma_init_io. If we know we can't coalesce it really just is a
> > dma_map_{single,page,bvec} loop, no need for any extra state.
>
> I imagine dma_io_single_range() to just check a flag in state.
>
> I still want to call dma_init_io() for the non-coalescing cases
> because all the flows, regardless of composition, should be about as
> fast as dma_map_sg is today.

If all flows includes multiple non-coalesced regions that just makes
things very complicated, and that's exactly what I'd want to avoid.

> That means we need to always pre-allocate the IOVA in any case where
> the IOMMU might be active - even on a non-coalescing flow.
>
> IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to
> be used and we can't just call today's dma_map_page() in a loop on the
> non-coalescing side and pay the overhead of Nx IOVA allocations.
>
> In large part this is for RDMA, were a single P2P page in a large
> multi-gigabyte user memory registration shouldn't drastically harm the
> registration performance by falling down to doing dma_map_page, and an
> IOVA allocation, on a 4k page by page basis.

But that P2P page needs to be handled very differently, as with it
we can't actually use a single iova range. So I'm not sure how that
is even supposed to work. If you have

+-------+-----+-------+
| local | P2P | local |
+-------+-----+-------+

you need at least 3 hw SGL entries, as the IOVA won't be contigous.

> The other thing that got hand waved here is how does dma_init_io()
> know which of the 6 states we are looking at? I imagine we probably
> want to do something like:
>
> struct dma_io_summarize summary = {};
> for each io()
> dma_io_summarize_range(&summary, io range)
> dma_init_io(dev, &state, &summary);
> if (state->single_range) {
> } else {
> }
> dma_io_done_mapping(&state); <-- flush IOTLB once

That's why I really just want 2 cases. If the caller guarantees the
range is coalescable and there is an IOMMU use the iommu-API like
API, else just iter over map_single/page.

> Enhancing the single sgl case is not a big change, I think. It does
> seem simplifying for the driver to not have to coalesce SGLs to detect
> the single-SGL fast-path.
>
> > > This is not quite what you said, we split the driver flow based on
> > > needing 1 HW SGL vs need many HW SGL.
> >
> > That's at least what I intended to say, and I'm a little curious as what
> > it came across.
>
> Ok, I was reading the discussion more about as alignment than single
> HW SGL, I think you ment alignment as implying coalescing behavior
> implying single HW SGL..

Yes.

2024-03-07 21:01:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Thu, Mar 07, 2024 at 04:05:05PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 06, 2024 at 08:00:36PM -0400, Jason Gunthorpe wrote:
> > >
> > > I don't think you can do without dma_addr_t storage. In most cases
> > > your can just store the dma_addr_t in the LE/BE encoded hardware
> > > SGL, so no extra storage should be needed though.
> >
> > RDMA (and often DRM too) generally doesn't work like that, the driver
> > copies the page table into the device and then the only reason to have
> > a dma_addr_t storage is to pass that to the dma unmap API. Optionally
> > eliminating long term dma_addr_t storage would be a worthwhile memory
> > savings for large long lived user space memory registrations.
>
> It's just kinda hard to do. For aligned IOMMU mapping you'd only
> have one dma_addr_t mappings (or maybe a few if P2P regions are
> involved), so this probably doesn't matter. For direct mappings
> you'd have a few, but maybe the better answer is to use THP
> more aggressively and reduce the number of segments.

Right, those things have all been done. 100GB of huge pages is still
using a fair amount of memory for storing dma_addr_t's.

It is hard to do perfectly, but I think it is not so bad if we focus
on the direct only case and simple systems that can exclude swiotlb
early on.

> > > > So are you thinking something more like a driver flow of:
> > > >
> > > > .. extent IO and get # aligned pages and know if there is P2P ..
> > > > dma_init_io(state, num_pages, p2p_flag)
> > > > if (dma_io_single_range(state)) {
> > > > // #2, #4
> > > > for each io()
> > > > dma_link_aligned_pages(state, io range)
> > > > hw_sgl = (state->iova, state->len)
> > > > } else {
> > >
> > > I think what you have a dma_io_single_range should become before
> > > the dma_init_io. If we know we can't coalesce it really just is a
> > > dma_map_{single,page,bvec} loop, no need for any extra state.
> >
> > I imagine dma_io_single_range() to just check a flag in state.
> >
> > I still want to call dma_init_io() for the non-coalescing cases
> > because all the flows, regardless of composition, should be about as
> > fast as dma_map_sg is today.
>
> If all flows includes multiple non-coalesced regions that just makes
> things very complicated, and that's exactly what I'd want to avoid.

I don't see how to avoid it unless we say RDMA shouldn't use this API,
which is kind of the whole point from my perspective..

I want an API that can handle all the same complexity as dma_map_sg()
without forcing the use of scatterlist. Instead "bring your own
datastructure". This is the essence of what we discussed.

An API that is inferior to dma_map_sg() is really problematic to use
with RDMA.

> > That means we need to always pre-allocate the IOVA in any case where
> > the IOMMU might be active - even on a non-coalescing flow.
> >
> > IOW, dma_init_io() always pre-allocates IOVA if the iommu is going to
> > be used and we can't just call today's dma_map_page() in a loop on the
> > non-coalescing side and pay the overhead of Nx IOVA allocations.
> >
> > In large part this is for RDMA, were a single P2P page in a large
> > multi-gigabyte user memory registration shouldn't drastically harm the
> > registration performance by falling down to doing dma_map_page, and an
> > IOVA allocation, on a 4k page by page basis.
>
> But that P2P page needs to be handled very differently, as with it
> we can't actually use a single iova range. So I'm not sure how that
> is even supposed to work. If you have
>
> +-------+-----+-------+
> | local | P2P | local |
> +-------+-----+-------+
>
> you need at least 3 hw SGL entries, as the IOVA won't be contigous.

Sure, 3 SGL entries is fine, that isn't what I'm pointing at

I'm saying that today if you give such a scatterlist to dma_map_sg()
it scans it and computes the IOVA space need, allocates one IOVA
space, then subdivides that single space up into the 3 HW SGLs you
show.

If you don't preserve that then we are calling, 4k at a time, a
dma_map_page() which is not anywhere close to the same outcome as what
dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs,
and we call into the IOVA allocator a huge number of times.

It needs to work following the same basic structure of dma_map_sg,
unfolding that logic into helpers so that the driver can provide
the data structure:

- Scan the io ranges and figure out how much IOVA needed
(dma_io_summarize_range)
- Allocate the IOVA (dma_init_io)
- Scan the io ranges again generate the final HW SGL
(dma_io_link_page)
- Finish the iommu batch (dma_io_done_mapping)

And you can make that pattern work for all the other cases too.

So I don't see this as particularly worse, calling some other API
instead of dma_map_page is not really a complexity on the
driver. Calling dma_init_io every time is also not a complexity. The
DMA API side is a bit more, but not substantively different logic from
what dma_map_sg already does.

Otherwise what is the alternative? How do I keep these complex things
working in RDMA and remove scatterlist?

> > The other thing that got hand waved here is how does dma_init_io()
> > know which of the 6 states we are looking at? I imagine we probably
> > want to do something like:
> >
> > struct dma_io_summarize summary = {};
> > for each io()
> > dma_io_summarize_range(&summary, io range)
> > dma_init_io(dev, &state, &summary);
> > if (state->single_range) {
> > } else {
> > }
> > dma_io_done_mapping(&state); <-- flush IOTLB once
>
> That's why I really just want 2 cases. If the caller guarantees the
> range is coalescable and there is an IOMMU use the iommu-API like
> API, else just iter over map_single/page.

But how does the caller even know if it is coalescable? Other than the
trivial case of a single CPU range, that is a complicated detail based
on what pages are inside the range combined with the capability of the
device doing DMA. I don't see a simple way for the caller to figure
this out. You need to sweep every page and collect some information on
it. The above is to abstract that detail.

It was simpler before the confidential compute stuff :(

Jason

2024-03-08 16:49:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Thu, Mar 07, 2024 at 05:01:16PM -0400, Jason Gunthorpe wrote:
> >
> > It's just kinda hard to do. For aligned IOMMU mapping you'd only
> > have one dma_addr_t mappings (or maybe a few if P2P regions are
> > involved), so this probably doesn't matter. For direct mappings
> > you'd have a few, but maybe the better answer is to use THP
> > more aggressively and reduce the number of segments.
>
> Right, those things have all been done. 100GB of huge pages is still
> using a fair amount of memory for storing dma_addr_t's.
>
> It is hard to do perfectly, but I think it is not so bad if we focus
> on the direct only case and simple systems that can exclude swiotlb
> early on.

Even with direct mappings only we still need to take care of
cache synchronization.

> > If all flows includes multiple non-coalesced regions that just makes
> > things very complicated, and that's exactly what I'd want to avoid.
>
> I don't see how to avoid it unless we say RDMA shouldn't use this API,
> which is kind of the whole point from my perspective..

The DMA API callers really need to know what is P2P or not for
various reasons. And they should generally have that information
available, either from pin_user_pages that needs to special case
it or from the in-kernel I/O submitter that build it from P2P and
normal memory.

> Sure, 3 SGL entries is fine, that isn't what I'm pointing at
>
> I'm saying that today if you give such a scatterlist to dma_map_sg()
> it scans it and computes the IOVA space need, allocates one IOVA
> space, then subdivides that single space up into the 3 HW SGLs you
> show.
>
> If you don't preserve that then we are calling, 4k at a time, a
> dma_map_page() which is not anywhere close to the same outcome as what
> dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs,
> and we call into the IOVA allocator a huge number of times.

Again, your callers must know what is a P2P region and what is not.
I don't think it is a hard burdern to do mappings at that granularity,
and we can encapsulate this in nice helpes for say the block layer
and pin_user_pages callers to start.

>
> It needs to work following the same basic structure of dma_map_sg,
> unfolding that logic into helpers so that the driver can provide
> the data structure:
>
> - Scan the io ranges and figure out how much IOVA needed
> (dma_io_summarize_range)

That is in general a function of the upper layer and not the DMA code.

> - Allocate the IOVA (dma_init_io)

And this step is only needed for the iommu case.

> > That's why I really just want 2 cases. If the caller guarantees the
> > range is coalescable and there is an IOMMU use the iommu-API like
> > API, else just iter over map_single/page.
>
> But how does the caller even know if it is coalescable? Other than the
> trivial case of a single CPU range, that is a complicated detail based
> on what pages are inside the range combined with the capability of the
> device doing DMA. I don't see a simple way for the caller to figure
> this out. You need to sweep every page and collect some information on
> it. The above is to abstract that detail.

dma_get_merge_boundary already provides this information in terms
of the device capabilities. And given that the callers knows what
is P2P and what is not we have all the information that is needed.


2024-03-08 20:24:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Fri, Mar 08, 2024 at 05:49:20PM +0100, Christoph Hellwig wrote:
> On Thu, Mar 07, 2024 at 05:01:16PM -0400, Jason Gunthorpe wrote:
> > >
> > > It's just kinda hard to do. For aligned IOMMU mapping you'd only
> > > have one dma_addr_t mappings (or maybe a few if P2P regions are
> > > involved), so this probably doesn't matter. For direct mappings
> > > you'd have a few, but maybe the better answer is to use THP
> > > more aggressively and reduce the number of segments.
> >
> > Right, those things have all been done. 100GB of huge pages is still
> > using a fair amount of memory for storing dma_addr_t's.
> >
> > It is hard to do perfectly, but I think it is not so bad if we focus
> > on the direct only case and simple systems that can exclude swiotlb
> > early on.
>
> Even with direct mappings only we still need to take care of
> cache synchronization.

Yes, we still have to unmap, but the unmap for cache synchronization
doesn't need the dma_addr_t to flush the CPU cache.

> > > If all flows includes multiple non-coalesced regions that just makes
> > > things very complicated, and that's exactly what I'd want to avoid.
> >
> > I don't see how to avoid it unless we say RDMA shouldn't use this API,
> > which is kind of the whole point from my perspective..
>
> The DMA API callers really need to know what is P2P or not for
> various reasons. And they should generally have that information
> available, either from pin_user_pages that needs to special case
> it or from the in-kernel I/O submitter that build it from P2P and
> normal memory.

I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
shoves the resulting page list into in a scattertable. It never checks
if any returned page is P2P - it has no reason to care. dma_map_sg()
does all the work.

That is the kind of abstraction I am coming to this problem with.

You are looking at BIO where you already needed to split things up for
other reasons, but I think that is a uniquely block thing that will
not be shared in other subsystems.

> > If you don't preserve that then we are calling, 4k at a time, a
> > dma_map_page() which is not anywhere close to the same outcome as what
> > dma_map_sg did. I may not get contiguous IOVA, I may not get 3 SGLs,
> > and we call into the IOVA allocator a huge number of times.
>
> Again, your callers must know what is a P2P region and what is not.

I don't see this at all. We don't do this today in RDMA. There is no
"P2P region".

> > > That's why I really just want 2 cases. If the caller guarantees the
> > > range is coalescable and there is an IOMMU use the iommu-API like
> > > API, else just iter over map_single/page.
> >
> > But how does the caller even know if it is coalescable? Other than the
> > trivial case of a single CPU range, that is a complicated detail based
> > on what pages are inside the range combined with the capability of the
> > device doing DMA. I don't see a simple way for the caller to figure
> > this out. You need to sweep every page and collect some information on
> > it. The above is to abstract that detail.
>
> dma_get_merge_boundary already provides this information in terms
> of the device capabilities. And given that the callers knows what
> is P2P and what is not we have all the information that is needed.

Encrypted memory too.

RDMA also doesn't call dma_get_merge_boundary(). It doesn't keep track
of P2P regions. It doesn't break out encrypted memory. It has no
purpose to do any of those things.

You fundamentally cannot subdivide a memory registration.

So we could artificially introduce the concept of limited coalescing
into RDMA, dmabuf and others just to drive this new API - but really
that feels much much worse than just making the DMA API still able to
do IOMMU coalescing in more cases.

Even if we did that, it will still be less efficient than today where
we just call dma_map_sg() on the jumble of pages.

Jason

2024-03-09 16:14:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote:
> > The DMA API callers really need to know what is P2P or not for
> > various reasons. And they should generally have that information
> > available, either from pin_user_pages that needs to special case
> > it or from the in-kernel I/O submitter that build it from P2P and
> > normal memory.
>
> I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
> shoves the resulting page list into in a scattertable. It never checks
> if any returned page is P2P - it has no reason to care. dma_map_sg()
> does all the work.

Right now it does, but that's not really a good interface. If we have
a pin_user_pages variant that only pins until the next relevant P2P
boundary and tells you about we can significantly simplify the overall
interface.

2024-03-10 09:35:33

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Sat, Mar 09, 2024 at 05:14:18PM +0100, Christoph Hellwig wrote:
> On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote:
> > > The DMA API callers really need to know what is P2P or not for
> > > various reasons. And they should generally have that information
> > > available, either from pin_user_pages that needs to special case
> > > it or from the in-kernel I/O submitter that build it from P2P and
> > > normal memory.
> >
> > I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
> > shoves the resulting page list into in a scattertable. It never checks
> > if any returned page is P2P - it has no reason to care. dma_map_sg()
> > does all the work.
>
> Right now it does, but that's not really a good interface. If we have
> a pin_user_pages variant that only pins until the next relevant P2P
> boundary and tells you about we can significantly simplify the overall
> interface.

And you will need to have a way to instruct that pin_user_pages() variant
to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that
force, you will have !FOLL_PCI_P2PDMA behaviour.

When you say "simplify the overall interface", which interface do you mean?

Thanks

2024-03-12 21:29:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Sun, Mar 10, 2024 at 11:35:13AM +0200, Leon Romanovsky wrote:
> And you will need to have a way to instruct that pin_user_pages() variant
> to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that
> force, you will have !FOLL_PCI_P2PDMA behaviour.

I don't understand what you mean.

> When you say "simplify the overall interface", which interface do you mean?

Primarily the dma mapping interface. Secondarily also everything around
it.

2024-03-13 07:46:58

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Tue, Mar 12, 2024 at 10:28:44PM +0100, Christoph Hellwig wrote:
> On Sun, Mar 10, 2024 at 11:35:13AM +0200, Leon Romanovsky wrote:
> > And you will need to have a way to instruct that pin_user_pages() variant
> > to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that
> > force, you will have !FOLL_PCI_P2PDMA behaviour.
>
> I don't understand what you mean.

Jason talked about the need to call to pin_user_pages(..., gup_flags | FOLL_PCI_P2PDMA, ...),
but in your proposal this call won't be possible anymore.

>
> > When you say "simplify the overall interface", which interface do you mean?
>
> Primarily the dma mapping interface. Secondarily also everything around
> it.

OK, thanks

2024-03-13 21:44:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Wed, Mar 13, 2024 at 09:46:36AM +0200, Leon Romanovsky wrote:
> On Tue, Mar 12, 2024 at 10:28:44PM +0100, Christoph Hellwig wrote:
> > On Sun, Mar 10, 2024 at 11:35:13AM +0200, Leon Romanovsky wrote:
> > > And you will need to have a way to instruct that pin_user_pages() variant
> > > to continue anyway, because you asked for FOLL_PCI_P2PDMA. Without that
> > > force, you will have !FOLL_PCI_P2PDMA behaviour.
> >
> > I don't understand what you mean.
>
> Jason talked about the need to call to pin_user_pages(..., gup_flags | FOLL_PCI_P2PDMA, ...),
> but in your proposal this call won't be possible anymore.

Why?


2024-03-19 17:54:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Sat, Mar 09, 2024 at 05:14:18PM +0100, Christoph Hellwig wrote:
> On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote:
> > > The DMA API callers really need to know what is P2P or not for
> > > various reasons. And they should generally have that information
> > > available, either from pin_user_pages that needs to special case
> > > it or from the in-kernel I/O submitter that build it from P2P and
> > > normal memory.
> >
> > I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
> > shoves the resulting page list into in a scattertable. It never checks
> > if any returned page is P2P - it has no reason to care. dma_map_sg()
> > does all the work.
>
> Right now it does, but that's not really a good interface. If we have
> a pin_user_pages variant that only pins until the next relevant P2P
> boundary and tells you about we can significantly simplify the overall
> interface.

Sorry for the delay, I was away..

I kind of understand your thinking on the DMA side, but I don't see
how this is good for users of the API beyond BIO.

How will this make RDMA better? We have one MR, the MR has pages, the
HW doesn't care about the SW distinction of p2p, swiotlb, direct,
encrypted, iommu, etc. It needs to create one HW page list for
whatever user VA range was given.

Or worse, whatever thing is inside a DMABUF from a DRM
driver. DMABUF's can have a (dynamic!) mixture of P2P and regular
AFAIK based on the GPU's migration behavior.

Or triple worse, ODP can dynamically change on a page by page basis
the type depending on what hmm_range_fault() returns.

So I take it as a requirement that RDMA MUST make single MR's out of a
hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are
not a functional replacement for a single MR.

Go back to the start of what are we trying to do here:
1) Make a DMA API that can support hmm_range_fault() users in a
sensible and performant way
2) Make a DMA API that can support RDMA MR's backed by DMABUF's, and
user VA's without restriction
3) Allow to remove scatterlist from BIO paths
4) Provide a DMABUF API that is not scatterlist that can feed into
the new DMA API - again supporting DMABUF's hodgepodge of types.

I'd like to do all of these things. I know 3 is your highest priority,
but it is my lowest :)

So, if the new API can only do uniformity I immediately loose #1 -
hmm_range_fault() can't guarentee anything, so it looses the IOVA
optimization that Leon's patches illustrate.

For uniformity #2 probably needs multiple DMA API "transactions". This
is doable, but it is less performant than one "transaction".

#3 is perfectly happy because BIO already creates uniformity

#4 is like #2, there is not guarenteed uniformity inside DMABUF so
every DMABUF importer needs to take some complexity to deal with
it. There are many DMABUF importers so this feels like a poor API
abstraction if we force everyone there to take on complexity.

So I'm just not seeing why this would be better. I think Leon's series
shows the cost of non-uniformity support is actually pretty
small. Still, we could do better, if the caller can optionally
indicate it knows it has uniformity then that can be optimized futher.

I'd like to find something that works well for all of the above, and I
think abstracting non-uniformity at the API level is important for the
above reasons.

Can we tweak what Leon has done to keep the hmm_range_fault support
and non-uniformity for RDMA but add a uniformity optimized flow for
BIO?

Jason

2024-03-20 08:59:32

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Tue, Mar 19, 2024 at 12:36:20PM -0300, Jason Gunthorpe wrote:
> On Sat, Mar 09, 2024 at 05:14:18PM +0100, Christoph Hellwig wrote:
> > On Fri, Mar 08, 2024 at 04:23:42PM -0400, Jason Gunthorpe wrote:
> > > > The DMA API callers really need to know what is P2P or not for
> > > > various reasons. And they should generally have that information
> > > > available, either from pin_user_pages that needs to special case
> > > > it or from the in-kernel I/O submitter that build it from P2P and
> > > > normal memory.
> > >
> > > I think that is a BIO thing. RDMA just calls with FOLL_PCI_P2PDMA and
> > > shoves the resulting page list into in a scattertable. It never checks
> > > if any returned page is P2P - it has no reason to care. dma_map_sg()
> > > does all the work.
> >
> > Right now it does, but that's not really a good interface. If we have
> > a pin_user_pages variant that only pins until the next relevant P2P
> > boundary and tells you about we can significantly simplify the overall
> > interface.
>
> Sorry for the delay, I was away..

<...>

> Can we tweak what Leon has done to keep the hmm_range_fault support
> and non-uniformity for RDMA but add a uniformity optimized flow for
> BIO?

Something like this will do the trick.

From 45e739e7073fb04bc168624f77320130bb3f9267 Mon Sep 17 00:00:00 2001
Message-ID: <45e739e7073fb04bc168624f77320130bb3f9267.1710924764.git.leonro@nvidia.com>
From: Leon Romanovsky <[email protected]>
Date: Mon, 18 Mar 2024 11:16:41 +0200
Subject: [PATCH] mm/gup: add strict interface to pin user pages according to
FOLL flag

All pin_user_pages*() and get_user_pages*() callbacks allocate user
pages by partially taking into account their p2p vs. non-p2p properties.

In case, user sets FOLL_PCI_P2PDMA flag, the allocated pages will include
both p2p and "regular" pages, while if FOLL_PCI_P2PDMA flag is not provided,
only regular pages are returned.

In order to make sure that with FOLL_PCI_P2PDMA flag, only p2p pages are
returned, let's introduce new internal FOLL_STRICT flag and provide special
pin_user_pages_fast_strict() API call.

Signed-off-by: Leon Romanovsky <[email protected]>
---
include/linux/mm.h | 3 +++
mm/gup.c | 36 +++++++++++++++++++++++++++++++++++-
mm/internal.h | 4 +++-
3 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..910b65dde24a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2491,6 +2491,9 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
void folio_add_pin(struct folio *folio);

+int pin_user_pages_fast_strict(unsigned long start, int nr_pages,
+ unsigned int gup_flags, struct page **pages);
+
int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
struct task_struct *task, bool bypass_rlim);
diff --git a/mm/gup.c b/mm/gup.c
index df83182ec72d..11b5c626a4ab 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -133,6 +133,10 @@ struct folio *try_grab_folio(struct page *page, int refs, unsigned int flags)
if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
return NULL;

+ if (flags & FOLL_STRICT)
+ if (flags & FOLL_PCI_P2PDMA && !is_pci_p2pdma_page(page))
+ return NULL;
+
if (flags & FOLL_GET)
return try_get_folio(page, refs);

@@ -232,6 +236,10 @@ int __must_check try_grab_page(struct page *page, unsigned int flags)
if (unlikely(!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)))
return -EREMOTEIO;

+ if (flags & FOLL_STRICT)
+ if (flags & FOLL_PCI_P2PDMA && !is_pci_p2pdma_page(page))
+ return -EREMOTEIO;
+
if (flags & FOLL_GET)
folio_ref_inc(folio);
else if (flags & FOLL_PIN) {
@@ -2243,6 +2251,8 @@ static bool is_valid_gup_args(struct page **pages, int *locked,
* - FOLL_TOUCH/FOLL_PIN/FOLL_TRIED/FOLL_FAST_ONLY are internal only
* - FOLL_REMOTE is internal only and used on follow_page()
* - FOLL_UNLOCKABLE is internal only and used if locked is !NULL
+ * - FOLL_STRICT is internal only and used to distinguish between p2p
+ * and "regular" pages.
*/
if (WARN_ON_ONCE(gup_flags & INTERNAL_GUP_FLAGS))
return false;
@@ -3187,7 +3197,8 @@ static int internal_get_user_pages_fast(unsigned long start,
if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
FOLL_FORCE | FOLL_PIN | FOLL_GET |
FOLL_FAST_ONLY | FOLL_NOFAULT |
- FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT)))
+ FOLL_PCI_P2PDMA | FOLL_HONOR_NUMA_FAULT |
+ FOLL_STRICT)))
return -EINVAL;

if (gup_flags & FOLL_PIN)
@@ -3322,6 +3333,29 @@ int pin_user_pages_fast(unsigned long start, int nr_pages,
}
EXPORT_SYMBOL_GPL(pin_user_pages_fast);

+/**
+ * pin_user_pages_fast_strict() - this is pin_user_pages_fast() variant, which
+ * makes sure that only pages with same properties are pinned.
+ *
+ * @start: starting user address
+ * @nr_pages: number of pages from start to pin
+ * @gup_flags: flags modifying pin behaviour
+ * @pages: array that receives pointers to the pages pinned.
+ * Should be at least nr_pages long.
+ *
+ * Nearly the same as pin_user_pages_fastt(), except that FOLL_STRICT is set.
+ *
+ * FOLL_STRICT means that the pages are allocated with specific FOLL_* properties.
+ */
+int pin_user_pages_fast_strict(unsigned long start, int nr_pages,
+ unsigned int gup_flags, struct page **pages)
+{
+ if (!is_valid_gup_args(pages, NULL, &gup_flags, FOLL_PIN | FOLL_STRICT))
+ return -EINVAL;
+ return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
+}
+EXPORT_SYMBOL_GPL(pin_user_pages_fast_strict);
+
/**
* pin_user_pages_remote() - pin pages of a remote process
*
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..7578837a0444 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1031,10 +1031,12 @@ enum {
FOLL_FAST_ONLY = 1 << 20,
/* allow unlocking the mmap lock */
FOLL_UNLOCKABLE = 1 << 21,
+ /* don't mix pages with different properties, e.g. p2p with "regular" ones */
+ FOLL_STRICT = 1 << 22,
};

#define INTERNAL_GUP_FLAGS (FOLL_TOUCH | FOLL_TRIED | FOLL_REMOTE | FOLL_PIN | \
- FOLL_FAST_ONLY | FOLL_UNLOCKABLE)
+ FOLL_FAST_ONLY | FOLL_UNLOCKABLE | FOLL_STRICT)

/*
* Indicates for which pages that are write-protected in the page table,
--
2.44.0

2024-03-21 22:39:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Tue, Mar 19, 2024 at 12:36:20PM -0300, Jason Gunthorpe wrote:
> I kind of understand your thinking on the DMA side, but I don't see
> how this is good for users of the API beyond BIO.
>
> How will this make RDMA better? We have one MR, the MR has pages, the
> HW doesn't care about the SW distinction of p2p, swiotlb, direct,
> encrypted, iommu, etc. It needs to create one HW page list for
> whatever user VA range was given.

Well, the hardware (as in the PCIe card) never cares. But the setup
path for the IOMMU does, and something in the OS needs to know about
it. So unless we want to stash away a 'is this P2P' flag in every
page / SG entry / bvec, or a do a lookup to find that out for each
of them we need to manage chunks at these boundaries. And that's
what I'm proposing.

> Or worse, whatever thing is inside a DMABUF from a DRM
> driver. DMABUF's can have a (dynamic!) mixture of P2P and regular
> AFAIK based on the GPU's migration behavior.

And that's fine. We just need to track it efficiently.

>
> Or triple worse, ODP can dynamically change on a page by page basis
> the type depending on what hmm_range_fault() returns.

Same. If this changes all the time you need to track it. And we
should find a way to shared the code if we have multiple users for it.

But most DMA API consumers will never see P2P, and when they see it
it will be static. So don't build the DMA API to automically do
the (not exactly super cheap) checks and add complexity for it.

> So I take it as a requirement that RDMA MUST make single MR's out of a
> hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are
> not a functional replacement for a single MR.

But MRs consolidate multiple dma addresses anyway.

> Go back to the start of what are we trying to do here:
> 1) Make a DMA API that can support hmm_range_fault() users in a
> sensible and performant way
> 2) Make a DMA API that can support RDMA MR's backed by DMABUF's, and
> user VA's without restriction
> 3) Allow to remove scatterlist from BIO paths
> 4) Provide a DMABUF API that is not scatterlist that can feed into
> the new DMA API - again supporting DMABUF's hodgepodge of types.
>
> I'd like to do all of these things. I know 3 is your highest priority,
> but it is my lowest :)

Well, 3 an 4. And 3 is not just limited to bio, but all the other
pointless scatterlist uses.


2024-03-21 22:43:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Wed, Mar 20, 2024 at 10:55:36AM +0200, Leon Romanovsky wrote:
> Something like this will do the trick.

As far as I can tell it totally misses the point. Which is not to never
return non-P2P if the flag is set, but to return either all P2P or non-P2
P and not create a boundary in the single call.


2024-03-22 17:46:33

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Thu, Mar 21, 2024 at 11:40:13PM +0100, Christoph Hellwig wrote:
> On Wed, Mar 20, 2024 at 10:55:36AM +0200, Leon Romanovsky wrote:
> > Something like this will do the trick.
>
> As far as I can tell it totally misses the point. Which is not to never
> return non-P2P if the flag is set, but to return either all P2P or non-P2
> P and not create a boundary in the single call.

You are treating FOLL_PCI_P2PDMA as a hint, but in iov_iter_extract_user_pages()
you set it only for p2p queues. I was under impression that you want only p2p pages
in these queues.

Anyway, I can prepare other patch that will return or p2p or non-p2p pages in one shot.

Thanks

2024-03-22 18:44:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Thu, Mar 21, 2024 at 11:39:10PM +0100, Christoph Hellwig wrote:
> On Tue, Mar 19, 2024 at 12:36:20PM -0300, Jason Gunthorpe wrote:
> > I kind of understand your thinking on the DMA side, but I don't see
> > how this is good for users of the API beyond BIO.
> >
> > How will this make RDMA better? We have one MR, the MR has pages, the
> > HW doesn't care about the SW distinction of p2p, swiotlb, direct,
> > encrypted, iommu, etc. It needs to create one HW page list for
> > whatever user VA range was given.
>
> Well, the hardware (as in the PCIe card) never cares. But the setup
> path for the IOMMU does, and something in the OS needs to know about
> it. So unless we want to stash away a 'is this P2P' flag in every
> page / SG entry / bvec, or a do a lookup to find that out for each
> of them we need to manage chunks at these boundaries. And that's
> what I'm proposing.

Okay, if we look at the struct-page-less world (which we want for
DMABUF) then we need to keep track for sure. What I had drafted was to
keep track in the new "per-SG entry" because that seemed easiest to
migrate existing code into.

Though the datastructure could also be written to be a list of uniform
memory types and then a list of SG entries. (more like how bio is
organized)

No idea right now which is better, and I'm happy make it go either
way.

But Leon's series is not quite getting to this, it it still struct
page based and struct page itself has all the metadata - though as you
say it is a bit expensive to access.

> > Or worse, whatever thing is inside a DMABUF from a DRM
> > driver. DMABUF's can have a (dynamic!) mixture of P2P and regular
> > AFAIK based on the GPU's migration behavior.
>
> And that's fine. We just need to track it efficiently.

Right, DMABUF/etc will return a something that has a list of physical
addresses and some meta-data to indicate the "p2p memory provider" for
the P2P part.

Perhaps it could be as simple as 1 bit in the physical address/length
and a global "P2P memory provider" pointer for the entire DMA
BUF. Unclear to me right now, but sure.

> > Or triple worse, ODP can dynamically change on a page by page basis
> > the type depending on what hmm_range_fault() returns.
>
> Same. If this changes all the time you need to track it. And we
> should find a way to shared the code if we have multiple users for it.

ODP (for at least the forseeable furture) is simpler because it is
always struct page based so we don't need more metadata if we pay the
cost to reach into the struct page. I suspect that is the right trade
off for hmm_range_fault users.

> But most DMA API consumers will never see P2P, and when they see it
> it will be static. So don't build the DMA API to automically do
> the (not exactly super cheap) checks and add complexity for it.

Okay, I think I get what you'd like to see.

If we are going to make caller provided uniformity a requirement, lets
imagine a formal memory type idea to help keep this a little
abstracted?

DMA_MEMORY_TYPE_NORMAL
DMA_MEMORY_TYPE_P2P_NOT_ACS
DMA_MEMORY_TYPE_ENCRYPTED
DMA_MEMORY_TYPE_BOUNCE_BUFFER // ??

Then maybe the driver flow looks like:

if (transaction.memory_type == DMA_MEMORY_TYPE_NORMAL && dma_api_has_iommu(dev)) {
struct dma_api_iommu_state state;

dma_api_iommu_start(&state, transaction.num_pages);
for_each_range(transaction, range)
dma_api_iommu_map_range(&state, range.start_page, range.length);
num_hwsgls = 1;
hwsgl.addr = state.iova;
hwsgl.length = transaction.length
dma_api_iommu_batch_done(&state);
} else if (transaction.memory_type == DMA_MEMORY_TYPE_P2P_NOT_ACS) {
num_hwsgls = transcation.num_sgls;
for_each_range(transaction, range) {
hwsgl[i].addr = dma_api_p2p_not_acs_map(range.start_physical, range.length, p2p_memory_provider);
hwsgl[i].len = range.size;
}
} else {
/* Must be DMA_MEMORY_TYPE_NORMAL, DMA_MEMORY_TYPE_ENCRYPTED, DMA_MEMORY_TYPE_BOUNCE_BUFFER? */
num_hwsgls = transcation.num_sgls;
for_each_range(transaction, range) {
hwsgl[i].addr = dma_api_map_cpu_page(range.start_page, range.length);
hwsgl[i].len = range.size;
}
}

And the hmm_range_fault case is sort of like:

struct dma_api_iommu_state state;
dma_api_iommu_start(&state, mr.num_pages);

[..]
hmm_range_fault(...)
if (present)
dma_link_page(&state, faulting_address_offset, page);
else
dma_unlink_page(&state, faulting_address_offset, page);

Is this looking closer?

> > So I take it as a requirement that RDMA MUST make single MR's out of a
> > hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are
> > not a functional replacement for a single MR.
>
> But MRs consolidate multiple dma addresses anyway.

I'm not sure I understand this?

> > Go back to the start of what are we trying to do here:
> > 1) Make a DMA API that can support hmm_range_fault() users in a
> > sensible and performant way
> > 2) Make a DMA API that can support RDMA MR's backed by DMABUF's, and
> > user VA's without restriction
> > 3) Allow to remove scatterlist from BIO paths
> > 4) Provide a DMABUF API that is not scatterlist that can feed into
> > the new DMA API - again supporting DMABUF's hodgepodge of types.
> >
> > I'd like to do all of these things. I know 3 is your highest priority,
> > but it is my lowest :)
>
> Well, 3 an 4. And 3 is not just limited to bio, but all the other
> pointless scatterlist uses.

Well, I didn't write a '5) remove all the other pointless scatterlist
case' :)

Anyhow, I think we all agree on the high level objective, we just need
to get to an API that fuses all of these goals together.

To go back to my main thesis - I would like a high performance low
level DMA API that is capable enough that it could implement
scatterlist dma_map_sg() and thus also implement any future
scatterlist_v2, bio, hmm_range_fault or any other thing we come up
with on top of it. This is broadly what I thought we agreed to at LSF
last year.

Jason

2024-03-25 04:05:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Fri, Mar 22, 2024 at 07:46:17PM +0200, Leon Romanovsky wrote:
> > As far as I can tell it totally misses the point. Which is not to never
> > return non-P2P if the flag is set, but to return either all P2P or non-P2
> > P and not create a boundary in the single call.
>
> You are treating FOLL_PCI_P2PDMA as a hint, but in iov_iter_extract_user_pages()
> you set it only for p2p queues. I was under impression that you want only p2p pages
> in these queues.

FOLL_PCI_P2PDMA is an indicator that the caller can cope with P2P
pages. Most callers simply can't.


2024-03-25 08:47:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Fri, Mar 22, 2024 at 03:43:30PM -0300, Jason Gunthorpe wrote:
> If we are going to make caller provided uniformity a requirement, lets
> imagine a formal memory type idea to help keep this a little
> abstracted?
>
> DMA_MEMORY_TYPE_NORMAL
> DMA_MEMORY_TYPE_P2P_NOT_ACS
> DMA_MEMORY_TYPE_ENCRYPTED
> DMA_MEMORY_TYPE_BOUNCE_BUFFER // ??
>
> Then maybe the driver flow looks like:
>
> if (transaction.memory_type == DMA_MEMORY_TYPE_NORMAL && dma_api_has_iommu(dev)) {

Add a nice helper to make this somewhat readable, but yes.

> } else if (transaction.memory_type == DMA_MEMORY_TYPE_P2P_NOT_ACS) {
> num_hwsgls = transcation.num_sgls;
> for_each_range(transaction, range) {
> hwsgl[i].addr = dma_api_p2p_not_acs_map(range.start_physical, range.length, p2p_memory_provider);
> hwsgl[i].len = range.size;
> }
> } else {
> /* Must be DMA_MEMORY_TYPE_NORMAL, DMA_MEMORY_TYPE_ENCRYPTED, DMA_MEMORY_TYPE_BOUNCE_BUFFER? */
> num_hwsgls = transcation.num_sgls;
> for_each_range(transaction, range) {
> hwsgl[i].addr = dma_api_map_cpu_page(range.start_page, range.length);
> hwsgl[i].len = range.size;
> }
>

And these two are really the same except that we call a different map
helper underneath. So I think as far as the driver is concerned
they should be the same, the DMA API just needs to key off the
memory tap.

> And the hmm_range_fault case is sort of like:
>
> struct dma_api_iommu_state state;
> dma_api_iommu_start(&state, mr.num_pages);
>
> [..]
> hmm_range_fault(...)
> if (present)
> dma_link_page(&state, faulting_address_offset, page);
> else
> dma_unlink_page(&state, faulting_address_offset, page);
>
> Is this looking closer?

Yes.

> > > So I take it as a requirement that RDMA MUST make single MR's out of a
> > > hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are
> > > not a functional replacement for a single MR.
> >
> > But MRs consolidate multiple dma addresses anyway.
>
> I'm not sure I understand this?

The RDMA MRs take a a list of PFNish address, (or SGLs with the
enhanced MRs from Mellanox) and give you back a single rkey/lkey.

> To go back to my main thesis - I would like a high performance low
> level DMA API that is capable enough that it could implement
> scatterlist dma_map_sg() and thus also implement any future
> scatterlist_v2, bio, hmm_range_fault or any other thing we come up
> with on top of it. This is broadly what I thought we agreed to at LSF
> last year.

I think the biggest underlying problem of the scatterlist based
DMA implementation for IOMMUs is that it's trying to handle to much,
that is magic coalescing even if the segments boundaries don't align
with the IOMMU page size. If we can get rid of that misfeature I
think we'd greatly simply the API and implementation.

2024-03-27 17:37:07

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Mon, Mar 25, 2024 at 12:22:15AM +0100, Christoph Hellwig wrote:
> On Fri, Mar 22, 2024 at 03:43:30PM -0300, Jason Gunthorpe wrote:
> > If we are going to make caller provided uniformity a requirement, lets
> > imagine a formal memory type idea to help keep this a little
> > abstracted?
> >
> > DMA_MEMORY_TYPE_NORMAL
> > DMA_MEMORY_TYPE_P2P_NOT_ACS
> > DMA_MEMORY_TYPE_ENCRYPTED
> > DMA_MEMORY_TYPE_BOUNCE_BUFFER // ??
> >
> > Then maybe the driver flow looks like:
> >
> > if (transaction.memory_type == DMA_MEMORY_TYPE_NORMAL && dma_api_has_iommu(dev)) {
>
> Add a nice helper to make this somewhat readable, but yes.
>
> > } else if (transaction.memory_type == DMA_MEMORY_TYPE_P2P_NOT_ACS) {
> > num_hwsgls = transcation.num_sgls;
> > for_each_range(transaction, range) {
> > hwsgl[i].addr = dma_api_p2p_not_acs_map(range.start_physical, range.length, p2p_memory_provider);
> > hwsgl[i].len = range.size;
> > }
> > } else {
> > /* Must be DMA_MEMORY_TYPE_NORMAL, DMA_MEMORY_TYPE_ENCRYPTED, DMA_MEMORY_TYPE_BOUNCE_BUFFER? */
> > num_hwsgls = transcation.num_sgls;
> > for_each_range(transaction, range) {
> > hwsgl[i].addr = dma_api_map_cpu_page(range.start_page, range.length);
> > hwsgl[i].len = range.size;
> > }
> >
>
> And these two are really the same except that we call a different map
> helper underneath. So I think as far as the driver is concerned
> they should be the same, the DMA API just needs to key off the
> memory tap.

Yeah.. If the caller is going to have compute the memory type of the
range then lets pass it to the helper

dma_api_map_memory_type(transaction.memory_type, range.start_page, range.length);

Then we can just hide all the differences under the API without doing
duplicated work.

Function names need some work ...

> > > > So I take it as a requirement that RDMA MUST make single MR's out of a
> > > > hodgepodge of page types. RDMA MRs cannot be split. Multiple MR's are
> > > > not a functional replacement for a single MR.
> > >
> > > But MRs consolidate multiple dma addresses anyway.
> >
> > I'm not sure I understand this?
>
> The RDMA MRs take a a list of PFNish address, (or SGLs with the
> enhanced MRs from Mellanox) and give you back a single rkey/lkey.

Yes, that is the desire.

> > To go back to my main thesis - I would like a high performance low
> > level DMA API that is capable enough that it could implement
> > scatterlist dma_map_sg() and thus also implement any future
> > scatterlist_v2, bio, hmm_range_fault or any other thing we come up
> > with on top of it. This is broadly what I thought we agreed to at LSF
> > last year.
>
> I think the biggest underlying problem of the scatterlist based
> DMA implementation for IOMMUs is that it's trying to handle to much,
> that is magic coalescing even if the segments boundaries don't align
> with the IOMMU page size. If we can get rid of that misfeature I
> think we'd greatly simply the API and implementation.

Yeah, that stuff is not easy at all and takes extra computation to
figure out. I always assumed it was there for block...

Leon & Chaitanya will make a RFC v2 along these lines, lets see how it
goes.

Thanks,
Jason

2024-04-09 20:39:47

by Zhu Yanjun

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps



在 2024/3/7 7:01, Zhu Yanjun 写道:
> 在 2024/3/5 12:18, Leon Romanovsky 写道:
>> This is complimentary part to the proposed LSF/MM topic.
>> https://lore.kernel.org/linux-rdma/[email protected]/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
>
> I am interested in this topic. Hope I can join the meeting to discuss
> this topic.
>

With the same idea, in the IDPF driver, the function dma_alloc_coherent
which is called in the IDPF driver can be devided into the following 2
functions:

iommu_dma_alloc_pages

and

iommu_dma_map_page

So the function iommu_dma_alloc_pages allocates pages,
iommu_dma_map_page makes mapping between pages and IOVA.

Now the above idea is implemented in the NIC driver. Currently it can
work well.

Next the above idea will be implemented in the block device. Hope this
can increase the performance of the block device.

Best Regards,
Zhu Yanjun

> Zhu Yanjun
>
>>
>> This is posted as RFC to get a feedback on proposed split, but RDMA,
>> VFIO and
>> DMA patches are ready for review and inclusion, the NVMe patches are
>> still in
>> progress as they require agreement on API first.
>>
>> Thanks
>>
>> -------------------------------------------------------------------------------
>> The DMA mapping operation performs two steps at one same time: allocates
>> IOVA space and actually maps DMA pages to that space. This one shot
>> operation works perfectly for non-complex scenarios, where callers use
>> that DMA API in control path when they setup hardware.
>>
>> However in more complex scenarios, when DMA mapping is needed in data
>> path and especially when some sort of specific datatype is involved,
>> such one shot approach has its drawbacks.
>>
>> That approach pushes developers to introduce new DMA APIs for specific
>> datatype. For example existing scatter-gather mapping functions, or
>> latest Chuck's RFC series to add biovec related DMA mapping [1] and
>> probably struct folio will need it too.
>>
>> These advanced DMA mapping APIs are needed to calculate IOVA size to
>> allocate it as one chunk and some sort of offset calculations to know
>> which part of IOVA to map.
>>
>> Instead of teaching DMA to know these specific datatypes, let's separate
>> existing DMA mapping routine to two steps and give an option to advanced
>> callers (subsystems) perform all calculations internally in advance and
>> map pages later when it is needed.
>>
>> In this series, three users are converted and each of such conversion
>> presents different positive gain:
>> 1. RDMA simplifies and speeds up its pagefault handling for
>>     on-demand-paging (ODP) mode.
>> 2. VFIO PCI live migration code saves huge chunk of memory.
>> 3. NVMe PCI avoids intermediate SG table manipulation and operates
>>     directly on BIOs.
>>
>> Thanks
>>
>> [1]
>> https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net
>>
>> Chaitanya Kulkarni (2):
>>    block: add dma_link_range() based API
>>    nvme-pci: use blk_rq_dma_map() for NVMe SGL
>>
>> Leon Romanovsky (14):
>>    mm/hmm: let users to tag specific PFNs
>>    dma-mapping: provide an interface to allocate IOVA
>>    dma-mapping: provide callbacks to link/unlink pages to specific IOVA
>>    iommu/dma: Provide an interface to allow preallocate IOVA
>>    iommu/dma: Prepare map/unmap page functions to receive IOVA
>>    iommu/dma: Implement link/unlink page callbacks
>>    RDMA/umem: Preallocate and cache IOVA for UMEM ODP
>>    RDMA/umem: Store ODP access mask information in PFN
>>    RDMA/core: Separate DMA mapping to caching IOVA and page linkage
>>    RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
>>    vfio/mlx5: Explicitly use number of pages instead of allocated length
>>    vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>>    vfio/mlx5: Explicitly store page list
>>    vfio/mlx5: Convert vfio to use DMA link API
>>
>>   Documentation/core-api/dma-attributes.rst |   7 +
>>   block/blk-merge.c                         | 156 ++++++++++++++
>>   drivers/infiniband/core/umem_odp.c        | 219 +++++++------------
>>   drivers/infiniband/hw/mlx5/mlx5_ib.h      |   1 +
>>   drivers/infiniband/hw/mlx5/odp.c          |  59 +++--
>>   drivers/iommu/dma-iommu.c                 | 129 ++++++++---
>>   drivers/nvme/host/pci.c                   | 220 +++++--------------
>>   drivers/vfio/pci/mlx5/cmd.c               | 252 ++++++++++++----------
>>   drivers/vfio/pci/mlx5/cmd.h               |  22 +-
>>   drivers/vfio/pci/mlx5/main.c              | 136 +++++-------
>>   include/linux/blk-mq.h                    |   9 +
>>   include/linux/dma-map-ops.h               |  13 ++
>>   include/linux/dma-mapping.h               |  39 ++++
>>   include/linux/hmm.h                       |   3 +
>>   include/rdma/ib_umem_odp.h                |  22 +-
>>   include/rdma/ib_verbs.h                   |  54 +++++
>>   kernel/dma/debug.h                        |   2 +
>>   kernel/dma/direct.h                       |   7 +-
>>   kernel/dma/mapping.c                      |  91 ++++++++
>>   mm/hmm.c                                  |  34 +--
>>   20 files changed, 870 insertions(+), 605 deletions(-)
>>
>

2024-05-02 23:33:47

by Zeng, Oak

[permalink] [raw]
Subject: RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Hi Leon, Jason

> -----Original Message-----
> From: Leon Romanovsky <[email protected]>
> Sent: Tuesday, March 5, 2024 6:19 AM
> To: Christoph Hellwig <[email protected]>; Robin Murphy
> <[email protected]>; Marek Szyprowski
> <[email protected]>; Joerg Roedel <[email protected]>; Will
> Deacon <[email protected]>; Jason Gunthorpe <[email protected]>; Chaitanya
> Kulkarni <[email protected]>
> Cc: Jonathan Corbet <[email protected]>; Jens Axboe <[email protected]>;
> Keith Busch <[email protected]>; Sagi Grimberg <[email protected]>;
> Yishai Hadas <[email protected]>; Shameer Kolothum
> <[email protected]>; Kevin Tian
> <[email protected]>; Alex Williamson <[email protected]>;
> J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> foundation.org>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Bart Van Assche
> <[email protected]>; Damien Le Moal
> <[email protected]>; Amir Goldstein
> <[email protected]>; [email protected]; Martin K. Petersen
> <[email protected]>; [email protected]; Dan Williams
> <[email protected]>; [email protected]; Leon Romanovsky
> <[email protected]>; Zhu Yanjun <[email protected]>
> Subject: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two
> steps
>
> This is complimentary part to the proposed LSF/MM topic.
> https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-
> [email protected]/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
>
> This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO
> and
> DMA patches are ready for review and inclusion, the NVMe patches are still
> in
> progress as they require agreement on API first.
>
> Thanks
>
> -------------------------------------------------------------------------------
> The DMA mapping operation performs two steps at one same time: allocates
> IOVA space and actually maps DMA pages to that space. This one shot
> operation works perfectly for non-complex scenarios, where callers use
> that DMA API in control path when they setup hardware.
>
> However in more complex scenarios, when DMA mapping is needed in data
> path and especially when some sort of specific datatype is involved,
> such one shot approach has its drawbacks.
>
> That approach pushes developers to introduce new DMA APIs for specific
> datatype. For example existing scatter-gather mapping functions, or
> latest Chuck's RFC series to add biovec related DMA mapping [1] and
> probably struct folio will need it too.
>
> These advanced DMA mapping APIs are needed to calculate IOVA size to
> allocate it as one chunk and some sort of offset calculations to know
> which part of IOVA to map.
>
> Instead of teaching DMA to know these specific datatypes, let's separate
> existing DMA mapping routine to two steps and give an option to advanced
> callers (subsystems) perform all calculations internally in advance and
> map pages later when it is needed.

I looked into how this scheme can be applied to DRM subsystem and GPU drivers.

I figured RDMA can apply this scheme because RDMA can calculate the iova size. Per my limited knowledge of rdma, user can register a memory region (the reg_user_mr vfunc) and memory region's sized is used to pre-allocate iova space. And in the RDMA use case, it seems the user registered region can be very big, e.g., 512MiB or even GiB

In GPU driver, we have a few use cases where we need dma-mapping. Just name two:

1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu (in Intel's driver it is through a vm_bind api, similar to mmap). A userptr can be of any random size, depending on user malloc size. Today we use dma-map-sg for this use case. The down side of our approach is, during userptr invalidation, even if user only munmap partially of an userptr, we invalidate the whole userptr from gpu page table, because there is no way for us to partially dma-unmap the whole sg list. I think we can try your new API in this case. The main benefit of the new approach is the partial munmap case.

We will have to pre-allocate iova for each userptr, and we have many userptrs of random size... So we might be not as efficient as RDMA case where I assume user register a few big memory regions.

2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU program directly, without any other extra driver API call. We call this use case system allocator.

For system allocator, driver have no knowledge of which virtual address range is valid in advance. So when GPU access a malloc'ed/mmap'ed address, we have a page fault. We then look up a CPU vma which contains the fault address. I guess we can use the CPU vma size to allocate the iova space of the same size?

But there will be a true difficulty to apply your scheme to this use case. It is related to the STICKY flag. As I understand it, the sticky flag is designed for driver to mark "this page/pfn has been populated, no need to re-populate again", roughly...Unlike userptr and RDMA use cases where the backing store of a buffer is always in system memory, in the system allocator use case, the backing store can be changing b/t system memory and GPU's device private memory. Even worse, we have to assume the data migration b/t system and GPU is dynamic. When data is migrated to GPU, we don't need dma-map And when migration happens to a pfn with STICKY flag, we still need to repopulate this pfn. So you can see, it is not easy to apply this scheme to this use case. At least I can't see an obvious way.


Oak


>
> In this series, three users are converted and each of such conversion
> presents different positive gain:
> 1. RDMA simplifies and speeds up its pagefault handling for
> on-demand-paging (ODP) mode.
> 2. VFIO PCI live migration code saves huge chunk of memory.
> 3. NVMe PCI avoids intermediate SG table manipulation and operates
> directly on BIOs.
>
> Thanks
>
> [1]
> https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@
> klimt.1015granger.net
>
> Chaitanya Kulkarni (2):
> block: add dma_link_range() based API
> nvme-pci: use blk_rq_dma_map() for NVMe SGL
>
> Leon Romanovsky (14):
> mm/hmm: let users to tag specific PFNs
> dma-mapping: provide an interface to allocate IOVA
> dma-mapping: provide callbacks to link/unlink pages to specific IOVA
> iommu/dma: Provide an interface to allow preallocate IOVA
> iommu/dma: Prepare map/unmap page functions to receive IOVA
> iommu/dma: Implement link/unlink page callbacks
> RDMA/umem: Preallocate and cache IOVA for UMEM ODP
> RDMA/umem: Store ODP access mask information in PFN
> RDMA/core: Separate DMA mapping to caching IOVA and page linkage
> RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
> vfio/mlx5: Explicitly use number of pages instead of allocated length
> vfio/mlx5: Rewrite create mkey flow to allow better code reuse
> vfio/mlx5: Explicitly store page list
> vfio/mlx5: Convert vfio to use DMA link API
>
> Documentation/core-api/dma-attributes.rst | 7 +
> block/blk-merge.c | 156 ++++++++++++++
> drivers/infiniband/core/umem_odp.c | 219 +++++++------------
> drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 +
> drivers/infiniband/hw/mlx5/odp.c | 59 +++--
> drivers/iommu/dma-iommu.c | 129 ++++++++---
> drivers/nvme/host/pci.c | 220 +++++--------------
> drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++----------
> drivers/vfio/pci/mlx5/cmd.h | 22 +-
> drivers/vfio/pci/mlx5/main.c | 136 +++++-------
> include/linux/blk-mq.h | 9 +
> include/linux/dma-map-ops.h | 13 ++
> include/linux/dma-mapping.h | 39 ++++
> include/linux/hmm.h | 3 +
> include/rdma/ib_umem_odp.h | 22 +-
> include/rdma/ib_verbs.h | 54 +++++
> kernel/dma/debug.h | 2 +
> kernel/dma/direct.h | 7 +-
> kernel/dma/mapping.c | 91 ++++++++
> mm/hmm.c | 34 +--
> 20 files changed, 870 insertions(+), 605 deletions(-)
>
> --
> 2.44.0


2024-05-03 11:57:25

by Zhu Yanjun

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps


On 03.05.24 01:32, Zeng, Oak wrote:
> Hi Leon, Jason
>
>> -----Original Message-----
>> From: Leon Romanovsky <[email protected]>
>> Sent: Tuesday, March 5, 2024 6:19 AM
>> To: Christoph Hellwig <[email protected]>; Robin Murphy
>> <[email protected]>; Marek Szyprowski
>> <[email protected]>; Joerg Roedel <[email protected]>; Will
>> Deacon <[email protected]>; Jason Gunthorpe <[email protected]>; Chaitanya
>> Kulkarni <[email protected]>
>> Cc: Jonathan Corbet <[email protected]>; Jens Axboe <[email protected]>;
>> Keith Busch <[email protected]>; Sagi Grimberg <[email protected]>;
>> Yishai Hadas <[email protected]>; Shameer Kolothum
>> <[email protected]>; Kevin Tian
>> <[email protected]>; Alex Williamson <[email protected]>;
>> Jérôme Glisse <[email protected]>; Andrew Morton <akpm@linux-
>> foundation.org>; [email protected]; [email protected];
>> [email protected]; [email protected];
>> [email protected]; [email protected];
>> [email protected]; [email protected]; Bart Van Assche
>> <[email protected]>; Damien Le Moal
>> <[email protected]>; Amir Goldstein
>> <[email protected]>; [email protected]; Martin K. Petersen
>> <[email protected]>; [email protected]; Dan Williams
>> <[email protected]>; [email protected]; Leon Romanovsky
>> <[email protected]>; Zhu Yanjun <[email protected]>
>> Subject: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two
>> steps
>>
>> This is complimentary part to the proposed LSF/MM topic.
>> https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-
>> [email protected]/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057
>>
>> This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO
>> and
>> DMA patches are ready for review and inclusion, the NVMe patches are still
>> in
>> progress as they require agreement on API first.
>>
>> Thanks
>>
>> -------------------------------------------------------------------------------
>> The DMA mapping operation performs two steps at one same time: allocates
>> IOVA space and actually maps DMA pages to that space. This one shot
>> operation works perfectly for non-complex scenarios, where callers use
>> that DMA API in control path when they setup hardware.
>>
>> However in more complex scenarios, when DMA mapping is needed in data
>> path and especially when some sort of specific datatype is involved,
>> such one shot approach has its drawbacks.
>>
>> That approach pushes developers to introduce new DMA APIs for specific
>> datatype. For example existing scatter-gather mapping functions, or
>> latest Chuck's RFC series to add biovec related DMA mapping [1] and
>> probably struct folio will need it too.
>>
>> These advanced DMA mapping APIs are needed to calculate IOVA size to
>> allocate it as one chunk and some sort of offset calculations to know
>> which part of IOVA to map.
>>
>> Instead of teaching DMA to know these specific datatypes, let's separate
>> existing DMA mapping routine to two steps and give an option to advanced
>> callers (subsystems) perform all calculations internally in advance and
>> map pages later when it is needed.
> I looked into how this scheme can be applied to DRM subsystem and GPU drivers.
>
> I figured RDMA can apply this scheme because RDMA can calculate the iova size. Per my limited knowledge of rdma, user can register a memory region (the reg_user_mr vfunc) and memory region's sized is used to pre-allocate iova space. And in the RDMA use case, it seems the user registered region can be very big, e.g., 512MiB or even GiB
>
> In GPU driver, we have a few use cases where we need dma-mapping. Just name two:
>
> 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu (in Intel's driver it is through a vm_bind api, similar to mmap). A userptr can be of any random size, depending on user malloc size. Today we use dma-map-sg for this use case. The down side of our approach is, during userptr invalidation, even if user only munmap partially of an userptr, we invalidate the whole userptr from gpu page table, because there is no way for us to partially dma-unmap the whole sg list. I think we can try your new API in this case. The main benefit of the new approach is the partial munmap case.
>
> We will have to pre-allocate iova for each userptr, and we have many userptrs of random size... So we might be not as efficient as RDMA case where I assume user register a few big memory regions.
>
> 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU program directly, without any other extra driver API call. We call this use case system allocator.
>
> For system allocator, driver have no knowledge of which virtual address range is valid in advance. So when GPU access a malloc'ed/mmap'ed address, we have a page fault. We then look up a CPU vma which contains the fault address. I guess we can use the CPU vma size to allocate the iova space of the same size?
>
> But there will be a true difficulty to apply your scheme to this use case. It is related to the STICKY flag. As I understand it, the sticky flag is designed for driver to mark "this page/pfn has been populated, no need to re-populate again", roughly...Unlike userptr and RDMA use cases where the backing store of a buffer is always in system memory, in the system allocator use case, the backing store can be changing b/t system memory and GPU's device private memory. Even worse, we have to assume the data migration b/t system and GPU is dynamic. When data is migrated to GPU, we don't need dma-map. And when migration happens to a pfn with STICKY flag, we still need to repopulate this pfn. So you can see, it is not easy to apply this scheme to this use case. At least I can't see an obvious way.

Not sure if GPU peer to peer dma mapping GPU memory for use can use this
scheme or not. If I remember it correctly, Intel Gaudi GPU supports peer
2 peer dma mapping in GPU Direct RDMA. Not sure if this scheme can be
applied in that place or not.

Just my 2 cent suggestions.

Zhu Yanjun

>
>
> Oak
>
>
>> In this series, three users are converted and each of such conversion
>> presents different positive gain:
>> 1. RDMA simplifies and speeds up its pagefault handling for
>> on-demand-paging (ODP) mode.
>> 2. VFIO PCI live migration code saves huge chunk of memory.
>> 3. NVMe PCI avoids intermediate SG table manipulation and operates
>> directly on BIOs.
>>
>> Thanks
>>
>> [1]
>> https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@
>> klimt.1015granger.net
>>
>> Chaitanya Kulkarni (2):
>> block: add dma_link_range() based API
>> nvme-pci: use blk_rq_dma_map() for NVMe SGL
>>
>> Leon Romanovsky (14):
>> mm/hmm: let users to tag specific PFNs
>> dma-mapping: provide an interface to allocate IOVA
>> dma-mapping: provide callbacks to link/unlink pages to specific IOVA
>> iommu/dma: Provide an interface to allow preallocate IOVA
>> iommu/dma: Prepare map/unmap page functions to receive IOVA
>> iommu/dma: Implement link/unlink page callbacks
>> RDMA/umem: Preallocate and cache IOVA for UMEM ODP
>> RDMA/umem: Store ODP access mask information in PFN
>> RDMA/core: Separate DMA mapping to caching IOVA and page linkage
>> RDMA/umem: Prevent UMEM ODP creation with SWIOTLB
>> vfio/mlx5: Explicitly use number of pages instead of allocated length
>> vfio/mlx5: Rewrite create mkey flow to allow better code reuse
>> vfio/mlx5: Explicitly store page list
>> vfio/mlx5: Convert vfio to use DMA link API
>>
>> Documentation/core-api/dma-attributes.rst | 7 +
>> block/blk-merge.c | 156 ++++++++++++++
>> drivers/infiniband/core/umem_odp.c | 219 +++++++------------
>> drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 +
>> drivers/infiniband/hw/mlx5/odp.c | 59 +++--
>> drivers/iommu/dma-iommu.c | 129 ++++++++---
>> drivers/nvme/host/pci.c | 220 +++++--------------
>> drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++----------
>> drivers/vfio/pci/mlx5/cmd.h | 22 +-
>> drivers/vfio/pci/mlx5/main.c | 136 +++++-------
>> include/linux/blk-mq.h | 9 +
>> include/linux/dma-map-ops.h | 13 ++
>> include/linux/dma-mapping.h | 39 ++++
>> include/linux/hmm.h | 3 +
>> include/rdma/ib_umem_odp.h | 22 +-
>> include/rdma/ib_verbs.h | 54 +++++
>> kernel/dma/debug.h | 2 +
>> kernel/dma/direct.h | 7 +-
>> kernel/dma/mapping.c | 91 ++++++++
>> mm/hmm.c | 34 +--
>> 20 files changed, 870 insertions(+), 605 deletions(-)
>>
>> --
>> 2.44.0

--
Best Regards,
Yanjun.Zhu


2024-05-03 14:41:45

by Zhu Yanjun

[permalink] [raw]
Subject: Re: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL

On 05.03.24 12:18, Leon Romanovsky wrote:
> From: Chaitanya Kulkarni <[email protected]>
>
> Update nvme_iod structure to hold iova, list of DMA linked addresses and
> total linked count, first one is needed in the request submission path
> to create a request to DMA mapping and last two are needed in the
> request completion path to remove the DMA mapping. In nvme_map_data()
> initialize iova with device, direction, and iova dma length with the
> help of blk_rq_get_dma_length(). Allocate iova using dma_alloc_iova().
> and call in nvme_pci_setup_sgls().
>
> Call newly added blk_rq_dma_map() to create request to DMA mapping and
> provide a callback function nvme_pci_sgl_map(). In the callback
> function initialize NVMe SGL dma addresses.
>
> Finally in nvme_unmap_data() unlink the dma address and free iova.
>
> Full disclosure:-
> -----------------
>
> This is an RFC to demonstrate the newly added DMA APIs can be used to
> map/unmap bvecs without the use of sg list, hence I've modified the pci
> code to only handle SGLs for now. Once we have some agreement on the
> structure of new DMA API I'll add support for PRPs along with all the
> optimization that I've removed from the code for this RFC for NVMe SGLs
> and PRPs.
>
> I was able to run fio verification job successfully :-
>
> $ fio fio/verify.fio --ioengine=io_uring --filename=/dev/nvme0n1
> --loops=10
> write-and-verify: (g=0): rw=randwrite, bs=(R) 8192B-8192B, (W) 8192B-8192B,
> (T) 8192B-8192B, ioengine=io_uring, iodepth=16
> fio-3.36
> Starting 1 process
> Jobs: 1 (f=1): [V(1)][81.6%][r=12.2MiB/s][r=1559 IOPS][eta 03m:00s]
> write-and-verify: (groupid=0, jobs=1): err= 0: pid=4435: Mon Mar 4 20:54:48 2024
> read: IOPS=2789, BW=21.8MiB/s (22.9MB/s)(6473MiB/297008msec)
> slat (usec): min=4, max=5124, avg=356.51, stdev=604.30
> clat (nsec): min=1593, max=23376k, avg=5377076.99, stdev=2039189.93
> lat (usec): min=493, max=23407, avg=5733.58, stdev=2103.22
> clat percentiles (usec):
> | 1.00th=[ 1172], 5.00th=[ 2114], 10.00th=[ 2835], 20.00th=[ 3654],
> | 30.00th=[ 4228], 40.00th=[ 4752], 50.00th=[ 5276], 60.00th=[ 5800],
> | 70.00th=[ 6325], 80.00th=[ 7046], 90.00th=[ 8094], 95.00th=[ 8979],
> | 99.00th=[10421], 99.50th=[11076], 99.90th=[12780], 99.95th=[14222],
> | 99.99th=[16909]
> write: IOPS=2608, BW=20.4MiB/s (21.4MB/s)(10.0GiB/502571msec); 0 zone resets
> slat (usec): min=4, max=5787, avg=382.68, stdev=649.01
> clat (nsec): min=521, max=23650k, avg=5751363.17, stdev=2676065.35
> lat (usec): min=95, max=23674, avg=6134.04, stdev=2813.48
> clat percentiles (usec):
> | 1.00th=[ 709], 5.00th=[ 1270], 10.00th=[ 1958], 20.00th=[ 3261],
> | 30.00th=[ 4228], 40.00th=[ 5014], 50.00th=[ 5800], 60.00th=[ 6521],
> | 70.00th=[ 7373], 80.00th=[ 8225], 90.00th=[ 9241], 95.00th=[ 9896],
> | 99.00th=[11469], 99.50th=[11863], 99.90th=[13960], 99.95th=[15270],
> | 99.99th=[17695]
> bw ( KiB/s): min= 1440, max=132496, per=99.28%, avg=20715.88, stdev=13123.13, samples=1013
> iops : min= 180, max=16562, avg=2589.34, stdev=1640.39, samples=1013
> lat (nsec) : 750=0.01%
> lat (usec) : 2=0.01%, 4=0.01%, 100=0.01%, 250=0.01%, 500=0.07%
> lat (usec) : 750=0.79%, 1000=1.22%
> lat (msec) : 2=5.94%, 4=18.87%, 10=69.53%, 20=3.58%, 50=0.01%
> cpu : usr=1.01%, sys=98.95%, ctx=1591, majf=0, minf=2286
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
> issued rwts: total=828524,1310720,0,0 short=0,0,0,0 dropped=0,0,0,0
> latency : target=0, window=0, percentile=100.00%, depth=16
>
> Run status group 0 (all jobs):
> READ: bw=21.8MiB/s (22.9MB/s), 21.8MiB/s-21.8MiB/s (22.9MB/s-22.9MB/s),
> io=6473MiB (6787MB), run=297008-297008msec
> WRITE: bw=20.4MiB/s (21.4MB/s), 20.4MiB/s-20.4MiB/s (21.4MB/s-21.4MB/s),
> io=10.0GiB (10.7GB), run=502571-502571msec
>
> Disk stats (read/write):
> nvme0n1: ios=829189/1310720, sectors=13293416/20971520, merge=0/0,
> ticks=836561/1340351, in_queue=2176913, util=99.30%
>
> Signed-off-by: Chaitanya Kulkarni <[email protected]>
> Signed-off-by: Leon Romanovsky <[email protected]>
> ---
> drivers/nvme/host/pci.c | 220 +++++++++-------------------------------
> 1 file changed, 49 insertions(+), 171 deletions(-)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index e6267a6aa380..140939228409 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -236,7 +236,9 @@ struct nvme_iod {
> unsigned int dma_len; /* length of single DMA segment mapping */
> dma_addr_t first_dma;
> dma_addr_t meta_dma;
> - struct sg_table sgt;
> + struct dma_iova_attrs iova;
> + dma_addr_t dma_link_address[128];

Why the length of this array is 128? Can we increase this length of the
array?

Thanks,
Zhu Yanjun

> + u16 nr_dma_link_address;
> union nvme_descriptor list[NVME_MAX_NR_ALLOCATIONS];
> };
>
> @@ -521,25 +523,10 @@ static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req,
> return true;
> }
>
> -static void nvme_free_prps(struct nvme_dev *dev, struct request *req)
> -{
> - const int last_prp = NVME_CTRL_PAGE_SIZE / sizeof(__le64) - 1;
> - struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> - dma_addr_t dma_addr = iod->first_dma;
> - int i;
> -
> - for (i = 0; i < iod->nr_allocations; i++) {
> - __le64 *prp_list = iod->list[i].prp_list;
> - dma_addr_t next_dma_addr = le64_to_cpu(prp_list[last_prp]);
> -
> - dma_pool_free(dev->prp_page_pool, prp_list, dma_addr);
> - dma_addr = next_dma_addr;
> - }
> -}
> -
> static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
> {
> struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> + u16 i;
>
> if (iod->dma_len) {
> dma_unmap_page(dev->dev, iod->first_dma, iod->dma_len,
> @@ -547,9 +534,8 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
> return;
> }
>
> - WARN_ON_ONCE(!iod->sgt.nents);
> -
> - dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
> + for (i = 0; i < iod->nr_dma_link_address; i++)
> + dma_unlink_range(&iod->iova, iod->dma_link_address[i]);
>
> if (iod->nr_allocations == 0)
> dma_pool_free(dev->prp_small_pool, iod->list[0].sg_list,
> @@ -557,120 +543,15 @@ static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)
> else if (iod->nr_allocations == 1)
> dma_pool_free(dev->prp_page_pool, iod->list[0].sg_list,
> iod->first_dma);
> - else
> - nvme_free_prps(dev, req);
> - mempool_free(iod->sgt.sgl, dev->iod_mempool);
> -}
> -
> -static void nvme_print_sgl(struct scatterlist *sgl, int nents)
> -{
> - int i;
> - struct scatterlist *sg;
> -
> - for_each_sg(sgl, sg, nents, i) {
> - dma_addr_t phys = sg_phys(sg);
> - pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d "
> - "dma_address:%pad dma_length:%d\n",
> - i, &phys, sg->offset, sg->length, &sg_dma_address(sg),
> - sg_dma_len(sg));
> - }
> -}
> -
> -static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
> - struct request *req, struct nvme_rw_command *cmnd)
> -{
> - struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> - struct dma_pool *pool;
> - int length = blk_rq_payload_bytes(req);
> - struct scatterlist *sg = iod->sgt.sgl;
> - int dma_len = sg_dma_len(sg);
> - u64 dma_addr = sg_dma_address(sg);
> - int offset = dma_addr & (NVME_CTRL_PAGE_SIZE - 1);
> - __le64 *prp_list;
> - dma_addr_t prp_dma;
> - int nprps, i;
> -
> - length -= (NVME_CTRL_PAGE_SIZE - offset);
> - if (length <= 0) {
> - iod->first_dma = 0;
> - goto done;
> - }
> -
> - dma_len -= (NVME_CTRL_PAGE_SIZE - offset);
> - if (dma_len) {
> - dma_addr += (NVME_CTRL_PAGE_SIZE - offset);
> - } else {
> - sg = sg_next(sg);
> - dma_addr = sg_dma_address(sg);
> - dma_len = sg_dma_len(sg);
> - }
> -
> - if (length <= NVME_CTRL_PAGE_SIZE) {
> - iod->first_dma = dma_addr;
> - goto done;
> - }
> -
> - nprps = DIV_ROUND_UP(length, NVME_CTRL_PAGE_SIZE);
> - if (nprps <= (256 / 8)) {
> - pool = dev->prp_small_pool;
> - iod->nr_allocations = 0;
> - } else {
> - pool = dev->prp_page_pool;
> - iod->nr_allocations = 1;
> - }
> -
> - prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
> - if (!prp_list) {
> - iod->nr_allocations = -1;
> - return BLK_STS_RESOURCE;
> - }
> - iod->list[0].prp_list = prp_list;
> - iod->first_dma = prp_dma;
> - i = 0;
> - for (;;) {
> - if (i == NVME_CTRL_PAGE_SIZE >> 3) {
> - __le64 *old_prp_list = prp_list;
> - prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
> - if (!prp_list)
> - goto free_prps;
> - iod->list[iod->nr_allocations++].prp_list = prp_list;
> - prp_list[0] = old_prp_list[i - 1];
> - old_prp_list[i - 1] = cpu_to_le64(prp_dma);
> - i = 1;
> - }
> - prp_list[i++] = cpu_to_le64(dma_addr);
> - dma_len -= NVME_CTRL_PAGE_SIZE;
> - dma_addr += NVME_CTRL_PAGE_SIZE;
> - length -= NVME_CTRL_PAGE_SIZE;
> - if (length <= 0)
> - break;
> - if (dma_len > 0)
> - continue;
> - if (unlikely(dma_len < 0))
> - goto bad_sgl;
> - sg = sg_next(sg);
> - dma_addr = sg_dma_address(sg);
> - dma_len = sg_dma_len(sg);
> - }
> -done:
> - cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sgt.sgl));
> - cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
> - return BLK_STS_OK;
> -free_prps:
> - nvme_free_prps(dev, req);
> - return BLK_STS_RESOURCE;
> -bad_sgl:
> - WARN(DO_ONCE(nvme_print_sgl, iod->sgt.sgl, iod->sgt.nents),
> - "Invalid SGL for payload:%d nents:%d\n",
> - blk_rq_payload_bytes(req), iod->sgt.nents);
> - return BLK_STS_IOERR;
> + dma_free_iova(&iod->iova);
> }
>
> static void nvme_pci_sgl_set_data(struct nvme_sgl_desc *sge,
> - struct scatterlist *sg)
> + dma_addr_t dma_addr,
> + unsigned int dma_len)
> {
> - sge->addr = cpu_to_le64(sg_dma_address(sg));
> - sge->length = cpu_to_le32(sg_dma_len(sg));
> + sge->addr = cpu_to_le64(dma_addr);
> + sge->length = cpu_to_le32(dma_len);
> sge->type = NVME_SGL_FMT_DATA_DESC << 4;
> }
>
> @@ -682,25 +563,37 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc *sge,
> sge->type = NVME_SGL_FMT_LAST_SEG_DESC << 4;
> }
>
> +struct nvme_pci_sgl_map_data {
> + struct nvme_iod *iod;
> + struct nvme_sgl_desc *sgl_list;
> +};
> +
> +static void nvme_pci_sgl_map(void *data, u32 cnt, dma_addr_t dma_addr,
> + dma_addr_t offset, u32 len)
> +{
> + struct nvme_pci_sgl_map_data *d = data;
> + struct nvme_sgl_desc *sgl_list = d->sgl_list;
> + struct nvme_iod *iod = d->iod;
> +
> + nvme_pci_sgl_set_data(&sgl_list[cnt], dma_addr, len);
> + iod->dma_link_address[cnt] = offset;
> + iod->nr_dma_link_address++;
> +}
> +
> static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
> struct request *req, struct nvme_rw_command *cmd)
> {
> + unsigned int entries = blk_rq_nr_phys_segments(req);
> struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> - struct dma_pool *pool;
> struct nvme_sgl_desc *sg_list;
> - struct scatterlist *sg = iod->sgt.sgl;
> - unsigned int entries = iod->sgt.nents;
> + struct dma_pool *pool;
> dma_addr_t sgl_dma;
> - int i = 0;
> + int linked_count;
> + struct nvme_pci_sgl_map_data data;
>
> /* setting the transfer type as SGL */
> cmd->flags = NVME_CMD_SGL_METABUF;
>
> - if (entries == 1) {
> - nvme_pci_sgl_set_data(&cmd->dptr.sgl, sg);
> - return BLK_STS_OK;
> - }
> -
> if (entries <= (256 / sizeof(struct nvme_sgl_desc))) {
> pool = dev->prp_small_pool;
> iod->nr_allocations = 0;
> @@ -718,11 +611,13 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
> iod->list[0].sg_list = sg_list;
> iod->first_dma = sgl_dma;
>
> - nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, entries);
> - do {
> - nvme_pci_sgl_set_data(&sg_list[i++], sg);
> - sg = sg_next(sg);
> - } while (--entries > 0);
> + data.iod = iod;
> + data.sgl_list = sg_list;
> +
> + linked_count = blk_rq_dma_map(req, nvme_pci_sgl_map, &data,
> + &iod->iova);
> +
> + nvme_pci_sgl_set_seg(&cmd->dptr.sgl, sgl_dma, linked_count);
>
> return BLK_STS_OK;
> }
> @@ -788,36 +683,20 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> &cmnd->rw, &bv);
> }
> }
> -
> - iod->dma_len = 0;
> - iod->sgt.sgl = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
> - if (!iod->sgt.sgl)
> + iod->iova.dev = dev->dev;
> + iod->iova.dir = rq_dma_dir(req);
> + iod->iova.attrs = DMA_ATTR_NO_WARN;
> + iod->iova.size = blk_rq_get_dma_length(req);
> + if (!iod->iova.size)
> return BLK_STS_RESOURCE;
> - sg_init_table(iod->sgt.sgl, blk_rq_nr_phys_segments(req));
> - iod->sgt.orig_nents = blk_rq_map_sg(req->q, req, iod->sgt.sgl);
> - if (!iod->sgt.orig_nents)
> - goto out_free_sg;
>
> - rc = dma_map_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req),
> - DMA_ATTR_NO_WARN);
> - if (rc) {
> - if (rc == -EREMOTEIO)
> - ret = BLK_STS_TARGET;
> - goto out_free_sg;
> - }
> + rc = dma_alloc_iova(&iod->iova);
> + if (rc)
> + return BLK_STS_RESOURCE;
>
> - if (nvme_pci_use_sgls(dev, req, iod->sgt.nents))
> - ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
> - else
> - ret = nvme_pci_setup_prps(dev, req, &cmnd->rw);
> - if (ret != BLK_STS_OK)
> - goto out_unmap_sg;
> - return BLK_STS_OK;
> + iod->dma_len = 0;
>
> -out_unmap_sg:
> - dma_unmap_sgtable(dev->dev, &iod->sgt, rq_dma_dir(req), 0);
> -out_free_sg:
> - mempool_free(iod->sgt.sgl, dev->iod_mempool);
> + ret = nvme_pci_setup_sgls(dev, req, &cmnd->rw);
> return ret;
> }
>
> @@ -841,7 +720,6 @@ static blk_status_t nvme_prep_rq(struct nvme_dev *dev, struct request *req)
>
> iod->aborted = false;
> iod->nr_allocations = -1;
> - iod->sgt.nents = 0;
>
> ret = nvme_setup_cmd(req->q->queuedata, req);
> if (ret)


2024-05-03 16:44:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:

> > Instead of teaching DMA to know these specific datatypes, let's separate
> > existing DMA mapping routine to two steps and give an option to advanced
> > callers (subsystems) perform all calculations internally in advance and
> > map pages later when it is needed.
>
> I looked into how this scheme can be applied to DRM subsystem and GPU drivers.
>
> I figured RDMA can apply this scheme because RDMA can calculate the
> iova size. Per my limited knowledge of rdma, user can register a
> memory region (the reg_user_mr vfunc) and memory region's sized is
> used to pre-allocate iova space. And in the RDMA use case, it seems
> the user registered region can be very big, e.g., 512MiB or even GiB

In RDMA the iova would be linked to the SVA granual we discussed
previously.

> In GPU driver, we have a few use cases where we need dma-mapping. Just name two:
>
> 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> (in Intel's driver it is through a vm_bind api, similar to mmap). A
> userptr can be of any random size, depending on user malloc
> size. Today we use dma-map-sg for this use case. The down side of
> our approach is, during userptr invalidation, even if user only
> munmap partially of an userptr, we invalidate the whole userptr from
> gpu page table, because there is no way for us to partially
> dma-unmap the whole sg list. I think we can try your new API in this
> case. The main benefit of the new approach is the partial munmap
> case.

Yes, this is one of the main things it will improve.

> We will have to pre-allocate iova for each userptr, and we have many
> userptrs of random size... So we might be not as efficient as RDMA
> case where I assume user register a few big memory regions.

You are already doing this. dma_map_sg() does exactly the same IOVA
allocation under the covers.

> 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> program directly, without any other extra driver API call. We call
> this use case system allocator.

> For system allocator, driver have no knowledge of which virtual
> address range is valid in advance. So when GPU access a
> malloc'ed/mmap'ed address, we have a page fault. We then look up a
> CPU vma which contains the fault address. I guess we can use the CPU
> vma size to allocate the iova space of the same size?

No. You'd follow what we discussed in the other thread.

If you do a full SVA then you'd split your MM space into granuals and
when a fault hits a granual you'd allocate the IOVA for the whole
granual. RDMA ODP is using a 512M granual currently.

If you are doing sub ranges then you'd probably allocate the IOVA for
the well defined sub range (assuming the typical use case isn't huge)

> But there will be a true difficulty to apply your scheme to this use
> case. It is related to the STICKY flag. As I understand it, the
> sticky flag is designed for driver to mark "this page/pfn has been
> populated, no need to re-populate again", roughly...Unlike userptr
> and RDMA use cases where the backing store of a buffer is always in
> system memory, in the system allocator use case, the backing store
> can be changing b/t system memory and GPU's device private
> memory. Even worse, we have to assume the data migration b/t system
> and GPU is dynamic. When data is migrated to GPU, we don't need
> dma-map. And when migration happens to a pfn with STICKY flag, we
> still need to repopulate this pfn. So you can see, it is not easy to
> apply this scheme to this use case. At least I can't see an obvious
> way.

You are already doing this today, you are keeping the sg list around
until you unmap it.

Instead of keeping the sg list you'd keep a much smaller datastructure
per-granual. The sticky bit is simply a convient way for ODP to manage
the smaller data structure, you don't have to use it.

But you do need to keep track of what pages in the granual have been
DMA mapped - sg list was doing this before. This could be a simple
bitmap array matching the granual size.

Looking (far) forward we may be able to have a "replace" API that
allows installing a new page unconditionally regardless of what is
already there.

Jason

2024-05-03 21:00:23

by Zeng, Oak

[permalink] [raw]
Subject: RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps



> -----Original Message-----
> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, May 3, 2024 12:43 PM
> To: Zeng, Oak <[email protected]>
> Cc: [email protected]; Christoph Hellwig <[email protected]>; Robin Murphy
> <[email protected]>; Marek Szyprowski
> <[email protected]>; Joerg Roedel <[email protected]>; Will
> Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> Brost, Matthew <[email protected]>; Hellstrom, Thomas
> <[email protected]>; Jonathan Corbet <[email protected]>; Jens
> Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> Shameer Kolothum <[email protected]>; Tian, Kevin
> <[email protected]>; Alex Williamson <[email protected]>;
> J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> foundation.org>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Bart Van Assche
> <[email protected]>; Damien Le Moal
> <[email protected]>; Amir Goldstein
> <[email protected]>; [email protected]; Martin K. Petersen
> <[email protected]>; [email protected]; Williams, Dan J
> <[email protected]>; [email protected]; Leon Romanovsky
> <[email protected]>; Zhu Yanjun <[email protected]>
> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> two steps
>
> On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
>
> > > Instead of teaching DMA to know these specific datatypes, let's separate
> > > existing DMA mapping routine to two steps and give an option to
> advanced
> > > callers (subsystems) perform all calculations internally in advance and
> > > map pages later when it is needed.
> >
> > I looked into how this scheme can be applied to DRM subsystem and GPU
> drivers.
> >
> > I figured RDMA can apply this scheme because RDMA can calculate the
> > iova size. Per my limited knowledge of rdma, user can register a
> > memory region (the reg_user_mr vfunc) and memory region's sized is
> > used to pre-allocate iova space. And in the RDMA use case, it seems
> > the user registered region can be very big, e.g., 512MiB or even GiB
>
> In RDMA the iova would be linked to the SVA granual we discussed
> previously.

I need to learn more of this scheme.

Let's say 512MiB granual... In a 57-bit virtual address machine, the use space can address space can be up to 56 bit (e.g., half-half split b/t kernel and user)

So you would end up with 134,217, 728 sub-regions (2 to the power of 27), which is huge...

Is that RDMA use a much smaller virtual address space?

With 512MiB granual, do you fault-in or map 512MiB virtual address range to RDMA page table? E.g., when page fault happens at address A, do you fault-in the whole 512MiB region to RDMA page table? How do you make sure all addresses in this 512MiB region are valid virtual address?



>
> > In GPU driver, we have a few use cases where we need dma-mapping. Just
> name two:
> >
> > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > userptr can be of any random size, depending on user malloc
> > size. Today we use dma-map-sg for this use case. The down side of
> > our approach is, during userptr invalidation, even if user only
> > munmap partially of an userptr, we invalidate the whole userptr from
> > gpu page table, because there is no way for us to partially
> > dma-unmap the whole sg list. I think we can try your new API in this
> > case. The main benefit of the new approach is the partial munmap
> > case.
>
> Yes, this is one of the main things it will improve.
>
> > We will have to pre-allocate iova for each userptr, and we have many
> > userptrs of random size... So we might be not as efficient as RDMA
> > case where I assume user register a few big memory regions.
>
> You are already doing this. dma_map_sg() does exactly the same IOVA
> allocation under the covers.

Sure. Then we can replace our _sg with your new DMA Api once it is merged. We will gain a benefit with a little more codes

>
> > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > program directly, without any other extra driver API call. We call
> > this use case system allocator.
>
> > For system allocator, driver have no knowledge of which virtual
> > address range is valid in advance. So when GPU access a
> > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > CPU vma which contains the fault address. I guess we can use the CPU
> > vma size to allocate the iova space of the same size?
>
> No. You'd follow what we discussed in the other thread.
>
> If you do a full SVA then you'd split your MM space into granuals and
> when a fault hits a granual you'd allocate the IOVA for the whole
> granual. RDMA ODP is using a 512M granual currently.

Per system allocator requirement, we have to do full SVA (which means ANY valid CPU virtual address is a valid GPU virtual address).

Per my above calculation, with 512M granual, we will end up a huge number of sub-regions....

>
> If you are doing sub ranges then you'd probably allocate the IOVA for
> the well defined sub range (assuming the typical use case isn't huge)

Can you explain what is sub ranges? Is that device only mirror partially of the CPU virtual address space?

How do we decide which part to mirror?


>
> > But there will be a true difficulty to apply your scheme to this use
> > case. It is related to the STICKY flag. As I understand it, the
> > sticky flag is designed for driver to mark "this page/pfn has been
> > populated, no need to re-populate again", roughly...Unlike userptr
> > and RDMA use cases where the backing store of a buffer is always in
> > system memory, in the system allocator use case, the backing store
> > can be changing b/t system memory and GPU's device private
> > memory. Even worse, we have to assume the data migration b/t system
> > and GPU is dynamic. When data is migrated to GPU, we don't need
> > dma-map. And when migration happens to a pfn with STICKY flag, we
> > still need to repopulate this pfn. So you can see, it is not easy to
> > apply this scheme to this use case. At least I can't see an obvious
> > way.
>
> You are already doing this today, you are keeping the sg list around
> until you unmap it.
>
> Instead of keeping the sg list you'd keep a much smaller datastructure
> per-granual. The sticky bit is simply a convient way for ODP to manage
> the smaller data structure, you don't have to use it.
>
> But you do need to keep track of what pages in the granual have been
> DMA mapped - sg list was doing this before. This could be a simple
> bitmap array matching the granual size.

Make sense. We can try once your API is ready.

I still don't figure out the granular scheme. Please help with above questions.

Thanks,
Oak


>
> Looking (far) forward we may be able to have a "replace" API that
> allows installing a new page unconditionally regardless of what is
> already there.
>
> Jason

2024-05-05 13:23:38

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL

On Fri, May 03, 2024 at 04:41:21PM +0200, Zhu Yanjun wrote:
> On 05.03.24 12:18, Leon Romanovsky wrote:
> > From: Chaitanya Kulkarni <[email protected]>

<...>

> > This is an RFC to demonstrate the newly added DMA APIs can be used to
> > map/unmap bvecs without the use of sg list, hence I've modified the pci
> > code to only handle SGLs for now. Once we have some agreement on the
> > structure of new DMA API I'll add support for PRPs along with all the
> > optimization that I've removed from the code for this RFC for NVMe SGLs
> > and PRPs.
> >

<...>

> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index e6267a6aa380..140939228409 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -236,7 +236,9 @@ struct nvme_iod {
> > unsigned int dma_len; /* length of single DMA segment mapping */
> > dma_addr_t first_dma;
> > dma_addr_t meta_dma;
> > - struct sg_table sgt;
> > + struct dma_iova_attrs iova;
> > + dma_addr_t dma_link_address[128];
>
> Why the length of this array is 128? Can we increase this length of the
> array?

It is combination of two things:
* Good enough value for this nvme RFC to pass simple test, which Chaitanya did.
* Output of various NVME_CTRL_* defines

Thanks

2024-05-06 07:25:25

by Zhu Yanjun

[permalink] [raw]
Subject: Re: [RFC RESEND 16/16] nvme-pci: use blk_rq_dma_map() for NVMe SGL


On 05.05.24 15:23, Leon Romanovsky wrote:
> On Fri, May 03, 2024 at 04:41:21PM +0200, Zhu Yanjun wrote:
>> On 05.03.24 12:18, Leon Romanovsky wrote:
>>> From: Chaitanya Kulkarni <[email protected]>
> <...>
>
>>> This is an RFC to demonstrate the newly added DMA APIs can be used to
>>> map/unmap bvecs without the use of sg list, hence I've modified the pci
>>> code to only handle SGLs for now. Once we have some agreement on the
>>> structure of new DMA API I'll add support for PRPs along with all the
>>> optimization that I've removed from the code for this RFC for NVMe SGLs
>>> and PRPs.
>>>
> <...>
>
>>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>>> index e6267a6aa380..140939228409 100644
>>> --- a/drivers/nvme/host/pci.c
>>> +++ b/drivers/nvme/host/pci.c
>>> @@ -236,7 +236,9 @@ struct nvme_iod {
>>> unsigned int dma_len; /* length of single DMA segment mapping */
>>> dma_addr_t first_dma;
>>> dma_addr_t meta_dma;
>>> - struct sg_table sgt;
>>> + struct dma_iova_attrs iova;
>>> + dma_addr_t dma_link_address[128];
>> Why the length of this array is 128? Can we increase this length of the
>> array?
> It is combination of two things:
> * Good enough value for this nvme RFC to pass simple test, which Chaitanya did.
> * Output of various NVME_CTRL_* defines

Thanks a lot. I enlarged this number to 512. It seems that it can work.
Hope this will increase the performance.

Best Regards,

Zhu Yanjun

>
> Thanks

--
Best Regards,
Yanjun.Zhu


2024-06-10 15:16:32

by Zeng, Oak

[permalink] [raw]
Subject: RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Hi Jason, Leon,

I come back to this thread to ask a question. Per the discussion in another thread, I have integrated the new dma-mapping API (the first 6 patches of this series) to DRM subsystem. The new API seems fit pretty good to our purpose, better than scatter-gather dma-mapping. So we want to continue work with you to adopt this new API.

Did you test the new API in RDMA subsystem? Or this RFC series was just some untested codes sending out to get people's design feedback? Do you have refined version for us to try? I ask because we are seeing some issues but not sure whether it is caused by the new API. We are debugging but it would be good to also ask at the same time.

Cc Himal/Krishna who are also working/testing the new API.

Thanks,
Oak

> -----Original Message-----
> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, May 3, 2024 12:43 PM
> To: Zeng, Oak <[email protected]>
> Cc: [email protected]; Christoph Hellwig <[email protected]>; Robin Murphy
> <[email protected]>; Marek Szyprowski
> <[email protected]>; Joerg Roedel <[email protected]>; Will
> Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> Brost, Matthew <[email protected]>; Hellstrom, Thomas
> <[email protected]>; Jonathan Corbet <[email protected]>; Jens
> Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> Shameer Kolothum <[email protected]>; Tian, Kevin
> <[email protected]>; Alex Williamson <[email protected]>;
> J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> foundation.org>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Bart Van Assche
> <[email protected]>; Damien Le Moal
> <[email protected]>; Amir Goldstein
> <[email protected]>; [email protected]; Martin K. Petersen
> <[email protected]>; [email protected]; Williams, Dan J
> <[email protected]>; [email protected]; Leon Romanovsky
> <[email protected]>; Zhu Yanjun <[email protected]>
> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> two steps
>
> On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
>
> > > Instead of teaching DMA to know these specific datatypes, let's separate
> > > existing DMA mapping routine to two steps and give an option to
> advanced
> > > callers (subsystems) perform all calculations internally in advance and
> > > map pages later when it is needed.
> >
> > I looked into how this scheme can be applied to DRM subsystem and GPU
> drivers.
> >
> > I figured RDMA can apply this scheme because RDMA can calculate the
> > iova size. Per my limited knowledge of rdma, user can register a
> > memory region (the reg_user_mr vfunc) and memory region's sized is
> > used to pre-allocate iova space. And in the RDMA use case, it seems
> > the user registered region can be very big, e.g., 512MiB or even GiB
>
> In RDMA the iova would be linked to the SVA granual we discussed
> previously.
>
> > In GPU driver, we have a few use cases where we need dma-mapping. Just
> name two:
> >
> > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > userptr can be of any random size, depending on user malloc
> > size. Today we use dma-map-sg for this use case. The down side of
> > our approach is, during userptr invalidation, even if user only
> > munmap partially of an userptr, we invalidate the whole userptr from
> > gpu page table, because there is no way for us to partially
> > dma-unmap the whole sg list. I think we can try your new API in this
> > case. The main benefit of the new approach is the partial munmap
> > case.
>
> Yes, this is one of the main things it will improve.
>
> > We will have to pre-allocate iova for each userptr, and we have many
> > userptrs of random size... So we might be not as efficient as RDMA
> > case where I assume user register a few big memory regions.
>
> You are already doing this. dma_map_sg() does exactly the same IOVA
> allocation under the covers.
>
> > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > program directly, without any other extra driver API call. We call
> > this use case system allocator.
>
> > For system allocator, driver have no knowledge of which virtual
> > address range is valid in advance. So when GPU access a
> > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > CPU vma which contains the fault address. I guess we can use the CPU
> > vma size to allocate the iova space of the same size?
>
> No. You'd follow what we discussed in the other thread.
>
> If you do a full SVA then you'd split your MM space into granuals and
> when a fault hits a granual you'd allocate the IOVA for the whole
> granual. RDMA ODP is using a 512M granual currently.
>
> If you are doing sub ranges then you'd probably allocate the IOVA for
> the well defined sub range (assuming the typical use case isn't huge)
>
> > But there will be a true difficulty to apply your scheme to this use
> > case. It is related to the STICKY flag. As I understand it, the
> > sticky flag is designed for driver to mark "this page/pfn has been
> > populated, no need to re-populate again", roughly...Unlike userptr
> > and RDMA use cases where the backing store of a buffer is always in
> > system memory, in the system allocator use case, the backing store
> > can be changing b/t system memory and GPU's device private
> > memory. Even worse, we have to assume the data migration b/t system
> > and GPU is dynamic. When data is migrated to GPU, we don't need
> > dma-map. And when migration happens to a pfn with STICKY flag, we
> > still need to repopulate this pfn. So you can see, it is not easy to
> > apply this scheme to this use case. At least I can't see an obvious
> > way.
>
> You are already doing this today, you are keeping the sg list around
> until you unmap it.
>
> Instead of keeping the sg list you'd keep a much smaller datastructure
> per-granual. The sticky bit is simply a convient way for ODP to manage
> the smaller data structure, you don't have to use it.
>
> But you do need to keep track of what pages in the granual have been
> DMA mapped - sg list was doing this before. This could be a simple
> bitmap array matching the granual size.
>
> Looking (far) forward we may be able to have a "replace" API that
> allows installing a new page unconditionally regardless of what is
> already there.
>
> Jason

2024-06-10 15:25:09

by Zhu Yanjun

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps


On 10.06.24 17:12, Zeng, Oak wrote:
> Hi Jason, Leon,
>
> I come back to this thread to ask a question. Per the discussion in another thread, I have integrated the new dma-mapping API (the first 6 patches of this series) to DRM subsystem. The new API seems fit pretty good to our purpose, better than scatter-gather dma-mapping. So we want to continue work with you to adopt this new API.
>
> Did you test the new API in RDMA subsystem? Or this RFC series was just some untested codes sending out to get people's design feedback? Do you have refined version for us to try? I ask because we are seeing some issues but not sure whether it is caused by the new API. We are debugging but it would be good to also ask at the same time.

Hi, Zeng

I have tested this patch series. And a patch about NVMe will cause some
call trace. But if you revert this patch about NVMe, the whole patches
can work well. You can develop your patches based on this patch series.

It seems that "some agreements can not be reached" about NVMe. So NVMe
patch can not work well. I do not delve into this NVMe patch.

Zhu Yanjun

>
> Cc Himal/Krishna who are also working/testing the new API.
>
> Thanks,
> Oak
>
>> -----Original Message-----
>> From: Jason Gunthorpe <[email protected]>
>> Sent: Friday, May 3, 2024 12:43 PM
>> To: Zeng, Oak <[email protected]>
>> Cc: [email protected]; Christoph Hellwig <[email protected]>; Robin Murphy
>> <[email protected]>; Marek Szyprowski
>> <[email protected]>; Joerg Roedel <[email protected]>; Will
>> Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
>> Brost, Matthew <[email protected]>; Hellstrom, Thomas
>> <[email protected]>; Jonathan Corbet <[email protected]>; Jens
>> Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
>> Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
>> Shameer Kolothum <[email protected]>; Tian, Kevin
>> <[email protected]>; Alex Williamson <[email protected]>;
>> Jérôme Glisse <[email protected]>; Andrew Morton <akpm@linux-
>> foundation.org>; [email protected]; [email protected];
>> [email protected]; [email protected];
>> [email protected]; [email protected];
>> [email protected]; [email protected]; Bart Van Assche
>> <[email protected]>; Damien Le Moal
>> <[email protected]>; Amir Goldstein
>> <[email protected]>; [email protected]; Martin K. Petersen
>> <[email protected]>; [email protected]; Williams, Dan J
>> <[email protected]>; [email protected]; Leon Romanovsky
>> <[email protected]>; Zhu Yanjun <[email protected]>
>> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
>> two steps
>>
>> On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
>>
>>>> Instead of teaching DMA to know these specific datatypes, let's separate
>>>> existing DMA mapping routine to two steps and give an option to
>> advanced
>>>> callers (subsystems) perform all calculations internally in advance and
>>>> map pages later when it is needed.
>>> I looked into how this scheme can be applied to DRM subsystem and GPU
>> drivers.
>>> I figured RDMA can apply this scheme because RDMA can calculate the
>>> iova size. Per my limited knowledge of rdma, user can register a
>>> memory region (the reg_user_mr vfunc) and memory region's sized is
>>> used to pre-allocate iova space. And in the RDMA use case, it seems
>>> the user registered region can be very big, e.g., 512MiB or even GiB
>> In RDMA the iova would be linked to the SVA granual we discussed
>> previously.
>>
>>> In GPU driver, we have a few use cases where we need dma-mapping. Just
>> name two:
>>> 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
>>> (in Intel's driver it is through a vm_bind api, similar to mmap). A
>>> userptr can be of any random size, depending on user malloc
>>> size. Today we use dma-map-sg for this use case. The down side of
>>> our approach is, during userptr invalidation, even if user only
>>> munmap partially of an userptr, we invalidate the whole userptr from
>>> gpu page table, because there is no way for us to partially
>>> dma-unmap the whole sg list. I think we can try your new API in this
>>> case. The main benefit of the new approach is the partial munmap
>>> case.
>> Yes, this is one of the main things it will improve.
>>
>>> We will have to pre-allocate iova for each userptr, and we have many
>>> userptrs of random size... So we might be not as efficient as RDMA
>>> case where I assume user register a few big memory regions.
>> You are already doing this. dma_map_sg() does exactly the same IOVA
>> allocation under the covers.
>>
>>> 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
>>> program directly, without any other extra driver API call. We call
>>> this use case system allocator.
>>> For system allocator, driver have no knowledge of which virtual
>>> address range is valid in advance. So when GPU access a
>>> malloc'ed/mmap'ed address, we have a page fault. We then look up a
>>> CPU vma which contains the fault address. I guess we can use the CPU
>>> vma size to allocate the iova space of the same size?
>> No. You'd follow what we discussed in the other thread.
>>
>> If you do a full SVA then you'd split your MM space into granuals and
>> when a fault hits a granual you'd allocate the IOVA for the whole
>> granual. RDMA ODP is using a 512M granual currently.
>>
>> If you are doing sub ranges then you'd probably allocate the IOVA for
>> the well defined sub range (assuming the typical use case isn't huge)
>>
>>> But there will be a true difficulty to apply your scheme to this use
>>> case. It is related to the STICKY flag. As I understand it, the
>>> sticky flag is designed for driver to mark "this page/pfn has been
>>> populated, no need to re-populate again", roughly...Unlike userptr
>>> and RDMA use cases where the backing store of a buffer is always in
>>> system memory, in the system allocator use case, the backing store
>>> can be changing b/t system memory and GPU's device private
>>> memory. Even worse, we have to assume the data migration b/t system
>>> and GPU is dynamic. When data is migrated to GPU, we don't need
>>> dma-map. And when migration happens to a pfn with STICKY flag, we
>>> still need to repopulate this pfn. So you can see, it is not easy to
>>> apply this scheme to this use case. At least I can't see an obvious
>>> way.
>> You are already doing this today, you are keeping the sg list around
>> until you unmap it.
>>
>> Instead of keeping the sg list you'd keep a much smaller datastructure
>> per-granual. The sticky bit is simply a convient way for ODP to manage
>> the smaller data structure, you don't have to use it.
>>
>> But you do need to keep track of what pages in the granual have been
>> DMA mapped - sg list was doing this before. This could be a simple
>> bitmap array matching the granual size.
>>
>> Looking (far) forward we may be able to have a "replace" API that
>> allows installing a new page unconditionally regardless of what is
>> already there.
>>
>> Jason

--
Best Regards,
Yanjun.Zhu


2024-06-10 16:18:49

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Mon, Jun 10, 2024 at 03:12:25PM +0000, Zeng, Oak wrote:
> Hi Jason, Leon,
>
> I come back to this thread to ask a question. Per the discussion in another thread, I have integrated the new dma-mapping API (the first 6 patches of this series) to DRM subsystem. The new API seems fit pretty good to our purpose, better than scatter-gather dma-mapping. So we want to continue work with you to adopt this new API.

Sounds great, thanks for the feedback.

>
> Did you test the new API in RDMA subsystem?

This version was tested in our regression tests, but there is a chance
that you are hitting flows that were not relevant for RDMA case.

> Or this RFC series was just some untested codes sending out to get people's design feedback?

RFC was fully tested in VFIO and RDMA paths, but not NVMe patch.

> Do you have refined version for us to try? I ask because we are seeing some issues but not sure whether it is caused by the new API. We are debugging but it would be good to also ask at the same time.

Yes, as an outcome of the feedback in this thread, I implemented a new
version. Unfortunately, there are some personal matters that are preventing
from me to send it right away.
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=dma-split-v1

There are some differences in the API, but the main idea is the same.
This version is not fully tested yet.

Thanks

>
> Cc Himal/Krishna who are also working/testing the new API.
>
> Thanks,
> Oak
>
> > -----Original Message-----
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Friday, May 3, 2024 12:43 PM
> > To: Zeng, Oak <[email protected]>
> > Cc: [email protected]; Christoph Hellwig <[email protected]>; Robin Murphy
> > <[email protected]>; Marek Szyprowski
> > <[email protected]>; Joerg Roedel <[email protected]>; Will
> > Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> > Brost, Matthew <[email protected]>; Hellstrom, Thomas
> > <[email protected]>; Jonathan Corbet <[email protected]>; Jens
> > Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> > Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> > Shameer Kolothum <[email protected]>; Tian, Kevin
> > <[email protected]>; Alex Williamson <[email protected]>;
> > J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> > foundation.org>; [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Bart Van Assche
> > <[email protected]>; Damien Le Moal
> > <[email protected]>; Amir Goldstein
> > <[email protected]>; [email protected]; Martin K. Petersen
> > <[email protected]>; [email protected]; Williams, Dan J
> > <[email protected]>; [email protected]; Leon Romanovsky
> > <[email protected]>; Zhu Yanjun <[email protected]>
> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > two steps
> >
> > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
> >
> > > > Instead of teaching DMA to know these specific datatypes, let's separate
> > > > existing DMA mapping routine to two steps and give an option to
> > advanced
> > > > callers (subsystems) perform all calculations internally in advance and
> > > > map pages later when it is needed.
> > >
> > > I looked into how this scheme can be applied to DRM subsystem and GPU
> > drivers.
> > >
> > > I figured RDMA can apply this scheme because RDMA can calculate the
> > > iova size. Per my limited knowledge of rdma, user can register a
> > > memory region (the reg_user_mr vfunc) and memory region's sized is
> > > used to pre-allocate iova space. And in the RDMA use case, it seems
> > > the user registered region can be very big, e.g., 512MiB or even GiB
> >
> > In RDMA the iova would be linked to the SVA granual we discussed
> > previously.
> >
> > > In GPU driver, we have a few use cases where we need dma-mapping. Just
> > name two:
> > >
> > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > > userptr can be of any random size, depending on user malloc
> > > size. Today we use dma-map-sg for this use case. The down side of
> > > our approach is, during userptr invalidation, even if user only
> > > munmap partially of an userptr, we invalidate the whole userptr from
> > > gpu page table, because there is no way for us to partially
> > > dma-unmap the whole sg list. I think we can try your new API in this
> > > case. The main benefit of the new approach is the partial munmap
> > > case.
> >
> > Yes, this is one of the main things it will improve.
> >
> > > We will have to pre-allocate iova for each userptr, and we have many
> > > userptrs of random size... So we might be not as efficient as RDMA
> > > case where I assume user register a few big memory regions.
> >
> > You are already doing this. dma_map_sg() does exactly the same IOVA
> > allocation under the covers.
> >
> > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > > program directly, without any other extra driver API call. We call
> > > this use case system allocator.
> >
> > > For system allocator, driver have no knowledge of which virtual
> > > address range is valid in advance. So when GPU access a
> > > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > > CPU vma which contains the fault address. I guess we can use the CPU
> > > vma size to allocate the iova space of the same size?
> >
> > No. You'd follow what we discussed in the other thread.
> >
> > If you do a full SVA then you'd split your MM space into granuals and
> > when a fault hits a granual you'd allocate the IOVA for the whole
> > granual. RDMA ODP is using a 512M granual currently.
> >
> > If you are doing sub ranges then you'd probably allocate the IOVA for
> > the well defined sub range (assuming the typical use case isn't huge)
> >
> > > But there will be a true difficulty to apply your scheme to this use
> > > case. It is related to the STICKY flag. As I understand it, the
> > > sticky flag is designed for driver to mark "this page/pfn has been
> > > populated, no need to re-populate again", roughly...Unlike userptr
> > > and RDMA use cases where the backing store of a buffer is always in
> > > system memory, in the system allocator use case, the backing store
> > > can be changing b/t system memory and GPU's device private
> > > memory. Even worse, we have to assume the data migration b/t system
> > > and GPU is dynamic. When data is migrated to GPU, we don't need
> > > dma-map. And when migration happens to a pfn with STICKY flag, we
> > > still need to repopulate this pfn. So you can see, it is not easy to
> > > apply this scheme to this use case. At least I can't see an obvious
> > > way.
> >
> > You are already doing this today, you are keeping the sg list around
> > until you unmap it.
> >
> > Instead of keeping the sg list you'd keep a much smaller datastructure
> > per-granual. The sticky bit is simply a convient way for ODP to manage
> > the smaller data structure, you don't have to use it.
> >
> > But you do need to keep track of what pages in the granual have been
> > DMA mapped - sg list was doing this before. This could be a simple
> > bitmap array matching the granual size.
> >
> > Looking (far) forward we may be able to have a "replace" API that
> > allows installing a new page unconditionally regardless of what is
> > already there.
> >
> > Jason

2024-06-10 16:40:47

by Zeng, Oak

[permalink] [raw]
Subject: RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Thanks Leon and Yanjun for the reply!

Based on the reply, we will continue use the current version for test (as it is tested for vfio and rdma). We will switch to v1 once it is fully tested/reviewed.

Thanks,
Oak

> -----Original Message-----
> From: Leon Romanovsky <[email protected]>
> Sent: Monday, June 10, 2024 12:18 PM
> To: Zeng, Oak <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>; Christoph Hellwig <[email protected]>; Robin
> Murphy <[email protected]>; Marek Szyprowski
> <[email protected]>; Joerg Roedel <[email protected]>; Will
> Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> Brost, Matthew <[email protected]>; Hellstrom, Thomas
> <[email protected]>; Jonathan Corbet <[email protected]>; Jens
> Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> Shameer Kolothum <[email protected]>; Tian, Kevin
> <[email protected]>; Alex Williamson <[email protected]>;
> J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> foundation.org>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Bart Van Assche
> <[email protected]>; Damien Le Moal
> <[email protected]>; Amir Goldstein
> <[email protected]>; [email protected]; Martin K. Petersen
> <[email protected]>; [email protected]; Williams, Dan J
> <[email protected]>; [email protected]; Zhu Yanjun
> <[email protected]>; Bommu, Krishnaiah
> <[email protected]>; Ghimiray, Himal Prasad
> <[email protected]>
> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> two steps
>
> On Mon, Jun 10, 2024 at 03:12:25PM +0000, Zeng, Oak wrote:
> > Hi Jason, Leon,
> >
> > I come back to this thread to ask a question. Per the discussion in another
> thread, I have integrated the new dma-mapping API (the first 6 patches of
> this series) to DRM subsystem. The new API seems fit pretty good to our
> purpose, better than scatter-gather dma-mapping. So we want to continue
> work with you to adopt this new API.
>
> Sounds great, thanks for the feedback.
>
> >
> > Did you test the new API in RDMA subsystem?
>
> This version was tested in our regression tests, but there is a chance
> that you are hitting flows that were not relevant for RDMA case.
>
> > Or this RFC series was just some untested codes sending out to get
> people's design feedback?
>
> RFC was fully tested in VFIO and RDMA paths, but not NVMe patch.
>
> > Do you have refined version for us to try? I ask because we are seeing
> some issues but not sure whether it is caused by the new API. We are
> debugging but it would be good to also ask at the same time.
>
> Yes, as an outcome of the feedback in this thread, I implemented a new
> version. Unfortunately, there are some personal matters that are preventing
> from me to send it right away.
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-
> rdma.git/log/?h=dma-split-v1
>
> There are some differences in the API, but the main idea is the same.
> This version is not fully tested yet.
>
> Thanks
>
> >
> > Cc Himal/Krishna who are also working/testing the new API.
> >
> > Thanks,
> > Oak
> >
> > > -----Original Message-----
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Friday, May 3, 2024 12:43 PM
> > > To: Zeng, Oak <[email protected]>
> > > Cc: [email protected]; Christoph Hellwig <[email protected]>; Robin Murphy
> > > <[email protected]>; Marek Szyprowski
> > > <[email protected]>; Joerg Roedel <[email protected]>; Will
> > > Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> > > Brost, Matthew <[email protected]>; Hellstrom, Thomas
> > > <[email protected]>; Jonathan Corbet <[email protected]>;
> Jens
> > > Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> > > Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> > > Shameer Kolothum <[email protected]>; Tian,
> Kevin
> > > <[email protected]>; Alex Williamson <[email protected]>;
> > > J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> > > foundation.org>; [email protected]; linux-
> [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Bart Van Assche
> > > <[email protected]>; Damien Le Moal
> > > <[email protected]>; Amir Goldstein
> > > <[email protected]>; [email protected]; Martin K. Petersen
> > > <[email protected]>; [email protected]; Williams, Dan J
> > > <[email protected]>; [email protected]; Leon Romanovsky
> > > <[email protected]>; Zhu Yanjun <[email protected]>
> > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > > two steps
> > >
> > > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
> > >
> > > > > Instead of teaching DMA to know these specific datatypes, let's
> separate
> > > > > existing DMA mapping routine to two steps and give an option to
> > > advanced
> > > > > callers (subsystems) perform all calculations internally in advance and
> > > > > map pages later when it is needed.
> > > >
> > > > I looked into how this scheme can be applied to DRM subsystem and
> GPU
> > > drivers.
> > > >
> > > > I figured RDMA can apply this scheme because RDMA can calculate the
> > > > iova size. Per my limited knowledge of rdma, user can register a
> > > > memory region (the reg_user_mr vfunc) and memory region's sized is
> > > > used to pre-allocate iova space. And in the RDMA use case, it seems
> > > > the user registered region can be very big, e.g., 512MiB or even GiB
> > >
> > > In RDMA the iova would be linked to the SVA granual we discussed
> > > previously.
> > >
> > > > In GPU driver, we have a few use cases where we need dma-mapping.
> Just
> > > name two:
> > > >
> > > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > > > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > > > userptr can be of any random size, depending on user malloc
> > > > size. Today we use dma-map-sg for this use case. The down side of
> > > > our approach is, during userptr invalidation, even if user only
> > > > munmap partially of an userptr, we invalidate the whole userptr from
> > > > gpu page table, because there is no way for us to partially
> > > > dma-unmap the whole sg list. I think we can try your new API in this
> > > > case. The main benefit of the new approach is the partial munmap
> > > > case.
> > >
> > > Yes, this is one of the main things it will improve.
> > >
> > > > We will have to pre-allocate iova for each userptr, and we have many
> > > > userptrs of random size... So we might be not as efficient as RDMA
> > > > case where I assume user register a few big memory regions.
> > >
> > > You are already doing this. dma_map_sg() does exactly the same IOVA
> > > allocation under the covers.
> > >
> > > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > > > program directly, without any other extra driver API call. We call
> > > > this use case system allocator.
> > >
> > > > For system allocator, driver have no knowledge of which virtual
> > > > address range is valid in advance. So when GPU access a
> > > > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > > > CPU vma which contains the fault address. I guess we can use the CPU
> > > > vma size to allocate the iova space of the same size?
> > >
> > > No. You'd follow what we discussed in the other thread.
> > >
> > > If you do a full SVA then you'd split your MM space into granuals and
> > > when a fault hits a granual you'd allocate the IOVA for the whole
> > > granual. RDMA ODP is using a 512M granual currently.
> > >
> > > If you are doing sub ranges then you'd probably allocate the IOVA for
> > > the well defined sub range (assuming the typical use case isn't huge)
> > >
> > > > But there will be a true difficulty to apply your scheme to this use
> > > > case. It is related to the STICKY flag. As I understand it, the
> > > > sticky flag is designed for driver to mark "this page/pfn has been
> > > > populated, no need to re-populate again", roughly...Unlike userptr
> > > > and RDMA use cases where the backing store of a buffer is always in
> > > > system memory, in the system allocator use case, the backing store
> > > > can be changing b/t system memory and GPU's device private
> > > > memory. Even worse, we have to assume the data migration b/t
> system
> > > > and GPU is dynamic. When data is migrated to GPU, we don't need
> > > > dma-map. And when migration happens to a pfn with STICKY flag, we
> > > > still need to repopulate this pfn. So you can see, it is not easy to
> > > > apply this scheme to this use case. At least I can't see an obvious
> > > > way.
> > >
> > > You are already doing this today, you are keeping the sg list around
> > > until you unmap it.
> > >
> > > Instead of keeping the sg list you'd keep a much smaller datastructure
> > > per-granual. The sticky bit is simply a convient way for ODP to manage
> > > the smaller data structure, you don't have to use it.
> > >
> > > But you do need to keep track of what pages in the granual have been
> > > DMA mapped - sg list was doing this before. This could be a simple
> > > bitmap array matching the granual size.
> > >
> > > Looking (far) forward we may be able to have a "replace" API that
> > > allows installing a new page unconditionally regardless of what is
> > > already there.
> > >
> > > Jason

2024-06-10 17:25:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> Thanks Leon and Yanjun for the reply!
>
> Based on the reply, we will continue use the current version for
> test (as it is tested for vfio and rdma). We will switch to v1 once
> it is fully tested/reviewed.

I'm glad you are finding it useful, one of my interests with this work
is to improve all the HMM users.

Jason

2024-06-10 21:28:52

by Zeng, Oak

[permalink] [raw]
Subject: RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Hi Jason, Leon,

I was able to fix the issue from my side. Things work fine now. I got two questions though:

1) The value returned from dma_link_range function is not contiguous, see below print. The "linked pa" is the function return.
I think dma_map_sgtable API would return some contiguous dma address. Is the dma-map_sgtable api is more efficient regarding the iommu page table? i.e., try to use bigger page size, such as use 2M page size when it is possible. With your new API, does it also have such consideration? I vaguely remembered Jason mentioned such thing, but my print below doesn't look like so. Maybe I need to test bigger range (only 16 pages range in the test of below printing). Comment?

[17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 18ef3f000
[17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190d00000
[17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190024000
[17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 178e89000

2) in the comment of dma_link_range function, it is said: " @dma_offset needs to be advanced by the caller with the size of previous page that was linked + DMA address returned for the previous page".
Is this description correct? I don't understand the part "+ DMA address returned for the previous page ".
In my codes, let's say I call this function to link 10 pages, the first dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the previously returned dma address.
Maybe I need more test. But any comment?

Thanks,
Oak

> -----Original Message-----
> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, June 10, 2024 1:25 PM
> To: Zeng, Oak <[email protected]>
> Cc: Leon Romanovsky <[email protected]>; Christoph Hellwig <[email protected]>;
> Robin Murphy <[email protected]>; Marek Szyprowski
> <[email protected]>; Joerg Roedel <[email protected]>; Will
> Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> Brost, Matthew <[email protected]>; Hellstrom, Thomas
> <[email protected]>; Jonathan Corbet <[email protected]>; Jens
> Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> Shameer Kolothum <[email protected]>; Tian, Kevin
> <[email protected]>; Alex Williamson <[email protected]>;
> J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> foundation.org>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Bart Van Assche
> <[email protected]>; Damien Le Moal
> <[email protected]>; Amir Goldstein
> <[email protected]>; [email protected]; Martin K. Petersen
> <[email protected]>; [email protected]; Williams, Dan J
> <[email protected]>; [email protected]; Zhu Yanjun
> <[email protected]>; Bommu, Krishnaiah
> <[email protected]>; Ghimiray, Himal Prasad
> <[email protected]>
> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> two steps
>
> On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> > Thanks Leon and Yanjun for the reply!
> >
> > Based on the reply, we will continue use the current version for
> > test (as it is tested for vfio and rdma). We will switch to v1 once
> > it is fully tested/reviewed.
>
> I'm glad you are finding it useful, one of my interests with this work
> is to improve all the HMM users.
>
> Jason

2024-06-11 08:13:34

by Zhu Yanjun

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps


On 10.06.24 23:28, Zeng, Oak wrote:
> Hi Jason, Leon,
>
> I was able to fix the issue from my side. Things work fine now.

Can you enlarge the dma list, then make tests with fio? Not sure if the
performance is better or not.

Thanks,

Zhu Yanjun

> I got two questions though:
>
> 1) The value returned from dma_link_range function is not contiguous, see below print. The "linked pa" is the function return.
> I think dma_map_sgtable API would return some contiguous dma address. Is the dma-map_sgtable api is more efficient regarding the iommu page table? i.e., try to use bigger page size, such as use 2M page size when it is possible. With your new API, does it also have such consideration? I vaguely remembered Jason mentioned such thing, but my print below doesn't look like so. Maybe I need to test bigger range (only 16 pages range in the test of below printing). Comment?
>
> [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 18ef3f000
> [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190d00000
> [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190024000
> [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 178e89000
>
> 2) in the comment of dma_link_range function, it is said: " @dma_offset needs to be advanced by the caller with the size of previous page that was linked + DMA address returned for the previous page".
> Is this description correct? I don't understand the part "+ DMA address returned for the previous page ".
> In my codes, let's say I call this function to link 10 pages, the first dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the previously returned dma address.
> Maybe I need more test. But any comment?
>
> Thanks,
> Oak
>
>> -----Original Message-----
>> From: Jason Gunthorpe <[email protected]>
>> Sent: Monday, June 10, 2024 1:25 PM
>> To: Zeng, Oak <[email protected]>
>> Cc: Leon Romanovsky <[email protected]>; Christoph Hellwig <[email protected]>;
>> Robin Murphy <[email protected]>; Marek Szyprowski
>> <[email protected]>; Joerg Roedel <[email protected]>; Will
>> Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
>> Brost, Matthew <[email protected]>; Hellstrom, Thomas
>> <[email protected]>; Jonathan Corbet <[email protected]>; Jens
>> Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
>> Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
>> Shameer Kolothum <[email protected]>; Tian, Kevin
>> <[email protected]>; Alex Williamson <[email protected]>;
>> Jérôme Glisse <[email protected]>; Andrew Morton <akpm@linux-
>> foundation.org>; [email protected]; [email protected];
>> [email protected]; [email protected];
>> [email protected]; [email protected];
>> [email protected]; [email protected]; Bart Van Assche
>> <[email protected]>; Damien Le Moal
>> <[email protected]>; Amir Goldstein
>> <[email protected]>; [email protected]; Martin K. Petersen
>> <[email protected]>; [email protected]; Williams, Dan J
>> <[email protected]>; [email protected]; Zhu Yanjun
>> <[email protected]>; Bommu, Krishnaiah
>> <[email protected]>; Ghimiray, Himal Prasad
>> <[email protected]>
>> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
>> two steps
>>
>> On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
>>> Thanks Leon and Yanjun for the reply!
>>>
>>> Based on the reply, we will continue use the current version for
>>> test (as it is tested for vfio and rdma). We will switch to v1 once
>>> it is fully tested/reviewed.
>> I'm glad you are finding it useful, one of my interests with this work
>> is to improve all the HMM users.
>>
>> Jason

--
Best


2024-06-11 15:39:37

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> Thanks Leon and Yanjun for the reply!
>
> Based on the reply, we will continue use the current version for test (as it is tested for vfio and rdma). We will switch to v1 once it is fully tested/reviewed.

Sounds good, if v0 fits your need, the v1 will fit it too.

From HMM perspective, the change is minimal between them.
In v0, I called to dma_link_page() here and now it is called
dma_hmm_link_page().

https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/diff/drivers/infiniband/hw/mlx5/odp.c?h=dma-split-v1&id=a0d719a406133cdc3ef2328dda3ef082a034c45e


>
> Thanks,
> Oak
>
> > -----Original Message-----
> > From: Leon Romanovsky <[email protected]>
> > Sent: Monday, June 10, 2024 12:18 PM
> > To: Zeng, Oak <[email protected]>
> > Cc: Jason Gunthorpe <[email protected]>; Christoph Hellwig <[email protected]>; Robin
> > Murphy <[email protected]>; Marek Szyprowski
> > <[email protected]>; Joerg Roedel <[email protected]>; Will
> > Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> > Brost, Matthew <[email protected]>; Hellstrom, Thomas
> > <[email protected]>; Jonathan Corbet <[email protected]>; Jens
> > Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> > Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> > Shameer Kolothum <[email protected]>; Tian, Kevin
> > <[email protected]>; Alex Williamson <[email protected]>;
> > J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> > foundation.org>; [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Bart Van Assche
> > <[email protected]>; Damien Le Moal
> > <[email protected]>; Amir Goldstein
> > <[email protected]>; [email protected]; Martin K. Petersen
> > <[email protected]>; [email protected]; Williams, Dan J
> > <[email protected]>; [email protected]; Zhu Yanjun
> > <[email protected]>; Bommu, Krishnaiah
> > <[email protected]>; Ghimiray, Himal Prasad
> > <[email protected]>
> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > two steps
> >
> > On Mon, Jun 10, 2024 at 03:12:25PM +0000, Zeng, Oak wrote:
> > > Hi Jason, Leon,
> > >
> > > I come back to this thread to ask a question. Per the discussion in another
> > thread, I have integrated the new dma-mapping API (the first 6 patches of
> > this series) to DRM subsystem. The new API seems fit pretty good to our
> > purpose, better than scatter-gather dma-mapping. So we want to continue
> > work with you to adopt this new API.
> >
> > Sounds great, thanks for the feedback.
> >
> > >
> > > Did you test the new API in RDMA subsystem?
> >
> > This version was tested in our regression tests, but there is a chance
> > that you are hitting flows that were not relevant for RDMA case.
> >
> > > Or this RFC series was just some untested codes sending out to get
> > people's design feedback?
> >
> > RFC was fully tested in VFIO and RDMA paths, but not NVMe patch.
> >
> > > Do you have refined version for us to try? I ask because we are seeing
> > some issues but not sure whether it is caused by the new API. We are
> > debugging but it would be good to also ask at the same time.
> >
> > Yes, as an outcome of the feedback in this thread, I implemented a new
> > version. Unfortunately, there are some personal matters that are preventing
> > from me to send it right away.
> > https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-
> > rdma.git/log/?h=dma-split-v1
> >
> > There are some differences in the API, but the main idea is the same.
> > This version is not fully tested yet.
> >
> > Thanks
> >
> > >
> > > Cc Himal/Krishna who are also working/testing the new API.
> > >
> > > Thanks,
> > > Oak
> > >
> > > > -----Original Message-----
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Friday, May 3, 2024 12:43 PM
> > > > To: Zeng, Oak <[email protected]>
> > > > Cc: [email protected]; Christoph Hellwig <[email protected]>; Robin Murphy
> > > > <[email protected]>; Marek Szyprowski
> > > > <[email protected]>; Joerg Roedel <[email protected]>; Will
> > > > Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> > > > Brost, Matthew <[email protected]>; Hellstrom, Thomas
> > > > <[email protected]>; Jonathan Corbet <[email protected]>;
> > Jens
> > > > Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> > > > Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> > > > Shameer Kolothum <[email protected]>; Tian,
> > Kevin
> > > > <[email protected]>; Alex Williamson <[email protected]>;
> > > > J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> > > > foundation.org>; [email protected]; linux-
> > [email protected];
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Bart Van Assche
> > > > <[email protected]>; Damien Le Moal
> > > > <[email protected]>; Amir Goldstein
> > > > <[email protected]>; [email protected]; Martin K. Petersen
> > > > <[email protected]>; [email protected]; Williams, Dan J
> > > > <[email protected]>; [email protected]; Leon Romanovsky
> > > > <[email protected]>; Zhu Yanjun <[email protected]>
> > > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > > > two steps
> > > >
> > > > On Thu, May 02, 2024 at 11:32:55PM +0000, Zeng, Oak wrote:
> > > >
> > > > > > Instead of teaching DMA to know these specific datatypes, let's
> > separate
> > > > > > existing DMA mapping routine to two steps and give an option to
> > > > advanced
> > > > > > callers (subsystems) perform all calculations internally in advance and
> > > > > > map pages later when it is needed.
> > > > >
> > > > > I looked into how this scheme can be applied to DRM subsystem and
> > GPU
> > > > drivers.
> > > > >
> > > > > I figured RDMA can apply this scheme because RDMA can calculate the
> > > > > iova size. Per my limited knowledge of rdma, user can register a
> > > > > memory region (the reg_user_mr vfunc) and memory region's sized is
> > > > > used to pre-allocate iova space. And in the RDMA use case, it seems
> > > > > the user registered region can be very big, e.g., 512MiB or even GiB
> > > >
> > > > In RDMA the iova would be linked to the SVA granual we discussed
> > > > previously.
> > > >
> > > > > In GPU driver, we have a few use cases where we need dma-mapping.
> > Just
> > > > name two:
> > > > >
> > > > > 1) userptr: it is user malloc'ed/mmap'ed memory and registers to gpu
> > > > > (in Intel's driver it is through a vm_bind api, similar to mmap). A
> > > > > userptr can be of any random size, depending on user malloc
> > > > > size. Today we use dma-map-sg for this use case. The down side of
> > > > > our approach is, during userptr invalidation, even if user only
> > > > > munmap partially of an userptr, we invalidate the whole userptr from
> > > > > gpu page table, because there is no way for us to partially
> > > > > dma-unmap the whole sg list. I think we can try your new API in this
> > > > > case. The main benefit of the new approach is the partial munmap
> > > > > case.
> > > >
> > > > Yes, this is one of the main things it will improve.
> > > >
> > > > > We will have to pre-allocate iova for each userptr, and we have many
> > > > > userptrs of random size... So we might be not as efficient as RDMA
> > > > > case where I assume user register a few big memory regions.
> > > >
> > > > You are already doing this. dma_map_sg() does exactly the same IOVA
> > > > allocation under the covers.
> > > >
> > > > > 2) system allocator: it is malloc'ed/mmap'ed memory be used for GPU
> > > > > program directly, without any other extra driver API call. We call
> > > > > this use case system allocator.
> > > >
> > > > > For system allocator, driver have no knowledge of which virtual
> > > > > address range is valid in advance. So when GPU access a
> > > > > malloc'ed/mmap'ed address, we have a page fault. We then look up a
> > > > > CPU vma which contains the fault address. I guess we can use the CPU
> > > > > vma size to allocate the iova space of the same size?
> > > >
> > > > No. You'd follow what we discussed in the other thread.
> > > >
> > > > If you do a full SVA then you'd split your MM space into granuals and
> > > > when a fault hits a granual you'd allocate the IOVA for the whole
> > > > granual. RDMA ODP is using a 512M granual currently.
> > > >
> > > > If you are doing sub ranges then you'd probably allocate the IOVA for
> > > > the well defined sub range (assuming the typical use case isn't huge)
> > > >
> > > > > But there will be a true difficulty to apply your scheme to this use
> > > > > case. It is related to the STICKY flag. As I understand it, the
> > > > > sticky flag is designed for driver to mark "this page/pfn has been
> > > > > populated, no need to re-populate again", roughly...Unlike userptr
> > > > > and RDMA use cases where the backing store of a buffer is always in
> > > > > system memory, in the system allocator use case, the backing store
> > > > > can be changing b/t system memory and GPU's device private
> > > > > memory. Even worse, we have to assume the data migration b/t
> > system
> > > > > and GPU is dynamic. When data is migrated to GPU, we don't need
> > > > > dma-map. And when migration happens to a pfn with STICKY flag, we
> > > > > still need to repopulate this pfn. So you can see, it is not easy to
> > > > > apply this scheme to this use case. At least I can't see an obvious
> > > > > way.
> > > >
> > > > You are already doing this today, you are keeping the sg list around
> > > > until you unmap it.
> > > >
> > > > Instead of keeping the sg list you'd keep a much smaller datastructure
> > > > per-granual. The sticky bit is simply a convient way for ODP to manage
> > > > the smaller data structure, you don't have to use it.
> > > >
> > > > But you do need to keep track of what pages in the granual have been
> > > > DMA mapped - sg list was doing this before. This could be a simple
> > > > bitmap array matching the granual size.
> > > >
> > > > Looking (far) forward we may be able to have a "replace" API that
> > > > allows installing a new page unconditionally regardless of what is
> > > > already there.
> > > >
> > > > Jason

2024-06-11 15:45:35

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Mon, Jun 10, 2024 at 09:28:04PM +0000, Zeng, Oak wrote:
> Hi Jason, Leon,
>
> I was able to fix the issue from my side. Things work fine now. I got two questions though:
>
> 1) The value returned from dma_link_range function is not contiguous, see below print. The "linked pa" is the function return.
> I think dma_map_sgtable API would return some contiguous dma address. Is the dma-map_sgtable api is more efficient regarding the iommu page table? i.e., try to use bigger page size, such as use 2M page size when it is possible. With your new API, does it also have such consideration? I vaguely remembered Jason mentioned such thing, but my print below doesn't look like so. Maybe I need to test bigger range (only 16 pages range in the test of below printing). Comment?

My API gives you the flexibility to use any page size you want. You can
use 2M pages instead of 4K pages. The API doesn't enforce any page size.

>
> [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 18ef3f000
> [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190d00000
> [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 190024000
> [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0, linked pa = 178e89000
>
> 2) in the comment of dma_link_range function, it is said: " @dma_offset needs to be advanced by the caller with the size of previous page that was linked + DMA address returned for the previous page".
> Is this description correct? I don't understand the part "+ DMA address returned for the previous page ".
> In my codes, let's say I call this function to link 10 pages, the first dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the previously returned dma address.
> Maybe I need more test. But any comment?

You did it perfectly right. This is the correct way to advance dma_offset.

Thanks

>
> Thanks,
> Oak
>
> > -----Original Message-----
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Monday, June 10, 2024 1:25 PM
> > To: Zeng, Oak <[email protected]>
> > Cc: Leon Romanovsky <[email protected]>; Christoph Hellwig <[email protected]>;
> > Robin Murphy <[email protected]>; Marek Szyprowski
> > <[email protected]>; Joerg Roedel <[email protected]>; Will
> > Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> > Brost, Matthew <[email protected]>; Hellstrom, Thomas
> > <[email protected]>; Jonathan Corbet <[email protected]>; Jens
> > Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> > Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> > Shameer Kolothum <[email protected]>; Tian, Kevin
> > <[email protected]>; Alex Williamson <[email protected]>;
> > J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> > foundation.org>; [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Bart Van Assche
> > <[email protected]>; Damien Le Moal
> > <[email protected]>; Amir Goldstein
> > <[email protected]>; [email protected]; Martin K. Petersen
> > <[email protected]>; [email protected]; Williams, Dan J
> > <[email protected]>; [email protected]; Zhu Yanjun
> > <[email protected]>; Bommu, Krishnaiah
> > <[email protected]>; Ghimiray, Himal Prasad
> > <[email protected]>
> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > two steps
> >
> > On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> > > Thanks Leon and Yanjun for the reply!
> > >
> > > Based on the reply, we will continue use the current version for
> > > test (as it is tested for vfio and rdma). We will switch to v1 once
> > > it is fully tested/reviewed.
> >
> > I'm glad you are finding it useful, one of my interests with this work
> > is to improve all the HMM users.
> >
> > Jason

2024-06-11 18:27:39

by Zeng, Oak

[permalink] [raw]
Subject: RE: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

Thank you Leon. That is helpful.

I also have another very na?ve question. I don't understand what is the iova address. I previously thought the iova address space is the same as the dma_address space when iommu is involved. I thought the dma_alloc_iova would allocate some contiguous iova address range and later dma_link_range function would link a physical page to the iova address and return the iova address. In other words, I thought the dma_address is iova address, and the iommu page table translate a dma_address or iova address to the physical address.

But from my print below, my above understanding is obviously wrong: the iova.dma_addr is 0 and the dma_address returned from dma_link_range is none zero... Can you help me what is iova address? Is iova address iommu related? Since dma_link_range returns a non-iova address, does this function allocate the dma-address itself? Is dma-address correlated with iova address?

Oak

> -----Original Message-----
> From: Leon Romanovsky <[email protected]>
> Sent: Tuesday, June 11, 2024 11:45 AM
> To: Zeng, Oak <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>; Christoph Hellwig <[email protected]>; Robin
> Murphy <[email protected]>; Marek Szyprowski
> <[email protected]>; Joerg Roedel <[email protected]>; Will
> Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> Brost, Matthew <[email protected]>; Hellstrom, Thomas
> <[email protected]>; Jonathan Corbet <[email protected]>; Jens
> Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> Shameer Kolothum <[email protected]>; Tian, Kevin
> <[email protected]>; Alex Williamson <[email protected]>;
> J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> foundation.org>; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Bart Van Assche
> <[email protected]>; Damien Le Moal
> <[email protected]>; Amir Goldstein
> <[email protected]>; [email protected]; Martin K. Petersen
> <[email protected]>; [email protected]; Williams, Dan J
> <[email protected]>; [email protected]; Zhu Yanjun
> <[email protected]>; Bommu, Krishnaiah
> <[email protected]>; Ghimiray, Himal Prasad
> <[email protected]>
> Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> two steps
>
> On Mon, Jun 10, 2024 at 09:28:04PM +0000, Zeng, Oak wrote:
> > Hi Jason, Leon,
> >
> > I was able to fix the issue from my side. Things work fine now. I got two
> questions though:
> >
> > 1) The value returned from dma_link_range function is not contiguous, see
> below print. The "linked pa" is the function return.
> > I think dma_map_sgtable API would return some contiguous dma address.
> Is the dma-map_sgtable api is more efficient regarding the iommu page table?
> i.e., try to use bigger page size, such as use 2M page size when it is possible.
> With your new API, does it also have such consideration? I vaguely
> remembered Jason mentioned such thing, but my print below doesn't look
> like so. Maybe I need to test bigger range (only 16 pages range in the test of
> below printing). Comment?
>
> My API gives you the flexibility to use any page size you want. You can
> use 2M pages instead of 4K pages. The API doesn't enforce any page size.
>
> >
> > [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> linked pa = 18ef3f000
> > [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> linked pa = 190d00000
> > [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> linked pa = 190024000
> > [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> linked pa = 178e89000
> >
> > 2) in the comment of dma_link_range function, it is said: " @dma_offset
> needs to be advanced by the caller with the size of previous page that was
> linked + DMA address returned for the previous page".
> > Is this description correct? I don't understand the part "+ DMA address
> returned for the previous page ".
> > In my codes, let's say I call this function to link 10 pages, the first
> dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the
> previously returned dma address.
> > Maybe I need more test. But any comment?
>
> You did it perfectly right. This is the correct way to advance dma_offset.
>
> Thanks
>
> >
> > Thanks,
> > Oak
> >
> > > -----Original Message-----
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Monday, June 10, 2024 1:25 PM
> > > To: Zeng, Oak <[email protected]>
> > > Cc: Leon Romanovsky <[email protected]>; Christoph Hellwig
> <[email protected]>;
> > > Robin Murphy <[email protected]>; Marek Szyprowski
> > > <[email protected]>; Joerg Roedel <[email protected]>; Will
> > > Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> > > Brost, Matthew <[email protected]>; Hellstrom, Thomas
> > > <[email protected]>; Jonathan Corbet <[email protected]>;
> Jens
> > > Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> > > Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> > > Shameer Kolothum <[email protected]>; Tian,
> Kevin
> > > <[email protected]>; Alex Williamson <[email protected]>;
> > > J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> > > foundation.org>; [email protected]; linux-
> [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; [email protected]; Bart Van Assche
> > > <[email protected]>; Damien Le Moal
> > > <[email protected]>; Amir Goldstein
> > > <[email protected]>; [email protected]; Martin K. Petersen
> > > <[email protected]>; [email protected]; Williams, Dan J
> > > <[email protected]>; [email protected]; Zhu Yanjun
> > > <[email protected]>; Bommu, Krishnaiah
> > > <[email protected]>; Ghimiray, Himal Prasad
> > > <[email protected]>
> > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > > two steps
> > >
> > > On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> > > > Thanks Leon and Yanjun for the reply!
> > > >
> > > > Based on the reply, we will continue use the current version for
> > > > test (as it is tested for vfio and rdma). We will switch to v1 once
> > > > it is fully tested/reviewed.
> > >
> > > I'm glad you are finding it useful, one of my interests with this work
> > > is to improve all the HMM users.
> > >
> > > Jason

2024-06-11 19:21:49

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps

On Tue, Jun 11, 2024 at 06:26:23PM +0000, Zeng, Oak wrote:
> Thank you Leon. That is helpful.
>
> I also have another very na?ve question. I don't understand what is the iova address. I previously thought the iova address space is the same as the dma_address space when iommu is involved. I thought the dma_alloc_iova would allocate some contiguous iova address range and later dma_link_range function would link a physical page to the iova address and return the iova address. In other words, I thought the dma_address is iova address, and the iommu page table translate a dma_address or iova address to the physical address.

This is right understanding.

>
> But from my print below, my above understanding is obviously wrong: the iova.dma_addr is 0 and the dma_address returned from dma_link_range is none zero... Can you help me what is iova address? Is iova address iommu related? Since dma_link_range returns a non-iova address, does this function allocate the dma-address itself? Is dma-address correlated with iova address?

This is a combination of two things:
1. Need to support HMM specific logic
2. Outcome of v0 version where I implemented dma_link_range() to perform fallback to DMA direct mode. See patch 2 and 3.
https://lore.kernel.org/all/54a3554639bfb963c9919c5d7c1f449021bebdb3.1709635535.git.leon@kernel.org/
https://lore.kernel.org/all/f1049f0fc280288ae2f0c1e02388cde91b0f7876.1709635535.git.leon@kernel.org/

So dma-iova == 0 means that you are working in direct mode and not with IOMMU, e.g. you can translate from physical address
to DMA address by simple call to phys_to_dma().

One of the comments was that it is not desired behaviour and I need to
create separate functions that will be in use only when IOMMU is used.

See the difference between v0 and v1 for dma_link_range() function.
v0: https://lore.kernel.org/all/f1049f0fc280288ae2f0c1e02388cde91b0f7876.1709635535.git.leon@kernel.org/
v1: https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=dma-split-v1&id=5aa29f2620ef86ac58c17a0297929a0b9e8d7790

And HMM variant of dma_link_range() function, which saves from you the
need to copy/paste same HMM logic from RDMA to DRM.
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=dma-split-v1&id=4d8d8d4fbe7891b1412f03ecaff88bc492e2e4eb

Thanks

>
> Oak
>
> > -----Original Message-----
> > From: Leon Romanovsky <[email protected]>
> > Sent: Tuesday, June 11, 2024 11:45 AM
> > To: Zeng, Oak <[email protected]>
> > Cc: Jason Gunthorpe <[email protected]>; Christoph Hellwig <[email protected]>; Robin
> > Murphy <[email protected]>; Marek Szyprowski
> > <[email protected]>; Joerg Roedel <[email protected]>; Will
> > Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> > Brost, Matthew <[email protected]>; Hellstrom, Thomas
> > <[email protected]>; Jonathan Corbet <[email protected]>; Jens
> > Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> > Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> > Shameer Kolothum <[email protected]>; Tian, Kevin
> > <[email protected]>; Alex Williamson <[email protected]>;
> > J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> > foundation.org>; [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]; Bart Van Assche
> > <[email protected]>; Damien Le Moal
> > <[email protected]>; Amir Goldstein
> > <[email protected]>; [email protected]; Martin K. Petersen
> > <[email protected]>; [email protected]; Williams, Dan J
> > <[email protected]>; [email protected]; Zhu Yanjun
> > <[email protected]>; Bommu, Krishnaiah
> > <[email protected]>; Ghimiray, Himal Prasad
> > <[email protected]>
> > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > two steps
> >
> > On Mon, Jun 10, 2024 at 09:28:04PM +0000, Zeng, Oak wrote:
> > > Hi Jason, Leon,
> > >
> > > I was able to fix the issue from my side. Things work fine now. I got two
> > questions though:
> > >
> > > 1) The value returned from dma_link_range function is not contiguous, see
> > below print. The "linked pa" is the function return.
> > > I think dma_map_sgtable API would return some contiguous dma address.
> > Is the dma-map_sgtable api is more efficient regarding the iommu page table?
> > i.e., try to use bigger page size, such as use 2M page size when it is possible.
> > With your new API, does it also have such consideration? I vaguely
> > remembered Jason mentioned such thing, but my print below doesn't look
> > like so. Maybe I need to test bigger range (only 16 pages range in the test of
> > below printing). Comment?
> >
> > My API gives you the flexibility to use any page size you want. You can
> > use 2M pages instead of 4K pages. The API doesn't enforce any page size.
> >
> > >
> > > [17584.665126] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> > linked pa = 18ef3f000
> > > [17584.665146] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> > linked pa = 190d00000
> > > [17584.665150] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> > linked pa = 190024000
> > > [17584.665153] drm_svm_hmmptr_map_dma_pages iova.dma_addr = 0x0,
> > linked pa = 178e89000
> > >
> > > 2) in the comment of dma_link_range function, it is said: " @dma_offset
> > needs to be advanced by the caller with the size of previous page that was
> > linked + DMA address returned for the previous page".
> > > Is this description correct? I don't understand the part "+ DMA address
> > returned for the previous page ".
> > > In my codes, let's say I call this function to link 10 pages, the first
> > dma_offset is 0, second is 4k, third 8k. This worked for me. I didn't add the
> > previously returned dma address.
> > > Maybe I need more test. But any comment?
> >
> > You did it perfectly right. This is the correct way to advance dma_offset.
> >
> > Thanks
> >
> > >
> > > Thanks,
> > > Oak
> > >
> > > > -----Original Message-----
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Monday, June 10, 2024 1:25 PM
> > > > To: Zeng, Oak <[email protected]>
> > > > Cc: Leon Romanovsky <[email protected]>; Christoph Hellwig
> > <[email protected]>;
> > > > Robin Murphy <[email protected]>; Marek Szyprowski
> > > > <[email protected]>; Joerg Roedel <[email protected]>; Will
> > > > Deacon <[email protected]>; Chaitanya Kulkarni <[email protected]>;
> > > > Brost, Matthew <[email protected]>; Hellstrom, Thomas
> > > > <[email protected]>; Jonathan Corbet <[email protected]>;
> > Jens
> > > > Axboe <[email protected]>; Keith Busch <[email protected]>; Sagi
> > > > Grimberg <[email protected]>; Yishai Hadas <[email protected]>;
> > > > Shameer Kolothum <[email protected]>; Tian,
> > Kevin
> > > > <[email protected]>; Alex Williamson <[email protected]>;
> > > > J?r?me Glisse <[email protected]>; Andrew Morton <akpm@linux-
> > > > foundation.org>; [email protected]; linux-
> > [email protected];
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected]; Bart Van Assche
> > > > <[email protected]>; Damien Le Moal
> > > > <[email protected]>; Amir Goldstein
> > > > <[email protected]>; [email protected]; Martin K. Petersen
> > > > <[email protected]>; [email protected]; Williams, Dan J
> > > > <[email protected]>; [email protected]; Zhu Yanjun
> > > > <[email protected]>; Bommu, Krishnaiah
> > > > <[email protected]>; Ghimiray, Himal Prasad
> > > > <[email protected]>
> > > > Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to
> > > > two steps
> > > >
> > > > On Mon, Jun 10, 2024 at 04:40:19PM +0000, Zeng, Oak wrote:
> > > > > Thanks Leon and Yanjun for the reply!
> > > > >
> > > > > Based on the reply, we will continue use the current version for
> > > > > test (as it is tested for vfio and rdma). We will switch to v1 once
> > > > > it is fully tested/reviewed.
> > > >
> > > > I'm glad you are finding it useful, one of my interests with this work
> > > > is to improve all the HMM users.
> > > >
> > > > Jason
>