Posting just for comment, still waiting on HMM to be accepted before
this patchset can be considered for inclusion.
This patchset implement the on demand paging feature using HMM. It
depends on the HMM patchset v10 (previous post (1)). Long term plan
is to replace ODP with HMM allowing to share same code infrastructure
accross different class of devices.
HMM (Heterogeneous Memory Management) is an helper layer for device
that want to mirror a process address space into their own mmu. Main
target is GPU but other hardware, like network device can take also
use HMM.
Tree with the patchset:
git://people.freedesktop.org/~glisse/linux hmm-v10 branch
(1) Previous patchset posting :
v1 http://lwn.net/Articles/597289/
v2 https://lkml.org/lkml/2014/6/12/559
v3 https://lkml.org/lkml/2014/6/13/633
v4 https://lkml.org/lkml/2014/8/29/423
v5 https://lkml.org/lkml/2014/11/3/759
v6 http://lwn.net/Articles/619737/
v7 http://lwn.net/Articles/627316/
v8 https://lwn.net/Articles/645515/
v9 https://lwn.net/Articles/651553/
Cheers,
Jérôme
To: <[email protected]>,
To: <[email protected]>,
Cc: "Kevin E Martin" <[email protected]>,
Cc: "Christophe Harle" <[email protected]>,
Cc: "Duncan Poole" <[email protected]>,
Cc: "Sherry Cheung" <[email protected]>,
Cc: "Subhash Gutti" <[email protected]>,
Cc: "John Hubbard" <[email protected]>,
Cc: "Mark Hairgrove" <[email protected]>,
Cc: "Lucien Dunning" <[email protected]>,
Cc: "Cameron Buschardt" <[email protected]>,
Cc: "Arvind Gopalakrishnan" <[email protected]>,
Cc: "Haggai Eran" <[email protected]>,
Cc: "Or Gerlitz" <[email protected]>,
Cc: "Sagi Grimberg" <[email protected]>
Cc: "Shachar Raindel" <[email protected]>,
Cc: "Liran Liss" <[email protected]>,
Cc: "Roland Dreier" <[email protected]>,
Cc: "Sander, Ben" <[email protected]>,
Cc: "Stoner, Greg" <[email protected]>,
Cc: "Bridgman, John" <[email protected]>,
Cc: "Mantor, Michael" <[email protected]>,
Cc: "Blinzer, Paul" <[email protected]>,
Cc: "Morichetti, Laurent" <[email protected]>,
Cc: "Deucher, Alexander" <[email protected]>,
Cc: "Leonid Shamis" <[email protected]>,
When using HMM for ODP it will be useful to pass the current mirror
page table iterator for __mlx_ib_populated_pas() function benefit. Add
void parameter for this.
Signed-off-by: Jérôme Glisse <[email protected]>
---
drivers/infiniband/hw/mlx5/mem.c | 8 +++++---
drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +-
drivers/infiniband/hw/mlx5/mr.c | 2 +-
3 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index 40df2cc..df56b7d 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -145,11 +145,13 @@ static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
* num_pages - total number of pages to fill
* pas - bus addresses array to fill
* access_flags - access flags to set on all present pages.
- use enum mlx5_ib_mtt_access_flags for this.
+ * use enum mlx5_ib_mtt_access_flags for this.
+ * data - intended for odp with hmm, it should point to current mirror page
+ * table iterator.
*/
void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
int page_shift, size_t offset, size_t num_pages,
- __be64 *pas, int access_flags)
+ __be64 *pas, int access_flags, void *data)
{
unsigned long umem_page_shift = ilog2(umem->page_size);
int shift = page_shift - umem_page_shift;
@@ -201,7 +203,7 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
{
return __mlx5_ib_populate_pas(dev, umem, page_shift, 0,
ib_umem_num_pages(umem), pas,
- access_flags);
+ access_flags, NULL);
}
int mlx5_ib_get_buf_offset(u64 addr, int page_shift, u32 *offset)
{
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 7cae098..d4dbd8e 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -622,7 +622,7 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift,
int *ncont, int *order);
void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
int page_shift, size_t offset, size_t num_pages,
- __be64 *pas, int access_flags);
+ __be64 *pas, int access_flags, void *data);
void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
int page_shift, __be64 *pas, int access_flags);
void mlx5_ib_copy_pas(u64 *old, u64 *new, int step, int num);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index bc9a0de..ef63e5f 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -912,7 +912,7 @@ int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
if (!zap) {
__mlx5_ib_populate_pas(dev, umem, PAGE_SHIFT,
start_page_index, npages, pas,
- MLX5_IB_MTT_PRESENT);
+ MLX5_IB_MTT_PRESENT, NULL);
/* Clear padding after the pages brought from the
* umem. */
memset(pas + npages, 0, size - npages * sizeof(u64));
--
1.9.3
When using HMM for ODP it will be useful to pass the current mirror
page table iterator for mlx5_ib_update_mtt() function benefit. Add
void parameter for this.
Signed-off-by: Jérôme Glisse <[email protected]>
---
drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 +-
drivers/infiniband/hw/mlx5/mr.c | 4 ++--
drivers/infiniband/hw/mlx5/odp.c | 8 +++++---
3 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index d4dbd8e..79d1e7c 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -571,7 +571,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
u64 virt_addr, int access_flags,
struct ib_udata *udata);
int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index,
- int npages, int zap);
+ int npages, int zap, void *data);
int mlx5_ib_dereg_mr(struct ib_mr *ibmr);
int mlx5_ib_destroy_mr(struct ib_mr *ibmr);
struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index ef63e5f..3ad371d 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -845,7 +845,7 @@ free_mr:
#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
- int zap)
+ int zap, void *data)
{
struct mlx5_ib_dev *dev = mr->dev;
struct device *ddev = dev->ib_dev.dma_device;
@@ -912,7 +912,7 @@ int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
if (!zap) {
__mlx5_ib_populate_pas(dev, umem, PAGE_SHIFT,
start_page_index, npages, pas,
- MLX5_IB_MTT_PRESENT, NULL);
+ MLX5_IB_MTT_PRESENT, data);
/* Clear padding after the pages brought from the
* umem. */
memset(pas + npages, 0, size - npages * sizeof(u64));
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index aa8391e..df86d05 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -91,14 +91,15 @@ void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
if (in_block && umr_offset == 0) {
mlx5_ib_update_mtt(mr, blk_start_idx,
- idx - blk_start_idx, 1);
+ idx - blk_start_idx, 1,
+ NULL);
in_block = 0;
}
}
}
if (in_block)
mlx5_ib_update_mtt(mr, blk_start_idx, idx - blk_start_idx + 1,
- 1);
+ 1, NULL);
/*
* We are now sure that the device will not access the
@@ -249,7 +250,8 @@ static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
* this MR, since ib_umem_odp_map_dma_pages already
* checks this.
*/
- ret = mlx5_ib_update_mtt(mr, start_idx, npages, 0);
+ ret = mlx5_ib_update_mtt(mr, start_idx,
+ npages, 0, NULL);
} else {
ret = -EAGAIN;
}
--
1.9.3
The mlx5 driver will need this function for its driver specific bit
of ODP (on demand paging) on HMM (Heterogeneous Memory Management).
Signed-off-by: Jérôme Glisse <[email protected]>
---
drivers/infiniband/core/umem_rbtree.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/infiniband/core/umem_rbtree.c b/drivers/infiniband/core/umem_rbtree.c
index 727d788..f030ec0 100644
--- a/drivers/infiniband/core/umem_rbtree.c
+++ b/drivers/infiniband/core/umem_rbtree.c
@@ -92,3 +92,4 @@ int rbt_ib_umem_for_each_in_range(struct rb_root *root,
return ret_val;
}
+EXPORT_SYMBOL(rbt_ib_umem_for_each_in_range);
--
1.9.3
This is a preparatory patch for HMM implementation of ODP (on demand
paging). It shuffle codes around that will be share between current
ODP implementation and HMM code path. It also convert many #ifdef
CONFIG to #if IS_ENABLED().
Signed-off-by: Jérôme Glisse <[email protected]>
---
drivers/infiniband/core/umem_odp.c | 3 +
drivers/infiniband/core/uverbs_cmd.c | 24 ++++--
drivers/infiniband/hw/mlx5/main.c | 13 ++-
drivers/infiniband/hw/mlx5/mem.c | 11 ++-
drivers/infiniband/hw/mlx5/mlx5_ib.h | 14 ++--
drivers/infiniband/hw/mlx5/mr.c | 19 +++--
drivers/infiniband/hw/mlx5/odp.c | 118 ++++++++++++++-------------
drivers/infiniband/hw/mlx5/qp.c | 4 +-
drivers/net/ethernet/mellanox/mlx5/core/eq.c | 2 +-
drivers/net/ethernet/mellanox/mlx5/core/qp.c | 8 +-
include/rdma/ib_umem_odp.h | 51 +++++++-----
include/rdma/ib_verbs.h | 7 +-
12 files changed, 159 insertions(+), 115 deletions(-)
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 0541761..d3b65d4 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -41,6 +41,8 @@
#include <rdma/ib_umem.h>
#include <rdma/ib_umem_odp.h>
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
static void ib_umem_notifier_start_account(struct ib_umem *item)
{
mutex_lock(&item->odp_data->umem_mutex);
@@ -667,3 +669,4 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 virt,
mutex_unlock(&umem->odp_data->umem_mutex);
}
EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index bbb02ff..53163aa 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -289,9 +289,12 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
struct ib_uverbs_get_context_resp resp;
struct ib_udata udata;
struct ib_device *ibdev = file->device->ib_dev;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
struct ib_device_attr dev_attr;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
struct ib_ucontext *ucontext;
struct file *filp;
int ret;
@@ -334,7 +337,9 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
rcu_read_unlock();
ucontext->closing = 0;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
ucontext->umem_tree = RB_ROOT;
init_rwsem(&ucontext->umem_rwsem);
ucontext->odp_mrs_count = 0;
@@ -345,8 +350,8 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
goto err_free;
if (!(dev_attr.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
ucontext->invalidate_range = NULL;
-
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
resp.num_comp_vectors = file->device->num_comp_vectors;
@@ -3438,7 +3443,9 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
if (ucore->outlen < resp.response_length + sizeof(resp.odp_caps))
goto end;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
resp.odp_caps.general_caps = attr.odp_caps.general_caps;
resp.odp_caps.per_transport_caps.rc_odp_caps =
attr.odp_caps.per_transport_caps.rc_odp_caps;
@@ -3447,9 +3454,10 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
resp.odp_caps.per_transport_caps.ud_odp_caps =
attr.odp_caps.per_transport_caps.ud_odp_caps;
resp.odp_caps.reserved = 0;
-#else
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
memset(&resp.odp_caps, 0, sizeof(resp.odp_caps));
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
resp.response_length += sizeof(resp.odp_caps);
if (ucore->outlen < resp.response_length + sizeof(resp.timestamp_mask))
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 085c24b..da31c70 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -293,11 +293,14 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
props->max_mcast_grp;
props->max_map_per_fmr = INT_MAX; /* no limit in ConnectIB */
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
if (MLX5_CAP_GEN(mdev, pg))
props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
props->odp_caps = dev->odp_caps;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
return 0;
}
@@ -673,9 +676,11 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct ib_device *ibdev,
goto out_count;
}
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
+#if (!IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM))
context->ibucontext.invalidate_range = &mlx5_ib_invalidate_range;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
INIT_LIST_HEAD(&context->db_page_list);
mutex_init(&context->db_page_mutex);
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index df56b7d..19354b6 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -120,7 +120,7 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift,
*count = i;
}
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
{
u64 mtt_entry = umem_dma & ODP_DMA_ADDR_MASK;
@@ -132,7 +132,7 @@ static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
return mtt_entry;
}
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
/*
* Populate the given array with bus addresses from the umem.
@@ -162,7 +162,9 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
int len;
struct scatterlist *sg;
int entry;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
const bool odp = umem->odp_data != NULL;
if (odp) {
@@ -176,7 +178,8 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
}
return;
}
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
i = 0;
for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 79d1e7c..28b500a 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -218,7 +218,7 @@ struct mlx5_ib_qp {
/* Store signature errors */
bool signature_en;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
/*
* A flag that is true for QP's that are in a state that doesn't
* allow page faults, and shouldn't schedule any more faults.
@@ -231,7 +231,7 @@ struct mlx5_ib_qp {
*/
spinlock_t disable_page_faults_lock;
struct mlx5_ib_pfault pagefaults[MLX5_IB_PAGEFAULT_CONTEXTS];
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
};
struct mlx5_ib_cq_buf {
@@ -434,14 +434,14 @@ struct mlx5_ib_dev {
struct mlx5_mr_cache cache;
struct timer_list delay_timer;
int fill_delay;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
struct ib_odp_caps odp_caps;
/*
* Sleepable RCU that prevents destruction of MRs while they are still
* being used by a page fault handler.
*/
struct srcu_struct mr_srcu;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
};
static inline struct mlx5_ib_cq *to_mibcq(struct mlx5_core_cq *mcq)
@@ -634,7 +634,7 @@ void mlx5_umr_cq_handler(struct ib_cq *cq, void *cq_context);
int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
struct ib_mr_status *mr_status);
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
extern struct workqueue_struct *mlx5_ib_page_fault_wq;
void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev);
@@ -647,8 +647,12 @@ int __init mlx5_ib_odp_init(void);
void mlx5_ib_odp_cleanup(void);
void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp);
void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp);
+
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
unsigned long end);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
static inline void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 3ad371d..18893611 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -46,7 +46,7 @@ enum {
};
#define MLX5_UMR_ALIGN 2048
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
static __be64 mlx5_ib_update_mtt_emergency_buffer[
MLX5_UMR_MTT_MIN_CHUNK_SIZE/sizeof(__be64)]
__aligned(MLX5_UMR_ALIGN);
@@ -59,10 +59,10 @@ static int destroy_mkey(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
{
int err = mlx5_core_destroy_mkey(dev->mdev, &mr->mmr);
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
/* Wait until all page fault handlers using the mr complete. */
synchronize_srcu(&dev->mr_srcu);
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
return err;
}
@@ -843,7 +843,7 @@ free_mr:
return mr;
}
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, int npages,
int zap, void *data)
{
@@ -1090,7 +1090,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
mr->ibmr.lkey = mr->mmr.key;
mr->ibmr.rkey = mr->mmr.key;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
if (umem->odp_data) {
/*
* This barrier prevents the compiler from moving the
@@ -1113,7 +1113,7 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
*/
smp_wmb();
}
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
return &mr->ibmr;
@@ -1202,15 +1202,18 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
int npages = mr->npages;
struct ib_umem *umem = mr->umem;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
if (umem && umem->odp_data) {
/* Prevent new page faults from succeeding */
mr->live = 0;
/* Wait for all running page-fault handlers to finish. */
synchronize_srcu(&dev->mr_srcu);
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
/* Destroy all page mappings */
mlx5_ib_invalidate_range(umem, ib_umem_start(umem),
ib_umem_end(umem));
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
/*
* We kill the umem before the MR for ODP,
* so that there will not be any invalidations in
@@ -1222,7 +1225,7 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
/* Avoid double-freeing the umem. */
umem = NULL;
}
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
clean_mr(mr);
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index df86d05..7299542 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -37,12 +37,29 @@
#define MAX_PREFETCH_LEN (4*1024*1024U)
+struct workqueue_struct *mlx5_ib_page_fault_wq;
+
+static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
+ u32 key)
+{
+ u32 base_key = mlx5_base_mkey(key);
+ struct mlx5_core_mr *mmr = __mlx5_mr_lookup(dev->mdev, base_key);
+ struct mlx5_ib_mr *mr = container_of(mmr, struct mlx5_ib_mr, mmr);
+
+ if (!mmr || mmr->key != key || !mr->live)
+ return NULL;
+
+ return container_of(mmr, struct mlx5_ib_mr, mmr);
+}
+
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
+
/* Timeout in ms to wait for an active mmu notifier to complete when handling
* a pagefault. */
#define MMU_NOTIFIER_TIMEOUT 1000
-struct workqueue_struct *mlx5_ib_page_fault_wq;
-
void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
unsigned long end)
{
@@ -110,60 +127,6 @@ void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
ib_umem_odp_unmap_dma_pages(umem, start, end);
}
-void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev)
-{
- struct ib_odp_caps *caps = &dev->odp_caps;
-
- memset(caps, 0, sizeof(*caps));
-
- if (!MLX5_CAP_GEN(dev->mdev, pg))
- return;
-
- caps->general_caps = IB_ODP_SUPPORT;
-
- if (MLX5_CAP_ODP(dev->mdev, ud_odp_caps.send))
- caps->per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SEND;
-
- if (MLX5_CAP_ODP(dev->mdev, rc_odp_caps.send))
- caps->per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SEND;
-
- if (MLX5_CAP_ODP(dev->mdev, rc_odp_caps.receive))
- caps->per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV;
-
- if (MLX5_CAP_ODP(dev->mdev, rc_odp_caps.write))
- caps->per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE;
-
- if (MLX5_CAP_ODP(dev->mdev, rc_odp_caps.read))
- caps->per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
-
- return;
-}
-
-static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
- u32 key)
-{
- u32 base_key = mlx5_base_mkey(key);
- struct mlx5_core_mr *mmr = __mlx5_mr_lookup(dev->mdev, base_key);
- struct mlx5_ib_mr *mr = container_of(mmr, struct mlx5_ib_mr, mmr);
-
- if (!mmr || mmr->key != key || !mr->live)
- return NULL;
-
- return container_of(mmr, struct mlx5_ib_mr, mmr);
-}
-
-static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp *qp,
- struct mlx5_ib_pfault *pfault,
- int error) {
- struct mlx5_ib_dev *dev = to_mdev(qp->ibqp.pd->device);
- int ret = mlx5_core_page_fault_resume(dev->mdev, qp->mqp.qpn,
- pfault->mpfault.flags,
- error);
- if (ret)
- pr_err("Failed to resolve the page fault on QP 0x%x\n",
- qp->mqp.qpn);
-}
-
/*
* Handle a single data segment in a page-fault WQE.
*
@@ -291,6 +254,49 @@ srcu_unlock:
return ret ? ret : npages;
}
+
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
+
+void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev)
+{
+ struct ib_odp_caps *caps = &dev->odp_caps;
+
+ memset(caps, 0, sizeof(*caps));
+
+ if (!MLX5_CAP_GEN(dev->mdev, pg))
+ return;
+
+ caps->general_caps = IB_ODP_SUPPORT;
+
+ if (MLX5_CAP_ODP(dev->mdev, ud_odp_caps.send))
+ caps->per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SEND;
+
+ if (MLX5_CAP_ODP(dev->mdev, rc_odp_caps.send))
+ caps->per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SEND;
+
+ if (MLX5_CAP_ODP(dev->mdev, rc_odp_caps.receive))
+ caps->per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV;
+
+ if (MLX5_CAP_ODP(dev->mdev, rc_odp_caps.write))
+ caps->per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE;
+
+ if (MLX5_CAP_ODP(dev->mdev, rc_odp_caps.read))
+ caps->per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
+}
+
+static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp *qp,
+ struct mlx5_ib_pfault *pfault,
+ int error) {
+ struct mlx5_ib_dev *dev = to_mdev(qp->ibqp.pd->device);
+ int ret = mlx5_core_page_fault_resume(dev->mdev, qp->mqp.qpn,
+ pfault->mpfault.flags,
+ error);
+ if (ret)
+ pr_err("Failed to resolve the page fault on QP 0x%x\n",
+ qp->mqp.qpn);
+}
+
/**
* Parse a series of data segments for page fault handling.
*
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 203c8a4..46ed2c9 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -3035,13 +3035,13 @@ int mlx5_ib_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr, int qp_attr
int mlx5_state;
int err = 0;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
/*
* Wait for any outstanding page faults, in case the user frees memory
* based upon this query's result.
*/
flush_workqueue(mlx5_ib_page_fault_wq);
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
mutex_lock(&qp->mutex);
outb = kzalloc(sizeof(*outb), GFP_KERNEL);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index a40b96d..ec7ee90 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -281,7 +281,7 @@ static int mlx5_eq_int(struct mlx5_core_dev *dev, struct mlx5_eq *eq)
}
break;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
case MLX5_EVENT_TYPE_PAGE_FAULT:
mlx5_eq_pagefault(dev, eqe);
break;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index 8b494b5..d25b7be 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -88,7 +88,7 @@ void mlx5_rsc_event(struct mlx5_core_dev *dev, u32 rsn, int event_type)
mlx5_core_put_rsc(common);
}
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
void mlx5_eq_pagefault(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe)
{
struct mlx5_eqe_page_fault *pf_eqe = &eqe->data.page_fault;
@@ -175,7 +175,7 @@ void mlx5_eq_pagefault(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe)
mlx5_core_put_rsc(common);
}
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
int mlx5_core_create_qp(struct mlx5_core_dev *dev,
struct mlx5_core_qp *qp,
@@ -419,7 +419,7 @@ int mlx5_core_xrcd_dealloc(struct mlx5_core_dev *dev, u32 xrcdn)
}
EXPORT_SYMBOL_GPL(mlx5_core_xrcd_dealloc);
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
int mlx5_core_page_fault_resume(struct mlx5_core_dev *dev, u32 qpn,
u8 flags, int error)
{
@@ -447,4 +447,4 @@ int mlx5_core_page_fault_resume(struct mlx5_core_dev *dev, u32 qpn,
return err;
}
EXPORT_SYMBOL_GPL(mlx5_core_page_fault_resume);
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 3da0b16..313d7f1 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -43,6 +43,8 @@ struct umem_odp_node {
};
struct ib_umem_odp {
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else
/*
* An array of the pages included in the on-demand paging umem.
* Indices of pages that are currently not mapped into the device will
@@ -62,8 +64,6 @@ struct ib_umem_odp {
* also protects access to the mmu notifier counters.
*/
struct mutex umem_mutex;
- void *private; /* for the HW driver to use. */
-
/* When false, use the notifier counter in the ucontext struct. */
bool mn_counters_active;
int notifiers_seq;
@@ -72,21 +72,43 @@ struct ib_umem_odp {
/* A linked list of umems that don't have private mmu notifier
* counters yet. */
struct list_head no_private_counters;
+ struct completion notifier_completion;
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+ void *private; /* for the HW driver to use. */
struct ib_umem *umem;
/* Tree tracking */
struct umem_odp_node interval_tree;
-
- struct completion notifier_completion;
int dying;
};
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem);
void ib_umem_odp_release(struct ib_umem *umem);
+void rbt_ib_umem_insert(struct umem_odp_node *node, struct rb_root *root);
+void rbt_ib_umem_remove(struct umem_odp_node *node, struct rb_root *root);
+typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
+ void *cookie);
+/*
+ * Call the callback on each ib_umem in the range. Returns the logical or of
+ * the return values of the functions called.
+ */
+int rbt_ib_umem_for_each_in_range(struct rb_root *root, u64 start, u64 end,
+ umem_call_back cb, void *cookie);
+
+struct umem_odp_node *rbt_ib_umem_iter_first(struct rb_root *root,
+ u64 start, u64 last);
+struct umem_odp_node *rbt_ib_umem_iter_next(struct umem_odp_node *node,
+ u64 start, u64 last);
+
+
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
+
/*
* The lower 2 bits of the DMA address signal the R/W permissions for
* the entry. To upgrade the permissions, provide the appropriate
@@ -106,22 +128,6 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 start_offset, u64 bcnt,
void ib_umem_odp_unmap_dma_pages(struct ib_umem *umem, u64 start_offset,
u64 bound);
-void rbt_ib_umem_insert(struct umem_odp_node *node, struct rb_root *root);
-void rbt_ib_umem_remove(struct umem_odp_node *node, struct rb_root *root);
-typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
- void *cookie);
-/*
- * Call the callback on each ib_umem in the range. Returns the logical or of
- * the return values of the functions called.
- */
-int rbt_ib_umem_for_each_in_range(struct rb_root *root, u64 start, u64 end,
- umem_call_back cb, void *cookie);
-
-struct umem_odp_node *rbt_ib_umem_iter_first(struct rb_root *root,
- u64 start, u64 last);
-struct umem_odp_node *rbt_ib_umem_iter_next(struct umem_odp_node *node,
- u64 start, u64 last);
-
static inline int ib_umem_mmu_notifier_retry(struct ib_umem *item,
unsigned long mmu_seq)
{
@@ -145,8 +151,11 @@ static inline int ib_umem_mmu_notifier_retry(struct ib_umem *item,
return 0;
}
+
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
static inline int ib_umem_odp_get(struct ib_ucontext *context,
struct ib_umem *umem)
{
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index b0f898e..9d32df11 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1215,7 +1215,9 @@ struct ib_ucontext {
int closing;
struct pid *tgid;
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
struct rb_root umem_tree;
/*
* Protects .umem_rbroot and tree, as well as odp_mrs_count and
@@ -1230,7 +1232,8 @@ struct ib_ucontext {
/* A list of umems that don't have private mmu notifier counters yet. */
struct list_head no_private_counters;
int odp_mrs_count;
-#endif
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
};
struct ib_uobject {
--
1.9.3
This add new core infiniband structure and helper to implement ODP (on
demand paging) on top of HMM. We need to retain the tree of ib_umem as
some hardware associate unique identifiant with each umem (or mr) and
only allow hardware page table to be updated using this unique id.
Changed since v1:
- Adapt to new hmm_mirror lifetime rules.
- Fix scan of existing mirror in ib_umem_odp_get().
Changed since v2:
- Remove FIXME for empty umem as it is an invalid case.
- Fix HMM version of ib_umem_odp_release()
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
Signed-off-by: Haggai Eran <[email protected]>
---
drivers/infiniband/core/umem_odp.c | 145 ++++++++++++++++++++++++++++++++++
drivers/infiniband/core/uverbs_cmd.c | 1 +
drivers/infiniband/core/uverbs_main.c | 6 ++
include/rdma/ib_umem_odp.h | 27 +++++++
include/rdma/ib_verbs.h | 12 +++
5 files changed, 191 insertions(+)
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index d3b65d4..bcbc2c2 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -42,7 +42,152 @@
#include <rdma/ib_umem_odp.h>
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+
+static void ib_mirror_destroy(struct kref *kref)
+{
+ struct ib_mirror *ib_mirror;
+ struct ib_device *ib_device;
+
+ ib_mirror = container_of(kref, struct ib_mirror, kref);
+
+ ib_device = ib_mirror->ib_device;
+ mutex_lock(&ib_device->hmm_mutex);
+ list_del_init(&ib_mirror->list);
+ mutex_unlock(&ib_device->hmm_mutex);
+
+ /* hmm_mirror_unregister() will free the structure. */
+ hmm_mirror_unregister(&ib_mirror->base);
+}
+
+void ib_mirror_unref(struct ib_mirror *ib_mirror)
+{
+ if (ib_mirror == NULL)
+ return;
+
+ kref_put(&ib_mirror->kref, ib_mirror_destroy);
+}
+EXPORT_SYMBOL(ib_mirror_unref);
+
+static inline struct ib_mirror *ib_mirror_ref(struct ib_mirror *ib_mirror)
+{
+ if (!ib_mirror || !kref_get_unless_zero(&ib_mirror->kref))
+ return NULL;
+ return ib_mirror;
+}
+
+int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
+{
+ struct mm_struct *mm = get_task_mm(current);
+ struct ib_device *ib_device = context->device;
+ struct ib_mirror *ib_mirror;
+ struct pid *our_pid;
+ int ret;
+
+ if (!mm || !ib_device->hmm_ready)
+ return -EINVAL;
+
+ /* This can not happen ! */
+ if (unlikely(ib_umem_start(umem) == ib_umem_end(umem)))
+ return -EINVAL;
+
+ /* Prevent creating ODP MRs in child processes */
+ rcu_read_lock();
+ our_pid = get_task_pid(current->group_leader, PIDTYPE_PID);
+ rcu_read_unlock();
+ put_pid(our_pid);
+ if (context->tgid != our_pid) {
+ mmput(mm);
+ return -EINVAL;
+ }
+
+ umem->hugetlb = 0;
+ umem->odp_data = kmalloc(sizeof(*umem->odp_data), GFP_KERNEL);
+ if (umem->odp_data == NULL) {
+ mmput(mm);
+ return -ENOMEM;
+ }
+ umem->odp_data->private = NULL;
+ umem->odp_data->umem = umem;
+
+ mutex_lock(&ib_device->hmm_mutex);
+ /* Is there an existing mirror for this process mm ? */
+ ib_mirror = ib_mirror_ref(context->ib_mirror);
+ if (!ib_mirror) {
+ struct ib_mirror *tmp;
+
+ list_for_each_entry(tmp, &ib_device->ib_mirrors, list) {
+ if (tmp->base.hmm->mm != mm)
+ continue;
+ ib_mirror = ib_mirror_ref(tmp);
+ break;
+ }
+ }
+
+ if (!ib_mirror) {
+ /* We need to create a new mirror. */
+ ib_mirror = kmalloc(sizeof(*ib_mirror), GFP_KERNEL);
+ if (!ib_mirror) {
+ mutex_unlock(&ib_device->hmm_mutex);
+ mmput(mm);
+ return -ENOMEM;
+ }
+ kref_init(&ib_mirror->kref);
+ init_rwsem(&ib_mirror->hmm_mr_rwsem);
+ ib_mirror->umem_tree = RB_ROOT;
+ ib_mirror->ib_device = ib_device;
+
+ ib_mirror->base.device = &ib_device->hmm_dev;
+ ret = hmm_mirror_register(&ib_mirror->base);
+ if (ret) {
+ mutex_unlock(&ib_device->hmm_mutex);
+ kfree(ib_mirror);
+ mmput(mm);
+ return ret;
+ }
+
+ list_add(&ib_mirror->list, &ib_device->ib_mirrors);
+ context->ib_mirror = ib_mirror_ref(ib_mirror);
+ }
+ mutex_unlock(&ib_device->hmm_mutex);
+ umem->odp_data.ib_mirror = ib_mirror;
+
+ down_write(&ib_mirror->umem_rwsem);
+ rbt_ib_umem_insert(&umem->odp_data->interval_tree, &mirror->umem_tree);
+ up_write(&ib_mirror->umem_rwsem);
+
+ mmput(mm);
+ return 0;
+}
+
+void ib_umem_odp_release(struct ib_umem *umem)
+{
+ struct ib_mirror *ib_mirror = umem->odp_data->ib_mirror;
+
+ /*
+ * Ensure that no more pages are mapped in the umem.
+ *
+ * It is the driver's responsibility to ensure, before calling us,
+ * that the hardware will not attempt to access the MR any more.
+ */
+
+ /* One optimization to release resources early here would be to call :
+ * hmm_mirror_range_discard(&ib_mirror->base,
+ * ib_umem_start(umem),
+ * ib_umem_end(umem));
+ * But we can have overlapping umem so we would need to only discard
+ * range covered by one and only one umem while holding the umem rwsem.
+ */
+ down_write(&ib_mirror->umem_rwsem);
+ rbt_ib_umem_remove(&umem->odp_data->interval_tree, &mirror->umem_tree);
+ up_write(&ib_mirror->umem_rwsem);
+
+ ib_mirror_unref(ib_mirror);
+ kfree(umem->odp_data);
+ kfree(umem);
+}
+
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
static void ib_umem_notifier_start_account(struct ib_umem *item)
{
mutex_lock(&item->odp_data->umem_mutex);
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 53163aa..1db6a17 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -339,6 +339,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+ ucontext->ib_mirror = NULL;
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
ucontext->umem_tree = RB_ROOT;
init_rwsem(&ucontext->umem_rwsem);
diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index f6eef2d..201bde3 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -45,6 +45,7 @@
#include <linux/cdev.h>
#include <linux/anon_inodes.h>
#include <linux/slab.h>
+#include <rdma/ib_umem_odp.h>
#include <asm/uaccess.h>
@@ -298,6 +299,11 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
kfree(uobj);
}
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+ ib_mirror_unref(context->ib_mirror);
+ context->ib_mirror = NULL;
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
put_pid(context->tgid);
return context->device->dealloc_ucontext(context);
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 313d7f1..36b72f0 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -37,6 +37,32 @@
#include <rdma/ib_verbs.h>
#include <linux/interval_tree.h>
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+/* struct ib_mirror - per process mirror structure for infiniband driver.
+ *
+ * @ib_device: Infiniband device this mirror is associated with.
+ * @base: The hmm base mirror struct.
+ * @kref: Refcount for the structure.
+ * @list: For the list of ib_mirror of a given ib_device.
+ * @umem_tree: Red black tree of ib_umem ordered by virtual address.
+ * @umem_rwsem: Semaphore protecting the reb black tree.
+ *
+ * Because ib_ucontext struct is tie to file descriptor there can be several of
+ * them for a same process, which violate HMM requirement. Hence we create only
+ * one ib_mirror struct per process and have each ib_umem struct reference it.
+ */
+struct ib_mirror {
+ struct ib_device *ib_device;
+ struct hmm_mirror base;
+ struct kref kref;
+ struct list_head list;
+ struct rb_root umem_tree;
+ struct rw_semaphore umem_rwsem;
+};
+
+void ib_mirror_unref(struct ib_mirror *ib_mirror);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
struct umem_odp_node {
u64 __subtree_last;
struct rb_node rb;
@@ -44,6 +70,7 @@ struct umem_odp_node {
struct ib_umem_odp {
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+ struct ib_mirror *ib_mirror;
#else
/*
* An array of the pages included in the on-demand paging umem.
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9d32df11..987050b 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -49,6 +49,9 @@
#include <linux/scatterlist.h>
#include <linux/workqueue.h>
#include <uapi/linux/if_ether.h>
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+#include <linux/hmm.h>
+#endif
#include <linux/atomic.h>
#include <linux/mmu_notifier.h>
@@ -1217,6 +1220,7 @@ struct ib_ucontext {
struct pid *tgid;
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+ struct ib_mirror *ib_mirror;
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
struct rb_root umem_tree;
/*
@@ -1730,6 +1734,14 @@ struct ib_device {
struct ib_dma_mapping_ops *dma_ops;
+#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+ /* For ODP using HMM. */
+ struct hmm_device hmm_dev;
+ struct list_head ib_mirrors;
+ struct mutex hmm_mutex;
+ bool hmm_ready;
+#endif
+
struct module *owner;
struct device dev;
struct kobject *ports_parent;
--
1.9.3
This add the core HMM callback for mlx5 device driver and initialize
the HMM device for the mlx5 infiniband device driver.
Changed since v1:
- Adapt to new hmm_mirror lifetime rules.
- HMM_ISDIRTY no longer exist.
Changed since v2:
- Adapt to HMM page table changes.
Signed-off-by: Jérôme Glisse <[email protected]>
Signed-off-by: John Hubbard <[email protected]>
---
drivers/infiniband/core/umem_odp.c | 10 +-
drivers/infiniband/hw/mlx5/main.c | 5 +
drivers/infiniband/hw/mlx5/mem.c | 38 ++++++++
drivers/infiniband/hw/mlx5/mlx5_ib.h | 17 ++++
drivers/infiniband/hw/mlx5/mr.c | 7 ++
drivers/infiniband/hw/mlx5/odp.c | 174 +++++++++++++++++++++++++++++++++++
include/rdma/ib_umem_odp.h | 17 ++++
7 files changed, 264 insertions(+), 4 deletions(-)
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index bcbc2c2..b7dd8228 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -132,7 +132,7 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
return -ENOMEM;
}
kref_init(&ib_mirror->kref);
- init_rwsem(&ib_mirror->hmm_mr_rwsem);
+ init_rwsem(&ib_mirror->umem_rwsem);
ib_mirror->umem_tree = RB_ROOT;
ib_mirror->ib_device = ib_device;
@@ -149,10 +149,11 @@ int ib_umem_odp_get(struct ib_ucontext *context, struct ib_umem *umem)
context->ib_mirror = ib_mirror_ref(ib_mirror);
}
mutex_unlock(&ib_device->hmm_mutex);
- umem->odp_data.ib_mirror = ib_mirror;
+ umem->odp_data->ib_mirror = ib_mirror;
down_write(&ib_mirror->umem_rwsem);
- rbt_ib_umem_insert(&umem->odp_data->interval_tree, &mirror->umem_tree);
+ rbt_ib_umem_insert(&umem->odp_data->interval_tree,
+ &ib_mirror->umem_tree);
up_write(&ib_mirror->umem_rwsem);
mmput(mm);
@@ -178,7 +179,8 @@ void ib_umem_odp_release(struct ib_umem *umem)
* range covered by one and only one umem while holding the umem rwsem.
*/
down_write(&ib_mirror->umem_rwsem);
- rbt_ib_umem_remove(&umem->odp_data->interval_tree, &mirror->umem_tree);
+ rbt_ib_umem_remove(&umem->odp_data->interval_tree,
+ &ib_mirror->umem_tree);
up_write(&ib_mirror->umem_rwsem);
ib_mirror_unref(ib_mirror);
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index da31c70..32ed2f1 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1530,6 +1530,9 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
if (err)
goto err_rsrc;
+ /* If HMM initialization fails we just do not enable odp. */
+ mlx5_dev_init_odp_hmm(&dev->ib_dev, &mdev->pdev->dev);
+
err = ib_register_device(&dev->ib_dev, NULL);
if (err)
goto err_odp;
@@ -1554,6 +1557,7 @@ err_umrc:
err_dev:
ib_unregister_device(&dev->ib_dev);
+ mlx5_dev_fini_odp_hmm(&dev->ib_dev);
err_odp:
mlx5_ib_odp_remove_one(dev);
@@ -1573,6 +1577,7 @@ static void mlx5_ib_remove(struct mlx5_core_dev *mdev, void *context)
ib_unregister_device(&dev->ib_dev);
destroy_umrc_res(dev);
+ mlx5_dev_fini_odp_hmm(&dev->ib_dev);
mlx5_ib_odp_remove_one(dev);
destroy_dev_resources(&dev->devr);
ib_dealloc_device(&dev->ib_dev);
diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index 19354b6..0d74eac 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -154,6 +154,8 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
__be64 *pas, int access_flags, void *data)
{
unsigned long umem_page_shift = ilog2(umem->page_size);
+ unsigned long start = ib_umem_start(umem) + (offset << PAGE_SHIFT);
+ unsigned long end = start + (num_pages << PAGE_SHIFT);
int shift = page_shift - umem_page_shift;
int mask = (1 << shift) - 1;
int i, k;
@@ -164,6 +166,42 @@ void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
int entry;
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+ if (umem->odp_data) {
+ struct ib_mirror *ib_mirror = umem->odp_data->ib_mirror;
+ struct hmm_mirror *mirror = &ib_mirror->base;
+ struct hmm_pt_iter *iter = data, local_iter;
+ unsigned long addr;
+
+ if (iter == NULL) {
+ iter = &local_iter;
+ hmm_pt_iter_init(iter, &mirror->pt);
+ }
+
+ for (i=0, addr=start; i < num_pages; ++i, addr+=PAGE_SIZE) {
+ unsigned long next = end;
+ dma_addr_t *ptep, pte;
+
+ /* Get and lock pointer to mirror page table. */
+ ptep = hmm_pt_iter_lookup(iter, addr, &next);
+ pte = ptep ? *ptep : 0;
+ /*
+ * HMM will not have any page tables set up, if this
+ * function is called before page faults have happened
+ * on the MR. In that case, we don't have PA's yet, so
+ * just set each one to zero and continue on. The hw
+ * will trigger a page fault.
+ */
+ if (hmm_pte_test_valid_dma(&pte))
+ pas[i] = cpu_to_be64(umem_dma_to_mtt(pte));
+ else
+ pas[i] = (__be64)0;
+ }
+
+ if (iter == &local_iter)
+ hmm_pt_iter_fini(iter);
+
+ return;
+ }
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
const bool odp = umem->odp_data != NULL;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 28b500a..ba2d46e 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -635,6 +635,7 @@ int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
struct ib_mr_status *mr_status);
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
+
extern struct workqueue_struct *mlx5_ib_page_fault_wq;
void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev);
@@ -649,11 +650,16 @@ void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp);
void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp);
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+void mlx5_dev_init_odp_hmm(struct ib_device *ib_dev, struct device *dev);
+void mlx5_dev_fini_odp_hmm(struct ib_device *ib_dev);
+int mlx5_ib_umem_invalidate(struct ib_umem *umem, u64 start,
+ u64 end, void *cookie);
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
unsigned long end);
#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
static inline void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev)
{
@@ -690,4 +696,15 @@ static inline u8 convert_access(int acc)
#define MLX5_MAX_UMR_SHIFT 16
#define MLX5_MAX_UMR_PAGES (1 << MLX5_MAX_UMR_SHIFT)
+#ifndef CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM
+static inline void mlx5_dev_init_odp_hmm(struct ib_device *ib_dev,
+ struct device *dev)
+{
+}
+
+static inline void mlx5_dev_fini_odp_hmm(struct ib_device *ib_dev)
+{
+}
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
+
#endif /* MLX5_IB_H */
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 18893611..ae71b78 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1209,6 +1209,13 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
/* Wait for all running page-fault handlers to finish. */
synchronize_srcu(&dev->mr_srcu);
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+ if (mlx5_ib_umem_invalidate(umem, ib_umem_start(umem),
+ ib_umem_end(umem), NULL))
+ /*
+ * FIXME do something to kill all mr and umem
+ * in use by this process.
+ */
+ pr_err("killing all mr with odp due to mtt update failure\n");
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
/* Destroy all page mappings */
mlx5_ib_invalidate_range(umem, ib_umem_start(umem),
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 7299542..5ef31da 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -53,6 +53,180 @@ static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
}
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+
+int mlx5_ib_umem_invalidate(struct ib_umem *umem, u64 start,
+ u64 end, void *cookie)
+{
+ const u64 umr_block_mask = (MLX5_UMR_MTT_ALIGNMENT / sizeof(u64)) - 1;
+ u64 idx = 0, blk_start_idx = 0;
+ struct hmm_pt_iter iter;
+ struct mlx5_ib_mr *mlx5_ib_mr;
+ struct hmm_mirror *mirror;
+ unsigned long addr;
+ int in_block = 0;
+ int ret = 0;
+
+ if (!umem || !umem->odp_data) {
+ pr_err("invalidation called on NULL umem or non-ODP umem\n");
+ return -EINVAL;
+ }
+
+ /* Is this ib_mr active and registered yet ? */
+ if (umem->odp_data->private == NULL)
+ return 0;
+
+ mlx5_ib_mr = umem->odp_data->private;
+ if (!mlx5_ib_mr->ibmr.pd)
+ return 0;
+
+ mirror = &umem->odp_data->ib_mirror->base;
+ start = max_t(u64, ib_umem_start(umem), start);
+ end = min_t(u64, ib_umem_end(umem), end);
+ hmm_pt_iter_init(&iter, &mirror->pt);
+
+ /*
+ * Iteration one - zap the HW's MTTs. HMM ensures that while we are
+ * doing the invalidation, no page fault will attempt to overwrite the
+ * same MTTs. Concurent invalidations might race us, but they will
+ * write 0s as well, so no difference in the end result.
+ */
+ for (addr = start; addr < end; addr += (u64)umem->page_size) {
+ unsigned long next = end;
+ dma_addr_t *ptep;
+
+ /* Get and lock pointer to mirror page table. */
+ ptep = hmm_pt_iter_walk(&iter, &addr, &next);
+ for (; ptep && addr < next; addr += PAGE_SIZE, ptep++) {
+ idx = (addr - ib_umem_start(umem)) / PAGE_SIZE;
+ /*
+ * Strive to write the MTTs in chunks, but avoid
+ * overwriting non-existing MTTs. The huristic here can
+ * be improved to estimate the cost of another UMR vs.
+ * the cost of bigger UMR.
+ */
+ if ((*ptep) & (ODP_READ_ALLOWED_BIT |
+ ODP_WRITE_ALLOWED_BIT)) {
+ if ((*ptep) & ODP_WRITE_ALLOWED_BIT)
+ hmm_pte_set_dirty(ptep);
+ /*
+ * Because there can not be concurrent overlapping
+ * munmap, page migrate, page write protect then it
+ * is safe here to clear those bits.
+ */
+ hmm_pte_clear_bit(ptep, ODP_READ_ALLOWED_SHIFT);
+ hmm_pte_clear_bit(ptep, ODP_WRITE_ALLOWED_SHIFT);
+ if (!in_block) {
+ blk_start_idx = idx;
+ in_block = 1;
+ }
+ } else {
+ u64 umr_offset = idx & umr_block_mask;
+
+ if (in_block && umr_offset == 0) {
+ ret = mlx5_ib_update_mtt(mlx5_ib_mr,
+ blk_start_idx,
+ idx - blk_start_idx,
+ 1, &iter) || ret;
+ in_block = 0;
+ }
+ }
+ }
+ }
+ if (in_block)
+ ret = mlx5_ib_update_mtt(mlx5_ib_mr, blk_start_idx,
+ idx - blk_start_idx + 1, 1,
+ &iter) || ret;
+ hmm_pt_iter_fini(&iter);
+ return ret;
+}
+
+static int mlx5_hmm_invalidate_range(struct hmm_mirror *mirror,
+ unsigned long start,
+ unsigned long end)
+{
+ struct ib_mirror *ib_mirror;
+ int ret;
+
+ ib_mirror = container_of(mirror, struct ib_mirror, base);
+
+ /* Go over all memory region and invalidate them. */
+ down_read(&ib_mirror->umem_rwsem);
+ ret = rbt_ib_umem_for_each_in_range(&ib_mirror->umem_tree, start, end,
+ mlx5_ib_umem_invalidate, NULL);
+ up_read(&ib_mirror->umem_rwsem);
+ return ret;
+}
+
+static void mlx5_hmm_release(struct hmm_mirror *mirror)
+{
+ struct ib_mirror *ib_mirror;
+
+ ib_mirror = container_of(mirror, struct ib_mirror, base);
+
+ /* Go over all memory region and invalidate them. */
+ mlx5_hmm_invalidate_range(mirror, 0, ULLONG_MAX);
+}
+
+static void mlx5_hmm_free(struct hmm_mirror *mirror)
+{
+ struct ib_mirror *ib_mirror;
+
+ ib_mirror = container_of(mirror, struct ib_mirror, base);
+ kfree(ib_mirror);
+}
+
+static int mlx5_hmm_update(struct hmm_mirror *mirror,
+ struct hmm_event *event)
+{
+ struct device *device = mirror->device->dev;
+ int ret = 0;
+
+ switch (event->etype) {
+ case HMM_DEVICE_RFAULT:
+ case HMM_DEVICE_WFAULT:
+ /* FIXME implement. */
+ break;
+ case HMM_NONE:
+ default:
+ dev_warn(device, "Warning: unhandled HMM event (%d) defaulting to invalidation\n",
+ event->etype);
+ /* Fallthrough. */
+ /* For write protect and fork we could only invalidate writeable mr. */
+ case HMM_WRITE_PROTECT:
+ case HMM_MIGRATE:
+ case HMM_MUNMAP:
+ case HMM_FORK:
+ ret = mlx5_hmm_invalidate_range(mirror,
+ event->start,
+ event->end);
+ break;
+ }
+
+ return ret;
+}
+
+static const struct hmm_device_ops mlx5_hmm_ops = {
+ .release = &mlx5_hmm_release,
+ .free = &mlx5_hmm_free,
+ .update = &mlx5_hmm_update,
+};
+
+void mlx5_dev_init_odp_hmm(struct ib_device *ib_device, struct device *dev)
+{
+ INIT_LIST_HEAD(&ib_device->ib_mirrors);
+ ib_device->hmm_dev.dev = dev;
+ ib_device->hmm_dev.ops = &mlx5_hmm_ops;
+ ib_device->hmm_ready = !hmm_device_register(&ib_device->hmm_dev);
+ mutex_init(&ib_device->hmm_mutex);
+}
+
+void mlx5_dev_fini_odp_hmm(struct ib_device *ib_device)
+{
+ if (!ib_device->hmm_ready)
+ return;
+ hmm_device_unregister(&ib_device->hmm_dev);
+}
+
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 36b72f0..fccbf36 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -133,6 +133,23 @@ struct umem_odp_node *rbt_ib_umem_iter_next(struct umem_odp_node *node,
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+
+/*
+ * HMM have few bits reserved for hardware specific bits inside the mirror page
+ * table. For IB we record the mapping protection per page there.
+ */
+#define ODP_READ_ALLOWED_SHIFT (HMM_PTE_HW_SHIFT + 0)
+#define ODP_WRITE_ALLOWED_SHIFT (HMM_PTE_HW_SHIFT + 1)
+#define ODP_READ_ALLOWED_BIT (1 << ODP_READ_ALLOWED_SHIFT)
+#define ODP_WRITE_ALLOWED_BIT (1 << ODP_WRITE_ALLOWED_SHIFT)
+
+/* Make sure we are not overwritting valid address bit on target arch. */
+#if (HMM_PTE_HW_SHIFT + 2) > PAGE_SHIFT
+#error (HMM_PTE_HW_SHIFT + 2) > PAGE_SHIFT
+#endif
+
+#define ODP_DMA_ADDR_MASK HMM_PTE_DMA_MASK
+
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
--
1.9.3
This patch add HMM specific support for hardware page faulting of
user memory region.
Changed since v1:
- Adapt to HMM page table changes.
- Turn some sanity test to BUG_ON().
Signed-off-by: Jérôme Glisse <[email protected]>
---
drivers/infiniband/hw/mlx5/odp.c | 144 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 143 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 5ef31da..658bfca 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -54,6 +54,52 @@ static struct mlx5_ib_mr *mlx5_ib_odp_find_mr_lkey(struct mlx5_ib_dev *dev,
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+struct mlx5_hmm_pfault {
+ struct mlx5_ib_mr *mlx5_ib_mr;
+ u64 start_idx;
+ dma_addr_t access_mask;
+ unsigned npages;
+ struct hmm_event event;
+};
+
+static int mlx5_hmm_pfault(struct mlx5_ib_dev *mlx5_ib_dev,
+ struct hmm_mirror *mirror,
+ const struct hmm_event *event)
+{
+ struct mlx5_hmm_pfault *pfault;
+ struct hmm_pt_iter iter;
+ unsigned long addr, cnt;
+ int ret;
+
+ pfault = container_of(event, struct mlx5_hmm_pfault, event);
+ hmm_pt_iter_init(&iter, &mirror->pt);
+
+ for (addr = event->start, cnt = 0; addr < event->end;
+ addr += PAGE_SIZE, ++cnt) {
+ unsigned long next = event->end;
+ dma_addr_t *ptep;
+
+ /* Get and lock pointer to mirror page table. */
+ ptep = hmm_pt_iter_lookup(&iter, addr, &next);
+ BUG_ON(!ptep);
+ for (; ptep && addr < next; addr += PAGE_SIZE, ptep++) {
+ /* This could be BUG_ON() as it can not happen. */
+ BUG_ON(!hmm_pte_test_valid_dma(ptep));
+ BUG_ON((pfault->access_mask & ODP_WRITE_ALLOWED_BIT) &&
+ !hmm_pte_test_write(ptep));
+ if (hmm_pte_test_write(ptep))
+ hmm_pte_set_bit(ptep, ODP_WRITE_ALLOWED_SHIFT);
+ hmm_pte_set_bit(ptep, ODP_READ_ALLOWED_SHIFT);
+ pfault->npages++;
+ }
+ }
+ ret = mlx5_ib_update_mtt(pfault->mlx5_ib_mr,
+ pfault->start_idx,
+ cnt, 0, &iter);
+ hmm_pt_iter_fini(&iter);
+ return ret;
+}
+
int mlx5_ib_umem_invalidate(struct ib_umem *umem, u64 start,
u64 end, void *cookie)
{
@@ -179,12 +225,19 @@ static int mlx5_hmm_update(struct hmm_mirror *mirror,
struct hmm_event *event)
{
struct device *device = mirror->device->dev;
+ struct mlx5_ib_dev *mlx5_ib_dev;
+ struct ib_device *ib_device;
int ret = 0;
+ ib_device = container_of(mirror->device, struct ib_device, hmm_dev);
+ mlx5_ib_dev = to_mdev(ib_device);
+
switch (event->etype) {
case HMM_DEVICE_RFAULT:
case HMM_DEVICE_WFAULT:
- /* FIXME implement. */
+ ret = mlx5_hmm_pfault(mlx5_ib_dev, mirror, event);
+ if (ret)
+ return ret;
break;
case HMM_NONE:
default:
@@ -227,6 +280,95 @@ void mlx5_dev_fini_odp_hmm(struct ib_device *ib_device)
hmm_device_unregister(&ib_device->hmm_dev);
}
+/*
+ * Handle a single data segment in a page-fault WQE.
+ *
+ * Returns number of pages retrieved on success. The caller will continue to
+ * the next data segment.
+ * Can return the following error codes:
+ * -EAGAIN to designate a temporary error. The caller will abort handling the
+ * page fault and resolve it.
+ * -EFAULT when there's an error mapping the requested pages. The caller will
+ * abort the page fault handling and possibly move the QP to an error state.
+ * On other errors the QP should also be closed with an error.
+ */
+static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
+ struct mlx5_ib_pfault *pfault,
+ u32 key, u64 io_virt, size_t bcnt,
+ u32 *bytes_mapped)
+{
+ struct mlx5_ib_dev *mlx5_ib_dev = to_mdev(qp->ibqp.pd->device);
+ struct ib_mirror *ib_mirror;
+ struct mlx5_hmm_pfault hmm_pfault;
+ int srcu_key;
+ int ret = 0;
+
+ srcu_key = srcu_read_lock(&mlx5_ib_dev->mr_srcu);
+ hmm_pfault.mlx5_ib_mr = mlx5_ib_odp_find_mr_lkey(mlx5_ib_dev, key);
+ /*
+ * If we didn't find the MR, it means the MR was closed while we were
+ * handling the ODP event. In this case we return -EFAULT so that the
+ * QP will be closed.
+ */
+ if (!hmm_pfault.mlx5_ib_mr || !hmm_pfault.mlx5_ib_mr->ibmr.pd) {
+ pr_err("Failed to find relevant mr for lkey=0x%06x, probably the MR was destroyed\n",
+ key);
+ ret = -EFAULT;
+ goto srcu_unlock;
+ }
+ if (!hmm_pfault.mlx5_ib_mr->umem->odp_data) {
+ pr_debug("skipping non ODP MR (lkey=0x%06x) in page fault handler.\n",
+ key);
+ if (bytes_mapped)
+ *bytes_mapped +=
+ (bcnt - pfault->mpfault.bytes_committed);
+ goto srcu_unlock;
+ }
+ if (hmm_pfault.mlx5_ib_mr->ibmr.pd != qp->ibqp.pd) {
+ pr_err("Page-fault with different PDs for QP and MR.\n");
+ ret = -EFAULT;
+ goto srcu_unlock;
+ }
+
+ ib_mirror = hmm_pfault.mlx5_ib_mr->umem->odp_data->ib_mirror;
+ if (ib_mirror->base.hmm == NULL) {
+ /* Somehow the mirror was kill from under us. */
+ ret = -EFAULT;
+ goto srcu_unlock;
+ }
+
+ /*
+ * Avoid branches - this code will perform correctly
+ * in all iterations (in iteration 2 and above,
+ * bytes_committed == 0).
+ */
+ io_virt += pfault->mpfault.bytes_committed;
+ bcnt -= pfault->mpfault.bytes_committed;
+
+ hmm_pfault.npages = 0;
+ hmm_pfault.start_idx = (io_virt - (hmm_pfault.mlx5_ib_mr->mmr.iova &
+ PAGE_MASK)) >> PAGE_SHIFT;
+ hmm_pfault.access_mask = ODP_READ_ALLOWED_BIT;
+ hmm_pfault.access_mask |= hmm_pfault.mlx5_ib_mr->umem->writable ?
+ ODP_WRITE_ALLOWED_BIT : 0;
+ hmm_pfault.event.start = io_virt & PAGE_MASK;
+ hmm_pfault.event.end = PAGE_ALIGN(io_virt + bcnt);
+ hmm_pfault.event.etype = hmm_pfault.mlx5_ib_mr->umem->writable ?
+ HMM_DEVICE_WFAULT : HMM_DEVICE_RFAULT;
+ ret = hmm_mirror_fault(&ib_mirror->base, &hmm_pfault.event);
+
+ if (!ret && hmm_pfault.npages && bytes_mapped) {
+ u32 new_mappings = hmm_pfault.npages * PAGE_SIZE -
+ (io_virt - round_down(io_virt, PAGE_SIZE));
+ *bytes_mapped += min_t(u32, new_mappings, bcnt);
+ }
+
+srcu_unlock:
+ srcu_read_unlock(&mlx5_ib_dev->mr_srcu, srcu_key);
+ pfault->mpfault.bytes_committed = 0;
+ return ret ? ret : hmm_pfault.npages;
+}
+
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
--
1.9.3
All pieces are in place for ODP (on demand paging) to work using HMM.
Add kernel option and final code to enable it.
Changed since v1:
- Added kernel option in this last patch of the serie.
Signed-off-by: Jérôme Glisse <[email protected]>
---
drivers/infiniband/Kconfig | 10 ++++++++++
drivers/infiniband/core/uverbs_cmd.c | 3 ---
drivers/infiniband/hw/mlx5/main.c | 4 ++++
3 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index b899531..764f524 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -49,6 +49,16 @@ config INFINIBAND_ON_DEMAND_PAGING
memory regions without pinning their pages, fetching the
pages on demand instead.
+config INFINIBAND_ON_DEMAND_PAGING_HMM
+ bool "InfiniBand on-demand paging support using HMM."
+ depends on HMM
+ depends on INFINIBAND_ON_DEMAND_PAGING
+ default n
+ ---help---
+ Use HMM (heterogeneous memory management) kernel API for
+ on demand paging. No userspace difference, this is just
+ an alternative implementation of the feature.
+
config INFINIBAND_ADDR_TRANS
bool
depends on INFINIBAND
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 1db6a17..c3e14a8 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -3445,8 +3445,6 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
goto end;
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
-#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
-#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
resp.odp_caps.general_caps = attr.odp_caps.general_caps;
resp.odp_caps.per_transport_caps.rc_odp_caps =
attr.odp_caps.per_transport_caps.rc_odp_caps;
@@ -3455,7 +3453,6 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file,
resp.odp_caps.per_transport_caps.ud_odp_caps =
attr.odp_caps.per_transport_caps.ud_odp_caps;
resp.odp_caps.reserved = 0;
-#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
memset(&resp.odp_caps, 0, sizeof(resp.odp_caps));
#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 32ed2f1..c340c3a 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -295,6 +295,10 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
+ if (MLX5_CAP_GEN(mdev, pg) && ibdev->hmm_ready) {
+ props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
+ props->odp_caps = dev->odp_caps;
+ }
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
if (MLX5_CAP_GEN(mdev, pg))
props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
--
1.9.3
On Thu, Aug 13, 2015 at 03:20:49PM -0400, J?r?me Glisse wrote:
> +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
> +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
Yuk, what is wrong with
#if !IS_ENABLED(...)
?
> -#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
> +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
> +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
> +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
Double yuk
#if !(IS_ENABLED(..) && IS_ENABLED(..))
?
And the #ifdefs suck, as many as possible should be normal if
statements, and one should think carefully if we really need to remove
fields from structures..
Jason
On Thu, Aug 13, 2015 at 02:13:35PM -0600, Jason Gunthorpe wrote:
> On Thu, Aug 13, 2015 at 03:20:49PM -0400, J?r?me Glisse wrote:
>
> > +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
> > +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
>
> Yuk, what is wrong with
>
> #if !IS_ENABLED(...)
>
> ?
Just that latter patches add code btw #if and #else, and that
originaly it was a bigger patch that added the #if code #else
at the same time. Hence why this patch looks like this.
>
> > -#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
> > +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)
> > +#if IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM)
> > +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING_HMM */
>
> Double yuk
>
> #if !(IS_ENABLED(..) && IS_ENABLED(..))
>
> ?
Same reason as above.
> And the #ifdefs suck, as many as possible should be normal if
> statements, and one should think carefully if we really need to remove
> fields from structures..
My patch only add #if, i am not responsible for previous code
that used #ifdef, i was told to convert to #if and that's what
i am doing.
Regarding fields, yes this is intentional, ODP is an infrastructure
that is private to infiniband and thus needs more fields inside ib
struct. While HMM is intended to be a common infrastructure not only
for ib device but for other kind of devices too.
Cheers,
J?r?me