Subject: [PATCH for-next v5 0/7] On-Demand Paging on SoftRoCE

This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
driver, which has been available only in mlx5 driver[1] so far.

There have been an obstacle to this series, but it has been finally solved.
The commit 9b4b7c1f9f54 ("RDMA/rxe: Add workqueue support for rxe tasks")
replaced the triple tasklets with a workqueue, and now it is ready to
merge the ODP patches on top of it.

I omitted some contents like the motive behind this series from the cover-
letter. Please see the cover letter of v3 for more details[2].

[Overview]
When applications register a memory region(MR), RDMA drivers normally pin
pages in the MR so that physical addresses are never changed during RDMA
communication. This requires the MR to fit in physical memory and
inevitably leads to memory pressure. On the other hand, On-Demand Paging
(ODP) allows applications to register MRs without pinning pages. They are
paged-in when the driver requires and paged-out when the OS reclaims. As a
result, it is possible to register a large MR that does not fit in physical
memory without taking up so much physical memory.

[How does ODP work?]
"struct ib_umem_odp" is used to manage pages. It is created for each
ODP-enabled MR on its registration. This struct holds a pair of arrays
(dma_list/pfn_list) that serve as a driver page table. DMA addresses and
PFNs are stored in the driver page table. They are updated on page-in and
page-out, both of which use the common interfaces in the ib_uverbs layer.

Page-in can occur when requester, responder or completer access an MR in
order to process RDMA operations. If they find that the pages being
accessed are not present on physical memory or requisite permissions are
not set on the pages, they provoke page fault to make the pages present
with proper permissions and at the same time update the driver page table.
After confirming the presence of the pages, they execute memory access such
as read, write or atomic operations.

Page-out is triggered by page reclaim or filesystem events (e.g. metadata
update of a file that is being used as an MR). When creating an ODP-enabled
MR, the driver registers an MMU notifier callback. When the kernel issues a
page invalidation notification, the callback is provoked to unmap DMA
addresses and update the driver page table. After that, the kernel releases
the pages.

[Supported operations]
All traditional operations are supported on RC connection. The new Atomic
write[3] and RDMA Flush[4] operations are not included in this patchset. I
will post them later after this patchset is merged. On UD connection, Send,
Recv, and SRQ-Recv are supported.

[How to test ODP?]
There are only a few resources available for testing. pyverbs testcases in
rdma-core and perftest[5] are recommendable ones. Other than them, the
ibv_rc_pingpong command can also be used for testing. Note that you may
have to build perftest from upstream because older versions do not handle
ODP capabilities correctly.

The ODP tree is available from github:
https://github.com/daimatsuda/linux/tree/odp_v5

[Future work]
My next work is to enable the new Atomic write[3] and RDMA Flush[4]
operations with ODP. After that, I am going to implement the prefetch
feature. It allows applications to trigger page fault using
ibv_advise_mr(3) to optimize performance. Some existing software like
librpma[6] use this feature. Additionally, I think we can also add the
implicit ODP feature in the future.

[1] [RFC 00/20] On demand paging
https://www.spinics.net/lists/linux-rdma/msg18906.html

[2] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
https://lore.kernel.org/lkml/[email protected]/

[3] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
https://lore.kernel.org/linux-rdma/[email protected]/

[4] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
https://lore.kernel.org/lkml/[email protected]/

[5] linux-rdma/perftest: Infiniband Verbs Performance Tests
https://github.com/linux-rdma/perftest

[6] librpma: Remote Persistent Memory Access Library
https://github.com/pmem/rpma

v4->v5:
1) Rebased to 6.4.0-rc2+
2) Changed to schedule all works on responder and completer to workqueue

v3->v4:
1) Re-designed functions that access MRs to use the MR xarray.
2) Rebased onto the latest jgg-for-next tree.

v2->v3:
1) Removed a patch that changes the common ib_uverbs layer.
2) Re-implemented patches for conversion to workqueue.
3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
4) Fixed some functions that returned incorrect errors.
5) Temporarily disabled ODP for RDMA Flush and Atomic Write.

v1->v2:
1) Fixed a crash issue reported by Haris Iqbal.
2) Tried to make lock patters clearer as pointed out by Romanovsky.
3) Minor clean ups and fixes.

Daisuke Matsuda (7):
RDMA/rxe: Always defer tasks on responder and completer to workqueue
RDMA/rxe: Make MR functions accessible from other rxe source code
RDMA/rxe: Move resp_states definition to rxe_verbs.h
RDMA/rxe: Add page invalidation support
RDMA/rxe: Allow registering MRs for On-Demand Paging
RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
RDMA/rxe: Add support for the traditional Atomic operations with ODP

drivers/infiniband/sw/rxe/Makefile | 2 +
drivers/infiniband/sw/rxe/rxe.c | 18 ++
drivers/infiniband/sw/rxe/rxe.h | 37 ---
drivers/infiniband/sw/rxe/rxe_comp.c | 12 +-
drivers/infiniband/sw/rxe/rxe_hw_counters.c | 1 -
drivers/infiniband/sw/rxe/rxe_hw_counters.h | 1 -
drivers/infiniband/sw/rxe/rxe_loc.h | 45 +++
drivers/infiniband/sw/rxe/rxe_mr.c | 27 +-
drivers/infiniband/sw/rxe/rxe_odp.c | 311 ++++++++++++++++++++
drivers/infiniband/sw/rxe/rxe_resp.c | 31 +-
drivers/infiniband/sw/rxe/rxe_verbs.c | 5 +-
drivers/infiniband/sw/rxe/rxe_verbs.h | 39 +++
12 files changed, 447 insertions(+), 82 deletions(-)
create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c

base-commit: a7dae5daf4bf50de01ebdd192bf52c2e8cd80c75

--
2.39.1



Subject: [PATCH for-next v5 3/7] RDMA/rxe: Move resp_states definition to rxe_verbs.h

To use the resp_states values in rxe_loc.h, it is necessary to move the
definition to rxe_verbs.h, where other internal states of this driver are
defined.

Signed-off-by: Daisuke Matsuda <[email protected]>
---
drivers/infiniband/sw/rxe/rxe.h | 37 ---------------------------
drivers/infiniband/sw/rxe/rxe_verbs.h | 37 +++++++++++++++++++++++++++
2 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.h b/drivers/infiniband/sw/rxe/rxe.h
index d33dd6cf83d3..9b4d044a1264 100644
--- a/drivers/infiniband/sw/rxe/rxe.h
+++ b/drivers/infiniband/sw/rxe/rxe.h
@@ -100,43 +100,6 @@
#define rxe_info_mw(mw, fmt, ...) ibdev_info_ratelimited((mw)->ibmw.device, \
"mw#%d %s: " fmt, (mw)->elem.index, __func__, ##__VA_ARGS__)

-/* responder states */
-enum resp_states {
- RESPST_NONE,
- RESPST_GET_REQ,
- RESPST_CHK_PSN,
- RESPST_CHK_OP_SEQ,
- RESPST_CHK_OP_VALID,
- RESPST_CHK_RESOURCE,
- RESPST_CHK_LENGTH,
- RESPST_CHK_RKEY,
- RESPST_EXECUTE,
- RESPST_READ_REPLY,
- RESPST_ATOMIC_REPLY,
- RESPST_ATOMIC_WRITE_REPLY,
- RESPST_PROCESS_FLUSH,
- RESPST_COMPLETE,
- RESPST_ACKNOWLEDGE,
- RESPST_CLEANUP,
- RESPST_DUPLICATE_REQUEST,
- RESPST_ERR_MALFORMED_WQE,
- RESPST_ERR_UNSUPPORTED_OPCODE,
- RESPST_ERR_MISALIGNED_ATOMIC,
- RESPST_ERR_PSN_OUT_OF_SEQ,
- RESPST_ERR_MISSING_OPCODE_FIRST,
- RESPST_ERR_MISSING_OPCODE_LAST_C,
- RESPST_ERR_MISSING_OPCODE_LAST_D1E,
- RESPST_ERR_TOO_MANY_RDMA_ATM_REQ,
- RESPST_ERR_RNR,
- RESPST_ERR_RKEY_VIOLATION,
- RESPST_ERR_INVALIDATE_RKEY,
- RESPST_ERR_LENGTH,
- RESPST_ERR_CQ_OVERFLOW,
- RESPST_ERROR,
- RESPST_DONE,
- RESPST_EXIT,
-};
-
void rxe_set_mtu(struct rxe_dev *rxe, unsigned int dev_mtu);

int rxe_add(struct rxe_dev *rxe, unsigned int mtu, const char *ibdev_name);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 26a20f088692..b6fbd9b3d086 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -127,6 +127,43 @@ struct rxe_comp_info {
struct rxe_task task;
};

+/* responder states */
+enum resp_states {
+ RESPST_NONE,
+ RESPST_GET_REQ,
+ RESPST_CHK_PSN,
+ RESPST_CHK_OP_SEQ,
+ RESPST_CHK_OP_VALID,
+ RESPST_CHK_RESOURCE,
+ RESPST_CHK_LENGTH,
+ RESPST_CHK_RKEY,
+ RESPST_EXECUTE,
+ RESPST_READ_REPLY,
+ RESPST_ATOMIC_REPLY,
+ RESPST_ATOMIC_WRITE_REPLY,
+ RESPST_PROCESS_FLUSH,
+ RESPST_COMPLETE,
+ RESPST_ACKNOWLEDGE,
+ RESPST_CLEANUP,
+ RESPST_DUPLICATE_REQUEST,
+ RESPST_ERR_MALFORMED_WQE,
+ RESPST_ERR_UNSUPPORTED_OPCODE,
+ RESPST_ERR_MISALIGNED_ATOMIC,
+ RESPST_ERR_PSN_OUT_OF_SEQ,
+ RESPST_ERR_MISSING_OPCODE_FIRST,
+ RESPST_ERR_MISSING_OPCODE_LAST_C,
+ RESPST_ERR_MISSING_OPCODE_LAST_D1E,
+ RESPST_ERR_TOO_MANY_RDMA_ATM_REQ,
+ RESPST_ERR_RNR,
+ RESPST_ERR_RKEY_VIOLATION,
+ RESPST_ERR_INVALIDATE_RKEY,
+ RESPST_ERR_LENGTH,
+ RESPST_ERR_CQ_OVERFLOW,
+ RESPST_ERROR,
+ RESPST_DONE,
+ RESPST_EXIT,
+};
+
enum rdatm_res_state {
rdatm_res_state_next,
rdatm_res_state_new,
--
2.39.1


Subject: [PATCH for-next v5 1/7] RDMA/rxe: Always defer tasks on responder and completer to workqueue

Both responder and completer need to sleep to execute page-fault when used
with ODP. It can happen when they are going to access user MRs, so tasks
must be executed in process context for such cases.

Additionally, current implementation seldom defers tasks to workqueue, but
instead defers to a softirq context running do_task(). It is called from
rxe_resp_queue_pkt() and rxe_comp_queue_pkt() in SOFTIRQ_NET_RX context and
can last until maximum RXE_MAX_ITERATIONS (=1024) loops are executed. The
problem is the that task execuion appears to be anonymous loads in the
system and that the loop can throttle other softirqs on the same CPU.

This patch makes responder and completer codes run in process context for
ODP and the problem described above.

Signed-off-by: Daisuke Matsuda <[email protected]>
---
drivers/infiniband/sw/rxe/rxe_comp.c | 12 +-----------
drivers/infiniband/sw/rxe/rxe_hw_counters.c | 1 -
drivers/infiniband/sw/rxe/rxe_hw_counters.h | 1 -
drivers/infiniband/sw/rxe/rxe_resp.c | 13 +------------
4 files changed, 2 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index db18ace74d2b..671fdb645030 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -128,18 +128,8 @@ void retransmit_timer(struct timer_list *t)

void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
{
- int must_sched;
-
skb_queue_tail(&qp->resp_pkts, skb);
-
- must_sched = skb_queue_len(&qp->resp_pkts) > 1;
- if (must_sched != 0)
- rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED);
-
- if (must_sched)
- rxe_sched_task(&qp->comp.task);
- else
- rxe_run_task(&qp->comp.task);
+ rxe_sched_task(&qp->comp.task);
}

static inline enum comp_state get_wqe(struct rxe_qp *qp,
diff --git a/drivers/infiniband/sw/rxe/rxe_hw_counters.c b/drivers/infiniband/sw/rxe/rxe_hw_counters.c
index a012522b577a..dc23cf3a6967 100644
--- a/drivers/infiniband/sw/rxe/rxe_hw_counters.c
+++ b/drivers/infiniband/sw/rxe/rxe_hw_counters.c
@@ -14,7 +14,6 @@ static const struct rdma_stat_desc rxe_counter_descs[] = {
[RXE_CNT_RCV_RNR].name = "rcvd_rnr_err",
[RXE_CNT_SND_RNR].name = "send_rnr_err",
[RXE_CNT_RCV_SEQ_ERR].name = "rcvd_seq_err",
- [RXE_CNT_COMPLETER_SCHED].name = "ack_deferred",
[RXE_CNT_RETRY_EXCEEDED].name = "retry_exceeded_err",
[RXE_CNT_RNR_RETRY_EXCEEDED].name = "retry_rnr_exceeded_err",
[RXE_CNT_COMP_RETRY].name = "completer_retry_err",
diff --git a/drivers/infiniband/sw/rxe/rxe_hw_counters.h b/drivers/infiniband/sw/rxe/rxe_hw_counters.h
index 71f4d4fa9dc8..303da0e3134a 100644
--- a/drivers/infiniband/sw/rxe/rxe_hw_counters.h
+++ b/drivers/infiniband/sw/rxe/rxe_hw_counters.h
@@ -18,7 +18,6 @@ enum rxe_counters {
RXE_CNT_RCV_RNR,
RXE_CNT_SND_RNR,
RXE_CNT_RCV_SEQ_ERR,
- RXE_CNT_COMPLETER_SCHED,
RXE_CNT_RETRY_EXCEEDED,
RXE_CNT_RNR_RETRY_EXCEEDED,
RXE_CNT_COMP_RETRY,
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index 68f6cd188d8e..ba0222bfce9e 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -46,21 +46,10 @@ static char *resp_state_name[] = {
[RESPST_EXIT] = "EXIT",
};

-/* rxe_recv calls here to add a request packet to the input queue */
void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
{
- int must_sched;
- struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
-
skb_queue_tail(&qp->req_pkts, skb);
-
- must_sched = (pkt->opcode == IB_OPCODE_RC_RDMA_READ_REQUEST) ||
- (skb_queue_len(&qp->req_pkts) > 1);
-
- if (must_sched)
- rxe_sched_task(&qp->resp.task);
- else
- rxe_run_task(&qp->resp.task);
+ rxe_sched_task(&qp->resp.task);
}

static inline enum resp_states get_req(struct rxe_qp *qp,
--
2.39.1


Subject: [PATCH for-next v5 7/7] RDMA/rxe: Add support for the traditional Atomic operations with ODP

Enable 'fetch and add' and 'compare and swap' operations to manipulate
data in an ODP-enabled MR. This is comprised of the following steps:
1. Check the driver page table(umem_odp->dma_list) to see if the target
page is both readable and writable.
2. If not, then trigger page fault to map the page.
3. Update the entry in the MR xarray.
4. Execute the operation.

umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
is not changed while it is being checked and that the target page is not
invalidated before data access completes.

Signed-off-by: Daisuke Matsuda <[email protected]>
---
drivers/infiniband/sw/rxe/rxe.c | 1 +
drivers/infiniband/sw/rxe/rxe_loc.h | 9 +++++++++
drivers/infiniband/sw/rxe/rxe_odp.c | 26 ++++++++++++++++++++++++++
drivers/infiniband/sw/rxe/rxe_resp.c | 5 ++++-
4 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 207a022156f0..abd3267c2873 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -88,6 +88,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV;
rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE;
rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
+ rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_ATOMIC;
rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
}
}
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 4b95c8c46bdc..b9d2985774ee 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -208,6 +208,9 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
u64 iova, int access_flags, struct rxe_mr *mr);
int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
enum rxe_mr_copy_dir dir);
+int rxe_odp_mr_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
+ u64 compare, u64 swap_add, u64 *orig_val);
+
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
static inline int
rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
@@ -221,6 +224,12 @@ rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
{
return -EOPNOTSUPP;
}
+static inline int
+rxe_odp_mr_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
+ u64 compare, u64 swap_add, u64 *orig_val)
+{
+ return RESPST_ERR_UNSUPPORTED_OPCODE;
+}

#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */

diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index cbe5d0c3fcc4..194b1fab98b7 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -283,3 +283,29 @@ int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,

return err;
}
+
+int rxe_odp_mr_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
+ u64 compare, u64 swap_add, u64 *orig_val)
+{
+ int err;
+ struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+
+ /* If pagefault is not required, umem mutex will be held until the
+ * atomic operation completes. Otherwise, it is released and locked
+ * again in rxe_odp_map_range() to let invalidation handler do its
+ * work meanwhile.
+ */
+ mutex_lock(&umem_odp->umem_mutex);
+
+ /* Atomic operations manipulate a single char. */
+ err = rxe_odp_map_range(mr, iova, sizeof(char), 0);
+ if (err)
+ return err;
+
+ err = rxe_mr_do_atomic_op(mr, iova, opcode, compare,
+ swap_add, orig_val);
+
+ mutex_unlock(&umem_odp->umem_mutex);
+
+ return err;
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index 90c31c4f2944..0a918145dc07 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -684,7 +684,10 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
u64 iova = qp->resp.va + qp->resp.offset;

if (mr->odp_enabled)
- err = RESPST_ERR_UNSUPPORTED_OPCODE;
+ err = rxe_odp_mr_atomic_op(mr, iova, pkt->opcode,
+ atmeth_comp(pkt),
+ atmeth_swap_add(pkt),
+ &res->atomic.orig_val);
else
err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
atmeth_comp(pkt),
--
2.39.1


Subject: [PATCH for-next v5 2/7] RDMA/rxe: Make MR functions accessible from other rxe source code

Some functions in rxe_mr.c are going to be used in rxe_odp.c, which is to
be created in the subsequent patch. List the declarations of the functions
in rxe_loc.h.

Signed-off-by: Daisuke Matsuda <[email protected]>
---
drivers/infiniband/sw/rxe/rxe_loc.h | 14 ++++++++++++++
drivers/infiniband/sw/rxe/rxe_mr.c | 18 ++++--------------
2 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 804b15e929dd..00fedd1a4980 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -60,7 +60,9 @@ int rxe_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);

/* rxe_mr.c */
u8 rxe_get_next_key(u32 last_key);
+void rxe_mr_init(int access, struct rxe_mr *mr);
void rxe_mr_init_dma(int access, struct rxe_mr *mr);
+int rxe_mr_fill_pages_from_sgt(struct rxe_mr *mr, struct sg_table *sgt);
int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
int access, struct rxe_mr *mr);
int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr);
@@ -71,6 +73,8 @@ int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
void *addr, int length, enum rxe_mr_copy_dir dir);
int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
int sg_nents, unsigned int *sg_offset);
+int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
+ unsigned int length, enum rxe_mr_copy_dir dir);
int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
u64 compare, u64 swap_add, u64 *orig_val);
int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
@@ -82,6 +86,16 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 key);
int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe);
void rxe_mr_cleanup(struct rxe_pool_elem *elem);

+static inline unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
+{
+ return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
+}
+
+static inline unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
+{
+ return iova & (mr_page_size(mr) - 1);
+}
+
/* rxe_mw.c */
int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata);
int rxe_dealloc_mw(struct ib_mw *ibmw);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 0e538fafcc20..ffbac6f5e828 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -49,7 +49,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
| IB_ACCESS_REMOTE_WRITE \
| IB_ACCESS_REMOTE_ATOMIC)

-static void rxe_mr_init(int access, struct rxe_mr *mr)
+void rxe_mr_init(int access, struct rxe_mr *mr)
{
u32 lkey = mr->elem.index << 8 | rxe_get_next_key(-1);
u32 rkey = (access & IB_ACCESS_REMOTE) ? lkey : 0;
@@ -77,16 +77,6 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
mr->ibmr.type = IB_MR_TYPE_DMA;
}

-static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
-{
- return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
-}
-
-static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
-{
- return iova & (mr_page_size(mr) - 1);
-}
-
static bool is_pmem_page(struct page *pg)
{
unsigned long paddr = page_to_phys(pg);
@@ -96,7 +86,7 @@ static bool is_pmem_page(struct page *pg)
IORES_DESC_PERSISTENT_MEMORY);
}

-static int rxe_mr_fill_pages_from_sgt(struct rxe_mr *mr, struct sg_table *sgt)
+int rxe_mr_fill_pages_from_sgt(struct rxe_mr *mr, struct sg_table *sgt)
{
XA_STATE(xas, &mr->page_list, 0);
struct sg_page_iter sg_iter;
@@ -247,8 +237,8 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
return ib_sg_to_pages(ibmr, sgl, sg_nents, sg_offset, rxe_set_page);
}

-static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
- unsigned int length, enum rxe_mr_copy_dir dir)
+int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
+ unsigned int length, enum rxe_mr_copy_dir dir)
{
unsigned int page_offset = rxe_mr_iova_to_page_offset(mr, iova);
unsigned long index = rxe_mr_iova_to_index(mr, iova);
--
2.39.1


Subject: [PATCH for-next v5 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging

Allow applications to register an ODP-enabled MR, in which case the flag
IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no
RDMA operation supported right now. They will be enabled later in the
subsequent two patches.

rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR. It syncs
process address space from the CPU page table to the driver page table
(dma_list/pfn_list in umem_odp) when called with RXE_PAGEFAULT_SNAPSHOT
flag. Additionally, It can be used to trigger page fault when pages being
accessed are not present or do not have proper read/write permissions, and
possibly to prefetch pages in the future.

Signed-off-by: Daisuke Matsuda <[email protected]>
---
drivers/infiniband/sw/rxe/rxe.c | 7 ++
drivers/infiniband/sw/rxe/rxe_loc.h | 14 +++
drivers/infiniband/sw/rxe/rxe_mr.c | 9 +-
drivers/infiniband/sw/rxe/rxe_odp.c | 120 ++++++++++++++++++++++++++
drivers/infiniband/sw/rxe/rxe_resp.c | 15 +++-
drivers/infiniband/sw/rxe/rxe_verbs.c | 5 +-
drivers/infiniband/sw/rxe/rxe_verbs.h | 2 +
7 files changed, 166 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 54c723a6edda..f2284d27229b 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -73,6 +73,13 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
rxe->ndev->dev_addr);

rxe->max_ucontext = RXE_MAX_UCONTEXT;
+
+ if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
+ rxe->attr.kernel_cap_flags |= IBK_ON_DEMAND_PAGING;
+
+ /* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
+ rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
+ }
}

/* initialize port attributes */
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 00fedd1a4980..93247d123642 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -202,4 +202,18 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
return rxe_wr_opcode_info[opcode].mask[qp->ibqp.qp_type];
}

+/* rxe_odp.c */
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
+ u64 iova, int access_flags, struct rxe_mr *mr);
+#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+static inline int
+rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
+ int access_flags, struct rxe_mr *mr)
+{
+ return -EOPNOTSUPP;
+}
+
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
#endif /* RXE_LOC_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index ffbac6f5e828..cd368cd096c8 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -318,7 +318,10 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
return err;
}

- return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
+ if (mr->odp_enabled)
+ return -EOPNOTSUPP;
+ else
+ return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
}

/* copy data in or out of a wqe, i.e. sg list
@@ -527,6 +530,10 @@ int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
struct page *page;
u64 *va;

+ /* ODP is not supported right now. WIP. */
+ if (mr->odp_enabled)
+ return RESPST_ERR_UNSUPPORTED_OPCODE;
+
/* See IBA oA19-28 */
if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
rxe_dbg_mr(mr, "mr not in valid state");
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index b69b25e0fef6..e5497d09c399 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -24,6 +24,24 @@ static void rxe_mr_unset_xarray(struct rxe_mr *mr, unsigned long start,
spin_unlock(&mr->page_list.xa_lock);
}

+static void rxe_mr_set_xarray(struct rxe_mr *mr, unsigned long start,
+ unsigned long end, unsigned long *pfn_list)
+{
+ unsigned long lower, upper, idx;
+ struct page *page;
+
+ lower = rxe_mr_iova_to_index(mr, start);
+ upper = rxe_mr_iova_to_index(mr, end);
+
+ /* make pages visible in xarray. no sleep while taking the lock */
+ spin_lock(&mr->page_list.xa_lock);
+ for (idx = lower; idx <= upper; idx++) {
+ page = hmm_pfn_to_page(pfn_list[idx]);
+ __xa_store(&mr->page_list, idx, page, GFP_ATOMIC);
+ }
+ spin_unlock(&mr->page_list.xa_lock);
+}
+
static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
const struct mmu_notifier_range *range,
unsigned long cur_seq)
@@ -54,3 +72,105 @@ static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
const struct mmu_interval_notifier_ops rxe_mn_ops = {
.invalidate = rxe_ib_invalidate_range,
};
+
+#define RXE_PAGEFAULT_RDONLY BIT(1)
+#define RXE_PAGEFAULT_SNAPSHOT BIT(2)
+static int rxe_odp_do_pagefault(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags)
+{
+ int np;
+ u64 access_mask;
+ bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT);
+ struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+
+ access_mask = ODP_READ_ALLOWED_BIT;
+ if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY))
+ access_mask |= ODP_WRITE_ALLOWED_BIT;
+
+ /*
+ * ib_umem_odp_map_dma_and_lock() locks umem_mutex on success.
+ * Callers must release the lock later to let invalidation handler
+ * do its work again.
+ */
+ np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt,
+ access_mask, fault);
+ if (np < 0)
+ return np;
+
+ /* umem_mutex is still locked here, so we can use hmm_pfn_to_page()
+ * safely to fetch pages in the range.
+ */
+ rxe_mr_set_xarray(mr, user_va, user_va + bcnt, umem_odp->pfn_list);
+
+ return np;
+}
+
+static int rxe_odp_init_pages(struct rxe_mr *mr)
+{
+ int ret;
+ struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+
+ ret = rxe_odp_do_pagefault(mr, mr->umem->address, mr->umem->length,
+ RXE_PAGEFAULT_SNAPSHOT);
+
+ if (ret >= 0)
+ mutex_unlock(&umem_odp->umem_mutex);
+
+ return ret >= 0 ? 0 : ret;
+}
+
+int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
+ u64 iova, int access_flags, struct rxe_mr *mr)
+{
+ int err;
+ struct ib_umem_odp *umem_odp;
+
+ if (!IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
+ return -EOPNOTSUPP;
+
+ rxe_mr_init(access_flags, mr);
+
+ xa_init(&mr->page_list);
+
+ if (!start && length == U64_MAX) {
+ if (iova != 0)
+ return -EINVAL;
+ if (!(rxe->attr.odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT))
+ return -EINVAL;
+
+ /* Never reach here, for implicit ODP is not implemented. */
+ }
+
+ umem_odp = ib_umem_odp_get(&rxe->ib_dev, start, length, access_flags,
+ &rxe_mn_ops);
+ if (IS_ERR(umem_odp)) {
+ rxe_dbg_mr(mr, "Unable to create umem_odp err = %d\n",
+ (int)PTR_ERR(umem_odp));
+ return PTR_ERR(umem_odp);
+ }
+
+ umem_odp->private = mr;
+
+ mr->odp_enabled = true;
+ mr->umem = &umem_odp->umem;
+ mr->access = access_flags;
+ mr->ibmr.length = length;
+ mr->ibmr.iova = iova;
+ mr->page_offset = ib_umem_offset(&umem_odp->umem);
+
+ err = rxe_odp_init_pages(mr);
+ if (err) {
+ ib_umem_odp_release(umem_odp);
+ return err;
+ }
+
+ err = rxe_mr_fill_pages_from_sgt(mr, &umem_odp->umem.sgt_append.sgt);
+ if (err) {
+ ib_umem_odp_release(umem_odp);
+ return err;
+ }
+
+ mr->state = RXE_MR_STATE_VALID;
+ mr->ibmr.type = IB_MR_TYPE_USER;
+
+ return err;
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index ba0222bfce9e..90c31c4f2944 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -626,6 +626,10 @@ static enum resp_states process_flush(struct rxe_qp *qp,
struct rxe_mr *mr = qp->resp.mr;
struct resp_res *res = qp->resp.res;

+ /* ODP is not supported right now. WIP. */
+ if (mr->odp_enabled)
+ return RESPST_ERR_UNSUPPORTED_OPCODE;
+
/* oA19-14, oA19-15 */
if (res && res->replay)
return RESPST_ACKNOWLEDGE;
@@ -679,10 +683,13 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
if (!res->replay) {
u64 iova = qp->resp.va + qp->resp.offset;

- err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
- atmeth_comp(pkt),
- atmeth_swap_add(pkt),
- &res->atomic.orig_val);
+ if (mr->odp_enabled)
+ err = RESPST_ERR_UNSUPPORTED_OPCODE;
+ else
+ err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
+ atmeth_comp(pkt),
+ atmeth_swap_add(pkt),
+ &res->atomic.orig_val);
if (err)
return err;

diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index dea605b7f683..9c23defdc7b5 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -1274,7 +1274,10 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd, u64 start,
mr->ibmr.pd = ibpd;
mr->ibmr.device = ibpd->device;

- err = rxe_mr_init_user(rxe, start, length, iova, access, mr);
+ if (access & IB_ACCESS_ON_DEMAND)
+ err = rxe_odp_mr_init_user(rxe, start, length, iova, access, mr);
+ else
+ err = rxe_mr_init_user(rxe, start, length, iova, access, mr);
if (err) {
rxe_dbg_mr(mr, "reg_user_mr failed, err = %d", err);
goto err_cleanup;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index b6fbd9b3d086..de5a982c7c7e 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -333,6 +333,8 @@ struct rxe_mr {
u32 nbuf;

struct xarray page_list;
+
+ bool odp_enabled;
};

static inline unsigned int mr_page_size(struct rxe_mr *mr)
--
2.39.1


Subject: [PATCH for-next v5 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read with ODP

rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
it to load payloads of requesting packets; responder uses it to process
Send, Write, and Read operaetions; completer uses it to copy data from
response packets of Read and Atomic operations to a user MR.

Allow these operations to be used with ODP by adding a subordinate function
rxe_odp_mr_copy(). It is comprised of the following steps:
1. Check the driver page table(umem_odp->dma_list) to see if pages being
accessed are present with appropriate permission.
2. If necessary, trigger page fault to map the pages.
3. Update the MR xarray using PFNs in umem_odp->pfn_list.
4. Execute data copy to/from the pages.

umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
is not changed while it is being checked and that mapped pages are not
invalidated before data copy completes.

Signed-off-by: Daisuke Matsuda <[email protected]>
---
drivers/infiniband/sw/rxe/rxe.c | 10 +++
drivers/infiniband/sw/rxe/rxe_loc.h | 8 ++
drivers/infiniband/sw/rxe/rxe_mr.c | 2 +-
drivers/infiniband/sw/rxe/rxe_odp.c | 109 ++++++++++++++++++++++++++++
4 files changed, 128 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index f2284d27229b..207a022156f0 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -79,6 +79,16 @@ static void rxe_init_device_param(struct rxe_dev *rxe)

/* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
+
+ rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SEND;
+ rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_RECV;
+ rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
+
+ rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SEND;
+ rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV;
+ rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE;
+ rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
+ rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
}
}

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 93247d123642..4b95c8c46bdc 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -206,6 +206,8 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
u64 iova, int access_flags, struct rxe_mr *mr);
+int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
+ enum rxe_mr_copy_dir dir);
#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
static inline int
rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
@@ -213,6 +215,12 @@ rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
{
return -EOPNOTSUPP;
}
+static inline int
+rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
+ int length, enum rxe_mr_copy_dir dir)
+{
+ return -EOPNOTSUPP;
+}

#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index cd368cd096c8..0e3cda59d702 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -319,7 +319,7 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
}

if (mr->odp_enabled)
- return -EOPNOTSUPP;
+ return rxe_odp_mr_copy(mr, iova, addr, length, dir);
else
return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
}
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index e5497d09c399..cbe5d0c3fcc4 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -174,3 +174,112 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,

return err;
}
+
+static inline bool rxe_is_pagefault_neccesary(struct ib_umem_odp *umem_odp,
+ u64 iova, int length, u32 perm)
+{
+ int idx;
+ u64 addr;
+ bool need_fault = false;
+
+ addr = iova & (~(BIT(umem_odp->page_shift) - 1));
+
+ /* Skim through all pages that are to be accessed. */
+ while (addr < iova + length) {
+ idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
+
+ if (!(umem_odp->dma_list[idx] & perm)) {
+ need_fault = true;
+ break;
+ }
+
+ addr += BIT(umem_odp->page_shift);
+ }
+ return need_fault;
+}
+
+/* umem mutex must be locked before entering this function. */
+static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 flags)
+{
+ struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+ const int max_tries = 3;
+ int cnt = 0;
+
+ int err;
+ u64 perm;
+ bool need_fault;
+
+ if (unlikely(length < 1)) {
+ mutex_unlock(&umem_odp->umem_mutex);
+ return -EINVAL;
+ }
+
+ perm = ODP_READ_ALLOWED_BIT;
+ if (!(flags & RXE_PAGEFAULT_RDONLY))
+ perm |= ODP_WRITE_ALLOWED_BIT;
+
+ /*
+ * A successful return from rxe_odp_do_pagefault() does not guarantee
+ * that all pages in the range became present. Recheck the DMA address
+ * array, allowing max 3 tries for pagefault.
+ */
+ while ((need_fault = rxe_is_pagefault_neccesary(umem_odp,
+ iova, length, perm))) {
+ if (cnt >= max_tries)
+ break;
+
+ mutex_unlock(&umem_odp->umem_mutex);
+
+ /* umem_mutex is locked on success. */
+ err = rxe_odp_do_pagefault(mr, iova, length, flags);
+ if (err < 0)
+ return err;
+
+ cnt++;
+ }
+
+ if (need_fault)
+ return -EFAULT;
+
+ return 0;
+}
+
+int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
+ enum rxe_mr_copy_dir dir)
+{
+ struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+ u32 flags = 0;
+ int err;
+
+ if (unlikely(!mr->odp_enabled))
+ return -EOPNOTSUPP;
+
+ switch (dir) {
+ case RXE_TO_MR_OBJ:
+ break;
+
+ case RXE_FROM_MR_OBJ:
+ flags = RXE_PAGEFAULT_RDONLY;
+ break;
+
+ default:
+ return -EINVAL;
+ }
+
+ /* If pagefault is not required, umem mutex will be held until data
+ * copy to the MR completes. Otherwise, it is released and locked
+ * again in rxe_odp_map_range() to let invalidation handler do its
+ * work meanwhile.
+ */
+ mutex_lock(&umem_odp->umem_mutex);
+
+ err = rxe_odp_map_range(mr, iova, length, flags);
+ if (err)
+ return err;
+
+ err = rxe_mr_copy_xarray(mr, iova, addr, length, dir);
+
+ mutex_unlock(&umem_odp->umem_mutex);
+
+ return err;
+}
--
2.39.1


Subject: [PATCH for-next v5 4/7] RDMA/rxe: Add page invalidation support

On page invalidation, an MMU notifier callback is invoked to unmap DMA
addresses and update the driver page table(umem_odp->dma_list). It also
set the corresponding entries in MR xarray to NULL to prevent any access.
The callback is registered when an ODP-enabled MR is created.

Signed-off-by: Daisuke Matsuda <[email protected]>
---
drivers/infiniband/sw/rxe/Makefile | 2 ++
drivers/infiniband/sw/rxe/rxe_odp.c | 56 +++++++++++++++++++++++++++++
2 files changed, 58 insertions(+)
create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c

diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
index 5395a581f4bb..93134f1d1d0c 100644
--- a/drivers/infiniband/sw/rxe/Makefile
+++ b/drivers/infiniband/sw/rxe/Makefile
@@ -23,3 +23,5 @@ rdma_rxe-y := \
rxe_task.o \
rxe_net.o \
rxe_hw_counters.o
+
+rdma_rxe-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += rxe_odp.o
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
new file mode 100644
index 000000000000..b69b25e0fef6
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022-2023 Fujitsu Ltd. All rights reserved.
+ */
+
+#include <linux/hmm.h>
+
+#include <rdma/ib_umem_odp.h>
+
+#include "rxe.h"
+
+static void rxe_mr_unset_xarray(struct rxe_mr *mr, unsigned long start,
+ unsigned long end)
+{
+ unsigned long lower, upper, idx;
+
+ lower = rxe_mr_iova_to_index(mr, start);
+ upper = rxe_mr_iova_to_index(mr, end);
+
+ /* make elements in xarray NULL */
+ spin_lock(&mr->page_list.xa_lock);
+ for (idx = lower; idx <= upper; idx++)
+ __xa_erase(&mr->page_list, idx);
+ spin_unlock(&mr->page_list.xa_lock);
+}
+
+static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
+ const struct mmu_notifier_range *range,
+ unsigned long cur_seq)
+{
+ struct ib_umem_odp *umem_odp =
+ container_of(mni, struct ib_umem_odp, notifier);
+ struct rxe_mr *mr = umem_odp->private;
+ unsigned long start, end;
+
+ if (!mmu_notifier_range_blockable(range))
+ return false;
+
+ mutex_lock(&umem_odp->umem_mutex);
+ mmu_interval_set_seq(mni, cur_seq);
+
+ start = max_t(u64, ib_umem_start(umem_odp), range->start);
+ end = min_t(u64, ib_umem_end(umem_odp), range->end);
+
+ rxe_mr_unset_xarray(mr, start, end);
+
+ /* update umem_odp->dma_list */
+ ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
+
+ mutex_unlock(&umem_odp->umem_mutex);
+ return true;
+}
+
+const struct mmu_interval_notifier_ops rxe_mn_ops = {
+ .invalidate = rxe_ib_invalidate_range,
+};
--
2.39.1


Subject: Re: [PATCH for-next v5 1/7] RDMA/rxe: Always defer tasks on responder and completer to workqueue

On Thu, May 18, 2023 5:22 PM Daisuke Matsuda wrote:
> Both responder and completer need to sleep to execute page-fault when used
> with ODP. It can happen when they are going to access user MRs, so tasks
> must be executed in process context for such cases.
>
> Additionally, current implementation seldom defers tasks to workqueue, but
> instead defers to a softirq context running do_task(). It is called from
> rxe_resp_queue_pkt() and rxe_comp_queue_pkt() in SOFTIRQ_NET_RX context and
> can last until maximum RXE_MAX_ITERATIONS (=1024) loops are executed. The
> problem is the that task execuion appears to be anonymous loads in the
> system and that the loop can throttle other softirqs on the same CPU.
>
> This patch makes responder and completer codes run in process context for
> ODP and the problem described above.
>
> Signed-off-by: Daisuke Matsuda <[email protected]>
> ---

Some people may be concerned about performance change caused by this patch,
so I measured the performance using 2 VMs on the same host.

Here are the settings:
Host:
CPU cores: 32
MEM 64GiB
Hypervisor: KVM

Guest:
vCPUs: 8
Mem: 12 GiB
OS: CentOS 9 Stream
kernel vers: 6.4.0-rc2 (latest jgg-for-next as of 5/18, 2023)

Conditions:
Operations: send, write, read

I measured the performance of the latest jgg-for-next kernel. After that I applied all ODP
patches and measured the performance again. The results are shown below. The performance
often fluctuates, so it is difficult to say which one is better. However, it seems there
is no regression after the change.


## latest jgg-for-next (Current)
[send]
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 258.32 0.004133
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 335.09 0.005361
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 240.24 0.003844
---------------------------------------------------------------------------------------

[write]
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 281.40 0.004502
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 347.02 0.005552
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 280.36 0.004486
---------------------------------------------------------------------------------------

[read]
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 271.30 0.004341
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 266.44 0.004263
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 226.94 0.003631
---------------------------------------------------------------------------------------


## After applying the ODP patches
send
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 289.26 0.004628
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 268.85 0.004302
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 320.81 0.005133
---------------------------------------------------------------------------------------

write
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 288.98 0.004624
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 261.98 0.004192
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 274.59 0.004393
---------------------------------------------------------------------------------------

read
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 324.56 0.005193
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 294.21 0.004707
---------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 200000 0.00 256.60 0.004106
---------------------------------------------------------------------------------------

2023-05-18 22:34:47

by Bob Pearson

[permalink] [raw]
Subject: Re: [PATCH for-next v5 1/7] RDMA/rxe: Always defer tasks on responder and completer to workqueue

On 5/18/23 03:21, Daisuke Matsuda wrote:
> Both responder and completer need to sleep to execute page-fault when used
> with ODP. It can happen when they are going to access user MRs, so tasks
> must be executed in process context for such cases.
>
> Additionally, current implementation seldom defers tasks to workqueue, but
> instead defers to a softirq context running do_task(). It is called from
> rxe_resp_queue_pkt() and rxe_comp_queue_pkt() in SOFTIRQ_NET_RX context and
> can last until maximum RXE_MAX_ITERATIONS (=1024) loops are executed. The
> problem is the that task execuion appears to be anonymous loads in the
> system and that the loop can throttle other softirqs on the same CPU.
>
> This patch makes responder and completer codes run in process context for
> ODP and the problem described above.
>
> Signed-off-by: Daisuke Matsuda <[email protected]>
> ---
> drivers/infiniband/sw/rxe/rxe_comp.c | 12 +-----------
> drivers/infiniband/sw/rxe/rxe_hw_counters.c | 1 -
> drivers/infiniband/sw/rxe/rxe_hw_counters.h | 1 -
> drivers/infiniband/sw/rxe/rxe_resp.c | 13 +------------
> 4 files changed, 2 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
> index db18ace74d2b..671fdb645030 100644
> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> @@ -128,18 +128,8 @@ void retransmit_timer(struct timer_list *t)
>
> void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
> {
> - int must_sched;
> -
> skb_queue_tail(&qp->resp_pkts, skb);
> -
> - must_sched = skb_queue_len(&qp->resp_pkts) > 1;
> - if (must_sched != 0)
> - rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED);
> -
> - if (must_sched)
> - rxe_sched_task(&qp->comp.task);
> - else
> - rxe_run_task(&qp->comp.task);
> + rxe_sched_task(&qp->comp.task);
> }
>
> static inline enum comp_state get_wqe(struct rxe_qp *qp,
> diff --git a/drivers/infiniband/sw/rxe/rxe_hw_counters.c b/drivers/infiniband/sw/rxe/rxe_hw_counters.c
> index a012522b577a..dc23cf3a6967 100644
> --- a/drivers/infiniband/sw/rxe/rxe_hw_counters.c
> +++ b/drivers/infiniband/sw/rxe/rxe_hw_counters.c
> @@ -14,7 +14,6 @@ static const struct rdma_stat_desc rxe_counter_descs[] = {
> [RXE_CNT_RCV_RNR].name = "rcvd_rnr_err",
> [RXE_CNT_SND_RNR].name = "send_rnr_err",
> [RXE_CNT_RCV_SEQ_ERR].name = "rcvd_seq_err",
> - [RXE_CNT_COMPLETER_SCHED].name = "ack_deferred",
> [RXE_CNT_RETRY_EXCEEDED].name = "retry_exceeded_err",
> [RXE_CNT_RNR_RETRY_EXCEEDED].name = "retry_rnr_exceeded_err",
> [RXE_CNT_COMP_RETRY].name = "completer_retry_err",
> diff --git a/drivers/infiniband/sw/rxe/rxe_hw_counters.h b/drivers/infiniband/sw/rxe/rxe_hw_counters.h
> index 71f4d4fa9dc8..303da0e3134a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_hw_counters.h
> +++ b/drivers/infiniband/sw/rxe/rxe_hw_counters.h
> @@ -18,7 +18,6 @@ enum rxe_counters {
> RXE_CNT_RCV_RNR,
> RXE_CNT_SND_RNR,
> RXE_CNT_RCV_SEQ_ERR,
> - RXE_CNT_COMPLETER_SCHED,
> RXE_CNT_RETRY_EXCEEDED,
> RXE_CNT_RNR_RETRY_EXCEEDED,
> RXE_CNT_COMP_RETRY,
> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> index 68f6cd188d8e..ba0222bfce9e 100644
> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> @@ -46,21 +46,10 @@ static char *resp_state_name[] = {
> [RESPST_EXIT] = "EXIT",
> };
>
> -/* rxe_recv calls here to add a request packet to the input queue */
> void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
> {
> - int must_sched;
> - struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
> -
> skb_queue_tail(&qp->req_pkts, skb);
> -
> - must_sched = (pkt->opcode == IB_OPCODE_RC_RDMA_READ_REQUEST) ||
> - (skb_queue_len(&qp->req_pkts) > 1);
> -
> - if (must_sched)
> - rxe_sched_task(&qp->resp.task);
> - else
> - rxe_run_task(&qp->resp.task);
> + rxe_sched_task(&qp->resp.task);
> }
>
> static inline enum resp_states get_req(struct rxe_qp *qp,

Looks good.

Reviewed-by: Bob Pearson <[email protected]>



2023-05-18 22:35:49

by Bob Pearson

[permalink] [raw]
Subject: Re: [PATCH for-next v5 3/7] RDMA/rxe: Move resp_states definition to rxe_verbs.h

On 5/18/23 03:21, Daisuke Matsuda wrote:
> To use the resp_states values in rxe_loc.h, it is necessary to move the
> definition to rxe_verbs.h, where other internal states of this driver are
> defined.
>
> Signed-off-by: Daisuke Matsuda <[email protected]>
> ---
> drivers/infiniband/sw/rxe/rxe.h | 37 ---------------------------
> drivers/infiniband/sw/rxe/rxe_verbs.h | 37 +++++++++++++++++++++++++++
> 2 files changed, 37 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe.h b/drivers/infiniband/sw/rxe/rxe.h
> index d33dd6cf83d3..9b4d044a1264 100644
> --- a/drivers/infiniband/sw/rxe/rxe.h
> +++ b/drivers/infiniband/sw/rxe/rxe.h
> @@ -100,43 +100,6 @@
> #define rxe_info_mw(mw, fmt, ...) ibdev_info_ratelimited((mw)->ibmw.device, \
> "mw#%d %s: " fmt, (mw)->elem.index, __func__, ##__VA_ARGS__)
>
> -/* responder states */
> -enum resp_states {
> - RESPST_NONE,
> - RESPST_GET_REQ,
> - RESPST_CHK_PSN,
> - RESPST_CHK_OP_SEQ,
> - RESPST_CHK_OP_VALID,
> - RESPST_CHK_RESOURCE,
> - RESPST_CHK_LENGTH,
> - RESPST_CHK_RKEY,
> - RESPST_EXECUTE,
> - RESPST_READ_REPLY,
> - RESPST_ATOMIC_REPLY,
> - RESPST_ATOMIC_WRITE_REPLY,
> - RESPST_PROCESS_FLUSH,
> - RESPST_COMPLETE,
> - RESPST_ACKNOWLEDGE,
> - RESPST_CLEANUP,
> - RESPST_DUPLICATE_REQUEST,
> - RESPST_ERR_MALFORMED_WQE,
> - RESPST_ERR_UNSUPPORTED_OPCODE,
> - RESPST_ERR_MISALIGNED_ATOMIC,
> - RESPST_ERR_PSN_OUT_OF_SEQ,
> - RESPST_ERR_MISSING_OPCODE_FIRST,
> - RESPST_ERR_MISSING_OPCODE_LAST_C,
> - RESPST_ERR_MISSING_OPCODE_LAST_D1E,
> - RESPST_ERR_TOO_MANY_RDMA_ATM_REQ,
> - RESPST_ERR_RNR,
> - RESPST_ERR_RKEY_VIOLATION,
> - RESPST_ERR_INVALIDATE_RKEY,
> - RESPST_ERR_LENGTH,
> - RESPST_ERR_CQ_OVERFLOW,
> - RESPST_ERROR,
> - RESPST_DONE,
> - RESPST_EXIT,
> -};
> -
> void rxe_set_mtu(struct rxe_dev *rxe, unsigned int dev_mtu);
>
> int rxe_add(struct rxe_dev *rxe, unsigned int mtu, const char *ibdev_name);
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
> index 26a20f088692..b6fbd9b3d086 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
> @@ -127,6 +127,43 @@ struct rxe_comp_info {
> struct rxe_task task;
> };
>
> +/* responder states */
> +enum resp_states {
> + RESPST_NONE,
> + RESPST_GET_REQ,
> + RESPST_CHK_PSN,
> + RESPST_CHK_OP_SEQ,
> + RESPST_CHK_OP_VALID,
> + RESPST_CHK_RESOURCE,
> + RESPST_CHK_LENGTH,
> + RESPST_CHK_RKEY,
> + RESPST_EXECUTE,
> + RESPST_READ_REPLY,
> + RESPST_ATOMIC_REPLY,
> + RESPST_ATOMIC_WRITE_REPLY,
> + RESPST_PROCESS_FLUSH,
> + RESPST_COMPLETE,
> + RESPST_ACKNOWLEDGE,
> + RESPST_CLEANUP,
> + RESPST_DUPLICATE_REQUEST,
> + RESPST_ERR_MALFORMED_WQE,
> + RESPST_ERR_UNSUPPORTED_OPCODE,
> + RESPST_ERR_MISALIGNED_ATOMIC,
> + RESPST_ERR_PSN_OUT_OF_SEQ,
> + RESPST_ERR_MISSING_OPCODE_FIRST,
> + RESPST_ERR_MISSING_OPCODE_LAST_C,
> + RESPST_ERR_MISSING_OPCODE_LAST_D1E,
> + RESPST_ERR_TOO_MANY_RDMA_ATM_REQ,
> + RESPST_ERR_RNR,
> + RESPST_ERR_RKEY_VIOLATION,
> + RESPST_ERR_INVALIDATE_RKEY,
> + RESPST_ERR_LENGTH,
> + RESPST_ERR_CQ_OVERFLOW,
> + RESPST_ERROR,
> + RESPST_DONE,
> + RESPST_EXIT,
> +};
> +
> enum rdatm_res_state {
> rdatm_res_state_next,
> rdatm_res_state_new,

Looks OK.

Reviewed-by: Bob Pearson <[email protected]>


2023-05-18 22:47:15

by Bob Pearson

[permalink] [raw]
Subject: Re: [PATCH for-next v5 2/7] RDMA/rxe: Make MR functions accessible from other rxe source code

On 5/18/23 03:21, Daisuke Matsuda wrote:
> Some functions in rxe_mr.c are going to be used in rxe_odp.c, which is to
> be created in the subsequent patch. List the declarations of the functions
> in rxe_loc.h.
>
> Signed-off-by: Daisuke Matsuda <[email protected]>
> ---
> drivers/infiniband/sw/rxe/rxe_loc.h | 14 ++++++++++++++
> drivers/infiniband/sw/rxe/rxe_mr.c | 18 ++++--------------
> 2 files changed, 18 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> index 804b15e929dd..00fedd1a4980 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -60,7 +60,9 @@ int rxe_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
>
> /* rxe_mr.c */
> u8 rxe_get_next_key(u32 last_key);
> +void rxe_mr_init(int access, struct rxe_mr *mr);
> void rxe_mr_init_dma(int access, struct rxe_mr *mr);
> +int rxe_mr_fill_pages_from_sgt(struct rxe_mr *mr, struct sg_table *sgt);
> int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
> int access, struct rxe_mr *mr);
> int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr);
> @@ -71,6 +73,8 @@ int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
> void *addr, int length, enum rxe_mr_copy_dir dir);
> int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
> int sg_nents, unsigned int *sg_offset);
> +int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> + unsigned int length, enum rxe_mr_copy_dir dir);
> int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
> u64 compare, u64 swap_add, u64 *orig_val);
> int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
> @@ -82,6 +86,16 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 key);
> int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe);
> void rxe_mr_cleanup(struct rxe_pool_elem *elem);
>
> +static inline unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
> +{
> + return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> +}
> +
> +static inline unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
> +{
> + return iova & (mr_page_size(mr) - 1);
> +}
> +
> /* rxe_mw.c */
> int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata);
> int rxe_dealloc_mw(struct ib_mw *ibmw);
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index 0e538fafcc20..ffbac6f5e828 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -49,7 +49,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
> | IB_ACCESS_REMOTE_WRITE \
> | IB_ACCESS_REMOTE_ATOMIC)
>
> -static void rxe_mr_init(int access, struct rxe_mr *mr)
> +void rxe_mr_init(int access, struct rxe_mr *mr)
> {
> u32 lkey = mr->elem.index << 8 | rxe_get_next_key(-1);
> u32 rkey = (access & IB_ACCESS_REMOTE) ? lkey : 0;
> @@ -77,16 +77,6 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr)
> mr->ibmr.type = IB_MR_TYPE_DMA;
> }
>
> -static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
> -{
> - return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
> -}
> -
> -static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
> -{
> - return iova & (mr_page_size(mr) - 1);
> -}
> -
> static bool is_pmem_page(struct page *pg)
> {
> unsigned long paddr = page_to_phys(pg);
> @@ -96,7 +86,7 @@ static bool is_pmem_page(struct page *pg)
> IORES_DESC_PERSISTENT_MEMORY);
> }
>
> -static int rxe_mr_fill_pages_from_sgt(struct rxe_mr *mr, struct sg_table *sgt)
> +int rxe_mr_fill_pages_from_sgt(struct rxe_mr *mr, struct sg_table *sgt)
> {
> XA_STATE(xas, &mr->page_list, 0);
> struct sg_page_iter sg_iter;
> @@ -247,8 +237,8 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
> return ib_sg_to_pages(ibmr, sgl, sg_nents, sg_offset, rxe_set_page);
> }
>
> -static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> - unsigned int length, enum rxe_mr_copy_dir dir)
> +int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
> + unsigned int length, enum rxe_mr_copy_dir dir)
> {
> unsigned int page_offset = rxe_mr_iova_to_page_offset(mr, iova);
> unsigned long index = rxe_mr_iova_to_index(mr, iova);

Looks good.

Reviewed-by: Bob Pearson <[email protected]>


2023-05-19 06:55:10

by Guoqing Jiang

[permalink] [raw]
Subject: Re: [PATCH for-next v5 0/7] On-Demand Paging on SoftRoCE

Hello,

On 5/18/23 16:21, Daisuke Matsuda wrote:
> [2] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
> https://lore.kernel.org/lkml/[email protected]/

Quote from above link

"There is a problem that data on DAX-enabled filesystem cannot be
duplicated with
software RAID or other hardware methods."

Could you elaborate a bit more about the problem or any links about it?
Thank you.

Guoqing

Subject: RE: [PATCH for-next v5 0/7] On-Demand Paging on SoftRoCE

On Fri, May 19, 2023 3:42 PM Guoqing Jiang wrote:
>
> Hello,
>
> On 5/18/23 16:21, Daisuke Matsuda wrote:
> > [2] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
> > https://lore.kernel.org/lkml/[email protected]/
>
> Quote from above link
>
> "There is a problem that data on DAX-enabled filesystem cannot be
> duplicated with
> software RAID or other hardware methods."
>
> Could you elaborate a bit more about the problem or any links about it?
> Thank you.

I am not an expert of Pmems, but my understanding is as follows:

Pmem (Persistent memory) is detected as memory device during boot process.
Physical addresses are allocated to them just like other memory in DIMM slots,
so system have to treat them differently from traditional storage devices like HDD/SSD.

It may be technically possible to duplicate data using multiple Pmems, but the duplication
is practically not useful. For traditional storage devices, you can hot-remove and hot-add
them easily on failure. However, Pmems are not attached to hot-pluggable slots. You have
to halt the system and open the cabinet to change out the Pmem. This means availability
of the system is not improved with data duplication on the same host.

Daisuke

>
> Guoqing

2023-05-19 10:28:48

by Guoqing Jiang

[permalink] [raw]
Subject: Re: [PATCH for-next v5 0/7] On-Demand Paging on SoftRoCE



On 5/19/23 17:57, Daisuke Matsuda (Fujitsu) wrote:
> On Fri, May 19, 2023 3:42 PM Guoqing Jiang wrote:
>> Hello,
>>
>> On 5/18/23 16:21, Daisuke Matsuda wrote:
>>> [2] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
>>> https://lore.kernel.org/lkml/[email protected]/
>> Quote from above link
>>
>> "There is a problem that data on DAX-enabled filesystem cannot be
>> duplicated with
>> software RAID or other hardware methods."
>>
>> Could you elaborate a bit more about the problem or any links about it?
>> Thank you.
> I am not an expert of Pmems, but my understanding is as follows:
>
> Pmem (Persistent memory) is detected as memory device during boot process.
> Physical addresses are allocated to them just like other memory in DIMM slots,
> so system have to treat them differently from traditional storage devices like HDD/SSD.
>
> It may be technically possible to duplicate data using multiple Pmems, but the duplication
> is practically not useful. For traditional storage devices, you can hot-remove and hot-add
> them easily on failure. However, Pmems are not attached to hot-pluggable slots. You have
> to halt the system and open the cabinet to change out the Pmem. This means availability
> of the system is not improved with data duplication on the same host.

I guess pmem with block translation table type would be fine since it
can be used
like normal storage device, but I am not pmem expert as well ????.

Thanks,
Guoqing

2023-05-19 17:11:48

by Bob Pearson

[permalink] [raw]
Subject: Re: [PATCH for-next v5 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging

On 5/18/23 03:21, Daisuke Matsuda wrote:
> Allow applications to register an ODP-enabled MR, in which case the flag
> IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no
> RDMA operation supported right now. They will be enabled later in the
> subsequent two patches.
>
> rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR. It syncs
> process address space from the CPU page table to the driver page table
> (dma_list/pfn_list in umem_odp) when called with RXE_PAGEFAULT_SNAPSHOT
> flag. Additionally, It can be used to trigger page fault when pages being
> accessed are not present or do not have proper read/write permissions, and
> possibly to prefetch pages in the future.
>
> Signed-off-by: Daisuke Matsuda <[email protected]>
> ---
> drivers/infiniband/sw/rxe/rxe.c | 7 ++
> drivers/infiniband/sw/rxe/rxe_loc.h | 14 +++
> drivers/infiniband/sw/rxe/rxe_mr.c | 9 +-
> drivers/infiniband/sw/rxe/rxe_odp.c | 120 ++++++++++++++++++++++++++
> drivers/infiniband/sw/rxe/rxe_resp.c | 15 +++-
> drivers/infiniband/sw/rxe/rxe_verbs.c | 5 +-
> drivers/infiniband/sw/rxe/rxe_verbs.h | 2 +
> 7 files changed, 166 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 54c723a6edda..f2284d27229b 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -73,6 +73,13 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
> rxe->ndev->dev_addr);
>
> rxe->max_ucontext = RXE_MAX_UCONTEXT;
> +
> + if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
> + rxe->attr.kernel_cap_flags |= IBK_ON_DEMAND_PAGING;
> +
> + /* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
> + rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
> + }
> }
>
> /* initialize port attributes */
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> index 00fedd1a4980..93247d123642 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -202,4 +202,18 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
> return rxe_wr_opcode_info[opcode].mask[qp->ibqp.qp_type];
> }
>
> +/* rxe_odp.c */
> +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
> +int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
> + u64 iova, int access_flags, struct rxe_mr *mr);
> +#else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
> +static inline int
> +rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
> + int access_flags, struct rxe_mr *mr)
> +{
> + return -EOPNOTSUPP;
> +}
> +
> +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
> +
> #endif /* RXE_LOC_H */
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index ffbac6f5e828..cd368cd096c8 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -318,7 +318,10 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
> return err;
> }
>
> - return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
> + if (mr->odp_enabled)
> + return -EOPNOTSUPP;
> + else
> + return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
> }
>
> /* copy data in or out of a wqe, i.e. sg list
> @@ -527,6 +530,10 @@ int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> struct page *page;
> u64 *va;
>
> + /* ODP is not supported right now. WIP. */
> + if (mr->odp_enabled)
> + return RESPST_ERR_UNSUPPORTED_OPCODE;
> +
> /* See IBA oA19-28 */
> if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
> rxe_dbg_mr(mr, "mr not in valid state");
> diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> index b69b25e0fef6..e5497d09c399 100644
> --- a/drivers/infiniband/sw/rxe/rxe_odp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> @@ -24,6 +24,24 @@ static void rxe_mr_unset_xarray(struct rxe_mr *mr, unsigned long start,
> spin_unlock(&mr->page_list.xa_lock);
> }
>
> +static void rxe_mr_set_xarray(struct rxe_mr *mr, unsigned long start,
> + unsigned long end, unsigned long *pfn_list)
> +{
> + unsigned long lower, upper, idx;
> + struct page *page;
> +
> + lower = rxe_mr_iova_to_index(mr, start);
> + upper = rxe_mr_iova_to_index(mr, end);
> +
> + /* make pages visible in xarray. no sleep while taking the lock */
> + spin_lock(&mr->page_list.xa_lock);
> + for (idx = lower; idx <= upper; idx++) {
> + page = hmm_pfn_to_page(pfn_list[idx]);
> + __xa_store(&mr->page_list, idx, page, GFP_ATOMIC);
> + }
> + spin_unlock(&mr->page_list.xa_lock);
> +}
> +
> static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
> const struct mmu_notifier_range *range,
> unsigned long cur_seq)
> @@ -54,3 +72,105 @@ static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
> const struct mmu_interval_notifier_ops rxe_mn_ops = {
> .invalidate = rxe_ib_invalidate_range,
> };
> +
> +#define RXE_PAGEFAULT_RDONLY BIT(1)
> +#define RXE_PAGEFAULT_SNAPSHOT BIT(2)
> +static int rxe_odp_do_pagefault(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags)
> +{
> + int np;
> + u64 access_mask;
> + bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT);
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> +
> + access_mask = ODP_READ_ALLOWED_BIT;
> + if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY))
> + access_mask |= ODP_WRITE_ALLOWED_BIT;
> +
> + /*
> + * ib_umem_odp_map_dma_and_lock() locks umem_mutex on success.
> + * Callers must release the lock later to let invalidation handler
> + * do its work again.
> + */
> + np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt,
> + access_mask, fault);
> + if (np < 0)
> + return np;
> +
> + /* umem_mutex is still locked here, so we can use hmm_pfn_to_page()
> + * safely to fetch pages in the range.
> + */
> + rxe_mr_set_xarray(mr, user_va, user_va + bcnt, umem_odp->pfn_list);
> +
> + return np;
> +}
> +
> +static int rxe_odp_init_pages(struct rxe_mr *mr)
> +{
> + int ret;
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> +
> + ret = rxe_odp_do_pagefault(mr, mr->umem->address, mr->umem->length,
> + RXE_PAGEFAULT_SNAPSHOT);
> +
> + if (ret >= 0)
> + mutex_unlock(&umem_odp->umem_mutex);
> +
> + return ret >= 0 ? 0 : ret;
> +}
> +
> +int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
> + u64 iova, int access_flags, struct rxe_mr *mr)
> +{
> + int err;
> + struct ib_umem_odp *umem_odp;
> +
> + if (!IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
> + return -EOPNOTSUPP;
> +
> + rxe_mr_init(access_flags, mr);
> +
> + xa_init(&mr->page_list);
> +
> + if (!start && length == U64_MAX) {
> + if (iova != 0)
> + return -EINVAL;
> + if (!(rxe->attr.odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT))
> + return -EINVAL;
> +
> + /* Never reach here, for implicit ODP is not implemented. */
> + }
> +
> + umem_odp = ib_umem_odp_get(&rxe->ib_dev, start, length, access_flags,
> + &rxe_mn_ops);
> + if (IS_ERR(umem_odp)) {
> + rxe_dbg_mr(mr, "Unable to create umem_odp err = %d\n",
> + (int)PTR_ERR(umem_odp));
> + return PTR_ERR(umem_odp);
> + }
> +
> + umem_odp->private = mr;
> +
> + mr->odp_enabled = true;
> + mr->umem = &umem_odp->umem;
> + mr->access = access_flags;
> + mr->ibmr.length = length;
> + mr->ibmr.iova = iova;
> + mr->page_offset = ib_umem_offset(&umem_odp->umem);
> +
> + err = rxe_odp_init_pages(mr);
> + if (err) {
> + ib_umem_odp_release(umem_odp);
> + return err;
> + }
> +
> + err = rxe_mr_fill_pages_from_sgt(mr, &umem_odp->umem.sgt_append.sgt);
> + if (err) {
> + ib_umem_odp_release(umem_odp);
> + return err;
> + }
> +
> + mr->state = RXE_MR_STATE_VALID;
> + mr->ibmr.type = IB_MR_TYPE_USER;
> +
> + return err;
> +}
> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> index ba0222bfce9e..90c31c4f2944 100644
> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> @@ -626,6 +626,10 @@ static enum resp_states process_flush(struct rxe_qp *qp,
> struct rxe_mr *mr = qp->resp.mr;
> struct resp_res *res = qp->resp.res;
>
> + /* ODP is not supported right now. WIP. */
> + if (mr->odp_enabled)
> + return RESPST_ERR_UNSUPPORTED_OPCODE;
> +
> /* oA19-14, oA19-15 */
> if (res && res->replay)
> return RESPST_ACKNOWLEDGE;
> @@ -679,10 +683,13 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
> if (!res->replay) {
> u64 iova = qp->resp.va + qp->resp.offset;
>
> - err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
> - atmeth_comp(pkt),
> - atmeth_swap_add(pkt),
> - &res->atomic.orig_val);
> + if (mr->odp_enabled)
> + err = RESPST_ERR_UNSUPPORTED_OPCODE;
> + else
> + err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
> + atmeth_comp(pkt),
> + atmeth_swap_add(pkt),
> + &res->atomic.orig_val);
> if (err)
> return err;
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
> index dea605b7f683..9c23defdc7b5 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.c
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
> @@ -1274,7 +1274,10 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd, u64 start,
> mr->ibmr.pd = ibpd;
> mr->ibmr.device = ibpd->device;
>
> - err = rxe_mr_init_user(rxe, start, length, iova, access, mr);
> + if (access & IB_ACCESS_ON_DEMAND)
> + err = rxe_odp_mr_init_user(rxe, start, length, iova, access, mr);
> + else
> + err = rxe_mr_init_user(rxe, start, length, iova, access, mr);
> if (err) {
> rxe_dbg_mr(mr, "reg_user_mr failed, err = %d", err);
> goto err_cleanup;
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
> index b6fbd9b3d086..de5a982c7c7e 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
> @@ -333,6 +333,8 @@ struct rxe_mr {
> u32 nbuf;
>
> struct xarray page_list;
> +
> + bool odp_enabled;
> };
>
> static inline unsigned int mr_page_size(struct rxe_mr *mr)

Reviewed-by: Bob Pearson <[email protected]>

2023-05-19 17:18:24

by Bob Pearson

[permalink] [raw]
Subject: Re: [PATCH for-next v5 4/7] RDMA/rxe: Add page invalidation support

On 5/18/23 23:07, Daisuke Matsuda wrote:
> On page invalidation, an MMU notifier callback is invoked to unmap DMA
> addresses and update the driver page table(umem_odp->dma_list). It also
> set the corresponding entries in MR xarray to NULL to prevent any access.
> The callback is registered when an ODP-enabled MR is created.
>
> Signed-off-by: Daisuke Matsuda <[email protected]>
> ---
> drivers/infiniband/sw/rxe/Makefile | 2 ++
> drivers/infiniband/sw/rxe/rxe_odp.c | 56 +++++++++++++++++++++++++++++
> 2 files changed, 58 insertions(+)
> create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
>
> diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
> index 5395a581f4bb..93134f1d1d0c 100644
> --- a/drivers/infiniband/sw/rxe/Makefile
> +++ b/drivers/infiniband/sw/rxe/Makefile
> @@ -23,3 +23,5 @@ rdma_rxe-y := \
> rxe_task.o \
> rxe_net.o \
> rxe_hw_counters.o
> +
> +rdma_rxe-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += rxe_odp.o
> diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> new file mode 100644
> index 000000000000..b69b25e0fef6
> --- /dev/null
> +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> @@ -0,0 +1,56 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/*
> + * Copyright (c) 2022-2023 Fujitsu Ltd. All rights reserved.
> + */
> +
> +#include <linux/hmm.h>
> +
> +#include <rdma/ib_umem_odp.h>
> +
> +#include "rxe.h"
> +
> +static void rxe_mr_unset_xarray(struct rxe_mr *mr, unsigned long start,
> + unsigned long end)
> +{
> + unsigned long lower, upper, idx;
> +
> + lower = rxe_mr_iova_to_index(mr, start);
> + upper = rxe_mr_iova_to_index(mr, end);
> +
> + /* make elements in xarray NULL */
> + spin_lock(&mr->page_list.xa_lock);
> + for (idx = lower; idx <= upper; idx++)
> + __xa_erase(&mr->page_list, idx);
> + spin_unlock(&mr->page_list.xa_lock);
> +}
> +
> +static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
> + const struct mmu_notifier_range *range,
> + unsigned long cur_seq)
> +{
> + struct ib_umem_odp *umem_odp =
> + container_of(mni, struct ib_umem_odp, notifier);
> + struct rxe_mr *mr = umem_odp->private;
> + unsigned long start, end;
> +
> + if (!mmu_notifier_range_blockable(range))
> + return false;
> +
> + mutex_lock(&umem_odp->umem_mutex);
> + mmu_interval_set_seq(mni, cur_seq);
> +
> + start = max_t(u64, ib_umem_start(umem_odp), range->start);
> + end = min_t(u64, ib_umem_end(umem_odp), range->end);
> +
> + rxe_mr_unset_xarray(mr, start, end);
> +
> + /* update umem_odp->dma_list */
> + ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
> +
> + mutex_unlock(&umem_odp->umem_mutex);
> + return true;
> +}
> +
> +const struct mmu_interval_notifier_ops rxe_mn_ops = {
> + .invalidate = rxe_ib_invalidate_range,
> +};

Looks good.

Reviewed-by: Bob Pearson <[email protected]>


2023-05-19 17:21:24

by Bob Pearson

[permalink] [raw]
Subject: Re: [PATCH for-next v5 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read with ODP

On 5/18/23 03:21, Daisuke Matsuda wrote:
> rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
> it to load payloads of requesting packets; responder uses it to process
> Send, Write, and Read operaetions; completer uses it to copy data from
> response packets of Read and Atomic operations to a user MR.
>
> Allow these operations to be used with ODP by adding a subordinate function
> rxe_odp_mr_copy(). It is comprised of the following steps:
> 1. Check the driver page table(umem_odp->dma_list) to see if pages being
> accessed are present with appropriate permission.
> 2. If necessary, trigger page fault to map the pages.
> 3. Update the MR xarray using PFNs in umem_odp->pfn_list.
> 4. Execute data copy to/from the pages.
>
> umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
> is not changed while it is being checked and that mapped pages are not
> invalidated before data copy completes.
>
> Signed-off-by: Daisuke Matsuda <[email protected]>
> ---
> drivers/infiniband/sw/rxe/rxe.c | 10 +++
> drivers/infiniband/sw/rxe/rxe_loc.h | 8 ++
> drivers/infiniband/sw/rxe/rxe_mr.c | 2 +-
> drivers/infiniband/sw/rxe/rxe_odp.c | 109 ++++++++++++++++++++++++++++
> 4 files changed, 128 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index f2284d27229b..207a022156f0 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -79,6 +79,16 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
>
> /* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
> rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
> +
> + rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SEND;
> + rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_RECV;
> + rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
> +
> + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SEND;
> + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV;
> + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE;
> + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
> + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
> }
> }
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> index 93247d123642..4b95c8c46bdc 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -206,6 +206,8 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
> #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
> int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
> u64 iova, int access_flags, struct rxe_mr *mr);
> +int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
> + enum rxe_mr_copy_dir dir);
> #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
> static inline int
> rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
> @@ -213,6 +215,12 @@ rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
> {
> return -EOPNOTSUPP;
> }
> +static inline int
> +rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
> + int length, enum rxe_mr_copy_dir dir)
> +{
> + return -EOPNOTSUPP;
> +}
>
> #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index cd368cd096c8..0e3cda59d702 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -319,7 +319,7 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
> }
>
> if (mr->odp_enabled)
> - return -EOPNOTSUPP;
> + return rxe_odp_mr_copy(mr, iova, addr, length, dir);
> else
> return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
> }
> diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> index e5497d09c399..cbe5d0c3fcc4 100644
> --- a/drivers/infiniband/sw/rxe/rxe_odp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> @@ -174,3 +174,112 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
>
> return err;
> }
> +
> +static inline bool rxe_is_pagefault_neccesary(struct ib_umem_odp *umem_odp,
> + u64 iova, int length, u32 perm)
> +{
> + int idx;
> + u64 addr;
> + bool need_fault = false;
> +
> + addr = iova & (~(BIT(umem_odp->page_shift) - 1));
> +
> + /* Skim through all pages that are to be accessed. */
> + while (addr < iova + length) {
> + idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
> +
> + if (!(umem_odp->dma_list[idx] & perm)) {
> + need_fault = true;
> + break;
> + }
> +
> + addr += BIT(umem_odp->page_shift);
> + }
> + return need_fault;
> +}
> +
> +/* umem mutex must be locked before entering this function. */
> +static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 flags)
> +{
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> + const int max_tries = 3;
> + int cnt = 0;
> +
> + int err;
> + u64 perm;
> + bool need_fault;
> +
> + if (unlikely(length < 1)) {
> + mutex_unlock(&umem_odp->umem_mutex);
> + return -EINVAL;
> + }
> +
> + perm = ODP_READ_ALLOWED_BIT;
> + if (!(flags & RXE_PAGEFAULT_RDONLY))
> + perm |= ODP_WRITE_ALLOWED_BIT;
> +
> + /*
> + * A successful return from rxe_odp_do_pagefault() does not guarantee
> + * that all pages in the range became present. Recheck the DMA address
> + * array, allowing max 3 tries for pagefault.
> + */
> + while ((need_fault = rxe_is_pagefault_neccesary(umem_odp,
> + iova, length, perm))) {
> + if (cnt >= max_tries)
> + break;
> +
> + mutex_unlock(&umem_odp->umem_mutex);
> +
> + /* umem_mutex is locked on success. */
> + err = rxe_odp_do_pagefault(mr, iova, length, flags);
> + if (err < 0)
> + return err;
> +
> + cnt++;
> + }
> +
> + if (need_fault)
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
> + enum rxe_mr_copy_dir dir)
> +{
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> + u32 flags = 0;
> + int err;
> +
> + if (unlikely(!mr->odp_enabled))
> + return -EOPNOTSUPP;
> +
> + switch (dir) {
> + case RXE_TO_MR_OBJ:
> + break;
> +
> + case RXE_FROM_MR_OBJ:
> + flags = RXE_PAGEFAULT_RDONLY;
> + break;
> +
> + default:
> + return -EINVAL;
> + }
> +
> + /* If pagefault is not required, umem mutex will be held until data
> + * copy to the MR completes. Otherwise, it is released and locked
> + * again in rxe_odp_map_range() to let invalidation handler do its
> + * work meanwhile.
> + */
> + mutex_lock(&umem_odp->umem_mutex);
> +
> + err = rxe_odp_map_range(mr, iova, length, flags);
> + if (err)
> + return err;
> +
> + err = rxe_mr_copy_xarray(mr, iova, addr, length, dir);
> +
> + mutex_unlock(&umem_odp->umem_mutex);
> +
> + return err;
> +}

Reviewed-by: Bob Pearson <[email protected]>

2023-05-19 17:21:45

by Bob Pearson

[permalink] [raw]
Subject: Re: [PATCH for-next v5 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read with ODP

On 5/18/23 03:21, Daisuke Matsuda wrote:
> rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
> it to load payloads of requesting packets; responder uses it to process
> Send, Write, and Read operaetions; completer uses it to copy data from
> response packets of Read and Atomic operations to a user MR.
>
> Allow these operations to be used with ODP by adding a subordinate function
> rxe_odp_mr_copy(). It is comprised of the following steps:
> 1. Check the driver page table(umem_odp->dma_list) to see if pages being
> accessed are present with appropriate permission.
> 2. If necessary, trigger page fault to map the pages.
> 3. Update the MR xarray using PFNs in umem_odp->pfn_list.
> 4. Execute data copy to/from the pages.
>
> umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
> is not changed while it is being checked and that mapped pages are not
> invalidated before data copy completes.
>
> Signed-off-by: Daisuke Matsuda <[email protected]>
> ---
> drivers/infiniband/sw/rxe/rxe.c | 10 +++
> drivers/infiniband/sw/rxe/rxe_loc.h | 8 ++
> drivers/infiniband/sw/rxe/rxe_mr.c | 2 +-
> drivers/infiniband/sw/rxe/rxe_odp.c | 109 ++++++++++++++++++++++++++++
> 4 files changed, 128 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index f2284d27229b..207a022156f0 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -79,6 +79,16 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
>
> /* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
> rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
> +
> + rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SEND;
> + rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_RECV;
> + rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
> +
> + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SEND;
> + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV;
> + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE;
> + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
> + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
> }
> }
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> index 93247d123642..4b95c8c46bdc 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -206,6 +206,8 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
> #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
> int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
> u64 iova, int access_flags, struct rxe_mr *mr);
> +int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
> + enum rxe_mr_copy_dir dir);
> #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
> static inline int
> rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
> @@ -213,6 +215,12 @@ rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
> {
> return -EOPNOTSUPP;
> }
> +static inline int
> +rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
> + int length, enum rxe_mr_copy_dir dir)
> +{
> + return -EOPNOTSUPP;
> +}
>
> #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index cd368cd096c8..0e3cda59d702 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -319,7 +319,7 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
> }
>
> if (mr->odp_enabled)
> - return -EOPNOTSUPP;
> + return rxe_odp_mr_copy(mr, iova, addr, length, dir);
> else
> return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
> }
> diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> index e5497d09c399..cbe5d0c3fcc4 100644
> --- a/drivers/infiniband/sw/rxe/rxe_odp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> @@ -174,3 +174,112 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
>
> return err;
> }
> +
> +static inline bool rxe_is_pagefault_neccesary(struct ib_umem_odp *umem_odp,
> + u64 iova, int length, u32 perm)
> +{
> + int idx;
> + u64 addr;
> + bool need_fault = false;
> +
> + addr = iova & (~(BIT(umem_odp->page_shift) - 1));
> +
> + /* Skim through all pages that are to be accessed. */
> + while (addr < iova + length) {
> + idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
> +
> + if (!(umem_odp->dma_list[idx] & perm)) {
> + need_fault = true;
> + break;
> + }
> +
> + addr += BIT(umem_odp->page_shift);
> + }
> + return need_fault;
> +}
> +
> +/* umem mutex must be locked before entering this function. */
> +static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 flags)
> +{
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> + const int max_tries = 3;
> + int cnt = 0;
> +
> + int err;
> + u64 perm;
> + bool need_fault;
> +
> + if (unlikely(length < 1)) {
> + mutex_unlock(&umem_odp->umem_mutex);
> + return -EINVAL;
> + }
> +
> + perm = ODP_READ_ALLOWED_BIT;
> + if (!(flags & RXE_PAGEFAULT_RDONLY))
> + perm |= ODP_WRITE_ALLOWED_BIT;
> +
> + /*
> + * A successful return from rxe_odp_do_pagefault() does not guarantee
> + * that all pages in the range became present. Recheck the DMA address
> + * array, allowing max 3 tries for pagefault.
> + */
> + while ((need_fault = rxe_is_pagefault_neccesary(umem_odp,
> + iova, length, perm))) {
> + if (cnt >= max_tries)
> + break;
> +
> + mutex_unlock(&umem_odp->umem_mutex);
> +
> + /* umem_mutex is locked on success. */
> + err = rxe_odp_do_pagefault(mr, iova, length, flags);
> + if (err < 0)
> + return err;
> +
> + cnt++;
> + }
> +
> + if (need_fault)
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
> + enum rxe_mr_copy_dir dir)
> +{
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> + u32 flags = 0;
> + int err;
> +
> + if (unlikely(!mr->odp_enabled))
> + return -EOPNOTSUPP;
> +
> + switch (dir) {
> + case RXE_TO_MR_OBJ:
> + break;
> +
> + case RXE_FROM_MR_OBJ:
> + flags = RXE_PAGEFAULT_RDONLY;
> + break;
> +
> + default:
> + return -EINVAL;
> + }
> +
> + /* If pagefault is not required, umem mutex will be held until data
> + * copy to the MR completes. Otherwise, it is released and locked
> + * again in rxe_odp_map_range() to let invalidation handler do its
> + * work meanwhile.
> + */
> + mutex_lock(&umem_odp->umem_mutex);
> +
> + err = rxe_odp_map_range(mr, iova, length, flags);
> + if (err)
> + return err;
> +
> + err = rxe_mr_copy_xarray(mr, iova, addr, length, dir);
> +
> + mutex_unlock(&umem_odp->umem_mutex);
> +
> + return err;
> +}

Reviewed-by: Bob Pearson <[email protected]>

2023-05-22 19:06:55

by Bob Pearson

[permalink] [raw]
Subject: Re: [PATCH for-next v5 7/7] RDMA/rxe: Add support for the traditional Atomic operations with ODP

On 5/18/23 03:21, Daisuke Matsuda wrote:
> Enable 'fetch and add' and 'compare and swap' operations to manipulate
> data in an ODP-enabled MR. This is comprised of the following steps:
> 1. Check the driver page table(umem_odp->dma_list) to see if the target
> page is both readable and writable.
> 2. If not, then trigger page fault to map the page.
> 3. Update the entry in the MR xarray.
> 4. Execute the operation.
>
> umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
> is not changed while it is being checked and that the target page is not
> invalidated before data access completes.
>
> Signed-off-by: Daisuke Matsuda <[email protected]>
> ---
> drivers/infiniband/sw/rxe/rxe.c | 1 +
> drivers/infiniband/sw/rxe/rxe_loc.h | 9 +++++++++
> drivers/infiniband/sw/rxe/rxe_odp.c | 26 ++++++++++++++++++++++++++
> drivers/infiniband/sw/rxe/rxe_resp.c | 5 ++++-
> 4 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 207a022156f0..abd3267c2873 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -88,6 +88,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
> rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV;
> rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE;
> rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
> + rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_ATOMIC;
> rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
> }
> }
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> index 4b95c8c46bdc..b9d2985774ee 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -208,6 +208,9 @@ int rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length,
> u64 iova, int access_flags, struct rxe_mr *mr);
> int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
> enum rxe_mr_copy_dir dir);
> +int rxe_odp_mr_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
> + u64 compare, u64 swap_add, u64 *orig_val);
> +
> #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
> static inline int
> rxe_odp_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
> @@ -221,6 +224,12 @@ rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
> {
> return -EOPNOTSUPP;
> }
> +static inline int
> +rxe_odp_mr_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
> + u64 compare, u64 swap_add, u64 *orig_val)
> +{
> + return RESPST_ERR_UNSUPPORTED_OPCODE;
> +}
>
> #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> index cbe5d0c3fcc4..194b1fab98b7 100644
> --- a/drivers/infiniband/sw/rxe/rxe_odp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> @@ -283,3 +283,29 @@ int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
>
> return err;
> }
> +
> +int rxe_odp_mr_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
> + u64 compare, u64 swap_add, u64 *orig_val)
> +{
> + int err;
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> +
> + /* If pagefault is not required, umem mutex will be held until the
> + * atomic operation completes. Otherwise, it is released and locked
> + * again in rxe_odp_map_range() to let invalidation handler do its
> + * work meanwhile.
> + */
> + mutex_lock(&umem_odp->umem_mutex);
> +
> + /* Atomic operations manipulate a single char. */
> + err = rxe_odp_map_range(mr, iova, sizeof(char), 0);
> + if (err)
> + return err;
> +
> + err = rxe_mr_do_atomic_op(mr, iova, opcode, compare,
> + swap_add, orig_val);
> +
> + mutex_unlock(&umem_odp->umem_mutex);
> +
> + return err;
> +}
> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> index 90c31c4f2944..0a918145dc07 100644
> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> @@ -684,7 +684,10 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
> u64 iova = qp->resp.va + qp->resp.offset;
>
> if (mr->odp_enabled)
> - err = RESPST_ERR_UNSUPPORTED_OPCODE;
> + err = rxe_odp_mr_atomic_op(mr, iova, pkt->opcode,
> + atmeth_comp(pkt),
> + atmeth_swap_add(pkt),
> + &res->atomic.orig_val);
> else
> err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
> atmeth_comp(pkt),

Reviewed-by: Bob Pearson <[email protected]>

2023-06-12 16:43:43

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH for-next v5 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging

On Thu, May 18, 2023 at 05:21:50PM +0900, Daisuke Matsuda wrote:

> +static void rxe_mr_set_xarray(struct rxe_mr *mr, unsigned long start,
> + unsigned long end, unsigned long *pfn_list)
> +{
> + unsigned long lower, upper, idx;
> + struct page *page;
> +
> + lower = rxe_mr_iova_to_index(mr, start);
> + upper = rxe_mr_iova_to_index(mr, end);
> +
> + /* make pages visible in xarray. no sleep while taking the lock */
> + spin_lock(&mr->page_list.xa_lock);
> + for (idx = lower; idx <= upper; idx++) {
> + page = hmm_pfn_to_page(pfn_list[idx]);
> + __xa_store(&mr->page_list, idx, page, GFP_ATOMIC);

All of these loops can be performance improved a lot by using xas
loops

> unsigned long cur_seq)
> @@ -54,3 +72,105 @@ static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
> const struct mmu_interval_notifier_ops rxe_mn_ops = {
> .invalidate = rxe_ib_invalidate_range,
> };
> +
> +#define RXE_PAGEFAULT_RDONLY BIT(1)
> +#define RXE_PAGEFAULT_SNAPSHOT BIT(2)
> +static int rxe_odp_do_pagefault(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags)
> +{
> + int np;
> + u64 access_mask;
> + bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT);
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> +
> + access_mask = ODP_READ_ALLOWED_BIT;
> + if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY))
> + access_mask |= ODP_WRITE_ALLOWED_BIT;
> +
> + /*
> + * ib_umem_odp_map_dma_and_lock() locks umem_mutex on success.
> + * Callers must release the lock later to let invalidation handler
> + * do its work again.
> + */
> + np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt,
> + access_mask, fault);
> + if (np < 0)
> + return np;
> +
> + /* umem_mutex is still locked here, so we can use hmm_pfn_to_page()
> + * safely to fetch pages in the range.

All the comments should be in the style like the first one, not the
second

> + */
> + rxe_mr_set_xarray(mr, user_va, user_va + bcnt, umem_odp->pfn_list);
> +
> + return np;
> +}
> +
> +static int rxe_odp_init_pages(struct rxe_mr *mr)
> +{
> + int ret;
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> +
> + ret = rxe_odp_do_pagefault(mr, mr->umem->address, mr->umem->length,
> + RXE_PAGEFAULT_SNAPSHOT);

Probably suffix this with "and_lock"

> + mr->odp_enabled = true;
> + mr->umem = &umem_odp->umem;
> + mr->access = access_flags;
> + mr->ibmr.length = length;
> + mr->ibmr.iova = iova;
> + mr->page_offset = ib_umem_offset(&umem_odp->umem);
> +
> + err = rxe_odp_init_pages(mr);
> + if (err) {
> + ib_umem_odp_release(umem_odp);
> + return err;
> + }
> +
> + err = rxe_mr_fill_pages_from_sgt(mr, &umem_odp->umem.sgt_append.sgt);

Uh? What is this? The sgt is not used in the ODP mode?

> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
> index b6fbd9b3d086..de5a982c7c7e 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
> @@ -333,6 +333,8 @@ struct rxe_mr {
> u32 nbuf;
>
> struct xarray page_list;
> +
> + bool odp_enabled;

You can tell from the umem, don't need a flag

Jason

2023-06-12 16:51:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH for-next v5 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read with ODP

On Thu, May 18, 2023 at 05:21:51PM +0900, Daisuke Matsuda wrote:

> +/* umem mutex must be locked before entering this function. */
> +static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 flags)
> +{
> + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> + const int max_tries = 3;
> + int cnt = 0;
> +
> + int err;
> + u64 perm;
> + bool need_fault;
> +
> + if (unlikely(length < 1)) {
> + mutex_unlock(&umem_odp->umem_mutex);
> + return -EINVAL;
> + }
> +
> + perm = ODP_READ_ALLOWED_BIT;
> + if (!(flags & RXE_PAGEFAULT_RDONLY))
> + perm |= ODP_WRITE_ALLOWED_BIT;
> +
> + /*
> + * A successful return from rxe_odp_do_pagefault() does not guarantee
> + * that all pages in the range became present. Recheck the DMA address
> + * array, allowing max 3 tries for pagefault.
> + */
> + while ((need_fault = rxe_is_pagefault_neccesary(umem_odp,
> + iova, length, perm))) {
> + if (cnt >= max_tries)
> + break;

I don't think this makes sense..

You need to make this work more like mlx5 does, you take the spinlock
on the xarray, you search it for your index and whatever is there
tells what to do. Hold the spinlock while doing the copy. This is
enough locking for the fast path.

If there is no index present, or it is not writable and you need
write, then you unlock the spinlock and prefetch the missing entry and
try again, this time also holding the mutex so there isn't a race.

You shouldn't probe into parts of the umem to discover information
already stored in the xarray then do the same lookup into the xarray.

IIRC this also needs to keep track in the xarray on a per page basis
if the page is writable.

Jason

Subject: RE: [PATCH for-next v5 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read with ODP

On Tue, June 13, 2023 1:23 AM Jason Gunthorpe wrote:
>
> On Thu, May 18, 2023 at 05:21:51PM +0900, Daisuke Matsuda wrote:
>
> > +/* umem mutex must be locked before entering this function. */
> > +static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 flags)
> > +{
> > + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> > + const int max_tries = 3;
> > + int cnt = 0;
> > +
> > + int err;
> > + u64 perm;
> > + bool need_fault;
> > +
> > + if (unlikely(length < 1)) {
> > + mutex_unlock(&umem_odp->umem_mutex);
> > + return -EINVAL;
> > + }
> > +
> > + perm = ODP_READ_ALLOWED_BIT;
> > + if (!(flags & RXE_PAGEFAULT_RDONLY))
> > + perm |= ODP_WRITE_ALLOWED_BIT;
> > +
> > + /*
> > + * A successful return from rxe_odp_do_pagefault() does not guarantee
> > + * that all pages in the range became present. Recheck the DMA address
> > + * array, allowing max 3 tries for pagefault.
> > + */
> > + while ((need_fault = rxe_is_pagefault_neccesary(umem_odp,
> > + iova, length, perm))) {
> > + if (cnt >= max_tries)
> > + break;
>
> I don't think this makes sense..
>
> You need to make this work more like mlx5 does, you take the spinlock
> on the xarray, you search it for your index and whatever is there
> tells what to do. Hold the spinlock while doing the copy. This is
> enough locking for the fast path.
>
> If there is no index present, or it is not writable and you need
> write, then you unlock the spinlock and prefetch the missing entry and
> try again, this time also holding the mutex so there isn't a race.
>
> You shouldn't probe into parts of the umem to discover information
> already stored in the xarray then do the same lookup into the xarray.
>
> IIRC this also needs to keep track in the xarray on a per page basis
> if the page is writable.

Thank you for the detailed explanation.
I agree the current implementation is inefficient.

I think I can fix the patches according to your comment.
I'll submit a new series once it is complete.

Thanks,
Daisuke

>
> Jason

Subject: RE: [PATCH for-next v5 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging

On Tue, June 13, 2023 1:19 AM Jason Gunthorpe wrote:

Sorry for my long silence.
I must spend most of my time for other works these days,
but I am still willing to complete this and subsequent works.

>
> On Thu, May 18, 2023 at 05:21:50PM +0900, Daisuke Matsuda wrote:
>
> > +static void rxe_mr_set_xarray(struct rxe_mr *mr, unsigned long start,
> > + unsigned long end, unsigned long *pfn_list)
> > +{
> > + unsigned long lower, upper, idx;
> > + struct page *page;
> > +
> > + lower = rxe_mr_iova_to_index(mr, start);
> > + upper = rxe_mr_iova_to_index(mr, end);
> > +
> > + /* make pages visible in xarray. no sleep while taking the lock */
> > + spin_lock(&mr->page_list.xa_lock);
> > + for (idx = lower; idx <= upper; idx++) {
> > + page = hmm_pfn_to_page(pfn_list[idx]);
> > + __xa_store(&mr->page_list, idx, page, GFP_ATOMIC);
>
> All of these loops can be performance improved a lot by using xas
> loops

Well, do you mean I should use 'xas_for_each()' here?
That is the same 'for-loop' after all, so performance may not be improved.
Additionally, the 'idx' value above must be counted separately in that case.

>
> > unsigned long cur_seq)
> > @@ -54,3 +72,105 @@ static bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
> > const struct mmu_interval_notifier_ops rxe_mn_ops = {
> > .invalidate = rxe_ib_invalidate_range,
> > };
> > +
> > +#define RXE_PAGEFAULT_RDONLY BIT(1)
> > +#define RXE_PAGEFAULT_SNAPSHOT BIT(2)
> > +static int rxe_odp_do_pagefault(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags)
> > +{
> > + int np;
> > + u64 access_mask;
> > + bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT);
> > + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> > +
> > + access_mask = ODP_READ_ALLOWED_BIT;
> > + if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY))
> > + access_mask |= ODP_WRITE_ALLOWED_BIT;
> > +
> > + /*
> > + * ib_umem_odp_map_dma_and_lock() locks umem_mutex on success.
> > + * Callers must release the lock later to let invalidation handler
> > + * do its work again.
> > + */
> > + np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt,
> > + access_mask, fault);
> > + if (np < 0)
> > + return np;
> > +
> > + /* umem_mutex is still locked here, so we can use hmm_pfn_to_page()
> > + * safely to fetch pages in the range.
>
> All the comments should be in the style like the first one, not the
> second

I got it.

>
> > + */
> > + rxe_mr_set_xarray(mr, user_va, user_va + bcnt, umem_odp->pfn_list);
> > +
> > + return np;
> > +}
> > +
> > +static int rxe_odp_init_pages(struct rxe_mr *mr)
> > +{
> > + int ret;
> > + struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> > +
> > + ret = rxe_odp_do_pagefault(mr, mr->umem->address, mr->umem->length,
> > + RXE_PAGEFAULT_SNAPSHOT);
>
> Probably suffix this with "and_lock"

Agreed.

>
> > + mr->odp_enabled = true;
> > + mr->umem = &umem_odp->umem;
> > + mr->access = access_flags;
> > + mr->ibmr.length = length;
> > + mr->ibmr.iova = iova;
> > + mr->page_offset = ib_umem_offset(&umem_odp->umem);
> > +
> > + err = rxe_odp_init_pages(mr);
> > + if (err) {
> > + ib_umem_odp_release(umem_odp);
> > + return err;
> > + }
> > +
> > + err = rxe_mr_fill_pages_from_sgt(mr, &umem_odp->umem.sgt_append.sgt);
>
> Uh? What is this? The sgt is not used in the ODP mode?

This function fills the xarray in the non-ODP mode.
You are correct the sgt is not used in ODP, but this works somehow.
I will fix this to fetch pages from umem_odp->pfn_list instead.

>
> > diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
> > index b6fbd9b3d086..de5a982c7c7e 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
> > +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
> > @@ -333,6 +333,8 @@ struct rxe_mr {
> > u32 nbuf;
> >
> > struct xarray page_list;
> > +
> > + bool odp_enabled;
>
> You can tell from the umem, don't need a flag

Certainly.

Daisuke

>
> Jason

2023-07-21 20:24:57

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH for-next v5 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging

On Wed, Jul 19, 2023 at 06:00:50AM +0000, Daisuke Matsuda (Fujitsu) wrote:
> On Tue, June 13, 2023 1:19 AM Jason Gunthorpe wrote:
>
> Sorry for my long silence.
> I must spend most of my time for other works these days,
> but I am still willing to complete this and subsequent works.
>
> >
> > On Thu, May 18, 2023 at 05:21:50PM +0900, Daisuke Matsuda wrote:
> >
> > > +static void rxe_mr_set_xarray(struct rxe_mr *mr, unsigned long start,
> > > + unsigned long end, unsigned long *pfn_list)
> > > +{
> > > + unsigned long lower, upper, idx;
> > > + struct page *page;
> > > +
> > > + lower = rxe_mr_iova_to_index(mr, start);
> > > + upper = rxe_mr_iova_to_index(mr, end);
> > > +
> > > + /* make pages visible in xarray. no sleep while taking the lock */
> > > + spin_lock(&mr->page_list.xa_lock);
> > > + for (idx = lower; idx <= upper; idx++) {
> > > + page = hmm_pfn_to_page(pfn_list[idx]);
> > > + __xa_store(&mr->page_list, idx, page, GFP_ATOMIC);
> >
> > All of these loops can be performance improved a lot by using xas
> > loops
>
> Well, do you mean I should use 'xas_for_each()' here?
> That is the same 'for-loop' after all, so performance may not be improved.
> Additionally, the 'idx' value above must be counted separately in that case.

xa_store is O(n log(n)), xas_store()/xas_next() more like O(n) per
store operation.

Jason