2022-11-16 08:53:30

by Li Zhijian

[permalink] [raw]
Subject: [for-next PATCH v6 00/10] RDMA/rxe: Add RDMA FLUSH operation

Hey folks,

Changes in V6:
- rebase to for-next(v6.1-rc1)
- add Yanjun's reviewed-by except "Allow registering persistent flag for pmem MR only"
- minimize pmem checking side effect # Jason
- return EOPNOSUPP if HCA doesn't support flush operation

These patches are going to implement a *NEW* RDMA opcode "RDMA FLUSH".
In IB SPEC 1.5[1], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
added in the MEMORY PLACEMENT EXTENSIONS section.

This patchset makes SoftRoCE support new RDMA FLUSH on RC service.

You can verify the patchset by building and running the rdma_flush example[2].
server:
$ ./rdma_flush_server -s [server_address] -p [port_number]
client:
$ ./rdma_flush_client -s [server_address] -p [port_number]

Corresponding pyverbs and tests(tests.test_qpex.QpExTestCase.test_qp_ex_rc_rdma_flush)
are also added to rdma-core

[1]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx
[2]: https://github.com/zhijianli88/rdma-core/tree/rdma-flush-v5

CC: Xiao Yang <[email protected]>
CC: "Gotou, Yasunori" <[email protected]>
CC: Jason Gunthorpe <[email protected]>
CC: Zhu Yanjun <[email protected]>
CC: Leon Romanovsky <[email protected]>
CC: Bob Pearson <[email protected]>
CC: Mark Bloch <[email protected]>
CC: Tom Talpey <[email protected]>
CC: "Gromadzki, Tomasz" <[email protected]>
CC: Dan Williams <[email protected]>
CC: [email protected]
CC: [email protected]

Can also access the kernel source in:
https://github.com/zhijianli88/linux/tree/rdma-flush-v6
Changes log
V5: https://lore.kernel.org/lkml/[email protected]/t/
V4:
- rework responder process
- rebase to v5.19+
- remove [7/7]: RDMA/rxe: Add RD FLUSH service support since RD is not really supported

V3:
- Just rebase and commit log and comment updates
- delete patch-1: "RDMA: mr: Introduce is_pmem", which will be combined into "Allow registering persistent flag for pmem MR only"
- delete patch-7

V2:
RDMA: mr: Introduce is_pmem
check 1st byte to avoid crossing page boundary
new scheme to check is_pmem # Dan

RDMA: Allow registering MR with flush access flags
combine with [03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only to this patch # Jason
split RDMA_FLUSH to 2 capabilities

RDMA/rxe: Allow registering persistent flag for pmem MR only
update commit message, get rid of confusing ib_check_flush_access_flags() # Tom

RDMA/rxe: Implement RC RDMA FLUSH service in requester side
extend flush to include length field. # Tom and Tomasz

RDMA/rxe: Implement flush execution in responder side
adjust start for WHOLE MR level # Tom
don't support DMA mr for flush # Tom
check flush return value

RDMA/rxe: Enable RDMA FLUSH capability for rxe device
adjust patch's order. move it here from [04/10]

Li Zhijian (10):
RDMA: Extend RDMA user ABI to support flush
RDMA: Extend RDMA kernel verbs ABI to support flush
RDMA/rxe: Extend rxe user ABI to support flush
RDMA/rxe: Allow registering persistent flag for pmem MR only
RDMA/rxe: Extend rxe packet format to support flush
RDMA/rxe: Implement RC RDMA FLUSH service in requester side
RDMA/rxe: Implement flush execution in responder side
RDMA/rxe: Implement flush completion
RDMA/cm: Make QP FLUSHABLE
RDMA/rxe: Enable RDMA FLUSH capability for rxe device

drivers/infiniband/core/cm.c | 3 +-
drivers/infiniband/sw/rxe/rxe_comp.c | 4 +-
drivers/infiniband/sw/rxe/rxe_hdr.h | 47 +++++++
drivers/infiniband/sw/rxe/rxe_loc.h | 1 +
drivers/infiniband/sw/rxe/rxe_mr.c | 58 +++++++-
drivers/infiniband/sw/rxe/rxe_opcode.c | 17 +++
drivers/infiniband/sw/rxe/rxe_opcode.h | 16 ++-
drivers/infiniband/sw/rxe/rxe_param.h | 4 +-
drivers/infiniband/sw/rxe/rxe_req.c | 15 +-
drivers/infiniband/sw/rxe/rxe_resp.c | 176 +++++++++++++++++++++---
drivers/infiniband/sw/rxe/rxe_verbs.h | 6 +
include/rdma/ib_pack.h | 3 +
include/rdma/ib_verbs.h | 20 ++-
include/uapi/rdma/ib_user_ioctl_verbs.h | 2 +
include/uapi/rdma/ib_user_verbs.h | 16 +++
include/uapi/rdma/rdma_user_rxe.h | 7 +
16 files changed, 362 insertions(+), 33 deletions(-)

--
2.31.1



2022-11-16 08:54:38

by Li Zhijian

[permalink] [raw]
Subject: [for-next PATCH v6 01/10] RDMA: Extend RDMA user ABI to support flush

This commit extends the RDMA user ABI to support the flush
operation defined in IBA A19.4.1. These changes are
backwards compatible with the existing RDMA user ABI.

Reviewed-by: Zhu Yanjun <[email protected]>
Signed-off-by: Li Zhijian <[email protected]>
---
V5: new names and new patch split scheme, suggested by Bob
---
include/uapi/rdma/ib_user_ioctl_verbs.h | 2 ++
include/uapi/rdma/ib_user_verbs.h | 16 ++++++++++++++++
2 files changed, 18 insertions(+)

diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
index 7dd56210226f..07b105e22f6f 100644
--- a/include/uapi/rdma/ib_user_ioctl_verbs.h
+++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
@@ -57,6 +57,8 @@ enum ib_uverbs_access_flags {
IB_UVERBS_ACCESS_ZERO_BASED = 1 << 5,
IB_UVERBS_ACCESS_ON_DEMAND = 1 << 6,
IB_UVERBS_ACCESS_HUGETLB = 1 << 7,
+ IB_UVERBS_ACCESS_FLUSH_GLOBAL = 1 << 8,
+ IB_UVERBS_ACCESS_FLUSH_PERSISTENT = 1 << 9,

IB_UVERBS_ACCESS_RELAXED_ORDERING = IB_UVERBS_ACCESS_OPTIONAL_FIRST,
IB_UVERBS_ACCESS_OPTIONAL_RANGE =
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index 43672cb1fd57..2d5f32d9d0d9 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -105,6 +105,18 @@ enum {
IB_USER_VERBS_EX_CMD_MODIFY_CQ
};

+/* see IBA A19.4.1.1 Placement Types */
+enum ib_placement_type {
+ IB_FLUSH_GLOBAL = 1U << 0,
+ IB_FLUSH_PERSISTENT = 1U << 1,
+};
+
+/* see IBA A19.4.1.2 Selectivity Level */
+enum ib_selectivity_level {
+ IB_FLUSH_RANGE = 0,
+ IB_FLUSH_MR,
+};
+
/*
* Make sure that all structs defined in this file remain laid out so
* that they pack the same way on 32-bit and 64-bit architectures (to
@@ -466,6 +478,7 @@ enum ib_uverbs_wc_opcode {
IB_UVERBS_WC_BIND_MW = 5,
IB_UVERBS_WC_LOCAL_INV = 6,
IB_UVERBS_WC_TSO = 7,
+ IB_UVERBS_WC_FLUSH = 8,
};

struct ib_uverbs_wc {
@@ -784,6 +797,7 @@ enum ib_uverbs_wr_opcode {
IB_UVERBS_WR_RDMA_READ_WITH_INV = 11,
IB_UVERBS_WR_MASKED_ATOMIC_CMP_AND_SWP = 12,
IB_UVERBS_WR_MASKED_ATOMIC_FETCH_AND_ADD = 13,
+ IB_UVERBS_WR_FLUSH = 14,
/* Review enum ib_wr_opcode before modifying this */
};

@@ -1331,6 +1345,8 @@ enum ib_uverbs_device_cap_flags {
/* Deprecated. Please use IB_UVERBS_RAW_PACKET_CAP_SCATTER_FCS. */
IB_UVERBS_DEVICE_RAW_SCATTER_FCS = 1ULL << 34,
IB_UVERBS_DEVICE_PCI_WRITE_END_PADDING = 1ULL << 36,
+ IB_UVERBS_DEVICE_FLUSH_GLOBAL = 1ULL << 38,
+ IB_UVERBS_DEVICE_FLUSH_PERSISTENT = 1ULL << 39,
};

enum ib_uverbs_raw_packet_caps {
--
2.31.1


2022-11-16 08:56:00

by Li Zhijian

[permalink] [raw]
Subject: [for-next PATCH v6 02/10] RDMA: Extend RDMA kernel verbs ABI to support flush

This commit extends the RDMA kernel verbs ABI to support the flush
operation defined in IBA A19.4.1. These changes are
backwards compatible with the existing RDMA kernel verbs ABI.

It makes device/HCA support new FLUSH attributes/capabilities, and it
also makes memory region support new FLUSH access flags.

Users can use ibv_reg_mr(3) to register flush access flags. Only the
access flags also supported by device's capabilities can be registered
successfully.

Once registered successfully, it means the MR is flushable. Similarly,
A flushable MR should also have one or both of GLOBAL_VISIBILITY and
PERSISTENT attributes/capabilities like device/HCA.

Reviewed-by: Zhu Yanjun <[email protected]>
Signed-off-by: Li Zhijian <[email protected]>
---
V5: new names and new patch split scheme, suggested by Bob
---
include/rdma/ib_pack.h | 3 +++
include/rdma/ib_verbs.h | 20 +++++++++++++++++++-
2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h
index a9162f25beaf..56211d1cc9f9 100644
--- a/include/rdma/ib_pack.h
+++ b/include/rdma/ib_pack.h
@@ -84,6 +84,7 @@ enum {
/* opcode 0x15 is reserved */
IB_OPCODE_SEND_LAST_WITH_INVALIDATE = 0x16,
IB_OPCODE_SEND_ONLY_WITH_INVALIDATE = 0x17,
+ IB_OPCODE_FLUSH = 0x1C,

/* real constants follow -- see comment about above IB_OPCODE()
macro for more details */
@@ -112,6 +113,7 @@ enum {
IB_OPCODE(RC, FETCH_ADD),
IB_OPCODE(RC, SEND_LAST_WITH_INVALIDATE),
IB_OPCODE(RC, SEND_ONLY_WITH_INVALIDATE),
+ IB_OPCODE(RC, FLUSH),

/* UC */
IB_OPCODE(UC, SEND_FIRST),
@@ -149,6 +151,7 @@ enum {
IB_OPCODE(RD, ATOMIC_ACKNOWLEDGE),
IB_OPCODE(RD, COMPARE_SWAP),
IB_OPCODE(RD, FETCH_ADD),
+ IB_OPCODE(RD, FLUSH),

/* UD */
IB_OPCODE(UD, SEND_ONLY),
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index a1f4d53a4bb6..bd436e0135ba 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -270,6 +270,9 @@ enum ib_device_cap_flags {
/* The device supports padding incoming writes to cacheline. */
IB_DEVICE_PCI_WRITE_END_PADDING =
IB_UVERBS_DEVICE_PCI_WRITE_END_PADDING,
+ /* Placement type attributes */
+ IB_DEVICE_FLUSH_GLOBAL = IB_UVERBS_DEVICE_FLUSH_GLOBAL,
+ IB_DEVICE_FLUSH_PERSISTENT = IB_UVERBS_DEVICE_FLUSH_PERSISTENT,
};

enum ib_kernel_cap_flags {
@@ -985,6 +988,7 @@ enum ib_wc_opcode {
IB_WC_REG_MR,
IB_WC_MASKED_COMP_SWAP,
IB_WC_MASKED_FETCH_ADD,
+ IB_WC_FLUSH = IB_UVERBS_WC_FLUSH,
/*
* Set value of IB_WC_RECV so consumers can test if a completion is a
* receive by testing (opcode & IB_WC_RECV).
@@ -1325,6 +1329,7 @@ enum ib_wr_opcode {
IB_UVERBS_WR_MASKED_ATOMIC_CMP_AND_SWP,
IB_WR_MASKED_ATOMIC_FETCH_AND_ADD =
IB_UVERBS_WR_MASKED_ATOMIC_FETCH_AND_ADD,
+ IB_WR_FLUSH = IB_UVERBS_WR_FLUSH,

/* These are kernel only and can not be issued by userspace */
IB_WR_REG_MR = 0x20,
@@ -1458,10 +1463,14 @@ enum ib_access_flags {
IB_ACCESS_ON_DEMAND = IB_UVERBS_ACCESS_ON_DEMAND,
IB_ACCESS_HUGETLB = IB_UVERBS_ACCESS_HUGETLB,
IB_ACCESS_RELAXED_ORDERING = IB_UVERBS_ACCESS_RELAXED_ORDERING,
+ IB_ACCESS_FLUSH_GLOBAL = IB_UVERBS_ACCESS_FLUSH_GLOBAL,
+ IB_ACCESS_FLUSH_PERSISTENT = IB_UVERBS_ACCESS_FLUSH_PERSISTENT,
+ IB_ACCESS_FLUSHABLE = IB_ACCESS_FLUSH_GLOBAL |
+ IB_ACCESS_FLUSH_PERSISTENT,

IB_ACCESS_OPTIONAL = IB_UVERBS_ACCESS_OPTIONAL_RANGE,
IB_ACCESS_SUPPORTED =
- ((IB_ACCESS_HUGETLB << 1) - 1) | IB_ACCESS_OPTIONAL,
+ ((IB_ACCESS_FLUSH_PERSISTENT << 1) - 1) | IB_ACCESS_OPTIONAL,
};

/*
@@ -4321,6 +4330,8 @@ int ib_dealloc_xrcd_user(struct ib_xrcd *xrcd, struct ib_udata *udata);
static inline int ib_check_mr_access(struct ib_device *ib_dev,
unsigned int flags)
{
+ u64 device_cap = ib_dev->attrs.device_cap_flags;
+
/*
* Local write permission is required if remote write or
* remote atomic permission is also requested.
@@ -4335,6 +4346,13 @@ static inline int ib_check_mr_access(struct ib_device *ib_dev,
if (flags & IB_ACCESS_ON_DEMAND &&
!(ib_dev->attrs.kernel_cap_flags & IBK_ON_DEMAND_PAGING))
return -EOPNOTSUPP;
+
+ if ((flags & IB_ACCESS_FLUSH_GLOBAL &&
+ !(device_cap & IB_DEVICE_FLUSH_GLOBAL)) ||
+ (flags & IB_ACCESS_FLUSH_PERSISTENT &&
+ !(device_cap & IB_DEVICE_FLUSH_PERSISTENT)))
+ return -EOPNOTSUPP;
+
return 0;
}

--
2.31.1


2022-11-16 08:57:42

by Li Zhijian

[permalink] [raw]
Subject: [for-next PATCH v6 10/10] RDMA/rxe: Enable RDMA FLUSH capability for rxe device

Now we are ready to enable RDMA FLUSH capability for RXE.
It can support Global Visibility and Persistence placement types.

Reviewed-by: Zhu Yanjun <[email protected]>
Signed-off-by: Li Zhijian <[email protected]>
---
drivers/infiniband/sw/rxe/rxe_param.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
index 86c7a8bf3cbb..c7a82823a041 100644
--- a/drivers/infiniband/sw/rxe/rxe_param.h
+++ b/drivers/infiniband/sw/rxe/rxe_param.h
@@ -51,7 +51,9 @@ enum rxe_device_param {
| IB_DEVICE_SRQ_RESIZE
| IB_DEVICE_MEM_MGT_EXTENSIONS
| IB_DEVICE_MEM_WINDOW
- | IB_DEVICE_MEM_WINDOW_TYPE_2B,
+ | IB_DEVICE_MEM_WINDOW_TYPE_2B
+ | IB_DEVICE_FLUSH_GLOBAL
+ | IB_DEVICE_FLUSH_PERSISTENT,
RXE_MAX_SGE = 32,
RXE_MAX_WQE_SIZE = sizeof(struct rxe_send_wqe) +
sizeof(struct ib_sge) * RXE_MAX_SGE,
--
2.31.1


2022-11-16 09:13:56

by Li Zhijian

[permalink] [raw]
Subject: [for-next PATCH v6 07/10] RDMA/rxe: Implement flush execution in responder side

Only the requested placement types that also registered in the destination
memory region are acceptable.
Otherwise, responder will also reply NAK "Remote Access Error" if it
found a placement type violation.

We will persist data via arch_wb_cache_pmem(), which could be
architecture specific.

This commit also add 2 helpers to update qp.resp from the incoming packet.

Reviewed-by: Zhu Yanjun <[email protected]>
Signed-off-by: Li Zhijian <[email protected]>
---
v6: call iova_to_vaddr to transform iova
v5: add QP attr check for flush access
rename flush_nvdimm_iova -> rxe_flush_pmem_iova()
v4: add send_read_response_ack and flush resource
---
drivers/infiniband/sw/rxe/rxe_loc.h | 1 +
drivers/infiniband/sw/rxe/rxe_mr.c | 36 ++++++
drivers/infiniband/sw/rxe/rxe_resp.c | 176 +++++++++++++++++++++++---
drivers/infiniband/sw/rxe/rxe_verbs.h | 6 +
4 files changed, 199 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index c2a5c8814a48..944d564a11cd 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -68,6 +68,7 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr);
int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
int access, struct rxe_mr *mr);
int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr);
+int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, int length);
int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
enum rxe_mr_copy_dir dir);
int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index fd423c015be0..592965ee89fa 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -4,6 +4,8 @@
* Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
*/

+#include <linux/libnvdimm.h>
+
#include "rxe.h"
#include "rxe_loc.h"

@@ -196,6 +198,7 @@ int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
mr->offset = ib_umem_offset(umem);
mr->state = RXE_MR_STATE_VALID;
mr->ibmr.type = IB_MR_TYPE_USER;
+ mr->ibmr.page_size = PAGE_SIZE;

return 0;

@@ -303,6 +306,39 @@ void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length)
return addr;
}

+int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, int length)
+{
+ size_t offset;
+
+ if (length == 0)
+ return 0;
+
+ if (mr->ibmr.type == IB_MR_TYPE_DMA)
+ return -EFAULT;
+
+ offset = (iova - mr->ibmr.iova + mr->offset) & mr->page_mask;
+ while (length > 0) {
+ u8 *va;
+ int bytes;
+
+ bytes = mr->ibmr.page_size - offset;
+ if (bytes > length)
+ bytes = length;
+
+ va = iova_to_vaddr(mr, iova, length);
+ if (!va)
+ return -EFAULT;
+
+ arch_wb_cache_pmem(va, bytes);
+
+ length -= bytes;
+ iova += bytes;
+ offset = 0;
+ }
+
+ return 0;
+}
+
/* copy data from a range (vaddr, vaddr+length-1) to or from
* a mr object starting at iova.
*/
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index 8caa9941e70e..43cf3fb04674 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -22,6 +22,7 @@ enum resp_states {
RESPST_EXECUTE,
RESPST_READ_REPLY,
RESPST_ATOMIC_REPLY,
+ RESPST_PROCESS_FLUSH,
RESPST_COMPLETE,
RESPST_ACKNOWLEDGE,
RESPST_CLEANUP,
@@ -57,6 +58,7 @@ static char *resp_state_name[] = {
[RESPST_EXECUTE] = "EXECUTE",
[RESPST_READ_REPLY] = "READ_REPLY",
[RESPST_ATOMIC_REPLY] = "ATOMIC_REPLY",
+ [RESPST_PROCESS_FLUSH] = "PROCESS_FLUSH",
[RESPST_COMPLETE] = "COMPLETE",
[RESPST_ACKNOWLEDGE] = "ACKNOWLEDGE",
[RESPST_CLEANUP] = "CLEANUP",
@@ -256,19 +258,38 @@ static enum resp_states check_op_seq(struct rxe_qp *qp,
}
}

+static bool check_qp_attr_access(struct rxe_qp *qp,
+ struct rxe_pkt_info *pkt)
+{
+ if (((pkt->mask & RXE_READ_MASK) &&
+ !(qp->attr.qp_access_flags & IB_ACCESS_REMOTE_READ)) ||
+ ((pkt->mask & RXE_WRITE_MASK) &&
+ !(qp->attr.qp_access_flags & IB_ACCESS_REMOTE_WRITE)) ||
+ ((pkt->mask & RXE_ATOMIC_MASK) &&
+ !(qp->attr.qp_access_flags & IB_ACCESS_REMOTE_ATOMIC))) {
+ return false;
+ }
+
+ if (pkt->mask & RXE_FLUSH_MASK) {
+ u32 flush_type = feth_plt(pkt);
+
+ if ((flush_type & IB_FLUSH_GLOBAL &&
+ !(qp->attr.qp_access_flags & IB_ACCESS_FLUSH_GLOBAL)) ||
+ (flush_type & IB_FLUSH_PERSISTENT &&
+ !(qp->attr.qp_access_flags & IB_ACCESS_FLUSH_PERSISTENT)))
+ return false;
+ }
+
+ return true;
+}
+
static enum resp_states check_op_valid(struct rxe_qp *qp,
struct rxe_pkt_info *pkt)
{
switch (qp_type(qp)) {
case IB_QPT_RC:
- if (((pkt->mask & RXE_READ_MASK) &&
- !(qp->attr.qp_access_flags & IB_ACCESS_REMOTE_READ)) ||
- ((pkt->mask & RXE_WRITE_MASK) &&
- !(qp->attr.qp_access_flags & IB_ACCESS_REMOTE_WRITE)) ||
- ((pkt->mask & RXE_ATOMIC_MASK) &&
- !(qp->attr.qp_access_flags & IB_ACCESS_REMOTE_ATOMIC))) {
+ if (!check_qp_attr_access(qp, pkt))
return RESPST_ERR_UNSUPPORTED_OPCODE;
- }

break;

@@ -425,6 +446,23 @@ static enum resp_states check_length(struct rxe_qp *qp,
return RESPST_CHK_RKEY;
}

+static void qp_resp_from_reth(struct rxe_qp *qp, struct rxe_pkt_info *pkt)
+{
+ qp->resp.va = reth_va(pkt);
+ qp->resp.offset = 0;
+ qp->resp.rkey = reth_rkey(pkt);
+ qp->resp.resid = reth_len(pkt);
+ qp->resp.length = reth_len(pkt);
+}
+
+static void qp_resp_from_atmeth(struct rxe_qp *qp, struct rxe_pkt_info *pkt)
+{
+ qp->resp.va = atmeth_va(pkt);
+ qp->resp.offset = 0;
+ qp->resp.rkey = atmeth_rkey(pkt);
+ qp->resp.resid = sizeof(u64);
+}
+
static enum resp_states check_rkey(struct rxe_qp *qp,
struct rxe_pkt_info *pkt)
{
@@ -436,23 +474,26 @@ static enum resp_states check_rkey(struct rxe_qp *qp,
u32 pktlen;
int mtu = qp->mtu;
enum resp_states state;
- int access;
+ int access = 0;

if (pkt->mask & RXE_READ_OR_WRITE_MASK) {
- if (pkt->mask & RXE_RETH_MASK) {
- qp->resp.va = reth_va(pkt);
- qp->resp.offset = 0;
- qp->resp.rkey = reth_rkey(pkt);
- qp->resp.resid = reth_len(pkt);
- qp->resp.length = reth_len(pkt);
- }
+ if (pkt->mask & RXE_RETH_MASK)
+ qp_resp_from_reth(qp, pkt);
+
access = (pkt->mask & RXE_READ_MASK) ? IB_ACCESS_REMOTE_READ
: IB_ACCESS_REMOTE_WRITE;
+ } else if (pkt->mask & RXE_FLUSH_MASK) {
+ u32 flush_type = feth_plt(pkt);
+
+ if (pkt->mask & RXE_RETH_MASK)
+ qp_resp_from_reth(qp, pkt);
+
+ if (flush_type & IB_FLUSH_GLOBAL)
+ access |= IB_ACCESS_FLUSH_GLOBAL;
+ if (flush_type & IB_FLUSH_PERSISTENT)
+ access |= IB_ACCESS_FLUSH_PERSISTENT;
} else if (pkt->mask & RXE_ATOMIC_MASK) {
- qp->resp.va = atmeth_va(pkt);
- qp->resp.offset = 0;
- qp->resp.rkey = atmeth_rkey(pkt);
- qp->resp.resid = sizeof(u64);
+ qp_resp_from_atmeth(qp, pkt);
access = IB_ACCESS_REMOTE_ATOMIC;
} else {
return RESPST_EXECUTE;
@@ -501,12 +542,21 @@ static enum resp_states check_rkey(struct rxe_qp *qp,
}
}

+ if (pkt->mask & RXE_FLUSH_MASK) {
+ /* FLUSH MR may not set va or resid
+ * no need to check range since we will flush whole mr
+ */
+ if (feth_sel(pkt) == IB_FLUSH_MR)
+ goto skip_check_range;
+ }
+
if (mr_check_range(mr, va + qp->resp.offset, resid)) {
state = RESPST_ERR_RKEY_VIOLATION;
goto err;
}

- if (pkt->mask & RXE_WRITE_MASK) {
+skip_check_range:
+ if (pkt->mask & RXE_WRITE_MASK) {
if (resid > mtu) {
if (pktlen != mtu || bth_pad(pkt)) {
state = RESPST_ERR_LENGTH;
@@ -610,11 +660,61 @@ static struct resp_res *rxe_prepare_res(struct rxe_qp *qp,
res->last_psn = pkt->psn;
res->cur_psn = pkt->psn;
break;
+ case RXE_FLUSH_MASK:
+ res->flush.va = qp->resp.va + qp->resp.offset;
+ res->flush.length = qp->resp.length;
+ res->flush.type = feth_plt(pkt);
+ res->flush.level = feth_sel(pkt);
}

return res;
}

+static enum resp_states process_flush(struct rxe_qp *qp,
+ struct rxe_pkt_info *pkt)
+{
+ u64 length, start;
+ struct rxe_mr *mr = qp->resp.mr;
+ struct resp_res *res = qp->resp.res;
+
+ /* oA19-14, oA19-15 */
+ if (res && res->replay)
+ return RESPST_ACKNOWLEDGE;
+ else if (!res) {
+ res = rxe_prepare_res(qp, pkt, RXE_FLUSH_MASK);
+ qp->resp.res = res;
+ }
+
+ if (res->flush.level == IB_FLUSH_RANGE) {
+ start = res->flush.va;
+ length = res->flush.length;
+ } else { /* level == IB_FLUSH_MR */
+ start = mr->ibmr.iova;
+ length = mr->ibmr.length;
+ }
+
+ if (res->flush.type & IB_FLUSH_PERSISTENT) {
+ if (rxe_flush_pmem_iova(mr, start, length))
+ return RESPST_ERR_RKEY_VIOLATION;
+ /* Make data persistent. */
+ wmb();
+ } else if (res->flush.type & IB_FLUSH_GLOBAL) {
+ /* Make data global visibility. */
+ wmb();
+ }
+
+ qp->resp.msn++;
+
+ /* next expected psn, read handles this separately */
+ qp->resp.psn = (pkt->psn + 1) & BTH_PSN_MASK;
+ qp->resp.ack_psn = qp->resp.psn;
+
+ qp->resp.opcode = pkt->opcode;
+ qp->resp.status = IB_WC_SUCCESS;
+
+ return RESPST_ACKNOWLEDGE;
+}
+
/* Guarantee atomicity of atomic operations at the machine level. */
static DEFINE_SPINLOCK(atomic_ops_lock);

@@ -916,6 +1016,8 @@ static enum resp_states execute(struct rxe_qp *qp, struct rxe_pkt_info *pkt)
return RESPST_READ_REPLY;
} else if (pkt->mask & RXE_ATOMIC_MASK) {
return RESPST_ATOMIC_REPLY;
+ } else if (pkt->mask & RXE_FLUSH_MASK) {
+ return RESPST_PROCESS_FLUSH;
} else {
/* Unreachable */
WARN_ON_ONCE(1);
@@ -1089,6 +1191,19 @@ static int send_atomic_ack(struct rxe_qp *qp, u8 syndrome, u32 psn)
return ret;
}

+static int send_read_response_ack(struct rxe_qp *qp, u8 syndrome, u32 psn)
+{
+ int ret = send_common_ack(qp, syndrome, psn,
+ IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY,
+ "RDMA READ response of length zero ACK");
+
+ /* have to clear this since it is used to trigger
+ * long read replies
+ */
+ qp->resp.res = NULL;
+ return ret;
+}
+
static enum resp_states acknowledge(struct rxe_qp *qp,
struct rxe_pkt_info *pkt)
{
@@ -1099,6 +1214,8 @@ static enum resp_states acknowledge(struct rxe_qp *qp,
send_ack(qp, qp->resp.aeth_syndrome, pkt->psn);
else if (pkt->mask & RXE_ATOMIC_MASK)
send_atomic_ack(qp, AETH_ACK_UNLIMITED, pkt->psn);
+ else if (pkt->mask & RXE_FLUSH_MASK)
+ send_read_response_ack(qp, AETH_ACK_UNLIMITED, pkt->psn);
else if (bth_ack(pkt))
send_ack(qp, AETH_ACK_UNLIMITED, pkt->psn);

@@ -1155,6 +1272,22 @@ static enum resp_states duplicate_request(struct rxe_qp *qp,
/* SEND. Ack again and cleanup. C9-105. */
send_ack(qp, AETH_ACK_UNLIMITED, prev_psn);
return RESPST_CLEANUP;
+ } else if (pkt->mask & RXE_FLUSH_MASK) {
+ struct resp_res *res;
+
+ /* Find the operation in our list of responder resources. */
+ res = find_resource(qp, pkt->psn);
+ if (res) {
+ res->replay = 1;
+ res->cur_psn = pkt->psn;
+ qp->resp.res = res;
+ rc = RESPST_PROCESS_FLUSH;
+ goto out;
+ }
+
+ /* Resource not found. Class D error. Drop the request. */
+ rc = RESPST_CLEANUP;
+ goto out;
} else if (pkt->mask & RXE_READ_MASK) {
struct resp_res *res;

@@ -1348,6 +1481,9 @@ int rxe_responder(void *arg)
case RESPST_ATOMIC_REPLY:
state = atomic_reply(qp, pkt);
break;
+ case RESPST_PROCESS_FLUSH:
+ state = process_flush(qp, pkt);
+ break;
case RESPST_ACKNOWLEDGE:
state = acknowledge(qp, pkt);
break;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 22a299b0a9f0..19ddfa890480 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -165,6 +165,12 @@ struct resp_res {
u64 va;
u32 resid;
} read;
+ struct {
+ u32 length;
+ u64 va;
+ u8 type;
+ u8 level;
+ } flush;
};
};

--
2.31.1