From: Long Li <[email protected]>
This patchset implements a RDMA driver for Microsoft Azure Network
Adapter (MANA). In MANA, the RDMA device is modeled as an auxiliary device
to the Ethernet device.
The first 11 patches modify the MANA Ethernet driver to support RDMA driver.
The last patch implementes the RDMA driver.
The user-mode of the driver is being reviewed at:
https://github.com/linux-rdma/rdma-core/pull/1177
Ajay Sharma (3):
net: mana: Set the DMA device max segment size
net: mana: Define data structures for protection domain and memory
registration
net: mana: Define and process GDMA response code
GDMA_STATUS_MORE_ENTRIES
Long Li (9):
net: mana: Add support for auxiliary device
net: mana: Record the physical address for doorbell page region
net: mana: Handle vport sharing between devices
net: mana: Add functions for allocating doorbell page from GDMA
net: mana: Export Work Queue functions for use by RDMA driver
net: mana: Record port number in netdev
net: mana: Move header files to a common location
net: mana: Define max values for SGL entries
RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter
MAINTAINERS | 4 +
drivers/infiniband/Kconfig | 1 +
drivers/infiniband/hw/Makefile | 1 +
drivers/infiniband/hw/mana/Kconfig | 7 +
drivers/infiniband/hw/mana/Makefile | 4 +
drivers/infiniband/hw/mana/cq.c | 80 ++
drivers/infiniband/hw/mana/main.c | 681 ++++++++++++++++++
drivers/infiniband/hw/mana/mana_ib.h | 145 ++++
drivers/infiniband/hw/mana/mr.c | 133 ++++
drivers/infiniband/hw/mana/qp.c | 501 +++++++++++++
drivers/infiniband/hw/mana/wq.c | 114 +++
.../net/ethernet/microsoft/mana/gdma_main.c | 96 ++-
.../net/ethernet/microsoft/mana/hw_channel.c | 6 +-
.../net/ethernet/microsoft/mana/mana_bpf.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_en.c | 162 ++++-
.../ethernet/microsoft/mana/mana_ethtool.c | 2 +-
.../net/ethernet/microsoft/mana/shm_channel.c | 2 +-
.../microsoft => include/net}/mana/gdma.h | 188 ++++-
.../net}/mana/hw_channel.h | 0
.../microsoft => include/net}/mana/mana.h | 29 +-
.../net}/mana/shm_channel.h | 0
include/uapi/rdma/ib_user_ioctl_verbs.h | 1 +
include/uapi/rdma/mana-abi.h | 66 ++
23 files changed, 2180 insertions(+), 45 deletions(-)
create mode 100644 drivers/infiniband/hw/mana/Kconfig
create mode 100644 drivers/infiniband/hw/mana/Makefile
create mode 100644 drivers/infiniband/hw/mana/cq.c
create mode 100644 drivers/infiniband/hw/mana/main.c
create mode 100644 drivers/infiniband/hw/mana/mana_ib.h
create mode 100644 drivers/infiniband/hw/mana/mr.c
create mode 100644 drivers/infiniband/hw/mana/qp.c
create mode 100644 drivers/infiniband/hw/mana/wq.c
rename {drivers/net/ethernet/microsoft => include/net}/mana/gdma.h (77%)
rename {drivers/net/ethernet/microsoft => include/net}/mana/hw_channel.h (100%)
rename {drivers/net/ethernet/microsoft => include/net}/mana/mana.h (93%)
rename {drivers/net/ethernet/microsoft => include/net}/mana/shm_channel.h (100%)
create mode 100644 include/uapi/rdma/mana-abi.h
--
2.17.1
From: Long Li <[email protected]>
For outgoing packets, the PF requires the VF to configure the vport with
corresponding protection domain and doorbell ID for the kernel or user
context. The vport can't be shared between different contexts.
Implement the logic to exclusively take over the vport by either the
Ethernet device or RDMA device.
Signed-off-by: Long Li <[email protected]>
---
Change log:
v2: use refcount instead of directly using atomic variables
v4: change to mutex to avoid possible race with refcount
drivers/net/ethernet/microsoft/mana/mana.h | 7 ++++
drivers/net/ethernet/microsoft/mana/mana_en.c | 40 ++++++++++++++++++-
2 files changed, 45 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana.h b/drivers/net/ethernet/microsoft/mana/mana.h
index 51bff91b63ee..8e58abdce906 100644
--- a/drivers/net/ethernet/microsoft/mana/mana.h
+++ b/drivers/net/ethernet/microsoft/mana/mana.h
@@ -376,6 +376,10 @@ struct mana_port_context {
mana_handle_t port_handle;
+ /* Mutex for sharing access to vport_use_count */
+ struct mutex vport_mutex;
+ int vport_use_count;
+
u16 port_idx;
bool port_is_up;
@@ -567,4 +571,7 @@ struct mana_adev {
struct gdma_dev *mdev;
};
+int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
+ u32 doorbell_pg_id);
+void mana_uncfg_vport(struct mana_port_context *apc);
#endif /* _MANA_H */
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 745a9783dd70..23e7e423a544 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -530,13 +530,31 @@ static int mana_query_vport_cfg(struct mana_port_context *apc, u32 vport_index,
return 0;
}
-static int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
- u32 doorbell_pg_id)
+void mana_uncfg_vport(struct mana_port_context *apc)
+{
+ mutex_lock(&apc->vport_mutex);
+ apc->vport_use_count--;
+ WARN_ON(apc->vport_use_count < 0);
+ mutex_unlock(&apc->vport_mutex);
+}
+EXPORT_SYMBOL_GPL(mana_uncfg_vport);
+
+int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
+ u32 doorbell_pg_id)
{
struct mana_config_vport_resp resp = {};
struct mana_config_vport_req req = {};
int err;
+ /* Ethernet driver and IB driver can't take the port at the same time */
+ mutex_lock(&apc->vport_mutex);
+ if (apc->vport_use_count > 0) {
+ mutex_unlock(&apc->vport_mutex);
+ return -ENODEV;
+ }
+ apc->vport_use_count++;
+ mutex_unlock(&apc->vport_mutex);
+
mana_gd_init_req_hdr(&req.hdr, MANA_CONFIG_VPORT_TX,
sizeof(req), sizeof(resp));
req.vport = apc->port_handle;
@@ -563,9 +581,19 @@ static int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
apc->tx_shortform_allowed = resp.short_form_allowed;
apc->tx_vp_offset = resp.tx_vport_offset;
+
+ netdev_info(apc->ndev, "Configured vPort %llu PD %u DB %u\n",
+ apc->port_handle, protection_dom_id, doorbell_pg_id);
out:
+ if (err) {
+ mutex_lock(&apc->vport_mutex);
+ apc->vport_use_count--;
+ mutex_unlock(&apc->vport_mutex);
+ }
+
return err;
}
+EXPORT_SYMBOL_GPL(mana_cfg_vport);
static int mana_cfg_vport_steering(struct mana_port_context *apc,
enum TRI_STATE rx,
@@ -626,6 +654,9 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
resp.hdr.status);
err = -EPROTO;
}
+
+ netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
+ apc->port_handle, num_entries);
out:
kfree(req);
return err;
@@ -1678,6 +1709,8 @@ static void mana_destroy_vport(struct mana_port_context *apc)
}
mana_destroy_txq(apc);
+
+ mana_uncfg_vport(apc);
}
static int mana_create_vport(struct mana_port_context *apc,
@@ -1929,6 +1962,9 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
apc->port_handle = INVALID_MANA_HANDLE;
apc->port_idx = port_idx;
+ mutex_init(&apc->vport_mutex);
+ apc->vport_use_count = 0;
+
ndev->netdev_ops = &mana_devops;
ndev->ethtool_ops = &mana_ethtool_ops;
ndev->mtu = ETH_DATA_LEN;
--
2.17.1
From: Long Li <[email protected]>
In preparation to add MANA RDMA driver, move all the required header files
to a common location for use by both Ethernet and RDMA drivers.
Signed-off-by: Long Li <[email protected]>
---
Change log:
v2: Move headers to include/net/mana, instead of include/linux/mana
MAINTAINERS | 1 +
drivers/net/ethernet/microsoft/mana/gdma_main.c | 2 +-
drivers/net/ethernet/microsoft/mana/hw_channel.c | 4 ++--
drivers/net/ethernet/microsoft/mana/mana_bpf.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_en.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_ethtool.c | 2 +-
drivers/net/ethernet/microsoft/mana/shm_channel.c | 2 +-
{drivers/net/ethernet/microsoft => include/net}/mana/gdma.h | 0
.../net/ethernet/microsoft => include/net}/mana/hw_channel.h | 0
{drivers/net/ethernet/microsoft => include/net}/mana/mana.h | 0
.../net/ethernet/microsoft => include/net}/mana/shm_channel.h | 0
11 files changed, 8 insertions(+), 7 deletions(-)
rename {drivers/net/ethernet/microsoft => include/net}/mana/gdma.h (100%)
rename {drivers/net/ethernet/microsoft => include/net}/mana/hw_channel.h (100%)
rename {drivers/net/ethernet/microsoft => include/net}/mana/mana.h (100%)
rename {drivers/net/ethernet/microsoft => include/net}/mana/shm_channel.h (100%)
diff --git a/MAINTAINERS b/MAINTAINERS
index 40fa1955ca3f..51bec6d5076d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9108,6 +9108,7 @@ F: include/asm-generic/hyperv-tlfs.h
F: include/asm-generic/mshyperv.h
F: include/clocksource/hyperv_timer.h
F: include/linux/hyperv.h
+F: include/net/mana
F: include/uapi/linux/hyperv.h
F: net/vmw_vsock/hyperv_transport.c
F: tools/hv/
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 272facd272a4..ab3d8e75fb69 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -6,7 +6,7 @@
#include <linux/utsname.h>
#include <linux/version.h>
-#include "mana.h"
+#include <net/mana/mana.h>
static u32 mana_gd_r32(struct gdma_context *g, u64 offset)
{
diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index 078d6a5a0768..e61cb3f6fbe1 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -1,8 +1,8 @@
// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
/* Copyright (c) 2021, Microsoft Corporation. */
-#include "gdma.h"
-#include "hw_channel.h"
+#include <net/mana/gdma.h>
+#include <net/mana/hw_channel.h>
static int mana_hwc_get_msg_index(struct hw_channel_context *hwc, u16 *msg_id)
{
diff --git a/drivers/net/ethernet/microsoft/mana/mana_bpf.c b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
index 1d2f948b5c00..97a79659af45 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_bpf.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_bpf.c
@@ -8,7 +8,7 @@
#include <linux/bpf_trace.h>
#include <net/xdp.h>
-#include "mana.h"
+#include <net/mana/mana.h>
void mana_xdp_tx(struct sk_buff *skb, struct net_device *ndev)
{
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 963a4dbb46ea..5aab7afc9143 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -11,7 +11,7 @@
#include <net/checksum.h>
#include <net/ip6_checksum.h>
-#include "mana.h"
+#include <net/mana/mana.h>
static DEFINE_IDA(mana_adev_ida);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index e13f2453eabb..ebc6595d02fe 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -5,7 +5,7 @@
#include <linux/etherdevice.h>
#include <linux/ethtool.h>
-#include "mana.h"
+#include <net/mana/mana.h>
static const struct {
char name[ETH_GSTRING_LEN];
diff --git a/drivers/net/ethernet/microsoft/mana/shm_channel.c b/drivers/net/ethernet/microsoft/mana/shm_channel.c
index da255da62176..5553af9c8085 100644
--- a/drivers/net/ethernet/microsoft/mana/shm_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/shm_channel.c
@@ -6,7 +6,7 @@
#include <linux/io.h>
#include <linux/mm.h>
-#include "shm_channel.h"
+#include <net/mana/shm_channel.h>
#define PAGE_FRAME_L48_WIDTH_BYTES 6
#define PAGE_FRAME_L48_WIDTH_BITS (PAGE_FRAME_L48_WIDTH_BYTES * 8)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h b/include/net/mana/gdma.h
similarity index 100%
rename from drivers/net/ethernet/microsoft/mana/gdma.h
rename to include/net/mana/gdma.h
diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.h b/include/net/mana/hw_channel.h
similarity index 100%
rename from drivers/net/ethernet/microsoft/mana/hw_channel.h
rename to include/net/mana/hw_channel.h
diff --git a/drivers/net/ethernet/microsoft/mana/mana.h b/include/net/mana/mana.h
similarity index 100%
rename from drivers/net/ethernet/microsoft/mana/mana.h
rename to include/net/mana/mana.h
diff --git a/drivers/net/ethernet/microsoft/mana/shm_channel.h b/include/net/mana/shm_channel.h
similarity index 100%
rename from drivers/net/ethernet/microsoft/mana/shm_channel.h
rename to include/net/mana/shm_channel.h
--
2.17.1
From: Long Li <[email protected]>
RDMA device may need to create Ethernet device queues for use by Queue
Pair type RAW. This allows a user-mode context accesses Ethernet hardware
queues. Export the supporting functions for use by the RDMA driver.
Signed-off-by: Long Li <[email protected]>
---
drivers/net/ethernet/microsoft/mana/gdma_main.c | 1 +
drivers/net/ethernet/microsoft/mana/mana.h | 9 +++++++++
drivers/net/ethernet/microsoft/mana/mana_en.c | 16 +++++++++-------
3 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 60cc1270b7d5..272facd272a4 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -125,6 +125,7 @@ int mana_gd_send_request(struct gdma_context *gc, u32 req_len, const void *req,
return mana_hwc_send_request(hwc, req_len, req, resp_len, resp);
}
+EXPORT_SYMBOL(mana_gd_send_request);
int mana_gd_alloc_memory(struct gdma_context *gc, unsigned int length,
struct gdma_mem_info *gmi)
diff --git a/drivers/net/ethernet/microsoft/mana/mana.h b/drivers/net/ethernet/microsoft/mana/mana.h
index 8e58abdce906..aca95c6ba8b3 100644
--- a/drivers/net/ethernet/microsoft/mana/mana.h
+++ b/drivers/net/ethernet/microsoft/mana/mana.h
@@ -571,6 +571,15 @@ struct mana_adev {
struct gdma_dev *mdev;
};
+int mana_create_wq_obj(struct mana_port_context *apc,
+ mana_handle_t vport,
+ u32 wq_type, struct mana_obj_spec *wq_spec,
+ struct mana_obj_spec *cq_spec,
+ mana_handle_t *wq_obj);
+
+void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
+ mana_handle_t wq_obj);
+
int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
u32 doorbell_pg_id);
void mana_uncfg_vport(struct mana_port_context *apc);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c18c358607a7..b769fccc076d 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -662,11 +662,11 @@ static int mana_cfg_vport_steering(struct mana_port_context *apc,
return err;
}
-static int mana_create_wq_obj(struct mana_port_context *apc,
- mana_handle_t vport,
- u32 wq_type, struct mana_obj_spec *wq_spec,
- struct mana_obj_spec *cq_spec,
- mana_handle_t *wq_obj)
+int mana_create_wq_obj(struct mana_port_context *apc,
+ mana_handle_t vport,
+ u32 wq_type, struct mana_obj_spec *wq_spec,
+ struct mana_obj_spec *cq_spec,
+ mana_handle_t *wq_obj)
{
struct mana_create_wqobj_resp resp = {};
struct mana_create_wqobj_req req = {};
@@ -715,9 +715,10 @@ static int mana_create_wq_obj(struct mana_port_context *apc,
out:
return err;
}
+EXPORT_SYMBOL_GPL(mana_create_wq_obj);
-static void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
- mana_handle_t wq_obj)
+void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
+ mana_handle_t wq_obj)
{
struct mana_destroy_wqobj_resp resp = {};
struct mana_destroy_wqobj_req req = {};
@@ -742,6 +743,7 @@ static void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
netdev_err(ndev, "Failed to destroy WQ object: %d, 0x%x\n", err,
resp.hdr.status);
}
+EXPORT_SYMBOL_GPL(mana_destroy_wq_obj);
static void mana_destroy_eq(struct mana_context *ac)
{
--
2.17.1
From: Long Li <[email protected]>
The port number is useful for user-mode application to identify this
net device based on port index. Set to the correct value in ndev.
Signed-off-by: Long Li <[email protected]>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index b769fccc076d..963a4dbb46ea 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1975,6 +1975,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
ndev->max_mtu = ndev->mtu;
ndev->min_mtu = ndev->mtu;
ndev->needed_headroom = MANA_HEADROOM;
+ ndev->dev_port = port_idx;
SET_NETDEV_DEV(ndev, gc->dev);
netif_carrier_off(ndev);
--
2.17.1
From: Ajay Sharma <[email protected]>
The MANA hardware support protection domain and memory registration for use
in RDMA environment. Add those definitions and expose them for use by the
RDMA driver.
Signed-off-by: Ajay Sharma <[email protected]>
Signed-off-by: Long Li <[email protected]>
---
Change log:
v3: format/coding style changes
drivers/net/ethernet/microsoft/mana/gdma.h | 146 +++++++++++++++++-
.../net/ethernet/microsoft/mana/gdma_main.c | 27 ++--
drivers/net/ethernet/microsoft/mana/mana_en.c | 18 ++-
3 files changed, 168 insertions(+), 23 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h b/drivers/net/ethernet/microsoft/mana/gdma.h
index f945755760dc..b1bec8ab5695 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma.h
+++ b/drivers/net/ethernet/microsoft/mana/gdma.h
@@ -27,6 +27,10 @@ enum gdma_request_type {
GDMA_CREATE_DMA_REGION = 25,
GDMA_DMA_REGION_ADD_PAGES = 26,
GDMA_DESTROY_DMA_REGION = 27,
+ GDMA_CREATE_PD = 29,
+ GDMA_DESTROY_PD = 30,
+ GDMA_CREATE_MR = 31,
+ GDMA_DESTROY_MR = 32,
};
#define GDMA_RESOURCE_DOORBELL_PAGE 27
@@ -59,6 +63,8 @@ enum {
GDMA_DEVICE_MANA = 2,
};
+typedef u64 gdma_obj_handle_t;
+
struct gdma_resource {
/* Protect the bitmap */
spinlock_t lock;
@@ -192,7 +198,7 @@ struct gdma_mem_info {
u64 length;
/* Allocated by the PF driver */
- u64 gdma_region;
+ gdma_obj_handle_t dma_region_handle;
};
#define REGISTER_ATB_MST_MKEY_LOWER_SIZE 8
@@ -599,7 +605,7 @@ struct gdma_create_queue_req {
u32 reserved1;
u32 pdid;
u32 doolbell_id;
- u64 gdma_region;
+ gdma_obj_handle_t gdma_region;
u32 reserved2;
u32 queue_size;
u32 log2_throttle_limit;
@@ -626,6 +632,28 @@ struct gdma_disable_queue_req {
u32 alloc_res_id_on_creation;
}; /* HW DATA */
+enum atb_page_size {
+ ATB_PAGE_SIZE_4K,
+ ATB_PAGE_SIZE_8K,
+ ATB_PAGE_SIZE_16K,
+ ATB_PAGE_SIZE_32K,
+ ATB_PAGE_SIZE_64K,
+ ATB_PAGE_SIZE_128K,
+ ATB_PAGE_SIZE_256K,
+ ATB_PAGE_SIZE_512K,
+ ATB_PAGE_SIZE_1M,
+ ATB_PAGE_SIZE_2M,
+ ATB_PAGE_SIZE_MAX,
+};
+
+enum gdma_mr_access_flags {
+ GDMA_ACCESS_FLAG_LOCAL_READ = (1 << 0),
+ GDMA_ACCESS_FLAG_LOCAL_WRITE = (1 << 1),
+ GDMA_ACCESS_FLAG_REMOTE_READ = (1 << 2),
+ GDMA_ACCESS_FLAG_REMOTE_WRITE = (1 << 3),
+ GDMA_ACCESS_FLAG_REMOTE_ATOMIC = (1 << 4),
+};
+
/* GDMA_CREATE_DMA_REGION */
struct gdma_create_dma_region_req {
struct gdma_req_hdr hdr;
@@ -652,14 +680,14 @@ struct gdma_create_dma_region_req {
struct gdma_create_dma_region_resp {
struct gdma_resp_hdr hdr;
- u64 gdma_region;
+ gdma_obj_handle_t dma_region_handle;
}; /* HW DATA */
/* GDMA_DMA_REGION_ADD_PAGES */
struct gdma_dma_region_add_pages_req {
struct gdma_req_hdr hdr;
- u64 gdma_region;
+ gdma_obj_handle_t dma_region_handle;
u32 page_addr_list_len;
u32 reserved3;
@@ -671,9 +699,114 @@ struct gdma_dma_region_add_pages_req {
struct gdma_destroy_dma_region_req {
struct gdma_req_hdr hdr;
- u64 gdma_region;
+ gdma_obj_handle_t dma_region_handle;
}; /* HW DATA */
+enum gdma_pd_flags {
+ GDMA_PD_FLAG_ALLOW_GPA_MR = (1 << 0),
+ GDMA_PD_FLAG_ALLOW_FMR_MR = (1 << 1),
+};
+
+struct gdma_create_pd_req {
+ struct gdma_req_hdr hdr;
+ enum gdma_pd_flags flags;
+ u32 reserved;
+};
+
+struct gdma_create_pd_resp {
+ struct gdma_resp_hdr hdr;
+ gdma_obj_handle_t pd_handle;
+ u32 pd_id;
+ u32 reserved;
+};
+
+struct gdma_destroy_pd_req {
+ struct gdma_req_hdr hdr;
+ gdma_obj_handle_t pd_handle;
+};
+
+struct gdma_destory_pd_resp {
+ struct gdma_resp_hdr hdr;
+};
+
+enum gdma_mr_type {
+ /* Guest Physical Address - MRs of this type allow access
+ * to any DMA-mapped memory using bus-logical address
+ */
+ GDMA_MR_TYPE_GPA = 1,
+
+ /* Guest Virtual Address - MRs of this type allow access
+ * to memory mapped by PTEs associated with this MR using a virtual
+ * address that is set up in the MST
+ */
+ GDMA_MR_TYPE_GVA,
+
+ /* Fast Memory Register - Like GVA but the MR is initially put in the
+ * FREE state (as opposed to Valid), and the specified number of
+ * PTEs are reserved for future fast memory reservations.
+ */
+ GDMA_MR_TYPE_FMR,
+};
+
+struct gdma_create_mr_params {
+ gdma_obj_handle_t pd_handle;
+ enum gdma_mr_type mr_type;
+ union {
+ struct {
+ gdma_obj_handle_t dma_region_handle;
+ u64 virtual_address;
+ enum gdma_mr_access_flags access_flags;
+ } gva;
+ struct {
+ enum gdma_mr_access_flags access_flags;
+ } gpa;
+ struct {
+ enum atb_page_size page_size;
+ u32 reserved_pte_count;
+ } fmr;
+ };
+};
+
+struct gdma_create_mr_request {
+ struct gdma_req_hdr hdr;
+ gdma_obj_handle_t pd_handle;
+ enum gdma_mr_type mr_type;
+ u32 reserved;
+
+ union {
+ struct {
+ enum gdma_mr_access_flags access_flags;
+ } gpa;
+
+ struct {
+ gdma_obj_handle_t dma_region_handle;
+ u64 virtual_address;
+ enum gdma_mr_access_flags access_flags;
+ } gva;
+
+ struct {
+ enum atb_page_size page_size;
+ u32 reserved_pte_count;
+ } fmr;
+ };
+};
+
+struct gdma_create_mr_response {
+ struct gdma_resp_hdr hdr;
+ gdma_obj_handle_t mr_handle;
+ u32 lkey;
+ u32 rkey;
+};
+
+struct gdma_destroy_mr_request {
+ struct gdma_req_hdr hdr;
+ gdma_obj_handle_t mr_handle;
+};
+
+struct gdma_destroy_mr_response {
+ struct gdma_resp_hdr hdr;
+};
+
int mana_gd_verify_vf_version(struct pci_dev *pdev);
int mana_gd_register_device(struct gdma_dev *gd);
@@ -705,4 +838,7 @@ int mana_gd_allocate_doorbell_page(struct gdma_context *gc, int *doorbell_page);
int mana_gd_destroy_doorbell_page(struct gdma_context *gc, int doorbell_page);
+int mana_gd_destroy_dma_region(struct gdma_context *gc,
+ gdma_obj_handle_t dma_region_handle);
+
#endif /* _GDMA_H */
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 0c38c9a539f9..60cc1270b7d5 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -226,7 +226,7 @@ static int mana_gd_create_hw_eq(struct gdma_context *gc,
req.type = queue->type;
req.pdid = queue->gdma_dev->pdid;
req.doolbell_id = queue->gdma_dev->doorbell;
- req.gdma_region = queue->mem_info.gdma_region;
+ req.gdma_region = queue->mem_info.dma_region_handle;
req.queue_size = queue->queue_size;
req.log2_throttle_limit = queue->eq.log2_throttle_limit;
req.eq_pci_msix_index = queue->eq.msix_index;
@@ -240,7 +240,7 @@ static int mana_gd_create_hw_eq(struct gdma_context *gc,
queue->id = resp.queue_index;
queue->eq.disable_needed = true;
- queue->mem_info.gdma_region = GDMA_INVALID_DMA_REGION;
+ queue->mem_info.dma_region_handle = GDMA_INVALID_DMA_REGION;
return 0;
}
@@ -694,24 +694,30 @@ int mana_gd_create_hwc_queue(struct gdma_dev *gd,
return err;
}
-static void mana_gd_destroy_dma_region(struct gdma_context *gc, u64 gdma_region)
+int mana_gd_destroy_dma_region(struct gdma_context *gc,
+ gdma_obj_handle_t dma_region_handle)
{
struct gdma_destroy_dma_region_req req = {};
struct gdma_general_resp resp = {};
int err;
- if (gdma_region == GDMA_INVALID_DMA_REGION)
- return;
+ if (dma_region_handle == GDMA_INVALID_DMA_REGION)
+ return 0;
mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_DMA_REGION, sizeof(req),
sizeof(resp));
- req.gdma_region = gdma_region;
+ req.dma_region_handle = dma_region_handle;
err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
- if (err || resp.hdr.status)
+ if (err || resp.hdr.status) {
dev_err(gc->dev, "Failed to destroy DMA region: %d, 0x%x\n",
err, resp.hdr.status);
+ return -EPROTO;
+ }
+
+ return 0;
}
+EXPORT_SYMBOL(mana_gd_destroy_dma_region);
static int mana_gd_create_dma_region(struct gdma_dev *gd,
struct gdma_mem_info *gmi)
@@ -756,14 +762,15 @@ static int mana_gd_create_dma_region(struct gdma_dev *gd,
if (err)
goto out;
- if (resp.hdr.status || resp.gdma_region == GDMA_INVALID_DMA_REGION) {
+ if (resp.hdr.status ||
+ resp.dma_region_handle == GDMA_INVALID_DMA_REGION) {
dev_err(gc->dev, "Failed to create DMA region: 0x%x\n",
resp.hdr.status);
err = -EPROTO;
goto out;
}
- gmi->gdma_region = resp.gdma_region;
+ gmi->dma_region_handle = resp.dma_region_handle;
out:
kfree(req);
return err;
@@ -886,7 +893,7 @@ void mana_gd_destroy_queue(struct gdma_context *gc, struct gdma_queue *queue)
return;
}
- mana_gd_destroy_dma_region(gc, gmi->gdma_region);
+ mana_gd_destroy_dma_region(gc, gmi->dma_region_handle);
mana_gd_free_memory(gmi);
kfree(queue);
}
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 23e7e423a544..c18c358607a7 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1382,10 +1382,10 @@ static int mana_create_txq(struct mana_port_context *apc,
memset(&wq_spec, 0, sizeof(wq_spec));
memset(&cq_spec, 0, sizeof(cq_spec));
- wq_spec.gdma_region = txq->gdma_sq->mem_info.gdma_region;
+ wq_spec.gdma_region = txq->gdma_sq->mem_info.dma_region_handle;
wq_spec.queue_size = txq->gdma_sq->queue_size;
- cq_spec.gdma_region = cq->gdma_cq->mem_info.gdma_region;
+ cq_spec.gdma_region = cq->gdma_cq->mem_info.dma_region_handle;
cq_spec.queue_size = cq->gdma_cq->queue_size;
cq_spec.modr_ctx_id = 0;
cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
@@ -1400,8 +1400,10 @@ static int mana_create_txq(struct mana_port_context *apc,
txq->gdma_sq->id = wq_spec.queue_index;
cq->gdma_cq->id = cq_spec.queue_index;
- txq->gdma_sq->mem_info.gdma_region = GDMA_INVALID_DMA_REGION;
- cq->gdma_cq->mem_info.gdma_region = GDMA_INVALID_DMA_REGION;
+ txq->gdma_sq->mem_info.dma_region_handle =
+ GDMA_INVALID_DMA_REGION;
+ cq->gdma_cq->mem_info.dma_region_handle =
+ GDMA_INVALID_DMA_REGION;
txq->gdma_txq_id = txq->gdma_sq->id;
@@ -1612,10 +1614,10 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
memset(&wq_spec, 0, sizeof(wq_spec));
memset(&cq_spec, 0, sizeof(cq_spec));
- wq_spec.gdma_region = rxq->gdma_rq->mem_info.gdma_region;
+ wq_spec.gdma_region = rxq->gdma_rq->mem_info.dma_region_handle;
wq_spec.queue_size = rxq->gdma_rq->queue_size;
- cq_spec.gdma_region = cq->gdma_cq->mem_info.gdma_region;
+ cq_spec.gdma_region = cq->gdma_cq->mem_info.dma_region_handle;
cq_spec.queue_size = cq->gdma_cq->queue_size;
cq_spec.modr_ctx_id = 0;
cq_spec.attached_eq = cq->gdma_cq->cq.parent->id;
@@ -1628,8 +1630,8 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
rxq->gdma_rq->id = wq_spec.queue_index;
cq->gdma_cq->id = cq_spec.queue_index;
- rxq->gdma_rq->mem_info.gdma_region = GDMA_INVALID_DMA_REGION;
- cq->gdma_cq->mem_info.gdma_region = GDMA_INVALID_DMA_REGION;
+ rxq->gdma_rq->mem_info.dma_region_handle = GDMA_INVALID_DMA_REGION;
+ cq->gdma_cq->mem_info.dma_region_handle = GDMA_INVALID_DMA_REGION;
rxq->gdma_id = rxq->gdma_rq->id;
cq->gdma_id = cq->gdma_cq->id;
--
2.17.1
From: Long Li <[email protected]>
The RDMA device needs to allocate doorbell pages for each user context.
Implement those functions and expose them for use by the RDMA driver.
Signed-off-by: Long Li <[email protected]>
---
drivers/net/ethernet/microsoft/mana/gdma.h | 29 ++++++++++
.../net/ethernet/microsoft/mana/gdma_main.c | 56 +++++++++++++++++++
2 files changed, 85 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h b/drivers/net/ethernet/microsoft/mana/gdma.h
index c724ca410fcb..f945755760dc 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma.h
+++ b/drivers/net/ethernet/microsoft/mana/gdma.h
@@ -22,11 +22,15 @@ enum gdma_request_type {
GDMA_GENERATE_TEST_EQE = 10,
GDMA_CREATE_QUEUE = 12,
GDMA_DISABLE_QUEUE = 13,
+ GDMA_ALLOCATE_RESOURCE_RANGE = 22,
+ GDMA_DESTROY_RESOURCE_RANGE = 24,
GDMA_CREATE_DMA_REGION = 25,
GDMA_DMA_REGION_ADD_PAGES = 26,
GDMA_DESTROY_DMA_REGION = 27,
};
+#define GDMA_RESOURCE_DOORBELL_PAGE 27
+
enum gdma_queue_type {
GDMA_INVALID_QUEUE,
GDMA_SQ,
@@ -568,6 +572,26 @@ struct gdma_register_device_resp {
u32 db_id;
}; /* HW DATA */
+struct gdma_allocate_resource_range_req {
+ struct gdma_req_hdr hdr;
+ u32 resource_type;
+ u32 num_resources;
+ u32 alignment;
+ u32 allocated_resources;
+};
+
+struct gdma_allocate_resource_range_resp {
+ struct gdma_resp_hdr hdr;
+ u32 allocated_resources;
+};
+
+struct gdma_destroy_resource_range_req {
+ struct gdma_req_hdr hdr;
+ u32 resource_type;
+ u32 num_resources;
+ u32 allocated_resources;
+};
+
/* GDMA_CREATE_QUEUE */
struct gdma_create_queue_req {
struct gdma_req_hdr hdr;
@@ -676,4 +700,9 @@ void mana_gd_free_memory(struct gdma_mem_info *gmi);
int mana_gd_send_request(struct gdma_context *gc, u32 req_len, const void *req,
u32 resp_len, void *resp);
+
+int mana_gd_allocate_doorbell_page(struct gdma_context *gc, int *doorbell_page);
+
+int mana_gd_destroy_doorbell_page(struct gdma_context *gc, int doorbell_page);
+
#endif /* _GDMA_H */
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 9fafaa0c8e76..7b42b78b7ddf 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -153,6 +153,62 @@ void mana_gd_free_memory(struct gdma_mem_info *gmi)
gmi->dma_handle);
}
+int mana_gd_destroy_doorbell_page(struct gdma_context *gc, int doorbell_page)
+{
+ struct gdma_destroy_resource_range_req req = {};
+ struct gdma_resp_hdr resp = {};
+ int err;
+
+ mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_RESOURCE_RANGE,
+ sizeof(req), sizeof(resp));
+
+ req.resource_type = GDMA_RESOURCE_DOORBELL_PAGE;
+ req.num_resources = 1;
+ req.allocated_resources = doorbell_page;
+
+ err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+ if (err || resp.status) {
+ dev_err(gc->dev,
+ "Failed to destroy doorbell page: ret %d, 0x%x\n",
+ err, resp.status);
+ return err ? err : -EPROTO;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(mana_gd_destroy_doorbell_page);
+
+int mana_gd_allocate_doorbell_page(struct gdma_context *gc,
+ int *doorbell_page)
+{
+ struct gdma_allocate_resource_range_req req = {};
+ struct gdma_allocate_resource_range_resp resp = {};
+ int err;
+
+ mana_gd_init_req_hdr(&req.hdr, GDMA_ALLOCATE_RESOURCE_RANGE,
+ sizeof(req), sizeof(resp));
+
+ req.resource_type = GDMA_RESOURCE_DOORBELL_PAGE;
+ req.num_resources = 1;
+ req.alignment = 1;
+
+ /* Have GDMA start searching from 0 */
+ req.allocated_resources = 0;
+
+ err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+ if (err || resp.hdr.status) {
+ dev_err(gc->dev,
+ "Failed to allocate doorbell page: ret %d, 0x%x\n",
+ err, resp.hdr.status);
+ return err ? err : -EPROTO;
+ }
+
+ *doorbell_page = resp.allocated_resources;
+
+ return 0;
+}
+EXPORT_SYMBOL(mana_gd_allocate_doorbell_page);
+
static int mana_gd_create_hw_eq(struct gdma_context *gc,
struct gdma_queue *queue)
{
--
2.17.1
From: Long Li <[email protected]>
The number of maximum SGl entries should be computed from the maximum
WQE size for the intended queue type and the corresponding OOB data
size. This guarantees the hardware queue can successfully queue requests
up to the queue depth exposed to the upper layer.
Signed-off-by: Long Li <[email protected]>
---
drivers/net/ethernet/microsoft/mana/mana_en.c | 2 +-
include/net/mana/gdma.h | 7 +++++++
include/net/mana/mana.h | 4 +---
3 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 5aab7afc9143..9a976d9b6b6f 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -187,7 +187,7 @@ int mana_start_xmit(struct sk_buff *skb, struct net_device *ndev)
pkg.wqe_req.client_data_unit = 0;
pkg.wqe_req.num_sge = 1 + skb_shinfo(skb)->nr_frags;
- WARN_ON_ONCE(pkg.wqe_req.num_sge > 30);
+ WARN_ON_ONCE(pkg.wqe_req.num_sge > MAX_TX_WQE_SGL_ENTRIES);
if (pkg.wqe_req.num_sge <= ARRAY_SIZE(pkg.sgl_array)) {
pkg.wqe_req.sgl = pkg.sgl_array;
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index b1bec8ab5695..ef5de92dd98d 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -436,6 +436,13 @@ struct gdma_wqe {
#define MAX_TX_WQE_SIZE 512
#define MAX_RX_WQE_SIZE 256
+#define MAX_TX_WQE_SGL_ENTRIES ((GDMA_MAX_SQE_SIZE - \
+ sizeof(struct gdma_sge) - INLINE_OOB_SMALL_SIZE) / \
+ sizeof(struct gdma_sge))
+
+#define MAX_RX_WQE_SGL_ENTRIES ((GDMA_MAX_RQE_SIZE - \
+ sizeof(struct gdma_sge)) / sizeof(struct gdma_sge))
+
struct gdma_cqe {
u32 cqe_data[GDMA_COMP_DATA_SIZE / 4];
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index aca95c6ba8b3..3a0bc6e0b730 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -264,8 +264,6 @@ struct mana_cq {
int budget;
};
-#define GDMA_MAX_RQE_SGES 15
-
struct mana_recv_buf_oob {
/* A valid GDMA work request representing the data buffer. */
struct gdma_wqe_request wqe_req;
@@ -275,7 +273,7 @@ struct mana_recv_buf_oob {
/* SGL of the buffer going to be sent has part of the work request. */
u32 num_sge;
- struct gdma_sge sgl[GDMA_MAX_RQE_SGES];
+ struct gdma_sge sgl[MAX_RX_WQE_SGL_ENTRIES];
/* Required to store the result of mana_gd_post_work_request.
* gdma_posted_wqe_info.wqe_size_in_bu is required for progressing the
--
2.17.1
From: Long Li <[email protected]>
For supporting RDMA device with multiple user contexts with their
individual doorbell pages, record the start address of doorbell page
region for use by the RDMA driver to allocate user context doorbell IDs.
Signed-off-by: Long Li <[email protected]>
---
drivers/net/ethernet/microsoft/mana/gdma.h | 2 ++
drivers/net/ethernet/microsoft/mana/gdma_main.c | 4 ++++
2 files changed, 6 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h b/drivers/net/ethernet/microsoft/mana/gdma.h
index d815d323be87..c724ca410fcb 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma.h
+++ b/drivers/net/ethernet/microsoft/mana/gdma.h
@@ -350,9 +350,11 @@ struct gdma_context {
struct completion eq_test_event;
u32 test_event_eq_id;
+ phys_addr_t bar0_pa;
void __iomem *bar0_va;
void __iomem *shm_base;
void __iomem *db_page_base;
+ phys_addr_t phys_db_page_base;
u32 db_page_size;
/* Shared memory chanenl (used to bootstrap HWC) */
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 49b85ca578b0..9fafaa0c8e76 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -27,6 +27,9 @@ static void mana_gd_init_registers(struct pci_dev *pdev)
gc->db_page_base = gc->bar0_va +
mana_gd_r64(gc, GDMA_REG_DB_PAGE_OFFSET);
+ gc->phys_db_page_base = gc->bar0_pa +
+ mana_gd_r64(gc, GDMA_REG_DB_PAGE_OFFSET);
+
gc->shm_base = gc->bar0_va + mana_gd_r64(gc, GDMA_REG_SHM_OFFSET);
}
@@ -1335,6 +1338,7 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
mutex_init(&gc->eq_test_event_mutex);
pci_set_drvdata(pdev, gc);
+ gc->bar0_pa = pci_resource_start(pdev, 0);
bar0_va = pci_iomap(pdev, bar, 0);
if (!bar0_va)
--
2.17.1
From: Ajay Sharma <[email protected]>
MANA hardware doesn't have any restrictions on the DMA segment size, set it
to the max allowed value.
Signed-off-by: Ajay Sharma <[email protected]>
Signed-off-by: Long Li <[email protected]>
---
Change log:
v2: Use the max allowed value as the hardware doesn't have any limit
drivers/net/ethernet/microsoft/mana/gdma_main.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 7b42b78b7ddf..0c38c9a539f9 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1387,6 +1387,12 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
if (err)
goto release_region;
+ err = dma_set_max_seg_size(&pdev->dev, UINT_MAX);
+ if (err) {
+ dev_err(&pdev->dev, "Failed to set dma device segment size\n");
+ goto release_region;
+ }
+
err = -ENOMEM;
gc = vzalloc(sizeof(*gc));
if (!gc)
--
2.17.1
From: Long Li <[email protected]>
In preparation for supporting MANA RDMA driver, add support for auxiliary
device in the Ethernet driver. The RDMA device is modeled as an auxiliary
device to the Ethernet device.
Signed-off-by: Long Li <[email protected]>
---
Change log:
v3: define mana_adev_idx_alloc and mana_adev_idx_free as static
drivers/net/ethernet/microsoft/mana/gdma.h | 2 +
drivers/net/ethernet/microsoft/mana/mana.h | 6 ++
drivers/net/ethernet/microsoft/mana/mana_en.c | 83 ++++++++++++++++++-
3 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h b/drivers/net/ethernet/microsoft/mana/gdma.h
index 41ecd156e95f..d815d323be87 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma.h
+++ b/drivers/net/ethernet/microsoft/mana/gdma.h
@@ -204,6 +204,8 @@ struct gdma_dev {
/* GDMA driver specific pointer */
void *driver_data;
+
+ struct auxiliary_device *adev;
};
#define MINIMUM_SUPPORTED_PAGE_SIZE PAGE_SIZE
diff --git a/drivers/net/ethernet/microsoft/mana/mana.h b/drivers/net/ethernet/microsoft/mana/mana.h
index d36405af9432..51bff91b63ee 100644
--- a/drivers/net/ethernet/microsoft/mana/mana.h
+++ b/drivers/net/ethernet/microsoft/mana/mana.h
@@ -6,6 +6,7 @@
#include "gdma.h"
#include "hw_channel.h"
+#include <linux/auxiliary_bus.h>
/* Microsoft Azure Network Adapter (MANA)'s definitions
*
@@ -561,4 +562,9 @@ struct mana_tx_package {
struct gdma_posted_wqe_info wqe_info;
};
+struct mana_adev {
+ struct auxiliary_device adev;
+ struct gdma_dev *mdev;
+};
+
#endif /* _MANA_H */
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index b7d3ba1b4d17..745a9783dd70 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -13,6 +13,18 @@
#include "mana.h"
+static DEFINE_IDA(mana_adev_ida);
+
+static int mana_adev_idx_alloc(void)
+{
+ return ida_alloc(&mana_adev_ida, GFP_KERNEL);
+}
+
+static void mana_adev_idx_free(int idx)
+{
+ ida_free(&mana_adev_ida, idx);
+}
+
/* Microsoft Azure Network Adapter (MANA) functions */
static int mana_open(struct net_device *ndev)
@@ -1960,6 +1972,70 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
return err;
}
+static void adev_release(struct device *dev)
+{
+ struct mana_adev *madev = container_of(dev, struct mana_adev, adev.dev);
+
+ kfree(madev);
+}
+
+static void remove_adev(struct gdma_dev *gd)
+{
+ struct auxiliary_device *adev = gd->adev;
+ int id = adev->id;
+
+ auxiliary_device_delete(adev);
+ auxiliary_device_uninit(adev);
+
+ mana_adev_idx_free(id);
+ gd->adev = NULL;
+}
+
+static int add_adev(struct gdma_dev *gd)
+{
+ int ret = 0;
+ struct mana_adev *madev;
+ struct auxiliary_device *adev;
+
+ madev = kzalloc(sizeof(*madev), GFP_KERNEL);
+ if (!madev)
+ return -ENOMEM;
+
+ adev = &madev->adev;
+ adev->id = mana_adev_idx_alloc();
+ if (adev->id < 0) {
+ ret = adev->id;
+ goto idx_fail;
+ }
+
+ adev->name = "rdma";
+ adev->dev.parent = gd->gdma_context->dev;
+ adev->dev.release = adev_release;
+ madev->mdev = gd;
+
+ ret = auxiliary_device_init(adev);
+ if (ret)
+ goto init_fail;
+
+ ret = auxiliary_device_add(adev);
+ if (ret)
+ goto add_fail;
+
+ gd->adev = adev;
+ return 0;
+
+add_fail:
+ auxiliary_device_uninit(adev);
+
+init_fail:
+ mana_adev_idx_free(adev->id);
+
+idx_fail:
+ kfree(madev);
+
+ return ret;
+}
+
int mana_probe(struct gdma_dev *gd, bool resuming)
{
struct gdma_context *gc = gd->gdma_context;
@@ -2027,6 +2103,8 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
break;
}
}
+
+ err = add_adev(gd);
out:
if (err)
mana_remove(gd, false);
@@ -2043,6 +2121,10 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
int err;
int i;
+ /* adev currently doesn't support suspending, always remove it */
+ if (gd->adev)
+ remove_adev(gd);
+
for (i = 0; i < ac->num_ports; i++) {
ndev = ac->ports[i];
if (!ndev) {
@@ -2075,7 +2157,6 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
}
mana_destroy_eq(ac);
-
out:
mana_gd_deregister_device(gd);
--
2.17.1
From: Ajay Sharma <[email protected]>
When doing memory registration, the PF may respond with
GDMA_STATUS_MORE_ENTRIES to indicate a follow request is needed. This is
not an error and should be processed as expected.
Signed-off-by: Ajay Sharma <[email protected]>
Signed-off-by: Long Li <[email protected]>
---
drivers/net/ethernet/microsoft/mana/hw_channel.c | 2 +-
include/net/mana/gdma.h | 2 ++
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c
index e61cb3f6fbe1..24ecd193a185 100644
--- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
+++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
@@ -820,7 +820,7 @@ int mana_hwc_send_request(struct hw_channel_context *hwc, u32 req_len,
goto out;
}
- if (ctx->status_code) {
+ if (ctx->status_code && ctx->status_code != GDMA_STATUS_MORE_ENTRIES) {
dev_err(hwc->dev, "HWC: Failed hw_channel req: 0x%x\n",
ctx->status_code);
err = -EPROTO;
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index ef5de92dd98d..0485902d96c9 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -9,6 +9,8 @@
#include "shm_channel.h"
+#define GDMA_STATUS_MORE_ENTRIES 0x00000105
+
/* Structures labeled with "HW DATA" are exchanged with the hardware. All of
* them are naturally aligned and hence don't need __packed.
*/
--
2.17.1
From: Long Li <[email protected]>
Add a RDMA VF driver for Microsoft Azure Network Adapter (MANA).
Signed-off-by: Long Li <[email protected]>
---
Change log:
v2:
Changed coding sytles/formats
Checked undersize for udata length
Changed all logging to use ibdev_xxx()
Avoided page array copy when doing MR
Sorted driver ops
Fixed warnings reported by kernel test robot <[email protected]>
v3:
More coding sytle/format changes
v4:
Process error on hardware vport configuration
MAINTAINERS | 3 +
drivers/infiniband/Kconfig | 1 +
drivers/infiniband/hw/Makefile | 1 +
drivers/infiniband/hw/mana/Kconfig | 7 +
drivers/infiniband/hw/mana/Makefile | 4 +
drivers/infiniband/hw/mana/cq.c | 80 +++
drivers/infiniband/hw/mana/main.c | 681 ++++++++++++++++++++++++
drivers/infiniband/hw/mana/mana_ib.h | 145 +++++
drivers/infiniband/hw/mana/mr.c | 133 +++++
drivers/infiniband/hw/mana/qp.c | 501 +++++++++++++++++
drivers/infiniband/hw/mana/wq.c | 114 ++++
include/net/mana/mana.h | 3 +
include/uapi/rdma/ib_user_ioctl_verbs.h | 1 +
include/uapi/rdma/mana-abi.h | 66 +++
14 files changed, 1740 insertions(+)
create mode 100644 drivers/infiniband/hw/mana/Kconfig
create mode 100644 drivers/infiniband/hw/mana/Makefile
create mode 100644 drivers/infiniband/hw/mana/cq.c
create mode 100644 drivers/infiniband/hw/mana/main.c
create mode 100644 drivers/infiniband/hw/mana/mana_ib.h
create mode 100644 drivers/infiniband/hw/mana/mr.c
create mode 100644 drivers/infiniband/hw/mana/qp.c
create mode 100644 drivers/infiniband/hw/mana/wq.c
create mode 100644 include/uapi/rdma/mana-abi.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 51bec6d5076d..1bed8444786d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9078,6 +9078,7 @@ M: Haiyang Zhang <[email protected]>
M: Stephen Hemminger <[email protected]>
M: Wei Liu <[email protected]>
M: Dexuan Cui <[email protected]>
+M: Long Li <[email protected]>
L: [email protected]
S: Supported
T: git git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git
@@ -9095,6 +9096,7 @@ F: arch/x86/kernel/cpu/mshyperv.c
F: drivers/clocksource/hyperv_timer.c
F: drivers/hid/hid-hyperv.c
F: drivers/hv/
+F: drivers/infiniband/hw/mana/
F: drivers/input/serio/hyperv-keyboard.c
F: drivers/iommu/hyperv-iommu.c
F: drivers/net/ethernet/microsoft/
@@ -9110,6 +9112,7 @@ F: include/clocksource/hyperv_timer.h
F: include/linux/hyperv.h
F: include/net/mana
F: include/uapi/linux/hyperv.h
+F: include/uapi/rdma/mana-abi.h
F: net/vmw_vsock/hyperv_transport.c
F: tools/hv/
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 33d3ce9c888e..a062c662ecff 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -83,6 +83,7 @@ source "drivers/infiniband/hw/qib/Kconfig"
source "drivers/infiniband/hw/cxgb4/Kconfig"
source "drivers/infiniband/hw/efa/Kconfig"
source "drivers/infiniband/hw/irdma/Kconfig"
+source "drivers/infiniband/hw/mana/Kconfig"
source "drivers/infiniband/hw/mlx4/Kconfig"
source "drivers/infiniband/hw/mlx5/Kconfig"
source "drivers/infiniband/hw/ocrdma/Kconfig"
diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile
index fba0b3be903e..f62e9e00c780 100644
--- a/drivers/infiniband/hw/Makefile
+++ b/drivers/infiniband/hw/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_INFINIBAND_QIB) += qib/
obj-$(CONFIG_INFINIBAND_CXGB4) += cxgb4/
obj-$(CONFIG_INFINIBAND_EFA) += efa/
obj-$(CONFIG_INFINIBAND_IRDMA) += irdma/
+obj-$(CONFIG_MANA_INFINIBAND) += mana/
obj-$(CONFIG_MLX4_INFINIBAND) += mlx4/
obj-$(CONFIG_MLX5_INFINIBAND) += mlx5/
obj-$(CONFIG_INFINIBAND_OCRDMA) += ocrdma/
diff --git a/drivers/infiniband/hw/mana/Kconfig b/drivers/infiniband/hw/mana/Kconfig
new file mode 100644
index 000000000000..b3ff03a23257
--- /dev/null
+++ b/drivers/infiniband/hw/mana/Kconfig
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config MANA_INFINIBAND
+ tristate "Microsoft Azure Network Adapter support"
+ depends on NETDEVICES && ETHERNET && PCI && MICROSOFT_MANA
+ help
+ This driver provides low-level RDMA support for
+ Microsoft Azure Network Adapter (MANA).
diff --git a/drivers/infiniband/hw/mana/Makefile b/drivers/infiniband/hw/mana/Makefile
new file mode 100644
index 000000000000..a799fe264c5a
--- /dev/null
+++ b/drivers/infiniband/hw/mana/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_MANA_INFINIBAND) += mana_ib.o
+
+mana_ib-y := main.o wq.o qp.o cq.o mr.o
diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
new file mode 100644
index 000000000000..046fd290073d
--- /dev/null
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
+ struct ib_udata *udata)
+{
+ struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
+ struct ib_device *ibdev = ibcq->device;
+ struct mana_ib_create_cq ucmd = {};
+ struct mana_ib_dev *mdev;
+ int err;
+
+ mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+ if (udata->inlen < sizeof(ucmd))
+ return -EINVAL;
+
+ err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+ if (err) {
+ ibdev_dbg(ibdev,
+ "Failed to copy from udata for create cq, %d\n", err);
+ return -EFAULT;
+ }
+
+ if (attr->cqe > MAX_SEND_BUFFERS_PER_QUEUE) {
+ ibdev_dbg(ibdev, "CQE %d exceeding limit\n", attr->cqe);
+ return -EINVAL;
+ }
+
+ cq->cqe = attr->cqe;
+ cq->umem = ib_umem_get(ibdev, ucmd.buf_addr, cq->cqe * COMP_ENTRY_SIZE,
+ IB_ACCESS_LOCAL_WRITE);
+ if (IS_ERR(cq->umem)) {
+ err = PTR_ERR(cq->umem);
+ ibdev_dbg(ibdev, "Failed to get umem for create cq, err %d\n",
+ err);
+ return err;
+ }
+
+ err = mana_ib_gd_create_dma_region(mdev, cq->umem, &cq->gdma_region,
+ PAGE_SIZE);
+ if (err) {
+ ibdev_err(ibdev,
+ "Failed to create dma region for create cq, %d\n",
+ err);
+ goto err_release_umem;
+ }
+
+ ibdev_dbg(ibdev,
+ "mana_ib_gd_create_dma_region ret %d gdma_region 0x%llx\n",
+ err, cq->gdma_region);
+
+ /* The CQ ID is not known at this time
+ * The ID is generated at create_qp
+ */
+
+ return 0;
+
+err_release_umem:
+ ib_umem_release(cq->umem);
+ return err;
+}
+
+int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
+{
+ struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
+ struct ib_device *ibdev = ibcq->device;
+ struct mana_ib_dev *mdev;
+
+ mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+ mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
+ ib_umem_release(cq->umem);
+
+ return 0;
+}
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
new file mode 100644
index 000000000000..58254a0cf581
--- /dev/null
+++ b/drivers/infiniband/hw/mana/main.c
@@ -0,0 +1,681 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+MODULE_DESCRIPTION("Microsoft Azure Network Adapter IB driver");
+MODULE_LICENSE("Dual BSD/GPL");
+
+void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
+ u32 port)
+{
+ struct gdma_dev *gd = dev->gdma_dev;
+ struct mana_port_context *mpc;
+ struct net_device *ndev;
+ struct mana_context *mc;
+
+ mc = gd->driver_data;
+ ndev = mc->ports[port];
+ mpc = netdev_priv(ndev);
+
+ mutex_lock(&pd->vport_mutex);
+
+ pd->vport_use_count--;
+ WARN_ON(pd->vport_use_count < 0);
+
+ if (!pd->vport_use_count)
+ mana_uncfg_vport(mpc);
+
+ mutex_unlock(&pd->vport_mutex);
+}
+
+int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
+ u32 doorbell_id)
+{
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct mana_port_context *mpc;
+ struct mana_context *mc;
+ struct net_device *ndev;
+ int err;
+
+ mc = mdev->driver_data;
+ ndev = mc->ports[port];
+ mpc = netdev_priv(ndev);
+
+ mutex_lock(&pd->vport_mutex);
+
+ pd->vport_use_count++;
+ if (pd->vport_use_count > 1) {
+ ibdev_dbg(&dev->ib_dev,
+ "Skip as this PD is already configured vport\n");
+ mutex_unlock(&pd->vport_mutex);
+ return 0;
+ }
+ mutex_unlock(&pd->vport_mutex);
+
+ err = mana_cfg_vport(mpc, pd->pdn, doorbell_id);
+ if (err) {
+ mutex_lock(&pd->vport_mutex);
+ pd->vport_use_count--;
+ mutex_unlock(&pd->vport_mutex);
+
+ ibdev_err(&dev->ib_dev, "Failed to configure vPort %d\n", err);
+ return err;
+ }
+
+ pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
+ pd->tx_vp_offset = mpc->tx_vp_offset;
+
+ ibdev_dbg(&dev->ib_dev,
+ "vport handle %llx pdid %x doorbell_id %x "
+ "tx_shortform_allowed %d tx_vp_offset %u\n",
+ mpc->port_handle, pd->pdn, doorbell_id,
+ pd->tx_shortform_allowed, pd->tx_vp_offset);
+
+ return 0;
+}
+
+static int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
+{
+ struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+ struct ib_device *ibdev = ibpd->device;
+ enum gdma_pd_flags flags = 0;
+ struct mana_ib_dev *dev;
+ int ret;
+
+ dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+ /* Set flags if this is a kernel request */
+ if (!ibpd->uobject)
+ flags = GDMA_PD_FLAG_ALLOW_GPA_MR | GDMA_PD_FLAG_ALLOW_FMR_MR;
+
+ ret = mana_ib_gd_create_pd(dev, &pd->pd_handle, &pd->pdn, flags);
+ if (ret) {
+ ibdev_err(ibdev, "Failed to get pd id, err %d\n", ret);
+ return ret;
+ }
+
+ mutex_init(&pd->vport_mutex);
+ pd->vport_use_count = 0;
+ return 0;
+}
+
+static int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
+{
+ struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+ struct ib_device *ibdev = ibpd->device;
+ struct mana_ib_dev *dev;
+
+ dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ return mana_ib_gd_destroy_pd(dev, pd->pd_handle);
+}
+
+static int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
+ struct ib_udata *udata)
+{
+ struct mana_ib_ucontext *ucontext =
+ container_of(ibcontext, struct mana_ib_ucontext, ibucontext);
+ struct ib_device *ibdev = ibcontext->device;
+ struct mana_ib_dev *mdev;
+ struct gdma_context *gc;
+ struct gdma_dev *dev;
+ int doorbell_page;
+ int ret;
+
+ mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ dev = mdev->gdma_dev;
+ gc = dev->gdma_context;
+
+ /* Allocate a doorbell page index */
+ ret = mana_gd_allocate_doorbell_page(gc, &doorbell_page);
+ if (ret) {
+ ibdev_err(ibdev, "Failed to allocate doorbell page %d\n", ret);
+ return -ENOMEM;
+ }
+
+ ibdev_dbg(ibdev, "Doorbell page allocated %d\n", doorbell_page);
+
+ ucontext->doorbell = doorbell_page;
+
+ return 0;
+}
+
+static void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
+{
+ struct mana_ib_ucontext *mana_ucontext =
+ container_of(ibcontext, struct mana_ib_ucontext, ibucontext);
+ struct ib_device *ibdev = ibcontext->device;
+ struct mana_ib_dev *mdev;
+ struct gdma_context *gc;
+ int ret;
+
+ mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ gc = mdev->gdma_dev->gdma_context;
+
+ ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell);
+ if (ret)
+ ibdev_err(ibdev, "Failed to destroy doorbell page %d\n", ret);
+}
+
+int mana_ib_gd_create_dma_region(struct mana_ib_dev *dev, struct ib_umem *umem,
+ mana_handle_t *gdma_region, u64 page_sz)
+{
+ size_t num_pages_total = ib_umem_num_dma_blocks(umem, page_sz);
+ struct gdma_dma_region_add_pages_req *add_req = NULL;
+ struct gdma_create_dma_region_resp create_resp = {};
+ struct gdma_create_dma_region_req *create_req;
+ size_t num_pages_cur, num_pages_to_handle;
+ unsigned int create_req_msg_size;
+ struct hw_channel_context *hwc;
+ struct ib_block_iter biter;
+ size_t max_pgs_create_cmd;
+ struct gdma_context *gc;
+ struct gdma_dev *mdev;
+ unsigned int i;
+ int err;
+
+ mdev = dev->gdma_dev;
+ gc = mdev->gdma_context;
+ hwc = gc->hwc.driver_data;
+ max_pgs_create_cmd =
+ (hwc->max_req_msg_size - sizeof(*create_req)) / sizeof(u64);
+
+ num_pages_to_handle =
+ min_t(size_t, num_pages_total, max_pgs_create_cmd);
+ create_req_msg_size =
+ struct_size(create_req, page_addr_list, num_pages_to_handle);
+
+ create_req = kzalloc(create_req_msg_size, GFP_KERNEL);
+ if (!create_req)
+ return -ENOMEM;
+
+ mana_gd_init_req_hdr(&create_req->hdr, GDMA_CREATE_DMA_REGION,
+ create_req_msg_size, sizeof(create_resp));
+
+ create_req->length = umem->length;
+ create_req->offset_in_page = umem->address & (page_sz - 1);
+ create_req->gdma_page_type = order_base_2(page_sz) - PAGE_SHIFT;
+ create_req->page_count = num_pages_total;
+ create_req->page_addr_list_len = num_pages_to_handle;
+
+ ibdev_dbg(&dev->ib_dev,
+ "size_dma_region %lu num_pages_total %lu, "
+ "page_sz 0x%llx offset_in_page %u\n",
+ umem->length, num_pages_total, page_sz,
+ create_req->offset_in_page);
+
+ ibdev_dbg(&dev->ib_dev, "num_pages_to_handle %lu, gdma_page_type %u",
+ num_pages_to_handle, create_req->gdma_page_type);
+
+ __rdma_umem_block_iter_start(&biter, umem, page_sz);
+
+ for (i = 0; i < num_pages_to_handle; ++i) {
+ dma_addr_t cur_addr;
+
+ __rdma_block_iter_next(&biter);
+ cur_addr = rdma_block_iter_dma_address(&biter);
+
+ create_req->page_addr_list[i] = cur_addr;
+
+ ibdev_dbg(&dev->ib_dev, "page num %u cur_addr 0x%llx\n", i,
+ cur_addr);
+ }
+
+ err = mana_gd_send_request(gc, create_req_msg_size, create_req,
+ sizeof(create_resp), &create_resp);
+ kfree(create_req);
+
+ if (err || create_resp.hdr.status) {
+ ibdev_err(&dev->ib_dev,
+ "Failed to create DMA region: %d, 0x%x\n", err,
+ create_resp.hdr.status);
+ goto error;
+ }
+
+ *gdma_region = create_resp.dma_region_handle;
+ ibdev_dbg(&dev->ib_dev, "Created DMA region with handle 0x%llx\n",
+ *gdma_region);
+
+ num_pages_cur = num_pages_to_handle;
+
+ if (num_pages_cur < num_pages_total) {
+ unsigned int add_req_msg_size;
+ size_t max_pgs_add_cmd =
+ (hwc->max_req_msg_size - sizeof(*add_req)) /
+ sizeof(u64);
+
+ num_pages_to_handle =
+ min_t(size_t, num_pages_total - num_pages_cur,
+ max_pgs_add_cmd);
+
+ /* Calculate the max num of pages that will be handled */
+ add_req_msg_size = struct_size(add_req, page_addr_list,
+ num_pages_to_handle);
+
+ add_req = kmalloc(add_req_msg_size, GFP_KERNEL);
+ if (!add_req) {
+ err = -ENOMEM;
+ goto error;
+ }
+
+ while (num_pages_cur < num_pages_total) {
+ struct gdma_general_resp add_resp = {};
+ u32 expected_status = 0;
+
+ if (num_pages_cur + num_pages_to_handle <
+ num_pages_total) {
+ /* Status indicating more pages are needed */
+ expected_status = GDMA_STATUS_MORE_ENTRIES;
+ }
+
+ memset(add_req, 0, add_req_msg_size);
+
+ mana_gd_init_req_hdr(&add_req->hdr,
+ GDMA_DMA_REGION_ADD_PAGES,
+ add_req_msg_size,
+ sizeof(add_resp));
+ add_req->dma_region_handle = *gdma_region;
+ add_req->page_addr_list_len = num_pages_to_handle;
+
+ for (i = 0; i < num_pages_to_handle; ++i) {
+ dma_addr_t cur_addr =
+ rdma_block_iter_dma_address(&biter);
+ add_req->page_addr_list[i] = cur_addr;
+ __rdma_block_iter_next(&biter);
+
+ ibdev_dbg(&dev->ib_dev,
+ "page_addr_list %lu addr 0x%llx\n",
+ num_pages_cur + i, cur_addr);
+ }
+
+ err = mana_gd_send_request(gc, add_req_msg_size,
+ add_req, sizeof(add_resp),
+ &add_resp);
+ if (!err || add_resp.hdr.status != expected_status) {
+ ibdev_err(&dev->ib_dev,
+ "Failed put DMA pages %u: %d,0x%x\n",
+ i, err, add_resp.hdr.status);
+ err = -EPROTO;
+ goto free_req;
+ }
+
+ num_pages_cur += num_pages_to_handle;
+ num_pages_to_handle =
+ min_t(size_t, num_pages_total - num_pages_cur,
+ max_pgs_add_cmd);
+ add_req_msg_size = sizeof(*add_req) +
+ num_pages_to_handle * sizeof(u64);
+ }
+free_req:
+ kfree(add_req);
+ }
+
+error:
+ return err;
+}
+
+int mana_ib_gd_destroy_dma_region(struct mana_ib_dev *dev, u64 gdma_region)
+{
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct gdma_context *gc;
+
+ gc = mdev->gdma_context;
+ ibdev_dbg(&dev->ib_dev, "destroy dma region 0x%llx\n", gdma_region);
+
+ return mana_gd_destroy_dma_region(gc, gdma_region);
+}
+
+int mana_ib_gd_create_pd(struct mana_ib_dev *dev, u64 *pd_handle, u32 *pd_id,
+ enum gdma_pd_flags flags)
+{
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct gdma_create_pd_resp resp = {};
+ struct gdma_create_pd_req req = {};
+ struct gdma_context *gc;
+ int err;
+
+ gc = mdev->gdma_context;
+
+ mana_gd_init_req_hdr(&req.hdr, GDMA_CREATE_PD, sizeof(req),
+ sizeof(resp));
+
+ req.flags = flags;
+ err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+
+ if (err || resp.hdr.status) {
+ ibdev_err(&dev->ib_dev,
+ "Failed to get pd_id err %d status %u\n", err,
+ resp.hdr.status);
+ if (!err)
+ err = -EPROTO;
+
+ return err;
+ }
+
+ *pd_handle = resp.pd_handle;
+ *pd_id = resp.pd_id;
+ ibdev_dbg(&dev->ib_dev, "pd_handle 0x%llx pd_id %d\n", *pd_handle,
+ *pd_id);
+
+ return 0;
+}
+
+int mana_ib_gd_destroy_pd(struct mana_ib_dev *dev, u64 pd_handle)
+{
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct gdma_destory_pd_resp resp = {};
+ struct gdma_destroy_pd_req req = {};
+ struct gdma_context *gc;
+ int err;
+
+ gc = mdev->gdma_context;
+
+ mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_PD, sizeof(req),
+ sizeof(resp));
+
+ req.pd_handle = pd_handle;
+ err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+
+ if (err || resp.hdr.status) {
+ ibdev_err(&dev->ib_dev,
+ "Failed to destroy pd_handle 0x%llx err %d status %u",
+ pd_handle, err, resp.hdr.status);
+ if (!err)
+ err = -EPROTO;
+ }
+
+ return err;
+}
+
+int mana_ib_gd_create_mr(struct mana_ib_dev *dev, struct mana_ib_mr *mr,
+ struct gdma_create_mr_params *mr_params)
+{
+ struct gdma_create_mr_response resp = {};
+ struct gdma_create_mr_request req = {};
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct gdma_context *gc;
+ int err;
+
+ gc = mdev->gdma_context;
+
+ mana_gd_init_req_hdr(&req.hdr, GDMA_CREATE_MR, sizeof(req),
+ sizeof(resp));
+ req.pd_handle = mr_params->pd_handle;
+
+ switch (mr_params->mr_type) {
+ case GDMA_MR_TYPE_GVA:
+ req.mr_type = GDMA_MR_TYPE_GVA;
+ req.gva.dma_region_handle = mr_params->gva.dma_region_handle;
+ req.gva.virtual_address = mr_params->gva.virtual_address;
+ req.gva.access_flags = mr_params->gva.access_flags;
+ break;
+
+ case GDMA_MR_TYPE_GPA:
+ req.mr_type = GDMA_MR_TYPE_GPA;
+ req.gpa.access_flags = mr_params->gpa.access_flags;
+ break;
+
+ case GDMA_MR_TYPE_FMR:
+ req.mr_type = GDMA_MR_TYPE_FMR;
+ req.fmr.page_size = mr_params->fmr.page_size;
+ req.fmr.reserved_pte_count = mr_params->fmr.reserved_pte_count;
+ break;
+
+ default:
+ ibdev_dbg(&dev->ib_dev,
+ "invalid param (GDMA_MR_TYPE) passed, type %d\n",
+ req.mr_type);
+ err = -EINVAL;
+ goto error;
+ }
+
+ err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+
+ if (err || resp.hdr.status) {
+ ibdev_err(&dev->ib_dev, "Failed to create mr %d, %u", err,
+ resp.hdr.status);
+ goto error;
+ }
+
+ mr->ibmr.lkey = resp.lkey;
+ mr->ibmr.rkey = resp.rkey;
+ mr->mr_handle = resp.mr_handle;
+
+ return 0;
+error:
+ return err;
+}
+
+int mana_ib_gd_destroy_mr(struct mana_ib_dev *dev, gdma_obj_handle_t mr_handle)
+{
+ struct gdma_destroy_mr_response resp = {};
+ struct gdma_destroy_mr_request req = {};
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct gdma_context *gc;
+ int err;
+
+ gc = mdev->gdma_context;
+
+ mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_MR, sizeof(req),
+ sizeof(resp));
+
+ req.mr_handle = mr_handle;
+
+ err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+ if (err || resp.hdr.status) {
+ dev_err(gc->dev, "Failed to destroy MR: %d, 0x%x\n", err,
+ resp.hdr.status);
+ if (!err)
+ err = -EPROTO;
+ return err;
+ }
+
+ return 0;
+}
+
+static int mana_ib_mmap(struct ib_ucontext *ibcontext,
+ struct vm_area_struct *vma)
+{
+ struct mana_ib_ucontext *mana_ucontext =
+ container_of(ibcontext, struct mana_ib_ucontext, ibucontext);
+ struct ib_device *ibdev = ibcontext->device;
+ struct mana_ib_dev *mdev;
+ struct gdma_context *gc;
+ phys_addr_t pfn;
+ pgprot_t prot;
+ int ret;
+
+ mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ gc = mdev->gdma_dev->gdma_context;
+
+ if (vma->vm_pgoff != 0) {
+ ibdev_err(ibdev, "Unexpected vm_pgoff %lu\n", vma->vm_pgoff);
+ return -EINVAL;
+ }
+
+ /* Map to the page indexed by ucontext->doorbell */
+ pfn = (gc->phys_db_page_base +
+ gc->db_page_size * mana_ucontext->doorbell) >>
+ PAGE_SHIFT;
+ prot = pgprot_writecombine(vma->vm_page_prot);
+
+ ret = rdma_user_mmap_io(ibcontext, vma, pfn, gc->db_page_size, prot,
+ NULL);
+ if (ret)
+ ibdev_err(ibdev, "can't rdma_user_mmap_io ret %d\n", ret);
+ else
+ ibdev_dbg(ibdev, "mapped I/O pfn 0x%llx page_size %u, ret %d\n",
+ pfn, gc->db_page_size, ret);
+
+ return ret;
+}
+
+static int mana_ib_get_port_immutable(struct ib_device *ibdev, u32 port_num,
+ struct ib_port_immutable *immutable)
+{
+ /* This version only support RAW_PACKET
+ * other values need to be filled for other types
+ */
+ immutable->core_cap_flags = RDMA_CORE_PORT_RAW_PACKET;
+
+ return 0;
+}
+
+static int mana_ib_query_device(struct ib_device *ibdev,
+ struct ib_device_attr *props,
+ struct ib_udata *uhw)
+{
+ props->max_qp = MANA_MAX_NUM_QUEUES;
+ props->max_qp_wr = MAX_SEND_BUFFERS_PER_QUEUE;
+
+ /* max_cqe could be potentially much bigger.
+ * As this version of driver only support RAW QP, set it to the same
+ * value as max_qp_wr
+ */
+ props->max_cqe = MAX_SEND_BUFFERS_PER_QUEUE;
+
+ props->max_mr_size = MANA_IB_MAX_MR_SIZE;
+ props->max_mr = INT_MAX;
+ props->max_send_sge = MAX_TX_WQE_SGL_ENTRIES;
+ props->max_recv_sge = MAX_RX_WQE_SGL_ENTRIES;
+
+ return 0;
+}
+
+static int mana_ib_query_port(struct ib_device *ibdev, u32 port,
+ struct ib_port_attr *props)
+{
+ /* This version doesn't return port properties */
+ return 0;
+}
+
+static int mana_ib_query_gid(struct ib_device *ibdev, u32 port, int index,
+ union ib_gid *gid)
+{
+ /* This version doesn't return GID properties */
+ return 0;
+}
+
+static void mana_ib_disassociate_ucontext(struct ib_ucontext *ibcontext)
+{
+}
+
+static const struct ib_device_ops mana_ib_dev_ops = {
+ .owner = THIS_MODULE,
+ .driver_id = RDMA_DRIVER_MANA,
+ .uverbs_abi_ver = MANA_IB_UVERBS_ABI_VERSION,
+
+ .alloc_pd = mana_ib_alloc_pd,
+ .alloc_ucontext = mana_ib_alloc_ucontext,
+ .create_cq = mana_ib_create_cq,
+ .create_qp = mana_ib_create_qp,
+ .create_rwq_ind_table = mana_ib_create_rwq_ind_table,
+ .create_wq = mana_ib_create_wq,
+ .dealloc_pd = mana_ib_dealloc_pd,
+ .dealloc_ucontext = mana_ib_dealloc_ucontext,
+ .dereg_mr = mana_ib_dereg_mr,
+ .destroy_cq = mana_ib_destroy_cq,
+ .destroy_qp = mana_ib_destroy_qp,
+ .destroy_rwq_ind_table = mana_ib_destroy_rwq_ind_table,
+ .destroy_wq = mana_ib_destroy_wq,
+ .disassociate_ucontext = mana_ib_disassociate_ucontext,
+ .get_port_immutable = mana_ib_get_port_immutable,
+ .mmap = mana_ib_mmap,
+ .modify_qp = mana_ib_modify_qp,
+ .modify_wq = mana_ib_modify_wq,
+ .query_device = mana_ib_query_device,
+ .query_gid = mana_ib_query_gid,
+ .query_port = mana_ib_query_port,
+ .reg_user_mr = mana_ib_reg_user_mr,
+
+ INIT_RDMA_OBJ_SIZE(ib_cq, mana_ib_cq, ibcq),
+ INIT_RDMA_OBJ_SIZE(ib_pd, mana_ib_pd, ibpd),
+ INIT_RDMA_OBJ_SIZE(ib_qp, mana_ib_qp, ibqp),
+ INIT_RDMA_OBJ_SIZE(ib_ucontext, mana_ib_ucontext, ibucontext),
+ INIT_RDMA_OBJ_SIZE(ib_rwq_ind_table, mana_ib_rwq_ind_table,
+ ib_ind_table),
+};
+
+static int mana_ib_probe(struct auxiliary_device *adev,
+ const struct auxiliary_device_id *id)
+{
+ struct mana_adev *madev = container_of(adev, struct mana_adev, adev);
+ struct gdma_dev *mdev = madev->mdev;
+ struct mana_context *mc;
+ struct mana_ib_dev *dev;
+ int ret = 0;
+
+ mc = mdev->driver_data;
+
+ dev = ib_alloc_device(mana_ib_dev, ib_dev);
+ if (!dev)
+ return -ENOMEM;
+
+ ib_set_device_ops(&dev->ib_dev, &mana_ib_dev_ops);
+
+ dev->ib_dev.phys_port_cnt = mc->num_ports;
+
+ ibdev_dbg(&dev->ib_dev, "mdev=%p id=%d num_ports=%d\n", mdev,
+ mdev->dev_id.as_uint32, dev->ib_dev.phys_port_cnt);
+
+ dev->gdma_dev = mdev;
+ dev->ib_dev.node_type = RDMA_NODE_IB_CA;
+
+ /* num_comp_vectors needs to set to the max MSIX index
+ * when interrupts and event queues are implemented
+ */
+ dev->ib_dev.num_comp_vectors = 1;
+ dev->ib_dev.dev.parent = mdev->gdma_context->dev;
+
+ ret = ib_register_device(&dev->ib_dev, "mana_%d",
+ mdev->gdma_context->dev);
+ if (ret) {
+ ib_dealloc_device(&dev->ib_dev);
+ return ret;
+ }
+
+ dev_set_drvdata(&adev->dev, dev);
+
+ return 0;
+}
+
+static void mana_ib_remove(struct auxiliary_device *adev)
+{
+ struct mana_ib_dev *dev = dev_get_drvdata(&adev->dev);
+
+ ib_unregister_device(&dev->ib_dev);
+ ib_dealloc_device(&dev->ib_dev);
+}
+
+static const struct auxiliary_device_id mana_id_table[] = {
+ {
+ .name = "mana.rdma",
+ },
+ {},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, mana_id_table);
+
+static struct auxiliary_driver mana_driver = {
+ .name = "rdma",
+ .probe = mana_ib_probe,
+ .remove = mana_ib_remove,
+ .id_table = mana_id_table,
+};
+
+static int __init mana_ib_init(void)
+{
+ auxiliary_driver_register(&mana_driver);
+
+ return 0;
+}
+
+static void __exit mana_ib_cleanup(void)
+{
+ auxiliary_driver_unregister(&mana_driver);
+}
+
+module_init(mana_ib_init);
+module_exit(mana_ib_cleanup);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
new file mode 100644
index 000000000000..d3d42b11e95f
--- /dev/null
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -0,0 +1,145 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2022 Microsoft Corporation. All rights reserved.
+ */
+
+#ifndef _MANA_IB_H_
+#define _MANA_IB_H_
+
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_mad.h>
+#include <rdma/ib_umem.h>
+#include <linux/auxiliary_bus.h>
+#include <rdma/mana-abi.h>
+
+#include <net/mana/mana.h>
+
+#define PAGE_SZ_BM \
+ (SZ_4K | SZ_8K | SZ_16K | SZ_32K | SZ_64K | SZ_128K | SZ_256K | \
+ SZ_512K | SZ_1M | SZ_2M)
+
+/* MANA doesn't have any limit for MR size */
+#define MANA_IB_MAX_MR_SIZE ((u64)(~(0ULL)))
+
+struct mana_ib_dev {
+ struct ib_device ib_dev;
+ struct gdma_dev *gdma_dev;
+};
+
+struct mana_ib_wq {
+ struct ib_wq ibwq;
+ struct ib_umem *umem;
+ int wqe;
+ u32 wq_buf_size;
+ u64 gdma_region;
+ u64 id;
+ mana_handle_t rx_object;
+};
+
+struct mana_ib_pd {
+ struct ib_pd ibpd;
+ u32 pdn;
+ mana_handle_t pd_handle;
+
+ /* Mutex for sharing access to vport_use_count */
+ struct mutex vport_mutex;
+ int vport_use_count;
+
+ bool tx_shortform_allowed;
+ u32 tx_vp_offset;
+};
+
+struct mana_ib_mr {
+ struct ib_mr ibmr;
+ struct ib_umem *umem;
+ mana_handle_t mr_handle;
+};
+
+struct mana_ib_cq {
+ struct ib_cq ibcq;
+ struct ib_umem *umem;
+ int cqe;
+ u64 gdma_region;
+ u64 id;
+};
+
+struct mana_ib_qp {
+ struct ib_qp ibqp;
+
+ /* Work queue info */
+ struct ib_umem *sq_umem;
+ int sqe;
+ u64 sq_gdma_region;
+ u64 sq_id;
+ mana_handle_t tx_object;
+
+ /* The port on the IB device, starting with 1 */
+ u32 port;
+};
+
+struct mana_ib_ucontext {
+ struct ib_ucontext ibucontext;
+ u32 doorbell;
+};
+
+struct mana_ib_rwq_ind_table {
+ struct ib_rwq_ind_table ib_ind_table;
+};
+
+int mana_ib_gd_create_dma_region(struct mana_ib_dev *dev, struct ib_umem *umem,
+ mana_handle_t *gdma_region, u64 page_sz);
+
+int mana_ib_gd_destroy_dma_region(struct mana_ib_dev *dev,
+ mana_handle_t gdma_region);
+
+struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,
+ struct ib_wq_init_attr *init_attr,
+ struct ib_udata *udata);
+
+int mana_ib_modify_wq(struct ib_wq *wq, struct ib_wq_attr *wq_attr,
+ u32 wq_attr_mask, struct ib_udata *udata);
+
+int mana_ib_destroy_wq(struct ib_wq *ibwq, struct ib_udata *udata);
+
+int mana_ib_create_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_table,
+ struct ib_rwq_ind_table_init_attr *init_attr,
+ struct ib_udata *udata);
+
+int mana_ib_destroy_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_tbl);
+
+struct ib_mr *mana_ib_get_dma_mr(struct ib_pd *ibpd, int access_flags);
+
+struct ib_mr *mana_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
+ u64 iova, int access_flags,
+ struct ib_udata *udata);
+
+int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
+
+int mana_ib_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *qp_init_attr,
+ struct ib_udata *udata);
+
+int mana_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+ int attr_mask, struct ib_udata *udata);
+
+int mana_ib_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata);
+
+int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port_id,
+ struct mana_ib_pd *pd, u32 doorbell_id);
+void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
+ u32 port);
+
+int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
+ struct ib_udata *udata);
+
+int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata);
+
+int mana_ib_gd_create_pd(struct mana_ib_dev *dev, u64 *pd_handle, u32 *pd_id,
+ enum gdma_pd_flags flags);
+
+int mana_ib_gd_destroy_pd(struct mana_ib_dev *dev, u64 pd_handle);
+
+int mana_ib_gd_create_mr(struct mana_ib_dev *dev, struct mana_ib_mr *mr,
+ struct gdma_create_mr_params *mr_params);
+
+int mana_ib_gd_destroy_mr(struct mana_ib_dev *dev, mana_handle_t mr_handle);
+#endif
diff --git a/drivers/infiniband/hw/mana/mr.c b/drivers/infiniband/hw/mana/mr.c
new file mode 100644
index 000000000000..962e40f2de53
--- /dev/null
+++ b/drivers/infiniband/hw/mana/mr.c
@@ -0,0 +1,133 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+#define VALID_MR_FLAGS \
+ (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ)
+
+static enum gdma_mr_access_flags
+mana_ib_verbs_to_gdma_access_flags(int access_flags)
+{
+ enum gdma_mr_access_flags flags = GDMA_ACCESS_FLAG_LOCAL_READ;
+
+ if (access_flags & IB_ACCESS_LOCAL_WRITE)
+ flags |= GDMA_ACCESS_FLAG_LOCAL_WRITE;
+
+ if (access_flags & IB_ACCESS_REMOTE_WRITE)
+ flags |= GDMA_ACCESS_FLAG_REMOTE_WRITE;
+
+ if (access_flags & IB_ACCESS_REMOTE_READ)
+ flags |= GDMA_ACCESS_FLAG_REMOTE_READ;
+
+ return flags;
+}
+
+struct ib_mr *mana_ib_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 length,
+ u64 iova, int access_flags,
+ struct ib_udata *udata)
+{
+ struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+ struct gdma_create_mr_params mr_params = {};
+ struct ib_device *ibdev = ibpd->device;
+ gdma_obj_handle_t dma_region_handle;
+ struct mana_ib_dev *dev;
+ struct mana_ib_mr *mr;
+ u64 page_sz;
+ int err;
+
+ dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+ ibdev_dbg(ibdev,
+ "start 0x%llx, iova 0x%llx length 0x%llx access_flags 0x%x",
+ start, iova, length, access_flags);
+
+ if (access_flags & ~VALID_MR_FLAGS)
+ return ERR_PTR(-EINVAL);
+
+ mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+ if (!mr)
+ return ERR_PTR(-ENOMEM);
+
+ mr->umem = ib_umem_get(ibdev, start, length, access_flags);
+ if (IS_ERR(mr->umem)) {
+ err = PTR_ERR(mr->umem);
+ ibdev_dbg(ibdev,
+ "Failed to get umem for register user-mr, %d\n", err);
+ goto err_free;
+ }
+
+ page_sz = ib_umem_find_best_pgsz(mr->umem, PAGE_SZ_BM, iova);
+ if (unlikely(!page_sz)) {
+ ibdev_err(ibdev, "Failed to get best page size\n");
+ err = -EOPNOTSUPP;
+ goto err_umem;
+ }
+ ibdev_dbg(ibdev, "Page size chosen %llu\n", page_sz);
+
+ err = mana_ib_gd_create_dma_region(dev, mr->umem, &dma_region_handle,
+ page_sz);
+ if (err) {
+ ibdev_err(ibdev, "Failed create dma region for user-mr, %d\n",
+ err);
+ goto err_umem;
+ }
+
+ ibdev_dbg(ibdev,
+ "mana_ib_gd_create_dma_region ret %d gdma_region %llx\n", err,
+ dma_region_handle);
+
+ mr_params.pd_handle = pd->pd_handle;
+ mr_params.mr_type = GDMA_MR_TYPE_GVA;
+ mr_params.gva.dma_region_handle = dma_region_handle;
+ mr_params.gva.virtual_address = iova;
+ mr_params.gva.access_flags =
+ mana_ib_verbs_to_gdma_access_flags(access_flags);
+
+ err = mana_ib_gd_create_mr(dev, mr, &mr_params);
+ if (err)
+ goto err_dma_region;
+
+ /* There is no need to keep track of dma_region_handle after MR is
+ * successfully created. The dma_region_handle is tracked in the PF
+ * as part of the lifecycle of this MR.
+ */
+
+ mr->ibmr.length = length;
+ mr->ibmr.page_size = page_sz;
+ return &mr->ibmr;
+
+err_dma_region:
+ mana_gd_destroy_dma_region(dev->gdma_dev->gdma_context,
+ dma_region_handle);
+
+err_umem:
+ ib_umem_release(mr->umem);
+
+err_free:
+ kfree(mr);
+ return ERR_PTR(err);
+}
+
+int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
+{
+ struct mana_ib_mr *mr = container_of(ibmr, struct mana_ib_mr, ibmr);
+ struct ib_device *ibdev = ibmr->device;
+ struct mana_ib_dev *dev;
+ int err;
+
+ dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+ err = mana_ib_gd_destroy_mr(dev, mr->mr_handle);
+ if (err)
+ return err;
+
+ if (mr->umem)
+ ib_umem_release(mr->umem);
+
+ kfree(mr);
+
+ return 0;
+}
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
new file mode 100644
index 000000000000..75100674f1cf
--- /dev/null
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -0,0 +1,501 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+static int mana_ib_cfg_vport_steering(struct mana_ib_dev *dev,
+ struct net_device *ndev,
+ mana_handle_t default_rxobj,
+ mana_handle_t ind_table[],
+ u32 log_ind_tbl_size, u32 rx_hash_key_len,
+ u8 *rx_hash_key)
+{
+ struct mana_port_context *mpc = netdev_priv(ndev);
+ struct mana_cfg_rx_steer_req *req = NULL;
+ struct mana_cfg_rx_steer_resp resp = {};
+ mana_handle_t *req_indir_tab;
+ struct gdma_context *gc;
+ struct gdma_dev *mdev;
+ u32 req_buf_size;
+ int i, err;
+
+ mdev = dev->gdma_dev;
+ gc = mdev->gdma_context;
+
+ req_buf_size =
+ sizeof(*req) + sizeof(mana_handle_t) * MANA_INDIRECT_TABLE_SIZE;
+ req = kzalloc(req_buf_size, GFP_KERNEL);
+ if (!req)
+ return -ENOMEM;
+
+ mana_gd_init_req_hdr(&req->hdr, MANA_CONFIG_VPORT_RX, req_buf_size,
+ sizeof(resp));
+
+ req->vport = mpc->port_handle;
+ req->rx_enable = 1;
+ req->update_default_rxobj = 1;
+ req->default_rxobj = default_rxobj;
+ req->hdr.dev_id = mdev->dev_id;
+
+ /* If there are more than 1 entries in indirection table, enable RSS */
+ if (log_ind_tbl_size)
+ req->rss_enable = true;
+
+ req->num_indir_entries = MANA_INDIRECT_TABLE_SIZE;
+ req->indir_tab_offset = sizeof(*req);
+ req->update_indir_tab = true;
+
+ req_indir_tab = (mana_handle_t *)(req + 1);
+ /* The ind table passed to the hardware must have
+ * MANA_INDIRECT_TABLE_SIZE entries. Adjust the verb
+ * ind_table to MANA_INDIRECT_TABLE_SIZE if required
+ */
+ ibdev_dbg(&dev->ib_dev, "ind table size %u\n", 1 << log_ind_tbl_size);
+ for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++) {
+ req_indir_tab[i] = ind_table[i % (1 << log_ind_tbl_size)];
+ ibdev_dbg(&dev->ib_dev, "index %u handle 0x%llx\n", i,
+ req_indir_tab[i]);
+ }
+
+ req->update_hashkey = true;
+ if (rx_hash_key_len)
+ memcpy(req->hashkey, rx_hash_key, rx_hash_key_len);
+ else
+ netdev_rss_key_fill(req->hashkey, MANA_HASH_KEY_SIZE);
+
+ ibdev_dbg(&dev->ib_dev, "vport handle %llu default_rxobj 0x%llx\n",
+ req->vport, default_rxobj);
+
+ err = mana_gd_send_request(gc, req_buf_size, req, sizeof(resp), &resp);
+ if (err) {
+ netdev_err(ndev, "Failed to configure vPort RX: %d\n", err);
+ goto out;
+ }
+
+ if (resp.hdr.status) {
+ netdev_err(ndev, "vPort RX configuration failed: 0x%x\n",
+ resp.hdr.status);
+ err = -EPROTO;
+ }
+
+ netdev_info(ndev, "Configured steering vPort %llu log_entries %u\n",
+ mpc->port_handle, log_ind_tbl_size);
+
+out:
+ kfree(req);
+ return err;
+}
+
+static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
+ struct ib_qp_init_attr *attr,
+ struct ib_udata *udata)
+{
+ struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+ struct mana_ib_dev *mdev =
+ container_of(pd->device, struct mana_ib_dev, ib_dev);
+ struct ib_rwq_ind_table *ind_tbl = attr->rwq_ind_tbl;
+ struct mana_ib_create_qp_rss_resp resp = {};
+ struct mana_ib_create_qp_rss ucmd = {};
+ struct gdma_dev *gd = mdev->gdma_dev;
+ mana_handle_t *mana_ind_table;
+ struct mana_port_context *mpc;
+ struct mana_context *mc;
+ struct net_device *ndev;
+ struct mana_ib_cq *cq;
+ struct mana_ib_wq *wq;
+ struct ib_cq *ibcq;
+ struct ib_wq *ibwq;
+ int i = 0, ret;
+ u32 port;
+
+ mc = gd->driver_data;
+
+ if (udata->inlen < sizeof(ucmd))
+ return -EINVAL;
+
+ ret = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+ if (ret) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed copy from udata for create rss-qp, err %d\n",
+ ret);
+ return -EFAULT;
+ }
+
+ if (attr->cap.max_recv_wr > MAX_SEND_BUFFERS_PER_QUEUE) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Requested max_recv_wr %d exceeding limit.\n",
+ attr->cap.max_recv_wr);
+ return -EINVAL;
+ }
+
+ if (attr->cap.max_recv_sge > MAX_RX_WQE_SGL_ENTRIES) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Requested max_recv_sge %d exceeding limit.\n",
+ attr->cap.max_recv_sge);
+ return -EINVAL;
+ }
+
+ if (ucmd.rx_hash_function != MANA_IB_RX_HASH_FUNC_TOEPLITZ) {
+ ibdev_dbg(&mdev->ib_dev,
+ "RX Hash function is not supported, %d\n",
+ ucmd.rx_hash_function);
+ return -EINVAL;
+ }
+
+ /* IB ports start with 1, MANA start with 0 */
+ port = ucmd.port;
+ if (port < 1 || port > mc->num_ports) {
+ ibdev_dbg(&mdev->ib_dev, "Invalid port %u in creating qp\n",
+ port);
+ return -EINVAL;
+ }
+ ndev = mc->ports[port - 1];
+ mpc = netdev_priv(ndev);
+
+ ibdev_dbg(&mdev->ib_dev, "rx_hash_function %d port %d\n",
+ ucmd.rx_hash_function, port);
+
+ mana_ind_table = kzalloc(sizeof(mana_handle_t) *
+ (1 << ind_tbl->log_ind_tbl_size),
+ GFP_KERNEL);
+ if (!mana_ind_table) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ qp->port = port;
+
+ for (i = 0; i < (1 << ind_tbl->log_ind_tbl_size); i++) {
+ struct mana_obj_spec wq_spec = {};
+ struct mana_obj_spec cq_spec = {};
+
+ ibwq = ind_tbl->ind_tbl[i];
+ wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+
+ ibcq = ibwq->cq;
+ cq = container_of(ibcq, struct mana_ib_cq, ibcq);
+
+ wq_spec.gdma_region = wq->gdma_region;
+ wq_spec.queue_size = wq->wq_buf_size;
+
+ cq_spec.gdma_region = cq->gdma_region;
+ cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
+ cq_spec.modr_ctx_id = 0;
+ cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+
+ ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
+ &wq_spec, &cq_spec, &wq->rx_object);
+ if (ret)
+ goto fail;
+
+ /* The GDMA regions are now owned by the WQ object */
+ wq->gdma_region = GDMA_INVALID_DMA_REGION;
+ cq->gdma_region = GDMA_INVALID_DMA_REGION;
+
+ wq->id = wq_spec.queue_index;
+ cq->id = cq_spec.queue_index;
+
+ ibdev_dbg(&mdev->ib_dev,
+ "ret %d rx_object 0x%llx wq id %llu cq id %llu\n",
+ ret, wq->rx_object, wq->id, cq->id);
+
+ resp.entries[i].cqid = cq->id;
+ resp.entries[i].wqid = wq->id;
+
+ mana_ind_table[i] = wq->rx_object;
+ }
+ resp.num_entries = i;
+
+ ret = mana_ib_cfg_vport_steering(mdev, ndev, wq->rx_object,
+ mana_ind_table,
+ ind_tbl->log_ind_tbl_size,
+ ucmd.rx_hash_key_len,
+ ucmd.rx_hash_key);
+ if (ret)
+ goto fail;
+
+ kfree(mana_ind_table);
+
+ if (udata) {
+ ret = ib_copy_to_udata(udata, &resp, sizeof(resp));
+ if (ret) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed to copy to udata create rss-qp, %d\n",
+ ret);
+ goto fail;
+ }
+ }
+
+ return 0;
+
+fail:
+ while (i-- > 0) {
+ ibwq = ind_tbl->ind_tbl[i];
+ wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+ mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
+ }
+
+ kfree(mana_ind_table);
+
+ return ret;
+}
+
+static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
+ struct ib_qp_init_attr *attr,
+ struct ib_udata *udata)
+{
+ struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+ struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+ struct mana_ib_dev *mdev =
+ container_of(ibpd->device, struct mana_ib_dev, ib_dev);
+ struct mana_ib_cq *send_cq =
+ container_of(attr->send_cq, struct mana_ib_cq, ibcq);
+ struct ib_ucontext *ib_ucontext = ibpd->uobject->context;
+ struct mana_ib_create_qp_resp resp = {};
+ struct mana_ib_ucontext *mana_ucontext;
+ struct gdma_dev *gd = mdev->gdma_dev;
+ struct mana_ib_create_qp ucmd = {};
+ struct mana_obj_spec wq_spec = {};
+ struct mana_obj_spec cq_spec = {};
+ struct mana_port_context *mpc;
+ struct mana_context *mc;
+ struct net_device *ndev;
+ struct ib_umem *umem;
+ int err;
+ u32 port;
+
+ mana_ucontext =
+ container_of(ib_ucontext, struct mana_ib_ucontext, ibucontext);
+ mc = gd->driver_data;
+
+ if (udata->inlen < sizeof(ucmd))
+ return -EINVAL;
+
+ err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+ if (err) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed to copy from udata create qp-raw, %d\n", err);
+ return -EFAULT;
+ }
+
+ /* IB ports start with 1, MANA Ethernet ports start with 0 */
+ port = ucmd.port;
+ if (ucmd.port > mc->num_ports)
+ return -EINVAL;
+
+ if (attr->cap.max_send_wr > MAX_SEND_BUFFERS_PER_QUEUE) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Requested max_send_wr %d exceeding limit\n",
+ attr->cap.max_send_wr);
+ return -EINVAL;
+ }
+
+ if (attr->cap.max_send_sge > MAX_TX_WQE_SGL_ENTRIES) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Requested max_send_sge %d exceeding limit\n",
+ attr->cap.max_send_sge);
+ return -EINVAL;
+ }
+
+ ndev = mc->ports[port - 1];
+ mpc = netdev_priv(ndev);
+ ibdev_dbg(&mdev->ib_dev, "port %u ndev %p mpc %p\n", port, ndev, mpc);
+
+ err = mana_ib_cfg_vport(mdev, port - 1, pd, mana_ucontext->doorbell);
+ if (err)
+ return -ENODEV;
+
+ qp->port = port;
+
+ ibdev_dbg(&mdev->ib_dev, "ucmd sq_buf_addr 0x%llx port %u\n",
+ ucmd.sq_buf_addr, ucmd.port);
+
+ umem = ib_umem_get(ibpd->device, ucmd.sq_buf_addr, ucmd.sq_buf_size,
+ IB_ACCESS_LOCAL_WRITE);
+ if (IS_ERR(umem)) {
+ err = PTR_ERR(umem);
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed to get umem for create qp-raw, err %d\n",
+ err);
+ goto err_free_vport;
+ }
+ qp->sq_umem = umem;
+
+ err = mana_ib_gd_create_dma_region(mdev, qp->sq_umem,
+ &qp->sq_gdma_region, PAGE_SIZE);
+ if (err) {
+ ibdev_err(&mdev->ib_dev,
+ "Failed to create dma region for create qp-raw, %d\n",
+ err);
+ goto err_release_umem;
+ }
+
+ ibdev_dbg(&mdev->ib_dev,
+ "mana_ib_gd_create_dma_region ret %d gdma_region 0x%llx\n",
+ err, qp->sq_gdma_region);
+
+ /* Create a WQ on the same port handle used by the Ethernet */
+ wq_spec.gdma_region = qp->sq_gdma_region;
+ wq_spec.queue_size = ucmd.sq_buf_size;
+
+ cq_spec.gdma_region = send_cq->gdma_region;
+ cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
+ cq_spec.modr_ctx_id = 0;
+ cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+
+ err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
+ &cq_spec, &qp->tx_object);
+ if (err) {
+ ibdev_err(&mdev->ib_dev,
+ "Failed to create wq for create raw-qp, err %d\n",
+ err);
+ goto err_destroy_dma_region;
+ }
+
+ /* The GDMA regions are now owned by the WQ object */
+ qp->sq_gdma_region = GDMA_INVALID_DMA_REGION;
+ send_cq->gdma_region = GDMA_INVALID_DMA_REGION;
+
+ qp->sq_id = wq_spec.queue_index;
+ send_cq->id = cq_spec.queue_index;
+
+ ibdev_dbg(&mdev->ib_dev,
+ "ret %d qp->tx_object 0x%llx sq id %llu cq id %llu\n", err,
+ qp->tx_object, qp->sq_id, send_cq->id);
+
+ resp.sqid = qp->sq_id;
+ resp.cqid = send_cq->id;
+ resp.tx_vp_offset = pd->tx_vp_offset;
+
+ if (udata) {
+ err = ib_copy_to_udata(udata, &resp, sizeof(resp));
+ if (err) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed copy udata for create qp-raw, %d\n",
+ err);
+ goto err_destroy_wq_obj;
+ }
+ }
+
+ return 0;
+
+err_destroy_wq_obj:
+ mana_destroy_wq_obj(mpc, GDMA_SQ, qp->tx_object);
+
+err_destroy_dma_region:
+ mana_ib_gd_destroy_dma_region(mdev, qp->sq_gdma_region);
+
+err_release_umem:
+ ib_umem_release(umem);
+
+err_free_vport:
+ mana_ib_uncfg_vport(mdev, pd, port - 1);
+
+ return err;
+}
+
+int mana_ib_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
+ struct ib_udata *udata)
+{
+ switch (attr->qp_type) {
+ case IB_QPT_RAW_PACKET:
+ /* When rwq_ind_tbl is used, it's for creating WQs for RSS */
+ if (attr->rwq_ind_tbl)
+ return mana_ib_create_qp_rss(ibqp, ibqp->pd, attr,
+ udata);
+
+ return mana_ib_create_qp_raw(ibqp, ibqp->pd, attr, udata);
+ default:
+ /* Creating QP other than IB_QPT_RAW_PACKET is not supported */
+ ibdev_dbg(ibqp->device, "Creating QP type %u not supported\n",
+ attr->qp_type);
+ }
+
+ return -EINVAL;
+}
+
+int mana_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+ int attr_mask, struct ib_udata *udata)
+{
+ /* modify_qp is not supported by this version of the driver */
+ return -EOPNOTSUPP;
+}
+
+static int mana_ib_destroy_qp_rss(struct mana_ib_qp *qp,
+ struct ib_rwq_ind_table *ind_tbl,
+ struct ib_udata *udata)
+{
+ struct mana_ib_dev *mdev =
+ container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev);
+ struct gdma_dev *gd = mdev->gdma_dev;
+ struct mana_port_context *mpc;
+ struct mana_context *mc;
+ struct net_device *ndev;
+ struct mana_ib_wq *wq;
+ struct ib_wq *ibwq;
+ int i;
+
+ mc = gd->driver_data;
+ ndev = mc->ports[qp->port - 1];
+ mpc = netdev_priv(ndev);
+
+ for (i = 0; i < (1 << ind_tbl->log_ind_tbl_size); i++) {
+ ibwq = ind_tbl->ind_tbl[i];
+ wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+ ibdev_dbg(&mdev->ib_dev, "destroying wq->rx_object %llu\n",
+ wq->rx_object);
+ mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
+ }
+
+ return 0;
+}
+
+static int mana_ib_destroy_qp_raw(struct mana_ib_qp *qp, struct ib_udata *udata)
+{
+ struct mana_ib_dev *mdev =
+ container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev);
+ struct gdma_dev *gd = mdev->gdma_dev;
+ struct ib_pd *ibpd = qp->ibqp.pd;
+ struct mana_port_context *mpc;
+ struct mana_context *mc;
+ struct net_device *ndev;
+ struct mana_ib_pd *pd;
+
+ mc = gd->driver_data;
+ ndev = mc->ports[qp->port - 1];
+ mpc = netdev_priv(ndev);
+ pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+
+ mana_destroy_wq_obj(mpc, GDMA_SQ, qp->tx_object);
+
+ if (qp->sq_umem) {
+ mana_ib_gd_destroy_dma_region(mdev, qp->sq_gdma_region);
+ ib_umem_release(qp->sq_umem);
+ }
+
+ mana_ib_uncfg_vport(mdev, pd, qp->port - 1);
+
+ return 0;
+}
+
+int mana_ib_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
+{
+ struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+
+ switch (ibqp->qp_type) {
+ case IB_QPT_RAW_PACKET:
+ if (ibqp->rwq_ind_tbl)
+ return mana_ib_destroy_qp_rss(qp, ibqp->rwq_ind_tbl,
+ udata);
+
+ return mana_ib_destroy_qp_raw(qp, udata);
+
+ default:
+ ibdev_dbg(ibqp->device, "Unexpected QP type %u\n",
+ ibqp->qp_type);
+ }
+
+ return -ENOENT;
+}
diff --git a/drivers/infiniband/hw/mana/wq.c b/drivers/infiniband/hw/mana/wq.c
new file mode 100644
index 000000000000..a11d0ae35ff7
--- /dev/null
+++ b/drivers/infiniband/hw/mana/wq.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,
+ struct ib_wq_init_attr *init_attr,
+ struct ib_udata *udata)
+{
+ struct mana_ib_dev *mdev =
+ container_of(pd->device, struct mana_ib_dev, ib_dev);
+ struct mana_ib_create_wq ucmd = {};
+ struct mana_ib_wq *wq;
+ struct ib_umem *umem;
+ int err;
+
+ if (udata->inlen < sizeof(ucmd))
+ return ERR_PTR(-EINVAL);
+
+ err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+ if (err) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed to copy from udata for create wq, %d\n", err);
+ return ERR_PTR(-EFAULT);
+ }
+
+ wq = kzalloc(sizeof(*wq), GFP_KERNEL);
+ if (!wq)
+ return ERR_PTR(-ENOMEM);
+
+ ibdev_dbg(&mdev->ib_dev, "ucmd wq_buf_addr 0x%llx\n", ucmd.wq_buf_addr);
+
+ umem = ib_umem_get(pd->device, ucmd.wq_buf_addr, ucmd.wq_buf_size,
+ IB_ACCESS_LOCAL_WRITE);
+ if (IS_ERR(umem)) {
+ err = PTR_ERR(umem);
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed to get umem for create wq, err %d\n", err);
+ goto err_free_wq;
+ }
+
+ wq->umem = umem;
+ wq->wqe = init_attr->max_wr;
+ wq->wq_buf_size = ucmd.wq_buf_size;
+ wq->rx_object = INVALID_MANA_HANDLE;
+
+ err = mana_ib_gd_create_dma_region(mdev, wq->umem, &wq->gdma_region,
+ PAGE_SIZE);
+ if (err) {
+ ibdev_err(&mdev->ib_dev,
+ "Failed to create dma region for create wq, %d\n",
+ err);
+ goto err_release_umem;
+ }
+
+ ibdev_dbg(&mdev->ib_dev,
+ "mana_ib_gd_create_dma_region ret %d gdma_region 0x%llx\n",
+ err, wq->gdma_region);
+
+ /* WQ ID is returned at wq_create time, doesn't know the value yet */
+
+ return &wq->ibwq;
+
+err_release_umem:
+ ib_umem_release(umem);
+
+err_free_wq:
+ kfree(wq);
+
+ return ERR_PTR(err);
+}
+
+int mana_ib_modify_wq(struct ib_wq *wq, struct ib_wq_attr *wq_attr,
+ u32 wq_attr_mask, struct ib_udata *udata)
+{
+ /* modify_wq is not supported by this version of the driver */
+ return -EOPNOTSUPP;
+}
+
+int mana_ib_destroy_wq(struct ib_wq *ibwq, struct ib_udata *udata)
+{
+ struct mana_ib_wq *wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+ struct ib_device *ib_dev = ibwq->device;
+ struct mana_ib_dev *mdev;
+
+ mdev = container_of(ib_dev, struct mana_ib_dev, ib_dev);
+
+ mana_ib_gd_destroy_dma_region(mdev, wq->gdma_region);
+ ib_umem_release(wq->umem);
+
+ kfree(wq);
+
+ return 0;
+}
+
+int mana_ib_create_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_table,
+ struct ib_rwq_ind_table_init_attr *init_attr,
+ struct ib_udata *udata)
+{
+ /* There is no additional data in ind_table to be maintained by this
+ * driver, do nothing
+ */
+ return 0;
+}
+
+int mana_ib_destroy_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_tbl)
+{
+ /* There is no additional data in ind_table to be maintained by this
+ * driver, do nothing
+ */
+ return 0;
+}
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 3a0bc6e0b730..1ff6e0d07cfd 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -406,6 +406,9 @@ int mana_bpf(struct net_device *ndev, struct netdev_bpf *bpf);
extern const struct ethtool_ops mana_ethtool_ops;
+/* A CQ can be created not associated with any EQ */
+#define GDMA_CQ_NO_EQ 0xffff
+
struct mana_obj_spec {
u32 queue_index;
u64 gdma_region;
diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
index 3072e5d6b692..081aabf536dc 100644
--- a/include/uapi/rdma/ib_user_ioctl_verbs.h
+++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
@@ -250,6 +250,7 @@ enum rdma_driver_id {
RDMA_DRIVER_QIB,
RDMA_DRIVER_EFA,
RDMA_DRIVER_SIW,
+ RDMA_DRIVER_MANA,
};
enum ib_uverbs_gid_type {
diff --git a/include/uapi/rdma/mana-abi.h b/include/uapi/rdma/mana-abi.h
new file mode 100644
index 000000000000..559c49b72e0d
--- /dev/null
+++ b/include/uapi/rdma/mana-abi.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR Linux-OpenIB) */
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#ifndef MANA_ABI_USER_H
+#define MANA_ABI_USER_H
+
+#include <linux/types.h>
+#include <rdma/ib_user_ioctl_verbs.h>
+
+/*
+ * Increment this value if any changes that break userspace ABI
+ * compatibility are made.
+ */
+
+#define MANA_IB_UVERBS_ABI_VERSION 1
+
+struct mana_ib_create_cq {
+ __aligned_u64 buf_addr;
+};
+
+struct mana_ib_create_qp {
+ __aligned_u64 sq_buf_addr;
+ __u32 sq_buf_size;
+ __u32 port;
+};
+
+struct mana_ib_create_qp_resp {
+ __u32 sqid;
+ __u32 cqid;
+ __u32 tx_vp_offset;
+ __u32 reserved;
+};
+
+struct mana_ib_create_wq {
+ __aligned_u64 wq_buf_addr;
+ __u32 wq_buf_size;
+ __u32 reserved;
+};
+
+/* RX Hash function flags */
+enum mana_ib_rx_hash_function_flags {
+ MANA_IB_RX_HASH_FUNC_TOEPLITZ = 1 << 0,
+};
+
+struct mana_ib_create_qp_rss {
+ __aligned_u64 rx_hash_fields_mask;
+ __u8 rx_hash_function;
+ __u8 reserved[7];
+ __u32 rx_hash_key_len;
+ __u8 rx_hash_key[40];
+ __u32 port;
+};
+
+struct rss_resp_entry {
+ __u32 cqid;
+ __u32 wqid;
+};
+
+struct mana_ib_create_qp_rss_resp {
+ __aligned_u64 num_entries;
+ struct rss_resp_entry entries[64];
+};
+
+#endif
--
2.17.1
Hello Maintainers,
Any idea when these patches would make it into the next kernel release ?
Ajay
-----Original Message-----
From: [email protected] <[email protected]>
Sent: Wednesday, June 15, 2022 9:07 PM
To: KY Srinivasan <[email protected]>; Haiyang Zhang <[email protected]>; Stephen Hemminger <[email protected]>; Wei Liu <[email protected]>; Dexuan Cui <[email protected]>; David S. Miller <[email protected]>; Jakub Kicinski <[email protected]>; Paolo Abeni <[email protected]>; Jason Gunthorpe <[email protected]>; Leon Romanovsky <[email protected]>; [email protected]; [email protected]; Ajay Sharma <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]; Long Li <[email protected]>
Subject: [EXTERNAL] [Patch v4 12/12] RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter
From: Long Li <[email protected]>
Add a RDMA VF driver for Microsoft Azure Network Adapter (MANA).
Signed-off-by: Long Li <[email protected]>
---
Change log:
v2:
Changed coding sytles/formats
Checked undersize for udata length
Changed all logging to use ibdev_xxx()
Avoided page array copy when doing MR
Sorted driver ops
Fixed warnings reported by kernel test robot <[email protected]>
v3:
More coding sytle/format changes
v4:
Process error on hardware vport configuration
MAINTAINERS | 3 +
drivers/infiniband/Kconfig | 1 +
drivers/infiniband/hw/Makefile | 1 +
drivers/infiniband/hw/mana/Kconfig | 7 +
drivers/infiniband/hw/mana/Makefile | 4 +
drivers/infiniband/hw/mana/cq.c | 80 +++
drivers/infiniband/hw/mana/main.c | 681 ++++++++++++++++++++++++
drivers/infiniband/hw/mana/mana_ib.h | 145 +++++
drivers/infiniband/hw/mana/mr.c | 133 +++++
drivers/infiniband/hw/mana/qp.c | 501 +++++++++++++++++
drivers/infiniband/hw/mana/wq.c | 114 ++++
include/net/mana/mana.h | 3 +
include/uapi/rdma/ib_user_ioctl_verbs.h | 1 +
include/uapi/rdma/mana-abi.h | 66 +++
14 files changed, 1740 insertions(+)
create mode 100644 drivers/infiniband/hw/mana/Kconfig
create mode 100644 drivers/infiniband/hw/mana/Makefile
create mode 100644 drivers/infiniband/hw/mana/cq.c
create mode 100644 drivers/infiniband/hw/mana/main.c
create mode 100644 drivers/infiniband/hw/mana/mana_ib.h
create mode 100644 drivers/infiniband/hw/mana/mr.c
create mode 100644 drivers/infiniband/hw/mana/qp.c
create mode 100644 drivers/infiniband/hw/mana/wq.c
create mode 100644 include/uapi/rdma/mana-abi.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 51bec6d5076d..1bed8444786d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9078,6 +9078,7 @@ M: Haiyang Zhang <[email protected]>
M: Stephen Hemminger <[email protected]>
M: Wei Liu <[email protected]>
M: Dexuan Cui <[email protected]>
+M: Long Li <[email protected]>
L: [email protected]
S: Supported
T: git git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git
@@ -9095,6 +9096,7 @@ F: arch/x86/kernel/cpu/mshyperv.c
F: drivers/clocksource/hyperv_timer.c
F: drivers/hid/hid-hyperv.c
F: drivers/hv/
+F: drivers/infiniband/hw/mana/
F: drivers/input/serio/hyperv-keyboard.c
F: drivers/iommu/hyperv-iommu.c
F: drivers/net/ethernet/microsoft/
@@ -9110,6 +9112,7 @@ F: include/clocksource/hyperv_timer.h
F: include/linux/hyperv.h
F: include/net/mana
F: include/uapi/linux/hyperv.h
+F: include/uapi/rdma/mana-abi.h
F: net/vmw_vsock/hyperv_transport.c
F: tools/hv/
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 33d3ce9c888e..a062c662ecff 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -83,6 +83,7 @@ source "drivers/infiniband/hw/qib/Kconfig"
source "drivers/infiniband/hw/cxgb4/Kconfig"
source "drivers/infiniband/hw/efa/Kconfig"
source "drivers/infiniband/hw/irdma/Kconfig"
+source "drivers/infiniband/hw/mana/Kconfig"
source "drivers/infiniband/hw/mlx4/Kconfig"
source "drivers/infiniband/hw/mlx5/Kconfig"
source "drivers/infiniband/hw/ocrdma/Kconfig"
diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile
index fba0b3be903e..f62e9e00c780 100644
--- a/drivers/infiniband/hw/Makefile
+++ b/drivers/infiniband/hw/Makefile
@@ -4,6 +4,7 @@ obj-$(CONFIG_INFINIBAND_QIB) += qib/
obj-$(CONFIG_INFINIBAND_CXGB4) += cxgb4/
obj-$(CONFIG_INFINIBAND_EFA) += efa/
obj-$(CONFIG_INFINIBAND_IRDMA) += irdma/
+obj-$(CONFIG_MANA_INFINIBAND) += mana/
obj-$(CONFIG_MLX4_INFINIBAND) += mlx4/
obj-$(CONFIG_MLX5_INFINIBAND) += mlx5/
obj-$(CONFIG_INFINIBAND_OCRDMA) += ocrdma/
diff --git a/drivers/infiniband/hw/mana/Kconfig b/drivers/infiniband/hw/mana/Kconfig
new file mode 100644
index 000000000000..b3ff03a23257
--- /dev/null
+++ b/drivers/infiniband/hw/mana/Kconfig
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config MANA_INFINIBAND
+ tristate "Microsoft Azure Network Adapter support"
+ depends on NETDEVICES && ETHERNET && PCI && MICROSOFT_MANA
+ help
+ This driver provides low-level RDMA support for
+ Microsoft Azure Network Adapter (MANA).
diff --git a/drivers/infiniband/hw/mana/Makefile b/drivers/infiniband/hw/mana/Makefile
new file mode 100644
index 000000000000..a799fe264c5a
--- /dev/null
+++ b/drivers/infiniband/hw/mana/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_MANA_INFINIBAND) += mana_ib.o
+
+mana_ib-y := main.o wq.o qp.o cq.o mr.o
diff --git a/drivers/infiniband/hw/mana/cq.c b/drivers/infiniband/hw/mana/cq.c
new file mode 100644
index 000000000000..046fd290073d
--- /dev/null
+++ b/drivers/infiniband/hw/mana/cq.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
+ struct ib_udata *udata)
+{
+ struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
+ struct ib_device *ibdev = ibcq->device;
+ struct mana_ib_create_cq ucmd = {};
+ struct mana_ib_dev *mdev;
+ int err;
+
+ mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+ if (udata->inlen < sizeof(ucmd))
+ return -EINVAL;
+
+ err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+ if (err) {
+ ibdev_dbg(ibdev,
+ "Failed to copy from udata for create cq, %d\n", err);
+ return -EFAULT;
+ }
+
+ if (attr->cqe > MAX_SEND_BUFFERS_PER_QUEUE) {
+ ibdev_dbg(ibdev, "CQE %d exceeding limit\n", attr->cqe);
+ return -EINVAL;
+ }
+
+ cq->cqe = attr->cqe;
+ cq->umem = ib_umem_get(ibdev, ucmd.buf_addr, cq->cqe * COMP_ENTRY_SIZE,
+ IB_ACCESS_LOCAL_WRITE);
+ if (IS_ERR(cq->umem)) {
+ err = PTR_ERR(cq->umem);
+ ibdev_dbg(ibdev, "Failed to get umem for create cq, err %d\n",
+ err);
+ return err;
+ }
+
+ err = mana_ib_gd_create_dma_region(mdev, cq->umem, &cq->gdma_region,
+ PAGE_SIZE);
+ if (err) {
+ ibdev_err(ibdev,
+ "Failed to create dma region for create cq, %d\n",
+ err);
+ goto err_release_umem;
+ }
+
+ ibdev_dbg(ibdev,
+ "mana_ib_gd_create_dma_region ret %d gdma_region 0x%llx\n",
+ err, cq->gdma_region);
+
+ /* The CQ ID is not known at this time
+ * The ID is generated at create_qp
+ */
+
+ return 0;
+
+err_release_umem:
+ ib_umem_release(cq->umem);
+ return err;
+}
+
+int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
+{
+ struct mana_ib_cq *cq = container_of(ibcq, struct mana_ib_cq, ibcq);
+ struct ib_device *ibdev = ibcq->device;
+ struct mana_ib_dev *mdev;
+
+ mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+ mana_ib_gd_destroy_dma_region(mdev, cq->gdma_region);
+ ib_umem_release(cq->umem);
+
+ return 0;
+}
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
new file mode 100644
index 000000000000..58254a0cf581
--- /dev/null
+++ b/drivers/infiniband/hw/mana/main.c
@@ -0,0 +1,681 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+MODULE_DESCRIPTION("Microsoft Azure Network Adapter IB driver");
+MODULE_LICENSE("Dual BSD/GPL");
+
+void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
+ u32 port)
+{
+ struct gdma_dev *gd = dev->gdma_dev;
+ struct mana_port_context *mpc;
+ struct net_device *ndev;
+ struct mana_context *mc;
+
+ mc = gd->driver_data;
+ ndev = mc->ports[port];
+ mpc = netdev_priv(ndev);
+
+ mutex_lock(&pd->vport_mutex);
+
+ pd->vport_use_count--;
+ WARN_ON(pd->vport_use_count < 0);
+
+ if (!pd->vport_use_count)
+ mana_uncfg_vport(mpc);
+
+ mutex_unlock(&pd->vport_mutex);
+}
+
+int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
+ u32 doorbell_id)
+{
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct mana_port_context *mpc;
+ struct mana_context *mc;
+ struct net_device *ndev;
+ int err;
+
+ mc = mdev->driver_data;
+ ndev = mc->ports[port];
+ mpc = netdev_priv(ndev);
+
+ mutex_lock(&pd->vport_mutex);
+
+ pd->vport_use_count++;
+ if (pd->vport_use_count > 1) {
+ ibdev_dbg(&dev->ib_dev,
+ "Skip as this PD is already configured vport\n");
+ mutex_unlock(&pd->vport_mutex);
+ return 0;
+ }
+ mutex_unlock(&pd->vport_mutex);
+
+ err = mana_cfg_vport(mpc, pd->pdn, doorbell_id);
+ if (err) {
+ mutex_lock(&pd->vport_mutex);
+ pd->vport_use_count--;
+ mutex_unlock(&pd->vport_mutex);
+
+ ibdev_err(&dev->ib_dev, "Failed to configure vPort %d\n", err);
+ return err;
+ }
+
+ pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
+ pd->tx_vp_offset = mpc->tx_vp_offset;
+
+ ibdev_dbg(&dev->ib_dev,
+ "vport handle %llx pdid %x doorbell_id %x "
+ "tx_shortform_allowed %d tx_vp_offset %u\n",
+ mpc->port_handle, pd->pdn, doorbell_id,
+ pd->tx_shortform_allowed, pd->tx_vp_offset);
+
+ return 0;
+}
+
+static int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
+{
+ struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+ struct ib_device *ibdev = ibpd->device;
+ enum gdma_pd_flags flags = 0;
+ struct mana_ib_dev *dev;
+ int ret;
+
+ dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+ /* Set flags if this is a kernel request */
+ if (!ibpd->uobject)
+ flags = GDMA_PD_FLAG_ALLOW_GPA_MR | GDMA_PD_FLAG_ALLOW_FMR_MR;
+
+ ret = mana_ib_gd_create_pd(dev, &pd->pd_handle, &pd->pdn, flags);
+ if (ret) {
+ ibdev_err(ibdev, "Failed to get pd id, err %d\n", ret);
+ return ret;
+ }
+
+ mutex_init(&pd->vport_mutex);
+ pd->vport_use_count = 0;
+ return 0;
+}
+
+static int mana_ib_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
+{
+ struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+ struct ib_device *ibdev = ibpd->device;
+ struct mana_ib_dev *dev;
+
+ dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ return mana_ib_gd_destroy_pd(dev, pd->pd_handle);
+}
+
+static int mana_ib_alloc_ucontext(struct ib_ucontext *ibcontext,
+ struct ib_udata *udata)
+{
+ struct mana_ib_ucontext *ucontext =
+ container_of(ibcontext, struct mana_ib_ucontext, ibucontext);
+ struct ib_device *ibdev = ibcontext->device;
+ struct mana_ib_dev *mdev;
+ struct gdma_context *gc;
+ struct gdma_dev *dev;
+ int doorbell_page;
+ int ret;
+
+ mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ dev = mdev->gdma_dev;
+ gc = dev->gdma_context;
+
+ /* Allocate a doorbell page index */
+ ret = mana_gd_allocate_doorbell_page(gc, &doorbell_page);
+ if (ret) {
+ ibdev_err(ibdev, "Failed to allocate doorbell page %d\n", ret);
+ return -ENOMEM;
+ }
+
+ ibdev_dbg(ibdev, "Doorbell page allocated %d\n", doorbell_page);
+
+ ucontext->doorbell = doorbell_page;
+
+ return 0;
+}
+
+static void mana_ib_dealloc_ucontext(struct ib_ucontext *ibcontext)
+{
+ struct mana_ib_ucontext *mana_ucontext =
+ container_of(ibcontext, struct mana_ib_ucontext, ibucontext);
+ struct ib_device *ibdev = ibcontext->device;
+ struct mana_ib_dev *mdev;
+ struct gdma_context *gc;
+ int ret;
+
+ mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ gc = mdev->gdma_dev->gdma_context;
+
+ ret = mana_gd_destroy_doorbell_page(gc, mana_ucontext->doorbell);
+ if (ret)
+ ibdev_err(ibdev, "Failed to destroy doorbell page %d\n", ret);
+}
+
+int mana_ib_gd_create_dma_region(struct mana_ib_dev *dev, struct ib_umem *umem,
+ mana_handle_t *gdma_region, u64 page_sz)
+{
+ size_t num_pages_total = ib_umem_num_dma_blocks(umem, page_sz);
+ struct gdma_dma_region_add_pages_req *add_req = NULL;
+ struct gdma_create_dma_region_resp create_resp = {};
+ struct gdma_create_dma_region_req *create_req;
+ size_t num_pages_cur, num_pages_to_handle;
+ unsigned int create_req_msg_size;
+ struct hw_channel_context *hwc;
+ struct ib_block_iter biter;
+ size_t max_pgs_create_cmd;
+ struct gdma_context *gc;
+ struct gdma_dev *mdev;
+ unsigned int i;
+ int err;
+
+ mdev = dev->gdma_dev;
+ gc = mdev->gdma_context;
+ hwc = gc->hwc.driver_data;
+ max_pgs_create_cmd =
+ (hwc->max_req_msg_size - sizeof(*create_req)) / sizeof(u64);
+
+ num_pages_to_handle =
+ min_t(size_t, num_pages_total, max_pgs_create_cmd);
+ create_req_msg_size =
+ struct_size(create_req, page_addr_list, num_pages_to_handle);
+
+ create_req = kzalloc(create_req_msg_size, GFP_KERNEL);
+ if (!create_req)
+ return -ENOMEM;
+
+ mana_gd_init_req_hdr(&create_req->hdr, GDMA_CREATE_DMA_REGION,
+ create_req_msg_size, sizeof(create_resp));
+
+ create_req->length = umem->length;
+ create_req->offset_in_page = umem->address & (page_sz - 1);
+ create_req->gdma_page_type = order_base_2(page_sz) - PAGE_SHIFT;
+ create_req->page_count = num_pages_total;
+ create_req->page_addr_list_len = num_pages_to_handle;
+
+ ibdev_dbg(&dev->ib_dev,
+ "size_dma_region %lu num_pages_total %lu, "
+ "page_sz 0x%llx offset_in_page %u\n",
+ umem->length, num_pages_total, page_sz,
+ create_req->offset_in_page);
+
+ ibdev_dbg(&dev->ib_dev, "num_pages_to_handle %lu, gdma_page_type %u",
+ num_pages_to_handle, create_req->gdma_page_type);
+
+ __rdma_umem_block_iter_start(&biter, umem, page_sz);
+
+ for (i = 0; i < num_pages_to_handle; ++i) {
+ dma_addr_t cur_addr;
+
+ __rdma_block_iter_next(&biter);
+ cur_addr = rdma_block_iter_dma_address(&biter);
+
+ create_req->page_addr_list[i] = cur_addr;
+
+ ibdev_dbg(&dev->ib_dev, "page num %u cur_addr 0x%llx\n", i,
+ cur_addr);
+ }
+
+ err = mana_gd_send_request(gc, create_req_msg_size, create_req,
+ sizeof(create_resp), &create_resp);
+ kfree(create_req);
+
+ if (err || create_resp.hdr.status) {
+ ibdev_err(&dev->ib_dev,
+ "Failed to create DMA region: %d, 0x%x\n", err,
+ create_resp.hdr.status);
+ goto error;
+ }
+
+ *gdma_region = create_resp.dma_region_handle;
+ ibdev_dbg(&dev->ib_dev, "Created DMA region with handle 0x%llx\n",
+ *gdma_region);
+
+ num_pages_cur = num_pages_to_handle;
+
+ if (num_pages_cur < num_pages_total) {
+ unsigned int add_req_msg_size;
+ size_t max_pgs_add_cmd =
+ (hwc->max_req_msg_size - sizeof(*add_req)) /
+ sizeof(u64);
+
+ num_pages_to_handle =
+ min_t(size_t, num_pages_total - num_pages_cur,
+ max_pgs_add_cmd);
+
+ /* Calculate the max num of pages that will be handled */
+ add_req_msg_size = struct_size(add_req, page_addr_list,
+ num_pages_to_handle);
+
+ add_req = kmalloc(add_req_msg_size, GFP_KERNEL);
+ if (!add_req) {
+ err = -ENOMEM;
+ goto error;
+ }
+
+ while (num_pages_cur < num_pages_total) {
+ struct gdma_general_resp add_resp = {};
+ u32 expected_status = 0;
+
+ if (num_pages_cur + num_pages_to_handle <
+ num_pages_total) {
+ /* Status indicating more pages are needed */
+ expected_status = GDMA_STATUS_MORE_ENTRIES;
+ }
+
+ memset(add_req, 0, add_req_msg_size);
+
+ mana_gd_init_req_hdr(&add_req->hdr,
+ GDMA_DMA_REGION_ADD_PAGES,
+ add_req_msg_size,
+ sizeof(add_resp));
+ add_req->dma_region_handle = *gdma_region;
+ add_req->page_addr_list_len = num_pages_to_handle;
+
+ for (i = 0; i < num_pages_to_handle; ++i) {
+ dma_addr_t cur_addr =
+ rdma_block_iter_dma_address(&biter);
+ add_req->page_addr_list[i] = cur_addr;
+ __rdma_block_iter_next(&biter);
+
+ ibdev_dbg(&dev->ib_dev,
+ "page_addr_list %lu addr 0x%llx\n",
+ num_pages_cur + i, cur_addr);
+ }
+
+ err = mana_gd_send_request(gc, add_req_msg_size,
+ add_req, sizeof(add_resp),
+ &add_resp);
+ if (!err || add_resp.hdr.status != expected_status) {
+ ibdev_err(&dev->ib_dev,
+ "Failed put DMA pages %u: %d,0x%x\n",
+ i, err, add_resp.hdr.status);
+ err = -EPROTO;
+ goto free_req;
+ }
+
+ num_pages_cur += num_pages_to_handle;
+ num_pages_to_handle =
+ min_t(size_t, num_pages_total - num_pages_cur,
+ max_pgs_add_cmd);
+ add_req_msg_size = sizeof(*add_req) +
+ num_pages_to_handle * sizeof(u64);
+ }
+free_req:
+ kfree(add_req);
+ }
+
+error:
+ return err;
+}
+
+int mana_ib_gd_destroy_dma_region(struct mana_ib_dev *dev, u64 gdma_region)
+{
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct gdma_context *gc;
+
+ gc = mdev->gdma_context;
+ ibdev_dbg(&dev->ib_dev, "destroy dma region 0x%llx\n", gdma_region);
+
+ return mana_gd_destroy_dma_region(gc, gdma_region);
+}
+
+int mana_ib_gd_create_pd(struct mana_ib_dev *dev, u64 *pd_handle, u32 *pd_id,
+ enum gdma_pd_flags flags)
+{
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct gdma_create_pd_resp resp = {};
+ struct gdma_create_pd_req req = {};
+ struct gdma_context *gc;
+ int err;
+
+ gc = mdev->gdma_context;
+
+ mana_gd_init_req_hdr(&req.hdr, GDMA_CREATE_PD, sizeof(req),
+ sizeof(resp));
+
+ req.flags = flags;
+ err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+
+ if (err || resp.hdr.status) {
+ ibdev_err(&dev->ib_dev,
+ "Failed to get pd_id err %d status %u\n", err,
+ resp.hdr.status);
+ if (!err)
+ err = -EPROTO;
+
+ return err;
+ }
+
+ *pd_handle = resp.pd_handle;
+ *pd_id = resp.pd_id;
+ ibdev_dbg(&dev->ib_dev, "pd_handle 0x%llx pd_id %d\n", *pd_handle,
+ *pd_id);
+
+ return 0;
+}
+
+int mana_ib_gd_destroy_pd(struct mana_ib_dev *dev, u64 pd_handle)
+{
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct gdma_destory_pd_resp resp = {};
+ struct gdma_destroy_pd_req req = {};
+ struct gdma_context *gc;
+ int err;
+
+ gc = mdev->gdma_context;
+
+ mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_PD, sizeof(req),
+ sizeof(resp));
+
+ req.pd_handle = pd_handle;
+ err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+
+ if (err || resp.hdr.status) {
+ ibdev_err(&dev->ib_dev,
+ "Failed to destroy pd_handle 0x%llx err %d status %u",
+ pd_handle, err, resp.hdr.status);
+ if (!err)
+ err = -EPROTO;
+ }
+
+ return err;
+}
+
+int mana_ib_gd_create_mr(struct mana_ib_dev *dev, struct mana_ib_mr *mr,
+ struct gdma_create_mr_params *mr_params)
+{
+ struct gdma_create_mr_response resp = {};
+ struct gdma_create_mr_request req = {};
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct gdma_context *gc;
+ int err;
+
+ gc = mdev->gdma_context;
+
+ mana_gd_init_req_hdr(&req.hdr, GDMA_CREATE_MR, sizeof(req),
+ sizeof(resp));
+ req.pd_handle = mr_params->pd_handle;
+
+ switch (mr_params->mr_type) {
+ case GDMA_MR_TYPE_GVA:
+ req.mr_type = GDMA_MR_TYPE_GVA;
+ req.gva.dma_region_handle = mr_params->gva.dma_region_handle;
+ req.gva.virtual_address = mr_params->gva.virtual_address;
+ req.gva.access_flags = mr_params->gva.access_flags;
+ break;
+
+ case GDMA_MR_TYPE_GPA:
+ req.mr_type = GDMA_MR_TYPE_GPA;
+ req.gpa.access_flags = mr_params->gpa.access_flags;
+ break;
+
+ case GDMA_MR_TYPE_FMR:
+ req.mr_type = GDMA_MR_TYPE_FMR;
+ req.fmr.page_size = mr_params->fmr.page_size;
+ req.fmr.reserved_pte_count = mr_params->fmr.reserved_pte_count;
+ break;
+
+ default:
+ ibdev_dbg(&dev->ib_dev,
+ "invalid param (GDMA_MR_TYPE) passed, type %d\n",
+ req.mr_type);
+ err = -EINVAL;
+ goto error;
+ }
+
+ err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+
+ if (err || resp.hdr.status) {
+ ibdev_err(&dev->ib_dev, "Failed to create mr %d, %u", err,
+ resp.hdr.status);
+ goto error;
+ }
+
+ mr->ibmr.lkey = resp.lkey;
+ mr->ibmr.rkey = resp.rkey;
+ mr->mr_handle = resp.mr_handle;
+
+ return 0;
+error:
+ return err;
+}
+
+int mana_ib_gd_destroy_mr(struct mana_ib_dev *dev, gdma_obj_handle_t mr_handle)
+{
+ struct gdma_destroy_mr_response resp = {};
+ struct gdma_destroy_mr_request req = {};
+ struct gdma_dev *mdev = dev->gdma_dev;
+ struct gdma_context *gc;
+ int err;
+
+ gc = mdev->gdma_context;
+
+ mana_gd_init_req_hdr(&req.hdr, GDMA_DESTROY_MR, sizeof(req),
+ sizeof(resp));
+
+ req.mr_handle = mr_handle;
+
+ err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
+ if (err || resp.hdr.status) {
+ dev_err(gc->dev, "Failed to destroy MR: %d, 0x%x\n", err,
+ resp.hdr.status);
+ if (!err)
+ err = -EPROTO;
+ return err;
+ }
+
+ return 0;
+}
+
+static int mana_ib_mmap(struct ib_ucontext *ibcontext,
+ struct vm_area_struct *vma)
+{
+ struct mana_ib_ucontext *mana_ucontext =
+ container_of(ibcontext, struct mana_ib_ucontext, ibucontext);
+ struct ib_device *ibdev = ibcontext->device;
+ struct mana_ib_dev *mdev;
+ struct gdma_context *gc;
+ phys_addr_t pfn;
+ pgprot_t prot;
+ int ret;
+
+ mdev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+ gc = mdev->gdma_dev->gdma_context;
+
+ if (vma->vm_pgoff != 0) {
+ ibdev_err(ibdev, "Unexpected vm_pgoff %lu\n", vma->vm_pgoff);
+ return -EINVAL;
+ }
+
+ /* Map to the page indexed by ucontext->doorbell */
+ pfn = (gc->phys_db_page_base +
+ gc->db_page_size * mana_ucontext->doorbell) >>
+ PAGE_SHIFT;
+ prot = pgprot_writecombine(vma->vm_page_prot);
+
+ ret = rdma_user_mmap_io(ibcontext, vma, pfn, gc->db_page_size, prot,
+ NULL);
+ if (ret)
+ ibdev_err(ibdev, "can't rdma_user_mmap_io ret %d\n", ret);
+ else
+ ibdev_dbg(ibdev, "mapped I/O pfn 0x%llx page_size %u, ret %d\n",
+ pfn, gc->db_page_size, ret);
+
+ return ret;
+}
+
+static int mana_ib_get_port_immutable(struct ib_device *ibdev, u32 port_num,
+ struct ib_port_immutable *immutable)
+{
+ /* This version only support RAW_PACKET
+ * other values need to be filled for other types
+ */
+ immutable->core_cap_flags = RDMA_CORE_PORT_RAW_PACKET;
+
+ return 0;
+}
+
+static int mana_ib_query_device(struct ib_device *ibdev,
+ struct ib_device_attr *props,
+ struct ib_udata *uhw)
+{
+ props->max_qp = MANA_MAX_NUM_QUEUES;
+ props->max_qp_wr = MAX_SEND_BUFFERS_PER_QUEUE;
+
+ /* max_cqe could be potentially much bigger.
+ * As this version of driver only support RAW QP, set it to the same
+ * value as max_qp_wr
+ */
+ props->max_cqe = MAX_SEND_BUFFERS_PER_QUEUE;
+
+ props->max_mr_size = MANA_IB_MAX_MR_SIZE;
+ props->max_mr = INT_MAX;
+ props->max_send_sge = MAX_TX_WQE_SGL_ENTRIES;
+ props->max_recv_sge = MAX_RX_WQE_SGL_ENTRIES;
+
+ return 0;
+}
+
+static int mana_ib_query_port(struct ib_device *ibdev, u32 port,
+ struct ib_port_attr *props)
+{
+ /* This version doesn't return port properties */
+ return 0;
+}
+
+static int mana_ib_query_gid(struct ib_device *ibdev, u32 port, int index,
+ union ib_gid *gid)
+{
+ /* This version doesn't return GID properties */
+ return 0;
+}
+
+static void mana_ib_disassociate_ucontext(struct ib_ucontext *ibcontext)
+{
+}
+
+static const struct ib_device_ops mana_ib_dev_ops = {
+ .owner = THIS_MODULE,
+ .driver_id = RDMA_DRIVER_MANA,
+ .uverbs_abi_ver = MANA_IB_UVERBS_ABI_VERSION,
+
+ .alloc_pd = mana_ib_alloc_pd,
+ .alloc_ucontext = mana_ib_alloc_ucontext,
+ .create_cq = mana_ib_create_cq,
+ .create_qp = mana_ib_create_qp,
+ .create_rwq_ind_table = mana_ib_create_rwq_ind_table,
+ .create_wq = mana_ib_create_wq,
+ .dealloc_pd = mana_ib_dealloc_pd,
+ .dealloc_ucontext = mana_ib_dealloc_ucontext,
+ .dereg_mr = mana_ib_dereg_mr,
+ .destroy_cq = mana_ib_destroy_cq,
+ .destroy_qp = mana_ib_destroy_qp,
+ .destroy_rwq_ind_table = mana_ib_destroy_rwq_ind_table,
+ .destroy_wq = mana_ib_destroy_wq,
+ .disassociate_ucontext = mana_ib_disassociate_ucontext,
+ .get_port_immutable = mana_ib_get_port_immutable,
+ .mmap = mana_ib_mmap,
+ .modify_qp = mana_ib_modify_qp,
+ .modify_wq = mana_ib_modify_wq,
+ .query_device = mana_ib_query_device,
+ .query_gid = mana_ib_query_gid,
+ .query_port = mana_ib_query_port,
+ .reg_user_mr = mana_ib_reg_user_mr,
+
+ INIT_RDMA_OBJ_SIZE(ib_cq, mana_ib_cq, ibcq),
+ INIT_RDMA_OBJ_SIZE(ib_pd, mana_ib_pd, ibpd),
+ INIT_RDMA_OBJ_SIZE(ib_qp, mana_ib_qp, ibqp),
+ INIT_RDMA_OBJ_SIZE(ib_ucontext, mana_ib_ucontext, ibucontext),
+ INIT_RDMA_OBJ_SIZE(ib_rwq_ind_table, mana_ib_rwq_ind_table,
+ ib_ind_table),
+};
+
+static int mana_ib_probe(struct auxiliary_device *adev,
+ const struct auxiliary_device_id *id)
+{
+ struct mana_adev *madev = container_of(adev, struct mana_adev, adev);
+ struct gdma_dev *mdev = madev->mdev;
+ struct mana_context *mc;
+ struct mana_ib_dev *dev;
+ int ret = 0;
+
+ mc = mdev->driver_data;
+
+ dev = ib_alloc_device(mana_ib_dev, ib_dev);
+ if (!dev)
+ return -ENOMEM;
+
+ ib_set_device_ops(&dev->ib_dev, &mana_ib_dev_ops);
+
+ dev->ib_dev.phys_port_cnt = mc->num_ports;
+
+ ibdev_dbg(&dev->ib_dev, "mdev=%p id=%d num_ports=%d\n", mdev,
+ mdev->dev_id.as_uint32, dev->ib_dev.phys_port_cnt);
+
+ dev->gdma_dev = mdev;
+ dev->ib_dev.node_type = RDMA_NODE_IB_CA;
+
+ /* num_comp_vectors needs to set to the max MSIX index
+ * when interrupts and event queues are implemented
+ */
+ dev->ib_dev.num_comp_vectors = 1;
+ dev->ib_dev.dev.parent = mdev->gdma_context->dev;
+
+ ret = ib_register_device(&dev->ib_dev, "mana_%d",
+ mdev->gdma_context->dev);
+ if (ret) {
+ ib_dealloc_device(&dev->ib_dev);
+ return ret;
+ }
+
+ dev_set_drvdata(&adev->dev, dev);
+
+ return 0;
+}
+
+static void mana_ib_remove(struct auxiliary_device *adev)
+{
+ struct mana_ib_dev *dev = dev_get_drvdata(&adev->dev);
+
+ ib_unregister_device(&dev->ib_dev);
+ ib_dealloc_device(&dev->ib_dev);
+}
+
+static const struct auxiliary_device_id mana_id_table[] = {
+ {
+ .name = "mana.rdma",
+ },
+ {},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, mana_id_table);
+
+static struct auxiliary_driver mana_driver = {
+ .name = "rdma",
+ .probe = mana_ib_probe,
+ .remove = mana_ib_remove,
+ .id_table = mana_id_table,
+};
+
+static int __init mana_ib_init(void)
+{
+ auxiliary_driver_register(&mana_driver);
+
+ return 0;
+}
+
+static void __exit mana_ib_cleanup(void)
+{
+ auxiliary_driver_unregister(&mana_driver);
+}
+
+module_init(mana_ib_init);
+module_exit(mana_ib_cleanup);
diff --git a/drivers/infiniband/hw/mana/mana_ib.h b/drivers/infiniband/hw/mana/mana_ib.h
new file mode 100644
index 000000000000..d3d42b11e95f
--- /dev/null
+++ b/drivers/infiniband/hw/mana/mana_ib.h
@@ -0,0 +1,145 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2022 Microsoft Corporation. All rights reserved.
+ */
+
+#ifndef _MANA_IB_H_
+#define _MANA_IB_H_
+
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_mad.h>
+#include <rdma/ib_umem.h>
+#include <linux/auxiliary_bus.h>
+#include <rdma/mana-abi.h>
+
+#include <net/mana/mana.h>
+
+#define PAGE_SZ_BM \
+ (SZ_4K | SZ_8K | SZ_16K | SZ_32K | SZ_64K | SZ_128K | SZ_256K | \
+ SZ_512K | SZ_1M | SZ_2M)
+
+/* MANA doesn't have any limit for MR size */
+#define MANA_IB_MAX_MR_SIZE ((u64)(~(0ULL)))
+
+struct mana_ib_dev {
+ struct ib_device ib_dev;
+ struct gdma_dev *gdma_dev;
+};
+
+struct mana_ib_wq {
+ struct ib_wq ibwq;
+ struct ib_umem *umem;
+ int wqe;
+ u32 wq_buf_size;
+ u64 gdma_region;
+ u64 id;
+ mana_handle_t rx_object;
+};
+
+struct mana_ib_pd {
+ struct ib_pd ibpd;
+ u32 pdn;
+ mana_handle_t pd_handle;
+
+ /* Mutex for sharing access to vport_use_count */
+ struct mutex vport_mutex;
+ int vport_use_count;
+
+ bool tx_shortform_allowed;
+ u32 tx_vp_offset;
+};
+
+struct mana_ib_mr {
+ struct ib_mr ibmr;
+ struct ib_umem *umem;
+ mana_handle_t mr_handle;
+};
+
+struct mana_ib_cq {
+ struct ib_cq ibcq;
+ struct ib_umem *umem;
+ int cqe;
+ u64 gdma_region;
+ u64 id;
+};
+
+struct mana_ib_qp {
+ struct ib_qp ibqp;
+
+ /* Work queue info */
+ struct ib_umem *sq_umem;
+ int sqe;
+ u64 sq_gdma_region;
+ u64 sq_id;
+ mana_handle_t tx_object;
+
+ /* The port on the IB device, starting with 1 */
+ u32 port;
+};
+
+struct mana_ib_ucontext {
+ struct ib_ucontext ibucontext;
+ u32 doorbell;
+};
+
+struct mana_ib_rwq_ind_table {
+ struct ib_rwq_ind_table ib_ind_table;
+};
+
+int mana_ib_gd_create_dma_region(struct mana_ib_dev *dev, struct ib_umem *umem,
+ mana_handle_t *gdma_region, u64 page_sz);
+
+int mana_ib_gd_destroy_dma_region(struct mana_ib_dev *dev,
+ mana_handle_t gdma_region);
+
+struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,
+ struct ib_wq_init_attr *init_attr,
+ struct ib_udata *udata);
+
+int mana_ib_modify_wq(struct ib_wq *wq, struct ib_wq_attr *wq_attr,
+ u32 wq_attr_mask, struct ib_udata *udata);
+
+int mana_ib_destroy_wq(struct ib_wq *ibwq, struct ib_udata *udata);
+
+int mana_ib_create_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_table,
+ struct ib_rwq_ind_table_init_attr *init_attr,
+ struct ib_udata *udata);
+
+int mana_ib_destroy_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_tbl);
+
+struct ib_mr *mana_ib_get_dma_mr(struct ib_pd *ibpd, int access_flags);
+
+struct ib_mr *mana_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
+ u64 iova, int access_flags,
+ struct ib_udata *udata);
+
+int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
+
+int mana_ib_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *qp_init_attr,
+ struct ib_udata *udata);
+
+int mana_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+ int attr_mask, struct ib_udata *udata);
+
+int mana_ib_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata);
+
+int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port_id,
+ struct mana_ib_pd *pd, u32 doorbell_id);
+void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
+ u32 port);
+
+int mana_ib_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
+ struct ib_udata *udata);
+
+int mana_ib_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata);
+
+int mana_ib_gd_create_pd(struct mana_ib_dev *dev, u64 *pd_handle, u32 *pd_id,
+ enum gdma_pd_flags flags);
+
+int mana_ib_gd_destroy_pd(struct mana_ib_dev *dev, u64 pd_handle);
+
+int mana_ib_gd_create_mr(struct mana_ib_dev *dev, struct mana_ib_mr *mr,
+ struct gdma_create_mr_params *mr_params);
+
+int mana_ib_gd_destroy_mr(struct mana_ib_dev *dev, mana_handle_t mr_handle);
+#endif
diff --git a/drivers/infiniband/hw/mana/mr.c b/drivers/infiniband/hw/mana/mr.c
new file mode 100644
index 000000000000..962e40f2de53
--- /dev/null
+++ b/drivers/infiniband/hw/mana/mr.c
@@ -0,0 +1,133 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+#define VALID_MR_FLAGS \
+ (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ)
+
+static enum gdma_mr_access_flags
+mana_ib_verbs_to_gdma_access_flags(int access_flags)
+{
+ enum gdma_mr_access_flags flags = GDMA_ACCESS_FLAG_LOCAL_READ;
+
+ if (access_flags & IB_ACCESS_LOCAL_WRITE)
+ flags |= GDMA_ACCESS_FLAG_LOCAL_WRITE;
+
+ if (access_flags & IB_ACCESS_REMOTE_WRITE)
+ flags |= GDMA_ACCESS_FLAG_REMOTE_WRITE;
+
+ if (access_flags & IB_ACCESS_REMOTE_READ)
+ flags |= GDMA_ACCESS_FLAG_REMOTE_READ;
+
+ return flags;
+}
+
+struct ib_mr *mana_ib_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 length,
+ u64 iova, int access_flags,
+ struct ib_udata *udata)
+{
+ struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+ struct gdma_create_mr_params mr_params = {};
+ struct ib_device *ibdev = ibpd->device;
+ gdma_obj_handle_t dma_region_handle;
+ struct mana_ib_dev *dev;
+ struct mana_ib_mr *mr;
+ u64 page_sz;
+ int err;
+
+ dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+ ibdev_dbg(ibdev,
+ "start 0x%llx, iova 0x%llx length 0x%llx access_flags 0x%x",
+ start, iova, length, access_flags);
+
+ if (access_flags & ~VALID_MR_FLAGS)
+ return ERR_PTR(-EINVAL);
+
+ mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+ if (!mr)
+ return ERR_PTR(-ENOMEM);
+
+ mr->umem = ib_umem_get(ibdev, start, length, access_flags);
+ if (IS_ERR(mr->umem)) {
+ err = PTR_ERR(mr->umem);
+ ibdev_dbg(ibdev,
+ "Failed to get umem for register user-mr, %d\n", err);
+ goto err_free;
+ }
+
+ page_sz = ib_umem_find_best_pgsz(mr->umem, PAGE_SZ_BM, iova);
+ if (unlikely(!page_sz)) {
+ ibdev_err(ibdev, "Failed to get best page size\n");
+ err = -EOPNOTSUPP;
+ goto err_umem;
+ }
+ ibdev_dbg(ibdev, "Page size chosen %llu\n", page_sz);
+
+ err = mana_ib_gd_create_dma_region(dev, mr->umem, &dma_region_handle,
+ page_sz);
+ if (err) {
+ ibdev_err(ibdev, "Failed create dma region for user-mr, %d\n",
+ err);
+ goto err_umem;
+ }
+
+ ibdev_dbg(ibdev,
+ "mana_ib_gd_create_dma_region ret %d gdma_region %llx\n", err,
+ dma_region_handle);
+
+ mr_params.pd_handle = pd->pd_handle;
+ mr_params.mr_type = GDMA_MR_TYPE_GVA;
+ mr_params.gva.dma_region_handle = dma_region_handle;
+ mr_params.gva.virtual_address = iova;
+ mr_params.gva.access_flags =
+ mana_ib_verbs_to_gdma_access_flags(access_flags);
+
+ err = mana_ib_gd_create_mr(dev, mr, &mr_params);
+ if (err)
+ goto err_dma_region;
+
+ /* There is no need to keep track of dma_region_handle after MR is
+ * successfully created. The dma_region_handle is tracked in the PF
+ * as part of the lifecycle of this MR.
+ */
+
+ mr->ibmr.length = length;
+ mr->ibmr.page_size = page_sz;
+ return &mr->ibmr;
+
+err_dma_region:
+ mana_gd_destroy_dma_region(dev->gdma_dev->gdma_context,
+ dma_region_handle);
+
+err_umem:
+ ib_umem_release(mr->umem);
+
+err_free:
+ kfree(mr);
+ return ERR_PTR(err);
+}
+
+int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
+{
+ struct mana_ib_mr *mr = container_of(ibmr, struct mana_ib_mr, ibmr);
+ struct ib_device *ibdev = ibmr->device;
+ struct mana_ib_dev *dev;
+ int err;
+
+ dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
+
+ err = mana_ib_gd_destroy_mr(dev, mr->mr_handle);
+ if (err)
+ return err;
+
+ if (mr->umem)
+ ib_umem_release(mr->umem);
+
+ kfree(mr);
+
+ return 0;
+}
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
new file mode 100644
index 000000000000..75100674f1cf
--- /dev/null
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -0,0 +1,501 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+static int mana_ib_cfg_vport_steering(struct mana_ib_dev *dev,
+ struct net_device *ndev,
+ mana_handle_t default_rxobj,
+ mana_handle_t ind_table[],
+ u32 log_ind_tbl_size, u32 rx_hash_key_len,
+ u8 *rx_hash_key)
+{
+ struct mana_port_context *mpc = netdev_priv(ndev);
+ struct mana_cfg_rx_steer_req *req = NULL;
+ struct mana_cfg_rx_steer_resp resp = {};
+ mana_handle_t *req_indir_tab;
+ struct gdma_context *gc;
+ struct gdma_dev *mdev;
+ u32 req_buf_size;
+ int i, err;
+
+ mdev = dev->gdma_dev;
+ gc = mdev->gdma_context;
+
+ req_buf_size =
+ sizeof(*req) + sizeof(mana_handle_t) * MANA_INDIRECT_TABLE_SIZE;
+ req = kzalloc(req_buf_size, GFP_KERNEL);
+ if (!req)
+ return -ENOMEM;
+
+ mana_gd_init_req_hdr(&req->hdr, MANA_CONFIG_VPORT_RX, req_buf_size,
+ sizeof(resp));
+
+ req->vport = mpc->port_handle;
+ req->rx_enable = 1;
+ req->update_default_rxobj = 1;
+ req->default_rxobj = default_rxobj;
+ req->hdr.dev_id = mdev->dev_id;
+
+ /* If there are more than 1 entries in indirection table, enable RSS */
+ if (log_ind_tbl_size)
+ req->rss_enable = true;
+
+ req->num_indir_entries = MANA_INDIRECT_TABLE_SIZE;
+ req->indir_tab_offset = sizeof(*req);
+ req->update_indir_tab = true;
+
+ req_indir_tab = (mana_handle_t *)(req + 1);
+ /* The ind table passed to the hardware must have
+ * MANA_INDIRECT_TABLE_SIZE entries. Adjust the verb
+ * ind_table to MANA_INDIRECT_TABLE_SIZE if required
+ */
+ ibdev_dbg(&dev->ib_dev, "ind table size %u\n", 1 << log_ind_tbl_size);
+ for (i = 0; i < MANA_INDIRECT_TABLE_SIZE; i++) {
+ req_indir_tab[i] = ind_table[i % (1 << log_ind_tbl_size)];
+ ibdev_dbg(&dev->ib_dev, "index %u handle 0x%llx\n", i,
+ req_indir_tab[i]);
+ }
+
+ req->update_hashkey = true;
+ if (rx_hash_key_len)
+ memcpy(req->hashkey, rx_hash_key, rx_hash_key_len);
+ else
+ netdev_rss_key_fill(req->hashkey, MANA_HASH_KEY_SIZE);
+
+ ibdev_dbg(&dev->ib_dev, "vport handle %llu default_rxobj 0x%llx\n",
+ req->vport, default_rxobj);
+
+ err = mana_gd_send_request(gc, req_buf_size, req, sizeof(resp), &resp);
+ if (err) {
+ netdev_err(ndev, "Failed to configure vPort RX: %d\n", err);
+ goto out;
+ }
+
+ if (resp.hdr.status) {
+ netdev_err(ndev, "vPort RX configuration failed: 0x%x\n",
+ resp.hdr.status);
+ err = -EPROTO;
+ }
+
+ netdev_info(ndev, "Configured steering vPort %llu log_entries %u\n",
+ mpc->port_handle, log_ind_tbl_size);
+
+out:
+ kfree(req);
+ return err;
+}
+
+static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
+ struct ib_qp_init_attr *attr,
+ struct ib_udata *udata)
+{
+ struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+ struct mana_ib_dev *mdev =
+ container_of(pd->device, struct mana_ib_dev, ib_dev);
+ struct ib_rwq_ind_table *ind_tbl = attr->rwq_ind_tbl;
+ struct mana_ib_create_qp_rss_resp resp = {};
+ struct mana_ib_create_qp_rss ucmd = {};
+ struct gdma_dev *gd = mdev->gdma_dev;
+ mana_handle_t *mana_ind_table;
+ struct mana_port_context *mpc;
+ struct mana_context *mc;
+ struct net_device *ndev;
+ struct mana_ib_cq *cq;
+ struct mana_ib_wq *wq;
+ struct ib_cq *ibcq;
+ struct ib_wq *ibwq;
+ int i = 0, ret;
+ u32 port;
+
+ mc = gd->driver_data;
+
+ if (udata->inlen < sizeof(ucmd))
+ return -EINVAL;
+
+ ret = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+ if (ret) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed copy from udata for create rss-qp, err %d\n",
+ ret);
+ return -EFAULT;
+ }
+
+ if (attr->cap.max_recv_wr > MAX_SEND_BUFFERS_PER_QUEUE) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Requested max_recv_wr %d exceeding limit.\n",
+ attr->cap.max_recv_wr);
+ return -EINVAL;
+ }
+
+ if (attr->cap.max_recv_sge > MAX_RX_WQE_SGL_ENTRIES) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Requested max_recv_sge %d exceeding limit.\n",
+ attr->cap.max_recv_sge);
+ return -EINVAL;
+ }
+
+ if (ucmd.rx_hash_function != MANA_IB_RX_HASH_FUNC_TOEPLITZ) {
+ ibdev_dbg(&mdev->ib_dev,
+ "RX Hash function is not supported, %d\n",
+ ucmd.rx_hash_function);
+ return -EINVAL;
+ }
+
+ /* IB ports start with 1, MANA start with 0 */
+ port = ucmd.port;
+ if (port < 1 || port > mc->num_ports) {
+ ibdev_dbg(&mdev->ib_dev, "Invalid port %u in creating qp\n",
+ port);
+ return -EINVAL;
+ }
+ ndev = mc->ports[port - 1];
+ mpc = netdev_priv(ndev);
+
+ ibdev_dbg(&mdev->ib_dev, "rx_hash_function %d port %d\n",
+ ucmd.rx_hash_function, port);
+
+ mana_ind_table = kzalloc(sizeof(mana_handle_t) *
+ (1 << ind_tbl->log_ind_tbl_size),
+ GFP_KERNEL);
+ if (!mana_ind_table) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ qp->port = port;
+
+ for (i = 0; i < (1 << ind_tbl->log_ind_tbl_size); i++) {
+ struct mana_obj_spec wq_spec = {};
+ struct mana_obj_spec cq_spec = {};
+
+ ibwq = ind_tbl->ind_tbl[i];
+ wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+
+ ibcq = ibwq->cq;
+ cq = container_of(ibcq, struct mana_ib_cq, ibcq);
+
+ wq_spec.gdma_region = wq->gdma_region;
+ wq_spec.queue_size = wq->wq_buf_size;
+
+ cq_spec.gdma_region = cq->gdma_region;
+ cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
+ cq_spec.modr_ctx_id = 0;
+ cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+
+ ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
+ &wq_spec, &cq_spec, &wq->rx_object);
+ if (ret)
+ goto fail;
+
+ /* The GDMA regions are now owned by the WQ object */
+ wq->gdma_region = GDMA_INVALID_DMA_REGION;
+ cq->gdma_region = GDMA_INVALID_DMA_REGION;
+
+ wq->id = wq_spec.queue_index;
+ cq->id = cq_spec.queue_index;
+
+ ibdev_dbg(&mdev->ib_dev,
+ "ret %d rx_object 0x%llx wq id %llu cq id %llu\n",
+ ret, wq->rx_object, wq->id, cq->id);
+
+ resp.entries[i].cqid = cq->id;
+ resp.entries[i].wqid = wq->id;
+
+ mana_ind_table[i] = wq->rx_object;
+ }
+ resp.num_entries = i;
+
+ ret = mana_ib_cfg_vport_steering(mdev, ndev, wq->rx_object,
+ mana_ind_table,
+ ind_tbl->log_ind_tbl_size,
+ ucmd.rx_hash_key_len,
+ ucmd.rx_hash_key);
+ if (ret)
+ goto fail;
+
+ kfree(mana_ind_table);
+
+ if (udata) {
+ ret = ib_copy_to_udata(udata, &resp, sizeof(resp));
+ if (ret) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed to copy to udata create rss-qp, %d\n",
+ ret);
+ goto fail;
+ }
+ }
+
+ return 0;
+
+fail:
+ while (i-- > 0) {
+ ibwq = ind_tbl->ind_tbl[i];
+ wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+ mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
+ }
+
+ kfree(mana_ind_table);
+
+ return ret;
+}
+
+static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
+ struct ib_qp_init_attr *attr,
+ struct ib_udata *udata)
+{
+ struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+ struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+ struct mana_ib_dev *mdev =
+ container_of(ibpd->device, struct mana_ib_dev, ib_dev);
+ struct mana_ib_cq *send_cq =
+ container_of(attr->send_cq, struct mana_ib_cq, ibcq);
+ struct ib_ucontext *ib_ucontext = ibpd->uobject->context;
+ struct mana_ib_create_qp_resp resp = {};
+ struct mana_ib_ucontext *mana_ucontext;
+ struct gdma_dev *gd = mdev->gdma_dev;
+ struct mana_ib_create_qp ucmd = {};
+ struct mana_obj_spec wq_spec = {};
+ struct mana_obj_spec cq_spec = {};
+ struct mana_port_context *mpc;
+ struct mana_context *mc;
+ struct net_device *ndev;
+ struct ib_umem *umem;
+ int err;
+ u32 port;
+
+ mana_ucontext =
+ container_of(ib_ucontext, struct mana_ib_ucontext, ibucontext);
+ mc = gd->driver_data;
+
+ if (udata->inlen < sizeof(ucmd))
+ return -EINVAL;
+
+ err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+ if (err) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed to copy from udata create qp-raw, %d\n", err);
+ return -EFAULT;
+ }
+
+ /* IB ports start with 1, MANA Ethernet ports start with 0 */
+ port = ucmd.port;
+ if (ucmd.port > mc->num_ports)
+ return -EINVAL;
+
+ if (attr->cap.max_send_wr > MAX_SEND_BUFFERS_PER_QUEUE) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Requested max_send_wr %d exceeding limit\n",
+ attr->cap.max_send_wr);
+ return -EINVAL;
+ }
+
+ if (attr->cap.max_send_sge > MAX_TX_WQE_SGL_ENTRIES) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Requested max_send_sge %d exceeding limit\n",
+ attr->cap.max_send_sge);
+ return -EINVAL;
+ }
+
+ ndev = mc->ports[port - 1];
+ mpc = netdev_priv(ndev);
+ ibdev_dbg(&mdev->ib_dev, "port %u ndev %p mpc %p\n", port, ndev, mpc);
+
+ err = mana_ib_cfg_vport(mdev, port - 1, pd, mana_ucontext->doorbell);
+ if (err)
+ return -ENODEV;
+
+ qp->port = port;
+
+ ibdev_dbg(&mdev->ib_dev, "ucmd sq_buf_addr 0x%llx port %u\n",
+ ucmd.sq_buf_addr, ucmd.port);
+
+ umem = ib_umem_get(ibpd->device, ucmd.sq_buf_addr, ucmd.sq_buf_size,
+ IB_ACCESS_LOCAL_WRITE);
+ if (IS_ERR(umem)) {
+ err = PTR_ERR(umem);
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed to get umem for create qp-raw, err %d\n",
+ err);
+ goto err_free_vport;
+ }
+ qp->sq_umem = umem;
+
+ err = mana_ib_gd_create_dma_region(mdev, qp->sq_umem,
+ &qp->sq_gdma_region, PAGE_SIZE);
+ if (err) {
+ ibdev_err(&mdev->ib_dev,
+ "Failed to create dma region for create qp-raw, %d\n",
+ err);
+ goto err_release_umem;
+ }
+
+ ibdev_dbg(&mdev->ib_dev,
+ "mana_ib_gd_create_dma_region ret %d gdma_region 0x%llx\n",
+ err, qp->sq_gdma_region);
+
+ /* Create a WQ on the same port handle used by the Ethernet */
+ wq_spec.gdma_region = qp->sq_gdma_region;
+ wq_spec.queue_size = ucmd.sq_buf_size;
+
+ cq_spec.gdma_region = send_cq->gdma_region;
+ cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
+ cq_spec.modr_ctx_id = 0;
+ cq_spec.attached_eq = GDMA_CQ_NO_EQ;
+
+ err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
+ &cq_spec, &qp->tx_object);
+ if (err) {
+ ibdev_err(&mdev->ib_dev,
+ "Failed to create wq for create raw-qp, err %d\n",
+ err);
+ goto err_destroy_dma_region;
+ }
+
+ /* The GDMA regions are now owned by the WQ object */
+ qp->sq_gdma_region = GDMA_INVALID_DMA_REGION;
+ send_cq->gdma_region = GDMA_INVALID_DMA_REGION;
+
+ qp->sq_id = wq_spec.queue_index;
+ send_cq->id = cq_spec.queue_index;
+
+ ibdev_dbg(&mdev->ib_dev,
+ "ret %d qp->tx_object 0x%llx sq id %llu cq id %llu\n", err,
+ qp->tx_object, qp->sq_id, send_cq->id);
+
+ resp.sqid = qp->sq_id;
+ resp.cqid = send_cq->id;
+ resp.tx_vp_offset = pd->tx_vp_offset;
+
+ if (udata) {
+ err = ib_copy_to_udata(udata, &resp, sizeof(resp));
+ if (err) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed copy udata for create qp-raw, %d\n",
+ err);
+ goto err_destroy_wq_obj;
+ }
+ }
+
+ return 0;
+
+err_destroy_wq_obj:
+ mana_destroy_wq_obj(mpc, GDMA_SQ, qp->tx_object);
+
+err_destroy_dma_region:
+ mana_ib_gd_destroy_dma_region(mdev, qp->sq_gdma_region);
+
+err_release_umem:
+ ib_umem_release(umem);
+
+err_free_vport:
+ mana_ib_uncfg_vport(mdev, pd, port - 1);
+
+ return err;
+}
+
+int mana_ib_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
+ struct ib_udata *udata)
+{
+ switch (attr->qp_type) {
+ case IB_QPT_RAW_PACKET:
+ /* When rwq_ind_tbl is used, it's for creating WQs for RSS */
+ if (attr->rwq_ind_tbl)
+ return mana_ib_create_qp_rss(ibqp, ibqp->pd, attr,
+ udata);
+
+ return mana_ib_create_qp_raw(ibqp, ibqp->pd, attr, udata);
+ default:
+ /* Creating QP other than IB_QPT_RAW_PACKET is not supported */
+ ibdev_dbg(ibqp->device, "Creating QP type %u not supported\n",
+ attr->qp_type);
+ }
+
+ return -EINVAL;
+}
+
+int mana_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+ int attr_mask, struct ib_udata *udata)
+{
+ /* modify_qp is not supported by this version of the driver */
+ return -EOPNOTSUPP;
+}
+
+static int mana_ib_destroy_qp_rss(struct mana_ib_qp *qp,
+ struct ib_rwq_ind_table *ind_tbl,
+ struct ib_udata *udata)
+{
+ struct mana_ib_dev *mdev =
+ container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev);
+ struct gdma_dev *gd = mdev->gdma_dev;
+ struct mana_port_context *mpc;
+ struct mana_context *mc;
+ struct net_device *ndev;
+ struct mana_ib_wq *wq;
+ struct ib_wq *ibwq;
+ int i;
+
+ mc = gd->driver_data;
+ ndev = mc->ports[qp->port - 1];
+ mpc = netdev_priv(ndev);
+
+ for (i = 0; i < (1 << ind_tbl->log_ind_tbl_size); i++) {
+ ibwq = ind_tbl->ind_tbl[i];
+ wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+ ibdev_dbg(&mdev->ib_dev, "destroying wq->rx_object %llu\n",
+ wq->rx_object);
+ mana_destroy_wq_obj(mpc, GDMA_RQ, wq->rx_object);
+ }
+
+ return 0;
+}
+
+static int mana_ib_destroy_qp_raw(struct mana_ib_qp *qp, struct ib_udata *udata)
+{
+ struct mana_ib_dev *mdev =
+ container_of(qp->ibqp.device, struct mana_ib_dev, ib_dev);
+ struct gdma_dev *gd = mdev->gdma_dev;
+ struct ib_pd *ibpd = qp->ibqp.pd;
+ struct mana_port_context *mpc;
+ struct mana_context *mc;
+ struct net_device *ndev;
+ struct mana_ib_pd *pd;
+
+ mc = gd->driver_data;
+ ndev = mc->ports[qp->port - 1];
+ mpc = netdev_priv(ndev);
+ pd = container_of(ibpd, struct mana_ib_pd, ibpd);
+
+ mana_destroy_wq_obj(mpc, GDMA_SQ, qp->tx_object);
+
+ if (qp->sq_umem) {
+ mana_ib_gd_destroy_dma_region(mdev, qp->sq_gdma_region);
+ ib_umem_release(qp->sq_umem);
+ }
+
+ mana_ib_uncfg_vport(mdev, pd, qp->port - 1);
+
+ return 0;
+}
+
+int mana_ib_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
+{
+ struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
+
+ switch (ibqp->qp_type) {
+ case IB_QPT_RAW_PACKET:
+ if (ibqp->rwq_ind_tbl)
+ return mana_ib_destroy_qp_rss(qp, ibqp->rwq_ind_tbl,
+ udata);
+
+ return mana_ib_destroy_qp_raw(qp, udata);
+
+ default:
+ ibdev_dbg(ibqp->device, "Unexpected QP type %u\n",
+ ibqp->qp_type);
+ }
+
+ return -ENOENT;
+}
diff --git a/drivers/infiniband/hw/mana/wq.c b/drivers/infiniband/hw/mana/wq.c
new file mode 100644
index 000000000000..a11d0ae35ff7
--- /dev/null
+++ b/drivers/infiniband/hw/mana/wq.c
@@ -0,0 +1,114 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#include "mana_ib.h"
+
+struct ib_wq *mana_ib_create_wq(struct ib_pd *pd,
+ struct ib_wq_init_attr *init_attr,
+ struct ib_udata *udata)
+{
+ struct mana_ib_dev *mdev =
+ container_of(pd->device, struct mana_ib_dev, ib_dev);
+ struct mana_ib_create_wq ucmd = {};
+ struct mana_ib_wq *wq;
+ struct ib_umem *umem;
+ int err;
+
+ if (udata->inlen < sizeof(ucmd))
+ return ERR_PTR(-EINVAL);
+
+ err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
+ if (err) {
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed to copy from udata for create wq, %d\n", err);
+ return ERR_PTR(-EFAULT);
+ }
+
+ wq = kzalloc(sizeof(*wq), GFP_KERNEL);
+ if (!wq)
+ return ERR_PTR(-ENOMEM);
+
+ ibdev_dbg(&mdev->ib_dev, "ucmd wq_buf_addr 0x%llx\n", ucmd.wq_buf_addr);
+
+ umem = ib_umem_get(pd->device, ucmd.wq_buf_addr, ucmd.wq_buf_size,
+ IB_ACCESS_LOCAL_WRITE);
+ if (IS_ERR(umem)) {
+ err = PTR_ERR(umem);
+ ibdev_dbg(&mdev->ib_dev,
+ "Failed to get umem for create wq, err %d\n", err);
+ goto err_free_wq;
+ }
+
+ wq->umem = umem;
+ wq->wqe = init_attr->max_wr;
+ wq->wq_buf_size = ucmd.wq_buf_size;
+ wq->rx_object = INVALID_MANA_HANDLE;
+
+ err = mana_ib_gd_create_dma_region(mdev, wq->umem, &wq->gdma_region,
+ PAGE_SIZE);
+ if (err) {
+ ibdev_err(&mdev->ib_dev,
+ "Failed to create dma region for create wq, %d\n",
+ err);
+ goto err_release_umem;
+ }
+
+ ibdev_dbg(&mdev->ib_dev,
+ "mana_ib_gd_create_dma_region ret %d gdma_region 0x%llx\n",
+ err, wq->gdma_region);
+
+ /* WQ ID is returned at wq_create time, doesn't know the value yet */
+
+ return &wq->ibwq;
+
+err_release_umem:
+ ib_umem_release(umem);
+
+err_free_wq:
+ kfree(wq);
+
+ return ERR_PTR(err);
+}
+
+int mana_ib_modify_wq(struct ib_wq *wq, struct ib_wq_attr *wq_attr,
+ u32 wq_attr_mask, struct ib_udata *udata)
+{
+ /* modify_wq is not supported by this version of the driver */
+ return -EOPNOTSUPP;
+}
+
+int mana_ib_destroy_wq(struct ib_wq *ibwq, struct ib_udata *udata)
+{
+ struct mana_ib_wq *wq = container_of(ibwq, struct mana_ib_wq, ibwq);
+ struct ib_device *ib_dev = ibwq->device;
+ struct mana_ib_dev *mdev;
+
+ mdev = container_of(ib_dev, struct mana_ib_dev, ib_dev);
+
+ mana_ib_gd_destroy_dma_region(mdev, wq->gdma_region);
+ ib_umem_release(wq->umem);
+
+ kfree(wq);
+
+ return 0;
+}
+
+int mana_ib_create_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_table,
+ struct ib_rwq_ind_table_init_attr *init_attr,
+ struct ib_udata *udata)
+{
+ /* There is no additional data in ind_table to be maintained by this
+ * driver, do nothing
+ */
+ return 0;
+}
+
+int mana_ib_destroy_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_tbl)
+{
+ /* There is no additional data in ind_table to be maintained by this
+ * driver, do nothing
+ */
+ return 0;
+}
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 3a0bc6e0b730..1ff6e0d07cfd 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -406,6 +406,9 @@ int mana_bpf(struct net_device *ndev, struct netdev_bpf *bpf);
extern const struct ethtool_ops mana_ethtool_ops;
+/* A CQ can be created not associated with any EQ */
+#define GDMA_CQ_NO_EQ 0xffff
+
struct mana_obj_spec {
u32 queue_index;
u64 gdma_region;
diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
index 3072e5d6b692..081aabf536dc 100644
--- a/include/uapi/rdma/ib_user_ioctl_verbs.h
+++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
@@ -250,6 +250,7 @@ enum rdma_driver_id {
RDMA_DRIVER_QIB,
RDMA_DRIVER_EFA,
RDMA_DRIVER_SIW,
+ RDMA_DRIVER_MANA,
};
enum ib_uverbs_gid_type {
diff --git a/include/uapi/rdma/mana-abi.h b/include/uapi/rdma/mana-abi.h
new file mode 100644
index 000000000000..559c49b72e0d
--- /dev/null
+++ b/include/uapi/rdma/mana-abi.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR Linux-OpenIB) */
+/*
+ * Copyright (c) 2022, Microsoft Corporation. All rights reserved.
+ */
+
+#ifndef MANA_ABI_USER_H
+#define MANA_ABI_USER_H
+
+#include <linux/types.h>
+#include <rdma/ib_user_ioctl_verbs.h>
+
+/*
+ * Increment this value if any changes that break userspace ABI
+ * compatibility are made.
+ */
+
+#define MANA_IB_UVERBS_ABI_VERSION 1
+
+struct mana_ib_create_cq {
+ __aligned_u64 buf_addr;
+};
+
+struct mana_ib_create_qp {
+ __aligned_u64 sq_buf_addr;
+ __u32 sq_buf_size;
+ __u32 port;
+};
+
+struct mana_ib_create_qp_resp {
+ __u32 sqid;
+ __u32 cqid;
+ __u32 tx_vp_offset;
+ __u32 reserved;
+};
+
+struct mana_ib_create_wq {
+ __aligned_u64 wq_buf_addr;
+ __u32 wq_buf_size;
+ __u32 reserved;
+};
+
+/* RX Hash function flags */
+enum mana_ib_rx_hash_function_flags {
+ MANA_IB_RX_HASH_FUNC_TOEPLITZ = 1 << 0,
+};
+
+struct mana_ib_create_qp_rss {
+ __aligned_u64 rx_hash_fields_mask;
+ __u8 rx_hash_function;
+ __u8 reserved[7];
+ __u32 rx_hash_key_len;
+ __u8 rx_hash_key[40];
+ __u32 port;
+};
+
+struct rss_resp_entry {
+ __u32 cqid;
+ __u32 wqid;
+};
+
+struct mana_ib_create_qp_rss_resp {
+ __aligned_u64 num_entries;
+ struct rss_resp_entry entries[64];
+};
+
+#endif
--
2.17.1
On Sat, Jun 25, 2022 at 04:20:19AM +0000, Ajay Sharma wrote:
> Hello Maintainers,
> Any idea when these patches would make it into the next kernel release ?
New rdma drivers typically take a long time to get merged due to their
typical huge size. Currently I'm working through ERDMA. Reviewing the
ERDMA submission would be helpful, I generally prefer it if people
proposing new drivers review other new drivers being submitted.
In this case it seems smaller, so you might make this cycle, though I
haven't even opened the userspace portion yet.
Jason
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
...
> +static int add_adev(struct gdma_dev *gd)
> +{
> + int ret = 0;
No need to initialize it to 0.
> + struct mana_adev *madev;
> + struct auxiliary_device *adev;
davem would require the reverse xmas tree order :-)
Reviewed-by: Dexuan Cui <[email protected]>
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
> +void mana_uncfg_vport(struct mana_port_context *apc)
> +{
> + mutex_lock(&apc->vport_mutex);
> + apc->vport_use_count--;
> + WARN_ON(apc->vport_use_count < 0);
> + mutex_unlock(&apc->vport_mutex);
> +}
> +EXPORT_SYMBOL_GPL(mana_uncfg_vport);
> +
> +int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
> + u32 doorbell_pg_id)
> {
> struct mana_config_vport_resp resp = {};
> struct mana_config_vport_req req = {};
> int err;
>
> + /* Ethernet driver and IB driver can't take the port at the same time */
> + mutex_lock(&apc->vport_mutex);
> + if (apc->vport_use_count > 0) {
> + mutex_unlock(&apc->vport_mutex);
> + return -ENODEV;
Maybe -EBUSY is better?
> @@ -563,9 +581,19 @@ static int mana_cfg_vport(struct mana_port_context
> *apc, u32 protection_dom_id,
>
> apc->tx_shortform_allowed = resp.short_form_allowed;
> apc->tx_vp_offset = resp.tx_vport_offset;
> +
> + netdev_info(apc->ndev, "Configured vPort %llu PD %u DB %u\n",
> + apc->port_handle, protection_dom_id, doorbell_pg_id);
Should this be netdev_dbg()?
The log buffer can be flooded if there are many vPorts per VF PCI device and
there are a lot of VFs.
> out:
> + if (err) {
> + mutex_lock(&apc->vport_mutex);
> + apc->vport_use_count--;
> + mutex_unlock(&apc->vport_mutex);
> + }
Change this to the blelow?
if (err)
mana_uncfg_vport(apc);
> @@ -626,6 +654,9 @@ static int mana_cfg_vport_steering(struct
> mana_port_context *apc,
> resp.hdr.status);
> err = -EPROTO;
> }
> +
> + netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
> + apc->port_handle, num_entries);
netdev_dbg()?
In general, the patch looks good to me.
Reviewed-by: Dexuan Cui <[email protected]>
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
> ...
> In preparation to add MANA RDMA driver, move all the required header files
> to a common location for use by both Ethernet and RDMA drivers.
>
> Signed-off-by: Long Li <[email protected]>
> ---
> Change log:
> v2: Move headers to include/net/mana, instead of include/linux/mana
> ...
> rename {drivers/net/ethernet/microsoft => include/net}/mana/gdma.h
> (100%)
> rename {drivers/net/ethernet/microsoft => include/net}/mana/hw_channel.h
> (100%)
> rename {drivers/net/ethernet/microsoft => include/net}/mana/mana.h
> (100%)
> rename {drivers/net/ethernet/microsoft =>
> include/net}/mana/shm_channel.h (100%)
While I'm giving my Reviewed-by, I hope someone who has a better judgement
can share thoughts as well.
Reviewed-by: Dexuan Cui <[email protected]>
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
> ...
> The port number is useful for user-mode application to identify this
> net device based on port index. Set to the correct value in ndev.
>
> Signed-off-by: Long Li <[email protected]>
Reviewed-by: Dexuan Cui <[email protected]>
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
> @@ -125,6 +125,7 @@ int mana_gd_send_request(struct gdma_context *gc,
> u32 req_len, const void *req,
>
> return mana_hwc_send_request(hwc, req_len, req, resp_len, resp);
> }
> +EXPORT_SYMBOL(mana_gd_send_request);
Can we use EXPORT_SYMBOL_GPL?
> @@ -715,9 +715,10 @@ static int mana_create_wq_obj(struct
> mana_port_context *apc,
> out:
> return err;
> }
> +EXPORT_SYMBOL_GPL(mana_create_wq_obj);
Well, here we use EXPORT_SYMBOL_GPL. If there is a rule to decide
which one should be used, please add a comment.
In general, the patch looks good to me.
Reviewed-by: Dexuan Cui <[email protected]>
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
> ...
> +int mana_ib_gd_create_dma_region(struct mana_ib_dev *dev, struct
> ib_umem *umem,
> + mana_handle_t *gdma_region, u64 page_sz)
> +{
> + ...
> + err = mana_gd_send_request(gc, create_req_msg_size, create_req,
> + sizeof(create_resp), &create_resp);
> + kfree(create_req);
> +
> + if (err || create_resp.hdr.status) {
> + ibdev_err(&dev->ib_dev,
> + "Failed to create DMA region: %d, 0x%x\n", err,
> + create_resp.hdr.status);
if (!err)
err = -EPROTO;
> + goto error;
> + }
> + ...
> + err = mana_gd_send_request(gc, add_req_msg_size,
> + add_req, sizeof(add_resp),
> + &add_resp);
> + if (!err || add_resp.hdr.status != expected_status) {
> + ibdev_err(&dev->ib_dev,
> + "Failed put DMA pages %u: %d,0x%x\n",
> + i, err, add_resp.hdr.status);
> + err = -EPROTO;
Should we try to undo what has been done by calling GDMA_DESTROY_DMA_REGION?
> + goto free_req;
> + }
> +
> + num_pages_cur += num_pages_to_handle;
> + num_pages_to_handle =
> + min_t(size_t, num_pages_total - num_pages_cur,
> + max_pgs_add_cmd);
> + add_req_msg_size = sizeof(*add_req) +
> + num_pages_to_handle * sizeof(u64);
> + }
> +free_req:
> + kfree(add_req);
> + }
> +
> +error:
> + return err;
> +}
> + ...
> +int mana_ib_gd_create_mr(struct mana_ib_dev *dev, struct mana_ib_mr
> *mr,
> + struct gdma_create_mr_params *mr_params)
> +{
> + struct gdma_create_mr_response resp = {};
> + struct gdma_create_mr_request req = {};
> + struct gdma_dev *mdev = dev->gdma_dev;
> + struct gdma_context *gc;
> + int err;
> +
> + gc = mdev->gdma_context;
> +
> + mana_gd_init_req_hdr(&req.hdr, GDMA_CREATE_MR, sizeof(req),
> + sizeof(resp));
> + req.pd_handle = mr_params->pd_handle;
> +
> + switch (mr_params->mr_type) {
> + case GDMA_MR_TYPE_GVA:
> + req.mr_type = GDMA_MR_TYPE_GVA;
> + req.gva.dma_region_handle = mr_params->gva.dma_region_handle;
> + req.gva.virtual_address = mr_params->gva.virtual_address;
> + req.gva.access_flags = mr_params->gva.access_flags;
> + break;
> +
> + case GDMA_MR_TYPE_GPA:
> + req.mr_type = GDMA_MR_TYPE_GPA;
> + req.gpa.access_flags = mr_params->gpa.access_flags;
> + break;
> +
> + case GDMA_MR_TYPE_FMR:
> + req.mr_type = GDMA_MR_TYPE_FMR;
> + req.fmr.page_size = mr_params->fmr.page_size;
> + req.fmr.reserved_pte_count = mr_params->fmr.reserved_pte_count;
> + break;
> +
> + default:
> + ibdev_dbg(&dev->ib_dev,
> + "invalid param (GDMA_MR_TYPE) passed, type %d\n",
> + req.mr_type);
Here req.mr_type is always 0.
We should remove the 3 above lines of "req.mr_type = ...", and
add a line "req.mr_type = mr_params->mr_type;" before the "switch" line..
> + err = -EINVAL;
> + goto error;
> + }
> +
> + err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp), &resp);
> +
> + if (err || resp.hdr.status) {
> + ibdev_err(&dev->ib_dev, "Failed to create mr %d, %u", err,
> + resp.hdr.status);
if (!err)
err = -EPROTO;
> + goto error;
> + }
> +
> + mr->ibmr.lkey = resp.lkey;
> + mr->ibmr.rkey = resp.rkey;
> + mr->mr_handle = resp.mr_handle;
> +
> + return 0;
> +error:
> + return err;
> +}
> + ...
> +static int mana_ib_probe(struct auxiliary_device *adev,
> + const struct auxiliary_device_id *id)
> +{
> + struct mana_adev *madev = container_of(adev, struct mana_adev, adev);
> + struct gdma_dev *mdev = madev->mdev;
> + struct mana_context *mc;
> + struct mana_ib_dev *dev;
> + int ret = 0;
No need to initialize 'ret' to 0.
> +int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
> +{
> + struct mana_ib_mr *mr = container_of(ibmr, struct mana_ib_mr, ibmr);
> + struct ib_device *ibdev = ibmr->device;
> + struct mana_ib_dev *dev;
> + int err;
> +
> + dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> +
> + err = mana_ib_gd_destroy_mr(dev, mr->mr_handle);
> + if (err)
Should we return here without calling ib_umem_release() and kfree(mr)?
> + return err;
> +
> + if (mr->umem)
> + ib_umem_release(mr->umem);
> +
> + kfree(mr);
> +
> + return 0;
> +}
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
> ...
> +EXPORT_SYMBOL(mana_gd_destroy_doorbell_page);
Can this be EXPORT_SYMBOL_GPL?
> +EXPORT_SYMBOL(mana_gd_allocate_doorbell_page);
EXPORT_SYMBOL_GPL?
Reviewed-by: Dexuan Cui <[email protected]>
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
>
> The MANA hardware support protection domain and memory registration for
s/support/supports
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h
> b/drivers/net/ethernet/microsoft/mana/gdma.h
> index f945755760dc..b1bec8ab5695 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma.h
> +++ b/drivers/net/ethernet/microsoft/mana/gdma.h
> @@ -27,6 +27,10 @@ enum gdma_request_type {
> GDMA_CREATE_DMA_REGION = 25,
> GDMA_DMA_REGION_ADD_PAGES = 26,
> GDMA_DESTROY_DMA_REGION = 27,
> + GDMA_CREATE_PD = 29,
> + GDMA_DESTROY_PD = 30,
> + GDMA_CREATE_MR = 31,
> + GDMA_DESTROY_MR = 32,
These are not used in this patch. They're used in the 12th
patch for the first time. Can we move these to that patch?
> #define GDMA_RESOURCE_DOORBELL_PAGE 27
> @@ -59,6 +63,8 @@ enum {
> GDMA_DEVICE_MANA = 2,
> };
>
> +typedef u64 gdma_obj_handle_t;
> +
> struct gdma_resource {
> /* Protect the bitmap */
> spinlock_t lock;
> @@ -192,7 +198,7 @@ struct gdma_mem_info {
> u64 length;
>
> /* Allocated by the PF driver */
> - u64 gdma_region;
> + gdma_obj_handle_t dma_region_handle;
The old name "gdma_region" is shorter and it has "gdma"
rather than "dma".
The new name is longer. When one starts to read the code for
the first time, I feel that "dma_region_handle" might be confusing
as it's similar to "dma_handle" (which is the DMA address returned
by dma_alloc_coherent()). "dma_region_handle" is an integer
rather than a memory address.
You use the new name probably because there is a "mr_handle "
in the 12 patch. I prefer the old name, though the new name is
also ok to me. If you decide to use the new name, it would be
great if this patch could split into two patches: one for the
renaming only, and the other for the real changes.
> #define REGISTER_ATB_MST_MKEY_LOWER_SIZE 8
> @@ -599,7 +605,7 @@ struct gdma_create_queue_req {
> u32 reserved1;
> u32 pdid;
> u32 doolbell_id;
> - u64 gdma_region;
> + gdma_obj_handle_t gdma_region;
If we decide to use the new name "dma_region_handle", should
we change the field/param names in the below structs and
functions as well (this may not be a complete list)?
struct mana_ib_wq
struct mana_ib_cq
mana_ib_gd_create_dma_region
mana_ib_gd_destroy_dma_region
> u32 reserved2;
> u32 queue_size;
> u32 log2_throttle_limit;
> @@ -626,6 +632,28 @@ struct gdma_disable_queue_req {
> u32 alloc_res_id_on_creation;
> }; /* HW DATA */
>
> +enum atb_page_size {
> + ATB_PAGE_SIZE_4K,
> + ATB_PAGE_SIZE_8K,
> + ATB_PAGE_SIZE_16K,
> + ATB_PAGE_SIZE_32K,
> + ATB_PAGE_SIZE_64K,
> + ATB_PAGE_SIZE_128K,
> + ATB_PAGE_SIZE_256K,
> + ATB_PAGE_SIZE_512K,
> + ATB_PAGE_SIZE_1M,
> + ATB_PAGE_SIZE_2M,
> + ATB_PAGE_SIZE_MAX,
> +};
> +
> +enum gdma_mr_access_flags {
> + GDMA_ACCESS_FLAG_LOCAL_READ = (1 << 0),
> + GDMA_ACCESS_FLAG_LOCAL_WRITE = (1 << 1),
> + GDMA_ACCESS_FLAG_REMOTE_READ = (1 << 2),
> + GDMA_ACCESS_FLAG_REMOTE_WRITE = (1 << 3),
> + GDMA_ACCESS_FLAG_REMOTE_ATOMIC = (1 << 4),
> +};
It would be better to use BIT_ULL(0), BIT_ULL(1), etc.
> /* GDMA_CREATE_DMA_REGION */
> struct gdma_create_dma_region_req {
> struct gdma_req_hdr hdr;
> @@ -652,14 +680,14 @@ struct gdma_create_dma_region_req {
>
> struct gdma_create_dma_region_resp {
> struct gdma_resp_hdr hdr;
> - u64 gdma_region;
> + gdma_obj_handle_t dma_region_handle;
> }; /* HW DATA */
>
> /* GDMA_DMA_REGION_ADD_PAGES */
> struct gdma_dma_region_add_pages_req {
> struct gdma_req_hdr hdr;
>
> - u64 gdma_region;
> + gdma_obj_handle_t dma_region_handle;
>
> u32 page_addr_list_len;
> u32 reserved3;
> @@ -671,9 +699,114 @@ struct gdma_dma_region_add_pages_req {
> struct gdma_destroy_dma_region_req {
> struct gdma_req_hdr hdr;
>
> - u64 gdma_region;
> + gdma_obj_handle_t dma_region_handle;
> }; /* HW DATA */
>
> +enum gdma_pd_flags {
> + GDMA_PD_FLAG_ALLOW_GPA_MR = (1 << 0),
> + GDMA_PD_FLAG_ALLOW_FMR_MR = (1 << 1),
> +};
Use BIT_ULL(0), BIT_ULL(1) ?
> +struct gdma_create_pd_req {
> + struct gdma_req_hdr hdr;
> + enum gdma_pd_flags flags;
> + u32 reserved;
> +};
> +
> +struct gdma_create_pd_resp {
> + struct gdma_resp_hdr hdr;
> + gdma_obj_handle_t pd_handle;
> + u32 pd_id;
> + u32 reserved;
> +};
> +
> +struct gdma_destroy_pd_req {
> + struct gdma_req_hdr hdr;
> + gdma_obj_handle_t pd_handle;
> +};
> +
> +struct gdma_destory_pd_resp {
> + struct gdma_resp_hdr hdr;
> +};
> +
> +enum gdma_mr_type {
> + /* Guest Physical Address - MRs of this type allow access
> + * to any DMA-mapped memory using bus-logical address
> + */
> + GDMA_MR_TYPE_GPA = 1,
> +
> + /* Guest Virtual Address - MRs of this type allow access
> + * to memory mapped by PTEs associated with this MR using a virtual
> + * address that is set up in the MST
> + */
> + GDMA_MR_TYPE_GVA,
> +
> + /* Fast Memory Register - Like GVA but the MR is initially put in the
> + * FREE state (as opposed to Valid), and the specified number of
> + * PTEs are reserved for future fast memory reservations.
> + */
> + GDMA_MR_TYPE_FMR,
> +};
> +
> +struct gdma_create_mr_params {
> + gdma_obj_handle_t pd_handle;
> + enum gdma_mr_type mr_type;
> + union {
> + struct {
> + gdma_obj_handle_t dma_region_handle;
> + u64 virtual_address;
> + enum gdma_mr_access_flags access_flags;
> + } gva;
Add an empty line to make it more readable?
> + struct {
> + enum gdma_mr_access_flags access_flags;
> + } gpa;
Add an empty line?
> + struct {
> + enum atb_page_size page_size;
> + u32 reserved_pte_count;
> + } fmr;
> + };
> +};
The definition of struct gdma_create_mr_params is not naturally aligned.
This can potenially cause issues.
According to my test, sizeof(struct gdma_create_mr_params) is 40 bytes,
meaning the compiler adds two "hidden" fields:
struct gdma_create_mr_params {
gdma_obj_handle_t pd_handle; // offset = 0
enum gdma_mr_type mr_type; // offset = 8
+ u32 hidden_field_a;
union { // offset = 0x10
struct {
gdma_obj_handle_t dma_region_handle; // offset =0x10
u64 virtual_address; // offset =0x18
enum gdma_mr_access_flags access_flags; // offset =0x20
+ u32 hidden_field_b;
} gva;
We'll run into trouble some day if the Linux VF driver or the host PF
driver adds something like __attribute__((packed)).
Can we work with the host team to improve the definition? If it's
hard/impossible to change the PF driver side definition, both sides
should at least explicitly define the two hidden fields as reserved fields.
BTW, can we assume the size of "enum" is 4 bytes? I prefer using u32
explicitly when a struct is used to talk to the PF driver or the device.
If we decide to use "enum", I suggest we add
BUILD_BUG_ON(sizeof(struct gdma_create_mr_params) != 40)
to make sure the assumptin is true.
BTW, Haiyang added "/* HW DATA */ " to other definitions,
e.g. gdma_create_queue_resp. Can you please add the same comment
for consistency?
> +struct gdma_create_mr_request {
> + struct gdma_req_hdr hdr;
> + gdma_obj_handle_t pd_handle;
> + enum gdma_mr_type mr_type;
> + u32 reserved;
> +
> + union {
> + struct {
> + enum gdma_mr_access_flags access_flags;
> + } gpa;
> +
> + struct {
> + gdma_obj_handle_t dma_region_handle;
> + u64 virtual_address;
> + enum gdma_mr_access_flags access_flags;
Similarly, there is a hidden u32 field here. We should explicitly define it.
> + } gva;
Can we use the same order of "gva; gpa" used in
struct gdma_create_mr_params?
> + struct {
> + enum atb_page_size page_size;
> + u32 reserved_pte_count;
> + } fmr;
> + };
> +};
Add BUILD_BUG_ON(sizeof(struct gdma_create_mr_request) != 80) ?
Add /* HW DATA */ ?
> +struct gdma_create_mr_response {
> + struct gdma_resp_hdr hdr;
> + gdma_obj_handle_t mr_handle;
> + u32 lkey;
> + u32 rkey;
> +};
> +
> +struct gdma_destroy_mr_request {
> + struct gdma_req_hdr hdr;
> + gdma_obj_handle_t mr_handle;
> +};
> +
> +struct gdma_destroy_mr_response {
> + struct gdma_resp_hdr hdr;
> +};
> +
None of the new defines are really used in this patch:
+enum atb_page_size {
+enum gdma_mr_access_flags {
+enum gdma_pd_flags {
+struct gdma_create_pd_req {
+struct gdma_create_pd_resp {
+struct gdma_destroy_pd_req {
+struct gdma_destory_pd_resp {
+enum gdma_mr_type {
+struct gdma_create_mr_params {
+struct gdma_create_mr_request {
+struct gdma_create_mr_response {
+struct gdma_destroy_mr_request {
+struct gdma_destroy_mr_response
The new defines are used in the 12th patch for the first time.
Can we move these to that patch or at least move these defines
to before the 12th patch?
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
>
> When doing memory registration, the PF may respond with
> GDMA_STATUS_MORE_ENTRIES to indicate a follow request is needed. This is
> not an error and should be processed as expected.
>
> Signed-off-by: Ajay Sharma <[email protected]>
> Signed-off-by: Long Li <[email protected]>
Reviewed-by: Dexuan Cui <[email protected]>
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
> ...
> The number of maximum SGl entries should be computed from the maximum
s/SGl/SGL
> @@ -436,6 +436,13 @@ struct gdma_wqe {
> #define MAX_TX_WQE_SIZE 512
> #define MAX_RX_WQE_SIZE 256
>
> +#define MAX_TX_WQE_SGL_ENTRIES ((GDMA_MAX_SQE_SIZE - \
> + sizeof(struct gdma_sge) - INLINE_OOB_SMALL_SIZE) / \
> + sizeof(struct gdma_sge))
> +
> +#define MAX_RX_WQE_SGL_ENTRIES ((GDMA_MAX_RQE_SIZE - \
> + sizeof(struct gdma_sge)) / sizeof(struct gdma_sge))
Can we make these '\' chars aligned? :-)
Please refer to the definiton of "lock_requestor" in include/linux/hyperv.h.
Reviewed-by: Dexuan Cui <[email protected]>
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
> ...
> From: Ajay Sharma <[email protected]>
>
> MANA hardware doesn't have any restrictions on the DMA segment size, set it
> to the max allowed value.
>
> Signed-off-by: Ajay Sharma <[email protected]>
> Signed-off-by: Long Li <[email protected]>
Reviewed-by: Dexuan Cui <[email protected]>
> Subject: RE: [Patch v4 01/12] net: mana: Add support for auxiliary device
>
> > From: [email protected] <[email protected]>
> > Sent: Wednesday, June 15, 2022 7:07 PM
> ...
> > +static int add_adev(struct gdma_dev *gd) {
> > + int ret = 0;
> No need to initialize it to 0.
>
> > + struct mana_adev *madev;
> > + struct auxiliary_device *adev;
>
> davem would require the reverse xmas tree order :-)
Thank you, will send v5 to fix this.
>
> Reviewed-by: Dexuan Cui <[email protected]>
> Subject: RE: [Patch v4 04/12] net: mana: Add functions for allocating doorbell
> page from GDMA
>
> > From: [email protected] <[email protected]>
> > Sent: Wednesday, June 15, 2022 7:07 PM ...
> > +EXPORT_SYMBOL(mana_gd_destroy_doorbell_page);
> Can this be EXPORT_SYMBOL_GPL?
>
> > +EXPORT_SYMBOL(mana_gd_allocate_doorbell_page);
> EXPORT_SYMBOL_GPL?
Will fix in v5.
>
> Reviewed-by: Dexuan Cui <[email protected]>
> Subject: RE: [Patch v4 10/12] net: mana: Define max values for SGL entries
>
> > From: [email protected] <[email protected]>
> > Sent: Wednesday, June 15, 2022 7:07 PM ...
> > The number of maximum SGl entries should be computed from the maximum
> s/SGl/SGL
>
> > @@ -436,6 +436,13 @@ struct gdma_wqe { #define MAX_TX_WQE_SIZE 512
> > #define MAX_RX_WQE_SIZE 256
> >
> > +#define MAX_TX_WQE_SGL_ENTRIES ((GDMA_MAX_SQE_SIZE - \
> > + sizeof(struct gdma_sge) - INLINE_OOB_SMALL_SIZE) / \
> > + sizeof(struct gdma_sge))
> > +
> > +#define MAX_RX_WQE_SGL_ENTRIES ((GDMA_MAX_RQE_SIZE - \
> > + sizeof(struct gdma_sge)) / sizeof(struct gdma_sge))
>
> Can we make these '\' chars aligned? :-) Please refer to the definiton of
> "lock_requestor" in include/linux/hyperv.h.
Will fix this.
>
> Reviewed-by: Dexuan Cui <[email protected]>
> Subject: RE: [Patch v4 07/12] net: mana: Export Work Queue functions for use
> by RDMA driver
>
> > From: [email protected] <[email protected]>
> > Sent: Wednesday, June 15, 2022 7:07 PM @@ -125,6 +125,7 @@ int
> > mana_gd_send_request(struct gdma_context *gc,
> > u32 req_len, const void *req,
> >
> > return mana_hwc_send_request(hwc, req_len, req, resp_len, resp); }
> > +EXPORT_SYMBOL(mana_gd_send_request);
> Can we use EXPORT_SYMBOL_GPL?
>
> > @@ -715,9 +715,10 @@ static int mana_create_wq_obj(struct
> > mana_port_context *apc,
> > out:
> > return err;
> > }
> > +EXPORT_SYMBOL_GPL(mana_create_wq_obj);
>
Will fix this in v5.
> Well, here we use EXPORT_SYMBOL_GPL. If there is a rule to decide which one
> should be used, please add a comment.
>
> In general, the patch looks good to me.
>
> Reviewed-by: Dexuan Cui <[email protected]>
On Wed, Jun 15, 2022 at 07:07:20PM -0700, [email protected] wrote:
> --- /dev/null
> +++ b/drivers/infiniband/hw/mana/cq.c
> @@ -0,0 +1,80 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
Why perpetuate this mess that the OpenIB people created? I thought that
no new drivers were going to be added with this, why does this one need
to have it as well if it is new?
thanks,
greg k-h
> Subject: RE: [Patch v4 03/12] net: mana: Handle vport sharing between devices
>
> > From: [email protected] <[email protected]>
> > Sent: Wednesday, June 15, 2022 7:07 PM
> > +void mana_uncfg_vport(struct mana_port_context *apc) {
> > + mutex_lock(&apc->vport_mutex);
> > + apc->vport_use_count--;
> > + WARN_ON(apc->vport_use_count < 0);
> > + mutex_unlock(&apc->vport_mutex);
> > +}
> > +EXPORT_SYMBOL_GPL(mana_uncfg_vport);
> > +
> > +int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
> > + u32 doorbell_pg_id)
> > {
> > struct mana_config_vport_resp resp = {};
> > struct mana_config_vport_req req = {};
> > int err;
> >
> > + /* Ethernet driver and IB driver can't take the port at the same time */
> > + mutex_lock(&apc->vport_mutex);
> > + if (apc->vport_use_count > 0) {
> > + mutex_unlock(&apc->vport_mutex);
> > + return -ENODEV;
> Maybe -EBUSY is better?
I agree with you, EBUSY is a better value. Will change this in v5.
>
> > @@ -563,9 +581,19 @@ static int mana_cfg_vport(struct
> > mana_port_context *apc, u32 protection_dom_id,
> >
> > apc->tx_shortform_allowed = resp.short_form_allowed;
> > apc->tx_vp_offset = resp.tx_vport_offset;
> > +
> > + netdev_info(apc->ndev, "Configured vPort %llu PD %u DB %u\n",
> > + apc->port_handle, protection_dom_id, doorbell_pg_id);
> Should this be netdev_dbg()?
> The log buffer can be flooded if there are many vPorts per VF PCI device and
> there are a lot of VFs.
The reason netdev_info () is used is that this message is important for troubleshooting initial setup issues with Ethernet driver. We rely on user to get this configured right to share the same hardware port between Ethernet and RDMA driver. As far as I know, there is no easy way for a driver to "take over" an exclusive hardware resource from another driver.
If it is acceptable that we have one such message for each opened Ethernet port on the system, I suggest we keep it this way.
>
> > out:
> > + if (err) {
> > + mutex_lock(&apc->vport_mutex);
> > + apc->vport_use_count--;
> > + mutex_unlock(&apc->vport_mutex);
> > + }
>
> Change this to the blelow?
> if (err)
> mana_uncfg_vport(apc);
>
> > @@ -626,6 +654,9 @@ static int mana_cfg_vport_steering(struct
> > mana_port_context *apc,
> > resp.hdr.status);
> > err = -EPROTO;
> > }
> > +
> > + netdev_info(ndev, "Configured steering vPort %llu entries %u\n",
> > + apc->port_handle, num_entries);
>
> netdev_dbg()?
>
> In general, the patch looks good to me.
> Reviewed-by: Dexuan Cui <[email protected]>
> Subject: Re: [Patch v4 12/12] RDMA/mana_ib: Add a driver for Microsoft Azure
> Network Adapter
>
> On Wed, Jun 15, 2022 at 07:07:20PM -0700, [email protected] wrote:
> > --- /dev/null
> > +++ b/drivers/infiniband/hw/mana/cq.c
> > @@ -0,0 +1,80 @@
> > +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
>
> Why perpetuate this mess that the OpenIB people created? I thought that no
> new drivers were going to be added with this, why does this one need to have it
> as well if it is new?
I apologize for the incorrect license language. I followed other RDMA driver's license terms but didn't' realized their licensing language is not up to the standard.
The newly introduced EFA RDMA driver used the following license terms:
(drivers/infiniband/hw/efa/efa_main.c)
// SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause
Is it acceptable that we use the same license terms?
Thanks,
Long
Please see comments inline
-----Original Message-----
From: Dexuan Cui <[email protected]>
Sent: Sunday, July 10, 2022 8:43 PM
To: Long Li <[email protected]>; KY Srinivasan <[email protected]>; Haiyang Zhang <[email protected]>; Stephen Hemminger <[email protected]>; Wei Liu <[email protected]>; David S. Miller <[email protected]>; Jakub Kicinski <[email protected]>; Paolo Abeni <[email protected]>; Jason Gunthorpe <[email protected]>; Leon Romanovsky <[email protected]>; [email protected]; [email protected]; Ajay Sharma <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]
Subject: RE: [Patch v4 12/12] RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM ...
> +int mana_ib_gd_create_dma_region(struct mana_ib_dev *dev, struct
> ib_umem *umem,
> + mana_handle_t *gdma_region, u64 page_sz) { ...
> + err = mana_gd_send_request(gc, create_req_msg_size, create_req,
> + sizeof(create_resp), &create_resp);
> + kfree(create_req);
> +
> + if (err || create_resp.hdr.status) {
> + ibdev_err(&dev->ib_dev,
> + "Failed to create DMA region: %d, 0x%x\n", err,
> + create_resp.hdr.status);
if (!err)
err = -EPROTO;
> + goto error;
> + }
> + ...
> + err = mana_gd_send_request(gc, add_req_msg_size,
> + add_req, sizeof(add_resp),
> + &add_resp);
> + if (!err || add_resp.hdr.status != expected_status) {
> + ibdev_err(&dev->ib_dev,
> + "Failed put DMA pages %u: %d,0x%x\n",
> + i, err, add_resp.hdr.status);
> + err = -EPROTO;
Should we try to undo what has been done by calling GDMA_DESTROY_DMA_REGION?
Yes, I updated the patch.
> + goto free_req;
> + }
> +
> + num_pages_cur += num_pages_to_handle;
> + num_pages_to_handle =
> + min_t(size_t, num_pages_total - num_pages_cur,
> + max_pgs_add_cmd);
> + add_req_msg_size = sizeof(*add_req) +
> + num_pages_to_handle * sizeof(u64);
> + }
> +free_req:
> + kfree(add_req);
> + }
> +
> +error:
> + return err;
> +}
> + ...
> +int mana_ib_gd_create_mr(struct mana_ib_dev *dev, struct mana_ib_mr
> *mr,
> + struct gdma_create_mr_params *mr_params) {
> + struct gdma_create_mr_response resp = {};
> + struct gdma_create_mr_request req = {};
> + struct gdma_dev *mdev = dev->gdma_dev;
> + struct gdma_context *gc;
> + int err;
> +
> + gc = mdev->gdma_context;
> +
> + mana_gd_init_req_hdr(&req.hdr, GDMA_CREATE_MR, sizeof(req),
> + sizeof(resp));
> + req.pd_handle = mr_params->pd_handle;
> +
> + switch (mr_params->mr_type) {
> + case GDMA_MR_TYPE_GVA:
> + req.mr_type = GDMA_MR_TYPE_GVA;
> + req.gva.dma_region_handle = mr_params->gva.dma_region_handle;
> + req.gva.virtual_address = mr_params->gva.virtual_address;
> + req.gva.access_flags = mr_params->gva.access_flags;
> + break;
> +
> + case GDMA_MR_TYPE_GPA:
> + req.mr_type = GDMA_MR_TYPE_GPA;
> + req.gpa.access_flags = mr_params->gpa.access_flags;
> + break;
> +
> + case GDMA_MR_TYPE_FMR:
> + req.mr_type = GDMA_MR_TYPE_FMR;
> + req.fmr.page_size = mr_params->fmr.page_size;
> + req.fmr.reserved_pte_count = mr_params->fmr.reserved_pte_count;
> + break;
> +
> + default:
> + ibdev_dbg(&dev->ib_dev,
> + "invalid param (GDMA_MR_TYPE) passed, type %d\n",
> + req.mr_type);
Here req.mr_type is always 0.
We should remove the 3 above lines of "req.mr_type = ...", and add a line "req.mr_type = mr_params->mr_type;" before the "switch" line..
No, That's incorrect. The mr_type is being explicitly set here to control what regions get exposed to the user and kernel. GPA and FMR are never exposed to user. So we cannot assign req.mr_type = mr_params->mr_type.
> + err = -EINVAL;
> + goto error;
> + }
> +
> + err = mana_gd_send_request(gc, sizeof(req), &req, sizeof(resp),
> +&resp);
> +
> + if (err || resp.hdr.status) {
> + ibdev_err(&dev->ib_dev, "Failed to create mr %d, %u", err,
> + resp.hdr.status);
if (!err)
err = -EPROTO;
> + goto error;
> + }
> +
> + mr->ibmr.lkey = resp.lkey;
> + mr->ibmr.rkey = resp.rkey;
> + mr->mr_handle = resp.mr_handle;
> +
> + return 0;
> +error:
> + return err;
> +}
> + ...
> +static int mana_ib_probe(struct auxiliary_device *adev,
> + const struct auxiliary_device_id *id) {
> + struct mana_adev *madev = container_of(adev, struct mana_adev, adev);
> + struct gdma_dev *mdev = madev->mdev;
> + struct mana_context *mc;
> + struct mana_ib_dev *dev;
> + int ret = 0;
No need to initialize 'ret' to 0.
Agreed. Updated the patch.
> +int mana_ib_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata) {
> + struct mana_ib_mr *mr = container_of(ibmr, struct mana_ib_mr, ibmr);
> + struct ib_device *ibdev = ibmr->device;
> + struct mana_ib_dev *dev;
> + int err;
> +
> + dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> +
> + err = mana_ib_gd_destroy_mr(dev, mr->mr_handle);
> + if (err)
Should we return here without calling ib_umem_release() and kfree(mr)?
Yes, if the device fails to deallocate the resources and we release them back to kernel it will lead to unexpected results.
> + return err;
> +
> + if (mr->umem)
> + ib_umem_release(mr->umem);
> +
> + kfree(mr);
> +
> + return 0;
> +}
Please see inline.
-----Original Message-----
From: Dexuan Cui <[email protected]>
Sent: Sunday, July 10, 2022 8:29 PM
To: Long Li <[email protected]>; KY Srinivasan <[email protected]>; Haiyang Zhang <[email protected]>; Stephen Hemminger <[email protected]>; Wei Liu <[email protected]>; David S. Miller <[email protected]>; Jakub Kicinski <[email protected]>; Paolo Abeni <[email protected]>; Jason Gunthorpe <[email protected]>; Leon Romanovsky <[email protected]>; [email protected]; [email protected]; Ajay Sharma <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; [email protected]
Subject: RE: [Patch v4 06/12] net: mana: Define data structures for protection domain and memory registration
> From: [email protected] <[email protected]>
> Sent: Wednesday, June 15, 2022 7:07 PM
>
> The MANA hardware support protection domain and memory registration
> for
s/support/supports
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h
> b/drivers/net/ethernet/microsoft/mana/gdma.h
> index f945755760dc..b1bec8ab5695 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma.h
> +++ b/drivers/net/ethernet/microsoft/mana/gdma.h
> @@ -27,6 +27,10 @@ enum gdma_request_type {
> GDMA_CREATE_DMA_REGION = 25,
> GDMA_DMA_REGION_ADD_PAGES = 26,
> GDMA_DESTROY_DMA_REGION = 27,
> + GDMA_CREATE_PD = 29,
> + GDMA_DESTROY_PD = 30,
> + GDMA_CREATE_MR = 31,
> + GDMA_DESTROY_MR = 32,
These are not used in this patch. They're used in the 12th patch for the first time. Can we move these to that patch?
> #define GDMA_RESOURCE_DOORBELL_PAGE 27
> @@ -59,6 +63,8 @@ enum {
> GDMA_DEVICE_MANA = 2,
> };
>
> +typedef u64 gdma_obj_handle_t;
> +
> struct gdma_resource {
> /* Protect the bitmap */
> spinlock_t lock;
> @@ -192,7 +198,7 @@ struct gdma_mem_info {
> u64 length;
>
> /* Allocated by the PF driver */
> - u64 gdma_region;
> + gdma_obj_handle_t dma_region_handle;
The old name "gdma_region" is shorter and it has "gdma"
rather than "dma".
The new name is longer. When one starts to read the code for the first time, I feel that "dma_region_handle" might be confusing as it's similar to "dma_handle" (which is the DMA address returned by dma_alloc_coherent()). "dma_region_handle" is an integer rather than a memory address.
You use the new name probably because there is a "mr_handle "
in the 12 patch. I prefer the old name, though the new name is also ok to me. If you decide to use the new name, it would be great if this patch could split into two patches: one for the renaming only, and the other for the real changes.
> #define REGISTER_ATB_MST_MKEY_LOWER_SIZE 8 @@ -599,7 +605,7 @@ struct
> gdma_create_queue_req {
> u32 reserved1;
> u32 pdid;
> u32 doolbell_id;
> - u64 gdma_region;
> + gdma_obj_handle_t gdma_region;
If we decide to use the new name "dma_region_handle", should we change the field/param names in the below structs and functions as well (this may not be a complete list)?
struct mana_ib_wq
struct mana_ib_cq
mana_ib_gd_create_dma_region
mana_ib_gd_destroy_dma_region
> u32 reserved2;
> u32 queue_size;
> u32 log2_throttle_limit;
> @@ -626,6 +632,28 @@ struct gdma_disable_queue_req {
> u32 alloc_res_id_on_creation;
> }; /* HW DATA */
>
> +enum atb_page_size {
> + ATB_PAGE_SIZE_4K,
> + ATB_PAGE_SIZE_8K,
> + ATB_PAGE_SIZE_16K,
> + ATB_PAGE_SIZE_32K,
> + ATB_PAGE_SIZE_64K,
> + ATB_PAGE_SIZE_128K,
> + ATB_PAGE_SIZE_256K,
> + ATB_PAGE_SIZE_512K,
> + ATB_PAGE_SIZE_1M,
> + ATB_PAGE_SIZE_2M,
> + ATB_PAGE_SIZE_MAX,
> +};
> +
> +enum gdma_mr_access_flags {
> + GDMA_ACCESS_FLAG_LOCAL_READ = (1 << 0),
> + GDMA_ACCESS_FLAG_LOCAL_WRITE = (1 << 1),
> + GDMA_ACCESS_FLAG_REMOTE_READ = (1 << 2),
> + GDMA_ACCESS_FLAG_REMOTE_WRITE = (1 << 3),
> + GDMA_ACCESS_FLAG_REMOTE_ATOMIC = (1 << 4), };
It would be better to use BIT_ULL(0), BIT_ULL(1), etc.
Agreed, updated in the new patch.
> /* GDMA_CREATE_DMA_REGION */
> struct gdma_create_dma_region_req {
> struct gdma_req_hdr hdr;
> @@ -652,14 +680,14 @@ struct gdma_create_dma_region_req {
>
> struct gdma_create_dma_region_resp {
> struct gdma_resp_hdr hdr;
> - u64 gdma_region;
> + gdma_obj_handle_t dma_region_handle;
> }; /* HW DATA */
>
> /* GDMA_DMA_REGION_ADD_PAGES */
> struct gdma_dma_region_add_pages_req {
> struct gdma_req_hdr hdr;
>
> - u64 gdma_region;
> + gdma_obj_handle_t dma_region_handle;
>
> u32 page_addr_list_len;
> u32 reserved3;
> @@ -671,9 +699,114 @@ struct gdma_dma_region_add_pages_req { struct
> gdma_destroy_dma_region_req {
> struct gdma_req_hdr hdr;
>
> - u64 gdma_region;
> + gdma_obj_handle_t dma_region_handle;
> }; /* HW DATA */
>
> +enum gdma_pd_flags {
> + GDMA_PD_FLAG_ALLOW_GPA_MR = (1 << 0),
> + GDMA_PD_FLAG_ALLOW_FMR_MR = (1 << 1), };
Use BIT_ULL(0), BIT_ULL(1) ?
Agreed and updated the patch
> +struct gdma_create_pd_req {
> + struct gdma_req_hdr hdr;
> + enum gdma_pd_flags flags;
> + u32 reserved;
> +};
> +
> +struct gdma_create_pd_resp {
> + struct gdma_resp_hdr hdr;
> + gdma_obj_handle_t pd_handle;
> + u32 pd_id;
> + u32 reserved;
> +};
> +
> +struct gdma_destroy_pd_req {
> + struct gdma_req_hdr hdr;
> + gdma_obj_handle_t pd_handle;
> +};
> +
> +struct gdma_destory_pd_resp {
> + struct gdma_resp_hdr hdr;
> +};
> +
> +enum gdma_mr_type {
> + /* Guest Physical Address - MRs of this type allow access
> + * to any DMA-mapped memory using bus-logical address
> + */
> + GDMA_MR_TYPE_GPA = 1,
> +
> + /* Guest Virtual Address - MRs of this type allow access
> + * to memory mapped by PTEs associated with this MR using a virtual
> + * address that is set up in the MST
> + */
> + GDMA_MR_TYPE_GVA,
> +
> + /* Fast Memory Register - Like GVA but the MR is initially put in the
> + * FREE state (as opposed to Valid), and the specified number of
> + * PTEs are reserved for future fast memory reservations.
> + */
> + GDMA_MR_TYPE_FMR,
> +};
> +
> +struct gdma_create_mr_params {
> + gdma_obj_handle_t pd_handle;
> + enum gdma_mr_type mr_type;
> + union {
> + struct {
> + gdma_obj_handle_t dma_region_handle;
> + u64 virtual_address;
> + enum gdma_mr_access_flags access_flags;
> + } gva;
Add an empty line to make it more readable?
Done.
> + struct {
> + enum gdma_mr_access_flags access_flags;
> + } gpa;
Add an empty line?
> + struct {
> + enum atb_page_size page_size;
> + u32 reserved_pte_count;
> + } fmr;
> + };
> +};
The definition of struct gdma_create_mr_params is not naturally aligned.
This can potenially cause issues.
This is union and so the biggest element is aligned to word. I feel since this is not passed to the hw it should be fine.
According to my test, sizeof(struct gdma_create_mr_params) is 40 bytes, meaning the compiler adds two "hidden" fields:
struct gdma_create_mr_params {
gdma_obj_handle_t pd_handle; // offset = 0
enum gdma_mr_type mr_type; // offset = 8
+ u32 hidden_field_a;
union { // offset = 0x10
struct {
gdma_obj_handle_t dma_region_handle; // offset =0x10
u64 virtual_address; // offset =0x18
enum gdma_mr_access_flags access_flags; // offset =0x20
+ u32 hidden_field_b;
} gva;
We'll run into trouble some day if the Linux VF driver or the host PF driver adds something like __attribute__((packed)).
Can we work with the host team to improve the definition? If it's hard/impossible to change the PF driver side definition, both sides should at least explicitly define the two hidden fields as reserved fields.
BTW, can we assume the size of "enum" is 4 bytes? I prefer using u32 explicitly when a struct is used to talk to the PF driver or the device.
If we decide to use "enum", I suggest we add BUILD_BUG_ON(sizeof(struct gdma_create_mr_params) != 40) to make sure the assumptin is true.
BTW, Haiyang added "/* HW DATA */ " to other definitions, e.g. gdma_create_queue_resp. Can you please add the same comment for consistency?
> +struct gdma_create_mr_request {
> + struct gdma_req_hdr hdr;
> + gdma_obj_handle_t pd_handle;
> + enum gdma_mr_type mr_type;
> + u32 reserved;
> +
> + union {
> + struct {
> + enum gdma_mr_access_flags access_flags;
> + } gpa;
> +
> + struct {
> + gdma_obj_handle_t dma_region_handle;
> + u64 virtual_address;
> + enum gdma_mr_access_flags access_flags;
Similarly, there is a hidden u32 field here. We should explicitly define it.
> + } gva;
Can we use the same order of "gva; gpa" used in struct gdma_create_mr_params?
Done, although it shouldn't matter in union case.
> + struct {
> + enum atb_page_size page_size;
> + u32 reserved_pte_count;
> + } fmr;
> + };
> +};
Add BUILD_BUG_ON(sizeof(struct gdma_create_mr_request) != 80) ?
Add /* HW DATA */ ?
> +struct gdma_create_mr_response {
> + struct gdma_resp_hdr hdr;
> + gdma_obj_handle_t mr_handle;
> + u32 lkey;
> + u32 rkey;
> +};
> +
> +struct gdma_destroy_mr_request {
> + struct gdma_req_hdr hdr;
> + gdma_obj_handle_t mr_handle;
> +};
> +
> +struct gdma_destroy_mr_response {
> + struct gdma_resp_hdr hdr;
> +};
> +
None of the new defines are really used in this patch:
+enum atb_page_size {
+enum gdma_mr_access_flags {
+enum gdma_pd_flags {
+struct gdma_create_pd_req {
+struct gdma_create_pd_resp {
+struct gdma_destroy_pd_req {
+struct gdma_destory_pd_resp {
+enum gdma_mr_type {
+struct gdma_create_mr_params {
+struct gdma_create_mr_request {
+struct gdma_create_mr_response {
+struct gdma_destroy_mr_request {
+struct gdma_destroy_mr_response
The new defines are used in the 12th patch for the first time.
Can we move these to that patch or at least move these defines to before the 12th patch?
On Tue, Jul 12, 2022 at 11:46:36PM +0000, Long Li wrote:
> > Subject: Re: [Patch v4 12/12] RDMA/mana_ib: Add a driver for Microsoft Azure
> > Network Adapter
> >
> > On Wed, Jun 15, 2022 at 07:07:20PM -0700, [email protected] wrote:
> > > --- /dev/null
> > > +++ b/drivers/infiniband/hw/mana/cq.c
> > > @@ -0,0 +1,80 @@
> > > +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> >
> > Why perpetuate this mess that the OpenIB people created? I thought that no
> > new drivers were going to be added with this, why does this one need to have it
> > as well if it is new?
>
> I apologize for the incorrect license language. I followed other RDMA driver's license terms but didn't' realized their licensing language is not up to the standard.
You need to follow the license rules of your employer, please consult
with them as they know what to do here.
> The newly introduced EFA RDMA driver used the following license terms:
> (drivers/infiniband/hw/efa/efa_main.c)
> // SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause
>
> Is it acceptable that we use the same license terms?
Again, discuss this with the lawyers at your company. But if you are
going to use a dual-license, you must be prepared to defend why you are
doing so.
thanks,
greg k-h
> From: Ajay Sharma <[email protected]>
> Sent: Tuesday, July 12, 2022 9:39 PM
> To: Dexuan Cui <[email protected]>; Long Li <[email protected]>; KY
> ...
> > The definition of struct gdma_create_mr_params is not naturally aligned.
> > This can potenially cause issues.
> This is union and so the biggest element is aligned to word. I feel since this is
> not passed to the hw it should be fine.
Ajay, you're right. I didn't realize struct gdma_create_mr_params is not really
passed to the PF driver or the device. Please ignore my comments on
struct gdma_create_mr_params. Sorry for the confusion!
> > BTW, Haiyang added "/* HW DATA */ " to other definitions, e.g.
> > gdma_create_queue_resp. Can you please add the same comment for
> > consistency?
It's still recommended that we add the tag "/* HW DATA */ " to new definitions
that are passed to the PF driver or the device.
> > +struct gdma_create_mr_request {
> > + struct gdma_req_hdr hdr;
> > + gdma_obj_handle_t pd_handle;
> > + enum gdma_mr_type mr_type;
> > + u32 reserved;
> > +
> > + union {
> > + struct {
> > + enum gdma_mr_access_flags access_flags;
> > + } gpa;
> > +
> > + struct {
> > + gdma_obj_handle_t dma_region_handle;
> > + u64 virtual_address;
> > + enum gdma_mr_access_flags access_flags;
>
> Similarly, there is a hidden u32 field here. We should explicitly define it.
The issue with struct gdma_create_mr_request is valid, since it's
passed to the PF driver. We should explicitly define the hidden field.
> From: Ajay Sharma <[email protected]>
> Sent: Tuesday, July 12, 2022 9:33 PM
> > ...
> > + switch (mr_params->mr_type) {
> > + case GDMA_MR_TYPE_GVA:
> > + req.mr_type = GDMA_MR_TYPE_GVA;
> > + req.gva.dma_region_handle = mr_params->gva.dma_region_handle;
> > + req.gva.virtual_address = mr_params->gva.virtual_address;
> > + req.gva.access_flags = mr_params->gva.access_flags;
> > + break;
> > +
> > + case GDMA_MR_TYPE_GPA:
> > + req.mr_type = GDMA_MR_TYPE_GPA;
> > + req.gpa.access_flags = mr_params->gpa.access_flags;
> > + break;
> > +
> > + case GDMA_MR_TYPE_FMR:
> > + req.mr_type = GDMA_MR_TYPE_FMR;
> > + req.fmr.page_size = mr_params->fmr.page_size;
> > + req.fmr.reserved_pte_count = mr_params->fmr.reserved_pte_count;
> > + break;
> > +
> > + default:
> > + ibdev_dbg(&dev->ib_dev,
> > + "invalid param (GDMA_MR_TYPE) passed, type %d\n",
> > + req.mr_type);
>
> Here req.mr_type is always 0.
> We should remove the 3 above lines of "req.mr_type = ...", and add a line
> "req.mr_type = mr_params->mr_type;" before the "switch" line..
>
> No, That's incorrect. The mr_type is being explicitly set here to control what
> regions get exposed to the user and kernel. GPA and FMR are never exposed to
> user. So we cannot assign req.mr_type = mr_params->mr_type.
I'm not following you. I meant the below change, which should have no
functional change, right? In the "default:" branch , we just "goto error;", so
there is no functional change either.
--- drivers/infiniband/hw/mana/main.c.orig
+++ drivers/infiniband/hw/mana/main.c
@@ -394,21 +394,19 @@
sizeof(resp));
req.pd_handle = mr_params->pd_handle;
+ req.mr_type = mr_params->mr_type;
switch (mr_params->mr_type) {
case GDMA_MR_TYPE_GVA:
- req.mr_type = GDMA_MR_TYPE_GVA;
req.gva.dma_region_handle = mr_params->gva.dma_region_handle;
req.gva.virtual_address = mr_params->gva.virtual_address;
req.gva.access_flags = mr_params->gva.access_flags;
break;
case GDMA_MR_TYPE_GPA:
- req.mr_type = GDMA_MR_TYPE_GPA;
req.gpa.access_flags = mr_params->gpa.access_flags;
break;
case GDMA_MR_TYPE_FMR:
- req.mr_type = GDMA_MR_TYPE_FMR;
req.fmr.page_size = mr_params->fmr.page_size;
req.fmr.reserved_pte_count = mr_params->fmr.reserved_pte_count;
break;
On Wed, Jun 15, 2022 at 07:07:20PM -0700, [email protected] wrote:
> +static int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
> +{
> + struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
> + struct ib_device *ibdev = ibpd->device;
> + enum gdma_pd_flags flags = 0;
> + struct mana_ib_dev *dev;
> + int ret;
> +
> + dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> +
> + /* Set flags if this is a kernel request */
> + if (!ibpd->uobject)
> + flags = GDMA_PD_FLAG_ALLOW_GPA_MR | GDMA_PD_FLAG_ALLOW_FMR_MR;
I'm confused, this driver doesn't seem to support kverbs:
> +static const struct ib_device_ops mana_ib_dev_ops = {
> + .owner = THIS_MODULE,
> + .driver_id = RDMA_DRIVER_MANA,
> + .uverbs_abi_ver = MANA_IB_UVERBS_ABI_VERSION,
> +
> + .alloc_pd = mana_ib_alloc_pd,
> + .alloc_ucontext = mana_ib_alloc_ucontext,
> + .create_cq = mana_ib_create_cq,
> + .create_qp = mana_ib_create_qp,
> + .create_rwq_ind_table = mana_ib_create_rwq_ind_table,
> + .create_wq = mana_ib_create_wq,
> + .dealloc_pd = mana_ib_dealloc_pd,
> + .dealloc_ucontext = mana_ib_dealloc_ucontext,
> + .dereg_mr = mana_ib_dereg_mr,
> + .destroy_cq = mana_ib_destroy_cq,
> + .destroy_qp = mana_ib_destroy_qp,
> + .destroy_rwq_ind_table = mana_ib_destroy_rwq_ind_table,
> + .destroy_wq = mana_ib_destroy_wq,
> + .disassociate_ucontext = mana_ib_disassociate_ucontext,
> + .get_port_immutable = mana_ib_get_port_immutable,
> + .mmap = mana_ib_mmap,
> + .modify_qp = mana_ib_modify_qp,
> + .modify_wq = mana_ib_modify_wq,
> + .query_device = mana_ib_query_device,
> + .query_gid = mana_ib_query_gid,
> + .query_port = mana_ib_query_port,
> + .reg_user_mr = mana_ib_reg_user_mr,
eg there is no way to create a kernel MR..
So, why do I see so many kverbs like things - and why are things like
FMR in this driver that can never be used?
Jason
On Mon, Jul 11, 2022 at 01:29:08AM +0000, Dexuan Cui wrote:
> > From: [email protected] <[email protected]>
> > Sent: Wednesday, June 15, 2022 7:07 PM
> >
> > The MANA hardware support protection domain and memory registration for
> s/support/supports
>
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h
> > b/drivers/net/ethernet/microsoft/mana/gdma.h
> > index f945755760dc..b1bec8ab5695 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma.h
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma.h
> > @@ -27,6 +27,10 @@ enum gdma_request_type {
> > GDMA_CREATE_DMA_REGION = 25,
> > GDMA_DMA_REGION_ADD_PAGES = 26,
> > GDMA_DESTROY_DMA_REGION = 27,
> > + GDMA_CREATE_PD = 29,
> > + GDMA_DESTROY_PD = 30,
> > + GDMA_CREATE_MR = 31,
> > + GDMA_DESTROY_MR = 32,
> These are not used in this patch. They're used in the 12th
> patch for the first time. Can we move these to that patch?
This looks like RDMA code anyhow, why is it under net/ethernet?
Jason
On Tue, Jul 12, 2022 at 06:48:09PM +0000, Long Li wrote:
> > > @@ -563,9 +581,19 @@ static int mana_cfg_vport(struct
> > > mana_port_context *apc, u32 protection_dom_id,
> > >
> > > apc->tx_shortform_allowed = resp.short_form_allowed;
> > > apc->tx_vp_offset = resp.tx_vport_offset;
> > > +
> > > + netdev_info(apc->ndev, "Configured vPort %llu PD %u DB %u\n",
> > > + apc->port_handle, protection_dom_id, doorbell_pg_id);
> > Should this be netdev_dbg()?
> > The log buffer can be flooded if there are many vPorts per VF PCI device and
> > there are a lot of VFs.
>
> The reason netdev_info () is used is that this message is important
> for troubleshooting initial setup issues with Ethernet driver. We
> rely on user to get this configured right to share the same hardware
> port between Ethernet and RDMA driver. As far as I know, there is no
> easy way for a driver to "take over" an exclusive hardware resource
> from another driver.
This seems like a really strange statement.
Exactly how does all of this work?
Jason
On Mon, Jul 11, 2022 at 01:13:50AM +0000, Dexuan Cui wrote:
> > From: [email protected] <[email protected]>
> > Sent: Wednesday, June 15, 2022 7:07 PM
> > ...
> > +EXPORT_SYMBOL(mana_gd_destroy_doorbell_page);
> Can this be EXPORT_SYMBOL_GPL?
>
> > +EXPORT_SYMBOL(mana_gd_allocate_doorbell_page);
> EXPORT_SYMBOL_GPL?
Can you think about using the symbol namespaces here?
Nobody else has done it yet, but I think we should be...
Jason
> Subject: Re: [Patch v4 03/12] net: mana: Handle vport sharing between
> devices
>
> On Tue, Jul 12, 2022 at 06:48:09PM +0000, Long Li wrote:
> > > > @@ -563,9 +581,19 @@ static int mana_cfg_vport(struct
> > > > mana_port_context *apc, u32 protection_dom_id,
> > > >
> > > > apc->tx_shortform_allowed = resp.short_form_allowed;
> > > > apc->tx_vp_offset = resp.tx_vport_offset;
> > > > +
> > > > + netdev_info(apc->ndev, "Configured vPort %llu PD %u DB %u\n",
> > > > + apc->port_handle, protection_dom_id, doorbell_pg_id);
> > > Should this be netdev_dbg()?
> > > The log buffer can be flooded if there are many vPorts per VF PCI
> > > device and there are a lot of VFs.
> >
> > The reason netdev_info () is used is that this message is important
> > for troubleshooting initial setup issues with Ethernet driver. We rely
> > on user to get this configured right to share the same hardware port
> > between Ethernet and RDMA driver. As far as I know, there is no easy
> > way for a driver to "take over" an exclusive hardware resource from
> > another driver.
>
> This seems like a really strange statement.
>
> Exactly how does all of this work?
>
> Jason
"vport" is a hardware resource that can either be used by an Ethernet device, or an RDMA device. But it can't be used by both at the same time. The "vport" is associated with a protection domain and doorbell, it's programmed in the hardware. Outgoing traffic is enforced on this vport based on how it is programmed.
Hardware is not responsible for tracking which one is using this "vport", it's up to the software to make sure it's correctly configured for that device.
Long
> Subject: Re: [Patch v4 06/12] net: mana: Define data structures for protection
> domain and memory registration
>
> On Mon, Jul 11, 2022 at 01:29:08AM +0000, Dexuan Cui wrote:
> > > From: [email protected] <[email protected]>
> > > Sent: Wednesday, June 15, 2022 7:07 PM
> > >
> > > The MANA hardware support protection domain and memory
> registration
> > > for
> > s/support/supports
> >
> > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma.h
> > > b/drivers/net/ethernet/microsoft/mana/gdma.h
> > > index f945755760dc..b1bec8ab5695 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/gdma.h
> > > +++ b/drivers/net/ethernet/microsoft/mana/gdma.h
> > > @@ -27,6 +27,10 @@ enum gdma_request_type {
> > > GDMA_CREATE_DMA_REGION = 25,
> > > GDMA_DMA_REGION_ADD_PAGES = 26,
> > > GDMA_DESTROY_DMA_REGION = 27,
> > > + GDMA_CREATE_PD = 29,
> > > + GDMA_DESTROY_PD = 30,
> > > + GDMA_CREATE_MR = 31,
> > > + GDMA_DESTROY_MR = 32,
> > These are not used in this patch. They're used in the 12th patch for
> > the first time. Can we move these to that patch?
>
> This looks like RDMA code anyhow, why is it under net/ethernet?
>
> Jason
This header file belongs to the GDMA layer (as its filename implies) . It's a hardware communication layer used by both ethernet and RDMA for communicating with the hardware.
Some of the RDMA functionalities are implemented at GDMA layer in the PF running on the host, so the message definitions are also there.
Long
On 6/16/22 10:07 AM, [email protected] wrote:
> From: Long Li <[email protected]>
>
<...>
> +
> +static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
> + struct ib_qp_init_attr *attr,
> + struct ib_udata *udata)
> +{
> + struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
> + struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp, ibqp);
> + struct mana_ib_dev *mdev =
> + container_of(ibpd->device, struct mana_ib_dev, ib_dev);
> + struct mana_ib_cq *send_cq =
> + container_of(attr->send_cq, struct mana_ib_cq, ibcq);
> + struct ib_ucontext *ib_ucontext = ibpd->uobject->context;
> + struct mana_ib_create_qp_resp resp = {};
> + struct mana_ib_ucontext *mana_ucontext;
> + struct gdma_dev *gd = mdev->gdma_dev;
> + struct mana_ib_create_qp ucmd = {};
> + struct mana_obj_spec wq_spec = {};
> + struct mana_obj_spec cq_spec = {};
> + struct mana_port_context *mpc;
> + struct mana_context *mc;
> + struct net_device *ndev;
> + struct ib_umem *umem;
> + int err;
> + u32 port;
> +
> + mana_ucontext =
> + container_of(ib_ucontext, struct mana_ib_ucontext, ibucontext);
> + mc = gd->driver_data;
> +
> + if (udata->inlen < sizeof(ucmd))
> + return -EINVAL;
> +
> + err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata->inlen));
> + if (err) {
> + ibdev_dbg(&mdev->ib_dev,
> + "Failed to copy from udata create qp-raw, %d\n", err);
> + return -EFAULT;
> + }
> +
> + /* IB ports start with 1, MANA Ethernet ports start with 0 */
> + port = ucmd.port;
> + if (ucmd.port > mc->num_ports)
> + return -EINVAL;
> +
> + if (attr->cap.max_send_wr > MAX_SEND_BUFFERS_PER_QUEUE) {
> + ibdev_dbg(&mdev->ib_dev,
> + "Requested max_send_wr %d exceeding limit\n",
> + attr->cap.max_send_wr);
> + return -EINVAL;
> + }
> +
> + if (attr->cap.max_send_sge > MAX_TX_WQE_SGL_ENTRIES) {
> + ibdev_dbg(&mdev->ib_dev,
> + "Requested max_send_sge %d exceeding limit\n",
> + attr->cap.max_send_sge);
> + return -EINVAL;
> + }
> +
> + ndev = mc->ports[port - 1];
> + mpc = netdev_priv(ndev);
> + ibdev_dbg(&mdev->ib_dev, "port %u ndev %p mpc %p\n", port, ndev, mpc);
> +
> + err = mana_ib_cfg_vport(mdev, port - 1, pd, mana_ucontext->doorbell);
> + if (err)
> + return -ENODEV;
> +
> + qp->port = port;
> +
> + ibdev_dbg(&mdev->ib_dev, "ucmd sq_buf_addr 0x%llx port %u\n",
> + ucmd.sq_buf_addr, ucmd.port);
> +
> + umem = ib_umem_get(ibpd->device, ucmd.sq_buf_addr, ucmd.sq_buf_size,
> + IB_ACCESS_LOCAL_WRITE);
> + if (IS_ERR(umem)) {
> + err = PTR_ERR(umem);
> + ibdev_dbg(&mdev->ib_dev,
> + "Failed to get umem for create qp-raw, err %d\n",
> + err);
> + goto err_free_vport;
> + }
> + qp->sq_umem = umem;
> +
> + err = mana_ib_gd_create_dma_region(mdev, qp->sq_umem,
> + &qp->sq_gdma_region, PAGE_SIZE);
> + if (err) {
> + ibdev_err(&mdev->ib_dev,
> + "Failed to create dma region for create qp-raw, %d\n",
> + err);
It is better not print in userspace-triggered paths.
There are also same issues in other paths.
Thanks,
Cheng Xu
On Thu, Jul 21, 2022 at 12:06:12AM +0000, Long Li wrote:
> > Subject: Re: [Patch v4 03/12] net: mana: Handle vport sharing between
> > devices
> >
> > On Tue, Jul 12, 2022 at 06:48:09PM +0000, Long Li wrote:
> > > > > @@ -563,9 +581,19 @@ static int mana_cfg_vport(struct
> > > > > mana_port_context *apc, u32 protection_dom_id,
> > > > >
> > > > > apc->tx_shortform_allowed = resp.short_form_allowed;
> > > > > apc->tx_vp_offset = resp.tx_vport_offset;
> > > > > +
> > > > > + netdev_info(apc->ndev, "Configured vPort %llu PD %u DB %u\n",
> > > > > + apc->port_handle, protection_dom_id, doorbell_pg_id);
> > > > Should this be netdev_dbg()?
> > > > The log buffer can be flooded if there are many vPorts per VF PCI
> > > > device and there are a lot of VFs.
> > >
> > > The reason netdev_info () is used is that this message is important
> > > for troubleshooting initial setup issues with Ethernet driver. We rely
> > > on user to get this configured right to share the same hardware port
> > > between Ethernet and RDMA driver. As far as I know, there is no easy
> > > way for a driver to "take over" an exclusive hardware resource from
> > > another driver.
> >
> > This seems like a really strange statement.
> >
> > Exactly how does all of this work?
> >
> > Jason
>
> "vport" is a hardware resource that can either be used by an
> Ethernet device, or an RDMA device. But it can't be used by both at
> the same time. The "vport" is associated with a protection domain
> and doorbell, it's programmed in the hardware. Outgoing traffic is
> enforced on this vport based on how it is programmed.
Sure, but how is the users problem to "get this configured right" and
what exactly is the user supposed to do?
I would expect the allocation of HW resources to be completely
transparent to the user. Why is it not?
Jason
> Subject: Re: [Patch v4 12/12] RDMA/mana_ib: Add a driver for Microsoft
> Azure Network Adapter
>
>
>
> On 6/16/22 10:07 AM, [email protected] wrote:
> > From: Long Li <[email protected]>
> >
>
> <...>
>
> > +
> > +static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
> > + struct ib_qp_init_attr *attr,
> > + struct ib_udata *udata)
> > +{
> > + struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd,
> ibpd);
> > + struct mana_ib_qp *qp = container_of(ibqp, struct mana_ib_qp,
> ibqp);
> > + struct mana_ib_dev *mdev =
> > + container_of(ibpd->device, struct mana_ib_dev, ib_dev);
> > + struct mana_ib_cq *send_cq =
> > + container_of(attr->send_cq, struct mana_ib_cq, ibcq);
> > + struct ib_ucontext *ib_ucontext = ibpd->uobject->context;
> > + struct mana_ib_create_qp_resp resp = {};
> > + struct mana_ib_ucontext *mana_ucontext;
> > + struct gdma_dev *gd = mdev->gdma_dev;
> > + struct mana_ib_create_qp ucmd = {};
> > + struct mana_obj_spec wq_spec = {};
> > + struct mana_obj_spec cq_spec = {};
> > + struct mana_port_context *mpc;
> > + struct mana_context *mc;
> > + struct net_device *ndev;
> > + struct ib_umem *umem;
> > + int err;
> > + u32 port;
> > +
> > + mana_ucontext =
> > + container_of(ib_ucontext, struct mana_ib_ucontext,
> ibucontext);
> > + mc = gd->driver_data;
> > +
> > + if (udata->inlen < sizeof(ucmd))
> > + return -EINVAL;
> > +
> > + err = ib_copy_from_udata(&ucmd, udata, min(sizeof(ucmd), udata-
> >inlen));
> > + if (err) {
> > + ibdev_dbg(&mdev->ib_dev,
> > + "Failed to copy from udata create qp-raw, %d\n",
> err);
> > + return -EFAULT;
> > + }
> > +
> > + /* IB ports start with 1, MANA Ethernet ports start with 0 */
> > + port = ucmd.port;
> > + if (ucmd.port > mc->num_ports)
> > + return -EINVAL;
> > +
> > + if (attr->cap.max_send_wr > MAX_SEND_BUFFERS_PER_QUEUE) {
> > + ibdev_dbg(&mdev->ib_dev,
> > + "Requested max_send_wr %d exceeding limit\n",
> > + attr->cap.max_send_wr);
> > + return -EINVAL;
> > + }
> > +
> > + if (attr->cap.max_send_sge > MAX_TX_WQE_SGL_ENTRIES) {
> > + ibdev_dbg(&mdev->ib_dev,
> > + "Requested max_send_sge %d exceeding limit\n",
> > + attr->cap.max_send_sge);
> > + return -EINVAL;
> > + }
> > +
> > + ndev = mc->ports[port - 1];
> > + mpc = netdev_priv(ndev);
> > + ibdev_dbg(&mdev->ib_dev, "port %u ndev %p mpc %p\n", port,
> ndev, mpc);
> > +
> > + err = mana_ib_cfg_vport(mdev, port - 1, pd, mana_ucontext-
> >doorbell);
> > + if (err)
> > + return -ENODEV;
> > +
> > + qp->port = port;
> > +
> > + ibdev_dbg(&mdev->ib_dev, "ucmd sq_buf_addr 0x%llx port %u\n",
> > + ucmd.sq_buf_addr, ucmd.port);
> > +
> > + umem = ib_umem_get(ibpd->device, ucmd.sq_buf_addr,
> ucmd.sq_buf_size,
> > + IB_ACCESS_LOCAL_WRITE);
> > + if (IS_ERR(umem)) {
> > + err = PTR_ERR(umem);
> > + ibdev_dbg(&mdev->ib_dev,
> > + "Failed to get umem for create qp-raw, err %d\n",
> > + err);
> > + goto err_free_vport;
> > + }
> > + qp->sq_umem = umem;
> > +
> > + err = mana_ib_gd_create_dma_region(mdev, qp->sq_umem,
> > + &qp->sq_gdma_region, PAGE_SIZE);
> > + if (err) {
> > + ibdev_err(&mdev->ib_dev,
> > + "Failed to create dma region for create qp-
> raw, %d\n",
> > + err);
>
> It is better not print in userspace-triggered paths.
>
> There are also same issues in other paths.
>
> Thanks,
> Cheng Xu
Thank you, I will scan the code and make sure user-mode can't flood error messages.
This error is a hardware/PF error. It means the hardware channel has faulted. It's logged at error level.
Long
> Subject: Re: [Patch v4 03/12] net: mana: Handle vport sharing between
> devices
>
> On Thu, Jul 21, 2022 at 12:06:12AM +0000, Long Li wrote:
> > > Subject: Re: [Patch v4 03/12] net: mana: Handle vport sharing
> > > between devices
> > >
> > > On Tue, Jul 12, 2022 at 06:48:09PM +0000, Long Li wrote:
> > > > > > @@ -563,9 +581,19 @@ static int mana_cfg_vport(struct
> > > > > > mana_port_context *apc, u32 protection_dom_id,
> > > > > >
> > > > > > apc->tx_shortform_allowed = resp.short_form_allowed;
> > > > > > apc->tx_vp_offset = resp.tx_vport_offset;
> > > > > > +
> > > > > > + netdev_info(apc->ndev, "Configured vPort %llu PD %u
> DB %u\n",
> > > > > > + apc->port_handle, protection_dom_id,
> doorbell_pg_id);
> > > > > Should this be netdev_dbg()?
> > > > > The log buffer can be flooded if there are many vPorts per VF
> > > > > PCI device and there are a lot of VFs.
> > > >
> > > > The reason netdev_info () is used is that this message is
> > > > important for troubleshooting initial setup issues with Ethernet
> > > > driver. We rely on user to get this configured right to share the
> > > > same hardware port between Ethernet and RDMA driver. As far as I
> > > > know, there is no easy way for a driver to "take over" an
> > > > exclusive hardware resource from another driver.
> > >
> > > This seems like a really strange statement.
> > >
> > > Exactly how does all of this work?
> > >
> > > Jason
> >
> > "vport" is a hardware resource that can either be used by an Ethernet
> > device, or an RDMA device. But it can't be used by both at the same
> > time. The "vport" is associated with a protection domain and doorbell,
> > it's programmed in the hardware. Outgoing traffic is enforced on this
> > vport based on how it is programmed.
>
> Sure, but how is the users problem to "get this configured right" and what
> exactly is the user supposed to do?
>
> I would expect the allocation of HW resources to be completely transparent
> to the user. Why is it not?
>
> Jason
In the hardware, RDMA RAW_QP shares the same hardware resource (in this case, the vPort in hardware table) with the ethernet NIC. When an RDMA user creates a RAW_QP, we can't just shut down the ethernet. The user is required to make sure the ethernet is not in used when he creates this QP type.
On Thu, Jul 21, 2022 at 05:58:39PM +0000, Long Li wrote:
> > > "vport" is a hardware resource that can either be used by an Ethernet
> > > device, or an RDMA device. But it can't be used by both at the same
> > > time. The "vport" is associated with a protection domain and doorbell,
> > > it's programmed in the hardware. Outgoing traffic is enforced on this
> > > vport based on how it is programmed.
> >
> > Sure, but how is the users problem to "get this configured right" and what
> > exactly is the user supposed to do?
> >
> > I would expect the allocation of HW resources to be completely transparent
> > to the user. Why is it not?
> >
>
> In the hardware, RDMA RAW_QP shares the same hardware resource (in
> this case, the vPort in hardware table) with the ethernet NIC. When
> an RDMA user creates a RAW_QP, we can't just shut down the
> ethernet. The user is required to make sure the ethernet is not in
> used when he creates this QP type.
You haven't answered my question - how is the user supposed to achieve
this?
And now I also want to know why the ethernet device and rdma device
can even be loaded together if they cannot share the physical port?
Exclusivity is not a sharing model that any driver today implements.
Jason
> -----Original Message-----
> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, July 20, 2022 6:49 PM
> To: Long Li <[email protected]>
> Cc: KY Srinivasan <[email protected]>; Haiyang Zhang
> <[email protected]>; Stephen Hemminger
> <[email protected]>; Wei Liu <[email protected]>; Dexuan Cui
> <[email protected]>; David S. Miller <[email protected]>; Jakub
> Kicinski <[email protected]>; Paolo Abeni <[email protected]>; Leon
> Romanovsky <[email protected]>; [email protected];
> [email protected]; Ajay Sharma <[email protected]>; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Subject: [EXTERNAL] Re: [Patch v4 12/12] RDMA/mana_ib: Add a driver for
> Microsoft Azure Network Adapter
>
> On Wed, Jun 15, 2022 at 07:07:20PM -0700, [email protected]
> wrote:
>
> > +static int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata
> > +*udata) {
> > + struct mana_ib_pd *pd = container_of(ibpd, struct mana_ib_pd, ibpd);
> > + struct ib_device *ibdev = ibpd->device;
> > + enum gdma_pd_flags flags = 0;
> > + struct mana_ib_dev *dev;
> > + int ret;
> > +
> > + dev = container_of(ibdev, struct mana_ib_dev, ib_dev);
> > +
> > + /* Set flags if this is a kernel request */
> > + if (!ibpd->uobject)
> > + flags = GDMA_PD_FLAG_ALLOW_GPA_MR |
> GDMA_PD_FLAG_ALLOW_FMR_MR;
>
> I'm confused, this driver doesn't seem to support kverbs:
>
> > +static const struct ib_device_ops mana_ib_dev_ops = {
> > + .owner = THIS_MODULE,
> > + .driver_id = RDMA_DRIVER_MANA,
> > + .uverbs_abi_ver = MANA_IB_UVERBS_ABI_VERSION,
> > +
> > + .alloc_pd = mana_ib_alloc_pd,
> > + .alloc_ucontext = mana_ib_alloc_ucontext,
> > + .create_cq = mana_ib_create_cq,
> > + .create_qp = mana_ib_create_qp,
> > + .create_rwq_ind_table = mana_ib_create_rwq_ind_table,
> > + .create_wq = mana_ib_create_wq,
> > + .dealloc_pd = mana_ib_dealloc_pd,
> > + .dealloc_ucontext = mana_ib_dealloc_ucontext,
> > + .dereg_mr = mana_ib_dereg_mr,
> > + .destroy_cq = mana_ib_destroy_cq,
> > + .destroy_qp = mana_ib_destroy_qp,
> > + .destroy_rwq_ind_table = mana_ib_destroy_rwq_ind_table,
> > + .destroy_wq = mana_ib_destroy_wq,
> > + .disassociate_ucontext = mana_ib_disassociate_ucontext,
> > + .get_port_immutable = mana_ib_get_port_immutable,
> > + .mmap = mana_ib_mmap,
> > + .modify_qp = mana_ib_modify_qp,
> > + .modify_wq = mana_ib_modify_wq,
> > + .query_device = mana_ib_query_device,
> > + .query_gid = mana_ib_query_gid,
> > + .query_port = mana_ib_query_port,
> > + .reg_user_mr = mana_ib_reg_user_mr,
>
> eg there is no way to create a kernel MR..
>
> So, why do I see so many kverbs like things - and why are things like FMR in
> this driver that can never be used?
>
> Jason
The idea was to introduce kernel support in future. I will remove it from the code and upload the patch.
> Subject: Re: [Patch v4 03/12] net: mana: Handle vport sharing between devices
>
> On Thu, Jul 21, 2022 at 05:58:39PM +0000, Long Li wrote:
> > > > "vport" is a hardware resource that can either be used by an
> > > > Ethernet device, or an RDMA device. But it can't be used by both
> > > > at the same time. The "vport" is associated with a protection
> > > > domain and doorbell, it's programmed in the hardware. Outgoing
> > > > traffic is enforced on this vport based on how it is programmed.
> > >
> > > Sure, but how is the users problem to "get this configured right"
> > > and what exactly is the user supposed to do?
> > >
> > > I would expect the allocation of HW resources to be completely
> > > transparent to the user. Why is it not?
> > >
> >
> > In the hardware, RDMA RAW_QP shares the same hardware resource (in
> > this case, the vPort in hardware table) with the ethernet NIC. When an
> > RDMA user creates a RAW_QP, we can't just shut down the ethernet. The
> > user is required to make sure the ethernet is not in used when he
> > creates this QP type.
>
> You haven't answered my question - how is the user supposed to achieve this?
The user needs to configure the network interface so the kernel will not use it when the user creates a RAW QP on this port.
This can be done via system configuration to not bring this interface online on system boot, or equivalently doing "ifconfig xxx down" to make the interface down when creating a RAW QP on this port.
>
> And now I also want to know why the ethernet device and rdma device can even
> be loaded together if they cannot share the physical port?
> Exclusivity is not a sharing model that any driver today implements.
>
This physical port limitation only applies to the RAW QP. For RC QP, the hardware doesn't have this limitation. The user can create RC QPs on a physical port up to the hardware limits independent of the Ethernet usage on the same port.
For Ethernet usage, the hardware supports only one active user on a physical port. The driver checks on the port usage before programming the hardware when creating the RAW QP. Because the RDMA driver doesn't know in advance which QP type the user will create, it exposes the device with all its ports. The user may not be able to create RAW QP on a port if this port is already in used by the kernel.
As a comparison, Mellanox NICs can expose both Ethernet and RDMA RAW_QP on the same physical port to software. They can work at the same time, but with some "quirks". The RDMA RAW_QP can preempt/interfere Ethernet traffic under certain conditions commonly used by DPDK (a heavy user of RAW_QP).
Here are two scenarios that a Mellanox NIC port works on both Ethernet and RAW_QP.
Scenario 1: The Ethernet loses TCP connection.
1. User A runs a program listing on a TCP port, accepts an incoming TCP connection and is communicating with the remote peer over this TCP connection.
2. User B creates an RDMA RAW_QP on the same port on the device.
3. As soon as the RAW_QP is created, the program in 1 can't send/receive data over this TCP connection. After some period of inactivity, the TCP connection terminates.
Please note that this may also pose a security risk. User B with RAW_QP can potentially hijack this TCP connection from the kernel by framing the correct Ethernet packets and send over this QP to trick the remote peer, making it believe it's User A.
Scenario 2: The Ethernet port state changes after RDMA RAW_QP is used on the port.
1. User uses "ifconfig ethx down" on the NIC, intending to make it offline
2. User creates a RDMA RAW_QP on the same port on the device.
3. User destroys this RAW_QP.
4. The ethx device in 1 reports carrier state in step 2, in many Linux distributions this makes it online without user interaction. "ifconfig ethx" shows its state changes to "up".
The two activities on Ethernet and on RDMA RAW_QP should not happen concurrently and the user either gets unexpected behavior (Scenario 1) or the user needs to explicitly serialize the use (Scenario 2). In this sense, I think MANA is not materially different to how the Mellanox NICs implement the RAW_QP. IMHO, it's better to have the user explicitly decide whether to use Ethernet or RDMA RAW_QP on a specific port.
Long
On Fri, Jul 29, 2022 at 06:44:22PM +0000, Long Li wrote:
> > Subject: Re: [Patch v4 03/12] net: mana: Handle vport sharing between devices
> >
> > On Thu, Jul 21, 2022 at 05:58:39PM +0000, Long Li wrote:
> > > > > "vport" is a hardware resource that can either be used by an
> > > > > Ethernet device, or an RDMA device. But it can't be used by both
> > > > > at the same time. The "vport" is associated with a protection
> > > > > domain and doorbell, it's programmed in the hardware. Outgoing
> > > > > traffic is enforced on this vport based on how it is programmed.
> > > >
> > > > Sure, but how is the users problem to "get this configured right"
> > > > and what exactly is the user supposed to do?
> > > >
> > > > I would expect the allocation of HW resources to be completely
> > > > transparent to the user. Why is it not?
> > > >
> > >
> > > In the hardware, RDMA RAW_QP shares the same hardware resource (in
> > > this case, the vPort in hardware table) with the ethernet NIC. When an
> > > RDMA user creates a RAW_QP, we can't just shut down the ethernet. The
> > > user is required to make sure the ethernet is not in used when he
> > > creates this QP type.
> >
> > You haven't answered my question - how is the user supposed to achieve this?
>
> The user needs to configure the network interface so the kernel will not use it when the user creates a RAW QP on this port.
>
> This can be done via system configuration to not bring this
> interface online on system boot, or equivalently doing "ifconfig xxx
> down" to make the interface down when creating a RAW QP on this
> port.
That sounds horrible, why allow the user to even bind two drivers if
the two drivers can't be used together?
> > And now I also want to know why the ethernet device and rdma device can even
> > be loaded together if they cannot share the physical port?
> > Exclusivity is not a sharing model that any driver today implements.
>
> This physical port limitation only applies to the RAW QP. For RC QP,
> the hardware doesn't have this limitation. The user can create RC
> QPs on a physical port up to the hardware limits independent of the
> Ethernet usage on the same port.
.. and it is because you support sharing models in other cases :\
> Scenario 1: The Ethernet loses TCP connection.
> 1. User A runs a program listing on a TCP port, accepts an incoming
> TCP connection and is communicating with the remote peer over this
> TCP connection.
> 2. User B creates an RDMA RAW_QP on the same port on the device.
> 3. As soon as the RAW_QP is created, the program in 1 can't
> send/receive data over this TCP connection. After some period of
> inactivity, the TCP connection terminates.
It is a little more complicated than that, but yes, that could
possibly happen if the userspace captures the right traffic.
> Please note that this may also pose a security risk. User B with
> RAW_QP can potentially hijack this TCP connection from the kernel by
> framing the correct Ethernet packets and send over this QP to trick
> the remote peer, making it believe it's User A.
Any root user can do this with the netstack using eg tcpdump, bpf,
XDP, raw sockets, etc. This is why the capability is guarded by
CAP_NET_RAW. It is nothing unusual.
> Scenario 2: The Ethernet port state changes after RDMA RAW_QP is used on the port.
> 1. User uses "ifconfig ethx down" on the NIC, intending to make it offline
> 2. User creates a RDMA RAW_QP on the same port on the device.
> 3. User destroys this RAW_QP.
> 4. The ethx device in 1 reports carrier state in step 2, in many
> Linux distributions this makes it online without user
> interaction. "ifconfig ethx" shows its state changes to "up".
This I'm not familiar with, it actually sounds like a bug that the
RAW_QP's interfere with the netdev carrier state.
> the Mellanox NICs implement the RAW_QP. IMHO, it's better to have
> the user explicitly decide whether to use Ethernet or RDMA RAW_QP on
> a specific port.
It should all be carefully documented someplace.
Jason
> > the Mellanox NICs implement the RAW_QP. IMHO, it's better to have the
> > user explicitly decide whether to use Ethernet or RDMA RAW_QP on a
> > specific port.
>
> It should all be carefully documented someplace.
The use case for RAW_QP is from user-mode. Is it acceptable that we document the detailed usage in rdma-core?
Long
On Fri, Jul 29, 2022 at 09:20:05PM +0000, Long Li wrote:
> > > the Mellanox NICs implement the RAW_QP. IMHO, it's better to have the
> > > user explicitly decide whether to use Ethernet or RDMA RAW_QP on a
> > > specific port.
> >
> > It should all be carefully documented someplace.
>
> The use case for RAW_QP is from user-mode. Is it acceptable that we
> document the detailed usage in rdma-core?
Yes, but add a suitable comment someplace in the kernel too
Jason
> Subject: Re: [Patch v4 03/12] net: mana: Handle vport sharing between devices
>
> On Fri, Jul 29, 2022 at 09:20:05PM +0000, Long Li wrote:
> > > > the Mellanox NICs implement the RAW_QP. IMHO, it's better to have
> > > > the user explicitly decide whether to use Ethernet or RDMA RAW_QP
> > > > on a specific port.
> > >
> > > It should all be carefully documented someplace.
> >
> > The use case for RAW_QP is from user-mode. Is it acceptable that we
> > document the detailed usage in rdma-core?
>
> Yes, but add a suitable comment someplace in the kernel too
Thanks. I will add detailed comments.
Long