2015-04-07 12:26:08

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 00/17] IB/Verbs: IB Management Helpers


Since v1:
* Apply suggestions from Doug, Ira, Jason, Tom, thanks for the comments :-)
and please remind me if I missed anything :-P
* Adopt new callback query_transport() to directly get the transport type
of device
* Reform a lot to adopt new management helpers, cleanup the old helpers

There are plenty of lengthy code to check the transport type of IB device,
or the link layer type of it's port, but actually we are just speculating
whether a particular management/feature is supported by the device/port.

Thus instead of inferring, we should have our own mechanism for IB management
capability/protocol/feature checking, several proposals below.

This patch set will reform the method of getting transport type, we will
now using query_transport() instead of inferring from transport and link
layer respectively, also we defined the new transport type to make the
concept more reasonable.

Mapping List:
node-type link-layer old-transport new-transport
nes RNIC ETH IWARP IWARP
amso1100 RNIC ETH IWARP IWARP
cxgb3 RNIC ETH IWARP IWARP
cxgb4 RNIC ETH IWARP IWARP
usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP
ocrdma IB_CA ETH IB IBOE
mlx4 IB_CA IB/ETH IB IB/IBOE
mlx5 IB_CA IB IB IB
ehca IB_CA IB IB IB
ipath IB_CA IB IB IB
mthca IB_CA IB IB IB
qib IB_CA IB IB IB

For example:
if (transport == IB) && (link-layer == ETH)
will now become:
if (query_transport() == IBOE)

Thus we will be able to get rid of the respective transport and link-layer
checking, and it will help us to add new protocol/Technology (like OPA) more
easier, also with the introduced management helpers, IB management logical
will be more clear and easier for extending.

TODO:
The patch set covered a wide range of IB stuff, thus for those who are
familiar with the particular part, your suggestion would be invaluable ;-)

Patches haven't been tested yet, we appreciate if any one who have these
HW willing to provide his Tested-by :-)

Proposals:
Sean:
https://www.mail-archive.com/[email protected]/msg23339.html
Doug:
https://www.mail-archive.com/[email protected]/msg23418.html
Jason:
https://www.mail-archive.com/[email protected]/msg23425.html

Michael Wang (17):
[PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW
[PATCH v2 02/17] IB/Verbs: Implement raw management helpers
[PATCH v2 03/17] IB/Verbs: Use management helper cap_ib_mad() for mad-check
[PATCH v2 04/17] IB/Verbs: Use management helper cap_ib_smi() for smi-check
[PATCH v2 05/17] IB/Verbs: Use management helper cap_ib_cm() for cm-check
[PATCH v2 06/17] IB/Verbs: Use management helper cap_ib_sa() for sa-check
[PATCH v2 07/17] IB/Verbs: Use management helper cap_ib_mcast() for mcast-check
[PATCH v2 08/17] IB/Verbs: Use management helper cap_ipoib() for ipoib-check
[PATCH v2 09/17] IB/Verbs: Use helper cap_read_multi_sge() and reform svc_rdma_accept()
[PATCH v2 10/17] IB/Verbs: Adopt management helpers for IB helpers
[PATCH v2 11/17] IB/Verbs: Reform link_layer_show() and ib_uverbs_query_port()
[PATCH v2 12/17] IB/Verbs: Use management helper cap_ib_cm_dev() for cm-device-check
[PATCH v2 13/17] IB/Verbs: Reform cma/ucma with management helpers
[PATCH v2 14/17] IB/Verbs: Reserve legacy transport type for 'struct rdma_dev_addr'
[PATCH v2 15/17] IB/Verbs: Reform cma_acquire_dev() with management helpers
[PATCH v2 16/17] IB/Verbs: Cleanup rdma_node_get_transport()
[PATCH v2 17/17] IB/Verbs: Move rdma_port_get_link_layer() to mlx4 head file

---
drivers/infiniband/core/agent.c | 2
drivers/infiniband/core/cm.c | 22 +-
drivers/infiniband/core/cma.c | 281 ++++++++++++---------------
drivers/infiniband/core/device.c | 1
drivers/infiniband/core/mad.c | 20 -
drivers/infiniband/core/multicast.c | 12 -
drivers/infiniband/core/sa_query.c | 29 +-
drivers/infiniband/core/sysfs.c | 8
drivers/infiniband/core/ucm.c | 3
drivers/infiniband/core/ucma.c | 25 --
drivers/infiniband/core/user_mad.c | 26 +-
drivers/infiniband/core/uverbs_cmd.c | 6
drivers/infiniband/core/verbs.c | 51 ----
drivers/infiniband/hw/amso1100/c2_provider.c | 7
drivers/infiniband/hw/cxgb3/iwch_provider.c | 7
drivers/infiniband/hw/cxgb4/provider.c | 7
drivers/infiniband/hw/ehca/ehca_hca.c | 6
drivers/infiniband/hw/ehca/ehca_iverbs.h | 3
drivers/infiniband/hw/ehca/ehca_main.c | 1
drivers/infiniband/hw/ipath/ipath_verbs.c | 7
drivers/infiniband/hw/mlx4/main.c | 10
drivers/infiniband/hw/mlx4/mlx4_ib.h | 8
drivers/infiniband/hw/mlx5/main.c | 7
drivers/infiniband/hw/mthca/mthca_provider.c | 7
drivers/infiniband/hw/nes/nes_verbs.c | 6
drivers/infiniband/hw/ocrdma/ocrdma_main.c | 1
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 6
drivers/infiniband/hw/ocrdma/ocrdma_verbs.h | 3
drivers/infiniband/hw/qib/qib_verbs.c | 7
drivers/infiniband/hw/usnic/usnic_ib_main.c | 1
drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 6
drivers/infiniband/hw/usnic/usnic_ib_verbs.h | 2
drivers/infiniband/ulp/ipoib/ipoib_main.c | 17 -
include/rdma/ib_verbs.h | 163 ++++++++++++++-
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 4
net/sunrpc/xprtrdma/svc_rdma_transport.c | 12 -
36 files changed, 490 insertions(+), 294 deletions(-)


2015-04-07 12:28:27

by Michael Wang

[permalink] [raw]
Subject: [PATCH 01/17] IB/Verbs: Implement new callback query_transport() for each HW


Add new callback query_transport() and implement for each HW.

Mapping List:
node-type link-layer old-transport new-transport
nes RNIC ETH IWARP IWARP
amso1100 RNIC ETH IWARP IWARP
cxgb3 RNIC ETH IWARP IWARP
cxgb4 RNIC ETH IWARP IWARP
usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP
ocrdma IB_CA ETH IB IBOE
mlx4 IB_CA IB/ETH IB IB/IBOE
mlx5 IB_CA IB IB IB
ehca IB_CA IB IB IB
ipath IB_CA IB IB IB
mthca IB_CA IB IB IB
qib IB_CA IB IB IB

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/device.c | 1 +
drivers/infiniband/core/verbs.c | 4 +++-
drivers/infiniband/hw/amso1100/c2_provider.c | 7 +++++++
drivers/infiniband/hw/cxgb3/iwch_provider.c | 7 +++++++
drivers/infiniband/hw/cxgb4/provider.c | 7 +++++++
drivers/infiniband/hw/ehca/ehca_hca.c | 6 ++++++
drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 +++
drivers/infiniband/hw/ehca/ehca_main.c | 1 +
drivers/infiniband/hw/ipath/ipath_verbs.c | 7 +++++++
drivers/infiniband/hw/mlx4/main.c | 10 ++++++++++
drivers/infiniband/hw/mlx5/main.c | 7 +++++++
drivers/infiniband/hw/mthca/mthca_provider.c | 7 +++++++
drivers/infiniband/hw/nes/nes_verbs.c | 6 ++++++
drivers/infiniband/hw/ocrdma/ocrdma_main.c | 1 +
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 6 ++++++
drivers/infiniband/hw/ocrdma/ocrdma_verbs.h | 3 +++
drivers/infiniband/hw/qib/qib_verbs.c | 7 +++++++
drivers/infiniband/hw/usnic/usnic_ib_main.c | 1 +
drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 6 ++++++
drivers/infiniband/hw/usnic/usnic_ib_verbs.h | 2 ++
include/rdma/ib_verbs.h | 7 ++++++-
21 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 18c1ece..a9587c4 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -76,6 +76,7 @@ static int ib_device_check_mandatory(struct ib_device *device)
} mandatory_table[] = {
IB_MANDATORY_FUNC(query_device),
IB_MANDATORY_FUNC(query_port),
+ IB_MANDATORY_FUNC(query_transport),
IB_MANDATORY_FUNC(query_pkey),
IB_MANDATORY_FUNC(query_gid),
IB_MANDATORY_FUNC(alloc_pd),
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index f93eb8d..83370de 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -133,14 +133,16 @@ enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_
if (device->get_link_layer)
return device->get_link_layer(device, port_num);

- switch (rdma_node_get_transport(device->node_type)) {
+ switch (device->query_transport(device, port_num)) {
case RDMA_TRANSPORT_IB:
+ case RDMA_TRANSPORT_IBOE:
return IB_LINK_LAYER_INFINIBAND;
case RDMA_TRANSPORT_IWARP:
case RDMA_TRANSPORT_USNIC:
case RDMA_TRANSPORT_USNIC_UDP:
return IB_LINK_LAYER_ETHERNET;
default:
+ BUG();
return IB_LINK_LAYER_UNSPECIFIED;
}
}
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index bdf3507..d46bbb0 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -99,6 +99,12 @@ static int c2_query_port(struct ib_device *ibdev,
return 0;
}

+static enum rdma_transport_type
+c2_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IWARP;
+}
+
static int c2_query_pkey(struct ib_device *ibdev,
u8 port, u16 index, u16 * pkey)
{
@@ -801,6 +807,7 @@ int c2_register_device(struct c2_dev *dev)
dev->ibdev.dma_device = &dev->pcidev->dev;
dev->ibdev.query_device = c2_query_device;
dev->ibdev.query_port = c2_query_port;
+ dev->ibdev.query_transport = c2_query_transport;
dev->ibdev.query_pkey = c2_query_pkey;
dev->ibdev.query_gid = c2_query_gid;
dev->ibdev.alloc_ucontext = c2_alloc_ucontext;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 811b24a..09682e9e 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -1232,6 +1232,12 @@ static int iwch_query_port(struct ib_device *ibdev,
return 0;
}

+static enum rdma_transport_type
+iwch_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IWARP;
+}
+
static ssize_t show_rev(struct device *dev, struct device_attribute *attr,
char *buf)
{
@@ -1385,6 +1391,7 @@ int iwch_register_device(struct iwch_dev *dev)
dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev);
dev->ibdev.query_device = iwch_query_device;
dev->ibdev.query_port = iwch_query_port;
+ dev->ibdev.query_transport = iwch_query_transport;
dev->ibdev.query_pkey = iwch_query_pkey;
dev->ibdev.query_gid = iwch_query_gid;
dev->ibdev.alloc_ucontext = iwch_alloc_ucontext;
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index 66bd6a2..a445e0d 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -390,6 +390,12 @@ static int c4iw_query_port(struct ib_device *ibdev, u8 port,
return 0;
}

+static enum rdma_transport_type
+c4iw_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IWARP;
+}
+
static ssize_t show_rev(struct device *dev, struct device_attribute *attr,
char *buf)
{
@@ -506,6 +512,7 @@ int c4iw_register_device(struct c4iw_dev *dev)
dev->ibdev.dma_device = &(dev->rdev.lldi.pdev->dev);
dev->ibdev.query_device = c4iw_query_device;
dev->ibdev.query_port = c4iw_query_port;
+ dev->ibdev.query_transport = c4iw_query_transport;
dev->ibdev.query_pkey = c4iw_query_pkey;
dev->ibdev.query_gid = c4iw_query_gid;
dev->ibdev.alloc_ucontext = c4iw_alloc_ucontext;
diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c
index 9ed4d25..d5a34a6 100644
--- a/drivers/infiniband/hw/ehca/ehca_hca.c
+++ b/drivers/infiniband/hw/ehca/ehca_hca.c
@@ -242,6 +242,12 @@ query_port1:
return ret;
}

+enum rdma_transport_type
+ehca_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IB;
+}
+
int ehca_query_sma_attr(struct ehca_shca *shca,
u8 port, struct ehca_sma_attr *attr)
{
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 22f79af..cec945f 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -49,6 +49,9 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props);
int ehca_query_port(struct ib_device *ibdev, u8 port,
struct ib_port_attr *props);

+enum rdma_transport_type
+ehca_query_transport(struct ib_device *device, u8 port_num);
+
int ehca_query_sma_attr(struct ehca_shca *shca, u8 port,
struct ehca_sma_attr *attr);

diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index cd8d290..60e0a09 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -467,6 +467,7 @@ static int ehca_init_device(struct ehca_shca *shca)
shca->ib_device.dma_device = &shca->ofdev->dev;
shca->ib_device.query_device = ehca_query_device;
shca->ib_device.query_port = ehca_query_port;
+ shca->ib_device.query_transport = ehca_query_transport;
shca->ib_device.query_gid = ehca_query_gid;
shca->ib_device.query_pkey = ehca_query_pkey;
/* shca->in_device.modify_device = ehca_modify_device */
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 44ea939..58d36e3 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -1638,6 +1638,12 @@ static int ipath_query_port(struct ib_device *ibdev,
return 0;
}

+static enum rdma_transport_type
+ipath_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IB;
+}
+
static int ipath_modify_device(struct ib_device *device,
int device_modify_mask,
struct ib_device_modify *device_modify)
@@ -2140,6 +2146,7 @@ int ipath_register_ib_device(struct ipath_devdata *dd)
dev->query_device = ipath_query_device;
dev->modify_device = ipath_modify_device;
dev->query_port = ipath_query_port;
+ dev->query_transport = ipath_query_transport;
dev->modify_port = ipath_modify_port;
dev->query_pkey = ipath_query_pkey;
dev->query_gid = ipath_query_gid;
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 0b280b1..28100bd 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -413,6 +413,15 @@ static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port,
return __mlx4_ib_query_port(ibdev, port, props, 0);
}

+static enum rdma_transport_type
+mlx4_ib_query_transport(struct ib_device *device, u8 port_num)
+{
+ struct mlx4_dev *dev = to_mdev(device)->dev;
+
+ return dev->caps.port_mask[port_num] == MLX4_PORT_TYPE_IB ?
+ RDMA_TRANSPORT_IB : RDMA_TRANSPORT_IBOE;
+}
+
int __mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
union ib_gid *gid, int netw_view)
{
@@ -2121,6 +2130,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)

ibdev->ib_dev.query_device = mlx4_ib_query_device;
ibdev->ib_dev.query_port = mlx4_ib_query_port;
+ ibdev->ib_dev.query_transport = mlx4_ib_query_transport;
ibdev->ib_dev.get_link_layer = mlx4_ib_port_link_layer;
ibdev->ib_dev.query_gid = mlx4_ib_query_gid;
ibdev->ib_dev.query_pkey = mlx4_ib_query_pkey;
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index cc4ac1e..209c796 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -351,6 +351,12 @@ out:
return err;
}

+static enum rdma_transport_type
+mlx5_ib_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IB;
+}
+
static int mlx5_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
union ib_gid *gid)
{
@@ -1336,6 +1342,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)

dev->ib_dev.query_device = mlx5_ib_query_device;
dev->ib_dev.query_port = mlx5_ib_query_port;
+ dev->ib_dev.query_transport = mlx5_ib_query_transport;
dev->ib_dev.query_gid = mlx5_ib_query_gid;
dev->ib_dev.query_pkey = mlx5_ib_query_pkey;
dev->ib_dev.modify_device = mlx5_ib_modify_device;
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 415f8e1..67ac6a4 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -179,6 +179,12 @@ static int mthca_query_port(struct ib_device *ibdev,
return err;
}

+static enum rdma_transport_type
+mthca_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IB;
+}
+
static int mthca_modify_device(struct ib_device *ibdev,
int mask,
struct ib_device_modify *props)
@@ -1281,6 +1287,7 @@ int mthca_register_device(struct mthca_dev *dev)
dev->ib_dev.dma_device = &dev->pdev->dev;
dev->ib_dev.query_device = mthca_query_device;
dev->ib_dev.query_port = mthca_query_port;
+ dev->ib_dev.query_transport = mthca_query_transport;
dev->ib_dev.modify_device = mthca_modify_device;
dev->ib_dev.modify_port = mthca_modify_port;
dev->ib_dev.query_pkey = mthca_query_pkey;
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index c0d0296..8df5b61 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -606,6 +606,11 @@ static int nes_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr
return 0;
}

+static enum rdma_transport_type
+nes_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IWARP;
+}

/**
* nes_query_pkey
@@ -3879,6 +3884,7 @@ struct nes_ib_device *nes_init_ofa_device(struct net_device *netdev)
nesibdev->ibdev.dev.parent = &nesdev->pcidev->dev;
nesibdev->ibdev.query_device = nes_query_device;
nesibdev->ibdev.query_port = nes_query_port;
+ nesibdev->ibdev.query_transport = nes_query_transport;
nesibdev->ibdev.query_pkey = nes_query_pkey;
nesibdev->ibdev.query_gid = nes_query_gid;
nesibdev->ibdev.alloc_ucontext = nes_alloc_ucontext;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index 7a2b59a..9f4d182 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -244,6 +244,7 @@ static int ocrdma_register_device(struct ocrdma_dev *dev)
/* mandatory verbs. */
dev->ibdev.query_device = ocrdma_query_device;
dev->ibdev.query_port = ocrdma_query_port;
+ dev->ibdev.query_transport = ocrdma_query_transport;
dev->ibdev.modify_port = ocrdma_modify_port;
dev->ibdev.query_gid = ocrdma_query_gid;
dev->ibdev.get_link_layer = ocrdma_link_layer;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 8771755..73bace4 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -187,6 +187,12 @@ int ocrdma_query_port(struct ib_device *ibdev,
return 0;
}

+enum rdma_transport_type
+ocrdma_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IBOE;
+}
+
int ocrdma_modify_port(struct ib_device *ibdev, u8 port, int mask,
struct ib_port_modify *props)
{
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
index b8f7853..4a81b63 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
@@ -41,6 +41,9 @@ int ocrdma_query_port(struct ib_device *, u8 port, struct ib_port_attr *props);
int ocrdma_modify_port(struct ib_device *, u8 port, int mask,
struct ib_port_modify *props);

+enum rdma_transport_type
+ocrdma_query_transport(struct ib_device *device, u8 port_num);
+
void ocrdma_get_guid(struct ocrdma_dev *, u8 *guid);
int ocrdma_query_gid(struct ib_device *, u8 port,
int index, union ib_gid *gid);
diff --git a/drivers/infiniband/hw/qib/qib_verbs.c b/drivers/infiniband/hw/qib/qib_verbs.c
index 4a35998..caad665 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.c
+++ b/drivers/infiniband/hw/qib/qib_verbs.c
@@ -1650,6 +1650,12 @@ static int qib_query_port(struct ib_device *ibdev, u8 port,
return 0;
}

+static enum rdma_transport_type
+qib_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IB;
+}
+
static int qib_modify_device(struct ib_device *device,
int device_modify_mask,
struct ib_device_modify *device_modify)
@@ -2184,6 +2190,7 @@ int qib_register_ib_device(struct qib_devdata *dd)
ibdev->query_device = qib_query_device;
ibdev->modify_device = qib_modify_device;
ibdev->query_port = qib_query_port;
+ ibdev->query_transport = qib_query_transport;
ibdev->modify_port = qib_modify_port;
ibdev->query_pkey = qib_query_pkey;
ibdev->query_gid = qib_query_gid;
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_main.c b/drivers/infiniband/hw/usnic/usnic_ib_main.c
index 0d0f986..03ea9f3 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_main.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_main.c
@@ -360,6 +360,7 @@ static void *usnic_ib_device_add(struct pci_dev *dev)

us_ibdev->ib_dev.query_device = usnic_ib_query_device;
us_ibdev->ib_dev.query_port = usnic_ib_query_port;
+ us_ibdev->ib_dev.query_transport = usnic_ib_query_transport;
us_ibdev->ib_dev.query_pkey = usnic_ib_query_pkey;
us_ibdev->ib_dev.query_gid = usnic_ib_query_gid;
us_ibdev->ib_dev.get_link_layer = usnic_ib_port_link_layer;
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index 53bd6a2..ff9a5f7 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -348,6 +348,12 @@ int usnic_ib_query_port(struct ib_device *ibdev, u8 port,
return 0;
}

+enum rdma_transport_type
+usnic_ib_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_USNIC_UDP;
+}
+
int usnic_ib_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
int qp_attr_mask,
struct ib_qp_init_attr *qp_init_attr)
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.h b/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
index bb864f5..0b1633b 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
@@ -27,6 +27,8 @@ int usnic_ib_query_device(struct ib_device *ibdev,
struct ib_device_attr *props);
int usnic_ib_query_port(struct ib_device *ibdev, u8 port,
struct ib_port_attr *props);
+enum rdma_transport_type
+usnic_ib_query_transport(struct ib_device *device, u8 port_num);
int usnic_ib_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
int qp_attr_mask,
struct ib_qp_init_attr *qp_init_attr);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 65994a1..d54f91e 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -75,10 +75,13 @@ enum rdma_node_type {
};

enum rdma_transport_type {
+ /* legacy for users */
RDMA_TRANSPORT_IB,
RDMA_TRANSPORT_IWARP,
RDMA_TRANSPORT_USNIC,
- RDMA_TRANSPORT_USNIC_UDP
+ RDMA_TRANSPORT_USNIC_UDP,
+ /* new transport */
+ RDMA_TRANSPORT_IBOE,
};

__attribute_const__ enum rdma_transport_type
@@ -1501,6 +1504,8 @@ struct ib_device {
int (*query_port)(struct ib_device *device,
u8 port_num,
struct ib_port_attr *port_attr);
+ enum rdma_transport_type (*query_transport)(struct ib_device *device,
+ u8 port_num);
enum rdma_link_layer (*get_link_layer)(struct ib_device *device,
u8 port_num);
int (*query_gid)(struct ib_device *device,
--
2.1.0

2015-04-07 12:29:38

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 02/17] IB/Verbs: Implement raw management helpers


Add raw helpers:
rdma_transport_ib
rdma_transport_iboe
rdma_transport_iwarp
rdma_ib_mgmt
To help us checking transport type.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
include/rdma/ib_verbs.h | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index d54f91e..780b3b7 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1748,6 +1748,31 @@ int ib_query_port(struct ib_device *device,
enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device,
u8 port_num);

+static inline int rdma_transport_ib(struct ib_device *device, u8 port_num)
+{
+ return device->query_transport(device, port_num)
+ == RDMA_TRANSPORT_IB;
+}
+
+static inline int rdma_transport_iboe(struct ib_device *device, u8 port_num)
+{
+ return device->query_transport(device, port_num)
+ == RDMA_TRANSPORT_IBOE;
+}
+
+static inline int rdma_transport_iwarp(struct ib_device *device, u8 port_num)
+{
+ return device->query_transport(device, port_num)
+ == RDMA_TRANSPORT_IWARP;
+}
+
+static inline int rdma_ib_mgmt(struct ib_device *device, u8 port_num)
+{
+ enum rdma_transport_type tp = device->query_transport(device, port_num);
+
+ return (tp == RDMA_TRANSPORT_IB || tp == RDMA_TRANSPORT_IBOE);
+}
+
int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid);

--
2.1.0

2015-04-07 12:30:30

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 03/17] IB/Verbs: Use management helper cap_ib_mad() for mad-check


Introduce helper cap_ib_mad() to help us check if the port of an
IB device support Infiniband Management Datagrams.

Reform ib_umad_add_one() to fit per-port-check method better.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/mad.c | 18 +++++++++---------
drivers/infiniband/core/user_mad.c | 26 ++++++++++++++++++++------
include/rdma/ib_verbs.h | 15 +++++++++++++++
3 files changed, 44 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 74c30f4..ef0c0c5 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -3057,9 +3057,6 @@ static void ib_mad_init_device(struct ib_device *device)
{
int start, end, i;

- if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
- return;
-
if (device->node_type == RDMA_NODE_IB_SWITCH) {
start = 0;
end = 0;
@@ -3069,6 +3066,9 @@ static void ib_mad_init_device(struct ib_device *device)
}

for (i = start; i <= end; i++) {
+ if (!cap_ib_mad(device, i))
+ continue;
+
if (ib_mad_port_open(device, i)) {
dev_err(&device->dev, "Couldn't open port %d\n", i);
goto error;
@@ -3086,15 +3086,15 @@ error_agent:
dev_err(&device->dev, "Couldn't close port %d\n", i);

error:
- i--;
+ while (--i >= start) {
+ if (!cap_ib_mad(device, i))
+ continue;

- while (i >= start) {
if (ib_agent_port_close(device, i))
dev_err(&device->dev,
"Couldn't close port %d for agents\n", i);
if (ib_mad_port_close(device, i))
dev_err(&device->dev, "Couldn't close port %d\n", i);
- i--;
}
}

@@ -3102,9 +3102,6 @@ static void ib_mad_remove_device(struct ib_device *device)
{
int i, num_ports, cur_port;

- if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
- return;
-
if (device->node_type == RDMA_NODE_IB_SWITCH) {
num_ports = 1;
cur_port = 0;
@@ -3113,6 +3110,9 @@ static void ib_mad_remove_device(struct ib_device *device)
cur_port = 1;
}
for (i = 0; i < num_ports; i++, cur_port++) {
+ if (!cap_ib_mad(device, i))
+ continue;
+
if (ib_agent_port_close(device, cur_port))
dev_err(&device->dev,
"Couldn't close port %d for agents\n",
diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c
index 928cdd2..b52884b 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -1273,9 +1273,7 @@ static void ib_umad_add_one(struct ib_device *device)
{
struct ib_umad_device *umad_dev;
int s, e, i;
-
- if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
- return;
+ int count = 0;

if (device->node_type == RDMA_NODE_IB_SWITCH)
s = e = 0;
@@ -1296,11 +1294,21 @@ static void ib_umad_add_one(struct ib_device *device)
umad_dev->end_port = e;

for (i = s; i <= e; ++i) {
+ if (!cap_ib_mad(device, i))
+ continue;
+
umad_dev->port[i - s].umad_dev = umad_dev;

if (ib_umad_init_port(device, i, umad_dev,
&umad_dev->port[i - s]))
goto err;
+
+ count++;
+ }
+
+ if (!count) {
+ kobject_put(&umad_dev->kobj);
+ return;
}

ib_set_client_data(device, &umad_client, umad_dev);
@@ -1308,8 +1316,12 @@ static void ib_umad_add_one(struct ib_device *device)
return;

err:
- while (--i >= s)
+ while (--i >= s) {
+ if (!cap_ib_mad(device, i))
+ continue;
+
ib_umad_kill_port(&umad_dev->port[i - s]);
+ }

kobject_put(&umad_dev->kobj);
}
@@ -1322,8 +1334,10 @@ static void ib_umad_remove_one(struct ib_device *device)
if (!umad_dev)
return;

- for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i)
- ib_umad_kill_port(&umad_dev->port[i]);
+ for (i = 0; i <= umad_dev->end_port - umad_dev->start_port; ++i) {
+ if (cap_ib_mad(device, i))
+ ib_umad_kill_port(&umad_dev->port[i]);
+ }

kobject_put(&umad_dev->kobj);
}
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 780b3b7..4013933 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1773,6 +1773,21 @@ static inline int rdma_ib_mgmt(struct ib_device *device, u8 port_num)
return (tp == RDMA_TRANSPORT_IB || tp == RDMA_TRANSPORT_IBOE);
}

+/**
+ * cap_ib_mad - Check if the port of device has the capability Infiniband
+ * Management Datagrams.
+ *
+ * @device: Device to be checked
+ * @port_num: Port number of the device
+ *
+ * Return 0 when port of the device don't support Infiniband
+ * Management Datagrams.
+ */
+static inline int cap_ib_mad(struct ib_device *device, u8 port_num)
+{
+ return rdma_ib_mgmt(device, port_num);
+}
+
int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid);

--
2.1.0

2015-04-07 12:31:38

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 04/17] IB/Verbs: Use management helper cap_ib_smi() for smi-check


Introduce helper cap_ib_smi() to help us check if the port of an
IB device support Infiniband Subnet Management Interface.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/agent.c | 2 +-
drivers/infiniband/core/mad.c | 2 +-
include/rdma/ib_verbs.h | 15 +++++++++++++++
3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c
index f6d2961..61471ee 100644
--- a/drivers/infiniband/core/agent.c
+++ b/drivers/infiniband/core/agent.c
@@ -156,7 +156,7 @@ int ib_agent_port_open(struct ib_device *device, int port_num)
goto error1;
}

- if (rdma_port_get_link_layer(device, port_num) == IB_LINK_LAYER_INFINIBAND) {
+ if (cap_ib_smi(device, port_num)) {
/* Obtain send only MAD agent for SMI QP */
port_priv->agent[0] = ib_register_mad_agent(device, port_num,
IB_QPT_SMI, NULL, 0,
diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index ef0c0c5..2668d4e 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -2938,7 +2938,7 @@ static int ib_mad_port_open(struct ib_device *device,
init_mad_qp(port_priv, &port_priv->qp_info[1]);

cq_size = mad_sendq_size + mad_recvq_size;
- has_smi = rdma_port_get_link_layer(device, port_num) == IB_LINK_LAYER_INFINIBAND;
+ has_smi = cap_ib_smi(device, port_num);
if (has_smi)
cq_size *= 2;

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 4013933..ee76010 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1788,6 +1788,21 @@ static inline int cap_ib_mad(struct ib_device *device, u8 port_num)
return rdma_ib_mgmt(device, port_num);
}

+/**
+ * cap_ib_smi - Check if the port of device has the capability Infiniband
+ * Subnet Management Interface.
+ *
+ * @device: Device to be checked
+ * @port_num: Port number of the device
+ *
+ * Return 0 when port of the device don't support Infiniband
+ * Subnet Management Interface.
+ */
+static inline int cap_ib_smi(struct ib_device *device, u8 port_num)
+{
+ return rdma_transport_ib(device, port_num);
+}
+
int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid);

--
2.1.0

2015-04-07 12:32:14

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 05/17] IB/Verbs: Use management helper cap_ib_cm() for cm-check


Introduce helper cap_ib_cm() to help us check if the port of an
IB device support Infiniband Communication Manager.

Reform cm_add_one() to fit per-port-check method better.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/cm.c | 22 +++++++++++++++++++---
include/rdma/ib_verbs.h | 15 +++++++++++++++
2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index e28a494..63418ee 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -3761,9 +3761,7 @@ static void cm_add_one(struct ib_device *ib_device)
unsigned long flags;
int ret;
u8 i;
-
- if (rdma_node_get_transport(ib_device->node_type) != RDMA_TRANSPORT_IB)
- return;
+ int count = 0;

cm_dev = kzalloc(sizeof(*cm_dev) + sizeof(*port) *
ib_device->phys_port_cnt, GFP_KERNEL);
@@ -3783,6 +3781,9 @@ static void cm_add_one(struct ib_device *ib_device)

set_bit(IB_MGMT_METHOD_SEND, reg_req.method_mask);
for (i = 1; i <= ib_device->phys_port_cnt; i++) {
+ if (!cap_ib_cm(ib_device, i))
+ continue;
+
port = kzalloc(sizeof *port, GFP_KERNEL);
if (!port)
goto error1;
@@ -3809,7 +3810,16 @@ static void cm_add_one(struct ib_device *ib_device)
ret = ib_modify_port(ib_device, i, 0, &port_modify);
if (ret)
goto error3;
+
+ count++;
}
+
+ if (!count) {
+ device_unregister(cm_dev->device);
+ kfree(cm_dev);
+ return;
+ }
+
ib_set_client_data(ib_device, &cm_client, cm_dev);

write_lock_irqsave(&cm.device_lock, flags);
@@ -3825,6 +3835,9 @@ error1:
port_modify.set_port_cap_mask = 0;
port_modify.clr_port_cap_mask = IB_PORT_CM_SUP;
while (--i) {
+ if (!cap_ib_cm(ib_device, i))
+ continue;
+
port = cm_dev->port[i-1];
ib_modify_port(ib_device, port->port_num, 0, &port_modify);
ib_unregister_mad_agent(port->mad_agent);
@@ -3853,6 +3866,9 @@ static void cm_remove_one(struct ib_device *ib_device)
write_unlock_irqrestore(&cm.device_lock, flags);

for (i = 1; i <= ib_device->phys_port_cnt; i++) {
+ if (!cap_ib_cm(ib_device, i))
+ continue;
+
port = cm_dev->port[i-1];
ib_modify_port(ib_device, port->port_num, 0, &port_modify);
ib_unregister_mad_agent(port->mad_agent);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index ee76010..3ba963f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1803,6 +1803,21 @@ static inline int cap_ib_smi(struct ib_device *device, u8 port_num)
return rdma_transport_ib(device, port_num);
}

+/**
+ * cap_ib_cm - Check if the port of device has the capability Infiniband
+ * Communication Manager.
+ *
+ * @device: Device to be checked
+ * @port_num: Port number of the device
+ *
+ * Return 0 when port of the device don't support Infiniband
+ * Communication Manager.
+ */
+static inline int cap_ib_cm(struct ib_device *device, u8 port_num)
+{
+ return rdma_ib_mgmt(device, port_num);
+}
+
int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid);

--
2.1.0

2015-04-07 12:32:57

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 06/17] IB/Verbs: Use management helper cap_ib_sa() for sa-check


Introduce helper cap_ib_sa() to help us check if the port of an
IB device support Infiniband Subnet Administrator.

Reform ib_sa_add_one() to fit per-port-check method better.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/sa_query.c | 27 +++++++++++++++++----------
include/rdma/ib_verbs.h | 15 +++++++++++++++
2 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index c38f030..f704254 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -450,7 +450,7 @@ static void ib_sa_event(struct ib_event_handler *handler, struct ib_event *event
struct ib_sa_port *port =
&sa_dev->port[event->element.port_num - sa_dev->start_port];

- if (rdma_port_get_link_layer(handler->device, port->port_num) != IB_LINK_LAYER_INFINIBAND)
+ if (WARN_ON(!cap_ib_sa(handler->device, port->port_num)))
return;

spin_lock_irqsave(&port->ah_lock, flags);
@@ -1153,9 +1153,7 @@ static void ib_sa_add_one(struct ib_device *device)
{
struct ib_sa_device *sa_dev;
int s, e, i;
-
- if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
- return;
+ int count = 0;

if (device->node_type == RDMA_NODE_IB_SWITCH)
s = e = 0;
@@ -1175,7 +1173,7 @@ static void ib_sa_add_one(struct ib_device *device)

for (i = 0; i <= e - s; ++i) {
spin_lock_init(&sa_dev->port[i].ah_lock);
- if (rdma_port_get_link_layer(device, i + 1) != IB_LINK_LAYER_INFINIBAND)
+ if (!cap_ib_sa(device, i + 1))
continue;

sa_dev->port[i].sm_ah = NULL;
@@ -1189,6 +1187,13 @@ static void ib_sa_add_one(struct ib_device *device)
goto err;

INIT_WORK(&sa_dev->port[i].update_task, update_sm_ah);
+
+ count++;
+ }
+
+ if (!count) {
+ kfree(sa_dev);
+ return;
}

ib_set_client_data(device, &sa_client, sa_dev);
@@ -1204,16 +1209,18 @@ static void ib_sa_add_one(struct ib_device *device)
if (ib_register_event_handler(&sa_dev->event_handler))
goto err;

- for (i = 0; i <= e - s; ++i)
- if (rdma_port_get_link_layer(device, i + 1) == IB_LINK_LAYER_INFINIBAND)
+ for (i = 0; i <= e - s; ++i) {
+ if (cap_ib_sa(device, i + 1))
update_sm_ah(&sa_dev->port[i].update_task);
+ }

return;

err:
- while (--i >= 0)
- if (rdma_port_get_link_layer(device, i + 1) == IB_LINK_LAYER_INFINIBAND)
+ while (--i >= 0) {
+ if (cap_ib_sa(device, i + 1))
ib_unregister_mad_agent(sa_dev->port[i].agent);
+ }

kfree(sa_dev);

@@ -1233,7 +1240,7 @@ static void ib_sa_remove_one(struct ib_device *device)
flush_workqueue(ib_wq);

for (i = 0; i <= sa_dev->end_port - sa_dev->start_port; ++i) {
- if (rdma_port_get_link_layer(device, i + 1) == IB_LINK_LAYER_INFINIBAND) {
+ if (cap_ib_sa(device, i + 1)) {
ib_unregister_mad_agent(sa_dev->port[i].agent);
if (sa_dev->port[i].sm_ah)
kref_put(&sa_dev->port[i].sm_ah->ref, free_sm_ah);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 3ba963f..c405e45 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1818,6 +1818,21 @@ static inline int cap_ib_cm(struct ib_device *device, u8 port_num)
return rdma_ib_mgmt(device, port_num);
}

+/**
+ * cap_ib_sa - Check if the port of device has the capability Infiniband
+ * Subnet Administrator.
+ *
+ * @device: Device to be checked
+ * @port_num: Port number of the device
+ *
+ * Return 0 when port of the device don't support Infiniband
+ * Subnet Administrator.
+ */
+static inline int cap_ib_sa(struct ib_device *device, u8 port_num)
+{
+ return rdma_transport_ib(device, port_num);
+}
+
int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid);

--
2.1.0

2015-04-07 12:33:33

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 07/17] IB/Verbs: Use management helper cap_ib_mcast() for mcast-check


Introduce helper cap_ib_mcast() to help us check if the port of an
IB device support Infiniband Multicast.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/multicast.c | 12 +++---------
include/rdma/ib_verbs.h | 15 +++++++++++++++
2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/core/multicast.c b/drivers/infiniband/core/multicast.c
index fa17b55..bdc1880 100644
--- a/drivers/infiniband/core/multicast.c
+++ b/drivers/infiniband/core/multicast.c
@@ -780,8 +780,7 @@ static void mcast_event_handler(struct ib_event_handler *handler,
int index;

dev = container_of(handler, struct mcast_device, event_handler);
- if (rdma_port_get_link_layer(dev->device, event->element.port_num) !=
- IB_LINK_LAYER_INFINIBAND)
+ if (WARN_ON(!cap_ib_mcast(dev->device, event->element.port_num)))
return;

index = event->element.port_num - dev->start_port;
@@ -808,9 +807,6 @@ static void mcast_add_one(struct ib_device *device)
int i;
int count = 0;

- if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
- return;
-
dev = kmalloc(sizeof *dev + device->phys_port_cnt * sizeof *port,
GFP_KERNEL);
if (!dev)
@@ -824,8 +820,7 @@ static void mcast_add_one(struct ib_device *device)
}

for (i = 0; i <= dev->end_port - dev->start_port; i++) {
- if (rdma_port_get_link_layer(device, dev->start_port + i) !=
- IB_LINK_LAYER_INFINIBAND)
+ if (!cap_ib_mcast(device, dev->start_port + i))
continue;
port = &dev->port[i];
port->dev = dev;
@@ -863,8 +858,7 @@ static void mcast_remove_one(struct ib_device *device)
flush_workqueue(mcast_wq);

for (i = 0; i <= dev->end_port - dev->start_port; i++) {
- if (rdma_port_get_link_layer(device, dev->start_port + i) ==
- IB_LINK_LAYER_INFINIBAND) {
+ if (cap_ib_mcast(device, dev->start_port + i)) {
port = &dev->port[i];
deref_port(port);
wait_for_completion(&port->comp);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index c405e45..5a5f6d5 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1833,6 +1833,21 @@ static inline int cap_ib_sa(struct ib_device *device, u8 port_num)
return rdma_transport_ib(device, port_num);
}

+/**
+ * cap_ib_mcast - Check if the port of device has the capability Infiniband
+ * Multicast.
+ *
+ * @device: Device to be checked
+ * @port_num: Port number of the device
+ *
+ * Return 0 when port of the device don't support Infiniband
+ * Multicast.
+ */
+static inline int cap_ib_mcast(struct ib_device *device, u8 port_num)
+{
+ return cap_ib_sa(device, port_num);
+}
+
int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid);

--
2.1.0

2015-04-07 12:34:12

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 08/17] IB/Verbs: Use management helper cap_ipoib() for ipoib-check


Introduce helper cap_ipoib() to help us check if the port of an
IB device support IP over Infiniband.

Reform ipoib_add_one() to fit per-port-check method better.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/ulp/ipoib/ipoib_main.c | 17 ++++++++++-------
include/rdma/ib_verbs.h | 15 +++++++++++++++
2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 58b5aa3..e36a926 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1654,9 +1654,7 @@ static void ipoib_add_one(struct ib_device *device)
struct net_device *dev;
struct ipoib_dev_priv *priv;
int s, e, p;
-
- if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
- return;
+ int count = 0;

dev_list = kmalloc(sizeof *dev_list, GFP_KERNEL);
if (!dev_list)
@@ -1673,13 +1671,21 @@ static void ipoib_add_one(struct ib_device *device)
}

for (p = s; p <= e; ++p) {
- if (rdma_port_get_link_layer(device, p) != IB_LINK_LAYER_INFINIBAND)
+ if (!cap_ipoib(device, p))
continue;
+
dev = ipoib_add_port("ib%d", device, p);
if (!IS_ERR(dev)) {
priv = netdev_priv(dev);
list_add_tail(&priv->list, dev_list);
}
+
+ count++;
+ }
+
+ if (!count) {
+ kfree(dev_list);
+ return;
}

ib_set_client_data(device, &ipoib_client, dev_list);
@@ -1690,9 +1696,6 @@ static void ipoib_remove_one(struct ib_device *device)
struct ipoib_dev_priv *priv, *tmp;
struct list_head *dev_list;

- if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
- return;
-
dev_list = ib_get_client_data(device, &ipoib_client);
if (!dev_list)
return;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 5a5f6d5..9db8966 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1848,6 +1848,21 @@ static inline int cap_ib_mcast(struct ib_device *device, u8 port_num)
return cap_ib_sa(device, port_num);
}

+/**
+ * cap_ipoib - Check if the port of device has the capability
+ * IP over Infiniband.
+ *
+ * @device: Device to be checked
+ * @port_num: Port number of the device
+ *
+ * Return 0 when port of the device don't support
+ * IP over Infiniband.
+ */
+static inline int cap_ipoib(struct ib_device *device, u8 port_num)
+{
+ return rdma_transport_ib(device, port_num);
+}
+
int ib_query_gid(struct ib_device *device,
u8 port_num, int index, union ib_gid *gid);

--
2.1.0

2015-04-07 12:34:52

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 09/17] IB/Verbs: Use helper cap_read_multi_sge() and reform svc_rdma_accept()


Introduce helper cap_read_multi_sge() to help us check if the port of an
IB device support RDMA Read Multiple Scatter-Gather Entries.

Reform svc_rdma_accept() to adopt management helpers.

Cc: Tom Talpey <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
include/rdma/ib_verbs.h | 15 +++++++++++++++
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 4 ++--
net/sunrpc/xprtrdma/svc_rdma_transport.c | 12 +++++-------
3 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9db8966..cae6f2d 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1849,6 +1849,21 @@ static inline int cap_ib_mcast(struct ib_device *device, u8 port_num)
}

/**
+ * cap_read_multi_sge - Check if the port of device has the capability
+ * RDMA Read Multiple Scatter-Gather Entries.
+ *
+ * @device: Device to be checked
+ * @port_num: Port number of the device
+ *
+ * Return 0 when port of the device don't support
+ * RDMA Read Multiple Scatter-Gather Entries.
+ */
+static inline int cap_read_multi_sge(struct ib_device *device, u8 port_num)
+{
+ return !rdma_transport_iwarp(device, port_num);
+}
+
+/**
* cap_ipoib - Check if the port of device has the capability
* IP over Infiniband.
*
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index e011027..604d035 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -118,8 +118,8 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,

static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
{
- if (rdma_node_get_transport(xprt->sc_cm_id->device->node_type) ==
- RDMA_TRANSPORT_IWARP)
+ if (!cap_read_multi_sge(xprt->sc_cm_id->device,
+ xprt->sc_cm_id->port_num))
return 1;
else
return min_t(int, sge_count, xprt->sc_max_sge);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 4e61880..e75175d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -979,8 +979,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
/*
* Determine if a DMA MR is required and if so, what privs are required
*/
- switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) {
- case RDMA_TRANSPORT_IWARP:
+ if (rdma_transport_iwarp(newxprt->sc_cm_id->device,
+ newxprt->sc_cm_id->port_num)) {
newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;
if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
need_dma_mr = 1;
@@ -992,8 +992,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
} else
need_dma_mr = 0;
- break;
- case RDMA_TRANSPORT_IB:
+ } else if (rdma_ib_mgmt(newxprt->sc_cm_id->device,
+ newxprt->sc_cm_id->port_num)) {
if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
need_dma_mr = 1;
dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
@@ -1003,10 +1003,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
} else
need_dma_mr = 0;
- break;
- default:
+ } else
goto errout;
- }

/* Create the DMA MR if needed, otherwise, use the DMA LKEY */
if (need_dma_mr) {
--
2.1.0

2015-04-07 12:35:32

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 10/17] IB/Verbs: Adopt management helpers for IB helpers


Adopt management helpers for:
ib_init_ah_from_path()
ib_init_ah_from_wc()
ib_resolve_eth_l2_attrs()

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/sa_query.c | 2 +-
drivers/infiniband/core/verbs.c | 6 ++----
2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index f704254..4e61104 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -540,7 +540,7 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num,
ah_attr->port_num = port_num;
ah_attr->static_rate = rec->rate;

- force_grh = rdma_port_get_link_layer(device, port_num) == IB_LINK_LAYER_ETHERNET;
+ force_grh = !rdma_transport_ib(device, port_num);

if (rec->hop_limit > 1 || force_grh) {
ah_attr->ah_flags = IB_AH_GRH;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 83370de..ca06f76 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -200,11 +200,9 @@ int ib_init_ah_from_wc(struct ib_device *device, u8 port_num, struct ib_wc *wc,
u32 flow_class;
u16 gid_index;
int ret;
- int is_eth = (rdma_port_get_link_layer(device, port_num) ==
- IB_LINK_LAYER_ETHERNET);

memset(ah_attr, 0, sizeof *ah_attr);
- if (is_eth) {
+ if (!rdma_transport_ib(device, port_num)) {
if (!(wc->wc_flags & IB_WC_GRH))
return -EPROTOTYPE;

@@ -873,7 +871,7 @@ int ib_resolve_eth_l2_attrs(struct ib_qp *qp,
union ib_gid sgid;

if ((*qp_attr_mask & IB_QP_AV) &&
- (rdma_port_get_link_layer(qp->device, qp_attr->ah_attr.port_num) == IB_LINK_LAYER_ETHERNET)) {
+ (!rdma_transport_ib(qp->device, qp_attr->ah_attr.port_num))) {
ret = ib_query_gid(qp->device, qp_attr->ah_attr.port_num,
qp_attr->ah_attr.grh.sgid_index, &sgid);
if (ret)
--
2.1.0

2015-04-07 12:36:09

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 11/17] IB/Verbs: Reform link_layer_show() and ib_uverbs_query_port()


Reform link_layer_show() and ib_uverbs_query_port() with management helpers.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/sysfs.c | 8 ++------
drivers/infiniband/core/uverbs_cmd.c | 6 ++++--
2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index cbd0383..aa53e40 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -248,14 +248,10 @@ static ssize_t phys_state_show(struct ib_port *p, struct port_attribute *unused,
static ssize_t link_layer_show(struct ib_port *p, struct port_attribute *unused,
char *buf)
{
- switch (rdma_port_get_link_layer(p->ibdev, p->port_num)) {
- case IB_LINK_LAYER_INFINIBAND:
+ if (rdma_transport_ib(p->ibdev, p->port_num))
return sprintf(buf, "%s\n", "InfiniBand");
- case IB_LINK_LAYER_ETHERNET:
+ else
return sprintf(buf, "%s\n", "Ethernet");
- default:
- return sprintf(buf, "%s\n", "Unknown");
- }
}

static PORT_ATTR_RO(state);
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index a9f0489..3eb6eb5 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -515,8 +515,10 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file *file,
resp.active_width = attr.active_width;
resp.active_speed = attr.active_speed;
resp.phys_state = attr.phys_state;
- resp.link_layer = rdma_port_get_link_layer(file->device->ib_dev,
- cmd.port_num);
+ resp.link_layer = rdma_transport_ib(file->device->ib_dev,
+ cmd.port_num) ?
+ IB_LINK_LAYER_INFINIBAND :
+ IB_LINK_LAYER_ETHERNET;

if (copy_to_user((void __user *) (unsigned long) cmd.response,
&resp, sizeof resp))
--
2.1.0

2015-04-07 12:36:47

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 12/17] IB/Verbs: Use management helper cap_ib_cm_dev() for cm-device-check


Introduce helper cap_ib_cm_dev() to help us check if any port of device
has the capability Infiniband Communication Manager.

Cc: Tom Talpey <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/cma.c | 5 ++---
drivers/infiniband/core/ucm.c | 3 +--
include/rdma/ib_verbs.h | 20 ++++++++++++++++++++
3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index d570030..d8a8ea7 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1625,8 +1625,7 @@ static void cma_listen_on_dev(struct rdma_id_private *id_priv,
struct rdma_cm_id *id;
int ret;

- if (cma_family(id_priv) == AF_IB &&
- rdma_node_get_transport(cma_dev->device->node_type) != RDMA_TRANSPORT_IB)
+ if (cma_family(id_priv) == AF_IB && !cap_ib_cm_dev(cma_dev->device))
return;

id = rdma_create_id(cma_listen_handler, id_priv, id_priv->id.ps,
@@ -2028,7 +2027,7 @@ static int cma_bind_loopback(struct rdma_id_private *id_priv)
mutex_lock(&lock);
list_for_each_entry(cur_dev, &dev_list, list) {
if (cma_family(id_priv) == AF_IB &&
- rdma_node_get_transport(cur_dev->device->node_type) != RDMA_TRANSPORT_IB)
+ !cap_ib_cm_dev(cur_dev->device))
continue;

if (!cma_dev)
diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index f2f6393..065405e 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -1253,8 +1253,7 @@ static void ib_ucm_add_one(struct ib_device *device)
dev_t base;
struct ib_ucm_device *ucm_dev;

- if (!device->alloc_ucontext ||
- rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
+ if (!device->alloc_ucontext || !cap_ib_cm_dev(device))
return;

ucm_dev = kzalloc(sizeof *ucm_dev, GFP_KERNEL);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index cae6f2d..2767a91 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1819,6 +1819,26 @@ static inline int cap_ib_cm(struct ib_device *device, u8 port_num)
}

/**
+ * cap_ib_cm_dev - Check if any port of device has the capability Infiniband
+ * Communication Manager.
+ *
+ * @device: Device to be checked
+ *
+ * Return 0 when all port of the device don't support Infiniband
+ * Communication Manager.
+ */
+static inline int cap_ib_cm_dev(struct ib_device *device)
+{
+ int i;
+
+ for (i = 1; i <= device->phys_port_cnt; i++) {
+ if (cap_ib_cm(device, i))
+ return 1;
+ }
+ return 0;
+}
+
+/**
* cap_ib_sa - Check if the port of device has the capability Infiniband
* Subnet Administrator.
*
--
2.1.0

2015-04-07 12:37:21

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 13/17] IB/Verbs: Reform cma/ucma with management helpers


Reform cma/ucma with management helpers.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/cma.c | 182 +++++++++++++----------------------------
drivers/infiniband/core/ucma.c | 25 ++----
2 files changed, 65 insertions(+), 142 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index d8a8ea7..c23f483 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -435,10 +435,10 @@ static int cma_resolve_ib_dev(struct rdma_id_private *id_priv)
pkey = ntohs(addr->sib_pkey);

list_for_each_entry(cur_dev, &dev_list, list) {
- if (rdma_node_get_transport(cur_dev->device->node_type) != RDMA_TRANSPORT_IB)
- continue;
-
for (p = 1; p <= cur_dev->device->phys_port_cnt; ++p) {
+ if (!rdma_ib_mgmt(cur_dev->device, p))
+ continue;
+
if (ib_find_cached_pkey(cur_dev->device, p, pkey, &index))
continue;

@@ -633,10 +633,10 @@ static int cma_modify_qp_rtr(struct rdma_id_private *id_priv,
if (ret)
goto out;

- if (rdma_node_get_transport(id_priv->cma_dev->device->node_type)
- == RDMA_TRANSPORT_IB &&
- rdma_port_get_link_layer(id_priv->id.device, id_priv->id.port_num)
- == IB_LINK_LAYER_ETHERNET) {
+ /* Will this happen? */
+ BUG_ON(id_priv->cma_dev->device != id_priv->id.device);
+
+ if (rdma_transport_iboe(id_priv->id.device, id_priv->id.port_num)) {
ret = rdma_addr_find_smac_by_sgid(&sgid, qp_attr.smac, NULL);

if (ret)
@@ -700,8 +700,7 @@ static int cma_ib_init_qp_attr(struct rdma_id_private *id_priv,
int ret;
u16 pkey;

- if (rdma_port_get_link_layer(id_priv->id.device, id_priv->id.port_num) ==
- IB_LINK_LAYER_INFINIBAND)
+ if (rdma_transport_ib(id_priv->id.device, id_priv->id.port_num))
pkey = ib_addr_get_pkey(dev_addr);
else
pkey = 0xffff;
@@ -735,8 +734,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr,
int ret = 0;

id_priv = container_of(id, struct rdma_id_private, id);
- switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
- case RDMA_TRANSPORT_IB:
+ if (rdma_ib_mgmt(id_priv->id.device, id_priv->id.port_num)) {
if (!id_priv->cm_id.ib || (id_priv->id.qp_type == IB_QPT_UD))
ret = cma_ib_init_qp_attr(id_priv, qp_attr, qp_attr_mask);
else
@@ -745,19 +743,16 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr,

if (qp_attr->qp_state == IB_QPS_RTR)
qp_attr->rq_psn = id_priv->seq_num;
- break;
- case RDMA_TRANSPORT_IWARP:
+ } else if (rdma_transport_iwarp(id_priv->id.device,
+ id_priv->id.port_num)) {
if (!id_priv->cm_id.iw) {
qp_attr->qp_access_flags = 0;
*qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS;
} else
ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr,
qp_attr_mask);
- break;
- default:
+ } else
ret = -ENOSYS;
- break;
- }

return ret;
}
@@ -928,13 +923,9 @@ static inline int cma_user_data_offset(struct rdma_id_private *id_priv)

static void cma_cancel_route(struct rdma_id_private *id_priv)
{
- switch (rdma_port_get_link_layer(id_priv->id.device, id_priv->id.port_num)) {
- case IB_LINK_LAYER_INFINIBAND:
+ if (rdma_transport_ib(id_priv->id.device, id_priv->id.port_num)) {
if (id_priv->query)
ib_sa_cancel_query(id_priv->query_id, id_priv->query);
- break;
- default:
- break;
}
}

@@ -1006,17 +997,14 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
mc = container_of(id_priv->mc_list.next,
struct cma_multicast, list);
list_del(&mc->list);
- switch (rdma_port_get_link_layer(id_priv->cma_dev->device, id_priv->id.port_num)) {
- case IB_LINK_LAYER_INFINIBAND:
+ if (rdma_transport_ib(id_priv->cma_dev->device,
+ id_priv->id.port_num)) {
ib_sa_free_multicast(mc->multicast.ib);
kfree(mc);
break;
- case IB_LINK_LAYER_ETHERNET:
+ } else if (rdma_transport_ib(id_priv->cma_dev->device,
+ id_priv->id.port_num))
kref_put(&mc->mcref, release_mc);
- break;
- default:
- break;
- }
}
}

@@ -1037,17 +1025,13 @@ void rdma_destroy_id(struct rdma_cm_id *id)
mutex_unlock(&id_priv->handler_mutex);

if (id_priv->cma_dev) {
- switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
- case RDMA_TRANSPORT_IB:
+ if (rdma_ib_mgmt(id_priv->id.device, id_priv->id.port_num)) {
if (id_priv->cm_id.ib)
ib_destroy_cm_id(id_priv->cm_id.ib);
- break;
- case RDMA_TRANSPORT_IWARP:
+ } else if (rdma_transport_iwarp(id_priv->id.device,
+ id_priv->id.port_num)) {
if (id_priv->cm_id.iw)
iw_destroy_cm_id(id_priv->cm_id.iw);
- break;
- default:
- break;
}
cma_leave_mc_groups(id_priv);
cma_release_dev(id_priv);
@@ -1966,26 +1950,14 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
return -EINVAL;

atomic_inc(&id_priv->refcount);
- switch (rdma_node_get_transport(id->device->node_type)) {
- case RDMA_TRANSPORT_IB:
- switch (rdma_port_get_link_layer(id->device, id->port_num)) {
- case IB_LINK_LAYER_INFINIBAND:
- ret = cma_resolve_ib_route(id_priv, timeout_ms);
- break;
- case IB_LINK_LAYER_ETHERNET:
- ret = cma_resolve_iboe_route(id_priv);
- break;
- default:
- ret = -ENOSYS;
- }
- break;
- case RDMA_TRANSPORT_IWARP:
+ if (rdma_transport_ib(id->device, id->port_num))
+ ret = cma_resolve_ib_route(id_priv, timeout_ms);
+ else if (rdma_transport_iboe(id->device, id->port_num))
+ ret = cma_resolve_iboe_route(id_priv);
+ else if (rdma_transport_iwarp(id->device, id->port_num))
ret = cma_resolve_iw_route(id_priv, timeout_ms);
- break;
- default:
+ else
ret = -ENOSYS;
- break;
- }
if (ret)
goto err;

@@ -2059,7 +2031,7 @@ port_found:
goto out;

id_priv->id.route.addr.dev_addr.dev_type =
- (rdma_port_get_link_layer(cma_dev->device, p) == IB_LINK_LAYER_INFINIBAND) ?
+ (rdma_transport_ib(cma_dev->device, p)) ?
ARPHRD_INFINIBAND : ARPHRD_ETHER;

rdma_addr_set_sgid(&id_priv->id.route.addr.dev_addr, &gid);
@@ -2536,18 +2508,15 @@ int rdma_listen(struct rdma_cm_id *id, int backlog)

id_priv->backlog = backlog;
if (id->device) {
- switch (rdma_node_get_transport(id->device->node_type)) {
- case RDMA_TRANSPORT_IB:
+ if (rdma_ib_mgmt(id->device, id->port_num)) {
ret = cma_ib_listen(id_priv);
if (ret)
goto err;
- break;
- case RDMA_TRANSPORT_IWARP:
+ } else if (rdma_transport_iwarp(id->device, id->port_num)) {
ret = cma_iw_listen(id_priv, backlog);
if (ret)
goto err;
- break;
- default:
+ } else {
ret = -ENOSYS;
goto err;
}
@@ -2883,20 +2852,15 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
id_priv->srq = conn_param->srq;
}

- switch (rdma_node_get_transport(id->device->node_type)) {
- case RDMA_TRANSPORT_IB:
+ if (rdma_ib_mgmt(id->device, id->port_num)) {
if (id->qp_type == IB_QPT_UD)
ret = cma_resolve_ib_udp(id_priv, conn_param);
else
ret = cma_connect_ib(id_priv, conn_param);
- break;
- case RDMA_TRANSPORT_IWARP:
+ } else if (rdma_transport_iwarp(id->device, id->port_num))
ret = cma_connect_iw(id_priv, conn_param);
- break;
- default:
+ else
ret = -ENOSYS;
- break;
- }
if (ret)
goto err;

@@ -2999,8 +2963,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
id_priv->srq = conn_param->srq;
}

- switch (rdma_node_get_transport(id->device->node_type)) {
- case RDMA_TRANSPORT_IB:
+ if (rdma_ib_mgmt(id->device, id->port_num)) {
if (id->qp_type == IB_QPT_UD) {
if (conn_param)
ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS,
@@ -3016,14 +2979,10 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
else
ret = cma_rep_recv(id_priv);
}
- break;
- case RDMA_TRANSPORT_IWARP:
+ } else if (rdma_transport_iwarp(id->device, id->port_num))
ret = cma_accept_iw(id_priv, conn_param);
- break;
- default:
+ else
ret = -ENOSYS;
- break;
- }

if (ret)
goto reject;
@@ -3067,8 +3026,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data,
if (!id_priv->cm_id.ib)
return -EINVAL;

- switch (rdma_node_get_transport(id->device->node_type)) {
- case RDMA_TRANSPORT_IB:
+ if (rdma_ib_mgmt(id->device, id->port_num)) {
if (id->qp_type == IB_QPT_UD)
ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT, 0,
private_data, private_data_len);
@@ -3076,15 +3034,11 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data,
ret = ib_send_cm_rej(id_priv->cm_id.ib,
IB_CM_REJ_CONSUMER_DEFINED, NULL,
0, private_data, private_data_len);
- break;
- case RDMA_TRANSPORT_IWARP:
+ } else if (rdma_transport_iwarp(id->device, id->port_num)) {
ret = iw_cm_reject(id_priv->cm_id.iw,
private_data, private_data_len);
- break;
- default:
+ } else
ret = -ENOSYS;
- break;
- }
return ret;
}
EXPORT_SYMBOL(rdma_reject);
@@ -3098,22 +3052,17 @@ int rdma_disconnect(struct rdma_cm_id *id)
if (!id_priv->cm_id.ib)
return -EINVAL;

- switch (rdma_node_get_transport(id->device->node_type)) {
- case RDMA_TRANSPORT_IB:
+ if (rdma_ib_mgmt(id->device, id->port_num)) {
ret = cma_modify_qp_err(id_priv);
if (ret)
goto out;
/* Initiate or respond to a disconnect. */
if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0))
ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0);
- break;
- case RDMA_TRANSPORT_IWARP:
+ } else if (rdma_transport_iwarp(id->device, id->port_num)) {
ret = iw_cm_disconnect(id_priv->cm_id.iw, 0);
- break;
- default:
+ } else
ret = -EINVAL;
- break;
- }
out:
return ret;
}
@@ -3359,24 +3308,13 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
list_add(&mc->list, &id_priv->mc_list);
spin_unlock(&id_priv->lock);

- switch (rdma_node_get_transport(id->device->node_type)) {
- case RDMA_TRANSPORT_IB:
- switch (rdma_port_get_link_layer(id->device, id->port_num)) {
- case IB_LINK_LAYER_INFINIBAND:
- ret = cma_join_ib_multicast(id_priv, mc);
- break;
- case IB_LINK_LAYER_ETHERNET:
- kref_init(&mc->mcref);
- ret = cma_iboe_join_multicast(id_priv, mc);
- break;
- default:
- ret = -EINVAL;
- }
- break;
- default:
+ if (rdma_transport_iboe(id->device, id->port_num)) {
+ kref_init(&mc->mcref);
+ ret = cma_iboe_join_multicast(id_priv, mc);
+ } else if (rdma_transport_ib(id->device, id->port_num))
+ ret = cma_join_ib_multicast(id_priv, mc);
+ else
ret = -ENOSYS;
- break;
- }

if (ret) {
spin_lock_irq(&id_priv->lock);
@@ -3404,19 +3342,17 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr)
ib_detach_mcast(id->qp,
&mc->multicast.ib->rec.mgid,
be16_to_cpu(mc->multicast.ib->rec.mlid));
- if (rdma_node_get_transport(id_priv->cma_dev->device->node_type) == RDMA_TRANSPORT_IB) {
- switch (rdma_port_get_link_layer(id->device, id->port_num)) {
- case IB_LINK_LAYER_INFINIBAND:
- ib_sa_free_multicast(mc->multicast.ib);
- kfree(mc);
- break;
- case IB_LINK_LAYER_ETHERNET:
- kref_put(&mc->mcref, release_mc);
- break;
- default:
- break;
- }
- }
+
+ /* Will this happen? */
+ BUG_ON(id_priv->cma_dev->device != id->device);
+
+ if (rdma_transport_ib(id->device, id->port_num)) {
+ ib_sa_free_multicast(mc->multicast.ib);
+ kfree(mc);
+ } else if (rdma_transport_iboe(id->device,
+ id->port_num))
+ kref_put(&mc->mcref, release_mc);
+
return;
}
}
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 45d67e9..42c9bf6 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -722,26 +722,13 @@ static ssize_t ucma_query_route(struct ucma_file *file,

resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid;
resp.port_num = ctx->cm_id->port_num;
- switch (rdma_node_get_transport(ctx->cm_id->device->node_type)) {
- case RDMA_TRANSPORT_IB:
- switch (rdma_port_get_link_layer(ctx->cm_id->device,
- ctx->cm_id->port_num)) {
- case IB_LINK_LAYER_INFINIBAND:
- ucma_copy_ib_route(&resp, &ctx->cm_id->route);
- break;
- case IB_LINK_LAYER_ETHERNET:
- ucma_copy_iboe_route(&resp, &ctx->cm_id->route);
- break;
- default:
- break;
- }
- break;
- case RDMA_TRANSPORT_IWARP:
+
+ if (rdma_transport_ib(ctx->cm_id->device, ctx->cm_id->port_num))
+ ucma_copy_ib_route(&resp, &ctx->cm_id->route);
+ else if (rdma_transport_iboe(ctx->cm_id->device, ctx->cm_id->port_num))
+ ucma_copy_iboe_route(&resp, &ctx->cm_id->route);
+ else if (rdma_transport_iwarp(ctx->cm_id->device, ctx->cm_id->port_num))
ucma_copy_iw_route(&resp, &ctx->cm_id->route);
- break;
- default:
- break;
- }

out:
if (copy_to_user((void __user *)(unsigned long)cmd.response,
--
2.1.0

2015-04-07 12:38:14

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 14/17] IB/Verbs: Reserve legacy transport type for 'struct rdma_dev_addr'


Reserve the legacy transport type for the 'transport' member
of 'struct rdma_dev_addr' until we make sure this is no
longer needed.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/cma.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index c23f483..e26b42e 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -244,14 +244,35 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 ip_ver)
hdr->ip_version = (ip_ver << 4) | (hdr->ip_version & 0xF);
}

+static inline void cma_set_legacy_transport(struct rdma_cm_id *id)
+{
+ switch (id->device->node_type) {
+ case RDMA_NODE_IB_CA:
+ case RDMA_NODE_IB_SWITCH:
+ case RDMA_NODE_IB_ROUTER:
+ id->route.addr.dev_addr.transport = RDMA_TRANSPORT_IB;
+ break;
+ case RDMA_NODE_RNIC:
+ id->route.addr.dev_addr.transport = RDMA_TRANSPORT_IWARP;
+ break;
+ case RDMA_NODE_USNIC:
+ id->route.addr.dev_addr.transport = RDMA_TRANSPORT_USNIC;
+ break;
+ case RDMA_NODE_USNIC_UDP:
+ id->route.addr.dev_addr.transport = RDMA_TRANSPORT_USNIC_UDP;
+ break;
+ default:
+ BUG();
+ }
+}
+
static void cma_attach_to_dev(struct rdma_id_private *id_priv,
struct cma_device *cma_dev)
{
atomic_inc(&cma_dev->refcount);
id_priv->cma_dev = cma_dev;
id_priv->id.device = cma_dev->device;
- id_priv->id.route.addr.dev_addr.transport =
- rdma_node_get_transport(cma_dev->device->node_type);
+ cma_set_legacy_transport(&id_priv->id);
list_add_tail(&id_priv->list, &cma_dev->id_list);
}

--
2.1.0

2015-04-07 12:38:54

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 15/17] IB/Verbs: Reform cma_acquire_dev() with management helpers


Reform cma_acquire_dev() with management helpers, introduce
cma_validate_port() to make the code more clean.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/cma.c | 69 +++++++++++++++++++++++++------------------
1 file changed, 41 insertions(+), 28 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index e26b42e..dc05cd0 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -370,18 +370,36 @@ static int cma_translate_addr(struct sockaddr *addr, struct rdma_dev_addr *dev_a
return ret;
}

+static inline int cma_validate_port(struct ib_device *device, u8 port,
+ union ib_gid *gid, int dev_type)
+{
+ u8 found_port;
+ int ret = -ENODEV;
+
+ if ((dev_type == ARPHRD_INFINIBAND) && !rdma_transport_ib(device, port))
+ return ret;
+
+ if ((dev_type != ARPHRD_INFINIBAND) && rdma_transport_ib(device, port))
+ return ret;
+
+ ret = ib_find_cached_gid(device, gid, &found_port, NULL);
+
+ if (!ret && (port == found_port))
+ return 0;
+
+ return ret;
+}
+
static int cma_acquire_dev(struct rdma_id_private *id_priv,
struct rdma_id_private *listen_id_priv)
{
struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
struct cma_device *cma_dev;
- union ib_gid gid, iboe_gid;
+ union ib_gid gid, iboe_gid, *gidp;
int ret = -ENODEV;
- u8 port, found_port;
- enum rdma_link_layer dev_ll = dev_addr->dev_type == ARPHRD_INFINIBAND ?
- IB_LINK_LAYER_INFINIBAND : IB_LINK_LAYER_ETHERNET;
+ u8 port;

- if (dev_ll != IB_LINK_LAYER_INFINIBAND &&
+ if (dev_addr->dev_type != ARPHRD_INFINIBAND &&
id_priv->id.ps == RDMA_PS_IPOIB)
return -EINVAL;

@@ -391,41 +409,36 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv,

memcpy(&gid, dev_addr->src_dev_addr +
rdma_addr_gid_offset(dev_addr), sizeof gid);
- if (listen_id_priv &&
- rdma_port_get_link_layer(listen_id_priv->id.device,
- listen_id_priv->id.port_num) == dev_ll) {
+
+ if (listen_id_priv) {
cma_dev = listen_id_priv->cma_dev;
port = listen_id_priv->id.port_num;
- if (rdma_node_get_transport(cma_dev->device->node_type) == RDMA_TRANSPORT_IB &&
- rdma_port_get_link_layer(cma_dev->device, port) == IB_LINK_LAYER_ETHERNET)
- ret = ib_find_cached_gid(cma_dev->device, &iboe_gid,
- &found_port, NULL);
- else
- ret = ib_find_cached_gid(cma_dev->device, &gid,
- &found_port, NULL);
+ gidp = rdma_transport_iboe(cma_dev->device, port) ?
+ &iboe_gid : &gid;

- if (!ret && (port == found_port)) {
- id_priv->id.port_num = found_port;
+ ret = cma_validate_port(cma_dev->device, port, gidp,
+ dev_addr->dev_type);
+ if (!ret) {
+ id_priv->id.port_num = port;
goto out;
}
}
+
list_for_each_entry(cma_dev, &dev_list, list) {
for (port = 1; port <= cma_dev->device->phys_port_cnt; ++port) {
if (listen_id_priv &&
listen_id_priv->cma_dev == cma_dev &&
listen_id_priv->id.port_num == port)
continue;
- if (rdma_port_get_link_layer(cma_dev->device, port) == dev_ll) {
- if (rdma_node_get_transport(cma_dev->device->node_type) == RDMA_TRANSPORT_IB &&
- rdma_port_get_link_layer(cma_dev->device, port) == IB_LINK_LAYER_ETHERNET)
- ret = ib_find_cached_gid(cma_dev->device, &iboe_gid, &found_port, NULL);
- else
- ret = ib_find_cached_gid(cma_dev->device, &gid, &found_port, NULL);
-
- if (!ret && (port == found_port)) {
- id_priv->id.port_num = found_port;
- goto out;
- }
+
+ gidp = rdma_transport_iboe(cma_dev->device, port) ?
+ &iboe_gid : &gid;
+
+ ret = cma_validate_port(cma_dev->device, port, gidp,
+ dev_addr->dev_type);
+ if (!ret) {
+ id_priv->id.port_num = port;
+ goto out;
}
}
}
--
2.1.0

2015-04-07 12:39:29

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 16/17] IB/Verbs: Cleanup rdma_node_get_transport()


We have get rid of all the scene using rdma_node_get_transport(),
now clean it up.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/verbs.c | 21 ---------------------
include/rdma/ib_verbs.h | 3 ---
2 files changed, 24 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index ca06f76..49acdbc 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -107,27 +107,6 @@ __attribute_const__ int ib_rate_to_mbps(enum ib_rate rate)
}
EXPORT_SYMBOL(ib_rate_to_mbps);

-__attribute_const__ enum rdma_transport_type
-rdma_node_get_transport(enum rdma_node_type node_type)
-{
- switch (node_type) {
- case RDMA_NODE_IB_CA:
- case RDMA_NODE_IB_SWITCH:
- case RDMA_NODE_IB_ROUTER:
- return RDMA_TRANSPORT_IB;
- case RDMA_NODE_RNIC:
- return RDMA_TRANSPORT_IWARP;
- case RDMA_NODE_USNIC:
- return RDMA_TRANSPORT_USNIC;
- case RDMA_NODE_USNIC_UDP:
- return RDMA_TRANSPORT_USNIC_UDP;
- default:
- BUG();
- return 0;
- }
-}
-EXPORT_SYMBOL(rdma_node_get_transport);
-
enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_num)
{
if (device->get_link_layer)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 2767a91..f033f824 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -84,9 +84,6 @@ enum rdma_transport_type {
RDMA_TRANSPORT_IBOE,
};

-__attribute_const__ enum rdma_transport_type
-rdma_node_get_transport(enum rdma_node_type node_type);
-
enum rdma_link_layer {
IB_LINK_LAYER_UNSPECIFIED,
IB_LINK_LAYER_INFINIBAND,
--
2.1.0

2015-04-07 12:40:03

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 17/17] IB/Verbs: Move rdma_port_get_link_layer() to mlx4 head file


Now only mlx4 still using rdma_port_get_link_layer(), move it
to it's private head file.

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/verbs.c | 20 --------------------
drivers/infiniband/hw/mlx4/mlx4_ib.h | 8 ++++++++
include/rdma/ib_verbs.h | 3 ---
3 files changed, 8 insertions(+), 23 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 49acdbc..f1eac93 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -107,26 +107,6 @@ __attribute_const__ int ib_rate_to_mbps(enum ib_rate rate)
}
EXPORT_SYMBOL(ib_rate_to_mbps);

-enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_num)
-{
- if (device->get_link_layer)
- return device->get_link_layer(device, port_num);
-
- switch (device->query_transport(device, port_num)) {
- case RDMA_TRANSPORT_IB:
- case RDMA_TRANSPORT_IBOE:
- return IB_LINK_LAYER_INFINIBAND;
- case RDMA_TRANSPORT_IWARP:
- case RDMA_TRANSPORT_USNIC:
- case RDMA_TRANSPORT_USNIC_UDP:
- return IB_LINK_LAYER_ETHERNET;
- default:
- BUG();
- return IB_LINK_LAYER_UNSPECIFIED;
- }
-}
-EXPORT_SYMBOL(rdma_port_get_link_layer);
-
/* Protection domains */

struct ib_pd *ib_alloc_pd(struct ib_device *device)
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 6eb743f..8e86ecc 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -708,6 +708,14 @@ int __mlx4_ib_query_pkey(struct ib_device *ibdev, u8 port, u16 index,
int __mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
union ib_gid *gid, int netw_view);

+static enum rdma_link_layer
+rdma_port_get_link_layer(struct ib_device *device, u8 port_num)
+{
+ /* Will this happen? */
+ BUG_ON(!device->get_link_layer);
+ return device->get_link_layer(device, port_num);
+}
+
static inline bool mlx4_ib_ah_grh_present(struct mlx4_ib_ah *ah)
{
u8 port = be32_to_cpu(ah->av.ib.port_pd) >> 24 & 3;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index f033f824..67b3e71 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1742,9 +1742,6 @@ int ib_query_device(struct ib_device *device,
int ib_query_port(struct ib_device *device,
u8 port_num, struct ib_port_attr *port_attr);

-enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device,
- u8 port_num);
-
static inline int rdma_transport_ib(struct ib_device *device, u8 port_num)
{
return device->query_transport(device, port_num)
--
2.1.0

2015-04-07 12:42:12

by Michael Wang

[permalink] [raw]
Subject: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW


Add new callback query_transport() and implement for each HW.

Mapping List:
node-type link-layer old-transport new-transport
nes RNIC ETH IWARP IWARP
amso1100 RNIC ETH IWARP IWARP
cxgb3 RNIC ETH IWARP IWARP
cxgb4 RNIC ETH IWARP IWARP
usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP
ocrdma IB_CA ETH IB IBOE
mlx4 IB_CA IB/ETH IB IB/IBOE
mlx5 IB_CA IB IB IB
ehca IB_CA IB IB IB
ipath IB_CA IB IB IB
mthca IB_CA IB IB IB
qib IB_CA IB IB IB

Cc: Jason Gunthorpe <[email protected]>
Cc: Doug Ledford <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Sean Hefty <[email protected]>
Signed-off-by: Michael Wang <[email protected]>
---
drivers/infiniband/core/device.c | 1 +
drivers/infiniband/core/verbs.c | 4 +++-
drivers/infiniband/hw/amso1100/c2_provider.c | 7 +++++++
drivers/infiniband/hw/cxgb3/iwch_provider.c | 7 +++++++
drivers/infiniband/hw/cxgb4/provider.c | 7 +++++++
drivers/infiniband/hw/ehca/ehca_hca.c | 6 ++++++
drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 +++
drivers/infiniband/hw/ehca/ehca_main.c | 1 +
drivers/infiniband/hw/ipath/ipath_verbs.c | 7 +++++++
drivers/infiniband/hw/mlx4/main.c | 10 ++++++++++
drivers/infiniband/hw/mlx5/main.c | 7 +++++++
drivers/infiniband/hw/mthca/mthca_provider.c | 7 +++++++
drivers/infiniband/hw/nes/nes_verbs.c | 6 ++++++
drivers/infiniband/hw/ocrdma/ocrdma_main.c | 1 +
drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 6 ++++++
drivers/infiniband/hw/ocrdma/ocrdma_verbs.h | 3 +++
drivers/infiniband/hw/qib/qib_verbs.c | 7 +++++++
drivers/infiniband/hw/usnic/usnic_ib_main.c | 1 +
drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 6 ++++++
drivers/infiniband/hw/usnic/usnic_ib_verbs.h | 2 ++
include/rdma/ib_verbs.h | 7 ++++++-
21 files changed, 104 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 18c1ece..a9587c4 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -76,6 +76,7 @@ static int ib_device_check_mandatory(struct ib_device *device)
} mandatory_table[] = {
IB_MANDATORY_FUNC(query_device),
IB_MANDATORY_FUNC(query_port),
+ IB_MANDATORY_FUNC(query_transport),
IB_MANDATORY_FUNC(query_pkey),
IB_MANDATORY_FUNC(query_gid),
IB_MANDATORY_FUNC(alloc_pd),
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index f93eb8d..83370de 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -133,14 +133,16 @@ enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_
if (device->get_link_layer)
return device->get_link_layer(device, port_num);

- switch (rdma_node_get_transport(device->node_type)) {
+ switch (device->query_transport(device, port_num)) {
case RDMA_TRANSPORT_IB:
+ case RDMA_TRANSPORT_IBOE:
return IB_LINK_LAYER_INFINIBAND;
case RDMA_TRANSPORT_IWARP:
case RDMA_TRANSPORT_USNIC:
case RDMA_TRANSPORT_USNIC_UDP:
return IB_LINK_LAYER_ETHERNET;
default:
+ BUG();
return IB_LINK_LAYER_UNSPECIFIED;
}
}
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index bdf3507..d46bbb0 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -99,6 +99,12 @@ static int c2_query_port(struct ib_device *ibdev,
return 0;
}

+static enum rdma_transport_type
+c2_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IWARP;
+}
+
static int c2_query_pkey(struct ib_device *ibdev,
u8 port, u16 index, u16 * pkey)
{
@@ -801,6 +807,7 @@ int c2_register_device(struct c2_dev *dev)
dev->ibdev.dma_device = &dev->pcidev->dev;
dev->ibdev.query_device = c2_query_device;
dev->ibdev.query_port = c2_query_port;
+ dev->ibdev.query_transport = c2_query_transport;
dev->ibdev.query_pkey = c2_query_pkey;
dev->ibdev.query_gid = c2_query_gid;
dev->ibdev.alloc_ucontext = c2_alloc_ucontext;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 811b24a..09682e9e 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -1232,6 +1232,12 @@ static int iwch_query_port(struct ib_device *ibdev,
return 0;
}

+static enum rdma_transport_type
+iwch_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IWARP;
+}
+
static ssize_t show_rev(struct device *dev, struct device_attribute *attr,
char *buf)
{
@@ -1385,6 +1391,7 @@ int iwch_register_device(struct iwch_dev *dev)
dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev);
dev->ibdev.query_device = iwch_query_device;
dev->ibdev.query_port = iwch_query_port;
+ dev->ibdev.query_transport = iwch_query_transport;
dev->ibdev.query_pkey = iwch_query_pkey;
dev->ibdev.query_gid = iwch_query_gid;
dev->ibdev.alloc_ucontext = iwch_alloc_ucontext;
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index 66bd6a2..a445e0d 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -390,6 +390,12 @@ static int c4iw_query_port(struct ib_device *ibdev, u8 port,
return 0;
}

+static enum rdma_transport_type
+c4iw_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IWARP;
+}
+
static ssize_t show_rev(struct device *dev, struct device_attribute *attr,
char *buf)
{
@@ -506,6 +512,7 @@ int c4iw_register_device(struct c4iw_dev *dev)
dev->ibdev.dma_device = &(dev->rdev.lldi.pdev->dev);
dev->ibdev.query_device = c4iw_query_device;
dev->ibdev.query_port = c4iw_query_port;
+ dev->ibdev.query_transport = c4iw_query_transport;
dev->ibdev.query_pkey = c4iw_query_pkey;
dev->ibdev.query_gid = c4iw_query_gid;
dev->ibdev.alloc_ucontext = c4iw_alloc_ucontext;
diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c
index 9ed4d25..d5a34a6 100644
--- a/drivers/infiniband/hw/ehca/ehca_hca.c
+++ b/drivers/infiniband/hw/ehca/ehca_hca.c
@@ -242,6 +242,12 @@ query_port1:
return ret;
}

+enum rdma_transport_type
+ehca_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IB;
+}
+
int ehca_query_sma_attr(struct ehca_shca *shca,
u8 port, struct ehca_sma_attr *attr)
{
diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
index 22f79af..cec945f 100644
--- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
+++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
@@ -49,6 +49,9 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props);
int ehca_query_port(struct ib_device *ibdev, u8 port,
struct ib_port_attr *props);

+enum rdma_transport_type
+ehca_query_transport(struct ib_device *device, u8 port_num);
+
int ehca_query_sma_attr(struct ehca_shca *shca, u8 port,
struct ehca_sma_attr *attr);

diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
index cd8d290..60e0a09 100644
--- a/drivers/infiniband/hw/ehca/ehca_main.c
+++ b/drivers/infiniband/hw/ehca/ehca_main.c
@@ -467,6 +467,7 @@ static int ehca_init_device(struct ehca_shca *shca)
shca->ib_device.dma_device = &shca->ofdev->dev;
shca->ib_device.query_device = ehca_query_device;
shca->ib_device.query_port = ehca_query_port;
+ shca->ib_device.query_transport = ehca_query_transport;
shca->ib_device.query_gid = ehca_query_gid;
shca->ib_device.query_pkey = ehca_query_pkey;
/* shca->in_device.modify_device = ehca_modify_device */
diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
index 44ea939..58d36e3 100644
--- a/drivers/infiniband/hw/ipath/ipath_verbs.c
+++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
@@ -1638,6 +1638,12 @@ static int ipath_query_port(struct ib_device *ibdev,
return 0;
}

+static enum rdma_transport_type
+ipath_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IB;
+}
+
static int ipath_modify_device(struct ib_device *device,
int device_modify_mask,
struct ib_device_modify *device_modify)
@@ -2140,6 +2146,7 @@ int ipath_register_ib_device(struct ipath_devdata *dd)
dev->query_device = ipath_query_device;
dev->modify_device = ipath_modify_device;
dev->query_port = ipath_query_port;
+ dev->query_transport = ipath_query_transport;
dev->modify_port = ipath_modify_port;
dev->query_pkey = ipath_query_pkey;
dev->query_gid = ipath_query_gid;
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 0b280b1..28100bd 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -413,6 +413,15 @@ static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port,
return __mlx4_ib_query_port(ibdev, port, props, 0);
}

+static enum rdma_transport_type
+mlx4_ib_query_transport(struct ib_device *device, u8 port_num)
+{
+ struct mlx4_dev *dev = to_mdev(device)->dev;
+
+ return dev->caps.port_mask[port_num] == MLX4_PORT_TYPE_IB ?
+ RDMA_TRANSPORT_IB : RDMA_TRANSPORT_IBOE;
+}
+
int __mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
union ib_gid *gid, int netw_view)
{
@@ -2121,6 +2130,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)

ibdev->ib_dev.query_device = mlx4_ib_query_device;
ibdev->ib_dev.query_port = mlx4_ib_query_port;
+ ibdev->ib_dev.query_transport = mlx4_ib_query_transport;
ibdev->ib_dev.get_link_layer = mlx4_ib_port_link_layer;
ibdev->ib_dev.query_gid = mlx4_ib_query_gid;
ibdev->ib_dev.query_pkey = mlx4_ib_query_pkey;
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index cc4ac1e..209c796 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -351,6 +351,12 @@ out:
return err;
}

+static enum rdma_transport_type
+mlx5_ib_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IB;
+}
+
static int mlx5_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
union ib_gid *gid)
{
@@ -1336,6 +1342,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)

dev->ib_dev.query_device = mlx5_ib_query_device;
dev->ib_dev.query_port = mlx5_ib_query_port;
+ dev->ib_dev.query_transport = mlx5_ib_query_transport;
dev->ib_dev.query_gid = mlx5_ib_query_gid;
dev->ib_dev.query_pkey = mlx5_ib_query_pkey;
dev->ib_dev.modify_device = mlx5_ib_modify_device;
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 415f8e1..67ac6a4 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -179,6 +179,12 @@ static int mthca_query_port(struct ib_device *ibdev,
return err;
}

+static enum rdma_transport_type
+mthca_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IB;
+}
+
static int mthca_modify_device(struct ib_device *ibdev,
int mask,
struct ib_device_modify *props)
@@ -1281,6 +1287,7 @@ int mthca_register_device(struct mthca_dev *dev)
dev->ib_dev.dma_device = &dev->pdev->dev;
dev->ib_dev.query_device = mthca_query_device;
dev->ib_dev.query_port = mthca_query_port;
+ dev->ib_dev.query_transport = mthca_query_transport;
dev->ib_dev.modify_device = mthca_modify_device;
dev->ib_dev.modify_port = mthca_modify_port;
dev->ib_dev.query_pkey = mthca_query_pkey;
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index c0d0296..8df5b61 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -606,6 +606,11 @@ static int nes_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr
return 0;
}

+static enum rdma_transport_type
+nes_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IWARP;
+}

/**
* nes_query_pkey
@@ -3879,6 +3884,7 @@ struct nes_ib_device *nes_init_ofa_device(struct net_device *netdev)
nesibdev->ibdev.dev.parent = &nesdev->pcidev->dev;
nesibdev->ibdev.query_device = nes_query_device;
nesibdev->ibdev.query_port = nes_query_port;
+ nesibdev->ibdev.query_transport = nes_query_transport;
nesibdev->ibdev.query_pkey = nes_query_pkey;
nesibdev->ibdev.query_gid = nes_query_gid;
nesibdev->ibdev.alloc_ucontext = nes_alloc_ucontext;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index 7a2b59a..9f4d182 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -244,6 +244,7 @@ static int ocrdma_register_device(struct ocrdma_dev *dev)
/* mandatory verbs. */
dev->ibdev.query_device = ocrdma_query_device;
dev->ibdev.query_port = ocrdma_query_port;
+ dev->ibdev.query_transport = ocrdma_query_transport;
dev->ibdev.modify_port = ocrdma_modify_port;
dev->ibdev.query_gid = ocrdma_query_gid;
dev->ibdev.get_link_layer = ocrdma_link_layer;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 8771755..73bace4 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -187,6 +187,12 @@ int ocrdma_query_port(struct ib_device *ibdev,
return 0;
}

+enum rdma_transport_type
+ocrdma_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IBOE;
+}
+
int ocrdma_modify_port(struct ib_device *ibdev, u8 port, int mask,
struct ib_port_modify *props)
{
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
index b8f7853..4a81b63 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
@@ -41,6 +41,9 @@ int ocrdma_query_port(struct ib_device *, u8 port, struct ib_port_attr *props);
int ocrdma_modify_port(struct ib_device *, u8 port, int mask,
struct ib_port_modify *props);

+enum rdma_transport_type
+ocrdma_query_transport(struct ib_device *device, u8 port_num);
+
void ocrdma_get_guid(struct ocrdma_dev *, u8 *guid);
int ocrdma_query_gid(struct ib_device *, u8 port,
int index, union ib_gid *gid);
diff --git a/drivers/infiniband/hw/qib/qib_verbs.c b/drivers/infiniband/hw/qib/qib_verbs.c
index 4a35998..caad665 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.c
+++ b/drivers/infiniband/hw/qib/qib_verbs.c
@@ -1650,6 +1650,12 @@ static int qib_query_port(struct ib_device *ibdev, u8 port,
return 0;
}

+static enum rdma_transport_type
+qib_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_IB;
+}
+
static int qib_modify_device(struct ib_device *device,
int device_modify_mask,
struct ib_device_modify *device_modify)
@@ -2184,6 +2190,7 @@ int qib_register_ib_device(struct qib_devdata *dd)
ibdev->query_device = qib_query_device;
ibdev->modify_device = qib_modify_device;
ibdev->query_port = qib_query_port;
+ ibdev->query_transport = qib_query_transport;
ibdev->modify_port = qib_modify_port;
ibdev->query_pkey = qib_query_pkey;
ibdev->query_gid = qib_query_gid;
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_main.c b/drivers/infiniband/hw/usnic/usnic_ib_main.c
index 0d0f986..03ea9f3 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_main.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_main.c
@@ -360,6 +360,7 @@ static void *usnic_ib_device_add(struct pci_dev *dev)

us_ibdev->ib_dev.query_device = usnic_ib_query_device;
us_ibdev->ib_dev.query_port = usnic_ib_query_port;
+ us_ibdev->ib_dev.query_transport = usnic_ib_query_transport;
us_ibdev->ib_dev.query_pkey = usnic_ib_query_pkey;
us_ibdev->ib_dev.query_gid = usnic_ib_query_gid;
us_ibdev->ib_dev.get_link_layer = usnic_ib_port_link_layer;
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index 53bd6a2..ff9a5f7 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -348,6 +348,12 @@ int usnic_ib_query_port(struct ib_device *ibdev, u8 port,
return 0;
}

+enum rdma_transport_type
+usnic_ib_query_transport(struct ib_device *device, u8 port_num)
+{
+ return RDMA_TRANSPORT_USNIC_UDP;
+}
+
int usnic_ib_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
int qp_attr_mask,
struct ib_qp_init_attr *qp_init_attr)
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.h b/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
index bb864f5..0b1633b 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
@@ -27,6 +27,8 @@ int usnic_ib_query_device(struct ib_device *ibdev,
struct ib_device_attr *props);
int usnic_ib_query_port(struct ib_device *ibdev, u8 port,
struct ib_port_attr *props);
+enum rdma_transport_type
+usnic_ib_query_transport(struct ib_device *device, u8 port_num);
int usnic_ib_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
int qp_attr_mask,
struct ib_qp_init_attr *qp_init_attr);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 65994a1..d54f91e 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -75,10 +75,13 @@ enum rdma_node_type {
};

enum rdma_transport_type {
+ /* legacy for users */
RDMA_TRANSPORT_IB,
RDMA_TRANSPORT_IWARP,
RDMA_TRANSPORT_USNIC,
- RDMA_TRANSPORT_USNIC_UDP
+ RDMA_TRANSPORT_USNIC_UDP,
+ /* new transport */
+ RDMA_TRANSPORT_IBOE,
};

__attribute_const__ enum rdma_transport_type
@@ -1501,6 +1504,8 @@ struct ib_device {
int (*query_port)(struct ib_device *device,
u8 port_num,
struct ib_port_attr *port_attr);
+ enum rdma_transport_type (*query_transport)(struct ib_device *device,
+ u8 port_num);
enum rdma_link_layer (*get_link_layer)(struct ib_device *device,
u8 port_num);
int (*query_gid)(struct ib_device *device,
--
2.1.0

2015-04-07 12:44:23

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH 01/17] IB/Verbs: Implement new callback query_transport() for each HW

V2 sent out, please ignore this one, my apologies.

Regards,
Michael Wang

On 04/07/2015 02:28 PM, Michael Wang wrote:
>
> Add new callback query_transport() and implement for each HW.
>
> Mapping List:
> node-type link-layer old-transport new-transport
> nes RNIC ETH IWARP IWARP
> amso1100 RNIC ETH IWARP IWARP
> cxgb3 RNIC ETH IWARP IWARP
> cxgb4 RNIC ETH IWARP IWARP
> usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP
> ocrdma IB_CA ETH IB IBOE
> mlx4 IB_CA IB/ETH IB IB/IBOE
> mlx5 IB_CA IB IB IB
> ehca IB_CA IB IB IB
> ipath IB_CA IB IB IB
> mthca IB_CA IB IB IB
> qib IB_CA IB IB IB
>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Doug Ledford <[email protected]>
> Cc: Ira Weiny <[email protected]>
> Cc: Sean Hefty <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> drivers/infiniband/core/device.c | 1 +
> drivers/infiniband/core/verbs.c | 4 +++-
> drivers/infiniband/hw/amso1100/c2_provider.c | 7 +++++++
> drivers/infiniband/hw/cxgb3/iwch_provider.c | 7 +++++++
> drivers/infiniband/hw/cxgb4/provider.c | 7 +++++++
> drivers/infiniband/hw/ehca/ehca_hca.c | 6 ++++++
> drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 +++
> drivers/infiniband/hw/ehca/ehca_main.c | 1 +
> drivers/infiniband/hw/ipath/ipath_verbs.c | 7 +++++++
> drivers/infiniband/hw/mlx4/main.c | 10 ++++++++++
> drivers/infiniband/hw/mlx5/main.c | 7 +++++++
> drivers/infiniband/hw/mthca/mthca_provider.c | 7 +++++++
> drivers/infiniband/hw/nes/nes_verbs.c | 6 ++++++
> drivers/infiniband/hw/ocrdma/ocrdma_main.c | 1 +
> drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 6 ++++++
> drivers/infiniband/hw/ocrdma/ocrdma_verbs.h | 3 +++
> drivers/infiniband/hw/qib/qib_verbs.c | 7 +++++++
> drivers/infiniband/hw/usnic/usnic_ib_main.c | 1 +
> drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 6 ++++++
> drivers/infiniband/hw/usnic/usnic_ib_verbs.h | 2 ++
> include/rdma/ib_verbs.h | 7 ++++++-
> 21 files changed, 104 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index 18c1ece..a9587c4 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -76,6 +76,7 @@ static int ib_device_check_mandatory(struct ib_device *device)
> } mandatory_table[] = {
> IB_MANDATORY_FUNC(query_device),
> IB_MANDATORY_FUNC(query_port),
> + IB_MANDATORY_FUNC(query_transport),
> IB_MANDATORY_FUNC(query_pkey),
> IB_MANDATORY_FUNC(query_gid),
> IB_MANDATORY_FUNC(alloc_pd),
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index f93eb8d..83370de 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -133,14 +133,16 @@ enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_
> if (device->get_link_layer)
> return device->get_link_layer(device, port_num);
>
> - switch (rdma_node_get_transport(device->node_type)) {
> + switch (device->query_transport(device, port_num)) {
> case RDMA_TRANSPORT_IB:
> + case RDMA_TRANSPORT_IBOE:
> return IB_LINK_LAYER_INFINIBAND;
> case RDMA_TRANSPORT_IWARP:
> case RDMA_TRANSPORT_USNIC:
> case RDMA_TRANSPORT_USNIC_UDP:
> return IB_LINK_LAYER_ETHERNET;
> default:
> + BUG();
> return IB_LINK_LAYER_UNSPECIFIED;
> }
> }
> diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
> index bdf3507..d46bbb0 100644
> --- a/drivers/infiniband/hw/amso1100/c2_provider.c
> +++ b/drivers/infiniband/hw/amso1100/c2_provider.c
> @@ -99,6 +99,12 @@ static int c2_query_port(struct ib_device *ibdev,
> return 0;
> }
>
> +static enum rdma_transport_type
> +c2_query_transport(struct ib_device *device, u8 port_num)
> +{
> + return RDMA_TRANSPORT_IWARP;
> +}
> +
> static int c2_query_pkey(struct ib_device *ibdev,
> u8 port, u16 index, u16 * pkey)
> {
> @@ -801,6 +807,7 @@ int c2_register_device(struct c2_dev *dev)
> dev->ibdev.dma_device = &dev->pcidev->dev;
> dev->ibdev.query_device = c2_query_device;
> dev->ibdev.query_port = c2_query_port;
> + dev->ibdev.query_transport = c2_query_transport;
> dev->ibdev.query_pkey = c2_query_pkey;
> dev->ibdev.query_gid = c2_query_gid;
> dev->ibdev.alloc_ucontext = c2_alloc_ucontext;
> diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
> index 811b24a..09682e9e 100644
> --- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
> +++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
> @@ -1232,6 +1232,12 @@ static int iwch_query_port(struct ib_device *ibdev,
> return 0;
> }
>
> +static enum rdma_transport_type
> +iwch_query_transport(struct ib_device *device, u8 port_num)
> +{
> + return RDMA_TRANSPORT_IWARP;
> +}
> +
> static ssize_t show_rev(struct device *dev, struct device_attribute *attr,
> char *buf)
> {
> @@ -1385,6 +1391,7 @@ int iwch_register_device(struct iwch_dev *dev)
> dev->ibdev.dma_device = &(dev->rdev.rnic_info.pdev->dev);
> dev->ibdev.query_device = iwch_query_device;
> dev->ibdev.query_port = iwch_query_port;
> + dev->ibdev.query_transport = iwch_query_transport;
> dev->ibdev.query_pkey = iwch_query_pkey;
> dev->ibdev.query_gid = iwch_query_gid;
> dev->ibdev.alloc_ucontext = iwch_alloc_ucontext;
> diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
> index 66bd6a2..a445e0d 100644
> --- a/drivers/infiniband/hw/cxgb4/provider.c
> +++ b/drivers/infiniband/hw/cxgb4/provider.c
> @@ -390,6 +390,12 @@ static int c4iw_query_port(struct ib_device *ibdev, u8 port,
> return 0;
> }
>
> +static enum rdma_transport_type
> +c4iw_query_transport(struct ib_device *device, u8 port_num)
> +{
> + return RDMA_TRANSPORT_IWARP;
> +}
> +
> static ssize_t show_rev(struct device *dev, struct device_attribute *attr,
> char *buf)
> {
> @@ -506,6 +512,7 @@ int c4iw_register_device(struct c4iw_dev *dev)
> dev->ibdev.dma_device = &(dev->rdev.lldi.pdev->dev);
> dev->ibdev.query_device = c4iw_query_device;
> dev->ibdev.query_port = c4iw_query_port;
> + dev->ibdev.query_transport = c4iw_query_transport;
> dev->ibdev.query_pkey = c4iw_query_pkey;
> dev->ibdev.query_gid = c4iw_query_gid;
> dev->ibdev.alloc_ucontext = c4iw_alloc_ucontext;
> diff --git a/drivers/infiniband/hw/ehca/ehca_hca.c b/drivers/infiniband/hw/ehca/ehca_hca.c
> index 9ed4d25..d5a34a6 100644
> --- a/drivers/infiniband/hw/ehca/ehca_hca.c
> +++ b/drivers/infiniband/hw/ehca/ehca_hca.c
> @@ -242,6 +242,12 @@ query_port1:
> return ret;
> }
>
> +enum rdma_transport_type
> +ehca_query_transport(struct ib_device *device, u8 port_num)
> +{
> + return RDMA_TRANSPORT_IB;
> +}
> +
> int ehca_query_sma_attr(struct ehca_shca *shca,
> u8 port, struct ehca_sma_attr *attr)
> {
> diff --git a/drivers/infiniband/hw/ehca/ehca_iverbs.h b/drivers/infiniband/hw/ehca/ehca_iverbs.h
> index 22f79af..cec945f 100644
> --- a/drivers/infiniband/hw/ehca/ehca_iverbs.h
> +++ b/drivers/infiniband/hw/ehca/ehca_iverbs.h
> @@ -49,6 +49,9 @@ int ehca_query_device(struct ib_device *ibdev, struct ib_device_attr *props);
> int ehca_query_port(struct ib_device *ibdev, u8 port,
> struct ib_port_attr *props);
>
> +enum rdma_transport_type
> +ehca_query_transport(struct ib_device *device, u8 port_num);
> +
> int ehca_query_sma_attr(struct ehca_shca *shca, u8 port,
> struct ehca_sma_attr *attr);
>
> diff --git a/drivers/infiniband/hw/ehca/ehca_main.c b/drivers/infiniband/hw/ehca/ehca_main.c
> index cd8d290..60e0a09 100644
> --- a/drivers/infiniband/hw/ehca/ehca_main.c
> +++ b/drivers/infiniband/hw/ehca/ehca_main.c
> @@ -467,6 +467,7 @@ static int ehca_init_device(struct ehca_shca *shca)
> shca->ib_device.dma_device = &shca->ofdev->dev;
> shca->ib_device.query_device = ehca_query_device;
> shca->ib_device.query_port = ehca_query_port;
> + shca->ib_device.query_transport = ehca_query_transport;
> shca->ib_device.query_gid = ehca_query_gid;
> shca->ib_device.query_pkey = ehca_query_pkey;
> /* shca->in_device.modify_device = ehca_modify_device */
> diff --git a/drivers/infiniband/hw/ipath/ipath_verbs.c b/drivers/infiniband/hw/ipath/ipath_verbs.c
> index 44ea939..58d36e3 100644
> --- a/drivers/infiniband/hw/ipath/ipath_verbs.c
> +++ b/drivers/infiniband/hw/ipath/ipath_verbs.c
> @@ -1638,6 +1638,12 @@ static int ipath_query_port(struct ib_device *ibdev,
> return 0;
> }
>
> +static enum rdma_transport_type
> +ipath_query_transport(struct ib_device *device, u8 port_num)
> +{
> + return RDMA_TRANSPORT_IB;
> +}
> +
> static int ipath_modify_device(struct ib_device *device,
> int device_modify_mask,
> struct ib_device_modify *device_modify)
> @@ -2140,6 +2146,7 @@ int ipath_register_ib_device(struct ipath_devdata *dd)
> dev->query_device = ipath_query_device;
> dev->modify_device = ipath_modify_device;
> dev->query_port = ipath_query_port;
> + dev->query_transport = ipath_query_transport;
> dev->modify_port = ipath_modify_port;
> dev->query_pkey = ipath_query_pkey;
> dev->query_gid = ipath_query_gid;
> diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
> index 0b280b1..28100bd 100644
> --- a/drivers/infiniband/hw/mlx4/main.c
> +++ b/drivers/infiniband/hw/mlx4/main.c
> @@ -413,6 +413,15 @@ static int mlx4_ib_query_port(struct ib_device *ibdev, u8 port,
> return __mlx4_ib_query_port(ibdev, port, props, 0);
> }
>
> +static enum rdma_transport_type
> +mlx4_ib_query_transport(struct ib_device *device, u8 port_num)
> +{
> + struct mlx4_dev *dev = to_mdev(device)->dev;
> +
> + return dev->caps.port_mask[port_num] == MLX4_PORT_TYPE_IB ?
> + RDMA_TRANSPORT_IB : RDMA_TRANSPORT_IBOE;
> +}
> +
> int __mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
> union ib_gid *gid, int netw_view)
> {
> @@ -2121,6 +2130,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
>
> ibdev->ib_dev.query_device = mlx4_ib_query_device;
> ibdev->ib_dev.query_port = mlx4_ib_query_port;
> + ibdev->ib_dev.query_transport = mlx4_ib_query_transport;
> ibdev->ib_dev.get_link_layer = mlx4_ib_port_link_layer;
> ibdev->ib_dev.query_gid = mlx4_ib_query_gid;
> ibdev->ib_dev.query_pkey = mlx4_ib_query_pkey;
> diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
> index cc4ac1e..209c796 100644
> --- a/drivers/infiniband/hw/mlx5/main.c
> +++ b/drivers/infiniband/hw/mlx5/main.c
> @@ -351,6 +351,12 @@ out:
> return err;
> }
>
> +static enum rdma_transport_type
> +mlx5_ib_query_transport(struct ib_device *device, u8 port_num)
> +{
> + return RDMA_TRANSPORT_IB;
> +}
> +
> static int mlx5_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
> union ib_gid *gid)
> {
> @@ -1336,6 +1342,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
>
> dev->ib_dev.query_device = mlx5_ib_query_device;
> dev->ib_dev.query_port = mlx5_ib_query_port;
> + dev->ib_dev.query_transport = mlx5_ib_query_transport;
> dev->ib_dev.query_gid = mlx5_ib_query_gid;
> dev->ib_dev.query_pkey = mlx5_ib_query_pkey;
> dev->ib_dev.modify_device = mlx5_ib_modify_device;
> diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
> index 415f8e1..67ac6a4 100644
> --- a/drivers/infiniband/hw/mthca/mthca_provider.c
> +++ b/drivers/infiniband/hw/mthca/mthca_provider.c
> @@ -179,6 +179,12 @@ static int mthca_query_port(struct ib_device *ibdev,
> return err;
> }
>
> +static enum rdma_transport_type
> +mthca_query_transport(struct ib_device *device, u8 port_num)
> +{
> + return RDMA_TRANSPORT_IB;
> +}
> +
> static int mthca_modify_device(struct ib_device *ibdev,
> int mask,
> struct ib_device_modify *props)
> @@ -1281,6 +1287,7 @@ int mthca_register_device(struct mthca_dev *dev)
> dev->ib_dev.dma_device = &dev->pdev->dev;
> dev->ib_dev.query_device = mthca_query_device;
> dev->ib_dev.query_port = mthca_query_port;
> + dev->ib_dev.query_transport = mthca_query_transport;
> dev->ib_dev.modify_device = mthca_modify_device;
> dev->ib_dev.modify_port = mthca_modify_port;
> dev->ib_dev.query_pkey = mthca_query_pkey;
> diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
> index c0d0296..8df5b61 100644
> --- a/drivers/infiniband/hw/nes/nes_verbs.c
> +++ b/drivers/infiniband/hw/nes/nes_verbs.c
> @@ -606,6 +606,11 @@ static int nes_query_port(struct ib_device *ibdev, u8 port, struct ib_port_attr
> return 0;
> }
>
> +static enum rdma_transport_type
> +nes_query_transport(struct ib_device *device, u8 port_num)
> +{
> + return RDMA_TRANSPORT_IWARP;
> +}
>
> /**
> * nes_query_pkey
> @@ -3879,6 +3884,7 @@ struct nes_ib_device *nes_init_ofa_device(struct net_device *netdev)
> nesibdev->ibdev.dev.parent = &nesdev->pcidev->dev;
> nesibdev->ibdev.query_device = nes_query_device;
> nesibdev->ibdev.query_port = nes_query_port;
> + nesibdev->ibdev.query_transport = nes_query_transport;
> nesibdev->ibdev.query_pkey = nes_query_pkey;
> nesibdev->ibdev.query_gid = nes_query_gid;
> nesibdev->ibdev.alloc_ucontext = nes_alloc_ucontext;
> diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
> index 7a2b59a..9f4d182 100644
> --- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
> +++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
> @@ -244,6 +244,7 @@ static int ocrdma_register_device(struct ocrdma_dev *dev)
> /* mandatory verbs. */
> dev->ibdev.query_device = ocrdma_query_device;
> dev->ibdev.query_port = ocrdma_query_port;
> + dev->ibdev.query_transport = ocrdma_query_transport;
> dev->ibdev.modify_port = ocrdma_modify_port;
> dev->ibdev.query_gid = ocrdma_query_gid;
> dev->ibdev.get_link_layer = ocrdma_link_layer;
> diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> index 8771755..73bace4 100644
> --- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> +++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
> @@ -187,6 +187,12 @@ int ocrdma_query_port(struct ib_device *ibdev,
> return 0;
> }
>
> +enum rdma_transport_type
> +ocrdma_query_transport(struct ib_device *device, u8 port_num)
> +{
> + return RDMA_TRANSPORT_IBOE;
> +}
> +
> int ocrdma_modify_port(struct ib_device *ibdev, u8 port, int mask,
> struct ib_port_modify *props)
> {
> diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
> index b8f7853..4a81b63 100644
> --- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
> +++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.h
> @@ -41,6 +41,9 @@ int ocrdma_query_port(struct ib_device *, u8 port, struct ib_port_attr *props);
> int ocrdma_modify_port(struct ib_device *, u8 port, int mask,
> struct ib_port_modify *props);
>
> +enum rdma_transport_type
> +ocrdma_query_transport(struct ib_device *device, u8 port_num);
> +
> void ocrdma_get_guid(struct ocrdma_dev *, u8 *guid);
> int ocrdma_query_gid(struct ib_device *, u8 port,
> int index, union ib_gid *gid);
> diff --git a/drivers/infiniband/hw/qib/qib_verbs.c b/drivers/infiniband/hw/qib/qib_verbs.c
> index 4a35998..caad665 100644
> --- a/drivers/infiniband/hw/qib/qib_verbs.c
> +++ b/drivers/infiniband/hw/qib/qib_verbs.c
> @@ -1650,6 +1650,12 @@ static int qib_query_port(struct ib_device *ibdev, u8 port,
> return 0;
> }
>
> +static enum rdma_transport_type
> +qib_query_transport(struct ib_device *device, u8 port_num)
> +{
> + return RDMA_TRANSPORT_IB;
> +}
> +
> static int qib_modify_device(struct ib_device *device,
> int device_modify_mask,
> struct ib_device_modify *device_modify)
> @@ -2184,6 +2190,7 @@ int qib_register_ib_device(struct qib_devdata *dd)
> ibdev->query_device = qib_query_device;
> ibdev->modify_device = qib_modify_device;
> ibdev->query_port = qib_query_port;
> + ibdev->query_transport = qib_query_transport;
> ibdev->modify_port = qib_modify_port;
> ibdev->query_pkey = qib_query_pkey;
> ibdev->query_gid = qib_query_gid;
> diff --git a/drivers/infiniband/hw/usnic/usnic_ib_main.c b/drivers/infiniband/hw/usnic/usnic_ib_main.c
> index 0d0f986..03ea9f3 100644
> --- a/drivers/infiniband/hw/usnic/usnic_ib_main.c
> +++ b/drivers/infiniband/hw/usnic/usnic_ib_main.c
> @@ -360,6 +360,7 @@ static void *usnic_ib_device_add(struct pci_dev *dev)
>
> us_ibdev->ib_dev.query_device = usnic_ib_query_device;
> us_ibdev->ib_dev.query_port = usnic_ib_query_port;
> + us_ibdev->ib_dev.query_transport = usnic_ib_query_transport;
> us_ibdev->ib_dev.query_pkey = usnic_ib_query_pkey;
> us_ibdev->ib_dev.query_gid = usnic_ib_query_gid;
> us_ibdev->ib_dev.get_link_layer = usnic_ib_port_link_layer;
> diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
> index 53bd6a2..ff9a5f7 100644
> --- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
> +++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
> @@ -348,6 +348,12 @@ int usnic_ib_query_port(struct ib_device *ibdev, u8 port,
> return 0;
> }
>
> +enum rdma_transport_type
> +usnic_ib_query_transport(struct ib_device *device, u8 port_num)
> +{
> + return RDMA_TRANSPORT_USNIC_UDP;
> +}
> +
> int usnic_ib_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
> int qp_attr_mask,
> struct ib_qp_init_attr *qp_init_attr)
> diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.h b/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
> index bb864f5..0b1633b 100644
> --- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
> +++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.h
> @@ -27,6 +27,8 @@ int usnic_ib_query_device(struct ib_device *ibdev,
> struct ib_device_attr *props);
> int usnic_ib_query_port(struct ib_device *ibdev, u8 port,
> struct ib_port_attr *props);
> +enum rdma_transport_type
> +usnic_ib_query_transport(struct ib_device *device, u8 port_num);
> int usnic_ib_query_qp(struct ib_qp *qp, struct ib_qp_attr *qp_attr,
> int qp_attr_mask,
> struct ib_qp_init_attr *qp_init_attr);
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 65994a1..d54f91e 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -75,10 +75,13 @@ enum rdma_node_type {
> };
>
> enum rdma_transport_type {
> + /* legacy for users */
> RDMA_TRANSPORT_IB,
> RDMA_TRANSPORT_IWARP,
> RDMA_TRANSPORT_USNIC,
> - RDMA_TRANSPORT_USNIC_UDP
> + RDMA_TRANSPORT_USNIC_UDP,
> + /* new transport */
> + RDMA_TRANSPORT_IBOE,
> };
>
> __attribute_const__ enum rdma_transport_type
> @@ -1501,6 +1504,8 @@ struct ib_device {
> int (*query_port)(struct ib_device *device,
> u8 port_num,
> struct ib_port_attr *port_attr);
> + enum rdma_transport_type (*query_transport)(struct ib_device *device,
> + u8 port_num);
> enum rdma_link_layer (*get_link_layer)(struct ib_device *device,
> u8 port_num);
> int (*query_gid)(struct ib_device *device,
>

2015-04-07 15:54:27

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 09/17] IB/Verbs: Use helper cap_read_multi_sge() and reform svc_rdma_accept()

On 4/7/2015 8:34 AM, Michael Wang wrote:
> /**
> + * cap_read_multi_sge - Check if the port of device has the capability
> + * RDMA Read Multiple Scatter-Gather Entries.
> + *
> + * @device: Device to be checked
> + * @port_num: Port number of the device
> + *
> + * Return 0 when port of the device don't support
> + * RDMA Read Multiple Scatter-Gather Entries.
> + */
> +static inline int cap_read_multi_sge(struct ib_device *device, u8 port_num)
> +{
> + return !rdma_transport_iwarp(device, port_num);
> +}

This just papers over the issue we discussed earlier. How *many*
entries does the device support? If a device supports one, or two,
is that enough? How does the upper layer know the limit?

This needs an explicit device attribute, to be fixed properly.

> +
> +/**
> * cap_ipoib - Check if the port of device has the capability
> * IP over Infiniband.
> *
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index e011027..604d035 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -118,8 +118,8 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
>
> static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
> {
> - if (rdma_node_get_transport(xprt->sc_cm_id->device->node_type) ==
> - RDMA_TRANSPORT_IWARP)
> + if (!cap_read_multi_sge(xprt->sc_cm_id->device,
> + xprt->sc_cm_id->port_num))
> return 1;
> else
> return min_t(int, sge_count, xprt->sc_max_sge);

This is incorrect. The RDMA Read max is not at all the same as the
max_sge. It is a different operation, with a different set of work
request parameters.

In other words, the above same comment applies.


> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index 4e61880..e75175d 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -979,8 +979,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
> /*
> * Determine if a DMA MR is required and if so, what privs are required
> */
> - switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) {
> - case RDMA_TRANSPORT_IWARP:
> + if (rdma_transport_iwarp(newxprt->sc_cm_id->device,
> + newxprt->sc_cm_id->port_num)) {
> newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;

Do I read this correctly that it is forcing the "read with invalidate"
capability to "on" for all iWARP devices? I don't think that is correct,
for the legacy devices you're also supporting.


> @@ -992,8 +992,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
> dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
> } else
> need_dma_mr = 0;
> - break;
> - case RDMA_TRANSPORT_IB:
> + } else if (rdma_ib_mgmt(newxprt->sc_cm_id->device,
> + newxprt->sc_cm_id->port_num)) {
> if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
> need_dma_mr = 1;
> dma_mr_acc = IB_ACCESS_LOCAL_WRITE;

Now I'm even more confused. How is the presence of IB management
related to needing a privileged lmr?

2015-04-07 16:05:27

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 09/17] IB/Verbs: Use helper cap_read_multi_sge() and reform svc_rdma_accept()

Hi, Tom

Thanks for the comments :-)

On 04/07/2015 05:46 PM, Tom Talpey wrote:
> On 4/7/2015 8:34 AM, Michael Wang wrote:
>> /**
>> + * cap_read_multi_sge - Check if the port of device has the capability
>> + * RDMA Read Multiple Scatter-Gather Entries.
>> + *
>> + * @device: Device to be checked
>> + * @port_num: Port number of the device
>> + *
>> + * Return 0 when port of the device don't support
>> + * RDMA Read Multiple Scatter-Gather Entries.
>> + */
>> +static inline int cap_read_multi_sge(struct ib_device *device, u8 port_num)
>> +{
>> + return !rdma_transport_iwarp(device, port_num);
>> +}
>
> This just papers over the issue we discussed earlier. How *many*
> entries does the device support? If a device supports one, or two,
> is that enough? How does the upper layer know the limit?
>
> This needs an explicit device attribute, to be fixed properly.

This is the prototype to expose the problem we have in here, I
would prefer some one good at this part to extending the API in
future, basing on the right logical.

Currently this just inherit from the legacy, it implemented
in order to be compatible with the current code.

>
>> +
>> +/**
>> * cap_ipoib - Check if the port of device has the capability
>> * IP over Infiniband.
>> *
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> index e011027..604d035 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> @@ -118,8 +118,8 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
>>
>> static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
>> {
>> - if (rdma_node_get_transport(xprt->sc_cm_id->device->node_type) ==
>> - RDMA_TRANSPORT_IWARP)
>> + if (!cap_read_multi_sge(xprt->sc_cm_id->device,
>> + xprt->sc_cm_id->port_num))
>> return 1;
>> else
>> return min_t(int, sge_count, xprt->sc_max_sge);
>
> This is incorrect. The RDMA Read max is not at all the same as the
> max_sge. It is a different operation, with a different set of work
> request parameters.
>
> In other words, the above same comment applies.

Any idea on how to improve this part?

Again, all these helpers just inherit the old logical, if
it's wrong, let's correct it ;-)

And if we don't know how to correct, we can leave this as a
signpost and waiting for someone good at this particular part
to fix it.

>
>
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index 4e61880..e75175d 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -979,8 +979,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>> /*
>> * Determine if a DMA MR is required and if so, what privs are required
>> */
>> - switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) {
>> - case RDMA_TRANSPORT_IWARP:
>> + if (rdma_transport_iwarp(newxprt->sc_cm_id->device,
>> + newxprt->sc_cm_id->port_num)) {
>> newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;
>
> Do I read this correctly that it is forcing the "read with invalidate"
> capability to "on" for all iWARP devices? I don't think that is correct,
> for the legacy devices you're also supporting.

Hmm.. but that's exactly same as the old logical, correct?
Or do you mean the old logical is wrong?

>
>
>> @@ -992,8 +992,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>> dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
>> } else
>> need_dma_mr = 0;
>> - break;
>> - case RDMA_TRANSPORT_IB:
>> + } else if (rdma_ib_mgmt(newxprt->sc_cm_id->device,
>> + newxprt->sc_cm_id->port_num)) {
>> if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
>> need_dma_mr = 1;
>> dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
>
> Now I'm even more confused. How is the presence of IB management
> related to needing a privileged lmr?

I think you actually mean we need some more wrapper here
with the right name, correct?

I'm not good at this part, any suggestions?

Regards,
Michael Wang

>
>

2015-04-07 17:26:34

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 03/17] IB/Verbs: Use management helper cap_ib_mad() for mad-check

On Tue, Apr 07, 2015 at 02:30:22PM +0200, Michael Wang wrote:

> - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
> - return;
> -
> if (device->node_type == RDMA_NODE_IB_SWITCH) {
> start = 0;
> end = 0;
> @@ -3069,6 +3066,9 @@ static void ib_mad_init_device(struct ib_device *device)
> }
>
> for (i = start; i <= end; i++) {
> + if (!cap_ib_mad(device, i))
> + continue;
> +

I would prefer to see these changes in control flow as dedicated
patches, at the top of your patch stack.

For this kind of work a patch should be mechanical changes only, it is
easier to review that way.

Same comment applies throughout.

Jason

2015-04-07 17:42:42

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 09/17] IB/Verbs: Use helper cap_read_multi_sge() and reform svc_rdma_accept()

On Tue, Apr 07, 2015 at 11:46:57AM -0400, Tom Talpey wrote:
> On 4/7/2015 8:34 AM, Michael Wang wrote:
> > /**
> >+ * cap_read_multi_sge - Check if the port of device has the capability
> >+ * RDMA Read Multiple Scatter-Gather Entries.
> >+ *
> >+ * @device: Device to be checked
> >+ * @port_num: Port number of the device
> >+ *
> >+ * Return 0 when port of the device don't support
> >+ * RDMA Read Multiple Scatter-Gather Entries.
> >+ */
> >+static inline int cap_read_multi_sge(struct ib_device *device, u8 port_num)
> >+{
> >+ return !rdma_transport_iwarp(device, port_num);
> >+}
>
> This just papers over the issue we discussed earlier. How *many*
> entries does the device support? If a device supports one, or two,
> is that enough? How does the upper layer know the limit?

I think Michael is fine to just make this one mechanical change.

The kernel only supports two kinds of devices today, ones with 1 read
SGE and ones where READ SGE == WRITE SGE == SEND SGE.

If someone makes another variation then it is up to them to propose a
better fix.


> > static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
> > {
> >- if (rdma_node_get_transport(xprt->sc_cm_id->device->node_type) ==
> >- RDMA_TRANSPORT_IWARP)
> >+ if (!cap_read_multi_sge(xprt->sc_cm_id->device,
> >+ xprt->sc_cm_id->port_num))
> > return 1;
> > else
> > return min_t(int, sge_count, xprt->sc_max_sge);
>
> This is incorrect. The RDMA Read max is not at all the same as the
> max_sge. It is a different operation, with a different set of work
> request parameters.

The algorithm looks OK to me,

newxprt->sc_max_sge = min((size_t)devattr.max_sge,
(size_t)RPCSVC_MAXPAGES);

So it returns 1 or the number of sge entries per WR, and max_sge is
for READ/WRITE/SEND in every case except when cap_read_multi_sge == 1

> > /*
> > * Determine if a DMA MR is required and if so, what privs are required
> > */
> >- switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) {
> >- case RDMA_TRANSPORT_IWARP:
> >+ if (rdma_transport_iwarp(newxprt->sc_cm_id->device,
> >+ newxprt->sc_cm_id->port_num)) {
> > newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;
>
> Do I read this correctly that it is forcing the "read with invalidate"
> capability to "on" for all iWARP devices? I don't think that is correct,
> for the legacy devices you're also supporting.

No idea here, this logic was added in:

commit 3a5c63803d0552a3ad93b85c262f12cd86471443
Author: Tom Tucker <[email protected]>
Date: Tue Sep 30 13:46:13 2008 -0500

svcrdma: Query device for Fast Reg support during connection setup

Query the device capabilities in the svc_rdma_accept function to determine
what advanced memory management capabilities are supported by the device.
Based on the query, select the most secure model available given the
requirements of the transport and capabilities of the adapter.

Signed-off-by: Tom Tucker <[email protected]>

> >@@ -992,8 +992,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
> > dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
> > } else
> > need_dma_mr = 0;
> >- break;
> >- case RDMA_TRANSPORT_IB:
> >+ } else if (rdma_ib_mgmt(newxprt->sc_cm_id->device,
> >+ newxprt->sc_cm_id->port_num)) {
> > if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
> > need_dma_mr = 1;
> > dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
>
> Now I'm even more confused. How is the presence of IB management
> related to needing a privileged lmr?

Agree, this needs to be someone else.

I think the test is probably based on this comment:

* NB: iWARP requires remote write access for the data sink
* of an RDMA_READ. IB does not.

So the if should be:

if (cap_rdma_read_needs_write(..) &&
!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
need_dma_mr = 1;
dma_mr_acc =
(IB_ACCESS_LOCAL_WRITE |
IB_ACCESS_REMOTE_WRITE);

And the identical if blocks merged.

Plus the
if (rdma_transport_iwarp(newxprt->sc_cm_id->device,
newxprt->sc_cm_id->port_num))
newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV

Jason

2015-04-07 18:40:24

by Hefty, Sean

[permalink] [raw]
Subject: RE: [PATCH v2 10/17] IB/Verbs: Adopt management helpers for IB helpers

> diff --git a/drivers/infiniband/core/sa_query.c
> b/drivers/infiniband/core/sa_query.c
> index f704254..4e61104 100644
> --- a/drivers/infiniband/core/sa_query.c
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -540,7 +540,7 @@ int ib_init_ah_from_path(struct ib_device *device, u8
> port_num,
> ah_attr->port_num = port_num;
> ah_attr->static_rate = rec->rate;
>
> - force_grh = rdma_port_get_link_layer(device, port_num) ==
> IB_LINK_LAYER_ETHERNET;
> + force_grh = !rdma_transport_ib(device, port_num);
>
> if (rec->hop_limit > 1 || force_grh) {
> ah_attr->ah_flags = IB_AH_GRH;
> diff --git a/drivers/infiniband/core/verbs.c
> b/drivers/infiniband/core/verbs.c
> index 83370de..ca06f76 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -200,11 +200,9 @@ int ib_init_ah_from_wc(struct ib_device *device, u8
> port_num, struct ib_wc *wc,
> u32 flow_class;
> u16 gid_index;
> int ret;
> - int is_eth = (rdma_port_get_link_layer(device, port_num) ==
> - IB_LINK_LAYER_ETHERNET);
>
> memset(ah_attr, 0, sizeof *ah_attr);
> - if (is_eth) {
> + if (!rdma_transport_ib(device, port_num)) {
> if (!(wc->wc_flags & IB_WC_GRH))
> return -EPROTOTYPE;
>
> @@ -873,7 +871,7 @@ int ib_resolve_eth_l2_attrs(struct ib_qp *qp,
> union ib_gid sgid;
>
> if ((*qp_attr_mask & IB_QP_AV) &&
> - (rdma_port_get_link_layer(qp->device, qp_attr->ah_attr.port_num)
> == IB_LINK_LAYER_ETHERNET)) {
> + (!rdma_transport_ib(qp->device, qp_attr->ah_attr.port_num))) {
> ret = ib_query_gid(qp->device, qp_attr->ah_attr.port_num,
> qp_attr->ah_attr.grh.sgid_index, &sgid);
> if (ret)

The above checks would be better as:

force_grh = rdma_transport_iboe(...)

They are RoCE/IBoE specific checks.
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-04-07 18:49:58

by Hefty, Sean

[permalink] [raw]
Subject: RE: [PATCH v2 11/17] IB/Verbs: Reform link_layer_show() and ib_uverbs_query_port()

> diff --git a/drivers/infiniband/core/sysfs.c
> b/drivers/infiniband/core/sysfs.c
> index cbd0383..aa53e40 100644
> --- a/drivers/infiniband/core/sysfs.c
> +++ b/drivers/infiniband/core/sysfs.c
> @@ -248,14 +248,10 @@ static ssize_t phys_state_show(struct ib_port *p,
> struct port_attribute *unused,
> static ssize_t link_layer_show(struct ib_port *p, struct port_attribute
> *unused,
> char *buf)
> {
> - switch (rdma_port_get_link_layer(p->ibdev, p->port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> + if (rdma_transport_ib(p->ibdev, p->port_num))
> return sprintf(buf, "%s\n", "InfiniBand");
> - case IB_LINK_LAYER_ETHERNET:
> + else
> return sprintf(buf, "%s\n", "Ethernet");
> - default:
> - return sprintf(buf, "%s\n", "Unknown");
> - }
> }
>
> static PORT_ATTR_RO(state);
> diff --git a/drivers/infiniband/core/uverbs_cmd.c
> b/drivers/infiniband/core/uverbs_cmd.c
> index a9f0489..3eb6eb5 100644
> --- a/drivers/infiniband/core/uverbs_cmd.c
> +++ b/drivers/infiniband/core/uverbs_cmd.c
> @@ -515,8 +515,10 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file
> *file,
> resp.active_width = attr.active_width;
> resp.active_speed = attr.active_speed;
> resp.phys_state = attr.phys_state;
> - resp.link_layer = rdma_port_get_link_layer(file->device-
> >ib_dev,
> - cmd.port_num);
> + resp.link_layer = rdma_transport_ib(file->device->ib_dev,
> + cmd.port_num) ?
> + IB_LINK_LAYER_INFINIBAND :
> + IB_LINK_LAYER_ETHERNET;
>
> if (copy_to_user((void __user *) (unsigned long) cmd.response,
> &resp, sizeof resp))

Both of the above check the transport in order to determine the link layer.

These values are exposed to user space. Does anyone know what link layer iWarp returns to user space?
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-04-07 18:56:50

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH v2 11/17] IB/Verbs: Reform link_layer_show() and ib_uverbs_query_port()

>
> > diff --git a/drivers/infiniband/core/sysfs.c
> > b/drivers/infiniband/core/sysfs.c
> > index cbd0383..aa53e40 100644
> > --- a/drivers/infiniband/core/sysfs.c
> > +++ b/drivers/infiniband/core/sysfs.c
> > @@ -248,14 +248,10 @@ static ssize_t phys_state_show(struct ib_port *p,
> > struct port_attribute *unused,
> > static ssize_t link_layer_show(struct ib_port *p, struct port_attribute
> > *unused,
> > char *buf)
> > {
> > - switch (rdma_port_get_link_layer(p->ibdev, p->port_num)) {
> > - case IB_LINK_LAYER_INFINIBAND:
> > + if (rdma_transport_ib(p->ibdev, p->port_num))
> > return sprintf(buf, "%s\n", "InfiniBand");
> > - case IB_LINK_LAYER_ETHERNET:
> > + else
> > return sprintf(buf, "%s\n", "Ethernet");
> > - default:
> > - return sprintf(buf, "%s\n", "Unknown");
> > - }
> > }
> >
> > static PORT_ATTR_RO(state);
> > diff --git a/drivers/infiniband/core/uverbs_cmd.c
> > b/drivers/infiniband/core/uverbs_cmd.c
> > index a9f0489..3eb6eb5 100644
> > --- a/drivers/infiniband/core/uverbs_cmd.c
> > +++ b/drivers/infiniband/core/uverbs_cmd.c
> > @@ -515,8 +515,10 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file
> > *file,
> > resp.active_width = attr.active_width;
> > resp.active_speed = attr.active_speed;
> > resp.phys_state = attr.phys_state;
> > - resp.link_layer = rdma_port_get_link_layer(file->device-
> > >ib_dev,
> > - cmd.port_num);
> > + resp.link_layer = rdma_transport_ib(file->device->ib_dev,
> > + cmd.port_num) ?
> > + IB_LINK_LAYER_INFINIBAND :
> > + IB_LINK_LAYER_ETHERNET;
> >
> > if (copy_to_user((void __user *) (unsigned long) cmd.response,
> > &resp, sizeof resp))
>
> Both of the above check the transport in order to determine the link layer.
>
> These values are exposed to user space. Does anyone know what link layer iWarp returns to user space?

Ethernet:

t4:~ # ibv_devinfo -d cxgb4_0|grep link_layer
link_layer: Ethernet
link_layer: Ethernet

Steve.

2015-04-07 20:13:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 10/17] IB/Verbs: Adopt management helpers for IB helpers

On Tue, Apr 07, 2015 at 02:35:22PM +0200, Michael Wang wrote:
> index f704254..4e61104 100644
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -540,7 +540,7 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num,
> ah_attr->port_num = port_num;
> ah_attr->static_rate = rec->rate;
>
> - force_grh = rdma_port_get_link_layer(device, port_num) == IB_LINK_LAYER_ETHERNET;
> + force_grh = !rdma_transport_ib(device, port_num);

Maybe these tests should be called cap_mandatory_grh - but I'm not
really sure how iWarp uses the GRH fields in the AH...

Jason

2015-04-07 20:16:30

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH v2 10/17] IB/Verbs: Adopt management helpers for IB helpers



> -----Original Message-----
> From: Jason Gunthorpe [mailto:[email protected]]
> Sent: Tuesday, April 07, 2015 3:13 PM
> To: Michael Wang
> Cc: Roland Dreier; Sean Hefty; [email protected]; [email protected]; [email protected];
> [email protected]; Hal Rosenstock; Tom Tucker; Steve Wise; Hoang-Nam Nguyen; Christoph Raisch; Mike Marciniszyn; Eli Cohen;
> Faisal Latif; Upinder Malhi; Trond Myklebust; J. Bruce Fields; David S. Miller; Ira Weiny; PJ Waskiewicz; Tatyana Nikolova; Or
Gerlitz; Jack
> Morgenstein; Haggai Eran; Ilya Nelkenbaum; Yann Droneaud; Bart Van Assche; Shachar Raindel; Sagi Grimberg; Devesh Sharma; Matan
> Barak; Moni Shoua; Jiri Kosina; Selvin Xavier; Mitesh Ahuja; Li RongQing; Rasmus Villemoes; Alex Estrin; Doug Ledford; Eric
Dumazet; Erez
> Shitrit; Tom Gundersen; Chuck Lever
> Subject: Re: [PATCH v2 10/17] IB/Verbs: Adopt management helpers for IB helpers
>
> On Tue, Apr 07, 2015 at 02:35:22PM +0200, Michael Wang wrote:
> > index f704254..4e61104 100644
> > +++ b/drivers/infiniband/core/sa_query.c
> > @@ -540,7 +540,7 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num,
> > ah_attr->port_num = port_num;
> > ah_attr->static_rate = rec->rate;
> >
> > - force_grh = rdma_port_get_link_layer(device, port_num) == IB_LINK_LAYER_ETHERNET;
> > + force_grh = !rdma_transport_ib(device, port_num);
>
> Maybe these tests should be called cap_mandatory_grh - but I'm not
> really sure how iWarp uses the GRH fields in the AH...
>

iWARP runs on top of TCP...this SA code is all IB-specific. The reason it was checking for ETHERNET, I think, is for RoCE. So
this change is totally incorrect, I think, because RoCE is an IB transport, but it runs on ETHERNET.

Steve.


2015-04-07 20:18:06

by Hefty, Sean

[permalink] [raw]
Subject: RE: [PATCH v2 10/17] IB/Verbs: Adopt management helpers for IB helpers

> > index f704254..4e61104 100644
> > +++ b/drivers/infiniband/core/sa_query.c
> > @@ -540,7 +540,7 @@ int ib_init_ah_from_path(struct ib_device *device,
> u8 port_num,
> > ah_attr->port_num = port_num;
> > ah_attr->static_rate = rec->rate;
> >
> > - force_grh = rdma_port_get_link_layer(device, port_num) ==
> IB_LINK_LAYER_ETHERNET;
> > + force_grh = !rdma_transport_ib(device, port_num);
>
> Maybe these tests should be called cap_mandatory_grh - but I'm not
> really sure how iWarp uses the GRH fields in the AH...

AH are used with unconnected endpoints, which iWarp doesn't currently support.

2015-04-07 21:11:23

by Steve Wise

[permalink] [raw]
Subject: RE: [PATCH v2 13/17] IB/Verbs: Reform cma/ucma with management helpers



> -----Original Message-----
> From: Michael Wang [mailto:[email protected]]
> Sent: Tuesday, April 07, 2015 7:37 AM
> To: Roland Dreier; Sean Hefty; [email protected]; [email protected]; [email protected];
> [email protected]
> Cc: Hal Rosenstock; Tom Tucker; Steve Wise; Hoang-Nam Nguyen; Christoph Raisch; Mike Marciniszyn; Eli Cohen; Faisal Latif; Upinder
> Malhi; Trond Myklebust; J. Bruce Fields; David S. Miller; Ira Weiny; PJ Waskiewicz; Tatyana Nikolova; Or Gerlitz; Jack Morgenstein; Haggai
> Eran; Ilya Nelkenbaum; Yann Droneaud; Bart Van Assche; Shachar Raindel; Sagi Grimberg; Devesh Sharma; Matan Barak; Moni Shoua; Jiri
> Kosina; Selvin Xavier; Mitesh Ahuja; Li RongQing; Rasmus Villemoes; Alex Estrin; Doug Ledford; Eric Dumazet; Erez Shitrit; Tom
> Gundersen; Chuck Lever; Michael Wang
> Subject: [PATCH v2 13/17] IB/Verbs: Reform cma/ucma with management helpers
>
>
> Reform cma/ucma with management helpers.
>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Doug Ledford <[email protected]>
> Cc: Ira Weiny <[email protected]>
> Cc: Sean Hefty <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> drivers/infiniband/core/cma.c | 182 +++++++++++++----------------------------
> drivers/infiniband/core/ucma.c | 25 ++----
> 2 files changed, 65 insertions(+), 142 deletions(-)
>
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index d8a8ea7..c23f483 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -435,10 +435,10 @@ static int cma_resolve_ib_dev(struct rdma_id_private *id_priv)
> pkey = ntohs(addr->sib_pkey);
>
> list_for_each_entry(cur_dev, &dev_list, list) {
> - if (rdma_node_get_transport(cur_dev->device->node_type) != RDMA_TRANSPORT_IB)
> - continue;
> -
> for (p = 1; p <= cur_dev->device->phys_port_cnt; ++p) {
> + if (!rdma_ib_mgmt(cur_dev->device, p))
> + continue;
> +
> if (ib_find_cached_pkey(cur_dev->device, p, pkey, &index))
> continue;
>
> @@ -633,10 +633,10 @@ static int cma_modify_qp_rtr(struct rdma_id_private *id_priv,
> if (ret)
> goto out;
>
> - if (rdma_node_get_transport(id_priv->cma_dev->device->node_type)
> - == RDMA_TRANSPORT_IB &&
> - rdma_port_get_link_layer(id_priv->id.device, id_priv->id.port_num)
> - == IB_LINK_LAYER_ETHERNET) {
> + /* Will this happen? */
> + BUG_ON(id_priv->cma_dev->device != id_priv->id.device);
> +
> + if (rdma_transport_iboe(id_priv->id.device, id_priv->id.port_num)) {
> ret = rdma_addr_find_smac_by_sgid(&sgid, qp_attr.smac, NULL);
>
> if (ret)
> @@ -700,8 +700,7 @@ static int cma_ib_init_qp_attr(struct rdma_id_private *id_priv,
> int ret;
> u16 pkey;
>
> - if (rdma_port_get_link_layer(id_priv->id.device, id_priv->id.port_num) ==
> - IB_LINK_LAYER_INFINIBAND)
> + if (rdma_transport_ib(id_priv->id.device, id_priv->id.port_num))
> pkey = ib_addr_get_pkey(dev_addr);
> else
> pkey = 0xffff;
> @@ -735,8 +734,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr,
> int ret = 0;
>
> id_priv = container_of(id, struct rdma_id_private, id);
> - switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id_priv->id.device, id_priv->id.port_num)) {
> if (!id_priv->cm_id.ib || (id_priv->id.qp_type == IB_QPT_UD))
> ret = cma_ib_init_qp_attr(id_priv, qp_attr, qp_attr_mask);
> else
> @@ -745,19 +743,16 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct ib_qp_attr *qp_attr,
>
> if (qp_attr->qp_state == IB_QPS_RTR)
> qp_attr->rq_psn = id_priv->seq_num;
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id_priv->id.device,
> + id_priv->id.port_num)) {
> if (!id_priv->cm_id.iw) {
> qp_attr->qp_access_flags = 0;
> *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS;
> } else
> ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr,
> qp_attr_mask);
> - break;
> - default:
> + } else
> ret = -ENOSYS;
> - break;
> - }
>
> return ret;
> }
> @@ -928,13 +923,9 @@ static inline int cma_user_data_offset(struct rdma_id_private *id_priv)
>
> static void cma_cancel_route(struct rdma_id_private *id_priv)
> {
> - switch (rdma_port_get_link_layer(id_priv->id.device, id_priv->id.port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> + if (rdma_transport_ib(id_priv->id.device, id_priv->id.port_num)) {
> if (id_priv->query)
> ib_sa_cancel_query(id_priv->query_id, id_priv->query);
> - break;
> - default:
> - break;
> }
> }
>
> @@ -1006,17 +997,14 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
> mc = container_of(id_priv->mc_list.next,
> struct cma_multicast, list);
> list_del(&mc->list);
> - switch (rdma_port_get_link_layer(id_priv->cma_dev->device, id_priv->id.port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> + if (rdma_transport_ib(id_priv->cma_dev->device,
> + id_priv->id.port_num)) {
> ib_sa_free_multicast(mc->multicast.ib);
> kfree(mc);
> break;
> - case IB_LINK_LAYER_ETHERNET:
> + } else if (rdma_transport_ib(id_priv->cma_dev->device,
> + id_priv->id.port_num))
> kref_put(&mc->mcref, release_mc);
> - break;
> - default:
> - break;
> - }
> }
> }
>

Doesn't the above change result in:

if (rdma_transport_ib()) {
} else if (rdma_transport_ib()) {
}

????

> @@ -1037,17 +1025,13 @@ void rdma_destroy_id(struct rdma_cm_id *id)
> mutex_unlock(&id_priv->handler_mutex);
>
> if (id_priv->cma_dev) {
> - switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id_priv->id.device, id_priv->id.port_num)) {
> if (id_priv->cm_id.ib)
> ib_destroy_cm_id(id_priv->cm_id.ib);
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id_priv->id.device,
> + id_priv->id.port_num)) {
> if (id_priv->cm_id.iw)
> iw_destroy_cm_id(id_priv->cm_id.iw);
> - break;
> - default:
> - break;
> }
> cma_leave_mc_groups(id_priv);
> cma_release_dev(id_priv);
> @@ -1966,26 +1950,14 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
> return -EINVAL;
>
> atomic_inc(&id_priv->refcount);
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> - switch (rdma_port_get_link_layer(id->device, id->port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> - ret = cma_resolve_ib_route(id_priv, timeout_ms);
> - break;
> - case IB_LINK_LAYER_ETHERNET:
> - ret = cma_resolve_iboe_route(id_priv);
> - break;
> - default:
> - ret = -ENOSYS;
> - }
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + if (rdma_transport_ib(id->device, id->port_num))
> + ret = cma_resolve_ib_route(id_priv, timeout_ms);
> + else if (rdma_transport_iboe(id->device, id->port_num))
> + ret = cma_resolve_iboe_route(id_priv);
> + else if (rdma_transport_iwarp(id->device, id->port_num))
> ret = cma_resolve_iw_route(id_priv, timeout_ms);
> - break;
> - default:
> + else
> ret = -ENOSYS;
> - break;
> - }
> if (ret)
> goto err;
>
> @@ -2059,7 +2031,7 @@ port_found:
> goto out;
>
> id_priv->id.route.addr.dev_addr.dev_type =
> - (rdma_port_get_link_layer(cma_dev->device, p) == IB_LINK_LAYER_INFINIBAND) ?
> + (rdma_transport_ib(cma_dev->device, p)) ?
> ARPHRD_INFINIBAND : ARPHRD_ETHER;
>
> rdma_addr_set_sgid(&id_priv->id.route.addr.dev_addr, &gid);
> @@ -2536,18 +2508,15 @@ int rdma_listen(struct rdma_cm_id *id, int backlog)
>
> id_priv->backlog = backlog;
> if (id->device) {
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id->device, id->port_num)) {
> ret = cma_ib_listen(id_priv);
> if (ret)
> goto err;
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id->device, id->port_num)) {
> ret = cma_iw_listen(id_priv, backlog);
> if (ret)
> goto err;
> - break;
> - default:
> + } else {
> ret = -ENOSYS;
> goto err;
> }
> @@ -2883,20 +2852,15 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
> id_priv->srq = conn_param->srq;
> }
>
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id->device, id->port_num)) {
> if (id->qp_type == IB_QPT_UD)
> ret = cma_resolve_ib_udp(id_priv, conn_param);
> else
> ret = cma_connect_ib(id_priv, conn_param);
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id->device, id->port_num))
> ret = cma_connect_iw(id_priv, conn_param);
> - break;
> - default:
> + else
> ret = -ENOSYS;
> - break;
> - }
> if (ret)
> goto err;
>
> @@ -2999,8 +2963,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
> id_priv->srq = conn_param->srq;
> }
>
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id->device, id->port_num)) {
> if (id->qp_type == IB_QPT_UD) {
> if (conn_param)
> ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS,
> @@ -3016,14 +2979,10 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
> else
> ret = cma_rep_recv(id_priv);
> }
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id->device, id->port_num))
> ret = cma_accept_iw(id_priv, conn_param);
> - break;
> - default:
> + else
> ret = -ENOSYS;
> - break;
> - }
>
> if (ret)
> goto reject;
> @@ -3067,8 +3026,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data,
> if (!id_priv->cm_id.ib)
> return -EINVAL;
>
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id->device, id->port_num)) {
> if (id->qp_type == IB_QPT_UD)
> ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT, 0,
> private_data, private_data_len);
> @@ -3076,15 +3034,11 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data,
> ret = ib_send_cm_rej(id_priv->cm_id.ib,
> IB_CM_REJ_CONSUMER_DEFINED, NULL,
> 0, private_data, private_data_len);
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id->device, id->port_num)) {
> ret = iw_cm_reject(id_priv->cm_id.iw,
> private_data, private_data_len);
> - break;
> - default:
> + } else
> ret = -ENOSYS;
> - break;
> - }
> return ret;
> }
> EXPORT_SYMBOL(rdma_reject);
> @@ -3098,22 +3052,17 @@ int rdma_disconnect(struct rdma_cm_id *id)
> if (!id_priv->cm_id.ib)
> return -EINVAL;
>
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id->device, id->port_num)) {
> ret = cma_modify_qp_err(id_priv);
> if (ret)
> goto out;
> /* Initiate or respond to a disconnect. */
> if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0))
> ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0);
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id->device, id->port_num)) {
> ret = iw_cm_disconnect(id_priv->cm_id.iw, 0);
> - break;
> - default:
> + } else
> ret = -EINVAL;
> - break;
> - }
> out:
> return ret;
> }
> @@ -3359,24 +3308,13 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
> list_add(&mc->list, &id_priv->mc_list);
> spin_unlock(&id_priv->lock);
>
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> - switch (rdma_port_get_link_layer(id->device, id->port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> - ret = cma_join_ib_multicast(id_priv, mc);
> - break;
> - case IB_LINK_LAYER_ETHERNET:
> - kref_init(&mc->mcref);
> - ret = cma_iboe_join_multicast(id_priv, mc);
> - break;
> - default:
> - ret = -EINVAL;
> - }
> - break;
> - default:
> + if (rdma_transport_iboe(id->device, id->port_num)) {
> + kref_init(&mc->mcref);
> + ret = cma_iboe_join_multicast(id_priv, mc);
> + } else if (rdma_transport_ib(id->device, id->port_num))
> + ret = cma_join_ib_multicast(id_priv, mc);
> + else
> ret = -ENOSYS;
> - break;
> - }
>
> if (ret) {
> spin_lock_irq(&id_priv->lock);
> @@ -3404,19 +3342,17 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr)
> ib_detach_mcast(id->qp,
> &mc->multicast.ib->rec.mgid,
> be16_to_cpu(mc->multicast.ib->rec.mlid));
> - if (rdma_node_get_transport(id_priv->cma_dev->device->node_type) == RDMA_TRANSPORT_IB) {
> - switch (rdma_port_get_link_layer(id->device, id->port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> - ib_sa_free_multicast(mc->multicast.ib);
> - kfree(mc);
> - break;
> - case IB_LINK_LAYER_ETHERNET:
> - kref_put(&mc->mcref, release_mc);
> - break;
> - default:
> - break;
> - }
> - }
> +
> + /* Will this happen? */
> + BUG_ON(id_priv->cma_dev->device != id->device);
> +
> + if (rdma_transport_ib(id->device, id->port_num)) {
> + ib_sa_free_multicast(mc->multicast.ib);
> + kfree(mc);
> + } else if (rdma_transport_iboe(id->device,
> + id->port_num))
> + kref_put(&mc->mcref, release_mc);
> +
> return;
> }
> }
> diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
> index 45d67e9..42c9bf6 100644
> --- a/drivers/infiniband/core/ucma.c
> +++ b/drivers/infiniband/core/ucma.c
> @@ -722,26 +722,13 @@ static ssize_t ucma_query_route(struct ucma_file *file,
>
> resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid;
> resp.port_num = ctx->cm_id->port_num;
> - switch (rdma_node_get_transport(ctx->cm_id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> - switch (rdma_port_get_link_layer(ctx->cm_id->device,
> - ctx->cm_id->port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> - ucma_copy_ib_route(&resp, &ctx->cm_id->route);
> - break;
> - case IB_LINK_LAYER_ETHERNET:
> - ucma_copy_iboe_route(&resp, &ctx->cm_id->route);
> - break;
> - default:
> - break;
> - }
> - break;
> - case RDMA_TRANSPORT_IWARP:
> +
> + if (rdma_transport_ib(ctx->cm_id->device, ctx->cm_id->port_num))
> + ucma_copy_ib_route(&resp, &ctx->cm_id->route);
> + else if (rdma_transport_iboe(ctx->cm_id->device, ctx->cm_id->port_num))
> + ucma_copy_iboe_route(&resp, &ctx->cm_id->route);
> + else if (rdma_transport_iwarp(ctx->cm_id->device, ctx->cm_id->port_num))
> ucma_copy_iw_route(&resp, &ctx->cm_id->route);
> - break;
> - default:
> - break;
> - }
>
> out:
> if (copy_to_user((void __user *)(unsigned long)cmd.response,
> --
> 2.1.0

2015-04-07 21:25:09

by Hefty, Sean

[permalink] [raw]
Subject: RE: [PATCH v2 02/17] IB/Verbs: Implement raw management helpers

> +static inline int rdma_transport_ib(struct ib_device *device, u8
> port_num)
> +{
> + return device->query_transport(device, port_num)
> + == RDMA_TRANSPORT_IB;
> +}
> +
> +static inline int rdma_transport_iboe(struct ib_device *device, u8
> port_num)
> +{
> + return device->query_transport(device, port_num)
> + == RDMA_TRANSPORT_IBOE;
> +}

We need to do something with the function names to make their use more obvious. Both IB and IBoE have transport IB. I think Jason suggested rdma_tech_ib / rdma_tech_iboe.

Regarding transport types, I believe that usnic supports 2 different transports. Although usnic isn't used by anything else in the core layer, we should probably be able to handle a device that supports multiple protocols. I'm not sure what the 'transport' should be for iWarp, since iWarp is layered over TCP. But that may just mean that the term transport isn't great.

- Sean
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-04-07 21:36:34

by Hefty, Sean

[permalink] [raw]
Subject: RE: [PATCH v2 13/17] IB/Verbs: Reform cma/ucma with management helpers

> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index d8a8ea7..c23f483 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -435,10 +435,10 @@ static int cma_resolve_ib_dev(struct rdma_id_private
> *id_priv)
> pkey = ntohs(addr->sib_pkey);
>
> list_for_each_entry(cur_dev, &dev_list, list) {
> - if (rdma_node_get_transport(cur_dev->device->node_type) !=
> RDMA_TRANSPORT_IB)
> - continue;
> -
> for (p = 1; p <= cur_dev->device->phys_port_cnt; ++p) {
> + if (!rdma_ib_mgmt(cur_dev->device, p))
> + continue;

This check wants to be something like is_af_ib_supported(). Checking for IB transport may actually be better than checking for IB management. I don't know if IBoE/RoCE devices support AF_IB.


> +
> if (ib_find_cached_pkey(cur_dev->device, p, pkey,
> &index))
> continue;
>
> @@ -633,10 +633,10 @@ static int cma_modify_qp_rtr(struct rdma_id_private
> *id_priv,
> if (ret)
> goto out;
>
> - if (rdma_node_get_transport(id_priv->cma_dev->device->node_type)
> - == RDMA_TRANSPORT_IB &&
> - rdma_port_get_link_layer(id_priv->id.device, id_priv-
> >id.port_num)
> - == IB_LINK_LAYER_ETHERNET) {
> + /* Will this happen? */
> + BUG_ON(id_priv->cma_dev->device != id_priv->id.device);

This shouldn't happen. The BUG_ON looks okay.


> + if (rdma_transport_iboe(id_priv->id.device, id_priv->id.port_num)) {
> ret = rdma_addr_find_smac_by_sgid(&sgid, qp_attr.smac, NULL);
>
> if (ret)
> @@ -700,8 +700,7 @@ static int cma_ib_init_qp_attr(struct rdma_id_private
> *id_priv,
> int ret;
> u16 pkey;
>
> - if (rdma_port_get_link_layer(id_priv->id.device, id_priv-
> >id.port_num) ==
> - IB_LINK_LAYER_INFINIBAND)
> + if (rdma_transport_ib(id_priv->id.device, id_priv->id.port_num))
> pkey = ib_addr_get_pkey(dev_addr);
> else
> pkey = 0xffff;

Check here should be against the link layer, not transport.


> @@ -735,8 +734,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct
> ib_qp_attr *qp_attr,
> int ret = 0;
>
> id_priv = container_of(id, struct rdma_id_private, id);
> - switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id_priv->id.device, id_priv->id.port_num)) {
> if (!id_priv->cm_id.ib || (id_priv->id.qp_type == IB_QPT_UD))
> ret = cma_ib_init_qp_attr(id_priv, qp_attr,
> qp_attr_mask);
> else
> @@ -745,19 +743,16 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct
> ib_qp_attr *qp_attr,
>
> if (qp_attr->qp_state == IB_QPS_RTR)
> qp_attr->rq_psn = id_priv->seq_num;
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id_priv->id.device,
> + id_priv->id.port_num)) {
> if (!id_priv->cm_id.iw) {
> qp_attr->qp_access_flags = 0;
> *qp_attr_mask = IB_QP_STATE | IB_QP_ACCESS_FLAGS;
> } else
> ret = iw_cm_init_qp_attr(id_priv->cm_id.iw, qp_attr,
> qp_attr_mask);
> - break;
> - default:
> + } else
> ret = -ENOSYS;
> - break;
> - }
>
> return ret;
> }
> @@ -928,13 +923,9 @@ static inline int cma_user_data_offset(struct
> rdma_id_private *id_priv)
>
> static void cma_cancel_route(struct rdma_id_private *id_priv)
> {
> - switch (rdma_port_get_link_layer(id_priv->id.device, id_priv-
> >id.port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> + if (rdma_transport_ib(id_priv->id.device, id_priv->id.port_num)) {

The check should be cap_ib_sa()


> if (id_priv->query)
> ib_sa_cancel_query(id_priv->query_id, id_priv->query);
> - break;
> - default:
> - break;
> }
> }
>
> @@ -1006,17 +997,14 @@ static void cma_leave_mc_groups(struct
> rdma_id_private *id_priv)
> mc = container_of(id_priv->mc_list.next,
> struct cma_multicast, list);
> list_del(&mc->list);
> - switch (rdma_port_get_link_layer(id_priv->cma_dev->device,
> id_priv->id.port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> + if (rdma_transport_ib(id_priv->cma_dev->device,
> + id_priv->id.port_num)) {
> ib_sa_free_multicast(mc->multicast.ib);
> kfree(mc);
> break;

Want cap_ib_mcast()


> - case IB_LINK_LAYER_ETHERNET:
> + } else if (rdma_transport_ib(id_priv->cma_dev->device,
> + id_priv->id.port_num))
> kref_put(&mc->mcref, release_mc);
> - break;
> - default:
> - break;

Just want else /* !cap_ib_mcast */


> - }
> }
> }
>
> @@ -1037,17 +1025,13 @@ void rdma_destroy_id(struct rdma_cm_id *id)
> mutex_unlock(&id_priv->handler_mutex);
>
> if (id_priv->cma_dev) {
> - switch (rdma_node_get_transport(id_priv->id.device-
> >node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id_priv->id.device, id_priv->id.port_num)) {
> if (id_priv->cm_id.ib)
> ib_destroy_cm_id(id_priv->cm_id.ib);
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id_priv->id.device,
> + id_priv->id.port_num)) {
> if (id_priv->cm_id.iw)
> iw_destroy_cm_id(id_priv->cm_id.iw);
> - break;
> - default:
> - break;
> }
> cma_leave_mc_groups(id_priv);
> cma_release_dev(id_priv);
> @@ -1966,26 +1950,14 @@ int rdma_resolve_route(struct rdma_cm_id *id, int
> timeout_ms)
> return -EINVAL;
>
> atomic_inc(&id_priv->refcount);
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> - switch (rdma_port_get_link_layer(id->device, id->port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> - ret = cma_resolve_ib_route(id_priv, timeout_ms);
> - break;
> - case IB_LINK_LAYER_ETHERNET:
> - ret = cma_resolve_iboe_route(id_priv);
> - break;
> - default:
> - ret = -ENOSYS;
> - }
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + if (rdma_transport_ib(id->device, id->port_num))
> + ret = cma_resolve_ib_route(id_priv, timeout_ms);

Best fit would be cap_ib_sa()


> + else if (rdma_transport_iboe(id->device, id->port_num))
> + ret = cma_resolve_iboe_route(id_priv);
> + else if (rdma_transport_iwarp(id->device, id->port_num))
> ret = cma_resolve_iw_route(id_priv, timeout_ms);
> - break;
> - default:
> + else
> ret = -ENOSYS;
> - break;
> - }
> if (ret)
> goto err;
>
> @@ -2059,7 +2031,7 @@ port_found:
> goto out;
>
> id_priv->id.route.addr.dev_addr.dev_type =
> - (rdma_port_get_link_layer(cma_dev->device, p) ==
> IB_LINK_LAYER_INFINIBAND) ?
> + (rdma_transport_ib(cma_dev->device, p)) ?
> ARPHRD_INFINIBAND : ARPHRD_ETHER;

This wants the link layer, or maybe use cap_ipoib.


>
> rdma_addr_set_sgid(&id_priv->id.route.addr.dev_addr, &gid);
> @@ -2536,18 +2508,15 @@ int rdma_listen(struct rdma_cm_id *id, int
> backlog)
>
> id_priv->backlog = backlog;
> if (id->device) {
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id->device, id->port_num)) {

Want cap_ib_cm()


> ret = cma_ib_listen(id_priv);
> if (ret)
> goto err;
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id->device, id->port_num)) {
> ret = cma_iw_listen(id_priv, backlog);
> if (ret)
> goto err;
> - break;
> - default:
> + } else {
> ret = -ENOSYS;
> goto err;
> }
> @@ -2883,20 +2852,15 @@ int rdma_connect(struct rdma_cm_id *id, struct
> rdma_conn_param *conn_param)
> id_priv->srq = conn_param->srq;
> }
>
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id->device, id->port_num)) {

cap_ib_cm()


> if (id->qp_type == IB_QPT_UD)
> ret = cma_resolve_ib_udp(id_priv, conn_param);
> else
> ret = cma_connect_ib(id_priv, conn_param);
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id->device, id->port_num))
> ret = cma_connect_iw(id_priv, conn_param);
> - break;
> - default:
> + else
> ret = -ENOSYS;
> - break;
> - }
> if (ret)
> goto err;
>
> @@ -2999,8 +2963,7 @@ int rdma_accept(struct rdma_cm_id *id, struct
> rdma_conn_param *conn_param)
> id_priv->srq = conn_param->srq;
> }
>
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id->device, id->port_num)) {

cap_ib_cm()


> if (id->qp_type == IB_QPT_UD) {
> if (conn_param)
> ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS,
> @@ -3016,14 +2979,10 @@ int rdma_accept(struct rdma_cm_id *id, struct
> rdma_conn_param *conn_param)
> else
> ret = cma_rep_recv(id_priv);
> }
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id->device, id->port_num))
> ret = cma_accept_iw(id_priv, conn_param);

If cap_ib_cm() is used in the places marked above, maybe add a cap_iw_cm() for the else conditions.


> - break;
> - default:
> + else
> ret = -ENOSYS;
> - break;
> - }
>
> if (ret)
> goto reject;
> @@ -3067,8 +3026,7 @@ int rdma_reject(struct rdma_cm_id *id, const void
> *private_data,
> if (!id_priv->cm_id.ib)
> return -EINVAL;
>
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id->device, id->port_num)) {

cap_ib_cm()


> if (id->qp_type == IB_QPT_UD)
> ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT, 0,
> private_data, private_data_len);
> @@ -3076,15 +3034,11 @@ int rdma_reject(struct rdma_cm_id *id, const void
> *private_data,
> ret = ib_send_cm_rej(id_priv->cm_id.ib,
> IB_CM_REJ_CONSUMER_DEFINED, NULL,
> 0, private_data, private_data_len);
> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id->device, id->port_num)) {
> ret = iw_cm_reject(id_priv->cm_id.iw,
> private_data, private_data_len);
> - break;
> - default:
> + } else
> ret = -ENOSYS;
> - break;
> - }
> return ret;
> }
> EXPORT_SYMBOL(rdma_reject);
> @@ -3098,22 +3052,17 @@ int rdma_disconnect(struct rdma_cm_id *id)
> if (!id_priv->cm_id.ib)
> return -EINVAL;
>
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> + if (rdma_ib_mgmt(id->device, id->port_num)) {
> ret = cma_modify_qp_err(id_priv);
> if (ret)
> goto out;
> /* Initiate or respond to a disconnect. */
> if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0))
> ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0);

cap_ib_cm()


> - break;
> - case RDMA_TRANSPORT_IWARP:
> + } else if (rdma_transport_iwarp(id->device, id->port_num)) {
> ret = iw_cm_disconnect(id_priv->cm_id.iw, 0);
> - break;
> - default:
> + } else
> ret = -EINVAL;
> - break;
> - }
> out:
> return ret;
> }
> @@ -3359,24 +3308,13 @@ int rdma_join_multicast(struct rdma_cm_id *id,
> struct sockaddr *addr,
> list_add(&mc->list, &id_priv->mc_list);
> spin_unlock(&id_priv->lock);
>
> - switch (rdma_node_get_transport(id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> - switch (rdma_port_get_link_layer(id->device, id->port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> - ret = cma_join_ib_multicast(id_priv, mc);
> - break;
> - case IB_LINK_LAYER_ETHERNET:
> - kref_init(&mc->mcref);
> - ret = cma_iboe_join_multicast(id_priv, mc);
> - break;
> - default:
> - ret = -EINVAL;
> - }
> - break;
> - default:
> + if (rdma_transport_iboe(id->device, id->port_num)) {
> + kref_init(&mc->mcref);
> + ret = cma_iboe_join_multicast(id_priv, mc);
> + } else if (rdma_transport_ib(id->device, id->port_num))
> + ret = cma_join_ib_multicast(id_priv, mc);

cap_ib_mcast()


> + else
> ret = -ENOSYS;
> - break;
> - }
>
> if (ret) {
> spin_lock_irq(&id_priv->lock);
> @@ -3404,19 +3342,17 @@ void rdma_leave_multicast(struct rdma_cm_id *id,
> struct sockaddr *addr)
> ib_detach_mcast(id->qp,
> &mc->multicast.ib->rec.mgid,
> be16_to_cpu(mc->multicast.ib-
> >rec.mlid));
> - if (rdma_node_get_transport(id_priv->cma_dev->device-
> >node_type) == RDMA_TRANSPORT_IB) {
> - switch (rdma_port_get_link_layer(id->device, id-
> >port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> - ib_sa_free_multicast(mc->multicast.ib);
> - kfree(mc);
> - break;
> - case IB_LINK_LAYER_ETHERNET:
> - kref_put(&mc->mcref, release_mc);
> - break;
> - default:
> - break;
> - }
> - }
> +
> + /* Will this happen? */
> + BUG_ON(id_priv->cma_dev->device != id->device);

Should not happen

> +
> + if (rdma_transport_ib(id->device, id->port_num)) {
> + ib_sa_free_multicast(mc->multicast.ib);
> + kfree(mc);

cap_ib_mcast()


> + } else if (rdma_transport_iboe(id->device,
> + id->port_num))
> + kref_put(&mc->mcref, release_mc);
> +
> return;
> }
> }
> diff --git a/drivers/infiniband/core/ucma.c
> b/drivers/infiniband/core/ucma.c
> index 45d67e9..42c9bf6 100644
> --- a/drivers/infiniband/core/ucma.c
> +++ b/drivers/infiniband/core/ucma.c
> @@ -722,26 +722,13 @@ static ssize_t ucma_query_route(struct ucma_file
> *file,
>
> resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid;
> resp.port_num = ctx->cm_id->port_num;
> - switch (rdma_node_get_transport(ctx->cm_id->device->node_type)) {
> - case RDMA_TRANSPORT_IB:
> - switch (rdma_port_get_link_layer(ctx->cm_id->device,
> - ctx->cm_id->port_num)) {
> - case IB_LINK_LAYER_INFINIBAND:
> - ucma_copy_ib_route(&resp, &ctx->cm_id->route);
> - break;
> - case IB_LINK_LAYER_ETHERNET:
> - ucma_copy_iboe_route(&resp, &ctx->cm_id->route);
> - break;
> - default:
> - break;
> - }
> - break;
> - case RDMA_TRANSPORT_IWARP:
> +
> + if (rdma_transport_ib(ctx->cm_id->device, ctx->cm_id->port_num))
> + ucma_copy_ib_route(&resp, &ctx->cm_id->route);

cap_ib_sa()


> + else if (rdma_transport_iboe(ctx->cm_id->device, ctx->cm_id-
> >port_num))
> + ucma_copy_iboe_route(&resp, &ctx->cm_id->route);
> + else if (rdma_transport_iwarp(ctx->cm_id->device, ctx->cm_id-
> >port_num))
> ucma_copy_iw_route(&resp, &ctx->cm_id->route);
> - break;
> - default:
> - break;
> - }
>
> out:
> if (copy_to_user((void __user *)(unsigned long)cmd.response,


- Sean
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-04-08 08:13:59

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 03/17] IB/Verbs: Use management helper cap_ib_mad() for mad-check

On 04/07/2015 07:26 PM, Jason Gunthorpe wrote:
> On Tue, Apr 07, 2015 at 02:30:22PM +0200, Michael Wang wrote:
>
>> - if (rdma_node_get_transport(device->node_type) != RDMA_TRANSPORT_IB)
>> - return;
>> -
>> if (device->node_type == RDMA_NODE_IB_SWITCH) {
>> start = 0;
>> end = 0;
>> @@ -3069,6 +3066,9 @@ static void ib_mad_init_device(struct ib_device *device)
>> }
>>
>> for (i = start; i <= end; i++) {
>> + if (!cap_ib_mad(device, i))
>> + continue;
>> +
>
> I would prefer to see these changes in control flow as dedicated
> patches, at the top of your patch stack.
>
> For this kind of work a patch should be mechanical changes only, it is
> easier to review that way.
>
> Same comment applies throughout.

Make sense :-) I will re-organize the sequence and put them at last.

Regards,
Michael Wang

>
> Jason
>

2015-04-08 08:24:20

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 10/17] IB/Verbs: Adopt management helpers for IB helpers

On 04/07/2015 08:40 PM, Hefty, Sean wrote:
[snip]
>> @@ -200,11 +200,9 @@ int ib_init_ah_from_wc(struct ib_device *device, u8
>> port_num, struct ib_wc *wc,
>> u32 flow_class;
>> u16 gid_index;
>> int ret;
>> - int is_eth = (rdma_port_get_link_layer(device, port_num) ==
>> - IB_LINK_LAYER_ETHERNET);
>>
>> memset(ah_attr, 0, sizeof *ah_attr);
>> - if (is_eth) {
>> + if (!rdma_transport_ib(device, port_num)) {
>> if (!(wc->wc_flags & IB_WC_GRH))
>> return -EPROTOTYPE;
>>
>> @@ -873,7 +871,7 @@ int ib_resolve_eth_l2_attrs(struct ib_qp *qp,
>> union ib_gid sgid;
>>
>> if ((*qp_attr_mask & IB_QP_AV) &&
>> - (rdma_port_get_link_layer(qp->device, qp_attr->ah_attr.port_num)
>> == IB_LINK_LAYER_ETHERNET)) {
>> + (!rdma_transport_ib(qp->device, qp_attr->ah_attr.port_num))) {
>> ret = ib_query_gid(qp->device, qp_attr->ah_attr.port_num,
>> qp_attr->ah_attr.grh.sgid_index, &sgid);
>> if (ret)
>
> The above checks would be better as:
>
> force_grh = rdma_transport_iboe(...)
>
> They are RoCE/IBoE specific checks.

Got it, will be in next version :-)

Regards,
Michael Wang

>

2015-04-08 08:28:25

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 10/17] IB/Verbs: Adopt management helpers for IB helpers

Hi, Steve

Thanks for the comment :-)

On 04/07/2015 10:16 PM, Steve Wise wrote:
[snip]
>>>
>>> - force_grh = rdma_port_get_link_layer(device, port_num) == IB_LINK_LAYER_ETHERNET;
>>> + force_grh = !rdma_transport_ib(device, port_num);
>>
>> Maybe these tests should be called cap_mandatory_grh - but I'm not
>> really sure how iWarp uses the GRH fields in the AH...
>>
>
> iWARP runs on top of TCP...this SA code is all IB-specific. The reason it was checking for ETHERNET, I think, is for RoCE. So
> this change is totally incorrect, I think, because RoCE is an IB transport, but it runs on ETHERNET.

I guess it's the name 'transport' which confusing folks... actually (!rdma_transport_ib)
including RoCE/IBoE, but yes, here it's for IBoE only, so let's change it to
rdma_transport_iboe ;-)

Regards,
Michael Wang

>
> Steve.
>
>
>

2015-04-08 08:30:17

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 11/17] IB/Verbs: Reform link_layer_show() and ib_uverbs_query_port()



On 04/07/2015 08:49 PM, Hefty, Sean wrote:
[snip]
>> @@ -515,8 +515,10 @@ ssize_t ib_uverbs_query_port(struct ib_uverbs_file
>> *file,
>> resp.active_width = attr.active_width;
>> resp.active_speed = attr.active_speed;
>> resp.phys_state = attr.phys_state;
>> - resp.link_layer = rdma_port_get_link_layer(file->device-
>>> ib_dev,
>> - cmd.port_num);
>> + resp.link_layer = rdma_transport_ib(file->device->ib_dev,
>> + cmd.port_num) ?
>> + IB_LINK_LAYER_INFINIBAND :
>> + IB_LINK_LAYER_ETHERNET;
>>
>> if (copy_to_user((void __user *) (unsigned long) cmd.response,
>> &resp, sizeof resp))
>
> Both of the above check the transport in order to determine the link layer.
>
> These values are exposed to user space. Does anyone know what link layer iWarp returns to user space?

It should be ETH for IWARP according to the old logical :-)

Regards,
Michael Wang

>

2015-04-08 08:39:23

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 13/17] IB/Verbs: Reform cma/ucma with management helpers


On 04/07/2015 11:11 PM, Steve Wise wrote:
[snip]
>> @@ -1006,17 +997,14 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv)
>> mc = container_of(id_priv->mc_list.next,
>> struct cma_multicast, list);
>> list_del(&mc->list);
>> - switch (rdma_port_get_link_layer(id_priv->cma_dev->device, id_priv->id.port_num)) {
>> - case IB_LINK_LAYER_INFINIBAND:
>> + if (rdma_transport_ib(id_priv->cma_dev->device,
>> + id_priv->id.port_num)) {
>> ib_sa_free_multicast(mc->multicast.ib);
>> kfree(mc);
>> break;
>> - case IB_LINK_LAYER_ETHERNET:
>> + } else if (rdma_transport_ib(id_priv->cma_dev->device,
>> + id_priv->id.port_num))
>> kref_put(&mc->mcref, release_mc);
>> - break;
>> - default:
>> - break;
>> - }
>> }
>> }
>>
>
> Doesn't the above change result in:
>
> if (rdma_transport_ib()) {
> } else if (rdma_transport_ib()) {
> }
>

My bad here.. I guess 'else' is enough.

Regards,
Michael Wang

> ????
>
>> @@ -1037,17 +1025,13 @@ void rdma_destroy_id(struct rdma_cm_id *id)
>> mutex_unlock(&id_priv->handler_mutex);
>>
>> if (id_priv->cma_dev) {
>> - switch (rdma_node_get_transport(id_priv->id.device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> + if (rdma_ib_mgmt(id_priv->id.device, id_priv->id.port_num)) {
>> if (id_priv->cm_id.ib)
>> ib_destroy_cm_id(id_priv->cm_id.ib);
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> + } else if (rdma_transport_iwarp(id_priv->id.device,
>> + id_priv->id.port_num)) {
>> if (id_priv->cm_id.iw)
>> iw_destroy_cm_id(id_priv->cm_id.iw);
>> - break;
>> - default:
>> - break;
>> }
>> cma_leave_mc_groups(id_priv);
>> cma_release_dev(id_priv);
>> @@ -1966,26 +1950,14 @@ int rdma_resolve_route(struct rdma_cm_id *id, int timeout_ms)
>> return -EINVAL;
>>
>> atomic_inc(&id_priv->refcount);
>> - switch (rdma_node_get_transport(id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> - switch (rdma_port_get_link_layer(id->device, id->port_num)) {
>> - case IB_LINK_LAYER_INFINIBAND:
>> - ret = cma_resolve_ib_route(id_priv, timeout_ms);
>> - break;
>> - case IB_LINK_LAYER_ETHERNET:
>> - ret = cma_resolve_iboe_route(id_priv);
>> - break;
>> - default:
>> - ret = -ENOSYS;
>> - }
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> + if (rdma_transport_ib(id->device, id->port_num))
>> + ret = cma_resolve_ib_route(id_priv, timeout_ms);
>> + else if (rdma_transport_iboe(id->device, id->port_num))
>> + ret = cma_resolve_iboe_route(id_priv);
>> + else if (rdma_transport_iwarp(id->device, id->port_num))
>> ret = cma_resolve_iw_route(id_priv, timeout_ms);
>> - break;
>> - default:
>> + else
>> ret = -ENOSYS;
>> - break;
>> - }
>> if (ret)
>> goto err;
>>
>> @@ -2059,7 +2031,7 @@ port_found:
>> goto out;
>>
>> id_priv->id.route.addr.dev_addr.dev_type =
>> - (rdma_port_get_link_layer(cma_dev->device, p) == IB_LINK_LAYER_INFINIBAND) ?
>> + (rdma_transport_ib(cma_dev->device, p)) ?
>> ARPHRD_INFINIBAND : ARPHRD_ETHER;
>>
>> rdma_addr_set_sgid(&id_priv->id.route.addr.dev_addr, &gid);
>> @@ -2536,18 +2508,15 @@ int rdma_listen(struct rdma_cm_id *id, int backlog)
>>
>> id_priv->backlog = backlog;
>> if (id->device) {
>> - switch (rdma_node_get_transport(id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> + if (rdma_ib_mgmt(id->device, id->port_num)) {
>> ret = cma_ib_listen(id_priv);
>> if (ret)
>> goto err;
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> + } else if (rdma_transport_iwarp(id->device, id->port_num)) {
>> ret = cma_iw_listen(id_priv, backlog);
>> if (ret)
>> goto err;
>> - break;
>> - default:
>> + } else {
>> ret = -ENOSYS;
>> goto err;
>> }
>> @@ -2883,20 +2852,15 @@ int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
>> id_priv->srq = conn_param->srq;
>> }
>>
>> - switch (rdma_node_get_transport(id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> + if (rdma_ib_mgmt(id->device, id->port_num)) {
>> if (id->qp_type == IB_QPT_UD)
>> ret = cma_resolve_ib_udp(id_priv, conn_param);
>> else
>> ret = cma_connect_ib(id_priv, conn_param);
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> + } else if (rdma_transport_iwarp(id->device, id->port_num))
>> ret = cma_connect_iw(id_priv, conn_param);
>> - break;
>> - default:
>> + else
>> ret = -ENOSYS;
>> - break;
>> - }
>> if (ret)
>> goto err;
>>
>> @@ -2999,8 +2963,7 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
>> id_priv->srq = conn_param->srq;
>> }
>>
>> - switch (rdma_node_get_transport(id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> + if (rdma_ib_mgmt(id->device, id->port_num)) {
>> if (id->qp_type == IB_QPT_UD) {
>> if (conn_param)
>> ret = cma_send_sidr_rep(id_priv, IB_SIDR_SUCCESS,
>> @@ -3016,14 +2979,10 @@ int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
>> else
>> ret = cma_rep_recv(id_priv);
>> }
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> + } else if (rdma_transport_iwarp(id->device, id->port_num))
>> ret = cma_accept_iw(id_priv, conn_param);
>> - break;
>> - default:
>> + else
>> ret = -ENOSYS;
>> - break;
>> - }
>>
>> if (ret)
>> goto reject;
>> @@ -3067,8 +3026,7 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data,
>> if (!id_priv->cm_id.ib)
>> return -EINVAL;
>>
>> - switch (rdma_node_get_transport(id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> + if (rdma_ib_mgmt(id->device, id->port_num)) {
>> if (id->qp_type == IB_QPT_UD)
>> ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT, 0,
>> private_data, private_data_len);
>> @@ -3076,15 +3034,11 @@ int rdma_reject(struct rdma_cm_id *id, const void *private_data,
>> ret = ib_send_cm_rej(id_priv->cm_id.ib,
>> IB_CM_REJ_CONSUMER_DEFINED, NULL,
>> 0, private_data, private_data_len);
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> + } else if (rdma_transport_iwarp(id->device, id->port_num)) {
>> ret = iw_cm_reject(id_priv->cm_id.iw,
>> private_data, private_data_len);
>> - break;
>> - default:
>> + } else
>> ret = -ENOSYS;
>> - break;
>> - }
>> return ret;
>> }
>> EXPORT_SYMBOL(rdma_reject);
>> @@ -3098,22 +3052,17 @@ int rdma_disconnect(struct rdma_cm_id *id)
>> if (!id_priv->cm_id.ib)
>> return -EINVAL;
>>
>> - switch (rdma_node_get_transport(id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> + if (rdma_ib_mgmt(id->device, id->port_num)) {
>> ret = cma_modify_qp_err(id_priv);
>> if (ret)
>> goto out;
>> /* Initiate or respond to a disconnect. */
>> if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0))
>> ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0);
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> + } else if (rdma_transport_iwarp(id->device, id->port_num)) {
>> ret = iw_cm_disconnect(id_priv->cm_id.iw, 0);
>> - break;
>> - default:
>> + } else
>> ret = -EINVAL;
>> - break;
>> - }
>> out:
>> return ret;
>> }
>> @@ -3359,24 +3308,13 @@ int rdma_join_multicast(struct rdma_cm_id *id, struct sockaddr *addr,
>> list_add(&mc->list, &id_priv->mc_list);
>> spin_unlock(&id_priv->lock);
>>
>> - switch (rdma_node_get_transport(id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> - switch (rdma_port_get_link_layer(id->device, id->port_num)) {
>> - case IB_LINK_LAYER_INFINIBAND:
>> - ret = cma_join_ib_multicast(id_priv, mc);
>> - break;
>> - case IB_LINK_LAYER_ETHERNET:
>> - kref_init(&mc->mcref);
>> - ret = cma_iboe_join_multicast(id_priv, mc);
>> - break;
>> - default:
>> - ret = -EINVAL;
>> - }
>> - break;
>> - default:
>> + if (rdma_transport_iboe(id->device, id->port_num)) {
>> + kref_init(&mc->mcref);
>> + ret = cma_iboe_join_multicast(id_priv, mc);
>> + } else if (rdma_transport_ib(id->device, id->port_num))
>> + ret = cma_join_ib_multicast(id_priv, mc);
>> + else
>> ret = -ENOSYS;
>> - break;
>> - }
>>
>> if (ret) {
>> spin_lock_irq(&id_priv->lock);
>> @@ -3404,19 +3342,17 @@ void rdma_leave_multicast(struct rdma_cm_id *id, struct sockaddr *addr)
>> ib_detach_mcast(id->qp,
>> &mc->multicast.ib->rec.mgid,
>> be16_to_cpu(mc->multicast.ib->rec.mlid));
>> - if (rdma_node_get_transport(id_priv->cma_dev->device->node_type) == RDMA_TRANSPORT_IB) {
>> - switch (rdma_port_get_link_layer(id->device, id->port_num)) {
>> - case IB_LINK_LAYER_INFINIBAND:
>> - ib_sa_free_multicast(mc->multicast.ib);
>> - kfree(mc);
>> - break;
>> - case IB_LINK_LAYER_ETHERNET:
>> - kref_put(&mc->mcref, release_mc);
>> - break;
>> - default:
>> - break;
>> - }
>> - }
>> +
>> + /* Will this happen? */
>> + BUG_ON(id_priv->cma_dev->device != id->device);
>> +
>> + if (rdma_transport_ib(id->device, id->port_num)) {
>> + ib_sa_free_multicast(mc->multicast.ib);
>> + kfree(mc);
>> + } else if (rdma_transport_iboe(id->device,
>> + id->port_num))
>> + kref_put(&mc->mcref, release_mc);
>> +
>> return;
>> }
>> }
>> diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
>> index 45d67e9..42c9bf6 100644
>> --- a/drivers/infiniband/core/ucma.c
>> +++ b/drivers/infiniband/core/ucma.c
>> @@ -722,26 +722,13 @@ static ssize_t ucma_query_route(struct ucma_file *file,
>>
>> resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid;
>> resp.port_num = ctx->cm_id->port_num;
>> - switch (rdma_node_get_transport(ctx->cm_id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> - switch (rdma_port_get_link_layer(ctx->cm_id->device,
>> - ctx->cm_id->port_num)) {
>> - case IB_LINK_LAYER_INFINIBAND:
>> - ucma_copy_ib_route(&resp, &ctx->cm_id->route);
>> - break;
>> - case IB_LINK_LAYER_ETHERNET:
>> - ucma_copy_iboe_route(&resp, &ctx->cm_id->route);
>> - break;
>> - default:
>> - break;
>> - }
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> +
>> + if (rdma_transport_ib(ctx->cm_id->device, ctx->cm_id->port_num))
>> + ucma_copy_ib_route(&resp, &ctx->cm_id->route);
>> + else if (rdma_transport_iboe(ctx->cm_id->device, ctx->cm_id->port_num))
>> + ucma_copy_iboe_route(&resp, &ctx->cm_id->route);
>> + else if (rdma_transport_iwarp(ctx->cm_id->device, ctx->cm_id->port_num))
>> ucma_copy_iw_route(&resp, &ctx->cm_id->route);
>> - break;
>> - default:
>> - break;
>> - }
>>
>> out:
>> if (copy_to_user((void __user *)(unsigned long)cmd.response,
>> --
>> 2.1.0
>

2015-04-08 08:41:37

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 02/17] IB/Verbs: Implement raw management helpers


On 04/07/2015 11:25 PM, Hefty, Sean wrote:
>> +static inline int rdma_transport_ib(struct ib_device *device, u8
>> port_num)
>> +{
>> + return device->query_transport(device, port_num)
>> + == RDMA_TRANSPORT_IB;
>> +}
>> +
>> +static inline int rdma_transport_iboe(struct ib_device *device, u8
>> port_num)
>> +{
>> + return device->query_transport(device, port_num)
>> + == RDMA_TRANSPORT_IBOE;
>> +}
>
> We need to do something with the function names to make their use more obvious. Both IB and IBoE have transport IB. I think Jason suggested rdma_tech_ib / rdma_tech_iboe.
>
> Regarding transport types, I believe that usnic supports 2 different transports. Although usnic isn't used by anything else in the core layer, we should probably be able to handle a device that supports multiple protocols. I'm not sure what the 'transport' should be for iWarp, since iWarp is layered over TCP. But that may just mean that the term transport isn't great.

Agree, it do confusing folks, I will use tech instead in next version :-)

Regards,
Michael Wang

>
> - Sean
>

2015-04-08 08:51:40

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 09/17] IB/Verbs: Use helper cap_read_multi_sge() and reform svc_rdma_accept()

On 04/07/2015 07:42 PM, Jason Gunthorpe wrote:
[snip]
>>> @@ -992,8 +992,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>>> dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
>>> } else
>>> need_dma_mr = 0;
>>> - break;
>>> - case RDMA_TRANSPORT_IB:
>>> + } else if (rdma_ib_mgmt(newxprt->sc_cm_id->device,
>>> + newxprt->sc_cm_id->port_num)) {
>>> if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
>>> need_dma_mr = 1;
>>> dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
>>
>> Now I'm even more confused. How is the presence of IB management
>> related to needing a privileged lmr?
>
> Agree, this needs to be someone else.
>
> I think the test is probably based on this comment:
>
> * NB: iWARP requires remote write access for the data sink
> * of an RDMA_READ. IB does not.
>
> So the if should be:
>
> if (cap_rdma_read_needs_write(..) &&
> !(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
> need_dma_mr = 1;
> dma_mr_acc =
> (IB_ACCESS_LOCAL_WRITE |
> IB_ACCESS_REMOTE_WRITE);
>
> And the identical if blocks merged.
>
> Plus the
> if (rdma_transport_iwarp(newxprt->sc_cm_id->device,
> newxprt->sc_cm_id->port_num))
> newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV

Sounds good :-) I'll give this part a reform in next version.

Regards,
Michael Wang

>
> Jason
>

2015-04-08 09:37:12

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 13/17] IB/Verbs: Reform cma/ucma with management helpers

Hi, Sean

Thanks for the review :-) cma is the most tough part during
reform, I really need some guide in here.


On 04/07/2015 11:36 PM, Hefty, Sean wrote:
>> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
>> index d8a8ea7..c23f483 100644
>> --- a/drivers/infiniband/core/cma.c
>> +++ b/drivers/infiniband/core/cma.c
>> @@ -435,10 +435,10 @@ static int cma_resolve_ib_dev(struct rdma_id_private
>> *id_priv)
>> pkey = ntohs(addr->sib_pkey);
>>
>> list_for_each_entry(cur_dev, &dev_list, list) {
>> - if (rdma_node_get_transport(cur_dev->device->node_type) !=
>> RDMA_TRANSPORT_IB)
>> - continue;
>> -
>> for (p = 1; p <= cur_dev->device->phys_port_cnt; ++p) {
>> + if (!rdma_ib_mgmt(cur_dev->device, p))
>> + continue;
>
> This check wants to be something like is_af_ib_supported(). Checking for IB transport may actually be better than checking for IB management. I don't know if IBoE/RoCE devices support AF_IB.

The wrapper make sense, but do we have the guarantee that IBoE port won't
be used for AF_IB address? I just can't locate the place we filtered it out...

>
[snip]
>> - == IB_LINK_LAYER_ETHERNET) {
>> + /* Will this happen? */
>> + BUG_ON(id_priv->cma_dev->device != id_priv->id.device);
>
> This shouldn't happen. The BUG_ON looks okay.

Got it :-)

>
>
>> + if (rdma_transport_iboe(id_priv->id.device, id_priv->id.port_num)) {
>> ret = rdma_addr_find_smac_by_sgid(&sgid, qp_attr.smac, NULL);
>>
>> if (ret)
>> @@ -700,8 +700,7 @@ static int cma_ib_init_qp_attr(struct rdma_id_private
>> *id_priv,
>> int ret;
>> u16 pkey;
>>
>> - if (rdma_port_get_link_layer(id_priv->id.device, id_priv-
>>> id.port_num) ==
>> - IB_LINK_LAYER_INFINIBAND)
>> + if (rdma_transport_ib(id_priv->id.device, id_priv->id.port_num))
>> pkey = ib_addr_get_pkey(dev_addr);
>> else
>> pkey = 0xffff;
>
> Check here should be against the link layer, not transport.

I guess the name confusing us again... what if use rdma_tech_ib() here?
it's the only tech using IB link layers, others are all ETH.

>
>
>> @@ -735,8 +734,7 @@ int rdma_init_qp_attr(struct rdma_cm_id *id, struct
[snip]
>>
>> static void cma_cancel_route(struct rdma_id_private *id_priv)
>> {
>> - switch (rdma_port_get_link_layer(id_priv->id.device, id_priv-
>>> id.port_num)) {
>> - case IB_LINK_LAYER_INFINIBAND:
>> + if (rdma_transport_ib(id_priv->id.device, id_priv->id.port_num)) {
>
> The check should be cap_ib_sa()

Got it, will be in next version :-)

All the mcast/sa suggestion below will be applied too.

>
[snip]
>>
>> id_priv->id.route.addr.dev_addr.dev_type =
>> - (rdma_port_get_link_layer(cma_dev->device, p) ==
>> IB_LINK_LAYER_INFINIBAND) ?
>> + (rdma_transport_ib(cma_dev->device, p)) ?
>> ARPHRD_INFINIBAND : ARPHRD_ETHER;
>
> This wants the link layer, or maybe use cap_ipoib.

Is this related with ipoib only?

>
>
>>
>> rdma_addr_set_sgid(&id_priv->id.route.addr.dev_addr, &gid);
>> @@ -2536,18 +2508,15 @@ int rdma_listen(struct rdma_cm_id *id, int
>> backlog)
>>
>> id_priv->backlog = backlog;
>> if (id->device) {
>> - switch (rdma_node_get_transport(id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> + if (rdma_ib_mgmt(id->device, id->port_num)) {
>
> Want cap_ib_cm()

Will be in next version :-) and the other cap_ib_cm() suggestion too.

>
>
>> ret = cma_ib_listen(id_priv);
[snip]
>> @@ -3016,14 +2979,10 @@ int rdma_accept(struct rdma_cm_id *id, struct
>> rdma_conn_param *conn_param)
>> else
>> ret = cma_rep_recv(id_priv);
>> }
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> + } else if (rdma_transport_iwarp(id->device, id->port_num))
>> ret = cma_accept_iw(id_priv, conn_param);
>
> If cap_ib_cm() is used in the places marked above, maybe add a cap_iw_cm() for the else conditions.

Sounds good, will be in next version :-)

Regards,
Michael Wang

>
>
>> - break;
>> - default:
>> + else
>> ret = -ENOSYS;
>> - break;
>> - }
>>
>> if (ret)
>> goto reject;
>> @@ -3067,8 +3026,7 @@ int rdma_reject(struct rdma_cm_id *id, const void
>> *private_data,
>> if (!id_priv->cm_id.ib)
>> return -EINVAL;
>>
>> - switch (rdma_node_get_transport(id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> + if (rdma_ib_mgmt(id->device, id->port_num)) {
>
> cap_ib_cm()
>
>
>> if (id->qp_type == IB_QPT_UD)
>> ret = cma_send_sidr_rep(id_priv, IB_SIDR_REJECT, 0,
>> private_data, private_data_len);
>> @@ -3076,15 +3034,11 @@ int rdma_reject(struct rdma_cm_id *id, const void
>> *private_data,
>> ret = ib_send_cm_rej(id_priv->cm_id.ib,
>> IB_CM_REJ_CONSUMER_DEFINED, NULL,
>> 0, private_data, private_data_len);
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> + } else if (rdma_transport_iwarp(id->device, id->port_num)) {
>> ret = iw_cm_reject(id_priv->cm_id.iw,
>> private_data, private_data_len);
>> - break;
>> - default:
>> + } else
>> ret = -ENOSYS;
>> - break;
>> - }
>> return ret;
>> }
>> EXPORT_SYMBOL(rdma_reject);
>> @@ -3098,22 +3052,17 @@ int rdma_disconnect(struct rdma_cm_id *id)
>> if (!id_priv->cm_id.ib)
>> return -EINVAL;
>>
>> - switch (rdma_node_get_transport(id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> + if (rdma_ib_mgmt(id->device, id->port_num)) {
>> ret = cma_modify_qp_err(id_priv);
>> if (ret)
>> goto out;
>> /* Initiate or respond to a disconnect. */
>> if (ib_send_cm_dreq(id_priv->cm_id.ib, NULL, 0))
>> ib_send_cm_drep(id_priv->cm_id.ib, NULL, 0);
>
> cap_ib_cm()
>
>
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> + } else if (rdma_transport_iwarp(id->device, id->port_num)) {
>> ret = iw_cm_disconnect(id_priv->cm_id.iw, 0);
>> - break;
>> - default:
>> + } else
>> ret = -EINVAL;
>> - break;
>> - }
>> out:
>> return ret;
>> }
>> @@ -3359,24 +3308,13 @@ int rdma_join_multicast(struct rdma_cm_id *id,
>> struct sockaddr *addr,
>> list_add(&mc->list, &id_priv->mc_list);
>> spin_unlock(&id_priv->lock);
>>
>> - switch (rdma_node_get_transport(id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> - switch (rdma_port_get_link_layer(id->device, id->port_num)) {
>> - case IB_LINK_LAYER_INFINIBAND:
>> - ret = cma_join_ib_multicast(id_priv, mc);
>> - break;
>> - case IB_LINK_LAYER_ETHERNET:
>> - kref_init(&mc->mcref);
>> - ret = cma_iboe_join_multicast(id_priv, mc);
>> - break;
>> - default:
>> - ret = -EINVAL;
>> - }
>> - break;
>> - default:
>> + if (rdma_transport_iboe(id->device, id->port_num)) {
>> + kref_init(&mc->mcref);
>> + ret = cma_iboe_join_multicast(id_priv, mc);
>> + } else if (rdma_transport_ib(id->device, id->port_num))
>> + ret = cma_join_ib_multicast(id_priv, mc);
>
> cap_ib_mcast()
>
>
>> + else
>> ret = -ENOSYS;
>> - break;
>> - }
>>
>> if (ret) {
>> spin_lock_irq(&id_priv->lock);
>> @@ -3404,19 +3342,17 @@ void rdma_leave_multicast(struct rdma_cm_id *id,
>> struct sockaddr *addr)
>> ib_detach_mcast(id->qp,
>> &mc->multicast.ib->rec.mgid,
>> be16_to_cpu(mc->multicast.ib-
>>> rec.mlid));
>> - if (rdma_node_get_transport(id_priv->cma_dev->device-
>>> node_type) == RDMA_TRANSPORT_IB) {
>> - switch (rdma_port_get_link_layer(id->device, id-
>>> port_num)) {
>> - case IB_LINK_LAYER_INFINIBAND:
>> - ib_sa_free_multicast(mc->multicast.ib);
>> - kfree(mc);
>> - break;
>> - case IB_LINK_LAYER_ETHERNET:
>> - kref_put(&mc->mcref, release_mc);
>> - break;
>> - default:
>> - break;
>> - }
>> - }
>> +
>> + /* Will this happen? */
>> + BUG_ON(id_priv->cma_dev->device != id->device);
>
> Should not happen
>
>> +
>> + if (rdma_transport_ib(id->device, id->port_num)) {
>> + ib_sa_free_multicast(mc->multicast.ib);
>> + kfree(mc);
>
> cap_ib_mcast()
>
>
>> + } else if (rdma_transport_iboe(id->device,
>> + id->port_num))
>> + kref_put(&mc->mcref, release_mc);
>> +
>> return;
>> }
>> }
>> diff --git a/drivers/infiniband/core/ucma.c
>> b/drivers/infiniband/core/ucma.c
>> index 45d67e9..42c9bf6 100644
>> --- a/drivers/infiniband/core/ucma.c
>> +++ b/drivers/infiniband/core/ucma.c
>> @@ -722,26 +722,13 @@ static ssize_t ucma_query_route(struct ucma_file
>> *file,
>>
>> resp.node_guid = (__force __u64) ctx->cm_id->device->node_guid;
>> resp.port_num = ctx->cm_id->port_num;
>> - switch (rdma_node_get_transport(ctx->cm_id->device->node_type)) {
>> - case RDMA_TRANSPORT_IB:
>> - switch (rdma_port_get_link_layer(ctx->cm_id->device,
>> - ctx->cm_id->port_num)) {
>> - case IB_LINK_LAYER_INFINIBAND:
>> - ucma_copy_ib_route(&resp, &ctx->cm_id->route);
>> - break;
>> - case IB_LINK_LAYER_ETHERNET:
>> - ucma_copy_iboe_route(&resp, &ctx->cm_id->route);
>> - break;
>> - default:
>> - break;
>> - }
>> - break;
>> - case RDMA_TRANSPORT_IWARP:
>> +
>> + if (rdma_transport_ib(ctx->cm_id->device, ctx->cm_id->port_num))
>> + ucma_copy_ib_route(&resp, &ctx->cm_id->route);
>
> cap_ib_sa()
>
>
>> + else if (rdma_transport_iboe(ctx->cm_id->device, ctx->cm_id-
>>> port_num))
>> + ucma_copy_iboe_route(&resp, &ctx->cm_id->route);
>> + else if (rdma_transport_iwarp(ctx->cm_id->device, ctx->cm_id-
>>> port_num))
>> ucma_copy_iw_route(&resp, &ctx->cm_id->route);
>> - break;
>> - default:
>> - break;
>> - }
>>
>> out:
>> if (copy_to_user((void __user *)(unsigned long)cmd.response,
>
>
> - Sean
>

2015-04-08 11:38:35

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] IB/Verbs: IB Management Helpers

On 4/7/2015 8:25 AM, Michael Wang wrote:
> Mapping List:
> node-type link-layer old-transport new-transport
> nes RNIC ETH IWARP IWARP
> amso1100 RNIC ETH IWARP IWARP
> cxgb3 RNIC ETH IWARP IWARP
> cxgb4 RNIC ETH IWARP IWARP
> usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP
> ocrdma IB_CA ETH IB IBOE
> mlx4 IB_CA IB/ETH IB IB/IBOE
> mlx5 IB_CA IB IB IB
> ehca IB_CA IB IB IB
> ipath IB_CA IB IB IB
> mthca IB_CA IB IB IB
> qib IB_CA IB IB IB

Can I rewind to ask a high-level question - what's the testing
plan for all of this? Do you have folks lined up for verifying
each of these adapters/networks, and what tests will they run?


2015-04-08 12:41:27

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] IB/Verbs: IB Management Helpers



On 04/08/2015 01:38 PM, Tom Talpey wrote:
> On 4/7/2015 8:25 AM, Michael Wang wrote:
>> Mapping List:
>> node-type link-layer old-transport new-transport
>> nes RNIC ETH IWARP IWARP
>> amso1100 RNIC ETH IWARP IWARP
>> cxgb3 RNIC ETH IWARP IWARP
>> cxgb4 RNIC ETH IWARP IWARP
>> usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP
>> ocrdma IB_CA ETH IB IBOE
>> mlx4 IB_CA IB/ETH IB IB/IBOE
>> mlx5 IB_CA IB IB IB
>> ehca IB_CA IB IB IB
>> ipath IB_CA IB IB IB
>> mthca IB_CA IB IB IB
>> qib IB_CA IB IB IB
>
> Can I rewind to ask a high-level question - what's the testing
> plan for all of this? Do you have folks lined up for verifying
> each of these adapters/networks, and what tests will they run?

I think no one can have the access to all these hardware, so we can
only depends on those who accidentally have one to help the testing,
but it's still far from that stage..

Besides, the logical is not very complex in this part, all the
mapping could find corresponding code as proof, so reviewing
carefully then give a public testing on some tree may could be
a plan too?

But yes, I won't be able to give an exhaustive testing by myself
and no one is backing me on that currently :-P me too are waiting
for the answers on how to assure the quality for patch set like this...

Regards,
Michael Wang

>
>
>

2015-04-08 15:52:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] IB/Verbs: IB Management Helpers

On Wed, Apr 08, 2015 at 02:41:18PM +0200, Michael Wang wrote:

> I think no one can have the access to all these hardware, so we can
> only depends on those who accidentally have one to help the testing,
> but it's still far from that stage..

I have seen other patches in this style use the compiler to do the
check, if the patch doesn't change the compiled output then it is
obviously OK.

Some careful use of macros might make that possible, but it is a fair
amount of work.

However, that may be the only way to get something this invasive
applied, especially since we've already seen mistakes in the manual
transforms :|

Jason

2015-04-08 16:05:52

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 00/17] IB/Verbs: IB Management Helpers

On 04/08/2015 05:51 PM, Jason Gunthorpe wrote:
> On Wed, Apr 08, 2015 at 02:41:18PM +0200, Michael Wang wrote:
>
>> I think no one can have the access to all these hardware, so we can
>> only depends on those who accidentally have one to help the testing,
>> but it's still far from that stage..
>
> I have seen other patches in this style use the compiler to do the
> check, if the patch doesn't change the compiled output then it is
> obviously OK.
>
> Some careful use of macros might make that possible, but it is a fair
> amount of work.
>
> However, that may be the only way to get something this invasive
> applied, especially since we've already seen mistakes in the manual
> transforms :|

Make sense, I may be able to testing with mlx4 in our lab, but IMHO
review carefully may be more reliable then incomplete testing in
this case, if we have some tree or branch for next staging, that
could be a good place for public testing, but it's haven't reached
that stage yet ;-)

Regards,
Michael Wang


>
> Jason
>

2015-04-08 17:02:26

by Hefty, Sean

[permalink] [raw]
Subject: RE: [PATCH v2 13/17] IB/Verbs: Reform cma/ucma with management helpers

> On 04/07/2015 11:36 PM, Hefty, Sean wrote:
> >> diff --git a/drivers/infiniband/core/cma.c
> b/drivers/infiniband/core/cma.c
> >> index d8a8ea7..c23f483 100644
> >> --- a/drivers/infiniband/core/cma.c
> >> +++ b/drivers/infiniband/core/cma.c
> >> @@ -435,10 +435,10 @@ static int cma_resolve_ib_dev(struct
> rdma_id_private
> >> *id_priv)
> >> pkey = ntohs(addr->sib_pkey);
> >>
> >> list_for_each_entry(cur_dev, &dev_list, list) {
> >> - if (rdma_node_get_transport(cur_dev->device->node_type) !=
> >> RDMA_TRANSPORT_IB)
> >> - continue;
> >> -
> >> for (p = 1; p <= cur_dev->device->phys_port_cnt; ++p) {
> >> + if (!rdma_ib_mgmt(cur_dev->device, p))
> >> + continue;
> >
> > This check wants to be something like is_af_ib_supported(). Checking
> for IB transport may actually be better than checking for IB management.
> I don't know if IBoE/RoCE devices support AF_IB.
>
> The wrapper make sense, but do we have the guarantee that IBoE port won't
> be used for AF_IB address? I just can't locate the place we filtered it
> out...

I can't think of a reason why IBoE wouldn't work with AF_IB, but I'm not sure if anyone has tested it. The original check would have let IBoE through. When I suggested checking for IB transport, I meant the actual transport protocol, which would have included both IB and IBoE.

> >> @@ -700,8 +700,7 @@ static int cma_ib_init_qp_attr(struct
> rdma_id_private
> >> *id_priv,
> >> int ret;
> >> u16 pkey;
> >>
> >> - if (rdma_port_get_link_layer(id_priv->id.device, id_priv-
> >>> id.port_num) ==
> >> - IB_LINK_LAYER_INFINIBAND)
> >> + if (rdma_transport_ib(id_priv->id.device, id_priv->id.port_num))
> >> pkey = ib_addr_get_pkey(dev_addr);
> >> else
> >> pkey = 0xffff;
> >
> > Check here should be against the link layer, not transport.
>
> I guess the name confusing us again... what if use rdma_tech_ib() here?
> it's the only tech using IB link layers, others are all ETH.

Yes, that would work.

> >> id_priv->id.route.addr.dev_addr.dev_type =
> >> - (rdma_port_get_link_layer(cma_dev->device, p) ==
> >> IB_LINK_LAYER_INFINIBAND) ?
> >> + (rdma_transport_ib(cma_dev->device, p)) ?
> >> ARPHRD_INFINIBAND : ARPHRD_ETHER;
> >
> > This wants the link layer, or maybe use cap_ipoib.
>
> Is this related with ipoib only?

ARPHDR_INFINIBAND is related to ipoib. In your next update, maybe go with tech_ib. I don't know the status of ipoib over iboe.

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-04-08 18:30:49

by Doug Ledford

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Tue, 2015-04-07 at 14:42 +0200, Michael Wang wrote:
> Add new callback query_transport() and implement for each HW.

My response here is going to be a long email, but that's because it's
easier to respond to the various patches all in one response in order to
preserve context. So, while I'm responding to patch 1 of 17, my
response will cover all 17 patches in whole.

> Mapping List:
> node-type link-layer old-transport new-transport
> nes RNIC ETH IWARP IWARP
> amso1100 RNIC ETH IWARP IWARP
> cxgb3 RNIC ETH IWARP IWARP
> cxgb4 RNIC ETH IWARP IWARP
> usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP
> ocrdma IB_CA ETH IB IBOE
> mlx4 IB_CA IB/ETH IB IB/IBOE
> mlx5 IB_CA IB IB IB
> ehca IB_CA IB IB IB
> ipath IB_CA IB IB IB
> mthca IB_CA IB IB IB
> qib IB_CA IB IB IB
>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: Doug Ledford <[email protected]>
> Cc: Ira Weiny <[email protected]>
> Cc: Sean Hefty <[email protected]>
> Signed-off-by: Michael Wang <[email protected]>
> ---
> drivers/infiniband/core/device.c | 1 +
> drivers/infiniband/core/verbs.c | 4 +++-
> drivers/infiniband/hw/amso1100/c2_provider.c | 7 +++++++
> drivers/infiniband/hw/cxgb3/iwch_provider.c | 7 +++++++
> drivers/infiniband/hw/cxgb4/provider.c | 7 +++++++
> drivers/infiniband/hw/ehca/ehca_hca.c | 6 ++++++
> drivers/infiniband/hw/ehca/ehca_iverbs.h | 3 +++
> drivers/infiniband/hw/ehca/ehca_main.c | 1 +
> drivers/infiniband/hw/ipath/ipath_verbs.c | 7 +++++++
> drivers/infiniband/hw/mlx4/main.c | 10 ++++++++++
> drivers/infiniband/hw/mlx5/main.c | 7 +++++++
> drivers/infiniband/hw/mthca/mthca_provider.c | 7 +++++++
> drivers/infiniband/hw/nes/nes_verbs.c | 6 ++++++
> drivers/infiniband/hw/ocrdma/ocrdma_main.c | 1 +
> drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 6 ++++++
> drivers/infiniband/hw/ocrdma/ocrdma_verbs.h | 3 +++
> drivers/infiniband/hw/qib/qib_verbs.c | 7 +++++++
> drivers/infiniband/hw/usnic/usnic_ib_main.c | 1 +
> drivers/infiniband/hw/usnic/usnic_ib_verbs.c | 6 ++++++
> drivers/infiniband/hw/usnic/usnic_ib_verbs.h | 2 ++
> include/rdma/ib_verbs.h | 7 ++++++-
> 21 files changed, 104 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> index 18c1ece..a9587c4 100644
> --- a/drivers/infiniband/core/device.c
> +++ b/drivers/infiniband/core/device.c
> @@ -76,6 +76,7 @@ static int ib_device_check_mandatory(struct ib_device *device)
> } mandatory_table[] = {
> IB_MANDATORY_FUNC(query_device),
> IB_MANDATORY_FUNC(query_port),
> + IB_MANDATORY_FUNC(query_transport),
> IB_MANDATORY_FUNC(query_pkey),
> IB_MANDATORY_FUNC(query_gid),
> IB_MANDATORY_FUNC(alloc_pd),

I'm concerned about the performance implications of this. The size of
this patchset already points out just how many places in the code we
have to check for various aspects of the device transport in order to do
the right thing. Without going through the entire list to see how many
are on critical hot paths, I'm sure some of them are on at least
partially critical hot paths (like creation of new connections). I
would prefer to see this change be implemented via a device attribute,
not a functional call query. That adds a needless function call in
these paths.

> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index f93eb8d..83370de 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -133,14 +133,16 @@ enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_
> if (device->get_link_layer)
> return device->get_link_layer(device, port_num);
>
> - switch (rdma_node_get_transport(device->node_type)) {
> + switch (device->query_transport(device, port_num)) {
> case RDMA_TRANSPORT_IB:
> + case RDMA_TRANSPORT_IBOE:
> return IB_LINK_LAYER_INFINIBAND;

If we are perserving ABI, then this looks wrong. Currently, IBOE
returnsi transport IB and link layer Ethernet. It should not return
link layer IB, it does not support IB link layer operations (such as MAD
access).

> case RDMA_TRANSPORT_IWARP:
> case RDMA_TRANSPORT_USNIC:
> case RDMA_TRANSPORT_USNIC_UDP:
> return IB_LINK_LAYER_ETHERNET;
> default:
> + BUG();
> return IB_LINK_LAYER_UNSPECIFIED;
> }
> }

[ snip ]

> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 65994a1..d54f91e 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -75,10 +75,13 @@ enum rdma_node_type {
> };
>
> enum rdma_transport_type {
> + /* legacy for users */
> RDMA_TRANSPORT_IB,
> RDMA_TRANSPORT_IWARP,
> RDMA_TRANSPORT_USNIC,
> - RDMA_TRANSPORT_USNIC_UDP
> + RDMA_TRANSPORT_USNIC_UDP,
> + /* new transport */
> + RDMA_TRANSPORT_IBOE,
> };

I'm also concerned about this. I would like to see this enum
essentially turned into a bitmap. One that is constructed in such a way
that we can always get the specific test we need with only one compare
against the overall value. In order to do so, we need to break it down
into the essential elements that are part of each of the transports.
So, for instance, we can define the two link layers we have so far, plus
reserve one for OPA which we know is coming:

RDMA_LINK_LAYER_IB = 0x00000001,
RDMA_LINK_LAYER_ETH = 0x00000002,
RDMA_LINK_LAYER_OPA = 0x00000004,
RDMA_LINK_LAYER_MASK = 0x0000000f,

We can then define the currently known high level transport types:

RDMA_TRANSPORT_IB = 0x00000010,
RDMA_TRANSPORT_IWARP = 0x00000020,
RDMA_TRANSPORT_USNIC = 0x00000040,
RDMA_TRANSPORT_USNIC_UDP = 0x00000080,
RDMA_TRANSPORT_MASK = 0x000000f0,

We could then define bits for the IB management types:

RDMA_MGMT_IB = 0x00000100,
RDMA_MGMT_OPA = 0x00000200,
RDMA_MGMT_MASK = 0x00000f00,

Then we have space to define specific quirks:

RDMA_SEPARATE_READ_SGE = 0x00001000,
RDMA_QUIRKS_MASK = 0xfffff000

Once those are defined, a few definitions for device drivers to use when
they initialize a device to set the bitmap to the right values:

#define IS_IWARP (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_IWARP |
RDMA_SEPARATE_READ_SGE)
#define IS_IB (RDMA_LINK_LAYER_IB | RDMA_TRANSPORT_IB | RDMA_MGMT_IB)
#define IS_IBOE (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_IB)
#define IS_USNIC (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_USNIC)
#define IS_OPA (RDMA_LINK_LAYER_OPA | RDMA_TRANSPORT_IB | RDMA_MGMT_IB |
RDMA_MGMT_OPA)

Then you need to define the tests:

static inline bool
rdma_transport_is_iwarp(struct ib_device *dev, u8 port)
{
return dev->port[port]->transport & RDMA_TRANSPORT_IWARP;
}

/* Note: this intentionally covers IB, IBOE, and OPA...use
rdma_dev_is_ib if you want to only get physical IB devices */
static inline bool
rdma_transport_is_ib(struct ibdev *dev)
{
return dev->port[port]->transport & RDMA_TRANSPORT_IB;
}

rdma_port_is_ib(struct ib_device *dev, u8 port)
{
return dev->port[port]->transport & RDMA_LINK_LAYER_IB;
}

rdma_port_is_iboe(struct ib_device *dev, u8 port)
{
return dev->port[port]->transport & IS_IBOE == IS_IBOE;
}

rdma_port_is_usnic(struct ib_device *dev, u8 port)
{
return dev->port[port]->transport & RDMA_TRANSPORT_USNIC;
}

rdma_port_is_opa(struct ib_device *dev, u8 port)
{
return dev->port[port]->transport & RDMA_LINK_LAYER_OPA;
}

rdma_port_is_iwarp(struct ib_device *dev, u8 port)
{
return rdma_transport_is_iwarp(dev, port);
}

rdma_port_ib_fabric_mgmt(struct ibdev *dev, u8 port)
{
return dev->port[port]->transport & RDMA_MGMT_IB;
}

rdma_port_opa_mgmt(struct ibdev *dev, u8 port)
{
return dev->port[port]->transport & RDMA_MGMT_OPA;
}

Other things can be changed too. Like rdma_port_get_link_layer can
become this:

{
return dev->transport & RDMA_LINK_LAYER_MASK;
}

From patch 2/17:


> +static inline int rdma_ib_mgmt(struct ib_device *device, u8 port_num)
> +{
> + enum rdma_transport_type tp = device->query_transport(device,
> port_num);
> +
> + return (tp == RDMA_TRANSPORT_IB || tp == RDMA_TRANSPORT_IBOE);
> +}

This looks wrong. IBOE doesn't have IB management. At least it doesn't
have subnet management.

Actually, reading through the remainder of the patches, there is some
serious confusion taking place here. In later patches, you use this as
a surrogate for cap_cm, which implies you are talking about connection
management. This is very different than the rdma_dev_ib_mgmt() test
that I create above, which specifically refers to IB management tasks
unique to IB/OPA: MAD, SM, multicast.

The kernel connection management code is not really limited. It
supports IB, IBOE, iWARP, and in the future it will support OPA. There
are some places in the CM code were we test for just IB/IBOE currently,
but that's only because we split iWARP out higher up in the abstraction
hierarchy. So, calling something rdma_ib_mgmt and meaning a rather
specialized tested in the CM is probably misleading.

To straighten all this out, lets break management out into the two
distinct types:

rdma_port_ib_fabric_mgmt() <- fabric specific management tasks: MAD, SM,
multicast. The proper test for this with my bitmap above is a simple
transport & RDMA_MGMT_IB test. If will be true for IB and OPA fabrics.

rdma_port_conn_mgmt() <- connection management, which we currently
support everything except USNIC (correct Sean?), so a test would be
something like !(transport & RDMA_TRANSPORT_USNIC). This is then split
out into two subgroups, IB style and iWARP stype connection management
(aka, rdma_port_iw_conn_mgmt() and rdma_port_ib_conn_mgmt()). In my
above bitmap, since I didn't give IBOE its own transport type, these
subgroups still boil down to the simple tests transport & iWARP and
transport & IB like they do today.

From patch 3/17:


> +/**
> + * cap_ib_mad - Check if the port of device has the capability
> Infiniband
> + * Management Datagrams.
> + *
> + * @device: Device to be checked
> + * @port_num: Port number of the device
> + *
> + * Return 0 when port of the device don't support Infiniband
> + * Management Datagrams.
> + */
> +static inline int cap_ib_mad(struct ib_device *device, u8 port_num)
> +{
> + return rdma_ib_mgmt(device, port_num);
> +}
> +

Why add cap_ib_mad? It's nothing more than rdma_port_ib_fabric_mgmt
with a new name. Just use rdma_port_ib_fabric_mgmt() everywhere you
have cap_ib_mad.

From patch 4/17:


> +/**
> + * cap_ib_smi - Check if the port of device has the capability
> Infiniband
> + * Subnet Management Interface.
> + *
> + * @device: Device to be checked
> + * @port_num: Port number of the device
> + *
> + * Return 0 when port of the device don't support Infiniband
> + * Subnet Management Interface.
> + */
> +static inline int cap_ib_smi(struct ib_device *device, u8 port_num)
> +{
> + return rdma_transport_ib(device, port_num);
> +}
> +

Same as the previous patch. This is needless indirection. Just use
rdma_port_ib_fabric_mgmt directly.

Patch 5/17:

Again, just use rdma_port_ib_conn_mgmt() directly.

Patch 6/17:

Again, just use rdma_port_ib_fabric_mgmt() directly.

Patch 7/17:

Again, just use rdma_port_ib_fabric_mgmt() directly. It's perfectly
applicable to the IB mcast registration requirements.

Patch 8/17:

Here we can create a new test if we are using the bitmap I created
above:

rdma_port_ipoib(struct ib_device *dev, u8 port)
{
return !(dev->port[port]->transport & RDMA_LINK_LAYER_ETH);
}

This is presuming that OPA will need ipoib devices. This will cause all
non-Ethernet link layer devices to return true, and right now, that is
all IB and all OPA devices.

Patch 9/17:

Most of the other comments on this patch stand as they are. I would add
the test:

rdma_port_separate_read_sge(dev, port)
{
return dev->port[port]->transport & RDMA_SEPERATE_READ_SGE;
}

and add the helper function:

rdma_port_get_read_sge(dev, port)
{
if (rdma_transport_is_iwarp)
return 1;
return dev->port[port]->max_sge;
}

Then, as Jason points out, if at some point in the future the kernel is
modified to support devices with assymetrical read/write SGE sizes, this
function can be modified to support those devices.

Patch 10/17:

As Sean pointed out, force_grh should be rdma_dev_is_iboe(). The cm
handles iw devices, but you notice all of the functions you modify here
start with ib_. The iwarp connections are funneled through iw_ specific
function variants, and so even though the cm handles iwarp, ib, and roce
devices, you never see anything other than ib/iboe (and opa in the
future) get to the ib_ variants of the functions. So, they wrote the
original tests as tests against the link layer being ethernet and used
that to differentiate between ib and iboe devices. It works, but can
confuse people. So, everyplace that has !rdma_transport_ib should
really be rdma_dev_is_iboe instead. If we ever merge the iw_ and ib_
functions in the future, having this right will help avoid problems.

Patch 11/17:

I wouldn't reform the link_layer_show except to make it compile with the
new defines I used above.

Patch 12/17:

Go ahead and add a helper to check all ports on a dev, just make it
rdma_hca_ib_conn_mgmt() and have it loop through
rdma_port_ib_conn_mgmt() return codes.

Patch 13/17:

This patch is largely unneeded if we reworked the bitmap like I have
above. A lot of the changes you made to switch from case statements to
multiple if statements can go back to being case statements because in
the bitmap IB and IBOE are still both transport IB, so you just do the
case on the transport bits and not on the link layer bits.

Patch 14/17:

Seems ok.

Patch 15/17:

If you implement the bitmap like I list above, then this code will need
fixed up to use the bitmap. Otherwise it looks OK.

Patch 16/17:

OK.

Patch 17/17:

I would drop this patch. In the future, the mlx5 driver will support
both Ethernet and IB like mlx4 does, and we would just need to pull this
code back to core instead of only in mlx4.


--
Doug Ledford <[email protected]>
GPG KeyID: 0E572FDD



Attachments:
signature.asc (819.00 B)
This is a digitally signed message part

2015-04-08 18:41:30

by Hefty, Sean

[permalink] [raw]
Subject: RE: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

> I'm concerned about the performance implications of this. The size of
> this patchset already points out just how many places in the code we
> have to check for various aspects of the device transport in order to do
> the right thing. Without going through the entire list to see how many
> are on critical hot paths, I'm sure some of them are on at least
> partially critical hot paths (like creation of new connections). I
> would prefer to see this change be implemented via a device attribute,
> not a functional call query. That adds a needless function call in
> these paths.

My impression of these changes were that they would eventually lead to the mechanism that you outlined:


> I'm also concerned about this. I would like to see this enum
> essentially turned into a bitmap. One that is constructed in such a way
> that we can always get the specific test we need with only one compare
> against the overall value. In order to do so, we need to break it down
> into the essential elements that are part of each of the transports.
> So, for instance, we can define the two link layers we have so far, plus
> reserve one for OPA which we know is coming:
>
> RDMA_LINK_LAYER_IB = 0x00000001,
> RDMA_LINK_LAYER_ETH = 0x00000002,
> RDMA_LINK_LAYER_OPA = 0x00000004,
> RDMA_LINK_LAYER_MASK = 0x0000000f,
>
> We can then define the currently known high level transport types:
>
> RDMA_TRANSPORT_IB = 0x00000010,
> RDMA_TRANSPORT_IWARP = 0x00000020,
> RDMA_TRANSPORT_USNIC = 0x00000040,
> RDMA_TRANSPORT_USNIC_UDP = 0x00000080,
> RDMA_TRANSPORT_MASK = 0x000000f0,
>
> We could then define bits for the IB management types:
>
> RDMA_MGMT_IB = 0x00000100,
> RDMA_MGMT_OPA = 0x00000200,
> RDMA_MGMT_MASK = 0x00000f00,
>
> Then we have space to define specific quirks:
>
> RDMA_SEPARATE_READ_SGE = 0x00001000,
> RDMA_QUIRKS_MASK = 0xfffff000

I too would like to see this as the end result, but I think it's possible to stage the changes by having the static inline calls being added convert to using these sort of attributes.

- Sean
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?Ý¢j"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-04-08 19:36:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Wed, Apr 08, 2015 at 06:41:22PM +0000, Hefty, Sean wrote:

> I too would like to see this as the end result, but I think it's
> possible to stage the changes by having the static inline calls
> being added convert to using these sort of attributes.

I agree as well, this patch set is already so big.

But Doug may be right, this conversion may need to be part of the
series that is applied in one go for performance reasons. But that is
just a patch at the end to optimize the inlines calls.

Jason

2015-04-08 20:10:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Wed, Apr 08, 2015 at 02:29:46PM -0400, Doug Ledford wrote:

> To straighten all this out, lets break management out into the two
> distinct types:
>
> rdma_port_ib_fabric_mgmt() <- fabric specific management tasks: MAD, SM,
> multicast. The proper test for this with my bitmap above is a simple
> transport & RDMA_MGMT_IB test. If will be true for IB and OPA fabrics.

> rdma_port_conn_mgmt() <- connection management, which we currently
> support everything except USNIC (correct Sean?), so a test would be
> something like !(transport & RDMA_TRANSPORT_USNIC). This is then split
> out into two subgroups, IB style and iWARP stype connection management
> (aka, rdma_port_iw_conn_mgmt() and rdma_port_ib_conn_mgmt()). In my
> above bitmap, since I didn't give IBOE its own transport type, these
> subgroups still boil down to the simple tests transport & iWARP and
> transport & IB like they do today.

There is a lot more variation here than just these two tests, and those
two tests won't scale to include OPA.

IB ROCEE OPA
SMI Y N Y (though the OPA smi looked a bit different)
IB SMP Y N N
OPA SMP N N Y
GMP Y Y Y
SA Y N Y
PM Y Y Y (? guessing for OPA)
CM Y Y Y
GMP needs GRH N Y N

It may be unrealistic, but I was hoping we could largely scrub the
opaque 'is spec iWARP, is spec ROCEE' kinds of tests because they
don't tell anyone what it is the code cares about.

Maybe what is needed is a more precise language for the functions:

> > + * cap_ib_mad - Check if the port of device has the capability
> > Infiniband
> > + * Management Datagrams.

As used this seems to mean:

True if the port can do IB/OPA SMP, or GMP management packets on QP0 or
QP1. (Y Y Y) ie: Do we need the MAD layer at all.

ib_smi seems to be true if QP0 is supported (Y N Y)

Maybe the above set would make a lot more sense as:
cap_ib_qp0
cap_ib_qp1
cap_opa_qp0

ib_cm seems to mean that the CM protocol from the IBA is used on the
port (Y Y Y)

ib_sa means the IBA SA protocol is supported (Y Y Y)

ib_mcast true if the IBA SA protocol is used for multicast GIDs (Y N
Y)

ipoib means the port supports the ipoib protocol (Y N ?)

This seem reasonable and understandable, even if they are currently a
bit duplicating.

> Patch 9/17:
>
> Most of the other comments on this patch stand as they are. I would add
> the test:
>
> rdma_port_separate_read_sge(dev, port)
> {
> return dev->port[port]->transport & RDMA_SEPERATE_READ_SGE;
> }
>
> and add the helper function:
>
> rdma_port_get_read_sge(dev, port)
> {
> if (rdma_transport_is_iwarp)
> return 1;
> return dev->port[port]->max_sge;
> }

Hum, that is nice, but it doesn't quite fit with how the ULP needs to
work. The max limit when creating a WR is the value passed into the
qp_cap, not the device maximum limit.

To do this properly we need to extend the qp_cap, and that is just too
big a change. A one bit iWarp quirk is OK for now.

> As Sean pointed out, force_grh should be rdma_dev_is_iboe(). The cm

I actually really prefer cap_mandatory_grh - that is what is going on
here. ie based on that name (as a reviewer) I'd expect to see the mad
layer check that the mandatory GRH is always present, or blow up.

Some of the other checks in this file revolve around pkey, I'm not
sure what rocee does there? cap_pkey_supported ?

Jason

2015-04-08 20:55:44

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On 4/8/2015 4:10 PM, Jason Gunthorpe wrote:
> On Wed, Apr 08, 2015 at 02:29:46PM -0400, Doug Ledford wrote:
>
>...
>>
>> rdma_port_get_read_sge(dev, port)
>> {
>> if (rdma_transport_is_iwarp)
>> return 1;
>> return dev->port[port]->max_sge;
>> }
>
> Hum, that is nice, but it doesn't quite fit with how the ULP needs to
> work. The max limit when creating a WR is the value passed into the
> qp_cap, not the device maximum limit.

Agreed, and I will again say that not all devices necessarily support
the same max_sge for all WR types. The current one-size-fits-all API
may make the upper layer think so, but it's possibly being lied to.

> To do this properly we need to extend the qp_cap, and that is just too
> big a change. A one bit iWarp quirk is OK for now.

Yes, it would be a large-ish change, and I like Doug's choice of word
"quirk" to capture these as exceptions, until the means for addressing
them is decided.

Overall, I like Doug's proposals, especially from an upper layer
perspective. I might suggest further refining them into categories,
perhaps "management", primarily of interest to kernel and the
plumbing of connections; and actual "RDMA semantics", of interest
to RDMA consumers.

2015-04-09 05:37:11

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH v2 10/17] IB/Verbs: Adopt management helpers for IB helpers

On Tue, Apr 07, 2015 at 03:16:30PM -0500, Steve Wise wrote:
>
>
> > -----Original Message-----
> > From: Jason Gunthorpe [mailto:[email protected]]
> > Sent: Tuesday, April 07, 2015 3:13 PM
> > To: Michael Wang
> > Cc: Roland Dreier; Sean Hefty; [email protected]; [email protected]; [email protected];
> > [email protected]; Hal Rosenstock; Tom Tucker; Steve Wise; Hoang-Nam Nguyen; Christoph Raisch; Mike Marciniszyn; Eli Cohen;
> > Faisal Latif; Upinder Malhi; Trond Myklebust; J. Bruce Fields; David S. Miller; Ira Weiny; PJ Waskiewicz; Tatyana Nikolova; Or
> Gerlitz; Jack
> > Morgenstein; Haggai Eran; Ilya Nelkenbaum; Yann Droneaud; Bart Van Assche; Shachar Raindel; Sagi Grimberg; Devesh Sharma; Matan
> > Barak; Moni Shoua; Jiri Kosina; Selvin Xavier; Mitesh Ahuja; Li RongQing; Rasmus Villemoes; Alex Estrin; Doug Ledford; Eric
> Dumazet; Erez
> > Shitrit; Tom Gundersen; Chuck Lever
> > Subject: Re: [PATCH v2 10/17] IB/Verbs: Adopt management helpers for IB helpers
> >
> > On Tue, Apr 07, 2015 at 02:35:22PM +0200, Michael Wang wrote:
> > > index f704254..4e61104 100644
> > > +++ b/drivers/infiniband/core/sa_query.c
> > > @@ -540,7 +540,7 @@ int ib_init_ah_from_path(struct ib_device *device, u8 port_num,
> > > ah_attr->port_num = port_num;
> > > ah_attr->static_rate = rec->rate;
> > >
> > > - force_grh = rdma_port_get_link_layer(device, port_num) == IB_LINK_LAYER_ETHERNET;
> > > + force_grh = !rdma_transport_ib(device, port_num);
> >
> > Maybe these tests should be called cap_mandatory_grh - but I'm not
> > really sure how iWarp uses the GRH fields in the AH...
> >
>
> iWARP runs on top of TCP...this SA code is all IB-specific. The reason it was checking for ETHERNET, I think, is for RoCE. So
> this change is totally incorrect, I think, because RoCE is an IB transport, but it runs on ETHERNET.

But RoCE does not have an SA?

Looks like ib_init_ah_from_path was overloaded to handle non-standard "path
records".

It seems like the correct functionality would be to use ib_init_ah_from_path()
for true SA PathRecords and have another call iboe_init_ah() wrap
ib_init_ah_from_path() when RoCE address information is needed in the AH.

For Michaels patches I think

force_grh = rdma_device_is_iboe(...)

is the logic we need here.

Ira


>
> Steve.
>
>
>

2015-04-09 08:05:35

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 13/17] IB/Verbs: Reform cma/ucma with management helpers

On 04/08/2015 07:02 PM, Hefty, Sean wrote:
[snip]
>>
>> The wrapper make sense, but do we have the guarantee that IBoE port won't
>> be used for AF_IB address? I just can't locate the place we filtered it
>> out...
>
> I can't think of a reason why IBoE wouldn't work with AF_IB, but I'm not sure if anyone has tested it. The original check would have let IBoE through. When I suggested checking for IB transport, I meant the actual transport protocol, which would have included both IB and IBoE.

Got it :-)

>
>>>> @@ -700,8 +700,7 @@ static int cma_ib_init_qp_attr(struct
[snip]
>
>>>> id_priv->id.route.addr.dev_addr.dev_type =
>>>> - (rdma_port_get_link_layer(cma_dev->device, p) ==
>>>> IB_LINK_LAYER_INFINIBAND) ?
>>>> + (rdma_transport_ib(cma_dev->device, p)) ?
>>>> ARPHRD_INFINIBAND : ARPHRD_ETHER;
>>>
>>> This wants the link layer, or maybe use cap_ipoib.
>>
>> Is this related with ipoib only?
>
> ARPHDR_INFINIBAND is related to ipoib. In your next update, maybe go with tech_ib. I don't know the status of ipoib over iboe.

Will be in next version :-)

Regards,
Michael Wang

>

2015-04-09 09:34:31

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On 04/08/2015 08:29 PM, Doug Ledford wrote:
> On Tue, 2015-04-07 at 14:42 +0200, Michael Wang wrote:
>> Add new callback query_transport() and implement for each HW.
>
> My response here is going to be a long email, but that's because it's
> easier to respond to the various patches all in one response in order to
> preserve context. So, while I'm responding to patch 1 of 17, my
> response will cover all 17 patches in whole.

Thanks for the review :-)

>
>> Mapping List:
[snip]
>>
>> diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
>> index 18c1ece..a9587c4 100644
>> --- a/drivers/infiniband/core/device.c
>> +++ b/drivers/infiniband/core/device.c
>> @@ -76,6 +76,7 @@ static int ib_device_check_mandatory(struct ib_device *device)
>> } mandatory_table[] = {
>> IB_MANDATORY_FUNC(query_device),
>> IB_MANDATORY_FUNC(query_port),
>> + IB_MANDATORY_FUNC(query_transport),
>> IB_MANDATORY_FUNC(query_pkey),
>> IB_MANDATORY_FUNC(query_gid),
>> IB_MANDATORY_FUNC(alloc_pd),
>
> I'm concerned about the performance implications of this. The size of
> this patchset already points out just how many places in the code we
> have to check for various aspects of the device transport in order to do
> the right thing. Without going through the entire list to see how many
> are on critical hot paths, I'm sure some of them are on at least
> partially critical hot paths (like creation of new connections). I
> would prefer to see this change be implemented via a device attribute,
> not a functional call query. That adds a needless function call in
> these paths.

That's exactly the first issue come into my mind while working on this.

Mostly I was influenced by the current device callback mechanism, we have
plenty of query callback and they are widely used in hot path, thus I
finally decided to use query_transport() to utilize the existed mechanism.

Actually I used to learn that the bitmask operation is somewhat expensive
too, while the callback may only cost two register, one instruction and
twice jump, thus I guess we may need some benchmark to tell the difference
on performance, so I just pick the easier way as first step :-P

>
>> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
>> index f93eb8d..83370de 100644
>> --- a/drivers/infiniband/core/verbs.c
>> +++ b/drivers/infiniband/core/verbs.c
>> @@ -133,14 +133,16 @@ enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_
>> if (device->get_link_layer)
>> return device->get_link_layer(device, port_num);
>>
>> - switch (rdma_node_get_transport(device->node_type)) {
>> + switch (device->query_transport(device, port_num)) {
>> case RDMA_TRANSPORT_IB:
>> + case RDMA_TRANSPORT_IBOE:
>> return IB_LINK_LAYER_INFINIBAND;
>
> If we are perserving ABI, then this looks wrong. Currently, IBOE
> returnsi transport IB and link layer Ethernet. It should not return
> link layer IB, it does not support IB link layer operations (such as MAD
> access).

That's my bad, IBOE is ETH link layer.

>
[snip]
>> };
>
> I'm also concerned about this. I would like to see this enum
> essentially turned into a bitmap. One that is constructed in such a way
> that we can always get the specific test we need with only one compare
> against the overall value. In order to do so, we need to break it down
> into the essential elements that are part of each of the transports.
> So, for instance, we can define the two link layers we have so far, plus
> reserve one for OPA which we know is coming:

The idea sounds interesting, but frankly speaking I'm already starting to
worried about the size of this patch set...

I really prefer to move optimizing/reforming work like this into next stage,
after this pioneer patch set settle down and working stably, after all, we
have already get rid of the old transport helpers, reforming based on
that should be far more easier and clear.

Next version will be reorganized to separate the implementation and wrapper
replacement, which make the patch set even bigger, fortunately, since the logical
is not very complex, we are still able to handle it, I really prefer we can
focus on performance and concise after infrastructure built up.

>
> RDMA_LINK_LAYER_IB = 0x00000001,
> RDMA_LINK_LAYER_ETH = 0x00000002,
> RDMA_LINK_LAYER_OPA = 0x00000004,
> RDMA_LINK_LAYER_MASK = 0x0000000f,
[snip]
>
> From patch 2/17:
>
>
>> +static inline int rdma_ib_mgmt(struct ib_device *device, u8 port_num)
>> +{
>> + enum rdma_transport_type tp = device->query_transport(device,
>> port_num);
>> +
>> + return (tp == RDMA_TRANSPORT_IB || tp == RDMA_TRANSPORT_IBOE);
>> +}
>
> This looks wrong. IBOE doesn't have IB management. At least it doesn't
> have subnet management.

This helper actually could be erased at last :-) after Sean's suggestion on cma
stuff, no where need this raw helper anymore, just cap_ib_cm(), cap_iw_cm()
and cap_ib_mad() is enough.

>
> Actually, reading through the remainder of the patches, there is some
> serious confusion taking place here. In later patches, you use this as
> a surrogate for cap_cm, which implies you are talking about connection
> management. This is very different than the rdma_dev_ib_mgmt() test
> that I create above, which specifically refers to IB management tasks
> unique to IB/OPA: MAD, SM, multicast.
[snip]
>> +static inline int cap_ib_mad(struct ib_device *device, u8 port_num)
>> +{
>> + return rdma_ib_mgmt(device, port_num);
>> +}
>> +
>
> Why add cap_ib_mad? It's nothing more than rdma_port_ib_fabric_mgmt
> with a new name. Just use rdma_port_ib_fabric_mgmt() everywhere you
> have cap_ib_mad.

That will be excellent if we use more concise semantic to address the
requirement, but I really want to make this as next stage since it sounds
like not a small topic...

At this stage I suggest we focus on:
1. erase all the scene using old transport/link-layer helpers
2. classify helpers for each management branch somewhat accurately
3. make sure it's table and works well (most important!)

So we can do further reforming based on that milestone in future ;-)

>
[snip]
>
> rdma_port_get_read_sge(dev, port)
> {
> if (rdma_transport_is_iwarp)
> return 1;
> return dev->port[port]->max_sge;
> }
>
> Then, as Jason points out, if at some point in the future the kernel is
> modified to support devices with assymetrical read/write SGE sizes, this
> function can be modified to support those devices.

This part is actually a big topic too... frankly speaking I prefer some
expert in that part to reform the stuff in future and give a good testing :-)

>
> Patch 10/17:
>
> As Sean pointed out, force_grh should be rdma_dev_is_iboe(). The cm
> handles iw devices, but you notice all of the functions you modify here
> start with ib_. The iwarp connections are funneled through iw_ specific
> function variants, and so even though the cm handles iwarp, ib, and roce
> devices, you never see anything other than ib/iboe (and opa in the
> future) get to the ib_ variants of the functions. So, they wrote the
> original tests as tests against the link layer being ethernet and used
> that to differentiate between ib and iboe devices. It works, but can
> confuse people. So, everyplace that has !rdma_transport_ib should
> really be rdma_dev_is_iboe instead. If we ever merge the iw_ and ib_
> functions in the future, having this right will help avoid problems.

Exactly, we noticed that the name transport do confusing peoples, next
version will use rdma_tech_iboe() to distinguish from transport stuff,
I guess that will make thing more clear :-)

>
> Patch 11/17:
>
> I wouldn't reform the link_layer_show except to make it compile with the
> new defines I used above.

This is try to erase the old transport/link-layer helpers, so we could have
a clean stage for further reforming ;-)

>
[snip]
>
> OK.
>
> Patch 17/17:
>
> I would drop this patch. In the future, the mlx5 driver will support
> both Ethernet and IB like mlx4 does, and we would just need to pull this
> code back to core instead of only in mlx4.

Actually we don't need that helper anymore, mlx4 can directly using it's own
implemented get_link_layer(), I just leave it there as a remind.

It doesn't make sense to put it in core level if only mlx4/5 using it, mlx5
would have it's own get_link_layer() implementation too if it's going to support
ETH port, they just need to use that new one :-)


Regards,
Michael Wang

>
>

2015-04-09 09:46:04

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On 04/08/2015 10:10 PM, Jason Gunthorpe wrote:
[snip]
>
>> As Sean pointed out, force_grh should be rdma_dev_is_iboe(). The cm
>
> I actually really prefer cap_mandatory_grh - that is what is going on
> here. ie based on that name (as a reviewer) I'd expect to see the mad
> layer check that the mandatory GRH is always present, or blow up.

Sounds good, will be in next version :-)

Regards,
Michael Wang

>
> Some of the other checks in this file revolve around pkey, I'm not
> sure what rocee does there? cap_pkey_supported ?
>
> Jason
>

2015-04-09 12:42:34

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On 04/08/2015 10:10 PM, Jason Gunthorpe wrote:
[snip]
>
> Some of the other checks in this file revolve around pkey, I'm not
> sure what rocee does there? cap_pkey_supported ?

I'm not sure if this count in capability... how shall we describe it?

Regards,
Michael Wang

>
> Jason
>

2015-04-09 14:35:54

by Doug Ledford

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Wed, 2015-04-08 at 14:10 -0600, Jason Gunthorpe wrote:
> On Wed, Apr 08, 2015 at 02:29:46PM -0400, Doug Ledford wrote:
>
> > To straighten all this out, lets break management out into the two
> > distinct types:
> >
> > rdma_port_ib_fabric_mgmt() <- fabric specific management tasks: MAD, SM,
> > multicast. The proper test for this with my bitmap above is a simple
> > transport & RDMA_MGMT_IB test. If will be true for IB and OPA fabrics.
>
> > rdma_port_conn_mgmt() <- connection management, which we currently
> > support everything except USNIC (correct Sean?), so a test would be
> > something like !(transport & RDMA_TRANSPORT_USNIC). This is then split
> > out into two subgroups, IB style and iWARP stype connection management
> > (aka, rdma_port_iw_conn_mgmt() and rdma_port_ib_conn_mgmt()). In my
> > above bitmap, since I didn't give IBOE its own transport type, these
> > subgroups still boil down to the simple tests transport & iWARP and
> > transport & IB like they do today.
>
> There is a lot more variation here than just these two tests, and those
> two tests won't scale to include OPA.
>
> IB ROCEE OPA
> SMI Y N Y (though the OPA smi looked a bit different)
> IB SMP Y N N
> OPA SMP N N Y
> GMP Y Y Y
> SA Y N Y
> PM Y Y Y (? guessing for OPA)
> CM Y Y Y
> GMP needs GRH N Y N
>

You can still break this down to a manageable bitmap.

SMI, SMP, and SA are all essentially the same and can be combined to one
bitmap that is

IB_SM 0x1
OPA_SM 0x2

and the defines are such that IB devices define IB_SM, and OPA devices
define IB_SM and OPA_SM. Any minor differences between OPA and IB can
be handled by testing just the OPA_SM bit. This will exclude all IBOE
devices and iWARP devices.

GMP, PM, and CM are all the same, and are all identical to transport ==
INFINIBAND.

GMP needs GRH happens to be precisely the same as ib_dev_is_iboe.

These are exactly the tests I proposed Jason. I'm not sure I see your
point here. I guess my point is that although the scenario of all the
different items seems complex, it really does boil down to needing only
exactly what I proposed earlier to fulfill the entire test matrix.


--
Doug Ledford <[email protected]>
GPG KeyID: 0E572FDD



Attachments:
signature.asc (819.00 B)
This is a digitally signed message part

2015-04-09 16:01:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Thu, Apr 09, 2015 at 02:42:24PM +0200, Michael Wang wrote:
> On 04/08/2015 10:10 PM, Jason Gunthorpe wrote:
> [snip]
> >
> > Some of the other checks in this file revolve around pkey, I'm not
> > sure what rocee does there? cap_pkey_supported ?
>
> I'm not sure if this count in capability... how shall we describe it?

I'm not sure how rocee uses pkey, but maybe the the GRH and pkey thing
would work well together under a single 'cap_ethernet_ah' ?

Jason

2015-04-09 16:01:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Thu, Apr 09, 2015 at 10:34:30AM -0400, Doug Ledford wrote:

> These are exactly the tests I proposed Jason. I'm not sure I see your
> point here. I guess my point is that although the scenario of all the
> different items seems complex, it really does boil down to needing only
> exactly what I proposed earlier to fulfill the entire test matrix.

I have no problem with minimizing a bitmap, but I want the accessors
to make sense first.

My specific problem with your suggestion was combining cap_ib_mad,
cap_ib_sa, and cap_ib_smi into rdma_port_ib_fabric_mgmt.

Not only do the three cap things not return the same value for all
situations, the documentary knowledge is lost by the reduction.

I'd prefer we look at this from a 'what do the call sites need' view,
not a 'how do we minimize' view.

I've written this before: The mess here is that it is too hard to know
what the call sites are actually checking for when it is some baroque
conditional.

Jason

2015-04-09 21:19:24

by Doug Ledford

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Thu, 2015-04-09 at 10:01 -0600, Jason Gunthorpe wrote:
> On Thu, Apr 09, 2015 at 10:34:30AM -0400, Doug Ledford wrote:
>
> > These are exactly the tests I proposed Jason. I'm not sure I see your
> > point here. I guess my point is that although the scenario of all the
> > different items seems complex, it really does boil down to needing only
> > exactly what I proposed earlier to fulfill the entire test matrix.
>
> I have no problem with minimizing a bitmap, but I want the accessors
> to make sense first.
>
> My specific problem with your suggestion was combining cap_ib_mad,
> cap_ib_sa, and cap_ib_smi into rdma_port_ib_fabric_mgmt.
>
> Not only do the three cap things not return the same value for all
> situations, the documentary knowledge is lost by the reduction.
>
> I'd prefer we look at this from a 'what do the call sites need' view,
> not a 'how do we minimize' view.
>
> I've written this before: The mess here is that it is too hard to know
> what the call sites are actually checking for when it is some baroque
> conditional.

The two goals: being specific about what the test is returning and
minimizing the bitmap footprint; are not necessarily opposed. One can
do both at the same time.

--
Doug Ledford <[email protected]>
GPG KeyID: 0E572FDD



Attachments:
signature.asc (819.00 B)
This is a digitally signed message part

2015-04-09 21:37:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Thu, Apr 09, 2015 at 05:19:08PM -0400, Doug Ledford wrote:

> The two goals: being specific about what the test is returning and
> minimizing the bitmap footprint; are not necessarily opposed. One can
> do both at the same time.

Agree

Jason

2015-04-10 06:16:34

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

First off there are 2 separate issues here:

1) We need to communicate if a port supports or requires various management
support from the ib_mad, ib_cm, and/or ib_sa modules.

2) We need to communicate how a addresses are formated and resolved for a
particular port


In general I don't think we need to remove all uses of the Transport
or Link Layer.

Although we may be able to remove most of the transport uses.

On Wed, Apr 08, 2015 at 02:10:15PM -0600, Jason Gunthorpe wrote:
> On Wed, Apr 08, 2015 at 02:29:46PM -0400, Doug Ledford wrote:
>
> > To straighten all this out, lets break management out into the two
> > distinct types:
> >
> > rdma_port_ib_fabric_mgmt() <- fabric specific management tasks: MAD, SM,
> > multicast. The proper test for this with my bitmap above is a simple
> > transport & RDMA_MGMT_IB test. If will be true for IB and OPA fabrics.
>
> > rdma_port_conn_mgmt() <- connection management, which we currently
> > support everything except USNIC (correct Sean?), so a test would be
> > something like !(transport & RDMA_TRANSPORT_USNIC). This is then split
> > out into two subgroups, IB style and iWARP stype connection management
> > (aka, rdma_port_iw_conn_mgmt() and rdma_port_ib_conn_mgmt()). In my
> > above bitmap, since I didn't give IBOE its own transport type, these
> > subgroups still boil down to the simple tests transport & iWARP and
> > transport & IB like they do today.
>
> There is a lot more variation here than just these two tests, and those
> two tests won't scale to include OPA.
>
> IB ROCEE OPA
> SMI Y N Y (though the OPA smi looked a bit different)

Yes OPA is different but it is based on the class version of the individual
MADs not any particular device/port support.

> IB SMP Y N N

Correction:

IB SMP Y N Y (OPA supports the IB NodeInfo query)

> OPA SMP N N Y

How is this different from the SMI?

> GMP Y Y Y
> SA Y N Y
> PM Y Y Y (? guessing for OPA)
^^^
Yes

> CM Y Y Y
> GMP needs GRH N Y N
>
> It may be unrealistic, but I was hoping we could largely scrub the
> opaque 'is spec iWARP, is spec ROCEE' kinds of tests because they
> don't tell anyone what it is the code cares about.

I somewhat agree except for things like addressing. In the area of addressing
I think we are likely to need to define something like "cap_addr_ib",
"cap_addr_iboe", "cap_addr_iwarp". See below for more details.

>
> Maybe what is needed is a more precise language for the functions:
>
> > > + * cap_ib_mad - Check if the port of device has the capability
> > > Infiniband
> > > + * Management Datagrams.
>
> As used this seems to mean:
>
> True if the port can do IB/OPA SMP, or GMP management packets on QP0 or
> QP1. (Y Y Y) ie: Do we need the MAD layer at all.
>
> ib_smi seems to be true if QP0 is supported (Y N Y)
>
> Maybe the above set would make a lot more sense as:
> cap_ib_qp0
> cap_ib_qp1
> cap_opa_qp0

I disagree.

All we need right now is is cap_qp0. All devices currently support QP1.

Then after all this is settled I can add:

IB ROCEE OPA
OPA MAD Space N N Y Port is OPA MAD space.

>
> ib_cm seems to mean that the CM protocol from the IBA is used on the
> port (Y Y Y)

Agree.

>
> ib_sa means the IBA SA protocol is supported (Y Y Y)

I think this should be (Y N Y)

IBoE has no SA. The IBoE code "fabricates" a Path Record it does not need to
interact with the SA.

>
> ib_mcast true if the IBA SA protocol is used for multicast GIDs (Y N
> Y)

Given the above why can't we just have the "ib_sa" flag?

>
> ipoib means the port supports the ipoib protocol (Y N ?)


OPA does IPoIB so... (Y N Y)


However, I think checking the link layer is more appropriate here. It does not
make sense to do IP over IB over Eth. Even though the IBoE can do the "IB"
protocol.


Making flags in the driver to indicate which ULPs they support is a _bad_
_idea_.

FWIW: I don't consider the MAD and SA (multicast) modules ULPs. Rather they
are helper modules which are built to share code amongst the drivers to process
things on behalf of the drivers themselves. As such advertising a need for or
particular support within those modules make sense.

So although strictly speaking we could do IPoIBoEth, I think having IPoIB check
the LL and limiting itself to ports which are IB LL is appropriate.

>
> This seem reasonable and understandable, even if they are currently a
> bit duplicating.
>
> > Patch 9/17:
> >
> > Most of the other comments on this patch stand as they are. I would add
> > the test:
> >
> > rdma_port_separate_read_sge(dev, port)
> > {
> > return dev->port[port]->transport & RDMA_SEPERATE_READ_SGE;
> > }
> >
> > and add the helper function:
> >
> > rdma_port_get_read_sge(dev, port)
> > {
> > if (rdma_transport_is_iwarp)
> > return 1;
> > return dev->port[port]->max_sge;
> > }
>
> Hum, that is nice, but it doesn't quite fit with how the ULP needs to
> work. The max limit when creating a WR is the value passed into the
> qp_cap, not the device maximum limit.
>
> To do this properly we need to extend the qp_cap, and that is just too
> big a change. A one bit iWarp quirk is OK for now.

I agree. This is the one place we probably want to just keep the "Transport"
check.

>
> > As Sean pointed out, force_grh should be rdma_dev_is_iboe(). The cm
>
> I actually really prefer cap_mandatory_grh - that is what is going on
> here. ie based on that name (as a reviewer) I'd expect to see the mad
> layer check that the mandatory GRH is always present, or blow up.

While GRH mandatory (for the GMP) is what this is. The function
ib_init_ah_from_path generically is really handling an "IBoE address" to send
to and therefore we need to force the GRH in the AH.

There is a whole slew of LL and Transport checks which are used to
format/resolve addresses.

Functions which need to know that the address format is "IBoE"

-- cma.c: cma_acquire_dev
-- cma.c: cma_modify_qp_rtr
-- sa_query.c: ib_init_ah_from_path
-- verbs.c: ib_resolve_eth_l2_attrs

What about a check rdma_port_req_iboe_addr()?

Functions where AF_IB is checked against LL because it does not make sense to
use AF_IB on anything but an IB LL

-- cma.c: cma_listen_on_dev
-- cma.c: cma_bind_loopback

These checks should be directly against the LL

Functions where the ARP type is checked against the link layer.

-- cma.c: cma_acquire_dev
-- cma.c: cma_bind_loopback

These checks should be directly against the LL

>
> Some of the other checks in this file revolve around pkey, I'm not
> sure what rocee does there? cap_pkey_supported ?

It seems IBoE just hardcodes the pkey to 0xffff. I don't see it used anywhere.

Function where port requires "real" PKey

-- cma.c: cma_ib_init_qp_attr

Check rdma_port_req_pkey()?


Over all for the addressing choices:

The "Transport" (or protocol, or whatever) is Verbs. The Layer below Verbs
(OPA/IB/Eth/TCP) defines how addressing, route, and connection information is
generated, communicated, and used.

As Jason and Doug have been saying sometimes we want to know when that requires
SA interaction or the use of the CM protocol (or neither). Other times we just
need to know what the Address format or Link Layer is.


Ira

2015-04-10 07:46:57

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On 04/09/2015 11:19 PM, Doug Ledford wrote:
[snip]
>>
>> I've written this before: The mess here is that it is too hard to know
>> what the call sites are actually checking for when it is some baroque
>> conditional.
>
> The two goals: being specific about what the test is returning and
> minimizing the bitmap footprint; are not necessarily opposed. One can
> do both at the same time.

This could be internal reforming after the cap_XX() stuff works for core
layer, at that time we don't need to touch core layer anymore, just
introducing this bitmap stuff in verb layer, replacing the implementation
of these helpers with the bitmap check, and following the semantic (description).

Regards,
Michael Wang

>

2015-04-10 07:48:21

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

> > diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> > index 18c1ece..a9587c4 100644
> > --- a/drivers/infiniband/core/device.c
> > +++ b/drivers/infiniband/core/device.c
> > @@ -76,6 +76,7 @@ static int ib_device_check_mandatory(struct ib_device *device)
> > } mandatory_table[] = {
> > IB_MANDATORY_FUNC(query_device),
> > IB_MANDATORY_FUNC(query_port),
> > + IB_MANDATORY_FUNC(query_transport),
> > IB_MANDATORY_FUNC(query_pkey),
> > IB_MANDATORY_FUNC(query_gid),
> > IB_MANDATORY_FUNC(alloc_pd),
>
> I'm concerned about the performance implications of this. The size of
> this patchset already points out just how many places in the code we
> have to check for various aspects of the device transport in order to do
> the right thing. Without going through the entire list to see how many
> are on critical hot paths, I'm sure some of them are on at least
> partially critical hot paths (like creation of new connections). I
> would prefer to see this change be implemented via a device attribute,
> not a functional call query. That adds a needless function call in
> these paths.

I like the idea of a query_transport but at the same time would like to see the
use of "transport" reduced. A reduction in the use of this call could
eliminate most performance concerns.

So can we keep this abstraction if at the end of the series we limit its use?

>
> > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> > index f93eb8d..83370de 100644
> > --- a/drivers/infiniband/core/verbs.c
> > +++ b/drivers/infiniband/core/verbs.c
> > @@ -133,14 +133,16 @@ enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_
> > if (device->get_link_layer)
> > return device->get_link_layer(device, port_num);
> >
> > - switch (rdma_node_get_transport(device->node_type)) {
> > + switch (device->query_transport(device, port_num)) {
> > case RDMA_TRANSPORT_IB:
> > + case RDMA_TRANSPORT_IBOE:
> > return IB_LINK_LAYER_INFINIBAND;
>
> If we are perserving ABI, then this looks wrong. Currently, IBOE
> returnsi transport IB and link layer Ethernet. It should not return
> link layer IB, it does not support IB link layer operations (such as MAD
> access).

I think the original code has the bug.

IBoE devices currently return a transport of IB but they probably never get
here because they support the get_link_layer callback used a few lines above.
So this "bug" was probably never hit.

>
> > case RDMA_TRANSPORT_IWARP:
> > case RDMA_TRANSPORT_USNIC:
> > case RDMA_TRANSPORT_USNIC_UDP:
> > return IB_LINK_LAYER_ETHERNET;
> > default:
> > + BUG();
> > return IB_LINK_LAYER_UNSPECIFIED;
> > }
> > }
>
> [ snip ]
>
> > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> > index 65994a1..d54f91e 100644
> > --- a/include/rdma/ib_verbs.h
> > +++ b/include/rdma/ib_verbs.h
> > @@ -75,10 +75,13 @@ enum rdma_node_type {
> > };
> >
> > enum rdma_transport_type {
> > + /* legacy for users */
> > RDMA_TRANSPORT_IB,
> > RDMA_TRANSPORT_IWARP,
> > RDMA_TRANSPORT_USNIC,
> > - RDMA_TRANSPORT_USNIC_UDP
> > + RDMA_TRANSPORT_USNIC_UDP,
> > + /* new transport */
> > + RDMA_TRANSPORT_IBOE,
> > };
>
> I'm also concerned about this. I would like to see this enum
> essentially turned into a bitmap. One that is constructed in such a way
> that we can always get the specific test we need with only one compare
> against the overall value. In order to do so, we need to break it down
> into the essential elements that are part of each of the transports.
> So, for instance, we can define the two link layers we have so far, plus
> reserve one for OPA which we know is coming:
>
> RDMA_LINK_LAYER_IB = 0x00000001,
> RDMA_LINK_LAYER_ETH = 0x00000002,
> RDMA_LINK_LAYER_OPA = 0x00000004,
> RDMA_LINK_LAYER_MASK = 0x0000000f,

I would reserve more bits here.

>
> We can then define the currently known high level transport types:
>
> RDMA_TRANSPORT_IB = 0x00000010,
> RDMA_TRANSPORT_IWARP = 0x00000020,
> RDMA_TRANSPORT_USNIC = 0x00000040,
> RDMA_TRANSPORT_USNIC_UDP = 0x00000080,
> RDMA_TRANSPORT_MASK = 0x000000f0,

I would reserve more bits here.

>
> We could then define bits for the IB management types:
>
> RDMA_MGMT_IB = 0x00000100,
> RDMA_MGMT_OPA = 0x00000200,
> RDMA_MGMT_MASK = 0x00000f00,

We at least need bits for SA / CM support.

I said previously all device types support QP1 I was wrong... I forgot about
USNIC devices. So the full management bit mask is.


RDMA_MGMT_IB_MAD = 0x00000100,
RDMA_MGMT_QP0 = 0x00000200,
RDMA_MGMT_SA = 0x00000400,
RDMA_MGMT_CM = 0x00000800,
RDMA_MGMT_OPA_MAD = 0x00001000,
RDMA_MGMT_MASK = 0x000fff00,

With a couple of spares.

The MAD stack is pretty agnostic to the types of MADs passing through it so we
don't really need PM flags etc.

>
> Then we have space to define specific quirks:
>
> RDMA_SEPARATE_READ_SGE = 0x00001000,
> RDMA_QUIRKS_MASK = 0xfffff000

shift for spares...

RDMA_SEPARATE_READ_SGE = 0x00100000,
RDMA_QUIRKS_MASK = 0xfff00000

>
> Once those are defined, a few definitions for device drivers to use when
> they initialize a device to set the bitmap to the right values:
>
> #define IS_IWARP (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_IWARP |
> RDMA_SEPARATE_READ_SGE)
> #define IS_IB (RDMA_LINK_LAYER_IB | RDMA_TRANSPORT_IB | RDMA_MGMT_IB)
> #define IS_IBOE (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_IB)
> #define IS_USNIC (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_USNIC)
> #define IS_OPA (RDMA_LINK_LAYER_OPA | RDMA_TRANSPORT_IB | RDMA_MGMT_IB |
> RDMA_MGMT_OPA)
>
> Then you need to define the tests:
>
> static inline bool
> rdma_transport_is_iwarp(struct ib_device *dev, u8 port)
> {
> return dev->port[port]->transport & RDMA_TRANSPORT_IWARP;
> }
>
> /* Note: this intentionally covers IB, IBOE, and OPA...use
> rdma_dev_is_ib if you want to only get physical IB devices */
> static inline bool
> rdma_transport_is_ib(struct ibdev *dev)
> {
> return dev->port[port]->transport & RDMA_TRANSPORT_IB;
> }
>
> rdma_port_is_ib(struct ib_device *dev, u8 port)

I prefer

rdma_port_link_layer_is_ib
rdma_port_link_layer_is_eth


> {
> return dev->port[port]->transport & RDMA_LINK_LAYER_IB;
> }
>
> rdma_port_is_iboe(struct ib_device *dev, u8 port)

I'm not sure what this means.

rdma_port_req_iboe_addr seems more appropriate because what we really need to
know is that this device requires an IBoE address format. (PKey is "fake",
PathRecord is fabricated rather than queried, GRH is required in the AH
conversion.)

> {
> return dev->port[port]->transport & IS_IBOE == IS_IBOE;
> }
>
> rdma_port_is_usnic(struct ib_device *dev, u8 port)

rdma_transport_is_usnic

> {
> return dev->port[port]->transport & RDMA_TRANSPORT_USNIC;
> }
>
> rdma_port_is_opa(struct ib_device *dev, u8 port)

rdma_port_link_layer_is_opa

> {
> return dev->port[port]->transport & RDMA_LINK_LAYER_OPA;
> }
>
> rdma_port_is_iwarp(struct ib_device *dev, u8 port)
> {
> return rdma_transport_is_iwarp(dev, port);

Why not call rdma_transport_is_iwarp?

> }
>
> rdma_port_ib_fabric_mgmt(struct ibdev *dev, u8 port)
> {
> return dev->port[port]->transport & RDMA_MGMT_IB;

I agree with Jason that this does not adequately describe the functionality we
are looking for.

> }
>
> rdma_port_opa_mgmt(struct ibdev *dev, u8 port)

Agree.

> {
> return dev->port[port]->transport & RDMA_MGMT_OPA;
> }
>
> Other things can be changed too. Like rdma_port_get_link_layer can
> become this:
>
> {
> return dev->transport & RDMA_LINK_LAYER_MASK;
> }
>
> From patch 2/17:
>
>
> > +static inline int rdma_ib_mgmt(struct ib_device *device, u8 port_num)
> > +{
> > + enum rdma_transport_type tp = device->query_transport(device,
> > port_num);
> > +
> > + return (tp == RDMA_TRANSPORT_IB || tp == RDMA_TRANSPORT_IBOE);
> > +}
>
> This looks wrong. IBOE doesn't have IB management. At least it doesn't
> have subnet management.

Right that is why we need a bit for CM vs SA capability.

>
> Actually, reading through the remainder of the patches, there is some
> serious confusion taking place here. In later patches, you use this as
> a surrogate for cap_cm, which implies you are talking about connection
> management. This is very different than the rdma_dev_ib_mgmt() test
> that I create above, which specifically refers to IB management tasks
> unique to IB/OPA: MAD, SM, multicast.

multicast is part of SA so should be covered by the SA capability.

>
> The kernel connection management code is not really limited. It
> supports IB, IBOE, iWARP, and in the future it will support OPA. There
> are some places in the CM code were we test for just IB/IBOE currently,
> but that's only because we split iWARP out higher up in the abstraction
> hierarchy. So, calling something rdma_ib_mgmt and meaning a rather
> specialized tested in the CM is probably misleading.

Right! So we should have a CM capability.

>
> To straighten all this out, lets break management out into the two
> distinct types:
>
> rdma_port_ib_fabric_mgmt() <- fabric specific management tasks: MAD, SM,
> multicast. The proper test for this with my bitmap above is a simple
> transport & RDMA_MGMT_IB test. If will be true for IB and OPA fabrics.

General management is covered by

RDMA_MGMT_IB_MAD = 0x00000100,
RDMA_MGMT_QP0 = 0x00000200,
...
RDMA_MGMT_OPA_MAD = 0x00001000,


>
> rdma_port_conn_mgmt() <- connection management, which we currently
> support everything except USNIC (correct Sean?), so a test would be
> something like !(transport & RDMA_TRANSPORT_USNIC). This is then split
> out into two subgroups, IB style and iWARP stype connection management
> (aka, rdma_port_iw_conn_mgmt() and rdma_port_ib_conn_mgmt()). In my
> above bitmap, since I didn't give IBOE its own transport type, these
> subgroups still boil down to the simple tests transport & iWARP and
> transport & IB like they do today.

Specific management features CM, route resolution (SA), and special Multicast
management requirements (SA) are covered by:

RDMA_MGMT_SA = 0x00000400,
RDMA_MGMT_CM = 0x00000800,


>
> From patch 3/17:
>
>
> > +/**
> > + * cap_ib_mad - Check if the port of device has the capability
> > Infiniband
> > + * Management Datagrams.
> > + *
> > + * @device: Device to be checked
> > + * @port_num: Port number of the device
> > + *
> > + * Return 0 when port of the device don't support Infiniband
> > + * Management Datagrams.
> > + */
> > +static inline int cap_ib_mad(struct ib_device *device, u8 port_num)
> > +{
> > + return rdma_ib_mgmt(device, port_num);
> > +}
> > +
>
> Why add cap_ib_mad? It's nothing more than rdma_port_ib_fabric_mgmt
> with a new name. Just use rdma_port_ib_fabric_mgmt() everywhere you
> have cap_ib_mad.

Because USNIC apparently does not support MADs at all. So we end up needing a
"big flag" to turn on/off ib_mad.

RDMA_MGMT_IB_MAD = 0x00000100,

>
> From patch 4/17:
>
>
> > +/**
> > + * cap_ib_smi - Check if the port of device has the capability
> > Infiniband
> > + * Subnet Management Interface.
> > + *
> > + * @device: Device to be checked
> > + * @port_num: Port number of the device
> > + *
> > + * Return 0 when port of the device don't support Infiniband
> > + * Subnet Management Interface.
> > + */
> > +static inline int cap_ib_smi(struct ib_device *device, u8 port_num)
> > +{
> > + return rdma_transport_ib(device, port_num);
> > +}
> > +
>
> Same as the previous patch. This is needless indirection. Just use
> rdma_port_ib_fabric_mgmt directly.

No this is not the same... You said:

<quote>
rdma_port_ib_fabric_mgmt() <- fabric specific management tasks: MAD, SM,
multicast. The proper test for this with my bitmap above is a simple
transport & RDMA_MGMT_IB test. If will be true for IB and OPA fabrics.
</quote>

But what we are looking for here is "does the port support QP0" Previously
flagged as "smi". Cover this with this flag.

RDMA_MGMT_QP0 = 0x00000200,

So the optimized version of the above is:

static inline int cap_ib_smi(struct ib_device *device, u8 port_num)
{
return [<device> <port>]->flags & RDMA_MGMT_QP0;
}


>
> Patch 5/17:
>
> Again, just use rdma_port_ib_conn_mgmt() directly.

Agreed.

>
> Patch 6/17:
>
> Again, just use rdma_port_ib_fabric_mgmt() directly.

No this needs to be

rdma_port_requires_sa()

or something...

>
> Patch 7/17:
>
> Again, just use rdma_port_ib_fabric_mgmt() directly. It's perfectly
> applicable to the IB mcast registration requirements.

No this needs to be

rdma_port_requires_sa()

or something...

>
> Patch 8/17:
>
> Here we can create a new test if we are using the bitmap I created
> above:
>
> rdma_port_ipoib(struct ib_device *dev, u8 port)
> {
> return !(dev->port[port]->transport & RDMA_LINK_LAYER_ETH);

I would prefer a link layer (or generic) bit mask rather than '&' of
"transport" and "link layer". That just seems wrong.

> }
>
> This is presuming that OPA will need ipoib devices. This will cause all
> non-Ethernet link layer devices to return true, and right now, that is
> all IB and all OPA devices.

Agreed.

>
> Patch 9/17:
>
> Most of the other comments on this patch stand as they are. I would add
> the test:
>
> rdma_port_separate_read_sge(dev, port)
> {
> return dev->port[port]->transport & RDMA_SEPERATE_READ_SGE;
> }
>
> and add the helper function:
>
> rdma_port_get_read_sge(dev, port)
> {
> if (rdma_transport_is_iwarp)
> return 1;
> return dev->port[port]->max_sge;
> }
>
> Then, as Jason points out, if at some point in the future the kernel is
> modified to support devices with assymetrical read/write SGE sizes, this
> function can be modified to support those devices.
>
> Patch 10/17:
>
> As Sean pointed out, force_grh should be rdma_dev_is_iboe(). The cm
> handles iw devices, but you notice all of the functions you modify here
> start with ib_. The iwarp connections are funneled through iw_ specific
> function variants, and so even though the cm handles iwarp, ib, and roce
> devices, you never see anything other than ib/iboe (and opa in the
> future) get to the ib_ variants of the functions. So, they wrote the
> original tests as tests against the link layer being ethernet and used
> that to differentiate between ib and iboe devices. It works, but can
> confuse people. So, everyplace that has !rdma_transport_ib should
> really be rdma_dev_is_iboe instead. If we ever merge the iw_ and ib_
> functions in the future, having this right will help avoid problems.

I guess rdma_dev_is_iboe is ok. But it seems like we are keying off the
addresses not necessarily the devices.

>
> Patch 11/17:
>
> I wouldn't reform the link_layer_show except to make it compile with the
> new defines I used above.
>
> Patch 12/17:
>
> Go ahead and add a helper to check all ports on a dev, just make it
> rdma_hca_ib_conn_mgmt() and have it loop through
> rdma_port_ib_conn_mgmt() return codes.
>
> Patch 13/17:
>
> This patch is largely unneeded if we reworked the bitmap like I have
> above. A lot of the changes you made to switch from case statements to
> multiple if statements can go back to being case statements because in
> the bitmap IB and IBOE are still both transport IB, so you just do the
> case on the transport bits and not on the link layer bits.
>
> Patch 14/17:
>
> Seems ok.
>
> Patch 15/17:
>
> If you implement the bitmap like I list above, then this code will need
> fixed up to use the bitmap. Otherwise it looks OK.
>
> Patch 16/17:
>
> OK.
>
> Patch 17/17:
>
> I would drop this patch. In the future, the mlx5 driver will support
> both Ethernet and IB like mlx4 does, and we would just need to pull this
> code back to core instead of only in mlx4.

Agreed.

Ira

2015-04-10 08:19:12

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW



On 04/09/2015 06:00 PM, Jason Gunthorpe wrote:
> On Thu, Apr 09, 2015 at 02:42:24PM +0200, Michael Wang wrote:
>> On 04/08/2015 10:10 PM, Jason Gunthorpe wrote:
>> [snip]
>>>
>>> Some of the other checks in this file revolve around pkey, I'm not
>>> sure what rocee does there? cap_pkey_supported ?
>>
>> I'm not sure if this count in capability... how shall we describe it?
>
> I'm not sure how rocee uses pkey, but maybe the the GRH and pkey thing
> would work well together under a single 'cap_ethernet_ah' ?

Sounds better, we can use this in all the case that handling address
for eth-link-layer :-)

Regards,
Michael Wang

>
> Jason
>

2015-04-10 08:25:36

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On 04/10/2015 08:16 AM, ira.weiny wrote:
> First off there are 2 separate issues here:
>
> 1) We need to communicate if a port supports or requires various management
> support from the ib_mad, ib_cm, and/or ib_sa modules.
>
> 2) We need to communicate how a addresses are formated and resolved for a
> particular port
>
>
> In general I don't think we need to remove all uses of the Transport
> or Link Layer.
>
> Although we may be able to remove most of the transport uses.
>
> On Wed, Apr 08, 2015 at 02:10:15PM -0600, Jason Gunthorpe wrote:
[snip]
>
>>
>> Some of the other checks in this file revolve around pkey, I'm not
>> sure what rocee does there? cap_pkey_supported ?
>
> It seems IBoE just hardcodes the pkey to 0xffff. I don't see it used anywhere.
>
> Function where port requires "real" PKey
>
> -- cma.c: cma_ib_init_qp_attr
>
> Check rdma_port_req_pkey()?

What about cap_eth_ah() for all the cases need eth addressing handling?

>
>
> Over all for the addressing choices:
>
> The "Transport" (or protocol, or whatever) is Verbs. The Layer below Verbs
> (OPA/IB/Eth/TCP) defines how addressing, route, and connection information is
> generated, communicated, and used.
>
> As Jason and Doug have been saying sometimes we want to know when that requires
> SA interaction or the use of the CM protocol (or neither). Other times we just
> need to know what the Address format or Link Layer is.

Till now it seems like we could be able to eliminate the link layer helper in core
layer, but I'll reserve that helper in next version, if later we do not need it anymore,
let's erase it then ;-)

Regards,
Michael Wang

>
>
> Ira
>

2015-04-10 14:57:24

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, Apr 10, 2015 at 10:25:21AM +0200, Michael Wang wrote:
> On 04/10/2015 08:16 AM, ira.weiny wrote:
> > First off there are 2 separate issues here:
> >
> > 1) We need to communicate if a port supports or requires various management
> > support from the ib_mad, ib_cm, and/or ib_sa modules.
> >
> > 2) We need to communicate how a addresses are formated and resolved for a
> > particular port
> >
> >
> > In general I don't think we need to remove all uses of the Transport
> > or Link Layer.
> >
> > Although we may be able to remove most of the transport uses.
> >
> > On Wed, Apr 08, 2015 at 02:10:15PM -0600, Jason Gunthorpe wrote:
> [snip]
> >
> >>
> >> Some of the other checks in this file revolve around pkey, I'm not
> >> sure what rocee does there? cap_pkey_supported ?
> >
> > It seems IBoE just hardcodes the pkey to 0xffff. I don't see it used anywhere.
> >
> > Function where port requires "real" PKey
> >
> > -- cma.c: cma_ib_init_qp_attr
> >
> > Check rdma_port_req_pkey()?
>
> What about cap_eth_ah() for all the cases need eth addressing handling?

That works.

>
> >
> >
> > Over all for the addressing choices:
> >
> > The "Transport" (or protocol, or whatever) is Verbs. The Layer below Verbs
> > (OPA/IB/Eth/TCP) defines how addressing, route, and connection information is
> > generated, communicated, and used.
> >
> > As Jason and Doug have been saying sometimes we want to know when that requires
> > SA interaction or the use of the CM protocol (or neither). Other times we just
> > need to know what the Address format or Link Layer is.
>
> Till now it seems like we could be able to eliminate the link layer helper in core
> layer, but I'll reserve that helper in next version, if later we do not need it anymore,
> let's erase it then ;-)

Eliminating Link Layer is fine if we can do it, but I still think that
something like IPoIB should check the link layer.

After sleeping on it the driver exporting cap_ipoib() does not seem _so_ bad
but I still see a distinction between ULPs like IPoIB and the other modules we
have been discussing.

Ira

> Regards,
> Michael Wang
>
> >
> >
> > Ira
> >

2015-04-10 16:16:20

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, Apr 10, 2015 at 02:16:11AM -0400, ira.weiny wrote:

> > IB ROCEE OPA
> > SMI Y N Y (though the OPA smi looked a bit different)
>
> Yes OPA is different but it is based on the class version of the individual
> MADs not any particular device/port support.

> > OPA SMP N N Y
>
> How is this different from the SMI?

Any code that generates SMPs and SMIs is going to need to know what
format to generate them in. It seems we have a sort of weird world
where IB SMPs are supported on OPA but not the IB SMI.

Not sure any users exist though..

> > Maybe the above set would make a lot more sense as:
> > cap_ib_qp0
> > cap_ib_qp1
> > cap_opa_qp0
>
> I disagree.
>
> All we need right now is is cap_qp0. All devices currently support QP1.

I didn't list iWarp in the table because everything is no, but it
doesn't support QP1.

> > ib_sa means the IBA SA protocol is supported (Y Y Y)
>
> I think this should be (Y N Y)
>
> IBoE has no SA. The IBoE code "fabricates" a Path Record it does not need to
> interact with the SA.

I was wondering why there are so many checks in the SA code, I know
RoCEE doesn't use it, but why are there there?

> > ib_mcast true if the IBA SA protocol is used for multicast GIDs (Y N
> > Y)
>
> Given the above why can't we just have the "ib_sa" flag?

Maybe I got it wrong, but yes, if it really means 'IBA SA protocol for
multicast then it can just be cap_sa.

But there is also the idea that some devices can't do multicast at all
(iWarp), we must care about that at some point?

> However, I think checking the link layer is more appropriate here.
> It does not make sense to do IP over IB over Eth. Even though the
> IBoE can do the "IB" protocol.

Yes, it is ugly.

I think if we look closely we'll find that IPoIB today has a hard
requirement on cap_sa being true, so lets use that?

In fact any ULP that unconditionally uses the SA can use that.

> > I actually really prefer cap_mandatory_grh - that is what is going on
> > here. ie based on that name (as a reviewer) I'd expect to see the mad
> > layer check that the mandatory GRH is always present, or blow up.
>
> While GRH mandatory (for the GMP) is what this is. The function
> ib_init_ah_from_path generically is really handling an "IBoE address" to send
> to and therefore we need to force the GRH in the AH.

This make sense to me.

It appears we have at least rocee, rocee v2 (udp?), tcp, ib and opa
address and AH formats? opa would support ib addresses too I guess.

A
bool rdma_port_addr_is_XXX()

along with a

enum AddrType rdma_port_addr_type()

Might be the thing? The latter should only be used with switch()

Jason

2015-04-10 16:48:41

by Doug Ledford

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, 2015-04-10 at 09:46 +0200, Michael Wang wrote:
> On 04/09/2015 11:19 PM, Doug Ledford wrote:
> [snip]
> >>
> >> I've written this before: The mess here is that it is too hard to know
> >> what the call sites are actually checking for when it is some baroque
> >> conditional.
> >
> > The two goals: being specific about what the test is returning and
> > minimizing the bitmap footprint; are not necessarily opposed. One can
> > do both at the same time.
>
> This could be internal reforming after the cap_XX() stuff works for core
> layer, at that time we don't need to touch core layer anymore, just
> introducing this bitmap stuff in verb layer, replacing the implementation
> of these helpers with the bitmap check, and following the semantic (description).

Agreed.

--
Doug Ledford <[email protected]>
GPG KeyID: 0E572FDD



Attachments:
signature.asc (819.00 B)
This is a digitally signed message part

2015-04-10 17:10:54

by Doug Ledford

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, 2015-04-10 at 03:48 -0400, ira.weiny wrote:
> > > diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> > > index 18c1ece..a9587c4 100644
> > > --- a/drivers/infiniband/core/device.c
> > > +++ b/drivers/infiniband/core/device.c
> > > @@ -76,6 +76,7 @@ static int ib_device_check_mandatory(struct ib_device *device)
> > > } mandatory_table[] = {
> > > IB_MANDATORY_FUNC(query_device),
> > > IB_MANDATORY_FUNC(query_port),
> > > + IB_MANDATORY_FUNC(query_transport),
> > > IB_MANDATORY_FUNC(query_pkey),
> > > IB_MANDATORY_FUNC(query_gid),
> > > IB_MANDATORY_FUNC(alloc_pd),
> >
> > I'm concerned about the performance implications of this. The size of
> > this patchset already points out just how many places in the code we
> > have to check for various aspects of the device transport in order to do
> > the right thing. Without going through the entire list to see how many
> > are on critical hot paths, I'm sure some of them are on at least
> > partially critical hot paths (like creation of new connections). I
> > would prefer to see this change be implemented via a device attribute,
> > not a functional call query. That adds a needless function call in
> > these paths.
>
> I like the idea of a query_transport but at the same time would like to see the
> use of "transport" reduced. A reduction in the use of this call could
> eliminate most performance concerns.
>
> So can we keep this abstraction if at the end of the series we limit its use?

The reason I don't like a query is because the transport type isn't
changing. It's a static device attribute. The only devices that *can*
change their transport are mlx4 or mlx5 devices, and they tear down and
deregister their current device and bring up a new one when they need to
change transports or link layers. So, this really isn't something we
should query, this should be part of our static device attributes.
Every other query in the list above is for something that changes. This
is not.

> >
> > > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> > > index f93eb8d..83370de 100644
> > > --- a/drivers/infiniband/core/verbs.c
> > > +++ b/drivers/infiniband/core/verbs.c
> > > @@ -133,14 +133,16 @@ enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_
> > > if (device->get_link_layer)
> > > return device->get_link_layer(device, port_num);
> > >
> > > - switch (rdma_node_get_transport(device->node_type)) {
> > > + switch (device->query_transport(device, port_num)) {
> > > case RDMA_TRANSPORT_IB:
> > > + case RDMA_TRANSPORT_IBOE:
> > > return IB_LINK_LAYER_INFINIBAND;
> >
> > If we are perserving ABI, then this looks wrong. Currently, IBOE
> > returnsi transport IB and link layer Ethernet. It should not return
> > link layer IB, it does not support IB link layer operations (such as MAD
> > access).
>
> I think the original code has the bug.
>
> IBoE devices currently return a transport of IB but they probably never get
> here because they support the get_link_layer callback used a few lines above.
> So this "bug" was probably never hit.
>
> >
> > > case RDMA_TRANSPORT_IWARP:
> > > case RDMA_TRANSPORT_USNIC:
> > > case RDMA_TRANSPORT_USNIC_UDP:
> > > return IB_LINK_LAYER_ETHERNET;
> > > default:
> > > + BUG();
> > > return IB_LINK_LAYER_UNSPECIFIED;
> > > }
> > > }
> >
> > [ snip ]
> >
> > > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> > > index 65994a1..d54f91e 100644
> > > --- a/include/rdma/ib_verbs.h
> > > +++ b/include/rdma/ib_verbs.h
> > > @@ -75,10 +75,13 @@ enum rdma_node_type {
> > > };
> > >
> > > enum rdma_transport_type {
> > > + /* legacy for users */
> > > RDMA_TRANSPORT_IB,
> > > RDMA_TRANSPORT_IWARP,
> > > RDMA_TRANSPORT_USNIC,
> > > - RDMA_TRANSPORT_USNIC_UDP
> > > + RDMA_TRANSPORT_USNIC_UDP,
> > > + /* new transport */
> > > + RDMA_TRANSPORT_IBOE,
> > > };
> >
> > I'm also concerned about this. I would like to see this enum
> > essentially turned into a bitmap. One that is constructed in such a way
> > that we can always get the specific test we need with only one compare
> > against the overall value. In order to do so, we need to break it down
> > into the essential elements that are part of each of the transports.
> > So, for instance, we can define the two link layers we have so far, plus
> > reserve one for OPA which we know is coming:
> >
> > RDMA_LINK_LAYER_IB = 0x00000001,
> > RDMA_LINK_LAYER_ETH = 0x00000002,
> > RDMA_LINK_LAYER_OPA = 0x00000004,
> > RDMA_LINK_LAYER_MASK = 0x0000000f,
>
> I would reserve more bits here.

Sure. I didn't mean to imply that this was as large is these bit fields
would ever need to be, I just typed it up quickly at the time.

> >
> > We can then define the currently known high level transport types:
> >
> > RDMA_TRANSPORT_IB = 0x00000010,
> > RDMA_TRANSPORT_IWARP = 0x00000020,
> > RDMA_TRANSPORT_USNIC = 0x00000040,
> > RDMA_TRANSPORT_USNIC_UDP = 0x00000080,
> > RDMA_TRANSPORT_MASK = 0x000000f0,
>
> I would reserve more bits here.
>
> >
> > We could then define bits for the IB management types:
> >
> > RDMA_MGMT_IB = 0x00000100,
> > RDMA_MGMT_OPA = 0x00000200,
> > RDMA_MGMT_MASK = 0x00000f00,
>
> We at least need bits for SA / CM support.
>
> I said previously all device types support QP1 I was wrong... I forgot about
> USNIC devices. So the full management bit mask is.
>
>
> RDMA_MGMT_IB_MAD = 0x00000100,
> RDMA_MGMT_QP0 = 0x00000200,
> RDMA_MGMT_SA = 0x00000400,
> RDMA_MGMT_CM = 0x00000800,
> RDMA_MGMT_OPA_MAD = 0x00001000,
> RDMA_MGMT_MASK = 0x000fff00,
>
> With a couple of spares.
>
> The MAD stack is pretty agnostic to the types of MADs passing through it so we
> don't really need PM flags etc.
>
> >
> > Then we have space to define specific quirks:
> >
> > RDMA_SEPARATE_READ_SGE = 0x00001000,
> > RDMA_QUIRKS_MASK = 0xfffff000
>
> shift for spares...
>
> RDMA_SEPARATE_READ_SGE = 0x00100000,
> RDMA_QUIRKS_MASK = 0xfff00000
>
> >
> > Once those are defined, a few definitions for device drivers to use when
> > they initialize a device to set the bitmap to the right values:
> >
> > #define IS_IWARP (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_IWARP |
> > RDMA_SEPARATE_READ_SGE)
> > #define IS_IB (RDMA_LINK_LAYER_IB | RDMA_TRANSPORT_IB | RDMA_MGMT_IB)
> > #define IS_IBOE (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_IB)
> > #define IS_USNIC (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_USNIC)
> > #define IS_OPA (RDMA_LINK_LAYER_OPA | RDMA_TRANSPORT_IB | RDMA_MGMT_IB |
> > RDMA_MGMT_OPA)
> >
> > Then you need to define the tests:
> >
> > static inline bool
> > rdma_transport_is_iwarp(struct ib_device *dev, u8 port)
> > {
> > return dev->port[port]->transport & RDMA_TRANSPORT_IWARP;
> > }
> >
> > /* Note: this intentionally covers IB, IBOE, and OPA...use
> > rdma_dev_is_ib if you want to only get physical IB devices */
> > static inline bool
> > rdma_transport_is_ib(struct ibdev *dev)
> > {
> > return dev->port[port]->transport & RDMA_TRANSPORT_IB;
> > }
> >
> > rdma_port_is_ib(struct ib_device *dev, u8 port)
>
> I prefer
>
> rdma_port_link_layer_is_ib
> rdma_port_link_layer_is_eth

In my quick example, anything that started with rdma_transport* was
testing the high level transport attribute of any given port of any
given device, and anything that started with rdma_port* was testing the
link layer on a specific port of any given device.

>
> > {
> > return dev->port[port]->transport & RDMA_LINK_LAYER_IB;
> > }
> >
> > rdma_port_is_iboe(struct ib_device *dev, u8 port)
>
> I'm not sure what this means.
>
> rdma_port_req_iboe_addr seems more appropriate because what we really need to
> know is that this device requires an IBoE address format. (PKey is "fake",
> PathRecord is fabricated rather than queried, GRH is required in the AH
> conversion.)

TomAto, Tomato...port_is_ib implicity means port_req_iboe_addr.
>
> > {
> > return dev->port[port]->transport & IS_IBOE == IS_IBOE;
> > }
> >
> > rdma_port_is_usnic(struct ib_device *dev, u8 port)
>
> rdma_transport_is_usnic
>
> > {
> > return dev->port[port]->transport & RDMA_TRANSPORT_USNIC;
> > }
> >
> > rdma_port_is_opa(struct ib_device *dev, u8 port)
>
> rdma_port_link_layer_is_opa
>
> > {
> > return dev->port[port]->transport & RDMA_LINK_LAYER_OPA;
> > }
> >
> > rdma_port_is_iwarp(struct ib_device *dev, u8 port)
> > {
> > return rdma_transport_is_iwarp(dev, port);
>
> Why not call rdma_transport_is_iwarp?

As per my above statement, rdma_transport* tests were testing the high
level transport type, rdma_port* types were testing link layers. iWARP
has an Eth link layer, so technically port_is_iwarp makes no sense. But
since all the other types had a check too, I included port_is_iwarp just
to be complete, and if you are going to ask if a specific port is iwarp
as a link layer, it makes sense to say yes if the transport is iwarp,
not if the link layer is eth.

[ snip lots of stuff that is all correct ]

> > Patch 8/17:
> >
> > Here we can create a new test if we are using the bitmap I created
> > above:
> >
> > rdma_port_ipoib(struct ib_device *dev, u8 port)
> > {
> > return !(dev->port[port]->transport & RDMA_LINK_LAYER_ETH);
>
> I would prefer a link layer (or generic) bit mask rather than '&' of
> "transport" and "link layer". That just seems wrong.

I kept the name transport, but it's really a device attribute bitmap.
And of all the link layer types, eth is the only one for which IPoIB
makes no sense (even if it's possible to do). So, as long as the ETH
bit isn't set, we're good to go. But, calling it
dev->port[port]->attributes instead of transport would make it more
clear what it is.

> I guess rdma_dev_is_iboe is ok. But it seems like we are keying off the
> addresses not necessarily the devices.

They're one and the same. The addresses go with the device, the device
goes with the addresses. You never have one without the other. The
name of the check is not really important, just as long as it's clearly
documented. I get why you link the address variant, because it pops out
all the things that are special about IBoE addressing and calls out that
the issues need to be handled. However, saying requires_iboe_addr(),
while foreshadowing the work that needs done, doesn't actually document
the work that needs done. Whether we call is dev_is_iboe() or
requires_iboe_addr(), it would be good if the documentation spelled out
those specific requirements for reference sake.


--
Doug Ledford <[email protected]>
GPG KeyID: 0E572FDD



Attachments:
signature.asc (819.00 B)
This is a digitally signed message part

2015-04-10 17:37:15

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, Apr 10, 2015 at 01:10:43PM -0400, Doug Ledford wrote:

> documented. I get why you link the address variant, because it pops out
> all the things that are special about IBoE addressing and calls out that
> the issues need to be handled. However, saying requires_iboe_addr(),
> while foreshadowing the work that needs done, doesn't actually document
> the work that needs done. Whether we call is dev_is_iboe() or
> requires_iboe_addr(), it would be good if the documentation spelled out
> those specific requirements for reference sake.

My deep hope for this, was that the test 'requires_iboe_addr' or
whatever we call it would have a *really good* kdoc.

List all the ways iboe_addr's work, how they differ from IB addresses,
refer to the specs people should read to understand it, etc.

The patches don't do this, and maybe Michael is the wrong person to
fill that in, but we can get it done..

Jason

BTW: Michael, next time you post the series, please trim the CC
list...

2015-04-10 17:38:57

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, Apr 10, 2015 at 10:15:51AM -0600, Jason Gunthorpe wrote:
> On Fri, Apr 10, 2015 at 02:16:11AM -0400, ira.weiny wrote:
>
> > > IB ROCEE OPA
> > > SMI Y N Y (though the OPA smi looked a bit different)
> >
> > Yes OPA is different but it is based on the class version of the individual
> > MADs not any particular device/port support.
>
> > > OPA SMP N N Y
> >
> > How is this different from the SMI?
>
> Any code that generates SMPs and SMIs is going to need to know what
> format to generate them in. It seems we have a sort of weird world
> where IB SMPs are supported on OPA but not the IB SMI.

My mistake it was late. The MAD stack needs to know if it should implement the
IB SMI or the IB & OPA SMI. That also implies the use of QP0 or vise versa.

In my email to Doug I suggested "OPA MAD" which covers the need to implement
the OPA SMI on that device for MADs which have that class version.

>
> Not sure any users exist though..

I think all the "users" are either userspace or the drivers themselves. It is
really just the common SMI processing which needs to be turned on/off. Again I
think this can be covered with the QP0 flag.

RDMA_MGMT_QP0 = 0x00000200,

>
> > > Maybe the above set would make a lot more sense as:
> > > cap_ib_qp0
> > > cap_ib_qp1
> > > cap_opa_qp0
> >
> > I disagree.
> >
> > All we need right now is is cap_qp0. All devices currently support QP1.
>
> I didn't list iWarp in the table because everything is no, but it
> doesn't support QP1.

Isn't ocrdma an iWarp device?

int ocrdma_process_mad(struct ib_device *ibdev,
int process_mad_flags,
u8 port_num,
struct ib_wc *in_wc,
struct ib_grh *in_grh,
struct ib_mad *in_mad, struct ib_mad *out_mad)
{
int status;
struct ocrdma_dev *dev;

switch (in_mad->mad_hdr.mgmt_class) {
case IB_MGMT_CLASS_PERF_MGMT:
dev = get_ocrdma_dev(ibdev);
if (!ocrdma_pma_counters(dev, out_mad))
status = IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY;
else
status = IB_MAD_RESULT_SUCCESS;
break;
default:
status = IB_MAD_RESULT_SUCCESS;
break;
}
return status;
}


Regardless I was wrong. USNIC devices don't support MADs at all. So we do
need the "supports MADs at all flag".

RDMA_MGMT_IB_MAD = 0x00000100,

>
> > > ib_sa means the IBA SA protocol is supported (Y Y Y)
> >
> > I think this should be (Y N Y)
> >
> > IBoE has no SA. The IBoE code "fabricates" a Path Record it does not need to
> > interact with the SA.
>
> I was wondering why there are so many checks in the SA code, I know
> RoCEE doesn't use it, but why are there there?

Which checks are you referring to? I think there are separate calls to query
the SA when running on IB for both the Route resolution and the Multicast join
operations. The choice of those calls could be made by a "cap_sa()" helper.

>
> > > ib_mcast true if the IBA SA protocol is used for multicast GIDs (Y N
> > > Y)
> >
> > Given the above why can't we just have the "ib_sa" flag?
>
> Maybe I got it wrong, but yes, if it really means 'IBA SA protocol for
> multicast then it can just be cap_sa.
>
> But there is also the idea that some devices can't do multicast at all
> (iWarp), we must care about that at some point?

I was not sure how to handle this either... I guess you need a
cap_multicast()???

>
> > However, I think checking the link layer is more appropriate here.
> > It does not make sense to do IP over IB over Eth. Even though the
> > IBoE can do the "IB" protocol.
>
> Yes, it is ugly.
>
> I think if we look closely we'll find that IPoIB today has a hard
> requirement on cap_sa being true, so lets use that?

I don't think that is appropriate. You have been advocating that the checks
be clear as to what support we need. While currently the IPoIB layer does (for
IB and OPA) require an SA I think those checks are only appropriate when it is
attempting an SA query.

The choice to run IPoIB at all is a different matter.

>
> In fact any ULP that unconditionally uses the SA can use that.

They _can_ use that but the point of this exercise (and additional checks going
forward) is that we don't "hide" meaning like this.

IPoIB should restrict itself to running on IB link layers. Should additional
link layers be added which IPoIB works on then we add that check.

>
> > > I actually really prefer cap_mandatory_grh - that is what is going on
> > > here. ie based on that name (as a reviewer) I'd expect to see the mad
> > > layer check that the mandatory GRH is always present, or blow up.
> >
> > While GRH mandatory (for the GMP) is what this is. The function
> > ib_init_ah_from_path generically is really handling an "IBoE address" to send
> > to and therefore we need to force the GRH in the AH.
>
> This make sense to me.
>
> It appears we have at least rocee, rocee v2 (udp?), tcp, ib and opa
> address and AH formats?

Seems that way. But has the rocee v2 been accepted?

> opa would support ib addresses too I guess.

Yes opa address == ib addresses. So there is no need to distinguish them.

>
> A
> bool rdma_port_addr_is_XXX()
>
> along with a
>
> enum AddrType rdma_port_addr_type()
>
> Might be the thing? The latter should only be used with switch()

Sounds good to me. But Doug has a point that the address type and the "port"
type go together. So this could probably be the same call for both of those.

Ira

2015-04-10 17:49:42

by Doug Ledford

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, 2015-04-10 at 13:38 -0400, ira.weiny wrote:

> Isn't ocrdma an iWarp device?

No, it's roce. It and mlx4 roce devices currently interoperate.

> > I think if we look closely we'll find that IPoIB today has a hard
> > requirement on cap_sa being true, so lets use that?
>
> I don't think that is appropriate. You have been advocating that the checks
> be clear as to what support we need. While currently the IPoIB layer does (for
> IB and OPA) require an SA I think those checks are only appropriate when it is
> attempting an SA query.
>
> The choice to run IPoIB at all is a different matter.

Appropriately named or not, Jason's choice of words "has a hard
requirement" is correct ;-) For IPoIB, the broadcast group of the fake
Ethernet fabric is a very specific IB multicast group per the IPoIB
spec.

> >
> > In fact any ULP that unconditionally uses the SA can use that.
>
> They _can_ use that but the point of this exercise (and additional checks going
> forward) is that we don't "hide" meaning like this.
>
> IPoIB should restrict itself to running on IB link layers. Should additional
> link layers be added which IPoIB works on then we add that check.

I think your right that checking the link layer is the right thing, and
for now, there is no need to check cap_sa because the link layer check
enforces it. In the future, if there is a new link layer we want to use
this on, and it doesn't have an sa, then we have to enable sa checks and
alternate methods at that time.

--
Doug Ledford <[email protected]>
GPG KeyID: 0E572FDD



Attachments:
signature.asc (819.00 B)
This is a digitally signed message part

2015-04-10 17:50:22

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On 4/10/2015 1:10 PM, Doug Ledford wrote:
> As per my above statement, rdma_transport* tests were testing the high
> level transport type, rdma_port* types were testing link layers. iWARP
> has an Eth link layer, so technically port_is_iwarp makes no sense. But
> since all the other types had a check too, I included port_is_iwarp just
> to be complete, and if you are going to ask if a specific port is iwarp
> as a link layer, it makes sense to say yes if the transport is iwarp,
> not if the link layer is eth.

Not wanting to split hairs, but I would not rule out the possibility
of a future device supporting iWARP on one port and another RDMA
protocol on another. One could also imagine softiWARP and softROCE
co-existing atop a single ethernet NIC.

So, I disagree that port_is_iwarp() is a nonsequitur.

Tom.

2015-04-10 18:05:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, Apr 10, 2015 at 01:38:38PM -0400, ira.weiny wrote:

> Isn't ocrdma an iWarp device?

No, it is RoCEE

> > I was wondering why there are so many checks in the SA code, I know
> > RoCEE doesn't use it, but why are there there?
>
> Which checks are you referring to? I think there are separate calls to query
> the SA when running on IB for both the Route resolution and the Multicast join
> operations. The choice of those calls could be made by a "cap_sa()" helper.

I will point them out next time I look through the patches

> > I think if we look closely we'll find that IPoIB today has a hard
> > requirement on cap_sa being true, so lets use that?
>
> I don't think that is appropriate. You have been advocating that the checks
> be clear as to what support we need.

Right, but this is narrow, and we are not hiding meaning.

Look at the IPoIB ULP, and look at the hard requiments of the code,
then translate those back to our new cap scheme. We see today's IPoIB
will not run without:
- UD support
- IB addressing
- IB multicast
- IB SA
- CM (optional)

It seems perfectly correct for a ULP to say at the very start, I need
all these caps, or I will not run (how could it run?). This is true of
any ULP that has a hard need to use those APIs.

That would seem to be the very essance of the cap scheme. Declare what
you need, not what standard you think you need.

Hiding meaning is to say 'only run on IB or OPA': WHY are we limited
to those two?

> While currently the IPoIB layer does (for IB and OPA) require an SA
> I think those checks are only appropriate when it is attempting an
> SA query.

That doesn't make any sense unless someone also adds support for
handling the !SA case.

> > It appears we have at least rocee, rocee v2 (udp?), tcp, ib and opa
> > address and AH formats?
>
> Seems that way. But has the rocee v2 been accepted?

Don't know much about it yet, patches exist, it seems to have a
slightly different addressing format.

> > opa would support ib addresses too I guess.
>
> Yes opa address == ib addresses. So there is no need to distinguish them.

The patches you sent showed a different LRH format for OPA (eg 32 bit
LID), so someday we will need to know that the full 32 bit LID is
available.

We can see how this might work in future, lets say OPAv2 *requires* the
32 bit LID, for that case cap_ib_address = 0 cap_opa_address = 1. If
we don't update IPoIB and it uses the tests from above then it
immediately, and correctly, stops running on those OPAv2 devices.

Once patched to support cap_op_address then it will begin working
again. That seems very sane..

Jason

2015-04-10 18:12:05

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, Apr 10, 2015 at 01:49:32PM -0400, Doug Ledford wrote:
> On Fri, 2015-04-10 at 13:38 -0400, ira.weiny wrote:
>
>
> > > I think if we look closely we'll find that IPoIB today has a hard
> > > requirement on cap_sa being true, so lets use that?
> >
> > I don't think that is appropriate. You have been advocating that the checks
> > be clear as to what support we need. While currently the IPoIB layer does (for
> > IB and OPA) require an SA I think those checks are only appropriate when it is
> > attempting an SA query.
> >
> > The choice to run IPoIB at all is a different matter.
>
> Appropriately named or not, Jason's choice of words "has a hard
> requirement" is correct ;-)

Agreed. I meant that using "cap_sa" is not appropriate. Not that IPoIB did
not have a hard requirement... :-D

I actually think that _both_ the check for IB link layer and the "cap_sa" is
required. Perhaps not at start up...

Ira

2015-04-10 18:18:13

by Doug Ledford

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, 2015-04-10 at 13:50 -0400, Tom Talpey wrote:
> On 4/10/2015 1:10 PM, Doug Ledford wrote:
> > As per my above statement, rdma_transport* tests were testing the high
> > level transport type, rdma_port* types were testing link layers. iWARP
> > has an Eth link layer, so technically port_is_iwarp makes no sense. But
> > since all the other types had a check too, I included port_is_iwarp just
> > to be complete, and if you are going to ask if a specific port is iwarp
> > as a link layer, it makes sense to say yes if the transport is iwarp,
> > not if the link layer is eth.
>
> Not wanting to split hairs, but I would not rule out the possibility
> of a future device supporting iWARP on one port and another RDMA
> protocol on another. One could also imagine softiWARP and softROCE
> co-existing atop a single ethernet NIC.
>
> So, I disagree that port_is_iwarp() is a nonsequitur.

Agreed, but that wasn't what I was calling non-sense. I was referring
to the fact that in my quick little write up, the rdma_port* functions
were all intended to test link layers, not high level transports. There
is no such thing as an iWARP link layer. It was still a port specific
test, and would work in all the situations you described, it's just that
asking if a port's link layer is iWARP makes no sense, so I returned
true if the transport was iWARP regardless of what the link layer
actually was.


--
Doug Ledford <[email protected]>
GPG KeyID: 0E572FDD



Attachments:
signature.asc (819.00 B)
This is a digitally signed message part

2015-04-10 18:25:27

by Doug Ledford

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, 2015-04-10 at 12:04 -0600, Jason Gunthorpe wrote:
> On Fri, Apr 10, 2015 at 01:38:38PM -0400, ira.weiny wrote:
>
> Hiding meaning is to say 'only run on IB or OPA': WHY are we limited
> to those two?

For something else, I might agree with this. But, for the specific case
of IPoIB, it's pretty fair. IPoIB is more than just an ULP. It's a
spec. And it's very IB specific. It will only work with OPA because
OPA is imitating IB. To run it on another fabric, you would need more
than just to make it work. If the new fabric doesn't have a broadcast
group, or has multicast registration like IB does, you need the
equivalent of IBTA, whatever that may be for this new fabric, buy in on
the pre-defined multicast groups and you might need firmware support in
the switches.

> We can see how this might work in future, lets say OPAv2 *requires* the
> 32 bit LID, for that case cap_ib_address = 0 cap_opa_address = 1. If
> we don't update IPoIB and it uses the tests from above then it
> immediately, and correctly, stops running on those OPAv2 devices.
>
> Once patched to support cap_op_address then it will begin working
> again. That seems very sane..

It is very sane from an implementation standpoint, but from the larger
interoperability standpoint, you need that spec to be extended to the
new fabric simultaneously.


--
Doug Ledford <[email protected]>
GPG KeyID: 0E572FDD



Attachments:
signature.asc (819.00 B)
This is a digitally signed message part

2015-04-10 19:17:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, Apr 10, 2015 at 02:24:26PM -0400, Doug Ledford wrote:

> IPoIB is more than just an ULP. It's a spec. And it's very IB
> specific. It will only work with OPA because OPA is imitating IB.
> To run it on another fabric, you would need more than just to make
> it work. If the new fabric doesn't have a broadcast group, or has
> multicast registration like IB does, you need the equivalent of
> IBTA, whatever that may be for this new fabric, buy in on the
> pre-defined multicast groups and you might need firmware support in
> the switches.

It feels like the 'cap_ib_addressing' or whatever we call it captures
this very well. The IPoIB RFC is very much concerned with GID's and
MGID's and broadly requires the IBA addressing
scheme. cap_ib_addressing asserts the port uses that scheme.

We wouldn't accept patches to IPoIB to add a new addressing scheme
without seeing proper diligence to the standards work.

Looking away from the stadards, using cap_XX seems very sane: We are
building a well defined system of invarients, You can't call into the
sa functions if cap_sa is not set, you can't call into the mcast
functions if cap_mcast is not set, you can't form a AH from IB
GIDs/MGID/LID without cap_ib_addressing.

I makes so much sense for the ULP to directly require the needed cap's
for the kernel APIs it intends to call, or not use the RDMA port at
all.

> > We can see how this might work in future, lets say OPAv2 *requires* the
> > 32 bit LID, for that case cap_ib_address = 0 cap_opa_address = 1. If
> > we don't update IPoIB and it uses the tests from above then it
> > immediately, and correctly, stops running on those OPAv2 devices.
> >
> > Once patched to support cap_op_address then it will begin working
> > again. That seems very sane..
>
> It is very sane from an implementation standpoint, but from the larger
> interoperability standpoint, you need that spec to be extended to the
> new fabric simultaneously.

I liked the OPAv2 hypothetical because it doesn't actually touch the
IPoIB spec. IPoIB spec has little to say about LIDs or LRHs it works
entirely at the GID/MGID/GRH level.

Jason

2015-04-10 20:38:19

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, Apr 10, 2015 at 12:04:55PM -0600, Jason Gunthorpe wrote:
> On Fri, Apr 10, 2015 at 01:38:38PM -0400, ira.weiny wrote:
>
> >
> > I don't think that is appropriate. You have been advocating that the checks
> > be clear as to what support we need.
>
> Right, but this is narrow, and we are not hiding meaning.
>
> Look at the IPoIB ULP, and look at the hard requiments of the code,
> then translate those back to our new cap scheme. We see today's IPoIB
> will not run without:
> - UD support
> - IB addressing
> - IB multicast
> - IB SA
> - CM (optional)
>
> It seems perfectly correct for a ULP to say at the very start, I need
> all these caps, or I will not run (how could it run?). This is true of
> any ULP that has a hard need to use those APIs.

Having IPoIB check for all of these and fail to start if not supported is good.
But the suggestion before was to have "cap_ipoib". I don't think we want that.

>
> That would seem to be the very essance of the cap scheme. Declare what
> you need, not what standard you think you need.

Right.

>
> Hiding meaning is to say 'only run on IB or OPA': WHY are we limited
> to those two?

Because only those 2 support the list of capabilities above.

>
> > While currently the IPoIB layer does (for IB and OPA) require an SA
> > I think those checks are only appropriate when it is attempting an
> > SA query.
>
> That doesn't make any sense unless someone also adds support for
> handling the !SA case.

Fair enough but IPoIB does need to check for SA support now as well as IB
addressing which is currently IB Link Layer (although as Doug said the Port and
Address format go hand in hand. So I'm happy calling it whatever.)

>
> > > It appears we have at least rocee, rocee v2 (udp?), tcp, ib and opa
> > > address and AH formats?
> >
> > Seems that way. But has the rocee v2 been accepted?
>
> Don't know much about it yet, patches exist, it seems to have a
> slightly different addressing format.
>
> > > opa would support ib addresses too I guess.
> >
> > Yes opa address == ib addresses. So there is no need to distinguish them.
>
> The patches you sent showed a different LRH format for OPA (eg 32 bit
> LID), so someday we will need to know that the full 32 bit LID is
> available.
>
> We can see how this might work in future, lets say OPAv2 *requires* the
> 32 bit LID, for that case cap_ib_address = 0 cap_opa_address = 1. If
> we don't update IPoIB and it uses the tests from above then it
> immediately, and correctly, stops running on those OPAv2 devices.

For your hypothetical case, agreed. But I want to make it clear for those who
may be casually reading this thread that OPA addresses are IB addresses right
now.

>
> Once patched to support cap_op_address then it will begin working
> again. That seems very sane..

Agreed.

Ira

2015-04-10 21:06:59

by Ira Weiny

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On Fri, Apr 10, 2015 at 01:17:23PM -0600, Jason Gunthorpe wrote:
> On Fri, Apr 10, 2015 at 02:24:26PM -0400, Doug Ledford wrote:
>
> > IPoIB is more than just an ULP. It's a spec. And it's very IB
> > specific. It will only work with OPA because OPA is imitating IB.
> > To run it on another fabric, you would need more than just to make
> > it work. If the new fabric doesn't have a broadcast group, or has
> > multicast registration like IB does, you need the equivalent of
> > IBTA, whatever that may be for this new fabric, buy in on the
> > pre-defined multicast groups and you might need firmware support in
> > the switches.
>
> It feels like the 'cap_ib_addressing' or whatever we call it captures
> this very well. The IPoIB RFC is very much concerned with GID's and
> MGID's and broadly requires the IBA addressing
> scheme. cap_ib_addressing asserts the port uses that scheme.
>
> We wouldn't accept patches to IPoIB to add a new addressing scheme
> without seeing proper diligence to the standards work.
>
> Looking away from the stadards, using cap_XX seems very sane: We are
> building a well defined system of invarients, You can't call into the
> sa functions if cap_sa is not set, you can't call into the mcast
> functions if cap_mcast is not set, you can't form a AH from IB
> GIDs/MGID/LID without cap_ib_addressing.

Yep.

>
> I makes so much sense for the ULP to directly require the needed cap's
> for the kernel APIs it intends to call, or not use the RDMA port at
> all.

Yes.

So trying to sum up.

Have we settled on the following "capabilities"? Helper function names aside.

/* legacy to communicate to userspace */
RDMA_LINK_LAYER_IB = 0x0000000000000001,
RDMA_LINK_LAYER_ETH = 0x0000000000000002,
RDMA_LINK_LAYER_MASK = 0x000000000000000f, /* more bits? */
/* I'm hoping we don't need more bits here */


/* legacy to communicate to userspace */
RDMA_TRANSPORT_IB = 0x0000000000000010,
RDMA_TRANSPORT_IWARP = 0x0000000000000020,
RDMA_TRANSPORT_USNIC = 0x0000000000000040,
RDMA_TRANSPORT_USNIC_UDP = 0x0000000000000080,
RDMA_TRANSPORT_MASK = 0x00000000000000f0, /* more bits? */
/* I'm hoping we don't need more bits here */


/* New flags */

RDMA_MGMT_IB_MAD = 0x0000000000000100, /* ib_mad module support */
RDMA_MGMT_QP0 = 0x0000000000000200, /* ib_mad QP0 support */
RDMA_MGMT_IB_SA = 0x0000000000000400, /* ib_sa module support */
/* NOTE includes IB Mcast */
RDMA_MGMT_IB_CM = 0x0000000000000800, /* ib_cm module support */
RDMA_MGMT_OPA_MAD = 0x0000000000001000, /* ib_mad OPA MAD support */
RDMA_MGMT_MASK = 0x00000000000fff00,

RDMA_ADDR_IB = 0x0000000000100000, /* Port does IB AH, PR, Pkey */
RDMA_ADDR_IBoE = 0x0000000000200000, /* Port does IBoE AH, PR, Pkey */
/* Do we need iWarp (TCP) here? */
RDMA_ADDR_IB_MASK = 0x000000000ff00000,


RDMA_SEPARATE_READ_SGE = 0x0000000010000000,
RDMA_QUIRKS_MASK = 0x000000fff0000000


>
> > > We can see how this might work in future, lets say OPAv2 *requires* the
> > > 32 bit LID, for that case cap_ib_address = 0 cap_opa_address = 1. If
> > > we don't update IPoIB and it uses the tests from above then it
> > > immediately, and correctly, stops running on those OPAv2 devices.
> > >
> > > Once patched to support cap_op_address then it will begin working
> > > again. That seems very sane..
> >
> > It is very sane from an implementation standpoint, but from the larger
> > interoperability standpoint, you need that spec to be extended to the
> > new fabric simultaneously.
>
> I liked the OPAv2 hypothetical because it doesn't actually touch the
> IPoIB spec. IPoIB spec has little to say about LIDs or LRHs it works
> entirely at the GID/MGID/GRH level.

Agreed.

Ira

2015-04-11 00:01:22

by Tom Talpey

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On 4/10/2015 5:06 PM, ira.weiny wrote:
> On Fri, Apr 10, 2015 at 01:17:23PM -0600, Jason Gunthorpe wrote:
>...
> So trying to sum up.
>
> Have we settled on the following "capabilities"? Helper function names aside.
>
> /* legacy to communicate to userspace */
> RDMA_LINK_LAYER_IB = 0x0000000000000001,
> RDMA_LINK_LAYER_ETH = 0x0000000000000002,
> RDMA_LINK_LAYER_MASK = 0x000000000000000f, /* more bits? */
> /* I'm hoping we don't need more bits here */
>
>
> /* legacy to communicate to userspace */
> RDMA_TRANSPORT_IB = 0x0000000000000010,
> RDMA_TRANSPORT_IWARP = 0x0000000000000020,
> RDMA_TRANSPORT_USNIC = 0x0000000000000040,
> RDMA_TRANSPORT_USNIC_UDP = 0x0000000000000080,
> RDMA_TRANSPORT_MASK = 0x00000000000000f0, /* more bits? */
> /* I'm hoping we don't need more bits here */
>
>
> /* New flags */
>
> RDMA_MGMT_IB_MAD = 0x0000000000000100, /* ib_mad module support */
> RDMA_MGMT_QP0 = 0x0000000000000200, /* ib_mad QP0 support */
> RDMA_MGMT_IB_SA = 0x0000000000000400, /* ib_sa module support */
> /* NOTE includes IB Mcast */
> RDMA_MGMT_IB_CM = 0x0000000000000800, /* ib_cm module support */
> RDMA_MGMT_OPA_MAD = 0x0000000000001000, /* ib_mad OPA MAD support */
> RDMA_MGMT_MASK = 0x00000000000fff00,

You explicitly say "userspace" - why would an upper layer need to
know the link, transport and management details? These seem to be
mid-layer matters.

> RDMA_ADDR_IB = 0x0000000000100000, /* Port does IB AH, PR, Pkey */
> RDMA_ADDR_IBoE = 0x0000000000200000, /* Port does IBoE AH, PR, Pkey */
> /* Do we need iWarp (TCP) here? */
> RDMA_ADDR_IB_MASK = 0x000000000ff00000,

I do see a ULP needing to know the address family needed to pass to
rdma_connect and rdma_listen, so I would add "IP", but not "iWARP".

> RDMA_SEPARATE_READ_SGE = 0x0000000010000000,
> RDMA_QUIRKS_MASK = 0x000000fff0000000

This is good, but it also needs an attribute to signal the need for a
remote-writable RDMA Read sink buffer, for today's iWARP.

Tom.

2015-04-13 07:40:32

by Michael Wang

[permalink] [raw]
Subject: Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

On 04/10/2015 07:36 PM, Jason Gunthorpe wrote:
> On Fri, Apr 10, 2015 at 01:10:43PM -0400, Doug Ledford wrote:
>
>> documented. I get why you link the address variant, because it pops out
>> all the things that are special about IBoE addressing and calls out that
>> the issues need to be handled. However, saying requires_iboe_addr(),
>> while foreshadowing the work that needs done, doesn't actually document
>> the work that needs done. Whether we call is dev_is_iboe() or
>> requires_iboe_addr(), it would be good if the documentation spelled out
>> those specific requirements for reference sake.
>
> My deep hope for this, was that the test 'requires_iboe_addr' or
> whatever we call it would have a *really good* kdoc.
>
> List all the ways iboe_addr's work, how they differ from IB addresses,
> refer to the specs people should read to understand it, etc.
>
> The patches don't do this, and maybe Michael is the wrong person to
> fill that in, but we can get it done..

That's exactly what I'm thinking ;-)

At first I'm just trying to save us some code but now it's becoming
a topic far above that purpose, I'd like to help commit whatever we already
settled and pass the internal reforming works to experts like you guys
, implement the bitmask stuff ;-)

And I can still help on review and may be testing with mlx4 if later I
got the access.

>
> Jason
>
> BTW: Michael, next time you post the series, please trim the CC
> list...

Thanks for the remind, I'll do trim in v3 :-)

Regards,
Michael Wang

>