Systems using IB partitions might be exposed to excessive pkey
violation traps which are sent to the OpenSM. This can be close to
a DoS attack, and in addition, the OpenSM logs are flooded with these
messages, hiding potential other log messages deemed important in
order to investigate important issues.
This series prohibit RDMA CM to establish connections between two
limited partition members. This avoids pkey violation traps stemming
from unicast messages to be sent to the OpenSM.
[If this patch series get accepted by the community, I ask if
the maintainer can update the reference to the first commit in the
second commit message with a correct 12 chars SHA]
Håkon Bugge (2):
IB/core: A full pkey is required to match a limited one
IB/cm: Send authentic pkey in REQ msg and check eligibility of the
pkeys
drivers/infiniband/core/cache.c | 32 +++++++++++++++++++++++++++-----
drivers/infiniband/core/cm.c | 39 ++++++++++++++++++++++++++++++++-------
include/rdma/ib_cache.h | 18 ++++++++++++++++++
include/rdma/ib_cm.h | 4 +++-
4 files changed, 80 insertions(+), 13 deletions(-)
--
2.13.6
In ib_find_cached_pkey(), the preference is to return the pkey-index
of the full pkey, if both a limited and the full exists in the cache.
However, if both pkeys are limited, the function return success.
To detect if two pkeys are eligible for communication,
ib_find_matched_cached_pkey() is introduced, which requires at least
one of the pkeys to be a full member, in order to return success.
Signed-off-by: Håkon Bugge <[email protected]>
---
drivers/infiniband/core/cache.c | 32 +++++++++++++++++++++++++++-----
include/rdma/ib_cache.h | 18 ++++++++++++++++++
2 files changed, 45 insertions(+), 5 deletions(-)
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index fb2d347f760f..cb7e9818a226 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -983,10 +983,11 @@ int ib_get_cached_subnet_prefix(struct ib_device *device,
}
EXPORT_SYMBOL(ib_get_cached_subnet_prefix);
-int ib_find_cached_pkey(struct ib_device *device,
- u8 port_num,
- u16 pkey,
- u16 *index)
+static int __ib_find_cached_pkey(struct ib_device *device,
+ u8 port_num,
+ u16 pkey,
+ u16 *index,
+ bool allow_lim_to_lim_match)
{
struct ib_pkey_cache *cache;
unsigned long flags;
@@ -1013,7 +1014,11 @@ int ib_find_cached_pkey(struct ib_device *device,
partial_ix = i;
}
- if (ret && partial_ix >= 0) {
+ /*
+ * If table content is limited, pkey must be full to match,
+ * unless allow_lim_to_lim_match is set
+ */
+ if (ret && partial_ix >= 0 && (pkey & 0x8000 || allow_lim_to_lim_match)) {
*index = partial_ix;
ret = 0;
}
@@ -1022,6 +1027,23 @@ int ib_find_cached_pkey(struct ib_device *device,
return ret;
}
+
+int ib_find_matched_cached_pkey(struct ib_device *device,
+ u8 port_num,
+ u16 pkey,
+ u16 *index)
+{
+ return __ib_find_cached_pkey(device, port_num, pkey, index, false);
+}
+EXPORT_SYMBOL(ib_find_matched_cached_pkey);
+
+int ib_find_cached_pkey(struct ib_device *device,
+ u8 port_num,
+ u16 pkey,
+ u16 *index)
+{
+ return __ib_find_cached_pkey(device, port_num, pkey, index, true);
+}
EXPORT_SYMBOL(ib_find_cached_pkey);
int ib_find_exact_cached_pkey(struct ib_device *device,
diff --git a/include/rdma/ib_cache.h b/include/rdma/ib_cache.h
index eb49cc8d1f95..748bd27559b6 100644
--- a/include/rdma/ib_cache.h
+++ b/include/rdma/ib_cache.h
@@ -92,6 +92,24 @@ int ib_get_cached_pkey(struct ib_device *device_handle,
u16 *pkey);
/**
+ * ib_find_matched_cached_pkey - Returns the PKey table index where a
+ * specified PKey value occurs. Success requires at least one of the
+ * PKeys to be a full-member.
+ * @device: The device to query.
+ * @port_num: The port number of the device to search for the PKey.
+ * @pkey: The PKey value to search for.
+ * @index: The index into the cached PKey table where the PKey was found.
+ *
+ * ib_find_matched_cached_pkey() searches the specified PKey table in
+ * the local software cache and return success unless both PKeys are
+ * limited.
+ */
+int ib_find_matched_cached_pkey(struct ib_device *device,
+ u8 port_num,
+ u16 pkey,
+ u16 *index);
+
+/**
* ib_find_cached_pkey - Returns the PKey table index where a specified
* PKey value occurs.
* @device: The device to query.
--
2.13.6
There is no point in using RDMA CM to establish a connection between
two QPs that cannot possible communicate. Particularly, if both the
active and passive side use limited pkeys, they are not able to
communicate.
In order to detect this situation, the authentic pkey is used in the
CM REQ message. The authentic pkey is the one that the HCA inserts
into the BTH in the IB packets.
When the passive side receives the REQ, commit ("ib_core: A full pkey
is required to match a limited one") ensures that
ib_find_matched_cached_pkey() fails unless at least one of the pkeys
compared has the full-member bit.
In the limited-to-limited case, this will prohibit the connection to
be formed, and thus, Pkey Violation Traps will not be sent to the SM.
Signed-off-by: Håkon Bugge <[email protected]>
---
drivers/infiniband/core/cm.c | 39 ++++++++++++++++++++++++++++++++-------
include/rdma/ib_cm.h | 4 +++-
2 files changed, 35 insertions(+), 8 deletions(-)
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index a92e1a5c202b..52ed51d5bd2a 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -3,6 +3,7 @@
* Copyright (c) 2004 Topspin Corporation. All rights reserved.
* Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved.
* Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
*
* This software is available to you under a choice of one of two
* licenses. You may choose to be licensed under the terms of the GNU
@@ -91,6 +92,7 @@ static const char * const ibcm_rej_reason_strs[] = {
[IB_CM_REJ_INVALID_CLASS_VERSION] = "invalid class version",
[IB_CM_REJ_INVALID_FLOW_LABEL] = "invalid flow label",
[IB_CM_REJ_INVALID_ALT_FLOW_LABEL] = "invalid alt flow label",
+ [IB_CM_REJ_INVALID_PKEY] = "invalid PKey",
};
const char *__attribute_const__ ibcm_reject_msg(int reason)
@@ -518,8 +520,8 @@ static int cm_init_av_by_path(struct sa_path_rec *path, struct cm_av *av,
return -EINVAL;
cm_dev = port->cm_dev;
- ret = ib_find_cached_pkey(cm_dev->ib_device, port->port_num,
- be16_to_cpu(path->pkey), &av->pkey_index);
+ ret = ib_find_matched_cached_pkey(cm_dev->ib_device, port->port_num,
+ be16_to_cpu(path->pkey), &av->pkey_index);
if (ret)
return ret;
@@ -1241,7 +1243,7 @@ static void cm_format_req(struct cm_req_msg *req_msg,
cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn));
cm_req_set_local_resp_timeout(req_msg,
param->local_cm_response_timeout);
- req_msg->pkey = param->primary_path->pkey;
+ req_msg->pkey = cpu_to_be16(cm_id_priv->pkey);
cm_req_set_path_mtu(req_msg, param->primary_path->mtu);
cm_req_set_max_cm_retries(req_msg, param->max_cm_retries);
@@ -1396,7 +1398,23 @@ int ib_send_cm_req(struct ib_cm_id *cm_id,
cm_id_priv->responder_resources = param->responder_resources;
cm_id_priv->retry_count = param->retry_count;
cm_id_priv->path_mtu = param->primary_path->mtu;
- cm_id_priv->pkey = param->primary_path->pkey;
+
+ /*
+ * We want to send the pkey used in the BTH in packets
+ * sent. This, in order for the passive side to determine if
+ * communication is permitted by the respective pkeys.
+ *
+ * The pkey in the paths are derived from the MGID, which has
+ * the full membership bit set. Hence, we retrieve the pkey by
+ * using the address vector's pkey_index.
+ */
+ ret = ib_get_cached_pkey(cm_id_priv->id.device,
+ cm_id_priv->av.port->port_num,
+ cm_id_priv->av.pkey_index,
+ &cm_id_priv->pkey);
+ if (ret)
+ goto error1;
+
cm_id_priv->qp_type = param->qp_type;
ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg);
@@ -1956,16 +1974,19 @@ static int cm_req_handler(struct cm_work *work)
cm_id_priv);
if (ret) {
int err;
+ int rej_reason = (ret == -ENOENT ?
+ IB_CM_REJ_INVALID_PKEY :
+ IB_CM_REJ_INVALID_GID);
err = ib_get_cached_gid(work->port->cm_dev->ib_device,
work->port->port_num, 0,
&work->path[0].sgid,
NULL);
if (err)
- ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
+ ib_send_cm_rej(cm_id, rej_reason,
NULL, 0, NULL, 0);
else
- ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
+ ib_send_cm_rej(cm_id, rej_reason,
&work->path[0].sgid,
sizeof(work->path[0].sgid),
NULL, 0);
@@ -1975,7 +1996,11 @@ static int cm_req_handler(struct cm_work *work)
ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av,
cm_id_priv);
if (ret) {
- ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_ALT_GID,
+ int rej_reason = (ret == -ENOENT ?
+ IB_CM_REJ_INVALID_PKEY :
+ IB_CM_REJ_INVALID_ALT_GID);
+
+ ib_send_cm_rej(cm_id, rej_reason,
&work->path[0].sgid,
sizeof(work->path[0].sgid), NULL, 0);
goto rejected;
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index 7979cb04f529..56b62303946a 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -3,6 +3,7 @@
* Copyright (c) 2004 Topspin Corporation. All rights reserved.
* Copyright (c) 2004 Voltaire Corporation. All rights reserved.
* Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
+ * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
*
* This software is available to you under a choice of one of two
* licenses. You may choose to be licensed under the terms of the GNU
@@ -183,7 +184,8 @@ enum ib_cm_rej_reason {
IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID = 30,
IB_CM_REJ_INVALID_CLASS_VERSION = 31,
IB_CM_REJ_INVALID_FLOW_LABEL = 32,
- IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33
+ IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33,
+ IB_CM_REJ_INVALID_PKEY = 34,
};
struct ib_cm_rej_event_param {
--
2.13.6
On 5/9/2018 5:30 AM, Håkon Bugge wrote:
> There is no point in using RDMA CM to establish a connection between
> two QPs that cannot possible communicate. Particularly, if both the
> active and passive side use limited pkeys, they are not able to
> communicate.
>
> In order to detect this situation, the authentic pkey is used in the
> CM REQ message. The authentic pkey is the one that the HCA inserts
> into the BTH in the IB packets.
>
> When the passive side receives the REQ, commit ("ib_core: A full pkey
> is required to match a limited one") ensures that
> ib_find_matched_cached_pkey() fails unless at least one of the pkeys
> compared has the full-member bit.
>
> In the limited-to-limited case, this will prohibit the connection to
> be formed, and thus, Pkey Violation Traps will not be sent to the SM.
>
> Signed-off-by: Håkon Bugge <[email protected]>
> ---
> drivers/infiniband/core/cm.c | 39 ++++++++++++++++++++++++++++++++-------
> include/rdma/ib_cm.h | 4 +++-
> 2 files changed, 35 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
> index a92e1a5c202b..52ed51d5bd2a 100644
> --- a/drivers/infiniband/core/cm.c
> +++ b/drivers/infiniband/core/cm.c
> @@ -3,6 +3,7 @@
> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
> * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved.
> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> * licenses. You may choose to be licensed under the terms of the GNU
> @@ -91,6 +92,7 @@ static const char * const ibcm_rej_reason_strs[] = {
> [IB_CM_REJ_INVALID_CLASS_VERSION] = "invalid class version",
> [IB_CM_REJ_INVALID_FLOW_LABEL] = "invalid flow label",
> [IB_CM_REJ_INVALID_ALT_FLOW_LABEL] = "invalid alt flow label",
> + [IB_CM_REJ_INVALID_PKEY] = "invalid PKey",
If this patch goes ahead, IBA spec for CM should be updated to include this.
> };
>
> const char *__attribute_const__ ibcm_reject_msg(int reason)
> @@ -518,8 +520,8 @@ static int cm_init_av_by_path(struct sa_path_rec *path, struct cm_av *av,
> return -EINVAL;
> cm_dev = port->cm_dev;
>
> - ret = ib_find_cached_pkey(cm_dev->ib_device, port->port_num,
> - be16_to_cpu(path->pkey), &av->pkey_index);
> + ret = ib_find_matched_cached_pkey(cm_dev->ib_device, port->port_num,
> + be16_to_cpu(path->pkey), &av->pkey_index);
> if (ret)
> return ret;
>
> @@ -1241,7 +1243,7 @@ static void cm_format_req(struct cm_req_msg *req_msg,
> cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn));
> cm_req_set_local_resp_timeout(req_msg,
> param->local_cm_response_timeout);
> - req_msg->pkey = param->primary_path->pkey;
> + req_msg->pkey = cpu_to_be16(cm_id_priv->pkey);
> cm_req_set_path_mtu(req_msg, param->primary_path->mtu);
> cm_req_set_max_cm_retries(req_msg, param->max_cm_retries);
>
> @@ -1396,7 +1398,23 @@ int ib_send_cm_req(struct ib_cm_id *cm_id,
> cm_id_priv->responder_resources = param->responder_resources;
> cm_id_priv->retry_count = param->retry_count;
> cm_id_priv->path_mtu = param->primary_path->mtu;
> - cm_id_priv->pkey = param->primary_path->pkey;
> +
> + /*
> + * We want to send the pkey used in the BTH in packets
> + * sent. This, in order for the passive side to determine if
> + * communication is permitted by the respective pkeys.
> + *
> + * The pkey in the paths are derived from the MGID, which has
> + * the full membership bit set. Hence, we retrieve the pkey by
> + * using the address vector's pkey_index.
The paths usually come from the SM and I don't expect SM to provide path
between ports of only limited members of partition. Default ACM provider
forms path from multicast group parameters including pkey. Is that the
scenario of concern ? If so, I still don't fully understand the scenario
because limited members are not supposed to be part of a multicast
group. There was some work started to extend this for client/server
model but it was never completed. However, there may be hole(s) in
various components of implementation which open(s) this door.
-- Hal
> + */
> + ret = ib_get_cached_pkey(cm_id_priv->id.device,
> + cm_id_priv->av.port->port_num,
> + cm_id_priv->av.pkey_index,
> + &cm_id_priv->pkey);
> + if (ret)
> + goto error1;
> +
> cm_id_priv->qp_type = param->qp_type;
>
> ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg);
> @@ -1956,16 +1974,19 @@ static int cm_req_handler(struct cm_work *work)
> cm_id_priv);
> if (ret) {
> int err;
> + int rej_reason = (ret == -ENOENT ?
> + IB_CM_REJ_INVALID_PKEY :
> + IB_CM_REJ_INVALID_GID);
>
> err = ib_get_cached_gid(work->port->cm_dev->ib_device,
> work->port->port_num, 0,
> &work->path[0].sgid,
> NULL);
> if (err)
> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
> + ib_send_cm_rej(cm_id, rej_reason,
> NULL, 0, NULL, 0);
> else
> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
> + ib_send_cm_rej(cm_id, rej_reason,
> &work->path[0].sgid,
> sizeof(work->path[0].sgid),
> NULL, 0);
> @@ -1975,7 +1996,11 @@ static int cm_req_handler(struct cm_work *work)
> ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av,
> cm_id_priv);
> if (ret) {
> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_ALT_GID,
> + int rej_reason = (ret == -ENOENT ?
> + IB_CM_REJ_INVALID_PKEY :
> + IB_CM_REJ_INVALID_ALT_GID);
> +
> + ib_send_cm_rej(cm_id, rej_reason,
> &work->path[0].sgid,
> sizeof(work->path[0].sgid), NULL, 0);
> goto rejected;
> diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
> index 7979cb04f529..56b62303946a 100644
> --- a/include/rdma/ib_cm.h
> +++ b/include/rdma/ib_cm.h
> @@ -3,6 +3,7 @@
> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
> * Copyright (c) 2004 Voltaire Corporation. All rights reserved.
> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
> *
> * This software is available to you under a choice of one of two
> * licenses. You may choose to be licensed under the terms of the GNU
> @@ -183,7 +184,8 @@ enum ib_cm_rej_reason {
> IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID = 30,
> IB_CM_REJ_INVALID_CLASS_VERSION = 31,
> IB_CM_REJ_INVALID_FLOW_LABEL = 32,
> - IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33
> + IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33,
> + IB_CM_REJ_INVALID_PKEY = 34,
> };
>
> struct ib_cm_rej_event_param {
>
Hi H?kon,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc4 next-20180509]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/H-kon-Bugge/Do-not-form-IB-connections-between-limited-partition-members/20180509-224727
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__
sparse warnings: (new ones prefixed by >>)
drivers/infiniband/core/cm.c:880:16: sparse: expression using sizeof(void)
drivers/infiniband/core/cm.c:861:21: sparse: expression using sizeof(void)
>> drivers/infiniband/core/cm.c:1246:25: sparse: cast from restricted __be16
>> drivers/infiniband/core/cm.c:1246:25: sparse: incorrect type in argument 1 (different base types) @@ expected unsigned short [unsigned] [usertype] val @@ got short [unsigned] [usertype] val @@
drivers/infiniband/core/cm.c:1246:25: expected unsigned short [unsigned] [usertype] val
drivers/infiniband/core/cm.c:1246:25: got restricted __be16 [usertype] pkey
>> drivers/infiniband/core/cm.c:1246:25: sparse: cast from restricted __be16
>> drivers/infiniband/core/cm.c:1246:25: sparse: cast from restricted __be16
drivers/infiniband/core/cm.c:861:21: sparse: expression using sizeof(void)
drivers/infiniband/core/cm.c:861:21: sparse: expression using sizeof(void)
>> drivers/infiniband/core/cm.c:1414:35: sparse: incorrect type in argument 4 (different base types) @@ expected unsigned short [usertype] *pkey @@ got short [usertype] *pkey @@
drivers/infiniband/core/cm.c:1414:35: expected unsigned short [usertype] *pkey
drivers/infiniband/core/cm.c:1414:35: got restricted __be16 *<noident>
drivers/infiniband/core/cm.c:861:21: sparse: expression using sizeof(void)
drivers/infiniband/core/cm.c:861:21: sparse: expression using sizeof(void)
drivers/infiniband/core/cm.c:861:21: sparse: expression using sizeof(void)
vim +1246 drivers/infiniband/core/cm.c
1218
1219 static void cm_format_req(struct cm_req_msg *req_msg,
1220 struct cm_id_private *cm_id_priv,
1221 struct ib_cm_req_param *param)
1222 {
1223 struct sa_path_rec *pri_path = param->primary_path;
1224 struct sa_path_rec *alt_path = param->alternate_path;
1225 bool pri_ext = false;
1226
1227 if (pri_path->rec_type == SA_PATH_REC_TYPE_OPA)
1228 pri_ext = opa_is_extended_lid(pri_path->opa.dlid,
1229 pri_path->opa.slid);
1230
1231 cm_format_mad_hdr(&req_msg->hdr, CM_REQ_ATTR_ID,
1232 cm_form_tid(cm_id_priv, CM_MSG_SEQUENCE_REQ));
1233
1234 req_msg->local_comm_id = cm_id_priv->id.local_id;
1235 req_msg->service_id = param->service_id;
1236 req_msg->local_ca_guid = cm_id_priv->id.device->node_guid;
1237 cm_req_set_local_qpn(req_msg, cpu_to_be32(param->qp_num));
1238 cm_req_set_init_depth(req_msg, param->initiator_depth);
1239 cm_req_set_remote_resp_timeout(req_msg,
1240 param->remote_cm_response_timeout);
1241 cm_req_set_qp_type(req_msg, param->qp_type);
1242 cm_req_set_flow_ctrl(req_msg, param->flow_control);
1243 cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn));
1244 cm_req_set_local_resp_timeout(req_msg,
1245 param->local_cm_response_timeout);
> 1246 req_msg->pkey = cpu_to_be16(cm_id_priv->pkey);
1247 cm_req_set_path_mtu(req_msg, param->primary_path->mtu);
1248 cm_req_set_max_cm_retries(req_msg, param->max_cm_retries);
1249
1250 if (param->qp_type != IB_QPT_XRC_INI) {
1251 cm_req_set_resp_res(req_msg, param->responder_resources);
1252 cm_req_set_retry_count(req_msg, param->retry_count);
1253 cm_req_set_rnr_retry_count(req_msg, param->rnr_retry_count);
1254 cm_req_set_srq(req_msg, param->srq);
1255 }
1256
1257 req_msg->primary_local_gid = pri_path->sgid;
1258 req_msg->primary_remote_gid = pri_path->dgid;
1259 if (pri_ext) {
1260 req_msg->primary_local_gid.global.interface_id
1261 = OPA_MAKE_ID(be32_to_cpu(pri_path->opa.slid));
1262 req_msg->primary_remote_gid.global.interface_id
1263 = OPA_MAKE_ID(be32_to_cpu(pri_path->opa.dlid));
1264 }
1265 if (pri_path->hop_limit <= 1) {
1266 req_msg->primary_local_lid = pri_ext ? 0 :
1267 htons(ntohl(sa_path_get_slid(pri_path)));
1268 req_msg->primary_remote_lid = pri_ext ? 0 :
1269 htons(ntohl(sa_path_get_dlid(pri_path)));
1270 } else {
1271 /* Work-around until there's a way to obtain remote LID info */
1272 req_msg->primary_local_lid = IB_LID_PERMISSIVE;
1273 req_msg->primary_remote_lid = IB_LID_PERMISSIVE;
1274 }
1275 cm_req_set_primary_flow_label(req_msg, pri_path->flow_label);
1276 cm_req_set_primary_packet_rate(req_msg, pri_path->rate);
1277 req_msg->primary_traffic_class = pri_path->traffic_class;
1278 req_msg->primary_hop_limit = pri_path->hop_limit;
1279 cm_req_set_primary_sl(req_msg, pri_path->sl);
1280 cm_req_set_primary_subnet_local(req_msg, (pri_path->hop_limit <= 1));
1281 cm_req_set_primary_local_ack_timeout(req_msg,
1282 cm_ack_timeout(cm_id_priv->av.port->cm_dev->ack_delay,
1283 pri_path->packet_life_time));
1284
1285 if (alt_path) {
1286 bool alt_ext = false;
1287
1288 if (alt_path->rec_type == SA_PATH_REC_TYPE_OPA)
1289 alt_ext = opa_is_extended_lid(alt_path->opa.dlid,
1290 alt_path->opa.slid);
1291
1292 req_msg->alt_local_gid = alt_path->sgid;
1293 req_msg->alt_remote_gid = alt_path->dgid;
1294 if (alt_ext) {
1295 req_msg->alt_local_gid.global.interface_id
1296 = OPA_MAKE_ID(be32_to_cpu(alt_path->opa.slid));
1297 req_msg->alt_remote_gid.global.interface_id
1298 = OPA_MAKE_ID(be32_to_cpu(alt_path->opa.dlid));
1299 }
1300 if (alt_path->hop_limit <= 1) {
1301 req_msg->alt_local_lid = alt_ext ? 0 :
1302 htons(ntohl(sa_path_get_slid(alt_path)));
1303 req_msg->alt_remote_lid = alt_ext ? 0 :
1304 htons(ntohl(sa_path_get_dlid(alt_path)));
1305 } else {
1306 req_msg->alt_local_lid = IB_LID_PERMISSIVE;
1307 req_msg->alt_remote_lid = IB_LID_PERMISSIVE;
1308 }
1309 cm_req_set_alt_flow_label(req_msg,
1310 alt_path->flow_label);
1311 cm_req_set_alt_packet_rate(req_msg, alt_path->rate);
1312 req_msg->alt_traffic_class = alt_path->traffic_class;
1313 req_msg->alt_hop_limit = alt_path->hop_limit;
1314 cm_req_set_alt_sl(req_msg, alt_path->sl);
1315 cm_req_set_alt_subnet_local(req_msg, (alt_path->hop_limit <= 1));
1316 cm_req_set_alt_local_ack_timeout(req_msg,
1317 cm_ack_timeout(cm_id_priv->av.port->cm_dev->ack_delay,
1318 alt_path->packet_life_time));
1319 }
1320
1321 if (param->private_data && param->private_data_len)
1322 memcpy(req_msg->private_data, param->private_data,
1323 param->private_data_len);
1324 }
1325
1326 static int cm_validate_req_param(struct ib_cm_req_param *param)
1327 {
1328 /* peer-to-peer not supported */
1329 if (param->peer_to_peer)
1330 return -EINVAL;
1331
1332 if (!param->primary_path)
1333 return -EINVAL;
1334
1335 if (param->qp_type != IB_QPT_RC && param->qp_type != IB_QPT_UC &&
1336 param->qp_type != IB_QPT_XRC_INI)
1337 return -EINVAL;
1338
1339 if (param->private_data &&
1340 param->private_data_len > IB_CM_REQ_PRIVATE_DATA_SIZE)
1341 return -EINVAL;
1342
1343 if (param->alternate_path &&
1344 (param->alternate_path->pkey != param->primary_path->pkey ||
1345 param->alternate_path->mtu != param->primary_path->mtu))
1346 return -EINVAL;
1347
1348 return 0;
1349 }
1350
1351 int ib_send_cm_req(struct ib_cm_id *cm_id,
1352 struct ib_cm_req_param *param)
1353 {
1354 struct cm_id_private *cm_id_priv;
1355 struct cm_req_msg *req_msg;
1356 unsigned long flags;
1357 int ret;
1358
1359 ret = cm_validate_req_param(param);
1360 if (ret)
1361 return ret;
1362
1363 /* Verify that we're not in timewait. */
1364 cm_id_priv = container_of(cm_id, struct cm_id_private, id);
1365 spin_lock_irqsave(&cm_id_priv->lock, flags);
1366 if (cm_id->state != IB_CM_IDLE) {
1367 spin_unlock_irqrestore(&cm_id_priv->lock, flags);
1368 ret = -EINVAL;
1369 goto out;
1370 }
1371 spin_unlock_irqrestore(&cm_id_priv->lock, flags);
1372
1373 cm_id_priv->timewait_info = cm_create_timewait_info(cm_id_priv->
1374 id.local_id);
1375 if (IS_ERR(cm_id_priv->timewait_info)) {
1376 ret = PTR_ERR(cm_id_priv->timewait_info);
1377 goto out;
1378 }
1379
1380 ret = cm_init_av_by_path(param->primary_path, &cm_id_priv->av,
1381 cm_id_priv);
1382 if (ret)
1383 goto error1;
1384 if (param->alternate_path) {
1385 ret = cm_init_av_by_path(param->alternate_path,
1386 &cm_id_priv->alt_av, cm_id_priv);
1387 if (ret)
1388 goto error1;
1389 }
1390 cm_id->service_id = param->service_id;
1391 cm_id->service_mask = ~cpu_to_be64(0);
1392 cm_id_priv->timeout_ms = cm_convert_to_ms(
1393 param->primary_path->packet_life_time) * 2 +
1394 cm_convert_to_ms(
1395 param->remote_cm_response_timeout);
1396 cm_id_priv->max_cm_retries = param->max_cm_retries;
1397 cm_id_priv->initiator_depth = param->initiator_depth;
1398 cm_id_priv->responder_resources = param->responder_resources;
1399 cm_id_priv->retry_count = param->retry_count;
1400 cm_id_priv->path_mtu = param->primary_path->mtu;
1401
1402 /*
1403 * We want to send the pkey used in the BTH in packets
1404 * sent. This, in order for the passive side to determine if
1405 * communication is permitted by the respective pkeys.
1406 *
1407 * The pkey in the paths are derived from the MGID, which has
1408 * the full membership bit set. Hence, we retrieve the pkey by
1409 * using the address vector's pkey_index.
1410 */
1411 ret = ib_get_cached_pkey(cm_id_priv->id.device,
1412 cm_id_priv->av.port->port_num,
1413 cm_id_priv->av.pkey_index,
> 1414 &cm_id_priv->pkey);
1415 if (ret)
1416 goto error1;
1417
1418 cm_id_priv->qp_type = param->qp_type;
1419
1420 ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg);
1421 if (ret)
1422 goto error1;
1423
1424 req_msg = (struct cm_req_msg *) cm_id_priv->msg->mad;
1425 cm_format_req(req_msg, cm_id_priv, param);
1426 cm_id_priv->tid = req_msg->hdr.tid;
1427 cm_id_priv->msg->timeout_ms = cm_id_priv->timeout_ms;
1428 cm_id_priv->msg->context[1] = (void *) (unsigned long) IB_CM_REQ_SENT;
1429
1430 cm_id_priv->local_qpn = cm_req_get_local_qpn(req_msg);
1431 cm_id_priv->rq_psn = cm_req_get_starting_psn(req_msg);
1432
1433 spin_lock_irqsave(&cm_id_priv->lock, flags);
1434 ret = ib_post_send_mad(cm_id_priv->msg, NULL);
1435 if (ret) {
1436 spin_unlock_irqrestore(&cm_id_priv->lock, flags);
1437 goto error2;
1438 }
1439 BUG_ON(cm_id->state != IB_CM_IDLE);
1440 cm_id->state = IB_CM_REQ_SENT;
1441 spin_unlock_irqrestore(&cm_id_priv->lock, flags);
1442 return 0;
1443
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
> On 9 May 2018, at 13:28, Hal Rosenstock <[email protected]> wrote:
>
> On 5/9/2018 5:30 AM, Håkon Bugge wrote:
>> There is no point in using RDMA CM to establish a connection between
>> two QPs that cannot possible communicate. Particularly, if both the
>> active and passive side use limited pkeys, they are not able to
>> communicate.
>>
>> In order to detect this situation, the authentic pkey is used in the
>> CM REQ message. The authentic pkey is the one that the HCA inserts
>> into the BTH in the IB packets.
>>
>> When the passive side receives the REQ, commit ("ib_core: A full pkey
>> is required to match a limited one") ensures that
>> ib_find_matched_cached_pkey() fails unless at least one of the pkeys
>> compared has the full-member bit.
>>
>> In the limited-to-limited case, this will prohibit the connection to
>> be formed, and thus, Pkey Violation Traps will not be sent to the SM.
>>
>> Signed-off-by: Håkon Bugge <[email protected]>
>> ---
>> drivers/infiniband/core/cm.c | 39 ++++++++++++++++++++++++++++++++-------
>> include/rdma/ib_cm.h | 4 +++-
>> 2 files changed, 35 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
>> index a92e1a5c202b..52ed51d5bd2a 100644
>> --- a/drivers/infiniband/core/cm.c
>> +++ b/drivers/infiniband/core/cm.c
>> @@ -3,6 +3,7 @@
>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>> * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved.
>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>> *
>> * This software is available to you under a choice of one of two
>> * licenses. You may choose to be licensed under the terms of the GNU
>> @@ -91,6 +92,7 @@ static const char * const ibcm_rej_reason_strs[] = {
>> [IB_CM_REJ_INVALID_CLASS_VERSION] = "invalid class version",
>> [IB_CM_REJ_INVALID_FLOW_LABEL] = "invalid flow label",
>> [IB_CM_REJ_INVALID_ALT_FLOW_LABEL] = "invalid alt flow label",
>> + [IB_CM_REJ_INVALID_PKEY] = "invalid PKey",
>
> If this patch goes ahead, IBA spec for CM should be updated to include this.
Sure, I see:
33 Invalid Alternate Flow Label
as the latest in the spec.
>
>> };
>>
>> const char *__attribute_const__ ibcm_reject_msg(int reason)
>> @@ -518,8 +520,8 @@ static int cm_init_av_by_path(struct sa_path_rec *path, struct cm_av *av,
>> return -EINVAL;
>> cm_dev = port->cm_dev;
>>
>> - ret = ib_find_cached_pkey(cm_dev->ib_device, port->port_num,
>> - be16_to_cpu(path->pkey), &av->pkey_index);
>> + ret = ib_find_matched_cached_pkey(cm_dev->ib_device, port->port_num,
>> + be16_to_cpu(path->pkey), &av->pkey_index);
>> if (ret)
>> return ret;
>>
>> @@ -1241,7 +1243,7 @@ static void cm_format_req(struct cm_req_msg *req_msg,
>> cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn));
>> cm_req_set_local_resp_timeout(req_msg,
>> param->local_cm_response_timeout);
>> - req_msg->pkey = param->primary_path->pkey;
>> + req_msg->pkey = cpu_to_be16(cm_id_priv->pkey);
>> cm_req_set_path_mtu(req_msg, param->primary_path->mtu);
>> cm_req_set_max_cm_retries(req_msg, param->max_cm_retries);
>>
>> @@ -1396,7 +1398,23 @@ int ib_send_cm_req(struct ib_cm_id *cm_id,
>> cm_id_priv->responder_resources = param->responder_resources;
>> cm_id_priv->retry_count = param->retry_count;
>> cm_id_priv->path_mtu = param->primary_path->mtu;
>> - cm_id_priv->pkey = param->primary_path->pkey;
>> +
>> + /*
>> + * We want to send the pkey used in the BTH in packets
>> + * sent. This, in order for the passive side to determine if
>> + * communication is permitted by the respective pkeys.
>> + *
>> + * The pkey in the paths are derived from the MGID, which has
>> + * the full membership bit set. Hence, we retrieve the pkey by
>> + * using the address vector's pkey_index.
>
> The paths usually come from the SM and I don't expect SM to provide path
> between ports of only limited members of partition.
In my case, it does.
> Default ACM provider
> forms path from multicast group parameters including pkey. Is that the
> scenario of concern ?
Also RDMA CM does that. Do an ibdump of a CM REQ message sent from a limited port, and you will see the PKey is the full member in the CM REQ msg.
> If so, I still don't fully understand the scenario
> because limited members are not supposed to be part of a multicast
> group. There was some work started to extend this for client/server
> model but it was never completed. However, there may be hole(s) in
> various components of implementation which open(s) this door.
I view OpenSM not returning a valid path between two limited members an orthogonal issue, as OpenSM is another component.
I think the CM REQ message should contain the correct PKey (fixed by this patch series).
And in the event the passive side being a limited member and receives a CM REQ with a limited PKey, that connection should not be formed (fixed by this patch series).
Read me correct, I am also in favour of fixing the OpenSM to not return a valid (but useless) path record in this case.
Thxs, Håkon
>
> -- Hal
>
>> + */
>> + ret = ib_get_cached_pkey(cm_id_priv->id.device,
>> + cm_id_priv->av.port->port_num,
>> + cm_id_priv->av.pkey_index,
>> + &cm_id_priv->pkey);
>> + if (ret)
>> + goto error1;
>> +
>> cm_id_priv->qp_type = param->qp_type;
>>
>> ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg);
>> @@ -1956,16 +1974,19 @@ static int cm_req_handler(struct cm_work *work)
>> cm_id_priv);
>> if (ret) {
>> int err;
>> + int rej_reason = (ret == -ENOENT ?
>> + IB_CM_REJ_INVALID_PKEY :
>> + IB_CM_REJ_INVALID_GID);
>>
>> err = ib_get_cached_gid(work->port->cm_dev->ib_device,
>> work->port->port_num, 0,
>> &work->path[0].sgid,
>> NULL);
>> if (err)
>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>> + ib_send_cm_rej(cm_id, rej_reason,
>> NULL, 0, NULL, 0);
>> else
>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>> + ib_send_cm_rej(cm_id, rej_reason,
>> &work->path[0].sgid,
>> sizeof(work->path[0].sgid),
>> NULL, 0);
>> @@ -1975,7 +1996,11 @@ static int cm_req_handler(struct cm_work *work)
>> ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av,
>> cm_id_priv);
>> if (ret) {
>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_ALT_GID,
>> + int rej_reason = (ret == -ENOENT ?
>> + IB_CM_REJ_INVALID_PKEY :
>> + IB_CM_REJ_INVALID_ALT_GID);
>> +
>> + ib_send_cm_rej(cm_id, rej_reason,
>> &work->path[0].sgid,
>> sizeof(work->path[0].sgid), NULL, 0);
>> goto rejected;
>> diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
>> index 7979cb04f529..56b62303946a 100644
>> --- a/include/rdma/ib_cm.h
>> +++ b/include/rdma/ib_cm.h
>> @@ -3,6 +3,7 @@
>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>> * Copyright (c) 2004 Voltaire Corporation. All rights reserved.
>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>> *
>> * This software is available to you under a choice of one of two
>> * licenses. You may choose to be licensed under the terms of the GNU
>> @@ -183,7 +184,8 @@ enum ib_cm_rej_reason {
>> IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID = 30,
>> IB_CM_REJ_INVALID_CLASS_VERSION = 31,
>> IB_CM_REJ_INVALID_FLOW_LABEL = 32,
>> - IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33
>> + IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33,
>> + IB_CM_REJ_INVALID_PKEY = 34,
>> };
>>
>> struct ib_cm_rej_event_param {
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5/10/2018 5:16 AM, Håkon Bugge wrote:
>
>
>> On 9 May 2018, at 13:28, Hal Rosenstock <[email protected]> wrote:
>>
>> On 5/9/2018 5:30 AM, Håkon Bugge wrote:
>>> There is no point in using RDMA CM to establish a connection between
>>> two QPs that cannot possible communicate. Particularly, if both the
>>> active and passive side use limited pkeys, they are not able to
>>> communicate.
>>>
>>> In order to detect this situation, the authentic pkey is used in the
>>> CM REQ message. The authentic pkey is the one that the HCA inserts
>>> into the BTH in the IB packets.
>>>
>>> When the passive side receives the REQ, commit ("ib_core: A full pkey
>>> is required to match a limited one") ensures that
>>> ib_find_matched_cached_pkey() fails unless at least one of the pkeys
>>> compared has the full-member bit.
>>>
>>> In the limited-to-limited case, this will prohibit the connection to
>>> be formed, and thus, Pkey Violation Traps will not be sent to the SM.
>>>
>>> Signed-off-by: Håkon Bugge <[email protected]>
>>> ---
>>> drivers/infiniband/core/cm.c | 39 ++++++++++++++++++++++++++++++++-------
>>> include/rdma/ib_cm.h | 4 +++-
>>> 2 files changed, 35 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
>>> index a92e1a5c202b..52ed51d5bd2a 100644
>>> --- a/drivers/infiniband/core/cm.c
>>> +++ b/drivers/infiniband/core/cm.c
>>> @@ -3,6 +3,7 @@
>>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>>> * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved.
>>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>>> *
>>> * This software is available to you under a choice of one of two
>>> * licenses. You may choose to be licensed under the terms of the GNU
>>> @@ -91,6 +92,7 @@ static const char * const ibcm_rej_reason_strs[] = {
>>> [IB_CM_REJ_INVALID_CLASS_VERSION] = "invalid class version",
>>> [IB_CM_REJ_INVALID_FLOW_LABEL] = "invalid flow label",
>>> [IB_CM_REJ_INVALID_ALT_FLOW_LABEL] = "invalid alt flow label",
>>> + [IB_CM_REJ_INVALID_PKEY] = "invalid PKey",
>>
>> If this patch goes ahead, IBA spec for CM should be updated to include this.
>
> Sure, I see:
>
> 33 Invalid Alternate Flow Label
>
> as the latest in the spec.
This appears to be an implementation rather than architectural issue. If
there is limited <-> limited CM, then the MAD would never get up to the
CM layer as it would be filtered by end port pkey enforcement. This is
only needed to protect against current code which can send on the full
rather than limited pkey (in BTH), right ?
>>
>>> };
>>>
>>> const char *__attribute_const__ ibcm_reject_msg(int reason)
>>> @@ -518,8 +520,8 @@ static int cm_init_av_by_path(struct sa_path_rec *path, struct cm_av *av,
>>> return -EINVAL;
>>> cm_dev = port->cm_dev;
>>>
>>> - ret = ib_find_cached_pkey(cm_dev->ib_device, port->port_num,
>>> - be16_to_cpu(path->pkey), &av->pkey_index);
>>> + ret = ib_find_matched_cached_pkey(cm_dev->ib_device, port->port_num,
>>> + be16_to_cpu(path->pkey), &av->pkey_index);
>>> if (ret)
>>> return ret;
>>>
>>> @@ -1241,7 +1243,7 @@ static void cm_format_req(struct cm_req_msg *req_msg,
>>> cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn));
>>> cm_req_set_local_resp_timeout(req_msg,
>>> param->local_cm_response_timeout);
>>> - req_msg->pkey = param->primary_path->pkey;
>>> + req_msg->pkey = cpu_to_be16(cm_id_priv->pkey);
>>> cm_req_set_path_mtu(req_msg, param->primary_path->mtu);
>>> cm_req_set_max_cm_retries(req_msg, param->max_cm_retries);
>>>
>>> @@ -1396,7 +1398,23 @@ int ib_send_cm_req(struct ib_cm_id *cm_id,
>>> cm_id_priv->responder_resources = param->responder_resources;
>>> cm_id_priv->retry_count = param->retry_count;
>>> cm_id_priv->path_mtu = param->primary_path->mtu;
>>> - cm_id_priv->pkey = param->primary_path->pkey;
>>> +
>>> + /*
>>> + * We want to send the pkey used in the BTH in packets
>>> + * sent. This, in order for the passive side to determine if
>>> + * communication is permitted by the respective pkeys.
>>> + *
>>> + * The pkey in the paths are derived from the MGID, which has
>>> + * the full membership bit set. Hence, we retrieve the pkey by
>>> + * using the address vector's pkey_index.
>>
>> The paths usually come from the SM and I don't expect SM to provide path
>> between ports of only limited members of partition.
>
> In my case, it does.
Which OpenSM flavor are you talking about (upstream, MLNX, or other) ?
What is OpenSM version string ? Is allow_both_pkeys TRUE or FALSE ?
>> Default ACM provider
>> forms path from multicast group parameters including pkey. Is that the
>> scenario of concern ?
>
> Also RDMA CM does that. Do an ibdump of a CM REQ message sent from a limited port, and you will see the PKey is the full member in the CM REQ msg.
Not exactly sure what you mean by RDMA CM does this as there are a
number of different ways to use RDMA CM. Are you referring to using
IPoIB for address resolution ? Another way is for a path record to be
passed in from user space and there is no control over it's contents.
At a minimum, I think the comment mentioning MGID should be clarified.
>> If so, I still don't fully understand the scenario
>> because limited members are not supposed to be part of a multicast
>> group. There was some work started to extend this for client/server
>> model but it was never completed. However, there may be hole(s) in
>> various components of implementation which open(s) this door.
>
> I view OpenSM not returning a valid path between two limited members an orthogonal issue, as OpenSM is another component.
Agreed. I think that there are other components too...
> I think the CM REQ message should contain the correct PKey (fixed by this patch series).
>
> And in the event the passive side being a limited member and receives a CM REQ with a limited PKey, that connection should not be formed (fixed by this patch series).
This is the backward compatibility for the current behavior on the
active side so that REQ gets rejected rather than accepted, right ?
In general, the approach used when there is pkey mismatch is to silently
drop packet rather than respond which causes a timeout on the requester
side. In this specific case, I'm not sure which approach is better.
Also, is there a similar issue with SIDR_REQ ?
> Read me correct, I am also in favour of fixing the OpenSM to not return a valid (but useless) path record in this case.
Understood and agreed.
-- Hal
>
> Thxs, Håkon
>
>
>>
>> -- Hal
>>
>>> + */
>>> + ret = ib_get_cached_pkey(cm_id_priv->id.device,
>>> + cm_id_priv->av.port->port_num,
>>> + cm_id_priv->av.pkey_index,
>>> + &cm_id_priv->pkey);
>>> + if (ret)
>>> + goto error1;
>>> +
>>> cm_id_priv->qp_type = param->qp_type;
>>>
>>> ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg);
>>> @@ -1956,16 +1974,19 @@ static int cm_req_handler(struct cm_work *work)
>>> cm_id_priv);
>>> if (ret) {
>>> int err;
>>> + int rej_reason = (ret == -ENOENT ?
>>> + IB_CM_REJ_INVALID_PKEY :
>>> + IB_CM_REJ_INVALID_GID);
>>>
>>> err = ib_get_cached_gid(work->port->cm_dev->ib_device,
>>> work->port->port_num, 0,
>>> &work->path[0].sgid,
>>> NULL);
>>> if (err)
>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>>> + ib_send_cm_rej(cm_id, rej_reason,
>>> NULL, 0, NULL, 0);
>>> else
>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>>> + ib_send_cm_rej(cm_id, rej_reason,
>>> &work->path[0].sgid,
>>> sizeof(work->path[0].sgid),
>>> NULL, 0);
>>> @@ -1975,7 +1996,11 @@ static int cm_req_handler(struct cm_work *work)
>>> ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av,
>>> cm_id_priv);
>>> if (ret) {
>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_ALT_GID,
>>> + int rej_reason = (ret == -ENOENT ?
>>> + IB_CM_REJ_INVALID_PKEY :
>>> + IB_CM_REJ_INVALID_ALT_GID);
>>> +
>>> + ib_send_cm_rej(cm_id, rej_reason,
>>> &work->path[0].sgid,
>>> sizeof(work->path[0].sgid), NULL, 0);
>>> goto rejected;
>>> diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
>>> index 7979cb04f529..56b62303946a 100644
>>> --- a/include/rdma/ib_cm.h
>>> +++ b/include/rdma/ib_cm.h
>>> @@ -3,6 +3,7 @@
>>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>>> * Copyright (c) 2004 Voltaire Corporation. All rights reserved.
>>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>>> *
>>> * This software is available to you under a choice of one of two
>>> * licenses. You may choose to be licensed under the terms of the GNU
>>> @@ -183,7 +184,8 @@ enum ib_cm_rej_reason {
>>> IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID = 30,
>>> IB_CM_REJ_INVALID_CLASS_VERSION = 31,
>>> IB_CM_REJ_INVALID_FLOW_LABEL = 32,
>>> - IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33
>>> + IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33,
>>> + IB_CM_REJ_INVALID_PKEY = 34,
>>> };
>>>
>>> struct ib_cm_rej_event_param {
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> On 10 May 2018, at 16:01, Hal Rosenstock <[email protected]> wrote:
>
> On 5/10/2018 5:16 AM, Håkon Bugge wrote:
>>
>>
>>> On 9 May 2018, at 13:28, Hal Rosenstock <[email protected]> wrote:
>>>
>>> On 5/9/2018 5:30 AM, Håkon Bugge wrote:
>>>> There is no point in using RDMA CM to establish a connection between
>>>> two QPs that cannot possible communicate. Particularly, if both the
>>>> active and passive side use limited pkeys, they are not able to
>>>> communicate.
>>>>
>>>> In order to detect this situation, the authentic pkey is used in the
>>>> CM REQ message. The authentic pkey is the one that the HCA inserts
>>>> into the BTH in the IB packets.
>>>>
>>>> When the passive side receives the REQ, commit ("ib_core: A full pkey
>>>> is required to match a limited one") ensures that
>>>> ib_find_matched_cached_pkey() fails unless at least one of the pkeys
>>>> compared has the full-member bit.
>>>>
>>>> In the limited-to-limited case, this will prohibit the connection to
>>>> be formed, and thus, Pkey Violation Traps will not be sent to the SM.
>>>>
>>>> Signed-off-by: Håkon Bugge <[email protected]>
>>>> ---
>>>> drivers/infiniband/core/cm.c | 39 ++++++++++++++++++++++++++++++++-------
>>>> include/rdma/ib_cm.h | 4 +++-
>>>> 2 files changed, 35 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
>>>> index a92e1a5c202b..52ed51d5bd2a 100644
>>>> --- a/drivers/infiniband/core/cm.c
>>>> +++ b/drivers/infiniband/core/cm.c
>>>> @@ -3,6 +3,7 @@
>>>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>>>> * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved.
>>>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>>>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>>>> *
>>>> * This software is available to you under a choice of one of two
>>>> * licenses. You may choose to be licensed under the terms of the GNU
>>>> @@ -91,6 +92,7 @@ static const char * const ibcm_rej_reason_strs[] = {
>>>> [IB_CM_REJ_INVALID_CLASS_VERSION] = "invalid class version",
>>>> [IB_CM_REJ_INVALID_FLOW_LABEL] = "invalid flow label",
>>>> [IB_CM_REJ_INVALID_ALT_FLOW_LABEL] = "invalid alt flow label",
>>>> + [IB_CM_REJ_INVALID_PKEY] = "invalid PKey",
>>>
>>> If this patch goes ahead, IBA spec for CM should be updated to include this.
>>
>> Sure, I see:
>>
>> 33 Invalid Alternate Flow Label
>>
>> as the latest in the spec.
>
> This appears to be an implementation rather than architectural issue. If
> there is limited <-> limited CM, then the MAD would never get up to the
> CM layer as it would be filtered by end port pkey enforcement. This is
> only needed to protect against current code which can send on the full
> rather than limited pkey (in BTH), right ?
We are talking about two things here. The PKey in the BTH and the PKey in the CM REQ payload. They differ.
I am out of office, but if my memory serves me correct, the PKey in the BTH in the MAD packet will be the default PKey. Further, we have per IBTA:
C10-132: Packets sent to the General Service Interface QP (QP1) shall be accepted if the P_Key in the packet matches any valid P_Key in the P_Key Table of the port on which the packet arrived. Matching is defined in 10.9.3 Partition Key Matching.
So, by architecture, the MAD arrives.
>
>
>>>
>>>> };
>>>>
>>>> const char *__attribute_const__ ibcm_reject_msg(int reason)
>>>> @@ -518,8 +520,8 @@ static int cm_init_av_by_path(struct sa_path_rec *path, struct cm_av *av,
>>>> return -EINVAL;
>>>> cm_dev = port->cm_dev;
>>>>
>>>> - ret = ib_find_cached_pkey(cm_dev->ib_device, port->port_num,
>>>> - be16_to_cpu(path->pkey), &av->pkey_index);
>>>> + ret = ib_find_matched_cached_pkey(cm_dev->ib_device, port->port_num,
>>>> + be16_to_cpu(path->pkey), &av->pkey_index);
>>>> if (ret)
>>>> return ret;
>>>>
>>>> @@ -1241,7 +1243,7 @@ static void cm_format_req(struct cm_req_msg *req_msg,
>>>> cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn));
>>>> cm_req_set_local_resp_timeout(req_msg,
>>>> param->local_cm_response_timeout);
>>>> - req_msg->pkey = param->primary_path->pkey;
>>>> + req_msg->pkey = cpu_to_be16(cm_id_priv->pkey);
>>>> cm_req_set_path_mtu(req_msg, param->primary_path->mtu);
>>>> cm_req_set_max_cm_retries(req_msg, param->max_cm_retries);
>>>>
>>>> @@ -1396,7 +1398,23 @@ int ib_send_cm_req(struct ib_cm_id *cm_id,
>>>> cm_id_priv->responder_resources = param->responder_resources;
>>>> cm_id_priv->retry_count = param->retry_count;
>>>> cm_id_priv->path_mtu = param->primary_path->mtu;
>>>> - cm_id_priv->pkey = param->primary_path->pkey;
>>>> +
>>>> + /*
>>>> + * We want to send the pkey used in the BTH in packets
>>>> + * sent. This, in order for the passive side to determine if
>>>> + * communication is permitted by the respective pkeys.
>>>> + *
>>>> + * The pkey in the paths are derived from the MGID, which has
>>>> + * the full membership bit set. Hence, we retrieve the pkey by
>>>> + * using the address vector's pkey_index.
>>>
>>> The paths usually come from the SM and I don't expect SM to provide path
>>> between ports of only limited members of partition.
>>
>> In my case, it does.
>
> Which OpenSM flavor are you talking about (upstream, MLNX, or other) ?
I guess other is the right term, as its Oracle’s flavour, 3.2.6_20170609.
> What is OpenSM version string ? Is allow_both_pkeys TRUE or FALSE ?
The ports do have both limited and full memberships, and the partitions.conf file contains both “limited”, “both”, and the “full” keywords.
>>> Default ACM provider
>>> forms path from multicast group parameters including pkey. Is that the
>>> scenario of concern ?
>>
>> Also RDMA CM does that. Do an ibdump of a CM REQ message sent from a limited port, and you will see the PKey is the full member in the CM REQ msg.
>
> Not exactly sure what you mean by RDMA CM does this as there are a
> number of different ways to use RDMA CM.
I am talking specifically about the CM REQ message. For example, if you run an IB-UV test program and specifies that RDMA CM should be used for connection management. Or it could be kernel ULPs. My point is that the pkey field in the CM REQ message (i.e. not the one in the BTH) is always the full one, even though the port uses the limited one.
> Are you referring to using
> IPoIB for address resolution ? Another way is for a path record to be
> passed in from user space and there is no control over it’s contents.
I did not bring up the path record or address resolution issue ;-)
But I guess you are alluding to the limited-to-limited case should 1) not have been able to perform address resolution and/or 2) the path record should not have been returned. I agree to both 1) and 2).
But if both 1) and 2) succeeds, this patch series is a last resort to stop the connection to be established and avoid pkey trap violation traps being sent.
> At a minimum, I think the comment mentioning MGID should be clarified.
Is the question if an MC group attached by a limited port will include other limited ports or only full member ports?
With the OpenSM I am running, the MC group contains limited members.
>>> If so, I still don't fully understand the scenario
>>> because limited members are not supposed to be part of a multicast
>>> group. There was some work started to extend this for client/server
>>> model but it was never completed. However, there may be hole(s) in
>>> various components of implementation which open(s) this door.
>>
>> I view OpenSM not returning a valid path between two limited members an orthogonal issue, as OpenSM is another component.
>
> Agreed. I think that there are other components too...
>
>> I think the CM REQ message should contain the correct PKey (fixed by this patch series).
>>
>> And in the event the passive side being a limited member and receives a CM REQ with a limited PKey, that connection should not be formed (fixed by this patch series).
> This is the backward compatibility for the current behavior on the
> active side so that REQ gets rejected rather than accepted, right ?
Not sure about the backwards compatibility part, but I certainly agree to the rest. Put it another way, when the MAD payload’s pkey is limited in the CM REQ, and there is only a limited matching key in the pkey cache, a REJ is sent back.
> In general, the approach used when there is pkey mismatch is to silently
> drop packet rather than respond which causes a timeout on the requester
> side. In this specific case, I’m not sure which approach is better.
It is not silently dropped when the port supports pkey trap messages. Then a trap message is sent to the OpenSM.
> Also, is there a similar issue with SIDR_REQ ?
Good point, most probaly!
> Read me correct, I am also in favour of fixing the OpenSM to not return a valid (but useless) path record in this case.
>
> Understood and agreed.
Thxs, Håkon
>
> -- Hal
>
>>
>> Thxs, Håkon
>>
>>
>>>
>>> -- Hal
>>>
>>>> + */
>>>> + ret = ib_get_cached_pkey(cm_id_priv->id.device,
>>>> + cm_id_priv->av.port->port_num,
>>>> + cm_id_priv->av.pkey_index,
>>>> + &cm_id_priv->pkey);
>>>> + if (ret)
>>>> + goto error1;
>>>> +
>>>> cm_id_priv->qp_type = param->qp_type;
>>>>
>>>> ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg);
>>>> @@ -1956,16 +1974,19 @@ static int cm_req_handler(struct cm_work *work)
>>>> cm_id_priv);
>>>> if (ret) {
>>>> int err;
>>>> + int rej_reason = (ret == -ENOENT ?
>>>> + IB_CM_REJ_INVALID_PKEY :
>>>> + IB_CM_REJ_INVALID_GID);
>>>>
>>>> err = ib_get_cached_gid(work->port->cm_dev->ib_device,
>>>> work->port->port_num, 0,
>>>> &work->path[0].sgid,
>>>> NULL);
>>>> if (err)
>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>> NULL, 0, NULL, 0);
>>>> else
>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>> &work->path[0].sgid,
>>>> sizeof(work->path[0].sgid),
>>>> NULL, 0);
>>>> @@ -1975,7 +1996,11 @@ static int cm_req_handler(struct cm_work *work)
>>>> ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av,
>>>> cm_id_priv);
>>>> if (ret) {
>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_ALT_GID,
>>>> + int rej_reason = (ret == -ENOENT ?
>>>> + IB_CM_REJ_INVALID_PKEY :
>>>> + IB_CM_REJ_INVALID_ALT_GID);
>>>> +
>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>> &work->path[0].sgid,
>>>> sizeof(work->path[0].sgid), NULL, 0);
>>>> goto rejected;
>>>> diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
>>>> index 7979cb04f529..56b62303946a 100644
>>>> --- a/include/rdma/ib_cm.h
>>>> +++ b/include/rdma/ib_cm.h
>>>> @@ -3,6 +3,7 @@
>>>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>>>> * Copyright (c) 2004 Voltaire Corporation. All rights reserved.
>>>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>>>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>>>> *
>>>> * This software is available to you under a choice of one of two
>>>> * licenses. You may choose to be licensed under the terms of the GNU
>>>> @@ -183,7 +184,8 @@ enum ib_cm_rej_reason {
>>>> IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID = 30,
>>>> IB_CM_REJ_INVALID_CLASS_VERSION = 31,
>>>> IB_CM_REJ_INVALID_FLOW_LABEL = 32,
>>>> - IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33
>>>> + IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33,
>>>> + IB_CM_REJ_INVALID_PKEY = 34,
>>>> };
>>>>
>>>> struct ib_cm_rej_event_param {
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5/10/2018 11:16 AM, Håkon Bugge wrote:
>
>
>> On 10 May 2018, at 16:01, Hal Rosenstock <[email protected]> wrote:
>>
>> On 5/10/2018 5:16 AM, Håkon Bugge wrote:
>>>
>>>
>>>> On 9 May 2018, at 13:28, Hal Rosenstock <[email protected]> wrote:
>>>>
>>>> On 5/9/2018 5:30 AM, Håkon Bugge wrote:
>>>>> There is no point in using RDMA CM to establish a connection between
>>>>> two QPs that cannot possible communicate. Particularly, if both the
>>>>> active and passive side use limited pkeys, they are not able to
>>>>> communicate.
>>>>>
>>>>> In order to detect this situation, the authentic pkey is used in the
>>>>> CM REQ message. The authentic pkey is the one that the HCA inserts
>>>>> into the BTH in the IB packets.
>>>>>
>>>>> When the passive side receives the REQ, commit ("ib_core: A full pkey
>>>>> is required to match a limited one") ensures that
>>>>> ib_find_matched_cached_pkey() fails unless at least one of the pkeys
>>>>> compared has the full-member bit.
>>>>>
>>>>> In the limited-to-limited case, this will prohibit the connection to
>>>>> be formed, and thus, Pkey Violation Traps will not be sent to the SM.
>>>>>
>>>>> Signed-off-by: Håkon Bugge <[email protected]>
>>>>> ---
>>>>> drivers/infiniband/core/cm.c | 39 ++++++++++++++++++++++++++++++++-------
>>>>> include/rdma/ib_cm.h | 4 +++-
>>>>> 2 files changed, 35 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
>>>>> index a92e1a5c202b..52ed51d5bd2a 100644
>>>>> --- a/drivers/infiniband/core/cm.c
>>>>> +++ b/drivers/infiniband/core/cm.c
>>>>> @@ -3,6 +3,7 @@
>>>>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>>>>> * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved.
>>>>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>>>>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>>>>> *
>>>>> * This software is available to you under a choice of one of two
>>>>> * licenses. You may choose to be licensed under the terms of the GNU
>>>>> @@ -91,6 +92,7 @@ static const char * const ibcm_rej_reason_strs[] = {
>>>>> [IB_CM_REJ_INVALID_CLASS_VERSION] = "invalid class version",
>>>>> [IB_CM_REJ_INVALID_FLOW_LABEL] = "invalid flow label",
>>>>> [IB_CM_REJ_INVALID_ALT_FLOW_LABEL] = "invalid alt flow label",
>>>>> + [IB_CM_REJ_INVALID_PKEY] = "invalid PKey",
>>>>
>>>> If this patch goes ahead, IBA spec for CM should be updated to include this.
>>>
>>> Sure, I see:
>>>
>>> 33 Invalid Alternate Flow Label
>>>
>>> as the latest in the spec.
>>
>> This appears to be an implementation rather than architectural issue. If
>> there is limited <-> limited CM, then the MAD would never get up to the
>> CM layer as it would be filtered by end port pkey enforcement. This is
>> only needed to protect against current code which can send on the full
>> rather than limited pkey (in BTH), right ?
>
> We are talking about two things here. The PKey in the BTH and the PKey in the CM REQ payload. They differ.
Yes, I understand the difference between the 2 potentially different PKeys.
> I am out of office, but if my memory serves me correct, the PKey in the BTH in the MAD packet will be the default PKey.
MADs are sent on reversible paths (13.5.4.1) where the response can be
constructed from only header info in the request. BTH:pkey doesn't
__have to be__ the default pkey. It just needs to be a common/shared
pkey between the source/destination pair.
For example, there are some configurations where the port is only a
limited member of the default partition (and full member of other non
default partition(s)).
There are also configurations where the port is both full and limited
member on one or more partition(s). This is the allow_both_pkeys TRUE
case for OpenSM which also uses the 'both' syntax in the partitions
config file. This appears to be the case to which you are referring.
CM MAD could be sent on default partition but that may not work if both
source and dest are only limited members of the default partition.
CM MAD could be sent on any common/shared pkey between the
source/destination ports.
> Further, we have per IBTA:
>
> C10-132: Packets sent to the General Service Interface QP (QP1) shall be accepted if the P_Key in the packet matches any valid P_Key in the P_Key Table of the port on which the packet arrived. Matching is defined in 10.9.3 Partition Key Matching.
>
> So, by architecture, the MAD arrives
In the case of port being both full and limited member of the same
partition on which CM MAD comes in on (as defined by pkey in BTH).
I was talking about limited <-> limited CM and didn't realize you were
talking about the 'both' case.
>>
>>
>>>>
>>>>> };
>>>>>
>>>>> const char *__attribute_const__ ibcm_reject_msg(int reason)
>>>>> @@ -518,8 +520,8 @@ static int cm_init_av_by_path(struct sa_path_rec *path, struct cm_av *av,
>>>>> return -EINVAL;
>>>>> cm_dev = port->cm_dev;
>>>>>
>>>>> - ret = ib_find_cached_pkey(cm_dev->ib_device, port->port_num,
>>>>> - be16_to_cpu(path->pkey), &av->pkey_index);
>>>>> + ret = ib_find_matched_cached_pkey(cm_dev->ib_device, port->port_num,
>>>>> + be16_to_cpu(path->pkey), &av->pkey_index);
>>>>> if (ret)
>>>>> return ret;
>>>>>
>>>>> @@ -1241,7 +1243,7 @@ static void cm_format_req(struct cm_req_msg *req_msg,
>>>>> cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn));
>>>>> cm_req_set_local_resp_timeout(req_msg,
>>>>> param->local_cm_response_timeout);
>>>>> - req_msg->pkey = param->primary_path->pkey;
>>>>> + req_msg->pkey = cpu_to_be16(cm_id_priv->pkey);
>>>>> cm_req_set_path_mtu(req_msg, param->primary_path->mtu);
>>>>> cm_req_set_max_cm_retries(req_msg, param->max_cm_retries);
>>>>>
>>>>> @@ -1396,7 +1398,23 @@ int ib_send_cm_req(struct ib_cm_id *cm_id,
>>>>> cm_id_priv->responder_resources = param->responder_resources;
>>>>> cm_id_priv->retry_count = param->retry_count;
>>>>> cm_id_priv->path_mtu = param->primary_path->mtu;
>>>>> - cm_id_priv->pkey = param->primary_path->pkey;
>>>>> +
>>>>> + /*
>>>>> + * We want to send the pkey used in the BTH in packets
>>>>> + * sent. This, in order for the passive side to determine if
>>>>> + * communication is permitted by the respective pkeys.
>>>>> + *
>>>>> + * The pkey in the paths are derived from the MGID, which has
>>>>> + * the full membership bit set. Hence, we retrieve the pkey by
>>>>> + * using the address vector's pkey_index.
>>>>
>>>> The paths usually come from the SM and I don't expect SM to provide path
>>>> between ports of only limited members of partition.
>>>
>>> In my case, it does.
>>
>> Which OpenSM flavor are you talking about (upstream, MLNX, or other) ?
>
> I guess other is the right term, as its Oracle’s flavour, 3.2.6_20170609.
I'll check upstream OpenSM when I get a chance to see if same problem
exists. Oracle OpenSM forked from upstream long time ago and not sure
how similar the path record code is. There were various impacts for the
shared port virtualization (for CX-3) in terms of adding the ability to
allow both pkeys.
>> What is OpenSM version string ? Is allow_both_pkeys TRUE or FALSE ?
>
> The ports do have both limited and full memberships, and the partitions.conf file contains both “limited”, “both”, and the “full” keywords.
>
>>>> Default ACM provider
>>>> forms path from multicast group parameters including pkey. Is that the
>>>> scenario of concern ?
>>>
>>> Also RDMA CM does that. Do an ibdump of a CM REQ message sent from a limited port, and you will see the PKey is the full member in the CM REQ msg.
>>
>> Not exactly sure what you mean by RDMA CM does this as there are a
>> number of different ways to use RDMA CM.
>
> I am talking specifically about the CM REQ message. For example, if you run an IB-UV test program and specifies that RDMA CM should be used for connection management. Or it could be kernel ULPs. My point is that the pkey field in the CM REQ message (i.e. not the one in the BTH) is always the full one, even though the port uses the limited one.
When you write "port uses limited one", you mean BTH:pkey is limited
member but REQ:pkey is full member, right ?
>
>> Are you referring to using
>> IPoIB for address resolution ? Another way is for a path record to be
>> passed in from user space and there is no control over it’s contents.
>
> I did not bring up the path record or address resolution issue ;-)
You mentioned that "The pkey in the paths are derived from the MGID" and
brought this issue up in my thinking about this issue.
> But I guess you are alluding to the limited-to-limited case should 1) not have been able to perform address resolution and/or 2) the path record should not have been returned. I agree to both 1) and 2).
>
> But if both 1) and 2) succeeds, this patch series is a last resort to stop the connection to be established and avoid pkey trap violation traps being sent.
>
>> At a minimum, I think the comment mentioning MGID should be clarified.
>
> Is the question if an MC group attached by a limited port will include other limited ports or only full member ports?
>
> With the OpenSM I am running, the MC group contains limited members.
That is problematic as any sending to MC group by limited members will
cause pkey violation traps to SM.
>>>> If so, I still don't fully understand the scenario
>>>> because limited members are not supposed to be part of a multicast
>>>> group. There was some work started to extend this for client/server
>>>> model but it was never completed. However, there may be hole(s) in
>>>> various components of implementation which open(s) this door.
>>>
>>> I view OpenSM not returning a valid path between two limited members an orthogonal issue, as OpenSM is another component.
>>
>> Agreed. I think that there are other components too...
>>
>>> I think the CM REQ message should contain the correct PKey (fixed by this patch series).
>>>
>>> And in the event the passive side being a limited member and receives a CM REQ with a limited PKey, that connection should not be formed (fixed by this patch series).
>> This is the backward compatibility for the current behavior on the
>> active side so that REQ gets rejected rather than accepted, right ?
>
> Not sure about the backwards compatibility part, but I certainly agree to the rest. Put it another way, when the MAD payload’s pkey is limited in the CM REQ, and there is only a limited matching key in the pkey cache, a REJ is sent back.
Understood.
>> In general, the approach used when there is pkey mismatch is to silently
>> drop packet rather than respond which causes a timeout on the requester
>> side. In this specific case, I’m not sure which approach is better.
>
> It is not silently dropped when the port supports pkey trap messages. Then a trap message is sent to the OpenSM.
I was only talking about the requester-responder part. Yes, pkey
security trap would go to SM.
As I wrote, I'm not sure which is preferable: to explictly reject as
this patch proposes or just silently drop and penalize requester.
-- Hal
>> Also, is there a similar issue with SIDR_REQ ?
>
> Good point, most probaly!
>
>> Read me correct, I am also in favour of fixing the OpenSM to not return a valid (but useless) path record in this case.
>>
>> Understood and agreed.
>
>
> Thxs, Håkon
>
>>
>> -- Hal
>>
>>>
>>> Thxs, Håkon
>>>
>>>
>>>>
>>>> -- Hal
>>>>
>>>>> + */
>>>>> + ret = ib_get_cached_pkey(cm_id_priv->id.device,
>>>>> + cm_id_priv->av.port->port_num,
>>>>> + cm_id_priv->av.pkey_index,
>>>>> + &cm_id_priv->pkey);
>>>>> + if (ret)
>>>>> + goto error1;
>>>>> +
>>>>> cm_id_priv->qp_type = param->qp_type;
>>>>>
>>>>> ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg);
>>>>> @@ -1956,16 +1974,19 @@ static int cm_req_handler(struct cm_work *work)
>>>>> cm_id_priv);
>>>>> if (ret) {
>>>>> int err;
>>>>> + int rej_reason = (ret == -ENOENT ?
>>>>> + IB_CM_REJ_INVALID_PKEY :
>>>>> + IB_CM_REJ_INVALID_GID);
>>>>>
>>>>> err = ib_get_cached_gid(work->port->cm_dev->ib_device,
>>>>> work->port->port_num, 0,
>>>>> &work->path[0].sgid,
>>>>> NULL);
>>>>> if (err)
>>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>>> NULL, 0, NULL, 0);
>>>>> else
>>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>>> &work->path[0].sgid,
>>>>> sizeof(work->path[0].sgid),
>>>>> NULL, 0);
>>>>> @@ -1975,7 +1996,11 @@ static int cm_req_handler(struct cm_work *work)
>>>>> ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av,
>>>>> cm_id_priv);
>>>>> if (ret) {
>>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_ALT_GID,
>>>>> + int rej_reason = (ret == -ENOENT ?
>>>>> + IB_CM_REJ_INVALID_PKEY :
>>>>> + IB_CM_REJ_INVALID_ALT_GID);
>>>>> +
>>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>>> &work->path[0].sgid,
>>>>> sizeof(work->path[0].sgid), NULL, 0);
>>>>> goto rejected;
>>>>> diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
>>>>> index 7979cb04f529..56b62303946a 100644
>>>>> --- a/include/rdma/ib_cm.h
>>>>> +++ b/include/rdma/ib_cm.h
>>>>> @@ -3,6 +3,7 @@
>>>>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>>>>> * Copyright (c) 2004 Voltaire Corporation. All rights reserved.
>>>>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>>>>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>>>>> *
>>>>> * This software is available to you under a choice of one of two
>>>>> * licenses. You may choose to be licensed under the terms of the GNU
>>>>> @@ -183,7 +184,8 @@ enum ib_cm_rej_reason {
>>>>> IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID = 30,
>>>>> IB_CM_REJ_INVALID_CLASS_VERSION = 31,
>>>>> IB_CM_REJ_INVALID_FLOW_LABEL = 32,
>>>>> - IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33
>>>>> + IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33,
>>>>> + IB_CM_REJ_INVALID_PKEY = 34,
>>>>> };
>>>>>
>>>>> struct ib_cm_rej_event_param {
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> On 10 May 2018, at 18:54, Hal Rosenstock <[email protected]> wrote:
>
> On 5/10/2018 11:16 AM, Håkon Bugge wrote:
>>
>>
>>> On 10 May 2018, at 16:01, Hal Rosenstock <[email protected]> wrote:
>>>
>>> On 5/10/2018 5:16 AM, Håkon Bugge wrote:
>>>>
>>>>
>>>>> On 9 May 2018, at 13:28, Hal Rosenstock <[email protected]> wrote:
>>>>>
>>>>> On 5/9/2018 5:30 AM, Håkon Bugge wrote:
>>>>>> There is no point in using RDMA CM to establish a connection between
>>>>>> two QPs that cannot possible communicate. Particularly, if both the
>>>>>> active and passive side use limited pkeys, they are not able to
>>>>>> communicate.
>>>>>>
>>>>>> In order to detect this situation, the authentic pkey is used in the
>>>>>> CM REQ message. The authentic pkey is the one that the HCA inserts
>>>>>> into the BTH in the IB packets.
>>>>>>
>>>>>> When the passive side receives the REQ, commit ("ib_core: A full pkey
>>>>>> is required to match a limited one") ensures that
>>>>>> ib_find_matched_cached_pkey() fails unless at least one of the pkeys
>>>>>> compared has the full-member bit.
>>>>>>
>>>>>> In the limited-to-limited case, this will prohibit the connection to
>>>>>> be formed, and thus, Pkey Violation Traps will not be sent to the SM.
>>>>>>
>>>>>> Signed-off-by: Håkon Bugge <[email protected]>
>>>>>> ---
>>>>>> drivers/infiniband/core/cm.c | 39 ++++++++++++++++++++++++++++++++-------
>>>>>> include/rdma/ib_cm.h | 4 +++-
>>>>>> 2 files changed, 35 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
>>>>>> index a92e1a5c202b..52ed51d5bd2a 100644
>>>>>> --- a/drivers/infiniband/core/cm.c
>>>>>> +++ b/drivers/infiniband/core/cm.c
>>>>>> @@ -3,6 +3,7 @@
>>>>>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>>>>>> * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved.
>>>>>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>>>>>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>>>>>> *
>>>>>> * This software is available to you under a choice of one of two
>>>>>> * licenses. You may choose to be licensed under the terms of the GNU
>>>>>> @@ -91,6 +92,7 @@ static const char * const ibcm_rej_reason_strs[] = {
>>>>>> [IB_CM_REJ_INVALID_CLASS_VERSION] = "invalid class version",
>>>>>> [IB_CM_REJ_INVALID_FLOW_LABEL] = "invalid flow label",
>>>>>> [IB_CM_REJ_INVALID_ALT_FLOW_LABEL] = "invalid alt flow label",
>>>>>> + [IB_CM_REJ_INVALID_PKEY] = "invalid PKey",
>>>>>
>>>>> If this patch goes ahead, IBA spec for CM should be updated to include this.
>>>>
>>>> Sure, I see:
>>>>
>>>> 33 Invalid Alternate Flow Label
>>>>
>>>> as the latest in the spec.
>>>
>>> This appears to be an implementation rather than architectural issue. If
>>> there is limited <-> limited CM, then the MAD would never get up to the
>>> CM layer as it would be filtered by end port pkey enforcement. This is
>>> only needed to protect against current code which can send on the full
>>> rather than limited pkey (in BTH), right ?
>>
>> We are talking about two things here. The PKey in the BTH and the PKey in the CM REQ payload. They differ.
>
> Yes, I understand the difference between the 2 potentially different PKeys.
>
>> I am out of office, but if my memory serves me correct, the PKey in the BTH in the MAD packet will be the default PKey.
>
> MADs are sent on reversible paths (13.5.4.1) where the response can be
> constructed from only header info in the request. BTH:pkey doesn't
> __have to be__ the default pkey. It just needs to be a common/shared
> pkey between the source/destination pair.
>
> For example, there are some configurations where the port is only a
> limited member of the default partition (and full member of other non
> default partition(s)).
>
> There are also configurations where the port is both full and limited
> member on one or more partition(s). This is the allow_both_pkeys TRUE
> case for OpenSM which also uses the 'both' syntax in the partitions
> config file. This appears to be the case to which you are referring.
>
> CM MAD could be sent on default partition but that may not work if both
> source and dest are only limited members of the default partition.
>
> CM MAD could be sent on any common/shared pkey between the
> source/destination ports.
>
>> Further, we have per IBTA:
>>
>> C10-132: Packets sent to the General Service Interface QP (QP1) shall be accepted if the P_Key in the packet matches any valid P_Key in the P_Key Table of the port on which the packet arrived. Matching is defined in 10.9.3 Partition Key Matching.
>>
>> So, by architecture, the MAD arrives
> In the case of port being both full and limited member of the same
> partition on which CM MAD comes in on (as defined by pkey in BTH).
>
> I was talking about limited <-> limited CM and didn't realize you were
> talking about the ‘both' case.
OK. So we agree that for the most typical cases, where the BTH.PKey is the default one, the MAD will be received and not trapped by the PKey check.
>>>
>>>
>>>>>
>>>>>> };
>>>>>>
>>>>>> const char *__attribute_const__ ibcm_reject_msg(int reason)
>>>>>> @@ -518,8 +520,8 @@ static int cm_init_av_by_path(struct sa_path_rec *path, struct cm_av *av,
>>>>>> return -EINVAL;
>>>>>> cm_dev = port->cm_dev;
>>>>>>
>>>>>> - ret = ib_find_cached_pkey(cm_dev->ib_device, port->port_num,
>>>>>> - be16_to_cpu(path->pkey), &av->pkey_index);
>>>>>> + ret = ib_find_matched_cached_pkey(cm_dev->ib_device, port->port_num,
>>>>>> + be16_to_cpu(path->pkey), &av->pkey_index);
>>>>>> if (ret)
>>>>>> return ret;
>>>>>>
>>>>>> @@ -1241,7 +1243,7 @@ static void cm_format_req(struct cm_req_msg *req_msg,
>>>>>> cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn));
>>>>>> cm_req_set_local_resp_timeout(req_msg,
>>>>>> param->local_cm_response_timeout);
>>>>>> - req_msg->pkey = param->primary_path->pkey;
>>>>>> + req_msg->pkey = cpu_to_be16(cm_id_priv->pkey);
>>>>>> cm_req_set_path_mtu(req_msg, param->primary_path->mtu);
>>>>>> cm_req_set_max_cm_retries(req_msg, param->max_cm_retries);
>>>>>>
>>>>>> @@ -1396,7 +1398,23 @@ int ib_send_cm_req(struct ib_cm_id *cm_id,
>>>>>> cm_id_priv->responder_resources = param->responder_resources;
>>>>>> cm_id_priv->retry_count = param->retry_count;
>>>>>> cm_id_priv->path_mtu = param->primary_path->mtu;
>>>>>> - cm_id_priv->pkey = param->primary_path->pkey;
>>>>>> +
>>>>>> + /*
>>>>>> + * We want to send the pkey used in the BTH in packets
>>>>>> + * sent. This, in order for the passive side to determine if
>>>>>> + * communication is permitted by the respective pkeys.
>>>>>> + *
>>>>>> + * The pkey in the paths are derived from the MGID, which has
>>>>>> + * the full membership bit set. Hence, we retrieve the pkey by
>>>>>> + * using the address vector's pkey_index.
>>>>>
>>>>> The paths usually come from the SM and I don't expect SM to provide path
>>>>> between ports of only limited members of partition.
>>>>
>>>> In my case, it does.
>>>
>>> Which OpenSM flavor are you talking about (upstream, MLNX, or other) ?
>>
>> I guess other is the right term, as its Oracle’s flavour, 3.2.6_20170609.
>
> I'll check upstream OpenSM when I get a chance to see if same problem
> exists. Oracle OpenSM forked from upstream long time ago and not sure
> how similar the path record code is. There were various impacts for the
> shared port virtualization (for CX-3) in terms of adding the ability to
> allow both pkeys.
I just had a chat with Line H. here in Oracle. She confirms that Oracle’s SM and the upstream one act similar in this matter.
Further, using shared port virtualization, OpenSM will adhere to the physical ports only, and doesn’t even know the PKey assignments to the VFs assigned the VMs.
>>> What is OpenSM version string ? Is allow_both_pkeys TRUE or FALSE ?
>>
>> The ports do have both limited and full memberships, and the partitions.conf file contains both “limited”, “both”, and the “full” keywords.
>>
>>>>> Default ACM provider
>>>>> forms path from multicast group parameters including pkey. Is that the
>>>>> scenario of concern ?
>>>>
>>>> Also RDMA CM does that. Do an ibdump of a CM REQ message sent from a limited port, and you will see the PKey is the full member in the CM REQ msg.
>>>
>>> Not exactly sure what you mean by RDMA CM does this as there are a
>>> number of different ways to use RDMA CM.
>>
>> I am talking specifically about the CM REQ message. For example, if you run an IB-UV test program and specifies that RDMA CM should be used for connection management. Or it could be kernel ULPs. My point is that the pkey field in the CM REQ message (i.e. not the one in the BTH) is always the full one, even though the port uses the limited one.
>
> When you write "port uses limited one", you mean BTH:pkey is limited
> member but REQ:pkey is full member, right ?
Yes, but the BTH:pkey is, as stated above most often the default pkey (limited or full), whilst REQ:pkey is a designated, limited pkey.
>>> Are you referring to using
>>> IPoIB for address resolution ? Another way is for a path record to be
>>> passed in from user space and there is no control over it’s contents.
>>
>> I did not bring up the path record or address resolution issue ;-)
>
> You mentioned that "The pkey in the paths are derived from the MGID" and
> brought this issue up in my thinking about this issue.
Yes, just to inform why it is full - when it shouldn’t be.
>> But I guess you are alluding to the limited-to-limited case should 1) not have been able to perform address resolution and/or 2) the path record should not have been returned. I agree to both 1) and 2).
>>
>> But if both 1) and 2) succeeds, this patch series is a last resort to stop the connection to be established and avoid pkey trap violation traps being sent.
>>
>>> At a minimum, I think the comment mentioning MGID should be clarified.
>>
>> Is the question if an MC group attached by a limited port will include other limited ports or only full member ports?
>>
>> With the OpenSM I am running, the MC group contains limited members.
>
> That is problematic as any sending to MC group by limited members will
> cause pkey violation traps to SM.
Sure, agree, but a separate issue from this series which handles the unicast case.
>>>>> If so, I still don't fully understand the scenario
>>>>> because limited members are not supposed to be part of a multicast
>>>>> group. There was some work started to extend this for client/server
>>>>> model but it was never completed. However, there may be hole(s) in
>>>>> various components of implementation which open(s) this door.
>>>>
>>>> I view OpenSM not returning a valid path between two limited members an orthogonal issue, as OpenSM is another component.
>>>
>>> Agreed. I think that there are other components too...
>>>
>>>> I think the CM REQ message should contain the correct PKey (fixed by this patch series).
>>>>
>>>> And in the event the passive side being a limited member and receives a CM REQ with a limited PKey, that connection should not be formed (fixed by this patch series).
>>> This is the backward compatibility for the current behavior on the
>>> active side so that REQ gets rejected rather than accepted, right ?
>>
>> Not sure about the backwards compatibility part, but I certainly agree to the rest. Put it another way, when the MAD payload’s pkey is limited in the CM REQ, and there is only a limited matching key in the pkey cache, a REJ is sent back.
>
> Understood.
>
>>> In general, the approach used when there is pkey mismatch is to silently
>>> drop packet rather than respond which causes a timeout on the requester
>>> side. In this specific case, I’m not sure which approach is better.
>>
>> It is not silently dropped when the port supports pkey trap messages. Then a trap message is sent to the OpenSM.
>
> I was only talking about the requester-responder part. Yes, pkey
> security trap would go to SM.
>
> As I wrote, I'm not sure which is preferable: to explictly reject as
> this patch proposes or just silently drop and penalize requester.
Since the infrastructure is in place, i.e., the REJ reasons, my view is that an explicit REJ with a distinct reason is the most obvious.
Thxs, Håkon
>
> -- Hal
>
>>> Also, is there a similar issue with SIDR_REQ ?
>>
>> Good point, most probaly!
>>
>>> Read me correct, I am also in favour of fixing the OpenSM to not return a valid (but useless) path record in this case.
>>>
>>> Understood and agreed.
>>
>>
>> Thxs, Håkon
>>
>>>
>>> -- Hal
>>>
>>>>
>>>> Thxs, Håkon
>>>>
>>>>
>>>>>
>>>>> -- Hal
>>>>>
>>>>>> + */
>>>>>> + ret = ib_get_cached_pkey(cm_id_priv->id.device,
>>>>>> + cm_id_priv->av.port->port_num,
>>>>>> + cm_id_priv->av.pkey_index,
>>>>>> + &cm_id_priv->pkey);
>>>>>> + if (ret)
>>>>>> + goto error1;
>>>>>> +
>>>>>> cm_id_priv->qp_type = param->qp_type;
>>>>>>
>>>>>> ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg);
>>>>>> @@ -1956,16 +1974,19 @@ static int cm_req_handler(struct cm_work *work)
>>>>>> cm_id_priv);
>>>>>> if (ret) {
>>>>>> int err;
>>>>>> + int rej_reason = (ret == -ENOENT ?
>>>>>> + IB_CM_REJ_INVALID_PKEY :
>>>>>> + IB_CM_REJ_INVALID_GID);
>>>>>>
>>>>>> err = ib_get_cached_gid(work->port->cm_dev->ib_device,
>>>>>> work->port->port_num, 0,
>>>>>> &work->path[0].sgid,
>>>>>> NULL);
>>>>>> if (err)
>>>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>>>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>>>> NULL, 0, NULL, 0);
>>>>>> else
>>>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>>>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>>>> &work->path[0].sgid,
>>>>>> sizeof(work->path[0].sgid),
>>>>>> NULL, 0);
>>>>>> @@ -1975,7 +1996,11 @@ static int cm_req_handler(struct cm_work *work)
>>>>>> ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av,
>>>>>> cm_id_priv);
>>>>>> if (ret) {
>>>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_ALT_GID,
>>>>>> + int rej_reason = (ret == -ENOENT ?
>>>>>> + IB_CM_REJ_INVALID_PKEY :
>>>>>> + IB_CM_REJ_INVALID_ALT_GID);
>>>>>> +
>>>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>>>> &work->path[0].sgid,
>>>>>> sizeof(work->path[0].sgid), NULL, 0);
>>>>>> goto rejected;
>>>>>> diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
>>>>>> index 7979cb04f529..56b62303946a 100644
>>>>>> --- a/include/rdma/ib_cm.h
>>>>>> +++ b/include/rdma/ib_cm.h
>>>>>> @@ -3,6 +3,7 @@
>>>>>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>>>>>> * Copyright (c) 2004 Voltaire Corporation. All rights reserved.
>>>>>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>>>>>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>>>>>> *
>>>>>> * This software is available to you under a choice of one of two
>>>>>> * licenses. You may choose to be licensed under the terms of the GNU
>>>>>> @@ -183,7 +184,8 @@ enum ib_cm_rej_reason {
>>>>>> IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID = 30,
>>>>>> IB_CM_REJ_INVALID_CLASS_VERSION = 31,
>>>>>> IB_CM_REJ_INVALID_FLOW_LABEL = 32,
>>>>>> - IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33
>>>>>> + IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33,
>>>>>> + IB_CM_REJ_INVALID_PKEY = 34,
>>>>>> };
>>>>>>
>>>>>> struct ib_cm_rej_event_param {
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to [email protected]
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 5/11/2018 6:55 AM, Håkon Bugge wrote:
>
>
>> On 10 May 2018, at 18:54, Hal Rosenstock <[email protected]> wrote:
>>
>> On 5/10/2018 11:16 AM, Håkon Bugge wrote:
>>>
>>>
>>>> On 10 May 2018, at 16:01, Hal Rosenstock <[email protected]> wrote:
>>>>
>>>> On 5/10/2018 5:16 AM, Håkon Bugge wrote:
>>>>>
>>>>>
>>>>>> On 9 May 2018, at 13:28, Hal Rosenstock <[email protected]> wrote:
>>>>>>
>>>>>> On 5/9/2018 5:30 AM, Håkon Bugge wrote:
>>>>>>> There is no point in using RDMA CM to establish a connection between
>>>>>>> two QPs that cannot possible communicate. Particularly, if both the
>>>>>>> active and passive side use limited pkeys, they are not able to
>>>>>>> communicate.
>>>>>>>
>>>>>>> In order to detect this situation, the authentic pkey is used in the
>>>>>>> CM REQ message. The authentic pkey is the one that the HCA inserts
>>>>>>> into the BTH in the IB packets.
>>>>>>>
>>>>>>> When the passive side receives the REQ, commit ("ib_core: A full pkey
>>>>>>> is required to match a limited one") ensures that
>>>>>>> ib_find_matched_cached_pkey() fails unless at least one of the pkeys
>>>>>>> compared has the full-member bit.
>>>>>>>
>>>>>>> In the limited-to-limited case, this will prohibit the connection to
>>>>>>> be formed, and thus, Pkey Violation Traps will not be sent to the SM.
>>>>>>>
>>>>>>> Signed-off-by: Håkon Bugge <[email protected]>
>>>>>>> ---
>>>>>>> drivers/infiniband/core/cm.c | 39 ++++++++++++++++++++++++++++++++-------
>>>>>>> include/rdma/ib_cm.h | 4 +++-
>>>>>>> 2 files changed, 35 insertions(+), 8 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
>>>>>>> index a92e1a5c202b..52ed51d5bd2a 100644
>>>>>>> --- a/drivers/infiniband/core/cm.c
>>>>>>> +++ b/drivers/infiniband/core/cm.c
>>>>>>> @@ -3,6 +3,7 @@
>>>>>>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>>>>>>> * Copyright (c) 2004, 2005 Voltaire Corporation. All rights reserved.
>>>>>>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>>>>>>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>>>>>>> *
>>>>>>> * This software is available to you under a choice of one of two
>>>>>>> * licenses. You may choose to be licensed under the terms of the GNU
>>>>>>> @@ -91,6 +92,7 @@ static const char * const ibcm_rej_reason_strs[] = {
>>>>>>> [IB_CM_REJ_INVALID_CLASS_VERSION] = "invalid class version",
>>>>>>> [IB_CM_REJ_INVALID_FLOW_LABEL] = "invalid flow label",
>>>>>>> [IB_CM_REJ_INVALID_ALT_FLOW_LABEL] = "invalid alt flow label",
>>>>>>> + [IB_CM_REJ_INVALID_PKEY] = "invalid PKey",
>>>>>>
>>>>>> If this patch goes ahead, IBA spec for CM should be updated to include this.
>>>>>
>>>>> Sure, I see:
>>>>>
>>>>> 33 Invalid Alternate Flow Label
>>>>>
>>>>> as the latest in the spec.
>>>>
>>>> This appears to be an implementation rather than architectural issue. If
>>>> there is limited <-> limited CM, then the MAD would never get up to the
>>>> CM layer as it would be filtered by end port pkey enforcement. This is
>>>> only needed to protect against current code which can send on the full
>>>> rather than limited pkey (in BTH), right ?
>>>
>>> We are talking about two things here. The PKey in the BTH and the PKey in the CM REQ payload. They differ.
>>
>> Yes, I understand the difference between the 2 potentially different PKeys.
>>
>>> I am out of office, but if my memory serves me correct, the PKey in the BTH in the MAD packet will be the default PKey.
>>
>> MADs are sent on reversible paths (13.5.4.1) where the response can be
>> constructed from only header info in the request. BTH:pkey doesn't
>> __have to be__ the default pkey. It just needs to be a common/shared
>> pkey between the source/destination pair.
>>
>> For example, there are some configurations where the port is only a
>> limited member of the default partition (and full member of other non
>> default partition(s)).
>>
>> There are also configurations where the port is both full and limited
>> member on one or more partition(s). This is the allow_both_pkeys TRUE
>> case for OpenSM which also uses the 'both' syntax in the partitions
>> config file. This appears to be the case to which you are referring.
>>
>> CM MAD could be sent on default partition but that may not work if both
>> source and dest are only limited members of the default partition.
>>
>> CM MAD could be sent on any common/shared pkey between the
>> source/destination ports.
>>
>>> Further, we have per IBTA:
>>>
>>> C10-132: Packets sent to the General Service Interface QP (QP1) shall be accepted if the P_Key in the packet matches any valid P_Key in the P_Key Table of the port on which the packet arrived. Matching is defined in 10.9.3 Partition Key Matching.
>>>
>>> So, by architecture, the MAD arrives
>> In the case of port being both full and limited member of the same
>> partition on which CM MAD comes in on (as defined by pkey in BTH).
>>
>> I was talking about limited <-> limited CM and didn't realize you were
>> talking about the ‘both' case.
>
> OK. So we agree that for the most typical cases, where the BTH.PKey is the default one, the MAD will be received and not trapped by the PKey check.
>
>>>>
>>>>
>>>>>>
>>>>>>> };
>>>>>>>
>>>>>>> const char *__attribute_const__ ibcm_reject_msg(int reason)
>>>>>>> @@ -518,8 +520,8 @@ static int cm_init_av_by_path(struct sa_path_rec *path, struct cm_av *av,
>>>>>>> return -EINVAL;
>>>>>>> cm_dev = port->cm_dev;
>>>>>>>
>>>>>>> - ret = ib_find_cached_pkey(cm_dev->ib_device, port->port_num,
>>>>>>> - be16_to_cpu(path->pkey), &av->pkey_index);
>>>>>>> + ret = ib_find_matched_cached_pkey(cm_dev->ib_device, port->port_num,
>>>>>>> + be16_to_cpu(path->pkey), &av->pkey_index);
This routine is shared by LAP send/rcvd and SIDR send. Is this change
the correct thing to do there as well ? Were LAP and SIDR tested ?
>>>>>>> if (ret)
>>>>>>> return ret;
>>>>>>>
>>>>>>> @@ -1241,7 +1243,7 @@ static void cm_format_req(struct cm_req_msg *req_msg,
>>>>>>> cm_req_set_starting_psn(req_msg, cpu_to_be32(param->starting_psn));
>>>>>>> cm_req_set_local_resp_timeout(req_msg,
>>>>>>> param->local_cm_response_timeout);
>>>>>>> - req_msg->pkey = param->primary_path->pkey;
>>>>>>> + req_msg->pkey = cpu_to_be16(cm_id_priv->pkey);
>>>>>>> cm_req_set_path_mtu(req_msg, param->primary_path->mtu);
>>>>>>> cm_req_set_max_cm_retries(req_msg, param->max_cm_retries);
>>>>>>>
>>>>>>> @@ -1396,7 +1398,23 @@ int ib_send_cm_req(struct ib_cm_id *cm_id,
>>>>>>> cm_id_priv->responder_resources = param->responder_resources;
>>>>>>> cm_id_priv->retry_count = param->retry_count;
>>>>>>> cm_id_priv->path_mtu = param->primary_path->mtu;
>>>>>>> - cm_id_priv->pkey = param->primary_path->pkey;
>>>>>>> +
>>>>>>> + /*
>>>>>>> + * We want to send the pkey used in the BTH in packets
>>>>>>> + * sent. This, in order for the passive side to determine if
>>>>>>> + * communication is permitted by the respective pkeys.
>>>>>>> + *
>>>>>>> + * The pkey in the paths are derived from the MGID, which has
>>>>>>> + * the full membership bit set. Hence, we retrieve the pkey by
>>>>>>> + * using the address vector's pkey_index.
>>>>>>
>>>>>> The paths usually come from the SM and I don't expect SM to provide path
>>>>>> between ports of only limited members of partition.
>>>>>
>>>>> In my case, it does.
>>>>
>>>> Which OpenSM flavor are you talking about (upstream, MLNX, or other) ?
>>>
>>> I guess other is the right term, as its Oracle’s flavour, 3.2.6_20170609.
>>
>> I'll check upstream OpenSM when I get a chance to see if same problem
>> exists. Oracle OpenSM forked from upstream long time ago and not sure
>> how similar the path record code is. There were various impacts for the
>> shared port virtualization (for CX-3) in terms of adding the ability to
>> allow both pkeys.
>
> I just had a chat with Line H. here in Oracle. She confirms that Oracle’s SM and the upstream one act similar in this matter.
To be sure, I will double check this with upstream when I get a few
cycles to do this.
> Further, using shared port virtualization, OpenSM will adhere to the physical ports only, and doesn’t even know the PKey assignments to the VFs assigned the VMs.
Right; with the shared port virtualization, the SM only knows about the
pport (physical port) partitioning whereas with the newer virtualization
annex there is also the vport (VF) partitioning in addition to the pport
partitioning. This has an impact on the paths that can be provided.
>
>>>> What is OpenSM version string ? Is allow_both_pkeys TRUE or FALSE ?
>>>
>>> The ports do have both limited and full memberships, and the partitions.conf file contains both “limited”, “both”, and the “full” keywords.
>>>
>>>>>> Default ACM provider
>>>>>> forms path from multicast group parameters including pkey. Is that the
>>>>>> scenario of concern ?
>>>>>
>>>>> Also RDMA CM does that. Do an ibdump of a CM REQ message sent from a limited port, and you will see the PKey is the full member in the CM REQ msg.
>>>>
>>>> Not exactly sure what you mean by RDMA CM does this as there are a
>>>> number of different ways to use RDMA CM.
>>>
>>> I am talking specifically about the CM REQ message. For example, if you run an IB-UV test program and specifies that RDMA CM should be used for connection management. Or it could be kernel ULPs. My point is that the pkey field in the CM REQ message (i.e. not the one in the BTH) is always the full one, even though the port uses the limited one.
>>
>> When you write "port uses limited one", you mean BTH:pkey is limited
>> member but REQ:pkey is full member, right ?
>
> Yes, but the BTH:pkey is, as stated above most often the default pkey (limited or full), whilst REQ:pkey is a designated, limited pkey.
FWIW I think the comment "The pkey in the paths are derived from the
MGID" can be improved as this is one use case and is not the most common
one.
>>>> Are you referring to using
>>>> IPoIB for address resolution ? Another way is for a path record to be
>>>> passed in from user space and there is no control over it’s contents.
>>>
>>> I did not bring up the path record or address resolution issue ;-)
>>
>> You mentioned that "The pkey in the paths are derived from the MGID" and
>> brought this issue up in my thinking about this issue.
>
> Yes, just to inform why it is full - when it shouldn’t be.
>
>>> But I guess you are alluding to the limited-to-limited case should 1) not have been able to perform address resolution and/or 2) the path record should not have been returned. I agree to both 1) and 2).
>>>
>>> But if both 1) and 2) succeeds, this patch series is a last resort to stop the connection to be established and avoid pkey trap violation traps being sent.
>>>
>>>> At a minimum, I think the comment mentioning MGID should be clarified.
>>>
>>> Is the question if an MC group attached by a limited port will include other limited ports or only full member ports?
>>>
>>> With the OpenSM I am running, the MC group contains limited members.
>>
>> That is problematic as any sending to MC group by limited members will
>> cause pkey violation traps to SM.
>
> Sure, agree, but a separate issue from this series which handles the unicast case.
>
>>>>>> If so, I still don't fully understand the scenario
>>>>>> because limited members are not supposed to be part of a multicast
>>>>>> group. There was some work started to extend this for client/server
>>>>>> model but it was never completed. However, there may be hole(s) in
>>>>>> various components of implementation which open(s) this door.
>>>>>
>>>>> I view OpenSM not returning a valid path between two limited members an orthogonal issue, as OpenSM is another component.
>>>>
>>>> Agreed. I think that there are other components too...
>>>>
>>>>> I think the CM REQ message should contain the correct PKey (fixed by this patch series).
>>>>>
>>>>> And in the event the passive side being a limited member and receives a CM REQ with a limited PKey, that connection should not be formed (fixed by this patch series).
>>>> This is the backward compatibility for the current behavior on the
>>>> active side so that REQ gets rejected rather than accepted, right ?
>>>
>>> Not sure about the backwards compatibility part, but I certainly agree to the rest. Put it another way, when the MAD payload’s pkey is limited in the CM REQ, and there is only a limited matching key in the pkey cache, a REJ is sent back.
>>
>> Understood.
>>
>>>> In general, the approach used when there is pkey mismatch is to silently
>>>> drop packet rather than respond which causes a timeout on the requester
>>>> side. In this specific case, I’m not sure which approach is better.
>>>
>>> It is not silently dropped when the port supports pkey trap messages. Then a trap message is sent to the OpenSM.
>>
>> I was only talking about the requester-responder part. Yes, pkey
>> security trap would go to SM.
>>
>> As I wrote, I'm not sure which is preferable: to explictly reject as
>> this patch proposes or just silently drop and penalize requester.
>
> Since the infrastructure is in place, i.e., the REJ reasons, my view is that an explicit REJ with a distinct reason is the most obvious.
The downsides are that it reveals info that perhaps should not be
revealed and allows the requestor to rerequest sooner. In some
scenarios, that could cause DoS attack on the responder (server).
-- Hal
>
> Thxs, Håkon
>
>>
>> -- Hal
>>
>>>> Also, is there a similar issue with SIDR_REQ ?
>>>
>>> Good point, most probaly!
>>>
>>>> Read me correct, I am also in favour of fixing the OpenSM to not return a valid (but useless) path record in this case.
>>>>
>>>> Understood and agreed.
>>>
>>>
>>> Thxs, Håkon
>>>
>>>>
>>>> -- Hal
>>>>
>>>>>
>>>>> Thxs, Håkon
>>>>>
>>>>>
>>>>>>
>>>>>> -- Hal
>>>>>>
>>>>>>> + */
>>>>>>> + ret = ib_get_cached_pkey(cm_id_priv->id.device,
>>>>>>> + cm_id_priv->av.port->port_num,
>>>>>>> + cm_id_priv->av.pkey_index,
>>>>>>> + &cm_id_priv->pkey);
>>>>>>> + if (ret)
>>>>>>> + goto error1;
>>>>>>> +
>>>>>>> cm_id_priv->qp_type = param->qp_type;
>>>>>>>
>>>>>>> ret = cm_alloc_msg(cm_id_priv, &cm_id_priv->msg);
>>>>>>> @@ -1956,16 +1974,19 @@ static int cm_req_handler(struct cm_work *work)
>>>>>>> cm_id_priv);
>>>>>>> if (ret) {
>>>>>>> int err;
>>>>>>> + int rej_reason = (ret == -ENOENT ?
>>>>>>> + IB_CM_REJ_INVALID_PKEY :
>>>>>>> + IB_CM_REJ_INVALID_GID);
>>>>>>>
>>>>>>> err = ib_get_cached_gid(work->port->cm_dev->ib_device,
>>>>>>> work->port->port_num, 0,
>>>>>>> &work->path[0].sgid,
>>>>>>> NULL);
>>>>>>> if (err)
>>>>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>>>>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>>>>> NULL, 0, NULL, 0);
>>>>>>> else
>>>>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
>>>>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>>>>> &work->path[0].sgid,
>>>>>>> sizeof(work->path[0].sgid),
>>>>>>> NULL, 0);
>>>>>>> @@ -1975,7 +1996,11 @@ static int cm_req_handler(struct cm_work *work)
>>>>>>> ret = cm_init_av_by_path(&work->path[1], &cm_id_priv->alt_av,
>>>>>>> cm_id_priv);
>>>>>>> if (ret) {
>>>>>>> - ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_ALT_GID,
>>>>>>> + int rej_reason = (ret == -ENOENT ?
>>>>>>> + IB_CM_REJ_INVALID_PKEY :
>>>>>>> + IB_CM_REJ_INVALID_ALT_GID);
>>>>>>> +
>>>>>>> + ib_send_cm_rej(cm_id, rej_reason,
>>>>>>> &work->path[0].sgid,
>>>>>>> sizeof(work->path[0].sgid), NULL, 0);
>>>>>>> goto rejected;
>>>>>>> diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
>>>>>>> index 7979cb04f529..56b62303946a 100644
>>>>>>> --- a/include/rdma/ib_cm.h
>>>>>>> +++ b/include/rdma/ib_cm.h
>>>>>>> @@ -3,6 +3,7 @@
>>>>>>> * Copyright (c) 2004 Topspin Corporation. All rights reserved.
>>>>>>> * Copyright (c) 2004 Voltaire Corporation. All rights reserved.
>>>>>>> * Copyright (c) 2005 Sun Microsystems, Inc. All rights reserved.
>>>>>>> + * Copyright (c) 2018 Oracle and/or its affiliates. All rights reserved.
>>>>>>> *
>>>>>>> * This software is available to you under a choice of one of two
>>>>>>> * licenses. You may choose to be licensed under the terms of the GNU
>>>>>>> @@ -183,7 +184,8 @@ enum ib_cm_rej_reason {
>>>>>>> IB_CM_REJ_DUPLICATE_LOCAL_COMM_ID = 30,
>>>>>>> IB_CM_REJ_INVALID_CLASS_VERSION = 31,
>>>>>>> IB_CM_REJ_INVALID_FLOW_LABEL = 32,
>>>>>>> - IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33
>>>>>>> + IB_CM_REJ_INVALID_ALT_FLOW_LABEL = 33,
>>>>>>> + IB_CM_REJ_INVALID_PKEY = 34,
>>>>>>> };
>>>>>>>
>>>>>>> struct ib_cm_rej_event_param {
>>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>> the body of a message to [email protected]
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On Thu, May 10, 2018 at 05:16:28PM +0200, Håkon Bugge wrote:
> We are talking about two things here. The PKey in the BTH and the
> PKey in the CM REQ payload. They differ.
>
> I am out of office, but if my memory serves me correct, the PKey in
> the BTH in the MAD packet will be the default PKey. Further, we have
> per IBTA:
This sounds like a Linux bug.
Linux does not do a PR to get a reversible path dedicated to the GMP,
so it always uses the data flow path, thus the GMP path paramenters
and those in the REQ should always exactly match.
Where is Linux setting the BTH.PKey and how did it choose to use the
default pkey? Lets fix that at least for sure.
Once that is fixed the rest of the series makes no sense since a REQ
with invalid PKey will never arrive.
However...
This series seems inconsistent with the spec.
IIRC the spec doesn't say if a full or limited pkey should be placed
in the REQ (Hal?). It is designed so that the requestor can get a
single reversible path and put that results into the REQ without
additional processing, however the PR returns only one PKey and again,
itis not really specified if it should be the full or limited pkey
(Hal?).
Basically this means that any pkey in the REQ could randomly be the
full or limited value, and that in-of-itself has not bearing on the
connection.
So it is quite wrong to insist that the pkey be limited or full when
processing the REQ. The end port is expected to match against the
local table.
The real answer to your trap problem is to fix the SM to not create
paths that are non-functional, that is just flat out broken SM
behavior.
Jason
On Fri, May 11, 2018 at 12:55:00PM +0200, Håkon Bugge wrote:
> > I'll check upstream OpenSM when I get a chance to see if same problem
> > exists. Oracle OpenSM forked from upstream long time ago and not sure
> > how similar the path record code is. There were various impacts for the
> > shared port virtualization (for CX-3) in terms of adding the ability to
> > allow both pkeys.
>
> I just had a chat with Line H. here in Oracle. She confirms that Oracle’s SM and the upstream one act similar in this matter.
>
> Further, using shared port virtualization, OpenSM will adhere to the
> physical ports only, and doesn’t even know the PKey assignments to
> the VFs assigned the VMs.
Yes, but that isn't the point. If the VM receives a PR with a pkey
that is not found in the VM's table then it should not even send a CM
REQ in the first place.
Jason
On 5/14/2018 5:02 PM, Jason Gunthorpe wrote:
> On Thu, May 10, 2018 at 05:16:28PM +0200, Håkon Bugge wrote:
>
>> We are talking about two things here. The PKey in the BTH and the
>> PKey in the CM REQ payload. They differ.
>>
>> I am out of office, but if my memory serves me correct, the PKey in
>> the BTH in the MAD packet will be the default PKey. Further, we have
>> per IBTA:
>
> This sounds like a Linux bug.
>
> Linux does not do a PR to get a reversible path dedicated to the GMP> so it always uses the data flow path, thus the GMP path paramenters
> and those in the REQ should always exactly match.
>
> Where is Linux setting the BTH.PKey and how did it choose to use the
> default pkey? Lets fix that at least for sure.
>
> Once that is fixed the rest of the series makes no sense since a REQ
> with invalid PKey will never arrive.
>
> However...
>
> This series seems inconsistent with the spec.
>
> IIRC the spec doesn't say if a full or limited pkey should be placed
> in the REQ (Hal?).
CM spec for REQ just says partition key without indicating whether this
means P_Key or just the partition (15 bits) so my read is that either
full or limited pkey is allowed in REQ.
> It is designed so that the requestor can get a
> single reversible path and put that results into the REQ without
> additional processing, however the PR returns only one PKey and again,
> it is not really specified if it should be the full or limited pkey
> (Hal?).
Correct; it's not specified.
> Basically this means that any pkey in the REQ could randomly be the
> full or limited value, and that in-of-itself has not bearing on the
> connection.
>
> So it is quite wrong to insist that the pkey be limited or full when
> processing the REQ. The end port is expected to match against the
> local table.
Note that there is thorny issue with shared (physical) port
virtualization. In shared port virtualization, the VF pkey assignment is
a local matter. Only thing SM knows is the physical port pkeys where
both full and limited membership of same partition is possible. It is
conceivable that CM REQ contains limited partition key between 2 limited
VFs and for that a new REJ reason code appears to be needed.
-- Hal
> The real answer to your trap problem is to fix the SM to not create
> paths that are non-functional, that is just flat out broken SM
> behavior.
>
> Jason
>
> On 15 May 2018, at 02:38, Hal Rosenstock <[email protected]> wrote:
>
> On 5/14/2018 5:02 PM, Jason Gunthorpe wrote:
>> On Thu, May 10, 2018 at 05:16:28PM +0200, Håkon Bugge wrote:
>>
>>> We are talking about two things here. The PKey in the BTH and the
>>> PKey in the CM REQ payload. They differ.
>>>
>>> I am out of office, but if my memory serves me correct, the PKey in
>>> the BTH in the MAD packet will be the default PKey. Further, we have
>>> per IBTA:
>>
>> This sounds like a Linux bug.
>>
>> Linux does not do a PR to get a reversible path dedicated to the GMP> so it always uses the data flow path, thus the GMP path paramenters
>> and those in the REQ should always exactly match.
>>
>> Where is Linux setting the BTH.PKey and how did it choose to use the
>> default pkey? Lets fix that at least for sure.
Linux isn’t. The BTH.PKey is inserted by the HCA (hw or fw) coming from the P_Key table (10.9.2 in IBTA), selected by a pkey_index associated with the QP.
As per C10-133: Packets sent from the Send Queue of a GSI QP shall attach a P_Key associated with that QP, just as a P_Key is associated with nonmanagement QPs
>> Once that is fixed the rest of the series makes no sense since a REQ
>> with invalid PKey will never arrive.
>>
>> However...
>>
>> This series seems inconsistent with the spec.
>>
>> IIRC the spec doesn't say if a full or limited pkey should be placed
>> in the REQ (Hal?).
>
> CM spec for REQ just says partition key without indicating whether this
> means P_Key or just the partition (15 bits) so my read is that either
> full or limited pkey is allowed in REQ.
>
>> It is designed so that the requestor can get a
>> single reversible path and put that results into the REQ without
>> additional processing, however the PR returns only one PKey and again,
>> it is not really specified if it should be the full or limited pkey
>> (Hal?).
>
> Correct; it's not specified.
>
>> Basically this means that any pkey in the REQ could randomly be the
>> full or limited value, and that in-of-itself has not bearing on the
>> connection.
>>
>> So it is quite wrong to insist that the pkey be limited or full when
>> processing the REQ. The end port is expected to match against the
>> local table.
>
> Note that there is thorny issue with shared (physical) port
> virtualization. In shared port virtualization, the VF pkey assignment is
> a local matter. Only thing SM knows is the physical port pkeys where
> both full and limited membership of same partition is possible. It is
> conceivable that CM REQ contains limited partition key between 2 limited
> VFs and for that a new REJ reason code appears to be needed.
+1
Håkon
>
> -- Hal
>
>> The real answer to your trap problem is to fix the SM to not create
>> paths that are non-functional, that is just flat out broken SM
>> behavior.
>>
>> Jason
>>
On Tue, May 15, 2018 at 08:11:09PM +0200, Håkon Bugge wrote:
> > On 15 May 2018, at 02:38, Hal Rosenstock <[email protected]> wrote:
> >
> > On 5/14/2018 5:02 PM, Jason Gunthorpe wrote:
> >> On Thu, May 10, 2018 at 05:16:28PM +0200, Håkon Bugge wrote:
> >>
> >>> We are talking about two things here. The PKey in the BTH and the
> >>> PKey in the CM REQ payload. They differ.
> >>>
> >>> I am out of office, but if my memory serves me correct, the PKey in
> >>> the BTH in the MAD packet will be the default PKey. Further, we have
> >>> per IBTA:
> >>
> >> This sounds like a Linux bug.
> >>
> >> Linux does not do a PR to get a reversible path dedicated to the GMP> so it always uses the data flow path, thus the GMP path paramenters
> >> and those in the REQ should always exactly match.
> >>
> >> Where is Linux setting the BTH.PKey and how did it choose to use the
> >> default pkey? Lets fix that at least for sure.
>
> Linux isn’t. The BTH.PKey is inserted by the HCA (hw or fw) coming from the P_Key table (10.9.2 in IBTA), selected by a pkey_index associated with the QP.
>
> As per C10-133: Packets sent from the Send Queue of a GSI QP shall
> attach a P_Key associated with that QP, just as a P_Key is
> associated with nonmanagement QPs
That language doesn't mean the PKey is forced to the default, it says
the pkey is programmable.
Linux implemented programmable PKeys for the GSI by using the work
request, so it deviates from what the spec imagined (which was
probably one GSI QP per PKey).
See ib_create_send_mad for instance:
mad_send_wr->send_wr.wr.wr_cqe = &mad_send_wr->mad_list.cqe;
mad_send_wr->send_wr.wr.sg_list = mad_send_wr->sg_list;
mad_send_wr->send_wr.wr.num_sge = 2;
mad_send_wr->send_wr.wr.opcode = IB_WR_SEND;
mad_send_wr->send_wr.wr.send_flags = IB_SEND_SIGNALED;
mad_send_wr->send_wr.remote_qpn = remote_qpn;
mad_send_wr->send_wr.remote_qkey = IB_QP_SET_QKEY;
mad_send_wr->send_wr.pkey_index = pkey_index;
The pkey_index associated with the QP1 is ignored:
/*
* PKey index for QP1 is irrelevant but
* one is needed for the Reset to Init transition
*/
attr->qp_state = IB_QPS_INIT;
attr->pkey_index = pkey_index;
attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY;
ret = ib_modify_qp(qp, attr, IB_QP_STATE |
IB_QP_PKEY_INDEX | IB_QP_QKEY);
Since is is present in every WR.
> >> Basically this means that any pkey in the REQ could randomly be the
> >> full or limited value, and that in-of-itself has not bearing on the
> >> connection.
> >>
> >> So it is quite wrong to insist that the pkey be limited or full when
> >> processing the REQ. The end port is expected to match against the
> >> local table.
> >
> > Note that there is thorny issue with shared (physical) port
> > virtualization. In shared port virtualization, the VF pkey assignment is
> > a local matter. Only thing SM knows is the physical port pkeys where
> > both full and limited membership of same partition is possible. It is
> > conceivable that CM REQ contains limited partition key between 2 limited
> > VFs and for that a new REJ reason code appears to be needed.
>
> +1
This is not a difficult issue.
If the GMP is properly tagged with the right PKey then it will never
be delivered to the VM if the VM does not have the PKey in the
table. It is up to the hypervisor to block GMPs that have Pkeys not in
the virtual PKey table of the VF.
The only time you could need a new REJ code is if the GMP is using a
PKey different from the REQ - which is a pretty goofy thing to do
considering this VM case.
Remember the SM doesn't know what Pkeys are in the VM, so it is
basically impossible for the REQ side to reliably select two different
pkeys and know that they will bothmake it to the VM.
One pkey could be done reliably if it matched ipoib, for instance, but
two really cannot in the general case.
So again - the bug here is that the GMP is being sent with the wrong
pkey. Fix that, then see what is left to fix..
If I recall there were bugs here in mlx drivers, where the driver sent
with the wrong Pkey. I think this has actually been fixed now, so
please check the upstream kernel to be sure the Pkey is not what it is
supposed to be.
Jason
> On 15 May 2018, at 21:04, Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, May 15, 2018 at 08:11:09PM +0200, Håkon Bugge wrote:
>>> On 15 May 2018, at 02:38, Hal Rosenstock <[email protected]> wrote:
>>>
>>> On 5/14/2018 5:02 PM, Jason Gunthorpe wrote:
>>>> On Thu, May 10, 2018 at 05:16:28PM +0200, Håkon Bugge wrote:
>>>>
>>>>> We are talking about two things here. The PKey in the BTH and the
>>>>> PKey in the CM REQ payload. They differ.
>>>>>
>>>>> I am out of office, but if my memory serves me correct, the PKey in
>>>>> the BTH in the MAD packet will be the default PKey. Further, we have
>>>>> per IBTA:
>>>>
>>>> This sounds like a Linux bug.
>>>>
>>>> Linux does not do a PR to get a reversible path dedicated to the GMP> so it always uses the data flow path, thus the GMP path paramenters
>>>> and those in the REQ should always exactly match.
>>>>
>>>> Where is Linux setting the BTH.PKey and how did it choose to use the
>>>> default pkey? Lets fix that at least for sure.
>>
>> Linux isn’t. The BTH.PKey is inserted by the HCA (hw or fw) coming from the P_Key table (10.9.2 in IBTA), selected by a pkey_index associated with the QP.
>>
>> As per C10-133: Packets sent from the Send Queue of a GSI QP shall
>> attach a P_Key associated with that QP, just as a P_Key is
>> associated with nonmanagement QPs
>
> That language doesn't mean the PKey is forced to the default, it says
> the pkey is programmable.
>
> Linux implemented programmable PKeys for the GSI by using the work
> request, so it deviates from what the spec imagined (which was
> probably one GSI QP per PKey).
>
> See ib_create_send_mad for instance:
>
> mad_send_wr->send_wr.wr.wr_cqe = &mad_send_wr->mad_list.cqe;
> mad_send_wr->send_wr.wr.sg_list = mad_send_wr->sg_list;
> mad_send_wr->send_wr.wr.num_sge = 2;
> mad_send_wr->send_wr.wr.opcode = IB_WR_SEND;
> mad_send_wr->send_wr.wr.send_flags = IB_SEND_SIGNALED;
> mad_send_wr->send_wr.remote_qpn = remote_qpn;
> mad_send_wr->send_wr.remote_qkey = IB_QP_SET_QKEY;
> mad_send_wr->send_wr.pkey_index = pkey_index;
>
> The pkey_index associated with the QP1 is ignored:
>
> /*
> * PKey index for QP1 is irrelevant but
> * one is needed for the Reset to Init transition
> */
> attr->qp_state = IB_QPS_INIT;
> attr->pkey_index = pkey_index;
> attr->qkey = (qp->qp_num == 0) ? 0 : IB_QP1_QKEY;
> ret = ib_modify_qp(qp, attr, IB_QP_STATE |
> IB_QP_PKEY_INDEX | IB_QP_QKEY);
>
> Since is is present in every WR.
>
>>>> Basically this means that any pkey in the REQ could randomly be the
>>>> full or limited value, and that in-of-itself has not bearing on the
>>>> connection.
>>>>
>>>> So it is quite wrong to insist that the pkey be limited or full when
>>>> processing the REQ. The end port is expected to match against the
>>>> local table.
>>>
>>> Note that there is thorny issue with shared (physical) port
>>> virtualization. In shared port virtualization, the VF pkey assignment is
>>> a local matter. Only thing SM knows is the physical port pkeys where
>>> both full and limited membership of same partition is possible. It is
>>> conceivable that CM REQ contains limited partition key between 2 limited
>>> VFs and for that a new REJ reason code appears to be needed.
>>
>> +1
>
> This is not a difficult issue.
>
> If the GMP is properly tagged with the right PKey then it will never
> be delivered to the VM if the VM does not have the PKey in the
> table.
Not quite right. For the shared port model, a GMP will (most probably) be accepted by the physical port, due to:
IBTA 3.9.5: QP1 is a member of all of the port’s partitions (i.e., can accept any packet specifying a P_Key contained in the port’s P_Key table).
and
C9-42: If the destination QP is QP1, the BTH:P_Key shall be compared to the set of P_Keys associated with the port on which the packet arrived. If the P_Key matches any of the keys associated with the port, it shall be considered valid.
So, if the GMP is destined to VM1, which is a limited member, but we have another VM2 which is a full member of the same partition, the GMP will pass the HCA’s PKey check.
> It is up to the hypervisor to block GMPs that have Pkeys not in
> the virtual PKey table of the VF.
The packet is received by the HCA and stripped from IB headers, in particular the BTH. How can the "hypervisor" block it when its doesn’t have access to the GMP’s BTH.PKey?
> The only time you could need a new REJ code is if the GMP is using a
> PKey different from the REQ - which is a pretty goofy thing to do
> considering this VM case.
Its goofy. In the CX-3 shared port model, the BTH.PKey is the default one and the REQ.PKey is the full one even if the sending VM’s port only is a limited member. This patch series fixes the last issue.
> Remember the SM doesn't know what Pkeys are in the VM, so it is
> basically impossible for the REQ side to reliably select two different
> pkeys and know that they will bothmake it to the VM.
The active side should use the "authentic" PKey in the REQ message. That is the one that would be used in BTH.PKey when communication has been established. This is implemented by this patch series.
Not sure what you mean by "reliably select two different pkeys". The CM REQ message contains one PKey.
> One pkey could be done reliably if it matched ipoib, for instance, but
> two really cannot in the general case.
>
> So again - the bug here is that the GMP is being sent with the wrong
> pkey. Fix that, then see what is left to fix..
See above, not sure how that could be implemented. And if it is solved by the HCA trapping the GMP due to the PKey check, it doesn’t help me, as the purpose of the series is to avoid (excessive) PKey traps sent to the SM.
> If I recall there were bugs here in mlx drivers, where the driver sent
> with the wrong Pkey. I think this has actually been fixed now, so
> please check the upstream kernel to be sure the Pkey is not what it is
> supposed to be.
Let me get back to you with some ibdumps here.
Håkon
On Wed, May 16, 2018 at 08:47:21AM +0200, Håkon Bugge wrote:
> > This is not a difficult issue.
> >
> > If the GMP is properly tagged with the right PKey then it will never
> > be delivered to the VM if the VM does not have the PKey in the
> > table.
>
> Not quite right. For the shared port model, a GMP will (most
> probably) be accepted by the physical port, due to:
Sure, but I am talking about the VM's 'virtual port'.
> So, if the GMP is destined to VM1, which is a limited member, but we
> have another VM2 which is a full member of the same partition, the
> GMP will pass the HCA’s PKey check.
Sure.
> > It is up to the hypervisor to block GMPs that have Pkeys not in
> > the virtual PKey table of the VF.
>
> The packet is received by the HCA and stripped from IB headers, in
> particular the BTH. How can the "hypervisor" block it when its
> doesn’t have access to the GMP’s BTH.PKey?
This is wrong too, in Linux the WC's for GMP packets include the pkey
index that was used to RX the packet. This is a Linux extension
matching the one on the sending side.
The hypervisor *must* block these GMPs, it is a hard requirement of
the pkey security model for VMs.
AFAIK the Mellanox drivers do this right - do you know differently?
> > The only time you could need a new REJ code is if the GMP is using a
> > PKey different from the REQ - which is a pretty goofy thing to do
> > considering this VM case.
>
> Its goofy. In the CX-3 shared port model, the BTH.PKey is the
> default one and the REQ.PKey is the full one even if the sending
> VM’s port only is a limited member. This patch series fixes the last
> issue.
Again, this is wrong, the BTH.Pkey and REQ.Pkey should be the same -
please fix that first before trying to change anything else.
> > Remember the SM doesn't know what Pkeys are in the VM, so it is
> > basically impossible for the REQ side to reliably select two different
> > pkeys and know that they will bothmake it to the VM.
>
> The active side should use the "authentic" PKey in the REQ
> message. That is the one that would be used in BTH.PKey when
> communication has been established. This is implemented by this
> patch series.
>
> Not sure what you mean by "reliably select two different pkeys". The
> CM REQ message contains one PKey.
If BTH.Pkey != REQ.PKey then the requestor side has to obviously
select two PKeys, which is basically impossible.
The VM should not be part of the default partition, for instance.
> See above, not sure how that could be implemented. And if it is
> solved by the HCA trapping the GMP due to the PKey check, it doesn’t
> help me, as the purpose of the series is to avoid (excessive) PKey
> traps sent to the SM.
It might not help you, but is shows this fix for the pkey trap issue
is wrong. You must fix the pkey traps on the sending side, not on the
responder side..
> > If I recall there were bugs here in mlx drivers, where the driver
> > sent with the wrong Pkey. I think this has actually been fixed
> > now, so please check the upstream kernel to be sure the Pkey is
> > not what it is supposed to be.
>
> Let me get back to you with some ibdumps here.
Upstream kernel (eg 4.16) and newish Mellanox firmware please.
And it would be fantastic if you could ID why this is happening, AFAIK
it should be OK in the kernel.
Jason
On 5/16/2018 11:12 AM, Jason Gunthorpe wrote:
> On Wed, May 16, 2018 at 08:47:21AM +0200, Håkon Bugge wrote:
>
>>> This is not a difficult issue.
>>>
>>> If the GMP is properly tagged with the right PKey then it will never
>>> be delivered to the VM if the VM does not have the PKey in the
>>> table.
>>
>> Not quite right. For the shared port model, a GMP will (most
>> probably) be accepted by the physical port, due to:
>
> Sure, but I am talking about the VM's 'virtual port'.
>
>> So, if the GMP is destined to VM1, which is a limited member, but we
>> have another VM2 which is a full member of the same partition, the
>> GMP will pass the HCA’s PKey check.
>
> Sure.
>
>>> It is up to the hypervisor to block GMPs that have Pkeys not in
>>> the virtual PKey table of the VF.
>>
>> The packet is received by the HCA and stripped from IB headers, in
>> particular the BTH. How can the "hypervisor" block it when its
>> doesn’t have access to the GMP’s BTH.PKey?
>
> This is wrong too, in Linux the WC's for GMP packets include the pkey
> index that was used to RX the packet. This is a Linux extension
> matching the one on the sending side.
>
> The hypervisor *must* block these GMPs, it is a hard requirement of
> the pkey security model for VMs.
Aside from the pkey enforcement at the VF, there is also the issue of
the pkey trap - if physical port model is followed for shared virtual
ports, trap should be generated which could look confusing to SM as well
as counting in some error counter (for physical port, it's either
PortCounter:Port[Xmit Rcv]ConstraintErrors.
> AFAIK the Mellanox drivers do this right - do you know differently?
>
>>> The only time you could need a new REJ code is if the GMP is using a
>>> PKey different from the REQ - which is a pretty goofy thing to do
>>> considering this VM case.
>>
>> Its goofy. In the CX-3 shared port model, the BTH.PKey is the
>> default one and the REQ.PKey is the full one even if the sending
>> VM’s port only is a limited member. This patch series fixes the last
>> issue.
>
> Again, this is wrong, the BTH.Pkey and REQ.Pkey should be the same -
I do not believe there is anything in the spec that requires this. I
agree it's the simplest use model though.
> please fix that first before trying to change anything else.
>
>>> Remember the SM doesn't know what Pkeys are in the VM, so it is
>>> basically impossible for the REQ side to reliably select two different
>>> pkeys and know that they will bothmake it to the VM.
>>
>> The active side should use the "authentic" PKey in the REQ
>> message. That is the one that would be used in BTH.PKey when
>> communication has been established. This is implemented by this
>> patch series.
>>
>> Not sure what you mean by "reliably select two different pkeys". The
>> CM REQ message contains one PKey.
>
> If BTH.Pkey != REQ.PKey then the requestor side has to obviously
> select two PKeys, which is basically impossible.
>
> The VM should not be part of the default partition, for instance.
I think that the VM is at least a limited member of the default partition.
-- Hal
>> See above, not sure how that could be implemented. And if it is
>> solved by the HCA trapping the GMP due to the PKey check, it doesn’t
>> help me, as the purpose of the series is to avoid (excessive) PKey
>> traps sent to the SM.
>
> It might not help you, but is shows this fix for the pkey trap issue
> is wrong. You must fix the pkey traps on the sending side, not on the
> responder side..
>
>>> If I recall there were bugs here in mlx drivers, where the driver
>>> sent with the wrong Pkey. I think this has actually been fixed
>>> now, so please check the upstream kernel to be sure the Pkey is
>>> not what it is supposed to be.
>>
>> Let me get back to you with some ibdumps here.
>
> Upstream kernel (eg 4.16) and newish Mellanox firmware please.
>
> And it would be fantastic if you could ID why this is happening, AFAIK
> it should be OK in the kernel.
>
> Jason
>
On Wed, May 16, 2018 at 12:42:37PM -0400, Hal Rosenstock wrote:
> >>> The only time you could need a new REJ code is if the GMP is using a
> >>> PKey different from the REQ - which is a pretty goofy thing to do
> >>> considering this VM case.
> >>
> >> Its goofy. In the CX-3 shared port model, the BTH.PKey is the
> >> default one and the REQ.PKey is the full one even if the sending
> >> VM’s port only is a limited member. This patch series fixes the last
> >> issue.
> >
> > Again, this is wrong, the BTH.Pkey and REQ.Pkey should be the same -
>
> I do not believe there is anything in the spec that requires this. I
> agree it's the simplest use model though.
The spec doesn't require it, but the design of the Linux CM certainly
does.
> > If BTH.Pkey != REQ.PKey then the requestor side has to obviously
> > select two PKeys, which is basically impossible.
> >
> > The VM should not be part of the default partition, for instance.
>
> I think that the VM is at least a limited member of the default partition.
Well, being a limited member still means the default pkey cannot be
used for CM GMPs.
I actually can't think of why you'd want to do this, better to put the
SM nodes in all the pkeys and reserve the default pkey completely for
the network management control plane.
Jason
On Wed, May 16, 2018 at 07:46:10PM +0200, Håkon Bugge wrote:
> OK. Lets take one example. The pkey table contains 0xFFFF, 0x8001,
> 0x0001.
>
> The wce.pkey_index is 1 (i.e., pointing to 0x8001). Now, tell me, was
> BTH.PKey 0x8001 (matches 0x8001) or was it 0x0001 (also matching
> 0x8001) ?
As far as the Linux core is concerned, it must have been 0x8001,
because the only way the pkey_index feature works properly is if
exact-match takes precedence over in-exact match.
Jason
> On 16 May 2018, at 20:01, Jason Gunthorpe <[email protected]> wrote:
>
> On Wed, May 16, 2018 at 07:46:10PM +0200, Håkon Bugge wrote:
>
>> OK. Lets take one example. The pkey table contains 0xFFFF, 0x8001,
>> 0x0001.
>>
>> The wce.pkey_index is 1 (i.e., pointing to 0x8001). Now, tell me, was
>> BTH.PKey 0x8001 (matches 0x8001) or was it 0x0001 (also matching
>> 0x8001) ?
>
> As far as the Linux core is concerned, it must have been 0x8001,
> because the only way the pkey_index feature works properly is if
> exact-match takes precedence over in-exact match.
And now if the table only contains 0xFFFF, 0x8001, how do you tell?
Håkon
On Wed, May 16, 2018 at 08:14:50PM +0200, Håkon Bugge wrote:
>
>
> > On 16 May 2018, at 20:01, Jason Gunthorpe <[email protected]> wrote:
> >
> > On Wed, May 16, 2018 at 07:46:10PM +0200, Håkon Bugge wrote:
> >
> >> OK. Lets take one example. The pkey table contains 0xFFFF, 0x8001,
> >> 0x0001.
> >>
> >> The wce.pkey_index is 1 (i.e., pointing to 0x8001). Now, tell me, was
> >> BTH.PKey 0x8001 (matches 0x8001) or was it 0x0001 (also matching
> >> 0x8001) ?
> >
> > As far as the Linux core is concerned, it must have been 0x8001,
> > because the only way the pkey_index feature works properly is if
> > exact-match takes precedence over in-exact match.
>
> And now if the table only contains 0xFFFF, 0x8001, how do you tell?
It doesn't matter.
The delgation of Pkeys to VMs are on a pkey-table-index basis, so if
it matches table entry 1 and entry 1 is passed to the VM, then the
packet can be passed to the VM.
Jason
On 5/16/2018 3:22 PM, Håkon Bugge wrote:
> But, do we need an update to IBTA (that the BTH.PKey shall be that of the VM's Port)?
Nothing in spec mentions shared (port) virtualization so that is an
exercise completely left to the reader...
Annex A19 is silent on this specific point but the virtual port model
has virtual port partition table so there's no ambiguity there.