LinuxLists.cc - [PATCH 00/27] Third batch of LNet fixes

2016-03-02 22:02:25

by James Simmons

[permalink] [raw]

Subject: [PATCH 00/27] Third batch of LNet fixes

This patch set merges all the fixes for the klnd drivers, socklnd and
o2iblnd, to what is currently used in production environments. Several
more fixes for the LNet core are also included with this patch set.

Alyona Romanenko (1):
staging: lustre: issue in the offset in lnet match hash table

Amir Shehata (3):
staging: lustre: change ibh_mrs from array to pointer
staging: lustre: make ko2iblnd connect parameters persistent
staging: lustre: Ignore hops if not explicitly set

Dmitry Eremin (4):
staging: lustre: fix socklnd issues found by Klocwork Insight tool
staging: lustre: fix api-ni.c issues found by Klocwork Insight tool
staging: lustre: fix conctl.c issues found by Klocwork Insight tool
staging: lustre: fix framework.c issues found by Klocwork Insight tool

Doug Oucharek (1):
staging: lustre: Change connect peer failed cleanup order

Frank Zago (3):
staging: lustre: make o2iblnd local functions static
staging: lustre: make o2iblnd_cb.c local functions static
staging: lustre: corrected some typos and grammar errors

James Simmons (3):
staging: lustre: return proper error code for LNet core
staging: lustre: bind socklnd peers to a specific CPT
staging: lustre: reverse LNet and infinband header order

Jeremy Filizetti (1):
staging: lustre: Support different ko2iblnd configs between systems

Jian Yu (1):
staging: lustre: replace direct LNet HZ access with kernel APIs

John L. Hammond (1):
staging: lustre: set task state before scheduling in lnet_sock_accept

Li Xi (1):
staging: lustre: remove annoying message in parse_nidrange

Liang Zhen (6):
staging: lustre: set downis to 1 if there's no NI for remote net
staging: lustre: recv could access freed message
staging: lustre: take extra refcount in kiblnd_connreq_done
staging: lustre: check wr_id returned by ib_poll_cq
staging: lustre: avoid intensive reconnecting for ko2iblnd
staging: lustre: do less intense allocating retry for ko2iblnd

Olaf Weber (1):
staging: lustre: Use after free in lnet_ptl_match_delay()

Sebastien Buisson (1):
staging: lustre: fix 'copy into fixed size buffer' errors

.../staging/lustre/include/linux/lnet/lib-lnet.h | 2 +-
.../staging/lustre/include/linux/lnet/lib-types.h | 2 +-
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 230 ++++------
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 135 ++++---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 475 ++++++++++++++------
.../staging/lustre/lnet/klnds/socklnd/socklnd.c | 7 +-
drivers/staging/lustre/lnet/lnet/api-ni.c | 8 +-
drivers/staging/lustre/lnet/lnet/config.c | 14 +-
drivers/staging/lustre/lnet/lnet/lib-eq.c | 2 +-
drivers/staging/lustre/lnet/lnet/lib-move.c | 51 ++-
drivers/staging/lustre/lnet/lnet/lib-ptl.c | 93 +++--
drivers/staging/lustre/lnet/lnet/lib-socket.c | 45 +-
drivers/staging/lustre/lnet/lnet/nidstrings.c | 3 +-
drivers/staging/lustre/lnet/lnet/router.c | 22 +-
drivers/staging/lustre/lnet/lnet/router_proc.c | 2 +-
drivers/staging/lustre/lnet/selftest/conctl.c | 9 +-
drivers/staging/lustre/lnet/selftest/conrpc.c | 2 +-
drivers/staging/lustre/lnet/selftest/console.c | 23 +-
drivers/staging/lustre/lnet/selftest/framework.c | 14 +-
drivers/staging/lustre/lustre/libcfs/workitem.c | 6 +-
drivers/staging/lustre/lustre/llite/dir.c | 6 +-
drivers/staging/lustre/lustre/obdclass/lu_object.c | 2 +-
drivers/staging/lustre/lustre/ptlrpc/import.c | 2 +-
drivers/staging/lustre/lustre/ptlrpc/nrs.c | 8 +-
drivers/staging/lustre/lustre/ptlrpc/sec_config.c | 7 +-
25 files changed, 715 insertions(+), 455 deletions(-)

2016-03-02 22:02:28

by James Simmons

[permalink] [raw]

Subject: [PATCH 01/27] staging: lustre: set downis to 1 if there's no NI for remote net

From: Liang Zhen <[email protected]>

lnet_route_t::lr_downis is marked as zero even if there is no NI to
target network, this is wrong and breaks logic of ARF. This patch
fixes this problem.

Signed-off-by: Liang Zhen <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6060
Reviewed-on: http://review.whamcloud.com/13417
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
drivers/staging/lustre/lnet/lnet/router.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
index 5e8b0ba..2eae8f6 100644
--- a/drivers/staging/lustre/lnet/lnet/router.c
+++ b/drivers/staging/lustre/lnet/lnet/router.c
@@ -718,6 +718,13 @@ lnet_parse_rc_info(lnet_rc_data_t *rcd)
rte->lr_downis = 0;
continue;
}
+ /**
+ * if @down is zero and this route is single-hop, it means
+ * we can't find NI for target network
+ */
+ if (!down && rte->lr_hops == 1)
+ down = 1;
+
rte->lr_downis = down;
}
}
--
1.7.1

2016-03-02 22:02:33

by James Simmons

[permalink] [raw]

Subject: [PATCH 04/27] staging: lustre: return proper error code for LNet core

It is consider bad style in the linux kernel to
return -1 or a positive number for an error.
Instead return the appropriate error codes.

Signed-off-by: James Simmons <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6142
Reviewed-on: http://review.whamcloud.com/17626
Reviewed-by: Doug Oucharek <[email protected]>
Reviewed-by: John L. Hammond <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
drivers/staging/lustre/lnet/lnet/api-ni.c | 4 ++--
drivers/staging/lustre/lnet/lnet/config.c | 6 +++---
drivers/staging/lustre/lnet/lnet/lib-eq.c | 2 +-
drivers/staging/lustre/lnet/lnet/lib-move.c | 22 +++++++++++-----------
drivers/staging/lustre/lnet/lnet/nidstrings.c | 2 +-
drivers/staging/lustre/lnet/lnet/router.c | 6 +++---
6 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index aa24489..6c42786 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -1409,7 +1409,7 @@ int lnet_lib_init(void)
/* we are under risk of consuming all lh_cookie */
CERROR("Can't have %d CPTs for LNet (max allowed is %d), please change setting of CPT-table and retry\n",
the_lnet.ln_cpt_number, LNET_CPT_MAX);
- return -1;
+ return -E2BIG;
}

while ((1 << the_lnet.ln_cpt_bits) < the_lnet.ln_cpt_number)
@@ -1418,7 +1418,7 @@ int lnet_lib_init(void)
rc = lnet_create_locks();
if (rc) {
CERROR("Can't create LNet global locks: %d\n", rc);
- return -1;
+ return rc;
}

the_lnet.ln_refcount = 0;
diff --git a/drivers/staging/lustre/lnet/lnet/config.c b/drivers/staging/lustre/lnet/lnet/config.c
index 4c40acb..e71eea7 100644
--- a/drivers/staging/lustre/lnet/lnet/config.c
+++ b/drivers/staging/lustre/lnet/lnet/config.c
@@ -467,7 +467,7 @@ lnet_str2tbs_sep(struct list_head *tbs, char *str)
ltb = lnet_new_text_buf(nob);
if (!ltb) {
lnet_free_text_bufs(&pending);
- return -1;
+ return -ENOMEM;
}

for (i = 0; i < nob; i++)
@@ -598,7 +598,7 @@ lnet_str2tbs_expand(struct list_head *tbs, char *str)

failed:
lnet_free_text_bufs(&pending);
- return -1;
+ return -EINVAL;
}

static int
@@ -634,7 +634,7 @@ lnet_parse_priority(char *str, unsigned int *priority, char **token)
* priority as the token to report in the error message.
*/
*token += sep - str + 1;
- return -1;
+ return -EINVAL;
}

CDEBUG(D_NET, "gateway %s, priority %d, nob %d\n", str, *priority, nob);
diff --git a/drivers/staging/lustre/lnet/lnet/lib-eq.c b/drivers/staging/lustre/lnet/lnet/lib-eq.c
index 042e974..adbcadb 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-eq.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-eq.c
@@ -320,7 +320,7 @@ __must_hold(&the_lnet.ln_eq_wait_lock)
unsigned long now;

if (!tms)
- return -1; /* don't want to wait and no new event */
+ return -ENXIO; /* don't want to wait and no new event */

init_waitqueue_entry(&wl, current);
set_current_state(TASK_INTERRUPTIBLE);
diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
index 8a21198..2d187e4 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-move.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
@@ -1167,30 +1167,30 @@ lnet_compare_routes(lnet_route_t *r1, lnet_route_t *r2)
return 1;

if (r1->lr_priority > r2->lr_priority)
- return -1;
+ return -ERANGE;

if (r1_hops < r2_hops)
return 1;

if (r1_hops > r2_hops)
- return -1;
+ return -ERANGE;

if (p1->lp_txqnob < p2->lp_txqnob)
return 1;

if (p1->lp_txqnob > p2->lp_txqnob)
- return -1;
+ return -ERANGE;

if (p1->lp_txcredits > p2->lp_txcredits)
return 1;

if (p1->lp_txcredits < p2->lp_txcredits)
- return -1;
+ return -ERANGE;

if (r1->lr_seq - r2->lr_seq <= 0)
return 1;

- return -1;
+ return -ERANGE;
}

static lnet_peer_t *
@@ -1517,7 +1517,7 @@ lnet_parse_put(lnet_ni_t *ni, lnet_msg_t *msg)
libcfs_id2str(info.mi_id), info.mi_portal,
info.mi_mbits, info.mi_roffset, info.mi_rlength, rc);

- return ENOENT; /* +ve: OK but no match */
+ return -ENOENT; /* -ve: OK but no match */
}
}

@@ -1548,7 +1548,7 @@ lnet_parse_get(lnet_ni_t *ni, lnet_msg_t *msg, int rdma_get)
CNETERR("Dropping GET from %s portal %d match %llu offset %d length %d\n",
libcfs_id2str(info.mi_id), info.mi_portal,
info.mi_mbits, info.mi_roffset, info.mi_rlength);
- return ENOENT; /* +ve: OK but no match */
+ return -ENOENT; /* -ve: OK but no match */
}

LASSERT(rc == LNET_MATCHMD_OK);
@@ -1615,7 +1615,7 @@ lnet_parse_reply(lnet_ni_t *ni, lnet_msg_t *msg)
md->md_me->me_portal);

lnet_res_unlock(cpt);
- return ENOENT; /* +ve: OK but no match */
+ return -ENOENT; /* -ve: OK but no match */
}

LASSERT(!md->md_offset);
@@ -1630,7 +1630,7 @@ lnet_parse_reply(lnet_ni_t *ni, lnet_msg_t *msg)
rlength, hdr->msg.reply.dst_wmd.wh_object_cookie,
mlength);
lnet_res_unlock(cpt);
- return ENOENT; /* +ve: OK but no match */
+ return -ENOENT; /* -ve: OK but no match */
}

CDEBUG(D_NET, "%s: Reply from %s of length %d/%d into md %#llx\n",
@@ -1683,7 +1683,7 @@ lnet_parse_ack(lnet_ni_t *ni, lnet_msg_t *msg)
md->md_me->me_portal);

lnet_res_unlock(cpt);
- return ENOENT; /* +ve! */
+ return -ENOENT; /* -ve! */
}

CDEBUG(D_NET, "%s: ACK from %s into md %#llx\n",
@@ -2030,7 +2030,7 @@ lnet_parse(lnet_ni_t *ni, lnet_hdr_t *hdr, lnet_nid_t from_nid,
if (!rc)
return 0;

- LASSERT(rc == ENOENT);
+ LASSERT(rc == -ENOENT);

free_drop:
LASSERT(!msg->msg_md);
diff --git a/drivers/staging/lustre/lnet/lnet/nidstrings.c b/drivers/staging/lustre/lnet/lnet/nidstrings.c
index 00010f3..9d1c7fd 100644
--- a/drivers/staging/lustre/lnet/lnet/nidstrings.c
+++ b/drivers/staging/lustre/lnet/lnet/nidstrings.c
@@ -1085,7 +1085,7 @@ libcfs_str2lnd(const char *str)
if (nf)
return nf->nf_type;

- return -1;
+ return -ENXIO;
}
EXPORT_SYMBOL(libcfs_str2lnd);

diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
index 51a831e..9543ba3 100644
--- a/drivers/staging/lustre/lnet/lnet/router.c
+++ b/drivers/staging/lustre/lnet/lnet/router.c
@@ -1503,7 +1503,7 @@ lnet_nrb_tiny_calculate(void)
LCONSOLE_ERROR_MSG(0x10c,
"tiny_router_buffers=%d invalid when routing enabled\n",
tiny_router_buffers);
- return -1;
+ return -EINVAL;
}

if (tiny_router_buffers > 0)
@@ -1522,7 +1522,7 @@ lnet_nrb_small_calculate(void)
LCONSOLE_ERROR_MSG(0x10c,
"small_router_buffers=%d invalid when routing enabled\n",
small_router_buffers);
- return -1;
+ return -EINVAL;
}

if (small_router_buffers > 0)
@@ -1541,7 +1541,7 @@ lnet_nrb_large_calculate(void)
LCONSOLE_ERROR_MSG(0x10c,
"large_router_buffers=%d invalid when routing enabled\n",
large_router_buffers);
- return -1;
+ return -EINVAL;
}

if (large_router_buffers > 0)
--
1.7.1

2016-03-02 22:02:46

by James Simmons

[permalink] [raw]

Subject: [PATCH 13/27] staging: lustre: fix api-ni.c issues found by Klocwork Insight tool

From: Dmitry Eremin <[email protected]>

Pointer 'ni' checked for NULL at line 1569 may be passed to
function and may be dereferenced there by passing argument 1 to
function 'lnet_ni_notify_locked' at line 1621.

Signed-off-by: Dmitry Eremin <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-4629
Reviewed-on: http://review.whamcloud.com/9386
Reviewed-by: John L. Hammond <[email protected]>
Reviewed-by: Isaac Huang <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
drivers/staging/lustre/lnet/lnet/router.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
index 9543ba3..6a44601 100644
--- a/drivers/staging/lustre/lnet/lnet/router.c
+++ b/drivers/staging/lustre/lnet/lnet/router.c
@@ -1800,7 +1800,8 @@ lnet_notify(lnet_ni_t *ni, lnet_nid_t nid, int alive, unsigned long when)

lnet_notify_locked(lp, !ni, alive, when);

- lnet_ni_notify_locked(ni, lp);
+ if (ni)
+ lnet_ni_notify_locked(ni, lp);

lnet_peer_decref_locked(lp);

--
1.7.1

2016-03-02 22:02:39

by James Simmons

[permalink] [raw]

Subject: [PATCH 10/27] staging: lustre: replace direct LNet HZ access with kernel APIs

From: Jian Yu <[email protected]>

On some customers' systems, the kernel was compiled with HZ defined
to 100, instead of 1000. This improves performance for HPC applications.
However, to use these systems with Lustre, customers have to re-build
Lustre for the kernel because Lustre directly uses the defined constant
HZ.

Since kernel 2.6.21, some non-HZ dependent timing APIs become non-
inline functions, which can be used in Lustre codes to replace the
direct HZ access.

These kernel APIs include:
jiffies_to_msecs()
jiffies_to_usecs()
jiffies_to_timespec()
msecs_to_jiffies()
usecs_to_jiffies()
timespec_to_jiffies()

And here are some samples of the replacement:
HZ -> msecs_to_jiffies(MSEC_PER_SEC)
n * HZ -> msecs_to_jiffies(n * MSEC_PER_SEC)
HZ / n -> msecs_to_jiffies(MSEC_PER_SEC / n)
n / HZ -> jiffies_to_msecs(n) / MSEC_PER_SEC
n / HZ * 1000 -> jiffies_to_msecs(n)

This patch replaces the direct HZ access in lnet module.

Signed-off-by: Jian Yu <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5443
Reviewed-by: Nathaniel Clark <[email protected]>
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 3 +-
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 6 +++-
drivers/staging/lustre/lnet/lnet/api-ni.c | 4 ++-
drivers/staging/lustre/lnet/lnet/lib-socket.c | 24 +++++++------------
drivers/staging/lustre/lnet/selftest/conrpc.c | 2 +-
5 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 2abb574..fb7079c 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -691,7 +691,8 @@ kiblnd_send_keepalive(kib_conn_t *conn)
{
return (*kiblnd_tunables.kib_keepalive > 0) &&
cfs_time_after(jiffies, conn->ibc_last_send +
- *kiblnd_tunables.kib_keepalive * HZ);
+ msecs_to_jiffies(*kiblnd_tunables.kib_keepalive *
+ MSEC_PER_SEC));
}

static inline int
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 0608431..d955a66 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -1147,7 +1147,9 @@ kiblnd_queue_tx_locked(kib_tx_t *tx, kib_conn_t *conn)
LASSERT(conn->ibc_state >= IBLND_CONN_ESTABLISHED);

tx->tx_queued = 1;
- tx->tx_deadline = jiffies + (*kiblnd_tunables.kib_timeout * HZ);
+ tx->tx_deadline = jiffies +
+ msecs_to_jiffies(*kiblnd_tunables.kib_timeout *
+ MSEC_PER_SEC);

if (!tx->tx_conn) {
kiblnd_conn_addref(conn);
@@ -3188,7 +3190,7 @@ kiblnd_connd(void *arg)
kiblnd_data.kib_peer_hash_size;
}

- deadline += p * HZ;
+ deadline += msecs_to_jiffies(p * MSEC_PER_SEC);
spin_lock_irqsave(&kiblnd_data.kib_connd_lock, flags);
}

diff --git a/drivers/staging/lustre/lnet/lnet/api-ni.c b/drivers/staging/lustre/lnet/lnet/api-ni.c
index 6c42786..f6f6cd8 100644
--- a/drivers/staging/lustre/lnet/lnet/api-ni.c
+++ b/drivers/staging/lustre/lnet/lnet/api-ni.c
@@ -2011,8 +2011,10 @@ LNetCtl(unsigned int cmd, void *arg)

case IOC_LIBCFS_NOTIFY_ROUTER:
secs_passed = (ktime_get_real_seconds() - data->ioc_u64[0]);
+ secs_passed *= msecs_to_jiffies(MSEC_PER_SEC);
+
return lnet_notify(NULL, data->ioc_nid, data->ioc_flags,
- jiffies - secs_passed * HZ);
+ jiffies - secs_passed);

case IOC_LIBCFS_LNET_DIST:
rc = LNetDist(data->ioc_nid, &data->ioc_nid, &data->ioc_u32[1]);
diff --git a/drivers/staging/lustre/lnet/lnet/lib-socket.c b/drivers/staging/lustre/lnet/lnet/lib-socket.c
index 269a6d8..cc0c275 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-socket.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-socket.c
@@ -262,7 +262,7 @@ int
lnet_sock_write(struct socket *sock, void *buffer, int nob, int timeout)
{
int rc;
- long ticks = timeout * HZ;
+ long jiffies_left = timeout * msecs_to_jiffies(MSEC_PER_SEC);
unsigned long then;
struct timeval tv;

@@ -282,10 +282,7 @@ lnet_sock_write(struct socket *sock, void *buffer, int nob, int timeout)

if (timeout) {
/* Set send timeout to remaining time */
- tv = (struct timeval) {
- .tv_sec = ticks / HZ,
- .tv_usec = ((ticks % HZ) * 1000000) / HZ
- };
+ jiffies_to_timeval(jiffies_left, &tv);
rc = kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO,
(char *)&tv, sizeof(tv));
if (rc) {
@@ -297,7 +294,7 @@ lnet_sock_write(struct socket *sock, void *buffer, int nob, int timeout)

then = jiffies;
rc = kernel_sendmsg(sock, &msg, &iov, 1, nob);
- ticks -= jiffies - then;
+ jiffies_left -= jiffies - then;

if (rc == nob)
return 0;
@@ -310,7 +307,7 @@ lnet_sock_write(struct socket *sock, void *buffer, int nob, int timeout)
return -ECONNABORTED;
}

- if (ticks <= 0)
+ if (jiffies_left <= 0)
return -EAGAIN;

buffer = ((char *)buffer) + rc;
@@ -324,12 +321,12 @@ int
lnet_sock_read(struct socket *sock, void *buffer, int nob, int timeout)
{
int rc;
- long ticks = timeout * HZ;
+ long jiffies_left = timeout * msecs_to_jiffies(MSEC_PER_SEC);
unsigned long then;
struct timeval tv;

LASSERT(nob > 0);
- LASSERT(ticks > 0);
+ LASSERT(jiffies_left > 0);

for (;;) {
struct kvec iov = {
@@ -341,10 +338,7 @@ lnet_sock_read(struct socket *sock, void *buffer, int nob, int timeout)
};

/* Set receive timeout to remaining time */
- tv = (struct timeval) {
- .tv_sec = ticks / HZ,
- .tv_usec = ((ticks % HZ) * 1000000) / HZ
- };
+ jiffies_to_timeval(jiffies_left, &tv);
rc = kernel_setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO,
(char *)&tv, sizeof(tv));
if (rc) {
@@ -355,7 +349,7 @@ lnet_sock_read(struct socket *sock, void *buffer, int nob, int timeout)

then = jiffies;
rc = kernel_recvmsg(sock, &msg, &iov, 1, nob, 0);
- ticks -= jiffies - then;
+ jiffies_left -= jiffies - then;

if (rc < 0)
return rc;
@@ -369,7 +363,7 @@ lnet_sock_read(struct socket *sock, void *buffer, int nob, int timeout)
if (!nob)
return 0;

- if (ticks <= 0)
+ if (jiffies_left <= 0)
return -ETIMEDOUT;
}
}
diff --git a/drivers/staging/lustre/lnet/selftest/conrpc.c b/drivers/staging/lustre/lnet/selftest/conrpc.c
index b02a140..736efce 100644
--- a/drivers/staging/lustre/lnet/selftest/conrpc.c
+++ b/drivers/staging/lustre/lnet/selftest/conrpc.c
@@ -1252,7 +1252,7 @@ lstcon_rpc_pinger(void *arg)
if (nd->nd_state != LST_NODE_ACTIVE)
continue;

- intv = (jiffies - nd->nd_stamp) / HZ;
+ intv = (jiffies - nd->nd_stamp) / msecs_to_jiffies(MSEC_PER_SEC);
if (intv < nd->nd_timeout / 2)
continue;

--
1.7.1

2016-03-02 22:02:51

by James Simmons

[permalink] [raw]

Subject: [PATCH 18/27] staging: lustre: make o2iblnd_cb.c local functions static

From: Frank Zago <[email protected]>

This reduces the code size by about 1KiB.

Signed-off-by: Frank Zago <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5396
Reviewed-on: http://review.whamcloud.com/11256
Reviewed-by: Patrick Farrell <[email protected]>
Reviewed-by: John L. Hammond <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 10 --------
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 25 +++++++++++++------
2 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 4f09776..4fe38cb 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -939,30 +939,20 @@ int kiblnd_create_peer(lnet_ni_t *ni, kib_peer_t **peerp, lnet_nid_t nid);
void kiblnd_destroy_peer(kib_peer_t *peer);
void kiblnd_destroy_dev(kib_dev_t *dev);
void kiblnd_unlink_peer_locked(kib_peer_t *peer);
-void kiblnd_peer_alive(kib_peer_t *peer);
kib_peer_t *kiblnd_find_peer_locked(lnet_nid_t nid);
-void kiblnd_peer_connect_failed(kib_peer_t *peer, int active, int error);
int kiblnd_close_stale_conns_locked(kib_peer_t *peer,
int version, __u64 incarnation);
int kiblnd_close_peer_conns_locked(kib_peer_t *peer, int why);

-void kiblnd_connreq_done(kib_conn_t *conn, int status);
kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
int state, int version);
void kiblnd_destroy_conn(kib_conn_t *conn);
void kiblnd_close_conn(kib_conn_t *conn, int error);
void kiblnd_close_conn_locked(kib_conn_t *conn, int error);

-int kiblnd_init_rdma(kib_conn_t *conn, kib_tx_t *tx, int type,
- int nob, kib_rdma_desc_t *dstrd, __u64 dstcookie);
-
void kiblnd_launch_tx(lnet_ni_t *ni, kib_tx_t *tx, lnet_nid_t nid);
-void kiblnd_queue_tx_locked(kib_tx_t *tx, kib_conn_t *conn);
-void kiblnd_queue_tx(kib_tx_t *tx, kib_conn_t *conn);
-void kiblnd_init_tx_msg(lnet_ni_t *ni, kib_tx_t *tx, int type, int body_nob);
void kiblnd_txlist_done(lnet_ni_t *ni, struct list_head *txlist,
int status);
-void kiblnd_check_sends(kib_conn_t *conn);

void kiblnd_qp_event(struct ib_event *event, void *arg);
void kiblnd_cq_event(struct ib_event *event, void *arg);
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index d955a66..11c0b49 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -40,6 +40,15 @@

#include "o2iblnd.h"

+static void kiblnd_peer_alive(kib_peer_t *peer);
+static void kiblnd_peer_connect_failed(kib_peer_t *peer, int active, int error);
+static void kiblnd_check_sends(kib_conn_t *conn);
+static void kiblnd_init_tx_msg(lnet_ni_t *ni, kib_tx_t *tx,
+ int type, int body_nob);
+static int kiblnd_init_rdma(kib_conn_t *conn, kib_tx_t *tx, int type,
+ int resid, kib_rdma_desc_t *dstrd, __u64 dstcookie);
+static void kiblnd_queue_tx_locked(kib_tx_t *tx, kib_conn_t *conn);
+static void kiblnd_queue_tx(kib_tx_t *tx, kib_conn_t *conn);
static void kiblnd_unmap_tx(lnet_ni_t *ni, kib_tx_t *tx);

static void
@@ -891,7 +900,7 @@ kiblnd_post_tx_locked(kib_conn_t *conn, kib_tx_t *tx, int credit)
return -EIO;
}

-void
+static void
kiblnd_check_sends(kib_conn_t *conn)
{
int ver = conn->ibc_version;
@@ -1019,7 +1028,7 @@ kiblnd_tx_complete(kib_tx_t *tx, int status)
kiblnd_conn_decref(conn); /* ...until here */
}

-void
+static void
kiblnd_init_tx_msg(lnet_ni_t *ni, kib_tx_t *tx, int type, int body_nob)
{
kib_hca_dev_t *hdev = tx->tx_pool->tpo_hdev;
@@ -1053,7 +1062,7 @@ kiblnd_init_tx_msg(lnet_ni_t *ni, kib_tx_t *tx, int type, int body_nob)
tx->tx_nwrq++;
}

-int
+static int
kiblnd_init_rdma(kib_conn_t *conn, kib_tx_t *tx, int type,
int resid, kib_rdma_desc_t *dstrd, __u64 dstcookie)
{
@@ -1137,7 +1146,7 @@ kiblnd_init_rdma(kib_conn_t *conn, kib_tx_t *tx, int type,
return rc;
}

-void
+static void
kiblnd_queue_tx_locked(kib_tx_t *tx, kib_conn_t *conn)
{
struct list_head *q;
@@ -1192,7 +1201,7 @@ kiblnd_queue_tx_locked(kib_tx_t *tx, kib_conn_t *conn)
list_add_tail(&tx->tx_list, q);
}

-void
+static void
kiblnd_queue_tx(kib_tx_t *tx, kib_conn_t *conn)
{
spin_lock(&conn->ibc_lock);
@@ -1792,7 +1801,7 @@ kiblnd_thread_fini(void)
atomic_dec(&kiblnd_data.kib_nthreads);
}

-void
+static void
kiblnd_peer_alive(kib_peer_t *peer)
{
/* This is racy, but everyone's only writing cfs_time_current() */
@@ -1993,7 +2002,7 @@ kiblnd_finalise_conn(kib_conn_t *conn)
kiblnd_handle_early_rxs(conn);
}

-void
+static void
kiblnd_peer_connect_failed(kib_peer_t *peer, int active, int error)
{
LIST_HEAD(zombies);
@@ -2047,7 +2056,7 @@ kiblnd_peer_connect_failed(kib_peer_t *peer, int active, int error)
kiblnd_txlist_done(peer->ibp_ni, &zombies, -EHOSTUNREACH);
}

-void
+static void
kiblnd_connreq_done(kib_conn_t *conn, int status)
{
kib_peer_t *peer = conn->ibc_peer;
--
1.7.1

2016-03-02 22:03:07

by James Simmons

[permalink] [raw]

Subject: [PATCH 26/27] staging: lustre: avoid intensive reconnecting for ko2iblnd

From: Liang Zhen <[email protected]>

When there is a connection race between two nodes and one side
of the connection is rejected by the other side. o2iblnd will
reconnect immediately, this is going to generate a lot of
trashes if:

- race winner is slow and can't send out connecting request
in short time.
- remote side leaves a cmid in TIMEWAIT state, which will reject
future connection requests

To resolve this problem, this patch changed the reconnection
behave: reconnection is submitted by connd only if a zombie
connection is being destroyed and there is a pending
reconnection request for the corresponding peer.

Also, after a few rejections, reconnection will have a time
interval between each attempt.

Signed-off-by: Liang Zhen <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-7569
Reviewed-on: http://review.whamcloud.com/17892
Reviewed-by: Doug Oucharek <[email protected]>
Reviewed-by: James Simmons <[email protected]>
Tested-by: James Simmons <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 40 +--
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 54 +++-
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 280 ++++++++++++++------
3 files changed, 258 insertions(+), 116 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index 56c221b..135ccf1 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -364,9 +364,7 @@ void kiblnd_destroy_peer(kib_peer_t *peer)
LASSERT(net);
LASSERT(!atomic_read(&peer->ibp_refcount));
LASSERT(!kiblnd_peer_active(peer));
- LASSERT(!peer->ibp_connecting);
- LASSERT(!peer->ibp_accepting);
- LASSERT(list_empty(&peer->ibp_conns));
+ LASSERT(kiblnd_peer_idle(peer));
LASSERT(list_empty(&peer->ibp_tx_queue));

LIBCFS_FREE(peer, sizeof(*peer));
@@ -392,10 +390,7 @@ kib_peer_t *kiblnd_find_peer_locked(lnet_nid_t nid)

list_for_each(tmp, peer_list) {
peer = list_entry(tmp, kib_peer_t, ibp_list);
-
- LASSERT(peer->ibp_connecting > 0 || /* creating conns */
- peer->ibp_accepting > 0 ||
- !list_empty(&peer->ibp_conns)); /* active conn */
+ LASSERT(!kiblnd_peer_idle(peer));

if (peer->ibp_nid != nid)
continue;
@@ -432,9 +427,7 @@ static int kiblnd_get_peer_info(lnet_ni_t *ni, int index,
for (i = 0; i < kiblnd_data.kib_peer_hash_size; i++) {
list_for_each(ptmp, &kiblnd_data.kib_peers[i]) {
peer = list_entry(ptmp, kib_peer_t, ibp_list);
- LASSERT(peer->ibp_connecting > 0 ||
- peer->ibp_accepting > 0 ||
- !list_empty(&peer->ibp_conns));
+ LASSERT(!kiblnd_peer_idle(peer));

if (peer->ibp_ni != ni)
continue;
@@ -502,9 +495,7 @@ static int kiblnd_del_peer(lnet_ni_t *ni, lnet_nid_t nid)
for (i = lo; i <= hi; i++) {
list_for_each_safe(ptmp, pnxt, &kiblnd_data.kib_peers[i]) {
peer = list_entry(ptmp, kib_peer_t, ibp_list);
- LASSERT(peer->ibp_connecting > 0 ||
- peer->ibp_accepting > 0 ||
- !list_empty(&peer->ibp_conns));
+ LASSERT(!kiblnd_peer_idle(peer));

if (peer->ibp_ni != ni)
continue;
@@ -545,9 +536,7 @@ static kib_conn_t *kiblnd_get_conn_by_idx(lnet_ni_t *ni, int index)
for (i = 0; i < kiblnd_data.kib_peer_hash_size; i++) {
list_for_each(ptmp, &kiblnd_data.kib_peers[i]) {
peer = list_entry(ptmp, kib_peer_t, ibp_list);
- LASSERT(peer->ibp_connecting > 0 ||
- peer->ibp_accepting > 0 ||
- !list_empty(&peer->ibp_conns));
+ LASSERT(!kiblnd_peer_idle(peer));

if (peer->ibp_ni != ni)
continue;
@@ -837,14 +826,14 @@ kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
return conn;

failed_2:
- kiblnd_destroy_conn(conn);
+ kiblnd_destroy_conn(conn, true);
failed_1:
LIBCFS_FREE(init_qp_attr, sizeof(*init_qp_attr));
failed_0:
return NULL;
}

-void kiblnd_destroy_conn(kib_conn_t *conn)
+void kiblnd_destroy_conn(kib_conn_t *conn, bool free_conn)
{
struct rdma_cm_id *cmid = conn->ibc_cmid;
kib_peer_t *peer = conn->ibc_peer;
@@ -984,9 +973,7 @@ static int kiblnd_close_matching_conns(lnet_ni_t *ni, lnet_nid_t nid)
for (i = lo; i <= hi; i++) {
list_for_each_safe(ptmp, pnxt, &kiblnd_data.kib_peers[i]) {
peer = list_entry(ptmp, kib_peer_t, ibp_list);
- LASSERT(peer->ibp_connecting > 0 ||
- peer->ibp_accepting > 0 ||
- !list_empty(&peer->ibp_conns));
+ LASSERT(!kiblnd_peer_idle(peer));

if (peer->ibp_ni != ni)
continue;
@@ -1071,12 +1058,8 @@ static void kiblnd_query(lnet_ni_t *ni, lnet_nid_t nid, unsigned long *when)
read_lock_irqsave(glock, flags);

peer = kiblnd_find_peer_locked(nid);
- if (peer) {
- LASSERT(peer->ibp_connecting > 0 || /* creating conns */
- peer->ibp_accepting > 0 ||
- !list_empty(&peer->ibp_conns)); /* active conn */
+ if (peer)
last_alive = peer->ibp_last_alive;
- }

read_unlock_irqrestore(glock, flags);

@@ -2368,6 +2351,8 @@ static void kiblnd_base_shutdown(void)
LASSERT(list_empty(&kiblnd_data.kib_peers[i]));
LASSERT(list_empty(&kiblnd_data.kib_connd_zombies));
LASSERT(list_empty(&kiblnd_data.kib_connd_conns));
+ LASSERT(list_empty(&kiblnd_data.kib_reconn_list));
+ LASSERT(list_empty(&kiblnd_data.kib_reconn_wait));

/* flag threads to terminate; wake and wait for them to die */
kiblnd_data.kib_shutdown = 1;
@@ -2506,6 +2491,9 @@ static int kiblnd_base_startup(void)
spin_lock_init(&kiblnd_data.kib_connd_lock);
INIT_LIST_HEAD(&kiblnd_data.kib_connd_conns);
INIT_LIST_HEAD(&kiblnd_data.kib_connd_zombies);
+ INIT_LIST_HEAD(&kiblnd_data.kib_reconn_list);
+ INIT_LIST_HEAD(&kiblnd_data.kib_reconn_wait);
+
init_waitqueue_head(&kiblnd_data.kib_connd_waitq);
init_waitqueue_head(&kiblnd_data.kib_failover_waitq);

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 6a4c4ac..bfcbdd1 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -348,6 +348,16 @@ typedef struct {
void *kib_connd; /* the connd task (serialisation assertions) */
struct list_head kib_connd_conns; /* connections to setup/teardown */
struct list_head kib_connd_zombies; /* connections with zero refcount */
+ /* connections to reconnect */
+ struct list_head kib_reconn_list;
+ /* peers wait for reconnection */
+ struct list_head kib_reconn_wait;
+ /**
+ * The second that peers are pulled out from \a kib_reconn_wait
+ * for reconnection.
+ */
+ time64_t kib_reconn_sec;
+
wait_queue_head_t kib_connd_waitq; /* connection daemon sleeps here */
spinlock_t kib_connd_lock; /* serialise */
struct ib_qp_attr kib_error_qpa; /* QP->ERROR */
@@ -525,6 +535,8 @@ typedef struct kib_conn {
struct list_head ibc_list; /* stash on peer's conn list */
struct list_head ibc_sched_list; /* schedule for attention */
__u16 ibc_version; /* version of connection */
+ /* reconnect later */
+ __u16 ibc_reconnect:1;
__u64 ibc_incarnation; /* which instance of the peer */
atomic_t ibc_refcount; /* # users */
int ibc_state; /* what's happening */
@@ -574,18 +586,25 @@ typedef struct kib_peer {
struct list_head ibp_list; /* stash on global peer list */
lnet_nid_t ibp_nid; /* who's on the other end(s) */
lnet_ni_t *ibp_ni; /* LNet interface */
- atomic_t ibp_refcount; /* # users */
struct list_head ibp_conns; /* all active connections */
struct list_head ibp_tx_queue; /* msgs waiting for a conn */
- __u16 ibp_version; /* version of peer */
__u64 ibp_incarnation; /* incarnation of peer */
- int ibp_connecting; /* current active connection attempts
- */
- int ibp_accepting; /* current passive connection attempts
- */
- int ibp_error; /* errno on closing this peer */
- unsigned long ibp_last_alive; /* when (in jiffies) I was last alive
- */
+ /* when (in jiffies) I was last alive */
+ unsigned long ibp_last_alive;
+ /* # users */
+ atomic_t ibp_refcount;
+ /* version of peer */
+ __u16 ibp_version;
+ /* current passive connection attempts */
+ unsigned short ibp_accepting;
+ /* current active connection attempts */
+ unsigned short ibp_connecting;
+ /* reconnect this peer later */
+ unsigned short ibp_reconnecting:1;
+ /* # consecutive reconnection attempts to this peer */
+ unsigned int ibp_reconnected;
+ /* errno on closing this peer */
+ int ibp_error;
/* max map_on_demand */
__u16 ibp_max_frags;
/* max_peer_credits */
@@ -667,6 +686,20 @@ do { \
kiblnd_destroy_peer(peer); \
} while (0)

+static inline bool
+kiblnd_peer_connecting(kib_peer_t *peer)
+{
+ return peer->ibp_connecting ||
+ peer->ibp_reconnecting ||
+ peer->ibp_accepting;
+}
+
+static inline bool
+kiblnd_peer_idle(kib_peer_t *peer)
+{
+ return !kiblnd_peer_connecting(peer) && list_empty(&peer->ibp_conns);
+}
+
static inline struct list_head *
kiblnd_nid2peerlist(lnet_nid_t nid)
{
@@ -943,6 +976,7 @@ int kiblnd_translate_mtu(int value);
int kiblnd_dev_failover(kib_dev_t *dev);
int kiblnd_create_peer(lnet_ni_t *ni, kib_peer_t **peerp, lnet_nid_t nid);
void kiblnd_destroy_peer(kib_peer_t *peer);
+bool kiblnd_reconnect_peer(kib_peer_t *peer);
void kiblnd_destroy_dev(kib_dev_t *dev);
void kiblnd_unlink_peer_locked(kib_peer_t *peer);
kib_peer_t *kiblnd_find_peer_locked(lnet_nid_t nid);
@@ -952,7 +986,7 @@ int kiblnd_close_peer_conns_locked(kib_peer_t *peer, int why);

kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
int state, int version);
-void kiblnd_destroy_conn(kib_conn_t *conn);
+void kiblnd_destroy_conn(kib_conn_t *conn, bool free_conn);
void kiblnd_close_conn(kib_conn_t *conn, int error);
void kiblnd_close_conn_locked(kib_conn_t *conn, int error);

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 1c62875..9e7efc8 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -1258,6 +1258,7 @@ kiblnd_connect_peer(kib_peer_t *peer)

LASSERT(net);
LASSERT(peer->ibp_connecting > 0);
+ LASSERT(!peer->ibp_reconnecting);

cmid = kiblnd_rdma_create_id(kiblnd_cm_callback, peer, RDMA_PS_TCP,
IB_QPT_RC);
@@ -1313,6 +1314,56 @@ kiblnd_connect_peer(kib_peer_t *peer)
kiblnd_peer_connect_failed(peer, 1, rc);
}

+bool
+kiblnd_reconnect_peer(kib_peer_t *peer)
+{
+ rwlock_t *glock = &kiblnd_data.kib_global_lock;
+ char *reason = NULL;
+ struct list_head txs;
+ unsigned long flags;
+
+ INIT_LIST_HEAD(&txs);
+
+ write_lock_irqsave(glock, flags);
+ if (!peer->ibp_reconnecting) {
+ if (peer->ibp_accepting)
+ reason = "accepting";
+ else if (peer->ibp_connecting)
+ reason = "connecting";
+ else if (!list_empty(&peer->ibp_conns))
+ reason = "connected";
+ else /* connected then closed */
+ reason = "closed";
+
+ goto no_reconnect;
+ }
+
+ LASSERT(!peer->ibp_accepting && !peer->ibp_connecting &&
+ list_empty(&peer->ibp_conns));
+ peer->ibp_reconnecting = 0;
+
+ if (!kiblnd_peer_active(peer)) {
+ list_splice_init(&peer->ibp_tx_queue, &txs);
+ reason = "unlinked";
+ goto no_reconnect;
+ }
+
+ peer->ibp_connecting++;
+ peer->ibp_reconnected++;
+ write_unlock_irqrestore(glock, flags);
+
+ kiblnd_connect_peer(peer);
+ return true;
+
+no_reconnect:
+ write_unlock_irqrestore(glock, flags);
+
+ CWARN("Abort reconnection of %s: %s\n",
+ libcfs_nid2str(peer->ibp_nid), reason);
+ kiblnd_txlist_done(peer->ibp_ni, &txs, -ECONNABORTED);
+ return false;
+}
+
void
kiblnd_launch_tx(lnet_ni_t *ni, kib_tx_t *tx, lnet_nid_t nid)
{
@@ -1358,8 +1409,7 @@ kiblnd_launch_tx(lnet_ni_t *ni, kib_tx_t *tx, lnet_nid_t nid)
if (peer) {
if (list_empty(&peer->ibp_conns)) {
/* found a peer, but it's still connecting... */
- LASSERT(peer->ibp_connecting ||
- peer->ibp_accepting);
+ LASSERT(kiblnd_peer_connecting(peer));
if (tx)
list_add_tail(&tx->tx_list,
&peer->ibp_tx_queue);
@@ -1397,8 +1447,7 @@ kiblnd_launch_tx(lnet_ni_t *ni, kib_tx_t *tx, lnet_nid_t nid)
if (peer2) {
if (list_empty(&peer2->ibp_conns)) {
/* found a peer, but it's still connecting... */
- LASSERT(peer2->ibp_connecting ||
- peer2->ibp_accepting);
+ LASSERT(kiblnd_peer_connecting(peer2));
if (tx)
list_add_tail(&tx->tx_list,
&peer2->ibp_tx_queue);
@@ -1818,10 +1867,7 @@ kiblnd_peer_notify(kib_peer_t *peer)

read_lock_irqsave(&kiblnd_data.kib_global_lock, flags);

- if (list_empty(&peer->ibp_conns) &&
- !peer->ibp_accepting &&
- !peer->ibp_connecting &&
- peer->ibp_error) {
+ if (kiblnd_peer_idle(peer) && peer->ibp_error) {
error = peer->ibp_error;
peer->ibp_error = 0;

@@ -2021,14 +2067,14 @@ kiblnd_peer_connect_failed(kib_peer_t *peer, int active, int error)
peer->ibp_accepting--;
}

- if (peer->ibp_connecting ||
- peer->ibp_accepting) {
+ if (kiblnd_peer_connecting(peer)) {
/* another connection attempt under way... */
write_unlock_irqrestore(&kiblnd_data.kib_global_lock,
flags);
return;
}

+ peer->ibp_reconnected = 0;
if (list_empty(&peer->ibp_conns)) {
/* Take peer's blocked transmits to complete with error */
list_add(&zombies, &peer->ibp_tx_queue);
@@ -2101,6 +2147,7 @@ kiblnd_connreq_done(kib_conn_t *conn, int status)
*/
kiblnd_conn_addref(conn); /* +1 ref for ibc_list */
list_add(&conn->ibc_list, &peer->ibp_conns);
+ peer->ibp_reconnected = 0;
if (active)
peer->ibp_connecting--;
else
@@ -2356,10 +2403,16 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
if (peer2->ibp_incarnation != reqmsg->ibm_srcstamp ||
peer2->ibp_version != version) {
kiblnd_close_peer_conns_locked(peer2, -ESTALE);
+
+ if (kiblnd_peer_active(peer2)) {
+ peer2->ibp_incarnation = reqmsg->ibm_srcstamp;
+ peer2->ibp_version = version;
+ }
write_unlock_irqrestore(g_lock, flags);

- CWARN("Conn stale %s [old ver: %x, new ver: %x]\n",
- libcfs_nid2str(nid), peer2->ibp_version, version);
+ CWARN("Conn stale %s version %x/%x incarnation %llu/%llu\n",
+ libcfs_nid2str(nid), peer2->ibp_version, version,
+ peer2->ibp_incarnation, reqmsg->ibm_srcstamp);

kiblnd_peer_decref(peer);
rej.ibr_why = IBLND_REJECT_CONN_STALE;
@@ -2378,6 +2431,11 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
goto failed;
}

+ /**
+ * passive connection is allowed even this peer is waiting for
+ * reconnection.
+ */
+ peer2->ibp_reconnecting = 0;
peer2->ibp_accepting++;
kiblnd_peer_addref(peer2);

@@ -2479,75 +2537,79 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
}

static void
-kiblnd_reconnect(kib_conn_t *conn, int version,
- __u64 incarnation, int why, kib_connparams_t *cp)
+kiblnd_check_reconnect(kib_conn_t *conn, int version,
+ __u64 incarnation, int why, kib_connparams_t *cp)
{
+ rwlock_t *glock = &kiblnd_data.kib_global_lock;
kib_peer_t *peer = conn->ibc_peer;
char *reason;
- int retry = 0;
+ int msg_size = IBLND_MSG_SIZE;
+ int frag_num = -1;
+ int queue_dep = -1;
+ bool reconnect;
unsigned long flags;

LASSERT(conn->ibc_state == IBLND_CONN_ACTIVE_CONNECT);
LASSERT(peer->ibp_connecting > 0); /* 'conn' at least */
+ LASSERT(!peer->ibp_reconnecting);

- write_lock_irqsave(&kiblnd_data.kib_global_lock, flags);
+ if (cp) {
+ msg_size = cp->ibcp_max_msg_size;
+ frag_num = cp->ibcp_max_frags;
+ queue_dep = cp->ibcp_queue_depth;
+ }

- /*
+ write_lock_irqsave(glock, flags);
+ /**
* retry connection if it's still needed and no other connection
* attempts (active or passive) are in progress
* NB: reconnect is still needed even when ibp_tx_queue is
* empty if ibp_version != version because reconnect may be
* initiated by kiblnd_query()
*/
- if ((!list_empty(&peer->ibp_tx_queue) ||
- peer->ibp_version != version) &&
- peer->ibp_connecting == 1 &&
- !peer->ibp_accepting) {
- retry = 1;
- peer->ibp_connecting++;
-
- peer->ibp_version = version;
- peer->ibp_incarnation = incarnation;
+ reconnect = (!list_empty(&peer->ibp_tx_queue) ||
+ peer->ibp_version != version) &&
+ peer->ibp_connecting == 1 &&
+ !peer->ibp_accepting;
+ if (!reconnect) {
+ reason = "no need";
+ goto out;
}

- write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);
-
- if (!retry)
- return;
-
switch (why) {
default:
reason = "Unknown";
break;

case IBLND_REJECT_RDMA_FRAGS:
- if (!cp)
- goto failed;
-
- if (conn->ibc_max_frags <= cp->ibcp_max_frags) {
- CNETERR("Unsupported max frags, peer supports %d\n",
- cp->ibcp_max_frags);
- goto failed;
- } else if (!*kiblnd_tunables.kib_map_on_demand) {
- CNETERR("map_on_demand must be enabled to support map_on_demand peers\n");
- goto failed;
+ if (!cp) {
+ reason = "can't negotiate max frags";
+ goto out;
+ }
+ if (!*kiblnd_tunables.kib_map_on_demand) {
+ reason = "map_on_demand must be enabled";
+ goto out;
+ }
+ if (conn->ibc_max_frags <= frag_num) {
+ reason = "unsupported max frags";
+ goto out;
}

- peer->ibp_max_frags = cp->ibcp_max_frags;
+ peer->ibp_max_frags = frag_num;
reason = "rdma fragments";
break;

case IBLND_REJECT_MSG_QUEUE_SIZE:
- if (!cp)
- goto failed;
-
- if (conn->ibc_queue_depth <= cp->ibcp_queue_depth) {
- CNETERR("Unsupported queue depth, peer supports %d\n",
- cp->ibcp_queue_depth);
- goto failed;
+ if (!cp) {
+ reason = "can't negotiate queue depth";
+ goto out;
+ }
+ if (conn->ibc_queue_depth <= queue_dep) {
+ reason = "unsupported queue depth";
+ goto out;
}

- peer->ibp_queue_depth = cp->ibcp_queue_depth;
+ peer->ibp_queue_depth = queue_dep;
reason = "queue depth";
break;

@@ -2564,20 +2626,24 @@ kiblnd_reconnect(kib_conn_t *conn, int version,
break;
}

- CNETERR("%s: retrying (%s), %x, %x, queue_dep: %d, max_frag: %d, msg_size: %d\n",
- libcfs_nid2str(peer->ibp_nid),
- reason, IBLND_MSG_VERSION, version,
- conn->ibc_queue_depth, conn->ibc_max_frags,
- cp ? cp->ibcp_max_msg_size : IBLND_MSG_SIZE);
-
- kiblnd_connect_peer(peer);
- return;
-failed:
- write_lock_irqsave(&kiblnd_data.kib_global_lock, flags);
- peer->ibp_connecting--;
- write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);
+ conn->ibc_reconnect = 1;
+ peer->ibp_reconnecting = 1;
+ peer->ibp_version = version;
+ if (incarnation)
+ peer->ibp_incarnation = incarnation;
+out:
+ write_unlock_irqrestore(glock, flags);

- return;
+ CNETERR("%s: %s (%s), %x, %x, msg_size: %d, queue_depth: %d/%d, max_frags: %d/%d\n",
+ libcfs_nid2str(peer->ibp_nid),
+ reconnect ? "reconnect" : "don't reconnect",
+ reason, IBLND_MSG_VERSION, version, msg_size,
+ conn->ibc_queue_depth, queue_dep,
+ conn->ibc_max_frags, frag_num);
+ /**
+ * if conn::ibc_reconnect is TRUE, connd will reconnect to the peer
+ * while destroying the zombie
+ */
}

static void
@@ -2590,8 +2656,8 @@ kiblnd_rejected(kib_conn_t *conn, int reason, void *priv, int priv_nob)

switch (reason) {
case IB_CM_REJ_STALE_CONN:
- kiblnd_reconnect(conn, IBLND_MSG_VERSION, 0,
- IBLND_REJECT_CONN_STALE, NULL);
+ kiblnd_check_reconnect(conn, IBLND_MSG_VERSION, 0,
+ IBLND_REJECT_CONN_STALE, NULL);
break;

case IB_CM_REJ_INVALID_SERVICE_ID:
@@ -2675,8 +2741,9 @@ kiblnd_rejected(kib_conn_t *conn, int reason, void *priv, int priv_nob)
case IBLND_REJECT_CONN_UNCOMPAT:
case IBLND_REJECT_MSG_QUEUE_SIZE:
case IBLND_REJECT_RDMA_FRAGS:
- kiblnd_reconnect(conn, rej->ibr_version,
- incarnation, rej->ibr_why, cp);
+ kiblnd_check_reconnect(conn, rej->ibr_version,
+ incarnation,
+ rej->ibr_why, cp);
break;

case IBLND_REJECT_NO_RESOURCES:
@@ -3180,9 +3247,21 @@ kiblnd_disconnect_conn(kib_conn_t *conn)
kiblnd_peer_notify(conn->ibc_peer);
}

+/**
+ * High-water for reconnection to the same peer, reconnection attempt should
+ * be delayed after trying more than KIB_RECONN_HIGH_RACE.
+ */
+#define KIB_RECONN_HIGH_RACE 10
+/**
+ * Allow connd to take a break and handle other things after consecutive
+ * reconnection attemps.
+ */
+#define KIB_RECONN_BREAK 100
+
int
kiblnd_connd(void *arg)
{
+ spinlock_t *lock= &kiblnd_data.kib_connd_lock;
wait_queue_t wait;
unsigned long flags;
kib_conn_t *conn;
@@ -3197,23 +3276,40 @@ kiblnd_connd(void *arg)
init_waitqueue_entry(&wait, current);
kiblnd_data.kib_connd = current;

- spin_lock_irqsave(&kiblnd_data.kib_connd_lock, flags);
+ spin_lock_irqsave(lock, flags);

while (!kiblnd_data.kib_shutdown) {
+ int reconn = 0;
+
dropped_lock = 0;

if (!list_empty(&kiblnd_data.kib_connd_zombies)) {
+ kib_peer_t *peer = NULL;
+
conn = list_entry(kiblnd_data.kib_connd_zombies.next,
kib_conn_t, ibc_list);
list_del(&conn->ibc_list);
+ if (conn->ibc_reconnect) {
+ peer = conn->ibc_peer;
+ kiblnd_peer_addref(peer);
+ }

- spin_unlock_irqrestore(&kiblnd_data.kib_connd_lock,
- flags);
+ spin_unlock_irqrestore(lock, flags);
dropped_lock = 1;

- kiblnd_destroy_conn(conn);
+ kiblnd_destroy_conn(conn, !peer);

- spin_lock_irqsave(&kiblnd_data.kib_connd_lock, flags);
+ spin_lock_irqsave(lock, flags);
+ if (!peer)
+ continue;
+
+ conn->ibc_peer = peer;
+ if (peer->ibp_reconnected < KIB_RECONN_HIGH_RACE)
+ list_add_tail(&conn->ibc_list,
+ &kiblnd_data.kib_reconn_list);
+ else
+ list_add_tail(&conn->ibc_list,
+ &kiblnd_data.kib_reconn_wait);
}

if (!list_empty(&kiblnd_data.kib_connd_conns)) {
@@ -3221,14 +3317,38 @@ kiblnd_connd(void *arg)
kib_conn_t, ibc_list);
list_del(&conn->ibc_list);

- spin_unlock_irqrestore(&kiblnd_data.kib_connd_lock,
- flags);
+ spin_unlock_irqrestore(lock, flags);
dropped_lock = 1;

kiblnd_disconnect_conn(conn);
kiblnd_conn_decref(conn);

- spin_lock_irqsave(&kiblnd_data.kib_connd_lock, flags);
+ spin_lock_irqsave(lock, flags);
+ }
+
+ while (reconn < KIB_RECONN_BREAK) {
+ if (kiblnd_data.kib_reconn_sec !=
+ ktime_get_real_seconds()) {
+ kiblnd_data.kib_reconn_sec = ktime_get_real_seconds();
+ list_splice_init(&kiblnd_data.kib_reconn_wait,
+ &kiblnd_data.kib_reconn_list);
+ }
+
+ if (list_empty(&kiblnd_data.kib_reconn_list))
+ break;
+
+ conn = list_entry(kiblnd_data.kib_reconn_list.next,
+ kib_conn_t, ibc_list);
+ list_del(&conn->ibc_list);
+
+ spin_unlock_irqrestore(lock, flags);
+ dropped_lock = 1;
+
+ reconn += kiblnd_reconnect_peer(conn->ibc_peer);
+ kiblnd_peer_decref(conn->ibc_peer);
+ LIBCFS_FREE(conn, sizeof(*conn));
+
+ spin_lock_irqsave(lock, flags);
}

/* careful with the jiffy wrap... */
@@ -3238,7 +3358,7 @@ kiblnd_connd(void *arg)
const int p = 1;
int chunk = kiblnd_data.kib_peer_hash_size;

- spin_unlock_irqrestore(&kiblnd_data.kib_connd_lock, flags);
+ spin_unlock_irqrestore(lock, flags);
dropped_lock = 1;

/*
@@ -3263,7 +3383,7 @@ kiblnd_connd(void *arg)
}

deadline += msecs_to_jiffies(p * MSEC_PER_SEC);
- spin_lock_irqsave(&kiblnd_data.kib_connd_lock, flags);
+ spin_lock_irqsave(lock, flags);
}

if (dropped_lock)
@@ -3272,15 +3392,15 @@ kiblnd_connd(void *arg)
/* Nothing to do for 'timeout' */
set_current_state(TASK_INTERRUPTIBLE);
add_wait_queue(&kiblnd_data.kib_connd_waitq, &wait);
- spin_unlock_irqrestore(&kiblnd_data.kib_connd_lock, flags);
+ spin_unlock_irqrestore(lock, flags);

schedule_timeout(timeout);

remove_wait_queue(&kiblnd_data.kib_connd_waitq, &wait);
- spin_lock_irqsave(&kiblnd_data.kib_connd_lock, flags);
+ spin_lock_irqsave(lock, flags);
}

- spin_unlock_irqrestore(&kiblnd_data.kib_connd_lock, flags);
+ spin_unlock_irqrestore(lock, flags);

kiblnd_thread_fini();
return 0;
--
1.7.1

2016-03-02 22:02:54

by James Simmons

[permalink] [raw]

Subject: [PATCH] staging: lustre: Support different ko2iblnd configs between systems

From: Jeremy Filizetti <[email protected]>

This patch adds suppoort for ko2iblnd to have different values for
peer_credits and map_on_demand between systems.

Signed-off-by: Jeremy Filizetti <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-3322
Reviewed-on: http://review.whamcloud.com/11794
Reviewed-by: Amir Shehata <[email protected]>
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 51 ++++---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 36 +++--
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 156 ++++++++++++--------
3 files changed, 146 insertions(+), 97 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index 1dc18d7..0b1ffbe 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -631,7 +631,7 @@ static int kiblnd_get_completion_vector(kib_conn_t *conn, int cpt)
}

kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
- int state, int version)
+ int state, int version, kib_connparams_t *cp)
{
/*
* CAVEAT EMPTOR:
@@ -686,6 +686,14 @@ kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
cmid->context = conn; /* for future CM callbacks */
conn->ibc_cmid = cmid;

+ if (!cp) {
+ conn->ibc_max_frags = IBLND_CFG_RDMA_FRAGS;
+ conn->ibc_queue_depth = *kiblnd_tunables.kib_peertxcredits;
+ } else {
+ conn->ibc_max_frags = cp->ibcp_max_frags;
+ conn->ibc_queue_depth = cp->ibcp_queue_depth;
+ }
+
INIT_LIST_HEAD(&conn->ibc_early_rxs);
INIT_LIST_HEAD(&conn->ibc_tx_noops);
INIT_LIST_HEAD(&conn->ibc_tx_queue);
@@ -730,27 +738,27 @@ kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
write_unlock_irqrestore(glock, flags);

LIBCFS_CPT_ALLOC(conn->ibc_rxs, lnet_cpt_table(), cpt,
- IBLND_RX_MSGS(version) * sizeof(kib_rx_t));
+ IBLND_RX_MSGS(conn) * sizeof(kib_rx_t));
if (!conn->ibc_rxs) {
CERROR("Cannot allocate RX buffers\n");
goto failed_2;
}

rc = kiblnd_alloc_pages(&conn->ibc_rx_pages, cpt,
- IBLND_RX_MSG_PAGES(version));
+ IBLND_RX_MSG_PAGES(conn));
if (rc)
goto failed_2;

kiblnd_map_rx_descs(conn);

- cq_attr.cqe = IBLND_CQ_ENTRIES(version);
+ cq_attr.cqe = IBLND_CQ_ENTRIES(conn);
cq_attr.comp_vector = kiblnd_get_completion_vector(conn, cpt);
cq = ib_create_cq(cmid->device,
kiblnd_cq_completion, kiblnd_cq_event, conn,
&cq_attr);
if (IS_ERR(cq)) {
- CERROR("Can't create CQ: %ld, cqe: %d\n",
- PTR_ERR(cq), IBLND_CQ_ENTRIES(version));
+ CERROR("Failed to create CQ with %d CQEs: %ld\n",
+ IBLND_CQ_ENTRIES(conn), PTR_ERR(cq));
goto failed_2;
}

@@ -764,8 +772,8 @@ kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,

init_qp_attr->event_handler = kiblnd_qp_event;
init_qp_attr->qp_context = conn;
- init_qp_attr->cap.max_send_wr = IBLND_SEND_WRS(version);
- init_qp_attr->cap.max_recv_wr = IBLND_RECV_WRS(version);
+ init_qp_attr->cap.max_send_wr = IBLND_SEND_WRS(conn);
+ init_qp_attr->cap.max_recv_wr = IBLND_RECV_WRS(conn);
init_qp_attr->cap.max_send_sge = 1;
init_qp_attr->cap.max_recv_sge = 1;
init_qp_attr->sq_sig_type = IB_SIGNAL_REQ_WR;
@@ -786,11 +794,11 @@ kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
LIBCFS_FREE(init_qp_attr, sizeof(*init_qp_attr));

/* 1 ref for caller and each rxmsg */
- atomic_set(&conn->ibc_refcount, 1 + IBLND_RX_MSGS(version));
- conn->ibc_nrx = IBLND_RX_MSGS(version);
+ atomic_set(&conn->ibc_refcount, 1 + IBLND_RX_MSGS(conn));
+ conn->ibc_nrx = IBLND_RX_MSGS(conn);

/* post receives */
- for (i = 0; i < IBLND_RX_MSGS(version); i++) {
+ for (i = 0; i < IBLND_RX_MSGS(conn); i++) {
rc = kiblnd_post_rx(&conn->ibc_rxs[i],
IBLND_POSTRX_NO_CREDIT);
if (rc) {
@@ -804,7 +812,7 @@ kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
* NB locking needed now I'm racing with completion
*/
spin_lock_irqsave(&sched->ibs_lock, flags);
- conn->ibc_nrx -= IBLND_RX_MSGS(version) - i;
+ conn->ibc_nrx -= IBLND_RX_MSGS(conn) - i;
spin_unlock_irqrestore(&sched->ibs_lock, flags);

/*
@@ -816,7 +824,7 @@ kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
conn->ibc_cmid = NULL;

/* Drop my own and unused rxbuffer refcounts */
- while (i++ <= IBLND_RX_MSGS(version))
+ while (i++ <= IBLND_RX_MSGS(conn))
kiblnd_conn_decref(conn);

return NULL;
@@ -886,8 +894,7 @@ void kiblnd_destroy_conn(kib_conn_t *conn)

if (conn->ibc_rxs) {
LIBCFS_FREE(conn->ibc_rxs,
- IBLND_RX_MSGS(conn->ibc_version)
- * sizeof(kib_rx_t));
+ IBLND_RX_MSGS(conn) * sizeof(kib_rx_t));
}

if (conn->ibc_connvars)
@@ -1143,7 +1150,7 @@ void kiblnd_unmap_rx_descs(kib_conn_t *conn)
LASSERT(conn->ibc_rxs);
LASSERT(conn->ibc_hdev);

- for (i = 0; i < IBLND_RX_MSGS(conn->ibc_version); i++) {
+ for (i = 0; i < IBLND_RX_MSGS(conn); i++) {
rx = &conn->ibc_rxs[i];

LASSERT(rx->rx_nob >= 0); /* not posted */
@@ -1167,7 +1174,7 @@ void kiblnd_map_rx_descs(kib_conn_t *conn)
int ipg;
int i;

- for (pg_off = ipg = i = 0; i < IBLND_RX_MSGS(conn->ibc_version); i++) {
+ for (pg_off = ipg = i = 0; i < IBLND_RX_MSGS(conn); i++) {
pg = conn->ibc_rx_pages->ibp_pages[ipg];
rx = &conn->ibc_rxs[i];

@@ -1192,7 +1199,7 @@ void kiblnd_map_rx_descs(kib_conn_t *conn)
if (pg_off == PAGE_SIZE) {
pg_off = 0;
ipg++;
- LASSERT(ipg <= IBLND_RX_MSG_PAGES(conn->ibc_version));
+ LASSERT(ipg <= IBLND_RX_MSG_PAGES(conn));
}
}
}
@@ -1296,12 +1303,16 @@ static void kiblnd_map_tx_pool(kib_tx_pool_t *tpo)
}
}

-struct ib_mr *kiblnd_find_rd_dma_mr(kib_hca_dev_t *hdev, kib_rdma_desc_t *rd)
+struct ib_mr *kiblnd_find_rd_dma_mr(kib_hca_dev_t *hdev, kib_rdma_desc_t *rd,
+ int negotiated_nfrags)
{
+ __u16 nfrags = (negotiated_nfrags != -1) ?
+ negotiated_nfrags : *kiblnd_tunables.kib_map_on_demand;
+
LASSERT(hdev->ibh_mrs);

if (*kiblnd_tunables.kib_map_on_demand > 0 &&
- *kiblnd_tunables.kib_map_on_demand <= rd->rd_nfrags)
+ nfrags <= rd->rd_nfrags)
return NULL;

return hdev->ibh_mrs;
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 0c88e8b..59a26c4 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -162,18 +162,17 @@ kiblnd_concurrent_sends_v1(void)
#define IBLND_FMR_POOL 256
#define IBLND_FMR_POOL_FLUSH 192

-/* TX messages (shared by all connections) */
-#define IBLND_TX_MSGS() (*kiblnd_tunables.kib_ntx)
-
-/* RX messages (per connection) */
-#define IBLND_RX_MSGS(v) (IBLND_MSG_QUEUE_SIZE(v) * 2 + IBLND_OOB_MSGS(v))
-#define IBLND_RX_MSG_BYTES(v) (IBLND_RX_MSGS(v) * IBLND_MSG_SIZE)
-#define IBLND_RX_MSG_PAGES(v) ((IBLND_RX_MSG_BYTES(v) + PAGE_SIZE - 1) / PAGE_SIZE)
+#define IBLND_RX_MSGS(c) \
+ ((c->ibc_queue_depth) * 2 + IBLND_OOB_MSGS(c->ibc_version))
+#define IBLND_RX_MSG_BYTES(c) (IBLND_RX_MSGS(c) * IBLND_MSG_SIZE)
+#define IBLND_RX_MSG_PAGES(c) \
+ ((IBLND_RX_MSG_BYTES(c) + PAGE_SIZE - 1) / PAGE_SIZE)

/* WRs and CQEs (per connection) */
-#define IBLND_RECV_WRS(v) IBLND_RX_MSGS(v)
-#define IBLND_SEND_WRS(v) ((IBLND_RDMA_FRAGS(v) + 1) * IBLND_CONCURRENT_SENDS(v))
-#define IBLND_CQ_ENTRIES(v) (IBLND_RECV_WRS(v) + IBLND_SEND_WRS(v))
+#define IBLND_RECV_WRS(c) IBLND_RX_MSGS(c)
+#define IBLND_SEND_WRS(c) \
+ ((c->ibc_max_frags + 1) * IBLND_CONCURRENT_SENDS(c->ibc_version))
+#define IBLND_CQ_ENTRIES(c) (IBLND_RECV_WRS(c) + IBLND_SEND_WRS(c))

struct kib_hca_dev;

@@ -464,10 +463,10 @@ typedef struct {
#define IBLND_REJECT_FATAL 3 /* Anything else */
#define IBLND_REJECT_CONN_UNCOMPAT 4 /* incompatible version peer */
#define IBLND_REJECT_CONN_STALE 5 /* stale peer */
-#define IBLND_REJECT_RDMA_FRAGS 6 /* Fatal: peer's rdma frags can't match */
- /* mine */
-#define IBLND_REJECT_MSG_QUEUE_SIZE 7 /* Fatal: peer's msg queue size can't */
- /* match mine */
+/* peer's rdma frags doesn't match mine */
+#define IBLND_REJECT_RDMA_FRAGS 6
+/* peer's msg queue size doesn't match mine */
+#define IBLND_REJECT_MSG_QUEUE_SIZE 7

/***********************************************************************/

@@ -535,6 +534,10 @@ typedef struct kib_conn {
int ibc_outstanding_credits; /* # credits to return */
int ibc_reserved_credits; /* # ACK/DONE msg credits */
int ibc_comms_error; /* set on comms error */
+ /* connections queue depth */
+ __u16 ibc_queue_depth;
+ /* connections max frags */
+ __u16 ibc_max_frags;
unsigned int ibc_nrx:16; /* receive buffers owned */
unsigned int ibc_scheduled:1; /* scheduled for attention */
unsigned int ibc_ready:1; /* CQ callback fired */
@@ -907,7 +910,8 @@ static inline unsigned int kiblnd_sg_dma_len(struct ib_device *dev,
#define KIBLND_CONN_PARAM_LEN(e) ((e)->param.conn.private_data_len)

struct ib_mr *kiblnd_find_rd_dma_mr(kib_hca_dev_t *hdev,
- kib_rdma_desc_t *rd);
+ kib_rdma_desc_t *rd,
+ int negotiated_nfrags);
void kiblnd_map_rx_descs(kib_conn_t *conn);
void kiblnd_unmap_rx_descs(kib_conn_t *conn);
void kiblnd_pool_free_node(kib_pool_t *pool, struct list_head *node);
@@ -942,7 +946,7 @@ int kiblnd_close_stale_conns_locked(kib_peer_t *peer,
int kiblnd_close_peer_conns_locked(kib_peer_t *peer, int why);

kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
- int state, int version);
+ int state, int version, kib_connparams_t *cp);
void kiblnd_destroy_conn(kib_conn_t *conn);
void kiblnd_close_conn(kib_conn_t *conn, int error);
void kiblnd_close_conn_locked(kib_conn_t *conn, int error);
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index fc788b0..0c5b689 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -328,14 +328,13 @@ kiblnd_handle_rx(kib_rx_t *rx)
spin_lock(&conn->ibc_lock);

if (conn->ibc_credits + credits >
- IBLND_MSG_QUEUE_SIZE(conn->ibc_version)) {
+ conn->ibc_queue_depth) {
rc2 = conn->ibc_credits;
spin_unlock(&conn->ibc_lock);

CERROR("Bad credits from %s: %d + %d > %d\n",
libcfs_nid2str(conn->ibc_peer->ibp_nid),
- rc2, credits,
- IBLND_MSG_QUEUE_SIZE(conn->ibc_version));
+ rc2, credits, conn->ibc_queue_depth);

kiblnd_close_conn(conn, -EPROTO);
kiblnd_post_rx(rx, IBLND_POSTRX_NO_CREDIT);
@@ -653,8 +652,8 @@ static int kiblnd_map_tx(lnet_ni_t *ni, kib_tx_t *tx, kib_rdma_desc_t *rd,
nob += rd->rd_frags[i].rf_nob;
}

- /* looking for pre-mapping MR */
- mr = kiblnd_find_rd_dma_mr(hdev, rd);
+ mr = kiblnd_find_rd_dma_mr(hdev, rd, tx->tx_conn ?
+ tx->tx_conn->ibc_max_frags : -1);
if (mr) {
/* found pre-mapping MR */
rd->rd_key = (rd != tx->tx_rd) ? mr->rkey : mr->lkey;
@@ -774,13 +773,13 @@ kiblnd_post_tx_locked(kib_conn_t *conn, kib_tx_t *tx, int credit)
LASSERT(tx->tx_queued);
/* We rely on this for QP sizing */
LASSERT(tx->tx_nwrq > 0);
- LASSERT(tx->tx_nwrq <= 1 + IBLND_RDMA_FRAGS(ver));
+ LASSERT(tx->tx_nwrq <= 1 + conn->ibc_max_frags);

LASSERT(!credit || credit == 1);
LASSERT(conn->ibc_outstanding_credits >= 0);
- LASSERT(conn->ibc_outstanding_credits <= IBLND_MSG_QUEUE_SIZE(ver));
+ LASSERT(conn->ibc_outstanding_credits <= conn->ibc_queue_depth);
LASSERT(conn->ibc_credits >= 0);
- LASSERT(conn->ibc_credits <= IBLND_MSG_QUEUE_SIZE(ver));
+ LASSERT(conn->ibc_credits <= conn->ibc_queue_depth);

if (conn->ibc_nsends_posted == IBLND_CONCURRENT_SENDS(ver)) {
/* tx completions outstanding... */
@@ -1089,10 +1088,10 @@ kiblnd_init_rdma(kib_conn_t *conn, kib_tx_t *tx, int type,
break;
}

- if (tx->tx_nwrq == IBLND_RDMA_FRAGS(conn->ibc_version)) {
- CERROR("RDMA too fragmented for %s (%d): %d/%d src %d/%d dst frags\n",
+ if (tx->tx_nwrq >= conn->ibc_max_frags) {
+ CERROR("RDMA has too many fragments for peer %s (%d), src idx/frags: %d/%d dst idx/frags: %d/%d\n",
libcfs_nid2str(conn->ibc_peer->ibp_nid),
- IBLND_RDMA_FRAGS(conn->ibc_version),
+ conn->ibc_max_frags,
srcidx, srcrd->rd_nfrags,
dstidx, dstrd->rd_nfrags);
rc = -EMSGSIZE;
@@ -2243,7 +2242,7 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
if (!ni || /* no matching net */
ni->ni_nid != reqmsg->ibm_dstnid || /* right NET, wrong NID! */
net->ibn_dev != ibdev) { /* wrong device */
- CERROR("Can't accept %s on %s (%s:%d:%pI4h): bad dst nid %s\n",
+ CERROR("Can't accept conn from %s on %s (%s:%d:%pI4h): bad dst nid %s\n",
libcfs_nid2str(nid),
!ni ? "NA" : libcfs_nid2str(ni->ni_nid),
ibdev->ibd_ifname, ibdev->ibd_nnets,
@@ -2270,10 +2269,11 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
goto failed;
}

- if (reqmsg->ibm_u.connparams.ibcp_queue_depth !=
+ if (reqmsg->ibm_u.connparams.ibcp_queue_depth >
IBLND_MSG_QUEUE_SIZE(version)) {
- CERROR("Can't accept %s: incompatible queue depth %d (%d wanted)\n",
- libcfs_nid2str(nid), reqmsg->ibm_u.connparams.ibcp_queue_depth,
+ CERROR("Can't accept conn from %s, queue depth too large: %d (<=%d wanted)\n",
+ libcfs_nid2str(nid),
+ reqmsg->ibm_u.connparams.ibcp_queue_depth,
IBLND_MSG_QUEUE_SIZE(version));

if (version == IBLND_MSG_VERSION)
@@ -2282,14 +2282,25 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
goto failed;
}

- if (reqmsg->ibm_u.connparams.ibcp_max_frags !=
+ if (reqmsg->ibm_u.connparams.ibcp_max_frags >
IBLND_RDMA_FRAGS(version)) {
- CERROR("Can't accept %s(version %x): incompatible max_frags %d (%d wanted)\n",
- libcfs_nid2str(nid), version,
- reqmsg->ibm_u.connparams.ibcp_max_frags,
- IBLND_RDMA_FRAGS(version));
+ CWARN("Can't accept conn from %s (version %x): max_frags %d too large (%d wanted)\n",
+ libcfs_nid2str(nid), version,
+ reqmsg->ibm_u.connparams.ibcp_max_frags,
+ IBLND_RDMA_FRAGS(version));

- if (version == IBLND_MSG_VERSION)
+ if (version >= IBLND_MSG_VERSION)
+ rej.ibr_why = IBLND_REJECT_RDMA_FRAGS;
+
+ goto failed;
+ } else if (reqmsg->ibm_u.connparams.ibcp_max_frags <
+ IBLND_RDMA_FRAGS(version) && !net->ibn_fmr_ps) {
+ CWARN("Can't accept conn from %s (version %x): max_frags %d incompatible without FMR pool (%d wanted)\n",
+ libcfs_nid2str(nid), version,
+ reqmsg->ibm_u.connparams.ibcp_max_frags,
+ IBLND_RDMA_FRAGS(version));
+
+ if (version >= IBLND_MSG_VERSION)
rej.ibr_why = IBLND_REJECT_RDMA_FRAGS;

goto failed;
@@ -2371,7 +2382,8 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
write_unlock_irqrestore(g_lock, flags);
}

- conn = kiblnd_create_conn(peer, cmid, IBLND_CONN_PASSIVE_WAIT, version);
+ conn = kiblnd_create_conn(peer, cmid, IBLND_CONN_PASSIVE_WAIT, version,
+ &reqmsg->ibm_u.connparams);
if (!conn) {
kiblnd_peer_connect_failed(peer, 0, -ENOMEM);
kiblnd_peer_decref(peer);
@@ -2384,19 +2396,21 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
* CM callback doesn't destroy cmid.
*/
conn->ibc_incarnation = reqmsg->ibm_srcstamp;
- conn->ibc_credits = IBLND_MSG_QUEUE_SIZE(version);
- conn->ibc_reserved_credits = IBLND_MSG_QUEUE_SIZE(version);
- LASSERT(conn->ibc_credits + conn->ibc_reserved_credits + IBLND_OOB_MSGS(version)
- <= IBLND_RX_MSGS(version));
+ conn->ibc_credits = reqmsg->ibm_u.connparams.ibcp_queue_depth;
+ conn->ibc_reserved_credits = reqmsg->ibm_u.connparams.ibcp_queue_depth;
+ LASSERT(conn->ibc_credits + conn->ibc_reserved_credits +
+ IBLND_OOB_MSGS(version) <= IBLND_RX_MSGS(conn));

ackmsg = &conn->ibc_connvars->cv_msg;
memset(ackmsg, 0, sizeof(*ackmsg));

kiblnd_init_msg(ackmsg, IBLND_MSG_CONNACK,
sizeof(ackmsg->ibm_u.connparams));
- ackmsg->ibm_u.connparams.ibcp_queue_depth = IBLND_MSG_QUEUE_SIZE(version);
+ ackmsg->ibm_u.connparams.ibcp_queue_depth =
+ reqmsg->ibm_u.connparams.ibcp_queue_depth;
+ ackmsg->ibm_u.connparams.ibcp_max_frags =
+ reqmsg->ibm_u.connparams.ibcp_max_frags;
ackmsg->ibm_u.connparams.ibcp_max_msg_size = IBLND_MSG_SIZE;
- ackmsg->ibm_u.connparams.ibcp_max_frags = IBLND_RDMA_FRAGS(version);

kiblnd_pack_msg(ni, ackmsg, version, 0, nid, reqmsg->ibm_srcstamp);

@@ -2479,6 +2493,31 @@ kiblnd_reconnect(kib_conn_t *conn, int version,
reason = "Unknown";
break;

+ case IBLND_REJECT_RDMA_FRAGS:
+ if (conn->ibc_max_frags <= cp->ibcp_max_frags) {
+ CNETERR("Unsupported max frags, peer supports %d\n",
+ cp->ibcp_max_frags);
+ goto failed;
+ } else if (!*kiblnd_tunables.kib_map_on_demand) {
+ CNETERR("map_on_demand must be enabled to support map_on_demand peers\n");
+ goto failed;
+ }
+
+ conn->ibc_max_frags = cp->ibcp_max_frags;
+ reason = "rdma fragments";
+ break;
+
+ case IBLND_REJECT_MSG_QUEUE_SIZE:
+ if (conn->ibc_queue_depth <= cp->ibcp_queue_depth) {
+ CNETERR("Unsupported queue depth, peer supports %d\n",
+ cp->ibcp_queue_depth);
+ goto failed;
+ }
+
+ conn->ibc_queue_depth = cp->ibcp_queue_depth;
+ reason = "queue depth";
+ break;
+
case IBLND_REJECT_CONN_STALE:
reason = "stale";
break;
@@ -2495,11 +2534,17 @@ kiblnd_reconnect(kib_conn_t *conn, int version,
CNETERR("%s: retrying (%s), %x, %x, queue_dep: %d, max_frag: %d, msg_size: %d\n",
libcfs_nid2str(peer->ibp_nid),
reason, IBLND_MSG_VERSION, version,
- cp ? cp->ibcp_queue_depth : IBLND_MSG_QUEUE_SIZE(version),
- cp ? cp->ibcp_max_frags : IBLND_RDMA_FRAGS(version),
+ conn->ibc_queue_depth, conn->ibc_max_frags,
cp ? cp->ibcp_max_msg_size : IBLND_MSG_SIZE);

kiblnd_connect_peer(peer);
+ return;
+failed:
+ write_lock_irqsave(&kiblnd_data.kib_global_lock, flags);
+ peer->ibp_connecting--;
+ write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);
+
+ return;
}

static void
@@ -2595,24 +2640,10 @@ kiblnd_rejected(kib_conn_t *conn, int reason, void *priv, int priv_nob)
case IBLND_REJECT_CONN_RACE:
case IBLND_REJECT_CONN_STALE:
case IBLND_REJECT_CONN_UNCOMPAT:
- kiblnd_reconnect(conn, rej->ibr_version,
- incarnation, rej->ibr_why, cp);
- break;
-
case IBLND_REJECT_MSG_QUEUE_SIZE:
- CERROR("%s rejected: incompatible message queue depth %d, %d\n",
- libcfs_nid2str(peer->ibp_nid),
- cp ? cp->ibcp_queue_depth :
- IBLND_MSG_QUEUE_SIZE(rej->ibr_version),
- IBLND_MSG_QUEUE_SIZE(conn->ibc_version));
- break;
-
case IBLND_REJECT_RDMA_FRAGS:
- CERROR("%s rejected: incompatible # of RDMA fragments %d, %d\n",
- libcfs_nid2str(peer->ibp_nid),
- cp ? cp->ibcp_max_frags :
- IBLND_RDMA_FRAGS(rej->ibr_version),
- IBLND_RDMA_FRAGS(conn->ibc_version));
+ kiblnd_reconnect(conn, rej->ibr_version,
+ incarnation, rej->ibr_why, cp);
break;

case IBLND_REJECT_NO_RESOURCES:
@@ -2676,22 +2707,22 @@ kiblnd_check_connreply(kib_conn_t *conn, void *priv, int priv_nob)
goto failed;
}

- if (msg->ibm_u.connparams.ibcp_queue_depth !=
- IBLND_MSG_QUEUE_SIZE(ver)) {
- CERROR("%s has incompatible queue depth %d(%d wanted)\n",
+ if (msg->ibm_u.connparams.ibcp_queue_depth >
+ conn->ibc_queue_depth) {
+ CERROR("%s has incompatible queue depth %d (<=%d wanted)\n",
libcfs_nid2str(peer->ibp_nid),
msg->ibm_u.connparams.ibcp_queue_depth,
- IBLND_MSG_QUEUE_SIZE(ver));
+ conn->ibc_queue_depth);
rc = -EPROTO;
goto failed;
}

- if (msg->ibm_u.connparams.ibcp_max_frags !=
- IBLND_RDMA_FRAGS(ver)) {
- CERROR("%s has incompatible max_frags %d (%d wanted)\n",
+ if (msg->ibm_u.connparams.ibcp_max_frags >
+ conn->ibc_max_frags) {
+ CERROR("%s has incompatible max_frags %d (<=%d wanted)\n",
libcfs_nid2str(peer->ibp_nid),
msg->ibm_u.connparams.ibcp_max_frags,
- IBLND_RDMA_FRAGS(ver));
+ conn->ibc_max_frags);
rc = -EPROTO;
goto failed;
}
@@ -2721,10 +2752,12 @@ kiblnd_check_connreply(kib_conn_t *conn, void *priv, int priv_nob)
}

conn->ibc_incarnation = msg->ibm_srcstamp;
- conn->ibc_credits =
- conn->ibc_reserved_credits = IBLND_MSG_QUEUE_SIZE(ver);
- LASSERT(conn->ibc_credits + conn->ibc_reserved_credits + IBLND_OOB_MSGS(ver)
- <= IBLND_RX_MSGS(ver));
+ conn->ibc_credits = msg->ibm_u.connparams.ibcp_queue_depth;
+ conn->ibc_reserved_credits = msg->ibm_u.connparams.ibcp_queue_depth;
+ conn->ibc_queue_depth = msg->ibm_u.connparams.ibcp_queue_depth;
+ conn->ibc_max_frags = msg->ibm_u.connparams.ibcp_max_frags;
+ LASSERT(conn->ibc_credits + conn->ibc_reserved_credits +
+ IBLND_OOB_MSGS(ver) <= IBLND_RX_MSGS(conn));

kiblnd_connreq_done(conn, 0);
return;
@@ -2761,7 +2794,8 @@ kiblnd_active_connect(struct rdma_cm_id *cmid)

read_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);

- conn = kiblnd_create_conn(peer, cmid, IBLND_CONN_ACTIVE_CONNECT, version);
+ conn = kiblnd_create_conn(peer, cmid, IBLND_CONN_ACTIVE_CONNECT,
+ version, NULL);
if (!conn) {
kiblnd_peer_connect_failed(peer, 1, -ENOMEM);
kiblnd_peer_decref(peer); /* lose cmid's ref */
@@ -2777,8 +2811,8 @@ kiblnd_active_connect(struct rdma_cm_id *cmid)

memset(msg, 0, sizeof(*msg));
kiblnd_init_msg(msg, IBLND_MSG_CONNREQ, sizeof(msg->ibm_u.connparams));
- msg->ibm_u.connparams.ibcp_queue_depth = IBLND_MSG_QUEUE_SIZE(version);
- msg->ibm_u.connparams.ibcp_max_frags = IBLND_RDMA_FRAGS(version);
+ msg->ibm_u.connparams.ibcp_queue_depth = conn->ibc_queue_depth;
+ msg->ibm_u.connparams.ibcp_max_frags = conn->ibc_max_frags;
msg->ibm_u.connparams.ibcp_max_msg_size = IBLND_MSG_SIZE;

kiblnd_pack_msg(peer->ibp_ni, msg, version,
--
1.7.1

2016-03-02 22:02:59

by James Simmons

[permalink] [raw]

Subject: [PATCH 24/27] staging: lustre: Change connect peer failed cleanup order

From: Doug Oucharek <[email protected]>

A race condition has been found where connd is cleaning up failed
connections, the peer ref counter goes to zero, but we stil have
a connecting counter > 0.

One possible race is when we are retrying a connection by
calling kiblnd_connect_peer() which itself fails and decrements
the peer ref counter and gets swapped out before it can decrement
the connecting counter. connd swaps in and cleans up the
connection where it sees a peer ref counter of 1 and a connecting
counter of 1. This will trigger the assert seen in LU-7210 when
it decrements the peer counter.

The solution: be sure to decrement the connecting counter
before decrementing the peer counter in the peer connect
failure path.

Signed-off-by: Doug Oucharek <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-7210
Reviewed-on: http://review.whamcloud.com/17004
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Amir Shehata <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 11e12ae..9428166 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -1299,8 +1299,10 @@ kiblnd_connect_peer(kib_peer_t *peer)
return;

failed2:
+ kiblnd_peer_connect_failed(peer, 1, rc);
kiblnd_peer_decref(peer); /* cmid's ref */
rdma_destroy_id(cmid);
+ return;
failed:
kiblnd_peer_connect_failed(peer, 1, rc);
}
--
1.7.1

2016-03-02 22:03:05

by James Simmons

[permalink] [raw]

Subject: [PATCH 27/27] staging: lustre: do less intense allocating retry for ko2iblnd

From: Liang Zhen <[email protected]>

ko2iblnd may retry too frequent for growing pools, all schedulers
are spinning if another thread is in progress of allocating a new
pool and can't finish right away because of high system load.

Signed-off-by: Liang Zhen <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-7054
Reviewed-on: http://review.whamcloud.com/16470
Reviewed-by: Doug Oucharek <[email protected]>
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index 135ccf1..0d32e65 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1218,6 +1218,7 @@ static kib_hca_dev_t *kiblnd_current_hdev(kib_dev_t *dev)
if (!(i++ % 50))
CDEBUG(D_NET, "%s: Wait for failover\n",
dev->ibd_ifname);
+ set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(cfs_time_seconds(1) / 100);

read_lock_irqsave(&kiblnd_data.kib_global_lock, flags);
@@ -1684,6 +1685,9 @@ struct list_head *kiblnd_pool_alloc_node(kib_poolset_t *ps)
{
struct list_head *node;
kib_pool_t *pool;
+ unsigned int interval = 1;
+ unsigned long time_before;
+ unsigned int trips = 0;
int rc;

again:
@@ -1709,9 +1713,15 @@ struct list_head *kiblnd_pool_alloc_node(kib_poolset_t *ps)
if (ps->ps_increasing) {
/* another thread is allocating a new pool */
spin_unlock(&ps->ps_lock);
- CDEBUG(D_NET, "Another thread is allocating new %s pool, waiting for her to complete\n",
- ps->ps_name);
- schedule();
+ trips++;
+ CDEBUG(D_NET, "Another thread is allocating new %s pool, waiting %d HZs for her to complete. trips = %d\n",
+ ps->ps_name, interval, trips);
+
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(interval);
+ if (interval < cfs_time_seconds(1))
+ interval *= 2;
+
goto again;
}

@@ -1725,8 +1735,10 @@ struct list_head *kiblnd_pool_alloc_node(kib_poolset_t *ps)
spin_unlock(&ps->ps_lock);

CDEBUG(D_NET, "%s pool exhausted, allocate new pool\n", ps->ps_name);
-
+ time_before = cfs_time_current();
rc = ps->ps_pool_create(ps, ps->ps_pool_size, &pool);
+ CDEBUG(D_NET, "ps_pool_create took %lu HZ to complete",
+ cfs_time_current() - time_before);

spin_lock(&ps->ps_lock);
ps->ps_increasing = 0;
--
1.7.1

2016-03-02 22:03:47

by James Simmons

[permalink] [raw]

Subject: [PATCH 25/27] staging: lustre: check wr_id returned by ib_poll_cq

From: Liang Zhen <[email protected]>

If ib_poll_cq returned +ve without initialising ib_wc::wr_id (bug
in driver), then o2iblnd will run into unpredictable situation
because ib_wc::wr_id may refer to stale tx/rx pointer in stack.

It indicates bug in HCA driver if this happened, ko2iblnd should
output console error then close current connection.

This patch could also be helpful for LU-5271

Signed-off-by: Liang Zhen <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-519
Reviewed-on: http://review.whamcloud.com/12747
Reviewed-by: Isaac Huang <[email protected]>
Reviewed-by: Doug Oucharek <[email protected]>
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 9 ++++---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 21 ++++++++++++++++++-
2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 3db1413..6a4c4ac 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -762,10 +762,11 @@ kiblnd_queue2str(kib_conn_t *conn, struct list_head *q)
/* CAVEAT EMPTOR: We rely on descriptor alignment to allow us to use the */
/* lowest bits of the work request id to stash the work item type. */

-#define IBLND_WID_TX 0
-#define IBLND_WID_RDMA 1
-#define IBLND_WID_RX 2
-#define IBLND_WID_MASK 3UL
+#define IBLND_WID_INVAL 0
+#define IBLND_WID_TX 1
+#define IBLND_WID_RX 2
+#define IBLND_WID_RDMA 3
+#define IBLND_WID_MASK 3UL

static inline __u64
kiblnd_ptr2wreqid(void *ptr, int type)
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 9428166..1c62875 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -768,7 +768,6 @@ kiblnd_post_tx_locked(kib_conn_t *conn, kib_tx_t *tx, int credit)
int ver = conn->ibc_version;
int rc;
int done;
- struct ib_send_wr *bad_wrq;

LASSERT(tx->tx_queued);
/* We rely on this for QP sizing */
@@ -852,7 +851,14 @@ kiblnd_post_tx_locked(kib_conn_t *conn, kib_tx_t *tx, int credit)
/* close_conn will launch failover */
rc = -ENETDOWN;
} else {
- rc = ib_post_send(conn->ibc_cmid->qp, &tx->tx_wrq->wr, &bad_wrq);
+ struct ib_send_wr *wrq = &tx->tx_wrq[tx->tx_nwrq - 1].wr;
+
+ LASSERTF(wrq->wr_id == kiblnd_ptr2wreqid(tx, IBLND_WID_TX),
+ "bad wr_id %llx, opc %d, flags %d, peer: %s\n",
+ wrq->wr_id, wrq->opcode, wrq->send_flags,
+ libcfs_nid2str(conn->ibc_peer->ibp_nid));
+ wrq = NULL;
+ rc = ib_post_send(conn->ibc_cmid->qp, &tx->tx_wrq->wr, &wrq);
}

conn->ibc_last_send = jiffies;
@@ -3421,6 +3427,8 @@ kiblnd_scheduler(void *arg)

spin_unlock_irqrestore(&sched->ibs_lock, flags);

+ wc.wr_id = IBLND_WID_INVAL;
+
rc = ib_poll_cq(conn->ibc_cq, 1, &wc);
if (!rc) {
rc = ib_req_notify_cq(conn->ibc_cq,
@@ -3438,6 +3446,15 @@ kiblnd_scheduler(void *arg)
rc = ib_poll_cq(conn->ibc_cq, 1, &wc);
}

+ if (unlikely(rc > 0 && wc.wr_id == IBLND_WID_INVAL)) {
+ LCONSOLE_ERROR("ib_poll_cq (rc: %d) returned invalid wr_id, opcode %d, status: %d, vendor_err: %d, conn: %s status: %d\nplease upgrade firmware and OFED or contact vendor.\n",
+ rc, wc.opcode, wc.status,
+ wc.vendor_err,
+ libcfs_nid2str(conn->ibc_peer->ibp_nid),
+ conn->ibc_state);
+ rc = -EINVAL;
+ }
+
if (rc < 0) {
CWARN("%s: ib_poll_cq failed: %d, closing connection\n",
libcfs_nid2str(conn->ibc_peer->ibp_nid),
--
1.7.1

2016-03-02 22:04:00

by James Simmons

[permalink] [raw]

Subject: [PATCH 23/27] staging: lustre: take extra refcount in kiblnd_connreq_done

From: Liang Zhen <[email protected]>

refcount taken by cmid is not reliable after kiblnd_connreq_done
released the glock because this connection is visible to other
threads, another thread can find and close this connection right
after kiblnd_connreq_done released the glock, if kiblnd_cm_callback
for RDMA_CM_EVENT_DISCONNECTED is called, it can release the
connection refcount taken by cmid. It means the connection could be
destroyed before kiblnd_connreq_done() finish operations on it.

Signed-off-by: Liang Zhen <[email protected]>
ntel-bug-id: https://jira.hpdd.intel.com/browse/LU-7210
Reviewed-on: http://review.whamcloud.com/17527
Reviewed-by: Doug Oucharek <[email protected]>
Reviewed-by: James Simmons <[email protected]>
Tested-by: James Simmons <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 16 ++++++++++++----
1 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index fb3873a..11e12ae 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -939,8 +939,6 @@ kiblnd_check_sends(kib_conn_t *conn)
kiblnd_queue_tx_locked(tx, conn);
}

- kiblnd_conn_addref(conn); /* 1 ref for me.... (see b21911) */
-
for (;;) {
int credit;

@@ -966,8 +964,6 @@ kiblnd_check_sends(kib_conn_t *conn)
}

spin_unlock(&conn->ibc_lock);
-
- kiblnd_conn_decref(conn); /* ...until here */
}

static void
@@ -2132,6 +2128,16 @@ kiblnd_connreq_done(kib_conn_t *conn, int status)
return;
}

+ /**
+ * refcount taken by cmid is not reliable after I released the glock
+ * because this connection is visible to other threads now, another
+ * thread can find and close this connection right after I released
+ * the glock, if kiblnd_cm_callback for RDMA_CM_EVENT_DISCONNECTED is
+ * called, it can release the connection refcount taken by cmid.
+ * It means the connection could be destroyed before I finish my
+ * operations on it.
+ */
+ kiblnd_conn_addref(conn);
write_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);

/* Schedule blocked txs */
@@ -2147,6 +2153,8 @@ kiblnd_connreq_done(kib_conn_t *conn, int status)

/* schedule blocked rxs */
kiblnd_handle_early_rxs(conn);
+
+ kiblnd_conn_decref(conn);
}

static void
--
1.7.1

2016-03-02 22:04:27

by James Simmons

[permalink] [raw]

Subject: [PATCH 22/27] staging: lustre: make ko2iblnd connect parameters persistent

From: Amir Shehata <[email protected]>

Store map-on-demand and peertx credits in the peer, since the peer
is persistent. Also made sure that when assigning the parameters
received on the connection to the peer structure through create,
that if another peer is added before grabbing the lock we assign
these parameters to it as well.

Signed-off-by: Amir Shehata <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-3322
Reviewed-on: http://review.whamcloud.com/17074
Reviewed-by: Doug Oucharek <[email protected]>
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 14 +++-----
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 6 +++-
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 38 ++++++++++++++------
3 files changed, 37 insertions(+), 21 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index 0b1ffbe..56c221b 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -335,6 +335,8 @@ int kiblnd_create_peer(lnet_ni_t *ni, kib_peer_t **peerp, lnet_nid_t nid)
peer->ibp_nid = nid;
peer->ibp_error = 0;
peer->ibp_last_alive = 0;
+ peer->ibp_max_frags = IBLND_CFG_RDMA_FRAGS;
+ peer->ibp_queue_depth = *kiblnd_tunables.kib_peertxcredits;
atomic_set(&peer->ibp_refcount, 1); /* 1 ref for caller */

INIT_LIST_HEAD(&peer->ibp_list); /* not in the peer table yet */
@@ -631,7 +633,7 @@ static int kiblnd_get_completion_vector(kib_conn_t *conn, int cpt)
}

kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
- int state, int version, kib_connparams_t *cp)
+ int state, int version)
{
/*
* CAVEAT EMPTOR:
@@ -685,14 +687,8 @@ kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
conn->ibc_peer = peer; /* I take the caller's ref */
cmid->context = conn; /* for future CM callbacks */
conn->ibc_cmid = cmid;
-
- if (!cp) {
- conn->ibc_max_frags = IBLND_CFG_RDMA_FRAGS;
- conn->ibc_queue_depth = *kiblnd_tunables.kib_peertxcredits;
- } else {
- conn->ibc_max_frags = cp->ibcp_max_frags;
- conn->ibc_queue_depth = cp->ibcp_queue_depth;
- }
+ conn->ibc_max_frags = peer->ibp_max_frags;
+ conn->ibc_queue_depth = peer->ibp_queue_depth;

INIT_LIST_HEAD(&conn->ibc_early_rxs);
INIT_LIST_HEAD(&conn->ibc_tx_noops);
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 59a26c4..3db1413 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -586,6 +586,10 @@ typedef struct kib_peer {
int ibp_error; /* errno on closing this peer */
unsigned long ibp_last_alive; /* when (in jiffies) I was last alive
*/
+ /* max map_on_demand */
+ __u16 ibp_max_frags;
+ /* max_peer_credits */
+ __u16 ibp_queue_depth;
} kib_peer_t;

extern kib_data_t kiblnd_data;
@@ -946,7 +950,7 @@ int kiblnd_close_stale_conns_locked(kib_peer_t *peer,
int kiblnd_close_peer_conns_locked(kib_peer_t *peer, int why);

kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,
- int state, int version, kib_connparams_t *cp);
+ int state, int version);
void kiblnd_destroy_conn(kib_conn_t *conn);
void kiblnd_close_conn(kib_conn_t *conn, int error);
void kiblnd_close_conn_locked(kib_conn_t *conn, int error);
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 22420c0..fb3873a 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -2323,6 +2323,10 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
goto failed;
}

+ /* We have validated the peer's parameters so use those */
+ peer->ibp_max_frags = reqmsg->ibm_u.connparams.ibcp_max_frags;
+ peer->ibp_queue_depth = reqmsg->ibm_u.connparams.ibcp_queue_depth;
+
write_lock_irqsave(g_lock, flags);

peer2 = kiblnd_find_peer_locked(nid);
@@ -2361,6 +2365,14 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
peer2->ibp_accepting++;
kiblnd_peer_addref(peer2);

+ /**
+ * Race with kiblnd_launch_tx (active connect) to create peer
+ * so copy validated parameters since we now know what the
+ * peer's limits are
+ */
+ peer2->ibp_max_frags = peer->ibp_max_frags;
+ peer2->ibp_queue_depth = peer->ibp_queue_depth;
+
write_unlock_irqrestore(g_lock, flags);
kiblnd_peer_decref(peer);
peer = peer2;
@@ -2383,8 +2395,8 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
write_unlock_irqrestore(g_lock, flags);
}

- conn = kiblnd_create_conn(peer, cmid, IBLND_CONN_PASSIVE_WAIT, version,
- &reqmsg->ibm_u.connparams);
+ conn = kiblnd_create_conn(peer, cmid, IBLND_CONN_PASSIVE_WAIT,
+ version);
if (!conn) {
kiblnd_peer_connect_failed(peer, 0, -ENOMEM);
kiblnd_peer_decref(peer);
@@ -2397,8 +2409,8 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)
* CM callback doesn't destroy cmid.
*/
conn->ibc_incarnation = reqmsg->ibm_srcstamp;
- conn->ibc_credits = reqmsg->ibm_u.connparams.ibcp_queue_depth;
- conn->ibc_reserved_credits = reqmsg->ibm_u.connparams.ibcp_queue_depth;
+ conn->ibc_credits = conn->ibc_queue_depth;
+ conn->ibc_reserved_credits = conn->ibc_queue_depth;
LASSERT(conn->ibc_credits + conn->ibc_reserved_credits +
IBLND_OOB_MSGS(version) <= IBLND_RX_MSGS(conn));

@@ -2407,10 +2419,8 @@ kiblnd_passive_connect(struct rdma_cm_id *cmid, void *priv, int priv_nob)

kiblnd_init_msg(ackmsg, IBLND_MSG_CONNACK,
sizeof(ackmsg->ibm_u.connparams));
- ackmsg->ibm_u.connparams.ibcp_queue_depth =
- reqmsg->ibm_u.connparams.ibcp_queue_depth;
- ackmsg->ibm_u.connparams.ibcp_max_frags =
- reqmsg->ibm_u.connparams.ibcp_max_frags;
+ ackmsg->ibm_u.connparams.ibcp_queue_depth = conn->ibc_queue_depth;
+ ackmsg->ibm_u.connparams.ibcp_max_frags = conn->ibc_max_frags;
ackmsg->ibm_u.connparams.ibcp_max_msg_size = IBLND_MSG_SIZE;

kiblnd_pack_msg(ni, ackmsg, version, 0, nid, reqmsg->ibm_srcstamp);
@@ -2495,6 +2505,9 @@ kiblnd_reconnect(kib_conn_t *conn, int version,
break;

case IBLND_REJECT_RDMA_FRAGS:
+ if (!cp)
+ goto failed;
+
if (conn->ibc_max_frags <= cp->ibcp_max_frags) {
CNETERR("Unsupported max frags, peer supports %d\n",
cp->ibcp_max_frags);
@@ -2504,18 +2517,21 @@ kiblnd_reconnect(kib_conn_t *conn, int version,
goto failed;
}

- conn->ibc_max_frags = cp->ibcp_max_frags;
+ peer->ibp_max_frags = cp->ibcp_max_frags;
reason = "rdma fragments";
break;

case IBLND_REJECT_MSG_QUEUE_SIZE:
+ if (!cp)
+ goto failed;
+
if (conn->ibc_queue_depth <= cp->ibcp_queue_depth) {
CNETERR("Unsupported queue depth, peer supports %d\n",
cp->ibcp_queue_depth);
goto failed;
}

- conn->ibc_queue_depth = cp->ibcp_queue_depth;
+ peer->ibp_queue_depth = cp->ibcp_queue_depth;
reason = "queue depth";
break;

@@ -2796,7 +2812,7 @@ kiblnd_active_connect(struct rdma_cm_id *cmid)
read_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);

conn = kiblnd_create_conn(peer, cmid, IBLND_CONN_ACTIVE_CONNECT,
- version, NULL);
+ version);
if (!conn) {
kiblnd_peer_connect_failed(peer, 1, -ENOMEM);
kiblnd_peer_decref(peer); /* lose cmid's ref */
--
1.7.1

2016-03-02 22:04:44

by James Simmons

[permalink] [raw]

Subject: [PATCH 20/27] staging: lustre: change ibh_mrs from array to pointer

From: Amir Shehata <[email protected]>

With the removal of PMR we no longer require ibh_mrs field
to be a array so change it to a simple pointer.

Signed-off-by: Amir Shehata <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6850
Reviewed-on: http://review.whamcloud.com/15788
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Frank Zago <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 95 ++------------------
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 5 +-
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c | 10 +--
3 files changed, 13 insertions(+), 97 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index c1ca8dc..1dc18d7 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -1296,55 +1296,15 @@ static void kiblnd_map_tx_pool(kib_tx_pool_t *tpo)
}
}

-struct ib_mr *kiblnd_find_dma_mr(kib_hca_dev_t *hdev, __u64 addr, __u64 size)
-{
- __u64 index;
-
- LASSERT(hdev->ibh_mrs[0]);
-
- if (hdev->ibh_nmrs == 1)
- return hdev->ibh_mrs[0];
-
- index = addr >> hdev->ibh_mr_shift;
-
- if (index < hdev->ibh_nmrs &&
- index == ((addr + size - 1) >> hdev->ibh_mr_shift))
- return hdev->ibh_mrs[index];
-
- return NULL;
-}
-
struct ib_mr *kiblnd_find_rd_dma_mr(kib_hca_dev_t *hdev, kib_rdma_desc_t *rd)
{
- struct ib_mr *prev_mr;
- struct ib_mr *mr;
- int i;
-
- LASSERT(hdev->ibh_mrs[0]);
+ LASSERT(hdev->ibh_mrs);

if (*kiblnd_tunables.kib_map_on_demand > 0 &&
*kiblnd_tunables.kib_map_on_demand <= rd->rd_nfrags)
return NULL;

- if (hdev->ibh_nmrs == 1)
- return hdev->ibh_mrs[0];
-
- for (i = 0, mr = prev_mr = NULL;
- i < rd->rd_nfrags; i++) {
- mr = kiblnd_find_dma_mr(hdev,
- rd->rd_frags[i].rf_addr,
- rd->rd_frags[i].rf_nob);
- if (!prev_mr)
- prev_mr = mr;
-
- if (!mr || prev_mr != mr) {
- /* Can't covered by one single MR */
- mr = NULL;
- break;
- }
- }
-
- return mr;
+ return hdev->ibh_mrs;
}

static void kiblnd_destroy_fmr_pool(kib_fmr_pool_t *pool)
@@ -1978,8 +1938,7 @@ static int kiblnd_net_init_pools(kib_net_t *net, __u32 *cpts, int ncpts)
int i;

read_lock_irqsave(&kiblnd_data.kib_global_lock, flags);
- if (!*kiblnd_tunables.kib_map_on_demand &&
- net->ibn_dev->ibd_hdev->ibh_nmrs == 1) {
+ if (!*kiblnd_tunables.kib_map_on_demand) {
read_unlock_irqrestore(&kiblnd_data.kib_global_lock, flags);
goto create_tx_pool;
}
@@ -2019,26 +1978,15 @@ static int kiblnd_net_init_pools(kib_net_t *net, __u32 *cpts, int ncpts)
rc = kiblnd_init_fmr_poolset(net->ibn_fmr_ps[cpt], cpt, net,
kiblnd_fmr_pool_size(ncpts),
kiblnd_fmr_flush_trigger(ncpts));
- if (rc == -ENOSYS && !i) /* no FMR */
- break;
-
- if (rc) { /* a real error */
+ if (rc) {
CERROR("Can't initialize FMR pool for CPT %d: %d\n",
cpt, rc);
goto failed;
}
}

- if (i > 0) {
+ if (i > 0)
LASSERT(i == ncpts);
- goto create_tx_pool;
- }
-
- cfs_percpt_free(net->ibn_fmr_ps);
- net->ibn_fmr_ps = NULL;
-
- CWARN("Device does not support FMR\n");
- goto failed;

create_tx_pool:
net->ibn_tx_ps = cfs_percpt_alloc(lnet_cpt_table(),
@@ -2087,34 +2035,18 @@ static int kiblnd_hdev_get_attr(kib_hca_dev_t *hdev)
return 0;
}

- for (hdev->ibh_mr_shift = 0;
- hdev->ibh_mr_shift < 64; hdev->ibh_mr_shift++) {
- if (hdev->ibh_mr_size == (1ULL << hdev->ibh_mr_shift) ||
- hdev->ibh_mr_size == (1ULL << hdev->ibh_mr_shift) - 1)
- return 0;
- }
-
CERROR("Invalid mr size: %#llx\n", hdev->ibh_mr_size);
return -EINVAL;
}

static void kiblnd_hdev_cleanup_mrs(kib_hca_dev_t *hdev)
{
- int i;
-
- if (!hdev->ibh_nmrs || !hdev->ibh_mrs)
+ if (!hdev->ibh_mrs)
return;

- for (i = 0; i < hdev->ibh_nmrs; i++) {
- if (!hdev->ibh_mrs[i])
- break;
-
- ib_dereg_mr(hdev->ibh_mrs[i]);
- }
+ ib_dereg_mr(hdev->ibh_mrs);

- LIBCFS_FREE(hdev->ibh_mrs, sizeof(*hdev->ibh_mrs) * hdev->ibh_nmrs);
- hdev->ibh_mrs = NULL;
- hdev->ibh_nmrs = 0;
+ hdev->ibh_mrs = NULL;
}

void kiblnd_hdev_destroy(kib_hca_dev_t *hdev)
@@ -2140,15 +2072,6 @@ static int kiblnd_hdev_setup_mrs(kib_hca_dev_t *hdev)
if (rc)
return rc;

- LIBCFS_ALLOC(hdev->ibh_mrs, 1 * sizeof(*hdev->ibh_mrs));
- if (!hdev->ibh_mrs) {
- CERROR("Failed to allocate MRs table\n");
- return -ENOMEM;
- }
-
- hdev->ibh_mrs[0] = NULL;
- hdev->ibh_nmrs = 1;
-
mr = ib_get_dma_mr(hdev->ibh_pd, acflags);
if (IS_ERR(mr)) {
CERROR("Failed ib_get_dma_mr : %ld\n", PTR_ERR(mr));
@@ -2156,7 +2079,7 @@ static int kiblnd_hdev_setup_mrs(kib_hca_dev_t *hdev)
return PTR_ERR(mr);
}

- hdev->ibh_mrs[0] = mr;
+ hdev->ibh_mrs = mr;

return 0;
}
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 4fe38cb..0c88e8b 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -209,8 +209,7 @@ typedef struct kib_hca_dev {
__u64 ibh_page_mask; /* page mask of current HCA */
int ibh_mr_shift; /* bits shift of max MR size */
__u64 ibh_mr_size; /* size of MR */
- int ibh_nmrs; /* # of global MRs */
- struct ib_mr **ibh_mrs; /* global MR */
+ struct ib_mr *ibh_mrs; /* global MR */
struct ib_pd *ibh_pd; /* PD */
kib_dev_t *ibh_dev; /* owner */
atomic_t ibh_ref; /* refcount */
@@ -909,8 +908,6 @@ static inline unsigned int kiblnd_sg_dma_len(struct ib_device *dev,

struct ib_mr *kiblnd_find_rd_dma_mr(kib_hca_dev_t *hdev,
kib_rdma_desc_t *rd);
-struct ib_mr *kiblnd_find_dma_mr(kib_hca_dev_t *hdev,
- __u64 addr, __u64 size);
void kiblnd_map_rx_descs(kib_conn_t *conn);
void kiblnd_unmap_rx_descs(kib_conn_t *conn);
void kiblnd_pool_free_node(kib_pool_t *pool, struct list_head *node);
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
index 11c0b49..fc788b0 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd_cb.c
@@ -158,7 +158,7 @@ kiblnd_post_rx(kib_rx_t *rx, int credit)
kib_conn_t *conn = rx->rx_conn;
kib_net_t *net = conn->ibc_peer->ibp_ni->ni_data;
struct ib_recv_wr *bad_wrq = NULL;
- struct ib_mr *mr;
+ struct ib_mr *mr = conn->ibc_hdev->ibh_mrs;
int rc;

LASSERT(net);
@@ -166,8 +166,6 @@ kiblnd_post_rx(kib_rx_t *rx, int credit)
LASSERT(credit == IBLND_POSTRX_NO_CREDIT ||
credit == IBLND_POSTRX_PEER_CREDIT ||
credit == IBLND_POSTRX_RSRVD_CREDIT);
-
- mr = kiblnd_find_dma_mr(conn->ibc_hdev, rx->rx_msgaddr, IBLND_MSG_SIZE);
LASSERT(mr);

rx->rx_sge.lkey = mr->lkey;
@@ -1035,17 +1033,15 @@ kiblnd_init_tx_msg(lnet_ni_t *ni, kib_tx_t *tx, int type, int body_nob)
struct ib_sge *sge = &tx->tx_sge[tx->tx_nwrq];
struct ib_rdma_wr *wrq = &tx->tx_wrq[tx->tx_nwrq];
int nob = offsetof(kib_msg_t, ibm_u) + body_nob;
- struct ib_mr *mr;
+ struct ib_mr *mr = hdev->ibh_mrs;

LASSERT(tx->tx_nwrq >= 0);
LASSERT(tx->tx_nwrq < IBLND_MAX_RDMA_FRAGS + 1);
LASSERT(nob <= IBLND_MSG_SIZE);
+ LASSERT(mr);

kiblnd_init_msg(tx->tx_msg, type, body_nob);

- mr = kiblnd_find_dma_mr(hdev, tx->tx_msgaddr, nob);
- LASSERT(mr);
-
sge->lkey = mr->lkey;
sge->addr = tx->tx_msgaddr;
sge->length = nob;
--
1.7.1

2016-03-02 22:05:07

by James Simmons

[permalink] [raw]

Subject: [PATCH 19/27] staging: lustre: corrected some typos and grammar errors

From: Frank Zago <[email protected]>

Cleanup various typos and grammar errors.

Signed-off-by: Frank Zago <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5710
Reviewed-on: http://review.whamcloud.com/12201
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Bob Glossman <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 2 +-
drivers/staging/lustre/lustre/llite/dir.c | 6 +++---
drivers/staging/lustre/lustre/obdclass/lu_object.c | 2 +-
drivers/staging/lustre/lustre/ptlrpc/import.c | 2 +-
4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index bd3552e..c1ca8dc 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -758,7 +758,7 @@ kib_conn_t *kiblnd_create_conn(kib_peer_t *peer, struct rdma_cm_id *cmid,

rc = ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
if (rc) {
- CERROR("Can't request completion notificiation: %d\n", rc);
+ CERROR("Can't request completion notification: %d\n", rc);
goto failed_2;
}

diff --git a/drivers/staging/lustre/lustre/llite/dir.c b/drivers/staging/lustre/lustre/llite/dir.c
index bd88a3b..b3368d7 100644
--- a/drivers/staging/lustre/lustre/llite/dir.c
+++ b/drivers/staging/lustre/lustre/llite/dir.c
@@ -878,7 +878,7 @@ int ll_get_mdt_idx(struct inode *inode)
/**
* Generic handler to do any pre-copy work.
*
- * It send a first hsm_progress (with extent length == 0) to coordinator as a
+ * It sends a first hsm_progress (with extent length == 0) to coordinator as a
* first information for it that real work has started.
*
* Moreover, for a ARCHIVE request, it will sample the file data version and
@@ -930,7 +930,7 @@ static int ll_ioc_copy_start(struct super_block *sb, struct hsm_copy *copy)
goto progress;
}

- /* Store it the hsm_copy for later copytool use.
+ /* Store in the hsm_copy for later copytool use.
* Always modified even if no lsm.
*/
copy->hc_data_version = data_version;
@@ -1008,7 +1008,7 @@ static int ll_ioc_copy_end(struct super_block *sb, struct hsm_copy *copy)
goto progress;
}

- /* Store it the hsm_copy for later copytool use.
+ /* Store in the hsm_copy for later copytool use.
* Always modified even if no lsm.
*/
hpk.hpk_data_version = data_version;
diff --git a/drivers/staging/lustre/lustre/obdclass/lu_object.c b/drivers/staging/lustre/lustre/obdclass/lu_object.c
index 0fa4bac..200d6e8 100644
--- a/drivers/staging/lustre/lustre/obdclass/lu_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/lu_object.c
@@ -134,7 +134,7 @@ void lu_object_put(const struct lu_env *env, struct lu_object *o)
}

/*
- * If object is dying (will not be cached), removed it
+ * If object is dying (will not be cached), then removed it
* from hash table and LRU.
*
* This is done with hash table and LRU lists locked. As the only
diff --git a/drivers/staging/lustre/lustre/ptlrpc/import.c b/drivers/staging/lustre/lustre/ptlrpc/import.c
index 70fcac1..5b33dd5 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/import.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/import.c
@@ -1360,7 +1360,7 @@ int ptlrpc_import_recovery_state_machine(struct obd_import *imp)
{
struct task_struct *task;
/* bug 17802: XXX client_disconnect_export vs connect request
- * race. if client will evicted at this time, we start
+ * race. if client is evicted at this time, we start
* invalidate thread without reference to import and import can
* be freed at same time.
*/
--
1.7.1

2016-03-02 22:05:37

by James Simmons

[permalink] [raw]

Subject: [PATCH 17/27] staging: lustre: make o2iblnd local functions static

From: Frank Zago <[email protected]>

This fixes sparse warnings such as:
.../o2iblnd.c:424:1: warning: symbol 'kiblnd_get_peer_info' was not declared.
Should it be static?

This reduces the code size by 400 bytes.

The body of "the_o2iblnd" was moved at the end of the file,
to avoid having to declare some static prototypes.

Signed-off-by: Frank Zago <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5396
Reviewed-on: http://review.whamcloud.com/11255
Reviewed-by: Patrick Farrell <[email protected]>
Reviewed-by: John L. Hammond <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c | 30 ++++++++++---------
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 6 ----
2 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
index 56e5784..bd3552e 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c
@@ -42,15 +42,7 @@
#include <asm/page.h>
#include "o2iblnd.h"

-static lnd_t the_o2iblnd = {
- .lnd_type = O2IBLND,
- .lnd_startup = kiblnd_startup,
- .lnd_shutdown = kiblnd_shutdown,
- .lnd_ctl = kiblnd_ctl,
- .lnd_query = kiblnd_query,
- .lnd_send = kiblnd_send,
- .lnd_recv = kiblnd_recv,
-};
+static lnd_t the_o2iblnd;

kib_data_t kiblnd_data;

@@ -1012,7 +1004,7 @@ static int kiblnd_close_matching_conns(lnet_ni_t *ni, lnet_nid_t nid)
return !count ? -ENOENT : 0;
}

-int kiblnd_ctl(lnet_ni_t *ni, unsigned int cmd, void *arg)
+static int kiblnd_ctl(lnet_ni_t *ni, unsigned int cmd, void *arg)
{
struct libcfs_ioctl_data *data = arg;
int rc = -EINVAL;
@@ -1065,7 +1057,7 @@ int kiblnd_ctl(lnet_ni_t *ni, unsigned int cmd, void *arg)
return rc;
}

-void kiblnd_query(lnet_ni_t *ni, lnet_nid_t nid, unsigned long *when)
+static void kiblnd_query(lnet_ni_t *ni, lnet_nid_t nid, unsigned long *when)
{
unsigned long last_alive = 0;
unsigned long now = cfs_time_current();
@@ -1100,7 +1092,7 @@ void kiblnd_query(lnet_ni_t *ni, lnet_nid_t nid, unsigned long *when)
last_alive ? cfs_duration_sec(now - last_alive) : -1);
}

-void kiblnd_free_pages(kib_pages_t *p)
+static void kiblnd_free_pages(kib_pages_t *p)
{
int npages = p->ibp_npages;
int i;
@@ -2491,7 +2483,7 @@ static void kiblnd_base_shutdown(void)
module_put(THIS_MODULE);
}

-void kiblnd_shutdown(lnet_ni_t *ni)
+static void kiblnd_shutdown(lnet_ni_t *ni)
{
kib_net_t *net = ni->ni_data;
rwlock_t *g_lock = &kiblnd_data.kib_global_lock;
@@ -2745,7 +2737,7 @@ static kib_dev_t *kiblnd_dev_search(char *ifname)
return alias;
}

-int kiblnd_startup(lnet_ni_t *ni)
+static int kiblnd_startup(lnet_ni_t *ni)
{
char *ifname;
kib_dev_t *ibdev = NULL;
@@ -2840,6 +2832,16 @@ net_failed:
return -ENETDOWN;
}

+static lnd_t the_o2iblnd = {
+ .lnd_type = O2IBLND,
+ .lnd_startup = kiblnd_startup,
+ .lnd_shutdown = kiblnd_shutdown,
+ .lnd_ctl = kiblnd_ctl,
+ .lnd_query = kiblnd_query,
+ .lnd_send = kiblnd_send,
+ .lnd_recv = kiblnd_recv,
+};
+
static void __exit ko2iblnd_exit(void)
{
lnet_unregister_lnd(&the_o2iblnd);
diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index 8e79e09..4f09776 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -920,11 +920,6 @@ int kiblnd_fmr_pool_map(kib_fmr_poolset_t *fps, __u64 *pages,
int npages, __u64 iov, kib_fmr_t *fmr);
void kiblnd_fmr_pool_unmap(kib_fmr_t *fmr, int status);

-int kiblnd_startup(lnet_ni_t *ni);
-void kiblnd_shutdown(lnet_ni_t *ni);
-int kiblnd_ctl(lnet_ni_t *ni, unsigned int cmd, void *arg);
-void kiblnd_query(struct lnet_ni *ni, lnet_nid_t nid, unsigned long *when);
-
int kiblnd_tunables_init(void);
void kiblnd_tunables_fini(void);

@@ -934,7 +929,6 @@ int kiblnd_thread_start(int (*fn)(void *arg), void *arg, char *name);
int kiblnd_failover_thread(void *arg);

int kiblnd_alloc_pages(kib_pages_t **pp, int cpt, int npages);
-void kiblnd_free_pages(kib_pages_t *p);

int kiblnd_cm_callback(struct rdma_cm_id *cmid,
struct rdma_cm_event *event);
--
1.7.1

2016-03-02 22:02:43

by James Simmons

[permalink] [raw]

Subject: [PATCH 14/27] staging: lustre: fix conctl.c issues found by Klocwork Insight tool

From: Dmitry Eremin <[email protected]>

The function lst_test_add_ioctl is always copying lstio_tes_param
from userland even if the user doesn't send this data to LNet
selftest. Only consider lstio_tes_param data if lstio_tes_param_len
is not zero.

Signed-off-by: Dmitry Eremin <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-4629
Reviewed-on: http://review.whamcloud.com/9386
Reviewed-by: John L. Hammond <[email protected]>
Reviewed-by: Isaac Huang <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
drivers/staging/lustre/lnet/selftest/conctl.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/lustre/lnet/selftest/conctl.c b/drivers/staging/lustre/lnet/selftest/conctl.c
index 90b7771..714d14b 100644
--- a/drivers/staging/lustre/lnet/selftest/conctl.c
+++ b/drivers/staging/lustre/lnet/selftest/conctl.c
@@ -761,6 +761,11 @@ static int lst_test_add_ioctl(lstio_test_args_t *args)
LIBCFS_ALLOC(param, args->lstio_tes_param_len);
if (!param)
goto out;
+ if (copy_from_user(param, args->lstio_tes_param,
+ args->lstio_tes_param_len)) {
+ rc = -EFAULT;
+ goto out;
+ }
}

rc = -EFAULT;
@@ -769,9 +774,7 @@ static int lst_test_add_ioctl(lstio_test_args_t *args)
copy_from_user(src_name, args->lstio_tes_sgrp_name,
args->lstio_tes_sgrp_nmlen) ||
copy_from_user(dst_name, args->lstio_tes_dgrp_name,
- args->lstio_tes_dgrp_nmlen) ||
- copy_from_user(param, args->lstio_tes_param,
- args->lstio_tes_param_len))
+ args->lstio_tes_dgrp_nmlen))
goto out;

rc = lstcon_test_add(batch_name, args->lstio_tes_type,
--
1.7.1

2016-03-02 22:05:58

by James Simmons

[permalink] [raw]

Subject: [PATCH 16/27] staging: lustre: reverse LNet and infinband header order

LNet is an abstraction built on top of network APIs such
as infiniband or TCP/IP layer. Since LNet is dependent
on these other layers we should ensure LNet headers
should always come after the infiniband header since the
infiniband headers can influence the LNet header
definitions.

Signed-off-by: James Simmons <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-5140
Reviewed-on: http://review.whamcloud.com/10571
Reviewed-by: Bob Glossman <[email protected]>
Reviewed-by: Liang Zhen <[email protected]>
Reviewed-by: Shuichi Ihara <[email protected]>
Reviewed-by: Patrick Farrell <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h | 10 +++++-----
1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
index fb7079c..8e79e09 100644
--- a/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
+++ b/drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
@@ -60,17 +60,17 @@
#include <net/sock.h>
#include <linux/in.h>

+#include <rdma/rdma_cm.h>
+#include <rdma/ib_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_fmr_pool.h>
+
#define DEBUG_SUBSYSTEM S_LND

#include "../../../include/linux/libcfs/libcfs.h"
#include "../../../include/linux/lnet/lnet.h"
#include "../../../include/linux/lnet/lib-lnet.h"

-#include <rdma/rdma_cm.h>
-#include <rdma/ib_cm.h>
-#include <rdma/ib_verbs.h>
-#include <rdma/ib_fmr_pool.h>
-
#define IBLND_PEER_HASH_SIZE 101 /* # peer lists */
/* # scheduler loops before reschedule */
#define IBLND_RESCHED 100
--
1.7.1

2016-03-02 22:06:22

by James Simmons

[permalink] [raw]

Subject: [PATCH 15/27] staging: lustre: fix framework.c issues found by Klocwork Insight tool

From: Dmitry Eremin <[email protected]>

The functions sfw_test_buffers() and sfw_unload_test() from LNet
selftest both assume sfw_test_instance_t being passed in is never
null. This is corrected here.

Signed-off-by: Dmitry Eremin <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-4629
Reviewed-on: http://review.whamcloud.com/9386
Reviewed-by: John L. Hammond <[email protected]>
Reviewed-by: Isaac Huang <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
drivers/staging/lustre/lnet/selftest/framework.c | 14 +++++++++++---
1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/lustre/lnet/selftest/framework.c b/drivers/staging/lustre/lnet/selftest/framework.c
index 3bbc720..e4a9724 100644
--- a/drivers/staging/lustre/lnet/selftest/framework.c
+++ b/drivers/staging/lustre/lnet/selftest/framework.c
@@ -541,10 +541,16 @@ sfw_test_rpc_fini(srpc_client_rpc_t *rpc)
static inline int
sfw_test_buffers(sfw_test_instance_t *tsi)
{
- struct sfw_test_case *tsc = sfw_find_test_case(tsi->tsi_service);
- struct srpc_service *svc = tsc->tsc_srv_service;
+ struct sfw_test_case *tsc;
+ struct srpc_service *svc;
int nbuf;

+ LASSERT(tsi);
+ tsc = sfw_find_test_case(tsi->tsi_service);
+ LASSERT(tsc);
+ svc = tsc->tsc_srv_service;
+ LASSERT(svc);
+
nbuf = min(svc->sv_wi_total, tsi->tsi_loop) / svc->sv_ncpts;
return max(SFW_TEST_WI_MIN, nbuf + SFW_TEST_WI_EXTRA);
}
@@ -591,8 +597,10 @@ sfw_load_test(struct sfw_test_instance *tsi)
static void
sfw_unload_test(struct sfw_test_instance *tsi)
{
- struct sfw_test_case *tsc = sfw_find_test_case(tsi->tsi_service);
+ struct sfw_test_case *tsc;

+ LASSERT(tsi);
+ tsc = sfw_find_test_case(tsi->tsi_service);
LASSERT(tsc);

if (tsi->tsi_is_client)
--
1.7.1

2016-03-02 22:06:40

by James Simmons

[permalink] [raw]

Subject: [PATCH 12/27] staging: lustre: fix socklnd issues found by Klocwork Insight tool

From: Dmitry Eremin <[email protected]>

Null pointer 'best_iface' that comes from line 802 may be
dereferenced at line 832.

Signed-off-by: Dmitry Eremin <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-4629
Reviewed-on: http://review.whamcloud.com/9386
Reviewed-by: John L. Hammond <[email protected]>
Reviewed-by: Isaac Huang <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/socklnd/socklnd.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
index 2c83b95..a710541 100644
--- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
+++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
@@ -804,6 +804,8 @@ ksocknal_select_ips(ksock_peer_t *peer, __u32 *peerips, int n_peerips)
ip = peer->ksnp_passive_ips[i];
best_iface = ksocknal_ip2iface(peer->ksnp_ni, ip);

+ /* peer passive ips are kept up to date */
+ LASSERT(best_iface);
} else {
/* choose a new interface */
LASSERT(i == peer->ksnp_n_passive_ips);
@@ -838,6 +840,8 @@ ksocknal_select_ips(ksock_peer_t *peer, __u32 *peerips, int n_peerips)
best_npeers = iface->ksni_npeers;
}

+ LASSERT(best_iface);
+
best_iface->ksni_npeers++;
ip = best_iface->ksni_ipaddr;
peer->ksnp_passive_ips[i] = ip;
--
1.7.1

2016-03-02 22:02:36

by James Simmons

[permalink] [raw]

Subject: [PATCH 07/27] staging: lustre: issue in the offset in lnet match hash table

From: Alyona Romanenko <[email protected]>

the offset in hash table is overflowed for no wildcard portal.
The offset for no wildcard has been corrected as for wildcard
in the LU-1622

Signed-off-by: Alyona Romanenko <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-7774
Reviewed-on: http://review.whamcloud.com/18422
Reviewed-by: Doug Oucharek <[email protected]>
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
drivers/staging/lustre/lnet/lnet/lib-ptl.c | 9 ++++-----
1 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/lib-ptl.c b/drivers/staging/lustre/lnet/lnet/lib-ptl.c
index 2b41205..3947e8b 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-ptl.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-ptl.c
@@ -360,16 +360,15 @@ lnet_mt_match_head(struct lnet_match_table *mtable,
lnet_process_id_t id, __u64 mbits)
{
struct lnet_portal *ptl = the_lnet.ln_portals[mtable->mt_portal];
+ unsigned long hash = mbits;

- if (lnet_ptl_is_wildcard(ptl)) {
- return &mtable->mt_mhash[mbits & LNET_MT_HASH_MASK];
- } else {
- unsigned long hash = mbits + id.nid + id.pid;
+ if (!lnet_ptl_is_wildcard(ptl)) {
+ hash += id.nid + id.pid;

LASSERT(lnet_ptl_is_unique(ptl));
hash = hash_long(hash, LNET_MT_HASH_BITS);
- return &mtable->mt_mhash[hash];
}
+ return &mtable->mt_mhash[hash & LNET_MT_HASH_MASK];
}

int
--
1.7.1

2016-03-02 22:06:57

by James Simmons

[permalink] [raw]

Subject: [PATCH 11/27] staging: lustre: bind socklnd peers to a specific CPT

Currently the socklnd driver doesn't support
CPT affinity for its peers. Binding peers to
a specific CPT and memory allocated to the
NUMA node belonging to the CPT should give a
performance boost.

Signed-off-by: James Simmons <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-7245
Reviewed-on: http://review.whamcloud.com/16710
Reviewed-by: Olaf Weber <[email protected]>
Reviewed-by: Amir Shehata <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/lnet/klnds/socklnd/socklnd.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
index d843082..2c83b95 100644
--- a/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
+++ b/drivers/staging/lustre/lnet/klnds/socklnd/socklnd.c
@@ -102,6 +102,7 @@ ksocknal_destroy_route(ksock_route_t *route)
static int
ksocknal_create_peer(ksock_peer_t **peerp, lnet_ni_t *ni, lnet_process_id_t id)
{
+ int cpt = lnet_cpt_of_nid(id.nid);
ksock_net_t *net = ni->ni_data;
ksock_peer_t *peer;

@@ -109,7 +110,7 @@ ksocknal_create_peer(ksock_peer_t **peerp, lnet_ni_t *ni, lnet_process_id_t id)
LASSERT(id.pid != LNET_PID_ANY);
LASSERT(!in_interrupt());

- LIBCFS_ALLOC(peer, sizeof(*peer));
+ LIBCFS_CPT_ALLOC(peer, lnet_cpt_table(), cpt, sizeof(*peer));
if (!peer)
return -ENOMEM;

--
1.7.1

2016-03-02 22:07:26

by James Simmons

[permalink] [raw]

Subject: [PATCH 09/27] staging: lustre: set task state before scheduling in lnet_sock_accept

From: John L. Hammond <[email protected]>

In the original code change when libcfs_sock_accept() was made
into lnet_sock_accept() a call to set_current_state(TASK_INTERRUPTIBLE)
got dropped which was restored. For upstream this is an
optimization of calling init_waitqueue_entry() only if
accept() return -EAGAIN. Also we can remove setting the
task to TASK_RUNNING that is not needed.

Signed-off-by: John L. Hammond <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6407
Reviewed-on: http://review.whamcloud.com/14265
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Amir Shehata <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
drivers/staging/lustre/lnet/lnet/lib-socket.c | 6 ++----
1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/lib-socket.c b/drivers/staging/lustre/lnet/lnet/lib-socket.c
index 5d77049..269a6d8 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-socket.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-socket.c
@@ -531,8 +531,6 @@ lnet_sock_accept(struct socket **newsockp, struct socket *sock)
struct socket *newsock;
int rc;

- init_waitqueue_entry(&wait, current);
-
/*
* XXX this should add a ref to sock->ops->owner, if
* TCP could be a module
@@ -548,11 +546,11 @@ lnet_sock_accept(struct socket **newsockp, struct socket *sock)
rc = sock->ops->accept(sock, newsock, O_NONBLOCK);
if (rc == -EAGAIN) {
/* Nothing ready, so wait for activity */
- set_current_state(TASK_INTERRUPTIBLE);
+ init_waitqueue_entry(&wait, current);
add_wait_queue(sk_sleep(sock->sk), &wait);
+ set_current_state(TASK_INTERRUPTIBLE);
schedule();
remove_wait_queue(sk_sleep(sock->sk), &wait);
- set_current_state(TASK_RUNNING);
rc = sock->ops->accept(sock, newsock, O_NONBLOCK);
}

--
1.7.1

2016-03-02 22:07:52

by James Simmons

[permalink] [raw]

Subject: [PATCH 08/27] staging: lustre: fix 'copy into fixed size buffer' errors

From: Sebastien Buisson <[email protected]>

Fix 'copy into fixed size buffer' defects found by Coverity
version 6.0.3:
Copy into fixed size buffer (STRING_OVERFLOW)
The fixed-size string might be overrun by copying without
checking the length.

Signed-off-by: Sebastien Buisson <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-2074
Reviewed-on: http://review.whamcloud.com/4154
Reviewed-by: Dmitry Eremin <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
drivers/staging/lustre/lnet/lnet/lib-socket.c | 15 +++++++++++--
drivers/staging/lustre/lnet/selftest/console.c | 23 +++++++++++++++++---
drivers/staging/lustre/lustre/libcfs/workitem.c | 6 ++++-
drivers/staging/lustre/lustre/ptlrpc/nrs.c | 8 ++++++-
drivers/staging/lustre/lustre/ptlrpc/sec_config.c | 7 +++++-
5 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/lib-socket.c b/drivers/staging/lustre/lnet/lnet/lib-socket.c
index 88905d5..5d77049 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-socket.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-socket.c
@@ -99,7 +99,10 @@ lnet_ipif_query(char *name, int *up, __u32 *ip, __u32 *mask)

CLASSERT(sizeof(ifr.ifr_name) >= IFNAMSIZ);

- strcpy(ifr.ifr_name, name);
+ if (strlen(name) > sizeof(ifr.ifr_name) - 1)
+ return -E2BIG;
+ strncpy(ifr.ifr_name, name, sizeof(ifr.ifr_name));
+
rc = lnet_sock_ioctl(SIOCGIFFLAGS, (unsigned long)&ifr);
if (rc) {
CERROR("Can't get flags for interface %s\n", name);
@@ -114,7 +117,10 @@ lnet_ipif_query(char *name, int *up, __u32 *ip, __u32 *mask)
}
*up = 1;

- strcpy(ifr.ifr_name, name);
+ if (strlen(name) > sizeof(ifr.ifr_name) - 1)
+ return -E2BIG;
+ strncpy(ifr.ifr_name, name, sizeof(ifr.ifr_name));
+
ifr.ifr_addr.sa_family = AF_INET;
rc = lnet_sock_ioctl(SIOCGIFADDR, (unsigned long)&ifr);
if (rc) {
@@ -125,7 +131,10 @@ lnet_ipif_query(char *name, int *up, __u32 *ip, __u32 *mask)
val = ((struct sockaddr_in *)&ifr.ifr_addr)->sin_addr.s_addr;
*ip = ntohl(val);

- strcpy(ifr.ifr_name, name);
+ if (strlen(name) > sizeof(ifr.ifr_name) - 1)
+ return -E2BIG;
+ strncpy(ifr.ifr_name, name, sizeof(ifr.ifr_name));
+
ifr.ifr_addr.sa_family = AF_INET;
rc = lnet_sock_ioctl(SIOCGIFNETMASK, (unsigned long)&ifr);
if (rc) {
diff --git a/drivers/staging/lustre/lnet/selftest/console.c b/drivers/staging/lustre/lnet/selftest/console.c
index e8ca1bf..0e3da44 100644
--- a/drivers/staging/lustre/lnet/selftest/console.c
+++ b/drivers/staging/lustre/lnet/selftest/console.c
@@ -206,8 +206,14 @@ lstcon_group_alloc(char *name, lstcon_group_t **grpp)
return -ENOMEM;

grp->grp_ref = 1;
- if (name)
- strcpy(grp->grp_name, name);
+ if (name) {
+ if (strlen(name) > sizeof(grp->grp_name)-1) {
+ LIBCFS_FREE(grp, offsetof(lstcon_group_t,
+ grp_ndl_hash[LST_NODE_HASHSIZE]));
+ return -E2BIG;
+ }
+ strncpy(grp->grp_name, name, sizeof(grp->grp_name));
+ }

INIT_LIST_HEAD(&grp->grp_link);
INIT_LIST_HEAD(&grp->grp_ndl_list);
@@ -873,7 +879,13 @@ lstcon_batch_add(char *name)
return -ENOMEM;
}

- strcpy(bat->bat_name, name);
+ if (strlen(name) > sizeof(bat->bat_name) - 1) {
+ LIBCFS_FREE(bat->bat_srv_hash, LST_NODE_HASHSIZE);
+ LIBCFS_FREE(bat->bat_cli_hash, LST_NODE_HASHSIZE);
+ LIBCFS_FREE(bat, sizeof(lstcon_batch_t));
+ return -E2BIG;
+ }
+ strncpy(bat->bat_name, name, sizeof(bat->bat_name));
bat->bat_hdr.tsb_index = 0;
bat->bat_hdr.tsb_id.bat_id = ++console_session.ses_id_cookie;

@@ -1733,7 +1745,10 @@ lstcon_session_new(char *name, int key, unsigned feats,
console_session.ses_feats_updated = 0;
console_session.ses_timeout = (timeout <= 0) ?
LST_CONSOLE_TIMEOUT : timeout;
- strlcpy(console_session.ses_name, name,
+
+ if (strlen(name) > sizeof(console_session.ses_name)-1)
+ return -E2BIG;
+ strncpy(console_session.ses_name, name,
sizeof(console_session.ses_name));

rc = lstcon_batch_add(LST_DEFAULT_BATCH);
diff --git a/drivers/staging/lustre/lustre/libcfs/workitem.c b/drivers/staging/lustre/lustre/libcfs/workitem.c
index 136bc13..f2ebed8 100644
--- a/drivers/staging/lustre/lustre/libcfs/workitem.c
+++ b/drivers/staging/lustre/lustre/libcfs/workitem.c
@@ -351,7 +351,11 @@ cfs_wi_sched_create(char *name, struct cfs_cpt_table *cptab,
if (!sched)
return -ENOMEM;

- strlcpy(sched->ws_name, name, CFS_WS_NAME_LEN);
+ if (strlen(name) > sizeof(sched->ws_name) - 1) {
+ LIBCFS_FREE(sched, sizeof(*sched));
+ return -E2BIG;
+ }
+ strncpy(sched->ws_name, name, sizeof(sched->ws_name));

sched->ws_cptab = cptab;
sched->ws_cpt = cpt;
diff --git a/drivers/staging/lustre/lustre/ptlrpc/nrs.c b/drivers/staging/lustre/lustre/ptlrpc/nrs.c
index 58e5d86..cc7909c 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/nrs.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/nrs.c
@@ -1095,6 +1095,7 @@ static int ptlrpc_nrs_policy_register(struct ptlrpc_nrs_pol_conf *conf)
{
struct ptlrpc_service *svc;
struct ptlrpc_nrs_pol_desc *desc;
+ size_t len;
int rc = 0;

LASSERT(conf->nc_ops);
@@ -1138,7 +1139,12 @@ static int ptlrpc_nrs_policy_register(struct ptlrpc_nrs_pol_conf *conf)
goto fail;
}

- strncpy(desc->pd_name, conf->nc_name, NRS_POL_NAME_MAX);
+ len = strlcpy(desc->pd_name, conf->nc_name, sizeof(desc->pd_name));
+ if (len >= sizeof(desc->pd_name)) {
+ kfree(desc);
+ rc = -E2BIG;
+ goto fail;
+ }
desc->pd_ops = conf->nc_ops;
desc->pd_compat = conf->nc_compat;
desc->pd_compat_svc_name = conf->nc_compat_svc_name;
diff --git a/drivers/staging/lustre/lustre/ptlrpc/sec_config.c b/drivers/staging/lustre/lustre/ptlrpc/sec_config.c
index 93b91bf..31d3be7 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/sec_config.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/sec_config.c
@@ -517,6 +517,7 @@ struct sptlrpc_conf *sptlrpc_conf_get(const char *fsname,
int create)
{
struct sptlrpc_conf *conf;
+ size_t len;

list_for_each_entry(conf, &sptlrpc_confs, sc_list) {
if (strcmp(conf->sc_fsname, fsname) == 0)
@@ -530,7 +531,11 @@ struct sptlrpc_conf *sptlrpc_conf_get(const char *fsname,
if (!conf)
return NULL;

- strcpy(conf->sc_fsname, fsname);
+ len = strlcpy(conf->sc_fsname, fsname, sizeof(conf->sc_fsname));
+ if (len >= sizeof(conf->sc_fsname)) {
+ kfree(conf);
+ return NULL;
+ }
sptlrpc_rule_set_init(&conf->sc_rset);
INIT_LIST_HEAD(&conf->sc_tgts);
list_add(&conf->sc_list, &sptlrpc_confs);
--
1.7.1

2016-03-02 22:08:38

by James Simmons

[permalink] [raw]

Subject: [PATCH 06/27] staging: lustre: Use after free in lnet_ptl_match_delay()

From: Olaf Weber <[email protected]>

In lnet_ptl_match_delay() we check msg->msg_rx_delayed to see whether
the message has been added to the delay queue. But this check is done
after lnet_ptl_unlock() and lnet_res_unlock(), and the message can be
processed and freed before the check.

Replace the check with checking rc against LNET_MATCHMD_NONE, which
is how the callers of lnet_ptl_match_delay() know whether the message
was added to the delay queue. To make this work we reset rc in the
loop when there was no match and the message hasn't been delayed. In
addition reorganize the code and add comments to clarify the logic.

In lnet_ptl_match_md() a similar msg->msg_rx_delayed is replaced for
the same reason.

Signed-off-by: Olaf Weber <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-7324
Reviewed-on: http://review.whamcloud.com/17840
Reviewed-by: Faccini Bruno <[email protected]>
Reviewed-by: Liang Zhen <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
drivers/staging/lustre/lnet/lnet/lib-ptl.c | 84 +++++++++++++++++----------
1 files changed, 53 insertions(+), 31 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/lib-ptl.c b/drivers/staging/lustre/lnet/lnet/lib-ptl.c
index 0281c6a..2b41205 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-ptl.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-ptl.c
@@ -472,10 +472,12 @@ lnet_ptl_match_delay(struct lnet_portal *ptl,
int rc = 0;
int i;

- /*
- * steal buffer from other CPTs, and delay it if nothing to steal,
- * this function is more expensive than a regular match, but we
- * don't expect it can happen a lot
+ /**
+ * Steal buffer from other CPTs, and delay msg if nothing to
+ * steal. This function is more expensive than a regular
+ * match, but we don't expect it can happen a lot. The return
+ * code contains one of LNET_MATCHMD_OK, LNET_MATCHMD_DROP, or
+ * LNET_MATCHMD_NONE.
*/
LASSERT(lnet_ptl_is_wildcard(ptl));

@@ -491,52 +493,71 @@ lnet_ptl_match_delay(struct lnet_portal *ptl,
lnet_res_lock(cpt);
lnet_ptl_lock(ptl);

- if (!i) { /* the first try, attach on stealing list */
+ if (!i) {
+ /* The first try, add to stealing list. */
list_add_tail(&msg->msg_list,
&ptl->ptl_msg_stealing);
}

- if (!list_empty(&msg->msg_list)) { /* on stealing list */
+ if (!list_empty(&msg->msg_list)) {
+ /* On stealing list. */
rc = lnet_mt_match_md(mtable, info, msg);

if ((rc & LNET_MATCHMD_EXHAUSTED) &&
mtable->mt_enabled)
lnet_ptl_disable_mt(ptl, cpt);

- if (rc & LNET_MATCHMD_FINISH)
+ if (rc & LNET_MATCHMD_FINISH) {
+ /* Match found, remove from stealing list. */
+ list_del_init(&msg->msg_list);
+ } else if (i == LNET_CPT_NUMBER - 1 || /* (1) */
+ !ptl->ptl_mt_nmaps || /* (2) */
+ (ptl->ptl_mt_nmaps == 1 && /* (3) */
+ ptl->ptl_mt_maps[0] == cpt)) {
+ /**
+ * No match found, and this is either
+ * (1) the last cpt to check, or
+ * (2) there is no active cpt, or
+ * (3) this is the only active cpt.
+ * There is nothing to steal: delay or
+ * drop the message.
+ */
list_del_init(&msg->msg_list);

+ if (lnet_ptl_is_lazy(ptl)) {
+ msg->msg_rx_delayed = 1;
+ list_add_tail(&msg->msg_list,
+ &ptl->ptl_msg_delayed);
+ rc = LNET_MATCHMD_NONE;
+ } else {
+ rc = LNET_MATCHMD_DROP;
+ }
+ } else {
+ /* Do another iteration. */
+ rc = 0;
+ }
} else {
- /*
- * could be matched by lnet_ptl_attach_md()
- * which is called by another thread
+ /**
+ * No longer on stealing list: another thread
+ * matched the message in lnet_ptl_attach_md().
+ * We are now expected to handle the message.
*/
rc = !msg->msg_md ?
LNET_MATCHMD_DROP : LNET_MATCHMD_OK;
}

- if (!list_empty(&msg->msg_list) && /* not matched yet */
- (i == LNET_CPT_NUMBER - 1 || /* the last CPT */
- !ptl->ptl_mt_nmaps || /* no active CPT */
- (ptl->ptl_mt_nmaps == 1 && /* the only active CPT */
- ptl->ptl_mt_maps[0] == cpt))) {
- /* nothing to steal, delay or drop */
- list_del_init(&msg->msg_list);
-
- if (lnet_ptl_is_lazy(ptl)) {
- msg->msg_rx_delayed = 1;
- list_add_tail(&msg->msg_list,
- &ptl->ptl_msg_delayed);
- rc = LNET_MATCHMD_NONE;
- } else {
- rc = LNET_MATCHMD_DROP;
- }
- }
-
lnet_ptl_unlock(ptl);
lnet_res_unlock(cpt);

- if ((rc & LNET_MATCHMD_FINISH) || msg->msg_rx_delayed)
+ /**
+ * Note that test (1) above ensures that we always
+ * exit the loop through this break statement.
+ *
+ * LNET_MATCHMD_NONE means msg was added to the
+ * delayed queue, and we may no longer reference it
+ * after lnet_ptl_unlock() and lnet_res_unlock().
+ */
+ if (rc & (LNET_MATCHMD_FINISH | LNET_MATCHMD_NONE))
break;
}

@@ -598,13 +619,14 @@ lnet_ptl_match_md(struct lnet_match_info *info, struct lnet_msg *msg)

lnet_ptl_unlock(ptl);
lnet_res_unlock(mtable->mt_cpt);
-
+ rc = LNET_MATCHMD_NONE;
} else {
lnet_res_unlock(mtable->mt_cpt);
rc = lnet_ptl_match_delay(ptl, info, msg);
}

- if (msg->msg_rx_delayed) {
+ /* LNET_MATCHMD_NONE means msg was added to the delay queue */
+ if (rc & LNET_MATCHMD_NONE) {
CDEBUG(D_NET,
"Delaying %s from %s ptl %d MB %#llx off %d len %d\n",
info->mi_opc == LNET_MD_OP_PUT ? "PUT" : "GET",
--
1.7.1

2016-03-02 22:08:57

by James Simmons

[permalink] [raw]

Subject: [PATCH 05/27] staging: lustre: remove annoying message in parse_nidrange

From: Li Xi <[email protected]>

When setting TBF rules of jobid, parse_nidrange() prints warning
messages. However, this is unnecessary and annoying since paring
a TBF rule will always try to parse the jobid like a nid.

Signed-off-by: Li Xi <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-7647
Reviewed-on: http://review.whamcloud.com/17916
Reviewed-by: Emoly Liu <[email protected]>
Reviewed-by: Bobi Jam <[email protected]>
Reviewed-by: Lai Siyao <[email protected]>
Reviewed-by: James Simmons <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
drivers/staging/lustre/lnet/lnet/nidstrings.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/nidstrings.c b/drivers/staging/lustre/lnet/lnet/nidstrings.c
index 9d1c7fd..ebf468f 100644
--- a/drivers/staging/lustre/lnet/lnet/nidstrings.c
+++ b/drivers/staging/lustre/lnet/lnet/nidstrings.c
@@ -270,7 +270,6 @@ parse_nidrange(struct cfs_lstr *src, struct list_head *nidlist)

return 1;
failed:
- CWARN("can't parse nidrange: \"%.*s\"\n", tmp.ls_len, tmp.ls_str);
return 0;
}

--
1.7.1

2016-03-02 22:09:25

by James Simmons

[permalink] [raw]

Subject: [PATCH 03/27] staging: lustre: Ignore hops if not explicitly set

From: Amir Shehata <[email protected]>

Since the # of hops is not a mandatory parameter the LU-6060
patch will cause problems to already existing systems since it
changes the behavior by which a route is determined down.

To fix this case the # of hops now defaults to LNET_UNDEFINED_HOPS
if no hop count is specified.

LNET_UNDEFINED_HOPS is defined to ((__u32)-1). When it's printed as
%d, it displays as -1.

__u32 is used through out the call stack for hop count to explicitly
define the size of the hop count and to avoid any sizing issues when
passing data to and from the kernel.

To keep existing behavior both lnet_compare_routes() and LNetDist()
will treat undefined hop count as hop count 1.

When executing the logic in lnet_parse_rc_info() there is no
longer an assumption that the default hop count is 1. If
the hop count is 1 then it must've been explicitly set by
the user.

Signed-off-by: Amir Shehata <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-6851
Reviewed-on: http://review.whamcloud.com/15719
Reviewed-by: Olaf Weber <[email protected]>
Reviewed-by: Doug Oucharek <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
.../staging/lustre/include/linux/lnet/lib-lnet.h | 2 +-
.../staging/lustre/include/linux/lnet/lib-types.h | 2 +-
drivers/staging/lustre/lnet/lnet/config.c | 8 ++++++--
drivers/staging/lustre/lnet/lnet/lib-move.c | 17 +++++++++++++----
drivers/staging/lustre/lnet/lnet/router.c | 6 +++---
drivers/staging/lustre/lnet/lnet/router_proc.c | 2 +-
6 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
index d78360d..84642dc 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-lnet.h
@@ -456,7 +456,7 @@ void lnet_lib_exit(void);
int lnet_notify(lnet_ni_t *ni, lnet_nid_t peer, int alive, unsigned long when);
void lnet_notify_locked(lnet_peer_t *lp, int notifylnd, int alive,
unsigned long when);
-int lnet_add_route(__u32 net, unsigned int hops, lnet_nid_t gateway_nid,
+int lnet_add_route(__u32 net, __u32 hops, lnet_nid_t gateway_nid,
unsigned int priority);
int lnet_check_routes(void);
int lnet_del_route(__u32 net, lnet_nid_t gw_nid);
diff --git a/drivers/staging/lustre/include/linux/lnet/lib-types.h b/drivers/staging/lustre/include/linux/lnet/lib-types.h
index 07b8db1..d2513db 100644
--- a/drivers/staging/lustre/include/linux/lnet/lib-types.h
+++ b/drivers/staging/lustre/include/linux/lnet/lib-types.h
@@ -371,7 +371,7 @@ typedef struct {
__u32 lr_net; /* remote network number */
int lr_seq; /* sequence for round-robin */
unsigned int lr_downis; /* number of down NIs */
- unsigned int lr_hops; /* how far I am */
+ __u32 lr_hops; /* how far I am */
unsigned int lr_priority; /* route priority */
} lnet_route_t;

diff --git a/drivers/staging/lustre/lnet/lnet/config.c b/drivers/staging/lustre/lnet/lnet/config.c
index 8c80625..4c40acb 100644
--- a/drivers/staging/lustre/lnet/lnet/config.c
+++ b/drivers/staging/lustre/lnet/lnet/config.c
@@ -664,7 +664,7 @@ lnet_parse_route(char *str, int *im_a_router)
char *token = str;
int ntokens = 0;
int myrc = -1;
- unsigned int hops;
+ __u32 hops;
int got_hops = 0;
unsigned int priority = 0;

@@ -747,8 +747,12 @@ lnet_parse_route(char *str, int *im_a_router)
}
}

+ /**
+ * if there are no hops set then we want to flag this value as
+ * unset since hops is an optional parameter
+ */
if (!got_hops)
- hops = 1;
+ hops = LNET_UNDEFINED_HOPS;

LASSERT(!list_empty(&nets));
LASSERT(!list_empty(&gateways));
diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
index fa5b7cd..8a21198 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-move.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
@@ -1160,6 +1160,8 @@ lnet_compare_routes(lnet_route_t *r1, lnet_route_t *r2)
{
lnet_peer_t *p1 = r1->lr_gateway;
lnet_peer_t *p2 = r2->lr_gateway;
+ int r1_hops = (r1->lr_hops == LNET_UNDEFINED_HOPS) ? 1 : r1->lr_hops;
+ int r2_hops = (r2->lr_hops == LNET_UNDEFINED_HOPS) ? 1 : r2->lr_hops;

if (r1->lr_priority < r2->lr_priority)
return 1;
@@ -1167,10 +1169,10 @@ lnet_compare_routes(lnet_route_t *r1, lnet_route_t *r2)
if (r1->lr_priority > r2->lr_priority)
return -1;

- if (r1->lr_hops < r2->lr_hops)
+ if (r1_hops < r2_hops)
return 1;

- if (r1->lr_hops > r2->lr_hops)
+ if (r1_hops > r2_hops)
return -1;

if (p1->lp_txqnob < p2->lp_txqnob)
@@ -2512,18 +2514,25 @@ LNetDist(lnet_nid_t dstnid, lnet_nid_t *srcnidp, __u32 *orderp)
if (rnet->lrn_net == dstnet) {
lnet_route_t *route;
lnet_route_t *shortest = NULL;
+ __u32 shortest_hops = LNET_UNDEFINED_HOPS;
+ __u32 route_hops;

LASSERT(!list_empty(&rnet->lrn_routes));

list_for_each_entry(route, &rnet->lrn_routes,
lr_list) {
+ route_hops = route->lr_hops;
+ if (route_hops == LNET_UNDEFINED_HOPS)
+ route_hops = 1;
if (!shortest ||
- route->lr_hops < shortest->lr_hops)
+ route_hops < shortest_hops) {
shortest = route;
+ shortest_hops = route_hops;
+ }
}

LASSERT(shortest);
- hops = shortest->lr_hops;
+ hops = shortest_hops;
if (srcnidp)
*srcnidp = shortest->lr_gateway->lp_ni->ni_nid;
if (orderp)
diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
index 2eae8f6..51a831e 100644
--- a/drivers/staging/lustre/lnet/lnet/router.c
+++ b/drivers/staging/lustre/lnet/lnet/router.c
@@ -294,7 +294,7 @@ lnet_add_route_to_rnet(lnet_remotenet_t *rnet, lnet_route_t *route)
}

int
-lnet_add_route(__u32 net, unsigned int hops, lnet_nid_t gateway,
+lnet_add_route(__u32 net, __u32 hops, lnet_nid_t gateway,
unsigned int priority)
{
struct list_head *e;
@@ -305,7 +305,7 @@ lnet_add_route(__u32 net, unsigned int hops, lnet_nid_t gateway,
int add_route;
int rc;

- CDEBUG(D_NET, "Add route: net %s hops %u priority %u gw %s\n",
+ CDEBUG(D_NET, "Add route: net %s hops %d priority %u gw %s\n",
libcfs_net2str(net), hops, priority, libcfs_nid2str(gateway));

if (gateway == LNET_NID_ANY ||
@@ -313,7 +313,7 @@ lnet_add_route(__u32 net, unsigned int hops, lnet_nid_t gateway,
net == LNET_NIDNET(LNET_NID_ANY) ||
LNET_NETTYP(net) == LOLND ||
LNET_NIDNET(gateway) == net ||
- hops < 1 || hops > 255)
+ (hops != LNET_UNDEFINED_HOPS && (hops < 1 || hops > 255)))
return -EINVAL;

if (lnet_islocalnet(net)) /* it's a local network */
diff --git a/drivers/staging/lustre/lnet/lnet/router_proc.c b/drivers/staging/lustre/lnet/lnet/router_proc.c
index fc643df..ce4331e 100644
--- a/drivers/staging/lustre/lnet/lnet/router_proc.c
+++ b/drivers/staging/lustre/lnet/lnet/router_proc.c
@@ -235,7 +235,7 @@ static int proc_lnet_routes(struct ctl_table *table, int write,

if (route) {
__u32 net = rnet->lrn_net;
- unsigned int hops = route->lr_hops;
+ __u32 hops = route->lr_hops;
unsigned int priority = route->lr_priority;
lnet_nid_t nid = route->lr_gateway->lp_nid;
int alive = lnet_is_route_alive(route);
--
1.7.1

2016-03-02 22:09:46

by James Simmons

[permalink] [raw]

Subject: [PATCH 02/27] staging: lustre: recv could access freed message

From: Liang Zhen <[email protected]>

When lnet_parse_put calls lnet_ptl_match_md, this function can attach
current message on the delayed list if there is no match. It means
this message can be taken over and freed by another thread who is
posting new MD, then it is not safe for caller of lnet_parse_put to
check this message again.

This patch fixes this issue by adding a local variable "ready_delay"
to store corresponding status of lnet_msg, so lnet doesn't need to
check the message again if lnet_ptl_match_md returned MATCH_NONE for
it.

Signed-off-by: Liang Zhen <[email protected]>
Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-7324
Reviewed-on: http://review.whamcloud.com/17065
Reviewed-by: Doug Oucharek <[email protected]>
Reviewed-by: Faccini Bruno <[email protected]>
Reviewed-by: Oleg Drokin <[email protected]>
---
drivers/staging/lustre/lnet/lnet/lib-move.c | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/lib-move.c b/drivers/staging/lustre/lnet/lnet/lib-move.c
index 7bc3e91..fa5b7cd 100644
--- a/drivers/staging/lustre/lnet/lnet/lib-move.c
+++ b/drivers/staging/lustre/lnet/lnet/lib-move.c
@@ -1466,6 +1466,7 @@ lnet_parse_put(lnet_ni_t *ni, lnet_msg_t *msg)
{
lnet_hdr_t *hdr = &msg->msg_hdr;
struct lnet_match_info info;
+ bool ready_delay;
int rc;

/* Convert put fields to host byte order */
@@ -1482,6 +1483,7 @@ lnet_parse_put(lnet_ni_t *ni, lnet_msg_t *msg)
info.mi_mbits = hdr->msg.put.match_bits;

msg->msg_rx_ready_delay = !ni->ni_lnd->lnd_eager_recv;
+ ready_delay = msg->msg_rx_ready_delay;

again:
rc = lnet_ptl_match_md(&info, msg);
@@ -1494,12 +1496,18 @@ lnet_parse_put(lnet_ni_t *ni, lnet_msg_t *msg)
return 0;

case LNET_MATCHMD_NONE:
- if (msg->msg_rx_delayed) /* attached on delayed list */
+ /**
+ * no eager_recv or has already called it, should
+ * have been attached on delayed list
+ */
+ if (ready_delay)
return 0;

rc = lnet_ni_eager_recv(ni, msg);
- if (!rc)
+ if (!rc) {
+ ready_delay = true;
goto again;
+ }
/* fall through */

case LNET_MATCHMD_DROP:
--
1.7.1

2016-03-02 23:22:53

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [PATCH] staging: lustre: Support different ko2iblnd configs between systems

On Wed, Mar 02, 2016 at 05:02:04PM -0500, James Simmons wrote:
> From: Jeremy Filizetti <[email protected]>
>
> This patch adds suppoort for ko2iblnd to have different values for
> peer_credits and map_on_demand between systems.

Your subject has no number for this patch, is it really patch 21 in the
series?

thanks,

greg k-h

2016-03-02 23:24:42

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [PATCH 00/27] Third batch of LNet fixes

On Wed, Mar 02, 2016 at 05:01:43PM -0500, James Simmons wrote:
> This patch set merges all the fixes for the klnd drivers, socklnd and
> o2iblnd, to what is currently used in production environments. Several
> more fixes for the LNet core are also included with this patch set.

I've applied the first 20, I never got patch 21 in the series, so please
rebase and resend the remaining ones, properly numbered :)

thanks,

greg k-h

2016-03-02 23:35:37

by Simmons, James A.

[permalink] [raw]

Subject: RE: [lustre-devel] [PATCH] staging: lustre: Support different ko2iblnd configs between systems

>On Wed, Mar 02, 2016 at 05:02:04PM -0500, James Simmons wrote:
>> From: Jeremy Filizetti <[email protected]>
>>
>> This patch adds suppoort for ko2iblnd to have different values for
>> peer_credits and map_on_demand between systems.
>
>Your subject has no number for this patch, is it really patch 21 in the
>series?

Yes it is really patch 21. The patch needed to be updated at the last
minute before I sent it out. Sorry about that mistake.

2016-03-02 23:52:44

by Simmons, James A.

[permalink] [raw]

Subject: RE: [lustre-devel] [PATCH 00/27] Third batch of LNet fixes

>On Wed, Mar 02, 2016 at 05:01:43PM -0500, James Simmons wrote:
>> This patch set merges all the fixes for the klnd drivers, socklnd and
>> o2iblnd, to what is currently used in production environments. Several
>> more fixes for the LNet core are also included with this patch set.
>
>I've applied the first 20, I never got patch 21 in the series, so please
>rebase and resend the remaining ones, properly numbered :)
>
>thanks,

Okay. I will send the remaining patches as a new patch series.