From: Long Li <[email protected]>
Starting with SMB2 dialect 3.0, Microsoft introduced SMB Direct transport
protocol for transferring upper layer (SMB2) payload over RDMA via Infiniband,
RoCE or iWARP. The prococol is published in [MS-SMBD]
(https://msdn.microsoft.com/en-us/library/hh536346.aspx).
Patch v2 added RDMA read/write via memory registration, and addressed
feedbacks on v1.
Patch v3 improved performance by introducing an additional queue for handling
empty packets and reducing lock contention on IRQ path. Also added light
weight profiling by reading TSC and addressed feedbacks on v2.
Patch v4 fixed connectivity issues with iWAPR devices and addressed comments.
Patch v5 fixed compiling errors on ia64, i386 and when INFINIBAND is not
configured, and addressed comments. Profiling is removed and will be
introduced in a seperate patch.
Patch v6 addressed comments.
Long Li (22):
CIFS: SMBD: Add parameter rdata to smb2_new_read_req
CIFS: SMBD: Introduce kernel config option CONFIG_CIFS_SMB_DIRECT
CIFS: SMBD: Add rdma mount option
CIFS: SMBD: Add SMB Direct protocol initial values and constants
CIFS: SMBD: Establish SMB Direct connection
CIFS: SMBD: export protocol initial values
CIFS: SMBD: Implement function to create a SMB Direct connection
CIFS: SMBD: Upper layer connects to SMBDirect session
CIFS: SMBD: Implement function to reconnect to a SMB Direct transport
CIFS: SMBD: Upper layer reconnects to SMB Direct session
CIFS: SMBD: Implement function to destroy a SMB Direct connection
CIFS: SMBD: Upper layer destroys SMB Direct session on shutdown or
umount
CIFS: SMBD: Set SMB Direct maximum read or write size for I/O
CIFS: SMBD: Implement function to receive data via RDMA receive
CIFS: SMBD: Upper layer receives data via RDMA receive
CIFS: SMBD: Implement function to send data via RDMA send
CIFS: SMBD: Upper layer sends data via RDMA send
CIFS: SMBD: Implement RDMA memory registration
CIFS: SMBD: Upper layer performs SMB write via RDMA read through
memory registration
CIFS: SMBD: Read correct returned data length for RDMA write (SMB
read) I/O
CIFS: SMBD: Upper layer performs SMB read via RDMA write through
memory registration
CIFS: SMBD: Add SMB Direct debug counters
fs/cifs/Kconfig | 8 +
fs/cifs/Makefile | 2 +
fs/cifs/cifs_debug.c | 147 +++
fs/cifs/cifsfs.c | 2 +
fs/cifs/cifsglob.h | 23 +-
fs/cifs/cifssmb.c | 16 +-
fs/cifs/connect.c | 64 +-
fs/cifs/file.c | 19 +-
fs/cifs/smb1ops.c | 4 +-
fs/cifs/smb2ops.c | 26 +-
fs/cifs/smb2pdu.c | 129 ++-
fs/cifs/smbdirect.c | 2618 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/smbdirect.h | 315 ++++++
fs/cifs/transport.c | 14 +-
14 files changed, 3359 insertions(+), 28 deletions(-)
create mode 100644 fs/cifs/smbdirect.c
create mode 100644 fs/cifs/smbdirect.h
--
2.7.4
From 1583367160286984843@xxx Tue Nov 07 01:03:14 +0000 2017
X-GM-THRID: 1583367160286984843
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
From: Long Li <[email protected]>
Add function to implement a reconnect to SMB Direct. This involves tearing
down the current connection and establishing/negotiating a new connection.
Signed-off-by: Long Li <[email protected]>
---
fs/cifs/smbdirect.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 47d999f..f3ae3dc 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1393,6 +1393,42 @@ static void idle_connection_timer(struct work_struct *work)
info->keep_alive_interval*HZ);
}
+/*
+ * Reconnect this SMBD connection, called from upper layer
+ * return value: 0 on success, or actual error code
+ */
+int smbd_reconnect(struct TCP_Server_Info *server)
+{
+ log_rdma_event(INFO, "reconnecting rdma session\n");
+
+ if (!server->smbd_conn) {
+ log_rdma_event(ERR, "rdma session already destroyed\n");
+ return -EINVAL;
+ }
+
+ /*
+ * This is possible if transport is disconnected and we haven't received
+ * notification from RDMA, but upper layer has detected timeout
+ */
+ if (server->smbd_conn->transport_status == SMBD_CONNECTED) {
+ log_rdma_event(INFO, "disconnecting transport\n");
+ smbd_disconnect_rdma_connection(server->smbd_conn);
+ }
+
+ /* wait until the transport is destroyed */
+ wait_event(server->smbd_conn->wait_destroy,
+ server->smbd_conn->transport_status == SMBD_DESTROYED);
+
+ destroy_workqueue(server->smbd_conn->workqueue);
+ kfree(server->smbd_conn);
+
+ log_rdma_event(INFO, "creating rdma session\n");
+ server->smbd_conn = smbd_get_connection(
+ server, (struct sockaddr *) &server->dstaddr);
+
+ return server->smbd_conn ? 0 : -ENOENT;
+}
+
static void destroy_caches_and_workqueue(struct smbd_connection *info)
{
destroy_receive_buffers(info);
--
2.7.4
From 1583176862597072542@xxx Sat Nov 04 22:38:32 +0000 2017
X-GM-THRID: 1583085205353134492
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
From: Long Li <[email protected]>
SMB Direct is a protocol for transferring SMB packets over RDMA. It was
introduced with Windows Serer 2012 and SMB 3.0.
With CONFIG_CIFS_SMB_DIRECT=y, SMB Direct code is built as part of CIFS.
Signed-off-by: Long Li <[email protected]>
---
fs/cifs/Kconfig | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/fs/cifs/Kconfig b/fs/cifs/Kconfig
index f724361..8d05fff 100644
--- a/fs/cifs/Kconfig
+++ b/fs/cifs/Kconfig
@@ -191,6 +191,14 @@ config CIFS_SMB311
This dialect includes improved security negotiation features.
If unsure, say N
+config CIFS_SMB_DIRECT
+ bool "SMB Direct support (Experimental)"
+ depends on CIFS && INFINIBAND
+ help
+ Enables SMB Direct experimental support for SMB 3.0, 3.02 and 3.1.1.
+ SMB Direct allows transferring SMB packets over RDMA. If unsure,
+ say N.
+
config CIFS_FSCACHE
bool "Provide CIFS client caching support"
depends on CIFS=m && FSCACHE || CIFS=y && FSCACHE=y
--
2.7.4
From 1583309073611213519@xxx Mon Nov 06 09:39:58 +0000 2017
X-GM-THRID: 1583309073611213519
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
From: Long Li <[email protected]>
Add "rdma" to CIFS mount options to connect to SMB Direct.
Add checks to validate this is used on SMB 3.X dialects.
To connect to SMBDirect, use "mount.cifs -o rdma,vers=3.x".
At the time of this patch, 3.x can be 3.0, 3.02 or 3.1.1.
Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifs_debug.c | 2 ++
fs/cifs/cifsfs.c | 2 ++
fs/cifs/cifsglob.h | 7 +++++++
fs/cifs/connect.c | 15 ++++++++++++++-
4 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index 9727e1d..ba0870d 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -171,6 +171,8 @@ static int cifs_debug_data_proc_show(struct seq_file *m, void *v)
ses->ses_count, ses->serverOS, ses->serverNOS,
ses->capabilities, ses->status);
}
+ if (server->rdma)
+ seq_printf(m, "RDMA\n\t");
seq_printf(m, "TCP status: %d\n\tLocal Users To "
"Server: %d SecMode: 0x%x Req On Wire: %d",
server->tcpStatus, server->srv_count,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 180b335..e15fbf1 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -327,6 +327,8 @@ cifs_show_address(struct seq_file *s, struct TCP_Server_Info *server)
default:
seq_puts(s, "(unknown)");
}
+ if (server->rdma)
+ seq_puts(s, ",rdma");
}
static void
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 808486c..09f9a71 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -530,6 +530,7 @@ struct smb_vol {
bool nopersistent:1;
bool resilient:1; /* noresilient not required since not fored for CA */
bool domainauto:1;
+ bool rdma:1;
unsigned int rsize;
unsigned int wsize;
bool sockopt_tcp_nodelay:1;
@@ -646,6 +647,12 @@ struct TCP_Server_Info {
bool sec_kerberos; /* supports plain Kerberos */
bool sec_mskerberos; /* supports legacy MS Kerberos */
bool large_buf; /* is current buffer large? */
+ /* use SMBD connection instead of socket */
+ bool rdma;
+#ifdef CONFIG_CIFS_SMB_DIRECT
+ /* point to the SMBD connection if RDMA is used instead of socket */
+ struct smbd_connection *smbd_conn;
+#endif
struct delayed_work echo; /* echo ping workqueue job */
char *smallbuf; /* pointer to current "small" buffer */
char *bigbuf; /* pointer to current "big" buffer */
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 59647eb..b5a575f 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -92,7 +92,7 @@ enum {
Opt_multiuser, Opt_sloppy, Opt_nosharesock,
Opt_persistent, Opt_nopersistent,
Opt_resilient, Opt_noresilient,
- Opt_domainauto,
+ Opt_domainauto, Opt_rdma,
/* Mount options which take numeric value */
Opt_backupuid, Opt_backupgid, Opt_uid,
@@ -183,6 +183,7 @@ static const match_table_t cifs_mount_option_tokens = {
{ Opt_resilient, "resilienthandles"},
{ Opt_noresilient, "noresilienthandles"},
{ Opt_domainauto, "domainauto"},
+ { Opt_rdma, "rdma"},
{ Opt_backupuid, "backupuid=%s" },
{ Opt_backupgid, "backupgid=%s" },
@@ -1538,6 +1539,9 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
case Opt_domainauto:
vol->domainauto = true;
break;
+ case Opt_rdma:
+ vol->rdma = true;
+ break;
/* Numeric Values */
case Opt_backupuid:
@@ -1928,6 +1932,11 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
goto cifs_parse_mount_err;
}
+ if (vol->rdma && vol->vals->protocol_id < SMB30_PROT_ID) {
+ cifs_dbg(VFS, "SMB Direct requires Version >=3.0\n");
+ goto cifs_parse_mount_err;
+ }
+
#ifndef CONFIG_KEYS
/* Muliuser mounts require CONFIG_KEYS support */
if (vol->multiuser) {
@@ -2131,6 +2140,9 @@ static int match_server(struct TCP_Server_Info *server, struct smb_vol *vol)
if (server->echo_interval != vol->echo_interval * HZ)
return 0;
+ if (server->rdma != vol->rdma)
+ return 0;
+
return 1;
}
@@ -2229,6 +2241,7 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
tcp_ses->noblocksnd = volume_info->noblocksnd;
tcp_ses->noautotune = volume_info->noautotune;
tcp_ses->tcp_nodelay = volume_info->sockopt_tcp_nodelay;
+ tcp_ses->rdma = volume_info->rdma;
tcp_ses->in_flight = 0;
tcp_ses->credits = 1;
init_waitqueue_head(&tcp_ses->response_q);
--
2.7.4
From 1583068490549891799@xxx Fri Nov 03 17:56:00 +0000 2017
X-GM-THRID: 1583068490549891799
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
From: Long Li <[email protected]>
When "rdma" is specified in the mount option, make CIFS connect to
SMB Direct.
Signed-off-by: Long Li <[email protected]>
---
fs/cifs/connect.c | 27 ++++++++++++++++++++++++---
1 file changed, 24 insertions(+), 3 deletions(-)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index b5a575f..2c0b34a 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -44,7 +44,6 @@
#include <net/ipv6.h>
#include <linux/parser.h>
#include <linux/bvec.h>
-
#include "cifspdu.h"
#include "cifsglob.h"
#include "cifsproto.h"
@@ -56,6 +55,9 @@
#include "rfc1002pdu.h"
#include "fscache.h"
#include "smb2proto.h"
+#ifdef CONFIG_CIFS_SMB_DIRECT
+#include "smbdirect.h"
+#endif
#define CIFS_PORT 445
#define RFC1001_PORT 139
@@ -2279,13 +2281,32 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
tcp_ses->echo_interval = volume_info->echo_interval * HZ;
else
tcp_ses->echo_interval = SMB_ECHO_INTERVAL_DEFAULT * HZ;
-
+ if (tcp_ses->rdma) {
+#ifdef CONFIG_CIFS_SMB_DIRECT
+ tcp_ses->smbd_conn = smbd_get_connection(
+ tcp_ses, (struct sockaddr *)&volume_info->dstaddr);
+ if (tcp_ses->smbd_conn) {
+ cifs_dbg(VFS, "RDMA transport established\n");
+ rc = 0;
+ goto smbd_connected;
+ } else {
+ rc = -ENOENT;
+ goto out_err_crypto_release;
+ }
+#else
+ cifs_dbg(VFS, "CONFIG_CIFS_SMB_DIRECT is not enabled\n");
+ rc = -ENOENT;
+ goto out_err_crypto_release;
+#endif
+ }
rc = ip_connect(tcp_ses);
if (rc < 0) {
cifs_dbg(VFS, "Error connecting to socket. Aborting operation.\n");
goto out_err_crypto_release;
}
-
+#ifdef CONFIG_CIFS_SMB_DIRECT
+smbd_connected:
+#endif
/*
* since we're in a cifs function already, we know that
* this will succeed. No need for try_module_get().
--
2.7.4
From 1583055760854360724@xxx Fri Nov 03 14:33:40 +0000 2017
X-GM-THRID: 1583055760854360724
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
From: Long Li <[email protected]>
The transport doesn't maintain send buffers or send queue for transferring
payload via RDMA send. There is no data copy in the transport on send.
Signed-off-by: Long Li <[email protected]>
---
fs/cifs/smbdirect.c | 246 ++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/smbdirect.h | 3 +
2 files changed, 249 insertions(+)
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 1e7f5df..6089ae7 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -42,6 +42,12 @@ static int smbd_post_recv(
struct smbd_response *response);
static int smbd_post_send_empty(struct smbd_connection *info);
+static int smbd_post_send_data(
+ struct smbd_connection *info,
+ struct kvec *iov, int n_vec, int remaining_data_length);
+static int smbd_post_send_page(struct smbd_connection *info,
+ struct page *page, unsigned long offset,
+ size_t size, int remaining_data_length);
/* SMBD version number */
#define SMBD_V1 0x0100
@@ -178,6 +184,10 @@ static void smbd_destroy_rdma_work(struct work_struct *work)
log_rdma_event(INFO, "cancelling send immediate work\n");
cancel_delayed_work_sync(&info->send_immediate_work);
+ log_rdma_event(INFO, "wait for all send to finish\n");
+ wait_event(info->wait_smbd_send_pending,
+ info->smbd_send_pending == 0);
+
log_rdma_event(INFO, "wait for all recv to finish\n");
wake_up_interruptible(&info->wait_reassembly_queue);
wait_event(info->wait_smbd_recv_pending,
@@ -1081,6 +1091,24 @@ static int smbd_post_send_sgl(struct smbd_connection *info,
}
/*
+ * Send a page
+ * page: the page to send
+ * offset: offset in the page to send
+ * size: length in the page to send
+ * remaining_data_length: remaining data to send in this payload
+ */
+static int smbd_post_send_page(struct smbd_connection *info, struct page *page,
+ unsigned long offset, size_t size, int remaining_data_length)
+{
+ struct scatterlist sgl;
+
+ sg_init_table(&sgl, 1);
+ sg_set_page(&sgl, page, size, offset);
+
+ return smbd_post_send_sgl(info, &sgl, size, remaining_data_length);
+}
+
+/*
* Send an empty message
* Empty message is used to extend credits to peer to for keep live
* while there is no upper layer payload to send at the time
@@ -1092,6 +1120,35 @@ static int smbd_post_send_empty(struct smbd_connection *info)
}
/*
+ * Send a data buffer
+ * iov: the iov array describing the data buffers
+ * n_vec: number of iov array
+ * remaining_data_length: remaining data to send following this packet
+ * in segmented SMBD packet
+ */
+static int smbd_post_send_data(
+ struct smbd_connection *info, struct kvec *iov, int n_vec,
+ int remaining_data_length)
+{
+ int i;
+ u32 data_length = 0;
+ struct scatterlist sgl[SMBDIRECT_MAX_SGE];
+
+ if (n_vec > SMBDIRECT_MAX_SGE) {
+ cifs_dbg(VFS, "Can't fit data to SGL, n_vec=%d\n", n_vec);
+ return -ENOMEM;
+ }
+
+ sg_init_table(sgl, n_vec);
+ for (i = 0; i < n_vec; i++) {
+ data_length += iov[i].iov_len;
+ sg_set_buf(&sgl[i], iov[i].iov_base, iov[i].iov_len);
+ }
+
+ return smbd_post_send_sgl(info, sgl, data_length, remaining_data_length);
+}
+
+/*
* Post a receive request to the transport
* The remote peer can only send data when a receive request is posted
* The interaction is controlled by send/receive credit system
@@ -1658,6 +1715,9 @@ struct smbd_connection *_smbd_get_connection(
queue_delayed_work(info->workqueue, &info->idle_timer_work,
info->keep_alive_interval*HZ);
+ init_waitqueue_head(&info->wait_smbd_send_pending);
+ info->smbd_send_pending = 0;
+
init_waitqueue_head(&info->wait_smbd_recv_pending);
info->smbd_recv_pending = 0;
@@ -1949,3 +2009,189 @@ int smbd_recv(struct smbd_connection *info, struct msghdr *msg)
msg->msg_iter.count = 0;
return rc;
}
+
+/*
+ * Send data to transport
+ * Each rqst is transported as a SMBDirect payload
+ * rqst: the data to write
+ * return value: 0 if successfully write, otherwise error code
+ */
+int smbd_send(struct smbd_connection *info, struct smb_rqst *rqst)
+{
+ struct kvec vec;
+ int nvecs;
+ int size;
+ int buflen = 0, remaining_data_length;
+ int start, i, j;
+ int max_iov_size =
+ info->max_send_size - sizeof(struct smbd_data_transfer);
+ struct kvec iov[SMBDIRECT_MAX_SGE];
+ int rc;
+
+ info->smbd_send_pending++;
+ if (info->transport_status != SMBD_CONNECTED) {
+ rc = -ENODEV;
+ goto done;
+ }
+
+ /*
+ * This usually means a configuration error
+ * We use RDMA read/write for packet size > rdma_readwrite_threshold
+ * as long as it's properly configured we should never get into this
+ * situation
+ */
+ if (rqst->rq_nvec + rqst->rq_npages > SMBDIRECT_MAX_SGE) {
+ log_write(ERR, "maximum send segment %x exceeding %x\n",
+ rqst->rq_nvec + rqst->rq_npages, SMBDIRECT_MAX_SGE);
+ rc = -EINVAL;
+ goto done;
+ }
+
+ /*
+ * Remove the RFC1002 length defined in MS-SMB2 section 2.1
+ * It is used only for TCP transport
+ * In future we may want to add a transport layer under protocol
+ * layer so this will only be issued to TCP transport
+ */
+ iov[0].iov_base = (char *)rqst->rq_iov[0].iov_base + 4;
+ iov[0].iov_len = rqst->rq_iov[0].iov_len - 4;
+ buflen += iov[0].iov_len;
+
+ /* total up iov array first */
+ for (i = 1; i < rqst->rq_nvec; i++) {
+ iov[i].iov_base = rqst->rq_iov[i].iov_base;
+ iov[i].iov_len = rqst->rq_iov[i].iov_len;
+ buflen += iov[i].iov_len;
+ }
+
+ /* add in the page array if there is one */
+ if (rqst->rq_npages) {
+ buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
+ buflen += rqst->rq_tailsz;
+ }
+
+ if (buflen + sizeof(struct smbd_data_transfer) >
+ info->max_fragmented_send_size) {
+ log_write(ERR, "payload size %d > max size %d\n",
+ buflen, info->max_fragmented_send_size);
+ rc = -EINVAL;
+ goto done;
+ }
+
+ remaining_data_length = buflen;
+
+ log_write(INFO, "rqst->rq_nvec=%d rqst->rq_npages=%d rq_pagesz=%d "
+ "rq_tailsz=%d buflen=%d\n",
+ rqst->rq_nvec, rqst->rq_npages, rqst->rq_pagesz,
+ rqst->rq_tailsz, buflen);
+
+ start = i = iov[0].iov_len ? 0 : 1;
+ buflen = 0;
+ while (true) {
+ buflen += iov[i].iov_len;
+ if (buflen > max_iov_size) {
+ if (i > start) {
+ remaining_data_length -=
+ (buflen-iov[i].iov_len);
+ log_write(INFO, "sending iov[] from start=%d "
+ "i=%d nvecs=%d "
+ "remaining_data_length=%d\n",
+ start, i, i-start,
+ remaining_data_length);
+ rc = smbd_post_send_data(
+ info, &iov[start], i-start,
+ remaining_data_length);
+ if (rc)
+ goto done;
+ } else {
+ /* iov[start] is too big, break it */
+ nvecs = (buflen+max_iov_size-1)/max_iov_size;
+ log_write(INFO, "iov[%d] iov_base=%p buflen=%d"
+ " break to %d vectors\n",
+ start, iov[start].iov_base,
+ buflen, nvecs);
+ for (j = 0; j < nvecs; j++) {
+ vec.iov_base =
+ (char *)iov[start].iov_base +
+ j*max_iov_size;
+ vec.iov_len = max_iov_size;
+ if (j == nvecs-1)
+ vec.iov_len =
+ buflen -
+ max_iov_size*(nvecs-1);
+ remaining_data_length -= vec.iov_len;
+ log_write(INFO,
+ "sending vec j=%d iov_base=%p"
+ " iov_len=%zu "
+ "remaining_data_length=%d\n",
+ j, vec.iov_base, vec.iov_len,
+ remaining_data_length);
+ rc = smbd_post_send_data(
+ info, &vec, 1,
+ remaining_data_length);
+ if (rc)
+ goto done;
+ }
+ i++;
+ }
+ start = i;
+ buflen = 0;
+ } else {
+ i++;
+ if (i == rqst->rq_nvec) {
+ /* send out all remaining vecs */
+ remaining_data_length -= buflen;
+ log_write(INFO,
+ "sending iov[] from start=%d i=%d "
+ "nvecs=%d remaining_data_length=%d\n",
+ start, i, i-start,
+ remaining_data_length);
+ rc = smbd_post_send_data(info, &iov[start],
+ i-start, remaining_data_length);
+ if (rc)
+ goto done;
+ break;
+ }
+ }
+ log_write(INFO, "looping i=%d buflen=%d\n", i, buflen);
+ }
+
+ /* now sending pages if there are any */
+ for (i = 0; i < rqst->rq_npages; i++) {
+ buflen = (i == rqst->rq_npages-1) ?
+ rqst->rq_tailsz : rqst->rq_pagesz;
+ nvecs = (buflen + max_iov_size - 1) / max_iov_size;
+ log_write(INFO, "sending pages buflen=%d nvecs=%d\n",
+ buflen, nvecs);
+ for (j = 0; j < nvecs; j++) {
+ size = max_iov_size;
+ if (j == nvecs-1)
+ size = buflen - j*max_iov_size;
+ remaining_data_length -= size;
+ log_write(INFO, "sending pages i=%d offset=%d size=%d"
+ " remaining_data_length=%d\n",
+ i, j*max_iov_size, size, remaining_data_length);
+ rc = smbd_post_send_page(
+ info, rqst->rq_pages[i], j*max_iov_size,
+ size, remaining_data_length);
+ if (rc)
+ goto done;
+ }
+ }
+
+done:
+ /*
+ * As an optimization, we don't wait for individual I/O to finish
+ * before sending the next one.
+ * Send them all and wait for pending send count to get to 0
+ * that means all the I/Os have been out and we are good to return
+ */
+
+ wait_event(info->wait_send_payload_pending,
+ atomic_read(&info->send_payload_pending) == 0);
+
+ info->smbd_send_pending--;
+ wake_up(&info->wait_smbd_send_pending);
+
+ return rc;
+}
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index c1e2d3d..1d23b81 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -89,6 +89,9 @@ struct smbd_connection {
/* Activity accoutning */
/* Pending reqeusts issued from upper layer */
+ int smbd_send_pending;
+ wait_queue_head_t wait_smbd_send_pending;
+
int smbd_recv_pending;
wait_queue_head_t wait_smbd_recv_pending;
--
2.7.4
From 1583056558797275317@xxx Fri Nov 03 14:46:21 +0000 2017
X-GM-THRID: 1583056558797275317
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
From: Long Li <[email protected]>
For debugging and troubleshooting, export SMBDirect debug counters to
/proc/fs/cifs/DebugData.
Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifs_debug.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)
diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index 7025d8d..cd65759 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -155,6 +155,72 @@ static int cifs_debug_data_proc_show(struct seq_file *m, void *v)
list_for_each(tmp1, &cifs_tcp_ses_list) {
server = list_entry(tmp1, struct TCP_Server_Info,
tcp_ses_list);
+
+#ifdef CONFIG_CIFS_SMB_DIRECT
+ if (!server->rdma)
+ goto skip_rdma;
+
+ seq_printf(m, "\nSMBDirect (in hex) protocol version: %x "
+ "transport status: %x",
+ server->smbd_conn->protocol,
+ server->smbd_conn->transport_status);
+ seq_printf(m, "\nConn receive_credit_max: %x "
+ "send_credit_target: %x max_send_size: %x",
+ server->smbd_conn->receive_credit_max,
+ server->smbd_conn->send_credit_target,
+ server->smbd_conn->max_send_size);
+ seq_printf(m, "\nConn max_fragmented_recv_size: %x "
+ "max_fragmented_send_size: %x max_receive_size:%x",
+ server->smbd_conn->max_fragmented_recv_size,
+ server->smbd_conn->max_fragmented_send_size,
+ server->smbd_conn->max_receive_size);
+ seq_printf(m, "\nConn keep_alive_interval: %x "
+ "max_readwrite_size: %x rdma_readwrite_threshold: %x",
+ server->smbd_conn->keep_alive_interval,
+ server->smbd_conn->max_readwrite_size,
+ server->smbd_conn->rdma_readwrite_threshold);
+ seq_printf(m, "\nDebug count_get_receive_buffer: %x "
+ "count_put_receive_buffer: %x count_send_empty: %x",
+ server->smbd_conn->count_get_receive_buffer,
+ server->smbd_conn->count_put_receive_buffer,
+ server->smbd_conn->count_send_empty);
+ seq_printf(m, "\nRead Queue count_reassembly_queue: %x "
+ "count_enqueue_reassembly_queue: %x "
+ "count_dequeue_reassembly_queue: %x "
+ "fragment_reassembly_remaining: %x "
+ "reassembly_data_length: %x "
+ "reassembly_queue_length: %x",
+ server->smbd_conn->count_reassembly_queue,
+ server->smbd_conn->count_enqueue_reassembly_queue,
+ server->smbd_conn->count_dequeue_reassembly_queue,
+ server->smbd_conn->fragment_reassembly_remaining,
+ server->smbd_conn->reassembly_data_length,
+ server->smbd_conn->reassembly_queue_length);
+ seq_printf(m, "\nCurrent Credits send_credits: %x "
+ "receive_credits: %x receive_credit_target: %x",
+ atomic_read(&server->smbd_conn->send_credits),
+ atomic_read(&server->smbd_conn->receive_credits),
+ server->smbd_conn->receive_credit_target);
+ seq_printf(m, "\nPending send_pending: %x send_payload_pending:"
+ " %x smbd_send_pending: %x smbd_recv_pending: %x",
+ atomic_read(&server->smbd_conn->send_pending),
+ atomic_read(&server->smbd_conn->send_payload_pending),
+ server->smbd_conn->smbd_send_pending,
+ server->smbd_conn->smbd_recv_pending);
+ seq_printf(m, "\nReceive buffers count_receive_queue: %x "
+ "count_empty_packet_queue: %x",
+ server->smbd_conn->count_receive_queue,
+ server->smbd_conn->count_empty_packet_queue);
+ seq_printf(m, "\nMR responder_resources: %x "
+ "max_frmr_depth: %x mr_type: %x",
+ server->smbd_conn->responder_resources,
+ server->smbd_conn->max_frmr_depth,
+ server->smbd_conn->mr_type);
+ seq_printf(m, "\nMR mr_ready_count: %x mr_used_count: %x",
+ atomic_read(&server->smbd_conn->mr_ready_count),
+ atomic_read(&server->smbd_conn->mr_used_count));
+skip_rdma:
+#endif
seq_printf(m, "\nNumber of credits: %d", server->credits);
i++;
list_for_each(tmp2, &server->smb_ses_list) {
--
2.7.4
From 1583014735850305718@xxx Fri Nov 03 03:41:36 +0000 2017
X-GM-THRID: 1582957989467837633
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
From: Long Li <[email protected]>
Add function to tear down a SMB Direct connection. This is used by upper
layer to free all SMB Direct connection and transport resources.
Signed-off-by: Long Li <[email protected]>
---
fs/cifs/smbdirect.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index f3ae3dc..5952276 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -1393,6 +1393,22 @@ static void idle_connection_timer(struct work_struct *work)
info->keep_alive_interval*HZ);
}
+/* Destroy this SMBD connection, called from upper layer */
+void smbd_destroy(struct smbd_connection *info)
+{
+ log_rdma_event(INFO, "destroying rdma session\n");
+
+ /* Kick off the disconnection process */
+ smbd_disconnect_rdma_connection(info);
+
+ log_rdma_event(INFO, "wait for transport being destroyed\n");
+ wait_event(info->wait_destroy,
+ info->transport_status == SMBD_DESTROYED);
+
+ destroy_workqueue(info->workqueue);
+ kfree(info);
+}
+
/*
* Reconnect this SMBD connection, called from upper layer
* return value: 0 on success, or actual error code
--
2.7.4
From 1583316423859229057@xxx Mon Nov 06 11:36:48 +0000 2017
X-GM-THRID: 1583308423082113125
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
From: Long Li <[email protected]>
On the receive path, the transport maintains receive buffers and a
reassembly queue for transferring payload via RDMA recv. There is data
copy in the transport on recv when it copies the payload to upper layer.
The transport recognizes the RFC1002 header length use in the SMB upper
layer payloads in CIFS. Because this length is mainly used for TCP and
not applicable to RDMA, it is handled as a out-of-band information and is
never sent over the wire, and the trasnport behaves like TCP to upper layer
by processing and exposing the length correctly on data payloads.
Signed-off-by: Long Li <[email protected]>
---
fs/cifs/smbdirect.c | 228 ++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/smbdirect.h | 3 +
2 files changed, 231 insertions(+)
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 5952276..1e7f5df 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -14,6 +14,7 @@
* the GNU General Public License for more details.
*/
#include <linux/module.h>
+#include <linux/highmem.h>
#include "smbdirect.h"
#include "cifs_debug.h"
@@ -179,6 +180,8 @@ static void smbd_destroy_rdma_work(struct work_struct *work)
log_rdma_event(INFO, "wait for all recv to finish\n");
wake_up_interruptible(&info->wait_reassembly_queue);
+ wait_event(info->wait_smbd_recv_pending,
+ info->smbd_recv_pending == 0);
log_rdma_event(INFO, "wait for all send posted to IB to finish\n");
wait_event(info->wait_send_pending,
@@ -1655,6 +1658,9 @@ struct smbd_connection *_smbd_get_connection(
queue_delayed_work(info->workqueue, &info->idle_timer_work,
info->keep_alive_interval*HZ);
+ init_waitqueue_head(&info->wait_smbd_recv_pending);
+ info->smbd_recv_pending = 0;
+
init_waitqueue_head(&info->wait_send_pending);
atomic_set(&info->send_pending, 0);
@@ -1721,3 +1727,225 @@ struct smbd_connection *smbd_get_connection(
}
return ret;
}
+
+/*
+ * Receive data from receive reassembly queue
+ * All the incoming data packets are placed in reassembly queue
+ * buf: the buffer to read data into
+ * size: the length of data to read
+ * return value: actual data read
+ * Note: this implementation copies the data from reassebmly queue to receive
+ * buffers used by upper layer. This is not the optimal code path. A better way
+ * to do it is to not have upper layer allocate its receive buffers but rather
+ * borrow the buffer from reassembly queue, and return it after data is
+ * consumed. But this will require more changes to upper layer code, and also
+ * need to consider packet boundaries while they still being reassembled.
+ */
+int smbd_recv_buf(struct smbd_connection *info, char *buf, unsigned int size)
+{
+ struct smbd_response *response;
+ struct smbd_data_transfer *data_transfer;
+ int to_copy, to_read, data_read, offset;
+ u32 data_length, remaining_data_length, data_offset;
+ int rc;
+ unsigned long flags;
+
+again:
+ if (info->transport_status != SMBD_CONNECTED) {
+ log_read(ERR, "disconnected\n");
+ return -ENODEV;
+ }
+
+ /*
+ * No need to hold the reassembly queue lock all the time as we are
+ * the only one reading from the front of the queue. The transport
+ * may add more entries to the back of the queeu at the same time
+ */
+ log_read(INFO, "size=%d info->reassembly_data_length=%d\n", size,
+ info->reassembly_data_length);
+ if (info->reassembly_data_length >= size) {
+ int queue_length;
+ int queue_removed = 0;
+
+ /*
+ * Need to make sure reassembly_data_length is read before
+ * reading reassembly_queue_length and calling
+ * _get_first_reassembly. This call is lock free
+ * as we never read at the end of the queue which are being
+ * updated in SOFTIRQ as more data is received
+ */
+ virt_rmb();
+ queue_length = info->reassembly_queue_length;
+ data_read = 0;
+ to_read = size;
+ offset = info->first_entry_offset;
+ while (data_read < size) {
+ response = _get_first_reassembly(info);
+ data_transfer = smbd_response_payload(response);
+ data_length = le32_to_cpu(data_transfer->data_length);
+ remaining_data_length =
+ le32_to_cpu(
+ data_transfer->remaining_data_length);
+ data_offset = le32_to_cpu(data_transfer->data_offset);
+
+ /*
+ * The upper layer expects RFC1002 length at the
+ * beginning of the payload. Return it to indicate
+ * the total length of the packet. This minimize the
+ * change to upper layer packet processing logic. This
+ * will be eventually remove when an intermediate
+ * transport layer is added
+ */
+ if (response->first_segment && size == 4) {
+ unsigned int rfc1002_len =
+ data_length + remaining_data_length;
+ *((__be32 *)buf) = cpu_to_be32(rfc1002_len);
+ data_read = 4;
+ response->first_segment = false;
+ log_read(INFO, "returning rfc1002 length %d\n",
+ rfc1002_len);
+ goto read_rfc1002_done;
+ }
+
+ to_copy = min_t(int, data_length - offset, to_read);
+ memcpy(
+ buf + data_read,
+ (char *)data_transfer + data_offset + offset,
+ to_copy);
+
+ /* move on to the next buffer? */
+ if (to_copy == data_length - offset) {
+ queue_length--;
+ /*
+ * No need to lock if we are not at the
+ * end of the queue
+ */
+ if (!queue_length)
+ spin_lock_irqsave(
+ &info->reassembly_queue_lock,
+ flags);
+ list_del(&response->list);
+ queue_removed++;
+ if (!queue_length)
+ spin_unlock_irqrestore(
+ &info->reassembly_queue_lock,
+ flags);
+
+ info->count_reassembly_queue--;
+ info->count_dequeue_reassembly_queue++;
+ put_receive_buffer(info, response, true);
+ offset = 0;
+ log_read(INFO, "put_receive_buffer offset=0\n");
+ } else
+ offset += to_copy;
+
+ to_read -= to_copy;
+ data_read += to_copy;
+
+ log_read(INFO, "_get_first_reassembly memcpy %d bytes "
+ "data_transfer_length-offset=%d after that "
+ "to_read=%d data_read=%d offset=%d\n",
+ to_copy, data_length - offset,
+ to_read, data_read, offset);
+ }
+
+ spin_lock_irqsave(&info->reassembly_queue_lock, flags);
+ info->reassembly_data_length -= data_read;
+ info->reassembly_queue_length -= queue_removed;
+ spin_unlock_irqrestore(&info->reassembly_queue_lock, flags);
+
+ info->first_entry_offset = offset;
+ log_read(INFO, "returning to thread data_read=%d "
+ "reassembly_data_length=%d first_entry_offset=%d\n",
+ data_read, info->reassembly_data_length,
+ info->first_entry_offset);
+read_rfc1002_done:
+ return data_read;
+ }
+
+ log_read(INFO, "wait_event on more data\n");
+ rc = wait_event_interruptible(
+ info->wait_reassembly_queue,
+ info->reassembly_data_length >= size ||
+ info->transport_status != SMBD_CONNECTED);
+ /* Don't return any data if interrupted */
+ if (rc)
+ return -ENODEV;
+
+ goto again;
+}
+
+/*
+ * Receive a page from receive reassembly queue
+ * page: the page to read data into
+ * to_read: the length of data to read
+ * return value: actual data read
+ */
+int smbd_recv_page(struct smbd_connection *info,
+ struct page *page, unsigned int to_read)
+{
+ int ret;
+ char *to_address;
+
+ /* make sure we have the page ready for read */
+ ret = wait_event_interruptible(
+ info->wait_reassembly_queue,
+ info->reassembly_data_length >= to_read ||
+ info->transport_status != SMBD_CONNECTED);
+ if (ret)
+ return 0;
+
+ /* now we can read from reassembly queue and not sleep */
+ to_address = kmap_atomic(page);
+
+ log_read(INFO, "reading from page=%p address=%p to_read=%d\n",
+ page, to_address, to_read);
+
+ ret = smbd_recv_buf(info, to_address, to_read);
+ kunmap_atomic(to_address);
+
+ return ret;
+}
+
+/*
+ * Receive data from transport
+ * msg: a msghdr point to the buffer, can be ITER_KVEC or ITER_BVEC
+ * return: total bytes read, or 0. SMB Direct will not do partial read.
+ */
+int smbd_recv(struct smbd_connection *info, struct msghdr *msg)
+{
+ char *buf;
+ struct page *page;
+ unsigned int to_read;
+ int rc;
+
+ info->smbd_recv_pending++;
+
+ switch (msg->msg_iter.type) {
+ case READ | ITER_KVEC:
+ buf = msg->msg_iter.kvec->iov_base;
+ to_read = msg->msg_iter.kvec->iov_len;
+ rc = smbd_recv_buf(info, buf, to_read);
+ break;
+
+ case READ | ITER_BVEC:
+ page = msg->msg_iter.bvec->bv_page;
+ to_read = msg->msg_iter.bvec->bv_len;
+ rc = smbd_recv_page(info, page, to_read);
+ break;
+
+ default:
+ /* It's a bug in upper layer to get there */
+ cifs_dbg(VFS, "CIFS: invalid msg type %d\n",
+ msg->msg_iter.type);
+ rc = -EIO;
+ }
+
+ info->smbd_recv_pending--;
+ wake_up(&info->wait_smbd_recv_pending);
+
+ /* SMBDirect will read it all or nothing */
+ if (rc > 0)
+ msg->msg_iter.count = 0;
+ return rc;
+}
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index 1707bbc..c1e2d3d 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -88,6 +88,9 @@ struct smbd_connection {
int fragment_reassembly_remaining;
/* Activity accoutning */
+ /* Pending reqeusts issued from upper layer */
+ int smbd_recv_pending;
+ wait_queue_head_t wait_smbd_recv_pending;
atomic_t send_pending;
wait_queue_head_t wait_send_pending;
--
2.7.4
From 1582940517121649967@xxx Thu Nov 02 08:01:55 +0000 2017
X-GM-THRID: 1582940517121649967
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
From: Long Li <[email protected]>
When upper layer wants to umount, make it call shutdown on transport when
SMB Direct is used.
Signed-off-by: Long Li <[email protected]>
---
fs/cifs/connect.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 8ca3c13..23f10d1 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -707,7 +707,12 @@ static void clean_demultiplex_info(struct TCP_Server_Info *server)
wake_up_all(&server->request_q);
/* give those requests time to exit */
msleep(125);
-
+#ifdef CONFIG_CIFS_SMB_DIRECT
+ if (server->smbd_conn) {
+ smbd_destroy(server->smbd_conn);
+ server->smbd_conn = NULL;
+ }
+#endif
if (server->ssocket) {
sock_release(server->ssocket);
server->ssocket = NULL;
--
2.7.4
From 1583368737262729075@xxx Tue Nov 07 01:28:18 +0000 2017
X-GM-THRID: 1583367250769930836
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread
From: Long Li <[email protected]>
Do a reconnect on SMB Direct when it is used as the connection. Reconnect
can happen for many reasons and it's mostly the decision of SMB2 upper
layer.
Signed-off-by: Long Li <[email protected]>
---
fs/cifs/connect.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 2c0b34a..8ca3c13 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -406,7 +406,14 @@ cifs_reconnect(struct TCP_Server_Info *server)
/* we should try only the port we connected to before */
mutex_lock(&server->srv_mutex);
+#ifdef CONFIG_CIFS_SMB_DIRECT
+ if (server->rdma)
+ rc = smbd_reconnect(server);
+ else
+ rc = generic_ip_connect(server);
+#else
rc = generic_ip_connect(server);
+#endif
if (rc) {
cifs_dbg(FYI, "reconnect error %d\n", rc);
mutex_unlock(&server->srv_mutex);
--
2.7.4
From 1585310927291809140@xxx Tue Nov 28 11:58:34 +0000 2017
X-GM-THRID: 1585306629485967748
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread