2017-08-02 20:12:09

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 00/37] Implement SMBD protocol: Series 1

From: Long Li <[email protected]>

SMB3 defines a protocol for transfer data over RDMA transport such as Infiniband, RoCE and iWARP. The prococol is published in [MS-SMBD] (https://msdn.microsoft.com/en-us/library/hh536346.aspx).

This is the series 1 of two patch sets. This patch set implements the SMBD transport for doing RDMA send/recv.

This patch set is the foundation of series 2 patch set, which implements sending upper layer RDMA read/write via memory registration.

Long Li (37):
[CIFS] SMBD: Add parsing for new rdma mount option
[CIFS] SMBD: Add structure for SMBD transport
[CIFS] SMBD: Add logging functions for debug
[CIFS] SMBD: Define per-channel SMBD transport parameters and default
values
[CIFS] SMBD: Implement API for upper layer to create SMBD transport
and establish RDMA connection
[CIFS] SMBD: Add definition and cache for SMBD response
[CIFS] SMBD: Implement receive buffer for handling SMBD response
[CIFS] SMBD: Define packet format for SMBD data transfer message
[CIFS] SMBD: Add SMBD request and cache
[CIFS] SMBD: Introduce wait queue when sending SMBD request
[CIFS] SMBD: Post a receive request
[CIFS] SMBD: Handle send completion from CQ
[CIFS] SMBD: Implement SMBD protocol negotiation
[CIFS] SMBD: Post a SMBD data transfer message with page payload
[CIFS] SMBD: Post a SMBD data transfer message with data payload
[CIFS] SMBD: Post a SMBD message with no payload
[CIFS] SMBD: Track status for transport
[CIFS] SMBD: Implement API for upper layer to send data
[CIFS] SMBD: Manage credits on SMBD client and server
[CIFS] SMBD: Implement reassembly queue for receiving data
[CIFS] SMBD: Implement API for upper layer to receive data
[CIFS] SMBD: Implement API for upper layer to receive data to page
[CIFS] SMBD: Implement API for upper layer to reconnect transport
[CIFS] SMBD: Support for SMBD keep alive protocol
[CIFS] SMBD: Support SMBD idle connection timer
[CIFS] SMBD: Send an immediate packet when it's needed
[CIFS] SMBD: Destroy transport when RDMA channel is disconnected
[CIFS] SMBD: Implement API for upper layer to destroy the transport
[CIFS] SMBD: Disconnect RDMA connection on QP errors
[CIFS] SMBD: Add SMBDirect transport to Makefile
[CIFS] Add SMBD transport to SMB session context
[CIFS] Add SMBD debug couters to CIFS debug exports
[CIFS] Connect to SMBD transport when specified in mount option
[CIFS] Reconnect to SMBD transport when it's used
[CIFS] Destroy SMBD transport on exit
[CIFS] Read from SMBD transport when it's used
[CIFS] Write to SMBD transport when it's used

fs/cifs/Makefile | 2 +-
fs/cifs/cifs_debug.c | 25 +
fs/cifs/cifsfs.c | 2 +
fs/cifs/cifsglob.h | 3 +
fs/cifs/cifsrdma.c | 1833 ++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 243 +++++++
fs/cifs/connect.c | 56 +-
fs/cifs/transport.c | 7 +
8 files changed, 2164 insertions(+), 7 deletions(-)
create mode 100644 fs/cifs/cifsrdma.c
create mode 100644 fs/cifs/cifsrdma.h

--
2.7.4


2017-08-02 20:12:16

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 01/37] [CIFS] SMBD: Add parsing for new rdma mount option

From: Long Li <[email protected]>

When doing mount with "-o rdma", user can specify this is for connecting to a SMBD session.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifs_debug.c | 2 ++
fs/cifs/cifsfs.c | 2 ++
fs/cifs/cifsglob.h | 2 ++
fs/cifs/connect.c | 10 +++++++++-
4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index 9727e1d..ba0870d 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -171,6 +171,8 @@ static int cifs_debug_data_proc_show(struct seq_file *m, void *v)
ses->ses_count, ses->serverOS, ses->serverNOS,
ses->capabilities, ses->status);
}
+ if (server->rdma)
+ seq_printf(m, "RDMA\n\t");
seq_printf(m, "TCP status: %d\n\tLocal Users To "
"Server: %d SecMode: 0x%x Req On Wire: %d",
server->tcpStatus, server->srv_count,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index fe0c8dc..a628800 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -330,6 +330,8 @@ cifs_show_address(struct seq_file *s, struct TCP_Server_Info *server)
default:
seq_puts(s, "(unknown)");
}
+ if (server->rdma)
+ seq_puts(s, ",rdma");
}

static void
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 8289f95..20af553 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -531,6 +531,7 @@ struct smb_vol {
bool nopersistent:1;
bool resilient:1; /* noresilient not required since not fored for CA */
bool domainauto:1;
+ bool rdma:1;
unsigned int rsize;
unsigned int wsize;
bool sockopt_tcp_nodelay:1;
@@ -649,6 +650,7 @@ struct TCP_Server_Info {
bool sec_kerberos; /* supports plain Kerberos */
bool sec_mskerberos; /* supports legacy MS Kerberos */
bool large_buf; /* is current buffer large? */
+ bool rdma; /* use rdma wrapper instead of socket */
struct delayed_work echo; /* echo ping workqueue job */
char *smallbuf; /* pointer to current "small" buffer */
char *bigbuf; /* pointer to current "big" buffer */
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 2eeaac6..0dc942c 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -94,7 +94,7 @@ enum {
Opt_multiuser, Opt_sloppy, Opt_nosharesock,
Opt_persistent, Opt_nopersistent,
Opt_resilient, Opt_noresilient,
- Opt_domainauto,
+ Opt_domainauto, Opt_rdma,

/* Mount options which take numeric value */
Opt_backupuid, Opt_backupgid, Opt_uid,
@@ -185,6 +185,7 @@ static const match_table_t cifs_mount_option_tokens = {
{ Opt_resilient, "resilienthandles"},
{ Opt_noresilient, "noresilienthandles"},
{ Opt_domainauto, "domainauto"},
+ { Opt_rdma, "rdma"},

{ Opt_backupuid, "backupuid=%s" },
{ Opt_backupgid, "backupgid=%s" },
@@ -1541,6 +1542,9 @@ cifs_parse_mount_options(const char *mountdata, const char *devname,
case Opt_domainauto:
vol->domainauto = true;
break;
+ case Opt_rdma:
+ vol->rdma = true;
+ break;

/* Numeric Values */
case Opt_backupuid:
@@ -2134,6 +2138,9 @@ static int match_server(struct TCP_Server_Info *server, struct smb_vol *vol)
if (server->echo_interval != vol->echo_interval * HZ)
return 0;

+ if (server->rdma != vol->rdma)
+ return 0;
+
return 1;
}

@@ -2234,6 +2241,7 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
tcp_ses->noblocksnd = volume_info->noblocksnd;
tcp_ses->noautotune = volume_info->noautotune;
tcp_ses->tcp_nodelay = volume_info->sockopt_tcp_nodelay;
+ tcp_ses->rdma = volume_info->rdma;
tcp_ses->in_flight = 0;
tcp_ses->credits = 1;
init_waitqueue_head(&tcp_ses->response_q);
--
2.7.4

2017-08-02 20:12:21

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 06/37] [CIFS] SMBD: Add definition and cache for SMBD response

From: Long Li <[email protected]>

Define the data structure for receiving a SMBD response. SMBD responses are allocated through a per-channel mempool. For each server initiated response message, the SMB client must post a SMBD response to the local hardware.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 13 +++++++++++++
fs/cifs/cifsrdma.h | 31 +++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index b18fb79..1636304 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -287,6 +287,7 @@ struct cifs_rdma_info* cifs_create_rdma_session(
struct cifs_rdma_info *info;
struct rdma_conn_param conn_param;
struct ib_qp_init_attr qp_attr;
+ char cache_name[80];
int max_pending = receive_credit_max + send_credit_target;

info = kzalloc(sizeof(struct cifs_rdma_info), GFP_KERNEL);
@@ -370,6 +371,18 @@ struct cifs_rdma_info* cifs_create_rdma_session(
goto out2;

log_rdma_event("rdma_connect connected\n");
+
+ sprintf(cache_name, "cifs_smbd_response_%p", info);
+ info->response_cache =
+ kmem_cache_create(
+ cache_name,
+ sizeof(struct cifs_rdma_response) +
+ info->max_receive_size,
+ 0, SLAB_HWCACHE_ALIGN, NULL);
+
+ info->response_mempool =
+ mempool_create(info->receive_credit_max, mempool_alloc_slab,
+ mempool_free_slab, info->response_cache);
out2:
rdma_destroy_id(info->id);

diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 71ea380..41ae61a 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -59,6 +59,10 @@ struct cifs_rdma_info {
atomic_t receive_credits;
atomic_t receive_credit_target;

+ // response pool for RDMA receive
+ struct kmem_cache *response_cache;
+ mempool_t *response_mempool;
+
// for debug purposes
unsigned int count_receive_buffer;
unsigned int count_get_receive_buffer;
@@ -66,6 +70,33 @@ struct cifs_rdma_info {
unsigned int count_send_empty;
};

+enum smbd_message_type {
+ SMBD_NEGOTIATE_RESP,
+ SMBD_TRANSFER_DATA,
+};
+
+// The context for a SMBD response
+struct cifs_rdma_response {
+ struct cifs_rdma_info *info;
+
+ // completion queue entry
+ struct ib_cqe cqe;
+
+ // the SGE entry for the packet
+ struct ib_sge sge;
+
+ enum smbd_message_type type;
+
+ // link to receive queue or reassembly queue
+ struct list_head list;
+
+ // indicate if this is the 1st packet of a payload
+ bool first_segment;
+
+ // SMBD packet header and payload follows this structure
+ char packet[0];
+};
+
// Create a SMBDirect session
struct cifs_rdma_info* cifs_create_rdma_session(
struct TCP_Server_Info *server, struct sockaddr *dstaddr);
--
2.7.4

2017-08-02 20:12:32

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 19/37] [CIFS] SMBD: Manage credits on SMBD client and server

From: Long Li <[email protected]>

SMB client and server maintain a credit system on the SMBD transport. Credits are used to tell when the client or server can send a packet to the peer, based on current memory or resource usage.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index eb48651..97cde3f 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -575,6 +575,46 @@ static int cifs_rdma_post_send_negotiate_req(struct cifs_rdma_info *info)
}

/*
+ * Extend the credits to remote peer
+ * This implements [MS-SMBD] 3.1.5.9
+ * The idea is that we should extend credits to remote peer as quickly as
+ * it's allowed, to maintain data flow. We allocate as much as receive
+ * buffer as possible, and extend the receive credits to remote peer
+ * return value: the new credtis being granted.
+ */
+static int manage_credits_prior_sending(struct cifs_rdma_info *info)
+{
+ int ret = 0;
+ struct cifs_rdma_response *response;
+ int rc;
+
+ if (atomic_read(&info->receive_credit_target) >
+ atomic_read(&info->receive_credits)) {
+ while (true) {
+ response = get_receive_buffer(info);
+ if (!response)
+ break;
+
+ response->type = SMBD_TRANSFER_DATA;
+ response->first_segment = false;
+ rc = cifs_rdma_post_recv(info, response);
+ if (rc) {
+ log_rdma_recv("post_recv failed rc=%d\n", rc);
+ put_receive_buffer(info, response);
+ break;
+ }
+
+ ret++;
+ }
+ }
+
+ atomic_add(ret, &info->receive_credits);
+ log_transport_credit(info);
+
+ return ret;
+}
+
+/*
* Send a page
* page: the page to send
* offset: offset in the page to send
@@ -607,6 +647,8 @@ static int cifs_rdma_post_send_page(struct cifs_rdma_info *info, struct page *pa

packet = (struct smbd_data_transfer *) request->packet;
packet->credits_requested = cpu_to_le16(info->send_credit_target);
+ packet->credits_granted =
+ cpu_to_le16(manage_credits_prior_sending(info));
packet->flags = cpu_to_le16(0);

packet->reserved = cpu_to_le16(0);
@@ -718,6 +760,8 @@ static int cifs_rdma_post_send_empty(struct cifs_rdma_info *info)
request->info = info;
packet = (struct smbd_data_transfer_no_data *) request->packet;

+ credits_granted = manage_credits_prior_sending(info);
+
/* nothing to do? */
if (credits_granted==0 && flags==0) {
mempool_free(request, info->request_mempool);
@@ -827,6 +871,7 @@ static int cifs_rdma_post_send_data(

packet = (struct smbd_data_transfer *) request->packet;
packet->credits_requested = cpu_to_le16(info->send_credit_target);
+ packet->credits_granted = cpu_to_le16(manage_credits_prior_sending(info));
packet->flags = cpu_to_le16(0);
packet->reserved = cpu_to_le16(0);

--
2.7.4

2017-08-02 20:12:31

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 22/37] [CIFS] SMBD: Implement API for upper layer to receive data to page

From: Long Li <[email protected]>

Sometimes upper layer may also want to read data to a page. Implement this API.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 30 ++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 2 ++
2 files changed, 32 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index e5f6300..67a11d9 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -1316,6 +1316,36 @@ struct cifs_rdma_info* cifs_create_rdma_session(
}

/*
+ * Read a page from receive reassembly queue
+ * page: the page to read data into
+ * to_read: the length of data to read
+ * return value: actual data read
+ */
+int cifs_rdma_read_page(struct cifs_rdma_info *info,
+ struct page *page, unsigned int to_read)
+{
+ int ret;
+ char *to_address;
+
+ // make sure we have the page ready for read
+ wait_event(
+ info->wait_reassembly_queue,
+ atomic_read(&info->reassembly_data_length) >= to_read ||
+ info->transport_status != CIFS_RDMA_CONNECTED);
+
+ // now we can read from reassembly queue and not sleep
+ to_address = kmap_atomic(page);
+
+ log_cifs_read("reading from page=%p address=%p to_read=%d\n",
+ page, to_address, to_read);
+
+ ret = cifs_rdma_read(info, to_address, to_read);
+ kunmap_atomic(to_address);
+
+ return ret;
+}
+
+/*
* Read data from receive reassembly queue
* All the incoming data packets are placed in reassembly queue
* buf: the buffer to read data into
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 8891e21..36f3e4c 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -215,5 +215,7 @@ struct cifs_rdma_info* cifs_create_rdma_session(
// SMBDirect interface for carrying upper layer CIFS I/O
int cifs_rdma_read(
struct cifs_rdma_info *rdma, char *buf, unsigned int to_read);
+int cifs_rdma_read_page(
+ struct cifs_rdma_info *rdma, struct page *page, unsigned int to_read);
int cifs_rdma_write(struct cifs_rdma_info *rdma, struct smb_rqst *rqst);
#endif
--
2.7.4

2017-08-02 20:12:30

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 07/37] [CIFS] SMBD: Implement receive buffer for handling SMBD response

From: Long Li <[email protected]>

Create recevie buffers to handling SMBD response messages. Each SMBD response message is assocated with a receive buffer, where the message payload is received. The number of receive buffer is determine after negotiating SMBD transport with the server.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 3 ++
2 files changed, 92 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 1636304..b3e43b1 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -54,6 +54,14 @@

#include "cifsrdma.h"

+static struct cifs_rdma_response* get_receive_buffer(
+ struct cifs_rdma_info *info);
+static void put_receive_buffer(
+ struct cifs_rdma_info *info,
+ struct cifs_rdma_response *response);
+static int allocate_receive_buffers(struct cifs_rdma_info *info, int num_buf);
+static void destroy_receive_buffers(struct cifs_rdma_info *info);
+
/*
* Per RDMA transport connection parameters
* as defined in [MS-SMBD] 3.1.1.1
@@ -280,6 +288,85 @@ static int cifs_rdma_ia_open(
return rc;
}

+/*
+ * Receive buffer operations.
+ * For each remote send, we need to post a receive. The receive buffers are
+ * pre-allocated in advance.
+ */
+static struct cifs_rdma_response* get_receive_buffer(struct cifs_rdma_info *info)
+{
+ struct cifs_rdma_response *ret = NULL;
+ unsigned long flags;
+
+ spin_lock_irqsave(&info->receive_queue_lock, flags);
+ if (!list_empty(&info->receive_queue)) {
+ ret = list_first_entry(
+ &info->receive_queue,
+ struct cifs_rdma_response, list);
+ list_del(&ret->list);
+ info->count_receive_buffer--;
+ info->count_get_receive_buffer++;
+ }
+ spin_unlock_irqrestore(&info->receive_queue_lock, flags);
+
+ return ret;
+}
+
+static void put_receive_buffer(
+ struct cifs_rdma_info *info, struct cifs_rdma_response *response)
+{
+ unsigned long flags;
+
+ ib_dma_unmap_single(info->id->device, response->sge.addr,
+ response->sge.length, DMA_FROM_DEVICE);
+
+ spin_lock_irqsave(&info->receive_queue_lock, flags);
+ list_add_tail(&response->list, &info->receive_queue);
+ info->count_receive_buffer++;
+ info->count_put_receive_buffer++;
+ spin_unlock_irqrestore(&info->receive_queue_lock, flags);
+}
+
+static int allocate_receive_buffers(struct cifs_rdma_info *info, int num_buf)
+{
+ int i;
+ struct cifs_rdma_response *response;
+
+ INIT_LIST_HEAD(&info->receive_queue);
+ spin_lock_init(&info->receive_queue_lock);
+
+ for (i=0; i<num_buf; i++) {
+ response = mempool_alloc(info->response_mempool, GFP_KERNEL);
+ if (!response)
+ goto allocate_failed;
+
+ response->info = info;
+ list_add_tail(&response->list, &info->receive_queue);
+ info->count_receive_buffer++;
+ }
+
+ return 0;
+
+allocate_failed:
+ while (!list_empty(&info->receive_queue)) {
+ response = list_first_entry(
+ &info->receive_queue,
+ struct cifs_rdma_response, list);
+ list_del(&response->list);
+ info->count_receive_buffer--;
+
+ mempool_free(response, info->response_mempool);
+ }
+ return -ENOMEM;
+}
+
+static void destroy_receive_buffers(struct cifs_rdma_info *info)
+{
+ struct cifs_rdma_response *response;
+ while ((response = get_receive_buffer(info)))
+ mempool_free(response, info->response_mempool);
+}
+
struct cifs_rdma_info* cifs_create_rdma_session(
struct TCP_Server_Info *server, struct sockaddr *dstaddr)
{
@@ -383,6 +470,8 @@ struct cifs_rdma_info* cifs_create_rdma_session(
info->response_mempool =
mempool_create(info->receive_credit_max, mempool_alloc_slab,
mempool_free_slab, info->response_cache);
+
+ allocate_receive_buffers(info, info->receive_credit_max);
out2:
rdma_destroy_id(info->id);

diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 41ae61a..78ce2bf 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -59,6 +59,9 @@ struct cifs_rdma_info {
atomic_t receive_credits;
atomic_t receive_credit_target;

+ struct list_head receive_queue;
+ spinlock_t receive_queue_lock;
+
// response pool for RDMA receive
struct kmem_cache *response_cache;
mempool_t *response_mempool;
--
2.7.4

2017-08-02 20:13:05

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 37/37] [CIFS] Write to SMBD transport when it's used

From: Long Li <[email protected]>

When sending data, send to SMBD is it's currently used.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/transport.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index ba62aaf..10b9d15 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -37,6 +37,7 @@
#include "cifsglob.h"
#include "cifsproto.h"
#include "cifs_debug.h"
+#include "cifsrdma.h"

void
cifs_wake_up_task(struct mid_q_entry *mid)
@@ -230,6 +231,11 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct smb_rqst *rqst)
struct msghdr smb_msg;
int val = 1;

+ if(server->rdma_ses) {
+ rc = cifs_rdma_write(server->rdma_ses, rqst);
+ goto done;
+ }
+
if (ssocket == NULL)
return -ENOTSOCK;

@@ -299,6 +305,7 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct smb_rqst *rqst)
server->tcpStatus = CifsNeedReconnect;
}

+done:
if (rc < 0 && rc != -EINTR)
cifs_dbg(VFS, "Error %d sending data on socket to server\n",
rc);
--
2.7.4

2017-08-02 20:13:28

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 33/37] [CIFS] Connect to SMBD transport when specified in mount option

From: Long Li <[email protected]>

When "-o rdma" is specified, attempt to connect to SMB server via SMBD.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/connect.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 0dc942c..1ba5b92 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -45,6 +45,7 @@
#include <linux/parser.h>
#include <linux/bvec.h>

+#include "cifsrdma.h"
#include "cifspdu.h"
#include "cifsglob.h"
#include "cifsproto.h"
@@ -2284,12 +2285,26 @@ cifs_get_tcp_session(struct smb_vol *volume_info)
else
tcp_ses->echo_interval = SMB_ECHO_INTERVAL_DEFAULT * HZ;

- rc = ip_connect(tcp_ses);
- if (rc < 0) {
- cifs_dbg(VFS, "Error connecting to socket. Aborting operation.\n");
- goto out_err_crypto_release;
+ if (tcp_ses->rdma) {
+ tcp_ses->rdma_ses = cifs_create_rdma_session(tcp_ses, (struct sockaddr *)&volume_info->dstaddr);
+ if (tcp_ses->rdma_ses) {
+ cifs_dbg(VFS, "%s: RDMA transport established\n", __func__);
+ rc = 0;
+ goto connected;
+ }
+ else {
+ rc = -ENOENT;
+ goto out_err_crypto_release;
+ }
+ }
+
+ rc = ip_connect(tcp_ses);
+ if (rc < 0) {
+ cifs_dbg(VFS, "Error connecting to socket. Aborting operation.\n");
+ goto out_err_crypto_release;
}

+connected:
/*
* since we're in a cifs function already, we know that
* this will succeed. No need for try_module_get().
--
2.7.4

2017-08-02 20:13:25

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 35/37] [CIFS] Destroy SMBD transport on exit

From: Long Li <[email protected]>

When SMBD is used in the SMB session, destroy it on exit.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/connect.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 54c1f7c..cc58cd8 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -708,6 +708,11 @@ static void clean_demultiplex_info(struct TCP_Server_Info *server)
/* give those requests time to exit */
msleep(125);

+ if (server->rdma && server->rdma_ses) {
+ cifs_destroy_rdma_session(server->rdma_ses);
+ server->rdma_ses = NULL;
+ }
+
if (server->ssocket) {
sock_release(server->ssocket);
server->ssocket = NULL;
@@ -2179,6 +2184,10 @@ cifs_put_tcp_session(struct TCP_Server_Info *server, int from_reconnect)
return;
}

+ if (server->rdma && server->rdma_ses) {
+ cifs_destroy_rdma_session(server->rdma_ses);
+ }
+
put_net(cifs_net_ns(server));

list_del_init(&server->tcp_ses_list);
--
2.7.4

2017-08-02 20:13:24

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 30/37] [CIFS] SMBD: Add SMBDirect transport to Makefile

From: Long Li <[email protected]>

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/Makefile b/fs/cifs/Makefile
index eed7eb0..7175e25 100644
--- a/fs/cifs/Makefile
+++ b/fs/cifs/Makefile
@@ -18,4 +18,4 @@ cifs-$(CONFIG_CIFS_DFS_UPCALL) += dns_resolve.o cifs_dfs_ref.o
cifs-$(CONFIG_CIFS_FSCACHE) += fscache.o cache.o

cifs-$(CONFIG_CIFS_SMB2) += smb2ops.o smb2maperror.o smb2transport.o \
- smb2misc.o smb2pdu.o smb2inode.o smb2file.o
+ smb2misc.o smb2pdu.o smb2inode.o smb2file.o cifsrdma.o
--
2.7.4

2017-08-02 20:14:33

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 24/37] [CIFS] SMBD: Support for SMBD keep alive protocol

From: Long Li <[email protected]>

SMBD uses a keep alive protocol to help peers detect if the remote is dead. When peer request keep alive, the transport needs to respond accordingly.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 30 ++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 7 +++++++
2 files changed, 37 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 4681cda..e275834 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -372,6 +372,12 @@ static void recv_done(struct ib_cq *cq, struct ib_wc *wc)
if (atomic_read(&info->send_credits))
wake_up(&info->wait_send_queue);

+ // send an emtpy response right away if requested
+ if (le16_to_cpu(data_transfer->flags) |
+ le16_to_cpu(SMB_DIRECT_RESPONSE_REQUESTED)) {
+ info->keep_alive_requested = KEEP_ALIVE_PENDING;
+ }
+
// process receive queue
if (le32_to_cpu(data_transfer->data_length)) {
if (info->full_packet_received) {
@@ -627,6 +633,24 @@ static int manage_credits_prior_sending(struct cifs_rdma_info *info)
}

/*
+ * Check if we need to send a KEEP_ALIVE message
+ * The idle connection timer triggers a KEEP_ALIVE message when expires
+ * In this case, we need to set SMB_DIRECT_RESPONSE_REQUESTED in the message
+ * flag to have peer send back a KEEP_ALIVE
+ * return value:
+ * 1 if SMB_DIRECT_RESPONSE_REQUESTED needs to be set
+ * 0: otherwise
+ */
+static int manage_keep_alive_before_sending(struct cifs_rdma_info *info)
+{
+ if (info->keep_alive_requested == KEEP_ALIVE_PENDING) {
+ info->keep_alive_requested = KEEP_ALIVE_SENT;
+ return 1;
+ }
+ return 0;
+}
+
+/*
* Send a page
* page: the page to send
* offset: offset in the page to send
@@ -662,6 +686,8 @@ static int cifs_rdma_post_send_page(struct cifs_rdma_info *info, struct page *pa
packet->credits_granted =
cpu_to_le16(manage_credits_prior_sending(info));
packet->flags = cpu_to_le16(0);
+ if (manage_keep_alive_before_sending(info))
+ packet->flags |= cpu_to_le16(SMB_DIRECT_RESPONSE_REQUESTED);

packet->reserved = cpu_to_le16(0);
packet->data_offset = cpu_to_le32(24);
@@ -773,6 +799,8 @@ static int cifs_rdma_post_send_empty(struct cifs_rdma_info *info)
packet = (struct smbd_data_transfer_no_data *) request->packet;

credits_granted = manage_credits_prior_sending(info);
+ if (manage_keep_alive_before_sending(info))
+ flags = SMB_DIRECT_RESPONSE_REQUESTED;

/* nothing to do? */
if (credits_granted==0 && flags==0) {
@@ -885,6 +913,8 @@ static int cifs_rdma_post_send_data(
packet->credits_requested = cpu_to_le16(info->send_credit_target);
packet->credits_granted = cpu_to_le16(manage_credits_prior_sending(info));
packet->flags = cpu_to_le16(0);
+ if (manage_keep_alive_before_sending(info))
+ packet->flags |= cpu_to_le16(SMB_DIRECT_RESPONSE_REQUESTED);
packet->reserved = cpu_to_le16(0);

packet->data_offset = cpu_to_le32(24);
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index c27db6f..62d0bb8 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -25,6 +25,12 @@
#include <rdma/rdma_cm.h>
#include <linux/mempool.h>

+enum keep_alive_status {
+ KEEP_ALIVE_NONE,
+ KEEP_ALIVE_PENDING,
+ KEEP_ALIVE_SENT,
+};
+
enum cifs_rdma_transport_status {
CIFS_RDMA_CREATED,
CIFS_RDMA_CONNECTING,
@@ -68,6 +74,7 @@ struct cifs_rdma_info {
int max_fragmented_send_size;
int max_receive_size;
int max_readwrite_size;
+ enum keep_alive_status keep_alive_requested;
int protocol;
atomic_t send_credits;
atomic_t receive_credits;
--
2.7.4

2017-08-02 20:14:32

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 29/37] [CIFS] SMBD: Disconnect RDMA connection on QP errors

From: Long Li <[email protected]>

Sometime Queue Pair may have errors if hardware is having trouble with connectivity issues. In this case, SMBD should terminate the RDMA connetion.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 18 ++++++++++++++++++
fs/cifs/cifsrdma.h | 1 +
2 files changed, 19 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index fb94975..01cb006 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -202,6 +202,22 @@ static int cifs_rdma_process_disconnected(struct cifs_rdma_info *info)
return 0;
}

+static void cifs_disconnect_rdma_work(struct work_struct *work)
+{
+ struct cifs_rdma_info *info =
+ container_of(work, struct cifs_rdma_info, disconnect_work);
+
+ if (info->transport_status == CIFS_RDMA_CONNECTED) {
+ info->transport_status = CIFS_RDMA_DISCONNECTING;
+ rdma_disconnect(info->id);
+ }
+}
+
+static void cifs_disconnect_rdma_session(struct cifs_rdma_info *info)
+{
+ schedule_work(&info->disconnect_work);
+}
+
/* Upcall from RDMA CM */
static int cifs_rdma_conn_upcall(
struct rdma_cm_id *id, struct rdma_cm_event *event)
@@ -264,6 +280,7 @@ cifs_rdma_qp_async_error_upcall(struct ib_event *event, void *context)
case IB_EVENT_QP_FATAL:
case IB_EVENT_QP_REQ_ERR:
case IB_EVENT_QP_ACCESS_ERR:
+ cifs_disconnect_rdma_session(info);

default:
break;
@@ -1494,6 +1511,7 @@ struct cifs_rdma_info* cifs_create_rdma_session(
init_waitqueue_head(&info->wait_recv_pending);
atomic_set(&info->recv_pending, 0);

+ INIT_WORK(&info->disconnect_work, cifs_disconnect_rdma_work);
INIT_WORK(&info->destroy_work, cifs_destroy_rdma_work);

rc = cifs_rdma_negotiate(info);
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 424ea8e..9306622 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -68,6 +68,7 @@ struct cifs_rdma_info {
bool negotiate_done;

struct work_struct destroy_work;
+ struct work_struct disconnect_work;

//connection paramters
int receive_credit_max;
--
2.7.4

2017-08-02 20:14:31

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 23/37] [CIFS] SMBD: Implement API for upper layer to reconnect transport

From: Long Li <[email protected]>

Upper layer may request to reconnect to server for multiple reason. Disconnect can happen at SMB layer or at SMBD layer. Add function to support reconnect when transport is disconnected.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 24 ++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 3 +++
2 files changed, 27 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 67a11d9..4681cda 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -1170,6 +1170,30 @@ static void destroy_receive_buffers(struct cifs_rdma_info *info)
mempool_free(response, info->response_mempool);
}

+int cifs_reconnect_rdma_session(struct TCP_Server_Info *server)
+{
+ log_rdma_event("reconnecting rdma session\n");
+
+ // why reconnect while it is still connected?
+ if (server->rdma_ses->transport_status == CIFS_RDMA_CONNECTED) {
+ log_rdma_event("still connected, not reconnecting\n");
+ return -EINVAL;
+ }
+
+ // wait until the transport is destroyed
+ while (server->rdma_ses->transport_status != CIFS_RDMA_DESTROYED)
+ msleep(1);
+
+ if (server->rdma_ses)
+ kfree(server->rdma_ses);
+
+ log_rdma_event("creating rdma session\n");
+ server->rdma_ses = cifs_create_rdma_session(
+ server, (struct sockaddr *) &server->dstaddr);
+
+ return server->rdma_ses ? 0 : -ENOENT;
+}
+
struct cifs_rdma_info* cifs_create_rdma_session(
struct TCP_Server_Info *server, struct sockaddr *dstaddr)
{
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 36f3e4c..c27db6f 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -212,6 +212,9 @@ struct cifs_rdma_response {
struct cifs_rdma_info* cifs_create_rdma_session(
struct TCP_Server_Info *server, struct sockaddr *dstaddr);

+// Reconnect SMBDirect session
+int cifs_reconnect_rdma_session(struct TCP_Server_Info *server);
+
// SMBDirect interface for carrying upper layer CIFS I/O
int cifs_rdma_read(
struct cifs_rdma_info *rdma, char *buf, unsigned int to_read);
--
2.7.4

2017-08-02 20:14:29

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 28/37] [CIFS] SMBD: Implement API for upper layer to destroy the transport

From: Long Li <[email protected]>

Upper layer calls to destroy transport when SMB directory is umounted or system is in shutdown state. SMBD is responsible for disconnecting from remote peer and freeing all the resources after it's disconnected.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 20 ++++++++++++++++++++
fs/cifs/cifsrdma.h | 4 ++++
2 files changed, 24 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index a1ef7f6..fb94975 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -193,6 +193,7 @@ static void cifs_destroy_rdma_work(struct work_struct *work)
kmem_cache_destroy(info->response_cache);

info->transport_status = CIFS_RDMA_DESTROYED;
+ wake_up_all(&info->wait_destroy);
}

static int cifs_rdma_process_disconnected(struct cifs_rdma_info *info)
@@ -1315,6 +1316,22 @@ static void idle_connection_timer(struct work_struct *work)
info->keep_alive_interval*HZ);
}

+void cifs_destroy_rdma_session(struct cifs_rdma_info *info)
+{
+ log_rdma_event("destroying rdma session\n");
+
+ // kick off the disconnection process
+ if (info->transport_status == CIFS_RDMA_CONNECTED)
+ rdma_disconnect(info->id);
+
+ info->server_info->tcpStatus = CifsExiting;
+
+ log_rdma_event("wait for transport being destroyed\n");
+ // wait until the transport is destroyed
+ wait_event(info->wait_destroy,
+ info->transport_status == CIFS_RDMA_DESTROYED);
+}
+
int cifs_reconnect_rdma_session(struct TCP_Server_Info *server)
{
log_rdma_event("reconnecting rdma session\n");
@@ -1423,6 +1440,8 @@ struct cifs_rdma_info* cifs_create_rdma_session(
conn_param.rnr_retry_count = 6;
conn_param.flow_control = 0;
init_waitqueue_head(&info->conn_wait);
+ init_waitqueue_head(&info->wait_destroy);
+
rc = rdma_connect(info->id, &conn_param);
if (rc) {
log_rdma_event("rdma_connect() failed with %i\n", rc);
@@ -1483,6 +1502,7 @@ struct cifs_rdma_info* cifs_create_rdma_session(

// negotiation failed
log_rdma_event("cifs_rdma_negotiate rc=%d\n", rc);
+ cifs_destroy_rdma_session(info);

out2:
rdma_destroy_id(info->id);
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 5d0d86d..424ea8e 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -62,6 +62,7 @@ struct cifs_rdma_info {
int ri_rc;
struct completion ri_done;
wait_queue_head_t conn_wait;
+ wait_queue_head_t wait_destroy;

struct completion negotiate_completion;
bool negotiate_done;
@@ -229,6 +230,9 @@ struct cifs_rdma_info* cifs_create_rdma_session(
// Reconnect SMBDirect session
int cifs_reconnect_rdma_session(struct TCP_Server_Info *server);

+// Destroy SMBDirect session
+void cifs_destroy_rdma_session(struct cifs_rdma_info *info);
+
// SMBDirect interface for carrying upper layer CIFS I/O
int cifs_rdma_read(
struct cifs_rdma_info *rdma, char *buf, unsigned int to_read);
--
2.7.4

2017-08-02 20:14:28

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 27/37] [CIFS] SMBD: Destroy transport when RDMA channel is disconnected

From: Long Li <[email protected]>

When RDMA connection is terminated, the transport needs to destroy the transport and free its resources. At this point, we no longer have a RDMA connection to server.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 2 ++
2 files changed, 58 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index fe7d1f8d..a1ef7f6 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -148,6 +148,59 @@ do { \
info->send_credit_target); \
} while (0)

+static void cifs_destroy_rdma_work(struct work_struct *work)
+{
+ unsigned long flags;
+ struct cifs_rdma_response *response;
+ struct cifs_rdma_info *info =
+ container_of(work, struct cifs_rdma_info, destroy_work);
+
+ log_rdma_event("cancelling all pending works\n");
+
+ cancel_delayed_work_sync(&info->idle_timer_work);
+ cancel_delayed_work_sync(&info->send_immediate_work);
+
+ ib_drain_qp(info->id->qp);
+ rdma_destroy_qp(info->id);
+
+ log_rdma_event("wait for all send or recv finish\n");
+ wait_event(info->wait_send_pending,
+ atomic_read(&info->send_pending) == 0);
+ wait_event(info->wait_recv_pending,
+ atomic_read(&info->recv_pending) == 0);
+
+ ib_free_cq(info->cq);
+ ib_dealloc_pd(info->pd);
+ rdma_destroy_id(info->id);
+
+ log_rdma_event("drain the reassembly queue\n");
+ spin_lock_irqsave(&info->reassembly_queue_lock, flags);
+ while ((response = _get_first_reassembly(info))) {
+ put_receive_buffer(info, response);
+ }
+ atomic_set(&info->reassembly_data_length, 0);
+ spin_unlock_irqrestore(&info->reassembly_queue_lock, flags);
+ wake_up(&info->wait_reassembly_queue);
+
+ log_rdma_event("free buffers\n");
+ destroy_receive_buffers(info);
+
+ // free mempools
+ mempool_destroy(info->request_mempool);
+ kmem_cache_destroy(info->request_cache);
+
+ mempool_destroy(info->response_mempool);
+ kmem_cache_destroy(info->response_cache);
+
+ info->transport_status = CIFS_RDMA_DESTROYED;
+}
+
+static int cifs_rdma_process_disconnected(struct cifs_rdma_info *info)
+{
+ schedule_work(&info->destroy_work);
+ return 0;
+}
+
/* Upcall from RDMA CM */
static int cifs_rdma_conn_upcall(
struct rdma_cm_id *id, struct rdma_cm_event *event)
@@ -186,6 +239,7 @@ static int cifs_rdma_conn_upcall(

case RDMA_CM_EVENT_DISCONNECTED:
info->transport_status = CIFS_RDMA_DISCONNECTED;
+ cifs_rdma_process_disconnected(info);
break;

default:
@@ -1421,6 +1475,8 @@ struct cifs_rdma_info* cifs_create_rdma_session(
init_waitqueue_head(&info->wait_recv_pending);
atomic_set(&info->recv_pending, 0);

+ INIT_WORK(&info->destroy_work, cifs_destroy_rdma_work);
+
rc = cifs_rdma_negotiate(info);
if (!rc)
return info;
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 025457c..5d0d86d 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -66,6 +66,8 @@ struct cifs_rdma_info {
struct completion negotiate_completion;
bool negotiate_done;

+ struct work_struct destroy_work;
+
//connection paramters
int receive_credit_max;
int send_credit_target;
--
2.7.4

2017-08-02 20:14:26

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection timer

From: Long Li <[email protected]>

This is part of the keep alive protocol. Each peer maintains a timer, and will send a message to peer when it expires.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 25 +++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 6 ++++--
2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index e275834..523c80f 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -89,6 +89,7 @@ static int send_credit_target = 512;
static int max_send_size = 8192;
static int max_fragmented_recv_size = 1024*1024;
static int max_receive_size = 8192;
+static int keep_alive_interval = 120;

// maximum number of SGEs in a RDMA I/O
static int max_send_sge = 16;
@@ -1200,6 +1201,25 @@ static void destroy_receive_buffers(struct cifs_rdma_info *info)
mempool_free(response, info->response_mempool);
}

+// Implement idle connection timer [MS-SMBD] 3.1.6.2
+static void idle_connection_timer(struct work_struct *work)
+{
+ struct cifs_rdma_info *info = container_of(
+ work, struct cifs_rdma_info,
+ idle_timer_work.work);
+
+ if (info->keep_alive_requested != KEEP_ALIVE_NONE)
+ log_keep_alive("error status info->keep_alive_requested=%d\n",
+ info->keep_alive_requested);
+
+ log_keep_alive("about to send an empty idle message\n");
+ cifs_rdma_post_send_empty(info);
+
+ // setup the next idle timeout work
+ schedule_delayed_work(&info->idle_timer_work,
+ info->keep_alive_interval*HZ);
+}
+
int cifs_reconnect_rdma_session(struct TCP_Server_Info *server)
{
log_rdma_event("reconnecting rdma session\n");
@@ -1262,6 +1282,7 @@ struct cifs_rdma_info* cifs_create_rdma_session(
info->max_send_size = max_send_size;
info->max_fragmented_recv_size = max_fragmented_recv_size;
info->max_receive_size = max_receive_size;
+ info->keep_alive_interval = keep_alive_interval;

max_send_sge = min_t(int, max_send_sge,
info->id->device->attrs.max_sge);
@@ -1348,6 +1369,10 @@ struct cifs_rdma_info* cifs_create_rdma_session(
init_waitqueue_head(&info->wait_send_queue);
init_waitqueue_head(&info->wait_reassembly_queue);

+ INIT_DELAYED_WORK(&info->idle_timer_work, idle_connection_timer);
+ schedule_delayed_work(&info->idle_timer_work,
+ info->keep_alive_interval*HZ);
+
init_waitqueue_head(&info->wait_send_pending);
atomic_set(&info->send_pending, 0);

diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 62d0bb8..dd497ce 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -73,6 +73,7 @@ struct cifs_rdma_info {
int max_fragmented_recv_size;
int max_fragmented_send_size;
int max_receive_size;
+ int keep_alive_interval;
int max_readwrite_size;
enum keep_alive_status keep_alive_requested;
int protocol;
@@ -101,12 +102,13 @@ struct cifs_rdma_info {

wait_queue_head_t wait_send_queue;

+ bool full_packet_received;
+ struct delayed_work idle_timer_work;;
+
// request pool for RDMA send
struct kmem_cache *request_cache;
mempool_t *request_mempool;

- bool full_packet_received;
-
// response pool for RDMA receive
struct kmem_cache *response_cache;
mempool_t *response_mempool;
--
2.7.4

2017-08-02 20:14:26

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 26/37] [CIFS] SMBD: Send an immediate packet when it's needed

From: Long Li <[email protected]>

At times when credits is exhausted and nearing exhausted, the peer needs to promptly extend credits to peers after freeing local resources for RDMA operations. When there is no data to send and we want to extend credits to server, an empty packet is used to extend credits to peer.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 42 ++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 3 +++
2 files changed, 45 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 523c80f..fe7d1f8d 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -318,6 +318,20 @@ static bool process_negotiation_response(struct cifs_rdma_response *response, in
return true;
}

+/*
+ * Check and schedule to send an immediate packet
+ * This is used to extend credtis to remote peer to keep the transport busy
+ */
+static void check_and_send_immediate(struct cifs_rdma_info *info)
+{
+ info->send_immediate = true;
+
+ // promptly send a packet if running low on receive credits
+ if (atomic_read(&info->receive_credits) <
+ atomic_read(&info->receive_credit_target) -1 )
+ schedule_delayed_work(&info->send_immediate_work, 0);
+}
+
/* Called from softirq, when recv is done */
static void recv_done(struct ib_cq *cq, struct ib_wc *wc)
{
@@ -379,6 +393,8 @@ static void recv_done(struct ib_cq *cq, struct ib_wc *wc)
info->keep_alive_requested = KEEP_ALIVE_PENDING;
}

+ check_and_send_immediate(info);
+
// process receive queue
if (le32_to_cpu(data_transfer->data_length)) {
if (info->full_packet_received) {
@@ -630,6 +646,9 @@ static int manage_credits_prior_sending(struct cifs_rdma_info *info)
atomic_add(ret, &info->receive_credits);
log_transport_credit(info);

+ if (ret)
+ info->send_immediate = false;
+
return ret;
}

@@ -1155,6 +1174,9 @@ static void put_receive_buffer(
info->count_receive_buffer++;
info->count_put_receive_buffer++;
spin_unlock_irqrestore(&info->receive_queue_lock, flags);
+
+ // now we can post new receive credits
+ check_and_send_immediate(info);
}

static int allocate_receive_buffers(struct cifs_rdma_info *info, int num_buf)
@@ -1201,6 +1223,25 @@ static void destroy_receive_buffers(struct cifs_rdma_info *info)
mempool_free(response, info->response_mempool);
}

+/*
+ * Check and send an immediate or keep alive packet
+ * The condition to send those packets are defined in [MS-SMBD] 3.1.1.1
+ * Connection.KeepaliveRequested and Connection.SendImmediate
+ * The idea is to extend credits to server as soon as it becomes available
+ */
+static void send_immediate_work(struct work_struct *work)
+{
+ struct cifs_rdma_info *info = container_of(
+ work, struct cifs_rdma_info,
+ send_immediate_work.work);
+
+ if (info->keep_alive_requested == KEEP_ALIVE_PENDING ||
+ info->send_immediate) {
+ log_keep_alive("send an empty message\n");
+ cifs_rdma_post_send_empty(info);
+ }
+}
+
// Implement idle connection timer [MS-SMBD] 3.1.6.2
static void idle_connection_timer(struct work_struct *work)
{
@@ -1370,6 +1411,7 @@ struct cifs_rdma_info* cifs_create_rdma_session(
init_waitqueue_head(&info->wait_reassembly_queue);

INIT_DELAYED_WORK(&info->idle_timer_work, idle_connection_timer);
+ INIT_DELAYED_WORK(&info->send_immediate_work, send_immediate_work);
schedule_delayed_work(&info->idle_timer_work,
info->keep_alive_interval*HZ);

diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index dd497ce..025457c 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -100,10 +100,13 @@ struct cifs_rdma_info {
// the offset to first buffer in reassembly queue
int first_entry_offset;

+ bool send_immediate;
+
wait_queue_head_t wait_send_queue;

bool full_packet_received;
struct delayed_work idle_timer_work;;
+ struct delayed_work send_immediate_work;

// request pool for RDMA send
struct kmem_cache *request_cache;
--
2.7.4

2017-08-02 20:14:25

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 31/37] [CIFS] Add SMBD transport to SMB session context

From: Long Li <[email protected]>

With the SMBD transport implemented, now add to the SMB session context. Now the upper layer can use SMBD transport in a SMB session.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsglob.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 20af553..413b011 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -581,6 +581,7 @@ inc_rfc1001_len(void *buf, int count)
}

struct TCP_Server_Info {
+ struct cifs_rdma_info *rdma_ses;
struct list_head tcp_ses_list;
struct list_head smb_ses_list;
int srv_count; /* reference counter */
--
2.7.4

2017-08-02 20:14:23

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 34/37] [CIFS] Reconnect to SMBD transport when it's used

From: Long Li <[email protected]>

When CIFS wants to reconnect a SMB session, use SMBD if it's being used.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/connect.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 1ba5b92..54c1f7c 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -409,7 +409,11 @@ cifs_reconnect(struct TCP_Server_Info *server)

/* we should try only the port we connected to before */
mutex_lock(&server->srv_mutex);
- rc = generic_ip_connect(server);
+ if (server->rdma)
+ rc = cifs_reconnect_rdma_session(server);
+ else
+ rc = generic_ip_connect(server);
+
if (rc) {
cifs_dbg(FYI, "reconnect error %d\n", rc);
mutex_unlock(&server->srv_mutex);
--
2.7.4

2017-08-02 20:14:22

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 36/37] [CIFS] Read from SMBD transport when it's used

From: Long Li <[email protected]>

When receiving data, upper layer looks at which transport is being used. If SMBD is used, read from SMBD.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/connect.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index cc58cd8..5ac8af0 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -584,6 +584,10 @@ cifs_read_from_socket(struct TCP_Server_Info *server, char *buf,
{
struct msghdr smb_msg;
struct kvec iov = {.iov_base = buf, .iov_len = to_read};
+
+ if (server->rdma_ses)
+ return cifs_rdma_read(server->rdma_ses, buf, to_read);
+
iov_iter_kvec(&smb_msg.msg_iter, READ | ITER_KVEC, &iov, 1, to_read);

return cifs_readv_from_socket(server, &smb_msg);
@@ -595,6 +599,10 @@ cifs_read_page_from_socket(struct TCP_Server_Info *server, struct page *page,
{
struct msghdr smb_msg;
struct bio_vec bv = {.bv_page = page, .bv_len = to_read};
+
+ if (server->rdma_ses)
+ return cifs_rdma_read_page(server->rdma_ses, page, to_read);
+
iov_iter_bvec(&smb_msg.msg_iter, READ | ITER_BVEC, &bv, 1, to_read);
return cifs_readv_from_socket(server, &smb_msg);
}
--
2.7.4

2017-08-02 20:14:21

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 32/37] [CIFS] Add SMBD debug couters to CIFS debug exports

From: Long Li <[email protected]>

SMBD tranport defines several debug counters. Add them to common CIFS to export those under /proc.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifs_debug.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

diff --git a/fs/cifs/cifs_debug.c b/fs/cifs/cifs_debug.c
index ba0870d..aed17ee 100644
--- a/fs/cifs/cifs_debug.c
+++ b/fs/cifs/cifs_debug.c
@@ -30,6 +30,7 @@
#include "cifsproto.h"
#include "cifs_debug.h"
#include "cifsfs.h"
+#include "cifsrdma.h"

void
cifs_dump_mem(char *label, void *data, int length)
@@ -152,6 +153,28 @@ static int cifs_debug_data_proc_show(struct seq_file *m, void *v)
list_for_each(tmp1, &cifs_tcp_ses_list) {
server = list_entry(tmp1, struct TCP_Server_Info,
tcp_ses_list);
+
+ if (!server->rdma_ses)
+ goto skip_rdma;
+
+ //RDMA counters
+ seq_printf(m, "\nrdma receive buffer: count, get, put send_empty: %u %u %u %u",
+ server->rdma_ses->count_receive_buffer,
+ server->rdma_ses->count_get_receive_buffer,
+ server->rdma_ses->count_put_receive_buffer,
+ server->rdma_ses->count_send_empty);
+
+ seq_printf(m, "\nrdma reassembly queue: count, enqueue, dequeue: %u %u %u",
+ server->rdma_ses->count_reassembly_queue,
+ server->rdma_ses->count_enqueue_reassembly_queue,
+ server->rdma_ses->count_dequeue_reassembly_queue);
+
+ seq_printf(m, "\nrdma credits: send receive receive_target: %u %u %u",
+ atomic_read(&server->rdma_ses->send_credits),
+ atomic_read(&server->rdma_ses->receive_credits),
+ atomic_read(&server->rdma_ses->receive_credit_target));
+
+skip_rdma:
seq_printf(m, "\nNumber of credits: %d", server->credits);
i++;
list_for_each(tmp2, &server->smb_ses_list) {
--
2.7.4

2017-08-02 20:17:08

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 10/37] [CIFS] SMBD: Introduce wait queue when sending SMBD request

From: Long Li <[email protected]>

Define wait queue in preparation for implement SMBD send. SMBD uses credit based flow control system, if the client doesn't have enough credits then it must wait for credits.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 1 +
fs/cifs/cifsrdma.h | 2 ++
2 files changed, 3 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 9610897..8aa8a47 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -484,6 +484,7 @@ struct cifs_rdma_info* cifs_create_rdma_session(
mempool_free_slab, info->response_cache);

allocate_receive_buffers(info, info->receive_credit_max);
+ init_waitqueue_head(&info->wait_send_queue);
out2:
rdma_destroy_id(info->id);

diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index e925aa4..287b5b1 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -62,6 +62,8 @@ struct cifs_rdma_info {
struct list_head receive_queue;
spinlock_t receive_queue_lock;

+ wait_queue_head_t wait_send_queue;
+
// request pool for RDMA send
struct kmem_cache *request_cache;
mempool_t *request_mempool;
--
2.7.4

2017-08-02 20:17:11

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with no payload

From: Long Li <[email protected]>

Implement the function to send a SMBD message with no payload. This is required at times when we want to extend credtis to server to have it continue to send data, without sending any actual data payload.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 11 +++++++
2 files changed, 107 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 989cad8..cf71bb1 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -66,6 +66,7 @@ static int cifs_rdma_post_recv(
struct cifs_rdma_info *info,
struct cifs_rdma_response *response);

+static int cifs_rdma_post_send_empty(struct cifs_rdma_info *info);
static int cifs_rdma_post_send_data(
struct cifs_rdma_info *info,
struct kvec *iov, int n_vec, int remaining_data_length);
@@ -674,6 +675,101 @@ static int cifs_rdma_post_send_page(struct cifs_rdma_info *info, struct page *pa
}

/*
+ * Send an empty message
+ * Empty message is used to extend credits to peer to for keep live
+ */
+static int cifs_rdma_post_send_empty(struct cifs_rdma_info *info)
+{
+ struct cifs_rdma_request *request;
+ struct smbd_data_transfer_no_data *packet;
+ struct ib_send_wr send_wr, *send_wr_fail;
+ int rc;
+ u16 credits_granted, flags=0;
+
+ request = mempool_alloc(info->request_mempool, GFP_KERNEL);
+ if (!request) {
+ log_rdma_send("failed to get send buffer for empty packet\n");
+ return -ENOMEM;
+ }
+
+ request->info = info;
+ packet = (struct smbd_data_transfer_no_data *) request->packet;
+
+ /* nothing to do? */
+ if (credits_granted==0 && flags==0) {
+ mempool_free(request, info->request_mempool);
+ log_keep_alive("nothing to do, not sending anything\n");
+ return 0;
+ }
+
+ packet->credits_requested = cpu_to_le16(info->send_credit_target);
+ packet->credits_granted = cpu_to_le16(credits_granted);
+ packet->flags = cpu_to_le16(flags);
+ packet->reserved = cpu_to_le16(0);
+ packet->remaining_data_length = cpu_to_le32(0);
+ packet->data_offset = cpu_to_le32(0);
+ packet->data_length = cpu_to_le32(0);
+
+ log_outgoing("credits_requested=%d credits_granted=%d data_offset=%d "
+ "data_length=%d remaining_data_length=%d\n",
+ le16_to_cpu(packet->credits_requested),
+ le16_to_cpu(packet->credits_granted),
+ le32_to_cpu(packet->data_offset),
+ le32_to_cpu(packet->data_length),
+ le32_to_cpu(packet->remaining_data_length));
+
+ request->num_sge = 1;
+ request->sge = kzalloc(sizeof(struct ib_sge), GFP_KERNEL);
+ if (!request->sge) {
+ rc = -ENOMEM;
+ goto allocate_sge_failed;
+ }
+
+ request->sge[0].addr = ib_dma_map_single(info->id->device,
+ (void *)packet, sizeof(*packet), DMA_TO_DEVICE);
+ if(ib_dma_mapping_error(info->id->device, request->sge[0].addr)) {
+ rc = -EIO;
+ goto dma_mapping_failure;
+ }
+
+ request->sge[0].length = sizeof(*packet);
+ request->sge[0].lkey = info->pd->local_dma_lkey;
+ ib_dma_sync_single_for_device(info->id->device, request->sge[0].addr,
+ request->sge[0].length, DMA_TO_DEVICE);
+
+ wait_event(info->wait_send_queue, atomic_read(&info->send_credits) > 0);
+ atomic_dec(&info->send_credits);
+ info->count_send_empty++;
+ log_rdma_send("rdma_request sge addr=%llu legnth=%u lkey=%u\n",
+ request->sge[0].addr, request->sge[0].length,
+ request->sge[0].lkey);
+
+ request->cqe.done = send_done;
+
+ send_wr.next = NULL;
+ send_wr.wr_cqe = &request->cqe;
+ send_wr.sg_list = request->sge;
+ send_wr.num_sge = request->num_sge;
+ send_wr.opcode = IB_WR_SEND;
+ send_wr.send_flags = IB_SEND_SIGNALED;
+
+ rc = ib_post_send(info->id->qp, &send_wr, &send_wr_fail);
+ if (!rc)
+ return 0;
+
+ log_rdma_send("ib_post_send failed rc=%d\n", rc);
+ ib_dma_unmap_single(info->id->device, request->sge[0].addr,
+ request->sge[0].length, DMA_TO_DEVICE);
+
+dma_mapping_failure:
+ kfree(request->sge);
+
+allocate_sge_failed:
+ mempool_free(request, info->request_mempool);
+ return rc;
+}
+
+/*
* Send a data buffer
* iov: the iov array describing the data buffers
* n_vec: number of iov array
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index a766cbf..4a4c191 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -120,6 +120,17 @@ struct smbd_negotiate_resp {
__le32 max_fragmented_size;
} __packed;

+// SMBD data transfer packet with no payload [MS-SMBD] 2.2.3
+struct smbd_data_transfer_no_data {
+ __le16 credits_requested;
+ __le16 credits_granted;
+ __le16 flags;
+ __le16 reserved;
+ __le32 remaining_data_length;
+ __le32 data_offset;
+ __le32 data_length;
+} __packed;
+
// SMBD data transfer packet with payload [MS-SMBD] 2.2.3
struct smbd_data_transfer {
__le16 credits_requested;
--
2.7.4

2017-08-02 20:17:10

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 03/37] [CIFS] SMBD: Add logging functions for debug

From: Long Li <[email protected]>

SMBD transport code will use those logging functions for debug.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index a2c0478..3c9f478 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -54,3 +54,55 @@

#include "cifsrdma.h"

+/* Logging functions
+ * Logging are defined as classes. They can be ORed to define the actual
+ * logging level via module parameter rdma_logging_class
+ * e.g. cifs.rdma_logging_class=0x500 will log all log_rdma_recv() and
+ * log_rdma_event()
+ */
+#define LOG_CREDIT 0x1
+#define LOG_OUTGOING 0x2
+#define LOG_INCOMING 0x4
+#define LOG_RECEIVE_QUEUE 0x8
+#define LOG_REASSEMBLY_QUEUE 0x10
+#define LOG_CIFS_READ 0x20
+#define LOG_CIFS_WRITE 0x40
+#define LOG_RDMA_SEND 0x80
+#define LOG_RDMA_RECV 0x100
+#define LOG_KEEP_ALIVE 0x200
+#define LOG_RDMA_EVENT 0x400
+
+static unsigned int rdma_logging_class = 0;
+module_param(rdma_logging_class, uint, 0644);
+MODULE_PARM_DESC(rdma_logging_class,
+ "Logging class for SMBD transport 0 to 512");
+
+#define log_rdma(class, fmt, args...) \
+do { \
+ if (class & rdma_logging_class) \
+ cifs_dbg(VFS, "%s:%d " fmt, __func__, __LINE__, ##args);\
+} while (0)
+
+#define log_rdma_credit(fmt, args...) log_rdma(LOG_CREDIT, fmt, ##args)
+#define log_outgoing(fmt, args...) log_rdma(LOG_OUTGOING, fmt, ##args)
+#define log_incoming(fmt, args...) log_rdma(LOG_INCOMING, fmt, ##args)
+#define log_receive_queue(fmt, args...) \
+ log_rdma(LOG_RECEIVE_QUEUE, fmt, ##args)
+#define log_reassembly_queue(fmt, args...) \
+ log_rdma(LOG_REASSEMBLY_QUEUE, fmt, ##args)
+#define log_cifs_read(fmt, args...) log_rdma(LOG_CIFS_READ, fmt, ##args)
+#define log_cifs_write(fmt, args...) log_rdma(LOG_CIFS_WRITE, fmt, ##args)
+#define log_rdma_send(fmt, args...) log_rdma(LOG_RDMA_SEND, fmt, ##args)
+#define log_rdma_recv(fmt, args...) log_rdma(LOG_RDMA_RECV, fmt, ##args)
+#define log_keep_alive(fmt, args...) log_rdma(LOG_KEEP_ALIVE, fmt, ##args)
+#define log_rdma_event(fmt, args...) log_rdma(LOG_RDMA_EVENT, fmt, ##args)
+
+#define log_transport_credit(info) \
+do { \
+ log_rdma_credit("receive_credits %d receive_credit_target %d " \
+ "send_credits %d send_credit_target %d\n", \
+ atomic_read(&info->receive_credits), \
+ atomic_read(&info->receive_credit_target), \
+ atomic_read(&info->send_credits), \
+ info->send_credit_target); \
+} while (0)
--
2.7.4

2017-08-02 20:17:07

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 21/37] [CIFS] SMBD: Implement API for upper layer to receive data

From: Long Li <[email protected]>

With reassembly queue in place, implement the API for upper layer to recevie data. This call may sleep if there is not enough data in the reassembly queue.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 5 +++
2 files changed, 115 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 1d3fd26..e5f6300 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -1316,6 +1316,116 @@ struct cifs_rdma_info* cifs_create_rdma_session(
}

/*
+ * Read data from receive reassembly queue
+ * All the incoming data packets are placed in reassembly queue
+ * buf: the buffer to read data into
+ * size: the length of data to read
+ * return value: actual data read
+ */
+int cifs_rdma_read(struct cifs_rdma_info *info, char *buf, unsigned int size)
+{
+ struct cifs_rdma_response *response;
+ struct smbd_data_transfer *data_transfer;
+ unsigned long flags;
+ int to_copy, to_read, data_read, offset;
+ u32 data_length, remaining_data_length, data_offset;
+
+again:
+ // the transport is disconnected?
+ if (info->transport_status != CIFS_RDMA_CONNECTED) {
+ log_cifs_read("disconnected\n");
+
+ /*
+ * If upper layer code is reading SMB packet length
+ * return 0 to indicate transport is disconnected and
+ * trigger a reconnect.
+ */
+ spin_lock_irqsave(&info->reassembly_queue_lock, flags);
+ response = _get_first_reassembly(info);
+ if (response && response->first_segment && size==4) {
+ memset(buf, 0, size);
+ spin_unlock_irqrestore(&info->reassembly_queue_lock, flags);
+ return size;
+ }
+ spin_unlock_irqrestore(&info->reassembly_queue_lock, flags);
+ return 0;
+ }
+
+ spin_lock_irqsave(&info->reassembly_queue_lock, flags);
+ log_cifs_read("size=%d info->reassembly_data_length=%d\n", size,
+ atomic_read(&info->reassembly_data_length));
+ if (atomic_read(&info->reassembly_data_length) >= size) {
+ data_read = 0;
+ to_read = size;
+ offset = info->first_entry_offset;
+ while(data_read < size) {
+ response = _get_first_reassembly(info);
+ data_transfer = (struct smbd_data_transfer *) response->packet;
+
+ data_length = le32_to_cpu(data_transfer->data_length);
+ remaining_data_length =
+ le32_to_cpu(data_transfer->remaining_data_length);
+ data_offset = le32_to_cpu(data_transfer->data_offset);
+
+ // this is for reading rfc1002 length
+ if (response->first_segment && size==4) {
+ unsigned int rfc1002_len =
+ data_length + remaining_data_length;
+ *((__be32*)buf) = cpu_to_be32(rfc1002_len);
+ data_read = 4;
+ response->first_segment = false;
+ log_cifs_read("returning rfc1002 length %d\n",
+ rfc1002_len);
+ goto read_rfc1002_done;
+ }
+
+ to_copy = min_t(int, data_length - offset, to_read);
+ memcpy(
+ buf + data_read,
+ (char*)data_transfer + data_offset + offset,
+ to_copy);
+
+ // move on to the next buffer?
+ if (to_copy == data_length - offset) {
+ list_del(&response->list);
+ info->count_reassembly_queue--;
+ info->count_dequeue_reassembly_queue++;
+ put_receive_buffer(info, response);
+ offset = 0;
+ log_cifs_read("put_receive_buffer offset=0\n");
+ } else
+ offset += to_copy;
+
+ to_read -= to_copy;
+ data_read += to_copy;
+
+ log_cifs_read("_get_first_reassembly memcpy %d bytes "
+ "data_transfer_length-offset=%d after that "
+ "to_read=%d data_read=%d offset=%d\n",
+ to_copy, data_length - offset,
+ to_read, data_read, offset);
+ }
+ atomic_sub(data_read, &info->reassembly_data_length);
+ info->first_entry_offset = offset;
+ log_cifs_read("returning to thread data_read=%d "
+ "reassembly_data_length=%d first_entry_offset=%d\n",
+ data_read, atomic_read(&info->reassembly_data_length),
+ info->first_entry_offset);
+read_rfc1002_done:
+ spin_unlock_irqrestore(&info->reassembly_queue_lock, flags);
+ return data_read;
+ }
+
+ spin_unlock_irqrestore(&info->reassembly_queue_lock, flags);
+ log_cifs_read("wait_event on more data\n");
+ wait_event(
+ info->wait_reassembly_queue,
+ atomic_read(&info->reassembly_data_length) >= size ||
+ info->transport_status != CIFS_RDMA_CONNECTED);
+ goto again;
+}
+
+/*
* Write data to transport
* Each rqst is transported as a SMBDirect payload
* rqst: the data to write
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index b26e3b7..8891e21 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -89,6 +89,8 @@ struct cifs_rdma_info {

// total data length of reassembly queue
atomic_t reassembly_data_length;
+ // the offset to first buffer in reassembly queue
+ int first_entry_offset;

wait_queue_head_t wait_send_queue;

@@ -210,5 +212,8 @@ struct cifs_rdma_response {
struct cifs_rdma_info* cifs_create_rdma_session(
struct TCP_Server_Info *server, struct sockaddr *dstaddr);

+// SMBDirect interface for carrying upper layer CIFS I/O
+int cifs_rdma_read(
+ struct cifs_rdma_info *rdma, char *buf, unsigned int to_read);
int cifs_rdma_write(struct cifs_rdma_info *rdma, struct smb_rqst *rqst);
#endif
--
2.7.4

2017-08-02 20:18:06

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 20/37] [CIFS] SMBD: Implement reassembly queue for receiving data

From: Long Li <[email protected]>

All the data recevied via SMBD are placed in the reassembly queue. When upper layer reads data, it looks into reassembly queue. The SMBD layer is responsibe for passing a complete SMB packet to upper layer.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 11 +++++++++++
2 files changed, 63 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 97cde3f..1d3fd26 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -62,6 +62,12 @@ static void put_receive_buffer(
static int allocate_receive_buffers(struct cifs_rdma_info *info, int num_buf);
static void destroy_receive_buffers(struct cifs_rdma_info *info);

+static void enqueue_reassembly(
+ struct cifs_rdma_info *info,
+ struct cifs_rdma_response *response, int data_length);
+static struct cifs_rdma_response* _get_first_reassembly(
+ struct cifs_rdma_info *info);
+
static int cifs_rdma_post_recv(
struct cifs_rdma_info *info,
struct cifs_rdma_response *response);
@@ -377,6 +383,12 @@ static void recv_done(struct ib_cq *cq, struct ib_wc *wc)
else
info->full_packet_received = true;

+ enqueue_reassembly(
+ info,
+ response,
+ le32_to_cpu(data_transfer->data_length));
+
+ wake_up(&info->wait_reassembly_queue);
goto queue_done;
}

@@ -1041,6 +1053,41 @@ static int cifs_rdma_negotiate(struct cifs_rdma_info *info)
}

/*
+ * Implement Connection.FragmentReassemblyBuffer defined in [MS-SMBD] 3.1.1.1
+ * This is a queue for reassembling upper layer payload and present to upper
+ * layer. All the inncoming payload go to the reassembly queue, regardless of
+ * reassembly is rquired. The uuper layer code reads from the queue for any
+ * incoming payloads.
+ */
+static void enqueue_reassembly(
+ struct cifs_rdma_info *info,
+ struct cifs_rdma_response *response,
+ int data_length)
+{
+ unsigned long flags;
+ spin_lock_irqsave(&info->reassembly_queue_lock, flags);
+ list_add_tail(&response->list, &info->reassembly_queue);
+ atomic_add(data_length, &info->reassembly_data_length);
+ log_reassembly_queue("info->reassembly_data_length=%d\n",
+ atomic_read(&info->reassembly_data_length));
+ info->count_reassembly_queue++;
+ info->count_enqueue_reassembly_queue++;
+ spin_unlock_irqrestore(&info->reassembly_queue_lock, flags);
+}
+
+static struct cifs_rdma_response * _get_first_reassembly(struct cifs_rdma_info *info)
+{
+ struct cifs_rdma_response *ret = NULL;
+
+ if (!list_empty(&info->reassembly_queue)) {
+ ret = list_first_entry(
+ &info->reassembly_queue,
+ struct cifs_rdma_response, list);
+ }
+ return ret;
+}
+
+/*
* Receive buffer operations.
* For each remote send, we need to post a receive. The receive buffers are
* pre-allocated in advance.
@@ -1084,6 +1131,10 @@ static int allocate_receive_buffers(struct cifs_rdma_info *info, int num_buf)
int i;
struct cifs_rdma_response *response;

+ INIT_LIST_HEAD(&info->reassembly_queue);
+ spin_lock_init(&info->reassembly_queue_lock);
+ atomic_set(&info->reassembly_data_length, 0);
+
INIT_LIST_HEAD(&info->receive_queue);
spin_lock_init(&info->receive_queue_lock);

@@ -1241,6 +1292,7 @@ struct cifs_rdma_info* cifs_create_rdma_session(

allocate_receive_buffers(info, info->receive_credit_max);
init_waitqueue_head(&info->wait_send_queue);
+ init_waitqueue_head(&info->wait_reassembly_queue);

init_waitqueue_head(&info->wait_send_pending);
atomic_set(&info->send_pending, 0);
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 90746a4..b26e3b7 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -82,6 +82,14 @@ struct cifs_rdma_info {
struct list_head receive_queue;
spinlock_t receive_queue_lock;

+ /* reassembly queue */
+ struct list_head reassembly_queue;
+ spinlock_t reassembly_queue_lock;
+ wait_queue_head_t wait_reassembly_queue;
+
+ // total data length of reassembly queue
+ atomic_t reassembly_data_length;
+
wait_queue_head_t wait_send_queue;

// request pool for RDMA send
@@ -98,6 +106,9 @@ struct cifs_rdma_info {
unsigned int count_receive_buffer;
unsigned int count_get_receive_buffer;
unsigned int count_put_receive_buffer;
+ unsigned int count_reassembly_queue;
+ unsigned int count_enqueue_reassembly_queue;
+ unsigned int count_dequeue_reassembly_queue;
unsigned int count_send_empty;
};

--
2.7.4

2017-08-02 20:18:14

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 14/37] [CIFS] SMBD: Post a SMBD data transfer message with page payload

From: Long Li <[email protected]>

Add the function to send a SMBD data transfer message to server with page passed from upper layer.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 113 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index aa3d1a5..b3ec109 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -66,6 +66,10 @@ static int cifs_rdma_post_recv(
struct cifs_rdma_info *info,
struct cifs_rdma_response *response);

+static int cifs_rdma_post_send_page(struct cifs_rdma_info *info,
+ struct page *page, unsigned long offset,
+ size_t size, int remaining_data_length);
+
/*
* Per RDMA transport connection parameters
* as defined in [MS-SMBD] 3.1.1.1
@@ -558,6 +562,115 @@ static int cifs_rdma_post_send_negotiate_req(struct cifs_rdma_info *info)
}

/*
+ * Send a page
+ * page: the page to send
+ * offset: offset in the page to send
+ * size: length in the page to send
+ * remaining_data_length: remaining data to send in this payload
+ */
+static int cifs_rdma_post_send_page(struct cifs_rdma_info *info, struct page *page,
+ unsigned long offset, size_t size, int remaining_data_length)
+{
+ struct cifs_rdma_request *request;
+ struct smbd_data_transfer *packet;
+ struct ib_send_wr send_wr, *send_wr_fail;
+ int rc = -ENOMEM;
+ int i;
+
+ request = mempool_alloc(info->request_mempool, GFP_KERNEL);
+ if (!request)
+ return rc;
+
+ request->info = info;
+
+ wait_event(info->wait_send_queue, atomic_read(&info->send_credits) > 0);
+ atomic_dec(&info->send_credits);
+
+ packet = (struct smbd_data_transfer *) request->packet;
+ packet->credits_requested = cpu_to_le16(info->send_credit_target);
+ packet->flags = cpu_to_le16(0);
+
+ packet->reserved = cpu_to_le16(0);
+ packet->data_offset = cpu_to_le32(24);
+ packet->data_length = cpu_to_le32(size);
+ packet->remaining_data_length = cpu_to_le32(remaining_data_length);
+
+ packet->padding = cpu_to_le32(0);
+
+ log_outgoing("credits_requested=%d credits_granted=%d data_offset=%d "
+ "data_length=%d remaining_data_length=%d\n",
+ le16_to_cpu(packet->credits_requested),
+ le16_to_cpu(packet->credits_granted),
+ le32_to_cpu(packet->data_offset),
+ le32_to_cpu(packet->data_length),
+ le32_to_cpu(packet->remaining_data_length));
+
+ request->sge = kzalloc(sizeof(struct ib_sge)*2, GFP_KERNEL);
+ if (!request->sge)
+ goto allocate_sge_failed;
+ request->num_sge = 2;
+
+ request->sge[0].addr = ib_dma_map_single(info->id->device,
+ (void *)packet,
+ sizeof(*packet),
+ DMA_BIDIRECTIONAL);
+ if(ib_dma_mapping_error(info->id->device, request->sge[0].addr)) {
+ rc = -EIO;
+ goto dma_mapping_failed;
+ }
+ request->sge[0].length = sizeof(*packet);
+ request->sge[0].lkey = info->pd->local_dma_lkey;
+ ib_dma_sync_single_for_device(info->id->device, request->sge[0].addr,
+ request->sge[0].length, DMA_TO_DEVICE);
+
+ request->sge[1].addr = ib_dma_map_page(info->id->device, page,
+ offset, size, DMA_BIDIRECTIONAL);
+ if(ib_dma_mapping_error(info->id->device, request->sge[1].addr)) {
+ rc = -EIO;
+ goto dma_mapping_failed;
+ }
+ request->sge[1].length = size;
+ request->sge[1].lkey = info->pd->local_dma_lkey;
+ ib_dma_sync_single_for_device(info->id->device, request->sge[1].addr,
+ request->sge[1].length, DMA_TO_DEVICE);
+
+ log_rdma_send("rdma_request sge[0] addr=%llu legnth=%u lkey=%u sge[1] "
+ "addr=%llu length=%u lkey=%u\n",
+ request->sge[0].addr, request->sge[0].length,
+ request->sge[0].lkey, request->sge[1].addr,
+ request->sge[1].length, request->sge[1].lkey);
+
+ request->cqe.done = send_done;
+
+ send_wr.next = NULL;
+ send_wr.wr_cqe = &request->cqe;
+ send_wr.sg_list = request->sge;
+ send_wr.num_sge = request->num_sge;
+ send_wr.opcode = IB_WR_SEND;
+ send_wr.send_flags = IB_SEND_SIGNALED;
+
+ rc = ib_post_send(info->id->qp, &send_wr, &send_wr_fail);
+ if (!rc)
+ return 0;
+
+ // post send failed
+ log_rdma_send("ib_post_send failed rc=%d\n", rc);
+
+dma_mapping_failed:
+ for (i=0; i<2; i++)
+ if (request->sge[i].addr)
+ ib_dma_unmap_single(info->id->device,
+ request->sge[i].addr,
+ request->sge[i].length,
+ DMA_TO_DEVICE);
+ kfree(request->sge);
+
+allocate_sge_failed:
+ mempool_free(request, info->request_mempool);
+ return rc;
+}
+
+/*
* Post a receive request to the transport
* The remote peer can only send data when a receive is posted
* The interaction is controlled by send/recieve credit system
--
2.7.4

2017-08-02 20:18:12

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 13/37] [CIFS] SMBD: Implement SMBD protocol negotiation

From: Long Li <[email protected]>

Now we have all the code in place to support SMBD protocol negotiation with SMB server. SMBD negotiation is defined in [MS-SMBD] 3.1.5. After negotiation, the client and server are connected through SMBD, and they can use SMBD to transfer data payloads.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 208 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 29 ++++++++
2 files changed, 237 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index ecbc832..aa3d1a5 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -222,6 +222,80 @@ static void send_done(struct ib_cq *cq, struct ib_wc *wc)
mempool_free(request, request->info->request_mempool);
}

+static void dump_smbd_negotiate_resp(struct smbd_negotiate_resp *resp)
+{
+ log_rdma_event("resp message min_version %u max_version %u "
+ "negotiated_version %u credits_requested %u "
+ "credits_granted %u status %u max_readwrite_size %u "
+ "preferred_send_size %u max_receive_size %u "
+ "max_fragmented_size %u\n",
+ resp->min_version, resp->max_version, resp->negotiated_version,
+ resp->credits_requested, resp->credits_granted, resp->status,
+ resp->max_readwrite_size, resp->preferred_send_size,
+ resp->max_receive_size, resp->max_fragmented_size);
+}
+
+/* Process a negotiation response message, according to [MS-SMBD]3.1.5.7 */
+static bool process_negotiation_response(struct cifs_rdma_response *response, int packet_length)
+{
+ struct cifs_rdma_info *info = response->info;
+ struct smbd_negotiate_resp *packet =
+ (struct smbd_negotiate_resp *) response->packet;
+
+ if (packet_length < sizeof (struct smbd_negotiate_resp)) {
+ log_rdma_event("error: packet_length=%d\n", packet_length);
+ return false;
+ }
+
+ if (le16_to_cpu(packet->negotiated_version) != 0x100) {
+ log_rdma_event("error: negotiated_version=%x\n",
+ le16_to_cpu(packet->negotiated_version));
+ return false;
+ }
+ info->protocol = le16_to_cpu(packet->negotiated_version);
+
+ if (packet->credits_requested == 0) {
+ log_rdma_event("error: credits_requested==0\n");
+ return false;
+ }
+ atomic_set(&info->receive_credit_target,
+ le16_to_cpu(packet->credits_requested));
+
+ if (packet->credits_granted == 0) {
+ log_rdma_event("error: credits_granted==0\n");
+ return false;
+ }
+ atomic_set(&info->send_credits, le16_to_cpu(packet->credits_granted));
+
+ atomic_set(&info->receive_credits, 0);
+
+ if (le32_to_cpu(packet->preferred_send_size) > info->max_receive_size) {
+ log_rdma_event("error: preferred_send_size=%d\n",
+ le32_to_cpu(packet->preferred_send_size));
+ return false;
+ }
+ info->max_receive_size = le32_to_cpu(packet->preferred_send_size);
+
+ if (le32_to_cpu(packet->max_receive_size) < 128) {
+ log_rdma_event("error: max_receive_size=%d\n",
+ le32_to_cpu(packet->max_receive_size));
+ return false;
+ }
+ info->max_send_size = min_t(int, info->max_send_size,
+ le32_to_cpu(packet->max_receive_size));
+
+ if (le32_to_cpu(packet->max_fragmented_size) < 131072) {
+ log_rdma_event("error: max_fragmented_size=%d\n",
+ le32_to_cpu(packet->max_fragmented_size));
+ return false;
+ }
+ info->max_fragmented_send_size = le32_to_cpu(packet->max_fragmented_size);
+
+ info->max_readwrite_size = le32_to_cpu(packet->max_readwrite_size);
+
+ return true;
+}
+
/* Called from softirq, when recv is done */
static void recv_done(struct ib_cq *cq, struct ib_wc *wc)
{
@@ -248,6 +322,14 @@ static void recv_done(struct ib_cq *cq, struct ib_wc *wc)
DMA_FROM_DEVICE);

switch(response->type) {
+ case SMBD_NEGOTIATE_RESP:
+ dump_smbd_negotiate_resp(
+ (struct smbd_negotiate_resp *) response->packet);
+ info->full_packet_received = true;
+ info->negotiate_done = process_negotiation_response(response, wc->byte_len);
+ complete(&info->negotiate_completion);
+ break;
+
case SMBD_TRANSFER_DATA:
data_transfer = (struct smbd_data_transfer *) response->packet;
atomic_dec(&info->receive_credits);
@@ -397,6 +479,85 @@ static int cifs_rdma_ia_open(
}

/*
+ * Send a negotiation request message to the peer
+ * The negotiation procedure is in [MS-SMBD] 3.1.5.2 and 3.1.5.3
+ * After negotiation, the transport is connected and ready for
+ * carrying upper layer SMB payload
+ */
+static int cifs_rdma_post_send_negotiate_req(struct cifs_rdma_info *info)
+{
+ struct ib_send_wr send_wr, *send_wr_fail;
+ int rc = -ENOMEM;
+ struct cifs_rdma_request *request;
+ struct smbd_negotiate_req *packet;
+
+ request = mempool_alloc(info->request_mempool, GFP_KERNEL);
+ if (!request)
+ return rc;
+
+ request->info = info;
+
+ packet = (struct smbd_negotiate_req *) request->packet;
+ packet->min_version = cpu_to_le16(0x100);
+ packet->max_version = cpu_to_le16(0x100);
+ packet->reserved = cpu_to_le16(0);
+ packet->credits_requested = cpu_to_le16(info->send_credit_target);
+ packet->preferred_send_size = cpu_to_le32(info->max_send_size);
+ packet->max_receive_size = cpu_to_le32(info->max_receive_size);
+ packet->max_fragmented_size =
+ cpu_to_le32(info->max_fragmented_recv_size);
+
+ request->sge = kzalloc(sizeof(struct ib_sge), GFP_KERNEL);
+ if (!request->sge)
+ goto allocate_sge_failed;
+
+ request->num_sge = 1;
+ request->sge[0].addr = ib_dma_map_single(
+ info->id->device, (void *)packet,
+ sizeof(*packet), DMA_TO_DEVICE);
+ if(ib_dma_mapping_error(info->id->device, request->sge[0].addr)) {
+ rc = -EIO;
+ goto dma_mapping_failed;
+ }
+
+ request->sge[0].length = sizeof(*packet);
+ request->sge[0].lkey = info->pd->local_dma_lkey;
+
+ ib_dma_sync_single_for_device(
+ info->id->device, request->sge[0].addr,
+ request->sge[0].length, DMA_TO_DEVICE);
+
+ request->cqe.done = send_done;
+
+ send_wr.next = NULL;
+ send_wr.wr_cqe = &request->cqe;
+ send_wr.sg_list = request->sge;
+ send_wr.num_sge = request->num_sge;
+ send_wr.opcode = IB_WR_SEND;
+ send_wr.send_flags = IB_SEND_SIGNALED;
+
+ log_rdma_send("sge addr=%llx length=%x lkey=%x\n",
+ request->sge[0].addr,
+ request->sge[0].length, request->sge[0].lkey);
+
+ rc = ib_post_send(info->id->qp, &send_wr, &send_wr_fail);
+ if (!rc)
+ return 0;
+
+ // if we reach here, post send failed
+ log_rdma_send("ib_post_send failed rc=%d\n", rc);
+ ib_dma_unmap_single(info->id->device, request->sge[0].addr,
+ request->sge[0].length, DMA_TO_DEVICE);
+
+dma_mapping_failed:
+ kfree(request->sge);
+
+allocate_sge_failed:
+ mempool_free(request, info->request_mempool);
+ return rc;
+}
+
+/*
* Post a receive request to the transport
* The remote peer can only send data when a receive is posted
* The interaction is controlled by send/recieve credit system
@@ -434,6 +595,45 @@ static int cifs_rdma_post_recv(struct cifs_rdma_info *info, struct cifs_rdma_res
return rc;
}

+// Perform SMBD negotiate according to [MS-SMBD] 3.1.5.2
+static int cifs_rdma_negotiate(struct cifs_rdma_info *info)
+{
+ int rc;
+ struct cifs_rdma_response* response = get_receive_buffer(info);
+ response->type = SMBD_NEGOTIATE_RESP;
+
+ rc = cifs_rdma_post_recv(info, response);
+
+ log_rdma_event("cifs_rdma_post_recv rc=%d iov.addr=%llx iov.length=%x "
+ "iov.lkey=%x\n",
+ rc, response->sge.addr,
+ response->sge.length, response->sge.lkey);
+ if (rc)
+ return rc;
+
+ init_completion(&info->negotiate_completion);
+ info->negotiate_done = false;
+ rc = cifs_rdma_post_send_negotiate_req(info);
+ if (rc)
+ return rc;
+
+ rc = wait_for_completion_interruptible_timeout(
+ &info->negotiate_completion, 60 * HZ);
+ log_rdma_event("wait_for_completion_timeout rc=%d\n", rc);
+
+ if (info->negotiate_done)
+ return 0;
+
+ if (rc == 0)
+ rc = -ETIMEDOUT;
+ else if (rc == -ERESTARTSYS)
+ rc = -EINTR;
+ else
+ rc = -ENOTCONN;
+
+ return rc;
+}
+
/*
* Receive buffer operations.
* For each remote send, we need to post a receive. The receive buffers are
@@ -634,6 +834,14 @@ struct cifs_rdma_info* cifs_create_rdma_session(

init_waitqueue_head(&info->wait_recv_pending);
atomic_set(&info->recv_pending, 0);
+
+ rc = cifs_rdma_negotiate(info);
+ if (!rc)
+ return info;
+
+ // negotiation failed
+ log_rdma_event("cifs_rdma_negotiate rc=%d\n", rc);
+
out2:
rdma_destroy_id(info->id);

diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 8702a2b..a766cbf 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -46,6 +46,9 @@ struct cifs_rdma_info {
int ri_rc;
struct completion ri_done;

+ struct completion negotiate_completion;
+ bool negotiate_done;
+
//connection paramters
int receive_credit_max;
int send_credit_target;
@@ -91,6 +94,32 @@ enum smbd_message_type {

#define SMB_DIRECT_RESPONSE_REQUESTED 0x0001

+// SMBD negotiation request packet [MS-SMBD] 2.2.1
+struct smbd_negotiate_req {
+ __le16 min_version;
+ __le16 max_version;
+ __le16 reserved;
+ __le16 credits_requested;
+ __le32 preferred_send_size;
+ __le32 max_receive_size;
+ __le32 max_fragmented_size;
+} __packed;
+
+// SMBD negotiation response packet [MS-SMBD] 2.2.2
+struct smbd_negotiate_resp {
+ __le16 min_version;
+ __le16 max_version;
+ __le16 negotiated_version;
+ __le16 reserved;
+ __le16 credits_requested;
+ __le16 credits_granted;
+ __le32 status;
+ __le32 max_readwrite_size;
+ __le32 preferred_send_size;
+ __le32 max_receive_size;
+ __le32 max_fragmented_size;
+} __packed;
+
// SMBD data transfer packet with payload [MS-SMBD] 2.2.3
struct smbd_data_transfer {
__le16 credits_requested;
--
2.7.4

2017-08-02 20:18:09

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 15/37] [CIFS] SMBD: Post a SMBD data transfer message with data payload

From: Long Li <[email protected]>

Similar to sending transfer message with page payload, this function creates a SMBD data packet and send it over to RDMA, from iov passed from upper layer.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 119 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index b3ec109..989cad8 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -66,6 +66,9 @@ static int cifs_rdma_post_recv(
struct cifs_rdma_info *info,
struct cifs_rdma_response *response);

+static int cifs_rdma_post_send_data(
+ struct cifs_rdma_info *info,
+ struct kvec *iov, int n_vec, int remaining_data_length);
static int cifs_rdma_post_send_page(struct cifs_rdma_info *info,
struct page *page, unsigned long offset,
size_t size, int remaining_data_length);
@@ -671,6 +674,122 @@ static int cifs_rdma_post_send_page(struct cifs_rdma_info *info, struct page *pa
}

/*
+ * Send a data buffer
+ * iov: the iov array describing the data buffers
+ * n_vec: number of iov array
+ * remaining_data_length: remaining data to send in this payload
+ */
+static int cifs_rdma_post_send_data(
+ struct cifs_rdma_info *info, struct kvec *iov, int n_vec,
+ int remaining_data_length)
+{
+ struct cifs_rdma_request *request;
+ struct smbd_data_transfer *packet;
+ struct ib_send_wr send_wr, *send_wr_fail;
+ int rc = -ENOMEM, i;
+ u32 data_length;
+
+ request = mempool_alloc(info->request_mempool, GFP_KERNEL);
+ if (!request)
+ return rc;
+
+ request->info = info;
+
+ wait_event(info->wait_send_queue, atomic_read(&info->send_credits) > 0);
+ atomic_dec(&info->send_credits);
+
+ packet = (struct smbd_data_transfer *) request->packet;
+ packet->credits_requested = cpu_to_le16(info->send_credit_target);
+ packet->flags = cpu_to_le16(0);
+ packet->reserved = cpu_to_le16(0);
+
+ packet->data_offset = cpu_to_le32(24);
+
+ data_length = 0;
+ for (i=0; i<n_vec; i++)
+ data_length += iov[i].iov_len;
+ packet->data_length = cpu_to_le32(data_length);
+
+ packet->remaining_data_length = cpu_to_le32(remaining_data_length);
+ packet->padding = cpu_to_le32(0);
+
+ log_rdma_send("credits_requested=%d credits_granted=%d data_offset=%d "
+ "data_length=%d remaining_data_length=%d\n",
+ le16_to_cpu(packet->credits_requested),
+ le16_to_cpu(packet->credits_granted),
+ le32_to_cpu(packet->data_offset),
+ le32_to_cpu(packet->data_length),
+ le32_to_cpu(packet->remaining_data_length));
+
+ request->sge = kzalloc(sizeof(struct ib_sge)*(n_vec+1), GFP_KERNEL);
+ if (!request->sge)
+ goto allocate_sge_failed;
+
+ request->num_sge = n_vec+1;
+
+ request->sge[0].addr = ib_dma_map_single(
+ info->id->device, (void *)packet,
+ sizeof(*packet), DMA_BIDIRECTIONAL);
+ if(ib_dma_mapping_error(info->id->device, request->sge[0].addr)) {
+ rc = -EIO;
+ goto dma_mapping_failure;
+ }
+ request->sge[0].length = sizeof(*packet);
+ request->sge[0].lkey = info->pd->local_dma_lkey;
+ ib_dma_sync_single_for_device(info->id->device, request->sge[0].addr,
+ request->sge[0].length, DMA_TO_DEVICE);
+
+ for (i=0; i<n_vec; i++) {
+ request->sge[i+1].addr = ib_dma_map_single(info->id->device, iov[i].iov_base,
+ iov[i].iov_len, DMA_BIDIRECTIONAL);
+ if(ib_dma_mapping_error(info->id->device, request->sge[i+1].addr)) {
+ rc = -EIO;
+ goto dma_mapping_failure;
+ }
+ request->sge[i+1].length = iov[i].iov_len;
+ request->sge[i+1].lkey = info->pd->local_dma_lkey;
+ ib_dma_sync_single_for_device(info->id->device, request->sge[i+i].addr,
+ request->sge[i+i].length, DMA_TO_DEVICE);
+ }
+
+ log_rdma_send("rdma_request sge[0] addr=%llu legnth=%u lkey=%u\n",
+ request->sge[0].addr, request->sge[0].length, request->sge[0].lkey);
+ for (i=0; i<n_vec; i++)
+ log_rdma_send("rdma_request sge[%d] addr=%llu legnth=%u lkey=%u\n",
+ i+1, request->sge[i+1].addr,
+ request->sge[i+1].length, request->sge[i+1].lkey);
+
+ request->cqe.done = send_done;
+
+ send_wr.next = NULL;
+ send_wr.wr_cqe = &request->cqe;
+ send_wr.sg_list = request->sge;
+ send_wr.num_sge = request->num_sge;
+ send_wr.opcode = IB_WR_SEND;
+ send_wr.send_flags = IB_SEND_SIGNALED;
+
+ rc = ib_post_send(info->id->qp, &send_wr, &send_wr_fail);
+ if (!rc)
+ return 0;
+
+ // post send failed
+ log_rdma_send("ib_post_send failed rc=%d\n", rc);
+
+dma_mapping_failure:
+ for (i=0; i<n_vec+1; i++)
+ if (request->sge[i].addr)
+ ib_dma_unmap_single(info->id->device,
+ request->sge[i].addr,
+ request->sge[i].length,
+ DMA_TO_DEVICE);
+ kfree(request->sge);
+
+allocate_sge_failed:
+ mempool_free(request, info->request_mempool);
+ return rc;
+}
+
+/*
* Post a receive request to the transport
* The remote peer can only send data when a receive is posted
* The interaction is controlled by send/recieve credit system
--
2.7.4

2017-08-02 20:18:07

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 09/37] [CIFS] SMBD: Add SMBD request and cache

From: Long Li <[email protected]>

Define SMBD request. Unlike SMBD response, SMBD request doesn't use pre-allocated buffers. Data payload is passed from upper layer provided data buffers. SMBD requests are allocated through per-channel mempool.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 12 ++++++++++++
fs/cifs/cifsrdma.h | 19 +++++++++++++++++++
2 files changed, 31 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index b3e43b1..9610897 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -459,6 +459,18 @@ struct cifs_rdma_info* cifs_create_rdma_session(

log_rdma_event("rdma_connect connected\n");

+ sprintf(cache_name, "cifs_smbd_request_%p", info);
+ info->request_cache =
+ kmem_cache_create(
+ cache_name,
+ sizeof(struct cifs_rdma_request) +
+ sizeof(struct smbd_data_transfer),
+ 0, SLAB_HWCACHE_ALIGN, NULL);
+
+ info->request_mempool =
+ mempool_create(info->send_credit_target, mempool_alloc_slab,
+ mempool_free_slab, info->request_cache);
+
sprintf(cache_name, "cifs_smbd_response_%p", info);
info->response_cache =
kmem_cache_create(
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index ed0ff54..e925aa4 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -62,6 +62,10 @@ struct cifs_rdma_info {
struct list_head receive_queue;
spinlock_t receive_queue_lock;

+ // request pool for RDMA send
+ struct kmem_cache *request_cache;
+ mempool_t *request_mempool;
+
// response pool for RDMA receive
struct kmem_cache *response_cache;
mempool_t *response_mempool;
@@ -93,6 +97,21 @@ struct smbd_data_transfer {
char buffer[0];
} __packed;

+// The context for a SMBD request
+struct cifs_rdma_request {
+ struct cifs_rdma_info *info;
+
+ // completion queue entry
+ struct ib_cqe cqe;
+
+ // the SGE entries for this packet
+ struct ib_sge *sge;
+ int num_sge;
+
+ // SMBD packet header follows this structure
+ char packet[0];
+};
+
// The context for a SMBD response
struct cifs_rdma_response {
struct cifs_rdma_info *info;
--
2.7.4

2017-08-02 20:18:05

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 17/37] [CIFS] SMBD: Track status for transport

From: Long Li <[email protected]>

Introduce status for tracking the status of SMBD transport. They are used in transport reconnect and shutdown.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 25 +++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 11 +++++++++++
2 files changed, 36 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index cf71bb1..ef21f1c 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -173,9 +173,12 @@ static int cifs_rdma_conn_upcall(
case RDMA_CM_EVENT_DEVICE_REMOVAL:
log_rdma_event("connected event=%d\n", event->event);
info->connect_state = event->event;
+ info->transport_status = CIFS_RDMA_CONNECTED;
+ wake_up_all(&info->conn_wait);
break;

case RDMA_CM_EVENT_DISCONNECTED:
+ info->transport_status = CIFS_RDMA_DISCONNECTED;
break;

default:
@@ -581,6 +584,12 @@ static int cifs_rdma_post_send_page(struct cifs_rdma_info *info, struct page *pa
int rc = -ENOMEM;
int i;

+ // disconnected?
+ if (info->transport_status != CIFS_RDMA_CONNECTED) {
+ log_outgoing("disconnected not sending\n");
+ return -ENOENT;
+ }
+
request = mempool_alloc(info->request_mempool, GFP_KERNEL);
if (!request)
return rc;
@@ -686,6 +695,12 @@ static int cifs_rdma_post_send_empty(struct cifs_rdma_info *info)
int rc;
u16 credits_granted, flags=0;

+ // disconnected?
+ if (info->transport_status != CIFS_RDMA_CONNECTED) {
+ log_outgoing("disconnected not sending\n");
+ return -ENOENT;
+ }
+
request = mempool_alloc(info->request_mempool, GFP_KERNEL);
if (!request) {
log_rdma_send("failed to get send buffer for empty packet\n");
@@ -785,6 +800,12 @@ static int cifs_rdma_post_send_data(
int rc = -ENOMEM, i;
u32 data_length;

+ // disconnected?
+ if (info->transport_status != CIFS_RDMA_CONNECTED) {
+ log_outgoing("disconnected not sending\n");
+ return -ENOENT;
+ }
+
request = mempool_alloc(info->request_mempool, GFP_KERNEL);
if (!request)
return rc;
@@ -1056,6 +1077,7 @@ struct cifs_rdma_info* cifs_create_rdma_session(
return NULL;

info->server_info = server;
+ info->transport_status = CIFS_RDMA_CONNECTING;

rc = cifs_rdma_ia_open(info, dstaddr);
if (rc) {
@@ -1122,12 +1144,15 @@ struct cifs_rdma_info* cifs_create_rdma_session(
conn_param.retry_count = 6;
conn_param.rnr_retry_count = 6;
conn_param.flow_control = 0;
+ init_waitqueue_head(&info->conn_wait);
rc = rdma_connect(info->id, &conn_param);
if (rc) {
log_rdma_event("rdma_connect() failed with %i\n", rc);
goto out2;
}

+ wait_event_interruptible(
+ info->conn_wait, info->transport_status == CIFS_RDMA_CONNECTED);
if (info->connect_state != RDMA_CM_EVENT_ESTABLISHED)
goto out2;

diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 4a4c191..9618e0b 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -25,6 +25,15 @@
#include <rdma/rdma_cm.h>
#include <linux/mempool.h>

+enum cifs_rdma_transport_status {
+ CIFS_RDMA_CREATED,
+ CIFS_RDMA_CONNECTING,
+ CIFS_RDMA_CONNECTED,
+ CIFS_RDMA_DISCONNECTING,
+ CIFS_RDMA_DISCONNECTED,
+ CIFS_RDMA_DESTROYED
+};
+
/*
* The context for the SMBDirect transport
* Everything related to the transport is here. It has several logical parts
@@ -35,6 +44,7 @@
*/
struct cifs_rdma_info {
struct TCP_Server_Info *server_info;
+ enum cifs_rdma_transport_status transport_status;

// RDMA related
struct rdma_cm_id *id;
@@ -45,6 +55,7 @@ struct cifs_rdma_info {
int connect_state;
int ri_rc;
struct completion ri_done;
+ wait_queue_head_t conn_wait;

struct completion negotiate_completion;
bool negotiate_done;
--
2.7.4

2017-08-02 20:18:02

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 18/37] [CIFS] SMBD: Implement API for upper layer to send data

From: Long Li <[email protected]>

Implement cifs_rdma_write for send an upper layer data. Upper layer uses this function to do a RDMA send. This function is also used to pass SMB packets for doing a RDMA read/write via memory registration.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 177 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 5 ++
2 files changed, 182 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index ef21f1c..eb48651 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -229,6 +229,10 @@ static void send_done(struct ib_cq *cq, struct ib_wc *wc)
request->sge[i].length,
DMA_TO_DEVICE);

+ if (atomic_dec_and_test(&request->info->send_pending)) {
+ wake_up(&request->info->wait_send_pending);
+ }
+
kfree(request->sge);
mempool_free(request, request->info->request_mempool);
}
@@ -551,12 +555,14 @@ static int cifs_rdma_post_send_negotiate_req(struct cifs_rdma_info *info)
request->sge[0].addr,
request->sge[0].length, request->sge[0].lkey);

+ atomic_inc(&info->send_pending);
rc = ib_post_send(info->id->qp, &send_wr, &send_wr_fail);
if (!rc)
return 0;

// if we reach here, post send failed
log_rdma_send("ib_post_send failed rc=%d\n", rc);
+ atomic_dec(&info->send_pending);
ib_dma_unmap_single(info->id->device, request->sge[0].addr,
request->sge[0].length, DMA_TO_DEVICE);

@@ -662,12 +668,14 @@ static int cifs_rdma_post_send_page(struct cifs_rdma_info *info, struct page *pa
send_wr.opcode = IB_WR_SEND;
send_wr.send_flags = IB_SEND_SIGNALED;

+ atomic_inc(&info->send_pending);
rc = ib_post_send(info->id->qp, &send_wr, &send_wr_fail);
if (!rc)
return 0;

// post send failed
log_rdma_send("ib_post_send failed rc=%d\n", rc);
+ atomic_dec(&info->send_pending);

dma_mapping_failed:
for (i=0; i<2; i++)
@@ -768,11 +776,13 @@ static int cifs_rdma_post_send_empty(struct cifs_rdma_info *info)
send_wr.opcode = IB_WR_SEND;
send_wr.send_flags = IB_SEND_SIGNALED;

+ atomic_inc(&info->send_pending);
rc = ib_post_send(info->id->qp, &send_wr, &send_wr_fail);
if (!rc)
return 0;

log_rdma_send("ib_post_send failed rc=%d\n", rc);
+ atomic_dec(&info->send_pending);
ib_dma_unmap_single(info->id->device, request->sge[0].addr,
request->sge[0].length, DMA_TO_DEVICE);

@@ -885,12 +895,14 @@ static int cifs_rdma_post_send_data(
send_wr.opcode = IB_WR_SEND;
send_wr.send_flags = IB_SEND_SIGNALED;

+ atomic_inc(&info->send_pending);
rc = ib_post_send(info->id->qp, &send_wr, &send_wr_fail);
if (!rc)
return 0;

// post send failed
log_rdma_send("ib_post_send failed rc=%d\n", rc);
+ atomic_dec(&info->send_pending);

dma_mapping_failure:
for (i=0; i<n_vec+1; i++)
@@ -1185,6 +1197,9 @@ struct cifs_rdma_info* cifs_create_rdma_session(
allocate_receive_buffers(info, info->receive_credit_max);
init_waitqueue_head(&info->wait_send_queue);

+ init_waitqueue_head(&info->wait_send_pending);
+ atomic_set(&info->send_pending, 0);
+
init_waitqueue_head(&info->wait_recv_pending);
atomic_set(&info->recv_pending, 0);

@@ -1202,3 +1217,165 @@ struct cifs_rdma_info* cifs_create_rdma_session(
kfree(info);
return NULL;
}
+
+/*
+ * Write data to transport
+ * Each rqst is transported as a SMBDirect payload
+ * rqst: the data to write
+ * return value: 0 if successfully write, otherwise error code
+ */
+int cifs_rdma_write(struct cifs_rdma_info *info, struct smb_rqst *rqst)
+{
+ struct kvec vec;
+ int nvecs;
+ int size;
+ int buflen=0, remaining_data_length;
+ int start, i, j;
+ int max_iov_size = info->max_send_size - sizeof(struct smbd_data_transfer);
+ struct kvec *iov;
+ int rc;
+
+ if (info->transport_status != CIFS_RDMA_CONNECTED) {
+ log_cifs_write("disconnected returning -EIO\n");
+ return -EIO;
+ }
+
+ iov = kzalloc(sizeof(struct kvec)*rqst->rq_nvec, GFP_KERNEL);
+ if (!iov) {
+ log_cifs_write("failed to allocate iov returing -ENOMEM\n");
+ return -ENOMEM;
+ }
+
+ /* Strip the first 4 bytes MS-SMB2 section 2.1
+ * they are used only for TCP transport */
+ iov[0].iov_base = (char*)rqst->rq_iov[0].iov_base + 4;
+ iov[0].iov_len = rqst->rq_iov[0].iov_len - 4;
+ buflen += iov[0].iov_len;
+
+ /* total up iov array first */
+ for (i = 1; i < rqst->rq_nvec; i++) {
+ iov[i].iov_base = rqst->rq_iov[i].iov_base;
+ iov[i].iov_len = rqst->rq_iov[i].iov_len;
+ buflen += iov[i].iov_len;
+ }
+
+ /* add in the page array if there is one */
+ if (rqst->rq_npages) {
+ buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
+ buflen += rqst->rq_tailsz;
+ }
+
+ if (buflen + sizeof(struct smbd_data_transfer) >
+ info->max_fragmented_send_size) {
+ log_cifs_write("payload size %d > max size %d\n",
+ buflen, info->max_fragmented_send_size);
+ rc = -EINVAL;
+ goto done;
+ }
+
+ remaining_data_length = buflen;
+
+ log_cifs_write("rqst->rq_nvec=%d rqst->rq_npages=%d rq_pagesz=%d "
+ "rq_tailsz=%d buflen=%d\n",
+ rqst->rq_nvec, rqst->rq_npages, rqst->rq_pagesz,
+ rqst->rq_tailsz, buflen);
+
+ start = i = iov[0].iov_len ? 0 : 1;
+ buflen = 0;
+ while (true){
+ buflen += iov[i].iov_len;
+ if (buflen > max_iov_size) {
+ if (i > start) {
+ remaining_data_length -=
+ (buflen-iov[i].iov_len);
+ log_cifs_write("sending iov[] from start=%d "
+ "i=%d nvecs=%d "
+ "remaining_data_length=%d\n",
+ start, i, i-start,
+ remaining_data_length);
+ rc = cifs_rdma_post_send_data(
+ info, &iov[start], i-start,
+ remaining_data_length);
+ if (rc)
+ goto done;
+ } else {
+ // iov[start] is too big, break it to nvecs pieces
+ nvecs = (buflen+max_iov_size-1)/max_iov_size;
+ log_cifs_write("iov[%d] iov_base=%p buflen=%d"
+ " break to %d vectors\n",
+ start, iov[start].iov_base,
+ buflen, nvecs);
+ for (j=0; j<nvecs; j++) {
+ vec.iov_base =
+ (char *)iov[start].iov_base +
+ j*max_iov_size;
+ vec.iov_len = max_iov_size;
+ if (j == nvecs-1)
+ vec.iov_len =
+ buflen -
+ max_iov_size*(nvecs-1);
+ remaining_data_length -= vec.iov_len;
+ log_cifs_write(
+ "sending vec j=%d iov_base=%p"
+ " iov_len=%lu "
+ "remaining_data_length=%d\n",
+ j, vec.iov_base, vec.iov_len,
+ remaining_data_length);
+ rc = cifs_rdma_post_send_data(
+ info, &vec, 1,
+ remaining_data_length);
+ if (rc)
+ goto done;
+ }
+ i++;
+ }
+ start = i;
+ buflen = 0;
+ } else {
+ i++;
+ if (i == rqst->rq_nvec) {
+ // send out all remaining vecs and we are done
+ remaining_data_length -= buflen;
+ log_cifs_write(
+ "sending iov[] from start=%d i=%d "
+ "nvecs=%d remaining_data_length=%d\n",
+ start, i, i-start,
+ remaining_data_length);
+ rc = cifs_rdma_post_send_data(info, &iov[start],
+ i-start, remaining_data_length);
+ if (rc)
+ goto done;
+ break;
+ }
+ }
+ log_cifs_write("looping i=%d buflen=%d\n", i, buflen);
+ }
+
+ // now sending pages
+ for (i = 0; i < rqst->rq_npages; i++) {
+ buflen = (i == rqst->rq_npages-1) ?
+ rqst->rq_tailsz : rqst->rq_pagesz;
+ nvecs = (buflen+max_iov_size-1)/max_iov_size;
+ log_cifs_write("sending pages buflen=%d nvecs=%d\n",
+ buflen, nvecs);
+ for (j=0; j<nvecs; j++) {
+ size = max_iov_size;
+ if (j == nvecs-1)
+ size = buflen - j*max_iov_size;
+ remaining_data_length -= size;
+ log_cifs_write("sending pages i=%d offset=%d size=%d"
+ " remaining_data_length=%d\n",
+ i, j*max_iov_size, size, remaining_data_length);
+ rc = cifs_rdma_post_send_page(
+ info, rqst->rq_pages[i], j*max_iov_size,
+ size, remaining_data_length);
+ if (rc)
+ goto done;
+ }
+ }
+
+done:
+ kfree(iov);
+ wait_event(info->wait_send_pending, atomic_read(&info->send_pending) == 0);
+ return rc;
+}
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 9618e0b..90746a4 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -73,6 +73,9 @@ struct cifs_rdma_info {
atomic_t receive_credits;
atomic_t receive_credit_target;

+ atomic_t send_pending;
+ wait_queue_head_t wait_send_pending;
+
atomic_t recv_pending;
wait_queue_head_t wait_recv_pending;

@@ -195,4 +198,6 @@ struct cifs_rdma_response {
// Create a SMBDirect session
struct cifs_rdma_info* cifs_create_rdma_session(
struct TCP_Server_Info *server, struct sockaddr *dstaddr);
+
+int cifs_rdma_write(struct cifs_rdma_info *rdma, struct smb_rqst *rqst);
#endif
--
2.7.4

2017-08-02 20:22:16

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 11/37] [CIFS] SMBD: Post a receive request

From: Long Li <[email protected]>

Add code to post a receive request to RDMA. Before the SMB server can send a packet to SMB client via SMBD, a receive request must be posted to local RDMA layer.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 124 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 5 +++
2 files changed, 129 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 8aa8a47..20237b7 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -62,6 +62,10 @@ static void put_receive_buffer(
static int allocate_receive_buffers(struct cifs_rdma_info *info, int num_buf);
static void destroy_receive_buffers(struct cifs_rdma_info *info);

+static int cifs_rdma_post_recv(
+ struct cifs_rdma_info *info,
+ struct cifs_rdma_response *response);
+
/*
* Per RDMA transport connection parameters
* as defined in [MS-SMBD] 3.1.1.1
@@ -193,6 +197,85 @@ cifs_rdma_qp_async_error_upcall(struct ib_event *event, void *context)
}
}

+/* Called from softirq, when recv is done */
+static void recv_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+ struct smbd_data_transfer *data_transfer;
+ struct cifs_rdma_response *response =
+ container_of(wc->wr_cqe, struct cifs_rdma_response, cqe);
+ struct cifs_rdma_info *info = response->info;
+
+ log_rdma_recv("response=%p type=%d wc status=%d wc opcode %d "
+ "byte_len=%d pkey_index=%x\n",
+ response, response->type, wc->status, wc->opcode,
+ wc->byte_len, wc->pkey_index);
+
+ if (wc->status != IB_WC_SUCCESS || wc->opcode != IB_WC_RECV) {
+ log_rdma_recv("wc->status=%d opcode=%d\n",
+ wc->status, wc->opcode);
+ goto error;
+ }
+
+ ib_dma_sync_single_for_cpu(
+ wc->qp->device,
+ response->sge.addr,
+ response->sge.length,
+ DMA_FROM_DEVICE);
+
+ switch(response->type) {
+ case SMBD_TRANSFER_DATA:
+ data_transfer = (struct smbd_data_transfer *) response->packet;
+ atomic_dec(&info->receive_credits);
+ atomic_set(&info->receive_credit_target,
+ le16_to_cpu(data_transfer->credits_requested));
+ atomic_add(le16_to_cpu(data_transfer->credits_granted),
+ &info->send_credits);
+
+ log_incoming("data flags %d data_offset %d data_length %d "
+ "remaining_data_length %d\n",
+ le16_to_cpu(data_transfer->flags),
+ le32_to_cpu(data_transfer->data_offset),
+ le32_to_cpu(data_transfer->data_length),
+ le32_to_cpu(data_transfer->remaining_data_length));
+
+ log_transport_credit(info);
+
+ // process sending queue on new credits
+ if (atomic_read(&info->send_credits))
+ wake_up(&info->wait_send_queue);
+
+ // process receive queue
+ if (le32_to_cpu(data_transfer->data_length)) {
+ if (info->full_packet_received) {
+ response->first_segment = true;
+ }
+
+ if (le32_to_cpu(data_transfer->remaining_data_length))
+ info->full_packet_received = false;
+ else
+ info->full_packet_received = true;
+
+ goto queue_done;
+ }
+
+ // if we reach here, this is an empty packet, finish it
+ break;
+
+ default:
+ log_rdma_recv("unexpected response type=%d\n", response->type);
+ }
+
+error:
+ put_receive_buffer(info, response);
+
+queue_done:
+ if (atomic_dec_and_test(&info->recv_pending)) {
+ wake_up(&info->wait_recv_pending);
+ }
+
+ return;
+}
+
static struct rdma_cm_id* cifs_rdma_create_id(
struct cifs_rdma_info *info, struct sockaddr *dstaddr)
{
@@ -289,6 +372,44 @@ static int cifs_rdma_ia_open(
}

/*
+ * Post a receive request to the transport
+ * The remote peer can only send data when a receive is posted
+ * The interaction is controlled by send/recieve credit system
+ */
+static int cifs_rdma_post_recv(struct cifs_rdma_info *info, struct cifs_rdma_response *response)
+{
+ struct ib_recv_wr recv_wr, *recv_wr_fail=NULL;
+ int rc = -EIO;
+
+ response->sge.addr = ib_dma_map_single(info->id->device, response->packet,
+ info->max_receive_size, DMA_FROM_DEVICE);
+ if (ib_dma_mapping_error(info->id->device, response->sge.addr))
+ return rc;
+
+ response->sge.length = info->max_receive_size;
+ response->sge.lkey = info->pd->local_dma_lkey;
+
+ response->cqe.done = recv_done;
+
+ recv_wr.wr_cqe = &response->cqe;
+ recv_wr.next = NULL;
+ recv_wr.sg_list = &response->sge;
+ recv_wr.num_sge = 1;
+
+ atomic_inc(&info->recv_pending);
+ rc = ib_post_recv(info->id->qp, &recv_wr, &recv_wr_fail);
+ if (rc) {
+ ib_dma_unmap_single(info->id->device, response->sge.addr,
+ response->sge.length, DMA_FROM_DEVICE);
+
+ log_rdma_recv("ib_post_recv failed rc=%d\n", rc);
+ atomic_dec(&info->recv_pending);
+ }
+
+ return rc;
+}
+
+/*
* Receive buffer operations.
* For each remote send, we need to post a receive. The receive buffers are
* pre-allocated in advance.
@@ -485,6 +606,9 @@ struct cifs_rdma_info* cifs_create_rdma_session(

allocate_receive_buffers(info, info->receive_credit_max);
init_waitqueue_head(&info->wait_send_queue);
+
+ init_waitqueue_head(&info->wait_recv_pending);
+ atomic_set(&info->recv_pending, 0);
out2:
rdma_destroy_id(info->id);

diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 287b5b1..8702a2b 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -59,6 +59,9 @@ struct cifs_rdma_info {
atomic_t receive_credits;
atomic_t receive_credit_target;

+ atomic_t recv_pending;
+ wait_queue_head_t wait_recv_pending;
+
struct list_head receive_queue;
spinlock_t receive_queue_lock;

@@ -68,6 +71,8 @@ struct cifs_rdma_info {
struct kmem_cache *request_cache;
mempool_t *request_mempool;

+ bool full_packet_received;
+
// response pool for RDMA receive
struct kmem_cache *response_cache;
mempool_t *response_mempool;
--
2.7.4

2017-08-02 20:22:15

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 12/37] [CIFS] SMBD: Handle send completion from CQ

From: Long Li <[email protected]>

In preparation for handling sending SMBD requests, add code to handle the send completion. In send complemention, the SMBD transport is responsible for freeing resources used in send.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 20237b7..ecbc832 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -197,6 +197,31 @@ cifs_rdma_qp_async_error_upcall(struct ib_event *event, void *context)
}
}

+/* Called in softirq, when a RDMA send is donea */
+static void send_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+ int i;
+ struct cifs_rdma_request *request =
+ container_of(wc->wr_cqe, struct cifs_rdma_request, cqe);
+
+ log_rdma_send("cifs_rdma_request %p completed wc->status=%d\n",
+ request, wc->status);
+
+ if (wc->status != IB_WC_SUCCESS || wc->opcode != IB_WC_SEND) {
+ log_rdma_send("wc->status=%d wc->opcode=%d\n",
+ wc->status, wc->opcode);
+ }
+
+ for (i=0; i<request->num_sge; i++)
+ ib_dma_unmap_single(request->info->id->device,
+ request->sge[i].addr,
+ request->sge[i].length,
+ DMA_TO_DEVICE);
+
+ kfree(request->sge);
+ mempool_free(request, request->info->request_mempool);
+}
+
/* Called from softirq, when recv is done */
static void recv_done(struct ib_cq *cq, struct ib_wc *wc)
{
--
2.7.4

2017-08-02 20:12:20

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 05/37] [CIFS] SMBD: Implement API for upper layer to create SMBD transport and establish RDMA connection

From: Long Li <[email protected]>

Implement the code for connecting to SMBD server. The client and server are connected using RC Queue Pair over RDMA API, which suppports Infiniband, RoCE and iWARP. Upper layer code can call cifs_create_rdma_session to establish a SMBD RDMA connection.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 257 +++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 14 +++
2 files changed, 271 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 7c4c178..b18fb79 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -120,3 +120,260 @@ do { \
atomic_read(&info->send_credits), \
info->send_credit_target); \
} while (0)
+
+/* Upcall from RDMA CM */
+static int cifs_rdma_conn_upcall(
+ struct rdma_cm_id *id, struct rdma_cm_event *event)
+{
+ struct cifs_rdma_info *info = id->context;
+
+ log_rdma_event("event=%d status=%d\n", event->event, event->status);
+
+ switch (event->event) {
+ case RDMA_CM_EVENT_ADDR_RESOLVED:
+ case RDMA_CM_EVENT_ROUTE_RESOLVED:
+ info->ri_rc = 0;
+ complete(&info->ri_done);
+ break;
+
+ case RDMA_CM_EVENT_ADDR_ERROR:
+ info->ri_rc = -EHOSTUNREACH;
+ complete(&info->ri_done);
+ break;
+
+ case RDMA_CM_EVENT_ROUTE_ERROR:
+ info->ri_rc = -ENETUNREACH;
+ complete(&info->ri_done);
+ break;
+
+ case RDMA_CM_EVENT_ESTABLISHED:
+ case RDMA_CM_EVENT_CONNECT_ERROR:
+ case RDMA_CM_EVENT_UNREACHABLE:
+ case RDMA_CM_EVENT_REJECTED:
+ case RDMA_CM_EVENT_DEVICE_REMOVAL:
+ log_rdma_event("connected event=%d\n", event->event);
+ info->connect_state = event->event;
+ break;
+
+ case RDMA_CM_EVENT_DISCONNECTED:
+ break;
+
+ default:
+ break;
+ }
+
+ return 0;
+}
+
+/* Upcall from RDMA QP */
+static void
+cifs_rdma_qp_async_error_upcall(struct ib_event *event, void *context)
+{
+ struct cifs_rdma_info *info = context;
+ log_rdma_event("%s on device %s info %p\n",
+ ib_event_msg(event->event), event->device->name, info);
+
+ switch (event->event)
+ {
+ case IB_EVENT_CQ_ERR:
+ case IB_EVENT_QP_FATAL:
+ case IB_EVENT_QP_REQ_ERR:
+ case IB_EVENT_QP_ACCESS_ERR:
+
+ default:
+ break;
+ }
+}
+
+static struct rdma_cm_id* cifs_rdma_create_id(
+ struct cifs_rdma_info *info, struct sockaddr *dstaddr)
+{
+ struct rdma_cm_id *id;
+ int rc;
+ struct sockaddr_in *addr_in = (struct sockaddr_in*) dstaddr;
+ __be16 *sport;
+
+ log_rdma_event("connecting to IP %pI4 port %d\n",
+ &addr_in->sin_addr, ntohs(addr_in->sin_port));
+
+ id = rdma_create_id(&init_net, cifs_rdma_conn_upcall, info,
+ RDMA_PS_TCP, IB_QPT_RC);
+ if (IS_ERR(id)) {
+ rc = PTR_ERR(id);
+ log_rdma_event("rdma_create_id() failed %i\n", rc);
+ return id;
+ }
+
+ if (dstaddr->sa_family == AF_INET6)
+ sport = &((struct sockaddr_in6 *)dstaddr)->sin6_port;
+ else
+ sport = &((struct sockaddr_in *)dstaddr)->sin_port;
+
+ *sport = htons(445);
+try_again:
+ init_completion(&info->ri_done);
+ info->ri_rc = -ETIMEDOUT;
+ rc = rdma_resolve_addr(id, NULL, (struct sockaddr*)dstaddr, 5000);
+ if (rc) {
+ log_rdma_event("rdma_resolve_addr() failed %i\n", rc);
+ goto out;
+ }
+ wait_for_completion_interruptible_timeout(
+ &info->ri_done, msecs_to_jiffies(8000));
+ rc = info->ri_rc;
+ if (rc) {
+ log_rdma_event("rdma_resolve_addr() completed %i\n", rc);
+ goto out;
+ }
+
+ info->ri_rc = -ETIMEDOUT;
+ rc = rdma_resolve_route(id, 5000);
+ if (rc) {
+ log_rdma_event("rdma_resolve_route() failed %i\n", rc);
+ goto out;
+ }
+ wait_for_completion_interruptible_timeout(
+ &info->ri_done, msecs_to_jiffies(8000));
+ rc = info->ri_rc;
+ if (rc) {
+ log_rdma_event("rdma_resolve_route() completed %i\n", rc);
+ goto out;
+ }
+
+ return id;
+
+out:
+ // try port number 5445 if port 445 doesn't work
+ if (*sport == htons(445)) {
+ *sport = htons(5445);
+ goto try_again;
+ }
+ rdma_destroy_id(id);
+ return ERR_PTR(rc);
+}
+
+static int cifs_rdma_ia_open(
+ struct cifs_rdma_info *info, struct sockaddr *dstaddr)
+{
+ int rc;
+
+ info->id = cifs_rdma_create_id(info, dstaddr);
+ if (IS_ERR(info->id)) {
+ rc = PTR_ERR(info->id);
+ goto out1;
+ }
+
+ info->pd = ib_alloc_pd(info->id->device, 0);
+ if (IS_ERR(info->pd)) {
+ rc = PTR_ERR(info->pd);
+ log_rdma_event("ib_alloc_pd() returned %d\n", rc);
+ goto out2;
+ }
+
+ return 0;
+
+out2:
+ rdma_destroy_id(info->id);
+ info->id = NULL;
+
+out1:
+ return rc;
+}
+
+struct cifs_rdma_info* cifs_create_rdma_session(
+ struct TCP_Server_Info *server, struct sockaddr *dstaddr)
+{
+ int rc;
+ struct cifs_rdma_info *info;
+ struct rdma_conn_param conn_param;
+ struct ib_qp_init_attr qp_attr;
+ int max_pending = receive_credit_max + send_credit_target;
+
+ info = kzalloc(sizeof(struct cifs_rdma_info), GFP_KERNEL);
+ if (!info)
+ return NULL;
+
+ info->server_info = server;
+
+ rc = cifs_rdma_ia_open(info, dstaddr);
+ if (rc) {
+ log_rdma_event("cifs_rdma_ia_open rc=%d\n", rc);
+ goto out1;
+ }
+
+ if (max_pending > info->id->device->attrs.max_cqe ||
+ max_pending > info->id->device->attrs.max_qp_wr) {
+ log_rdma_event("consider lowering receive_credit_max and "
+ "send_credit_target. Possible CQE overrun, device "
+ "reporting max_cpe %d max_qp_wr %d\n",
+ info->id->device->attrs.max_cqe,
+ info->id->device->attrs.max_qp_wr);
+ goto out2;
+ }
+
+ info->receive_credit_max = receive_credit_max;
+ info->send_credit_target = send_credit_target;
+ info->max_send_size = max_send_size;
+ info->max_fragmented_recv_size = max_fragmented_recv_size;
+ info->max_receive_size = max_receive_size;
+
+ max_send_sge = min_t(int, max_send_sge,
+ info->id->device->attrs.max_sge);
+ max_recv_sge = min_t(int, max_recv_sge,
+ info->id->device->attrs.max_sge_rd);
+
+ info->cq = ib_alloc_cq(info->id->device, info,
+ info->receive_credit_max + info->send_credit_target,
+ 0, IB_POLL_SOFTIRQ);
+ if (IS_ERR(info->cq))
+ goto out2;
+
+ memset(&qp_attr, 0, sizeof qp_attr);
+ qp_attr.event_handler = cifs_rdma_qp_async_error_upcall;
+ qp_attr.qp_context = info;
+ qp_attr.cap.max_send_wr = info->send_credit_target;
+ qp_attr.cap.max_recv_wr = info->receive_credit_max;
+ qp_attr.cap.max_send_sge = max_send_sge;
+ qp_attr.cap.max_recv_sge = max_recv_sge;
+ qp_attr.cap.max_inline_data = 0;
+ qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
+ qp_attr.qp_type = IB_QPT_RC;
+ qp_attr.send_cq = info->cq;
+ qp_attr.recv_cq = info->cq;
+ qp_attr.port_num = ~0;
+
+ rc = rdma_create_qp(info->id, info->pd, &qp_attr);
+ if (rc) {
+ log_rdma_event("rdma_create_qp failed %i\n", rc);
+ rc = -ENETUNREACH;
+ goto out2;
+ }
+
+ memset(&conn_param, 0, sizeof(conn_param));
+ conn_param.private_data = NULL;
+ conn_param.private_data_len = 0;
+ conn_param.initiator_depth = 0;
+ conn_param.responder_resources = 32;
+ if (info->id->device->attrs.max_qp_rd_atom < 32)
+ conn_param.responder_resources =
+ info->id->device->attrs.max_qp_rd_atom;
+ conn_param.retry_count = 6;
+ conn_param.rnr_retry_count = 6;
+ conn_param.flow_control = 0;
+ rc = rdma_connect(info->id, &conn_param);
+ if (rc) {
+ log_rdma_event("rdma_connect() failed with %i\n", rc);
+ goto out2;
+ }
+
+ if (info->connect_state != RDMA_CM_EVENT_ESTABLISHED)
+ goto out2;
+
+ log_rdma_event("rdma_connect connected\n");
+out2:
+ rdma_destroy_id(info->id);
+
+out1:
+ kfree(info);
+ return NULL;
+}
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 9979fd4..71ea380 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -36,6 +36,16 @@
struct cifs_rdma_info {
struct TCP_Server_Info *server_info;

+ // RDMA related
+ struct rdma_cm_id *id;
+ struct ib_qp_init_attr qp_attr;
+ struct ib_pd *pd;
+ struct ib_cq *cq;
+ struct ib_device_attr dev_attr;
+ int connect_state;
+ int ri_rc;
+ struct completion ri_done;
+
//connection paramters
int receive_credit_max;
int send_credit_target;
@@ -55,4 +65,8 @@ struct cifs_rdma_info {
unsigned int count_put_receive_buffer;
unsigned int count_send_empty;
};
+
+// Create a SMBDirect session
+struct cifs_rdma_info* cifs_create_rdma_session(
+ struct TCP_Server_Info *server, struct sockaddr *dstaddr);
#endif
--
2.7.4

2017-08-02 20:23:07

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 08/37] [CIFS] SMBD: Define packet format for SMBD data transfer message

From: Long Li <[email protected]>

Define the packet format for a SMBD data packet with payload

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index 78ce2bf..ed0ff54 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -78,6 +78,21 @@ enum smbd_message_type {
SMBD_TRANSFER_DATA,
};

+#define SMB_DIRECT_RESPONSE_REQUESTED 0x0001
+
+// SMBD data transfer packet with payload [MS-SMBD] 2.2.3
+struct smbd_data_transfer {
+ __le16 credits_requested;
+ __le16 credits_granted;
+ __le16 flags;
+ __le16 reserved;
+ __le32 remaining_data_length;
+ __le32 data_offset;
+ __le32 data_length;
+ __le32 padding;
+ char buffer[0];
+} __packed;
+
// The context for a SMBD response
struct cifs_rdma_response {
struct cifs_rdma_info *info;
--
2.7.4

2017-08-02 20:23:28

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 04/37] [CIFS] SMBD: Define per-channel SMBD transport parameters and default values

From: Long Li <[email protected]>

For each channel, SMBD defines per-channel parameters. Those can be negotiated with the server, and are restricted by RDMA hardware limitations.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 14 ++++++++++++++
fs/cifs/cifsrdma.h | 13 +++++++++++++
2 files changed, 27 insertions(+)

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
index 3c9f478..7c4c178 100644
--- a/fs/cifs/cifsrdma.c
+++ b/fs/cifs/cifsrdma.c
@@ -54,6 +54,20 @@

#include "cifsrdma.h"

+/*
+ * Per RDMA transport connection parameters
+ * as defined in [MS-SMBD] 3.1.1.1
+ */
+static int receive_credit_max = 512;
+static int send_credit_target = 512;
+static int max_send_size = 8192;
+static int max_fragmented_recv_size = 1024*1024;
+static int max_receive_size = 8192;
+
+// maximum number of SGEs in a RDMA I/O
+static int max_send_sge = 16;
+static int max_recv_sge = 16;
+
/* Logging functions
* Logging are defined as classes. They can be ORed to define the actual
* logging level via module parameter rdma_logging_class
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
index ec6aa61..9979fd4 100644
--- a/fs/cifs/cifsrdma.h
+++ b/fs/cifs/cifsrdma.h
@@ -36,6 +36,19 @@
struct cifs_rdma_info {
struct TCP_Server_Info *server_info;

+ //connection paramters
+ int receive_credit_max;
+ int send_credit_target;
+ int max_send_size;
+ int max_fragmented_recv_size;
+ int max_fragmented_send_size;
+ int max_receive_size;
+ int max_readwrite_size;
+ int protocol;
+ atomic_t send_credits;
+ atomic_t receive_credits;
+ atomic_t receive_credit_target;
+
// for debug purposes
unsigned int count_receive_buffer;
unsigned int count_get_receive_buffer;
--
2.7.4

2017-08-02 20:23:44

by Long Li

[permalink] [raw]
Subject: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD transport

From: Long Li <[email protected]>

Define a new structure for SMBD transport. This stucture will have all the
information on the transport, and it will be stored in the current SMB session.

Signed-off-by: Long Li <[email protected]>
---
fs/cifs/cifsrdma.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/cifs/cifsrdma.h | 45 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 101 insertions(+)
create mode 100644 fs/cifs/cifsrdma.c
create mode 100644 fs/cifs/cifsrdma.h

diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
new file mode 100644
index 0000000..a2c0478
--- /dev/null
+++ b/fs/cifs/cifsrdma.c
@@ -0,0 +1,56 @@
+/*
+ * Copyright (C) 2017, Microsoft Corporation.
+ *
+ * Author(s): Long Li <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#include <linux/fs.h>
+#include <linux/net.h>
+#include <linux/string.h>
+#include <linux/list.h>
+#include <linux/wait.h>
+#include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/ctype.h>
+#include <linux/utsname.h>
+#include <linux/mempool.h>
+#include <linux/delay.h>
+#include <linux/completion.h>
+#include <linux/kthread.h>
+#include <linux/pagevec.h>
+#include <linux/freezer.h>
+#include <linux/namei.h>
+#include <asm/uaccess.h>
+#include <asm/processor.h>
+#include <linux/inet.h>
+#include <linux/module.h>
+#include <keys/user-type.h>
+#include <net/ipv6.h>
+#include <linux/parser.h>
+
+#include "cifspdu.h"
+#include "cifsglob.h"
+#include "cifsproto.h"
+#include "cifs_unicode.h"
+#include "cifs_debug.h"
+#include "cifs_fs_sb.h"
+#include "ntlmssp.h"
+#include "nterr.h"
+#include "rfc1002pdu.h"
+#include "fscache.h"
+
+#include "cifsrdma.h"
+
diff --git a/fs/cifs/cifsrdma.h b/fs/cifs/cifsrdma.h
new file mode 100644
index 0000000..ec6aa61
--- /dev/null
+++ b/fs/cifs/cifsrdma.h
@@ -0,0 +1,45 @@
+/*
+ * Copyright (C) 2017, Microsoft Corporation.
+ *
+ * Author(s): Long Li <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#ifndef _CIFS_RDMA_H
+#define _CIFS_RDMA_H
+
+#include "cifsglob.h"
+#include <rdma/ib_verbs.h>
+#include <rdma/rdma_cm.h>
+#include <linux/mempool.h>
+
+/*
+ * The context for the SMBDirect transport
+ * Everything related to the transport is here. It has several logical parts
+ * 1. RDMA related structures
+ * 2. SMBDirect connection parameters
+ * 3. Reassembly queue for data receive path
+ * 4. mempools for allocating packets
+ */
+struct cifs_rdma_info {
+ struct TCP_Server_Info *server_info;
+
+ // for debug purposes
+ unsigned int count_receive_buffer;
+ unsigned int count_get_receive_buffer;
+ unsigned int count_put_receive_buffer;
+ unsigned int count_send_empty;
+};
+#endif
--
2.7.4

2017-08-08 07:04:24

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD transport

Hi Li,

thanks for providing this patchset, I guess it will be a huge win to
have SMBDirect support for the kernel client!

> Define a new structure for SMBD transport. This stucture will have all the
> information on the transport, and it will be stored in the current SMB session.
...
> +/*
> + * The context for the SMBDirect transport
> + * Everything related to the transport is here. It has several
logical parts
> + * 1. RDMA related structures
> + * 2. SMBDirect connection parameters
> + * 3. Reassembly queue for data receive path
> + * 4. mempools for allocating packets
> + */
> +struct cifs_rdma_info {
> + struct TCP_Server_Info *server_info;
> +
> + // for debug purposes
> + unsigned int count_receive_buffer;
> + unsigned int count_get_receive_buffer;
> + unsigned int count_put_receive_buffer;
> + unsigned int count_send_empty;
> +};
> +#endif
>

It seems that the new transport is tied to it's caller
regarding structures and naming conventions.

I think it would be better to strictly separate them,
as I'd like to use the SMBDirect transport also from the
userspace for the client side e.g. in Samba's '[lib]smbclient',
but also in Samba's server side code 'smbd'.

Would it be possible to isolate this in
smb_direct.c and smb_direct.h while using
smb_direct_* prefixes for structures and
functions? Also avoiding the usage of other headers
from fs/cifs/*.h, expect for something generic like
nterr.h.

I guess 'struct cifs_rdma_info' would then be
'struct smb_direct_connection'. And it won't
have a reference to struct TCP_Server_Info.

It the strict layering is too much change,
I'd at least like to have the name changes.

This should relatively easy to do by using somthing like

git format-patch --stdout -37 > before

cat before | sed \
-e 's!struct cifs_rdma_info!struct smb_direct_connection!g' \
-e 's!cifsrdma\.h!smb_direct.h!g' \
-e 's!cifsrdma\.c!smb_direct.c!g' \
> after

git reset --hard HEAD~37
git am after

metze

2017-08-12 08:32:53

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD transport

> -----Original Message-----
> From: Stefan Metzmacher [mailto:[email protected]]
> Sent: Monday, August 7, 2017 11:58 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: Re: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD
> transport
>
> Hi Li,
>
> thanks for providing this patchset, I guess it will be a huge win to have
> SMBDirect support for the kernel client!
>
> > Define a new structure for SMBD transport. This stucture will have all
> > the information on the transport, and it will be stored in the current SMB
> session.
> ...
> > +/*
> > + * The context for the SMBDirect transport
> > + * Everything related to the transport is here. It has several
> logical parts
> > + * 1. RDMA related structures
> > + * 2. SMBDirect connection parameters
> > + * 3. Reassembly queue for data receive path
> > + * 4. mempools for allocating packets */ struct cifs_rdma_info {
> > + struct TCP_Server_Info *server_info;
> > +
> > + // for debug purposes
> > + unsigned int count_receive_buffer;
> > + unsigned int count_get_receive_buffer;
> > + unsigned int count_put_receive_buffer;
> > + unsigned int count_send_empty;
> > +};
> > +#endif
> >
>
> It seems that the new transport is tied to it's caller regarding structures and
> naming conventions.
>
> I think it would be better to strictly separate them, as I'd like to use the
> SMBDirect transport also from the userspace for the client side e.g. in
> Samba's '[lib]smbclient', but also in Samba's server side code 'smbd'.

Thank you for reviewing the patch set.

I think it is possible to separate the common code that implements the SMBDirect transport. There are some challenges to reuse the same code for both kernel and user spaces.
1. Kernel mode RDMA verbs are similar but different to user-mode ones.
2. Some RDMA features (e.g Fast Registration Work Request) are not available in user-mode.
3. Locking and synchronization mechanism is different
4. Memory management is different.
5. Process creation/scheduling and data sharing between processes are different, and there is no user-mode code running in interrupt/softirq.

Those needs to be abstracted through a layer, the rest of the code can be shared. I can work on this after patch set is reviewed.

>
> Would it be possible to isolate this in
> smb_direct.c and smb_direct.h while using
> smb_direct_* prefixes for structures and functions? Also avoiding the usage
> of other headers from fs/cifs/*.h, expect for something generic like nterr.h.

Sure I will make naming changes and clean up the header files.
>
> I guess 'struct cifs_rdma_info' would then be 'struct smb_direct_connection'.
> And it won't have a reference to struct TCP_Server_Info.

I will look for ways to remove reference to struct TCP_Server_Info . The reason why it has a reference to TCP_Server_Info is that: TCP_Server_Info represents a transport connection, although it also has many other TCP related code. SMBD needs to get to this connection TCP_Server_Info and set the transport status on shutdown (and maybe other situations).


Long

>
> It the strict layering is too much change, I'd at least like to have the name
> changes.
>
> This should relatively easy to do by using somthing like
>
> git format-patch --stdout -37 > before
>
> cat before | sed \
> -e 's!struct cifs_rdma_info!struct smb_direct_connection!g' \ -e
> 's!cifsrdma\.h!smb_direct.h!g' \ -e 's!cifsrdma\.c!smb_direct.c!g' \
> > after
>
> git reset --hard HEAD~37
> git am after
>
> metze

2017-08-12 18:49:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD transport

On Sat, Aug 12, 2017 at 08:32:48AM +0000, Long Li via samba-technical wrote:
> I think it is possible to separate the common code that implements the SMBDirect transport. There are some challenges to reuse the same code for both kernel and user spaces.
> 1. Kernel mode RDMA verbs are similar but different to user-mode ones.
> 2. Some RDMA features (e.g Fast Registration Work Request) are not available in user-mode.
> 3. Locking and synchronization mechanism is different
> 4. Memory management is different.
> 5. Process creation/scheduling and data sharing between processes are different, and there is no user-mode code running in interrupt/softirq.
>
> Those needs to be abstracted through a layer, the rest of the code can be shared. I can work on this after patch set is reviewed.

NAK - code with those sort of obsfucation layer will be rejected
for kernel inclusion - don't add it.

2017-08-13 10:10:53

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD transport

On Wed, Aug 02, 2017 at 01:10:13PM -0700, Long Li wrote:
> From: Long Li <[email protected]>
>
> Define a new structure for SMBD transport. This stucture will have all the
> information on the transport, and it will be stored in the current SMB session.
>
> Signed-off-by: Long Li <[email protected]>
> ---
> fs/cifs/cifsrdma.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> fs/cifs/cifsrdma.h | 45 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 101 insertions(+)
> create mode 100644 fs/cifs/cifsrdma.c
> create mode 100644 fs/cifs/cifsrdma.h
>
> diff --git a/fs/cifs/cifsrdma.c b/fs/cifs/cifsrdma.c
> new file mode 100644
> index 0000000..a2c0478
> --- /dev/null
> +++ b/fs/cifs/cifsrdma.c
> @@ -0,0 +1,56 @@
> +/*
> + * Copyright (C) 2017, Microsoft Corporation.
> + *
> + * Author(s): Long Li <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
> + * the GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +#include <linux/fs.h>
> +#include <linux/net.h>
> +#include <linux/string.h>
> +#include <linux/list.h>
> +#include <linux/wait.h>
> +#include <linux/slab.h>
> +#include <linux/pagemap.h>
> +#include <linux/ctype.h>
> +#include <linux/utsname.h>
> +#include <linux/mempool.h>
> +#include <linux/delay.h>
> +#include <linux/completion.h>
> +#include <linux/kthread.h>
> +#include <linux/pagevec.h>
> +#include <linux/freezer.h>
> +#include <linux/namei.h>
> +#include <asm/uaccess.h>
> +#include <asm/processor.h>
> +#include <linux/inet.h>
> +#include <linux/module.h>
> +#include <keys/user-type.h>
> +#include <net/ipv6.h>
> +#include <linux/parser.h>

Where do all these includes come from? It seems like most of them
are not actually used in the code.

2017-08-13 10:12:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 04/37] [CIFS] SMBD: Define per-channel SMBD transport parameters and default values

> +/*
> + * Per RDMA transport connection parameters
> + * as defined in [MS-SMBD] 3.1.1.1
> + */
> +static int receive_credit_max = 512;
> +static int send_credit_target = 512;
> +static int max_send_size = 8192;
> +static int max_fragmented_recv_size = 1024*1024;
> +static int max_receive_size = 8192;

Are these protocol constants? If so please use either #defines
or enums with upper case names for them.

> +// maximum number of SGEs in a RDMA I/O

Please always use /* ... */ style comments in the kernel.

2017-08-13 10:15:13

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 08/37] [CIFS] SMBD: Define packet format for SMBD data transfer message

> +// SMBD data transfer packet with payload [MS-SMBD] 2.2.3
> +struct smbd_data_transfer {
> + __le16 credits_requested;
> + __le16 credits_granted;
> + __le16 flags;
> + __le16 reserved;
> + __le32 remaining_data_length;
> + __le32 data_offset;
> + __le32 data_length;
> + __le32 padding;
> + char buffer[0];

Please use the actually standardized [] syntax for variable sized
arrays. Also normally this would be a __u8 to fit with the other
types, but I haven't seen the usage yet.

> +} __packed;

The structure is natually packed already, no need to add the
attribute.

2017-08-13 10:18:34

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 11/37] [CIFS] SMBD: Post a receive request

> + switch(response->type) {
> + case SMBD_TRANSFER_DATA:
> + data_transfer = (struct smbd_data_transfer *) response->packet;

Maybe add a little helper for the packet data to hide these cast, e.g.

static inline void *smbd_payload(struct cifs_rdma_response *resp)
{
return (void *)response->packet;
}


> + atomic_dec(&info->receive_credits);
> + atomic_set(&info->receive_credit_target,
> + le16_to_cpu(data_transfer->credits_requested));
> + atomic_add(le16_to_cpu(data_transfer->credits_granted),
> + &info->send_credits);

That's a lot of atomic ops in the fast path handler. Also remember
that atomic_set isn't really atomic vs other callers.

2017-08-13 10:19:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 12/37] [CIFS] SMBD: Handle send completion from CQ

You seem to be doing memory allocations and frees for every packet on
the write. At least for other RDMA protocols that would have been
a major performance issue.

Do you have any performance numbers and/or profiles of the code?

2017-08-13 10:22:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 13/37] [CIFS] SMBD: Implement SMBD protocol negotiation

> + request = mempool_alloc(info->request_mempool, GFP_KERNEL);
> + if (!request)
> + return rc;

Here you do a mempool allocation to guarantee forward progress..

> + request->sge = kzalloc(sizeof(struct ib_sge), GFP_KERNEL);
> + if (!request->sge)
> + goto allocate_sge_failed;

... and here it's a plain malloc in the same path, which renders the
above useless. Also I would just embedd the sge into the containing
structure to avoid a memory allocation.

> + request->info = info;
> +
> + packet = (struct smbd_negotiate_req *) request->packet;
> + packet->min_version = cpu_to_le16(0x100);
> + packet->max_version = cpu_to_le16(0x100);

Canyou add a constant for the version code?

> + packet->reserved = cpu_to_le16(0);

no need to byte swap 0 - you can always assign it directly.

2017-08-13 10:23:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 15/37] [CIFS] SMBD: Post a SMBD data transfer message with data payload

You can always get the struct page for kernel allocations using
virt_to_page (or vmalloc_to_page, but this code would not handle the
vmalloc case either), so I don't think you need this helper and can
always use the one added in the previous patch.

2017-08-13 10:24:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with no payload

On Wed, Aug 02, 2017 at 01:10:27PM -0700, Long Li wrote:
> From: Long Li <[email protected]>
>
> Implement the function to send a SMBD message with no payload. This is required at times when we want to extend credtis to server to have it continue to send data, without sending any actual data payload.

Shouldn't this just be implemented as a special case in the version
that posts data?

2017-08-13 10:27:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 00/37] Implement SMBD protocol: Series 1

Hi Long,

a few meta-comments:

first the split into lots of tiny lists makes the series really hard
to read, I'd split it into just a few:

Patch 1: add the protocol constants
Patch 2-n: core cifs modifcation required to implement SMBD
Patch n+1: add the actual SMBD code
Patch n+2..m: write up the core CIFS code to use it.

stcond a lot of the code doesn't seem to follow the usual Linux style.
It might be a good idea to run it through scripts/checkpach.pl and
at least fix the errors it generates. The warnings are much more
opinioned so feel free to ignore them if they sound odd for now.

2017-08-13 10:31:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 00/37] Implement SMBD protocol: Series 1

... and a third one:

please include the linux-rdma mailing list on your next post, to make
sure we get a good review pool of people familiar with the rdma code.

2017-08-14 10:24:44

by Jeff Layton

[permalink] [raw]
Subject: Re: [[PATCH v1] 08/37] [CIFS] SMBD: Define packet format for SMBD data transfer message

On Sun, 2017-08-13 at 03:15 -0700, Christoph Hellwig wrote:
> > +// SMBD data transfer packet with payload [MS-SMBD] 2.2.3
> > +struct smbd_data_transfer {
> > + __le16 credits_requested;
> > + __le16 credits_granted;
> > + __le16 flags;
> > + __le16 reserved;
> > + __le32 remaining_data_length;
> > + __le32 data_offset;
> > + __le32 data_length;
> > + __le32 padding;
> > + char buffer[0];
>
> Please use the actually standardized [] syntax for variable sized
> arrays. Also normally this would be a __u8 to fit with the other
> types, but I haven't seen the usage yet.
>

Yes, having a single-element array makes it harder to handle the
indexes, etc. Flexible arrays are better.

> > +} __packed;
>
> The structure is natually packed already, no need to add the
> attribute.

I think this should remain on structs that are intended to go across the
wire. Could we ever end up with some exotic arch that stuffs some
padding in there? Maybe I'm just paranoid, but I don't see any harm in
leaving that here.

--
Jeff Layton <[email protected]>

2017-08-14 14:17:13

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD transport

Hi Long,

>> It seems that the new transport is tied to it's caller regarding structures and
>> naming conventions.
>>
>> I think it would be better to strictly separate them, as I'd like to use the
>> SMBDirect transport also from the userspace for the client side e.g. in
>> Samba's '[lib]smbclient', but also in Samba's server side code 'smbd'.
>
> Thank you for reviewing the patch set.
>
> I think it is possible to separate the common code that implements the SMBDirect transport. There are some challenges to reuse the same code for both kernel and user spaces.
> 1. Kernel mode RDMA verbs are similar but different to user-mode ones.
> 2. Some RDMA features (e.g Fast Registration Work Request) are not available in user-mode.
> 3. Locking and synchronization mechanism is different
> 4. Memory management is different.
> 5. Process creation/scheduling and data sharing between processes are different, and there is no user-mode code running in interrupt/softirq.
>
> Those needs to be abstracted through a layer, the rest of the code can be shared. I can work on this after patch set is reviewed.

I guess this is a misunderstanding...

I don't want to use that code and run it in userspace,
I have a userspace prototype more or less working here, see
https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/master3-rdma
and
https://git.samba.org/?p=metze/samba/wip.git;a=blob;f=libcli/smb/smb_direct.c;h=9cc0d861ccfcbb4df9ef6ad85a7fe3d262e549c0;hb=85d46de6fdbba041d3e8004af46865a72d2b8405

I goal is that we'll have an api that allows userspace
code to use the kernel code SMBDirect code. This
userspace code would get a file descriptor from the kernel
and would be able to use it similar to a tcp socket.
If the kernel would simulate the 4 byte length header,
it's trivial to get to a stage were smbclient and smbd
are able to support SMBDirect without much changes.
We only need to replace connect(), listen(), accept() and a few more
by SMBDirect versions.

For the real data transfer we might be able to use memfd_create()
or something similar to share the buffers between userspace and kernel.

I guess this is a long way, but having the basic SMBDirect code in
dependently in the kernel would help a lot.

>> Would it be possible to isolate this in
>> smb_direct.c and smb_direct.h while using
>> smb_direct_* prefixes for structures and functions? Also avoiding the usage
>> of other headers from fs/cifs/*.h, expect for something generic like nterr.h.
>
> Sure I will make naming changes and clean up the header files.

Thanks!

>> I guess 'struct cifs_rdma_info' would then be 'struct smb_direct_connection'.
>> And it won't have a reference to struct TCP_Server_Info.
>
> I will look for ways to remove reference to struct TCP_Server_Info . The reason why it has a reference to TCP_Server_Info is that: TCP_Server_Info represents a transport connection, although it also has many other TCP related code. SMBD needs to get to this connection TCP_Server_Info and set the transport status on shutdown (and maybe other situations).
>

Wouldn't it be better to provide a way to ask for the connection state
and let the caller ask for the state instead of changing the callers
structure?

metze


Attachments:
signature.asc (836.00 B)
OpenPGP digital signature

2017-08-14 17:04:05

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 00/37] Implement SMBD protocol: Series 1



> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Sunday, August 13, 2017 3:28 AM
> To: Long Li <[email protected]>
> Cc: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]; Long Li
> <[email protected]>
> Subject: Re: [[PATCH v1] 00/37] Implement SMBD protocol: Series 1
>
> Hi Long,
>
> a few meta-comments:
>
> first the split into lots of tiny lists makes the series really hard to read, I'd split
> it into just a few:
>
> Patch 1: add the protocol constants
> Patch 2-n: core cifs modifcation required to implement SMBD Patch n+1:
> add the actual SMBD code Patch n+2..m: write up the core CIFS code to use
> it.
>
> stcond a lot of the code doesn't seem to follow the usual Linux style.
> It might be a good idea to run it through scripts/checkpach.pl and at least fix
> the errors it generates. The warnings are much more opinioned so feel free
> to ignore them if they sound odd for now.

Hi Christoph

Thank you for reviewing the patch set.

I'll address those comments and update the patches.

Long

2017-08-14 18:11:01

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD transport



> -----Original Message-----
> From: Stefan Metzmacher [mailto:[email protected]]
> Sent: Monday, August 14, 2017 6:41 AM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Christoph Hellwig
> <[email protected]>
> Subject: Re: [[PATCH v1] 02/37] [CIFS] SMBD: Add structure for SMBD
> transport
>
> Hi Long,
>
> >> It seems that the new transport is tied to it's caller regarding
> >> structures and naming conventions.
> >>
> >> I think it would be better to strictly separate them, as I'd like to
> >> use the SMBDirect transport also from the userspace for the client
> >> side e.g. in Samba's '[lib]smbclient', but also in Samba's server side code
> 'smbd'.
> >
> > Thank you for reviewing the patch set.
> >
> > I think it is possible to separate the common code that implements the
> SMBDirect transport. There are some challenges to reuse the same code for
> both kernel and user spaces.
> > 1. Kernel mode RDMA verbs are similar but different to user-mode ones.
> > 2. Some RDMA features (e.g Fast Registration Work Request) are not
> available in user-mode.
> > 3. Locking and synchronization mechanism is different 4. Memory
> > management is different.
> > 5. Process creation/scheduling and data sharing between processes are
> different, and there is no user-mode code running in interrupt/softirq.
> >
> > Those needs to be abstracted through a layer, the rest of the code can be
> shared. I can work on this after patch set is reviewed.
>
> I guess this is a misunderstanding...
>
> I don't want to use that code and run it in userspace, I have a userspace
> prototype more or less working here, see
> https://git.samba.org/?p=metze/samba/wip.git;a=shortlog;h=refs/heads/m
> aster3-rdma
> and
> https://git.samba.org/?p=metze/samba/wip.git;a=blob;f=libcli/smb/smb_dir
> ect.c;h=9cc0d861ccfcbb4df9ef6ad85a7fe3d262e549c0;hb=85d46de6fdbba041
> d3e8004af46865a72d2b8405
>
> I goal is that we'll have an api that allows userspace code to use the kernel
> code SMBDirect code. This userspace code would get a file descriptor from
> the kernel and would be able to use it similar to a tcp socket.
> If the kernel would simulate the 4 byte length header, it's trivial to get to a
> stage were smbclient and smbd are able to support SMBDirect without much
> changes.
> We only need to replace connect(), listen(), accept() and a few more by
> SMBDirect versions.

This is possible. SMBDirect code can handle the first 4 bytes for upper layer.
>
> For the real data transfer we might be able to use memfd_create() or
> something similar to share the buffers between userspace and kernel.

You'll need to post RDMA read/write through the same QP created by SMBDirect in the kernel. I think this needs some more work but it's doable.

>
> I guess this is a long way, but having the basic SMBDirect code in dependently
> in the kernel would help a lot.
>
> >> Would it be possible to isolate this in smb_direct.c and smb_direct.h
> >> while using
> >> smb_direct_* prefixes for structures and functions? Also avoiding the
> >> usage of other headers from fs/cifs/*.h, expect for something generic like
> nterr.h.
> >
> > Sure I will make naming changes and clean up the header files.
>
> Thanks!
>
> >> I guess 'struct cifs_rdma_info' would then be 'struct
> smb_direct_connection'.
> >> And it won't have a reference to struct TCP_Server_Info.
> >
> > I will look for ways to remove reference to struct TCP_Server_Info . The
> reason why it has a reference to TCP_Server_Info is that: TCP_Server_Info
> represents a transport connection, although it also has many other TCP
> related code. SMBD needs to get to this connection TCP_Server_Info and set
> the transport status on shutdown (and maybe other situations).
> >
>
> Wouldn't it be better to provide a way to ask for the connection state and let
> the caller ask for the state instead of changing the callers structure?
>
> metze


2017-08-14 18:16:33

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 12/37] [CIFS] SMBD: Handle send completion from CQ



> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Sunday, August 13, 2017 3:20 AM
> To: Long Li <[email protected]>
> Cc: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]; Long Li
> <[email protected]>
> Subject: Re: [[PATCH v1] 12/37] [CIFS] SMBD: Handle send completion from
> CQ
>
> You seem to be doing memory allocations and frees for every packet on the
> write. At least for other RDMA protocols that would have been a major
> performance issue.

The size of SGE array passed to IB is unknown, so I don't know how much to pre-allocate in advance. But it seems this size is not big when passed down from CIFS. I will look at pre-allocating buffer if this is an issue.

>
> Do you have any performance numbers and/or profiles of the code?

I will look into profiling.

2017-08-14 18:20:10

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with no payload



> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Sunday, August 13, 2017 3:24 AM
> To: Long Li <[email protected]>
> Cc: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]; Long Li
> <[email protected]>
> Subject: Re: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with no
> payload
>
> On Wed, Aug 02, 2017 at 01:10:27PM -0700, Long Li wrote:
> > From: Long Li <[email protected]>
> >
> > Implement the function to send a SMBD message with no payload. This is
> required at times when we want to extend credtis to server to have it
> continue to send data, without sending any actual data payload.
>
> Shouldn't this just be implemented as a special case in the version that posts
> data?

It uses a different packet format "struct smbd_data_transfer_no_data". I can restructure some common code to share between packet sending functions.

2017-08-14 19:01:15

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with no payload

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Monday, August 14, 2017 2:20 PM
> To: Christoph Hellwig <[email protected]>
> Cc: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Subject: RE: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with no
> payload
>
> > > Implement the function to send a SMBD message with no payload. This is
> > required at times when we want to extend credtis to server to have it
> > continue to send data, without sending any actual data payload.
> >
> > Shouldn't this just be implemented as a special case in the version that posts
> > data?
>
> It uses a different packet format "struct smbd_data_transfer_no_data". I can
> restructure some common code to share between packet sending functions.

The SMB Direct keepalive is just a Data Transfer Message with no payload
(MS-SMBD section 2.2.3) and the SMB_DIRECT_RESPONSE_REQUESTED flag
possibly set. I don't see any need to define a special structure to describe this?

Tom.


2017-08-14 19:10:07

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 01/37] [CIFS] SMBD: Add parsing for new rdma mount option

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:10 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 01/37] [CIFS] SMBD: Add parsing for new rdma mount
> option
>
> From: Long Li <[email protected]>
>
> When doing mount with "-o rdma", user can specify this is for connecting to a
> SMBD session.

Nit: it's an "SMB" session. SMBD (SMB Direct) is the transport connection.

The use of SMB Direct is only applicable when an SMB3.x dialect is negotiated.
Is there any restriction in the use of the mount option to preclude also specifying
SMB1 and SMB2.x?

Tom.

2017-08-14 19:28:50

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 04/37] [CIFS] SMBD: Define per-channel SMBD transport parameters and default values

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Christoph Hellwig
> Sent: Sunday, August 13, 2017 6:12 AM
> To: Long Li <[email protected]>
> Cc: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]; Long Li
> <[email protected]>
> Subject: Re: [[PATCH v1] 04/37] [CIFS] SMBD: Define per-channel SMBD
> transport parameters and default values
>
> > +/*
> > + * Per RDMA transport connection parameters
> > + * as defined in [MS-SMBD] 3.1.1.1
> > + */
> > +static int receive_credit_max = 512;
> > +static int send_credit_target = 512;
> > +static int max_send_size = 8192;
> > +static int max_fragmented_recv_size = 1024*1024;
> > +static int max_receive_size = 8192;
>
> Are these protocol constants? If so please use either #defines
> or enums with upper case names for them.

These are not defined constants, but the values beg for some explanatory text
why they are chosen. Windows uses, and negotiates by default, a 1364-byte
maximum send size, and caps credits to 255. The other values match.

BTW, the parameters are defined in MS-SMBD 3.1.1.1 but the chosen values
are in behavior notes 2 and 7.

Tom.


2017-08-14 19:54:44

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 05/37] [CIFS] SMBD: Implement API for upper layer to create SMBD transport and establish RDMA connection

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:10 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 05/37] [CIFS] SMBD: Implement API for upper layer to
> create SMBD transport and establish RDMA connection
>
> From: Long Li <[email protected]>
>
> Implement the code for connecting to SMBD server. The client and server are
> connected using RC Queue Pair over RDMA API, which suppports Infiniband,
> RoCE and iWARP. Upper layer code can call cifs_create_rdma_session to
> establish a SMBD RDMA connection.
>
> +/* Upcall from RDMA CM */
> +static int cifs_rdma_conn_upcall(
> + struct rdma_cm_id *id, struct rdma_cm_event *event)
> +{
> + struct cifs_rdma_info *info = id->context;
> +
> + log_rdma_event("event=%d status=%d\n", event->event, event->status);
> +
> + switch (event->event) {
> + case RDMA_CM_EVENT_ADDR_RESOLVED:
> + case RDMA_CM_EVENT_ROUTE_RESOLVED:
> + info->ri_rc = 0;
> + complete(&info->ri_done);
> + break;
> +
> + case RDMA_CM_EVENT_ADDR_ERROR:
> + info->ri_rc = -EHOSTUNREACH;
> + complete(&info->ri_done);
> + break;
> +
> + case RDMA_CM_EVENT_ROUTE_ERROR:
> + info->ri_rc = -ENETUNREACH;
> + complete(&info->ri_done);
> + break;
> +
> + case RDMA_CM_EVENT_ESTABLISHED:
> + case RDMA_CM_EVENT_CONNECT_ERROR:
> + case RDMA_CM_EVENT_UNREACHABLE:
> + case RDMA_CM_EVENT_REJECTED:
> + case RDMA_CM_EVENT_DEVICE_REMOVAL:
> + log_rdma_event("connected event=%d\n", event->event);
> + info->connect_state = event->event;
> + break;
> +
> + case RDMA_CM_EVENT_DISCONNECTED:
> + break;
> +
> + default:
> + break;
> + }
> +
> + return 0;
> +}

This code looks a lot like the connection stuff in the NFS/RDMA RPC transport.
Does your code have the same needs? If so, you might consider moving this to
a common RDMA handler.

> +/* Upcall from RDMA QP */
> +static void
> +cifs_rdma_qp_async_error_upcall(struct ib_event *event, void *context)
> +{
> + struct cifs_rdma_info *info = context;
> + log_rdma_event("%s on device %s info %p\n",
> + ib_event_msg(event->event), event->device->name, info);
> +
> + switch (event->event)
> + {
> + case IB_EVENT_CQ_ERR:
> + case IB_EVENT_QP_FATAL:
> + case IB_EVENT_QP_REQ_ERR:
> + case IB_EVENT_QP_ACCESS_ERR:
> +
> + default:
> + break;
> + }
> +}

Ditto. But, what's up with the empty switch(event->event) processing?

> +static struct rdma_cm_id* cifs_rdma_create_id(
> + struct cifs_rdma_info *info, struct sockaddr *dstaddr)
> +{
...
> + log_rdma_event("connecting to IP %pI4 port %d\n",
> + &addr_in->sin_addr, ntohs(addr_in->sin_port));
>... and then...
> + if (dstaddr->sa_family == AF_INET6)
> + sport = &((struct sockaddr_in6 *)dstaddr)->sin6_port;
> + else
> + sport = &((struct sockaddr_in *)dstaddr)->sin_port;
> +
> + *sport = htons(445);
...and
> +out:
> + // try port number 5445 if port 445 doesn't work
> + if (*sport == htons(445)) {
> + *sport = htons(5445);
> + goto try_again;
> + }

Suggest rearranging the log_rdma_event() call to reflect reality.

The IANA-assigned port for SMB Direct is 5445, and port 445 will be
listening on TCP. Should you really be probing that port before 5445?
I suggest not doing so unconditionally.

> +struct cifs_rdma_info* cifs_create_rdma_session(
> + struct TCP_Server_Info *server, struct sockaddr *dstaddr)
> +{
> ...
> + int max_pending = receive_credit_max + send_credit_target;
>...
> + if (max_pending > info->id->device->attrs.max_cqe ||
> + max_pending > info->id->device->attrs.max_qp_wr) {
> + log_rdma_event("consider lowering receive_credit_max and "
> + "send_credit_target. Possible CQE overrun, device "
> + "reporting max_cpe %d max_qp_wr %d\n",
> + info->id->device->attrs.max_cqe,
> + info->id->device->attrs.max_qp_wr);
> + goto out2;
> + }

I don't understand this. Why are you directing both Receive and Send completions
to the same CQ, won't that make it very hard to manage completions and their
interrupts? Also, what device(s) have you seen trigger this log? CQ's are generally
allowed to be quite large.

> + conn_param.responder_resources = 32;
> + if (info->id->device->attrs.max_qp_rd_atom < 32)
> + conn_param.responder_resources =
> + info->id->device->attrs.max_qp_rd_atom;
> + conn_param.retry_count = 6;
> + conn_param.rnr_retry_count = 6;
> + conn_param.flow_control = 0;

These choices warrant explanation. 32 responder resources is on the large side for
most datacenter networks. And because of SMB Direct's credits, why is RNR_retry not
simply zero?



2017-08-14 20:09:11

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 07/37] [CIFS] SMBD: Implement receive buffer for handling SMBD response

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:10 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 07/37] [CIFS] SMBD: Implement receive buffer for
> handling SMBD response
>
> +/*
> + * Receive buffer operations.
> + * For each remote send, we need to post a receive. The receive buffers are
> + * pre-allocated in advance.
> + */

This approach appears to have been derived from the NFS/RDMA one.
The SMB protocol operates very differently! It is not a strict request/
response protocol. Many operations can become asynchronous by the
server choosing to make a STATUS_PENDING reply. A second reply then
comes later. The SMB2_CANCEL operation normally has no reply at all.
And callbacks for oplocks can occur at any time.

Even within a single request, many replies can be received. For example,
an SMB2_READ response which exceeds your negotiated receive size of
8192. These will be fragmented by SMB Direct into a "train" of multiple
messages, which will be logically reassembled by the receiver. Each of
them will consume a credit.

Thanks to SMB Direct crediting, the connection is not failing, but you are
undoubtedly spending a lot of time and ping-ponging to re-post receives
and allow the message trains to flow. And, because it's never one-to-one,
there are also unneeded receives posted before and after such exchanges.

You need to use SMB Direct crediting to post a more traffic-sensitive pool
of receives, and simply manage its depth when posting client requests.
As a start, I'd suggest simply choosing a constant number, approximately
whatever credit value you actually negotiate with the peer. Then, just
replenish (re-post) receive buffers as they are completed by the adapter.
You can get more sophisticated about this strategy later.

Tom.

> +static struct cifs_rdma_response* get_receive_buffer(struct cifs_rdma_info
> *info)
> +{
> + struct cifs_rdma_response *ret = NULL;
> + unsigned long flags;
> +
> + spin_lock_irqsave(&info->receive_queue_lock, flags);
> + if (!list_empty(&info->receive_queue)) {
> + ret = list_first_entry(
> + &info->receive_queue,
> + struct cifs_rdma_response, list);
> + list_del(&ret->list);
> + info->count_receive_buffer--;
> + info->count_get_receive_buffer++;
> + }
> + spin_unlock_irqrestore(&info->receive_queue_lock, flags);
> +
> + return ret;
> +}
> +
> +static void put_receive_buffer(
> + struct cifs_rdma_info *info, struct cifs_rdma_response *response)
> +{
> + unsigned long flags;
> +
> + ib_dma_unmap_single(info->id->device, response->sge.addr,
> + response->sge.length, DMA_FROM_DEVICE);
> +
> + spin_lock_irqsave(&info->receive_queue_lock, flags);
> + list_add_tail(&response->list, &info->receive_queue);
> + info->count_receive_buffer++;
> + info->count_put_receive_buffer++;
> + spin_unlock_irqrestore(&info->receive_queue_lock, flags);
> +}
> +
> +static int allocate_receive_buffers(struct cifs_rdma_info *info, int num_buf)
> +{
> + int i;
> + struct cifs_rdma_response *response;
> +
> + INIT_LIST_HEAD(&info->receive_queue);
> + spin_lock_init(&info->receive_queue_lock);
> +
> + for (i=0; i<num_buf; i++) {
> + response = mempool_alloc(info->response_mempool, GFP_KERNEL);
> + if (!response)
> + goto allocate_failed;
> +
> + response->info = info;
> + list_add_tail(&response->list, &info->receive_queue);
> + info->count_receive_buffer++;
> + }
> +
> + return 0;
> +
> +allocate_failed:
> + while (!list_empty(&info->receive_queue)) {
> + response = list_first_entry(
> + &info->receive_queue,
> + struct cifs_rdma_response, list);
> + list_del(&response->list);
> + info->count_receive_buffer--;
> +
> + mempool_free(response, info->response_mempool);
> + }
> + return -ENOMEM;
> +}
> +
> +static void destroy_receive_buffers(struct cifs_rdma_info *info)
> +{
> + struct cifs_rdma_response *response;
> + while ((response = get_receive_buffer(info)))
> + mempool_free(response, info->response_mempool);
> +}
> +


2017-08-14 20:23:17

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 14/37] [CIFS] SMBD: Post a SMBD data transfer message with page payload

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:10 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 14/37] [CIFS] SMBD: Post a SMBD data transfer message
> with page payload
>
> /*
> + * Send a page
> + * page: the page to send
> + * offset: offset in the page to send
> + * size: length in the page to send
> + * remaining_data_length: remaining data to send in this payload
> + */
> +static int cifs_rdma_post_send_page(struct cifs_rdma_info *info, struct page
> *page,
> + unsigned long offset, size_t size, int remaining_data_length)
> +{
>...
> + wait_event(info->wait_send_queue, atomic_read(&info->send_credits) >
> 0);

This is an uninterruptible wait, correct? What's to guarantee the event will
ever fire? Also, if the count is zero, there should be a check that an SMB Direct
credit request is outstanding. If not, it's wasteful to sleep for the keepalive timer
to do so.

Tom.

2017-08-14 20:26:59

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 15/37] [CIFS] SMBD: Post a SMBD data transfer message with data payload

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:10 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 15/37] [CIFS] SMBD: Post a SMBD data transfer message
> with data payload
>

> Similar to sending transfer message with page payload, this function creates a
> SMBD data packet and send it over to RDMA, from iov passed from upper layer.

The following routine is heavily redundant with 14/37 cifs_rdma_post_send_page().
Because they share quite a bit of protocol and DMA mapping logic, strongly suggest
they be merged.

Tom.

> +static int cifs_rdma_post_send_data(
> + struct cifs_rdma_info *info,
> + struct kvec *iov, int n_vec, int remaining_data_length);
> static int cifs_rdma_post_send_page(struct cifs_rdma_info *info,
> struct page *page, unsigned long offset,
> size_t size, int remaining_data_length);
> @@ -671,6 +674,122 @@ static int cifs_rdma_post_send_page(struct
> cifs_rdma_info *info, struct page *pa
> }


2017-08-14 20:44:40

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 18/37] [CIFS] SMBD: Implement API for upper layer to send data

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:10 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 18/37] [CIFS] SMBD: Implement API for upper layer to
> send data
>
> +/*
> + * Write data to transport
> + * Each rqst is transported as a SMBDirect payload
> + * rqst: the data to write
> + * return value: 0 if successfully write, otherwise error code
> + */
> +int cifs_rdma_write(struct cifs_rdma_info *info, struct smb_rqst *rqst)
> +{

!!!
This is a VERY confusing name. It is not sending an RDMA Write, which will
confuse any RDMA-enlightened reader. It's performing an RDMA Send, so
that name is perhaps one possibility.

> + if (info->transport_status != CIFS_RDMA_CONNECTED) {
> + log_cifs_write("disconnected returning -EIO\n");
> + return -EIO;
> + }

Isn't this optimizing the error case? There's no guarantee it's still connected by
the time the following request construction occurs. Why not just proceed without
the check?

> + /* Strip the first 4 bytes MS-SMB2 section 2.1
> + * they are used only for TCP transport */
> + iov[0].iov_base = (char*)rqst->rq_iov[0].iov_base + 4;
> + iov[0].iov_len = rqst->rq_iov[0].iov_len - 4;
> + buflen += iov[0].iov_len;

Ok, that layering choice in the cifs.ko client code needs to be corrected. After all,
it will need to be RDMA-aware to build the SMB3 read/write channel structures.
And, the code in cifs_post_send_data() is allocating and building a structure that
could have been accounted for much earlier, avoiding the extra overhead.

That change could happen later, the hack is mostly ok for now. But something
needs to be said in a comment.

Tom.

2017-08-14 20:47:34

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 19/37] [CIFS] SMBD: Manage credits on SMBD client and server

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:11 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 19/37] [CIFS] SMBD: Manage credits on SMBD client and
> server
>
> /*
> + * Extend the credits to remote peer
> + * This implements [MS-SMBD] 3.1.5.9
> + * The idea is that we should extend credits to remote peer as quickly as
> + * it's allowed, to maintain data flow. We allocate as much as receive
> + * buffer as possible, and extend the receive credits to remote peer
> + * return value: the new credtis being granted.
> + */
> +static int manage_credits_prior_sending(struct cifs_rdma_info *info)
> +{
> + int ret = 0;
> + struct cifs_rdma_response *response;
> + int rc;
> +
> + if (atomic_read(&info->receive_credit_target) >

When does the receive_credit_target value change? It seems wasteful to
perform an atomic_read() on this local value each time.

Tom.

2017-08-14 20:57:14

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 21/37] [CIFS] SMBD: Implement API for upper layer to receive data

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:11 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 21/37] [CIFS] SMBD: Implement API for upper layer to
> receive data
>
> /*
> + * Read data from receive reassembly queue
> + * All the incoming data packets are placed in reassembly queue
> + * buf: the buffer to read data into
> + * size: the length of data to read
> + * return value: actual data read
> + */
> +int cifs_rdma_read(struct cifs_rdma_info *info, char *buf, unsigned int size)
> +{
>...
> + spin_lock_irqsave(&info->reassembly_queue_lock, flags);
> + log_cifs_read("size=%d info->reassembly_data_length=%d\n", size,
> + atomic_read(&info->reassembly_data_length));
> + if (atomic_read(&info->reassembly_data_length) >= size) {

If the reassembly queue is protected by a lock, why is an atomic_read() of
its length needed?

> + // this is for reading rfc1002 length
> + if (response->first_segment && size==4) {
> + unsigned int rfc1002_len =
> + data_length + remaining_data_length;
> + *((__be32*)buf) = cpu_to_be32(rfc1002_len);
> + data_read = 4;
> + response->first_segment = false;
> + log_cifs_read("returning rfc1002 length %d\n",
> + rfc1002_len);
> + goto read_rfc1002_done;
> + }

I am totally confused. What does RFC1002 framing have to do with
receiving an SMB Direct packet???

> +
> + to_copy = min_t(int, data_length - offset, to_read);
> + memcpy(
> + buf + data_read,
> + (char*)data_transfer + data_offset + offset,
> + to_copy);

Is it really necessary to perform all these data copies, especially under the
reassembly_queue spinlock? This seems quite inefficient. Can the receive
buffers not be loaned out and chained logically?

Tom.

2017-08-14 20:59:59

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 22/37] [CIFS] SMBD: Implement API for upper layer to receive data to page

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:11 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 22/37] [CIFS] SMBD: Implement API for upper layer to
> receive data to page
>
> /*
> + * Read a page from receive reassembly queue
> + * page: the page to read data into
> + * to_read: the length of data to read
> + * return value: actual data read
> + */
> +int cifs_rdma_read_page(struct cifs_rdma_info *info,
> + struct page *page, unsigned int to_read)
> +{

Same comment as for cifs_rdma_write() - this name is confusing as it
does not perform an RDMA Read. Needs to be changed.

Tom.

2017-08-14 21:02:36

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 23/37] [CIFS] SMBD: Implement API for upper layer to reconnect transport

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:11 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 23/37] [CIFS] SMBD: Implement API for upper layer to
> reconnect transport
>
> +int cifs_reconnect_rdma_session(struct TCP_Server_Info *server)
> +{
> + log_rdma_event("reconnecting rdma session\n");
> +
> + // why reconnect while it is still connected?
> + if (server->rdma_ses->transport_status == CIFS_RDMA_CONNECTED) {
> + log_rdma_event("still connected, not reconnecting\n");
> + return -EINVAL;
> + }

Why is this check needed?

> +
> + // wait until the transport is destroyed
> + while (server->rdma_ses->transport_status != CIFS_RDMA_DESTROYED)
> + msleep(1);

Polling!? Please plan to implement a proper handshake for connection logic.


2017-08-14 21:06:57

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 24/37] [CIFS] SMBD: Support for SMBD keep alive protocol

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:11 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 24/37] [CIFS] SMBD: Support for SMBD keep alive
> protocol
>
> SMBD uses a keep alive protocol to help peers detect if the remote is dead.
> When peer request keep alive, the transport needs to respond accordingly.

The keepalive exchange is also used to replenish credits in certain
pathological conditions.

> + // send an emtpy response right away if requested
> + if (le16_to_cpu(data_transfer->flags) |
> + le16_to_cpu(SMB_DIRECT_RESPONSE_REQUESTED)) {
> + info->keep_alive_requested = KEEP_ALIVE_PENDING;
> + }

This is clearly a typo, the condition is always true. "&"??

Tom.

2017-08-14 21:12:33

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection timer

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:11 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection timer
>
> +static int keep_alive_interval = 120;

This is the recommended value, but not the only possibility.

> @@ -1348,6 +1369,10 @@ struct cifs_rdma_info* cifs_create_rdma_session(
> init_waitqueue_head(&info->wait_send_queue);
> init_waitqueue_head(&info->wait_reassembly_queue);
>
> + INIT_DELAYED_WORK(&info->idle_timer_work, idle_connection_timer);
> + schedule_delayed_work(&info->idle_timer_work,
> + info->keep_alive_interval*HZ);
> +

This initialization is ok, but the timer should be rescheduled (extended) any time
any packet is sent. There is no need to perform keepalives on an active SMB Direct
connection.

Tom.

2017-08-14 21:15:38

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 26/37] [CIFS] SMBD: Send an immediate packet when it's needed

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:11 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 26/37] [CIFS] SMBD: Send an immediate packet when it's
> needed
>
> +/*
> + * Check and schedule to send an immediate packet
> + * This is used to extend credtis to remote peer to keep the transport busy
> + */
> +static void check_and_send_immediate(struct cifs_rdma_info *info)
> +{
> + info->send_immediate = true;
> +
> + // promptly send a packet if running low on receive credits

...if *our peer* is running low on credits.

> + if (atomic_read(&info->receive_credits) <
> + atomic_read(&info->receive_credit_target) -1 )

Why read the receive_credit_target atomically? It's a mostly unchanging local value?

Tom.

2017-08-14 21:20:06

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 30/37] [CIFS] SMBD: Add SMBDirect transport to Makefile

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Wednesday, August 2, 2017 4:11 PM
> To: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Cc: Long Li <[email protected]>
> Subject: [[PATCH v1] 30/37] [CIFS] SMBD: Add SMBDirect transport to Makefile
>
> cifs-$(CONFIG_CIFS_SMB2) += smb2ops.o smb2maperror.o smb2transport.o \
> - smb2misc.o smb2pdu.o smb2inode.o smb2file.o
> + smb2misc.o smb2pdu.o smb2inode.o smb2file.o cifsrdma.o

"cifsrdma.o" is a really confusing choice of names. SMB Direct is only
possible when the negotiated SMB dialect is SMB3. "CIFS" historically
means a dialect of SMB1, which many of us would like to see be in
a separate, and therefore omittable, module. Please use a name
more indicative of the function, perhaps "smbdirect.o".

Tom.

2017-08-14 22:51:07

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with no payload



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 12:00 PM
> To: Long Li <[email protected]>; Christoph Hellwig <[email protected]>
> Cc: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Subject: RE: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with no
> payload
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Monday, August 14, 2017 2:20 PM
> > To: Christoph Hellwig <[email protected]>
> > Cc: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Subject: RE: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with
> > no payload
> >
> > > > Implement the function to send a SMBD message with no payload.
> > > > This is
> > > required at times when we want to extend credtis to server to have
> > > it continue to send data, without sending any actual data payload.
> > >
> > > Shouldn't this just be implemented as a special case in the version
> > > that posts data?
> >
> > It uses a different packet format "struct smbd_data_transfer_no_data".
> > I can restructure some common code to share between packet sending
> functions.
>
> The SMB Direct keepalive is just a Data Transfer Message with no payload
> (MS-SMBD section 2.2.3) and the SMB_DIRECT_RESPONSE_REQUESTED flag
> possibly set. I don't see any need to define a special structure to describe
> this?

Data Transfer Message has the following extra fields at the end of an empty packet.

__le32 padding;
char buffer[0];

I agree with you those can be merged to a special structure case. Will make the change.

>
> Tom.


2017-08-14 22:53:56

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 01/37] [CIFS] SMBD: Add parsing for new rdma mount option



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 12:10 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: RE: [[PATCH v1] 01/37] [CIFS] SMBD: Add parsing for new rdma
> mount option
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:10 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 01/37] [CIFS] SMBD: Add parsing for new rdma
> > mount option
> >
> > From: Long Li <[email protected]>
> >
> > When doing mount with "-o rdma", user can specify this is for
> > connecting to a SMBD session.
>
> Nit: it's an "SMB" session. SMBD (SMB Direct) is the transport connection.
>
> The use of SMB Direct is only applicable when an SMB3.x dialect is negotiated.
> Is there any restriction in the use of the mount option to preclude also
> specifying
> SMB1 and SMB2.x?

SMB version is specified in mount "-o vers=XXX". For example, "mount -o vers=3.02" will try to negotiate SMB 3.02. I will make the change to return error on vers<3.

>
> Tom.

2017-08-14 22:57:27

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 04/37] [CIFS] SMBD: Define per-channel SMBD transport parameters and default values



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 12:29 PM
> To: Christoph Hellwig <[email protected]>; Long Li <[email protected]>
> Cc: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Subject: RE: [[PATCH v1] 04/37] [CIFS] SMBD: Define per-channel SMBD
> transport parameters and default values
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Christoph Hellwig
> > Sent: Sunday, August 13, 2017 6:12 AM
> > To: Long Li <[email protected]>
> > Cc: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]; Long
> > Li <[email protected]>
> > Subject: Re: [[PATCH v1] 04/37] [CIFS] SMBD: Define per-channel SMBD
> > transport parameters and default values
> >
> > > +/*
> > > + * Per RDMA transport connection parameters
> > > + * as defined in [MS-SMBD] 3.1.1.1
> > > + */
> > > +static int receive_credit_max = 512; static int send_credit_target
> > > += 512; static int max_send_size = 8192; static int
> > > +max_fragmented_recv_size = 1024*1024; static int max_receive_size =
> > > +8192;
> >
> > Are these protocol constants? If so please use either #defines or
> > enums with upper case names for them.
>
> These are not defined constants, but the values beg for some explanatory
> text why they are chosen. Windows uses, and negotiates by default, a 1364-
> byte maximum send size, and caps credits to 255. The other values match.
>
> BTW, the parameters are defined in MS-SMBD 3.1.1.1 but the chosen values
> are in behavior notes 2 and 7.

I will change those values to more inline with what Windows choses. The different values don't have a visible impact to performance while RDMA read/write is used.

>
> Tom.


2017-08-14 22:58:14

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 14/37] [CIFS] SMBD: Post a SMBD data transfer message with page payload



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 1:23 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: RE: [[PATCH v1] 14/37] [CIFS] SMBD: Post a SMBD data transfer
> message with page payload
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:10 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 14/37] [CIFS] SMBD: Post a SMBD data transfer
> > message with page payload
> >
> > /*
> > + * Send a page
> > + * page: the page to send
> > + * offset: offset in the page to send
> > + * size: length in the page to send
> > + * remaining_data_length: remaining data to send in this payload */
> > +static int cifs_rdma_post_send_page(struct cifs_rdma_info *info,
> > +struct page
> > *page,
> > + unsigned long offset, size_t size, int
> > +remaining_data_length) {
> >...
> > + wait_event(info->wait_send_queue,
> > + atomic_read(&info->send_credits) >
> > 0);
>
> This is an uninterruptible wait, correct? What's to guarantee the event will
> ever fire? Also, if the count is zero, there should be a check that an SMB
> Direct credit request is outstanding. If not, it's wasteful to sleep for the
> keepalive timer to do so.

Will fix it.

>
> Tom.

2017-08-14 23:04:11

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 19/37] [CIFS] SMBD: Manage credits on SMBD client and server



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 1:47 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: RE: [[PATCH v1] 19/37] [CIFS] SMBD: Manage credits on SMBD client
> and server
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:11 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 19/37] [CIFS] SMBD: Manage credits on SMBD client
> > and server
> >
> > /*
> > + * Extend the credits to remote peer
> > + * This implements [MS-SMBD] 3.1.5.9
> > + * The idea is that we should extend credits to remote peer as
> > +quickly as
> > + * it's allowed, to maintain data flow. We allocate as much as
> > +receive
> > + * buffer as possible, and extend the receive credits to remote peer
> > + * return value: the new credtis being granted.
> > + */
> > +static int manage_credits_prior_sending(struct cifs_rdma_info *info)
> > +{
> > + int ret = 0;
> > + struct cifs_rdma_response *response;
> > + int rc;
> > +
> > + if (atomic_read(&info->receive_credit_target) >
>
> When does the receive_credit_target value change? It seems wasteful to
> perform an atomic_read() on this local value each time.

It could be potentially changed while receiving a SMBD packet, as specified in MS-SMBD 3.1.5.8.

I agree with you there is no need to use atomic since this value is not increased or decreased, just being set. Will change it.

>
> Tom.

2017-08-14 23:12:40

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with no payload

> -----Original Message-----
> From: Long Li
> Sent: Monday, August 14, 2017 6:51 PM
> To: Tom Talpey <[email protected]>; Christoph Hellwig
> <[email protected]>
> Cc: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Subject: RE: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with no
> payload
>
>
>
> > -----Original Message-----
> > From: Tom Talpey
> > Sent: Monday, August 14, 2017 12:00 PM
> > To: Long Li <[email protected]>; Christoph Hellwig <[email protected]>
> > Cc: Steve French <[email protected]>; [email protected]; samba-
> > [email protected]; [email protected]
> > Subject: RE: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with no
> > payload
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:linux-cifs-
> > > [email protected]] On Behalf Of Long Li
> > > Sent: Monday, August 14, 2017 2:20 PM
> > > To: Christoph Hellwig <[email protected]>
> > > Cc: Steve French <[email protected]>; [email protected];
> > > samba- [email protected]; [email protected]
> > > Subject: RE: [[PATCH v1] 16/37] [CIFS] SMBD: Post a SMBD message with
> > > no payload
> > >
> > > > > Implement the function to send a SMBD message with no payload.
> > > > > This is
> > > > required at times when we want to extend credtis to server to have
> > > > it continue to send data, without sending any actual data payload.
> > > >
> > > > Shouldn't this just be implemented as a special case in the version
> > > > that posts data?
> > >
> > > It uses a different packet format "struct smbd_data_transfer_no_data".
> > > I can restructure some common code to share between packet sending
> > functions.
> >
> > The SMB Direct keepalive is just a Data Transfer Message with no payload
> > (MS-SMBD section 2.2.3) and the SMB_DIRECT_RESPONSE_REQUESTED flag
> > possibly set. I don't see any need to define a special structure to describe
> > this?
>
> Data Transfer Message has the following extra fields at the end of an empty
> packet.
>
> __le32 padding;

No need to omit the padding, it's ignored anyway and the DataLength is zero
so there's no other payload to consume. You can send a packet of pretty much
any length. Just send the regular struct.

BTW, the "padding" field is defined as variable array of bytes, meaning semantically
you might want to code it as u8 padding[0] as well. However in practice it is
either 4 bytes or not present at all, so you'd probably end up writing extra code
for that choice.

> char buffer[0];
>
> I agree with you those can be merged to a special structure case. Will make the
> change.

Tom.

2017-08-14 23:24:35

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 21/37] [CIFS] SMBD: Implement API for upper layer to receive data



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 1:57 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: RE: [[PATCH v1] 21/37] [CIFS] SMBD: Implement API for upper layer
> to receive data
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:11 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 21/37] [CIFS] SMBD: Implement API for upper layer
> > to receive data
> >
> > /*
> > + * Read data from receive reassembly queue
> > + * All the incoming data packets are placed in reassembly queue
> > + * buf: the buffer to read data into
> > + * size: the length of data to read
> > + * return value: actual data read
> > + */
> > +int cifs_rdma_read(struct cifs_rdma_info *info, char *buf, unsigned
> > +int size) {
> >...
> > + spin_lock_irqsave(&info->reassembly_queue_lock, flags);
> > + log_cifs_read("size=%d info->reassembly_data_length=%d\n", size,
> > + atomic_read(&info->reassembly_data_length));
> > + if (atomic_read(&info->reassembly_data_length) >= size) {
>
> If the reassembly queue is protected by a lock, why is an atomic_read() of its
> length needed?

Will change this to non-atomic.

>
> > + // this is for reading rfc1002 length
> > + if (response->first_segment && size==4) {
> > + unsigned int rfc1002_len =
> > + data_length + remaining_data_length;
> > + *((__be32*)buf) = cpu_to_be32(rfc1002_len);
> > + data_read = 4;
> > + response->first_segment = false;
> > + log_cifs_read("returning rfc1002 length %d\n",
> > + rfc1002_len);
> > + goto read_rfc1002_done;
> > + }
>
> I am totally confused. What does RFC1002 framing have to do with receiving
> an SMB Direct packet???

The upper layer expects RFC1002 length at the beginning of the payload. A lot of protocol processing logic check and act on this value. Returning this value will avoid changes to lots of other upper layer code.

This will be eventually fixed when a transport layer is added to upper layer code. I recommend we do it in another patch.

>
> > +
> > + to_copy = min_t(int, data_length - offset, to_read);
> > + memcpy(
> > + buf + data_read,
> > + (char*)data_transfer + data_offset + offset,
> > + to_copy);
>
> Is it really necessary to perform all these data copies, especially under the
> reassembly_queue spinlock? This seems quite inefficient. Can the receive
> buffers not be loaned out and chained logically?

This will require upper layer code changes to move to use new buffers allocated/loaned this way, and also deal with packet boundaries.

This code is not used to actually carry file data, which are normally done through RDMA read/write.

If we want to do it, I suggest do another patch since more changes other than transport are involved.

>
> Tom.

2017-08-14 23:27:15

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 24/37] [CIFS] SMBD: Support for SMBD keep alive protocol



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 2:07 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: RE: [[PATCH v1] 24/37] [CIFS] SMBD: Support for SMBD keep alive
> protocol
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:11 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 24/37] [CIFS] SMBD: Support for SMBD keep alive
> > protocol
> >
> > SMBD uses a keep alive protocol to help peers detect if the remote is dead.
> > When peer request keep alive, the transport needs to respond accordingly.
>
> The keepalive exchange is also used to replenish credits in certain
> pathological conditions.
>
> > + // send an emtpy response right away if requested
> > + if (le16_to_cpu(data_transfer->flags) |
> > + le16_to_cpu(SMB_DIRECT_RESPONSE_REQUESTED)) {
> > + info->keep_alive_requested = KEEP_ALIVE_PENDING;
> > + }
>
> This is clearly a typo, the condition is always true. "&"??

Sorry it's a typo.

>
> Tom.

2017-08-14 23:29:44

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection timer



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 2:12 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]
> Cc: Long Li <[email protected]>
> Subject: RE: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection
> timer
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:11 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection
> > timer
> >
> > +static int keep_alive_interval = 120;
>
> This is the recommended value, but not the only possibility.
>
> > @@ -1348,6 +1369,10 @@ struct cifs_rdma_info*
> cifs_create_rdma_session(
> > init_waitqueue_head(&info->wait_send_queue);
> > init_waitqueue_head(&info->wait_reassembly_queue);
> >
> > + INIT_DELAYED_WORK(&info->idle_timer_work,
> idle_connection_timer);
> > + schedule_delayed_work(&info->idle_timer_work,
> > + info->keep_alive_interval*HZ);
> > +
>
> This initialization is ok, but the timer should be rescheduled (extended) any
> time any packet is sent. There is no need to perform keepalives on an active
> SMB Direct connection.

My feeling is that rescheduling on a work queue for every packet is sent is not efficient, especially under heavy conditions.

Firing it every 120 seconds doesn't seem to be big waste and may actually save some CPU.

>
> Tom.

2017-08-14 23:31:01

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 30/37] [CIFS] SMBD: Add SMBDirect transport to Makefile



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 2:20 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: RE: [[PATCH v1] 30/37] [CIFS] SMBD: Add SMBDirect transport to
> Makefile
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:11 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 30/37] [CIFS] SMBD: Add SMBDirect transport to
> > Makefile
> >
> > cifs-$(CONFIG_CIFS_SMB2) += smb2ops.o smb2maperror.o
> smb2transport.o \
> > - smb2misc.o smb2pdu.o smb2inode.o smb2file.o
> > + smb2misc.o smb2pdu.o smb2inode.o
> > + smb2file.o cifsrdma.o
>
> "cifsrdma.o" is a really confusing choice of names. SMB Direct is only possible
> when the negotiated SMB dialect is SMB3. "CIFS" historically means a dialect
> of SMB1, which many of us would like to see be in a separate, and therefore
> omittable, module. Please use a name more indicative of the function,
> perhaps "smbdirect.o".

Yes, I will make the naming changes.

>
> Tom.

2017-08-14 23:35:38

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 21/37] [CIFS] SMBD: Implement API for upper layer to receive data

> -----Original Message-----
> From: Long Li
> Sent: Monday, August 14, 2017 7:25 PM
> To: Tom Talpey <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: RE: [[PATCH v1] 21/37] [CIFS] SMBD: Implement API for upper layer to
> receive data
>
>
>
> > -----Original Message-----
> > From: Tom Talpey
> > Sent: Monday, August 14, 2017 1:57 PM
> > To: Long Li <[email protected]>; Steve French <[email protected]>;
> > [email protected]; [email protected]; linux-
> > [email protected]
> > Subject: RE: [[PATCH v1] 21/37] [CIFS] SMBD: Implement API for upper layer
> > to receive data
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:linux-cifs-
> > > [email protected]] On Behalf Of Long Li
> > > Sent: Wednesday, August 2, 2017 4:11 PM
> > > To: Steve French <[email protected]>; [email protected];
> > > samba- [email protected]; [email protected]
> > > Cc: Long Li <[email protected]>
> > > Subject: [[PATCH v1] 21/37] [CIFS] SMBD: Implement API for upper layer
> > > to receive data
> > >
> > > /*
> > > + * Read data from receive reassembly queue
> > > + * All the incoming data packets are placed in reassembly queue
> > > + * buf: the buffer to read data into
> > > + * size: the length of data to read
> > > + * return value: actual data read
> > > + */
> > > +int cifs_rdma_read(struct cifs_rdma_info *info, char *buf, unsigned
> > > +int size) {
> > >...
> > > + spin_lock_irqsave(&info->reassembly_queue_lock, flags);
> > > + log_cifs_read("size=%d info->reassembly_data_length=%d\n", size,
> > > + atomic_read(&info->reassembly_data_length));
> > > + if (atomic_read(&info->reassembly_data_length) >= size) {
> >
> > If the reassembly queue is protected by a lock, why is an atomic_read() of its
> > length needed?
>
> Will change this to non-atomic.
>
> >
> > > + // this is for reading rfc1002 length
> > > + if (response->first_segment && size==4) {
> > > + unsigned int rfc1002_len =
> > > + data_length + remaining_data_length;
> > > + *((__be32*)buf) = cpu_to_be32(rfc1002_len);
> > > + data_read = 4;
> > > + response->first_segment = false;
> > > + log_cifs_read("returning rfc1002 length %d\n",
> > > + rfc1002_len);
> > > + goto read_rfc1002_done;
> > > + }
> >
> > I am totally confused. What does RFC1002 framing have to do with receiving
> > an SMB Direct packet???
>
> The upper layer expects RFC1002 length at the beginning of the payload. A lot
> of protocol processing logic check and act on this value. Returning this value
> will avoid changes to lots of other upper layer code.
>
> This will be eventually fixed when a transport layer is added to upper layer
> code. I recommend we do it in another patch.

It's totally non-obvious that you are *inserting* an RFC1002 length into the received
message. Need to state that, and include the above explanation.

OK on deferring that work. So far so good!

> > > + to_copy = min_t(int, data_length - offset, to_read);
> > > + memcpy(
> > > + buf + data_read,
> > > + (char*)data_transfer + data_offset + offset,
> > > + to_copy);
> >
> > Is it really necessary to perform all these data copies, especially under the
> > reassembly_queue spinlock? This seems quite inefficient. Can the receive
> > buffers not be loaned out and chained logically?
>
> This will require upper layer code changes to move to use new buffers
> allocated/loaned this way, and also deal with packet boundaries.
>
> This code is not used to actually carry file data, which are normally done
> through RDMA read/write.

Disagree - RDMA will only be used when the size exceeds a threshold, perhaps 4KB or
even 8KB. That means you'll be performing several memcpy()s for such workloads.
RDMA is only used for bulk data.

At the very least, try to rearrange the code to avoid holding the reassembly lock for
so long, definitely not when memcpy'ing. It will significantly single-thread all your receives.

> If we want to do it, I suggest do another patch since more changes other than
> transport are involved.

Sure, that's fine but 1) we will definitely want to do it and 2) this needs a comment.

Tom.

2017-08-14 23:37:21

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 23/37] [CIFS] SMBD: Implement API for upper layer to reconnect transport



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 2:03 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: RE: [[PATCH v1] 23/37] [CIFS] SMBD: Implement API for upper layer
> to reconnect transport
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:11 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 23/37] [CIFS] SMBD: Implement API for upper layer
> > to reconnect transport
> >
> > +int cifs_reconnect_rdma_session(struct TCP_Server_Info *server) {
> > + log_rdma_event("reconnecting rdma session\n");
> > +
> > + // why reconnect while it is still connected?
> > + if (server->rdma_ses->transport_status == CIFS_RDMA_CONNECTED)
> {
> > + log_rdma_event("still connected, not reconnecting\n");
> > + return -EINVAL;
> > + }
>
> Why is this check needed?

This was used in early stage of development. It's probably not needed anymore. Will look into this.

>
> > +
> > + // wait until the transport is destroyed
> > + while (server->rdma_ses->transport_status !=
> CIFS_RDMA_DESTROYED)
> > + msleep(1);
>
> Polling!? Please plan to implement a proper handshake for connection logic.

Will look into using wait queue.

2017-08-14 23:42:32

by Tom Talpey

[permalink] [raw]
Subject: RE: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection timer

> -----Original Message-----
> From: [email protected] [mailto:linux-cifs-
> [email protected]] On Behalf Of Long Li
> Sent: Monday, August 14, 2017 7:30 PM
> To: Tom Talpey <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: RE: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection
> timer
>
> [This sender failed our fraud detection checks and may not be who they appear
> to be. Learn about spoofing at http://aka.ms/LearnAboutSpoofing]
>
> > -----Original Message-----
> > From: Tom Talpey
> > Sent: Monday, August 14, 2017 2:12 PM
> > To: Long Li <[email protected]>; Steve French <[email protected]>;
> > [email protected]; [email protected]; linux-
> > [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: RE: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection
> > timer
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:linux-cifs-
> > > [email protected]] On Behalf Of Long Li
> > > Sent: Wednesday, August 2, 2017 4:11 PM
> > > To: Steve French <[email protected]>; [email protected];
> > > samba- [email protected]; [email protected]
> > > Cc: Long Li <[email protected]>
> > > Subject: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection
> > > timer
> > >
> > > +static int keep_alive_interval = 120;
> >
> > This is the recommended value, but not the only possibility.
> >
> > > @@ -1348,6 +1369,10 @@ struct cifs_rdma_info*
> > cifs_create_rdma_session(
> > > init_waitqueue_head(&info->wait_send_queue);
> > > init_waitqueue_head(&info->wait_reassembly_queue);
> > >
> > > + INIT_DELAYED_WORK(&info->idle_timer_work,
> > idle_connection_timer);
> > > + schedule_delayed_work(&info->idle_timer_work,
> > > + info->keep_alive_interval*HZ);
> > > +
> >
> > This initialization is ok, but the timer should be rescheduled (extended) any
> > time any packet is sent. There is no need to perform keepalives on an active
> > SMB Direct connection.
>
> My feeling is that rescheduling on a work queue for every packet is sent is not
> efficient, especially under heavy conditions.

That's not what I was suggesting. Cant the timer simply be re-extended to the
120-second interval? I.e. on an active connection, it will never fire because it's
always advancing.

As defined here, it will go off and send a keepalive every 120 seconds. The
idle_connection_timer() routine unconditionally sends it.

>
> Firing it every 120 seconds doesn't seem to be big waste and may actually save
> some CPU.

Firing the timer, no big deal. Sending the packets and requiring the peer to process
them too, disagree.

Tom.

2017-08-15 00:11:01

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection timer



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 4:42 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]
> Subject: RE: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle connection
> timer
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Monday, August 14, 2017 7:30 PM
> > To: Tom Talpey <[email protected]>; Steve French
> > <[email protected]>; [email protected];
> > [email protected]; linux- [email protected]
> > Subject: RE: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle
> > connection timer
> >
> > [This sender failed our fraud detection checks and may not be who they
> > appear to be. Learn about spoofing at
> > http://aka.ms/LearnAboutSpoofing]
> >
> > > -----Original Message-----
> > > From: Tom Talpey
> > > Sent: Monday, August 14, 2017 2:12 PM
> > > To: Long Li <[email protected]>; Steve French
> > > <[email protected]>; [email protected];
> > > [email protected]; linux- [email protected]
> > > Cc: Long Li <[email protected]>
> > > Subject: RE: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle
> > > connection timer
> > >
> > > > -----Original Message-----
> > > > From: [email protected] [mailto:linux-cifs-
> > > > [email protected]] On Behalf Of Long Li
> > > > Sent: Wednesday, August 2, 2017 4:11 PM
> > > > To: Steve French <[email protected]>; [email protected];
> > > > samba- [email protected]; [email protected]
> > > > Cc: Long Li <[email protected]>
> > > > Subject: [[PATCH v1] 25/37] [CIFS] SMBD: Support SMBD idle
> > > > connection timer
> > > >
> > > > +static int keep_alive_interval = 120;
> > >
> > > This is the recommended value, but not the only possibility.
> > >
> > > > @@ -1348,6 +1369,10 @@ struct cifs_rdma_info*
> > > cifs_create_rdma_session(
> > > > init_waitqueue_head(&info->wait_send_queue);
> > > > init_waitqueue_head(&info->wait_reassembly_queue);
> > > >
> > > > + INIT_DELAYED_WORK(&info->idle_timer_work,
> > > idle_connection_timer);
> > > > + schedule_delayed_work(&info->idle_timer_work,
> > > > + info->keep_alive_interval*HZ);
> > > > +
> > >
> > > This initialization is ok, but the timer should be rescheduled
> > > (extended) any time any packet is sent. There is no need to perform
> > > keepalives on an active SMB Direct connection.
> >
> > My feeling is that rescheduling on a work queue for every packet is
> > sent is not efficient, especially under heavy conditions.
>
> That's not what I was suggesting. Cant the timer simply be re-extended to
> the 120-second interval? I.e. on an active connection, it will never fire
> because it's always advancing.
>
> As defined here, it will go off and send a keepalive every 120 seconds. The
> idle_connection_timer() routine unconditionally sends it.
>
> >
> > Firing it every 120 seconds doesn't seem to be big waste and may
> > actually save some CPU.
>
> Firing the timer, no big deal. Sending the packets and requiring the peer to
> process them too, disagree.

Fair enough. I will fix the code to modify delayed work instead of firing every 120 seconds.

>
> Tom.

2017-08-19 23:41:28

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 07/37] [CIFS] SMBD: Implement receive buffer for handling SMBD response



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 1:09 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Subject: RE: [[PATCH v1] 07/37] [CIFS] SMBD: Implement receive buffer for
> handling SMBD response
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:10 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 07/37] [CIFS] SMBD: Implement receive buffer for
> > handling SMBD response
> >
> > +/*
> > + * Receive buffer operations.
> > + * For each remote send, we need to post a receive. The receive
> > +buffers are
> > + * pre-allocated in advance.
> > + */
>
> This approach appears to have been derived from the NFS/RDMA one.
> The SMB protocol operates very differently! It is not a strict request/
> response protocol. Many operations can become asynchronous by the
> server choosing to make a STATUS_PENDING reply. A second reply then
> comes later. The SMB2_CANCEL operation normally has no reply at all.
> And callbacks for oplocks can occur at any time.

I think you misunderstood the receiver buffers. They are posted so the remote peer can post a send. The remote peer's receive credit is calculated based on how many receive buffer have been posted. The code doesn't assume one post_send needs one corresponding post_recv. In practice, receive buffers are posted as soon as possible to extend receive credits to the remote peer.

>
> Even within a single request, many replies can be received. For example, an
> SMB2_READ response which exceeds your negotiated receive size of 8192.
> These will be fragmented by SMB Direct into a "train" of multiple messages,
> which will be logically reassembled by the receiver. Each of them will
> consume a credit.
>
> Thanks to SMB Direct crediting, the connection is not failing, but you are
> undoubtedly spending a lot of time and ping-ponging to re-post receives and
> allow the message trains to flow. And, because it's never one-to-one, there
> are also unneeded receives posted before and after such exchanges.
>
> You need to use SMB Direct crediting to post a more traffic-sensitive pool of
> receives, and simply manage its depth when posting client requests.
> As a start, I'd suggest simply choosing a constant number, approximately
> whatever credit value you actually negotiate with the peer. Then, just
> replenish (re-post) receive buffers as they are completed by the adapter.
> You can get more sophisticated about this strategy later.

The code behaves exactly the same as you described. It uses a constant to decide how many receive buffer to post. It's not very smart and can be improved.

>
> Tom.
>
> > +static struct cifs_rdma_response* get_receive_buffer(struct
> > +cifs_rdma_info
> > *info)
> > +{
> > + struct cifs_rdma_response *ret = NULL;
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&info->receive_queue_lock, flags);
> > + if (!list_empty(&info->receive_queue)) {
> > + ret = list_first_entry(
> > + &info->receive_queue,
> > + struct cifs_rdma_response, list);
> > + list_del(&ret->list);
> > + info->count_receive_buffer--;
> > + info->count_get_receive_buffer++;
> > + }
> > + spin_unlock_irqrestore(&info->receive_queue_lock, flags);
> > +
> > + return ret;
> > +}
> > +
> > +static void put_receive_buffer(
> > + struct cifs_rdma_info *info, struct cifs_rdma_response
> > +*response) {
> > + unsigned long flags;
> > +
> > + ib_dma_unmap_single(info->id->device, response->sge.addr,
> > + response->sge.length, DMA_FROM_DEVICE);
> > +
> > + spin_lock_irqsave(&info->receive_queue_lock, flags);
> > + list_add_tail(&response->list, &info->receive_queue);
> > + info->count_receive_buffer++;
> > + info->count_put_receive_buffer++;
> > + spin_unlock_irqrestore(&info->receive_queue_lock, flags); }
> > +
> > +static int allocate_receive_buffers(struct cifs_rdma_info *info, int
> > +num_buf) {
> > + int i;
> > + struct cifs_rdma_response *response;
> > +
> > + INIT_LIST_HEAD(&info->receive_queue);
> > + spin_lock_init(&info->receive_queue_lock);
> > +
> > + for (i=0; i<num_buf; i++) {
> > + response = mempool_alloc(info->response_mempool,
> GFP_KERNEL);
> > + if (!response)
> > + goto allocate_failed;
> > +
> > + response->info = info;
> > + list_add_tail(&response->list, &info->receive_queue);
> > + info->count_receive_buffer++;
> > + }
> > +
> > + return 0;
> > +
> > +allocate_failed:
> > + while (!list_empty(&info->receive_queue)) {
> > + response = list_first_entry(
> > + &info->receive_queue,
> > + struct cifs_rdma_response, list);
> > + list_del(&response->list);
> > + info->count_receive_buffer--;
> > +
> > + mempool_free(response, info->response_mempool);
> > + }
> > + return -ENOMEM;
> > +}
> > +
> > +static void destroy_receive_buffers(struct cifs_rdma_info *info) {
> > + struct cifs_rdma_response *response;
> > + while ((response = get_receive_buffer(info)))
> > + mempool_free(response, info->response_mempool); }
> > +


2017-08-19 23:41:53

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 18/37] [CIFS] SMBD: Implement API for upper layer to send data



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 1:44 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Subject: RE: [[PATCH v1] 18/37] [CIFS] SMBD: Implement API for upper layer
> to send data
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:10 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 18/37] [CIFS] SMBD: Implement API for upper layer
> > to send data
> >
> > +/*
> > + * Write data to transport
> > + * Each rqst is transported as a SMBDirect payload
> > + * rqst: the data to write
> > + * return value: 0 if successfully write, otherwise error code */
> > +int cifs_rdma_write(struct cifs_rdma_info *info, struct smb_rqst
> > +*rqst) {
>
> !!!
> This is a VERY confusing name. It is not sending an RDMA Write, which will
> confuse any RDMA-enlightened reader. It's performing an RDMA Send, so
> that name is perhaps one possibility.
>
> > + if (info->transport_status != CIFS_RDMA_CONNECTED) {
> > + log_cifs_write("disconnected returning -EIO\n");
> > + return -EIO;
> > + }
>
> Isn't this optimizing the error case? There's no guarantee it's still connected
> by the time the following request construction occurs. Why not just proceed
> without the check?
>
> > + /* Strip the first 4 bytes MS-SMB2 section 2.1
> > + * they are used only for TCP transport */
> > + iov[0].iov_base = (char*)rqst->rq_iov[0].iov_base + 4;
> > + iov[0].iov_len = rqst->rq_iov[0].iov_len - 4;
> > + buflen += iov[0].iov_len;
>
> Ok, that layering choice in the cifs.ko client code needs to be corrected. After
> all, it will need to be RDMA-aware to build the SMB3 read/write channel
> structures.
> And, the code in cifs_post_send_data() is allocating and building a structure
> that could have been accounted for much earlier, avoiding the extra
> overhead.
>
> That change could happen later, the hack is mostly ok for now. But
> something needs to be said in a comment.

Will address those in v2.

>
> Tom.

2017-08-30 02:18:01

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 15/37] [CIFS] SMBD: Post a SMBD data transfer message with data payload

> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Sunday, August 13, 2017 3:24 AM
> To: Long Li <[email protected]>
> Cc: Steve French <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]; Long Li
> <[email protected]>
> Subject: Re: [[PATCH v1] 15/37] [CIFS] SMBD: Post a SMBD data transfer
> message with data payload
>
> You can always get the struct page for kernel allocations using virt_to_page
> (or vmalloc_to_page, but this code would not handle the vmalloc case either),
> so I don't think you need this helper and can always use the one added in the
> previous patch.

I partially addressed this issue in the V3 patch. Most of the duplicate code on sending path is merged.

The difficulty with translating the buffer to pages is that: I don't know how many pages will be translated, and how many struct page I need to allocate in advance to hold them. I try to avoid memory allocation in the I/O path as much as possible. So I keep two functions of sending data: one for buffer and one for pages.

2017-08-30 02:30:10

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 18/37] [CIFS] SMBD: Implement API for upper layer to send data



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 1:44 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Subject: RE: [[PATCH v1] 18/37] [CIFS] SMBD: Implement API for upper layer
> to send data
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:10 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 18/37] [CIFS] SMBD: Implement API for upper layer
> > to send data
> >
> > +/*
> > + * Write data to transport
> > + * Each rqst is transported as a SMBDirect payload
> > + * rqst: the data to write
> > + * return value: 0 if successfully write, otherwise error code */
> > +int cifs_rdma_write(struct cifs_rdma_info *info, struct smb_rqst
> > +*rqst) {
>
> !!!
> This is a VERY confusing name. It is not sending an RDMA Write, which will
> confuse any RDMA-enlightened reader. It's performing an RDMA Send, so
> that name is perhaps one possibility.

I have fixed that in v3.

>
> > + if (info->transport_status != CIFS_RDMA_CONNECTED) {
> > + log_cifs_write("disconnected returning -EIO\n");
> > + return -EIO;
> > + }
>
> Isn't this optimizing the error case? There's no guarantee it's still connected
> by the time the following request construction occurs. Why not just proceed
> without the check?

I rearranged the shutdown logic in v3. Checking for transport status is still needed, but it checks after checking for other counters on pending activities.

For example, on sending code:

info->smbd_send_pending++;
if (info->transport_status != SMBD_CONNECTED) {
info->smbd_send_pending--;
wake_up(&info->wait_smbd_send_pending);
}

On transport shutdown code:

info->transport_status = SMBD_DISCONNECTING;
.......
.......
.......
log_rdma_event(INFO, "wait for all send to finish\n");
wait_event(info->wait_smbd_send_pending,
info->smbd_send_pending == 0);

It guarantees no sending code can enter transport after shutdown is finished. Shutdown is running on a separate work queue, so it is needed.

>
> > + /* Strip the first 4 bytes MS-SMB2 section 2.1
> > + * they are used only for TCP transport */
> > + iov[0].iov_base = (char*)rqst->rq_iov[0].iov_base + 4;
> > + iov[0].iov_len = rqst->rq_iov[0].iov_len - 4;
> > + buflen += iov[0].iov_len;
>
> Ok, that layering choice in the cifs.ko client code needs to be corrected. After
> all, it will need to be RDMA-aware to build the SMB3 read/write channel
> structures.
> And, the code in cifs_post_send_data() is allocating and building a structure
> that could have been accounted for much earlier, avoiding the extra
> overhead.
>
> That change could happen later, the hack is mostly ok for now. But
> something needs to be said in a comment.
>
> Tom.

2017-08-30 02:35:59

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 05/37] [CIFS] SMBD: Implement API for upper layer to create SMBD transport and establish RDMA connection



> -----Original Message-----
> From: Tom Talpey
> Sent: Monday, August 14, 2017 12:55 PM
> To: Long Li <[email protected]>; Steve French <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Subject: RE: [[PATCH v1] 05/37] [CIFS] SMBD: Implement API for upper layer
> to create SMBD transport and establish RDMA connection
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-cifs-
> > [email protected]] On Behalf Of Long Li
> > Sent: Wednesday, August 2, 2017 4:10 PM
> > To: Steve French <[email protected]>; [email protected];
> > samba- [email protected]; [email protected]
> > Cc: Long Li <[email protected]>
> > Subject: [[PATCH v1] 05/37] [CIFS] SMBD: Implement API for upper layer
> > to create SMBD transport and establish RDMA connection
> >
> > From: Long Li <[email protected]>
> >
> > Implement the code for connecting to SMBD server. The client and
> > server are connected using RC Queue Pair over RDMA API, which
> > suppports Infiniband, RoCE and iWARP. Upper layer code can call
> > cifs_create_rdma_session to establish a SMBD RDMA connection.
> >
> > +/* Upcall from RDMA CM */
> > +static int cifs_rdma_conn_upcall(
> > + struct rdma_cm_id *id, struct rdma_cm_event *event) {
> > + struct cifs_rdma_info *info = id->context;
> > +
> > + log_rdma_event("event=%d status=%d\n", event->event,
> > + event->status);
> > +
> > + switch (event->event) {
> > + case RDMA_CM_EVENT_ADDR_RESOLVED:
> > + case RDMA_CM_EVENT_ROUTE_RESOLVED:
> > + info->ri_rc = 0;
> > + complete(&info->ri_done);
> > + break;
> > +
> > + case RDMA_CM_EVENT_ADDR_ERROR:
> > + info->ri_rc = -EHOSTUNREACH;
> > + complete(&info->ri_done);
> > + break;
> > +
> > + case RDMA_CM_EVENT_ROUTE_ERROR:
> > + info->ri_rc = -ENETUNREACH;
> > + complete(&info->ri_done);
> > + break;
> > +
> > + case RDMA_CM_EVENT_ESTABLISHED:
> > + case RDMA_CM_EVENT_CONNECT_ERROR:
> > + case RDMA_CM_EVENT_UNREACHABLE:
> > + case RDMA_CM_EVENT_REJECTED:
> > + case RDMA_CM_EVENT_DEVICE_REMOVAL:
> > + log_rdma_event("connected event=%d\n", event->event);
> > + info->connect_state = event->event;
> > + break;
> > +
> > + case RDMA_CM_EVENT_DISCONNECTED:
> > + break;
> > +
> > + default:
> > + break;
> > + }
> > +
> > + return 0;
> > +}
>
> This code looks a lot like the connection stuff in the NFS/RDMA RPC transport.
> Does your code have the same needs? If so, you might consider moving this
> to a common RDMA handler.

>
> > +/* Upcall from RDMA QP */
> > +static void
> > +cifs_rdma_qp_async_error_upcall(struct ib_event *event, void
> > +*context) {
> > + struct cifs_rdma_info *info = context;
> > + log_rdma_event("%s on device %s info %p\n",
> > + ib_event_msg(event->event), event->device->name,
> > +info);
> > +
> > + switch (event->event)
> > + {
> > + case IB_EVENT_CQ_ERR:
> > + case IB_EVENT_QP_FATAL:
> > + case IB_EVENT_QP_REQ_ERR:
> > + case IB_EVENT_QP_ACCESS_ERR:
> > +
> > + default:
> > + break;
> > + }
> > +}
>
> Ditto. But, what's up with the empty switch(event->event) processing?

I have changed to code to disconnect RDMA connection on QP errors.


>
> > +static struct rdma_cm_id* cifs_rdma_create_id(
> > + struct cifs_rdma_info *info, struct sockaddr *dstaddr)
> > +{
> ...
> > + log_rdma_event("connecting to IP %pI4 port %d\n",
> > + &addr_in->sin_addr, ntohs(addr_in->sin_port));
> >... and then...
> > + if (dstaddr->sa_family == AF_INET6)
> > + sport = &((struct sockaddr_in6 *)dstaddr)->sin6_port;
> > + else
> > + sport = &((struct sockaddr_in *)dstaddr)->sin_port;
> > +
> > + *sport = htons(445);
> ...and
> > +out:
> > + // try port number 5445 if port 445 doesn't work
> > + if (*sport == htons(445)) {
> > + *sport = htons(5445);
> > + goto try_again;
> > + }
>
> Suggest rearranging the log_rdma_event() call to reflect reality.
>
> The IANA-assigned port for SMB Direct is 5445, and port 445 will be listening
> on TCP. Should you really be probing that port before 5445?
> I suggest not doing so unconditionally.

This part is reworked in V3 to behave as you suggested.

>
> > +struct cifs_rdma_info* cifs_create_rdma_session(
> > + struct TCP_Server_Info *server, struct sockaddr *dstaddr) {
> > ...
> > + int max_pending = receive_credit_max + send_credit_target;
> >...
> > + if (max_pending > info->id->device->attrs.max_cqe ||
> > + max_pending > info->id->device->attrs.max_qp_wr) {
> > + log_rdma_event("consider lowering receive_credit_max and "
> > + "send_credit_target. Possible CQE overrun, device "
> > + "reporting max_cpe %d max_qp_wr %d\n",
> > + info->id->device->attrs.max_cqe,
> > + info->id->device->attrs.max_qp_wr);
> > + goto out2;
> > + }
>
> I don't understand this. Why are you directing both Receive and Send
> completions to the same CQ, won't that make it very hard to manage
> completions and their interrupts? Also, what device(s) have you seen trigger
> this log? CQ's are generally allowed to be quite large.

I have moved them to separate completion queues in V3.

>
> > + conn_param.responder_resources = 32;
> > + if (info->id->device->attrs.max_qp_rd_atom < 32)
> > + conn_param.responder_resources =
> > + info->id->device->attrs.max_qp_rd_atom;
> > + conn_param.retry_count = 6;
> > + conn_param.rnr_retry_count = 6;
> > + conn_param.flow_control = 0;
>
> These choices warrant explanation. 32 responder resources is on the large
> side for most datacenter networks. And because of SMB Direct's credits, why
> is RNR_retry not simply zero?
>

Those have been fixed. On Mellanox ConnectX3, the attrs.max_qp_rd_atom is 16. So I choose a default value 32 to work with hardware that can potentially support more responder resources.

2017-08-30 08:51:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [[PATCH v1] 15/37] [CIFS] SMBD: Post a SMBD data transfer message with data payload

On Wed, Aug 30, 2017 at 02:17:56AM +0000, Long Li wrote:
> I partially addressed this issue in the V3 patch. Most of the duplicate
> code on sending path is merged.
>
> The difficulty with translating the buffer to pages is that: I don't
> know how many pages will be translated, and how many struct page I need
> to allocate in advance to hold them. I try to avoid memory allocation
> in the I/O path as much as possible. So I keep two functions of
> sending data: one for buffer and one for pages.

You do: you'll always need speace for (len + PAGE_SIZE - 1) >> PAGE_SIZE
pages.

That being said: what callers even send you buffers? In general we
should aim to work with pages for all allocations that aren't tiny.

2017-08-30 18:17:08

by Long Li

[permalink] [raw]
Subject: RE: [[PATCH v1] 15/37] [CIFS] SMBD: Post a SMBD data transfer message with data payload

> -----Original Message-----
> From: Christoph Hellwig [mailto:[email protected]]
> Sent: Wednesday, August 30, 2017 1:52 AM
> To: Long Li <[email protected]>
> Cc: Christoph Hellwig <[email protected]>; Steve French
> <[email protected]>; [email protected]; samba-
> [email protected]; [email protected]
> Subject: Re: [[PATCH v1] 15/37] [CIFS] SMBD: Post a SMBD data transfer
> message with data payload
>
> On Wed, Aug 30, 2017 at 02:17:56AM +0000, Long Li wrote:
> > I partially addressed this issue in the V3 patch. Most of the
> > duplicate code on sending path is merged.
> >
> > The difficulty with translating the buffer to pages is that: I don't
> > know how many pages will be translated, and how many struct page I
> > need to allocate in advance to hold them. I try to avoid memory
> > allocation in the I/O path as much as possible. So I keep two
> > functions of sending data: one for buffer and one for pages.
>
> You do: you'll always need speace for (len + PAGE_SIZE - 1) >> PAGE_SIZE
> pages.
>
> That being said: what callers even send you buffers? In general we should
> aim to work with pages for all allocations that aren't tiny.

I'll look through the code allocating the buffers, it probably can fit into one page. Will fix this.