This patchset is still preliminary and is just an RFC...
First, some background. When I was at Connectathon this year, Trond
mentioned an interesting idea to me. He said (paraphrasing):
"Why doesn't CIFS just use the RPC layer for transport? It's very
efficient at shoveling bits out onto the wire. You'd just need to
abstract out the XDR/RPC specific bits."
The idea has been floating around in the back of my head until recently,
when I decided to give it a go. This patchset represents a first, rough
draft at a patchset to do this. There's also a proof of concept module
that demonstrates that it works as expected.
The patchset is still very rough. Reconnect behavior is still very
"RPC-ish", for instance. There are doubtless other changes that will
need to be made before I had anything merge-worthy.
At this point, I'm just interested in feedback on the general approach.
Should I continue to pursue this or is it a non-starter for some
reason?
The next step here, obviously is to start building a fs on top of it.
I'd particularly be interested in using this as a foundation of a
new smb2fs.
I've also got this patchset up on my public git tree:
http://git.samba.org/?p=jlayton/cifs.git;a=summary
Here are some questions I anticipate on this, and their answers:
------------------------------------------------------------------------
Q: Are you insane? Why would you attempt to do this?
A: Maybe...but in the last couple of years, I've spent a substantial
amount of time working on the CIFS code. Much of that time has been
spent fixing bugs. Many of those bugs exist in the low-level transport
code which has been hacked on, kludged around and hand tweaked to
where it is today. Unfortunately that's made it a very hard to work
on mess. This drives away potential developers.
CIFS in particular is also designed around synchronous ops, which
seriously limits throughput. Retrofitting it for asynchronous operation
will be adding even more kludges. The sunrpc code is already
fundamentally asynchronous.
------------------------------------------------------------------------
Q: Okay, what's the benefit of hooking it up to sunrpc rather than
building a new transport layer (or fixing the transport in the other two
filesystems)?
A: Using sunrpc allows us to share a lot of the rpc scheduler code with
sunrpc. At a high level, NFS/RPC and SMB aren't really very different.
Sure, they have different formats, and one is big endian on the wire and
the other isn't...still there are significant similarities.
We also get access to the great upcall mechanisms that sunrpc has, and
the possibility to share code like the gssapi upcalls. The sunrpc layer
has a credential and authentication management framework that we can
build on to make a truly multiuser CIFS/SMB filesystem.
I've heard it claimed before that Linux's sunrpc layer is
over-engineered, but in this case that works in our favor...
------------------------------------------------------------------------
Q: can we hook up cifs or smbfs to use this as a transport?
A: Not trivially. CIFS in particular is not designed with each call
having discrete encode and decode functions. They're sort of mashed
together. smbfs might be possible...I'm a little less familiar with it,
but it does have a transport layer that more closely resembles the
sunrpc one. Still though, it'd take significant work to make that
happen. I'm not opposed to the idea however.
In the end though, I think we'll probably need to design something new
to sit on top of this. We will probably be able to borrow code and
concepts from the other filesystems however.
------------------------------------------------------------------------
Q: could we use this as a transport layer for a smb2fs ?
A: Yes, I think so. This particular prototype is build around SMB1, but
SMB2 could be supported with only minor modifications. One of the
reasons for sending this patchset now before I've built a filesystem on
top of it is because I know that SMB2 work is in progress. I'd like to
see it based around a more asynchronous transport model, or at least
built with cleaner layering so that we can eventually bolt on a different
transport layer if we so choose.
Jeff Layton (9):
sunrpc: abstract out encoding function at rpc_clnt level
sunrpc: move check for too small reply size into rpc_verify_header
sunrpc: abstract our call decoding routine
sunrpc: move rpc_xdr_buf_init to clnt.h
sunrpc: make call_bind non-static
sunrpc: add new SMB transport class for sunrpc
sunrpc: add encoding and decoding routines for SMB
sunrpc: add Kconfig option for CONFIG_SUNRPC_SMB
smbtest: simple module for testing SMB/RPC code
fs/Makefile | 2 +
fs/lockd/host.c | 4 +
fs/lockd/mon.c | 4 +
fs/nfs/client.c | 4 +
fs/nfs/mount_clnt.c | 4 +
fs/nfsd/nfs4callback.c | 4 +
fs/smbtest/Makefile | 1 +
fs/smbtest/smbtest.c | 204 +++++
include/linux/sunrpc/clnt.h | 24 +-
include/linux/sunrpc/smb.h | 42 +
include/linux/sunrpc/xprtsmb.h | 59 ++
net/sunrpc/Kconfig | 11 +
net/sunrpc/Makefile | 1 +
net/sunrpc/clnt.c | 98 ++-
net/sunrpc/rpcb_clnt.c | 8 +
net/sunrpc/smb.c | 120 +++
net/sunrpc/sunrpc_syms.c | 3 +
net/sunrpc/xprtsmb.c | 1723 ++++++++++++++++++++++++++++++++++++++++
18 files changed, 2272 insertions(+), 44 deletions(-)
create mode 100644 fs/smbtest/Makefile
create mode 100644 fs/smbtest/smbtest.c
create mode 100644 include/linux/sunrpc/smb.h
create mode 100644 include/linux/sunrpc/xprtsmb.h
create mode 100644 net/sunrpc/smb.c
create mode 100644 net/sunrpc/xprtsmb.c
In order to add the ability to do an SMB call with the sunrpc layer, we
need to abstract out the call encoding. Add a function pointer that hangs
off of the rpc_clnt to do the encoding. For now, all of the existing
callers will simply set it to rpc_xdr_encode.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/host.c | 3 +++
fs/lockd/mon.c | 3 +++
fs/nfs/client.c | 3 +++
fs/nfs/mount_clnt.c | 3 +++
fs/nfsd/nfs4callback.c | 3 +++
include/linux/sunrpc/clnt.h | 7 +++++++
net/sunrpc/clnt.c | 14 ++++++++++----
net/sunrpc/rpcb_clnt.c | 6 ++++++
8 files changed, 38 insertions(+), 4 deletions(-)
diff --git a/fs/lockd/host.c b/fs/lockd/host.c
index 4600c20..b7189ce 100644
--- a/fs/lockd/host.c
+++ b/fs/lockd/host.c
@@ -362,6 +362,9 @@ nlm_bind_host(struct nlm_host *host)
.program = &nlm_program,
.version = host->h_version,
.authflavor = RPC_AUTH_UNIX,
+ .encode = rpc_xdr_encode,
+ .callhdr_size = RPC_CALLHDRSIZE,
+ .replhdr_size = RPC_REPHDRSIZE,
.flags = (RPC_CLNT_CREATE_NOPING |
RPC_CLNT_CREATE_AUTOBIND),
};
diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
index f956651..ea24301 100644
--- a/fs/lockd/mon.c
+++ b/fs/lockd/mon.c
@@ -75,6 +75,9 @@ static struct rpc_clnt *nsm_create(void)
.program = &nsm_program,
.version = NSM_VERSION,
.authflavor = RPC_AUTH_NULL,
+ .encode = rpc_xdr_encode,
+ .callhdr_size = RPC_CALLHDRSIZE,
+ .replhdr_size = RPC_REPHDRSIZE,
.flags = RPC_CLNT_CREATE_NOPING,
};
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 63976c0..5505ddf 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -606,6 +606,9 @@ static int nfs_create_rpc_client(struct nfs_client *clp,
.servername = clp->cl_hostname,
.program = &nfs_program,
.version = clp->rpc_ops->version,
+ .encode = rpc_xdr_encode,
+ .callhdr_size = RPC_CALLHDRSIZE,
+ .replhdr_size = RPC_REPHDRSIZE,
.authflavor = flavor,
};
diff --git a/fs/nfs/mount_clnt.c b/fs/nfs/mount_clnt.c
index 0adefc4..14f79e5 100644
--- a/fs/nfs/mount_clnt.c
+++ b/fs/nfs/mount_clnt.c
@@ -159,6 +159,9 @@ int nfs_mount(struct nfs_mount_request *info)
.servername = info->hostname,
.program = &mnt_program,
.version = info->version,
+ .encode = rpc_xdr_encode,
+ .callhdr_size = RPC_CALLHDRSIZE,
+ .replhdr_size = RPC_REPHDRSIZE,
.authflavor = RPC_AUTH_UNIX,
};
struct rpc_clnt *mnt_clnt;
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index 24e8d78..dbfd91c 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -494,6 +494,9 @@ int setup_callback_client(struct nfs4_client *clp)
.authflavor = clp->cl_flavor,
.flags = (RPC_CLNT_CREATE_NOPING | RPC_CLNT_CREATE_QUIET),
.client_name = clp->cl_principal,
+ .encode = rpc_xdr_encode,
+ .callhdr_size = RPC_CALLHDRSIZE,
+ .replhdr_size = RPC_REPHDRSIZE,
};
struct rpc_clnt *client;
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 8ed9642..6209c39 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -63,6 +63,9 @@ struct rpc_clnt {
struct rpc_program * cl_program;
char cl_inline_name[32];
char *cl_principal; /* target to authenticate to */
+ void (*cl_encode) (struct rpc_task *task);
+ size_t cl_callhdr_sz; /* in quadwords */
+ size_t cl_replhdr_sz; /* in quadwords */
};
/*
@@ -115,6 +118,9 @@ struct rpc_create_args {
unsigned long flags;
char *client_name;
struct svc_xprt *bc_xprt; /* NFSv4.1 backchannel */
+ size_t callhdr_size; /* in quadwords */
+ size_t replhdr_size; /* in quadwords */
+ void (*encode) (struct rpc_task *task);
};
/* Values for "flags" field */
@@ -150,6 +156,7 @@ struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred,
int flags);
void rpc_restart_call_prepare(struct rpc_task *);
void rpc_restart_call(struct rpc_task *);
+void rpc_xdr_encode(struct rpc_task *);
void rpc_setbufsize(struct rpc_clnt *, unsigned int, unsigned int);
size_t rpc_max_payload(struct rpc_clnt *);
void rpc_force_rebind(struct rpc_clnt *);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 38829e2..f675416 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -200,6 +200,10 @@ static struct rpc_clnt * rpc_new_client(const struct rpc_create_args *args, stru
clnt->cl_vers = version->number;
clnt->cl_stats = program->stats;
clnt->cl_metrics = rpc_alloc_iostats(clnt);
+ clnt->cl_encode = args->encode;
+ clnt->cl_callhdr_sz = args->callhdr_size;
+ clnt->cl_replhdr_sz = args->replhdr_size;
+
err = -ENOMEM;
if (clnt->cl_metrics == NULL)
goto out_no_stats;
@@ -905,6 +909,7 @@ call_allocate(struct rpc_task *task)
unsigned int slack = task->tk_msg.rpc_cred->cr_auth->au_cslack;
struct rpc_rqst *req = task->tk_rqstp;
struct rpc_xprt *xprt = task->tk_xprt;
+ struct rpc_clnt *clnt = task->tk_client;
struct rpc_procinfo *proc = task->tk_msg.rpc_proc;
dprint_status(task);
@@ -926,9 +931,9 @@ call_allocate(struct rpc_task *task)
* and reply headers, and convert both values
* to byte sizes.
*/
- req->rq_callsize = RPC_CALLHDRSIZE + (slack << 1) + proc->p_arglen;
+ req->rq_callsize = clnt->cl_callhdr_sz + (slack << 1) + proc->p_arglen;
req->rq_callsize <<= 2;
- req->rq_rcvsize = RPC_REPHDRSIZE + slack + proc->p_replen;
+ req->rq_rcvsize = clnt->cl_replhdr_sz + slack + proc->p_replen;
req->rq_rcvsize <<= 2;
req->rq_buffer = xprt->ops->buf_alloc(task,
@@ -975,7 +980,7 @@ rpc_xdr_buf_init(struct xdr_buf *buf, void *start, size_t len)
/*
* 3. Encode arguments of an RPC call
*/
-static void
+void
rpc_xdr_encode(struct rpc_task *task)
{
struct rpc_rqst *req = task->tk_rqstp;
@@ -1005,6 +1010,7 @@ rpc_xdr_encode(struct rpc_task *task)
task->tk_status = rpcauth_wrap_req(task, encode, req, p,
task->tk_msg.rpc_argp);
}
+EXPORT_SYMBOL_GPL(rpc_xdr_encode);
/*
* 4. Get the server port number if not yet set
@@ -1148,7 +1154,7 @@ call_transmit(struct rpc_task *task)
/* Encode here so that rpcsec_gss can use correct sequence number. */
if (rpc_task_need_encode(task)) {
BUG_ON(task->tk_rqstp->rq_bytes_sent != 0);
- rpc_xdr_encode(task);
+ task->tk_client->cl_encode(task);
/* Did the encode result in an error condition? */
if (task->tk_status != 0) {
/* Was the error nonfatal? */
diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
index 830faf4..65764ba 100644
--- a/net/sunrpc/rpcb_clnt.c
+++ b/net/sunrpc/rpcb_clnt.c
@@ -174,6 +174,9 @@ static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr,
.program = &rpcb_program,
.version = version,
.authflavor = RPC_AUTH_UNIX,
+ .encode = rpc_xdr_encode,
+ .callhdr_size = RPC_CALLHDRSIZE,
+ .replhdr_size = RPC_REPHDRSIZE,
.flags = RPC_CLNT_CREATE_NOPING,
};
@@ -191,6 +194,9 @@ static struct rpc_clnt *rpcb_create(char *hostname, struct sockaddr *srvaddr,
.program = &rpcb_program,
.version = version,
.authflavor = RPC_AUTH_UNIX,
+ .encode = rpc_xdr_encode,
+ .callhdr_size = RPC_CALLHDRSIZE,
+ .replhdr_size = RPC_REPHDRSIZE,
.flags = (RPC_CLNT_CREATE_NOPING |
RPC_CLNT_CREATE_NONPRIVPORT),
};
--
1.6.0.6
Proof of concept module. This just sends a NEGOTIATE_PROTOCOL request to
a server and verifies the response.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/Makefile | 2 +
fs/smbtest/Makefile | 1 +
fs/smbtest/smbtest.c | 204 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 207 insertions(+), 0 deletions(-)
create mode 100644 fs/smbtest/Makefile
create mode 100644 fs/smbtest/smbtest.c
diff --git a/fs/Makefile b/fs/Makefile
index af6d047..0e9d1df 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -124,3 +124,5 @@ obj-$(CONFIG_OCFS2_FS) += ocfs2/
obj-$(CONFIG_BTRFS_FS) += btrfs/
obj-$(CONFIG_GFS2_FS) += gfs2/
obj-$(CONFIG_EXOFS_FS) += exofs/
+
+obj-m += smbtest/
diff --git a/fs/smbtest/Makefile b/fs/smbtest/Makefile
new file mode 100644
index 0000000..d861f82
--- /dev/null
+++ b/fs/smbtest/Makefile
@@ -0,0 +1 @@
+obj-m += smbtest.o
diff --git a/fs/smbtest/smbtest.c b/fs/smbtest/smbtest.c
new file mode 100644
index 0000000..cfcb672
--- /dev/null
+++ b/fs/smbtest/smbtest.c
@@ -0,0 +1,204 @@
+/*
+ * fs/smbtest/smbtest.c -- proof of concept test for SMB code in sunrpc
+ *
+ * Copyright (C) 2009 Red Hat, Inc -- Jeff Layton <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/inet.h>
+#include <linux/sched.h>
+#include <linux/smbno.h>
+#include <linux/sunrpc/clnt.h>
+#include <linux/sunrpc/xprt.h>
+#include <linux/sunrpc/xprtsmb.h>
+#include <linux/sunrpc/smb.h>
+
+#define CIFS_NEGOTIATE 0
+#define SMB_COM_NEGOTIATE 0x72
+
+static char server_address[INET6_ADDRSTRLEN];
+module_param_string(server_address, server_address, INET6_ADDRSTRLEN, 0);
+MODULE_PARM_DESC(server_address, "IPv4 address of server");
+
+typedef struct negotiate_req {
+ __u8 WordCount;
+ __le16 ByteCount;
+ unsigned char DialectsArray[1];
+} __attribute__((packed)) NEGOTIATE_REQ;
+
+typedef struct negotiate_rsp {
+ __u8 WordCount;
+ __le16 DialectIndex; /* 0xFFFF = no dialect acceptable */
+ __u8 SecurityMode;
+ __le16 MaxMpxCount;
+ __le16 MaxNumberVcs;
+ __le32 MaxBufferSize;
+ __le32 MaxRawSize;
+ __le32 SessionKey;
+ __le32 Capabilities; /* see below */
+ __le32 SystemTimeLow;
+ __le32 SystemTimeHigh;
+ __le16 ServerTimeZone;
+ __u8 EncryptionKeyLength;
+ __u16 ByteCount;
+ union {
+ unsigned char EncryptionKey[1]; /* cap extended security off */
+ /* followed by Domain name - if extended security is off */
+ /* followed by 16 bytes of server GUID */
+ /* then security blob if cap_extended_security negotiated */
+ struct {
+ unsigned char GUID[16];
+ unsigned char SecurityBlob[1];
+ } __attribute__((packed)) extended_response;
+ } __attribute__((packed)) u;
+} __attribute__((packed)) NEGOTIATE_RSP;
+
+/*
+ * SMB doesn't mandate that buffers are always quadword aligned. XDR/RPC
+ * does however and that concept is pretty pervasive in the sunrpc code.
+ * Eventually, it might be better to specify sizes in bytes, but it doesn't
+ * really matter much.
+ */
+#define SMB_HDR_QUADS XDR_QUADLEN(sizeof(struct smb_header))
+
+static int
+smb_enc_negotiate(struct rpc_rqst *req, NEGOTIATE_REQ *buf, void *obj)
+{
+ char *dialects = buf->DialectsArray;
+
+ buf->WordCount = 0;
+ buf->ByteCount = 12;
+
+ dialects[0] = 0x02; /* buffer format byte */
+ strncpy(&dialects[1], "NT LM 0.12", 1000);
+
+ req->rq_slen = xdr_adjust_iovec(&req->rq_svec[0],
+ (__be32 *) &dialects[12]);
+
+ return 0;
+}
+
+static int
+smb_dec_negotiate(void *rqstp, char *data, void *obj)
+{
+ NEGOTIATE_RSP *buf = (NEGOTIATE_RSP *) (data + sizeof(struct smb_header));
+
+ printk("smbtest: Server wants dialect index %u\n",
+ le16_to_cpu(buf->DialectIndex));
+ return 0;
+}
+
+static struct rpc_procinfo smbtest_procedures[] = {
+[CIFS_NEGOTIATE] = {
+ .p_proc = SMB_COM_NEGOTIATE,
+ .p_encode = (kxdrproc_t) smb_enc_negotiate,
+ .p_decode = (kxdrproc_t) smb_dec_negotiate,
+ .p_arglen = XDR_QUADLEN(sizeof(NEGOTIATE_REQ) + 1024),
+ .p_replen = XDR_QUADLEN(sizeof(NEGOTIATE_RSP) + 1024),
+ .p_timer = 0,
+ .p_statidx = CIFS_NEGOTIATE,
+ .p_name = "NEGOTIATE",
+ },
+};
+
+static struct rpc_version smbtest_version1 = {
+ .number = 1,
+ .nrprocs = ARRAY_SIZE(smbtest_procedures),
+ .procs = smbtest_procedures,
+};
+
+static struct rpc_version * smbtest_version[] = {
+ [1] = &smbtest_version1,
+};
+
+static struct rpc_stat smbtest_stats;
+
+static struct rpc_program cifs_rpc_prog = {
+ .name = "smbtest",
+ .number = 0,
+ .nrvers = ARRAY_SIZE(smbtest_version),
+ .version = smbtest_version,
+ .stats = &smbtest_stats,
+};
+
+static struct rpc_clnt *smbtest_clnt;
+
+static int smbtest_init(void)
+{
+ struct sockaddr_in sin = {
+ .sin_family = AF_INET,
+ .sin_port = htons(445),
+ };
+ u8 *addr = (u8 *) &sin.sin_addr.s_addr;
+ int status;
+ struct rpc_create_args create_args = {
+ .protocol = XPRT_TRANSPORT_SMB,
+ .address = (struct sockaddr *) &sin,
+ .addrsize = sizeof(sin),
+ .servername = "cifs",
+ .program = &cifs_rpc_prog,
+ .version = 1,
+ .authflavor = RPC_AUTH_NULL,
+ .encode = smb_encode,
+ .decode = smb_decode,
+ .callhdr_size = SMB_HDR_QUADS,
+ .replhdr_size = SMB_HDR_QUADS,
+ .flags = RPC_CLNT_CREATE_NOPING |
+ RPC_CLNT_CREATE_NONPRIVPORT,
+ };
+
+ NEGOTIATE_REQ req = { };
+ NEGOTIATE_RSP rsp = { };
+
+ struct rpc_message msg = {
+ .rpc_argp = &req,
+ .rpc_resp = &rsp,
+ };
+
+ if (!in4_pton(server_address, -1, addr, '\0', NULL)) {
+ printk("smbtest: in4_pton failed\n");
+ return -EINVAL;
+ }
+
+ smbtest_clnt = rpc_create(&create_args);
+ if (IS_ERR(smbtest_clnt)) {
+ printk("smbtest: rpc client creation failed\n");
+ return PTR_ERR(smbtest_clnt);
+ }
+ printk("smbtest: rpc client create succeeded\n");
+
+ msg.rpc_proc = &smbtest_clnt->cl_procinfo[CIFS_NEGOTIATE];
+ status = rpc_call_sync(smbtest_clnt, &msg, 0);
+ printk("smbtest: rpc_call_sync returned %d\n", status);
+
+ return status;
+}
+
+static void smbtest_exit(void)
+{
+ printk("%s\n", __func__);
+ rpc_shutdown_client(smbtest_clnt);
+}
+
+module_init(smbtest_init);
+module_exit(smbtest_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Jeff Layton <[email protected]>");
--
1.6.0.6
...this should introduce no behavioral changes.
Signed-off-by: Jeff Layton <[email protected]>
---
net/sunrpc/clnt.c | 24 ++++++++++++------------
1 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index f675416..e504b59 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1411,18 +1411,6 @@ call_decode(struct rpc_task *task)
WARN_ON(memcmp(&req->rq_rcv_buf, &req->rq_private_buf,
sizeof(req->rq_rcv_buf)) != 0);
- if (req->rq_rcv_buf.len < 12) {
- if (!RPC_IS_SOFT(task)) {
- task->tk_action = call_bind;
- clnt->cl_stats->rpcretrans++;
- goto out_retry;
- }
- dprintk("RPC: %s: too small RPC reply size (%d bytes)\n",
- clnt->cl_protname, task->tk_status);
- task->tk_action = call_timeout;
- goto out_retry;
- }
-
p = rpc_verify_header(task);
if (IS_ERR(p)) {
if (p == ERR_PTR(-EAGAIN))
@@ -1518,6 +1506,18 @@ rpc_verify_header(struct rpc_task *task)
u32 n;
int error = -EACCES;
+ if (task->tk_rqstp->rq_rcv_buf.len < 12) {
+ if (!RPC_IS_SOFT(task)) {
+ task->tk_action = call_bind;
+ task->tk_client->cl_stats->rpcretrans++;
+ goto out_retry;
+ }
+ dprintk("RPC: %s: too small RPC reply size (%d bytes)\n",
+ task->tk_client->cl_protname, task->tk_status);
+ task->tk_action = call_timeout;
+ goto out_retry;
+ }
+
if ((task->tk_rqstp->rq_rcv_buf.len & 3) != 0) {
/* RFC-1014 says that the representation of XDR data must be a
* multiple of four bytes
--
1.6.0.6
...so that it can be used elsewhere.
Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/clnt.h | 13 ++++++++++++-
net/sunrpc/clnt.c | 12 ------------
2 files changed, 12 insertions(+), 13 deletions(-)
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 00afedc..307a3ec 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -165,7 +165,6 @@ size_t rpc_max_payload(struct rpc_clnt *);
void rpc_force_rebind(struct rpc_clnt *);
size_t rpc_peeraddr(struct rpc_clnt *, struct sockaddr *, size_t);
const char *rpc_peeraddr2str(struct rpc_clnt *, enum rpc_display_format_t);
-
size_t rpc_ntop(const struct sockaddr *, char *, const size_t);
size_t rpc_pton(const char *, const size_t,
struct sockaddr *, const size_t);
@@ -197,6 +196,18 @@ static inline void rpc_set_port(struct sockaddr *sap,
}
}
+static inline void
+rpc_xdr_buf_init(struct xdr_buf *buf, void *start, size_t len)
+{
+ buf->head[0].iov_base = start;
+ buf->head[0].iov_len = len;
+ buf->tail[0].iov_len = 0;
+ buf->page_len = 0;
+ buf->flags = 0;
+ buf->len = 0;
+ buf->buflen = len;
+}
+
#define IPV6_SCOPE_DELIMITER '%'
#define IPV6_SCOPE_ID_LEN sizeof("%nnnnnnnnnn")
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index f05d289..42688af 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -966,18 +966,6 @@ rpc_task_force_reencode(struct rpc_task *task)
task->tk_rqstp->rq_bytes_sent = 0;
}
-static inline void
-rpc_xdr_buf_init(struct xdr_buf *buf, void *start, size_t len)
-{
- buf->head[0].iov_base = start;
- buf->head[0].iov_len = len;
- buf->tail[0].iov_len = 0;
- buf->page_len = 0;
- buf->flags = 0;
- buf->len = 0;
- buf->buflen = len;
-}
-
/*
* 3. Encode arguments of an RPC call
*/
--
1.6.0.6
Add a new smb.c file to hold encode and decode routines for SMB packets.
Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/smb.h | 42 ++++++++++++++
include/linux/sunrpc/xprtsmb.h | 6 +-
net/sunrpc/smb.c | 120 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 165 insertions(+), 3 deletions(-)
create mode 100644 include/linux/sunrpc/smb.h
create mode 100644 net/sunrpc/smb.c
diff --git a/include/linux/sunrpc/smb.h b/include/linux/sunrpc/smb.h
new file mode 100644
index 0000000..304ab8c
--- /dev/null
+++ b/include/linux/sunrpc/smb.h
@@ -0,0 +1,42 @@
+/*
+ * net/sunrpc/smb.h -- SMB transport for sunrpc
+ *
+ * Copyright (c) 2009 Red Hat, Inc.
+ * Author(s): Jeff Layton ([email protected])
+ *
+ * This library is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; either version 2.1 of the License, or
+ * (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License
+ * along with this library; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+/*
+ * xprtsmb.h
+ *
+ * This file contains the public interfaces for the SMB transport for
+ * the sunrpc layer.
+ */
+
+#ifndef _LINUX_SUNRPC_SMB_H
+#define _LINUX_SUNRPC_SMB_H
+
+/*
+ * This is the generic SMB function. rqstp is either a rpc_rqst (client
+ * side) or svc_rqst pointer (server side).
+ * Encode functions always assume there's enough room in the buffer.
+ */
+typedef int (*ksmbproc_t)(void *rqstp, __le32 *data, void *obj);
+
+void smb_encode(struct rpc_task *task);
+int smb_decode(struct rpc_task *task);
+
+#endif /* _LINUX_SUNRPC_SMB_H */
diff --git a/include/linux/sunrpc/xprtsmb.h b/include/linux/sunrpc/xprtsmb.h
index d55e85b..731cce2 100644
--- a/include/linux/sunrpc/xprtsmb.h
+++ b/include/linux/sunrpc/xprtsmb.h
@@ -34,9 +34,6 @@
*/
#define XPRT_TRANSPORT_SMB 1024
-int init_smb_xprt(void);
-void cleanup_smb_xprt(void);
-
/* standard SMB header */
struct smb_header {
__u8 protocol[4];
@@ -56,4 +53,7 @@ struct smb_header {
/* SMB Header Flags of interest */
#define SMBFLG_RESPONSE 0x80 /* response from server */
+int init_smb_xprt(void);
+void cleanup_smb_xprt(void);
+
#endif /* _LINUX_SUNRPC_XPRTSMB_H */
diff --git a/net/sunrpc/smb.c b/net/sunrpc/smb.c
new file mode 100644
index 0000000..5511bb2
--- /dev/null
+++ b/net/sunrpc/smb.c
@@ -0,0 +1,120 @@
+/*
+ * net/sunrpc/smb.c -- smb encode and decode routines
+ *
+ * Copyright (C) 2009 Red Hat, Inc -- Jeff Layton <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor,
+ * Boston, MA 02110-1301, USA.
+ */
+
+#include <linux/types.h>
+#include <linux/sunrpc/clnt.h>
+#include <linux/sunrpc/xprtsmb.h>
+#include <linux/sunrpc/smb.h>
+#include <linux/smbno.h>
+
+static __le32 *
+smb_verify_header(struct rpc_task *task)
+{
+ struct kvec *iov = &task->tk_rqstp->rq_rcv_buf.head[0];
+ __le32 *p = iov->iov_base;
+
+ if (task->tk_rqstp->rq_rcv_buf.len < sizeof(struct smb_header)) {
+ task->tk_action = call_bind;
+ task->tk_client->cl_stats->rpcretrans++;
+ return ERR_PTR(-EAGAIN);
+ }
+
+ /* FIXME: check for errors and return them */
+
+ /*
+ * with SMB, we occasionally need to refer back to the SMB header for
+ * info (flags, etc). The header is fixed-length however, so just
+ * return a pointer to the start of the SMB header.
+ */
+ return p;
+}
+
+static __le32 *
+smb_encode_header(struct rpc_task *task)
+{
+ struct rpc_rqst *req = task->tk_rqstp;
+ u32 *p = (u32 *) req->rq_svec[0].iov_base;
+ struct smb_header *h;
+
+ /* transport header is always 32 bits */
+ ++p;
+ memset(p, 0, sizeof(*h));
+
+ h = (struct smb_header *) p;
+
+ h->protocol[0] = 0xFF;
+ h->protocol[1] = 'S';
+ h->protocol[2] = 'M';
+ h->protocol[3] = 'B';
+
+ h->command = task->tk_msg.rpc_proc->p_proc;
+ h->flags2 = SMB_FLAGS2_LONG_PATH_COMPONENTS;
+ h->pid = cpu_to_le16((u16) current->tgid);
+ h->pidhigh = cpu_to_le16((u16) (current->tgid >> 16));
+
+ /*
+ * SMB MID's are similar to XID's in RPC, but they're only 16 bits.
+ * For now we just use the xid field and mask off the upper bits.
+ */
+ req->rq_xid &= 0x0000ffff;
+ h->mid = cpu_to_le16((u16) req->rq_xid);
+
+ req->rq_slen = xdr_adjust_iovec(&req->rq_svec[0], (__be32 *) (h + 1));
+
+ return (__le32 *) (h + 1);
+}
+
+void
+smb_encode(struct rpc_task *task)
+{
+ struct rpc_rqst *req = task->tk_rqstp;
+ __le32 *p;
+ ksmbproc_t encode = (ksmbproc_t) task->tk_msg.rpc_proc->p_encode;
+
+ rpc_xdr_buf_init(&req->rq_snd_buf, req->rq_buffer, req->rq_callsize);
+ rpc_xdr_buf_init(&req->rq_rcv_buf,
+ (char *)req->rq_buffer + req->rq_callsize,
+ req->rq_rcvsize);
+
+ p = smb_encode_header(task);
+
+ encode(req, p, NULL);
+ return;
+}
+EXPORT_SYMBOL_GPL(smb_encode);
+
+int
+smb_decode(struct rpc_task *task)
+{
+ struct rpc_rqst *req = task->tk_rqstp;
+ kxdrproc_t decode = task->tk_msg.rpc_proc->p_decode;
+ __le32 *p;
+
+ p = smb_verify_header(task);
+ if (IS_ERR(p))
+ return PTR_ERR(p);
+
+ if (decode)
+ task->tk_status = decode(req, p, task->tk_msg.rpc_resp);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(smb_decode);
--
1.6.0.6
...and the necessary Makefile bits so to compile support into the sunrpc
module.
Signed-off-by: Jeff Layton <[email protected]>
---
net/sunrpc/Kconfig | 11 +++++++++++
net/sunrpc/Makefile | 1 +
net/sunrpc/sunrpc_syms.c | 3 +++
net/sunrpc/xprtsmb.c | 7 +++++--
4 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index 443c161..63f5ee7 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -54,3 +54,14 @@ config RPCSEC_GSS_SPKM3
available from http://linux-nfs.org/.
If unsure, say N.
+
+config SUNRPC_SMB
+ bool "SMB/CIFS Transport for the SUNRPC Layer (EXPERIMENTAL)"
+ depends on SUNRPC && EXPERIMENTAL
+ help
+ This option adds the ability for the kernels SUNRPC Layer to
+ send and receive Sever Message Block (SMB) traffic. This
+ protocol is widely used by Microsoft Windows servers for
+ filesharing.
+
+ If unsure, say N.
diff --git a/net/sunrpc/Makefile b/net/sunrpc/Makefile
index 9d2fca5..691e62f 100644
--- a/net/sunrpc/Makefile
+++ b/net/sunrpc/Makefile
@@ -16,3 +16,4 @@ sunrpc-y := clnt.o xprt.o socklib.o xprtsock.o sched.o \
sunrpc-$(CONFIG_NFS_V4_1) += backchannel_rqst.o bc_svc.o
sunrpc-$(CONFIG_PROC_FS) += stats.o
sunrpc-$(CONFIG_SYSCTL) += sysctl.o
+sunrpc-$(CONFIG_SUNRPC_SMB) += xprtsmb.o smb.o
diff --git a/net/sunrpc/sunrpc_syms.c b/net/sunrpc/sunrpc_syms.c
index 8cce921..65b4dec 100644
--- a/net/sunrpc/sunrpc_syms.c
+++ b/net/sunrpc/sunrpc_syms.c
@@ -21,6 +21,7 @@
#include <linux/workqueue.h>
#include <linux/sunrpc/rpc_pipe_fs.h>
#include <linux/sunrpc/xprtsock.h>
+#include <linux/sunrpc/xprtsmb.h>
extern struct cache_detail ip_map_cache, unix_gid_cache;
@@ -45,6 +46,7 @@ init_sunrpc(void)
cache_register(&unix_gid_cache);
svc_init_xprt_sock(); /* svc sock transport */
init_socket_xprt(); /* clnt sock transport */
+ init_smb_xprt();
rpcauth_init_module();
out:
return err;
@@ -54,6 +56,7 @@ static void __exit
cleanup_sunrpc(void)
{
rpcauth_remove_module();
+ cleanup_smb_xprt();
cleanup_socket_xprt();
svc_cleanup_xprt_sock();
unregister_rpc_pipefs();
diff --git a/net/sunrpc/xprtsmb.c b/net/sunrpc/xprtsmb.c
index 585cf37..362600a 100644
--- a/net/sunrpc/xprtsmb.c
+++ b/net/sunrpc/xprtsmb.c
@@ -1691,7 +1691,8 @@ static struct xprt_class xs_smb_transport = {
* init_socket_xprt - set up xprtsock's sysctls, register with RPC client
*
*/
-int init_smb_xprt(void)
+int
+init_smb_xprt(void)
{
#ifdef RPC_DEBUG
if (!sunrpc_table_header)
@@ -1707,7 +1708,8 @@ int init_smb_xprt(void)
* cleanup_socket_xprt - remove xprtsock's sysctls, unregister
*
*/
-void cleanup_smb_xprt(void)
+void
+cleanup_smb_xprt(void)
{
#ifdef RPC_DEBUG
if (sunrpc_table_header) {
@@ -1718,3 +1720,4 @@ void cleanup_smb_xprt(void)
xprt_unregister_transport(&xs_smb_transport);
}
+
--
1.6.0.6
...all of the existing rpc users will use the new rpc_xdr_decode function.
Signed-off-by: Jeff Layton <[email protected]>
---
fs/lockd/host.c | 1 +
fs/lockd/mon.c | 1 +
fs/nfs/client.c | 1 +
fs/nfs/mount_clnt.c | 1 +
fs/nfsd/nfs4callback.c | 1 +
include/linux/sunrpc/clnt.h | 3 ++
net/sunrpc/clnt.c | 46 +++++++++++++++++++++++++++++-------------
net/sunrpc/rpcb_clnt.c | 2 +
8 files changed, 42 insertions(+), 14 deletions(-)
diff --git a/fs/lockd/host.c b/fs/lockd/host.c
index b7189ce..2159ee2 100644
--- a/fs/lockd/host.c
+++ b/fs/lockd/host.c
@@ -363,6 +363,7 @@ nlm_bind_host(struct nlm_host *host)
.version = host->h_version,
.authflavor = RPC_AUTH_UNIX,
.encode = rpc_xdr_encode,
+ .decode = rpc_xdr_decode,
.callhdr_size = RPC_CALLHDRSIZE,
.replhdr_size = RPC_REPHDRSIZE,
.flags = (RPC_CLNT_CREATE_NOPING |
diff --git a/fs/lockd/mon.c b/fs/lockd/mon.c
index ea24301..bb4e1ad 100644
--- a/fs/lockd/mon.c
+++ b/fs/lockd/mon.c
@@ -76,6 +76,7 @@ static struct rpc_clnt *nsm_create(void)
.version = NSM_VERSION,
.authflavor = RPC_AUTH_NULL,
.encode = rpc_xdr_encode,
+ .decode = rpc_xdr_decode,
.callhdr_size = RPC_CALLHDRSIZE,
.replhdr_size = RPC_REPHDRSIZE,
.flags = RPC_CLNT_CREATE_NOPING,
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 5505ddf..5ec2a45 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -607,6 +607,7 @@ static int nfs_create_rpc_client(struct nfs_client *clp,
.program = &nfs_program,
.version = clp->rpc_ops->version,
.encode = rpc_xdr_encode,
+ .decode = rpc_xdr_decode,
.callhdr_size = RPC_CALLHDRSIZE,
.replhdr_size = RPC_REPHDRSIZE,
.authflavor = flavor,
diff --git a/fs/nfs/mount_clnt.c b/fs/nfs/mount_clnt.c
index 14f79e5..86a3b9e 100644
--- a/fs/nfs/mount_clnt.c
+++ b/fs/nfs/mount_clnt.c
@@ -160,6 +160,7 @@ int nfs_mount(struct nfs_mount_request *info)
.program = &mnt_program,
.version = info->version,
.encode = rpc_xdr_encode,
+ .decode = rpc_xdr_decode,
.callhdr_size = RPC_CALLHDRSIZE,
.replhdr_size = RPC_REPHDRSIZE,
.authflavor = RPC_AUTH_UNIX,
diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
index dbfd91c..bb655a2 100644
--- a/fs/nfsd/nfs4callback.c
+++ b/fs/nfsd/nfs4callback.c
@@ -495,6 +495,7 @@ int setup_callback_client(struct nfs4_client *clp)
.flags = (RPC_CLNT_CREATE_NOPING | RPC_CLNT_CREATE_QUIET),
.client_name = clp->cl_principal,
.encode = rpc_xdr_encode,
+ .decode = rpc_xdr_decode,
.callhdr_size = RPC_CALLHDRSIZE,
.replhdr_size = RPC_REPHDRSIZE,
};
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 6209c39..00afedc 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -64,6 +64,7 @@ struct rpc_clnt {
char cl_inline_name[32];
char *cl_principal; /* target to authenticate to */
void (*cl_encode) (struct rpc_task *task);
+ int (*cl_decode) (struct rpc_task *task);
size_t cl_callhdr_sz; /* in quadwords */
size_t cl_replhdr_sz; /* in quadwords */
};
@@ -121,6 +122,7 @@ struct rpc_create_args {
size_t callhdr_size; /* in quadwords */
size_t replhdr_size; /* in quadwords */
void (*encode) (struct rpc_task *task);
+ int (*decode) (struct rpc_task *task);
};
/* Values for "flags" field */
@@ -157,6 +159,7 @@ struct rpc_task *rpc_call_null(struct rpc_clnt *clnt, struct rpc_cred *cred,
void rpc_restart_call_prepare(struct rpc_task *);
void rpc_restart_call(struct rpc_task *);
void rpc_xdr_encode(struct rpc_task *);
+int rpc_xdr_decode(struct rpc_task *);
void rpc_setbufsize(struct rpc_clnt *, unsigned int, unsigned int);
size_t rpc_max_payload(struct rpc_clnt *);
void rpc_force_rebind(struct rpc_clnt *);
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index e504b59..f05d289 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -201,6 +201,7 @@ static struct rpc_clnt * rpc_new_client(const struct rpc_create_args *args, stru
clnt->cl_stats = program->stats;
clnt->cl_metrics = rpc_alloc_iostats(clnt);
clnt->cl_encode = args->encode;
+ clnt->cl_decode = args->decode;
clnt->cl_callhdr_sz = args->callhdr_size;
clnt->cl_replhdr_sz = args->replhdr_size;
@@ -1379,6 +1380,27 @@ retry:
task->tk_status = 0;
}
+int
+rpc_xdr_decode(struct rpc_task *task)
+{
+ struct rpc_rqst *req = task->tk_rqstp;
+ kxdrproc_t decode = task->tk_msg.rpc_proc->p_decode;
+ __be32 *p;
+
+ p = rpc_verify_header(task);
+ if (IS_ERR(p))
+ return PTR_ERR(p);
+
+ if (decode)
+ task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
+ task->tk_msg.rpc_resp);
+
+ dprintk("RPC: %5u %s result %d\n", task->tk_pid, __func__,
+ task->tk_status);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(rpc_xdr_decode);
+
/*
* 7. Decode the RPC reply
*/
@@ -1387,8 +1409,7 @@ call_decode(struct rpc_task *task)
{
struct rpc_clnt *clnt = task->tk_client;
struct rpc_rqst *req = task->tk_rqstp;
- kxdrproc_t decode = task->tk_msg.rpc_proc->p_decode;
- __be32 *p;
+ int status;
dprintk("RPC: %5u call_decode (status %d)\n",
task->tk_pid, task->tk_status);
@@ -1411,22 +1432,19 @@ call_decode(struct rpc_task *task)
WARN_ON(memcmp(&req->rq_rcv_buf, &req->rq_private_buf,
sizeof(req->rq_rcv_buf)) != 0);
- p = rpc_verify_header(task);
- if (IS_ERR(p)) {
- if (p == ERR_PTR(-EAGAIN))
- goto out_retry;
+ if (!clnt->cl_decode)
+ BUG();
+
+ status = clnt->cl_decode(task);
+
+ if (status == -EAGAIN)
+ goto out_retry;
+ else if (status)
return;
- }
task->tk_action = rpc_exit_task;
-
- if (decode) {
- task->tk_status = rpcauth_unwrap_resp(task, decode, req, p,
- task->tk_msg.rpc_resp);
- }
- dprintk("RPC: %5u call_decode result %d\n", task->tk_pid,
- task->tk_status);
return;
+
out_retry:
task->tk_status = 0;
/* Note: rpc_verify_header() may have freed the RPC slot */
diff --git a/net/sunrpc/rpcb_clnt.c b/net/sunrpc/rpcb_clnt.c
index 65764ba..db053bf 100644
--- a/net/sunrpc/rpcb_clnt.c
+++ b/net/sunrpc/rpcb_clnt.c
@@ -175,6 +175,7 @@ static struct rpc_clnt *rpcb_create_local(struct sockaddr *addr,
.version = version,
.authflavor = RPC_AUTH_UNIX,
.encode = rpc_xdr_encode,
+ .decode = rpc_xdr_decode,
.callhdr_size = RPC_CALLHDRSIZE,
.replhdr_size = RPC_REPHDRSIZE,
.flags = RPC_CLNT_CREATE_NOPING,
@@ -195,6 +196,7 @@ static struct rpc_clnt *rpcb_create(char *hostname, struct sockaddr *srvaddr,
.version = version,
.authflavor = RPC_AUTH_UNIX,
.encode = rpc_xdr_encode,
+ .decode = rpc_xdr_decode,
.callhdr_size = RPC_CALLHDRSIZE,
.replhdr_size = RPC_REPHDRSIZE,
.flags = (RPC_CLNT_CREATE_NOPING |
--
1.6.0.6
Add a new transport class for SMB traffic that lives in xprtsmb.c and
a corresponding header file for making use of it.
With this, it's possible to use the sunrpc layer to send and receive
SMB traffic.
Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/xprtsmb.h | 59 ++
net/sunrpc/xprtsmb.c | 1720 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 1779 insertions(+), 0 deletions(-)
create mode 100644 include/linux/sunrpc/xprtsmb.h
create mode 100644 net/sunrpc/xprtsmb.c
diff --git a/include/linux/sunrpc/xprtsmb.h b/include/linux/sunrpc/xprtsmb.h
new file mode 100644
index 0000000..d55e85b
--- /dev/null
+++ b/include/linux/sunrpc/xprtsmb.h
@@ -0,0 +1,59 @@
+/*
+ * net/sunrpc/xprtsmb.h -- SMB transport for sunrpc
+ *
+ * Copyright (c) 2009 Red Hat, Inc.
+ * Author(s): Jeff Layton ([email protected])
+ *
+ * This library is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; either version 2.1 of the License, or
+ * (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
+ * the GNU Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public License
+ * along with this library; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+/*
+ * xprtsmb.h
+ *
+ * This file contains the public interfaces for the SMB transport for
+ * the sunrpc layer.
+ */
+
+#ifndef _LINUX_SUNRPC_XPRTSMB_H
+#define _LINUX_SUNRPC_XPRTSMB_H
+
+/*
+ * RPC transport identifier for SMB
+ */
+#define XPRT_TRANSPORT_SMB 1024
+
+int init_smb_xprt(void);
+void cleanup_smb_xprt(void);
+
+/* standard SMB header */
+struct smb_header {
+ __u8 protocol[4];
+ __u8 command;
+ __le32 status;
+ __u8 flags;
+ __le16 flags2;
+ __le16 pidhigh;
+ __u8 signature[8];
+ __u8 pad[2];
+ __le16 tid;
+ __le16 pid;
+ __le16 uid;
+ __le16 mid;
+} __attribute__((packed));
+
+/* SMB Header Flags of interest */
+#define SMBFLG_RESPONSE 0x80 /* response from server */
+
+#endif /* _LINUX_SUNRPC_XPRTSMB_H */
diff --git a/net/sunrpc/xprtsmb.c b/net/sunrpc/xprtsmb.c
new file mode 100644
index 0000000..585cf37
--- /dev/null
+++ b/net/sunrpc/xprtsmb.c
@@ -0,0 +1,1720 @@
+/*
+ * linux/net/sunrpc/xprtsmb.c
+ *
+ * Client-side transport implementation for SMB sockets.
+ *
+ * (C) 1998 Red Hat
+ * Initial revision by Jeff Layton <[email protected]>
+ *
+ * Heavily based on net/sunrpc/xprtsock.c -- copyrights from that file below:
+ *
+ * TCP callback races fixes (C) 1998 Red Hat
+ * TCP send fixes (C) 1998 Red Hat
+ * TCP NFS related read + write fixes
+ * (C) 1999 Dave Airlie, University of Limerick, Ireland <[email protected]>
+ *
+ * Rewrite of larges part of the code in order to stabilize TCP stuff.
+ * Fix behaviour when socket buffer is full.
+ * (C) 1999 Trond Myklebust <[email protected]>
+ *
+ * IP socket transport implementation, (C) 2005 Chuck Lever <[email protected]>
+ *
+ * IPv6 support contributed by Gilles Quillard, Bull Open Source, 2005.
+ * <[email protected]>
+ *
+ */
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+#include <linux/capability.h>
+#include <linux/pagemap.h>
+#include <linux/errno.h>
+#include <linux/socket.h>
+#include <linux/in.h>
+#include <linux/net.h>
+#include <linux/mm.h>
+#include <linux/tcp.h>
+#include <linux/sunrpc/clnt.h>
+#include <linux/sunrpc/sched.h>
+#include <linux/sunrpc/xprtsmb.h>
+#include <linux/sunrpc/xprtsock.h>
+#include <linux/file.h>
+#ifdef CONFIG_NFS_V4_1
+#include <linux/sunrpc/bc_xprt.h>
+#endif
+
+#include <net/sock.h>
+#include <net/checksum.h>
+#include <net/tcp.h>
+
+/* CIFS currently defaults to allowing 50 outstanding calls at a time */
+#define SMB_DEF_SLOT_TABLE 50
+
+/*
+ * top byte of the 32-bit SMB "fragment header" is always 0. That gives a max
+ * size for a SMB packet of 2^24-1
+ */
+#define SMB_FRAGMENT_SIZE_MASK 0x00ffffff
+#define SMB_MAX_FRAGMENT_SIZE SMB_FRAGMENT_SIZE_MASK
+
+/*
+ * xprtsock tunables
+ */
+unsigned int xprt_smb_slot_table_entries = SMB_DEF_SLOT_TABLE;
+
+#define XS_TCP_LINGER_TO (15U * HZ)
+static unsigned int xs_smb_fin_timeout __read_mostly = XS_TCP_LINGER_TO;
+
+/*
+ * We can register our own files under /proc/sys/sunrpc by
+ * calling register_sysctl_table() again. The files in that
+ * directory become the union of all files registered there.
+ *
+ * We simply need to make sure that we don't collide with
+ * someone else's file names!
+ */
+
+#ifdef RPC_DEBUG
+
+static unsigned int min_slot_table_size = RPC_MIN_SLOT_TABLE;
+static unsigned int max_slot_table_size = RPC_MAX_SLOT_TABLE;
+
+static struct ctl_table_header *sunrpc_table_header;
+
+/*
+ * FIXME: changing the UDP slot table size should also resize the UDP
+ * socket buffers for existing UDP transports
+ */
+static ctl_table xs_tunables_table[] = {
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "smb_slot_table_entries",
+ .data = &xprt_smb_slot_table_entries,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .strategy = &sysctl_intvec,
+ .extra1 = &min_slot_table_size,
+ .extra2 = &max_slot_table_size
+ },
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "smb_fin_timeout",
+ .data = &xs_smb_fin_timeout,
+ .maxlen = sizeof(xs_smb_fin_timeout),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_jiffies,
+ .strategy = sysctl_jiffies
+ },
+ {
+ .ctl_name = 0,
+ },
+};
+
+static ctl_table sunrpc_table[] = {
+ {
+ .ctl_name = CTL_SUNRPC,
+ .procname = "sunrpc",
+ .mode = 0555,
+ .child = xs_tunables_table
+ },
+ {
+ .ctl_name = 0,
+ },
+};
+
+#endif
+
+/*
+ * Wait duration for an RPC TCP connection to be established. Solaris
+ * NFS over TCP uses 60 seconds, for example, which is in line with how
+ * long a server takes to reboot.
+ */
+#define XS_TCP_CONN_TO (60U * HZ)
+
+/*
+ * Wait duration for a reply from the RPC portmapper.
+ */
+#define XS_BIND_TO (60U * HZ)
+
+/*
+ * The reestablish timeout allows clients to delay for a bit before attempting
+ * to reconnect to a server that just dropped our connection.
+ *
+ * We implement an exponential backoff when trying to reestablish a TCP
+ * transport connection with the server. Some servers like to drop a TCP
+ * connection when they are overworked, so we start with a short timeout and
+ * increase over time if the server is down or not responding.
+ */
+#define XS_TCP_INIT_REEST_TO (3U * HZ)
+#define XS_TCP_MAX_REEST_TO (5U * 60 * HZ)
+
+/*
+ * reconnecting a SMB socket is expensive, so we don't do idle timeouts if
+ * we can help it.
+ */
+#define XS_SMB_IDLE_DISC_TO (~0UL)
+
+
+#ifdef RPC_DEBUG
+# undef RPC_DEBUG_DATA
+# define RPCDBG_FACILITY RPCDBG_TRANS
+#endif
+
+#ifdef RPC_DEBUG_DATA
+static void xs_pktdump(char *msg, u32 *packet, unsigned int count)
+{
+ u8 *buf = (u8 *) packet;
+ int j;
+
+ dprintk("RPC: %s\n", msg);
+ for (j = 0; j < count && j < 128; j += 4) {
+ if (!(j & 31)) {
+ if (j)
+ dprintk("\n");
+ dprintk("0x%04x ", j);
+ }
+ dprintk("%02x%02x%02x%02x ",
+ buf[j], buf[j+1], buf[j+2], buf[j+3]);
+ }
+ dprintk("\n");
+}
+#else
+static inline void xs_pktdump(char *msg, u32 *packet, unsigned int count)
+{
+ /* NOP */
+}
+#endif
+
+struct smb_xprt {
+ struct rpc_xprt xprt;
+
+ /*
+ * Network layer
+ */
+ struct socket * sock;
+ struct sock * inet;
+
+ /*
+ * State of TCP reply receive
+ */
+ __be32 tcp_fraghdr;
+ struct smb_header smb_header;
+
+ u32 tcp_offset,
+ tcp_reclen;
+
+ unsigned long tcp_copied,
+ tcp_flags;
+
+ /*
+ * Connection of transports
+ */
+ struct delayed_work connect_worker;
+ struct sockaddr_storage addr;
+ unsigned short port;
+
+ /*
+ * Saved socket callback addresses
+ */
+ void (*old_data_ready)(struct sock *, int);
+ void (*old_state_change)(struct sock *);
+ void (*old_write_space)(struct sock *);
+ void (*old_error_report)(struct sock *);
+};
+
+/*
+ * TCP receive state flags
+ */
+#define TCP_RCV_COPY_FRAGHDR (1UL << 0)
+#define TCP_RCV_READ_SMBHDR (1UL << 1)
+#define TCP_RCV_COPY_SMBHDR (1UL << 2)
+#define TCP_RCV_COPY_DATA (1UL << 3)
+
+static inline struct sockaddr *xs_addr(struct rpc_xprt *xprt)
+{
+ return (struct sockaddr *) &xprt->addr;
+}
+
+static inline struct sockaddr_in *xs_addr_in(struct rpc_xprt *xprt)
+{
+ return (struct sockaddr_in *) &xprt->addr;
+}
+
+static inline struct sockaddr_in6 *xs_addr_in6(struct rpc_xprt *xprt)
+{
+ return (struct sockaddr_in6 *) &xprt->addr;
+}
+
+static void xs_format_common_peer_addresses(struct rpc_xprt *xprt)
+{
+ struct sockaddr *sap = xs_addr(xprt);
+ struct sockaddr_in6 *sin6;
+ struct sockaddr_in *sin;
+ char buf[128];
+
+ (void)rpc_ntop(sap, buf, sizeof(buf));
+ xprt->address_strings[RPC_DISPLAY_ADDR] = kstrdup(buf, GFP_KERNEL);
+
+ switch (sap->sa_family) {
+ case AF_INET:
+ sin = xs_addr_in(xprt);
+ (void)snprintf(buf, sizeof(buf), "%02x%02x%02x%02x",
+ NIPQUAD(sin->sin_addr.s_addr));
+ break;
+ case AF_INET6:
+ sin6 = xs_addr_in6(xprt);
+ (void)snprintf(buf, sizeof(buf), "%pi6", &sin6->sin6_addr);
+ break;
+ default:
+ BUG();
+ }
+ xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
+}
+
+static void xs_format_common_peer_ports(struct rpc_xprt *xprt)
+{
+ struct sockaddr *sap = xs_addr(xprt);
+ char buf[128];
+
+ (void)snprintf(buf, sizeof(buf), "%u", rpc_get_port(sap));
+ xprt->address_strings[RPC_DISPLAY_PORT] = kstrdup(buf, GFP_KERNEL);
+
+ (void)snprintf(buf, sizeof(buf), "%4hx", rpc_get_port(sap));
+ xprt->address_strings[RPC_DISPLAY_HEX_PORT] = kstrdup(buf, GFP_KERNEL);
+}
+
+static void xs_format_peer_addresses(struct rpc_xprt *xprt,
+ const char *protocol,
+ const char *netid)
+{
+ xprt->address_strings[RPC_DISPLAY_PROTO] = protocol;
+ xprt->address_strings[RPC_DISPLAY_NETID] = netid;
+ xs_format_common_peer_addresses(xprt);
+ xs_format_common_peer_ports(xprt);
+}
+
+static void xs_free_peer_addresses(struct rpc_xprt *xprt)
+{
+ unsigned int i;
+
+ for (i = 0; i < RPC_DISPLAY_MAX; i++)
+ switch (i) {
+ case RPC_DISPLAY_PROTO:
+ case RPC_DISPLAY_NETID:
+ continue;
+ default:
+ kfree(xprt->address_strings[i]);
+ }
+}
+
+#define XS_SENDMSG_FLAGS (MSG_DONTWAIT | MSG_NOSIGNAL)
+
+static int xs_send_kvec(struct socket *sock, struct sockaddr *addr, int addrlen, struct kvec *vec, unsigned int base, int more)
+{
+ struct msghdr msg = {
+ .msg_name = addr,
+ .msg_namelen = addrlen,
+ .msg_flags = XS_SENDMSG_FLAGS | (more ? MSG_MORE : 0),
+ };
+ struct kvec iov = {
+ .iov_base = vec->iov_base + base,
+ .iov_len = vec->iov_len - base,
+ };
+
+ if (iov.iov_len != 0)
+ return kernel_sendmsg(sock, &msg, &iov, 1, iov.iov_len);
+ return kernel_sendmsg(sock, &msg, NULL, 0, 0);
+}
+
+static int xs_send_pagedata(struct socket *sock, struct xdr_buf *xdr, unsigned int base, int more)
+{
+ struct page **ppage;
+ unsigned int remainder;
+ int err, sent = 0;
+
+ remainder = xdr->page_len - base;
+ base += xdr->page_base;
+ ppage = xdr->pages + (base >> PAGE_SHIFT);
+ base &= ~PAGE_MASK;
+ for(;;) {
+ unsigned int len = min_t(unsigned int, PAGE_SIZE - base, remainder);
+ int flags = XS_SENDMSG_FLAGS;
+
+ remainder -= len;
+ if (remainder != 0 || more)
+ flags |= MSG_MORE;
+ err = sock->ops->sendpage(sock, *ppage, base, len, flags);
+ if (remainder == 0 || err != len)
+ break;
+ sent += err;
+ ppage++;
+ base = 0;
+ }
+ if (sent == 0)
+ return err;
+ if (err > 0)
+ sent += err;
+ return sent;
+}
+
+/**
+ * xs_sendpages - write pages directly to a socket
+ * @sock: socket to send on
+ * @addr: UDP only -- address of destination
+ * @addrlen: UDP only -- length of destination address
+ * @xdr: buffer containing this request
+ * @base: starting position in the buffer
+ *
+ */
+static int xs_sendpages(struct socket *sock, struct sockaddr *addr, int addrlen, struct xdr_buf *xdr, unsigned int base)
+{
+ unsigned int remainder = xdr->len - base;
+ int err, sent = 0;
+
+ if (unlikely(!sock))
+ return -ENOTSOCK;
+
+ clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+ if (base != 0) {
+ addr = NULL;
+ addrlen = 0;
+ }
+
+ if (base < xdr->head[0].iov_len || addr != NULL) {
+ unsigned int len = xdr->head[0].iov_len - base;
+ remainder -= len;
+ err = xs_send_kvec(sock, addr, addrlen, &xdr->head[0], base, remainder != 0);
+ if (remainder == 0 || err != len)
+ goto out;
+ sent += err;
+ base = 0;
+ } else
+ base -= xdr->head[0].iov_len;
+
+ if (base < xdr->page_len) {
+ unsigned int len = xdr->page_len - base;
+ remainder -= len;
+ err = xs_send_pagedata(sock, xdr, base, remainder != 0);
+ if (remainder == 0 || err != len)
+ goto out;
+ sent += err;
+ base = 0;
+ } else
+ base -= xdr->page_len;
+
+ if (base >= xdr->tail[0].iov_len)
+ return sent;
+ err = xs_send_kvec(sock, NULL, 0, &xdr->tail[0], base, 0);
+out:
+ if (sent == 0)
+ return err;
+ if (err > 0)
+ sent += err;
+ return sent;
+}
+
+static void xs_nospace_callback(struct rpc_task *task)
+{
+ struct smb_xprt *transport = container_of(task->tk_rqstp->rq_xprt, struct smb_xprt, xprt);
+
+ transport->inet->sk_write_pending--;
+ clear_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags);
+}
+
+/**
+ * xs_nospace - place task on wait queue if transmit was incomplete
+ * @task: task to put to sleep
+ *
+ */
+static int xs_nospace(struct rpc_task *task)
+{
+ struct rpc_rqst *req = task->tk_rqstp;
+ struct rpc_xprt *xprt = req->rq_xprt;
+ struct smb_xprt *transport = container_of(xprt, struct smb_xprt, xprt);
+ int ret = 0;
+
+ dprintk("RPC: %5u xmit incomplete (%u left of %u)\n",
+ task->tk_pid, req->rq_slen - req->rq_bytes_sent,
+ req->rq_slen);
+
+ /* Protect against races with write_space */
+ spin_lock_bh(&xprt->transport_lock);
+
+ /* Don't race with disconnect */
+ if (xprt_connected(xprt)) {
+ if (test_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags)) {
+ ret = -EAGAIN;
+ /*
+ * Notify TCP that we're limited by the application
+ * window size
+ */
+ set_bit(SOCK_NOSPACE, &transport->sock->flags);
+ transport->inet->sk_write_pending++;
+ /* ...and wait for more buffer space */
+ xprt_wait_for_buffer_space(task, xs_nospace_callback);
+ }
+ } else {
+ clear_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags);
+ ret = -ENOTCONN;
+ }
+
+ spin_unlock_bh(&xprt->transport_lock);
+ return ret;
+}
+
+/**
+ * xs_tcp_shutdown - gracefully shut down a TCP socket
+ * @xprt: transport
+ *
+ * Initiates a graceful shutdown of the TCP socket by calling the
+ * equivalent of shutdown(SHUT_WR);
+ */
+static void xs_tcp_shutdown(struct rpc_xprt *xprt)
+{
+ struct smb_xprt *transport = container_of(xprt, struct smb_xprt, xprt);
+ struct socket *sock = transport->sock;
+
+ if (sock != NULL)
+ kernel_sock_shutdown(sock, SHUT_WR);
+}
+
+static inline void xs_encode_smb_record_marker(struct xdr_buf *buf)
+{
+ u32 reclen = buf->len - sizeof(rpc_fraghdr);
+ rpc_fraghdr *base = buf->head[0].iov_base;
+ *base = htonl(SMB_FRAGMENT_SIZE_MASK & reclen);
+}
+
+/**
+ * xs_smb_send_request - write an RPC request to a TCP socket
+ * @task: address of RPC task that manages the state of an RPC request
+ *
+ * Return values:
+ * 0: The request has been sent
+ * EAGAIN: The socket was blocked, please call again later to
+ * complete the request
+ * ENOTCONN: Caller needs to invoke connect logic then call again
+ * other: Some other error occured, the request was not sent
+ *
+ * XXX: In the case of soft timeouts, should we eventually give up
+ * if sendmsg is not able to make progress?
+ */
+static int xs_smb_send_request(struct rpc_task *task)
+{
+ struct rpc_rqst *req = task->tk_rqstp;
+ struct rpc_xprt *xprt = req->rq_xprt;
+ struct smb_xprt *transport = container_of(xprt, struct smb_xprt, xprt);
+ struct xdr_buf *xdr = &req->rq_snd_buf;
+ int status;
+
+ xs_encode_smb_record_marker(&req->rq_snd_buf);
+
+ xs_pktdump("packet data:",
+ req->rq_svec->iov_base,
+ req->rq_svec->iov_len);
+
+ /* Continue transmitting the packet/record. We must be careful
+ * to cope with writespace callbacks arriving _after_ we have
+ * called sendmsg(). */
+ while (1) {
+ status = xs_sendpages(transport->sock,
+ NULL, 0, xdr, req->rq_bytes_sent);
+
+ dprintk("RPC: %s(%u) = %d\n", __func__,
+ xdr->len - req->rq_bytes_sent, status);
+
+ if (unlikely(status < 0))
+ break;
+
+ /* If we've sent the entire packet, immediately
+ * reset the count of bytes sent. */
+ req->rq_bytes_sent += status;
+ task->tk_bytes_sent += status;
+ if (likely(req->rq_bytes_sent >= req->rq_slen)) {
+ req->rq_bytes_sent = 0;
+ return 0;
+ }
+
+ if (status != 0)
+ continue;
+ status = -EAGAIN;
+ break;
+ }
+ if (!transport->sock)
+ goto out;
+
+ switch (status) {
+ case -ENOTSOCK:
+ status = -ENOTCONN;
+ /* Should we call xs_close() here? */
+ break;
+ case -EAGAIN:
+ status = xs_nospace(task);
+ break;
+ default:
+ dprintk("RPC: sendmsg returned unrecognized error %d\n",
+ -status);
+ case -ECONNRESET:
+ case -EPIPE:
+ xs_tcp_shutdown(xprt);
+ case -ECONNREFUSED:
+ case -ENOTCONN:
+ clear_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags);
+ }
+out:
+ return status;
+}
+
+/**
+ * xs_tcp_release_xprt - clean up after a tcp transmission
+ * @xprt: transport
+ * @task: rpc task
+ *
+ * This cleans up if an error causes us to abort the transmission of a request.
+ * In this case, the socket may need to be reset in order to avoid confusing
+ * the server.
+ */
+static void xs_tcp_release_xprt(struct rpc_xprt *xprt, struct rpc_task *task)
+{
+ struct rpc_rqst *req;
+
+ if (task != xprt->snd_task)
+ return;
+ if (task == NULL)
+ goto out_release;
+ req = task->tk_rqstp;
+ if (req->rq_bytes_sent == 0)
+ goto out_release;
+ if (req->rq_bytes_sent == req->rq_snd_buf.len)
+ goto out_release;
+ set_bit(XPRT_CLOSE_WAIT, &task->tk_xprt->state);
+out_release:
+ xprt_release_xprt(xprt, task);
+}
+
+static void xs_save_old_callbacks(struct smb_xprt *transport, struct sock *sk)
+{
+ transport->old_data_ready = sk->sk_data_ready;
+ transport->old_state_change = sk->sk_state_change;
+ transport->old_write_space = sk->sk_write_space;
+ transport->old_error_report = sk->sk_error_report;
+}
+
+static void xs_restore_old_callbacks(struct smb_xprt *transport, struct sock *sk)
+{
+ sk->sk_data_ready = transport->old_data_ready;
+ sk->sk_state_change = transport->old_state_change;
+ sk->sk_write_space = transport->old_write_space;
+ sk->sk_error_report = transport->old_error_report;
+}
+
+static void xs_reset_transport(struct smb_xprt *transport)
+{
+ struct socket *sock = transport->sock;
+ struct sock *sk = transport->inet;
+
+ if (sk == NULL)
+ return;
+
+ write_lock_bh(&sk->sk_callback_lock);
+ transport->inet = NULL;
+ transport->sock = NULL;
+
+ sk->sk_user_data = NULL;
+
+ xs_restore_old_callbacks(transport, sk);
+ write_unlock_bh(&sk->sk_callback_lock);
+
+ sk->sk_no_check = 0;
+
+ sock_release(sock);
+}
+
+/**
+ * xs_close - close a socket
+ * @xprt: transport
+ *
+ * This is used when all requests are complete; ie, no DRC state remains
+ * on the server we want to save.
+ *
+ * The caller _must_ be holding XPRT_LOCKED in order to avoid issues with
+ * xs_reset_transport() zeroing the socket from underneath a writer.
+ */
+static void xs_close(struct rpc_xprt *xprt)
+{
+ struct smb_xprt *transport = container_of(xprt, struct smb_xprt, xprt);
+
+ dprintk("RPC: xs_close xprt %p\n", xprt);
+
+ xs_reset_transport(transport);
+
+ smp_mb__before_clear_bit();
+ clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
+ clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
+ clear_bit(XPRT_CLOSING, &xprt->state);
+ smp_mb__after_clear_bit();
+ xprt_disconnect_done(xprt);
+}
+
+static void xs_tcp_close(struct rpc_xprt *xprt)
+{
+ if (test_and_clear_bit(XPRT_CONNECTION_CLOSE, &xprt->state))
+ xs_close(xprt);
+ else
+ xs_tcp_shutdown(xprt);
+}
+
+/**
+ * xs_destroy - prepare to shutdown a transport
+ * @xprt: doomed transport
+ *
+ */
+static void xs_destroy(struct rpc_xprt *xprt)
+{
+ struct smb_xprt *transport = container_of(xprt, struct smb_xprt, xprt);
+
+ dprintk("RPC: xs_destroy xprt %p\n", xprt);
+
+ cancel_rearming_delayed_work(&transport->connect_worker);
+
+ xs_close(xprt);
+ xs_free_peer_addresses(xprt);
+ kfree(xprt->slot);
+ kfree(xprt);
+ module_put(THIS_MODULE);
+}
+
+static inline struct rpc_xprt *xprt_from_sock(struct sock *sk)
+{
+ return (struct rpc_xprt *) sk->sk_user_data;
+}
+
+static inline void xs_smb_read_fraghdr(struct rpc_xprt *xprt, struct xdr_skb_reader *desc)
+{
+ struct smb_xprt *transport = container_of(xprt, struct smb_xprt, xprt);
+ size_t len, used;
+ char *p;
+
+ p = ((char *) &transport->tcp_fraghdr) + transport->tcp_offset;
+ len = sizeof(transport->tcp_fraghdr) - transport->tcp_offset;
+ used = xdr_skb_read_bits(desc, p, len);
+ transport->tcp_offset += used;
+ if (used != len)
+ return;
+
+ transport->tcp_reclen = ntohl(transport->tcp_fraghdr);
+ if (transport->tcp_reclen & ~SMB_FRAGMENT_SIZE_MASK) {
+ dprintk("SMB: top byte of fragheader invalid\n");
+ xprt_force_disconnect(xprt);
+ return;
+ }
+
+ transport->tcp_reclen &= SMB_FRAGMENT_SIZE_MASK;
+
+ transport->tcp_flags = TCP_RCV_READ_SMBHDR;
+ transport->tcp_offset = 0;
+
+ /* Sanity check of the record length */
+ if (unlikely(transport->tcp_reclen < sizeof(struct smb_header))) {
+ dprintk("RPC: invalid SMB record fragment length\n");
+ xprt_force_disconnect(xprt);
+ return;
+ }
+ dprintk("RPC: reading SMB record fragment of length %d\n",
+ transport->tcp_reclen);
+}
+
+static void xs_smb_check_fraghdr(struct smb_xprt *transport)
+{
+ if (transport->tcp_offset == transport->tcp_reclen) {
+ transport->tcp_flags = TCP_RCV_COPY_FRAGHDR;
+ transport->tcp_offset = 0;
+ transport->tcp_copied = 0;
+ }
+}
+
+static inline void xs_smb_read_smbhdr(struct smb_xprt *transport, struct xdr_skb_reader *desc)
+{
+ size_t len, used;
+ char *p;
+
+ len = sizeof(transport->smb_header) - transport->tcp_offset;
+ dprintk("SMB: reading SMB header (%Zu bytes)\n", len);
+ p = ((char *) &transport->smb_header) + transport->tcp_offset;
+ used = xdr_skb_read_bits(desc, p, len);
+ transport->tcp_offset += used;
+ if (used != len)
+ return;
+ transport->tcp_flags = TCP_RCV_COPY_SMBHDR | TCP_RCV_COPY_DATA;
+
+ if (transport->smb_header.flags & SMBFLG_RESPONSE)
+ dprintk("SMB: reading %s MID 0x%04x\n",
+ (transport->smb_header.flags & SMBFLG_RESPONSE) ?
+ "reply for" : "request with",
+ le16_to_cpu(transport->smb_header.mid));
+
+ xs_smb_check_fraghdr(transport);
+}
+
+static inline void xs_smb_read_common(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc,
+ struct rpc_rqst *req)
+{
+ struct smb_xprt *transport =
+ container_of(xprt, struct smb_xprt, xprt);
+ struct xdr_buf *rcvbuf;
+ size_t len;
+ ssize_t r;
+
+ rcvbuf = &req->rq_private_buf;
+
+ /* copy SMB header into buffer */
+ if (transport->tcp_flags & TCP_RCV_COPY_SMBHDR) {
+ memcpy(rcvbuf->head[0].iov_base, &transport->smb_header,
+ sizeof(transport->smb_header));
+ transport->tcp_copied += sizeof(transport->smb_header);
+ transport->tcp_flags &= ~TCP_RCV_COPY_SMBHDR;
+ }
+
+ len = desc->count;
+ if (len > transport->tcp_reclen - transport->tcp_offset) {
+ struct xdr_skb_reader my_desc;
+
+ len = transport->tcp_reclen - transport->tcp_offset;
+ memcpy(&my_desc, desc, sizeof(my_desc));
+ my_desc.count = len;
+ r = xdr_partial_copy_from_skb(rcvbuf, transport->tcp_copied,
+ &my_desc, xdr_skb_read_bits);
+ desc->count -= r;
+ desc->offset += r;
+ } else
+ r = xdr_partial_copy_from_skb(rcvbuf, transport->tcp_copied,
+ desc, xdr_skb_read_bits);
+
+ if (r > 0) {
+ transport->tcp_copied += r;
+ transport->tcp_offset += r;
+ }
+ if (r != len) {
+ /* Error when copying to the receive buffer,
+ * usually because we weren't able to allocate
+ * additional buffer pages. All we can do now
+ * is turn off TCP_RCV_COPY_DATA, so the request
+ * will not receive any additional updates,
+ * and time out.
+ * Any remaining data from this record will
+ * be discarded.
+ */
+ transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
+ dprintk("RPC: XID %08x truncated request\n",
+ ntohl(transport->smb_header.mid));
+ dprintk("RPC: xprt = %p, tcp_copied = %lu, "
+ "tcp_offset = %u, tcp_reclen = %u\n",
+ xprt, transport->tcp_copied,
+ transport->tcp_offset, transport->tcp_reclen);
+ return;
+ }
+
+ dprintk("RPC: XID %08x read %Zd bytes\n",
+ ntohl(transport->smb_header.mid), r);
+ dprintk("RPC: xprt = %p, tcp_copied = %lu, tcp_offset = %u, "
+ "tcp_reclen = %u\n", xprt, transport->tcp_copied,
+ transport->tcp_offset, transport->tcp_reclen);
+
+ if (transport->tcp_copied == req->rq_private_buf.buflen ||
+ transport->tcp_offset == transport->tcp_reclen)
+ transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
+
+ return;
+}
+
+/*
+ * Finds the request corresponding to the RPC xid and invokes the common
+ * tcp read code to read the data.
+ */
+static inline int xs_smb_read_reply(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc)
+{
+ struct smb_xprt *transport =
+ container_of(xprt, struct smb_xprt, xprt);
+ struct rpc_rqst *req;
+
+ dprintk("RPC: read reply XID %08x\n", ntohl(transport->smb_header.mid));
+
+ /* Find and lock the request corresponding to this xid */
+ spin_lock(&xprt->transport_lock);
+ req = xprt_lookup_rqst(xprt, transport->smb_header.mid);
+ if (!req) {
+ dprintk("RPC: XID %08x request not found!\n",
+ ntohl(transport->smb_header.mid));
+ spin_unlock(&xprt->transport_lock);
+ return -1;
+ }
+
+ xs_smb_read_common(xprt, desc, req);
+
+ if (!(transport->tcp_flags & TCP_RCV_COPY_DATA))
+ xprt_complete_rqst(req->rq_task, transport->tcp_copied);
+
+ spin_unlock(&xprt->transport_lock);
+ return 0;
+}
+
+#if defined(CONFIG_NFS_V4_1)
+/*
+ * Obtains an rpc_rqst previously allocated and invokes the common
+ * tcp read code to read the data. The result is placed in the callback
+ * queue.
+ * If we're unable to obtain the rpc_rqst we schedule the closing of the
+ * connection and return -1.
+ */
+static inline int xs_smb_read_callback(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc)
+{
+ struct smb_xprt *transport =
+ container_of(xprt, struct smb_xprt, xprt);
+ struct rpc_rqst *req;
+
+ req = xprt_alloc_bc_request(xprt);
+ if (req == NULL) {
+ printk(KERN_WARNING "Callback slot table overflowed\n");
+ xprt_force_disconnect(xprt);
+ return -1;
+ }
+
+ req->rq_xid = transport->smb_header.mid;
+ dprintk("RPC: read callback XID %08x\n", ntohl(req->rq_xid));
+ xs_smb_read_common(xprt, desc, req);
+
+ if (!(transport->tcp_flags & TCP_RCV_COPY_DATA)) {
+ struct svc_serv *bc_serv = xprt->bc_serv;
+
+ /*
+ * Add callback request to callback list. The callback
+ * service sleeps on the sv_cb_waitq waiting for new
+ * requests. Wake it up after adding enqueing the
+ * request.
+ */
+ dprintk("RPC: add callback request to list\n");
+ spin_lock(&bc_serv->sv_cb_lock);
+ list_add(&req->rq_bc_list, &bc_serv->sv_cb_list);
+ spin_unlock(&bc_serv->sv_cb_lock);
+ wake_up(&bc_serv->sv_cb_waitq);
+ }
+
+ req->rq_private_buf.len = transport->tcp_copied;
+
+ return 0;
+}
+
+static inline int _xs_smb_read_data(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc)
+{
+ struct smb_xprt *transport =
+ container_of(xprt, struct smb_xprt, xprt);
+
+ return (transport->smb_header.flags & SMBFLG_RESPONSE) ?
+ xs_smb_read_reply(xprt, desc) :
+ xs_smb_read_callback(xprt, desc);
+}
+#else
+static inline int _xs_smb_read_data(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc)
+{
+ return xs_smb_read_reply(xprt, desc);
+}
+#endif /* CONFIG_NFS_V4_1 */
+
+/*
+ * Read data off the transport. This can be either an RPC_CALL or an
+ * RPC_REPLY. Relay the processing to helper functions.
+ */
+static void xs_smb_read_data(struct rpc_xprt *xprt,
+ struct xdr_skb_reader *desc)
+{
+ struct smb_xprt *transport =
+ container_of(xprt, struct smb_xprt, xprt);
+
+ if (_xs_smb_read_data(xprt, desc) == 0)
+ xs_smb_check_fraghdr(transport);
+ else {
+ /*
+ * The transport_lock protects the request handling.
+ * There's no need to hold it to update the tcp_flags.
+ */
+ transport->tcp_flags &= ~TCP_RCV_COPY_DATA;
+ }
+}
+
+static inline void xs_smb_read_discard(struct smb_xprt *transport, struct xdr_skb_reader *desc)
+{
+ size_t len;
+
+ len = transport->tcp_reclen - transport->tcp_offset;
+ if (len > desc->count)
+ len = desc->count;
+ desc->count -= len;
+ desc->offset += len;
+ transport->tcp_offset += len;
+ dprintk("RPC: discarded %Zu bytes\n", len);
+ xs_smb_check_fraghdr(transport);
+}
+
+static int xs_smb_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, unsigned int offset, size_t len)
+{
+ struct rpc_xprt *xprt = rd_desc->arg.data;
+ struct smb_xprt *transport = container_of(xprt, struct smb_xprt, xprt);
+ struct xdr_skb_reader desc = {
+ .skb = skb,
+ .offset = offset,
+ .count = len,
+ };
+
+ dprintk("RPC: xs_tcp_data_recv started\n");
+ do {
+ dprintk("SMB: %s tcp_flags=0x%lx\n", __func__,
+ transport->tcp_flags);
+ /* Read in a new fragment marker if necessary */
+ /* Can we ever really expect to get completely empty fragments? */
+ if (transport->tcp_flags & TCP_RCV_COPY_FRAGHDR) {
+ xs_smb_read_fraghdr(xprt, &desc);
+ continue;
+ }
+ /* Read in the xid if necessary */
+ if (transport->tcp_flags & TCP_RCV_READ_SMBHDR) {
+ xs_smb_read_smbhdr(transport, &desc);
+ continue;
+ }
+ /* Read in the request data */
+ if (transport->tcp_flags & TCP_RCV_COPY_DATA) {
+ xs_smb_read_data(xprt, &desc);
+ continue;
+ }
+ /* Skip over any trailing bytes on short reads */
+ xs_smb_read_discard(transport, &desc);
+ } while (desc.count);
+ dprintk("RPC: xs_tcp_data_recv done\n");
+ return len - desc.count;
+}
+
+/**
+ * xs_smb_data_ready - "data ready" callback for TCP sockets
+ * @sk: socket with data to read
+ * @bytes: how much data to read
+ *
+ */
+static void xs_smb_data_ready(struct sock *sk, int bytes)
+{
+ struct rpc_xprt *xprt;
+ read_descriptor_t rd_desc;
+ int read;
+
+ dprintk("SMB: xs_smb_data_ready...\n");
+
+ read_lock(&sk->sk_callback_lock);
+ if (!(xprt = xprt_from_sock(sk)))
+ goto out;
+ if (xprt->shutdown)
+ goto out;
+
+ /* We use rd_desc to pass struct xprt to xs_tcp_data_recv */
+ rd_desc.arg.data = xprt;
+ do {
+ rd_desc.count = 65536;
+ read = tcp_read_sock(sk, &rd_desc, xs_smb_data_recv);
+ } while (read > 0);
+out:
+ read_unlock(&sk->sk_callback_lock);
+}
+
+/*
+ * Do the equivalent of linger/linger2 handling for dealing with
+ * broken servers that don't close the socket in a timely
+ * fashion
+ */
+static void xs_tcp_schedule_linger_timeout(struct rpc_xprt *xprt,
+ unsigned long timeout)
+{
+ struct smb_xprt *transport;
+
+ if (xprt_test_and_set_connecting(xprt))
+ return;
+ set_bit(XPRT_CONNECTION_ABORT, &xprt->state);
+ transport = container_of(xprt, struct smb_xprt, xprt);
+ queue_delayed_work(rpciod_workqueue, &transport->connect_worker,
+ timeout);
+}
+
+static void xs_tcp_cancel_linger_timeout(struct rpc_xprt *xprt)
+{
+ struct smb_xprt *transport;
+
+ transport = container_of(xprt, struct smb_xprt, xprt);
+
+ if (!test_bit(XPRT_CONNECTION_ABORT, &xprt->state) ||
+ !cancel_delayed_work(&transport->connect_worker))
+ return;
+ clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
+ xprt_clear_connecting(xprt);
+}
+
+static void xs_sock_mark_closed(struct rpc_xprt *xprt)
+{
+ smp_mb__before_clear_bit();
+ clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
+ clear_bit(XPRT_CLOSING, &xprt->state);
+ smp_mb__after_clear_bit();
+ /* Mark transport as closed and wake up all pending tasks */
+ xprt_disconnect_done(xprt);
+}
+
+/**
+ * xs_smb_state_change - callback to handle TCP socket state changes
+ * @sk: socket whose state has changed
+ *
+ */
+static void xs_smb_state_change(struct sock *sk)
+{
+ struct rpc_xprt *xprt;
+
+ read_lock(&sk->sk_callback_lock);
+ if (!(xprt = xprt_from_sock(sk)))
+ goto out;
+ dprintk("RPC: xs_smb_state_change client %p...\n", xprt);
+ dprintk("RPC: state %x conn %d dead %d zapped %d\n",
+ sk->sk_state, xprt_connected(xprt),
+ sock_flag(sk, SOCK_DEAD),
+ sock_flag(sk, SOCK_ZAPPED));
+
+ switch (sk->sk_state) {
+ case TCP_ESTABLISHED:
+ spin_lock_bh(&xprt->transport_lock);
+ if (!xprt_test_and_set_connected(xprt)) {
+ struct smb_xprt *transport = container_of(xprt,
+ struct smb_xprt, xprt);
+
+ /* Reset TCP record info */
+ transport->tcp_offset = 0;
+ transport->tcp_reclen = 0;
+ transport->tcp_copied = 0;
+ transport->tcp_flags = TCP_RCV_COPY_FRAGHDR;
+
+ xprt_wake_pending_tasks(xprt, -EAGAIN);
+ }
+ spin_unlock_bh(&xprt->transport_lock);
+ break;
+ case TCP_FIN_WAIT1:
+ /* The client initiated a shutdown of the socket */
+ xprt->connect_cookie++;
+ xprt->reestablish_timeout = 0;
+ set_bit(XPRT_CLOSING, &xprt->state);
+ smp_mb__before_clear_bit();
+ clear_bit(XPRT_CONNECTED, &xprt->state);
+ clear_bit(XPRT_CLOSE_WAIT, &xprt->state);
+ smp_mb__after_clear_bit();
+ xs_tcp_schedule_linger_timeout(xprt, xs_smb_fin_timeout);
+ break;
+ case TCP_CLOSE_WAIT:
+ /* The server initiated a shutdown of the socket */
+ xprt_force_disconnect(xprt);
+ case TCP_SYN_SENT:
+ xprt->connect_cookie++;
+ case TCP_CLOSING:
+ /*
+ * If the server closed down the connection, make sure that
+ * we back off before reconnecting
+ */
+ if (xprt->reestablish_timeout < XS_TCP_INIT_REEST_TO)
+ xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
+ break;
+ case TCP_LAST_ACK:
+ set_bit(XPRT_CLOSING, &xprt->state);
+ xs_tcp_schedule_linger_timeout(xprt, xs_smb_fin_timeout);
+ smp_mb__before_clear_bit();
+ clear_bit(XPRT_CONNECTED, &xprt->state);
+ smp_mb__after_clear_bit();
+ break;
+ case TCP_CLOSE:
+ xs_tcp_cancel_linger_timeout(xprt);
+ xs_sock_mark_closed(xprt);
+ }
+ out:
+ read_unlock(&sk->sk_callback_lock);
+}
+
+/**
+ * xs_error_report - callback mainly for catching socket errors
+ * @sk: socket
+ */
+static void xs_error_report(struct sock *sk)
+{
+ struct rpc_xprt *xprt;
+
+ read_lock(&sk->sk_callback_lock);
+ if (!(xprt = xprt_from_sock(sk)))
+ goto out;
+ dprintk("RPC: %s client %p...\n"
+ "RPC: error %d\n",
+ __func__, xprt, sk->sk_err);
+ xprt_wake_pending_tasks(xprt, -EAGAIN);
+out:
+ read_unlock(&sk->sk_callback_lock);
+}
+
+static void xs_write_space(struct sock *sk)
+{
+ struct socket *sock;
+ struct rpc_xprt *xprt;
+
+ if (unlikely(!(sock = sk->sk_socket)))
+ return;
+ clear_bit(SOCK_NOSPACE, &sock->flags);
+
+ if (unlikely(!(xprt = xprt_from_sock(sk))))
+ return;
+ if (test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags) == 0)
+ return;
+
+ xprt_write_space(xprt);
+}
+
+/**
+ * xs_tcp_write_space - callback invoked when socket buffer space
+ * becomes available
+ * @sk: socket whose state has changed
+ *
+ * Called when more output buffer space is available for this socket.
+ * We try not to wake our writers until they can make "significant"
+ * progress, otherwise we'll waste resources thrashing kernel_sendmsg
+ * with a bunch of small requests.
+ */
+static void xs_tcp_write_space(struct sock *sk)
+{
+ read_lock(&sk->sk_callback_lock);
+
+ /* from net/core/stream.c:sk_stream_write_space */
+ if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk))
+ xs_write_space(sk);
+
+ read_unlock(&sk->sk_callback_lock);
+}
+
+/**
+ * xs_smb_set_port - reset the port number in the remote endpoint address
+ * @xprt: generic transport
+ * @port: new port number
+ *
+ * Should never be needed for SMB transport so just BUG() here.
+ */
+static void
+xs_smb_set_port(struct rpc_xprt *xprt, unsigned short port)
+{
+ dprintk("RPC: setting port for xprt %p to %u\n", xprt, port);
+ BUG();
+}
+
+/**
+ * xs_smb_rpcbind - attempt to rebind the transport
+ * @task: task
+ *
+ * Should never be needed for SMB transport so just BUG() here.
+ */
+static void
+xs_smb_rpcbind(struct rpc_task *task)
+{
+ dprintk("RPC: rpcbind called on SMB task %p", task);
+ BUG();
+}
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+static struct lock_class_key xs_key[2];
+static struct lock_class_key xs_slock_key[2];
+
+static inline void xs_reclassify_socket4(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+
+ BUG_ON(sock_owned_by_user(sk));
+ sock_lock_init_class_and_name(sk, "slock-AF_INET-RPC",
+ &xs_slock_key[0], "sk_lock-AF_INET-RPC", &xs_key[0]);
+}
+
+static inline void xs_reclassify_socket6(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+
+ BUG_ON(sock_owned_by_user(sk));
+ sock_lock_init_class_and_name(sk, "slock-AF_INET6-RPC",
+ &xs_slock_key[1], "sk_lock-AF_INET6-RPC", &xs_key[1]);
+}
+#else
+static inline void xs_reclassify_socket4(struct socket *sock)
+{
+}
+
+static inline void xs_reclassify_socket6(struct socket *sock)
+{
+}
+#endif
+
+/*
+ * We need to preserve the port number so the reply cache on the server can
+ * find our cached RPC replies when we get around to reconnecting.
+ */
+static void xs_abort_connection(struct rpc_xprt *xprt, struct smb_xprt *transport)
+{
+ int result;
+ struct sockaddr any;
+
+ dprintk("RPC: disconnecting xprt %p to reuse port\n", xprt);
+
+ /*
+ * Disconnect the transport socket by doing a connect operation
+ * with AF_UNSPEC. This should return immediately...
+ */
+ memset(&any, 0, sizeof(any));
+ any.sa_family = AF_UNSPEC;
+ result = kernel_connect(transport->sock, &any, sizeof(any), 0);
+ if (!result)
+ xs_sock_mark_closed(xprt);
+ else
+ dprintk("RPC: AF_UNSPEC connect return code %d\n",
+ result);
+}
+
+static void xs_tcp_reuse_connection(struct rpc_xprt *xprt, struct smb_xprt *transport)
+{
+ unsigned int state = transport->inet->sk_state;
+
+ if (state == TCP_CLOSE && transport->sock->state == SS_UNCONNECTED)
+ return;
+ if ((1 << state) & (TCPF_ESTABLISHED|TCPF_SYN_SENT))
+ return;
+ xs_abort_connection(xprt, transport);
+}
+
+static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
+{
+ struct smb_xprt *transport = container_of(xprt, struct smb_xprt, xprt);
+
+ if (!transport->inet) {
+ struct sock *sk = sock->sk;
+
+ write_lock_bh(&sk->sk_callback_lock);
+
+ xs_save_old_callbacks(transport, sk);
+
+ sk->sk_user_data = xprt;
+ sk->sk_data_ready = xs_smb_data_ready;
+ sk->sk_state_change = xs_smb_state_change;
+ sk->sk_write_space = xs_tcp_write_space;
+ sk->sk_error_report = xs_error_report;
+ sk->sk_allocation = GFP_ATOMIC;
+
+ /* socket options */
+ sk->sk_userlocks |= SOCK_BINDPORT_LOCK;
+ sock_reset_flag(sk, SOCK_LINGER);
+ tcp_sk(sk)->linger2 = 0;
+ tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
+
+ xprt_clear_connected(xprt);
+
+ /* Reset to new socket */
+ transport->sock = sock;
+ transport->inet = sk;
+
+ write_unlock_bh(&sk->sk_callback_lock);
+ }
+
+ if (!xprt_bound(xprt))
+ return -ENOTCONN;
+
+ /* Tell the socket layer to start connecting... */
+ xprt->stat.connect_count++;
+ xprt->stat.connect_start = jiffies;
+ return kernel_connect(sock, xs_addr(xprt), xprt->addrlen, O_NONBLOCK);
+}
+
+/**
+ * xs_tcp_setup_socket - create a TCP socket and connect to a remote endpoint
+ * @xprt: RPC transport to connect
+ * @transport: socket transport to connect
+ * @create_sock: function to create a socket of the correct type
+ *
+ * Invoked by a work queue tasklet.
+ */
+static void xs_tcp_setup_socket(struct rpc_xprt *xprt,
+ struct smb_xprt *transport,
+ struct socket *(*create_sock)(struct rpc_xprt *,
+ struct smb_xprt *))
+{
+ struct socket *sock = transport->sock;
+ int status = -EIO;
+
+ if (xprt->shutdown)
+ goto out;
+
+ if (!sock) {
+ clear_bit(XPRT_CONNECTION_ABORT, &xprt->state);
+ sock = create_sock(xprt, transport);
+ if (IS_ERR(sock)) {
+ status = PTR_ERR(sock);
+ goto out;
+ }
+ } else {
+ int abort_and_exit;
+
+ abort_and_exit = test_and_clear_bit(XPRT_CONNECTION_ABORT,
+ &xprt->state);
+ /* "close" the socket, preserving the local port */
+ xs_tcp_reuse_connection(xprt, transport);
+
+ if (abort_and_exit)
+ goto out_eagain;
+ }
+
+ dprintk("SMB: worker connecting xprt %p via %s to "
+ "%s (port %s)\n", xprt,
+ xprt->address_strings[RPC_DISPLAY_PROTO],
+ xprt->address_strings[RPC_DISPLAY_ADDR],
+ xprt->address_strings[RPC_DISPLAY_PORT]);
+
+ status = xs_tcp_finish_connecting(xprt, sock);
+ dprintk("SMB: %p connect status %d connected %d sock state %d\n",
+ xprt, -status, xprt_connected(xprt),
+ sock->sk->sk_state);
+ switch (status) {
+ default:
+ printk("%s: connect returned unhandled error %d\n",
+ __func__, status);
+ case -EADDRNOTAVAIL:
+ /* We're probably in TIME_WAIT. Get rid of existing socket,
+ * and retry
+ */
+ set_bit(XPRT_CONNECTION_CLOSE, &xprt->state);
+ xprt_force_disconnect(xprt);
+ break;
+ case -ECONNREFUSED:
+ case -ECONNRESET:
+ case -ENETUNREACH:
+ /* retry with existing socket, after a delay */
+ case 0:
+ case -EINPROGRESS:
+ case -EALREADY:
+ xprt_clear_connecting(xprt);
+ return;
+ }
+out_eagain:
+ status = -EAGAIN;
+out:
+ xprt_clear_connecting(xprt);
+ xprt_wake_pending_tasks(xprt, status);
+}
+
+static struct socket *xs_create_tcp_sock4(struct rpc_xprt *xprt,
+ struct smb_xprt *transport)
+{
+ struct socket *sock;
+ int err;
+
+ /* start from scratch */
+ err = sock_create_kern(PF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
+ if (err < 0) {
+ dprintk("RPC: can't create TCP transport socket (%d).\n",
+ -err);
+ goto out_err;
+ }
+ xs_reclassify_socket4(sock);
+ return sock;
+out_err:
+ return ERR_PTR(-EIO);
+}
+
+/**
+ * xs_tcp_connect_worker4 - connect a TCP socket to a remote endpoint
+ * @work: RPC transport to connect
+ *
+ * Invoked by a work queue tasklet.
+ */
+static void xs_tcp_connect_worker4(struct work_struct *work)
+{
+ struct smb_xprt *transport =
+ container_of(work, struct smb_xprt, connect_worker.work);
+ struct rpc_xprt *xprt = &transport->xprt;
+
+ xs_tcp_setup_socket(xprt, transport, xs_create_tcp_sock4);
+}
+
+static struct socket *xs_create_tcp_sock6(struct rpc_xprt *xprt,
+ struct smb_xprt *transport)
+{
+ struct socket *sock;
+ int err;
+
+ /* start from scratch */
+ err = sock_create_kern(PF_INET6, SOCK_STREAM, IPPROTO_TCP, &sock);
+ if (err < 0) {
+ dprintk("RPC: can't create TCP transport socket (%d).\n",
+ -err);
+ goto out_err;
+ }
+ xs_reclassify_socket6(sock);
+ return sock;
+out_err:
+ return ERR_PTR(-EIO);
+}
+
+/**
+ * xs_tcp_connect_worker6 - connect a TCP socket to a remote endpoint
+ * @work: RPC transport to connect
+ *
+ * Invoked by a work queue tasklet.
+ */
+static void xs_tcp_connect_worker6(struct work_struct *work)
+{
+ struct smb_xprt *transport =
+ container_of(work, struct smb_xprt, connect_worker.work);
+ struct rpc_xprt *xprt = &transport->xprt;
+
+ xs_tcp_setup_socket(xprt, transport, xs_create_tcp_sock6);
+}
+
+/**
+ * xs_connect - connect a socket to a remote endpoint
+ * @task: address of RPC task that manages state of connect request
+ *
+ * TCP: If the remote end dropped the connection, delay reconnecting.
+ *
+ * UDP socket connects are synchronous, but we use a work queue anyway
+ * to guarantee that even unprivileged user processes can set up a
+ * socket on a privileged port.
+ *
+ * If a UDP socket connect fails, the delay behavior here prevents
+ * retry floods (hard mounts).
+ */
+static void xs_connect(struct rpc_task *task)
+{
+ struct rpc_xprt *xprt = task->tk_xprt;
+ struct smb_xprt *transport = container_of(xprt, struct smb_xprt, xprt);
+
+ if (xprt_test_and_set_connecting(xprt))
+ return;
+
+ if (transport->sock != NULL) {
+ dprintk("RPC: xs_connect delayed xprt %p for %lu "
+ "seconds\n",
+ xprt, xprt->reestablish_timeout / HZ);
+ queue_delayed_work(rpciod_workqueue,
+ &transport->connect_worker,
+ xprt->reestablish_timeout);
+ xprt->reestablish_timeout <<= 1;
+ if (xprt->reestablish_timeout > XS_TCP_MAX_REEST_TO)
+ xprt->reestablish_timeout = XS_TCP_MAX_REEST_TO;
+ } else {
+ dprintk("RPC: xs_connect scheduled xprt %p\n", xprt);
+ queue_delayed_work(rpciod_workqueue,
+ &transport->connect_worker, 0);
+ }
+}
+
+static void xs_tcp_connect(struct rpc_task *task)
+{
+ struct rpc_xprt *xprt = task->tk_xprt;
+
+ /* Exit if we need to wait for socket shutdown to complete */
+ if (test_bit(XPRT_CLOSING, &xprt->state))
+ return;
+ xs_connect(task);
+}
+
+/**
+ * xs_tcp_print_stats - display TCP socket-specifc stats
+ * @xprt: rpc_xprt struct containing statistics
+ * @seq: output file
+ *
+ */
+static void xs_tcp_print_stats(struct rpc_xprt *xprt, struct seq_file *seq)
+{
+ struct smb_xprt *transport = container_of(xprt, struct smb_xprt, xprt);
+ long idle_time = 0;
+
+ if (xprt_connected(xprt))
+ idle_time = (long)(jiffies - xprt->last_used) / HZ;
+
+ seq_printf(seq, "\txprt:\ttcp %u %lu %lu %lu %ld %lu %lu %lu %Lu %Lu\n",
+ transport->port,
+ xprt->stat.bind_count,
+ xprt->stat.connect_count,
+ xprt->stat.connect_time,
+ idle_time,
+ xprt->stat.sends,
+ xprt->stat.recvs,
+ xprt->stat.bad_xids,
+ xprt->stat.req_u,
+ xprt->stat.bklog_u);
+}
+
+static struct rpc_xprt_ops xs_smb_ops = {
+ .reserve_xprt = xprt_reserve_xprt,
+ .release_xprt = xs_tcp_release_xprt,
+ .rpcbind = xs_smb_rpcbind,
+ .set_port = xs_smb_set_port,
+ .connect = xs_tcp_connect,
+ .buf_alloc = rpc_malloc,
+ .buf_free = rpc_free,
+ .send_request = xs_smb_send_request,
+ .set_retrans_timeout = xprt_set_retrans_timeout_def,
+#if defined(CONFIG_NFS_V4_1)
+ .release_request = bc_release_request,
+#endif /* CONFIG_NFS_V4_1 */
+ .close = xs_tcp_close,
+ .destroy = xs_destroy,
+ .print_stats = xs_tcp_print_stats,
+};
+
+static struct rpc_xprt *xs_setup_xprt(struct xprt_create *args,
+ unsigned int slot_table_size)
+{
+ struct rpc_xprt *xprt;
+ struct smb_xprt *new;
+
+ if (args->addrlen > sizeof(xprt->addr)) {
+ dprintk("RPC: xs_setup_xprt: address too large\n");
+ return ERR_PTR(-EBADF);
+ }
+
+ new = kzalloc(sizeof(*new), GFP_KERNEL);
+ if (new == NULL) {
+ dprintk("RPC: xs_setup_xprt: couldn't allocate "
+ "rpc_xprt\n");
+ return ERR_PTR(-ENOMEM);
+ }
+ xprt = &new->xprt;
+
+ xprt->max_reqs = slot_table_size;
+ xprt->slot = kcalloc(xprt->max_reqs, sizeof(struct rpc_rqst), GFP_KERNEL);
+ if (xprt->slot == NULL) {
+ kfree(xprt);
+ dprintk("RPC: xs_setup_xprt: couldn't allocate slot "
+ "table\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ memcpy(&xprt->addr, args->dstaddr, args->addrlen);
+ xprt->addrlen = args->addrlen;
+ if (args->srcaddr)
+ memcpy(&new->addr, args->srcaddr, args->addrlen);
+
+ return xprt;
+}
+
+static const struct rpc_timeout xs_tcp_default_timeout = {
+ .to_initval = 60 * HZ,
+ .to_maxval = 60 * HZ,
+ .to_retries = 2,
+};
+
+/**
+ * xs_setup_tcp - Set up transport to use a TCP socket
+ * @args: rpc transport creation arguments
+ *
+ */
+static struct rpc_xprt *xs_setup_smb(struct xprt_create *args)
+{
+ struct sockaddr *addr = args->dstaddr;
+ struct rpc_xprt *xprt;
+ struct smb_xprt *transport;
+
+ xprt = xs_setup_xprt(args, xprt_smb_slot_table_entries);
+ if (IS_ERR(xprt))
+ return xprt;
+ transport = container_of(xprt, struct smb_xprt, xprt);
+
+ xprt->prot = IPPROTO_TCP;
+ xprt->tsh_size = sizeof(rpc_fraghdr) / sizeof(u32);
+ xprt->max_payload = SMB_MAX_FRAGMENT_SIZE;
+
+ xprt->bind_timeout = XS_BIND_TO;
+ xprt->connect_timeout = XS_TCP_CONN_TO;
+ xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
+ xprt->idle_timeout = XS_SMB_IDLE_DISC_TO;
+
+ xprt->ops = &xs_smb_ops;
+ xprt->timeout = &xs_tcp_default_timeout;
+
+ switch (addr->sa_family) {
+ case AF_INET:
+ if (((struct sockaddr_in *)addr)->sin_port == htons(0))
+ return ERR_PTR(-EINVAL);
+
+ xprt_set_bound(xprt);
+ INIT_DELAYED_WORK(&transport->connect_worker, xs_tcp_connect_worker4);
+ xs_format_peer_addresses(xprt, "tcp", RPCBIND_NETID_TCP);
+ break;
+ case AF_INET6:
+ if (((struct sockaddr_in6 *)addr)->sin6_port == htons(0))
+ return ERR_PTR(-EINVAL);
+
+ xprt_set_bound(xprt);
+ INIT_DELAYED_WORK(&transport->connect_worker, xs_tcp_connect_worker6);
+ xs_format_peer_addresses(xprt, "tcp", RPCBIND_NETID_TCP6);
+ break;
+ default:
+ kfree(xprt);
+ return ERR_PTR(-EAFNOSUPPORT);
+ }
+
+ dprintk("RPC: set up xprt to %s (port %s) via %s\n",
+ xprt->address_strings[RPC_DISPLAY_ADDR],
+ xprt->address_strings[RPC_DISPLAY_PORT],
+ xprt->address_strings[RPC_DISPLAY_PROTO]);
+
+ if (try_module_get(THIS_MODULE))
+ return xprt;
+
+ kfree(xprt->slot);
+ kfree(xprt);
+ return ERR_PTR(-EINVAL);
+}
+
+static struct xprt_class xs_smb_transport = {
+ .list = LIST_HEAD_INIT(xs_smb_transport.list),
+ .name = "smb",
+ .owner = THIS_MODULE,
+ .ident = XPRT_TRANSPORT_SMB,
+ .setup = xs_setup_smb,
+};
+
+/**
+ * init_socket_xprt - set up xprtsock's sysctls, register with RPC client
+ *
+ */
+int init_smb_xprt(void)
+{
+#ifdef RPC_DEBUG
+ if (!sunrpc_table_header)
+ sunrpc_table_header = register_sysctl_table(sunrpc_table);
+#endif
+
+ xprt_register_transport(&xs_smb_transport);
+
+ return 0;
+}
+
+/**
+ * cleanup_socket_xprt - remove xprtsock's sysctls, unregister
+ *
+ */
+void cleanup_smb_xprt(void)
+{
+#ifdef RPC_DEBUG
+ if (sunrpc_table_header) {
+ unregister_sysctl_table(sunrpc_table_header);
+ sunrpc_table_header = NULL;
+ }
+#endif
+
+ xprt_unregister_transport(&xs_smb_transport);
+}
--
1.6.0.6
...is there a better way to do this?
Signed-off-by: Jeff Layton <[email protected]>
---
include/linux/sunrpc/clnt.h | 1 +
net/sunrpc/clnt.c | 4 ++--
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/include/linux/sunrpc/clnt.h b/include/linux/sunrpc/clnt.h
index 307a3ec..92ff85d 100644
--- a/include/linux/sunrpc/clnt.h
+++ b/include/linux/sunrpc/clnt.h
@@ -171,6 +171,7 @@ size_t rpc_pton(const char *, const size_t,
char * rpc_sockaddr2uaddr(const struct sockaddr *);
size_t rpc_uaddr2sockaddr(const char *, const size_t,
struct sockaddr *, const size_t);
+void call_bind(struct rpc_task *task);
static inline unsigned short rpc_get_port(const struct sockaddr *sap)
{
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 42688af..f2b0815 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -63,7 +63,6 @@ static void call_reserve(struct rpc_task *task);
static void call_reserveresult(struct rpc_task *task);
static void call_allocate(struct rpc_task *task);
static void call_decode(struct rpc_task *task);
-static void call_bind(struct rpc_task *task);
static void call_bind_status(struct rpc_task *task);
static void call_transmit(struct rpc_task *task);
#if defined(CONFIG_NFS_V4_1)
@@ -1004,7 +1003,7 @@ EXPORT_SYMBOL_GPL(rpc_xdr_encode);
/*
* 4. Get the server port number if not yet set
*/
-static void
+void
call_bind(struct rpc_task *task)
{
struct rpc_xprt *xprt = task->tk_xprt;
@@ -1018,6 +1017,7 @@ call_bind(struct rpc_task *task)
xprt->ops->rpcbind(task);
}
}
+EXPORT_SYMBOL_GPL(call_bind);
/*
* 4a. Sort out bind result
--
1.6.0.6
On Sun, Sep 27, 2009 at 11:50 AM, Jeff Layton <[email protected]> wrote:
> This patchset is still preliminary and is just an RFC...
>
> First, some background. When I was at Connectathon this year, Trond
> mentioned an interesting idea to me. He said (paraphrasing):
>
> "Why doesn't CIFS just use the RPC layer for transport? It's very
> efficient at shoveling bits out onto the wire. You'd just need to
> abstract out the XDR/RPC specific bits."
>
> The idea has been floating around in the back of my head until recently,
> when I decided to give it a go. This patchset represents a first, rough
> draft at a patchset to do this. There's also a proof of concept module
> that demonstrates that it works as expected.
>
> The patchset is still very rough. Reconnect behavior is still very
> "RPC-ish", for instance. There are doubtless other changes that will
> need to be made before I had anything merge-worthy.
>
> At this point, I'm just interested in feedback on the general approach.
> Should I continue to pursue this or is it a non-starter for some
> reason?
>
> The next step here, obviously is to start building a fs on top of it.
> I'd particularly be interested in using this as a foundation of a
> new smb2fs.
>
> I've also got this patchset up on my public git tree:
>
> http://git.samba.org/?p=jlayton/cifs.git;a=summary
>
> Here are some questions I anticipate on this, and their answers:
>
> ------------------------------------------------------------------------
> Q: Are you insane? Why would you attempt to do this?
>
> A: Maybe...but in the last couple of years, I've spent a substantial
> amount of time working on the CIFS code. Much of that time has been
> spent fixing bugs. Many of those bugs exist in the low-level transport
> code which has been hacked on, kludged around and hand tweaked to
> where it is today. Unfortunately that's made it a very hard to work
> on mess. This drives away potential developers.
>
> CIFS in particular is also designed around synchronous ops, which
> seriously limits throughput. Retrofitting it for asynchronous operation
> will be adding even more kludges. The sunrpc code is already
> fundamentally asynchronous.
> ------------------------------------------------------------------------
> Q: Okay, what's the benefit of hooking it up to sunrpc rather than
> building a new transport layer (or fixing the transport in the other two
> filesystems)?
>
> A: Using sunrpc allows us to share a lot of the rpc scheduler code with
> sunrpc. At a high level, NFS/RPC and SMB aren't really very different.
> Sure, they have different formats, and one is big endian on the wire and
> the other isn't...still there are significant similarities.
>
> We also get access to the great upcall mechanisms that sunrpc has, and
> the possibility to share code like the gssapi upcalls. The sunrpc layer
> has a credential and authentication management framework that we can
> build on to make a truly multiuser CIFS/SMB filesystem.
>
> I've heard it claimed before that Linux's sunrpc layer is
> over-engineered, but in this case that works in our favor...
> ------------------------------------------------------------------------
> Q: can we hook up cifs or smbfs to use this as a transport?
>
> A: Not trivially. CIFS in particular is not designed with each call
> having discrete encode and decode functions. They're sort of mashed
> together. smbfs might be possible...I'm a little less familiar with it,
> but it does have a transport layer that more closely resembles the
> sunrpc one. Still though, it'd take significant work to make that
> happen. I'm not opposed to the idea however.
>
> In the end though, I think we'll probably need to design something new
> to sit on top of this. We will probably be able to borrow code and
> concepts from the other filesystems however.
> ------------------------------------------------------------------------
> Q: could we use this as a transport layer for a smb2fs ?
>
> A: Yes, I think so. This particular prototype is build around SMB1, but
> SMB2 could be supported with only minor modifications. One of the
> reasons for sending this patchset now before I've built a filesystem on
> top of it is because I know that SMB2 work is in progress. I'd like to
> see it based around a more asynchronous transport model, or at least
> built with cleaner layering so that we can eventually bolt on a different
> transport layer if we so choose.
>
> Jeff Layton (9):
> sunrpc: abstract out encoding function at rpc_clnt level
> sunrpc: move check for too small reply size into rpc_verify_header
> sunrpc: abstract our call decoding routine
> sunrpc: move rpc_xdr_buf_init to clnt.h
> sunrpc: make call_bind non-static
> sunrpc: add new SMB transport class for sunrpc
> sunrpc: add encoding and decoding routines for SMB
> sunrpc: add Kconfig option for CONFIG_SUNRPC_SMB
> smbtest: simple module for testing SMB/RPC code
>
> fs/Makefile | 2 +
> fs/lockd/host.c | 4 +
> fs/lockd/mon.c | 4 +
> fs/nfs/client.c | 4 +
> fs/nfs/mount_clnt.c | 4 +
> fs/nfsd/nfs4callback.c | 4 +
> fs/smbtest/Makefile | 1 +
> fs/smbtest/smbtest.c | 204 +++++
> include/linux/sunrpc/clnt.h | 24 +-
> include/linux/sunrpc/smb.h | 42 +
> include/linux/sunrpc/xprtsmb.h | 59 ++
> net/sunrpc/Kconfig | 11 +
> net/sunrpc/Makefile | 1 +
> net/sunrpc/clnt.c | 98 ++-
> net/sunrpc/rpcb_clnt.c | 8 +
> net/sunrpc/smb.c | 120 +++
> net/sunrpc/sunrpc_syms.c | 3 +
> net/sunrpc/xprtsmb.c | 1723
> ++++++++++++++++++++++++++++++++++++++++
> 18 files changed, 2272 insertions(+), 44 deletions(-)
> create mode 100644 fs/smbtest/Makefile
> create mode 100644 fs/smbtest/smbtest.c
> create mode 100644 include/linux/sunrpc/smb.h
> create mode 100644 include/linux/sunrpc/xprtsmb.h
> create mode 100644 net/sunrpc/smb.c
> create mode 100644 net/sunrpc/xprtsmb.c
>
> _______________________________________________
> linux-cifs-client mailing list
> [email protected]
> https://lists.samba.org/mailman/listinfo/linux-cifs-client
>
Jeff,
So servers need to speak SUNRPC as well right?
Are_threre/how_many servers are out there that speak CIFS/SMB over SUNRPC or
SMB2 over SUNRPC?
Jeff Layton wrote:
> On Mon, 28 Sep 2009 09:41:08 -0500
> "Steve French (smfltc)" <[email protected]> wrote:
>
>
>>>> This patchset is still preliminary and is just an RFC...
>>>>
>>>> First, some background. When I was at Connectathon this year, Trond
>>>> mentioned an interesting idea to me. He said (paraphrasing):
>>>>
>>>> "Why doesn't CIFS just use the RPC layer for transport? It's very
>>>> efficient at shoveling bits out onto the wire. You'd just need to
>>>> abstract out the XDR/RPC specific bits."
>>>>
>>>>
>>>>
>> My first reaction is that if you abstract out XDR/RPC specific parts of
>> SunRPC it isn't SunRPC,
>> just a scheduler on top of tcp (not a bad thing in theory). Pulling
>> out the two key pieces from
>> SunRPC:
>> - asynchronous event handling and scheduling
>> - upcall for credentials
>> could be useful, but does add a lot of complexity. If there is a way
>> to use just the async
>> scheduling (and perhaps upcall) out of SunRPC, that part sounds fine as
>> long as it
>> can skip the encoding/decoding and just pass in a raw kvec containing
>> the SMB
>> header and data.
>>
>>
>
> Well, the sunrpc layer currently contains a lot of pieces:
>
> 1) client side call/response handling (clnt routines)
> 2) server side call/response handling (svc routines)
> 3) XDR encoding and decoding routines (including crypto signatures, etc)
>
> ...the idea is to hook up new encoding and decoding routines and to add
> a new "transport class" which will make the client-side scheduler handle
> SMB/SMB2 properly.
>
> We'll also eventually have to add new authentication/credential
> "classes" too. I haven't researched that yet in any real depth, so I
> can't state much about how difficult it'll be.
>
>
>>>> CIFS in particular is also designed around synchronous ops, which
>>>> seriously limits throughput. Retrofitting it for asynchronous operation
>>>> will be adding even more kludges.
>>>>
>>>>
>> There are only three operations that we can send asynchronous today, all
>> of which require
>> special case handling in the VFS already:
>> - readpages
>> - writepages
>> - blocking locks
>> (and also directory change notification which we and nfs don't do). I
>> think the "slow_work"
>> mechanism is probably sufficient for these cases already.
>>
>>
>
> The problem is that rolling a mechanism to handle asynchronous ops is
> difficult to get right. I think it makes a lot of sense to reuse a
> proven engine here. It also makes a lot of sense to implement
> synchronous ops on top of an asynchronous infrastructure. RPC does this
> under the hood, and so did smbfs.
>
> What you're proposing, in effect, is to do this in reverse -- implement
> an asynchronous transport engine using synchronous ops and offloading
> the background parts onto threads. That's possible I suppose, but it
> means you have a lot of tasks sleeping in the kernel and waiting for
> stuff to happen.
>
>
I don't think it changes the number of tasks sleeping. For sync
operations it doesn't
change how many tasks that are sleeping. For async operations, you no
longer have a
task sleeping in cifs_writepages or cifs_readpages, but do have the
ability to dispatch
a decode routine when the SMB Write or Read response is returned (may or
may not have a pool to do this). Seems like about the same number of
task (possibly
less). Moving to all asynchronous operations under open, unlink, close
doesn't reduce
the number of sleeping tasks (it may increase it or stay the same).
>>>> works in our favor...
>>>> ------------------------------------------------------------------------
>>>> Q: can we hook up cifs or smbfs to use this as a transport?
>>>>
>>>> A: Not trivially. CIFS in particular is not designed with each call
>>>> having discrete encode and decode functions. They're sort of mashed
>>>>
>>>>
>> We certainly don't want to move to an abstract encoding mechanism,
>> especially for SMB2
>> where there is only one encoding of wire operations, and no duplicate
>> requests due
>> to 20 years of dialects. I can see an argument for abstract encoding
>> for requests
>> like SMB open, vs. SMB OpenX vs. SMB NTCreateX but this would be harder or
>> to abstract and has to be done case by case anyway due to differences in
>> field length, missing fields, different compensations. It is not
>> like the simpler NFS case where encoding involves endian conversion etc.
>>
>>
>
> I'm not sure what you mean by this. Assembling an SMB header and call
> is very similar to assembling an RPC header and call. There are
> differences of course, but they aren't that substantial.
>
> SMB does introduce some more interesting wrinkles. For instance, since
> state is tied up with the actual socket connection, we'll probably need
> callbacks into the fs for socket state changes. That doesn't have much
> to do with how you abstract out the encoding and decoding though.
>
>
>>>> ------------------------------------------------------------------------
>>>> Q: could we use this as a transport layer for a smb2fs ?
>>>>
>>>> A: Yes, I think so. This particular prototype is build around SMB1, but
>>>> SMB2 could be supported with only minor modifications. One of the
>>>> reasons for sending this patchset now before I've built a filesystem on
>>>> top of it is because I know that SMB2 work is in progress. I'd like to
>>>> see it based around a more asynchronous transport model, or at least
>>>> built with cleaner layering so that we can eventually bolt on a different
>>>> transport layer if we so choose.
>>>>
>>>>
>> Amost all the ops use "send_receive" already - so there is no need to
>> change the code much above
>> that if you want to experiment with changing the transport. I like the
>> idea of the
>> abtraction of async operations, and creating completion routines (and an
>> async send
>> abstraction) for readpages, writepages and directory change
>> notification would make sense.
>> but in both cifs and smb2, the 95% of the operations that must be
>> synchronous in
>> the VFS (open, lookup, unlink, create etc.) can already be hooked up to
>> any transport
>> as long as it can send a kvec contain fs data and return a response
>> (like the "send_receive"
>> and equivalent).
>>
>>
>
> The problem with the send_receive interface is that it assumes that the
> encoding, send and decoding will be done by the same task. I think that
> assumption will greatly limit this code later and force you to rely on
> workarounds (like slow_work) to get asynchronous behavior.
>
> At the very least, I suggest splitting off the decode portions into
> separate functions. That at least should allow you the ability later to
> offload that part to other tasks (similar to how async tasks get
> offloaded to rpciod).
>
>
That (splitting decode into a distinct helper) makes sense (at least for
async capable ops,
in particular write, read and directory change notification and byte
range locks). The
"smb_decode_write_response" case is a particularly simple and useful one
to do and would
be an easy one to do as a test. I think the prototype patches for
async write that someone did
for cifs a few years ago did that.
>> The idea of doing abstract translation and encoding of SMB protocol frames
>> does seem overengineered and probably would make it harder to
>> read/understand
>> the setup of certain complex request frames which are quite different from
>> Samba to Windows. As another example, generalized, abstract SMB frame
>> conversion isn't being done in Samba 3 for example, and with only
>> 19 requests in SMB2 it makes even less sense. On the client, since
>> we have control over which types of requests we send, our case
>> is simpler than for the server for sending requests, but in
>> response processing since we have to work around server bugs, xdr like
>> decoding of SMB responses could get harder still.
>>
>>
>
> Again, I don't see SMB as being that different from NFS in this regard.
> You have a transport header (similar to the fraghdr with NFS TCP
> transport), then a protocol header (the SMB/SMB2 header), and then
> call-specific information. RPC/NFS works exactly the same way.
>
>
NFS and CIFS/SMB2 seem pretty different to me in this regard. CIFS and
SMB2
are much simpler - for these once protocols unlike nfs, after you assemble
the CIFS or SMB2 header (possibly with a data area as in SMB Write)
you simply add a 4 byte length (actually 3 byte length, 0 byte 0) and
you send it - no endian conversions, xdr, no adding 80 bytes or so rpc
prefix.
SunRPC adds a whole layer (not just a length field) with credentials.
SunRPC
typically adds 80 bytes (or more depending on auth flavor) before the
nfs frame
(cifs and smb2 frame don't need this, so the net/sunrpc code which
handles the
80 bytes or so of rpc header before the network fs frame
is not used for the cifs code)
In SMB2 the pacing is part of the SMB packet, not the transport packet,
and in
SMB2 any reconnection code which would be built into the proposed modified
SunRPC transport would probably have to be aware of the new handle types
and locking operations and state that have been added in SMB2.1 which may
break the abstraction between network fs and SunRPC transport
> The code I've proposed more or less has abstractions along those lines.
> There's:
>
> transport class -- (net/sunrpc/xprtsmb.c)
> header encoding/decoding -- (net/sunrpc/smb.c)
>
> ...the other parts will be implemented by the filesystem (similar to
> how fs/nfs/nfs?xdr.c work).
>
>
>> I like the idea of the way SunRPC keeps task information, and it may
>> make it easier
>> to carry credentials around (although I think using Dave Howell's key
>> management code
>> might be ok instead to access Winbind). I am not sure how easy it
>> would be to tie
>> SunRPC credential mapping to Winbind but that could probably be done. I
>> like the
>> async scheduling capability of SunRPC although I suspect that it is a
>> factor in
>> a number of the (nfs client) performance problems we have seen so may
>> need more work.
>> I don't like adding (in effect) an extra transport and "encoding layer"
>> though to
>> protocols (cifs and smb2). NFS since it is built on SunRPC on the
>> wire, required
>> such a layer, and it makes sense for NFS to layer the code, like their
>> protocol,
>> over SunRPC. CIFS and SMB2 don't require (or even allow) XDR translation,
>> variable encodings, and SunRPC encapsulation so the idea of abstracting the
>> encoding of something that has a single defined encoding seems wrong.
>>
>
> I'm not sure I understand this last comment. CIFS/SMB2 and NFS are just
> not that different in this regard. Either way you have to marshal up
> the buffer correctly before you send it, and decode it properly.
>
>
As an example of one of the more complicated cases for cifs (setting a
time field).
The code looks something like this and is very straightforward to see where
the "linux field" is being put into the packet - and to catch errors in
size or
endianness. Most cases are simpler for cifs.
struct file_basic_info buf; /* the wire format of the file basic info
SMB info level */
buf->LastAccessTime = 0; /* we can't set access time to Windows */
buf->LastWriteTime = cpu_to_le64(cifs_UnixTimeToNT(attrs->ia_atime));
(followed by a few similar assignment statements for the remaining fields)
then send the SMB header followed by the buffer. There is no marshalling
or xdr conversion needed. One thing I like about this approach is that
"sparse" (make modules C=1) immediately catches any mismatch
between the wire format (size or endianness) and the vfs data
structure element being put in the frame.
Looking at nfs4xdr.c as an example for comparison, it is much harder to
see the actual line
where a bigendian (or littlenedian) 64 bit quantity is being put into
the request frame.
Splitting encode and decode routines probably would make code more
reasonable - but converting from a "linux file system thing" to an
abstract structure
and then to one of many wire possibilities for a frame - doesn't make
much sense
for the case where there is only one wire encoding (SMB2) or for cases
where a single operation maps to more than one frame in some dialects
but not others (as in SMB) and so can't be encoded abstractly.
On Mon, 28 Sep 2009 13:40:23 -0500
"Steve French (smfltc)" <[email protected]> wrote:
> Jeff Layton wrote:
> > On Mon, 28 Sep 2009 09:41:08 -0500
> > "Steve French (smfltc)" <[email protected]> wrote:
> >
> >
> >>>> This patchset is still preliminary and is just an RFC...
> >>>>
> >>>> First, some background. When I was at Connectathon this year, Trond
> >>>> mentioned an interesting idea to me. He said (paraphrasing):
> >>>>
> >>>> "Why doesn't CIFS just use the RPC layer for transport? It's very
> >>>> efficient at shoveling bits out onto the wire. You'd just need to
> >>>> abstract out the XDR/RPC specific bits."
> >>>>
> >>>>
> >>>>
> >> My first reaction is that if you abstract out XDR/RPC specific parts of
> >> SunRPC it isn't SunRPC,
> >> just a scheduler on top of tcp (not a bad thing in theory). Pulling
> >> out the two key pieces from
> >> SunRPC:
> >> - asynchronous event handling and scheduling
> >> - upcall for credentials
> >> could be useful, but does add a lot of complexity. If there is a way
> >> to use just the async
> >> scheduling (and perhaps upcall) out of SunRPC, that part sounds fine as
> >> long as it
> >> can skip the encoding/decoding and just pass in a raw kvec containing
> >> the SMB
> >> header and data.
> >>
> >>
> >
> > Well, the sunrpc layer currently contains a lot of pieces:
> >
> > 1) client side call/response handling (clnt routines)
> > 2) server side call/response handling (svc routines)
> > 3) XDR encoding and decoding routines (including crypto signatures, etc)
> >
> > ...the idea is to hook up new encoding and decoding routines and to add
> > a new "transport class" which will make the client-side scheduler handle
> > SMB/SMB2 properly.
> >
> > We'll also eventually have to add new authentication/credential
> > "classes" too. I haven't researched that yet in any real depth, so I
> > can't state much about how difficult it'll be.
> >
> >
> >>>> CIFS in particular is also designed around synchronous ops, which
> >>>> seriously limits throughput. Retrofitting it for asynchronous operation
> >>>> will be adding even more kludges.
> >>>>
> >>>>
> >> There are only three operations that we can send asynchronous today, all
> >> of which require
> >> special case handling in the VFS already:
> >> - readpages
> >> - writepages
> >> - blocking locks
> >> (and also directory change notification which we and nfs don't do). I
> >> think the "slow_work"
> >> mechanism is probably sufficient for these cases already.
> >>
> >>
> >
> > The problem is that rolling a mechanism to handle asynchronous ops is
> > difficult to get right. I think it makes a lot of sense to reuse a
> > proven engine here. It also makes a lot of sense to implement
> > synchronous ops on top of an asynchronous infrastructure. RPC does this
> > under the hood, and so did smbfs.
> >
> > What you're proposing, in effect, is to do this in reverse -- implement
> > an asynchronous transport engine using synchronous ops and offloading
> > the background parts onto threads. That's possible I suppose, but it
> > means you have a lot of tasks sleeping in the kernel and waiting for
> > stuff to happen.
> >
> >
> I don't think it changes the number of tasks sleeping. For sync
> operations it doesn't
> change how many tasks that are sleeping. For async operations, you no
> longer have a
> task sleeping in cifs_writepages or cifs_readpages, but do have the
> ability to dispatch
> a decode routine when the SMB Write or Read response is returned (may or
> may not have a pool to do this). Seems like about the same number of
> task (possibly
> less). Moving to all asynchronous operations under open, unlink, close
> doesn't reduce
> the number of sleeping tasks (it may increase it or stay the same).
I think it does. Take the example of asynchronous writepages requests.
You want to send a bunch of write calls in quick succession and then
process the replies as they come in. If you rely on a SendReceive-type
interface, you have no choice but to spawn a separate slow_work task
for each call on the wire. No guarantee they'll run in parallel though
(depends on the rules for spawning new kslowd threads).
Now, you could hack up a routine that just sends the calls, sprinkle in
a callback routine that's run by cifsd or something when the writes
come back (though you'll need to be wary of deadlock, of course).
...or you could just start with a transport layer that's designed for
this from the beginning. That's what I'm suggesting that we do.
Heck, samba does the exact same thing. It's transport layer is
fundamentally asynchronous too (built around the tevent lib).
> >>>> works in our favor...
> >>>> ------------------------------------------------------------------------
> >>>> Q: can we hook up cifs or smbfs to use this as a transport?
> >>>>
> >>>> A: Not trivially. CIFS in particular is not designed with each call
> >>>> having discrete encode and decode functions. They're sort of mashed
> >>>>
> >>>>
> >> We certainly don't want to move to an abstract encoding mechanism,
> >> especially for SMB2
> >> where there is only one encoding of wire operations, and no duplicate
> >> requests due
> >> to 20 years of dialects. I can see an argument for abstract encoding
> >> for requests
> >> like SMB open, vs. SMB OpenX vs. SMB NTCreateX but this would be harder or
> >> to abstract and has to be done case by case anyway due to differences in
> >> field length, missing fields, different compensations. It is not
> >> like the simpler NFS case where encoding involves endian conversion etc.
> >>
> >>
> >
> > I'm not sure what you mean by this. Assembling an SMB header and call
> > is very similar to assembling an RPC header and call. There are
> > differences of course, but they aren't that substantial.
> >
> > SMB does introduce some more interesting wrinkles. For instance, since
> > state is tied up with the actual socket connection, we'll probably need
> > callbacks into the fs for socket state changes. That doesn't have much
> > to do with how you abstract out the encoding and decoding though.
> >
> >
> >>>> ------------------------------------------------------------------------
> >>>> Q: could we use this as a transport layer for a smb2fs ?
> >>>>
> >>>> A: Yes, I think so. This particular prototype is build around SMB1, but
> >>>> SMB2 could be supported with only minor modifications. One of the
> >>>> reasons for sending this patchset now before I've built a filesystem on
> >>>> top of it is because I know that SMB2 work is in progress. I'd like to
> >>>> see it based around a more asynchronous transport model, or at least
> >>>> built with cleaner layering so that we can eventually bolt on a different
> >>>> transport layer if we so choose.
> >>>>
> >>>>
> >> Amost all the ops use "send_receive" already - so there is no need to
> >> change the code much above
> >> that if you want to experiment with changing the transport. I like the
> >> idea of the
> >> abtraction of async operations, and creating completion routines (and an
> >> async send
> >> abstraction) for readpages, writepages and directory change
> >> notification would make sense.
> >> but in both cifs and smb2, the 95% of the operations that must be
> >> synchronous in
> >> the VFS (open, lookup, unlink, create etc.) can already be hooked up to
> >> any transport
> >> as long as it can send a kvec contain fs data and return a response
> >> (like the "send_receive"
> >> and equivalent).
> >>
> >>
> >
> > The problem with the send_receive interface is that it assumes that the
> > encoding, send and decoding will be done by the same task. I think that
> > assumption will greatly limit this code later and force you to rely on
> > workarounds (like slow_work) to get asynchronous behavior.
> >
> > At the very least, I suggest splitting off the decode portions into
> > separate functions. That at least should allow you the ability later to
> > offload that part to other tasks (similar to how async tasks get
> > offloaded to rpciod).
> >
> >
> That (splitting decode into a distinct helper) makes sense (at least for
> async capable ops,
> in particular write, read and directory change notification and byte
> range locks). The
> "smb_decode_write_response" case is a particularly simple and useful one
> to do and would
> be an easy one to do as a test. I think the prototype patches for
> async write that someone did
> for cifs a few years ago did that.
Even for ops that are fundamentally synchronous, it makes sense to
split the encoding and decoding into separate routines. Maybe it's just
my personal bias, but I think it encourages a cleaner, more flexible
design.
> >> The idea of doing abstract translation and encoding of SMB protocol frames
> >> does seem overengineered and probably would make it harder to
> >> read/understand
> >> the setup of certain complex request frames which are quite different from
> >> Samba to Windows. As another example, generalized, abstract SMB frame
> >> conversion isn't being done in Samba 3 for example, and with only
> >> 19 requests in SMB2 it makes even less sense. On the client, since
> >> we have control over which types of requests we send, our case
> >> is simpler than for the server for sending requests, but in
> >> response processing since we have to work around server bugs, xdr like
> >> decoding of SMB responses could get harder still.
> >>
> >>
> >
> > Again, I don't see SMB as being that different from NFS in this regard.
> > You have a transport header (similar to the fraghdr with NFS TCP
> > transport), then a protocol header (the SMB/SMB2 header), and then
> > call-specific information. RPC/NFS works exactly the same way.
> >
> >
> NFS and CIFS/SMB2 seem pretty different to me in this regard. CIFS and
> SMB2
> are much simpler - for these once protocols unlike nfs, after you assemble
> the CIFS or SMB2 header (possibly with a data area as in SMB Write)
> you simply add a 4 byte length (actually 3 byte length, 0 byte 0) and
> you send it - no endian conversions, xdr, no adding 80 bytes or so rpc
> prefix.
> SunRPC adds a whole layer (not just a length field) with credentials.
> SunRPC
> typically adds 80 bytes (or more depending on auth flavor) before the
> nfs frame
> (cifs and smb2 frame don't need this, so the net/sunrpc code which
> handles the
> 80 bytes or so of rpc header before the network fs frame
> is not used for the cifs code)
>
> In SMB2 the pacing is part of the SMB packet, not the transport packet,
> and in
> SMB2 any reconnection code which would be built into the proposed modified
> SunRPC transport would probably have to be aware of the new handle types
> and locking operations and state that have been added in SMB2.1 which may
> break the abstraction between network fs and SunRPC transport
>
CIFS/SMB most definitely need to do endianness conversions -- BE arches
have to convert nearly everything.
Though honestly, you're splitting hairs here. All of these details are
meaningless. In NFS there are areas that are simpler to handle than SMB
(like the header parsing of the response)...and there are parts that
are more difficult (fragheader handling, authentication).
The fact of that matter is that they are just not that different. There
are distinct parallels to many of the important pieces:
SMB transport header is 32 bits and contains a length with a zeroed out
upper byte. The RPC over TCP transport header is 32 bits and contains a
31 bit length + a "last fragment" flag.
SMB has MID's, RPC has XID's
RPC has the RPC auth info, SMB has the session UID and TID
...the parallels go on...
We'll of course need a different set of arguments to build a SMB header
vs. a RPC one. That information comes from the filesystem anyway and
it'll know what sort of arguments it needs to pass in.
> > The code I've proposed more or less has abstractions along those lines.
> > There's:
> >
> > transport class -- (net/sunrpc/xprtsmb.c)
> > header encoding/decoding -- (net/sunrpc/smb.c)
> >
> > ...the other parts will be implemented by the filesystem (similar to
> > how fs/nfs/nfs?xdr.c work).
> >
> >
> >> I like the idea of the way SunRPC keeps task information, and it may
> >> make it easier
> >> to carry credentials around (although I think using Dave Howell's key
> >> management code
> >> might be ok instead to access Winbind). I am not sure how easy it
> >> would be to tie
> >> SunRPC credential mapping to Winbind but that could probably be done. I
> >> like the
> >> async scheduling capability of SunRPC although I suspect that it is a
> >> factor in
> >> a number of the (nfs client) performance problems we have seen so may
> >> need more work.
> >> I don't like adding (in effect) an extra transport and "encoding layer"
> >> though to
> >> protocols (cifs and smb2). NFS since it is built on SunRPC on the
> >> wire, required
> >> such a layer, and it makes sense for NFS to layer the code, like their
> >> protocol,
> >> over SunRPC. CIFS and SMB2 don't require (or even allow) XDR translation,
> >> variable encodings, and SunRPC encapsulation so the idea of abstracting the
> >> encoding of something that has a single defined encoding seems wrong.
> >>
> >
> > I'm not sure I understand this last comment. CIFS/SMB2 and NFS are just
> > not that different in this regard. Either way you have to marshal up
> > the buffer correctly before you send it, and decode it properly.
> >
> >
>
> As an example of one of the more complicated cases for cifs (setting a
> time field).
> The code looks something like this and is very straightforward to see where
> the "linux field" is being put into the packet - and to catch errors in
> size or
> endianness. Most cases are simpler for cifs.
>
> struct file_basic_info buf; /* the wire format of the file basic info
> SMB info level */
>
> buf->LastAccessTime = 0; /* we can't set access time to Windows */
> buf->LastWriteTime = cpu_to_le64(cifs_UnixTimeToNT(attrs->ia_atime));
> (followed by a few similar assignment statements for the remaining fields)
>
> then send the SMB header followed by the buffer. There is no marshalling
> or xdr conversion needed. One thing I like about this approach is that
> "sparse" (make modules C=1) immediately catches any mismatch
> between the wire format (size or endianness) and the vfs data
> structure element being put in the frame.
>
> Looking at nfs4xdr.c as an example for comparison, it is much harder to
> see the actual line
> where a bigendian (or littlenedian) 64 bit quantity is being put into
> the request frame.
>
Woah...I think we just have a difference in terminology here. When I
say "encoding" the packet, I just mean that we're stuffing the data
into the buffer in the format that the wire expects.
Now, whether you do that with packed structures (like CIFS does and
which I tend to prefer), or via "hand-marshaling" (like the WRITE/READ
macros in the nfs?xdr.c files) is simply an implementation detail.
In fact, if you take a look at the sample "smbtest" module, you'll
notice that the negotiate protocol packets are "encoded" and "decoded"
using packed structures.
In truth, we could probably utilize packed structures in most of the
NFS xdr code too, it just hasn't traditionally been done that way...
> Splitting encode and decode routines probably would make code more
> reasonable - but converting from a "linux file system thing" to an
> abstract structure
> and then to one of many wire possibilities for a frame - doesn't make
> much sense
> for the case where there is only one wire encoding (SMB2) or for cases
> where a single operation maps to more than one frame in some dialects
> but not others (as in SMB) and so can't be encoded abstractly.
The transport engine doesn't (and shouldn't) care about any of this.
All it needs to know is how to encode the packet given a set of
arguments, how to send it, and what to do with the reply. The rest is
all the job of the actual filesystem.
The only difference here from what CIFS does today is that this is
handled via a discrete set of interfaces to the transport engine rather
than just calling SendReceive with an already layed-out buffer.
--
Jeff Layton <[email protected]>
Jeff Layton wrote:
> On Mon, 28 Sep 2009 13:40:23 -0500
> "Steve French (smfltc)" <[email protected]> wrote:
>
>
>> Jeff Layton wrote:
>>
>>> On Mon, 28 Sep 2009 09:41:08 -0500
>>> "Steve French (smfltc)" <[email protected]> wrote:
>>>
>>>
>>>
>>>
>>>
>> I don't think it changes the number of tasks sleeping. For sync
>> operations it doesn't
>> change how many tasks that are sleeping. For async operations, you no
>> longer have a
>> task sleeping in cifs_writepages or cifs_readpages, but do have the
>> ability to dispatch
>> a decode routine when the SMB Write or Read response is returned (may or
>> may not have a pool to do this). Seems like about the same number of
>> task (possibly
>> less). Moving to all asynchronous operations under open, unlink, close
>> doesn't reduce
>> the number of sleeping tasks (it may increase it or stay the same).
>>
>
> I think it does. Take the example of asynchronous writepages requests.
> You want to send a bunch of write calls in quick succession and then
> process the replies as they come in. If you rely on a SendReceive-type
> interface, you have no choice but to spawn a separate slow_work task
> for each call on the wire. No guarantee they'll run in parallel though
> (depends on the rules for spawning new kslowd threads).
>
> Now, you could hack up a routine that just sends the calls, sprinkle in
> a callback routine that's run by cifsd or something when the writes
> come back (though you'll need to be wary of deadlock, of course).
>
>
I was assuming the latter ... ie that cifsd processes the completion (or
spawns a slow
work task to process the response). For the case of write (actually
writepages)
there is not processing to be done (checking the rc, updating the status of
the corresponding pages in the page cache). For read there are various
choices
(launching a slow work thread in readpages - but you only need one for what
is potentially a very large number of reads from that readpages call) or
processing
in cifs or launching a slow work thread to process (my current
preference since
this is just cache updates/readahead anyway).
>>
>> That (splitting decode into a distinct helper) makes sense (at least for
>> async capable ops,
>> in particular write, read and directory change notification and byte
>> range locks). The
>> "smb_decode_write_response" case is a particularly simple and useful one
>> to do and would
>> be an easy one to do as a test. I think the prototype patches for
>> async write that someone did
>> for cifs a few years ago did that.
>>
>
> Even for ops that are fundamentally synchronous, it makes sense to
> split the encoding and decoding into separate routines. Maybe it's just
> my personal bias, but I think it encourages a cleaner, more flexible
> design.
>
I don't have a strong preference either way. There are a large number
of SMBs which
are basically "responseless" (we just look at the rc, but they don't
include parameters)
so they would not have interesting decode routines. Even for SMB2
although we
probably have to look at the credits, the generic SMB response
processing can
probably deal with those. For those SMBs which have response processing,
if it makes for smaller more readable functions to break out the decode
routines, that seems
fine.
> SMB transport header is 32 bits and contains a length with a zeroed out
> upper byte. The RPC over TCP transport header is 32 bits and contains a
> 31 bit length + a "last fragment" flag.
>
> SMB has MID's, RPC has XID's
>
> RPC has the RPC auth info, SMB has the session UID and TID
>
>
>
There are some similar concepts but at different levels of the network
stack. Having
net/sunrpc assemble SMB headers (uid, mids) would be needed if you
really want
to leverage the similar concepts, and it seems like a strange idea to
move much of fs/cifs
into net/sunrpc helpers
>>> The code I've proposed more or less has abstractions along those lines.
>>> There's:
>>>
>>> transport class -- (net/sunrpc/xprtsmb.c)
>>> header encoding/decoding -- (net/sunrpc/smb.c)
>>>
>>> ...the other parts will be implemented by the filesystem (similar to
>>> how fs/nfs/nfs?xdr.c work).
>>>
>>>
>>>
>>>> I like the idea of the way SunRPC keeps task information, and it may
>>>> make it easier
>>>> to carry credentials around (although I think using Dave Howell's key
>>>> management code
>>>> might be ok instead to access Winbind). I am not sure how easy it
>>>> would be to tie
>>>> SunRPC credential mapping to Winbind but that could probably be done. I
>>>> like the
>>>> async scheduling capability of SunRPC although I suspect that it is a
>>>> factor in
>>>> a number of the (nfs client) performance problems we have seen so may
>>>> need more work.
>>>> I don't like adding (in effect) an extra transport and "encoding layer"
>>>> though to
>>>> protocols (cifs and smb2). NFS since it is built on SunRPC on the
>>>> wire, required
>>>> such a layer, and it makes sense for NFS to layer the code, like their
>>>> protocol,
>>>> over SunRPC. CIFS and SMB2 don't require (or even allow) XDR translation,
>>>> variable encodings, and SunRPC encapsulation so the idea of abstracting the
>>>> encoding of something that has a single defined encoding seems wrong.
>>>>
>>>>
>>> I'm not sure I understand this last comment. CIFS/SMB2 and NFS are just
>>> not that different in this regard. Either way you have to marshal up
>>> the buffer correctly before you send it, and decode it properly.
>>>
>>>
>>>
>> As an example of one of the more complicated cases for cifs (setting a
>> time field).
>> The code looks something like this and is very straightforward to see where
>> the "linux field" is being put into the packet - and to catch errors in
>> size or
>> endianness. Most cases are simpler for cifs.
>>
>> struct file_basic_info buf; /* the wire format of the file basic info
>> SMB info level */
>>
>> buf->LastAccessTime = 0; /* we can't set access time to Windows */
>> buf->LastWriteTime = cpu_to_le64(cifs_UnixTimeToNT(attrs->ia_atime));
>> (followed by a few similar assignment statements for the remaining fields)
>>
>> then send the SMB header followed by the buffer. There is no marshalling
>> or xdr conversion needed. One thing I like about this approach is that
>> "sparse" (make modules C=1) immediately catches any mismatch
>> between the wire format (size or endianness) and the vfs data
>> structure element being put in the frame.
>>
>> Looking at nfs4xdr.c as an example for comparison, it is much harder to
>> see the actual line
>> where a bigendian (or littlenedian) 64 bit quantity is being put into
>> the request frame.
>>
>>
>
> Woah...I think we just have a difference in terminology here. When I
> say "encoding" the packet, I just mean that we're stuffing the data
> into the buffer in the format that the wire expects.
>
> Now, whether you do that with packed structures (like CIFS does and
> which I tend to prefer), or via "hand-marshaling" (like the WRITE/READ
> macros in the nfs?xdr.c files) is simply an implementation detail.
>
If we are stuffing the data into the buffer in the format the wire
expects ie a series of assignment
statements more or less as we do today (but broken into distinct encode
and decode helpers)
pSMB->SomeField = cpu_to_le64(inode->some_attribute_field);
and simply using SunRPC as a convenient wrapper around the socket API for
async dispatch that is probably fine (not sure if it is slower or faster
... and
RPC has various performance limitations like the RPC_SLOT_COUNT)
although not sure it is worth the trouble.
On Mon, 28 Sep 2009 16:40:35 -0500
"Steve French (smfltc)" <[email protected]> wrote:
> Jeff Layton wrote:
> > On Mon, 28 Sep 2009 13:40:23 -0500
> > "Steve French (smfltc)" <[email protected]> wrote:
> >
> >
> >> Jeff Layton wrote:
> >>
> >>> On Mon, 28 Sep 2009 09:41:08 -0500
> >>> "Steve French (smfltc)" <[email protected]> wrote:
> >>>
> >>>
> >>>
> >>>
> >>>
> >> I don't think it changes the number of tasks sleeping. For sync
> >> operations it doesn't
> >> change how many tasks that are sleeping. For async operations, you no
> >> longer have a
> >> task sleeping in cifs_writepages or cifs_readpages, but do have the
> >> ability to dispatch
> >> a decode routine when the SMB Write or Read response is returned (may or
> >> may not have a pool to do this). Seems like about the same number of
> >> task (possibly
> >> less). Moving to all asynchronous operations under open, unlink, close
> >> doesn't reduce
> >> the number of sleeping tasks (it may increase it or stay the same).
> >>
> >
> > I think it does. Take the example of asynchronous writepages requests.
> > You want to send a bunch of write calls in quick succession and then
> > process the replies as they come in. If you rely on a SendReceive-type
> > interface, you have no choice but to spawn a separate slow_work task
> > for each call on the wire. No guarantee they'll run in parallel though
> > (depends on the rules for spawning new kslowd threads).
> >
> > Now, you could hack up a routine that just sends the calls, sprinkle in
> > a callback routine that's run by cifsd or something when the writes
> > come back (though you'll need to be wary of deadlock, of course).
> >
> >
> I was assuming the latter ... ie that cifsd processes the completion (or
> spawns a slow
> work task to process the response). For the case of write (actually
> writepages)
> there is not processing to be done (checking the rc, updating the status of
> the corresponding pages in the page cache). For read there are various
> choices
> (launching a slow work thread in readpages - but you only need one for what
> is potentially a very large number of reads from that readpages call) or
> processing
> in cifs or launching a slow work thread to process (my current
> preference since
> this is just cache updates/readahead anyway).
>
Fair enough. In fact, I have a patch in my for-2.6.33 branch that
generalizes the callback routine when cifsd matches a response to a
MID. That may help get that moving.
Still, the fact of the matter is that cifs carries a thread (cifsd)
that's wouldn't truly be needed were the responses to be processed in
softirq context as sunrpc does.
> >>
> >> That (splitting decode into a distinct helper) makes sense (at least for
> >> async capable ops,
> >> in particular write, read and directory change notification and byte
> >> range locks). The
> >> "smb_decode_write_response" case is a particularly simple and useful one
> >> to do and would
> >> be an easy one to do as a test. I think the prototype patches for
> >> async write that someone did
> >> for cifs a few years ago did that.
> >>
> >
> > Even for ops that are fundamentally synchronous, it makes sense to
> > split the encoding and decoding into separate routines. Maybe it's just
> > my personal bias, but I think it encourages a cleaner, more flexible
> > design.
> >
> I don't have a strong preference either way. There are a large number
> of SMBs which
> are basically "responseless" (we just look at the rc, but they don't
> include parameters)
> so they would not have interesting decode routines. Even for SMB2
> although we
> probably have to look at the credits, the generic SMB response
> processing can
> probably deal with those. For those SMBs which have response processing,
> if it makes for smaller more readable functions to break out the decode
> routines, that seems
> fine.
>
Sure, it's the same way with NFS too. For those we can probably just
use a standard "decode" routine that just looks at the error code. Code
consolidation is a good thing.
> > SMB transport header is 32 bits and contains a length with a zeroed out
> > upper byte. The RPC over TCP transport header is 32 bits and contains a
> > 31 bit length + a "last fragment" flag.
> >
> > SMB has MID's, RPC has XID's
> >
> > RPC has the RPC auth info, SMB has the session UID and TID
> >
> >
> >
> There are some similar concepts but at different levels of the network
> stack. Having
> net/sunrpc assemble SMB headers (uid, mids) would be needed if you
> really want
> to leverage the similar concepts, and it seems like a strange idea to
> move much of fs/cifs
> into net/sunrpc helpers
>
Are they really at different levels of the network stack? SMB/SMB2
carries a little more info within the protocol header than we need at
the transport layer, but it's not really that big a deal. The code I
have just copies off the data until it gets to the MID.
Assembling UID's/TID's etc for a packet I see as more a function for
the credential/auth management code in the RPC layer. Essentially I
think I'll have to write a "RPC_AUTH_SMB" rpc_auth plugin for it and
probably also a corresponding credential piece.
It'll be quite a bit different from how it's handled with "real" RPC
protocols, but should still fit within the basic framework.
> >>> The code I've proposed more or less has abstractions along those lines.
> >>> There's:
> >>>
> >>> transport class -- (net/sunrpc/xprtsmb.c)
> >>> header encoding/decoding -- (net/sunrpc/smb.c)
> >>>
> >>> ...the other parts will be implemented by the filesystem (similar to
> >>> how fs/nfs/nfs?xdr.c work).
> >>>
> >>>
> >>>
> >>>> I like the idea of the way SunRPC keeps task information, and it may
> >>>> make it easier
> >>>> to carry credentials around (although I think using Dave Howell's key
> >>>> management code
> >>>> might be ok instead to access Winbind). I am not sure how easy it
> >>>> would be to tie
> >>>> SunRPC credential mapping to Winbind but that could probably be done. I
> >>>> like the
> >>>> async scheduling capability of SunRPC although I suspect that it is a
> >>>> factor in
> >>>> a number of the (nfs client) performance problems we have seen so may
> >>>> need more work.
> >>>> I don't like adding (in effect) an extra transport and "encoding layer"
> >>>> though to
> >>>> protocols (cifs and smb2). NFS since it is built on SunRPC on the
> >>>> wire, required
> >>>> such a layer, and it makes sense for NFS to layer the code, like their
> >>>> protocol,
> >>>> over SunRPC. CIFS and SMB2 don't require (or even allow) XDR translation,
> >>>> variable encodings, and SunRPC encapsulation so the idea of abstracting the
> >>>> encoding of something that has a single defined encoding seems wrong.
> >>>>
> >>>>
> >>> I'm not sure I understand this last comment. CIFS/SMB2 and NFS are just
> >>> not that different in this regard. Either way you have to marshal up
> >>> the buffer correctly before you send it, and decode it properly.
> >>>
> >>>
> >>>
> >> As an example of one of the more complicated cases for cifs (setting a
> >> time field).
> >> The code looks something like this and is very straightforward to see where
> >> the "linux field" is being put into the packet - and to catch errors in
> >> size or
> >> endianness. Most cases are simpler for cifs.
> >>
> >> struct file_basic_info buf; /* the wire format of the file basic info
> >> SMB info level */
> >>
> >> buf->LastAccessTime = 0; /* we can't set access time to Windows */
> >> buf->LastWriteTime = cpu_to_le64(cifs_UnixTimeToNT(attrs->ia_atime));
> >> (followed by a few similar assignment statements for the remaining fields)
> >>
> >> then send the SMB header followed by the buffer. There is no marshalling
> >> or xdr conversion needed. One thing I like about this approach is that
> >> "sparse" (make modules C=1) immediately catches any mismatch
> >> between the wire format (size or endianness) and the vfs data
> >> structure element being put in the frame.
> >>
> >> Looking at nfs4xdr.c as an example for comparison, it is much harder to
> >> see the actual line
> >> where a bigendian (or littlenedian) 64 bit quantity is being put into
> >> the request frame.
> >>
> >>
> >
> > Woah...I think we just have a difference in terminology here. When I
> > say "encoding" the packet, I just mean that we're stuffing the data
> > into the buffer in the format that the wire expects.
> >
> > Now, whether you do that with packed structures (like CIFS does and
> > which I tend to prefer), or via "hand-marshaling" (like the WRITE/READ
> > macros in the nfs?xdr.c files) is simply an implementation detail.
> >
> If we are stuffing the data into the buffer in the format the wire
> expects ie a series of assignment
> statements more or less as we do today (but broken into distinct encode
> and decode helpers)
>
> pSMB->SomeField = cpu_to_le64(inode->some_attribute_field);
>
> and simply using SunRPC as a convenient wrapper around the socket API for
> async dispatch that is probably fine (not sure if it is slower or faster
> ... and
> RPC has various performance limitations like the RPC_SLOT_COUNT)
> although not sure it is worth the trouble.
Why would it be any extra trouble when you're designing something new
in the first place? As I said above, trying to bolt CIFS onto this is
probably going to be more trouble than its worth. SMB2 is another
matter though.
When it comes to encoding, all of the header and call encoding routines
have a defined set of args and return values. How you actually do the
encoding is up to the person writing the implementation. I too think
that packed structures are easier to deal with and would probably
prefer those over hand-marshaling.
CIFS has a "slot count" limitation too (we default to 50 outstanding
calls). In fact, for the prototype here, I made the slot count default
to 50 because of that precedent.
--
Jeff Layton <[email protected]>