LinuxLists.cc - [PATCH - RFC] new "nosharetransport" option for NFS mounts.

2013-07-07 23:58:24

Subject: [PATCH - RFC] new "nosharetransport" option for NFS mounts.

This patch adds a "nosharetransport" option to allow two different
mounts from the same server to use different transports.
If the mounts use NFSv4, or are of the same filesystem, then
"nosharecache" must be used as well.

There are at least two circumstances where it might be desirable
to use separate transports:

1/ If the NFS server can get into a state where it will ignore
requests for one filesystem while servicing request for another,
then using separate connections for the separate filesystems can
stop problems with one affecting access to the other.

This is particularly relevant for NetApp filers where one filesystem
has been "suspended". Requests to that filesystem will be dropped
(rather than the more correct NFS3ERR_JUKEBOX). This currently
interferes with other filesystems.

2/ If a very fast network is used with a many-processor client, a
single TCP connection can present a bottle neck which reduces total
throughput. Using multiple TCP connections (one per mount) removes
the bottleneck.
An alternate workaround is to configure multiple virtual IP
addresses on the server and mount each filesystem from a different
IP. This is effective (throughput goes up) but an unnecessary
administrative burden.

Signed-off-by: NeilBrown <[email protected]>

---
Is this a good idea? Bad idea? Have I missed something important?

NeilBrown

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index c513b0c..64e3f39 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -403,8 +403,13 @@ static struct nfs_client *nfs_match_client(const struct nfs_client_initdata *dat
const struct sockaddr *sap = data->addr;
struct nfs_net *nn = net_generic(data->net, nfs_net_id);

+ if (test_bit(NFS_CS_NO_SHARE, &data->init_flags))
+ return NULL;
+
list_for_each_entry(clp, &nn->nfs_client_list, cl_share_link) {
const struct sockaddr *clap = (struct sockaddr *)&clp->cl_addr;
+ if (test_bit(NFS_CS_NO_SHARE,&clp->cl_flags))
+ continue;
/* Don't match clients that failed to initialise properly */
if (clp->cl_cons_state < 0)
continue;
@@ -753,6 +758,8 @@ static int nfs_init_server(struct nfs_server *server,
data->timeo, data->retrans);
if (data->flags & NFS_MOUNT_NORESVPORT)
set_bit(NFS_CS_NORESVPORT, &cl_init.init_flags);
+ if (data->flags & NFS_MOUNT_NOSHARE_XPRT)
+ set_bit(NFS_CS_NO_SHARE, &cl_init.init_flags);
if (server->options & NFS_OPTION_MIGRATION)
set_bit(NFS_CS_MIGRATION, &cl_init.init_flags);

diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 2d7525f..d9141d8 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -88,6 +88,7 @@ enum {
Opt_acl, Opt_noacl,
Opt_rdirplus, Opt_nordirplus,
Opt_sharecache, Opt_nosharecache,
+ Opt_sharetransport, Opt_nosharetransport,
Opt_resvport, Opt_noresvport,
Opt_fscache, Opt_nofscache,
Opt_migration, Opt_nomigration,
@@ -146,6 +147,8 @@ static const match_table_t nfs_mount_option_tokens = {
{ Opt_nordirplus, "nordirplus" },
{ Opt_sharecache, "sharecache" },
{ Opt_nosharecache, "nosharecache" },
+ { Opt_sharetransport, "sharetransport"},
+ { Opt_nosharetransport, "nosharetransport"},
{ Opt_resvport, "resvport" },
{ Opt_noresvport, "noresvport" },
{ Opt_fscache, "fsc" },
@@ -634,6 +637,7 @@ static void nfs_show_mount_options(struct seq_file *m, struct nfs_server *nfss,
{ NFS_MOUNT_NOACL, ",noacl", "" },
{ NFS_MOUNT_NORDIRPLUS, ",nordirplus", "" },
{ NFS_MOUNT_UNSHARED, ",nosharecache", "" },
+ { NFS_MOUNT_NOSHARE_XPRT, ",nosharetransport", ""},
{ NFS_MOUNT_NORESVPORT, ",noresvport", "" },
{ 0, NULL, NULL }
};
@@ -1239,6 +1243,12 @@ static int nfs_parse_mount_options(char *raw,
case Opt_nosharecache:
mnt->flags |= NFS_MOUNT_UNSHARED;
break;
+ case Opt_sharetransport:
+ mnt->flags &= ~NFS_MOUNT_NOSHARE_XPRT;
+ break;
+ case Opt_nosharetransport:
+ mnt->flags |= NFS_MOUNT_NOSHARE_XPRT;
+ break;
case Opt_resvport:
mnt->flags &= ~NFS_MOUNT_NORESVPORT;
break;
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 3b7fa2a..9e9d7d3 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -41,6 +41,7 @@ struct nfs_client {
#define NFS_CS_DISCRTRY 1 /* - disconnect on RPC retry */
#define NFS_CS_MIGRATION 2 /* - transparent state migr */
#define NFS_CS_INFINITE_SLOTS 3 /* - don't limit TCP slots */
+#define NFS_CS_NO_SHARE 4 /* - don't share across mounts */
struct sockaddr_storage cl_addr; /* server identifier */
size_t cl_addrlen;
char * cl_hostname; /* hostname of server */
diff --git a/include/uapi/linux/nfs_mount.h b/include/uapi/linux/nfs_mount.h
index 576bddd..81c49ff 100644
--- a/include/uapi/linux/nfs_mount.h
+++ b/include/uapi/linux/nfs_mount.h
@@ -73,5 +73,6 @@ struct nfs_mount_data {

#define NFS_MOUNT_LOCAL_FLOCK 0x100000
#define NFS_MOUNT_LOCAL_FCNTL 0x200000
+#define NFS_MOUNT_NOSHARE_XPRT 0x400000

#endif

Attachments:

signature.asc (828.00 B)

2013-07-09 14:46:33

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH - RFC] new "nosharetransport" option for NFS mounts.

T24gVHVlLCAyMDEzLTA3LTA5IGF0IDEzOjIyICsxMDAwLCBOZWlsQnJvd24gd3JvdGU6DQo+IE9u
IE1vbiwgOCBKdWwgMjAxMyAxODo1MTo0MCArMDAwMCAiTXlrbGVidXN0LCBUcm9uZCINCj4gPFRy
b25kLk15a2xlYnVzdEBuZXRhcHAuY29tPiB3cm90ZToNCj4gDQo+ID4gT24gTW9uLCAyMDEzLTA3
LTA4IGF0IDA5OjU4ICsxMDAwLCBOZWlsQnJvd24gd3JvdGU6DQo+ID4gPiANCj4gPiA+IFRoaXMg
cGF0Y2ggYWRkcyBhICJub3NoYXJldHJhbnNwb3J0IiBvcHRpb24gdG8gYWxsb3cgdHdvIGRpZmZl
cmVudA0KPiA+ID4gbW91bnRzIGZyb20gdGhlIHNhbWUgc2VydmVyIHRvIHVzZSBkaWZmZXJlbnQg
dHJhbnNwb3J0cy4NCj4gPiA+IElmIHRoZSBtb3VudHMgdXNlIE5GU3Y0LCBvciBhcmUgb2YgdGhl
IHNhbWUgZmlsZXN5c3RlbSwgdGhlbg0KPiA+ID4gIm5vc2hhcmVjYWNoZSIgbXVzdCBiZSB1c2Vk
IGFzIHdlbGwuDQo+ID4gDQo+ID4gV29uJ3QgdGhpcyBpbnRlcmZlcmUgd2l0aCB0aGUgcmVjZW50
bHkgYWRkZWQgTkZTdjQgdHJ1bmtpbmcgZGV0ZWN0aW9uPw0KPiANCj4gV2lsbCBpdD8gIEkgZ29v
Z2xlZCBhcm91bmQgYSBiaXQgYnV0IGNvdWxkbid0IGZpbmQgYW55dGhpbmcgdGhhdCB0ZWxscyBt
ZQ0KPiB3aGF0IHRydW5raW5nIHJlYWxseSB3YXMgaW4gdGhpcyBjb250ZXh0LiAgVGhlbiBJIGZv
dW5kIGNvbW1pdCAwNWY0YzM1MGVlMDIgDQo+IHdoaWNoIG1ha2VzIGl0IHF1aXRlIGNsZWFyICh0
aGFua3MgQ2h1Y2shKS4NCj4gDQo+IFByb2JhYmx5IHRoZSBjb2RlIEkgd3JvdGUgY291bGQgaW50
ZXJmZXJlLg0KPiANCj4gPiANCj4gPiBBbHNvLCBob3cgd2lsbCBpdCB3b3JrIHdpdGggTkZTdjQu
MSBzZXNzaW9ucz8gVGhlIHNlcnZlciB3aWxsIHVzdWFsbHkNCj4gPiByZXF1aXJlIGEgQklORF9D
T05OX1RPX1NFU1NJT04gd2hlbiBuZXcgVENQIGNvbm5lY3Rpb25zIGF0dGVtcHQgdG8NCj4gPiBh
dHRhY2ggdG8gYW4gZXhpc3Rpbmcgc2Vzc2lvbi4NCj4gDQo+IFdoeSB3b3VsZCBpdCBhdHRlbXB0
IHRvIGF0dGFjaCB0byBhbiBleGlzdGluZyBzZXNzaW9uPyAgSSB3b3VsZCBob3BlIHRoZXJlDQo+
IHRoZSB0d28gZGlmZmVyZW50IG1vdW50cyB3aXRoIHNlcGFyYXRlIFRDUCBjb25uZWN0aW9ucyB3
b3VsZCBsb29rIGNvbXBsZXRlbHkNCj4gc2VwYXJhdGUgLSBkaWZmZXJlbnQgdHJhbnNwb3J0LCBk
aWZmZXJlbnQgY2FjaGUsIGRpZmZlcmVudCBzZXNzaW9uLg0KPiA/Pw0KDQpDdXJyZW50bHkgd2Ug
bWFwIHNlc3Npb25zIGFuZCBsZWFzZXMgMS0xLiBZb3UnZCBoYXZlIHF1aXRlIHNvbWUgd29yayB0
bw0KZG8gdG8gY2hhbmdlIHRoYXQsIGFuZCBpdCBpcyB2ZXJ5IHVuY2xlYXIgdG8gbWUgdGhhdCB0
aGVyZSBpcyBhbnkNCmJlbmVmaXQgdG8gZG9pbmcgc28uDQoNCj4gPiANCj4gPiA+IDIvIElmIGEg
dmVyeSBmYXN0IG5ldHdvcmsgaXMgdXNlZCB3aXRoIGEgbWFueS1wcm9jZXNzb3IgY2xpZW50LCBh
DQo+ID4gPiAgIHNpbmdsZSBUQ1AgY29ubmVjdGlvbiBjYW4gcHJlc2VudCBhIGJvdHRsZSBuZWNr
IHdoaWNoIHJlZHVjZXMgdG90YWwNCj4gPiA+ICAgdGhyb3VnaHB1dC4gIFVzaW5nIG11bHRpcGxl
IFRDUCBjb25uZWN0aW9ucyAob25lIHBlciBtb3VudCkgcmVtb3Zlcw0KPiA+ID4gICB0aGUgYm90
dGxlbmVjay4NCj4gPiA+ICAgQW4gYWx0ZXJuYXRlIHdvcmthcm91bmQgaXMgdG8gY29uZmlndXJl
IG11bHRpcGxlIHZpcnR1YWwgSVANCj4gPiA+ICAgYWRkcmVzc2VzIG9uIHRoZSBzZXJ2ZXIgYW5k
IG1vdW50IGVhY2ggZmlsZXN5c3RlbSBmcm9tIGEgZGlmZmVyZW50DQo+ID4gPiAgIElQLiAgVGhp
cyBpcyBlZmZlY3RpdmUgKHRocm91Z2hwdXQgZ29lcyB1cCkgYnV0IGFuIHVubmVjZXNzYXJ5DQo+
ID4gPiAgIGFkbWluaXN0cmF0aXZlIGJ1cmRlbi4NCj4gPiANCj4gPiBBcyBJIHVuZGVyc3RhbmQg
aXQsIHVzaW5nIG11bHRpcGxlIHNpbXVsdGFuZW91cyBUQ1AgY29ubmVjdGlvbnMgYmV0d2Vlbg0K
PiA+IHRoZSBzYW1lIGVuZHBvaW50cyBhbHNvIGFkZHMgYSByaXNrIHRoYXQgdGhlIGNvbmdlc3Rp
b24gd2luZG93cyB3aWxsDQo+ID4gaW50ZXJmZXJlLiBEbyB5b3UgaGF2ZSBudW1iZXJzIHRvIGJh
Y2sgdXAgdGhlIGNsYWltIG9mIGEgcGVyZm9ybWFuY2UNCj4gPiBpbXByb3ZlbWVudD8NCj4gDQo+
IEEgY3VzdG9tZXIgdXBncmFkZWQgZnJvbSBTTEVTMTAgKDIuNi4xNiBiYXNlZCkgdG8gU0xFUzEx
ICgzLjAgYmFzZWQpIGFuZCBzYXcNCj4gYSBzbG93ZG93biBvbiBzb21lIGxhcmdlIERCIGpvYnMg
b2YgYmV0d2VlbiAxLjUgYW5kIDIgdGltZXMgKGkuZS4gdG90YWwgdGltZQ0KPiAxNTAlIHRvIDIw
MCUgb2Ygd2hhdCBpcyB3YXMgYmVmb3JlKS4NCj4gQWZ0ZXIgc29tZSBhbmFseXNpcyB0aGV5IGNy
ZWF0ZWQgbXVsdGlwbGUgdmlydHVhbCBJUHMgb24gdGhlIHNlcnZlciBhbmQNCj4gbW91bnRlZCB0
aGUgc2V2ZXJhbCBmaWxlc3lzdGVtIGVhY2ggZnJvbSBkaWZmZXJlbnQgSVBzIGFuZCBnb3QgdGhl
DQo+IHBlcmZvcm1hbmNlIGJhY2sgKHRoZXkgc2VlIHRoaXMgYXMgYSB3b3JrLWFyb3VuZCByYXRo
ZXIgdGhhbiBhIGdlbnVpbmUNCj4gc29sdXRpb24pLg0KPiBOdW1iZXJzIGFyZSBsaWtlICI1MDBN
Qi9zIG9uIGEgc2luZ2xlIGNvbm5lY3Rpb24sIDg1ME1CL3NlYyBwZWFraW5nIHRvDQo+IDEwMDBN
Qi9zZWMgb24gbXVsdGlwbGUgY29ubmVjdGlvbnMiLg0KPiANCj4gSWYgSSBjYW4gZ2V0IHNvbWV0
aGluZyBtb3JlIGNvbmNyZXRlIEknbGwgbGV0IHlvdSBrbm93Lg0KPiANCj4gQXMgdGhpcyB3b3Jr
ZWQgd2VsbCBpbiAyLjYuMTYgKHdoaWNoIGRvZXNuJ3QgdHJ5IHRvIHNoYXJlIGNvbm5lY3Rpb25z
KSB0aGlzDQo+IGlzIHNlZW4gYXMgYSByZWdyZXNzaW9uLg0KPiANCj4gT24gbGlua3MgdGhhdCBh
cmUgZWFzeSB0byBzYXR1cmF0ZSwgY29uZ2VzdGlvbiB3aW5kb3dzIGFyZSBpbXBvcnRhbnQgYW5k
DQo+IGhhdmluZyBhIHNpbmdsZSBjb25uZWN0aW9uIGlzIHByb2JhYmx5IGEgZ29vZCBpZGVhIC0g
c28gdGhlIGN1cnJlbnQgZGVmYXVsdA0KPiBpcyBjZXJ0YWlubHkgY29ycmVjdC4NCj4gT24gYSAx
MEcgZXRoZXJuZXQgb3IgaW5maW5pYmFuZCBjb25uZWN0aW9uICh3aGVyZSB0aGUgaXNzdWUgaGFz
IGJlZW4NCj4gbWVhc3VyZWQpIGNvbmdlc3Rpb24ganVzdCBkb2Vzbid0IHNlZW0gdG8gYmUgYW4g
aXNzdWUuDQoNCkl0IHdvdWxkIGhlbHAgaWYgd2UgY2FuIHVuZGVyc3RhbmQgd2hlcmUgdGhlIGFj
dHVhbCBib3R0bGVuZWNrIGlzLiBJZg0KdGhpcyByZWFsbHkgaXMgYWJvdXQgbG9jayBjb250ZW50
aW9uLCB0aGVuIHNvbHZpbmcgdGhhdCBwcm9ibGVtIG1pZ2h0DQpoZWxwIHRoZSBzaW5nbGUgbW91
bnQgY2FzZSB0b28uLi4NCg0KLS0gDQpUcm9uZCBNeWtsZWJ1c3QNCkxpbnV4IE5GUyBjbGllbnQg
bWFpbnRhaW5lcg0KDQpOZXRBcHANClRyb25kLk15a2xlYnVzdEBuZXRhcHAuY29tDQp3d3cubmV0
YXBwLmNvbQ0K

2013-07-09 15:04:36

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [PATCH - RFC] new "nosharetransport" option for NFS mounts.

On Tue, Jul 09, 2013 at 01:22:53PM +1000, NeilBrown wrote:
> On Mon, 8 Jul 2013 18:51:40 +0000 "Myklebust, Trond"
> <[email protected]> wrote:
>
> > On Mon, 2013-07-08 at 09:58 +1000, NeilBrown wrote:
> > >
> > > This patch adds a "nosharetransport" option to allow two different
> > > mounts from the same server to use different transports.
> > > If the mounts use NFSv4, or are of the same filesystem, then
> > > "nosharecache" must be used as well.
> >
> > Won't this interfere with the recently added NFSv4 trunking detection?
>
> Will it? I googled around a bit but couldn't find anything that tells me
> what trunking really was in this context. Then I found commit 05f4c350ee02
> which makes it quite clear (thanks Chuck!).
>
> Probably the code I wrote could interfere.
>
> >
> > Also, how will it work with NFSv4.1 sessions? The server will usually
> > require a BIND_CONN_TO_SESSION when new TCP connections attempt to
> > attach to an existing session.

Since the current client only requests SP4_NONE state protection, the
BIND_CONN_TO_SESSION is implicit:

If, when the client ID was created, the client opted for
SP4_NONE state protection, the client is not required to use
BIND_CONN_TO_SESSION to associate the connection with the
session, unless the client wishes to associate the connection
with the backchannel. When SP4_NONE protection is used, simply
sending a COMPOUND request with a SEQUENCE operation is
sufficient to associate the connection with the session
specified in SEQUENCE.

But Neil would need to make sure he's using all the state associated
with the existing client.

> Why would it attempt to attach to an existing session? I would hope there
> the two different mounts with separate TCP connections would look completely
> separate - different transport, different cache, different session.
> ??

Sounds to me sharing these things shouldn't be a problem in your case,
but I don't know.

--b.

2013-07-08 18:52:29

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH - RFC] new "nosharetransport" option for NFS mounts.

T24gTW9uLCAyMDEzLTA3LTA4IGF0IDA5OjU4ICsxMDAwLCBOZWlsQnJvd24gd3JvdGU6DQo+IA0K
PiBUaGlzIHBhdGNoIGFkZHMgYSAibm9zaGFyZXRyYW5zcG9ydCIgb3B0aW9uIHRvIGFsbG93IHR3
byBkaWZmZXJlbnQNCj4gbW91bnRzIGZyb20gdGhlIHNhbWUgc2VydmVyIHRvIHVzZSBkaWZmZXJl
bnQgdHJhbnNwb3J0cy4NCj4gSWYgdGhlIG1vdW50cyB1c2UgTkZTdjQsIG9yIGFyZSBvZiB0aGUg
c2FtZSBmaWxlc3lzdGVtLCB0aGVuDQo+ICJub3NoYXJlY2FjaGUiIG11c3QgYmUgdXNlZCBhcyB3
ZWxsLg0KDQpXb24ndCB0aGlzIGludGVyZmVyZSB3aXRoIHRoZSByZWNlbnRseSBhZGRlZCBORlN2
NCB0cnVua2luZyBkZXRlY3Rpb24/DQoNCkFsc28sIGhvdyB3aWxsIGl0IHdvcmsgd2l0aCBORlN2
NC4xIHNlc3Npb25zPyBUaGUgc2VydmVyIHdpbGwgdXN1YWxseQ0KcmVxdWlyZSBhIEJJTkRfQ09O
Tl9UT19TRVNTSU9OIHdoZW4gbmV3IFRDUCBjb25uZWN0aW9ucyBhdHRlbXB0IHRvDQphdHRhY2gg
dG8gYW4gZXhpc3Rpbmcgc2Vzc2lvbi4NCg0KPiBUaGVyZSBhcmUgYXQgbGVhc3QgdHdvIGNpcmN1
bXN0YW5jZXMgd2hlcmUgaXQgbWlnaHQgYmUgZGVzaXJhYmxlDQo+IHRvIHVzZSBzZXBhcmF0ZSB0
cmFuc3BvcnRzOg0KPiANCj4gMS8gSWYgdGhlIE5GUyBzZXJ2ZXIgY2FuIGdldCBpbnRvIGEgc3Rh
dGUgd2hlcmUgaXQgd2lsbCBpZ25vcmUNCj4gICByZXF1ZXN0cyBmb3Igb25lIGZpbGVzeXN0ZW0g
d2hpbGUgc2VydmljaW5nIHJlcXVlc3QgZm9yIGFub3RoZXIsDQo+ICAgdGhlbiB1c2luZyBzZXBh
cmF0ZSBjb25uZWN0aW9ucyBmb3IgdGhlIHNlcGFyYXRlIGZpbGVzeXN0ZW1zIGNhbg0KPiAgIHN0
b3AgcHJvYmxlbXMgd2l0aCBvbmUgYWZmZWN0aW5nIGFjY2VzcyB0byB0aGUgb3RoZXIuDQo+IA0K
PiAgIFRoaXMgaXMgcGFydGljdWxhcmx5IHJlbGV2YW50IGZvciBOZXRBcHAgZmlsZXJzIHdoZXJl
IG9uZSBmaWxlc3lzdGVtDQo+ICAgaGFzIGJlZW4gInN1c3BlbmRlZCIuICBSZXF1ZXN0cyB0byB0
aGF0IGZpbGVzeXN0ZW0gd2lsbCBiZSBkcm9wcGVkDQo+ICAgKHJhdGhlciB0aGFuIHRoZSBtb3Jl
IGNvcnJlY3QgTkZTM0VSUl9KVUtFQk9YKS4gIFRoaXMgY3VycmVudGx5DQo+ICAgaW50ZXJmZXJl
cyB3aXRoIG90aGVyIGZpbGVzeXN0ZW1zLg0KDQpUaGlzIGlzIGEga25vd24gaXNzdWUgdGhhdCBy
ZWFsbHkgbmVlZHMgdG8gYmUgZml4ZWQgb24gdGhlIHNlcnZlciwgbm90DQpvbiB0aGUgY2xpZW50
LiBBcyBmYXIgYXMgSSBrbm93LCB3b3JrIGlzIGFscmVhZHkgdW5kZXJ3YXkgdG8gZml4IHRoaXMu
DQoNCj4gMi8gSWYgYSB2ZXJ5IGZhc3QgbmV0d29yayBpcyB1c2VkIHdpdGggYSBtYW55LXByb2Nl
c3NvciBjbGllbnQsIGENCj4gICBzaW5nbGUgVENQIGNvbm5lY3Rpb24gY2FuIHByZXNlbnQgYSBi
b3R0bGUgbmVjayB3aGljaCByZWR1Y2VzIHRvdGFsDQo+ICAgdGhyb3VnaHB1dC4gIFVzaW5nIG11
bHRpcGxlIFRDUCBjb25uZWN0aW9ucyAob25lIHBlciBtb3VudCkgcmVtb3Zlcw0KPiAgIHRoZSBi
b3R0bGVuZWNrLg0KPiAgIEFuIGFsdGVybmF0ZSB3b3JrYXJvdW5kIGlzIHRvIGNvbmZpZ3VyZSBt
dWx0aXBsZSB2aXJ0dWFsIElQDQo+ICAgYWRkcmVzc2VzIG9uIHRoZSBzZXJ2ZXIgYW5kIG1vdW50
IGVhY2ggZmlsZXN5c3RlbSBmcm9tIGEgZGlmZmVyZW50DQo+ICAgSVAuICBUaGlzIGlzIGVmZmVj
dGl2ZSAodGhyb3VnaHB1dCBnb2VzIHVwKSBidXQgYW4gdW5uZWNlc3NhcnkNCj4gICBhZG1pbmlz
dHJhdGl2ZSBidXJkZW4uDQoNCkFzIEkgdW5kZXJzdGFuZCBpdCwgdXNpbmcgbXVsdGlwbGUgc2lt
dWx0YW5lb3VzIFRDUCBjb25uZWN0aW9ucyBiZXR3ZWVuDQp0aGUgc2FtZSBlbmRwb2ludHMgYWxz
byBhZGRzIGEgcmlzayB0aGF0IHRoZSBjb25nZXN0aW9uIHdpbmRvd3Mgd2lsbA0KaW50ZXJmZXJl
LiBEbyB5b3UgaGF2ZSBudW1iZXJzIHRvIGJhY2sgdXAgdGhlIGNsYWltIG9mIGEgcGVyZm9ybWFu
Y2UNCmltcHJvdmVtZW50Pw0KDQpUaGUgb3RoZXIgaXNzdWUgSSBjYW4gdGhpbmsgb2YgaXMgdGhh
dCBmb3IgTkZTIHZlcnNpb25zIDwgNC4xLCB0aGlzIG1heQ0KY2F1c2UgdGhlIHNlcnZlciB0byBh
bGxvY2F0ZSBtb3JlIHJlc291cmNlcyBwZXIgY2xpZW50IGluIHRoZSBmb3JtIG9mDQpyZXBsYXkg
Y2FjaGVzIGV0Yy4NCg0KPiBTaWduZWQtb2ZmLWJ5OiBOZWlsQnJvd24gPG5laWxiQHN1c2UuZGU+
DQo+IA0KPiAtLS0NCj4gSXMgdGhpcyBhIGdvb2QgaWRlYT8gIEJhZCBpZGVhPyAgSGF2ZSBJIG1p
c3NlZCBzb21ldGhpbmcgaW1wb3J0YW50Pw0KPiANCj4gTmVpbEJyb3duDQo+IA0KPiANCj4gZGlm
ZiAtLWdpdCBhL2ZzL25mcy9jbGllbnQuYyBiL2ZzL25mcy9jbGllbnQuYw0KPiBpbmRleCBjNTEz
YjBjLi42NGUzZjM5IDEwMDY0NA0KPiAtLS0gYS9mcy9uZnMvY2xpZW50LmMNCj4gKysrIGIvZnMv
bmZzL2NsaWVudC5jDQo+IEBAIC00MDMsOCArNDAzLDEzIEBAIHN0YXRpYyBzdHJ1Y3QgbmZzX2Ns
aWVudCAqbmZzX21hdGNoX2NsaWVudChjb25zdCBzdHJ1Y3QgbmZzX2NsaWVudF9pbml0ZGF0YSAq
ZGF0DQo+ICAJY29uc3Qgc3RydWN0IHNvY2thZGRyICpzYXAgPSBkYXRhLT5hZGRyOw0KPiAgCXN0
cnVjdCBuZnNfbmV0ICpubiA9IG5ldF9nZW5lcmljKGRhdGEtPm5ldCwgbmZzX25ldF9pZCk7DQo+
ICANCj4gKwlpZiAodGVzdF9iaXQoTkZTX0NTX05PX1NIQVJFLCAmZGF0YS0+aW5pdF9mbGFncykp
DQo+ICsJCXJldHVybiBOVUxMOw0KPiArDQo+ICAJbGlzdF9mb3JfZWFjaF9lbnRyeShjbHAsICZu
bi0+bmZzX2NsaWVudF9saXN0LCBjbF9zaGFyZV9saW5rKSB7DQo+ICAJICAgICAgICBjb25zdCBz
dHJ1Y3Qgc29ja2FkZHIgKmNsYXAgPSAoc3RydWN0IHNvY2thZGRyICopJmNscC0+Y2xfYWRkcjsN
Cj4gKwkJaWYgKHRlc3RfYml0KE5GU19DU19OT19TSEFSRSwmY2xwLT5jbF9mbGFncykpDQo+ICsJ
CQljb250aW51ZTsNCj4gIAkJLyogRG9uJ3QgbWF0Y2ggY2xpZW50cyB0aGF0IGZhaWxlZCB0byBp
bml0aWFsaXNlIHByb3Blcmx5ICovDQo+ICAJCWlmIChjbHAtPmNsX2NvbnNfc3RhdGUgPCAwKQ0K
PiAgCQkJY29udGludWU7DQo+IEBAIC03NTMsNiArNzU4LDggQEAgc3RhdGljIGludCBuZnNfaW5p
dF9zZXJ2ZXIoc3RydWN0IG5mc19zZXJ2ZXIgKnNlcnZlciwNCj4gIAkJCWRhdGEtPnRpbWVvLCBk
YXRhLT5yZXRyYW5zKTsNCj4gIAlpZiAoZGF0YS0+ZmxhZ3MgJiBORlNfTU9VTlRfTk9SRVNWUE9S
VCkNCj4gIAkJc2V0X2JpdChORlNfQ1NfTk9SRVNWUE9SVCwgJmNsX2luaXQuaW5pdF9mbGFncyk7
DQo+ICsJaWYgKGRhdGEtPmZsYWdzICYgTkZTX01PVU5UX05PU0hBUkVfWFBSVCkNCj4gKwkJc2V0
X2JpdChORlNfQ1NfTk9fU0hBUkUsICZjbF9pbml0LmluaXRfZmxhZ3MpOw0KPiAgCWlmIChzZXJ2
ZXItPm9wdGlvbnMgJiBORlNfT1BUSU9OX01JR1JBVElPTikNCj4gIAkJc2V0X2JpdChORlNfQ1Nf
TUlHUkFUSU9OLCAmY2xfaW5pdC5pbml0X2ZsYWdzKTsNCj4gIA0KPiBkaWZmIC0tZ2l0IGEvZnMv
bmZzL3N1cGVyLmMgYi9mcy9uZnMvc3VwZXIuYw0KPiBpbmRleCAyZDc1MjVmLi5kOTE0MWQ4IDEw
MDY0NA0KPiAtLS0gYS9mcy9uZnMvc3VwZXIuYw0KPiArKysgYi9mcy9uZnMvc3VwZXIuYw0KPiBA
QCAtODgsNiArODgsNyBAQCBlbnVtIHsNCj4gIAlPcHRfYWNsLCBPcHRfbm9hY2wsDQo+ICAJT3B0
X3JkaXJwbHVzLCBPcHRfbm9yZGlycGx1cywNCj4gIAlPcHRfc2hhcmVjYWNoZSwgT3B0X25vc2hh
cmVjYWNoZSwNCj4gKwlPcHRfc2hhcmV0cmFuc3BvcnQsIE9wdF9ub3NoYXJldHJhbnNwb3J0LA0K
PiAgCU9wdF9yZXN2cG9ydCwgT3B0X25vcmVzdnBvcnQsDQo+ICAJT3B0X2ZzY2FjaGUsIE9wdF9u
b2ZzY2FjaGUsDQo+ICAJT3B0X21pZ3JhdGlvbiwgT3B0X25vbWlncmF0aW9uLA0KPiBAQCAtMTQ2
LDYgKzE0Nyw4IEBAIHN0YXRpYyBjb25zdCBtYXRjaF90YWJsZV90IG5mc19tb3VudF9vcHRpb25f
dG9rZW5zID0gew0KPiAgCXsgT3B0X25vcmRpcnBsdXMsICJub3JkaXJwbHVzIiB9LA0KPiAgCXsg
T3B0X3NoYXJlY2FjaGUsICJzaGFyZWNhY2hlIiB9LA0KPiAgCXsgT3B0X25vc2hhcmVjYWNoZSwg
Im5vc2hhcmVjYWNoZSIgfSwNCj4gKwl7IE9wdF9zaGFyZXRyYW5zcG9ydCwgInNoYXJldHJhbnNw
b3J0In0sDQo+ICsJeyBPcHRfbm9zaGFyZXRyYW5zcG9ydCwgIm5vc2hhcmV0cmFuc3BvcnQifSwN
Cj4gIAl7IE9wdF9yZXN2cG9ydCwgInJlc3Zwb3J0IiB9LA0KPiAgCXsgT3B0X25vcmVzdnBvcnQs
ICJub3Jlc3Zwb3J0IiB9LA0KPiAgCXsgT3B0X2ZzY2FjaGUsICJmc2MiIH0sDQo+IEBAIC02MzQs
NiArNjM3LDcgQEAgc3RhdGljIHZvaWQgbmZzX3Nob3dfbW91bnRfb3B0aW9ucyhzdHJ1Y3Qgc2Vx
X2ZpbGUgKm0sIHN0cnVjdCBuZnNfc2VydmVyICpuZnNzLA0KPiAgCQl7IE5GU19NT1VOVF9OT0FD
TCwgIixub2FjbCIsICIiIH0sDQo+ICAJCXsgTkZTX01PVU5UX05PUkRJUlBMVVMsICIsbm9yZGly
cGx1cyIsICIiIH0sDQo+ICAJCXsgTkZTX01PVU5UX1VOU0hBUkVELCAiLG5vc2hhcmVjYWNoZSIs
ICIiIH0sDQo+ICsJCXsgTkZTX01PVU5UX05PU0hBUkVfWFBSVCwgIixub3NoYXJldHJhbnNwb3J0
IiwgIiJ9LA0KPiAgCQl7IE5GU19NT1VOVF9OT1JFU1ZQT1JULCAiLG5vcmVzdnBvcnQiLCAiIiB9
LA0KPiAgCQl7IDAsIE5VTEwsIE5VTEwgfQ0KPiAgCX07DQo+IEBAIC0xMjM5LDYgKzEyNDMsMTIg
QEAgc3RhdGljIGludCBuZnNfcGFyc2VfbW91bnRfb3B0aW9ucyhjaGFyICpyYXcsDQo+ICAJCWNh
c2UgT3B0X25vc2hhcmVjYWNoZToNCj4gIAkJCW1udC0+ZmxhZ3MgfD0gTkZTX01PVU5UX1VOU0hB
UkVEOw0KPiAgCQkJYnJlYWs7DQo+ICsJCWNhc2UgT3B0X3NoYXJldHJhbnNwb3J0Og0KPiArCQkJ
bW50LT5mbGFncyAmPSB+TkZTX01PVU5UX05PU0hBUkVfWFBSVDsNCj4gKwkJCWJyZWFrOw0KPiAr
CQljYXNlIE9wdF9ub3NoYXJldHJhbnNwb3J0Og0KPiArCQkJbW50LT5mbGFncyB8PSBORlNfTU9V
TlRfTk9TSEFSRV9YUFJUOw0KPiArCQkJYnJlYWs7DQo+ICAJCWNhc2UgT3B0X3Jlc3Zwb3J0Og0K
PiAgCQkJbW50LT5mbGFncyAmPSB+TkZTX01PVU5UX05PUkVTVlBPUlQ7DQo+ICAJCQlicmVhazsN
Cj4gZGlmZiAtLWdpdCBhL2luY2x1ZGUvbGludXgvbmZzX2ZzX3NiLmggYi9pbmNsdWRlL2xpbnV4
L25mc19mc19zYi5oDQo+IGluZGV4IDNiN2ZhMmEuLjllOWQ3ZDMgMTAwNjQ0DQo+IC0tLSBhL2lu
Y2x1ZGUvbGludXgvbmZzX2ZzX3NiLmgNCj4gKysrIGIvaW5jbHVkZS9saW51eC9uZnNfZnNfc2Iu
aA0KPiBAQCAtNDEsNiArNDEsNyBAQCBzdHJ1Y3QgbmZzX2NsaWVudCB7DQo+ICAjZGVmaW5lIE5G
U19DU19ESVNDUlRSWQkJMQkJLyogLSBkaXNjb25uZWN0IG9uIFJQQyByZXRyeSAqLw0KPiAgI2Rl
ZmluZSBORlNfQ1NfTUlHUkFUSU9OCTIJCS8qIC0gdHJhbnNwYXJlbnQgc3RhdGUgbWlnciAqLw0K
PiAgI2RlZmluZSBORlNfQ1NfSU5GSU5JVEVfU0xPVFMJMwkJLyogLSBkb24ndCBsaW1pdCBUQ1Ag
c2xvdHMgKi8NCj4gKyNkZWZpbmUgTkZTX0NTX05PX1NIQVJFCQk0CQkvKiAtIGRvbid0IHNoYXJl
IGFjcm9zcyBtb3VudHMgKi8NCj4gIAlzdHJ1Y3Qgc29ja2FkZHJfc3RvcmFnZQljbF9hZGRyOwkv
KiBzZXJ2ZXIgaWRlbnRpZmllciAqLw0KPiAgCXNpemVfdAkJCWNsX2FkZHJsZW47DQo+ICAJY2hh
ciAqCQkJY2xfaG9zdG5hbWU7CS8qIGhvc3RuYW1lIG9mIHNlcnZlciAqLw0KPiBkaWZmIC0tZ2l0
IGEvaW5jbHVkZS91YXBpL2xpbnV4L25mc19tb3VudC5oIGIvaW5jbHVkZS91YXBpL2xpbnV4L25m
c19tb3VudC5oDQo+IGluZGV4IDU3NmJkZGQuLjgxYzQ5ZmYgMTAwNjQ0DQo+IC0tLSBhL2luY2x1
ZGUvdWFwaS9saW51eC9uZnNfbW91bnQuaA0KPiArKysgYi9pbmNsdWRlL3VhcGkvbGludXgvbmZz
X21vdW50LmgNCj4gQEAgLTczLDUgKzczLDYgQEAgc3RydWN0IG5mc19tb3VudF9kYXRhIHsNCj4g
IA0KPiAgI2RlZmluZSBORlNfTU9VTlRfTE9DQUxfRkxPQ0sJMHgxMDAwMDANCj4gICNkZWZpbmUg
TkZTX01PVU5UX0xPQ0FMX0ZDTlRMCTB4MjAwMDAwDQo+ICsjZGVmaW5lIE5GU19NT1VOVF9OT1NI
QVJFX1hQUlQJMHg0MDAwMDANCj4gIA0KPiAgI2VuZGlmDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0
DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1c3RA
bmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg==

2013-07-09 14:21:29

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH - RFC] new "nosharetransport" option for NFS mounts.

On Jul 8, 2013, at 11:22 PM, NeilBrown <[email protected]> wrote:

> On Mon, 8 Jul 2013 18:51:40 +0000 "Myklebust, Trond"
> <[email protected]> wrote:
>
>> On Mon, 2013-07-08 at 09:58 +1000, NeilBrown wrote:
>>>
>>> This patch adds a "nosharetransport" option to allow two different
>>> mounts from the same server to use different transports.
>>> If the mounts use NFSv4, or are of the same filesystem, then
>>> "nosharecache" must be used as well.
>>
>> Won't this interfere with the recently added NFSv4 trunking detection?
>
> Will it? I googled around a bit but couldn't find anything that tells me
> what trunking really was in this context. Then I found commit 05f4c350ee02
> which makes it quite clear (thanks Chuck!).
>
> Probably the code I wrote could interfere.
>
>>
>> Also, how will it work with NFSv4.1 sessions? The server will usually
>> require a BIND_CONN_TO_SESSION when new TCP connections attempt to
>> attach to an existing session.
>
> Why would it attempt to attach to an existing session? I would hope there
> the two different mounts with separate TCP connections would look completely
> separate - different transport, different cache, different session.
> ??
>
>>
>>> There are at least two circumstances where it might be desirable
>>> to use separate transports:
>>>
>>> 1/ If the NFS server can get into a state where it will ignore
>>> requests for one filesystem while servicing request for another,
>>> then using separate connections for the separate filesystems can
>>> stop problems with one affecting access to the other.
>>>
>>> This is particularly relevant for NetApp filers where one filesystem
>>> has been "suspended". Requests to that filesystem will be dropped
>>> (rather than the more correct NFS3ERR_JUKEBOX). This currently
>>> interferes with other filesystems.
>>
>> This is a known issue that really needs to be fixed on the server, not
>> on the client. As far as I know, work is already underway to fix this.
>
> I wasn't aware of this, nor were our support people. I've passed it on so
> maybe they can bug Netapp....
>
>>
>>> 2/ If a very fast network is used with a many-processor client, a
>>> single TCP connection can present a bottle neck which reduces total
>>> throughput. Using multiple TCP connections (one per mount) removes
>>> the bottleneck.
>>> An alternate workaround is to configure multiple virtual IP
>>> addresses on the server and mount each filesystem from a different
>>> IP. This is effective (throughput goes up) but an unnecessary
>>> administrative burden.
>>
>> As I understand it, using multiple simultaneous TCP connections between
>> the same endpoints also adds a risk that the congestion windows will
>> interfere. Do you have numbers to back up the claim of a performance
>> improvement?
>
> A customer upgraded from SLES10 (2.6.16 based) to SLES11 (3.0 based) and saw
> a slowdown on some large DB jobs of between 1.5 and 2 times (i.e. total time
> 150% to 200% of what is was before).
> After some analysis they created multiple virtual IPs on the server and
> mounted the several filesystem each from different IPs and got the
> performance back (they see this as a work-around rather than a genuine
> solution).
> Numbers are like "500MB/s on a single connection, 850MB/sec peaking to
> 1000MB/sec on multiple connections".
>
> If I can get something more concrete I'll let you know.
>
> As this worked well in 2.6.16 (which doesn't try to share connections) this
> is seen as a regression.
>
> On links that are easy to saturate, congestion windows are important and
> having a single connection is probably a good idea - so the current default
> is certainly correct.
> On a 10G ethernet or infiniband connection (where the issue has been
> measured) congestion just doesn't seem to be an issue.

We've toyed with the idea of using multiple TCP connections per mount for years. The choice was made to stick with one connection (and one session on NFSv4.1) for each server.

The main limitation has been having a single RPC slot table for the transport, allowing only 16 concurrent RPC requests per server at a time. Andy and Trond did some good work making the slot table widen itself dynamically as the TCP window opens.

A secondary concern is head-of-queue blocking. The server end can certainly stall a client by not taking the top request off the socket queue, and thereby delay any requests that are behind that one in the queue. I think the preferred solution there is to build out support for RPC over SCTP, and use SCTP's multi-stream feature. Alternately we might choose to try out M-TCP. Server implementations can also be made sensitive to this issue to help prevent delays.

A tertiary issue is contention for the transport on multi-socket systems. For a long while I've suspected it may occur, but I've never measured it in practice.

Re: the problem at hand: You've definitely measured a performance regression. However, I don't think those numbers explain _why_ it is occurring.

The first thing to check is whether SuSE11 has the dynamic RPC slot table logic I mentioned above. I think it starts with upstream commit d9ba131d, but someone should correct me if I'm wrong.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2013-07-09 10:01:20

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH - RFC] new "nosharetransport" option for NFS mounts.

On Tue, 9 Jul 2013 13:22:53 +1000 NeilBrown <[email protected]> wrote:

> A customer upgraded from SLES10 (2.6.16 based) to SLES11 (3.0 based) and saw
> a slowdown on some large DB jobs of between 1.5 and 2 times (i.e. total time
> 150% to 200% of what is was before).
> After some analysis they created multiple virtual IPs on the server and
> mounted the several filesystem each from different IPs and got the
> performance back (they see this as a work-around rather than a genuine
> solution).
> Numbers are like "500MB/s on a single connection, 850MB/sec peaking to
> 1000MB/sec on multiple connections".
>
> If I can get something more concrete I'll let you know.

Slightly more concrete:

4 mounts from the one server, 10 threads of fio on each mount.
All over a 10G Ethernet.

1 IP address without "nosharetransport": ~700MB/s
4 IP addresses without "nosharetransport": ~1100MB/s
1 IP address with "nosharetransport": ~1100MB/s

This is all NFSv3. NFSv4 is much slower with nosharecache (32MB/s!), but I
might have botched the backport to 3.0 (or didn't address the v4 specific
issues you raised). I haven't looked into that yet.

NeilBrown

Attachments:

signature.asc (828.00 B)

2013-07-09 03:23:06

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH - RFC] new "nosharetransport" option for NFS mounts.

On Mon, 8 Jul 2013 18:51:40 +0000 "Myklebust, Trond"
<[email protected]> wrote:

> On Mon, 2013-07-08 at 09:58 +1000, NeilBrown wrote:
> >
> > This patch adds a "nosharetransport" option to allow two different
> > mounts from the same server to use different transports.
> > If the mounts use NFSv4, or are of the same filesystem, then
> > "nosharecache" must be used as well.
>
> Won't this interfere with the recently added NFSv4 trunking detection?

Will it? I googled around a bit but couldn't find anything that tells me
what trunking really was in this context. Then I found commit 05f4c350ee02
which makes it quite clear (thanks Chuck!).

Probably the code I wrote could interfere.

>
> Also, how will it work with NFSv4.1 sessions? The server will usually
> require a BIND_CONN_TO_SESSION when new TCP connections attempt to
> attach to an existing session.

Why would it attempt to attach to an existing session? I would hope there
the two different mounts with separate TCP connections would look completely
separate - different transport, different cache, different session.
??

>
> > There are at least two circumstances where it might be desirable
> > to use separate transports:
> >
> > 1/ If the NFS server can get into a state where it will ignore
> > requests for one filesystem while servicing request for another,
> > then using separate connections for the separate filesystems can
> > stop problems with one affecting access to the other.
> >
> > This is particularly relevant for NetApp filers where one filesystem
> > has been "suspended". Requests to that filesystem will be dropped
> > (rather than the more correct NFS3ERR_JUKEBOX). This currently
> > interferes with other filesystems.
>
> This is a known issue that really needs to be fixed on the server, not
> on the client. As far as I know, work is already underway to fix this.

I wasn't aware of this, nor were our support people. I've passed it on so
maybe they can bug Netapp....

>
> > 2/ If a very fast network is used with a many-processor client, a
> > single TCP connection can present a bottle neck which reduces total
> > throughput. Using multiple TCP connections (one per mount) removes
> > the bottleneck.
> > An alternate workaround is to configure multiple virtual IP
> > addresses on the server and mount each filesystem from a different
> > IP. This is effective (throughput goes up) but an unnecessary
> > administrative burden.
>
> As I understand it, using multiple simultaneous TCP connections between
> the same endpoints also adds a risk that the congestion windows will
> interfere. Do you have numbers to back up the claim of a performance
> improvement?

A customer upgraded from SLES10 (2.6.16 based) to SLES11 (3.0 based) and saw
a slowdown on some large DB jobs of between 1.5 and 2 times (i.e. total time
150% to 200% of what is was before).
After some analysis they created multiple virtual IPs on the server and
mounted the several filesystem each from different IPs and got the
performance back (they see this as a work-around rather than a genuine
solution).
Numbers are like "500MB/s on a single connection, 850MB/sec peaking to
1000MB/sec on multiple connections".

If I can get something more concrete I'll let you know.

As this worked well in 2.6.16 (which doesn't try to share connections) this
is seen as a regression.

On links that are easy to saturate, congestion windows are important and
having a single connection is probably a good idea - so the current default
is certainly correct.
On a 10G ethernet or infiniband connection (where the issue has been
measured) congestion just doesn't seem to be an issue.

Thanks,
NeilBrown

Attachments:

signature.asc (828.00 B)