After seeing Trond=E2=80=99s patches for NFS multipathing on NFSv4.1, we
decided to try using the same concept for NFSv3/4. The primary issue
we identified was XID collision in the duplicate request cache (replay
cache) for NFSv3/4. In NFSv3/4, entries are hashed based on XID
instead of the slot ID and sequence ID that NFSv4.1 uses. Since the
XIDs are generated by the RPC transports, and Trond=E2=80=99s patches creat=
e
multiple transports for multipathing, different transports can end up
using an overlapping set of XIDs.
To fix this, we apply a mask to XIDs. Each transport is constrained to
its own segment of the total XID range, and they can never overlap.
In terms of loss of entropy, by masking out just enough bits from the
XID, we are convinced that the probability of XID wraparound or
collision on NFS client restart has not increased to a problematic
level (so long as the RPCs are distributed round-robin, as in Trond=E2=80=
=99s
patches).
We tested multipathing out and discovered that it enables NFS to get
more bandwidth on a bonded interface (instead of using only one
physical link, it can use multiple). Specifically, we tested on a
setup where the client was connected to the server via 4 bonded 10Gb/s
links. Without multipathing, the client could only achieve 10Gb/s
(using one physical link). With multipathing, the client was able to
achieve a maximum of close to 40Gb/s.
However, although the maximum performance was close to 40Gb/s,
achieving an average throughput of even 30Gb/s required many
connections. The performance of individual trials had a high
variance. We traced this uneven performance to colliding network
paths. With round-robin distribution of RPCs, no single TCP
connection can exceed the performance of the slowest one. If the
connections are distributed unevenly across network paths, some
connections can bottleneck others. To solve this problem, we are
currently working on patches to provide load-balancing as an
alternative to round-robin for distributing RPCs.
To use these patches, you first have to apply Trond's 5 patches
(Available at https://www.spinics.net/lists/linux-nfs/msg63368.html).
Let us know what you think or if you have any ideas for improving
this.
Jui-Yu Chang (1):
NFS: Allow multiple connections to NFSv3 and NFSv4.0 servers
Bennett Amodio (1):
SUNRPC: Mask XIDs to prevent replay cache collision
fs/nfs/client.c | 3 +++
fs/nfs/nfs4client.c | 2 +-
include/linux/sunrpc/xprt.h | 5 +++++
net/sunrpc/clnt.c | 8 ++++++++
net/sunrpc/xprt.c | 14 ++++++--------
5 files changed, 23 insertions(+), 9 deletions(-)
--
1.9.1
T24gVHVlLCAyMDE3LTA4LTE1IGF0IDE3OjQ2IC0wNzAwLCBCZW5uZXR0IEFtb2RpbyB3cm90ZToN
Cj4gQWZ0ZXIgc2VlaW5nIFRyb25k4oCZcyBwYXRjaGVzIGZvciBORlMgbXVsdGlwYXRoaW5nIG9u
IE5GU3Y0LjEsIHdlDQo+IGRlY2lkZWQgdG8gdHJ5IHVzaW5nIHRoZSBzYW1lIGNvbmNlcHQgZm9y
IE5GU3YzLzQuICBUaGUgcHJpbWFyeSBpc3N1ZQ0KPiB3ZSBpZGVudGlmaWVkIHdhcyBYSUQgY29s
bGlzaW9uIGluIHRoZSBkdXBsaWNhdGUgcmVxdWVzdCBjYWNoZQ0KPiAocmVwbGF5DQo+IGNhY2hl
KSBmb3IgTkZTdjMvNC4gIEluIE5GU3YzLzQsIGVudHJpZXMgYXJlIGhhc2hlZCBiYXNlZCBvbiBY
SUQNCj4gaW5zdGVhZCBvZiB0aGUgc2xvdCBJRCBhbmQgc2VxdWVuY2UgSUQgdGhhdCBORlN2NC4x
IHVzZXMuICBTaW5jZSB0aGUNCj4gWElEcyBhcmUgZ2VuZXJhdGVkIGJ5IHRoZSBSUEMgdHJhbnNw
b3J0cywgYW5kIFRyb25k4oCZcyBwYXRjaGVzIGNyZWF0ZQ0KPiBtdWx0aXBsZSB0cmFuc3BvcnRz
IGZvciBtdWx0aXBhdGhpbmcsIGRpZmZlcmVudCB0cmFuc3BvcnRzIGNhbiBlbmQgdXANCj4gdXNp
bmcgYW4gb3ZlcmxhcHBpbmcgc2V0IG9mIFhJRHMuDQoNCldoeSBpcyB0aGF0IGEgcHJvYmxlbT8g
WW91IHNob3VsZCBlbmQgdXAgd2l0aCBjb25uZWN0aW9ucyB0aGF0IHNob3cNCmRpZmZlcmVudCBj
b21iaW5hdGlvbnMgb2Ygc291cmNlIElQK3BvcnQgYW5kL29yIGRlc3RpbmF0aW9uIElQK3BvcnQu
IEl0DQpzaG91bGQgYmUgdHJpdmlhbCB0byBkaXN0aW5ndWlzaCBiZXR3ZWVuIFhJRHMuDQoNClF1
aXRlIGZyYW5rbHksIEkgZG8gbm90IHdhbnQgdG8gc3RhcnQgY2FydmluZyB1cCB0aGUgWElEIHNw
YWNlLCBzaW5jZSBhDQozMi1iaXQgbnVtYmVyIGlzIHJlYWxseSBub3QgdGhhdCBiaWcgaW4gdGhl
c2UgZGF5cyBvZiAxMDBHaWdFIG5ldHdvcmtzLg0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGlu
dXggTkZTIGNsaWVudCBtYWludGFpbmVyLCBQcmltYXJ5RGF0YQ0KdHJvbmQubXlrbGVidXN0QHBy
aW1hcnlkYXRhLmNvbQ0K
On Fri, Aug 18, 2017 at 7:57 AM, Trond Myklebust
<[email protected]> wrote:
> On Tue, 2017-08-15 at 17:46 -0700, Bennett Amodio wrote:
>> After seeing Trond=E2=80=99s patches for NFS multipathing on NFSv4.1, we
>> decided to try using the same concept for NFSv3/4. The primary issue
>> we identified was XID collision in the duplicate request cache
>> (replay
>> cache) for NFSv3/4. In NFSv3/4, entries are hashed based on XID
>> instead of the slot ID and sequence ID that NFSv4.1 uses. Since the
>> XIDs are generated by the RPC transports, and Trond=E2=80=99s patches cr=
eate
>> multiple transports for multipathing, different transports can end up
>> using an overlapping set of XIDs.
>
> Why is that a problem? You should end up with connections that show
> different combinations of source IP+port and/or destination IP+port. It
> should be trivial to distinguish between XIDs.
Although the Linux NFS server hashes cache entries based on source IP
and source port as well as XID, this is not a requirement of the
NFSv3/v4 specification, so NFS server implementations may exist which
hash only based on source IP and XID. In practice, is this uncommon
enough that it's not worth addressing?
> Quite frankly, I do not want to start carving up the XID space, since a
> 32-bit number is really not that big in these days of 100GigE networks.
This is a good point, and we also think that carving up the XID space
is not a great solution. If XID collision is a problem, another
solution could be an atomic XID shared between transports which belong
to the same client.
If there's no problem in the first place, that's even better. We
thought when you said "I don't feel comfortable subjecting NFSv3/v4
replay caches to this treatment yet" that you were referring to XID
collision. Is there another potential issue with multipathing and
replay caches?
Cheers!
Bennett Amodio
T24gRnJpLCAyMDE3LTA4LTE4IGF0IDEzOjE1IC0wNzAwLCBCZW5uZXR0IEFtb2RpbyB3cm90ZToN
Cj4gT24gRnJpLCBBdWcgMTgsIDIwMTcgYXQgNzo1NyBBTSwgVHJvbmQgTXlrbGVidXN0DQo+IDx0
cm9uZG15QHByaW1hcnlkYXRhLmNvbT4gd3JvdGU6DQo+ID4gT24gVHVlLCAyMDE3LTA4LTE1IGF0
IDE3OjQ2IC0wNzAwLCBCZW5uZXR0IEFtb2RpbyB3cm90ZToNCj4gPiA+IEFmdGVyIHNlZWluZyBU
cm9uZOKAmXMgcGF0Y2hlcyBmb3IgTkZTIG11bHRpcGF0aGluZyBvbiBORlN2NC4xLCB3ZQ0KPiA+
ID4gZGVjaWRlZCB0byB0cnkgdXNpbmcgdGhlIHNhbWUgY29uY2VwdCBmb3IgTkZTdjMvNC4gIFRo
ZSBwcmltYXJ5DQo+ID4gPiBpc3N1ZQ0KPiA+ID4gd2UgaWRlbnRpZmllZCB3YXMgWElEIGNvbGxp
c2lvbiBpbiB0aGUgZHVwbGljYXRlIHJlcXVlc3QgY2FjaGUNCj4gPiA+IChyZXBsYXkNCj4gPiA+
IGNhY2hlKSBmb3IgTkZTdjMvNC4gIEluIE5GU3YzLzQsIGVudHJpZXMgYXJlIGhhc2hlZCBiYXNl
ZCBvbiBYSUQNCj4gPiA+IGluc3RlYWQgb2YgdGhlIHNsb3QgSUQgYW5kIHNlcXVlbmNlIElEIHRo
YXQgTkZTdjQuMSB1c2VzLiAgU2luY2UNCj4gPiA+IHRoZQ0KPiA+ID4gWElEcyBhcmUgZ2VuZXJh
dGVkIGJ5IHRoZSBSUEMgdHJhbnNwb3J0cywgYW5kIFRyb25k4oCZcyBwYXRjaGVzDQo+ID4gPiBj
cmVhdGUNCj4gPiA+IG11bHRpcGxlIHRyYW5zcG9ydHMgZm9yIG11bHRpcGF0aGluZywgZGlmZmVy
ZW50IHRyYW5zcG9ydHMgY2FuDQo+ID4gPiBlbmQgdXANCj4gPiA+IHVzaW5nIGFuIG92ZXJsYXBw
aW5nIHNldCBvZiBYSURzLg0KPiA+IA0KPiA+IFdoeSBpcyB0aGF0IGEgcHJvYmxlbT8gWW91IHNo
b3VsZCBlbmQgdXAgd2l0aCBjb25uZWN0aW9ucyB0aGF0IHNob3cNCj4gPiBkaWZmZXJlbnQgY29t
YmluYXRpb25zIG9mIHNvdXJjZSBJUCtwb3J0IGFuZC9vciBkZXN0aW5hdGlvbg0KPiA+IElQK3Bv
cnQuIEl0DQo+ID4gc2hvdWxkIGJlIHRyaXZpYWwgdG8gZGlzdGluZ3Vpc2ggYmV0d2VlbiBYSURz
Lg0KPiANCj4gQWx0aG91Z2ggdGhlIExpbnV4IE5GUyBzZXJ2ZXIgaGFzaGVzIGNhY2hlIGVudHJp
ZXMgYmFzZWQgb24gc291cmNlIElQDQo+IGFuZCBzb3VyY2UgcG9ydCBhcyB3ZWxsIGFzIFhJRCwg
dGhpcyBpcyBub3QgYSByZXF1aXJlbWVudCBvZiB0aGUNCj4gTkZTdjMvdjQgc3BlY2lmaWNhdGlv
biwgc28gTkZTIHNlcnZlciBpbXBsZW1lbnRhdGlvbnMgbWF5IGV4aXN0IHdoaWNoDQo+IGhhc2gg
b25seSBiYXNlZCBvbiBzb3VyY2UgSVAgYW5kIFhJRC4gIEluIHByYWN0aWNlLCBpcyB0aGlzIHVu
Y29tbW9uDQo+IGVub3VnaCB0aGF0IGl0J3Mgbm90IHdvcnRoIGFkZHJlc3Npbmc/DQoNClRoZXJl
IGlzIG5vdGhpbmcgaW4gUkZDMTgxMyB0aGF0IGdpdmVzIGFueSBkaXJlY3Rpb24gb24gaG93IHRv
IHNldCB1cCBhDQpkdXBsaWNhdGUgcmVwbGF5IGNhY2hlIChEUkMpLiBIb3dldmVyIGVzdGFibGlz
aGVkIHByYWN0aWNlIGRpY3RhdGVzDQp0aGF0IHRoZSBzZXJ2ZXIgc2hvdWxkIGJlIHByZXBhcmVk
IGZvciBkdXBsaWNhdGUgWElEcyB0aGF0IG9yaWdpbmF0ZQ0KZnJvbSB0aGUgc2FtZSBJUCBhZGRy
ZXNzLg0KSW4gcGFydGljdWxhciwgaWYgdGhlIGxpbnV4IGNsaWVudCBjb25uZWN0cyBtb3JlIHRo
YW4gb25jZSB0byB5b3VyDQpzZXJ2ZXIgKGUuZy4gdGhyb3VnaCAyIGRpZmZlcmVudCBJUCBhZGRy
ZXNzZXMpIGl0IHdpbGwgYXNzdW1lIHRoZSBYSURzDQphcmUgcGVyIGNvbm5lY3Rpb24uIERpdHRv
IGlmIHVzaW5nIFVEUC4NCg0KPiA+IFF1aXRlIGZyYW5rbHksIEkgZG8gbm90IHdhbnQgdG8gc3Rh
cnQgY2FydmluZyB1cCB0aGUgWElEIHNwYWNlLA0KPiA+IHNpbmNlIGENCj4gPiAzMi1iaXQgbnVt
YmVyIGlzIHJlYWxseSBub3QgdGhhdCBiaWcgaW4gdGhlc2UgZGF5cyBvZiAxMDBHaWdFDQo+ID4g
bmV0d29ya3MuDQo+IA0KPiBUaGlzIGlzIGEgZ29vZCBwb2ludCwgYW5kIHdlIGFsc28gdGhpbmsg
dGhhdCBjYXJ2aW5nIHVwIHRoZSBYSUQgc3BhY2UNCj4gaXMgbm90IGEgZ3JlYXQgc29sdXRpb24u
ICBJZiBYSUQgY29sbGlzaW9uIGlzIGEgcHJvYmxlbSwgYW5vdGhlcg0KPiBzb2x1dGlvbiBjb3Vs
ZCBiZSBhbiBhdG9taWMgWElEIHNoYXJlZCBiZXR3ZWVuIHRyYW5zcG9ydHMgd2hpY2gNCj4gYmVs
b25nDQo+IHRvIHRoZSBzYW1lIGNsaWVudC4NCj4gDQo+IElmIHRoZXJlJ3Mgbm8gcHJvYmxlbSBp
biB0aGUgZmlyc3QgcGxhY2UsIHRoYXQncyBldmVuIGJldHRlci4gIFdlDQo+IHRob3VnaHQgd2hl
biB5b3Ugc2FpZCAiSSBkb24ndCBmZWVsIGNvbWZvcnRhYmxlIHN1YmplY3RpbmcgTkZTdjMvdjQN
Cj4gcmVwbGF5IGNhY2hlcyB0byB0aGlzIHRyZWF0bWVudCB5ZXQiIHRoYXQgeW91IHdlcmUgcmVm
ZXJyaW5nIHRvIFhJRA0KPiBjb2xsaXNpb24uICBJcyB0aGVyZSBhbm90aGVyIHBvdGVudGlhbCBp
c3N1ZSB3aXRoIG11bHRpcGF0aGluZyBhbmQNCj4gcmVwbGF5IGNhY2hlcz8NCg0KVGhlcmUgaXMg
dGhlIHF1ZXN0aW9uIG9mIHdoYXQgdG8gZG8gd2hlbiBhIE5JQyBnb2VzIGRvd24uIERvIHdlIGZh
aWwNCm92ZXIgdG8gYSBkaWZmZXJlbnQgY29ubmVjdGlvbiBvciBub3Q/IFRoZSBleGlzdGluZyBw
cmFjdGljZXMgdy5yLnQuDQpEUkNzIHN1Z2dlc3QgdGhhdCB3ZSBjYW5ub3QgZG8gc287IGZvciBp
bnN0YW5jZSB0aGUgTGludXggc2VydmVyIERSQw0Kd291bGQgYnJlYWsgaW4gdGhhdCBjYXNlLCBs
ZWFkaW5nIHBvdGVudGlhbGx5IHRvIGlzc3VlcyB3aXRoIG5vbi0NCmlkZW1wb3RlbnQgb3BlcmF0
aW9ucyB0aGF0IG5lZWQgdG8gYmUgcmVwbGF5ZWQuDQoNCi0tIA0KVHJvbmQgTXlrbGVidXN0DQpM
aW51eCBORlMgY2xpZW50IG1haW50YWluZXIsIFByaW1hcnlEYXRhDQp0cm9uZC5teWtsZWJ1c3RA
cHJpbWFyeWRhdGEuY29tDQo=
On Fri, Aug 18, 2017 at 4:31 PM, Trond Myklebust
<[email protected]> wrote:
> On Fri, 2017-08-18 at 13:15 -0700, Bennett Amodio wrote:
>> On Fri, Aug 18, 2017 at 7:57 AM, Trond Myklebust
>> <[email protected]> wrote:
>> > On Tue, 2017-08-15 at 17:46 -0700, Bennett Amodio wrote:
>> > > After seeing Trond=E2=80=99s patches for NFS multipathing on NFSv4.1=
, we
>> > > decided to try using the same concept for NFSv3/4. The primary
>> > > issue
>> > > we identified was XID collision in the duplicate request cache
>> > > (replay
>> > > cache) for NFSv3/4. In NFSv3/4, entries are hashed based on XID
>> > > instead of the slot ID and sequence ID that NFSv4.1 uses. Since
>> > > the
>> > > XIDs are generated by the RPC transports, and Trond=E2=80=99s patche=
s
>> > > create
>> > > multiple transports for multipathing, different transports can
>> > > end up
>> > > using an overlapping set of XIDs.
>> >
>> > Why is that a problem? You should end up with connections that show
>> > different combinations of source IP+port and/or destination
>> > IP+port. It
>> > should be trivial to distinguish between XIDs.
>>
>> Although the Linux NFS server hashes cache entries based on source IP
>> and source port as well as XID, this is not a requirement of the
>> NFSv3/v4 specification, so NFS server implementations may exist which
>> hash only based on source IP and XID. In practice, is this uncommon
>> enough that it's not worth addressing?
>
> There is nothing in RFC1813 that gives any direction on how to set up a
> duplicate replay cache (DRC). However established practice dictates
> that the server should be prepared for duplicate XIDs that originate
> from the same IP address.
> In particular, if the linux client connects more than once to your
> server (e.g. through 2 different IP addresses) it will assume the XIDs
> are per connection. Ditto if using UDP.
Understood, thanks for the clarification!
>> > Quite frankly, I do not want to start carving up the XID space,
>> > since a
>> > 32-bit number is really not that big in these days of 100GigE
>> > networks.
>>
>> This is a good point, and we also think that carving up the XID space
>> is not a great solution. If XID collision is a problem, another
>> solution could be an atomic XID shared between transports which
>> belong
>> to the same client.
>>
>> If there's no problem in the first place, that's even better. We
>> thought when you said "I don't feel comfortable subjecting NFSv3/v4
>> replay caches to this treatment yet" that you were referring to XID
>> collision. Is there another potential issue with multipathing and
>> replay caches?
>
> There is the question of what to do when a NIC goes down. Do we fail
> over to a different connection or not? The existing practices w.r.t.
> DRCs suggest that we cannot do so; for instance the Linux server DRC
> would break in that case, leading potentially to issues with non-
> idempotent operations that need to be replayed.
If I'm understanding correctly, what you're suggesting is that a
failover could cause a duplicate request with a different source IP
(from the new interface). This would lead to the server not
recognizing the request as a duplicate and re-running a non-idempotent
operation. I don't see how this is a new problem, though. If you
have one connection, going through one interface, and that interface
goes down, don't you still have to choose between (potentially bad)
failover and hanging?
If the interfaces are bonded, I don't think this is an issue at all, is it?
Cheers!
Bennett Amodio