2017-08-31 17:34:22

by Kjetil Joergensen

[permalink] [raw]
Subject: linux>=4.10: PUTFH|GETATTR|CLOSE, GETATTR fails, CLOSE not re-issued

Hi,

(Now - I do not actually know the specification(s) all that well, so
it may be that I've by accident cherry picked the bits that partially
turns this into a linux-nfs-client bug, and I'd be more than happy
with responses that'd be useful to yell at netapp with).

after d8d849835eb2082ea17655538a83fa467633927f (NFSv4: Place the
GETATTR operation before the CLOSE). If GETATTR actually fails, CLOSE
will never be processed by the server, and it seems the linux nfs
client never tries to re-issue CLOSE.

We have client A holding file F open, client B goes ahead and unlinks
F, at some point client a does PUTFH,GETATTR, for which the server
responds NFS4ERR_STALE.

Now, client A goes ahead and tries to clean up it's internal state,
and sends the server compound PUTFH,GETATTR,CLOSE, for which the
server responds with PUTFH(NFS4_OK),GETATTR(NFS4ERR_STALE).

Which seems correct in the eyes of RFC7530 section 14.2., which says
the server should stop processing the compound when a subop fails.

The server has not processed the CLOSE op, and in the case of netapp
it appears it keeps holding on to the stateid, waiting for the client
to CLOSE it.

Judging from tcpdump, the client never attempts to re-issue the CLOSE
op that weren't processed.

On the server side, the stateid sticks around until we tear down the
client completely (umount or re-boot). Over time, this leads the
netapp to bleed stateids.

Compare this to pre d8d849835eb2082ea17655538a83fa467633927f, the
client issues PUTFH,CLOSE,GETATTR. Both PUTFH & CLOSE succeeds,
GETATTR as expected still gets NFS4ERR_STALE. The server did however
process CLOSE, and retired it's stateid.

Cheers,

--
Kjetil Joergensen <[email protected]>
Phone: +1 (650) 739-6580


2017-09-01 18:44:14

by Weston Andros Adamson

[permalink] [raw]
Subject: Re: linux>=4.10: PUTFH|GETATTR|CLOSE, GETATTR fails, CLOSE not re-issued

Nice analysis! I think post d8d849835eb2082ea17655538a83fa467633927f, we
need to retry with a [PUTFH, CLOSE] if the GETATTR fails.

The problem as I see it is the GETATTR is tied to the CURRENT_FH, which =
is
stale for new operations since the file was unlinked, but the CLOSE is =
tied to the
(CURRENT_FH, open stateid) pair and is not stale because the state id is =
still
valid.

Trond is out on PTO, should be back on or before next Tuesday. The =
recent change
was his and he might have a better idea how to handle this.

-dros


> On Aug 31, 2017, at 1:34 PM, Kjetil Joergensen <[email protected]> =
wrote:
>=20
> Hi,
>=20
> (Now - I do not actually know the specification(s) all that well, so
> it may be that I've by accident cherry picked the bits that partially
> turns this into a linux-nfs-client bug, and I'd be more than happy
> with responses that'd be useful to yell at netapp with).
>=20
> after d8d849835eb2082ea17655538a83fa467633927f (NFSv4: Place the
> GETATTR operation before the CLOSE). If GETATTR actually fails, CLOSE
> will never be processed by the server, and it seems the linux nfs
> client never tries to re-issue CLOSE.
>=20
> We have client A holding file F open, client B goes ahead and unlinks
> F, at some point client a does PUTFH,GETATTR, for which the server
> responds NFS4ERR_STALE.
>=20
> Now, client A goes ahead and tries to clean up it's internal state,
> and sends the server compound PUTFH,GETATTR,CLOSE, for which the
> server responds with PUTFH(NFS4_OK),GETATTR(NFS4ERR_STALE).
>=20
> Which seems correct in the eyes of RFC7530 section 14.2., which says
> the server should stop processing the compound when a subop fails.
>=20
> The server has not processed the CLOSE op, and in the case of netapp
> it appears it keeps holding on to the stateid, waiting for the client
> to CLOSE it.
>=20
> Judging from tcpdump, the client never attempts to re-issue the CLOSE
> op that weren't processed.
>=20
> On the server side, the stateid sticks around until we tear down the
> client completely (umount or re-boot). Over time, this leads the
> netapp to bleed stateids.
>=20
> Compare this to pre d8d849835eb2082ea17655538a83fa467633927f, the
> client issues PUTFH,CLOSE,GETATTR. Both PUTFH & CLOSE succeeds,
> GETATTR as expected still gets NFS4ERR_STALE. The server did however
> process CLOSE, and retired it's stateid.
>=20
> Cheers,
>=20
> --=20
> Kjetil Joergensen <[email protected]>
> Phone: +1 (650) 739-6580
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2017-09-05 17:51:08

by Weston Andros Adamson

[permalink] [raw]
Subject: Re: linux>=4.10: PUTFH|GETATTR|CLOSE, GETATTR fails, CLOSE not re-issued

I chatted with Trond about this and he says it's a server bug if an =
unlinked file
keeps stateids around - the client doesn't need to issue a close in this =
case.

What version of ONTAP are you running?

-dros


> On Sep 1, 2017, at 2:44 PM, Weston Andros Adamson <[email protected]> =
wrote:
>=20
> Nice analysis! I think post d8d849835eb2082ea17655538a83fa467633927f, =
we
> need to retry with a [PUTFH, CLOSE] if the GETATTR fails.
>=20
> The problem as I see it is the GETATTR is tied to the CURRENT_FH, =
which is
> stale for new operations since the file was unlinked, but the CLOSE is =
tied to the
> (CURRENT_FH, open stateid) pair and is not stale because the state id =
is still
> valid.
>=20
> Trond is out on PTO, should be back on or before next Tuesday. The =
recent change
> was his and he might have a better idea how to handle this.
>=20
> -dros
>=20
>=20
>> On Aug 31, 2017, at 1:34 PM, Kjetil Joergensen <[email protected]> =
wrote:
>>=20
>> Hi,
>>=20
>> (Now - I do not actually know the specification(s) all that well, so
>> it may be that I've by accident cherry picked the bits that partially
>> turns this into a linux-nfs-client bug, and I'd be more than happy
>> with responses that'd be useful to yell at netapp with).
>>=20
>> after d8d849835eb2082ea17655538a83fa467633927f (NFSv4: Place the
>> GETATTR operation before the CLOSE). If GETATTR actually fails, CLOSE
>> will never be processed by the server, and it seems the linux nfs
>> client never tries to re-issue CLOSE.
>>=20
>> We have client A holding file F open, client B goes ahead and =
unlinks
>> F, at some point client a does PUTFH,GETATTR, for which the server
>> responds NFS4ERR_STALE.
>>=20
>> Now, client A goes ahead and tries to clean up it's internal state,
>> and sends the server compound PUTFH,GETATTR,CLOSE, for which the
>> server responds with PUTFH(NFS4_OK),GETATTR(NFS4ERR_STALE).
>>=20
>> Which seems correct in the eyes of RFC7530 section 14.2., which says
>> the server should stop processing the compound when a subop fails.
>>=20
>> The server has not processed the CLOSE op, and in the case of netapp
>> it appears it keeps holding on to the stateid, waiting for the client
>> to CLOSE it.
>>=20
>> Judging from tcpdump, the client never attempts to re-issue the CLOSE
>> op that weren't processed.
>>=20
>> On the server side, the stateid sticks around until we tear down the
>> client completely (umount or re-boot). Over time, this leads the
>> netapp to bleed stateids.
>>=20
>> Compare this to pre d8d849835eb2082ea17655538a83fa467633927f, the
>> client issues PUTFH,CLOSE,GETATTR. Both PUTFH & CLOSE succeeds,
>> GETATTR as expected still gets NFS4ERR_STALE. The server did however
>> process CLOSE, and retired it's stateid.
>>=20
>> Cheers,
>>=20
>> --=20
>> Kjetil Joergensen <[email protected]>
>> Phone: +1 (650) 739-6580
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>=20


2017-09-05 22:31:42

by Kjetil Joergensen

[permalink] [raw]
Subject: Re: linux>=4.10: PUTFH|GETATTR|CLOSE, GETATTR fails, CLOSE not re-issued

Hi,

On Tue, Sep 5, 2017 at 10:51 AM, Weston Andros Adamson <[email protected]> wrote:
>
> I chatted with Trond about this and he says it's a server bug if an unlinked file
> keeps stateids around - the client doesn't need to issue a close in this case.

We don't disagree that this is a bug with the server, it is after all
a rather efficient
denial-of-service attack against it (Especially if you don't dismantle
your clients
all that often).

Although, not calling CLOSE under certain circumstances doesn't seem correct.

Continuing to cherrypick from RFCs:

RFC5661 - 8.2.4. Stateid Lifetime and Validation
Stateids must remain valid until either a client restart or a server
restart or until the client returns all of the locks associated with
the stateid by means of an operation such as CLOSE or DELEGRETURN.
If the locks are lost due to revocation, as long as the client ID is
valid, the stateid remains a valid designation of that revoked state
until the client frees it by using FREE_STATEID.

> What version of ONTAP are you running?

Version: NetApp Release 8.2.4P6 7-Mode: Wed Jan 11 01:07:08 PST 2017


>
>
> -dros
>
>
> > On Sep 1, 2017, at 2:44 PM, Weston Andros Adamson <[email protected]> wrote:
> >
> > Nice analysis! I think post d8d849835eb2082ea17655538a83fa467633927f, we
> > need to retry with a [PUTFH, CLOSE] if the GETATTR fails.
> >
> > The problem as I see it is the GETATTR is tied to the CURRENT_FH, which is
> > stale for new operations since the file was unlinked, but the CLOSE is tied to the
> > (CURRENT_FH, open stateid) pair and is not stale because the state id is still
> > valid.
> >
> > Trond is out on PTO, should be back on or before next Tuesday. The recent change
> > was his and he might have a better idea how to handle this.
> >
> > -dros
> >
> >
> >> On Aug 31, 2017, at 1:34 PM, Kjetil Joergensen <[email protected]> wrote:
> >>
> >> Hi,
> >>
> >> (Now - I do not actually know the specification(s) all that well, so
> >> it may be that I've by accident cherry picked the bits that partially
> >> turns this into a linux-nfs-client bug, and I'd be more than happy
> >> with responses that'd be useful to yell at netapp with).
> >>
> >> after d8d849835eb2082ea17655538a83fa467633927f (NFSv4: Place the
> >> GETATTR operation before the CLOSE). If GETATTR actually fails, CLOSE
> >> will never be processed by the server, and it seems the linux nfs
> >> client never tries to re-issue CLOSE.
> >>
> >> We have client A holding file F open, client B goes ahead and unlinks
> >> F, at some point client a does PUTFH,GETATTR, for which the server
> >> responds NFS4ERR_STALE.
> >>
> >> Now, client A goes ahead and tries to clean up it's internal state,
> >> and sends the server compound PUTFH,GETATTR,CLOSE, for which the
> >> server responds with PUTFH(NFS4_OK),GETATTR(NFS4ERR_STALE).
> >>
> >> Which seems correct in the eyes of RFC7530 section 14.2., which says
> >> the server should stop processing the compound when a subop fails.
> >>
> >> The server has not processed the CLOSE op, and in the case of netapp
> >> it appears it keeps holding on to the stateid, waiting for the client
> >> to CLOSE it.
> >>
> >> Judging from tcpdump, the client never attempts to re-issue the CLOSE
> >> op that weren't processed.
> >>
> >> On the server side, the stateid sticks around until we tear down the
> >> client completely (umount or re-boot). Over time, this leads the
> >> netapp to bleed stateids.
> >>
> >> Compare this to pre d8d849835eb2082ea17655538a83fa467633927f, the
> >> client issues PUTFH,CLOSE,GETATTR. Both PUTFH & CLOSE succeeds,
> >> GETATTR as expected still gets NFS4ERR_STALE. The server did however
> >> process CLOSE, and retired it's stateid.
> >>
> >> Cheers,
> >>
> >> --
> >> Kjetil Joergensen <[email protected]>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>



--
Kjetil Joergensen <[email protected]>
SRE, Medallia Inc

2017-09-06 00:05:43

by Trond Myklebust

[permalink] [raw]
Subject: Re: linux>=4.10: PUTFH|GETATTR|CLOSE, GETATTR fails, CLOSE not re-issued

T24gVHVlLCAyMDE3LTA5LTA1IGF0IDIyOjQ0ICswMDAwLCBUcm9uZCBNeWtsZWJ1c3Qgd3JvdGU6
DQo+ID4gT24gU2VwIDUsIDIwMTcsIGF0IDE4OjMxLCBLamV0aWwgSm9lcmdlbnNlbiA8a2pldGls
QG1lZGFsbGlhLmNvbT4NCj4gPiB3cm90ZToNCj4gPiANCj4gPiBIaSwNCj4gPiANCj4gPiBPbiBU
dWUsIFNlcCA1LCAyMDE3IGF0IDEwOjUxIEFNLCBXZXN0b24gQW5kcm9zIEFkYW1zb24gPGRyb3NA
bW9ua2V5DQo+ID4gLm9yZz4gd3JvdGU6DQo+ID4gPiBJIGNoYXR0ZWQgd2l0aCBUcm9uZCBhYm91
dCB0aGlzIGFuZCBoZSBzYXlzIGl0J3MgYSBzZXJ2ZXIgYnVnIGlmDQo+ID4gPiBhbiB1bmxpbmtl
ZCBmaWxlDQo+ID4gPiBrZWVwcyBzdGF0ZWlkcyBhcm91bmQgLSB0aGUgY2xpZW50IGRvZXNuJ3Qg
bmVlZCB0byBpc3N1ZSBhIGNsb3NlDQo+ID4gPiBpbiB0aGlzIGNhc2UuDQo+ID4gIA0KPiA+IFdl
IGRvbid0IGRpc2FncmVlIHRoYXQgdGhpcyBpcyBhIGJ1ZyB3aXRoIHRoZSBzZXJ2ZXIsIGl0IGlz
IGFmdGVyDQo+ID4gYWxsDQo+ID4gYSByYXRoZXIgZWZmaWNpZW50DQo+ID4gZGVuaWFsLW9mLXNl
cnZpY2UgYXR0YWNrIGFnYWluc3QgaXQgKEVzcGVjaWFsbHkgaWYgeW91IGRvbid0DQo+ID4gZGlz
bWFudGxlDQo+ID4geW91ciBjbGllbnRzDQo+ID4gYWxsIHRoYXQgb2Z0ZW4pLg0KPiA+IA0KPiA+
IEFsdGhvdWdoLCBub3QgY2FsbGluZyBDTE9TRSB1bmRlciBjZXJ0YWluIGNpcmN1bXN0YW5jZXMg
ZG9lc24ndA0KPiA+IHNlZW0gY29ycmVjdC4NCj4gPiANCj4gPiBDb250aW51aW5nIHRvIGNoZXJy
eXBpY2sgZnJvbSBSRkNzOg0KPiA+IA0KPiA+IFJGQzU2NjEgLSA4LjIuNC4gIFN0YXRlaWQgTGlm
ZXRpbWUgYW5kIFZhbGlkYXRpb24NCj4gPiAgIFN0YXRlaWRzIG11c3QgcmVtYWluIHZhbGlkIHVu
dGlsIGVpdGhlciBhIGNsaWVudCByZXN0YXJ0IG9yIGENCj4gPiBzZXJ2ZXINCj4gPiAgIHJlc3Rh
cnQgb3IgdW50aWwgdGhlIGNsaWVudCByZXR1cm5zIGFsbCBvZiB0aGUgbG9ja3MgYXNzb2NpYXRl
ZA0KPiA+IHdpdGgNCj4gPiAgIHRoZSBzdGF0ZWlkIGJ5IG1lYW5zIG9mIGFuIG9wZXJhdGlvbiBz
dWNoIGFzIENMT1NFIG9yDQo+ID4gREVMRUdSRVRVUk4uDQo+ID4gICBJZiB0aGUgbG9ja3MgYXJl
IGxvc3QgZHVlIHRvIHJldm9jYXRpb24sIGFzIGxvbmcgYXMgdGhlIGNsaWVudCBJRA0KPiA+IGlz
DQo+ID4gICB2YWxpZCwgdGhlIHN0YXRlaWQgcmVtYWlucyBhIHZhbGlkIGRlc2lnbmF0aW9uIG9m
IHRoYXQgcmV2b2tlZA0KPiA+IHN0YXRlDQo+ID4gICB1bnRpbCB0aGUgY2xpZW50IGZyZWVzIGl0
IGJ5IHVzaW5nIEZSRUVfU1RBVEVJRC4NCj4gPiANCj4gPiA+IFdoYXQgdmVyc2lvbiBvZiBPTlRB
UCBhcmUgeW91IHJ1bm5pbmc/DQo+ID4gIA0KPiA+IFZlcnNpb246IE5ldEFwcCBSZWxlYXNlIDgu
Mi40UDYgNy1Nb2RlOiBXZWQgSmFuIDExIDAxOjA3OjA4IFBTVA0KPiA+IDIwMTcNCj4gPiANCj4g
PiANCj4gDQo+IFdl4oCZcmUgbm90IGZpeGluZyBhbnkgc2VydmVyIGJ1Z3Mgb24gdGhlIGNsaWVu
dCwgYW5kIHRoaXMgaXMNCj4gZGVmaW5pdGVseSBhIHNlcnZlciBidWcuIFlvdSBjYW7igJl0IGhh
dmUgc3RhdGUgYXNzb2NpYXRlZCB3aXRoIGEgbm9uLQ0KPiBleGlzdGVudCBvciBjb21wbGV0ZWx5
IGluYWNjZXNzaWJsZSBmaWxlLg0KPiANCg0KQ29uY2VybmluZyB5b3VyIHF1b3RlIHRoZXJlIGFi
b3V0IEZSRUVfU1RBVEVJRCwgdGhhdCBoYXMgbm90aGluZyB0byBkbw0Kd2l0aCBkZWxldGVkIGZp
bGVzLiBJdCBpcyBhIG1lY2hhbmlzbSB0byBhbGxvdyB0aGUgc2VydmVyIHRvIHNhZmVseQ0KY2Fj
aGUgb3BlbiBzdGF0ZSBpbiB0aGUgcGFydGljdWxhciBjYXNlIHdoZXJlIGEgbmV0d29yayBwYXJ0
aXRpb24NCnByZXZlbnRzIHRoZSBjbGllbnQgZnJvbSByZW5ld2luZyBpdHMgbGVhc2UuIFRoZXJl
IGlzIG5vdGhpbmcgdGhhdCBzYXlzDQppdCBhcHBsaWVzIHRvIGRlbGV0ZWQgZmlsZXMsIGFuZCBu
b3IgaXMgdGhlcmUgYW55IHJlYXNvbiB3aHkgd2Ugd291bGQNCndhbnQgdG8gY2FjaGUgb3BlbiBv
ciBsb2NrIHN0YXRlIGluIGEgY2FzZSB3aGVyZSB0aGUgZmlsZWhhbmRsZSBpcw0Kc3RhbGUuDQoN
Ci0tIA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIsIFByaW1h
cnlEYXRhDQp0cm9uZC5teWtsZWJ1c3RAcHJpbWFyeWRhdGEuY29tDQo=