2012-04-10 16:03:48

by Sachin Prabhu

[permalink] [raw]
Subject: ESTALE errors: What can be done to handle this problem.

The change introduced by upstream patch
4f48af45842c6e78ab958c90344d3c940db4da15
means that the lookup call for a file will return a inode as being valid
as long as
1) The directory containing the file was revalidated ie. the cached
attributes are valid and it was found that there was no change in the
directory since the attributes of the file were cached.
2) We aren't doing a OPEN call or the LOOKUP_REVAL flag is not set.

We no longer check to see if the file attributes themselves are valid.
This patch was introduced in 2.6.23.

This increases the chances of ESTALE errors on GETATTR calls if
1) The attributes of the directory containing the file is refreshed on
the client and the values stored in the attribute cache are still valid
when the subsequent steps are run.
2) The file is deleted on the server.
3) The cached attributes of the file on the client have expired.
4) We stat the file.

For a GETATTR call
1) The lookup will return the inode as valid.
2) The subsequent getattr call will notice that the attributes are no
longer valid and will attempt a on the wire call which fails with an
ESTALE.
We cannot recover at this point and return an ESTALE error to userland.

We can easily reproduce this with the following reproducer.

On the server
# while true; do date >b ; rm -f a ; mv b a;sleep 3; done

On the client
#while true; do stat a 2>&1 | grep Stale ; done

We do not see this problem in kernels not containing this patch since we
revalidate the file inode before we return the inode as valid in the
lookup call. Here we notice that the attribute cache is no longer valid
and do an over the wire GETATTR call to fetch the latest file
attributes. An ESTALE error in the lookup phase is handled by redoing
the lookup call. This still has a potential of failing with an ESTALE if
the attribute cache expires between the lookup and the getattr calls.
But chances of this happening are relatively low.

A workaround is to disable attribute caching which is accompanied by a
huge performance hit. A proper fix would be to just redo the lookup call
in case we ever encounter an ESTALE error with a limit to the number of
retries which can be made. However this approach will involve a bit of
work in the VFS layer. We have a user provided patch which does just
this but is incomplete.

Is this approach feasible? If not, what else can be done to avoid this
problem.

Sachin Prabhu



2012-04-10 20:03:49

by Peter Staubach

[permalink] [raw]
Subject: RE: ESTALE errors: What can be done to handle this problem.

Hi.

In the real world, practically speaking, the infinite loop will not happen. Eventually, the client will get a lookup and getattr done and return to the application. Whatever counter value is chosen is wrong. By definition.

We can all imagine situations and scenarios, but they simply won't happen. The Solaris kernel is proof of this. It has contained support to address this situation for many years now and without having a huge hue and cry from its users and at one point, it actually was used by a lot of people. :-)

I tried to address this situation while I was at Red Hat and finally gave up because no one wanted to solve the problem for real. This has to be solved in the vfs and system call layers.

ps


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of J. Bruce Fields
Sent: Tuesday, April 10, 2012 3:01 PM
To: Myklebust, Trond
Cc: Sachin Prabhu; Linux NFS mailing list; [email protected]; steved
Subject: Re: ESTALE errors: What can be done to handle this problem.

On Tue, Apr 10, 2012 at 04:57:42PM +0000, Myklebust, Trond wrote:
> Handling ESTALE still doesn't guarantee that you can make progress.
> Remove the 'sleep 3' above, and you can theoretically find yourself
> replaying lookups until the cows come home while that 'stat()' call
> continues to return ESTALE.
> The bottom line is that NFS is not safe in situations such as the
> above,

By the way, what precisely are the "situations such as the above"? Or put in another way, what are the rules users need to know?

("Operations on paths affected by a directory-modifying operation should not be attempted until after sufficient time has passed for the lookup cache to time out"? No, it should be a weaker rule than that.)

> since we don't have the kind of locking required to guarantee that
> LOOKUP + GETATTR can be done atomically.

Though all we actually need is a way to hold a temporary reference on the looked up filehandle. In theory that wouldn't be hard to add?:

- Clarify that current and saved filehandles shouldn't go stale
during the lifetime of a compound.
- If we also need to hold filehandle references across
compounds, introduce new operations for that (and make the
references part of the client's leased state)

--b.

>
> > Is this approach feasible? If not, what else can be done to avoid
> > this problem.
>
> If you have workloads such as the above, then I suggest "mount
> -olookupcache=none". It still won't prevent ESTALE, but at least you
> ensure that the dentry revalidation always does a full lookup.
>
> Otherwise, you can do as Jeff suggested: handle the ESTALE at the VFS
> level, but ensure that you break out of the loop after a limited
> number of attempts has been made.
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>

2012-04-10 19:01:29

by J. Bruce Fields

[permalink] [raw]
Subject: Re: ESTALE errors: What can be done to handle this problem.

On Tue, Apr 10, 2012 at 04:57:42PM +0000, Myklebust, Trond wrote:
> Handling ESTALE still doesn't guarantee that you can make progress.
> Remove the 'sleep 3' above, and you can theoretically find yourself
> replaying lookups until the cows come home while that 'stat()' call
> continues to return ESTALE.
> The bottom line is that NFS is not safe in situations such as the above,

By the way, what precisely are the "situations such as the above"? Or
put in another way, what are the rules users need to know?

("Operations on paths affected by a directory-modifying operation should
not be attempted until after sufficient time has passed for the lookup
cache to time out"? No, it should be a weaker rule than that.)

> since we don't have the kind of locking required to guarantee that
> LOOKUP + GETATTR can be done atomically.

Though all we actually need is a way to hold a temporary reference on
the looked up filehandle. In theory that wouldn't be hard to add?:

- Clarify that current and saved filehandles shouldn't go stale
during the lifetime of a compound.
- If we also need to hold filehandle references across
compounds, introduce new operations for that (and make the
references part of the client's leased state)

--b.

>
> > Is this approach feasible? If not, what else can be done to avoid this
> > problem.
>
> If you have workloads such as the above, then I suggest "mount
> -olookupcache=none". It still won't prevent ESTALE, but at least you
> ensure that the dentry revalidation always does a full lookup.
>
> Otherwise, you can do as Jeff suggested: handle the ESTALE at the VFS
> level, but ensure that you break out of the loop after a limited number
> of attempts has been made.
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>

2012-04-10 16:57:44

by Myklebust, Trond

[permalink] [raw]
Subject: Re: ESTALE errors: What can be done to handle this problem.

T24gVHVlLCAyMDEyLTA0LTEwIGF0IDE3OjAzICswMTAwLCBTYWNoaW4gUHJhYmh1IHdyb3RlOg0K
PiBUaGUgY2hhbmdlIGludHJvZHVjZWQgYnkgdXBzdHJlYW0gcGF0Y2gNCj4gNGY0OGFmNDU4NDJj
NmU3OGFiOTU4YzkwMzQ0ZDNjOTQwZGI0ZGExNQ0KPiBtZWFucyB0aGF0IHRoZSBsb29rdXAgY2Fs
bCBmb3IgYSBmaWxlIHdpbGwgcmV0dXJuIGEgaW5vZGUgYXMgYmVpbmcgdmFsaWQNCj4gYXMgbG9u
ZyBhcyANCj4gMSkgVGhlIGRpcmVjdG9yeSBjb250YWluaW5nIHRoZSBmaWxlIHdhcyByZXZhbGlk
YXRlZCBpZS4gdGhlIGNhY2hlZA0KPiBhdHRyaWJ1dGVzIGFyZSB2YWxpZCBhbmQgaXQgd2FzIGZv
dW5kIHRoYXQgdGhlcmUgd2FzIG5vIGNoYW5nZSBpbiB0aGUNCj4gZGlyZWN0b3J5IHNpbmNlIHRo
ZSBhdHRyaWJ1dGVzIG9mIHRoZSBmaWxlIHdlcmUgY2FjaGVkLg0KPiAyKSBXZSBhcmVuJ3QgZG9p
bmcgYSBPUEVOIGNhbGwgb3IgdGhlIExPT0tVUF9SRVZBTCBmbGFnIGlzIG5vdCBzZXQuDQo+IA0K
PiBXZSBubyBsb25nZXIgY2hlY2sgdG8gc2VlIGlmIHRoZSBmaWxlIGF0dHJpYnV0ZXMgdGhlbXNl
bHZlcyBhcmUgdmFsaWQuDQo+IFRoaXMgcGF0Y2ggd2FzIGludHJvZHVjZWQgaW4gMi42LjIzLg0K
PiANCj4gVGhpcyBpbmNyZWFzZXMgdGhlIGNoYW5jZXMgb2YgRVNUQUxFIGVycm9ycyBvbiBHRVRB
VFRSIGNhbGxzIGlmDQo+IDEpIFRoZSBhdHRyaWJ1dGVzIG9mIHRoZSBkaXJlY3RvcnkgY29udGFp
bmluZyB0aGUgZmlsZSBpcyByZWZyZXNoZWQgb24NCj4gdGhlIGNsaWVudCBhbmQgdGhlIHZhbHVl
cyBzdG9yZWQgaW4gdGhlIGF0dHJpYnV0ZSBjYWNoZSBhcmUgc3RpbGwgdmFsaWQNCj4gd2hlbiB0
aGUgc3Vic2VxdWVudCBzdGVwcyBhcmUgcnVuLg0KPiAyKSBUaGUgZmlsZSBpcyBkZWxldGVkIG9u
IHRoZSBzZXJ2ZXIuDQo+IDMpIFRoZSBjYWNoZWQgYXR0cmlidXRlcyBvZiB0aGUgZmlsZSBvbiB0
aGUgY2xpZW50IGhhdmUgZXhwaXJlZC4NCj4gNCkgV2Ugc3RhdCB0aGUgZmlsZS4NCj4gDQo+IEZv
ciBhIEdFVEFUVFIgY2FsbA0KPiAxKSBUaGUgbG9va3VwIHdpbGwgcmV0dXJuIHRoZSBpbm9kZSBh
cyB2YWxpZC4NCj4gMikgVGhlIHN1YnNlcXVlbnQgZ2V0YXR0ciBjYWxsIHdpbGwgbm90aWNlIHRo
YXQgdGhlIGF0dHJpYnV0ZXMgYXJlIG5vDQo+IGxvbmdlciB2YWxpZCBhbmQgd2lsbCBhdHRlbXB0
IGEgb24gdGhlIHdpcmUgY2FsbCB3aGljaCBmYWlscyB3aXRoIGFuDQo+IEVTVEFMRS4gDQo+IFdl
IGNhbm5vdCByZWNvdmVyIGF0IHRoaXMgcG9pbnQgYW5kIHJldHVybiBhbiBFU1RBTEUgZXJyb3Ig
dG8gdXNlcmxhbmQuDQo+IA0KPiBXZSBjYW4gZWFzaWx5IHJlcHJvZHVjZSB0aGlzIHdpdGggdGhl
IGZvbGxvd2luZyByZXByb2R1Y2VyLg0KPiANCj4gT24gdGhlIHNlcnZlcg0KPiAjIHdoaWxlIHRy
dWU7IGRvIGRhdGUgPmIgOyBybSAtZiBhIDsgbXYgYiBhO3NsZWVwIDM7IGRvbmUNCj4gDQo+IE9u
IHRoZSBjbGllbnQNCj4gI3doaWxlIHRydWU7IGRvIHN0YXQgYSAyPiYxIHwgZ3JlcCBTdGFsZSA7
IGRvbmUNCj4gDQo+IFdlIGRvIG5vdCBzZWUgdGhpcyBwcm9ibGVtIGluIGtlcm5lbHMgbm90IGNv
bnRhaW5pbmcgdGhpcyBwYXRjaCBzaW5jZSB3ZQ0KPiByZXZhbGlkYXRlIHRoZSBmaWxlIGlub2Rl
IGJlZm9yZSB3ZSByZXR1cm4gdGhlIGlub2RlIGFzIHZhbGlkIGluIHRoZQ0KPiBsb29rdXAgY2Fs
bC4gSGVyZSAgd2Ugbm90aWNlIHRoYXQgdGhlIGF0dHJpYnV0ZSBjYWNoZSBpcyBubyBsb25nZXIg
dmFsaWQNCj4gYW5kIGRvIGFuIG92ZXIgdGhlIHdpcmUgR0VUQVRUUiBjYWxsIHRvIGZldGNoIHRo
ZSBsYXRlc3QgZmlsZQ0KPiBhdHRyaWJ1dGVzLiBBbiBFU1RBTEUgZXJyb3IgaW4gdGhlIGxvb2t1
cCBwaGFzZSBpcyBoYW5kbGVkIGJ5IHJlZG9pbmcNCj4gdGhlIGxvb2t1cCBjYWxsLiBUaGlzIHN0
aWxsIGhhcyBhIHBvdGVudGlhbCBvZiBmYWlsaW5nIHdpdGggYW4gRVNUQUxFIGlmDQo+IHRoZSBh
dHRyaWJ1dGUgY2FjaGUgZXhwaXJlcyBiZXR3ZWVuIHRoZSBsb29rdXAgYW5kIHRoZSBnZXRhdHRy
IGNhbGxzLg0KPiBCdXQgY2hhbmNlcyBvZiB0aGlzIGhhcHBlbmluZyBhcmUgcmVsYXRpdmVseSBs
b3cuDQo+DQo+IEEgd29ya2Fyb3VuZCBpcyB0byBkaXNhYmxlIGF0dHJpYnV0ZSBjYWNoaW5nIHdo
aWNoIGlzIGFjY29tcGFuaWVkIGJ5IGENCj4gaHVnZSBwZXJmb3JtYW5jZSBoaXQuIEEgcHJvcGVy
IGZpeCB3b3VsZCBiZSB0byBqdXN0IHJlZG8gdGhlIGxvb2t1cCBjYWxsDQo+IGluIGNhc2Ugd2Ug
ZXZlciBlbmNvdW50ZXIgYW4gRVNUQUxFIGVycm9yIHdpdGggYSBsaW1pdCB0byB0aGUgbnVtYmVy
IG9mDQo+IHJldHJpZXMgd2hpY2ggY2FuIGJlIG1hZGUuIEhvd2V2ZXIgdGhpcyBhcHByb2FjaCB3
aWxsIGludm9sdmUgYSBiaXQgb2YNCj4gd29yayBpbiB0aGUgVkZTIGxheWVyLiBXZSBoYXZlIGEg
dXNlciBwcm92aWRlZCBwYXRjaCB3aGljaCBkb2VzIGp1c3QNCj4gdGhpcyBidXQgaXMgaW5jb21w
bGV0ZS4NCg0KSGFuZGxpbmcgRVNUQUxFIHN0aWxsIGRvZXNuJ3QgZ3VhcmFudGVlIHRoYXQgeW91
IGNhbiBtYWtlIHByb2dyZXNzLg0KUmVtb3ZlIHRoZSAnc2xlZXAgMycgYWJvdmUsIGFuZCB5b3Ug
Y2FuIHRoZW9yZXRpY2FsbHkgZmluZCB5b3Vyc2VsZg0KcmVwbGF5aW5nIGxvb2t1cHMgdW50aWwg
dGhlIGNvd3MgY29tZSBob21lIHdoaWxlIHRoYXQgJ3N0YXQoKScgY2FsbA0KY29udGludWVzIHRv
IHJldHVybiBFU1RBTEUuDQpUaGUgYm90dG9tIGxpbmUgaXMgdGhhdCBORlMgaXMgbm90IHNhZmUg
aW4gc2l0dWF0aW9ucyBzdWNoIGFzIHRoZSBhYm92ZSwNCnNpbmNlIHdlIGRvbid0IGhhdmUgdGhl
IGtpbmQgb2YgbG9ja2luZyByZXF1aXJlZCB0byBndWFyYW50ZWUgdGhhdA0KTE9PS1VQICsgR0VU
QVRUUiBjYW4gYmUgZG9uZSBhdG9taWNhbGx5Lg0KDQo+IElzIHRoaXMgYXBwcm9hY2ggZmVhc2li
bGU/IElmIG5vdCwgd2hhdCBlbHNlIGNhbiBiZSBkb25lIHRvIGF2b2lkIHRoaXMNCj4gcHJvYmxl
bS4NCg0KSWYgeW91IGhhdmUgd29ya2xvYWRzIHN1Y2ggYXMgdGhlIGFib3ZlLCB0aGVuIEkgc3Vn
Z2VzdCAibW91bnQNCi1vbG9va3VwY2FjaGU9bm9uZSIuIEl0IHN0aWxsIHdvbid0IHByZXZlbnQg
RVNUQUxFLCBidXQgYXQgbGVhc3QgeW91DQplbnN1cmUgdGhhdCB0aGUgZGVudHJ5IHJldmFsaWRh
dGlvbiBhbHdheXMgZG9lcyBhIGZ1bGwgbG9va3VwLg0KDQpPdGhlcndpc2UsIHlvdSBjYW4gZG8g
YXMgSmVmZiBzdWdnZXN0ZWQ6IGhhbmRsZSB0aGUgRVNUQUxFIGF0IHRoZSBWRlMNCmxldmVsLCBi
dXQgZW5zdXJlIHRoYXQgeW91IGJyZWFrIG91dCBvZiB0aGUgbG9vcCBhZnRlciBhIGxpbWl0ZWQg
bnVtYmVyDQpvZiBhdHRlbXB0cyBoYXMgYmVlbiBtYWRlLg0KDQotLSANClRyb25kIE15a2xlYnVz
dA0KTGludXggTkZTIGNsaWVudCBtYWludGFpbmVyDQoNCk5ldEFwcA0KVHJvbmQuTXlrbGVidXN0
QG5ldGFwcC5jb20NCnd3dy5uZXRhcHAuY29tDQoNCg==

2012-04-10 16:10:56

by Jeff Layton

[permalink] [raw]
Subject: Re: ESTALE errors: What can be done to handle this problem.

On Tue, 10 Apr 2012 17:03:45 +0100
Sachin Prabhu <[email protected]> wrote:

> The change introduced by upstream patch
> 4f48af45842c6e78ab958c90344d3c940db4da15
> means that the lookup call for a file will return a inode as being valid
> as long as
> 1) The directory containing the file was revalidated ie. the cached
> attributes are valid and it was found that there was no change in the
> directory since the attributes of the file were cached.
> 2) We aren't doing a OPEN call or the LOOKUP_REVAL flag is not set.
>
> We no longer check to see if the file attributes themselves are valid.
> This patch was introduced in 2.6.23.
>
> This increases the chances of ESTALE errors on GETATTR calls if
> 1) The attributes of the directory containing the file is refreshed on
> the client and the values stored in the attribute cache are still valid
> when the subsequent steps are run.
> 2) The file is deleted on the server.
> 3) The cached attributes of the file on the client have expired.
> 4) We stat the file.
>
> For a GETATTR call
> 1) The lookup will return the inode as valid.
> 2) The subsequent getattr call will notice that the attributes are no
> longer valid and will attempt a on the wire call which fails with an
> ESTALE.
> We cannot recover at this point and return an ESTALE error to userland.
>
> We can easily reproduce this with the following reproducer.
>
> On the server
> # while true; do date >b ; rm -f a ; mv b a;sleep 3; done
>
> On the client
> #while true; do stat a 2>&1 | grep Stale ; done
>
> We do not see this problem in kernels not containing this patch since we
> revalidate the file inode before we return the inode as valid in the
> lookup call. Here we notice that the attribute cache is no longer valid
> and do an over the wire GETATTR call to fetch the latest file
> attributes. An ESTALE error in the lookup phase is handled by redoing
> the lookup call. This still has a potential of failing with an ESTALE if
> the attribute cache expires between the lookup and the getattr calls.
> But chances of this happening are relatively low.
>
> A workaround is to disable attribute caching which is accompanied by a
> huge performance hit. A proper fix would be to just redo the lookup call
> in case we ever encounter an ESTALE error with a limit to the number of
> retries which can be made. However this approach will involve a bit of
> work in the VFS layer. We have a user provided patch which does just
> this but is incomplete.
>
> Is this approach feasible? If not, what else can be done to avoid this
> problem.
>
> Sachin Prabhu
>

(cc'ing linux-fsdevel too)

While it sounds distinctly not-fun to code up, I think the only real
fix is to add code to all of the path-based syscalls to redo the
lookup with LOOKUP_REVAL set and retry the operation when you get back
an ESTALE.

We should retry more than once too since it's possible that you could
race more than once. That should be bounded at some finite
number of retries though so we don't end up in an infinite loop.

I don't think it would be a terribly difficult patchset, but it would
have to be done in the VFS layer and would touch a lot of code there.

--
Jeff Layton <[email protected]>