2013-03-11 16:13:43

by Bodo Stroesser

[permalink] [raw]
Subject: sunrpc/cache.c: races while updating cache entries

SGksCgpBRkFJQ1MsIHRoZXJlIGlzIG9uZSBtb3JlIHJhY2UgaW4gUlBDIGNhY2hlLgoKVGhl
IHByb2JsZW0gc2hvd2VkIHVwIGZvciB0aGUgImF1dGgudW5peC5pcCIgY2FjaGUsIHRoYXQK
aGFzIGEgcmVhZGVyLgoKSWYgYSBzZXJ2ZXIgdGhyZWFkIHRyaWVzIHRvIGZpbmQgYSBjYWNo
ZSBlbnRyeSwgaXQgZmlyc3QKZG9lcyBhIGxvb2t1cCAob3IgY2FsbHMgaXBfbWFwX2NhY2hl
ZF9nZXQoKSBpbiB0aGlzIHNwZWNpZmljCmNhc2UpLiBUaGVuLCBpdCBjYWxscyBjYWNoZV9j
aGVjaygpIGZvciB0aGlzIGVudHJ5LgoKSWYgdGhlIHJlYWRlciB1cGRhdGVzIHRoZSBjYWNo
ZSBlbnRyeSBqdXN0IGJldHdlZW4gdGhlCnRocmVhZCdzIGxvb2t1cCBhbmQgY2FjaGVfY2hl
Y2soKSBleGVjdXRpb24sIGNhY2hlX2NoZWNrKCkKd2lsbCBkbyBhbiB1cGNhbGwgZm9yIHRo
aXMgY2FjaGUgZW50cnkuIFRoaXMgaXMsIGJlY2F1c2UKc3VucnBjX2NhY2hlX3VwZGF0ZSgp
IGNhbGxzIGNhY2hlX2ZyZXNoX2xvY2tlZChvbGQsIDApLAp3aGljaCBzZXRzIGV4cGlyeV90
aW1lIHRvIDAuCgpVbmZvcnR1bmF0ZWx5LCB0aGUgcmVwbHkgdG8gdGhlIHVwY2FsbCB3aWxs
IG5vdCBkZXF1ZXVlCnRoZSBxdWV1ZWQgY2FjaGVfcmVxdWVzdCwgYXMgdGhlIHJlcGx5IHdp
bGwgYmUgYXNzaWduZWQgdG8KdGhlIGNhY2hlIGVudHJ5IG5ld2x5IGNyZWF0ZWQgYnkgdGhl
IGFib3ZlIGNhY2hlIHVwZGF0ZS4KClRoZSByZXN1bHQgaXMgYSBncm93aW5nIHF1ZXVlIG9m
IG9ycGhhbmVkIGNhY2hlX3JlcXVlc3QKc3RydWN0dXJlcyAtLT4gbWVtb3J5IGxlYWsuCgpJ
IGZvdW5kIHRoaXMgb24gYSBTTEVTMTEgU1AxIHdpdGggYSBiYWNrcG9ydCBvZiB0aGUgbGF0
ZXN0CnBhdGNoZXMgdGhhdCBmaXggdGhlIG90aGVyIFJQQyByYWNlcy4gT24gdGhpcyBvbGQg
a2VybmVsLAp0aGUgcHJvYmxlbSBhbHNvIGxlYWRzIHRvIHN2Y19kcm9wKCkgYmVpbmcgY2Fs
bGVkIGZvciB0aGUKYWZmZWN0ZWQgUlBDIHJlcXVlc3QgKGFmdGVyIHN2Y19kZWZlcigpKS4K
CkJlc3QgUmVnYXJkcwpCb2RvCg==




2013-04-05 12:40:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc/cache.c: races while updating cache entries

On Thu, Apr 04, 2013 at 07:59:35PM +0200, Bodo Stroesser wrote:
> There is no reason for apologies. The thread meanwhile seems to be a bit
> confusing :-)
>
> Current state is:
>
> - Neil Brown has created two series of patches. One for SLES11-SP1 and a
> second one for -SP2
>
> - AFAICS, the series for -SP2 will match with mainline also.
>
> - Today I found and fixed the (hopefully) last problem in the -SP1 series.
> My test using this patchset will run until Monday.
>
> - Provided the test on SP1 succeeds, probably on Tuesday I'll start to test
> the patches for SP2 (and mainline). If it runs fine, we'll have a tested
> patchset not later than Mon 15th.

OK, great, as long as it hasn't just been forgotten!

I'd also be curious to understand why we aren't getting a lot of
complaints about this from elsewhere.... Is there something unique
about your setup? Do the bugs that remain upstream take a long time to
reproduce?

--b.

2013-04-05 15:33:51

by Bodo Stroesser

[permalink] [raw]
Subject: Re: sunrpc/cache.c: races while updating cache entries

T24gMDUgQXByIDIwMTMgMTQ6NDA6MDAgKzAxMDAgSi4gQnJ1Y2UgRmllbGRzIDxiZmllbGRz
QGZpZWxkc2VzLm9yZz4gd3JvdGU6Cj4gT24gVGh1LCBBcHIgMDQsIDIwMTMgYXQgMDc6NTk6
MzVQTSArMDIwMCwgQm9kbyBTdHJvZXNzZXIgd3JvdGU6Cj4gPiBUaGVyZSBpcyBubyByZWFz
b24gZm9yIGFwb2xvZ2llcy4gVGhlIHRocmVhZCBtZWFud2hpbGUgc2VlbXMgdG8gYmUgYSBi
aXQKPiA+IGNvbmZ1c2luZyA6LSkKPiA+IAo+ID4gQ3VycmVudCBzdGF0ZSBpczoKPiA+IAo+
ID4gLSBOZWlsIEJyb3duIGhhcyBjcmVhdGVkIHR3byBzZXJpZXMgb2YgcGF0Y2hlcy4gT25l
IGZvciBTTEVTMTEtU1AxIGFuZCBhCj4gPiAgIHNlY29uZCBvbmUgZm9yIC1TUDIKPiA+IAo+
ID4gLSBBRkFJQ1MsIHRoZSBzZXJpZXMgZm9yIC1TUDIgd2lsbCBtYXRjaCB3aXRoIG1haW5s
aW5lIGFsc28uCj4gPiAKPiA+IC0gVG9kYXkgSSBmb3VuZCBhbmQgZml4ZWQgdGhlIChob3Bl
ZnVsbHkpIGxhc3QgcHJvYmxlbSBpbiB0aGUgLVNQMSBzZXJpZXMuCj4gPiAgIE15IHRlc3Qg
dXNpbmcgdGhpcyBwYXRjaHNldCB3aWxsIHJ1biB1bnRpbCBNb25kYXkuCj4gPiAKPiA+IC0g
UHJvdmlkZWQgdGhlIHRlc3Qgb24gU1AxIHN1Y2NlZWRzLCBwcm9iYWJseSBvbiBUdWVzZGF5
IEknbGwgc3RhcnQgdG8gdGVzdAo+ID4gICB0aGUgcGF0Y2hlcyBmb3IgU1AyIChhbmQgbWFp
bmxpbmUpLiBJZiBpdCBydW5zIGZpbmUsIHdlJ2xsIGhhdmUgYSB0ZXN0ZWQKPiA+ICAgcGF0
Y2hzZXQgbm90IGxhdGVyIHRoYW4gTW9uIDE1dGguCj4gCj4gT0ssIGdyZWF0LCBhcyBsb25n
IGFzIGl0IGhhc24ndCBqdXN0IGJlZW4gZm9yZ290dGVuIQo+IAo+IEknZCBhbHNvIGJlIGN1
cmlvdXMgdG8gdW5kZXJzdGFuZCB3aHkgd2UgYXJlbid0IGdldHRpbmcgYSBsb3Qgb2YKPiBj
b21wbGFpbnRzIGFib3V0IHRoaXMgZnJvbSBlbHNld2hlcmUuLi4uICBJcyB0aGVyZSBzb21l
dGhpbmcgdW5pcXVlCj4gYWJvdXQgeW91ciBzZXR1cD8gIERvIHRoZSBidWdzIHRoYXQgcmVt
YWluIHVwc3RyZWFtIHRha2UgYSBsb25nIHRpbWUgdG8KPiByZXByb2R1Y2U/Cj4gCj4gLS1i
Lgo+IAoKSXQncyBubyBzZWNyZXQsIHdoYXQgd2UgYXJlIGRvaW5nLiBTbyBsZXQgbWUgdHJ5
IHRvIGV4cGxhaW46CgpXZSBidWlsZCBhcHBsaWFuY2VzIGZvciBzdG9yYWdlIHB1cnBvc2Vz
LiBFYWNoIGFwcGxpYW5jZSBtYWlubHkgY29uc2lzdHMgb2YKYSBjbHVzdGVyIG9mIHNlcnZl
cnMgYW5kIGEgYnVuY2ggb2YgRmlicmVDaGFubmVsIFJBSUQgc3lzdGVtcy4gVGhlIHNlcnZl
cnMKb2YgdGhlIGFwcGxpYW5jZSBydW4gU0xFUzExLgoKT25lIG9yZSBtb3JlIG9mIHRoZSBz
ZXJ2ZXJzIGluIHRoZSBjbHVzdGVyIGNhbiBhY3QgYXMgYSBORlMgc2VydmVyLgoKRWFjaCBO
RlMgc2VydmVyIGlzIGNvbm5lY3RlZCB0byB0aGUgUkFJRCBzeXN0ZW1zIGFuZCBoYXMgdHdv
IDEwIEdCaXQvcyBFdGhlcm5ldApjb250cm9sbGVycyBmb3IgdGhlIGxpbmsgdG8gdGhlIGNs
aWVudHMuCgpUaGUgYXBwbGlhbmNlIG5vdCBvbmx5IG9mZmVycyBORlMgYWNjZXNzIGZvciBj
bGllbnRzLCBidXQgYWxzbyBoYXMgc29tZSBvdGhlcgp0eXBlcyBvZiBpbnRlcmZhY2VzIHRv
IGJlIHVzZWQgYnkgdGhlIGNsaWVudHMuCgpGb3IgUUEgb2YgdGhlIGFwcGxpYW5jZXMgd2Ug
dXNlIGEgc3BlY2lhbCB0ZXN0IHN5c3RlbSwgdGhhdCBydW5zIHRoZSBlbnRpcmUKYXBwbGlh
bmNlIHdpdGggYWxsIGl0cyBpbnRlcmZhY2VzIHVuZGVyIGhlYXZ5IGxvYWQuCgpGb3IgdGhl
IHRlc3Qgb2YgdGhlIE5GUyBpbnRlcmZhY2VzIG9mIHRoZSBhcHBsaWFuY2UsIHdlIGNvbm5l
Y3QgdGhlIEV0aGVybmV0CmxpbmtzIG9uZSBieSBvbmUgdG8gMTAgR0JpdC9zIEV0aGVybmV0
IGNvbnRyb2xsZXJzIG9uIGEgbGludXggbWFjaGluZSBvZiB0aGUKdGVzdCBzeXN0ZW0uCgpU
aGUgU1cgb24gdGhlIHRlc3Qgc3lzdGVtIGZvciBlYWNoIEV0aGVybmV0IGxpbmsgdXNlcyAz
MiBUQ1AgY29ubmVjdGlvbnMgdG8gdGhlCk5GUyBzZXJ2ZXIgaW4gcGFyYWxsZWwuIAoKU28g
YmV0d2VlbiBORlMgc2VydmVyIG9mIHRoZSBhcHBsaWFuY2UgYW5kIGxpbnV4IG1hY2hpbmUg
b2YgdGhlIHRlc3Qgc3lzdGVtIHdlCmhhdmUgdHdvIDEwIEdCaXQvcyBsaW5rcyB3aXRoIDMy
IFRDUC9SUEMvTkZTX1YzIGNvbm5lY3Rpb25zIGVhY2guIEVhY2ggbGluawppcyBydW5uaW5n
IGF0IHVwIHRvIDEgR0J5dGUvcyB0aHJvdWdocHV0IChwZXIgc2Vjb25kIGFuZCBwZXIgbGlu
ayBhIHRvdGFsIG9mCjMyayBORlMzX1JFQUQgb3IgTkZTM19XUklURSBSUENzIG9mIDMyayBk
YXRhIGVhY2guKQoKTm9ybWFsIExpbnV4LU5GUy1DbGllbnRzIG9wZW4gb25seSBvbmUgc2lu
Z2xlIGNvbm5lY3Rpb24gdG8gYSBzcGVjaWZpYyBORlMKc2VydmVyLCBldmVuIGlmIHRoZXJl
IGFyZSBtdWx0aXBsZSBtb3VudHMuIFdlIGRvIG5vdCB1c2UgdGhlIGxpbnV4IGJ1aWx0aW4K
Y2xpZW50LCBidXQgY3JlYXRlIGEgUlBDIGNsaWVudCBieSBjbG50dGNwX2NyZWF0ZSgpIGFu
ZCBkbyB0aGUgTkZTIGhhbmRsaW5nCmRpcmVjdGx5LiBUaHVzIHdlIGNhbiBoYXZlIG11bHRp
cGxlIGNvbm5lY3Rpb25zIGFuZCB3ZSBpbW1lZGlhdGVseSBjYW4Kc2VlIGlmIHNvbWV0aGlu
ZyBnb2VzIHdyb25nIChlLmcuIGlmIGEgUlBDIHJlcXVlc3QgaXMgZHJvcHBlZCksIHdoaWxl
IHRoZQpidWlsdGluIGxpbnV4IGNsaWVudCBwcm9iYWJseSB3b3VsZCBkbyBhIHNpbGVudCBy
ZXRyeS4gKEJ1dCBwcm9iYWJseSBvbmUKY291bGQgc2VlIHNpbmdsZSBjb25uZWN0aW9ucyBo
YW5nIGZvciBhIGZldyBtaW51dGVzIHNwb3JhZGljYWxseS4gTWF5YmUKc29tZW9uZSBoaXQg
YnkgdGhpcyB3b3VsZCBjb21wbGFpbiBhYm91dCB0aGUgbmV0d29yayAuLi4pCgpBcyBhIHNp
ZGUgZWZmZWN0IG9mIHRoaXMgdGVzdCBzZXR1cCBhbGwgNjQgY29ubmVjdGlvbnMgdG8gdGhl
IE5GUyBzZXJ2ZXIKdXNlIHRoZSBzYW1lIHVpZC9naWQgYW5kIGFsbCAzMiBjb25uZWN0aW9u
cyBvbiBvbmUgbGluayBjb21lIGZyb20gdGhlIHNhbWUKaXAgYWRkcmVzcy4gVGhpcyAtIGFz
IHdlIGtub3cgbm93IC0gbWF4aW1pemVzIHRoZSBzdHJlc3MgZm9yIGEgc2luZ2xlIGVudHJ5
Cm9mIHRoZSBjYWNoZXMuCgpXaXRoIG91ciB0ZXN0IHNldHVwIGF0IHRoZSBiZWdpbm5pbmcg
d2UgaGFkIG1vcmUgdGhhbiB0d28gZHJvcHBlZCBSUEMgcmVxdWVzdApwZXIgaG91ciBhbmQg
cGVyIE5GUyBzZXJ2ZXIuIChPZiBjb3Vyc2UsIHRoaXMgcmF0ZSB2YXJpZWQgd2lkZWx5Likg
V2l0aCBlYWNoCnNpbmdsZSBjaGFuZ2UgaW4gY2FjaGUuYyB0aGUgcmF0ZSB3ZW50IGRvd24u
IFRoZSBsYXRlc3QgZHJvcCBjYXVzZWQgYnkgYQptaXNzaW5nIGRldGFpbCBpbiB0aGUgbGF0
ZXN0IHBhdGNoc2V0IGZvciAtU1AxIG9jY3VyZWQgYWZ0ZXIgbW9yZSB0aGFuIDIgZGF5cwpv
ZiB0ZXN0aW5nIQoKVGh1cywgdG8gdmVyaWZ5IHRoZSBwYXRjaGVzIEkgc2NoZWR1bGUgYSB0
ZXN0IGZvciBhdCBsZWFzdCA0IGRheXMuCgpIVEgKQm9kbwo=



2013-04-04 17:59:36

by Bodo Stroesser

[permalink] [raw]
Subject: Re: sunrpc/cache.c: races while updating cache entries

T24gMDMgQXByIDIwMTMgMjA6MzY6MDAgKzAxMDAgSi4gQnJ1Y2UgRmllbGRzIDxiZmllbGRz
QGZpZWxkc2VzLm9yZz4gd3JvdGU6Cj4gT24gVGh1LCBNYXIgMjEsIDIwMTMgYXQgMDU6NDE6
MzVQTSArMDEwMCwgQm9kbyBTdHJvZXNzZXIgd3JvdGU6Cj4gPiBPbiAyMSBNYXIgMjAxMyAw
MDozNDowMCArMDEwMCBOZWlsQnJvd24gPG5laWxiQHN1c2UuZGU+IHdyb3RlOgo+ID4gPiBU
aGlzIGFwcGxpZXMgb25seSB0byBTTEVTMTEtU1AxICgyLjYuMzIrKSByaWdodD8KPiA+ID4g
SW4gbWFpbmxpbmUgdGhlIHJlcXVlc3Qgd29uJ3QgYmUgZHJvcHBlZCBiZWNhdXNlIHRoZSBj
YWNoZSBpdGVtIGlzbid0IGFzc3VtZWQKPiA+ID4gdG8gc3RpbGwgYmUgdmFsaWQuCj4gPgo+
ID4gSSBhZ3JlZSwgdGhlcmUgd2lsbCBiZSBubyBkcm9wIGluIG1haW5saW5lLgo+ID4KPiA+
IEJ1dCBpbiBtYWlubGluZSB0aGVyZSB3aWxsIGJlIGEgdXNlbGVzcyB1cGNhbGwuIFRoYXQg
dXBjYWxsIG1pZ2h0IGV2ZW4gYmUKPiA+IHByb2Nlc3NlZCBieSB0aGUgcmVhZGVyLCBpZiBj
YWNoZV9jbGVhbiBpc24ndCBmYXN0ZXIgdGhlbiB0aGUgcmVhZGVyLgo+ID4gVGhlIGFuc3dl
ciBmcm9tIHRoZSByZWFkZXIgd2lsbCBhZ2FpbiByZXBsYWNlIHRoZSBjdXJyZW50IGNhY2hl
IGVudHJ5Cj4gPiAod2hpY2ggaXMganVzdCBhIGZldyBtaWNyb3NlY29uZHMgb2xkKSBieSBh
IG5ldyBvbmUuIFRoZSBuZXcgb25lIG11c3QgYmUKPiA+IGFsbG9jYXRlZCwgdGhlIHJlcGxh
Y2VkIG9uZSBtdXN0IGxhdGVyIGJlIGNsZWFuZWQuIFRoZSBzaW5nbGUgaXRlbSBjYWNoZQo+
ID4gaXMgaW52YWxpZGF0ZWQgYWdhaW4uCj4gPgo+ID4gSW4gd29yc3QgY2FzZSwgdGhlIHVu
bmVjZXNzYXJ5IHJlcGxhY2VtZW50IGFsc28gY291bGQgdHJpZ2dlciB0aGUgbmV4dAo+ID4g
cm91bmQgb2YgdGhlIGdhbWUgLi4uCj4gPgo+ID4gU28gaW4gbXkgb3BpbmlvbiBpdCB3b3Vs
ZCBiZSBiZXR0ZXIgdG8gYWRkIHRoZSBwYXRjaCB5b3Ugc3VnZ2VzdGVkIGJlbG93Cj4gPiB0
byBtYWlubGluZSBhbHNvLgo+ID4KPiA+ID4KPiA+ID4gU28gd2UgbmVlZCB0byBtYWtlIHN1
cmUgdGhhdCBzdW5ycGNfY2FjaGVfcGlwZV91cGNhbGwgZG9lc24ndCBtYWtlIGFuIHVwY2Fs
bAo+ID4gPiBvbiBhIGNhY2hlIGl0ZW0gdGhhdCBoYXMgYmVlbiByZXBsYWNlZC4gIEknZCBy
YXRoZXIgbm90IHVzZSB0aGUgQ0FDSEVfQ0xFQU4KPiA+ID4gYml0ICh3aGV0aGVyIHJlbmFt
ZWQgb3Igbm90KSBhcyBpdCBoYXMgYSB3ZWxsIGRlZmluZWQgbWVhbmluZyAiaGFzIGJlZW4K
PiA+ID4gcmVtb3ZlZCBmcm9tIGNhY2hlIiBhbmQgSSdkIHJhdGhlciBub3QgYmx1ciB0aGF0
IG1lYW5pbmcuCj4gPiA+IFdlIGFscmVhZHkgaGF2ZSBhIHN0YXRlIHRoYXQgbWVhbnMgInRo
aXMgaGFzIGJlZW4gcmVwbGFjZSItIC0+ZXhwaXJ5X3RpbWUgaXMKPiA+ID4gMC4KPiA+ID4g
U28gaG93IGFib3V0IGFkZGluZwo+ID4gPiAgICBpZiAoaC0+ZXhwaXJ5X3RpbWUgPT0gMCkK
PiA+ID4gICAgICAgICAgcmV0dXJuIC1FQUdBSU47Cj4gPiA+IHRvIHN1bnJwY19jYWNoZV9w
aXBlX3VwY2FsbCgpIGluIFNMRVMxMS1TUDEuCj4gPiA+Cj4gPiA+IERvZXMgdGhhdCB3b3Jr
IGZvciB5b3U/Cj4gPgo+ID4gWWVzLCB0aGF0IGxvb2tzIGdvb2QuIE15IHRlc3Qgd2l0aCB0
aGlzIGZpeCBpcyBydW5uaW5nIHN1Y2Nlc3NmdWxseQo+ID4gc2luY2UgNSBob3Vycy4gSSds
bCBsZXQgaXQgcnVuIHVudGlsIE1vbmRheS4KPgo+IEFwb2xvZ2llcywgSSd2ZSBjb21wbGV0
ZWx5IGxvc3QgdHJhY2sgb2YgdGhpcyB0aHJlYWQ6IGRvIHdlIGtub3cgd2hhdAo+IG1haW5s
aW5lIG5lZWRzIG5vdz8KPgo+IC0tYi4KPgoKVGhlcmUgaXMgbm8gcmVhc29uIGZvciBhcG9s
b2dpZXMuIFRoZSB0aHJlYWQgbWVhbndoaWxlIHNlZW1zIHRvIGJlIGEgYml0CmNvbmZ1c2lu
ZyA6LSkKCkN1cnJlbnQgc3RhdGUgaXM6CgotIE5laWwgQnJvd24gaGFzIGNyZWF0ZWQgdHdv
IHNlcmllcyBvZiBwYXRjaGVzLiBPbmUgZm9yIFNMRVMxMS1TUDEgYW5kIGEKICBzZWNvbmQg
b25lIGZvciAtU1AyCgotIEFGQUlDUywgdGhlIHNlcmllcyBmb3IgLVNQMiB3aWxsIG1hdGNo
IHdpdGggbWFpbmxpbmUgYWxzby4KCi0gVG9kYXkgSSBmb3VuZCBhbmQgZml4ZWQgdGhlICho
b3BlZnVsbHkpIGxhc3QgcHJvYmxlbSBpbiB0aGUgLVNQMSBzZXJpZXMuCiAgTXkgdGVzdCB1
c2luZyB0aGlzIHBhdGNoc2V0IHdpbGwgcnVuIHVudGlsIE1vbmRheS4KCi0gUHJvdmlkZWQg
dGhlIHRlc3Qgb24gU1AxIHN1Y2NlZWRzLCBwcm9iYWJseSBvbiBUdWVzZGF5IEknbGwgc3Rh
cnQgdG8gdGVzdAogIHRoZSBwYXRjaGVzIGZvciBTUDIgKGFuZCBtYWlubGluZSkuIElmIGl0
IHJ1bnMgZmluZSwgd2UnbGwgaGF2ZSBhIHRlc3RlZAogIHBhdGNoc2V0IG5vdCBsYXRlciB0
aGFuIE1vbiAxNXRoLgoKQm9kbwo=



2013-04-05 21:08:34

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc/cache.c: races while updating cache entries

On Fri, Apr 05, 2013 at 05:33:49PM +0200, Bodo Stroesser wrote:
> On 05 Apr 2013 14:40:00 +0100 J. Bruce Fields <[email protected]> wrote:
> > On Thu, Apr 04, 2013 at 07:59:35PM +0200, Bodo Stroesser wrote:
> > > There is no reason for apologies. The thread meanwhile seems to be a bit
> > > confusing :-)
> > >
> > > Current state is:
> > >
> > > - Neil Brown has created two series of patches. One for SLES11-SP1 and a
> > > second one for -SP2
> > >
> > > - AFAICS, the series for -SP2 will match with mainline also.
> > >
> > > - Today I found and fixed the (hopefully) last problem in the -SP1 series.
> > > My test using this patchset will run until Monday.
> > >
> > > - Provided the test on SP1 succeeds, probably on Tuesday I'll start to test
> > > the patches for SP2 (and mainline). If it runs fine, we'll have a tested
> > > patchset not later than Mon 15th.
> >
> > OK, great, as long as it hasn't just been forgotten!
> >
> > I'd also be curious to understand why we aren't getting a lot of
> > complaints about this from elsewhere.... Is there something unique
> > about your setup? Do the bugs that remain upstream take a long time to
> > reproduce?
> >
> > --b.
> >
>
> It's no secret, what we are doing. So let me try to explain:

Thanks for the detailed explanation! I'll look forward to the patches.

--b.

>
> We build appliances for storage purposes. Each appliance mainly consists of
> a cluster of servers and a bunch of FibreChannel RAID systems. The servers
> of the appliance run SLES11.
>
> One ore more of the servers in the cluster can act as a NFS server.
>
> Each NFS server is connected to the RAID systems and has two 10 GBit/s Ethernet
> controllers for the link to the clients.
>
> The appliance not only offers NFS access for clients, but also has some other
> types of interfaces to be used by the clients.
>
> For QA of the appliances we use a special test system, that runs the entire
> appliance with all its interfaces under heavy load.
>
> For the test of the NFS interfaces of the appliance, we connect the Ethernet
> links one by one to 10 GBit/s Ethernet controllers on a linux machine of the
> test system.
>
> The SW on the test system for each Ethernet link uses 32 TCP connections to the
> NFS server in parallel.
>
> So between NFS server of the appliance and linux machine of the test system we
> have two 10 GBit/s links with 32 TCP/RPC/NFS_V3 connections each. Each link
> is running at up to 1 GByte/s throughput (per second and per link a total of
> 32k NFS3_READ or NFS3_WRITE RPCs of 32k data each.)
>
> Normal Linux-NFS-Clients open only one single connection to a specific NFS
> server, even if there are multiple mounts. We do not use the linux builtin
> client, but create a RPC client by clnttcp_create() and do the NFS handling
> directly. Thus we can have multiple connections and we immediately can
> see if something goes wrong (e.g. if a RPC request is dropped), while the
> builtin linux client probably would do a silent retry. (But probably one
> could see single connections hang for a few minutes sporadically. Maybe
> someone hit by this would complain about the network ...)
>
> As a side effect of this test setup all 64 connections to the NFS server
> use the same uid/gid and all 32 connections on one link come from the same
> ip address. This - as we know now - maximizes the stress for a single entry
> of the caches.
>
> With our test setup at the beginning we had more than two dropped RPC request
> per hour and per NFS server. (Of course, this rate varied widely.) With each
> single change in cache.c the rate went down. The latest drop caused by a
> missing detail in the latest patchset for -SP1 occured after more than 2 days
> of testing!
>
> Thus, to verify the patches I schedule a test for at least 4 days.
>
> HTH
> Bodo

2013-04-03 18:36:19

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc/cache.c: races while updating cache entries

On Thu, Mar 21, 2013 at 05:41:35PM +0100, Bodo Stroesser wrote:
> On 21 Mar 2013 00:34:00 +0100 NeilBrown <[email protected]> wrote:
> > This applies only to SLES11-SP1 (2.6.32+) right?
> > In mainline the request won't be dropped because the cache item isn't assumed
> > to still be valid.
>
> I agree, there will be no drop in mainline.
>
> But in mainline there will be a useless upcall. That upcall might even be
> processed by the reader, if cache_clean isn't faster then the reader.
> The answer from the reader will again replace the current cache entry
> (which is just a few microseconds old) by a new one. The new one must be
> allocated, the replaced one must later be cleaned. The single item cache
> is invalidated again.
>
> In worst case, the unnecessary replacement also could trigger the next
> round of the game ...
>
> So in my opinion it would be better to add the patch you suggested below
> to mainline also.
>
> >
> > So we need to make sure that sunrpc_cache_pipe_upcall doesn't make an upcall
> > on a cache item that has been replaced. I'd rather not use the CACHE_CLEAN
> > bit (whether renamed or not) as it has a well defined meaning "has been
> > removed from cache" and I'd rather not blur that meaning.
> > We already have a state that means "this has been replace"- ->expiry_time is
> > 0.
> > So how about adding
> > if (h->expiry_time == 0)
> > return -EAGAIN;
> > to sunrpc_cache_pipe_upcall() in SLES11-SP1.
> >
> > Does that work for you?
>
> Yes, that looks good. My test with this fix is running successfully
> since 5 hours. I'll let it run until Monday.

Apologies, I've completely lost track of this thread: do we know what
mainline needs now?

--b.

2013-04-19 16:55:54

by Bodo Stroesser

[permalink] [raw]
Subject: Re: sunrpc/cache.c: races while updating cache entries

T24gMDUgQXByIDIwMTMgMjM6MDk6MDAgKzAxMDAgSi4gQnJ1Y2UgRmllbGRzIDxiZmllbGRz
QGZpZWxkc2VzLm9yZz4gd3JvdGU6Cj4gT24gRnJpLCBBcHIgMDUsIDIwMTMgYXQgMDU6MzM6
NDlQTSArMDIwMCwgQm9kbyBTdHJvZXNzZXIgd3JvdGU6Cj4gPiBPbiAwNSBBcHIgMjAxMyAx
NDo0MDowMCArMDEwMCBKLiBCcnVjZSBGaWVsZHMgPGJmaWVsZHNAZmllbGRzZXMub3JnPiB3
cm90ZToKPiA+ID4gT24gVGh1LCBBcHIgMDQsIDIwMTMgYXQgMDc6NTk6MzVQTSArMDIwMCwg
Qm9kbyBTdHJvZXNzZXIgd3JvdGU6Cj4gPiA+ID4gVGhlcmUgaXMgbm8gcmVhc29uIGZvciBh
cG9sb2dpZXMuIFRoZSB0aHJlYWQgbWVhbndoaWxlIHNlZW1zIHRvIGJlIGEgYml0Cj4gPiA+
ID4gY29uZnVzaW5nIDotKQo+ID4gPiA+IAo+ID4gPiA+IEN1cnJlbnQgc3RhdGUgaXM6Cj4g
PiA+ID4gCj4gPiA+ID4gLSBOZWlsIEJyb3duIGhhcyBjcmVhdGVkIHR3byBzZXJpZXMgb2Yg
cGF0Y2hlcy4gT25lIGZvciBTTEVTMTEtU1AxIGFuZCBhCj4gPiA+ID4gICBzZWNvbmQgb25l
IGZvciAtU1AyCj4gPiA+ID4gCj4gPiA+ID4gLSBBRkFJQ1MsIHRoZSBzZXJpZXMgZm9yIC1T
UDIgd2lsbCBtYXRjaCB3aXRoIG1haW5saW5lIGFsc28uCj4gPiA+ID4gCj4gPiA+ID4gLSBU
b2RheSBJIGZvdW5kIGFuZCBmaXhlZCB0aGUgKGhvcGVmdWxseSkgbGFzdCBwcm9ibGVtIGlu
IHRoZSAtU1AxIHNlcmllcy4KPiA+ID4gPiAgIE15IHRlc3QgdXNpbmcgdGhpcyBwYXRjaHNl
dCB3aWxsIHJ1biB1bnRpbCBNb25kYXkuCj4gPiA+ID4gCj4gPiA+ID4gLSBQcm92aWRlZCB0
aGUgdGVzdCBvbiBTUDEgc3VjY2VlZHMsIHByb2JhYmx5IG9uIFR1ZXNkYXkgSSdsbCBzdGFy
dCB0byB0ZXN0Cj4gPiA+ID4gICB0aGUgcGF0Y2hlcyBmb3IgU1AyIChhbmQgbWFpbmxpbmUp
LiBJZiBpdCBydW5zIGZpbmUsIHdlJ2xsIGhhdmUgYSB0ZXN0ZWQKPiA+ID4gPiAgIHBhdGNo
c2V0IG5vdCBsYXRlciB0aGFuIE1vbiAxNXRoLgo+ID4gPiAKPiA+ID4gT0ssIGdyZWF0LCBh
cyBsb25nIGFzIGl0IGhhc24ndCBqdXN0IGJlZW4gZm9yZ290dGVuIQo+ID4gPiAKPiA+ID4g
SSdkIGFsc28gYmUgY3VyaW91cyB0byB1bmRlcnN0YW5kIHdoeSB3ZSBhcmVuJ3QgZ2V0dGlu
ZyBhIGxvdCBvZgo+ID4gPiBjb21wbGFpbnRzIGFib3V0IHRoaXMgZnJvbSBlbHNld2hlcmUu
Li4uICBJcyB0aGVyZSBzb21ldGhpbmcgdW5pcXVlCj4gPiA+IGFib3V0IHlvdXIgc2V0dXA/
ICBEbyB0aGUgYnVncyB0aGF0IHJlbWFpbiB1cHN0cmVhbSB0YWtlIGEgbG9uZyB0aW1lIHRv
Cj4gPiA+IHJlcHJvZHVjZT8KPiA+ID4gCj4gPiA+IC0tYi4KPiA+ID4gCj4gPiAKPiA+IEl0
J3Mgbm8gc2VjcmV0LCB3aGF0IHdlIGFyZSBkb2luZy4gU28gbGV0IG1lIHRyeSB0byBleHBs
YWluOgo+IAo+IFRoYW5rcyBmb3IgdGhlIGRldGFpbGVkIGV4cGxhbmF0aW9uISAgSSdsbCBs
b29rIGZvcndhcmQgdG8gdGhlIHBhdGNoZXMuCj4gCj4gLS1iLgo+IAoKTGV0IG1lIGdpdmUg
YW4gaW50ZXJtZWRpYXRlIHJlc3VsdDoKClRoZSB0ZXN0IG9mIHRoZSAtU1AxIHBhdGNoIHNl
cmllcyBzdWNjZWVkZWQuCgpXZSBzdGFydGVkIHRoZSB0ZXN0IG9mIHRoZSAtU1AyIChhbmQg
bWFpbmxpbmUpIHNlcmllcyBvbiBUdWUsIDl0aCwgYnV0IGhhZCBubwpzdWNjZXNzLgpXZSBk
aWQgX25vdF8gZmluZCBhIHByb2JsZW0gd2l0aCB0aGUgcGF0Y2hlcywgYnV0IHVuZGVyIC1T
UDIgb3VyIHRlc3Qgc2NlbmFyaW8KaGFzIGxlc3MgdGhhbiA0MCUgb2YgdGhlIHRocm91Z2hw
dXQgd2Ugc2F3IHVuZGVyIC1TUDEuIFdpdGggdGhhdCBsb3cKcGVyZm9ybWFuY2UsIHdlIGhh
ZCBhIDQgZGF5IHJ1biB3aXRob3V0IGFueSBkcm9wcGVkIFJQQyByZXF1ZXN0LiBCdXQgd2Ug
ZG9uJ3QKa25vdyB0aGUgZXJyb3IgcmF0ZSB3aXRob3V0IHRoZSBwYXRjaGVzIHVuZGVyIHRo
ZXNlIGNvbmRpdGlvbnMuIFNvIHdlIGNhbid0CmdpdmUgYW4gby5rLiBmb3IgdGhlIHBhdGNo
ZXMgeWV0LgoKQ3VycmVudGx5IHdlIHRyeSB0byBmaW5kIHRoZSByZWFzb24gZm9yIHRoZSBk
aWZmZXJlbnQgYmVoYXZpb3Igb2YgU1AxIGFuZCBTUDIKCkJvZG8K



2013-05-13 04:08:46

by Namjae Jeon

[permalink] [raw]
Subject: Re: sunrpc/cache.c: races while updating cache entries

Hi.

Sorry for interrupt.
I fixed my issue using this patch(nfsd4: fix hang on fast-booting nfs
servers). it was different issue with this subject on current mail.

Thanks.

2013/5/10, Namjae Jeon <[email protected]>:
> Hi. Bodo.
>
> We are facing issues with respect to the SUNRPC cache.
> In our case we have two targets connected back-to-back
> NFS Server: Kernel version, 2.6.35
>
> At times, when Client tries to connect to the Server it stucks for
> very long duration and keeps on trying to mount.
>
> When we try to figure out using logs, we checked that client was not
> getting response of FSINFO request.
>
> Further, by debugging we found that the request was getting dropped at
> the SERVER, so this request was not being served.
>
> In the code we reached this point:
> svcauth_unix_set_client()->
> gi = unix_gid_find(cred->cr_uid, rqstp);
> switch (PTR_ERR(gi)) {
> case -EAGAIN:
> return SVC_DROP;
>
> This path is related with the SUNRPC cache management.
>
> When we remove this UNIX_GID_FIND path from our code, there is no problem.
>
> When we try to figure the possible related problems as per our
> scneario, We found that you have faced similar issue for RACE in the
> cache.
> Can you please suggest what could be the problem so that we can check
> further ?
>
> Or from the solution if you encounter the similar situation.
> can you please suggest on the possible patches for 2.6.35 - which we
> can try in our environment ?
>
> We will be highly grateful.
>
> Thanks
>
>
> 2013/4/20, Bodo Stroesser <[email protected]>:
>> On 05 Apr 2013 23:09:00 +0100 J. Bruce Fields <[email protected]>
>> wrote:
>>> On Fri, Apr 05, 2013 at 05:33:49PM +0200, Bodo Stroesser wrote:
>>> > On 05 Apr 2013 14:40:00 +0100 J. Bruce Fields <[email protected]>
>>> > wrote:
>>> > > On Thu, Apr 04, 2013 at 07:59:35PM +0200, Bodo Stroesser wrote:
>>> > > > There is no reason for apologies. The thread meanwhile seems to be
>>> > > > a
>>> > > > bit
>>> > > > confusing :-)
>>> > > >
>>> > > > Current state is:
>>> > > >
>>> > > > - Neil Brown has created two series of patches. One for SLES11-SP1
>>> > > > and a
>>> > > > second one for -SP2
>>> > > >
>>> > > > - AFAICS, the series for -SP2 will match with mainline also.
>>> > > >
>>> > > > - Today I found and fixed the (hopefully) last problem in the -SP1
>>> > > > series.
>>> > > > My test using this patchset will run until Monday.
>>> > > >
>>> > > > - Provided the test on SP1 succeeds, probably on Tuesday I'll
>>> > > > start
>>> > > > to test
>>> > > > the patches for SP2 (and mainline). If it runs fine, we'll have
>>> > > > a
>>> > > > tested
>>> > > > patchset not later than Mon 15th.
>>> > >
>>> > > OK, great, as long as it hasn't just been forgotten!
>>> > >
>>> > > I'd also be curious to understand why we aren't getting a lot of
>>> > > complaints about this from elsewhere.... Is there something unique
>>> > > about your setup? Do the bugs that remain upstream take a long time
>>> > > to
>>> > > reproduce?
>>> > >
>>> > > --b.
>>> > >
>>> >
>>> > It's no secret, what we are doing. So let me try to explain:
>>>
>>> Thanks for the detailed explanation! I'll look forward to the patches.
>>>
>>> --b.
>>>
>>
>> Let me give an intermediate result:
>>
>> The test of the -SP1 patch series succeeded.
>>
>> We started the test of the -SP2 (and mainline) series on Tue, 9th, but
>> had
>> no
>> success.
>> We did _not_ find a problem with the patches, but under -SP2 our test
>> scenario
>> has less than 40% of the throughput we saw under -SP1. With that low
>> performance, we had a 4 day run without any dropped RPC request. But we
>> don't
>> know the error rate without the patches under these conditions. So we
>> can't
>> give an o.k. for the patches yet.
>>
>> Currently we try to find the reason for the different behavior of SP1 and
>> SP2
>>
>> Bodo
>>
>

2013-05-10 07:51:39

by Namjae Jeon

[permalink] [raw]
Subject: Re: sunrpc/cache.c: races while updating cache entries

Hi. Bodo.

We are facing issues with respect to the SUNRPC cache.
In our case we have two targets connected back-to-back
NFS Server: Kernel version, 2.6.35

At times, when Client tries to connect to the Server it stucks for
very long duration and keeps on trying to mount.

When we try to figure out using logs, we checked that client was not
getting response of FSINFO request.

Further, by debugging we found that the request was getting dropped at
the SERVER, so this request was not being served.

In the code we reached this point:
svcauth_unix_set_client()->
gi = unix_gid_find(cred->cr_uid, rqstp);
switch (PTR_ERR(gi)) {
case -EAGAIN:
return SVC_DROP;

This path is related with the SUNRPC cache management.

When we remove this UNIX_GID_FIND path from our code, there is no problem.

When we try to figure the possible related problems as per our
scneario, We found that you have faced similar issue for RACE in the
cache.
Can you please suggest what could be the problem so that we can check further ?

Or from the solution if you encounter the similar situation.
can you please suggest on the possible patches for 2.6.35 - which we
can try in our environment ?

We will be highly grateful.

Thanks


2013/4/20, Bodo Stroesser <[email protected]>:
> On 05 Apr 2013 23:09:00 +0100 J. Bruce Fields <[email protected]> wrote:
>> On Fri, Apr 05, 2013 at 05:33:49PM +0200, Bodo Stroesser wrote:
>> > On 05 Apr 2013 14:40:00 +0100 J. Bruce Fields <[email protected]>
>> > wrote:
>> > > On Thu, Apr 04, 2013 at 07:59:35PM +0200, Bodo Stroesser wrote:
>> > > > There is no reason for apologies. The thread meanwhile seems to be a
>> > > > bit
>> > > > confusing :-)
>> > > >
>> > > > Current state is:
>> > > >
>> > > > - Neil Brown has created two series of patches. One for SLES11-SP1
>> > > > and a
>> > > > second one for -SP2
>> > > >
>> > > > - AFAICS, the series for -SP2 will match with mainline also.
>> > > >
>> > > > - Today I found and fixed the (hopefully) last problem in the -SP1
>> > > > series.
>> > > > My test using this patchset will run until Monday.
>> > > >
>> > > > - Provided the test on SP1 succeeds, probably on Tuesday I'll start
>> > > > to test
>> > > > the patches for SP2 (and mainline). If it runs fine, we'll have a
>> > > > tested
>> > > > patchset not later than Mon 15th.
>> > >
>> > > OK, great, as long as it hasn't just been forgotten!
>> > >
>> > > I'd also be curious to understand why we aren't getting a lot of
>> > > complaints about this from elsewhere.... Is there something unique
>> > > about your setup? Do the bugs that remain upstream take a long time
>> > > to
>> > > reproduce?
>> > >
>> > > --b.
>> > >
>> >
>> > It's no secret, what we are doing. So let me try to explain:
>>
>> Thanks for the detailed explanation! I'll look forward to the patches.
>>
>> --b.
>>
>
> Let me give an intermediate result:
>
> The test of the -SP1 patch series succeeded.
>
> We started the test of the -SP2 (and mainline) series on Tue, 9th, but had
> no
> success.
> We did _not_ find a problem with the patches, but under -SP2 our test
> scenario
> has less than 40% of the throughput we saw under -SP1. With that low
> performance, we had a 4 day run without any dropped RPC request. But we
> don't
> know the error rate without the patches under these conditions. So we can't
> give an o.k. for the patches yet.
>
> Currently we try to find the reason for the different behavior of SP1 and
> SP2
>
> Bodo
>

2013-06-03 14:27:08

by Bodo Stroesser

[permalink] [raw]
Subject: Re: sunrpc/cache.c: races while updating cache entries

T24gRnJpLCBBcHIgMTksIDIwMTMgYXQgMDY6NTY6MDBQTSArMDIwMCwgQm9kbyBTdHJvZXNz
ZXIgd3JvdGU6Cj4gCj4gV2Ugc3RhcnRlZCB0aGUgdGVzdCBvZiB0aGUgLVNQMiAoYW5kIG1h
aW5saW5lKSBzZXJpZXMgb24gVHVlLCA5dGgsIGJ1dCBoYWQgbm8KPiBzdWNjZXNzLgo+IFdl
IGRpZCBfbm90XyBmaW5kIGEgcHJvYmxlbSB3aXRoIHRoZSBwYXRjaGVzLCBidXQgdW5kZXIg
LVNQMiBvdXIgdGVzdCBzY2VuYXJpbwo+IGhhcyBsZXNzIHRoYW4gNDAlIG9mIHRoZSB0aHJv
dWdocHV0IHdlIHNhdyB1bmRlciAtU1AxLiBXaXRoIHRoYXQgbG93Cj4gcGVyZm9ybWFuY2Us
IHdlIGhhZCBhIDQgZGF5IHJ1biB3aXRob3V0IGFueSBkcm9wcGVkIFJQQyByZXF1ZXN0LiBC
dXQgd2UgZG9uJ3QKPiBrbm93IHRoZSBlcnJvciByYXRlIHdpdGhvdXQgdGhlIHBhdGNoZXMg
dW5kZXIgdGhlc2UgY29uZGl0aW9ucy4gU28gd2UgY2FuJ3QKPiBnaXZlIGFuIG8uay4gZm9y
IHRoZSBwYXRjaGVzIHlldC4KPiAKPiBDdXJyZW50bHkgd2UgdHJ5IHRvIGZpbmQgdGhlIHJl
YXNvbiBmb3IgdGhlIGRpZmZlcmVudCBiZWhhdmlvciBvZiBTUDEgYW5kIFNQMgo+IAoKSGks
Cgpzb3JyeSBmb3IgdGhlIGRlbGF5LiBNZWFud2hpbGUgd2UgZm91bmQgdGhlIHJlYXNvbiBm
b3IgdGhlIHNtYWxsIHRocm91Z2hwdXQKd2l0aCAtU1AyLiBUaGUgcHJvYmxlbSByZXN1bHRl
ZCBmcm9tIGEgY2hhbmdlIGluIG91ciBvd24gc29mdHdhcmUuCgpUaHVzIEkgY291bGQgZml4
IHRoaXMgYW5kIHN0YXJ0ZWQgYSB0ZXN0IG9uIGxhc3QgVHVlc2RheS4gSSBzdG9wcGVkIHRo
ZSB0ZXN0CnRvZGF5IGFmdGVyIDYgZGF5cyB3aXRob3V0IGFueSBsb3N0IFJQQy4gV2l0aG91
dCB0aGUgcGF0Y2hlcyBJIHNhdyB0aGUgZmlyc3QKZHJvcHBlZCBSUEMgYWZ0ZXIgMyBob3Vy
cy4gVGh1cywgSSB0aGluayB0aGUgcGF0Y2hlcyBmb3IgLVNQMiBhcmUgZmluZS4gCgpATmVp
bDogd291bGQgcGF0Y2ggMDAwNiBvZiB0aGUgLVNQMSBwYXRjaHNldCBiZSBhIGdvb2QgYWRk
aXRpb25hbCBjaGFuZ2UgZm9yCm1haW5saW5lPwoKQm9kbwo=



2013-06-13 01:55:07

by NeilBrown

[permalink] [raw]
Subject: Re: sunrpc/cache.c: races while updating cache entries

On 03 Jun 2013 16:27:06 +0200 Bodo Stroesser <[email protected]>
wrote:

> On Fri, Apr 19, 2013 at 06:56:00PM +0200, Bodo Stroesser wrote:
> >
> > We started the test of the -SP2 (and mainline) series on Tue, 9th, but had no
> > success.
> > We did _not_ find a problem with the patches, but under -SP2 our test scenario
> > has less than 40% of the throughput we saw under -SP1. With that low
> > performance, we had a 4 day run without any dropped RPC request. But we don't
> > know the error rate without the patches under these conditions. So we can't
> > give an o.k. for the patches yet.
> >
> > Currently we try to find the reason for the different behavior of SP1 and SP2
> >
>
> Hi,
>
> sorry for the delay. Meanwhile we found the reason for the small throughput
> with -SP2. The problem resulted from a change in our own software.
>
> Thus I could fix this and started a test on last Tuesday. I stopped the test
> today after 6 days without any lost RPC. Without the patches I saw the first
> dropped RPC after 3 hours. Thus, I think the patches for -SP2 are fine.
>
> @Neil: would patch 0006 of the -SP1 patchset be a good additional change for
> mainline?
>
> Bodo

Thanks for all the testing.

Bruce: where are you at with these? Are you holding one to some that I sent
previously, or should I resend them all?


Bodo: no, I don't think that patch is appropriate for mainline. It causes
sunrpc_cache_pipe_upcall to abort if ->expiry_time is zero. There is
certainly no point in doing an upcall in that case, but the code in mainline
is quite different to the code in -SP1 against which that patch made sense.

For mainline an equivalent optimisation which probably makes the interesting
case more obvious would be:

diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
index d01eb07..291cc47 100644
--- a/net/sunrpc/cache.c
+++ b/net/sunrpc/cache.c
@@ -262,7 +262,8 @@ int cache_check(struct cache_detail *detail,
if (rqstp == NULL) {
if (rv == -EAGAIN)
rv = -ENOENT;
- } else if (rv == -EAGAIN || age > refresh_age/2) {
+ } else if (rv == -EAGAIN ||
+ (refresh_age > 0 && age > refresh_age/2)) {
dprintk("RPC: Want update, refage=%ld, age=%ld\n",
refresh_age, age);
if (!test_and_set_bit(CACHE_PENDING, &h->flags)) {


i.e. trap that case in cache_check.

NeilBrown


Attachments:
signature.asc (828.00 B)

2013-06-13 02:04:55

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc/cache.c: races while updating cache entries

On Thu, Jun 13, 2013 at 11:54:56AM +1000, NeilBrown wrote:
> On 03 Jun 2013 16:27:06 +0200 Bodo Stroesser <[email protected]>
> wrote:
>
> > On Fri, Apr 19, 2013 at 06:56:00PM +0200, Bodo Stroesser wrote:
> > >
> > > We started the test of the -SP2 (and mainline) series on Tue, 9th, but had no
> > > success.
> > > We did _not_ find a problem with the patches, but under -SP2 our test scenario
> > > has less than 40% of the throughput we saw under -SP1. With that low
> > > performance, we had a 4 day run without any dropped RPC request. But we don't
> > > know the error rate without the patches under these conditions. So we can't
> > > give an o.k. for the patches yet.
> > >
> > > Currently we try to find the reason for the different behavior of SP1 and SP2
> > >
> >
> > Hi,
> >
> > sorry for the delay. Meanwhile we found the reason for the small throughput
> > with -SP2. The problem resulted from a change in our own software.
> >
> > Thus I could fix this and started a test on last Tuesday. I stopped the test
> > today after 6 days without any lost RPC. Without the patches I saw the first
> > dropped RPC after 3 hours. Thus, I think the patches for -SP2 are fine.
> >
> > @Neil: would patch 0006 of the -SP1 patchset be a good additional change for
> > mainline?
> >
> > Bodo
>
> Thanks for all the testing.
>
> Bruce: where are you at with these? Are you holding one to some that I sent
> previously, or should I resend them all?

No, I'm not holding on to any--if you could resend them all that would
be great.

--b.

>
>
> Bodo: no, I don't think that patch is appropriate for mainline. It causes
> sunrpc_cache_pipe_upcall to abort if ->expiry_time is zero. There is
> certainly no point in doing an upcall in that case, but the code in mainline
> is quite different to the code in -SP1 against which that patch made sense.
>
> For mainline an equivalent optimisation which probably makes the interesting
> case more obvious would be:
>
> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
> index d01eb07..291cc47 100644
> --- a/net/sunrpc/cache.c
> +++ b/net/sunrpc/cache.c
> @@ -262,7 +262,8 @@ int cache_check(struct cache_detail *detail,
> if (rqstp == NULL) {
> if (rv == -EAGAIN)
> rv = -ENOENT;
> - } else if (rv == -EAGAIN || age > refresh_age/2) {
> + } else if (rv == -EAGAIN ||
> + (refresh_age > 0 && age > refresh_age/2)) {
> dprintk("RPC: Want update, refage=%ld, age=%ld\n",
> refresh_age, age);
> if (!test_and_set_bit(CACHE_PENDING, &h->flags)) {
>
>
> i.e. trap that case in cache_check.
>
> NeilBrown