2016-09-23 17:40:13

by Olga Kornievskaia

[permalink] [raw]
Subject: reuse of slot and seq# when RPC was interrupted

Hi folks,

I'd like to raise an issue with regards to nfs41_sequence_done()
slot->interrupted case. There is a comment there saying the if the RPC
was interrupted then we don't know if the server has processed the
slot or not so mark the slot as interrupted. In that case the sequence
is not bumped. Then later there is logic that if we received
SEQ_MISORDERED and the slot was marked interrupted then bump the
sequence.

The problem comes when the sequence number is not increment the reply
is not necessarily a SEQ_MISORDERED. Instead, the reply is a "cached"
reply of the operation that was interrupted. That leads to the xdr
returning "Remote EIO" (unrecoverable in some cases).

If we bump the sequence number always then we should get the
SEQ_MISORDERED error from which we can recover.

A reproducer to see an operation reuse a seq# and getting cached reply
is as follows:
1. on the shell do "rm <file in nfs>"
2. at the nfs_proxy delay the reply from the server enough to send a
ctrl-c to the shell.
3. do something else on nfs.

If we instead bump the sequence number in the case of interrupted and do:

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index a1a3b4c..b78dac5 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -728,6 +728,7 @@ int nfs41_sequence_done(struct rpc_task *task,
struct nfs4_sequence_res *res)
* operation..
* Mark the slot as having hosted an interrupted RPC call.
*/
+ ++slot->seq_nr;
slot->interrupted = 1;
goto out;
case -NFS4ERR_DELAY:

@@ -748,14 +749,6 @@ int nfs41_sequence_done(struct rpc_task *task,
struct nfs4_sequence_res *res)
goto retry_nowait;
case -NFS4ERR_SEQ_MISORDERED:
/*
- * Was the last operation on this sequence interrupted?
- * If so, retry after bumping the sequence number.
- */
- if (interrupted) {
- ++slot->seq_nr;
- goto retry_nowait;
- }
- /*
* Could this slot have been previously retired?
* If so, then the server may be expecting seq_nr = 1!
*/

1. if the server received it, then we bump and next operation has correct number
2. if the server didn't received and we bump, then next operation
received SEQ_MISORDERED, it'll reset the slot/session?


2016-09-23 17:45:09

by Trond Myklebust

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted


> On Sep 23, 2016, at 13:40, Olga Kornievskaia <[email protected]> wrote:
>=20
> If we instead bump the sequence number in the case of interrupted and do:

You have no guarantees that the server has seen and processed the operation=
.


2016-09-23 17:59:06

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
<[email protected]> wrote:
>
>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <[email protected]> wrote:
>>
>> If we instead bump the sequence number in the case of interrupted and do:
>
> You have no guarantees that the server has seen and processed the operation.

That is correct, i have tested the patch and made server never to
receive the operation and client have an interrupted slot. On the next
operation the server will complain back with SEQ_MISORDERED. Client
can recover from this operation. Client can not recover from "Remote
EIO".

2016-09-23 18:08:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

DQo+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTM6NTksIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVt
aWNoLmVkdT4gd3JvdGU6DQo+IA0KPiBPbiBGcmksIFNlcCAyMywgMjAxNiBhdCAxOjQ1IFBNLCBU
cm9uZCBNeWtsZWJ1c3QNCj4gPHRyb25kbXlAcHJpbWFyeWRhdGEuY29tPiB3cm90ZToNCj4+IA0K
Pj4+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTM6NDAsIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVt
aWNoLmVkdT4gd3JvdGU6DQo+Pj4gDQo+Pj4gSWYgd2UgaW5zdGVhZCBidW1wIHRoZSBzZXF1ZW5j
ZSBudW1iZXIgaW4gdGhlIGNhc2Ugb2YgaW50ZXJydXB0ZWQgYW5kIGRvOg0KPj4gDQo+PiBZb3Ug
aGF2ZSBubyBndWFyYW50ZWVzIHRoYXQgdGhlIHNlcnZlciBoYXMgc2VlbiBhbmQgcHJvY2Vzc2Vk
IHRoZSBvcGVyYXRpb24uDQo+IA0KPiBUaGF0IGlzIGNvcnJlY3QsIGkgaGF2ZSB0ZXN0ZWQgdGhl
IHBhdGNoIGFuZCBtYWRlIHNlcnZlciBuZXZlciB0bw0KPiByZWNlaXZlIHRoZSBvcGVyYXRpb24g
YW5kIGNsaWVudCBoYXZlIGFuIGludGVycnVwdGVkIHNsb3QuIE9uIHRoZSBuZXh0DQo+IG9wZXJh
dGlvbiB0aGUgc2VydmVyIHdpbGwgY29tcGxhaW4gYmFjayB3aXRoIFNFUV9NSVNPUkRFUkVELiBD
bGllbnQNCj4gY2FuIHJlY292ZXIgZnJvbSB0aGlzIG9wZXJhdGlvbi4gQ2xpZW50IGNhbiBub3Qg
cmVjb3ZlciBmcm9tICJSZW1vdGUNCj4gRUlP4oCdLg0KPiANCg0KV2h5IG5vdD8=


2016-09-23 18:25:04

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
<[email protected]> wrote:
>
>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <[email protected]> wrote:
>>
>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>> <[email protected]> wrote:
>>>
>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <[email protected]> wrote:
>>>>
>>>> If we instead bump the sequence number in the case of interrupted and =
do:
>>>
>>> You have no guarantees that the server has seen and processed the opera=
tion.
>>
>> That is correct, i have tested the patch and made server never to
>> receive the operation and client have an interrupted slot. On the next
>> operation the server will complain back with SEQ_MISORDERED. Client
>> can recover from this operation. Client can not recover from "Remote
>> EIO=E2=80=9D.
>>
>
> Why not?

When XDR layer returns EREMOTEIO it's not handled by the NFS error
recovery (are you suggesting we should?) and returns that to the
application.

2016-09-23 18:34:33

by Trond Myklebust

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

DQo+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTQ6MjUsIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVt
aWNoLmVkdT4gd3JvdGU6DQo+IA0KPiBPbiBGcmksIFNlcCAyMywgMjAxNiBhdCAyOjA4IFBNLCBU
cm9uZCBNeWtsZWJ1c3QNCj4gPHRyb25kbXlAcHJpbWFyeWRhdGEuY29tPiB3cm90ZToNCj4+IA0K
Pj4+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTM6NTksIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVt
aWNoLmVkdT4gd3JvdGU6DQo+Pj4gDQo+Pj4gT24gRnJpLCBTZXAgMjMsIDIwMTYgYXQgMTo0NSBQ
TSwgVHJvbmQgTXlrbGVidXN0DQo+Pj4gPHRyb25kbXlAcHJpbWFyeWRhdGEuY29tPiB3cm90ZToN
Cj4+Pj4gDQo+Pj4+PiBPbiBTZXAgMjMsIDIwMTYsIGF0IDEzOjQwLCBPbGdhIEtvcm5pZXZza2Fp
YSA8YWdsb0B1bWljaC5lZHU+IHdyb3RlOg0KPj4+Pj4gDQo+Pj4+PiBJZiB3ZSBpbnN0ZWFkIGJ1
bXAgdGhlIHNlcXVlbmNlIG51bWJlciBpbiB0aGUgY2FzZSBvZiBpbnRlcnJ1cHRlZCBhbmQgZG86
DQo+Pj4+IA0KPj4+PiBZb3UgaGF2ZSBubyBndWFyYW50ZWVzIHRoYXQgdGhlIHNlcnZlciBoYXMg
c2VlbiBhbmQgcHJvY2Vzc2VkIHRoZSBvcGVyYXRpb24uDQo+Pj4gDQo+Pj4gVGhhdCBpcyBjb3Jy
ZWN0LCBpIGhhdmUgdGVzdGVkIHRoZSBwYXRjaCBhbmQgbWFkZSBzZXJ2ZXIgbmV2ZXIgdG8NCj4+
PiByZWNlaXZlIHRoZSBvcGVyYXRpb24gYW5kIGNsaWVudCBoYXZlIGFuIGludGVycnVwdGVkIHNs
b3QuIE9uIHRoZSBuZXh0DQo+Pj4gb3BlcmF0aW9uIHRoZSBzZXJ2ZXIgd2lsbCBjb21wbGFpbiBi
YWNrIHdpdGggU0VRX01JU09SREVSRUQuIENsaWVudA0KPj4+IGNhbiByZWNvdmVyIGZyb20gdGhp
cyBvcGVyYXRpb24uIENsaWVudCBjYW4gbm90IHJlY292ZXIgZnJvbSAiUmVtb3RlDQo+Pj4gRUlP
4oCdLg0KPj4+IA0KPj4gDQo+PiBXaHkgbm90Pw0KPiANCj4gV2hlbiBYRFIgbGF5ZXIgcmV0dXJu
cyBFUkVNT1RFSU8gaXQncyBub3QgaGFuZGxlZCBieSB0aGUgTkZTIGVycm9yDQo+IHJlY292ZXJ5
IChhcmUgeW91IHN1Z2dlc3Rpbmcgd2Ugc2hvdWxkPykgIGFuZCByZXR1cm5zIHRoYXQgdG8gdGhl
DQo+IGFwcGxpY2F0aW9uLg0KPiANCg0KSeKAmW0gc2F5aW5nIHRoYXQgaWYgd2UgZ2V0IGEgU0VR
X01JU09SREVSRUQgZHVlIHRvIGEgcHJldmlvdXMgaW50ZXJydXB0IG9uIHRoYXQgc2xvdCwgdGhl
biB3ZSBzaG91bGQgaWdub3JlIHRoZSBlcnJvciBpbiB0YXNrLT50a19zdGF0dXMsIGFuZCBqdXN0
IHJldHJ5IGFmdGVyIGJ1bXBpbmcgdGhlIHNsb3Qgc2VxaWQuDQoNCg==


2016-09-23 18:41:38

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
<[email protected]> wrote:
>
>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <[email protected]> wrote:
>>
>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>> <[email protected]> wrote:
>>>
>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <[email protected]> wrote:
>>>>
>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>> <[email protected]> wrote:
>>>>>
>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <[email protected]> wrote:
>>>>>>
>>>>>> If we instead bump the sequence number in the case of interrupted an=
d do:
>>>>>
>>>>> You have no guarantees that the server has seen and processed the ope=
ration.
>>>>
>>>> That is correct, i have tested the patch and made server never to
>>>> receive the operation and client have an interrupted slot. On the next
>>>> operation the server will complain back with SEQ_MISORDERED. Client
>>>> can recover from this operation. Client can not recover from "Remote
>>>> EIO=E2=80=9D.
>>>>
>>>
>>> Why not?
>>
>> When XDR layer returns EREMOTEIO it's not handled by the NFS error
>> recovery (are you suggesting we should?) and returns that to the
>> application.
>>
>
> I=E2=80=99m saying that if we get a SEQ_MISORDERED due to a previous inte=
rrupt on that slot, then we should ignore the error in task->tk_status, and=
just retry after bumping the slot seqid.
>

I'm confused where your objection lies. Are you ok with bumping the
sequence # when task->tk_status =3D 1 and saying that we should still
keep the code that I deleted in the 2nd chunk of the patch that bumped
the seqid on getting SEQ_MISORDERED due to a previously interrupted
slot?
Wouldn't that create a difference of 2 slots for the server that has
received the original request?

2016-09-23 19:08:07

by Trond Myklebust

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

DQo+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTQ6NDEsIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVt
aWNoLmVkdT4gd3JvdGU6DQo+IA0KPiBPbiBGcmksIFNlcCAyMywgMjAxNiBhdCAyOjM0IFBNLCBU
cm9uZCBNeWtsZWJ1c3QNCj4gPHRyb25kbXlAcHJpbWFyeWRhdGEuY29tPiB3cm90ZToNCj4+IA0K
Pj4+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTQ6MjUsIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVt
aWNoLmVkdT4gd3JvdGU6DQo+Pj4gDQo+Pj4gT24gRnJpLCBTZXAgMjMsIDIwMTYgYXQgMjowOCBQ
TSwgVHJvbmQgTXlrbGVidXN0DQo+Pj4gPHRyb25kbXlAcHJpbWFyeWRhdGEuY29tPiB3cm90ZToN
Cj4+Pj4gDQo+Pj4+PiBPbiBTZXAgMjMsIDIwMTYsIGF0IDEzOjU5LCBPbGdhIEtvcm5pZXZza2Fp
YSA8YWdsb0B1bWljaC5lZHU+IHdyb3RlOg0KPj4+Pj4gDQo+Pj4+PiBPbiBGcmksIFNlcCAyMywg
MjAxNiBhdCAxOjQ1IFBNLCBUcm9uZCBNeWtsZWJ1c3QNCj4+Pj4+IDx0cm9uZG15QHByaW1hcnlk
YXRhLmNvbT4gd3JvdGU6DQo+Pj4+Pj4gDQo+Pj4+Pj4+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTM6
NDAsIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVtaWNoLmVkdT4gd3JvdGU6DQo+Pj4+Pj4+IA0K
Pj4+Pj4+PiBJZiB3ZSBpbnN0ZWFkIGJ1bXAgdGhlIHNlcXVlbmNlIG51bWJlciBpbiB0aGUgY2Fz
ZSBvZiBpbnRlcnJ1cHRlZCBhbmQgZG86DQo+Pj4+Pj4gDQo+Pj4+Pj4gWW91IGhhdmUgbm8gZ3Vh
cmFudGVlcyB0aGF0IHRoZSBzZXJ2ZXIgaGFzIHNlZW4gYW5kIHByb2Nlc3NlZCB0aGUgb3BlcmF0
aW9uLg0KPj4+Pj4gDQo+Pj4+PiBUaGF0IGlzIGNvcnJlY3QsIGkgaGF2ZSB0ZXN0ZWQgdGhlIHBh
dGNoIGFuZCBtYWRlIHNlcnZlciBuZXZlciB0bw0KPj4+Pj4gcmVjZWl2ZSB0aGUgb3BlcmF0aW9u
IGFuZCBjbGllbnQgaGF2ZSBhbiBpbnRlcnJ1cHRlZCBzbG90LiBPbiB0aGUgbmV4dA0KPj4+Pj4g
b3BlcmF0aW9uIHRoZSBzZXJ2ZXIgd2lsbCBjb21wbGFpbiBiYWNrIHdpdGggU0VRX01JU09SREVS
RUQuIENsaWVudA0KPj4+Pj4gY2FuIHJlY292ZXIgZnJvbSB0aGlzIG9wZXJhdGlvbi4gQ2xpZW50
IGNhbiBub3QgcmVjb3ZlciBmcm9tICJSZW1vdGUNCj4+Pj4+IEVJT+KAnS4NCj4+Pj4+IA0KPj4+
PiANCj4+Pj4gV2h5IG5vdD8NCj4+PiANCj4+PiBXaGVuIFhEUiBsYXllciByZXR1cm5zIEVSRU1P
VEVJTyBpdCdzIG5vdCBoYW5kbGVkIGJ5IHRoZSBORlMgZXJyb3INCj4+PiByZWNvdmVyeSAoYXJl
IHlvdSBzdWdnZXN0aW5nIHdlIHNob3VsZD8pICBhbmQgcmV0dXJucyB0aGF0IHRvIHRoZQ0KPj4+
IGFwcGxpY2F0aW9uLg0KPj4+IA0KPj4gDQo+PiBJ4oCZbSBzYXlpbmcgdGhhdCBpZiB3ZSBnZXQg
YSBTRVFfTUlTT1JERVJFRCBkdWUgdG8gYSBwcmV2aW91cyBpbnRlcnJ1cHQgb24gdGhhdCBzbG90
LCB0aGVuIHdlIHNob3VsZCBpZ25vcmUgdGhlIGVycm9yIGluIHRhc2stPnRrX3N0YXR1cywgYW5k
IGp1c3QgcmV0cnkgYWZ0ZXIgYnVtcGluZyB0aGUgc2xvdCBzZXFpZC4NCj4+IA0KPiANCj4gSSdt
IGNvbmZ1c2VkIHdoZXJlIHlvdXIgb2JqZWN0aW9uIGxpZXMuIEFyZSB5b3Ugb2sgd2l0aCBidW1w
aW5nIHRoZQ0KPiBzZXF1ZW5jZSAjIHdoZW4gdGFzay0+dGtfc3RhdHVzID0gMSBhbmQgc2F5aW5n
IHRoYXQgd2Ugc2hvdWxkIHN0aWxsDQo+IGtlZXAgdGhlIGNvZGUgdGhhdCBJIGRlbGV0ZWQgaW4g
dGhlIDJuZCBjaHVuayBvZiB0aGUgcGF0Y2ggdGhhdCBidW1wZWQNCj4gdGhlIHNlcWlkIG9uIGdl
dHRpbmcgU0VRX01JU09SREVSRUQgZHVlIHRvIGEgcHJldmlvdXNseSBpbnRlcnJ1cHRlZA0KPiBz
bG90Pw0KPiBXb3VsZG4ndCB0aGF0IGNyZWF0ZSBhIGRpZmZlcmVuY2Ugb2YgMiBzbG90cyBmb3Ig
dGhlIHNlcnZlciB0aGF0IGhhcw0KPiByZWNlaXZlZCB0aGUgb3JpZ2luYWwgcmVxdWVzdD8NCj4g
DQoNCknigJltIHNheWluZyBJ4oCZZCBwcmVmZXIgdG8ga2VlcCB0aGUgY3VycmVudCBjb2RlLCBi
dXQgZml4IHRoZSByZXRyeSB0aGF0IGlzIGFwcGFyZW50bHkgYnJva2VuLiBJZiB3ZeKAmXJlIG5v
dCBpZ25vcmluZyB0aGUgdGFzay0+dGtfZXJyb3Igd2hlbiB3ZSBkZWNpZGUgdG8gcmV0cnksIHRo
ZW4gdGhhdOKAmXMgYSBidWcgaW4gbXkgb3Bpbmlvbi4=


2016-09-23 19:27:47

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

On Fri, Sep 23, 2016 at 3:07 PM, Trond Myklebust
<[email protected]> wrote:
>
>> On Sep 23, 2016, at 14:41, Olga Kornievskaia <[email protected]> wrote:
>>
>> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
>> <[email protected]> wrote:
>>>
>>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <[email protected]> wrote:
>>>>
>>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>>>> <[email protected]> wrote:
>>>>>
>>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <[email protected]> wrote:
>>>>>>
>>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <[email protected]> wrot=
e:
>>>>>>>>
>>>>>>>> If we instead bump the sequence number in the case of interrupted =
and do:
>>>>>>>
>>>>>>> You have no guarantees that the server has seen and processed the o=
peration.
>>>>>>
>>>>>> That is correct, i have tested the patch and made server never to
>>>>>> receive the operation and client have an interrupted slot. On the ne=
xt
>>>>>> operation the server will complain back with SEQ_MISORDERED. Client
>>>>>> can recover from this operation. Client can not recover from "Remote
>>>>>> EIO=E2=80=9D.
>>>>>>
>>>>>
>>>>> Why not?
>>>>
>>>> When XDR layer returns EREMOTEIO it's not handled by the NFS error
>>>> recovery (are you suggesting we should?) and returns that to the
>>>> application.
>>>>
>>>
>>> I=E2=80=99m saying that if we get a SEQ_MISORDERED due to a previous in=
terrupt on that slot, then we should ignore the error in task->tk_status, a=
nd just retry after bumping the slot seqid.
>>>
>>
>> I'm confused where your objection lies. Are you ok with bumping the
>> sequence # when task->tk_status =3D 1 and saying that we should still
>> keep the code that I deleted in the 2nd chunk of the patch that bumped
>> the seqid on getting SEQ_MISORDERED due to a previously interrupted
>> slot?
>> Wouldn't that create a difference of 2 slots for the server that has
>> received the original request?
>>
>
> I=E2=80=99m saying I=E2=80=99d prefer to keep the current code, but fix t=
he retry that is apparently broken. If we=E2=80=99re not ignoring the task-=
>tk_error when we decide to retry, then that=E2=80=99s a bug in my opinion.

I'm not understand what you are suggestion. I do better with example
so allow me:

REMOVE used slot 0 seq=3D00000036 received ctrl-c
nfs41_sequence_done() gets called task->tk_status =3D 1:
slot->interrupted is set to 1. slot is freed.

next operation comes in, in my case it's ACCESS. initialization of the
sequence uses slot 0 seq=3D00000036
server replies with REMOVE

client code xdr in decode_op_hrs() returns EREMOTEIO. decode_access()
returns EREMOTEIO. handle error just returns that error.

where do we retry?

2016-09-23 19:57:45

by Trond Myklebust

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

DQo+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTU6MjcsIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVt
aWNoLmVkdT4gd3JvdGU6DQo+IA0KPiBPbiBGcmksIFNlcCAyMywgMjAxNiBhdCAzOjA3IFBNLCBU
cm9uZCBNeWtsZWJ1c3QNCj4gPHRyb25kbXlAcHJpbWFyeWRhdGEuY29tPiB3cm90ZToNCj4+IA0K
Pj4+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTQ6NDEsIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVt
aWNoLmVkdT4gd3JvdGU6DQo+Pj4gDQo+Pj4gT24gRnJpLCBTZXAgMjMsIDIwMTYgYXQgMjozNCBQ
TSwgVHJvbmQgTXlrbGVidXN0DQo+Pj4gPHRyb25kbXlAcHJpbWFyeWRhdGEuY29tPiB3cm90ZToN
Cj4+Pj4gDQo+Pj4+PiBPbiBTZXAgMjMsIDIwMTYsIGF0IDE0OjI1LCBPbGdhIEtvcm5pZXZza2Fp
YSA8YWdsb0B1bWljaC5lZHU+IHdyb3RlOg0KPj4+Pj4gDQo+Pj4+PiBPbiBGcmksIFNlcCAyMywg
MjAxNiBhdCAyOjA4IFBNLCBUcm9uZCBNeWtsZWJ1c3QNCj4+Pj4+IDx0cm9uZG15QHByaW1hcnlk
YXRhLmNvbT4gd3JvdGU6DQo+Pj4+Pj4gDQo+Pj4+Pj4+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTM6
NTksIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVtaWNoLmVkdT4gd3JvdGU6DQo+Pj4+Pj4+IA0K
Pj4+Pj4+PiBPbiBGcmksIFNlcCAyMywgMjAxNiBhdCAxOjQ1IFBNLCBUcm9uZCBNeWtsZWJ1c3QN
Cj4+Pj4+Pj4gPHRyb25kbXlAcHJpbWFyeWRhdGEuY29tPiB3cm90ZToNCj4+Pj4+Pj4+IA0KPj4+
Pj4+Pj4+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTM6NDAsIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xv
QHVtaWNoLmVkdT4gd3JvdGU6DQo+Pj4+Pj4+Pj4gDQo+Pj4+Pj4+Pj4gSWYgd2UgaW5zdGVhZCBi
dW1wIHRoZSBzZXF1ZW5jZSBudW1iZXIgaW4gdGhlIGNhc2Ugb2YgaW50ZXJydXB0ZWQgYW5kIGRv
Og0KPj4+Pj4+Pj4gDQo+Pj4+Pj4+PiBZb3UgaGF2ZSBubyBndWFyYW50ZWVzIHRoYXQgdGhlIHNl
cnZlciBoYXMgc2VlbiBhbmQgcHJvY2Vzc2VkIHRoZSBvcGVyYXRpb24uDQo+Pj4+Pj4+IA0KPj4+
Pj4+PiBUaGF0IGlzIGNvcnJlY3QsIGkgaGF2ZSB0ZXN0ZWQgdGhlIHBhdGNoIGFuZCBtYWRlIHNl
cnZlciBuZXZlciB0bw0KPj4+Pj4+PiByZWNlaXZlIHRoZSBvcGVyYXRpb24gYW5kIGNsaWVudCBo
YXZlIGFuIGludGVycnVwdGVkIHNsb3QuIE9uIHRoZSBuZXh0DQo+Pj4+Pj4+IG9wZXJhdGlvbiB0
aGUgc2VydmVyIHdpbGwgY29tcGxhaW4gYmFjayB3aXRoIFNFUV9NSVNPUkRFUkVELiBDbGllbnQN
Cj4+Pj4+Pj4gY2FuIHJlY292ZXIgZnJvbSB0aGlzIG9wZXJhdGlvbi4gQ2xpZW50IGNhbiBub3Qg
cmVjb3ZlciBmcm9tICJSZW1vdGUNCj4+Pj4+Pj4gRUlP4oCdLg0KPj4+Pj4+PiANCj4+Pj4+PiAN
Cj4+Pj4+PiBXaHkgbm90Pw0KPj4+Pj4gDQo+Pj4+PiBXaGVuIFhEUiBsYXllciByZXR1cm5zIEVS
RU1PVEVJTyBpdCdzIG5vdCBoYW5kbGVkIGJ5IHRoZSBORlMgZXJyb3INCj4+Pj4+IHJlY292ZXJ5
IChhcmUgeW91IHN1Z2dlc3Rpbmcgd2Ugc2hvdWxkPykgIGFuZCByZXR1cm5zIHRoYXQgdG8gdGhl
DQo+Pj4+PiBhcHBsaWNhdGlvbi4NCj4+Pj4+IA0KPj4+PiANCj4+Pj4gSeKAmW0gc2F5aW5nIHRo
YXQgaWYgd2UgZ2V0IGEgU0VRX01JU09SREVSRUQgZHVlIHRvIGEgcHJldmlvdXMgaW50ZXJydXB0
IG9uIHRoYXQgc2xvdCwgdGhlbiB3ZSBzaG91bGQgaWdub3JlIHRoZSBlcnJvciBpbiB0YXNrLT50
a19zdGF0dXMsIGFuZCBqdXN0IHJldHJ5IGFmdGVyIGJ1bXBpbmcgdGhlIHNsb3Qgc2VxaWQuDQo+
Pj4+IA0KPj4+IA0KPj4+IEknbSBjb25mdXNlZCB3aGVyZSB5b3VyIG9iamVjdGlvbiBsaWVzLiBB
cmUgeW91IG9rIHdpdGggYnVtcGluZyB0aGUNCj4+PiBzZXF1ZW5jZSAjIHdoZW4gdGFzay0+dGtf
c3RhdHVzID0gMSBhbmQgc2F5aW5nIHRoYXQgd2Ugc2hvdWxkIHN0aWxsDQo+Pj4ga2VlcCB0aGUg
Y29kZSB0aGF0IEkgZGVsZXRlZCBpbiB0aGUgMm5kIGNodW5rIG9mIHRoZSBwYXRjaCB0aGF0IGJ1
bXBlZA0KPj4+IHRoZSBzZXFpZCBvbiBnZXR0aW5nIFNFUV9NSVNPUkRFUkVEIGR1ZSB0byBhIHBy
ZXZpb3VzbHkgaW50ZXJydXB0ZWQNCj4+PiBzbG90Pw0KPj4+IFdvdWxkbid0IHRoYXQgY3JlYXRl
IGEgZGlmZmVyZW5jZSBvZiAyIHNsb3RzIGZvciB0aGUgc2VydmVyIHRoYXQgaGFzDQo+Pj4gcmVj
ZWl2ZWQgdGhlIG9yaWdpbmFsIHJlcXVlc3Q/DQo+Pj4gDQo+PiANCj4+IEnigJltIHNheWluZyBJ
4oCZZCBwcmVmZXIgdG8ga2VlcCB0aGUgY3VycmVudCBjb2RlLCBidXQgZml4IHRoZSByZXRyeSB0
aGF0IGlzIGFwcGFyZW50bHkgYnJva2VuLiBJZiB3ZeKAmXJlIG5vdCBpZ25vcmluZyB0aGUgdGFz
ay0+dGtfZXJyb3Igd2hlbiB3ZSBkZWNpZGUgdG8gcmV0cnksIHRoZW4gdGhhdOKAmXMgYSBidWcg
aW4gbXkgb3Bpbmlvbi4NCj4gDQo+IEknbSBub3QgdW5kZXJzdGFuZCB3aGF0IHlvdSBhcmUgc3Vn
Z2VzdGlvbi4gSSBkbyBiZXR0ZXIgd2l0aCBleGFtcGxlDQo+IHNvIGFsbG93IG1lOg0KPiANCj4g
UkVNT1ZFIHVzZWQgc2xvdCAwIHNlcT0wMDAwMDAzNiByZWNlaXZlZCBjdHJsLWMNCj4gbmZzNDFf
c2VxdWVuY2VfZG9uZSgpIGdldHMgY2FsbGVkIHRhc2stPnRrX3N0YXR1cyA9IDE6DQo+IHNsb3Qt
PmludGVycnVwdGVkIGlzIHNldCB0byAxLiBzbG90IGlzIGZyZWVkLg0KPiANCj4gbmV4dCBvcGVy
YXRpb24gY29tZXMgaW4sIGluIG15IGNhc2UgaXQncyBBQ0NFU1MuIGluaXRpYWxpemF0aW9uIG9m
IHRoZQ0KPiBzZXF1ZW5jZSB1c2VzIHNsb3QgMCBzZXE9MDAwMDAwMzYNCj4gc2VydmVyIHJlcGxp
ZXMgd2l0aCBSRU1PVkUNCj4gDQo+IGNsaWVudCBjb2RlIHhkciBpbiBkZWNvZGVfb3BfaHJzKCkg
cmV0dXJucyBFUkVNT1RFSU8uIGRlY29kZV9hY2Nlc3MoKQ0KPiByZXR1cm5zIEVSRU1PVEVJTy4g
aGFuZGxlIGVycm9yIGp1c3QgcmV0dXJucyB0aGF0IGVycm9yLg0KPiANCj4gd2hlcmUgZG8gd2Ug
cmV0cnk/DQo+IA0KDQpUaGUgcmV0cnkgc2hvdWxkIGJlIGhhcHBlbmluZyB3aGVuIHdlIGV4aXQg
ZnJvbSBuZnM0MV9zZXF1ZW5jZV9kb25lKCkgYnkgcmVzdGFydGluZyB0aGUgUlBDLg==


2016-09-23 20:07:54

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

On Fri, Sep 23, 2016 at 3:57 PM, Trond Myklebust
<[email protected]> wrote:
>
>> On Sep 23, 2016, at 15:27, Olga Kornievskaia <[email protected]> wrote:
>>
>> On Fri, Sep 23, 2016 at 3:07 PM, Trond Myklebust
>> <[email protected]> wrote:
>>>
>>>> On Sep 23, 2016, at 14:41, Olga Kornievskaia <[email protected]> wrote:
>>>>
>>>> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
>>>> <[email protected]> wrote:
>>>>>
>>>>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <[email protected]> wrote:
>>>>>>
>>>>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <[email protected]> wrot=
e:
>>>>>>>>
>>>>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <[email protected]> wr=
ote:
>>>>>>>>>>
>>>>>>>>>> If we instead bump the sequence number in the case of interrupte=
d and do:
>>>>>>>>>
>>>>>>>>> You have no guarantees that the server has seen and processed the=
operation.
>>>>>>>>
>>>>>>>> That is correct, i have tested the patch and made server never to
>>>>>>>> receive the operation and client have an interrupted slot. On the =
next
>>>>>>>> operation the server will complain back with SEQ_MISORDERED. Clien=
t
>>>>>>>> can recover from this operation. Client can not recover from "Remo=
te
>>>>>>>> EIO=E2=80=9D.
>>>>>>>>
>>>>>>>
>>>>>>> Why not?
>>>>>>
>>>>>> When XDR layer returns EREMOTEIO it's not handled by the NFS error
>>>>>> recovery (are you suggesting we should?) and returns that to the
>>>>>> application.
>>>>>>
>>>>>
>>>>> I=E2=80=99m saying that if we get a SEQ_MISORDERED due to a previous =
interrupt on that slot, then we should ignore the error in task->tk_status,=
and just retry after bumping the slot seqid.
>>>>>
>>>>
>>>> I'm confused where your objection lies. Are you ok with bumping the
>>>> sequence # when task->tk_status =3D 1 and saying that we should still
>>>> keep the code that I deleted in the 2nd chunk of the patch that bumped
>>>> the seqid on getting SEQ_MISORDERED due to a previously interrupted
>>>> slot?
>>>> Wouldn't that create a difference of 2 slots for the server that has
>>>> received the original request?
>>>>
>>>
>>> I=E2=80=99m saying I=E2=80=99d prefer to keep the current code, but fix=
the retry that is apparently broken. If we=E2=80=99re not ignoring the tas=
k->tk_error when we decide to retry, then that=E2=80=99s a bug in my opinio=
n.
>>
>> I'm not understand what you are suggestion. I do better with example
>> so allow me:
>>
>> REMOVE used slot 0 seq=3D00000036 received ctrl-c
>> nfs41_sequence_done() gets called task->tk_status =3D 1:
>> slot->interrupted is set to 1. slot is freed.
>>
>> next operation comes in, in my case it's ACCESS. initialization of the
>> sequence uses slot 0 seq=3D00000036
>> server replies with REMOVE
>>
>> client code xdr in decode_op_hrs() returns EREMOTEIO. decode_access()
>> returns EREMOTEIO. handle error just returns that error.
>>
>> where do we retry?
>>
>
> The retry should be happening when we exit from nfs41_sequence_done() by =
restarting the RPC.

Are you suggestion that REMOVE is retried? Ok I can see that (though
I'm not sure why a killed task suppose to be retried. Wasn't it killed
for a reason?). But if you are saying ACCESS should be retried then I
don't see how it can fit into the code flow.

2016-09-23 20:25:04

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

On Fri, Sep 23, 2016 at 4:07 PM, Olga Kornievskaia <[email protected]> wrote:
> On Fri, Sep 23, 2016 at 3:57 PM, Trond Myklebust
> <[email protected]> wrote:
>>
>>> On Sep 23, 2016, at 15:27, Olga Kornievskaia <[email protected]> wrote:
>>>
>>> On Fri, Sep 23, 2016 at 3:07 PM, Trond Myklebust
>>> <[email protected]> wrote:
>>>>
>>>>> On Sep 23, 2016, at 14:41, Olga Kornievskaia <[email protected]> wrote:
>>>>>
>>>>> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <[email protected]> wrote=
:
>>>>>>>
>>>>>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <[email protected]> wro=
te:
>>>>>>>>>
>>>>>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <[email protected]> w=
rote:
>>>>>>>>>>>
>>>>>>>>>>> If we instead bump the sequence number in the case of interrupt=
ed and do:
>>>>>>>>>>
>>>>>>>>>> You have no guarantees that the server has seen and processed th=
e operation.
>>>>>>>>>
>>>>>>>>> That is correct, i have tested the patch and made server never to
>>>>>>>>> receive the operation and client have an interrupted slot. On the=
next
>>>>>>>>> operation the server will complain back with SEQ_MISORDERED. Clie=
nt
>>>>>>>>> can recover from this operation. Client can not recover from "Rem=
ote
>>>>>>>>> EIO=E2=80=9D.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Why not?
>>>>>>>
>>>>>>> When XDR layer returns EREMOTEIO it's not handled by the NFS error
>>>>>>> recovery (are you suggesting we should?) and returns that to the
>>>>>>> application.
>>>>>>>
>>>>>>
>>>>>> I=E2=80=99m saying that if we get a SEQ_MISORDERED due to a previous=
interrupt on that slot, then we should ignore the error in task->tk_status=
, and just retry after bumping the slot seqid.
>>>>>>
>>>>>
>>>>> I'm confused where your objection lies. Are you ok with bumping the
>>>>> sequence # when task->tk_status =3D 1 and saying that we should still
>>>>> keep the code that I deleted in the 2nd chunk of the patch that bumpe=
d
>>>>> the seqid on getting SEQ_MISORDERED due to a previously interrupted
>>>>> slot?
>>>>> Wouldn't that create a difference of 2 slots for the server that has
>>>>> received the original request?
>>>>>
>>>>
>>>> I=E2=80=99m saying I=E2=80=99d prefer to keep the current code, but fi=
x the retry that is apparently broken. If we=E2=80=99re not ignoring the ta=
sk->tk_error when we decide to retry, then that=E2=80=99s a bug in my opini=
on.
>>>
>>> I'm not understand what you are suggestion. I do better with example
>>> so allow me:
>>>
>>> REMOVE used slot 0 seq=3D00000036 received ctrl-c
>>> nfs41_sequence_done() gets called task->tk_status =3D 1:
>>> slot->interrupted is set to 1. slot is freed.
>>>
>>> next operation comes in, in my case it's ACCESS. initialization of the
>>> sequence uses slot 0 seq=3D00000036
>>> server replies with REMOVE
>>>
>>> client code xdr in decode_op_hrs() returns EREMOTEIO. decode_access()
>>> returns EREMOTEIO. handle error just returns that error.
>>>
>>> where do we retry?
>>>
>>
>> The retry should be happening when we exit from nfs41_sequence_done() by=
restarting the RPC.
>
> Are you suggestion that REMOVE is retried? Ok I can see that (though
> I'm not sure why a killed task suppose to be retried. Wasn't it killed
> for a reason?). But if you are saying ACCESS should be retried then I
> don't see how it can fit into the code flow.

I'm still hung up on the fact your suggestion of "retry". There is no
retry. You wrote "if we get a SEQ_MISORDERED" we never get
"SEQ_MISORDERED".

I can see if you want to add to error_handling that we check if error
is EREMOTEIO and check that slot->interrupted is set, then we try?

2016-09-23 20:34:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

DQo+IE9uIFNlcCAyMywgMjAxNiwgYXQgMTY6MjUsIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVt
aWNoLmVkdT4gd3JvdGU6DQo+IA0KPiBPbiBGcmksIFNlcCAyMywgMjAxNiBhdCA0OjA3IFBNLCBP
bGdhIEtvcm5pZXZza2FpYSA8YWdsb0B1bWljaC5lZHU+IHdyb3RlOg0KPj4gT24gRnJpLCBTZXAg
MjMsIDIwMTYgYXQgMzo1NyBQTSwgVHJvbmQgTXlrbGVidXN0DQo+PiA8dHJvbmRteUBwcmltYXJ5
ZGF0YS5jb20+IHdyb3RlOg0KPj4+IA0KPj4+PiBPbiBTZXAgMjMsIDIwMTYsIGF0IDE1OjI3LCBP
bGdhIEtvcm5pZXZza2FpYSA8YWdsb0B1bWljaC5lZHU+IHdyb3RlOg0KPj4+PiANCj4+Pj4gT24g
RnJpLCBTZXAgMjMsIDIwMTYgYXQgMzowNyBQTSwgVHJvbmQgTXlrbGVidXN0DQo+Pj4+IDx0cm9u
ZG15QHByaW1hcnlkYXRhLmNvbT4gd3JvdGU6DQo+Pj4+PiANCj4+Pj4+PiBPbiBTZXAgMjMsIDIw
MTYsIGF0IDE0OjQxLCBPbGdhIEtvcm5pZXZza2FpYSA8YWdsb0B1bWljaC5lZHU+IHdyb3RlOg0K
Pj4+Pj4+IA0KPj4+Pj4+IE9uIEZyaSwgU2VwIDIzLCAyMDE2IGF0IDI6MzQgUE0sIFRyb25kIE15
a2xlYnVzdA0KPj4+Pj4+IDx0cm9uZG15QHByaW1hcnlkYXRhLmNvbT4gd3JvdGU6DQo+Pj4+Pj4+
IA0KPj4+Pj4+Pj4gT24gU2VwIDIzLCAyMDE2LCBhdCAxNDoyNSwgT2xnYSBLb3JuaWV2c2thaWEg
PGFnbG9AdW1pY2guZWR1PiB3cm90ZToNCj4+Pj4+Pj4+IA0KPj4+Pj4+Pj4gT24gRnJpLCBTZXAg
MjMsIDIwMTYgYXQgMjowOCBQTSwgVHJvbmQgTXlrbGVidXN0DQo+Pj4+Pj4+PiA8dHJvbmRteUBw
cmltYXJ5ZGF0YS5jb20+IHdyb3RlOg0KPj4+Pj4+Pj4+IA0KPj4+Pj4+Pj4+PiBPbiBTZXAgMjMs
IDIwMTYsIGF0IDEzOjU5LCBPbGdhIEtvcm5pZXZza2FpYSA8YWdsb0B1bWljaC5lZHU+IHdyb3Rl
Og0KPj4+Pj4+Pj4+PiANCj4+Pj4+Pj4+Pj4gT24gRnJpLCBTZXAgMjMsIDIwMTYgYXQgMTo0NSBQ
TSwgVHJvbmQgTXlrbGVidXN0DQo+Pj4+Pj4+Pj4+IDx0cm9uZG15QHByaW1hcnlkYXRhLmNvbT4g
d3JvdGU6DQo+Pj4+Pj4+Pj4+PiANCj4+Pj4+Pj4+Pj4+PiBPbiBTZXAgMjMsIDIwMTYsIGF0IDEz
OjQwLCBPbGdhIEtvcm5pZXZza2FpYSA8YWdsb0B1bWljaC5lZHU+IHdyb3RlOg0KPj4+Pj4+Pj4+
Pj4+IA0KPj4+Pj4+Pj4+Pj4+IElmIHdlIGluc3RlYWQgYnVtcCB0aGUgc2VxdWVuY2UgbnVtYmVy
IGluIHRoZSBjYXNlIG9mIGludGVycnVwdGVkIGFuZCBkbzoNCj4+Pj4+Pj4+Pj4+IA0KPj4+Pj4+
Pj4+Pj4gWW91IGhhdmUgbm8gZ3VhcmFudGVlcyB0aGF0IHRoZSBzZXJ2ZXIgaGFzIHNlZW4gYW5k
IHByb2Nlc3NlZCB0aGUgb3BlcmF0aW9uLg0KPj4+Pj4+Pj4+PiANCj4+Pj4+Pj4+Pj4gVGhhdCBp
cyBjb3JyZWN0LCBpIGhhdmUgdGVzdGVkIHRoZSBwYXRjaCBhbmQgbWFkZSBzZXJ2ZXIgbmV2ZXIg
dG8NCj4+Pj4+Pj4+Pj4gcmVjZWl2ZSB0aGUgb3BlcmF0aW9uIGFuZCBjbGllbnQgaGF2ZSBhbiBp
bnRlcnJ1cHRlZCBzbG90LiBPbiB0aGUgbmV4dA0KPj4+Pj4+Pj4+PiBvcGVyYXRpb24gdGhlIHNl
cnZlciB3aWxsIGNvbXBsYWluIGJhY2sgd2l0aCBTRVFfTUlTT1JERVJFRC4gQ2xpZW50DQo+Pj4+
Pj4+Pj4+IGNhbiByZWNvdmVyIGZyb20gdGhpcyBvcGVyYXRpb24uIENsaWVudCBjYW4gbm90IHJl
Y292ZXIgZnJvbSAiUmVtb3RlDQo+Pj4+Pj4+Pj4+IEVJT+KAnS4NCj4+Pj4+Pj4+Pj4gDQo+Pj4+
Pj4+Pj4gDQo+Pj4+Pj4+Pj4gV2h5IG5vdD8NCj4+Pj4+Pj4+IA0KPj4+Pj4+Pj4gV2hlbiBYRFIg
bGF5ZXIgcmV0dXJucyBFUkVNT1RFSU8gaXQncyBub3QgaGFuZGxlZCBieSB0aGUgTkZTIGVycm9y
DQo+Pj4+Pj4+PiByZWNvdmVyeSAoYXJlIHlvdSBzdWdnZXN0aW5nIHdlIHNob3VsZD8pICBhbmQg
cmV0dXJucyB0aGF0IHRvIHRoZQ0KPj4+Pj4+Pj4gYXBwbGljYXRpb24uDQo+Pj4+Pj4+PiANCj4+
Pj4+Pj4gDQo+Pj4+Pj4+IEnigJltIHNheWluZyB0aGF0IGlmIHdlIGdldCBhIFNFUV9NSVNPUkRF
UkVEIGR1ZSB0byBhIHByZXZpb3VzIGludGVycnVwdCBvbiB0aGF0IHNsb3QsIHRoZW4gd2Ugc2hv
dWxkIGlnbm9yZSB0aGUgZXJyb3IgaW4gdGFzay0+dGtfc3RhdHVzLCBhbmQganVzdCByZXRyeSBh
ZnRlciBidW1waW5nIHRoZSBzbG90IHNlcWlkLg0KPj4+Pj4+PiANCj4+Pj4+PiANCj4+Pj4+PiBJ
J20gY29uZnVzZWQgd2hlcmUgeW91ciBvYmplY3Rpb24gbGllcy4gQXJlIHlvdSBvayB3aXRoIGJ1
bXBpbmcgdGhlDQo+Pj4+Pj4gc2VxdWVuY2UgIyB3aGVuIHRhc2stPnRrX3N0YXR1cyA9IDEgYW5k
IHNheWluZyB0aGF0IHdlIHNob3VsZCBzdGlsbA0KPj4+Pj4+IGtlZXAgdGhlIGNvZGUgdGhhdCBJ
IGRlbGV0ZWQgaW4gdGhlIDJuZCBjaHVuayBvZiB0aGUgcGF0Y2ggdGhhdCBidW1wZWQNCj4+Pj4+
PiB0aGUgc2VxaWQgb24gZ2V0dGluZyBTRVFfTUlTT1JERVJFRCBkdWUgdG8gYSBwcmV2aW91c2x5
IGludGVycnVwdGVkDQo+Pj4+Pj4gc2xvdD8NCj4+Pj4+PiBXb3VsZG4ndCB0aGF0IGNyZWF0ZSBh
IGRpZmZlcmVuY2Ugb2YgMiBzbG90cyBmb3IgdGhlIHNlcnZlciB0aGF0IGhhcw0KPj4+Pj4+IHJl
Y2VpdmVkIHRoZSBvcmlnaW5hbCByZXF1ZXN0Pw0KPj4+Pj4+IA0KPj4+Pj4gDQo+Pj4+PiBJ4oCZ
bSBzYXlpbmcgSeKAmWQgcHJlZmVyIHRvIGtlZXAgdGhlIGN1cnJlbnQgY29kZSwgYnV0IGZpeCB0
aGUgcmV0cnkgdGhhdCBpcyBhcHBhcmVudGx5IGJyb2tlbi4gSWYgd2XigJlyZSBub3QgaWdub3Jp
bmcgdGhlIHRhc2stPnRrX2Vycm9yIHdoZW4gd2UgZGVjaWRlIHRvIHJldHJ5LCB0aGVuIHRoYXTi
gJlzIGEgYnVnIGluIG15IG9waW5pb24uDQo+Pj4+IA0KPj4+PiBJJ20gbm90IHVuZGVyc3RhbmQg
d2hhdCB5b3UgYXJlIHN1Z2dlc3Rpb24uIEkgZG8gYmV0dGVyIHdpdGggZXhhbXBsZQ0KPj4+PiBz
byBhbGxvdyBtZToNCj4+Pj4gDQo+Pj4+IFJFTU9WRSB1c2VkIHNsb3QgMCBzZXE9MDAwMDAwMzYg
cmVjZWl2ZWQgY3RybC1jDQo+Pj4+IG5mczQxX3NlcXVlbmNlX2RvbmUoKSBnZXRzIGNhbGxlZCB0
YXNrLT50a19zdGF0dXMgPSAxOg0KPj4+PiBzbG90LT5pbnRlcnJ1cHRlZCBpcyBzZXQgdG8gMS4g
c2xvdCBpcyBmcmVlZC4NCj4+Pj4gDQo+Pj4+IG5leHQgb3BlcmF0aW9uIGNvbWVzIGluLCBpbiBt
eSBjYXNlIGl0J3MgQUNDRVNTLiBpbml0aWFsaXphdGlvbiBvZiB0aGUNCj4+Pj4gc2VxdWVuY2Ug
dXNlcyBzbG90IDAgc2VxPTAwMDAwMDM2DQo+Pj4+IHNlcnZlciByZXBsaWVzIHdpdGggUkVNT1ZF
DQo+Pj4+IA0KPj4+PiBjbGllbnQgY29kZSB4ZHIgaW4gZGVjb2RlX29wX2hycygpIHJldHVybnMg
RVJFTU9URUlPLiBkZWNvZGVfYWNjZXNzKCkNCj4+Pj4gcmV0dXJucyBFUkVNT1RFSU8uIGhhbmRs
ZSBlcnJvciBqdXN0IHJldHVybnMgdGhhdCBlcnJvci4NCj4+Pj4gDQo+Pj4+IHdoZXJlIGRvIHdl
IHJldHJ5Pw0KPj4+PiANCj4+PiANCj4+PiBUaGUgcmV0cnkgc2hvdWxkIGJlIGhhcHBlbmluZyB3
aGVuIHdlIGV4aXQgZnJvbSBuZnM0MV9zZXF1ZW5jZV9kb25lKCkgYnkgcmVzdGFydGluZyB0aGUg
UlBDLg0KPj4gDQo+PiBBcmUgeW91IHN1Z2dlc3Rpb24gdGhhdCBSRU1PVkUgaXMgcmV0cmllZD8g
T2sgSSBjYW4gc2VlIHRoYXQgKHRob3VnaA0KPj4gSSdtIG5vdCBzdXJlIHdoeSBhIGtpbGxlZCB0
YXNrIHN1cHBvc2UgdG8gYmUgcmV0cmllZC4gV2Fzbid0IGl0IGtpbGxlZA0KPj4gZm9yIGEgcmVh
c29uPykuIEJ1dCBpZiB5b3UgYXJlIHNheWluZyBBQ0NFU1Mgc2hvdWxkIGJlIHJldHJpZWQgdGhl
biBJDQo+PiBkb24ndCBzZWUgaG93IGl0IGNhbiBmaXQgaW50byB0aGUgY29kZSBmbG93Lg0KPiAN
Cj4gSSdtIHN0aWxsIGh1bmcgdXAgb24gdGhlIGZhY3QgeW91ciBzdWdnZXN0aW9uIG9mICJyZXRy
eSIuIFRoZXJlIGlzIG5vDQo+IHJldHJ5LiBZb3Ugd3JvdGUgImlmIHdlIGdldCBhIFNFUV9NSVNP
UkRFUkVEIiB3ZSBuZXZlciBnZXQNCj4gIlNFUV9NSVNPUkRFUkVEIi4NCj4gDQo+IEkgY2FuIHNl
ZSBpZiB5b3Ugd2FudCB0byBhZGQgdG8gZXJyb3JfaGFuZGxpbmcgdGhhdCB3ZSBjaGVjayBpZiBl
cnJvcg0KPiBpcyBFUkVNT1RFSU8gYW5kIGNoZWNrIHRoYXQgc2xvdC0+aW50ZXJydXB0ZWQgaXMg
c2V0LCB0aGVuIHdlIHRyeT8NCj4gDQoNClllcywgdGhhdOKAmXMgd2hhdCBJ4oCZZCBob3BlIHRv
IHNlZS4NCg0K


2016-09-23 20:39:00

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

On Fri, Sep 23, 2016 at 4:34 PM, Trond Myklebust
<[email protected]> wrote:
>
>> On Sep 23, 2016, at 16:25, Olga Kornievskaia <[email protected]> wrote:
>>
>> On Fri, Sep 23, 2016 at 4:07 PM, Olga Kornievskaia <[email protected]> wrot=
e:
>>> On Fri, Sep 23, 2016 at 3:57 PM, Trond Myklebust
>>> <[email protected]> wrote:
>>>>
>>>>> On Sep 23, 2016, at 15:27, Olga Kornievskaia <[email protected]> wrote:
>>>>>
>>>>> On Fri, Sep 23, 2016 at 3:07 PM, Trond Myklebust
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> On Sep 23, 2016, at 14:41, Olga Kornievskaia <[email protected]> wrote=
:
>>>>>>>
>>>>>>> On Fri, Sep 23, 2016 at 2:34 PM, Trond Myklebust
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> On Sep 23, 2016, at 14:25, Olga Kornievskaia <[email protected]> wro=
te:
>>>>>>>>>
>>>>>>>>> On Fri, Sep 23, 2016 at 2:08 PM, Trond Myklebust
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> On Sep 23, 2016, at 13:59, Olga Kornievskaia <[email protected]> w=
rote:
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 23, 2016 at 1:45 PM, Trond Myklebust
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Sep 23, 2016, at 13:40, Olga Kornievskaia <[email protected]>=
wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we instead bump the sequence number in the case of interru=
pted and do:
>>>>>>>>>>>>
>>>>>>>>>>>> You have no guarantees that the server has seen and processed =
the operation.
>>>>>>>>>>>
>>>>>>>>>>> That is correct, i have tested the patch and made server never =
to
>>>>>>>>>>> receive the operation and client have an interrupted slot. On t=
he next
>>>>>>>>>>> operation the server will complain back with SEQ_MISORDERED. Cl=
ient
>>>>>>>>>>> can recover from this operation. Client can not recover from "R=
emote
>>>>>>>>>>> EIO=E2=80=9D.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Why not?
>>>>>>>>>
>>>>>>>>> When XDR layer returns EREMOTEIO it's not handled by the NFS erro=
r
>>>>>>>>> recovery (are you suggesting we should?) and returns that to the
>>>>>>>>> application.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I=E2=80=99m saying that if we get a SEQ_MISORDERED due to a previo=
us interrupt on that slot, then we should ignore the error in task->tk_stat=
us, and just retry after bumping the slot seqid.
>>>>>>>>
>>>>>>>
>>>>>>> I'm confused where your objection lies. Are you ok with bumping the
>>>>>>> sequence # when task->tk_status =3D 1 and saying that we should sti=
ll
>>>>>>> keep the code that I deleted in the 2nd chunk of the patch that bum=
ped
>>>>>>> the seqid on getting SEQ_MISORDERED due to a previously interrupted
>>>>>>> slot?
>>>>>>> Wouldn't that create a difference of 2 slots for the server that ha=
s
>>>>>>> received the original request?
>>>>>>>
>>>>>>
>>>>>> I=E2=80=99m saying I=E2=80=99d prefer to keep the current code, but =
fix the retry that is apparently broken. If we=E2=80=99re not ignoring the =
task->tk_error when we decide to retry, then that=E2=80=99s a bug in my opi=
nion.
>>>>>
>>>>> I'm not understand what you are suggestion. I do better with example
>>>>> so allow me:
>>>>>
>>>>> REMOVE used slot 0 seq=3D00000036 received ctrl-c
>>>>> nfs41_sequence_done() gets called task->tk_status =3D 1:
>>>>> slot->interrupted is set to 1. slot is freed.
>>>>>
>>>>> next operation comes in, in my case it's ACCESS. initialization of th=
e
>>>>> sequence uses slot 0 seq=3D00000036
>>>>> server replies with REMOVE
>>>>>
>>>>> client code xdr in decode_op_hrs() returns EREMOTEIO. decode_access()
>>>>> returns EREMOTEIO. handle error just returns that error.
>>>>>
>>>>> where do we retry?
>>>>>
>>>>
>>>> The retry should be happening when we exit from nfs41_sequence_done() =
by restarting the RPC.
>>>
>>> Are you suggestion that REMOVE is retried? Ok I can see that (though
>>> I'm not sure why a killed task suppose to be retried. Wasn't it killed
>>> for a reason?). But if you are saying ACCESS should be retried then I
>>> don't see how it can fit into the code flow.
>>
>> I'm still hung up on the fact your suggestion of "retry". There is no
>> retry. You wrote "if we get a SEQ_MISORDERED" we never get
>> "SEQ_MISORDERED".
>>
>> I can see if you want to add to error_handling that we check if error
>> is EREMOTEIO and check that slot->interrupted is set, then we try?
>>
>
> Yes, that=E2=80=99s what I=E2=80=99d hope to see.

Ok I understand now and would try that.

2017-04-22 16:26:28

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: reuse of slot and seq# when RPC was interrupted

Hi Trond,

I'm still having issues with interrupted slots.

1. Client sent out an operation (X with cacheme=3Dtrue). It's waiting on
the reply by got a ctrl-c.
2. Client sends out the operation which happens to be CLOSE. It uses
the same slot and sequence id as operation X.
3. Server replies back to CLOSE with ERR_DELAY
4. Server replies to operation X
5. Client does the delay and resends the CLOSE using the same slot and
sequence id as before.
6. Server reply with X (out of the replay cache).
7. Client decoding returns back EREMOTEIO and error handling doesn't
do anything about it.

Now the problem is open state is left on the server.

If the server were to not send an ERR_DELAY first but just reply with
X then the code handles that and retries (commit a865880e20).