2012-09-18 09:50:20

by William Dauchy

[permalink] [raw]
Subject: unhandled error -10026

Hello,

I'm getting a trace following an unhandled error on a linux nfs client
3.4.7 x86_64.

Any hint?

NFS: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
------------[ cut here ]------------
kernel BUG at net/sunrpc/sched.c:699!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU 19
Pid: 17671, comm: kworker/19:12 Not tainted 3.4.7 /0D61XP
RIP: 0010:[<ffffffff81444ea9>] [<ffffffff81444ea9>] __rpc_execute+0x1a9/0x1b0
RSP: 0018:ffff880bb8913db0 EFLAGS: 00010202
RAX: 0000000000000007 RBX: ffff880bec6e2080 RCX: ffff880c3fcec188
RDX: ffff880c3fcec188 RSI: 0000000000000000 RDI: ffff880bec6e2080
RBP: ffff880bec6e20f0 R08: ffff880bb64bd4c0 R09: 0000000000029440
R10: 0000000000000000 R11: 000000000000000c R12: 0000000000000001
R13: ffff880c3fcf1305 R14: 0000000000000000 R15: ffff880bec6e2108
FS: 0000000000000000(0000) GS:ffff880c3fce0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000078080280000 CR3: 000000000149d000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/19:12 (pid: 17671, threadinfo ffff880bb64bd4c0, task
ffff880bb64bd070)
Stack:
000000000000df00 ffff880bb64bd4c0 ffff880bb64bd4c8 ffff880bb64bd070
ffff880c3fcec180 ffff880c3fcf1300 ffff880c3fcf1305 ffffffff81444efd
ffff8803e327df00 ffffffff810503b8 000000000000df00 ffffffff81444ee0
Call Trace:
[<ffffffff81444efd>] ? rpc_async_schedule+0x1d/0x30
[<ffffffff810503b8>] ? process_one_work+0x108/0x3a0
[<ffffffff81444ee0>] ? rpc_execute+0x30/0x30
[<ffffffff81050aa1>] ? worker_thread+0x151/0x420
[<ffffffff81050950>] ? rescuer_thread+0x300/0x300
[<ffffffff81050950>] ? rescuer_thread+0x300/0x300
[<ffffffff81054ebe>] ? kthread+0x9e/0xb0
[<ffffffff8147bbb4>] ? kernel_thread_helper+0x4/0x10
[<ffffffff81479e78>] ? retint_restore_args+0x6/0x6
[<ffffffff81054e20>] ? kthread_freezable_should_stop+0x60/0x60
[<ffffffff8147bbb0>] ? gs_change+0xb/0xb
Code: e8 3d fe ff ff eb dd f0 ff 0b 71 05 f0 ff 03 cd 04 0f 94 c0 84
c0 0f 85 7b ff ff ff 48 83 c4 18 5b 5d 41 5c 41 5d c3 0f 0b eb fe <0f>
0b eb fe 0f 1f 00 53 48 89 fb f0 80 4f 70 04 e8 72 f7 ff ff
RIP [<ffffffff81444ea9>] __rpc_execute+0x1a9/0x1b0
RSP <ffff880bb8913db0>
---[ end trace 27110931730a34ea ]---

--
William


2012-09-20 17:47:30

by Andy Adamson

[permalink] [raw]
Subject: Re: unhandled error -10026

On Thu, Sep 20, 2012 at 12:17 PM, J. Bruce Fields <[email protected]> wrote:
> On Thu, Sep 20, 2012 at 12:06:48PM -0400, Andy Adamson wrote:
>> On Thu, Sep 20, 2012 at 10:34 AM, William Dauchy <[email protected]> wrote:
>> > On Tue, Sep 18, 2012 at 11:49 AM, William Dauchy <[email protected]> wrote:
>> >> I'm getting a trace following an unhandled error on a linux nfs client
>> >> 3.4.7 x86_64.
>> >> NFS: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
>> >
>> > For the moment I don't know if the error is coming from a bad server
>> > implementation or if it's on client side. Should I assume that this an
>> > error that should never hit the client?
>>
>> Yes.
>>
>> The client only sends OPEN reclaims after noting the server has
>> rebooted due to previously receiving an NFS4ERR_STALE_CLIENTID or
>> NFS4ERR_STALE_STATEID error from a state-full operation (RENEW, OPEN,
>> OPEN_DOWNGRADE, OPEN_CONFIRM, CLOSE, LOCK, LOCKU) which triggers the
>> client to establish a new clientid via
>> SETCLIENTID/SETCLIENTID_CONFIRM.
>>
>> Upon server reboot, all state that the previous server instance had is
>> invalid - including OPEN seqid's. So, the server returning
>> NFS4ERR_BAD_SEQID (10026) on an OPEN reclaim is illegal.
>
> Wait, but couldn't there be multiple reclaims using the same open owner,
> in which case later reclaims could in theory hit BAD_SEQID?

Nope.

3530 section 9.1.6. Sequencing of Lock Requests

Note that for requests that contain a sequence number, for each
state-owner, there should be no more than one outstanding request.

-->Andy
>
> --b.

2012-09-20 14:35:06

by William Dauchy

[permalink] [raw]
Subject: Re: unhandled error -10026

On Tue, Sep 18, 2012 at 11:49 AM, William Dauchy <[email protected]> wrote:
> I'm getting a trace following an unhandled error on a linux nfs client
> 3.4.7 x86_64.
> NFS: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state

For the moment I don't know if the error is coming from a bad server
implementation or if it's on client side. Should I assume that this an
error that should never hit the client?

Regards,
--
William

2012-09-23 13:50:37

by William Dauchy

[permalink] [raw]
Subject: Re: unhandled error -10026

On Thu, Sep 20, 2012 at 9:33 PM, J. Bruce Fields <[email protected]> wrote:
> William, is this easy to reproduce? Would it be possible to get a
> network trace covering the problem?

Unfortunately not. The issue doesn't appear to be reproduced often so
it is difficult for me to debug; but if I find a way to reproduce it,
I will come back with some more information.
I will also dig on server side code and check the cases it could
answer this error.

Thank you for all this information,
--
William

2012-09-20 16:17:20

by J. Bruce Fields

[permalink] [raw]
Subject: Re: unhandled error -10026

On Thu, Sep 20, 2012 at 12:06:48PM -0400, Andy Adamson wrote:
> On Thu, Sep 20, 2012 at 10:34 AM, William Dauchy <[email protected]> wrote:
> > On Tue, Sep 18, 2012 at 11:49 AM, William Dauchy <[email protected]> wrote:
> >> I'm getting a trace following an unhandled error on a linux nfs client
> >> 3.4.7 x86_64.
> >> NFS: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
> >
> > For the moment I don't know if the error is coming from a bad server
> > implementation or if it's on client side. Should I assume that this an
> > error that should never hit the client?
>
> Yes.
>
> The client only sends OPEN reclaims after noting the server has
> rebooted due to previously receiving an NFS4ERR_STALE_CLIENTID or
> NFS4ERR_STALE_STATEID error from a state-full operation (RENEW, OPEN,
> OPEN_DOWNGRADE, OPEN_CONFIRM, CLOSE, LOCK, LOCKU) which triggers the
> client to establish a new clientid via
> SETCLIENTID/SETCLIENTID_CONFIRM.
>
> Upon server reboot, all state that the previous server instance had is
> invalid - including OPEN seqid's. So, the server returning
> NFS4ERR_BAD_SEQID (10026) on an OPEN reclaim is illegal.

Wait, but couldn't there be multiple reclaims using the same open owner,
in which case later reclaims could in theory hit BAD_SEQID?

--b.

2012-09-20 19:33:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: unhandled error -10026

On Thu, Sep 20, 2012 at 01:53:44PM -0400, Andy Adamson wrote:
> On Thu, Sep 20, 2012 at 1:47 PM, Andy Adamson <[email protected]> wrote:
> > On Thu, Sep 20, 2012 at 12:17 PM, J. Bruce Fields <[email protected]> wrote:
> >> On Thu, Sep 20, 2012 at 12:06:48PM -0400, Andy Adamson wrote:
> >>> On Thu, Sep 20, 2012 at 10:34 AM, William Dauchy <[email protected]> wrote:
> >>> > On Tue, Sep 18, 2012 at 11:49 AM, William Dauchy <[email protected]> wrote:
> >>> >> I'm getting a trace following an unhandled error on a linux nfs client
> >>> >> 3.4.7 x86_64.
> >>> >> NFS: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
> >>> >
> >>> > For the moment I don't know if the error is coming from a bad server
> >>> > implementation or if it's on client side. Should I assume that this an
> >>> > error that should never hit the client?
> >>>
> >>> Yes.
> >>>
> >>> The client only sends OPEN reclaims after noting the server has
> >>> rebooted due to previously receiving an NFS4ERR_STALE_CLIENTID or
> >>> NFS4ERR_STALE_STATEID error from a state-full operation (RENEW, OPEN,
> >>> OPEN_DOWNGRADE, OPEN_CONFIRM, CLOSE, LOCK, LOCKU) which triggers the
> >>> client to establish a new clientid via
> >>> SETCLIENTID/SETCLIENTID_CONFIRM.
> >>>
> >>> Upon server reboot, all state that the previous server instance had is
> >>> invalid - including OPEN seqid's. So, the server returning
> >>> NFS4ERR_BAD_SEQID (10026) on an OPEN reclaim is illegal.
> >>
> >> Wait, but couldn't there be multiple reclaims using the same open owner,
> >> in which case later reclaims could in theory hit BAD_SEQID?
> >
> > Nope.
> >
> > 3530 section 9.1.6. Sequencing of Lock Requests
> >
> > Note that for requests that contain a sequence number, for each
> > state-owner, there should be no more than one outstanding request.
>
> Well - I sent this too soon :) . Yes, a buggy client could send
> (serialized) reclaims with a bad seqid, and get NFS4ERR_BAD_SEQ.
> Tough to do with the above constraint, but possible.

William, is this easy to reproduce? Would it be possible to get a
network trace covering the problem?

(tcpdump -s0 -wtmp.pcap, then send us tmp.pcap. And also feel free to
take a look at tmp.pcap with wireshark yourself--you may be able to find
the call that's returning BAD_SEQID. What we'll be curious about is
what the sequence id sent on that call was, and what the sequence id was
on any preceding operations using the same open owner).

--b.

2012-09-20 18:09:17

by Andy Adamson

[permalink] [raw]
Subject: Re: unhandled error -10026

On Thu, Sep 20, 2012 at 10:34 AM, William Dauchy <[email protected]> wrote:
> On Tue, Sep 18, 2012 at 11:49 AM, William Dauchy <[email protected]> wrote:
>> I'm getting a trace following an unhandled error on a linux nfs client
>> 3.4.7 x86_64.
>> NFS: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
>
> For the moment I don't know if the error is coming from a bad server
> implementation or if it's on client side. Should I assume that this an
> error that should never hit the client?

Do you have a tcpdump trace you could share?

-->Andy

>
> Regards,
> --
> William
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-09-20 16:06:49

by Andy Adamson

[permalink] [raw]
Subject: Re: unhandled error -10026

On Thu, Sep 20, 2012 at 10:34 AM, William Dauchy <[email protected]> wrote:
> On Tue, Sep 18, 2012 at 11:49 AM, William Dauchy <[email protected]> wrote:
>> I'm getting a trace following an unhandled error on a linux nfs client
>> 3.4.7 x86_64.
>> NFS: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
>
> For the moment I don't know if the error is coming from a bad server
> implementation or if it's on client side. Should I assume that this an
> error that should never hit the client?

Yes.

The client only sends OPEN reclaims after noting the server has
rebooted due to previously receiving an NFS4ERR_STALE_CLIENTID or
NFS4ERR_STALE_STATEID error from a state-full operation (RENEW, OPEN,
OPEN_DOWNGRADE, OPEN_CONFIRM, CLOSE, LOCK, LOCKU) which triggers the
client to establish a new clientid via
SETCLIENTID/SETCLIENTID_CONFIRM.

Upon server reboot, all state that the previous server instance had is
invalid - including OPEN seqid's. So, the server returning
NFS4ERR_BAD_SEQID (10026) on an OPEN reclaim is illegal.

-->Andy



>
> Regards,
> --
> William
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-09-20 17:53:45

by Andy Adamson

[permalink] [raw]
Subject: Re: unhandled error -10026

On Thu, Sep 20, 2012 at 1:47 PM, Andy Adamson <[email protected]> wrote:
> On Thu, Sep 20, 2012 at 12:17 PM, J. Bruce Fields <[email protected]> wrote:
>> On Thu, Sep 20, 2012 at 12:06:48PM -0400, Andy Adamson wrote:
>>> On Thu, Sep 20, 2012 at 10:34 AM, William Dauchy <[email protected]> wrote:
>>> > On Tue, Sep 18, 2012 at 11:49 AM, William Dauchy <[email protected]> wrote:
>>> >> I'm getting a trace following an unhandled error on a linux nfs client
>>> >> 3.4.7 x86_64.
>>> >> NFS: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
>>> >
>>> > For the moment I don't know if the error is coming from a bad server
>>> > implementation or if it's on client side. Should I assume that this an
>>> > error that should never hit the client?
>>>
>>> Yes.
>>>
>>> The client only sends OPEN reclaims after noting the server has
>>> rebooted due to previously receiving an NFS4ERR_STALE_CLIENTID or
>>> NFS4ERR_STALE_STATEID error from a state-full operation (RENEW, OPEN,
>>> OPEN_DOWNGRADE, OPEN_CONFIRM, CLOSE, LOCK, LOCKU) which triggers the
>>> client to establish a new clientid via
>>> SETCLIENTID/SETCLIENTID_CONFIRM.
>>>
>>> Upon server reboot, all state that the previous server instance had is
>>> invalid - including OPEN seqid's. So, the server returning
>>> NFS4ERR_BAD_SEQID (10026) on an OPEN reclaim is illegal.
>>
>> Wait, but couldn't there be multiple reclaims using the same open owner,
>> in which case later reclaims could in theory hit BAD_SEQID?
>
> Nope.
>
> 3530 section 9.1.6. Sequencing of Lock Requests
>
> Note that for requests that contain a sequence number, for each
> state-owner, there should be no more than one outstanding request.

Well - I sent this too soon :) . Yes, a buggy client could send
(serialized) reclaims with a bad seqid, and get NFS4ERR_BAD_SEQ.
Tough to do with the above constraint, but possible.

-->Andy

>
> -->Andy
>>
>> --b.