LinuxLists.cc - NFS4 BAD_STATEID loop (kernel 3.0)

[permalink] [raw]

Subject: Re: NFS4 BAD_STATEID loop (kernel 3.0.4)

* Trond Myklebust ([email protected]) wrote:
> Date: Mon, 24 Oct 2011 17:31:42 +0200
> From: Trond Myklebust <[email protected]>
> To: David Flynn <[email protected]>
> Cc: [email protected], Chuck Lever <[email protected]>
> X-Mailer: Evolution 3.0.3 (3.0.3-1.fc15)
> Organization: NetApp Inc
> Subject: Re: NFS4 BAD_STATEID loop (kernel 3.0.4)
>
> On Mon, 2011-10-24 at 14:50 +0000, David Flynn wrote:
> > No. Time Source Destination Protocol Size Info
> > 39 15:33:59.077143 172.29.190.28 172.29.120.140 NFS 370 V4 COMPOUND Call (Reply In 40) <EMPTY> PUTFH;WRITE;GETATTR
> >
> > Frame 39: 370 bytes on wire (2960 bits), 370 bytes captured (2960 bits)
> > Ethernet II, Src: ChelsioC_07:49:6f (00:07:43:07:49:6f), Dst: All-HSRP-routers_be (00:00:0c:07:ac:be)
> > Internet Protocol, Src: 172.29.190.28 (172.29.190.28), Dst: 172.29.120.140 (172.29.120.140)
> > Transmission Control Protocol, Src Port: omginitialrefs (900), Dst Port: nfs (2049), Seq: 40433, Ack: 7449, Len: 304
> > Remote Procedure Call, Type:Call XID:0x43ce4e16
> > Network File System
> > [Program Version: 4]
> > [V4 Procedure: COMPOUND (1)]
> > Tag: <EMPTY>
> > length: 0
> > contents: <EMPTY>
> > minorversion: 0
> > Operations (count: 3)
> > Opcode: PUTFH (22)
> > filehandle
> > length: 36
> > [hash (CRC-32): 0x6e4b15f3]
> > decode type as: unknown
> > filehandle: 7df3a75d5e1cd908000ab44c5b000000efc80200000a0300...
> > Opcode: WRITE (38)
> > stateid
>
> Do you have an example of the stateid argument's value? Does it change
> at all between separate WRITE attempts?

Looking at two captures done an hour apart, no, i see no change.
(a packet capture has been supplied privately)

Regards,

..david

2011-10-29 18:29:30

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

* Myklebust, Trond ([email protected]) wrote:
> No. Not in this case: the state manager was doing RENEW, then exiting
> because it was told all is A_OK by the server.
> The open() system call would be the one that was looping.

Sometimes it worked, somtimes not. From memory:
- kill -9 worked in situations that suffered from NFS4ERR_STALE_CLIENTID
- kill -9 failed in situations that suffered from BAD_STATEID

..david

2011-10-29 00:25:25

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

* David Flynn ([email protected]) wrote:
> * Trond Myklebust ([email protected]) wrote:
> > Do you have an example of the stateid argument's value? Does it change
> > at all between separate WRITE attempts?
>
> Further to all this, i've just had a similar fault on another machine,

Using the same kernel, same mountpoint as before, we're currently
experiencing a loop involving NFS4ERR_STALE_CLIENTID.
Trace:
ftp://ftp.kw.bbc.co.uk/davidf/priv/saesheil.pcap

Unfortunately, this is resulting in about 40 nodes doing their best to
kill the poor solaris server. Generating a combined total of
250Mbit/sec towards the NFS server (collecting a little under
200Mbit/sec of replies).

Have we not heard of exponential backoff?

This seems to require major attention, given that this amounted to a
site wide DoS: going round all the machines and killing the processes
that were having major problems brought the situation back under
control. Frankly i'd rather that you panicked the kernel than this.

Regards,
..david

2011-10-29 18:26:21

[permalink] [raw]

Subject: RE: NFS4ERR_STALE_CLIENTID loop

> -----Original Message-----
> From: Chuck Lever [mailto:[email protected]]
> Sent: Saturday, October 29, 2011 8:23 PM
> To: Myklebust, Trond
> Cc: David Flynn; [email protected]
> Subject: Re: NFS4ERR_STALE_CLIENTID loop
>
>
> On Oct 29, 2011, at 2:22 PM, Myklebust, Trond wrote:
>
> >> -----Original Message-----
> >> From: David Flynn [mailto:[email protected]]
> >> Sent: Saturday, October 29, 2011 8:02 PM
> >> To: Myklebust, Trond
> >> Cc: David Flynn; [email protected]; Chuck Lever
> >> Subject: Re: NFS4ERR_STALE_CLIENTID loop
> >>
> >> * Trond Myklebust ([email protected]) wrote:
> >>>> Using the same kernel, same mountpoint as before, we're currently
> >>>> experiencing a loop involving NFS4ERR_STALE_CLIENTID.
> >> ...
> >>> The problem seems like a split-brain issue on the server... On the
> > one
> >>> hand, it is happily telling us that our lease is OK when we RENEW.
> >>> Then when we try to use said lease in an OPEN, it is replying with
> >>> STALE_CLIENTID.
> >>
> >> Thank you for the quick update, especially at the weekend. I'm
> > wondering if
> >> it is possible that the STALE_CLIENTID issue is a by-product of the
> >> BAD_STATEID issue from earlier. We have observed several times the
> >> BAD_STATEID loop, but the CLIENTID problem only seemed to occur
when
> > all
> >> 40+ nodes were all showing problems.
> >>
> >> After killing off sufficient processes, the some of the machines
then
> >> recovered of their own accord. So your conclusion that there is a
> > server issue
> >> sounds reasonable.
> >>
> >> On any such possible backoff, the previous case was with quite
small
> >> requests in quite a tight loop that seemed to cause the server
grief.
> >> This morning, a machine with a 10GbE interface had a BAD_STATEID
> >> issue
> > but
> >> involving some much larger writes[1], resulting in 1.6Gbit/sec from
> > that
> >> machine alone. Thankfully there was only a second machine with
1GbE
> >> interfaces bringing the total up to 2.5Gbit/sec.
> >>
> >> It is this ability for a group of clients to make matters worse
that
> > is just as bad
> >> as any fault with Solaris.
> >
> > Sure, but gone are the days when NFS had "reference implementations"
> > that everyone had to interoperate with. NFSv4 is a fully documented
> > protocol which describes how both clients and servers are supposed
to
> > work. If either the client or server fail to work according to that
> > documentation, then bad things will happen.
> >
> > While I could litter the client code with lots of little tricks to
be
> > "defensive" in the face of buggy servers, that isn't going to solve
> > anything: the server will still be buggy, and the client will still
be
> > faced with a situation that it cannot resolve.
> >
> >> (In a similar vein, it can be just as frustrating trying to get a
> > client to stop
> >> looping like this - it is often impossible to kill the process that
> > triggered the
> >> problem; for these, we had to resort to deleting the files using
> >> NFSv3
> > (which
> >> was working quite happily))
> >
> > 'kill -9' should in principle work to kill off the process. Was that
> > failing to work?
>
> The trick is knowing which process to kill. Generally you have to
kill the state
> manager thread in this case.
>

No. Not in this case: the state manager was doing RENEW, then exiting
because it was told all is A_OK by the server.
The open() system call would be the one that was looping.

Trond

2011-10-29 18:48:03

by J. Bruce Fields

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

On Sat, Oct 29, 2011 at 11:21:13AM -0700, Myklebust, Trond wrote:
> > From: J. Bruce Fields [mailto:[email protected]]
> > On Sat, Oct 29, 2011 at 07:29:36PM +0200, Trond Myklebust wrote:
> > > OK. This is the first time I've seen this tcpdump.
> > >
> > > The problem seems like a split-brain issue on the server... On the
> one
> > > hand, it is happily telling us that our lease is OK when we RENEW.
> > > Then when we try to use said lease in an OPEN, it is replying with
> > > STALE_CLIENTID.
> > >
> > > IOW: This isn't a problem I can fix on the client whether or not I
> add
> > > exponential backoff. The problem needs to be addressed on the server
> > > by the Solaris folks....
> >
> > Is there any simple thing we could do on the client to reduce the
> impact of
> > these sorts of loops?
>
> WHY? Those loops aren't supposed to happen if the server works according
> to spec.

Yes, and it's not something I care that strongly about, really, my only
observation is that this sort of failure (an implementation bug on one
side or another resulting in a loop) seems to have been common (based on
no hard data, just my vague memories of list threads), and the results
fairly obnoxious (possibly even for unrelated hosts on the network).
So if there's some simple way to fail more gracefully it might be
helpful.

--b.

2011-10-29 17:29:38

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

On Sat, 2011-10-29 at 00:25 +0000, David Flynn wrote:
> * David Flynn ([email protected]) wrote:
> > * Trond Myklebust ([email protected]) wrote:
> > > Do you have an example of the stateid argument's value? Does it change
> > > at all between separate WRITE attempts?
> >
> > Further to all this, i've just had a similar fault on another machine,
>
> Using the same kernel, same mountpoint as before, we're currently
> experiencing a loop involving NFS4ERR_STALE_CLIENTID.
> Trace:
> ftp://ftp.kw.bbc.co.uk/davidf/priv/saesheil.pcap
>
> Unfortunately, this is resulting in about 40 nodes doing their best to
> kill the poor solaris server. Generating a combined total of
> 250Mbit/sec towards the NFS server (collecting a little under
> 200Mbit/sec of replies).
>
> Have we not heard of exponential backoff?
>
> This seems to require major attention, given that this amounted to a
> site wide DoS: going round all the machines and killing the processes
> that were having major problems brought the situation back under
> control. Frankly i'd rather that you panicked the kernel than this.

OK. This is the first time I've seen this tcpdump.

The problem seems like a split-brain issue on the server... On the one
hand, it is happily telling us that our lease is OK when we RENEW. Then
when we try to use said lease in an OPEN, it is replying with
STALE_CLIENTID.

IOW: This isn't a problem I can fix on the client whether or not I add
exponential backoff. The problem needs to be addressed on the server by
the Solaris folks....

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-10-24 13:33:03

[permalink] [raw]

Subject: Re: NFS4 BAD_STATEID loop (kernel 3.0)

On Mon, 2011-10-24 at 13:17 +0000, David Flynn wrote:
> * Trond Myklebust ([email protected]) wrote:
> > We should in principle be able to recover a BAD_STATEID error by running
> > the state recovery thread. It's a shame that the machine was rebooted,
> > but does your syslog trace perhaps show any state recovery thread
> > errors?
>
> There were no other nfs related messages reported between the initial
> blocked task and rebooting the machine later. Additionally, there were
> no nfs related messages from bootup of the machine until the blocked
> task.
>
> One thing that may be of interest is that the user in question with the
> blocked task had hit their quota hard limit. (It was the same user that
> had the issue i reported earlier too on the same filesystem).
>
> Kind regards,
> ..david

I'm assuming then that your network trace showed no sign of any OPEN
calls of that particular file, just retries of the WRITE?

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-10-27 22:18:32

[permalink] [raw]

Subject: Re: NFS4 BAD_STATEID loop (kernel 3.0.4)

* Trond Myklebust ([email protected]) wrote:
> Do you have an example of the stateid argument's value? Does it change
> at all between separate WRITE attempts?

Further to all this, i've just had a similar fault on another machine,
producing a huge amounts of:

[463795.630702] nfs4_reclaim_open_state: Lock reclaim failed!
[463795.637446] nfs4_reclaim_open_state: Lock reclaim failed!
[463795.643113] nfs4_reclaim_open_state: Lock reclaim failed!

A network capture is available:
ftp://ftp.kw.bbc.co.uk/davidf/priv/uekahrae.pcap

$ echo 0 | sudo tee /proc/sys/sunrpc/rpc_debug
[468024.010036] -pid- flgs status -client- --rqstp- -timeout ---ops--
[468024.010051] 6289 0801 0 ffff8801f3e37e00 (null) 0 ffffffffa0229d40 nfsv4 WRITE a:call_start q:NFS client
[468024.010057] 6290 0801 0 ffff8801f3e37e00 (null) 0 ffffffffa0229d40 nfsv4 WRITE a:call_start q:NFS client

blocked task:

[464304.799306] INFO: task rrdtool:28506 blocked for more than 120 seconds.
[464304.799309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[464304.799311] rrdtool D 0000000000000001 0 28506 4189 0x00000000
[464304.799315] ffff880073bd5ca8 0000000000000082 ffff8804232c5408 0000000000012a40
[464304.799318] ffff880073bd5fd8 0000000000012a40 ffff880073bd4000 0000000000012a40
[464304.799320] 0000000000012a40 0000000000012a40 ffff880073bd5fd8 0000000000012a40
[464304.799322] Call Trace:
[464304.799332] [<ffffffff81110030>] ? __lock_page+0x70/0x70
[464304.799335] [<ffffffff8160007c>] io_schedule+0x8c/0xd0
[464304.799337] [<ffffffff8111003e>] sleep_on_page+0xe/0x20
[464304.799339] [<ffffffff816008ff>] __wait_on_bit+0x5f/0x90
[464304.799341] [<ffffffff81110203>] wait_on_page_bit+0x73/0x80
[464304.799345] [<ffffffff81085bf0>] ? autoremove_wake_function+0x40/0x40
[464304.799347] [<ffffffff8111c5e5>] ? pagevec_lookup_tag+0x25/0x40
[464304.799349] [<ffffffff81110436>] filemap_fdatawait_range+0xf6/0x1a0
[464304.799363] [<ffffffffa02037d0>] ? nfs_destroy_directcache+0x20/0x20 [nfs]
[464304.799365] [<ffffffff8111bae1>] ? do_writepages+0x21/0x40
[464304.799367] [<ffffffff811116bb>] ? __filemap_fdatawrite_range+0x5b/0x60
[464304.799368] [<ffffffff81111730>] filemap_write_and_wait_range+0x70/0x80
[464304.799371] [<ffffffff8119cc6a>] vfs_fsync_range+0x5a/0x90
[464304.799373] [<ffffffff8119cd0c>] vfs_fsync+0x1c/0x20
[464304.799377] [<ffffffffa01f7c74>] nfs_file_flush+0x54/0x80 [nfs]
[464304.799380] [<ffffffff8116ee7f>] filp_close+0x3f/0x90
[464304.799382] [<ffffffff8116f8a7>] sys_close+0xb7/0x120
[464304.799384] [<ffffffff8160a702>] system_call_fastpath+0x16/0x1b

Regards,
..david

2011-10-24 13:43:30

[permalink] [raw]

Subject: Re: NFS4 BAD_STATEID loop (kernel 3.0)

On Oct 24, 2011, at 6:40 AM, David Flynn wrote:

> Dear All,
>
> On a system running kernel 3.0, mounting a Solaris NFS4 export, we
> observe a continuous 20Mbit/sec exchange between client and server that had
> been occurring for 10 days.

Can you tell us a little more about the server? Which release of Solaris? What hardware?

> From /proc/mounts:
> home:/home/ /home nfs4 rw,relatime,vers=4,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=172.29.190.21,minorversion=0,local_lock=none,addr=172.29.120.140 0 0
>
> Reconciling logs, we find that around the time the exchange started,
> the kernel reports a process having blocked:
>
> [787321.680029] INFO: task bash:17799 blocked for more than 120 seconds.
> [787321.680067] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [787321.680104] bash D 0000000000000002 0 17799 1 0x00000000
> [787321.680131] ffff880173ad9ca8 0000000000000086 ffff88017216b008 0000000000012a40
> [787321.680171] ffff880173ad9fd8 0000000000012a40 ffff880173ad8000 0000000000012a40
> [787321.680211] 0000000000012a40 0000000000012a40 ffff880173ad9fd8 0000000000012a40
> [787321.680251] Call Trace:
> [787321.680275] [<ffffffff8110e900>] ? __lock_page+0x70/0x70
> [787321.680299] [<ffffffff815fea2c>] io_schedule+0x8c/0xd0
> [787321.680321] [<ffffffff8110e90e>] sleep_on_page+0xe/0x20
> [787321.680344] [<ffffffff815ff2af>] __wait_on_bit+0x5f/0x90
> [787321.680366] [<ffffffff8110ead3>] wait_on_page_bit+0x73/0x80
> [787321.680390] [<ffffffff81085980>] ? autoremove_wake_function+0x40/0x40
> [787321.680416] [<ffffffff8111aeb5>] ? pagevec_lookup_tag+0x25/0x40
> [787321.680439] [<ffffffff8110ed06>] filemap_fdatawait_range+0xf6/0x1a0
> [787321.680483] [<ffffffffa022e740>] ? nfs_destroy_directcache+0x20/0x20 [nfs]
> [787321.680507] [<ffffffff8111a3b1>] ? do_writepages+0x21/0x40
> [787321.680530] [<ffffffff8110ff8b>] ? __filemap_fdatawrite_range+0x5b/0x60
> [787321.680555] [<ffffffff81110000>] filemap_write_and_wait_range+0x70/0x80
> [787321.680580] [<ffffffff8119b45a>] vfs_fsync_range+0x5a/0x90
> [787321.680602] [<ffffffff8119b4fc>] vfs_fsync+0x1c/0x20
> [787321.680628] [<ffffffffa0222be4>] nfs_file_flush+0x54/0x80 [nfs]
> [787321.680653] [<ffffffff8116d66f>] filp_close+0x3f/0x90
> [787321.680675] [<ffffffff8116e097>] sys_close+0xb7/0x120
> [787321.680698] [<ffffffff816090c2>] system_call_fastpath+0x16/0x1b
>
> A network trace showed the following exchange below being continually
> repeated. Unfortunately this system too has since been rebooted.
>
> Any thoughts on the matter would be most welcome,
>
> Kind regards,
>
> ..david
>
>
> No. Time Source Destination Protocol Size Info
> 9880 11:40:12.833617 172.29.190.21 172.29.120.140 NFS 1122 V4 COMPOUND Call (Reply In 9881) <EMPTY> PUTFH;WRITE;GETATTR
>
> Frame 9880: 1122 bytes on wire (8976 bits), 1122 bytes captured (8976 bits)
> Arrival Time: Oct 17, 2011 11:40:12.833617000 BST
> Frame Length: 1122 bytes (8976 bits)
> Capture Length: 1122 bytes (8976 bits)
> Ethernet II, Src: ChelsioC_06:68:f9 (00:07:43:06:68:f9), Dst: All-HSRP-routers_be (00:00:0c:07:ac:be)
> Internet Protocol, Src: 172.29.190.21 (172.29.190.21), Dst: 172.29.120.140 (172.29.120.140)
> Transmission Control Protocol, Src Port: 816 (816), Dst Port: nfs (2049), Seq: 5199745, Ack: 275801, Len: 1056
> Remote Procedure Call, Type:Call XID:0x5daa6e93
> Network File System
> [Program Version: 4]
> [V4 Procedure: COMPOUND (1)]
> Tag: <EMPTY>
> length: 0
> contents: <EMPTY>
> minorversion: 0
> Operations (count: 3)
> Opcode: PUTFH (22)
> filehandle
> length: 36
> [hash (CRC-32): 0x6e4b15f3]
> decode type as: unknown
> filehandle: 7df3a75d5e1cd908000ab44c5b000000efc80200000a0300...
> Opcode: WRITE (38)
> stateid
> seqid: 0x00000000
> Data: 4e06f15b800f82e300000000
> offset: 11392
> stable: FILE_SYNC4 (2)
> Write length: 814
> Data: <DATA>
> length: 814
> contents: <DATA>
> fill bytes: opaque data
> Opcode: GETATTR (9)
> GETATTR4args
> attr_request
> bitmap[0] = 0x00000018
> [2 attributes requested]
> mand_attr: FATTR4_CHANGE (3)
> mand_attr: FATTR4_SIZE (4)
> bitmap[1] = 0x00300000
> [2 attributes requested]
> recc_attr: FATTR4_TIME_METADATA (52)
> recc_attr: FATTR4_TIME_MODIFY (53)
>
> No. Time Source Destination Protocol Size Info
> 9881 11:40:12.833956 172.29.120.140 172.29.190.21 NFS 122 V4 COMPOUND Reply (Call In 9880) <EMPTY> PUTFH;WRITE
>
> Frame 9881: 122 bytes on wire (976 bits), 122 bytes captured (976 bits)
> Arrival Time: Oct 17, 2011 11:40:12.833956000 BST
> [Time delta from previous captured frame: 0.000339000 seconds]
> Frame Length: 122 bytes (976 bits)
> Capture Length: 122 bytes (976 bits)
> Ethernet II, Src: Cisco_1e:f7:80 (00:13:5f:1e:f7:80), Dst: ChelsioC_06:68:f9 (00:07:43:06:68:f9)
> Internet Protocol, Src: 172.29.120.140 (172.29.120.140), Dst: 172.29.190.21 (172.29.190.21)
> Transmission Control Protocol, Src Port: nfs (2049), Dst Port: 816 (816), Seq: 275801, Ack: 5200801, Len: 56
> Remote Procedure Call, Type:Reply XID:0x5daa6e93
> Network File System
> [Program Version: 4]
> [V4 Procedure: COMPOUND (1)]
> Status: NFS4ERR_BAD_STATEID (10025)
> Tag: <EMPTY>
> length: 0
> contents: <EMPTY>
> Operations (count: 2)
> Opcode: PUTFH (22)
> Status: NFS4_OK (0)
> Opcode: WRITE (38)
> Status: NFS4ERR_BAD_STATEID (10025)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2011-10-31 13:08:07

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

On Oct 29, 2011, at 3:52 PM, David Flynn wrote:

> * Myklebust, Trond ([email protected]) wrote:
>>> -----Original Message-----
>>> From: Chuck Lever [mailto:[email protected]]
>>> On Oct 29, 2011, at 2:47 PM, J. Bruce Fields wrote:
>>>> Yes, and it's not something I care that strongly about, really, my
>>>> only observation is that this sort of failure (an implementation
>>>> bug on one side or another resulting in a loop) seems to have been
>>>> common (based on no hard data, just my vague memories of list
>>>> threads), and the results fairly obnoxious (possibly even for
>>>> unrelated hosts on the network).
>>>> So if there's some simple way to fail more gracefully it might be
>>>> helpful.
>>>
>>> For what it's worth, I agree that client implementations should
>>> attempt to behave more gracefully in the face of server problems, be
>>> they the result of bugs or the result of other issues specific to
>>> that server. Problems like this make NFSv4 as a protocol look bad.
>>
>> I can't see what a client can do in this situation except possibly just
>> give up after a while and throw a SERVER_BROKEN error (which means data
>> loss). That still won't make NFSv4 look good...
>
> Indeed, it is a quite the dilemma.
>
> I agree that giving and guaranteeing unattended data loss is bad (data
> loss at the behest of an operator is ok, afterall they can always fence
> a broken machine).

David, what would help immensely is if you can find a reliable way of reproducing this. So far we have been unable to find a reproducer.

> Looking at some of the logs again, even going back to the very original
> case, it appears to be about 600us between retries (RTT=400us). Is
> there any way to make that less aggressive?, eg 1s? -- that'd reduce the
> impact by three orders of magnitude. What would be the down-side? How
> often do you expect to get a BAD_STATEID error?
>
> ..david

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2011-10-29 18:51:17

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

On Oct 29, 2011, at 2:47 PM, J. Bruce Fields wrote:

> On Sat, Oct 29, 2011 at 11:21:13AM -0700, Myklebust, Trond wrote:
>>> From: J. Bruce Fields [mailto:[email protected]]
>>> On Sat, Oct 29, 2011 at 07:29:36PM +0200, Trond Myklebust wrote:
>>>> OK. This is the first time I've seen this tcpdump.
>>>>
>>>> The problem seems like a split-brain issue on the server... On the
>> one
>>>> hand, it is happily telling us that our lease is OK when we RENEW.
>>>> Then when we try to use said lease in an OPEN, it is replying with
>>>> STALE_CLIENTID.
>>>>
>>>> IOW: This isn't a problem I can fix on the client whether or not I
>> add
>>>> exponential backoff. The problem needs to be addressed on the server
>>>> by the Solaris folks....
>>>
>>> Is there any simple thing we could do on the client to reduce the
>> impact of
>>> these sorts of loops?
>>
>> WHY? Those loops aren't supposed to happen if the server works according
>> to spec.
>
> Yes, and it's not something I care that strongly about, really, my only
> observation is that this sort of failure (an implementation bug on one
> side or another resulting in a loop) seems to have been common (based on
> no hard data, just my vague memories of list threads), and the results
> fairly obnoxious (possibly even for unrelated hosts on the network).
> So if there's some simple way to fail more gracefully it might be
> helpful.

For what it's worth, I agree that client implementations should attempt to behave more gracefully in the face of server problems, be they the result of bugs or the result of other issues specific to that server. Problems like this make NFSv4 as a protocol look bad.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2011-10-24 11:22:48

[permalink] [raw]

Subject: Re: NFS4 BAD_STATEID loop (kernel 3.0)

On Mon, 2011-10-24 at 10:40 +0000, David Flynn wrote:
> Dear All,
>
> On a system running kernel 3.0, mounting a Solaris NFS4 export, we
> observe a continuous 20Mbit/sec exchange between client and server that had
> been occurring for 10 days.
<snip>
> No. Time Source Destination Protocol Size Info
> 9880 11:40:12.833617 172.29.190.21 172.29.120.140 NFS 1122 V4 COMPOUND Call (Reply In 9881) <EMPTY> PUTFH;WRITE;GETATTR
>
> Frame 9880: 1122 bytes on wire (8976 bits), 1122 bytes captured (8976 bits)
> Arrival Time: Oct 17, 2011 11:40:12.833617000 BST
> Frame Length: 1122 bytes (8976 bits)
> Capture Length: 1122 bytes (8976 bits)
> Ethernet II, Src: ChelsioC_06:68:f9 (00:07:43:06:68:f9), Dst: All-HSRP-routers_be (00:00:0c:07:ac:be)
> Internet Protocol, Src: 172.29.190.21 (172.29.190.21), Dst: 172.29.120.140 (172.29.120.140)
> Transmission Control Protocol, Src Port: 816 (816), Dst Port: nfs (2049), Seq: 5199745, Ack: 275801, Len: 1056
> Remote Procedure Call, Type:Call XID:0x5daa6e93
> Network File System
> [Program Version: 4]
> [V4 Procedure: COMPOUND (1)]
> Tag: <EMPTY>
> length: 0
> contents: <EMPTY>
> minorversion: 0
> Operations (count: 3)
> Opcode: PUTFH (22)
> filehandle
> length: 36
> [hash (CRC-32): 0x6e4b15f3]
> decode type as: unknown
> filehandle: 7df3a75d5e1cd908000ab44c5b000000efc80200000a0300...
> Opcode: WRITE (38)
> stateid
> seqid: 0x00000000
> Data: 4e06f15b800f82e300000000
> offset: 11392
> stable: FILE_SYNC4 (2)
> Write length: 814
> Data: <DATA>
> length: 814
> contents: <DATA>
> fill bytes: opaque data
> Opcode: GETATTR (9)
> GETATTR4args
> attr_request
> bitmap[0] = 0x00000018
> [2 attributes requested]
> mand_attr: FATTR4_CHANGE (3)
> mand_attr: FATTR4_SIZE (4)
> bitmap[1] = 0x00300000
> [2 attributes requested]
> recc_attr: FATTR4_TIME_METADATA (52)
> recc_attr: FATTR4_TIME_MODIFY (53)
>
> No. Time Source Destination Protocol Size Info
> 9881 11:40:12.833956 172.29.120.140 172.29.190.21 NFS 122 V4 COMPOUND Reply (Call In 9880) <EMPTY> PUTFH;WRITE
>
> Frame 9881: 122 bytes on wire (976 bits), 122 bytes captured (976 bits)
> Arrival Time: Oct 17, 2011 11:40:12.833956000 BST
> [Time delta from previous captured frame: 0.000339000 seconds]
> Frame Length: 122 bytes (976 bits)
> Capture Length: 122 bytes (976 bits)
> Ethernet II, Src: Cisco_1e:f7:80 (00:13:5f:1e:f7:80), Dst: ChelsioC_06:68:f9 (00:07:43:06:68:f9)
> Internet Protocol, Src: 172.29.120.140 (172.29.120.140), Dst: 172.29.190.21 (172.29.190.21)
> Transmission Control Protocol, Src Port: nfs (2049), Dst Port: 816 (816), Seq: 275801, Ack: 5200801, Len: 56
> Remote Procedure Call, Type:Reply XID:0x5daa6e93
> Network File System
> [Program Version: 4]
> [V4 Procedure: COMPOUND (1)]
> Status: NFS4ERR_BAD_STATEID (10025)
> Tag: <EMPTY>
> length: 0
> contents: <EMPTY>
> Operations (count: 2)
> Opcode: PUTFH (22)
> Status: NFS4_OK (0)
> Opcode: WRITE (38)
> Status: NFS4ERR_BAD_STATEID (10025)

We should in principle be able to recover a BAD_STATEID error by running
the state recovery thread. It's a shame that the machine was rebooted,
but does your syslog trace perhaps show any state recovery thread
errors?

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-10-29 19:19:30

[permalink] [raw]

Subject: RE: NFS4ERR_STALE_CLIENTID loop

> -----Original Message-----
> From: Chuck Lever [mailto:[email protected]]
> Sent: Saturday, October 29, 2011 8:51 PM
> To: J. Bruce Fields
> Cc: Myklebust, Trond; David Flynn; [email protected]
> Subject: Re: NFS4ERR_STALE_CLIENTID loop
>
>
> On Oct 29, 2011, at 2:47 PM, J. Bruce Fields wrote:
>
> > On Sat, Oct 29, 2011 at 11:21:13AM -0700, Myklebust, Trond wrote:
> >>> From: J. Bruce Fields [mailto:[email protected]] On Sat, Oct
29,
> >>> 2011 at 07:29:36PM +0200, Trond Myklebust wrote:
> >>>> OK. This is the first time I've seen this tcpdump.
> >>>>
> >>>> The problem seems like a split-brain issue on the server... On
the
> >> one
> >>>> hand, it is happily telling us that our lease is OK when we
RENEW.
> >>>> Then when we try to use said lease in an OPEN, it is replying
with
> >>>> STALE_CLIENTID.
> >>>>
> >>>> IOW: This isn't a problem I can fix on the client whether or not
I
> >> add
> >>>> exponential backoff. The problem needs to be addressed on the
> >>>> server by the Solaris folks....
> >>>
> >>> Is there any simple thing we could do on the client to reduce the
> >> impact of
> >>> these sorts of loops?
> >>
> >> WHY? Those loops aren't supposed to happen if the server works
> >> according to spec.
> >
> > Yes, and it's not something I care that strongly about, really, my
> > only observation is that this sort of failure (an implementation bug
> > on one side or another resulting in a loop) seems to have been
common
> > (based on no hard data, just my vague memories of list threads), and
> > the results fairly obnoxious (possibly even for unrelated hosts on
the
> network).
> > So if there's some simple way to fail more gracefully it might be
> > helpful.
>
> For what it's worth, I agree that client implementations should
attempt to
> behave more gracefully in the face of server problems, be they the
result of
> bugs or the result of other issues specific to that server. Problems
like this
> make NFSv4 as a protocol look bad.

I can't see what a client can do in this situation except possibly just
give up after a while and throw a SERVER_BROKEN error (which means data
loss). That still won't make NFSv4 look good...

Trond

2011-10-29 21:08:20

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

* Myklebust, Trond ([email protected]) wrote:
> BAD_STATEID is a different matter, and is one that we should have
> resolved in the NFS client in the upstream kernel. At least on newer
> clients, we should be trying to reopen the file and re-establish all
> locks when we get a BAD_STATEID. Can you please remind us which kernel
> you are using?

Ah, i see. This was all on 3.0.0 and 3.0.4 (a quick check didn't reveal
any relevant changes between the two).

Are there any stable patches that can be applied to 3.0.y?

> That said... Even on new clients, the recovery attempt may fail due to
> the STALE_CLIENTID bug. That will still hit us when we call OPEN in
> order to get a new stateid.

The interval between retries on that was ~1-1.5ms, could this be made
slower? -- same questions as before really.

Regards,

..david

2011-10-31 13:40:02

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

On Oct 31, 2011, at 9:21 AM, David Flynn wrote:

> * Chuck Lever ([email protected]) wrote:
>> David, what would help immensely is if you can find a reliable way of
>> reproducing this. So far we have been unable to find a reproducer.
>
> While i've managed to have problems with individual machines, that seem
> to fail at some random point of their own choosing, the most reliable
> way to produce problem for us to have a number of nodes updating various
> RRD files frequently.
>
> Given that i haven't found a reliable and short method for reproducing
> it either, if we were to re-run the known case and capture all network
> traffic, would you be able to extract the relevant detail to generate a
> simulation?

A reproducer would be better for us [*], but I understand the arbitrary nature of the problem. A network trace would be an excellent start.

Now, it would be interesting if in fact the problem occurs only when multiple clients interact with a server. In that case, capture a full network trace with snoop on your server. We'll worry about pruning the size of the trace once you have a clean capture.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

* - A reproducer allows us to perform internal-only tests at will, and it also can confirm we've got the problem properly fixed.

2011-10-31 13:21:46

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

* Chuck Lever ([email protected]) wrote:
> David, what would help immensely is if you can find a reliable way of
> reproducing this. So far we have been unable to find a reproducer.

While i've managed to have problems with individual machines, that seem
to fail at some random point of their own choosing, the most reliable
way to produce problem for us to have a number of nodes updating various
RRD files frequently.

Given that i haven't found a reliable and short method for reproducing
it either, if we were to re-run the known case and capture all network
traffic, would you be able to extract the relevant detail to generate a
simulation?

Regards,

..david

2011-10-29 19:53:05

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

* Myklebust, Trond ([email protected]) wrote:
> > -----Original Message-----
> > From: Chuck Lever [mailto:[email protected]]
> > On Oct 29, 2011, at 2:47 PM, J. Bruce Fields wrote:
> > > Yes, and it's not something I care that strongly about, really, my
> > > only observation is that this sort of failure (an implementation
> > > bug on one side or another resulting in a loop) seems to have been
> > > common (based on no hard data, just my vague memories of list
> > > threads), and the results fairly obnoxious (possibly even for
> > > unrelated hosts on the network).
> > > So if there's some simple way to fail more gracefully it might be
> > > helpful.
> >
> > For what it's worth, I agree that client implementations should
> > attempt to behave more gracefully in the face of server problems, be
> > they the result of bugs or the result of other issues specific to
> > that server. Problems like this make NFSv4 as a protocol look bad.
>
> I can't see what a client can do in this situation except possibly just
> give up after a while and throw a SERVER_BROKEN error (which means data
> loss). That still won't make NFSv4 look good...

Indeed, it is a quite the dilemma.

I agree that giving and guaranteeing unattended data loss is bad (data
loss at the behest of an operator is ok, afterall they can always fence
a broken machine).

Looking at some of the logs again, even going back to the very original
case, it appears to be about 600us between retries (RTT=400us). Is
there any way to make that less aggressive?, eg 1s? -- that'd reduce the
impact by three orders of magnitude. What would be the down-side? How
often do you expect to get a BAD_STATEID error?

..david

2011-10-29 18:03:06

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

* Trond Myklebust ([email protected]) wrote:
> > Using the same kernel, same mountpoint as before, we're currently
> > experiencing a loop involving NFS4ERR_STALE_CLIENTID.
...
> The problem seems like a split-brain issue on the server... On the one
> hand, it is happily telling us that our lease is OK when we RENEW. Then
> when we try to use said lease in an OPEN, it is replying with
> STALE_CLIENTID.

Thank you for the quick update, especially at the weekend. I'm
wondering if it is possible that the STALE_CLIENTID issue is a by-product
of the BAD_STATEID issue from earlier. We have observed several times
the BAD_STATEID loop, but the CLIENTID problem only seemed to occur when
all 40+ nodes were all showing problems.

After killing off sufficient processes, the some of the machines then
recovered of their own accord. So your conclusion that there is a
server issue sounds reasonable.

On any such possible backoff, the previous case was with quite small
requests in quite a tight loop that seemed to cause the server grief.
This morning, a machine with a 10GbE interface had a BAD_STATEID issue
but involving some much larger writes[1], resulting in 1.6Gbit/sec from
that machine alone. Thankfully there was only a second machine with
1GbE interfaces bringing the total up to 2.5Gbit/sec.

It is this ability for a group of clients to make matters worse that
is just as bad as any fault with Solaris.

(In a similar vein, it can be just as frustrating trying to get a client
to stop looping like this - it is often impossible to kill the process
that triggered the problem; for these, we had to resort to deleting
the files using NFSv3 (which was working quite happily))

Thank you again,
..david

[1] Capture: ftp://ftp.kw.bbc.co.uk/davidf/priv/waquahso.pcap

2011-10-29 20:42:37

[permalink] [raw]

Subject: RE: NFS4ERR_STALE_CLIENTID loop

> -----Original Message-----
> From: David Flynn [mailto:[email protected]]
> Sent: Saturday, October 29, 2011 9:53 PM
> To: Myklebust, Trond
> Cc: Chuck Lever; J. Bruce Fields; David Flynn;
[email protected]
> Subject: Re: NFS4ERR_STALE_CLIENTID loop
>
> * Myklebust, Trond ([email protected]) wrote:
> > > -----Original Message-----
> > > From: Chuck Lever [mailto:[email protected]] On Oct 29, 2011,
> > > at 2:47 PM, J. Bruce Fields wrote:
> > > > Yes, and it's not something I care that strongly about, really,
my
> > > > only observation is that this sort of failure (an implementation
> > > > bug on one side or another resulting in a loop) seems to have
been
> > > > common (based on no hard data, just my vague memories of list
> > > > threads), and the results fairly obnoxious (possibly even for
> > > > unrelated hosts on the network).
> > > > So if there's some simple way to fail more gracefully it might
be
> > > > helpful.
> > >
> > > For what it's worth, I agree that client implementations should
> > > attempt to behave more gracefully in the face of server problems,
be
> > > they the result of bugs or the result of other issues specific to
> > > that server. Problems like this make NFSv4 as a protocol look
bad.
> >
> > I can't see what a client can do in this situation except possibly
> > just give up after a while and throw a SERVER_BROKEN error (which
> > means data loss). That still won't make NFSv4 look good...
>
> Indeed, it is a quite the dilemma.
>
> I agree that giving and guaranteeing unattended data loss is bad (data
loss at
> the behest of an operator is ok, afterall they can always fence a
broken
> machine).
>
> Looking at some of the logs again, even going back to the very
original case, it
> appears to be about 600us between retries (RTT=400us). Is there any
way to
> make that less aggressive?, eg 1s? -- that'd reduce the impact by
three
> orders of magnitude. What would be the down-side? How often do you
> expect to get a BAD_STATEID error?

BAD_STATEID is a different matter, and is one that we should have
resolved in the NFS client in the upstream kernel. At least on newer
clients, we should be trying to reopen the file and re-establish all
locks when we get a BAD_STATEID. Can you please remind us which kernel
you are using?

That said... Even on new clients, the recovery attempt may fail due to
the STALE_CLIENTID bug. That will still hit us when we call OPEN in
order to get a new stateid.

Cheers
Trond

2011-10-24 14:50:57

[permalink] [raw]

Subject: Re: NFS4 BAD_STATEID loop (kernel 3.0.4)

* Chuck Lever ([email protected]) wrote:
> Can you tell us a little more about the server? Which release of
> Solaris? What hardware?

SunOS 5.10 Generic_141444-09
(sparc)

* Trond Myklebust ([email protected]) wrote:
> I'm assuming then that your network trace showed no sign of any OPEN
> calls of that particular file, just retries of the WRITE?

Correct.

However, the good news is that it has just happened again (certainly
not quota related)

The blocked task:
[179068.773206] INFO: task bash:3293 blocked for more than 120 seconds.
[179068.779660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[179068.787701] bash D 0000000000000004 0 3293 1 0x00000000
[179068.795173] ffff88001f97fca8 0000000000000086 ffff880426876008 0000000000012a40
[179068.802992] ffff88001f97ffd8 0000000000012a40 ffff88001f97e000 0000000000012a40
[179068.810745] 0000000000012a40 0000000000012a40 ffff88001f97ffd8 0000000000012a40
[179068.818810] Call Trace:
[179068.821496] [<ffffffff81110030>] ? __lock_page+0x70/0x70
[179068.827204] [<ffffffff8160007c>] io_schedule+0x8c/0xd0
[179068.832952] [<ffffffff8111003e>] sleep_on_page+0xe/0x20
[179068.838823] [<ffffffff816008ff>] __wait_on_bit+0x5f/0x90
[179068.844734] [<ffffffff81110203>] wait_on_page_bit+0x73/0x80
[179068.850798] [<ffffffff81085bf0>] ? autoremove_wake_function+0x40/0x40
[179068.857879] [<ffffffff8111c5e5>] ? pagevec_lookup_tag+0x25/0x40
[179068.864173] [<ffffffff81110436>] filemap_fdatawait_range+0xf6/0x1a0
[179068.870721] [<ffffffffa02167d0>] ? nfs_destroy_directcache+0x20/0x20 [nfs]
[179068.877963] [<ffffffff8111bae1>] ? do_writepages+0x21/0x40
[179068.883744] [<ffffffff811116bb>] ? __filemap_fdatawrite_range+0x5b/0x60
[179068.890867] [<ffffffff81111730>] filemap_write_and_wait_range+0x70/0x80
[179068.898025] [<ffffffff8119cc6a>] vfs_fsync_range+0x5a/0x90
[179068.904197] [<ffffffff8119cd0c>] vfs_fsync+0x1c/0x20
[179068.909721] [<ffffffffa020ac74>] nfs_file_flush+0x54/0x80 [nfs]
[179068.916069] [<ffffffff8116ee7f>] filp_close+0x3f/0x90
[179068.921611] [<ffffffff8116f8a7>] sys_close+0xb7/0x120
[179068.927328] [<ffffffff8160a702>] system_call_fastpath+0x16/0x1b

$ echo 0 >/proc/sys/sunrpc/rpc_debug
[180179.009328] -pid- flgs status -client- --rqstp- -timeout ---ops--
[180179.015540] 40304 0801 0 ffff8804241ae800 (null) 0 ffffffffa023cd40 nfsv4 WRITE a:call_start q:NFS client

and our pingpong (more details at end):
14:07:07.307191 IP vc-fs1.rd.bbc.co.uk.1837702678 > home.rd.bbc.co.uk.nfs: 300 getattr fh 0,0/22
14:07:07.307471 IP home.rd.bbc.co.uk.nfs > vc-fs1.rd.bbc.co.uk.1837702678: reply ok 52 getattr ERROR: unk 10025

This system is up at the moment, if there is further detail you require
i can provide that.

NB, the system this occurred on is running kernel 3.0.4
Mount options as per earlier.

Kind regards,

..david

No. Time Source Destination Protocol Size Info
39 15:33:59.077143 172.29.190.28 172.29.120.140 NFS 370 V4 COMPOUND Call (Reply In 40) <EMPTY> PUTFH;WRITE;GETATTR

Frame 39: 370 bytes on wire (2960 bits), 370 bytes captured (2960 bits)
Ethernet II, Src: ChelsioC_07:49:6f (00:07:43:07:49:6f), Dst: All-HSRP-routers_be (00:00:0c:07:ac:be)
Internet Protocol, Src: 172.29.190.28 (172.29.190.28), Dst: 172.29.120.140 (172.29.120.140)
Transmission Control Protocol, Src Port: omginitialrefs (900), Dst Port: nfs (2049), Seq: 40433, Ack: 7449, Len: 304
Remote Procedure Call, Type:Call XID:0x43ce4e16
Network File System
[Program Version: 4]
[V4 Procedure: COMPOUND (1)]
Tag: <EMPTY>
length: 0
contents: <EMPTY>
minorversion: 0
Operations (count: 3)
Opcode: PUTFH (22)
filehandle
length: 36
[hash (CRC-32): 0x6e4b15f3]
decode type as: unknown
filehandle: 7df3a75d5e1cd908000ab44c5b000000efc80200000a0300...
Opcode: WRITE (38)
stateid
offset: 11474
stable: FILE_SYNC4 (2)
Write length: 68
Data: <DATA>
length: 68
contents: <DATA>
Opcode: GETATTR (9)
GETATTR4args
attr_request
bitmap[0] = 0x00000018
[2 attributes requested]
mand_attr: FATTR4_CHANGE (3)
mand_attr: FATTR4_SIZE (4)
bitmap[1] = 0x00300000
[2 attributes requested]
recc_attr: FATTR4_TIME_METADATA (52)
recc_attr: FATTR4_TIME_MODIFY (53)

No. Time Source Destination Protocol Size Info
40 15:33:59.077433 172.29.120.140 172.29.190.28 NFS 122 V4 COMPOUND Reply (Call In 39) <EMPTY> PUTFH;WRITE

Frame 40: 122 bytes on wire (976 bits), 122 bytes captured (976 bits)
Ethernet II, Src: Cisco_1e:f7:80 (00:13:5f:1e:f7:80), Dst: ChelsioC_07:49:6f (00:07:43:07:49:6f)
Internet Protocol, Src: 172.29.120.140 (172.29.120.140), Dst: 172.29.190.28 (172.29.190.28)
Transmission Control Protocol, Src Port: nfs (2049), Dst Port: omginitialrefs (900), Seq: 7449, Ack: 40737, Len: 56
Remote Procedure Call, Type:Reply XID:0x43ce4e16
Network File System
[Program Version: 4]
[V4 Procedure: COMPOUND (1)]
Status: NFS4ERR_BAD_STATEID (10025)
Tag: <EMPTY>
length: 0
contents: <EMPTY>
Operations (count: 2)
Opcode: PUTFH (22)
Status: NFS4_OK (0)
Opcode: WRITE (38)
Status: NFS4ERR_BAD_STATEID (10025)

2011-10-24 15:31:44

[permalink] [raw]

Subject: Re: NFS4 BAD_STATEID loop (kernel 3.0.4)

On Mon, 2011-10-24 at 14:50 +0000, David Flynn wrote:
> No. Time Source Destination Protocol Size Info
> 39 15:33:59.077143 172.29.190.28 172.29.120.140 NFS 370 V4 COMPOUND Call (Reply In 40) <EMPTY> PUTFH;WRITE;GETATTR
>
> Frame 39: 370 bytes on wire (2960 bits), 370 bytes captured (2960 bits)
> Ethernet II, Src: ChelsioC_07:49:6f (00:07:43:07:49:6f), Dst: All-HSRP-routers_be (00:00:0c:07:ac:be)
> Internet Protocol, Src: 172.29.190.28 (172.29.190.28), Dst: 172.29.120.140 (172.29.120.140)
> Transmission Control Protocol, Src Port: omginitialrefs (900), Dst Port: nfs (2049), Seq: 40433, Ack: 7449, Len: 304
> Remote Procedure Call, Type:Call XID:0x43ce4e16
> Network File System
> [Program Version: 4]
> [V4 Procedure: COMPOUND (1)]
> Tag: <EMPTY>
> length: 0
> contents: <EMPTY>
> minorversion: 0
> Operations (count: 3)
> Opcode: PUTFH (22)
> filehandle
> length: 36
> [hash (CRC-32): 0x6e4b15f3]
> decode type as: unknown
> filehandle: 7df3a75d5e1cd908000ab44c5b000000efc80200000a0300...
> Opcode: WRITE (38)
> stateid

Do you have an example of the stateid argument's value? Does it change
at all between separate WRITE attempts?

Cheers
Trond

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-10-29 18:15:12

by J. Bruce Fields

[permalink] [raw]

Subject: Re: NFS4ERR_STALE_CLIENTID loop

On Sat, Oct 29, 2011 at 07:29:36PM +0200, Trond Myklebust wrote:
> OK. This is the first time I've seen this tcpdump.
>
> The problem seems like a split-brain issue on the server... On the one
> hand, it is happily telling us that our lease is OK when we RENEW. Then
> when we try to use said lease in an OPEN, it is replying with
> STALE_CLIENTID.
>
> IOW: This isn't a problem I can fix on the client whether or not I add
> exponential backoff. The problem needs to be addressed on the server by
> the Solaris folks....

Is there any simple thing we could do on the client to reduce the impact
of these sorts of loops?

Given that we know there are bad servers out there it might be nice to
do if it's not complicated.

(Though as a server implementer my purely selfish impulse is to leave
things as they are since it ensures I'll get bug reports if I screw
up....)

--b.

2011-10-29 18:22:03

[permalink] [raw]

Subject: RE: NFS4ERR_STALE_CLIENTID loop

> -----Original Message-----
> From: David Flynn [mailto:[email protected]]
> Sent: Saturday, October 29, 2011 8:02 PM
> To: Myklebust, Trond
> Cc: David Flynn; [email protected]; Chuck Lever
> Subject: Re: NFS4ERR_STALE_CLIENTID loop
>
> * Trond Myklebust ([email protected]) wrote:
> > > Using the same kernel, same mountpoint as before, we're currently
> > > experiencing a loop involving NFS4ERR_STALE_CLIENTID.
> ...
> > The problem seems like a split-brain issue on the server... On the
one
> > hand, it is happily telling us that our lease is OK when we RENEW.
> > Then when we try to use said lease in an OPEN, it is replying with
> > STALE_CLIENTID.
>
> Thank you for the quick update, especially at the weekend. I'm
wondering if
> it is possible that the STALE_CLIENTID issue is a by-product of the
> BAD_STATEID issue from earlier. We have observed several times the
> BAD_STATEID loop, but the CLIENTID problem only seemed to occur when
all
> 40+ nodes were all showing problems.
>
> After killing off sufficient processes, the some of the machines then
> recovered of their own accord. So your conclusion that there is a
server issue
> sounds reasonable.
>
> On any such possible backoff, the previous case was with quite small
> requests in quite a tight loop that seemed to cause the server grief.
> This morning, a machine with a 10GbE interface had a BAD_STATEID issue
but
> involving some much larger writes[1], resulting in 1.6Gbit/sec from
that
> machine alone. Thankfully there was only a second machine with 1GbE
> interfaces bringing the total up to 2.5Gbit/sec.
>
> It is this ability for a group of clients to make matters worse that
is just as bad
> as any fault with Solaris.

Sure, but gone are the days when NFS had "reference implementations"
that everyone had to interoperate with. NFSv4 is a fully documented
protocol which describes how both clients and servers are supposed to
work. If either the client or server fail to work according to that
documentation, then bad things will happen.

While I could litter the client code with lots of little tricks to be
"defensive" in the face of buggy servers, that isn't going to solve
anything: the server will still be buggy, and the client will still be
faced with a situation that it cannot resolve.

> (In a similar vein, it can be just as frustrating trying to get a
client to stop
> looping like this - it is often impossible to kill the process that
triggered the
> problem; for these, we had to resort to deleting the files using NFSv3
(which
> was working quite happily))

'kill -9' should in principle work to kill off the process. Was that
failing to work?

Cheers
Trond

2011-10-29 21:12:05

[permalink] [raw]

Subject: RE: NFS4ERR_STALE_CLIENTID loop

> -----Original Message-----
> From: David Flynn [mailto:[email protected]]
> Sent: Saturday, October 29, 2011 11:08 PM
> To: Myklebust, Trond
> Cc: David Flynn; Chuck Lever; J. Bruce Fields;
[email protected]
> Subject: Re: NFS4ERR_STALE_CLIENTID loop
>
> * Myklebust, Trond ([email protected]) wrote:
> > BAD_STATEID is a different matter, and is one that we should have
> > resolved in the NFS client in the upstream kernel. At least on newer
> > clients, we should be trying to reopen the file and re-establish all
> > locks when we get a BAD_STATEID. Can you please remind us which
kernel
> > you are using?
>
> Ah, i see. This was all on 3.0.0 and 3.0.4 (a quick check didn't
reveal any
> relevant changes between the two).
>
> Are there any stable patches that can be applied to 3.0.y?
>
> > That said... Even on new clients, the recovery attempt may fail due
to
> > the STALE_CLIENTID bug. That will still hit us when we call OPEN in
> > order to get a new stateid.
>
> The interval between retries on that was ~1-1.5ms, could this be made
> slower? -- same questions as before really.

Given that things such as reboot recovery are subject to a recovery
window (grace period), I'm, again, very reluctant to add an artificial
backoff as that may have consequences for behaviour in the bug-free
case.

If the server wants us to back off, it can say so itself by using the
NFS4ERR_DELAY error mechanism.

Cheers
Trond

2011-10-24 13:18:09

[permalink] [raw]

Subject: Re: NFS4 BAD_STATEID loop (kernel 3.0)

* Trond Myklebust ([email protected]) wrote:
> We should in principle be able to recover a BAD_STATEID error by running
> the state recovery thread. It's a shame that the machine was rebooted,
> but does your syslog trace perhaps show any state recovery thread
> errors?

There were no other nfs related messages reported between the initial
blocked task and rebooting the machine later. Additionally, there were
no nfs related messages from bootup of the machine until the blocked
task.

One thing that may be of interest is that the user in question with the
blocked task had hit their quota hard limit. (It was the same user that
had the issue i reported earlier too on the same filesystem).

Kind regards,
..david

2011-10-29 18:21:15

[permalink] [raw]

Subject: RE: NFS4ERR_STALE_CLIENTID loop

> -----Original Message-----
> From: J. Bruce Fields [mailto:[email protected]]
> Sent: Saturday, October 29, 2011 8:15 PM
> To: Myklebust, Trond
> Cc: David Flynn; [email protected]; Chuck Lever
> Subject: Re: NFS4ERR_STALE_CLIENTID loop
>
> On Sat, Oct 29, 2011 at 07:29:36PM +0200, Trond Myklebust wrote:
> > OK. This is the first time I've seen this tcpdump.
> >
> > The problem seems like a split-brain issue on the server... On the
one
> > hand, it is happily telling us that our lease is OK when we RENEW.
> > Then when we try to use said lease in an OPEN, it is replying with
> > STALE_CLIENTID.
> >
> > IOW: This isn't a problem I can fix on the client whether or not I
add
> > exponential backoff. The problem needs to be addressed on the server
> > by the Solaris folks....
>
> Is there any simple thing we could do on the client to reduce the
impact of
> these sorts of loops?

WHY? Those loops aren't supposed to happen if the server works according
to spec.

> Given that we know there are bad servers out there it might be nice to
do if
> it's not complicated.

Again WHY? We can't fix a broken server by backing off correctly formed
requests. If a server wants us to back off, it will send us
NFS4ERR_DELAY.

> (Though as a server implementer my purely selfish impulse is to leave
things
> as they are since it ensures I'll get bug reports if I screw
> up....)

You would have gotten the same bugreport. That said, I refuse to change
the client to accommodate buggy servers. The solution to the problem of
buggy servers is to fix those servers.

Trond

2011-10-29 18:23:28