From: Trond Myklebust Subject: Re: session hangs with 2.6.33rc7 & nfs-utils 1.2.2rc9 Date: Thu, 06 May 2010 12:57:41 -0400 Message-ID: <1273165061.7699.39.camel@localhost.localdomain> References: <1265897747.478.1.camel@localhost> <1265898673.478.3.camel@localhost> <20100211102322.1581a385@barsoom.rdu.redhat.com> <20100211152822.622dcc70@barsoom.rdu.redhat.com> <20100212143249.GA17928@fieldses.org> <20100212111343.101fd0ce@tlielax.poochiereds.net> <1265998794.2832.23.camel@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Cc: linux-nfs@vger.kernel.org, Jeff Layton To: Timo Aaltonen Return-path: Received: from mail-out1.uio.no ([129.240.10.57]:60290 "EHLO mail-out1.uio.no" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751174Ab0EFQ5r (ORCPT ); Thu, 6 May 2010 12:57:47 -0400 In-Reply-To: Sender: linux-nfs-owner@vger.kernel.org List-ID: On Thu, 2010-05-06 at 18:40 +0300, Timo Aaltonen wrote: > Reviving the old thread, since it's still happening. Maybe I should file a > proper bug about this? Please do. Can you see if you can get a trace of the recovery thread? Also, please check your dmesg log for any 'unhandled error' messages. If there is nothing there, then we may have to get a 'wireshark' trace too... Cheers Trond > On Fri, 12 Feb 2010, Trond Myklebust wrote: > > > On Fri, 2010-02-12 at 11:13 -0500, Jeff Layton wrote: > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] 130.233.227.1 D 00000000ffffffff 0 29902 2 0x00000080 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] ffff8800c158dc50 0000000000000046 0000000000015ac0 0000000000015ac0 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] ffff880116251ab0 ffff8800c158dfd8 0000000000015ac0 ffff8801162516f0 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] 0000000000015ac0 ffff8800c158dfd8 0000000000015ac0 ffff880116251ab0 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] Call Trace: > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] ? rpc_wait_bit_killable+0x0/0x40 [sunrpc] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] rpc_wait_bit_killable+0x24/0x40 [sunrpc] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] __wait_on_bit+0x5f/0x90 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] ? rpc_wait_bit_killable+0x0/0x40 [sunrpc] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] out_of_line_wait_on_bit+0x78/0x90 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] ? wake_bit_function+0x0/0x40 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] __rpc_execute+0x105/0x220 [sunrpc] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] rpc_execute+0x26/0x30 [sunrpc] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] rpc_run_task+0x3a/0x90 [sunrpc] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] rpc_call_sync+0x42/0x70 [sunrpc] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] ? finish_task_switch+0x50/0xd0 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] nfs4_proc_renew+0x4d/0xa0 [nfs] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] nfs4_check_lease+0x8c/0xc0 [nfs] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] ? nfs4_run_state_manager+0x0/0x40 [nfs] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] nfs4_state_manager+0xf8/0x200 [nfs] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] ? allow_signal+0x9d/0xb0 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] nfs4_run_state_manager+0x23/0x40 [nfs] > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] kthread+0x96/0xa0 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] child_rip+0xa/0x20 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] ? kthread+0x0/0xa0 > >> Feb 12 16:11:38 nexus6 kernel: [10629.733513] [] ? child_rip+0x0/0x20 > >> > >> Here, the mount appears to be attempting to renew the state. > >> > >> > >> Another thread (not shown in the sysrq-t) is attempting to do writeback > >> (maybe the bdev flush thread?) and that's in turn blocked by the state > >> recovery code? > >> > >> > >> Looking back over the capture that you sent originally, I see that the > >> client is attempting a WRITE, getting a NFS4ERR_BAD_STATEID back from > >> the server, issuing a RENEW which succeeds and then reattempting the > >> WRITE. > >> > >> Trond, shouldn't it be trying to recover the state before reissuing the > >> write? > >> > > > > Yes. Both nfs4_read_done() and nfs4_write_done() will call > > nfs4_async_handle_error(), which again should call > > nfs4_state_mark_reclaim_nograce() on the stateid. The state manager > > should then end up recovering that stateid before the write is > > reattempted. > > > > We did previously have an issue if the server were to return > > NFS4ERR_STALE_STATEID, followed by NFS4ERR_OK to our RENEW probe > > (==seriously buggy server), but 2.6.33-rc7 has a fix for that. > > The offending app seems to be the new "indicator-applet" on the Ubuntu > GNOME desktop, so maybe that's why others haven't seen this bug yet? > Here's the trace from dmesg: > > [482400.730245] INFO: task indicator-apple:28782 blocked for more than 120 seconds. > [482400.730247] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [482400.730249] indicator-app D 0000000000000000 0 28782 1 0x00000080 > [482400.730253] ffff880010579ac8 0000000000000082 0000000000015bc0 0000000000015bc0 > [482400.730257] ffff880109bb1ab0 ffff880010579fd8 0000000000015bc0 ffff880109bb16f0 > [482400.730261] 0000000000015bc0 ffff880010579fd8 0000000000015bc0 ffff880109bb1ab0 > [482400.730265] Call Trace: > [482400.730275] [] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs] > [482400.730279] [] io_schedule+0x47/0x70 > [482400.730288] [] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs] > [482400.730292] [] __wait_on_bit+0x5f/0x90 > [482400.730302] [] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs] > [482400.730305] [] out_of_line_wait_on_bit+0x78/0x90 > [482400.730309] [] ? wake_bit_function+0x0/0x40 > [482400.730318] [] nfs_wait_on_request+0x2f/0x40 [nfs] > [482400.730329] [] nfs_wait_on_requests_locked+0x7f/0xd0 [nfs] > [482400.730340] [] nfs_sync_mapping_wait+0x9e/0x1a0 [nfs] > [482400.730350] [] nfs_write_mapping+0x79/0xb0 [nfs] > [482400.730361] [] nfs_wb_all+0x17/0x20 [nfs] > [482400.730369] [] nfs_do_fsync+0x2a/0x60 [nfs] > [482400.730378] [] nfs_file_write+0x106/0x1e0 [nfs] > [482400.730381] [] do_sync_write+0xfa/0x140 > [482400.730385] [] ? autoremove_wake_function+0x0/0x40 > [482400.730389] [] ? security_file_permission+0x16/0x20 > [482400.730392] [] vfs_write+0xb8/0x1a0 > [482400.730396] [] ? audit_syscall_entry+0x242/0x270 > [482400.730399] [] sys_write+0x51/0x80 > [482400.730403] [] system_call_fastpath+0x16/0x1b > > I haven't tried the 2.6.34-series yet.. >