From: Christopher Hawkins Subject: Re: nfs4 hang Date: Mon, 24 May 2010 13:17:01 -0400 (EDT) Message-ID: <18251062.111274721421458.JavaMail.javamailuser@localhost> References: <2212784.91274721147769.JavaMail.javamailuser@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Cc: linux-nfs To: "J. Bruce Fields" Return-path: Received: from vms173007pub.verizon.net ([206.46.173.7]:24914 "EHLO vms173007pub.verizon.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758032Ab0EXRXt (ORCPT ); Mon, 24 May 2010 13:23:49 -0400 Received: from backoffice.bplinux.com ([unknown] [98.118.141.73]) by vms173007.mailsrvcs.net (Sun Java(tm) System Messaging Server 7u2-7.02 32bit (built Apr 16 2009)) with ESMTPA id <0L2X00C7WPN5GVV3-q7IJPNDOu5koUfzYv49kGXL67TkHkek2@public.gmane.org> for linux-nfs@vger.kernel.org; Mon, 24 May 2010 12:23:32 -0500 (CDT) In-reply-to: <2212784.91274721147769.JavaMail.javamailuser@localhost> Sender: linux-nfs-owner@vger.kernel.org List-ID: Thanks Bruce, it does appear that idmapd is somehow related to this issue. However, I'm stumped because it does exist on the client, along with all the libraries it requires (according to ldd, anyway). Is there an easy way to disable idmapd while still running nfs4 to determine if it is causing the hang? Chris ----- "J. Bruce Fields" wrote: > On Sat, May 22, 2010 at 11:11:31AM -0400, Christopher Hawkins wrote: > > Actually this issue seems to be on the server side. I changed the > export to read only so the clients couldn't accidentally overwrite > anything... and the issue persists! Will keep testing. > > Probably rpc.idmapd is trying to access a file on NFSv4. > > (Which doesn't work because the NFSv4 client needs idmapd to respond > before it can continue, creating a deadlock.) > > --b. > > > > > ----- "Christopher Hawkins" wrote: > > > > > I managed to make it go a little further and now the nodes will > finish > > > booting, though they hang when I start other nodes and start > using > > > files. And this was captured in the syslog from one of the nodes > where > > > nfs4 goes unresponsive. Not familiar enough with nfs4 to read it, > > > however. Does this point to a culprit? > > > > > > It's as if the nodes are writing all over each others > connections, > > > like maybe I have a shared directory where each node must have its > own > > > private directory. They each have their own /var/lib and > > > /var/lib/nfs/rpc_pipefs mounted. Anything else that nfs clients > write > > > to (aside from proc and dev)? > > > > > > May 21 17:13:45 192.168.1.243 kernel: INFO: task rpciod/0:1297 > blocked > > > for more than 120 seconds. > > > May 21 17:13:45 192.168.1.243 kernel: "echo 0 > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > May 21 17:13:45 192.168.1.243 kernel: rpciod/0 D 00000097 > 2580 > > > 1297 7 1306 340 (L-TLB) > > > May 21 17:13:45 192.168.1.243 kernel: de89de40 00000046 > > > 1dfa3ac0 00000097 de8ca3b4 c1984c78 00000001 0000000a > > > May 21 17:13:45 192.168.1.243 kernel: de8ce000 1dfa3ac0 > > > 00000097 00000000 00000000 de8ce10c c1787100 f54ebac0 > > > May 21 17:13:45 192.168.1.243 kernel: c041ebcc 00000000 > > > 00000000 00000003 00000296 c1984b00 c17875bc de89de68 > > > May 21 17:13:45 192.168.1.243 kernel: Call Trace: > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > __wake_up+0x2a/0x3d > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > rpc_queue_upcall+0x92/0x99 [sunrpc] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs_idmap_id+0x18b/0x1f6 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > default_wake_function+0x0/0xc > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs_map_name_to_uid+0x1b/0x1f [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > decode_getfattr+0x611/0xbbd [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > decode_open+0xf3/0x25e [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > decode_getfh+0x61/0xa3 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs4_xdr_dec_open+0x0/0xa2 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs4_xdr_dec_open+0x75/0xa2 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > rpcauth_unwrap_resp+0x5b/0x60 [sunrpc] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > call_decode+0x437/0x47d [sunrpc] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs4_xdr_dec_open+0x0/0xa2 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > __rpc_execute+0x7a/0x1f8 [sunrpc] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > run_workqueue+0x78/0xb5 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > rpc_async_schedule+0x0/0x5 [sunrpc] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > worker_thread+0xd9/0x10b > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > default_wake_function+0x0/0xc > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > worker_thread+0x0/0x10b > > > May 21 17:13:45 192.168.1.243 kernel: [] > kthread+0xc0/0xed > > > May 21 17:13:45 192.168.1.243 kernel: [] > kthread+0x0/0xed > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > kernel_thread_helper+0x7/0x10 > > > May 21 17:13:45 192.168.1.243 kernel: ======================= > > > May 21 17:13:45 192.168.1.243 kernel: INFO: task java:13420 > blocked > > > for more than 120 seconds. > > > May 21 17:13:45 192.168.1.243 kernel: "echo 0 > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > May 21 17:13:45 192.168.1.243 kernel: java D 00000098 > 2068 > > > 13420 1 13421 13386 (NOTLB) > > > May 21 17:13:45 192.168.1.243 kernel: f6bbfab0 00000086 > > > ada06900 00000098 00000000 c06ba340 dfc6d280 0000000a > > > May 21 17:13:45 192.168.1.243 kernel: f5781000 ada06900 > > > 00000098 00000000 00000000 f578110c c1787100 f649d3c0 > > > May 21 17:13:45 192.168.1.243 kernel: 00000098 00000000 > > > de7c6804 c1787100 de7c6800 f6bbfb1c c04ec500 ffffffff > > > May 21 17:13:45 192.168.1.243 kernel: Call Trace: > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > __next_cpu+0x12/0x21 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > __mutex_lock_slowpath+0x4d/0x7c > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > .text.lock.mutex+0xf/0x14 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs_idmap_id+0x79/0x1f6 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > default_wake_function+0x0/0xc > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs_map_name_to_uid+0x1b/0x1f [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > decode_getfattr+0x611/0xbbd [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > lock_timer_base+0x15/0x2f > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs4_xdr_dec_getattr+0x0/0x45 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs4_xdr_dec_getattr+0x3e/0x45 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > rpcauth_unwrap_resp+0x5b/0x60 [sunrpc] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > call_decode+0x437/0x47d [sunrpc] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs4_xdr_dec_getattr+0x0/0x45 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > __rpc_execute+0x7a/0x1f8 [sunrpc] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > rpc_call_sync+0x6b/0x91 [sunrpc] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs4_proc_getattr+0x6f/0x8b [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > __nfs_revalidate_inode+0x120/0x249 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > schedule+0xa22/0xa55 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs_lookup_revalidate+0x1e8/0x3d7 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > lock_timer_base+0x15/0x2f > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > decode_compound_hdr+0x35/0x72 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > decode_op_hdr+0xd/0x58 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs4_xdr_dec_access+0x54/0x8d [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > lock_timer_base+0x15/0x2f > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > __mod_timer+0xda/0xe4 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > rpc_wake_up_next+0x11f/0x126 [sunrpc] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > recalc_sigpending+0xe/0x20 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > sigprocmask+0xb0/0xce > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > rpc_call_sync+0x89/0x91 [sunrpc] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > _nfs4_proc_access+0xa7/0xdb [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs_access_add_cache+0xad/0x135 [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > dput+0x22/0xed > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > nfs_open_revalidate+0x186/0x1da [nfs] > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > do_lookup+0x135/0x174 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > __link_path_walk+0x314/0xd33 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > link_path_walk+0x38/0x95 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > do_path_lookup+0x219/0x27f > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > __link_path_walk+0x314/0xd33 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > link_path_walk+0x38/0x95 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > do_path_lookup+0x219/0x27f > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > __user_walk_fd+0x29/0x3a > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > vfs_stat_fd+0x15/0x3c > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > default_wake_function+0x0/0xc > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > sys_stat64+0xf/0x23 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > getnstimeofday+0x30/0xb6 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > copy_to_user+0x31/0x48 > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > sys_clock_gettime+0x58/0x7e > > > May 21 17:13:45 192.168.1.243 kernel: [] > > > syscall_call+0x7/0xb > > > May 21 17:13:45 192.168.1.243 kernel: ======================= > > > > > > > > > Christopher Hawkins > > > > > > Cell: 315-882-2551 > > > Fax: 716-662-0788 > > > > > > ----- "J. Bruce Fields" wrote: > > > > > > > On Thu, May 20, 2010 at 03:54:28PM -0400, Christopher Hawkins > > > wrote: > > > > > Hello, > > > > > > > > > > I have a bash script that sets up a root filesystem over NFS. > The > > > > method works fine over NFS3 but hangs when using NFS4 as a > > > transport. > > > > Hoping someone can help me figure out why? > > > > > > > > > > In a nutshell: > > > > > > > > > > 1. The node boots into a ramdisk that has a few thousand files > in > > > it > > > > already > > > > > 2. connects to an NFS server exporting / and mounts that in a > tmp > > > > dir > > > > > 3. bash (on the booting node) loops through the local and > remote > > > > filesystem and creates a local symlink to anything in the nfs > root > > > > that it doesn't already have. > > > > > > > > Wild. > > > > > > > > > Export options are: / > > > > 192.168.1.0/24(rw,no_root_squash,no_subtree_check,fsid=0) > > > > > Mount options are: rw,sync,noatime,rsize=1024,wsize=1024 > > > > > > > > > > rpcidmapd, portmap, statd are running on server and client, > and > > > > rpc_pipefs is mounted on both as well. The script currently > makes > > > > about 6200 links, total. Nfs4 hangs about 90% of the way > through. I > > > > removed the directory it was hanging on (a simple, empty dir) > and > > > then > > > > it just hung on the next one. > > > > > > > > Is it possible there's some idmap deadlock? (If say an upcall > to > > > > idmapd > > > > is hanging because the mapping requires access to some file > that > > > > isn't > > > > there yet?) > > > > > > > > A sysrq-T trace might help. > > > > > > > > --b. > > > > > > > > > Really appreciate any help or pointers on where to look for > more > > > > info. Thanks! > > > > > > > > > > Chris > > > > > > > > > > > > > > > > > > > > -- > > > > > To unsubscribe from this list: send the line "unsubscribe > > > linux-nfs" > > > > in > > > > > the body of a message to majordomo@vger.kernel.org > > > > > More majordomo info at > > > http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe > linux-nfs" > > > in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at > http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" > in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html