Return-Path: linux-nfs-owner@vger.kernel.org Received: from userp1040.oracle.com ([156.151.31.81]:44670 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751554Ab3KLQX3 convert rfc822-to-8bit (ORCPT ); Tue, 12 Nov 2013 11:23:29 -0500 Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\)) Subject: Re: Thread overran stack, or stack corrupted BUG on mount From: Chuck Lever In-Reply-To: <20131112112021.1a0a60ca@tlielax.poochiereds.net> Date: Tue, 12 Nov 2013 11:23:13 -0500 Cc: Weston Andros Adamson , linux-nfs list Message-Id: References: <2C73011F-0939-434C-9E4D-13A1EB1403D7@netapp.com> <20131112105539.4f804fc6@tlielax.poochiereds.net> <20131112112021.1a0a60ca@tlielax.poochiereds.net> To: Jeff Layton Sender: linux-nfs-owner@vger.kernel.org List-ID: On Nov 12, 2013, at 11:20 AM, Jeff Layton wrote: > On Tue, 12 Nov 2013 10:55:39 -0500 > Jeff Layton wrote: > >> On Tue, 12 Nov 2013 15:31:34 +0000 >> Weston Andros Adamson wrote: >> >>> I got this oops yesterday running the ?test_sec_options.sh? script I recently posted as a patch to Anna?s nfs-ordeal repo (tons of mount/umount). >>> >>> At this point GSSD had died (I was tracking down a fd leak). I haven?t been able to reproduce this yet. >>> >>> Any idea if I should trust the stack trace? Could this be related to the issue Jeff just posted? >>> >>> -dros >>> >>> BUG: unable to handle kernel paging request at ffff88017a604030 >>> IP: [] __wake_up+0x22/0x4d >>> PGD 2651067 PUD 0 >>> Thread overran stack, or stack corrupted >>> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC >>> Modules linked in: nfsv4 cts rpcsec_gss_krb5 nfsv3 nfs fscache crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ppdev ablk_helper cryptd serio_raw i2c_piix4 i2c_core e1000 nfsd parport_pc parport shpchp auth_rpcgss oid_registry exportfs nfs_acl lockd floppy freq_table sunrpc autofs4 mptspi scsi_transport_spi mptscsih mptbase ata_generic >>> CPU: 0 PID: 10547 Comm: mount.nfs Not tainted 3.12.0-rc3-branch-dros_testing+ #1 >>> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 >>> task: ffff8800798f2100 ti: ffff88007a604000 task.ti: ffff88007a604000 >>> RIP: 0010:[] [] __wake_up+0x22/0x4d >>> RSP: 0018:ffff88007a604028 EFLAGS: 00010092 >>> RAX: 0000000000000296 RBX: ffffffffa006a980 RCX: 000000009a519a50 >>> RDX: 000000009a509a50 RSI: 000000000000038a RDI: ffffffffa006a980 >>> RBP: ffff88017a604058 R08: 0000000000000003 R09: 0000000000000001 >>> R10: ffff88006d41d7c0 R11: ffff88007f20b000 R12: ffff88007a6058e0 >>> R13: ffff8800645d8018 R14: ffff88007a6058f8 R15: ffff8800798f2100 >>> FS: 00007fb2765b3880(0000) GS:ffff88007f200000(0000) knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: ffff88017a604030 CR3: 000000007a6ba000 CR4: 00000000001407f0 >>> Stack: >>> ffff88007a604038 0000000000000000 ffff880000000001 0000000000000003 >>> ffff88006453a0d0 ffff88007a6058e0 ffff88007a604078 ffffffffa0045de2 >>> ffff88006453dbe0 ffff88006453dbe0 ffff88007a604098 ffffffffa0045d27 >>> Call Trace: >>> [] ? rpc_release_client+0x4a/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_free_client+0x56/0xc7 [sunrpc] >>> [] ? rpc_release_client+0x68/0x9a [sunrpc] >>> [] ? rpc_shutdown_client+0x107/0x116 [sunrpc] >>> [] ? __fscache_cookie_put+0x43/0x4f [fscache] >>> [] ? __fscache_relinquish_cookie+0x168/0x16d [fscache] >>> [] ? nfs_free_client+0x4c/0xaf [nfs] >>> [] ? nfs4_free_client+0x97/0x9b [nfsv4] >>> [] ? nfs_put_client+0xe8/0xed [nfs] >>> [] ? nfs4_init_client+0x22e/0x29d [nfsv4] >>> [] ? nfs_probe_fsinfo+0x2c7/0x2c7 [nfs] >>> [] ? nfs_get_client+0x8a/0x2bf [nfs] >>> [] ? nfs_get_client+0x25a/0x2bf [nfs] >>> [] ? nfs4_set_client+0x9f/0xf1 [nfsv4] >>> [] ? __rpc_init_priority_wait_queue+0x98/0xcf [sunrpc] >>> [] ? nfs4_create_server+0xfe/0x264 [nfsv4] >>> [] ? nfs4_remote_mount+0x2f/0x57 [nfsv4] >>> [] ? mount_fs+0x69/0x157 >>> [] ? __alloc_percpu+0x10/0x12 >>> [] ? vfs_kern_mount+0x62/0xd9 >>> [] ? nfs_do_root_mount+0x8c/0xb4 [nfsv4] >>> [] ? nfs4_try_mount+0x60/0xbb [nfsv4] >>> [] ? nfs_fs_mount+0x88f/0x97a [nfs] >>> [] ? nfs_clone_super+0x6b/0x6b [nfs] >>> [] ? nfs_set_super+0x53/0x53 [nfs] >>> [] ? mount_fs+0x69/0x157 >>> [] ? __alloc_percpu+0x10/0x12 >>> [] ? vfs_kern_mount+0x62/0xd9 >>> [] ? do_mount+0x6ce/0x871 >>> [] ? copy_mount_options+0xc2/0x12f >>> [] ? SyS_mount+0x85/0xbe >>> [] ? system_call_fastpath+0x16/0x1b >>> Code: 89 e5 e8 98 ff ff ff 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 20 89 55 e0 89 75 e8 48 89 4d d8 e8 0c 99 43 00 <4c> 8b 45 d8 48 89 df 8b 55 e0 49 89 c4 31 c9 8b 75 e8 e8 b4 d0 >>> RIP [] __wake_up+0x22/0x4d >>> RSP >>> CR2: ffff88017a604030 >>> ---[ end trace 1122f3f8cf98e4c2 ]--- >>> >> >> Yep, I think this is the same problem I reported earlier. I ran the >> reproducer with rpc_debug turned up and ended up seeing a very similar >> stack trace. Basically the server is returning NFS4ERR_CLID_IN_USE but >> the client keeps retrying the call over and over. >> >> I suspect that leads to some sort of recursion, but I haven't quite >> spotted it yet. >> > > (cc'ing Chuck since I think the problem is in the new detect_trunking code) > > Ok, I think I see the problem. The looping comes from this block in > nfs4_discover_server_trunking: > > -----------------[snip]----------------- > case -NFS4ERR_CLID_INUSE: > case -NFS4ERR_WRONGSEC: > clnt = rpc_clone_client_set_auth(clnt, RPC_AUTH_UNIX); > if (IS_ERR(clnt)) { > status = PTR_ERR(clnt); > break; > } > /* Note: this is safe because we haven't yet marked the > * client as ready, so we are the only user of > * clp->cl_rpcclient > */ > clnt = xchg(&clp->cl_rpcclient, clnt); > rpc_shutdown_client(clnt); > clnt = clp->cl_rpcclient; > goto again; > -----------------[snip]----------------- > > ...so in the case of the reproducer, we get back -NFS4ERR_CLID_IN_USE, > at that point we call rpc_clone_client_set_auth(), which creates a new > rpc_clnt, but it's created as a child of the original. > > When rpc_shutdown_client is called, the original clnt is not destroyed > because the child still holds a reference to it. So, we go and try the > call again and it fails with the same error over and over again, and we > end up with a long chain of rpc_clnt's. > > How that ends up smashing the stack, I'm not sure though. I'm also not > sure of the remedy. It seems like we might ought to have some upper > bound on the number of SETCLIENTID attempts? CLID_INUSE is supposed to be a permanent error now. I think one retry, if any, is all that is appropriate. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com