Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754500Ab2FKMju (ORCPT ); Mon, 11 Jun 2012 08:39:50 -0400 Received: from mx1.redhat.com ([209.132.183.28]:17976 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754110Ab2FKMjt (ORCPT ); Mon, 11 Jun 2012 08:39:49 -0400 Date: Mon, 11 Jun 2012 08:39:32 -0400 From: Jeff Layton To: bfields Cc: "Myklebust, Trond" , Joerg Platte , "linux-kernel@vger.kernel.org" , "linux-nfs@vger.kernel.org" , Hans de Bruin Subject: Re: Kernel 3.4.X NFS server regression Message-ID: <20120611083932.24e27e39@corrin.poochiereds.net> In-Reply-To: <20120611121634.GB7654@fieldses.org> References: <4FD47D4E.9070307@naasa.net> <1339340441.4751.1.camel@lade.trondhjem.org> <20120611121634.GB7654@fieldses.org> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4480 Lines: 92 On Mon, 11 Jun 2012 08:16:34 -0400 bfields wrote: > On Sun, Jun 10, 2012 at 03:00:42PM +0000, Myklebust, Trond wrote: > > Cc: linux-nfs@vger.kernel.org + bfields and changing title to label it > > as a server regression since that is what the trace appears to imply. > > > > On Sun, 2012-06-10 at 12:56 +0200, Joerg Platte wrote: > > > All 3.4 kernels I tried so far (3.4, 3.4.1 and 3.4.2) suffer from the > > > same NFS related problem: > > > > > > Jun 10 09:23:36 coco kernel: INFO: task kworker/u:1:8 blocked for more > > > than 120 seconds. > > > Jun 10 09:23:36 coco kernel: "echo 0 > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > Jun 10 09:23:36 coco kernel: kworker/u:1 D 002ba28c 0 8 > > > 2 0x00000000 > > > Jun 10 09:23:36 coco kernel: df465ec0 00000046 00000005 002ba28c > > > 00000000 0000000a 00000282 df465e70 > > > Jun 10 09:23:36 coco kernel: df465ec0 df44d2b0 ffff6b60 df465e84 > > > df44d2b0 e33fa6b3 00000282 de764ae0 > > > Jun 10 09:23:36 coco kernel: ffffffff d78bcfb8 df465e8c c012e0f6 > > > df465ea4 c013610c 00000000 d78bcf80 > > > Jun 10 09:23:36 coco kernel: Call Trace: > > > Jun 10 09:23:36 coco kernel: [] ? add_timer+0x11/0x17 > > > Jun 10 09:23:36 coco kernel: [] ? queue_delayed_work_on+0x74/0xf0 > > > Jun 10 09:23:36 coco kernel: [] ? queue_delayed_work+0x1b/0x28 > > > Jun 10 09:23:36 coco kernel: [] schedule+0x1d/0x4c > > > Jun 10 09:23:36 coco kernel: [] cld_pipe_upcall+0x4e/0x75 [nfsd] > > > Jun 10 09:23:36 coco kernel: [] > > > nfsd4_cld_grace_done+0x60/0x99 [nfsd] > > > Jun 10 09:23:36 coco kernel: [] > > > nfsd4_record_grace_done+0x10/0x12 [nfsd] > > > Jun 10 09:23:36 coco kernel: [] laundromat_main+0x291/0x2d8 > > > [nfsd] > > > Jun 10 09:23:36 coco kernel: [] process_one_work+0xff/0x325 > > > Jun 10 09:23:36 coco kernel: [] ? start_worker+0x20/0x23 > > > Jun 10 09:23:36 coco kernel: [] ? > > > nfsd4_process_open1+0x32b/0x32b [nfsd] > > > Jun 10 09:23:36 coco kernel: [] worker_thread+0xf4/0x39a > > > Jun 10 09:23:36 coco kernel: [] ? rescuer_thread+0x231/0x231 > > > Jun 10 09:23:36 coco kernel: [] kthread+0x6c/0x6e > > > Jun 10 09:23:36 coco kernel: [] ? kthreadd+0xe8/0xe8 > > > Jun 10 09:23:36 coco kernel: [] kernel_thread_helper+0x6/0xd > > > > > > A kworker task is stuck in D state and nfs mounts from other clients do > > > not work at all. This happens only on one machine, another one with the > > > same kernel (same self compiled Debian package) works. All previous 3.3 > > > kernels work as well. > > > > > > Since this machine is remote it is not that easy to bisect to find the > > > bad commit. Are there any other things I can try? > > If you create a directory named /var/lib/nfs/v4recovery/, does the > problem go away? > > My guess would be that it's trying to upcall to the new reboot-recovery > state daemon, and you don't have that running. > > Before the addition of that upcall state was kept in > /var/lib/nfs/v4recovery. So we decide whether to use the old method or > the new one by checking for the existance of that path. > > But I'm guessing we were wrong to assume that existing setups that > people perceived as working would have that path, because the failures > in the absence of that path were probably less obvious. > > --b. This sounds like the same problem that Hans reported as well. I've not been able to reproduce that so far. Here's what I get when I start nfsd with no v4recoverdir and nfsdcld isn't running: [ 109.715080] NFSD: starting 90-second grace period [ 229.984220] NFSD: Unable to end grace period: -110 What I don't quite understand is why the queue_timeout job isn't getting run here. What should happen is that 30s after upcall, rpc_timeout_upcall_queue should run. The message will be found to be still sitting on the , so it should set its status to -ETIMEDOUT and wake up the caller. I can only assume that the queue_timeout job isn't getting run for some reason, but I'm unclear on why that would be. -- Jeff Layton -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/