Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752673AbbEAU4G (ORCPT ); Fri, 1 May 2015 16:56:06 -0400 Received: from mx1.redhat.com ([209.132.183.28]:52651 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751988AbbEAU4D (ORCPT ); Fri, 1 May 2015 16:56:03 -0400 Date: Fri, 1 May 2015 16:56:00 -0400 (EDT) From: Benjamin Coddington X-X-Sender: bcodding@planck.local To: Shawn Bohrer cc: linux-nfs@vger.kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, mayoff@rgmadvisors.com, Jeff Layton , fsorenso@redhat.com Subject: Re: NFS Freezer and stuck tasks In-Reply-To: <20150304220027.GB20242@sbohrermbp13-local.rgmadvisors.com> Message-ID: References: <20150304220027.GB20242@sbohrermbp13-local.rgmadvisors.com> User-Agent: Alpine 2.19.9992 (OSX 65 2014-06-20) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5081 Lines: 112 On Wed, 4 Mar 2015, Shawn Bohrer wrote: > Hello, > > We're using the Linux cgroup Freezer on some machines that use NFS and > have run into what appears to be a bug where frozen tasks are blocking > running tasks and preventing them from completing. On one of our > machines which happens to be running an older 3.10.46 kernel we have > frozen some of the tasks on the system using the cgroup Freezer. We > also have a separate set of tasks which are NOT frozen which are stuck > trying to open some files on NFS. > > Looking at the frozen tasks there are several that have the following > stack: > > [] rpc_wait_bit_killable+0x35/0x80 > [] __rpc_wait_for_completion_task+0x2d/0x30 > [] nfs4_run_open_task+0x11d/0x170 > [] _nfs4_open_and_get_state+0x53/0x260 > [] nfs4_do_open+0x121/0x400 > [] nfs4_atomic_open+0x31/0x50 > [] nfs4_file_open+0xac/0x180 > [] do_dentry_open.isra.19+0x1ee/0x280 > [] finish_open+0x1e/0x30 > [] do_last.isra.64+0x2c2/0xc40 > [] path_openat.isra.65+0x2c9/0x490 > [] do_filp_open+0x38/0x80 > [] do_sys_open+0xe4/0x1c0 > [] SyS_open+0x1e/0x20 > [] system_call_fastpath+0x16/0x1b > [] 0xffffffffffffffff > > Here it looks like we are waiting in a wait queue inside > rpc_wait_bit_killable() for RPC_TASK_ACTIVE. > > And there is a single task with a stack that looks like the following: > > [] __refrigerator+0x55/0x150 > [] rpc_wait_bit_killable+0x66/0x80 > [] __rpc_wait_for_completion_task+0x2d/0x30 > [] nfs4_run_open_task+0x11d/0x170 > [] _nfs4_open_and_get_state+0x53/0x260 > [] nfs4_do_open+0x121/0x400 > [] nfs4_atomic_open+0x31/0x50 > [] nfs4_file_open+0xac/0x180 > [] do_dentry_open.isra.19+0x1ee/0x280 > [] finish_open+0x1e/0x30 > [] do_last.isra.64+0x2c2/0xc40 > [] path_openat.isra.65+0x2c9/0x490 > [] do_filp_open+0x38/0x80 > [] do_sys_open+0xe4/0x1c0 > [] SyS_open+0x1e/0x20 > [] system_call_fastpath+0x16/0x1b > [] 0xffffffffffffffff > > This looks similar but the different offset into > rpc_wait_bit_killable() shows that we have returned from the > schedule() call in freezable_schedule() and are now blocked in > __refrigerator() inside freezer_count() > > Similarly if you look at the tasks that are NOT frozen but are stuck > opening a NFS file, they also have the following stack showing they are > waiting in the wait queue for RPC_TASK_ACTIVE. > > [] rpc_wait_bit_killable+0x35/0x80 > [] __rpc_wait_for_completion_task+0x2d/0x30 > [] nfs4_run_open_task+0x11d/0x170 > [] _nfs4_open_and_get_state+0x53/0x260 > [] nfs4_do_open+0x121/0x400 > [] nfs4_atomic_open+0x31/0x50 > [] nfs4_file_open+0xac/0x180 > [] do_dentry_open.isra.19+0x1ee/0x280 > [] finish_open+0x1e/0x30 > [] do_last.isra.64+0x2c2/0xc40 > [] path_openat.isra.65+0x2c9/0x490 > [] do_filp_open+0x38/0x80 > [] do_sys_open+0xe4/0x1c0 > [] SyS_open+0x1e/0x20 > [] system_call_fastpath+0x16/0x1b > [] 0xffffffffffffffff > > We have hit this a couple of times now and know that if we THAW all of > the frozen tasks that running tasks will unwedge and finish. > > Additionally we have also tried thawing the single task that is frozen > in __refrigerator() inside rpc_wait_bit_killable(). This usually > results in different frozen task entering the __refrigerator() state > inside rpc_wait_bit_killable(). It looks like each one of those tasks > must wake up another letting it progress. Again if you thaw enough of > the frozen tasks eventually everything unwedges and everything > completes. > > I've looked through the 3.10 stable patches since 3.10.46 and don't > see anything that looks like it addresses this. Does anyone have any > idea what might be going on here, and what the fix might be? > > Thanks, > Shawn Hi Shawn, just started looking at this myself, and as Frank Sorensen points out in https://bugzilla.redhat.com/show_bug.cgi?id=1209143 the problem is that a task takes the xprt lock and then ends up in the refrigerator effectively blocking other tasks from proceeding. Jeff, any suggestions on how to proceed here? Ben -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/