Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753397AbYLID6X (ORCPT ); Mon, 8 Dec 2008 22:58:23 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751416AbYLID6N (ORCPT ); Mon, 8 Dec 2008 22:58:13 -0500 Received: from fgwmail7.fujitsu.co.jp ([192.51.44.37]:52501 "EHLO fgwmail7.fujitsu.co.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751332AbYLID6M (ORCPT ); Mon, 8 Dec 2008 22:58:12 -0500 Date: Tue, 9 Dec 2008 12:57:13 +0900 From: KAMEZAWA Hiroyuki To: balbir@linux.vnet.ibm.com Cc: Daisuke Nishimura , linux-mm@kvack.org, YAMAMOTO Takashi , Paul Menage , lizf@cn.fujitsu.com, linux-kernel@vger.kernel.org, Nick Piggin , David Rientjes , Pavel Emelianov , Dhaval Giani , Andrew Morton Subject: Re: [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v4) Message-Id: <20081209125713.e868c43a.kamezawa.hiroyu@jp.fujitsu.com> In-Reply-To: <20081209034828.GU13333@balbir.in.ibm.com> References: <20081116081034.25166.7586.sendpatchset@balbir-laptop> <20081116081055.25166.85066.sendpatchset@balbir-laptop> <20081125205832.38f8c365.nishimura@mxp.nes.nec.co.jp> <492C1345.9090201@linux.vnet.ibm.com> <20081126111447.106ec275.nishimura@mxp.nes.nec.co.jp> <20081209115943.7d6a0ea3.kamezawa.hiroyu@jp.fujitsu.com> <20081209034828.GU13333@balbir.in.ibm.com> Organization: FUJITSU Co. LTD. X-Mailer: Sylpheed 2.5.0 (GTK+ 2.10.14; i686-pc-mingw32) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5345 Lines: 125 On Tue, 9 Dec 2008 09:18:28 +0530 Balbir Singh wrote: > * KAMEZAWA Hiroyuki [2008-12-09 11:59:43]: > > > On Wed, 26 Nov 2008 11:14:47 +0900 > > Daisuke Nishimura wrote: > > > > > On Tue, 25 Nov 2008 20:31:25 +0530, Balbir Singh wrote: > > > > Daisuke Nishimura wrote: > > > > > Hi. > > > > > > > > > > Unfortunately, trying to hold cgroup_mutex at reclaim causes dead lock. > > > > > > > > > > For example, when attaching a task to some cpuset directory(memory_migrate=on), > > > > > > > > > > cgroup_tasks_write (hold cgroup_mutex) > > > > > attach_task_by_pid > > > > > cgroup_attach_task > > > > > cpuset_attach > > > > > cpuset_migrate_mm > > > > > : > > > > > unmap_and_move > > > > > mem_cgroup_prepare_migration > > > > > mem_cgroup_try_charge > > > > > mem_cgroup_hierarchical_reclaim > > > > > > > > > > > > > Did lockdep complain about it? > > > > > > > I haven't understood lockdep so well, but I got logs like this: > > > > > > === > > > INFO: task move.sh:17710 blocked for more than 480 seconds. > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > move.sh D ffff88010e1c76c0 0 17710 17597 > > > ffff8800bd9edf00 0000000000000046 0000000000000000 0000000000000000 > > > ffff8803afbc0000 ffff8800bd9ee270 0000000e00000000 000000010a54459c > > > ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff > > > Call Trace: > > > [] mem_cgroup_get_first_node+0x29/0x8a > > > [] mutex_lock_nested+0x180/0x2a2 > > > [] mem_cgroup_get_first_node+0x29/0x8a > > > [] mem_cgroup_get_first_node+0x29/0x8a > > > [] __mem_cgroup_try_charge+0x27a/0x2de > > > [] mem_cgroup_prepare_migration+0x6c/0xa5 > > > [] migrate_pages+0x10c/0x4a0 > > > [] migrate_pages+0x155/0x4a0 > > > [] new_node_page+0x0/0x2f > > > [] check_range+0x300/0x325 > > > [] do_migrate_pages+0x1a5/0x1f1 > > > [] cpuset_migrate_mm+0x30/0x93 > > > [] cpuset_migrate_mm+0x5a/0x93 > > > [] cpuset_attach+0x93/0xa6 > > > [] cgroup_attach_task+0x395/0x3e1 > > > [] cgroup_tasks_write+0xfa/0x11d > > > [] cgroup_tasks_write+0x39/0x11d > > > [] cgroup_file_write+0xef/0x216 > > > [] vfs_write+0xad/0x136 > > > [] sys_write+0x45/0x6e > > > [] system_call_fastpath+0x16/0x1b > > > INFO: lockdep is turned off. > > > === > > > > > > And other processes trying to hold cgroup_mutex are also stuck. > > > > > > > 1. We could probably move away from cgroup_mutex to a memory controller specific > > > > mutex. > > > > 2. We could give up cgroup_mutex before migrate_mm, since it seems like we'll > > > > hold the cgroup lock for long and holding it during reclaim will definitely be > > > > visible to users trying to create/delete nodes. > > > > > > > > I prefer to do (2), I'll look at the code more closely > > > > > > > I basically agree, but I think we should also consider mpol_rebind_mm. > > > > > > mpol_rebind_mm, which can be called from cpuset_attach, does down_write(mm->mmap_sem), > > > which means down_write(mm->mmap_sem) can be called under cgroup_mutex. > > > OTOH, page fault path does down_read(mm->mmap_sem) and can call mem_cgroup_try_charge, > > > which means mutex_lock(cgroup_mutex) can be called under down_read(mm->mmap_sem). > > > > > > > What's status of this problem ? fixed or not yet ? > > Sorry for failing to track paches. > > > > Kamezawa-San, > > We are looking at two approaches that I had mentioned earlier > > 1) rely on the new cgroup_tasklist mutex introduced to close the race Hmm ? what you're talking about is ==memcg-avoid-dead-lock-caused-by-race-between-oom-and-cpuset_attach.patch +static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */ this ? I think there is no Acks from Paul Menage (I.e. cgroup maitainer) to this and am afraid that this will add new complexity. Hmm...It seems that my cgroup-ID scan patch will need more time to fix this race. But I'd like to push fixes for this kind of race out to cgroup layer rather than making more special...we already have tons of special operations which cannot be handled in cgroup layer and breaks assumption. -Kame > 2) Removing cgroup lock dependency with cgroup_tasks_write. I worry > that it can lead to long latencies with cgroup_lock held > > I can send a patch for (1) today, I want to fix (2) > and spent a lot of time staring at that code and could not find > any easy way to fix it. > > > -- > Balbir > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/