Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753532AbYLIDvX (ORCPT ); Mon, 8 Dec 2008 22:51:23 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751559AbYLIDvO (ORCPT ); Mon, 8 Dec 2008 22:51:14 -0500 Received: from E23SMTP05.au.ibm.com ([202.81.18.174]:51163 "EHLO e23smtp05.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751073AbYLIDvM (ORCPT ); Mon, 8 Dec 2008 22:51:12 -0500 Date: Tue, 9 Dec 2008 09:18:28 +0530 From: Balbir Singh To: KAMEZAWA Hiroyuki Cc: Daisuke Nishimura , linux-mm@kvack.org, YAMAMOTO Takashi , Paul Menage , lizf@cn.fujitsu.com, linux-kernel@vger.kernel.org, Nick Piggin , David Rientjes , Pavel Emelianov , Dhaval Giani , Andrew Morton Subject: Re: [mm] [PATCH 3/4] Memory cgroup hierarchical reclaim (v4) Message-ID: <20081209034828.GU13333@balbir.in.ibm.com> Reply-To: balbir@linux.vnet.ibm.com Mail-Followup-To: KAMEZAWA Hiroyuki , Daisuke Nishimura , linux-mm@kvack.org, YAMAMOTO Takashi , Paul Menage , lizf@cn.fujitsu.com, linux-kernel@vger.kernel.org, Nick Piggin , David Rientjes , Pavel Emelianov , Dhaval Giani , Andrew Morton References: <20081116081034.25166.7586.sendpatchset@balbir-laptop> <20081116081055.25166.85066.sendpatchset@balbir-laptop> <20081125205832.38f8c365.nishimura@mxp.nes.nec.co.jp> <492C1345.9090201@linux.vnet.ibm.com> <20081126111447.106ec275.nishimura@mxp.nes.nec.co.jp> <20081209115943.7d6a0ea3.kamezawa.hiroyu@jp.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <20081209115943.7d6a0ea3.kamezawa.hiroyu@jp.fujitsu.com> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4441 Lines: 103 * KAMEZAWA Hiroyuki [2008-12-09 11:59:43]: > On Wed, 26 Nov 2008 11:14:47 +0900 > Daisuke Nishimura wrote: > > > On Tue, 25 Nov 2008 20:31:25 +0530, Balbir Singh wrote: > > > Daisuke Nishimura wrote: > > > > Hi. > > > > > > > > Unfortunately, trying to hold cgroup_mutex at reclaim causes dead lock. > > > > > > > > For example, when attaching a task to some cpuset directory(memory_migrate=on), > > > > > > > > cgroup_tasks_write (hold cgroup_mutex) > > > > attach_task_by_pid > > > > cgroup_attach_task > > > > cpuset_attach > > > > cpuset_migrate_mm > > > > : > > > > unmap_and_move > > > > mem_cgroup_prepare_migration > > > > mem_cgroup_try_charge > > > > mem_cgroup_hierarchical_reclaim > > > > > > > > > > Did lockdep complain about it? > > > > > I haven't understood lockdep so well, but I got logs like this: > > > > === > > INFO: task move.sh:17710 blocked for more than 480 seconds. > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > move.sh D ffff88010e1c76c0 0 17710 17597 > > ffff8800bd9edf00 0000000000000046 0000000000000000 0000000000000000 > > ffff8803afbc0000 ffff8800bd9ee270 0000000e00000000 000000010a54459c > > ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff > > Call Trace: > > [] mem_cgroup_get_first_node+0x29/0x8a > > [] mutex_lock_nested+0x180/0x2a2 > > [] mem_cgroup_get_first_node+0x29/0x8a > > [] mem_cgroup_get_first_node+0x29/0x8a > > [] __mem_cgroup_try_charge+0x27a/0x2de > > [] mem_cgroup_prepare_migration+0x6c/0xa5 > > [] migrate_pages+0x10c/0x4a0 > > [] migrate_pages+0x155/0x4a0 > > [] new_node_page+0x0/0x2f > > [] check_range+0x300/0x325 > > [] do_migrate_pages+0x1a5/0x1f1 > > [] cpuset_migrate_mm+0x30/0x93 > > [] cpuset_migrate_mm+0x5a/0x93 > > [] cpuset_attach+0x93/0xa6 > > [] cgroup_attach_task+0x395/0x3e1 > > [] cgroup_tasks_write+0xfa/0x11d > > [] cgroup_tasks_write+0x39/0x11d > > [] cgroup_file_write+0xef/0x216 > > [] vfs_write+0xad/0x136 > > [] sys_write+0x45/0x6e > > [] system_call_fastpath+0x16/0x1b > > INFO: lockdep is turned off. > > === > > > > And other processes trying to hold cgroup_mutex are also stuck. > > > > > 1. We could probably move away from cgroup_mutex to a memory controller specific > > > mutex. > > > 2. We could give up cgroup_mutex before migrate_mm, since it seems like we'll > > > hold the cgroup lock for long and holding it during reclaim will definitely be > > > visible to users trying to create/delete nodes. > > > > > > I prefer to do (2), I'll look at the code more closely > > > > > I basically agree, but I think we should also consider mpol_rebind_mm. > > > > mpol_rebind_mm, which can be called from cpuset_attach, does down_write(mm->mmap_sem), > > which means down_write(mm->mmap_sem) can be called under cgroup_mutex. > > OTOH, page fault path does down_read(mm->mmap_sem) and can call mem_cgroup_try_charge, > > which means mutex_lock(cgroup_mutex) can be called under down_read(mm->mmap_sem). > > > > What's status of this problem ? fixed or not yet ? > Sorry for failing to track paches. > Kamezawa-San, We are looking at two approaches that I had mentioned earlier 1) rely on the new cgroup_tasklist mutex introduced to close the race 2) Removing cgroup lock dependency with cgroup_tasks_write. I worry that it can lead to long latencies with cgroup_lock held I can send a patch for (1) today, I want to fix (2) and spent a lot of time staring at that code and could not find any easy way to fix it. -- Balbir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/