Hi,
Here is a proposed fix for the memory controller cgroup_mutex deadlock
reported. It is lightly tested and reviewed. I need help with review
and test. Is the reported deadlock reproducible after this patch? A
careful review of the cpuset impact will also be highly appreciated.
From: Balbir Singh <[email protected]>
cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
do_migrate_pages(). The issue with that is that
1. It can lead to deadlock with memcg, as do_migrate_pages()
enters reclaim
2. It can lead to long latencies, preventing users from creating/
destroying other cgroups anywhere else
The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
gives up cgroup_mutex while doing so.
Signed-off-by: Balbir Singh <[email protected]>
---
include/linux/cpuset.h | 13 ++++++++++++-
kernel/cpuset.c | 23 ++++++++++++-----------
2 files changed, 24 insertions(+), 12 deletions(-)
diff -puN kernel/cgroup.c~cpuset-remove-cgroup-mutex-from-update-path kernel/cgroup.c
diff -puN kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path kernel/cpuset.c
--- a/kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path
+++ a/kernel/cpuset.c
@@ -369,7 +369,7 @@ static void guarantee_online_mems(const
* task has been modifying its cpuset.
*/
-void cpuset_update_task_memory_state(void)
+void __cpuset_update_task_memory_state(bool held)
{
int my_cpusets_mem_gen;
struct task_struct *tsk = current;
@@ -380,7 +380,8 @@ void cpuset_update_task_memory_state(voi
rcu_read_unlock();
if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
- mutex_lock(&callback_mutex);
+ if (!held)
+ mutex_lock(&callback_mutex);
task_lock(tsk);
cs = task_cs(tsk); /* Maybe changed when task not locked */
guarantee_online_mems(cs, &tsk->mems_allowed);
@@ -394,7 +395,8 @@ void cpuset_update_task_memory_state(voi
else
tsk->flags &= ~PF_SPREAD_SLAB;
task_unlock(tsk);
- mutex_unlock(&callback_mutex);
+ if (!held)
+ mutex_unlock(&callback_mutex);
mpol_rebind_task(tsk, &tsk->mems_allowed);
}
}
@@ -949,13 +951,15 @@ static int update_cpumask(struct cpuset
* so that the migration code can allocate pages on these nodes.
*
* Call holding cgroup_mutex, so current's cpuset won't change
- * during this call, as manage_mutex holds off any cpuset_attach()
+ * during this call, as callback_mutex holds off any cpuset_attach()
* calls. Therefore we don't need to take task_lock around the
* call to guarantee_online_mems(), as we know no one is changing
* our task's cpuset.
*
* Hold callback_mutex around the two modifications of our tasks
- * mems_allowed to synchronize with cpuset_mems_allowed().
+ * mems_allowed to synchronize with cpuset_mems_allowed(). Give
+ * up cgroup_mutex to avoid deadlocking with other subsystems
+ * as we enter reclaim from do_migrate_pages().
*
* While the mm_struct we are migrating is typically from some
* other task, the task_struct mems_allowed that we are hacking
@@ -976,17 +980,14 @@ static void cpuset_migrate_mm(struct mm_
{
struct task_struct *tsk = current;
- cpuset_update_task_memory_state();
-
+ cgroup_unlock();
mutex_lock(&callback_mutex);
+ cpuset_update_task_memory_state_locked();
tsk->mems_allowed = *to;
- mutex_unlock(&callback_mutex);
-
do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
-
- mutex_lock(&callback_mutex);
guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
mutex_unlock(&callback_mutex);
+ cgroup_lock();
}
static void *cpuset_being_rebound;
diff -puN include/linux/cpuset.h~cpuset-remove-cgroup-mutex-from-update-path include/linux/cpuset.h
--- a/include/linux/cpuset.h~cpuset-remove-cgroup-mutex-from-update-path
+++ a/include/linux/cpuset.h
@@ -25,7 +25,18 @@ extern void cpuset_cpus_allowed_locked(s
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
-void cpuset_update_task_memory_state(void);
+extern void __cpuset_update_task_memory_state(bool locked);
+
+static void inline cpuset_update_task_memory_state(void)
+{
+ __cpuset_update_task_memory_state(false);
+}
+
+static void inline cpuset_update_task_memory_state_locked(void)
+{
+ __cpuset_update_task_memory_state(true);
+}
+
int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
_
--
Balbir
On Wed, 10 Dec 2008 10:49:47 +0530, Balbir Singh <[email protected]> wrote:
> Hi,
>
> Here is a proposed fix for the memory controller cgroup_mutex deadlock
> reported. It is lightly tested and reviewed. I need help with review
> and test. Is the reported deadlock reproducible after this patch? A
> careful review of the cpuset impact will also be highly appreciated.
>
> From: Balbir Singh <[email protected]>
>
> cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> do_migrate_pages(). The issue with that is that
>
> 1. It can lead to deadlock with memcg, as do_migrate_pages()
> enters reclaim
> 2. It can lead to long latencies, preventing users from creating/
> destroying other cgroups anywhere else
>
> The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
> gives up cgroup_mutex while doing so.
>
I agree changing cpuset_migrate_mm not to hold cgroup_mutex to fix the dead lock
is one choice, and it looks good to me at the first impression.
But I'm not sure it's good to change cpuset(other subsystem) code because of memcg.
Anyway, I'll test this patch and report the result tomorrow.
(Sorry, I don't have enough time today.)
Thanks,
Daisuke Nishimura.
> Signed-off-by: Balbir Singh <[email protected]>
> ---
>
> include/linux/cpuset.h | 13 ++++++++++++-
> kernel/cpuset.c | 23 ++++++++++++-----------
> 2 files changed, 24 insertions(+), 12 deletions(-)
>
> diff -puN kernel/cgroup.c~cpuset-remove-cgroup-mutex-from-update-path kernel/cgroup.c
> diff -puN kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path kernel/cpuset.c
> --- a/kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path
> +++ a/kernel/cpuset.c
> @@ -369,7 +369,7 @@ static void guarantee_online_mems(const
> * task has been modifying its cpuset.
> */
>
> -void cpuset_update_task_memory_state(void)
> +void __cpuset_update_task_memory_state(bool held)
> {
> int my_cpusets_mem_gen;
> struct task_struct *tsk = current;
> @@ -380,7 +380,8 @@ void cpuset_update_task_memory_state(voi
> rcu_read_unlock();
>
> if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
> - mutex_lock(&callback_mutex);
> + if (!held)
> + mutex_lock(&callback_mutex);
> task_lock(tsk);
> cs = task_cs(tsk); /* Maybe changed when task not locked */
> guarantee_online_mems(cs, &tsk->mems_allowed);
> @@ -394,7 +395,8 @@ void cpuset_update_task_memory_state(voi
> else
> tsk->flags &= ~PF_SPREAD_SLAB;
> task_unlock(tsk);
> - mutex_unlock(&callback_mutex);
> + if (!held)
> + mutex_unlock(&callback_mutex);
> mpol_rebind_task(tsk, &tsk->mems_allowed);
> }
> }
> @@ -949,13 +951,15 @@ static int update_cpumask(struct cpuset
> * so that the migration code can allocate pages on these nodes.
> *
> * Call holding cgroup_mutex, so current's cpuset won't change
> - * during this call, as manage_mutex holds off any cpuset_attach()
> + * during this call, as callback_mutex holds off any cpuset_attach()
> * calls. Therefore we don't need to take task_lock around the
> * call to guarantee_online_mems(), as we know no one is changing
> * our task's cpuset.
> *
> * Hold callback_mutex around the two modifications of our tasks
> - * mems_allowed to synchronize with cpuset_mems_allowed().
> + * mems_allowed to synchronize with cpuset_mems_allowed(). Give
> + * up cgroup_mutex to avoid deadlocking with other subsystems
> + * as we enter reclaim from do_migrate_pages().
> *
> * While the mm_struct we are migrating is typically from some
> * other task, the task_struct mems_allowed that we are hacking
> @@ -976,17 +980,14 @@ static void cpuset_migrate_mm(struct mm_
> {
> struct task_struct *tsk = current;
>
> - cpuset_update_task_memory_state();
> -
> + cgroup_unlock();
> mutex_lock(&callback_mutex);
> + cpuset_update_task_memory_state_locked();
> tsk->mems_allowed = *to;
> - mutex_unlock(&callback_mutex);
> -
> do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
> -
> - mutex_lock(&callback_mutex);
> guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
> mutex_unlock(&callback_mutex);
> + cgroup_lock();
> }
>
> static void *cpuset_being_rebound;
> diff -puN include/linux/cpuset.h~cpuset-remove-cgroup-mutex-from-update-path include/linux/cpuset.h
> --- a/include/linux/cpuset.h~cpuset-remove-cgroup-mutex-from-update-path
> +++ a/include/linux/cpuset.h
> @@ -25,7 +25,18 @@ extern void cpuset_cpus_allowed_locked(s
> extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
> #define cpuset_current_mems_allowed (current->mems_allowed)
> void cpuset_init_current_mems_allowed(void);
> -void cpuset_update_task_memory_state(void);
> +extern void __cpuset_update_task_memory_state(bool locked);
> +
> +static void inline cpuset_update_task_memory_state(void)
> +{
> + __cpuset_update_task_memory_state(false);
> +}
> +
> +static void inline cpuset_update_task_memory_state_locked(void)
> +{
> + __cpuset_update_task_memory_state(true);
> +}
> +
> int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
>
> extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
> _
>
> --
> Balbir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
* Daisuke Nishimura <[email protected]> [2008-12-10 15:19:48]:
> On Wed, 10 Dec 2008 10:49:47 +0530, Balbir Singh <[email protected]> wrote:
> > Hi,
> >
> > Here is a proposed fix for the memory controller cgroup_mutex deadlock
> > reported. It is lightly tested and reviewed. I need help with review
> > and test. Is the reported deadlock reproducible after this patch? A
> > careful review of the cpuset impact will also be highly appreciated.
> >
> > From: Balbir Singh <[email protected]>
> >
> > cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> > do_migrate_pages(). The issue with that is that
> >
> > 1. It can lead to deadlock with memcg, as do_migrate_pages()
> > enters reclaim
> > 2. It can lead to long latencies, preventing users from creating/
> > destroying other cgroups anywhere else
> >
> > The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
> > gives up cgroup_mutex while doing so.
> >
> I agree changing cpuset_migrate_mm not to hold cgroup_mutex to fix the dead lock
> is one choice, and it looks good to me at the first impression.
>
> But I'm not sure it's good to change cpuset(other subsystem) code because of memcg.
>
> Anyway, I'll test this patch and report the result tomorrow.
> (Sorry, I don't have enough time today.)
Thanks for helping Daisuke-San!
I'll look forward to your test result. I'll continue to pound my
system meanwhile
--
Balbir
On Wed, 10 Dec 2008 15:19:48 +0900, Daisuke Nishimura <[email protected]> wrote:
> On Wed, 10 Dec 2008 10:49:47 +0530, Balbir Singh <[email protected]> wrote:
> > Hi,
> >
> > Here is a proposed fix for the memory controller cgroup_mutex deadlock
> > reported. It is lightly tested and reviewed. I need help with review
> > and test. Is the reported deadlock reproducible after this patch? A
> > careful review of the cpuset impact will also be highly appreciated.
> >
> > From: Balbir Singh <[email protected]>
> >
> > cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> > do_migrate_pages(). The issue with that is that
> >
> > 1. It can lead to deadlock with memcg, as do_migrate_pages()
> > enters reclaim
> > 2. It can lead to long latencies, preventing users from creating/
> > destroying other cgroups anywhere else
> >
> > The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
> > gives up cgroup_mutex while doing so.
> >
> I agree changing cpuset_migrate_mm not to hold cgroup_mutex to fix the dead lock
> is one choice, and it looks good to me at the first impression.
>
> But I'm not sure it's good to change cpuset(other subsystem) code because of memcg.
>
> Anyway, I'll test this patch and report the result tomorrow.
> (Sorry, I don't have enough time today.)
>
Unfortunately, this patch doesn't seem enough.
This patch can fix dead lock caused by "circular lock of cgroup_mutex",
but cannot that of caused by "race between page reclaim and cpuset_attach(mpol_rebind_mm)".
(The dead lock I fixed in memcg-avoid-dead-lock-caused-by-race-between-oom-and-cpuset_attach.patch
was caused by "race between memcg's oom and mpol_rebind_mm, and was independent of hierarchy.)
I attach logs I got in testing this patch.
Thanks,
Daisuke Nishimura.
===
INFO: task automount:23438 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
automount D ffff88010ae963c0 0 23438 1
ffff8803ab8f9300 0000000000000046 0000000000000000 0000000000000000
ffff88010fb72600 ffff8803ab8f9670 0000000d00000000 0000000100026d3a
ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
Call Trace:
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff804ce493>] mutex_lock_nested+0x188/0x2b2
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c6490>] mntput_no_expire+0x1e/0x139
[<ffffffff802c85ab>] seq_escape+0x3a/0xb8
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c80d6>] show_vfsmnt+0xd7/0xf5
[<ffffffff802c8d2a>] seq_read+0x20c/0x2e5
[<ffffffff802b2723>] vfs_read+0xaa/0x133
[<ffffffff802b3234>] fget_light+0x49/0xe1
[<ffffffff802b2a18>] sys_read+0x45/0x6e
[<ffffffff8020bedb>] system_call_fastpath+0x16/0x1b
INFO: lockdep is turned off.
INFO: task automount:24873 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
automount D ffff88010ae963c0 0 24873 1
ffff8803ab8fcc00 0000000000000046 0000000000000000 0000000000000000
ffff8803afbe4c00 ffff8803ab8fcf70 0000000f00000000 0000000100029028
ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
Call Trace:
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff804ce493>] mutex_lock_nested+0x188/0x2b2
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c6490>] mntput_no_expire+0x1e/0x139
[<ffffffff802c85ab>] seq_escape+0x3a/0xb8
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c80d6>] show_vfsmnt+0xd7/0xf5
[<ffffffff802c8d2a>] seq_read+0x20c/0x2e5
[<ffffffff802b2723>] vfs_read+0xaa/0x133
[<ffffffff802b3234>] fget_light+0x49/0xe1
[<ffffffff802b2a18>] sys_read+0x45/0x6e
[<ffffffff8020bedb>] system_call_fastpath+0x16/0x1b
INFO: lockdep is turned off.
INFO: task mmapstress10:21307 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mmapstress10 D ffff88010acb84c0 0 21307 14494
ffff88010ad8df00 0000000000000046 0000000000000000 0000000000000000
ffff88010fada600 ffff88010ad8e270 0000000700000000 000000010002983e
ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
Call Trace:
[<ffffffff802ae443>] mem_cgroup_get_first_node+0x29/0x8a
[<ffffffff804ce493>] mutex_lock_nested+0x188/0x2b2
[<ffffffff802ae443>] mem_cgroup_get_first_node+0x29/0x8a
[<ffffffff802ae443>] mem_cgroup_get_first_node+0x29/0x8a
[<ffffffff802ae4f0>] mem_cgroup_hierarchical_reclaim+0x4c/0xc6
[<ffffffff802ae9f4>] __mem_cgroup_try_charge+0x151/0x1d1
[<ffffffff802ae8e3>] __mem_cgroup_try_charge+0x40/0x1d1
[<ffffffff802af2d8>] mem_cgroup_charge_common+0x46/0x72
[<ffffffff80291440>] do_wp_page+0x45a/0x646
[<ffffffff80292d85>] handle_mm_fault+0x6a8/0x737
[<ffffffff80292db6>] handle_mm_fault+0x6d9/0x737
[<ffffffff804d2371>] do_page_fault+0x3ab/0x753
[<ffffffff804d034f>] page_fault+0x1f/0x30
INFO: lockdep is turned off.
INFO: task shmem_test_02:22746 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
shmem_test_02 D ffff88010ac81c80 0 22746 14216
ffff8800bf1fa600 0000000000000046 0000000000000000 0000000000000000
ffff88010fa9a600 ffff8800bf1fa970 0000000400000000 000000010002593a
ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
Call Trace:
[<ffffffff802ae443>] mem_cgroup_get_first_node+0x29/0x8a
[<ffffffff804ce493>] mutex_lock_nested+0x188/0x2b2
[<ffffffff802ae443>] mem_cgroup_get_first_node+0x29/0x8a
[<ffffffff802ae443>] mem_cgroup_get_first_node+0x29/0x8a
[<ffffffff802ae4f0>] mem_cgroup_hierarchical_reclaim+0x4c/0xc6
[<ffffffff802ae9f4>] __mem_cgroup_try_charge+0x151/0x1d1
[<ffffffff802ae8e3>] __mem_cgroup_try_charge+0x40/0x1d1
[<ffffffff802af2d8>] mem_cgroup_charge_common+0x46/0x72
[<ffffffff8028c9fe>] shmem_getpage+0x6ae/0x851
[<ffffffff8022e1f0>] task_rq_lock+0x44/0x78
[<ffffffff804cf73d>] trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff804cf73d>] trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8020e304>] do_IRQ+0x139/0x15d
[<ffffffff802c7b8d>] mnt_want_write+0x6e/0x76
[<ffffffff802c604c>] mnt_drop_write+0x25/0xec
[<ffffffff8028cc3b>] shmem_fault+0x3a/0x5f
[<ffffffff802909be>] __do_fault+0x51/0x402
[<ffffffff8029289e>] handle_mm_fault+0x1c1/0x737
[<ffffffff804d2371>] do_page_fault+0x3ab/0x753
[<ffffffff804d034f>] page_fault+0x1f/0x30
INFO: lockdep is turned off.
INFO: task move.sh:22750 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
move.sh D ffff88010b4ab900 0 22750 19661
ffff8803ab8fdf00 0000000000000046 0000000000000000 0000000000000002
ffff8803afbe4c00 ffff8803ab8fe270 0000000f00000000 0000000100025624
ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
Call Trace:
[<ffffffff804cf64e>] __down_write_nested+0x7e/0x96
[<ffffffff804ceb34>] down_write+0x64/0x75
[<ffffffff802a43c0>] mpol_rebind_mm+0x16/0x3f
[<ffffffff802a43c0>] mpol_rebind_mm+0x16/0x3f
[<ffffffff8026e06a>] cpuset_attach+0x7d/0xa6
[<ffffffff8026b05d>] cgroup_attach_task+0x33d/0x397
[<ffffffff8026b1b1>] cgroup_tasks_write+0xfa/0x11e
[<ffffffff8026b0f0>] cgroup_tasks_write+0x39/0x11e
[<ffffffff8026b66e>] cgroup_file_write+0xed/0x20b
[<ffffffff802b25f0>] vfs_write+0xad/0x136
[<ffffffff802b2a86>] sys_write+0x45/0x6e
[<ffffffff8020bedb>] system_call_fastpath+0x16/0x1b
INFO: lockdep is turned off.
INFO: task udev_run_devd:22758 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
udev_run_devd D ffff88010eb39c80 0 22758 1
ffff8801091e8000 0000000000000046 0000000000000000 0000000000000000
ffff88010fb4cc00 ffff8801091e8370 0000000c00000000 0000000100024ffd
ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
Call Trace:
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff804ce493>] mutex_lock_nested+0x188/0x2b2
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c6490>] mntput_no_expire+0x1e/0x139
[<ffffffff802c85ab>] seq_escape+0x3a/0xb8
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c80d6>] show_vfsmnt+0xd7/0xf5
[<ffffffff802c8d2a>] seq_read+0x20c/0x2e5
[<ffffffff802b2723>] vfs_read+0xaa/0x133
[<ffffffff802b2a18>] sys_read+0x45/0x6e
[<ffffffff8020bedb>] system_call_fastpath+0x16/0x1b
INFO: lockdep is turned off.
INFO: task ls:22850 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ls D ffff88010e8897c0 0 22850 8012
ffff8801091edf00 0000000000000046 0000000000000000 0000000000000000
ffff88010fb22600 ffff8801091ee270 0000000a00000000 0000000100025a97
ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
Call Trace:
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff804ce493>] mutex_lock_nested+0x188/0x2b2
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c6490>] mntput_no_expire+0x1e/0x139
[<ffffffff802c85ab>] seq_escape+0x3a/0xb8
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c80d6>] show_vfsmnt+0xd7/0xf5
[<ffffffff802c8d2a>] seq_read+0x20c/0x2e5
[<ffffffff802b2723>] vfs_read+0xaa/0x133
[<ffffffff802b2a18>] sys_read+0x45/0x6e
[<ffffffff8020bedb>] system_call_fastpath+0x16/0x1b
INFO: lockdep is turned off.
INFO: task multipath:27599 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
multipath D ffff88010b4af200 0 27599 1
ffff8800ba971300 0000000000000046 0000000000000000 0000000000000000
ffff8803afbe4c00 ffff8800ba971670 0000000f00000000 0000000100051678
ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
Call Trace:
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff804ce493>] mutex_lock_nested+0x188/0x2b2
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c6490>] mntput_no_expire+0x1e/0x139
[<ffffffff802c85ab>] seq_escape+0x3a/0xb8
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c80d6>] show_vfsmnt+0xd7/0xf5
[<ffffffff802c8d2a>] seq_read+0x20c/0x2e5
[<ffffffff802b2723>] vfs_read+0xaa/0x133
[<ffffffff802b2a18>] sys_read+0x45/0x6e
[<ffffffff8020bedb>] system_call_fastpath+0x16/0x1b
INFO: lockdep is turned off.
INFO: task udev_run_hotplu:27601 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
udev_run_hotp D ffff88010e476880 0 27601 1
ffff8801091ea600 0000000000000046 0000000000000000 0000000000000000
ffff88010fb48000 ffff8801091ea970 0000000b00000000 0000000100051867
ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
Call Trace:
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff804ce493>] mutex_lock_nested+0x188/0x2b2
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c6490>] mntput_no_expire+0x1e/0x139
[<ffffffff802c85ab>] seq_escape+0x3a/0xb8
[<ffffffff802699c3>] cgroup_show_options+0x20/0xa3
[<ffffffff802c80d6>] show_vfsmnt+0xd7/0xf5
[<ffffffff802c8d2a>] seq_read+0x20c/0x2e5
[<ffffffff802b2723>] vfs_read+0xaa/0x133
[<ffffffff802b2a18>] sys_read+0x45/0x6e
[<ffffffff8020bedb>] system_call_fastpath+0x16/0x1b
INFO: lockdep is turned off.
===
* Daisuke Nishimura <[email protected]> [2008-12-10 16:41:26]:
> On Wed, 10 Dec 2008 15:19:48 +0900, Daisuke Nishimura <[email protected]> wrote:
> > On Wed, 10 Dec 2008 10:49:47 +0530, Balbir Singh <[email protected]> wrote:
> > > Hi,
> > >
> > > Here is a proposed fix for the memory controller cgroup_mutex deadlock
> > > reported. It is lightly tested and reviewed. I need help with review
> > > and test. Is the reported deadlock reproducible after this patch? A
> > > careful review of the cpuset impact will also be highly appreciated.
> > >
> > > From: Balbir Singh <[email protected]>
> > >
> > > cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> > > do_migrate_pages(). The issue with that is that
> > >
> > > 1. It can lead to deadlock with memcg, as do_migrate_pages()
> > > enters reclaim
> > > 2. It can lead to long latencies, preventing users from creating/
> > > destroying other cgroups anywhere else
> > >
> > > The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
> > > gives up cgroup_mutex while doing so.
> > >
> > I agree changing cpuset_migrate_mm not to hold cgroup_mutex to fix the dead lock
> > is one choice, and it looks good to me at the first impression.
> >
> > But I'm not sure it's good to change cpuset(other subsystem) code because of memcg.
> >
> > Anyway, I'll test this patch and report the result tomorrow.
> > (Sorry, I don't have enough time today.)
> >
> Unfortunately, this patch doesn't seem enough.
>
> This patch can fix dead lock caused by "circular lock of cgroup_mutex",
> but cannot that of caused by "race between page reclaim and cpuset_attach(mpol_rebind_mm)".
>
> (The dead lock I fixed in memcg-avoid-dead-lock-caused-by-race-between-oom-and-cpuset_attach.patch
> was caused by "race between memcg's oom and mpol_rebind_mm, and was independent of hierarchy.)
>
Yes, I agree, my point was to fix the deadlock caused in the hierarchy
due to cpuset_migrate_mm(). If I understand correctly
1. This patch introduces no new bug, but the old bug remains. The
deadlock is fixed
2. We need this patch + your fix to completely solve the problem?
Could you also share how to reproduce the issue, I'll test on my end
as well.
Thanks for your help!
--
Balbir
On Wed, 10 Dec 2008 16:41:26 +0900
Daisuke Nishimura <[email protected]> wrote:
> On Wed, 10 Dec 2008 15:19:48 +0900, Daisuke Nishimura <[email protected]> wrote:
> > On Wed, 10 Dec 2008 10:49:47 +0530, Balbir Singh <[email protected]> wrote:
> > > Hi,
> > >
> > > Here is a proposed fix for the memory controller cgroup_mutex deadlock
> > > reported. It is lightly tested and reviewed. I need help with review
> > > and test. Is the reported deadlock reproducible after this patch? A
> > > careful review of the cpuset impact will also be highly appreciated.
> > >
> > > From: Balbir Singh <[email protected]>
> > >
> > > cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> > > do_migrate_pages(). The issue with that is that
> > >
> > > 1. It can lead to deadlock with memcg, as do_migrate_pages()
> > > enters reclaim
> > > 2. It can lead to long latencies, preventing users from creating/
> > > destroying other cgroups anywhere else
> > >
> > > The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
> > > gives up cgroup_mutex while doing so.
> > >
> > I agree changing cpuset_migrate_mm not to hold cgroup_mutex to fix the dead lock
> > is one choice, and it looks good to me at the first impression.
> >
> > But I'm not sure it's good to change cpuset(other subsystem) code because of memcg.
> >
> > Anyway, I'll test this patch and report the result tomorrow.
> > (Sorry, I don't have enough time today.)
> >
> Unfortunately, this patch doesn't seem enough.
>
> This patch can fix dead lock caused by "circular lock of cgroup_mutex",
> but cannot that of caused by "race between page reclaim and cpuset_attach(mpol_rebind_mm)".
>
> (The dead lock I fixed in memcg-avoid-dead-lock-caused-by-race-between-oom-and-cpuset_attach.patch
> was caused by "race between memcg's oom and mpol_rebind_mm, and was independent of hierarchy.)
>
> I attach logs I got in testing this patch.
>
Hmm, ok then, what you mention to is this race.
--
cgroup_lock()
-> cpuset_attach()
-> down_write(&mm->mmap_sem);
down_read()
-> page fault
-> reclaim in memcg
-> cgroup_lock().
--
What this patch tries to fix is this recursive locks
--
cgroup_lock()
-> cpuset_attach()
-> cpuset_migrate_mm()
-> charge to migration
-> go to reclaim and meet cgroup_lock.
--
Right ?
BTW, releasing cgroup_lock() while attach() is going on is finally safe ?
If not, can this lock for attach be replaced with (new) cgroup private mutex ?
a new mutex like this ?
--
struct cgroup {
.....
mutex_t attach_mutex; /* for serializing attach() ops.
while attach() is going on, rmdir() will fail */
}
--
Do we need the big lock of cgroup_lock for attach(), at last ?
-Kame
On Wed, 10 Dec 2008 10:49:47 +0530
Balbir Singh <[email protected]> wrote:
> Hi,
>
> Here is a proposed fix for the memory controller cgroup_mutex deadlock
> reported. It is lightly tested and reviewed. I need help with review
> and test. Is the reported deadlock reproducible after this patch? A
> careful review of the cpuset impact will also be highly appreciated.
>
> From: Balbir Singh <[email protected]>
>
> cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> do_migrate_pages(). The issue with that is that
>
> 1. It can lead to deadlock with memcg, as do_migrate_pages()
> enters reclaim
> 2. It can lead to long latencies, preventing users from creating/
> destroying other cgroups anywhere else
>
> The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
> gives up cgroup_mutex while doing so.
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
>
> include/linux/cpuset.h | 13 ++++++++++++-
> kernel/cpuset.c | 23 ++++++++++++-----------
> 2 files changed, 24 insertions(+), 12 deletions(-)
>
> diff -puN kernel/cgroup.c~cpuset-remove-cgroup-mutex-from-update-path kernel/cgroup.c
> diff -puN kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path kernel/cpuset.c
> --- a/kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path
> +++ a/kernel/cpuset.c
> @@ -369,7 +369,7 @@ static void guarantee_online_mems(const
> * task has been modifying its cpuset.
> */
>
> -void cpuset_update_task_memory_state(void)
> +void __cpuset_update_task_memory_state(bool held)
> {
> int my_cpusets_mem_gen;
> struct task_struct *tsk = current;
> @@ -380,7 +380,8 @@ void cpuset_update_task_memory_state(voi
> rcu_read_unlock();
>
> if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
> - mutex_lock(&callback_mutex);
> + if (!held)
> + mutex_lock(&callback_mutex);
> task_lock(tsk);
> cs = task_cs(tsk); /* Maybe changed when task not locked */
> guarantee_online_mems(cs, &tsk->mems_allowed);
> @@ -394,7 +395,8 @@ void cpuset_update_task_memory_state(voi
> else
> tsk->flags &= ~PF_SPREAD_SLAB;
> task_unlock(tsk);
> - mutex_unlock(&callback_mutex);
> + if (!held)
> + mutex_unlock(&callback_mutex);
> mpol_rebind_task(tsk, &tsk->mems_allowed);
> }
> }
> @@ -949,13 +951,15 @@ static int update_cpumask(struct cpuset
> * so that the migration code can allocate pages on these nodes.
> *
> * Call holding cgroup_mutex, so current's cpuset won't change
> - * during this call, as manage_mutex holds off any cpuset_attach()
> + * during this call, as callback_mutex holds off any cpuset_attach()
> * calls. Therefore we don't need to take task_lock around the
> * call to guarantee_online_mems(), as we know no one is changing
> * our task's cpuset.
> *
> * Hold callback_mutex around the two modifications of our tasks
> - * mems_allowed to synchronize with cpuset_mems_allowed().
> + * mems_allowed to synchronize with cpuset_mems_allowed(). Give
> + * up cgroup_mutex to avoid deadlocking with other subsystems
> + * as we enter reclaim from do_migrate_pages().
> *
> * While the mm_struct we are migrating is typically from some
> * other task, the task_struct mems_allowed that we are hacking
> @@ -976,17 +980,14 @@ static void cpuset_migrate_mm(struct mm_
> {
> struct task_struct *tsk = current;
>
> - cpuset_update_task_memory_state();
> -
> + cgroup_unlock();
> mutex_lock(&callback_mutex);
> + cpuset_update_task_memory_state_locked();
> tsk->mems_allowed = *to;
> - mutex_unlock(&callback_mutex);
> -
> do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
> -
> - mutex_lock(&callback_mutex);
> guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
> mutex_unlock(&callback_mutex);
> + cgroup_lock();
> }
>
Hmm...can't this happen ?
Assume there is a task X and cgroup Z1 and Z2. Z1 and Z2 doesn't need to be in
the same hierarchy.
==
CPU A attach task X to cgroup Z1
cgroup_lock()
for_each_subsys_state()
=> attach(X,Z)
=> migrate_mm()
=> cgroup_unlock()
migration
CPU B attach task X to cgroup Z2 at the same time
cgroup_lock()
replace css_set.
==
Works on CPU B can't break for_each_subsys_state() in CPU A ?
Sorry if I misunderstand.
Thanks,
-Kame
* KAMEZAWA Hiroyuki <[email protected]> [2008-12-10 17:49:06]:
> On Wed, 10 Dec 2008 10:49:47 +0530
> Balbir Singh <[email protected]> wrote:
>
> > Hi,
> >
> > Here is a proposed fix for the memory controller cgroup_mutex deadlock
> > reported. It is lightly tested and reviewed. I need help with review
> > and test. Is the reported deadlock reproducible after this patch? A
> > careful review of the cpuset impact will also be highly appreciated.
> >
> > From: Balbir Singh <[email protected]>
> >
> > cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> > do_migrate_pages(). The issue with that is that
> >
> > 1. It can lead to deadlock with memcg, as do_migrate_pages()
> > enters reclaim
> > 2. It can lead to long latencies, preventing users from creating/
> > destroying other cgroups anywhere else
> >
> > The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
> > gives up cgroup_mutex while doing so.
> >
> > Signed-off-by: Balbir Singh <[email protected]>
> > ---
> >
> > include/linux/cpuset.h | 13 ++++++++++++-
> > kernel/cpuset.c | 23 ++++++++++++-----------
> > 2 files changed, 24 insertions(+), 12 deletions(-)
> >
> > diff -puN kernel/cgroup.c~cpuset-remove-cgroup-mutex-from-update-path kernel/cgroup.c
> > diff -puN kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path kernel/cpuset.c
> > --- a/kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path
> > +++ a/kernel/cpuset.c
> > @@ -369,7 +369,7 @@ static void guarantee_online_mems(const
> > * task has been modifying its cpuset.
> > */
> >
> > -void cpuset_update_task_memory_state(void)
> > +void __cpuset_update_task_memory_state(bool held)
> > {
> > int my_cpusets_mem_gen;
> > struct task_struct *tsk = current;
> > @@ -380,7 +380,8 @@ void cpuset_update_task_memory_state(voi
> > rcu_read_unlock();
> >
> > if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
> > - mutex_lock(&callback_mutex);
> > + if (!held)
> > + mutex_lock(&callback_mutex);
> > task_lock(tsk);
> > cs = task_cs(tsk); /* Maybe changed when task not locked */
> > guarantee_online_mems(cs, &tsk->mems_allowed);
> > @@ -394,7 +395,8 @@ void cpuset_update_task_memory_state(voi
> > else
> > tsk->flags &= ~PF_SPREAD_SLAB;
> > task_unlock(tsk);
> > - mutex_unlock(&callback_mutex);
> > + if (!held)
> > + mutex_unlock(&callback_mutex);
> > mpol_rebind_task(tsk, &tsk->mems_allowed);
> > }
> > }
> > @@ -949,13 +951,15 @@ static int update_cpumask(struct cpuset
> > * so that the migration code can allocate pages on these nodes.
> > *
> > * Call holding cgroup_mutex, so current's cpuset won't change
> > - * during this call, as manage_mutex holds off any cpuset_attach()
> > + * during this call, as callback_mutex holds off any cpuset_attach()
> > * calls. Therefore we don't need to take task_lock around the
> > * call to guarantee_online_mems(), as we know no one is changing
> > * our task's cpuset.
> > *
> > * Hold callback_mutex around the two modifications of our tasks
> > - * mems_allowed to synchronize with cpuset_mems_allowed().
> > + * mems_allowed to synchronize with cpuset_mems_allowed(). Give
> > + * up cgroup_mutex to avoid deadlocking with other subsystems
> > + * as we enter reclaim from do_migrate_pages().
> > *
> > * While the mm_struct we are migrating is typically from some
> > * other task, the task_struct mems_allowed that we are hacking
> > @@ -976,17 +980,14 @@ static void cpuset_migrate_mm(struct mm_
> > {
> > struct task_struct *tsk = current;
> >
> > - cpuset_update_task_memory_state();
> > -
> > + cgroup_unlock();
> > mutex_lock(&callback_mutex);
> > + cpuset_update_task_memory_state_locked();
> > tsk->mems_allowed = *to;
> > - mutex_unlock(&callback_mutex);
> > -
> > do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
> > -
> > - mutex_lock(&callback_mutex);
> > guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
> > mutex_unlock(&callback_mutex);
> > + cgroup_lock();
> > }
> >
>
> Hmm...can't this happen ?
>
> Assume there is a task X and cgroup Z1 and Z2. Z1 and Z2 doesn't need to be in
> the same hierarchy.
> ==
> CPU A attach task X to cgroup Z1
> cgroup_lock()
> for_each_subsys_state()
You mean for_each_subsys() right?
> => attach(X,Z)
> => migrate_mm()
> => cgroup_unlock()
> migration
>
> CPU B attach task X to cgroup Z2 at the same time
> cgroup_lock()
> replace css_set.
> ==
>
> Works on CPU B can't break for_each_subsys_state() in CPU A ?
>
for_each_subsys is hierarchy aware, so if we try to add the same task
to different hierachies, it should not be a problem right?
> Sorry if I misunderstand.
I hope I understood your scenario correctly.
--
Balbir
Balbir Singh said:
> * KAMEZAWA Hiroyuki <[email protected]> [2008-12-10
> 17:49:06]:
>
>> On Wed, 10 Dec 2008 10:49:47 +0530
>> Balbir Singh <[email protected]> wrote:
>>
>> > Hi,
>> >
>> > Here is a proposed fix for the memory controller cgroup_mutex deadlock
>> > reported. It is lightly tested and reviewed. I need help with review
>> > and test. Is the reported deadlock reproducible after this patch? A
>> > careful review of the cpuset impact will also be highly appreciated.
>> >
>> > From: Balbir Singh <[email protected]>
>> >
>> > cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
>> > do_migrate_pages(). The issue with that is that
>> >
>> > 1. It can lead to deadlock with memcg, as do_migrate_pages()
>> > enters reclaim
>> > 2. It can lead to long latencies, preventing users from creating/
>> > destroying other cgroups anywhere else
>> >
>> > The patch holds callback_mutex through the duration of
>> cpuset_migrate_mm() and
>> > gives up cgroup_mutex while doing so.
>> >
>> > Signed-off-by: Balbir Singh <[email protected]>
>> > ---
>> >
>> > include/linux/cpuset.h | 13 ++++++++++++-
>> > kernel/cpuset.c | 23 ++++++++++++-----------
>> > 2 files changed, 24 insertions(+), 12 deletions(-)
>> >
>> > diff -puN kernel/cgroup.c~cpuset-remove-cgroup-mutex-from-update-path
>> kernel/cgroup.c
>> > diff -puN kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path
>> kernel/cpuset.c
>> > --- a/kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path
>> > +++ a/kernel/cpuset.c
>> > @@ -369,7 +369,7 @@ static void guarantee_online_mems(const
>> > * task has been modifying its cpuset.
>> > */
>> >
>> > -void cpuset_update_task_memory_state(void)
>> > +void __cpuset_update_task_memory_state(bool held)
>> > {
>> > int my_cpusets_mem_gen;
>> > struct task_struct *tsk = current;
>> > @@ -380,7 +380,8 @@ void cpuset_update_task_memory_state(voi
>> > rcu_read_unlock();
>> >
>> > if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
>> > - mutex_lock(&callback_mutex);
>> > + if (!held)
>> > + mutex_lock(&callback_mutex);
>> > task_lock(tsk);
>> > cs = task_cs(tsk); /* Maybe changed when task not locked */
>> > guarantee_online_mems(cs, &tsk->mems_allowed);
>> > @@ -394,7 +395,8 @@ void cpuset_update_task_memory_state(voi
>> > else
>> > tsk->flags &= ~PF_SPREAD_SLAB;
>> > task_unlock(tsk);
>> > - mutex_unlock(&callback_mutex);
>> > + if (!held)
>> > + mutex_unlock(&callback_mutex);
>> > mpol_rebind_task(tsk, &tsk->mems_allowed);
>> > }
>> > }
>> > @@ -949,13 +951,15 @@ static int update_cpumask(struct cpuset
>> > * so that the migration code can allocate pages on these nodes.
>> > *
>> > * Call holding cgroup_mutex, so current's cpuset won't change
>> > - * during this call, as manage_mutex holds off any cpuset_attach()
>> > + * during this call, as callback_mutex holds off any
>> cpuset_attach()
>> > * calls. Therefore we don't need to take task_lock around the
>> > * call to guarantee_online_mems(), as we know no one is changing
>> > * our task's cpuset.
>> > *
>> > * Hold callback_mutex around the two modifications of our tasks
>> > - * mems_allowed to synchronize with cpuset_mems_allowed().
>> > + * mems_allowed to synchronize with cpuset_mems_allowed(). Give
>> > + * up cgroup_mutex to avoid deadlocking with other subsystems
>> > + * as we enter reclaim from do_migrate_pages().
>> > *
>> > * While the mm_struct we are migrating is typically from some
>> > * other task, the task_struct mems_allowed that we are hacking
>> > @@ -976,17 +980,14 @@ static void cpuset_migrate_mm(struct mm_
>> > {
>> > struct task_struct *tsk = current;
>> >
>> > - cpuset_update_task_memory_state();
>> > -
>> > + cgroup_unlock();
>> > mutex_lock(&callback_mutex);
>> > + cpuset_update_task_memory_state_locked();
>> > tsk->mems_allowed = *to;
>> > - mutex_unlock(&callback_mutex);
>> > -
>> > do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
>> > -
>> > - mutex_lock(&callback_mutex);
>> > guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
>> > mutex_unlock(&callback_mutex);
>> > + cgroup_lock();
>> > }
>> >
>>
>> Hmm...can't this happen ?
>>
>> Assume there is a task X and cgroup Z1 and Z2. Z1 and Z2 doesn't need to
>> be in
>> the same hierarchy.
>> ==
>> CPU A attach task X to cgroup Z1
>> cgroup_lock()
>> for_each_subsys_state()
>
> You mean for_each_subsys() right?
>
>> => attach(X,Z)
>> => migrate_mm()
>> => cgroup_unlock()
>> migration
>>
>> CPU B attach task X to cgroup Z2 at the same time
>> cgroup_lock()
>> replace css_set.
>> ==
>>
>> Works on CPU B can't break for_each_subsys_state() in CPU A ?
>>
>
> for_each_subsys is hierarchy aware, so if we try to add the same task
> to different hierachies, it should not be a problem right?
>
Ah, maybe. But what happens when Z1 and Z2 is the same hierarchy ?
Are there some locks ?
-Kame
On Wed, 10 Dec 2008 17:18:36 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Wed, 10 Dec 2008 16:41:26 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > On Wed, 10 Dec 2008 15:19:48 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > On Wed, 10 Dec 2008 10:49:47 +0530, Balbir Singh <[email protected]> wrote:
> > > > Hi,
> > > >
> > > > Here is a proposed fix for the memory controller cgroup_mutex deadlock
> > > > reported. It is lightly tested and reviewed. I need help with review
> > > > and test. Is the reported deadlock reproducible after this patch? A
> > > > careful review of the cpuset impact will also be highly appreciated.
> > > >
> > > > From: Balbir Singh <[email protected]>
> > > >
> > > > cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> > > > do_migrate_pages(). The issue with that is that
> > > >
> > > > 1. It can lead to deadlock with memcg, as do_migrate_pages()
> > > > enters reclaim
> > > > 2. It can lead to long latencies, preventing users from creating/
> > > > destroying other cgroups anywhere else
> > > >
> > > > The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
> > > > gives up cgroup_mutex while doing so.
> > > >
> > > I agree changing cpuset_migrate_mm not to hold cgroup_mutex to fix the dead lock
> > > is one choice, and it looks good to me at the first impression.
> > >
> > > But I'm not sure it's good to change cpuset(other subsystem) code because of memcg.
> > >
> > > Anyway, I'll test this patch and report the result tomorrow.
> > > (Sorry, I don't have enough time today.)
> > >
> > Unfortunately, this patch doesn't seem enough.
> >
> > This patch can fix dead lock caused by "circular lock of cgroup_mutex",
> > but cannot that of caused by "race between page reclaim and cpuset_attach(mpol_rebind_mm)".
> >
> > (The dead lock I fixed in memcg-avoid-dead-lock-caused-by-race-between-oom-and-cpuset_attach.patch
> > was caused by "race between memcg's oom and mpol_rebind_mm, and was independent of hierarchy.)
> >
> > I attach logs I got in testing this patch.
> >
> Hmm, ok then, what you mention to is this race.
> --
> cgroup_lock()
> -> cpuset_attach()
> -> down_write(&mm->mmap_sem);
>
> down_read()
> -> page fault
> -> reclaim in memcg
> -> cgroup_lock().
> --
> What this patch tries to fix is this recursive locks
> --
> cgroup_lock()
> -> cpuset_attach()
> -> cpuset_migrate_mm()
> -> charge to migration
> -> go to reclaim and meet cgroup_lock.
> --
>
>
> Right ?
>
Yes.
Thank you for explaining in detail.
Daisuke Nishimura.
> BTW, releasing cgroup_lock() while attach() is going on is finally safe ?
> If not, can this lock for attach be replaced with (new) cgroup private mutex ?
>
> a new mutex like this ?
> --
> struct cgroup {
> .....
> mutex_t attach_mutex; /* for serializing attach() ops.
> while attach() is going on, rmdir() will fail */
> }
> --
> Do we need the big lock of cgroup_lock for attach(), at last ?
>
> -Kame
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
* Daisuke Nishimura <[email protected]> [2008-12-10 20:53:37]:
> On Wed, 10 Dec 2008 17:18:36 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Wed, 10 Dec 2008 16:41:26 +0900
> > Daisuke Nishimura <[email protected]> wrote:
> >
> > > On Wed, 10 Dec 2008 15:19:48 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > > On Wed, 10 Dec 2008 10:49:47 +0530, Balbir Singh <[email protected]> wrote:
> > > > > Hi,
> > > > >
> > > > > Here is a proposed fix for the memory controller cgroup_mutex deadlock
> > > > > reported. It is lightly tested and reviewed. I need help with review
> > > > > and test. Is the reported deadlock reproducible after this patch? A
> > > > > careful review of the cpuset impact will also be highly appreciated.
> > > > >
> > > > > From: Balbir Singh <[email protected]>
> > > > >
> > > > > cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> > > > > do_migrate_pages(). The issue with that is that
> > > > >
> > > > > 1. It can lead to deadlock with memcg, as do_migrate_pages()
> > > > > enters reclaim
> > > > > 2. It can lead to long latencies, preventing users from creating/
> > > > > destroying other cgroups anywhere else
> > > > >
> > > > > The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
> > > > > gives up cgroup_mutex while doing so.
> > > > >
> > > > I agree changing cpuset_migrate_mm not to hold cgroup_mutex to fix the dead lock
> > > > is one choice, and it looks good to me at the first impression.
> > > >
> > > > But I'm not sure it's good to change cpuset(other subsystem) code because of memcg.
> > > >
> > > > Anyway, I'll test this patch and report the result tomorrow.
> > > > (Sorry, I don't have enough time today.)
> > > >
> > > Unfortunately, this patch doesn't seem enough.
> > >
> > > This patch can fix dead lock caused by "circular lock of cgroup_mutex",
> > > but cannot that of caused by "race between page reclaim and cpuset_attach(mpol_rebind_mm)".
> > >
> > > (The dead lock I fixed in memcg-avoid-dead-lock-caused-by-race-between-oom-and-cpuset_attach.patch
> > > was caused by "race between memcg's oom and mpol_rebind_mm, and was independent of hierarchy.)
> > >
> > > I attach logs I got in testing this patch.
> > >
> > Hmm, ok then, what you mention to is this race.
> > --
> > cgroup_lock()
> > -> cpuset_attach()
> > -> down_write(&mm->mmap_sem);
> >
> > down_read()
> > -> page fault
> > -> reclaim in memcg
> > -> cgroup_lock().
> > --
> > What this patch tries to fix is this recursive locks
> > --
> > cgroup_lock()
> > -> cpuset_attach()
> > -> cpuset_migrate_mm()
> > -> charge to migration
> > -> go to reclaim and meet cgroup_lock.
> > --
> >
> >
> > Right ?
> >
> Yes.
> Thank you for explaining in detail.
>
Sorry, I don't understand the context, I am unable to figure out
1. How to reproduce the problem that Daisuke-San reported
2. Whether the patch is correct or causing more problems or needs more
stuff to completely fix the race.
>
> Daisuke Nishimura.
>
> > BTW, releasing cgroup_lock() while attach() is going on is finally safe ?
> > If not, can this lock for attach be replaced with (new) cgroup private mutex ?
> >
> > a new mutex like this ?
> > --
> > struct cgroup {
> > .....
> > mutex_t attach_mutex; /* for serializing attach() ops.
> > while attach() is going on, rmdir() will fail */
> > }
> > --
> > Do we need the big lock of cgroup_lock for attach(), at last ?
> >
> > -Kame
> >
--
Balbir
* KAMEZAWA Hiroyuki <[email protected]> [2008-12-10 20:32:03]:
> Balbir Singh said:
> > * KAMEZAWA Hiroyuki <[email protected]> [2008-12-10
> > 17:49:06]:
> >
> >> On Wed, 10 Dec 2008 10:49:47 +0530
> >> Balbir Singh <[email protected]> wrote:
> >>
> >> > Hi,
> >> >
> >> > Here is a proposed fix for the memory controller cgroup_mutex deadlock
> >> > reported. It is lightly tested and reviewed. I need help with review
> >> > and test. Is the reported deadlock reproducible after this patch? A
> >> > careful review of the cpuset impact will also be highly appreciated.
> >> >
> >> > From: Balbir Singh <[email protected]>
> >> >
> >> > cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> >> > do_migrate_pages(). The issue with that is that
> >> >
> >> > 1. It can lead to deadlock with memcg, as do_migrate_pages()
> >> > enters reclaim
> >> > 2. It can lead to long latencies, preventing users from creating/
> >> > destroying other cgroups anywhere else
> >> >
> >> > The patch holds callback_mutex through the duration of
> >> cpuset_migrate_mm() and
> >> > gives up cgroup_mutex while doing so.
> >> >
> >> > Signed-off-by: Balbir Singh <[email protected]>
> >> > ---
> >> >
> >> > include/linux/cpuset.h | 13 ++++++++++++-
> >> > kernel/cpuset.c | 23 ++++++++++++-----------
> >> > 2 files changed, 24 insertions(+), 12 deletions(-)
> >> >
> >> > diff -puN kernel/cgroup.c~cpuset-remove-cgroup-mutex-from-update-path
> >> kernel/cgroup.c
> >> > diff -puN kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path
> >> kernel/cpuset.c
> >> > --- a/kernel/cpuset.c~cpuset-remove-cgroup-mutex-from-update-path
> >> > +++ a/kernel/cpuset.c
> >> > @@ -369,7 +369,7 @@ static void guarantee_online_mems(const
> >> > * task has been modifying its cpuset.
> >> > */
> >> >
> >> > -void cpuset_update_task_memory_state(void)
> >> > +void __cpuset_update_task_memory_state(bool held)
> >> > {
> >> > int my_cpusets_mem_gen;
> >> > struct task_struct *tsk = current;
> >> > @@ -380,7 +380,8 @@ void cpuset_update_task_memory_state(voi
> >> > rcu_read_unlock();
> >> >
> >> > if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
> >> > - mutex_lock(&callback_mutex);
> >> > + if (!held)
> >> > + mutex_lock(&callback_mutex);
> >> > task_lock(tsk);
> >> > cs = task_cs(tsk); /* Maybe changed when task not locked */
> >> > guarantee_online_mems(cs, &tsk->mems_allowed);
> >> > @@ -394,7 +395,8 @@ void cpuset_update_task_memory_state(voi
> >> > else
> >> > tsk->flags &= ~PF_SPREAD_SLAB;
> >> > task_unlock(tsk);
> >> > - mutex_unlock(&callback_mutex);
> >> > + if (!held)
> >> > + mutex_unlock(&callback_mutex);
> >> > mpol_rebind_task(tsk, &tsk->mems_allowed);
> >> > }
> >> > }
> >> > @@ -949,13 +951,15 @@ static int update_cpumask(struct cpuset
> >> > * so that the migration code can allocate pages on these nodes.
> >> > *
> >> > * Call holding cgroup_mutex, so current's cpuset won't change
> >> > - * during this call, as manage_mutex holds off any cpuset_attach()
> >> > + * during this call, as callback_mutex holds off any
> >> cpuset_attach()
> >> > * calls. Therefore we don't need to take task_lock around the
> >> > * call to guarantee_online_mems(), as we know no one is changing
> >> > * our task's cpuset.
> >> > *
> >> > * Hold callback_mutex around the two modifications of our tasks
> >> > - * mems_allowed to synchronize with cpuset_mems_allowed().
> >> > + * mems_allowed to synchronize with cpuset_mems_allowed(). Give
> >> > + * up cgroup_mutex to avoid deadlocking with other subsystems
> >> > + * as we enter reclaim from do_migrate_pages().
> >> > *
> >> > * While the mm_struct we are migrating is typically from some
> >> > * other task, the task_struct mems_allowed that we are hacking
> >> > @@ -976,17 +980,14 @@ static void cpuset_migrate_mm(struct mm_
> >> > {
> >> > struct task_struct *tsk = current;
> >> >
> >> > - cpuset_update_task_memory_state();
> >> > -
> >> > + cgroup_unlock();
> >> > mutex_lock(&callback_mutex);
> >> > + cpuset_update_task_memory_state_locked();
> >> > tsk->mems_allowed = *to;
> >> > - mutex_unlock(&callback_mutex);
> >> > -
> >> > do_migrate_pages(mm, from, to, MPOL_MF_MOVE_ALL);
> >> > -
> >> > - mutex_lock(&callback_mutex);
> >> > guarantee_online_mems(task_cs(tsk),&tsk->mems_allowed);
> >> > mutex_unlock(&callback_mutex);
> >> > + cgroup_lock();
> >> > }
> >> >
> >>
> >> Hmm...can't this happen ?
> >>
> >> Assume there is a task X and cgroup Z1 and Z2. Z1 and Z2 doesn't need to
> >> be in
> >> the same hierarchy.
> >> ==
> >> CPU A attach task X to cgroup Z1
> >> cgroup_lock()
> >> for_each_subsys_state()
> >
> > You mean for_each_subsys() right?
> >
> >> => attach(X,Z)
> >> => migrate_mm()
> >> => cgroup_unlock()
> >> migration
> >>
> >> CPU B attach task X to cgroup Z2 at the same time
> >> cgroup_lock()
> >> replace css_set.
> >> ==
> >>
> >> Works on CPU B can't break for_each_subsys_state() in CPU A ?
> >>
> >
> > for_each_subsys is hierarchy aware, so if we try to add the same task
> > to different hierachies, it should not be a problem right?
> >
> Ah, maybe. But what happens when Z1 and Z2 is the same hierarchy ?
> Are there some locks ?
>
If they are in the same hierarchy, tsk->cgroups and tsk->cg_list is
updated atomically and for_each_subsys should not be affected.
Needs more thought and coffee though
--
Balbir
On Wed, 10 Dec 2008 18:36:07 +0530
Balbir Singh <[email protected]> wrote:
> * Daisuke Nishimura <[email protected]> [2008-12-10 20:53:37]:
>
> > On Wed, 10 Dec 2008 17:18:36 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > > On Wed, 10 Dec 2008 16:41:26 +0900
> > > Daisuke Nishimura <[email protected]> wrote:
> > >
> > > > On Wed, 10 Dec 2008 15:19:48 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > > > On Wed, 10 Dec 2008 10:49:47 +0530, Balbir Singh <[email protected]> wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Here is a proposed fix for the memory controller cgroup_mutex deadlock
> > > > > > reported. It is lightly tested and reviewed. I need help with review
> > > > > > and test. Is the reported deadlock reproducible after this patch? A
> > > > > > careful review of the cpuset impact will also be highly appreciated.
> > > > > >
> > > > > > From: Balbir Singh <[email protected]>
> > > > > >
> > > > > > cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> > > > > > do_migrate_pages(). The issue with that is that
> > > > > >
> > > > > > 1. It can lead to deadlock with memcg, as do_migrate_pages()
> > > > > > enters reclaim
> > > > > > 2. It can lead to long latencies, preventing users from creating/
> > > > > > destroying other cgroups anywhere else
> > > > > >
> > > > > > The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
> > > > > > gives up cgroup_mutex while doing so.
> > > > > >
> > > > > I agree changing cpuset_migrate_mm not to hold cgroup_mutex to fix the dead lock
> > > > > is one choice, and it looks good to me at the first impression.
> > > > >
> > > > > But I'm not sure it's good to change cpuset(other subsystem) code because of memcg.
> > > > >
> > > > > Anyway, I'll test this patch and report the result tomorrow.
> > > > > (Sorry, I don't have enough time today.)
> > > > >
> > > > Unfortunately, this patch doesn't seem enough.
> > > >
> > > > This patch can fix dead lock caused by "circular lock of cgroup_mutex",
> > > > but cannot that of caused by "race between page reclaim and cpuset_attach(mpol_rebind_mm)".
> > > >
> > > > (The dead lock I fixed in memcg-avoid-dead-lock-caused-by-race-between-oom-and-cpuset_attach.patch
> > > > was caused by "race between memcg's oom and mpol_rebind_mm, and was independent of hierarchy.)
> > > >
> > > > I attach logs I got in testing this patch.
> > > >
> > > Hmm, ok then, what you mention to is this race.
> > > --
> > > cgroup_lock()
> > > -> cpuset_attach()
> > > -> down_write(&mm->mmap_sem);
> > >
> > > down_read()
> > > -> page fault
> > > -> reclaim in memcg
> > > -> cgroup_lock().
> > > --
> > > What this patch tries to fix is this recursive locks
> > > --
> > > cgroup_lock()
> > > -> cpuset_attach()
> > > -> cpuset_migrate_mm()
> > > -> charge to migration
> > > -> go to reclaim and meet cgroup_lock.
> > > --
> > >
> > >
> > > Right ?
> > >
> > Yes.
> > Thank you for explaining in detail.
> >
>
> Sorry, I don't understand the context, I am unable to figure out
>
> 1. How to reproduce the problem that Daisuke-San reported
Ah.. sorry.
1) mount memory cgroup and cpuset.
(I mount them on different mount points, but I think this can also happen
even when mounting on the same hierarchy.)
2) make directories
2-1) memory
- make a directory(/cgroup/memory/01)
- set memory.limit_in_bytes(no need to set memsw.limit_in_bytes).
- enable hierarchy(no need to make a child).
2-2) cpuset
- make 2(at least) directories(/cgroup/cpuset/01,02)
- set different "mems".
- set memory_migrate on.
3) attach shell to /cgroup/*/01
4) run some programs enough to cause swap out/in
5) trigger page migration by cpuset between 01 and 02 repeatedly.
I think Documentation/controllers/memcg_test.txt would help.
feel free to ask me if you need additional information.
> 2. Whether the patch is correct or causing more problems or needs more
> stuff to completely fix the race.
>
I should consider more to tell whether it's all right to release cgroup_mutex
under attach_task, but some more stuff is needed at least.
Thanks,
Daisuke Nishimura.
* Daisuke Nishimura <[email protected]> [2008-12-10 23:08:24]:
> On Wed, 10 Dec 2008 18:36:07 +0530
> Balbir Singh <[email protected]> wrote:
>
> > * Daisuke Nishimura <[email protected]> [2008-12-10 20:53:37]:
> >
> > > On Wed, 10 Dec 2008 17:18:36 +0900
> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > >
> > > > On Wed, 10 Dec 2008 16:41:26 +0900
> > > > Daisuke Nishimura <[email protected]> wrote:
> > > >
> > > > > On Wed, 10 Dec 2008 15:19:48 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > > > > On Wed, 10 Dec 2008 10:49:47 +0530, Balbir Singh <[email protected]> wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > Here is a proposed fix for the memory controller cgroup_mutex deadlock
> > > > > > > reported. It is lightly tested and reviewed. I need help with review
> > > > > > > and test. Is the reported deadlock reproducible after this patch? A
> > > > > > > careful review of the cpuset impact will also be highly appreciated.
> > > > > > >
> > > > > > > From: Balbir Singh <[email protected]>
> > > > > > >
> > > > > > > cpuset_migrate_mm() holds cgroup_mutex throughout the duration of
> > > > > > > do_migrate_pages(). The issue with that is that
> > > > > > >
> > > > > > > 1. It can lead to deadlock with memcg, as do_migrate_pages()
> > > > > > > enters reclaim
> > > > > > > 2. It can lead to long latencies, preventing users from creating/
> > > > > > > destroying other cgroups anywhere else
> > > > > > >
> > > > > > > The patch holds callback_mutex through the duration of cpuset_migrate_mm() and
> > > > > > > gives up cgroup_mutex while doing so.
> > > > > > >
> > > > > > I agree changing cpuset_migrate_mm not to hold cgroup_mutex to fix the dead lock
> > > > > > is one choice, and it looks good to me at the first impression.
> > > > > >
> > > > > > But I'm not sure it's good to change cpuset(other subsystem) code because of memcg.
> > > > > >
> > > > > > Anyway, I'll test this patch and report the result tomorrow.
> > > > > > (Sorry, I don't have enough time today.)
> > > > > >
> > > > > Unfortunately, this patch doesn't seem enough.
> > > > >
> > > > > This patch can fix dead lock caused by "circular lock of cgroup_mutex",
> > > > > but cannot that of caused by "race between page reclaim and cpuset_attach(mpol_rebind_mm)".
> > > > >
> > > > > (The dead lock I fixed in memcg-avoid-dead-lock-caused-by-race-between-oom-and-cpuset_attach.patch
> > > > > was caused by "race between memcg's oom and mpol_rebind_mm, and was independent of hierarchy.)
> > > > >
> > > > > I attach logs I got in testing this patch.
> > > > >
> > > > Hmm, ok then, what you mention to is this race.
> > > > --
> > > > cgroup_lock()
> > > > -> cpuset_attach()
> > > > -> down_write(&mm->mmap_sem);
> > > >
> > > > down_read()
> > > > -> page fault
> > > > -> reclaim in memcg
> > > > -> cgroup_lock().
> > > > --
> > > > What this patch tries to fix is this recursive locks
> > > > --
> > > > cgroup_lock()
> > > > -> cpuset_attach()
> > > > -> cpuset_migrate_mm()
> > > > -> charge to migration
> > > > -> go to reclaim and meet cgroup_lock.
> > > > --
> > > >
> > > >
> > > > Right ?
> > > >
> > > Yes.
> > > Thank you for explaining in detail.
> > >
> >
> > Sorry, I don't understand the context, I am unable to figure out
> >
> > 1. How to reproduce the problem that Daisuke-San reported
> Ah.. sorry.
>
> 1) mount memory cgroup and cpuset.
> (I mount them on different mount points, but I think this can also happen
> even when mounting on the same hierarchy.)
> 2) make directories
> 2-1) memory
> - make a directory(/cgroup/memory/01)
> - set memory.limit_in_bytes(no need to set memsw.limit_in_bytes).
> - enable hierarchy(no need to make a child).
> 2-2) cpuset
> - make 2(at least) directories(/cgroup/cpuset/01,02)
> - set different "mems".
> - set memory_migrate on.
> 3) attach shell to /cgroup/*/01
> 4) run some programs enough to cause swap out/in
> 5) trigger page migration by cpuset between 01 and 02 repeatedly.
> I think Documentation/controllers/memcg_test.txt would help.
>
Yes, I've seen that, Thanks, I'll follow that.
> feel free to ask me if you need additional information.
>
> > 2. Whether the patch is correct or causing more problems or needs more
> > stuff to completely fix the race.
> >
> I should consider more to tell whether it's all right to release cgroup_mutex
> under attach_task, but some more stuff is needed at least.
>
>
> Thanks,
> Daisuke Nishimura.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
--
Balbir