2016-04-13 09:42:23

by Petr Mladek

[permalink] [raw]
Subject: [BUG] cgroup/workques/fork: deadlock when moving cgroups

Hi,

Cyril reported a system lock up when running
memcg_move_charge_at_immigrate_test.sh test[*] repeatedly.

I have reproduced it also with the plain 4.6-rc3.

There seems to be a deadlock where 4 processes are involved. It makes
the system unable to fork any new processes. I had to use alt-sysrq
command to get to debugging information.

Please, find the entire log at http://pastebin.com/pL9eKsFb

Interesting is the lockdep output:

[ 409.669856] Showing all locks held in the system:
[ 409.669856] 2 locks held by systemd/1:
[ 409.669856] #0: (&p->lock){+.+.+.}, at: [<ffffffff8124e54d>] seq_read+0x3d/0x3a0
[ 409.669856] #1: (cgroup_mutex){+.+.+.}, at: [<ffffffff8112379e>] proc_cgroup_show+0x4e/0x300
[ 409.669856] 1 lock held by kthreadd/2:
[ 409.669856] #0: (&cgroup_threadgroup_rwsem){++++++}, at: [<ffffffff8106b235>] copy_process.part.30+0x555/0x1d30
[ 409.669856] 3 locks held by kworker/0:2/114:
[ 409.669856] #0: ("cgroup_destroy"){.+.+..}, at: [<ffffffff8108c971>] process_one_work+0x151/0x6b0
[ 409.669856] #1: ((&css->destroy_work)){+.+...}, at: [<ffffffff8108c971>] process_one_work+0x151/0x6b0
[ 409.669856] #2: (cgroup_mutex){+.+.+.}, at: [<ffffffff8111a80d>] css_release_work_fn+0x2d/0xf0
[ 409.669856] 1 lock held by kworker/0:3/588:
[ 409.669856] #0: (&pool->manager_arb){+.+.+.}, at: [<ffffffff8108d1bc>] worker_thread+0x2ec/0x490
[ 409.669856] 1 lock held by in:imklog/816:
[ 409.669856] #0: (&f->f_pos_lock){+.+.+.}, at: [<ffffffff81248013>] __fdget_pos+0x43/0x50
[ 409.669856] 6 locks held by memcg_move_char/2860:
[ 409.669856] #0: (sb_writers#6){.+.+.+}, at: [<ffffffff812299e1>] __sb_start_write+0xd1/0xf0
[ 409.669856] #1: (&of->mutex){+.+.+.}, at: [<ffffffff812ae4d6>] kernfs_fop_write+0x66/0x190
[ 409.669856] #2: (cgroup_mutex){+.+.+.}, at: [<ffffffff8111ef7c>] cgroup_kn_lock_live+0x5c/0x1f0
[ 409.669856] #3: (&cgroup_threadgroup_rwsem){++++++}, at: [<ffffffff810c6a64>] percpu_down_write+0x24/0xd0
[ 409.669856] #4: (lock#4){+.+...}, at: [<ffffffff811aa1d4>] lru_add_drain_all+0x34/0x1a0
[ 409.669856] #5: (cpu_hotplug.lock){++++++}, at: [<ffffffff8106e097>] get_online_cpus+0x17/0x70


The problematic processes seem to be:

1. memcg_move_char/2860:
========================

It is the process from the test suite. It is waiting in lru_add_drain_all()
when calling flush_work() on CPU0[**]. Note that it holds many locks
including 'cgroup_mutex' and 'cgroup_threadgroup_rwsem'.

[ 409.669856] memcg_move_char D ffff8800ab82b978 0 2860 1438 0x00000000
[ 409.669856] ffff8800ab82b978 ffffffff8244c908 ffff880036d44800 ffff880036a0d1c0
[ 409.669856] ffff8800ab82c000 ffff8800ab82bb00 ffff8800ab82baf8 ffff880036a0d1c0
[ 409.669856] 0000000000000004 ffff8800ab82b990 ffffffff8195c77c 7fffffffffffffff
[ 409.669856] Call Trace:
[ 409.669856] [<ffffffff8195c77c>] schedule+0x3c/0x90
[ 409.669856] [<ffffffff81960e74>] schedule_timeout+0x294/0x480
[ 409.669856] [<ffffffff810c9fd6>] ? mark_held_locks+0x66/0x90
[ 409.669856] [<ffffffff81961c1c>] ? _raw_spin_unlock_irq+0x2c/0x50
[ 409.669856] [<ffffffff810ca0f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 409.669856] [<ffffffff8195d9d6>] wait_for_completion+0x96/0x100
[ 409.669856] [<ffffffff810a1a30>] ? wake_up_q+0x80/0x80
[ 409.669856] [<ffffffff8108b7ca>] flush_work+0x1ca/0x270
[ 409.669856] [<ffffffff8108b82e>] ? flush_work+0x22e/0x270
[ 409.669856] [<ffffffff81088f20>] ? destroy_worker+0xd0/0xd0
[ 409.669856] [<ffffffff811aa2fb>] lru_add_drain_all+0x15b/0x1a0
[ 409.669856] [<ffffffff81211c8b>] mem_cgroup_move_task+0x5b/0xe0
[ 409.669856] [<ffffffff81214640>] ? mem_cgroup_move_account+0x290/0x290
[ 409.669856] [<ffffffff8111fa9e>] cgroup_taskset_migrate+0x1ee/0x360
[ 409.669856] [<ffffffff8111fd05>] cgroup_migrate+0xf5/0x190
[ 409.669856] [<ffffffff8111fc15>] ? cgroup_migrate+0x5/0x190
[ 409.669856] [<ffffffff8111ffa5>] cgroup_attach_task+0x205/0x230
[ 409.669856] [<ffffffff8111fde2>] ? cgroup_attach_task+0x42/0x230
[ 409.669856] [<ffffffff811203dd>] __cgroup_procs_write.isra.30+0x2bd/0x470
[ 409.669856] [<ffffffff8112017f>] ? __cgroup_procs_write.isra.30+0x5f/0x470
[ 409.669856] [<ffffffff811205a0>] cgroup_tasks_write+0x10/0x20
[ 409.669856] [<ffffffff8111bf6e>] cgroup_file_write+0x3e/0x1b0
[ 409.669856] [<ffffffff812ae5b4>] kernfs_fop_write+0x144/0x190
[ 409.669856] [<ffffffff81225838>] __vfs_write+0x28/0xe0
[ 409.669856] [<ffffffff810c69a4>] ? percpu_down_read+0x44/0x80
[ 409.669856] [<ffffffff812299e1>] ? __sb_start_write+0xd1/0xf0
[ 409.669856] [<ffffffff812299e1>] ? __sb_start_write+0xd1/0xf0
[ 409.669856] [<ffffffff81226742>] vfs_write+0xa2/0x1a0
[ 409.669856] [<ffffffff81247296>] ? __fget_light+0x66/0x90
[ 409.669856] [<ffffffff81227ab9>] SyS_write+0x49/0xa0
[ 409.669856] [<ffffffff819623fc>] entry_SYSCALL_64_fastpath+0x1f/0xbd


2. kworker/0:2/114:
===================

It is a worker on CPU0 that is blocked in css_release_work_fn().
It is unable to get 'cgroup_mutex' that is already owned by
'memcg_move_char/2860'.


[ 409.669856] kworker/0:2 D ffff880036d4bcd8 0 114 2 0x00000000
[ 409.669856] Workqueue: cgroup_destroy css_release_work_fn
[ 409.669856] ffff880036d4bcd8 00000000ffffffff ffff8800acd40100 ffff880036d44800
[ 409.669856] ffff880036d4c000 ffffffff81e94280 0000000000000246 ffff880036d44800
[ 409.669856] 00000000ffffffff ffff880036d4bcf0 ffffffff8195c77c ffffffff81e94288
[ 409.669856] Call Trace:
[ 409.669856] [<ffffffff8195c77c>] schedule+0x3c/0x90
[ 409.669856] [<ffffffff8195cb35>] schedule_preempt_disabled+0x15/0x20
[ 409.669856] [<ffffffff8195ec49>] mutex_lock_nested+0x169/0x3f0
[ 409.669856] [<ffffffff8111a80d>] ? css_release_work_fn+0x2d/0xf0
[ 409.669856] [<ffffffff8111a80d>] css_release_work_fn+0x2d/0xf0
[ 409.669856] [<ffffffff8108c9ec>] process_one_work+0x1cc/0x6b0
[ 409.669856] [<ffffffff8108c971>] ? process_one_work+0x151/0x6b0
[ 409.669856] [<ffffffff8108cffb>] worker_thread+0x12b/0x490
[ 409.669856] [<ffffffff8108ced0>] ? process_one_work+0x6b0/0x6b0
[ 409.669856] [<ffffffff81093dd4>] kthread+0xe4/0x100
[ 409.669856] [<ffffffff81962632>] ret_from_fork+0x22/0x50
[ 409.669856] [<ffffffff81093cf0>] ? kthread_create_on_node+0x200/0x200


3. kworker/0:3/588:
===================

It tries to create another worker on CPU0 to proceed the pending
works. It waits until the kthread is created by kthreadd.


[ 409.669856] kworker/0:3 D ffff8800ad8ebb68 0 588 2 0x00000000
[ 409.669856] ffff8800ad8ebb68 ffffffff82435ae8 ffff88012a648040 ffff8800acd40100
[ 409.669856] ffff8800ad8ec000 ffff8800ad8ebcc8 ffff8800ad8ebcc0 ffff8800acd40100
[ 409.669856] ffff8800acd40100 ffff8800ad8ebb80 ffffffff8195c77c 7fffffffffffffff
[ 409.669856] Call Trace:
[ 409.669856] [<ffffffff8195c77c>] schedule+0x3c/0x90
[ 409.669856] [<ffffffff81960e74>] schedule_timeout+0x294/0x480
[ 409.669856] [<ffffffff810c9fd6>] ? mark_held_locks+0x66/0x90
[ 409.669856] [<ffffffff81961c1c>] ? _raw_spin_unlock_irq+0x2c/0x50
[ 409.669856] [<ffffffff810ca0f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 409.669856] [<ffffffff8195d72b>] wait_for_completion_killable+0xab/0x170
[ 409.669856] [<ffffffff810a1a30>] ? wake_up_q+0x80/0x80
[ 409.669856] [<ffffffff8108ced0>] ? process_one_work+0x6b0/0x6b0
[ 409.669856] [<ffffffff81093c4a>] kthread_create_on_node+0x15a/0x200
[ 409.669856] [<ffffffff8121d6d8>] ? create_object+0x238/0x2e0
[ 409.669856] [<ffffffff814505a9>] ? snprintf+0x39/0x40
[ 409.669856] [<ffffffff8108b9db>] create_worker+0xbb/0x190
[ 409.669856] [<ffffffff8108d205>] worker_thread+0x335/0x490
[ 409.669856] [<ffffffff8108ced0>] ? process_one_work+0x6b0/0x6b0
[ 409.669856] [<ffffffff81093dd4>] kthread+0xe4/0x100
[ 409.669856] [<ffffffff81962632>] ret_from_fork+0x22/0x50
[ 409.669856] [<ffffffff81093cf0>] ? kthread_create_on_node+0x200/0x200


4. kthreadd/2:
==============

It is a kthread that is responsible for forking other kthreads.
It is blocked in copy_process() when calling threadgroup_change_begin()
because 'cgroup_threadgroup_rwsem' is held by 'memcg_move_char/2860'.

[ 409.669856] kthreadd D ffff88012a64fc20 0 2 0 0x00000000
[ 409.669856] ffff88012a64fc20 ffffffff830aedd0 ffffffff81e0b540 ffff88012a648040
[ 409.669856] ffff88012a650000 ffffffff830aedb8 ffffffff830aedd0 ffff88012a648040
[ 409.669856] 0000000000000000 ffff88012a64fc38 ffffffff8195c77c ffff88012a648040
[ 409.669856] Call Trace:
[ 409.669856] [<ffffffff8195c77c>] schedule+0x3c/0x90
[ 409.669856] [<ffffffff8196070b>] rwsem_down_read_failed+0xcb/0x120
[ 409.669856] [<ffffffff81452e38>] call_rwsem_down_read_failed+0x18/0x30
[ 409.669856] [<ffffffff810c69be>] percpu_down_read+0x5e/0x80
[ 409.669856] [<ffffffff8106b235>] ? copy_process.part.30+0x555/0x1d30
[ 409.669856] [<ffffffff8106b235>] copy_process.part.30+0x555/0x1d30
[ 409.669856] [<ffffffff810cafa0>] ? __lock_acquire+0xab0/0x1ad0
[ 409.669856] [<ffffffff81093cf0>] ? kthread_create_on_node+0x200/0x200
[ 409.669856] [<ffffffff810ca0f9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 409.669856] [<ffffffff8106cbdc>] _do_fork+0xdc/0x6c0
[ 409.669856] [<ffffffff81094aa5>] ? kthreadd+0x2c5/0x380
[ 409.669856] [<ffffffff8106d1e9>] kernel_thread+0x29/0x30
[ 409.669856] [<ffffffff81094b01>] kthreadd+0x321/0x380
[ 409.669856] [<ffffffff81962632>] ? ret_from_fork+0x22/0x50
[ 409.669856] [<ffffffff81962632>] ret_from_fork+0x22/0x50
[ 409.669856] [<ffffffff810947e0>] ? kthread_create_on_cpu+0x60/0x60


By other words, "memcg_move_char/2860" flushes a work. But it cannot
get flushed because one worker is blocked and another one could not
get created. All these operations are blocked by the very same
"memcg_move_char/2860".

Note that also "systemd/1" is waiting for "cgroup_mutex" in
proc_cgroup_show(). But it seems that it is not in the main
cycle causing the deadlock.

I am able to reproduce this problem quite easily (within few minutes).
There are often even more tasks waiting for the cgroups-related locks
but they are not causing the deadlock.


The question is how to solve this problem. I see several possibilities:

+ avoid using workqueues in lru_add_drain_all()

+ make lru_add_drain_all() killable and restartable

+ do not block fork() when lru_add_drain_all() is running,
e.g. using some lazy techniques like RCU, workqueues

+ at least do not block fork of workers; AFAIK, they have a limited
cgroups usage anyway because they are marked with PF_NO_SETAFFINITY


I am willing to test any potential fix or even work on the fix.
But I do not have that big insight into the problem, so I would
need some pointers.


[*] It comes from the LTP test suite. The relevant sources might
be found at
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/controllers/memcg/functional/memcg_move_charge_at_immigrate_test.sh
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/controllers/memcg/functional/memcg_lib.sh
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/controllers/memcg/functional/memcg_process.c

[**] Some of the details are not visible from the provided log. We have first
reproduced this problem with an internal kernel based on the upstream 4.4.
I checked more details using a crash dump. Then I reproduced the same
deadlock with plain 4.6-rc3 and rather used logs from this kernel
in this mail.


Best Regards,
Petr


2016-04-13 18:33:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

Hello, Petr.

(cc'ing Johannes)

On Wed, Apr 13, 2016 at 11:42:16AM +0200, Petr Mladek wrote:
...
> By other words, "memcg_move_char/2860" flushes a work. But it cannot
> get flushed because one worker is blocked and another one could not
> get created. All these operations are blocked by the very same
> "memcg_move_char/2860".
>
> Note that also "systemd/1" is waiting for "cgroup_mutex" in
> proc_cgroup_show(). But it seems that it is not in the main
> cycle causing the deadlock.
>
> I am able to reproduce this problem quite easily (within few minutes).
> There are often even more tasks waiting for the cgroups-related locks
> but they are not causing the deadlock.
>
>
> The question is how to solve this problem. I see several possibilities:
>
> + avoid using workqueues in lru_add_drain_all()
>
> + make lru_add_drain_all() killable and restartable
>
> + do not block fork() when lru_add_drain_all() is running,
> e.g. using some lazy techniques like RCU, workqueues
>
> + at least do not block fork of workers; AFAIK, they have a limited
> cgroups usage anyway because they are marked with PF_NO_SETAFFINITY
>
>
> I am willing to test any potential fix or even work on the fix.
> But I do not have that big insight into the problem, so I would
> need some pointers.

An easy solution would be to make lru_add_drain_all() use a
WQ_MEM_RECLAIM workqueue. A better way would be making charge moving
asynchronous similar to cpuset node migration but I don't know whether
that's realistic. Will prep a patch to add a rescuer to
lru_add_drain_all().

Thanks.

--
tejun

2016-04-13 18:57:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Wed, Apr 13, 2016 at 02:33:09PM -0400, Tejun Heo wrote:
> An easy solution would be to make lru_add_drain_all() use a
> WQ_MEM_RECLAIM workqueue. A better way would be making charge moving
> asynchronous similar to cpuset node migration but I don't know whether
> that's realistic. Will prep a patch to add a rescuer to
> lru_add_drain_all().

So, something like the following. Can you please see whether the
deadlock goes away with the patch?

diff --git a/mm/swap.c b/mm/swap.c
index a0bc206..7022872 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -664,8 +664,16 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy)
lru_add_drain();
}

+static struct workqueue_struct *lru_add_drain_wq;
static DEFINE_PER_CPU(struct work_struct, lru_add_drain_work);

+static int __init lru_add_drain_wq_init(void)
+{
+ lru_add_drain_wq = alloc_workqueue("lru_add_drain", WQ_MEM_RECLAIM, 0);
+ return lru_add_drain_wq ? 0 : -ENOMEM;
+}
+core_initcall(lru_add_drain_wq_init);
+
void lru_add_drain_all(void)
{
static DEFINE_MUTEX(lock);
@@ -685,13 +693,12 @@ void lru_add_drain_all(void)
pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
- schedule_work_on(cpu, work);
+ queue_work_on(cpu, lru_add_drain_wq, work);
cpumask_set_cpu(cpu, &has_work);
}
}

- for_each_cpu(cpu, &has_work)
- flush_work(&per_cpu(lru_add_drain_work, cpu));
+ flush_workqueue(lru_add_drain_wq);

put_online_cpus();
mutex_unlock(&lock);

2016-04-13 19:23:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Wed 13-04-16 14:33:09, Tejun Heo wrote:
> Hello, Petr.
>
> (cc'ing Johannes)
>
> On Wed, Apr 13, 2016 at 11:42:16AM +0200, Petr Mladek wrote:
> ...
> > By other words, "memcg_move_char/2860" flushes a work. But it cannot
> > get flushed because one worker is blocked and another one could not
> > get created. All these operations are blocked by the very same
> > "memcg_move_char/2860".
> >
> > Note that also "systemd/1" is waiting for "cgroup_mutex" in
> > proc_cgroup_show(). But it seems that it is not in the main
> > cycle causing the deadlock.
> >
> > I am able to reproduce this problem quite easily (within few minutes).
> > There are often even more tasks waiting for the cgroups-related locks
> > but they are not causing the deadlock.
> >
> >
> > The question is how to solve this problem. I see several possibilities:
> >
> > + avoid using workqueues in lru_add_drain_all()
> >
> > + make lru_add_drain_all() killable and restartable
> >
> > + do not block fork() when lru_add_drain_all() is running,
> > e.g. using some lazy techniques like RCU, workqueues
> >
> > + at least do not block fork of workers; AFAIK, they have a limited
> > cgroups usage anyway because they are marked with PF_NO_SETAFFINITY
> >
> >
> > I am willing to test any potential fix or even work on the fix.
> > But I do not have that big insight into the problem, so I would
> > need some pointers.
>
> An easy solution would be to make lru_add_drain_all() use a
> WQ_MEM_RECLAIM workqueue.

I think we can live without lru_add_drain_all() in the migration path.
We are talking about 4 pagevecs so 56 pages. The charge migration is
racy anyway. What concerns me more is how all this is fragile. It sounds
just too easy to add a dependency on per-cpu sync work later and
reintroduce this issue which is quite hard to detect.

Cannot we come up with something more robust? Or at least warn when we
try to use per-cpu workers with problematic locks held?

Thanks!
--
Michal Hocko
SUSE Labs

2016-04-13 19:28:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Wed 13-04-16 21:23:13, Michal Hocko wrote:
> On Wed 13-04-16 14:33:09, Tejun Heo wrote:
> > Hello, Petr.
> >
> > (cc'ing Johannes)
> >
> > On Wed, Apr 13, 2016 at 11:42:16AM +0200, Petr Mladek wrote:
> > ...
> > > By other words, "memcg_move_char/2860" flushes a work. But it cannot
> > > get flushed because one worker is blocked and another one could not
> > > get created. All these operations are blocked by the very same
> > > "memcg_move_char/2860".
> > >
> > > Note that also "systemd/1" is waiting for "cgroup_mutex" in
> > > proc_cgroup_show(). But it seems that it is not in the main
> > > cycle causing the deadlock.
> > >
> > > I am able to reproduce this problem quite easily (within few minutes).
> > > There are often even more tasks waiting for the cgroups-related locks
> > > but they are not causing the deadlock.
> > >
> > >
> > > The question is how to solve this problem. I see several possibilities:
> > >
> > > + avoid using workqueues in lru_add_drain_all()
> > >
> > > + make lru_add_drain_all() killable and restartable
> > >
> > > + do not block fork() when lru_add_drain_all() is running,
> > > e.g. using some lazy techniques like RCU, workqueues
> > >
> > > + at least do not block fork of workers; AFAIK, they have a limited
> > > cgroups usage anyway because they are marked with PF_NO_SETAFFINITY
> > >
> > >
> > > I am willing to test any potential fix or even work on the fix.
> > > But I do not have that big insight into the problem, so I would
> > > need some pointers.
> >
> > An easy solution would be to make lru_add_drain_all() use a
> > WQ_MEM_RECLAIM workqueue.
>
> I think we can live without lru_add_drain_all() in the migration path.
> We are talking about 4 pagevecs so 56 pages. The charge migration is

wanted to say 56 * num_cpus of course.

> racy anyway. What concerns me more is how all this is fragile. It sounds
> just too easy to add a dependency on per-cpu sync work later and
> reintroduce this issue which is quite hard to detect.
>
> Cannot we come up with something more robust? Or at least warn when we
> try to use per-cpu workers with problematic locks held?
>
> Thanks!
--
Michal Hocko
SUSE Labs

2016-04-13 19:48:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

Hello, Michal.

On Wed, Apr 13, 2016 at 09:23:14PM +0200, Michal Hocko wrote:
> I think we can live without lru_add_drain_all() in the migration path.
> We are talking about 4 pagevecs so 56 pages. The charge migration is

Ah, nice.

> racy anyway. What concerns me more is how all this is fragile. It sounds
> just too easy to add a dependency on per-cpu sync work later and
> reintroduce this issue which is quite hard to detect.
> Cannot we come up with something more robust? Or at least warn when we
> try to use per-cpu workers with problematic locks held?

Yeah, workqueue does limited lockdep annotation but it doesn't
integrate fully with the rest of dependency tracking. It'd be nice to
have that. Don't know how to tho.

Thanks.

--
tejun

2016-04-13 19:48:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Wed 13-04-16 15:37:34, Tejun Heo wrote:
> Hello, Michal.
>
> On Wed, Apr 13, 2016 at 09:23:14PM +0200, Michal Hocko wrote:
> > I think we can live without lru_add_drain_all() in the migration path.
> > We are talking about 4 pagevecs so 56 pages. The charge migration is
>
> Ah, nice.
>
> > racy anyway. What concerns me more is how all this is fragile. It sounds
> > just too easy to add a dependency on per-cpu sync work later and
> > reintroduce this issue which is quite hard to detect.
> > Cannot we come up with something more robust? Or at least warn when we
> > try to use per-cpu workers with problematic locks held?
>
> Yeah, workqueue does limited lockdep annotation but it doesn't
> integrate fully with the rest of dependency tracking. It'd be nice to
> have that. Don't know how to tho.

I was thinking about something like flush_per_cpu_work() which would
assert on group_threadgroup_rwsem held for write. But this still sounds
suboptimal to me. Do we really have to do
cgroup_threadgroup_change_begin for kworker threads? They are
PF_NO_SETAFFINITY so they cannot be moved AFAIR. Or am I missing
something?

--
Michal Hocko
SUSE Labs

2016-04-14 07:06:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Wed 13-04-16 21:48:20, Michal Hocko wrote:
[...]
> I was thinking about something like flush_per_cpu_work() which would
> assert on group_threadgroup_rwsem held for write.

I have thought about this some more and I guess this is not limitted to
per cpu workers. Basically any flush_work with group_threadgroup_rwsem
held for write is dangerous, right?
--
Michal Hocko
SUSE Labs

2016-04-14 15:32:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

Hello,

On Thu, Apr 14, 2016 at 09:06:23AM +0200, Michal Hocko wrote:
> On Wed 13-04-16 21:48:20, Michal Hocko wrote:
> [...]
> > I was thinking about something like flush_per_cpu_work() which would
> > assert on group_threadgroup_rwsem held for write.
>
> I have thought about this some more and I guess this is not limitted to
> per cpu workers. Basically any flush_work with group_threadgroup_rwsem
> held for write is dangerous, right?

Whether per-cpu or not doesn't matter. What matters is whether the
workqueue has WQ_MEM_RECLAIM or not. That said, I think what we want
to do is avoiding performing heavy operations in migration path. It's
where the core and all controllers have to synchronize, so performing
operations with many external dependencies is bound to get messy. I
wonder whether memory charge moving can be restructured in a similar
fashion to how cpuset node migration is made async. However, given
that charge moving has always been a best effort thing, for now, I
think it'd be best to drop lru_add_drain.

Thanks.

--
tejun

2016-04-14 17:51:09

by Johannes Weiner

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Wed, Apr 13, 2016 at 09:23:14PM +0200, Michal Hocko wrote:
> I think we can live without lru_add_drain_all() in the migration path.

Agreed. Michal, would you care to send a patch to remove it?

2016-04-15 07:06:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Thu 14-04-16 13:50:55, Johannes Weiner wrote:
> On Wed, Apr 13, 2016 at 09:23:14PM +0200, Michal Hocko wrote:
> > I think we can live without lru_add_drain_all() in the migration path.
>
> Agreed. Michal, would you care to send a patch to remove it?

Now that I am looking closer I am not sure this would help though.
mem_cgroup_move_charge needs to take mmap_sem for read and keeps looping
until it gets it. What if the mmap_sem holder for write depends on the
work queue code in the same way lru_add_drain_all does? I mean this all
is really fragile.
Tejun was proposing to do the migration async (move the whole
mem_cgroup_move_charge into the work item). This would solve the problem
of course. I haven't checked whether this would be safe but it at least
sounds doable (albeit far from trivial). It would also be a user visible
change because the new memcg will not contain the moved charges after we
return to user space. I think this would be acceptable but if somebody
really relies on the previous behavior I guess we can solve it with a
post_move cgroup callback which would be called from a lockless context.

Anyway, before we go that way, can we at least consider the possibility
of removing the kworker creation dependency on the global rwsem? AFAIU
this locking was added because of the pid controller. Do we even care
about something as volatile as kworkers in the pid controller?

Anyway one way or another I will be travelling until next Friday and
will have only limited time to look into this.
--
Michal Hocko
SUSE Labs

2016-04-15 14:38:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

Hello, Michal.

On Fri, Apr 15, 2016 at 09:06:01AM +0200, Michal Hocko wrote:
> Tejun was proposing to do the migration async (move the whole
> mem_cgroup_move_charge into the work item). This would solve the problem
> of course. I haven't checked whether this would be safe but it at least
> sounds doable (albeit far from trivial). It would also be a user visible
> change because the new memcg will not contain the moved charges after we
> return to user space. I think this would be acceptable but if somebody

Not necessarily. The only thing necessary is flushing the work item
after releasing locks but before returning to user.
cpuset_post_attach_flush() does exactly the same thing.

> really relies on the previous behavior I guess we can solve it with a
> post_move cgroup callback which would be called from a lockless context.
>
> Anyway, before we go that way, can we at least consider the possibility
> of removing the kworker creation dependency on the global rwsem? AFAIU
> this locking was added because of the pid controller. Do we even care
> about something as volatile as kworkers in the pid controller?

It's not just pid controller and the global percpu locking has lower
hotpath overhead. We can try to exclude kworkers out of the locking
but that can get really nasty and there are already attempts to add
cgroup support to workqueue. Will think more about it. For now tho,
do you think making charge moving async would be difficult?

Thanks.

--
tejun

2016-04-15 15:08:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Fri 15-04-16 10:38:15, Tejun Heo wrote:
> Hello, Michal.
>
> On Fri, Apr 15, 2016 at 09:06:01AM +0200, Michal Hocko wrote:
> > Tejun was proposing to do the migration async (move the whole
> > mem_cgroup_move_charge into the work item). This would solve the problem
> > of course. I haven't checked whether this would be safe but it at least
> > sounds doable (albeit far from trivial). It would also be a user visible
> > change because the new memcg will not contain the moved charges after we
> > return to user space. I think this would be acceptable but if somebody
>
> Not necessarily. The only thing necessary is flushing the work item
> after releasing locks but before returning to user.
> cpuset_post_attach_flush() does exactly the same thing.

Ahh, ok, didn't know that __cgroup_procs_write is doing something
controller specific. Yes then we wouldn't need a generic callback if
another code like above would be acceptable.

> > really relies on the previous behavior I guess we can solve it with a
> > post_move cgroup callback which would be called from a lockless context.
> >
> > Anyway, before we go that way, can we at least consider the possibility
> > of removing the kworker creation dependency on the global rwsem? AFAIU
> > this locking was added because of the pid controller. Do we even care
> > about something as volatile as kworkers in the pid controller?
>
> It's not just pid controller and the global percpu locking has lower

where else would the locking matter? I have only checked the git history
to build my picture so I might be missing something of course.

> hotpath overhead. We can try to exclude kworkers out of the locking
> but that can get really nasty and there are already attempts to add
> cgroup support to workqueue. Will think more about it. For now tho,
> do you think making charge moving async would be difficult?

Well it certainly is not that trivial because it relies on being
exclusive with global context. I will have to look closer of course but
I cannot guarantee I will get to it before I get back from LSF. We can
certainly discuss that at the conference. Johannes will be there as
well.

Thanks!
--
Michal Hocko
SUSE Labs

2016-04-15 15:25:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

Hello,

On Fri, Apr 15, 2016 at 05:08:15PM +0200, Michal Hocko wrote:
> On Fri 15-04-16 10:38:15, Tejun Heo wrote:
> > Not necessarily. The only thing necessary is flushing the work item
> > after releasing locks but before returning to user.
> > cpuset_post_attach_flush() does exactly the same thing.
>
> Ahh, ok, didn't know that __cgroup_procs_write is doing something
> controller specific. Yes then we wouldn't need a generic callback if
> another code like above would be acceptable.

Yeah, I thought it'd be an one-off thing so didn't made it a generic
callback. Making it a generic callback isn't a problem tho.

> > > really relies on the previous behavior I guess we can solve it with a
> > > post_move cgroup callback which would be called from a lockless context.
> > >
> > > Anyway, before we go that way, can we at least consider the possibility
> > > of removing the kworker creation dependency on the global rwsem? AFAIU
> > > this locking was added because of the pid controller. Do we even care
> > > about something as volatile as kworkers in the pid controller?
> >
> > It's not just pid controller and the global percpu locking has lower
>
> where else would the locking matter? I have only checked the git history
> to build my picture so I might be missing something of course.

IIRC, there were followup patches which fixed and/or simplified
locking paths. It's just generally a lot simpler to deal with. The
downside obviously is that cgroup core operations can't depend on task
creation. I didn't expect memcg to trigger it too tho. I don't think
we wanna be doing heavy-lifting operations like node migration or page
relabeling while holding cgroup lock anyway, so would much prefer
making them async.

> > hotpath overhead. We can try to exclude kworkers out of the locking
> > but that can get really nasty and there are already attempts to add
> > cgroup support to workqueue. Will think more about it. For now tho,
> > do you think making charge moving async would be difficult?
>
> Well it certainly is not that trivial because it relies on being
> exclusive with global context. I will have to look closer of course but
> I cannot guarantee I will get to it before I get back from LSF. We can
> certainly discuss that at the conference. Johannes will be there as
> well.

I see. For cpuset, it didn't really matter but what we can do is
creating a mechanism on cgroup core side which is called after a
migration operation is done after dropping the usual locks and
guarantees that no new migration will be started before the callbacks
finish. If we have that, relocating charge moving outside the attach
path should be pretty trivial, right?

Thanks.

--
tejun

2016-04-15 19:26:51

by Tejun Heo

[permalink] [raw]
Subject: [PATCH for-4.6-fixes] memcg: remove lru_add_drain_all() invocation from mem_cgroup_move_charge()

mem_cgroup_move_charge() invokes lru_add_drain_all() so that the pvec
pages can be moved too. lru_add_drain_all() schedules and flushes
work items on system_wq which depends on being able to create new
kworkers to make forward progress. Since 1ed1328792ff ("sched,
cgroup: replace signal_struct->group_rwsem with a global
percpu_rwsem"), a new task can't be created while in the cgroup
migration path and the described lru_add_drain_all() invocation can
easily lead to a deadlock.

Charge moving is best-effort and whether the pvec pages are migrated
or not doesn't really matter. Don't call it during charge moving.
Eventually, we want to move the actual charge moving outside the
migration path.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Johannes Weiner <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Fixes: 1ed1328792ff ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
Cc: [email protected]
---
Hello,

So, this deadlock seems pretty easy to trigger. We'll make the charge
moving asynchronous eventually but let's not hold off fixing an
immediate problem.

Thanks.

mm/memcontrol.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36db05f..56060c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4859,7 +4859,6 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
.mm = mm,
};

- lru_add_drain_all();
/*
* Signal lock_page_memcg() to take the memcg's move_lock
* while we're moving its pages to another memcg. Then wait

2016-04-17 12:01:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Fri 15-04-16 11:25:26, Tejun Heo wrote:
> Hello,

Hi,

> On Fri, Apr 15, 2016 at 05:08:15PM +0200, Michal Hocko wrote:
[...]
> > Well it certainly is not that trivial because it relies on being
> > exclusive with global context. I will have to look closer of course but
> > I cannot guarantee I will get to it before I get back from LSF. We can
> > certainly discuss that at the conference. Johannes will be there as
> > well.
>
> I see. For cpuset, it didn't really matter but what we can do is
> creating a mechanism on cgroup core side which is called after a
> migration operation is done after dropping the usual locks and
> guarantees that no new migration will be started before the callbacks
> finish. If we have that, relocating charge moving outside the attach
> path should be pretty trivial, right?

Yeah, I guess this should work as well. I would have to double check the
code, I cannot say it would be my favorite so I might have forgot some
subtle details.
--
Michal Hocko
SUSE Labs

2016-04-17 12:07:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH for-4.6-fixes] memcg: remove lru_add_drain_all() invocation from mem_cgroup_move_charge()

On Fri 15-04-16 15:17:19, Tejun Heo wrote:
> mem_cgroup_move_charge() invokes lru_add_drain_all() so that the pvec
> pages can be moved too. lru_add_drain_all() schedules and flushes
> work items on system_wq which depends on being able to create new
> kworkers to make forward progress. Since 1ed1328792ff ("sched,
> cgroup: replace signal_struct->group_rwsem with a global
> percpu_rwsem"), a new task can't be created while in the cgroup
> migration path and the described lru_add_drain_all() invocation can
> easily lead to a deadlock.
>
> Charge moving is best-effort and whether the pvec pages are migrated
> or not doesn't really matter. Don't call it during charge moving.
> Eventually, we want to move the actual charge moving outside the
> migration path.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: Johannes Weiner <[email protected]>

I guess
Debugged-by: Petr Mladek <[email protected]>
Reported-by: Cyril Hrubis <[email protected]>

would be due

> Suggested-by: Michal Hocko <[email protected]>
> Fixes: 1ed1328792ff ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
> Cc: [email protected]
> ---
> Hello,
>
> So, this deadlock seems pretty easy to trigger. We'll make the charge
> moving asynchronous eventually but let's not hold off fixing an
> immediate problem.

Although this looks rather straightforward and it fixes the immediate
problem I am little bit nervous about it. As already pointed out in
other email mem_cgroup_move_charge still depends on mmap_sem for
read and we might hit an even more subtle lockup if the current holder
of the mmap_sem for write depends on the task creation (e.g. some of the
direct reclaim path uses WQ which is really hard to rule out and I even
think that some shrinkers do this).

I liked your proposal when mem_cgroup_move_charge would be called from a
context which doesn't hold the problematic rwsem much more. Would that
be too intrusive for the stable backport?

> Thanks.
>
> mm/memcontrol.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 36db05f..56060c7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4859,7 +4859,6 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
> .mm = mm,
> };
>
> - lru_add_drain_all();
> /*
> * Signal lock_page_memcg() to take the memcg's move_lock
> * while we're moving its pages to another memcg. Then wait

--
Michal Hocko
SUSE Labs

2016-04-18 14:40:32

by Petr Mladek

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Fri 2016-04-15 10:38:15, Tejun Heo wrote:
> > Anyway, before we go that way, can we at least consider the possibility
> > of removing the kworker creation dependency on the global rwsem? AFAIU
> > this locking was added because of the pid controller. Do we even care
> > about something as volatile as kworkers in the pid controller?
>
> It's not just pid controller and the global percpu locking has lower
> hotpath overhead. We can try to exclude kworkers out of the locking
> but that can get really nasty and there are already attempts to add
> cgroup support to workqueue. Will think more about it.

I have played with this idea on Friday. Please, find below a POC.
The worker detection works and the deadlock is removed. But workers
do not appear in the root cgroups. I am not familiar with the cgroups
stuff, so this part is much more difficult for me.

I send it because it might give you an idea when discussing it
on LSF. Please, let me know if I should continue on this way or
if it looks too crazy already now.


>From ca1420926f990892a914d64046ee8d273b876f30 Mon Sep 17 00:00:00 2001
From: Petr Mladek <[email protected]>
Date: Mon, 18 Apr 2016 14:17:17 +0200
Subject: [POC PATCH] cgroups/workqueus: Do not block forking workqueues by cgroups
lock

This is a POC how to delay cgroups operations when forking workqueue
workers. Workqueues are used by some cgroups operations, for example,
lru_add_drain_all(). Creating new worker might help to avoid a deadlock.

This patch adds a way to detect whether a workqueue worker is being
forked, see detect_wq_worker(). For this it needs to make struct
kthread_create_info and the main worker_thread public. As a consequence,
it renames worker_thread to wq_worker_thread.

Next, cgroups_fork() just initializes the cgroups fields in task_struct.
It does not really need to be guarded by cgroup_threadgroup_rwsem.

cgroup_threadgroup_rwsem is taken later when we check if the fork
is allowed and when we copy the cgroups setting. But these two
operations are skipped for workers.

The result is that workers are not blocked by any cgroups operation
but they do not appear in the root cgroups.

There is a preparation of cgroup_delayed_post_fork() that might put
the workers into the root cgroups. It is just a copy of
cgroup_post_fork() where "kthreadd_task" is hardcoded. It is not yet
called. Also it is not protected against other cgroups operations.
---
include/linux/kthread.h | 14 +++++++++++++
include/linux/workqueue.h | 1 +
kernel/cgroup.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 36 +++++++++++++++++++++++---------
kernel/kthread.c | 14 -------------
kernel/workqueue.c | 9 ++++----
6 files changed, 98 insertions(+), 29 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index e691b6a23f72..23e9dc494696 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -49,6 +49,20 @@ int kthread_park(struct task_struct *k);
void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);

+struct kthread_create_info
+{
+ /* Information passed to kthread() from kthreadd. */
+ int (*threadfn)(void *data);
+ void *data;
+ int node;
+
+ /* Result passed back to kthread_create() from kthreadd. */
+ struct task_struct *result;
+ struct completion *done;
+
+ struct list_head list;
+};
+
int kthreadd(void *unused);
extern struct task_struct *kthreadd_task;
extern int tsk_fork_get_node(struct task_struct *tsk);
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index ca73c503b92a..fc713e88dfb1 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -18,6 +18,7 @@ struct workqueue_struct;
struct work_struct;
typedef void (*work_func_t)(struct work_struct *work);
void delayed_work_timer_fn(unsigned long __data);
+int wq_worker_thread(void *__worker);

/*
* The first word is the work queue pointer and the flags rolled into
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 671dc05c0b0f..971ea46f352d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5876,6 +5876,59 @@ void cgroup_post_fork(struct task_struct *child)
} while_each_subsys_mask();
}

+call this from a work for thread workers to moce them into the root
+cgoups (inherited from kthreadd_task). It could not be moved anyway,
+so.
+
+void cgroup_delayed_post_fork(struct task_struct *child)
+{
+ struct cgroup_subsys *ss;
+ int i;
+
+ /*
+ * This may race against cgroup_enable_task_cg_lists(). As that
+ * function sets use_task_css_set_links before grabbing
+ * tasklist_lock and we just went through tasklist_lock to add
+ * @child, it's guaranteed that either we see the set
+ * use_task_css_set_links or cgroup_enable_task_cg_lists() sees
+ * @child during its iteration.
+ *
+ * If we won the race, @child is associated with %current's
+ * css_set. Grabbing css_set_lock guarantees both that the
+ * association is stable, and, on completion of the parent's
+ * migration, @child is visible in the source of migration or
+ * already in the destination cgroup. This guarantee is necessary
+ * when implementing operations which need to migrate all tasks of
+ * a cgroup to another.
+ *
+ * Note that if we lose to cgroup_enable_task_cg_lists(), @child
+ * will remain in init_css_set. This is safe because all tasks are
+ * in the init_css_set before cg_links is enabled and there's no
+ * operation which transfers all tasks out of init_css_set.
+ */
+ if (use_task_css_set_links) {
+ struct css_set *cset;
+
+ spin_lock_bh(&css_set_lock);
+ cset = task_css_set(kthreadd_task);
+ if (list_empty(&child->cg_list)) {
+ get_css_set(cset);
+ css_set_move_task(child, NULL, cset, false);
+ }
+ spin_unlock_bh(&css_set_lock);
+ }
+
+ /*
+ * Call ss->fork(). This must happen after @child is linked on
+ * css_set; otherwise, @child might change state between ->fork()
+ * and addition to css_set.
+ */
+ do_each_subsys_mask(ss, i, have_fork_callback) {
+ ss->fork(child);
+ } while_each_subsys_mask();
+}
+
+
/**
* cgroup_exit - detach cgroup from exiting task
* @tsk: pointer to task_struct of exiting process
diff --git a/kernel/fork.c b/kernel/fork.c
index d277e83ed3e0..ab4ea4d8a9ad 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1242,6 +1242,13 @@ init_task_pid(struct task_struct *task, enum pid_type type, struct pid *pid)
task->pids[type].pid = pid;
}

+static bool
+detect_wq_worker(struct task_struct *p, struct kthread_create_info *create)
+{
+ return ((p->flags & PF_KTHREAD) && create &&
+ create->threadfn == wq_worker_thread);
+}
+
/*
* This creates a new process as a copy of the old one,
* but does not actually start it yet.
@@ -1259,6 +1266,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
unsigned long tls)
{
int retval;
+ bool preparing_wq_worker;
struct task_struct *p;

if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
@@ -1378,14 +1386,13 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->real_start_time = ktime_get_boot_ns();
p->io_context = NULL;
p->audit_context = NULL;
- threadgroup_change_begin(current);
cgroup_fork(p);
#ifdef CONFIG_NUMA
p->mempolicy = mpol_dup(p->mempolicy);
if (IS_ERR(p->mempolicy)) {
retval = PTR_ERR(p->mempolicy);
p->mempolicy = NULL;
- goto bad_fork_cleanup_threadgroup_lock;
+ goto bad_fork_cleanup_delayacct_tsk;
}
#endif
#ifdef CONFIG_CPUSETS
@@ -1536,9 +1543,14 @@ static struct task_struct *copy_process(unsigned long clone_flags,
* between here and cgroup_post_fork() if an organisation operation is in
* progress.
*/
- retval = cgroup_can_fork(p);
- if (retval)
- goto bad_fork_free_pid;
+ preparing_wq_worker =
+ detect_wq_worker(p, (struct kthread_create_info *)stack_size);
+ if (!preparing_wq_worker) {
+ threadgroup_change_begin(current);
+ retval = cgroup_can_fork(p);
+ if (retval)
+ goto bad_fork_free_pid;
+ }

/*
* Make it visible to the rest of the system, but dont wake it up yet.
@@ -1618,8 +1630,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
write_unlock_irq(&tasklist_lock);

proc_fork_connector(p);
- cgroup_post_fork(p);
- threadgroup_change_end(current);
+ if (!preparing_wq_worker) {
+ cgroup_post_fork(p);
+ threadgroup_change_end(current);
+ }
perf_event_fork(p);

trace_task_newtask(p, clone_flags);
@@ -1628,8 +1642,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
return p;

bad_fork_cancel_cgroup:
- cgroup_cancel_fork(p);
+ if (!preparing_wq_worker)
+ cgroup_cancel_fork(p);
bad_fork_free_pid:
+ if (!preparing_wq_worker)
+ threadgroup_change_end(current);
if (pid != &init_struct_pid)
free_pid(pid);
bad_fork_cleanup_io:
@@ -1658,9 +1675,8 @@ bad_fork_cleanup_perf:
bad_fork_cleanup_policy:
#ifdef CONFIG_NUMA
mpol_put(p->mempolicy);
-bad_fork_cleanup_threadgroup_lock:
+bad_fork_cleanup_delayacct_tsk:
#endif
- threadgroup_change_end(current);
delayacct_tsk_free(p);
bad_fork_cleanup_count:
atomic_dec(&p->cred->user->processes);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 9ff173dca1ae..53dbff29c798 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -24,20 +24,6 @@ static DEFINE_SPINLOCK(kthread_create_lock);
static LIST_HEAD(kthread_create_list);
struct task_struct *kthreadd_task;

-struct kthread_create_info
-{
- /* Information passed to kthread() from kthreadd. */
- int (*threadfn)(void *data);
- void *data;
- int node;
-
- /* Result passed back to kthread_create() from kthreadd. */
- struct task_struct *result;
- struct completion *done;
-
- struct list_head list;
-};
-
struct kthread {
unsigned long flags;
unsigned int cpu;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2232ae3e3ad6..e8cf3db849a5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -348,7 +348,6 @@ EXPORT_SYMBOL_GPL(system_power_efficient_wq);
struct workqueue_struct *system_freezable_power_efficient_wq __read_mostly;
EXPORT_SYMBOL_GPL(system_freezable_power_efficient_wq);

-static int worker_thread(void *__worker);
static void workqueue_sysfs_unregister(struct workqueue_struct *wq);

#define CREATE_TRACE_POINTS
@@ -1771,8 +1770,8 @@ static struct worker *create_worker(struct worker_pool *pool)
else
snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);

- worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
- "kworker/%s", id_buf);
+ worker->task = kthread_create_on_node(wq_worker_thread, worker,
+ pool->node, "kworker/%s", id_buf);
if (IS_ERR(worker->task))
goto fail;

@@ -2155,7 +2154,7 @@ static void process_scheduled_works(struct worker *worker)
}

/**
- * worker_thread - the worker thread function
+ * wq_worker_thread - the worker thread function
* @__worker: self
*
* The worker thread function. All workers belong to a worker_pool -
@@ -2166,7 +2165,7 @@ static void process_scheduled_works(struct worker *worker)
*
* Return: 0
*/
-static int worker_thread(void *__worker)
+int wq_worker_thread(void *__worker)
{
struct worker *worker = __worker;
struct worker_pool *pool = worker->pool;
--
1.8.5.6

2016-04-19 14:01:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Mon 18-04-16 16:40:23, Petr Mladek wrote:
> On Fri 2016-04-15 10:38:15, Tejun Heo wrote:
> > > Anyway, before we go that way, can we at least consider the possibility
> > > of removing the kworker creation dependency on the global rwsem? AFAIU
> > > this locking was added because of the pid controller. Do we even care
> > > about something as volatile as kworkers in the pid controller?
> >
> > It's not just pid controller and the global percpu locking has lower
> > hotpath overhead. We can try to exclude kworkers out of the locking
> > but that can get really nasty and there are already attempts to add
> > cgroup support to workqueue. Will think more about it.
>
> I have played with this idea on Friday. Please, find below a POC.
> The worker detection works and the deadlock is removed. But workers
> do not appear in the root cgroups. I am not familiar with the cgroups
> stuff, so this part is much more difficult for me.
>
> I send it because it might give you an idea when discussing it
> on LSF. Please, let me know if I should continue on this way or
> if it looks too crazy already now.
>
>
> >From ca1420926f990892a914d64046ee8d273b876f30 Mon Sep 17 00:00:00 2001
> From: Petr Mladek <[email protected]>
> Date: Mon, 18 Apr 2016 14:17:17 +0200
> Subject: [POC PATCH] cgroups/workqueus: Do not block forking workqueues by cgroups
> lock
>
> This is a POC how to delay cgroups operations when forking workqueue
> workers. Workqueues are used by some cgroups operations, for example,
> lru_add_drain_all(). Creating new worker might help to avoid a deadlock.
>
> This patch adds a way to detect whether a workqueue worker is being
> forked, see detect_wq_worker(). For this it needs to make struct
> kthread_create_info and the main worker_thread public. As a consequence,
> it renames worker_thread to wq_worker_thread.
>
> Next, cgroups_fork() just initializes the cgroups fields in task_struct.
> It does not really need to be guarded by cgroup_threadgroup_rwsem.
>
> cgroup_threadgroup_rwsem is taken later when we check if the fork
> is allowed and when we copy the cgroups setting. But these two
> operations are skipped for workers.
>
> The result is that workers are not blocked by any cgroups operation
> but they do not appear in the root cgroups.
>
> There is a preparation of cgroup_delayed_post_fork() that might put
> the workers into the root cgroups. It is just a copy of
> cgroup_post_fork() where "kthreadd_task" is hardcoded. It is not yet
> called. Also it is not protected against other cgroups operations.
> ---
> include/linux/kthread.h | 14 +++++++++++++
> include/linux/workqueue.h | 1 +
> kernel/cgroup.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
> kernel/fork.c | 36 +++++++++++++++++++++++---------
> kernel/kthread.c | 14 -------------
> kernel/workqueue.c | 9 ++++----
> 6 files changed, 98 insertions(+), 29 deletions(-)

This feels too overcomplicated. Can we simply drop the locking in
copy_process if the current == ktreadadd? Would something actually
break?
--
Michal Hocko
SUSE Labs

2016-04-19 15:39:10

by Petr Mladek

[permalink] [raw]
Subject: Re: [BUG] cgroup/workques/fork: deadlock when moving cgroups

On Tue 2016-04-19 10:01:21, Michal Hocko wrote:
> On Mon 18-04-16 16:40:23, Petr Mladek wrote:
> > On Fri 2016-04-15 10:38:15, Tejun Heo wrote:
> > > > Anyway, before we go that way, can we at least consider the possibility
> > > > of removing the kworker creation dependency on the global rwsem? AFAIU
> > > > this locking was added because of the pid controller. Do we even care
> > > > about something as volatile as kworkers in the pid controller?
> > >
> > > It's not just pid controller and the global percpu locking has lower
> > > hotpath overhead. We can try to exclude kworkers out of the locking
> > > but that can get really nasty and there are already attempts to add
> > > cgroup support to workqueue. Will think more about it.
> >
> > I have played with this idea on Friday. Please, find below a POC.
> > The worker detection works and the deadlock is removed. But workers
> > do not appear in the root cgroups. I am not familiar with the cgroups
> > stuff, so this part is much more difficult for me.
> >
> > I send it because it might give you an idea when discussing it
> > on LSF. Please, let me know if I should continue on this way or
> > if it looks too crazy already now.
> >
> > >From ca1420926f990892a914d64046ee8d273b876f30 Mon Sep 17 00:00:00 2001
> > From: Petr Mladek <[email protected]>
> > Date: Mon, 18 Apr 2016 14:17:17 +0200
> > Subject: [POC PATCH] cgroups/workqueus: Do not block forking workqueues by cgroups
> > lock
> >
> > This is a POC how to delay cgroups operations when forking workqueue
> > workers.
>>
> > include/linux/kthread.h | 14 +++++++++++++
> > include/linux/workqueue.h | 1 +
> > kernel/cgroup.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++
> > kernel/fork.c | 36 +++++++++++++++++++++++---------
> > kernel/kthread.c | 14 -------------
> > kernel/workqueue.c | 9 ++++----
> > 6 files changed, 98 insertions(+), 29 deletions(-)
>
> This feels too overcomplicated. Can we simply drop the locking in
> copy_process if the current == ktreadadd? Would something actually
> break?

This would affect all kthreads. But there are kthreads that might be moved
in cgroups and where it makes sense. We will need to synchronize the
delayed cgroups initialization with the moving operation. But then we
could use the same solution for all processes.

Best Regards,
Petr

2016-04-20 21:29:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH for-4.6-fixes] memcg: remove lru_add_drain_all() invocation from mem_cgroup_move_charge()

Hello, Michal.

On Sun, Apr 17, 2016 at 08:07:48AM -0400, Michal Hocko wrote:
> On Fri 15-04-16 15:17:19, Tejun Heo wrote:
> > mem_cgroup_move_charge() invokes lru_add_drain_all() so that the pvec
> > pages can be moved too. lru_add_drain_all() schedules and flushes
> > work items on system_wq which depends on being able to create new
> > kworkers to make forward progress. Since 1ed1328792ff ("sched,
> > cgroup: replace signal_struct->group_rwsem with a global
> > percpu_rwsem"), a new task can't be created while in the cgroup
> > migration path and the described lru_add_drain_all() invocation can
> > easily lead to a deadlock.
> >
> > Charge moving is best-effort and whether the pvec pages are migrated
> > or not doesn't really matter. Don't call it during charge moving.
> > Eventually, we want to move the actual charge moving outside the
> > migration path.
> >
> > Signed-off-by: Tejun Heo <[email protected]>
> > Reported-by: Johannes Weiner <[email protected]>
>
> I guess
> Debugged-by: Petr Mladek <[email protected]>
> Reported-by: Cyril Hrubis <[email protected]>

Yeah, definitely. Sorry about missing them.

> > Suggested-by: Michal Hocko <[email protected]>
> > Fixes: 1ed1328792ff ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
> > Cc: [email protected]
> > ---
> > Hello,
> >
> > So, this deadlock seems pretty easy to trigger. We'll make the charge
> > moving asynchronous eventually but let's not hold off fixing an
> > immediate problem.
>
> Although this looks rather straightforward and it fixes the immediate
> problem I am little bit nervous about it. As already pointed out in
> other email mem_cgroup_move_charge still depends on mmap_sem for
> read and we might hit an even more subtle lockup if the current holder
> of the mmap_sem for write depends on the task creation (e.g. some of the
> direct reclaim path uses WQ which is really hard to rule out and I even
> think that some shrinkers do this).
>
> I liked your proposal when mem_cgroup_move_charge would be called from a
> context which doesn't hold the problematic rwsem much more. Would that
> be too intrusive for the stable backport?

Yeah, I'm working on the fix but let's plug this one first as it seems
really easy to trigger. I got a couple off-list reports (in and
outside fb) of this triggering.

Thanks.

--
tejun

2016-04-21 03:27:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH for-4.6-fixes] memcg: remove lru_add_drain_all() invocation from mem_cgroup_move_charge()

On Wed 20-04-16 17:29:22, Tejun Heo wrote:
> Hello, Michal.
>
> On Sun, Apr 17, 2016 at 08:07:48AM -0400, Michal Hocko wrote:
[...]
> > I liked your proposal when mem_cgroup_move_charge would be called from a
> > context which doesn't hold the problematic rwsem much more. Would that
> > be too intrusive for the stable backport?
>
> Yeah, I'm working on the fix but let's plug this one first as it seems
> really easy to trigger. I got a couple off-list reports (in and
> outside fb) of this triggering.

Sure, I think we can use this for an immediate workaround. Nevertheless
we want to have a full fix and do not rely on this being the only
problem. I would eventually like to reintroduce the draining later when
we have a better fix because even though this is not a correctness
problem I think we should try hard to not leave anything behind.

That being said:
Acked-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs

2016-04-21 15:00:28

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH for-4.6-fixes] memcg: remove lru_add_drain_all() invocation from mem_cgroup_move_charge()

On Wed 2016-04-20 23:27:05, Michal Hocko wrote:
> On Wed 20-04-16 17:29:22, Tejun Heo wrote:
> > Hello, Michal.
> >
> > On Sun, Apr 17, 2016 at 08:07:48AM -0400, Michal Hocko wrote:
> [...]
> > > I liked your proposal when mem_cgroup_move_charge would be called from a
> > > context which doesn't hold the problematic rwsem much more. Would that
> > > be too intrusive for the stable backport?
> >
> > Yeah, I'm working on the fix but let's plug this one first as it seems
> > really easy to trigger. I got a couple off-list reports (in and
> > outside fb) of this triggering.
>
> Sure, I think we can use this for an immediate workaround. Nevertheless
> we want to have a full fix and do not rely on this being the only
> problem. I would eventually like to reintroduce the draining later when
> we have a better fix because even though this is not a correctness
> problem I think we should try hard to not leave anything behind.
>
> That being said:
> Acked-by: Michal Hocko <[email protected]>

Just for completeness. The problematic LTP test is running for hours
with this patch. Feel free to add:

Tested-by: Petr Mladek <[email protected]>

Best Regards,
Petr

2016-04-21 15:51:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH for-4.6-fixes] memcg: remove lru_add_drain_all() invocation from mem_cgroup_move_charge()

On Thu, Apr 21, 2016 at 05:00:24PM +0200, Petr Mladek wrote:
> > Acked-by: Michal Hocko <[email protected]>
>
> Just for completeness. The problematic LTP test is running for hours
> with this patch. Feel free to add:
>
> Tested-by: Petr Mladek <[email protected]>

Thanks, I'm gonna repost w/ Andrew cc'd and the tags added.

--
tejun

2016-04-21 15:56:56

by Tejun Heo

[permalink] [raw]
Subject: [PATCH for-4.6-fixes] memcg: remove lru_add_drain_all() invocation from mem_cgroup_move_charge()

mem_cgroup_move_charge() invokes lru_add_drain_all() so that the pvec
pages can be moved too. lru_add_drain_all() schedules and flushes
work items on system_wq which depends on being able to create new
kworkers to make forward progress. Since 1ed1328792ff ("sched,
cgroup: replace signal_struct->group_rwsem with a global
percpu_rwsem"), a new task can't be created while in the cgroup
migration path and the described lru_add_drain_all() invocation can
easily lead to a deadlock.

Charge moving is best-effort and whether the pvec pages are migrated
or not doesn't really matter. Don't call it during charge moving.
Eventually, we want to move the actual charge moving outside the
migration path.

Signed-off-by: Tejun Heo <[email protected]>
Debugged-and-tested-by: Petr Mladek <[email protected]>
Reported-by: Cyril Hrubis <[email protected]>
Reported-by: Johannes Weiner <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Fixes: 1ed1328792ff ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
Cc: [email protected] # v4.4+
---
Hello, Andrew.

This fixes an easily triggered deadlock when using memcg charge
moving. While the issue requires more invasive changes, which are
being worked on, to plug completely, this works around the immediate
issue. Can you please route this through -mm?

Thanks.

mm/memcontrol.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36db05f..56060c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4859,7 +4859,6 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
.mm = mm,
};

- lru_add_drain_all();
/*
* Signal lock_page_memcg() to take the memcg's move_lock
* while we're moving its pages to another memcg. Then wait

2016-04-21 23:06:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/2] cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback

Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write(). This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.

memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function. This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex. This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished. cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem. We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: <[email protected]> # 4.4+ prerequisite for the next patch
---
include/linux/cgroup-defs.h | 1 +
include/linux/cpuset.h | 6 ------
kernel/cgroup.c | 7 +++++--
kernel/cpuset.c | 4 ++--
4 files changed, 8 insertions(+), 10 deletions(-)

--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -444,6 +444,7 @@ struct cgroup_subsys {
int (*can_attach)(struct cgroup_taskset *tset);
void (*cancel_attach)(struct cgroup_taskset *tset);
void (*attach)(struct cgroup_taskset *tset);
+ void (*post_attach)(void);
int (*can_fork)(struct task_struct *task);
void (*cancel_fork)(struct task_struct *task);
void (*fork)(struct task_struct *task);
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -137,8 +137,6 @@ static inline void set_mems_allowed(node
task_unlock(current);
}

-extern void cpuset_post_attach_flush(void);
-
#else /* !CONFIG_CPUSETS */

static inline bool cpusets_enabled(void) { return false; }
@@ -245,10 +243,6 @@ static inline bool read_mems_allowed_ret
return false;
}

-static inline void cpuset_post_attach_flush(void)
-{
-}
-
#endif /* !CONFIG_CPUSETS */

#endif /* _LINUX_CPUSET_H */
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2825,9 +2825,10 @@ static ssize_t __cgroup_procs_write(stru
size_t nbytes, loff_t off, bool threadgroup)
{
struct task_struct *tsk;
+ struct cgroup_subsys *ss;
struct cgroup *cgrp;
pid_t pid;
- int ret;
+ int ssid, ret;

if (kstrtoint(strstrip(buf), 0, &pid) || pid < 0)
return -EINVAL;
@@ -2875,8 +2876,10 @@ out_unlock_rcu:
rcu_read_unlock();
out_unlock_threadgroup:
percpu_up_write(&cgroup_threadgroup_rwsem);
+ for_each_subsys(ss, ssid)
+ if (ss->post_attach)
+ ss->post_attach();
cgroup_kn_unlock(of->kn);
- cpuset_post_attach_flush();
return ret ?: nbytes;
}

--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -58,7 +58,6 @@
#include <asm/uaccess.h>
#include <linux/atomic.h>
#include <linux/mutex.h>
-#include <linux/workqueue.h>
#include <linux/cgroup.h>
#include <linux/wait.h>

@@ -1016,7 +1015,7 @@ static void cpuset_migrate_mm(struct mm_
}
}

-void cpuset_post_attach_flush(void)
+static void cpuset_post_attach(void)
{
flush_workqueue(cpuset_migrate_mm_wq);
}
@@ -2087,6 +2086,7 @@ struct cgroup_subsys cpuset_cgrp_subsys
.can_attach = cpuset_can_attach,
.cancel_attach = cpuset_cancel_attach,
.attach = cpuset_attach,
+ .post_attach = cpuset_post_attach,
.bind = cpuset_bind,
.legacy_cftypes = files,
.early_init = true,

2016-04-21 23:09:08

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/2] memcg: relocate charge moving from ->attach to ->post_attach

Hello,

So, this ended up a lot simpler than I originally expected. I tested
it lightly and it seems to work fine. Petr, can you please test these
two patches w/o the lru drain drop patch and see whether the problem
is gone?

Thanks.
------ 8< ------
If charge moving is used, memcg performs relabeling of the affected
pages from its ->attach callback which is called under both
cgroup_threadgroup_rwsem and thus can't create new kthreads. This is
fragile as various operations may depend on workqueues making forward
progress which relies on the ability to create new kthreads.

There's no reason to perform charge moving from ->attach which is deep
in the task migration path. Move it to ->post_attach which is called
after the actual migration is finished and cgroup_threadgroup_rwsem is
dropped.

* move_charge_struct->mm is added and ->can_attach is now responsible
for pinning and recording the target mm. mem_cgroup_clear_mc() is
updated accordingly. This also simplifies mem_cgroup_move_task().

* mem_cgroup_move_task() is now called from ->post_attach instead of
->attach.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Debugged-by: Petr Mladek <[email protected]>
Reported-by: Cyril Hrubis <[email protected]>
Reported-by: Johannes Weiner <[email protected]>
Fixes: 1ed1328792ff ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
Cc: <[email protected]> # 4.4+
---
mm/memcontrol.c | 37 +++++++++++++++++++------------------
1 file changed, 19 insertions(+), 18 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -207,6 +207,7 @@ static void mem_cgroup_oom_notify(struct
/* "mc" and its members are protected by cgroup_mutex */
static struct move_charge_struct {
spinlock_t lock; /* for from, to */
+ struct mm_struct *mm;
struct mem_cgroup *from;
struct mem_cgroup *to;
unsigned long flags;
@@ -4667,6 +4668,8 @@ static void __mem_cgroup_clear_mc(void)

static void mem_cgroup_clear_mc(void)
{
+ struct mm_struct *mm = mc.mm;
+
/*
* we must clear moving_task before waking up waiters at the end of
* task migration.
@@ -4676,7 +4679,10 @@ static void mem_cgroup_clear_mc(void)
spin_lock(&mc.lock);
mc.from = NULL;
mc.to = NULL;
+ mc.mm = NULL;
spin_unlock(&mc.lock);
+
+ mmput(mm);
}

static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
@@ -4733,6 +4739,7 @@ static int mem_cgroup_can_attach(struct
VM_BUG_ON(mc.moved_swap);

spin_lock(&mc.lock);
+ mc.mm = mm;
mc.from = from;
mc.to = memcg;
mc.flags = move_flags;
@@ -4742,8 +4749,9 @@ static int mem_cgroup_can_attach(struct
ret = mem_cgroup_precharge_mc(mm);
if (ret)
mem_cgroup_clear_mc();
+ } else {
+ mmput(mm);
}
- mmput(mm);
return ret;
}

@@ -4852,11 +4860,11 @@ put: /* get_mctgt_type() gets the page
return ret;
}

-static void mem_cgroup_move_charge(struct mm_struct *mm)
+static void mem_cgroup_move_charge(void)
{
struct mm_walk mem_cgroup_move_charge_walk = {
.pmd_entry = mem_cgroup_move_charge_pte_range,
- .mm = mm,
+ .mm = mc.mm,
};

lru_add_drain_all();
@@ -4868,7 +4876,7 @@ static void mem_cgroup_move_charge(struc
atomic_inc(&mc.from->moving_account);
synchronize_rcu();
retry:
- if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
+ if (unlikely(!down_read_trylock(&mc.mm->mmap_sem))) {
/*
* Someone who are holding the mmap_sem might be waiting in
* waitq. So we cancel all extra charges, wake up all waiters,
@@ -4885,23 +4893,16 @@ retry:
* additional charge, the page walk just aborts.
*/
walk_page_range(0, ~0UL, &mem_cgroup_move_charge_walk);
- up_read(&mm->mmap_sem);
+ up_read(&mc.mm->mmap_sem);
atomic_dec(&mc.from->moving_account);
}

-static void mem_cgroup_move_task(struct cgroup_taskset *tset)
+static void mem_cgroup_move_task(void)
{
- struct cgroup_subsys_state *css;
- struct task_struct *p = cgroup_taskset_first(tset, &css);
- struct mm_struct *mm = get_task_mm(p);
-
- if (mm) {
- if (mc.to)
- mem_cgroup_move_charge(mm);
- mmput(mm);
- }
- if (mc.to)
+ if (mc.to) {
+ mem_cgroup_move_charge();
mem_cgroup_clear_mc();
+ }
}
#else /* !CONFIG_MMU */
static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
@@ -4911,7 +4912,7 @@ static int mem_cgroup_can_attach(struct
static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
{
}
-static void mem_cgroup_move_task(struct cgroup_taskset *tset)
+static void mem_cgroup_move_task(void)
{
}
#endif
@@ -5195,7 +5196,7 @@ struct cgroup_subsys memory_cgrp_subsys
.css_reset = mem_cgroup_css_reset,
.can_attach = mem_cgroup_can_attach,
.cancel_attach = mem_cgroup_cancel_attach,
- .attach = mem_cgroup_move_task,
+ .post_attach = mem_cgroup_move_task,
.bind = mem_cgroup_bind,
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,

2016-04-21 23:11:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2] cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback

BTW, I just got a permanent delivery failure. The mail was sent out
through gmail.

Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the server for the recipient domain suse.com by prv3-mx.novell.com. [130.57.1.17].

The error that the other server returned was:
550 Message was blocked by server (RBL hit)

--
tejun

2016-04-22 13:57:22

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: relocate charge moving from ->attach to ->post_attach

On Thu 2016-04-21 19:09:02, Tejun Heo wrote:
> Hello,
>
> So, this ended up a lot simpler than I originally expected. I tested
> it lightly and it seems to work fine. Petr, can you please test these
> two patches w/o the lru drain drop patch and see whether the problem
> is gone?

The LTP test is running several hours without any problem with these
two patches. I double checked that I did not have the lru drain drop
patch. Therefore, these two patches solved the problem as well, so:

Tested-by: Petr Mladek <[email protected]>

Best Regards,
Petr

2016-04-25 08:26:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: relocate charge moving from ->attach to ->post_attach

On Thu 21-04-16 19:09:02, Tejun Heo wrote:
> Hello,
>
> So, this ended up a lot simpler than I originally expected. I tested
> it lightly and it seems to work fine. Petr, can you please test these
> two patches w/o the lru drain drop patch and see whether the problem
> is gone?
>
> Thanks.
> ------ 8< ------
> If charge moving is used, memcg performs relabeling of the affected
> pages from its ->attach callback which is called under both
> cgroup_threadgroup_rwsem and thus can't create new kthreads. This is
> fragile as various operations may depend on workqueues making forward
> progress which relies on the ability to create new kthreads.
>
> There's no reason to perform charge moving from ->attach which is deep
> in the task migration path. Move it to ->post_attach which is called
> after the actual migration is finished and cgroup_threadgroup_rwsem is
> dropped.
>
> * move_charge_struct->mm is added and ->can_attach is now responsible
> for pinning and recording the target mm. mem_cgroup_clear_mc() is
> updated accordingly. This also simplifies mem_cgroup_move_task().
>
> * mem_cgroup_move_task() is now called from ->post_attach instead of
> ->attach.

This looks much better than the previous quick&dirty workaround. Thanks
for taking an extra step!

Sorry for being so pushy but shouldn't we at least document those
callbacks which depend on cgroup_threadgroup_rwsem to never ever block
on WQ directly or indirectly. Maybe even enforce they have to be
non-blocking. This would be out of scope of this particular patch of
course but it would fit nicely into the series.

> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Debugged-by: Petr Mladek <[email protected]>
> Reported-by: Cyril Hrubis <[email protected]>
> Reported-by: Johannes Weiner <[email protected]>
> Fixes: 1ed1328792ff ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
> Cc: <[email protected]> # 4.4+

Acked-by: Michal Hocko <[email protected]>

Thanks!

> ---
> mm/memcontrol.c | 37 +++++++++++++++++++------------------
> 1 file changed, 19 insertions(+), 18 deletions(-)
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -207,6 +207,7 @@ static void mem_cgroup_oom_notify(struct
> /* "mc" and its members are protected by cgroup_mutex */
> static struct move_charge_struct {
> spinlock_t lock; /* for from, to */
> + struct mm_struct *mm;
> struct mem_cgroup *from;
> struct mem_cgroup *to;
> unsigned long flags;
> @@ -4667,6 +4668,8 @@ static void __mem_cgroup_clear_mc(void)
>
> static void mem_cgroup_clear_mc(void)
> {
> + struct mm_struct *mm = mc.mm;
> +
> /*
> * we must clear moving_task before waking up waiters at the end of
> * task migration.
> @@ -4676,7 +4679,10 @@ static void mem_cgroup_clear_mc(void)
> spin_lock(&mc.lock);
> mc.from = NULL;
> mc.to = NULL;
> + mc.mm = NULL;
> spin_unlock(&mc.lock);
> +
> + mmput(mm);
> }
>
> static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> @@ -4733,6 +4739,7 @@ static int mem_cgroup_can_attach(struct
> VM_BUG_ON(mc.moved_swap);
>
> spin_lock(&mc.lock);
> + mc.mm = mm;
> mc.from = from;
> mc.to = memcg;
> mc.flags = move_flags;
> @@ -4742,8 +4749,9 @@ static int mem_cgroup_can_attach(struct
> ret = mem_cgroup_precharge_mc(mm);
> if (ret)
> mem_cgroup_clear_mc();
> + } else {
> + mmput(mm);
> }
> - mmput(mm);
> return ret;
> }
>
> @@ -4852,11 +4860,11 @@ put: /* get_mctgt_type() gets the page
> return ret;
> }
>
> -static void mem_cgroup_move_charge(struct mm_struct *mm)
> +static void mem_cgroup_move_charge(void)
> {
> struct mm_walk mem_cgroup_move_charge_walk = {
> .pmd_entry = mem_cgroup_move_charge_pte_range,
> - .mm = mm,
> + .mm = mc.mm,
> };
>
> lru_add_drain_all();
> @@ -4868,7 +4876,7 @@ static void mem_cgroup_move_charge(struc
> atomic_inc(&mc.from->moving_account);
> synchronize_rcu();
> retry:
> - if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
> + if (unlikely(!down_read_trylock(&mc.mm->mmap_sem))) {
> /*
> * Someone who are holding the mmap_sem might be waiting in
> * waitq. So we cancel all extra charges, wake up all waiters,
> @@ -4885,23 +4893,16 @@ retry:
> * additional charge, the page walk just aborts.
> */
> walk_page_range(0, ~0UL, &mem_cgroup_move_charge_walk);
> - up_read(&mm->mmap_sem);
> + up_read(&mc.mm->mmap_sem);
> atomic_dec(&mc.from->moving_account);
> }
>
> -static void mem_cgroup_move_task(struct cgroup_taskset *tset)
> +static void mem_cgroup_move_task(void)
> {
> - struct cgroup_subsys_state *css;
> - struct task_struct *p = cgroup_taskset_first(tset, &css);
> - struct mm_struct *mm = get_task_mm(p);
> -
> - if (mm) {
> - if (mc.to)
> - mem_cgroup_move_charge(mm);
> - mmput(mm);
> - }
> - if (mc.to)
> + if (mc.to) {
> + mem_cgroup_move_charge();
> mem_cgroup_clear_mc();
> + }
> }
> #else /* !CONFIG_MMU */
> static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> @@ -4911,7 +4912,7 @@ static int mem_cgroup_can_attach(struct
> static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
> {
> }
> -static void mem_cgroup_move_task(struct cgroup_taskset *tset)
> +static void mem_cgroup_move_task(void)
> {
> }
> #endif
> @@ -5195,7 +5196,7 @@ struct cgroup_subsys memory_cgrp_subsys
> .css_reset = mem_cgroup_css_reset,
> .can_attach = mem_cgroup_can_attach,
> .cancel_attach = mem_cgroup_cancel_attach,
> - .attach = mem_cgroup_move_task,
> + .post_attach = mem_cgroup_move_task,
> .bind = mem_cgroup_bind,
> .dfl_cftypes = memory_files,
> .legacy_cftypes = mem_cgroup_legacy_files,

--
Michal Hocko
SUSE Labs

2016-04-25 19:42:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: relocate charge moving from ->attach to ->post_attach

Hello, Michal.

On Mon, Apr 25, 2016 at 10:25:42AM +0200, Michal Hocko wrote:
> > * move_charge_struct->mm is added and ->can_attach is now responsible
> > for pinning and recording the target mm. mem_cgroup_clear_mc() is
> > updated accordingly. This also simplifies mem_cgroup_move_task().
> >
> > * mem_cgroup_move_task() is now called from ->post_attach instead of
> > ->attach.
>
> This looks much better than the previous quick&dirty workaround. Thanks
> for taking an extra step!
>
> Sorry for being so pushy but shouldn't we at least document those
> callbacks which depend on cgroup_threadgroup_rwsem to never ever block
> on WQ directly or indirectly. Maybe even enforce they have to be
> non-blocking. This would be out of scope of this particular patch of
> course but it would fit nicely into the series.

Yeah, I'll look into whether I can annotate this dependency with
lockdep. If that isn't doable, I'll document the requirements.
cgroup methods have changed a lot and need to be redocumented anyway.

Thanks.

--
tejun

2016-04-25 19:45:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: relocate charge moving from ->attach to ->post_attach

Applied 1-2 to cgroup/for-4.6-fixes.

Thanks.

--
tejun