Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030362AbcDMJmX (ORCPT ); Wed, 13 Apr 2016 05:42:23 -0400 Received: from mx2.suse.de ([195.135.220.15]:37527 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966174AbcDMJmT (ORCPT ); Wed, 13 Apr 2016 05:42:19 -0400 Date: Wed, 13 Apr 2016 11:42:16 +0200 From: Petr Mladek To: cgroups@vger.kernel.org Cc: Tejun Heo , Michal Hocko , Cyril Hrubis , linux-kernel@vger.kernel.org Subject: [BUG] cgroup/workques/fork: deadlock when moving cgroups Message-ID: <20160413094216.GC5774@pathway.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11501 Lines: 221 Hi, Cyril reported a system lock up when running memcg_move_charge_at_immigrate_test.sh test[*] repeatedly. I have reproduced it also with the plain 4.6-rc3. There seems to be a deadlock where 4 processes are involved. It makes the system unable to fork any new processes. I had to use alt-sysrq command to get to debugging information. Please, find the entire log at http://pastebin.com/pL9eKsFb Interesting is the lockdep output: [ 409.669856] Showing all locks held in the system: [ 409.669856] 2 locks held by systemd/1: [ 409.669856] #0: (&p->lock){+.+.+.}, at: [] seq_read+0x3d/0x3a0 [ 409.669856] #1: (cgroup_mutex){+.+.+.}, at: [] proc_cgroup_show+0x4e/0x300 [ 409.669856] 1 lock held by kthreadd/2: [ 409.669856] #0: (&cgroup_threadgroup_rwsem){++++++}, at: [] copy_process.part.30+0x555/0x1d30 [ 409.669856] 3 locks held by kworker/0:2/114: [ 409.669856] #0: ("cgroup_destroy"){.+.+..}, at: [] process_one_work+0x151/0x6b0 [ 409.669856] #1: ((&css->destroy_work)){+.+...}, at: [] process_one_work+0x151/0x6b0 [ 409.669856] #2: (cgroup_mutex){+.+.+.}, at: [] css_release_work_fn+0x2d/0xf0 [ 409.669856] 1 lock held by kworker/0:3/588: [ 409.669856] #0: (&pool->manager_arb){+.+.+.}, at: [] worker_thread+0x2ec/0x490 [ 409.669856] 1 lock held by in:imklog/816: [ 409.669856] #0: (&f->f_pos_lock){+.+.+.}, at: [] __fdget_pos+0x43/0x50 [ 409.669856] 6 locks held by memcg_move_char/2860: [ 409.669856] #0: (sb_writers#6){.+.+.+}, at: [] __sb_start_write+0xd1/0xf0 [ 409.669856] #1: (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0x66/0x190 [ 409.669856] #2: (cgroup_mutex){+.+.+.}, at: [] cgroup_kn_lock_live+0x5c/0x1f0 [ 409.669856] #3: (&cgroup_threadgroup_rwsem){++++++}, at: [] percpu_down_write+0x24/0xd0 [ 409.669856] #4: (lock#4){+.+...}, at: [] lru_add_drain_all+0x34/0x1a0 [ 409.669856] #5: (cpu_hotplug.lock){++++++}, at: [] get_online_cpus+0x17/0x70 The problematic processes seem to be: 1. memcg_move_char/2860: ======================== It is the process from the test suite. It is waiting in lru_add_drain_all() when calling flush_work() on CPU0[**]. Note that it holds many locks including 'cgroup_mutex' and 'cgroup_threadgroup_rwsem'. [ 409.669856] memcg_move_char D ffff8800ab82b978 0 2860 1438 0x00000000 [ 409.669856] ffff8800ab82b978 ffffffff8244c908 ffff880036d44800 ffff880036a0d1c0 [ 409.669856] ffff8800ab82c000 ffff8800ab82bb00 ffff8800ab82baf8 ffff880036a0d1c0 [ 409.669856] 0000000000000004 ffff8800ab82b990 ffffffff8195c77c 7fffffffffffffff [ 409.669856] Call Trace: [ 409.669856] [] schedule+0x3c/0x90 [ 409.669856] [] schedule_timeout+0x294/0x480 [ 409.669856] [] ? mark_held_locks+0x66/0x90 [ 409.669856] [] ? _raw_spin_unlock_irq+0x2c/0x50 [ 409.669856] [] ? trace_hardirqs_on_caller+0xf9/0x1c0 [ 409.669856] [] wait_for_completion+0x96/0x100 [ 409.669856] [] ? wake_up_q+0x80/0x80 [ 409.669856] [] flush_work+0x1ca/0x270 [ 409.669856] [] ? flush_work+0x22e/0x270 [ 409.669856] [] ? destroy_worker+0xd0/0xd0 [ 409.669856] [] lru_add_drain_all+0x15b/0x1a0 [ 409.669856] [] mem_cgroup_move_task+0x5b/0xe0 [ 409.669856] [] ? mem_cgroup_move_account+0x290/0x290 [ 409.669856] [] cgroup_taskset_migrate+0x1ee/0x360 [ 409.669856] [] cgroup_migrate+0xf5/0x190 [ 409.669856] [] ? cgroup_migrate+0x5/0x190 [ 409.669856] [] cgroup_attach_task+0x205/0x230 [ 409.669856] [] ? cgroup_attach_task+0x42/0x230 [ 409.669856] [] __cgroup_procs_write.isra.30+0x2bd/0x470 [ 409.669856] [] ? __cgroup_procs_write.isra.30+0x5f/0x470 [ 409.669856] [] cgroup_tasks_write+0x10/0x20 [ 409.669856] [] cgroup_file_write+0x3e/0x1b0 [ 409.669856] [] kernfs_fop_write+0x144/0x190 [ 409.669856] [] __vfs_write+0x28/0xe0 [ 409.669856] [] ? percpu_down_read+0x44/0x80 [ 409.669856] [] ? __sb_start_write+0xd1/0xf0 [ 409.669856] [] ? __sb_start_write+0xd1/0xf0 [ 409.669856] [] vfs_write+0xa2/0x1a0 [ 409.669856] [] ? __fget_light+0x66/0x90 [ 409.669856] [] SyS_write+0x49/0xa0 [ 409.669856] [] entry_SYSCALL_64_fastpath+0x1f/0xbd 2. kworker/0:2/114: =================== It is a worker on CPU0 that is blocked in css_release_work_fn(). It is unable to get 'cgroup_mutex' that is already owned by 'memcg_move_char/2860'. [ 409.669856] kworker/0:2 D ffff880036d4bcd8 0 114 2 0x00000000 [ 409.669856] Workqueue: cgroup_destroy css_release_work_fn [ 409.669856] ffff880036d4bcd8 00000000ffffffff ffff8800acd40100 ffff880036d44800 [ 409.669856] ffff880036d4c000 ffffffff81e94280 0000000000000246 ffff880036d44800 [ 409.669856] 00000000ffffffff ffff880036d4bcf0 ffffffff8195c77c ffffffff81e94288 [ 409.669856] Call Trace: [ 409.669856] [] schedule+0x3c/0x90 [ 409.669856] [] schedule_preempt_disabled+0x15/0x20 [ 409.669856] [] mutex_lock_nested+0x169/0x3f0 [ 409.669856] [] ? css_release_work_fn+0x2d/0xf0 [ 409.669856] [] css_release_work_fn+0x2d/0xf0 [ 409.669856] [] process_one_work+0x1cc/0x6b0 [ 409.669856] [] ? process_one_work+0x151/0x6b0 [ 409.669856] [] worker_thread+0x12b/0x490 [ 409.669856] [] ? process_one_work+0x6b0/0x6b0 [ 409.669856] [] kthread+0xe4/0x100 [ 409.669856] [] ret_from_fork+0x22/0x50 [ 409.669856] [] ? kthread_create_on_node+0x200/0x200 3. kworker/0:3/588: =================== It tries to create another worker on CPU0 to proceed the pending works. It waits until the kthread is created by kthreadd. [ 409.669856] kworker/0:3 D ffff8800ad8ebb68 0 588 2 0x00000000 [ 409.669856] ffff8800ad8ebb68 ffffffff82435ae8 ffff88012a648040 ffff8800acd40100 [ 409.669856] ffff8800ad8ec000 ffff8800ad8ebcc8 ffff8800ad8ebcc0 ffff8800acd40100 [ 409.669856] ffff8800acd40100 ffff8800ad8ebb80 ffffffff8195c77c 7fffffffffffffff [ 409.669856] Call Trace: [ 409.669856] [] schedule+0x3c/0x90 [ 409.669856] [] schedule_timeout+0x294/0x480 [ 409.669856] [] ? mark_held_locks+0x66/0x90 [ 409.669856] [] ? _raw_spin_unlock_irq+0x2c/0x50 [ 409.669856] [] ? trace_hardirqs_on_caller+0xf9/0x1c0 [ 409.669856] [] wait_for_completion_killable+0xab/0x170 [ 409.669856] [] ? wake_up_q+0x80/0x80 [ 409.669856] [] ? process_one_work+0x6b0/0x6b0 [ 409.669856] [] kthread_create_on_node+0x15a/0x200 [ 409.669856] [] ? create_object+0x238/0x2e0 [ 409.669856] [] ? snprintf+0x39/0x40 [ 409.669856] [] create_worker+0xbb/0x190 [ 409.669856] [] worker_thread+0x335/0x490 [ 409.669856] [] ? process_one_work+0x6b0/0x6b0 [ 409.669856] [] kthread+0xe4/0x100 [ 409.669856] [] ret_from_fork+0x22/0x50 [ 409.669856] [] ? kthread_create_on_node+0x200/0x200 4. kthreadd/2: ============== It is a kthread that is responsible for forking other kthreads. It is blocked in copy_process() when calling threadgroup_change_begin() because 'cgroup_threadgroup_rwsem' is held by 'memcg_move_char/2860'. [ 409.669856] kthreadd D ffff88012a64fc20 0 2 0 0x00000000 [ 409.669856] ffff88012a64fc20 ffffffff830aedd0 ffffffff81e0b540 ffff88012a648040 [ 409.669856] ffff88012a650000 ffffffff830aedb8 ffffffff830aedd0 ffff88012a648040 [ 409.669856] 0000000000000000 ffff88012a64fc38 ffffffff8195c77c ffff88012a648040 [ 409.669856] Call Trace: [ 409.669856] [] schedule+0x3c/0x90 [ 409.669856] [] rwsem_down_read_failed+0xcb/0x120 [ 409.669856] [] call_rwsem_down_read_failed+0x18/0x30 [ 409.669856] [] percpu_down_read+0x5e/0x80 [ 409.669856] [] ? copy_process.part.30+0x555/0x1d30 [ 409.669856] [] copy_process.part.30+0x555/0x1d30 [ 409.669856] [] ? __lock_acquire+0xab0/0x1ad0 [ 409.669856] [] ? kthread_create_on_node+0x200/0x200 [ 409.669856] [] ? trace_hardirqs_on_caller+0xf9/0x1c0 [ 409.669856] [] _do_fork+0xdc/0x6c0 [ 409.669856] [] ? kthreadd+0x2c5/0x380 [ 409.669856] [] kernel_thread+0x29/0x30 [ 409.669856] [] kthreadd+0x321/0x380 [ 409.669856] [] ? ret_from_fork+0x22/0x50 [ 409.669856] [] ret_from_fork+0x22/0x50 [ 409.669856] [] ? kthread_create_on_cpu+0x60/0x60 By other words, "memcg_move_char/2860" flushes a work. But it cannot get flushed because one worker is blocked and another one could not get created. All these operations are blocked by the very same "memcg_move_char/2860". Note that also "systemd/1" is waiting for "cgroup_mutex" in proc_cgroup_show(). But it seems that it is not in the main cycle causing the deadlock. I am able to reproduce this problem quite easily (within few minutes). There are often even more tasks waiting for the cgroups-related locks but they are not causing the deadlock. The question is how to solve this problem. I see several possibilities: + avoid using workqueues in lru_add_drain_all() + make lru_add_drain_all() killable and restartable + do not block fork() when lru_add_drain_all() is running, e.g. using some lazy techniques like RCU, workqueues + at least do not block fork of workers; AFAIK, they have a limited cgroups usage anyway because they are marked with PF_NO_SETAFFINITY I am willing to test any potential fix or even work on the fix. But I do not have that big insight into the problem, so I would need some pointers. [*] It comes from the LTP test suite. The relevant sources might be found at https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/controllers/memcg/functional/memcg_move_charge_at_immigrate_test.sh https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/controllers/memcg/functional/memcg_lib.sh https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/controllers/memcg/functional/memcg_process.c [**] Some of the details are not visible from the provided log. We have first reproduced this problem with an internal kernel based on the upstream 4.4. I checked more details using a crash dump. Then I reproduced the same deadlock with plain 4.6-rc3 and rather used logs from this kernel in this mail. Best Regards, Petr