Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932700AbaFLH3W (ORCPT ); Thu, 12 Jun 2014 03:29:22 -0400 Received: from cn.fujitsu.com ([59.151.112.132]:45072 "EHLO heian.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S932161AbaFLH3T (ORCPT ); Thu, 12 Jun 2014 03:29:19 -0400 X-IronPort-AV: E=Sophos;i="5.00,694,1396972800"; d="scan'208";a="31801661" Message-ID: <53995435.1000402@cn.fujitsu.com> Date: Thu, 12 Jun 2014 15:18:13 +0800 From: Gu Zheng User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20110930 Thunderbird/7.0.1 MIME-Version: 1.0 To: Johannes Weiner CC: Tejun Heo , , Cgroups , linux-kernel , Yasuaki Ishimatsu , tangchen , Andrew Morton , Subject: Re: [stable-3.10.y] possible unsafe locking warning References: <5385B52A.7050106@cn.fujitsu.com> <20140528154856.GD1419@htj.dyndns.org> <539003C6.9060308@cn.fujitsu.com> <20140605132400.GW2878@cmpxchg.org> In-Reply-To: <20140605132400.GW2878@cmpxchg.org> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.167.226.100] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Johannes, Sorry for late. The patch works well with stable kernel-3.10.y, the warning is gone. Thank you and Tejun very much.:) Regards, Gu On 06/05/2014 09:24 PM, Johannes Weiner wrote: > Hi, > > [cc'ing Andrew and linux-mm for patch review and inclusion] > > On Thu, Jun 05, 2014 at 01:44:38PM +0800, Gu Zheng wrote: >> Hi Tejun, >> Sorry for late replay. >> On 05/28/2014 11:48 PM, Tejun Heo wrote: >> >>> (cc'ing Johannes for mm-foo) >>> >>> Hello, >>> >>> On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote: >>>> [ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage. >>>> [ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes: >>>> [ 2457.824102] (&sig->group_rwsem){+++++?}, at: [] exit_signals+0x24/0x130 >>>> [ 2457.923538] {RECLAIM_FS-ON-W} state was registered at: >>>> [ 2457.985055] [] mark_held_locks+0xb9/0x140 >>>> [ 2458.053976] [] lockdep_trace_alloc+0x7a/0xe0 >>>> [ 2458.126015] [] kmem_cache_alloc_trace+0x37/0x240 >>>> [ 2458.202214] [] flex_array_alloc+0x99/0x1a0 >>>> [ 2458.272175] [] cgroup_attach_task+0x63/0x430 >>>> [ 2458.344214] [] attach_task_by_pid+0x210/0x280 >>>> [ 2458.417294] [] cgroup_procs_write+0x16/0x20 >>>> [ 2458.488287] [] cgroup_file_write+0x120/0x2c0 >>>> [ 2458.560320] [] vfs_write+0xc0/0x1f0 >>>> [ 2458.622994] [] SyS_write+0x4c/0xa0 >>>> [ 2458.684618] [] tracesys+0xdd/0xe2 >>>> [ 2458.745214] irq event stamp: 49 >>>> [ 2458.782794] hardirqs last enabled at (49): [] _raw_spin_unlock_irqrestore+0x36/0x70 >>>> [ 2458.894388] hardirqs last disabled at (48): [] _raw_spin_lock_irqsave+0x2b/0xa0 >>>> [ 2459.000771] softirqs last enabled at (0): [] copy_process.part.24+0x627/0x15f0 >>>> [ 2459.107161] softirqs last disabled at (0): [< (null)>] (null) >>>> [ 2459.195852] >>>> [ 2459.195852] other info that might help us debug this: >>>> [ 2459.274024] Possible unsafe locking scenario: >>>> [ 2459.274024] >>>> [ 2459.344911] CPU0 >>>> [ 2459.374161] ---- >>>> [ 2459.403408] lock(&sig->group_rwsem); >>>> [ 2459.448490] >>>> [ 2459.479825] lock(&sig->group_rwsem); >>>> [ 2459.526979] >>>> [ 2459.526979] *** DEADLOCK *** >>>> [ 2459.526979] >>>> [ 2459.597866] no locks held by kswapd2/1151. >>>> [ 2459.646896] >>>> [ 2459.646896] stack backtrace: >>>> [ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4 >>>> [ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014 >>>> [ 2459.895983] ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48 >>>> [ 2459.985003] ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001 >>>> [ 2460.074024] 000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009 >>>> [ 2460.163087] Call Trace: >>>> [ 2460.192345] [] dump_stack+0x19/0x1b >>>> [ 2460.253874] [] print_usage_bug+0x1f7/0x208 >>>> [ 2460.399807] [] mark_lock+0x21d/0x2a0 >>>> [ 2460.462369] [] __lock_acquire+0x52a/0xb60 >>>> [ 2460.735516] [] lock_acquire+0xa2/0x140 >>>> [ 2460.935691] [] down_read+0x51/0xa0 >>>> [ 2461.062888] [] exit_signals+0x24/0x130 >>>> [ 2461.127536] [] do_exit+0xb5/0xa50 >>>> [ 2461.320433] [] kthread+0xdb/0x100 >>>> [ 2461.532049] [] ret_from_fork+0x7c/0xb0 >>> >>> The lockdep warning is about threadgroup_lock being grabbed by kswapd >>> which is depended upon during memory reclaim when the lock may be held >>> by tasks which may wait on memory reclaim. From the backtrace, it >>> looks like the right thing to do is marking the kswapd that it's no >>> longer a memory reclaimer once before it starts exiting. > > Yeah, that makes sense. In fact, we can reset *all* the > reclaim-specific per-task states the second it stops performing > reclaim work. > >>>> And when reference to the related code(kernel-3.10.y), it seems that cgroup_attach_task(thread-2, >>>> attach kswapd) trigger kswapd(reclaim memory?) when trying to alloc memory(flex_array_alloc) under >>>> the protection of sig->group_rwsem, but meanwhile the kswapd(thread-1) is in the exit routine >>>> (because it was marked SHOULD STOP when offline pages completed), which needs to acquire >>>> sig->group_rwsem in exit_signals(), so the deadlock occurs. >>>> >>>> thread-1 | thread-2 >>>> | >>>> __offline_pages(): | system_call_fastpath() >>>> |-> kswapd_stop(node); | |-> ...... >>>> |-> kthread_stop(kswapd) | |-> cgroup_file_write() >>>> |-> set_bit(KTHREAD_SHOULD_STOP, &kthread->flags); | |-> ...... >>>> |-> wake_up_process(k) | |-> attach_task_by_pid() >>>> | | |-> threadgroup_lock(tsk) >>>> |<----------| | // Here, got the lock. >>>> |-> kswapd() | |-> ... >>>> |-> if (kthread_should_stop()) | |-> cgroup_attach_task() >>>> return; | |-> flex_array_alloc() >>>> | | |-> kzalloc() >>>> |<----------| | |-> wait for kswapd to reclaim memory >>>> |-> kthread() | >>>> |-> do_exit(ret) | >>>> |-> exit_signals() | >>>> |-> threadgroup_change_begin(tsk) | >>>> |-> down_read(&tsk->signal->group_rwsem) | >>>> // Here, acquire the lock. >>>> >>>> If my analysis is correct, the latest kernel may have the same issue, though the flex_array was replaced >>>> by list, but we still need to alloc memory(e.g. in find_css_set()), so the race may still occur. >>>> Any comments about this? If I missed something, please correct me.:) >>> >>> Not sure whether this can actually happen but if so the right fix >>> would be making thread-2 not wait for kswapd which is exiting and can >>> no longer serve as memory reclaimer. > > There is never a direct wait for a specific kswapd thread in the > waitqueue sense. The allocator wakes up the kswapds for all nodes > allowed in the allocation, then retries the allocation a few times in > the hope that kswapd does something before entering reclaim itself. > > How far back do we need this in stable? > > --- > >>From c3d76e3c208bc90b64b804ffefa114b920cab47e Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Thu, 5 Jun 2014 08:37:01 -0400 > Subject: [patch] mm: vmscan: clear kswapd's special reclaim powers before > exiting > > When kswapd exits, it can end up taking locks that were previously > held by allocating tasks while they waited for reclaim. Lockdep > currently warns about this: > > On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote: >> [ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage. >> [ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes: >> [ 2457.824102] (&sig->group_rwsem){+++++?}, at: [] exit_signals+0x24/0x130 >> [ 2457.923538] {RECLAIM_FS-ON-W} state was registered at: >> [ 2457.985055] [] mark_held_locks+0xb9/0x140 >> [ 2458.053976] [] lockdep_trace_alloc+0x7a/0xe0 >> [ 2458.126015] [] kmem_cache_alloc_trace+0x37/0x240 >> [ 2458.202214] [] flex_array_alloc+0x99/0x1a0 >> [ 2458.272175] [] cgroup_attach_task+0x63/0x430 >> [ 2458.344214] [] attach_task_by_pid+0x210/0x280 >> [ 2458.417294] [] cgroup_procs_write+0x16/0x20 >> [ 2458.488287] [] cgroup_file_write+0x120/0x2c0 >> [ 2458.560320] [] vfs_write+0xc0/0x1f0 >> [ 2458.622994] [] SyS_write+0x4c/0xa0 >> [ 2458.684618] [] tracesys+0xdd/0xe2 >> [ 2458.745214] irq event stamp: 49 >> [ 2458.782794] hardirqs last enabled at (49): [] _raw_spin_unlock_irqrestore+0x36/0x70 >> [ 2458.894388] hardirqs last disabled at (48): [] _raw_spin_lock_irqsave+0x2b/0xa0 >> [ 2459.000771] softirqs last enabled at (0): [] copy_process.part.24+0x627/0x15f0 >> [ 2459.107161] softirqs last disabled at (0): [< (null)>] (null) >> [ 2459.195852] >> [ 2459.195852] other info that might help us debug this: >> [ 2459.274024] Possible unsafe locking scenario: >> [ 2459.274024] >> [ 2459.344911] CPU0 >> [ 2459.374161] ---- >> [ 2459.403408] lock(&sig->group_rwsem); >> [ 2459.448490] >> [ 2459.479825] lock(&sig->group_rwsem); >> [ 2459.526979] >> [ 2459.526979] *** DEADLOCK *** >> [ 2459.526979] >> [ 2459.597866] no locks held by kswapd2/1151. >> [ 2459.646896] >> [ 2459.646896] stack backtrace: >> [ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4 >> [ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014 >> [ 2459.895983] ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48 >> [ 2459.985003] ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001 >> [ 2460.074024] 000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009 >> [ 2460.163087] Call Trace: >> [ 2460.192345] [] dump_stack+0x19/0x1b >> [ 2460.253874] [] print_usage_bug+0x1f7/0x208 >> [ 2460.399807] [] mark_lock+0x21d/0x2a0 >> [ 2460.462369] [] __lock_acquire+0x52a/0xb60 >> [ 2460.735516] [] lock_acquire+0xa2/0x140 >> [ 2460.935691] [] down_read+0x51/0xa0 >> [ 2461.062888] [] exit_signals+0x24/0x130 >> [ 2461.127536] [] do_exit+0xb5/0xa50 >> [ 2461.320433] [] kthread+0xdb/0x100 >> [ 2461.532049] [] ret_from_fork+0x7c/0xb0 > > This is because the kswapd thread is still marked as a reclaimer at > the time of exit. But because it is exiting, nobody is actually > waiting on it to make reclaim progress anymore, and it's nothing but a > regular thread at this point. Be tidy and strip it of all its powers > (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state) > before returning from the thread function. > > Signed-off-by: Johannes Weiner > --- > mm/vmscan.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9a63d13739a6..4ac2eab860d2 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -3425,7 +3425,10 @@ static int kswapd(void *p) > } > } > > + tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD); > current->reclaim_state = NULL; > + lockdep_clear_current_reclaim_state(); > + > return 0; > } > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/