Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751239Ab3HSSxS (ORCPT ); Mon, 19 Aug 2013 14:53:18 -0400 Received: from mail-pa0-f47.google.com ([209.85.220.47]:65129 "EHLO mail-pa0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751019Ab3HSSxP (ORCPT ); Mon, 19 Aug 2013 14:53:15 -0400 Date: Mon, 19 Aug 2013 11:52:54 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Johannes Berg cc: Ding Tianhong , Linus Torvalds , Greg KH , "David S. Miller" , "Otcheretianski, Andrei" , "linux-kernel@vger.kernel.org" , "netdev@vger.kernel.org" , "stable@vger.kernel.org" , Pravin B Shelar , Thomas Graf Subject: Re: 3.11-rc6 genetlink locking fix offends lockdep In-Reply-To: <1376911362.14734.11.camel@jlt4.sipsolutions.net> Message-ID: References: <1376899214.14734.6.camel@jlt4.sipsolutions.net> <5211FAB3.7080400@huawei.com> <1376911362.14734.11.camel@jlt4.sipsolutions.net> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7932 Lines: 152 On Mon, 19 Aug 2013, Johannes Berg wrote: > On Mon, 2013-08-19 at 19:00 +0800, Ding Tianhong wrote: > > On 2013/8/19 16:00, Johannes Berg wrote: > > > > > >> 3.11-rc6's commit 58ad436fcf49 ("genetlink: fix family dump race") > > >> gives me the lockdep trace below at startup. > > > > > > Hmm. Yes, I see now how this happens, not sure why I didn't run into it. > > > > > > The problem is that genl_family_rcv_msg() is called with the genl_lock > > > held, and then calls netlink_dump_start() with it held, creating a > > > genl_lock->cb_mutex dependency, but obviously the dump continuation is > > > the other way around. > > > > > > We could use the semaphore instead, I believe, but I don't really > > > understand the mutex vs. semaphore well enough to be sure that's > > > correct. > > > > > > johannes > > > > > it is useless, the logic need to modify or otherwise it will still call lockdep trace. > > I don't believe so, the semaphore and cb_mutex don't have a dependency > yet, afaict. The down_read(&cb_lock) patch you suggested gives the lockdep trace below. > > > maybe i could send a patch for it, if you wish. > > What do you mean? > > johannes [ 4.027797] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X [ 4.129749] e1000e 0000:00:19.0: irq 43 for MSI/MSI-X [ 4.130179] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready [ 4.134629] [ 4.134646] ====================================================== [ 4.134680] [ INFO: possible circular locking dependency detected ] [ 4.134714] 3.11.0-rc6 #3 Not tainted [ 4.134735] ------------------------------------------------------- [ 4.134767] NetworkManager/357 is trying to acquire lock: [ 4.134797] (cb_lock){++++++}, at: [] ctrl_dumpfamily+0x38/0x108 [ 4.134853] [ 4.134853] but task is already holding lock: [ 4.134883] (nlk->cb_mutex){+.+.+.}, at: [] netlink_dump+0x1c/0x1d7 [ 4.134938] [ 4.134938] which lock already depends on the new lock. [ 4.134938] [ 4.134981] [ 4.134981] the existing dependency chain (in reverse order) is: [ 4.135020] [ 4.135020] -> #2 (nlk->cb_mutex){+.+.+.}: [ 4.135056] [] __lock_acquire+0x865/0x956 [ 4.135094] [] lock_acquire+0x57/0x6d [ 4.135129] [] mutex_lock_nested+0x5e/0x345 [ 4.135167] [] __netlink_dump_start+0xae/0x14e [ 4.135205] [] genl_rcv_msg+0xf4/0x252 [ 4.135239] [] netlink_rcv_skb+0x3e/0x8c [ 4.135275] [] genl_rcv+0x24/0x34 [ 4.135307] [] netlink_unicast+0xed/0x17a [ 4.135342] [] netlink_sendmsg+0x2fb/0x345 [ 4.135378] [] sock_sendmsg+0x79/0x8e [ 4.135412] [] ___sys_sendmsg+0x231/0x2be [ 4.135448] [] __sys_sendmsg+0x3d/0x5e [ 4.135483] [] SyS_sendmsg+0xd/0x19 [ 4.135517] [] system_call_fastpath+0x16/0x1b [ 4.135555] [ 4.135555] -> #1 (genl_mutex){+.+.+.}: [ 4.135589] [] __lock_acquire+0x865/0x956 [ 4.135626] [] lock_acquire+0x57/0x6d [ 4.135661] [] mutex_lock_nested+0x5e/0x345 [ 4.135697] [] genl_lock+0x12/0x14 [ 4.135730] [] genl_lock_all+0x15/0x17 [ 4.135763] [] genl_register_family+0x51/0x142 [ 4.135801] [] genl_register_family_with_ops+0x23/0x70 [ 4.135842] [] genl_init+0x41/0x80 [ 4.135876] [] do_one_initcall+0x7f/0x108 [ 4.135912] [] kernel_init_freeable+0x106/0x195 [ 4.135951] [] kernel_init+0x9/0xd1 [ 4.135985] [] ret_from_fork+0x7c/0xb0 [ 4.136019] [ 4.136019] -> #0 (cb_lock){++++++}: [ 4.136052] [] validate_chain.isra.21+0x836/0xe8e [ 4.136091] [] __lock_acquire+0x865/0x956 [ 4.136127] [] lock_acquire+0x57/0x6d [ 4.136161] [] down_read+0x42/0x57 [ 4.136194] [] ctrl_dumpfamily+0x38/0x108 [ 4.136230] [] netlink_dump+0x88/0x1d7 [ 4.136264] [] netlink_recvmsg+0x1b1/0x2d1 [ 4.137410] [] sock_recvmsg+0x83/0x98 [ 4.138459] [] ___sys_recvmsg+0x15d/0x207 [ 4.139692] [] __sys_recvmsg+0x3d/0x5e [ 4.140918] [] SyS_recvmsg+0xd/0x19 [ 4.141975] [] system_call_fastpath+0x16/0x1b [ 4.143042] [ 4.143042] other info that might help us debug this: [ 4.143042] [ 4.146007] Chain exists of: [ 4.146007] cb_lock --> genl_mutex --> nlk->cb_mutex [ 4.146007] [ 4.148919] Possible unsafe locking scenario: [ 4.148919] [ 4.150955] CPU0 CPU1 [ 4.152060] ---- ---- [ 4.153030] lock(nlk->cb_mutex); [ 4.153950] lock(genl_mutex); [ 4.154912] lock(nlk->cb_mutex); [ 4.155832] lock(cb_lock); [ 4.156732] [ 4.156732] *** DEADLOCK *** [ 4.156732] [ 4.159477] 1 lock held by NetworkManager/357: [ 4.160354] #0: (nlk->cb_mutex){+.+.+.}, at: [] netlink_dump+0x1c/0x1d7 [ 4.161411] [ 4.161411] stack backtrace: [ 4.163489] CPU: 0 PID: 357 Comm: NetworkManager Not tainted 3.11.0-rc6 #3 [ 4.164527] Hardware name: LENOVO 4174EH1/4174EH1, BIOS 8CET51WW (1.31 ) 11/29/2011 [ 4.165491] ffffffff81d0a450 ffff88022bd61938 ffffffff8157cf90 0000000000000006 [ 4.166520] ffffffff81cc85a0 ffff88022bd61988 ffffffff8157a8a8 ffff880200000001 [ 4.167486] ffff8802313ae080 ffff8802313ae080 ffff8802313ae750 ffff8802313ae080 [ 4.168532] Call Trace: [ 4.169491] [] dump_stack+0x4f/0x84 [ 4.170532] [] print_circular_bug+0x2ad/0x2be [ 4.171507] [] validate_chain.isra.21+0x836/0xe8e [ 4.172548] [] __lock_acquire+0x865/0x956 [ 4.173525] [] lock_acquire+0x57/0x6d [ 4.174536] [] ? ctrl_dumpfamily+0x38/0x108 [ 4.175535] [] down_read+0x42/0x57 [ 4.176523] [] ? ctrl_dumpfamily+0x38/0x108 [ 4.177564] [] ctrl_dumpfamily+0x38/0x108 [ 4.178556] [] ? __alloc_skb+0x97/0x1a0 [ 4.179611] [] netlink_dump+0x88/0x1d7 [ 4.180594] [] netlink_recvmsg+0x1b1/0x2d1 [ 4.181645] [] sock_recvmsg+0x83/0x98 [ 4.182628] [] ? might_fault+0x52/0xa2 [ 4.183683] [] ___sys_recvmsg+0x15d/0x207 [ 4.184668] [] ? __lock_acquire+0x865/0x956 [ 4.185731] [] ? fget_light+0x35c/0x377 [ 4.186707] [] ? fget_light+0x164/0x377 [ 4.187744] [] __sys_recvmsg+0x3d/0x5e [ 4.188709] [] ? sock_def_write_space+0x1b5/0x1b5 [ 4.189755] [] SyS_recvmsg+0xd/0x19 [ 4.190729] [] system_call_fastpath+0x16/0x1b [ 4.192674] iwlwifi 0000:03:00.0: L1 Enabled; Disabling L0S [ 4.192887] iwlwifi 0000:03:00.0: Radio type=0x0-0x3-0x1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/