Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422728Ab2KNMGk (ORCPT ); Wed, 14 Nov 2012 07:06:40 -0500 Received: from mx2.parallels.com ([64.131.90.16]:48150 "EHLO mx2.parallels.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755341Ab2KNMGi (ORCPT ); Wed, 14 Nov 2012 07:06:38 -0500 Message-ID: <50A38936.2010406@parallels.com> Date: Wed, 14 Nov 2012 16:06:14 +0400 From: Glauber Costa User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121016 Thunderbird/16.0.1 MIME-Version: 1.0 To: Sasha Levin CC: Sasha Levin , , , Andrew Morton , , Johannes Weiner , Tejun Heo , Michal Hocko , Christoph Lameter , Pekka Enberg , David Rientjes , Pekka Enberg , Suleiman Souhlal , Dave Jones Subject: Re: [PATCH v6 28/29] slub: slub-specific propagation changes. References: <1351771665-11076-1-git-send-email-glommer@parallels.com> <1351771665-11076-29-git-send-email-glommer@parallels.com> <509A83F8.6040402@oracle.com> <509B5673.8020801@parallels.com> <509C7A77.3020206@gmail.com> In-Reply-To: <509C7A77.3020206@gmail.com> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6262 Lines: 122 On 11/09/2012 07:37 AM, Sasha Levin wrote: > On 11/08/2012 01:51 AM, Glauber Costa wrote: >> On 11/07/2012 04:53 PM, Sasha Levin wrote: >>> On 11/01/2012 08:07 AM, Glauber Costa wrote: >>>> SLUB allows us to tune a particular cache behavior with sysfs-based >>>> tunables. When creating a new memcg cache copy, we'd like to preserve >>>> any tunables the parent cache already had. >>>> >>>> This can be done by tapping into the store attribute function provided >>>> by the allocator. We of course don't need to mess with read-only >>>> fields. Since the attributes can have multiple types and are stored >>>> internally by sysfs, the best strategy is to issue a ->show() in the >>>> root cache, and then ->store() in the memcg cache. >>>> >>>> The drawback of that, is that sysfs can allocate up to a page in >>>> buffering for show(), that we are likely not to need, but also can't >>>> guarantee. To avoid always allocating a page for that, we can update the >>>> caches at store time with the maximum attribute size ever stored to the >>>> root cache. We will then get a buffer big enough to hold it. The >>>> corolary to this, is that if no stores happened, nothing will be >>>> propagated. >>>> >>>> It can also happen that a root cache has its tunables updated during >>>> normal system operation. In this case, we will propagate the change to >>>> all caches that are already active. >>>> >>>> Signed-off-by: Glauber Costa >>>> CC: Christoph Lameter >>>> CC: Pekka Enberg >>>> CC: Michal Hocko >>>> CC: Kamezawa Hiroyuki >>>> CC: Johannes Weiner >>>> CC: Suleiman Souhlal >>>> CC: Tejun Heo >>>> --- >>> >>> Hi guys, >>> >>> This patch is making lockdep angry! *bark bark* >>> >>> [ 351.935003] ====================================================== >>> [ 351.937693] [ INFO: possible circular locking dependency detected ] >>> [ 351.939720] 3.7.0-rc4-next-20121106-sasha-00008-g353b62f #117 Tainted: G W >>> [ 351.942444] ------------------------------------------------------- >>> [ 351.943528] trinity-child13/6961 is trying to acquire lock: >>> [ 351.943528] (s_active#43){++++.+}, at: [] sysfs_addrm_finish+0x31/0x60 >>> [ 351.943528] >>> [ 351.943528] but task is already holding lock: >>> [ 351.943528] (slab_mutex){+.+.+.}, at: [] kmem_cache_destroy+0x22/0xe0 >>> [ 351.943528] >>> [ 351.943528] which lock already depends on the new lock. >>> [ 351.943528] >>> [ 351.943528] >>> [ 351.943528] the existing dependency chain (in reverse order) is: >>> [ 351.943528] >>> -> #1 (slab_mutex){+.+.+.}: >>> [ 351.960334] [] lock_acquire+0x1aa/0x240 >>> [ 351.960334] [] __mutex_lock_common+0x59/0x5a0 >>> [ 351.960334] [] mutex_lock_nested+0x3f/0x50 >>> [ 351.960334] [] slab_attr_store+0xde/0x110 >>> [ 351.960334] [] sysfs_write_file+0xfa/0x150 >>> [ 351.960334] [] vfs_write+0xb0/0x180 >>> [ 351.960334] [] sys_pwrite64+0x60/0xb0 >>> [ 351.960334] [] tracesys+0xe1/0xe6 >>> [ 351.960334] >>> -> #0 (s_active#43){++++.+}: >>> [ 351.960334] [] __lock_acquire+0x14df/0x1ca0 >>> [ 351.960334] [] lock_acquire+0x1aa/0x240 >>> [ 351.960334] [] sysfs_deactivate+0x122/0x1a0 >>> [ 351.960334] [] sysfs_addrm_finish+0x31/0x60 >>> [ 351.960334] [] sysfs_remove_dir+0x89/0xd0 >>> [ 351.960334] [] kobject_del+0x16/0x40 >>> [ 351.960334] [] __kmem_cache_shutdown+0x40/0x60 >>> [ 351.960334] [] kmem_cache_destroy+0x40/0xe0 >>> [ 351.960334] [] mon_text_release+0x78/0xe0 >>> [ 351.960334] [] __fput+0x122/0x2d0 >>> [ 351.960334] [] ____fput+0x9/0x10 >>> [ 351.960334] [] task_work_run+0xbe/0x100 >>> [ 351.960334] [] do_exit+0x432/0xbd0 >>> [ 351.960334] [] do_group_exit+0x84/0xd0 >>> [ 351.960334] [] get_signal_to_deliver+0x81d/0x930 >>> [ 351.960334] [] do_signal+0x3a/0x950 >>> [ 351.960334] [] do_notify_resume+0x3e/0x90 >>> [ 351.960334] [] int_signal+0x12/0x17 >>> [ 351.960334] First: Sorry I took so long, I had some problems in my way back from Spain... I just managed to reproduce it, by following the callchain. In summary: 1) when we store an attribute, we will call sysfs_get_active(), that will hold the sd->dep_map lock, where 'sd' is the specific dirent. 2) ->store() is called with that held. 3) ->store() will hold the slab_mutex 4) While destroying the cache, with the slab_mutex held, we will eventually get to kobject_put(), that deep down in the callchain will resort to sysfs_addrm_finish, that can hold that lock again. In summary, creating a kmem limited memcg, storing an argument in the global cache, and then deleting the memcg should trigger this. The funny thing is that I had a test exactly like this in which it didn't trigger, and now I know why: I was storing attributes for "dentry", which can stay around for longer until it completely runs out of objects, which will depend on the vmscan shrinkers kicking in. storing to a more short lived cache will easily trigger this - Thanks! During __kmem_cache_create, we drop the slab_mutex around sysfs_slab_add. Although the justification for that is a bit different, I think this is generally sane and the same could be done here. I will send a patch for this - and other issues - shortly. Thanks again, Sasha. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/