Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753311AbaBLQdl (ORCPT ); Wed, 12 Feb 2014 11:33:41 -0500 Received: from gw-1.arm.linux.org.uk ([78.32.30.217]:34215 "EHLO pandora.arm.linux.org.uk" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753169AbaBLQdf (ORCPT ); Wed, 12 Feb 2014 11:33:35 -0500 Date: Wed, 12 Feb 2014 16:33:17 +0000 From: Russell King - ARM Linux To: Marek Szyprowski Cc: linux-arm-kernel@lists.infradead.org, David Airlie , dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org Subject: Re: [BUG] Circular locking dependency - DRM/CMA/MM/hotplug/... Message-ID: <20140212163317.GQ26684@n2100.arm.linux.org.uk> References: <20140211183543.GK26684@n2100.arm.linux.org.uk> <52FB9602.1000805@samsung.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <52FB9602.1000805@samsung.com> User-Agent: Mutt/1.5.19 (2009-01-05) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 12, 2014 at 04:40:50PM +0100, Marek Szyprowski wrote: > Hello, > > On 2014-02-11 19:35, Russell King - ARM Linux wrote: >> The cubox-i4 just hit a new lockdep problem - not quite sure what to >> make of this - it looks like an interaction between quite a lot of >> locks - I suspect more than the lockdep code is reporting in its >> "Possible unsafe locking scenario" report. >> >> I'm hoping I've sent this to appropriate people... if anyone thinks >> this needs to go to someone else, please forward it. Thanks. > > From the attached log it looks like an issue (AB-BA deadlock) between > device mutex (&dev->struct_mutex) and mm semaphore (&mm->mmap_sem). > Similar issue has been discussed quite a long time ago in v4l2 > subsystem: I think there's more locks involved than just those two. > https://www.mail-archive.com/linux-media@vger.kernel.org/msg38599.html > http://www.spinics.net/lists/linux-media/msg40225.html > > Solving it probably requires some changes in DRM core. I see no direct > relation between this issue and CMA itself. I don't think so - the locking in DRM is pretty sane. Let's take a look: >> the existing dependency chain (in reverse order) is: >> -> #5 (&dev->struct_mutex){+.+...}: >> [] __lock_acquire+0x151c/0x1ca0 >> [] lock_acquire+0xa0/0x130 >> [] mutex_lock_nested+0x5c/0x3ac >> [] drm_gem_mmap+0x40/0xdc >> [] drm_gem_cma_mmap+0x14/0x2c >> [] mmap_region+0x3ac/0x59c >> [] do_mmap_pgoff+0x2c8/0x370 >> [] vm_mmap_pgoff+0x6c/0x9c >> [] SyS_mmap_pgoff+0x54/0x98 >> [] ret_fast_syscall+0x0/0x48 vm_mmap_pgoff() takes mm->mmap_sem before calling do_mmap_pgoff(). So, this results in the following locking order: mm->mmap_sem dev->struct_mutex >> -> #4 (&mm->mmap_sem){++++++}: >> [] __lock_acquire+0x151c/0x1ca0 >> [] lock_acquire+0xa0/0x130 >> [] might_fault+0x6c/0x94 >> [] con_set_unimap+0x158/0x27c >> [] vt_ioctl+0x1298/0x1388 >> [] tty_ioctl+0x168/0xbf4 >> [] do_vfs_ioctl+0x84/0x664 >> [] SyS_ioctl+0x44/0x64 >> [] ret_fast_syscall+0x0/0x48 vt_ioctl() takes the console lock, so this results in: console_lock mm->mmap_sem >> -> #3 (console_lock){+.+.+.}: >> [] __lock_acquire+0x151c/0x1ca0 >> [] lock_acquire+0xa0/0x130 >> [] console_lock+0x60/0x74 >> [] console_cpu_notify+0x28/0x34 >> [] notifier_call_chain+0x4c/0x8c >> [] __raw_notifier_call_chain+0x1c/0x24 >> [] __cpu_notify+0x34/0x50 >> [] cpu_notify_nofail+0x18/0x24 >> [] _cpu_down+0x100/0x244 >> [] cpu_down+0x30/0x44 >> [] cpu_subsys_offline+0x14/0x18 >> [] device_offline+0x94/0xbc >> [] online_store+0x4c/0x74 >> [] dev_attr_store+0x20/0x2c >> [] sysfs_kf_write+0x54/0x58 >> [] kernfs_fop_write+0xc4/0x160 >> [] vfs_write+0xbc/0x184 >> [] SyS_write+0x48/0x70 >> [] ret_fast_syscall+0x0/0x48 cpu_down() takes cpu_hotplug.lock, so here we have: cpu_hotplug.lock console_lock >> -> #2 (cpu_hotplug.lock){+.+.+.}: >> [] __lock_acquire+0x151c/0x1ca0 >> [] lock_acquire+0xa0/0x130 >> [] mutex_lock_nested+0x5c/0x3ac >> [] get_online_cpus+0x3c/0x58 >> [] lru_add_drain_all+0x24/0x190 >> [] migrate_prep+0x10/0x18 >> [] alloc_contig_range+0xf4/0x30c >> [] dma_alloc_from_contiguous+0x7c/0x130 >> [] __alloc_from_contiguous+0x38/0x12c >> [] atomic_pool_init+0x74/0x128 >> [] do_one_initcall+0x3c/0x164 >> [] kernel_init_freeable+0x104/0x1d0 >> [] kernel_init+0x10/0xec >> [] ret_from_fork+0x14/0x2c dma_alloc_from_contiguous takes the cma_mutex, so here we end up with: cma_mutex cpu_hotplug.lock >> -> #1 (lock){+.+...}: >> [] __lock_acquire+0x151c/0x1ca0 >> [] lock_acquire+0xa0/0x130 >> [] mutex_lock_nested+0x5c/0x3ac >> [] lru_add_drain_all+0x1c/0x190 >> [] migrate_prep+0x10/0x18 >> [] alloc_contig_range+0xf4/0x30c >> [] dma_alloc_from_contiguous+0x7c/0x130 >> [] __alloc_from_contiguous+0x38/0x12c >> [] atomic_pool_init+0x74/0x128 >> [] do_one_initcall+0x3c/0x164 >> [] kernel_init_freeable+0x104/0x1d0 >> [] kernel_init+0x10/0xec >> [] ret_from_fork+0x14/0x2c Ditto - here we have: cma_mutex lock where "lock" is nicely named... this is a lock inside lru_add_drain_all() and under this lock, we call get_online_cpus() and put_online_cpus(). get_online_cpus() takes cpu_hotplug.lock, so here we also have: cma_mutex lock cpu_hotplug.lock >> -> #0 (cma_mutex){+.+.+.}: >> [] print_circular_bug+0x70/0x2f0 >> [] __lock_acquire+0x1580/0x1ca0 >> [] lock_acquire+0xa0/0x130 >> [] mutex_lock_nested+0x5c/0x3ac >> [] dma_release_from_contiguous+0xb8/0xf8 >> [] __arm_dma_free.isra.11+0x194/0x218 >> [] arm_dma_free+0x1c/0x24 >> [] drm_gem_cma_free_object+0x68/0xb8 >> [] drm_gem_object_free+0x30/0x38 >> [] drm_gem_object_handle_unreference_unlocked+0x108/0x148 >> [] drm_gem_handle_delete+0xb0/0x10c >> [] drm_gem_dumb_destroy+0x14/0x18 >> [] drm_mode_destroy_dumb_ioctl+0x34/0x40 >> [] drm_ioctl+0x3f4/0x498 >> [] do_vfs_ioctl+0x84/0x664 >> [] SyS_ioctl+0x44/0x64 >> [] ret_fast_syscall+0x0/0x48 drm_gem_object_unreference_unlocked takes dev->struct_mutex, so: dev->struct_mutex cma_mutex So, the full locking dependency tree is this: CPU0 CPU1 CPU2 CPU3 CPU4 dev->struct_mutex (from #0) mm->mmap_sem dev->struct_mutex (from #5) console_lock (from #4) mm->mmap_sem cpu_hotplug.lock (from #3) console_lock cma_mutex (from #2, but also from #1) cpu_hotplug.lock cma_mutex Which is pretty sick - and I don't think that blaming this solely on V4L2 nor DRM is particularly fair. I believe the onus is on every author of one of those locks involved in that chain needs to re-analyse whether their locking is sane. For instance, what is cma_mutex protecting? Is it protecting the CMA bitmap? What if we did these changes: struct page *dma_alloc_from_contiguous(struct device *dev, int count, unsigned int align) { ... mutex_lock(&cma_mutex); ... for (;;) { pageno = bitmap_find_next_zero_area(cma->bitmap, cma->count, start, count, mask); if (pageno >= cma->count) break; pfn = cma->base_pfn + pageno; + bitmap_set(cma->bitmap, pageno, count); + mutex_unlock(&cma_mutex); ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA); + mutex_lock(&cma_mutex); if (ret == 0) { - bitmap_set(cma->bitmap, pageno, count); page = pfn_to_page(pfn); break; - } else if (ret != -EBUSY) { + } + bitmap_clear(cma->bitmap, pageno, count); + if (ret != -EBUSY) { break; } ... mutex_unlock(&cma_mutex); pr_debug("%s(): returned %p\n", __func__, page); return page; } bool dma_release_from_contiguous(struct device *dev, struct page *pages, int count) { ... + free_contig_range(pfn, count); mutex_lock(&cma_mutex); bitmap_clear(cma->bitmap, pfn - cma->base_pfn, count); - free_contig_range(pfn, count); mutex_unlock(&cma_mutex); ... } which avoids the dependency between cma_mutex and cpu_hotplug.lock ? -- FTTC broadband for 0.8mile line: 5.8Mbps down 500kbps up. Estimation in database were 13.1 to 19Mbit for a good line, about 7.5+ for a bad. Estimate before purchase was "up to 13.2Mbit". -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/