Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754319AbZKQQoJ (ORCPT ); Tue, 17 Nov 2009 11:44:09 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752844AbZKQQoI (ORCPT ); Tue, 17 Nov 2009 11:44:08 -0500 Received: from e7.ny.us.ibm.com ([32.97.182.137]:35349 "EHLO e7.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752394AbZKQQoG (ORCPT ); Tue, 17 Nov 2009 11:44:06 -0500 Subject: Re: [PATCH] prevent deadlock in __unmap_hugepage_range() when alloc_huge_page() fails. From: Adam Litke To: Mel Gorman Cc: Andrew Morton , Larry Woodman , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , Lee Schermerhorn , Hugh Dickins , David Gibson In-Reply-To: <20091117160922.GB29804@csn.ul.ie> References: <1257872456.3227.2.camel@dhcp-100-19-198.bos.redhat.com> <20091116121253.d86920a0.akpm@linux-foundation.org> <20091117160922.GB29804@csn.ul.ie> Content-Type: text/plain; charset="UTF-8" Organization: IBM Date: Tue, 17 Nov 2009 10:43:48 -0600 Message-ID: <1258476228.2820.13.camel@aglitke> Mime-Version: 1.0 X-Mailer: Evolution 2.28.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9873 Lines: 221 This looks good to me. The venerable hugetlb_mutex already prevents allocation/instantiation races in hugetlb_cow() so the extra spinlock dropping should have no adverse effects with respect to that. Acked-by: Adam Litke On Tue, 2009-11-17 at 16:09 +0000, Mel Gorman wrote: > On Mon, Nov 16, 2009 at 12:12:53PM -0800, Andrew Morton wrote: > > On Tue, 10 Nov 2009 12:00:56 -0500 > > Larry Woodman wrote: > > > > > > > > hugetlb_fault() takes the mm->page_table_lock spinlock then calls > > > hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to > > > an insufficient huge page pool it calls unmap_ref_private() with the > > > mm->page_table_lock held. unmap_ref_private() then calls > > > unmap_hugepage_range() which tries to acquire the mm->page_table_lock. > > > > > > > > > [] print_circular_bug_tail+0x80/0x9f > > > [] ? check_noncircular+0xb0/0xe8 > > > [] __lock_acquire+0x956/0xc0e > > > [] lock_acquire+0xee/0x12e > > > [] ? unmap_hugepage_range+0x3e/0x84 > > > [] ? unmap_hugepage_range+0x3e/0x84 > > > [] _spin_lock+0x40/0x89 > > > [] ? unmap_hugepage_range+0x3e/0x84 > > > [] ? alloc_huge_page+0x218/0x318 > > > [] unmap_hugepage_range+0x3e/0x84 > > > [] hugetlb_cow+0x1e2/0x3f4 > > > [] ? hugetlb_fault+0x453/0x4f6 > > > [] hugetlb_fault+0x480/0x4f6 > > > [] follow_hugetlb_page+0x116/0x2d9 > > > [] ? _spin_unlock_irq+0x3a/0x5c > > > [] __get_user_pages+0x2a3/0x427 > > > [] get_user_pages+0x3e/0x54 > > > [] get_user_pages_fast+0x170/0x1b5 > > > [] dio_get_page+0x64/0x14a > > > [] __blockdev_direct_IO+0x4b7/0xb31 > > > [] blkdev_direct_IO+0x58/0x6e > > > [] ? blkdev_get_blocks+0x0/0xb8 > > > [] generic_file_aio_read+0xdd/0x528 > > > [] ? avc_has_perm+0x66/0x8c > > > [] do_sync_read+0xf5/0x146 > > > [] ? autoremove_wake_function+0x0/0x5a > > > [] ? security_file_permission+0x24/0x3a > > > [] vfs_read+0xb5/0x126 > > > [] ? fget_light+0x5e/0xf8 > > > [] sys_read+0x54/0x8c > > > [] system_call_fastpath+0x16/0x1b > > > > Confused. > > > > That code is very old, alloc_huge_page() failures are surely common and > > afaict the bug will lead to an instantly dead box. So why haven't > > people hit this bug before now? > > > > Failures like that can happen when the workload is calling fork() with > MAP_PRIVATE and the child is writing with a small hugepage pool. I reran the > tests used at the time the code was written and they don't trigger warnings or > problems in the normal case other than the note that the child got killed. It > required PROVE_LOCKING to be set which is not set on the default configs I > normally use when I don't suspect locking problems. > > I can confirm that with PROVE_LOCKING set that the warnings do trigger but > the machine doesn't lockup and I see in dmesg. Bit of a surprise really, > you'd think a double taking of a spinlock would result in damage. > > [ 111.031795] PID 2876 killed due to inadequate hugepage pool > > Applying the patch does fix that problem but there is a very similar > problem remaining in the same path. More on this later. > > > > This can be fixed by dropping the mm->page_table_lock around the call > > > to unmap_ref_private() if alloc_huge_page() fails, its dropped right below > > > in the normal path anyway: > > > > > > > Why is that safe? What is the page_table_lock protecting in here? > > > > The lock is there from 2005 and it was to protect against against concurrent > PTE updates. However, it is probably unnecessary protection as this whole > path should be protected by the hugetlb_instantiation_mutex. There was locking > that was left behind that may be unnecessary after that mutex was introducted > but is left in place for the day someone decides to tackle that mutex. > > I don't think this patch solves everything with the page_table_lock in > that path because it's also possible from the COW path to enter the buddy > allocator with the spinlock still held with setups like > > 1. create a mapping of 3 huge pages > 2. allow the system to dynamically allocate 6 huge pages > 3. fork() and fault in the child > with preempt and DEBUG_SPINLOCK_SLEEP set, a different > warning can still trigger > > Here is alternative patch below which extends Larry's patch to drop the > spinlock earlier and retake as required. Because it involves page table > updates, I've added Hugh to the cc because he knows all the rules and > gotchas backwards. > > ==== CUT HERE ==== > hugetlb: prevent deadlock in __unmap_hugepage_range() when alloc_huge_page() fails. > > hugetlb_fault() takes the mm->page_table_lock spinlock then calls > hugetlb_cow(). If the alloc_huge_page() in hugetlb_cow() fails due to an > insufficient huge page pool it calls unmap_ref_private() with the > mm->page_table_lock held. unmap_ref_private() then calls > unmap_hugepage_range() which tries to acquire the mm->page_table_lock. > > [] print_circular_bug_tail+0x80/0x9f > [] ? check_noncircular+0xb0/0xe8 > [] __lock_acquire+0x956/0xc0e > [] lock_acquire+0xee/0x12e > [] ? unmap_hugepage_range+0x3e/0x84 > [] ? unmap_hugepage_range+0x3e/0x84 > [] _spin_lock+0x40/0x89 > [] ? unmap_hugepage_range+0x3e/0x84 > [] ? alloc_huge_page+0x218/0x318 > [] unmap_hugepage_range+0x3e/0x84 > [] hugetlb_cow+0x1e2/0x3f4 > [] ? hugetlb_fault+0x453/0x4f6 > [] hugetlb_fault+0x480/0x4f6 > [] follow_hugetlb_page+0x116/0x2d9 > [] ? _spin_unlock_irq+0x3a/0x5c > [] __get_user_pages+0x2a3/0x427 > [] get_user_pages+0x3e/0x54 > [] get_user_pages_fast+0x170/0x1b5 > [] dio_get_page+0x64/0x14a > [] __blockdev_direct_IO+0x4b7/0xb31 > [] blkdev_direct_IO+0x58/0x6e > [] ? blkdev_get_blocks+0x0/0xb8 > [] generic_file_aio_read+0xdd/0x528 > [] ? avc_has_perm+0x66/0x8c > [] do_sync_read+0xf5/0x146 > [] ? autoremove_wake_function+0x0/0x5a > [] ? security_file_permission+0x24/0x3a > [] vfs_read+0xb5/0x126 > [] ? fget_light+0x5e/0xf8 > [] sys_read+0x54/0x8c > [] system_call_fastpath+0x16/0x1b > > This can be fixed by dropping the mm->page_table_lock around the call to > unmap_ref_private() if alloc_huge_page() fails, its dropped right below in > the normal path anyway. However, earlier in the that function, it's also > possible to call into the page allocator with the same spinlock held. > > What this patch does is drop the spinlock before the page allocator is > potentially entered. The check for page allocation failure can be made > without the page_table_lock as well as the copy of the huge page. Even if > the PTE changed while the spinlock was held, the consequence is that a huge > page is copied unnecessarily. This resolves both the double taking of the > lock and sleeping with the spinlock held. > > [mel@csn.ul.ie: Cover also the case where process can sleep with spinlock] > Signed-off-by: Larry Woodman > Signed-off-by: Mel Gorman > Cc: Adam Litke > Cc: Andy Whitcroft > Cc: Lee Schermerhorn > Cc: Hugh Dickins > Cc: David Gibson > --- > mm/hugetlb.c | 13 +++++++++++-- > 1 file changed, 11 insertions(+), 2 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 450493d..2ef66a2 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2293,6 +2293,9 @@ retry_avoidcopy: > outside_reserve = 1; > > page_cache_get(old_page); > + > + /* Drop page_table_lock as buddy allocator may be called */ > + spin_unlock(&mm->page_table_lock); > new_page = alloc_huge_page(vma, address, outside_reserve); > > if (IS_ERR(new_page)) { > @@ -2310,19 +2313,25 @@ retry_avoidcopy: > if (unmap_ref_private(mm, vma, old_page, address)) { > BUG_ON(page_count(old_page) != 1); > BUG_ON(huge_pte_none(pte)); > + spin_lock(&mm->page_table_lock); > goto retry_avoidcopy; > } > WARN_ON_ONCE(1); > } > > + /* Caller expects lock to be held */ > + spin_lock(&mm->page_table_lock); > return -PTR_ERR(new_page); > } > > - spin_unlock(&mm->page_table_lock); > copy_huge_page(new_page, old_page, address, vma); > __SetPageUptodate(new_page); > - spin_lock(&mm->page_table_lock); > > + /* > + * Retake the page_table_lock to check for racing updates > + * before the page tables are altered > + */ > + spin_lock(&mm->page_table_lock); > ptep = huge_pte_offset(mm, address & huge_page_mask(h)); > if (likely(pte_same(huge_ptep_get(ptep), pte))) { > /* Break COW */ -- Thanks, Adam -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/