Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756090Ab0KCUnf (ORCPT ); Wed, 3 Nov 2010 16:43:35 -0400 Received: from e7.ny.us.ibm.com ([32.97.182.137]:37672 "EHLO e7.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753563Ab0KCUnb (ORCPT ); Wed, 3 Nov 2010 16:43:31 -0400 Subject: Deadlocks with transparent huge pages and userspace fs daemons From: Dave Hansen To: Miklos Szeredi Cc: Andrea Arcangeli , linux-fsdevel , linux-mm , "linux-kernel@vger.kernel.org" , Lin Feng Shen , Yuri L Volobuev , Mel Gorman , dingc@cn.ibm.com, lnxninja Content-Type: text/plain; charset="ANSI_X3.4-1968" Date: Wed, 03 Nov 2010 13:43:25 -0700 Message-ID: <1288817005.4235.11393.camel@nimitz> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9662 Lines: 202 Hey Miklos, When testing with a transparent huge page kernel: http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/andrea/aa.git;a=summary some IBM testers ran into some deadlocks. It appears that the khugepaged process is trying to migrate one of a filesystem daemon's pages while khugepaged holds the daemon's mmap_sem for write. I think I've reproduced this issue in a slightly different form with FUSE. In my case, I think the FUSE process actually deadlocks on itself instead of with khugepaged as in the IBM tester example that got me looking at this. Andrea put it this way: > As long as page faults are needed to execute the I/O I doubt it's safe. But > I'll definitely change khugepaged not to allocate memory. If nothing else > because I don't want khugepaged to make easier to trigger issues like this. But > it's hard for me to consider this a bug of khugepaged from a theoretical > standpoint. I tend to agree. khugepaged makes the likelyhood of these things happening much higher, but I don't think it fundamentally creates the issue. Should we do something like make page compaction always non-blocking on lock_page()? Should we teach the VM about fuse daemons somehow? INFO: task unionfs:3527 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. unionfs D ffff88007d356ec0 0 3527 3478 0x00000000 ffff88007b0db9a8 0000000000000082 ffffea00000650c8 ffff88007d356c70 ffff88007d1286a0 000000000000000d 0000000000000000 0000000000000301 ffff88007b0db978 ffffffff81098f70 ffff88007b0dba58 ffff880001db1f40 Call Trace: [] ? vma_prio_tree_next+0x3c/0x52 [] io_schedule+0x38/0x4d [] sync_page+0x44/0x48 [] __wait_on_bit_lock+0x42/0x8a [] ? sync_page+0x0/0x48 [] __lock_page+0x64/0x6b [] ? wake_bit_function+0x0/0x2a [] migrate_pages+0x1df/0x66b [] ? compaction_alloc+0x0/0x2b9 [] ? ____pagevec_lru_add+0x13c/0x14f [] compact_zone+0x331/0x54d [] compact_zone_order+0xaa/0xb9 [] try_to_compact_pages+0xda/0x140 [] __alloc_pages_nodemask+0x3a6/0x74b [] alloc_pages_vma+0x110/0x13d [] do_huge_pmd_anonymous_page+0xc0/0x287 [] handle_mm_fault+0x15c/0x201 [] do_page_fault+0x304/0x422 [] ? do_brk+0x282/0x2c8 [] page_fault+0x1f/0x30 I had to make some changes to the transparent huge page code to get this to happen. First, I made the scanning *REALLY* aggressive: echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs echo 1 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs echo 65536 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan Then, I hacked migrate_pages() call of unmap_and_move() to always 'force', so that it tries to lock_page() unconditionally. That's just to make this race more common. I also created some large malloc()'d memory areas in the unionfs daemon and touched them constantly to cause lots of page faults. Other relevant tasks: INFO: task mmap-and-touch:3584 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. mmap-and-touc D ffff88007bd71510 0 3584 3542 0x00000000 ffff88007a591b88 0000000000000086 ffff88007bd57400 ffff88007bd712c0 ffff88007d01cd70 ffffffff00000004 ffff88007d22e578 ffff88005e5b7440 ffff88007a591b58 0000000181182a8c ffff88007a591b88 ffff880001c91f40 Call Trace: [] io_schedule+0x38/0x4d [] sync_page+0x44/0x48 [] __wait_on_bit_lock+0x42/0x8a [] ? sync_page+0x0/0x48 [] __lock_page+0x64/0x6b [] ? wake_bit_function+0x0/0x2a [] find_lock_page+0x39/0x5d [] filemap_fault+0x1a6/0x30e [] __do_fault+0x50/0x432 [] handle_pte_fault+0x2db/0x717 [] ? __free_pages+0x1b/0x24 [] ? __pte_alloc+0x112/0x121 [] handle_mm_fault+0x1e9/0x201 [] do_page_fault+0x304/0x422 [] ? sys_newfstat+0x29/0x34 [] page_fault+0x1f/0x30 INFO: task memknobs:3599 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. memknobs D ffff88007d305b20 0 3599 3573 0x00000000 ffff88005e4539a8 0000000000000086 ffff88005e453978 ffff88007d3058d0 ffff88007dbb60d0 ffffea0000000002 000000003963d000 ffff88007a4c11e8 ffffea000033aa10 000000017b1e69e0 ffff88005e453988 ffff880001c51f40 Call Trace: [] io_schedule+0x38/0x4d [] sync_page+0x44/0x48 [] __wait_on_bit_lock+0x42/0x8a [] ? sync_page+0x0/0x48 [] __lock_page+0x64/0x6b [] ? wake_bit_function+0x0/0x2a [] migrate_pages+0x1df/0x66b [] ? compaction_alloc+0x0/0x2b9 [] ? ____pagevec_lru_add+0x13c/0x14f [] compact_zone+0x331/0x54d [] compact_zone_order+0xaa/0xb9 [] try_to_compact_pages+0xda/0x140 [] __alloc_pages_nodemask+0x3a6/0x74b [] alloc_pages_vma+0x110/0x13d [] do_huge_pmd_anonymous_page+0xc0/0x287 [] handle_mm_fault+0x15c/0x201 [] do_page_fault+0x304/0x422 [] ? __dequeue_entity+0x2e/0x33 [] ? __switch_to+0x22a/0x23c [] ? set_next_entity+0x18/0x36 [] ? finish_task_switch+0x3c/0x81 [] ? schedule+0x6f4/0x79a [] page_fault+0x1f/0x30 INFO: task khugepaged:515 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. khugepaged D ffff88007d1e8360 0 515 2 0x00000000 ffff88007cad5d00 0000000000000046 ffff88007cad5cc0 ffff88007d1e8110 ffff88007d0986e0 0000000000000008 ffff88007cad5ce0 ffffffff81037e33 00000000ffffffff 000000017cad5d50 00000001000dd090 0000000000000002 Call Trace: [] ? lock_timer_base+0x26/0x4a [] rwsem_down_failed_common+0xcc/0xfe [] rwsem_down_write_failed+0x13/0x15 [] call_rwsem_down_write_failed+0x13/0x20 [] ? down_write+0x20/0x22 [] khugepaged+0xee0/0xf5f [] ? autoremove_wake_function+0x0/0x38 [] ? khugepaged+0x0/0xf5f [] kthread+0x81/0x89 [] kernel_thread_helper+0x4/0x10 [] ? kthread+0x0/0x89 [] ? kernel_thread_helper+0x0/0x10 Original stack trace from GPFS deadlock: > khugepaged D ffff88007c823080 0 52 2 0x00000000 > ffff8800378c98f0 0000000000000046 0000000000000000 001a7949f3208ca4 > ffffffffffffff10 ffff880079efc670 000000002b6c79c0 00000001169be651 > ffff88003780c638 ffff8800378c9fd8 0000000000010518 ffff88003780c638 > Call Trace: > [] ? sync_page+0x0/0x50 > [] io_schedule+0x73/0xc0 > [] sync_page+0x3d/0x50 > [] __wait_on_bit_lock+0x5a/0xc0 > [] __lock_page+0x67/0x70 > [] ? wake_bit_function+0x0/0x50 > [] ? lru_cache_add_lru+0x21/0x40 > [] lock_page+0x30/0x40 > [] migrate_pages+0x59d/0x5d0 > [] ? compaction_alloc+0x0/0x370 > [] compact_zone+0x4ac/0x5e0 > [] ? get_page_from_freelist+0x15c/0x820 > [] compact_zone_order+0x7e/0xb0 > [] try_to_compact_pages+0x109/0x170 > [] __alloc_pages_nodemask+0x55c/0x810 > [] alloc_pages_vma+0x84/0x110 > [] khugepaged+0xa4f/0x1190 > [] ? autoremove_wake_function+0x0/0x40 > [] ? khugepaged+0x0/0x1190 > [] kthread+0x96/0xa0 > [] child_rip+0xa/0x20 > [] ? kthread+0x0/0xa0 > [] ? child_rip+0x0/0x20 > > > mmfsd D ffff88007c823680 0 4453 4118 0x00000080 > ffff88001ad1ddf0 0000000000000082 0000000000000000 0000000000000000 > 0000000000000000 ffff880037fcee40 ffff880079d40ab0 00000001169be9c1 > ffff8800782b7ad8 ffff88001ad1dfd8 0000000000010518 ffff8800782b7ad8 > Call Trace: > [] ? thread_return+0x4e/0x778 > [] ? __hrtimer_start_range_ns+0x1a3/0x430 > [] rwsem_down_failed_common+0x95/0x1d0 > [] rwsem_down_read_failed+0x26/0x30 > [] call_rwsem_down_read_failed+0x14/0x30 > [] ? down_read+0x24/0x30 > [] do_page_fault+0x34a/0x3a0 > [] page_fault+0x25/0x30 -- Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/