Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755834AbdLOUE2 (ORCPT ); Fri, 15 Dec 2017 15:04:28 -0500 Received: from out0-213.mail.aliyun.com ([140.205.0.213]:39523 "EHLO out0-213.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755406AbdLOUE0 (ORCPT ); Fri, 15 Dec 2017 15:04:26 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R841e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e02c03311;MF=yang.s@alibaba-inc.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---.9jQb5Z7_1513368252; Subject: Re: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block To: Michal Hocko Cc: kirill.shutemov@linux.intel.com, hughd@google.com, aarcange@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1513281203-54878-1-git-send-email-yang.s@alibaba-inc.com> <20171215102753.GY16951@dhcp22.suse.cz> From: "Yang Shi" Message-ID: <13f935a9-42af-98f4-1813-456a25200d9d@alibaba-inc.com> Date: Sat, 16 Dec 2017 04:04:10 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20171215102753.GY16951@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6326 Lines: 152 Hi Kirill & Michal, Since both of you raised the same question about who holds the semaphore for that long time, I just reply here to both of you. The backtrace shows vm-scalability is running with 300G memory and it is doing munmap as below: [188995.241865] CPU: 15 PID: 8063 Comm: usemem Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1 [188995.242252] Hardware name: Huawei Technologies Co., Ltd. Tecal RH2288H V2-12L/BC11SRSG1, BIOS RMIBV368 11/01/2013 [188995.242637] task: ffff883f610a5b00 task.stack: ffffc90037280000 [188995.242838] RIP: 0010:[] .c [] unmap_page_range+0x619/0x940 [188995.243231] RSP: 0018:ffffc90037283c98 EFLAGS: 00000282 [188995.243429] RAX: 00002b760ac57000 RBX: 00002b760ac56000 RCX: 0000000003eb13ca [188995.243820] RDX: ffffea003971e420 RSI: 00002b760ac56000 RDI: ffff8837cb832e80 [188995.244211] RBP: ffffc90037283d78 R08: ffff883ebf8fc3c0 R09: 0000000000008000 [188995.244600] R10: 00000000826b7e00 R11: 0000000000000000 R12: ffff8821e70f72b0 [188995.244993] R13: ffffea00fac4f280 R14: ffffc90037283e00 R15: 00002b760ac57000 [188995.245390] FS: 00002b34b4861700(0000) GS:ffff883f7d3c0000(0000) knlGS:0000000000000000 [188995.245788] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [188995.245990] CR2: 00002b7092160fed CR3: 0000000977850000 CR4: 00000000001406e0 [188995.246388] Stack: [188995.246581] 00002b92f71edfff.c 00002b7fffffffff.c 00002b92f71ee000.c ffff8809778502b0.c [188995.246981] 00002b763fffffff.c ffff8802e1895ec0.c ffffc90037283d48.c ffff883f610a5b00.c [188995.247365] ffffc90037283d70.c 00002b8000000000.c ffffc00000000fff.c ffffea00879c3df0.c [188995.247759] Call Trace: [188995.247957] [] unmap_single_vma+0x7d/0xe0 [188995.248161] [] unmap_vmas+0x51/0xa0 [188995.248367] [] unmap_region+0xbd/0x130 [188995.248571] [] ? rwsem_down_write_failed_killable+0x31c/0x3f0 [188995.248961] [] do_munmap+0x26c/0x420 [188995.249162] [] SyS_munmap+0x50/0x70 [188995.249361] [] entry_SYSCALL_64_fastpath+0x1a/0xa9 By analyzing vmcore, khugepaged is waiting for vm-scalability process's mmap_sem. unmap_vmas will unmap every vma in the memory space, it sounds the test generated huge amount of vmas. Shall we add "cond_resched()" in unmap_vmas(), i.e for every 100 vmas? It may improve the responsiveness a little bit for non-preempt kernel, although it still can't release the semaphore. Thanks, Yang On 12/15/17 2:27 AM, Michal Hocko wrote: > On Fri 15-12-17 03:53:23, Yang Shi wrote: >> In the current design, khugepaged need acquire mmap_sem before scanning >> mm, but in some corner case, khugepaged may scan the current running >> process which might be modifying memory mapping, so khugepaged might >> block in uninterruptible state. But, the process might hold the mmap_sem >> for long time when modifying a huge memory space, then it may trigger >> the below khugepaged hung issue: >> >> INFO: task khugepaged:270 blocked for more than 120 seconds. >> Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> khugepaged D 0 270 2 0x00000000 >> ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440 >> ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64 >> 0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0 >> Call Trace: >> [] ? __schedule+0x235/0x6e0 >> [] schedule+0x36/0x80 >> [] rwsem_down_read_failed+0xf0/0x150 >> [] call_rwsem_down_read_failed+0x18/0x30 >> [] down_read+0x20/0x40 >> [] khugepaged+0x476/0x11d0 >> [] ? idle_balance+0x1ce/0x300 >> [] ? prepare_to_wait_event+0x100/0x100 >> [] ? collapse_shmem+0xbf0/0xbf0 >> [] kthread+0xe6/0x100 >> [] ? kthread_park+0x60/0x60 >> [] ret_from_fork+0x25/0x30 > > I am definitely interested in what the holder of the write lock does > here for such a long time. > >> So, it sounds pointless to just block for waiting for the semaphore for >> khugepaged, here replace down_read() to down_read_trylock() to move to >> scan next mm quickly instead of just blocking on the semaphore so that >> other processes can get more chances to install THP. >> Then khugepaged can come back to scan the skipped mm when finish the >> current round full_scan. >> >> And, it soudns the change can improve khugepaged efficiency a little >> bit. >> >> The below is the test result with running LTP on a 24 cores 4GB memory 2 >> nodes NUMA VM: >> >> pristine w/ trylock >> full_scan 197 187 >> pages_collapsed 21 26 >> thp_fault_alloc 40818 44466 >> thp_fault_fallback 18413 16679 >> thp_collapse_alloc 21 150 >> thp_collapse_alloc_failed 14 16 >> thp_file_alloc 369 369 >> >> Signed-off-by: Yang Shi >> Cc: Kirill A. Shutemov >> Cc: Michal Hocko >> Cc: Hugh Dickins >> Cc: Andrea Arcangeli >> Cc: Andrew Morton > > The patch makes sense to me > Acked-by: Michal Hocko > >> --- >> mm/khugepaged.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index ea4ff25..ecc2b68 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1674,7 +1674,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, >> spin_unlock(&khugepaged_mm_lock); >> >> mm = mm_slot->mm; >> - down_read(&mm->mmap_sem); >> + /* >> + * Not wait for semaphore to avoid long time waiting, just move >> + * to the next mm on the list. >> + */ >> + if (unlikely(!down_read_trylock(&mm->mmap_sem))) >> + goto breakouterloop_mmap_sem; >> if (unlikely(khugepaged_test_exit(mm))) >> vma = NULL; >> else >> -- >> 1.8.3.1 >> >