Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752377AbdLNTx6 (ORCPT ); Thu, 14 Dec 2017 14:53:58 -0500 Received: from out0-213.mail.aliyun.com ([140.205.0.213]:53057 "EHLO out0-213.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752023AbdLNTxz (ORCPT ); Thu, 14 Dec 2017 14:53:55 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R121e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e02c03309;MF=yang.s@alibaba-inc.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---.9ikZ0xP_1513281220; From: "Yang Shi" To: kirill.shutemov@linux.intel.com, mhocko@suse.com, hughd@google.com, aarcange@redhat.com, akpm@linux-foundation.org Cc: "Yang Shi" , , Subject: [PATCH] mm: thp: use down_read_trylock in khugepaged to avoid long block Date: Fri, 15 Dec 2017 03:53:23 +0800 Message-Id: <1513281203-54878-1-git-send-email-yang.s@alibaba-inc.com> X-Mailer: git-send-email 1.8.3.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3263 Lines: 80 In the current design, khugepaged need acquire mmap_sem before scanning mm, but in some corner case, khugepaged may scan the current running process which might be modifying memory mapping, so khugepaged might block in uninterruptible state. But, the process might hold the mmap_sem for long time when modifying a huge memory space, then it may trigger the below khugepaged hung issue: INFO: task khugepaged:270 blocked for more than 120 seconds. Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. khugepaged D 0 270 2 0x00000000 ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440 ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64 0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0 Call Trace: [] ? __schedule+0x235/0x6e0 [] schedule+0x36/0x80 [] rwsem_down_read_failed+0xf0/0x150 [] call_rwsem_down_read_failed+0x18/0x30 [] down_read+0x20/0x40 [] khugepaged+0x476/0x11d0 [] ? idle_balance+0x1ce/0x300 [] ? prepare_to_wait_event+0x100/0x100 [] ? collapse_shmem+0xbf0/0xbf0 [] kthread+0xe6/0x100 [] ? kthread_park+0x60/0x60 [] ret_from_fork+0x25/0x30 So, it sounds pointless to just block for waiting for the semaphore for khugepaged, here replace down_read() to down_read_trylock() to move to scan next mm quickly instead of just blocking on the semaphore so that other processes can get more chances to install THP. Then khugepaged can come back to scan the skipped mm when finish the current round full_scan. And, it soudns the change can improve khugepaged efficiency a little bit. The below is the test result with running LTP on a 24 cores 4GB memory 2 nodes NUMA VM: pristine w/ trylock full_scan 197 187 pages_collapsed 21 26 thp_fault_alloc 40818 44466 thp_fault_fallback 18413 16679 thp_collapse_alloc 21 150 thp_collapse_alloc_failed 14 16 thp_file_alloc 369 369 Signed-off-by: Yang Shi Cc: Kirill A. Shutemov Cc: Michal Hocko Cc: Hugh Dickins Cc: Andrea Arcangeli Cc: Andrew Morton --- mm/khugepaged.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index ea4ff25..ecc2b68 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1674,7 +1674,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, spin_unlock(&khugepaged_mm_lock); mm = mm_slot->mm; - down_read(&mm->mmap_sem); + /* + * Not wait for semaphore to avoid long time waiting, just move + * to the next mm on the list. + */ + if (unlikely(!down_read_trylock(&mm->mmap_sem))) + goto breakouterloop_mmap_sem; if (unlikely(khugepaged_test_exit(mm))) vma = NULL; else -- 1.8.3.1