Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751358AbaAPVBw (ORCPT ); Thu, 16 Jan 2014 16:01:52 -0500 Received: from relay2.sgi.com ([192.48.179.30]:55614 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750754AbaAPVBt (ORCPT ); Thu, 16 Jan 2014 16:01:49 -0500 From: Alex Thorlton To: linux-kernel@vger.kernel.org Cc: Alex Thorlton , Ingo Molnar , Peter Zijlstra , Andrew Morton , "Kirill A. Shutemov" , Benjamin Herrenschmidt , Rik van Riel , Naoya Horiguchi , Oleg Nesterov , "Eric W. Biederman" , Andy Lutomirski , Al Viro , Kees Cook , Andrea Arcangeli Subject: [RFC PATCHv2 1/2] Add mm flag to control THP Date: Thu, 16 Jan 2014 15:01:43 -0600 Message-Id: <1bc8f911363af956b37d8ea415d734f3191f1c78.1389905087.git.athorlton@sgi.com> X-Mailer: git-send-email 1.7.12.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch adds an mm flag (MMF_THP_DISABLE) to disable transparent hugepages using prctl. Changes for v2: * Pulled code for prctl helper functions into prctl to make things more concise. * Changed PRCTL_SET_THP_DISABLE to accept an argument to set/clear the THP_DISABLE bit, instead of having two separate PRCTLs for this. * Removed ifdef in prctl.h that defined MMF_THP_DISABLE based on whether or not CONFIG_TRANSPARENT_HUGEPAGE was set. * Added code to get khugepaged to ignore mm_structs with THP disabled. The main motivation behind this patch is to provide a way to disable THP for jobs where the code cannot be modified and using a malloc hook with madvise is not an option (i.e. statically allocated data). This patch allows us to do just that, without affecting other jobs running on the system. We need to do this sort of thing for jobs where THP hurts performance, due to the possibility of increased remote memory accesses that can be created by situations such as the following: When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will be handed out, and the THP will be stuck on whatever node the chunk was originally referenced from. If many remote nodes need to do work on that same chunk, they'll be making remote accesses. With THP disabled, 4K pages can be handed out to separate nodes as they're needed, greatly reducing the amount of remote accesses to memory. Here are some results showing the improvement that my test case gets when the MMF_THP_DISABLE flag is clear vs. set: MMF_THP_DISABLE clear: # perf stat -a -r 3 ./prctl_wrapper_mmv2 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Performance counter stats for './prctl_wrapper_mmv2 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g' (3 runs): 267537198.932548 task-clock # 641.115 CPUs utilized ( +- 0.03% ) [100.00%] 909,086 context-switches # 0.000 M/sec ( +- 0.07% ) [100.00%] 1,004 CPU-migrations # 0.000 M/sec ( +- 1.49% ) [100.00%] 137,942 page-faults # 0.000 M/sec ( +- 1.70% ) 350,607,742,932,846 cycles # 1.311 GHz ( +- 0.03% ) [100.00%] 523,280,989,487,579 stalled-cycles-frontend # 149.25% frontend cycles idle ( +- 0.04% ) [100.00%] 395,143,659,263,350 stalled-cycles-backend # 112.70% backend cycles idle ( +- 0.24% ) [100.00%] 147,359,655,811,699 instructions # 0.42 insns per cycle # 3.55 stalled cycles per insn ( +- 0.05% ) [100.00%] 26,897,860,986,646 branches # 100.539 M/sec ( +- 0.10% ) [100.00%] 1,264,232,340 branch-misses # 0.00% of all branches ( +- 0.65% ) 417.299580464 seconds time elapsed ( +- 0.03% ) MMF_THP_DISABLE set: # perf stat -a -r 3 ./prctl_wrapper_mmv2 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g Performance counter stats for './prctl_wrapper_mmv2 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g' (3 runs): 142442476.218751 task-clock # 642.085 CPUs utilized ( +- 0.74% ) [100.00%] 520,084 context-switches # 0.000 M/sec ( +- 0.79% ) [100.00%] 853 CPU-migrations # 0.000 M/sec ( +- 14.53% ) [100.00%] 62,396,741 page-faults # 0.000 M/sec ( +- 0.01% ) 155,509,431,078,100 cycles # 1.092 GHz ( +- 0.75% ) [100.00%] 213,552,817,573,474 stalled-cycles-frontend # 137.32% frontend cycles idle ( +- 1.23% ) [100.00%] 117,337,842,556,506 stalled-cycles-backend # 75.45% backend cycles idle ( +- 2.09% ) [100.00%] 178,809,541,860,114 instructions # 1.15 insns per cycle # 1.19 stalled cycles per insn ( +- 0.18% ) [100.00%] 26,295,305,012,722 branches # 184.603 M/sec ( +- 0.42% ) [100.00%] 754,391,541 branch-misses # 0.00% of all branches ( +- 0.50% ) 221.843813599 seconds time elapsed ( +- 0.75% ) As you can see, this particular test gets about a 2x performance boost when THP is turned off. Here's a link to the test, along with the wrapper that I used: http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv2.tar.gz There are still a few things that might need tweaked here, but I wanted to get the patch out there to get a discussion started. Two things I noted from the old patch discussion that will likely need to be addressed are: * Patch doesn't currently account for get_user_pages allocations. I'm actually not sure if this needs to be addressed. From what I know, get user pages calls down to handle_mm_fault, which should prevent THPs from being handed out where necessary. If anybody can confirm that, it would be appreciated. * Current behavior is to have fork()/exec()'d processes inherit the flag. Andrew Morton pointed out some possible issues with this, so we may need to rethink this behavior. - If parent process has THP disabled, and forks off a child, the child will also have THP disabled. This may not be the desired behavior. Signed-off-by: Alex Thorlton Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Andrew Morton Cc: "Kirill A. Shutemov" Cc: Benjamin Herrenschmidt Cc: Rik van Riel Cc: Naoya Horiguchi Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Andy Lutomirski Cc: Al Viro Cc: Kees Cook Cc: Andrea Arcangeli Cc: linux-kernel@vger.kernel.org --- include/linux/huge_mm.h | 6 ++++-- include/linux/sched.h | 6 +++++- include/uapi/linux/prctl.h | 3 +++ kernel/sys.c | 11 +++++++++++ 4 files changed, 23 insertions(+), 3 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 91672e2..475f59f 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -1,6 +1,8 @@ #ifndef _LINUX_HUGE_MM_H #define _LINUX_HUGE_MM_H +#include + extern int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, pmd_t *pmd, @@ -74,7 +76,8 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma); (1<vm_flags & VM_HUGEPAGE))) && \ !((__vma)->vm_flags & VM_NOHUGEPAGE) && \ - !is_vma_temporary_stack(__vma)) + !is_vma_temporary_stack(__vma) && \ + !test_bit(MMF_THP_DISABLE, &(__vma)->vm_mm->flags)) #define transparent_hugepage_defrag(__vma) \ ((transparent_hugepage_flags & \ (1<no_new_privs ? 1 : 0; + case PR_SET_THP_DISABLE: + if (arg2) + set_bit(MMF_THP_DISABLE, &me->mm->flags); + else + clear_bit(MMF_THP_DISABLE, &me->mm->flags); + break; + case PR_GET_THP_DISABLE: + error = put_user(test_bit(MMF_THP_DISABLE, + &me->mm->flags), + (int __user *) arg2); + break; default: error = -EINVAL; break; -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/