This patch improves THP collapse rates, by allowing zero pages.
Currently THP can collapse 4kB pages into a THP when there
are up to khugepaged_max_ptes_none pte_none ptes in a 2MB
range. This patch counts pte none and mapped zero pages
with the same variable.
The patch was tested with a program that allocates 800MB of
memory, and performs interleaved reads and writes, in a pattern
that causes some 2MB areas to first see read accesses, resulting
in the zero pfn being mapped there.
To simulate memory fragmentation at allocation time, I modified
do_huge_pmd_anonymous_page to return VM_FAULT_FALLBACK for read
faults.
Without the patch, only %50 of the program was collapsed into
THP and the percentage did not increase over time.
With this patch after 10 minutes of waiting khugepaged had
collapsed %99 of the program's memory.
Signed-off-by: Ebru Akagunduz <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
Changes in v2:
- Check zero pfn in release_pte_pages() (Andrea Arcangeli)
mm/huge_memory.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e08e37a..a87a691 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2139,7 +2139,7 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte)
{
while (--_pte >= pte) {
pte_t pteval = *_pte;
- if (!pte_none(pteval))
+ if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)))
release_pte_page(pte_page(pteval));
}
}
@@ -2150,13 +2150,13 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
{
struct page *page;
pte_t *_pte;
- int none = 0;
+ int none_or_zero = 0;
bool referenced = false, writable = false;
for (_pte = pte; _pte < pte+HPAGE_PMD_NR;
_pte++, address += PAGE_SIZE) {
pte_t pteval = *_pte;
- if (pte_none(pteval)) {
- if (++none <= khugepaged_max_ptes_none)
+ if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+ if (++none_or_zero <= khugepaged_max_ptes_none)
continue;
else
goto out;
@@ -2237,7 +2237,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
pte_t pteval = *_pte;
struct page *src_page;
- if (pte_none(pteval)) {
+ if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
clear_user_highpage(page, address);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
} else {
@@ -2573,7 +2573,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
{
pmd_t *pmd;
pte_t *pte, *_pte;
- int ret = 0, none = 0;
+ int ret = 0, none_or_zero = 0;
struct page *page;
unsigned long _address;
spinlock_t *ptl;
@@ -2591,8 +2591,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
_pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
- if (pte_none(pteval)) {
- if (++none <= khugepaged_max_ptes_none)
+ if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+ if (++none_or_zero <= khugepaged_max_ptes_none)
continue;
else
goto out_unmap;
--
1.9.1
On Wed, Feb 11, 2015 at 11:03:55PM +0200, Ebru Akagunduz wrote:
> Changes in v2:
> - Check zero pfn in release_pte_pages() (Andrea Arcangeli)
.. and in:
> @@ -2237,7 +2237,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> pte_t pteval = *_pte;
> struct page *src_page;
>
> - if (pte_none(pteval)) {
> + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> clear_user_highpage(page, address);
> add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> } else {
__collapse_huge_page_copy, both were needed as far as I can tell.
I haven't tested it but it's looking good.
Reviewed-by: Andrea Arcangeli <[email protected]>
On Wed, Feb 11, 2015 at 11:16:00PM +0100, Andrea Arcangeli wrote:
> On Wed, Feb 11, 2015 at 11:03:55PM +0200, Ebru Akagunduz wrote:
> > Changes in v2:
> > - Check zero pfn in release_pte_pages() (Andrea Arcangeli)
>
> .. and in:
>
> > @@ -2237,7 +2237,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > pte_t pteval = *_pte;
> > struct page *src_page;
> >
> > - if (pte_none(pteval)) {
> > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > clear_user_highpage(page, address);
> > add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> > } else {
>
> __collapse_huge_page_copy, both were needed as far as I can tell.
There was is_zero_pfn(pte_pfn(pteval)) in __collapse_huge_page_copy() in
original patch.
> I haven't tested it but it's looking good.
>
> Reviewed-by: Andrea Arcangeli <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Thu, Feb 12, 2015 at 12:21:40AM +0200, Kirill A. Shutemov wrote:
> On Wed, Feb 11, 2015 at 11:16:00PM +0100, Andrea Arcangeli wrote:
> > On Wed, Feb 11, 2015 at 11:03:55PM +0200, Ebru Akagunduz wrote:
> > > Changes in v2:
> > > - Check zero pfn in release_pte_pages() (Andrea Arcangeli)
> >
> > .. and in:
> >
> > > @@ -2237,7 +2237,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
> > > pte_t pteval = *_pte;
> > > struct page *src_page;
> > >
> > > - if (pte_none(pteval)) {
> > > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > > clear_user_highpage(page, address);
> > > add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
> > > } else {
> >
> > __collapse_huge_page_copy, both were needed as far as I can tell.
>
> There was is_zero_pfn(pte_pfn(pteval)) in __collapse_huge_page_copy() in
> original patch.
That clarifies things ok.
On 02/11/2015 10:03 PM, Ebru Akagunduz wrote:
> This patch improves THP collapse rates, by allowing zero pages.
>
> Currently THP can collapse 4kB pages into a THP when there
> are up to khugepaged_max_ptes_none pte_none ptes in a 2MB
> range. This patch counts pte none and mapped zero pages
> with the same variable.
>
> The patch was tested with a program that allocates 800MB of
> memory, and performs interleaved reads and writes, in a pattern
> that causes some 2MB areas to first see read accesses, resulting
> in the zero pfn being mapped there.
>
> To simulate memory fragmentation at allocation time, I modified
> do_huge_pmd_anonymous_page to return VM_FAULT_FALLBACK for read
> faults.
>
> Without the patch, only %50 of the program was collapsed into
> THP and the percentage did not increase over time.
>
> With this patch after 10 minutes of waiting khugepaged had
> collapsed %99 of the program's memory.
>
> Signed-off-by: Ebru Akagunduz <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On Wed, 11 Feb 2015 23:03:55 +0200 Ebru Akagunduz <[email protected]> wrote:
> This patch improves THP collapse rates, by allowing zero pages.
>
> Currently THP can collapse 4kB pages into a THP when there
> are up to khugepaged_max_ptes_none pte_none ptes in a 2MB
> range. This patch counts pte none and mapped zero pages
> with the same variable.
So if I'm understanding this correctly, with the default value of
khugepaged_max_ptes_none (HPAGE_PMD_NR-1), if an application creates a
2MB area which contains 511 mappings of the zero page and one real
page, the kernel will proceed to turn that area into a real, physical
huge page. So it consumes 2MB of memory which would not have
previously been allocated?
If so, this might be rather undesirable behaviour in some situations
(and ditto the current behaviour for pte_none ptes)?
This can be tuned by adjusting khugepaged_max_ptes_none, but not many
people are likely to do that because we didn't document the damn thing.
At all. Can we please rectify this, and update it for the is_zero_pfn
feature? The documentation should include an explanation telling
people how to decide what setting to use, how to observe its effects,
etc.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 02/18/2015 06:31 PM, Andrew Morton wrote:
> On Wed, 11 Feb 2015 23:03:55 +0200 Ebru Akagunduz
> <[email protected]> wrote:
>
>> This patch improves THP collapse rates, by allowing zero pages.
>>
>> Currently THP can collapse 4kB pages into a THP when there are up
>> to khugepaged_max_ptes_none pte_none ptes in a 2MB range. This
>> patch counts pte none and mapped zero pages with the same
>> variable.
>
> So if I'm understanding this correctly, with the default value of
> khugepaged_max_ptes_none (HPAGE_PMD_NR-1), if an application
> creates a 2MB area which contains 511 mappings of the zero page and
> one real page, the kernel will proceed to turn that area into a
> real, physical huge page. So it consumes 2MB of memory which would
> not have previously been allocated?
This is equivalent to an application doing a write fault
to a 2MB area that was previously untouched, going into
do_huge_pmd_anonymous_page() and receiving a 2MB page.
> If so, this might be rather undesirable behaviour in some
> situations (and ditto the current behaviour for pte_none ptes)?
>
> This can be tuned by adjusting khugepaged_max_ptes_none,
The example of directly going into do_huge_pmd_anonymous_page()
is not influenced by the tunable.
It may indeed be undesirable in some situations, but I am
not sure how to detect those...
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAEBAgAGBQJU5SlsAAoJEM553pKExN6D8DYH/0TQPr38R3lYqxTllOVPIUus
+UrgXveOeoMiMbN3e5r9tIJkw+2yUJFZ8hkYx+aFsTD5zNz7xwf9Qz8IdJpcZ3sc
PkvOnnZNk/ZzixWrBhWFPsKRN2pi5wXMpfNM2jTs9W4EeyfkV3RYbGxZy/OO1LB5
CwDzteCTb81y1FYxC4vNxLnML417ZjIMq7ICdj6lKW2KC5+TdCIPTOrKCy+2fWBo
4qhqho4RFKHLCxpnryUMzZDXca4vmcgGWwUm5xLF6SnJWWFEiPBLixJiRV3xe0iw
rbuGhcIXo/q16oO4QOIl+hSVJr8vE+Y8xRbIJFmWXCmuQHQpg5ZspVZ+9Z/3UaI=
=Qf1D
-----END PGP SIGNATURE-----
On 02/19/2015 01:08 AM, Rik van Riel wrote:
> On 02/18/2015 06:31 PM, Andrew Morton wrote:
>> On Wed, 11 Feb 2015 23:03:55 +0200 Ebru Akagunduz
>> <[email protected]> wrote:
>
>>> This patch improves THP collapse rates, by allowing zero pages.
>>>
>>> Currently THP can collapse 4kB pages into a THP when there are up
>>> to khugepaged_max_ptes_none pte_none ptes in a 2MB range. This
>>> patch counts pte none and mapped zero pages with the same
>>> variable.
>
>> So if I'm understanding this correctly, with the default value of
>> khugepaged_max_ptes_none (HPAGE_PMD_NR-1), if an application
>> creates a 2MB area which contains 511 mappings of the zero page and
>> one real page, the kernel will proceed to turn that area into a
>> real, physical huge page. So it consumes 2MB of memory which would
>> not have previously been allocated?
>
> This is equivalent to an application doing a write fault
> to a 2MB area that was previously untouched, going into
> do_huge_pmd_anonymous_page() and receiving a 2MB page.
>
>> If so, this might be rather undesirable behaviour in some
>> situations (and ditto the current behaviour for pte_none ptes)?
>
>> This can be tuned by adjusting khugepaged_max_ptes_none,
>
> The example of directly going into do_huge_pmd_anonymous_page()
> is not influenced by the tunable.
>
> It may indeed be undesirable in some situations, but I am
> not sure how to detect those...
Well, yeah. We seem to lack a setting to restrict page fault THP allocations to
e.g. madvise, while still letting khugepaged to collapse them later, taking
khugepaged_max_ptes_none into account.
On Wed, Feb 18, 2015 at 03:31:19PM -0800, Andrew Morton wrote:
> On Wed, 11 Feb 2015 23:03:55 +0200 Ebru Akagunduz <[email protected]> wrote:
>
> > This patch improves THP collapse rates, by allowing zero pages.
> >
> > Currently THP can collapse 4kB pages into a THP when there
> > are up to khugepaged_max_ptes_none pte_none ptes in a 2MB
> > range. This patch counts pte none and mapped zero pages
> > with the same variable.
>
> So if I'm understanding this correctly, with the default value of
> khugepaged_max_ptes_none (HPAGE_PMD_NR-1), if an application creates a
> 2MB area which contains 511 mappings of the zero page and one real
> page, the kernel will proceed to turn that area into a real, physical
> huge page. So it consumes 2MB of memory which would not have
> previously been allocated?
Correct.
>
> If so, this might be rather undesirable behaviour in some situations
> (and ditto the current behaviour for pte_none ptes)?
>
> This can be tuned by adjusting khugepaged_max_ptes_none, but not many
> people are likely to do that because we didn't document the damn thing.
khugepaged checks !hugepage_vma_check, so those apps that don't want
it can opt out with MADV_NOHUGEPAGE. The sysctl allows to tune for the
default behavior.
> At all. Can we please rectify this, and update it for the is_zero_pfn
> feature? The documentation should include an explanation telling
> people how to decide what setting to use, how to observe its effects,
> etc.
Agreed, documentation for the sysfs control would be good to have
indeed.
In the meantime I've got a more urgent issue, for which the fix is
appended below.
Thanks,
Andrea
==
>From aaa03f8c142c9a486e3e49de80f52d01a930ba3d Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <[email protected]>
Date: Fri, 20 Feb 2015 18:08:57 +0100
Subject: [PATCH] mm: incorporate zero pages into transparent huge pages fix
After applying the "incorporate zero pages into transparent huge
pages" feature, I've got an oops on a overnight stress test:
------------[ cut here ]------------
kernel BUG at mm/huge_memory.c:1920!
invalid opcode: 0000 [#1] SMP
Modules linked in: tun usbhid snd_hda_codec_realtek x86_pkg_temp_thermal snd_hda_codec_generic kvm_intel kvm snd_hda_intel crc32c_intel snd_hda_controller ghash_clmulni_intel xhci_pci snd_hda_codec xhci_hcd ehci_pci ehci_hcd snd_pcm usbcore psmouse sr_mod snd_timer snd cdrom pcspkr usb_common [last unloaded: microcode]
CPU: 4 PID: 4250 Comm: Analysis Helper Not tainted 3.19.0+ #5
Hardware name: /DH61BE, BIOS BEH6110H.86A.0120.2013.1112.1412 11/12/2013
task: ffff88040c520840 ti: ffff880406070000 task.ti: ffff880406070000
RIP: 0010:[<ffffffff811ad362>] [<ffffffff811ad362>] split_huge_page_to_list+0x6a2/0x7c0
RSP: 0018:ffff880406073c58 EFLAGS: 00010282
RAX: 8000000163b2f067 RBX: ffff880406ac5f90 RCX: ffff880404f70978
RDX: ffffea0000000000 RSI: ffff880404f70000 RDI: 8000000163b2f047
RBP: ffff880408de25c0 R08: 00000000058ecbc0 R09: 00007f2dfe52f000
R10: 0000000000000000 R11: 00007f2dfe600000 R12: 00007f2dfe400000
R13: 0000000404f70067 R14: ffffc00000000fff R15: ffff8800d04da2e0
FS: 00007f2e12168700(0000) GS:ffff88041f300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f03b9871000 CR3: 0000000407c9d000 CR4: 00000000000427e0
Stack:
ffff880406073d08 00000007f2dfe400 00007f2d00000001 ffff880407f2dcd0
ffffea00058e8000 ffff880400000000 00000000058e8000 ffffffff81b8bf40
0000000000000004 ffffea00101ab170 ffff880408d273e8 ffff880406ac5f90
Call Trace:
[<ffffffff811add6a>] ? __split_huge_page_pmd+0xfa/0x2a0
[<ffffffff811838b2>] ? unmap_single_vma+0x2b2/0x810
[<ffffffff810c658b>] ? try_to_wake_up+0xbb/0x2d0
[<ffffffff81104df8>] ? get_futex_key+0x1c8/0x2c0
[<ffffffff81184969>] ? zap_page_range+0x89/0xe0
[<ffffffff81187150>] ? handle_mm_fault+0xe70/0x1110
[<ffffffff810cdd2e>] ? set_next_entity+0x4e/0x60
[<ffffffff811899bc>] ? find_vma+0x5c/0x70
[<ffffffff81195be3>] ? SyS_madvise+0x4f3/0x760
[<ffffffff8177d192>] ? system_call_fastpath+0x12/0x17
Code: ff ff 0f 1f 80 00 00 00 00 c7 44 24 10 00 00 00 00 e9 ef fa ff ff b8 01 00 00 00 e9 3f fb ff ff 48 83 c8 40 e9 f7 fe ff ff 0f 0b <0f> 0b 48 c7 c6 28 b8 a0 81 4c 89 f7 e8 8d 3c fd ff 0f 0b 48 8b
RIP [<ffffffff811ad362>] split_huge_page_to_list+0x6a2/0x7c0
RSP <ffff880406073c58>
---[ end trace 6ca92529e1de43ba ]---
The oops happens here:
BUG_ON(!pte_none(*pte));
In short when we do the split_huge_page_map we withdraw from the
pgtable deposit of the MM and we find that pgtable isn't fully zero.
That is most certainly because we didn't clear it if it was a zero
page before putting it in the deposit. This adds the pte_clear to fix
the bug.
The PT lock could be actually not be taken, as the pte is already
private to us and not visible to any other CPU (we'll be adding it to
the deposit later), but because it's private the lock can't create any
contention. Considering the paravirt calls (which also should be
superfluous) may end up being invoked and make assumptions, I thought
it was safer to keep the locking protocol the same, even if the
pgtable is already private. In order to drop it however, we should
drop it from the other path too. If we want to optimize away the lock
from both branches, it's better to do it in a separate patch.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a63da02..f0207cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2205,6 +2205,18 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
clear_user_highpage(page, address);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
+ if (is_zero_pfn(pte_pfn(pteval))) {
+ /*
+ * ptl mostly unnecessary.
+ */
+ spin_lock(ptl);
+ /*
+ * paravirt calls inside pte_clear here are
+ * superfluous.
+ */
+ pte_clear(vma->vm_mm, address, _pte);
+ spin_unlock(ptl);
+ }
} else {
src_page = pte_page(pteval);
copy_user_highpage(page, src_page, address, vma);
On Wed, 18 Feb 2015 19:08:12 -0500 Rik van Riel <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 02/18/2015 06:31 PM, Andrew Morton wrote:
> > On Wed, 11 Feb 2015 23:03:55 +0200 Ebru Akagunduz
> > <[email protected]> wrote:
> >
> >> This patch improves THP collapse rates, by allowing zero pages.
> >>
> >> Currently THP can collapse 4kB pages into a THP when there are up
> >> to khugepaged_max_ptes_none pte_none ptes in a 2MB range. This
> >> patch counts pte none and mapped zero pages with the same
> >> variable.
> >
> > So if I'm understanding this correctly, with the default value of
> > khugepaged_max_ptes_none (HPAGE_PMD_NR-1), if an application
> > creates a 2MB area which contains 511 mappings of the zero page and
> > one real page, the kernel will proceed to turn that area into a
> > real, physical huge page. So it consumes 2MB of memory which would
> > not have previously been allocated?
>
> This is equivalent to an application doing a write fault
> to a 2MB area that was previously untouched, going into
> do_huge_pmd_anonymous_page() and receiving a 2MB page.
>
> > If so, this might be rather undesirable behaviour in some
> > situations (and ditto the current behaviour for pte_none ptes)?
> >
> > This can be tuned by adjusting khugepaged_max_ptes_none,
>
> The example of directly going into do_huge_pmd_anonymous_page()
> is not influenced by the tunable.
>
> It may indeed be undesirable in some situations, but I am
> not sure how to detect those...
Here's a live one: https://bugzilla.kernel.org/show_bug.cgi?id=93111
Application does MADV_DONTNEED to free up a load of memory and then
khugepaged comes along and pages that memory back in again. It seems a
bit silly to do this after userspace has deliberately discarded those
pages!
Presumably MADV_NOHUGEPAGE can be used to prevent this, but it's a bit
of a hand-grenade. I guess the MADV_DONTNEED manpage should be updated
to explain all this?
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 02/23/2015 02:16 PM, Andrew Morton wrote:
> On Wed, 18 Feb 2015 19:08:12 -0500 Rik van Riel <[email protected]>
> wrote:
>>> If so, this might be rather undesirable behaviour in some
>>> situations (and ditto the current behaviour for pte_none
>>> ptes)?
>>>
>>> This can be tuned by adjusting khugepaged_max_ptes_none,
> Here's a live one:
> https://bugzilla.kernel.org/show_bug.cgi?id=93111
>
> Application does MADV_DONTNEED to free up a load of memory and
> then khugepaged comes along and pages that memory back in again.
> It seems a bit silly to do this after userspace has deliberately
> discarded those pages!
>
> Presumably MADV_NOHUGEPAGE can be used to prevent this, but it's a
> bit of a hand-grenade. I guess the MADV_DONTNEED manpage should be
> updated to explain all this?
That makes me wonder what a good value for khugepaged_max_ptes_none
would be.
Doubling the amount of memory a program uses seems quite unreasonable.
Increasing the amount of memory a program uses by 512x seems totally
unreasonable.
Increasing the amount of memory a program uses by 20% might be
reasonable, if that much memory is available, since that seems to
be about how much performance improvement we have ever seen from
THP.
Andrew, Andrea, do you have any ideas on this?
Is this something to just set, or should we ask Ebru to run
a few different tests with this?
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAEBAgAGBQJU64LQAAoJEM553pKExN6DbjAH/31KsggMczFT5Z6KQ68dnMnc
nlYAHmiC8nBzguhj5fUtm94jWBK1IPg9cUkRt1tKDJXkVGk91it0MdO1QhuSL91b
xNghqc1d8/P/dmuguNH6C7BUlf52iFFyaCrnip+sO1rxIEUYkFwHxpwC5vSlLrrl
bENlILFuY5kmF2xd6kIfvhOr7TzkbCS92Da3la0sCIT4tjlXPKJ6fuTo9aK8LOqr
kKi6gmmyH+gDhi2EAJk3D1cZT8RqrynsbirEEcWq+ORNUScmSqNlQqGOLw/nJeSp
Nkw7rReeMz5PHVxnsNQE4kxQ4zIJ0auZsZ9cC4Gw3ZpQKdiLBiAK+lJECgQsqPk=
=pDxP
-----END PGP SIGNATURE-----
On 23.2.2015 20:43, Rik van Riel wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 02/23/2015 02:16 PM, Andrew Morton wrote:
>> On Wed, 18 Feb 2015 19:08:12 -0500 Rik van Riel <[email protected]>
>> wrote:
>>>> If so, this might be rather undesirable behaviour in some
>>>> situations (and ditto the current behaviour for pte_none
>>>> ptes)?
>>>>
>>>> This can be tuned by adjusting khugepaged_max_ptes_none,
>> Here's a live one:
>> https://bugzilla.kernel.org/show_bug.cgi?id=93111
>>
>> Application does MADV_DONTNEED to free up a load of memory and
>> then khugepaged comes along and pages that memory back in again.
>> It seems a bit silly to do this after userspace has deliberately
>> discarded those pages!
OK that's a nice example how a more conservative default for
max_ptes_none would make sense even with the current aggressive
THP faulting.
>> Presumably MADV_NOHUGEPAGE can be used to prevent this, but it's a
>> bit of a hand-grenade. I guess the MADV_DONTNEED manpage should be
>> updated to explain all this?
Probably, together with the tunable documentation. Seems like we
didn't add enough details to madvise manpage in the recent round :)
> That makes me wonder what a good value for khugepaged_max_ptes_none
> would be.
>
> Doubling the amount of memory a program uses seems quite unreasonable.
>
> Increasing the amount of memory a program uses by 512x seems totally
> unreasonable.
>
> Increasing the amount of memory a program uses by 20% might be
> reasonable, if that much memory is available, since that seems to
> be about how much performance improvement we have ever seen from
> THP.
>
> Andrew, Andrea, do you have any ideas on this?
>
> Is this something to just set, or should we ask Ebru to run
> a few different tests with this?
If there is a good test for this, sure.