LinuxLists.cc - [PATCH] mm: set khugepaged_max_ptes_none by 1/8 of HPAGE_PMD

2015-02-27 18:27:10

Subject: [PATCH] mm: set khugepaged_max_ptes_none by 1/8 of HPAGE_PMD_NR

Using THP, programs can access memory faster, by having the
kernel collapse small pages into large pages. The parameter
max_ptes_none specifies how many extra small pages (that are
not already mapped) can be allocated when collapsing a group
of small pages into one large page.

A larger value of max_ptes_none can cause the kernel
to collapse more incomplete areas into THPs, speeding
up memory access at the cost of increased memory use.
A smaller value of max_ptes_none will reduce memory
waste, at the expense of collapsing fewer areas into
THPs.

The problem was reported here:
https://bugzilla.kernel.org/show_bug.cgi?id=93111

Signed-off-by: Ebru Akagunduz <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
mm/huge_memory.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e08e37a..497fb5a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -59,11 +59,10 @@ static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
/*
- * default collapse hugepages if there is at least one pte mapped like
- * it would have happened if the vma was large enough during page
- * fault.
+ * The default value should be a compromise between memory use and THP speedup.
+ * To collapse hugepages, unmapped ptes should not exceed 1/8 of HPAGE_PMD_NR.
*/
-static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
+static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR/8;

static int khugepaged(void *none);
static int khugepaged_slab_init(void);
--
1.9.1

2015-02-27 20:53:42

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] mm: set khugepaged_max_ptes_none by 1/8 of HPAGE_PMD_NR

On Fri, 27 Feb 2015, Ebru Akagunduz wrote:

> Using THP, programs can access memory faster, by having the
> kernel collapse small pages into large pages. The parameter
> max_ptes_none specifies how many extra small pages (that are
> not already mapped) can be allocated when collapsing a group
> of small pages into one large page.
>

Not exactly, khugepaged isn't "allocating" small pages to collapse into a
hugepage, rather it is allocating a hugepage and then remapping the
pageblock's mapped pages.

> A larger value of max_ptes_none can cause the kernel
> to collapse more incomplete areas into THPs, speeding
> up memory access at the cost of increased memory use.
> A smaller value of max_ptes_none will reduce memory
> waste, at the expense of collapsing fewer areas into
> THPs.
>

This changelog only describes what max_ptes_none does, it doesn't state
why you want to change it from HPAGE_PMD_NR-1, which is 511 on x86_64
(largest value, more thp), to HPAGE_PMD_NR/8, which is 64 (smaller value,
less thp, less rss as a result of collapsing).

This has particular performance implications on users who already have thp
enabled, so it's difficult to change the default. This is tuanble that
you could easily set in an initscript, so I don't think we need to change
the value for everybody.

> The problem was reported here:
> https://bugzilla.kernel.org/show_bug.cgi?id=93111
>
> Signed-off-by: Ebru Akagunduz <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> ---
> mm/huge_memory.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e08e37a..497fb5a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -59,11 +59,10 @@ static DEFINE_MUTEX(khugepaged_mutex);
> static DEFINE_SPINLOCK(khugepaged_mm_lock);
> static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
> /*
> - * default collapse hugepages if there is at least one pte mapped like
> - * it would have happened if the vma was large enough during page
> - * fault.
> + * The default value should be a compromise between memory use and THP speedup.
> + * To collapse hugepages, unmapped ptes should not exceed 1/8 of HPAGE_PMD_NR.
> */
> -static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
> +static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR/8;
>
> static int khugepaged(void *none);
> static int khugepaged_slab_init(void);

2015-02-27 20:57:30

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH] mm: set khugepaged_max_ptes_none by 1/8 of HPAGE_PMD_NR

On 02/27/2015 03:53 PM, David Rientjes wrote:
> On Fri, 27 Feb 2015, Ebru Akagunduz wrote:
>
>> Using THP, programs can access memory faster, by having the
>> kernel collapse small pages into large pages. The parameter
>> max_ptes_none specifies how many extra small pages (that are
>> not already mapped) can be allocated when collapsing a group
>> of small pages into one large page.
>>
>
> Not exactly, khugepaged isn't "allocating" small pages to collapse into a
> hugepage, rather it is allocating a hugepage and then remapping the
> pageblock's mapped pages.

How would you describe the amount of extra memory
allocated, as a result of converting a partially
mapped 2MB area into a THP?

It is not physically allocating 4kB pages, but
I would like to keep the text understandable to
people who do not know the THP internals.

>> A larger value of max_ptes_none can cause the kernel
>> to collapse more incomplete areas into THPs, speeding
>> up memory access at the cost of increased memory use.
>> A smaller value of max_ptes_none will reduce memory
>> waste, at the expense of collapsing fewer areas into
>> THPs.
>>
>
> This changelog only describes what max_ptes_none does, it doesn't state
> why you want to change it from HPAGE_PMD_NR-1, which is 511 on x86_64
> (largest value, more thp), to HPAGE_PMD_NR/8, which is 64 (smaller value,
> less thp, less rss as a result of collapsing).
>
> This has particular performance implications on users who already have thp
> enabled, so it's difficult to change the default. This is tuanble that
> you could easily set in an initscript, so I don't think we need to change
> the value for everybody.

I think we do need to change the default.

Why? See this bug:

>> The problem was reported here:
>> https://bugzilla.kernel.org/show_bug.cgi?id=93111

Now, there may be a better value than HPAGE_PMD_NR/8, but
I am not sure what it would be, or why.

I do know that HPAGE_PMD_NR-1 results in undesired behaviour,
as seen in the bug above...

--
All rights reversed

2015-02-27 21:12:58

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] mm: set khugepaged_max_ptes_none by 1/8 of HPAGE_PMD_NR

On Fri, 27 Feb 2015, Rik van Riel wrote:

> >> Using THP, programs can access memory faster, by having the
> >> kernel collapse small pages into large pages. The parameter
> >> max_ptes_none specifies how many extra small pages (that are
> >> not already mapped) can be allocated when collapsing a group
> >> of small pages into one large page.
> >>
> >
> > Not exactly, khugepaged isn't "allocating" small pages to collapse into a
> > hugepage, rather it is allocating a hugepage and then remapping the
> > pageblock's mapped pages.
>
> How would you describe the amount of extra memory
> allocated, as a result of converting a partially
> mapped 2MB area into a THP?
>
> It is not physically allocating 4kB pages, but
> I would like to keep the text understandable to
> people who do not know the THP internals.
>

I would say it specifies how much unmapped memory can become mapped by a
hugepage.

> I think we do need to change the default.
>
> Why? See this bug:
>
> >> The problem was reported here:
> >> https://bugzilla.kernel.org/show_bug.cgi?id=93111
>
> Now, there may be a better value than HPAGE_PMD_NR/8, but
> I am not sure what it would be, or why.
>
> I do know that HPAGE_PMD_NR-1 results in undesired behaviour,
> as seen in the bug above...
>

I know that the value of 64 would also be undesirable for Google since we
tightly constrain memory usage, we have used max_ptes_none == 0 since it
was introduced. We can get away with that because our malloc() is
modified to try to give back large contiguous ranges of memory
periodically back to the system, also using madvise(MADV_DONTNEED), and
tries to avoid splitting thp memory.

The value is determined by how the system will be used: do you tightly
constrain memory usage and not allow any unmapped memory be collapsed into
a hugepage, or do you have an abundance of memory and really want an
aggressive value like HPAGE_PMD_NR-1. Depending on the properties of the
system, you can tune this to anything you want just like we do in
initscripts.

I'm only concerned here about changing a default that has been around for
four years and the possibly negative implications that will have on users
who never touch this value. They undoubtedly get less memory backed by
thp, and that can lead to a performance regression. So if this patch is
merged and we get a bug report for the 4.1 kernel, do we tell that user
that we changed behavior out from under them and to adjust the tunable
back to HPAGE_PMD_NR-1?

Meanwhile, the bug report you cite has a workaround that has always been
available for thp kernels:
# echo 64 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none