2014-07-15 01:09:42

by David Rientjes

[permalink] [raw]
Subject: [patch] mm, thp: only collapse hugepages to nodes with affinity

Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
node") improved the previous khugepaged logic which allocated a
transparent hugepages from the node of the first page being collapsed.

However, it is still possible to collapse pages to remote memory which may
suffer from additional access latency. With the current policy, it is
possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely
if the majority are allocated from that node.

Introduce a strict requirement that pages can only be collapsed to nodes
at or below RECLAIM_DISTANCE to ensure the access latency of the pages
scanned does not regress.

Signed-off-by: David Rientjes <[email protected]>
---
mm/huge_memory.c | 54 ++++++++++++------------------------------------------
1 file changed, 12 insertions(+), 42 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2231,34 +2231,7 @@ static void khugepaged_alloc_sleep(void)
msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
}

-static int khugepaged_node_load[MAX_NUMNODES];
-
#ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(void)
-{
- static int last_khugepaged_target_node = NUMA_NO_NODE;
- int nid, target_node = 0, max_value = 0;
-
- /* find first node with max normal pages hit */
- for (nid = 0; nid < MAX_NUMNODES; nid++)
- if (khugepaged_node_load[nid] > max_value) {
- max_value = khugepaged_node_load[nid];
- target_node = nid;
- }
-
- /* do some balance if several nodes have the same hit record */
- if (target_node <= last_khugepaged_target_node)
- for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
- nid++)
- if (max_value == khugepaged_node_load[nid]) {
- target_node = nid;
- break;
- }
-
- last_khugepaged_target_node = target_node;
- return target_node;
-}
-
static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
{
if (IS_ERR(*hpage)) {
@@ -2309,11 +2282,6 @@ static struct page
return *hpage;
}
#else
-static int khugepaged_find_target_node(void)
-{
- return 0;
-}
-
static inline struct page *alloc_hugepage(int defrag)
{
return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
@@ -2522,7 +2490,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
if (!pmd)
goto out;

- memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
_pte++, _address += PAGE_SIZE) {
@@ -2538,14 +2505,18 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
page = vm_normal_page(vma, _address, pteval);
if (unlikely(!page))
goto out_unmap;
- /*
- * Record which node the original page is from and save this
- * information to khugepaged_node_load[].
- * Khupaged will allocate hugepage from the node has the max
- * hit record.
- */
- node = page_to_nid(page);
- khugepaged_node_load[node]++;
+ if (node == NUMA_NO_NODE) {
+ node = page_to_nid(page);
+ } else {
+ int distance = node_distance(page_to_nid(page), node);
+
+ /*
+ * Do not migrate to memory that would not be reclaimed
+ * from.
+ */
+ if (distance > RECLAIM_DISTANCE)
+ goto out_unmap;
+ }
VM_BUG_ON_PAGE(PageCompound(page), page);
if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
goto out_unmap;
@@ -2561,7 +2532,6 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
out_unmap:
pte_unmap_unlock(pte, ptl);
if (ret) {
- node = khugepaged_find_target_node();
/* collapse_huge_page will return with the mmap_sem released */
collapse_huge_page(mm, address, hpage, vma, node);
}


2014-07-15 04:47:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [patch] mm, thp: only collapse hugepages to nodes with affinity

On 07/14/2014 06:09 PM, David Rientjes wrote:
> + if (node == NUMA_NO_NODE) {
> + node = page_to_nid(page);
> + } else {
> + int distance = node_distance(page_to_nid(page), node);
> +
> + /*
> + * Do not migrate to memory that would not be reclaimed
> + * from.
> + */
> + if (distance > RECLAIM_DISTANCE)
> + goto out_unmap;
> + }

Isn't the reclaim behavior based on zone_reclaim_mode and not
RECLAIM_DISTANCE directly? And isn't that reclaim behavior disabled by
default?

I think you should at least be consulting zone_reclaim_mode.

2014-07-15 23:17:38

by David Rientjes

[permalink] [raw]
Subject: Re: [patch] mm, thp: only collapse hugepages to nodes with affinity

On Mon, 14 Jul 2014, Dave Hansen wrote:

> > + if (node == NUMA_NO_NODE) {
> > + node = page_to_nid(page);
> > + } else {
> > + int distance = node_distance(page_to_nid(page), node);
> > +
> > + /*
> > + * Do not migrate to memory that would not be reclaimed
> > + * from.
> > + */
> > + if (distance > RECLAIM_DISTANCE)
> > + goto out_unmap;
> > + }
>
> Isn't the reclaim behavior based on zone_reclaim_mode and not
> RECLAIM_DISTANCE directly? And isn't that reclaim behavior disabled by
> default?
>

Seems that RECLAIM_DISTANCE has taken on a life of its own independent of
zone_reclaim_mode as a heuristic, such as its use in creating sched
domains which would be unrelated.

> I think you should at least be consulting zone_reclaim_mode.
>

Good point, and it matches what the comment is saying about whether we'd
actually reclaim from the remote node to allocate thp on fault or not.
I'll add it.

After this change, we'll also need to consider the behavior of thp at
fault and whether remote HPAGE_PMD_SIZE memory when local memory is
low/fragmented is better than local PAGE_SIZE memory. In my page fault
latency testing on true NUMA machines it's convincing that it's not.

This makes me believe that, somewhat similar to this patch, when we
allocate thp memory at fault and zone_reclaim_mode is non-zero that we
should set only nodes with numa_node_id() <= RECLAIM_DISTANCE and then
otherwise fallback to the PAGE_SIZE fault path.

I've been hesitant to make that exact change, though, because it's a
systemwide setting and I really hope to avoid a prctl() that controls
zone reclaim for a particular process. Perhaps the NUMA balancing work
makes this more dependable.

2014-07-16 00:14:07

by David Rientjes

[permalink] [raw]
Subject: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode

Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
node") improved the previous khugepaged logic which allocated a
transparent hugepages from the node of the first page being collapsed.

However, it is still possible to collapse pages to remote memory which may
suffer from additional access latency. With the current policy, it is
possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely
if the majority are allocated from that node.

When zone_reclaim_mode is enabled, it means the VM should make every attempt
to allocate locally to prevent NUMA performance degradation. In this case,
we do not want to collapse hugepages to remote nodes that would suffer from
increased access latency. Thus, when zone_reclaim_mode is enabled, only
allow collapsing to nodes with RECLAIM_DISTANCE or less.

There is no functional change for systems that disable zone_reclaim_mode.

Signed-off-by: David Rientjes <[email protected]>
---
v2: only change behavior for zone_reclaim_mode per Dave Hansen

mm/huge_memory.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2234,6 +2234,26 @@ static void khugepaged_alloc_sleep(void)
static int khugepaged_node_load[MAX_NUMNODES];

#ifdef CONFIG_NUMA
+static bool khugepaged_scan_abort(int nid)
+{
+ int i;
+
+ /*
+ * If zone_reclaim_mode is disabled, then no extra effort is made to
+ * allocate memory locally.
+ */
+ if (!zone_reclaim_mode)
+ return false;
+
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ if (!khugepaged_node_load[i])
+ continue;
+ if (node_distance(nid, i) > RECLAIM_DISTANCE)
+ return true;
+ }
+ return false;
+}
+
static int khugepaged_find_target_node(void)
{
static int last_khugepaged_target_node = NUMA_NO_NODE;
@@ -2309,6 +2329,11 @@ static struct page
return *hpage;
}
#else
+static bool khugepaged_scan_abort(int nid)
+{
+ return false;
+}
+
static int khugepaged_find_target_node(void)
{
return 0;
@@ -2515,6 +2540,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
unsigned long _address;
spinlock_t *ptl;
int node = NUMA_NO_NODE;
+ int last_node = node;

VM_BUG_ON(address & ~HPAGE_PMD_MASK);

@@ -2545,6 +2571,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
* hit record.
*/
node = page_to_nid(page);
+ if (node != last_node) {
+ if (khugepaged_scan_abort(node))
+ goto out_unmap;
+ last_node = node;
+ }
khugepaged_node_load[node]++;
VM_BUG_ON_PAGE(PageCompound(page), page);
if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))

2014-07-16 01:22:50

by Bob Liu

[permalink] [raw]
Subject: Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode


On 07/16/2014 08:13 AM, David Rientjes wrote:
> Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
> node") improved the previous khugepaged logic which allocated a
> transparent hugepages from the node of the first page being collapsed.
>
> However, it is still possible to collapse pages to remote memory which may
> suffer from additional access latency. With the current policy, it is
> possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely
> if the majority are allocated from that node.
>
> When zone_reclaim_mode is enabled, it means the VM should make every attempt
> to allocate locally to prevent NUMA performance degradation. In this case,
> we do not want to collapse hugepages to remote nodes that would suffer from
> increased access latency. Thus, when zone_reclaim_mode is enabled, only
> allow collapsing to nodes with RECLAIM_DISTANCE or less.
>
> There is no functional change for systems that disable zone_reclaim_mode.
>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> v2: only change behavior for zone_reclaim_mode per Dave Hansen
>
> mm/huge_memory.c | 31 +++++++++++++++++++++++++++++++
> 1 file changed, 31 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2234,6 +2234,26 @@ static void khugepaged_alloc_sleep(void)
> static int khugepaged_node_load[MAX_NUMNODES];
>
> #ifdef CONFIG_NUMA
> +static bool khugepaged_scan_abort(int nid)
> +{
> + int i;
> +
> + /*
> + * If zone_reclaim_mode is disabled, then no extra effort is made to
> + * allocate memory locally.
> + */
> + if (!zone_reclaim_mode)
> + return false;
> +
> + for (i = 0; i < MAX_NUMNODES; i++) {
> + if (!khugepaged_node_load[i])
> + continue;
> + if (node_distance(nid, i) > RECLAIM_DISTANCE)
> + return true;
> + }
> + return false;
> +}
> +
> static int khugepaged_find_target_node(void)
> {
> static int last_khugepaged_target_node = NUMA_NO_NODE;
> @@ -2309,6 +2329,11 @@ static struct page
> return *hpage;
> }
> #else
> +static bool khugepaged_scan_abort(int nid)
> +{
> + return false;
> +}
> +
> static int khugepaged_find_target_node(void)
> {
> return 0;
> @@ -2515,6 +2540,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> unsigned long _address;
> spinlock_t *ptl;
> int node = NUMA_NO_NODE;
> + int last_node = node;
>
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> @@ -2545,6 +2571,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> * hit record.
> */
> node = page_to_nid(page);
> + if (node != last_node) {
> + if (khugepaged_scan_abort(node))
> + goto out_unmap;

Nitpick: How about not break the loop but only reset the related
khugepaged_node_load[] to zero. E.g. modify khugepaged_scan_abort() like
this:
if (node_distance(nid, i) > RECLAIM_DISTANCE)
khugepaged_node_load[i] = 0;

By this way, we may have a chance to find a more suitable node.

> + last_node = node;
> + }
> khugepaged_node_load[node]++;
> VM_BUG_ON_PAGE(PageCompound(page), page);
> if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>

--
Regards,
-Bob

2014-07-16 15:38:42

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode

On 07/16/2014 02:13 AM, David Rientjes wrote:
> Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
> node") improved the previous khugepaged logic which allocated a
> transparent hugepages from the node of the first page being collapsed.
>
> However, it is still possible to collapse pages to remote memory which may
> suffer from additional access latency. With the current policy, it is
> possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely
> if the majority are allocated from that node.
>
> When zone_reclaim_mode is enabled, it means the VM should make every attempt
> to allocate locally to prevent NUMA performance degradation. In this case,
> we do not want to collapse hugepages to remote nodes that would suffer from
> increased access latency. Thus, when zone_reclaim_mode is enabled, only
> allow collapsing to nodes with RECLAIM_DISTANCE or less.
>
> There is no functional change for systems that disable zone_reclaim_mode.
>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> v2: only change behavior for zone_reclaim_mode per Dave Hansen
>
> mm/huge_memory.c | 31 +++++++++++++++++++++++++++++++
> 1 file changed, 31 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2234,6 +2234,26 @@ static void khugepaged_alloc_sleep(void)
> static int khugepaged_node_load[MAX_NUMNODES];
>
> #ifdef CONFIG_NUMA
> +static bool khugepaged_scan_abort(int nid)
> +{
> + int i;
> +
> + /*
> + * If zone_reclaim_mode is disabled, then no extra effort is made to
> + * allocate memory locally.
> + */
> + if (!zone_reclaim_mode)
> + return false;
> +

I wonder if you could do this here?

if (khugepaged_node_load[nid])
return false;

If the condition is true, it means you already checked the 'nid' node against
all other nodes present in the pmd in a previous khugepaged_scan_pmd iteration.
And if it passed then, it would also pass now. If meanwhile a new node was found
and recorded, it was also checked against everything present at that point,
including 'nid'. So it should be safe?

The worst case (perfect interleaving page per page, so that "node != last_node"
is true in each iteration) complexity then reduces from O(HPAGE_PMD_NR *
MAX_NUMNODES) to O(HPAGE_PMD_NR + MAX_NUMNODES) iterations.

> + for (i = 0; i < MAX_NUMNODES; i++) {
> + if (!khugepaged_node_load[i])
> + continue;
> + if (node_distance(nid, i) > RECLAIM_DISTANCE)
> + return true;
> + }
> + return false;
> +}
> +
> static int khugepaged_find_target_node(void)
> {
> static int last_khugepaged_target_node = NUMA_NO_NODE;
> @@ -2309,6 +2329,11 @@ static struct page
> return *hpage;
> }
> #else
> +static bool khugepaged_scan_abort(int nid)
> +{
> + return false;
> +}
> +
> static int khugepaged_find_target_node(void)
> {
> return 0;
> @@ -2515,6 +2540,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> unsigned long _address;
> spinlock_t *ptl;
> int node = NUMA_NO_NODE;
> + int last_node = node;
>
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>
> @@ -2545,6 +2571,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> * hit record.
> */
> node = page_to_nid(page);
> + if (node != last_node) {
> + if (khugepaged_scan_abort(node))
> + goto out_unmap;
> + last_node = node;
> + }
> khugepaged_node_load[node]++;
> VM_BUG_ON_PAGE(PageCompound(page), page);
> if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2014-07-16 15:47:49

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode

On 07/16/2014 03:22 AM, Bob Liu wrote:
>> @@ -2545,6 +2571,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>> * hit record.
>> */
>> node = page_to_nid(page);
>> + if (node != last_node) {
>> + if (khugepaged_scan_abort(node))
>> + goto out_unmap;
>
> Nitpick: How about not break the loop but only reset the related
> khugepaged_node_load[] to zero. E.g. modify khugepaged_scan_abort() like
> this:
> if (node_distance(nid, i) > RECLAIM_DISTANCE)
> khugepaged_node_load[i] = 0;
>
> By this way, we may have a chance to find a more suitable node.

Hm theoretically there might be a suitable node, but this approach wouldn't
work. By resetting it to zero you forget that there ever was node 'i'. If there
is no more base page from node 'i', the load remains zero and the next call with
'nid' will think that 'nid' is OK.

So the correct way would be more complex but I wonder if it's worth the trouble...

>> + last_node = node;
>> + }
>> khugepaged_node_load[node]++;
>> VM_BUG_ON_PAGE(PageCompound(page), page);
>> if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
>>
>

2014-07-16 19:38:51

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode

I shall worry less if you change the Subject from "mm, tmp:" to "mm, thp:"

Hugh :)

2014-07-17 00:49:42

by David Rientjes

[permalink] [raw]
Subject: Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode

On Wed, 16 Jul 2014, Vlastimil Babka wrote:

> >> @@ -2545,6 +2571,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >> * hit record.
> >> */
> >> node = page_to_nid(page);
> >> + if (node != last_node) {
> >> + if (khugepaged_scan_abort(node))
> >> + goto out_unmap;
> >
> > Nitpick: How about not break the loop but only reset the related
> > khugepaged_node_load[] to zero. E.g. modify khugepaged_scan_abort() like
> > this:
> > if (node_distance(nid, i) > RECLAIM_DISTANCE)
> > khugepaged_node_load[i] = 0;
> >
> > By this way, we may have a chance to find a more suitable node.
>
> Hm theoretically there might be a suitable node, but this approach wouldn't
> work. By resetting it to zero you forget that there ever was node 'i'. If there
> is no more base page from node 'i', the load remains zero and the next call with
> 'nid' will think that 'nid' is OK.
>

Right, the suggestion is wrong because we do not want to ever collapse to
a node when the distance from the source page is > RECLAIM_DISTANCE,
that's the entire point of the patch.

2014-07-17 00:54:30

by David Rientjes

[permalink] [raw]
Subject: Re: [patch v2] mm, tmp: only collapse hugepages to nodes with affinity for zone_reclaim_mode

On Wed, 16 Jul 2014, Vlastimil Babka wrote:

> I wonder if you could do this here?
>
> if (khugepaged_node_load[nid])
> return false;
>
> If the condition is true, it means you already checked the 'nid' node against
> all other nodes present in the pmd in a previous khugepaged_scan_pmd iteration.
> And if it passed then, it would also pass now. If meanwhile a new node was found
> and recorded, it was also checked against everything present at that point,
> including 'nid'. So it should be safe?
>
> The worst case (perfect interleaving page per page, so that "node != last_node"
> is true in each iteration) complexity then reduces from O(HPAGE_PMD_NR *
> MAX_NUMNODES) to O(HPAGE_PMD_NR + MAX_NUMNODES) iterations.
>

Excellent suggestion, thanks Vlastimil!

2014-07-17 00:59:29

by David Rientjes

[permalink] [raw]
Subject: [patch v3] mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode

Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
node") improved the previous khugepaged logic which allocated a
transparent hugepages from the node of the first page being collapsed.

However, it is still possible to collapse pages to remote memory which may
suffer from additional access latency. With the current policy, it is
possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely
if the majority are allocated from that node.

When zone_reclaim_mode is enabled, it means the VM should make every attempt
to allocate locally to prevent NUMA performance degradation. In this case,
we do not want to collapse hugepages to remote nodes that would suffer from
increased access latency. Thus, when zone_reclaim_mode is enabled, only
allow collapsing to nodes with RECLAIM_DISTANCE or less.

There is no functional change for systems that disable zone_reclaim_mode.

Signed-off-by: David Rientjes <[email protected]>
---
v2: only change behavior for zone_reclaim_mode per Dave Hansen
v3: optimization based on previous node counts per Vlastimil Babka

mm/huge_memory.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2234,6 +2234,30 @@ static void khugepaged_alloc_sleep(void)
static int khugepaged_node_load[MAX_NUMNODES];

#ifdef CONFIG_NUMA
+static bool khugepaged_scan_abort(int nid)
+{
+ int i;
+
+ /*
+ * If zone_reclaim_mode is disabled, then no extra effort is made to
+ * allocate memory locally.
+ */
+ if (!zone_reclaim_mode)
+ return false;
+
+ /* If there is a count for this node already, it must be acceptable */
+ if (khugepaged_node_load[nid])
+ return false;
+
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ if (!khugepaged_node_load[i])
+ continue;
+ if (node_distance(nid, i) > RECLAIM_DISTANCE)
+ return true;
+ }
+ return false;
+}
+
static int khugepaged_find_target_node(void)
{
static int last_khugepaged_target_node = NUMA_NO_NODE;
@@ -2309,6 +2333,11 @@ static struct page
return *hpage;
}
#else
+static bool khugepaged_scan_abort(int nid)
+{
+ return false;
+}
+
static int khugepaged_find_target_node(void)
{
return 0;
@@ -2545,6 +2574,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
* hit record.
*/
node = page_to_nid(page);
+ if (khugepaged_scan_abort(node))
+ goto out_unmap;
khugepaged_node_load[node]++;
VM_BUG_ON_PAGE(PageCompound(page), page);
if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))

2014-07-17 16:28:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [patch v3] mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode

On 07/16/2014 05:59 PM, David Rientjes wrote:
> Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
> node") improved the previous khugepaged logic which allocated a
> transparent hugepages from the node of the first page being collapsed.
>
> However, it is still possible to collapse pages to remote memory which may
> suffer from additional access latency. With the current policy, it is
> possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely
> if the majority are allocated from that node.
>
> When zone_reclaim_mode is enabled, it means the VM should make every attempt
> to allocate locally to prevent NUMA performance degradation. In this case,
> we do not want to collapse hugepages to remote nodes that would suffer from
> increased access latency. Thus, when zone_reclaim_mode is enabled, only
> allow collapsing to nodes with RECLAIM_DISTANCE or less.
>
> There is no functional change for systems that disable zone_reclaim_mode.
>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> v2: only change behavior for zone_reclaim_mode per Dave Hansen
> v3: optimization based on previous node counts per Vlastimil Babka
>
> mm/huge_memory.c | 31 +++++++++++++++++++++++++++++++
> 1 file changed, 31 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2234,6 +2234,30 @@ static void khugepaged_alloc_sleep(void)
> static int khugepaged_node_load[MAX_NUMNODES];
>
> #ifdef CONFIG_NUMA
> +static bool khugepaged_scan_abort(int nid)
> +{
> + int i;
> +
> + /*
> + * If zone_reclaim_mode is disabled, then no extra effort is made to
> + * allocate memory locally.
> + */
> + if (!zone_reclaim_mode)
> + return false;
> +
> + /* If there is a count for this node already, it must be acceptable */
> + if (khugepaged_node_load[nid])
> + return false;
> +
> + for (i = 0; i < MAX_NUMNODES; i++) {
> + if (!khugepaged_node_load[i])
> + continue;
> + if (node_distance(nid, i) > RECLAIM_DISTANCE)
> + return true;
> + }
> + return false;
> +}
> +
> static int khugepaged_find_target_node(void)
> {
> static int last_khugepaged_target_node = NUMA_NO_NODE;
> @@ -2309,6 +2333,11 @@ static struct page
> return *hpage;
> }
> #else
> +static bool khugepaged_scan_abort(int nid)
> +{
> + return false;
> +}

Minor nit: I guess this makes it more explicit, but this #ifdef is
unnecessary in practice because we define zone_reclaim_mode this way:

#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
#else
#define zone_reclaim_mode 0
#endif

Looks fine to me otherwise, though. Definitely addresses the concerns I
had about RECLAIM_DISTANCE being consulted directly.

2014-07-17 21:48:12

by David Rientjes

[permalink] [raw]
Subject: [patch v4] mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode

Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
node") improved the previous khugepaged logic which allocated a
transparent hugepages from the node of the first page being collapsed.

However, it is still possible to collapse pages to remote memory which may
suffer from additional access latency. With the current policy, it is
possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely
if the majority are allocated from that node.

When zone_reclaim_mode is enabled, it means the VM should make every attempt
to allocate locally to prevent NUMA performance degradation. In this case,
we do not want to collapse hugepages to remote nodes that would suffer from
increased access latency. Thus, when zone_reclaim_mode is enabled, only
allow collapsing to nodes with RECLAIM_DISTANCE or less.

There is no functional change for systems that disable zone_reclaim_mode.

Signed-off-by: David Rientjes <[email protected]>
---
v2: only change behavior for zone_reclaim_mode per Dave Hansen
v3: optimization based on previous node counts per Vlastimil Babka
v4: unconditionally define khugepaged_scan_abort per Dave Hansen
no increase in .text size for CONFIG_NUMA=n

mm/huge_memory.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2233,6 +2233,30 @@ static void khugepaged_alloc_sleep(void)

static int khugepaged_node_load[MAX_NUMNODES];

+static bool khugepaged_scan_abort(int nid)
+{
+ int i;
+
+ /*
+ * If zone_reclaim_mode is disabled, then no extra effort is made to
+ * allocate memory locally.
+ */
+ if (!zone_reclaim_mode)
+ return false;
+
+ /* If there is a count for this node already, it must be acceptable */
+ if (khugepaged_node_load[nid])
+ return false;
+
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ if (!khugepaged_node_load[i])
+ continue;
+ if (node_distance(nid, i) > RECLAIM_DISTANCE)
+ return true;
+ }
+ return false;
+}
+
#ifdef CONFIG_NUMA
static int khugepaged_find_target_node(void)
{
@@ -2545,6 +2569,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
* hit record.
*/
node = page_to_nid(page);
+ if (khugepaged_scan_abort(node))
+ goto out_unmap;
khugepaged_node_load[node]++;
VM_BUG_ON_PAGE(PageCompound(page), page);
if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))

2014-07-25 15:34:22

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch v4] mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode

On Thu, Jul 17, 2014 at 02:48:07PM -0700, David Rientjes wrote:
> Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
> node") improved the previous khugepaged logic which allocated a
> transparent hugepages from the node of the first page being collapsed.
>
> However, it is still possible to collapse pages to remote memory which may
> suffer from additional access latency. With the current policy, it is
> possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely
> if the majority are allocated from that node.
>
> When zone_reclaim_mode is enabled, it means the VM should make every attempt
> to allocate locally to prevent NUMA performance degradation. In this case,
> we do not want to collapse hugepages to remote nodes that would suffer from
> increased access latency. Thus, when zone_reclaim_mode is enabled, only
> allow collapsing to nodes with RECLAIM_DISTANCE or less.
>
> There is no functional change for systems that disable zone_reclaim_mode.
>
> Signed-off-by: David Rientjes <[email protected]>

The patch looks ok for what it is intended to do so

Acked-by: Mel Gorman <[email protected]>

However, I would consider it likely that pages allocated on different nodes
within a hugepage boundary indicates that multiple threads on different nodes
are accessing those pages. I would be skeptical that reduced TLB misses
offset remote access penalties. Should we simply refuse to collapse huge
pages when the 4K pages are allocated from different nodes? If automatic
NUMA balancing is enabled and the access are really coming from one node
then the 4K pages will eventually be migrated to a local node and then
khugepaged can collapse it.

--
Mel Gorman
SUSE Labs

2014-07-28 08:42:14

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [patch v3] mm, thp: only collapse hugepages to nodes with affinity for zone_reclaim_mode

On 07/17/2014 02:59 AM, David Rientjes wrote:
> Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
> node") improved the previous khugepaged logic which allocated a
> transparent hugepages from the node of the first page being collapsed.
>
> However, it is still possible to collapse pages to remote memory which may
> suffer from additional access latency. With the current policy, it is
> possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed remotely
> if the majority are allocated from that node.
>
> When zone_reclaim_mode is enabled, it means the VM should make every attempt
> to allocate locally to prevent NUMA performance degradation. In this case,
> we do not want to collapse hugepages to remote nodes that would suffer from
> increased access latency. Thus, when zone_reclaim_mode is enabled, only
> allow collapsing to nodes with RECLAIM_DISTANCE or less.
>
> There is no functional change for systems that disable zone_reclaim_mode.
>
> Signed-off-by: David Rientjes <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>