Workloads consisting of a large number of processes running the same program
with a very large shared data section may experience performance problems
when numa balancing attempts to migrate the shared cow pages. This manifests
itself with many processes or tasks in TASK_UNINTERRUPTIBLE state waiting
for the shared pages to be migrated.
This patch changes change_pte_range() to skip shared copy-on-write pages when
called from change_prot_numa().
The program listed below simulates the conditions with these results when
run with 288 processes on a 144 core/8 socket machine.
Average throughput Average throughput Average throughput
with numa_balancing=0 with numa_balancing=1 with numa_balancing=1
without the patch with the patch
--------------------- --------------------- ---------------------
2118782 2021534 2107979
Complex production environments show less variability and fewer poorly
performing outliers accompanied with a smaller number of processes waiting
on NUMA page migration. In some cases, %iowait drops from 16%-26% to 0.
// SPDX-License-Identifier: GPL-2.0
/*
* Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
*/
#include <sys/time.h>
#include <stdio.h>
#include <wait.h>
#include <sys/mman.h>
int a[1000000] = {13};
int main(int argc, const char **argv)
{
int n = 0;
int i;
pid_t pid;
int stat;
int *count_array;
int cpu_count = 288;
long total = 0;
struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0};
if (argc > 2)
cpu_count = atoi(argv[2]);
count_array = mmap(NULL, cpu_count * sizeof(int),
(PROT_READ|PROT_WRITE),
(MAP_SHARED|MAP_ANONYMOUS), 0, 0);
if (count_array == MAP_FAILED) {
perror("mmap:");
return 0;
}
for (i = 0; i < cpu_count; ++i) {
pid = fork();
if (pid <= 0)
break;
if ((i & 0xf) == 0)
usleep(2);
}
if (pid != 0) {
if (i == 0) {
perror("fork:");
return 0;
}
for (;;) {
pid = wait(&stat);
if (pid < 0)
break;
}
for (i = 0; i < cpu_count; ++i)
total += count_array[i];
printf("Total %ld\n", total);
munmap(count_array, cpu_count * sizeof(int));
return 0;
}
gettimeofday(&t1, 0);
timeradd(&t1, &t2, &t1);
while (timercmp(&t2, &t1, <)) {
int b = 0;
int j;
for (j = 0; j < 1000000; j++)
b += a[j];
gettimeofday(&t2, 0);
n++;
}
count_array[i] = n;
return 0;
}
mm/mprotect.c | 5 +++++
1 file changed, 5 insertions(+)
--
1.8.3.1
Workloads consisting of a large number processes running the same program
with a large shared data section may suffer from excessive numa balancing
page migration of the pages in the shared data section. This shows up as
high I/O wait time and degraded performance on machines with higher socket
or node counts.
This patch skips shared copy-on-write pages in change_pte_range() for the
numa balancing case.
Signed-off-by: Henry Willard <[email protected]>
Reviewed-by: Håkon Bugge <[email protected]>
Reviewed-by: Steve Sistare [email protected]
---
mm/mprotect.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ec39f730a0bf..fbbb3ab70818 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -84,6 +84,11 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (!page || PageKsm(page))
continue;
+ /* Also skip shared copy-on-write pages */
+ if (is_cow_mapping(vma->vm_flags) &&
+ page_mapcount(page) != 1)
+ continue;
+
/* Avoid TLB flush if possible */
if (pte_protnone(oldpte))
continue;
--
1.8.3.1
On Tue, Jan 16, 2018 at 11:28:44AM -0800, Henry Willard wrote:
> Workloads consisting of a large number processes running the same program
> with a large shared data section may suffer from excessive numa balancing
> page migration of the pages in the shared data section. This shows up as
> high I/O wait time and degraded performance on machines with higher socket
> or node counts.
>
> This patch skips shared copy-on-write pages in change_pte_range() for the
> numa balancing case.
>
> Signed-off-by: Henry Willard <[email protected]>
> Reviewed-by: H?kon Bugge <[email protected]>
> Reviewed-by: Steve Sistare [email protected]
Merge the leader and this mail together. It would have been nice to see
data on other realistic workloads as well.
My main source of discomfort is the fact that this is permanent as two
processes perfectly isolated but with a suitably shared COW mapping
will never migrate the data. A potential improvement to get the reported
bandwidth up in the test program would be to skip the rest of the VMA if
page_mapcount != 1 in a COW mapping as it would be reasonable to assume
the remaining pages in the VMA are also affected and the scan is wasteful.
There are counter-examples to this but I suspect that the full VMA being
shared is the common case. Whether you do that or not;
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
> On Jan 16, 2018, at 1:26 PM, Mel Gorman <[email protected]> wrote:
>
> On Tue, Jan 16, 2018 at 11:28:44AM -0800, Henry Willard wrote:
>> Workloads consisting of a large number processes running the same program
>> with a large shared data section may suffer from excessive numa balancing
>> page migration of the pages in the shared data section. This shows up as
>> high I/O wait time and degraded performance on machines with higher socket
>> or node counts.
>>
>> This patch skips shared copy-on-write pages in change_pte_range() for the
>> numa balancing case.
>>
>> Signed-off-by: Henry Willard <[email protected]>
>> Reviewed-by: Håkon Bugge <[email protected]>
>> Reviewed-by: Steve Sistare [email protected]
>
> Merge the leader and this mail together. It would have been nice to see
> data on other realistic workloads as well.
>
> My main source of discomfort is the fact that this is permanent as two
> processes perfectly isolated but with a suitably shared COW mapping
> will never migrate the data. A potential improvement to get the reported
> bandwidth up in the test program would be to skip the rest of the VMA if
> page_mapcount != 1 in a COW mapping as it would be reasonable to assume
> the remaining pages in the VMA are also affected and the scan is wasteful.
> There are counter-examples to this but I suspect that the full VMA being
> shared is the common case. Whether you do that or not;
>
> Acked-by: Mel Gorman <[email protected]>
Thanks. The real customer cases where this was observed involved large, 1TB or more, eight socket machines running very active RDBMS workloads. These customers saw high iowait times and a loss in performance when numa balancing was enabled. Previously there was no reported iowait time. The extent of the loss of performance was variable depending on the activity and never quantified. The little test program is a distillation of what was observed. In the real workload, a large part of the VMA is shared, but not all of it, so this seemed the simplest and most reliable patch.
Henry
>
> --
> Mel Gorman
> SUSE Labs
On Tue, 16 Jan 2018, Mel Gorman wrote:
> My main source of discomfort is the fact that this is permanent as two
> processes perfectly isolated but with a suitably shared COW mapping
> will never migrate the data. A potential improvement to get the reported
> bandwidth up in the test program would be to skip the rest of the VMA if
> page_mapcount != 1 in a COW mapping as it would be reasonable to assume
> the remaining pages in the VMA are also affected and the scan is wasteful.
> There are counter-examples to this but I suspect that the full VMA being
> shared is the common case. Whether you do that or not;
Same concern here. Typically CAP_SYS_NICE will bypass the check that the
page is only mapped to a single process and the check looks exactly like
the ones for manual migration. Using CAP_SYS_NICE would be surprising
here since autonuma is not triggered by the currently running process.
Can we configure this somehow via sysfs?
> On Jan 17, 2018, at 10:23 AM, Christopher Lameter <[email protected]> wrote:
>
> On Tue, 16 Jan 2018, Mel Gorman wrote:
>
>> My main source of discomfort is the fact that this is permanent as two
>> processes perfectly isolated but with a suitably shared COW mapping
>> will never migrate the data. A potential improvement to get the reported
>> bandwidth up in the test program would be to skip the rest of the VMA if
>> page_mapcount != 1 in a COW mapping as it would be reasonable to assume
>> the remaining pages in the VMA are also affected and the scan is wasteful.
>> There are counter-examples to this but I suspect that the full VMA being
>> shared is the common case. Whether you do that or not;
>
> Same concern here. Typically CAP_SYS_NICE will bypass the check that the
> page is only mapped to a single process and the check looks exactly like
> the ones for manual migration. Using CAP_SYS_NICE would be surprising
> here since autonuma is not triggered by the currently running process.
>
> Can we configure this somehow via sysfs?
If I understand the code correctly, CAP_SYS_NICE allows MPOL_MF_MOVE_ALL to be set with mbind() or used with move_pages(). CAP_SYS_NICE also causes migrate_pages() to behave as if MPOL_MF_MOVE_ALL were specified. There are checks requiring either MPOL_MF_MOVE_ALL or page_mapcount(page) == 1. The normal case does not call change_prot_numa(). change_prot_numa() is only called when MPOL_MF_LAZY is specified, and at the moment MPOL_MF_LAZY is not recognized as a valid flag. It looks to me that as things stand now, change_prot_numa() is only called from task_numa_work().
If MPOL_MF_LAZY were allowed and specified things would not work correctly. change_pte_range() is unaware of and can’t honor the difference between MPOL_MF_MOVE_ALL and MPOL_MF_MOVE.
For the case of auto numa balancing, it may be undesirable for shared pages to be migrated whether they are also copy-on-write or not. The copy-on-write test was added to restrict the effect of the patch to the specific situation we observed. Perhaps I should remove it, I don’t understand why it would be desirable to modify the behavior via sysfs.
Thanks,
Henry
>
>
On Thu, 18 Jan 2018, Henry Willard wrote:
> If MPOL_MF_LAZY were allowed and specified things would not work
> correctly. change_pte_range() is unaware of and can’t honor the
> difference between MPOL_MF_MOVE_ALL and MPOL_MF_MOVE.
Not sure how that relates to what I said earlier... Sorry.
>
> For the case of auto numa balancing, it may be undesirable for shared
> pages to be migrated whether they are also copy-on-write or not. The
> copy-on-write test was added to restrict the effect of the patch to the
> specific situation we observed. Perhaps I should remove it, I don’t
> understand why it would be desirable to modify the behavior via sysfs.
I think the most common case of shared pages occurs for pages that contain
code. In that case a page may be mapped into hundreds if not thousands of
processes. In particular that is often the case for basic system libraries
like the c library which may actually be mapped into every binary that is
running.
It is very difficult and expensive to unmap these pages from all the
processes in order to migrate them. So some sort of limit would be useful
to avoid unnecessary migration attempts. One example would be to forbid
migrating pages that are mapped in more than 5 processes. Some sysctl know
would be useful here to set the boundary.
Your patch addresses a special case here by forbidding migration of any
page mapped by more than a single process (mapcount !=1).
That would mean f.e. that the complete migration of a set of processes
that rely on sharing data via a memory segment is impossible because those
shared pages can never be moved.
By setting the limit higher that migration would still be possible.
Maybe we can set that limit by default at 5 and allow a higher setting
if users have applications that require a higher mapcoun? F.e. a
common construct is a shepherd task and N worker threads. If those
tasks each have their own address space and only communicate via
a shared data segment then one may want to set the limit higher than N
in order to allow the migration of the group of processes.
> On Jan 19, 2018, at 6:12 PM, Christopher Lameter <[email protected]> wrote:
>
> On Thu, 18 Jan 2018, Henry Willard wrote:
>
>> If MPOL_MF_LAZY were allowed and specified things would not work
>> correctly. change_pte_range() is unaware of and can’t honor the
>> difference between MPOL_MF_MOVE_ALL and MPOL_MF_MOVE.
>
> Not sure how that relates to what I said earlier... Sorry.
Only that CAP_SYS_NICE is not relevant to this patch.
>
>>
>> For the case of auto numa balancing, it may be undesirable for shared
>> pages to be migrated whether they are also copy-on-write or not. The
>> copy-on-write test was added to restrict the effect of the patch to the
>> specific situation we observed. Perhaps I should remove it, I don’t
>> understand why it would be desirable to modify the behavior via sysfs.
>
> I think the most common case of shared pages occurs for pages that contain
> code. In that case a page may be mapped into hundreds if not thousands of
> processes. In particular that is often the case for basic system libraries
> like the c library which may actually be mapped into every binary that is
> running.
That is true, but auto numa balancing skips these and similar pages before it calls change_prot_numa(). They don’t even have to be actually shared to be skipped.
>
> It is very difficult and expensive to unmap these pages from all the
> processes in order to migrate them. So some sort of limit would be useful
> to avoid unnecessary migration attempts. One example would be to forbid
> migrating pages that are mapped in more than 5 processes. Some sysctl know
> would be useful here to set the boundary.
>
> Your patch addresses a special case here by forbidding migration of any
> page mapped by more than a single process (mapcount !=1).
The current patch skips pages that are in copy-on-write VMAs and still shared. These include pages in the program’s data segment that are writable. but have not been written to. Once the pages are modified they are no longer shared and can be migrated. The problem is that in some cases, the pages are never modified and remain shared.
Prior to commit 4b10e7d562c90d0a72f324832c26653947a07381, change_prot_numa() called change_prot_numa_range(), which tested for (page_mapcount(page) != 1) and bailed out for any shared pages. This patch is more selective. A simple test for shared or not seems to be common.
>
> That would mean f.e. that the complete migration of a set of processes
> that rely on sharing data via a memory segment is impossible because those
> shared pages can never be moved.
>
> By setting the limit higher that migration would still be possible.
>
> Maybe we can set that limit by default at 5 and allow a higher setting
> if users have applications that require a higher mapcoun? F.e. a
> common construct is a shepherd task and N worker threads. If those
> tasks each have their own address space and only communicate via
> a shared data segment then one may want to set the limit higher than N
> in order to allow the migration of the group of processes.
This example would be unaffected by this patch, because the patch does not affect explicitly shared memory. A process with the necessary capabilities is still able to migrate all pages.
Henry