2024-06-14 05:00:21

by Chen Yu

[permalink] [raw]
Subject: [RFC PATCH] sched/numa: scan the vma if it has not been scanned for a while

From: Yujie Liu <[email protected]>

Problem statement:
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
Numa vma scan overhead has been reduced a lot. Meanwhile, it could be
a double-sword that, the reducing of the vma scan might create less Numa
page fault information. The insufficient information makes it harder for
the Numa balancer to make decision. Later,
commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs
regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix
mm numa_scan_seq based unconditional scan") are found to bring back part
of the performance.

Recently when running SPECcpu on a 320 CPUs/2 Sockets system, a long
duration of remote Numa node read was observed by PMU events. It causes
high core-to-core variance and performance penalty. After the
investigation, it is found that many vmas are skipped due to the active
PID check. According to the trace events, in most cases, vma_is_accessed()
returns false because both pids_active[0] and pids_active[1] have been
cleared.

As an experiment, if the vma_is_accessed() is hacked to always return true,
the long duration remote Numa access is gone.

Proposal:
The main idea is to adjust vma_is_accessed() to let it return true easier.

solution 1 is to extend the pids_active[] from 2 to N, which has already
been proposed by Peter[1]. And how to decide N needs investigation.

solution 2 is to compare the diff between mm->numa_scan_seq and
vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
scan the vma.

solution 2 can be used to cover process-based workload(SPECcpu eg). The
reason is: There is only 1 thread within this process. If this process
access the vma at the beginning, then sleeps for a long time, the
pid_active array will be cleared. When this process is woken up, it will
never get a chance to set prot_none anymore. Because only the first 2
times of access is regarded as accessed:
(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
and no other threads can help set this prot_none.

This patch is mainly to raise this question, and seek for suggestion from
the community to handle it properly. Thanks in advance for any suggestion.

Link: https://lore.kernel.org/lkml/[email protected]/ #1
Reported-by: Xiaoping Zhou <[email protected]>
Co-developed-by: Chen Yu <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
Signed-off-by: Yujie Liu <[email protected]>
---
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..2b74fc06fb95 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3188,6 +3188,14 @@ static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
return true;
}

+ /*
+ * This vma has not been accessed for a while, and has limited number of threads
+ * within the current task can help.
+ */
+ if (READ_ONCE(mm->numa_scan_seq) >
+ (vma->numab_state->prev_scan_seq + get_nr_threads(current)))
+ return true;
+
return false;
}

--
2.25.1