In current AutoNUMA implementation, the page tables of the processes
are scanned periodically to trigger the NUMA hint page faults. The
scanning runs in the context of the processes, so will delay the
running of the processes. In a test with 64 threads pmbench memory
accessing benchmark on a 2-socket server machine with 104 logical CPUs
and 256 GB memory, there are more than 20000 latency outliers that are
> 1 ms in 3600s run time. These latency outliers are almost all
caused by the AutoNUMA page table scanning. Because they almost all
disappear after applying this patch to scan the page tables
asynchronously.
Because there are idle CPUs in system, the asynchronous running page
table scanning code can run on these idle CPUs instead of the CPUs the
workload is running on.
So on system with enough idle CPU time, it's better to scan the page
tables asynchronously to take full advantages of these idle CPU time.
Another scenario which can benefit from this is to scan the page
tables on some service CPUs of the socket, so that the real workload
can run on the isolated CPUs without the latency outliers caused by
the page table scanning.
But it's not perfect to scan page tables asynchronously too. For
example, on system without enough idle CPU time, the CPU time isn't
scheduled fairly because the page table scanning is charged to the
workqueue thread instead of the process/thread it works for. And
although the page tables are scanned for the target process, it may
run on a CPU that is not in the cpuset of the target process.
One possible solution is to let the system administrator to choose the
better behavior for the system via a sysctl knob (implemented in the
patch). But it's not perfect too. Because every user space knob adds
maintenance overhead.
A better solution may be to back-charge the CPU time to scan the page
tables to the process/thread, and find a way to run the work on the
proper cpuset. After some googling, I found there's some discussion
about this as in the following thread,
https://lkml.org/lkml/2019/6/13/1321
So this patch may be not ready to be merged by upstream yet. It
quantizes the latency outliers caused by the page table scanning in
AutoNUMA. And it provides a possible way to resolve the issue for
users who cares about it. And it is a potential customer of the work
related to the cgroup-aware workqueue or other asynchronous execution
mechanisms.
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Daniel Jordan <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Aubrey Li <[email protected]>
---
include/linux/sched.h | 1 +
include/linux/sched/sysctl.h | 1 +
kernel/exit.c | 1 +
kernel/sched/fair.c | 55 +++++++++++++++++++++++++++---------
kernel/sysctl.c | 9 ++++++
5 files changed, 53 insertions(+), 14 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 04278493bf15..433f77436ed4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1087,6 +1087,7 @@ struct task_struct {
u64 last_task_numa_placement;
u64 last_sum_exec_runtime;
struct callback_head numa_work;
+ struct work_struct numa_async_work;
/*
* This pointer is only modified for current in syscall and
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index d4f6215ee03f..ea595ae9e573 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -37,6 +37,7 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
extern unsigned int sysctl_numa_balancing_scan_period_min;
extern unsigned int sysctl_numa_balancing_scan_period_max;
extern unsigned int sysctl_numa_balancing_scan_size;
+extern unsigned int sysctl_numa_balancing_scan_async;
#ifdef CONFIG_SCHED_DEBUG
extern __read_mostly unsigned int sysctl_sched_migration_cost;
diff --git a/kernel/exit.c b/kernel/exit.c
index 0b81b26a872a..f5e84ff4b788 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -798,6 +798,7 @@ void __noreturn do_exit(long code)
if (group_dead)
disassociate_ctty(1);
exit_task_namespaces(tsk);
+ cancel_work_sync(&tsk->numa_async_work);
exit_task_work(tsk);
exit_thread(tsk);
exit_umh(tsk);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1217bfe5e81..7d82c6826708 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1064,6 +1064,9 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;
+/* Scan asynchronously via work queue */
+unsigned int sysctl_numa_balancing_scan_async;
+
struct numa_group {
refcount_t refcount;
@@ -2481,24 +2484,17 @@ static void reset_ptenuma_scan(struct task_struct *p)
p->mm->numa_scan_offset = 0;
}
-/*
- * The expensive part of numa migration is done from task_work context.
- * Triggered from task_tick_numa().
- */
-static void task_numa_work(struct callback_head *work)
+static void numa_scan(struct task_struct *p)
{
unsigned long migrate, next_scan, now = jiffies;
- struct task_struct *p = current;
+ struct task_struct *pc = current;
struct mm_struct *mm = p->mm;
- u64 runtime = p->se.sum_exec_runtime;
+ u64 runtime = pc->se.sum_exec_runtime;
struct vm_area_struct *vma;
unsigned long start, end;
unsigned long nr_pte_updates = 0;
long pages, virtpages;
- SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
-
- work->next = work;
/*
* Who cares about NUMA placement when they're dying.
*
@@ -2621,12 +2617,36 @@ static void task_numa_work(struct callback_head *work)
* Usually update_task_scan_period slows down scanning enough; on an
* overloaded system we need to limit overhead on a per task basis.
*/
- if (unlikely(p->se.sum_exec_runtime != runtime)) {
- u64 diff = p->se.sum_exec_runtime - runtime;
+ if (unlikely(pc->se.sum_exec_runtime != runtime)) {
+ u64 diff = pc->se.sum_exec_runtime - runtime;
p->node_stamp += 32 * diff;
}
}
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+static void task_numa_work(struct callback_head *work)
+{
+ struct task_struct *p = current;
+
+ SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
+
+ work->next = work;
+
+ numa_scan(p);
+}
+
+static void numa_async_work(struct work_struct *work)
+{
+ struct task_struct *p;
+
+ p = container_of(work, struct task_struct, numa_async_work);
+
+ numa_scan(p);
+}
+
void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
{
int mm_users = 0;
@@ -2650,6 +2670,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
p->last_sum_exec_runtime = 0;
init_task_work(&p->numa_work, task_numa_work);
+ INIT_WORK(&p->numa_async_work, numa_async_work);
/* New address space, reset the preferred nid */
if (!(clone_flags & CLONE_VM)) {
@@ -2699,8 +2720,14 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
curr->numa_scan_period = task_scan_start(curr);
curr->node_stamp += period;
- if (!time_before(jiffies, curr->mm->numa_next_scan))
- task_work_add(curr, work, true);
+ if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+ if (sysctl_numa_balancing_scan_async)
+ queue_work_node(numa_node_id(),
+ system_unbound_wq,
+ &curr->numa_async_work);
+ else
+ task_work_add(curr, work, true);
+ }
}
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index ad5b88a53c5a..31a987230c58 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -418,6 +418,15 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = SYSCTL_ONE,
},
+ {
+ .procname = "numa_balancing_scan_async",
+ .data = &sysctl_numa_balancing_scan_async,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
+ },
{
.procname = "numa_balancing",
.data = NULL, /* filled in by handler */
--
2.25.1
On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
> In current AutoNUMA implementation, the page tables of the processes
> are scanned periodically to trigger the NUMA hint page faults. The
> scanning runs in the context of the processes, so will delay the
> running of the processes. In a test with 64 threads pmbench memory
> accessing benchmark on a 2-socket server machine with 104 logical CPUs
> and 256 GB memory, there are more than 20000 latency outliers that are
> > 1 ms in 3600s run time. These latency outliers are almost all
> caused by the AutoNUMA page table scanning. Because they almost all
> disappear after applying this patch to scan the page tables
> asynchronously.
>
> Because there are idle CPUs in system, the asynchronous running page
> table scanning code can run on these idle CPUs instead of the CPUs the
> workload is running on.
>
> So on system with enough idle CPU time, it's better to scan the page
> tables asynchronously to take full advantages of these idle CPU time.
> Another scenario which can benefit from this is to scan the page
> tables on some service CPUs of the socket, so that the real workload
> can run on the isolated CPUs without the latency outliers caused by
> the page table scanning.
>
> But it's not perfect to scan page tables asynchronously too. For
> example, on system without enough idle CPU time, the CPU time isn't
> scheduled fairly because the page table scanning is charged to the
> workqueue thread instead of the process/thread it works for. And
> although the page tables are scanned for the target process, it may
> run on a CPU that is not in the cpuset of the target process.
>
> One possible solution is to let the system administrator to choose the
> better behavior for the system via a sysctl knob (implemented in the
> patch). But it's not perfect too. Because every user space knob adds
> maintenance overhead.
>
> A better solution may be to back-charge the CPU time to scan the page
> tables to the process/thread, and find a way to run the work on the
> proper cpuset. After some googling, I found there's some discussion
> about this as in the following thread,
>
> https://lkml.org/lkml/2019/6/13/1321
>
> So this patch may be not ready to be merged by upstream yet. It
> quantizes the latency outliers caused by the page table scanning in
> AutoNUMA. And it provides a possible way to resolve the issue for
> users who cares about it. And it is a potential customer of the work
> related to the cgroup-aware workqueue or other asynchronous execution
> mechanisms.
>
The caveats you list are the important ones and the reason why it was
not done asynchronously. In an earlier implementation all the work was
done by a dedicated thread and ultimately abandoned.
There is no guarantee there is an idle CPU available and one that is
local to the thread that should be doing the scanning. Even if there is,
it potentially prevents another task from scheduling on an idle CPU and
similarly other workqueue tasks may be delayed waiting on the scanner. The
hiding of the cost is also problematic because the CPU cost is hidden
and mixed with other unrelated workqueues. It also has the potential
to mask bugs. Lets say for example there is a bug whereby a task is
scanning excessively, that can be easily missed when the work is done by
a workqueue.
While it's just an opinion, my preference would be to focus on reducing
the cost and amount of scanning done -- particularly for threads. For
example, all threads operate on the same address space but there can be
significant overlap where all threads are potentially scanning the same
areas or regions that the thread has no interest in. One option would be
to track the highest and lowest pages accessed and only scan within
those regions for example. The tricky part is that library pages may
create very wide windows that render the tracking useless but it could
at least be investigated.
--
Mel Gorman
SUSE Labs
Mel Gorman <[email protected]> writes:
> On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
>> In current AutoNUMA implementation, the page tables of the processes
>> are scanned periodically to trigger the NUMA hint page faults. The
>> scanning runs in the context of the processes, so will delay the
>> running of the processes. In a test with 64 threads pmbench memory
>> accessing benchmark on a 2-socket server machine with 104 logical CPUs
>> and 256 GB memory, there are more than 20000 latency outliers that are
>> > 1 ms in 3600s run time. These latency outliers are almost all
>> caused by the AutoNUMA page table scanning. Because they almost all
>> disappear after applying this patch to scan the page tables
>> asynchronously.
>>
>> Because there are idle CPUs in system, the asynchronous running page
>> table scanning code can run on these idle CPUs instead of the CPUs the
>> workload is running on.
>>
>> So on system with enough idle CPU time, it's better to scan the page
>> tables asynchronously to take full advantages of these idle CPU time.
>> Another scenario which can benefit from this is to scan the page
>> tables on some service CPUs of the socket, so that the real workload
>> can run on the isolated CPUs without the latency outliers caused by
>> the page table scanning.
>>
>> But it's not perfect to scan page tables asynchronously too. For
>> example, on system without enough idle CPU time, the CPU time isn't
>> scheduled fairly because the page table scanning is charged to the
>> workqueue thread instead of the process/thread it works for. And
>> although the page tables are scanned for the target process, it may
>> run on a CPU that is not in the cpuset of the target process.
>>
>> One possible solution is to let the system administrator to choose the
>> better behavior for the system via a sysctl knob (implemented in the
>> patch). But it's not perfect too. Because every user space knob adds
>> maintenance overhead.
>>
>> A better solution may be to back-charge the CPU time to scan the page
>> tables to the process/thread, and find a way to run the work on the
>> proper cpuset. After some googling, I found there's some discussion
>> about this as in the following thread,
>>
>> https://lkml.org/lkml/2019/6/13/1321
>>
>> So this patch may be not ready to be merged by upstream yet. It
>> quantizes the latency outliers caused by the page table scanning in
>> AutoNUMA. And it provides a possible way to resolve the issue for
>> users who cares about it. And it is a potential customer of the work
>> related to the cgroup-aware workqueue or other asynchronous execution
>> mechanisms.
>>
>
> The caveats you list are the important ones and the reason why it was
> not done asynchronously. In an earlier implementation all the work was
> done by a dedicated thread and ultimately abandoned.
>
> There is no guarantee there is an idle CPU available and one that is
> local to the thread that should be doing the scanning. Even if there is,
> it potentially prevents another task from scheduling on an idle CPU and
> similarly other workqueue tasks may be delayed waiting on the scanner. The
> hiding of the cost is also problematic because the CPU cost is hidden
> and mixed with other unrelated workqueues. It also has the potential
> to mask bugs. Lets say for example there is a bug whereby a task is
> scanning excessively, that can be easily missed when the work is done by
> a workqueue.
Do you think something like cgroup-aware workqueue is a solution deserve
to be tried when it's available? It will not hide the scanning cost,
because the CPU time will be charged to the original cgroup or task.
Although the other tasks may be disturbed, cgroup can provide some kind
of management via cpusets.
> While it's just an opinion, my preference would be to focus on reducing
> the cost and amount of scanning done -- particularly for threads. For
> example, all threads operate on the same address space but there can be
> significant overlap where all threads are potentially scanning the same
> areas or regions that the thread has no interest in. One option would be
> to track the highest and lowest pages accessed and only scan within
> those regions for example. The tricky part is that library pages may
> create very wide windows that render the tracking useless but it could
> at least be investigated.
In general, I think it's good to reduce the scanning cost.
Why do you think there will be overlap between the threads of a process?
If my understanding were correctly, the threads will scan one by one
instead of simultaneously. And how to determine whether a vma need to
be scanned or not? For example, there may be only a small portion of
pages been accessed in a vma, but they may be accessed remotely and
consumes quite some inter-node bandwidth, so need to be migrated.
Best Regards,
Huang, Ying
On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
> While it's just an opinion, my preference would be to focus on reducing
> the cost and amount of scanning done -- particularly for threads.
This; I really don't believe in those back-charging things, esp. since
not having cgroups or having multiple applications in a single cgroup is
a valid setup.
Another way to reduce latency spikes is to decrease both
sysctl_numa_balancing_scan_delay and sysctl_numa_balancing_scan_size.
Then you do more smaller scans. By scanning more often you reduce the
contrast, by reducing the size you lower the max latency.
And this is all assuming you actually want numa balancing for this
process.
Peter Zijlstra <[email protected]> writes:
> On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
>> While it's just an opinion, my preference would be to focus on reducing
>> the cost and amount of scanning done -- particularly for threads.
>
> This; I really don't believe in those back-charging things, esp. since
> not having cgroups or having multiple applications in a single cgroup is
> a valid setup.
Technically, it appears possible to back-charge the CPU time to the
process/thread directly (not the cgroup).
> Another way to reduce latency spikes is to decrease both
> sysctl_numa_balancing_scan_delay and sysctl_numa_balancing_scan_size.
> Then you do more smaller scans. By scanning more often you reduce the
> contrast, by reducing the size you lower the max latency.
Yes. This can reduce latency spikes.
Best Regards,
Huang, Ying
> And this is all assuming you actually want numa balancing for this
> process.
On Wed, 15 Apr 2020 16:14:38 +0800 "Huang\, Ying" <[email protected]> wrote:
> Mel Gorman <[email protected]> writes:
>
> > On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote:
> >> In current AutoNUMA implementation, the page tables of the processes
> >> are scanned periodically to trigger the NUMA hint page faults. The
> >> scanning runs in the context of the processes, so will delay the
> >> running of the processes. In a test with 64 threads pmbench memory
> >> accessing benchmark on a 2-socket server machine with 104 logical CPUs
> >> and 256 GB memory, there are more than 20000 latency outliers that are
> >> > 1 ms in 3600s run time. These latency outliers are almost all
> >> caused by the AutoNUMA page table scanning. Because they almost all
> >> disappear after applying this patch to scan the page tables
> >> asynchronously.
> >>
> >> Because there are idle CPUs in system, the asynchronous running page
> >> table scanning code can run on these idle CPUs instead of the CPUs the
> >> workload is running on.
> >>
> >> So on system with enough idle CPU time, it's better to scan the page
> >> tables asynchronously to take full advantages of these idle CPU time.
> >> Another scenario which can benefit from this is to scan the page
> >> tables on some service CPUs of the socket, so that the real workload
> >> can run on the isolated CPUs without the latency outliers caused by
> >> the page table scanning.
> >>
> >> But it's not perfect to scan page tables asynchronously too. For
> >> example, on system without enough idle CPU time, the CPU time isn't
> >> scheduled fairly because the page table scanning is charged to the
> >> workqueue thread instead of the process/thread it works for. And
> >> although the page tables are scanned for the target process, it may
> >> run on a CPU that is not in the cpuset of the target process.
> >>
> >> One possible solution is to let the system administrator to choose the
> >> better behavior for the system via a sysctl knob (implemented in the
> >> patch). But it's not perfect too. Because every user space knob adds
> >> maintenance overhead.
> >>
> >> A better solution may be to back-charge the CPU time to scan the page
> >> tables to the process/thread, and find a way to run the work on the
> >> proper cpuset. After some googling, I found there's some discussion
> >> about this as in the following thread,
> >>
> >> https://lkml.org/lkml/2019/6/13/1321
> >>
> >> So this patch may be not ready to be merged by upstream yet. It
> >> quantizes the latency outliers caused by the page table scanning in
> >> AutoNUMA. And it provides a possible way to resolve the issue for
> >> users who cares about it. And it is a potential customer of the work
> >> related to the cgroup-aware workqueue or other asynchronous execution
> >> mechanisms.
> >>
> >
> > The caveats you list are the important ones and the reason why it was
> > not done asynchronously. In an earlier implementation all the work was
> > done by a dedicated thread and ultimately abandoned.
> >
> > There is no guarantee there is an idle CPU available and one that is
> > local to the thread that should be doing the scanning. Even if there is,
> > it potentially prevents another task from scheduling on an idle CPU and
> > similarly other workqueue tasks may be delayed waiting on the scanner. The
> > hiding of the cost is also problematic because the CPU cost is hidden
> > and mixed with other unrelated workqueues. It also has the potential
> > to mask bugs. Lets say for example there is a bug whereby a task is
> > scanning excessively, that can be easily missed when the work is done by
> > a workqueue.
>
> Do you think something like cgroup-aware workqueue is a solution deserve
> to be tried when it's available? It will not hide the scanning cost,
> because the CPU time will be charged to the original cgroup or task.
> Although the other tasks may be disturbed, cgroup can provide some kind
> of management via cpusets.
>
> > While it's just an opinion, my preference would be to focus on reducing
> > the cost and amount of scanning done -- particularly for threads. For
> > example, all threads operate on the same address space but there can be
> > significant overlap where all threads are potentially scanning the same
> > areas or regions that the thread has no interest in. One option would be
> > to track the highest and lowest pages accessed and only scan within
> > those regions for example. The tricky part is that library pages may
> > create very wide windows that render the tracking useless but it could
> > at least be investigated.
>
> In general, I think it's good to reduce the scanning cost.
I think the main idea of DAMON[1] might be able to applied here. Have you
considered it?
[1] https://lore.kernel.org/linux-mm/[email protected]/
Thanks,
SeongJae Park
>
> Why do you think there will be overlap between the threads of a process?
> If my understanding were correctly, the threads will scan one by one
> instead of simultaneously. And how to determine whether a vma need to
> be scanned or not? For example, there may be only a small portion of
> pages been accessed in a vma, but they may be accessed remotely and
> consumes quite some inter-node bandwidth, so need to be migrated.
>
> Best Regards,
> Huang, Ying
>
On Thu, Apr 16, 2020 at 09:24:35AM +0800, Huang, Ying wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
> >> While it's just an opinion, my preference would be to focus on reducing
> >> the cost and amount of scanning done -- particularly for threads.
> >
> > This; I really don't believe in those back-charging things, esp. since
> > not having cgroups or having multiple applications in a single cgroup is
> > a valid setup.
>
> Technically, it appears possible to back-charge the CPU time to the
> process/thread directly (not the cgroup).
I've yet to see a sane proposal there. What we're not going to do is
make regular task accounting more expensive than it already is.
On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> I think the main idea of DAMON[1] might be able to applied here. Have you
> considered it?
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
I've ignored that entire thing after you said the information it
provides was already available through the PMU.
On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <[email protected]> wrote:
> On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > I think the main idea of DAMON[1] might be able to applied here. Have you
> > considered it?
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> I've ignored that entire thing after you said the information it
> provides was already available through the PMU.
Sorry if my answer made you confused. What I wanted to say was that the
fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
for now, but it could be modified to use PMU or other features instead. In
other words, PMU on some architectures could provide the fundamental, low level
information for DAMON, as PTE Accessed bit does. What DAMON does are efficient
control of the fundamental access checking features and making of elaborated
final informations, which PMU itself doesn't provide.
Thanks,
SeongJae Park
On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > I think the main idea of DAMON[1] might be able to applied here. Have you
> > > considered it?
> > >
> > > [1] https://lore.kernel.org/linux-mm/[email protected]/
> >
> > I've ignored that entire thing after you said the information it
> > provides was already available through the PMU.
>
> Sorry if my answer made you confused. What I wanted to say was that the
> fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> for now, but it could be modified to use PMU or other features instead.
I would not be inclined to lean towards either approach for NUMA
balancing. Fiddling with the accessed bit can have consequences for page
aging and residency -- fine for debugging a problem, not to fine for
normal usage. I would expect the PMU approach would have high overhead
as well as taking over a PMU counter that userspace debugging may expect
to be available.
--
Mel Gorman
SUSE Labs
On Fri, Apr 17, 2020 at 01:16:29PM +0100, Mel Gorman wrote:
> On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <[email protected]> wrote:
> >
> > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > > I think the main idea of DAMON[1] might be able to applied here. Have you
> > > > considered it?
> > > >
> > > > [1] https://lore.kernel.org/linux-mm/[email protected]/
> > >
> > > I've ignored that entire thing after you said the information it
> > > provides was already available through the PMU.
> >
> > Sorry if my answer made you confused. What I wanted to say was that the
> > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> > for now, but it could be modified to use PMU or other features instead.
>
> I would not be inclined to lean towards either approach for NUMA
> balancing. Fiddling with the accessed bit can have consequences for page
> aging and residency -- fine for debugging a problem, not to fine for
> normal usage. I would expect the PMU approach would have high overhead
> as well as taking over a PMU counter that userspace debugging may expect
> to be available.
Oh, quite agreed; I was just saying I never saw the use of that whole
DAMON thing. AFAICT it's not actually solving a problem, just making
more.
On Fri, Apr 17, 2020 at 02:44:37PM +0200, SeongJae Park wrote:
> On Fri, 17 Apr 2020 13:16:29 +0100 Mel Gorman <[email protected]> wrote:
>
> > On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> > > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <[email protected]> wrote:
> > >
> > > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > > > I think the main idea of DAMON[1] might be able to applied here. Have you
> > > > > considered it?
> > > > >
> > > > > [1] https://lore.kernel.org/linux-mm/[email protected]/
> > > >
> > > > I've ignored that entire thing after you said the information it
> > > > provides was already available through the PMU.
> > >
> > > Sorry if my answer made you confused. What I wanted to say was that the
> > > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> > > for now, but it could be modified to use PMU or other features instead.
> >
> > I would not be inclined to lean towards either approach for NUMA
> > balancing. Fiddling with the accessed bit can have consequences for page
> > aging and residency -- fine for debugging a problem, not to fine for
> > normal usage. I would expect the PMU approach would have high overhead
> > as well as taking over a PMU counter that userspace debugging may expect
> > to be available.
>
> Good point. But, isn't it ok to use Accessed bit as long as PG_Idle and
> PG_Young is adjusted properly? Current DAMON implementation does so, as
> idle_page_tracking also does.
>
PG_idle and PG_young are page flags that are used by idle tracking. The
accessed bit I'm thinking of is in the PTE bit and it's the PTE that
is changed by page_referenced() during reclaim. So it's not clear how
adjusting the page bits help avoid page aging problems during reclaim.
Maybe your suggestion was to move NUMA balancing to use the PG_idle and
PG_young bits from idle tracking but I'm not sure what that gains us.
This may be because I did not look too closely at DAMON as for debugging
and analysis, the PMU sounded more suitable.
It's not clear to me how looking at the page or pte bit handling of
DAMON helps us reduce the scanning cost of numa balancing. There may
be some benefit in splitting the address space and rescanning sections
that were referenced but it has similar pitfalls to simply tracking the
highest/lowest address that trapped a NUMA hinting fault.
--
Mel Gorman
SUSE Labs
On Fri, 17 Apr 2020 13:16:29 +0100 Mel Gorman <[email protected]> wrote:
> On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <[email protected]> wrote:
> >
> > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > > I think the main idea of DAMON[1] might be able to applied here. Have you
> > > > considered it?
> > > >
> > > > [1] https://lore.kernel.org/linux-mm/[email protected]/
> > >
> > > I've ignored that entire thing after you said the information it
> > > provides was already available through the PMU.
> >
> > Sorry if my answer made you confused. What I wanted to say was that the
> > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> > for now, but it could be modified to use PMU or other features instead.
>
> I would not be inclined to lean towards either approach for NUMA
> balancing. Fiddling with the accessed bit can have consequences for page
> aging and residency -- fine for debugging a problem, not to fine for
> normal usage. I would expect the PMU approach would have high overhead
> as well as taking over a PMU counter that userspace debugging may expect
> to be available.
Good point. But, isn't it ok to use Accessed bit as long as PG_Idle and
PG_Young is adjusted properly? Current DAMON implementation does so, as
idle_page_tracking also does.
That said, the core logics of DAMON and the underlying access check primitive
are logically seperated. I am planning[1] to further entirely seperate those
and let users to be able to use right access check promitive for their needs.
If I'm missing something, please let me know.
[1] https://lore.kernel.org/linux-mm/[email protected]/
Thanks,
SeongJae Park
>
> --
> Mel Gorman
> SUSE Labs
On Fri, 17 Apr 2020 15:46:26 +0100 Mel Gorman <[email protected]> wrote:
> On Fri, Apr 17, 2020 at 02:44:37PM +0200, SeongJae Park wrote:
> > On Fri, 17 Apr 2020 13:16:29 +0100 Mel Gorman <[email protected]> wrote:
> >
> > > On Fri, Apr 17, 2020 at 12:21:29PM +0200, SeongJae Park wrote:
> > > > On Fri, 17 Apr 2020 12:04:17 +0200 Peter Zijlstra <[email protected]> wrote:
> > > >
> > > > > On Fri, Apr 17, 2020 at 09:05:08AM +0200, SeongJae Park wrote:
> > > > > > I think the main idea of DAMON[1] might be able to applied here. Have you
> > > > > > considered it?
> > > > > >
> > > > > > [1] https://lore.kernel.org/linux-mm/[email protected]/
> > > > >
> > > > > I've ignored that entire thing after you said the information it
> > > > > provides was already available through the PMU.
> > > >
> > > > Sorry if my answer made you confused. What I wanted to say was that the
> > > > fundamental access checking mechanism that DAMON depends on is PTE Accessed bit
> > > > for now, but it could be modified to use PMU or other features instead.
> > >
> > > I would not be inclined to lean towards either approach for NUMA
> > > balancing. Fiddling with the accessed bit can have consequences for page
> > > aging and residency -- fine for debugging a problem, not to fine for
> > > normal usage. I would expect the PMU approach would have high overhead
> > > as well as taking over a PMU counter that userspace debugging may expect
> > > to be available.
> >
> > Good point. But, isn't it ok to use Accessed bit as long as PG_Idle and
> > PG_Young is adjusted properly? Current DAMON implementation does so, as
> > idle_page_tracking also does.
> >
>
> PG_idle and PG_young are page flags that are used by idle tracking. The
> accessed bit I'm thinking of is in the PTE bit and it's the PTE that
> is changed by page_referenced() during reclaim. So it's not clear how
> adjusting the page bits help avoid page aging problems during reclaim.
The idle tracking reads and updates not only PG_idle and PG_young, but also the
accessed bit. So idle tacking was also required to deal with the interference
of the reclaimer logic and thus it introduced PG_young. The commit
33c3fc71c8cf ("mm: introduce idle page tracking") explains this as below:
The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.
DAMON adjusts PG_young in similar way to avoid the interference. DAMON further
adjusts PG_idle so that it does not interfere to page idle tracking mechanism.
I think I made you confused by unnecessarily mentioning PG_idle, sorry.
>
> Maybe your suggestion was to move NUMA balancing to use the PG_idle and
> PG_young bits from idle tracking but I'm not sure what that gains us.
> This may be because I did not look too closely at DAMON as for debugging
> and analysis, the PMU sounded more suitable.
>
> It's not clear to me how looking at the page or pte bit handling of
> DAMON helps us reduce the scanning cost of numa balancing. There may
> be some benefit in splitting the address space and rescanning sections
> that were referenced but it has similar pitfalls to simply tracking the
> highest/lowest address that trapped a NUMA hinting fault.
DAMON allows users to know which address ranges have which access frequency.
Thus, I think DAMON could be used for detection of hot pages, which will be
good migration candidates, instead of the NUMA hinting fault based referenced
pages detection.
The benefits I expect from this change are better accuracy and less overhead.
As we can know not only referenced pages but hot pages, migration will be more
effective.
In terms of the overhead, DAMON allows users to set the upperbound of the
monitoring overhead regardless of the size of the targe address space, unlike
the page scanning based mechanisms. Thus, the overhead could be reduced,
especially if the target process has large memory footprint.
As I'm not deeply understanding AutoNUMA, I might saying something unrealistic
or missing some important points.
Thanks,
SeongJae Park
>
> --
> Mel Gorman
> SUSE Labs
SeongJae Park <[email protected]> writes:
>
> DAMON allows users to know which address ranges have which access frequency.
> Thus, I think DAMON could be used for detection of hot pages, which will be
> good migration candidates, instead of the NUMA hinting fault based referenced
> pages detection.
>
> The benefits I expect from this change are better accuracy and less overhead.
> As we can know not only referenced pages but hot pages, migration will be more
> effective.
The purpose of AutoNUMA scanning isn't to find the hot pages, but the
pages that are accessed across NUMA node. PTE accessed bit cannot
provide information of accessing CPU, so it's not enough for AutoNUMA.
Best Regards,
Huang, Ying
Peter Zijlstra <[email protected]> writes:
> On Thu, Apr 16, 2020 at 09:24:35AM +0800, Huang, Ying wrote:
>> Peter Zijlstra <[email protected]> writes:
>>
>> > On Tue, Apr 14, 2020 at 01:06:46PM +0100, Mel Gorman wrote:
>> >> While it's just an opinion, my preference would be to focus on reducing
>> >> the cost and amount of scanning done -- particularly for threads.
>> >
>> > This; I really don't believe in those back-charging things, esp. since
>> > not having cgroups or having multiple applications in a single cgroup is
>> > a valid setup.
>>
>> Technically, it appears possible to back-charge the CPU time to the
>> process/thread directly (not the cgroup).
>
> I've yet to see a sane proposal there. What we're not going to do is
> make regular task accounting more expensive than it already is.
Yes. There's overhead to back-charge. To reduce the overhead, instead
of back-charge immediately, we can
- Add one field to task_struct, say backcharge_time, to track the
delayed back-charged CPU time.
- When the work item completes its work, add the CPU time it spends to
task_struct->backcharge_time atomically
- When the task account CPU regularly, e.g. in scheduler_tick(),
task_struct->backcharge is considered too.
Although this cannot eliminate the overhead, it can reduce it. Do you
think this is acceptable or not?
Best Regards,
Huang, Ying