On today's cloud computing service scenario, NUMA (non uniform memory access)
architecture server has been applied on a large scale. Using Damon function,
it can easily and lightweight identify hot and cold memory, but it can not
display the situation of locale and remote NUMA memory access.
The purpose of these serie patches is to identify the situation of NUMA access
in combination with DAMON, especially for remote NUMA access in hot memory.
We hope to detect this situation in the data center and use page migration or
multi backup page technology to optimize the behavior of memory access.
So next, we will further improve Damon NUMA function:
1. Support hugtlbfs NUMA access statistics.
2. Add the DAMO tool to parse NUMA local & remote in "damon_region" support.
3. For hot memory remote NUMA access, support page migration or multi backup page.
About DAMON correctness of numa access statistics
We wrote a test case, allocate about 1G memory, and use numa_alloc(), set 512M in
NUMA node0 and 512M in NUMA node1, and The test case alternately accesses the 1G of memory.
We used "perf record -e damon:damon_aggregated" and "perf script"
cmd to obtain data, like this:
kdamond.0 target_id=0 nr_regions=10 281473056325632-281473127964672:: 12 0 5243 5513
kdamond.0 target_id=0 nr_regions=10 281473127964672-281473238028288: 8 1 5427 5399
...
kdamond.0 target_id=0 nr_regions=10 281473056325632-281473127964672: 9 3 7669 7632
kdamond.0 target_id=0 nr_regions=10 281473127964672-281473238028288: 7 2 7913 7892
And compared with numastat like this:
Per-node process memory usage (in MBs) for PID 111676 (lt-numademo)
Node 0 Node 1 Node 2
--------------- --------------- ---------------
Huge 0.00 0.00 0.00
Heap 0.02 0.00 0.00
Stack 0.01 0.00 0.00
Private 565.24 564.00 0.00
---------------- --------------- --------------- ---------------
Total 565.27 564.00 0.00
This comparison can determine the accuracy of Damon NUMA memory access statistics.
About the impact of DAMON NUMA access on Performance
During the benchmakr test, we found that the MBW benchmark memcpy test item
will cause about 3% performance degradation, and there is no performance degradation
in other benchmarks.
So we added "numa_stat" switch in DAMON dbgfs interface, turn on this switch when NUMA access
statistics is required.
Xin Hao (5):
mm/damon: Add NUMA local and remote variables in 'damon_region'
mm/damon: Add 'damon_region' NUMA fault simulation support
mm/damon: Add 'damon_region' NUMA access statistics core
implementation
mm/damon/dbgfs: Add numa simulate switch
mm/damon/tracepoint: Add 'damon_region' NUMA access statistics support
include/linux/damon.h | 25 ++++++++++
include/trace/events/damon.h | 9 +++-
mm/damon/core.c | 94 +++++++++++++++++++++++++++++++++++-
mm/damon/dbgfs.c | 70 ++++++++++++++++++++++++---
mm/damon/paddr.c | 25 ++++++++--
mm/damon/prmtv-common.c | 44 +++++++++++++++++
mm/damon/prmtv-common.h | 3 ++
mm/damon/vaddr.c | 45 ++++++++++-------
mm/huge_memory.c | 5 ++
mm/memory.c | 5 ++
10 files changed, 292 insertions(+), 33 deletions(-)
--
2.27.0
The purpose of adding these two variables 'local' & 'remote'
is to obtain the struct 'damon_region' numa access status.
Signed-off-by: Xin Hao <[email protected]>
Signed-off-by: Rongwei Wang <[email protected]>
---
include/linux/damon.h | 4 ++++
mm/damon/core.c | 6 ++++++
2 files changed, 10 insertions(+)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index 5e1e3a128b77..77d0937dcab5 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -41,6 +41,8 @@ struct damon_addr_range {
* @nr_accesses: Access frequency of this region.
* @list: List head for siblings.
* @age: Age of this region.
+ * @local: Local numa node accesses.
+ * @remote: Remote numa node accesses.
*
* @age is initially zero, increased for each aggregation interval, and reset
* to zero again if the access frequency is significantly changed. If two
@@ -56,6 +58,8 @@ struct damon_region {
unsigned int age;
/* private: Internal value for age calculation. */
unsigned int last_nr_accesses;
+ unsigned long local;
+ unsigned long remote;
};
/**
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 1dd153c31c9e..933ef51afa71 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -45,6 +45,8 @@ struct damon_region *damon_new_region(unsigned long start, unsigned long end)
region->age = 0;
region->last_nr_accesses = 0;
+ region->local = 0;
+ region->remote = 0;
return region;
}
@@ -740,6 +742,8 @@ static void damon_merge_two_regions(struct damon_target *t,
l->nr_accesses = (l->nr_accesses * sz_l + r->nr_accesses * sz_r) /
(sz_l + sz_r);
+ l->remote = (l->remote * sz_l + r->remote * sz_r) / (sz_l + sz_r);
+ l->local = (l->local * sz_l + r->local * sz_r) / (sz_l + sz_r);
l->age = (l->age * sz_l + r->age * sz_r) / (sz_l + sz_r);
l->ar.end = r->ar.end;
damon_destroy_region(r, t);
@@ -812,6 +816,8 @@ static void damon_split_region_at(struct damon_ctx *ctx,
new->age = r->age;
new->last_nr_accesses = r->last_nr_accesses;
+ new->local = r->local;
+ new->remote = r->remote;
damon_insert_region(new, r, damon_next_region(r), t);
}
--
2.27.0
These codes development here refers to NUMA balance code,
it will cause a page_fault, in do_numa_page(), we will count
'damon_region' NUMA local and remote values.
Signed-off-by: Xin Hao <[email protected]>
Signed-off-by: Rongwei Wang <[email protected]>
---
mm/damon/paddr.c | 23 +++++++++++++++++----
mm/damon/prmtv-common.c | 44 +++++++++++++++++++++++++++++++++++++++++
mm/damon/prmtv-common.h | 3 +++
mm/damon/vaddr.c | 32 +++++++++++++++++++++---------
4 files changed, 89 insertions(+), 13 deletions(-)
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index 5e8244f65a1a..b8feacf15592 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -16,9 +16,10 @@
#include "../internal.h"
#include "prmtv-common.h"
-static bool __damon_pa_mkold(struct page *page, struct vm_area_struct *vma,
+static bool __damon_pa_mk_set(struct page *page, struct vm_area_struct *vma,
unsigned long addr, void *arg)
{
+ bool result = false;
struct page_vma_mapped_walk pvmw = {
.page = page,
.vma = vma,
@@ -27,10 +28,24 @@ static bool __damon_pa_mkold(struct page *page, struct vm_area_struct *vma,
while (page_vma_mapped_walk(&pvmw)) {
addr = pvmw.address;
- if (pvmw.pte)
+ if (pvmw.pte) {
damon_ptep_mkold(pvmw.pte, vma->vm_mm, addr);
- else
+ if (nr_online_nodes > 1) {
+ result = damon_ptep_mknone(pvmw.pte, vma, addr);
+ if (result)
+ flush_tlb_page(vma, addr);
+ }
+ } else {
damon_pmdp_mkold(pvmw.pmd, vma->vm_mm, addr);
+ if (nr_online_nodes > 1) {
+ result = damon_pmdp_mknone(pvmw.pmd, vma, addr);
+ if (result) {
+ unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+ flush_tlb_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+ }
+ }
+ }
}
return true;
}
@@ -39,7 +54,7 @@ static void damon_pa_mkold(unsigned long paddr)
{
struct page *page = damon_get_page(PHYS_PFN(paddr));
struct rmap_walk_control rwc = {
- .rmap_one = __damon_pa_mkold,
+ .rmap_one = __damon_pa_mk_set,
.anon_lock = page_lock_anon_vma_read,
};
bool need_lock;
diff --git a/mm/damon/prmtv-common.c b/mm/damon/prmtv-common.c
index 92a04f5831d6..35ac50fdf7b6 100644
--- a/mm/damon/prmtv-common.c
+++ b/mm/damon/prmtv-common.c
@@ -12,6 +12,50 @@
#include "prmtv-common.h"
+bool damon_ptep_mknone(pte_t *pte, struct vm_area_struct *vma, unsigned long addr)
+{
+ pte_t oldpte, ptent;
+ bool preserve_write;
+
+ oldpte = *pte;
+ if (pte_protnone(oldpte))
+ return false;
+
+ if (pte_present(oldpte)) {
+ preserve_write = pte_write(oldpte);
+ oldpte = ptep_modify_prot_start(vma, addr, pte);
+ ptent = pte_modify(oldpte, PAGE_NONE);
+
+ if (preserve_write)
+ ptent = pte_mk_savedwrite(ptent);
+
+ ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+ return true;
+ }
+ return false;
+}
+
+bool damon_pmdp_mknone(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr)
+{
+ bool preserve_write;
+ pmd_t entry = *pmd;
+
+ if (is_huge_zero_pmd(entry) || pmd_protnone(entry))
+ return false;
+
+ if (pmd_present(entry)) {
+ preserve_write = pmd_write(entry);
+ entry = pmdp_invalidate(vma, addr, pmd);
+ entry = pmd_modify(entry, PAGE_NONE);
+ if (preserve_write)
+ entry = pmd_mk_savedwrite(entry);
+
+ set_pmd_at(vma->vm_mm, addr, pmd, entry);
+ return true;
+ }
+ return false;
+}
+
/*
* Get an online page for a pfn if it's in the LRU list. Otherwise, returns
* NULL.
diff --git a/mm/damon/prmtv-common.h b/mm/damon/prmtv-common.h
index e790cb5f8fe0..002a308facd0 100644
--- a/mm/damon/prmtv-common.h
+++ b/mm/damon/prmtv-common.h
@@ -7,6 +7,9 @@
#include <linux/damon.h>
+bool damon_ptep_mknone(pte_t *pte, struct vm_area_struct *vma, unsigned long addr);
+bool damon_pmdp_mknone(pmd_t *pmd, struct vm_area_struct *vma, unsigned long addr);
+
struct page *damon_get_page(unsigned long pfn);
void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm, unsigned long addr);
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 89b6468da2b9..732b41ed134c 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -367,9 +367,10 @@ static void damon_va_update(struct damon_ctx *ctx)
}
}
-static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
+static int damon_va_pmd_entry(pmd_t *pmd, unsigned long addr,
unsigned long next, struct mm_walk *walk)
{
+ bool result = false;
pte_t *pte;
spinlock_t *ptl;
@@ -377,7 +378,14 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
ptl = pmd_lock(walk->mm, pmd);
if (pmd_huge(*pmd)) {
damon_pmdp_mkold(pmd, walk->mm, addr);
+ if (nr_online_nodes > 1)
+ result = damon_pmdp_mknone(pmd, walk->vma, addr);
spin_unlock(ptl);
+ if (result) {
+ unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+ flush_tlb_range(walk->vma, haddr, haddr + HPAGE_PMD_SIZE);
+ }
return 0;
}
spin_unlock(ptl);
@@ -386,11 +394,17 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
return 0;
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
- if (!pte_present(*pte))
- goto out;
+ if (!pte_present(*pte)) {
+ pte_unmap_unlock(pte, ptl);
+ return 0;
+ }
damon_ptep_mkold(pte, walk->mm, addr);
-out:
+ if (nr_online_nodes > 1)
+ result = damon_ptep_mknone(pte, walk->vma, addr);
pte_unmap_unlock(pte, ptl);
+ if (result)
+ flush_tlb_page(walk->vma, addr);
+
return 0;
}
@@ -450,15 +464,15 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
#define damon_mkold_hugetlb_entry NULL
#endif /* CONFIG_HUGETLB_PAGE */
-static const struct mm_walk_ops damon_mkold_ops = {
- .pmd_entry = damon_mkold_pmd_entry,
+static const struct mm_walk_ops damon_va_ops = {
+ .pmd_entry = damon_va_pmd_entry,
.hugetlb_entry = damon_mkold_hugetlb_entry,
};
-static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
+static void damon_va_check(struct mm_struct *mm, unsigned long addr)
{
mmap_read_lock(mm);
- walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, NULL);
+ walk_page_range(mm, addr, addr + 1, &damon_va_ops, NULL);
mmap_read_unlock(mm);
}
@@ -471,7 +485,7 @@ static void __damon_va_prepare_access_check(struct damon_ctx *ctx,
{
r->sampling_addr = damon_rand(r->ar.start, r->ar.end);
- damon_va_mkold(mm, r->sampling_addr);
+ damon_va_check(mm, r->sampling_addr);
}
static void damon_va_prepare_access_checks(struct damon_ctx *ctx)
--
2.27.0
After setting PTE none or PMD none in DAMON, NUMA access of "damon_region" will be
counted in page fault if current pid matches the pid that DAMON is monitoring.
Signed-off-by: Xin Hao <[email protected]>
Signed-off-by: Rongwei Wang <[email protected]>
---
include/linux/damon.h | 18 ++++++++++
mm/damon/core.c | 80 +++++++++++++++++++++++++++++++++++++++++--
mm/damon/dbgfs.c | 18 +++++++---
mm/damon/vaddr.c | 11 ++----
mm/huge_memory.c | 5 +++
mm/memory.c | 5 +++
6 files changed, 121 insertions(+), 16 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index 77d0937dcab5..5bf1eb92584b 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -12,12 +12,16 @@
#include <linux/time64.h>
#include <linux/types.h>
#include <linux/random.h>
+#include <linux/mm.h>
/* Minimal region size. Every damon_region is aligned by this. */
#define DAMON_MIN_REGION PAGE_SIZE
/* Max priority score for DAMON-based operation schemes */
#define DAMOS_MAX_SCORE (99)
+extern struct damon_ctx **dbgfs_ctxs;
+extern int dbgfs_nr_ctxs;
+
/* Get a random number in [l, r) */
static inline unsigned long damon_rand(unsigned long l, unsigned long r)
{
@@ -68,6 +72,7 @@ struct damon_region {
* @nr_regions: Number of monitoring target regions of this target.
* @regions_list: Head of the monitoring target regions of this target.
* @list: List head for siblings.
+ * @target_lock: Use damon_region lock to avoid race.
*
* Each monitoring context could have multiple targets. For example, a context
* for virtual memory address spaces could have multiple target processes. The
@@ -80,6 +85,7 @@ struct damon_target {
unsigned int nr_regions;
struct list_head regions_list;
struct list_head list;
+ spinlock_t target_lock;
};
/**
@@ -503,8 +509,20 @@ int damon_stop(struct damon_ctx **ctxs, int nr_ctxs);
#endif /* CONFIG_DAMON */
#ifdef CONFIG_DAMON_VADDR
+
+/*
+ * 't->id' should be the pointer to the relevant 'struct pid' having reference
+ * count. Caller must put the returned task, unless it is NULL.
+ */
+static inline struct task_struct *damon_get_task_struct(struct damon_target *t)
+{
+ return get_pid_task((struct pid *)t->id, PIDTYPE_PID);
+}
bool damon_va_target_valid(void *t);
void damon_va_set_primitives(struct damon_ctx *ctx);
+void damon_numa_fault(int page_nid, int node_id, struct vm_fault *vmf);
+#else
+static inline void damon_numa_fault(int page_nid, int node_id, struct vm_fault *vmf) { }
#endif /* CONFIG_DAMON_VADDR */
#ifdef CONFIG_DAMON_PADDR
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 933ef51afa71..970fc02abeba 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -157,6 +157,7 @@ struct damon_target *damon_new_target(unsigned long id)
t->id = id;
t->nr_regions = 0;
INIT_LIST_HEAD(&t->regions_list);
+ spin_lock_init(&t->target_lock);
return t;
}
@@ -792,8 +793,11 @@ static void kdamond_merge_regions(struct damon_ctx *c, unsigned int threshold,
{
struct damon_target *t;
- damon_for_each_target(t, c)
+ damon_for_each_target(t, c) {
+ spin_lock(&t->target_lock);
damon_merge_regions_of(t, threshold, sz_limit);
+ spin_unlock(&t->target_lock);
+ }
}
/*
@@ -879,8 +883,11 @@ static void kdamond_split_regions(struct damon_ctx *ctx)
nr_regions < ctx->max_nr_regions / 3)
nr_subregions = 3;
- damon_for_each_target(t, ctx)
+ damon_for_each_target(t, ctx) {
+ spin_lock(&t->target_lock);
damon_split_regions_of(ctx, t, nr_subregions);
+ spin_unlock(&t->target_lock);
+ }
last_nr_regions = nr_regions;
}
@@ -1000,6 +1007,73 @@ static int kdamond_wait_activation(struct damon_ctx *ctx)
return -EBUSY;
}
+static struct damon_target *get_damon_target(struct task_struct *task)
+{
+ int i;
+ unsigned long id1, id2;
+ struct damon_target *t;
+
+ rcu_read_lock();
+ for (i = 0; i < READ_ONCE(dbgfs_nr_ctxs); i++) {
+ struct damon_ctx *ctx = rcu_dereference(dbgfs_ctxs[i]);
+
+ if (!ctx || !ctx->kdamond)
+ continue;
+ damon_for_each_target(t, dbgfs_ctxs[i]) {
+ struct task_struct *ts = damon_get_task_struct(t);
+
+ if (ts) {
+ id1 = (unsigned long)pid_vnr((struct pid *)t->id);
+ id2 = (unsigned long)pid_vnr(get_task_pid(task, PIDTYPE_PID));
+ put_task_struct(ts);
+ if (id1 == id2)
+ return t;
+ }
+ }
+ }
+ rcu_read_unlock();
+
+ return NULL;
+}
+
+static struct damon_region *get_damon_region(struct damon_target *t, unsigned long addr)
+{
+ struct damon_region *r, *next;
+
+ if (!t || !addr)
+ return NULL;
+
+ spin_lock(&t->target_lock);
+ damon_for_each_region_safe(r, next, t) {
+ if (r->ar.start <= addr && r->ar.end >= addr) {
+ spin_unlock(&t->target_lock);
+ return r;
+ }
+ }
+ spin_unlock(&t->target_lock);
+
+ return NULL;
+}
+
+void damon_numa_fault(int page_nid, int node_id, struct vm_fault *vmf)
+{
+ struct damon_target *t;
+ struct damon_region *r;
+
+ if (nr_online_nodes > 1) {
+ t = get_damon_target(current);
+ if (!t)
+ return;
+ r = get_damon_region(t, vmf->address);
+ if (r) {
+ if (page_nid == node_id)
+ r->local++;
+ else
+ r->remote++;
+ }
+ }
+}
+
/*
* The monitoring daemon that runs as a kernel thread
*/
@@ -1057,8 +1131,10 @@ static int kdamond_fn(void *data)
}
}
damon_for_each_target(t, ctx) {
+ spin_lock(&t->target_lock);
damon_for_each_region_safe(r, next, t)
damon_destroy_region(r, t);
+ spin_unlock(&t->target_lock);
}
if (ctx->callback.before_terminate)
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c
index 5b899601e56c..c7f4e95abc14 100644
--- a/mm/damon/dbgfs.c
+++ b/mm/damon/dbgfs.c
@@ -15,11 +15,12 @@
#include <linux/page_idle.h>
#include <linux/slab.h>
-static struct damon_ctx **dbgfs_ctxs;
-static int dbgfs_nr_ctxs;
+struct damon_ctx **dbgfs_ctxs;
+int dbgfs_nr_ctxs;
static struct dentry **dbgfs_dirs;
static DEFINE_MUTEX(damon_dbgfs_lock);
+
/*
* Returns non-empty string on success, negative error code otherwise.
*/
@@ -808,10 +809,18 @@ static int dbgfs_rm_context(char *name)
return -ENOMEM;
}
- for (i = 0, j = 0; i < dbgfs_nr_ctxs; i++) {
+ dbgfs_nr_ctxs--;
+ /* Prevent NUMA fault get the wrong value */
+ smp_mb();
+
+ for (i = 0, j = 0; i < dbgfs_nr_ctxs + 1; i++) {
if (dbgfs_dirs[i] == dir) {
+ struct damon_ctx *tmp_ctx = dbgfs_ctxs[i];
+
+ rcu_assign_pointer(dbgfs_ctxs[i], NULL);
+ synchronize_rcu();
debugfs_remove(dbgfs_dirs[i]);
- dbgfs_destroy_ctx(dbgfs_ctxs[i]);
+ dbgfs_destroy_ctx(tmp_ctx);
continue;
}
new_dirs[j] = dbgfs_dirs[i];
@@ -823,7 +832,6 @@ static int dbgfs_rm_context(char *name)
dbgfs_dirs = new_dirs;
dbgfs_ctxs = new_ctxs;
- dbgfs_nr_ctxs--;
return 0;
}
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 732b41ed134c..78b90972d171 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -22,15 +22,6 @@
#define DAMON_MIN_REGION 1
#endif
-/*
- * 't->id' should be the pointer to the relevant 'struct pid' having reference
- * count. Caller must put the returned task, unless it is NULL.
- */
-static inline struct task_struct *damon_get_task_struct(struct damon_target *t)
-{
- return get_pid_task((struct pid *)t->id, PIDTYPE_PID);
-}
-
/*
* Get the mm_struct of the given target
*
@@ -363,7 +354,9 @@ static void damon_va_update(struct damon_ctx *ctx)
damon_for_each_target(t, ctx) {
if (damon_va_three_regions(t, three_regions))
continue;
+ spin_lock(&t->target_lock);
damon_va_apply_three_regions(t, three_regions);
+ spin_unlock(&t->target_lock);
}
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..9cb413a8cd4a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -34,6 +34,7 @@
#include <linux/oom.h>
#include <linux/numa.h>
#include <linux/page_owner.h>
+#include <linux/damon.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -1450,6 +1451,10 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
flags |= TNF_NO_GROUP;
page_nid = page_to_nid(page);
+
+ /* Get the NUMA accesses of monitored processes by DAMON */
+ damon_numa_fault(page_nid, numa_node_id(), vmf);
+
last_cpupid = page_cpupid_last(page);
target_nid = numa_migrate_prep(page, vma, haddr, page_nid,
&flags);
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..fb55264f36af 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -74,6 +74,7 @@
#include <linux/perf_event.h>
#include <linux/ptrace.h>
#include <linux/vmalloc.h>
+#include <linux/damon.h>
#include <trace/events/kmem.h>
@@ -4392,6 +4393,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
+
+ /* Get the NUMA accesses of monitored processes by DAMON */
+ damon_numa_fault(page_nid, numa_node_id(), vmf);
+
target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid,
&flags);
if (target_nid == NUMA_NO_NODE) {
--
2.27.0
For applications that frequently access the memory area,
Doing numa simulation will cause a lot of pagefault and
tlb misses which will cause the applications performance
regression.
So there adds a switch to turn off the numa simulation
function by default, if you want to turn on this function
just do like below.
# cd /sys/kernel/debug/damon/
# echo on > numa_stat
# cat numa_stat
# on
Signed-off-by: Xin Hao <[email protected]>
Signed-off-by: Rongwei Wang <[email protected]>
---
include/linux/damon.h | 3 +++
mm/damon/core.c | 10 ++++++++-
mm/damon/dbgfs.c | 52 +++++++++++++++++++++++++++++++++++++++++--
mm/damon/paddr.c | 6 +++--
mm/damon/vaddr.c | 6 +++--
5 files changed, 70 insertions(+), 7 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index 5bf1eb92584b..c7d7613e1a17 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -19,6 +19,9 @@
/* Max priority score for DAMON-based operation schemes */
#define DAMOS_MAX_SCORE (99)
+/* Switch for NUMA fault */
+DECLARE_STATIC_KEY_FALSE(numa_stat_enabled_key);
+
extern struct damon_ctx **dbgfs_ctxs;
extern int dbgfs_nr_ctxs;
diff --git a/mm/damon/core.c b/mm/damon/core.c
index 970fc02abeba..4aa3c2d3895c 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -1060,7 +1060,8 @@ void damon_numa_fault(int page_nid, int node_id, struct vm_fault *vmf)
struct damon_target *t;
struct damon_region *r;
- if (nr_online_nodes > 1) {
+ if (static_branch_unlikely(&numa_stat_enabled_key)
+ && nr_online_nodes > 1) {
t = get_damon_target(current);
if (!t)
return;
@@ -1151,6 +1152,13 @@ static int kdamond_fn(void *data)
nr_running_ctxs--;
mutex_unlock(&damon_lock);
+ /*
+ * when no kdamond threads are running, the
+ * 'numa_stat_enabled_key' keeps default value.
+ */
+ if (!nr_running_ctxs)
+ static_branch_disable(&numa_stat_enabled_key);
+
return 0;
}
diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c
index c7f4e95abc14..0ef35dbfda39 100644
--- a/mm/damon/dbgfs.c
+++ b/mm/damon/dbgfs.c
@@ -609,6 +609,49 @@ static ssize_t dbgfs_kdamond_pid_read(struct file *file,
return len;
}
+DEFINE_STATIC_KEY_FALSE(numa_stat_enabled_key);
+
+static ssize_t dbgfs_numa_stat_read(struct file *file,
+ char __user *buf, size_t count, loff_t *ppos)
+{
+ char numa_on_buf[5];
+ bool enable = static_branch_unlikely(&numa_stat_enabled_key);
+ int len;
+
+ len = scnprintf(numa_on_buf, 5, enable ? "on\n" : "off\n");
+
+ return simple_read_from_buffer(buf, count, ppos, numa_on_buf, len);
+}
+
+static ssize_t dbgfs_numa_stat_write(struct file *file,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ ssize_t ret = 0;
+ char *kbuf;
+
+ kbuf = user_input_str(buf, count, ppos);
+ if (IS_ERR(kbuf))
+ return PTR_ERR(kbuf);
+
+ /* Remove white space */
+ if (sscanf(kbuf, "%s", kbuf) != 1) {
+ kfree(kbuf);
+ return -EINVAL;
+ }
+
+ if (!strncmp(kbuf, "on", count))
+ static_branch_enable(&numa_stat_enabled_key);
+ else if (!strncmp(kbuf, "off", count))
+ static_branch_disable(&numa_stat_enabled_key);
+ else
+ ret = -EINVAL;
+
+ if (!ret)
+ ret = count;
+ kfree(kbuf);
+ return ret;
+}
+
static int damon_dbgfs_open(struct inode *inode, struct file *file)
{
file->private_data = inode->i_private;
@@ -645,12 +688,17 @@ static const struct file_operations kdamond_pid_fops = {
.read = dbgfs_kdamond_pid_read,
};
+static const struct file_operations numa_stat_ops = {
+ .write = dbgfs_numa_stat_write,
+ .read = dbgfs_numa_stat_read,
+};
+
static void dbgfs_fill_ctx_dir(struct dentry *dir, struct damon_ctx *ctx)
{
const char * const file_names[] = {"attrs", "schemes", "target_ids",
- "init_regions", "kdamond_pid"};
+ "init_regions", "kdamond_pid", "numa_stat"};
const struct file_operations *fops[] = {&attrs_fops, &schemes_fops,
- &target_ids_fops, &init_regions_fops, &kdamond_pid_fops};
+ &target_ids_fops, &init_regions_fops, &kdamond_pid_fops, &numa_stat_ops};
int i;
for (i = 0; i < ARRAY_SIZE(file_names); i++)
diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c
index b8feacf15592..9b9920784f22 100644
--- a/mm/damon/paddr.c
+++ b/mm/damon/paddr.c
@@ -30,14 +30,16 @@ static bool __damon_pa_mk_set(struct page *page, struct vm_area_struct *vma,
addr = pvmw.address;
if (pvmw.pte) {
damon_ptep_mkold(pvmw.pte, vma->vm_mm, addr);
- if (nr_online_nodes > 1) {
+ if (static_branch_unlikely(&numa_stat_enabled_key) &&
+ nr_online_nodes > 1) {
result = damon_ptep_mknone(pvmw.pte, vma, addr);
if (result)
flush_tlb_page(vma, addr);
}
} else {
damon_pmdp_mkold(pvmw.pmd, vma->vm_mm, addr);
- if (nr_online_nodes > 1) {
+ if (static_branch_unlikely(&numa_stat_enabled_key) &&
+ nr_online_nodes > 1) {
result = damon_pmdp_mknone(pvmw.pmd, vma, addr);
if (result) {
unsigned long haddr = addr & HPAGE_PMD_MASK;
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 78b90972d171..5c2e2c2e29bb 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -371,7 +371,8 @@ static int damon_va_pmd_entry(pmd_t *pmd, unsigned long addr,
ptl = pmd_lock(walk->mm, pmd);
if (pmd_huge(*pmd)) {
damon_pmdp_mkold(pmd, walk->mm, addr);
- if (nr_online_nodes > 1)
+ if (static_branch_unlikely(&numa_stat_enabled_key) &&
+ nr_online_nodes > 1)
result = damon_pmdp_mknone(pmd, walk->vma, addr);
spin_unlock(ptl);
if (result) {
@@ -392,7 +393,8 @@ static int damon_va_pmd_entry(pmd_t *pmd, unsigned long addr,
return 0;
}
damon_ptep_mkold(pte, walk->mm, addr);
- if (nr_online_nodes > 1)
+ if (static_branch_unlikely(&numa_stat_enabled_key) &&
+ nr_online_nodes > 1)
result = damon_ptep_mknone(pte, walk->vma, addr);
pte_unmap_unlock(pte, ptl);
if (result)
--
2.27.0
This patch is used to support 'damon_region' NUMA access for tracepoint,
The purpose of this is to facilitate users to obtain the numa access
status of 'damon_region' through perf or damo tools.
Signed-off-by: Xin Hao <[email protected]>
Signed-off-by: Rongwei Wang <[email protected]>
---
include/trace/events/damon.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
index c79f1d4c39af..687b7aba751e 100644
--- a/include/trace/events/damon.h
+++ b/include/trace/events/damon.h
@@ -23,6 +23,8 @@ TRACE_EVENT(damon_aggregated,
__field(unsigned long, end)
__field(unsigned int, nr_accesses)
__field(unsigned int, age)
+ __field(unsigned long, local)
+ __field(unsigned long, remote)
),
TP_fast_assign(
@@ -32,12 +34,15 @@ TRACE_EVENT(damon_aggregated,
__entry->end = r->ar.end;
__entry->nr_accesses = r->nr_accesses;
__entry->age = r->age;
+ __entry->local = r->local;
+ __entry->remote = r->remote;
),
- TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u %u",
+ TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u %u %lu %lu",
__entry->target_id, __entry->nr_regions,
__entry->start, __entry->end,
- __entry->nr_accesses, __entry->age)
+ __entry->nr_accesses, __entry->age,
+ __entry->local, __entry->remote)
);
#endif /* _TRACE_DAMON_H */
--
2.27.0
+ David Rientjes, who has shown interest[1] in this topic.
[1] https://lore.kernel.org/linux-mm/[email protected]/
---
Hi Xin,
Thank you always for great patches!
On Wed, 16 Feb 2022 16:30:36 +0800 Xin Hao <[email protected]> wrote:
> On today's cloud computing service scenario, NUMA (non uniform memory access)
> architecture server has been applied on a large scale. Using Damon function,
> it can easily and lightweight identify hot and cold memory, but it can not
> display the situation of locale and remote NUMA memory access.
>
> The purpose of these serie patches is to identify the situation of NUMA access
> in combination with DAMON, especially for remote NUMA access in hot memory.
> We hope to detect this situation in the data center and use page migration or
> multi backup page technology to optimize the behavior of memory access.
>
> So next, we will further improve Damon NUMA function:
> 1. Support hugtlbfs NUMA access statistics.
> 2. Add the DAMO tool to parse NUMA local & remote in "damon_region" support.
> 3. For hot memory remote NUMA access, support page migration or multi backup page.
>
> About DAMON correctness of numa access statistics
> We wrote a test case, allocate about 1G memory, and use numa_alloc(), set 512M in
> NUMA node0 and 512M in NUMA node1, and The test case alternately accesses the 1G of memory.
>
> We used "perf record -e damon:damon_aggregated" and "perf script"
> cmd to obtain data, like this:
> kdamond.0 target_id=0 nr_regions=10 281473056325632-281473127964672:: 12 0 5243 5513
> kdamond.0 target_id=0 nr_regions=10 281473127964672-281473238028288: 8 1 5427 5399
> ...
> kdamond.0 target_id=0 nr_regions=10 281473056325632-281473127964672: 9 3 7669 7632
> kdamond.0 target_id=0 nr_regions=10 281473127964672-281473238028288: 7 2 7913 7892
>
> And compared with numastat like this:
> Per-node process memory usage (in MBs) for PID 111676 (lt-numademo)
> Node 0 Node 1 Node 2
> --------------- --------------- ---------------
> Huge 0.00 0.00 0.00
> Heap 0.02 0.00 0.00
> Stack 0.01 0.00 0.00
> Private 565.24 564.00 0.00
> ---------------- --------------- --------------- ---------------
> Total 565.27 564.00 0.00
> This comparison can determine the accuracy of Damon NUMA memory access statistics.
>
> About the impact of DAMON NUMA access on Performance
> During the benchmakr test, we found that the MBW benchmark memcpy test item
> will cause about 3% performance degradation, and there is no performance degradation
> in other benchmarks.
> So we added "numa_stat" switch in DAMON dbgfs interface, turn on this switch when NUMA access
> statistics is required.
>
>
> Xin Hao (5):
> mm/damon: Add NUMA local and remote variables in 'damon_region'
> mm/damon: Add 'damon_region' NUMA fault simulation support
> mm/damon: Add 'damon_region' NUMA access statistics core
> implementation
> mm/damon/dbgfs: Add numa simulate switch
> mm/damon/tracepoint: Add 'damon_region' NUMA access statistics support
>
> include/linux/damon.h | 25 ++++++++++
> include/trace/events/damon.h | 9 +++-
> mm/damon/core.c | 94 +++++++++++++++++++++++++++++++++++-
> mm/damon/dbgfs.c | 70 ++++++++++++++++++++++++---
> mm/damon/paddr.c | 25 ++++++++--
> mm/damon/prmtv-common.c | 44 +++++++++++++++++
> mm/damon/prmtv-common.h | 3 ++
> mm/damon/vaddr.c | 45 ++++++++++-------
> mm/huge_memory.c | 5 ++
> mm/memory.c | 5 ++
> 10 files changed, 292 insertions(+), 33 deletions(-)
I'd like to comment on the high level design at the moment. To my
understanding, this patchset extends DAMON core and monitoring operations for
virtual address spaces (vaddr) and the physical address space (paddr) to
monitor NUMA-local/remote accesses via PROT_NONE and page faults mechanism.
The underlying mechanism for NUMA-local/remote accesses (PROT_NONE and page
fault) looks ok to me. But, changes to the core and vaddr/paddr operations
looks unnecessary, to me. That's also not for general use cases.
I think it would be simpler to implment more monitoring operations for NUMA
monitoring use case (one for NUMA-local accesses accounting and another one for
NUMA-remote accesses accounting), alongside vaddr and paddr. Then, users could
configure DAMON to have three monitoring contexts (one with vaddr ops, second
one with numa-local ops, and third one with numa-remote ops), run those
concurrently, then show the three results and make some decisions like
migrations.
One additional advantage of this approach is that the accuracy for
NUMA-local/remote accessed could be better, because the contexts configured to
use NUMA-local/remote monitoring ops will do the regions adjustment with
NUMA-local/remote accesses (to my understanding, this patchset let regions have
NUMA-local/remote accesses counter in addition to the total one, but still use
only the total one for the regions adjustment).
If I'm missing something, please let me know.
Thanks,
SJ
>
> --
> 2.27.0
Hi SeongJae:
On 2/17/22 4:29 PM, SeongJae Park wrote:
> + David Rientjes, who has shown interest[1] in this topic.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> ---
>
> Hi Xin,
>
>
> Thank you always for great patches!
>
> On Wed, 16 Feb 2022 16:30:36 +0800 Xin Hao <[email protected]> wrote:
>
>> On today's cloud computing service scenario, NUMA (non uniform memory access)
>> architecture server has been applied on a large scale. Using Damon function,
>> it can easily and lightweight identify hot and cold memory, but it can not
>> display the situation of locale and remote NUMA memory access.
>>
>> The purpose of these serie patches is to identify the situation of NUMA access
>> in combination with DAMON, especially for remote NUMA access in hot memory.
>> We hope to detect this situation in the data center and use page migration or
>> multi backup page technology to optimize the behavior of memory access.
>>
>> So next, we will further improve Damon NUMA function:
>> 1. Support hugtlbfs NUMA access statistics.
>> 2. Add the DAMO tool to parse NUMA local & remote in "damon_region" support.
>> 3. For hot memory remote NUMA access, support page migration or multi backup page.
>>
>> About DAMON correctness of numa access statistics
>> We wrote a test case, allocate about 1G memory, and use numa_alloc(), set 512M in
>> NUMA node0 and 512M in NUMA node1, and The test case alternately accesses the 1G of memory.
>>
>> We used "perf record -e damon:damon_aggregated" and "perf script"
>> cmd to obtain data, like this:
>> kdamond.0 target_id=0 nr_regions=10 281473056325632-281473127964672:: 12 0 5243 5513
>> kdamond.0 target_id=0 nr_regions=10 281473127964672-281473238028288: 8 1 5427 5399
>> ...
>> kdamond.0 target_id=0 nr_regions=10 281473056325632-281473127964672: 9 3 7669 7632
>> kdamond.0 target_id=0 nr_regions=10 281473127964672-281473238028288: 7 2 7913 7892
>>
>> And compared with numastat like this:
>> Per-node process memory usage (in MBs) for PID 111676 (lt-numademo)
>> Node 0 Node 1 Node 2
>> --------------- --------------- ---------------
>> Huge 0.00 0.00 0.00
>> Heap 0.02 0.00 0.00
>> Stack 0.01 0.00 0.00
>> Private 565.24 564.00 0.00
>> ---------------- --------------- --------------- ---------------
>> Total 565.27 564.00 0.00
>> This comparison can determine the accuracy of Damon NUMA memory access statistics.
>>
>> About the impact of DAMON NUMA access on Performance
>> During the benchmakr test, we found that the MBW benchmark memcpy test item
>> will cause about 3% performance degradation, and there is no performance degradation
>> in other benchmarks.
>> So we added "numa_stat" switch in DAMON dbgfs interface, turn on this switch when NUMA access
>> statistics is required.
>>
>>
>> Xin Hao (5):
>> mm/damon: Add NUMA local and remote variables in 'damon_region'
>> mm/damon: Add 'damon_region' NUMA fault simulation support
>> mm/damon: Add 'damon_region' NUMA access statistics core
>> implementation
>> mm/damon/dbgfs: Add numa simulate switch
>> mm/damon/tracepoint: Add 'damon_region' NUMA access statistics support
>>
>> include/linux/damon.h | 25 ++++++++++
>> include/trace/events/damon.h | 9 +++-
>> mm/damon/core.c | 94 +++++++++++++++++++++++++++++++++++-
>> mm/damon/dbgfs.c | 70 ++++++++++++++++++++++++---
>> mm/damon/paddr.c | 25 ++++++++--
>> mm/damon/prmtv-common.c | 44 +++++++++++++++++
>> mm/damon/prmtv-common.h | 3 ++
>> mm/damon/vaddr.c | 45 ++++++++++-------
>> mm/huge_memory.c | 5 ++
>> mm/memory.c | 5 ++
>> 10 files changed, 292 insertions(+), 33 deletions(-)
> I'd like to comment on the high level design at the moment. To my
> understanding, this patchset extends DAMON core and monitoring operations for
> virtual address spaces (vaddr) and the physical address space (paddr) to
> monitor NUMA-local/remote accesses via PROT_NONE and page faults mechanism.
>
> The underlying mechanism for NUMA-local/remote accesses (PROT_NONE and page
> fault) looks ok to me. But, changes to the core and vaddr/paddr operations
> looks unnecessary, to me. That's also not for general use cases.
You are right, adding NUMA access statistics does make the PA & VA codes
look confusing。
>
> I think it would be simpler to implment more monitoring operations for NUMA
> monitoring use case (one for NUMA-local accesses accounting and another one for
> NUMA-remote accesses accounting), alongside vaddr and paddr. Then, users could
> configure DAMON to have three monitoring contexts (one with vaddr ops, second
> one with numa-local ops, and third one with numa-remote ops), run those
> concurrently, then show the three results and make some decisions like
> migrations.
Thanks for your advice, I will implement these in the next version, But
from my understanding or maybe
I didn't get what you were thinking, I think only one monitor context is
needed for NUMA Local & Remote,
Do not need a separate implementation like "numa_local_ops" and
"numa_remote_ops", just set "numa_access_ops" is ok.
>
> One additional advantage of this approach is that the accuracy for
> NUMA-local/remote accessed could be better, because the contexts configured to
> use NUMA-local/remote monitoring ops will do the regions adjustment with
> NUMA-local/remote accesses (to my understanding, this patchset let regions have
> NUMA-local/remote accesses counter in addition to the total one, but still use
> only the total one for the regions adjustment).
>
> If I'm missing something, please let me know.
>
>
> Thanks,
> SJ
>
>> --
>> 2.27.0
--
Best Regards!
Xin Hao
On Fri, 18 Feb 2022 10:21:27 +0800 Xin Hao <[email protected]> wrote:
> Hi SeongJae:
>
> On 2/17/22 4:29 PM, SeongJae Park wrote:
> > + David Rientjes, who has shown interest[1] in this topic.
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
> >
> > ---
> >
> > Hi Xin,
> >
> >
> > Thank you always for great patches!
> >
> > On Wed, 16 Feb 2022 16:30:36 +0800 Xin Hao <[email protected]> wrote:
> >
[...]
> > I'd like to comment on the high level design at the moment. To my
> > understanding, this patchset extends DAMON core and monitoring operations for
> > virtual address spaces (vaddr) and the physical address space (paddr) to
> > monitor NUMA-local/remote accesses via PROT_NONE and page faults mechanism.
> >
> > The underlying mechanism for NUMA-local/remote accesses (PROT_NONE and page
> > fault) looks ok to me. But, changes to the core and vaddr/paddr operations
> > looks unnecessary, to me. That's also not for general use cases.
> You are right, adding NUMA access statistics does make the PA & VA codes
> look confusing。
> >
> > I think it would be simpler to implment more monitoring operations for NUMA
> > monitoring use case (one for NUMA-local accesses accounting and another one for
> > NUMA-remote accesses accounting), alongside vaddr and paddr. Then, users could
> > configure DAMON to have three monitoring contexts (one with vaddr ops, second
> > one with numa-local ops, and third one with numa-remote ops), run those
> > concurrently, then show the three results and make some decisions like
> > migrations.
>
> Thanks for your advice, I will implement these in the next version, But
> from my understanding or maybe
>
> I didn't get what you were thinking, I think only one monitor context is
> needed for NUMA Local & Remote,
>
> Do not need a separate implementation like "numa_local_ops" and
> "numa_remote_ops", just set "numa_access_ops" is ok.
Sorry for insufficient explanation of my concern. In short, I'm concerning
about the regions adjustment.
You may do so by storing NUMA-local access count and NUMA-remote access
count in the nr_acceses filed of each region, e.g., saving NUMA-local access
count in upper-half bits of nr_accesses and saving NUMA-remote access count in
the lower-half bits. However, then DAMON will do the regions adjustment based
on the NUMA-local/remote accesses count mixed value, so the accuracy would be
degraded. So I think we need to implement each monitoring operations set for
each accesses that we want to monitor.
>
> >
> > One additional advantage of this approach is that the accuracy for
> > NUMA-local/remote accessed could be better, because the contexts configured to
> > use NUMA-local/remote monitoring ops will do the regions adjustment with
> > NUMA-local/remote accesses (to my understanding, this patchset let regions have
> > NUMA-local/remote accesses counter in addition to the total one, but still use
> > only the total one for the regions adjustment).
My previous comment above might help clarifying my concern.
If I'm missing something, please let me know.
Thanks,
SJ
> >
> > If I'm missing something, please let me know.
> >
> >
> > Thanks,
> > SJ
> >
> >> --
> >> 2.27.0
>
> --
> Best Regards!
> Xin Hao
>