2023-12-15 07:48:47

by Prasad, Aravinda

[permalink] [raw]
Subject: mm/DAMON: Profiling enhancements for DAMON

DAMON randomly samples one or more pages in every region and tracks
accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
When the region size is large (e.g., several GBs), which is common
for large footprint applications, detecting whether the region is
accessed or not completely depends on whether the pages that are
actively accessed in the region are picked during random sampling.
If such pages are not picked for sampling, DAMON fails to identify
the region as accessed. However, increasing the sampling rate or
increasing the number of regions increases CPU overheads of kdamond.

This patch proposes profiling different levels of the application’s
page table tree to detect whether a region is accessed or not. This
patch is based on the observation that, when the accessed bit for a
page is set, the accessed bits at the higher levels of the page table
tree (PMD/PUD/PGD) corresponding to the path of the page table walk
are also set. Hence, it is efficient to check the accessed bits at
the higher levels of the page table tree to detect whether a region
is accessed or not. For example, if the access bit for a PUD entry
is set, then one or more pages in the 1GB PUD subtree is accessed as
each PUD entry covers 1GB mapping. Hence, instead of sampling
thousands of 4K/2M pages to detect accesses in a large region,
sampling at the higher level of page table tree is faster and efficient.

This patch is based on 6.6.3 kernel.

TODO: Support 5-level page table tree

Evaluation:

- MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
and 5TB with 10GB hot data.
- DAMON: 5ms sampling, 200ms aggregation interval. Rest all
parameters set to default value.
- DAMON+PTP: Page table profiling applied to DAMON with the above
parameters.

Profiling efficiency in detecting hot data [1]:

Footprint 1GB 10GB 100GB 5TB
---------------------------------------------
DAMON >90% <50% ~0% 0%
DAMON+PTP >90% >90% >90% >90%

CPU overheads (in billion cycles) for kdamond:

Footprint 1GB 10GB 100GB 5TB
---------------------------------------------
DAMON 1.15 19.53 3.52 9.55
DAMON+PTP 0.83 3.20 1.27 2.55

A detailed explanation and evaluation can be found in the arXiv paper:
[1] https://arxiv.org/pdf/2311.10275.pdf

Regards,
Aravinda

Signed-off-by: Alan Nair <[email protected]>
Signed-off-by: Sandeep Kumar <[email protected]>
Signed-off-by: Aravinda Prasad <[email protected]>
---
arch/x86/include/asm/pgtable.h | 17 +++++
include/linux/damon.h | 13 ++++
include/linux/pgtable.h | 31 ++++++++
mm/damon/core.c | 28 ++++++++
mm/damon/vaddr.c | 146 +++++++++++++++++++++++++++++++++++++---
5 files changed, 223 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e02b179ec..accdabb95 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -169,6 +169,11 @@ static inline int pud_young(pud_t pud)
return pud_flags(pud) & _PAGE_ACCESSED;
}

+static inline int pgd_young(pgd_t pgd)
+{
+ return pgd_flags(pgd) & _PAGE_ACCESSED;
+}
+
static inline int pte_write(pte_t pte)
{
/*
@@ -681,6 +686,18 @@ static inline pud_t pud_mkwrite(pud_t pud)
return pud_clear_saveddirty(pud);
}

+static inline pgd_t pgd_clear_flags(pgd_t pgd, pgdval_t clear)
+{
+ pgdval_t v = native_pgd_val(pgd);
+
+ return native_make_pgd(v & ~clear);
+}
+
+static inline pgd_t pgd_mkold(pgd_t pgd)
+{
+ return pgd_clear_flags(pgd, _PAGE_ACCESSED);
+}
+
#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
static inline int pte_soft_dirty(pte_t pte)
{
diff --git a/include/linux/damon.h b/include/linux/damon.h
index c70cca8a8..8521a62ec 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -19,6 +19,14 @@
/* Max priority score for DAMON-based operation schemes */
#define DAMOS_MAX_SCORE (99)

+/* DAMON profiling levels */
+enum damon_profile_level {
+ PTE_LEVEL,
+ PMD_LEVEL,
+ PUD_LEVEL,
+ PGD_LEVEL,
+};
+
/* Get a random number in [l, r) */
static inline unsigned long damon_rand(unsigned long l, unsigned long r)
{
@@ -57,6 +65,8 @@ struct damon_region {
unsigned int age;
/* private: Internal value for age calculation. */
unsigned int last_nr_accesses;
+ /* Page table profiling level */
+ enum damon_profile_level profile_level;
};

/**
@@ -656,6 +666,9 @@ int damon_stop(struct damon_ctx **ctxs, int nr_ctxs);
int damon_set_region_biggest_system_ram_default(struct damon_target *t,
unsigned long *start, unsigned long *end);

+enum damon_profile_level pick_profile_level(unsigned long start,
+ unsigned long end, unsigned long addr);
+
#endif /* CONFIG_DAMON */

#endif /* _DAMON_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b..82d5f67ea 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -935,6 +935,37 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
#define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
#endif

+/*
+ * When walking page tables, get the address of the current/passed boundary,
+ * or the start address of the range if that comes earlier.
+ */
+
+#define pgd_addr_start(addr, start) \
+({ unsigned long __boundary = (addr) & PGDIR_MASK; \
+ (__boundary > start) ? __boundary : (start); \
+})
+
+#ifndef p4d_addr_start
+#define p4d_addr_start(addr, start) \
+({ unsigned long __boundary = (addr) & P4D_MASK; \
+ (__boundary > start) ? __boundary : (start); \
+})
+#endif
+
+#ifndef pud_addr_start
+#define pud_addr_start(addr, start) \
+({ unsigned long __boundary = (addr) & PUD_MASK; \
+ (__boundary > start) ? __boundary : (start); \
+})
+#endif
+
+#ifndef pmd_addr_start
+#define pmd_addr_start(addr, start) \
+({ unsigned long __boundary = (addr) & PMD_MASK; \
+ (__boundary > start) ? __boundary : (start); \
+})
+#endif
+
/*
* When walking page tables, get the address of the next boundary,
* or the end address of the range if that comes earlier. Although no
diff --git a/mm/damon/core.c b/mm/damon/core.c
index fd5be73f6..2a7d5c041 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -31,6 +31,33 @@ static struct damon_operations damon_registered_ops[NR_DAMON_OPS];

static struct kmem_cache *damon_region_cache __ro_after_init;

+/* Pick the highest possible page table profiling level for the
+ * region corresponding to addr
+ */
+enum damon_profile_level pick_profile_level(unsigned long start,
+ unsigned long end, unsigned long addr)
+{
+ enum damon_profile_level level = PTE_LEVEL;
+
+ if (pmd_addr_start(addr, (start) - 1) < start
+ || pmd_addr_end(addr, (end) + 1) > end)
+ goto out;
+ level = PMD_LEVEL;
+
+ if (pud_addr_start(addr, (start) - 1) < start
+ || pud_addr_end(addr, (end) + 1) > end)
+ goto out;
+ level = PUD_LEVEL;
+
+ if (pgd_addr_start(addr, (start) - 1) < start
+ || pgd_addr_end(addr, (end) + 1) > end)
+ goto out;
+ level = PGD_LEVEL;
+
+out:
+ return level;
+}
+
/* Should be called under damon_ops_lock with id smaller than NR_DAMON_OPS */
static bool __damon_is_registered_ops(enum damon_ops_id id)
{
@@ -132,6 +159,7 @@ struct damon_region *damon_new_region(unsigned long start, unsigned long end)

region->age = 0;
region->last_nr_accesses = 0;
+ region->profile_level = PTE_LEVEL;

return region;
}
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index cf8a9fc5c..b71221b3e 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -387,16 +387,76 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
#define damon_mkold_hugetlb_entry NULL
#endif /* CONFIG_HUGETLB_PAGE */

-static const struct mm_walk_ops damon_mkold_ops = {
- .pmd_entry = damon_mkold_pmd_entry,
+static int damon_mkold_pmd(pmd_t *pmd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ spinlock_t *ptl;
+
+ if (!pmd_present(*pmd) || pmd_none(*pmd))
+ goto out;
+
+ ptl = pmd_lock(walk->mm, pmd);
+ if (pmd_young(*pmd))
+ *pmd = pmd_mkold(*pmd);
+
+ spin_unlock(ptl);
+
+out:
+ return 0;
+}
+
+static int damon_mkold_pud(pud_t *pud, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ spinlock_t *ptl;
+
+ if (!pud_present(*pud) || pud_none(*pud))
+ goto out;
+
+ ptl = pud_lock(walk->mm, pud);
+ if (pud_young(*pud))
+ *pud = pud_mkold(*pud);
+
+ spin_unlock(ptl);
+
+out:
+ return 0;
+}
+
+static int damon_mkold_pgd(pgd_t *pgd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+
+ if (!pgd_present(*pgd) || pgd_none(*pgd))
+ goto out;
+
+ spin_lock(&pgd_lock);
+ if (pgd_young(*pgd))
+ *pgd = pgd_mkold(*pgd);
+
+ spin_unlock(&pgd_lock);
+
+out:
+ return 0;
+}
+
+static const struct mm_walk_ops damon_mkold_ops[] = {
+ {.pmd_entry = damon_mkold_pmd_entry,
.hugetlb_entry = damon_mkold_hugetlb_entry,
- .walk_lock = PGWALK_RDLOCK,
+ .walk_lock = PGWALK_RDLOCK},
+ {.pmd_entry = damon_mkold_pmd},
+ {.pud_entry = damon_mkold_pud},
+ {.pgd_entry = damon_mkold_pgd},
};

-static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
+static void damon_va_mkold(struct mm_struct *mm, struct damon_region *r)
{
+ unsigned long addr = r->sampling_addr;
+
+ r->profile_level = pick_profile_level(r->ar.start, r->ar.end, addr);
+
mmap_read_lock(mm);
- walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, NULL);
+ walk_page_range(mm, addr, addr + 1, damon_mkold_ops + r->profile_level, NULL);
mmap_read_unlock(mm);
}

@@ -409,7 +469,7 @@ static void __damon_va_prepare_access_check(struct mm_struct *mm,
{
r->sampling_addr = damon_rand(r->ar.start, r->ar.end);

- damon_va_mkold(mm, r->sampling_addr);
+ damon_va_mkold(mm, r);
}

static void damon_va_prepare_access_checks(struct damon_ctx *ctx)
@@ -531,22 +591,84 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
#define damon_young_hugetlb_entry NULL
#endif /* CONFIG_HUGETLB_PAGE */

-static const struct mm_walk_ops damon_young_ops = {
- .pmd_entry = damon_young_pmd_entry,
+static int damon_young_pmd(pmd_t *pmd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ spinlock_t *ptl;
+ struct damon_young_walk_private *priv = walk->private;
+
+ if (!pmd_present(*pmd) || pmd_none(*pmd))
+ goto out;
+
+ ptl = pmd_lock(walk->mm, pmd);
+ if (pmd_young(*pmd) || mmu_notifier_test_young(walk->mm, addr))
+ priv->young = true;
+
+ *priv->folio_sz = (1UL << PMD_SHIFT);
+ spin_unlock(ptl);
+out:
+ return 0;
+}
+
+static int damon_young_pud(pud_t *pud, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ spinlock_t *ptl;
+ struct damon_young_walk_private *priv = walk->private;
+
+ if (!pud_present(*pud) || pud_none(*pud))
+ goto out;
+
+ ptl = pud_lock(walk->mm, pud);
+ if (pud_young(*pud) || mmu_notifier_test_young(walk->mm, addr))
+ priv->young = true;
+
+ *priv->folio_sz = (1UL << PUD_SHIFT);
+
+ spin_unlock(ptl);
+out:
+ return 0;
+}
+
+static int damon_young_pgd(pgd_t *pgd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct damon_young_walk_private *priv = walk->private;
+
+ if (!pgd_present(*pgd) || pgd_none(*pgd))
+ goto out;
+
+ spin_lock(&pgd_lock);
+ if (pgd_young(*pgd) || mmu_notifier_test_young(walk->mm, addr))
+ priv->young = true;
+
+ *priv->folio_sz = (1UL << PGDIR_SHIFT);
+
+ spin_unlock(&pgd_lock);
+out:
+ return 0;
+}
+
+static const struct mm_walk_ops damon_young_ops[] = {
+ {.pmd_entry = damon_young_pmd_entry,
.hugetlb_entry = damon_young_hugetlb_entry,
- .walk_lock = PGWALK_RDLOCK,
+ .walk_lock = PGWALK_RDLOCK},
+ {.pmd_entry = damon_young_pmd},
+ {.pud_entry = damon_young_pud},
+ {.pgd_entry = damon_young_pgd},
};

-static bool damon_va_young(struct mm_struct *mm, unsigned long addr,
+static bool damon_va_young(struct mm_struct *mm, struct damon_region *r,
unsigned long *folio_sz)
{
+ unsigned long addr = r->sampling_addr;
struct damon_young_walk_private arg = {
.folio_sz = folio_sz,
.young = false,
};

mmap_read_lock(mm);
- walk_page_range(mm, addr, addr + 1, &damon_young_ops, &arg);
+ walk_page_range(mm, addr, addr + 1, damon_young_ops + r->profile_level, &arg);
mmap_read_unlock(mm);
return arg.young;
}
@@ -572,7 +694,7 @@ static void __damon_va_check_access(struct mm_struct *mm,
return;
}

- last_accessed = damon_va_young(mm, r->sampling_addr, &last_folio_sz);
+ last_accessed = damon_va_young(mm, r, &last_folio_sz);
if (last_accessed)
r->nr_accesses++;



2023-12-15 08:34:10

by Yu Zhao

[permalink] [raw]
Subject: Re: mm/DAMON: Profiling enhancements for DAMON

On Fri, Dec 15, 2023 at 12:42 AM Aravinda Prasad
<[email protected]> wrote:
...

> This patch proposes profiling different levels of the application’s
> page table tree to detect whether a region is accessed or not. This
> patch is based on the observation that, when the accessed bit for a
> page is set, the accessed bits at the higher levels of the page table
> tree (PMD/PUD/PGD) corresponding to the path of the page table walk
> are also set. Hence, it is efficient to check the accessed bits at
> the higher levels of the page table tree to detect whether a region
> is accessed or not.

This patch can crash on Xen. See commit 4aaf269c768d("mm: introduce
arch_has_hw_nonleaf_pmd_young()")

MGLRU already does this in the correct way. See mm/vmscan.c.

This patch also can cause USER DATA CORRUPTION. See commit
c11d34fa139e ("mm/damon/ops-common: atomically test and clear young on
ptes and pmds").

The quality of your patch makes me very much doubt the quality of your
paper, especially your results on Google's kstaled and MGLRU in table
6.2.

2023-12-15 10:08:09

by Prasad, Aravinda

[permalink] [raw]
Subject: RE: mm/DAMON: Profiling enhancements for DAMON



> -----Original Message-----
> From: Yu Zhao <[email protected]>
> Sent: Friday, December 15, 2023 2:03 PM
> To: Prasad, Aravinda <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Kumar, Sandeep4
> <[email protected]>; Huang, Ying <[email protected]>;
> Hansen, Dave <[email protected]>; Williams, Dan J
> <[email protected]>; Subramoney, Sreenivas
> <[email protected]>; Kervinen, Antti
> <[email protected]>; Kanevskiy, Alexander
> <[email protected]>; Alan Nair <[email protected]>; Juergen
> Gross <[email protected]>; Ryan Roberts <[email protected]>
> Subject: Re: mm/DAMON: Profiling enhancements for DAMON
>
> On Fri, Dec 15, 2023 at 12:42 AM Aravinda Prasad
> <[email protected]> wrote:
> ...
>
> > This patch proposes profiling different levels of the application’s
> > page table tree to detect whether a region is accessed or not. This
> > patch is based on the observation that, when the accessed bit for a
> > page is set, the accessed bits at the higher levels of the page table
> > tree (PMD/PUD/PGD) corresponding to the path of the page table walk
> > are also set. Hence, it is efficient to check the accessed bits at
> > the higher levels of the page table tree to detect whether a region is
> > accessed or not.
>
> This patch can crash on Xen. See commit 4aaf269c768d("mm: introduce
> arch_has_hw_nonleaf_pmd_young()")

Will fix as suggested in the commit.

>
> MGLRU already does this in the correct way. See mm/vmscan.c.

I don't see access bits at PUD or PGD checked for 4K page size. Can you
point me to the code where access bits are checked at PUD and PGD level?

>
> This patch also can cause USER DATA CORRUPTION. See commit
> c11d34fa139e ("mm/damon/ops-common: atomically test and clear young
> on ptes and pmds").

Ok. Will atomically test and set the access bits.

>
> The quality of your patch makes me very much doubt the quality of your
> paper, especially your results on Google's kstaled and MGLRU in table 6.2.

The results are very much reproducible. We have not used kstaled/MGLRU for
the data in Figure 3, but we linearly scan pages similar to kstaled by implementing
a kernel thread for scanning.

Our argument for kstaled/MGLRU is that, scanning individual pages at 4K
granularity may not be efficient for large footprint applications. Instead,
access bits at the higher level of the page table tree can be used. In the
paper we have demonstrated this with DAMON but the concept can be
applied to kstaled/MGLRU as well.

Regards,
Aravinda

2023-12-15 18:12:10

by Dave Hansen

[permalink] [raw]
Subject: Re: mm/DAMON: Profiling enhancements for DAMON

On 12/14/23 23:46, Aravinda Prasad wrote:
> +static int damon_young_pmd(pmd_t *pmd, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + spinlock_t *ptl;
> + struct damon_young_walk_private *priv = walk->private;
> +
> + if (!pmd_present(*pmd) || pmd_none(*pmd))
> + goto out;
> +
> + ptl = pmd_lock(walk->mm, pmd);
> + if (pmd_young(*pmd) || mmu_notifier_test_young(walk->mm, addr))
> + priv->young = true;
> +
> + *priv->folio_sz = (1UL << PMD_SHIFT);
> + spin_unlock(ptl);
> +out:
> + return 0;
> +}

There are a number of paired p*_present() and p_*none() checks in this
patch. What is their function (especially pmd_none())? For instance,
damon_young_pmd() gets called via the pagewalk infrastructure:

> static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> struct mm_walk *walk)
> {
...
> next = pmd_addr_end(addr, end);
> if (pmd_none(*pmd)) {
...
> continue;
> }
...
> if (ops->pmd_entry)
> err = ops->pmd_entry(pmd, addr, next, walk);

I'd suggest taking a closer look at the code you are replacing and other
mm_walk users to make sure the handlers are doing appropriate checks.

2023-12-15 20:12:16

by SeongJae Park

[permalink] [raw]
Subject: Re: mm/DAMON: Profiling enhancements for DAMON

Hello Aravinda,

> Subject: Re: mm/DAMON: Profiling enhancements for DAMON

Nit. I'd suggest to use 'mm/damon: ' prefix for consistency. Also adding
[PATCH] prefix before that would help people easily classifying this nice mail.

And if you are posting this patch aiming not to be merged as is but to share
the early implementation and get some comments, you could further make your
intention clear using [RFC PATCH] prefix instead.

On 2023-12-15T13:16:19+05:30 Aravinda Prasad <[email protected]> wrote:

> DAMON randomly samples one or more pages in every region and tracks
> accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> When the region size is large (e.g., several GBs), which is common
> for large footprint applications, detecting whether the region is
> accessed or not completely depends on whether the pages that are
> actively accessed in the region are picked during random sampling.
> If such pages are not picked for sampling, DAMON fails to identify
> the region as accessed. However, increasing the sampling rate or
> increasing the number of regions increases CPU overheads of kdamond.
>
> This patch proposes profiling different levels of the application\u2019s
> page table tree to detect whether a region is accessed or not. This
> patch is based on the observation that, when the accessed bit for a
> page is set, the accessed bits at the higher levels of the page table
> tree (PMD/PUD/PGD) corresponding to the path of the page table walk
> are also set. Hence, it is efficient to check the accessed bits at
> the higher levels of the page table tree to detect whether a region
> is accessed or not. For example, if the access bit for a PUD entry
> is set, then one or more pages in the 1GB PUD subtree is accessed as
> each PUD entry covers 1GB mapping. Hence, instead of sampling
> thousands of 4K/2M pages to detect accesses in a large region,
> sampling at the higher level of page table tree is faster and efficient.

The high level problem and solution ideas make sense to me, thank you for
sharing these nice finding and patch!

>
> This patch is based on 6.6.3 kernel.

6.6.3 would be a nice baseline for RFC level patches, since it would let more
people do test. But for non-RFC, I'd suggest to base patches for DAMON on
latest mm-unstable[1] if possible.

[1] https://git.kernel.org/akpm/mm/h/mm-unstable

>
> TODO: Support 5-level page table tree

Looking forward to that! :)

>
> Evaluation:
>
> - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> and 5TB with 10GB hot data.
> - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> parameters set to default value.

The default setup is 5ms sampling and 100ms aggregation intervals. Was there a
reason to use 200ms aggregation interval instead of the default one?

> - DAMON+PTP: Page table profiling applied to DAMON with the above
> parameters.
>
> Profiling efficiency in detecting hot data [1]:
>
> Footprint 1GB 10GB 100GB 5TB
> ---------------------------------------------
> DAMON >90% <50% ~0% 0%
> DAMON+PTP >90% >90% >90% >90%

DAMON provides both frequency and a sort of recency information via
'nr_accesses' and 'age', respectively. Using those together, we could minimize
DAMON's accuracy risk. For example, if we define hot pages with only the
frequency information, say, pages that DAMON reported as having >=40
'nr_accesses' (100% access rate), I have no doubt at above result. But if we
define hot page with both information, say, pages that DAMON didn't reported as
having 0 'nr_accesses' for more than 10 'age' (not accessed for more than 2
seconds), I think the number might be higher. So I'm wondering how you defined
the hot page for this evaluation, and if you also measured with the age
information usage.

Note that this doesn't mean DAMON+PTP is not needed. I think DAMON+PTP would
be anyway better even when we use the 'age' together. I'm just curious if you
have considered such approaches and already have such test results. I'm also
not asking to do the tests and add those here. But mentioning your thought
about such approach could also be helpful for readers of this patch, imho.

>
> CPU overheads (in billion cycles) for kdamond:
>
> Footprint 1GB 10GB 100GB 5TB
> ---------------------------------------------
> DAMON 1.15 19.53 3.52 9.55
> DAMON+PTP 0.83 3.20 1.27 2.55
>
> A detailed explanation and evaluation can be found in the arXiv paper:
> [1] https://arxiv.org/pdf/2311.10275.pdf
>
> Regards,
> Aravinda
>
> Signed-off-by: Alan Nair <[email protected]>
> Signed-off-by: Sandeep Kumar <[email protected]>
> Signed-off-by: Aravinda Prasad <[email protected]>
> ---

Though it's not clearly mentioned, I left only high level comments below
assuming this is an RFC patch.

> arch/x86/include/asm/pgtable.h | 17 +++++
> include/linux/damon.h | 13 ++++
> include/linux/pgtable.h | 31 ++++++++
> mm/damon/core.c | 28 ++++++++
> mm/damon/vaddr.c | 146 +++++++++++++++++++++++++++++++++++++---

Seems this improvement is only for vaddr. Would this great improvement be able
to be applied to paddr.c in future?

> 5 files changed, 223 insertions(+), 12 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index e02b179ec..accdabb95 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -169,6 +169,11 @@ static inline int pud_young(pud_t pud)
> return pud_flags(pud) & _PAGE_ACCESSED;
> }
>
> +static inline int pgd_young(pgd_t pgd)
> +{
> + return pgd_flags(pgd) & _PAGE_ACCESSED;
> +}
> +
> static inline int pte_write(pte_t pte)
> {
> /*
> @@ -681,6 +686,18 @@ static inline pud_t pud_mkwrite(pud_t pud)
> return pud_clear_saveddirty(pud);
> }
>
> +static inline pgd_t pgd_clear_flags(pgd_t pgd, pgdval_t clear)
> +{
> + pgdval_t v = native_pgd_val(pgd);
> +
> + return native_make_pgd(v & ~clear);
> +}
> +
> +static inline pgd_t pgd_mkold(pgd_t pgd)
> +{
> + return pgd_clear_flags(pgd, _PAGE_ACCESSED);
> +}
> +
> #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
> static inline int pte_soft_dirty(pte_t pte)
> {
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index c70cca8a8..8521a62ec 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -19,6 +19,14 @@
> /* Max priority score for DAMON-based operation schemes */
> #define DAMOS_MAX_SCORE (99)
>
> +/* DAMON profiling levels */
> +enum damon_profile_level {
> + PTE_LEVEL,
> + PMD_LEVEL,
> + PUD_LEVEL,
> + PGD_LEVEL,
> +};

I'd prefer more detailed kerneldoc comments. Also, I think this new enum is
not really needed. Please read my comment for damon_va_mkold() below for my
detailed opinion.

> +
> /* Get a random number in [l, r) */
> static inline unsigned long damon_rand(unsigned long l, unsigned long r)
> {
> @@ -57,6 +65,8 @@ struct damon_region {
> unsigned int age;
> /* private: Internal value for age calculation. */
> unsigned int last_nr_accesses;
> + /* Page table profiling level */
> + enum damon_profile_level profile_level;

Seems this field is set for every damon_va_mkold(), and used on only the
function call, and following damon_va_young() call. That allows the following
damon_va_young() skips the unnecessary recalculation of this field. Unless the
profile_level calculation is a big deal, I think letting damon_va_young() do
the calculation once more and keeping this struct simple may be better.

> };
>
> /**
> @@ -656,6 +666,9 @@ int damon_stop(struct damon_ctx **ctxs, int nr_ctxs);
> int damon_set_region_biggest_system_ram_default(struct damon_target *t,
> unsigned long *start, unsigned long *end);
>
> +enum damon_profile_level pick_profile_level(unsigned long start,
> + unsigned long end, unsigned long addr);
> +

Seems damon_va_mkold() is the only caller of this function. I think defining
this as static on the same source file will make change simpler.

> #endif /* CONFIG_DAMON */
>
> #endif /* _DAMON_H */
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index af7639c3b..82d5f67ea 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -935,6 +935,37 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> #define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
> #endif
>
> +/*
> + * When walking page tables, get the address of the current/passed boundary,
> + * or the start address of the range if that comes earlier.
> + */
> +
> +#define pgd_addr_start(addr, start) \
> +({ unsigned long __boundary = (addr) & PGDIR_MASK; \
> + (__boundary > start) ? __boundary : (start); \
> +})
> +
> +#ifndef p4d_addr_start
> +#define p4d_addr_start(addr, start) \
> +({ unsigned long __boundary = (addr) & P4D_MASK; \
> + (__boundary > start) ? __boundary : (start); \
> +})
> +#endif
> +
> +#ifndef pud_addr_start
> +#define pud_addr_start(addr, start) \
> +({ unsigned long __boundary = (addr) & PUD_MASK; \
> + (__boundary > start) ? __boundary : (start); \
> +})
> +#endif
> +
> +#ifndef pmd_addr_start
> +#define pmd_addr_start(addr, start) \
> +({ unsigned long __boundary = (addr) & PMD_MASK; \
> + (__boundary > start) ? __boundary : (start); \
> +})
> +#endif

Seems these are only used from mm/damon/vaddr.c. If so, I think moving these
definitions into the file may better to read.

> +
> /*
> * When walking page tables, get the address of the next boundary,
> * or the end address of the range if that comes earlier. Although no
> diff --git a/mm/damon/core.c b/mm/damon/core.c
> index fd5be73f6..2a7d5c041 100644
> --- a/mm/damon/core.c
> +++ b/mm/damon/core.c
> @@ -31,6 +31,33 @@ static struct damon_operations damon_registered_ops[NR_DAMON_OPS];
>
> static struct kmem_cache *damon_region_cache __ro_after_init;
>
> +/* Pick the highest possible page table profiling level for the
> + * region corresponding to addr
> + */

Again, I'd prefer having a nice kerneldoc comment.

> +enum damon_profile_level pick_profile_level(unsigned long start,
> + unsigned long end, unsigned long addr)

As noted above, seems this function is used from vaddr.c only. Moving this
definition into the file may make this change simpler in my opinion.

> +{
> + enum damon_profile_level level = PTE_LEVEL;
> +
> + if (pmd_addr_start(addr, (start) - 1) < start
> + || pmd_addr_end(addr, (end) + 1) > end)
> + goto out;

How about simply 'return PTE_LEVEL' here, and similarly for below parts?

> + level = PMD_LEVEL;
> +
> + if (pud_addr_start(addr, (start) - 1) < start
> + || pud_addr_end(addr, (end) + 1) > end)
> + goto out;
> + level = PUD_LEVEL;
> +
> + if (pgd_addr_start(addr, (start) - 1) < start
> + || pgd_addr_end(addr, (end) + 1) > end)
> + goto out;
> + level = PGD_LEVEL;
> +
> +out:
> + return level;
> +}
> +
> /* Should be called under damon_ops_lock with id smaller than NR_DAMON_OPS */
> static bool __damon_is_registered_ops(enum damon_ops_id id)
> {
> @@ -132,6 +159,7 @@ struct damon_region *damon_new_region(unsigned long start, unsigned long end)
>
> region->age = 0;
> region->last_nr_accesses = 0;
> + region->profile_level = PTE_LEVEL;
>
> return region;
> }
> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> index cf8a9fc5c..b71221b3e 100644
> --- a/mm/damon/vaddr.c
> +++ b/mm/damon/vaddr.c
> @@ -387,16 +387,76 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
> #define damon_mkold_hugetlb_entry NULL
> #endif /* CONFIG_HUGETLB_PAGE */
>
> -static const struct mm_walk_ops damon_mkold_ops = {
> - .pmd_entry = damon_mkold_pmd_entry,
> +static int damon_mkold_pmd(pmd_t *pmd, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + spinlock_t *ptl;
> +
> + if (!pmd_present(*pmd) || pmd_none(*pmd))
> + goto out;
> +
> + ptl = pmd_lock(walk->mm, pmd);
> + if (pmd_young(*pmd))
> + *pmd = pmd_mkold(*pmd);

Maybe we need to do the test and write at once? Same for similar parts below.

> +
> + spin_unlock(ptl);
> +
> +out:
> + return 0;
> +}
> +
> +static int damon_mkold_pud(pud_t *pud, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + spinlock_t *ptl;
> +
> + if (!pud_present(*pud) || pud_none(*pud))
> + goto out;
> +
> + ptl = pud_lock(walk->mm, pud);
> + if (pud_young(*pud))
> + *pud = pud_mkold(*pud);
> +
> + spin_unlock(ptl);
> +
> +out:
> + return 0;
> +}
> +
> +static int damon_mkold_pgd(pgd_t *pgd, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> +
> + if (!pgd_present(*pgd) || pgd_none(*pgd))
> + goto out;
> +
> + spin_lock(&pgd_lock);
> + if (pgd_young(*pgd))
> + *pgd = pgd_mkold(*pgd);
> +
> + spin_unlock(&pgd_lock);
> +
> +out:
> + return 0;
> +}
> +
> +static const struct mm_walk_ops damon_mkold_ops[] = {
> + {.pmd_entry = damon_mkold_pmd_entry,
> .hugetlb_entry = damon_mkold_hugetlb_entry,
> - .walk_lock = PGWALK_RDLOCK,
> + .walk_lock = PGWALK_RDLOCK},
> + {.pmd_entry = damon_mkold_pmd},
> + {.pud_entry = damon_mkold_pud},
> + {.pgd_entry = damon_mkold_pgd},
> };
>
> -static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
> +static void damon_va_mkold(struct mm_struct *mm, struct damon_region *r)
> {
> + unsigned long addr = r->sampling_addr;
> +
> + r->profile_level = pick_profile_level(r->ar.start, r->ar.end, addr);
> +
> mmap_read_lock(mm);
> - walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, NULL);
> + walk_page_range(mm, addr, addr + 1, damon_mkold_ops + r->profile_level, NULL);

Maybe &damon_mkold_ops[r->profile_level] would easier to read?

> mmap_read_unlock(mm);
> }
>
> @@ -409,7 +469,7 @@ static void __damon_va_prepare_access_check(struct mm_struct *mm,
> {
> r->sampling_addr = damon_rand(r->ar.start, r->ar.end);
>
> - damon_va_mkold(mm, r->sampling_addr);
> + damon_va_mkold(mm, r);
> }
>
> static void damon_va_prepare_access_checks(struct damon_ctx *ctx)
> @@ -531,22 +591,84 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
> #define damon_young_hugetlb_entry NULL
> #endif /* CONFIG_HUGETLB_PAGE */
>
> -static const struct mm_walk_ops damon_young_ops = {
> - .pmd_entry = damon_young_pmd_entry,
> +static int damon_young_pmd(pmd_t *pmd, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + spinlock_t *ptl;
> + struct damon_young_walk_private *priv = walk->private;
> +
> + if (!pmd_present(*pmd) || pmd_none(*pmd))
> + goto out;
> +
> + ptl = pmd_lock(walk->mm, pmd);
> + if (pmd_young(*pmd) || mmu_notifier_test_young(walk->mm, addr))
> + priv->young = true;
> +
> + *priv->folio_sz = (1UL << PMD_SHIFT);
> + spin_unlock(ptl);
> +out:
> + return 0;
> +}
> +
> +static int damon_young_pud(pud_t *pud, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + spinlock_t *ptl;
> + struct damon_young_walk_private *priv = walk->private;
> +
> + if (!pud_present(*pud) || pud_none(*pud))
> + goto out;
> +
> + ptl = pud_lock(walk->mm, pud);
> + if (pud_young(*pud) || mmu_notifier_test_young(walk->mm, addr))
> + priv->young = true;
> +
> + *priv->folio_sz = (1UL << PUD_SHIFT);
> +
> + spin_unlock(ptl);
> +out:
> + return 0;
> +}
> +
> +static int damon_young_pgd(pgd_t *pgd, unsigned long addr,
> + unsigned long next, struct mm_walk *walk)
> +{
> + struct damon_young_walk_private *priv = walk->private;
> +
> + if (!pgd_present(*pgd) || pgd_none(*pgd))
> + goto out;
> +
> + spin_lock(&pgd_lock);
> + if (pgd_young(*pgd) || mmu_notifier_test_young(walk->mm, addr))
> + priv->young = true;
> +
> + *priv->folio_sz = (1UL << PGDIR_SHIFT);
> +
> + spin_unlock(&pgd_lock);
> +out:
> + return 0;
> +}
> +
> +static const struct mm_walk_ops damon_young_ops[] = {
> + {.pmd_entry = damon_young_pmd_entry,
> .hugetlb_entry = damon_young_hugetlb_entry,
> - .walk_lock = PGWALK_RDLOCK,
> + .walk_lock = PGWALK_RDLOCK},
> + {.pmd_entry = damon_young_pmd},
> + {.pud_entry = damon_young_pud},
> + {.pgd_entry = damon_young_pgd},
> };
>
> -static bool damon_va_young(struct mm_struct *mm, unsigned long addr,
> +static bool damon_va_young(struct mm_struct *mm, struct damon_region *r,
> unsigned long *folio_sz)
> {
> + unsigned long addr = r->sampling_addr;
> struct damon_young_walk_private arg = {
> .folio_sz = folio_sz,
> .young = false,
> };
>
> mmap_read_lock(mm);
> - walk_page_range(mm, addr, addr + 1, &damon_young_ops, &arg);
> + walk_page_range(mm, addr, addr + 1, damon_young_ops + r->profile_level, &arg);

Again, '&damon_young_ops[r->profile_level]' may be easier for me to read.

Also, I still prefer 80 columns limit. Could we wrap this line?

Seems pick_profile_level() is only for calculating ->profile_level which will
be stored into the region for each damon_va_mkold(). And ->profile_level is
used to find proper mm_walk_ops from damon_mkold_ops[] and damon_young_ops[].
I understand the index is saved in ->profile_level to be reused in the followup
damon_va_young() call, to avoid the unnecessary recalculation overhead.

If the recalculation overhead is not a big deal, I think having functions for
getting the proper pointers to mm_walk_ops could make the change simpler?

> mmap_read_unlock(mm);
> return arg.young;
> }
> @@ -572,7 +694,7 @@ static void __damon_va_check_access(struct mm_struct *mm,
> return;
> }
>
> - last_accessed = damon_va_young(mm, r->sampling_addr, &last_folio_sz);
> + last_accessed = damon_va_young(mm, r, &last_folio_sz);
> if (last_accessed)
> r->nr_accesses++;


Thanks,
SJ

2023-12-16 05:42:32

by Yu Zhao

[permalink] [raw]
Subject: Re: mm/DAMON: Profiling enhancements for DAMON

On Fri, Dec 15, 2023 at 3:08 AM Prasad, Aravinda
<[email protected]> wrote:
>
> > On Fri, Dec 15, 2023 at 12:42 AM Aravinda Prasad
> > <[email protected]> wrote:
> > ...
> >
> > > This patch proposes profiling different levels of the application’s
> > > page table tree to detect whether a region is accessed or not. This
> > > patch is based on the observation that, when the accessed bit for a
> > > page is set, the accessed bits at the higher levels of the page table
> > > tree (PMD/PUD/PGD) corresponding to the path of the page table walk
> > > are also set. Hence, it is efficient to check the accessed bits at
> > > the higher levels of the page table tree to detect whether a region is
> > > accessed or not.
> >
> > This patch can crash on Xen. See commit 4aaf269c768d("mm: introduce
> > arch_has_hw_nonleaf_pmd_young()")
>
> Will fix as suggested in the commit.
>
> >
> > MGLRU already does this in the correct way. See mm/vmscan.c.
>
> I don't see access bits at PUD or PGD checked for 4K page size. Can you
> point me to the code where access bits are checked at PUD and PGD level?

There isn't any, because *the system* bottlenecks at the PTE level and
at moving memory between tiers. Optimizing at the PUD/PGD levels has
insignificant ROI for the system.

And food for thought:
1. Can a PUD/PGD cover memory from different tiers?
2. Can the A-bit in non-leaf entries work for EPT?

> > This patch also can cause USER DATA CORRUPTION. See commit
> > c11d34fa139e ("mm/damon/ops-common: atomically test and clear young
> > on ptes and pmds").
>
> Ok. Will atomically test and set the access bits.
>
> >
> > The quality of your patch makes me very much doubt the quality of your
> > paper, especially your results on Google's kstaled and MGLRU in table 6.2.
>
> The results are very much reproducible. We have not used kstaled/MGLRU for
> the data in Figure 3, but we linearly scan pages similar to kstaled by implementing
> a kernel thread for scanning.

You have not used MGLRU, and yet your results are very much reproducible.

> Our argument for kstaled/MGLRU is that, scanning individual pages at 4K
> granularity may not be efficient for large footprint applications.

Your argument for MGLRU is based on a wrong assumption, as I have
already pointed out.

> Instead,
> access bits at the higher level of the page table tree can be used. In the
> paper we have demonstrated this with DAMON but the concept can be
> applied to kstaled/MGLRU as well.

You got it backward: MGLRU introduced the concept; you fabricated a
comparison table.

2023-12-18 11:32:56

by Prasad, Aravinda

[permalink] [raw]
Subject: RE: mm/DAMON: Profiling enhancements for DAMON



> -----Original Message-----
> From: Yu Zhao <[email protected]>
> Sent: Saturday, December 16, 2023 11:12 AM
> To: Prasad, Aravinda <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Kumar, Sandeep4
> <[email protected]>; Huang, Ying <[email protected]>; Hansen,
> Dave <[email protected]>; Williams, Dan J <[email protected]>;
> Subramoney, Sreenivas <[email protected]>; Kervinen, Antti
> <[email protected]>; Kanevskiy, Alexander
> <[email protected]>; Alan Nair <[email protected]>; Juergen
> Gross <[email protected]>; Ryan Roberts <[email protected]>
> Subject: Re: mm/DAMON: Profiling enhancements for DAMON
>
> On Fri, Dec 15, 2023 at 3:08 AM Prasad, Aravinda <[email protected]>
> wrote:
> >
> > > On Fri, Dec 15, 2023 at 12:42 AM Aravinda Prasad
> > > <[email protected]> wrote:
> > > ...
> > >
> > > > This patch proposes profiling different levels of the
> > > > application’s page table tree to detect whether a region is
> > > > accessed or not. This patch is based on the observation that, when
> > > > the accessed bit for a page is set, the accessed bits at the
> > > > higher levels of the page table tree (PMD/PUD/PGD) corresponding
> > > > to the path of the page table walk are also set. Hence, it is
> > > > efficient to check the accessed bits at the higher levels of the
> > > > page table tree to detect whether a region is accessed or not.
> > >
> > > This patch can crash on Xen. See commit 4aaf269c768d("mm: introduce
> > > arch_has_hw_nonleaf_pmd_young()")
> >
> > Will fix as suggested in the commit.
> >
> > >
> > > MGLRU already does this in the correct way. See mm/vmscan.c.

noted

> >
> > I don't see access bits at PUD or PGD checked for 4K page size. Can
> > you point me to the code where access bits are checked at PUD and PGD level?
>
> There isn't any, because *the system* bottlenecks at the PTE level and at moving
> memory between tiers. Optimizing at the PUD/PGD levels has insignificant ROI
> for the system.

Optimization at PUD/PGD can be used for large footprint applications, especially
for damon, to find if any pages in a region are accessed or not.

>
> And food for thought:
> 1. Can a PUD/PGD cover memory from different tiers?

Yes, it can.

> 2. Can the A-bit in non-leaf entries work for EPT?

Need to check.

>
> > > This patch also can cause USER DATA CORRUPTION. See commit
> > > c11d34fa139e ("mm/damon/ops-common: atomically test and clear young
> > > on ptes and pmds").
> >
> > Ok. Will atomically test and set the access bits.
> >
> > >
> > > The quality of your patch makes me very much doubt the quality of
> > > your paper, especially your results on Google's kstaled and MGLRU in table
> 6.2.
> >
> > The results are very much reproducible. We have not used kstaled/MGLRU
> > for the data in Figure 3, but we linearly scan pages similar to
> > kstaled by implementing a kernel thread for scanning.
>
> You have not used MGLRU, and yet your results are very much reproducible.

As we have mentioned in the paper, the results are for checking/scanning
accessed bits for pages at leaf level (PTE for 4K and PMD for 2M). In general
this is applicable to any technique using leaf level scanning where for large
footprint applications, the scanning time drastically increases.

MGLRU also scans leaf level accessed bits and hence falls into this category

Similar observations on scanning were also made by HeMem [2] in Figure 3.

[2] HeMem: Scalable Tiered Memory Management for Big Data Applications
and Real NVM", https://dl.acm.org/doi/pdf/10.1145/3477132.3483550

>
> > Our argument for kstaled/MGLRU is that, scanning individual pages at
> > 4K granularity may not be efficient for large footprint applications.
>
> Your argument for MGLRU is based on a wrong assumption, as I have already
> pointed out.

Our argument in the paper is for any technique that is scanning leaf level
accessed bits, be it kstaled or MGLRU.

>
> > Instead,
> > access bits at the higher level of the page table tree can be used. In
> > the paper we have demonstrated this with DAMON but the concept can be
> > applied to kstaled/MGLRU as well.
>
> You got it backward: MGLRU introduced the concept; you fabricated a comparison
> table.

Not convinced. I see from documentation mentioning that "clearing the accessed
bit in non-leaf page table entries" with 0x0004 in /sys/kernel/mm/lru_gen/enabled

But the code is restricted to PMD only.

static bool should_clear_pmd_young(void)
{
return arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG);
}

Regards,
Aravinda



2023-12-18 11:55:04

by Prasad, Aravinda

[permalink] [raw]
Subject: RE: mm/DAMON: Profiling enhancements for DAMON



> -----Original Message-----
> From: Hansen, Dave <[email protected]>
> Sent: Friday, December 15, 2023 11:40 PM
> To: Prasad, Aravinda <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected]
> Cc: [email protected]; Kumar, Sandeep4 <[email protected]>;
> Huang, Ying <[email protected]>; Williams, Dan J
> <[email protected]>; Subramoney, Sreenivas
> <[email protected]>; Kervinen, Antti <[email protected]>;
> Kanevskiy, Alexander <[email protected]>; Alan Nair
> <[email protected]>
> Subject: Re: mm/DAMON: Profiling enhancements for DAMON
>
> On 12/14/23 23:46, Aravinda Prasad wrote:
> > +static int damon_young_pmd(pmd_t *pmd, unsigned long addr,
> > + unsigned long next, struct mm_walk *walk) {
> > + spinlock_t *ptl;
> > + struct damon_young_walk_private *priv = walk->private;
> > +
> > + if (!pmd_present(*pmd) || pmd_none(*pmd))
> > + goto out;
> > +
> > + ptl = pmd_lock(walk->mm, pmd);
> > + if (pmd_young(*pmd) || mmu_notifier_test_young(walk->mm, addr))
> > + priv->young = true;
> > +
> > + *priv->folio_sz = (1UL << PMD_SHIFT);
> > + spin_unlock(ptl);
> > +out:
> > + return 0;
> > +}
>
> There are a number of paired p*_present() and p_*none() checks in this patch.
> What is their function (especially pmd_none())? For instance,
> damon_young_pmd() gets called via the pagewalk infrastructure:

Hmm.. I think, they can be removed. Looks like pagewalk infrastructure should take care.
But will double check.

>
> > static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> > struct mm_walk *walk) {
> ...
> > next = pmd_addr_end(addr, end);
> > if (pmd_none(*pmd)) {
> ...
> > continue;
> > }
> ...
> > if (ops->pmd_entry)
> > err = ops->pmd_entry(pmd, addr, next, walk);
>
> I'd suggest taking a closer look at the code you are replacing and other mm_walk
> users to make sure the handlers are doing appropriate checks.

Sure.

2023-12-18 13:13:00

by Prasad, Aravinda

[permalink] [raw]
Subject: RE: mm/DAMON: Profiling enhancements for DAMON



> -----Original Message-----
> From: SeongJae Park <[email protected]>
> Sent: Saturday, December 16, 2023 1:42 AM
> To: Prasad, Aravinda <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Kumar, Sandeep4
> <[email protected]>; Huang, Ying <[email protected]>; Hansen,
> Dave <[email protected]>; Williams, Dan J <[email protected]>;
> Subramoney, Sreenivas <[email protected]>; Kervinen, Antti
> <[email protected]>; Kanevskiy, Alexander
> <[email protected]>; Alan Nair <[email protected]>
> Subject: Re: mm/DAMON: Profiling enhancements for DAMON
>
> Hello Aravinda,
>
> > Subject: Re: mm/DAMON: Profiling enhancements for DAMON
>
> Nit. I'd suggest to use 'mm/damon: ' prefix for consistency. Also adding [PATCH]
> prefix before that would help people easily classifying this nice mail.
>

ok

> And if you are posting this patch aiming not to be merged as is but to share the
> early implementation and get some comments, you could further make your
> intention clear using [RFC PATCH] prefix instead.

noted

>
> On 2023-12-15T13:16:19+05:30 Aravinda Prasad <[email protected]>
> wrote:
>
> > DAMON randomly samples one or more pages in every region and tracks
> > accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> > When the region size is large (e.g., several GBs), which is common for
> > large footprint applications, detecting whether the region is accessed
> > or not completely depends on whether the pages that are actively
> > accessed in the region are picked during random sampling.
> > If such pages are not picked for sampling, DAMON fails to identify the
> > region as accessed. However, increasing the sampling rate or
> > increasing the number of regions increases CPU overheads of kdamond.
> >
> > This patch proposes profiling different levels of the
> > application\u2019s page table tree to detect whether a region is
> > accessed or not. This patch is based on the observation that, when the
> > accessed bit for a page is set, the accessed bits at the higher levels
> > of the page table tree (PMD/PUD/PGD) corresponding to the path of the
> > page table walk are also set. Hence, it is efficient to check the
> > accessed bits at the higher levels of the page table tree to detect
> > whether a region is accessed or not. For example, if the access bit
> > for a PUD entry is set, then one or more pages in the 1GB PUD subtree
> > is accessed as each PUD entry covers 1GB mapping. Hence, instead of
> > sampling thousands of 4K/2M pages to detect accesses in a large
> > region, sampling at the higher level of page table tree is faster and efficient.
>
> The high level problem and solution ideas make sense to me, thank you for
> sharing these nice finding and patch!
>
> >
> > This patch is based on 6.6.3 kernel.
>
> 6.6.3 would be a nice baseline for RFC level patches, since it would let more
> people do test. But for non-RFC, I'd suggest to base patches for DAMON on latest
> mm-unstable[1] if possible.
>
> [1] https://git.kernel.org/akpm/mm/h/mm-unstable

Sure. Will do.

>
> >
> > TODO: Support 5-level page table tree
>
> Looking forward to that! :)
>
> >
> > Evaluation:
> >
> > - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> > and 5TB with 10GB hot data.
> > - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> > parameters set to default value.
>
> The default setup is 5ms sampling and 100ms aggregation intervals. Was there a
> reason to use 200ms aggregation interval instead of the default one?

We have experimented with other sampling interval (1ms) and aggregation
intervals (50, 100, 200). We provide data for 5ms and 200ms as it
gets 40 samples before a region is split or merged. We think that 20 samples
with 5ms and 100ms for large regions (>100GB) is too less for baseline damon.

>
> > - DAMON+PTP: Page table profiling applied to DAMON with the above
> > parameters.
> >
> > Profiling efficiency in detecting hot data [1]:
> >
> > Footprint 1GB 10GB 100GB 5TB
> > ---------------------------------------------
> > DAMON >90% <50% ~0% 0%
> > DAMON+PTP >90% >90% >90% >90%
>
> DAMON provides both frequency and a sort of recency information via
> 'nr_accesses' and 'age', respectively. Using those together, we could minimize
> DAMON's accuracy risk. For example, if we define hot pages with only the
> frequency information, say, pages that DAMON reported as having >=40
> 'nr_accesses' (100% access rate), I have no doubt at above result. But if we
> define hot page with both information, say, pages that DAMON didn't reported as
> having 0 'nr_accesses' for more than 10 'age' (not accessed for more than 2
> seconds), I think the number might be higher. So I'm wondering how you defined
> the hot page for this evaluation, and if you also measured with the age
> information usage.
>
> Note that this doesn't mean DAMON+PTP is not needed. I think DAMON+PTP
> would be anyway better even when we use the 'age' together. I'm just curious if
> you have considered such approaches and already have such test results. I'm also
> not asking to do the tests and add those here. But mentioning your thought
> about such approach could also be helpful for readers of this patch, imho.

Will share what we observed for 5TB application.

When the DAMON starts with the default 10 regions, each region is 400+GB in size.
What we observed was, with 20 or 40 samples per window, none of the hot pages
were sampled in almost all windows. So 'nr_accesses' was 0 for all regions
throughout the run. By chance, we had nr_accesses as 1 in between, but those
regions were split and immediately merged in the next window.

For hot page detection, any region with access count >5 is considered hot (all pages in the
region is considered as hot). We experimented by aggregating the nr_accesses
for different time intervals from 1 sec (used to compute precision/recall plots)
to 100 seconds for real-world benchmarks. We did not use 'age'.

>
> >
> > CPU overheads (in billion cycles) for kdamond:
> >
> > Footprint 1GB 10GB 100GB 5TB
> > ---------------------------------------------
> > DAMON 1.15 19.53 3.52 9.55
> > DAMON+PTP 0.83 3.20 1.27 2.55
> >
> > A detailed explanation and evaluation can be found in the arXiv paper:
> > [1] https://arxiv.org/pdf/2311.10275.pdf
> >
> > Regards,
> > Aravinda
> >
> > Signed-off-by: Alan Nair <[email protected]>
> > Signed-off-by: Sandeep Kumar <[email protected]>
> > Signed-off-by: Aravinda Prasad <[email protected]>
> > ---
>
> Though it's not clearly mentioned, I left only high level comments below assuming
> this is an RFC patch.

ok

>
> > arch/x86/include/asm/pgtable.h | 17 +++++
> > include/linux/damon.h | 13 ++++
> > include/linux/pgtable.h | 31 ++++++++
> > mm/damon/core.c | 28 ++++++++
> > mm/damon/vaddr.c | 146
> +++++++++++++++++++++++++++++++++++++---
>
> Seems this improvement is only for vaddr. Would this great improvement be able
> to be applied to paddr.c in future?

Applying to paddr.c requires some h/w support. Not sure if we can directly extend it.

>
> > 5 files changed, 223 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/pgtable.h
> > b/arch/x86/include/asm/pgtable.h index e02b179ec..accdabb95 100644
> > --- a/arch/x86/include/asm/pgtable.h
> > +++ b/arch/x86/include/asm/pgtable.h
> > @@ -169,6 +169,11 @@ static inline int pud_young(pud_t pud)
> > return pud_flags(pud) & _PAGE_ACCESSED; }
> >
> > +static inline int pgd_young(pgd_t pgd) {
> > + return pgd_flags(pgd) & _PAGE_ACCESSED; }
> > +
> > static inline int pte_write(pte_t pte) {
> > /*
> > @@ -681,6 +686,18 @@ static inline pud_t pud_mkwrite(pud_t pud)
> > return pud_clear_saveddirty(pud);
> > }
> >
> > +static inline pgd_t pgd_clear_flags(pgd_t pgd, pgdval_t clear) {
> > + pgdval_t v = native_pgd_val(pgd);
> > +
> > + return native_make_pgd(v & ~clear);
> > +}
> > +
> > +static inline pgd_t pgd_mkold(pgd_t pgd) {
> > + return pgd_clear_flags(pgd, _PAGE_ACCESSED); }
> > +
> > #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
> > static inline int pte_soft_dirty(pte_t pte) { diff --git
> > a/include/linux/damon.h b/include/linux/damon.h index
> > c70cca8a8..8521a62ec 100644
> > --- a/include/linux/damon.h
> > +++ b/include/linux/damon.h
> > @@ -19,6 +19,14 @@
> > /* Max priority score for DAMON-based operation schemes */
> > #define DAMOS_MAX_SCORE (99)
> >
> > +/* DAMON profiling levels */
> > +enum damon_profile_level {
> > + PTE_LEVEL,
> > + PMD_LEVEL,
> > + PUD_LEVEL,
> > + PGD_LEVEL,
> > +};
>
> I'd prefer more detailed kerneldoc comments. Also, I think this new enum is not
> really needed. Please read my comment for damon_va_mkold() below for my
> detailed opinion.

ok

>
> > +
> > /* Get a random number in [l, r) */
> > static inline unsigned long damon_rand(unsigned long l, unsigned long
> > r) { @@ -57,6 +65,8 @@ struct damon_region {
> > unsigned int age;
> > /* private: Internal value for age calculation. */
> > unsigned int last_nr_accesses;
> > + /* Page table profiling level */
> > + enum damon_profile_level profile_level;
>
> Seems this field is set for every damon_va_mkold(), and used on only the function
> call, and following damon_va_young() call. That allows the following
> damon_va_young() skips the unnecessary recalculation of this field. Unless the
> profile_level calculation is a big deal, I think letting damon_va_young() do the
> calculation once more and keeping this struct simple may be better.

Yes. We can let damon_va_young do the recalculation of the profile_level.

We did it this way because we experimented with higher page table levels that is
beyond the address space covered by the region to improve convergence (see
the "flex" variant in the paper).

>
> > };
> >
> > /**
> > @@ -656,6 +666,9 @@ int damon_stop(struct damon_ctx **ctxs, int
> > nr_ctxs); int damon_set_region_biggest_system_ram_default(struct
> damon_target *t,
> > unsigned long *start, unsigned long *end);
> >
> > +enum damon_profile_level pick_profile_level(unsigned long start,
> > + unsigned long end, unsigned long addr);
> > +
>
> Seems damon_va_mkold() is the only caller of this function. I think defining this
> as static on the same source file will make change simpler.

Sure

>
> > #endif /* CONFIG_DAMON */
> >
> > #endif /* _DAMON_H */
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index
> > af7639c3b..82d5f67ea 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -935,6 +935,37 @@ static inline void arch_swap_restore(swp_entry_t
> > entry, struct folio *folio) #define flush_tlb_fix_spurious_fault(vma,
> > address, ptep) flush_tlb_page(vma, address) #endif
> >
> > +/*
> > + * When walking page tables, get the address of the current/passed
> > +boundary,
> > + * or the start address of the range if that comes earlier.
> > + */
> > +
> > +#define pgd_addr_start(addr, start) \
> > +({ unsigned long __boundary = (addr) & PGDIR_MASK; \
> > + (__boundary > start) ? __boundary : (start); \
> > +})
> > +
> > +#ifndef p4d_addr_start
> > +#define p4d_addr_start(addr, start) \
> > +({ unsigned long __boundary = (addr) & P4D_MASK; \
> > + (__boundary > start) ? __boundary : (start); \
> > +})
> > +#endif
> > +
> > +#ifndef pud_addr_start
> > +#define pud_addr_start(addr, start) \
> > +({ unsigned long __boundary = (addr) & PUD_MASK; \
> > + (__boundary > start) ? __boundary : (start); \
> > +})
> > +#endif
> > +
> > +#ifndef pmd_addr_start
> > +#define pmd_addr_start(addr, start) \
> > +({ unsigned long __boundary = (addr) & PMD_MASK; \
> > + (__boundary > start) ? __boundary : (start); \
> > +})
> > +#endif
>
> Seems these are only used from mm/damon/vaddr.c. If so, I think moving these
> definitions into the file may better to read.

ok.

>
> > +
> > /*
> > * When walking page tables, get the address of the next boundary,
> > * or the end address of the range if that comes earlier. Although
> > no diff --git a/mm/damon/core.c b/mm/damon/core.c index
> > fd5be73f6..2a7d5c041 100644
> > --- a/mm/damon/core.c
> > +++ b/mm/damon/core.c
> > @@ -31,6 +31,33 @@ static struct damon_operations
> > damon_registered_ops[NR_DAMON_OPS];
> >
> > static struct kmem_cache *damon_region_cache __ro_after_init;
> >
> > +/* Pick the highest possible page table profiling level for the
> > + * region corresponding to addr
> > + */
>
> Again, I'd prefer having a nice kerneldoc comment.

noted

>
> > +enum damon_profile_level pick_profile_level(unsigned long start,
> > + unsigned long end, unsigned long addr)
>
> As noted above, seems this function is used from vaddr.c only. Moving this
> definition into the file may make this change simpler in my opinion.

ok

>
> > +{
> > + enum damon_profile_level level = PTE_LEVEL;
> > +
> > + if (pmd_addr_start(addr, (start) - 1) < start
> > + || pmd_addr_end(addr, (end) + 1) > end)
> > + goto out;
>
> How about simply 'return PTE_LEVEL' here, and similarly for below parts?

Yes. Will do.

The code is this way as we experimented by doing a level++ or level+=2
depending on the region size before the "out" label, to experiment with
DAMON+PTP.

>
> > + level = PMD_LEVEL;
> > +
> > + if (pud_addr_start(addr, (start) - 1) < start
> > + || pud_addr_end(addr, (end) + 1) > end)
> > + goto out;
> > + level = PUD_LEVEL;
> > +
> > + if (pgd_addr_start(addr, (start) - 1) < start
> > + || pgd_addr_end(addr, (end) + 1) > end)
> > + goto out;
> > + level = PGD_LEVEL;
> > +
> > +out:
> > + return level;
> > +}
> > +
> > /* Should be called under damon_ops_lock with id smaller than
> > NR_DAMON_OPS */ static bool __damon_is_registered_ops(enum
> > damon_ops_id id) { @@ -132,6 +159,7 @@ struct damon_region
> > *damon_new_region(unsigned long start, unsigned long end)
> >
> > region->age = 0;
> > region->last_nr_accesses = 0;
> > + region->profile_level = PTE_LEVEL;
> >
> > return region;
> > }
> > diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c index
> > cf8a9fc5c..b71221b3e 100644
> > --- a/mm/damon/vaddr.c
> > +++ b/mm/damon/vaddr.c
> > @@ -387,16 +387,76 @@ static int damon_mkold_hugetlb_entry(pte_t *pte,
> > unsigned long hmask, #define damon_mkold_hugetlb_entry NULL #endif
> > /* CONFIG_HUGETLB_PAGE */
> >
> > -static const struct mm_walk_ops damon_mkold_ops = {
> > - .pmd_entry = damon_mkold_pmd_entry,
> > +static int damon_mkold_pmd(pmd_t *pmd, unsigned long addr,
> > + unsigned long next, struct mm_walk *walk) {
> > + spinlock_t *ptl;
> > +
> > + if (!pmd_present(*pmd) || pmd_none(*pmd))
> > + goto out;
> > +
> > + ptl = pmd_lock(walk->mm, pmd);
> > + if (pmd_young(*pmd))
> > + *pmd = pmd_mkold(*pmd);
>
> Maybe we need to do the test and write at once? Same for similar parts below.

Yes, as mentioned by Yu Zhao.

>
> > +
> > + spin_unlock(ptl);
> > +
> > +out:
> > + return 0;
> > +}
> > +
> > +static int damon_mkold_pud(pud_t *pud, unsigned long addr,
> > + unsigned long next, struct mm_walk *walk) {
> > + spinlock_t *ptl;
> > +
> > + if (!pud_present(*pud) || pud_none(*pud))
> > + goto out;
> > +
> > + ptl = pud_lock(walk->mm, pud);
> > + if (pud_young(*pud))
> > + *pud = pud_mkold(*pud);
> > +
> > + spin_unlock(ptl);
> > +
> > +out:
> > + return 0;
> > +}
> > +
> > +static int damon_mkold_pgd(pgd_t *pgd, unsigned long addr,
> > + unsigned long next, struct mm_walk *walk) {
> > +
> > + if (!pgd_present(*pgd) || pgd_none(*pgd))
> > + goto out;
> > +
> > + spin_lock(&pgd_lock);
> > + if (pgd_young(*pgd))
> > + *pgd = pgd_mkold(*pgd);
> > +
> > + spin_unlock(&pgd_lock);
> > +
> > +out:
> > + return 0;
> > +}
> > +
> > +static const struct mm_walk_ops damon_mkold_ops[] = {
> > + {.pmd_entry = damon_mkold_pmd_entry,
> > .hugetlb_entry = damon_mkold_hugetlb_entry,
> > - .walk_lock = PGWALK_RDLOCK,
> > + .walk_lock = PGWALK_RDLOCK},
> > + {.pmd_entry = damon_mkold_pmd},
> > + {.pud_entry = damon_mkold_pud},
> > + {.pgd_entry = damon_mkold_pgd},
> > };
> >
> > -static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
> > +static void damon_va_mkold(struct mm_struct *mm, struct damon_region
> > +*r)
> > {
> > + unsigned long addr = r->sampling_addr;
> > +
> > + r->profile_level = pick_profile_level(r->ar.start, r->ar.end, addr);
> > +
> > mmap_read_lock(mm);
> > - walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, NULL);
> > + walk_page_range(mm, addr, addr + 1, damon_mkold_ops +
> > +r->profile_level, NULL);
>
> Maybe &damon_mkold_ops[r->profile_level] would easier to read?

Yes.

>
> > mmap_read_unlock(mm);
> > }
> >
> > @@ -409,7 +469,7 @@ static void __damon_va_prepare_access_check(struct
> > mm_struct *mm, {
> > r->sampling_addr = damon_rand(r->ar.start, r->ar.end);
> >
> > - damon_va_mkold(mm, r->sampling_addr);
> > + damon_va_mkold(mm, r);
> > }
> >
> > static void damon_va_prepare_access_checks(struct damon_ctx *ctx) @@
> > -531,22 +591,84 @@ static int damon_young_hugetlb_entry(pte_t *pte,
> > unsigned long hmask, #define damon_young_hugetlb_entry NULL #endif
> > /* CONFIG_HUGETLB_PAGE */
> >
> > -static const struct mm_walk_ops damon_young_ops = {
> > - .pmd_entry = damon_young_pmd_entry,
> > +static int damon_young_pmd(pmd_t *pmd, unsigned long addr,
> > + unsigned long next, struct mm_walk *walk) {
> > + spinlock_t *ptl;
> > + struct damon_young_walk_private *priv = walk->private;
> > +
> > + if (!pmd_present(*pmd) || pmd_none(*pmd))
> > + goto out;
> > +
> > + ptl = pmd_lock(walk->mm, pmd);
> > + if (pmd_young(*pmd) || mmu_notifier_test_young(walk->mm, addr))
> > + priv->young = true;
> > +
> > + *priv->folio_sz = (1UL << PMD_SHIFT);
> > + spin_unlock(ptl);
> > +out:
> > + return 0;
> > +}
> > +
> > +static int damon_young_pud(pud_t *pud, unsigned long addr,
> > + unsigned long next, struct mm_walk *walk) {
> > + spinlock_t *ptl;
> > + struct damon_young_walk_private *priv = walk->private;
> > +
> > + if (!pud_present(*pud) || pud_none(*pud))
> > + goto out;
> > +
> > + ptl = pud_lock(walk->mm, pud);
> > + if (pud_young(*pud) || mmu_notifier_test_young(walk->mm, addr))
> > + priv->young = true;
> > +
> > + *priv->folio_sz = (1UL << PUD_SHIFT);
> > +
> > + spin_unlock(ptl);
> > +out:
> > + return 0;
> > +}
> > +
> > +static int damon_young_pgd(pgd_t *pgd, unsigned long addr,
> > + unsigned long next, struct mm_walk *walk) {
> > + struct damon_young_walk_private *priv = walk->private;
> > +
> > + if (!pgd_present(*pgd) || pgd_none(*pgd))
> > + goto out;
> > +
> > + spin_lock(&pgd_lock);
> > + if (pgd_young(*pgd) || mmu_notifier_test_young(walk->mm, addr))
> > + priv->young = true;
> > +
> > + *priv->folio_sz = (1UL << PGDIR_SHIFT);
> > +
> > + spin_unlock(&pgd_lock);
> > +out:
> > + return 0;
> > +}
> > +
> > +static const struct mm_walk_ops damon_young_ops[] = {
> > + {.pmd_entry = damon_young_pmd_entry,
> > .hugetlb_entry = damon_young_hugetlb_entry,
> > - .walk_lock = PGWALK_RDLOCK,
> > + .walk_lock = PGWALK_RDLOCK},
> > + {.pmd_entry = damon_young_pmd},
> > + {.pud_entry = damon_young_pud},
> > + {.pgd_entry = damon_young_pgd},
> > };
> >
> > -static bool damon_va_young(struct mm_struct *mm, unsigned long addr,
> > +static bool damon_va_young(struct mm_struct *mm, struct damon_region
> > +*r,
> > unsigned long *folio_sz)
> > {
> > + unsigned long addr = r->sampling_addr;
> > struct damon_young_walk_private arg = {
> > .folio_sz = folio_sz,
> > .young = false,
> > };
> >
> > mmap_read_lock(mm);
> > - walk_page_range(mm, addr, addr + 1, &damon_young_ops, &arg);
> > + walk_page_range(mm, addr, addr + 1, damon_young_ops +
> > +r->profile_level, &arg);
>
> Again, '&damon_young_ops[r->profile_level]' may be easier for me to read.
>
> Also, I still prefer 80 columns limit. Could we wrap this line?

Sure

>
> Seems pick_profile_level() is only for calculating ->profile_level which will be
> stored into the region for each damon_va_mkold(). And ->profile_level is used to
> find proper mm_walk_ops from damon_mkold_ops[] and damon_young_ops[].
> I understand the index is saved in ->profile_level to be reused in the followup
> damon_va_young() call, to avoid the unnecessary recalculation overhead.
>
> If the recalculation overhead is not a big deal, I think having functions for getting
> the proper pointers to mm_walk_ops could make the change simpler?

Yes, we can do recalculation in damon_va_young(). Will incorporate.

>
> > mmap_read_unlock(mm);
> > return arg.young;
> > }
> > @@ -572,7 +694,7 @@ static void __damon_va_check_access(struct
> mm_struct *mm,
> > return;
> > }
> >
> > - last_accessed = damon_va_young(mm, r->sampling_addr,
> &last_folio_sz);
> > + last_accessed = damon_va_young(mm, r, &last_folio_sz);
> > if (last_accessed)
> > r->nr_accesses++;
>
>
> Thanks,
> SJ

2024-03-04 06:26:20

by Prasad, Aravinda

[permalink] [raw]
Subject: RE: mm/DAMON: Profiling enhancements for DAMON

Resending as plain-text msg.

The profiling enhancements can be applied to Arm64 provided Arm updates the access bits at the higher levels of the page table tree.


From: Prasad, Aravinda
Sent: Monday, March 4, 2024 11:54 AM
To: 罗午阳 <[email protected]>; '[email protected]' <[email protected]>; '[email protected]' <[email protected]>; '[email protected]' <[email protected]>; '[email protected]' <[email protected]>
Cc: '[email protected]' <[email protected]>; Kumar, Sandeep4 <[email protected]>; Huang, Ying <[email protected]>; Hansen, Dave <[email protected]>; Williams, Dan J <[email protected]>; Subramoney, Sreenivas <[email protected]>; Kervinen, Antti <[email protected]>; Kanevskiy, Alexander <[email protected]>
Subject: RE: mm/DAMON: Profiling enhancements for DAMON

The profiling enhancements can be applied to Arm64 provided Arm updates the access bits at the higher levels of the page table tree.

From: 罗午阳 <mailto:[email protected]>
Sent: Monday, March 4, 2024 7:25 AM
To: Prasad, Aravinda <mailto:[email protected]>; '[email protected]' <mailto:[email protected]>; '[email protected]' <mailto:[email protected]>; '[email protected]' <mailto:[email protected]>; '[email protected]' <mailto:[email protected]>
Cc: '[email protected]' <mailto:[email protected]>; Kumar, Sandeep4 <mailto:[email protected]>; Huang, Ying <mailto:[email protected]>; Hansen, Dave <mailto:[email protected]>; Williams, Dan J <mailto:[email protected]>; Subramoney, Sreenivas <mailto:[email protected]>; Kervinen, Antti <mailto:[email protected]>; Kanevskiy, Alexander <mailto:[email protected]>; '[email protected]' <mailto:[email protected]>
Subject: mm/DAMON: Profiling enhancements for DAMON

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <mailto:[email protected]>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
        aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
        by smtp.lore.kernel.org (Postfix) with ESMTP id 8E415C4332F
        for <mailto:[email protected]>; Fri, 15 Dec 2023 07:42:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
        id 2AA048D011B; Fri, 15 Dec 2023 02:42:32 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
        id 25A3F8D0103; Fri, 15 Dec 2023 02:42:32 -0500 (EST)
X-Delivered-To: mailto:[email protected]
Received: by kanga.kvack.org (Postfix, from userid 63042)
        id 0FAAE8D011B; Fri, 15 Dec 2023 02:42:32 -0500 (EST)
X-Delivered-To: mailto:[email protected]
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
        by kanga.kvack.org (Postfix) with ESMTP id 015488D0103
        for <mailto:[email protected]>; Fri, 15 Dec 2023 02:42:31 -0500 (EST)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
        by unirelay06.hostedemail.com (Postfix) with ESMTP id C3ED6A1473
        for <mailto:[email protected]>; Fri, 15 Dec 2023 07:42:31 +0000 (UTC)
X-FDA: 81568260102.06.332ED91
Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20])
        by imf18.hostedemail.com (Postfix) with ESMTP id 7E84E1C0013
        for <mailto:[email protected]>; Fri, 15 Dec 2023 07:42:28 +0000 (UTC)
Authentication-Results: imf18.hostedemail.com;
        dkim=pass header.d=intel.com header.s=Intel header.b=GRCP+hgl;
        spf=pass (imf18.hostedemail.com: domain of mailto:[email protected] designates 134.134.136.20 as permitted sender) mailto:[email protected];
        dmarc=pass (policy=none) header.from=intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
        s=arc-20220608; t=1702626149;
        h=from:from:sender:reply-to:subject:subject:date:date:
        message-id:message-id:to:to:cc:cc:mime-version:mime-version:
        content-type:content-type:
        content-transfer-encoding:content-transfer-encoding:in-reply-to:
        references:dkim-signature; bh=Wy4maTp3FvZJPQpK4P6LsDTA5xIj7NHl9T0HbhkL9P8=;
        b=NeE4C9vWxoUWOS0ggq4DANoqevpO7pdntFz0IDSkK9aoi2z7p4oPBmjNb4b99b1S2oWIr7
        xm0r7BanY/8cn3g9UYvOmZCtJMhFhHR2A1obE1Ra6B5UxcNni+aKvZvuMFJzO4r6ROxp5G
        UMAGX1SdJNVxoaWu/JlVpLFPai7VxNY=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702626149; a=rsa-sha256;
        cv=none;
        b=HA9TVxgaSrSB6ocUwC+vkea+3BDfFrGoUFj9UFeI4c3XohbSo/oHCTAVnQph7LfGeoH4hN
        EMev80NQFWzg4V/4k5GRZ4vPkz67VDymKr2r/c+tInFyH2uTkWd2L7CsLPxfweulH8Bs3v
        VW59/+1ZumNFAI9+sCK73O38iRz7I/Y=
ARC-Authentication-Results: i=1;
        imf18.hostedemail.com;
        dkim=pass header.d=intel.com header.s=Intel header.b=GRCP+hgl;
        spf=pass (imf18.hostedemail.com: domain of mailto:[email protected] designates 134.134.136.20 as permitted sender) mailto:[email protected];
        dmarc=pass (policy=none) header.from=intel.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; mailto:[email protected]; q=dns/txt; s=Intel;
  t=1702626148; x=1734162148;
  h=from:to:cc:subject:date:message-id:mime-version:
   content-transfer-encoding;
  bh=5yeRH864DCWoHH5VGhWvjXrR53z9cs1Wa1o3OaEo9mk=;
  b=GRCP+hglmyXrQvtRegbwnie32aMmxR8s2AyL1g/ABX56IdsgRtjpmGb6
   28IQebcgxHmw+F4BfpBsFZfkdcg6u+T2MddfnYTcklebirAilfVp4AXAK
   dcthPX6OiD295pAhRvnh1qu3VevVLtUbA2dsOMsIp2YGzuLsbRVhdMiC9
   iNGC1WOCAF2MWSJm0gnjbzeFU1Z/AuoATtRp0UaOHgqpDKRwbQzOiGJxL
   cixrQEK27lI8/gGk6drstARDhi5F7RhweUmzCkd5USLAdRwcQhqCOyuiD
   a+4Dmo/m0bMjpDeG+dv61JqnersnQ1WytgcXJr1b6EzcPQiE5I85e8yyj
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10924"; a="385660900"
X-IronPort-AV: E=Sophos;i="6.04,278,1695711600";
   d="scan'208";a="385660900"
Received: from orsmga001.jf.intel.com ([10.7.209.18])
  by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Dec 2023 23:42:26 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10924"; a="808877437"
X-IronPort-AV: E=Sophos;i="6.04,278,1695711600";
   d="scan'208";a="808877437"
Received: from adr-par-inspur1.iind.intel.com ([10.223.93.209])
  by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 14 Dec 2023 23:42:22 -0800
From: Aravinda Prasad <mailto:[email protected]>
To: mailto:[email protected],
        mailto:[email protected],
        mailto:[email protected],
        mailto:[email protected]
Cc: mailto:[email protected],
        mailto:[email protected],
        mailto:[email protected],
        mailto:[email protected],
        mailto:[email protected],
        mailto:[email protected],
        mailto:[email protected],
        mailto:[email protected],
        mailto:[email protected],
        Alan Nair <mailto:[email protected]>
Subject: mm/DAMON: Profiling enhancements for DAMON
Date: Fri, 15 Dec 2023 13:16:19 +0530
Message-Id: <mailto:[email protected]>
X-Mailer: git-send-email 2.21.3
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Stat-Signature: njd78ou6rf9ifi3rdbxenntr3pxo4ykx
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 7E84E1C0013
X-Rspam-User:
X-HE-Tag: 1702626148-429191
X-HE-Meta: U2FsdGVkX1+6dS+zzuYhUKK1fckjYonolsfOKwHIRdhlL3PmF2YZV12/IQuMHZjBoOubfsb0wnqWad0ehMjxTfbBQwgWk3dV5uG3Y4IbbvU5tLc8qbZnO0Ee4Zp7Zh19vTj+2akAYLC/JyPR7A4e5DA4GzdIcWn2kBFzu5S7vsSFGWEKZxS9ehtm1A3lWjV2Jpb7oktgikefeh/9enpKiNwQoVSJ1VLE5tZZ2W09cwpEX6qKKf3geQoCaNZm1WlwO3bSmWSIDFBspALGwgTFHs150SxeHcyJrGf+PwGO2amK6PCwR3Chc5ndRhp1UIfy25oWaRqtTykpse+rxxdhOXYI5VbU/ZeGc5YrumEjE9fSjEt0JEFo0Z9MdqVhOPshVU7q1VNW1RGbzJ8/49JRVPqnMMUPv/QmF7EcESNPdEl4FHJHyTLdp60/IJMQFD9qSfxiQ6FJa256i+FhCfRAdGL+kKvRj4DOVTvU2zu1p0xgXWavQZIXiDVxNaC2unF6Z4PHv8HArGLID/I8H0SDiZ7GjBchZ4DYD4Fi3M5hYv+N2/5pvtE+UivZgBgPii8G4Z8g11Fx7XqAKYYD+YkZO+uORJPBd3gtHfJMh2HKrm8WfaC5EMeCf4XSCwM3b9RyPLmvJ2UyEePsqg7HqV8THTsheJ6WR+qiVYumw2c/vTlIifJAfLg5sL0+Gwv29OLZpGiZbCo2hs7TtLU7b+ZQij+FTEnMWrxBR+obk1CBeB+8rD65Z6M9tHMzSYRoXlYtKPqUFM7+l1yWsOUpmICZthaVC2O2hKlGrI0Uoh8ntDyWcfCkOP9pSl0Vsa51KcWBj2MwxaRQaFhgH6YQohotlAZYnLo/Kr5LVu1+RFG0Y6BImlrs7U+FkBpaoWqg4tT6RQpp3yIaTlUHUJPno28JhkGhp12CfAMlz7Ppwcm4Jz8+UX7CLEOgcPp9jCOxGSasCuyNX4yYjIKlnluTYfc
0IBo9o9n
m7YFShtzmzn4RpUsinYh0MO1JKUeUxCNVFCpj2OSch0n40KxF9nfikd6VMKaIrsykQzb5bzypGWgMfBFg9N/p2/bcmYmfQ4XSTmG0qWTSt8f/IIt/wY1UVg6orUmxcVhIgCACBaDqhVsu8gK8VR1nQxP9GLtmpJwWcG369lNsoSrBK9NNIXccLl47c15GnXzOmOS3
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: mailto:[email protected]
Precedence: bulk
X-Loop: mailto:[email protected]
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:[email protected]>
List-Unsubscribe: <mailto:[email protected]>

DAMON randomly samples one or more pages in every region and tracks
accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
When the region size is large (e.g., several GBs), which is common
for large footprint applications, detecting whether the region is
accessed or not completely depends on whether the pages that are
actively accessed in the region are picked during random sampling.
If such pages are not picked for sampling, DAMON fails to identify
the region as accessed. However, increasing the sampling rate or
increasing the number of regions increases CPU overheads of kdamond.

This patch proposes profiling different levels of the application’s
page table tree to detect whether a region is accessed or not. This
patch is based on the observation that, when the accessed bit for a
page is set, the accessed bits at the higher levels of the page table
tree (PMD/PUD/PGD) corresponding to the path of the page table walk
are also set. Hence, it is efficient to  check the accessed bits at
the higher levels of the page table tree to detect whether a region
is accessed or not. For example, if the access bit for a PUD entry
is set, then one or more pages in the 1GB PUD subtree is accessed as
each PUD entry covers 1GB mapping. Hence, instead of sampling
thousands of 4K/2M pages to detect accesses in a large region,
sampling at the higher level of page table tree is faster and efficient.

This patch is based on 6.6.3 kernel.

TODO: Support 5-level page table tree

Evaluation:

- MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
  and 5TB with 10GB hot data.
- DAMON: 5ms sampling, 200ms aggregation interval. Rest all
  parameters set to default value.
- DAMON+PTP: Page table profiling applied to DAMON with the above
  parameters.

Profiling efficiency in detecting hot data [1]:

Footprint      1GB     10GB    100GB   5TB
---------------------------------------------
DAMON          >90%    <50%    ~0%      0%
DAMON+PTP      >90%    >90%    >90%    >90%

CPU overheads (in billion cycles) for kdamond:

Footprint      1GB     10GB    100GB   5TB
---------------------------------------------
DAMON          1.15    19.53   3.52    9.55
DAMON+PTP      0.83    3.20   1.27    2.55

A detailed explanation and evaluation can be found in the arXiv paper:
[1] https://arxiv.org/pdf/2311.10275.pdf

Regards,
Aravinda

Signed-off-by: Alan Nair <mailto:[email protected]>
Signed-off-by: Sandeep Kumar <mailto:[email protected]>
Signed-off-by: Aravinda Prasad <mailto:[email protected]>
---
arch/x86/include/asm/pgtable.h |   17 +++++
include/linux/damon.h          |   13 ++++
include/linux/pgtable.h        |   31 ++++++++
mm/damon/core.c                |   28 ++++++++
mm/damon/vaddr.c               |  146 +++++++++++++++++++++++++++++++++++++---
5 files changed, 223 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e02b179ec..accdabb95 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -169,6 +169,11 @@ static inline int pud_young(pud_t pud)
       return pud_flags(pud) & _PAGE_ACCESSED;
}

+static inline int pgd_young(pgd_t pgd)
+{
+       return pgd_flags(pgd) & _PAGE_ACCESSED;
+}
+
static inline int pte_write(pte_t pte)
{
       /*
@@ -681,6 +686,18 @@ static inline pud_t pud_mkwrite(pud_t pud)
       return pud_clear_saveddirty(pud);
}

+static inline pgd_t pgd_clear_flags(pgd_t pgd, pgdval_t clear)
+{
+       pgdval_t v = native_pgd_val(pgd);
+
+       return native_make_pgd(v & ~clear);
+}
+
+static inline pgd_t pgd_mkold(pgd_t pgd)
+{
+       return pgd_clear_flags(pgd, _PAGE_ACCESSED);
+}
+
#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
static inline int pte_soft_dirty(pte_t pte)
{
diff --git a/include/linux/damon.h b/include/linux/damon.h
index c70cca8a8..8521a62ec 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -19,6 +19,14 @@
/* Max priority score for DAMON-based operation schemes */
#define DAMOS_MAX_SCORE              (99)

+/* DAMON profiling levels */
+enum damon_profile_level {
+       PTE_LEVEL,
+       PMD_LEVEL,
+       PUD_LEVEL,
+       PGD_LEVEL,
+};
+
/* Get a random number in [l, r) */
static inline unsigned long damon_rand(unsigned long l, unsigned long r)
{
@@ -57,6 +65,8 @@ struct damon_region {
       unsigned int age;
/* private: Internal value for age calculation. */
       unsigned int last_nr_accesses;
+       /* Page table profiling level */
+       enum damon_profile_level profile_level;
};

 /**
@@ -656,6 +666,9 @@ int damon_stop(struct damon_ctx **ctxs, int nr_ctxs);
int damon_set_region_biggest_system_ram_default(struct damon_target *t,
                              unsigned long *start, unsigned long *end);

+enum damon_profile_level pick_profile_level(unsigned long start,
+              unsigned long end, unsigned long addr);
+
#endif /* CONFIG_DAMON */

 #endif /* _DAMON_H */
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index af7639c3b..82d5f67ea 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -935,6 +935,37 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
#define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
#endif

+/*
+ * When walking page tables, get the address of the current/passed boundary,
+ * or the start address of the range if that comes earlier.
+ */
+
+#define pgd_addr_start(addr, start)                 \
+({     unsigned long __boundary = (addr) & PGDIR_MASK;       \
+       (__boundary > start) ? __boundary : (start);  \
+})
+
+#ifndef p4d_addr_start
+#define p4d_addr_start(addr, start)                 \
+({     unsigned long __boundary = (addr) & P4D_MASK; \
+       (__boundary > start) ? __boundary : (start);  \
+})
+#endif
+
+#ifndef pud_addr_start
+#define pud_addr_start(addr, start)                 \
+({     unsigned long __boundary = (addr) & PUD_MASK; \
+       (__boundary > start) ? __boundary : (start);  \
+})
+#endif
+
+#ifndef pmd_addr_start
+#define pmd_addr_start(addr, start)                 \
+({     unsigned long __boundary = (addr) & PMD_MASK; \
+       (__boundary > start) ? __boundary : (start);  \
+})
+#endif
+
/*
  * When walking page tables, get the address of the next boundary,
  * or the end address of the range if that comes earlier.  Although no
diff --git a/mm/damon/core.c b/mm/damon/core.c
index fd5be73f6..2a7d5c041 100644
--- a/mm/damon/core.c
+++ b/mm/damon/core.c
@@ -31,6 +31,33 @@ static struct damon_operations damon_registered_ops[NR_DAMON_OPS];

 static struct kmem_cache *damon_region_cache __ro_after_init;

+/* Pick the highest possible page table profiling level for the
+ * region corresponding to addr
+ */
+enum damon_profile_level pick_profile_level(unsigned long start,
+       unsigned long end, unsigned long addr)
+{
+       enum damon_profile_level level = PTE_LEVEL;
+
+       if (pmd_addr_start(addr, (start) - 1) < start
+              || pmd_addr_end(addr, (end) + 1) > end)
+              goto out;
+       level = PMD_LEVEL;
+
+       if (pud_addr_start(addr, (start) - 1) < start
+              || pud_addr_end(addr, (end) + 1) > end)
+              goto out;
+       level = PUD_LEVEL;
+
+       if (pgd_addr_start(addr, (start) - 1) < start
+              || pgd_addr_end(addr, (end) + 1) > end)
+              goto out;
+       level = PGD_LEVEL;
+
+out:
+       return level;
+}
+
/* Should be called under damon_ops_lock with id smaller than NR_DAMON_OPS */
static bool __damon_is_registered_ops(enum damon_ops_id id)
{
@@ -132,6 +159,7 @@ struct damon_region *damon_new_region(unsigned long start, unsigned long end)

        region->age = 0;
       region->last_nr_accesses = 0;
+       region->profile_level = PTE_LEVEL;

        return region;
}
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index cf8a9fc5c..b71221b3e 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -387,16 +387,76 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
#define damon_mkold_hugetlb_entry NULL
#endif /* CONFIG_HUGETLB_PAGE */

-static const struct mm_walk_ops damon_mkold_ops = {
-       .pmd_entry = damon_mkold_pmd_entry,
+static int damon_mkold_pmd(pmd_t *pmd, unsigned long addr,
+       unsigned long next, struct mm_walk *walk)
+{
+       spinlock_t *ptl;
+
+       if (!pmd_present(*pmd) || pmd_none(*pmd))
+              goto out;
+
+       ptl = pmd_lock(walk->mm, pmd);
+       if (pmd_young(*pmd))
+              *pmd = pmd_mkold(*pmd);
+
+       spin_unlock(ptl);
+
+out:
+       return 0;
+}
+
+static int damon_mkold_pud(pud_t *pud, unsigned long addr,
+       unsigned long next, struct mm_walk *walk)
+{
+       spinlock_t *ptl;
+
+       if (!pud_present(*pud) || pud_none(*pud))
+              goto out;
+
+       ptl = pud_lock(walk->mm, pud);
+       if (pud_young(*pud))
+              *pud = pud_mkold(*pud);
+
+       spin_unlock(ptl);
+
+out:
+       return 0;
+}
+
+static int damon_mkold_pgd(pgd_t *pgd, unsigned long addr,
+       unsigned long next, struct mm_walk *walk)
+{
+
+       if (!pgd_present(*pgd) || pgd_none(*pgd))
+              goto out;
+
+       spin_lock(&pgd_lock);
+       if (pgd_young(*pgd))
+              *pgd = pgd_mkold(*pgd);
+
+       spin_unlock(&pgd_lock);
+
+out:
+       return 0;
+}
+
+static const struct mm_walk_ops damon_mkold_ops[] = {
+       {.pmd_entry = damon_mkold_pmd_entry,
       .hugetlb_entry = damon_mkold_hugetlb_entry,
-       .walk_lock = PGWALK_RDLOCK,
+       .walk_lock = PGWALK_RDLOCK},
+       {.pmd_entry = damon_mkold_pmd},
+       {.pud_entry = damon_mkold_pud},
+       {.pgd_entry = damon_mkold_pgd},
};

-static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
+static void damon_va_mkold(struct mm_struct *mm, struct damon_region *r)
{
+       unsigned long addr = r->sampling_addr;
+
+       r->profile_level = pick_profile_level(r->ar.start, r->ar.end, addr);
+
       mmap_read_lock(mm);
-       walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, NULL);
+       walk_page_range(mm, addr, addr + 1, damon_mkold_ops + r->profile_level, NULL);
       mmap_read_unlock(mm);
}

@@ -409,7 +469,7 @@ static void __damon_va_prepare_access_check(struct mm_struct *mm,
{
       r->sampling_addr = damon_rand(r->ar.start, r->ar.end);

-       damon_va_mkold(mm, r->sampling_addr);
+       damon_va_mkold(mm, r);
}

 static void damon_va_prepare_access_checks(struct damon_ctx *ctx)
@@ -531,22 +591,84 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
#define damon_young_hugetlb_entry NULL
#endif /* CONFIG_HUGETLB_PAGE */

-static const struct mm_walk_ops damon_young_ops = {
-       .pmd_entry = damon_young_pmd_entry,
+static int damon_young_pmd(pmd_t *pmd, unsigned long addr,
+              unsigned long next, struct mm_walk *walk)
+{
+       spinlock_t *ptl;
+       struct damon_young_walk_private *priv = walk->private;
+
+       if (!pmd_present(*pmd) || pmd_none(*pmd))
+              goto out;
+
+       ptl = pmd_lock(walk->mm, pmd);
+       if (pmd_young(*pmd) || mmu_notifier_test_young(walk->mm, addr))
+              priv->young = true;
+
+       *priv->folio_sz = (1UL << PMD_SHIFT);
+       spin_unlock(ptl);
+out:
+       return 0;
+}
+
+static int damon_young_pud(pud_t *pud, unsigned long addr,
+              unsigned long next, struct mm_walk *walk)
+{
+       spinlock_t *ptl;
+       struct damon_young_walk_private *priv = walk->private;
+
+       if (!pud_present(*pud) || pud_none(*pud))
+              goto out;
+
+       ptl = pud_lock(walk->mm, pud);
+       if (pud_young(*pud) || mmu_notifier_test_young(walk->mm, addr))
+              priv->young = true;
+
+       *priv->folio_sz = (1UL << PUD_SHIFT);
+
+       spin_unlock(ptl);
+out:
+       return 0;
+}
+
+static int damon_young_pgd(pgd_t *pgd, unsigned long addr,
+              unsigned long next, struct mm_walk *walk)
+{
+       struct damon_young_walk_private *priv = walk->private;
+
+       if (!pgd_present(*pgd) || pgd_none(*pgd))
+              goto out;
+
+       spin_lock(&pgd_lock);
+       if (pgd_young(*pgd) || mmu_notifier_test_young(walk->mm, addr))
+              priv->young = true;
+
+       *priv->folio_sz = (1UL << PGDIR_SHIFT);
+
+       spin_unlock(&pgd_lock);
+out:
+       return 0;
+}
+
+static const struct mm_walk_ops damon_young_ops[] = {
+       {.pmd_entry = damon_young_pmd_entry,
       .hugetlb_entry = damon_young_hugetlb_entry,
-       .walk_lock = PGWALK_RDLOCK,
+       .walk_lock = PGWALK_RDLOCK},
+       {.pmd_entry = damon_young_pmd},
+       {.pud_entry = damon_young_pud},
+       {.pgd_entry = damon_young_pgd},
};

-static bool damon_va_young(struct mm_struct *mm, unsigned long addr,
+static bool damon_va_young(struct mm_struct *mm, struct damon_region *r,
              unsigned long *folio_sz)
{
+       unsigned long addr = r->sampling_addr;
       struct damon_young_walk_private arg = {
               .folio_sz = folio_sz,
              .young = false,
       };

        mmap_read_lock(mm);
-       walk_page_range(mm, addr, addr + 1, &damon_young_ops, &arg);
+       walk_page_range(mm, addr, addr + 1, damon_young_ops + r->profile_level, &arg);
       mmap_read_unlock(mm);
       return arg.young;
}
@@ -572,7 +694,7 @@ static void __damon_va_check_access(struct mm_struct *mm,
              return;
       }

-       last_accessed = damon_va_young(mm, r->sampling_addr, &last_folio_sz);
+       last_accessed = damon_va_young(mm, r, &last_folio_sz);
       if (last_accessed)
              r->nr_accesses++;

Hi,Aravinda.prasad,

Recently, we are researching this patch, what do you think of applying this patch to Arm64 ? and Do you have any suggestions?  

Thanks!
#/******本邮件及其附件含有小米公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件! This e-mail and its attachments contain confidential information from XIAOMI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!******/#