2024-03-18 13:24:26

by Prasad, Aravinda

[permalink] [raw]
Subject: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON

DAMON randomly samples one or more pages in every region and tracks
accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
When the region size is large (e.g., several GBs), which is common
for large footprint applications, detecting whether the region is
accessed or not completely depends on whether the pages that are
actively accessed in the region are picked during random sampling.
If such pages are not picked for sampling, DAMON fails to identify
the region as accessed. However, increasing the sampling rate or
increasing the number of regions increases CPU overheads of kdamond.

This patch proposes profiling different levels of the application’s
page table tree to detect whether a region is accessed or not. This
patch set is based on the observation that, when the accessed bit for a
page is set, the accessed bits at the higher levels of the page table
tree (PMD/PUD/PGD) corresponding to the path of the page table walk
are also set. Hence, it is efficient to check the accessed bits at
the higher levels of the page table tree to detect whether a region
is accessed or not. For example, if the access bit for a PUD entry
is set, then one or more pages in the 1GB PUD subtree is accessed as
each PUD entry covers 1GB mapping. Hence, instead of sampling
thousands of 4K/2M pages to detect accesses in a large region,
sampling at the higher level of page table tree is faster and efficient.

This patch set is based on 6.8-rc5 kernel (commit: f48159f8, mm-unstable
tree)

Changes since v1 [1]
====================

- Added support for 5-level page table tree
- Split the patch to mm infrastructure changes and DAMON enhancements
- Code changes as per comments on v1
- Added kerneldoc comments

[1] https://lkml.org/lkml/2023/12/15/272

Evaluation:

- MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
and 5TB with 10GB hot data.
- DAMON: 5ms sampling, 200ms aggregation interval. Rest all
parameters set to default value.
- DAMON+PTP: Page table profiling applied to DAMON with the above
parameters.

Profiling efficiency in detecting hot data:

Footprint 1GB 10GB 100GB 5TB
---------------------------------------------
DAMON >90% <50% ~0% 0%
DAMON+PTP >90% >90% >90% >90%

CPU overheads (in billion cycles) for kdamond:

Footprint 1GB 10GB 100GB 5TB
---------------------------------------------
DAMON 1.15 19.53 3.52 9.55
DAMON+PTP 0.83 3.20 1.27 2.55

A detailed explanation and evaluation can be found in the arXiv paper:
https://arxiv.org/pdf/2311.10275.pdf


Aravinda Prasad (3):
mm/damon: mm infrastructure support
mm/damon: profiling enhancement
mm/damon: documentation updates

Documentation/mm/damon/design.rst | 42 ++++++
arch/x86/include/asm/pgtable.h | 20 +++
arch/x86/mm/pgtable.c | 28 +++-
include/linux/mmu_notifier.h | 36 +++++
include/linux/pgtable.h | 79 ++++++++++
mm/damon/vaddr.c | 233 ++++++++++++++++++++++++++++--
6 files changed, 424 insertions(+), 14 deletions(-)

--
2.21.3



2024-03-18 13:24:59

by Prasad, Aravinda

[permalink] [raw]
Subject: [PATCH v2 1/3] mm/damon: mm infrastructure support

This patch adds mm infrastructure support to set and test
access bits at different levels of the page table tree.

The patch also adds support to check if a give address
is in the PMD/PUD/PGD address range.

Signed-off-by: Alan Nair <[email protected]>
Signed-off-by: Aravinda Prasad <[email protected]>
---
arch/x86/include/asm/pgtable.h | 20 +++++++++
arch/x86/mm/pgtable.c | 28 +++++++++++-
include/linux/mmu_notifier.h | 36 ++++++++++++++++
include/linux/pgtable.h | 79 ++++++++++++++++++++++++++++++++++
4 files changed, 161 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7621a5acb13e..b8d505194282 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -164,11 +164,24 @@ static inline bool pud_dirty(pud_t pud)
return pud_flags(pud) & _PAGE_DIRTY_BITS;
}

+#define pud_young pud_young
static inline int pud_young(pud_t pud)
{
return pud_flags(pud) & _PAGE_ACCESSED;
}

+#define p4d_young p4d_young
+static inline int p4d_young(p4d_t p4d)
+{
+ return p4d_flags(p4d) & _PAGE_ACCESSED;
+}
+
+#define pgd_young pgd_young
+static inline int pgd_young(pgd_t pgd)
+{
+ return pgd_flags(pgd) & _PAGE_ACCESSED;
+}
+
static inline int pte_write(pte_t pte)
{
/*
@@ -1329,10 +1342,17 @@ extern int pudp_set_access_flags(struct vm_area_struct *vma,
pud_t entry, int dirty);

#define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
+#define pudp_test_and_clear_young pudp_test_and_clear_young
+#define p4dp_test_and_clear_young p4dp_test_and_clear_young
+#define pgdp_test_and_clear_young pgdp_test_and_clear_young
extern int pmdp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp);
extern int pudp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pud_t *pudp);
+extern int p4dp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, p4d_t *p4dp);
+extern int pgdp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pgd_t *pgdp);

#define __HAVE_ARCH_PMDP_CLEAR_YOUNG_FLUSH
extern int pmdp_clear_flush_young(struct vm_area_struct *vma,
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index ff690ddc2334..9f8e08326b43 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -578,9 +578,7 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,

return ret;
}
-#endif

-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
int pudp_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pud_t *pudp)
{
@@ -594,6 +592,32 @@ int pudp_test_and_clear_young(struct vm_area_struct *vma,
}
#endif

+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+int p4dp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, p4d_t *p4dp)
+{
+ int ret = 0;
+
+ if (p4d_young(*p4dp))
+ ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
+ (unsigned long *)p4dp);
+
+ return ret;
+}
+
+int pgdp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long addr, pgd_t *pgdp)
+{
+ int ret = 0;
+
+ if (pgd_young(*pgdp))
+ ret = test_and_clear_bit(_PAGE_BIT_ACCESSED,
+ (unsigned long *)pgdp);
+
+ return ret;
+}
+#endif
+
int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index f349e08a9dfe..ec7fc170882e 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -581,6 +581,39 @@ static inline void mmu_notifier_range_init_owner(
__young; \
})

+#define pudp_clear_young_notify(__vma, __address, __pudp) \
+({ \
+ int __young; \
+ struct vm_area_struct *___vma = __vma; \
+ unsigned long ___address = __address; \
+ __young = pudp_test_and_clear_young(___vma, ___address, __pudp);\
+ __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \
+ ___address + PUD_SIZE); \
+ __young; \
+})
+
+#define p4dp_clear_young_notify(__vma, __address, __p4dp) \
+({ \
+ int __young; \
+ struct vm_area_struct *___vma = __vma; \
+ unsigned long ___address = __address; \
+ __young = p4dp_test_and_clear_young(___vma, ___address, __p4dp);\
+ __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \
+ ___address + P4D_SIZE); \
+ __young; \
+})
+
+#define pgdp_clear_young_notify(__vma, __address, __pgdp) \
+({ \
+ int __young; \
+ struct vm_area_struct *___vma = __vma; \
+ unsigned long ___address = __address; \
+ __young = pgdp_test_and_clear_young(___vma, ___address, __pgdp);\
+ __young |= mmu_notifier_clear_young(___vma->vm_mm, ___address, \
+ ___address + PGDIR_SIZE); \
+ __young; \
+})
+
/*
* set_pte_at_notify() sets the pte _after_ running the notifier.
* This is safe to start by updating the secondary MMUs, because the primary MMU
@@ -690,6 +723,9 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)
#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
#define ptep_clear_young_notify ptep_test_and_clear_young
#define pmdp_clear_young_notify pmdp_test_and_clear_young
+#define pudp_clear_young_notify pudp_test_and_clear_young
+#define p4dp_clear_young_notify p4dp_test_and_clear_young
+#define pgdp_clear_young_notify pgdp_test_and_clear_young
#define ptep_clear_flush_notify ptep_clear_flush
#define pmdp_huge_clear_flush_notify pmdp_huge_clear_flush
#define pudp_huge_clear_flush_notify pudp_huge_clear_flush
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 85fc7554cd52..09c3e8bb11bf 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -184,6 +184,27 @@ static inline int pmd_young(pmd_t pmd)
}
#endif

+#ifndef pud_young
+static inline int pud_young(pud_t pud)
+{
+ return 0;
+}
+#endif
+
+#ifndef p4d_young
+static inline int p4d_young(p4d_t p4d)
+{
+ return 0;
+}
+#endif
+
+#ifndef pgd_young
+static inline int pgd_young(pgd_t pgd)
+{
+ return 0;
+}
+#endif
+
#ifndef pmd_dirty
static inline int pmd_dirty(pmd_t pmd)
{
@@ -386,6 +407,33 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
#endif

+#ifndef pudp_test_and_clear_young
+static inline int pudp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long address,
+ pud_t *pudp)
+{
+ return 0;
+}
+#endif
+
+#ifndef p4dp_test_and_clear_young
+static inline int p4dp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long address,
+ p4d_t *p4dp)
+{
+ return 0;
+}
+#endif
+
+#ifndef pgdp_test_and_clear_young
+static inline int pgdp_test_and_clear_young(struct vm_area_struct *vma,
+ unsigned long address,
+ pgd_t *pgdp)
+{
+ return 0;
+}
+#endif
+
#ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep);
@@ -1090,6 +1138,37 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
#define flush_tlb_fix_spurious_fault(vma, address, ptep) flush_tlb_page(vma, address)
#endif

+/*
+ * When walking page tables, get the address of the current boundary,
+ * or the start address of the range if that comes earlier.
+ */
+
+#define pgd_addr_start(addr, start) \
+({ unsigned long __boundary = (addr) & PGDIR_MASK; \
+ (__boundary > start) ? __boundary : (start); \
+})
+
+#ifndef p4d_addr_start
+#define p4d_addr_start(addr, start) \
+({ unsigned long __boundary = (addr) & P4D_MASK; \
+ (__boundary > start) ? __boundary : (start); \
+})
+#endif
+
+#ifndef pud_addr_start
+#define pud_addr_start(addr, start) \
+({ unsigned long __boundary = (addr) & PUD_MASK; \
+ (__boundary > start) ? __boundary : (start); \
+})
+#endif
+
+#ifndef pmd_addr_start
+#define pmd_addr_start(addr, start) \
+({ unsigned long __boundary = (addr) & PMD_MASK; \
+ (__boundary > start) ? __boundary : (start); \
+})
+#endif
+
/*
* When walking page tables, get the address of the next boundary,
* or the end address of the range if that comes earlier. Although no
--
2.21.3


2024-03-18 13:25:03

by Prasad, Aravinda

[permalink] [raw]
Subject: [PATCH v2 2/3] mm/damon: profiling enhancement

This patch adds profiling enhancement for DAMON.
Given the sampling_addr and its region bounds,
this patch picks the highest possible page table
tree level such that the address range covered by
the picked page table level (P*D) is within the
region's bounds. Once a page table level is picked,
access bit setting and checking is done at that level.
As the higher levels of the page table tree covers
a larger address space, any accessed bit set implies
one or more pages in the given region is accessed.
This helps in quickly identifying hot regions when
the region size is large (e.g., several GBs), which
is common for large footprint applications.

Signed-off-by: Alan Nair <[email protected]>
Signed-off-by: Sandeep Kumar <[email protected]>
Signed-off-by: Aravinda Prasad <[email protected]>
---
mm/damon/vaddr.c | 233 ++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 221 insertions(+), 12 deletions(-)

diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 381559e4a1fa..daa1a2aedab6 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -52,6 +52,53 @@ static struct mm_struct *damon_get_mm(struct damon_target *t)
return mm;
}

+/* Pick the highest possible page table profiling level for addr
+ * in the region defined by start and end
+ */
+static int pick_profile_level(unsigned long start, unsigned long end,
+ unsigned long addr)
+{
+ /* Start with PTE and check if higher levels can be picked */
+ int level = 0;
+
+ if (!arch_has_hw_nonleaf_pmd_young())
+ return level;
+
+ /* Check if PMD or higher can be picked, else use PTE */
+ if (pmd_addr_start(addr, (start) - 1) < start
+ || pmd_addr_end(addr, (end) + 1) > end)
+ return level;
+
+ level++;
+ /* Check if PUD or higher can be picked, else use PMD */
+ if (pud_addr_start(addr, (start) - 1) < start
+ || pud_addr_end(addr, (end) + 1) > end)
+ return level;
+
+ if (pgtable_l5_enabled()) {
+ level++;
+ /* Check if P4D or higher can be picked, else use PUD */
+ if (p4d_addr_start(addr, (start) - 1) < start
+ || p4d_addr_end(addr, (end) + 1) > end)
+ return level;
+ }
+
+ level++;
+ /* Check if PGD can be picked, else return PUD level */
+ if (pgd_addr_start(addr, (start) - 1) < start
+ || pgd_addr_end(addr, (end) + 1) > end)
+ return level;
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+ /* Do not pick PGD level if PTI is enabled */
+ if (static_cpu_has(X86_FEATURE_PTI))
+ return level;
+#endif
+
+ /* Return PGD level */
+ return ++level;
+}
+
/*
* Functions for the initial monitoring target regions construction
*/
@@ -387,16 +434,90 @@ static int damon_mkold_hugetlb_entry(pte_t *pte, unsigned long hmask,
#define damon_mkold_hugetlb_entry NULL
#endif /* CONFIG_HUGETLB_PAGE */

-static const struct mm_walk_ops damon_mkold_ops = {
- .pmd_entry = damon_mkold_pmd_entry,
+
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+static int damon_mkold_pmd(pmd_t *pmd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ spinlock_t *ptl;
+
+ if (!pmd_present(*pmd))
+ return 0;
+
+ ptl = pmd_lock(walk->mm, pmd);
+ pmdp_clear_young_notify(walk->vma, addr, pmd);
+ spin_unlock(ptl);
+
+ return 0;
+}
+
+static int damon_mkold_pud(pud_t *pud, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ spinlock_t *ptl;
+
+ if (!pud_present(*pud))
+ return 0;
+
+ ptl = pud_lock(walk->mm, pud);
+ pudp_clear_young_notify(walk->vma, addr, pud);
+ spin_unlock(ptl);
+
+ return 0;
+}
+
+static int damon_mkold_p4d(p4d_t *p4d, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct mm_struct *mm = walk->mm;
+
+ if (!p4d_present(*p4d))
+ return 0;
+
+ spin_lock(&mm->page_table_lock);
+ p4dp_clear_young_notify(walk->vma, addr, p4d);
+ spin_unlock(&mm->page_table_lock);
+
+ return 0;
+}
+
+static int damon_mkold_pgd(pgd_t *pgd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct mm_struct *mm = walk->mm;
+
+ if (!pgd_present(*pgd))
+ return 0;
+
+ spin_lock(&mm->page_table_lock);
+ pgdp_clear_young_notify(walk->vma, addr, pgd);
+ spin_unlock(&mm->page_table_lock);
+
+ return 0;
+}
+#endif
+
+static const struct mm_walk_ops damon_mkold_ops[] = {
+ {.pmd_entry = damon_mkold_pmd_entry,
.hugetlb_entry = damon_mkold_hugetlb_entry,
- .walk_lock = PGWALK_RDLOCK,
+ .walk_lock = PGWALK_RDLOCK},
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+ {.pmd_entry = damon_mkold_pmd},
+ {.pud_entry = damon_mkold_pud},
+ {.p4d_entry = damon_mkold_p4d},
+ {.pgd_entry = damon_mkold_pgd},
+#endif
};

-static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
+static void damon_va_mkold(struct mm_struct *mm, struct damon_region *r)
{
+ unsigned long addr = r->sampling_addr;
+ int profile_level;
+
+ profile_level = pick_profile_level(r->ar.start, r->ar.end, addr);
+
mmap_read_lock(mm);
- walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, NULL);
+ walk_page_range(mm, addr, addr + 1, &damon_mkold_ops[profile_level], NULL);
mmap_read_unlock(mm);
}

@@ -409,7 +530,7 @@ static void __damon_va_prepare_access_check(struct mm_struct *mm,
{
r->sampling_addr = damon_rand(r->ar.start, r->ar.end);

- damon_va_mkold(mm, r->sampling_addr);
+ damon_va_mkold(mm, r);
}

static void damon_va_prepare_access_checks(struct damon_ctx *ctx)
@@ -531,22 +652,110 @@ static int damon_young_hugetlb_entry(pte_t *pte, unsigned long hmask,
#define damon_young_hugetlb_entry NULL
#endif /* CONFIG_HUGETLB_PAGE */

-static const struct mm_walk_ops damon_young_ops = {
- .pmd_entry = damon_young_pmd_entry,
+
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+static int damon_young_pmd(pmd_t *pmd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ spinlock_t *ptl;
+ struct damon_young_walk_private *priv = walk->private;
+
+ if (!pmd_present(*pmd))
+ return 0;
+
+ ptl = pmd_lock(walk->mm, pmd);
+ if (pmd_young(*pmd) || mmu_notifier_test_young(walk->mm, addr))
+ priv->young = true;
+
+ *priv->folio_sz = (1UL << PMD_SHIFT);
+ spin_unlock(ptl);
+
+ return 0;
+}
+
+static int damon_young_pud(pud_t *pud, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ spinlock_t *ptl;
+ struct damon_young_walk_private *priv = walk->private;
+
+ if (!pud_present(*pud))
+ return 0;
+
+ ptl = pud_lock(walk->mm, pud);
+ if (pud_young(*pud) || mmu_notifier_test_young(walk->mm, addr))
+ priv->young = true;
+
+ *priv->folio_sz = (1UL << PUD_SHIFT);
+ spin_unlock(ptl);
+
+ return 0;
+}
+
+static int damon_young_p4d(p4d_t *p4d, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct mm_struct *mm = walk->mm;
+ struct damon_young_walk_private *priv = walk->private;
+
+ if (!p4d_present(*p4d))
+ return 0;
+
+ spin_lock(&mm->page_table_lock);
+ if (p4d_young(*p4d) || mmu_notifier_test_young(walk->mm, addr))
+ priv->young = true;
+
+ *priv->folio_sz = (1UL << P4D_SHIFT);
+ spin_unlock(&mm->page_table_lock);
+
+ return 0;
+}
+
+static int damon_young_pgd(pgd_t *pgd, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ struct damon_young_walk_private *priv = walk->private;
+
+ if (!pgd_present(*pgd))
+ return 0;
+
+ spin_lock(&pgd_lock);
+ if (pgd_young(*pgd) || mmu_notifier_test_young(walk->mm, addr))
+ priv->young = true;
+
+ *priv->folio_sz = (1UL << PGDIR_SHIFT);
+ spin_unlock(&pgd_lock);
+
+ return 0;
+}
+#endif
+
+static const struct mm_walk_ops damon_young_ops[] = {
+ {.pmd_entry = damon_young_pmd_entry,
.hugetlb_entry = damon_young_hugetlb_entry,
- .walk_lock = PGWALK_RDLOCK,
+ .walk_lock = PGWALK_RDLOCK},
+#ifdef CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+ {.pmd_entry = damon_young_pmd},
+ {.pud_entry = damon_young_pud},
+ {.p4d_entry = damon_young_p4d},
+ {.pgd_entry = damon_young_pgd},
+#endif
};

-static bool damon_va_young(struct mm_struct *mm, unsigned long addr,
+static bool damon_va_young(struct mm_struct *mm, struct damon_region *r,
unsigned long *folio_sz)
{
+ unsigned long addr = r->sampling_addr;
+ int profile_level;
struct damon_young_walk_private arg = {
.folio_sz = folio_sz,
.young = false,
};

+ profile_level = pick_profile_level(r->ar.start, r->ar.end, addr);
+
mmap_read_lock(mm);
- walk_page_range(mm, addr, addr + 1, &damon_young_ops, &arg);
+ walk_page_range(mm, addr, addr + 1, &damon_young_ops[profile_level], &arg);
mmap_read_unlock(mm);
return arg.young;
}
@@ -577,7 +786,7 @@ static void __damon_va_check_access(struct mm_struct *mm,
return;
}

- last_accessed = damon_va_young(mm, r->sampling_addr, &last_folio_sz);
+ last_accessed = damon_va_young(mm, r, &last_folio_sz);
damon_update_region_access_rate(r, last_accessed, attrs);

last_addr = r->sampling_addr;
--
2.21.3


2024-03-18 13:25:18

by Prasad, Aravinda

[permalink] [raw]
Subject: [PATCH v2 3/3] mm/damon: documentation updates

This patch updates the kernel documentation.

Signed-off-by: Aravinda Prasad <[email protected]>
---
Documentation/mm/damon/design.rst | 42 +++++++++++++++++++++++++++++++
1 file changed, 42 insertions(+)

diff --git a/Documentation/mm/damon/design.rst b/Documentation/mm/damon/design.rst
index 5620aab9b385..59014ecbb551 100644
--- a/Documentation/mm/damon/design.rst
+++ b/Documentation/mm/damon/design.rst
@@ -139,6 +139,48 @@ the interference is the responsibility of sysadmins. However, it solves the
conflict with the reclaim logic using ``PG_idle`` and ``PG_young`` page flags,
as Idle page tracking does.

+Profiling enhancement for virtual address space
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For virtual address space tracking, relying on checking Accessed bit(s) only
+at the leaf level of the page table is inefficient. Hardware architectures
+have supported Accessed bit(s) at all levels of the page table tree by
+updating them during the page table walk. Hence, DAMON dynamically
+profiles different levels (PMD/PUD/P4D) of a multi-level page table tree.
+
+DAMON leverages the following key insight: a data page that is accessed
+should also have the Accessed bit set at PMD, PUD, P4D, and PGD entry.
+Similarly, if the Accessed bit in a PGD entry (or a PUD/PMD entry) is
+not set, then none of the data pages under the PGD entry (or PUD/PMD
+entry) subtree are accessed. DAMON profiles Accessed bits at the highest
+possible level of the page table tree to identify the regions that are
+accessed.
+
+For example, consider a region and the sampling address (SA) in the below
+figure. The address range of a PUD entry corresponding to SA is within
+region bounds and hence PUD is picked for checking and setting the
+Accessed bits. However, this not true if P4D is picked for profiling.
+Hence in this case PUD is the highest possible level that can be picked
+for profiling.
+ .......
+ + P4D +
+ .......
+ / \
+ / \
+ / \
+ / \
+ / \
+ / ....... \
+ / + PUD + \
+ / ....... \
+ / / \ \
+- - - - - +-----*---*--+====+-*------+- -*- - -
+ + # SA # +
+ + # # +
+- - - - - +------------+====+--------+- - - - -
+
+ | ----- DAMON region ------|
+

Core Logics
===========
--
2.21.3


2024-03-18 18:25:06

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm/damon: profiling enhancement

Hi Aravinda,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url: https://github.com/intel-lab-lkp/linux/commits/Aravinda-Prasad/mm-damon-mm-infrastructure-support/20240318-212723
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240318132848.82686-3-aravinda.prasad%40intel.com
patch subject: [PATCH v2 2/3] mm/damon: profiling enhancement
config: m68k-allmodconfig (https://download.01.org/0day-ci/archive/20240319/[email protected]/config)
compiler: m68k-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240319/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

mm/damon/vaddr.c: In function 'pick_profile_level':
>> mm/damon/vaddr.c:78:13: error: implicit declaration of function 'pgtable_l5_enabled' [-Werror=implicit-function-declaration]
78 | if (pgtable_l5_enabled()) {
| ^~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors


vim +/pgtable_l5_enabled +78 mm/damon/vaddr.c

54
55 /* Pick the highest possible page table profiling level for addr
56 * in the region defined by start and end
57 */
58 static int pick_profile_level(unsigned long start, unsigned long end,
59 unsigned long addr)
60 {
61 /* Start with PTE and check if higher levels can be picked */
62 int level = 0;
63
64 if (!arch_has_hw_nonleaf_pmd_young())
65 return level;
66
67 /* Check if PMD or higher can be picked, else use PTE */
68 if (pmd_addr_start(addr, (start) - 1) < start
69 || pmd_addr_end(addr, (end) + 1) > end)
70 return level;
71
72 level++;
73 /* Check if PUD or higher can be picked, else use PMD */
74 if (pud_addr_start(addr, (start) - 1) < start
75 || pud_addr_end(addr, (end) + 1) > end)
76 return level;
77
> 78 if (pgtable_l5_enabled()) {
79 level++;
80 /* Check if P4D or higher can be picked, else use PUD */
81 if (p4d_addr_start(addr, (start) - 1) < start
82 || p4d_addr_end(addr, (end) + 1) > end)
83 return level;
84 }
85
86 level++;
87 /* Check if PGD can be picked, else return PUD level */
88 if (pgd_addr_start(addr, (start) - 1) < start
89 || pgd_addr_end(addr, (end) + 1) > end)
90 return level;
91

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-03-18 20:28:27

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] mm/damon: mm infrastructure support

Hi Aravinda,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url: https://github.com/intel-lab-lkp/linux/commits/Aravinda-Prasad/mm-damon-mm-infrastructure-support/20240318-212723
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240318132848.82686-2-aravinda.prasad%40intel.com
patch subject: [PATCH v2 1/3] mm/damon: mm infrastructure support
config: powerpc-allmodconfig (https://download.01.org/0day-ci/archive/20240319/[email protected]/config)
compiler: powerpc64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240319/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

In file included from arch/powerpc/include/asm/kup.h:43,
from arch/powerpc/include/asm/uaccess.h:8,
from include/linux/uaccess.h:11,
from include/linux/sched/task.h:13,
from include/linux/sched/signal.h:9,
from include/linux/rcuwait.h:6,
from include/linux/percpu-rwsem.h:7,
from include/linux/fs.h:33,
from include/linux/compat.h:17,
from arch/powerpc/kernel/asm-offsets.c:12:
>> include/linux/pgtable.h:411:19: error: static declaration of 'pudp_test_and_clear_young' follows non-static declaration
411 | static inline int pudp_test_and_clear_young(struct vm_area_struct *vma,
| ^~~~~~~~~~~~~~~~~~~~~~~~~
In file included from arch/powerpc/include/asm/book3s/64/mmu-hash.h:20,
from arch/powerpc/include/asm/book3s/64/mmu.h:32,
from arch/powerpc/include/asm/mmu.h:391,
from arch/powerpc/include/asm/paca.h:18,
from arch/powerpc/include/asm/current.h:13,
from include/linux/mutex.h:14,
from include/linux/rhashtable-types.h:14,
from include/linux/ipc.h:7,
from include/uapi/linux/sem.h:5,
from include/linux/sem.h:5,
from include/linux/compat.h:14:
arch/powerpc/include/asm/book3s/64/pgtable.h:1306:12: note: previous declaration of 'pudp_test_and_clear_young' with type 'int(struct vm_area_struct *, long unsigned int, pud_t *)'
1306 | extern int pudp_test_and_clear_young(struct vm_area_struct *vma,
| ^~~~~~~~~~~~~~~~~~~~~~~~~
make[3]: *** [scripts/Makefile.build:116: arch/powerpc/kernel/asm-offsets.s] Error 1
make[3]: Target 'prepare' not remade because of errors.
make[2]: *** [Makefile:1191: prepare0] Error 2
make[2]: Target 'prepare' not remade because of errors.
make[1]: *** [Makefile:240: __sub-make] Error 2
make[1]: Target 'prepare' not remade because of errors.
make: *** [Makefile:240: __sub-make] Error 2
make: Target 'prepare' not remade because of errors.


vim +/pudp_test_and_clear_young +411 include/linux/pgtable.h

409
410 #ifndef pudp_test_and_clear_young
> 411 static inline int pudp_test_and_clear_young(struct vm_area_struct *vma,
412 unsigned long address,
413 pud_t *pudp)
414 {
415 return 0;
416 }
417 #endif
418

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-03-18 22:00:39

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] mm/damon: profiling enhancement

Hi Aravinda,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url: https://github.com/intel-lab-lkp/linux/commits/Aravinda-Prasad/mm-damon-mm-infrastructure-support/20240318-212723
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240318132848.82686-3-aravinda.prasad%40intel.com
patch subject: [PATCH v2 2/3] mm/damon: profiling enhancement
config: hexagon-allmodconfig (https://download.01.org/0day-ci/archive/20240319/[email protected]/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project 8f68022f8e6e54d1aeae4ed301f5a015963089b7)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240319/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

In file included from mm/damon/vaddr.c:10:
In file included from include/linux/highmem.h:10:
In file included from include/linux/mm.h:2194:
include/linux/vmstat.h:522:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
522 | return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
| ~~~~~~~~~~~ ^ ~~~
In file included from mm/damon/vaddr.c:10:
In file included from include/linux/highmem.h:12:
In file included from include/linux/hardirq.h:11:
In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
In file included from include/asm-generic/hardirq.h:17:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:13:
In file included from arch/hexagon/include/asm/io.h:328:
include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
547 | val = __raw_readb(PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
560 | val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
| ~~~~~~~~~~ ^
include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
| ^
In file included from mm/damon/vaddr.c:10:
In file included from include/linux/highmem.h:12:
In file included from include/linux/hardirq.h:11:
In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
In file included from include/asm-generic/hardirq.h:17:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:13:
In file included from arch/hexagon/include/asm/io.h:328:
include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
573 | val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
| ~~~~~~~~~~ ^
include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
| ^
In file included from mm/damon/vaddr.c:10:
In file included from include/linux/highmem.h:12:
In file included from include/linux/hardirq.h:11:
In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
In file included from include/asm-generic/hardirq.h:17:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:13:
In file included from arch/hexagon/include/asm/io.h:328:
include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
584 | __raw_writeb(value, PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
594 | __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
604 | __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
| ~~~~~~~~~~ ^
>> mm/damon/vaddr.c:78:6: error: call to undeclared function 'pgtable_l5_enabled'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
78 | if (pgtable_l5_enabled()) {
| ^
7 warnings and 1 error generated.


vim +/pgtable_l5_enabled +78 mm/damon/vaddr.c

54
55 /* Pick the highest possible page table profiling level for addr
56 * in the region defined by start and end
57 */
58 static int pick_profile_level(unsigned long start, unsigned long end,
59 unsigned long addr)
60 {
61 /* Start with PTE and check if higher levels can be picked */
62 int level = 0;
63
64 if (!arch_has_hw_nonleaf_pmd_young())
65 return level;
66
67 /* Check if PMD or higher can be picked, else use PTE */
68 if (pmd_addr_start(addr, (start) - 1) < start
69 || pmd_addr_end(addr, (end) + 1) > end)
70 return level;
71
72 level++;
73 /* Check if PUD or higher can be picked, else use PMD */
74 if (pud_addr_start(addr, (start) - 1) < start
75 || pud_addr_end(addr, (end) + 1) > end)
76 return level;
77
> 78 if (pgtable_l5_enabled()) {
79 level++;
80 /* Check if P4D or higher can be picked, else use PUD */
81 if (p4d_addr_start(addr, (start) - 1) < start
82 || p4d_addr_end(addr, (end) + 1) > end)
83 return level;
84 }
85
86 level++;
87 /* Check if PGD can be picked, else return PUD level */
88 if (pgd_addr_start(addr, (start) - 1) < start
89 || pgd_addr_end(addr, (end) + 1) > end)
90 return level;
91

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-03-19 00:51:54

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON

On Mon, Mar 18, 2024 at 9:24 AM Aravinda Prasad
<[email protected]> wrote:
>
> DAMON randomly samples one or more pages in every region and tracks
> accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> When the region size is large (e.g., several GBs), which is common
> for large footprint applications, detecting whether the region is
> accessed or not completely depends on whether the pages that are
> actively accessed in the region are picked during random sampling.
> If such pages are not picked for sampling, DAMON fails to identify
> the region as accessed. However, increasing the sampling rate or
> increasing the number of regions increases CPU overheads of kdamond.
>
> This patch proposes profiling different levels of the application’s
> page table tree to detect whether a region is accessed or not. This
> patch set is based on the observation that, when the accessed bit for a
> page is set, the accessed bits at the higher levels of the page table
> tree (PMD/PUD/PGD) corresponding to the path of the page table walk
> are also set. Hence, it is efficient to check the accessed bits at
> the higher levels of the page table tree to detect whether a region
> is accessed or not. For example, if the access bit for a PUD entry
> is set, then one or more pages in the 1GB PUD subtree is accessed as
> each PUD entry covers 1GB mapping. Hence, instead of sampling
> thousands of 4K/2M pages to detect accesses in a large region,
> sampling at the higher level of page table tree is faster and efficient.
>
> This patch set is based on 6.8-rc5 kernel (commit: f48159f8, mm-unstable
> tree)
>
> Changes since v1 [1]
> ====================
>
> - Added support for 5-level page table tree
> - Split the patch to mm infrastructure changes and DAMON enhancements
> - Code changes as per comments on v1
> - Added kerneldoc comments
>
> [1] https://lkml.org/lkml/2023/12/15/272
>
> Evaluation:
>
> - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> and 5TB with 10GB hot data.
> - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> parameters set to default value.
> - DAMON+PTP: Page table profiling applied to DAMON with the above
> parameters.
>
> Profiling efficiency in detecting hot data:
>
> Footprint 1GB 10GB 100GB 5TB
> ---------------------------------------------
> DAMON >90% <50% ~0% 0%
> DAMON+PTP >90% >90% >90% >90%
>
> CPU overheads (in billion cycles) for kdamond:
>
> Footprint 1GB 10GB 100GB 5TB
> ---------------------------------------------
> DAMON 1.15 19.53 3.52 9.55
> DAMON+PTP 0.83 3.20 1.27 2.55
>
> A detailed explanation and evaluation can be found in the arXiv paper:
> https://arxiv.org/pdf/2311.10275.pdf

NAK, on the ground of citing the nonfactual source above and
misrespenting the existing idea as your own invention [1].

The existing idea was purposely not patented so that all CPU vendors
are free to use it. Not sure what kind of peer review that source had,
but it's not getting around the reviewers here easily. Please do feel
free to ask any 3rd party that has no conflict of interest to override
my NAK though.

[1] https://lore.kernel.org/CAOUHufbDzy5dMcLR9ex25VdB_QBmSrW_We-2+KftZVYKNn4s9g@mail.gmail.com/
[2] https://lore.kernel.org/[email protected]/

2024-03-19 05:21:06

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON

Hi Aravinda,


Thank you for posting this new revision!

I remember I told you that I don't see a high level significant problems on on
the reply to the previous revision of this patch[1], but I show a concern now.
Sorry for not raising this earlier, but let me explain my humble concerns
before being even more late.

On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad <[email protected]> wrote:

> DAMON randomly samples one or more pages in every region and tracks
> accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> When the region size is large (e.g., several GBs), which is common
> for large footprint applications, detecting whether the region is
> accessed or not completely depends on whether the pages that are
> actively accessed in the region are picked during random sampling.
> If such pages are not picked for sampling, DAMON fails to identify
> the region as accessed. However, increasing the sampling rate or
> increasing the number of regions increases CPU overheads of kdamond.

DAMON uses sampling because it considers a region as accessed if a portion of
the region that big enough to be detected via sampling is all accessed. If a
region is having some pages that really accessed but the proportion is too
small to be found via sampling, I think DAMON could say the overall access to
the region is only modest and could even be ignored. In my humble opinion,
this fits with the definition of DAMON region: A memory address range that
constructed with pages having similar access frequency.

>
> This patch proposes profiling different levels of the application\u2019s
> page table tree to detect whether a region is accessed or not. This
> patch set is based on the observation that, when the accessed bit for a
> page is set, the accessed bits at the higher levels of the page table
> tree (PMD/PUD/PGD) corresponding to the path of the page table walk
> are also set. Hence, it is efficient to check the accessed bits at
> the higher levels of the page table tree to detect whether a region
> is accessed or not. For example, if the access bit for a PUD entry
> is set, then one or more pages in the 1GB PUD subtree is accessed as
> each PUD entry covers 1GB mapping. Hence, instead of sampling
> thousands of 4K/2M pages to detect accesses in a large region,
> sampling at the higher level of page table tree is faster and efficient.

Due to the above reason, I concern this could result in making DAMON monitoring
results be inaccurately biased to report more than real accesses.

>
> This patch set is based on 6.8-rc5 kernel (commit: f48159f8, mm-unstable
> tree)
>
> Changes since v1 [1]
> ====================
>
> - Added support for 5-level page table tree
> - Split the patch to mm infrastructure changes and DAMON enhancements
> - Code changes as per comments on v1
> - Added kerneldoc comments
>
> [1] https://lkml.org/lkml/2023/12/15/272
>
> Evaluation:
>
> - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> and 5TB with 10GB hot data.
> - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> parameters set to default value.
> - DAMON+PTP: Page table profiling applied to DAMON with the above
> parameters.
>
> Profiling efficiency in detecting hot data:
>
> Footprint 1GB 10GB 100GB 5TB
> ---------------------------------------------
> DAMON >90% <50% ~0% 0%
> DAMON+PTP >90% >90% >90% >90%

Sampling interval is the time interval that assumed to be large enough for the
workload to make meaningful amount of accesses within the interval. Hence,
meaningful amount of sampling interval depends on the workload's characteristic
and system's memory bandwidth.

Here, the size of the hot memory region is about 100MB, 1GB, 10GB, and 10GB for
the four cases, respectively. And you set the sampling interval as 5ms. Let's
assume the system can access, say, 50 GB per second, and hence it could be able
to access only up to 250 MB per 5ms. So, in case of 1GB and footprint, all hot
memory region would be accessed while DAMON is waiting for next sampling
interval. Hence, DAMON would be able to see most accesses via sampling. But
for 100GB footprint case, only 250MB / 10GB = about 2.5% of the hot memory
region would be accessed between the sampling interval. DAMON cannot see whole
accesses, and hence the precision could be low.

I don't know exact memory bandwith of the system, but to detect the 10 GB hot
region with 5ms sampling interval, the system should be able to access 2GB
memory per millisecond, or about 2TB memory per second. I think systems of
such memory bandwidth is not that common.

I show you also explored a configuration setting the aggregation interval
higher. But because each sampling checks only access between the sampling
interval, that might not help in this setup. I'm wondering if you also
explored increasing sampling interval.

Sorry again for finding this concern not early enough. But I think we may need
to discuss about this first.

[1] https://lkml.kernel.org/r/[email protected]


Thanks,
SJ


>
> CPU overheads (in billion cycles) for kdamond:
>
> Footprint 1GB 10GB 100GB 5TB
> ---------------------------------------------
> DAMON 1.15 19.53 3.52 9.55
> DAMON+PTP 0.83 3.20 1.27 2.55
>
> A detailed explanation and evaluation can be found in the arXiv paper:
> https://arxiv.org/pdf/2311.10275.pdf
>
>
> Aravinda Prasad (3):
> mm/damon: mm infrastructure support
> mm/damon: profiling enhancement
> mm/damon: documentation updates
>
> Documentation/mm/damon/design.rst | 42 ++++++
> arch/x86/include/asm/pgtable.h | 20 +++
> arch/x86/mm/pgtable.c | 28 +++-
> include/linux/mmu_notifier.h | 36 +++++
> include/linux/pgtable.h | 79 ++++++++++
> mm/damon/vaddr.c | 233 ++++++++++++++++++++++++++++--
> 6 files changed, 424 insertions(+), 14 deletions(-)
>
> --
> 2.21.3

2024-03-19 10:58:16

by Prasad, Aravinda

[permalink] [raw]
Subject: RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON



> -----Original Message-----
> From: SeongJae Park <[email protected]>
> Sent: Tuesday, March 19, 2024 10:51 AM
> To: Prasad, Aravinda <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Kumar, Sandeep4
> <[email protected]>; Huang, Ying <[email protected]>;
> Hansen, Dave <[email protected]>; Williams, Dan J
> <[email protected]>; Subramoney, Sreenivas
> <[email protected]>; Kervinen, Antti
> <[email protected]>; Kanevskiy, Alexander
> <[email protected]>
> Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
>
> Hi Aravinda,
>
>
> Thank you for posting this new revision!
>
> I remember I told you that I don't see a high level significant problems on on
> the reply to the previous revision of this patch[1], but I show a concern now.
> Sorry for not raising this earlier, but let me explain my humble concerns before
> being even more late.

Sure, no problem. We can discuss. I will get back to you with a detailed note.

Regards,
Aravinda

>
> On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad
> <[email protected]> wrote:
>
> > DAMON randomly samples one or more pages in every region and tracks
> > accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> > When the region size is large (e.g., several GBs), which is common for
> > large footprint applications, detecting whether the region is accessed
> > or not completely depends on whether the pages that are actively
> > accessed in the region are picked during random sampling.
> > If such pages are not picked for sampling, DAMON fails to identify the
> > region as accessed. However, increasing the sampling rate or
> > increasing the number of regions increases CPU overheads of kdamond.
>
> DAMON uses sampling because it considers a region as accessed if a portion of
> the region that big enough to be detected via sampling is all accessed. If a
> region is having some pages that really accessed but the proportion is too
> small to be found via sampling, I think DAMON could say the overall access to
> the region is only modest and could even be ignored. In my humble opinion,
> this fits with the definition of DAMON region: A memory address range that
> constructed with pages having similar access frequency.


>
> >
> > This patch proposes profiling different levels of the
> > application\u2019s page table tree to detect whether a region is
> > accessed or not. This patch set is based on the observation that, when
> > the accessed bit for a page is set, the accessed bits at the higher
> > levels of the page table tree (PMD/PUD/PGD) corresponding to the path
> > of the page table walk are also set. Hence, it is efficient to check
> > the accessed bits at the higher levels of the page table tree to
> > detect whether a region is accessed or not. For example, if the access
> > bit for a PUD entry is set, then one or more pages in the 1GB PUD
> > subtree is accessed as each PUD entry covers 1GB mapping. Hence,
> > instead of sampling thousands of 4K/2M pages to detect accesses in a
> > large region, sampling at the higher level of page table tree is faster and
> efficient.
>
> Due to the above reason, I concern this could result in making DAMON
> monitoring results be inaccurately biased to report more than real accesses.
>
> >
> > This patch set is based on 6.8-rc5 kernel (commit: f48159f8,
> > mm-unstable
> > tree)
> >
> > Changes since v1 [1]
> > ====================
> >
> > - Added support for 5-level page table tree
> > - Split the patch to mm infrastructure changes and DAMON enhancements
> > - Code changes as per comments on v1
> > - Added kerneldoc comments
> >
> > [1] https://lkml.org/lkml/2023/12/15/272
> >
> > Evaluation:
> >
> > - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> > and 5TB with 10GB hot data.
> > - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> > parameters set to default value.
> > - DAMON+PTP: Page table profiling applied to DAMON with the above
> > parameters.
> >
> > Profiling efficiency in detecting hot data:
> >
> > Footprint 1GB 10GB 100GB 5TB
> > ---------------------------------------------
> > DAMON >90% <50% ~0% 0%
> > DAMON+PTP >90% >90% >90% >90%
>
> Sampling interval is the time interval that assumed to be large enough for the
> workload to make meaningful amount of accesses within the interval. Hence,
> meaningful amount of sampling interval depends on the workload's
> characteristic and system's memory bandwidth.
>
> Here, the size of the hot memory region is about 100MB, 1GB, 10GB, and
> 10GB for the four cases, respectively. And you set the sampling interval as
> 5ms. Let's assume the system can access, say, 50 GB per second, and hence it
> could be able to access only up to 250 MB per 5ms. So, in case of 1GB and
> footprint, all hot memory region would be accessed while DAMON is waiting
> for next sampling interval. Hence, DAMON would be able to see most
> accesses via sampling. But for 100GB footprint case, only 250MB / 10GB =
> about 2.5% of the hot memory region would be accessed between the
> sampling interval. DAMON cannot see whole accesses, and hence the
> precision could be low.
>
> I don't know exact memory bandwith of the system, but to detect the 10 GB
> hot region with 5ms sampling interval, the system should be able to access
> 2GB memory per millisecond, or about 2TB memory per second. I think
> systems of such memory bandwidth is not that common.
>
> I show you also explored a configuration setting the aggregation interval
> higher. But because each sampling checks only access between the sampling
> interval, that might not help in this setup. I'm wondering if you also explored
> increasing sampling interval.
>
> Sorry again for finding this concern not early enough. But I think we may need
> to discuss about this first.
>
> [1] https://lkml.kernel.org/r/[email protected]
>
>
> Thanks,
> SJ
>
>
> >
> > CPU overheads (in billion cycles) for kdamond:
> >
> > Footprint 1GB 10GB 100GB 5TB
> > ---------------------------------------------
> > DAMON 1.15 19.53 3.52 9.55
> > DAMON+PTP 0.83 3.20 1.27 2.55
> >
> > A detailed explanation and evaluation can be found in the arXiv paper:
> > https://arxiv.org/pdf/2311.10275.pdf
> >
> >
> > Aravinda Prasad (3):
> > mm/damon: mm infrastructure support
> > mm/damon: profiling enhancement
> > mm/damon: documentation updates
> >
> > Documentation/mm/damon/design.rst | 42 ++++++
> > arch/x86/include/asm/pgtable.h | 20 +++
> > arch/x86/mm/pgtable.c | 28 +++-
> > include/linux/mmu_notifier.h | 36 +++++
> > include/linux/pgtable.h | 79 ++++++++++
> > mm/damon/vaddr.c | 233 ++++++++++++++++++++++++++++--
> > 6 files changed, 424 insertions(+), 14 deletions(-)
> >
> > --
> > 2.21.3

2024-03-20 12:31:39

by Prasad, Aravinda

[permalink] [raw]
Subject: RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON



> -----Original Message-----
> From: SeongJae Park <[email protected]>
> Sent: Tuesday, March 19, 2024 10:51 AM
> To: Prasad, Aravinda <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Kumar, Sandeep4
> <[email protected]>; Huang, Ying <[email protected]>; Hansen,
> Dave <[email protected]>; Williams, Dan J <[email protected]>;
> Subramoney, Sreenivas <[email protected]>; Kervinen, Antti
> <[email protected]>; Kanevskiy, Alexander
> <[email protected]>
> Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
>
> Hi Aravinda,
>
>
> Thank you for posting this new revision!
>
> I remember I told you that I don't see a high level significant problems on on the
> reply to the previous revision of this patch[1], but I show a concern now.
> Sorry for not raising this earlier, but let me explain my humble concerns before
> being even more late.

Please find my comments below:

>
> On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad
> <[email protected]> wrote:
>
> > DAMON randomly samples one or more pages in every region and tracks
> > accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> > When the region size is large (e.g., several GBs), which is common for
> > large footprint applications, detecting whether the region is accessed
> > or not completely depends on whether the pages that are actively
> > accessed in the region are picked during random sampling.
> > If such pages are not picked for sampling, DAMON fails to identify the
> > region as accessed. However, increasing the sampling rate or
> > increasing the number of regions increases CPU overheads of kdamond.
>
> DAMON uses sampling because it considers a region as accessed if a portion of
> the region that big enough to be detected via sampling is all accessed. If a region
> is having some pages that really accessed but the proportion is too small to be
> found via sampling, I think DAMON could say the overall access to the region is
> only modest and could even be ignored. In my humble opinion, this fits with the
> definition of DAMON region: A memory address range that constructed with
> pages having similar access frequency.

Agree that DAMON considers a region as accessed if a good portion of the region
is accessed. But few points I would like to discuss:

For large regions (say 10GB, that has 2,621,440 4K pages), sampling at PTE level
will not cover a good portion of the region. For example, default 5ms sampling
and 100ms aggregation samples only 20 4K pages in an aggregation interval.
Increasing sampling to 1 ms and aggregation to 1 second can only cover
1000 4K pages, but results in higher CPU overheads due to frequent sampling
Even increasing the aggregation interval to 60 seconds but sampling at 5ms can
only cover 12000 samples, but region splitting and merging happens once
in 60 seconds.

In addition, this worsens when region sizes are say 100GB+. We observe that
sampling at PTE level does not help for large regions as more samples are
are required. So, decreasing/increasing the sampling or aggressions intervals
proportional to the region size is not practical as all regions are of not equal
size, we can have 100GB regions as well as many small regions (e.g., 16KB
to 1MB). So tuning sampling rate and aggregation interval did not help
for large regions.

It can also be observed that large regions cannot be avoided. Large regions
are created by merging adjacent smaller regions or at the beginning of
profiling (depending on min regions parameter which defaults to 10).
Increasing min region reduces the size of regions but increases kdamond
overheads, hence, not preferable.

So, sampling at PTE level cannot precisely detect accesses to large regions
resulting in inaccuracies, even though it works for small regions.
From our experiments, we found that with 10% hot data in a large region
(80+GB regions in a 1TB+ footprint application), DAMON was not able to
detect a single access to that region in 95+% cases with different sample
and aggregation interval combinations. But DAMON works good for
applications with footprint <50GB where regions are typically small.

Now consider the scenario with the proposed enhancement. With a
100GB region, if we sample a PUD entry that covers 1GB address
space, then the default 5ms sampling and 100ms aggregation samples
20 PUD entries that is 20 GB portion of the region. This gives a good
estimate of the portion of the region that is accessed. But the downside
is that as PUD accessed bit is set even if a small set of pages are accessed
under its subtree this can report more access as real as you noted.

But such large regions are split into two in the next aggregation interval.
As the splitting of the regions continues, in next few aggregation intervals
many smaller regions are formed. Once smaller regions are formed,
the proposed enhancement cannot pick higher levels of the page table
tree and behaves as good as default DAMON. So, with the proposed
enhancement, larger regions are quickly split into smaller regions if they
have only small set of pages accessed.

To avoid misinterpreting region access count, I feel that the "age" of the
region is of real help and should be considered by both DAMON and the
proposed enhancement. If the age of a region is small (<10) then that
region should not be considered stable and hence should not be
considered for any memory tiering decisions. For regions with age,
say >10, can be considered as stable as they reflect the actual access
frequency.

>
> >
> > This patch proposes profiling different levels of the
> > application\u2019s page table tree to detect whether a region is
> > accessed or not. This patch set is based on the observation that, when
> > the accessed bit for a page is set, the accessed bits at the higher
> > levels of the page table tree (PMD/PUD/PGD) corresponding to the path
> > of the page table walk are also set. Hence, it is efficient to check
> > the accessed bits at the higher levels of the page table tree to
> > detect whether a region is accessed or not. For example, if the access
> > bit for a PUD entry is set, then one or more pages in the 1GB PUD
> > subtree is accessed as each PUD entry covers 1GB mapping. Hence,
> > instead of sampling thousands of 4K/2M pages to detect accesses in a
> > large region, sampling at the higher level of page table tree is faster and
> efficient.
>
> Due to the above reason, I concern this could result in making DAMON monitoring
> results be inaccurately biased to report more than real accesses.

DAMON, even without the proposed enhancement, can result in inaccuracies
for large regions, (see examples above).

>
> >
> > This patch set is based on 6.8-rc5 kernel (commit: f48159f8,
> > mm-unstable
> > tree)
> >
> > Changes since v1 [1]
> > ====================
> >
> > - Added support for 5-level page table tree
> > - Split the patch to mm infrastructure changes and DAMON enhancements
> > - Code changes as per comments on v1
> > - Added kerneldoc comments
> >
> > [1] https://lkml.org/lkml/2023/12/15/272
> >
> > Evaluation:
> >
> > - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> > and 5TB with 10GB hot data.
> > - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> > parameters set to default value.
> > - DAMON+PTP: Page table profiling applied to DAMON with the above
> > parameters.
> >
> > Profiling efficiency in detecting hot data:
> >
> > Footprint 1GB 10GB 100GB 5TB
> > ---------------------------------------------
> > DAMON >90% <50% ~0% 0%
> > DAMON+PTP >90% >90% >90% >90%
>
> Sampling interval is the time interval that assumed to be large enough for the
> workload to make meaningful amount of accesses within the interval. Hence,
> meaningful amount of sampling interval depends on the workload's characteristic
> and system's memory bandwidth.
>
> Here, the size of the hot memory region is about 100MB, 1GB, 10GB, and 10GB
> for the four cases, respectively. And you set the sampling interval as 5ms. Let's
> assume the system can access, say, 50 GB per second, and hence it could be able
> to access only up to 250 MB per 5ms. So, in case of 1GB and footprint, all hot
> memory region would be accessed while DAMON is waiting for next sampling
> interval. Hence, DAMON would be able to see most accesses via sampling. But
> for 100GB footprint case, only 250MB / 10GB = about 2.5% of the hot memory
> region would be accessed between the sampling interval. DAMON cannot see
> whole accesses, and hence the precision could be low.
>
> I don't know exact memory bandwith of the system, but to detect the 10 GB hot
> region with 5ms sampling interval, the system should be able to access 2GB
> memory per millisecond, or about 2TB memory per second. I think systems of
> such memory bandwidth is not that common.
>
> I show you also explored a configuration setting the aggregation interval higher.
> But because each sampling checks only access between the sampling interval,
> that might not help in this setup. I'm wondering if you also explored increasing
> sampling interval.
>

What we have observed that many real-world benchmarks we experimented
with do not saturate the memory bandwidth. We also experimented with
masim microbenchmark to understand the impact on memory access rate
(we inserted delay between memory access operations in do_rnd_ro() and
other functions). We see decrease in the precision as access intensity is
reduced. We have experimented with different sampling and aggregation
intervals, but that did not help much in improving precision.

So, what I think is it that most of the cases the precision depends on the page
(hot or cold) that is randomly picked for sampling than the sampling rate. Most
of the time only cold 4K pages are picked in a large region as they typically
account for 90% of the pages in the region and hence DAMON does not
detect any accesses at all. By profiling higher levels of the page table tree
this can be improved.

> Sorry again for finding this concern not early enough. But I think we may need to
> discuss about this first.

Absolutely no problem. Please let me know your thoughts.

Regards,
Aravinda

>
> [1] https://lkml.kernel.org/r/[email protected]
>
>
> Thanks,
> SJ
>
>
> >
> > CPU overheads (in billion cycles) for kdamond:
> >
> > Footprint 1GB 10GB 100GB 5TB
> > ---------------------------------------------
> > DAMON 1.15 19.53 3.52 9.55
> > DAMON+PTP 0.83 3.20 1.27 2.55
> >
> > A detailed explanation and evaluation can be found in the arXiv paper:
> > https://arxiv.org/pdf/2311.10275.pdf
> >
> >
> > Aravinda Prasad (3):
> > mm/damon: mm infrastructure support
> > mm/damon: profiling enhancement
> > mm/damon: documentation updates
> >
> > Documentation/mm/damon/design.rst | 42 ++++++
> > arch/x86/include/asm/pgtable.h | 20 +++
> > arch/x86/mm/pgtable.c | 28 +++-
> > include/linux/mmu_notifier.h | 36 +++++
> > include/linux/pgtable.h | 79 ++++++++++
> > mm/damon/vaddr.c | 233 ++++++++++++++++++++++++++++--
> > 6 files changed, 424 insertions(+), 14 deletions(-)
> >
> > --
> > 2.21.3

2024-03-21 23:10:30

by SeongJae Park

[permalink] [raw]
Subject: RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON

On Wed, 20 Mar 2024 12:31:17 +0000 "Prasad, Aravinda" <[email protected]> wrote:

>
>
> > -----Original Message-----
> > From: SeongJae Park <[email protected]>
> > Sent: Tuesday, March 19, 2024 10:51 AM
> > To: Prasad, Aravinda <[email protected]>
> > Cc: [email protected]; [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; Kumar, Sandeep4
> > <[email protected]>; Huang, Ying <[email protected]>; Hansen,
> > Dave <[email protected]>; Williams, Dan J <[email protected]>;
> > Subramoney, Sreenivas <[email protected]>; Kervinen, Antti
> > <[email protected]>; Kanevskiy, Alexander
> > <[email protected]>
> > Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
> >
> > Hi Aravinda,
> >
> >
> > Thank you for posting this new revision!
> >
> > I remember I told you that I don't see a high level significant problems on on the
> > reply to the previous revision of this patch[1], but I show a concern now.
> > Sorry for not raising this earlier, but let me explain my humble concerns before
> > being even more late.
>
> Please find my comments below:
>
> >
> > On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad
> > <[email protected]> wrote:
> >
> > > DAMON randomly samples one or more pages in every region and tracks
> > > accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> > > When the region size is large (e.g., several GBs), which is common for
> > > large footprint applications, detecting whether the region is accessed
> > > or not completely depends on whether the pages that are actively
> > > accessed in the region are picked during random sampling.
> > > If such pages are not picked for sampling, DAMON fails to identify the
> > > region as accessed. However, increasing the sampling rate or
> > > increasing the number of regions increases CPU overheads of kdamond.
> >
> > DAMON uses sampling because it considers a region as accessed if a portion of
> > the region that big enough to be detected via sampling is all accessed. If a region
> > is having some pages that really accessed but the proportion is too small to be
> > found via sampling, I think DAMON could say the overall access to the region is
> > only modest and could even be ignored. In my humble opinion, this fits with the
> > definition of DAMON region: A memory address range that constructed with
> > pages having similar access frequency.
>
> Agree that DAMON considers a region as accessed if a good portion of the region
> is accessed. But few points I would like to discuss:
>
> For large regions (say 10GB, that has 2,621,440 4K pages), sampling at PTE level
> will not cover a good portion of the region. For example, default 5ms sampling
> and 100ms aggregation samples only 20 4K pages in an aggregation interval.

If the 20 attempts all failed at finding any single accessed 4K page, I think
it roughly means less than 5% of the region is accessed within the
user-specified time (aggregation interval). I would translate that as only
tiny portion of the region is accessed within the user-specified time, and
hence DAMON is ok to say the region is nearly not accessed.

> Increasing sampling to 1 ms and aggregation to 1 second can only cover
> 1000 4K pages, but results in higher CPU overheads due to frequent sampling.
> Even increasing the aggregation interval to 60 seconds but sampling at 5ms can
> only cover 12000 samples, but region splitting and merging happens once
> in 60 seconds.

At the beginning of each sampling interval, DAMON randomly picks one page per
region, clear their accessed bits, wait until the sampling interval is
finished, and check the accessed bits again. In other words, DAMON shows only
accesses that made in last sampling interval.

Increasing number of samples per aggregation interval can help DAMON knows the
access frequency of regions in finer granularity, but doesn't allow DAMON see
more accesses. Rather than that, if the aggregation interval is fixed
(reducing sampling interval), DAMON can show even less amount of accesses.

What we need here is giving the workload longer sampling time so that the
workload can make access to a size of memory regions that large enough to be
found by DAMON.

>
> In addition, this worsens when region sizes are say 100GB+. We observe that
> sampling at PTE level does not help for large regions as more samples are
> are required. So, decreasing/increasing the sampling or aggressions intervals
> proportional to the region size is not practical as all regions are of not equal
> size, we can have 100GB regions as well as many small regions (e.g., 16KB
> to 1MB).

IMO, it becomes worse because the minimum size of accessed memory regions that
can be found by DAMON via sampling has increased together, while you didn't
give more sampling time (a.k.a the time to let the workload make accesses that
DAMON can show).

> So tuning sampling rate and aggregation interval did not help for large
> regions.

Due to the mechanism of the DAMON's sampling I mentioned above, I think this is
what expected. We need to increase sampling interval.

>
> It can also be observed that large regions cannot be avoided. Large regions
> are created by merging adjacent smaller regions or at the beginning of
> profiling (depending on min regions parameter which defaults to 10).
> Increasing min region reduces the size of regions but increases kdamond
> overheads, hence, not preferable.
>
> So, sampling at PTE level cannot precisely detect accesses to large regions
> resulting in inaccuracies, even though it works for small regions.
> From our experiments, we found that with 10% hot data in a large region
> (80+GB regions in a 1TB+ footprint application), DAMON was not able to
> detect a single access to that region in 95+% cases with different sample
> and aggregation interval combinations. But DAMON works good for
> applications with footprint <50GB where regions are typically small.
>
> Now consider the scenario with the proposed enhancement. With a
> 100GB region, if we sample a PUD entry that covers 1GB address
> space, then the default 5ms sampling and 100ms aggregation samples
> 20 PUD entries that is 20 GB portion of the region. This gives a good
> estimate of the portion of the region that is accessed. But the downside
> is that as PUD accessed bit is set even if a small set of pages are accessed
> under its subtree this can report more access as real as you noted.
>
> But such large regions are split into two in the next aggregation interval.
> As the splitting of the regions continues, in next few aggregation intervals
> many smaller regions are formed. Once smaller regions are formed,
> the proposed enhancement cannot pick higher levels of the page table
> tree and behaves as good as default DAMON. So, with the proposed
> enhancement, larger regions are quickly split into smaller regions if they
> have only small set of pages accessed.

I fully agree. This is what could be a real and important benefits.

>
> To avoid misinterpreting region access count, I feel that the "age" of the
> region is of real help and should be considered by both DAMON and the
> proposed enhancement. If the age of a region is small (<10) then that
> region should not be considered stable and hence should not be
> considered for any memory tiering decisions. For regions with age,
> say >10, can be considered as stable as they reflect the actual access
> frequency.

I think this is a good approach, but difficult to be used by default. I think
we might be able to get the benefit without making problem at the
over-reporting accesses by using the high level accessed bit check results as a
hint for better quality of region split?

Also, if we can allow large enough age, the random region split will eventually
find the small hot regions even without high level accessed bit hint. Of
course the hint could help finding it earlier. I think that was one of my
comment on the first version of this patch.

>
> >
> > >
> > > This patch proposes profiling different levels of the
> > > application\u2019s page table tree to detect whether a region is
> > > accessed or not. This patch set is based on the observation that, when
> > > the accessed bit for a page is set, the accessed bits at the higher
> > > levels of the page table tree (PMD/PUD/PGD) corresponding to the path
> > > of the page table walk are also set. Hence, it is efficient to check
> > > the accessed bits at the higher levels of the page table tree to
> > > detect whether a region is accessed or not. For example, if the access
> > > bit for a PUD entry is set, then one or more pages in the 1GB PUD
> > > subtree is accessed as each PUD entry covers 1GB mapping. Hence,
> > > instead of sampling thousands of 4K/2M pages to detect accesses in a
> > > large region, sampling at the higher level of page table tree is faster and
> > efficient.
> >
> > Due to the above reason, I concern this could result in making DAMON monitoring
> > results be inaccurately biased to report more than real accesses.
>
> DAMON, even without the proposed enhancement, can result in inaccuracies
> for large regions, (see examples above).

I think temporarily missing such tiny portion of accesses is not a critical
problem. If this is a problem, the user should increase the sampling interval
in my opinion. That said, as mentioned above, DAMON would better to improve
its regions split mechanism.

>
> >
> > >
> > > This patch set is based on 6.8-rc5 kernel (commit: f48159f8,
> > > mm-unstable
> > > tree)
> > >
> > > Changes since v1 [1]
> > > ====================
> > >
> > > - Added support for 5-level page table tree
> > > - Split the patch to mm infrastructure changes and DAMON enhancements
> > > - Code changes as per comments on v1
> > > - Added kerneldoc comments
> > >
> > > [1] https://lkml.org/lkml/2023/12/15/272
> > >
> > > Evaluation:
> > >
> > > - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> > > and 5TB with 10GB hot data.
> > > - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> > > parameters set to default value.
> > > - DAMON+PTP: Page table profiling applied to DAMON with the above
> > > parameters.
> > >
> > > Profiling efficiency in detecting hot data:
> > >
> > > Footprint 1GB 10GB 100GB 5TB
> > > ---------------------------------------------
> > > DAMON >90% <50% ~0% 0%
> > > DAMON+PTP >90% >90% >90% >90%
> >
> > Sampling interval is the time interval that assumed to be large enough for the
> > workload to make meaningful amount of accesses within the interval. Hence,
> > meaningful amount of sampling interval depends on the workload's characteristic
> > and system's memory bandwidth.
> >
> > Here, the size of the hot memory region is about 100MB, 1GB, 10GB, and 10GB
> > for the four cases, respectively. And you set the sampling interval as 5ms. Let's
> > assume the system can access, say, 50 GB per second, and hence it could be able
> > to access only up to 250 MB per 5ms. So, in case of 1GB and footprint, all hot
> > memory region would be accessed while DAMON is waiting for next sampling
> > interval. Hence, DAMON would be able to see most accesses via sampling. But
> > for 100GB footprint case, only 250MB / 10GB = about 2.5% of the hot memory
> > region would be accessed between the sampling interval. DAMON cannot see
> > whole accesses, and hence the precision could be low.
> >
> > I don't know exact memory bandwith of the system, but to detect the 10 GB hot
> > region with 5ms sampling interval, the system should be able to access 2GB
> > memory per millisecond, or about 2TB memory per second. I think systems of
> > such memory bandwidth is not that common.
> >
> > I show you also explored a configuration setting the aggregation interval higher.
> > But because each sampling checks only access between the sampling interval,
> > that might not help in this setup. I'm wondering if you also explored increasing
> > sampling interval.
> >
>
> What we have observed that many real-world benchmarks we experimented
> with do not saturate the memory bandwidth. We also experimented with
> masim microbenchmark to understand the impact on memory access rate
> (we inserted delay between memory access operations in do_rnd_ro() and
> other functions). We see decrease in the precision as access intensity is
> reduced. We have experimented with different sampling and aggregation
> intervals, but that did not help much in improving precision.

Again, please note that DAMON can show only accesses made between each sampling
interval at a time. The important factor for expectation of DAMON's accuracy
is, the balance between the memory access intensity of the workload, and the
length of the sampling interval. The workload should be access intensive
enough to make sufficient amount of accesses between sampling interval. The
sampling interval should be long enough to allow the workload makes sufficient
amount of accesses within the time interval.

The fact that the workloads were not saturating the memory bandwidth is not
enough to know if that means the workload was memory intensive enough, and the
sampling interval was long enough.

I was mentioning the memory bandwidth as only the maximum memory intensity of
the system that could be achieved.

>
> So, what I think is it that most of the cases the precision depends on the page
> (hot or cold) that is randomly picked for sampling than the sampling rate. Most
> of the time only cold 4K pages are picked in a large region as they typically
> account for 90% of the pages in the region and hence DAMON does not
> detect any accesses at all. By profiling higher levels of the page table tree
> this can be improved.

Again, agreed. This is an important and grateful finding. Thank you. And
again as mentioned above, I don't think we can merge this patch as is, but we
could think about using the high level access bit check results as a hint to
better split the regions.

Indeed, DAMON's monitoring mechanism has many rooms for improvements. I also
have some ideas but my time was more spent on more capabilities of DAMON/DAMOS
so far. It was a bit intentional proiority setting since I got no real DAMON
accuracy problem report from the production usage, and improving the accuracy
will deliver the benefit to all DAMON/DAMOS features.

Since an important milestone of DAMOS, namely auto-tuning, has merged into the
mainline, I think I may better to spend more time on monitoring accuracy
improvement. I have some immature ideas in my head. I will try to summarize
and share the ideas in near future.

>
> > Sorry again for finding this concern not early enough. But I think we may need to
> > discuss about this first.
>
> Absolutely no problem. Please let me know your thoughts.

Thank you for patiently walking with me :)


Thanks,
SJ

>
> Regards,
> Aravinda
>
> >
> > [1] https://lkml.kernel.org/r/[email protected]
> >
> >
> > Thanks,
> > SJ
> >
> >
> > >
> > > CPU overheads (in billion cycles) for kdamond:
> > >
> > > Footprint 1GB 10GB 100GB 5TB
> > > ---------------------------------------------
> > > DAMON 1.15 19.53 3.52 9.55
> > > DAMON+PTP 0.83 3.20 1.27 2.55
> > >
> > > A detailed explanation and evaluation can be found in the arXiv paper:
> > > https://arxiv.org/pdf/2311.10275.pdf
> > >
> > >
> > > Aravinda Prasad (3):
> > > mm/damon: mm infrastructure support
> > > mm/damon: profiling enhancement
> > > mm/damon: documentation updates
> > >
> > > Documentation/mm/damon/design.rst | 42 ++++++
> > > arch/x86/include/asm/pgtable.h | 20 +++
> > > arch/x86/mm/pgtable.c | 28 +++-
> > > include/linux/mmu_notifier.h | 36 +++++
> > > include/linux/pgtable.h | 79 ++++++++++
> > > mm/damon/vaddr.c | 233 ++++++++++++++++++++++++++++--
> > > 6 files changed, 424 insertions(+), 14 deletions(-)
> > >
> > > --
> > > 2.21.3

2024-03-22 12:18:42

by Prasad, Aravinda

[permalink] [raw]
Subject: RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON



> -----Original Message-----
> From: SeongJae Park <[email protected]>
> Sent: Friday, March 22, 2024 4:40 AM
> To: Prasad, Aravinda <[email protected]>
> Cc: SeongJae Park <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected]; Kumar,
> Sandeep4 <[email protected]>; Huang, Ying <[email protected]>;
> Hansen, Dave <[email protected]>; Williams, Dan J
> <[email protected]>; Subramoney, Sreenivas
> <[email protected]>; Kervinen, Antti <[email protected]>;
> Kanevskiy, Alexander <[email protected]>
> Subject: RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
>
> On Wed, 20 Mar 2024 12:31:17 +0000 "Prasad, Aravinda"
> <[email protected]> wrote:
>
> >
> >
> > > -----Original Message-----
> > > From: SeongJae Park <[email protected]>
> > > Sent: Tuesday, March 19, 2024 10:51 AM
> > > To: Prasad, Aravinda <[email protected]>
> > > Cc: [email protected]; [email protected]; [email protected]; linux-
> > > [email protected]; [email protected]; Kumar, Sandeep4
> > > <[email protected]>; Huang, Ying <[email protected]>;
> > > Hansen, Dave <[email protected]>; Williams, Dan J
> > > <[email protected]>; Subramoney, Sreenivas
> > > <[email protected]>; Kervinen, Antti
> > > <[email protected]>; Kanevskiy, Alexander
> > > <[email protected]>
> > > Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for
> > > DAMON
> > >
> > > Hi Aravinda,
> > >
> > >
> > > Thank you for posting this new revision!
> > >
> > > I remember I told you that I don't see a high level significant
> > > problems on on the reply to the previous revision of this patch[1], but I show a
> concern now.
> > > Sorry for not raising this earlier, but let me explain my humble
> > > concerns before being even more late.
> >
> > Please find my comments below:
> >
> > >
> > > On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad
> > > <[email protected]> wrote:
> > >
> > > > DAMON randomly samples one or more pages in every region and
> > > > tracks accesses to them using the ACCESSED bit in PTE (or PMD for 2MB
> pages).
> > > > When the region size is large (e.g., several GBs), which is common
> > > > for large footprint applications, detecting whether the region is
> > > > accessed or not completely depends on whether the pages that are
> > > > actively accessed in the region are picked during random sampling.
> > > > If such pages are not picked for sampling, DAMON fails to identify
> > > > the region as accessed. However, increasing the sampling rate or
> > > > increasing the number of regions increases CPU overheads of kdamond.
> > >
> > > DAMON uses sampling because it considers a region as accessed if a
> > > portion of the region that big enough to be detected via sampling is
> > > all accessed. If a region is having some pages that really accessed
> > > but the proportion is too small to be found via sampling, I think
> > > DAMON could say the overall access to the region is only modest and
> > > could even be ignored. In my humble opinion, this fits with the
> > > definition of DAMON region: A memory address range that constructed with
> pages having similar access frequency.
> >
> > Agree that DAMON considers a region as accessed if a good portion of
> > the region is accessed. But few points I would like to discuss:
> >
> > For large regions (say 10GB, that has 2,621,440 4K pages), sampling at
> > PTE level will not cover a good portion of the region. For example,
> > default 5ms sampling and 100ms aggregation samples only 20 4K pages in an
> aggregation interval.
>
> If the 20 attempts all failed at finding any single accessed 4K page, I think it
> roughly means less than 5% of the region is accessed within the user-specified
> time (aggregation interval). I would translate that as only tiny portion of the
> region is accessed within the user-specified time, and hence DAMON is ok to say
> the region is nearly not accessed.

I am looking at it from the other way:

To detect if a region is hot or cold at least 1% of the pages in the region should
be sampled. For a 10GB region (with 2,621,440 4K pages) this requires sampling
at least 26,214 pages. For a 100GB region this will require sampling at least
262,144 pages.

If we sample at 5ms, this takes 131.072 seconds to cover 1% of 10GB and 1310.72
seconds to cover 100GB.

DAMON shows that the selected page as accessed if that page was accessed
during the 5ms sampling window. Now if we increase the sampling to 20ms to
improve access detection, then covering 1% of the region takes even longer.

>
> > Increasing sampling to 1 ms and aggregation to 1 second can only cover
> > 1000 4K pages, but results in higher CPU overheads due to frequent sampling.
> > Even increasing the aggregation interval to 60 seconds but sampling at
> > 5ms can only cover 12000 samples, but region splitting and merging
> > happens once in 60 seconds.
>
> At the beginning of each sampling interval, DAMON randomly picks one page per
> region, clear their accessed bits, wait until the sampling interval is finished, and
> check the accessed bits again. In other words, DAMON shows only accesses that
> made in last sampling interval.

Yes, I see this in the code:

while(time < aggregation_interval)
{
clear_access_bit
sleep(sampling_time)
check_access_bit
}

I would suggest this logic instead.

while(time < aggregation_interval)
{
Number_of_samples = aggregation_interval / sampling_time;

for (i = 0, I < number_of_samples; i++)
{
clear_access_bit
}

sleep(aggregation_time)

for (i = 0, I < number_of_samples; i++)
{
check_access_bit
}
}

This can help in better access detection. I am sure you would
have already explored it.

>
> Increasing number of samples per aggregation interval can help DAMON knows
> the access frequency of regions in finer granularity, but doesn't allow DAMON see
> more accesses. Rather than that, if the aggregation interval is fixed (reducing
> sampling interval), DAMON can show even less amount of accesses.
>
> What we need here is giving the workload longer sampling time so that the
> workload can make access to a size of memory regions that large enough to be
> found by DAMON.

But even with longer sampling time, we may miss the access. For example,
consider all the pages in the region are accessed sequentially. Now if DAMON samples
a different page other than the page that is being accessed it will miss. Now even if we
have longer sampling time it is possible that none of the accesses are detected.

>
> >
> > In addition, this worsens when region sizes are say 100GB+. We observe
> > that sampling at PTE level does not help for large regions as more
> > samples are are required. So, decreasing/increasing the sampling or
> > aggressions intervals proportional to the region size is not practical
> > as all regions are of not equal size, we can have 100GB regions as
> > well as many small regions (e.g., 16KB to 1MB).
>
> IMO, it becomes worse because the minimum size of accessed memory regions
> that can be found by DAMON via sampling has increased together, while you
> didn't give more sampling time (a.k.a the time to let the workload make accesses
> that DAMON can show).
>
> > So tuning sampling rate and aggregation interval did not help for
> > large regions.
>
> Due to the mechanism of the DAMON's sampling I mentioned above, I think this
> is what expected. We need to increase sampling interval.
>
> >
> > It can also be observed that large regions cannot be avoided. Large
> > regions are created by merging adjacent smaller regions or at the
> > beginning of profiling (depending on min regions parameter which defaults to
> 10).
> > Increasing min region reduces the size of regions but increases
> > kdamond overheads, hence, not preferable.
> >
> > So, sampling at PTE level cannot precisely detect accesses to large
> > regions resulting in inaccuracies, even though it works for small regions.
> > From our experiments, we found that with 10% hot data in a large
> > region (80+GB regions in a 1TB+ footprint application), DAMON was not
> > able to detect a single access to that region in 95+% cases with
> > different sample and aggregation interval combinations. But DAMON
> > works good for applications with footprint <50GB where regions are typically
> small.
> >
> > Now consider the scenario with the proposed enhancement. With a 100GB
> > region, if we sample a PUD entry that covers 1GB address space, then
> > the default 5ms sampling and 100ms aggregation samples
> > 20 PUD entries that is 20 GB portion of the region. This gives a good
> > estimate of the portion of the region that is accessed. But the
> > downside is that as PUD accessed bit is set even if a small set of
> > pages are accessed under its subtree this can report more access as real as you
> noted.
> >
> > But such large regions are split into two in the next aggregation interval.
> > As the splitting of the regions continues, in next few aggregation
> > intervals many smaller regions are formed. Once smaller regions are
> > formed, the proposed enhancement cannot pick higher levels of the page
> > table tree and behaves as good as default DAMON. So, with the proposed
> > enhancement, larger regions are quickly split into smaller regions if
> > they have only small set of pages accessed.
>
> I fully agree. This is what could be a real and important benefits.
>
> >
> > To avoid misinterpreting region access count, I feel that the "age" of
> > the region is of real help and should be considered by both DAMON and
> > the proposed enhancement. If the age of a region is small (<10) then
> > that region should not be considered stable and hence should not be
> > considered for any memory tiering decisions. For regions with age, say
> > >10, can be considered as stable as they reflect the actual access
> > frequency.
>
> I think this is a good approach, but difficult to be used by default. I think we
> might be able to get the benefit without making problem at the over-reporting
> accesses by using the high level accessed bit check results as a hint for better
> quality of region split?

I agree, high level page table profiling can give hints to split the region instead of
using it to detect accesses to the region.

>
> Also, if we can allow large enough age, the random region split will eventually find
> the small hot regions even without high level accessed bit hint. Of course the hint
> could help finding it earlier. I think that was one of my comment on the first
> version of this patch.

The problem is that a large region that is split is immediately merged as the split
regions have access count zero.

We observe that large regions are never getting split at all due to this.

Regards,
Aravinda

>
> >
> > >
> > > >
> > > > This patch proposes profiling different levels of the
> > > > application\u2019s page table tree to detect whether a region is
> > > > accessed or not. This patch set is based on the observation that,
> > > > when the accessed bit for a page is set, the accessed bits at the
> > > > higher levels of the page table tree (PMD/PUD/PGD) corresponding
> > > > to the path of the page table walk are also set. Hence, it is
> > > > efficient to check the accessed bits at the higher levels of the
> > > > page table tree to detect whether a region is accessed or not. For
> > > > example, if the access bit for a PUD entry is set, then one or
> > > > more pages in the 1GB PUD subtree is accessed as each PUD entry
> > > > covers 1GB mapping. Hence, instead of sampling thousands of 4K/2M
> > > > pages to detect accesses in a large region, sampling at the higher
> > > > level of page table tree is faster and
> > > efficient.
> > >
> > > Due to the above reason, I concern this could result in making DAMON
> > > monitoring results be inaccurately biased to report more than real accesses.
> >
> > DAMON, even without the proposed enhancement, can result in
> > inaccuracies for large regions, (see examples above).
>
> I think temporarily missing such tiny portion of accesses is not a critical problem.
> If this is a problem, the user should increase the sampling interval in my opinion.
> That said, as mentioned above, DAMON would better to improve its regions split
> mechanism.
>
> >
> > >
> > > >
> > > > This patch set is based on 6.8-rc5 kernel (commit: f48159f8,
> > > > mm-unstable
> > > > tree)
> > > >
> > > > Changes since v1 [1]
> > > > ====================
> > > >
> > > > - Added support for 5-level page table tree
> > > > - Split the patch to mm infrastructure changes and DAMON
> > > > enhancements
> > > > - Code changes as per comments on v1
> > > > - Added kerneldoc comments
> > > >
> > > > [1] https://lkml.org/lkml/2023/12/15/272
> > > >
> > > > Evaluation:
> > > >
> > > > - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> > > > and 5TB with 10GB hot data.
> > > > - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> > > > parameters set to default value.
> > > > - DAMON+PTP: Page table profiling applied to DAMON with the above
> > > > parameters.
> > > >
> > > > Profiling efficiency in detecting hot data:
> > > >
> > > > Footprint 1GB 10GB 100GB 5TB
> > > > ---------------------------------------------
> > > > DAMON >90% <50% ~0% 0%
> > > > DAMON+PTP >90% >90% >90% >90%
> > >
> > > Sampling interval is the time interval that assumed to be large
> > > enough for the workload to make meaningful amount of accesses within
> > > the interval. Hence, meaningful amount of sampling interval depends
> > > on the workload's characteristic and system's memory bandwidth.
> > >
> > > Here, the size of the hot memory region is about 100MB, 1GB, 10GB,
> > > and 10GB for the four cases, respectively. And you set the sampling
> > > interval as 5ms. Let's assume the system can access, say, 50 GB per
> > > second, and hence it could be able to access only up to 250 MB per
> > > 5ms. So, in case of 1GB and footprint, all hot memory region would
> > > be accessed while DAMON is waiting for next sampling interval.
> > > Hence, DAMON would be able to see most accesses via sampling. But
> > > for 100GB footprint case, only 250MB / 10GB = about 2.5% of the hot
> > > memory region would be accessed between the sampling interval. DAMON
> cannot see whole accesses, and hence the precision could be low.
> > >
> > > I don't know exact memory bandwith of the system, but to detect the
> > > 10 GB hot region with 5ms sampling interval, the system should be
> > > able to access 2GB memory per millisecond, or about 2TB memory per
> > > second. I think systems of such memory bandwidth is not that common.
> > >
> > > I show you also explored a configuration setting the aggregation interval
> higher.
> > > But because each sampling checks only access between the sampling
> > > interval, that might not help in this setup. I'm wondering if you
> > > also explored increasing sampling interval.
> > >
> >
> > What we have observed that many real-world benchmarks we experimented
> > with do not saturate the memory bandwidth. We also experimented with
> > masim microbenchmark to understand the impact on memory access rate
> > (we inserted delay between memory access operations in do_rnd_ro() and
> > other functions). We see decrease in the precision as access intensity
> > is reduced. We have experimented with different sampling and
> > aggregation intervals, but that did not help much in improving precision.
>
> Again, please note that DAMON can show only accesses made between each
> sampling interval at a time. The important factor for expectation of DAMON's
> accuracy is, the balance between the memory access intensity of the workload,
> and the length of the sampling interval. The workload should be access intensive
> enough to make sufficient amount of accesses between sampling interval. The
> sampling interval should be long enough to allow the workload makes sufficient
> amount of accesses within the time interval.
>
> The fact that the workloads were not saturating the memory bandwidth is not
> enough to know if that means the workload was memory intensive enough, and
> the sampling interval was long enough.
>
> I was mentioning the memory bandwidth as only the maximum memory intensity
> of the system that could be achieved.
>
> >
> > So, what I think is it that most of the cases the precision depends on
> > the page (hot or cold) that is randomly picked for sampling than the
> > sampling rate. Most of the time only cold 4K pages are picked in a
> > large region as they typically account for 90% of the pages in the
> > region and hence DAMON does not detect any accesses at all. By
> > profiling higher levels of the page table tree this can be improved.
>
> Again, agreed. This is an important and grateful finding. Thank you. And again as
> mentioned above, I don't think we can merge this patch as is, but we could think
> about using the high level access bit check results as a hint to better split the
> regions.
>
> Indeed, DAMON's monitoring mechanism has many rooms for improvements. I
> also have some ideas but my time was more spent on more capabilities of
> DAMON/DAMOS so far. It was a bit intentional proiority setting since I got no real
> DAMON accuracy problem report from the production usage, and improving the
> accuracy will deliver the benefit to all DAMON/DAMOS features.
>
> Since an important milestone of DAMOS, namely auto-tuning, has merged into
> the mainline, I think I may better to spend more time on monitoring accuracy
> improvement. I have some immature ideas in my head. I will try to summarize
> and share the ideas in near future.
>
> >
> > > Sorry again for finding this concern not early enough. But I think
> > > we may need to discuss about this first.
> >
> > Absolutely no problem. Please let me know your thoughts.
>
> Thank you for patiently walking with me :)
>
>
> Thanks,
> SJ
>
> >
> > Regards,
> > Aravinda
> >
> > >
> > > [1] https://lkml.kernel.org/r/[email protected]
> > >
> > >
> > > Thanks,
> > > SJ
> > >
> > >
> > > >
> > > > CPU overheads (in billion cycles) for kdamond:
> > > >
> > > > Footprint 1GB 10GB 100GB 5TB
> > > > ---------------------------------------------
> > > > DAMON 1.15 19.53 3.52 9.55
> > > > DAMON+PTP 0.83 3.20 1.27 2.55
> > > >
> > > > A detailed explanation and evaluation can be found in the arXiv paper:
> > > > https://arxiv.org/pdf/2311.10275.pdf
> > > >
> > > >
> > > > Aravinda Prasad (3):
> > > > mm/damon: mm infrastructure support
> > > > mm/damon: profiling enhancement
> > > > mm/damon: documentation updates
> > > >
> > > > Documentation/mm/damon/design.rst | 42 ++++++
> > > > arch/x86/include/asm/pgtable.h | 20 +++
> > > > arch/x86/mm/pgtable.c | 28 +++-
> > > > include/linux/mmu_notifier.h | 36 +++++
> > > > include/linux/pgtable.h | 79 ++++++++++
> > > > mm/damon/vaddr.c | 233 ++++++++++++++++++++++++++++--
> > > > 6 files changed, 424 insertions(+), 14 deletions(-)
> > > >
> > > > --
> > > > 2.21.3

2024-03-22 18:42:38

by SeongJae Park

[permalink] [raw]
Subject: RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON

On Fri, 22 Mar 2024 12:12:09 +0000 "Prasad, Aravinda" <[email protected]> wrote:

[...]
> > > For large regions (say 10GB, that has 2,621,440 4K pages), sampling at
> > > PTE level will not cover a good portion of the region. For example,
> > > default 5ms sampling and 100ms aggregation samples only 20 4K pages in an
> > > aggregation interval.
> >
> > If the 20 attempts all failed at finding any single accessed 4K page, I think it
> > roughly means less than 5% of the region is accessed within the user-specified
> > time (aggregation interval). I would translate that as only tiny portion of the

I now find the above sentence is not correct. Sorry, my bad. Let me re-write.

I think it roughly means the workload is not accessing the region in a
frequency that high enough for DAMON to observe within the user-specified time
(sampling interval).

> > region is accessed within the user-specified time, and hence DAMON is ok to say
> > the region is nearly not accessed.
>
> I am looking at it from the other way:
>
> To detect if a region is hot or cold at least 1% of the pages in the region should
> be sampled. For a 10GB region (with 2,621,440 4K pages) this requires sampling
> at least 26,214 pages. For a 100GB region this will require sampling at least
> 262,144 pages.

Why do you think 1% of the pages should be sampled?

DAMON defines the region as an address range that contains pages having similar
access frequency. Hence if we see a page was accessed within a given time
interval, we can assume all pages in the page is also accessed within the
interval, and vice versa. That's why we sample only single page per region,
and how DAMON's monitoring overhead can be controlled regardless of the size of
the monitoring target memory.

To detect if the region is hot or cold, DAMON continues sampling multiple times
and use number of sampling intervals that seen the access to the region
(nr_accesses) as the relative hotness of the region. Hence, how many sampling
is required depends on what precision of the relative hotness the user wants.
The size of the region doesn't matter here.

Am I missing something?

>
> If we sample at 5ms, this takes 131.072 seconds to cover 1% of 10GB and 1310.72
> seconds to cover 100GB.
>
> DAMON shows that the selected page as accessed if that page was accessed
> during the 5ms sampling window. Now if we increase the sampling to 20ms to
> improve access detection, then covering 1% of the region takes even longer.
>
> >
> > > Increasing sampling to 1 ms and aggregation to 1 second can only cover
> > > 1000 4K pages, but results in higher CPU overheads due to frequent sampling.
> > > Even increasing the aggregation interval to 60 seconds but sampling at
> > > 5ms can only cover 12000 samples, but region splitting and merging
> > > happens once in 60 seconds.
> >
> > At the beginning of each sampling interval, DAMON randomly picks one page per
> > region, clear their accessed bits, wait until the sampling interval is finished, and
> > check the accessed bits again. In other words, DAMON shows only accesses that
> > made in last sampling interval.
>
> Yes, I see this in the code:
>
> while(time < aggregation_interval)
> {
> clear_access_bit
> sleep(sampling_time)
> check_access_bit
> }
>
> I would suggest this logic instead.
>
> while(time < aggregation_interval)
> {
> Number_of_samples = aggregation_interval / sampling_time;
>
> for (i = 0, I < number_of_samples; i++)
> {
> clear_access_bit
> }
>
> sleep(aggregation_time)
>
> for (i = 0, I < number_of_samples; i++)
> {
> check_access_bit
> }
> }
>
> This can help in better access detection. I am sure you would
> have already explored it.

The way to detect the access in the region is implemented by each monitoring
operations set (vaddr, fvaddr, and paddr). We could implement yet another
monitoring operations set with a new access detection method. Nonetheless, I
think changing existing monitoring operations sets to use this suggestion while
keeping their concepts would be not easy.

>
> >
> > Increasing number of samples per aggregation interval can help DAMON knows
> > the access frequency of regions in finer granularity, but doesn't allow DAMON see
> > more accesses. Rather than that, if the aggregation interval is fixed (reducing
> > sampling interval), DAMON can show even less amount of accesses.
> >
> > What we need here is giving the workload longer sampling time so that the
> > workload can make access to a size of memory regions that large enough to be
> > found by DAMON.
>
> But even with longer sampling time, we may miss the access. For example,
> consider all the pages in the region are accessed sequentially. Now if DAMON samples
> a different page other than the page that is being accessed it will miss. Now even if we
> have longer sampling time it is possible that none of the accesses are detected.

If there was accesses to some pages of the region but unaccessed page has
picked as the sampling target, someone could say only a tiny portion of the
region is accessed, so we can treat the region as not accessed at all. That's
at least what the monitoring operations set you use here ('vaddr') thinks.

[...]
> > Also, if we can allow large enough age, the random region split will eventually find
> > the small hot regions even without high level accessed bit hint. Of course the hint
> > could help finding it earlier. I think that was one of my comment on the first
> > version of this patch.
>
> The problem is that a large region that is split is immediately merged as the split
> regions have access count zero.
>
> We observe that large regions are never getting split at all due to this.

I understand this is a valid concern. Especially because we currently split
each region into two sub-regions, finding small hot memory region in the middle
of a huge region could be challenging. This concern has raised before DAMON
has merged into the mainline by Jonathan Cameron. There was also a research
from my previous colleague saying incresing the sub-regions for each split
improves the accuracy. Nonetheless, it increases overall number of regions and
hence increased the overhead. And we didn't get real issue due to this from
the production so far, so we still keeping the old behavior. I'm thinking
about a way to make this better.

That said, the real system would have more than the single region, and the
access pattern will be more dynamic. It will cause the region to be merged and
split in more random and chaotic way. Hence I think there is still a chance to
find the small hot portion eventually. Also, the sampling regions are picked
randomly. A page of the small hot portion will eventually picked as sampling
target, even multiple times, and at least reset the 'age' of the region.

I sometimes turn on DAMON to monitor entire physical address space (about 128
GiB) of my machine and run no active workload but just a few background
deamons. So the system would have only small amount of accesses. At the
beginning, the monitoring output shows all regions as not accessed (nr_accesses
0) and having same 'age'. But as time goes by, the regions are still showing
no access (nr_accesses 0), but different ages and sizes.

Again, I'm not saying existing monitoring mechanism is perfect and optimum. We
should continue optimizing it. Nonetheless, the current accuracy is not
perfectly proved to be too awful to be used in real world. I know at least a
few unnamed production usages of DAMON, and they didn't complained about
DAMON's accuracy so far.


Thanks,
SJ

>
> Regards,
> Aravinda
[...]

2024-03-25 13:37:18

by Prasad, Aravinda

[permalink] [raw]
Subject: RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON



> -----Original Message-----
> From: SeongJae Park <[email protected]>
> Sent: Saturday, March 23, 2024 12:03 AM
> To: Prasad, Aravinda <[email protected]>
> Cc: SeongJae Park <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected]; Kumar,
> Sandeep4 <[email protected]>; Huang, Ying <[email protected]>;
> Hansen, Dave <[email protected]>; Williams, Dan J
> <[email protected]>; Subramoney, Sreenivas
> <[email protected]>; Kervinen, Antti <[email protected]>;
> Kanevskiy, Alexander <[email protected]>
> Subject: RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
>
> On Fri, 22 Mar 2024 12:12:09 +0000 "Prasad, Aravinda"
> <[email protected]> wrote:
>
> [...]
> > > > For large regions (say 10GB, that has 2,621,440 4K pages),
> > > > sampling at PTE level will not cover a good portion of the region.
> > > > For example, default 5ms sampling and 100ms aggregation samples
> > > > only 20 4K pages in an aggregation interval.
> > >
> > > If the 20 attempts all failed at finding any single accessed 4K
> > > page, I think it roughly means less than 5% of the region is
> > > accessed within the user-specified time (aggregation interval). I
> > > would translate that as only tiny portion of the
>
> I now find the above sentence is not correct. Sorry, my bad. Let me re-write.
>
> I think it roughly means the workload is not accessing the region in a frequency
> that high enough for DAMON to observe within the user-specified time (sampling
> interval).
>
> > > region is accessed within the user-specified time, and hence DAMON
> > > is ok to say the region is nearly not accessed.
> >
> > I am looking at it from the other way:
> >
> > To detect if a region is hot or cold at least 1% of the pages in the
> > region should be sampled. For a 10GB region (with 2,621,440 4K pages)
> > this requires sampling at least 26,214 pages. For a 100GB region this
> > will require sampling at least
> > 262,144 pages.
>
> Why do you think 1% of the pages should be sampled?

1% is just an example.

>
> DAMON defines the region as an address range that contains pages having similar
> access frequency. Hence if we see a page was accessed within a given time
> interval, we can assume all pages in the page is also accessed within the interval,
> and vice versa. That's why we sample only single page per region, and how
> DAMON's monitoring overhead can be controlled regardless of the size of the
> monitoring target memory.

Initially when DAMON creates "min" regions, it does not consider access frequency.
They are created by diving the address space. So, at the beginning, these regions
need not have pages with similar access frequency. But eventually, as regions are
split and merged then regions are formed that have similar access frequency.

We observe that hot sets are spread across the address space of the application
and many times, only a portion of the DAMON created regions contain a hot data
as per the application's access pattern. In such cases a single sample per
region is not enough.

For small memory footprint applications with small region size, I agree there are
absolutely no issues (also confirmed by our experiments). But for large footprint
applications (1TB+) that can have large regions (50GB+) we see these issues.

>
> To detect if the region is hot or cold, DAMON continues sampling multiple times
> and use number of sampling intervals that seen the access to the region
> (nr_accesses) as the relative hotness of the region. Hence, how many sampling is
> required depends on what precision of the relative hotness the user wants.
> The size of the region doesn't matter here.
>
> Am I missing something?

As mentioned before all these are working fine for small footprint applications (<100GB).
But as we go beyond 1TB footprint we start seeing issues. I can show you a demo
on 1TB+ footprint applications.

>
> >
> > If we sample at 5ms, this takes 131.072 seconds to cover 1% of 10GB
> > and 1310.72 seconds to cover 100GB.
> >
> > DAMON shows that the selected page as accessed if that page was
> > accessed during the 5ms sampling window. Now if we increase the
> > sampling to 20ms to improve access detection, then covering 1% of the region
> takes even longer.
> >
> > >
> > > > Increasing sampling to 1 ms and aggregation to 1 second can only
> > > > cover
> > > > 1000 4K pages, but results in higher CPU overheads due to frequent
> sampling.
> > > > Even increasing the aggregation interval to 60 seconds but
> > > > sampling at 5ms can only cover 12000 samples, but region splitting
> > > > and merging happens once in 60 seconds.
> > >
> > > At the beginning of each sampling interval, DAMON randomly picks one
> > > page per region, clear their accessed bits, wait until the sampling
> > > interval is finished, and check the accessed bits again. In other
> > > words, DAMON shows only accesses that made in last sampling interval.
> >
> > Yes, I see this in the code:
> >
> > while(time < aggregation_interval)
> > {
> > clear_access_bit
> > sleep(sampling_time)
> > check_access_bit
> > }
> >
> > I would suggest this logic instead.
> >
> > while(time < aggregation_interval)
> > {
> > Number_of_samples = aggregation_interval / sampling_time;
> >
> > for (i = 0, I < number_of_samples; i++)
> > {
> > clear_access_bit
> > }
> >
> > sleep(aggregation_time)
> >
> > for (i = 0, I < number_of_samples; i++)
> > {
> > check_access_bit
> > }
> > }
> >
> > This can help in better access detection. I am sure you would
> > have already explored it.
>
> The way to detect the access in the region is implemented by each monitoring
> operations set (vaddr, fvaddr, and paddr). We could implement yet another
> monitoring operations set with a new access detection method. Nonetheless, I
> think changing existing monitoring operations sets to use this suggestion while
> keeping their concepts would be not easy.

Agree.

>
> >
> > >
> > > Increasing number of samples per aggregation interval can help DAMON
> > > knows the access frequency of regions in finer granularity, but
> > > doesn't allow DAMON see more accesses. Rather than that, if the
> > > aggregation interval is fixed (reducing sampling interval), DAMON can show
> even less amount of accesses.
> > >
> > > What we need here is giving the workload longer sampling time so
> > > that the workload can make access to a size of memory regions that
> > > large enough to be found by DAMON.
> >
> > But even with longer sampling time, we may miss the access. For
> > example, consider all the pages in the region are accessed
> > sequentially. Now if DAMON samples a different page other than the
> > page that is being accessed it will miss. Now even if we have longer sampling
> time it is possible that none of the accesses are detected.
>
> If there was accesses to some pages of the region but unaccessed page has
> picked as the sampling target, someone could say only a tiny portion of the region
> is accessed, so we can treat the region as not accessed at all. That's at least what
> the monitoring operations set you use here ('vaddr') thinks.
>
> [...]
> > > Also, if we can allow large enough age, the random region split will
> > > eventually find the small hot regions even without high level
> > > accessed bit hint. Of course the hint could help finding it
> > > earlier. I think that was one of my comment on the first version of this patch.
> >
> > The problem is that a large region that is split is immediately merged
> > as the split regions have access count zero.
> >
> > We observe that large regions are never getting split at all due to this.
>
> I understand this is a valid concern. Especially because we currently split each
> region into two sub-regions, finding small hot memory region in the middle of a
> huge region could be challenging. This concern has raised before DAMON has
> merged into the mainline by Jonathan Cameron. There was also a research from
> my previous colleague saying incresing the sub-regions for each split improves the
> accuracy. Nonetheless, it increases overall number of regions and hence
> increased the overhead. And we didn't get real issue due to this from the
> production so far, so we still keeping the old behavior. I'm thinking about a way to
> make this better.

These issues are observed only when memory footprint is large enough (1TB+).
Production systems may not be using such large footprint applications yet.

>
> That said, the real system would have more than the single region, and the access
> pattern will be more dynamic. It will cause the region to be merged and split in
> more random and chaotic way. Hence I think there is still a chance to find the
> small hot portion eventually. Also, the sampling regions are picked randomly. A
> page of the small hot portion will eventually picked as sampling target, even
> multiple times, and at least reset the 'age' of the region.
>
> I sometimes turn on DAMON to monitor entire physical address space (about 128
> GiB) of my machine and run no active workload but just a few background
> deamons. So the system would have only small amount of accesses. At the
> beginning, the monitoring output shows all regions as not accessed (nr_accesses
> 0) and having same 'age'. But as time goes by, the regions are still showing no
> access (nr_accesses 0), but different ages and sizes.

Have not tired with physical address space monitoring. But for "vaddr", we see DAMON
working good up to 100GB footprint.

>
> Again, I'm not saying existing monitoring mechanism is perfect and optimum. We
> should continue optimizing it. Nonetheless, the current accuracy is not perfectly
> proved to be too awful to be used in real world. I know at least a few unnamed
> production usages of DAMON, and they didn't complained about DAMON's
> accuracy so far.

We see this problem very consistently on large footprint applications, so could be
working fine for small footprint applications in production.

Regards,
Aravinda


>
>
> Thanks,
> SJ
>
> >
> > Regards,
> > Aravinda
> [...]