2020-08-17 11:30:57

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v20 00/15] Introduce Data Access MONitor (DAMON)

From: SeongJae Park <[email protected]>

Changes from Previous Version
=============================

- Place 'CREATE_TRACE_POINTS' after '#include' statements (Steven Rostedt)
- Support large record file (Alkaid)
- Place 'put_pid()' of virtual monitoring targets in 'cleanup' callback
- Avoid conflict between concurrent DAMON users
- Update evaluation result document

Introduction
============

DAMON is a data access monitoring framework subsystem for the Linux kernel.
The core mechanisms of DAMON called 'region based sampling' and 'adaptive
regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
patchset for the detail) make it

- accurate (The monitored information is useful for DRAM level memory
management. It might not appropriate for Cache-level accuracy, though.),
- light-weight (The monitoring overhead is low enough to be applied online
while making no impact on the performance of the target workloads.), and
- scalable (the upper-bound of the instrumentation overhead is controllable
regardless of the size of target workloads.).

Using this framework, therefore, the kernel's core memory management mechanisms
such as reclamation and THP can be optimized for better memory management. The
experimental memory management optimization works that incurring high
instrumentation overhead will be able to have another try. In user space,
meanwhile, users who have some special workloads will be able to write
personalized tools or applications for deeper understanding and specialized
optimizations of their systems.

Evaluations
===========

We evaluated DAMON's overhead, monitoring quality and usefulness using 25
realistic workloads on my QEMU/KVM based virtual machine running a kernel that
v20 DAMON patchset is applied.

DAMON is lightweight. It increases system memory usage by 0.12% and slows
target workloads down by 1.39%.

DAMON is accurate and useful for memory management optimizations. An
experimental DAMON-based operation scheme for THP, 'ethp', removes 88.16% of
THP memory overheads while preserving 88.73% of THP speedup. Another
experimental DAMON-based 'proactive reclamation' implementation, 'prcl',
reduces 91.34% of residential sets and 25.59% of system memory footprint while
incurring only 1.58% runtime overhead in the best case (parsec3/freqmine).

NOTE that the experimentail THP optimization and proactive reclamation are not
for production but just only for proof of concepts.

Please refer to the official document[1] or "Documentation/admin-guide/mm: Add
a document for DAMON" patch in this patchset for detailed evaluation setup and
results.

[1] https://damonitor.github.io/doc/html/latest-damon/admin-guide/mm/damon/eval.html

More Information
================

We prepared a showcase web site[1] that you can get more information. There
are

- the official documentations[2],
- the heatmap format dynamic access pattern of various realistic workloads for
heap area[3], mmap()-ed area[4], and stack[5] area,
- the dynamic working set size distribution[6] and chronological working set
size changes[7], and
- the latest performance test results[8].

[1] https://damonitor.github.io/_index
[2] https://damonitor.github.io/doc/html/latest-damon
[3] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.0.png.html
[4] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
[5] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.2.png.html
[6] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
[7] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
[8] https://damonitor.github.io/test/result/perf/latest/html/index.html

Baseline and Complete Git Trees
===============================

The patches are based on the v5.8. You can also clone the complete git
tree:

$ git clone git://github.com/sjp38/linux -b damon/patches/v20

The web is also available:
https://github.com/sjp38/linux/releases/tag/damon/patches/v20

There are a couple of trees for entire DAMON patchset series. It includes
future features. The first one[1] contains the changes for latest release,
while the other one[2] contains the changes for next release.

[1] https://github.com/sjp38/linux/tree/damon/master
[2] https://github.com/sjp38/linux/tree/damon/next

Sequence Of Patches
===================

First four patches implement the target address space independent core logics
of DAMON and it's programming interface. The 1st patch introduces DAMON
subsystem, it's data structures, and the data structure related basic
manipulation functions. Following three patches (2nd to 4th) implements the
core mechanisms of DAMON, namely regions based sampling (patch 2), adaptive
regions adjustment (patch 3), and dynamic memory mapping change adoption (patch
4).

Now the essential parts of DAMON is complete but require low level primitives
to be implemented and configured with DAMON to just work. The following two
patches makes it just work for virtual address spaces monitoring. The 5th
patch makes 'PG_idle' could be used by DAMON and the 6th patch implements the
virtual memory address space specific low primitives using page table Accessed
bits and the 'PG_idle' page flag.

Now DAMON just works for virtual address space monitoring via the kernel space
api. Following six patches adds interfaces for the users in the user space.
The 7th patch implements recording of access patterns in DAMON. Each of next
two patches (8th and 9th) respectively adds a tracepoint for other tracepoints
supporting tracers such as perf, and a debugfs interface for privileged people
and/or programs in user space. 10th patch makes the debugfs interface further
support pidfd. And, the 11th patch implements an user space tool to provide a
minimal reference to the debugfs interface and for high level use/tests of the
DAMON.

Three patches for maintainability follows. The 12th patch adds documentations
for both the user space and the kernel space. The 13th patch provides unit
tests (based on the kunit) while the 14th patch adds user space tests (based on
the kselftest).

Finally, the last patch (15th) updates the MAINTAINERS file.

Patch History
=============

Changes from v19
(https://lore.kernel.org/linux-mm/[email protected]/)
- Place 'CREATE_TRACE_POINTS' after '#include' statements (Steven Rostedt)
- Support large record file (Alkaid)
- Place 'put_pid()' of virtual monitoring targets in 'cleanup' callback
- Avoid conflict between concurrent DAMON users
- Update evaluation result document

Changes from v18
(https://lore.kernel.org/linux-mm/[email protected]/)
- Drop loadable module support (Mike Rapoport)
- Select PAGE_EXTENSION if !64BIT for 'set_page_young()'
- Take care of the MMU notification subscribers (Shakeel Butt)
- Substitute 'struct damon_task' with 'struct damon_target' for better abstract
- Use 'struct pid' instead of 'pid_t' as the target (Shakeel Butt)
- Support pidfd from the debugfs interface (Shakeel Butt)
- Fix typos (Greg Thelen)
- Properly isolate DAMON from other pmd/pte Accessed bit users (Greg Thelen)
- Rebase on v5.8

Changes from v17
(https://lore.kernel.org/linux-mm/[email protected]/)
- Reorganize the doc and remove png blobs (Mike Rapoport)
- Wordsmith mechnisms doc and commit messages
- tools/wss: Set default working set access frequency threshold
- Avoid race in damon deamon start

Changes from v16
(https://lore.kernel.org/linux-mm/[email protected]/)
- Wordsmith/cleanup the documentations and the code
- user space tool: Simplify the code and add wss option for reuse histogram
- recording: Check disablement condition properly
- recording: Force minimal recording buffer size (1KB)

Changes from v15
(https://lore.kernel.org/linux-mm/[email protected]/)
- Refine commit messages (David Hildenbrand)
- Optimizes three vma regions search (Varad Gautam)
- Support static granularity monitoring (Shakeel Butt)
- Cleanup code and re-organize the sequence of patches

Changes from v14
(https://lore.kernel.org/linux-mm/[email protected]/)
- Directly pass region and task to tracepoint (Steven Rostedt)
- Refine comments for better read
- Add more 'Reviewed-by's (Leonard Foerster, Brendan Higgins)

Changes from v13
(https://lore.kernel.org/linux-mm/[email protected]/)
- Fix a typo (Leonard Foerster)
- Fix wring condition of three sub ranges split (Leonard Foerster)
- Rebase on v5.7

Please refer to the v13 patchset to get older history.

SeongJae Park (15):
mm: Introduce Data Access MONitor (DAMON)
mm/damon: Implement region based sampling
mm/damon: Adaptively adjust regions
mm/damon: Track dynamic monitoring target regions update
mm/idle_page_tracking: Make PG_(idle|young) reusable
mm/damon: Implement callbacks for the virtual memory address spaces
mm/damon: Implement access pattern recording
mm/damon: Add a tracepoint
mm/damon: Implement a debugfs interface
damon/debugfs: Support pidfd target id
tools: Introduce a minimal user-space tool for DAMON
Documentation: Add documents for DAMON
mm/damon: Add kunit tests
mm/damon: Add user space selftests
MAINTAINERS: Update for DAMON

Documentation/admin-guide/mm/damon/guide.rst | 157 ++
Documentation/admin-guide/mm/damon/index.rst | 15 +
Documentation/admin-guide/mm/damon/plans.rst | 29 +
Documentation/admin-guide/mm/damon/start.rst | 96 +
Documentation/admin-guide/mm/damon/usage.rst | 302 +++
Documentation/admin-guide/mm/index.rst | 1 +
Documentation/vm/damon/api.rst | 20 +
Documentation/vm/damon/design.rst | 166 ++
Documentation/vm/damon/eval.rst | 225 ++
Documentation/vm/damon/faq.rst | 58 +
Documentation/vm/damon/index.rst | 31 +
Documentation/vm/index.rst | 1 +
MAINTAINERS | 13 +
include/linux/damon.h | 193 ++
include/linux/page-flags.h | 4 +-
include/linux/page_ext.h | 2 +-
include/linux/page_idle.h | 6 +-
include/trace/events/damon.h | 43 +
include/trace/events/mmflags.h | 2 +-
mm/Kconfig | 33 +
mm/Makefile | 1 +
mm/damon-test.h | 671 ++++++
mm/damon.c | 1805 +++++++++++++++++
mm/page_ext.c | 12 +-
mm/page_idle.c | 10 -
tools/damon/.gitignore | 1 +
tools/damon/_damon.py | 130 ++
tools/damon/_dist.py | 36 +
tools/damon/_recfile.py | 23 +
tools/damon/bin2txt.py | 67 +
tools/damon/damo | 37 +
tools/damon/heats.py | 362 ++++
tools/damon/nr_regions.py | 91 +
tools/damon/record.py | 135 ++
tools/damon/report.py | 45 +
tools/damon/wss.py | 100 +
tools/testing/selftests/damon/Makefile | 7 +
.../selftests/damon/_chk_dependency.sh | 28 +
tools/testing/selftests/damon/_chk_record.py | 109 +
.../testing/selftests/damon/debugfs_attrs.sh | 161 ++
.../testing/selftests/damon/debugfs_record.sh | 50 +
41 files changed, 5260 insertions(+), 18 deletions(-)
create mode 100644 Documentation/admin-guide/mm/damon/guide.rst
create mode 100644 Documentation/admin-guide/mm/damon/index.rst
create mode 100644 Documentation/admin-guide/mm/damon/plans.rst
create mode 100644 Documentation/admin-guide/mm/damon/start.rst
create mode 100644 Documentation/admin-guide/mm/damon/usage.rst
create mode 100644 Documentation/vm/damon/api.rst
create mode 100644 Documentation/vm/damon/design.rst
create mode 100644 Documentation/vm/damon/eval.rst
create mode 100644 Documentation/vm/damon/faq.rst
create mode 100644 Documentation/vm/damon/index.rst
create mode 100644 include/linux/damon.h
create mode 100644 include/trace/events/damon.h
create mode 100644 mm/damon-test.h
create mode 100644 mm/damon.c
create mode 100644 tools/damon/.gitignore
create mode 100644 tools/damon/_damon.py
create mode 100644 tools/damon/_dist.py
create mode 100644 tools/damon/_recfile.py
create mode 100644 tools/damon/bin2txt.py
create mode 100755 tools/damon/damo
create mode 100644 tools/damon/heats.py
create mode 100644 tools/damon/nr_regions.py
create mode 100644 tools/damon/record.py
create mode 100644 tools/damon/report.py
create mode 100644 tools/damon/wss.py
create mode 100644 tools/testing/selftests/damon/Makefile
create mode 100644 tools/testing/selftests/damon/_chk_dependency.sh
create mode 100644 tools/testing/selftests/damon/_chk_record.py
create mode 100755 tools/testing/selftests/damon/debugfs_attrs.sh
create mode 100755 tools/testing/selftests/damon/debugfs_record.sh

--
2.17.1


2020-08-17 11:31:03

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v20 01/15] mm: Introduce Data Access MONitor (DAMON)

From: SeongJae Park <[email protected]>

DAMON is a data access monitoring framework subsystem for the Linux
kernel. The core mechanisms of DAMON make it

- accurate (the monitoring output is useful enough for DRAM level
memory management; It might not appropriate for CPU Cache levels,
though),
- light-weight (the monitoring overhead is low enough to be applied
online), and
- scalable (the upper-bound of the overhead is in constant range
regardless of the size of target workloads).

Using this framework, therefore, the kernel's memory management
mechanisms can make advanced decisions. Experimental memory management
optimization works that incurring high data accesses monitoring overhead
could implemented again. In user space, meanwhile, users who have some
special workloads can write personalized applications for better
understanding and optimizations of their workloads and systems.

This commit is implementing only the stub for the initialization, basic
data structures, and simple manipulation functions of the structures.
The core mechanisms of DAMON will be implemented by following commits.

Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
Reviewed-by: Varad Gautam <[email protected]>
---
include/linux/damon.h | 66 ++++++++++++++++
mm/Kconfig | 11 +++
mm/Makefile | 1 +
mm/damon.c | 176 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 254 insertions(+)
create mode 100644 include/linux/damon.h
create mode 100644 mm/damon.c

diff --git a/include/linux/damon.h b/include/linux/damon.h
new file mode 100644
index 000000000000..a6e839a236f4
--- /dev/null
+++ b/include/linux/damon.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * DAMON api
+ *
+ * Copyright 2019-2020 Amazon.com, Inc. or its affiliates.
+ *
+ * Author: SeongJae Park <[email protected]>
+ */
+
+#ifndef _DAMON_H_
+#define _DAMON_H_
+
+#include <linux/random.h>
+#include <linux/types.h>
+
+/**
+ * struct damon_addr_range - Represents an address region of [@start, @end).
+ * @start: Start address of the region (inclusive).
+ * @end: End address of the region (exclusive).
+ */
+struct damon_addr_range {
+ unsigned long start;
+ unsigned long end;
+};
+
+/**
+ * struct damon_region - Represents a monitoring target region.
+ * @ar: The address range of the region.
+ * @sampling_addr: Address of the sample for the next access check.
+ * @nr_accesses: Access frequency of this region.
+ * @list: List head for siblings.
+ */
+struct damon_region {
+ struct damon_addr_range ar;
+ unsigned long sampling_addr;
+ unsigned int nr_accesses;
+ struct list_head list;
+};
+
+/**
+ * struct damon_target - Represents a monitoring target.
+ * @id: Unique identifier for this target.
+ * @regions_list: Head of the monitoring target regions of this target.
+ * @list: List head for siblings.
+ *
+ * Each monitoring context could have multiple targets. For example, a context
+ * for virtual memory address spaces could have multiple target processes. The
+ * @id of each target should be unique among the targets of the context. For
+ * example, in the virtual address monitoring context, it could be a pidfd or
+ * an address of an mm_struct.
+ */
+struct damon_target {
+ unsigned long id;
+ struct list_head regions_list;
+ struct list_head list;
+};
+
+/**
+ * struct damon_ctx - Represents a context for each monitoring.
+ * @targets_list: Head of monitoring targets (&damon_target) list.
+ */
+struct damon_ctx {
+ struct list_head targets_list; /* 'damon_target' objects */
+};
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index f2104cc0d35c..a99d755d67d3 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,15 @@ config ARCH_HAS_HUGEPD
config MAPPING_DIRTY_HELPERS
bool

+config DAMON
+ bool "Data Access Monitor"
+ help
+ This feature allows to monitor access frequency of each memory
+ region. The information can be useful for performance-centric DRAM
+ level memory management.
+
+ See https://damonitor.github.io/doc/html/latest-damon/index.html for
+ more information.
+ If unsure, say N.
+
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 6e9d46b2efc9..30c5dba52fb2 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -121,3 +121,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_DAMON) += damon.o
diff --git a/mm/damon.c b/mm/damon.c
new file mode 100644
index 000000000000..d446ba4bfb0a
--- /dev/null
+++ b/mm/damon.c
@@ -0,0 +1,176 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Data Access Monitor
+ *
+ * Copyright 2019-2020 Amazon.com, Inc. or its affiliates.
+ *
+ * Author: SeongJae Park <[email protected]>
+ *
+ * This file is constructed in below parts.
+ *
+ * - Functions and macros for DAMON data structures
+ * - Functions for the initialization
+ *
+ * The core parts are not implemented yet.
+ */
+
+#define pr_fmt(fmt) "damon: " fmt
+
+#include <linux/damon.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+/*
+ * Functions and macros for DAMON data structures
+ */
+
+#define damon_next_region(r) \
+ (container_of(r->list.next, struct damon_region, list))
+
+#define damon_prev_region(r) \
+ (container_of(r->list.prev, struct damon_region, list))
+
+#define damon_for_each_region(r, t) \
+ list_for_each_entry(r, &t->regions_list, list)
+
+#define damon_for_each_region_safe(r, next, t) \
+ list_for_each_entry_safe(r, next, &t->regions_list, list)
+
+#define damon_for_each_target(t, ctx) \
+ list_for_each_entry(t, &(ctx)->targets_list, list)
+
+#define damon_for_each_target_safe(t, next, ctx) \
+ list_for_each_entry_safe(t, next, &(ctx)->targets_list, list)
+
+/* Get a random number in [l, r) */
+#define damon_rand(l, r) (l + prandom_u32() % (r - l))
+
+/*
+ * Construct a damon_region struct
+ *
+ * Returns the pointer to the new struct if success, or NULL otherwise
+ */
+static struct damon_region *damon_new_region(unsigned long start,
+ unsigned long end)
+{
+ struct damon_region *region;
+
+ region = kmalloc(sizeof(*region), GFP_KERNEL);
+ if (!region)
+ return NULL;
+
+ region->ar.start = start;
+ region->ar.end = end;
+ region->nr_accesses = 0;
+ INIT_LIST_HEAD(&region->list);
+
+ return region;
+}
+
+/*
+ * Add a region between two other regions
+ */
+static inline void damon_insert_region(struct damon_region *r,
+ struct damon_region *prev, struct damon_region *next)
+{
+ __list_add(&r->list, &prev->list, &next->list);
+}
+
+static void damon_add_region(struct damon_region *r, struct damon_target *t)
+{
+ list_add_tail(&r->list, &t->regions_list);
+}
+
+static void damon_del_region(struct damon_region *r)
+{
+ list_del(&r->list);
+}
+
+static void damon_free_region(struct damon_region *r)
+{
+ kfree(r);
+}
+
+static void damon_destroy_region(struct damon_region *r)
+{
+ damon_del_region(r);
+ damon_free_region(r);
+}
+
+/*
+ * Construct a damon_target struct
+ *
+ * Returns the pointer to the new struct if success, or NULL otherwise
+ */
+static struct damon_target *damon_new_target(unsigned long id)
+{
+ struct damon_target *t;
+
+ t = kmalloc(sizeof(*t), GFP_KERNEL);
+ if (!t)
+ return NULL;
+
+ t->id = id;
+ INIT_LIST_HEAD(&t->regions_list);
+
+ return t;
+}
+
+static void damon_add_target(struct damon_ctx *ctx, struct damon_target *t)
+{
+ list_add_tail(&t->list, &ctx->targets_list);
+}
+
+static void damon_del_target(struct damon_target *t)
+{
+ list_del(&t->list);
+}
+
+static void damon_free_target(struct damon_target *t)
+{
+ struct damon_region *r, *next;
+
+ damon_for_each_region_safe(r, next, t)
+ damon_free_region(r);
+ kfree(t);
+}
+
+static void damon_destroy_target(struct damon_target *t)
+{
+ damon_del_target(t);
+ damon_free_target(t);
+}
+
+static unsigned int nr_damon_targets(struct damon_ctx *ctx)
+{
+ struct damon_target *t;
+ unsigned int nr_targets = 0;
+
+ damon_for_each_target(t, ctx)
+ nr_targets++;
+
+ return nr_targets;
+}
+
+static unsigned int nr_damon_regions(struct damon_target *t)
+{
+ struct damon_region *r;
+ unsigned int nr_regions = 0;
+
+ damon_for_each_region(r, t)
+ nr_regions++;
+
+ return nr_regions;
+}
+
+/*
+ * Functions for the initialization
+ */
+
+static int __init damon_init(void)
+{
+ return 0;
+}
+
+module_init(damon_init);
--
2.17.1

2020-08-17 11:31:09

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v20 03/15] mm/damon: Adaptively adjust regions

From: SeongJae Park <[email protected]>

Even somehow the initial monitoring target regions are well constructed
to fulfill the assumption (pages in same region have similar access
frequencies), the data access pattern can be dynamically changed. This
will result in low monitoring quality. To keep the assumption as much
as possible, DAMON adaptively merges and splits each region based on
their access frequency.

For each ``aggregation interval``, it compares the access frequencies of
adjacent regions and merges those if the frequency difference is small.
Then, after it reports and clears the aggregated access frequency of
each region, it splits each region into two or three regions if the
total number of regions will not exceed the user-specified maximum
number of regions after the split.

In this way, DAMON provides its best-effort quality and minimal overhead
while keeping the upper-bound overhead that users set.

Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
---
include/linux/damon.h | 11 ++-
mm/damon.c | 191 ++++++++++++++++++++++++++++++++++++++++--
2 files changed, 189 insertions(+), 13 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index e0c8ea8f4091..708775a36be3 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -64,7 +64,8 @@ struct damon_target {
*
* @sample_interval: The time between access samplings.
* @aggr_interval: The time between monitor results aggregations.
- * @nr_regions: The number of monitoring regions.
+ * @min_nr_regions: The minimum number of monitoring regions.
+ * @max_nr_regions: The maximum number of monitoring regions.
*
* For each @sample_interval, DAMON checks whether each region is accessed or
* not. It aggregates and keeps the access information (number of accesses to
@@ -127,7 +128,8 @@ struct damon_target {
struct damon_ctx {
unsigned long sample_interval;
unsigned long aggr_interval;
- unsigned long nr_regions;
+ unsigned long min_nr_regions;
+ unsigned long max_nr_regions;

struct timespec64 last_aggregation;

@@ -149,8 +151,9 @@ struct damon_ctx {

int damon_set_targets(struct damon_ctx *ctx,
unsigned long *ids, ssize_t nr_ids);
-int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
- unsigned long aggr_int, unsigned long min_nr_reg);
+int damon_set_attrs(struct damon_ctx *ctx,
+ unsigned long sample_int, unsigned long aggr_int,
+ unsigned long min_nr_reg, unsigned long max_nr_reg);
int damon_start(struct damon_ctx *ctxs, int nr_ctxs);
int damon_stop(struct damon_ctx *ctxs, int nr_ctxs);

diff --git a/mm/damon.c b/mm/damon.c
index 408f3a66a8d9..13b17074d411 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -176,6 +176,26 @@ static unsigned int nr_damon_regions(struct damon_target *t)
return nr_regions;
}

+/* Returns the size upper limit for each monitoring region */
+static unsigned long damon_region_sz_limit(struct damon_ctx *ctx)
+{
+ struct damon_target *t;
+ struct damon_region *r;
+ unsigned long sz = 0;
+
+ damon_for_each_target(t, ctx) {
+ damon_for_each_region(r, t)
+ sz += r->ar.end - r->ar.start;
+ }
+
+ if (ctx->min_nr_regions)
+ sz /= ctx->min_nr_regions;
+ if (sz < MIN_REGION)
+ sz = MIN_REGION;
+
+ return sz;
+}
+
/*
* Functions for DAMON core logics and features
*/
@@ -226,6 +246,145 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
}
}

+#define sz_damon_region(r) (r->ar.end - r->ar.start)
+
+/*
+ * Merge two adjacent regions into one region
+ */
+static void damon_merge_two_regions(struct damon_region *l,
+ struct damon_region *r)
+{
+ l->nr_accesses = (l->nr_accesses * sz_damon_region(l) +
+ r->nr_accesses * sz_damon_region(r)) /
+ (sz_damon_region(l) + sz_damon_region(r));
+ l->ar.end = r->ar.end;
+ damon_destroy_region(r);
+}
+
+#define diff_of(a, b) (a > b ? a - b : b - a)
+
+/*
+ * Merge adjacent regions having similar access frequencies
+ *
+ * t target affected by this merge operation
+ * thres '->nr_accesses' diff threshold for the merge
+ * sz_limit size upper limit of each region
+ */
+static void damon_merge_regions_of(struct damon_target *t, unsigned int thres,
+ unsigned long sz_limit)
+{
+ struct damon_region *r, *prev = NULL, *next;
+
+ damon_for_each_region_safe(r, next, t) {
+ if (prev && prev->ar.end == r->ar.start &&
+ diff_of(prev->nr_accesses, r->nr_accesses) <= thres &&
+ sz_damon_region(prev) + sz_damon_region(r) <= sz_limit)
+ damon_merge_two_regions(prev, r);
+ else
+ prev = r;
+ }
+}
+
+/*
+ * Merge adjacent regions having similar access frequencies
+ *
+ * threshold '->nr_accesses' diff threshold for the merge
+ * sz_limit size upper limit of each region
+ *
+ * This function merges monitoring target regions which are adjacent and their
+ * access frequencies are similar. This is for minimizing the monitoring
+ * overhead under the dynamically changeable access pattern. If a merge was
+ * unnecessarily made, later 'kdamond_split_regions()' will revert it.
+ */
+static void kdamond_merge_regions(struct damon_ctx *c, unsigned int threshold,
+ unsigned long sz_limit)
+{
+ struct damon_target *t;
+
+ damon_for_each_target(t, c)
+ damon_merge_regions_of(t, threshold, sz_limit);
+}
+
+/*
+ * Split a region in two
+ *
+ * r the region to be split
+ * sz_r size of the first sub-region that will be made
+ */
+static void damon_split_region_at(struct damon_ctx *ctx,
+ struct damon_region *r, unsigned long sz_r)
+{
+ struct damon_region *new;
+
+ new = damon_new_region(r->ar.start + sz_r, r->ar.end);
+ r->ar.end = new->ar.start;
+
+ damon_insert_region(new, r, damon_next_region(r));
+}
+
+/* Split every region in the given target into 'nr_subs' regions */
+static void damon_split_regions_of(struct damon_ctx *ctx,
+ struct damon_target *t, int nr_subs)
+{
+ struct damon_region *r, *next;
+ unsigned long sz_region, sz_sub = 0;
+ int i;
+
+ damon_for_each_region_safe(r, next, t) {
+ sz_region = r->ar.end - r->ar.start;
+
+ for (i = 0; i < nr_subs - 1 &&
+ sz_region > 2 * MIN_REGION; i++) {
+ /*
+ * Randomly select size of left sub-region to be at
+ * least 10 percent and at most 90% of original region
+ */
+ sz_sub = ALIGN_DOWN(damon_rand(1, 10) *
+ sz_region / 10, MIN_REGION);
+ /* Do not allow blank region */
+ if (sz_sub == 0 || sz_sub >= sz_region)
+ continue;
+
+ damon_split_region_at(ctx, r, sz_sub);
+ sz_region = sz_sub;
+ }
+ }
+}
+
+/*
+ * Split every target region into randomly-sized small regions
+ *
+ * This function splits every target region into random-sized small regions if
+ * current total number of the regions is equal or smaller than half of the
+ * user-specified maximum number of regions. This is for maximizing the
+ * monitoring accuracy under the dynamically changeable access patterns. If a
+ * split was unnecessarily made, later 'kdamond_merge_regions()' will revert
+ * it.
+ */
+static void kdamond_split_regions(struct damon_ctx *ctx)
+{
+ struct damon_target *t;
+ unsigned int nr_regions = 0;
+ static unsigned int last_nr_regions;
+ int nr_subregions = 2;
+
+ damon_for_each_target(t, ctx)
+ nr_regions += nr_damon_regions(t);
+
+ if (nr_regions > ctx->max_nr_regions / 2)
+ return;
+
+ /* Maybe the middle of the region has different access frequency */
+ if (last_nr_regions == nr_regions &&
+ nr_regions < ctx->max_nr_regions / 3)
+ nr_subregions = 3;
+
+ damon_for_each_target(t, ctx)
+ damon_split_regions_of(ctx, t, nr_subregions);
+
+ last_nr_regions = nr_regions;
+}
+
/*
* Check whether current monitoring should be stopped
*
@@ -264,10 +423,14 @@ static int kdamond_fn(void *data)
struct damon_ctx *ctx = (struct damon_ctx *)data;
struct damon_target *t;
struct damon_region *r, *next;
+ unsigned int max_nr_accesses = 0;
+ unsigned long sz_limit = 0;

pr_info("kdamond (%d) starts\n", ctx->kdamond->pid);
if (ctx->init_target_regions)
ctx->init_target_regions(ctx);
+ sz_limit = damon_region_sz_limit(ctx);
+
while (!kdamond_need_stop(ctx)) {
if (ctx->prepare_access_checks)
ctx->prepare_access_checks(ctx);
@@ -277,14 +440,16 @@ static int kdamond_fn(void *data)
usleep_range(ctx->sample_interval, ctx->sample_interval + 1);

if (ctx->check_accesses)
- ctx->check_accesses(ctx);
+ max_nr_accesses = ctx->check_accesses(ctx);

if (kdamond_aggregate_interval_passed(ctx)) {
if (ctx->aggregate_cb)
ctx->aggregate_cb(ctx);
+ kdamond_merge_regions(ctx, max_nr_accesses / 10,
+ sz_limit);
kdamond_reset_aggregated(ctx);
+ kdamond_split_regions(ctx);
}
-
}
damon_for_each_target(t, ctx) {
damon_for_each_region_safe(r, next, t)
@@ -460,25 +625,33 @@ int damon_set_targets(struct damon_ctx *ctx,
* @ctx: monitoring context
* @sample_int: time interval between samplings
* @aggr_int: time interval between aggregations
- * @nr_reg: number of regions
+ * @min_nr_reg: minimal number of regions
+ * @max_nr_reg: maximum number of regions
*
* This function should not be called while the kdamond is running.
* Every time interval is in micro-seconds.
*
* Return: 0 on success, negative error code otherwise.
*/
-int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
- unsigned long aggr_int, unsigned long nr_reg)
+int damon_set_attrs(struct damon_ctx *ctx,
+ unsigned long sample_int, unsigned long aggr_int,
+ unsigned long min_nr_reg, unsigned long max_nr_reg)
{
- if (nr_reg < 3) {
- pr_err("nr_regions (%lu) must be at least 3\n",
- nr_reg);
+ if (min_nr_reg < 3) {
+ pr_err("min_nr_regions (%lu) must be at least 3\n",
+ min_nr_reg);
+ return -EINVAL;
+ }
+ if (min_nr_reg > max_nr_reg) {
+ pr_err("invalid nr_regions. min (%lu) > max (%lu)\n",
+ min_nr_reg, max_nr_reg);
return -EINVAL;
}

ctx->sample_interval = sample_int;
ctx->aggr_interval = aggr_int;
- ctx->nr_regions = nr_reg;
+ ctx->min_nr_regions = min_nr_reg;
+ ctx->max_nr_regions = max_nr_reg;

return 0;
}
--
2.17.1

2020-08-17 11:31:14

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v20 04/15] mm/damon: Track dynamic monitoring target regions update

From: SeongJae Park <[email protected]>

The monitoring target address range can be dynamically changed. For
example, virtual memory could be dynamically mapped and unmapped.
Physical memory could be hot-plugged.

As the changes could be quite frequent in some cases, DAMON checks the
dynamic memory mapping changes and applies it to the abstracted target
area only for each of a user-specified time interval, ``regions update
interval``.

Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
---
include/linux/damon.h | 20 +++++++++++++++-----
mm/damon.c | 23 +++++++++++++++++++++--
2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 708775a36be3..65533be91735 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -64,13 +64,16 @@ struct damon_target {
*
* @sample_interval: The time between access samplings.
* @aggr_interval: The time between monitor results aggregations.
+ * @regions_update_interval: The time between monitor regions updates.
* @min_nr_regions: The minimum number of monitoring regions.
* @max_nr_regions: The maximum number of monitoring regions.
*
* For each @sample_interval, DAMON checks whether each region is accessed or
* not. It aggregates and keeps the access information (number of accesses to
- * each region) for @aggr_interval time. All time intervals are in
- * micro-seconds.
+ * each region) for @aggr_interval time. DAMON also checks whether the target
+ * memory regions need update (e.g., by ``mmap()`` calls from the application,
+ * in case of virtual memory monitoring) and applies the changes for each
+ * @regions_update_interval. All time intervals are in micro-seconds.
*
* @kdamond: Kernel thread who does the monitoring.
* @kdamond_stop: Notifies whether kdamond should stop.
@@ -94,6 +97,7 @@ struct damon_target {
* @targets_list: Head of monitoring targets (&damon_target) list.
*
* @init_target_regions: Constructs initial monitoring target regions.
+ * @update_target_regions: Updates monitoring target regions.
* @prepare_access_checks: Prepares next access check of target regions.
* @check_accesses: Checks the access of target regions.
* @target_valid: Determine if the target is valid.
@@ -104,12 +108,15 @@ struct damon_target {
* DAMON can be extended for various address spaces by users. For this, users
* can register the target address space dependent low level functions for
* their usecases via the callback pointers of the context. The monitoring
- * thread calls @init_target_regions before starting the monitoring, and
+ * thread calls @init_target_regions before starting the monitoring,
+ * @update_target_regions for each @regions_update_interval, and
* @prepare_access_checks, @check_accesses, and @target_valid for each
* @sample_interval.
*
* @init_target_regions should construct proper monitoring target regions and
* link those to the DAMON context struct.
+ * @update_target_regions should update the monitoring target regions for
+ * current status.
* @prepare_access_checks should manipulate the monitoring regions to be
* prepare for the next access check.
* @check_accesses should check the accesses to each region that made after the
@@ -128,10 +135,12 @@ struct damon_target {
struct damon_ctx {
unsigned long sample_interval;
unsigned long aggr_interval;
+ unsigned long regions_update_interval;
unsigned long min_nr_regions;
unsigned long max_nr_regions;

struct timespec64 last_aggregation;
+ struct timespec64 last_regions_update;

struct task_struct *kdamond;
bool kdamond_stop;
@@ -141,6 +150,7 @@ struct damon_ctx {

/* callbacks */
void (*init_target_regions)(struct damon_ctx *context);
+ void (*update_target_regions)(struct damon_ctx *context);
void (*prepare_access_checks)(struct damon_ctx *context);
unsigned int (*check_accesses)(struct damon_ctx *context);
bool (*target_valid)(struct damon_target *target);
@@ -151,8 +161,8 @@ struct damon_ctx {

int damon_set_targets(struct damon_ctx *ctx,
unsigned long *ids, ssize_t nr_ids);
-int damon_set_attrs(struct damon_ctx *ctx,
- unsigned long sample_int, unsigned long aggr_int,
+int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
+ unsigned long aggr_int, unsigned long regions_update_int,
unsigned long min_nr_reg, unsigned long max_nr_reg);
int damon_start(struct damon_ctx *ctxs, int nr_ctxs);
int damon_stop(struct damon_ctx *ctxs, int nr_ctxs);
diff --git a/mm/damon.c b/mm/damon.c
index 13b17074d411..59d0b59858ae 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -385,6 +385,17 @@ static void kdamond_split_regions(struct damon_ctx *ctx)
last_nr_regions = nr_regions;
}

+/*
+ * Check whether it is time to check and apply the target monitoring regions
+ *
+ * Returns true if it is.
+ */
+static bool kdamond_need_update_regions(struct damon_ctx *ctx)
+{
+ return damon_check_reset_time_interval(&ctx->last_regions_update,
+ ctx->regions_update_interval);
+}
+
/*
* Check whether current monitoring should be stopped
*
@@ -450,6 +461,12 @@ static int kdamond_fn(void *data)
kdamond_reset_aggregated(ctx);
kdamond_split_regions(ctx);
}
+
+ if (kdamond_need_update_regions(ctx)) {
+ if (ctx->update_target_regions)
+ ctx->update_target_regions(ctx);
+ sz_limit = damon_region_sz_limit(ctx);
+ }
}
damon_for_each_target(t, ctx) {
damon_for_each_region_safe(r, next, t)
@@ -624,6 +641,7 @@ int damon_set_targets(struct damon_ctx *ctx,
* damon_set_attrs() - Set attributes for the monitoring.
* @ctx: monitoring context
* @sample_int: time interval between samplings
+ * @regions_update_int: time interval between target regions update
* @aggr_int: time interval between aggregations
* @min_nr_reg: minimal number of regions
* @max_nr_reg: maximum number of regions
@@ -633,8 +651,8 @@ int damon_set_targets(struct damon_ctx *ctx,
*
* Return: 0 on success, negative error code otherwise.
*/
-int damon_set_attrs(struct damon_ctx *ctx,
- unsigned long sample_int, unsigned long aggr_int,
+int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
+ unsigned long aggr_int, unsigned long regions_update_int,
unsigned long min_nr_reg, unsigned long max_nr_reg)
{
if (min_nr_reg < 3) {
@@ -650,6 +668,7 @@ int damon_set_attrs(struct damon_ctx *ctx,

ctx->sample_interval = sample_int;
ctx->aggr_interval = aggr_int;
+ ctx->regions_update_interval = regions_update_int;
ctx->min_nr_regions = min_nr_reg;
ctx->max_nr_regions = max_nr_reg;

--
2.17.1

2020-08-17 11:31:18

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v20 05/15] mm/idle_page_tracking: Make PG_(idle|young) reusable

From: SeongJae Park <[email protected]>

PG_idle and PG_young allows the two PTE Accessed bit users,
IDLE_PAGE_TRACKING and the reclaim logic concurrently work while don't
interfere each other. That is, when they need to clear the Accessed
bit, they set PG_young and PG_idle to represent the previous state of
the bit, respectively. And when they need to read the bit, if the bit
is cleared, they further read the PG_young and PG_idle, respectively, to
know whether the other has cleared the bit meanwhile or not.

We could add another page flag and extend the mechanism to use the flag
if we need to add another concurrent PTE Accessed bit user subsystem.
However, it would be only waste the space. Instead, if the new
subsystem is mutually exclusive with IDLE_PAGE_TRACKING, it could simply
reuse the PG_idle flag. However, it's impossible because the flags are
dependent on IDLE_PAGE_TRACKING.

To allow such reuse of the flags, this commit separates the PG_young and
PG_idle flag logic from IDLE_PAGE_TRACKING and introduces new kernel
config, 'PAGE_IDLE_FLAG'. Hence, if !IDLE_PAGE_TRACKING and
IDLE_PAGE_FLAG, a new subsystem would be able to reuse PG_idle.

In the next commit, DAMON's reference implementation of the virtual
memory address space monitoring primitives will use it.

Signed-off-by: SeongJae Park <[email protected]>
---
include/linux/page-flags.h | 4 ++--
include/linux/page_ext.h | 2 +-
include/linux/page_idle.h | 6 +++---
include/trace/events/mmflags.h | 2 +-
mm/Kconfig | 8 ++++++++
mm/page_ext.c | 12 +++++++++++-
mm/page_idle.c | 10 ----------
7 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6be1aa559b1e..7736d290bb61 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -132,7 +132,7 @@ enum pageflags {
#ifdef CONFIG_MEMORY_FAILURE
PG_hwpoison, /* hardware poisoned page. Don't touch */
#endif
-#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+#if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
PG_young,
PG_idle,
#endif
@@ -432,7 +432,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
#define __PG_HWPOISON 0
#endif

-#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+#if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
TESTPAGEFLAG(Young, young, PF_ANY)
SETPAGEFLAG(Young, young, PF_ANY)
TESTCLEARFLAG(Young, young, PF_ANY)
diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
index cfce186f0c4e..c9cbc9756011 100644
--- a/include/linux/page_ext.h
+++ b/include/linux/page_ext.h
@@ -19,7 +19,7 @@ struct page_ext_operations {
enum page_ext_flags {
PAGE_EXT_OWNER,
PAGE_EXT_OWNER_ALLOCATED,
-#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
PAGE_EXT_YOUNG,
PAGE_EXT_IDLE,
#endif
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..d8a6aecf99cb 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -6,7 +6,7 @@
#include <linux/page-flags.h>
#include <linux/page_ext.h>

-#ifdef CONFIG_IDLE_PAGE_TRACKING
+#ifdef CONFIG_PAGE_IDLE_FLAG

#ifdef CONFIG_64BIT
static inline bool page_is_young(struct page *page)
@@ -106,7 +106,7 @@ static inline void clear_page_idle(struct page *page)
}
#endif /* CONFIG_64BIT */

-#else /* !CONFIG_IDLE_PAGE_TRACKING */
+#else /* !CONFIG_PAGE_IDLE_FLAG */

static inline bool page_is_young(struct page *page)
{
@@ -135,6 +135,6 @@ static inline void clear_page_idle(struct page *page)
{
}

-#endif /* CONFIG_IDLE_PAGE_TRACKING */
+#endif /* CONFIG_PAGE_IDLE_FLAG */

#endif /* _LINUX_MM_PAGE_IDLE_H */
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 5fb752034386..4d182c32071b 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -73,7 +73,7 @@
#define IF_HAVE_PG_HWPOISON(flag,string)
#endif

-#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+#if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
#define IF_HAVE_PG_IDLE(flag,string) ,{1UL << flag, string}
#else
#define IF_HAVE_PG_IDLE(flag,string)
diff --git a/mm/Kconfig b/mm/Kconfig
index a99d755d67d3..8b1dacc60a8e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -765,10 +765,18 @@ config DEFERRED_STRUCT_PAGE_INIT
lifetime of the system until these kthreads finish the
initialisation.

+config PAGE_IDLE_FLAG
+ bool "Add PG_idle and PG_young flags"
+ help
+ This feature adds PG_idle and PG_young flags in 'struct page'. PTE
+ Accessed bit writers can set the state of the bit in the flags to let
+ other PTE Accessed bit readers don't disturbed.
+
config IDLE_PAGE_TRACKING
bool "Enable idle page tracking"
depends on SYSFS && MMU
select PAGE_EXTENSION if !64BIT
+ select PAGE_IDLE_FLAG
help
This feature allows to estimate the amount of user pages that have
not been touched during a given period of time. This information can
diff --git a/mm/page_ext.c b/mm/page_ext.c
index a3616f7a0e9e..f9a6ff65ac0a 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -58,11 +58,21 @@
* can utilize this callback to initialize the state of it correctly.
*/

+#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
+static bool need_page_idle(void)
+{
+ return true;
+}
+struct page_ext_operations page_idle_ops = {
+ .need = need_page_idle,
+};
+#endif
+
static struct page_ext_operations *page_ext_ops[] = {
#ifdef CONFIG_PAGE_OWNER
&page_owner_ops,
#endif
-#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
&page_idle_ops,
#endif
};
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 057c61df12db..144fb4ed961d 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -211,16 +211,6 @@ static const struct attribute_group page_idle_attr_group = {
.name = "page_idle",
};

-#ifndef CONFIG_64BIT
-static bool need_page_idle(void)
-{
- return true;
-}
-struct page_ext_operations page_idle_ops = {
- .need = need_page_idle,
-};
-#endif
-
static int __init page_idle_init(void)
{
int err;
--
2.17.1

2020-08-17 11:31:25

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v20 07/15] mm/damon: Implement access pattern recording

From: SeongJae Park <[email protected]>

This commit implements the recording feature of DAMON. If this feature
is enabled, DAMON writes the monitored access patterns in its binary
format into a file which specified by the user. This is already able to
be implemented by each user using the callbacks. However, as the
recording is expected to be widely used, this commit implements the
feature in the DAMON, for more convenience.

Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
---
include/linux/damon.h | 15 +++++
mm/damon.c | 142 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 154 insertions(+), 3 deletions(-)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 945eb9259acf..54cdc1366f59 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -75,6 +75,14 @@ struct damon_target {
* in case of virtual memory monitoring) and applies the changes for each
* @regions_update_interval. All time intervals are in micro-seconds.
*
+ * @rbuf: In-memory buffer for monitoring result recording.
+ * @rbuf_len: The length of @rbuf.
+ * @rbuf_offset: The offset for next write to @rbuf.
+ * @rfile_path: Record file path.
+ *
+ * If @rbuf, @rbuf_len, and @rfile_path are set, the monitored results are
+ * automatically stored in @rfile_path file.
+ *
* @kdamond: Kernel thread who does the monitoring.
* @kdamond_stop: Notifies whether kdamond should stop.
* @kdamond_lock: Mutex for the synchronizations with @kdamond.
@@ -142,6 +150,11 @@ struct damon_ctx {
struct timespec64 last_aggregation;
struct timespec64 last_regions_update;

+ unsigned char *rbuf;
+ unsigned int rbuf_len;
+ unsigned int rbuf_offset;
+ char *rfile_path;
+
struct task_struct *kdamond;
bool kdamond_stop;
struct mutex kdamond_lock;
@@ -172,6 +185,8 @@ int damon_set_targets(struct damon_ctx *ctx,
int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
unsigned long aggr_int, unsigned long regions_update_int,
unsigned long min_nr_reg, unsigned long max_nr_reg);
+int damon_set_recording(struct damon_ctx *ctx,
+ unsigned int rbuf_len, char *rfile_path);
int damon_start(struct damon_ctx *ctxs, int nr_ctxs);
int damon_stop(struct damon_ctx *ctxs, int nr_ctxs);

diff --git a/mm/damon.c b/mm/damon.c
index cec151d60755..fed35f7b3879 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -57,6 +57,10 @@
#define damon_for_each_target_safe(t, next, ctx) \
list_for_each_entry_safe(t, next, &(ctx)->targets_list, list)

+#define MIN_RECORD_BUFFER_LEN 1024
+#define MAX_RECORD_BUFFER_LEN (4 * 1024 * 1024)
+#define MAX_RFILE_PATH_LEN 256
+
/* Get a random number in [l, r) */
#define damon_rand(l, r) (l + prandom_u32() % (r - l))

@@ -785,16 +789,89 @@ static bool kdamond_aggregate_interval_passed(struct damon_ctx *ctx)
}

/*
- * Reset the aggregated monitoring results
+ * Flush the content in the result buffer to the result file
+ */
+static void damon_flush_rbuffer(struct damon_ctx *ctx)
+{
+ ssize_t sz;
+ loff_t pos = 0;
+ struct file *rfile;
+
+ if (!ctx->rbuf_offset)
+ return;
+
+ rfile = filp_open(ctx->rfile_path,
+ O_CREAT | O_RDWR | O_APPEND | O_LARGEFILE, 0644);
+ if (IS_ERR(rfile)) {
+ pr_err("Cannot open the result file %s\n",
+ ctx->rfile_path);
+ return;
+ }
+
+ while (ctx->rbuf_offset) {
+ sz = kernel_write(rfile, ctx->rbuf, ctx->rbuf_offset, &pos);
+ if (sz < 0)
+ break;
+ ctx->rbuf_offset -= sz;
+ }
+ filp_close(rfile, NULL);
+}
+
+/*
+ * Write a data into the result buffer
+ */
+static void damon_write_rbuf(struct damon_ctx *ctx, void *data, ssize_t size)
+{
+ if (!ctx->rbuf_len || !ctx->rbuf || !ctx->rfile_path)
+ return;
+ if (ctx->rbuf_offset + size > ctx->rbuf_len)
+ damon_flush_rbuffer(ctx);
+ if (ctx->rbuf_offset + size > ctx->rbuf_len) {
+ pr_warn("%s: flush failed, or wrong size given(%u, %zu)\n",
+ __func__, ctx->rbuf_offset, size);
+ return;
+ }
+
+ memcpy(&ctx->rbuf[ctx->rbuf_offset], data, size);
+ ctx->rbuf_offset += size;
+}
+
+/*
+ * Flush the aggregated monitoring results to the result buffer
+ *
+ * Stores current tracking results to the result buffer and reset 'nr_accesses'
+ * of each region. The format for the result buffer is as below:
+ *
+ * <time> <number of targets> <array of target infos>
+ *
+ * target info: <id> <number of regions> <array of region infos>
+ * region info: <start address> <end address> <nr_accesses>
*/
static void kdamond_reset_aggregated(struct damon_ctx *c)
{
struct damon_target *t;
- struct damon_region *r;
+ struct timespec64 now;
+ unsigned int nr;
+
+ ktime_get_coarse_ts64(&now);
+
+ damon_write_rbuf(c, &now, sizeof(now));
+ nr = nr_damon_targets(c);
+ damon_write_rbuf(c, &nr, sizeof(nr));

damon_for_each_target(t, c) {
- damon_for_each_region(r, t)
+ struct damon_region *r;
+
+ damon_write_rbuf(c, &t->id, sizeof(t->id));
+ nr = nr_damon_regions(t);
+ damon_write_rbuf(c, &nr, sizeof(nr));
+ damon_for_each_region(r, t) {
+ damon_write_rbuf(c, &r->ar.start, sizeof(r->ar.start));
+ damon_write_rbuf(c, &r->ar.end, sizeof(r->ar.end));
+ damon_write_rbuf(c, &r->nr_accesses,
+ sizeof(r->nr_accesses));
r->nr_accesses = 0;
+ }
}
}

@@ -978,6 +1055,14 @@ static bool kdamond_need_stop(struct damon_ctx *ctx)
return true;
}

+static void kdamond_write_record_header(struct damon_ctx *ctx)
+{
+ int recfmt_ver = 2;
+
+ damon_write_rbuf(ctx, "damon_recfmt_ver", 16);
+ damon_write_rbuf(ctx, &recfmt_ver, sizeof(recfmt_ver));
+}
+
/*
* The monitoring daemon that runs as a kernel thread
*/
@@ -994,6 +1079,8 @@ static int kdamond_fn(void *data)
ctx->init_target_regions(ctx);
sz_limit = damon_region_sz_limit(ctx);

+ kdamond_write_record_header(ctx);
+
while (!kdamond_need_stop(ctx)) {
if (ctx->prepare_access_checks)
ctx->prepare_access_checks(ctx);
@@ -1020,6 +1107,7 @@ static int kdamond_fn(void *data)
sz_limit = damon_region_sz_limit(ctx);
}
}
+ damon_flush_rbuffer(ctx);
damon_for_each_target(t, ctx) {
damon_for_each_region_safe(r, next, t)
damon_destroy_region(r);
@@ -1189,6 +1277,54 @@ int damon_set_targets(struct damon_ctx *ctx,
return 0;
}

+/**
+ * damon_set_recording() - Set attributes for the recording.
+ * @ctx: target kdamond context
+ * @rbuf_len: length of the result buffer
+ * @rfile_path: path to the monitor result files
+ *
+ * Setting 'rbuf_len' 0 disables recording.
+ *
+ * This function should not be called while the kdamond is running.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_set_recording(struct damon_ctx *ctx,
+ unsigned int rbuf_len, char *rfile_path)
+{
+ size_t rfile_path_len;
+
+ if (rbuf_len && (rbuf_len > MAX_RECORD_BUFFER_LEN ||
+ rbuf_len < MIN_RECORD_BUFFER_LEN)) {
+ pr_err("result buffer size (%u) is out of [%d,%d]\n",
+ rbuf_len, MIN_RECORD_BUFFER_LEN,
+ MAX_RECORD_BUFFER_LEN);
+ return -EINVAL;
+ }
+ rfile_path_len = strnlen(rfile_path, MAX_RFILE_PATH_LEN);
+ if (rfile_path_len >= MAX_RFILE_PATH_LEN) {
+ pr_err("too long (>%d) result file path %s\n",
+ MAX_RFILE_PATH_LEN, rfile_path);
+ return -EINVAL;
+ }
+ ctx->rbuf_len = rbuf_len;
+ kfree(ctx->rbuf);
+ ctx->rbuf = NULL;
+ kfree(ctx->rfile_path);
+ ctx->rfile_path = NULL;
+
+ if (rbuf_len) {
+ ctx->rbuf = kvmalloc(rbuf_len, GFP_KERNEL);
+ if (!ctx->rbuf)
+ return -ENOMEM;
+ }
+ ctx->rfile_path = kmalloc(rfile_path_len + 1, GFP_KERNEL);
+ if (!ctx->rfile_path)
+ return -ENOMEM;
+ strncpy(ctx->rfile_path, rfile_path, rfile_path_len + 1);
+ return 0;
+}
+
/**
* damon_set_attrs() - Set attributes for the monitoring.
* @ctx: monitoring context
--
2.17.1

2020-08-17 11:31:32

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v20 08/15] mm/damon: Add a tracepoint

From: SeongJae Park <[email protected]>

This commit adds a tracepoint for DAMON. It traces the monitoring
results of each region for each aggregation interval. Using this, DAMON
can easily integrated with tracepoints supporting tools such as perf.

Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
Reviewed-by: Steven Rostedt (VMware) <[email protected]>
---
include/trace/events/damon.h | 43 ++++++++++++++++++++++++++++++++++++
mm/damon.c | 4 ++++
2 files changed, 47 insertions(+)
create mode 100644 include/trace/events/damon.h

diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
new file mode 100644
index 000000000000..2f422f4f1fb9
--- /dev/null
+++ b/include/trace/events/damon.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM damon
+
+#if !defined(_TRACE_DAMON_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DAMON_H
+
+#include <linux/damon.h>
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(damon_aggregated,
+
+ TP_PROTO(struct damon_target *t, struct damon_region *r,
+ unsigned int nr_regions),
+
+ TP_ARGS(t, r, nr_regions),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, target_id)
+ __field(unsigned int, nr_regions)
+ __field(unsigned long, start)
+ __field(unsigned long, end)
+ __field(unsigned int, nr_accesses)
+ ),
+
+ TP_fast_assign(
+ __entry->target_id = t->id;
+ __entry->nr_regions = nr_regions;
+ __entry->start = r->ar.start;
+ __entry->end = r->ar.end;
+ __entry->nr_accesses = r->nr_accesses;
+ ),
+
+ TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u",
+ __entry->target_id, __entry->nr_regions,
+ __entry->start, __entry->end, __entry->nr_accesses)
+);
+
+#endif /* _TRACE_DAMON_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/damon.c b/mm/damon.c
index fed35f7b3879..e52e107c90fd 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -32,6 +32,9 @@
#include <linux/sched/task.h>
#include <linux/slab.h>

+#define CREATE_TRACE_POINTS
+#include <trace/events/damon.h>
+
/* Minimal region size. Every damon_region is aligned by this. */
#define MIN_REGION PAGE_SIZE

@@ -870,6 +873,7 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
damon_write_rbuf(c, &r->ar.end, sizeof(r->ar.end));
damon_write_rbuf(c, &r->nr_accesses,
sizeof(r->nr_accesses));
+ trace_damon_aggregated(t, r, nr);
r->nr_accesses = 0;
}
}
--
2.17.1

2020-08-17 11:31:36

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v20 11/15] tools: Introduce a minimal user-space tool for DAMON

From: SeongJae Park <[email protected]>

This commit imtroduces a shallow wrapper python script,
``/tools/damon/damo`` that provides more convenient interface. Note
that it is only aimed to be used for minimal reference of the DAMON's
debugfs interfaces and for debugging of the DAMON itself.

Signed-off-by: SeongJae Park <[email protected]>
---
tools/damon/.gitignore | 1 +
tools/damon/_damon.py | 130 ++++++++++++++
tools/damon/_dist.py | 36 ++++
tools/damon/_recfile.py | 23 +++
tools/damon/bin2txt.py | 67 +++++++
tools/damon/damo | 37 ++++
tools/damon/heats.py | 362 ++++++++++++++++++++++++++++++++++++++
tools/damon/nr_regions.py | 91 ++++++++++
tools/damon/record.py | 135 ++++++++++++++
tools/damon/report.py | 45 +++++
tools/damon/wss.py | 100 +++++++++++
11 files changed, 1027 insertions(+)
create mode 100644 tools/damon/.gitignore
create mode 100644 tools/damon/_damon.py
create mode 100644 tools/damon/_dist.py
create mode 100644 tools/damon/_recfile.py
create mode 100644 tools/damon/bin2txt.py
create mode 100755 tools/damon/damo
create mode 100644 tools/damon/heats.py
create mode 100644 tools/damon/nr_regions.py
create mode 100644 tools/damon/record.py
create mode 100644 tools/damon/report.py
create mode 100644 tools/damon/wss.py

diff --git a/tools/damon/.gitignore b/tools/damon/.gitignore
new file mode 100644
index 000000000000..96403d36ff93
--- /dev/null
+++ b/tools/damon/.gitignore
@@ -0,0 +1 @@
+__pycache__/*
diff --git a/tools/damon/_damon.py b/tools/damon/_damon.py
new file mode 100644
index 000000000000..1f6a292e8c25
--- /dev/null
+++ b/tools/damon/_damon.py
@@ -0,0 +1,130 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+Contains core functions for DAMON debugfs control.
+"""
+
+import os
+import subprocess
+
+debugfs_attrs = None
+debugfs_record = None
+debugfs_target_ids = None
+debugfs_monitor_on = None
+
+def set_target_id(tid):
+ with open(debugfs_target_ids, 'w') as f:
+ f.write('%s\n' % tid)
+
+def turn_damon(on_off):
+ return subprocess.call("echo %s > %s" % (on_off, debugfs_monitor_on),
+ shell=True, executable="/bin/bash")
+
+def is_damon_running():
+ with open(debugfs_monitor_on, 'r') as f:
+ return f.read().strip() == 'on'
+
+class Attrs:
+ sample_interval = None
+ aggr_interval = None
+ regions_update_interval = None
+ min_nr_regions = None
+ max_nr_regions = None
+ rbuf_len = None
+ rfile_path = None
+
+ def __init__(self, s, a, r, n, x, l, f):
+ self.sample_interval = s
+ self.aggr_interval = a
+ self.regions_update_interval = r
+ self.min_nr_regions = n
+ self.max_nr_regions = x
+ self.rbuf_len = l
+ self.rfile_path = f
+
+ def __str__(self):
+ return "%s %s %s %s %s %s %s" % (self.sample_interval,
+ self.aggr_interval, self.regions_update_interval,
+ self.min_nr_regions, self.max_nr_regions, self.rbuf_len,
+ self.rfile_path)
+
+ def attr_str(self):
+ return "%s %s %s %s %s " % (self.sample_interval, self.aggr_interval,
+ self.regions_update_interval, self.min_nr_regions,
+ self.max_nr_regions)
+
+ def record_str(self):
+ return '%s %s ' % (self.rbuf_len, self.rfile_path)
+
+ def apply(self):
+ ret = subprocess.call('echo %s > %s' % (self.attr_str(), debugfs_attrs),
+ shell=True, executable='/bin/bash')
+ if ret:
+ return ret
+ ret = subprocess.call('echo %s > %s' % (self.record_str(),
+ debugfs_record), shell=True, executable='/bin/bash')
+ if ret:
+ return ret
+
+def current_attrs():
+ with open(debugfs_attrs, 'r') as f:
+ attrs = f.read().split()
+ attrs = [int(x) for x in attrs]
+
+ with open(debugfs_record, 'r') as f:
+ rattrs = f.read().split()
+ attrs.append(int(rattrs[0]))
+ attrs.append(rattrs[1])
+
+ return Attrs(*attrs)
+
+def chk_update_debugfs(debugfs):
+ global debugfs_attrs
+ global debugfs_record
+ global debugfs_target_ids
+ global debugfs_monitor_on
+
+ debugfs_damon = os.path.join(debugfs, 'damon')
+ debugfs_attrs = os.path.join(debugfs_damon, 'attrs')
+ debugfs_record = os.path.join(debugfs_damon, 'record')
+ debugfs_target_ids = os.path.join(debugfs_damon, 'target_ids')
+ debugfs_monitor_on = os.path.join(debugfs_damon, 'monitor_on')
+
+ if not os.path.isdir(debugfs_damon):
+ print("damon debugfs dir (%s) not found", debugfs_damon)
+ exit(1)
+
+ for f in [debugfs_attrs, debugfs_record, debugfs_target_ids,
+ debugfs_monitor_on]:
+ if not os.path.isfile(f):
+ print("damon debugfs file (%s) not found" % f)
+ exit(1)
+
+def cmd_args_to_attrs(args):
+ "Generate attributes with specified arguments"
+ sample_interval = args.sample
+ aggr_interval = args.aggr
+ regions_update_interval = args.updr
+ min_nr_regions = args.minr
+ max_nr_regions = args.maxr
+ rbuf_len = args.rbuf
+ if not os.path.isabs(args.out):
+ args.out = os.path.join(os.getcwd(), args.out)
+ rfile_path = args.out
+ return Attrs(sample_interval, aggr_interval, regions_update_interval,
+ min_nr_regions, max_nr_regions, rbuf_len, rfile_path)
+
+def set_attrs_argparser(parser):
+ parser.add_argument('-d', '--debugfs', metavar='<debugfs>', type=str,
+ default='/sys/kernel/debug', help='debugfs mounted path')
+ parser.add_argument('-s', '--sample', metavar='<interval>', type=int,
+ default=5000, help='sampling interval')
+ parser.add_argument('-a', '--aggr', metavar='<interval>', type=int,
+ default=100000, help='aggregate interval')
+ parser.add_argument('-u', '--updr', metavar='<interval>', type=int,
+ default=1000000, help='regions update interval')
+ parser.add_argument('-n', '--minr', metavar='<# regions>', type=int,
+ default=10, help='minimal number of regions')
+ parser.add_argument('-m', '--maxr', metavar='<# regions>', type=int,
+ default=1000, help='maximum number of regions')
diff --git a/tools/damon/_dist.py b/tools/damon/_dist.py
new file mode 100644
index 000000000000..9851ec964e5c
--- /dev/null
+++ b/tools/damon/_dist.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import os
+import struct
+import subprocess
+
+def access_patterns(f):
+ nr_regions = struct.unpack('I', f.read(4))[0]
+
+ patterns = []
+ for r in range(nr_regions):
+ saddr = struct.unpack('L', f.read(8))[0]
+ eaddr = struct.unpack('L', f.read(8))[0]
+ nr_accesses = struct.unpack('I', f.read(4))[0]
+ patterns.append([eaddr - saddr, nr_accesses])
+ return patterns
+
+def plot_dist(data_file, output_file, xlabel, ylabel):
+ terminal = output_file.split('.')[-1]
+ if not terminal in ['pdf', 'jpeg', 'png', 'svg']:
+ os.remove(data_file)
+ print("Unsupported plot output type.")
+ exit(-1)
+
+ gnuplot_cmd = """
+ set term %s;
+ set output '%s';
+ set key off;
+ set xlabel '%s';
+ set ylabel '%s';
+ plot '%s' with linespoints;""" % (terminal, output_file, xlabel, ylabel,
+ data_file)
+ subprocess.call(['gnuplot', '-e', gnuplot_cmd])
+ os.remove(data_file)
+
diff --git a/tools/damon/_recfile.py b/tools/damon/_recfile.py
new file mode 100644
index 000000000000..45dd8ffdb5ae
--- /dev/null
+++ b/tools/damon/_recfile.py
@@ -0,0 +1,23 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import struct
+
+fmt_version = 0
+
+def set_fmt_version(f):
+ global fmt_version
+
+ mark = f.read(16)
+ if mark == b'damon_recfmt_ver':
+ fmt_version = struct.unpack('i', f.read(4))[0]
+ else:
+ fmt_version = 0
+ f.seek(0)
+ return fmt_version
+
+def target_id(f):
+ if fmt_version == 1:
+ return struct.unpack('i', f.read(4))[0]
+ else:
+ return struct.unpack('L', f.read(8))[0]
diff --git a/tools/damon/bin2txt.py b/tools/damon/bin2txt.py
new file mode 100644
index 000000000000..79516c72f449
--- /dev/null
+++ b/tools/damon/bin2txt.py
@@ -0,0 +1,67 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import argparse
+import os
+import struct
+import sys
+
+import _recfile
+
+def parse_time(bindat):
+ "bindat should be 16 bytes"
+ sec = struct.unpack('l', bindat[0:8])[0]
+ nsec = struct.unpack('l', bindat[8:16])[0]
+ return sec * 1000000000 + nsec;
+
+def pr_region(f):
+ saddr = struct.unpack('L', f.read(8))[0]
+ eaddr = struct.unpack('L', f.read(8))[0]
+ nr_accesses = struct.unpack('I', f.read(4))[0]
+ print("%012x-%012x(%10d):\t%d" %
+ (saddr, eaddr, eaddr - saddr, nr_accesses))
+
+def pr_task_info(f):
+ target_id = _recfile.target_id(f)
+ print("target_id: ", target_id)
+ nr_regions = struct.unpack('I', f.read(4))[0]
+ print("nr_regions: ", nr_regions)
+ for r in range(nr_regions):
+ pr_region(f)
+
+def set_argparser(parser):
+ parser.add_argument('--input', '-i', type=str, metavar='<file>',
+ default='damon.data', help='input file name')
+
+def main(args=None):
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ file_path = args.input
+
+ if not os.path.isfile(file_path):
+ print('input file (%s) is not exist' % file_path)
+ exit(1)
+
+ with open(file_path, 'rb') as f:
+ _recfile.set_fmt_version(f)
+ start_time = None
+ while True:
+ timebin = f.read(16)
+ if len(timebin) != 16:
+ break
+ time = parse_time(timebin)
+ if not start_time:
+ start_time = time
+ print("start_time: ", start_time)
+ print("rel time: %16d" % (time - start_time))
+ nr_tasks = struct.unpack('I', f.read(4))[0]
+ print("nr_tasks: ", nr_tasks)
+ for t in range(nr_tasks):
+ pr_task_info(f)
+ print("")
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/damon/damo b/tools/damon/damo
new file mode 100755
index 000000000000..58e1099ae5fc
--- /dev/null
+++ b/tools/damon/damo
@@ -0,0 +1,37 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import argparse
+
+import record
+import report
+
+class SubCmdHelpFormatter(argparse.RawDescriptionHelpFormatter):
+ def _format_action(self, action):
+ parts = super(argparse.RawDescriptionHelpFormatter,
+ self)._format_action(action)
+ # skip sub parsers help
+ if action.nargs == argparse.PARSER:
+ parts = '\n'.join(parts.split('\n')[1:])
+ return parts
+
+parser = argparse.ArgumentParser(formatter_class=SubCmdHelpFormatter)
+
+subparser = parser.add_subparsers(title='command', dest='command',
+ metavar='<command>')
+subparser.required = True
+
+parser_record = subparser.add_parser('record',
+ help='record data accesses of the given target processes')
+record.set_argparser(parser_record)
+
+parser_report = subparser.add_parser('report',
+ help='report the recorded data accesses in the specified form')
+report.set_argparser(parser_report)
+
+args = parser.parse_args()
+
+if args.command == 'record':
+ record.main(args)
+elif args.command == 'report':
+ report.main(args)
diff --git a/tools/damon/heats.py b/tools/damon/heats.py
new file mode 100644
index 000000000000..78a2a793f50e
--- /dev/null
+++ b/tools/damon/heats.py
@@ -0,0 +1,362 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+Transform binary trace data into human readable text that can be used for
+heatmap drawing, or directly plot the data in a heatmap format.
+
+Format of the text is:
+
+ <time> <space> <heat>
+ ...
+
+"""
+
+import argparse
+import os
+import struct
+import subprocess
+import sys
+import tempfile
+
+import _recfile
+
+class HeatSample:
+ space_idx = None
+ sz_time_space = None
+ heat = None
+
+ def __init__(self, space_idx, sz_time_space, heat):
+ if sz_time_space < 0:
+ raise RuntimeError()
+ self.space_idx = space_idx
+ self.sz_time_space = sz_time_space
+ self.heat = heat
+
+ def total_heat(self):
+ return self.heat * self.sz_time_space
+
+ def merge(self, sample):
+ "sample must have a space idx that same to self"
+ heat_sum = self.total_heat() + sample.total_heat()
+ self.heat = heat_sum / (self.sz_time_space + sample.sz_time_space)
+ self.sz_time_space += sample.sz_time_space
+
+def pr_samples(samples, time_idx, time_unit, region_unit):
+ display_time = time_idx * time_unit
+ for idx, sample in enumerate(samples):
+ display_addr = idx * region_unit
+ if not sample:
+ print("%s\t%s\t%s" % (display_time, display_addr, 0.0))
+ continue
+ print("%s\t%s\t%s" % (display_time, display_addr, sample.total_heat() /
+ time_unit / region_unit))
+
+def to_idx(value, min_, unit):
+ return (value - min_) // unit
+
+def read_task_heats(f, tid, aunit, amin, amax):
+ tid_ = _recfile.target_id(f)
+ nr_regions = struct.unpack('I', f.read(4))[0]
+ if tid_ != tid:
+ f.read(20 * nr_regions)
+ return None
+ samples = []
+ for i in range(nr_regions):
+ saddr = struct.unpack('L', f.read(8))[0]
+ eaddr = struct.unpack('L', f.read(8))[0]
+ eaddr = min(eaddr, amax - 1)
+ heat = struct.unpack('I', f.read(4))[0]
+
+ if eaddr <= amin:
+ continue
+ if saddr >= amax:
+ continue
+ saddr = max(amin, saddr)
+ eaddr = min(amax, eaddr)
+
+ sidx = to_idx(saddr, amin, aunit)
+ eidx = to_idx(eaddr - 1, amin, aunit)
+ for idx in range(sidx, eidx + 1):
+ sa = max(amin + idx * aunit, saddr)
+ ea = min(amin + (idx + 1) * aunit, eaddr)
+ sample = HeatSample(idx, (ea - sa), heat)
+ samples.append(sample)
+ return samples
+
+def parse_time(bindat):
+ sec = struct.unpack('l', bindat[0:8])[0]
+ nsec = struct.unpack('l', bindat[8:16])[0]
+ return sec * 1000000000 + nsec
+
+def apply_samples(target_samples, samples, start_time, end_time, aunit, amin):
+ for s in samples:
+ sample = HeatSample(s.space_idx,
+ s.sz_time_space * (end_time - start_time), s.heat)
+ idx = sample.space_idx
+ if not target_samples[idx]:
+ target_samples[idx] = sample
+ else:
+ target_samples[idx].merge(sample)
+
+def __pr_heats(f, tid, tunit, tmin, tmax, aunit, amin, amax):
+ heat_samples = [None] * ((amax - amin) // aunit)
+
+ start_time = 0
+ end_time = 0
+ last_flushed = -1
+ while True:
+ start_time = end_time
+ timebin = f.read(16)
+ if (len(timebin)) != 16:
+ break
+ end_time = parse_time(timebin)
+ nr_tasks = struct.unpack('I', f.read(4))[0]
+ samples_set = {}
+ for t in range(nr_tasks):
+ samples = read_task_heats(f, tid, aunit, amin, amax)
+ if samples:
+ samples_set[tid] = samples
+ if not tid in samples_set:
+ continue
+ if start_time >= tmax:
+ continue
+ if end_time <= tmin:
+ continue
+ start_time = max(start_time, tmin)
+ end_time = min(end_time, tmax)
+
+ sidx = to_idx(start_time, tmin, tunit)
+ eidx = to_idx(end_time - 1, tmin, tunit)
+ for idx in range(sidx, eidx + 1):
+ if idx != last_flushed:
+ pr_samples(heat_samples, idx, tunit, aunit)
+ heat_samples = [None] * ((amax - amin) // aunit)
+ last_flushed = idx
+ st = max(start_time, tmin + idx * tunit)
+ et = min(end_time, tmin + (idx + 1) * tunit)
+ apply_samples(heat_samples, samples_set[tid], st, et, aunit, amin)
+
+def pr_heats(args):
+ binfile = args.input
+ tid = args.tid
+ tres = args.tres
+ tmin = args.tmin
+ ares = args.ares
+ amin = args.amin
+
+ tunit = (args.tmax - tmin) // tres
+ aunit = (args.amax - amin) // ares
+
+ # Compensate the values so that those fit with the resolution
+ tmax = tmin + tunit * tres
+ amax = amin + aunit * ares
+
+ with open(binfile, 'rb') as f:
+ _recfile.set_fmt_version(f)
+ __pr_heats(f, tid, tunit, tmin, tmax, aunit, amin, amax)
+
+class GuideInfo:
+ tid = None
+ start_time = None
+ end_time = None
+ lowest_addr = None
+ highest_addr = None
+ gaps = None
+
+ def __init__(self, tid, start_time):
+ self.tid = tid
+ self.start_time = start_time
+ self.gaps = []
+
+ def regions(self):
+ regions = []
+ region = [self.lowest_addr]
+ for gap in self.gaps:
+ for idx, point in enumerate(gap):
+ if idx == 0:
+ region.append(point)
+ regions.append(region)
+ else:
+ region = [point]
+ region.append(self.highest_addr)
+ regions.append(region)
+ return regions
+
+ def total_space(self):
+ ret = 0
+ for r in self.regions():
+ ret += r[1] - r[0]
+ return ret
+
+ def __str__(self):
+ lines = ['target_id:%d' % self.tid]
+ lines.append('time: %d-%d (%d)' % (self.start_time, self.end_time,
+ self.end_time - self.start_time))
+ for idx, region in enumerate(self.regions()):
+ lines.append('region\t%2d: %020d-%020d (%d)' %
+ (idx, region[0], region[1], region[1] - region[0]))
+ return '\n'.join(lines)
+
+def is_overlap(region1, region2):
+ if region1[1] < region2[0]:
+ return False
+ if region2[1] < region1[0]:
+ return False
+ return True
+
+def overlap_region_of(region1, region2):
+ return [max(region1[0], region2[0]), min(region1[1], region2[1])]
+
+def overlapping_regions(regions1, regions2):
+ overlap_regions = []
+ for r1 in regions1:
+ for r2 in regions2:
+ if is_overlap(r1, r2):
+ r1 = overlap_region_of(r1, r2)
+ if r1:
+ overlap_regions.append(r1)
+ return overlap_regions
+
+def get_guide_info(binfile):
+ "Read file, return the set of guide information objects of the data"
+ guides = {}
+ with open(binfile, 'rb') as f:
+ _recfile.set_fmt_version(f)
+ while True:
+ timebin = f.read(16)
+ if len(timebin) != 16:
+ break
+ monitor_time = parse_time(timebin)
+ nr_tasks = struct.unpack('I', f.read(4))[0]
+ for t in range(nr_tasks):
+ tid = _recfile.target_id(f)
+ nr_regions = struct.unpack('I', f.read(4))[0]
+ if not tid in guides:
+ guides[tid] = GuideInfo(tid, monitor_time)
+ guide = guides[tid]
+ guide.end_time = monitor_time
+
+ last_addr = None
+ gaps = []
+ for r in range(nr_regions):
+ saddr = struct.unpack('L', f.read(8))[0]
+ eaddr = struct.unpack('L', f.read(8))[0]
+ f.read(4)
+
+ if not guide.lowest_addr or saddr < guide.lowest_addr:
+ guide.lowest_addr = saddr
+ if not guide.highest_addr or eaddr > guide.highest_addr:
+ guide.highest_addr = eaddr
+
+ if not last_addr:
+ last_addr = eaddr
+ continue
+ if last_addr != saddr:
+ gaps.append([last_addr, saddr])
+ last_addr = eaddr
+
+ if not guide.gaps:
+ guide.gaps = gaps
+ else:
+ guide.gaps = overlapping_regions(guide.gaps, gaps)
+ return sorted(list(guides.values()), key=lambda x: x.total_space(),
+ reverse=True)
+
+def pr_guide(binfile):
+ for guide in get_guide_info(binfile):
+ print(guide)
+
+def region_sort_key(region):
+ return region[1] - region[0]
+
+def set_missed_args(args):
+ if args.tid and args.tmin and args.tmax and args.amin and args.amax:
+ return
+ guides = get_guide_info(args.input)
+ guide = guides[0]
+ if not args.tid:
+ args.tid = guide.tid
+ for g in guides:
+ if g.tid == args.tid:
+ guide = g
+ break
+
+ if not args.tmin:
+ args.tmin = guide.start_time
+ if not args.tmax:
+ args.tmax = guide.end_time
+
+ if not args.amin or not args.amax:
+ region = sorted(guide.regions(), key=lambda x: x[1] - x[0],
+ reverse=True)[0]
+ args.amin = region[0]
+ args.amax = region[1]
+
+def plot_heatmap(data_file, output_file):
+ terminal = output_file.split('.')[-1]
+ if not terminal in ['pdf', 'jpeg', 'png', 'svg']:
+ os.remove(data_file)
+ print("Unsupported plot output type.")
+ exit(-1)
+
+ gnuplot_cmd = """
+ set term %s;
+ set output '%s';
+ set key off;
+ set xrange [0:];
+ set yrange [0:];
+ set xlabel 'Time (ns)';
+ set ylabel 'Address (bytes)';
+ plot '%s' using 1:2:3 with image;""" % (terminal, output_file, data_file)
+ subprocess.call(['gnuplot', '-e', gnuplot_cmd])
+ os.remove(data_file)
+
+def set_argparser(parser):
+ parser.add_argument('--input', '-i', type=str, metavar='<file>',
+ default='damon.data', help='input file name')
+ parser.add_argument('--tid', metavar='<id>', type=int,
+ help='target id')
+ parser.add_argument('--tres', metavar='<resolution>', type=int,
+ default=500, help='time resolution of the output')
+ parser.add_argument('--tmin', metavar='<time>', type=lambda x: int(x,0),
+ help='minimal time of the output')
+ parser.add_argument('--tmax', metavar='<time>', type=lambda x: int(x,0),
+ help='maximum time of the output')
+ parser.add_argument('--ares', metavar='<resolution>', type=int, default=500,
+ help='space address resolution of the output')
+ parser.add_argument('--amin', metavar='<address>', type=lambda x: int(x,0),
+ help='minimal space address of the output')
+ parser.add_argument('--amax', metavar='<address>', type=lambda x: int(x,0),
+ help='maximum space address of the output')
+ parser.add_argument('--guide', action='store_true',
+ help='print a guidance for the min/max/resolution settings')
+ parser.add_argument('--heatmap', metavar='<file>', type=str,
+ help='heatmap image file to create')
+
+def main(args=None):
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ if args.guide:
+ pr_guide(args.input)
+ else:
+ set_missed_args(args)
+ orig_stdout = sys.stdout
+ if args.heatmap:
+ tmp_path = tempfile.mkstemp()[1]
+ tmp_file = open(tmp_path, 'w')
+ sys.stdout = tmp_file
+
+ pr_heats(args)
+
+ if args.heatmap:
+ sys.stdout = orig_stdout
+ tmp_file.flush()
+ tmp_file.close()
+ plot_heatmap(tmp_path, args.heatmap)
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/damon/nr_regions.py b/tools/damon/nr_regions.py
new file mode 100644
index 000000000000..960dd4362472
--- /dev/null
+++ b/tools/damon/nr_regions.py
@@ -0,0 +1,91 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"Print out distribution of the number of regions in the given record"
+
+import argparse
+import struct
+import sys
+import tempfile
+
+import _dist
+import _recfile
+
+def set_argparser(parser):
+ parser.add_argument('--input', '-i', type=str, metavar='<file>',
+ default='damon.data', help='input file name')
+ parser.add_argument('--range', '-r', type=int, nargs=3,
+ metavar=('<start>', '<stop>', '<step>'),
+ help='range of percentiles to print')
+ parser.add_argument('--sortby', '-s', choices=['time', 'size'],
+ help='the metric to be used for sorting the number of regions')
+ parser.add_argument('--plot', '-p', type=str, metavar='<file>',
+ help='plot the distribution to an image file')
+
+def main(args=None):
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ percentiles = [0, 25, 50, 75, 100]
+
+ file_path = args.input
+ if args.range:
+ percentiles = range(args.range[0], args.range[1], args.range[2])
+ nr_regions_sort = True
+ if args.sortby == 'time':
+ nr_regions_sort = False
+
+ tid_pattern_map = {}
+ with open(file_path, 'rb') as f:
+ _recfile.set_fmt_version(f)
+ start_time = None
+ while True:
+ timebin = f.read(16)
+ if len(timebin) != 16:
+ break
+ nr_tasks = struct.unpack('I', f.read(4))[0]
+ for t in range(nr_tasks):
+ tid = _recfile.target_id(f)
+ if not tid in tid_pattern_map:
+ tid_pattern_map[tid] = []
+ tid_pattern_map[tid].append(_dist.access_patterns(f))
+
+ orig_stdout = sys.stdout
+ if args.plot:
+ tmp_path = tempfile.mkstemp()[1]
+ tmp_file = open(tmp_path, 'w')
+ sys.stdout = tmp_file
+
+ print('# <percentile> <# regions>')
+ for tid in tid_pattern_map.keys():
+ # Skip firs 20 regions as those would not adaptively adjusted
+ snapshots = tid_pattern_map[tid][20:]
+ nr_regions_dist = []
+ for snapshot in snapshots:
+ nr_regions_dist.append(len(snapshot))
+ if nr_regions_sort:
+ nr_regions_dist.sort(reverse=False)
+
+ print('# target_id\t%s' % tid)
+ print('# avr:\t%d' % (sum(nr_regions_dist) / len(nr_regions_dist)))
+ for percentile in percentiles:
+ thres_idx = int(percentile / 100.0 * len(nr_regions_dist))
+ if thres_idx == len(nr_regions_dist):
+ thres_idx -= 1
+ threshold = nr_regions_dist[thres_idx]
+ print('%d\t%d' % (percentile, nr_regions_dist[thres_idx]))
+
+ if args.plot:
+ sys.stdout = orig_stdout
+ tmp_file.flush()
+ tmp_file.close()
+ xlabel = 'runtime (percent)'
+ if nr_regions_sort:
+ xlabel = 'percentile'
+ _dist.plot_dist(tmp_path, args.plot, xlabel,
+ 'number of monitoring target regions')
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/damon/record.py b/tools/damon/record.py
new file mode 100644
index 000000000000..6d1cbe593b94
--- /dev/null
+++ b/tools/damon/record.py
@@ -0,0 +1,135 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+Record data access patterns of the target process.
+"""
+
+import argparse
+import os
+import signal
+import subprocess
+import time
+
+import _damon
+
+def pidfd_open(pid):
+ import ctypes
+ libc = ctypes.CDLL(None)
+ syscall = libc.syscall
+ syscall.argtypes = [ctypes.c_int, ctypes.c_int, ctypes.c_uint]
+ syscall.restype = ctypes.c_long
+
+ NR_pidfd_open = 434
+
+ return syscall(NR_pidfd_open, pid, 0)
+
+def do_record(target, is_target_cmd, attrs, old_attrs, pidfd):
+ if os.path.isfile(attrs.rfile_path):
+ os.rename(attrs.rfile_path, attrs.rfile_path + '.old')
+
+ if attrs.apply():
+ print('attributes (%s) failed to be applied' % attrs)
+ cleanup_exit(old_attrs, -1)
+ print('# damon attrs: %s %s' % (attrs.attr_str(), attrs.record_str()))
+ if is_target_cmd:
+ p = subprocess.Popen(target, shell=True, executable='/bin/bash')
+ target = p.pid
+
+ if pidfd:
+ fd = pidfd_open(int(target))
+ if fd < 0:
+ print('failed getting pidfd of %s: %s' % (target, fd))
+ cleanup_exit(old_attrs, -1)
+
+ # NOTE: The race is still possible because the pid might be already
+ # reused before above pidfd_open() returned. Eliminating the race is
+ # impossible unless we drop the pid support. This option handling is
+ # only for reference of the pidfd usage.
+ target = 'pidfd %s' % fd
+
+ if _damon.set_target_id(target):
+ print('target id setting (%s) failed' % target)
+ cleanup_exit(old_attrs, -2)
+ if _damon.turn_damon('on'):
+ print('could not turn on damon' % target)
+ cleanup_exit(old_attrs, -3)
+ while not _damon.is_damon_running():
+ time.sleep(1)
+ print('Press Ctrl+C to stop')
+ if is_target_cmd:
+ p.wait()
+ while True:
+ # damon will turn it off by itself if the target tasks are terminated.
+ if not _damon.is_damon_running():
+ break
+ time.sleep(1)
+
+ if pidfd:
+ os.close(fd)
+ cleanup_exit(old_attrs, 0)
+
+def cleanup_exit(orig_attrs, exit_code):
+ if _damon.is_damon_running():
+ if _damon.turn_damon('off'):
+ print('failed to turn damon off!')
+ while _damon.is_damon_running():
+ time.sleep(1)
+ if orig_attrs:
+ if orig_attrs.apply():
+ print('original attributes (%s) restoration failed!' % orig_attrs)
+ exit(exit_code)
+
+def sighandler(signum, frame):
+ print('\nsignal %s received' % signum)
+ cleanup_exit(orig_attrs, signum)
+
+def chk_permission():
+ if os.geteuid() != 0:
+ print("Run as root")
+ exit(1)
+
+def set_argparser(parser):
+ _damon.set_attrs_argparser(parser)
+ parser.add_argument('target', type=str, metavar='<target>',
+ help='the target command or the pid to record')
+ parser.add_argument('--pidfd', action='store_true',
+ help='use pidfd type target id')
+ parser.add_argument('-l', '--rbuf', metavar='<len>', type=int,
+ default=1024*1024, help='length of record result buffer')
+ parser.add_argument('-o', '--out', metavar='<file path>', type=str,
+ default='damon.data', help='output file path')
+
+def main(args=None):
+ global orig_attrs
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ chk_permission()
+ _damon.chk_update_debugfs(args.debugfs)
+
+ signal.signal(signal.SIGINT, sighandler)
+ signal.signal(signal.SIGTERM, sighandler)
+ orig_attrs = _damon.current_attrs()
+
+ args.schemes = ''
+ pidfd = args.pidfd
+ new_attrs = _damon.cmd_args_to_attrs(args)
+ target = args.target
+
+ target_fields = target.split()
+ if not subprocess.call('which %s &> /dev/null' % target_fields[0],
+ shell=True, executable='/bin/bash'):
+ do_record(target, True, new_attrs, orig_attrs, pidfd)
+ else:
+ try:
+ pid = int(target)
+ except:
+ print('target \'%s\' is neither a command, nor a pid' % target)
+ exit(1)
+ do_record(target, False, new_attrs, orig_attrs, pidfd)
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/damon/report.py b/tools/damon/report.py
new file mode 100644
index 000000000000..c661c7b2f1af
--- /dev/null
+++ b/tools/damon/report.py
@@ -0,0 +1,45 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import argparse
+
+import bin2txt
+import heats
+import nr_regions
+import wss
+
+def set_argparser(parser):
+ subparsers = parser.add_subparsers(title='report type', dest='report_type',
+ metavar='<report type>', help='the type of the report to generate')
+ subparsers.required = True
+
+ parser_raw = subparsers.add_parser('raw', help='human readable raw data')
+ bin2txt.set_argparser(parser_raw)
+
+ parser_heats = subparsers.add_parser('heats', help='heats of regions')
+ heats.set_argparser(parser_heats)
+
+ parser_wss = subparsers.add_parser('wss', help='working set size')
+ wss.set_argparser(parser_wss)
+
+ parser_nr_regions = subparsers.add_parser('nr_regions',
+ help='number of regions')
+ nr_regions.set_argparser(parser_nr_regions)
+
+def main(args=None):
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ if args.report_type == 'raw':
+ bin2txt.main(args)
+ elif args.report_type == 'heats':
+ heats.main(args)
+ elif args.report_type == 'wss':
+ wss.main(args)
+ elif args.report_type == 'nr_regions':
+ nr_regions.main(args)
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/damon/wss.py b/tools/damon/wss.py
new file mode 100644
index 000000000000..54e2ac7cf83b
--- /dev/null
+++ b/tools/damon/wss.py
@@ -0,0 +1,100 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"Print out the distribution of the working set sizes of the given trace"
+
+import argparse
+import struct
+import sys
+import tempfile
+
+import _dist
+import _recfile
+
+def set_argparser(parser):
+ parser.add_argument('--input', '-i', type=str, metavar='<file>',
+ default='damon.data', help='input file name')
+ parser.add_argument('--range', '-r', type=int, nargs=3,
+ metavar=('<start>', '<stop>', '<step>'),
+ help='range of wss percentiles to print')
+ parser.add_argument('--thres', '-t', type=int, default=1,
+ metavar='<# accesses>',
+ help='minimal number of accesses for treated as working set')
+ parser.add_argument('--sortby', '-s', choices=['time', 'size'],
+ help='the metric to be used for the sort of the working set sizes')
+ parser.add_argument('--plot', '-p', type=str, metavar='<file>',
+ help='plot the distribution to an image file')
+
+def main(args=None):
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ percentiles = [0, 25, 50, 75, 100]
+
+ file_path = args.input
+ if args.range:
+ percentiles = range(args.range[0], args.range[1], args.range[2])
+ wss_sort = True
+ if args.sortby == 'time':
+ wss_sort = False
+
+ tid_pattern_map = {}
+ with open(file_path, 'rb') as f:
+ _recfile.set_fmt_version(f)
+ start_time = None
+ while True:
+ timebin = f.read(16)
+ if len(timebin) != 16:
+ break
+ nr_tasks = struct.unpack('I', f.read(4))[0]
+ for t in range(nr_tasks):
+ tid = _recfile.target_id(f)
+ if not tid in tid_pattern_map:
+ tid_pattern_map[tid] = []
+ tid_pattern_map[tid].append(_dist.access_patterns(f))
+
+ orig_stdout = sys.stdout
+ if args.plot:
+ tmp_path = tempfile.mkstemp()[1]
+ tmp_file = open(tmp_path, 'w')
+ sys.stdout = tmp_file
+
+ print('# <percentile> <wss>')
+ for tid in tid_pattern_map.keys():
+ # Skip first 20 snapshots as regions may not adjusted yet.
+ snapshots = tid_pattern_map[tid][20:]
+ wss_dist = []
+ for snapshot in snapshots:
+ wss = 0
+ for p in snapshot:
+ # Ignore regions not accessed
+ if p[1] < args.thres:
+ continue
+ wss += p[0]
+ wss_dist.append(wss)
+ if wss_sort:
+ wss_dist.sort(reverse=False)
+
+ print('# target_id\t%s' % tid)
+ print('# avr:\t%d' % (sum(wss_dist) / len(wss_dist)))
+ for percentile in percentiles:
+ thres_idx = int(percentile / 100.0 * len(wss_dist))
+ if thres_idx == len(wss_dist):
+ thres_idx -= 1
+ threshold = wss_dist[thres_idx]
+ print('%d\t%d' % (percentile, wss_dist[thres_idx]))
+
+ if args.plot:
+ sys.stdout = orig_stdout
+ tmp_file.flush()
+ tmp_file.close()
+ xlabel = 'runtime (percent)'
+ if wss_sort:
+ xlabel = 'percentile'
+ _dist.plot_dist(tmp_path, args.plot, xlabel,
+ 'working set size (bytes)')
+
+if __name__ == '__main__':
+ main()
--
2.17.1

2020-08-17 11:31:41

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v20 12/15] Documentation: Add documents for DAMON

From: SeongJae Park <[email protected]>

This commit adds documents for DAMON under
`Documentation/admin-guide/mm/damon/` and `Documentation/vm/damon/`.

Signed-off-by: SeongJae Park <[email protected]>
---
Documentation/admin-guide/mm/damon/guide.rst | 157 ++++++++++
Documentation/admin-guide/mm/damon/index.rst | 15 +
Documentation/admin-guide/mm/damon/plans.rst | 29 ++
Documentation/admin-guide/mm/damon/start.rst | 96 ++++++
Documentation/admin-guide/mm/damon/usage.rst | 302 +++++++++++++++++++
Documentation/admin-guide/mm/index.rst | 1 +
Documentation/vm/damon/api.rst | 20 ++
Documentation/vm/damon/design.rst | 166 ++++++++++
Documentation/vm/damon/eval.rst | 225 ++++++++++++++
Documentation/vm/damon/faq.rst | 58 ++++
Documentation/vm/damon/index.rst | 31 ++
Documentation/vm/index.rst | 1 +
12 files changed, 1101 insertions(+)
create mode 100644 Documentation/admin-guide/mm/damon/guide.rst
create mode 100644 Documentation/admin-guide/mm/damon/index.rst
create mode 100644 Documentation/admin-guide/mm/damon/plans.rst
create mode 100644 Documentation/admin-guide/mm/damon/start.rst
create mode 100644 Documentation/admin-guide/mm/damon/usage.rst
create mode 100644 Documentation/vm/damon/api.rst
create mode 100644 Documentation/vm/damon/design.rst
create mode 100644 Documentation/vm/damon/eval.rst
create mode 100644 Documentation/vm/damon/faq.rst
create mode 100644 Documentation/vm/damon/index.rst

diff --git a/Documentation/admin-guide/mm/damon/guide.rst b/Documentation/admin-guide/mm/damon/guide.rst
new file mode 100644
index 000000000000..c51fb843efaa
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/guide.rst
@@ -0,0 +1,157 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+Optimization Guide
+==================
+
+This document helps you estimating the amount of benefit that you could get
+from DAMON-based optimizations, and describes how you could achieve it. You
+are assumed to already read :doc:`start`.
+
+
+Check The Signs
+===============
+
+No optimization can provide same extent of benefit to every case. Therefore
+you should first guess how much improvements you could get using DAMON. If
+some of below conditions match your situation, you could consider using DAMON.
+
+- *Low IPC and High Cache Miss Ratios.* Low IPC means most of the CPU time is
+ spent waiting for the completion of time-consuming operations such as memory
+ access, while high cache miss ratios mean the caches don't help it well.
+ DAMON is not for cache level optimization, but DRAM level. However,
+ improving DRAM management will also help this case by reducing the memory
+ operation latency.
+- *Memory Over-commitment and Unknown Users.* If you are doing memory
+ overcommitment and you cannot control every user of your system, a memory
+ bank run could happen at any time. You can estimate when it will happen
+ based on DAMON's monitoring results and act earlier to avoid or deal better
+ with the crisis.
+- *Frequent Memory Pressure.* Frequent memory pressure means your system has
+ wrong configurations or memory hogs. DAMON will help you find the right
+ configuration and/or the criminals.
+- *Heterogeneous Memory System.* If your system is utilizing memory devices
+ that placed between DRAM and traditional hard disks, such as non-volatile
+ memory or fast SSDs, DAMON could help you utilizing the devices more
+ efficiently.
+
+
+Profile
+=======
+
+If you found some positive signals, you could start by profiling your workloads
+using DAMON. Find major workloads on your systems and analyze their data
+access pattern to find something wrong or can be improved. The DAMON user
+space tool (``damo``) will be useful for this.
+
+We recommend you to start from working set size distribution check using ``damo
+report wss``. If the distribution is ununiform or quite different from what
+you estimated, you could consider `Memory Configuration`_ optimization.
+
+Then, review the overall access pattern in heatmap form using ``damo report
+heats``. If it shows a simple pattern consists of a small number of memory
+regions having high contrast of access temperature, you could consider manual
+`Program Modification`_.
+
+If you still want to absorb more benefits, you should develop `Personalized
+DAMON Application`_ for your special case.
+
+You don't need to take only one approach among the above plans, but you could
+use multiple of the above approaches to maximize the benefit.
+
+
+Optimize
+========
+
+If the profiling result also says it's worth trying some optimization, you
+could consider below approaches. Note that some of the below approaches assume
+that your systems are configured with swap devices or other types of auxiliary
+memory so that you don't strictly required to accommodate the whole working set
+in the main memory. Most of the detailed optimization should be made on your
+concrete understanding of your memory devices.
+
+
+Memory Configuration
+--------------------
+
+No more no less, DRAM should be large enough to accommodate only important
+working sets, because DRAM is highly performance critical but expensive and
+heavily consumes the power. However, knowing the size of the real important
+working sets is difficult. As a consequence, people usually equips
+unnecessarily large or too small DRAM. Many problems stem from such wrong
+configurations.
+
+Using the working set size distribution report provided by ``damo report wss``,
+you can know the appropriate DRAM size for you. For example, roughly speaking,
+if you worry about only 95 percentile latency, you don't need to equip DRAM of
+a size larger than 95 percentile working set size.
+
+Let's see a real example. This `page
+<https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/guide.html#memory-configuration>`_
+shows the heatmap and the working set size distributions/changes of
+``freqmine`` workload in PARSEC3 benchmark suite. The working set size spikes
+up to 180 MiB, but keeps smaller than 50 MiB for more than 95% of the time.
+Even though you give only 50 MiB of memory space to the workload, it will work
+well for 95% of the time. Meanwhile, you can save the 130 MiB of memory space.
+
+
+Program Modification
+--------------------
+
+If the data access pattern heatmap plotted by ``damo report heats`` is quite
+simple so that you can understand how the things are going in the workload with
+your human eye, you could manually optimize the memory management.
+
+For example, suppose that the workload has two big memory object but only one
+object is frequently accessed while the other one is only occasionally
+accessed. Then, you could modify the program source code to keep the hot
+object in the main memory by invoking ``mlock()`` or ``madvise()`` with
+``MADV_WILLNEED``. Or, you could proactively evict the cold object using
+``madvise()`` with ``MADV_COLD`` or ``MADV_PAGEOUT``. Using both together
+would be also worthy.
+
+A research work [1]_ using the ``mlock()`` achieved up to 2.55x performance
+speedup.
+
+Let's see another realistic example access pattern for this kind of
+optimizations. This `page
+<https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/guide.html#program-modification>`_
+shows the visualized access patterns of streamcluster workload in PARSEC3
+benchmark suite. We can easily identify the 100 MiB sized hot object.
+
+
+Personalized DAMON Application
+------------------------------
+
+Above approaches will work well for many general cases, but would not enough
+for some special cases.
+
+If this is the case, it might be the time to forget the comfortable use of the
+user space tool and dive into the debugfs interface (refer to :doc:`usage` for
+the detail) of DAMON. Using the interface, you can control the DAMON more
+flexibly. Therefore, you can write your personalized DAMON application that
+controls the monitoring via the debugfs interface, analyzes the result, and
+applies complex optimizations itself. Using this, you can make more creative
+and wise optimizations.
+
+If you are a kernel space programmer, writing kernel space DAMON applications
+using the API (refer to the :doc:`/vm/damon/api` for more detail) would be an
+option.
+
+
+Reference Practices
+===================
+
+Referencing previously done successful practices could help you getting the
+sense for this kind of optimizations. There is an academic paper [1]_
+reporting the visualized access pattern and manual `Program
+Modification`_ results for a number of realistic workloads. You can also get
+the visualized access patterns [3]_ [4]_ [5]_ and automated DAMON-based memory
+operations results for other realistic workloads that collected with latest
+version of DAMON [2]_ .
+
+.. [1] https://dl.acm.org/doi/10.1145/3366626.3368125
+.. [2] https://damonitor.github.io/test/result/perf/latest/html/
+.. [3] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
+.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
+.. [5] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst
new file mode 100644
index 000000000000..0baae7a5402b
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/index.rst
@@ -0,0 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+Monitoring Data Accesses
+========================
+
+:doc:`DAMON </vm/damon/index>` allows light-weight data access monitoring.
+Using this, users can analyze and optimize their systems.
+
+.. toctree::
+ :maxdepth: 2
+
+ start
+ guide
+ usage
diff --git a/Documentation/admin-guide/mm/damon/plans.rst b/Documentation/admin-guide/mm/damon/plans.rst
new file mode 100644
index 000000000000..e3aa5ab96c29
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/plans.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Future Plans
+============
+
+DAMON is still on its first stage. Below plans are still under development.
+
+
+Automate Data Access Monitoring-based Memory Operation Schemes Execution
+========================================================================
+
+The ultimate goal of DAMON is to be used as a building block for the data
+access pattern aware kernel memory management optimization. It will make
+system just works efficiently. However, some users having very special
+workloads will want to further do their own optimization. DAMON will automate
+most of the tasks for such manual optimizations in near future. Users will be
+required to only describe what kind of data access pattern-based operation
+schemes they want in a simple form.
+
+By applying a very simple scheme for THP promotion/demotion with a prototype
+implementation, DAMON reduced 60% of THP memory footprint overhead while
+preserving 50% of the THP performance benefit. The detailed results can be
+seen on an external web page [1]_.
+
+Several RFC patchsets for this plan are available [2]_.
+
+.. [1] https://damonitor.github.io/test/result/perf/latest/html/
+.. [2] https://lore.kernel.org/linux-mm/[email protected]/
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst
new file mode 100644
index 000000000000..deed2ea2321e
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/start.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Getting Started
+===============
+
+This document briefly describes how you can use DAMON by demonstrating its
+default user space tool. Please note that this document describes only a part
+of its features for brevity. Please refer to :doc:`usage` for more details.
+
+
+TL; DR
+======
+
+Follow below 5 commands to monitor and visualize the access pattern of your
+workload. ::
+
+ $ git clone https://github.com/sjp38/linux -b damon/master
+ /* build the kernel with CONFIG_DAMON=y, install, reboot */
+ $ mount -t debugfs none /sys/kernel/debug/
+ $ cd linux/tools/damon
+ $ ./damo record $(pidof <your workload>)
+ $ ./damo report heats --heatmap access_pattern.png
+
+
+Prerequisites
+=============
+
+Kernel
+------
+
+You should first ensure your system is running on a kernel built with
+``CONFIG_DAMON=y``.
+
+
+User Space Tool
+---------------
+
+For the demonstration, we will use the default user space tool for DAMON,
+called DAMON Operator (DAMO). It is located at ``tools/damon/damo`` of the
+kernel source tree. For brevity, below examples assume you set ``$PATH`` to
+point it. It's not mandatory, though.
+
+Because DAMO is using the debugfs interface (refer to :doc:`usage` for the
+detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as
+below::
+
+ # mount -t debugfs none /sys/kernel/debug/
+
+or append below line to your ``/etc/fstab`` file so that your system can
+automatically mount debugfs from next booting::
+
+ debugfs /sys/kernel/debug debugfs defaults 0 0
+
+
+Recording Data Access Patterns
+==============================
+
+Below commands record memory access pattern of a program and save the
+monitoring results in a file. ::
+
+ $ git clone https://github.com/sjp38/masim
+ $ cd masim; make; ./masim ./configs/zigzag.cfg &
+ $ sudo damo record -o damon.data $(pidof masim)
+
+The first two lines of the commands get an artificial memory access generator
+program and runs it in the background. It will repeatedly access two 100 MiB
+sized memory regions one by one. You can substitute this with your real
+workload. The last line asks ``damo`` to record the access pattern in
+``damon.data`` file.
+
+
+Visualizing Recorded Patterns
+=============================
+
+Below three commands visualize the recorded access patterns into three
+image files. ::
+
+ $ damo report heats --heatmap access_pattern_heatmap.png
+ $ damo report wss --range 0 101 1 --plot wss_dist.png
+ $ damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png
+
+- ``access_pattern_heatmap.png`` will show the data access pattern in a
+ heatmap, which shows when (x-axis) what memory region (y-axis) is how
+ frequently accessed (color).
+- ``wss_dist.png`` will show the distribution of the working set size.
+- ``wss_chron_change.png`` will show how the working set size has
+ chronologically changed.
+
+You can show the images in a web page [1]_ . Those made with other realistic
+workloads are also available [2]_ [3]_ [4]_.
+
+.. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns
+.. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
+.. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
+.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
new file mode 100644
index 000000000000..a6606d27a559
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -0,0 +1,302 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Detailed Usages
+===============
+
+DAMON provides below three interfaces for different users.
+
+- *DAMON user space tool.*
+ This is for privileged people such as system administrators who want a
+ just-working human-friendly interface. Using this, users can use the DAMON’s
+ major features in a human-friendly way. It may not be highly tuned for
+ special cases, though. It supports only virtual address spaces monitoring.
+- *debugfs interface.*
+ This is for privileged user space programmers who want more optimized use of
+ DAMON. Using this, users can use DAMON’s major features by reading
+ from and writing to special debugfs files. Therefore, you can write and use
+ your personalized DAMON debugfs wrapper programs that reads/writes the
+ debugfs files instead of you. The DAMON user space tool is also a reference
+ implementation of such programs. It supports only virtual address spaces
+ monitoring.
+- *Kernel Space Programming Interface.*
+ This is for kernel space programmers. Using this, users can utilize every
+ feature of DAMON most flexibly and efficiently by writing kernel space
+ DAMON application programs for you. You can even extend DAMON for various
+ address spaces.
+
+This document does not describe the kernel space programming interface in
+detail. For that, please refer to the :doc:`/vm/damon/api`.
+
+
+DAMON User Space Tool
+=====================
+
+A reference implementation of the DAMON user space tools which provides a
+convenient user interface is in the kernel source tree. It is located at
+``tools/damon/damo`` of the tree.
+
+The tool provides a subcommands based interface. Every subcommand provides
+``-h`` option, which provides the minimal usage of it. Currently, the tool
+supports two subcommands, ``record`` and ``report``.
+
+Below example commands assume you set ``$PATH`` to point ``tools/damon/`` for
+brevity. It is not mandatory for use of ``damo``, though.
+
+
+Recording Data Access Pattern
+-----------------------------
+
+The ``record`` subcommand records the data access pattern of target workloads
+in a file (``./damon.data`` by default). You can specify the target with 1)
+the command for execution of the monitoring target process, or 2) pid of
+running target process. Below example shows a command target usage::
+
+ # cd <kernel>/tools/damon/
+ # damo record "sleep 5"
+
+The tool will execute ``sleep 5`` by itself and record the data access patterns
+of the process. Below example shows a pid target usage::
+
+ # sleep 5 &
+ # damo record `pidof sleep`
+
+The location of the recorded file can be explicitly set using ``-o`` option.
+You can further tune this by setting the monitoring attributes. To know about
+the monitoring attributes in detail, please refer to the
+:doc:`/vm/damon/design`.
+
+
+Analyzing Data Access Pattern
+-----------------------------
+
+The ``report`` subcommand reads a data access pattern record file (if not
+explicitly specified using ``-i`` option, reads ``./damon.data`` file by
+default) and generates human-readable reports. You can specify what type of
+report you want using a sub-subcommand to ``report`` subcommand. ``raw``,
+``heats``, and ``wss`` report types are supported for now.
+
+
+raw
+~~~
+
+``raw`` sub-subcommand simply transforms the binary record into a
+human-readable text. For example::
+
+ $ damo report raw
+ start_time: 193485829398
+ rel time: 0
+ nr_tasks: 1
+ target_id: 1348
+ nr_regions: 4
+ 560189609000-56018abce000( 22827008): 0
+ 7fbdff59a000-7fbdffaf1a00( 5601792): 0
+ 7fbdffaf1a00-7fbdffbb5000( 800256): 1
+ 7ffea0dc0000-7ffea0dfd000( 249856): 0
+
+ rel time: 100000731
+ nr_tasks: 1
+ target_id: 1348
+ nr_regions: 6
+ 560189609000-56018abce000( 22827008): 0
+ 7fbdff59a000-7fbdff8ce933( 3361075): 0
+ 7fbdff8ce933-7fbdffaf1a00( 2240717): 1
+ 7fbdffaf1a00-7fbdffb66d99( 480153): 0
+ 7fbdffb66d99-7fbdffbb5000( 320103): 1
+ 7ffea0dc0000-7ffea0dfd000( 249856): 0
+
+The first line shows the recording started timestamp (nanosecond). Records of
+data access patterns follows. Each record is separated by a blank line. Each
+record first specifies the recorded time (``rel time``) in relative to the
+start time, the number of monitored tasks in this record (``nr_tasks``).
+Recorded data access patterns of each task follow. Each data access pattern
+for each task shows the target's pid (``target_id``) and a number of monitored
+address regions in this access pattern (``nr_regions``) first. After that,
+each line shows the start/end address, size, and the number of observed
+accesses of each region.
+
+
+heats
+~~~~~
+
+The ``raw`` output is very detailed but hard to manually read. ``heats``
+sub-subcommand plots the data in 3-dimensional form, which represents the time
+in x-axis, address of regions in y-axis, and the access frequency in z-axis.
+Users can set the resolution of the map (``--tres`` and ``--ares``) and
+start/end point of each axis (``--tmin``, ``--tmax``, ``--amin``, and
+``--amax``) via optional arguments. For example::
+
+ $ damo report heats --tres 3 --ares 3
+ 0 0 0.0
+ 0 7609002 0.0
+ 0 15218004 0.0
+ 66112620851 0 0.0
+ 66112620851 7609002 0.0
+ 66112620851 15218004 0.0
+ 132225241702 0 0.0
+ 132225241702 7609002 0.0
+ 132225241702 15218004 0.0
+
+This command shows a recorded access pattern in heatmap of 3x3 resolution.
+Therefore it shows 9 data points in total. Each line shows each of the data
+points. The three numbers in each line represent time in nanosecond, address,
+and the observed access frequency.
+
+Users will be able to convert this text output into a heatmap image (represents
+z-axis values with colors) or other 3D representations using various tools such
+as 'gnuplot'. For more convenience, ``heats`` sub-subcommand provides the
+'gnuplot' based heatmap image creation. For this, you can use ``--heatmap``
+option. Also, note that because it uses 'gnuplot' internally, it will fail if
+'gnuplot' is not installed on your system. For example::
+
+ $ ./damo report heats --heatmap heatmap.png
+
+Creates the heatmap image in ``heatmap.png`` file. It supports ``pdf``,
+``png``, ``jpeg``, and ``svg``.
+
+If the target address space is virtual memory address space and you plot the
+entire address space, the huge unmapped regions will make the picture looks
+only black. Therefore you should do proper zoom in / zoom out using the
+resolution and axis boundary-setting arguments. To make this effort minimal,
+you can use ``--guide`` option as below::
+
+ $ ./damo report heats --guide
+ target_id:1348
+ time: 193485829398-198337863555 (4852034157)
+ region 0: 00000094564599762944-00000094564622589952 (22827008)
+ region 1: 00000140454009610240-00000140454016012288 (6402048)
+ region 2: 00000140731597193216-00000140731597443072 (249856)
+
+The output shows unions of monitored regions (start and end addresses in byte)
+and the union of monitored time duration (start and end time in nanoseconds) of
+each target task. Therefore, it would be wise to plot the data points in each
+union. If no axis boundary option is given, it will automatically find the
+biggest union in ``--guide`` output and set the boundary in it.
+
+
+wss
+~~~
+
+The ``wss`` type extracts the distribution and chronological working set size
+changes from the records. For example::
+
+ $ ./damo report wss
+ # <percentile> <wss>
+ # target_id 1348
+ # avr: 66228
+ 0 0
+ 25 0
+ 50 0
+ 75 0
+ 100 1920615
+
+Without any option, it shows the distribution of the working set sizes as
+above. It shows 0th, 25th, 50th, 75th, and 100th percentile and the average of
+the measured working set sizes in the access pattern records. In this case,
+the working set size was zero for 75th percentile but 1,920,615 bytes in max
+and 66,228 bytes on average.
+
+By setting the sort key of the percentile using '--sortby', you can show how
+the working set size has chronologically changed. For example::
+
+ $ ./damo report wss --sortby time
+ # <percentile> <wss>
+ # target_id 1348
+ # avr: 66228
+ 0 0
+ 25 0
+ 50 0
+ 75 0
+ 100 0
+
+The average is still 66,228. And, because the access was spiked in very short
+duration and this command plots only 4 data points, we cannot show when the
+access spikes made. Users can specify the resolution of the distribution
+(``--range``). By giving more fine resolution, the short duration spikes could
+be found.
+
+Similar to that of ``heats --heatmap``, it also supports 'gnuplot' based simple
+visualization of the distribution via ``--plot`` option.
+
+
+debugfs Interface
+=================
+
+DAMON exports four files, ``attrs``, ``target_ids``, ``record``, and
+``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``.
+
+
+Attributes
+----------
+
+Users can get and set the ``sampling interval``, ``aggregation interval``,
+``regions update interval``, and min/max number of monitoring target regions by
+reading from and writing to the ``attrs`` file. To know about the monitoring
+attributes in detail, please refer to the :doc:`/vm/damon/design`. For
+example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and
+1000, and then check it again::
+
+ # cd <debugfs>/damon
+ # echo 5000 100000 1000000 10 1000 > attrs
+ # cat attrs
+ 5000 100000 1000000 10 1000
+
+
+Target IDs
+----------
+
+Some types of address spaces supports multiple monitoring target. For example,
+the virtual memory address spaces monitoring can have multiple processes as the
+monitoring targets. Users can set the targets by writing relevant id values of
+the targets to, and get the ids of the current targets by reading from the
+``target_ids`` file. In case of the virtual address spaces monitoring, the
+values should be pids of the monitoring target processes. For example, below
+commands set processes having pids 42 and 4242 as the monitoring targets and
+check it again::
+
+ # cd <debugfs>/damon
+ # echo 42 4242 > target_ids
+ # cat target_ids
+ 42 4242
+
+Note that setting the target ids doesn't start the monitoring.
+
+
+Record
+------
+
+This debugfs file allows you to record monitored access patterns in a regular
+binary file. The recorded results are first written in an in-memory buffer and
+flushed to a file in batch. Users can get and set the size of the buffer and
+the path to the result file by reading from and writing to the ``record`` file.
+For example, below commands set the buffer to be 4 KiB and the result to be
+saved in ``/damon.data``. ::
+
+ # cd <debugfs>/damon
+ # echo "4096 /damon.data" > record
+ # cat record
+ 4096 /damon.data
+
+The recording can be disabled by setting the buffer size zero.
+
+
+Turning On/Off
+--------------
+
+Setting the files as described above doesn't incur effect unless you explicitly
+start the monitoring. You can start, stop, and check the current status of the
+monitoring by writing to and reading from the ``monitor_on`` file. Writing
+``on`` to the file starts the monitoring of the targets with the attributes.
+Writing ``off`` to the file stops those. DAMON also stops if every target
+process is terminated. Below example commands turn on, off, and check the
+status of DAMON::
+
+ # cd <debugfs>/damon
+ # echo on > monitor_on
+ # echo off > monitor_on
+ # cat monitor_on
+ off
+
+Please note that you cannot write to the above-mentioned debugfs files while
+the monitoring is turned on. If you write to the files while DAMON is running,
+an error code such as ``-EBUSY`` will be returned.
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 11db46448354..e6de5cd41945 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -27,6 +27,7 @@ the Linux memory management.

concepts
cma_debugfs
+ damon/index
hugetlbpage
idle_page_tracking
ksm
diff --git a/Documentation/vm/damon/api.rst b/Documentation/vm/damon/api.rst
new file mode 100644
index 000000000000..649409828eab
--- /dev/null
+++ b/Documentation/vm/damon/api.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+API Reference
+=============
+
+Kernel space programs can use every feature of DAMON using below APIs. All you
+need to do is including ``damon.h``, which is located in ``include/linux/`` of
+the source tree.
+
+Structures
+==========
+
+.. kernel-doc:: include/linux/damon.h
+
+
+Functions
+=========
+
+.. kernel-doc:: mm/damon.c
diff --git a/Documentation/vm/damon/design.rst b/Documentation/vm/damon/design.rst
new file mode 100644
index 000000000000..727d72093f8f
--- /dev/null
+++ b/Documentation/vm/damon/design.rst
@@ -0,0 +1,166 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+Design
+======
+
+Configurable Layers
+===================
+
+DAMON provides data access monitoring functionality while making the accuracy
+and the overhead controllable. The fundamental access monitorings require
+primitives that dependent on and optimized for the target address space. On
+the other hand, the accuracy and overhead tradeoff mechanism, which is the core
+of DAMON, is in the pure logic space. DAMON separates the two parts in
+different layers and defines its interface to allow various low level
+primitives implementations configurable with the core logic.
+
+Due to this separated design and the configurable interface, users can extend
+DAMON for any address space by configuring the core logics with appropriate low
+level primitive implementations. If appropriate one is not provided, users can
+implement the primitives on their own.
+
+For example, physical memory, virtual memory, swap space, those for specific
+processes, NUMA nodes, files, and backing memory devices would be supportable.
+Also, if some architectures or devices support special optimized access check
+primitives, those will be easily configurable.
+
+
+Reference Implementations of Address Space Specific Primitives
+==============================================================
+
+The low level primitives for the fundamental access monitoring are defined in
+two parts:
+
+1. Identification of the monitoring target address range for the address space.
+2. Access check of specific address range in the target space.
+
+DAMON currently provides the implementation of the primitives for only the
+virtual address spaces. Below two subsections describe how it works.
+
+
+PTE Accessed-bit Based Access Check
+-----------------------------------
+
+The implementation for the virtual address space uses PTE Accessed-bit for
+basic access checks. It finds the relevant PTE Accessed bit from the address
+by walking the page table for the target task of the address. In this way, the
+implementation finds and clears the bit for next sampling target address and
+checks whether the bit set again after one sampling period. This could disturb
+other kernel subsystems using the Accessed bits, namely Idle page tracking and
+the reclaim logic. To avoid such disturbances, DAMON makes it mutually
+exclusive with Idle page tracking and uses ``PG_idle`` and ``PG_young`` page
+flags to solve the conflict with the reclaim logic, as Idle page tracking does.
+
+
+VMA-based Target Address Range Construction
+-------------------------------------------
+
+Only small parts in the super-huge virtual address space of the processes are
+mapped to the physical memory and accessed. Thus, tracking the unmapped
+address regions is just wasteful. However, because DAMON can deal with some
+level of noise using the adaptive regions adjustment mechanism, tracking every
+mapping is not strictly required but could even incur a high overhead in some
+cases. That said, too huge unmapped areas inside the monitoring target should
+be removed to not take the time for the adaptive mechanism.
+
+For the reason, this implementation converts the complex mappings to three
+distinct regions that cover every mapped area of the address space. The two
+gaps between the three regions are the two biggest unmapped areas in the given
+address space. The two biggest unmapped areas would be the gap between the
+heap and the uppermost mmap()-ed region, and the gap between the lowermost
+mmap()-ed region and the stack in most of the cases. Because these gaps are
+exceptionally huge in usual address spaces, excluding these will be sufficient
+to make a reasonable trade-off. Below shows this in detail::
+
+ <heap>
+ <BIG UNMAPPED REGION 1>
+ <uppermost mmap()-ed region>
+ (small mmap()-ed regions and munmap()-ed regions)
+ <lowermost mmap()-ed region>
+ <BIG UNMAPPED REGION 2>
+ <stack>
+
+
+Address Space Independent Core Mechanisms
+=========================================
+
+Below four sections describe each of the DAMON core mechanisms and the five
+monitoring attributes, ``sampling interval``, ``aggregation interval``,
+``regions update interval``, ``minimum number of regions``, and ``maximum
+number of regions``.
+
+
+Access Frequency Monitoring
+---------------------------
+
+The output of DAMON says what pages are how frequently accessed for a given
+duration. The resolution of the access frequency is controlled by setting
+``sampling interval`` and ``aggregation interval``. In detail, DAMON checks
+access to each page per ``sampling interval`` and aggregates the results. In
+other words, counts the number of the accesses to each page. After each
+``aggregation interval`` passes, DAMON calls callback functions that previously
+registered by users so that users can read the aggregated results and then
+clears the results. This can be described in below simple pseudo-code::
+
+ while monitoring_on:
+ for page in monitoring_target:
+ if accessed(page):
+ nr_accesses[page] += 1
+ if time() % aggregation_interval == 0:
+ for callback in user_registered_callbacks:
+ callback(monitoring_target, nr_accesses)
+ for page in monitoring_target:
+ nr_accesses[page] = 0
+ sleep(sampling interval)
+
+The monitoring overhead of this mechanism will arbitrarily increase as the
+size of the target workload grows.
+
+
+Region Based Sampling
+---------------------
+
+To avoid the unbounded increase of the overhead, DAMON groups adjacent pages
+that assumed to have the same access frequencies into a region. As long as the
+assumption (pages in a region have the same access frequencies) is kept, only
+one page in the region is required to be checked. Thus, for each ``sampling
+interval``, DAMON randomly picks one page in each region, waits for one
+``sampling interval``, checks whether the page is accessed meanwhile, and
+increases the access frequency of the region if so. Therefore, the monitoring
+overhead is controllable by setting the number of regions. DAMON allows users
+to set the minimum and the maximum number of regions for the trade-off.
+
+This scheme, however, cannot preserve the quality of the output if the
+assumption is not guaranteed.
+
+
+Adaptive Regions Adjustment
+---------------------------
+
+Even somehow the initial monitoring target regions are well constructed to
+fulfill the assumption (pages in same region have similar access frequencies),
+the data access pattern can be dynamically changed. This will result in low
+monitoring quality. To keep the assumption as much as possible, DAMON
+adaptively merges and splits each region based on their access frequency.
+
+For each ``aggregation interval``, it compares the access frequencies of
+adjacent regions and merges those if the frequency difference is small. Then,
+after it reports and clears the aggregated access frequency of each region, it
+splits each region into two or three regions if the total number of regions
+will not exceed the user-specified maximum number of regions after the split.
+
+In this way, DAMON provides its best-effort quality and minimal overhead while
+keeping the bounds users set for their trade-off.
+
+
+Dynamic Target Space Updates Handling
+-------------------------------------
+
+The monitoring target address range could dynamically changed. For example,
+virtual memory could be dynamically mapped and unmapped. Physical memory could
+be hot-plugged.
+
+As the changes could be quite frequent in some cases, DAMON checks the dynamic
+memory mapping changes and applies it to the abstracted target area only for
+each of a user-specified time interval (``regions update interval``).
diff --git a/Documentation/vm/damon/eval.rst b/Documentation/vm/damon/eval.rst
new file mode 100644
index 000000000000..cb80c63c3ed2
--- /dev/null
+++ b/Documentation/vm/damon/eval.rst
@@ -0,0 +1,225 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Evaluation
+==========
+
+DAMON is lightweight. It increases system memory usage by 0.12% and slows
+target workloads down by 1.39%.
+
+DAMON is accurate and useful for memory management optimizations. An
+experimental DAMON-based operation scheme for THP, 'ethp', removes 88.16% of
+THP memory overheads while preserving 88.73% of THP speedup. Another
+experimental DAMON-based 'proactive reclamation' implementation, 'prcl',
+reduces 91.34% of residential sets and 25.59% of system memory footprint while
+incurring only 1.58% runtime overhead in the best case (parsec3/freqmine).
+
+
+Setup
+=====
+
+On QEMU/KVM based virtual machines utilizing 130GB of RAM and 36 vCPUs hosted
+by AWS EC2 i3.metal instances that running a kernel that v20 DAMON patchset is
+applied, I measure runtime and consumed system memory while running various
+realistic workloads with several configurations. I use 13 and 12 workloads in
+PARSEC3 [3]_ and SPLASH-2X [4]_ benchmark suites, respectively. I use another
+wrapper scripts [5]_ for convenient setup and run of the workloads.
+
+
+Measurement
+-----------
+
+For the measurement of the amount of consumed memory in system global scope, I
+drop caches before starting each of the workloads and monitor 'MemFree' in the
+'/proc/meminfo' file. To make results more stable, I repeat the runs 5 times
+and average results.
+
+
+Configurations
+--------------
+
+The configurations I use are as below.
+
+- orig: Linux v5.8 with 'madvise' THP policy
+- rec: 'orig' plus DAMON running with virtual memory access recording
+- prec: 'orig' plus DAMON running with physical memory access recording
+- thp: same with 'orig', but use 'always' THP policy
+- ethp: 'orig' plus a DAMON operation scheme, 'efficient THP'
+- prcl: 'orig' plus a DAMON operation scheme, 'proactive reclaim [6]_'
+
+I use 'rec' for measurement of DAMON overheads to target workloads and system
+memory. 'prec' is for physical memory monitroing and recording. It monitors
+17GB sized 'System RAM' region. The remaining configs including 'thp', 'ethp',
+and 'prcl' are for measurement of DAMON monitoring accuracy.
+
+'ethp' and 'prcl' are simple DAMON-based operation schemes developed for
+proof of concepts of DAMON. 'ethp' reduces memory space waste of THP by using
+DAMON for the decision of promotions and demotion for huge pages, while 'prcl'
+is as similar as the original work. Those are implemented as below::
+
+ # format: <min/max size> <min/max frequency (0-100)> <min/max age> <action>
+ # ethp: Use huge pages if a region shows >=5% access rate, use regular
+ # pages if a region >=2MB shows 0 access rate for >=7 seconds
+ min max 5 max min max hugepage
+ 2M max min min 7s max nohugepage
+
+ # prcl: If a region >=4KB shows 0 access rate for >=10 seconds, page out.
+ 4K max 0 0 10s max pageout
+
+Note that both 'ethp' and 'prcl' are designed with my only straightforward
+intuition because those are for only proof of concepts and monitoring accuracy
+of DAMON. In other words, those are not for production. For production use,
+those should be more tuned.
+
+.. [1] "Redis latency problems troubleshooting", https://redis.io/topics/latency
+.. [2] "Disable Transparent Huge Pages (THP)",
+ https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
+.. [3] "The PARSEC Becnhmark Suite", https://parsec.cs.princeton.edu/index.htm
+.. [4] "SPLASH-2x", https://parsec.cs.princeton.edu/parsec3-doc.htm#splash2x
+.. [5] "parsec3_on_ubuntu", https://github.com/sjp38/parsec3_on_ubuntu
+.. [6] "Proactively reclaiming idle memory", https://lwn.net/Articles/787611/
+
+
+Results
+=======
+
+Below two tables show the measurement results. The runtimes are in seconds
+while the memory usages are in KiB. Each configuration except 'orig' shows
+its overhead relative to 'orig' in percent within parenthesizes.::
+
+ runtime orig rec (overhead) prec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
+ parsec3/blackscholes 137.688 139.910 (1.61) 138.226 (0.39) 138.524 (0.61) 138.548 (0.62) 150.562 (9.35)
+ parsec3/bodytrack 124.496 123.294 (-0.97) 124.482 (-0.01) 124.874 (0.30) 123.514 (-0.79) 126.380 (1.51)
+ parsec3/canneal 196.513 209.465 (6.59) 223.213 (13.59) 189.302 (-3.67) 199.453 (1.50) 242.217 (23.26)
+ parsec3/dedup 18.060 18.128 (0.38) 18.378 (1.76) 18.210 (0.83) 18.397 (1.87) 20.545 (13.76)
+ parsec3/facesim 343.697 344.917 (0.36) 341.367 (-0.68) 337.696 (-1.75) 344.805 (0.32) 361.169 (5.08)
+ parsec3/ferret 288.868 286.110 (-0.95) 292.308 (1.19) 287.814 (-0.36) 284.243 (-1.60) 284.200 (-1.62)
+ parsec3/fluidanimate 342.267 337.743 (-1.32) 330.680 (-3.39) 337.356 (-1.43) 340.604 (-0.49) 343.565 (0.38)
+ parsec3/freqmine 437.385 436.854 (-0.12) 437.641 (0.06) 435.008 (-0.54) 436.998 (-0.09) 444.276 (1.58)
+ parsec3/raytrace 183.036 182.039 (-0.54) 184.859 (1.00) 187.330 (2.35) 185.660 (1.43) 209.707 (14.57)
+ parsec3/streamcluster 611.075 675.108 (10.48) 656.373 (7.41) 541.711 (-11.35) 473.679 (-22.48) 815.450 (33.45)
+ parsec3/swaptions 220.338 220.948 (0.28) 220.891 (0.25) 220.387 (0.02) 219.986 (-0.16) -100.000 (0.00)
+ parsec3/vips 87.710 88.581 (0.99) 88.423 (0.81) 88.460 (0.86) 88.471 (0.87) 89.661 (2.22)
+ parsec3/x264 114.927 117.774 (2.48) 116.630 (1.48) 112.237 (-2.34) 110.709 (-3.67) 124.560 (8.38)
+ splash2x/barnes 131.034 130.895 (-0.11) 129.088 (-1.48) 118.213 (-9.78) 124.497 (-4.99) 167.966 (28.19)
+ splash2x/fft 59.805 60.237 (0.72) 59.895 (0.15) 47.008 (-21.40) 57.962 (-3.08) 87.183 (45.78)
+ splash2x/lu_cb 132.353 132.157 (-0.15) 132.473 (0.09) 131.561 (-0.60) 135.541 (2.41) 141.720 (7.08)
+ splash2x/lu_ncb 149.050 150.496 (0.97) 151.912 (1.92) 150.974 (1.29) 148.329 (-0.48) 152.227 (2.13)
+ splash2x/ocean_cp 82.189 77.735 (-5.42) 84.466 (2.77) 77.498 (-5.71) 82.586 (0.48) 113.737 (38.38)
+ splash2x/ocean_ncp 154.934 154.656 (-0.18) 164.204 (5.98) 101.861 (-34.26) 142.600 (-7.96) 281.650 (81.79)
+ splash2x/radiosity 142.710 141.643 (-0.75) 143.940 (0.86) 141.982 (-0.51) 142.017 (-0.49) 152.116 (6.59)
+ splash2x/radix 50.357 50.331 (-0.05) 50.717 (0.72) 45.664 (-9.32) 50.222 (-0.27) 73.981 (46.91)
+ splash2x/raytrace 134.039 132.650 (-1.04) 134.583 (0.41) 131.570 (-1.84) 133.050 (-0.74) 141.463 (5.54)
+ splash2x/volrend 120.769 120.220 (-0.45) 119.895 (-0.72) 120.159 (-0.50) 119.311 (-1.21) 119.581 (-0.98)
+ splash2x/water_nsquared 376.599 373.411 (-0.85) 382.601 (1.59) 348.701 (-7.41) 357.033 (-5.20) 397.427 (5.53)
+ splash2x/water_spatial 132.619 133.432 (0.61) 135.505 (2.18) 134.865 (1.69) 133.940 (1.00) 148.196 (11.75)
+ total 4772.510 4838.740 (1.39) 4862.740 (1.89) 4568.970 (-4.26) 4592.160 (-3.78) 5189.560 (8.74)
+
+
+ memused.avg orig rec (overhead) prec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
+ parsec3/blackscholes 1825022.800 1863815.200 (2.13) 1830082.000 (0.28) 1800999.800 (-1.32) 1807743.800 (-0.95) 1580027.800 (-13.42)
+ parsec3/bodytrack 1425506.800 1438323.400 (0.90) 1439260.600 (0.96) 1400505.600 (-1.75) 1412295.200 (-0.93) 1412759.600 (-0.89)
+ parsec3/canneal 1040902.600 1050404.000 (0.91) 1053535.200 (1.21) 1027175.800 (-1.32) 1035229.400 (-0.55) 1039159.400 (-0.17)
+ parsec3/dedup 2526700.400 2540671.600 (0.55) 2503689.800 (-0.91) 2544440.200 (0.70) 2510519.000 (-0.64) 2503148.200 (-0.93)
+ parsec3/facesim 545844.600 550680.000 (0.89) 543658.600 (-0.40) 532320.200 (-2.48) 539429.600 (-1.18) 470836.800 (-13.74)
+ parsec3/ferret 352118.600 326782.600 (-7.20) 322645.600 (-8.37) 304054.800 (-13.65) 317259.000 (-9.90) 313532.400 (-10.96)
+ parsec3/fluidanimate 651597.600 580045.200 (-10.98) 578297.400 (-11.25) 569431.600 (-12.61) 577322.800 (-11.40) 482061.600 (-26.02)
+ parsec3/freqmine 989212.000 996291.200 (0.72) 989405.000 (0.02) 970891.000 (-1.85) 981122.000 (-0.82) 736030.000 (-25.59)
+ parsec3/raytrace 1749470.400 1751183.200 (0.10) 1740937.600 (-0.49) 1717138.800 (-1.85) 1731298.200 (-1.04) 1528069.000 (-12.66)
+ parsec3/streamcluster 123425.400 151548.200 (22.79) 144024.800 (16.69) 118379.000 (-4.09) 124845.400 (1.15) 118629.800 (-3.89)
+ parsec3/swaptions 4150.600 25679.200 (518.69) 19914.800 (379.80) 8577.000 (106.64) 17348.200 (317.97) -100.000 (0.00)
+ parsec3/vips 2989801.200 3003285.400 (0.45) 3012055.400 (0.74) 2958369.000 (-1.05) 2970897.800 (-0.63) 2962063.000 (-0.93)
+ parsec3/x264 3242663.400 3256091.000 (0.41) 3248949.400 (0.19) 3195605.400 (-1.45) 3206571.600 (-1.11) 3219046.333 (-0.73)
+ splash2x/barnes 1208017.600 1212702.600 (0.39) 1194143.600 (-1.15) 1208450.200 (0.04) 1212607.600 (0.38) 878554.667 (-27.27)
+ splash2x/fft 9786259.000 9705563.600 (-0.82) 9391006.800 (-4.04) 9967230.600 (1.85) 9657639.400 (-1.31) 10215759.333 (4.39)
+ splash2x/lu_cb 512130.400 521431.800 (1.82) 513051.400 (0.18) 508534.200 (-0.70) 512643.600 (0.10) 328017.333 (-35.95)
+ splash2x/lu_ncb 511156.200 526566.400 (3.01) 513230.400 (0.41) 509823.800 (-0.26) 516302.000 (1.01) 418078.333 (-18.21)
+ splash2x/ocean_cp 3353269.200 3319496.000 (-1.01) 3251575.000 (-3.03) 3379639.800 (0.79) 3326416.600 (-0.80) 3143859.667 (-6.24)
+ splash2x/ocean_ncp 3905538.200 3914929.600 (0.24) 3877493.200 (-0.72) 7053949.400 (80.61) 4633035.000 (18.63) 3527482.667 (-9.68)
+ splash2x/radiosity 1462030.400 1468050.000 (0.41) 1454997.600 (-0.48) 1466985.400 (0.34) 1461777.400 (-0.02) 441332.000 (-69.81)
+ splash2x/radix 2367200.800 2363995.000 (-0.14) 2251124.600 (-4.90) 2417603.800 (2.13) 2317804.000 (-2.09) 2495581.667 (5.42)
+ splash2x/raytrace 42356.200 56270.200 (32.85) 49419.000 (16.67) 86408.400 (104.00) 50547.600 (19.34) 40341.000 (-4.76)
+ splash2x/volrend 148631.600 162954.600 (9.64) 153305.200 (3.14) 140089.200 (-5.75) 149831.200 (0.81) 150232.000 (1.08)
+ splash2x/water_nsquared 39835.800 54268.000 (36.23) 53659.400 (34.70) 41073.600 (3.11) 85322.600 (114.19) 49463.667 (24.17)
+ splash2x/water_spatial 669746.600 679634.200 (1.48) 667518.600 (-0.33) 664383.800 (-0.80) 684470.200 (2.20) 401946.000 (-39.99)
+ total 41472600.000 41520700.000 (0.12) 40796900.000 (-1.63) 44592000.000 (7.52) 41840100.000 (0.89) 38456146.000 (-7.27)
+
+
+DAMON Overheads
+---------------
+
+In total, DAMON virtual memory access recording feature ('rec') incurs 1.39%
+runtime overhead and 0.12% memory space overhead. Even though the size of the
+monitoring target region becomes much larger with the physical memory access
+recording ('prec'), it still shows only modest amount of overhead (1.89% for
+runtime and -1.63% for memory footprint).
+
+For a convenient test run of 'rec' and 'prec', I use a Python wrapper. The
+wrapper constantly consumes about 10-15MB of memory. This becomes a high
+memory overhead if the target workload has a small memory footprint.
+Nonetheless, the overheads are not from DAMON, but from the wrapper, and thus
+should be ignored. This fake memory overhead continues in 'ethp' and 'prcl',
+as those configurations are also using the Python wrapper.
+
+
+Efficient THP
+-------------
+
+THP 'always' enabled policy achieves 4.26% speedup but incurs 7.52% memory
+overhead. It achieves 34.26% speedup in the best case, but 80.61% memory
+overhead in the worst case. Interestingly, both the best and worst-case are
+with 'splash2x/ocean_ncp').
+
+The 2-lines implementation of data access monitoring based THP version ('ethp')
+shows 3.78% speedup and 0.89% memory overhead. In other words, 'ethp' removes
+88.16% of THP memory waste while preserving 88.73% of THP speedup in total. In
+the case of the 'splash2x/ocean_ncp', 'ethp' removes 76.90% of THP memory waste
+while preserving 23.23% of THP speedup.
+
+
+Proactive Reclamation
+---------------------
+
+As similar to the original work, I use 4G 'zram' swap device for this
+configuration.
+
+In total, our 1 line implementation of Proactive Reclamation, 'prcl', incurred
+8.74% runtime overhead in total while achieving 7.27% system memory footprint
+reduction.
+
+Nonetheless, as the memory usage is calculated with 'MemFree' in
+'/proc/meminfo', it contains the SwapCached pages. As the swapcached pages can
+be easily evicted, I also measured the residential set size of the workloads::
+
+ rss.avg orig rec (overhead) prec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
+ parsec3/blackscholes 587078.800 586930.400 (-0.03) 586355.200 (-0.12) 586147.400 (-0.16) 585203.400 (-0.32) 243110.800 (-58.59)
+ parsec3/bodytrack 32470.800 32488.400 (0.05) 32351.000 (-0.37) 32433.400 (-0.12) 32429.000 (-0.13) 18804.800 (-42.09)
+ parsec3/canneal 842418.600 842442.800 (0.00) 844396.000 (0.23) 840756.400 (-0.20) 841242.000 (-0.14) 825296.200 (-2.03)
+ parsec3/dedup 1180100.000 1179309.200 (-0.07) 1160477.800 (-1.66) 1198789.200 (1.58) 1171802.600 (-0.70) 595531.600 (-49.54)
+ parsec3/facesim 312056.000 312109.200 (0.02) 312044.400 (-0.00) 318102.200 (1.94) 316239.600 (1.34) 192002.600 (-38.47)
+ parsec3/ferret 99792.200 99641.800 (-0.15) 99044.800 (-0.75) 102041.800 (2.25) 100854.000 (1.06) 83628.200 (-16.20)
+ parsec3/fluidanimate 530735.400 530759.000 (0.00) 530865.200 (0.02) 532440.800 (0.32) 522778.600 (-1.50) 433547.400 (-18.31)
+ parsec3/freqmine 552951.000 552788.000 (-0.03) 552761.800 (-0.03) 556004.400 (0.55) 554001.200 (0.19) 47881.200 (-91.34)
+ parsec3/raytrace 883966.600 880061.400 (-0.44) 883144.800 (-0.09) 871786.400 (-1.38) 881000.200 (-0.34) 267210.800 (-69.77)
+ parsec3/streamcluster 110901.600 110863.400 (-0.03) 110893.600 (-0.01) 115612.600 (4.25) 114976.800 (3.67) 109728.600 (-1.06)
+ parsec3/swaptions 5708.800 5712.400 (0.06) 5681.400 (-0.48) 5720.400 (0.20) 5726.000 (0.30) -100.000 (0.00)
+ parsec3/vips 32272.200 32427.400 (0.48) 31959.800 (-0.97) 34177.800 (5.90) 33306.400 (3.20) 28869.000 (-10.55)
+ parsec3/x264 81878.000 81914.200 (0.04) 81823.600 (-0.07) 83579.400 (2.08) 83236.800 (1.66) 81220.667 (-0.80)
+ splash2x/barnes 1211917.400 1211328.200 (-0.05) 1212450.400 (0.04) 1221951.000 (0.83) 1218924.600 (0.58) 489430.333 (-59.62)
+ splash2x/fft 9874359.000 9934912.400 (0.61) 9843789.600 (-0.31) 10204484.600 (3.34) 9980640.400 (1.08) 7003881.000 (-29.07)
+ splash2x/lu_cb 509066.200 509222.600 (0.03) 509059.600 (-0.00) 509594.600 (0.10) 509479.000 (0.08) 315538.667 (-38.02)
+ splash2x/lu_ncb 509192.200 508437.000 (-0.15) 509331.000 (0.03) 509606.000 (0.08) 509578.200 (0.08) 412065.667 (-19.07)
+ splash2x/ocean_cp 3380283.800 3380301.000 (0.00) 3377617.200 (-0.08) 3416531.200 (1.07) 3389845.200 (0.28) 2398084.000 (-29.06)
+ splash2x/ocean_ncp 3917913.600 3924529.200 (0.17) 3934911.800 (0.43) 7123907.400 (81.83) 4703623.600 (20.05) 2428288.000 (-38.02)
+ splash2x/radiosity 1467978.600 1468655.400 (0.05) 1467534.000 (-0.03) 1477722.600 (0.66) 1471036.000 (0.21) 148573.333 (-89.88)
+ splash2x/radix 2413933.400 2408367.600 (-0.23) 2381122.400 (-1.36) 2480169.400 (2.74) 2367118.800 (-1.94) 1848857.000 (-23.41)
+ splash2x/raytrace 23280.000 23272.800 (-0.03) 23259.000 (-0.09) 28715.600 (23.35) 28354.400 (21.80) 13302.333 (-42.86)
+ splash2x/volrend 44079.400 44091.600 (0.03) 44022.200 (-0.13) 44547.200 (1.06) 44615.600 (1.22) 29833.000 (-32.32)
+ splash2x/water_nsquared 29392.800 29425.600 (0.11) 29422.400 (0.10) 30317.800 (3.15) 30602.200 (4.11) 21769.000 (-25.94)
+ splash2x/water_spatial 658604.400 660276.800 (0.25) 660334.000 (0.26) 660491.000 (0.29) 660636.400 (0.31) 304246.667 (-53.80)
+ total 29292400.000 29350400.000 (0.20) 29224634.000 (-0.23) 32985491.000 (12.61) 30157300.000 (2.95) 18340700.000 (-37.39)
+
+In total, 37.39% of residential sets were reduced.
+
+With parsec3/freqmine, 'prcl' reduced 91.34% of residential sets and 25.59% of
+system memory usage while incurring only 1.58% runtime overhead.
diff --git a/Documentation/vm/damon/faq.rst b/Documentation/vm/damon/faq.rst
new file mode 100644
index 000000000000..088128bbf22b
--- /dev/null
+++ b/Documentation/vm/damon/faq.rst
@@ -0,0 +1,58 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+Frequently Asked Questions
+==========================
+
+Why a new subsystem, instead of extending perf or other user space tools?
+=========================================================================
+
+First, because it needs to be lightweight as much as possible so that it can be
+used online, any unnecessary overhead such as kernel - user space context
+switching cost should be avoided. Second, DAMON aims to be used by other
+programs including the kernel. Therefore, having a dependency on specific
+tools like perf is not desirable. These are the two biggest reasons why DAMON
+is implemented in the kernel space.
+
+
+Can 'idle pages tracking' or 'perf mem' substitute DAMON?
+=========================================================
+
+Idle page tracking is a low level primitive for access check of the physical
+address space. 'perf mem' is similar, though it can use sampling to minimize
+the overhead. On the other hand, DAMON is a higher-level framework for the
+monitoring of various address spaces. It is focused on memory management
+optimization and provides sophisticated accuracy/overhead handling mechanisms.
+Therefore, 'idle pages tracking' and 'perf mem' could provide a subset of
+DAMON's output, but cannot substitute DAMON.
+
+
+How can I optimize my system's memory management using DAMON?
+=============================================================
+
+Because there are several ways for the DAMON-based optimizations, we wrote a
+separate document, :doc:`/admin-guide/mm/damon/guide`. Please refer to that.
+
+
+Does DAMON support virtual memory only?
+=======================================
+
+No. The core of the DAMON is address space independent. The address space
+specific low level primitive parts including monitoring target regions
+constructions and actual access checks can be implemented and configured on the
+DAMON core by the users. In this way, DAMON users can monitor any address
+space with any access check technique.
+
+Nonetheless, DAMON provides vma tracking and PTE Accessed bit check based
+implementations of the address space dependent functions for the virtual memory
+by default, for a reference and convenient use. In near future, we will
+provide those for physical memory address space.
+
+
+Can I simply monitor page granularity?
+======================================
+
+Yes. You can do so by setting the ``min_nr_regions`` attribute higher than the
+working set size divided by the page size. Because the monitoring target
+regions size is forced to be ``>=page size``, the region split will make no
+effect.
diff --git a/Documentation/vm/damon/index.rst b/Documentation/vm/damon/index.rst
new file mode 100644
index 000000000000..17dca3c12aad
--- /dev/null
+++ b/Documentation/vm/damon/index.rst
@@ -0,0 +1,31 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+DAMON: Data Access MONitor
+==========================
+
+DAMON is a data access monitoring framework subsystem for the Linux kernel.
+The core mechanisms of DAMON (refer to :doc:`design` for the detail) make it
+
+ - *accurate* (the monitoring output is useful enough for DRAM level memory
+ management; It might not appropriate for CPU Cache levels, though),
+ - *light-weight* (the monitoring overhead is low enough to be applied online),
+ and
+ - *scalable* (the upper-bound of the overhead is in constant range regardless
+ of the size of target workloads).
+
+Using this framework, therefore, the kernel's memory management mechanisms can
+make advanced decisions. Experimental memory management optimization works
+that incurring high data accesses monitoring overhead could implemented again.
+In user space, meanwhile, users who have some special workloads can write
+personalized applications for better understanding and optimizations of their
+workloads and systems.
+
+.. toctree::
+ :maxdepth: 2
+
+ faq
+ design
+ eval
+ api
+ plans
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index 611140ffef7e..8d8d088bc7af 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -31,6 +31,7 @@ descriptions of data structures and algorithms.
active_mm
balance
cleancache
+ damon/index
free_page_reporting
frontswap
highmem
--
2.17.1

2020-08-17 11:31:45

by SeongJae Park

[permalink] [raw]
Subject: [PATCH v20 06/15] mm/damon: Implement callbacks for the virtual memory address spaces

From: SeongJae Park <[email protected]>

This commit introduces a reference implementation of the address space
specific low level primitives for the virtual address space, so that
users of DAMON can easily monitor the data accesses on virtual address
spaces of specific processes by simply configuring the implementation to
be used by DAMON.

The low level primitives for the fundamental access monitoring are
defined in two parts:
1. Identification of the monitoring target address range for the address
space.
2. Access check of specific address range in the target space.

The reference implementation for the virtual address space provided by
this commit is designed as below.

PTE Accessed-bit Based Access Check
-----------------------------------

The implementation uses PTE Accessed-bit for basic access checks. That
is, it clears the bit for next sampling target page and checks whether
it set again after one sampling period. This could disturb other kernel
subsystems using the Accessed bits, namely Idle page tracking and the
reclaim logic. To avoid such disturbances, DAMON makes it mutually
exclusive with Idle page tracking and uses ``PG_idle`` and ``PG_young``
page flags to solve the conflict with the reclaim logics, as Idle page
tracking does.

VMA-based Target Address Range Construction
-------------------------------------------

Only small parts in the super-huge virtual address space of the
processes are mapped to physical memory and accessed. Thus, tracking
the unmapped address regions is just wasteful. However, because DAMON
can deal with some level of noise using the adaptive regions adjustment
mechanism, tracking every mapping is not strictly required but could
even incur a high overhead in some cases. That said, too huge unmapped
areas inside the monitoring target should be removed to not take the
time for the adaptive mechanism.

For the reason, this implementation converts the complex mappings to
three distinct regions that cover every mapped area of the address
space. Also, the two gaps between the three regions are the two biggest
unmapped areas in the given address space. The two biggest unmapped
areas would be the gap between the heap and the uppermost mmap()-ed
region, and the gap between the lowermost mmap()-ed region and the stack
in most of the cases. Because these gaps are exceptionally huge in
usual address spacees, excluding these will be sufficient to make a
reasonable trade-off. Below shows this in detail::

<heap>
<BIG UNMAPPED REGION 1>
<uppermost mmap()-ed region>
(small mmap()-ed regions and munmap()-ed regions)
<lowermost mmap()-ed region>
<BIG UNMAPPED REGION 2>
<stack>

Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
---
include/linux/damon.h | 8 +
mm/Kconfig | 3 +
mm/damon.c | 552 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 563 insertions(+)

diff --git a/include/linux/damon.h b/include/linux/damon.h
index 65533be91735..945eb9259acf 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -159,6 +159,14 @@ struct damon_ctx {
void (*aggregate_cb)(struct damon_ctx *context);
};

+/* Reference callback implementations for virtual memory */
+void kdamond_init_vm_regions(struct damon_ctx *ctx);
+void kdamond_update_vm_regions(struct damon_ctx *ctx);
+void kdamond_prepare_vm_access_checks(struct damon_ctx *ctx);
+unsigned int kdamond_check_vm_accesses(struct damon_ctx *ctx);
+bool kdamond_vm_target_valid(struct damon_target *t);
+void kdamond_vm_cleanup(struct damon_ctx *ctx);
+
int damon_set_targets(struct damon_ctx *ctx,
unsigned long *ids, ssize_t nr_ids);
int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
diff --git a/mm/Kconfig b/mm/Kconfig
index 8b1dacc60a8e..21cbd394bc78 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -882,6 +882,9 @@ config MAPPING_DIRTY_HELPERS

config DAMON
bool "Data Access Monitor"
+ depends on MMU && !IDLE_PAGE_TRACKING
+ select PAGE_EXTENSION if !64BIT
+ select PAGE_IDLE_FLAG
help
This feature allows to monitor access frequency of each memory
region. The information can be useful for performance-centric DRAM
diff --git a/mm/damon.c b/mm/damon.c
index 59d0b59858ae..cec151d60755 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -9,6 +9,10 @@
* This file is constructed in below parts.
*
* - Functions and macros for DAMON data structures
+ * - Functions for the initial monitoring target regions construction
+ * - Functions for the dynamic monitoring target regions update
+ * - Functions for the access checking of the regions
+ * - Functions for the target validity check and cleanup
* - Functions for DAMON core logics and features
* - Functions for the DAMON programming interface
* - Functions for the initialization
@@ -20,6 +24,7 @@
#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
#include <linux/module.h>
#include <linux/page_idle.h>
#include <linux/random.h>
@@ -196,6 +201,553 @@ static unsigned long damon_region_sz_limit(struct damon_ctx *ctx)
return sz;
}

+/*
+ * Functions for the initial monitoring target regions construction
+ */
+
+/*
+ * 't->id' should be the pointer to the relevant 'struct pid' having reference
+ * count. Caller must put the returned task, unless it is NULL.
+ */
+#define damon_get_task_struct(t) \
+ (get_pid_task((struct pid *)t->id, PIDTYPE_PID))
+
+/*
+ * Get the mm_struct of the given target
+ *
+ * Caller _must_ put the mm_struct after use, unless it is NULL.
+ *
+ * Returns the mm_struct of the target on success, NULL on failure
+ */
+static struct mm_struct *damon_get_mm(struct damon_target *t)
+{
+ struct task_struct *task;
+ struct mm_struct *mm;
+
+ task = damon_get_task_struct(t);
+ if (!task)
+ return NULL;
+
+ mm = get_task_mm(task);
+ put_task_struct(task);
+ return mm;
+}
+
+/*
+ * Size-evenly split a region into 'nr_pieces' small regions
+ *
+ * Returns 0 on success, or negative error code otherwise.
+ */
+static int damon_split_region_evenly(struct damon_ctx *ctx,
+ struct damon_region *r, unsigned int nr_pieces)
+{
+ unsigned long sz_orig, sz_piece, orig_end;
+ struct damon_region *n = NULL, *next;
+ unsigned long start;
+
+ if (!r || !nr_pieces)
+ return -EINVAL;
+
+ orig_end = r->ar.end;
+ sz_orig = r->ar.end - r->ar.start;
+ sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, MIN_REGION);
+
+ if (!sz_piece)
+ return -EINVAL;
+
+ r->ar.end = r->ar.start + sz_piece;
+ next = damon_next_region(r);
+ for (start = r->ar.end; start + sz_piece <= orig_end;
+ start += sz_piece) {
+ n = damon_new_region(start, start + sz_piece);
+ if (!n)
+ return -ENOMEM;
+ damon_insert_region(n, r, next);
+ r = n;
+ }
+ /* complement last region for possible rounding error */
+ if (n)
+ n->ar.end = orig_end;
+
+ return 0;
+}
+
+static unsigned long sz_range(struct damon_addr_range *r)
+{
+ return r->end - r->start;
+}
+
+static void swap_ranges(struct damon_addr_range *r1,
+ struct damon_addr_range *r2)
+{
+ struct damon_addr_range tmp;
+
+ tmp = *r1;
+ *r1 = *r2;
+ *r2 = tmp;
+}
+
+/*
+ * Find three regions separated by two biggest unmapped regions
+ *
+ * vma the head vma of the target address space
+ * regions an array of three address ranges that results will be saved
+ *
+ * This function receives an address space and finds three regions in it which
+ * separated by the two biggest unmapped regions in the space. Please refer to
+ * below comments of 'damon_init_vm_regions_of()' function to know why this is
+ * necessary.
+ *
+ * Returns 0 if success, or negative error code otherwise.
+ */
+static int damon_three_regions_in_vmas(struct vm_area_struct *vma,
+ struct damon_addr_range regions[3])
+{
+ struct damon_addr_range gap = {0}, first_gap = {0}, second_gap = {0};
+ struct vm_area_struct *last_vma = NULL;
+ unsigned long start = 0;
+ struct rb_root rbroot;
+
+ /* Find two biggest gaps so that first_gap > second_gap > others */
+ for (; vma; vma = vma->vm_next) {
+ if (!last_vma) {
+ start = vma->vm_start;
+ goto next;
+ }
+
+ if (vma->rb_subtree_gap <= sz_range(&second_gap)) {
+ rbroot.rb_node = &vma->vm_rb;
+ vma = rb_entry(rb_last(&rbroot),
+ struct vm_area_struct, vm_rb);
+ goto next;
+ }
+
+ gap.start = last_vma->vm_end;
+ gap.end = vma->vm_start;
+ if (sz_range(&gap) > sz_range(&second_gap)) {
+ swap_ranges(&gap, &second_gap);
+ if (sz_range(&second_gap) > sz_range(&first_gap))
+ swap_ranges(&second_gap, &first_gap);
+ }
+next:
+ last_vma = vma;
+ }
+
+ if (!sz_range(&second_gap) || !sz_range(&first_gap))
+ return -EINVAL;
+
+ /* Sort the two biggest gaps by address */
+ if (first_gap.start > second_gap.start)
+ swap_ranges(&first_gap, &second_gap);
+
+ /* Store the result */
+ regions[0].start = ALIGN(start, MIN_REGION);
+ regions[0].end = ALIGN(first_gap.start, MIN_REGION);
+ regions[1].start = ALIGN(first_gap.end, MIN_REGION);
+ regions[1].end = ALIGN(second_gap.start, MIN_REGION);
+ regions[2].start = ALIGN(second_gap.end, MIN_REGION);
+ regions[2].end = ALIGN(last_vma->vm_end, MIN_REGION);
+
+ return 0;
+}
+
+/*
+ * Get the three regions in the given target (task)
+ *
+ * Returns 0 on success, negative error code otherwise.
+ */
+static int damon_three_regions_of(struct damon_target *t,
+ struct damon_addr_range regions[3])
+{
+ struct mm_struct *mm;
+ int rc;
+
+ mm = damon_get_mm(t);
+ if (!mm)
+ return -EINVAL;
+
+ mmap_read_lock(mm);
+ rc = damon_three_regions_in_vmas(mm->mmap, regions);
+ mmap_read_unlock(mm);
+
+ mmput(mm);
+ return rc;
+}
+
+/*
+ * Initialize the monitoring target regions for the given target (task)
+ *
+ * t the given target
+ *
+ * Because only a number of small portions of the entire address space
+ * is actually mapped to the memory and accessed, monitoring the unmapped
+ * regions is wasteful. That said, because we can deal with small noises,
+ * tracking every mapping is not strictly required but could even incur a high
+ * overhead if the mapping frequently changes or the number of mappings is
+ * high. The adaptive regions adjustment mechanism will further help to deal
+ * with the noise by simply identifying the unmapped areas as a region that
+ * has no access. Moreover, applying the real mappings that would have many
+ * unmapped areas inside will make the adaptive mechanism quite complex. That
+ * said, too huge unmapped areas inside the monitoring target should be removed
+ * to not take the time for the adaptive mechanism.
+ *
+ * For the reason, we convert the complex mappings to three distinct regions
+ * that cover every mapped area of the address space. Also the two gaps
+ * between the three regions are the two biggest unmapped areas in the given
+ * address space. In detail, this function first identifies the start and the
+ * end of the mappings and the two biggest unmapped areas of the address space.
+ * Then, it constructs the three regions as below:
+ *
+ * [mappings[0]->start, big_two_unmapped_areas[0]->start)
+ * [big_two_unmapped_areas[0]->end, big_two_unmapped_areas[1]->start)
+ * [big_two_unmapped_areas[1]->end, mappings[nr_mappings - 1]->end)
+ *
+ * As usual memory map of processes is as below, the gap between the heap and
+ * the uppermost mmap()-ed region, and the gap between the lowermost mmap()-ed
+ * region and the stack will be two biggest unmapped regions. Because these
+ * gaps are exceptionally huge areas in usual address space, excluding these
+ * two biggest unmapped regions will be sufficient to make a trade-off.
+ *
+ * <heap>
+ * <BIG UNMAPPED REGION 1>
+ * <uppermost mmap()-ed region>
+ * (other mmap()-ed regions and small unmapped regions)
+ * <lowermost mmap()-ed region>
+ * <BIG UNMAPPED REGION 2>
+ * <stack>
+ */
+static void damon_init_vm_regions_of(struct damon_ctx *c,
+ struct damon_target *t)
+{
+ struct damon_region *r;
+ struct damon_addr_range regions[3];
+ unsigned long sz = 0, nr_pieces;
+ int i;
+
+ if (damon_three_regions_of(t, regions)) {
+ pr_err("Failed to get three regions of target %lu\n", t->id);
+ return;
+ }
+
+ for (i = 0; i < 3; i++)
+ sz += regions[i].end - regions[i].start;
+ if (c->min_nr_regions)
+ sz /= c->min_nr_regions;
+ if (sz < MIN_REGION)
+ sz = MIN_REGION;
+
+ /* Set the initial three regions of the target */
+ for (i = 0; i < 3; i++) {
+ r = damon_new_region(regions[i].start, regions[i].end);
+ if (!r) {
+ pr_err("%d'th init region creation failed\n", i);
+ return;
+ }
+ damon_add_region(r, t);
+
+ nr_pieces = (regions[i].end - regions[i].start) / sz;
+ damon_split_region_evenly(c, r, nr_pieces);
+ }
+}
+
+/* Initialize '->regions_list' of every target (task) */
+void kdamond_init_vm_regions(struct damon_ctx *ctx)
+{
+ struct damon_target *t;
+
+ damon_for_each_target(t, ctx) {
+ /* the user may set the target regions as they want */
+ if (!nr_damon_regions(t))
+ damon_init_vm_regions_of(ctx, t);
+ }
+}
+
+/*
+ * Functions for the dynamic monitoring target regions update
+ */
+
+/*
+ * Check whether a region is intersecting an address range
+ *
+ * Returns true if it is.
+ */
+static bool damon_intersect(struct damon_region *r, struct damon_addr_range *re)
+{
+ return !(r->ar.end <= re->start || re->end <= r->ar.start);
+}
+
+/*
+ * Update damon regions for the three big regions of the given target
+ *
+ * t the given target
+ * bregions the three big regions of the target
+ */
+static void damon_apply_three_regions(struct damon_ctx *ctx,
+ struct damon_target *t, struct damon_addr_range bregions[3])
+{
+ struct damon_region *r, *next;
+ unsigned int i = 0;
+
+ /* Remove regions which are not in the three big regions now */
+ damon_for_each_region_safe(r, next, t) {
+ for (i = 0; i < 3; i++) {
+ if (damon_intersect(r, &bregions[i]))
+ break;
+ }
+ if (i == 3)
+ damon_destroy_region(r);
+ }
+
+ /* Adjust intersecting regions to fit with the three big regions */
+ for (i = 0; i < 3; i++) {
+ struct damon_region *first = NULL, *last;
+ struct damon_region *newr;
+ struct damon_addr_range *br;
+
+ br = &bregions[i];
+ /* Get the first and last regions which intersects with br */
+ damon_for_each_region(r, t) {
+ if (damon_intersect(r, br)) {
+ if (!first)
+ first = r;
+ last = r;
+ }
+ if (r->ar.start >= br->end)
+ break;
+ }
+ if (!first) {
+ /* no damon_region intersects with this big region */
+ newr = damon_new_region(
+ ALIGN_DOWN(br->start, MIN_REGION),
+ ALIGN(br->end, MIN_REGION));
+ if (!newr)
+ continue;
+ damon_insert_region(newr, damon_prev_region(r), r);
+ } else {
+ first->ar.start = ALIGN_DOWN(br->start, MIN_REGION);
+ last->ar.end = ALIGN(br->end, MIN_REGION);
+ }
+ }
+}
+
+/*
+ * Update regions for current memory mappings
+ */
+void kdamond_update_vm_regions(struct damon_ctx *ctx)
+{
+ struct damon_addr_range three_regions[3];
+ struct damon_target *t;
+
+ damon_for_each_target(t, ctx) {
+ if (damon_three_regions_of(t, three_regions))
+ continue;
+ damon_apply_three_regions(ctx, t, three_regions);
+ }
+}
+
+/*
+ * Functions for the access checking of the regions
+ */
+
+static void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm,
+ unsigned long addr)
+{
+ bool referenced = false;
+ struct page *page = pte_page(*pte);
+
+ if (pte_young(*pte)) {
+ referenced = true;
+ *pte = pte_mkold(*pte);
+ }
+
+#ifdef CONFIG_MMU_NOTIFIER
+ if (mmu_notifier_clear_young(mm, addr, addr + PAGE_SIZE))
+ referenced = true;
+#endif /* CONFIG_MMU_NOTIFIER */
+
+ if (referenced)
+ set_page_young(page);
+
+ set_page_idle(page);
+}
+
+static void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm,
+ unsigned long addr)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ bool referenced = false;
+ struct page *page = pmd_page(*pmd);
+
+ if (pmd_young(*pmd)) {
+ referenced = true;
+ *pmd = pmd_mkold(*pmd);
+ }
+
+#ifdef CONFIG_MMU_NOTIFIER
+ if (mmu_notifier_clear_young(mm, addr,
+ addr + ((1UL) << HPAGE_PMD_SHIFT)))
+ referenced = true;
+#endif /* CONFIG_MMU_NOTIFIER */
+
+ if (referenced)
+ set_page_young(page);
+
+ set_page_idle(page);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+}
+
+static void damon_mkold(struct mm_struct *mm, unsigned long addr)
+{
+ pte_t *pte = NULL;
+ pmd_t *pmd = NULL;
+ spinlock_t *ptl;
+
+ if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
+ return;
+
+ if (pte) {
+ damon_ptep_mkold(pte, mm, addr);
+ pte_unmap_unlock(pte, ptl);
+ } else {
+ damon_pmdp_mkold(pmd, mm, addr);
+ spin_unlock(ptl);
+ }
+}
+
+static void damon_prepare_vm_access_check(struct damon_ctx *ctx,
+ struct mm_struct *mm, struct damon_region *r)
+{
+ r->sampling_addr = damon_rand(r->ar.start, r->ar.end);
+
+ damon_mkold(mm, r->sampling_addr);
+}
+
+void kdamond_prepare_vm_access_checks(struct damon_ctx *ctx)
+{
+ struct damon_target *t;
+ struct mm_struct *mm;
+ struct damon_region *r;
+
+ damon_for_each_target(t, ctx) {
+ mm = damon_get_mm(t);
+ if (!mm)
+ continue;
+ damon_for_each_region(r, t)
+ damon_prepare_vm_access_check(ctx, mm, r);
+ mmput(mm);
+ }
+}
+
+static bool damon_young(struct mm_struct *mm, unsigned long addr,
+ unsigned long *page_sz)
+{
+ pte_t *pte = NULL;
+ pmd_t *pmd = NULL;
+ spinlock_t *ptl;
+ bool young = false;
+
+ if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
+ return false;
+
+ *page_sz = PAGE_SIZE;
+ if (pte) {
+ young = pte_young(*pte);
+ if (!young)
+ young = !page_is_idle(pte_page(*pte));
+ pte_unmap_unlock(pte, ptl);
+ return young;
+ }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ young = pmd_young(*pmd);
+ if (!young)
+ young = !page_is_idle(pmd_page(*pmd));
+ spin_unlock(ptl);
+ *page_sz = ((1UL) << HPAGE_PMD_SHIFT);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+ return young;
+}
+
+/*
+ * Check whether the region was accessed after the last preparation
+ *
+ * mm 'mm_struct' for the given virtual address space
+ * r the region to be checked
+ */
+static void damon_check_vm_access(struct damon_ctx *ctx,
+ struct mm_struct *mm, struct damon_region *r)
+{
+ static struct mm_struct *last_mm;
+ static unsigned long last_addr;
+ static unsigned long last_page_sz = PAGE_SIZE;
+ static bool last_accessed;
+
+ /* If the region is in the last checked page, reuse the result */
+ if (mm == last_mm && (ALIGN_DOWN(last_addr, last_page_sz) ==
+ ALIGN_DOWN(r->sampling_addr, last_page_sz))) {
+ if (last_accessed)
+ r->nr_accesses++;
+ return;
+ }
+
+ last_accessed = damon_young(mm, r->sampling_addr, &last_page_sz);
+ if (last_accessed)
+ r->nr_accesses++;
+
+ last_mm = mm;
+ last_addr = r->sampling_addr;
+}
+
+unsigned int kdamond_check_vm_accesses(struct damon_ctx *ctx)
+{
+ struct damon_target *t;
+ struct mm_struct *mm;
+ struct damon_region *r;
+ unsigned int max_nr_accesses = 0;
+
+ damon_for_each_target(t, ctx) {
+ mm = damon_get_mm(t);
+ if (!mm)
+ continue;
+ damon_for_each_region(r, t) {
+ damon_check_vm_access(ctx, mm, r);
+ max_nr_accesses = max(r->nr_accesses, max_nr_accesses);
+ }
+ mmput(mm);
+ }
+
+ return max_nr_accesses;
+}
+
+
+/*
+ * Functions for the target validity check and cleanup
+ */
+
+bool kdamond_vm_target_valid(struct damon_target *t)
+{
+ struct task_struct *task;
+
+ task = damon_get_task_struct(t);
+ if (task) {
+ put_task_struct(task);
+ return true;
+ }
+
+ return false;
+}
+
+void kdamond_vm_cleanup(struct damon_ctx *ctx)
+{
+ struct damon_target *t, *next;
+
+ damon_for_each_target_safe(t, next, ctx) {
+ put_pid((struct pid *)t->id);
+ damon_destroy_target(t);
+ }
+}
+
/*
* Functions for DAMON core logics and features
*/
--
2.17.1

2020-08-20 07:31:42

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v20 00/15] Introduce Data Access MONitor (DAMON)

On Mon, 17 Aug 2020 12:51:22 +0200 SeongJae Park <[email protected]> wrote:

> From: SeongJae Park <[email protected]>
>
> Changes from Previous Version
> =============================
>
> - Place 'CREATE_TRACE_POINTS' after '#include' statements (Steven Rostedt)
> - Support large record file (Alkaid)
> - Place 'put_pid()' of virtual monitoring targets in 'cleanup' callback
> - Avoid conflict between concurrent DAMON users
> - Update evaluation result document
>
> Introduction
> ============
>
> DAMON is a data access monitoring framework subsystem for the Linux kernel.
> The core mechanisms of DAMON called 'region based sampling' and 'adaptive
> regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
> patchset for the detail) make it
>
> - accurate (The monitored information is useful for DRAM level memory
> management. It might not appropriate for Cache-level accuracy, though.),
> - light-weight (The monitoring overhead is low enough to be applied online
> while making no impact on the performance of the target workloads.), and
> - scalable (the upper-bound of the instrumentation overhead is controllable
> regardless of the size of target workloads.).
>
> Using this framework, therefore, the kernel's core memory management mechanisms
> such as reclamation and THP can be optimized for better memory management. The
> experimental memory management optimization works that incurring high
> instrumentation overhead will be able to have another try. In user space,
> meanwhile, users who have some special workloads will be able to write
> personalized tools or applications for deeper understanding and specialized
> optimizations of their systems.

DAMON will be presented in the next week LPC[1]. To be prepared for a screen
sharing error (if I get no such error, I will do a live-demo), I recorded a
simple demo video. I would like to share it here to help your easier
understanding of DAMON.

https://youtu.be/l63eqbVBZRY

[1] https://linuxplumbersconf.org/event/7/contributions/659/


Thanks,
SeongJae Park

2020-08-31 11:26:09

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v20 00/15] Introduce Data Access MONitor (DAMON)

On Thu, 20 Aug 2020 09:27:38 +0200 SeongJae Park <[email protected]> wrote:

> On Mon, 17 Aug 2020 12:51:22 +0200 SeongJae Park <[email protected]> wrote:
>
> > From: SeongJae Park <[email protected]>
> >
> > Changes from Previous Version
> > =============================
> >
> > - Place 'CREATE_TRACE_POINTS' after '#include' statements (Steven Rostedt)
> > - Support large record file (Alkaid)
> > - Place 'put_pid()' of virtual monitoring targets in 'cleanup' callback
> > - Avoid conflict between concurrent DAMON users
> > - Update evaluation result document
> >
> > Introduction
> > ============
> >
> > DAMON is a data access monitoring framework subsystem for the Linux kernel.
> > The core mechanisms of DAMON called 'region based sampling' and 'adaptive
> > regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
> > patchset for the detail) make it
> >
> > - accurate (The monitored information is useful for DRAM level memory
> > management. It might not appropriate for Cache-level accuracy, though.),
> > - light-weight (The monitoring overhead is low enough to be applied online
> > while making no impact on the performance of the target workloads.), and
> > - scalable (the upper-bound of the instrumentation overhead is controllable
> > regardless of the size of target workloads.).
> >
> > Using this framework, therefore, the kernel's core memory management mechanisms
> > such as reclamation and THP can be optimized for better memory management. The
> > experimental memory management optimization works that incurring high
> > instrumentation overhead will be able to have another try. In user space,
> > meanwhile, users who have some special workloads will be able to write
> > personalized tools or applications for deeper understanding and specialized
> > optimizations of their systems.
>
> DAMON will be presented in the next week LPC[1]. To be prepared for a screen
> sharing error (if I get no such error, I will do a live-demo), I recorded a
> simple demo video. I would like to share it here to help your easier
> understanding of DAMON.
>
> https://youtu.be/l63eqbVBZRY
>
> [1] https://linuxplumbersconf.org/event/7/contributions/659/

During the session, I introduced the list of future works and asked the
audiences to vote for the priority of the tasks:
https://youtu.be/jOBkKMA0uF0?t=13253

To summarize here, the tasks are (highest priority first):

1. Make current DAMON patchset series merged in the mainline (6 votes)
2. User space interface improvement (4 votes)
- Multiple monitoring contexts
- Charging of the monitoring threads' CPU usage
3. Support more address spaces (2 votes)
- Cgroups, cached pages, specific file-backed pages, swap slots, ...
3. DAMON-based MM optimizations (2 votes)
- Page reclaim, THP, compaction, NUMA balancing, ...
4. Optimize for special use-cases (1 vote)
- Page granularity monitoring, accessed-or-not monitoring, ...

So, I'd like to focus on polishing current patchset so that it could be merged
in. For that, I'd like to ask your more reviews.

While waiting for the reviews, I will start implementing other future features
that received many votes. The support of multiple monitoring contexts for the
user space would be the first one. Once the implementation is finished, I will
post it as separated RFC patchset (the user space interface will be compatible
with current one).

Any comment is welcome.


Thanks,
SeongJae Park

2020-09-09 08:30:27

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v20 00/15] Introduce Data Access MONitor (DAMON)

This is just a reminder ping to let you know I'm still waiting more reviews!
Any comments will be appreciated!


Thanks,
SeongJae Park

2020-09-23 17:06:37

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v20 00/15] Introduce Data Access MONitor (DAMON)

On Mon, Aug 17, 2020 at 3:52 AM SeongJae Park <[email protected]> wrote:
>
> From: SeongJae Park <[email protected]>
>
> Changes from Previous Version
> =============================
>
> - Place 'CREATE_TRACE_POINTS' after '#include' statements (Steven Rostedt)
> - Support large record file (Alkaid)
> - Place 'put_pid()' of virtual monitoring targets in 'cleanup' callback
> - Avoid conflict between concurrent DAMON users
> - Update evaluation result document
>
> Introduction
> ============
>
> DAMON is a data access monitoring framework subsystem for the Linux kernel.
> The core mechanisms of DAMON called 'region based sampling' and 'adaptive
> regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
> patchset for the detail) make it
>
> - accurate (The monitored information is useful for DRAM level memory
> management. It might not appropriate for Cache-level accuracy, though.),
> - light-weight (The monitoring overhead is low enough to be applied online
> while making no impact on the performance of the target workloads.), and
> - scalable (the upper-bound of the instrumentation overhead is controllable
> regardless of the size of target workloads.).
>
> Using this framework, therefore, the kernel's core memory management mechanisms
> such as reclamation and THP can be optimized for better memory management. The
> experimental memory management optimization works that incurring high
> instrumentation overhead will be able to have another try. In user space,
> meanwhile, users who have some special workloads will be able to write
> personalized tools or applications for deeper understanding and specialized
> optimizations of their systems.
>
> Evaluations
> ===========
>
> We evaluated DAMON's overhead, monitoring quality and usefulness using 25
> realistic workloads on my QEMU/KVM based virtual machine running a kernel that
> v20 DAMON patchset is applied.
>
> DAMON is lightweight. It increases system memory usage by 0.12% and slows
> target workloads down by 1.39%.
>
> DAMON is accurate and useful for memory management optimizations. An
> experimental DAMON-based operation scheme for THP, 'ethp', removes 88.16% of
> THP memory overheads while preserving 88.73% of THP speedup. Another
> experimental DAMON-based 'proactive reclamation' implementation, 'prcl',
> reduces 91.34% of residential sets and 25.59% of system memory footprint while
> incurring only 1.58% runtime overhead in the best case (parsec3/freqmine).
>
> NOTE that the experimentail THP optimization and proactive reclamation are not
> for production but just only for proof of concepts.
>
> Please refer to the official document[1] or "Documentation/admin-guide/mm: Add
> a document for DAMON" patch in this patchset for detailed evaluation setup and
> results.
>
> [1] https://damonitor.github.io/doc/html/latest-damon/admin-guide/mm/damon/eval.html
>


Hi SeongJae,

Sorry for the late response. I will start looking at this series in
more detail in the next couple of weeks. I have a couple of high level
comments for now.

1) Please explain in the cover letter why someone should prefer to use
DAMON instead of Page Idle Tracking.

2) Also add what features Page Idle Tracking provides which the first
version of DAMON does not provide (like page level tracking, physical
or unmapped memory tracking e.t.c) and tell if you plan to add such
features to DAMON in future. Basically giving reasons to not block the
current version of DAMON until it is feature-rich.

3) I think in the first mergeable version of DAMON, I would prefer to
have support to control (create/delete/account) the DAMON context. You
already have a RFC series on it. I would like to have that series part
of this one.

I will go through individual patches to provide more detailed
feedback, so, you don't need to post the next version until then.

thanks,
Shakeel

2020-09-24 06:43:21

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v20 00/15] Introduce Data Access MONitor (DAMON)

On Wed, 23 Sep 2020 10:04:57 -0700 Shakeel Butt <[email protected]> wrote:

> On Mon, Aug 17, 2020 at 3:52 AM SeongJae Park <[email protected]> wrote:
> >
> > From: SeongJae Park <[email protected]>
> >
> > Changes from Previous Version
> > =============================
> >
> > - Place 'CREATE_TRACE_POINTS' after '#include' statements (Steven Rostedt)
> > - Support large record file (Alkaid)
> > - Place 'put_pid()' of virtual monitoring targets in 'cleanup' callback
> > - Avoid conflict between concurrent DAMON users
> > - Update evaluation result document
> >
> > Introduction
> > ============
> >
> > DAMON is a data access monitoring framework subsystem for the Linux kernel.
> > The core mechanisms of DAMON called 'region based sampling' and 'adaptive
> > regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
> > patchset for the detail) make it
> >
> > - accurate (The monitored information is useful for DRAM level memory
> > management. It might not appropriate for Cache-level accuracy, though.),
> > - light-weight (The monitoring overhead is low enough to be applied online
> > while making no impact on the performance of the target workloads.), and
> > - scalable (the upper-bound of the instrumentation overhead is controllable
> > regardless of the size of target workloads.).
> >
> > Using this framework, therefore, the kernel's core memory management mechanisms
> > such as reclamation and THP can be optimized for better memory management. The
> > experimental memory management optimization works that incurring high
> > instrumentation overhead will be able to have another try. In user space,
> > meanwhile, users who have some special workloads will be able to write
> > personalized tools or applications for deeper understanding and specialized
> > optimizations of their systems.
> >
> > Evaluations
> > ===========
> >
> > We evaluated DAMON's overhead, monitoring quality and usefulness using 25
> > realistic workloads on my QEMU/KVM based virtual machine running a kernel that
> > v20 DAMON patchset is applied.
> >
> > DAMON is lightweight. It increases system memory usage by 0.12% and slows
> > target workloads down by 1.39%.
> >
> > DAMON is accurate and useful for memory management optimizations. An
> > experimental DAMON-based operation scheme for THP, 'ethp', removes 88.16% of
> > THP memory overheads while preserving 88.73% of THP speedup. Another
> > experimental DAMON-based 'proactive reclamation' implementation, 'prcl',
> > reduces 91.34% of residential sets and 25.59% of system memory footprint while
> > incurring only 1.58% runtime overhead in the best case (parsec3/freqmine).
> >
> > NOTE that the experimentail THP optimization and proactive reclamation are not
> > for production but just only for proof of concepts.
> >
> > Please refer to the official document[1] or "Documentation/admin-guide/mm: Add
> > a document for DAMON" patch in this patchset for detailed evaluation setup and
> > results.
> >
> > [1] https://damonitor.github.io/doc/html/latest-damon/admin-guide/mm/damon/eval.html
> >
>
>
> Hi SeongJae,
>
> Sorry for the late response. I will start looking at this series in
> more detail in the next couple of weeks.

Thank you so much!

> I have a couple of high level comments for now.
>
> 1) Please explain in the cover letter why someone should prefer to use
> DAMON instead of Page Idle Tracking.

In short, because DAMON provides overhead-quality tradeoff and allow use of
variable monitoring primitives other than only PG_Idle and PTE Accessed bits.
I will explain this in detail in the cover letter of the next version of this
patchset.

>
> 2) Also add what features Page Idle Tracking provides which the first
> version of DAMON does not provide (like page level tracking, physical
> or unmapped memory tracking e.t.c) and tell if you plan to add such
> features to DAMON in future. Basically giving reasons to not block the
> current version of DAMON until it is feature-rich.

In short, DAMON will provide only virtual address space monitoring by default
but I believe the lack of features because DAMON is expandable for those.
Also, I will make DAMON co-exists with Idle Page Tracking again. I will post
another RFC patchset for this soon. Again, I will describe this in detail in
the next version of the cover letter.

>
> 3) I think in the first mergeable version of DAMON, I would prefer to
> have support to control (create/delete/account) the DAMON context. You
> already have a RFC series on it. I would like to have that series part
> of this one.

Ok, I will apply it here.


Thanks,
SeongJae Park

2020-09-25 15:04:01

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH v20 00/15] Introduce Data Access MONitor (DAMON)

On Mon, 31 Aug 2020 13:22:35 +0200 SeongJae Park <[email protected]> wrote:

> On Thu, 20 Aug 2020 09:27:38 +0200 SeongJae Park <[email protected]> wrote:
>
> > On Mon, 17 Aug 2020 12:51:22 +0200 SeongJae Park <[email protected]> wrote:
> >
> > > From: SeongJae Park <[email protected]>
> > >
[...]
> > > Introduction
> > > ============
> > >
> > > DAMON is a data access monitoring framework subsystem for the Linux kernel.
> > > The core mechanisms of DAMON called 'region based sampling' and 'adaptive
> > > regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
> > > patchset for the detail) make it
> > >
> > > - accurate (The monitored information is useful for DRAM level memory
> > > management. It might not appropriate for Cache-level accuracy, though.),
> > > - light-weight (The monitoring overhead is low enough to be applied online
> > > while making no impact on the performance of the target workloads.), and
> > > - scalable (the upper-bound of the instrumentation overhead is controllable
> > > regardless of the size of target workloads.).
> > >
> > > Using this framework, therefore, the kernel's core memory management mechanisms
> > > such as reclamation and THP can be optimized for better memory management. The
> > > experimental memory management optimization works that incurring high
> > > instrumentation overhead will be able to have another try. In user space,
> > > meanwhile, users who have some special workloads will be able to write
> > > personalized tools or applications for deeper understanding and specialized
> > > optimizations of their systems.
> >
> > DAMON will be presented in the next week LPC[1]. To be prepared for a screen
> > sharing error (if I get no such error, I will do a live-demo), I recorded a
> > simple demo video. I would like to share it here to help your easier
> > understanding of DAMON.
> >
> > https://youtu.be/l63eqbVBZRY
> >
> > [1] https://linuxplumbersconf.org/event/7/contributions/659/
>
> During the session, I introduced the list of future works and asked the
> audiences to vote for the priority of the tasks:
> https://youtu.be/jOBkKMA0uF0?t=13253

I also promised to make my automated tests for DAMON available as open source.
I'm happy to announce that it is not available at Github[1] under GPL v2
license. Using that, you can easily test how well DAMON works on your machine.
Hopefully, it could be used as a getting started guide for both users and
developers of DAMON.

[1] https://github.com/awslabs/damon-tests


Thanks,
SeongJae Park

>
> To summarize here, the tasks are (highest priority first):
>
> 1. Make current DAMON patchset series merged in the mainline (6 votes)
> 2. User space interface improvement (4 votes)
> - Multiple monitoring contexts
> - Charging of the monitoring threads' CPU usage
> 3. Support more address spaces (2 votes)
> - Cgroups, cached pages, specific file-backed pages, swap slots, ...
> 3. DAMON-based MM optimizations (2 votes)
> - Page reclaim, THP, compaction, NUMA balancing, ...
> 4. Optimize for special use-cases (1 vote)
> - Page granularity monitoring, accessed-or-not monitoring, ...
>
> So, I'd like to focus on polishing current patchset so that it could be merged
> in. For that, I'd like to ask your more reviews.
>
> While waiting for the reviews, I will start implementing other future features
> that received many votes. The support of multiple monitoring contexts for the
> user space would be the first one. Once the implementation is finished, I will
> post it as separated RFC patchset (the user space interface will be compatible
> with current one).
>
> Any comment is welcome.
>
>
> Thanks,
> SeongJae Park