From: SeongJae Park <[email protected]>
Changes from Previous Version
=============================
- Reorganize the doc and remove png blobs (Mike Rapoport)
- Wordsmith mechnisms doc and commit messages
- tools/wss: Set default working set access frequency threshold
- Avoid race in damon deamon start
Introduction
============
DAMON is a data access monitoring framework subsystem for the Linux kernel.
The core mechanisms of DAMON called 'region based sampling' and 'adaptive
regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
patchset for the detail) make it
- accurate (The monitored information is useful for DRAM level memory
management. It might not appropriate for Cache-level accuracy, though.),
- light-weight (The monitoring overhead is low enough to be applied online
while making no impact on the performance of the target workloads.), and
- scalable (the upper-bound of the instrumentation overhead is controllable
regardless of the size of target workloads.).
Using this framework, therefore, the kernel's core memory management mechanisms
such as reclamation and THP can be optimized for better memory management. The
experimental memory management optimization works that incurring high
instrumentation overhead will be able to have another try. In user space,
meanwhile, users who have some special workloads will be able to write
personalized tools or applications for deeper understanding and specialized
optimizations of their systems.
Evaluations
===========
We evaluated DAMON's overhead, monitoring quality and usefulness using 25
realistic workloads on my QEMU/KVM based virtual machine running a kernel that
v16 DAMON patchset is applied.
DAMON is lightweight. It increases system memory usage by only -0.25% and
consumes less than 1% CPU time in most case. It slows target workloads down by
only 0.94%.
DAMON is accurate and useful for memory management optimizations. An
experimental DAMON-based operation scheme for THP, 'ethp', removes 31.29% of
THP memory overheads while preserving 60.64% of THP speedup. Another
experimental DAMON-based 'proactive reclamation' implementation, 'prcl',
reduces 87.95% of residential sets and 29.52% of system memory footprint while
incurring only 2.15% runtime overhead in the best case (parsec3/freqmine).
NOTE that the experimentail THP optimization and proactive reclamation are not
for production, just only for proof of concepts.
Please refer to the official document[1] or "Documentation/admin-guide/mm: Add
a document for DAMON" patch in this patchset for detailed evaluation setup and
results.
[1] https://damonitor.github.io/doc/html/latest-damon/admin-guide/mm/damon/eval.html
More Information
================
We prepared a showcase web site[1] that you can get more information. There
are
- the official documentations[2],
- the heatmap format dynamic access pattern of various realistic workloads for
heap area[3], mmap()-ed area[4], and stack[5] area,
- the dynamic working set size distribution[6] and chronological working set
size changes[7], and
- the latest performance test results[8].
[1] https://damonitor.github.io/_index
[2] https://damonitor.github.io/doc/html/latest-damon
[3] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.0.png.html
[4] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
[5] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.2.png.html
[6] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
[7] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
[8] https://damonitor.github.io/test/result/perf/latest/html/index.html
Baseline and Complete Git Trees
===============================
The patches are based on the v5.7. You can also clone the complete git
tree:
$ git clone git://github.com/sjp38/linux -b damon/patches/v18
The web is also available:
https://github.com/sjp38/linux/releases/tag/damon/patches/v18
There are a couple of trees for entire DAMON patchset series. It includes
future features. The first one[1] contains the changes for latest release,
while the other one[2] contains the changes for next release.
[1] https://github.com/sjp38/linux/tree/damon/master
[2] https://github.com/sjp38/linux/tree/damon/next
Sequence Of Patches
===================
The 1st patch exports 'lookup_page_ext()' to GPL modules so that it can be used
by DAMON even though it is built as a loadable module.
Next four patches implement the target address space independent core logics of
DAMON and it's programming interface. The 2nd patch introduces DAMON module,
it's data structures, and data structure related common functions. Following
three patches (3rd to 5th) implements the core mechanisms of DAMON, namely
regions based sampling (patch 3), adaptive regions adjustment (patch 4), and
dynamic memory mapping chage adoption (patch 5).
The following one (patch 6) implements the virtual memory address space
specific functions.
Following four patches are for more user friendly interfaces. The 7th patch
implements recording of access patterns in DAMON. Each of next two patches
(8th and 9th) respectively adds a tracepoint for other tracepoints supporting
tracers such as perf, and a debugfs interface for privileged people and/or
programs in user space.
Two patches for high level users of DAMON follows. To provide a minimal
reference to the debugfs interface and for high level use/tests of the DAMON,
the next patch (10th) implements an user space tool. The 11th patch adds a
document for administrators of DAMON.
Next two patches are for tests. The 12th patch provides unit tests (based on
the kunit) while the 13th patch adds user space tests (based on the kselftest).
Finally, the last patch (14th) updates the MAINTAINERS file.
Patch History
=============
Changes from v17
(https://lore.kernel.org/linux-mm/[email protected]/)
- Reorganize the doc and remove png blobs (Mike Rapoport)
- Wordsmith mechnisms doc and commit messages
- tools/wss: Set default working set access frequency threshold
- Avoid race in damon deamon start
Changes from v16
(https://lore.kernel.org/linux-mm/[email protected]/)
- Wordsmith/cleanup the documentations and the code
- user space tool: Simplify the code and add wss option for reuse histogram
- recording: Check disablement condition properly
- recording: Force minimal recording buffer size (1KB)
Changes from v15
(https://lore.kernel.org/linux-mm/[email protected]/)
- Refine commit messages (David Hildenbrand)
- Optimizes three vma regions search (Varad Gautam)
- Support static granularity monitoring (Shakeel Butt)
- Cleanup code and re-organize the sequence of patches
Changes from v14
(https://lore.kernel.org/linux-mm/[email protected]/)
- Directly pass region and task to tracepoint (Steven Rostedt)
- Refine comments for better read
- Add more 'Reviewed-by's (Leonard Foerster, Brendan Higgins)
Changes from v13
(https://lore.kernel.org/linux-mm/[email protected]/)
- Fix a typo (Leonard Foerster)
- Fix wring condition of three sub ranges split (Leonard Foerster)
- Rebase on v5.7
Changes from v12
(https://lore.kernel.org/linux-mm/[email protected]/)
- Avoid races between debugfs readers and writers
- Add kernel-doc comments in damon.h
Changes from v11
(https://lore.kernel.org/linux-mm/[email protected]/)
- Rewrite the document (Stefan Nuernberger)
- Make 'damon_for_each_*' argument order consistent (Leonard Foerster)
- Fix wrong comment in 'kdamond_merge_regions()' (Leonard Foerster)
Changes from v10
(https://lore.kernel.org/linux-mm/[email protected]/)
- Reduce aggressive split overhead by doing it only if required
Changes from v9
(https://lore.kernel.org/linux-mm/[email protected]/)
- Split each region into 4 subregions if possible (Jonathan Cameraon)
- Update kunit test for the split code change
Please refer to the v9 patchset to get older history.
SeongJae Park (14):
mm/page_ext: Export lookup_page_ext() to GPL modules
mm: Introduce Data Access MONitor (DAMON)
mm/damon: Implement region based sampling
mm/damon: Adaptively adjust regions
mm/damon: Track dynamic monitoring target regions update
mm/damon: Implement callbacks for the virtual memory address spaces
mm/damon: Implement access pattern recording
mm/damon: Add a tracepoint
mm/damon: Implement a debugfs interface
tools: Introduce a minimal user-space tool for DAMON
Documentation: Add documents for DAMON
mm/damon: Add kunit tests
mm/damon: Add user space selftests
MAINTAINERS: Update for DAMON
Documentation/admin-guide/mm/damon/guide.rst | 157 ++
Documentation/admin-guide/mm/damon/index.rst | 15 +
Documentation/admin-guide/mm/damon/plans.rst | 29 +
Documentation/admin-guide/mm/damon/start.rst | 98 +
Documentation/admin-guide/mm/damon/usage.rst | 298 +++
Documentation/admin-guide/mm/index.rst | 1 +
Documentation/vm/damon/api.rst | 20 +
Documentation/vm/damon/eval.rst | 222 +++
Documentation/vm/damon/faq.rst | 59 +
Documentation/vm/damon/index.rst | 32 +
Documentation/vm/damon/mechanisms.rst | 165 ++
Documentation/vm/index.rst | 1 +
MAINTAINERS | 13 +
include/linux/damon.h | 175 ++
include/trace/events/damon.h | 43 +
mm/Kconfig | 23 +
mm/Makefile | 1 +
mm/damon-test.h | 661 +++++++
mm/damon.c | 1634 +++++++++++++++++
mm/page_ext.c | 1 +
tools/damon/.gitignore | 1 +
tools/damon/_damon.py | 129 ++
tools/damon/_dist.py | 36 +
tools/damon/_recfile.py | 23 +
tools/damon/bin2txt.py | 67 +
tools/damon/damo | 37 +
tools/damon/heats.py | 362 ++++
tools/damon/nr_regions.py | 91 +
tools/damon/record.py | 106 ++
tools/damon/report.py | 45 +
tools/damon/wss.py | 100 +
tools/testing/selftests/damon/Makefile | 7 +
.../selftests/damon/_chk_dependency.sh | 28 +
tools/testing/selftests/damon/_chk_record.py | 108 ++
.../testing/selftests/damon/debugfs_attrs.sh | 139 ++
.../testing/selftests/damon/debugfs_record.sh | 50 +
36 files changed, 4977 insertions(+)
create mode 100644 Documentation/admin-guide/mm/damon/guide.rst
create mode 100644 Documentation/admin-guide/mm/damon/index.rst
create mode 100644 Documentation/admin-guide/mm/damon/plans.rst
create mode 100644 Documentation/admin-guide/mm/damon/start.rst
create mode 100644 Documentation/admin-guide/mm/damon/usage.rst
create mode 100644 Documentation/vm/damon/api.rst
create mode 100644 Documentation/vm/damon/eval.rst
create mode 100644 Documentation/vm/damon/faq.rst
create mode 100644 Documentation/vm/damon/index.rst
create mode 100644 Documentation/vm/damon/mechanisms.rst
create mode 100644 include/linux/damon.h
create mode 100644 include/trace/events/damon.h
create mode 100644 mm/damon-test.h
create mode 100644 mm/damon.c
create mode 100644 tools/damon/.gitignore
create mode 100644 tools/damon/_damon.py
create mode 100644 tools/damon/_dist.py
create mode 100644 tools/damon/_recfile.py
create mode 100644 tools/damon/bin2txt.py
create mode 100755 tools/damon/damo
create mode 100644 tools/damon/heats.py
create mode 100644 tools/damon/nr_regions.py
create mode 100644 tools/damon/record.py
create mode 100644 tools/damon/report.py
create mode 100644 tools/damon/wss.py
create mode 100644 tools/testing/selftests/damon/Makefile
create mode 100644 tools/testing/selftests/damon/_chk_dependency.sh
create mode 100644 tools/testing/selftests/damon/_chk_record.py
create mode 100755 tools/testing/selftests/damon/debugfs_attrs.sh
create mode 100755 tools/testing/selftests/damon/debugfs_record.sh
--
2.17.1
From: SeongJae Park <[email protected]>
DAMON is a data access monitoring framework subsystem for the Linux
kernel. The core mechanisms of DAMON make it
- accurate (the monitoring output is useful enough for DRAM level
memory management; It might not appropriate for CPU Cache levels,
though),
- light-weight (the monitoring overhead is low enough to be applied
online), and
- scalable (the upper-bound of the overhead is in constant range
regardless of the size of target workloads).
Using this framework, therefore, the kernel's memory management
mechanisms can make advanced decisions. Experimental memory management
optimization works that incurring high data accesses monitoring overhead
could implemented again. In user space, meanwhile, users who have some
special workloads can write personalized applications for better
understanding and optimizations of their workloads and systems.
This commit is implementing only the stub for the module load/unload,
basic data structures, and simple manipulation functions of the
structures to keep the size of commit small. The core mechanisms of
DAMON will be implemented one by one by following commits.
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
Reviewed-by: Varad Gautam <[email protected]>
---
include/linux/damon.h | 63 ++++++++++++++
mm/Kconfig | 12 +++
mm/Makefile | 1 +
mm/damon.c | 188 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 264 insertions(+)
create mode 100644 include/linux/damon.h
create mode 100644 mm/damon.c
diff --git a/include/linux/damon.h b/include/linux/damon.h
new file mode 100644
index 000000000000..c8f8c1c41a45
--- /dev/null
+++ b/include/linux/damon.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * DAMON api
+ *
+ * Copyright 2019-2020 Amazon.com, Inc. or its affiliates.
+ *
+ * Author: SeongJae Park <[email protected]>
+ */
+
+#ifndef _DAMON_H_
+#define _DAMON_H_
+
+#include <linux/random.h>
+#include <linux/types.h>
+
+/**
+ * struct damon_addr_range - Represents an address region of [@start, @end).
+ * @start: Start address of the region (inclusive).
+ * @end: End address of the region (exclusive).
+ */
+struct damon_addr_range {
+ unsigned long start;
+ unsigned long end;
+};
+
+/**
+ * struct damon_region - Represents a monitoring target region.
+ * @ar: The address range of the region.
+ * @sampling_addr: Address of the sample for the next access check.
+ * @nr_accesses: Access frequency of this region.
+ * @list: List head for siblings.
+ */
+struct damon_region {
+ struct damon_addr_range ar;
+ unsigned long sampling_addr;
+ unsigned int nr_accesses;
+ struct list_head list;
+};
+
+/**
+ * struct damon_task - Represents a monitoring target task.
+ * @pid: Process id of the task.
+ * @regions_list: Head of the monitoring target regions of this task.
+ * @list: List head for siblings.
+ *
+ * If the monitoring target address space is task independent (e.g., physical
+ * memory address space monitoring), @pid should be '-1'.
+ */
+struct damon_task {
+ int pid;
+ struct list_head regions_list;
+ struct list_head list;
+};
+
+/**
+ * struct damon_ctx - Represents a context for each monitoring.
+ * @tasks_list: Head of monitoring target tasks (&damon_task) list.
+ */
+struct damon_ctx {
+ struct list_head tasks_list; /* 'damon_task' objects */
+};
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c35..464e9594dcec 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -867,4 +867,16 @@ config ARCH_HAS_HUGEPD
config MAPPING_DIRTY_HELPERS
bool
+config DAMON
+ tristate "Data Access Monitor"
+ depends on MMU
+ help
+ This feature allows to monitor access frequency of each memory
+ region. The information can be useful for performance-centric DRAM
+ level memory management.
+
+ See https://damonitor.github.io/doc/html/latest-damon/index.html for
+ more information.
+ If unsure, say N.
+
endmenu
diff --git a/mm/Makefile b/mm/Makefile
index fccd3756b25f..230e545b6e07 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -112,3 +112,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_DAMON) += damon.o
diff --git a/mm/damon.c b/mm/damon.c
new file mode 100644
index 000000000000..5ab13b1c15cf
--- /dev/null
+++ b/mm/damon.c
@@ -0,0 +1,188 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Data Access Monitor
+ *
+ * Copyright 2019-2020 Amazon.com, Inc. or its affiliates.
+ *
+ * Author: SeongJae Park <[email protected]>
+ *
+ * This file is constructed in below parts.
+ *
+ * - Functions and macros for DAMON data structures
+ * - Functions for the module loading/unloading
+ *
+ * The core parts are not implemented yet.
+ */
+
+#define pr_fmt(fmt) "damon: " fmt
+
+#include <linux/damon.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+
+/*
+ * Functions and macros for DAMON data structures
+ */
+
+#define damon_get_task_struct(t) \
+ (get_pid_task(find_vpid(t->pid), PIDTYPE_PID))
+
+#define damon_next_region(r) \
+ (container_of(r->list.next, struct damon_region, list))
+
+#define damon_prev_region(r) \
+ (container_of(r->list.prev, struct damon_region, list))
+
+#define damon_for_each_region(r, t) \
+ list_for_each_entry(r, &t->regions_list, list)
+
+#define damon_for_each_region_safe(r, next, t) \
+ list_for_each_entry_safe(r, next, &t->regions_list, list)
+
+#define damon_for_each_task(t, ctx) \
+ list_for_each_entry(t, &(ctx)->tasks_list, list)
+
+#define damon_for_each_task_safe(t, next, ctx) \
+ list_for_each_entry_safe(t, next, &(ctx)->tasks_list, list)
+
+/* Get a random number in [l, r) */
+#define damon_rand(l, r) (l + prandom_u32() % (r - l))
+
+/*
+ * Construct a damon_region struct
+ *
+ * Returns the pointer to the new struct if success, or NULL otherwise
+ */
+static struct damon_region *damon_new_region(unsigned long start,
+ unsigned long end)
+{
+ struct damon_region *region;
+
+ region = kmalloc(sizeof(*region), GFP_KERNEL);
+ if (!region)
+ return NULL;
+
+ region->ar.start = start;
+ region->ar.end = end;
+ region->nr_accesses = 0;
+ INIT_LIST_HEAD(®ion->list);
+
+ return region;
+}
+
+/*
+ * Add a region between two other regions
+ */
+static inline void damon_insert_region(struct damon_region *r,
+ struct damon_region *prev, struct damon_region *next)
+{
+ __list_add(&r->list, &prev->list, &next->list);
+}
+
+static void damon_add_region(struct damon_region *r, struct damon_task *t)
+{
+ list_add_tail(&r->list, &t->regions_list);
+}
+
+static void damon_del_region(struct damon_region *r)
+{
+ list_del(&r->list);
+}
+
+static void damon_free_region(struct damon_region *r)
+{
+ kfree(r);
+}
+
+static void damon_destroy_region(struct damon_region *r)
+{
+ damon_del_region(r);
+ damon_free_region(r);
+}
+
+/*
+ * Construct a damon_task struct
+ *
+ * Returns the pointer to the new struct if success, or NULL otherwise
+ */
+static struct damon_task *damon_new_task(int pid)
+{
+ struct damon_task *t;
+
+ t = kmalloc(sizeof(*t), GFP_KERNEL);
+ if (!t)
+ return NULL;
+
+ t->pid = pid;
+ INIT_LIST_HEAD(&t->regions_list);
+
+ return t;
+}
+
+static void damon_add_task(struct damon_ctx *ctx, struct damon_task *t)
+{
+ list_add_tail(&t->list, &ctx->tasks_list);
+}
+
+static void damon_del_task(struct damon_task *t)
+{
+ list_del(&t->list);
+}
+
+static void damon_free_task(struct damon_task *t)
+{
+ struct damon_region *r, *next;
+
+ damon_for_each_region_safe(r, next, t)
+ damon_free_region(r);
+ kfree(t);
+}
+
+static void damon_destroy_task(struct damon_task *t)
+{
+ damon_del_task(t);
+ damon_free_task(t);
+}
+
+static unsigned int nr_damon_tasks(struct damon_ctx *ctx)
+{
+ struct damon_task *t;
+ unsigned int nr_tasks = 0;
+
+ damon_for_each_task(t, ctx)
+ nr_tasks++;
+
+ return nr_tasks;
+}
+
+static unsigned int nr_damon_regions(struct damon_task *t)
+{
+ struct damon_region *r;
+ unsigned int nr_regions = 0;
+
+ damon_for_each_region(r, t)
+ nr_regions++;
+
+ return nr_regions;
+}
+
+/*
+ * Functions for the module loading/unloading
+ */
+
+static int __init damon_init(void)
+{
+ return 0;
+}
+
+static void __exit damon_exit(void)
+{
+}
+
+module_init(damon_init);
+module_exit(damon_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("SeongJae Park <[email protected]>");
+MODULE_DESCRIPTION("DAMON: Data Access MONitor");
--
2.17.1
From: SeongJae Park <[email protected]>
DAMON separates its monitoring target address space independent high
level logics from the target space dependent low level primitives for
flexible support of various address spaces.
This commit implements DAMON's target address space independent high
level logics for basic access check and region based sampling. Hence,
without the target address space specific parts implementations, this
doesn't work alone. A reference implementation of those will be
provided by a later commit.
Basic Access Check
==================
The output of DAMON says what pages are how frequently accessed for a
given duration. The resolution of the access frequency is controlled by
setting ``sampling interval`` and ``aggregation interval``. In detail,
DAMON checks access to each page per ``sampling interval`` and
aggregates the results. In other words, counts the number of the
accesses to each page. After each ``aggregation interval`` passes,
DAMON calls callback functions that previously registered by users so
that users can read the aggregated results and then clears the results.
This can be described in below simple pseudo-code::
while monitoring_on:
for page in monitoring_target:
if accessed(page):
nr_accesses[page] += 1
if time() % aggregation_interval == 0:
for callback in user_registered_callbacks:
callback(monitoring_target, nr_accesses)
for page in monitoring_target:
nr_accesses[page] = 0
sleep(sampling interval)
The monitoring overhead of this mechanism will arbitrarily increase as
the size of the target workload grows.
Region Based Sampling
=====================
To avoid the unbounded increase of the overhead, DAMON groups adjacent
pages that assumed to have the same access frequencies into a region.
As long as the assumption (pages in a region have the same access
frequencies) is kept, only one page in the region is required to be
checked. Thus, for each ``sampling interval``, DAMON randomly picks one
page in each region, waits for one ``sampling interval``, checks whether
the page is accessed meanwhile, and increases the access frequency of
the region if so. Therefore, the monitoring overhead is controllable by
setting the number of regions. DAMON allows users to set the minimum
and the maximum number of regions for the trade-off.
This scheme, however, cannot preserve the quality of the output if the
assumption is not guaranteed. Next commit will address this problem.
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
---
include/linux/damon.h | 80 ++++++++++++-
mm/damon.c | 260 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 337 insertions(+), 3 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index c8f8c1c41a45..7adc7b6b3507 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -11,6 +11,8 @@
#define _DAMON_H_
#include <linux/random.h>
+#include <linux/mutex.h>
+#include <linux/time64.h>
#include <linux/types.h>
/**
@@ -53,11 +55,87 @@ struct damon_task {
};
/**
- * struct damon_ctx - Represents a context for each monitoring.
+ * struct damon_ctx - Represents a context for each monitoring. This is the
+ * main interface that allows users to set the attributes and get the results
+ * of the monitoring.
+ *
+ * @sample_interval: The time between access samplings.
+ * @aggr_interval: The time between monitor results aggregations.
+ * @nr_regions: The number of monitoring regions.
+ *
+ * For each @sample_interval, DAMON checks whether each region is accessed or
+ * not. It aggregates and keeps the access information (number of accesses to
+ * each region) for @aggr_interval time. All time intervals are in
+ * micro-seconds.
+ *
+ * @kdamond: Kernel thread who does the monitoring.
+ * @kdamond_stop: Notifies whether kdamond should stop.
+ * @kdamond_lock: Mutex for the synchronizations with @kdamond.
+ *
+ * For each monitoring request (damon_start()), a kernel thread for the
+ * monitoring is created. The pointer to the thread is stored in @kdamond.
+ *
+ * The monitoring thread sets @kdamond to NULL when it terminates. Therefore,
+ * users can know whether the monitoring is ongoing or terminated by reading
+ * @kdamond. Also, users can ask @kdamond to be terminated by writing non-zero
+ * to @kdamond_stop. Reads and writes to @kdamond and @kdamond_stop from
+ * outside of the monitoring thread must be protected by @kdamond_lock.
+ *
+ * Note that the monitoring thread protects only @kdamond and @kdamond_stop via
+ * @kdamond_lock. Accesses to other fields must be protected by themselves.
+ *
* @tasks_list: Head of monitoring target tasks (&damon_task) list.
+ *
+ * @init_target_regions: Constructs initial monitoring target regions.
+ * @prepare_access_checks: Prepares next access check of target regions.
+ * @check_accesses: Checks the access of target regions.
+ * @sample_cb: Called for each sampling interval.
+ * @aggregate_cb: Called for each aggregation interval.
+ *
+ * DAMON can be extended for various address spaces by users. For this, users
+ * can register the target address space dependent low level functions for
+ * their usecases via the callback pointers of the context. The monitoring
+ * thread calls @init_target_regions before starting the monitoring, and
+ * @prepare_access_checks and @check_accesses for each @sample_interval.
+ *
+ * @init_target_regions should construct proper monitoring target regions and
+ * link those to the DAMON context struct.
+ * @prepare_access_checks should manipulate the monitoring regions to be
+ * prepare for the next access check.
+ * @check_accesses should check the accesses to each region that made after the
+ * last preparation and update the `->nr_accesses` of each region.
+ *
+ * @sample_cb and @aggregate_cb are called from @kdamond for each of the
+ * sampling intervals and aggregation intervals, respectively. Therefore,
+ * users can safely access to the monitoring results via @tasks_list without
+ * additional protection of @kdamond_lock. For the reason, users are
+ * recommended to use these callback for the accesses to the results.
*/
struct damon_ctx {
+ unsigned long sample_interval;
+ unsigned long aggr_interval;
+ unsigned long nr_regions;
+
+ struct timespec64 last_aggregation;
+
+ struct task_struct *kdamond;
+ bool kdamond_stop;
+ struct mutex kdamond_lock;
+
struct list_head tasks_list; /* 'damon_task' objects */
+
+ /* callbacks */
+ void (*init_target_regions)(struct damon_ctx *context);
+ void (*prepare_access_checks)(struct damon_ctx *context);
+ unsigned int (*check_accesses)(struct damon_ctx *context);
+ void (*sample_cb)(struct damon_ctx *context);
+ void (*aggregate_cb)(struct damon_ctx *context);
};
+int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids);
+int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
+ unsigned long aggr_int, unsigned long min_nr_reg);
+int damon_start(struct damon_ctx *ctx);
+int damon_stop(struct damon_ctx *ctx);
+
#endif
diff --git a/mm/damon.c b/mm/damon.c
index 5ab13b1c15cf..29d82c2d65be 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -9,18 +9,27 @@
* This file is constructed in below parts.
*
* - Functions and macros for DAMON data structures
+ * - Functions for DAMON core logics and features
+ * - Functions for the DAMON programming interface
* - Functions for the module loading/unloading
- *
- * The core parts are not implemented yet.
*/
#define pr_fmt(fmt) "damon: " fmt
#include <linux/damon.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
#include <linux/mm.h>
#include <linux/module.h>
+#include <linux/page_idle.h>
+#include <linux/random.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/task.h>
#include <linux/slab.h>
+/* Minimal region size. Every damon_region is aligned by this. */
+#define MIN_REGION PAGE_SIZE
+
/*
* Functions and macros for DAMON data structures
*/
@@ -167,6 +176,253 @@ static unsigned int nr_damon_regions(struct damon_task *t)
return nr_regions;
}
+/*
+ * Functions for DAMON core logics and features
+ */
+
+/*
+ * damon_check_reset_time_interval() - Check if a time interval is elapsed.
+ * @baseline: the time to check whether the interval has elapsed since
+ * @interval: the time interval (microseconds)
+ *
+ * See whether the given time interval has passed since the given baseline
+ * time. If so, it also updates the baseline to current time for next check.
+ *
+ * Return: true if the time interval has passed, or false otherwise.
+ */
+static bool damon_check_reset_time_interval(struct timespec64 *baseline,
+ unsigned long interval)
+{
+ struct timespec64 now;
+
+ ktime_get_coarse_ts64(&now);
+ if ((timespec64_to_ns(&now) - timespec64_to_ns(baseline)) <
+ interval * 1000)
+ return false;
+ *baseline = now;
+ return true;
+}
+
+/*
+ * Check whether it is time to flush the aggregated information
+ */
+static bool kdamond_aggregate_interval_passed(struct damon_ctx *ctx)
+{
+ return damon_check_reset_time_interval(&ctx->last_aggregation,
+ ctx->aggr_interval);
+}
+
+/*
+ * Reset the aggregated monitoring results
+ */
+static void kdamond_reset_aggregated(struct damon_ctx *c)
+{
+ struct damon_task *t;
+ struct damon_region *r;
+
+ damon_for_each_task(t, c) {
+ damon_for_each_region(r, t)
+ r->nr_accesses = 0;
+ }
+}
+
+/*
+ * Check whether current monitoring should be stopped
+ *
+ * The monitoring is stopped when either the user requested to stop, or all
+ * monitoring target tasks are dead.
+ *
+ * Returns true if need to stop current monitoring.
+ */
+static bool kdamond_need_stop(struct damon_ctx *ctx)
+{
+ struct damon_task *t;
+ struct task_struct *task;
+ bool stop;
+
+ mutex_lock(&ctx->kdamond_lock);
+ stop = ctx->kdamond_stop;
+ mutex_unlock(&ctx->kdamond_lock);
+ if (stop)
+ return true;
+
+ damon_for_each_task(t, ctx) {
+ /* -1 is reserved for non-process bounded monitoring */
+ if (t->pid == -1)
+ return false;
+
+ task = damon_get_task_struct(t);
+ if (task) {
+ put_task_struct(task);
+ return false;
+ }
+ }
+
+ return true;
+}
+
+/*
+ * The monitoring daemon that runs as a kernel thread
+ */
+static int kdamond_fn(void *data)
+{
+ struct damon_ctx *ctx = (struct damon_ctx *)data;
+ struct damon_task *t;
+ struct damon_region *r, *next;
+
+ pr_info("kdamond (%d) starts\n", ctx->kdamond->pid);
+ if (ctx->init_target_regions)
+ ctx->init_target_regions(ctx);
+ while (!kdamond_need_stop(ctx)) {
+ if (ctx->prepare_access_checks)
+ ctx->prepare_access_checks(ctx);
+ if (ctx->sample_cb)
+ ctx->sample_cb(ctx);
+
+ usleep_range(ctx->sample_interval, ctx->sample_interval + 1);
+
+ if (ctx->check_accesses)
+ ctx->check_accesses(ctx);
+
+ if (kdamond_aggregate_interval_passed(ctx)) {
+ if (ctx->aggregate_cb)
+ ctx->aggregate_cb(ctx);
+ kdamond_reset_aggregated(ctx);
+ }
+
+ }
+ damon_for_each_task(t, ctx) {
+ damon_for_each_region_safe(r, next, t)
+ damon_destroy_region(r);
+ }
+ pr_debug("kdamond (%d) finishes\n", ctx->kdamond->pid);
+ mutex_lock(&ctx->kdamond_lock);
+ ctx->kdamond = NULL;
+ mutex_unlock(&ctx->kdamond_lock);
+
+ do_exit(0);
+}
+
+/*
+ * Functions for the DAMON programming interface
+ */
+
+static bool damon_kdamond_running(struct damon_ctx *ctx)
+{
+ bool running;
+
+ mutex_lock(&ctx->kdamond_lock);
+ running = ctx->kdamond != NULL;
+ mutex_unlock(&ctx->kdamond_lock);
+
+ return running;
+}
+
+/**
+ * damon_start() - Starts monitoring with given context.
+ * @ctx: monitoring context
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_start(struct damon_ctx *ctx)
+{
+ int err = -EBUSY;
+
+ mutex_lock(&ctx->kdamond_lock);
+ if (!ctx->kdamond) {
+ err = 0;
+ ctx->kdamond_stop = false;
+ ctx->kdamond = kthread_create(kdamond_fn, ctx, "kdamond");
+ if (IS_ERR(ctx->kdamond))
+ err = PTR_ERR(ctx->kdamond);
+ else
+ wake_up_process(ctx->kdamond);
+ }
+ mutex_unlock(&ctx->kdamond_lock);
+
+ return err;
+}
+
+/**
+ * damon_stop() - Stops monitoring of given context.
+ * @ctx: monitoring context
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_stop(struct damon_ctx *ctx)
+{
+ mutex_lock(&ctx->kdamond_lock);
+ if (ctx->kdamond) {
+ ctx->kdamond_stop = true;
+ mutex_unlock(&ctx->kdamond_lock);
+ while (damon_kdamond_running(ctx))
+ usleep_range(ctx->sample_interval,
+ ctx->sample_interval * 2);
+ return 0;
+ }
+ mutex_unlock(&ctx->kdamond_lock);
+
+ return -EPERM;
+}
+
+/**
+ * damon_set_pids() - Set monitoring target processes.
+ * @ctx: monitoring context
+ * @pids: array of target processes pids
+ * @nr_pids: number of entries in @pids
+ *
+ * This function should not be called while the kdamond is running.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids)
+{
+ ssize_t i;
+ struct damon_task *t, *next;
+
+ damon_for_each_task_safe(t, next, ctx)
+ damon_destroy_task(t);
+
+ for (i = 0; i < nr_pids; i++) {
+ t = damon_new_task(pids[i]);
+ if (!t) {
+ pr_err("Failed to alloc damon_task\n");
+ return -ENOMEM;
+ }
+ damon_add_task(ctx, t);
+ }
+
+ return 0;
+}
+
+/**
+ * damon_set_attrs() - Set attributes for the monitoring.
+ * @ctx: monitoring context
+ * @sample_int: time interval between samplings
+ * @aggr_int: time interval between aggregations
+ * @nr_reg: number of regions
+ *
+ * This function should not be called while the kdamond is running.
+ * Every time interval is in micro-seconds.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
+ unsigned long aggr_int, unsigned long nr_reg)
+{
+ if (nr_reg < 3) {
+ pr_err("nr_regions (%lu) must be at least 3\n",
+ nr_reg);
+ return -EINVAL;
+ }
+
+ ctx->sample_interval = sample_int;
+ ctx->aggr_interval = aggr_int;
+ ctx->nr_regions = nr_reg;
+
+ return 0;
+}
+
/*
* Functions for the module loading/unloading
*/
--
2.17.1
From: SeongJae Park <[email protected]>
This commit implements the recording feature of DAMON. If this feature
is enabled, DAMON writes the monitored access patterns in its binary
format into a file which specified by the user. This is already able to
be implemented by each user using the callbacks. However, as the
recording is expected to be widely used, this commit implements the
feature in the DAMON, for more convenience.
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
---
include/linux/damon.h | 15 +++++
mm/damon.c | 141 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 153 insertions(+), 3 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index 310d36d123b3..b0e7e31a22b3 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -72,6 +72,14 @@ struct damon_task {
* in case of virtual memory monitoring) and applies the changes for each
* @regions_update_interval. All time intervals are in micro-seconds.
*
+ * @rbuf: In-memory buffer for monitoring result recording.
+ * @rbuf_len: The length of @rbuf.
+ * @rbuf_offset: The offset for next write to @rbuf.
+ * @rfile_path: Record file path.
+ *
+ * If @rbuf, @rbuf_len, and @rfile_path are set, the monitored results are
+ * automatically stored in @rfile_path file.
+ *
* @kdamond: Kernel thread who does the monitoring.
* @kdamond_stop: Notifies whether kdamond should stop.
* @kdamond_lock: Mutex for the synchronizations with @kdamond.
@@ -129,6 +137,11 @@ struct damon_ctx {
struct timespec64 last_aggregation;
struct timespec64 last_regions_update;
+ unsigned char *rbuf;
+ unsigned int rbuf_len;
+ unsigned int rbuf_offset;
+ char *rfile_path;
+
struct task_struct *kdamond;
bool kdamond_stop;
struct mutex kdamond_lock;
@@ -154,6 +167,8 @@ int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids);
int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
unsigned long aggr_int, unsigned long regions_update_int,
unsigned long min_nr_reg, unsigned long max_nr_reg);
+int damon_set_recording(struct damon_ctx *ctx,
+ unsigned int rbuf_len, char *rfile_path);
int damon_start(struct damon_ctx *ctx);
int damon_stop(struct damon_ctx *ctx);
diff --git a/mm/damon.c b/mm/damon.c
index 386780739007..55ecfab64220 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -58,6 +58,10 @@
#define damon_for_each_task_safe(t, next, ctx) \
list_for_each_entry_safe(t, next, &(ctx)->tasks_list, list)
+#define MIN_RECORD_BUFFER_LEN 1024
+#define MAX_RECORD_BUFFER_LEN (4 * 1024 * 1024)
+#define MAX_RFILE_PATH_LEN 256
+
/* Get a random number in [l, r) */
#define damon_rand(l, r) (l + prandom_u32() % (r - l))
@@ -707,16 +711,88 @@ static bool kdamond_aggregate_interval_passed(struct damon_ctx *ctx)
}
/*
- * Reset the aggregated monitoring results
+ * Flush the content in the result buffer to the result file
+ */
+static void damon_flush_rbuffer(struct damon_ctx *ctx)
+{
+ ssize_t sz;
+ loff_t pos = 0;
+ struct file *rfile;
+
+ if (!ctx->rbuf_offset)
+ return;
+
+ rfile = filp_open(ctx->rfile_path, O_CREAT | O_RDWR | O_APPEND, 0644);
+ if (IS_ERR(rfile)) {
+ pr_err("Cannot open the result file %s\n",
+ ctx->rfile_path);
+ return;
+ }
+
+ while (ctx->rbuf_offset) {
+ sz = kernel_write(rfile, ctx->rbuf, ctx->rbuf_offset, &pos);
+ if (sz < 0)
+ break;
+ ctx->rbuf_offset -= sz;
+ }
+ filp_close(rfile, NULL);
+}
+
+/*
+ * Write a data into the result buffer
+ */
+static void damon_write_rbuf(struct damon_ctx *ctx, void *data, ssize_t size)
+{
+ if (!ctx->rbuf_len || !ctx->rbuf || !ctx->rfile_path)
+ return;
+ if (ctx->rbuf_offset + size > ctx->rbuf_len)
+ damon_flush_rbuffer(ctx);
+ if (ctx->rbuf_offset + size > ctx->rbuf_len) {
+ pr_warn("%s: flush failed, or wrong size given(%u, %zu)\n",
+ __func__, ctx->rbuf_offset, size);
+ return;
+ }
+
+ memcpy(&ctx->rbuf[ctx->rbuf_offset], data, size);
+ ctx->rbuf_offset += size;
+}
+
+/*
+ * Flush the aggregated monitoring results to the result buffer
+ *
+ * Stores current tracking results to the result buffer and reset 'nr_accesses'
+ * of each region. The format for the result buffer is as below:
+ *
+ * <time> <number of tasks> <array of task infos>
+ *
+ * task info: <pid> <number of regions> <array of region infos>
+ * region info: <start address> <end address> <nr_accesses>
*/
static void kdamond_reset_aggregated(struct damon_ctx *c)
{
struct damon_task *t;
- struct damon_region *r;
+ struct timespec64 now;
+ unsigned int nr;
+
+ ktime_get_coarse_ts64(&now);
+
+ damon_write_rbuf(c, &now, sizeof(now));
+ nr = nr_damon_tasks(c);
+ damon_write_rbuf(c, &nr, sizeof(nr));
damon_for_each_task(t, c) {
- damon_for_each_region(r, t)
+ struct damon_region *r;
+
+ damon_write_rbuf(c, &t->pid, sizeof(t->pid));
+ nr = nr_damon_regions(t);
+ damon_write_rbuf(c, &nr, sizeof(nr));
+ damon_for_each_region(r, t) {
+ damon_write_rbuf(c, &r->ar.start, sizeof(r->ar.start));
+ damon_write_rbuf(c, &r->ar.end, sizeof(r->ar.end));
+ damon_write_rbuf(c, &r->nr_accesses,
+ sizeof(r->nr_accesses));
r->nr_accesses = 0;
+ }
}
}
@@ -905,6 +981,14 @@ static bool kdamond_need_stop(struct damon_ctx *ctx)
return true;
}
+static void kdamond_write_record_header(struct damon_ctx *ctx)
+{
+ int recfmt_ver = 1;
+
+ damon_write_rbuf(ctx, "damon_recfmt_ver", 16);
+ damon_write_rbuf(ctx, &recfmt_ver, sizeof(recfmt_ver));
+}
+
/*
* The monitoring daemon that runs as a kernel thread
*/
@@ -921,6 +1005,8 @@ static int kdamond_fn(void *data)
ctx->init_target_regions(ctx);
sz_limit = damon_region_sz_limit(ctx);
+ kdamond_write_record_header(ctx);
+
while (!kdamond_need_stop(ctx)) {
if (ctx->prepare_access_checks)
ctx->prepare_access_checks(ctx);
@@ -947,6 +1033,7 @@ static int kdamond_fn(void *data)
sz_limit = damon_region_sz_limit(ctx);
}
}
+ damon_flush_rbuffer(ctx);
damon_for_each_task(t, ctx) {
damon_for_each_region_safe(r, next, t)
damon_destroy_region(r);
@@ -1051,6 +1138,54 @@ int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids)
return 0;
}
+/**
+ * damon_set_recording() - Set attributes for the recording.
+ * @ctx: target kdamond context
+ * @rbuf_len: length of the result buffer
+ * @rfile_path: path to the monitor result files
+ *
+ * Setting 'rbuf_len' 0 disables recording.
+ *
+ * This function should not be called while the kdamond is running.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_set_recording(struct damon_ctx *ctx,
+ unsigned int rbuf_len, char *rfile_path)
+{
+ size_t rfile_path_len;
+
+ if (rbuf_len && (rbuf_len > MAX_RECORD_BUFFER_LEN ||
+ rbuf_len < MIN_RECORD_BUFFER_LEN)) {
+ pr_err("result buffer size (%u) is out of [%d,%d]\n",
+ rbuf_len, MIN_RECORD_BUFFER_LEN,
+ MAX_RECORD_BUFFER_LEN);
+ return -EINVAL;
+ }
+ rfile_path_len = strnlen(rfile_path, MAX_RFILE_PATH_LEN);
+ if (rfile_path_len >= MAX_RFILE_PATH_LEN) {
+ pr_err("too long (>%d) result file path %s\n",
+ MAX_RFILE_PATH_LEN, rfile_path);
+ return -EINVAL;
+ }
+ ctx->rbuf_len = rbuf_len;
+ kfree(ctx->rbuf);
+ ctx->rbuf = NULL;
+ kfree(ctx->rfile_path);
+ ctx->rfile_path = NULL;
+
+ if (rbuf_len) {
+ ctx->rbuf = kvmalloc(rbuf_len, GFP_KERNEL);
+ if (!ctx->rbuf)
+ return -ENOMEM;
+ }
+ ctx->rfile_path = kmalloc(rfile_path_len + 1, GFP_KERNEL);
+ if (!ctx->rfile_path)
+ return -ENOMEM;
+ strncpy(ctx->rfile_path, rfile_path, rfile_path_len + 1);
+ return 0;
+}
+
/**
* damon_set_attrs() - Set attributes for the monitoring.
* @ctx: monitoring context
--
2.17.1
From: SeongJae Park <[email protected]>
This commit exports 'lookup_page_ext()' to GPL modules. It will be used
by DAMON in following commit for the implementation of the region based
sampling.
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
Reviewed-by: Varad Gautam <[email protected]>
---
mm/page_ext.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/page_ext.c b/mm/page_ext.c
index a3616f7a0e9e..9d802d01fcb5 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -131,6 +131,7 @@ struct page_ext *lookup_page_ext(const struct page *page)
MAX_ORDER_NR_PAGES);
return get_entry(base, index);
}
+EXPORT_SYMBOL_GPL(lookup_page_ext);
static int __init alloc_node_page_ext(int nid)
{
--
2.17.1
From: SeongJae Park <[email protected]>
This commit implements a debugfs interface for DAMON. It works for the
virtual address spaces monitoring.
DAMON exports four files, ``attrs``, ``pids``, ``record``, and
``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``.
Attributes
----------
Users can read and write the ``sampling interval``, ``aggregation
interval``, ``regions update interval``, and min/max number of
monitoring target regions by reading from and writing to the ``attrs``
file. For example, below commands set those values to 5 ms, 100 ms,
1,000 ms, 10, 1000 and check it again::
# cd <debugfs>/damon
# echo 5000 100000 1000000 10 1000 > attrs
# cat attrs
5000 100000 1000000 10 1000
Target PIDs
-----------
Users can read and write the pids of current monitoring target processes
by reading from and writing to the ``pids`` file. For example, below
commands set processes having pids 42 and 4242 as the processes to be
monitored and check it again::
# cd <debugfs>/damon
# echo 42 4242 > pids
# cat pids
42 4242
Note that setting the pids doesn't start the monitoring.
Record
------
DAMON supports direct monitoring result record feature. The recorded
results are first written to a buffer and flushed to a file in batch.
Users can set the size of the buffer and the path to the result file by
reading from and writing to the ``record`` file. For example, below
commands set the buffer to be 4 KiB and the result to be saved in
'/damon.data'.
# cd <debugfs>/damon
# echo 4096 /damon.data > pids
# cat record
4096 /damon.data
Turning On/Off
--------------
You can check current status, start and stop the monitoring by reading
from and writing to the ``monitor_on`` file. Writing ``on`` to the file
starts DAMON to monitor the target processes with the attributes.
Writing ``off`` to the file stops DAMON. DAMON also stops if every
target processes is terminated. Below example commands turn on, off,
and check status of DAMON::
# cd <debugfs>/damon
# echo on > monitor_on
# echo off > monitor_on
# cat monitor_on
off
Please note that you cannot write to the ``attrs`` and ``pids`` files
while the monitoring is turned on. If you write to the files while
DAMON is running, ``-EINVAL`` will be returned.
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
---
mm/damon.c | 381 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 380 insertions(+), 1 deletion(-)
diff --git a/mm/damon.c b/mm/damon.c
index 00df1a4c3d5c..df05bd821ff8 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -14,6 +14,7 @@
* - Functions for the access checking of the regions
* - Functions for DAMON core logics and features
* - Functions for the DAMON programming interface
+ * - Functions for the DAMON debugfs interface
* - Functions for the module loading/unloading
*/
@@ -22,6 +23,7 @@
#define CREATE_TRACE_POINTS
#include <linux/damon.h>
+#include <linux/debugfs.h>
#include <linux/delay.h>
#include <linux/kthread.h>
#include <linux/mm.h>
@@ -68,6 +70,20 @@
/* Get a random number in [l, r) */
#define damon_rand(l, r) (l + prandom_u32() % (r - l))
+/* A monitoring context for debugfs interface users. */
+static struct damon_ctx damon_user_ctx = {
+ .sample_interval = 5 * 1000,
+ .aggr_interval = 100 * 1000,
+ .regions_update_interval = 1000 * 1000,
+ .min_nr_regions = 10,
+ .max_nr_regions = 1000,
+
+ .init_target_regions = kdamond_init_vm_regions,
+ .update_target_regions = kdamond_update_vm_regions,
+ .prepare_access_checks = kdamond_prepare_vm_access_checks,
+ .check_accesses = kdamond_check_vm_accesses,
+};
+
/*
* Construct a damon_region struct
*
@@ -1228,17 +1244,380 @@ int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
return 0;
}
+/*
+ * Functions for the DAMON debugfs interface
+ */
+
+static ssize_t debugfs_monitor_on_read(struct file *file,
+ char __user *buf, size_t count, loff_t *ppos)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ char monitor_on_buf[5];
+ bool monitor_on;
+ int len;
+
+ monitor_on = damon_kdamond_running(ctx);
+ len = snprintf(monitor_on_buf, 5, monitor_on ? "on\n" : "off\n");
+
+ return simple_read_from_buffer(buf, count, ppos, monitor_on_buf, len);
+}
+
+/*
+ * Returns non-empty string on success, negarive error code otherwise.
+ */
+static char *user_input_str(const char __user *buf, size_t count, loff_t *ppos)
+{
+ char *kbuf;
+ ssize_t ret;
+
+ /* We do not accept continuous write */
+ if (*ppos)
+ return ERR_PTR(-EINVAL);
+
+ kbuf = kmalloc(count + 1, GFP_KERNEL);
+ if (!kbuf)
+ return ERR_PTR(-ENOMEM);
+
+ ret = simple_write_to_buffer(kbuf, count + 1, ppos, buf, count);
+ if (ret != count) {
+ kfree(kbuf);
+ return ERR_PTR(-EIO);
+ }
+ kbuf[ret] = '\0';
+
+ return kbuf;
+}
+
+static ssize_t debugfs_monitor_on_write(struct file *file,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ ssize_t ret = count;
+ char *kbuf;
+ int err;
+
+ kbuf = user_input_str(buf, count, ppos);
+ if (IS_ERR(kbuf))
+ return PTR_ERR(kbuf);
+
+ /* Remove white space */
+ if (sscanf(kbuf, "%s", kbuf) != 1)
+ return -EINVAL;
+ if (!strncmp(kbuf, "on", count))
+ err = damon_start(ctx);
+ else if (!strncmp(kbuf, "off", count))
+ err = damon_stop(ctx);
+ else
+ return -EINVAL;
+
+ if (err)
+ ret = err;
+ return ret;
+}
+
+static ssize_t damon_sprint_pids(struct damon_ctx *ctx, char *buf, ssize_t len)
+{
+ struct damon_task *t;
+ int written = 0;
+ int rc;
+
+ damon_for_each_task(t, ctx) {
+ rc = snprintf(&buf[written], len - written, "%d ", t->pid);
+ if (!rc)
+ return -ENOMEM;
+ written += rc;
+ }
+ if (written)
+ written -= 1;
+ written += snprintf(&buf[written], len - written, "\n");
+ return written;
+}
+
+static ssize_t debugfs_pids_read(struct file *file,
+ char __user *buf, size_t count, loff_t *ppos)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ ssize_t len;
+ char pids_buf[320];
+
+ mutex_lock(&ctx->kdamond_lock);
+ len = damon_sprint_pids(ctx, pids_buf, 320);
+ mutex_unlock(&ctx->kdamond_lock);
+ if (len < 0)
+ return len;
+
+ return simple_read_from_buffer(buf, count, ppos, pids_buf, len);
+}
+
+/*
+ * Converts a string into an array of unsigned long integers
+ *
+ * Returns an array of unsigned long integers if the conversion success, or
+ * NULL otherwise.
+ */
+static int *str_to_pids(const char *str, ssize_t len, ssize_t *nr_pids)
+{
+ int *pids;
+ const int max_nr_pids = 32;
+ int pid;
+ int pos = 0, parsed, ret;
+
+ *nr_pids = 0;
+ pids = kmalloc_array(max_nr_pids, sizeof(pid), GFP_KERNEL);
+ if (!pids)
+ return NULL;
+ while (*nr_pids < max_nr_pids && pos < len) {
+ ret = sscanf(&str[pos], "%d%n", &pid, &parsed);
+ pos += parsed;
+ if (ret != 1)
+ break;
+ pids[*nr_pids] = pid;
+ *nr_pids += 1;
+ }
+ if (*nr_pids == 0) {
+ kfree(pids);
+ pids = NULL;
+ }
+
+ return pids;
+}
+
+static ssize_t debugfs_pids_write(struct file *file,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ char *kbuf;
+ int *targets;
+ ssize_t nr_targets;
+ ssize_t ret = count;
+ int err;
+
+ kbuf = user_input_str(buf, count, ppos);
+ if (IS_ERR(kbuf))
+ return PTR_ERR(kbuf);
+
+ targets = str_to_pids(kbuf, ret, &nr_targets);
+ if (!targets) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ mutex_lock(&ctx->kdamond_lock);
+ if (ctx->kdamond) {
+ ret = -EINVAL;
+ goto unlock_out;
+ }
+
+ err = damon_set_pids(ctx, targets, nr_targets);
+ if (err)
+ ret = err;
+unlock_out:
+ mutex_unlock(&ctx->kdamond_lock);
+ kfree(targets);
+out:
+ kfree(kbuf);
+ return ret;
+}
+
+static ssize_t debugfs_record_read(struct file *file,
+ char __user *buf, size_t count, loff_t *ppos)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ char record_buf[20 + MAX_RFILE_PATH_LEN];
+ int ret;
+
+ mutex_lock(&ctx->kdamond_lock);
+ ret = snprintf(record_buf, ARRAY_SIZE(record_buf), "%u %s\n",
+ ctx->rbuf_len, ctx->rfile_path);
+ mutex_unlock(&ctx->kdamond_lock);
+ return simple_read_from_buffer(buf, count, ppos, record_buf, ret);
+}
+
+static ssize_t debugfs_record_write(struct file *file,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ char *kbuf;
+ unsigned int rbuf_len;
+ char rfile_path[MAX_RFILE_PATH_LEN];
+ ssize_t ret = count;
+ int err;
+
+ kbuf = user_input_str(buf, count, ppos);
+ if (IS_ERR(kbuf))
+ return PTR_ERR(kbuf);
+
+ if (sscanf(kbuf, "%u %s",
+ &rbuf_len, rfile_path) != 2) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ mutex_lock(&ctx->kdamond_lock);
+ if (ctx->kdamond) {
+ ret = -EBUSY;
+ goto unlock_out;
+ }
+
+ err = damon_set_recording(ctx, rbuf_len, rfile_path);
+ if (err)
+ ret = err;
+unlock_out:
+ mutex_unlock(&ctx->kdamond_lock);
+out:
+ kfree(kbuf);
+ return ret;
+}
+
+static ssize_t debugfs_attrs_read(struct file *file,
+ char __user *buf, size_t count, loff_t *ppos)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ char kbuf[128];
+ int ret;
+
+ mutex_lock(&ctx->kdamond_lock);
+ ret = snprintf(kbuf, ARRAY_SIZE(kbuf), "%lu %lu %lu %lu %lu\n",
+ ctx->sample_interval, ctx->aggr_interval,
+ ctx->regions_update_interval, ctx->min_nr_regions,
+ ctx->max_nr_regions);
+ mutex_unlock(&ctx->kdamond_lock);
+
+ return simple_read_from_buffer(buf, count, ppos, kbuf, ret);
+}
+
+static ssize_t debugfs_attrs_write(struct file *file,
+ const char __user *buf, size_t count, loff_t *ppos)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ unsigned long s, a, r, minr, maxr;
+ char *kbuf;
+ ssize_t ret = count;
+ int err;
+
+ kbuf = user_input_str(buf, count, ppos);
+ if (IS_ERR(kbuf))
+ return PTR_ERR(kbuf);
+
+ if (sscanf(kbuf, "%lu %lu %lu %lu %lu",
+ &s, &a, &r, &minr, &maxr) != 5) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ mutex_lock(&ctx->kdamond_lock);
+ if (ctx->kdamond) {
+ ret = -EBUSY;
+ goto unlock_out;
+ }
+
+ err = damon_set_attrs(ctx, s, a, r, minr, maxr);
+ if (err)
+ ret = err;
+unlock_out:
+ mutex_unlock(&ctx->kdamond_lock);
+out:
+ kfree(kbuf);
+ return ret;
+}
+
+static const struct file_operations monitor_on_fops = {
+ .owner = THIS_MODULE,
+ .read = debugfs_monitor_on_read,
+ .write = debugfs_monitor_on_write,
+};
+
+static const struct file_operations pids_fops = {
+ .owner = THIS_MODULE,
+ .read = debugfs_pids_read,
+ .write = debugfs_pids_write,
+};
+
+static const struct file_operations record_fops = {
+ .owner = THIS_MODULE,
+ .read = debugfs_record_read,
+ .write = debugfs_record_write,
+};
+
+static const struct file_operations attrs_fops = {
+ .owner = THIS_MODULE,
+ .read = debugfs_attrs_read,
+ .write = debugfs_attrs_write,
+};
+
+static struct dentry *debugfs_root;
+
+static int __init damon_debugfs_init(void)
+{
+ const char * const file_names[] = {"attrs", "record",
+ "pids", "monitor_on"};
+ const struct file_operations *fops[] = {&attrs_fops, &record_fops,
+ &pids_fops, &monitor_on_fops};
+ int i;
+
+ debugfs_root = debugfs_create_dir("damon", NULL);
+ if (!debugfs_root) {
+ pr_err("failed to create the debugfs dir\n");
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < ARRAY_SIZE(file_names); i++) {
+ if (!debugfs_create_file(file_names[i], 0600, debugfs_root,
+ NULL, fops[i])) {
+ pr_err("failed to create %s file\n", file_names[i]);
+ return -ENOMEM;
+ }
+ }
+
+ return 0;
+}
+
+static int __init damon_init_user_ctx(void)
+{
+ int rc;
+
+ struct damon_ctx *ctx = &damon_user_ctx;
+
+ ktime_get_coarse_ts64(&ctx->last_aggregation);
+ ctx->last_regions_update = ctx->last_aggregation;
+
+ rc = damon_set_recording(ctx, 1024 * 1024, "/damon.data");
+ if (rc)
+ return rc;
+
+ mutex_init(&ctx->kdamond_lock);
+
+ INIT_LIST_HEAD(&ctx->tasks_list);
+
+ return 0;
+}
+
/*
* Functions for the module loading/unloading
*/
static int __init damon_init(void)
{
- return 0;
+ int rc;
+
+ rc = damon_init_user_ctx();
+ if (rc)
+ return rc;
+
+ rc = damon_debugfs_init();
+ if (rc)
+ pr_err("%s: debugfs init failed\n", __func__);
+
+ return rc;
}
static void __exit damon_exit(void)
{
+ damon_stop(&damon_user_ctx);
+ debugfs_remove_recursive(debugfs_root);
+
+ kfree(damon_user_ctx.rbuf);
+ kfree(damon_user_ctx.rfile_path);
}
module_init(damon_init);
--
2.17.1
From: SeongJae Park <[email protected]>
This commit imtroduces a shallow wrapper python script,
``/tools/damon/damo`` that provides more convenient interface. Note
that it is only aimed to be used for minimal reference of the DAMON's
debugfs interfaces and for debugging of the DAMON itself.
Signed-off-by: SeongJae Park <[email protected]>
---
tools/damon/.gitignore | 1 +
tools/damon/_damon.py | 129 ++++++++++++++
tools/damon/_dist.py | 36 ++++
tools/damon/_recfile.py | 23 +++
tools/damon/bin2txt.py | 67 +++++++
tools/damon/damo | 37 ++++
tools/damon/heats.py | 362 ++++++++++++++++++++++++++++++++++++++
tools/damon/nr_regions.py | 91 ++++++++++
tools/damon/record.py | 106 +++++++++++
tools/damon/report.py | 45 +++++
tools/damon/wss.py | 100 +++++++++++
11 files changed, 997 insertions(+)
create mode 100644 tools/damon/.gitignore
create mode 100644 tools/damon/_damon.py
create mode 100644 tools/damon/_dist.py
create mode 100644 tools/damon/_recfile.py
create mode 100644 tools/damon/bin2txt.py
create mode 100755 tools/damon/damo
create mode 100644 tools/damon/heats.py
create mode 100644 tools/damon/nr_regions.py
create mode 100644 tools/damon/record.py
create mode 100644 tools/damon/report.py
create mode 100644 tools/damon/wss.py
diff --git a/tools/damon/.gitignore b/tools/damon/.gitignore
new file mode 100644
index 000000000000..96403d36ff93
--- /dev/null
+++ b/tools/damon/.gitignore
@@ -0,0 +1 @@
+__pycache__/*
diff --git a/tools/damon/_damon.py b/tools/damon/_damon.py
new file mode 100644
index 000000000000..2a08468ad27e
--- /dev/null
+++ b/tools/damon/_damon.py
@@ -0,0 +1,129 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+Contains core functions for DAMON debugfs control.
+"""
+
+import os
+import subprocess
+
+debugfs_attrs = None
+debugfs_record = None
+debugfs_pids = None
+debugfs_monitor_on = None
+
+def set_target_pid(pid):
+ return subprocess.call('echo %s > %s' % (pid, debugfs_pids), shell=True,
+ executable='/bin/bash')
+
+def turn_damon(on_off):
+ return subprocess.call("echo %s > %s" % (on_off, debugfs_monitor_on),
+ shell=True, executable="/bin/bash")
+
+def is_damon_running():
+ with open(debugfs_monitor_on, 'r') as f:
+ return f.read().strip() == 'on'
+
+class Attrs:
+ sample_interval = None
+ aggr_interval = None
+ regions_update_interval = None
+ min_nr_regions = None
+ max_nr_regions = None
+ rbuf_len = None
+ rfile_path = None
+
+ def __init__(self, s, a, r, n, x, l, f):
+ self.sample_interval = s
+ self.aggr_interval = a
+ self.regions_update_interval = r
+ self.min_nr_regions = n
+ self.max_nr_regions = x
+ self.rbuf_len = l
+ self.rfile_path = f
+
+ def __str__(self):
+ return "%s %s %s %s %s %s %s" % (self.sample_interval,
+ self.aggr_interval, self.regions_update_interval,
+ self.min_nr_regions, self.max_nr_regions, self.rbuf_len,
+ self.rfile_path)
+
+ def attr_str(self):
+ return "%s %s %s %s %s " % (self.sample_interval, self.aggr_interval,
+ self.regions_update_interval, self.min_nr_regions,
+ self.max_nr_regions)
+
+ def record_str(self):
+ return '%s %s ' % (self.rbuf_len, self.rfile_path)
+
+ def apply(self):
+ ret = subprocess.call('echo %s > %s' % (self.attr_str(), debugfs_attrs),
+ shell=True, executable='/bin/bash')
+ if ret:
+ return ret
+ ret = subprocess.call('echo %s > %s' % (self.record_str(),
+ debugfs_record), shell=True, executable='/bin/bash')
+ if ret:
+ return ret
+
+def current_attrs():
+ with open(debugfs_attrs, 'r') as f:
+ attrs = f.read().split()
+ attrs = [int(x) for x in attrs]
+
+ with open(debugfs_record, 'r') as f:
+ rattrs = f.read().split()
+ attrs.append(int(rattrs[0]))
+ attrs.append(rattrs[1])
+
+ return Attrs(*attrs)
+
+def chk_update_debugfs(debugfs):
+ global debugfs_attrs
+ global debugfs_record
+ global debugfs_pids
+ global debugfs_monitor_on
+
+ debugfs_damon = os.path.join(debugfs, 'damon')
+ debugfs_attrs = os.path.join(debugfs_damon, 'attrs')
+ debugfs_record = os.path.join(debugfs_damon, 'record')
+ debugfs_pids = os.path.join(debugfs_damon, 'pids')
+ debugfs_monitor_on = os.path.join(debugfs_damon, 'monitor_on')
+
+ if not os.path.isdir(debugfs_damon):
+ print("damon debugfs dir (%s) not found", debugfs_damon)
+ exit(1)
+
+ for f in [debugfs_attrs, debugfs_record, debugfs_pids, debugfs_monitor_on]:
+ if not os.path.isfile(f):
+ print("damon debugfs file (%s) not found" % f)
+ exit(1)
+
+def cmd_args_to_attrs(args):
+ "Generate attributes with specified arguments"
+ sample_interval = args.sample
+ aggr_interval = args.aggr
+ regions_update_interval = args.updr
+ min_nr_regions = args.minr
+ max_nr_regions = args.maxr
+ rbuf_len = args.rbuf
+ if not os.path.isabs(args.out):
+ args.out = os.path.join(os.getcwd(), args.out)
+ rfile_path = args.out
+ return Attrs(sample_interval, aggr_interval, regions_update_interval,
+ min_nr_regions, max_nr_regions, rbuf_len, rfile_path)
+
+def set_attrs_argparser(parser):
+ parser.add_argument('-d', '--debugfs', metavar='<debugfs>', type=str,
+ default='/sys/kernel/debug', help='debugfs mounted path')
+ parser.add_argument('-s', '--sample', metavar='<interval>', type=int,
+ default=5000, help='sampling interval')
+ parser.add_argument('-a', '--aggr', metavar='<interval>', type=int,
+ default=100000, help='aggregate interval')
+ parser.add_argument('-u', '--updr', metavar='<interval>', type=int,
+ default=1000000, help='regions update interval')
+ parser.add_argument('-n', '--minr', metavar='<# regions>', type=int,
+ default=10, help='minimal number of regions')
+ parser.add_argument('-m', '--maxr', metavar='<# regions>', type=int,
+ default=1000, help='maximum number of regions')
diff --git a/tools/damon/_dist.py b/tools/damon/_dist.py
new file mode 100644
index 000000000000..9851ec964e5c
--- /dev/null
+++ b/tools/damon/_dist.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import os
+import struct
+import subprocess
+
+def access_patterns(f):
+ nr_regions = struct.unpack('I', f.read(4))[0]
+
+ patterns = []
+ for r in range(nr_regions):
+ saddr = struct.unpack('L', f.read(8))[0]
+ eaddr = struct.unpack('L', f.read(8))[0]
+ nr_accesses = struct.unpack('I', f.read(4))[0]
+ patterns.append([eaddr - saddr, nr_accesses])
+ return patterns
+
+def plot_dist(data_file, output_file, xlabel, ylabel):
+ terminal = output_file.split('.')[-1]
+ if not terminal in ['pdf', 'jpeg', 'png', 'svg']:
+ os.remove(data_file)
+ print("Unsupported plot output type.")
+ exit(-1)
+
+ gnuplot_cmd = """
+ set term %s;
+ set output '%s';
+ set key off;
+ set xlabel '%s';
+ set ylabel '%s';
+ plot '%s' with linespoints;""" % (terminal, output_file, xlabel, ylabel,
+ data_file)
+ subprocess.call(['gnuplot', '-e', gnuplot_cmd])
+ os.remove(data_file)
+
diff --git a/tools/damon/_recfile.py b/tools/damon/_recfile.py
new file mode 100644
index 000000000000..331b4d8165d8
--- /dev/null
+++ b/tools/damon/_recfile.py
@@ -0,0 +1,23 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import struct
+
+fmt_version = 0
+
+def set_fmt_version(f):
+ global fmt_version
+
+ mark = f.read(16)
+ if mark == b'damon_recfmt_ver':
+ fmt_version = struct.unpack('i', f.read(4))[0]
+ else:
+ fmt_version = 0
+ f.seek(0)
+ return fmt_version
+
+def pid(f):
+ if fmt_version == 0:
+ return struct.unpack('L', f.read(8))[0]
+ else:
+ return struct.unpack('i', f.read(4))[0]
diff --git a/tools/damon/bin2txt.py b/tools/damon/bin2txt.py
new file mode 100644
index 000000000000..8b9b57a0d727
--- /dev/null
+++ b/tools/damon/bin2txt.py
@@ -0,0 +1,67 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import argparse
+import os
+import struct
+import sys
+
+import _recfile
+
+def parse_time(bindat):
+ "bindat should be 16 bytes"
+ sec = struct.unpack('l', bindat[0:8])[0]
+ nsec = struct.unpack('l', bindat[8:16])[0]
+ return sec * 1000000000 + nsec;
+
+def pr_region(f):
+ saddr = struct.unpack('L', f.read(8))[0]
+ eaddr = struct.unpack('L', f.read(8))[0]
+ nr_accesses = struct.unpack('I', f.read(4))[0]
+ print("%012x-%012x(%10d):\t%d" %
+ (saddr, eaddr, eaddr - saddr, nr_accesses))
+
+def pr_task_info(f):
+ pid = _recfile.pid(f)
+ print("pid: ", pid)
+ nr_regions = struct.unpack('I', f.read(4))[0]
+ print("nr_regions: ", nr_regions)
+ for r in range(nr_regions):
+ pr_region(f)
+
+def set_argparser(parser):
+ parser.add_argument('--input', '-i', type=str, metavar='<file>',
+ default='damon.data', help='input file name')
+
+def main(args=None):
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ file_path = args.input
+
+ if not os.path.isfile(file_path):
+ print('input file (%s) is not exist' % file_path)
+ exit(1)
+
+ with open(file_path, 'rb') as f:
+ _recfile.set_fmt_version(f)
+ start_time = None
+ while True:
+ timebin = f.read(16)
+ if len(timebin) != 16:
+ break
+ time = parse_time(timebin)
+ if not start_time:
+ start_time = time
+ print("start_time: ", start_time)
+ print("rel time: %16d" % (time - start_time))
+ nr_tasks = struct.unpack('I', f.read(4))[0]
+ print("nr_tasks: ", nr_tasks)
+ for t in range(nr_tasks):
+ pr_task_info(f)
+ print("")
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/damon/damo b/tools/damon/damo
new file mode 100755
index 000000000000..58e1099ae5fc
--- /dev/null
+++ b/tools/damon/damo
@@ -0,0 +1,37 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import argparse
+
+import record
+import report
+
+class SubCmdHelpFormatter(argparse.RawDescriptionHelpFormatter):
+ def _format_action(self, action):
+ parts = super(argparse.RawDescriptionHelpFormatter,
+ self)._format_action(action)
+ # skip sub parsers help
+ if action.nargs == argparse.PARSER:
+ parts = '\n'.join(parts.split('\n')[1:])
+ return parts
+
+parser = argparse.ArgumentParser(formatter_class=SubCmdHelpFormatter)
+
+subparser = parser.add_subparsers(title='command', dest='command',
+ metavar='<command>')
+subparser.required = True
+
+parser_record = subparser.add_parser('record',
+ help='record data accesses of the given target processes')
+record.set_argparser(parser_record)
+
+parser_report = subparser.add_parser('report',
+ help='report the recorded data accesses in the specified form')
+report.set_argparser(parser_report)
+
+args = parser.parse_args()
+
+if args.command == 'record':
+ record.main(args)
+elif args.command == 'report':
+ report.main(args)
diff --git a/tools/damon/heats.py b/tools/damon/heats.py
new file mode 100644
index 000000000000..99837083874e
--- /dev/null
+++ b/tools/damon/heats.py
@@ -0,0 +1,362 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+Transform binary trace data into human readable text that can be used for
+heatmap drawing, or directly plot the data in a heatmap format.
+
+Format of the text is:
+
+ <time> <space> <heat>
+ ...
+
+"""
+
+import argparse
+import os
+import struct
+import subprocess
+import sys
+import tempfile
+
+import _recfile
+
+class HeatSample:
+ space_idx = None
+ sz_time_space = None
+ heat = None
+
+ def __init__(self, space_idx, sz_time_space, heat):
+ if sz_time_space < 0:
+ raise RuntimeError()
+ self.space_idx = space_idx
+ self.sz_time_space = sz_time_space
+ self.heat = heat
+
+ def total_heat(self):
+ return self.heat * self.sz_time_space
+
+ def merge(self, sample):
+ "sample must have a space idx that same to self"
+ heat_sum = self.total_heat() + sample.total_heat()
+ self.heat = heat_sum / (self.sz_time_space + sample.sz_time_space)
+ self.sz_time_space += sample.sz_time_space
+
+def pr_samples(samples, time_idx, time_unit, region_unit):
+ display_time = time_idx * time_unit
+ for idx, sample in enumerate(samples):
+ display_addr = idx * region_unit
+ if not sample:
+ print("%s\t%s\t%s" % (display_time, display_addr, 0.0))
+ continue
+ print("%s\t%s\t%s" % (display_time, display_addr, sample.total_heat() /
+ time_unit / region_unit))
+
+def to_idx(value, min_, unit):
+ return (value - min_) // unit
+
+def read_task_heats(f, pid, aunit, amin, amax):
+ pid_ = _recfile.pid(f)
+ nr_regions = struct.unpack('I', f.read(4))[0]
+ if pid_ != pid:
+ f.read(20 * nr_regions)
+ return None
+ samples = []
+ for i in range(nr_regions):
+ saddr = struct.unpack('L', f.read(8))[0]
+ eaddr = struct.unpack('L', f.read(8))[0]
+ eaddr = min(eaddr, amax - 1)
+ heat = struct.unpack('I', f.read(4))[0]
+
+ if eaddr <= amin:
+ continue
+ if saddr >= amax:
+ continue
+ saddr = max(amin, saddr)
+ eaddr = min(amax, eaddr)
+
+ sidx = to_idx(saddr, amin, aunit)
+ eidx = to_idx(eaddr - 1, amin, aunit)
+ for idx in range(sidx, eidx + 1):
+ sa = max(amin + idx * aunit, saddr)
+ ea = min(amin + (idx + 1) * aunit, eaddr)
+ sample = HeatSample(idx, (ea - sa), heat)
+ samples.append(sample)
+ return samples
+
+def parse_time(bindat):
+ sec = struct.unpack('l', bindat[0:8])[0]
+ nsec = struct.unpack('l', bindat[8:16])[0]
+ return sec * 1000000000 + nsec
+
+def apply_samples(target_samples, samples, start_time, end_time, aunit, amin):
+ for s in samples:
+ sample = HeatSample(s.space_idx,
+ s.sz_time_space * (end_time - start_time), s.heat)
+ idx = sample.space_idx
+ if not target_samples[idx]:
+ target_samples[idx] = sample
+ else:
+ target_samples[idx].merge(sample)
+
+def __pr_heats(f, pid, tunit, tmin, tmax, aunit, amin, amax):
+ heat_samples = [None] * ((amax - amin) // aunit)
+
+ start_time = 0
+ end_time = 0
+ last_flushed = -1
+ while True:
+ start_time = end_time
+ timebin = f.read(16)
+ if (len(timebin)) != 16:
+ break
+ end_time = parse_time(timebin)
+ nr_tasks = struct.unpack('I', f.read(4))[0]
+ samples_set = {}
+ for t in range(nr_tasks):
+ samples = read_task_heats(f, pid, aunit, amin, amax)
+ if samples:
+ samples_set[pid] = samples
+ if not pid in samples_set:
+ continue
+ if start_time >= tmax:
+ continue
+ if end_time <= tmin:
+ continue
+ start_time = max(start_time, tmin)
+ end_time = min(end_time, tmax)
+
+ sidx = to_idx(start_time, tmin, tunit)
+ eidx = to_idx(end_time - 1, tmin, tunit)
+ for idx in range(sidx, eidx + 1):
+ if idx != last_flushed:
+ pr_samples(heat_samples, idx, tunit, aunit)
+ heat_samples = [None] * ((amax - amin) // aunit)
+ last_flushed = idx
+ st = max(start_time, tmin + idx * tunit)
+ et = min(end_time, tmin + (idx + 1) * tunit)
+ apply_samples(heat_samples, samples_set[pid], st, et, aunit, amin)
+
+def pr_heats(args):
+ binfile = args.input
+ pid = args.pid
+ tres = args.tres
+ tmin = args.tmin
+ ares = args.ares
+ amin = args.amin
+
+ tunit = (args.tmax - tmin) // tres
+ aunit = (args.amax - amin) // ares
+
+ # Compensate the values so that those fit with the resolution
+ tmax = tmin + tunit * tres
+ amax = amin + aunit * ares
+
+ with open(binfile, 'rb') as f:
+ _recfile.set_fmt_version(f)
+ __pr_heats(f, pid, tunit, tmin, tmax, aunit, amin, amax)
+
+class GuideInfo:
+ pid = None
+ start_time = None
+ end_time = None
+ lowest_addr = None
+ highest_addr = None
+ gaps = None
+
+ def __init__(self, pid, start_time):
+ self.pid = pid
+ self.start_time = start_time
+ self.gaps = []
+
+ def regions(self):
+ regions = []
+ region = [self.lowest_addr]
+ for gap in self.gaps:
+ for idx, point in enumerate(gap):
+ if idx == 0:
+ region.append(point)
+ regions.append(region)
+ else:
+ region = [point]
+ region.append(self.highest_addr)
+ regions.append(region)
+ return regions
+
+ def total_space(self):
+ ret = 0
+ for r in self.regions():
+ ret += r[1] - r[0]
+ return ret
+
+ def __str__(self):
+ lines = ['pid:%d' % self.pid]
+ lines.append('time: %d-%d (%d)' % (self.start_time, self.end_time,
+ self.end_time - self.start_time))
+ for idx, region in enumerate(self.regions()):
+ lines.append('region\t%2d: %020d-%020d (%d)' %
+ (idx, region[0], region[1], region[1] - region[0]))
+ return '\n'.join(lines)
+
+def is_overlap(region1, region2):
+ if region1[1] < region2[0]:
+ return False
+ if region2[1] < region1[0]:
+ return False
+ return True
+
+def overlap_region_of(region1, region2):
+ return [max(region1[0], region2[0]), min(region1[1], region2[1])]
+
+def overlapping_regions(regions1, regions2):
+ overlap_regions = []
+ for r1 in regions1:
+ for r2 in regions2:
+ if is_overlap(r1, r2):
+ r1 = overlap_region_of(r1, r2)
+ if r1:
+ overlap_regions.append(r1)
+ return overlap_regions
+
+def get_guide_info(binfile):
+ "Read file, return the set of guide information objects of the data"
+ guides = {}
+ with open(binfile, 'rb') as f:
+ _recfile.set_fmt_version(f)
+ while True:
+ timebin = f.read(16)
+ if len(timebin) != 16:
+ break
+ monitor_time = parse_time(timebin)
+ nr_tasks = struct.unpack('I', f.read(4))[0]
+ for t in range(nr_tasks):
+ pid = _recfile.pid(f)
+ nr_regions = struct.unpack('I', f.read(4))[0]
+ if not pid in guides:
+ guides[pid] = GuideInfo(pid, monitor_time)
+ guide = guides[pid]
+ guide.end_time = monitor_time
+
+ last_addr = None
+ gaps = []
+ for r in range(nr_regions):
+ saddr = struct.unpack('L', f.read(8))[0]
+ eaddr = struct.unpack('L', f.read(8))[0]
+ f.read(4)
+
+ if not guide.lowest_addr or saddr < guide.lowest_addr:
+ guide.lowest_addr = saddr
+ if not guide.highest_addr or eaddr > guide.highest_addr:
+ guide.highest_addr = eaddr
+
+ if not last_addr:
+ last_addr = eaddr
+ continue
+ if last_addr != saddr:
+ gaps.append([last_addr, saddr])
+ last_addr = eaddr
+
+ if not guide.gaps:
+ guide.gaps = gaps
+ else:
+ guide.gaps = overlapping_regions(guide.gaps, gaps)
+ return sorted(list(guides.values()), key=lambda x: x.total_space(),
+ reverse=True)
+
+def pr_guide(binfile):
+ for guide in get_guide_info(binfile):
+ print(guide)
+
+def region_sort_key(region):
+ return region[1] - region[0]
+
+def set_missed_args(args):
+ if args.pid and args.tmin and args.tmax and args.amin and args.amax:
+ return
+ guides = get_guide_info(args.input)
+ guide = guides[0]
+ if not args.pid:
+ args.pid = guide.pid
+ for g in guides:
+ if g.pid == args.pid:
+ guide = g
+ break
+
+ if not args.tmin:
+ args.tmin = guide.start_time
+ if not args.tmax:
+ args.tmax = guide.end_time
+
+ if not args.amin or not args.amax:
+ region = sorted(guide.regions(), key=lambda x: x[1] - x[0],
+ reverse=True)[0]
+ args.amin = region[0]
+ args.amax = region[1]
+
+def plot_heatmap(data_file, output_file):
+ terminal = output_file.split('.')[-1]
+ if not terminal in ['pdf', 'jpeg', 'png', 'svg']:
+ os.remove(data_file)
+ print("Unsupported plot output type.")
+ exit(-1)
+
+ gnuplot_cmd = """
+ set term %s;
+ set output '%s';
+ set key off;
+ set xrange [0:];
+ set yrange [0:];
+ set xlabel 'Time (ns)';
+ set ylabel 'Virtual Address (bytes)';
+ plot '%s' using 1:2:3 with image;""" % (terminal, output_file, data_file)
+ subprocess.call(['gnuplot', '-e', gnuplot_cmd])
+ os.remove(data_file)
+
+def set_argparser(parser):
+ parser.add_argument('--input', '-i', type=str, metavar='<file>',
+ default='damon.data', help='input file name')
+ parser.add_argument('--pid', metavar='<pid>', type=int,
+ help='pid of target task')
+ parser.add_argument('--tres', metavar='<resolution>', type=int,
+ default=500, help='time resolution of the output')
+ parser.add_argument('--tmin', metavar='<time>', type=lambda x: int(x,0),
+ help='minimal time of the output')
+ parser.add_argument('--tmax', metavar='<time>', type=lambda x: int(x,0),
+ help='maximum time of the output')
+ parser.add_argument('--ares', metavar='<resolution>', type=int, default=500,
+ help='space address resolution of the output')
+ parser.add_argument('--amin', metavar='<address>', type=lambda x: int(x,0),
+ help='minimal space address of the output')
+ parser.add_argument('--amax', metavar='<address>', type=lambda x: int(x,0),
+ help='maximum space address of the output')
+ parser.add_argument('--guide', action='store_true',
+ help='print a guidance for the min/max/resolution settings')
+ parser.add_argument('--heatmap', metavar='<file>', type=str,
+ help='heatmap image file to create')
+
+def main(args=None):
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ if args.guide:
+ pr_guide(args.input)
+ else:
+ set_missed_args(args)
+ orig_stdout = sys.stdout
+ if args.heatmap:
+ tmp_path = tempfile.mkstemp()[1]
+ tmp_file = open(tmp_path, 'w')
+ sys.stdout = tmp_file
+
+ pr_heats(args)
+
+ if args.heatmap:
+ sys.stdout = orig_stdout
+ tmp_file.flush()
+ tmp_file.close()
+ plot_heatmap(tmp_path, args.heatmap)
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/damon/nr_regions.py b/tools/damon/nr_regions.py
new file mode 100644
index 000000000000..655ee50a7b8d
--- /dev/null
+++ b/tools/damon/nr_regions.py
@@ -0,0 +1,91 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"Print out distribution of the number of regions in the given record"
+
+import argparse
+import struct
+import sys
+import tempfile
+
+import _dist
+import _recfile
+
+def set_argparser(parser):
+ parser.add_argument('--input', '-i', type=str, metavar='<file>',
+ default='damon.data', help='input file name')
+ parser.add_argument('--range', '-r', type=int, nargs=3,
+ metavar=('<start>', '<stop>', '<step>'),
+ help='range of percentiles to print')
+ parser.add_argument('--sortby', '-s', choices=['time', 'size'],
+ help='the metric to be used for sorting the number of regions')
+ parser.add_argument('--plot', '-p', type=str, metavar='<file>',
+ help='plot the distribution to an image file')
+
+def main(args=None):
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ percentiles = [0, 25, 50, 75, 100]
+
+ file_path = args.input
+ if args.range:
+ percentiles = range(args.range[0], args.range[1], args.range[2])
+ nr_regions_sort = True
+ if args.sortby == 'time':
+ nr_regions_sort = False
+
+ pid_pattern_map = {}
+ with open(file_path, 'rb') as f:
+ _recfile.set_fmt_version(f)
+ start_time = None
+ while True:
+ timebin = f.read(16)
+ if len(timebin) != 16:
+ break
+ nr_tasks = struct.unpack('I', f.read(4))[0]
+ for t in range(nr_tasks):
+ pid = _recfile.pid(f)
+ if not pid in pid_pattern_map:
+ pid_pattern_map[pid] = []
+ pid_pattern_map[pid].append(_dist.access_patterns(f))
+
+ orig_stdout = sys.stdout
+ if args.plot:
+ tmp_path = tempfile.mkstemp()[1]
+ tmp_file = open(tmp_path, 'w')
+ sys.stdout = tmp_file
+
+ print('# <percentile> <# regions>')
+ for pid in pid_pattern_map.keys():
+ # Skip firs 20 regions as those would not adaptively adjusted
+ snapshots = pid_pattern_map[pid][20:]
+ nr_regions_dist = []
+ for snapshot in snapshots:
+ nr_regions_dist.append(len(snapshot))
+ if nr_regions_sort:
+ nr_regions_dist.sort(reverse=False)
+
+ print('# pid\t%s' % pid)
+ print('# avr:\t%d' % (sum(nr_regions_dist) / len(nr_regions_dist)))
+ for percentile in percentiles:
+ thres_idx = int(percentile / 100.0 * len(nr_regions_dist))
+ if thres_idx == len(nr_regions_dist):
+ thres_idx -= 1
+ threshold = nr_regions_dist[thres_idx]
+ print('%d\t%d' % (percentile, nr_regions_dist[thres_idx]))
+
+ if args.plot:
+ sys.stdout = orig_stdout
+ tmp_file.flush()
+ tmp_file.close()
+ xlabel = 'runtime (percent)'
+ if nr_regions_sort:
+ xlabel = 'percentile'
+ _dist.plot_dist(tmp_path, args.plot, xlabel,
+ 'number of monitoring target regions')
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/damon/record.py b/tools/damon/record.py
new file mode 100644
index 000000000000..44fa3a12af35
--- /dev/null
+++ b/tools/damon/record.py
@@ -0,0 +1,106 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"""
+Record data access patterns of the target process.
+"""
+
+import argparse
+import os
+import signal
+import subprocess
+import time
+
+import _damon
+
+def do_record(target, is_target_cmd, attrs, old_attrs):
+ if os.path.isfile(attrs.rfile_path):
+ os.rename(attrs.rfile_path, attrs.rfile_path + '.old')
+
+ if attrs.apply():
+ print('attributes (%s) failed to be applied' % attrs)
+ cleanup_exit(old_attrs, -1)
+ print('# damon attrs: %s %s' % (attrs.attr_str(), attrs.record_str()))
+ if is_target_cmd:
+ p = subprocess.Popen(target, shell=True, executable='/bin/bash')
+ target = p.pid
+ if _damon.set_target_pid(target):
+ print('pid setting (%s) failed' % target)
+ cleanup_exit(old_attrs, -2)
+ if _damon.turn_damon('on'):
+ print('could not turn on damon' % target)
+ cleanup_exit(old_attrs, -3)
+ while not _damon.is_damon_running():
+ time.sleep(1)
+ print('Press Ctrl+C to stop')
+ if is_target_cmd:
+ p.wait()
+ while True:
+ # damon will turn it off by itself if the target tasks are terminated.
+ if not _damon.is_damon_running():
+ break
+ time.sleep(1)
+
+ cleanup_exit(old_attrs, 0)
+
+def cleanup_exit(orig_attrs, exit_code):
+ if _damon.is_damon_running():
+ if _damon.turn_damon('off'):
+ print('failed to turn damon off!')
+ while _damon.is_damon_running():
+ time.sleep(1)
+ if orig_attrs:
+ if orig_attrs.apply():
+ print('original attributes (%s) restoration failed!' % orig_attrs)
+ exit(exit_code)
+
+def sighandler(signum, frame):
+ print('\nsignal %s received' % signum)
+ cleanup_exit(orig_attrs, signum)
+
+def chk_permission():
+ if os.geteuid() != 0:
+ print("Run as root")
+ exit(1)
+
+def set_argparser(parser):
+ _damon.set_attrs_argparser(parser)
+ parser.add_argument('target', type=str, metavar='<target>',
+ help='the target command or the pid to record')
+ parser.add_argument('-l', '--rbuf', metavar='<len>', type=int,
+ default=1024*1024, help='length of record result buffer')
+ parser.add_argument('-o', '--out', metavar='<file path>', type=str,
+ default='damon.data', help='output file path')
+
+def main(args=None):
+ global orig_attrs
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ chk_permission()
+ _damon.chk_update_debugfs(args.debugfs)
+
+ signal.signal(signal.SIGINT, sighandler)
+ signal.signal(signal.SIGTERM, sighandler)
+ orig_attrs = _damon.current_attrs()
+
+ args.schemes = ''
+ new_attrs = _damon.cmd_args_to_attrs(args)
+ target = args.target
+
+ target_fields = target.split()
+ if not subprocess.call('which %s > /dev/null' % target_fields[0],
+ shell=True, executable='/bin/bash'):
+ do_record(target, True, new_attrs, orig_attrs)
+ else:
+ try:
+ pid = int(target)
+ except:
+ print('target \'%s\' is neither a command, nor a pid' % target)
+ exit(1)
+ do_record(target, False, new_attrs, orig_attrs)
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/damon/report.py b/tools/damon/report.py
new file mode 100644
index 000000000000..c661c7b2f1af
--- /dev/null
+++ b/tools/damon/report.py
@@ -0,0 +1,45 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+import argparse
+
+import bin2txt
+import heats
+import nr_regions
+import wss
+
+def set_argparser(parser):
+ subparsers = parser.add_subparsers(title='report type', dest='report_type',
+ metavar='<report type>', help='the type of the report to generate')
+ subparsers.required = True
+
+ parser_raw = subparsers.add_parser('raw', help='human readable raw data')
+ bin2txt.set_argparser(parser_raw)
+
+ parser_heats = subparsers.add_parser('heats', help='heats of regions')
+ heats.set_argparser(parser_heats)
+
+ parser_wss = subparsers.add_parser('wss', help='working set size')
+ wss.set_argparser(parser_wss)
+
+ parser_nr_regions = subparsers.add_parser('nr_regions',
+ help='number of regions')
+ nr_regions.set_argparser(parser_nr_regions)
+
+def main(args=None):
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ if args.report_type == 'raw':
+ bin2txt.main(args)
+ elif args.report_type == 'heats':
+ heats.main(args)
+ elif args.report_type == 'wss':
+ wss.main(args)
+ elif args.report_type == 'nr_regions':
+ nr_regions.main(args)
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/damon/wss.py b/tools/damon/wss.py
new file mode 100644
index 000000000000..0517c1c57d4d
--- /dev/null
+++ b/tools/damon/wss.py
@@ -0,0 +1,100 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"Print out the distribution of the working set sizes of the given trace"
+
+import argparse
+import struct
+import sys
+import tempfile
+
+import _dist
+import _recfile
+
+def set_argparser(parser):
+ parser.add_argument('--input', '-i', type=str, metavar='<file>',
+ default='damon.data', help='input file name')
+ parser.add_argument('--range', '-r', type=int, nargs=3,
+ metavar=('<start>', '<stop>', '<step>'),
+ help='range of wss percentiles to print')
+ parser.add_argument('--thres', '-t', type=int, default=1,
+ metavar='<# accesses>',
+ help='minimal number of accesses for treated as working set')
+ parser.add_argument('--sortby', '-s', choices=['time', 'size'],
+ help='the metric to be used for the sort of the working set sizes')
+ parser.add_argument('--plot', '-p', type=str, metavar='<file>',
+ help='plot the distribution to an image file')
+
+def main(args=None):
+ if not args:
+ parser = argparse.ArgumentParser()
+ set_argparser(parser)
+ args = parser.parse_args()
+
+ percentiles = [0, 25, 50, 75, 100]
+
+ file_path = args.input
+ if args.range:
+ percentiles = range(args.range[0], args.range[1], args.range[2])
+ wss_sort = True
+ if args.sortby == 'time':
+ wss_sort = False
+
+ pid_pattern_map = {}
+ with open(file_path, 'rb') as f:
+ _recfile.set_fmt_version(f)
+ start_time = None
+ while True:
+ timebin = f.read(16)
+ if len(timebin) != 16:
+ break
+ nr_tasks = struct.unpack('I', f.read(4))[0]
+ for t in range(nr_tasks):
+ pid = _recfile.pid(f)
+ if not pid in pid_pattern_map:
+ pid_pattern_map[pid] = []
+ pid_pattern_map[pid].append(_dist.access_patterns(f))
+
+ orig_stdout = sys.stdout
+ if args.plot:
+ tmp_path = tempfile.mkstemp()[1]
+ tmp_file = open(tmp_path, 'w')
+ sys.stdout = tmp_file
+
+ print('# <percentile> <wss>')
+ for pid in pid_pattern_map.keys():
+ # Skip first 20 snapshots as regions may not adjusted yet.
+ snapshots = pid_pattern_map[pid][20:]
+ wss_dist = []
+ for snapshot in snapshots:
+ wss = 0
+ for p in snapshot:
+ # Ignore regions not accessed
+ if p[1] < args.thres:
+ continue
+ wss += p[0]
+ wss_dist.append(wss)
+ if wss_sort:
+ wss_dist.sort(reverse=False)
+
+ print('# pid\t%s' % pid)
+ print('# avr:\t%d' % (sum(wss_dist) / len(wss_dist)))
+ for percentile in percentiles:
+ thres_idx = int(percentile / 100.0 * len(wss_dist))
+ if thres_idx == len(wss_dist):
+ thres_idx -= 1
+ threshold = wss_dist[thres_idx]
+ print('%d\t%d' % (percentile, wss_dist[thres_idx]))
+
+ if args.plot:
+ sys.stdout = orig_stdout
+ tmp_file.flush()
+ tmp_file.close()
+ xlabel = 'runtime (percent)'
+ if wss_sort:
+ xlabel = 'percentile'
+ _dist.plot_dist(tmp_path, args.plot, xlabel,
+ 'working set size (bytes)')
+
+if __name__ == '__main__':
+ main()
--
2.17.1
From: SeongJae Park <[email protected]>
Even somehow the initial monitoring target regions are well constructed
to fulfill the assumption (pages in same region have similar access
frequencies), the data access pattern can be dynamically changed. This
will result in low monitoring quality. To keep the assumption as much
as possible, DAMON adaptively merges and splits each region based on
their access frequency.
For each ``aggregation interval``, it compares the access frequencies of
adjacent regions and merges those if the frequency difference is small.
Then, after it reports and clears the aggregated access frequency of
each region, it splits each region into two or three regions if the
total number of regions will not exceed the user-specified maximum
number of regions after the split.
In this way, DAMON provides its best-effort quality and minimal overhead
while keeping the upper-bound overhead that users set.
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
---
include/linux/damon.h | 11 ++-
mm/damon.c | 191 ++++++++++++++++++++++++++++++++++++++++--
2 files changed, 189 insertions(+), 13 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index 7adc7b6b3507..97ddc74e207f 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -61,7 +61,8 @@ struct damon_task {
*
* @sample_interval: The time between access samplings.
* @aggr_interval: The time between monitor results aggregations.
- * @nr_regions: The number of monitoring regions.
+ * @min_nr_regions: The minimum number of monitoring regions.
+ * @max_nr_regions: The maximum number of monitoring regions.
*
* For each @sample_interval, DAMON checks whether each region is accessed or
* not. It aggregates and keeps the access information (number of accesses to
@@ -114,7 +115,8 @@ struct damon_task {
struct damon_ctx {
unsigned long sample_interval;
unsigned long aggr_interval;
- unsigned long nr_regions;
+ unsigned long min_nr_regions;
+ unsigned long max_nr_regions;
struct timespec64 last_aggregation;
@@ -133,8 +135,9 @@ struct damon_ctx {
};
int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids);
-int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
- unsigned long aggr_int, unsigned long min_nr_reg);
+int damon_set_attrs(struct damon_ctx *ctx,
+ unsigned long sample_int, unsigned long aggr_int,
+ unsigned long min_nr_reg, unsigned long max_nr_reg);
int damon_start(struct damon_ctx *ctx);
int damon_stop(struct damon_ctx *ctx);
diff --git a/mm/damon.c b/mm/damon.c
index 29d82c2d65be..02bc7542a76f 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -176,6 +176,26 @@ static unsigned int nr_damon_regions(struct damon_task *t)
return nr_regions;
}
+/* Returns the size upper limit for each monitoring region */
+static unsigned long damon_region_sz_limit(struct damon_ctx *ctx)
+{
+ struct damon_task *t;
+ struct damon_region *r;
+ unsigned long sz = 0;
+
+ damon_for_each_task(t, ctx) {
+ damon_for_each_region(r, t)
+ sz += r->ar.end - r->ar.start;
+ }
+
+ if (ctx->min_nr_regions)
+ sz /= ctx->min_nr_regions;
+ if (sz < MIN_REGION)
+ sz = MIN_REGION;
+
+ return sz;
+}
+
/*
* Functions for DAMON core logics and features
*/
@@ -226,6 +246,145 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
}
}
+#define sz_damon_region(r) (r->ar.end - r->ar.start)
+
+/*
+ * Merge two adjacent regions into one region
+ */
+static void damon_merge_two_regions(struct damon_region *l,
+ struct damon_region *r)
+{
+ l->nr_accesses = (l->nr_accesses * sz_damon_region(l) +
+ r->nr_accesses * sz_damon_region(r)) /
+ (sz_damon_region(l) + sz_damon_region(r));
+ l->ar.end = r->ar.end;
+ damon_destroy_region(r);
+}
+
+#define diff_of(a, b) (a > b ? a - b : b - a)
+
+/*
+ * Merge adjacent regions having similar access frequencies
+ *
+ * t task affected by merge operation
+ * thres '->nr_accesses' diff threshold for the merge
+ * sz_limit size upper limit of each region
+ */
+static void damon_merge_regions_of(struct damon_task *t, unsigned int thres,
+ unsigned long sz_limit)
+{
+ struct damon_region *r, *prev = NULL, *next;
+
+ damon_for_each_region_safe(r, next, t) {
+ if (prev && prev->ar.end == r->ar.start &&
+ diff_of(prev->nr_accesses, r->nr_accesses) <= thres &&
+ sz_damon_region(prev) + sz_damon_region(r) <= sz_limit)
+ damon_merge_two_regions(prev, r);
+ else
+ prev = r;
+ }
+}
+
+/*
+ * Merge adjacent regions having similar access frequencies
+ *
+ * threshold '->nr_accesses' diff threshold for the merge
+ * sz_limit size upper limit of each region
+ *
+ * This function merges monitoring target regions which are adjacent and their
+ * access frequencies are similar. This is for minimizing the monitoring
+ * overhead under the dynamically changeable access pattern. If a merge was
+ * unnecessarily made, later 'kdamond_split_regions()' will revert it.
+ */
+static void kdamond_merge_regions(struct damon_ctx *c, unsigned int threshold,
+ unsigned long sz_limit)
+{
+ struct damon_task *t;
+
+ damon_for_each_task(t, c)
+ damon_merge_regions_of(t, threshold, sz_limit);
+}
+
+/*
+ * Split a region in two
+ *
+ * r the region to be split
+ * sz_r size of the first sub-region that will be made
+ */
+static void damon_split_region_at(struct damon_ctx *ctx,
+ struct damon_region *r, unsigned long sz_r)
+{
+ struct damon_region *new;
+
+ new = damon_new_region(r->ar.start + sz_r, r->ar.end);
+ r->ar.end = new->ar.start;
+
+ damon_insert_region(new, r, damon_next_region(r));
+}
+
+/* Split every region in the given task into 'nr_subs' regions */
+static void damon_split_regions_of(struct damon_ctx *ctx,
+ struct damon_task *t, int nr_subs)
+{
+ struct damon_region *r, *next;
+ unsigned long sz_region, sz_sub = 0;
+ int i;
+
+ damon_for_each_region_safe(r, next, t) {
+ sz_region = r->ar.end - r->ar.start;
+
+ for (i = 0; i < nr_subs - 1 &&
+ sz_region > 2 * MIN_REGION; i++) {
+ /*
+ * Randomly select size of left sub-region to be at
+ * least 10 percent and at most 90% of original region
+ */
+ sz_sub = ALIGN_DOWN(damon_rand(1, 10) *
+ sz_region / 10, MIN_REGION);
+ /* Do not allow blank region */
+ if (sz_sub == 0 || sz_sub >= sz_region)
+ continue;
+
+ damon_split_region_at(ctx, r, sz_sub);
+ sz_region = sz_sub;
+ }
+ }
+}
+
+/*
+ * Split every target region into randomly-sized small regions
+ *
+ * This function splits every target region into random-sized small regions if
+ * current total number of the regions is equal or smaller than half of the
+ * user-specified maximum number of regions. This is for maximizing the
+ * monitoring accuracy under the dynamically changeable access patterns. If a
+ * split was unnecessarily made, later 'kdamond_merge_regions()' will revert
+ * it.
+ */
+static void kdamond_split_regions(struct damon_ctx *ctx)
+{
+ struct damon_task *t;
+ unsigned int nr_regions = 0;
+ static unsigned int last_nr_regions;
+ int nr_subregions = 2;
+
+ damon_for_each_task(t, ctx)
+ nr_regions += nr_damon_regions(t);
+
+ if (nr_regions > ctx->max_nr_regions / 2)
+ return;
+
+ /* Maybe the middle of the region has different access frequency */
+ if (last_nr_regions == nr_regions &&
+ nr_regions < ctx->max_nr_regions / 3)
+ nr_subregions = 3;
+
+ damon_for_each_task(t, ctx)
+ damon_split_regions_of(ctx, t, nr_subregions);
+
+ last_nr_regions = nr_regions;
+}
+
/*
* Check whether current monitoring should be stopped
*
@@ -269,10 +428,14 @@ static int kdamond_fn(void *data)
struct damon_ctx *ctx = (struct damon_ctx *)data;
struct damon_task *t;
struct damon_region *r, *next;
+ unsigned int max_nr_accesses = 0;
+ unsigned long sz_limit = 0;
pr_info("kdamond (%d) starts\n", ctx->kdamond->pid);
if (ctx->init_target_regions)
ctx->init_target_regions(ctx);
+ sz_limit = damon_region_sz_limit(ctx);
+
while (!kdamond_need_stop(ctx)) {
if (ctx->prepare_access_checks)
ctx->prepare_access_checks(ctx);
@@ -282,14 +445,16 @@ static int kdamond_fn(void *data)
usleep_range(ctx->sample_interval, ctx->sample_interval + 1);
if (ctx->check_accesses)
- ctx->check_accesses(ctx);
+ max_nr_accesses = ctx->check_accesses(ctx);
if (kdamond_aggregate_interval_passed(ctx)) {
if (ctx->aggregate_cb)
ctx->aggregate_cb(ctx);
+ kdamond_merge_regions(ctx, max_nr_accesses / 10,
+ sz_limit);
kdamond_reset_aggregated(ctx);
+ kdamond_split_regions(ctx);
}
-
}
damon_for_each_task(t, ctx) {
damon_for_each_region_safe(r, next, t)
@@ -400,25 +565,33 @@ int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids)
* @ctx: monitoring context
* @sample_int: time interval between samplings
* @aggr_int: time interval between aggregations
- * @nr_reg: number of regions
+ * @min_nr_reg: minimal number of regions
+ * @max_nr_reg: maximum number of regions
*
* This function should not be called while the kdamond is running.
* Every time interval is in micro-seconds.
*
* Return: 0 on success, negative error code otherwise.
*/
-int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
- unsigned long aggr_int, unsigned long nr_reg)
+int damon_set_attrs(struct damon_ctx *ctx,
+ unsigned long sample_int, unsigned long aggr_int,
+ unsigned long min_nr_reg, unsigned long max_nr_reg)
{
- if (nr_reg < 3) {
- pr_err("nr_regions (%lu) must be at least 3\n",
- nr_reg);
+ if (min_nr_reg < 3) {
+ pr_err("min_nr_regions (%lu) must be at least 3\n",
+ min_nr_reg);
+ return -EINVAL;
+ }
+ if (min_nr_reg > max_nr_reg) {
+ pr_err("invalid nr_regions. min (%lu) > max (%lu)\n",
+ min_nr_reg, max_nr_reg);
return -EINVAL;
}
ctx->sample_interval = sample_int;
ctx->aggr_interval = aggr_int;
- ctx->nr_regions = nr_reg;
+ ctx->min_nr_regions = min_nr_reg;
+ ctx->max_nr_regions = max_nr_reg;
return 0;
}
--
2.17.1
From: SeongJae Park <[email protected]>
The monitoring target address range can be dynamically changed. For
example, virtual memory could be dynamically mapped and unmapped.
Physical memory could be hot-plugged.
As the changes could be quite frequent in some cases, DAMON checks the
dynamic memory mapping changes and applies it to the abstracted target
area only for each of a user-specified time interval, ``regions update
interval``.
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
---
include/linux/damon.h | 20 +++++++++++++++-----
mm/damon.c | 23 +++++++++++++++++++++--
2 files changed, 36 insertions(+), 7 deletions(-)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index 97ddc74e207f..3c0b92a679e8 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -61,13 +61,16 @@ struct damon_task {
*
* @sample_interval: The time between access samplings.
* @aggr_interval: The time between monitor results aggregations.
+ * @regions_update_interval: The time between monitor regions updates.
* @min_nr_regions: The minimum number of monitoring regions.
* @max_nr_regions: The maximum number of monitoring regions.
*
* For each @sample_interval, DAMON checks whether each region is accessed or
* not. It aggregates and keeps the access information (number of accesses to
- * each region) for @aggr_interval time. All time intervals are in
- * micro-seconds.
+ * each region) for @aggr_interval time. DAMON also checks whether the target
+ * memory regions need update (e.g., by ``mmap()`` calls from the application,
+ * in case of virtual memory monitoring) and applies the changes for each
+ * @regions_update_interval. All time intervals are in micro-seconds.
*
* @kdamond: Kernel thread who does the monitoring.
* @kdamond_stop: Notifies whether kdamond should stop.
@@ -88,6 +91,7 @@ struct damon_task {
* @tasks_list: Head of monitoring target tasks (&damon_task) list.
*
* @init_target_regions: Constructs initial monitoring target regions.
+ * @update_target_regions: Updates monitoring target regions.
* @prepare_access_checks: Prepares next access check of target regions.
* @check_accesses: Checks the access of target regions.
* @sample_cb: Called for each sampling interval.
@@ -96,11 +100,14 @@ struct damon_task {
* DAMON can be extended for various address spaces by users. For this, users
* can register the target address space dependent low level functions for
* their usecases via the callback pointers of the context. The monitoring
- * thread calls @init_target_regions before starting the monitoring, and
+ * thread calls @init_target_regions before starting the monitoring,
+ * @update_target_regions for each @regions_update_interval, and
* @prepare_access_checks and @check_accesses for each @sample_interval.
*
* @init_target_regions should construct proper monitoring target regions and
* link those to the DAMON context struct.
+ * @update_target_regions should update the monitoring target regions for
+ * current status.
* @prepare_access_checks should manipulate the monitoring regions to be
* prepare for the next access check.
* @check_accesses should check the accesses to each region that made after the
@@ -115,10 +122,12 @@ struct damon_task {
struct damon_ctx {
unsigned long sample_interval;
unsigned long aggr_interval;
+ unsigned long regions_update_interval;
unsigned long min_nr_regions;
unsigned long max_nr_regions;
struct timespec64 last_aggregation;
+ struct timespec64 last_regions_update;
struct task_struct *kdamond;
bool kdamond_stop;
@@ -128,6 +137,7 @@ struct damon_ctx {
/* callbacks */
void (*init_target_regions)(struct damon_ctx *context);
+ void (*update_target_regions)(struct damon_ctx *context);
void (*prepare_access_checks)(struct damon_ctx *context);
unsigned int (*check_accesses)(struct damon_ctx *context);
void (*sample_cb)(struct damon_ctx *context);
@@ -135,8 +145,8 @@ struct damon_ctx {
};
int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids);
-int damon_set_attrs(struct damon_ctx *ctx,
- unsigned long sample_int, unsigned long aggr_int,
+int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
+ unsigned long aggr_int, unsigned long regions_update_int,
unsigned long min_nr_reg, unsigned long max_nr_reg);
int damon_start(struct damon_ctx *ctx);
int damon_stop(struct damon_ctx *ctx);
diff --git a/mm/damon.c b/mm/damon.c
index 02bc7542a76f..b844924b9fdb 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -385,6 +385,17 @@ static void kdamond_split_regions(struct damon_ctx *ctx)
last_nr_regions = nr_regions;
}
+/*
+ * Check whether it is time to check and apply the target monitoring regions
+ *
+ * Returns true if it is.
+ */
+static bool kdamond_need_update_regions(struct damon_ctx *ctx)
+{
+ return damon_check_reset_time_interval(&ctx->last_regions_update,
+ ctx->regions_update_interval);
+}
+
/*
* Check whether current monitoring should be stopped
*
@@ -455,6 +466,12 @@ static int kdamond_fn(void *data)
kdamond_reset_aggregated(ctx);
kdamond_split_regions(ctx);
}
+
+ if (kdamond_need_update_regions(ctx)) {
+ if (ctx->update_target_regions)
+ ctx->update_target_regions(ctx);
+ sz_limit = damon_region_sz_limit(ctx);
+ }
}
damon_for_each_task(t, ctx) {
damon_for_each_region_safe(r, next, t)
@@ -564,6 +581,7 @@ int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids)
* damon_set_attrs() - Set attributes for the monitoring.
* @ctx: monitoring context
* @sample_int: time interval between samplings
+ * @regions_update_int: time interval between target regions update
* @aggr_int: time interval between aggregations
* @min_nr_reg: minimal number of regions
* @max_nr_reg: maximum number of regions
@@ -573,8 +591,8 @@ int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids)
*
* Return: 0 on success, negative error code otherwise.
*/
-int damon_set_attrs(struct damon_ctx *ctx,
- unsigned long sample_int, unsigned long aggr_int,
+int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
+ unsigned long aggr_int, unsigned long regions_update_int,
unsigned long min_nr_reg, unsigned long max_nr_reg)
{
if (min_nr_reg < 3) {
@@ -590,6 +608,7 @@ int damon_set_attrs(struct damon_ctx *ctx,
ctx->sample_interval = sample_int;
ctx->aggr_interval = aggr_int;
+ ctx->regions_update_interval = regions_update_int;
ctx->min_nr_regions = min_nr_reg;
ctx->max_nr_regions = max_nr_reg;
--
2.17.1
From: SeongJae Park <[email protected]>
This commit adds documents for DAMON under
`Documentation/admin-guide/mm/damon/` and `Documentation/vm/damon/`.
Signed-off-by: SeongJae Park <[email protected]>
---
Documentation/admin-guide/mm/damon/guide.rst | 157 ++++++++++
Documentation/admin-guide/mm/damon/index.rst | 15 +
Documentation/admin-guide/mm/damon/plans.rst | 29 ++
Documentation/admin-guide/mm/damon/start.rst | 98 ++++++
Documentation/admin-guide/mm/damon/usage.rst | 298 +++++++++++++++++++
Documentation/admin-guide/mm/index.rst | 1 +
Documentation/vm/damon/api.rst | 20 ++
Documentation/vm/damon/eval.rst | 222 ++++++++++++++
Documentation/vm/damon/faq.rst | 59 ++++
Documentation/vm/damon/index.rst | 32 ++
Documentation/vm/damon/mechanisms.rst | 165 ++++++++++
Documentation/vm/index.rst | 1 +
12 files changed, 1097 insertions(+)
create mode 100644 Documentation/admin-guide/mm/damon/guide.rst
create mode 100644 Documentation/admin-guide/mm/damon/index.rst
create mode 100644 Documentation/admin-guide/mm/damon/plans.rst
create mode 100644 Documentation/admin-guide/mm/damon/start.rst
create mode 100644 Documentation/admin-guide/mm/damon/usage.rst
create mode 100644 Documentation/vm/damon/api.rst
create mode 100644 Documentation/vm/damon/eval.rst
create mode 100644 Documentation/vm/damon/faq.rst
create mode 100644 Documentation/vm/damon/index.rst
create mode 100644 Documentation/vm/damon/mechanisms.rst
diff --git a/Documentation/admin-guide/mm/damon/guide.rst b/Documentation/admin-guide/mm/damon/guide.rst
new file mode 100644
index 000000000000..c51fb843efaa
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/guide.rst
@@ -0,0 +1,157 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+Optimization Guide
+==================
+
+This document helps you estimating the amount of benefit that you could get
+from DAMON-based optimizations, and describes how you could achieve it. You
+are assumed to already read :doc:`start`.
+
+
+Check The Signs
+===============
+
+No optimization can provide same extent of benefit to every case. Therefore
+you should first guess how much improvements you could get using DAMON. If
+some of below conditions match your situation, you could consider using DAMON.
+
+- *Low IPC and High Cache Miss Ratios.* Low IPC means most of the CPU time is
+ spent waiting for the completion of time-consuming operations such as memory
+ access, while high cache miss ratios mean the caches don't help it well.
+ DAMON is not for cache level optimization, but DRAM level. However,
+ improving DRAM management will also help this case by reducing the memory
+ operation latency.
+- *Memory Over-commitment and Unknown Users.* If you are doing memory
+ overcommitment and you cannot control every user of your system, a memory
+ bank run could happen at any time. You can estimate when it will happen
+ based on DAMON's monitoring results and act earlier to avoid or deal better
+ with the crisis.
+- *Frequent Memory Pressure.* Frequent memory pressure means your system has
+ wrong configurations or memory hogs. DAMON will help you find the right
+ configuration and/or the criminals.
+- *Heterogeneous Memory System.* If your system is utilizing memory devices
+ that placed between DRAM and traditional hard disks, such as non-volatile
+ memory or fast SSDs, DAMON could help you utilizing the devices more
+ efficiently.
+
+
+Profile
+=======
+
+If you found some positive signals, you could start by profiling your workloads
+using DAMON. Find major workloads on your systems and analyze their data
+access pattern to find something wrong or can be improved. The DAMON user
+space tool (``damo``) will be useful for this.
+
+We recommend you to start from working set size distribution check using ``damo
+report wss``. If the distribution is ununiform or quite different from what
+you estimated, you could consider `Memory Configuration`_ optimization.
+
+Then, review the overall access pattern in heatmap form using ``damo report
+heats``. If it shows a simple pattern consists of a small number of memory
+regions having high contrast of access temperature, you could consider manual
+`Program Modification`_.
+
+If you still want to absorb more benefits, you should develop `Personalized
+DAMON Application`_ for your special case.
+
+You don't need to take only one approach among the above plans, but you could
+use multiple of the above approaches to maximize the benefit.
+
+
+Optimize
+========
+
+If the profiling result also says it's worth trying some optimization, you
+could consider below approaches. Note that some of the below approaches assume
+that your systems are configured with swap devices or other types of auxiliary
+memory so that you don't strictly required to accommodate the whole working set
+in the main memory. Most of the detailed optimization should be made on your
+concrete understanding of your memory devices.
+
+
+Memory Configuration
+--------------------
+
+No more no less, DRAM should be large enough to accommodate only important
+working sets, because DRAM is highly performance critical but expensive and
+heavily consumes the power. However, knowing the size of the real important
+working sets is difficult. As a consequence, people usually equips
+unnecessarily large or too small DRAM. Many problems stem from such wrong
+configurations.
+
+Using the working set size distribution report provided by ``damo report wss``,
+you can know the appropriate DRAM size for you. For example, roughly speaking,
+if you worry about only 95 percentile latency, you don't need to equip DRAM of
+a size larger than 95 percentile working set size.
+
+Let's see a real example. This `page
+<https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/guide.html#memory-configuration>`_
+shows the heatmap and the working set size distributions/changes of
+``freqmine`` workload in PARSEC3 benchmark suite. The working set size spikes
+up to 180 MiB, but keeps smaller than 50 MiB for more than 95% of the time.
+Even though you give only 50 MiB of memory space to the workload, it will work
+well for 95% of the time. Meanwhile, you can save the 130 MiB of memory space.
+
+
+Program Modification
+--------------------
+
+If the data access pattern heatmap plotted by ``damo report heats`` is quite
+simple so that you can understand how the things are going in the workload with
+your human eye, you could manually optimize the memory management.
+
+For example, suppose that the workload has two big memory object but only one
+object is frequently accessed while the other one is only occasionally
+accessed. Then, you could modify the program source code to keep the hot
+object in the main memory by invoking ``mlock()`` or ``madvise()`` with
+``MADV_WILLNEED``. Or, you could proactively evict the cold object using
+``madvise()`` with ``MADV_COLD`` or ``MADV_PAGEOUT``. Using both together
+would be also worthy.
+
+A research work [1]_ using the ``mlock()`` achieved up to 2.55x performance
+speedup.
+
+Let's see another realistic example access pattern for this kind of
+optimizations. This `page
+<https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/guide.html#program-modification>`_
+shows the visualized access patterns of streamcluster workload in PARSEC3
+benchmark suite. We can easily identify the 100 MiB sized hot object.
+
+
+Personalized DAMON Application
+------------------------------
+
+Above approaches will work well for many general cases, but would not enough
+for some special cases.
+
+If this is the case, it might be the time to forget the comfortable use of the
+user space tool and dive into the debugfs interface (refer to :doc:`usage` for
+the detail) of DAMON. Using the interface, you can control the DAMON more
+flexibly. Therefore, you can write your personalized DAMON application that
+controls the monitoring via the debugfs interface, analyzes the result, and
+applies complex optimizations itself. Using this, you can make more creative
+and wise optimizations.
+
+If you are a kernel space programmer, writing kernel space DAMON applications
+using the API (refer to the :doc:`/vm/damon/api` for more detail) would be an
+option.
+
+
+Reference Practices
+===================
+
+Referencing previously done successful practices could help you getting the
+sense for this kind of optimizations. There is an academic paper [1]_
+reporting the visualized access pattern and manual `Program
+Modification`_ results for a number of realistic workloads. You can also get
+the visualized access patterns [3]_ [4]_ [5]_ and automated DAMON-based memory
+operations results for other realistic workloads that collected with latest
+version of DAMON [2]_ .
+
+.. [1] https://dl.acm.org/doi/10.1145/3366626.3368125
+.. [2] https://damonitor.github.io/test/result/perf/latest/html/
+.. [3] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
+.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
+.. [5] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst
new file mode 100644
index 000000000000..0baae7a5402b
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/index.rst
@@ -0,0 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+Monitoring Data Accesses
+========================
+
+:doc:`DAMON </vm/damon/index>` allows light-weight data access monitoring.
+Using this, users can analyze and optimize their systems.
+
+.. toctree::
+ :maxdepth: 2
+
+ start
+ guide
+ usage
diff --git a/Documentation/admin-guide/mm/damon/plans.rst b/Documentation/admin-guide/mm/damon/plans.rst
new file mode 100644
index 000000000000..e3aa5ab96c29
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/plans.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Future Plans
+============
+
+DAMON is still on its first stage. Below plans are still under development.
+
+
+Automate Data Access Monitoring-based Memory Operation Schemes Execution
+========================================================================
+
+The ultimate goal of DAMON is to be used as a building block for the data
+access pattern aware kernel memory management optimization. It will make
+system just works efficiently. However, some users having very special
+workloads will want to further do their own optimization. DAMON will automate
+most of the tasks for such manual optimizations in near future. Users will be
+required to only describe what kind of data access pattern-based operation
+schemes they want in a simple form.
+
+By applying a very simple scheme for THP promotion/demotion with a prototype
+implementation, DAMON reduced 60% of THP memory footprint overhead while
+preserving 50% of the THP performance benefit. The detailed results can be
+seen on an external web page [1]_.
+
+Several RFC patchsets for this plan are available [2]_.
+
+.. [1] https://damonitor.github.io/test/result/perf/latest/html/
+.. [2] https://lore.kernel.org/linux-mm/[email protected]/
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst
new file mode 100644
index 000000000000..a6f04d966adc
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/start.rst
@@ -0,0 +1,98 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Getting Started
+===============
+
+This document briefly describes how you can use DAMON by demonstrating its
+default user space tool. Please note that this document describes only a part
+of its features for brevity. Please refer to :doc:`usage` for more details.
+
+
+TL; DR
+======
+
+Follow below 5 commands to monitor and visualize the access pattern of your
+workload. ::
+
+ $ git clone https://github.com/sjp38/linux -b damon/master
+ /* build the kernel with CONFIG_DAMON=y, install, reboot */
+ $ mount -t debugfs none /sys/kernel/debug/
+ $ cd linux/tools/damon
+ $ ./damo record $(pidof <your workload>)
+ $ ./damo report heats --heatmap access_pattern.png
+
+
+Prerequisites
+=============
+
+Kernel
+------
+
+You should first ensure your system is running on a kernel built with
+``CONFIG_DAMON``. If the value is set to ``m``, load the module first::
+
+ # modprobe damon
+
+
+User Space Tool
+---------------
+
+For the demonstration, we will use the default user space tool for DAMON,
+called DAMON Operator (DAMO). It is located at ``tools/damon/damo`` of the
+kernel source tree. For brevity, below examples assume you set ``$PATH`` to
+point it. It's not mandatory, though.
+
+Because DAMO is using the debugfs interface (refer to :doc:`usage` for the
+detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as
+below::
+
+ # mount -t debugfs none /sys/kernel/debug/
+
+or append below line to your ``/etc/fstab`` file so that your system can
+automatically mount debugfs from next booting::
+
+ debugfs /sys/kernel/debug debugfs defaults 0 0
+
+
+Recording Data Access Patterns
+==============================
+
+Below commands record memory access pattern of a program and save the
+monitoring results in a file. ::
+
+ $ git clone https://github.com/sjp38/masim
+ $ cd masim; make; ./masim ./configs/zigzag.cfg &
+ $ sudo damo record -o damon.data $(pidof masim)
+
+The first two lines of the commands get an artificial memory access generator
+program and runs it in the background. It will repeatedly access two 100 MiB
+sized memory regions one by one. You can substitute this with your real
+workload. The last line asks ``damo`` to record the access pattern in
+``damon.data`` file.
+
+
+Visualizing Recorded Patterns
+=============================
+
+Below three commands visualize the recorded access patterns into three
+image files. ::
+
+ $ damo report heats --heatmap access_pattern_heatmap.png
+ $ damo report wss --range 0 101 1 --plot wss_dist.png
+ $ damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png
+
+- ``access_pattern_heatmap.png`` will show the data access pattern in a
+ heatmap, which shows when (x-axis) what memory region (y-axis) is how
+ frequently accessed (color).
+- ``wss_dist.png`` will show the distribution of the working set size.
+- ``wss_chron_change.png`` will show how the working set size has
+ chronologically changed.
+
+You can show the images in a web page [1]_ . Those made with other realistic
+workloads are also available [2]_ [3]_ [4]_.
+
+.. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns
+.. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
+.. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
+.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
new file mode 100644
index 000000000000..971e6b06b4ac
--- /dev/null
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -0,0 +1,298 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Detailed Usages
+===============
+
+DAMON provides below three interfaces for different users.
+
+- *DAMON user space tool.*
+ This is for privileged people such as system administrators who want a
+ just-working human-friendly interface. Using this, users can use the DAMON’s
+ major features in a human-friendly way. It may not be highly tuned for
+ special cases, though. It supports only virtual address spaces monitoring.
+- *debugfs interface.*
+ This is for privileged user space programmers who want more optimized use of
+ DAMON. Using this, users can use DAMON’s major features by reading
+ from and writing to special debugfs files. Therefore, you can write and use
+ your personalized DAMON debugfs wrapper programs that reads/writes the
+ debugfs files instead of you. The DAMON user space tool is also a reference
+ implementation of such programs. It supports only virtual address spaces
+ monitoring.
+- *Kernel Space Programming Interface.*
+ This is for kernel space programmers. Using this, users can utilize every
+ feature of DAMON most flexibly and efficiently by writing kernel space
+ DAMON application programs for you. You can even extend DAMON for various
+ address spaces.
+
+This document does not describe the kernel space programming interface in
+detail. For that, please refer to the :doc:`/vm/damon/api`.
+
+
+DAMON User Sapce Tool
+=====================
+
+A reference implementation of the DAMON user space tools which provides a
+convenient user interface is in the kernel source tree. It is located at
+``tools/damon/damo`` of the tree.
+
+The tool provides a subcommands based interface. Every subcommand provides
+``-h`` option, which provides the minimal usage of it. Currently, the tool
+supports two subcommands, ``record`` and ``report``.
+
+Below example commands assume you set ``$PATH`` to point ``tools/damon/`` for
+brevity. It is not mandatory for use of ``damo``, though.
+
+
+Recording Data Access Pattern
+-----------------------------
+
+The ``record`` subcommand records the data access pattern of target workloads
+in a file (``./damon.data`` by default). You can specify the target with 1)
+the command for execution of the monitoring target process, or 2) pid of
+running target process. Below example shows a command target usage::
+
+ # cd <kernel>/tools/damon/
+ # damo record "sleep 5"
+
+The tool will execute ``sleep 5`` by itself and record the data access patterns
+of the process. Below example shows a pid target usage::
+
+ # sleep 5 &
+ # damo record `pidof sleep`
+
+The location of the recorded file can be explicitly set using ``-o`` option.
+You can further tune this by setting the monitoring attributes. To know about
+the monitoring attributes in detail, please refer to the
+:doc:`/vm/damon/mechanisms`.
+
+
+Analyzing Data Access Pattern
+-----------------------------
+
+The ``report`` subcommand reads a data access pattern record file (if not
+explicitly specified using ``-i`` option, reads ``./damon.data`` file by
+default) and generates human-readable reports. You can specify what type of
+report you want using a sub-subcommand to ``report`` subcommand. ``raw``,
+``heats``, and ``wss`` report types are supported for now.
+
+
+raw
+~~~
+
+``raw`` sub-subcommand simply transforms the binary record into a
+human-readable text. For example::
+
+ $ damo report raw
+ start_time: 193485829398
+ rel time: 0
+ nr_tasks: 1
+ pid: 1348
+ nr_regions: 4
+ 560189609000-56018abce000( 22827008): 0
+ 7fbdff59a000-7fbdffaf1a00( 5601792): 0
+ 7fbdffaf1a00-7fbdffbb5000( 800256): 1
+ 7ffea0dc0000-7ffea0dfd000( 249856): 0
+
+ rel time: 100000731
+ nr_tasks: 1
+ pid: 1348
+ nr_regions: 6
+ 560189609000-56018abce000( 22827008): 0
+ 7fbdff59a000-7fbdff8ce933( 3361075): 0
+ 7fbdff8ce933-7fbdffaf1a00( 2240717): 1
+ 7fbdffaf1a00-7fbdffb66d99( 480153): 0
+ 7fbdffb66d99-7fbdffbb5000( 320103): 1
+ 7ffea0dc0000-7ffea0dfd000( 249856): 0
+
+The first line shows the recording started timestamp (nanosecond). Records of
+data access patterns follows. Each record is separated by a blank line. Each
+record first specifies the recorded time (``rel time``) in relative to the
+start time, the number of monitored tasks in this record (``nr_tasks``).
+Recorded data access patterns of each task follow. Each data access pattern
+for each task shows the target's pid (``pid``) and a number of monitored
+address regions in this access pattern (``nr_regions``) first. After that,
+each line shows the start/end address, size, and the number of observed
+accesses of each region.
+
+
+heats
+~~~~~
+
+The ``raw`` output is very detailed but hard to manually read. ``heats``
+sub-subcommand plots the data in 3-dimensional form, which represents the time
+in x-axis, address of regions in y-axis, and the access frequency in z-axis.
+Users can set the resolution of the map (``--tres`` and ``--ares``) and
+start/end point of each axis (``--tmin``, ``--tmax``, ``--amin``, and
+``--amax``) via optional arguments. For example::
+
+ $ damo report heats --tres 3 --ares 3
+ 0 0 0.0
+ 0 7609002 0.0
+ 0 15218004 0.0
+ 66112620851 0 0.0
+ 66112620851 7609002 0.0
+ 66112620851 15218004 0.0
+ 132225241702 0 0.0
+ 132225241702 7609002 0.0
+ 132225241702 15218004 0.0
+
+This command shows a recorded access pattern in heatmap of 3x3 resolution.
+Therefore it shows 9 data points in total. Each line shows each of the data
+points. The three numbers in each line represent time in nanosecond, address,
+and the observed access frequency.
+
+Users will be able to convert this text output into a heatmap image (represents
+z-axis values with colors) or other 3D representations using various tools such
+as 'gnuplot'. For more convenience, ``heats`` sub-subcommand provides the
+'gnuplot' based heatmap image creation. For this, you can use ``--heatmap``
+option. Also, note that because it uses 'gnuplot' internally, it will fail if
+'gnuplot' is not installed on your system. For example::
+
+ $ ./damo report heats --heatmap heatmap.png
+
+Creates the heatmap image in ``heatmap.png`` file. It supports ``pdf``,
+``png``, ``jpeg``, and ``svg``.
+
+If the target address space is virtual memory address space and you plot the
+entire address space, the huge unmapped regions will make the picture looks
+only black. Therefore you should do proper zoom in / zoom out using the
+resolution and axis boundary-setting arguments. To make this effort minimal,
+you can use ``--guide`` option as below::
+
+ $ ./damo report heats --guide
+ pid:1348
+ time: 193485829398-198337863555 (4852034157)
+ region 0: 00000094564599762944-00000094564622589952 (22827008)
+ region 1: 00000140454009610240-00000140454016012288 (6402048)
+ region 2: 00000140731597193216-00000140731597443072 (249856)
+
+The output shows unions of monitored regions (start and end addresses in byte)
+and the union of monitored time duration (start and end time in nanoseconds) of
+each target task. Therefore, it would be wise to plot the data points in each
+union. If no axis boundary option is given, it will automatically find the
+biggest union in ``--guide`` output and set the boundary in it.
+
+
+wss
+~~~
+
+The ``wss`` type extracts the distribution and chronological working set size
+changes from the records. For example::
+
+ $ ./damo report wss
+ # <percentile> <wss>
+ # pid 1348
+ # avr: 66228
+ 0 0
+ 25 0
+ 50 0
+ 75 0
+ 100 1920615
+
+Without any option, it shows the distribution of the working set sizes as
+above. It shows 0th, 25th, 50th, 75th, and 100th percentile and the average of
+the measured working set sizes in the access pattern records. In this case,
+the working set size was zero for 75th percentile but 1,920,615 bytes in max
+and 66,228 bytes on average.
+
+By setting the sort key of the percentile using '--sortby', you can show how
+the working set size has chronologically changed. For example::
+
+ $ ./damo report wss --sortby time
+ # <percentile> <wss>
+ # pid 1348
+ # avr: 66228
+ 0 0
+ 25 0
+ 50 0
+ 75 0
+ 100 0
+
+The average is still 66,228. And, because the access was spiked in very short
+duration and this command plots only 4 data points, we cannot show when the
+access spikes made. Users can specify the resolution of the distribution
+(``--range``). By giving more fine resolution, the short duration spikes could
+be found.
+
+Similar to that of ``heats --heatmap``, it also supports 'gnuplot' based simple
+visualization of the distribution via ``--plot`` option.
+
+
+debugfs Interface
+=================
+
+DAMON exports four files, ``attrs``, ``pids``, ``record``, and ``monitor_on``
+under its debugfs directory, ``<debugfs>/damon/``.
+
+
+Attributes
+----------
+
+Users can get and set the ``sampling interval``, ``aggregation interval``,
+``regions update interval``, and min/max number of monitoring target regions by
+reading from and writing to the ``attrs`` file. To know about the monitoring
+attributes in detail, please refer to the :doc:`/vm/damon/mechanisms`. For
+example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and
+1000, and then check it again::
+
+ # cd <debugfs>/damon
+ # echo 5000 100000 1000000 10 1000 > attrs
+ # cat attrs
+ 5000 100000 1000000 10 1000
+
+
+Target PIDs
+-----------
+
+To monitor the virtual memory address spaces of specific processes, users can
+get and set the pids of monitoring target processes by reading from and writing
+to the ``pids`` file. For example, below commands set processes having pids 42
+and 4242 as the processes to be monitored and check it again::
+
+ # cd <debugfs>/damon
+ # echo 42 4242 > pids
+ # cat pids
+ 42 4242
+
+Note that setting the pids doesn't start the monitoring.
+
+
+Record
+------
+
+This debugfs file allows you to record monitored access patterns in a regular
+binary file. The recorded results are first written in an in-memory buffer and
+flushed to a file in batch. Users can get and set the size of the buffer and
+the path to the result file by reading from and writing to the ``record`` file.
+For example, below commands set the buffer to be 4 KiB and the result to be
+saved in ``/damon.data``. ::
+
+ # cd <debugfs>/damon
+ # echo "4096 /damon.data" > record
+ # cat record
+ 4096 /damon.data
+
+The recording can be disabled by setting the buffer size zero.
+
+
+Turning On/Off
+--------------
+
+Setting the files as described above doesn't incur any effect on your system
+unless you explicitly start the monitoring. You can start, stop, and check the
+current status of the monitoring by writing to and reading from the
+``monitor_on`` file. Writing ``on`` to the file starts the monitoring of the
+targets with the attributes. Writing ``off`` to the file stops those. DAMON
+also stops if every target process is terminated. Below example commands turn
+on, off, and check the status of DAMON::
+
+ # cd <debugfs>/damon
+ # echo on > monitor_on
+ # echo off > monitor_on
+ # cat monitor_on
+ off
+
+Please note that you cannot write to the above-mentioned debugfs files while
+the monitoring is turned on. If you write to the files while DAMON is running,
+an error code such as ``-EBUSY`` will be returned.
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 11db46448354..e6de5cd41945 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -27,6 +27,7 @@ the Linux memory management.
concepts
cma_debugfs
+ damon/index
hugetlbpage
idle_page_tracking
ksm
diff --git a/Documentation/vm/damon/api.rst b/Documentation/vm/damon/api.rst
new file mode 100644
index 000000000000..649409828eab
--- /dev/null
+++ b/Documentation/vm/damon/api.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+API Reference
+=============
+
+Kernel space programs can use every feature of DAMON using below APIs. All you
+need to do is including ``damon.h``, which is located in ``include/linux/`` of
+the source tree.
+
+Structures
+==========
+
+.. kernel-doc:: include/linux/damon.h
+
+
+Functions
+=========
+
+.. kernel-doc:: mm/damon.c
diff --git a/Documentation/vm/damon/eval.rst b/Documentation/vm/damon/eval.rst
new file mode 100644
index 000000000000..b233890b4e45
--- /dev/null
+++ b/Documentation/vm/damon/eval.rst
@@ -0,0 +1,222 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Evaluation
+==========
+
+DAMON is lightweight. It increases system memory usage by only -0.25% and
+consumes less than 1% CPU time in most case. It slows target workloads down by
+only 0.94%.
+
+DAMON is accurate and useful for memory management optimizations. An
+experimental DAMON-based operation scheme for THP, 'ethp', removes 31.29% of
+THP memory overheads while preserving 60.64% of THP speedup. Another
+experimental DAMON-based 'proactive reclamation' implementation, 'prcl',
+reduces 87.95% of residential sets and 29.52% of system memory footprint while
+incurring only 2.15% runtime overhead in the best case (parsec3/freqmine).
+
+Setup
+=====
+
+On a QEMU/KVM based virtual machine utilizing 20GB of RAM and hosted by an
+Intel i7 machine that running a kernel that v16 DAMON patchset is applied, I
+measure runtime and consumed system memory while running various realistic
+workloads with several configurations. I use 13 and 12 workloads in PARSEC3
+[3]_ and SPLASH-2X [4]_ benchmark suites, respectively. I use another wrapper
+scripts [5]_ for convenient setup and run of the workloads.
+
+Measurement
+-----------
+
+For the measurement of the amount of consumed memory in system global scope, I
+drop caches before starting each of the workloads and monitor 'MemFree' in the
+'/proc/meminfo' file. To make results more stable, I repeat the runs 5 times
+and average results.
+
+Configurations
+--------------
+
+The configurations I use are as below.
+
+- orig: Linux v5.7 with 'madvise' THP policy
+- rec: 'orig' plus DAMON running with virtual memory access recording
+- prec: 'orig' plus DAMON running with physical memory access recording
+- thp: same with 'orig', but use 'always' THP policy
+- ethp: 'orig' plus a DAMON operation scheme, 'efficient THP'
+- prcl: 'orig' plus a DAMON operation scheme, 'proactive reclaim [6]_'
+
+I use 'rec' for measurement of DAMON overheads to target workloads and system
+memory. 'prec' is for physical memory monitroing and recording. It monitors
+17GB sized 'System RAM' region. The remaining configs including 'thp', 'ethp',
+and 'prcl' are for measurement of DAMON monitoring accuracy.
+
+'ethp' and 'prcl' are simple DAMON-based operation schemes developed for
+proof of concepts of DAMON. 'ethp' reduces memory space waste of THP by using
+DAMON for the decision of promotions and demotion for huge pages, while 'prcl'
+is as similar as the original work. Those are implemented as below::
+
+ # format: <min/max size> <min/max frequency (0-100)> <min/max age> <action>
+ # ethp: Use huge pages if a region shows >=5% access rate, use regular
+ # pages if a region >=2MB shows <5% access rate for >=13 seconds
+ null null 5 null null null hugepage
+ 2M null null null 13s null nohugepage
+
+ # prcl: If a region >=4KB shows <=5% access rate for >=7 seconds, page out.
+ 4K null null 5 7s null pageout
+
+Note that both 'ethp' and 'prcl' are designed with my only straightforward
+intuition because those are for only proof of concepts and monitoring accuracy
+of DAMON. In other words, those are not for production. For production use,
+those should be more tuned.
+
+.. [1] "Redis latency problems troubleshooting", https://redis.io/topics/latency
+.. [2] "Disable Transparent Huge Pages (THP)",
+ https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
+.. [3] "The PARSEC Becnhmark Suite", https://parsec.cs.princeton.edu/index.htm
+.. [4] "SPLASH-2x", https://parsec.cs.princeton.edu/parsec3-doc.htm#splash2x
+.. [5] "parsec3_on_ubuntu", https://github.com/sjp38/parsec3_on_ubuntu
+.. [6] "Proactively reclaiming idle memory", https://lwn.net/Articles/787611/
+
+Results
+=======
+
+Below two tables show the measurement results. The runtimes are in seconds
+while the memory usages are in KiB. Each configuration except 'orig' shows
+its overhead relative to 'orig' in percent within parenthesizes.::
+
+ runtime orig rec (overhead) prec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
+ parsec3/blackscholes 107.228 107.859 (0.59) 108.110 (0.82) 107.381 (0.14) 106.811 (-0.39) 114.766 (7.03)
+ parsec3/bodytrack 79.292 79.609 (0.40) 79.777 (0.61) 79.313 (0.03) 78.892 (-0.50) 80.398 (1.40)
+ parsec3/canneal 148.887 150.878 (1.34) 153.337 (2.99) 127.873 (-14.11) 132.272 (-11.16) 167.631 (12.59)
+ parsec3/dedup 11.970 11.975 (0.04) 12.024 (0.45) 11.752 (-1.82) 11.921 (-0.41) 13.244 (10.64)
+ parsec3/facesim 212.800 215.927 (1.47) 215.004 (1.04) 205.117 (-3.61) 207.401 (-2.54) 220.834 (3.78)
+ parsec3/ferret 190.646 192.560 (1.00) 192.414 (0.93) 190.662 (0.01) 192.309 (0.87) 193.497 (1.50)
+ parsec3/fluidanimate 213.951 216.459 (1.17) 217.578 (1.70) 209.500 (-2.08) 211.826 (-0.99) 218.299 (2.03)
+ parsec3/freqmine 291.050 292.117 (0.37) 293.279 (0.77) 289.553 (-0.51) 291.768 (0.25) 297.309 (2.15)
+ parsec3/raytrace 118.645 119.734 (0.92) 119.521 (0.74) 117.715 (-0.78) 118.844 (0.17) 134.045 (12.98)
+ parsec3/streamcluster 332.843 336.997 (1.25) 337.049 (1.26) 279.716 (-15.96) 290.985 (-12.58) 346.646 (4.15)
+ parsec3/swaptions 155.437 157.174 (1.12) 156.159 (0.46) 155.017 (-0.27) 154.955 (-0.31) 156.555 (0.72)
+ parsec3/vips 59.215 59.426 (0.36) 59.156 (-0.10) 59.243 (0.05) 58.858 (-0.60) 60.184 (1.64)
+ parsec3/x264 67.445 71.400 (5.86) 71.122 (5.45) 64.078 (-4.99) 66.027 (-2.10) 71.489 (6.00)
+ splash2x/barnes 81.826 81.800 (-0.03) 82.648 (1.00) 74.343 (-9.15) 79.063 (-3.38) 103.785 (26.84)
+ splash2x/fft 33.850 34.148 (0.88) 33.912 (0.18) 23.493 (-30.60) 32.684 (-3.44) 48.303 (42.70)
+ splash2x/lu_cb 86.404 86.333 (-0.08) 86.988 (0.68) 85.720 (-0.79) 85.944 (-0.53) 89.338 (3.40)
+ splash2x/lu_ncb 94.908 98.021 (3.28) 96.041 (1.19) 90.304 (-4.85) 93.279 (-1.72) 97.270 (2.49)
+ splash2x/ocean_cp 47.122 47.391 (0.57) 47.902 (1.65) 43.227 (-8.26) 44.609 (-5.33) 51.410 (9.10)
+ splash2x/ocean_ncp 93.147 92.911 (-0.25) 93.886 (0.79) 51.451 (-44.76) 71.107 (-23.66) 112.554 (20.83)
+ splash2x/radiosity 92.150 92.604 (0.49) 93.339 (1.29) 90.802 (-1.46) 91.824 (-0.35) 104.439 (13.34)
+ splash2x/radix 31.961 32.113 (0.48) 32.066 (0.33) 25.184 (-21.20) 30.412 (-4.84) 49.989 (56.41)
+ splash2x/raytrace 84.781 85.278 (0.59) 84.763 (-0.02) 83.192 (-1.87) 83.970 (-0.96) 85.382 (0.71)
+ splash2x/volrend 87.401 87.978 (0.66) 87.977 (0.66) 86.636 (-0.88) 87.169 (-0.26) 88.043 (0.73)
+ splash2x/water_nsquared 239.140 239.570 (0.18) 240.901 (0.74) 221.323 (-7.45) 224.670 (-6.05) 244.492 (2.24)
+ splash2x/water_spatial 89.538 89.978 (0.49) 90.171 (0.71) 89.729 (0.21) 89.238 (-0.34) 99.331 (10.94)
+ total 3051.620 3080.230 (0.94) 3085.130 (1.10) 2862.320 (-6.20) 2936.830 (-3.76) 3249.240 (6.48)
+
+
+ memused.avg orig rec (overhead) prec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
+ parsec3/blackscholes 1676679.200 1683789.200 (0.42) 1680281.200 (0.21) 1613817.400 (-3.75) 1835229.200 (9.46) 1407952.800 (-16.03)
+ parsec3/bodytrack 1295736.000 1308412.600 (0.98) 1311988.000 (1.25) 1243417.400 (-4.04) 1435410.600 (10.78) 1255566.400 (-3.10)
+ parsec3/canneal 1004062.000 1008823.800 (0.47) 1000100.200 (-0.39) 983976.000 (-2.00) 1051719.600 (4.75) 993055.800 (-1.10)
+ parsec3/dedup 2389765.800 2393381.000 (0.15) 2366668.200 (-0.97) 2412948.600 (0.97) 2435885.600 (1.93) 2380172.800 (-0.40)
+ parsec3/facesim 488927.200 498228.000 (1.90) 496683.800 (1.59) 476327.800 (-2.58) 552890.000 (13.08) 449143.600 (-8.14)
+ parsec3/ferret 280324.600 282032.400 (0.61) 282284.400 (0.70) 258211.000 (-7.89) 331493.800 (18.25) 265850.400 (-5.16)
+ parsec3/fluidanimate 560636.200 569038.200 (1.50) 565067.400 (0.79) 556923.600 (-0.66) 588021.200 (4.88) 512901.600 (-8.51)
+ parsec3/freqmine 883286.000 904960.200 (2.45) 886105.200 (0.32) 849347.400 (-3.84) 998358.000 (13.03) 622542.800 (-29.52)
+ parsec3/raytrace 1639370.200 1642318.200 (0.18) 1626673.200 (-0.77) 1591284.200 (-2.93) 1755088.400 (7.06) 1410261.600 (-13.98)
+ parsec3/streamcluster 116955.600 127251.400 (8.80) 121441.000 (3.84) 113853.800 (-2.65) 139659.400 (19.41) 120335.200 (2.89)
+ parsec3/swaptions 8342.400 18555.600 (122.43) 16581.200 (98.76) 6745.800 (-19.14) 27487.200 (229.49) 14275.600 (71.12)
+ parsec3/vips 2776417.600 2784989.400 (0.31) 2820564.600 (1.59) 2694060.800 (-2.97) 2968650.000 (6.92) 2713590.000 (-2.26)
+ parsec3/x264 2912885.000 2936474.600 (0.81) 2936775.800 (0.82) 2799599.200 (-3.89) 3168695.000 (8.78) 2829085.800 (-2.88)
+ splash2x/barnes 1206459.600 1204145.600 (-0.19) 1177390.000 (-2.41) 1210556.800 (0.34) 1214978.800 (0.71) 907737.000 (-24.76)
+ splash2x/fft 9384156.400 9258749.600 (-1.34) 8560377.800 (-8.78) 9337563.000 (-0.50) 9228873.600 (-1.65) 9823394.400 (4.68)
+ splash2x/lu_cb 510210.800 514052.800 (0.75) 502735.200 (-1.47) 514459.800 (0.83) 523884.200 (2.68) 367563.200 (-27.96)
+ splash2x/lu_ncb 510091.200 516046.800 (1.17) 505327.600 (-0.93) 512568.200 (0.49) 524178.400 (2.76) 427981.800 (-16.10)
+ splash2x/ocean_cp 3342260.200 3294531.200 (-1.43) 3171236.000 (-5.12) 3379693.600 (1.12) 3314896.600 (-0.82) 3252406.000 (-2.69)
+ splash2x/ocean_ncp 3900447.200 3881682.600 (-0.48) 3816493.200 (-2.15) 7065506.200 (81.15) 4449224.400 (14.07) 3829931.200 (-1.81)
+ splash2x/radiosity 1466372.000 1463840.200 (-0.17) 1438554.000 (-1.90) 1475151.600 (0.60) 1474828.800 (0.58) 496636.000 (-66.13)
+ splash2x/radix 1760056.600 1691719.000 (-3.88) 1613057.400 (-8.35) 1384416.400 (-21.34) 1632274.400 (-7.26) 2141640.200 (21.68)
+ splash2x/raytrace 38794.000 48187.400 (24.21) 46728.400 (20.45) 41323.400 (6.52) 61499.800 (58.53) 68455.200 (76.46)
+ splash2x/volrend 138107.400 148197.000 (7.31) 146223.400 (5.88) 128076.400 (-7.26) 164593.800 (19.18) 140885.200 (2.01)
+ splash2x/water_nsquared 39072.000 49889.200 (27.69) 47548.400 (21.69) 37546.400 (-3.90) 57195.400 (46.38) 42994.200 (10.04)
+ splash2x/water_spatial 662099.800 665964.800 (0.58) 651017.000 (-1.67) 659808.400 (-0.35) 674475.600 (1.87) 519677.600 (-21.51)
+ total 38991500.000 38895300.000 (-0.25) 37787817.000 (-3.09) 41347200.000 (6.04) 40609600.000 (4.15) 36994100.000 (-5.12)
+
+
+DAMON Overheads
+---------------
+
+In total, DAMON virtual memory access recording feature ('rec') incurs 0.94%
+runtime overhead and -0.25% memory space overhead. Even though the size of the
+monitoring target region becomes much larger with the physical memory access
+recording ('prec'), it still shows only modest amount of overhead (1.10% for
+runtime and -3.09% for memory footprint).
+
+For a convenience test run of 'rec' and 'prec', I use a Python wrapper. The
+wrapper constantly consumes about 10-15MB of memory. This becomes a high
+memory overhead if the target workload has a small memory footprint.
+Nonetheless, the overheads are not from DAMON, but from the wrapper, and thus
+should be ignored. This fake memory overhead continues in 'ethp' and 'prcl',
+as those configurations are also using the Python wrapper.
+
+
+Efficient THP
+-------------
+
+THP 'always' enabled policy achieves 6.20% speedup but incurs 6.04% memory
+overhead. It achieves 44.76% speedup in the best case, but 81.15% memory
+overhead in the worst case. Interestingly, both the best and worst-case are
+with 'splash2x/ocean_ncp').
+
+The 2-lines implementation of data access monitoring based THP version ('ethp')
+shows 3.76% speedup and 4.15% memory overhead. In other words, 'ethp' removes
+31.29% of THP memory waste while preserving 60.64% of THP speedup in total. In
+the case of the 'splash2x/ocean_ncp', 'ethp' removes 82.66% of THP memory waste
+while preserving 52.85% of THP speedup.
+
+
+Proactive Reclamation
+---------------------
+
+As similar to the original work, I use 4G 'zram' swap device for this
+configuration.
+
+In total, our 1 line implementation of Proactive Reclamation, 'prcl', incurred
+6.48% runtime overhead in total while achieving 5.12% system memory usage
+reduction.
+
+Nonetheless, as the memory usage is calculated with 'MemFree' in
+'/proc/meminfo', it contains the SwapCached pages. As the swapcached pages can
+be easily evicted, I also measured the residential set size of the workloads::
+
+ rss.avg orig rec (overhead) prec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
+ parsec3/blackscholes 590412.200 589991.400 (-0.07) 591716.400 (0.22) 591131.000 (0.12) 591055.200 (0.11) 274623.600 (-53.49)
+ parsec3/bodytrack 32202.200 32297.400 (0.30) 32301.400 (0.31) 32328.000 (0.39) 32169.800 (-0.10) 25311.200 (-21.40)
+ parsec3/canneal 840063.600 839145.200 (-0.11) 839506.200 (-0.07) 835102.600 (-0.59) 839766.000 (-0.04) 833091.800 (-0.83)
+ parsec3/dedup 1185493.200 1202688.800 (1.45) 1204597.000 (1.61) 1238071.400 (4.44) 1201689.400 (1.37) 920688.600 (-22.34)
+ parsec3/facesim 311570.400 311542.000 (-0.01) 311665.000 (0.03) 316106.400 (1.46) 312003.400 (0.14) 252646.000 (-18.91)
+ parsec3/ferret 99783.200 99330.000 (-0.45) 99735.000 (-0.05) 102000.600 (2.22) 99927.400 (0.14) 90967.400 (-8.83)
+ parsec3/fluidanimate 531780.800 531800.800 (0.00) 531754.600 (-0.00) 532009.600 (0.04) 531822.400 (0.01) 479116.000 (-9.90)
+ parsec3/freqmine 551787.600 551550.600 (-0.04) 551950.000 (0.03) 556030.000 (0.77) 553720.400 (0.35) 66480.000 (-87.95)
+ parsec3/raytrace 895247.000 895240.200 (-0.00) 895770.400 (0.06) 895880.200 (0.07) 893516.600 (-0.19) 327339.600 (-63.44)
+ parsec3/streamcluster 110862.200 110840.400 (-0.02) 110878.600 (0.01) 112067.200 (1.09) 112010.800 (1.04) 109763.600 (-0.99)
+ parsec3/swaptions 5630.000 5580.800 (-0.87) 5599.600 (-0.54) 5624.200 (-0.10) 5697.400 (1.20) 3792.400 (-32.64)
+ parsec3/vips 31677.200 31881.800 (0.65) 31785.800 (0.34) 32177.000 (1.58) 32456.800 (2.46) 29692.000 (-6.27)
+ parsec3/x264 81796.400 81918.600 (0.15) 81827.600 (0.04) 82734.800 (1.15) 82854.000 (1.29) 81478.200 (-0.39)
+ splash2x/barnes 1216014.600 1215462.000 (-0.05) 1218535.200 (0.21) 1227689.400 (0.96) 1219022.000 (0.25) 650771.000 (-46.48)
+ splash2x/fft 9622775.200 9511973.400 (-1.15) 9688178.600 (0.68) 9733868.400 (1.15) 9651488.000 (0.30) 7567077.400 (-21.36)
+ splash2x/lu_cb 511102.400 509911.600 (-0.23) 511123.800 (0.00) 514466.800 (0.66) 510462.800 (-0.13) 361014.000 (-29.37)
+ splash2x/lu_ncb 510569.800 510724.600 (0.03) 510888.800 (0.06) 513951.600 (0.66) 509474.400 (-0.21) 424030.400 (-16.95)
+ splash2x/ocean_cp 3413563.600 3413721.800 (0.00) 3398399.600 (-0.44) 3446878.000 (0.98) 3404799.200 (-0.26) 3244787.400 (-4.94)
+ splash2x/ocean_ncp 3927797.400 3936294.400 (0.22) 3917698.800 (-0.26) 7181781.200 (82.85) 4525783.600 (15.22) 3693747.800 (-5.96)
+ splash2x/radiosity 1477264.800 1477569.200 (0.02) 1476954.200 (-0.02) 1485724.800 (0.57) 1474684.800 (-0.17) 230128.000 (-84.42)
+ splash2x/radix 1773025.000 1754424.200 (-1.05) 1743194.400 (-1.68) 1445575.200 (-18.47) 1694855.200 (-4.41) 1769750.000 (-0.18)
+ splash2x/raytrace 23292.000 23284.000 (-0.03) 23292.800 (0.00) 28704.800 (23.24) 26489.600 (13.73) 15753.000 (-32.37)
+ splash2x/volrend 44095.800 44068.200 (-0.06) 44107.600 (0.03) 44114.600 (0.04) 44054.000 (-0.09) 31616.000 (-28.30)
+ splash2x/water_nsquared 29416.800 29403.200 (-0.05) 29406.400 (-0.04) 30103.200 (2.33) 29433.600 (0.06) 24927.400 (-15.26)
+ splash2x/water_spatial 657791.000 657840.400 (0.01) 657826.600 (0.01) 657595.800 (-0.03) 656617.800 (-0.18) 481334.800 (-26.83)
+ total 28475091.000 28368400.000 (-0.37) 28508700.000 (0.12) 31641800.000 (11.12) 29036000.000 (1.97) 21989800.000 (-22.78)
+
+In total, 22.78% of residential sets were reduced.
+
+With parsec3/freqmine, 'prcl' reduced 87.95% of residential sets and 29.52% of
+system memory usage while incurring only 2.15% runtime overhead.
diff --git a/Documentation/vm/damon/faq.rst b/Documentation/vm/damon/faq.rst
new file mode 100644
index 000000000000..a15059cfb98a
--- /dev/null
+++ b/Documentation/vm/damon/faq.rst
@@ -0,0 +1,59 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+Frequently Asked Questions
+==========================
+
+Why a new module, instead of extending perf or other user space tools?
+======================================================================
+
+First, because it needs to be lightweight as much as possible so that it can be
+used online, any unnecessary overhead such as kernel - user space context
+switching cost should be avoided. Second, DAMON aims to be used by other
+programs including the kernel. Therefore, having a dependency on specific
+tools like perf is not desirable. These are the two biggest reasons why DAMON
+is implemented in the kernel space.
+
+
+Can 'idle pages tracking' or 'perf mem' substitute DAMON?
+=========================================================
+
+Idle page tracking is a low level primitive for access check of the physical
+address space. 'perf mem' is similar, though it can use sampling to minimize
+the overhead. On the other hand, DAMON is a higher-level framework for the
+monitoring of various address spaces. It is focused on memory management
+optimization and provides sophisticated accuracy/overhead handling mechanisms.
+Therefore, 'idle pages tracking' and 'perf mem' could provide a subset of
+DAMON's output, but cannot substitute DAMON. Rather than that, thouse could be
+configured as DAMON's low-level primitives for specific address spaces.
+
+
+How can I optimize my system's memory management using DAMON?
+=============================================================
+
+Because there are several ways for the DAMON-based optimizations, we wrote a
+separate document, :doc:`/admin-guide/mm/damon/guide`. Please refer to that.
+
+
+Does DAMON support virtual memory only?
+=======================================
+
+No. The core of the DAMON is address space independent. The address space
+specific low level primitive parts including monitoring target regions
+constructions and actual access checks can be implemented and configured on the
+DAMON core by the users. In this way, DAMON users can monitor any address
+space with any access check technique.
+
+Nonetheless, DAMON provides vma tracking and PTE Accessed bit check based
+implementations of the address space dependent functions for the virtual memory
+by default, for a reference and convenient use. In near future, we will
+provide those for physical memory address space.
+
+
+Can I simply monitor page granularity?
+======================================
+
+Yes. You can do so by setting the ``min_nr_regions`` attribute higher than the
+working set size divided by the page size. Because the monitoring target
+regions size is forced to be ``>=page size``, the region split will make no
+effect.
diff --git a/Documentation/vm/damon/index.rst b/Documentation/vm/damon/index.rst
new file mode 100644
index 000000000000..1ac29c8d9e87
--- /dev/null
+++ b/Documentation/vm/damon/index.rst
@@ -0,0 +1,32 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+DAMON: Data Access MONitor
+==========================
+
+DAMON is a data access monitoring framework subsystem for the Linux kernel.
+The core mechanisms of DAMON (refer to :doc:`mechanisms` for the detail) make
+it
+
+ - *accurate* (the monitoring output is useful enough for DRAM level memory
+ management; It might not appropriate for CPU Cache levels, though),
+ - *light-weight* (the monitoring overhead is low enough to be applied online),
+ and
+ - *scalable* (the upper-bound of the overhead is in constant range regardless
+ of the size of target workloads).
+
+Using this framework, therefore, the kernel's memory management mechanisms can
+make advanced decisions. Experimental memory management optimization works
+that incurring high data accesses monitoring overhead could implemented again.
+In user space, meanwhile, users who have some special workloads can write
+personalized applications for better understanding and optimizations of their
+workloads and systems.
+
+.. toctree::
+ :maxdepth: 2
+
+ faq
+ mechanisms
+ eval
+ api
+ plans
diff --git a/Documentation/vm/damon/mechanisms.rst b/Documentation/vm/damon/mechanisms.rst
new file mode 100644
index 000000000000..56cad258cea1
--- /dev/null
+++ b/Documentation/vm/damon/mechanisms.rst
@@ -0,0 +1,165 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Mechanisms
+==========
+
+Configurable Layers
+===================
+
+DAMON provides data access monitoring functionality while making the accuracy
+and the overhead controllable. The fundamental access monitorings require
+primitives that dependent on and optimized for the target address space. On
+the other hand, the accuracy and overhead tradeoff mechanism, which is the core
+of DAMON, is in the pure logic space. DAMON separates the two parts in
+different layers and defines its interface to allow various low level
+primitives implementations configurable with the core logic.
+
+Due to this separated design and the configurable interface, users can extend
+DAMON for any address space by configuring the core logics with appropriate low
+level primitive implementations. If appropriate one is not provided, users can
+implement the primitives on their own.
+
+For example, physical memory, virtual memory, swap space, those for specific
+processes, NUMA nodes, files, and backing memory devices would be supportable.
+Also, if some architectures or devices support special optimized access check
+primitives, those will be easily configurable.
+
+
+Reference Implementations of Address Space Specific Primitives
+==============================================================
+
+The low level primitives for the fundamental access monitoring are defined in
+two parts:
+
+1. Identification of the monitoring target address range for the address space.
+2. Access check of specific address range in the target space.
+
+DAMON currently provides the implementation of the primitives for only the
+virtual address spaces. Below two subsections describe how it works.
+
+
+PTE Accessed-bit Based Access Check
+-----------------------------------
+
+The implementation for the virtual address space uses PTE Accessed-bit for
+basic access checks. It finds the relevant PTE Accessed bit from the address
+by walking the page table for the target task of the address. In this way, the
+implementation finds and clears the bit for next sampling target address and
+checks whether the bit set again after one sampling period. To avoid
+disturbing other Accessed bit users such as the reclamation logic, the
+implementation adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same
+to the 'Idle Page Tracking'.
+
+
+VMA-based Target Address Range Construction
+-------------------------------------------
+
+Only small parts in the super-huge virtual address space of the processes are
+mapped to the physical memory and accessed. Thus, tracking the unmapped
+address regions is just wasteful. However, because DAMON can deal with some
+level of noise using the adaptive regions adjustment mechanism, tracking every
+mapping is not strictly required but could even incur a high overhead in some
+cases. That said, too huge unmapped areas inside the monitoring target should
+be removed to not take the time for the adaptive mechanism.
+
+For the reason, this implementation converts the complex mappings to three
+distinct regions that cover every mapped area of the address space. The two
+gaps between the three regions are the two biggest unmapped areas in the given
+address space. The two biggest unmapped areas would be the gap between the
+heap and the uppermost mmap()-ed region, and the gap between the lowermost
+mmap()-ed region and the stack in most of the cases. Because these gaps are
+exceptionally huge in usual address spaces, excluding these will be sufficient
+to make a reasonable trade-off. Below shows this in detail::
+
+ <heap>
+ <BIG UNMAPPED REGION 1>
+ <uppermost mmap()-ed region>
+ (small mmap()-ed regions and munmap()-ed regions)
+ <lowermost mmap()-ed region>
+ <BIG UNMAPPED REGION 2>
+ <stack>
+
+
+Address Space Independent Core Mechanisms
+=========================================
+
+Below four sections describe each of the DAMON core mechanisms and the five
+monitoring attributes, ``sampling interval``, ``aggregation interval``,
+``regions update interval``, ``minimum number of regions``, and ``maximum
+number of regions``.
+
+
+Access Frequency Monitoring
+---------------------------
+
+The output of DAMON says what pages are how frequently accessed for a given
+duration. The resolution of the access frequency is controlled by setting
+``sampling interval`` and ``aggregation interval``. In detail, DAMON checks
+access to each page per ``sampling interval`` and aggregates the results. In
+other words, counts the number of the accesses to each page. After each
+``aggregation interval`` passes, DAMON calls callback functions that previously
+registered by users so that users can read the aggregated results and then
+clears the results. This can be described in below simple pseudo-code::
+
+ while monitoring_on:
+ for page in monitoring_target:
+ if accessed(page):
+ nr_accesses[page] += 1
+ if time() % aggregation_interval == 0:
+ for callback in user_registered_callbacks:
+ callback(monitoring_target, nr_accesses)
+ for page in monitoring_target:
+ nr_accesses[page] = 0
+ sleep(sampling interval)
+
+The monitoring overhead of this mechanism will arbitrarily increase as the
+size of the target workload grows.
+
+
+Region Based Sampling
+---------------------
+
+To avoid the unbounded increase of the overhead, DAMON groups adjacent pages
+that assumed to have the same access frequencies into a region. As long as the
+assumption (pages in a region have the same access frequencies) is kept, only
+one page in the region is required to be checked. Thus, for each ``sampling
+interval``, DAMON randomly picks one page in each region, waits for one
+``sampling interval``, checks whether the page is accessed meanwhile, and
+increases the access frequency of the region if so. Therefore, the monitoring
+overhead is controllable by setting the number of regions. DAMON allows users
+to set the minimum and the maximum number of regions for the trade-off.
+
+This scheme, however, cannot preserve the quality of the output if the
+assumption is not guaranteed.
+
+
+Adaptive Regions Adjustment
+---------------------------
+
+Even somehow the initial monitoring target regions are well constructed to
+fulfill the assumption (pages in same region have similar access frequencies),
+the data access pattern can be dynamically changed. This will result in low
+monitoring quality. To keep the assumption as much as possible, DAMON
+adaptively merges and splits each region based on their access frequency.
+
+For each ``aggregation interval``, it compares the access frequencies of
+adjacent regions and merges those if the frequency difference is small. Then,
+after it reports and clears the aggregated access frequency of each region, it
+splits each region into two or three regions if the total number of regions
+will not exceed the user-specified maximum number of regions after the split.
+
+In this way, DAMON provides its best-effort quality and minimal overhead while
+keeping the bounds users set for their trade-off.
+
+
+Dynamic Target Space Updates Handling
+-------------------------------------
+
+The monitoring target address range could dynamically changed. For example,
+virtual memory could be dynamically mapped and unmapped. Physical memory could
+be hot-plugged.
+
+As the changes could be quite frequent in some cases, DAMON checks the dynamic
+memory mapping changes and applies it to the abstracted target area only for
+each of a user-specified time interval (``regions update interval``).
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index e8d943b21cf9..30813498c74d 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -31,6 +31,7 @@ descriptions of data structures and algorithms.
active_mm
balance
cleancache
+ damon/index
frontswap
highmem
hmm
--
2.17.1
From: SeongJae Park <[email protected]>
This commit adds a simple user space tests for DAMON. The tests are
using kselftest framework.
Signed-off-by: SeongJae Park <[email protected]>
---
tools/testing/selftests/damon/Makefile | 7 +
.../selftests/damon/_chk_dependency.sh | 28 ++++
tools/testing/selftests/damon/_chk_record.py | 108 ++++++++++++++
.../testing/selftests/damon/debugfs_attrs.sh | 139 ++++++++++++++++++
.../testing/selftests/damon/debugfs_record.sh | 50 +++++++
5 files changed, 332 insertions(+)
create mode 100644 tools/testing/selftests/damon/Makefile
create mode 100644 tools/testing/selftests/damon/_chk_dependency.sh
create mode 100644 tools/testing/selftests/damon/_chk_record.py
create mode 100755 tools/testing/selftests/damon/debugfs_attrs.sh
create mode 100755 tools/testing/selftests/damon/debugfs_record.sh
diff --git a/tools/testing/selftests/damon/Makefile b/tools/testing/selftests/damon/Makefile
new file mode 100644
index 000000000000..cfd5393a4639
--- /dev/null
+++ b/tools/testing/selftests/damon/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for damon selftests
+
+TEST_FILES = _chk_dependency.sh _chk_record_file.py
+TEST_PROGS = debugfs_attrs.sh debugfs_record.sh
+
+include ../lib.mk
diff --git a/tools/testing/selftests/damon/_chk_dependency.sh b/tools/testing/selftests/damon/_chk_dependency.sh
new file mode 100644
index 000000000000..814dcadd5e96
--- /dev/null
+++ b/tools/testing/selftests/damon/_chk_dependency.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+DBGFS=/sys/kernel/debug/damon
+
+if [ $EUID -ne 0 ];
+then
+ echo "Run as root"
+ exit $ksft_skip
+fi
+
+if [ ! -d $DBGFS ]
+then
+ echo "$DBGFS not found"
+ exit $ksft_skip
+fi
+
+for f in attrs record pids monitor_on
+do
+ if [ ! -f "$DBGFS/$f" ]
+ then
+ echo "$f not found"
+ exit 1
+ fi
+done
diff --git a/tools/testing/selftests/damon/_chk_record.py b/tools/testing/selftests/damon/_chk_record.py
new file mode 100644
index 000000000000..5cfcf4161404
--- /dev/null
+++ b/tools/testing/selftests/damon/_chk_record.py
@@ -0,0 +1,108 @@
+#!/usr/bin/env python3
+# SPDX-License-Identifier: GPL-2.0
+
+"Check whether the DAMON record file is valid"
+
+import argparse
+import struct
+import sys
+
+fmt_version = 0
+
+def set_fmt_version(f):
+ global fmt_version
+
+ mark = f.read(16)
+ if mark == b'damon_recfmt_ver':
+ fmt_version = struct.unpack('i', f.read(4))[0]
+ else:
+ fmt_version = 0
+ f.seek(0)
+ return fmt_version
+
+def read_pid(f):
+ if fmt_version == 0:
+ pid = struct.unpack('L', f.read(8))[0]
+ else:
+ pid = struct.unpack('i', f.read(4))[0]
+def err_percent(val, expected):
+ return abs(val - expected) / expected * 100
+
+def chk_task_info(f):
+ pid = read_pid(f)
+ nr_regions = struct.unpack('I', f.read(4))[0]
+
+ if nr_regions > max_nr_regions:
+ print('too many regions: %d > %d' % (nr_regions, max_nr_regions))
+ exit(1)
+
+ nr_gaps = 0
+ eaddr = 0
+ for r in range(nr_regions):
+ saddr = struct.unpack('L', f.read(8))[0]
+ if eaddr and saddr != eaddr:
+ nr_gaps += 1
+ eaddr = struct.unpack('L', f.read(8))[0]
+ nr_accesses = struct.unpack('I', f.read(4))[0]
+
+ if saddr >= eaddr:
+ print('wrong region [%d,%d)' % (saddr, eaddr))
+ exit(1)
+
+ max_nr_accesses = aint / sint
+ if nr_accesses > max_nr_accesses:
+ if err_percent(nr_accesses, max_nr_accesses) > 15:
+ print('too high nr_access: expected %d but %d' %
+ (max_nr_accesses, nr_accesses))
+ exit(1)
+ if nr_gaps != 2:
+ print('number of gaps are not two but %d' % nr_gaps)
+ exit(1)
+
+def parse_time_us(bindat):
+ sec = struct.unpack('l', bindat[0:8])[0]
+ nsec = struct.unpack('l', bindat[8:16])[0]
+ return (sec * 1000000000 + nsec) / 1000
+
+def main():
+ global sint
+ global aint
+ global min_nr
+ global max_nr_regions
+
+ parser = argparse.ArgumentParser()
+ parser.add_argument('file', metavar='<file>',
+ help='path to the record file')
+ parser.add_argument('--attrs', metavar='<attrs>',
+ default='5000 100000 1000000 10 1000',
+ help='content of debugfs attrs file')
+ args = parser.parse_args()
+ file_path = args.file
+ attrs = [int(x) for x in args.attrs.split()]
+ sint, aint, rint, min_nr, max_nr_regions = attrs
+
+ with open(file_path, 'rb') as f:
+ set_fmt_version(f)
+ last_aggr_time = None
+ while True:
+ timebin = f.read(16)
+ if len(timebin) != 16:
+ break
+
+ now = parse_time_us(timebin)
+ if not last_aggr_time:
+ last_aggr_time = now
+ else:
+ error = err_percent(now - last_aggr_time, aint)
+ if error > 15:
+ print('wrong aggr interval: expected %d, but %d' %
+ (aint, now - last_aggr_time))
+ exit(1)
+ last_aggr_time = now
+
+ nr_tasks = struct.unpack('I', f.read(4))[0]
+ for t in range(nr_tasks):
+ chk_task_info(f)
+
+if __name__ == '__main__':
+ main()
diff --git a/tools/testing/selftests/damon/debugfs_attrs.sh b/tools/testing/selftests/damon/debugfs_attrs.sh
new file mode 100755
index 000000000000..d5188b0f71b1
--- /dev/null
+++ b/tools/testing/selftests/damon/debugfs_attrs.sh
@@ -0,0 +1,139 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./_chk_dependency.sh
+
+# Test attrs file
+file="$DBGFS/attrs"
+
+ORIG_CONTENT=$(cat $file)
+
+echo 1 2 3 4 5 > $file
+if [ $? -ne 0 ]
+then
+ echo "$file write failed"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+echo 1 2 3 4 > $file
+if [ $? -eq 0 ]
+then
+ echo "$file write success (should failed)"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+CONTENT=$(cat $file)
+if [ "$CONTENT" != "1 2 3 4 5" ]
+then
+ echo "$file not written"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+echo $ORIG_CONTENT > $file
+
+# Test record file
+file="$DBGFS/record"
+
+ORIG_CONTENT=$(cat $file)
+
+echo "4242 foo.bar" > $file
+if [ $? -ne 0 ]
+then
+ echo "$file writing sane input failed"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+echo abc 2 3 > $file
+if [ $? -eq 0 ]
+then
+ echo "$file writing insane input 1 success (should failed)"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+echo 123 > $file
+if [ $? -eq 0 ]
+then
+ echo "$file writing insane input 2 success (should failed)"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+CONTENT=$(cat $file)
+if [ "$CONTENT" != "4242 foo.bar" ]
+then
+ echo "$file not written"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+echo "0 null" > $file
+if [ $? -ne 0 ]
+then
+ echo "$file disabling write fail"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+CONTENT=$(cat $file)
+if [ "$CONTENT" != "0 null" ]
+then
+ echo "$file not disabled"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+echo "4242 foo.bar" > $file
+if [ $? -ne 0 ]
+then
+ echo "$file writing sane data again fail"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+echo $ORIG_CONTENT > $file
+
+# Test pids file
+file="$DBGFS/pids"
+
+ORIG_CONTENT=$(cat $file)
+
+echo "1 2 3 4" > $file
+if [ $? -ne 0 ]
+then
+ echo "$file write fail"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+echo "1 2 abc 4" > $file
+if [ $? -ne 0 ]
+then
+ echo "$file write fail"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+echo abc 2 3 > $file
+if [ $? -eq 0 ]
+then
+ echo "$file write success (should failed)"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+CONTENT=$(cat $file)
+if [ "$CONTENT" != "1 2" ]
+then
+ echo "$file not written"
+ echo $ORIG_CONTENT > $file
+ exit 1
+fi
+
+echo $ORIG_CONTENT > $file
+
+echo "PASS"
diff --git a/tools/testing/selftests/damon/debugfs_record.sh b/tools/testing/selftests/damon/debugfs_record.sh
new file mode 100755
index 000000000000..fa9e07eea258
--- /dev/null
+++ b/tools/testing/selftests/damon/debugfs_record.sh
@@ -0,0 +1,50 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source ./_chk_dependency.sh
+
+restore_attrs()
+{
+ echo $ORIG_ATTRS > $DBGFS/attrs
+ echo $ORIG_PIDS > $DBGFS/pids
+ echo $ORIG_RECORD > $DBGFS/record
+}
+
+ORIG_ATTRS=$(cat $DBGFS/attrs)
+ORIG_PIDS=$(cat $DBGFS/pids)
+ORIG_RECORD=$(cat $DBGFS/record)
+
+rfile=$pwd/damon.data
+
+rm -f $rfile
+ATTRS="5000 100000 1000000 10 1000"
+echo $ATTRS > $DBGFS/attrs
+echo 4096 $rfile > $DBGFS/record
+sleep 5 &
+echo $(pidof sleep) > $DBGFS/pids
+echo on > $DBGFS/monitor_on
+sleep 0.5
+killall sleep
+echo off > $DBGFS/monitor_on
+
+sync
+
+if [ ! -f $rfile ]
+then
+ echo "record file not made"
+ restore_attrs
+
+ exit 1
+fi
+
+python3 ./_chk_record.py $rfile --attrs "$ATTRS"
+if [ $? -ne 0 ]
+then
+ echo "record file is wrong"
+ restore_attrs
+ exit 1
+fi
+
+rm -f $rfile
+restore_attrs
+echo "PASS"
--
2.17.1
From: SeongJae Park <[email protected]>
This commit introduces a reference implementation of the address space
specific low level primitives for the virtual address space, so that
users of DAMON can easily monitor the data accesses on virtual address
spaces of specific processes by simply configuring the implementation to
be used by DAMON.
The low level primitives for the fundamental access monitoring are
defined in two parts:
1. Identification of the monitoring target address range for the address
space.
2. Access check of specific address range in the target space.
The reference implementation for the virtual address space provided by
this commit is designed as below.
PTE Accessed-bit Based Access Check
-----------------------------------
The implementation uses PTE Accessed-bit for basic access checks. That
is, it clears the bit for next sampling target page and checks whether
it set again after one sampling period. To avoid disturbing other
Accessed bit users such as the reclamation logic, the implementation
adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the
'Idle Page Tracking'.
VMA-based Target Address Range Construction
-------------------------------------------
Only small parts in the super-huge virtual address space of the
processes are mapped to physical memory and accessed. Thus, tracking
the unmapped address regions is just wasteful. However, because DAMON
can deal with some level of noise using the adaptive regions adjustment
mechanism, tracking every mapping is not strictly required but could
even incur a high overhead in some cases. That said, too huge unmapped
areas inside the monitoring target should be removed to not take the
time for the adaptive mechanism.
For the reason, this implementation converts the complex mappings to
three distinct regions that cover every mapped area of the address
space. Also, the two gaps between the three regions are the two biggest
unmapped areas in the given address space. The two biggest unmapped
areas would be the gap between the heap and the uppermost mmap()-ed
region, and the gap between the lowermost mmap()-ed region and the stack
in most of the cases. Because these gaps are exceptionally huge in
usual address spacees, excluding these will be sufficient to make a
reasonable trade-off. Below shows this in detail::
<heap>
<BIG UNMAPPED REGION 1>
<uppermost mmap()-ed region>
(small mmap()-ed regions and munmap()-ed regions)
<lowermost mmap()-ed region>
<BIG UNMAPPED REGION 2>
<stack>
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
---
include/linux/damon.h | 6 +
mm/damon.c | 474 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 480 insertions(+)
diff --git a/include/linux/damon.h b/include/linux/damon.h
index 3c0b92a679e8..310d36d123b3 100644
--- a/include/linux/damon.h
+++ b/include/linux/damon.h
@@ -144,6 +144,12 @@ struct damon_ctx {
void (*aggregate_cb)(struct damon_ctx *context);
};
+/* Reference callback implementations for virtual memory */
+void kdamond_init_vm_regions(struct damon_ctx *ctx);
+void kdamond_update_vm_regions(struct damon_ctx *ctx);
+void kdamond_prepare_vm_access_checks(struct damon_ctx *ctx);
+unsigned int kdamond_check_vm_accesses(struct damon_ctx *ctx);
+
int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids);
int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
unsigned long aggr_int, unsigned long regions_update_int,
diff --git a/mm/damon.c b/mm/damon.c
index b844924b9fdb..386780739007 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -9,6 +9,9 @@
* This file is constructed in below parts.
*
* - Functions and macros for DAMON data structures
+ * - Functions for the initial monitoring target regions construction
+ * - Functions for the dynamic monitoring target regions update
+ * - Functions for the access checking of the regions
* - Functions for DAMON core logics and features
* - Functions for the DAMON programming interface
* - Functions for the module loading/unloading
@@ -196,6 +199,477 @@ static unsigned long damon_region_sz_limit(struct damon_ctx *ctx)
return sz;
}
+/*
+ * Get the mm_struct of the given task
+ *
+ * Caller _must_ put the mm_struct after use, unless it is NULL.
+ *
+ * Returns the mm_struct of the task on success, NULL on failure
+ */
+static struct mm_struct *damon_get_mm(struct damon_task *t)
+{
+ struct task_struct *task;
+ struct mm_struct *mm;
+
+ task = damon_get_task_struct(t);
+ if (!task)
+ return NULL;
+
+ mm = get_task_mm(task);
+ put_task_struct(task);
+ return mm;
+}
+
+/*
+ * Functions for the initial monitoring target regions construction
+ */
+
+/*
+ * Size-evenly split a region into 'nr_pieces' small regions
+ *
+ * Returns 0 on success, or negative error code otherwise.
+ */
+static int damon_split_region_evenly(struct damon_ctx *ctx,
+ struct damon_region *r, unsigned int nr_pieces)
+{
+ unsigned long sz_orig, sz_piece, orig_end;
+ struct damon_region *n = NULL, *next;
+ unsigned long start;
+
+ if (!r || !nr_pieces)
+ return -EINVAL;
+
+ orig_end = r->ar.end;
+ sz_orig = r->ar.end - r->ar.start;
+ sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, MIN_REGION);
+
+ if (!sz_piece)
+ return -EINVAL;
+
+ r->ar.end = r->ar.start + sz_piece;
+ next = damon_next_region(r);
+ for (start = r->ar.end; start + sz_piece <= orig_end;
+ start += sz_piece) {
+ n = damon_new_region(start, start + sz_piece);
+ if (!n)
+ return -ENOMEM;
+ damon_insert_region(n, r, next);
+ r = n;
+ }
+ /* complement last region for possible rounding error */
+ if (n)
+ n->ar.end = orig_end;
+
+ return 0;
+}
+
+static unsigned long sz_range(struct damon_addr_range *r)
+{
+ return r->end - r->start;
+}
+
+static void swap_ranges(struct damon_addr_range *r1,
+ struct damon_addr_range *r2)
+{
+ struct damon_addr_range tmp;
+
+ tmp = *r1;
+ *r1 = *r2;
+ *r2 = tmp;
+}
+
+/*
+ * Find three regions separated by two biggest unmapped regions
+ *
+ * vma the head vma of the target address space
+ * regions an array of three address ranges that results will be saved
+ *
+ * This function receives an address space and finds three regions in it which
+ * separated by the two biggest unmapped regions in the space. Please refer to
+ * below comments of 'damon_init_vm_regions_of()' function to know why this is
+ * necessary.
+ *
+ * Returns 0 if success, or negative error code otherwise.
+ */
+static int damon_three_regions_in_vmas(struct vm_area_struct *vma,
+ struct damon_addr_range regions[3])
+{
+ struct damon_addr_range gap = {0}, first_gap = {0}, second_gap = {0};
+ struct vm_area_struct *last_vma = NULL;
+ unsigned long start = 0;
+ struct rb_root rbroot;
+
+ /* Find two biggest gaps so that first_gap > second_gap > others */
+ for (; vma; vma = vma->vm_next) {
+ if (!last_vma) {
+ start = vma->vm_start;
+ goto next;
+ }
+
+ if (vma->rb_subtree_gap <= sz_range(&second_gap)) {
+ rbroot.rb_node = &vma->vm_rb;
+ vma = rb_entry(rb_last(&rbroot),
+ struct vm_area_struct, vm_rb);
+ goto next;
+ }
+
+ gap.start = last_vma->vm_end;
+ gap.end = vma->vm_start;
+ if (sz_range(&gap) > sz_range(&second_gap)) {
+ swap_ranges(&gap, &second_gap);
+ if (sz_range(&second_gap) > sz_range(&first_gap))
+ swap_ranges(&second_gap, &first_gap);
+ }
+next:
+ last_vma = vma;
+ }
+
+ if (!sz_range(&second_gap) || !sz_range(&first_gap))
+ return -EINVAL;
+
+ /* Sort the two biggest gaps by address */
+ if (first_gap.start > second_gap.start)
+ swap_ranges(&first_gap, &second_gap);
+
+ /* Store the result */
+ regions[0].start = ALIGN(start, MIN_REGION);
+ regions[0].end = ALIGN(first_gap.start, MIN_REGION);
+ regions[1].start = ALIGN(first_gap.end, MIN_REGION);
+ regions[1].end = ALIGN(second_gap.start, MIN_REGION);
+ regions[2].start = ALIGN(second_gap.end, MIN_REGION);
+ regions[2].end = ALIGN(last_vma->vm_end, MIN_REGION);
+
+ return 0;
+}
+
+/*
+ * Get the three regions in the given task
+ *
+ * Returns 0 on success, negative error code otherwise.
+ */
+static int damon_three_regions_of(struct damon_task *t,
+ struct damon_addr_range regions[3])
+{
+ struct mm_struct *mm;
+ int rc;
+
+ mm = damon_get_mm(t);
+ if (!mm)
+ return -EINVAL;
+
+ down_read(&mm->mmap_sem);
+ rc = damon_three_regions_in_vmas(mm->mmap, regions);
+ up_read(&mm->mmap_sem);
+
+ mmput(mm);
+ return rc;
+}
+
+/*
+ * Initialize the monitoring target regions for the given task
+ *
+ * t the given target task
+ *
+ * Because only a number of small portions of the entire address space
+ * is actually mapped to the memory and accessed, monitoring the unmapped
+ * regions is wasteful. That said, because we can deal with small noises,
+ * tracking every mapping is not strictly required but could even incur a high
+ * overhead if the mapping frequently changes or the number of mappings is
+ * high. The adaptive regions adjustment mechanism will further help to deal
+ * with the noise by simply identifying the unmapped areas as a region that
+ * has no access. Moreover, applying the real mappings that would have many
+ * unmapped areas inside will make the adaptive mechanism quite complex. That
+ * said, too huge unmapped areas inside the monitoring target should be removed
+ * to not take the time for the adaptive mechanism.
+ *
+ * For the reason, we convert the complex mappings to three distinct regions
+ * that cover every mapped area of the address space. Also the two gaps
+ * between the three regions are the two biggest unmapped areas in the given
+ * address space. In detail, this function first identifies the start and the
+ * end of the mappings and the two biggest unmapped areas of the address space.
+ * Then, it constructs the three regions as below:
+ *
+ * [mappings[0]->start, big_two_unmapped_areas[0]->start)
+ * [big_two_unmapped_areas[0]->end, big_two_unmapped_areas[1]->start)
+ * [big_two_unmapped_areas[1]->end, mappings[nr_mappings - 1]->end)
+ *
+ * As usual memory map of processes is as below, the gap between the heap and
+ * the uppermost mmap()-ed region, and the gap between the lowermost mmap()-ed
+ * region and the stack will be two biggest unmapped regions. Because these
+ * gaps are exceptionally huge areas in usual address space, excluding these
+ * two biggest unmapped regions will be sufficient to make a trade-off.
+ *
+ * <heap>
+ * <BIG UNMAPPED REGION 1>
+ * <uppermost mmap()-ed region>
+ * (other mmap()-ed regions and small unmapped regions)
+ * <lowermost mmap()-ed region>
+ * <BIG UNMAPPED REGION 2>
+ * <stack>
+ */
+static void damon_init_vm_regions_of(struct damon_ctx *c, struct damon_task *t)
+{
+ struct damon_region *r;
+ struct damon_addr_range regions[3];
+ unsigned long sz = 0, nr_pieces;
+ int i;
+
+ if (damon_three_regions_of(t, regions)) {
+ pr_err("Failed to get three regions of task %d\n", t->pid);
+ return;
+ }
+
+ for (i = 0; i < 3; i++)
+ sz += regions[i].end - regions[i].start;
+ if (c->min_nr_regions)
+ sz /= c->min_nr_regions;
+ if (sz < MIN_REGION)
+ sz = MIN_REGION;
+
+ /* Set the initial three regions of the task */
+ for (i = 0; i < 3; i++) {
+ r = damon_new_region(regions[i].start, regions[i].end);
+ if (!r) {
+ pr_err("%d'th init region creation failed\n", i);
+ return;
+ }
+ damon_add_region(r, t);
+
+ nr_pieces = (regions[i].end - regions[i].start) / sz;
+ damon_split_region_evenly(c, r, nr_pieces);
+ }
+}
+
+/* Initialize '->regions_list' of every task */
+void kdamond_init_vm_regions(struct damon_ctx *ctx)
+{
+ struct damon_task *t;
+
+ damon_for_each_task(t, ctx) {
+ /* the user may set the target regions as they want */
+ if (!nr_damon_regions(t))
+ damon_init_vm_regions_of(ctx, t);
+ }
+}
+
+/*
+ * Functions for the dynamic monitoring target regions update
+ */
+
+/*
+ * Check whether a region is intersecting an address range
+ *
+ * Returns true if it is.
+ */
+static bool damon_intersect(struct damon_region *r, struct damon_addr_range *re)
+{
+ return !(r->ar.end <= re->start || re->end <= r->ar.start);
+}
+
+/*
+ * Update damon regions for the three big regions of the given task
+ *
+ * t the given task
+ * bregions the three big regions of the task
+ */
+static void damon_apply_three_regions(struct damon_ctx *ctx,
+ struct damon_task *t, struct damon_addr_range bregions[3])
+{
+ struct damon_region *r, *next;
+ unsigned int i = 0;
+
+ /* Remove regions which are not in the three big regions now */
+ damon_for_each_region_safe(r, next, t) {
+ for (i = 0; i < 3; i++) {
+ if (damon_intersect(r, &bregions[i]))
+ break;
+ }
+ if (i == 3)
+ damon_destroy_region(r);
+ }
+
+ /* Adjust intersecting regions to fit with the three big regions */
+ for (i = 0; i < 3; i++) {
+ struct damon_region *first = NULL, *last;
+ struct damon_region *newr;
+ struct damon_addr_range *br;
+
+ br = &bregions[i];
+ /* Get the first and last regions which intersects with br */
+ damon_for_each_region(r, t) {
+ if (damon_intersect(r, br)) {
+ if (!first)
+ first = r;
+ last = r;
+ }
+ if (r->ar.start >= br->end)
+ break;
+ }
+ if (!first) {
+ /* no damon_region intersects with this big region */
+ newr = damon_new_region(
+ ALIGN_DOWN(br->start, MIN_REGION),
+ ALIGN(br->end, MIN_REGION));
+ if (!newr)
+ continue;
+ damon_insert_region(newr, damon_prev_region(r), r);
+ } else {
+ first->ar.start = ALIGN_DOWN(br->start, MIN_REGION);
+ last->ar.end = ALIGN(br->end, MIN_REGION);
+ }
+ }
+}
+
+/*
+ * Update regions for current memory mappings
+ */
+void kdamond_update_vm_regions(struct damon_ctx *ctx)
+{
+ struct damon_addr_range three_regions[3];
+ struct damon_task *t;
+
+ damon_for_each_task(t, ctx) {
+ if (damon_three_regions_of(t, three_regions))
+ continue;
+ damon_apply_three_regions(ctx, t, three_regions);
+ }
+}
+
+/*
+ * Functions for the access checking of the regions
+ */
+
+static void damon_mkold(struct mm_struct *mm, unsigned long addr)
+{
+ pte_t *pte = NULL;
+ pmd_t *pmd = NULL;
+ spinlock_t *ptl;
+
+ if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
+ return;
+
+ if (pte) {
+ if (pte_young(*pte)) {
+ clear_page_idle(pte_page(*pte));
+ set_page_young(pte_page(*pte));
+ }
+ *pte = pte_mkold(*pte);
+ pte_unmap_unlock(pte, ptl);
+ return;
+ }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (pmd_young(*pmd)) {
+ clear_page_idle(pmd_page(*pmd));
+ set_page_young(pmd_page(*pmd));
+ }
+ *pmd = pmd_mkold(*pmd);
+ spin_unlock(ptl);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+}
+
+static void damon_prepare_vm_access_check(struct damon_ctx *ctx,
+ struct mm_struct *mm, struct damon_region *r)
+{
+ r->sampling_addr = damon_rand(r->ar.start, r->ar.end);
+
+ damon_mkold(mm, r->sampling_addr);
+}
+
+void kdamond_prepare_vm_access_checks(struct damon_ctx *ctx)
+{
+ struct damon_task *t;
+ struct mm_struct *mm;
+ struct damon_region *r;
+
+ damon_for_each_task(t, ctx) {
+ mm = damon_get_mm(t);
+ if (!mm)
+ continue;
+ damon_for_each_region(r, t)
+ damon_prepare_vm_access_check(ctx, mm, r);
+ mmput(mm);
+ }
+}
+
+static bool damon_young(struct mm_struct *mm, unsigned long addr,
+ unsigned long *page_sz)
+{
+ pte_t *pte = NULL;
+ pmd_t *pmd = NULL;
+ spinlock_t *ptl;
+ bool young = false;
+
+ if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
+ return false;
+
+ *page_sz = PAGE_SIZE;
+ if (pte) {
+ young = pte_young(*pte);
+ pte_unmap_unlock(pte, ptl);
+ return young;
+ }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ young = pmd_young(*pmd);
+ spin_unlock(ptl);
+ *page_sz = ((1UL) << HPAGE_PMD_SHIFT);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+ return young;
+}
+
+/*
+ * Check whether the region was accessed after the last preparation
+ *
+ * mm 'mm_struct' for the given virtual address space
+ * r the region to be checked
+ */
+static void damon_check_vm_access(struct damon_ctx *ctx,
+ struct mm_struct *mm, struct damon_region *r)
+{
+ static struct mm_struct *last_mm;
+ static unsigned long last_addr;
+ static unsigned long last_page_sz = PAGE_SIZE;
+ static bool last_accessed;
+
+ /* If the region is in the last checked page, reuse the result */
+ if (mm == last_mm && (ALIGN_DOWN(last_addr, last_page_sz) ==
+ ALIGN_DOWN(r->sampling_addr, last_page_sz))) {
+ if (last_accessed)
+ r->nr_accesses++;
+ return;
+ }
+
+ last_accessed = damon_young(mm, r->sampling_addr, &last_page_sz);
+ if (last_accessed)
+ r->nr_accesses++;
+
+ last_mm = mm;
+ last_addr = r->sampling_addr;
+}
+
+unsigned int kdamond_check_vm_accesses(struct damon_ctx *ctx)
+{
+ struct damon_task *t;
+ struct mm_struct *mm;
+ struct damon_region *r;
+ unsigned int max_nr_accesses = 0;
+
+ damon_for_each_task(t, ctx) {
+ mm = damon_get_mm(t);
+ if (!mm)
+ continue;
+ damon_for_each_region(r, t) {
+ damon_check_vm_access(ctx, mm, r);
+ max_nr_accesses = max(r->nr_accesses, max_nr_accesses);
+ }
+ mmput(mm);
+ }
+
+ return max_nr_accesses;
+}
+
/*
* Functions for DAMON core logics and features
*/
--
2.17.1
From: SeongJae Park <[email protected]>
This commit adds a tracepoint for DAMON. It traces the monitoring
results of each region for each aggregation interval. Using this, DAMON
can easily integrated with tracepoints supporting tools such as perf.
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Leonard Foerster <[email protected]>
---
include/trace/events/damon.h | 43 ++++++++++++++++++++++++++++++++++++
mm/damon.c | 4 ++++
2 files changed, 47 insertions(+)
create mode 100644 include/trace/events/damon.h
diff --git a/include/trace/events/damon.h b/include/trace/events/damon.h
new file mode 100644
index 000000000000..40b249a28b30
--- /dev/null
+++ b/include/trace/events/damon.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM damon
+
+#if !defined(_TRACE_DAMON_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DAMON_H
+
+#include <linux/damon.h>
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(damon_aggregated,
+
+ TP_PROTO(struct damon_task *t, struct damon_region *r,
+ unsigned int nr_regions),
+
+ TP_ARGS(t, r, nr_regions),
+
+ TP_STRUCT__entry(
+ __field(int, pid)
+ __field(unsigned int, nr_regions)
+ __field(unsigned long, start)
+ __field(unsigned long, end)
+ __field(unsigned int, nr_accesses)
+ ),
+
+ TP_fast_assign(
+ __entry->pid = t->pid;
+ __entry->nr_regions = nr_regions;
+ __entry->start = r->ar.start;
+ __entry->end = r->ar.end;
+ __entry->nr_accesses = r->nr_accesses;
+ ),
+
+ TP_printk("pid=%d nr_regions=%u %lu-%lu: %u", __entry->pid,
+ __entry->nr_regions, __entry->start,
+ __entry->end, __entry->nr_accesses)
+);
+
+#endif /* _TRACE_DAMON_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/damon.c b/mm/damon.c
index 55ecfab64220..00df1a4c3d5c 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -19,6 +19,8 @@
#define pr_fmt(fmt) "damon: " fmt
+#define CREATE_TRACE_POINTS
+
#include <linux/damon.h>
#include <linux/delay.h>
#include <linux/kthread.h>
@@ -29,6 +31,7 @@
#include <linux/sched/mm.h>
#include <linux/sched/task.h>
#include <linux/slab.h>
+#include <trace/events/damon.h>
/* Minimal region size. Every damon_region is aligned by this. */
#define MIN_REGION PAGE_SIZE
@@ -791,6 +794,7 @@ static void kdamond_reset_aggregated(struct damon_ctx *c)
damon_write_rbuf(c, &r->ar.end, sizeof(r->ar.end));
damon_write_rbuf(c, &r->nr_accesses,
sizeof(r->nr_accesses));
+ trace_damon_aggregated(t, r, nr);
r->nr_accesses = 0;
}
}
--
2.17.1
From: SeongJae Park <[email protected]>
This commit adds kunit based unit tests for DAMON.
Signed-off-by: SeongJae Park <[email protected]>
Reviewed-by: Brendan Higgins <[email protected]>
---
mm/Kconfig | 11 +
mm/damon-test.h | 661 ++++++++++++++++++++++++++++++++++++++++++++++++
mm/damon.c | 6 +
3 files changed, 678 insertions(+)
create mode 100644 mm/damon-test.h
diff --git a/mm/Kconfig b/mm/Kconfig
index 464e9594dcec..e32761985611 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -879,4 +879,15 @@ config DAMON
more information.
If unsure, say N.
+config DAMON_KUNIT_TEST
+ bool "Test for damon"
+ depends on DAMON=y && KUNIT
+ help
+ This builds the DAMON Kunit test suite.
+
+ For more information on KUnit and unit tests in general, please refer
+ to the KUnit documentation.
+
+ If unsure, say N.
+
endmenu
diff --git a/mm/damon-test.h b/mm/damon-test.h
new file mode 100644
index 000000000000..b31c7fe913ca
--- /dev/null
+++ b/mm/damon-test.h
@@ -0,0 +1,661 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Data Access Monitor Unit Tests
+ *
+ * Copyright 2019 Amazon.com, Inc. or its affiliates. All rights reserved.
+ *
+ * Author: SeongJae Park <[email protected]>
+ */
+
+#ifdef CONFIG_DAMON_KUNIT_TEST
+
+#ifndef _DAMON_TEST_H
+#define _DAMON_TEST_H
+
+#include <kunit/test.h>
+
+static void damon_test_str_to_pids(struct kunit *test)
+{
+ char *question;
+ int *answers;
+ int expected[] = {12, 35, 46};
+ ssize_t nr_integers = 0, i;
+
+ question = "123";
+ answers = str_to_pids(question, strnlen(question, 128), &nr_integers);
+ KUNIT_EXPECT_EQ(test, (ssize_t)1, nr_integers);
+ KUNIT_EXPECT_EQ(test, 123, answers[0]);
+ kfree(answers);
+
+ question = "123abc";
+ answers = str_to_pids(question, strnlen(question, 128), &nr_integers);
+ KUNIT_EXPECT_EQ(test, (ssize_t)1, nr_integers);
+ KUNIT_EXPECT_EQ(test, 123, answers[0]);
+ kfree(answers);
+
+ question = "a123";
+ answers = str_to_pids(question, strnlen(question, 128), &nr_integers);
+ KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers);
+ KUNIT_EXPECT_PTR_EQ(test, answers, (int *)NULL);
+
+ question = "12 35";
+ answers = str_to_pids(question, strnlen(question, 128), &nr_integers);
+ KUNIT_EXPECT_EQ(test, (ssize_t)2, nr_integers);
+ for (i = 0; i < nr_integers; i++)
+ KUNIT_EXPECT_EQ(test, expected[i], answers[i]);
+ kfree(answers);
+
+ question = "12 35 46";
+ answers = str_to_pids(question, strnlen(question, 128), &nr_integers);
+ KUNIT_EXPECT_EQ(test, (ssize_t)3, nr_integers);
+ for (i = 0; i < nr_integers; i++)
+ KUNIT_EXPECT_EQ(test, expected[i], answers[i]);
+ kfree(answers);
+
+ question = "12 35 abc 46";
+ answers = str_to_pids(question, strnlen(question, 128), &nr_integers);
+ KUNIT_EXPECT_EQ(test, (ssize_t)2, nr_integers);
+ for (i = 0; i < 2; i++)
+ KUNIT_EXPECT_EQ(test, expected[i], answers[i]);
+ kfree(answers);
+
+ question = "";
+ answers = str_to_pids(question, strnlen(question, 128), &nr_integers);
+ KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers);
+ KUNIT_EXPECT_PTR_EQ(test, (int *)NULL, answers);
+ kfree(answers);
+
+ question = "\n";
+ answers = str_to_pids(question, strnlen(question, 128), &nr_integers);
+ KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers);
+ KUNIT_EXPECT_PTR_EQ(test, (int *)NULL, answers);
+ kfree(answers);
+}
+
+static void damon_test_regions(struct kunit *test)
+{
+ struct damon_region *r;
+ struct damon_task *t;
+
+ r = damon_new_region(1, 2);
+ KUNIT_EXPECT_EQ(test, 1ul, r->ar.start);
+ KUNIT_EXPECT_EQ(test, 2ul, r->ar.end);
+ KUNIT_EXPECT_EQ(test, 0u, r->nr_accesses);
+
+ t = damon_new_task(42);
+ KUNIT_EXPECT_EQ(test, 0u, nr_damon_regions(t));
+
+ damon_add_region(r, t);
+ KUNIT_EXPECT_EQ(test, 1u, nr_damon_regions(t));
+
+ damon_del_region(r);
+ KUNIT_EXPECT_EQ(test, 0u, nr_damon_regions(t));
+
+ damon_free_task(t);
+}
+
+static void damon_test_tasks(struct kunit *test)
+{
+ struct damon_ctx *c = &damon_user_ctx;
+ struct damon_task *t;
+
+ t = damon_new_task(42);
+ KUNIT_EXPECT_EQ(test, 42, t->pid);
+ KUNIT_EXPECT_EQ(test, 0u, nr_damon_tasks(c));
+
+ damon_add_task(&damon_user_ctx, t);
+ KUNIT_EXPECT_EQ(test, 1u, nr_damon_tasks(c));
+
+ damon_destroy_task(t);
+ KUNIT_EXPECT_EQ(test, 0u, nr_damon_tasks(c));
+}
+
+static void damon_test_set_pids(struct kunit *test)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ int pids[] = {1, 2, 3};
+ char buf[64];
+
+ damon_set_pids(ctx, pids, 3);
+ damon_sprint_pids(ctx, buf, 64);
+ KUNIT_EXPECT_STREQ(test, (char *)buf, "1 2 3\n");
+
+ damon_set_pids(ctx, NULL, 0);
+ damon_sprint_pids(ctx, buf, 64);
+ KUNIT_EXPECT_STREQ(test, (char *)buf, "\n");
+
+ damon_set_pids(ctx, (int []){1, 2}, 2);
+ damon_sprint_pids(ctx, buf, 64);
+ KUNIT_EXPECT_STREQ(test, (char *)buf, "1 2\n");
+
+ damon_set_pids(ctx, (int []){2}, 1);
+ damon_sprint_pids(ctx, buf, 64);
+ KUNIT_EXPECT_STREQ(test, (char *)buf, "2\n");
+
+ damon_set_pids(ctx, NULL, 0);
+ damon_sprint_pids(ctx, buf, 64);
+ KUNIT_EXPECT_STREQ(test, (char *)buf, "\n");
+}
+
+static void damon_test_set_recording(struct kunit *test)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ int err;
+
+ err = damon_set_recording(ctx, 42, "foo");
+ KUNIT_EXPECT_EQ(test, err, -EINVAL);
+ damon_set_recording(ctx, 4242, "foo.bar");
+ KUNIT_EXPECT_EQ(test, ctx->rbuf_len, 4242u);
+ KUNIT_EXPECT_STREQ(test, ctx->rfile_path, "foo.bar");
+ damon_set_recording(ctx, 424242, "foo");
+ KUNIT_EXPECT_EQ(test, ctx->rbuf_len, 424242u);
+ KUNIT_EXPECT_STREQ(test, ctx->rfile_path, "foo");
+}
+
+static void __link_vmas(struct vm_area_struct *vmas, ssize_t nr_vmas)
+{
+ int i, j;
+ unsigned long largest_gap, gap;
+
+ if (!nr_vmas)
+ return;
+
+ for (i = 0; i < nr_vmas - 1; i++) {
+ vmas[i].vm_next = &vmas[i + 1];
+
+ vmas[i].vm_rb.rb_left = NULL;
+ vmas[i].vm_rb.rb_right = &vmas[i + 1].vm_rb;
+
+ largest_gap = 0;
+ for (j = i; j < nr_vmas; j++) {
+ if (j == 0)
+ continue;
+ gap = vmas[j].vm_start - vmas[j - 1].vm_end;
+ if (gap > largest_gap)
+ largest_gap = gap;
+ }
+ vmas[i].rb_subtree_gap = largest_gap;
+ }
+ vmas[i].vm_next = NULL;
+ vmas[i].vm_rb.rb_right = NULL;
+ vmas[i].rb_subtree_gap = 0;
+}
+
+/*
+ * Test damon_three_regions_in_vmas() function
+ *
+ * DAMON converts the complex and dynamic memory mappings of each target task
+ * to three discontiguous regions which cover every mapped areas. However, the
+ * three regions should not include the two biggest unmapped areas in the
+ * original mapping, because the two biggest areas are normally the areas
+ * between 1) heap and the mmap()-ed regions, and 2) the mmap()-ed regions and
+ * stack. Because these two unmapped areas are very huge but obviously never
+ * accessed, covering the region is just a waste.
+ *
+ * 'damon_three_regions_in_vmas() receives an address space of a process. It
+ * first identifies the start of mappings, end of mappings, and the two biggest
+ * unmapped areas. After that, based on the information, it constructs the
+ * three regions and returns. For more detail, refer to the comment of
+ * 'damon_init_regions_of()' function definition in 'mm/damon.c' file.
+ *
+ * For example, suppose virtual address ranges of 10-20, 20-25, 200-210,
+ * 210-220, 300-305, and 307-330 (Other comments represent this mappings in
+ * more short form: 10-20-25, 200-210-220, 300-305, 307-330) of a process are
+ * mapped. To cover every mappings, the three regions should start with 10,
+ * and end with 305. The process also has three unmapped areas, 25-200,
+ * 220-300, and 305-307. Among those, 25-200 and 220-300 are the biggest two
+ * unmapped areas, and thus it should be converted to three regions of 10-25,
+ * 200-220, and 300-330.
+ */
+static void damon_test_three_regions_in_vmas(struct kunit *test)
+{
+ struct damon_addr_range regions[3] = {0,};
+ /* 10-20-25, 200-210-220, 300-305, 307-330 */
+ struct vm_area_struct vmas[] = {
+ (struct vm_area_struct) {.vm_start = 10, .vm_end = 20},
+ (struct vm_area_struct) {.vm_start = 20, .vm_end = 25},
+ (struct vm_area_struct) {.vm_start = 200, .vm_end = 210},
+ (struct vm_area_struct) {.vm_start = 210, .vm_end = 220},
+ (struct vm_area_struct) {.vm_start = 300, .vm_end = 305},
+ (struct vm_area_struct) {.vm_start = 307, .vm_end = 330},
+ };
+
+ __link_vmas(vmas, 6);
+
+ damon_three_regions_in_vmas(&vmas[0], regions);
+
+ KUNIT_EXPECT_EQ(test, 10ul, regions[0].start);
+ KUNIT_EXPECT_EQ(test, 25ul, regions[0].end);
+ KUNIT_EXPECT_EQ(test, 200ul, regions[1].start);
+ KUNIT_EXPECT_EQ(test, 220ul, regions[1].end);
+ KUNIT_EXPECT_EQ(test, 300ul, regions[2].start);
+ KUNIT_EXPECT_EQ(test, 330ul, regions[2].end);
+}
+
+/* Clean up global state of damon */
+static void damon_cleanup_global_state(void)
+{
+ struct damon_task *t, *next;
+
+ damon_for_each_task_safe(t, next, &damon_user_ctx)
+ damon_destroy_task(t);
+
+ damon_user_ctx.rbuf_offset = 0;
+}
+
+/*
+ * Test kdamond_reset_aggregated()
+ *
+ * DAMON checks access to each region and aggregates this information as the
+ * access frequency of each region. In detail, it increases '->nr_accesses' of
+ * regions that an access has confirmed. 'kdamond_reset_aggregated()' flushes
+ * the aggregated information ('->nr_accesses' of each regions) to the result
+ * buffer. As a result of the flushing, the '->nr_accesses' of regions are
+ * initialized to zero.
+ */
+static void damon_test_aggregate(struct kunit *test)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ int pids[] = {1, 2, 3};
+ unsigned long saddr[][3] = {{10, 20, 30}, {5, 42, 49}, {13, 33, 55} };
+ unsigned long eaddr[][3] = {{15, 27, 40}, {31, 45, 55}, {23, 44, 66} };
+ unsigned long accesses[][3] = {{42, 95, 84}, {10, 20, 30}, {0, 1, 2} };
+ struct damon_task *t;
+ struct damon_region *r;
+ int it, ir;
+ ssize_t sz, sr, sp;
+
+ damon_set_recording(ctx, 4242, "damon.data");
+ damon_set_pids(ctx, pids, 3);
+
+ it = 0;
+ damon_for_each_task(t, ctx) {
+ for (ir = 0; ir < 3; ir++) {
+ r = damon_new_region(saddr[it][ir], eaddr[it][ir]);
+ r->nr_accesses = accesses[it][ir];
+ damon_add_region(r, t);
+ }
+ it++;
+ }
+ kdamond_reset_aggregated(ctx);
+ it = 0;
+ damon_for_each_task(t, ctx) {
+ ir = 0;
+ /* '->nr_accesses' should be zeroed */
+ damon_for_each_region(r, t) {
+ KUNIT_EXPECT_EQ(test, 0u, r->nr_accesses);
+ ir++;
+ }
+ /* regions should be preserved */
+ KUNIT_EXPECT_EQ(test, 3, ir);
+ it++;
+ }
+ /* tasks also should be preserved */
+ KUNIT_EXPECT_EQ(test, 3, it);
+
+ /* The aggregated information should be written in the buffer */
+ sr = sizeof(r->ar.start) + sizeof(r->ar.end) + sizeof(r->nr_accesses);
+ sp = sizeof(t->pid) + sizeof(unsigned int) + 3 * sr;
+ sz = sizeof(struct timespec64) + sizeof(unsigned int) + 3 * sp;
+ KUNIT_EXPECT_EQ(test, (unsigned int)sz, ctx->rbuf_offset);
+
+ damon_set_recording(ctx, 0, "damon.data");
+ damon_cleanup_global_state();
+}
+
+static void damon_test_write_rbuf(struct kunit *test)
+{
+ struct damon_ctx *ctx = &damon_user_ctx;
+ char *data;
+
+ damon_set_recording(&damon_user_ctx, 4242, "damon.data");
+
+ data = "hello";
+ damon_write_rbuf(ctx, data, strnlen(data, 256));
+ KUNIT_EXPECT_EQ(test, ctx->rbuf_offset, 5u);
+
+ damon_write_rbuf(ctx, data, 0);
+ KUNIT_EXPECT_EQ(test, ctx->rbuf_offset, 5u);
+
+ KUNIT_EXPECT_STREQ(test, (char *)ctx->rbuf, data);
+ damon_set_recording(&damon_user_ctx, 0, "damon.data");
+}
+
+static struct damon_region *__nth_region_of(struct damon_task *t, int idx)
+{
+ struct damon_region *r;
+ unsigned int i = 0;
+
+ damon_for_each_region(r, t) {
+ if (i++ == idx)
+ return r;
+ }
+
+ return NULL;
+}
+
+/*
+ * Test 'damon_apply_three_regions()'
+ *
+ * test kunit object
+ * regions an array containing start/end addresses of current
+ * monitoring target regions
+ * nr_regions the number of the addresses in 'regions'
+ * three_regions The three regions that need to be applied now
+ * expected start/end addresses of monitoring target regions that
+ * 'three_regions' are applied
+ * nr_expected the number of addresses in 'expected'
+ *
+ * The memory mapping of the target processes changes dynamically. To follow
+ * the change, DAMON periodically reads the mappings, simplifies it to the
+ * three regions, and updates the monitoring target regions to fit in the three
+ * regions. The update of current target regions is the role of
+ * 'damon_apply_three_regions()'.
+ *
+ * This test passes the given target regions and the new three regions that
+ * need to be applied to the function and check whether it updates the regions
+ * as expected.
+ */
+static void damon_do_test_apply_three_regions(struct kunit *test,
+ unsigned long *regions, int nr_regions,
+ struct damon_addr_range *three_regions,
+ unsigned long *expected, int nr_expected)
+{
+ struct damon_task *t;
+ struct damon_region *r;
+ int i;
+
+ t = damon_new_task(42);
+ for (i = 0; i < nr_regions / 2; i++) {
+ r = damon_new_region(regions[i * 2], regions[i * 2 + 1]);
+ damon_add_region(r, t);
+ }
+ damon_add_task(&damon_user_ctx, t);
+
+ damon_apply_three_regions(&damon_user_ctx, t, three_regions);
+
+ for (i = 0; i < nr_expected / 2; i++) {
+ r = __nth_region_of(t, i);
+ KUNIT_EXPECT_EQ(test, r->ar.start, expected[i * 2]);
+ KUNIT_EXPECT_EQ(test, r->ar.end, expected[i * 2 + 1]);
+ }
+
+ damon_cleanup_global_state();
+}
+
+/*
+ * This function test most common case where the three big regions are only
+ * slightly changed. Target regions should adjust their boundary (10-20-30,
+ * 50-55, 70-80, 90-100) to fit with the new big regions or remove target
+ * regions (57-79) that now out of the three regions.
+ */
+static void damon_test_apply_three_regions1(struct kunit *test)
+{
+ /* 10-20-30, 50-55-57-59, 70-80-90-100 */
+ unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59,
+ 70, 80, 80, 90, 90, 100};
+ /* 5-27, 45-55, 73-104 */
+ struct damon_addr_range new_three_regions[3] = {
+ (struct damon_addr_range){.start = 5, .end = 27},
+ (struct damon_addr_range){.start = 45, .end = 55},
+ (struct damon_addr_range){.start = 73, .end = 104} };
+ /* 5-20-27, 45-55, 73-80-90-104 */
+ unsigned long expected[] = {5, 20, 20, 27, 45, 55,
+ 73, 80, 80, 90, 90, 104};
+
+ damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions),
+ new_three_regions, expected, ARRAY_SIZE(expected));
+}
+
+/*
+ * Test slightly bigger change. Similar to above, but the second big region
+ * now require two target regions (50-55, 57-59) to be removed.
+ */
+static void damon_test_apply_three_regions2(struct kunit *test)
+{
+ /* 10-20-30, 50-55-57-59, 70-80-90-100 */
+ unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59,
+ 70, 80, 80, 90, 90, 100};
+ /* 5-27, 56-57, 65-104 */
+ struct damon_addr_range new_three_regions[3] = {
+ (struct damon_addr_range){.start = 5, .end = 27},
+ (struct damon_addr_range){.start = 56, .end = 57},
+ (struct damon_addr_range){.start = 65, .end = 104} };
+ /* 5-20-27, 56-57, 65-80-90-104 */
+ unsigned long expected[] = {5, 20, 20, 27, 56, 57,
+ 65, 80, 80, 90, 90, 104};
+
+ damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions),
+ new_three_regions, expected, ARRAY_SIZE(expected));
+}
+
+/*
+ * Test a big change. The second big region has totally freed and mapped to
+ * different area (50-59 -> 61-63). The target regions which were in the old
+ * second big region (50-55-57-59) should be removed and new target region
+ * covering the second big region (61-63) should be created.
+ */
+static void damon_test_apply_three_regions3(struct kunit *test)
+{
+ /* 10-20-30, 50-55-57-59, 70-80-90-100 */
+ unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59,
+ 70, 80, 80, 90, 90, 100};
+ /* 5-27, 61-63, 65-104 */
+ struct damon_addr_range new_three_regions[3] = {
+ (struct damon_addr_range){.start = 5, .end = 27},
+ (struct damon_addr_range){.start = 61, .end = 63},
+ (struct damon_addr_range){.start = 65, .end = 104} };
+ /* 5-20-27, 61-63, 65-80-90-104 */
+ unsigned long expected[] = {5, 20, 20, 27, 61, 63,
+ 65, 80, 80, 90, 90, 104};
+
+ damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions),
+ new_three_regions, expected, ARRAY_SIZE(expected));
+}
+
+/*
+ * Test another big change. Both of the second and third big regions (50-59
+ * and 70-100) has totally freed and mapped to different area (30-32 and
+ * 65-68). The target regions which were in the old second and third big
+ * regions should now be removed and new target regions covering the new second
+ * and third big regions should be crated.
+ */
+static void damon_test_apply_three_regions4(struct kunit *test)
+{
+ /* 10-20-30, 50-55-57-59, 70-80-90-100 */
+ unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59,
+ 70, 80, 80, 90, 90, 100};
+ /* 5-7, 30-32, 65-68 */
+ struct damon_addr_range new_three_regions[3] = {
+ (struct damon_addr_range){.start = 5, .end = 7},
+ (struct damon_addr_range){.start = 30, .end = 32},
+ (struct damon_addr_range){.start = 65, .end = 68} };
+ /* expect 5-7, 30-32, 65-68 */
+ unsigned long expected[] = {5, 7, 30, 32, 65, 68};
+
+ damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions),
+ new_three_regions, expected, ARRAY_SIZE(expected));
+}
+
+static void damon_test_split_evenly(struct kunit *test)
+{
+ struct damon_ctx *c = &damon_user_ctx;
+ struct damon_task *t;
+ struct damon_region *r;
+ unsigned long i;
+
+ KUNIT_EXPECT_EQ(test, damon_split_region_evenly(c, NULL, 5), -EINVAL);
+
+ t = damon_new_task(42);
+ r = damon_new_region(0, 100);
+ KUNIT_EXPECT_EQ(test, damon_split_region_evenly(c, r, 0), -EINVAL);
+
+ damon_add_region(r, t);
+ KUNIT_EXPECT_EQ(test, damon_split_region_evenly(c, r, 10), 0);
+ KUNIT_EXPECT_EQ(test, nr_damon_regions(t), 10u);
+
+ i = 0;
+ damon_for_each_region(r, t) {
+ KUNIT_EXPECT_EQ(test, r->ar.start, i++ * 10);
+ KUNIT_EXPECT_EQ(test, r->ar.end, i * 10);
+ }
+ damon_free_task(t);
+
+ t = damon_new_task(42);
+ r = damon_new_region(5, 59);
+ damon_add_region(r, t);
+ KUNIT_EXPECT_EQ(test, damon_split_region_evenly(c, r, 5), 0);
+ KUNIT_EXPECT_EQ(test, nr_damon_regions(t), 5u);
+
+ i = 0;
+ damon_for_each_region(r, t) {
+ if (i == 4)
+ break;
+ KUNIT_EXPECT_EQ(test, r->ar.start, 5 + 10 * i++);
+ KUNIT_EXPECT_EQ(test, r->ar.end, 5 + 10 * i);
+ }
+ KUNIT_EXPECT_EQ(test, r->ar.start, 5 + 10 * i);
+ KUNIT_EXPECT_EQ(test, r->ar.end, 59ul);
+ damon_free_task(t);
+
+ t = damon_new_task(42);
+ r = damon_new_region(5, 6);
+ damon_add_region(r, t);
+ KUNIT_EXPECT_EQ(test, damon_split_region_evenly(c, r, 2), -EINVAL);
+ KUNIT_EXPECT_EQ(test, nr_damon_regions(t), 1u);
+
+ damon_for_each_region(r, t) {
+ KUNIT_EXPECT_EQ(test, r->ar.start, 5ul);
+ KUNIT_EXPECT_EQ(test, r->ar.end, 6ul);
+ }
+ damon_free_task(t);
+}
+
+static void damon_test_split_at(struct kunit *test)
+{
+ struct damon_task *t;
+ struct damon_region *r;
+
+ t = damon_new_task(42);
+ r = damon_new_region(0, 100);
+ damon_add_region(r, t);
+ damon_split_region_at(&damon_user_ctx, r, 25);
+ KUNIT_EXPECT_EQ(test, r->ar.start, 0ul);
+ KUNIT_EXPECT_EQ(test, r->ar.end, 25ul);
+
+ r = damon_next_region(r);
+ KUNIT_EXPECT_EQ(test, r->ar.start, 25ul);
+ KUNIT_EXPECT_EQ(test, r->ar.end, 100ul);
+
+ damon_free_task(t);
+}
+
+static void damon_test_merge_two(struct kunit *test)
+{
+ struct damon_task *t;
+ struct damon_region *r, *r2, *r3;
+ int i;
+
+ t = damon_new_task(42);
+ r = damon_new_region(0, 100);
+ r->nr_accesses = 10;
+ damon_add_region(r, t);
+ r2 = damon_new_region(100, 300);
+ r2->nr_accesses = 20;
+ damon_add_region(r2, t);
+
+ damon_merge_two_regions(r, r2);
+ KUNIT_EXPECT_EQ(test, r->ar.start, 0ul);
+ KUNIT_EXPECT_EQ(test, r->ar.end, 300ul);
+ KUNIT_EXPECT_EQ(test, r->nr_accesses, 16u);
+
+ i = 0;
+ damon_for_each_region(r3, t) {
+ KUNIT_EXPECT_PTR_EQ(test, r, r3);
+ i++;
+ }
+ KUNIT_EXPECT_EQ(test, i, 1);
+
+ damon_free_task(t);
+}
+
+static void damon_test_merge_regions_of(struct kunit *test)
+{
+ struct damon_task *t;
+ struct damon_region *r;
+ unsigned long sa[] = {0, 100, 114, 122, 130, 156, 170, 184};
+ unsigned long ea[] = {100, 112, 122, 130, 156, 170, 184, 230};
+ unsigned int nrs[] = {0, 0, 10, 10, 20, 30, 1, 2};
+
+ unsigned long saddrs[] = {0, 114, 130, 156, 170};
+ unsigned long eaddrs[] = {112, 130, 156, 170, 230};
+ int i;
+
+ t = damon_new_task(42);
+ for (i = 0; i < ARRAY_SIZE(sa); i++) {
+ r = damon_new_region(sa[i], ea[i]);
+ r->nr_accesses = nrs[i];
+ damon_add_region(r, t);
+ }
+
+ damon_merge_regions_of(t, 9, 9999);
+ /* 0-112, 114-130, 130-156, 156-170 */
+ KUNIT_EXPECT_EQ(test, nr_damon_regions(t), 5u);
+ for (i = 0; i < 5; i++) {
+ r = __nth_region_of(t, i);
+ KUNIT_EXPECT_EQ(test, r->ar.start, saddrs[i]);
+ KUNIT_EXPECT_EQ(test, r->ar.end, eaddrs[i]);
+ }
+ damon_free_task(t);
+}
+
+static void damon_test_split_regions_of(struct kunit *test)
+{
+ struct damon_task *t;
+ struct damon_region *r;
+
+ t = damon_new_task(42);
+ r = damon_new_region(0, 22);
+ damon_add_region(r, t);
+ damon_split_regions_of(&damon_user_ctx, t, 2);
+ KUNIT_EXPECT_EQ(test, nr_damon_regions(t), 2u);
+ damon_free_task(t);
+
+ t = damon_new_task(42);
+ r = damon_new_region(0, 220);
+ damon_add_region(r, t);
+ damon_split_regions_of(&damon_user_ctx, t, 4);
+ KUNIT_EXPECT_EQ(test, nr_damon_regions(t), 4u);
+ damon_free_task(t);
+}
+
+static struct kunit_case damon_test_cases[] = {
+ KUNIT_CASE(damon_test_str_to_pids),
+ KUNIT_CASE(damon_test_tasks),
+ KUNIT_CASE(damon_test_regions),
+ KUNIT_CASE(damon_test_set_pids),
+ KUNIT_CASE(damon_test_set_recording),
+ KUNIT_CASE(damon_test_three_regions_in_vmas),
+ KUNIT_CASE(damon_test_aggregate),
+ KUNIT_CASE(damon_test_write_rbuf),
+ KUNIT_CASE(damon_test_apply_three_regions1),
+ KUNIT_CASE(damon_test_apply_three_regions2),
+ KUNIT_CASE(damon_test_apply_three_regions3),
+ KUNIT_CASE(damon_test_apply_three_regions4),
+ KUNIT_CASE(damon_test_split_evenly),
+ KUNIT_CASE(damon_test_split_at),
+ KUNIT_CASE(damon_test_merge_two),
+ KUNIT_CASE(damon_test_merge_regions_of),
+ KUNIT_CASE(damon_test_split_regions_of),
+ {},
+};
+
+static struct kunit_suite damon_test_suite = {
+ .name = "damon",
+ .test_cases = damon_test_cases,
+};
+kunit_test_suite(damon_test_suite);
+
+#endif /* _DAMON_TEST_H */
+
+#endif /* CONFIG_DAMON_KUNIT_TEST */
diff --git a/mm/damon.c b/mm/damon.c
index df05bd821ff8..0f906126d21f 100644
--- a/mm/damon.c
+++ b/mm/damon.c
@@ -36,7 +36,11 @@
#include <trace/events/damon.h>
/* Minimal region size. Every damon_region is aligned by this. */
+#ifndef CONFIG_DAMON_KUNIT_TEST
#define MIN_REGION PAGE_SIZE
+#else
+#define MIN_REGION 1
+#endif
/*
* Functions and macros for DAMON data structures
@@ -1626,3 +1630,5 @@ module_exit(damon_exit);
MODULE_LICENSE("GPL");
MODULE_AUTHOR("SeongJae Park <[email protected]>");
MODULE_DESCRIPTION("DAMON: Data Access MONitor");
+
+#include "damon-test.h"
--
2.17.1
From: SeongJae Park <[email protected]>
This commit updates MAINTAINERS file for DAMON related files.
Signed-off-by: SeongJae Park <[email protected]>
---
MAINTAINERS | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 50659d76976b..23348005f5bd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4686,6 +4686,19 @@ F: net/ax25/ax25_out.c
F: net/ax25/ax25_timer.c
F: net/ax25/sysctl_net_ax25.c
+DATA ACCESS MONITOR
+M: SeongJae Park <[email protected]>
+L: [email protected]
+S: Maintained
+F: Documentation/admin-guide/mm/damon/*
+F: Documentation/vm/damon/*
+F: include/linux/damon.h
+F: include/trace/events/damon.h
+F: mm/damon-test.h
+F: mm/damon.c
+F: tools/damon/*
+F: tools/testing/selftests/damon/*
+
DAVICOM FAST ETHERNET (DMFE) NETWORK DRIVER
L: [email protected]
S: Orphan
--
2.17.1
Hi,
On Mon, Jul 13, 2020 at 10:41:31AM +0200, SeongJae Park wrote:
> From: SeongJae Park <[email protected]>
>
> This commit exports 'lookup_page_ext()' to GPL modules. It will be used
> by DAMON in following commit for the implementation of the region based
> sampling.
Maybe I'm missing something, but why is DAMON a module?
> Signed-off-by: SeongJae Park <[email protected]>
> Reviewed-by: Leonard Foerster <[email protected]>
> Reviewed-by: Varad Gautam <[email protected]>
> ---
> mm/page_ext.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index a3616f7a0e9e..9d802d01fcb5 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -131,6 +131,7 @@ struct page_ext *lookup_page_ext(const struct page *page)
> MAX_ORDER_NR_PAGES);
> return get_entry(base, index);
> }
> +EXPORT_SYMBOL_GPL(lookup_page_ext);
>
> static int __init alloc_node_page_ext(int nid)
> {
> --
> 2.17.1
>
--
Sincerely yours,
Mike.
On Mon, 13 Jul 2020 15:08:42 +0300 Mike Rapoport <[email protected]> wrote:
> Hi,
>
> On Mon, Jul 13, 2020 at 10:41:31AM +0200, SeongJae Park wrote:
> > From: SeongJae Park <[email protected]>
> >
> > This commit exports 'lookup_page_ext()' to GPL modules. It will be used
> > by DAMON in following commit for the implementation of the region based
> > sampling.
>
> Maybe I'm missing something, but why is DAMON a module?
I made it loadable just for easier adoption from downstream kernels. I could
drop the module build support if asked.
Thanks,
SeongJae Park
>
> > Signed-off-by: SeongJae Park <[email protected]>
> > Reviewed-by: Leonard Foerster <[email protected]>
> > Reviewed-by: Varad Gautam <[email protected]>
> > ---
> > mm/page_ext.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > index a3616f7a0e9e..9d802d01fcb5 100644
> > --- a/mm/page_ext.c
> > +++ b/mm/page_ext.c
> > @@ -131,6 +131,7 @@ struct page_ext *lookup_page_ext(const struct page *page)
> > MAX_ORDER_NR_PAGES);
> > return get_entry(base, index);
> > }
> > +EXPORT_SYMBOL_GPL(lookup_page_ext);
> >
> > static int __init alloc_node_page_ext(int nid)
> > {
> > --
> > 2.17.1
> >
>
> --
> Sincerely yours,
> Mike.
On Mon, Jul 13, 2020 at 02:21:43PM +0200, SeongJae Park wrote:
> On Mon, 13 Jul 2020 15:08:42 +0300 Mike Rapoport <[email protected]> wrote:
>
> > Hi,
> >
> > On Mon, Jul 13, 2020 at 10:41:31AM +0200, SeongJae Park wrote:
> > > From: SeongJae Park <[email protected]>
> > >
> > > This commit exports 'lookup_page_ext()' to GPL modules. It will be used
> > > by DAMON in following commit for the implementation of the region based
> > > sampling.
> >
> > Maybe I'm missing something, but why is DAMON a module?
>
> I made it loadable just for easier adoption from downstream kernels. I could
> drop the module build support if asked.
Well, exporting core mm symbols to modules should be considred very
carefully.
Why lookup_page_ext() is required for DAMON? It is not used anywhere in
this patchset.
> Thanks,
> SeongJae Park
>
> >
> > > Signed-off-by: SeongJae Park <[email protected]>
> > > Reviewed-by: Leonard Foerster <[email protected]>
> > > Reviewed-by: Varad Gautam <[email protected]>
> > > ---
> > > mm/page_ext.c | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/mm/page_ext.c b/mm/page_ext.c
> > > index a3616f7a0e9e..9d802d01fcb5 100644
> > > --- a/mm/page_ext.c
> > > +++ b/mm/page_ext.c
> > > @@ -131,6 +131,7 @@ struct page_ext *lookup_page_ext(const struct page *page)
> > > MAX_ORDER_NR_PAGES);
> > > return get_entry(base, index);
> > > }
> > > +EXPORT_SYMBOL_GPL(lookup_page_ext);
> > >
> > > static int __init alloc_node_page_ext(int nid)
> > > {
> > > --
> > > 2.17.1
> > >
> >
> > --
> > Sincerely yours,
> > Mike.
--
Sincerely yours,
Mike.
On Mon, 13 Jul 2020 20:19:09 +0300 Mike Rapoport <[email protected]> wrote:
> On Mon, Jul 13, 2020 at 02:21:43PM +0200, SeongJae Park wrote:
> > On Mon, 13 Jul 2020 15:08:42 +0300 Mike Rapoport <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > On Mon, Jul 13, 2020 at 10:41:31AM +0200, SeongJae Park wrote:
> > > > From: SeongJae Park <[email protected]>
> > > >
> > > > This commit exports 'lookup_page_ext()' to GPL modules. It will be used
> > > > by DAMON in following commit for the implementation of the region based
> > > > sampling.
> > >
> > > Maybe I'm missing something, but why is DAMON a module?
> >
> > I made it loadable just for easier adoption from downstream kernels. I could
> > drop the module build support if asked.
>
> Well, exporting core mm symbols to modules should be considred very
> carefully.
Agreed. I will drop the module support from the next spin.
>
> Why lookup_page_ext() is required for DAMON? It is not used anywhere in
> this patchset.
It's indirectly used. In the 6th patch, DAMON uses 'set_page_young()' to not
interfere with other PTE Accessed bit users. And, 'set_page_young()' uses
'lookup_page_ext()' if !CONFIG_64BIT. That's why I exported it.
Thanks,
SeongJae Park
On Mon, Jul 13, 2020 at 1:44 AM SeongJae Park <[email protected]> wrote:
>
> From: SeongJae Park <[email protected]>
>
> This commit introduces a reference implementation of the address space
> specific low level primitives for the virtual address space, so that
> users of DAMON can easily monitor the data accesses on virtual address
> spaces of specific processes by simply configuring the implementation to
> be used by DAMON.
>
> The low level primitives for the fundamental access monitoring are
> defined in two parts:
> 1. Identification of the monitoring target address range for the address
> space.
> 2. Access check of specific address range in the target space.
>
> The reference implementation for the virtual address space provided by
> this commit is designed as below.
>
> PTE Accessed-bit Based Access Check
> -----------------------------------
>
> The implementation uses PTE Accessed-bit for basic access checks. That
> is, it clears the bit for next sampling target page and checks whether
> it set again after one sampling period. To avoid disturbing other
> Accessed bit users such as the reclamation logic, the implementation
> adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the
> 'Idle Page Tracking'.
>
> VMA-based Target Address Range Construction
> -------------------------------------------
>
> Only small parts in the super-huge virtual address space of the
> processes are mapped to physical memory and accessed. Thus, tracking
> the unmapped address regions is just wasteful. However, because DAMON
> can deal with some level of noise using the adaptive regions adjustment
> mechanism, tracking every mapping is not strictly required but could
> even incur a high overhead in some cases. That said, too huge unmapped
> areas inside the monitoring target should be removed to not take the
> time for the adaptive mechanism.
>
> For the reason, this implementation converts the complex mappings to
> three distinct regions that cover every mapped area of the address
> space. Also, the two gaps between the three regions are the two biggest
> unmapped areas in the given address space. The two biggest unmapped
> areas would be the gap between the heap and the uppermost mmap()-ed
> region, and the gap between the lowermost mmap()-ed region and the stack
> in most of the cases. Because these gaps are exceptionally huge in
> usual address spacees, excluding these will be sufficient to make a
> reasonable trade-off. Below shows this in detail::
>
> <heap>
> <BIG UNMAPPED REGION 1>
> <uppermost mmap()-ed region>
> (small mmap()-ed regions and munmap()-ed regions)
> <lowermost mmap()-ed region>
> <BIG UNMAPPED REGION 2>
> <stack>
>
> Signed-off-by: SeongJae Park <[email protected]>
> Reviewed-by: Leonard Foerster <[email protected]>
[snip]
> +
> +static void damon_mkold(struct mm_struct *mm, unsigned long addr)
> +{
> + pte_t *pte = NULL;
> + pmd_t *pmd = NULL;
> + spinlock_t *ptl;
> +
> + if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
> + return;
> +
> + if (pte) {
> + if (pte_young(*pte)) {
Any reason for skipping mmu_notifier_clear_young()? Why exclude VMs as
DAMON's target applications?
> + clear_page_idle(pte_page(*pte));
> + set_page_young(pte_page(*pte));
> + }
> + *pte = pte_mkold(*pte);
> + pte_unmap_unlock(pte, ptl);
> + return;
> + }
> +
On Thu, 16 Jul 2020 17:46:54 -0700 Shakeel Butt <[email protected]> wrote:
> On Mon, Jul 13, 2020 at 1:44 AM SeongJae Park <[email protected]> wrote:
> >
> > From: SeongJae Park <[email protected]>
> >
> > This commit introduces a reference implementation of the address space
> > specific low level primitives for the virtual address space, so that
> > users of DAMON can easily monitor the data accesses on virtual address
> > spaces of specific processes by simply configuring the implementation to
> > be used by DAMON.
> >
> > The low level primitives for the fundamental access monitoring are
> > defined in two parts:
> > 1. Identification of the monitoring target address range for the address
> > space.
> > 2. Access check of specific address range in the target space.
> >
> > The reference implementation for the virtual address space provided by
> > this commit is designed as below.
> >
> > PTE Accessed-bit Based Access Check
> > -----------------------------------
> >
> > The implementation uses PTE Accessed-bit for basic access checks. That
> > is, it clears the bit for next sampling target page and checks whether
> > it set again after one sampling period. To avoid disturbing other
> > Accessed bit users such as the reclamation logic, the implementation
> > adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the
> > 'Idle Page Tracking'.
> >
> > VMA-based Target Address Range Construction
> > -------------------------------------------
> >
> > Only small parts in the super-huge virtual address space of the
> > processes are mapped to physical memory and accessed. Thus, tracking
> > the unmapped address regions is just wasteful. However, because DAMON
> > can deal with some level of noise using the adaptive regions adjustment
> > mechanism, tracking every mapping is not strictly required but could
> > even incur a high overhead in some cases. That said, too huge unmapped
> > areas inside the monitoring target should be removed to not take the
> > time for the adaptive mechanism.
> >
> > For the reason, this implementation converts the complex mappings to
> > three distinct regions that cover every mapped area of the address
> > space. Also, the two gaps between the three regions are the two biggest
> > unmapped areas in the given address space. The two biggest unmapped
> > areas would be the gap between the heap and the uppermost mmap()-ed
> > region, and the gap between the lowermost mmap()-ed region and the stack
> > in most of the cases. Because these gaps are exceptionally huge in
> > usual address spacees, excluding these will be sufficient to make a
> > reasonable trade-off. Below shows this in detail::
> >
> > <heap>
> > <BIG UNMAPPED REGION 1>
> > <uppermost mmap()-ed region>
> > (small mmap()-ed regions and munmap()-ed regions)
> > <lowermost mmap()-ed region>
> > <BIG UNMAPPED REGION 2>
> > <stack>
> >
> > Signed-off-by: SeongJae Park <[email protected]>
> > Reviewed-by: Leonard Foerster <[email protected]>
> [snip]
> > +
> > +static void damon_mkold(struct mm_struct *mm, unsigned long addr)
> > +{
> > + pte_t *pte = NULL;
> > + pmd_t *pmd = NULL;
> > + spinlock_t *ptl;
> > +
> > + if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
> > + return;
> > +
> > + if (pte) {
> > + if (pte_young(*pte)) {
>
> Any reason for skipping mmu_notifier_clear_young()? Why exclude VMs as
> DAMON's target applications?
Obviously my mistake, thank you for pointing this! I will add the function
call in the next spin.
Thanks,
SeongJae Park
>
> > + clear_page_idle(pte_page(*pte));
> > + set_page_young(pte_page(*pte));
> > + }
> > + *pte = pte_mkold(*pte);
> > + pte_unmap_unlock(pte, ptl);
> > + return;
> > + }
> > +
On Mon, 13 Jul 2020 19:38:05 +0200 SeongJae Park <[email protected]> wrote:
> On Mon, 13 Jul 2020 20:19:09 +0300 Mike Rapoport <[email protected]> wrote:
>
> > On Mon, Jul 13, 2020 at 02:21:43PM +0200, SeongJae Park wrote:
> > > On Mon, 13 Jul 2020 15:08:42 +0300 Mike Rapoport <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > On Mon, Jul 13, 2020 at 10:41:31AM +0200, SeongJae Park wrote:
> > > > > From: SeongJae Park <[email protected]>
> > > > >
> > > > > This commit exports 'lookup_page_ext()' to GPL modules. It will be used
> > > > > by DAMON in following commit for the implementation of the region based
> > > > > sampling.
> > > >
> > > > Maybe I'm missing something, but why is DAMON a module?
> > >
> > > I made it loadable just for easier adoption from downstream kernels. I could
> > > drop the module build support if asked.
> >
> > Well, exporting core mm symbols to modules should be considred very
> > carefully.
>
> Agreed. I will drop the module support from the next spin.
>
> >
> > Why lookup_page_ext() is required for DAMON? It is not used anywhere in
> > this patchset.
>
> It's indirectly used. In the 6th patch, DAMON uses 'set_page_young()' to not
> interfere with other PTE Accessed bit users. And, 'set_page_young()' uses
> 'lookup_page_ext()' if !CONFIG_64BIT. That's why I exported it.
This also means that it would make no sense if !64BIT && !PAGE_EXTENSION. In
the next spin, I will update the DAMON Kconfig to select PAGE_EXTENSION if
!64BIT, as same to that of IDLE_PAGE_TRACKING.
Thanks,
SeongJae Park
On Thu, Jul 16, 2020 at 11:54 PM SeongJae Park <[email protected]> wrote:
>
> On Thu, 16 Jul 2020 17:46:54 -0700 Shakeel Butt <[email protected]> wrote:
>
> > On Mon, Jul 13, 2020 at 1:44 AM SeongJae Park <[email protected]> wrote:
> > >
> > > From: SeongJae Park <[email protected]>
> > >
> > > This commit introduces a reference implementation of the address space
> > > specific low level primitives for the virtual address space, so that
> > > users of DAMON can easily monitor the data accesses on virtual address
> > > spaces of specific processes by simply configuring the implementation to
> > > be used by DAMON.
> > >
> > > The low level primitives for the fundamental access monitoring are
> > > defined in two parts:
> > > 1. Identification of the monitoring target address range for the address
> > > space.
> > > 2. Access check of specific address range in the target space.
> > >
> > > The reference implementation for the virtual address space provided by
> > > this commit is designed as below.
> > >
> > > PTE Accessed-bit Based Access Check
> > > -----------------------------------
> > >
> > > The implementation uses PTE Accessed-bit for basic access checks. That
> > > is, it clears the bit for next sampling target page and checks whether
> > > it set again after one sampling period. To avoid disturbing other
> > > Accessed bit users such as the reclamation logic, the implementation
> > > adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the
> > > 'Idle Page Tracking'.
> > >
> > > VMA-based Target Address Range Construction
> > > -------------------------------------------
> > >
> > > Only small parts in the super-huge virtual address space of the
> > > processes are mapped to physical memory and accessed. Thus, tracking
> > > the unmapped address regions is just wasteful. However, because DAMON
> > > can deal with some level of noise using the adaptive regions adjustment
> > > mechanism, tracking every mapping is not strictly required but could
> > > even incur a high overhead in some cases. That said, too huge unmapped
> > > areas inside the monitoring target should be removed to not take the
> > > time for the adaptive mechanism.
> > >
> > > For the reason, this implementation converts the complex mappings to
> > > three distinct regions that cover every mapped area of the address
> > > space. Also, the two gaps between the three regions are the two biggest
> > > unmapped areas in the given address space. The two biggest unmapped
> > > areas would be the gap between the heap and the uppermost mmap()-ed
> > > region, and the gap between the lowermost mmap()-ed region and the stack
> > > in most of the cases. Because these gaps are exceptionally huge in
> > > usual address spacees, excluding these will be sufficient to make a
> > > reasonable trade-off. Below shows this in detail::
> > >
> > > <heap>
> > > <BIG UNMAPPED REGION 1>
> > > <uppermost mmap()-ed region>
> > > (small mmap()-ed regions and munmap()-ed regions)
> > > <lowermost mmap()-ed region>
> > > <BIG UNMAPPED REGION 2>
> > > <stack>
> > >
> > > Signed-off-by: SeongJae Park <[email protected]>
> > > Reviewed-by: Leonard Foerster <[email protected]>
> > [snip]
> > > +
> > > +static void damon_mkold(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > + pte_t *pte = NULL;
> > > + pmd_t *pmd = NULL;
> > > + spinlock_t *ptl;
> > > +
> > > + if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
> > > + return;
> > > +
> > > + if (pte) {
> > > + if (pte_young(*pte)) {
> >
> > Any reason for skipping mmu_notifier_clear_young()? Why exclude VMs as
> > DAMON's target applications?
>
> Obviously my mistake, thank you for pointing this! I will add the function
> call in the next spin.
>
Similarly mmu_notifier_test_young() for the damon_young(). BTW I think
we can combine ctx->prepare_access_checks() and ctx->check_accesses()
into one i.e. get the young state for the previous cycle and mkold for
the next cycle in a single step.
I am wondering if there is any advantage to having "Page Idle
Tracking" beside DAMON. I think we can make them mutually exclusive.
Once we have established that I think DAMON can steal the two page
flag bits from it and can make use of them. What do you think?
On Fri, 17 Jul 2020 08:17:09 -0700 Shakeel Butt <[email protected]> wrote:
> On Thu, Jul 16, 2020 at 11:54 PM SeongJae Park <[email protected]> wrote:
> >
> > On Thu, 16 Jul 2020 17:46:54 -0700 Shakeel Butt <[email protected]> wrote:
> >
> > > On Mon, Jul 13, 2020 at 1:44 AM SeongJae Park <[email protected]> wrote:
> > > >
> > > > From: SeongJae Park <[email protected]>
> > > >
> > > > This commit introduces a reference implementation of the address space
> > > > specific low level primitives for the virtual address space, so that
> > > > users of DAMON can easily monitor the data accesses on virtual address
> > > > spaces of specific processes by simply configuring the implementation to
> > > > be used by DAMON.
> > > >
> > > > The low level primitives for the fundamental access monitoring are
> > > > defined in two parts:
> > > > 1. Identification of the monitoring target address range for the address
> > > > space.
> > > > 2. Access check of specific address range in the target space.
> > > >
> > > > The reference implementation for the virtual address space provided by
> > > > this commit is designed as below.
> > > >
> > > > PTE Accessed-bit Based Access Check
> > > > -----------------------------------
> > > >
> > > > The implementation uses PTE Accessed-bit for basic access checks. That
> > > > is, it clears the bit for next sampling target page and checks whether
> > > > it set again after one sampling period. To avoid disturbing other
> > > > Accessed bit users such as the reclamation logic, the implementation
> > > > adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the
> > > > 'Idle Page Tracking'.
> > > >
> > > > VMA-based Target Address Range Construction
> > > > -------------------------------------------
> > > >
> > > > Only small parts in the super-huge virtual address space of the
> > > > processes are mapped to physical memory and accessed. Thus, tracking
> > > > the unmapped address regions is just wasteful. However, because DAMON
> > > > can deal with some level of noise using the adaptive regions adjustment
> > > > mechanism, tracking every mapping is not strictly required but could
> > > > even incur a high overhead in some cases. That said, too huge unmapped
> > > > areas inside the monitoring target should be removed to not take the
> > > > time for the adaptive mechanism.
> > > >
> > > > For the reason, this implementation converts the complex mappings to
> > > > three distinct regions that cover every mapped area of the address
> > > > space. Also, the two gaps between the three regions are the two biggest
> > > > unmapped areas in the given address space. The two biggest unmapped
> > > > areas would be the gap between the heap and the uppermost mmap()-ed
> > > > region, and the gap between the lowermost mmap()-ed region and the stack
> > > > in most of the cases. Because these gaps are exceptionally huge in
> > > > usual address spacees, excluding these will be sufficient to make a
> > > > reasonable trade-off. Below shows this in detail::
> > > >
> > > > <heap>
> > > > <BIG UNMAPPED REGION 1>
> > > > <uppermost mmap()-ed region>
> > > > (small mmap()-ed regions and munmap()-ed regions)
> > > > <lowermost mmap()-ed region>
> > > > <BIG UNMAPPED REGION 2>
> > > > <stack>
> > > >
> > > > Signed-off-by: SeongJae Park <[email protected]>
> > > > Reviewed-by: Leonard Foerster <[email protected]>
> > > [snip]
> > > > +
> > > > +static void damon_mkold(struct mm_struct *mm, unsigned long addr)
> > > > +{
> > > > + pte_t *pte = NULL;
> > > > + pmd_t *pmd = NULL;
> > > > + spinlock_t *ptl;
> > > > +
> > > > + if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
> > > > + return;
> > > > +
> > > > + if (pte) {
> > > > + if (pte_young(*pte)) {
> > >
> > > Any reason for skipping mmu_notifier_clear_young()? Why exclude VMs as
> > > DAMON's target applications?
> >
> > Obviously my mistake, thank you for pointing this! I will add the function
> > call in the next spin.
> >
>
> Similarly mmu_notifier_test_young() for the damon_young().
Yes, indeed. Thanks for pointing this, either :)
> BTW I think we can combine ctx->prepare_access_checks() and
> ctx->check_accesses() into one i.e. get the young state for the previous
> cycle and mkold for the next cycle in a single step.
Yes, we could. But, I'm unsure what is the advantage of doing that. First of
all, if the combined implementation is required, peopld could simply implement
the two logics in the combined way in one of the callbacks and leave the other
one blank. Also, I'm worrying if combining those could make the code a little
bit hard to read. IMHO, I think separating those makes the 'kdamond_fn()' code
little bit easier to read. Actually, I started from the combined approach but
separated the two logics since v7 after Jonathan's comment[1].
[1] https://lore.kernel.org/linux-mm/[email protected]/
>
> I am wondering if there is any advantage to having "Page Idle
> Tracking" beside DAMON. I think we can make them mutually exclusive.
> Once we have established that I think DAMON can steal the two page
> flag bits from it and can make use of them. What do you think?
Again, yes, I think we could. But I don't see clear advantage of it for now.
Thanks,
SeongJae Park
On Fri, Jul 17, 2020 at 9:24 AM SeongJae Park <[email protected]> wrote:
>
> On Fri, 17 Jul 2020 08:17:09 -0700 Shakeel Butt <[email protected]> wrote:
>
> > On Thu, Jul 16, 2020 at 11:54 PM SeongJae Park <[email protected]> wrote:
> > >
> > > On Thu, 16 Jul 2020 17:46:54 -0700 Shakeel Butt <[email protected]> wrote:
> > >
> > > > On Mon, Jul 13, 2020 at 1:44 AM SeongJae Park <[email protected]> wrote:
> > > > >
> > > > > From: SeongJae Park <[email protected]>
> > > > >
> > > > > This commit introduces a reference implementation of the address space
> > > > > specific low level primitives for the virtual address space, so that
> > > > > users of DAMON can easily monitor the data accesses on virtual address
> > > > > spaces of specific processes by simply configuring the implementation to
> > > > > be used by DAMON.
> > > > >
> > > > > The low level primitives for the fundamental access monitoring are
> > > > > defined in two parts:
> > > > > 1. Identification of the monitoring target address range for the address
> > > > > space.
> > > > > 2. Access check of specific address range in the target space.
> > > > >
> > > > > The reference implementation for the virtual address space provided by
> > > > > this commit is designed as below.
> > > > >
> > > > > PTE Accessed-bit Based Access Check
> > > > > -----------------------------------
> > > > >
> > > > > The implementation uses PTE Accessed-bit for basic access checks. That
> > > > > is, it clears the bit for next sampling target page and checks whether
> > > > > it set again after one sampling period. To avoid disturbing other
> > > > > Accessed bit users such as the reclamation logic, the implementation
> > > > > adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the
> > > > > 'Idle Page Tracking'.
> > > > >
> > > > > VMA-based Target Address Range Construction
> > > > > -------------------------------------------
> > > > >
> > > > > Only small parts in the super-huge virtual address space of the
> > > > > processes are mapped to physical memory and accessed. Thus, tracking
> > > > > the unmapped address regions is just wasteful. However, because DAMON
> > > > > can deal with some level of noise using the adaptive regions adjustment
> > > > > mechanism, tracking every mapping is not strictly required but could
> > > > > even incur a high overhead in some cases. That said, too huge unmapped
> > > > > areas inside the monitoring target should be removed to not take the
> > > > > time for the adaptive mechanism.
> > > > >
> > > > > For the reason, this implementation converts the complex mappings to
> > > > > three distinct regions that cover every mapped area of the address
> > > > > space. Also, the two gaps between the three regions are the two biggest
> > > > > unmapped areas in the given address space. The two biggest unmapped
> > > > > areas would be the gap between the heap and the uppermost mmap()-ed
> > > > > region, and the gap between the lowermost mmap()-ed region and the stack
> > > > > in most of the cases. Because these gaps are exceptionally huge in
> > > > > usual address spacees, excluding these will be sufficient to make a
> > > > > reasonable trade-off. Below shows this in detail::
> > > > >
> > > > > <heap>
> > > > > <BIG UNMAPPED REGION 1>
> > > > > <uppermost mmap()-ed region>
> > > > > (small mmap()-ed regions and munmap()-ed regions)
> > > > > <lowermost mmap()-ed region>
> > > > > <BIG UNMAPPED REGION 2>
> > > > > <stack>
> > > > >
> > > > > Signed-off-by: SeongJae Park <[email protected]>
> > > > > Reviewed-by: Leonard Foerster <[email protected]>
> > > > [snip]
> > > > > +
> > > > > +static void damon_mkold(struct mm_struct *mm, unsigned long addr)
> > > > > +{
> > > > > + pte_t *pte = NULL;
> > > > > + pmd_t *pmd = NULL;
> > > > > + spinlock_t *ptl;
> > > > > +
> > > > > + if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
> > > > > + return;
> > > > > +
> > > > > + if (pte) {
> > > > > + if (pte_young(*pte)) {
> > > >
> > > > Any reason for skipping mmu_notifier_clear_young()? Why exclude VMs as
> > > > DAMON's target applications?
> > >
> > > Obviously my mistake, thank you for pointing this! I will add the function
> > > call in the next spin.
> > >
> >
> > Similarly mmu_notifier_test_young() for the damon_young().
>
> Yes, indeed. Thanks for pointing this, either :)
>
> > BTW I think we can combine ctx->prepare_access_checks() and
> > ctx->check_accesses() into one i.e. get the young state for the previous
> > cycle and mkold for the next cycle in a single step.
>
> Yes, we could. But, I'm unsure what is the advantage of doing that. First of
> all, if the combined implementation is required, peopld could simply implement
> the two logics in the combined way in one of the callbacks and leave the other
> one blank. Also, I'm worrying if combining those could make the code a little
> bit hard to read. IMHO, I think separating those makes the 'kdamond_fn()' code
> little bit easier to read. Actually, I started from the combined approach but
> separated the two logics since v7 after Jonathan's comment[1].
>
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
>
> >
> > I am wondering if there is any advantage to having "Page Idle
> > Tracking" beside DAMON. I think we can make them mutually exclusive.
> > Once we have established that I think DAMON can steal the two page
> > flag bits from it and can make use of them. What do you think?
>
> Again, yes, I think we could. But I don't see clear advantage of it for now.
>
>
Hmm, I will think more about it. Somehow I feel if we want to monitor
at the page sized region granularity then this will be really helpful.
Anyways, it needs more brainstorming.
BTW I am still going over the series and my humble request would be to
wait till I have gone through the series completely and provided the
feedback then you can send the next version after incorporating the
feedback.
Shakeel
On Mon, Jul 13, 2020 at 1:43 AM SeongJae Park <[email protected]> wrote:
>
> From: SeongJae Park <[email protected]>
>
> DAMON is a data access monitoring framework subsystem for the Linux
> kernel. The core mechanisms of DAMON make it
>
> - accurate (the monitoring output is useful enough for DRAM level
> memory management; It might not appropriate for CPU Cache levels,
> though),
> - light-weight (the monitoring overhead is low enough to be applied
> online), and
> - scalable (the upper-bound of the overhead is in constant range
> regardless of the size of target workloads).
>
> Using this framework, therefore, the kernel's memory management
> mechanisms can make advanced decisions. Experimental memory management
> optimization works that incurring high data accesses monitoring overhead
> could implemented again. In user space, meanwhile, users who have some
> special workloads can write personalized applications for better
> understanding and optimizations of their workloads and systems.
>
> This commit is implementing only the stub for the module load/unload,
> basic data structures, and simple manipulation functions of the
> structures to keep the size of commit small. The core mechanisms of
> DAMON will be implemented one by one by following commits.
>
> Signed-off-by: SeongJae Park <[email protected]>
> Reviewed-by: Leonard Foerster <[email protected]>
> Reviewed-by: Varad Gautam <[email protected]>
> ---
> include/linux/damon.h | 63 ++++++++++++++
> mm/Kconfig | 12 +++
> mm/Makefile | 1 +
> mm/damon.c | 188 ++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 264 insertions(+)
> create mode 100644 include/linux/damon.h
> create mode 100644 mm/damon.c
>
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> new file mode 100644
> index 000000000000..c8f8c1c41a45
> --- /dev/null
> +++ b/include/linux/damon.h
> @@ -0,0 +1,63 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * DAMON api
> + *
> + * Copyright 2019-2020 Amazon.com, Inc. or its affiliates.
> + *
> + * Author: SeongJae Park <[email protected]>
> + */
> +
> +#ifndef _DAMON_H_
> +#define _DAMON_H_
> +
> +#include <linux/random.h>
> +#include <linux/types.h>
> +
> +/**
> + * struct damon_addr_range - Represents an address region of [@start, @end).
> + * @start: Start address of the region (inclusive).
> + * @end: End address of the region (exclusive).
> + */
> +struct damon_addr_range {
> + unsigned long start;
> + unsigned long end;
> +};
> +
> +/**
> + * struct damon_region - Represents a monitoring target region.
> + * @ar: The address range of the region.
> + * @sampling_addr: Address of the sample for the next access check.
> + * @nr_accesses: Access frequency of this region.
> + * @list: List head for siblings.
> + */
> +struct damon_region {
> + struct damon_addr_range ar;
> + unsigned long sampling_addr;
> + unsigned int nr_accesses;
> + struct list_head list;
> +};
> +
> +/**
> + * struct damon_task - Represents a monitoring target task.
> + * @pid: Process id of the task.
> + * @regions_list: Head of the monitoring target regions of this task.
> + * @list: List head for siblings.
> + *
> + * If the monitoring target address space is task independent (e.g., physical
> + * memory address space monitoring), @pid should be '-1'.
> + */
> +struct damon_task {
> + int pid;
Storing and accessing pid like this is racy. Why not save the "struct
pid" after getting the reference? I am still going over the usage,
maybe storing mm_struct would be an even better choice.
> + struct list_head regions_list;
> + struct list_head list;
> +};
> +
> +/**
> + * struct damon_ctx - Represents a context for each monitoring.
> + * @tasks_list: Head of monitoring target tasks (&damon_task) list.
> + */
> +struct damon_ctx {
> + struct list_head tasks_list; /* 'damon_task' objects */
> +};
> +
> +#endif
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c1acc34c1c35..464e9594dcec 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -867,4 +867,16 @@ config ARCH_HAS_HUGEPD
> config MAPPING_DIRTY_HELPERS
> bool
>
> +config DAMON
> + tristate "Data Access Monitor"
> + depends on MMU
> + help
> + This feature allows to monitor access frequency of each memory
> + region. The information can be useful for performance-centric DRAM
> + level memory management.
> +
> + See https://damonitor.github.io/doc/html/latest-damon/index.html for
> + more information.
> + If unsure, say N.
> +
> endmenu
> diff --git a/mm/Makefile b/mm/Makefile
> index fccd3756b25f..230e545b6e07 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -112,3 +112,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
> obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
> obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
> obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
> +obj-$(CONFIG_DAMON) += damon.o
> diff --git a/mm/damon.c b/mm/damon.c
> new file mode 100644
> index 000000000000..5ab13b1c15cf
> --- /dev/null
> +++ b/mm/damon.c
> @@ -0,0 +1,188 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Data Access Monitor
> + *
> + * Copyright 2019-2020 Amazon.com, Inc. or its affiliates.
> + *
> + * Author: SeongJae Park <[email protected]>
> + *
> + * This file is constructed in below parts.
> + *
> + * - Functions and macros for DAMON data structures
> + * - Functions for the module loading/unloading
> + *
> + * The core parts are not implemented yet.
> + */
> +
> +#define pr_fmt(fmt) "damon: " fmt
> +
> +#include <linux/damon.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +
> +/*
> + * Functions and macros for DAMON data structures
> + */
> +
> +#define damon_get_task_struct(t) \
> + (get_pid_task(find_vpid(t->pid), PIDTYPE_PID))
You need at least rcu lock around find_vpid(). Also you need to be
careful about the context. If you accept my previous suggestion then
you just need to do this in the process context which is registering
the pid (no need to worry about the pid namespace).
I am wondering if there should be an interface to register processes
with DAMON using pidfd instead of integer pid.
> +
> +#define damon_next_region(r) \
> + (container_of(r->list.next, struct damon_region, list))
> +
> +#define damon_prev_region(r) \
> + (container_of(r->list.prev, struct damon_region, list))
> +
> +#define damon_for_each_region(r, t) \
> + list_for_each_entry(r, &t->regions_list, list)
> +
> +#define damon_for_each_region_safe(r, next, t) \
> + list_for_each_entry_safe(r, next, &t->regions_list, list)
> +
> +#define damon_for_each_task(t, ctx) \
> + list_for_each_entry(t, &(ctx)->tasks_list, list)
> +
> +#define damon_for_each_task_safe(t, next, ctx) \
> + list_for_each_entry_safe(t, next, &(ctx)->tasks_list, list)
> +
> +/* Get a random number in [l, r) */
> +#define damon_rand(l, r) (l + prandom_u32() % (r - l))
> +
> +/*
> + * Construct a damon_region struct
> + *
> + * Returns the pointer to the new struct if success, or NULL otherwise
> + */
> +static struct damon_region *damon_new_region(unsigned long start,
> + unsigned long end)
> +{
> + struct damon_region *region;
> +
> + region = kmalloc(sizeof(*region), GFP_KERNEL);
> + if (!region)
> + return NULL;
> +
> + region->ar.start = start;
> + region->ar.end = end;
> + region->nr_accesses = 0;
> + INIT_LIST_HEAD(®ion->list);
> +
> + return region;
> +}
> +
> +/*
> + * Add a region between two other regions
> + */
> +static inline void damon_insert_region(struct damon_region *r,
> + struct damon_region *prev, struct damon_region *next)
> +{
> + __list_add(&r->list, &prev->list, &next->list);
> +}
> +
> +static void damon_add_region(struct damon_region *r, struct damon_task *t)
> +{
> + list_add_tail(&r->list, &t->regions_list);
> +}
> +
> +static void damon_del_region(struct damon_region *r)
> +{
> + list_del(&r->list);
> +}
> +
> +static void damon_free_region(struct damon_region *r)
> +{
> + kfree(r);
> +}
> +
> +static void damon_destroy_region(struct damon_region *r)
> +{
> + damon_del_region(r);
> + damon_free_region(r);
> +}
> +
> +/*
> + * Construct a damon_task struct
> + *
> + * Returns the pointer to the new struct if success, or NULL otherwise
> + */
> +static struct damon_task *damon_new_task(int pid)
> +{
> + struct damon_task *t;
> +
> + t = kmalloc(sizeof(*t), GFP_KERNEL);
> + if (!t)
> + return NULL;
> +
> + t->pid = pid;
> + INIT_LIST_HEAD(&t->regions_list);
> +
> + return t;
> +}
> +
> +static void damon_add_task(struct damon_ctx *ctx, struct damon_task *t)
> +{
> + list_add_tail(&t->list, &ctx->tasks_list);
> +}
> +
> +static void damon_del_task(struct damon_task *t)
> +{
> + list_del(&t->list);
> +}
> +
> +static void damon_free_task(struct damon_task *t)
> +{
> + struct damon_region *r, *next;
> +
> + damon_for_each_region_safe(r, next, t)
> + damon_free_region(r);
> + kfree(t);
> +}
> +
> +static void damon_destroy_task(struct damon_task *t)
> +{
> + damon_del_task(t);
> + damon_free_task(t);
> +}
> +
> +static unsigned int nr_damon_tasks(struct damon_ctx *ctx)
> +{
> + struct damon_task *t;
> + unsigned int nr_tasks = 0;
> +
> + damon_for_each_task(t, ctx)
> + nr_tasks++;
> +
> + return nr_tasks;
> +}
> +
> +static unsigned int nr_damon_regions(struct damon_task *t)
> +{
> + struct damon_region *r;
> + unsigned int nr_regions = 0;
> +
> + damon_for_each_region(r, t)
> + nr_regions++;
> +
> + return nr_regions;
> +}
> +
> +/*
> + * Functions for the module loading/unloading
> + */
> +
> +static int __init damon_init(void)
> +{
> + return 0;
> +}
> +
> +static void __exit damon_exit(void)
> +{
> +}
> +
> +module_init(damon_init);
> +module_exit(damon_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("SeongJae Park <[email protected]>");
> +MODULE_DESCRIPTION("DAMON: Data Access MONitor");
> --
> 2.17.1
>
On Fri, 17 Jul 2020 19:23:28 -0700 Shakeel Butt <[email protected]> wrote:
> On Fri, Jul 17, 2020 at 9:24 AM SeongJae Park <[email protected]> wrote:
> >
> > On Fri, 17 Jul 2020 08:17:09 -0700 Shakeel Butt <[email protected]> wrote:
> >
> > > On Thu, Jul 16, 2020 at 11:54 PM SeongJae Park <[email protected]> wrote:
> > > >
> > > > On Thu, 16 Jul 2020 17:46:54 -0700 Shakeel Butt <[email protected]> wrote:
> > > >
> > > > > On Mon, Jul 13, 2020 at 1:44 AM SeongJae Park <[email protected]> wrote:
> > > > > >
> > > > > > From: SeongJae Park <[email protected]>
> > > > > >
> > > > > > This commit introduces a reference implementation of the address space
> > > > > > specific low level primitives for the virtual address space, so that
> > > > > > users of DAMON can easily monitor the data accesses on virtual address
> > > > > > spaces of specific processes by simply configuring the implementation to
> > > > > > be used by DAMON.
> > > > > >
> > > > > > The low level primitives for the fundamental access monitoring are
> > > > > > defined in two parts:
> > > > > > 1. Identification of the monitoring target address range for the address
> > > > > > space.
> > > > > > 2. Access check of specific address range in the target space.
> > > > > >
> > > > > > The reference implementation for the virtual address space provided by
> > > > > > this commit is designed as below.
> > > > > >
> > > > > > PTE Accessed-bit Based Access Check
> > > > > > -----------------------------------
> > > > > >
> > > > > > The implementation uses PTE Accessed-bit for basic access checks. That
> > > > > > is, it clears the bit for next sampling target page and checks whether
> > > > > > it set again after one sampling period. To avoid disturbing other
> > > > > > Accessed bit users such as the reclamation logic, the implementation
> > > > > > adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the
> > > > > > 'Idle Page Tracking'.
> > > > > >
> > > > > > VMA-based Target Address Range Construction
> > > > > > -------------------------------------------
> > > > > >
> > > > > > Only small parts in the super-huge virtual address space of the
> > > > > > processes are mapped to physical memory and accessed. Thus, tracking
> > > > > > the unmapped address regions is just wasteful. However, because DAMON
> > > > > > can deal with some level of noise using the adaptive regions adjustment
> > > > > > mechanism, tracking every mapping is not strictly required but could
> > > > > > even incur a high overhead in some cases. That said, too huge unmapped
> > > > > > areas inside the monitoring target should be removed to not take the
> > > > > > time for the adaptive mechanism.
> > > > > >
> > > > > > For the reason, this implementation converts the complex mappings to
> > > > > > three distinct regions that cover every mapped area of the address
> > > > > > space. Also, the two gaps between the three regions are the two biggest
> > > > > > unmapped areas in the given address space. The two biggest unmapped
> > > > > > areas would be the gap between the heap and the uppermost mmap()-ed
> > > > > > region, and the gap between the lowermost mmap()-ed region and the stack
> > > > > > in most of the cases. Because these gaps are exceptionally huge in
> > > > > > usual address spacees, excluding these will be sufficient to make a
> > > > > > reasonable trade-off. Below shows this in detail::
> > > > > >
> > > > > > <heap>
> > > > > > <BIG UNMAPPED REGION 1>
> > > > > > <uppermost mmap()-ed region>
> > > > > > (small mmap()-ed regions and munmap()-ed regions)
> > > > > > <lowermost mmap()-ed region>
> > > > > > <BIG UNMAPPED REGION 2>
> > > > > > <stack>
> > > > > >
> > > > > > Signed-off-by: SeongJae Park <[email protected]>
> > > > > > Reviewed-by: Leonard Foerster <[email protected]>
> > > > > [snip]
> > > > > > +
> > > > > > +static void damon_mkold(struct mm_struct *mm, unsigned long addr)
> > > > > > +{
> > > > > > + pte_t *pte = NULL;
> > > > > > + pmd_t *pmd = NULL;
> > > > > > + spinlock_t *ptl;
> > > > > > +
> > > > > > + if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
> > > > > > + return;
> > > > > > +
> > > > > > + if (pte) {
> > > > > > + if (pte_young(*pte)) {
> > > > >
> > > > > Any reason for skipping mmu_notifier_clear_young()? Why exclude VMs as
> > > > > DAMON's target applications?
> > > >
> > > > Obviously my mistake, thank you for pointing this! I will add the function
> > > > call in the next spin.
> > > >
> > >
> > > Similarly mmu_notifier_test_young() for the damon_young().
> >
> > Yes, indeed. Thanks for pointing this, either :)
> >
> > > BTW I think we can combine ctx->prepare_access_checks() and
> > > ctx->check_accesses() into one i.e. get the young state for the previous
> > > cycle and mkold for the next cycle in a single step.
> >
> > Yes, we could. But, I'm unsure what is the advantage of doing that. First of
> > all, if the combined implementation is required, peopld could simply implement
> > the two logics in the combined way in one of the callbacks and leave the other
> > one blank. Also, I'm worrying if combining those could make the code a little
> > bit hard to read. IMHO, I think separating those makes the 'kdamond_fn()' code
> > little bit easier to read. Actually, I started from the combined approach but
> > separated the two logics since v7 after Jonathan's comment[1].
> >
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
> >
> >
> > >
> > > I am wondering if there is any advantage to having "Page Idle
> > > Tracking" beside DAMON. I think we can make them mutually exclusive.
> > > Once we have established that I think DAMON can steal the two page
> > > flag bits from it and can make use of them. What do you think?
> >
> > Again, yes, I think we could. But I don't see clear advantage of it for now.
> >
> >
>
> Hmm, I will think more about it. Somehow I feel if we want to monitor
> at the page sized region granularity then this will be really helpful.
> Anyways, it needs more brainstorming.
Ok, I will also think about it from the perspective.
>
> BTW I am still going over the series and my humble request would be to
> wait till I have gone through the series completely and provided the
> feedback then you can send the next version after incorporating the
> feedback.
No problem, just let me know when you finished. Appreciate your review :)
Thanks,
SeongJae Park
>
> Shakeel
>
On Fri, 17 Jul 2020 19:47:50 -0700 Shakeel Butt <[email protected]> wrote:
> On Mon, Jul 13, 2020 at 1:43 AM SeongJae Park <[email protected]> wrote:
> >
> > From: SeongJae Park <[email protected]>
> >
> > DAMON is a data access monitoring framework subsystem for the Linux
> > kernel. The core mechanisms of DAMON make it
> >
> > - accurate (the monitoring output is useful enough for DRAM level
> > memory management; It might not appropriate for CPU Cache levels,
> > though),
> > - light-weight (the monitoring overhead is low enough to be applied
> > online), and
> > - scalable (the upper-bound of the overhead is in constant range
> > regardless of the size of target workloads).
> >
> > Using this framework, therefore, the kernel's memory management
> > mechanisms can make advanced decisions. Experimental memory management
> > optimization works that incurring high data accesses monitoring overhead
> > could implemented again. In user space, meanwhile, users who have some
> > special workloads can write personalized applications for better
> > understanding and optimizations of their workloads and systems.
> >
> > This commit is implementing only the stub for the module load/unload,
> > basic data structures, and simple manipulation functions of the
> > structures to keep the size of commit small. The core mechanisms of
> > DAMON will be implemented one by one by following commits.
> >
> > Signed-off-by: SeongJae Park <[email protected]>
> > Reviewed-by: Leonard Foerster <[email protected]>
> > Reviewed-by: Varad Gautam <[email protected]>
> > ---
> > include/linux/damon.h | 63 ++++++++++++++
> > mm/Kconfig | 12 +++
> > mm/Makefile | 1 +
> > mm/damon.c | 188 ++++++++++++++++++++++++++++++++++++++++++
> > 4 files changed, 264 insertions(+)
> > create mode 100644 include/linux/damon.h
> > create mode 100644 mm/damon.c
> >
> > diff --git a/include/linux/damon.h b/include/linux/damon.h
> > new file mode 100644
> > index 000000000000..c8f8c1c41a45
> > --- /dev/null
> > +++ b/include/linux/damon.h
> > @@ -0,0 +1,63 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * DAMON api
> > + *
> > + * Copyright 2019-2020 Amazon.com, Inc. or its affiliates.
> > + *
> > + * Author: SeongJae Park <[email protected]>
> > + */
> > +
[...]
> > +
> > +/**
> > + * struct damon_task - Represents a monitoring target task.
> > + * @pid: Process id of the task.
> > + * @regions_list: Head of the monitoring target regions of this task.
> > + * @list: List head for siblings.
> > + *
> > + * If the monitoring target address space is task independent (e.g., physical
> > + * memory address space monitoring), @pid should be '-1'.
> > + */
> > +struct damon_task {
> > + int pid;
>
> Storing and accessing pid like this is racy. Why not save the "struct
> pid" after getting the reference? I am still going over the usage,
> maybe storing mm_struct would be an even better choice.
>
> > + struct list_head regions_list;
> > + struct list_head list;
> > +};
> > +
[...]
> > +
> > +#define damon_get_task_struct(t) \
> > + (get_pid_task(find_vpid(t->pid), PIDTYPE_PID))
>
> You need at least rcu lock around find_vpid(). Also you need to be
> careful about the context. If you accept my previous suggestion then
> you just need to do this in the process context which is registering
> the pid (no need to worry about the pid namespace).
>
> I am wondering if there should be an interface to register processes
> with DAMON using pidfd instead of integer pid.
Good points! I will use pidfd for this purpose, instead.
BTW, 'struct damon_task' was introduced while DAMON supports only virtual
address spaces and recently extended to support physical memory address
monitoring case by defining an exceptional pid (-1) for such case. I think it
doesn't smoothly fit with the design.
Therefore, I would like to change it with more general named struct, e.g.,
struct damon_target {
void *id;
struct list_head regions_list;
struct list_head list;
};
The 'id' field will be able to store or point pid_t, struct mm_struct, struct
pid, or anything relevant, depending on the target address space.
Only one part of the address space independent logics of DAMON, namely
'kdamon_need_stop()', uses '->pid' of the 'struct damon_task'. It will be
introduced by the next patch ("mm/damon: Implement region based sampling").
Therefore, the conversion will be easy. For the part, I could add another
callback, e.g.,
struct damon_ctx {
[...]
bool (*is_target_valid)(struct damon_target *t);
};
And let the address space specific primitives to implement this.
Then, damon_get_task_struct() and damon_get_mm() will be introduced by the
sixth patch ("mm/damon: Implement callbacks for the virtual memory address
spaces") as a part of the virtual address space specific primitives
implementation.
I gonna make the change in the next spin. If you have some opinions on this,
please let me know.
Thanks,
SeongJae Park
On Mon, 13 Jul 2020 10:41:39 +0200 SeongJae Park <[email protected]> wrote:
> From: SeongJae Park <[email protected]>
>
> This commit implements a debugfs interface for DAMON. It works for the
> virtual address spaces monitoring.
>
[...]
> +/*
> + * Converts a string into an array of unsigned long integers
> + *
> + * Returns an array of unsigned long integers if the conversion success, or
> + * NULL otherwise.
> + */
> +static int *str_to_pids(const char *str, ssize_t len, ssize_t *nr_pids)
> +{
> + int *pids;
> + const int max_nr_pids = 32;
> + int pid;
> + int pos = 0, parsed, ret;
> +
> + *nr_pids = 0;
> + pids = kmalloc_array(max_nr_pids, sizeof(pid), GFP_KERNEL);
> + if (!pids)
> + return NULL;
> + while (*nr_pids < max_nr_pids && pos < len) {
> + ret = sscanf(&str[pos], "%d%n", &pid, &parsed);
> + pos += parsed;
> + if (ret != 1)
> + break;
> + pids[*nr_pids] = pid;
> + *nr_pids += 1;
> + }
> + if (*nr_pids == 0) {
> + kfree(pids);
> + pids = NULL;
> + }
Hmm, this means debugfs users cannot make 'target_ids' empty again. I will fix
this in the next spin.
Thanks,
SeongJae Park
SeongJae Park <[email protected]> wrote:
> From: SeongJae Park <[email protected]>
>
> This commit adds documents for DAMON under
> `Documentation/admin-guide/mm/damon/` and `Documentation/vm/damon/`.
>
> Signed-off-by: SeongJae Park <[email protected]>
> ---
> Documentation/admin-guide/mm/damon/guide.rst | 157 ++++++++++
> Documentation/admin-guide/mm/damon/index.rst | 15 +
> Documentation/admin-guide/mm/damon/plans.rst | 29 ++
> Documentation/admin-guide/mm/damon/start.rst | 98 ++++++
> Documentation/admin-guide/mm/damon/usage.rst | 298 +++++++++++++++++++
> Documentation/admin-guide/mm/index.rst | 1 +
> Documentation/vm/damon/api.rst | 20 ++
> Documentation/vm/damon/eval.rst | 222 ++++++++++++++
> Documentation/vm/damon/faq.rst | 59 ++++
> Documentation/vm/damon/index.rst | 32 ++
> Documentation/vm/damon/mechanisms.rst | 165 ++++++++++
> Documentation/vm/index.rst | 1 +
> 12 files changed, 1097 insertions(+)
> create mode 100644 Documentation/admin-guide/mm/damon/guide.rst
> create mode 100644 Documentation/admin-guide/mm/damon/index.rst
> create mode 100644 Documentation/admin-guide/mm/damon/plans.rst
> create mode 100644 Documentation/admin-guide/mm/damon/start.rst
> create mode 100644 Documentation/admin-guide/mm/damon/usage.rst
> create mode 100644 Documentation/vm/damon/api.rst
> create mode 100644 Documentation/vm/damon/eval.rst
> create mode 100644 Documentation/vm/damon/faq.rst
> create mode 100644 Documentation/vm/damon/index.rst
> create mode 100644 Documentation/vm/damon/mechanisms.rst
>
> diff --git a/Documentation/admin-guide/mm/damon/guide.rst b/Documentation/admin-guide/mm/damon/guide.rst
> new file mode 100644
> index 000000000000..c51fb843efaa
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/damon/guide.rst
> @@ -0,0 +1,157 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==================
> +Optimization Guide
> +==================
> +
> +This document helps you estimating the amount of benefit that you could get
> +from DAMON-based optimizations, and describes how you could achieve it. You
> +are assumed to already read :doc:`start`.
> +
> +
> +Check The Signs
> +===============
> +
> +No optimization can provide same extent of benefit to every case. Therefore
> +you should first guess how much improvements you could get using DAMON. If
> +some of below conditions match your situation, you could consider using DAMON.
> +
> +- *Low IPC and High Cache Miss Ratios.* Low IPC means most of the CPU time is
> + spent waiting for the completion of time-consuming operations such as memory
> + access, while high cache miss ratios mean the caches don't help it well.
> + DAMON is not for cache level optimization, but DRAM level. However,
> + improving DRAM management will also help this case by reducing the memory
> + operation latency.
> +- *Memory Over-commitment and Unknown Users.* If you are doing memory
> + overcommitment and you cannot control every user of your system, a memory
> + bank run could happen at any time. You can estimate when it will happen
> + based on DAMON's monitoring results and act earlier to avoid or deal better
> + with the crisis.
> +- *Frequent Memory Pressure.* Frequent memory pressure means your system has
> + wrong configurations or memory hogs. DAMON will help you find the right
> + configuration and/or the criminals.
> +- *Heterogeneous Memory System.* If your system is utilizing memory devices
> + that placed between DRAM and traditional hard disks, such as non-volatile
> + memory or fast SSDs, DAMON could help you utilizing the devices more
> + efficiently.
> +
> +
> +Profile
> +=======
> +
> +If you found some positive signals, you could start by profiling your workloads
> +using DAMON. Find major workloads on your systems and analyze their data
> +access pattern to find something wrong or can be improved. The DAMON user
> +space tool (``damo``) will be useful for this.
> +
> +We recommend you to start from working set size distribution check using ``damo
> +report wss``. If the distribution is ununiform or quite different from what
> +you estimated, you could consider `Memory Configuration`_ optimization.
> +
> +Then, review the overall access pattern in heatmap form using ``damo report
> +heats``. If it shows a simple pattern consists of a small number of memory
> +regions having high contrast of access temperature, you could consider manual
> +`Program Modification`_.
> +
> +If you still want to absorb more benefits, you should develop `Personalized
> +DAMON Application`_ for your special case.
> +
> +You don't need to take only one approach among the above plans, but you could
> +use multiple of the above approaches to maximize the benefit.
> +
> +
> +Optimize
> +========
> +
> +If the profiling result also says it's worth trying some optimization, you
> +could consider below approaches. Note that some of the below approaches assume
> +that your systems are configured with swap devices or other types of auxiliary
> +memory so that you don't strictly required to accommodate the whole working set
> +in the main memory. Most of the detailed optimization should be made on your
> +concrete understanding of your memory devices.
> +
> +
> +Memory Configuration
> +--------------------
> +
> +No more no less, DRAM should be large enough to accommodate only important
> +working sets, because DRAM is highly performance critical but expensive and
> +heavily consumes the power. However, knowing the size of the real important
> +working sets is difficult. As a consequence, people usually equips
> +unnecessarily large or too small DRAM. Many problems stem from such wrong
> +configurations.
> +
> +Using the working set size distribution report provided by ``damo report wss``,
> +you can know the appropriate DRAM size for you. For example, roughly speaking,
> +if you worry about only 95 percentile latency, you don't need to equip DRAM of
> +a size larger than 95 percentile working set size.
> +
> +Let's see a real example. This `page
> +<https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/guide.html#memory-configuration>`_
> +shows the heatmap and the working set size distributions/changes of
> +``freqmine`` workload in PARSEC3 benchmark suite. The working set size spikes
> +up to 180 MiB, but keeps smaller than 50 MiB for more than 95% of the time.
> +Even though you give only 50 MiB of memory space to the workload, it will work
> +well for 95% of the time. Meanwhile, you can save the 130 MiB of memory space.
> +
> +
> +Program Modification
> +--------------------
> +
> +If the data access pattern heatmap plotted by ``damo report heats`` is quite
> +simple so that you can understand how the things are going in the workload with
> +your human eye, you could manually optimize the memory management.
> +
> +For example, suppose that the workload has two big memory object but only one
> +object is frequently accessed while the other one is only occasionally
> +accessed. Then, you could modify the program source code to keep the hot
> +object in the main memory by invoking ``mlock()`` or ``madvise()`` with
> +``MADV_WILLNEED``. Or, you could proactively evict the cold object using
> +``madvise()`` with ``MADV_COLD`` or ``MADV_PAGEOUT``. Using both together
> +would be also worthy.
> +
> +A research work [1]_ using the ``mlock()`` achieved up to 2.55x performance
> +speedup.
> +
> +Let's see another realistic example access pattern for this kind of
> +optimizations. This `page
> +<https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/guide.html#program-modification>`_
> +shows the visualized access patterns of streamcluster workload in PARSEC3
> +benchmark suite. We can easily identify the 100 MiB sized hot object.
> +
> +
> +Personalized DAMON Application
> +------------------------------
> +
> +Above approaches will work well for many general cases, but would not enough
> +for some special cases.
> +
> +If this is the case, it might be the time to forget the comfortable use of the
> +user space tool and dive into the debugfs interface (refer to :doc:`usage` for
> +the detail) of DAMON. Using the interface, you can control the DAMON more
> +flexibly. Therefore, you can write your personalized DAMON application that
> +controls the monitoring via the debugfs interface, analyzes the result, and
> +applies complex optimizations itself. Using this, you can make more creative
> +and wise optimizations.
> +
> +If you are a kernel space programmer, writing kernel space DAMON applications
> +using the API (refer to the :doc:`/vm/damon/api` for more detail) would be an
> +option.
> +
> +
> +Reference Practices
> +===================
> +
> +Referencing previously done successful practices could help you getting the
> +sense for this kind of optimizations. There is an academic paper [1]_
> +reporting the visualized access pattern and manual `Program
> +Modification`_ results for a number of realistic workloads. You can also get
> +the visualized access patterns [3]_ [4]_ [5]_ and automated DAMON-based memory
> +operations results for other realistic workloads that collected with latest
> +version of DAMON [2]_ .
> +
> +.. [1] https://dl.acm.org/doi/10.1145/3366626.3368125
> +.. [2] https://damonitor.github.io/test/result/perf/latest/html/
> +.. [3] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
> +.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
> +.. [5] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
> diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst
> new file mode 100644
> index 000000000000..0baae7a5402b
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/damon/index.rst
> @@ -0,0 +1,15 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +========================
> +Monitoring Data Accesses
> +========================
> +
> +:doc:`DAMON </vm/damon/index>` allows light-weight data access monitoring.
> +Using this, users can analyze and optimize their systems.
> +
> +.. toctree::
> + :maxdepth: 2
> +
> + start
> + guide
> + usage
> diff --git a/Documentation/admin-guide/mm/damon/plans.rst b/Documentation/admin-guide/mm/damon/plans.rst
> new file mode 100644
> index 000000000000..e3aa5ab96c29
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/damon/plans.rst
> @@ -0,0 +1,29 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +============
> +Future Plans
> +============
> +
> +DAMON is still on its first stage. Below plans are still under development.
> +
> +
> +Automate Data Access Monitoring-based Memory Operation Schemes Execution
> +========================================================================
> +
> +The ultimate goal of DAMON is to be used as a building block for the data
> +access pattern aware kernel memory management optimization. It will make
> +system just works efficiently. However, some users having very special
> +workloads will want to further do their own optimization. DAMON will automate
> +most of the tasks for such manual optimizations in near future. Users will be
> +required to only describe what kind of data access pattern-based operation
> +schemes they want in a simple form.
> +
> +By applying a very simple scheme for THP promotion/demotion with a prototype
> +implementation, DAMON reduced 60% of THP memory footprint overhead while
> +preserving 50% of the THP performance benefit. The detailed results can be
> +seen on an external web page [1]_.
> +
> +Several RFC patchsets for this plan are available [2]_.
> +
> +.. [1] https://damonitor.github.io/test/result/perf/latest/html/
> +.. [2] https://lore.kernel.org/linux-mm/[email protected]/
> diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst
> new file mode 100644
> index 000000000000..a6f04d966adc
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/damon/start.rst
> @@ -0,0 +1,98 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============
> +Getting Started
> +===============
> +
> +This document briefly describes how you can use DAMON by demonstrating its
> +default user space tool. Please note that this document describes only a part
> +of its features for brevity. Please refer to :doc:`usage` for more details.
> +
> +
> +TL; DR
> +======
> +
> +Follow below 5 commands to monitor and visualize the access pattern of your
> +workload. ::
> +
> + $ git clone https://github.com/sjp38/linux -b damon/master
> + /* build the kernel with CONFIG_DAMON=y, install, reboot */
> + $ mount -t debugfs none /sys/kernel/debug/
> + $ cd linux/tools/damon
> + $ ./damo record $(pidof <your workload>)
> + $ ./damo report heats --heatmap access_pattern.png
> +
> +
> +Prerequisites
> +=============
> +
> +Kernel
> +------
> +
> +You should first ensure your system is running on a kernel built with
> +``CONFIG_DAMON``. If the value is set to ``m``, load the module first::
> +
> + # modprobe damon
> +
> +
> +User Space Tool
> +---------------
> +
> +For the demonstration, we will use the default user space tool for DAMON,
> +called DAMON Operator (DAMO). It is located at ``tools/damon/damo`` of the
> +kernel source tree. For brevity, below examples assume you set ``$PATH`` to
> +point it. It's not mandatory, though.
> +
> +Because DAMO is using the debugfs interface (refer to :doc:`usage` for the
> +detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as
> +below::
> +
> + # mount -t debugfs none /sys/kernel/debug/
> +
> +or append below line to your ``/etc/fstab`` file so that your system can
> +automatically mount debugfs from next booting::
> +
> + debugfs /sys/kernel/debug debugfs defaults 0 0
> +
> +
> +Recording Data Access Patterns
> +==============================
> +
> +Below commands record memory access pattern of a program and save the
> +monitoring results in a file. ::
> +
> + $ git clone https://github.com/sjp38/masim
> + $ cd masim; make; ./masim ./configs/zigzag.cfg &
> + $ sudo damo record -o damon.data $(pidof masim)
> +
> +The first two lines of the commands get an artificial memory access generator
> +program and runs it in the background. It will repeatedly access two 100 MiB
> +sized memory regions one by one. You can substitute this with your real
> +workload. The last line asks ``damo`` to record the access pattern in
> +``damon.data`` file.
> +
> +
> +Visualizing Recorded Patterns
> +=============================
> +
> +Below three commands visualize the recorded access patterns into three
> +image files. ::
> +
> + $ damo report heats --heatmap access_pattern_heatmap.png
> + $ damo report wss --range 0 101 1 --plot wss_dist.png
> + $ damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png
> +
> +- ``access_pattern_heatmap.png`` will show the data access pattern in a
> + heatmap, which shows when (x-axis) what memory region (y-axis) is how
> + frequently accessed (color).
> +- ``wss_dist.png`` will show the distribution of the working set size.
> +- ``wss_chron_change.png`` will show how the working set size has
> + chronologically changed.
> +
> +You can show the images in a web page [1]_ . Those made with other realistic
> +workloads are also available [2]_ [3]_ [4]_.
> +
> +.. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns
> +.. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
> +.. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
> +.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
> diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
> new file mode 100644
> index 000000000000..971e6b06b4ac
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/damon/usage.rst
> @@ -0,0 +1,298 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============
> +Detailed Usages
> +===============
> +
> +DAMON provides below three interfaces for different users.
> +
> +- *DAMON user space tool.*
> + This is for privileged people such as system administrators who want a
> + just-working human-friendly interface. Using this, users can use the DAMON’s
> + major features in a human-friendly way. It may not be highly tuned for
> + special cases, though. It supports only virtual address spaces monitoring.
> +- *debugfs interface.*
> + This is for privileged user space programmers who want more optimized use of
> + DAMON. Using this, users can use DAMON’s major features by reading
> + from and writing to special debugfs files. Therefore, you can write and use
> + your personalized DAMON debugfs wrapper programs that reads/writes the
> + debugfs files instead of you. The DAMON user space tool is also a reference
> + implementation of such programs. It supports only virtual address spaces
> + monitoring.
> +- *Kernel Space Programming Interface.*
> + This is for kernel space programmers. Using this, users can utilize every
> + feature of DAMON most flexibly and efficiently by writing kernel space
> + DAMON application programs for you. You can even extend DAMON for various
> + address spaces.
> +
> +This document does not describe the kernel space programming interface in
> +detail. For that, please refer to the :doc:`/vm/damon/api`.
> +
> +
> +DAMON User Sapce Tool
Space
> +=====================
> +
> +A reference implementation of the DAMON user space tools which provides a
> +convenient user interface is in the kernel source tree. It is located at
> +``tools/damon/damo`` of the tree.
> +
> +The tool provides a subcommands based interface. Every subcommand provides
> +``-h`` option, which provides the minimal usage of it. Currently, the tool
> +supports two subcommands, ``record`` and ``report``.
> +
> +Below example commands assume you set ``$PATH`` to point ``tools/damon/`` for
> +brevity. It is not mandatory for use of ``damo``, though.
> +
> +
> +Recording Data Access Pattern
> +-----------------------------
> +
> +The ``record`` subcommand records the data access pattern of target workloads
> +in a file (``./damon.data`` by default). You can specify the target with 1)
> +the command for execution of the monitoring target process, or 2) pid of
> +running target process. Below example shows a command target usage::
> +
> + # cd <kernel>/tools/damon/
> + # damo record "sleep 5"
> +
> +The tool will execute ``sleep 5`` by itself and record the data access patterns
> +of the process. Below example shows a pid target usage::
> +
> + # sleep 5 &
> + # damo record `pidof sleep`
> +
> +The location of the recorded file can be explicitly set using ``-o`` option.
> +You can further tune this by setting the monitoring attributes. To know about
> +the monitoring attributes in detail, please refer to the
> +:doc:`/vm/damon/mechanisms`.
> +
> +
> +Analyzing Data Access Pattern
> +-----------------------------
> +
> +The ``report`` subcommand reads a data access pattern record file (if not
> +explicitly specified using ``-i`` option, reads ``./damon.data`` file by
> +default) and generates human-readable reports. You can specify what type of
> +report you want using a sub-subcommand to ``report`` subcommand. ``raw``,
> +``heats``, and ``wss`` report types are supported for now.
> +
> +
> +raw
> +~~~
> +
> +``raw`` sub-subcommand simply transforms the binary record into a
> +human-readable text. For example::
> +
> + $ damo report raw
> + start_time: 193485829398
> + rel time: 0
> + nr_tasks: 1
> + pid: 1348
> + nr_regions: 4
> + 560189609000-56018abce000( 22827008): 0
> + 7fbdff59a000-7fbdffaf1a00( 5601792): 0
> + 7fbdffaf1a00-7fbdffbb5000( 800256): 1
> + 7ffea0dc0000-7ffea0dfd000( 249856): 0
> +
> + rel time: 100000731
> + nr_tasks: 1
> + pid: 1348
> + nr_regions: 6
> + 560189609000-56018abce000( 22827008): 0
> + 7fbdff59a000-7fbdff8ce933( 3361075): 0
> + 7fbdff8ce933-7fbdffaf1a00( 2240717): 1
> + 7fbdffaf1a00-7fbdffb66d99( 480153): 0
> + 7fbdffb66d99-7fbdffbb5000( 320103): 1
> + 7ffea0dc0000-7ffea0dfd000( 249856): 0
> +
> +The first line shows the recording started timestamp (nanosecond). Records of
> +data access patterns follows. Each record is separated by a blank line. Each
> +record first specifies the recorded time (``rel time``) in relative to the
> +start time, the number of monitored tasks in this record (``nr_tasks``).
> +Recorded data access patterns of each task follow. Each data access pattern
> +for each task shows the target's pid (``pid``) and a number of monitored
> +address regions in this access pattern (``nr_regions``) first. After that,
> +each line shows the start/end address, size, and the number of observed
> +accesses of each region.
> +
> +
> +heats
> +~~~~~
> +
> +The ``raw`` output is very detailed but hard to manually read. ``heats``
> +sub-subcommand plots the data in 3-dimensional form, which represents the time
> +in x-axis, address of regions in y-axis, and the access frequency in z-axis.
> +Users can set the resolution of the map (``--tres`` and ``--ares``) and
> +start/end point of each axis (``--tmin``, ``--tmax``, ``--amin``, and
> +``--amax``) via optional arguments. For example::
> +
> + $ damo report heats --tres 3 --ares 3
> + 0 0 0.0
> + 0 7609002 0.0
> + 0 15218004 0.0
> + 66112620851 0 0.0
> + 66112620851 7609002 0.0
> + 66112620851 15218004 0.0
> + 132225241702 0 0.0
> + 132225241702 7609002 0.0
> + 132225241702 15218004 0.0
> +
> +This command shows a recorded access pattern in heatmap of 3x3 resolution.
> +Therefore it shows 9 data points in total. Each line shows each of the data
> +points. The three numbers in each line represent time in nanosecond, address,
> +and the observed access frequency.
> +
> +Users will be able to convert this text output into a heatmap image (represents
> +z-axis values with colors) or other 3D representations using various tools such
> +as 'gnuplot'. For more convenience, ``heats`` sub-subcommand provides the
> +'gnuplot' based heatmap image creation. For this, you can use ``--heatmap``
> +option. Also, note that because it uses 'gnuplot' internally, it will fail if
> +'gnuplot' is not installed on your system. For example::
> +
> + $ ./damo report heats --heatmap heatmap.png
> +
> +Creates the heatmap image in ``heatmap.png`` file. It supports ``pdf``,
> +``png``, ``jpeg``, and ``svg``.
> +
> +If the target address space is virtual memory address space and you plot the
> +entire address space, the huge unmapped regions will make the picture looks
> +only black. Therefore you should do proper zoom in / zoom out using the
> +resolution and axis boundary-setting arguments. To make this effort minimal,
> +you can use ``--guide`` option as below::
> +
> + $ ./damo report heats --guide
> + pid:1348
> + time: 193485829398-198337863555 (4852034157)
> + region 0: 00000094564599762944-00000094564622589952 (22827008)
> + region 1: 00000140454009610240-00000140454016012288 (6402048)
> + region 2: 00000140731597193216-00000140731597443072 (249856)
> +
> +The output shows unions of monitored regions (start and end addresses in byte)
> +and the union of monitored time duration (start and end time in nanoseconds) of
> +each target task. Therefore, it would be wise to plot the data points in each
> +union. If no axis boundary option is given, it will automatically find the
> +biggest union in ``--guide`` output and set the boundary in it.
> +
> +
> +wss
> +~~~
> +
> +The ``wss`` type extracts the distribution and chronological working set size
> +changes from the records. For example::
> +
> + $ ./damo report wss
> + # <percentile> <wss>
> + # pid 1348
> + # avr: 66228
> + 0 0
> + 25 0
> + 50 0
> + 75 0
> + 100 1920615
> +
> +Without any option, it shows the distribution of the working set sizes as
> +above. It shows 0th, 25th, 50th, 75th, and 100th percentile and the average of
> +the measured working set sizes in the access pattern records. In this case,
> +the working set size was zero for 75th percentile but 1,920,615 bytes in max
> +and 66,228 bytes on average.
> +
> +By setting the sort key of the percentile using '--sortby', you can show how
> +the working set size has chronologically changed. For example::
> +
> + $ ./damo report wss --sortby time
> + # <percentile> <wss>
> + # pid 1348
> + # avr: 66228
> + 0 0
> + 25 0
> + 50 0
> + 75 0
> + 100 0
> +
> +The average is still 66,228. And, because the access was spiked in very short
> +duration and this command plots only 4 data points, we cannot show when the
> +access spikes made. Users can specify the resolution of the distribution
> +(``--range``). By giving more fine resolution, the short duration spikes could
> +be found.
> +
> +Similar to that of ``heats --heatmap``, it also supports 'gnuplot' based simple
> +visualization of the distribution via ``--plot`` option.
> +
> +
> +debugfs Interface
> +=================
> +
> +DAMON exports four files, ``attrs``, ``pids``, ``record``, and ``monitor_on``
> +under its debugfs directory, ``<debugfs>/damon/``.
> +
> +
> +Attributes
> +----------
> +
> +Users can get and set the ``sampling interval``, ``aggregation interval``,
> +``regions update interval``, and min/max number of monitoring target regions by
> +reading from and writing to the ``attrs`` file. To know about the monitoring
> +attributes in detail, please refer to the :doc:`/vm/damon/mechanisms`. For
> +example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and
> +1000, and then check it again::
> +
> + # cd <debugfs>/damon
> + # echo 5000 100000 1000000 10 1000 > attrs
> + # cat attrs
> + 5000 100000 1000000 10 1000
> +
> +
> +Target PIDs
> +-----------
> +
> +To monitor the virtual memory address spaces of specific processes, users can
> +get and set the pids of monitoring target processes by reading from and writing
> +to the ``pids`` file. For example, below commands set processes having pids 42
> +and 4242 as the processes to be monitored and check it again::
> +
> + # cd <debugfs>/damon
> + # echo 42 4242 > pids
> + # cat pids
> + 42 4242
> +
> +Note that setting the pids doesn't start the monitoring.
> +
> +
> +Record
> +------
> +
> +This debugfs file allows you to record monitored access patterns in a regular
> +binary file. The recorded results are first written in an in-memory buffer and
> +flushed to a file in batch. Users can get and set the size of the buffer and
> +the path to the result file by reading from and writing to the ``record`` file.
> +For example, below commands set the buffer to be 4 KiB and the result to be
> +saved in ``/damon.data``. ::
> +
> + # cd <debugfs>/damon
> + # echo "4096 /damon.data" > record
> + # cat record
> + 4096 /damon.data
> +
> +The recording can be disabled by setting the buffer size zero.
> +
> +
> +Turning On/Off
> +--------------
> +
> +Setting the files as described above doesn't incur any effect on your system
> +unless you explicitly start the monitoring. You can start, stop, and check the
> +current status of the monitoring by writing to and reading from the
> +``monitor_on`` file. Writing ``on`` to the file starts the monitoring of the
> +targets with the attributes. Writing ``off`` to the file stops those. DAMON
> +also stops if every target process is terminated. Below example commands turn
> +on, off, and check the status of DAMON::
> +
> + # cd <debugfs>/damon
> + # echo on > monitor_on
> + # echo off > monitor_on
> + # cat monitor_on
> + off
> +
> +Please note that you cannot write to the above-mentioned debugfs files while
> +the monitoring is turned on. If you write to the files while DAMON is running,
> +an error code such as ``-EBUSY`` will be returned.
> diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
> index 11db46448354..e6de5cd41945 100644
> --- a/Documentation/admin-guide/mm/index.rst
> +++ b/Documentation/admin-guide/mm/index.rst
> @@ -27,6 +27,7 @@ the Linux memory management.
>
> concepts
> cma_debugfs
> + damon/index
> hugetlbpage
> idle_page_tracking
> ksm
> diff --git a/Documentation/vm/damon/api.rst b/Documentation/vm/damon/api.rst
> new file mode 100644
> index 000000000000..649409828eab
> --- /dev/null
> +++ b/Documentation/vm/damon/api.rst
> @@ -0,0 +1,20 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=============
> +API Reference
> +=============
> +
> +Kernel space programs can use every feature of DAMON using below APIs. All you
> +need to do is including ``damon.h``, which is located in ``include/linux/`` of
> +the source tree.
> +
> +Structures
> +==========
> +
> +.. kernel-doc:: include/linux/damon.h
> +
> +
> +Functions
> +=========
> +
> +.. kernel-doc:: mm/damon.c
> diff --git a/Documentation/vm/damon/eval.rst b/Documentation/vm/damon/eval.rst
> new file mode 100644
> index 000000000000..b233890b4e45
> --- /dev/null
> +++ b/Documentation/vm/damon/eval.rst
> @@ -0,0 +1,222 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========
> +Evaluation
> +==========
> +
> +DAMON is lightweight. It increases system memory usage by only -0.25% and
> +consumes less than 1% CPU time in most case. It slows target workloads down by
> +only 0.94%.
> +
> +DAMON is accurate and useful for memory management optimizations. An
> +experimental DAMON-based operation scheme for THP, 'ethp', removes 31.29% of
> +THP memory overheads while preserving 60.64% of THP speedup. Another
> +experimental DAMON-based 'proactive reclamation' implementation, 'prcl',
> +reduces 87.95% of residential sets and 29.52% of system memory footprint while
> +incurring only 2.15% runtime overhead in the best case (parsec3/freqmine).
> +
> +Setup
> +=====
> +
> +On a QEMU/KVM based virtual machine utilizing 20GB of RAM and hosted by an
> +Intel i7 machine that running a kernel that v16 DAMON patchset is applied, I
> +measure runtime and consumed system memory while running various realistic
> +workloads with several configurations. I use 13 and 12 workloads in PARSEC3
> +[3]_ and SPLASH-2X [4]_ benchmark suites, respectively. I use another wrapper
> +scripts [5]_ for convenient setup and run of the workloads.
> +
> +Measurement
> +-----------
> +
> +For the measurement of the amount of consumed memory in system global scope, I
> +drop caches before starting each of the workloads and monitor 'MemFree' in the
> +'/proc/meminfo' file. To make results more stable, I repeat the runs 5 times
> +and average results.
> +
> +Configurations
> +--------------
> +
> +The configurations I use are as below.
> +
> +- orig: Linux v5.7 with 'madvise' THP policy
> +- rec: 'orig' plus DAMON running with virtual memory access recording
> +- prec: 'orig' plus DAMON running with physical memory access recording
> +- thp: same with 'orig', but use 'always' THP policy
> +- ethp: 'orig' plus a DAMON operation scheme, 'efficient THP'
> +- prcl: 'orig' plus a DAMON operation scheme, 'proactive reclaim [6]_'
> +
> +I use 'rec' for measurement of DAMON overheads to target workloads and system
> +memory. 'prec' is for physical memory monitroing and recording. It monitors
> +17GB sized 'System RAM' region. The remaining configs including 'thp', 'ethp',
> +and 'prcl' are for measurement of DAMON monitoring accuracy.
> +
> +'ethp' and 'prcl' are simple DAMON-based operation schemes developed for
> +proof of concepts of DAMON. 'ethp' reduces memory space waste of THP by using
> +DAMON for the decision of promotions and demotion for huge pages, while 'prcl'
> +is as similar as the original work. Those are implemented as below::
> +
> + # format: <min/max size> <min/max frequency (0-100)> <min/max age> <action>
> + # ethp: Use huge pages if a region shows >=5% access rate, use regular
> + # pages if a region >=2MB shows <5% access rate for >=13 seconds
> + null null 5 null null null hugepage
> + 2M null null null 13s null nohugepage
> +
> + # prcl: If a region >=4KB shows <=5% access rate for >=7 seconds, page out.
> + 4K null null 5 7s null pageout
> +
> +Note that both 'ethp' and 'prcl' are designed with my only straightforward
> +intuition because those are for only proof of concepts and monitoring accuracy
> +of DAMON. In other words, those are not for production. For production use,
> +those should be more tuned.
> +
> +.. [1] "Redis latency problems troubleshooting", https://redis.io/topics/latency
> +.. [2] "Disable Transparent Huge Pages (THP)",
> + https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
> +.. [3] "The PARSEC Becnhmark Suite", https://parsec.cs.princeton.edu/index.htm
> +.. [4] "SPLASH-2x", https://parsec.cs.princeton.edu/parsec3-doc.htm#splash2x
> +.. [5] "parsec3_on_ubuntu", https://github.com/sjp38/parsec3_on_ubuntu
> +.. [6] "Proactively reclaiming idle memory", https://lwn.net/Articles/787611/
> +
> +Results
> +=======
> +
> +Below two tables show the measurement results. The runtimes are in seconds
> +while the memory usages are in KiB. Each configuration except 'orig' shows
> +its overhead relative to 'orig' in percent within parenthesizes.::
> +
> + runtime orig rec (overhead) prec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
> + parsec3/blackscholes 107.228 107.859 (0.59) 108.110 (0.82) 107.381 (0.14) 106.811 (-0.39) 114.766 (7.03)
> + parsec3/bodytrack 79.292 79.609 (0.40) 79.777 (0.61) 79.313 (0.03) 78.892 (-0.50) 80.398 (1.40)
> + parsec3/canneal 148.887 150.878 (1.34) 153.337 (2.99) 127.873 (-14.11) 132.272 (-11.16) 167.631 (12.59)
> + parsec3/dedup 11.970 11.975 (0.04) 12.024 (0.45) 11.752 (-1.82) 11.921 (-0.41) 13.244 (10.64)
> + parsec3/facesim 212.800 215.927 (1.47) 215.004 (1.04) 205.117 (-3.61) 207.401 (-2.54) 220.834 (3.78)
> + parsec3/ferret 190.646 192.560 (1.00) 192.414 (0.93) 190.662 (0.01) 192.309 (0.87) 193.497 (1.50)
> + parsec3/fluidanimate 213.951 216.459 (1.17) 217.578 (1.70) 209.500 (-2.08) 211.826 (-0.99) 218.299 (2.03)
> + parsec3/freqmine 291.050 292.117 (0.37) 293.279 (0.77) 289.553 (-0.51) 291.768 (0.25) 297.309 (2.15)
> + parsec3/raytrace 118.645 119.734 (0.92) 119.521 (0.74) 117.715 (-0.78) 118.844 (0.17) 134.045 (12.98)
> + parsec3/streamcluster 332.843 336.997 (1.25) 337.049 (1.26) 279.716 (-15.96) 290.985 (-12.58) 346.646 (4.15)
> + parsec3/swaptions 155.437 157.174 (1.12) 156.159 (0.46) 155.017 (-0.27) 154.955 (-0.31) 156.555 (0.72)
> + parsec3/vips 59.215 59.426 (0.36) 59.156 (-0.10) 59.243 (0.05) 58.858 (-0.60) 60.184 (1.64)
> + parsec3/x264 67.445 71.400 (5.86) 71.122 (5.45) 64.078 (-4.99) 66.027 (-2.10) 71.489 (6.00)
> + splash2x/barnes 81.826 81.800 (-0.03) 82.648 (1.00) 74.343 (-9.15) 79.063 (-3.38) 103.785 (26.84)
> + splash2x/fft 33.850 34.148 (0.88) 33.912 (0.18) 23.493 (-30.60) 32.684 (-3.44) 48.303 (42.70)
> + splash2x/lu_cb 86.404 86.333 (-0.08) 86.988 (0.68) 85.720 (-0.79) 85.944 (-0.53) 89.338 (3.40)
> + splash2x/lu_ncb 94.908 98.021 (3.28) 96.041 (1.19) 90.304 (-4.85) 93.279 (-1.72) 97.270 (2.49)
> + splash2x/ocean_cp 47.122 47.391 (0.57) 47.902 (1.65) 43.227 (-8.26) 44.609 (-5.33) 51.410 (9.10)
> + splash2x/ocean_ncp 93.147 92.911 (-0.25) 93.886 (0.79) 51.451 (-44.76) 71.107 (-23.66) 112.554 (20.83)
> + splash2x/radiosity 92.150 92.604 (0.49) 93.339 (1.29) 90.802 (-1.46) 91.824 (-0.35) 104.439 (13.34)
> + splash2x/radix 31.961 32.113 (0.48) 32.066 (0.33) 25.184 (-21.20) 30.412 (-4.84) 49.989 (56.41)
> + splash2x/raytrace 84.781 85.278 (0.59) 84.763 (-0.02) 83.192 (-1.87) 83.970 (-0.96) 85.382 (0.71)
> + splash2x/volrend 87.401 87.978 (0.66) 87.977 (0.66) 86.636 (-0.88) 87.169 (-0.26) 88.043 (0.73)
> + splash2x/water_nsquared 239.140 239.570 (0.18) 240.901 (0.74) 221.323 (-7.45) 224.670 (-6.05) 244.492 (2.24)
> + splash2x/water_spatial 89.538 89.978 (0.49) 90.171 (0.71) 89.729 (0.21) 89.238 (-0.34) 99.331 (10.94)
> + total 3051.620 3080.230 (0.94) 3085.130 (1.10) 2862.320 (-6.20) 2936.830 (-3.76) 3249.240 (6.48)
> +
> +
> + memused.avg orig rec (overhead) prec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
> + parsec3/blackscholes 1676679.200 1683789.200 (0.42) 1680281.200 (0.21) 1613817.400 (-3.75) 1835229.200 (9.46) 1407952.800 (-16.03)
> + parsec3/bodytrack 1295736.000 1308412.600 (0.98) 1311988.000 (1.25) 1243417.400 (-4.04) 1435410.600 (10.78) 1255566.400 (-3.10)
> + parsec3/canneal 1004062.000 1008823.800 (0.47) 1000100.200 (-0.39) 983976.000 (-2.00) 1051719.600 (4.75) 993055.800 (-1.10)
> + parsec3/dedup 2389765.800 2393381.000 (0.15) 2366668.200 (-0.97) 2412948.600 (0.97) 2435885.600 (1.93) 2380172.800 (-0.40)
> + parsec3/facesim 488927.200 498228.000 (1.90) 496683.800 (1.59) 476327.800 (-2.58) 552890.000 (13.08) 449143.600 (-8.14)
> + parsec3/ferret 280324.600 282032.400 (0.61) 282284.400 (0.70) 258211.000 (-7.89) 331493.800 (18.25) 265850.400 (-5.16)
> + parsec3/fluidanimate 560636.200 569038.200 (1.50) 565067.400 (0.79) 556923.600 (-0.66) 588021.200 (4.88) 512901.600 (-8.51)
> + parsec3/freqmine 883286.000 904960.200 (2.45) 886105.200 (0.32) 849347.400 (-3.84) 998358.000 (13.03) 622542.800 (-29.52)
> + parsec3/raytrace 1639370.200 1642318.200 (0.18) 1626673.200 (-0.77) 1591284.200 (-2.93) 1755088.400 (7.06) 1410261.600 (-13.98)
> + parsec3/streamcluster 116955.600 127251.400 (8.80) 121441.000 (3.84) 113853.800 (-2.65) 139659.400 (19.41) 120335.200 (2.89)
> + parsec3/swaptions 8342.400 18555.600 (122.43) 16581.200 (98.76) 6745.800 (-19.14) 27487.200 (229.49) 14275.600 (71.12)
> + parsec3/vips 2776417.600 2784989.400 (0.31) 2820564.600 (1.59) 2694060.800 (-2.97) 2968650.000 (6.92) 2713590.000 (-2.26)
> + parsec3/x264 2912885.000 2936474.600 (0.81) 2936775.800 (0.82) 2799599.200 (-3.89) 3168695.000 (8.78) 2829085.800 (-2.88)
> + splash2x/barnes 1206459.600 1204145.600 (-0.19) 1177390.000 (-2.41) 1210556.800 (0.34) 1214978.800 (0.71) 907737.000 (-24.76)
> + splash2x/fft 9384156.400 9258749.600 (-1.34) 8560377.800 (-8.78) 9337563.000 (-0.50) 9228873.600 (-1.65) 9823394.400 (4.68)
> + splash2x/lu_cb 510210.800 514052.800 (0.75) 502735.200 (-1.47) 514459.800 (0.83) 523884.200 (2.68) 367563.200 (-27.96)
> + splash2x/lu_ncb 510091.200 516046.800 (1.17) 505327.600 (-0.93) 512568.200 (0.49) 524178.400 (2.76) 427981.800 (-16.10)
> + splash2x/ocean_cp 3342260.200 3294531.200 (-1.43) 3171236.000 (-5.12) 3379693.600 (1.12) 3314896.600 (-0.82) 3252406.000 (-2.69)
> + splash2x/ocean_ncp 3900447.200 3881682.600 (-0.48) 3816493.200 (-2.15) 7065506.200 (81.15) 4449224.400 (14.07) 3829931.200 (-1.81)
> + splash2x/radiosity 1466372.000 1463840.200 (-0.17) 1438554.000 (-1.90) 1475151.600 (0.60) 1474828.800 (0.58) 496636.000 (-66.13)
> + splash2x/radix 1760056.600 1691719.000 (-3.88) 1613057.400 (-8.35) 1384416.400 (-21.34) 1632274.400 (-7.26) 2141640.200 (21.68)
> + splash2x/raytrace 38794.000 48187.400 (24.21) 46728.400 (20.45) 41323.400 (6.52) 61499.800 (58.53) 68455.200 (76.46)
> + splash2x/volrend 138107.400 148197.000 (7.31) 146223.400 (5.88) 128076.400 (-7.26) 164593.800 (19.18) 140885.200 (2.01)
> + splash2x/water_nsquared 39072.000 49889.200 (27.69) 47548.400 (21.69) 37546.400 (-3.90) 57195.400 (46.38) 42994.200 (10.04)
> + splash2x/water_spatial 662099.800 665964.800 (0.58) 651017.000 (-1.67) 659808.400 (-0.35) 674475.600 (1.87) 519677.600 (-21.51)
> + total 38991500.000 38895300.000 (-0.25) 37787817.000 (-3.09) 41347200.000 (6.04) 40609600.000 (4.15) 36994100.000 (-5.12)
> +
> +
> +DAMON Overheads
> +---------------
> +
> +In total, DAMON virtual memory access recording feature ('rec') incurs 0.94%
> +runtime overhead and -0.25% memory space overhead. Even though the size of the
> +monitoring target region becomes much larger with the physical memory access
> +recording ('prec'), it still shows only modest amount of overhead (1.10% for
> +runtime and -3.09% for memory footprint).
> +
> +For a convenience test run of 'rec' and 'prec', I use a Python wrapper. The
> +wrapper constantly consumes about 10-15MB of memory. This becomes a high
> +memory overhead if the target workload has a small memory footprint.
> +Nonetheless, the overheads are not from DAMON, but from the wrapper, and thus
> +should be ignored. This fake memory overhead continues in 'ethp' and 'prcl',
> +as those configurations are also using the Python wrapper.
> +
> +
> +Efficient THP
> +-------------
> +
> +THP 'always' enabled policy achieves 6.20% speedup but incurs 6.04% memory
> +overhead. It achieves 44.76% speedup in the best case, but 81.15% memory
> +overhead in the worst case. Interestingly, both the best and worst-case are
> +with 'splash2x/ocean_ncp').
> +
> +The 2-lines implementation of data access monitoring based THP version ('ethp')
> +shows 3.76% speedup and 4.15% memory overhead. In other words, 'ethp' removes
> +31.29% of THP memory waste while preserving 60.64% of THP speedup in total. In
> +the case of the 'splash2x/ocean_ncp', 'ethp' removes 82.66% of THP memory waste
> +while preserving 52.85% of THP speedup.
> +
> +
> +Proactive Reclamation
> +---------------------
> +
> +As similar to the original work, I use 4G 'zram' swap device for this
> +configuration.
> +
> +In total, our 1 line implementation of Proactive Reclamation, 'prcl', incurred
> +6.48% runtime overhead in total while achieving 5.12% system memory usage
> +reduction.
> +
> +Nonetheless, as the memory usage is calculated with 'MemFree' in
> +'/proc/meminfo', it contains the SwapCached pages. As the swapcached pages can
> +be easily evicted, I also measured the residential set size of the workloads::
> +
> + rss.avg orig rec (overhead) prec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
> + parsec3/blackscholes 590412.200 589991.400 (-0.07) 591716.400 (0.22) 591131.000 (0.12) 591055.200 (0.11) 274623.600 (-53.49)
> + parsec3/bodytrack 32202.200 32297.400 (0.30) 32301.400 (0.31) 32328.000 (0.39) 32169.800 (-0.10) 25311.200 (-21.40)
> + parsec3/canneal 840063.600 839145.200 (-0.11) 839506.200 (-0.07) 835102.600 (-0.59) 839766.000 (-0.04) 833091.800 (-0.83)
> + parsec3/dedup 1185493.200 1202688.800 (1.45) 1204597.000 (1.61) 1238071.400 (4.44) 1201689.400 (1.37) 920688.600 (-22.34)
> + parsec3/facesim 311570.400 311542.000 (-0.01) 311665.000 (0.03) 316106.400 (1.46) 312003.400 (0.14) 252646.000 (-18.91)
> + parsec3/ferret 99783.200 99330.000 (-0.45) 99735.000 (-0.05) 102000.600 (2.22) 99927.400 (0.14) 90967.400 (-8.83)
> + parsec3/fluidanimate 531780.800 531800.800 (0.00) 531754.600 (-0.00) 532009.600 (0.04) 531822.400 (0.01) 479116.000 (-9.90)
> + parsec3/freqmine 551787.600 551550.600 (-0.04) 551950.000 (0.03) 556030.000 (0.77) 553720.400 (0.35) 66480.000 (-87.95)
> + parsec3/raytrace 895247.000 895240.200 (-0.00) 895770.400 (0.06) 895880.200 (0.07) 893516.600 (-0.19) 327339.600 (-63.44)
> + parsec3/streamcluster 110862.200 110840.400 (-0.02) 110878.600 (0.01) 112067.200 (1.09) 112010.800 (1.04) 109763.600 (-0.99)
> + parsec3/swaptions 5630.000 5580.800 (-0.87) 5599.600 (-0.54) 5624.200 (-0.10) 5697.400 (1.20) 3792.400 (-32.64)
> + parsec3/vips 31677.200 31881.800 (0.65) 31785.800 (0.34) 32177.000 (1.58) 32456.800 (2.46) 29692.000 (-6.27)
> + parsec3/x264 81796.400 81918.600 (0.15) 81827.600 (0.04) 82734.800 (1.15) 82854.000 (1.29) 81478.200 (-0.39)
> + splash2x/barnes 1216014.600 1215462.000 (-0.05) 1218535.200 (0.21) 1227689.400 (0.96) 1219022.000 (0.25) 650771.000 (-46.48)
> + splash2x/fft 9622775.200 9511973.400 (-1.15) 9688178.600 (0.68) 9733868.400 (1.15) 9651488.000 (0.30) 7567077.400 (-21.36)
> + splash2x/lu_cb 511102.400 509911.600 (-0.23) 511123.800 (0.00) 514466.800 (0.66) 510462.800 (-0.13) 361014.000 (-29.37)
> + splash2x/lu_ncb 510569.800 510724.600 (0.03) 510888.800 (0.06) 513951.600 (0.66) 509474.400 (-0.21) 424030.400 (-16.95)
> + splash2x/ocean_cp 3413563.600 3413721.800 (0.00) 3398399.600 (-0.44) 3446878.000 (0.98) 3404799.200 (-0.26) 3244787.400 (-4.94)
> + splash2x/ocean_ncp 3927797.400 3936294.400 (0.22) 3917698.800 (-0.26) 7181781.200 (82.85) 4525783.600 (15.22) 3693747.800 (-5.96)
> + splash2x/radiosity 1477264.800 1477569.200 (0.02) 1476954.200 (-0.02) 1485724.800 (0.57) 1474684.800 (-0.17) 230128.000 (-84.42)
> + splash2x/radix 1773025.000 1754424.200 (-1.05) 1743194.400 (-1.68) 1445575.200 (-18.47) 1694855.200 (-4.41) 1769750.000 (-0.18)
> + splash2x/raytrace 23292.000 23284.000 (-0.03) 23292.800 (0.00) 28704.800 (23.24) 26489.600 (13.73) 15753.000 (-32.37)
> + splash2x/volrend 44095.800 44068.200 (-0.06) 44107.600 (0.03) 44114.600 (0.04) 44054.000 (-0.09) 31616.000 (-28.30)
> + splash2x/water_nsquared 29416.800 29403.200 (-0.05) 29406.400 (-0.04) 30103.200 (2.33) 29433.600 (0.06) 24927.400 (-15.26)
> + splash2x/water_spatial 657791.000 657840.400 (0.01) 657826.600 (0.01) 657595.800 (-0.03) 656617.800 (-0.18) 481334.800 (-26.83)
> + total 28475091.000 28368400.000 (-0.37) 28508700.000 (0.12) 31641800.000 (11.12) 29036000.000 (1.97) 21989800.000 (-22.78)
> +
> +In total, 22.78% of residential sets were reduced.
> +
> +With parsec3/freqmine, 'prcl' reduced 87.95% of residential sets and 29.52% of
> +system memory usage while incurring only 2.15% runtime overhead.
> diff --git a/Documentation/vm/damon/faq.rst b/Documentation/vm/damon/faq.rst
> new file mode 100644
> index 000000000000..a15059cfb98a
> --- /dev/null
> +++ b/Documentation/vm/damon/faq.rst
> @@ -0,0 +1,59 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========================
> +Frequently Asked Questions
> +==========================
> +
> +Why a new module, instead of extending perf or other user space tools?
> +======================================================================
> +
> +First, because it needs to be lightweight as much as possible so that it can be
> +used online, any unnecessary overhead such as kernel - user space context
> +switching cost should be avoided. Second, DAMON aims to be used by other
> +programs including the kernel. Therefore, having a dependency on specific
> +tools like perf is not desirable. These are the two biggest reasons why DAMON
> +is implemented in the kernel space.
> +
> +
> +Can 'idle pages tracking' or 'perf mem' substitute DAMON?
> +=========================================================
> +
> +Idle page tracking is a low level primitive for access check of the physical
> +address space. 'perf mem' is similar, though it can use sampling to minimize
> +the overhead. On the other hand, DAMON is a higher-level framework for the
> +monitoring of various address spaces. It is focused on memory management
> +optimization and provides sophisticated accuracy/overhead handling mechanisms.
> +Therefore, 'idle pages tracking' and 'perf mem' could provide a subset of
> +DAMON's output, but cannot substitute DAMON. Rather than that, thouse could be
those?
> +configured as DAMON's low-level primitives for specific address spaces.
> +
> +
> +How can I optimize my system's memory management using DAMON?
> +=============================================================
> +
> +Because there are several ways for the DAMON-based optimizations, we wrote a
> +separate document, :doc:`/admin-guide/mm/damon/guide`. Please refer to that.
> +
> +
> +Does DAMON support virtual memory only?
> +=======================================
> +
> +No. The core of the DAMON is address space independent. The address space
> +specific low level primitive parts including monitoring target regions
> +constructions and actual access checks can be implemented and configured on the
> +DAMON core by the users. In this way, DAMON users can monitor any address
> +space with any access check technique.
> +
> +Nonetheless, DAMON provides vma tracking and PTE Accessed bit check based
> +implementations of the address space dependent functions for the virtual memory
> +by default, for a reference and convenient use. In near future, we will
> +provide those for physical memory address space.
> +
> +
> +Can I simply monitor page granularity?
> +======================================
> +
> +Yes. You can do so by setting the ``min_nr_regions`` attribute higher than the
> +working set size divided by the page size. Because the monitoring target
> +regions size is forced to be ``>=page size``, the region split will make no
> +effect.
> diff --git a/Documentation/vm/damon/index.rst b/Documentation/vm/damon/index.rst
> new file mode 100644
> index 000000000000..1ac29c8d9e87
> --- /dev/null
> +++ b/Documentation/vm/damon/index.rst
> @@ -0,0 +1,32 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========================
> +DAMON: Data Access MONitor
> +==========================
> +
> +DAMON is a data access monitoring framework subsystem for the Linux kernel.
> +The core mechanisms of DAMON (refer to :doc:`mechanisms` for the detail) make
> +it
> +
> + - *accurate* (the monitoring output is useful enough for DRAM level memory
> + management; It might not appropriate for CPU Cache levels, though),
> + - *light-weight* (the monitoring overhead is low enough to be applied online),
> + and
> + - *scalable* (the upper-bound of the overhead is in constant range regardless
> + of the size of target workloads).
> +
> +Using this framework, therefore, the kernel's memory management mechanisms can
> +make advanced decisions. Experimental memory management optimization works
> +that incurring high data accesses monitoring overhead could implemented again.
> +In user space, meanwhile, users who have some special workloads can write
> +personalized applications for better understanding and optimizations of their
> +workloads and systems.
> +
> +.. toctree::
> + :maxdepth: 2
> +
> + faq
> + mechanisms
> + eval
> + api
> + plans
> diff --git a/Documentation/vm/damon/mechanisms.rst b/Documentation/vm/damon/mechanisms.rst
> new file mode 100644
> index 000000000000..56cad258cea1
> --- /dev/null
> +++ b/Documentation/vm/damon/mechanisms.rst
> @@ -0,0 +1,165 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==========
> +Mechanisms
> +==========
> +
> +Configurable Layers
> +===================
> +
> +DAMON provides data access monitoring functionality while making the accuracy
> +and the overhead controllable. The fundamental access monitorings require
> +primitives that dependent on and optimized for the target address space. On
> +the other hand, the accuracy and overhead tradeoff mechanism, which is the core
> +of DAMON, is in the pure logic space. DAMON separates the two parts in
> +different layers and defines its interface to allow various low level
> +primitives implementations configurable with the core logic.
> +
> +Due to this separated design and the configurable interface, users can extend
> +DAMON for any address space by configuring the core logics with appropriate low
> +level primitive implementations. If appropriate one is not provided, users can
> +implement the primitives on their own.
> +
> +For example, physical memory, virtual memory, swap space, those for specific
> +processes, NUMA nodes, files, and backing memory devices would be supportable.
> +Also, if some architectures or devices support special optimized access check
> +primitives, those will be easily configurable.
> +
> +
> +Reference Implementations of Address Space Specific Primitives
> +==============================================================
> +
> +The low level primitives for the fundamental access monitoring are defined in
> +two parts:
> +
> +1. Identification of the monitoring target address range for the address space.
> +2. Access check of specific address range in the target space.
> +
> +DAMON currently provides the implementation of the primitives for only the
> +virtual address spaces. Below two subsections describe how it works.
> +
> +
> +PTE Accessed-bit Based Access Check
> +-----------------------------------
> +
> +The implementation for the virtual address space uses PTE Accessed-bit for
> +basic access checks. It finds the relevant PTE Accessed bit from the address
> +by walking the page table for the target task of the address. In this way, the
> +implementation finds and clears the bit for next sampling target address and
> +checks whether the bit set again after one sampling period. To avoid
> +disturbing other Accessed bit users such as the reclamation logic, the
> +implementation adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same
> +to the 'Idle Page Tracking'.
> +
> +
> +VMA-based Target Address Range Construction
> +-------------------------------------------
> +
> +Only small parts in the super-huge virtual address space of the processes are
> +mapped to the physical memory and accessed. Thus, tracking the unmapped
> +address regions is just wasteful. However, because DAMON can deal with some
> +level of noise using the adaptive regions adjustment mechanism, tracking every
> +mapping is not strictly required but could even incur a high overhead in some
> +cases. That said, too huge unmapped areas inside the monitoring target should
> +be removed to not take the time for the adaptive mechanism.
> +
> +For the reason, this implementation converts the complex mappings to three
> +distinct regions that cover every mapped area of the address space. The two
> +gaps between the three regions are the two biggest unmapped areas in the given
> +address space. The two biggest unmapped areas would be the gap between the
> +heap and the uppermost mmap()-ed region, and the gap between the lowermost
> +mmap()-ed region and the stack in most of the cases. Because these gaps are
> +exceptionally huge in usual address spaces, excluding these will be sufficient
> +to make a reasonable trade-off. Below shows this in detail::
> +
> + <heap>
> + <BIG UNMAPPED REGION 1>
> + <uppermost mmap()-ed region>
> + (small mmap()-ed regions and munmap()-ed regions)
> + <lowermost mmap()-ed region>
> + <BIG UNMAPPED REGION 2>
> + <stack>
> +
> +
> +Address Space Independent Core Mechanisms
> +=========================================
> +
> +Below four sections describe each of the DAMON core mechanisms and the five
> +monitoring attributes, ``sampling interval``, ``aggregation interval``,
> +``regions update interval``, ``minimum number of regions``, and ``maximum
> +number of regions``.
> +
> +
> +Access Frequency Monitoring
> +---------------------------
> +
> +The output of DAMON says what pages are how frequently accessed for a given
> +duration. The resolution of the access frequency is controlled by setting
> +``sampling interval`` and ``aggregation interval``. In detail, DAMON checks
> +access to each page per ``sampling interval`` and aggregates the results. In
> +other words, counts the number of the accesses to each page. After each
> +``aggregation interval`` passes, DAMON calls callback functions that previously
> +registered by users so that users can read the aggregated results and then
> +clears the results. This can be described in below simple pseudo-code::
> +
> + while monitoring_on:
> + for page in monitoring_target:
> + if accessed(page):
> + nr_accesses[page] += 1
> + if time() % aggregation_interval == 0:
> + for callback in user_registered_callbacks:
> + callback(monitoring_target, nr_accesses)
> + for page in monitoring_target:
> + nr_accesses[page] = 0
> + sleep(sampling interval)
> +
> +The monitoring overhead of this mechanism will arbitrarily increase as the
> +size of the target workload grows.
> +
> +
> +Region Based Sampling
> +---------------------
> +
> +To avoid the unbounded increase of the overhead, DAMON groups adjacent pages
> +that assumed to have the same access frequencies into a region. As long as the
> +assumption (pages in a region have the same access frequencies) is kept, only
> +one page in the region is required to be checked. Thus, for each ``sampling
> +interval``, DAMON randomly picks one page in each region, waits for one
> +``sampling interval``, checks whether the page is accessed meanwhile, and
> +increases the access frequency of the region if so. Therefore, the monitoring
> +overhead is controllable by setting the number of regions. DAMON allows users
> +to set the minimum and the maximum number of regions for the trade-off.
> +
> +This scheme, however, cannot preserve the quality of the output if the
> +assumption is not guaranteed.
> +
> +
> +Adaptive Regions Adjustment
> +---------------------------
> +
> +Even somehow the initial monitoring target regions are well constructed to
> +fulfill the assumption (pages in same region have similar access frequencies),
> +the data access pattern can be dynamically changed. This will result in low
> +monitoring quality. To keep the assumption as much as possible, DAMON
> +adaptively merges and splits each region based on their access frequency.
> +
> +For each ``aggregation interval``, it compares the access frequencies of
> +adjacent regions and merges those if the frequency difference is small. Then,
> +after it reports and clears the aggregated access frequency of each region, it
> +splits each region into two or three regions if the total number of regions
> +will not exceed the user-specified maximum number of regions after the split.
> +
> +In this way, DAMON provides its best-effort quality and minimal overhead while
> +keeping the bounds users set for their trade-off.
> +
> +
> +Dynamic Target Space Updates Handling
> +-------------------------------------
> +
> +The monitoring target address range could dynamically changed. For example,
> +virtual memory could be dynamically mapped and unmapped. Physical memory could
> +be hot-plugged.
> +
> +As the changes could be quite frequent in some cases, DAMON checks the dynamic
> +memory mapping changes and applies it to the abstracted target area only for
> +each of a user-specified time interval (``regions update interval``).
> diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
> index e8d943b21cf9..30813498c74d 100644
> --- a/Documentation/vm/index.rst
> +++ b/Documentation/vm/index.rst
> @@ -31,6 +31,7 @@ descriptions of data structures and algorithms.
> active_mm
> balance
> cleancache
> + damon/index
> frontswap
> highmem
> hmm
SeongJae Park <[email protected]> wrote:
> From: SeongJae Park <[email protected]>
>
> This commit introduces a reference implementation of the address space
> specific low level primitives for the virtual address space, so that
> users of DAMON can easily monitor the data accesses on virtual address
> spaces of specific processes by simply configuring the implementation to
> be used by DAMON.
>
> The low level primitives for the fundamental access monitoring are
> defined in two parts:
> 1. Identification of the monitoring target address range for the address
> space.
> 2. Access check of specific address range in the target space.
>
> The reference implementation for the virtual address space provided by
> this commit is designed as below.
>
> PTE Accessed-bit Based Access Check
> -----------------------------------
>
> The implementation uses PTE Accessed-bit for basic access checks. That
> is, it clears the bit for next sampling target page and checks whether
> it set again after one sampling period. To avoid disturbing other
> Accessed bit users such as the reclamation logic, the implementation
> adjusts the ``PG_Idle`` and ``PG_Young`` appropriately, as same to the
> 'Idle Page Tracking'.
>
> VMA-based Target Address Range Construction
> -------------------------------------------
>
> Only small parts in the super-huge virtual address space of the
> processes are mapped to physical memory and accessed. Thus, tracking
> the unmapped address regions is just wasteful. However, because DAMON
> can deal with some level of noise using the adaptive regions adjustment
> mechanism, tracking every mapping is not strictly required but could
> even incur a high overhead in some cases. That said, too huge unmapped
> areas inside the monitoring target should be removed to not take the
> time for the adaptive mechanism.
>
> For the reason, this implementation converts the complex mappings to
> three distinct regions that cover every mapped area of the address
> space. Also, the two gaps between the three regions are the two biggest
> unmapped areas in the given address space. The two biggest unmapped
> areas would be the gap between the heap and the uppermost mmap()-ed
> region, and the gap between the lowermost mmap()-ed region and the stack
> in most of the cases. Because these gaps are exceptionally huge in
> usual address spacees, excluding these will be sufficient to make a
> reasonable trade-off. Below shows this in detail::
>
> <heap>
> <BIG UNMAPPED REGION 1>
> <uppermost mmap()-ed region>
> (small mmap()-ed regions and munmap()-ed regions)
> <lowermost mmap()-ed region>
> <BIG UNMAPPED REGION 2>
> <stack>
>
> Signed-off-by: SeongJae Park <[email protected]>
> Reviewed-by: Leonard Foerster <[email protected]>
> ---
> include/linux/damon.h | 6 +
> mm/damon.c | 474 ++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 480 insertions(+)
>
> diff --git a/include/linux/damon.h b/include/linux/damon.h
> index 3c0b92a679e8..310d36d123b3 100644
> --- a/include/linux/damon.h
> +++ b/include/linux/damon.h
> @@ -144,6 +144,12 @@ struct damon_ctx {
> void (*aggregate_cb)(struct damon_ctx *context);
> };
>
> +/* Reference callback implementations for virtual memory */
> +void kdamond_init_vm_regions(struct damon_ctx *ctx);
> +void kdamond_update_vm_regions(struct damon_ctx *ctx);
> +void kdamond_prepare_vm_access_checks(struct damon_ctx *ctx);
> +unsigned int kdamond_check_vm_accesses(struct damon_ctx *ctx);
> +
> int damon_set_pids(struct damon_ctx *ctx, int *pids, ssize_t nr_pids);
> int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
> unsigned long aggr_int, unsigned long regions_update_int,
> diff --git a/mm/damon.c b/mm/damon.c
> index b844924b9fdb..386780739007 100644
> --- a/mm/damon.c
> +++ b/mm/damon.c
> @@ -9,6 +9,9 @@
> * This file is constructed in below parts.
> *
> * - Functions and macros for DAMON data structures
> + * - Functions for the initial monitoring target regions construction
> + * - Functions for the dynamic monitoring target regions update
> + * - Functions for the access checking of the regions
> * - Functions for DAMON core logics and features
> * - Functions for the DAMON programming interface
> * - Functions for the module loading/unloading
> @@ -196,6 +199,477 @@ static unsigned long damon_region_sz_limit(struct damon_ctx *ctx)
> return sz;
> }
>
> +/*
> + * Get the mm_struct of the given task
> + *
> + * Caller _must_ put the mm_struct after use, unless it is NULL.
> + *
> + * Returns the mm_struct of the task on success, NULL on failure
> + */
> +static struct mm_struct *damon_get_mm(struct damon_task *t)
> +{
> + struct task_struct *task;
> + struct mm_struct *mm;
> +
> + task = damon_get_task_struct(t);
> + if (!task)
> + return NULL;
> +
> + mm = get_task_mm(task);
> + put_task_struct(task);
> + return mm;
> +}
> +
> +/*
> + * Functions for the initial monitoring target regions construction
> + */
> +
> +/*
> + * Size-evenly split a region into 'nr_pieces' small regions
> + *
> + * Returns 0 on success, or negative error code otherwise.
> + */
> +static int damon_split_region_evenly(struct damon_ctx *ctx,
> + struct damon_region *r, unsigned int nr_pieces)
> +{
> + unsigned long sz_orig, sz_piece, orig_end;
> + struct damon_region *n = NULL, *next;
> + unsigned long start;
> +
> + if (!r || !nr_pieces)
> + return -EINVAL;
> +
> + orig_end = r->ar.end;
> + sz_orig = r->ar.end - r->ar.start;
> + sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, MIN_REGION);
> +
> + if (!sz_piece)
> + return -EINVAL;
> +
> + r->ar.end = r->ar.start + sz_piece;
> + next = damon_next_region(r);
> + for (start = r->ar.end; start + sz_piece <= orig_end;
> + start += sz_piece) {
> + n = damon_new_region(start, start + sz_piece);
> + if (!n)
> + return -ENOMEM;
> + damon_insert_region(n, r, next);
> + r = n;
> + }
> + /* complement last region for possible rounding error */
> + if (n)
> + n->ar.end = orig_end;
> +
> + return 0;
> +}
> +
> +static unsigned long sz_range(struct damon_addr_range *r)
> +{
> + return r->end - r->start;
> +}
> +
> +static void swap_ranges(struct damon_addr_range *r1,
> + struct damon_addr_range *r2)
> +{
> + struct damon_addr_range tmp;
> +
> + tmp = *r1;
> + *r1 = *r2;
> + *r2 = tmp;
> +}
> +
> +/*
> + * Find three regions separated by two biggest unmapped regions
> + *
> + * vma the head vma of the target address space
> + * regions an array of three address ranges that results will be saved
> + *
> + * This function receives an address space and finds three regions in it which
> + * separated by the two biggest unmapped regions in the space. Please refer to
> + * below comments of 'damon_init_vm_regions_of()' function to know why this is
> + * necessary.
> + *
> + * Returns 0 if success, or negative error code otherwise.
> + */
> +static int damon_three_regions_in_vmas(struct vm_area_struct *vma,
> + struct damon_addr_range regions[3])
> +{
> + struct damon_addr_range gap = {0}, first_gap = {0}, second_gap = {0};
> + struct vm_area_struct *last_vma = NULL;
> + unsigned long start = 0;
> + struct rb_root rbroot;
> +
> + /* Find two biggest gaps so that first_gap > second_gap > others */
> + for (; vma; vma = vma->vm_next) {
> + if (!last_vma) {
> + start = vma->vm_start;
> + goto next;
> + }
> +
> + if (vma->rb_subtree_gap <= sz_range(&second_gap)) {
> + rbroot.rb_node = &vma->vm_rb;
> + vma = rb_entry(rb_last(&rbroot),
> + struct vm_area_struct, vm_rb);
> + goto next;
> + }
> +
> + gap.start = last_vma->vm_end;
> + gap.end = vma->vm_start;
> + if (sz_range(&gap) > sz_range(&second_gap)) {
> + swap_ranges(&gap, &second_gap);
> + if (sz_range(&second_gap) > sz_range(&first_gap))
> + swap_ranges(&second_gap, &first_gap);
> + }
> +next:
> + last_vma = vma;
> + }
> +
> + if (!sz_range(&second_gap) || !sz_range(&first_gap))
> + return -EINVAL;
> +
> + /* Sort the two biggest gaps by address */
> + if (first_gap.start > second_gap.start)
> + swap_ranges(&first_gap, &second_gap);
> +
> + /* Store the result */
> + regions[0].start = ALIGN(start, MIN_REGION);
> + regions[0].end = ALIGN(first_gap.start, MIN_REGION);
> + regions[1].start = ALIGN(first_gap.end, MIN_REGION);
> + regions[1].end = ALIGN(second_gap.start, MIN_REGION);
> + regions[2].start = ALIGN(second_gap.end, MIN_REGION);
> + regions[2].end = ALIGN(last_vma->vm_end, MIN_REGION);
> +
> + return 0;
> +}
> +
> +/*
> + * Get the three regions in the given task
> + *
> + * Returns 0 on success, negative error code otherwise.
> + */
> +static int damon_three_regions_of(struct damon_task *t,
> + struct damon_addr_range regions[3])
> +{
> + struct mm_struct *mm;
> + int rc;
> +
> + mm = damon_get_mm(t);
> + if (!mm)
> + return -EINVAL;
> +
> + down_read(&mm->mmap_sem);
> + rc = damon_three_regions_in_vmas(mm->mmap, regions);
> + up_read(&mm->mmap_sem);
> +
> + mmput(mm);
> + return rc;
> +}
> +
> +/*
> + * Initialize the monitoring target regions for the given task
> + *
> + * t the given target task
> + *
> + * Because only a number of small portions of the entire address space
> + * is actually mapped to the memory and accessed, monitoring the unmapped
> + * regions is wasteful. That said, because we can deal with small noises,
> + * tracking every mapping is not strictly required but could even incur a high
> + * overhead if the mapping frequently changes or the number of mappings is
> + * high. The adaptive regions adjustment mechanism will further help to deal
> + * with the noise by simply identifying the unmapped areas as a region that
> + * has no access. Moreover, applying the real mappings that would have many
> + * unmapped areas inside will make the adaptive mechanism quite complex. That
> + * said, too huge unmapped areas inside the monitoring target should be removed
> + * to not take the time for the adaptive mechanism.
> + *
> + * For the reason, we convert the complex mappings to three distinct regions
> + * that cover every mapped area of the address space. Also the two gaps
> + * between the three regions are the two biggest unmapped areas in the given
> + * address space. In detail, this function first identifies the start and the
> + * end of the mappings and the two biggest unmapped areas of the address space.
> + * Then, it constructs the three regions as below:
> + *
> + * [mappings[0]->start, big_two_unmapped_areas[0]->start)
> + * [big_two_unmapped_areas[0]->end, big_two_unmapped_areas[1]->start)
> + * [big_two_unmapped_areas[1]->end, mappings[nr_mappings - 1]->end)
> + *
> + * As usual memory map of processes is as below, the gap between the heap and
> + * the uppermost mmap()-ed region, and the gap between the lowermost mmap()-ed
> + * region and the stack will be two biggest unmapped regions. Because these
> + * gaps are exceptionally huge areas in usual address space, excluding these
> + * two biggest unmapped regions will be sufficient to make a trade-off.
> + *
> + * <heap>
> + * <BIG UNMAPPED REGION 1>
> + * <uppermost mmap()-ed region>
> + * (other mmap()-ed regions and small unmapped regions)
> + * <lowermost mmap()-ed region>
> + * <BIG UNMAPPED REGION 2>
> + * <stack>
> + */
> +static void damon_init_vm_regions_of(struct damon_ctx *c, struct damon_task *t)
> +{
> + struct damon_region *r;
> + struct damon_addr_range regions[3];
> + unsigned long sz = 0, nr_pieces;
> + int i;
> +
> + if (damon_three_regions_of(t, regions)) {
> + pr_err("Failed to get three regions of task %d\n", t->pid);
> + return;
> + }
> +
> + for (i = 0; i < 3; i++)
> + sz += regions[i].end - regions[i].start;
> + if (c->min_nr_regions)
> + sz /= c->min_nr_regions;
> + if (sz < MIN_REGION)
> + sz = MIN_REGION;
> +
> + /* Set the initial three regions of the task */
> + for (i = 0; i < 3; i++) {
> + r = damon_new_region(regions[i].start, regions[i].end);
> + if (!r) {
> + pr_err("%d'th init region creation failed\n", i);
> + return;
> + }
> + damon_add_region(r, t);
> +
> + nr_pieces = (regions[i].end - regions[i].start) / sz;
> + damon_split_region_evenly(c, r, nr_pieces);
> + }
> +}
> +
> +/* Initialize '->regions_list' of every task */
> +void kdamond_init_vm_regions(struct damon_ctx *ctx)
> +{
> + struct damon_task *t;
> +
> + damon_for_each_task(t, ctx) {
> + /* the user may set the target regions as they want */
> + if (!nr_damon_regions(t))
> + damon_init_vm_regions_of(ctx, t);
> + }
> +}
> +
> +/*
> + * Functions for the dynamic monitoring target regions update
> + */
> +
> +/*
> + * Check whether a region is intersecting an address range
> + *
> + * Returns true if it is.
> + */
> +static bool damon_intersect(struct damon_region *r, struct damon_addr_range *re)
> +{
> + return !(r->ar.end <= re->start || re->end <= r->ar.start);
> +}
> +
> +/*
> + * Update damon regions for the three big regions of the given task
> + *
> + * t the given task
> + * bregions the three big regions of the task
> + */
> +static void damon_apply_three_regions(struct damon_ctx *ctx,
> + struct damon_task *t, struct damon_addr_range bregions[3])
> +{
> + struct damon_region *r, *next;
> + unsigned int i = 0;
> +
> + /* Remove regions which are not in the three big regions now */
> + damon_for_each_region_safe(r, next, t) {
> + for (i = 0; i < 3; i++) {
> + if (damon_intersect(r, &bregions[i]))
> + break;
> + }
> + if (i == 3)
> + damon_destroy_region(r);
> + }
> +
> + /* Adjust intersecting regions to fit with the three big regions */
> + for (i = 0; i < 3; i++) {
> + struct damon_region *first = NULL, *last;
> + struct damon_region *newr;
> + struct damon_addr_range *br;
> +
> + br = &bregions[i];
> + /* Get the first and last regions which intersects with br */
> + damon_for_each_region(r, t) {
> + if (damon_intersect(r, br)) {
> + if (!first)
> + first = r;
> + last = r;
> + }
> + if (r->ar.start >= br->end)
> + break;
> + }
> + if (!first) {
> + /* no damon_region intersects with this big region */
> + newr = damon_new_region(
> + ALIGN_DOWN(br->start, MIN_REGION),
> + ALIGN(br->end, MIN_REGION));
> + if (!newr)
> + continue;
> + damon_insert_region(newr, damon_prev_region(r), r);
> + } else {
> + first->ar.start = ALIGN_DOWN(br->start, MIN_REGION);
> + last->ar.end = ALIGN(br->end, MIN_REGION);
> + }
> + }
> +}
> +
> +/*
> + * Update regions for current memory mappings
> + */
> +void kdamond_update_vm_regions(struct damon_ctx *ctx)
> +{
> + struct damon_addr_range three_regions[3];
> + struct damon_task *t;
> +
> + damon_for_each_task(t, ctx) {
> + if (damon_three_regions_of(t, three_regions))
> + continue;
> + damon_apply_three_regions(ctx, t, three_regions);
> + }
> +}
> +
> +/*
> + * Functions for the access checking of the regions
> + */
> +
> +static void damon_mkold(struct mm_struct *mm, unsigned long addr)
> +{
> + pte_t *pte = NULL;
> + pmd_t *pmd = NULL;
> + spinlock_t *ptl;
> +
> + if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
> + return;
> +
> + if (pte) {
> + if (pte_young(*pte)) {
> + clear_page_idle(pte_page(*pte));
> + set_page_young(pte_page(*pte));
While this compiles without support for PG_young and PG_idle, I assume
it won't work well because it'd clear pte.young without setting
PG_young. And this would mess with vmscan.
So this code appears to depend on PG_young and PG_idle, which are
currently only available via CONFIG_IDLE_PAGE_TRACKING. DAMON could
depend on CONFIG_IDLE_PAGE_TRACKING via Kconfig. But I assume that
CONFIG_IDLE_PAGE_TRACKING and CONFIG_DAMON cannot be concurrently used
because they'll stomp on each other's use of pte.young, PG_young,
PG_idle.
So I suspect we want:
1. CONFIG_DAMON to depend on !CONFIG_IDLE_PAGE_TRACKING and vise-versa.
2. PG_young,PG_idle and related helpers to depend on
CONFIG_DAMON||CONFIG_IDLE_PAGE_TRACKING.
> + }
> + *pte = pte_mkold(*pte);
> + pte_unmap_unlock(pte, ptl);
> + return;
> + }
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + if (pmd_young(*pmd)) {
> + clear_page_idle(pmd_page(*pmd));
> + set_page_young(pmd_page(*pmd));
> + }
> + *pmd = pmd_mkold(*pmd);
> + spin_unlock(ptl);
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +}
> +
> +static void damon_prepare_vm_access_check(struct damon_ctx *ctx,
> + struct mm_struct *mm, struct damon_region *r)
> +{
> + r->sampling_addr = damon_rand(r->ar.start, r->ar.end);
> +
> + damon_mkold(mm, r->sampling_addr);
> +}
> +
> +void kdamond_prepare_vm_access_checks(struct damon_ctx *ctx)
> +{
> + struct damon_task *t;
> + struct mm_struct *mm;
> + struct damon_region *r;
> +
> + damon_for_each_task(t, ctx) {
> + mm = damon_get_mm(t);
> + if (!mm)
> + continue;
> + damon_for_each_region(r, t)
> + damon_prepare_vm_access_check(ctx, mm, r);
> + mmput(mm);
> + }
> +}
> +
> +static bool damon_young(struct mm_struct *mm, unsigned long addr,
> + unsigned long *page_sz)
> +{
> + pte_t *pte = NULL;
> + pmd_t *pmd = NULL;
> + spinlock_t *ptl;
> + bool young = false;
> +
> + if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
> + return false;
> +
> + *page_sz = PAGE_SIZE;
> + if (pte) {
> + young = pte_young(*pte);
> + pte_unmap_unlock(pte, ptl);
> + return young;
> + }
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + young = pmd_young(*pmd);
> + spin_unlock(ptl);
> + *page_sz = ((1UL) << HPAGE_PMD_SHIFT);
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> + return young;
> +}
> +
> +/*
> + * Check whether the region was accessed after the last preparation
> + *
> + * mm 'mm_struct' for the given virtual address space
> + * r the region to be checked
> + */
> +static void damon_check_vm_access(struct damon_ctx *ctx,
> + struct mm_struct *mm, struct damon_region *r)
> +{
> + static struct mm_struct *last_mm;
> + static unsigned long last_addr;
> + static unsigned long last_page_sz = PAGE_SIZE;
> + static bool last_accessed;
> +
> + /* If the region is in the last checked page, reuse the result */
> + if (mm == last_mm && (ALIGN_DOWN(last_addr, last_page_sz) ==
> + ALIGN_DOWN(r->sampling_addr, last_page_sz))) {
> + if (last_accessed)
> + r->nr_accesses++;
> + return;
> + }
> +
> + last_accessed = damon_young(mm, r->sampling_addr, &last_page_sz);
> + if (last_accessed)
> + r->nr_accesses++;
> +
> + last_mm = mm;
> + last_addr = r->sampling_addr;
> +}
> +
> +unsigned int kdamond_check_vm_accesses(struct damon_ctx *ctx)
> +{
> + struct damon_task *t;
> + struct mm_struct *mm;
> + struct damon_region *r;
> + unsigned int max_nr_accesses = 0;
> +
> + damon_for_each_task(t, ctx) {
> + mm = damon_get_mm(t);
> + if (!mm)
> + continue;
> + damon_for_each_region(r, t) {
> + damon_check_vm_access(ctx, mm, r);
> + max_nr_accesses = max(r->nr_accesses, max_nr_accesses);
> + }
> + mmput(mm);
> + }
> +
> + return max_nr_accesses;
> +}
> +
> /*
> * Functions for DAMON core logics and features
> */
On Mon, 27 Jul 2020 00:19:00 -0700 Greg Thelen <[email protected]> wrote:
> SeongJae Park <[email protected]> wrote:
>
> > From: SeongJae Park <[email protected]>
> >
> > This commit adds documents for DAMON under
> > `Documentation/admin-guide/mm/damon/` and `Documentation/vm/damon/`.
> >
> > Signed-off-by: SeongJae Park <[email protected]>
> > ---
> > Documentation/admin-guide/mm/damon/guide.rst | 157 ++++++++++
> > Documentation/admin-guide/mm/damon/index.rst | 15 +
> > Documentation/admin-guide/mm/damon/plans.rst | 29 ++
> > Documentation/admin-guide/mm/damon/start.rst | 98 ++++++
> > Documentation/admin-guide/mm/damon/usage.rst | 298 +++++++++++++++++++
> > Documentation/admin-guide/mm/index.rst | 1 +
> > Documentation/vm/damon/api.rst | 20 ++
> > Documentation/vm/damon/eval.rst | 222 ++++++++++++++
> > Documentation/vm/damon/faq.rst | 59 ++++
> > Documentation/vm/damon/index.rst | 32 ++
> > Documentation/vm/damon/mechanisms.rst | 165 ++++++++++
> > Documentation/vm/index.rst | 1 +
> > 12 files changed, 1097 insertions(+)
> > create mode 100644 Documentation/admin-guide/mm/damon/guide.rst
> > create mode 100644 Documentation/admin-guide/mm/damon/index.rst
> > create mode 100644 Documentation/admin-guide/mm/damon/plans.rst
> > create mode 100644 Documentation/admin-guide/mm/damon/start.rst
> > create mode 100644 Documentation/admin-guide/mm/damon/usage.rst
> > create mode 100644 Documentation/vm/damon/api.rst
> > create mode 100644 Documentation/vm/damon/eval.rst
> > create mode 100644 Documentation/vm/damon/faq.rst
> > create mode 100644 Documentation/vm/damon/index.rst
> > create mode 100644 Documentation/vm/damon/mechanisms.rst
> >+
[...]
> > +===============
> > +Detailed Usages
> > +===============
> > +
> > +DAMON provides below three interfaces for different users.
> > +
> > +- *DAMON user space tool.*
> > + This is for privileged people such as system administrators who want a
> > + just-working human-friendly interface. Using this, users can use the DAMON’s
> > + major features in a human-friendly way. It may not be highly tuned for
> > + special cases, though. It supports only virtual address spaces monitoring.
> > +- *debugfs interface.*
> > + This is for privileged user space programmers who want more optimized use of
> > + DAMON. Using this, users can use DAMON’s major features by reading
> > + from and writing to special debugfs files. Therefore, you can write and use
> > + your personalized DAMON debugfs wrapper programs that reads/writes the
> > + debugfs files instead of you. The DAMON user space tool is also a reference
> > + implementation of such programs. It supports only virtual address spaces
> > + monitoring.
> > +- *Kernel Space Programming Interface.*
> > + This is for kernel space programmers. Using this, users can utilize every
> > + feature of DAMON most flexibly and efficiently by writing kernel space
> > + DAMON application programs for you. You can even extend DAMON for various
> > + address spaces.
> > +
> > +This document does not describe the kernel space programming interface in
> > +detail. For that, please refer to the :doc:`/vm/damon/api`.
> > +
> > +
> > +DAMON User Sapce Tool
>
> Space
Right, thanks!
>
[...]
> > +
> > +Can 'idle pages tracking' or 'perf mem' substitute DAMON?
> > +=========================================================
> > +
> > +Idle page tracking is a low level primitive for access check of the physical
> > +address space. 'perf mem' is similar, though it can use sampling to minimize
> > +the overhead. On the other hand, DAMON is a higher-level framework for the
> > +monitoring of various address spaces. It is focused on memory management
> > +optimization and provides sophisticated accuracy/overhead handling mechanisms.
> > +Therefore, 'idle pages tracking' and 'perf mem' could provide a subset of
> > +DAMON's output, but cannot substitute DAMON. Rather than that, thouse could be
>
> those?
Good eye! I will fix both in the next spin.
Thanks,
SeongJae Park
On Mon, 27 Jul 2020 00:34:54 -0700 Greg Thelen <[email protected]> wrote:
> SeongJae Park <[email protected]> wrote:
>
> > From: SeongJae Park <[email protected]>
> >
> > This commit introduces a reference implementation of the address space
> > specific low level primitives for the virtual address space, so that
> > users of DAMON can easily monitor the data accesses on virtual address
> > spaces of specific processes by simply configuring the implementation to
> > be used by DAMON.
[...]
> > diff --git a/mm/damon.c b/mm/damon.c
> > index b844924b9fdb..386780739007 100644
> > --- a/mm/damon.c
> > +++ b/mm/damon.c
> > @@ -9,6 +9,9 @@
[...]
> > +/*
> > + * Functions for the access checking of the regions
> > + */
> > +
> > +static void damon_mkold(struct mm_struct *mm, unsigned long addr)
> > +{
> > + pte_t *pte = NULL;
> > + pmd_t *pmd = NULL;
> > + spinlock_t *ptl;
> > +
> > + if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
> > + return;
> > +
> > + if (pte) {
> > + if (pte_young(*pte)) {
> > + clear_page_idle(pte_page(*pte));
> > + set_page_young(pte_page(*pte));
>
> While this compiles without support for PG_young and PG_idle, I assume
> it won't work well because it'd clear pte.young without setting
> PG_young. And this would mess with vmscan.
You're right, thanks for catching this up! This definitely need to be fixed in
the next spin.
>
> So this code appears to depend on PG_young and PG_idle, which are
> currently only available via CONFIG_IDLE_PAGE_TRACKING. DAMON could
> depend on CONFIG_IDLE_PAGE_TRACKING via Kconfig. But I assume that
> CONFIG_IDLE_PAGE_TRACKING and CONFIG_DAMON cannot be concurrently used
> because they'll stomp on each other's use of pte.young, PG_young,
> PG_idle.
> So I suspect we want:
> 1. CONFIG_DAMON to depend on !CONFIG_IDLE_PAGE_TRACKING and vise-versa.
> 2. PG_young,PG_idle and related helpers to depend on
> CONFIG_DAMON||CONFIG_IDLE_PAGE_TRACKING.
Awesome insights and suggestions, thanks!
I would like to note that DAMON could be interfered by IDLE_PAGE_TRACKING and
vmscan, but not vice versa, as DAMON respects PG_idle and PG_young. This
design came from the weak goal of DAMON. DAMON aims to provide not perfect
monitoring but only best effort accuracy that would be sufficient for
performance-centric DRAM level memory management. So, at that time, I thought
being interfered by IDLE_PAGE_TRACKING and the reclaim logic would not be a
real problem but letting IDLE_PAGE_TRACKING coexist is somehow beneficial.
That said, I couldn't find a real benefit of the coexistance yet, and the
problem of being interference now seems bigger as we will support more cases
including the page granularity.
Maybe we could make IDLE_PAGE_TRACKING and DAMON coexist but mutual exclusive
in runtime, if the beneficial of coexistance turns out big. However, I would
like to make it simple first and optimize the case later if real requirement
found.
So, I will implement your suggestions in the next spin. If you have different
opinions, please feel free to comment.
Thanks,
SeongJae Park
>
> > + }
> > + *pte = pte_mkold(*pte);
> > + pte_unmap_unlock(pte, ptl);
> > + return;
> > + }
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > + if (pmd_young(*pmd)) {
> > + clear_page_idle(pmd_page(*pmd));
> > + set_page_young(pmd_page(*pmd));
> > + }
> > + *pmd = pmd_mkold(*pmd);
> > + spin_unlock(ptl);
> > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > +}
> > +
[...]
On Mon, Jul 27, 2020 at 2:03 AM SeongJae Park <[email protected]> wrote:
>
> On Mon, 27 Jul 2020 00:34:54 -0700 Greg Thelen <[email protected]> wrote:
>
> > SeongJae Park <[email protected]> wrote:
> >
> > > From: SeongJae Park <[email protected]>
> > >
> > > This commit introduces a reference implementation of the address space
> > > specific low level primitives for the virtual address space, so that
> > > users of DAMON can easily monitor the data accesses on virtual address
> > > spaces of specific processes by simply configuring the implementation to
> > > be used by DAMON.
> [...]
> > > diff --git a/mm/damon.c b/mm/damon.c
> > > index b844924b9fdb..386780739007 100644
> > > --- a/mm/damon.c
> > > +++ b/mm/damon.c
> > > @@ -9,6 +9,9 @@
> [...]
> > > +/*
> > > + * Functions for the access checking of the regions
> > > + */
> > > +
> > > +static void damon_mkold(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > + pte_t *pte = NULL;
> > > + pmd_t *pmd = NULL;
> > > + spinlock_t *ptl;
> > > +
> > > + if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
> > > + return;
> > > +
> > > + if (pte) {
> > > + if (pte_young(*pte)) {
> > > + clear_page_idle(pte_page(*pte));
> > > + set_page_young(pte_page(*pte));
> >
> > While this compiles without support for PG_young and PG_idle, I assume
> > it won't work well because it'd clear pte.young without setting
> > PG_young. And this would mess with vmscan.
>
> You're right, thanks for catching this up! This definitely need to be fixed in
> the next spin.
>
> >
> > So this code appears to depend on PG_young and PG_idle, which are
> > currently only available via CONFIG_IDLE_PAGE_TRACKING. DAMON could
> > depend on CONFIG_IDLE_PAGE_TRACKING via Kconfig. But I assume that
> > CONFIG_IDLE_PAGE_TRACKING and CONFIG_DAMON cannot be concurrently used
> > because they'll stomp on each other's use of pte.young, PG_young,
> > PG_idle.
> > So I suspect we want:
> > 1. CONFIG_DAMON to depend on !CONFIG_IDLE_PAGE_TRACKING and vise-versa.
> > 2. PG_young,PG_idle and related helpers to depend on
> > CONFIG_DAMON||CONFIG_IDLE_PAGE_TRACKING.
>
> Awesome insights and suggestions, thanks!
>
> I would like to note that DAMON could be interfered by IDLE_PAGE_TRACKING and
> vmscan, but not vice versa, as DAMON respects PG_idle and PG_young. This
> design came from the weak goal of DAMON. DAMON aims to provide not perfect
> monitoring but only best effort accuracy that would be sufficient for
> performance-centric DRAM level memory management. So, at that time, I thought
> being interfered by IDLE_PAGE_TRACKING and the reclaim logic would not be a
> real problem but letting IDLE_PAGE_TRACKING coexist is somehow beneficial.
> That said, I couldn't find a real benefit of the coexistance yet, and the
> problem of being interference now seems bigger as we will support more cases
> including the page granularity.
>
> Maybe we could make IDLE_PAGE_TRACKING and DAMON coexist but mutual exclusive
> in runtime, if the beneficial of coexistance turns out big. However, I would
> like to make it simple first and optimize the case later if real requirement
> found.
If you are planning to have support for tracking at page granularity
and physical memory monitoring in DAMON then I don't see any benefit
of coexistence of DAMON with IDLE_PAGE_TRACKING. Though I will not
push you to go that route if the code with coexistence is simple
enough.
On Tue, 28 Jul 2020 10:42:11 -0700 Shakeel Butt <[email protected]> wrote:
> On Mon, Jul 27, 2020 at 2:03 AM SeongJae Park <[email protected]> wrote:
> >
> > On Mon, 27 Jul 2020 00:34:54 -0700 Greg Thelen <[email protected]> wrote:
> >
> > > SeongJae Park <[email protected]> wrote:
> > >
> > > > From: SeongJae Park <[email protected]>
> > > >
> > > > This commit introduces a reference implementation of the address space
> > > > specific low level primitives for the virtual address space, so that
> > > > users of DAMON can easily monitor the data accesses on virtual address
> > > > spaces of specific processes by simply configuring the implementation to
> > > > be used by DAMON.
> > [...]
> > > > diff --git a/mm/damon.c b/mm/damon.c
> > > > index b844924b9fdb..386780739007 100644
> > > > --- a/mm/damon.c
> > > > +++ b/mm/damon.c
> > > > @@ -9,6 +9,9 @@
> > [...]
> > > > +/*
> > > > + * Functions for the access checking of the regions
> > > > + */
> > > > +
> > > > +static void damon_mkold(struct mm_struct *mm, unsigned long addr)
> > > > +{
> > > > + pte_t *pte = NULL;
> > > > + pmd_t *pmd = NULL;
> > > > + spinlock_t *ptl;
> > > > +
> > > > + if (follow_pte_pmd(mm, addr, NULL, &pte, &pmd, &ptl))
> > > > + return;
> > > > +
> > > > + if (pte) {
> > > > + if (pte_young(*pte)) {
> > > > + clear_page_idle(pte_page(*pte));
> > > > + set_page_young(pte_page(*pte));
> > >
> > > While this compiles without support for PG_young and PG_idle, I assume
> > > it won't work well because it'd clear pte.young without setting
> > > PG_young. And this would mess with vmscan.
> >
> > You're right, thanks for catching this up! This definitely need to be fixed in
> > the next spin.
> >
> > >
> > > So this code appears to depend on PG_young and PG_idle, which are
> > > currently only available via CONFIG_IDLE_PAGE_TRACKING. DAMON could
> > > depend on CONFIG_IDLE_PAGE_TRACKING via Kconfig. But I assume that
> > > CONFIG_IDLE_PAGE_TRACKING and CONFIG_DAMON cannot be concurrently used
> > > because they'll stomp on each other's use of pte.young, PG_young,
> > > PG_idle.
> > > So I suspect we want:
> > > 1. CONFIG_DAMON to depend on !CONFIG_IDLE_PAGE_TRACKING and vise-versa.
> > > 2. PG_young,PG_idle and related helpers to depend on
> > > CONFIG_DAMON||CONFIG_IDLE_PAGE_TRACKING.
> >
> > Awesome insights and suggestions, thanks!
> >
> > I would like to note that DAMON could be interfered by IDLE_PAGE_TRACKING and
> > vmscan, but not vice versa, as DAMON respects PG_idle and PG_young. This
> > design came from the weak goal of DAMON. DAMON aims to provide not perfect
> > monitoring but only best effort accuracy that would be sufficient for
> > performance-centric DRAM level memory management. So, at that time, I thought
> > being interfered by IDLE_PAGE_TRACKING and the reclaim logic would not be a
> > real problem but letting IDLE_PAGE_TRACKING coexist is somehow beneficial.
> > That said, I couldn't find a real benefit of the coexistance yet, and the
> > problem of being interference now seems bigger as we will support more cases
> > including the page granularity.
> >
> > Maybe we could make IDLE_PAGE_TRACKING and DAMON coexist but mutual exclusive
> > in runtime, if the beneficial of coexistance turns out big. However, I would
> > like to make it simple first and optimize the case later if real requirement
> > found.
>
> If you are planning to have support for tracking at page granularity
> and physical memory monitoring in DAMON then I don't see any benefit
> of coexistence of DAMON with IDLE_PAGE_TRACKING. Though I will not
> push you to go that route if the code with coexistence is simple
> enough.
Agreed, I don't see the benefit, neither. I already selected the mutual
exclusive way :)
Thanks,
SeongJae Park
On Sat, Jul 18, 2020 at 6:31 AM SeongJae Park <[email protected]> wrote:
>
> On Fri, 17 Jul 2020 19:47:50 -0700 Shakeel Butt <[email protected]> wrote:
>
> > On Mon, Jul 13, 2020 at 1:43 AM SeongJae Park <[email protected]> wrote:
> > >
> > > From: SeongJae Park <[email protected]>
> > >
> > > DAMON is a data access monitoring framework subsystem for the Linux
> > > kernel. The core mechanisms of DAMON make it
> > >
> > > - accurate (the monitoring output is useful enough for DRAM level
> > > memory management; It might not appropriate for CPU Cache levels,
> > > though),
> > > - light-weight (the monitoring overhead is low enough to be applied
> > > online), and
> > > - scalable (the upper-bound of the overhead is in constant range
> > > regardless of the size of target workloads).
> > >
> > > Using this framework, therefore, the kernel's memory management
> > > mechanisms can make advanced decisions. Experimental memory management
> > > optimization works that incurring high data accesses monitoring overhead
> > > could implemented again. In user space, meanwhile, users who have some
> > > special workloads can write personalized applications for better
> > > understanding and optimizations of their workloads and systems.
> > >
> > > This commit is implementing only the stub for the module load/unload,
> > > basic data structures, and simple manipulation functions of the
> > > structures to keep the size of commit small. The core mechanisms of
> > > DAMON will be implemented one by one by following commits.
> > >
> > > Signed-off-by: SeongJae Park <[email protected]>
> > > Reviewed-by: Leonard Foerster <[email protected]>
> > > Reviewed-by: Varad Gautam <[email protected]>
> > > ---
> > > include/linux/damon.h | 63 ++++++++++++++
> > > mm/Kconfig | 12 +++
> > > mm/Makefile | 1 +
> > > mm/damon.c | 188 ++++++++++++++++++++++++++++++++++++++++++
> > > 4 files changed, 264 insertions(+)
> > > create mode 100644 include/linux/damon.h
> > > create mode 100644 mm/damon.c
> > >
> > > diff --git a/include/linux/damon.h b/include/linux/damon.h
> > > new file mode 100644
> > > index 000000000000..c8f8c1c41a45
> > > --- /dev/null
> > > +++ b/include/linux/damon.h
> > > @@ -0,0 +1,63 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +/*
> > > + * DAMON api
> > > + *
> > > + * Copyright 2019-2020 Amazon.com, Inc. or its affiliates.
> > > + *
> > > + * Author: SeongJae Park <[email protected]>
> > > + */
> > > +
> [...]
> > > +
> > > +/**
> > > + * struct damon_task - Represents a monitoring target task.
> > > + * @pid: Process id of the task.
> > > + * @regions_list: Head of the monitoring target regions of this task.
> > > + * @list: List head for siblings.
> > > + *
> > > + * If the monitoring target address space is task independent (e.g., physical
> > > + * memory address space monitoring), @pid should be '-1'.
> > > + */
> > > +struct damon_task {
> > > + int pid;
> >
> > Storing and accessing pid like this is racy. Why not save the "struct
> > pid" after getting the reference? I am still going over the usage,
> > maybe storing mm_struct would be an even better choice.
> >
> > > + struct list_head regions_list;
> > > + struct list_head list;
> > > +};
> > > +
> [...]
> > > +
> > > +#define damon_get_task_struct(t) \
> > > + (get_pid_task(find_vpid(t->pid), PIDTYPE_PID))
> >
> > You need at least rcu lock around find_vpid(). Also you need to be
> > careful about the context. If you accept my previous suggestion then
> > you just need to do this in the process context which is registering
> > the pid (no need to worry about the pid namespace).
> >
> > I am wondering if there should be an interface to register processes
> > with DAMON using pidfd instead of integer pid.
>
> Good points! I will use pidfd for this purpose, instead.
>
> BTW, 'struct damon_task' was introduced while DAMON supports only virtual
> address spaces and recently extended to support physical memory address
> monitoring case by defining an exceptional pid (-1) for such case. I think it
> doesn't smoothly fit with the design.
>
> Therefore, I would like to change it with more general named struct, e.g.,
>
> struct damon_target {
> void *id;
> struct list_head regions_list;
> struct list_head list;
> };
>
> The 'id' field will be able to store or point pid_t, struct mm_struct, struct
> pid, or anything relevant, depending on the target address space.
>
> Only one part of the address space independent logics of DAMON, namely
> 'kdamon_need_stop()', uses '->pid' of the 'struct damon_task'. It will be
> introduced by the next patch ("mm/damon: Implement region based sampling").
> Therefore, the conversion will be easy. For the part, I could add another
> callback, e.g.,
>
> struct damon_ctx {
> [...]
> bool (*is_target_valid)(struct damon_target *t);
> };
>
> And let the address space specific primitives to implement this.
>
> Then, damon_get_task_struct() and damon_get_mm() will be introduced by the
> sixth patch ("mm/damon: Implement callbacks for the virtual memory address
> spaces") as a part of the virtual address space specific primitives
> implementation.
>
> I gonna make the change in the next spin. If you have some opinions on this,
> please let me know.
>
>
Sorry for the late response. I think the general direction you are
taking is fine but there are still some open questions. I am trying to
reason if 'address space' is general enough abstraction for different
types of monitoring targets. It fits well for the 'processes' targets.
For the physical memory, the monitoring part of the abstraction (i.e.
damon_ctx) seems fine but I am not sure about the optimization part
(i.e. [merge|split]_regions) which raises the question that should the
merge/split functionality be part of the abstraction.
I am also very interested in the 'cgroups' as the target and I am not
sure if 'address space' is the right abstraction for the cgroups as
well. Well we can think of cgroups as a combination of tasks but
cgroup also contains unmapped pages. So, maybe it is a combination of
virtual and physical address space targets damon can monitor but I am
still not clear how to specify that in the abstractions provided by
damon. Anyways these are the questions for later and we can start
simple with just processes but I would like to not expose these
abstractions/interfaces to userspace otherwise it would be really hard
to change later.
Another topic I want to discuss is managing/charging the resource
(cpu) usage of monitoring. Yes, damon with optimization has low cpu
cost but as the number of targets increase the cpu cost will increase
which will be in a range which can not be ignored as system overhead.
At the moment, it seems like there is one kthread doing all the
monitoring, since we can control the cpu usage of kthreads, it might
make sense to allow different kthreads for different sets of targets
(processes in a cgroup).
On Wed, 29 Jul 2020 08:31:29 -0700 Shakeel Butt <[email protected]> wrote:
> On Sat, Jul 18, 2020 at 6:31 AM SeongJae Park <[email protected]> wrote:
> >
> > On Fri, 17 Jul 2020 19:47:50 -0700 Shakeel Butt <[email protected]> wrote:
> >
> > > On Mon, Jul 13, 2020 at 1:43 AM SeongJae Park <[email protected]> wrote:
> > > >
> > > > From: SeongJae Park <[email protected]>
> > > >
> > > > DAMON is a data access monitoring framework subsystem for the Linux
> > > > kernel. The core mechanisms of DAMON make it
> > > >
> > > > - accurate (the monitoring output is useful enough for DRAM level
> > > > memory management; It might not appropriate for CPU Cache levels,
> > > > though),
> > > > - light-weight (the monitoring overhead is low enough to be applied
> > > > online), and
> > > > - scalable (the upper-bound of the overhead is in constant range
> > > > regardless of the size of target workloads).
> > > >
> > > > Using this framework, therefore, the kernel's memory management
> > > > mechanisms can make advanced decisions. Experimental memory management
> > > > optimization works that incurring high data accesses monitoring overhead
> > > > could implemented again. In user space, meanwhile, users who have some
> > > > special workloads can write personalized applications for better
> > > > understanding and optimizations of their workloads and systems.
> > > >
> > > > This commit is implementing only the stub for the module load/unload,
> > > > basic data structures, and simple manipulation functions of the
> > > > structures to keep the size of commit small. The core mechanisms of
> > > > DAMON will be implemented one by one by following commits.
> > > >
> > > > Signed-off-by: SeongJae Park <[email protected]>
> > > > Reviewed-by: Leonard Foerster <[email protected]>
> > > > Reviewed-by: Varad Gautam <[email protected]>
> > > > ---
> > > > include/linux/damon.h | 63 ++++++++++++++
> > > > mm/Kconfig | 12 +++
> > > > mm/Makefile | 1 +
> > > > mm/damon.c | 188 ++++++++++++++++++++++++++++++++++++++++++
> > > > 4 files changed, 264 insertions(+)
> > > > create mode 100644 include/linux/damon.h
> > > > create mode 100644 mm/damon.c
> > > >
> > > > diff --git a/include/linux/damon.h b/include/linux/damon.h
> > > > new file mode 100644
> > > > index 000000000000..c8f8c1c41a45
> > > > --- /dev/null
> > > > +++ b/include/linux/damon.h
> > > > @@ -0,0 +1,63 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +/*
> > > > + * DAMON api
> > > > + *
> > > > + * Copyright 2019-2020 Amazon.com, Inc. or its affiliates.
> > > > + *
> > > > + * Author: SeongJae Park <[email protected]>
> > > > + */
> > > > +
> > [...]
> > > > +
> > > > +/**
> > > > + * struct damon_task - Represents a monitoring target task.
> > > > + * @pid: Process id of the task.
> > > > + * @regions_list: Head of the monitoring target regions of this task.
> > > > + * @list: List head for siblings.
> > > > + *
> > > > + * If the monitoring target address space is task independent (e.g., physical
> > > > + * memory address space monitoring), @pid should be '-1'.
> > > > + */
> > > > +struct damon_task {
> > > > + int pid;
> > >
> > > Storing and accessing pid like this is racy. Why not save the "struct
> > > pid" after getting the reference? I am still going over the usage,
> > > maybe storing mm_struct would be an even better choice.
> > >
> > > > + struct list_head regions_list;
> > > > + struct list_head list;
> > > > +};
> > > > +
> > [...]
> > > > +
> > > > +#define damon_get_task_struct(t) \
> > > > + (get_pid_task(find_vpid(t->pid), PIDTYPE_PID))
> > >
> > > You need at least rcu lock around find_vpid(). Also you need to be
> > > careful about the context. If you accept my previous suggestion then
> > > you just need to do this in the process context which is registering
> > > the pid (no need to worry about the pid namespace).
> > >
> > > I am wondering if there should be an interface to register processes
> > > with DAMON using pidfd instead of integer pid.
> >
> > Good points! I will use pidfd for this purpose, instead.
> >
> > BTW, 'struct damon_task' was introduced while DAMON supports only virtual
> > address spaces and recently extended to support physical memory address
> > monitoring case by defining an exceptional pid (-1) for such case. I think it
> > doesn't smoothly fit with the design.
> >
> > Therefore, I would like to change it with more general named struct, e.g.,
> >
> > struct damon_target {
> > void *id;
> > struct list_head regions_list;
> > struct list_head list;
> > };
> >
> > The 'id' field will be able to store or point pid_t, struct mm_struct, struct
> > pid, or anything relevant, depending on the target address space.
> >
> > Only one part of the address space independent logics of DAMON, namely
> > 'kdamon_need_stop()', uses '->pid' of the 'struct damon_task'. It will be
> > introduced by the next patch ("mm/damon: Implement region based sampling").
> > Therefore, the conversion will be easy. For the part, I could add another
> > callback, e.g.,
> >
> > struct damon_ctx {
> > [...]
> > bool (*is_target_valid)(struct damon_target *t);
> > };
> >
> > And let the address space specific primitives to implement this.
> >
> > Then, damon_get_task_struct() and damon_get_mm() will be introduced by the
> > sixth patch ("mm/damon: Implement callbacks for the virtual memory address
> > spaces") as a part of the virtual address space specific primitives
> > implementation.
> >
> > I gonna make the change in the next spin. If you have some opinions on this,
> > please let me know.
> >
> >
>
> Sorry for the late response. I think the general direction you are
> taking is fine but there are still some open questions. I am trying to
> reason if 'address space' is general enough abstraction for different
> types of monitoring targets. It fits well for the 'processes' targets.
Agreed, and that's why I'm planning to make the target as a simple
identifier[1] that independent of the real types of the targets. It will be
interpreted by only the low level primitive implementation. From the core
logic's perspective, it will not be interpreted at all but only required to be
unique among the targets of the monitoring context (damon_ctx). The changed
version will be posted in the next spin.
[1] https://damonitor.github.io/doc/html/next/vm/damon/api.html#c.damon_target
> For the physical memory, the monitoring part of the abstraction (i.e.
> damon_ctx) seems fine but I am not sure about the optimization part
> (i.e. [merge|split]_regions) which raises the question that should the
> merge/split functionality be part of the abstraction.
I would like to argue that the optimization is the core part of DAMON that
makes its difference: It provides _controllable_ tradeoff between the accuracy
and the overhead while doing the best effort for the accuracy. Even on
physical memory space, the logic works well if the memory is not highly
segmented. Even under the significant segmentation, it could provide a sign of
compaction necessity. I also believe compaction and DAMON can complement each
other. Nonetheless, the optimization could be turned off by manipulating the
DAMON's attributes if needed.
>
> I am also very interested in the 'cgroups' as the target and I am not
> sure if 'address space' is the right abstraction for the cgroups as
> well. Well we can think of cgroups as a combination of tasks but
> cgroup also contains unmapped pages. So, maybe it is a combination of
> virtual and physical address space targets damon can monitor but I am
> still not clear how to specify that in the abstractions provided by
> damon.
In this case, because DAMON provides the low level primitives for the virtual
address space and the physical address space itself, the user could simply make
two DAMON contexts, one for the virtual address parts of the processes in the
cgroup and one for physical address part of the cgroup and run those in
parallel.
Or, you could implement another low level primitives for cgroups usecase. If
your usecase is somewhat complex and need some level of optimization, I think
this is the right way. The low level primitive implementation should provide
methods[1] for 1) identifying the target, 2) constructing the monitoring target
address ranges, and 3) checking access of small address ranges in the target
ranges. The implementation for cgroups could 1) use cgroup ids as the target
ids, 2) construct the target address ranges by integrating the address spaces
of the processes in the cgroup and the related physical memory regions (maybe
you could convert every address ranges to physical address), and 3) use page
table walk/rmap and PTE/PMD Accessed bit for the access checking.
[1] https://damonitor.github.io/doc/html/next/vm/damon/design.html#configurable-layers
> Anyways these are the questions for later and we can start
> simple with just processes but I would like to not expose these
> abstractions/interfaces to userspace otherwise it would be really hard
> to change later.
Agreed. For now, DAMON provides three interfaces for the user space. First
and the main one is the debugfs interface[1]. I chose debugfs because it's
relatively easy to change the interface. Second one is tracepoint[2]. It only
provides monitoring time, the target id, and the monitored access frequency of
each region. I believe this is sufficiently abstracted format. Final one is
user space tool[3] written in python. It is only a wrapper of the debugfs
interface. I believe keeping the higher level interface will not be hard.
[1] 9th patch of this series: https://lore.kernel.org/linux-mm/[email protected]/
[2] 8th patch of this series: https://lore.kernel.org/linux-mm/[email protected]/
[3] 10th patch of this series: https://lore.kernel.org/linux-mm/[email protected]/
>
> Another topic I want to discuss is managing/charging the resource
> (cpu) usage of monitoring. Yes, damon with optimization has low cpu
> cost but as the number of targets increase the cpu cost will increase
> which will be in a range which can not be ignored as system overhead.
> At the moment, it seems like there is one kthread doing all the
> monitoring, since we can control the cpu usage of kthreads, it might
> make sense to allow different kthreads for different sets of targets
> (processes in a cgroup).
Each DAMON context is proccessed by one kthread. Therefore, running each of or
a group of targets in dedicated kthread is naturally supported.
Thanks again for the questions. If I misunderstood your points or my answers
are insufficient to understand, please don't hesitate letting me know :)
Thanks,
SeongJae Park