Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
From:   SeongJae Park <sj38.park@gmail.com>
To:     Shakeel Butt <shakeelb@google.com>
Cc:     Andrew Morton <akpm@linux-foundation.org>,
        SeongJae Park <sj38.park@gmail.com>,
        SeongJae Park <sjpark@amazon.de>, Jonathan.Cameron@huawei.com,
        amit@kernel.org, Jonathan Corbet <corbet@lwn.net>,
        David Hildenbrand <david@redhat.com>, dwmw@amazon.com,
        foersleo@amazon.de, Greg Thelen <gthelen@google.com>,
        jgowans@amazon.com, mheyne@amazon.de,
        David Rientjes <rientjes@google.com>, sieberf@amazon.com,
        Vlastimil Babka <vbabka@suse.cz>, linux-damon@amazon.com,
        Linux MM <linux-mm@kvack.org>, linux-doc@vger.kernel.org,
        LKML <linux-kernel@vger.kernel.org>, Wei Xu <weixugc@google.com>,
        Paul Turner <pjt@google.com>, Yu Zhao <yuzhao@google.com>,
        Dave Hansen <dave.hansen@intel.com>
Subject: Re: [PATCH v34 00/13] Introduce Data Access MONitor (DAMON)
Date:   Mon,  9 Aug 2021 14:07:14 +0000
Message-Id: <20210809140714.34394-1-sjpark@amazon.de>
In-Reply-To: <20210806114801.6958-1-sjpark@amazon.de>
Precedence: bulk

From: SeongJae Park <sjpark@amazon.de>

On Fri,  6 Aug 2021 11:48:01 +0000 SeongJae Park <sj38.park@gmail.com> wrote:

> From: SeongJae Park <sjpark@amazon.de>
> 
> On Thu, 5 Aug 2021 17:03:44 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
[...]
> > 
> > I would like to see more thought/design go into how DAMON could be
> > modified to address Shakeel's other three requirements.  At least to
> > the point where we can confidently say "yes, we will be able to do
> > this".  Are you able to drive this discussion along please?
> 
> Sure.  I will describe my plan for convincing Shakeel's usages in detail as a
> reply to this mail.

Shakeel, I am explaining how DAMON will be extended and how it can be used for
your usages below.  If there is any doubt or question, please feel free to let
me know.

What information DAMON (will) provides: contiguity, frequency, and recency
--------------------------------------------------------------------------

DAMON of this patchset informs users which memory region is how frequently
accessed.  The memory region is a set of contiguous pages which having similar
access frequency.  In addition to this, a following patch[1] will make DAMON to
track how long time the region maintained its size and access frequency.  We
call this as 'age' of each region.  That is, DAMON will be extended to provide
three attributes of data access patterns: contiguity (size of each region),
frequency, and recency.

Physical Address Space support
------------------------------

This version of DAMON is supporting only virtual address spaces of processes,
but will be extended to the physical address space[2].  The extension will be
quite simple because DAMON's monitoring primitives layer is separated from its
core logic.

How DAMON can be used for Shakeel's usages
------------------------------------------

The usages described in Shakeel's prior mail[1] are:

    1) Working set estimation: This is used for cluster level scheduling
    and controlling the knobs of memory overcommit.

    2) Proactive reclaim

    3) Balancing between memory tiers: Moving hot pages to fast tiers and
    cold pages to slow tiers

    4) Hugepage optimization: Hot memory backed by hugepages

    In addition, these uses are not happening in isolation. We want a
    combination of these running concurrently on a system. So, it is clear
    that the first version or step of DAMON which only targets virtual
    address space monitoring is not sufficient for these use-cases.

DAMON can satisfy all the usages as below.

- working set estimation: This can be done by iterating each region and
  checking if the access frequency of it is higher than a threshold.  Our user
  space tool provides an implementation[3] for this.  Below is a pseudo-code
  for this:

    workingsets = []
    working_set_size = 0
    for region in regions:
        if region.access_frequncy > threshold:
            workingsets.append(region)
            working_set_size += region.end_address - region.start_address
    return workingsets, working_set_size

- proactive reclaim: This can be done by iterating each region while checking
  if it has zero access frequency and if its age is higher than a time
  threshold, and reclaim those.  We implemented this as a kernel module with
  only 354 lines of code[4].  Below is a pseudo-code for this:

    for region in regions:
        if region.access_frquency == 0 and region.age > threshold:
            reclaim(region)

- Balancing between memory tiers: Because DAMON provides access frequency, we
  can know not only idle memory region but cold/cool/warm/hot memory region.
  Once the functions for migrating pages from a tier to different tier is
  matured, applying DAMON for this usage will be quite straightforward.  That
  is, for each region, if its access frequency and age is higher than
  thresholds, migrate pages in the region to faster tier.  If its access
  frequency is lower than a threshold and its age is higher than a threshold,
  migrate pages in the region to slower tier.  Below is a pseudo-code for this:

    for region in regions:
        if region.age > age_threshod:
            if region.access_frequency > hot_threshold:
                migrate_to_fast_tier(region)
            if region.access_frequency < cold_threshold:
                migrate_to_slow_tier(region)

- Hugepage optimization: This will be quite similar to tiers balancing, but we
  can use the size of regions.  That is, we do monitoring of virtual address
  spaces first.  Then, for each region, if its access frequency, age, and size
  are higher than thresholds (size threshold would be 2MB), makes the region to
  be backed by huge pages.  If the age and size are higher than thresholds but
  the access frequency is lower than a threshold, makes the huge pages of the
  region to be backed by regular pages.  We evaluated this idea with a
  prototype[5].  It removed 76.15% of THP memory overheads while preserving
  51.25% of THP speedup.  Below is a pseudo-code for this:

    for region in regions:
        if region.age > age_threshod and region.size >= 2 * MB:
            if region.access_frequency > hot_threshold:
                use_thps_for(region)
            if region.access_frequency < cold_threshold:
                use_regular_pages_for(region)

- Combination of these running concurrently: DAMON will be extended to be able
  to monitor both the physical address space and virtual address spaces
  simultaneously, like below.

    struct damon_ctx *ctx_for_virt = damon_new_ctx();
    struct damon_ctx *ctx_for_phys = damon_new_ctx();
    struct damon_context *ctxs[] = {ctx_for_virt, ctx_for_phys};
    [...]
    /* first context for physical address space monitoring */
    damon_pa_set_primitives(ctx_for_virt);
    /* second context for virtual address spaces monitoring */
    damon_va_set_primitives(ctx_for_phys);
    damon_start(ctxs, 2);

Extending for page-granularity monitoring
-----------------------------------------

To my understanding, Shakeel wants to do above with page-granularity
monitoring.  It will incur inevitable high overhead, but for someone who can
afford the cost, I will make DAMON to support it, as below.

Even with DAMON of this patchset, users can do the page-granularity monitoring
by simply setting the 'min_nr_regions' and 'max_nr_regions' of DAMON to the
number of pages in the target address space (nr_pages).  Nevertheless, it will
result in creation of 'nr_pages' region structs.  Assuming 4K pages, this will
result in about 1% memory waste, as each region struct consumes about 44 bytes
of memory.  Our plan for removal of such overhead is as below.

In a future, the regions abstraction will be able to be entirely opted out[6].
In the case, no region structs will be allocated, so the memory overhead will
be zero.  Nonetheless, the user will be required to configure DAMON to use a
special monitoring primitive which saves the monitoring results such as access
frequency and age in somewhere such as their own data structure or page flags,
like multi-gen LRU patchset does.  If such data structure is commonly usable,
we can extend DAMON core to support it.  To show how this will work, we
implemented a page-granularity idleness monitoring primitive with only 69 lines
of code[6].

Also, if someone has ideas for reducing the page granularity monitoring
overhead, we can put the optimization in the monitoring primitives layer, which
is separated from the core logic.

[1] https://lore.kernel.org/linux-mm/20201216084404.23183-2-sjpark@amazon.com/
[2] https://lore.kernel.org/linux-mm/20201216094221.11898-1-sjpark@amazon.com/
[3] https://github.com/awslabs/damo/blob/master/wss.py
[4] https://lore.kernel.org/linux-mm/20210720131309.22073-15-sj38.park@gmail.com/
[5] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html#efficient-thp
[6] https://github.com/sjp38/linux/commit/9e0cb168d30e
[7] https://lore.kernel.org/linux-mm/20201216094221.11898-14-sjpark@amazon.com/


Thanks,
SeongJae Park