Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp1959193ybb; Thu, 2 Apr 2020 10:25:15 -0700 (PDT) X-Google-Smtp-Source: APiQypJ7u0vR7URJvZvdnultJqvruBi6RhtELGeJqo/zuAJLXmSz5NQa2xjUVfu99JHu8JGihhUq X-Received: by 2002:a9d:7a45:: with SMTP id z5mr3134501otm.181.1585848314846; Thu, 02 Apr 2020 10:25:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1585848314; cv=none; d=google.com; s=arc-20160816; b=oQUhKuGwGlN9XTb92tj+8ZWGF9Kk5636Plijl9nKEu1r4pp6/UlBgSYnrJkICQQe7/ Qh9s+qldna+jRXlKcAJHLw7KtUNbEroaRwR/Ic0bzxQaKCBGYp6M1kfp8hvfOZgb0ShJ Ze+V+5MuYaK74m/z2SGrQ2kNiIs8QwxmAXkRRoEm6eGr78JTr2IL4RwkTkBqRZsmvyA0 zeWHsmzYSf3UUK3u3/4gIqBxlnO2s5wruvz33AcKw0btY+59bghVGZW8aI28hvEloSx2 +Tspq2bRMjMQnmnmJf8Zthb9FvWyHSeVIy7v/Hqx5ecVDsWkMyU257kXKsalqqMYCAxb YaCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :organization:references:in-reply-to:message-id:subject:cc:to:from :date; bh=VmzJJuW9j/xZWJXn8paUDnROf4fEylzLcD9x1M+/JkU=; b=KL/Ney9GZAoAma13Di7P2xk3MHI2h86ew988i3d6bf3jajE03JhmwPKgn2haAoS05V PU5M1LwQIUU7/dggJhSYgLsdM/BUwnNqICCLoVTHAv428/TTzzgjP1M/Isb3DxnCIo/M 8SR59u+T5X/Vj/cjrqMBXGDplctsEc8eLj3RXJOCC8U65Fu0rR21SFtZeYrfwrdp6pYN cBu0ZaIgGYMDv4KK1Sv8hq/KgiWOvq/MZmbAdiAcjpRfvhQDWat1tsUxDO9fpd1/7am6 5uwPUyftrOgzgHyMJxkVZb5cJDPSK4oJF9NIlCT5sGvPRaGXfTB3jxMZL2PnDUYkWXam FEqg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l15si2755483otb.8.2020.04.02.10.25.01; Thu, 02 Apr 2020 10:25:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389328AbgDBRYQ (ORCPT + 99 others); Thu, 2 Apr 2020 13:24:16 -0400 Received: from lhrrgout.huawei.com ([185.176.76.210]:2633 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1732754AbgDBRYQ (ORCPT ); Thu, 2 Apr 2020 13:24:16 -0400 Received: from lhreml710-chm.china.huawei.com (unknown [172.18.7.107]) by Forcepoint Email with ESMTP id 744F8DD4AA664FA4A246; Thu, 2 Apr 2020 18:24:14 +0100 (IST) Received: from localhost (10.47.26.250) by lhreml710-chm.china.huawei.com (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Thu, 2 Apr 2020 18:24:13 +0100 Date: Thu, 2 Apr 2020 18:24:01 +0100 From: Jonathan Cameron To: SeongJae Park CC: , SeongJae Park , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: [PATCH v7 04/15] mm/damon: Implement region based sampling Message-ID: <20200402182401.000032a6@Huawei.com> In-Reply-To: <20200402135959.26336-1-sjpark@amazon.com> References: <20200401082222.21242-1-sjpark@amazon.com> <20200402135959.26336-1-sjpark@amazon.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 3.17.4 (GTK+ 2.24.32; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.47.26.250] X-ClientProxiedBy: lhreml729-chm.china.huawei.com (10.201.108.80) To lhreml710-chm.china.huawei.com (10.201.108.61) X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2 Apr 2020 15:59:59 +0200 SeongJae Park wrote: > On Wed, 1 Apr 2020 10:22:22 +0200 SeongJae Park wrote: > > > On Tue, 31 Mar 2020 17:02:33 +0100 Jonathan Cameron wrote: > > > > > On Wed, 18 Mar 2020 12:27:11 +0100 > > > SeongJae Park wrote: > > > > > > > From: SeongJae Park > > > > > > > > This commit implements DAMON's basic access check and region based > > > > sampling mechanisms. This change would seems make no sense, mainly > > > > because it is only a part of the DAMON's logics. Following two commits > > > > will make more sense. > > > > > > > > Basic Access Check > > > > ------------------ > > > > > > > > DAMON basically reports what pages are how frequently accessed. Note > > > > that the frequency is not an absolute number of accesses, but a relative > > > > frequency among the pages of the target workloads. > > > > > > > > Users can control the resolution of the reports by setting two time > > > > intervals, ``sampling interval`` and ``aggregation interval``. In > > > > detail, DAMON checks access to each page per ``sampling interval``, > > > > aggregates the results (counts the number of the accesses to each page), > > > > and reports the aggregated results per ``aggregation interval``. For > > > > the access check of each page, DAMON uses the Accessed bits of PTEs. > > > > > > > > This is thus similar to common periodic access checks based access > > > > tracking mechanisms, which overhead is increasing as the size of the > > > > target process grows. > > > > > > > > Region Based Sampling > > > > --------------------- > > > > > > > > To avoid the unbounded increase of the overhead, DAMON groups a number > > > > of adjacent pages that assumed to have same access frequencies into a > > > > region. As long as the assumption (pages in a region have same access > > > > frequencies) is kept, only one page in the region is required to be > > > > checked. Thus, for each ``sampling interval``, DAMON randomly picks one > > > > page in each region and clears its Accessed bit. After one more > > > > ``sampling interval``, DAMON reads the Accessed bit of the page and > > > > increases the access frequency of the region if the bit has set > > > > meanwhile. Therefore, the monitoring overhead is controllable by > > > > setting the number of regions. > > > > > > > > Nonetheless, this scheme cannot preserve the quality of the output if > > > > the assumption is not kept. Following commit will introduce how we can > > > > make the guarantee with best effort. > > > > > > > > Signed-off-by: SeongJae Park > > > > > > Hi. > > > > > > A few comments inline. > > > > > > I've still not replicated your benchmarks so may well have some more > > > feedback once I've managed that on one of our servers. > > > > Appreciate your comments. If you need any help for the replication, please let > > me know. I basically use my parsec3 wrapper scripts[1] to run parsec3 and > > splash2x workloads and `damo` tool, which resides in the kernel tree at > > `/tools/damon/`. > > > > For example, below commands will reproduce ethp applied splash2x/fft run. > > > > $ echo "2M null 5 null null null hugepage > > 2M null null 5 1s null nohugepage" > ethp > > $ parsec3_on_ubuntu/run.sh splash2x.fft > > $ linux/tools/damon/damo schemes -c ethp `pidof fft` > > > > [1] https://github.com/sjp38/parsec3_on_ubuntu > > > > > > > > Thanks, > > > > > > Jonathan > > > > > > > --- > > > > include/linux/damon.h | 24 ++ > > > > mm/damon.c | 553 ++++++++++++++++++++++++++++++++++++++++++ > > > > 2 files changed, 577 insertions(+) > > > > > > [...] > > > > diff --git a/mm/damon.c b/mm/damon.c > > > > index d7e6226ab7f1..018016793555 100644 > > > > --- a/mm/damon.c > > > > +++ b/mm/damon.c > > > > @@ -10,8 +10,14 @@ > > > > #define pr_fmt(fmt) "damon: " fmt > > > > > > > > #include > > > > +#include > > > > +#include > > > > #include > > > > #include > > > > +#include > > > > +#include > > > > +#include > > > > +#include > > > > #include > > > > > > [...] > > > > +/* > > > > + * Size-evenly split a region into 'nr_pieces' small regions > > > > + * > > > > + * Returns 0 on success, or negative error code otherwise. > > > > + */ > > > > +static int damon_split_region_evenly(struct damon_ctx *ctx, > > > > + struct damon_region *r, unsigned int nr_pieces) > > > > +{ > > > > + unsigned long sz_orig, sz_piece, orig_end; > > > > + struct damon_region *piece = NULL, *next; > > > > + unsigned long start; > > > > + > > > > + if (!r || !nr_pieces) > > > > + return -EINVAL; > > > > + > > > > + orig_end = r->vm_end; > > > > + sz_orig = r->vm_end - r->vm_start; > > > > + sz_piece = sz_orig / nr_pieces; > > > > + > > > > + if (!sz_piece) > > > > + return -EINVAL; > > > > + > > > > + r->vm_end = r->vm_start + sz_piece; > > > > + next = damon_next_region(r); > > > > + for (start = r->vm_end; start + sz_piece <= orig_end; > > > > + start += sz_piece) { > > > > + piece = damon_new_region(ctx, start, start + sz_piece); > > > piece may be n > > > > Yes, that name is short and more intuitive. I will rename so. > > > > > > + damon_insert_region(piece, r, next); > > > > + r = piece; > > > > + } > > > > + /* complement last region for possible rounding error */ > > > > + if (piece) > > > > + piece->vm_end = orig_end; > > > > > > Update the sampling address to ensure it's in the region? > > > > I think `piece->vm_end` should be equal or smaller than `orig_end` and > > therefore the sampling address of `piece` will be still in the region. > > > > > > > > > + > > > > + return 0; > > > > +} > > > > + > > [...] > > > > +static void damon_pte_pmd_mkold(pte_t *pte, pmd_t *pmd) > > > > +{ > > > > + if (pte) { > > > > + if (pte_young(*pte)) { > > > > + clear_page_idle(pte_page(*pte)); > > > > + set_page_young(pte_page(*pte)); > > > > + } > > > > + *pte = pte_mkold(*pte); > > > > + return; > > > > + } > > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > > + if (pmd) { > > > > + if (pmd_young(*pmd)) { > > > > + clear_page_idle(pmd_page(*pmd)); > > > > + set_page_young(pmd_page(*pmd)); > > > > + } > > > > + *pmd = pmd_mkold(*pmd); > > > > + } > > > > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > > > > > No need to flush the TLBs? > > > > Good point! > > > > I have intentionally skipped TLB flushing here to minimize the performance > > effect to the target workload. I also thought this might not degrade the > > monitoring accuracy so much because we are targetting for the DRAM level > > accesses of memory-intensive workloads, which might make TLB flood frequently. > > > > However, your comment makes me thinking differently now. By flushing the TLB > > here, we will increase up to `number_of_regions` TLB misses for sampling > > interval. This might be not a huge overhead. Also, improving the monitoring > > accuracy makes no harm at all. I even didn't measured the overhead. > > > > I will test the overhead and if it is not significant, I will make this code to > > flush TLB, in the next spin. > > Hmm, it seems like 'page_idle.c' is also modifying the Accessed bit but doesn't > flush related TLB entries. If I'm not missing something here, I would like to > leave this part as is to make the behavior consistent. Interesting. In that usecase, the risk is that the MMU believes the page still has the accessed bit set when we have cleared it and hence the accessed bit is not written out to the table in memory. That will give them a wrong decision so not great and would lead to them thinking more pages are idle than are. Here we could have a particular TLB entry for a huge page in which a region lies entirely. Because we don't flush the TLB each time we could end with a count of 0 accesses when it should be the maximum. A very frequently accessed page might well sit in the TLB for a very long time (particularly if the TLB is running a clever eviction strategy). I think we would want to be test this and see if we get that pathological case sometimes. Also worth benchmarking if it actually costs us very much to do the flushes. Jonathan > > > Thanks, > SeongJae Park > > > > > > > > > > +} > > > > + > > [...] > > > > +/* > > > > + * The monitoring daemon that runs as a kernel thread > > > > + */ > > > > +static int kdamond_fn(void *data) > > > > +{ > > > > + struct damon_ctx *ctx = data; > > > > + struct damon_task *t; > > > > + struct damon_region *r, *next; > > > > + struct mm_struct *mm; > > > > + > > > > + pr_info("kdamond (%d) starts\n", ctx->kdamond->pid); > > > > + kdamond_init_regions(ctx); > > > > > > We haven't called mkold on the initial regions so first check will > > > get us fairly random state. > > > > Yes, indeed. However, the early results will not be accurate anyway because > > the adaptive regions adjustment algorithm will not take effect yet. I would > > like to leave this part as is but add some comments about this point to keep > > the code simple. > > > > > > > > > + while (!kdamond_need_stop(ctx)) { > > > > + damon_for_each_task(ctx, t) { > > > > + mm = damon_get_mm(t); > > > > + if (!mm) > > > > + continue; > > > > + damon_for_each_region(r, t) > > > > + kdamond_check_access(ctx, mm, r); > > > > + mmput(mm); > > > > + } > > > > + > > > > + if (kdamond_aggregate_interval_passed(ctx)) > > > > + kdamond_reset_aggregated(ctx); > > > > + > > > > + usleep_range(ctx->sample_interval, ctx->sample_interval + 1); > > > > + } > > > > + damon_for_each_task(ctx, t) { > > > > + damon_for_each_region_safe(r, next, t) > > > > + damon_destroy_region(r); > > > > + } > > > > + pr_debug("kdamond (%d) finishes\n", ctx->kdamond->pid); > > > > + mutex_lock(&ctx->kdamond_lock); > > > > + ctx->kdamond = NULL; > > > > + mutex_unlock(&ctx->kdamond_lock); > > > > + > > > > + return 0; > > > > +} > > > > + > > [...] > > > > +/* > > > > + * Start or stop the kdamond > > > > + * > > > > + * Returns 0 if success, negative error code otherwise. > > > > + */ > > > > +static int damon_turn_kdamond(struct damon_ctx *ctx, bool on) > > > > +{ > > > > + int err = -EBUSY; > > > > + > > > > + mutex_lock(&ctx->kdamond_lock); > > > > + if (!ctx->kdamond && on) { > > > > > > Given there is very little shared code between on and off, I would > > > suggest just splitting it into two functions. > > > > Good point, I will do so in next spin. > > > > > > > > > + err = 0; > > > > + ctx->kdamond = kthread_run(kdamond_fn, ctx, "kdamond"); > > > > + if (IS_ERR(ctx->kdamond)) > > > > + err = PTR_ERR(ctx->kdamond); > > > > + } else if (ctx->kdamond && !on) { > > > > + mutex_unlock(&ctx->kdamond_lock); > > > > + kthread_stop(ctx->kdamond); > > > > + while (damon_kdamond_running(ctx)) > > > > + usleep_range(ctx->sample_interval, > > > > + ctx->sample_interval * 2); > > > > + return 0; > > > > + } > > > > + mutex_unlock(&ctx->kdamond_lock); > > > > + > > > > + return err; > > > > +} > > > > + > > [...] > > > > + > > > > +/* > > > > > > Why not make these actual kernel-doc? That way you can use the > > > kernel-doc scripts to sanity check them. > > > > Oops, I just forgot that it should start with '/**'. Will fix it in next spin. > > > > > > Thanks, > > SeongJae Park > > > > > > > > /** > > > > > > > + * damon_set_attrs() - Set attributes for the monitoring. > > > > + * @ctx: monitoring context > > > > + * @sample_int: time interval between samplings > > > > + * @aggr_int: time interval between aggregations > > > > + * @min_nr_reg: minimal number of regions > > > > + * > > > > + * This function should not be called while the kdamond is running. > > > > + * Every time interval is in micro-seconds. > > > > + * > > > > + * Return: 0 on success, negative error code otherwise. > > > > + */ > > > > +int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int, > > > > + unsigned long aggr_int, unsigned long min_nr_reg) > > > > +{ > > > > + if (min_nr_reg < 3) { > > > > + pr_err("min_nr_regions (%lu) should be bigger than 2\n", > > > > + min_nr_reg); > > > > + return -EINVAL; > > > > + } > > > > + > > > > + ctx->sample_interval = sample_int; > > > > + ctx->aggr_interval = aggr_int; > > > > + ctx->min_nr_regions = min_nr_reg; > > > > + > > > > + return 0; > > > > +} > > > > + > > > > static int __init damon_init(void) > > > > { > > > > return 0; > > >