Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp4153160pxf; Tue, 6 Apr 2021 09:07:32 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyxOpFmEME9gIS9B65Ox/wlT/UqQnPPNGc0Qg6hP25FQiTeqe1MSpNcXRp0iizZzxNnud++ X-Received: by 2002:a05:6402:57:: with SMTP id f23mr38116317edu.323.1617725252721; Tue, 06 Apr 2021 09:07:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617725252; cv=none; d=google.com; s=arc-20160816; b=McZ7l5UMpx0mMoh1fFCMa/WbBhYilcR7ILFUQFgBuFpLmNbrIcA0fb8aYvzOh51SeX +ptf4qPozh6ki7iQjOyEpHfV5HqQw+cxXAgRQM8O8OSvUyoi2ZzRy04mDVTUBq9BACkp C+qb+jjUKgR5/7UAiF6cLS8NLlDXxg5NhtwVDMU9iDIuYPry4xWfxqzYh/oAucYZbSiP /tc31Dc71Hn9kP4H7PfuRt5fovWht5p8QNvbrWyXhuxCFe/r9mpqpZ2TQhsoxlJ3PkIJ pZElSM9WvFT7Kb2ZJP/OKOuHW+pyE8fwBsoDwXn8cX+Ac9O41iAEqF/BpXeCSGkCuaJE G9BA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject:reply-to; bh=gXcUVxXywZxM88umSPIk9XhoEERGpSkMWZnQ3NAjgl0=; b=W1CKFQdYrq+eacIqsl530GzUry2uwiwePfgAyE7TNmjOb84hRugbGc/EFch78kfHKC ZLNc2cuO43MtSyki50sobe1IOr6Wiw9KF9InkRHKQvg0nPTOifEtB4TIjEJR+vRdSo8Y YwTO8wX/CyWjiWOSkt5gVMJcFvmXEKJDevBdW7lZm4FpdqbRuAJcEzhTrWgLs4s03EUb OZNtUaJnKpDEYNUAeapXCYqrpmmHfGQTWYBd1FOVZGJQ+hmBV7grM5Jsv7l+NHgKReOj 98VZnBwGA3BrfzjvblF/TnHTtiugZiy+otGPhf8w81F633W8RiRpliiMETfn4mHwFEWD QIbw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id ko14si17138098ejc.419.2021.04.06.09.07.07; Tue, 06 Apr 2021 09:07:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233413AbhDFGxz (ORCPT + 99 others); Tue, 6 Apr 2021 02:53:55 -0400 Received: from out4436.biz.mail.alibaba.com ([47.88.44.36]:19981 "EHLO out4436.biz.mail.alibaba.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230296AbhDFGxz (ORCPT ); Tue, 6 Apr 2021 02:53:55 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R861e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01424;MF=xlpang@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0UUg-950_1617692015; Received: from xunleideMacBook-Pro.local(mailfrom:xlpang@linux.alibaba.com fp:SMTPD_---0UUg-950_1617692015) by smtp.aliyun-inc.com(127.0.0.1); Tue, 06 Apr 2021 14:53:36 +0800 Reply-To: xlpang@linux.alibaba.com Subject: Re: [PATCH 2/4] mm/page_reporting: Introduce free page reporting factor To: Alexander Duyck , Xunlei Pang Cc: Andrew Morton , Alexander Duyck , Mel Gorman , LKML , linux-mm References: <1616751898-58393-1-git-send-email-xlpang@linux.alibaba.com> <1616751898-58393-3-git-send-email-xlpang@linux.alibaba.com> From: Xunlei Pang Message-ID: <886910ff-f19d-0aa4-e0ad-0aa9072526ee@linux.alibaba.com> Date: Tue, 6 Apr 2021 14:53:35 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/3/21 2:56 AM, Alexander Duyck wrote: > On Fri, Mar 26, 2021 at 2:45 AM Xunlei Pang wrote: >> >> Add new "/sys/kernel/mm/page_reporting/reporting_factor" >> within [0, 100], and stop page reporting when it reaches >> the configured threshold. Default is 100 which means no >> limitation is imposed. Percentile is adopted to reflect >> the fact that it reports on the per-zone basis. >> >> We can control the total number of reporting pages via >> this knob to avoid EPT violations which may affect the >> performance of the business, imagine the guest memory >> allocation burst or host long-tail memory reclaiming >> really hurt. > > I'm not a fan of the concept as I don't think it really does what it > was meant to do. The way page reporting was meant to work is that when > we have enough free pages we will cycle through memory a few pages at > a time reporting what is unused to the hypervisor. It was meant to be > a scan more than something that just would stop once it touched a > certain part of the memory. > > If you are wanting to truly reserve some amount of memory so that it > is always left held by the guest then it might make more sense to make > the value a fixed amount of memory rather than trying to do it as a > percentage. > > Also we may need to look at adding some sort of > linearization/defragmentation logic for the reported pages. One issue > is that there are several things that will add pages to the end of the > free page lists. One of the reasons why I was processing the entire > list when I was processing reported pages was because the page freeing > functions will normally cause pages to be interleaved with the > reported pages on the end of the list. So if you are wanting to > reserve some pages as being non-reported we may need to add something > sort the lists periodically. Yes, agreed. To make the counter accurate, I also noticed this problem, I'm going to figure out a way to handle it, e.g. maybe adding a new migratetype for reported free_list is a good choice, this also helps reduce the zone lock latency during the report procedue on large VM, which can be in milliseconds. > >> This knob can help make customized control policies according >> to VM priority, it is also useful for testing, gray-release, etc. > > As far as the knob itself it would make sense to combine this with > patch 3 since they are just different versions of the same control > >> --- >> mm/page_reporting.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 59 insertions(+), 1 deletion(-) >> >> diff --git a/mm/page_reporting.c b/mm/page_reporting.c >> index ba195ea..86c6479 100644 >> --- a/mm/page_reporting.c >> +++ b/mm/page_reporting.c >> @@ -11,6 +11,8 @@ >> #include "page_reporting.h" >> #include "internal.h" >> >> +static int reporting_factor = 100; >> + >> #define PAGE_REPORTING_DELAY (2 * HZ) >> static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly; >> >> @@ -134,6 +136,7 @@ void __page_reporting_notify(void) >> struct list_head *list = &area->free_list[mt]; >> unsigned int page_len = PAGE_SIZE << order; >> struct page *page, *next; >> + unsigned long threshold; >> long budget; >> int err = 0; >> >> @@ -144,6 +147,7 @@ void __page_reporting_notify(void) >> if (list_empty(list)) >> return err; >> >> + threshold = atomic_long_read(&zone->managed_pages) * reporting_factor / 100; > > So at 0 you are setting this threshold to 0, however based on the code > below you are still pulling at least one page. > >> spin_lock_irq(&zone->lock); >> >> /* >> @@ -181,6 +185,8 @@ void __page_reporting_notify(void) >> >> /* Attempt to pull page from list and place in scatterlist */ >> if (*offset) { >> + unsigned long nr_pages; >> + >> if (!__isolate_free_page(page, order)) { >> next = page; >> break; >> @@ -190,6 +196,12 @@ void __page_reporting_notify(void) >> --(*offset); >> sg_set_page(&sgl[*offset], page, page_len, 0); >> >> + nr_pages = (PAGE_REPORTING_CAPACITY - *offset) << order; >> + if (zone->reported_pages + nr_pages >= threshold) { >> + err = 1; >> + break; >> + } >> + > > So here we are checking the threshold after we have already pulled the > page. With this being the case it might make more sense to either > allow for the full capacity of pages to be pulled and then check this > after they have been reported, or to move this check up to somewhere > before you start processing the pages. What you want to avoid is > having to perform this check for every individual page. > >> continue; >> } >> >> @@ -244,9 +256,13 @@ void __page_reporting_notify(void) >> struct scatterlist *sgl, struct zone *zone) >> { >> unsigned int order, mt, leftover, offset = PAGE_REPORTING_CAPACITY; >> - unsigned long watermark; >> + unsigned long watermark, threshold; >> int err = 0; >> >> + threshold = atomic_long_read(&zone->managed_pages) * reporting_factor / 100; >> + if (zone->reported_pages >= threshold) >> + return err; >> + > > Rather than having to calculate the threshold in multiple spots it > might make more sense to move this to the start of > page_reporting_cycle and have it performed again before we reacquire > the spinlock and run page_reporting_drain. > >> /* Generate minimum watermark to be able to guarantee progress */ >> watermark = low_wmark_pages(zone) + >> (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER); >> @@ -267,11 +283,18 @@ void __page_reporting_notify(void) >> >> err = page_reporting_cycle(prdev, zone, order, mt, >> sgl, &offset); >> + /* Exceed threshold go to report leftover */ >> + if (err > 0) { >> + err = 0; >> + goto leftover; >> + } >> + >> if (err) >> return err; >> } >> } >> >> +leftover: >> /* report the leftover pages before going idle */ >> leftover = PAGE_REPORTING_CAPACITY - offset; >> if (leftover) { > > You should optimize for processing full batches rather than chopping > things up into smaller groupings. > >> @@ -435,9 +458,44 @@ static ssize_t refault_kbytes_store(struct kobject *kobj, >> } >> REPORTING_ATTR(refault_kbytes); >> >> +static ssize_t reporting_factor_show(struct kobject *kobj, >> + struct kobj_attribute *attr, char *buf) >> +{ >> + return sprintf(buf, "%u\n", reporting_factor); >> +} >> + >> +static ssize_t reporting_factor_store(struct kobject *kobj, >> + struct kobj_attribute *attr, >> + const char *buf, size_t count) >> +{ >> + int new, old, err; >> + struct page *page; >> + >> + err = kstrtoint(buf, 10, &new); >> + if (err || (new < 0 || new > 100)) >> + return -EINVAL; >> + >> + old = reporting_factor; >> + reporting_factor = new; >> + >> + if (new <= old) >> + goto out; >> + >> + /* Trigger reporting with new larger reporting_factor */ >> + page = alloc_pages(__GFP_HIGHMEM | __GFP_NOWARN, >> + PAGE_REPORTING_MIN_ORDER); >> + if (page) >> + __free_pages(page, PAGE_REPORTING_MIN_ORDER); >> + >> +out: >> + return count; >> +} >> +REPORTING_ATTR(reporting_factor); >> + >> static struct attribute *reporting_attrs[] = { >> &reported_kbytes_attr.attr, >> &refault_kbytes_attr.attr, >> + &reporting_factor_attr.attr, >> NULL, >> }; >> >> -- >> 1.8.3.1 >> >>