Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp5140615yba; Wed, 10 Apr 2019 12:14:49 -0700 (PDT) X-Google-Smtp-Source: APXvYqw6UzqzC1Hr9+Yt+Cj4xSuNbAjql8LL3AApws74a/hO/kl+WuMYCgij5/wcTH510393Sg/Y X-Received: by 2002:a17:902:8a4:: with SMTP id 33mr46148646pll.7.1554923689624; Wed, 10 Apr 2019 12:14:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1554923689; cv=none; d=google.com; s=arc-20160816; b=AcGCgBWX1qFeyHmQQv6YdzVNWn13hq3Wds6FYcCmXUxCApWmART2uDaWhYSlKpJCID OVZowcy93hSJ2KLE8nYLKf/3nxFauGS953KXIZr+lqgnzl/d2DVIp6ntfl+0j0zAJaSv vQ1Uhs0zjpXtMDiz5QK1O/cqS+ZGXvrthEJQWKyqcwfivWlfoDtCdyz5PA8HCDidm/c0 CKZQS+++5z04cBF9QySLYd2it9EO8lzovSvuAioelfyWNFD4w7gOWUSuzG1sEoLndiK9 pO5zL3ZJKNSBcTzEPtkOHxaLAWAfcvrdH6RLFDubR1ML6fxby5P24Q6ID9X4OKkG2482 61sA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from; bh=sHnXzx23mKGzmM1k+o+LppmvQ/fIeqpa9wtD2H40pVQ=; b=u1Yu8flG0lcO1YL0f/l5NyEgSbG3AbJ3YFXIMyW0ssKR/2tnCymWDtQ0KzKVYBZs0F R6plaNht2FkmDBx6fC6f34QiB7lJc4oo6eO18bAfVnmvt0bhU7gIFuBnxv0TFkN+qqZr DbNfpA+PDTVtmZow8pyJkyMweJAxwnFyylah8d3ZmalEyYMaoWFW5zuW2hsPlm9m5SZr eAtTwTpWAzROSvM4hepVD6dKtM3mVtl6JVW95fNvoFgt/z6Ct/wlobK1Xp1QOG6JC6cu ZlqNb2ZWnID2GOww9B3wvixl470IvqajIJjYwPEPjfF4pmn+O4w2ihMCpcf2RxgVCpNl nVoQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 68si3567726plc.356.2019.04.10.12.14.33; Wed, 10 Apr 2019 12:14:49 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726595AbfDJTN6 (ORCPT + 99 others); Wed, 10 Apr 2019 15:13:58 -0400 Received: from mx1.redhat.com ([209.132.183.28]:1055 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726536AbfDJTN5 (ORCPT ); Wed, 10 Apr 2019 15:13:57 -0400 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 6476680B2C; Wed, 10 Apr 2019 19:13:57 +0000 (UTC) Received: from llong.com (ovpn-120-189.rdu2.redhat.com [10.10.120.189]) by smtp.corp.redhat.com (Postfix) with ESMTP id 786C46013A; Wed, 10 Apr 2019 19:13:54 +0000 (UTC) From: Waiman Long To: Tejun Heo , Li Zefan , Johannes Weiner , Jonathan Corbet , Michal Hocko , Vladimir Davydov Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Roman Gushchin , Shakeel Butt , Kirill Tkhai , Aaron Lu , Waiman Long Subject: [RFC PATCH 1/2] mm/memcontrol: Finer-grained control for subset of allocated memory Date: Wed, 10 Apr 2019 15:13:20 -0400 Message-Id: <20190410191321.9527-2-longman@redhat.com> In-Reply-To: <20190410191321.9527-1-longman@redhat.com> References: <20190410191321.9527-1-longman@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Wed, 10 Apr 2019 19:13:57 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The current control mechanism for memory cgroup v2 lumps all the memory together irrespective of the type of memory objects. However, there are cases where users may have more concern about one type of memory usage than the others. In order to support finer-grained control of memory usage, the following two new cgroup v2 control files are added: - memory.subset.list Either "" (default), "anon" (anonymous memory) or "file" (file cache). It specifies the type of memory objects we want to monitor. - memory.subset.high The high memory limit for the memory type specified in "memory.subset.list". For simplicity, the limit is for memory usage by all the tasks within the current memory cgroup only. It doesn't include memory usage by other tasks in child memory cgroups. Hence, we can just check the corresponding stat[] array entry of the selected memory type to see if it is above the limit. We currently don't have the capability to specify the type of memory objects to reclaim. When memory reclaim is triggered after reaching the "memory.subset.high" limit, other type of memory objects will also be reclaimed. In the future, we may extend this capability to allow even more fine-grained selection of memory types as well as a combination of them if the need arises. A test program was written to allocate 1 Gbytes of memory and then touch every pages of them. This program was then run in a memory cgroup: # echo anon > memory.subset.list # echo 10485760 > memory.subset.high # echo $$ > cgroup.procs # ~/touch-1gb While the test program was running: # grep -w anon memory.stat anon 10817536 It was a bit higher than the limit, but that should be OK. Without setting the limit, the output would be # grep -w anon memory.stat anon 1074335744 Signed-off-by: Waiman Long --- Documentation/admin-guide/cgroup-v2.rst | 35 +++++++++ include/linux/memcontrol.h | 7 ++ mm/memcontrol.c | 96 ++++++++++++++++++++++++- 3 files changed, 137 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 20f92c16ffbf..0d5b7c77897d 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1080,6 +1080,41 @@ PAGE_SIZE multiple when read back. high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. + memory.subset.high + A read-write single value file which exists on non-root cgroups. + The default is "max". + + Memory usage throttle limit for a subset of memory objects with + types specified in "memory.subset.list". If a cgroup's usage for + those memory objects goes over the high boundary, the processes + of the cgroup are throttled and put under heavy reclaim pressure. + + This throttle limit is not allowed to go higher than + "memory.high" and will be adjusted accordingly when "memory.high" + is changed. Because of that, "memory.subset.list" should always + be set first before assigning a limit to this file. + + Unlike "memory.high", "memory.subset.high" does not count memory + objects usage in child cgroups. + + Going over the high limit never invokes the OOM killer and + under extreme conditions the limit may be breached. + + memory.subset.list + A read-write single value file which exists on non-root cgroups. + The default is "" which means no separate memory subcomponent + tracking and throttling. + + Currently, only the following two primary subcompoent types are + supported: + + - anon (anonymous memory) + - file (filesystem cache, including tmpfs and shared memory) + + The value of this file should either be "", "anon" or "file". + Changing its value resets "memory.subset.high" to be the same + as "memory.high". + memory.oom.group A read-write single value file which exists on non-root cgroups. The default value is "0". diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1f3d880b7ca1..1baf3e4a9eeb 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -212,6 +212,13 @@ struct mem_cgroup { /* Upper bound of normal memory consumption range */ unsigned long high; + /* + * Upper memory consumption bound for a subset of memory object type + * specified in subset_list for the current cgroup only. + */ + unsigned long subset_high; + unsigned long subset_list; + /* Range enforcement for interrupt charges */ struct work_struct high_work; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 532e0e2a4817..7e52adea60d9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2145,6 +2145,14 @@ static void reclaim_high(struct mem_cgroup *memcg, unsigned int nr_pages, gfp_t gfp_mask) { + int mtype = READ_ONCE(memcg->subset_list); + + /* + * Try memory reclaim if subset_high is exceeded. + */ + if (mtype && (memcg_page_state(memcg, mtype) > memcg->subset_high)) + try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); + do { if (page_counter_read(&memcg->memory) <= memcg->high) continue; @@ -2190,6 +2198,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, bool may_swap = true; bool drained = false; bool oomed = false; + bool over_subset_high = false; enum oom_status oom_status; if (mem_cgroup_is_root(memcg)) @@ -2323,6 +2332,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (batch > nr_pages) refill_stock(memcg, batch - nr_pages); + if (memcg->subset_list && + (memcg_page_state(memcg, memcg->subset_list) > memcg->subset_high)) + over_subset_high = true; + /* * If the hierarchy is above the normal consumption range, schedule * reclaim on returning to userland. We can perform reclaim here @@ -2333,7 +2346,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, * reclaim, the cost of mismatch is negligible. */ do { - if (page_counter_read(&memcg->memory) > memcg->high) { + if (page_counter_read(&memcg->memory) > memcg->high || + over_subset_high) { /* Don't bother a random interrupted task */ if (in_interrupt()) { schedule_work(&memcg->high_work); @@ -2343,6 +2357,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, set_notify_resume(current); break; } + over_subset_high = false; } while ((memcg = parent_mem_cgroup(memcg))); return 0; @@ -4491,6 +4506,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) return ERR_PTR(error); memcg->high = PAGE_COUNTER_MAX; + memcg->subset_high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; if (parent) { memcg->swappiness = mem_cgroup_swappiness(parent); @@ -5447,6 +5463,13 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, memcg->high = high; + /* + * Synchronize subset_high if subset_list not set and lower + * subset_high, if necessary. + */ + if (!memcg->subset_list || (high < memcg->subset_high)) + memcg->subset_high = high; + nr_pages = page_counter_read(&memcg->memory); if (nr_pages > high) try_to_free_mem_cgroup_pages(memcg, nr_pages - high, @@ -5511,6 +5534,65 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, return nbytes; } +static int memory_subset_high_show(struct seq_file *m, void *v) +{ + return seq_puts_memcg_tunable(m, + READ_ONCE(mem_cgroup_from_seq(m)->subset_high)); +} + +static ssize_t memory_subset_high_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long high; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "max", &high); + if (err) + return err; + + if (high > memcg->high) + return -EINVAL; + + memcg->subset_high = high; + return nbytes; +} + +static int memory_subset_list_show(struct seq_file *m, void *v) +{ + unsigned long mtype = READ_ONCE(mem_cgroup_from_seq(m)->subset_list); + + seq_puts(m, (mtype == MEMCG_RSS) ? "anon\n" : + (mtype == MEMCG_CACHE) ? "file\n" : "\n"); + return 0; +} + +static ssize_t memory_subset_list_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned long mtype; + + buf = strstrip(buf); + if (!strcmp(buf, "anon")) + mtype = MEMCG_RSS; + else if (!strcmp(buf, "file")) + mtype = MEMCG_CACHE; + else if (buf[0] == '\0') + mtype = 0; + else + return -EINVAL; + + if (mtype == memcg->subset_list) + return nbytes; + + memcg->subset_list = mtype; + /* Reset subset_high */ + memcg->subset_high = memcg->high; + return nbytes; +} + static int memory_events_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_seq(m); @@ -5699,6 +5781,18 @@ static struct cftype memory_files[] = { .seq_show = memory_oom_group_show, .write = memory_oom_group_write, }, + { + .name = "subset.high", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_subset_high_show, + .write = memory_subset_high_write, + }, + { + .name = "subset.list", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_subset_list_show, + .write = memory_subset_list_write, + }, { } /* terminate */ }; -- 2.18.1