2019-04-10 19:14:46

by Waiman Long

[permalink] [raw]
Subject: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

The current control mechanism for memory cgroup v2 lumps all the memory
together irrespective of the type of memory objects. However, there
are cases where users may have more concern about one type of memory
usage than the others.

We have customer request to limit memory consumption on anonymous memory
only as they said the feature was available in other OSes like Solaris.

To allow finer-grained control of memory, this patchset 2 new control
knobs for memory controller:
- memory.subset.list for specifying the type of memory to be under control.
- memory.subset.high for the high limit of memory consumption of that
memory type.

For simplicity, the limit is not hierarchical and applies to only tasks
in the local memory cgroup.

Waiman Long (2):
mm/memcontrol: Finer-grained control for subset of allocated memory
mm/memcontrol: Add a new MEMCG_SUBSET_HIGH event

Documentation/admin-guide/cgroup-v2.rst | 35 +++++++++
include/linux/memcontrol.h | 8 ++
mm/memcontrol.c | 100 +++++++++++++++++++++++-
3 files changed, 142 insertions(+), 1 deletion(-)

--
2.18.1


2019-04-10 19:14:49

by Waiman Long

[permalink] [raw]
Subject: [RFC PATCH 1/2] mm/memcontrol: Finer-grained control for subset of allocated memory

The current control mechanism for memory cgroup v2 lumps all the memory
together irrespective of the type of memory objects. However, there
are cases where users may have more concern about one type of memory
usage than the others.

In order to support finer-grained control of memory usage, the following
two new cgroup v2 control files are added:

- memory.subset.list
Either "" (default), "anon" (anonymous memory) or "file" (file
cache). It specifies the type of memory objects we want to monitor.
- memory.subset.high
The high memory limit for the memory type specified in
"memory.subset.list".

For simplicity, the limit is for memory usage by all the tasks within
the current memory cgroup only. It doesn't include memory usage by
other tasks in child memory cgroups. Hence, we can just check the
corresponding stat[] array entry of the selected memory type to see if
it is above the limit.

We currently don't have the capability to specify the type of memory
objects to reclaim. When memory reclaim is triggered after reaching
the "memory.subset.high" limit, other type of memory objects will also
be reclaimed.

In the future, we may extend this capability to allow even more
fine-grained selection of memory types as well as a combination of them
if the need arises.

A test program was written to allocate 1 Gbytes of memory and then
touch every pages of them. This program was then run in a memory cgroup:

# echo anon > memory.subset.list
# echo 10485760 > memory.subset.high
# echo $$ > cgroup.procs
# ~/touch-1gb

While the test program was running:

# grep -w anon memory.stat
anon 10817536

It was a bit higher than the limit, but that should be OK.

Without setting the limit, the output would be

# grep -w anon memory.stat
anon 1074335744

Signed-off-by: Waiman Long <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 35 +++++++++
include/linux/memcontrol.h | 7 ++
mm/memcontrol.c | 96 ++++++++++++++++++++++++-
3 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 20f92c16ffbf..0d5b7c77897d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1080,6 +1080,41 @@ PAGE_SIZE multiple when read back.
high limit is used and monitored properly, this limit's
utility is limited to providing the final safety net.

+ memory.subset.high
+ A read-write single value file which exists on non-root cgroups.
+ The default is "max".
+
+ Memory usage throttle limit for a subset of memory objects with
+ types specified in "memory.subset.list". If a cgroup's usage for
+ those memory objects goes over the high boundary, the processes
+ of the cgroup are throttled and put under heavy reclaim pressure.
+
+ This throttle limit is not allowed to go higher than
+ "memory.high" and will be adjusted accordingly when "memory.high"
+ is changed. Because of that, "memory.subset.list" should always
+ be set first before assigning a limit to this file.
+
+ Unlike "memory.high", "memory.subset.high" does not count memory
+ objects usage in child cgroups.
+
+ Going over the high limit never invokes the OOM killer and
+ under extreme conditions the limit may be breached.
+
+ memory.subset.list
+ A read-write single value file which exists on non-root cgroups.
+ The default is "" which means no separate memory subcomponent
+ tracking and throttling.
+
+ Currently, only the following two primary subcompoent types are
+ supported:
+
+ - anon (anonymous memory)
+ - file (filesystem cache, including tmpfs and shared memory)
+
+ The value of this file should either be "", "anon" or "file".
+ Changing its value resets "memory.subset.high" to be the same
+ as "memory.high".
+
memory.oom.group
A read-write single value file which exists on non-root
cgroups. The default value is "0".
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1f3d880b7ca1..1baf3e4a9eeb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -212,6 +212,13 @@ struct mem_cgroup {
/* Upper bound of normal memory consumption range */
unsigned long high;

+ /*
+ * Upper memory consumption bound for a subset of memory object type
+ * specified in subset_list for the current cgroup only.
+ */
+ unsigned long subset_high;
+ unsigned long subset_list;
+
/* Range enforcement for interrupt charges */
struct work_struct high_work;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 532e0e2a4817..7e52adea60d9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2145,6 +2145,14 @@ static void reclaim_high(struct mem_cgroup *memcg,
unsigned int nr_pages,
gfp_t gfp_mask)
{
+ int mtype = READ_ONCE(memcg->subset_list);
+
+ /*
+ * Try memory reclaim if subset_high is exceeded.
+ */
+ if (mtype && (memcg_page_state(memcg, mtype) > memcg->subset_high))
+ try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+
do {
if (page_counter_read(&memcg->memory) <= memcg->high)
continue;
@@ -2190,6 +2198,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
bool may_swap = true;
bool drained = false;
bool oomed = false;
+ bool over_subset_high = false;
enum oom_status oom_status;

if (mem_cgroup_is_root(memcg))
@@ -2323,6 +2332,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);

+ if (memcg->subset_list &&
+ (memcg_page_state(memcg, memcg->subset_list) > memcg->subset_high))
+ over_subset_high = true;
+
/*
* If the hierarchy is above the normal consumption range, schedule
* reclaim on returning to userland. We can perform reclaim here
@@ -2333,7 +2346,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
* reclaim, the cost of mismatch is negligible.
*/
do {
- if (page_counter_read(&memcg->memory) > memcg->high) {
+ if (page_counter_read(&memcg->memory) > memcg->high ||
+ over_subset_high) {
/* Don't bother a random interrupted task */
if (in_interrupt()) {
schedule_work(&memcg->high_work);
@@ -2343,6 +2357,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
set_notify_resume(current);
break;
}
+ over_subset_high = false;
} while ((memcg = parent_mem_cgroup(memcg)));

return 0;
@@ -4491,6 +4506,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
return ERR_PTR(error);

memcg->high = PAGE_COUNTER_MAX;
+ memcg->subset_high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
if (parent) {
memcg->swappiness = mem_cgroup_swappiness(parent);
@@ -5447,6 +5463,13 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,

memcg->high = high;

+ /*
+ * Synchronize subset_high if subset_list not set and lower
+ * subset_high, if necessary.
+ */
+ if (!memcg->subset_list || (high < memcg->subset_high))
+ memcg->subset_high = high;
+
nr_pages = page_counter_read(&memcg->memory);
if (nr_pages > high)
try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
@@ -5511,6 +5534,65 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return nbytes;
}

+static int memory_subset_high_show(struct seq_file *m, void *v)
+{
+ return seq_puts_memcg_tunable(m,
+ READ_ONCE(mem_cgroup_from_seq(m)->subset_high));
+}
+
+static ssize_t memory_subset_high_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long high;
+ int err;
+
+ buf = strstrip(buf);
+ err = page_counter_memparse(buf, "max", &high);
+ if (err)
+ return err;
+
+ if (high > memcg->high)
+ return -EINVAL;
+
+ memcg->subset_high = high;
+ return nbytes;
+}
+
+static int memory_subset_list_show(struct seq_file *m, void *v)
+{
+ unsigned long mtype = READ_ONCE(mem_cgroup_from_seq(m)->subset_list);
+
+ seq_puts(m, (mtype == MEMCG_RSS) ? "anon\n" :
+ (mtype == MEMCG_CACHE) ? "file\n" : "\n");
+ return 0;
+}
+
+static ssize_t memory_subset_list_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned long mtype;
+
+ buf = strstrip(buf);
+ if (!strcmp(buf, "anon"))
+ mtype = MEMCG_RSS;
+ else if (!strcmp(buf, "file"))
+ mtype = MEMCG_CACHE;
+ else if (buf[0] == '\0')
+ mtype = 0;
+ else
+ return -EINVAL;
+
+ if (mtype == memcg->subset_list)
+ return nbytes;
+
+ memcg->subset_list = mtype;
+ /* Reset subset_high */
+ memcg->subset_high = memcg->high;
+ return nbytes;
+}
+
static int memory_events_show(struct seq_file *m, void *v)
{
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
@@ -5699,6 +5781,18 @@ static struct cftype memory_files[] = {
.seq_show = memory_oom_group_show,
.write = memory_oom_group_write,
},
+ {
+ .name = "subset.high",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_subset_high_show,
+ .write = memory_subset_high_write,
+ },
+ {
+ .name = "subset.list",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = memory_subset_list_show,
+ .write = memory_subset_list_write,
+ },
{ } /* terminate */
};

--
2.18.1

2019-04-10 19:14:59

by Waiman Long

[permalink] [raw]
Subject: [RFC PATCH 2/2] mm/memcontrol: Add a new MEMCG_SUBSET_HIGH event

A new MEMCG_SUBSET_HIGH event is added to record the number of times
subset.high value is exceeded.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 6 +++++-
2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1baf3e4a9eeb..4498db61507a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -52,6 +52,7 @@ enum memcg_memory_event {
MEMCG_LOW,
MEMCG_HIGH,
MEMCG_MAX,
+ MEMCG_SUBSET_HIGH,
MEMCG_OOM,
MEMCG_OOM_KILL,
MEMCG_SWAP_MAX,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7e52adea60d9..feba8b9c55b3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2150,8 +2150,10 @@ static void reclaim_high(struct mem_cgroup *memcg,
/*
* Try memory reclaim if subset_high is exceeded.
*/
- if (mtype && (memcg_page_state(memcg, mtype) > memcg->subset_high))
+ if (mtype && (memcg_page_state(memcg, mtype) > memcg->subset_high)) {
+ memcg_memory_event(memcg, MEMCG_SUBSET_HIGH);
try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true);
+ }

do {
if (page_counter_read(&memcg->memory) <= memcg->high)
@@ -5603,6 +5605,8 @@ static int memory_events_show(struct seq_file *m, void *v)
atomic_long_read(&memcg->memory_events[MEMCG_HIGH]));
seq_printf(m, "max %lu\n",
atomic_long_read(&memcg->memory_events[MEMCG_MAX]));
+ seq_printf(m, "subset_high %lu\n",
+ atomic_long_read(&memcg->memory_events[MEMCG_SUBSET_HIGH]));
seq_printf(m, "oom %lu\n",
atomic_long_read(&memcg->memory_events[MEMCG_OOM]));
seq_printf(m, "oom_kill %lu\n",
--
2.18.1

2019-04-10 19:57:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

On Wed 10-04-19 15:13:19, Waiman Long wrote:
> The current control mechanism for memory cgroup v2 lumps all the memory
> together irrespective of the type of memory objects. However, there
> are cases where users may have more concern about one type of memory
> usage than the others.
>
> We have customer request to limit memory consumption on anonymous memory
> only as they said the feature was available in other OSes like Solaris.

Please be more specific about a usecase.

> To allow finer-grained control of memory, this patchset 2 new control
> knobs for memory controller:
> - memory.subset.list for specifying the type of memory to be under control.
> - memory.subset.high for the high limit of memory consumption of that
> memory type.

Please be more specific about the semantic.

I am really skeptical about this feature to be honest, though.

> For simplicity, the limit is not hierarchical and applies to only tasks
> in the local memory cgroup.

This is a no-go to begin with.

> Waiman Long (2):
> mm/memcontrol: Finer-grained control for subset of allocated memory
> mm/memcontrol: Add a new MEMCG_SUBSET_HIGH event
>
> Documentation/admin-guide/cgroup-v2.rst | 35 +++++++++
> include/linux/memcontrol.h | 8 ++
> mm/memcontrol.c | 100 +++++++++++++++++++++++-
> 3 files changed, 142 insertions(+), 1 deletion(-)
>
> --
> 2.18.1

--
Michal Hocko
SUSE Labs

2019-04-10 21:39:20

by Chris Down

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

Hi Waiman,

Waiman Long writes:
>The current control mechanism for memory cgroup v2 lumps all the memory
>together irrespective of the type of memory objects. However, there
>are cases where users may have more concern about one type of memory
>usage than the others.

I have concerns about this implementation, and the overall idea in general. We
had per-class memory limiting in the cgroup v1 API, and it ended up really
poorly, and resulted in a situation where it's really hard to compose a usable
system out of it any more.

A major part of the restructure in cgroup v2 has been to simplify things so
that it's more easy to understand for service owners and sysadmins. This was
intentional, because otherwise the system overall is hard to make into
something that does what users *really* want, and users end up with a lot of
confusion, misconfiguration, and generally an inability to produce a coherent
system, because we've made things too hard to piece together.

In general, for purposes of resource control, I'm not convinced that it makes
sense to limit only one kind of memory based on prior experience with v1. Can
you give a production use case where this would be a clear benefit, traded off
against the increase in complexity to the API?

>For simplicity, the limit is not hierarchical and applies to only tasks
>in the local memory cgroup.

We've made an explicit effort to make all things hierarchical -- this confuses
things further. Even if we did have something like this, it would have to
respect the hierarchy, we really don't want to return to the use_hierarchy
days where users, sysadmins, and even ourselves are confused by the resource
control semantics that are supposed to be achieved.

>We have customer request to limit memory consumption on anonymous memory
>only as they said the feature was available in other OSes like Solaris.

What's the production use case where this is demonstrably providing clear
benefits in terms of resource control? How can it compose as part of an easy to
understand, resource controlling system? I'd like to see a lot more information
on why this is needed, and the usability and technical tradeoffs considered.

Thanks,

Chris

2019-04-11 14:03:34

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

On 04/10/2019 03:54 PM, Michal Hocko wrote:
> On Wed 10-04-19 15:13:19, Waiman Long wrote:
>> The current control mechanism for memory cgroup v2 lumps all the memory
>> together irrespective of the type of memory objects. However, there
>> are cases where users may have more concern about one type of memory
>> usage than the others.
>>
>> We have customer request to limit memory consumption on anonymous memory
>> only as they said the feature was available in other OSes like Solaris.
> Please be more specific about a usecase.

From that customer's point of view, page cache is more like common goods
that can typically be shared by a number of different groups. Depending
on which groups touch the pages first, it is possible that most of those
pages can be disproportionately attributed to one group than the others.
Anonymous memory, on the other hand, are not shared and so can more
correctly represent the memory footprint of an application. Of course,
there are certainly cases where an application can have large private
files that can consume a lot of cache pages. These are probably not the
case for the applications used by that customer.

>
>> To allow finer-grained control of memory, this patchset 2 new control
>> knobs for memory controller:
>> - memory.subset.list for specifying the type of memory to be under control.
>> - memory.subset.high for the high limit of memory consumption of that
>> memory type.
> Please be more specific about the semantic.
>
> I am really skeptical about this feature to be honest, though.
>

Please see patch 1 which has a more detailed description. This is just
an overview for the cover letter.

>> For simplicity, the limit is not hierarchical and applies to only tasks
>> in the local memory cgroup.
> This is a no-go to begin with.

The reason for doing that is to introduce as little overhead as
possible. We can certainly make it hierarchical, but it will complicate
the code and increase runtime overhead. Another alternative is to limit
this feature to only leaf memory cgroups. That should be enough to cover
what the customer is asking for and leave room for future hierarchical
extension, if needed.

Cheers,
Longman

2019-04-11 14:23:25

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

On 04/10/2019 05:38 PM, Chris Down wrote:
> Hi Waiman,
>
> Waiman Long writes:
>> The current control mechanism for memory cgroup v2 lumps all the memory
>> together irrespective of the type of memory objects. However, there
>> are cases where users may have more concern about one type of memory
>> usage than the others.
>
> I have concerns about this implementation, and the overall idea in
> general. We had per-class memory limiting in the cgroup v1 API, and it
> ended up really poorly, and resulted in a situation where it's really
> hard to compose a usable system out of it any more.
>
> A major part of the restructure in cgroup v2 has been to simplify
> things so that it's more easy to understand for service owners and
> sysadmins. This was intentional, because otherwise the system overall
> is hard to make into something that does what users *really* want, and
> users end up with a lot of confusion, misconfiguration, and generally
> an inability to produce a coherent system, because we've made things
> too hard to piece together.
>
> In general, for purposes of resource control, I'm not convinced that
> it makes sense to limit only one kind of memory based on prior
> experience with v1. Can you give a production use case where this
> would be a clear benefit, traded off against the increase in
> complexity to the API?
>

As I said in my previous email on this thread, the customer considered
pages cache as common goods not fully representing the "real" memory
footprint used by an application.  Depending on actual mix of
applications running on a system, there are certainly cases where their
view is correct. In fact, what the customer is asking for is not even
provided by the v1 API even with that many classes of memory that you
can choose from.

>> For simplicity, the limit is not hierarchical and applies to only tasks
>> in the local memory cgroup.
>
> We've made an explicit effort to make all things hierarchical -- this
> confuses things further. Even if we did have something like this, it
> would have to respect the hierarchy, we really don't want to return to
> the use_hierarchy days where users, sysadmins, and even ourselves are
> confused by the resource control semantics that are supposed to be
> achieved.

I see your point. I am now suggesting that this new feature is limited
to just leaf memory cgroup for now. We can extend it to full
hierarchical support in the future if necessary.

>
>> We have customer request to limit memory consumption on anonymous memory
>> only as they said the feature was available in other OSes like Solaris.
>
> What's the production use case where this is demonstrably providing
> clear benefits in terms of resource control? How can it compose as
> part of an easy to understand, resource controlling system? I'd like
> to see a lot more information on why this is needed, and the usability
> and technical tradeoffs considered.

Simply put, the customers want to control and limit memory consumption
based on the anonymous memory (RSS) that are used by the applications.
This was what they were doing in the past and their tooling was based on
this. They want to continue doing that after migrating to Linux. Adding
page cache into the mix and they don't know how they should handle that.

Cheers,
Longman

2019-04-11 14:39:08

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

On 10.04.2019 22:13, Waiman Long wrote:
> The current control mechanism for memory cgroup v2 lumps all the memory
> together irrespective of the type of memory objects. However, there
> are cases where users may have more concern about one type of memory
> usage than the others.
>
> We have customer request to limit memory consumption on anonymous memory
> only as they said the feature was available in other OSes like Solaris.
>
> To allow finer-grained control of memory, this patchset 2 new control
> knobs for memory controller:
> - memory.subset.list for specifying the type of memory to be under control.
> - memory.subset.high for the high limit of memory consumption of that
> memory type.
>
> For simplicity, the limit is not hierarchical and applies to only tasks
> in the local memory cgroup.
>
> Waiman Long (2):
> mm/memcontrol: Finer-grained control for subset of allocated memory
> mm/memcontrol: Add a new MEMCG_SUBSET_HIGH event
>
> Documentation/admin-guide/cgroup-v2.rst | 35 +++++++++
> include/linux/memcontrol.h | 8 ++
> mm/memcontrol.c | 100 +++++++++++++++++++++++-
> 3 files changed, 142 insertions(+), 1 deletion(-)

CC Andrey.

In Virtuozzo kernel we have similar functionality for limitation of page cache in a cgroup:

https://github.com/OpenVZ/vzkernel/commit/8ceef5e0c07c7621fcb0e04ccc48a679dfeec4a4

2019-04-11 14:56:17

by Waiman Long

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

On 04/11/2019 10:37 AM, Kirill Tkhai wrote:
> On 10.04.2019 22:13, Waiman Long wrote:
>> The current control mechanism for memory cgroup v2 lumps all the memory
>> together irrespective of the type of memory objects. However, there
>> are cases where users may have more concern about one type of memory
>> usage than the others.
>>
>> We have customer request to limit memory consumption on anonymous memory
>> only as they said the feature was available in other OSes like Solaris.
>>
>> To allow finer-grained control of memory, this patchset 2 new control
>> knobs for memory controller:
>> - memory.subset.list for specifying the type of memory to be under control.
>> - memory.subset.high for the high limit of memory consumption of that
>> memory type.
>>
>> For simplicity, the limit is not hierarchical and applies to only tasks
>> in the local memory cgroup.
>>
>> Waiman Long (2):
>> mm/memcontrol: Finer-grained control for subset of allocated memory
>> mm/memcontrol: Add a new MEMCG_SUBSET_HIGH event
>>
>> Documentation/admin-guide/cgroup-v2.rst | 35 +++++++++
>> include/linux/memcontrol.h | 8 ++
>> mm/memcontrol.c | 100 +++++++++++++++++++++++-
>> 3 files changed, 142 insertions(+), 1 deletion(-)
> CC Andrey.
>
> In Virtuozzo kernel we have similar functionality for limitation of page cache in a cgroup:
>
> https://github.com/OpenVZ/vzkernel/commit/8ceef5e0c07c7621fcb0e04ccc48a679dfeec4a4

It will be helpful to know the use case where you want to limit page
cache usage. I have anonymous memory in mind when I compose this patch,
but I make the mechanism more generic so that it can apply to other use
cases as well.

Cheers,
Longman

2019-04-11 15:21:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

On Thu 11-04-19 10:02:16, Waiman Long wrote:
> On 04/10/2019 03:54 PM, Michal Hocko wrote:
> > On Wed 10-04-19 15:13:19, Waiman Long wrote:
> >> The current control mechanism for memory cgroup v2 lumps all the memory
> >> together irrespective of the type of memory objects. However, there
> >> are cases where users may have more concern about one type of memory
> >> usage than the others.
> >>
> >> We have customer request to limit memory consumption on anonymous memory
> >> only as they said the feature was available in other OSes like Solaris.
> > Please be more specific about a usecase.
>
> From that customer's point of view, page cache is more like common goods
> that can typically be shared by a number of different groups. Depending
> on which groups touch the pages first, it is possible that most of those
> pages can be disproportionately attributed to one group than the others.
> Anonymous memory, on the other hand, are not shared and so can more
> correctly represent the memory footprint of an application. Of course,
> there are certainly cases where an application can have large private
> files that can consume a lot of cache pages. These are probably not the
> case for the applications used by that customer.

So you are essentially interested in the page cache limiting, right?
This has been proposed several times already and always rejected because
this is not a good idea.

I would really like to see a more specific example where this makes
sense. False sharing can be certainly happen, no questions about that
but then the how big of a problem that is? Please more specifics.

> >> To allow finer-grained control of memory, this patchset 2 new control
> >> knobs for memory controller:
> >> - memory.subset.list for specifying the type of memory to be under control.
> >> - memory.subset.high for the high limit of memory consumption of that
> >> memory type.
> > Please be more specific about the semantic.
> >
> > I am really skeptical about this feature to be honest, though.
> >
>
> Please see patch 1 which has a more detailed description. This is just
> an overview for the cover letter.

No, please describe the whole design in high level in the cover letter.
I am not going to spend time reviewing specific patches if the whole
idea is not clear beforhand. Design should be clear first before diving
into technical details.

> >> For simplicity, the limit is not hierarchical and applies to only tasks
> >> in the local memory cgroup.
> > This is a no-go to begin with.
>
> The reason for doing that is to introduce as little overhead as
> possible.

We are not going to break semantic based on very vague hand waving about
overhead.

> We can certainly make it hierarchical, but it will complicate
> the code and increase runtime overhead. Another alternative is to limit
> this feature to only leaf memory cgroups. That should be enough to cover
> what the customer is asking for and leave room for future hierarchical
> extension, if needed.

No, this is a broken design that doesn't fall into the over cgroups
design.

--
Michal Hocko
SUSE Labs

2019-04-11 15:27:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

On Thu 11-04-19 17:19:11, Michal Hocko wrote:
> On Thu 11-04-19 10:02:16, Waiman Long wrote:
> > On 04/10/2019 03:54 PM, Michal Hocko wrote:
> > > On Wed 10-04-19 15:13:19, Waiman Long wrote:
> > >> The current control mechanism for memory cgroup v2 lumps all the memory
> > >> together irrespective of the type of memory objects. However, there
> > >> are cases where users may have more concern about one type of memory
> > >> usage than the others.
> > >>
> > >> We have customer request to limit memory consumption on anonymous memory
> > >> only as they said the feature was available in other OSes like Solaris.
> > > Please be more specific about a usecase.
> >
> > From that customer's point of view, page cache is more like common goods
> > that can typically be shared by a number of different groups. Depending
> > on which groups touch the pages first, it is possible that most of those
> > pages can be disproportionately attributed to one group than the others.
> > Anonymous memory, on the other hand, are not shared and so can more
> > correctly represent the memory footprint of an application. Of course,
> > there are certainly cases where an application can have large private
> > files that can consume a lot of cache pages. These are probably not the
> > case for the applications used by that customer.
>
> So you are essentially interested in the page cache limiting, right?
> This has been proposed several times already and always rejected because
> this is not a good idea.

OK, so after reading other responses I've realized that I've
misunderstood your intention. You are really interested in the anon
memory limiting. But my objection still holds! I would like to hear much
more specific usecases. Is the page cache such a precious resource it
cannot be refaulted? With the storage speed these days I am quite not
sure. Also there is always way to delegate page cache pre-faulting to a
dedicated cgroup with a low limit protection if _some_ pagecache is
really important.

--
Michal Hocko
SUSE Labs

2019-04-11 15:32:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

On Thu, Apr 11, 2019 at 10:02:16AM -0400, Waiman Long wrote:
> On 04/10/2019 03:54 PM, Michal Hocko wrote:
> > On Wed 10-04-19 15:13:19, Waiman Long wrote:
> >> The current control mechanism for memory cgroup v2 lumps all the memory
> >> together irrespective of the type of memory objects. However, there
> >> are cases where users may have more concern about one type of memory
> >> usage than the others.
> >>
> >> We have customer request to limit memory consumption on anonymous memory
> >> only as they said the feature was available in other OSes like Solaris.
> > Please be more specific about a usecase.
>
> From that customer's point of view, page cache is more like common goods
> that can typically be shared by a number of different groups. Depending
> on which groups touch the pages first, it is possible that most of those
> pages can be disproportionately attributed to one group than the others.
>
> Anonymous memory, on the other hand, are not shared and so can more
> correctly represent the memory footprint of an application. Of course,
> there are certainly cases where an application can have large private
> files that can consume a lot of cache pages. These are probably not the
> case for the applications used by that customer.

I don't understand what the goal is. What do you accomplish by only
restricting anon memory? Are you trying to contain malfunctioning
applications? Malicious applications?

Cache can apply as much pressure to the system as anon can. So if you
are in the position to ask your applications to behave wrt cache,
surely you can ask them to behave wrt anon as well...?

This also answers only one narrow question out of the many that arise
when heavily sharing cache. The accounting isn't done right,
memory.current of the participating cgroups will make no sense, IO
read and writeback burden is assigned to random cgroups.

> >> For simplicity, the limit is not hierarchical and applies to only tasks
> >> in the local memory cgroup.
> > This is a no-go to begin with.
>
> The reason for doing that is to introduce as little overhead as
> possible. We can certainly make it hierarchical, but it will complicate
> the code and increase runtime overhead. Another alternative is to limit
> this feature to only leaf memory cgroups. That should be enough to cover
> what the customer is asking for and leave room for future hierarchical
> extension, if needed.

I agree with Michal, this is a no-go. It involves userspace ABI that
we have to maintain indefinitely, so it needs to integrate properly
with the overall model of the cgroup2 interface.

That includes hierarchical support, but as per above it includes wider
questions of how this is supposed to integrate with the concepts of
comprehensive resource control. How it integrates with the accounting
(if you want to support shared pages, they should also be accounted as
shared and not to random groups), the relationships with connected
resources such as IO (in a virtual memory system that can do paging,
memory and IO are fungible, so if you want to be able to share one,
you have to be able to share the other as well to the same extent),
how it integrates with memory.low protection etc.

As it stands, I don't see this patch set addressing any of these.

2019-04-11 15:36:37

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

On 11.04.2019 17:55, Waiman Long wrote:
> On 04/11/2019 10:37 AM, Kirill Tkhai wrote:
>> On 10.04.2019 22:13, Waiman Long wrote:
>>> The current control mechanism for memory cgroup v2 lumps all the memory
>>> together irrespective of the type of memory objects. However, there
>>> are cases where users may have more concern about one type of memory
>>> usage than the others.
>>>
>>> We have customer request to limit memory consumption on anonymous memory
>>> only as they said the feature was available in other OSes like Solaris.
>>>
>>> To allow finer-grained control of memory, this patchset 2 new control
>>> knobs for memory controller:
>>> - memory.subset.list for specifying the type of memory to be under control.
>>> - memory.subset.high for the high limit of memory consumption of that
>>> memory type.
>>>
>>> For simplicity, the limit is not hierarchical and applies to only tasks
>>> in the local memory cgroup.
>>>
>>> Waiman Long (2):
>>> mm/memcontrol: Finer-grained control for subset of allocated memory
>>> mm/memcontrol: Add a new MEMCG_SUBSET_HIGH event
>>>
>>> Documentation/admin-guide/cgroup-v2.rst | 35 +++++++++
>>> include/linux/memcontrol.h | 8 ++
>>> mm/memcontrol.c | 100 +++++++++++++++++++++++-
>>> 3 files changed, 142 insertions(+), 1 deletion(-)
>> CC Andrey.
>>
>> In Virtuozzo kernel we have similar functionality for limitation of page cache in a cgroup:
>>
>> https://github.com/OpenVZ/vzkernel/commit/8ceef5e0c07c7621fcb0e04ccc48a679dfeec4a4
>
> It will be helpful to know the use case where you want to limit page
> cache usage. I have anonymous memory in mind when I compose this patch,
> but I make the mechanism more generic so that it can apply to other use
> cases as well.

We have distributed storage, and there are its daemons on every host.
There are replication factor 1:N, so the same block may be duplicated
on different hosts. They produce a lot of pagecache, but it is reused
not often (because of the above 1:N).

So, we want to limit pagecache, but do not limit anon memory. This
prevents global reclaim, and we found this improves our performance tests.

Kirill

2019-04-11 21:23:22

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] mm/memcontrol: Finer-grained memory control

On Thu, Apr 11, 2019 at 10:22:22AM -0400, Waiman Long wrote:
> On 04/10/2019 05:38 PM, Chris Down wrote:
> > Hi Waiman,
> >
> > Waiman Long writes:
> >> The current control mechanism for memory cgroup v2 lumps all the memory
> >> together irrespective of the type of memory objects. However, there
> >> are cases where users may have more concern about one type of memory
> >> usage than the others.
> >
> > I have concerns about this implementation, and the overall idea in
> > general. We had per-class memory limiting in the cgroup v1 API, and it
> > ended up really poorly, and resulted in a situation where it's really
> > hard to compose a usable system out of it any more.
> >
> > A major part of the restructure in cgroup v2 has been to simplify
> > things so that it's more easy to understand for service owners and
> > sysadmins. This was intentional, because otherwise the system overall
> > is hard to make into something that does what users *really* want, and
> > users end up with a lot of confusion, misconfiguration, and generally
> > an inability to produce a coherent system, because we've made things
> > too hard to piece together.
> >
> > In general, for purposes of resource control, I'm not convinced that
> > it makes sense to limit only one kind of memory based on prior
> > experience with v1. Can you give a production use case where this
> > would be a clear benefit, traded off against the increase in
> > complexity to the API?
> >
>
> As I said in my previous email on this thread, the customer considered
> pages cache as common goods not fully representing the "real" memory
> footprint used by an application.? Depending on actual mix of
> applications running on a system, there are certainly cases where their
> view is correct. In fact, what the customer is asking for is not even
> provided by the v1 API even with that many classes of memory that you
> can choose from.

Hello Waiman!

If I understand the case correctly, the customer wants to get signaled
when anon memory consumption will reach a certain point, right?

I doubt that the idea is to keep only the certain amount of anon memory
resident and swap out everything else. So, probably, the reaction will
be to kill the application.

If so, do we really need a control?
Maybe polling memory.stats::anon will be enough?

If not, I can imagine some sort of threshold notification mechanism
on top of memory.stats. Similar to what is build on top of psi.

Tracking the size of anon memory is definitely useful for spotting
userspace leaks and spkies, so an ability to set up thresholds and get
events sounds appealing to me.

Thanks!

Roman