2015-05-19 10:41:06

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

memcg was reported years ago to have significant overhead when unused. It
has improved but it's still the case that users that have no knowledge of
memcg pay a performance penalty.

This patch adds a Kconfig that controls whether memcg is enabled by default
and a kernel parameter cgroup_enable= to enable it if desired. Anyone using
oldconfig will get the historical behaviour. It is not an option for most
distributions to simply disable MEMCG as there are users that require it
but they should also be knowledgable enough to use cgroup_enable=.

This was evaluated using aim9, a page fault microbenchmark and ebizzy
but I'll focus on the page fault microbenchmark. It can be reproduced
using pft from mmtests (https://github.com/gormanm/mmtests). Edit
configs/config-global-dhp__pagealloc-performance and update MMTESTS to
only contain pft. This is the relevant part of the profile summary

/usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
mem_cgroup_try_charge 2.950% 175781
__mem_cgroup_count_vm_event 1.431% 85239
mem_cgroup_page_lruvec 0.456% 27156
mem_cgroup_commit_charge 0.392% 23342
uncharge_list 0.323% 19256
mem_cgroup_update_lru_size 0.278% 16538
memcg_check_events 0.216% 12858
mem_cgroup_charge_statistics.isra.22 0.188% 11172
try_charge 0.150% 8928
commit_charge 0.141% 8388
get_mem_cgroup_from_mm 0.121% 7184

It's showing 6.64% overhead in memcontrol.c when no memcgs are in
use. Applying the patch and disabling memcg reduces this to 0.48%

/usr/src/linux-4.0-nomemcg-v1r1/mm/memcontrol.c 0.4834 27511
mem_cgroup_page_lruvec 0.161% 9172
mem_cgroup_update_lru_size 0.154% 8794
mem_cgroup_try_charge 0.126% 7194
mem_cgroup_commit_charge 0.041% 2351

Note that it's not very visible from headline performance figures

pft faults
4.0.0 4.0.0
vanilla nomemcg-v1
Hmean faults/cpu-1 1443258.1051 ( 0.00%) 1530574.6033 ( 6.05%)
Hmean faults/cpu-3 1340385.9270 ( 0.00%) 1375156.5834 ( 2.59%)
Hmean faults/cpu-5 875599.0222 ( 0.00%) 876217.9211 ( 0.07%)
Hmean faults/cpu-7 601146.6726 ( 0.00%) 599068.4360 ( -0.35%)
Hmean faults/cpu-8 510728.2754 ( 0.00%) 509887.9960 ( -0.16%)
Hmean faults/sec-1 1432084.7845 ( 0.00%) 1518566.3541 ( 6.04%)
Hmean faults/sec-3 3943818.1437 ( 0.00%) 4036918.0217 ( 2.36%)
Hmean faults/sec-5 3877573.5867 ( 0.00%) 3922745.9207 ( 1.16%)
Hmean faults/sec-7 3991832.0418 ( 0.00%) 3990670.8481 ( -0.03%)
Hmean faults/sec-8 3987189.8167 ( 0.00%) 3978842.8107 ( -0.21%)

Low thread counts get a boost but it's within noise as memcg overhead does
not dominate. It's not obvious at all at higher thread counts as other
factors cause more problems. The overall breakdown of CPU usage looks like

4.0.0 4.0.0
vanilla nomemcg-v1
User 41.45 41.11
System 410.19 404.76
Elapsed 130.33 126.30

Despite the relative unimportance, there is at least some justification
for disabling memcg by default.

Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/kernel-parameters.txt | 4 ++++
init/Kconfig | 15 +++++++++++++++
kernel/cgroup.c | 20 ++++++++++++++++----
mm/memcontrol.c | 3 +++
4 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index bfcb1a62a7b4..4f264f906816 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -591,6 +591,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
cut the overhead, others just disable the usage. So
only cgroup_disable=memory is actually worthy}

+ cgroup_enable= [KNL] Enable a particular controller
+ Similar to cgroup_disable except that it enables
+ controllers that are disabled by default.
+
checkreqprot [SELINUX] Set initial checkreqprot flag value.
Format: { "0" | "1" }
See security/selinux/Kconfig help text.
diff --git a/init/Kconfig b/init/Kconfig
index f5dbc6d4261b..819b6cc05cba 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -990,6 +990,21 @@ config MEMCG
Provides a memory resource controller that manages both anonymous
memory and page cache. (See Documentation/cgroups/memory.txt)

+config MEMCG_DEFAULT_ENABLED
+ bool "Automatically enable memory resource controller"
+ default y
+ depends on MEMCG
+ help
+ The memory controller has some overhead even if idle as resource
+ usage must be tracked in case a group is created and a process
+ migrated. As users may not be aware of this and the cgroup_disable=
+ option, this config option controls whether it is enabled by
+ default. It is assumed that someone that requires the controller
+ can find the cgroup_enable= switch.
+
+ Say N if unsure. This is default Y to preserve oldconfig and
+ historical behaviour.
+
config MEMCG_SWAP
bool "Memory Resource Controller Swap Extension"
depends on MEMCG && SWAP
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 29a7b2cc593e..0e79db55bf1a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5370,7 +5370,7 @@ out_free:
kfree(pathbuf);
}

-static int __init cgroup_disable(char *str)
+static int __init __cgroup_set_state(char *str, bool disabled)
{
struct cgroup_subsys *ss;
char *token;
@@ -5382,16 +5382,28 @@ static int __init cgroup_disable(char *str)

for_each_subsys(ss, i) {
if (!strcmp(token, ss->name)) {
- ss->disabled = 1;
- printk(KERN_INFO "Disabling %s control group"
- " subsystem\n", ss->name);
+ ss->disabled = disabled;
+ printk(KERN_INFO "Setting %s control group"
+ " subsystem %s\n", ss->name,
+ disabled ? "disabled" : "enabled");
break;
}
}
}
return 1;
}
+
+static int __init cgroup_disable(char *str)
+{
+ return __cgroup_set_state(str, true);
+}
+
+static int __init cgroup_enable(char *str)
+{
+ return __cgroup_set_state(str, false);
+}
__setup("cgroup_disable=", cgroup_disable);
+__setup("cgroup_enable=", cgroup_enable);

static int __init cgroup_set_legacy_files_on_dfl(char *str)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b34ef4a32a3b..ce171ba16949 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5391,6 +5391,9 @@ struct cgroup_subsys memory_cgrp_subsys = {
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
.early_init = 0,
+#ifndef CONFIG_MEMCG_DEFAULT_ENABLED
+ .disabled = 1,
+#endif
};

/**


2015-05-19 14:18:26

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

CC'ing Tejun and cgroups for the generic cgroup interface part

On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
> memcg was reported years ago to have significant overhead when unused. It
> has improved but it's still the case that users that have no knowledge of
> memcg pay a performance penalty.
>
> This patch adds a Kconfig that controls whether memcg is enabled by default
> and a kernel parameter cgroup_enable= to enable it if desired. Anyone using
> oldconfig will get the historical behaviour. It is not an option for most
> distributions to simply disable MEMCG as there are users that require it
> but they should also be knowledgable enough to use cgroup_enable=.
>
> This was evaluated using aim9, a page fault microbenchmark and ebizzy
> but I'll focus on the page fault microbenchmark. It can be reproduced
> using pft from mmtests (https://github.com/gormanm/mmtests). Edit
> configs/config-global-dhp__pagealloc-performance and update MMTESTS to
> only contain pft. This is the relevant part of the profile summary
>
> /usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
> mem_cgroup_try_charge 2.950% 175781

Ouch. Do you have a way to get the per-instruction breakdown of this?
This function really isn't doing much. I'll try to reproduce it here
too, I haven't seen such high costs with pft in the past.

> __mem_cgroup_count_vm_event 1.431% 85239
> mem_cgroup_page_lruvec 0.456% 27156
> mem_cgroup_commit_charge 0.392% 23342
> uncharge_list 0.323% 19256
> mem_cgroup_update_lru_size 0.278% 16538
> memcg_check_events 0.216% 12858
> mem_cgroup_charge_statistics.isra.22 0.188% 11172
> try_charge 0.150% 8928
> commit_charge 0.141% 8388
> get_mem_cgroup_from_mm 0.121% 7184
>
> It's showing 6.64% overhead in memcontrol.c when no memcgs are in
> use. Applying the patch and disabling memcg reduces this to 0.48%

The frustrating part is that 4.5% of that is not even coming from the
main accounting and tracking work. I'm looking into getting this
fixed regardless of what happens with this patch.

> /usr/src/linux-4.0-nomemcg-v1r1/mm/memcontrol.c 0.4834 27511
> mem_cgroup_page_lruvec 0.161% 9172
> mem_cgroup_update_lru_size 0.154% 8794
> mem_cgroup_try_charge 0.126% 7194
> mem_cgroup_commit_charge 0.041% 2351
>
> Note that it's not very visible from headline performance figures
>
> pft faults
> 4.0.0 4.0.0
> vanilla nomemcg-v1
> Hmean faults/cpu-1 1443258.1051 ( 0.00%) 1530574.6033 ( 6.05%)
> Hmean faults/cpu-3 1340385.9270 ( 0.00%) 1375156.5834 ( 2.59%)
> Hmean faults/cpu-5 875599.0222 ( 0.00%) 876217.9211 ( 0.07%)
> Hmean faults/cpu-7 601146.6726 ( 0.00%) 599068.4360 ( -0.35%)
> Hmean faults/cpu-8 510728.2754 ( 0.00%) 509887.9960 ( -0.16%)
> Hmean faults/sec-1 1432084.7845 ( 0.00%) 1518566.3541 ( 6.04%)
> Hmean faults/sec-3 3943818.1437 ( 0.00%) 4036918.0217 ( 2.36%)
> Hmean faults/sec-5 3877573.5867 ( 0.00%) 3922745.9207 ( 1.16%)
> Hmean faults/sec-7 3991832.0418 ( 0.00%) 3990670.8481 ( -0.03%)
> Hmean faults/sec-8 3987189.8167 ( 0.00%) 3978842.8107 ( -0.21%)
>
> Low thread counts get a boost but it's within noise as memcg overhead does
> not dominate. It's not obvious at all at higher thread counts as other
> factors cause more problems. The overall breakdown of CPU usage looks like
>
> 4.0.0 4.0.0
> vanilla nomemcg-v1
> User 41.45 41.11
> System 410.19 404.76
> Elapsed 130.33 126.30
>
> Despite the relative unimportance, there is at least some justification
> for disabling memcg by default.

I guess so. The only thing I don't like about this is that it changes
the default of a single controller. While there is some justification
from an overhead standpoint, it's a little weird in terms of interface
when you boot, say, a distribution kernel and it has cgroups with all
but one resource controller available.

Would it make more sense to provide a Kconfig option that disables all
resource controllers per default? There is still value in having only
the generic cgroup part for grouped process monitoring and control.

Thanks,
Johannes

> Signed-off-by: Mel Gorman <[email protected]>
> ---
> Documentation/kernel-parameters.txt | 4 ++++
> init/Kconfig | 15 +++++++++++++++
> kernel/cgroup.c | 20 ++++++++++++++++----
> mm/memcontrol.c | 3 +++
> 4 files changed, 38 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index bfcb1a62a7b4..4f264f906816 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -591,6 +591,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> cut the overhead, others just disable the usage. So
> only cgroup_disable=memory is actually worthy}
>
> + cgroup_enable= [KNL] Enable a particular controller
> + Similar to cgroup_disable except that it enables
> + controllers that are disabled by default.
> +
> checkreqprot [SELINUX] Set initial checkreqprot flag value.
> Format: { "0" | "1" }
> See security/selinux/Kconfig help text.
> diff --git a/init/Kconfig b/init/Kconfig
> index f5dbc6d4261b..819b6cc05cba 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -990,6 +990,21 @@ config MEMCG
> Provides a memory resource controller that manages both anonymous
> memory and page cache. (See Documentation/cgroups/memory.txt)
>
> +config MEMCG_DEFAULT_ENABLED
> + bool "Automatically enable memory resource controller"
> + default y
> + depends on MEMCG
> + help
> + The memory controller has some overhead even if idle as resource
> + usage must be tracked in case a group is created and a process
> + migrated. As users may not be aware of this and the cgroup_disable=
> + option, this config option controls whether it is enabled by
> + default. It is assumed that someone that requires the controller
> + can find the cgroup_enable= switch.
> +
> + Say N if unsure. This is default Y to preserve oldconfig and
> + historical behaviour.
> +
> config MEMCG_SWAP
> bool "Memory Resource Controller Swap Extension"
> depends on MEMCG && SWAP
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 29a7b2cc593e..0e79db55bf1a 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -5370,7 +5370,7 @@ out_free:
> kfree(pathbuf);
> }
>
> -static int __init cgroup_disable(char *str)
> +static int __init __cgroup_set_state(char *str, bool disabled)
> {
> struct cgroup_subsys *ss;
> char *token;
> @@ -5382,16 +5382,28 @@ static int __init cgroup_disable(char *str)
>
> for_each_subsys(ss, i) {
> if (!strcmp(token, ss->name)) {
> - ss->disabled = 1;
> - printk(KERN_INFO "Disabling %s control group"
> - " subsystem\n", ss->name);
> + ss->disabled = disabled;
> + printk(KERN_INFO "Setting %s control group"
> + " subsystem %s\n", ss->name,
> + disabled ? "disabled" : "enabled");
> break;
> }
> }
> }
> return 1;
> }
> +
> +static int __init cgroup_disable(char *str)
> +{
> + return __cgroup_set_state(str, true);
> +}
> +
> +static int __init cgroup_enable(char *str)
> +{
> + return __cgroup_set_state(str, false);
> +}
> __setup("cgroup_disable=", cgroup_disable);
> +__setup("cgroup_enable=", cgroup_enable);
>
> static int __init cgroup_set_legacy_files_on_dfl(char *str)
> {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b34ef4a32a3b..ce171ba16949 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5391,6 +5391,9 @@ struct cgroup_subsys memory_cgrp_subsys = {
> .dfl_cftypes = memory_files,
> .legacy_cftypes = mem_cgroup_legacy_files,
> .early_init = 0,
> +#ifndef CONFIG_MEMCG_DEFAULT_ENABLED
> + .disabled = 1,
> +#endif
> };
>
> /**

2015-05-19 14:43:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

On Tue, May 19, 2015 at 10:18:07AM -0400, Johannes Weiner wrote:
> CC'ing Tejun and cgroups for the generic cgroup interface part
>
> On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
> > memcg was reported years ago to have significant overhead when unused. It
> > has improved but it's still the case that users that have no knowledge of
> > memcg pay a performance penalty.
> >
> > This patch adds a Kconfig that controls whether memcg is enabled by default
> > and a kernel parameter cgroup_enable= to enable it if desired. Anyone using
> > oldconfig will get the historical behaviour. It is not an option for most
> > distributions to simply disable MEMCG as there are users that require it
> > but they should also be knowledgable enough to use cgroup_enable=.
> >
> > This was evaluated using aim9, a page fault microbenchmark and ebizzy
> > but I'll focus on the page fault microbenchmark. It can be reproduced
> > using pft from mmtests (https://github.com/gormanm/mmtests). Edit
> > configs/config-global-dhp__pagealloc-performance and update MMTESTS to
> > only contain pft. This is the relevant part of the profile summary
> >
> > /usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
> > mem_cgroup_try_charge 2.950% 175781
>
> Ouch. Do you have a way to get the per-instruction breakdown of this?

Not that I can upload in a reasonable amount of time. An annotated profile
and vmlinux image for decoding addresses is not small. My expectation is
that it'd be trivially reproducible.

> This function really isn't doing much. I'll try to reproduce it here
> too, I haven't seen such high costs with pft in the past.
>

I don't believe it's the machine that is being particularly stupid. It's
a fairly bog-standard desktop class box.

> > __mem_cgroup_count_vm_event 1.431% 85239
> > mem_cgroup_page_lruvec 0.456% 27156
> > mem_cgroup_commit_charge 0.392% 23342
> > uncharge_list 0.323% 19256
> > mem_cgroup_update_lru_size 0.278% 16538
> > memcg_check_events 0.216% 12858
> > mem_cgroup_charge_statistics.isra.22 0.188% 11172
> > try_charge 0.150% 8928
> > commit_charge 0.141% 8388
> > get_mem_cgroup_from_mm 0.121% 7184
> >
> > It's showing 6.64% overhead in memcontrol.c when no memcgs are in
> > use. Applying the patch and disabling memcg reduces this to 0.48%
>
> The frustrating part is that 4.5% of that is not even coming from the
> main accounting and tracking work. I'm looking into getting this
> fixed regardless of what happens with this patch.
>
> > /usr/src/linux-4.0-nomemcg-v1r1/mm/memcontrol.c 0.4834 27511
> > mem_cgroup_page_lruvec 0.161% 9172
> > mem_cgroup_update_lru_size 0.154% 8794
> > mem_cgroup_try_charge 0.126% 7194
> > mem_cgroup_commit_charge 0.041% 2351
> >
> > Note that it's not very visible from headline performance figures
> >
> > pft faults
> > 4.0.0 4.0.0
> > vanilla nomemcg-v1
> > Hmean faults/cpu-1 1443258.1051 ( 0.00%) 1530574.6033 ( 6.05%)
> > Hmean faults/cpu-3 1340385.9270 ( 0.00%) 1375156.5834 ( 2.59%)
> > Hmean faults/cpu-5 875599.0222 ( 0.00%) 876217.9211 ( 0.07%)
> > Hmean faults/cpu-7 601146.6726 ( 0.00%) 599068.4360 ( -0.35%)
> > Hmean faults/cpu-8 510728.2754 ( 0.00%) 509887.9960 ( -0.16%)
> > Hmean faults/sec-1 1432084.7845 ( 0.00%) 1518566.3541 ( 6.04%)
> > Hmean faults/sec-3 3943818.1437 ( 0.00%) 4036918.0217 ( 2.36%)
> > Hmean faults/sec-5 3877573.5867 ( 0.00%) 3922745.9207 ( 1.16%)
> > Hmean faults/sec-7 3991832.0418 ( 0.00%) 3990670.8481 ( -0.03%)
> > Hmean faults/sec-8 3987189.8167 ( 0.00%) 3978842.8107 ( -0.21%)
> >
> > Low thread counts get a boost but it's within noise as memcg overhead does
> > not dominate. It's not obvious at all at higher thread counts as other
> > factors cause more problems. The overall breakdown of CPU usage looks like
> >
> > 4.0.0 4.0.0
> > vanilla nomemcg-v1
> > User 41.45 41.11
> > System 410.19 404.76
> > Elapsed 130.33 126.30
> >
> > Despite the relative unimportance, there is at least some justification
> > for disabling memcg by default.
>
> I guess so. The only thing I don't like about this is that it changes
> the default of a single controller. While there is some justification
> from an overhead standpoint, it's a little weird in terms of interface
> when you boot, say, a distribution kernel and it has cgroups with all
> but one resource controller available.
>
> Would it make more sense to provide a Kconfig option that disables all
> resource controllers per default? There is still value in having only
> the generic cgroup part for grouped process monitoring and control.
>

A config option per controller seems overkill because AFAIK the other
controllers are harmless in terms of overhead. All enabled or all
disabled has other consequences because AFAIK systemd requires some
controllers to function correctly -- e.g.
https://bugs.freedesktop.org/show_bug.cgi?id=74589

After I wrote the patch, I spotted that Debian apparently already
does something like this and by coincidence they matched the
parameter name and values. See the memory controller instructions on
https://wiki.debian.org/LXC#Prepare_the_host . So in this case at least
upstream would match something that at least one distro in the field
already uses.

--
Mel Gorman
SUSE Labs

2015-05-19 14:51:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

On Tue 19-05-15 10:18:07, Johannes Weiner wrote:
> CC'ing Tejun and cgroups for the generic cgroup interface part
>
> On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
[...]
> > /usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
> > mem_cgroup_try_charge 2.950% 175781
>
> Ouch. Do you have a way to get the per-instruction breakdown of this?
> This function really isn't doing much. I'll try to reproduce it here
> too, I haven't seen such high costs with pft in the past.
>
> > try_charge 0.150% 8928
> > get_mem_cgroup_from_mm 0.121% 7184

Indeed! try_charge + get_mem_cgroup_from_mm which I would expect to be
the biggest consumers here are below 10% of the mem_cgroup_try_charge.
Other than that the function doesn't do much else than some flags
queries and css_put...

Do you have the full trace? Sorry for a stupid question but do inlines
from other header files get accounted to memcontrol.c?

[...]
--
Michal Hocko
SUSE Labs

2015-05-19 15:12:40

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

On 05/19/2015 04:53 PM, Michal Hocko wrote:
> On Tue 19-05-15 10:18:07, Johannes Weiner wrote:
>> CC'ing Tejun and cgroups for the generic cgroup interface part
>>
>> On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
> [...]
>>> /usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
>>> mem_cgroup_try_charge 2.950% 175781
>>
>> Ouch. Do you have a way to get the per-instruction breakdown of this?
>> This function really isn't doing much. I'll try to reproduce it here
>> too, I haven't seen such high costs with pft in the past.
>>
>>> try_charge 0.150% 8928
>>> get_mem_cgroup_from_mm 0.121% 7184
>
> Indeed! try_charge + get_mem_cgroup_from_mm which I would expect to be
> the biggest consumers here are below 10% of the mem_cgroup_try_charge.

Note that they don't explain 10% of the mem_cgroup_try_charge. They
*add* their own overhead to the overhead of mem_cgroup_try_charge
itself. Which might be what you meant but I wasn't sure.

> Other than that the function doesn't do much else than some flags
> queries and css_put...
>
> Do you have the full trace?
> Sorry for a stupid question but do inlines
> from other header files get accounted to memcontrol.c?

Yes, perf doesn't know about them so it's accounted to function where
the code physically is.

>
> [...]
>

2015-05-19 15:13:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

On Tue, May 19, 2015 at 04:53:40PM +0200, Michal Hocko wrote:
> On Tue 19-05-15 10:18:07, Johannes Weiner wrote:
> > CC'ing Tejun and cgroups for the generic cgroup interface part
> >
> > On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
> [...]
> > > /usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
> > > mem_cgroup_try_charge 2.950% 175781
> >
> > Ouch. Do you have a way to get the per-instruction breakdown of this?
> > This function really isn't doing much. I'll try to reproduce it here
> > too, I haven't seen such high costs with pft in the past.
> >
> > > try_charge 0.150% 8928
> > > get_mem_cgroup_from_mm 0.121% 7184
>
> Indeed! try_charge + get_mem_cgroup_from_mm which I would expect to be
> the biggest consumers here are below 10% of the mem_cgroup_try_charge.
> Other than that the function doesn't do much else than some flags
> queries and css_put...
>
> Do you have the full trace? Sorry for a stupid question but do inlines
> from other header files get accounted to memcontrol.c?
>

The annotations for those functions look like with some very basic notes are
as follows. Note that I've done almost no research on this. I just noticed
that the memcg overhead was still there when looking for something else.

ffffffff811c15f0 <mem_cgroup_try_charge>: /* mem_cgroup_try_charge total: 176903 2.9692 */
765 0.0128 :ffffffff811c15f0: callq ffffffff816435e0 <__fentry__>
78 0.0013 :ffffffff811c15f5: push %rbp
1185 0.0199 :ffffffff811c15f6: mov %rsp,%rbp
356 0.0060 :ffffffff811c15f9: push %r14
209 0.0035 :ffffffff811c15fb: push %r13
1599 0.0268 :ffffffff811c15fd: push %r12
320 0.0054 :ffffffff811c15ff: mov %rcx,%r12
305 0.0051 :ffffffff811c1602: push %rbx
325 0.0055 :ffffffff811c1603: sub $0x10,%rsp
878 0.0147 :ffffffff811c1607: mov 0xb7501b(%rip),%ecx # ffffffff81d36628 <memory_cgrp_subsys+0x68>
571 0.0096 :ffffffff811c160d: test %ecx,%ecx

### MEL: Function entry, check for mem_cgroup_disabled()


:ffffffff811c160f: je ffffffff811c1630 <mem_cgroup_try_charge+0x40>
:ffffffff811c1611: xor %eax,%eax
:ffffffff811c1613: xor %ebx,%ebx
1 1.7e-05 :ffffffff811c1615: mov %rbx,(%r12)
7 1.2e-04 :ffffffff811c1619: add $0x10,%rsp
1211 0.0203 :ffffffff811c161d: pop %rbx
5 8.4e-05 :ffffffff811c161e: pop %r12
5 8.4e-05 :ffffffff811c1620: pop %r13
1249 0.0210 :ffffffff811c1622: pop %r14
7 1.2e-04 :ffffffff811c1624: pop %rbp
5 8.4e-05 :ffffffff811c1625: retq
:ffffffff811c1626: nopw %cs:0x0(%rax,%rax,1)
295 0.0050 :ffffffff811c1630: mov (%rdi),%rax
160703 2.6973 :ffffffff811c1633: mov %edx,%r13d

#### MEL: I was surprised to see this atrocity. It's a PageSwapCache check
#### /usr/src/linux-4.0-vanilla/./arch/x86/include/asm/bitops.h:311
#### /usr/src/linux-4.0-vanilla/include/linux/page-flags.h:261
#### /usr/src/linux-4.0-vanilla/mm/memcontrol.c:5473
####
#### Everything after here is consistent small amounts of overhead just from
#### being called a lot

179 0.0030 :ffffffff811c1636: test $0x10000,%eax
:ffffffff811c163b: je ffffffff811c1648 <mem_cgroup_try_charge+0x58>
:ffffffff811c163d: xor %eax,%eax
:ffffffff811c163f: xor %ebx,%ebx
:ffffffff811c1641: cmpq $0x0,0x38(%rdi)
:ffffffff811c1646: jne ffffffff811c1615 <mem_cgroup_try_charge+0x25>
1343 0.0225 :ffffffff811c1648: mov (%rdi),%rax
26 4.4e-04 :ffffffff811c164b: mov $0x1,%r14d
24 4.0e-04 :ffffffff811c1651: test $0x40,%ah
:ffffffff811c1654: je ffffffff811c1665 <mem_cgroup_try_charge+0x75>
:ffffffff811c1656: mov (%rdi),%rax
:ffffffff811c1659: test $0x40,%ah
:ffffffff811c165c: je ffffffff811c1665 <mem_cgroup_try_charge+0x75>
:ffffffff811c165e: mov 0x68(%rdi),%rcx
:ffffffff811c1662: shl %cl,%r14d
1225 0.0206 :ffffffff811c1665: mov 0xb74f35(%rip),%eax # ffffffff81d365a0 <do_swap_account>
66 0.0011 :ffffffff811c166b: test %eax,%eax
:ffffffff811c166d: jne ffffffff811c16a8 <mem_cgroup_try_charge+0xb8>
3 5.0e-05 :ffffffff811c166f: mov %rsi,%rdi
22 3.7e-04 :ffffffff811c1672: callq ffffffff811bc920 <get_mem_cgroup_from_mm>
1291 0.0217 :ffffffff811c1677: mov %rax,%rbx
3 5.0e-05 :ffffffff811c167a: mov %r14d,%edx
:ffffffff811c167d: mov %r13d,%esi
10 1.7e-04 :ffffffff811c1680: mov %rbx,%rdi
1380 0.0232 :ffffffff811c1683: callq ffffffff811c0950 <try_charge>
10 1.7e-04 :ffffffff811c1688: testb $0x1,0x74(%rbx)
1235 0.0207 :ffffffff811c168c: je ffffffff811c16d0 <mem_cgroup_try_charge+0xe0>
7 1.2e-04 :ffffffff811c168e: cmp $0xfffffffc,%eax
:ffffffff811c1691: jne ffffffff811c1615 <mem_cgroup_try_charge+0x25>
:ffffffff811c1693: mov 0xb74f0e(%rip),%rbx # ffffffff81d365a8 <root_mem_cgroup>
:ffffffff811c169a: xor %eax,%eax
:ffffffff811c169c: jmpq ffffffff811c1615 <mem_cgroup_try_charge+0x25>
:ffffffff811c16a1: nopl 0x0(%rax)
:ffffffff811c16a8: mov (%rdi),%rax
:ffffffff811c16ab: test $0x10000,%eax
:ffffffff811c16b0: je ffffffff811c166f <mem_cgroup_try_charge+0x7f>
:ffffffff811c16b2: mov %rsi,-0x28(%rbp)
:ffffffff811c16b6: callq ffffffff811c0450 <try_get_mem_cgroup_from_page>
:ffffffff811c16bb: test %rax,%rax
:ffffffff811c16be: mov %rax,%rbx
:ffffffff811c16c1: mov -0x28(%rbp),%rsi
:ffffffff811c16c5: jne ffffffff811c167a <mem_cgroup_try_charge+0x8a>
:ffffffff811c16c7: jmp ffffffff811c166f <mem_cgroup_try_charge+0x7f>
:ffffffff811c16c9: nopl 0x0(%rax)
:ffffffff811c16d0: mov 0x18(%rbx),%rdx
:ffffffff811c16d4: test $0x3,%dl
:ffffffff811c16d7: jne ffffffff811c16df <mem_cgroup_try_charge+0xef>
:ffffffff811c16d9: decq %gs:(%rdx)
:ffffffff811c16dd: jmp ffffffff811c168e <mem_cgroup_try_charge+0x9e>
:ffffffff811c16df: lea 0x10(%rbx),%rdi
:ffffffff811c16e3: lock subq $0x1,0x10(%rbx)
:ffffffff811c16e9: je ffffffff811c16ed <mem_cgroup_try_charge+0xfd>
:ffffffff811c16eb: jmp ffffffff811c168e <mem_cgroup_try_charge+0x9e>
:ffffffff811c16ed: mov %eax,-0x28(%rbp)
:ffffffff811c16f0: callq *0x20(%rbx)
:ffffffff811c16f3: mov -0x28(%rbp),%eax
:ffffffff811c16f6: jmp ffffffff811c168e <mem_cgroup_try_charge+0x9e>
:ffffffff811c16f8: nopl 0x0(%rax,%rax,1)

ffffffff811bc920 <get_mem_cgroup_from_mm>: /* get_mem_cgroup_from_mm total: 7251 0.1217 */
#### MEL: Nothing really big jumped out there at me.
1318 0.0221 :ffffffff811bc920: callq ffffffff816435e0 <__fentry__>
19 3.2e-04 :ffffffff811bc925: push %rbp
42 7.0e-04 :ffffffff811bc926: mov %rsp,%rbp
1278 0.0215 :ffffffff811bc929: jmp ffffffff811bc94b <get_mem_cgroup_from_mm+0x2b>
:ffffffff811bc92b: nopl 0x0(%rax,%rax,1)
1259 0.0211 :ffffffff811bc930: testb $0x1,0x74(%rdx)
161 0.0027 :ffffffff811bc934: jne ffffffff811bc980 <get_mem_cgroup_from_mm+0x60>
:ffffffff811bc936: mov 0x18(%rdx),%rax
:ffffffff811bc93a: test $0x3,%al
:ffffffff811bc93c: jne ffffffff811bc985 <get_mem_cgroup_from_mm+0x65>
:ffffffff811bc93e: incq %gs:(%rax)
:ffffffff811bc942: mov $0x1,%eax
:ffffffff811bc947: test %al,%al
:ffffffff811bc949: jne ffffffff811bc980 <get_mem_cgroup_from_mm+0x60>
13 2.2e-04 :ffffffff811bc94b: test %rdi,%rdi
:ffffffff811bc94e: je ffffffff811bc96c <get_mem_cgroup_from_mm+0x4c>
47 7.9e-04 :ffffffff811bc950: mov 0x340(%rdi),%rax
1410 0.0237 :ffffffff811bc957: test %rax,%rax
:ffffffff811bc95a: je ffffffff811bc96c <get_mem_cgroup_from_mm+0x4c>
26 4.4e-04 :ffffffff811bc95c: mov 0xca0(%rax),%rax
179 0.0030 :ffffffff811bc963: mov 0x70(%rax),%rdx
174 0.0029 :ffffffff811bc967: test %rdx,%rdx
:ffffffff811bc96a: jne ffffffff811bc930 <get_mem_cgroup_from_mm+0x10>
:ffffffff811bc96c: mov 0xb79c35(%rip),%rdx # ffffffff81d365a8 <root_mem_cgroup>
1 1.7e-05 :ffffffff811bc973: testb $0x1,0x74(%rdx)
:ffffffff811bc977: je ffffffff811bc936 <get_mem_cgroup_from_mm+0x16>
:ffffffff811bc979: nopl 0x0(%rax)
1299 0.0218 :ffffffff811bc980: mov %rdx,%rax
4 6.7e-05 :ffffffff811bc983: pop %rbp
21 3.5e-04 :ffffffff811bc984: retq
:ffffffff811bc985: testb $0x2,0x18(%rdx)
:ffffffff811bc989: jne ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
:ffffffff811bc98b: mov 0x10(%rdx),%rcx
:ffffffff811bc98f: test %rcx,%rcx
:ffffffff811bc992: je ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
:ffffffff811bc994: lea 0x1(%rcx),%rsi
:ffffffff811bc998: lea 0x10(%rdx),%r8
:ffffffff811bc99c: mov %rcx,%rax
:ffffffff811bc99f: lock cmpxchg %rsi,0x10(%rdx)
:ffffffff811bc9a5: cmp %rcx,%rax
:ffffffff811bc9a8: mov %rax,%rsi
:ffffffff811bc9ab: jne ffffffff811bc9b4 <get_mem_cgroup_from_mm+0x94>
:ffffffff811bc9ad: mov $0x1,%eax
:ffffffff811bc9b2: jmp ffffffff811bc947 <get_mem_cgroup_from_mm+0x27>
:ffffffff811bc9b4: test %rsi,%rsi
:ffffffff811bc9b7: je ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
:ffffffff811bc9b9: lea 0x1(%rsi),%rcx
:ffffffff811bc9bd: mov %rsi,%rax
:ffffffff811bc9c0: lock cmpxchg %rcx,(%r8)
:ffffffff811bc9c5: cmp %rax,%rsi
:ffffffff811bc9c8: je ffffffff811bc9ad <get_mem_cgroup_from_mm+0x8d>
:ffffffff811bc9ca: mov %rax,%rsi
:ffffffff811bc9cd: test %rsi,%rsi
:ffffffff811bc9d0: jne ffffffff811bc9b9 <get_mem_cgroup_from_mm+0x99>
:ffffffff811bc9d2: xor %eax,%eax
:ffffffff811bc9d4: jmpq ffffffff811bc947 <get_mem_cgroup_from_mm+0x27>

--
Mel Gorman
SUSE Labs

2015-05-19 15:13:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

[Let's CC Ben here - the email thread has started here:
http://marc.info/?l=linux-mm&m=143203206402073&w=2 and it seems Debian
is disabling memcg controller already so this might be of your interest]

On Tue 19-05-15 15:43:45, Mel Gorman wrote:
[...]
> After I wrote the patch, I spotted that Debian apparently already
> does something like this and by coincidence they matched the
> parameter name and values. See the memory controller instructions on
> https://wiki.debian.org/LXC#Prepare_the_host . So in this case at least
> upstream would match something that at least one distro in the field
> already uses.

I've read through
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534964 and it seems
that the primary motivation for the runtime disabling was the _memory_
overhead of the struct page_cgroup
(https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534964#152). This is
no longer the case since 1306a85aed3e ("mm: embed the memcg pointer
directly into struct page") merged in 3.19.

I can see some point in disabling the memcg due to runtime overhead.
There will always be some, albeit hard to notice. If an user really need
this to happen there is a command line option for that. The question is
who would do CONFIG_MEMCG && !MEMCG_DEFAULT_ENABLED. Do you expect any
distributions go that way?
Ben, would you welcome such a change upstream or is there a reason to
change the Debian kernel runtime default now that the memory overhead is
mostly gone (for 3.19+ kernels of course)?
--
Michal Hocko
SUSE Labs

2015-05-19 15:25:41

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

On 05/19/2015 05:13 PM, Mel Gorman wrote:
> On Tue, May 19, 2015 at 04:53:40PM +0200, Michal Hocko wrote:
>> On Tue 19-05-15 10:18:07, Johannes Weiner wrote:
>>> CC'ing Tejun and cgroups for the generic cgroup interface part
>>>
>>> On Tue, May 19, 2015 at 11:40:57AM +0100, Mel Gorman wrote:
>> [...]
>>>> /usr/src/linux-4.0-vanilla/mm/memcontrol.c 6.6441 395842
>>>> mem_cgroup_try_charge 2.950% 175781
>>>
>>> Ouch. Do you have a way to get the per-instruction breakdown of this?
>>> This function really isn't doing much. I'll try to reproduce it here
>>> too, I haven't seen such high costs with pft in the past.
>>>
>>>> try_charge 0.150% 8928
>>>> get_mem_cgroup_from_mm 0.121% 7184
>>
>> Indeed! try_charge + get_mem_cgroup_from_mm which I would expect to be
>> the biggest consumers here are below 10% of the mem_cgroup_try_charge.
>> Other than that the function doesn't do much else than some flags
>> queries and css_put...
>>
>> Do you have the full trace? Sorry for a stupid question but do inlines
>> from other header files get accounted to memcontrol.c?
>>
>
> The annotations for those functions look like with some very basic notes are
> as follows. Note that I've done almost no research on this. I just noticed
> that the memcg overhead was still there when looking for something else.
>
> ffffffff811c15f0 <mem_cgroup_try_charge>: /* mem_cgroup_try_charge total: 176903 2.9692 */
> 765 0.0128 :ffffffff811c15f0: callq ffffffff816435e0 <__fentry__>
> 78 0.0013 :ffffffff811c15f5: push %rbp
> 1185 0.0199 :ffffffff811c15f6: mov %rsp,%rbp
> 356 0.0060 :ffffffff811c15f9: push %r14
> 209 0.0035 :ffffffff811c15fb: push %r13
> 1599 0.0268 :ffffffff811c15fd: push %r12
> 320 0.0054 :ffffffff811c15ff: mov %rcx,%r12
> 305 0.0051 :ffffffff811c1602: push %rbx
> 325 0.0055 :ffffffff811c1603: sub $0x10,%rsp
> 878 0.0147 :ffffffff811c1607: mov 0xb7501b(%rip),%ecx # ffffffff81d36628 <memory_cgrp_subsys+0x68>
> 571 0.0096 :ffffffff811c160d: test %ecx,%ecx
>
> ### MEL: Function entry, check for mem_cgroup_disabled()
>
>
> :ffffffff811c160f: je ffffffff811c1630 <mem_cgroup_try_charge+0x40>
> :ffffffff811c1611: xor %eax,%eax
> :ffffffff811c1613: xor %ebx,%ebx
> 1 1.7e-05 :ffffffff811c1615: mov %rbx,(%r12)
> 7 1.2e-04 :ffffffff811c1619: add $0x10,%rsp
> 1211 0.0203 :ffffffff811c161d: pop %rbx
> 5 8.4e-05 :ffffffff811c161e: pop %r12
> 5 8.4e-05 :ffffffff811c1620: pop %r13
> 1249 0.0210 :ffffffff811c1622: pop %r14
> 7 1.2e-04 :ffffffff811c1624: pop %rbp
> 5 8.4e-05 :ffffffff811c1625: retq
> :ffffffff811c1626: nopw %cs:0x0(%rax,%rax,1)
> 295 0.0050 :ffffffff811c1630: mov (%rdi),%rax
> 160703 2.6973 :ffffffff811c1633: mov %edx,%r13d
>
> #### MEL: I was surprised to see this atrocity. It's a PageSwapCache check

Looks like sampling is off by instruction, because why would a reg->reg
mov took so long. So it's probably a cache miss on struct page, pointer
to which is in rdi. Which is weird, I would expect memcg to be called on
struct pages that are already hot. It would also mean that if you don't
fetch the struct page from the memcg code, then the following code in
the caller will most likely work on the struct page and get the cache
miss anyway?

> #### /usr/src/linux-4.0-vanilla/./arch/x86/include/asm/bitops.h:311
> #### /usr/src/linux-4.0-vanilla/include/linux/page-flags.h:261
> #### /usr/src/linux-4.0-vanilla/mm/memcontrol.c:5473
> ####
> #### Everything after here is consistent small amounts of overhead just from
> #### being called a lot
>
> 179 0.0030 :ffffffff811c1636: test $0x10000,%eax
> :ffffffff811c163b: je ffffffff811c1648 <mem_cgroup_try_charge+0x58>
> :ffffffff811c163d: xor %eax,%eax
> :ffffffff811c163f: xor %ebx,%ebx
> :ffffffff811c1641: cmpq $0x0,0x38(%rdi)
> :ffffffff811c1646: jne ffffffff811c1615 <mem_cgroup_try_charge+0x25>
> 1343 0.0225 :ffffffff811c1648: mov (%rdi),%rax
> 26 4.4e-04 :ffffffff811c164b: mov $0x1,%r14d
> 24 4.0e-04 :ffffffff811c1651: test $0x40,%ah
> :ffffffff811c1654: je ffffffff811c1665 <mem_cgroup_try_charge+0x75>
> :ffffffff811c1656: mov (%rdi),%rax
> :ffffffff811c1659: test $0x40,%ah
> :ffffffff811c165c: je ffffffff811c1665 <mem_cgroup_try_charge+0x75>
> :ffffffff811c165e: mov 0x68(%rdi),%rcx
> :ffffffff811c1662: shl %cl,%r14d
> 1225 0.0206 :ffffffff811c1665: mov 0xb74f35(%rip),%eax # ffffffff81d365a0 <do_swap_account>
> 66 0.0011 :ffffffff811c166b: test %eax,%eax
> :ffffffff811c166d: jne ffffffff811c16a8 <mem_cgroup_try_charge+0xb8>
> 3 5.0e-05 :ffffffff811c166f: mov %rsi,%rdi
> 22 3.7e-04 :ffffffff811c1672: callq ffffffff811bc920 <get_mem_cgroup_from_mm>
> 1291 0.0217 :ffffffff811c1677: mov %rax,%rbx
> 3 5.0e-05 :ffffffff811c167a: mov %r14d,%edx
> :ffffffff811c167d: mov %r13d,%esi
> 10 1.7e-04 :ffffffff811c1680: mov %rbx,%rdi
> 1380 0.0232 :ffffffff811c1683: callq ffffffff811c0950 <try_charge>
> 10 1.7e-04 :ffffffff811c1688: testb $0x1,0x74(%rbx)
> 1235 0.0207 :ffffffff811c168c: je ffffffff811c16d0 <mem_cgroup_try_charge+0xe0>
> 7 1.2e-04 :ffffffff811c168e: cmp $0xfffffffc,%eax
> :ffffffff811c1691: jne ffffffff811c1615 <mem_cgroup_try_charge+0x25>
> :ffffffff811c1693: mov 0xb74f0e(%rip),%rbx # ffffffff81d365a8 <root_mem_cgroup>
> :ffffffff811c169a: xor %eax,%eax
> :ffffffff811c169c: jmpq ffffffff811c1615 <mem_cgroup_try_charge+0x25>
> :ffffffff811c16a1: nopl 0x0(%rax)
> :ffffffff811c16a8: mov (%rdi),%rax
> :ffffffff811c16ab: test $0x10000,%eax
> :ffffffff811c16b0: je ffffffff811c166f <mem_cgroup_try_charge+0x7f>
> :ffffffff811c16b2: mov %rsi,-0x28(%rbp)
> :ffffffff811c16b6: callq ffffffff811c0450 <try_get_mem_cgroup_from_page>
> :ffffffff811c16bb: test %rax,%rax
> :ffffffff811c16be: mov %rax,%rbx
> :ffffffff811c16c1: mov -0x28(%rbp),%rsi
> :ffffffff811c16c5: jne ffffffff811c167a <mem_cgroup_try_charge+0x8a>
> :ffffffff811c16c7: jmp ffffffff811c166f <mem_cgroup_try_charge+0x7f>
> :ffffffff811c16c9: nopl 0x0(%rax)
> :ffffffff811c16d0: mov 0x18(%rbx),%rdx
> :ffffffff811c16d4: test $0x3,%dl
> :ffffffff811c16d7: jne ffffffff811c16df <mem_cgroup_try_charge+0xef>
> :ffffffff811c16d9: decq %gs:(%rdx)
> :ffffffff811c16dd: jmp ffffffff811c168e <mem_cgroup_try_charge+0x9e>
> :ffffffff811c16df: lea 0x10(%rbx),%rdi
> :ffffffff811c16e3: lock subq $0x1,0x10(%rbx)
> :ffffffff811c16e9: je ffffffff811c16ed <mem_cgroup_try_charge+0xfd>
> :ffffffff811c16eb: jmp ffffffff811c168e <mem_cgroup_try_charge+0x9e>
> :ffffffff811c16ed: mov %eax,-0x28(%rbp)
> :ffffffff811c16f0: callq *0x20(%rbx)
> :ffffffff811c16f3: mov -0x28(%rbp),%eax
> :ffffffff811c16f6: jmp ffffffff811c168e <mem_cgroup_try_charge+0x9e>
> :ffffffff811c16f8: nopl 0x0(%rax,%rax,1)
>
> ffffffff811bc920 <get_mem_cgroup_from_mm>: /* get_mem_cgroup_from_mm total: 7251 0.1217 */
> #### MEL: Nothing really big jumped out there at me.
> 1318 0.0221 :ffffffff811bc920: callq ffffffff816435e0 <__fentry__>
> 19 3.2e-04 :ffffffff811bc925: push %rbp
> 42 7.0e-04 :ffffffff811bc926: mov %rsp,%rbp
> 1278 0.0215 :ffffffff811bc929: jmp ffffffff811bc94b <get_mem_cgroup_from_mm+0x2b>
> :ffffffff811bc92b: nopl 0x0(%rax,%rax,1)
> 1259 0.0211 :ffffffff811bc930: testb $0x1,0x74(%rdx)
> 161 0.0027 :ffffffff811bc934: jne ffffffff811bc980 <get_mem_cgroup_from_mm+0x60>
> :ffffffff811bc936: mov 0x18(%rdx),%rax
> :ffffffff811bc93a: test $0x3,%al
> :ffffffff811bc93c: jne ffffffff811bc985 <get_mem_cgroup_from_mm+0x65>
> :ffffffff811bc93e: incq %gs:(%rax)
> :ffffffff811bc942: mov $0x1,%eax
> :ffffffff811bc947: test %al,%al
> :ffffffff811bc949: jne ffffffff811bc980 <get_mem_cgroup_from_mm+0x60>
> 13 2.2e-04 :ffffffff811bc94b: test %rdi,%rdi
> :ffffffff811bc94e: je ffffffff811bc96c <get_mem_cgroup_from_mm+0x4c>
> 47 7.9e-04 :ffffffff811bc950: mov 0x340(%rdi),%rax
> 1410 0.0237 :ffffffff811bc957: test %rax,%rax
> :ffffffff811bc95a: je ffffffff811bc96c <get_mem_cgroup_from_mm+0x4c>
> 26 4.4e-04 :ffffffff811bc95c: mov 0xca0(%rax),%rax
> 179 0.0030 :ffffffff811bc963: mov 0x70(%rax),%rdx
> 174 0.0029 :ffffffff811bc967: test %rdx,%rdx
> :ffffffff811bc96a: jne ffffffff811bc930 <get_mem_cgroup_from_mm+0x10>
> :ffffffff811bc96c: mov 0xb79c35(%rip),%rdx # ffffffff81d365a8 <root_mem_cgroup>
> 1 1.7e-05 :ffffffff811bc973: testb $0x1,0x74(%rdx)
> :ffffffff811bc977: je ffffffff811bc936 <get_mem_cgroup_from_mm+0x16>
> :ffffffff811bc979: nopl 0x0(%rax)
> 1299 0.0218 :ffffffff811bc980: mov %rdx,%rax
> 4 6.7e-05 :ffffffff811bc983: pop %rbp
> 21 3.5e-04 :ffffffff811bc984: retq
> :ffffffff811bc985: testb $0x2,0x18(%rdx)
> :ffffffff811bc989: jne ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
> :ffffffff811bc98b: mov 0x10(%rdx),%rcx
> :ffffffff811bc98f: test %rcx,%rcx
> :ffffffff811bc992: je ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
> :ffffffff811bc994: lea 0x1(%rcx),%rsi
> :ffffffff811bc998: lea 0x10(%rdx),%r8
> :ffffffff811bc99c: mov %rcx,%rax
> :ffffffff811bc99f: lock cmpxchg %rsi,0x10(%rdx)
> :ffffffff811bc9a5: cmp %rcx,%rax
> :ffffffff811bc9a8: mov %rax,%rsi
> :ffffffff811bc9ab: jne ffffffff811bc9b4 <get_mem_cgroup_from_mm+0x94>
> :ffffffff811bc9ad: mov $0x1,%eax
> :ffffffff811bc9b2: jmp ffffffff811bc947 <get_mem_cgroup_from_mm+0x27>
> :ffffffff811bc9b4: test %rsi,%rsi
> :ffffffff811bc9b7: je ffffffff811bc9d2 <get_mem_cgroup_from_mm+0xb2>
> :ffffffff811bc9b9: lea 0x1(%rsi),%rcx
> :ffffffff811bc9bd: mov %rsi,%rax
> :ffffffff811bc9c0: lock cmpxchg %rcx,(%r8)
> :ffffffff811bc9c5: cmp %rax,%rsi
> :ffffffff811bc9c8: je ffffffff811bc9ad <get_mem_cgroup_from_mm+0x8d>
> :ffffffff811bc9ca: mov %rax,%rsi
> :ffffffff811bc9cd: test %rsi,%rsi
> :ffffffff811bc9d0: jne ffffffff811bc9b9 <get_mem_cgroup_from_mm+0x99>
> :ffffffff811bc9d2: xor %eax,%eax
> :ffffffff811bc9d4: jmpq ffffffff811bc947 <get_mem_cgroup_from_mm+0x27>
>

2015-05-19 15:24:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

On Tue 19-05-15 16:13:02, Mel Gorman wrote:
[...]
> :ffffffff811c160f: je ffffffff811c1630 <mem_cgroup_try_charge+0x40>
> :ffffffff811c1611: xor %eax,%eax
> :ffffffff811c1613: xor %ebx,%ebx
> 1 1.7e-05 :ffffffff811c1615: mov %rbx,(%r12)
> 7 1.2e-04 :ffffffff811c1619: add $0x10,%rsp
> 1211 0.0203 :ffffffff811c161d: pop %rbx
> 5 8.4e-05 :ffffffff811c161e: pop %r12
> 5 8.4e-05 :ffffffff811c1620: pop %r13
> 1249 0.0210 :ffffffff811c1622: pop %r14
> 7 1.2e-04 :ffffffff811c1624: pop %rbp
> 5 8.4e-05 :ffffffff811c1625: retq
> :ffffffff811c1626: nopw %cs:0x0(%rax,%rax,1)
> 295 0.0050 :ffffffff811c1630: mov (%rdi),%rax
> 160703 2.6973 :ffffffff811c1633: mov %edx,%r13d

Huh, what? Even if this was off by one and the preceding instruction has
consumed the time. This would be reading from page->flags but the page
should be hot by the time we got here, no?

> #### MEL: I was surprised to see this atrocity. It's a PageSwapCache check
> #### /usr/src/linux-4.0-vanilla/./arch/x86/include/asm/bitops.h:311
> #### /usr/src/linux-4.0-vanilla/include/linux/page-flags.h:261
> #### /usr/src/linux-4.0-vanilla/mm/memcontrol.c:5473
--
Michal Hocko
SUSE Labs

2015-05-19 15:41:25

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

On Tue, May 19, 2015 at 05:27:10PM +0200, Michal Hocko wrote:
> On Tue 19-05-15 16:13:02, Mel Gorman wrote:
> [...]
> > :ffffffff811c160f: je ffffffff811c1630 <mem_cgroup_try_charge+0x40>
> > :ffffffff811c1611: xor %eax,%eax
> > :ffffffff811c1613: xor %ebx,%ebx
> > 1 1.7e-05 :ffffffff811c1615: mov %rbx,(%r12)
> > 7 1.2e-04 :ffffffff811c1619: add $0x10,%rsp
> > 1211 0.0203 :ffffffff811c161d: pop %rbx
> > 5 8.4e-05 :ffffffff811c161e: pop %r12
> > 5 8.4e-05 :ffffffff811c1620: pop %r13
> > 1249 0.0210 :ffffffff811c1622: pop %r14
> > 7 1.2e-04 :ffffffff811c1624: pop %rbp
> > 5 8.4e-05 :ffffffff811c1625: retq
> > :ffffffff811c1626: nopw %cs:0x0(%rax,%rax,1)
> > 295 0.0050 :ffffffff811c1630: mov (%rdi),%rax
> > 160703 2.6973 :ffffffff811c1633: mov %edx,%r13d
>
> Huh, what? Even if this was off by one and the preceding instruction has
> consumed the time. This would be reading from page->flags but the page
> should be hot by the time we got here, no?
>

I would have expected so but it's not the first time I've seen cases where
examining the flags was a costly instruction. I suspect it's due to an
ordering issue or more likely, a frequent branch mispredict that is being
accounted for against this instruction.

--
Mel Gorman
SUSE Labs

2015-05-19 16:04:13

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

On Tue, May 19, 2015 at 04:41:19PM +0100, Mel Gorman wrote:
> On Tue, May 19, 2015 at 05:27:10PM +0200, Michal Hocko wrote:
> > On Tue 19-05-15 16:13:02, Mel Gorman wrote:
> > [...]
> > > :ffffffff811c160f: je ffffffff811c1630 <mem_cgroup_try_charge+0x40>
> > > :ffffffff811c1611: xor %eax,%eax
> > > :ffffffff811c1613: xor %ebx,%ebx
> > > 1 1.7e-05 :ffffffff811c1615: mov %rbx,(%r12)
> > > 7 1.2e-04 :ffffffff811c1619: add $0x10,%rsp
> > > 1211 0.0203 :ffffffff811c161d: pop %rbx
> > > 5 8.4e-05 :ffffffff811c161e: pop %r12
> > > 5 8.4e-05 :ffffffff811c1620: pop %r13
> > > 1249 0.0210 :ffffffff811c1622: pop %r14
> > > 7 1.2e-04 :ffffffff811c1624: pop %rbp
> > > 5 8.4e-05 :ffffffff811c1625: retq
> > > :ffffffff811c1626: nopw %cs:0x0(%rax,%rax,1)
> > > 295 0.0050 :ffffffff811c1630: mov (%rdi),%rax
> > > 160703 2.6973 :ffffffff811c1633: mov %edx,%r13d
> >
> > Huh, what? Even if this was off by one and the preceding instruction has
> > consumed the time. This would be reading from page->flags but the page
> > should be hot by the time we got here, no?
> >
>
> I would have expected so but it's not the first time I've seen cases where
> examining the flags was a costly instruction. I suspect it's due to an
> ordering issue or more likely, a frequent branch mispredict that is being
> accounted for against this instruction.
>

Which is plausible as forward branches are statically predicted false but
in this particular load that could be a close to a 100% mispredict.

--
Mel Gorman
SUSE Labs

2015-05-19 16:14:39

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

On Tue, May 19, 2015 at 05:25:36PM +0200, Vlastimil Babka wrote:
> On 05/19/2015 05:13 PM, Mel Gorman wrote:
> >### MEL: Function entry, check for mem_cgroup_disabled()
> >
> >
> > :ffffffff811c160f: je ffffffff811c1630 <mem_cgroup_try_charge+0x40>
> > :ffffffff811c1611: xor %eax,%eax
> > :ffffffff811c1613: xor %ebx,%ebx
> > 1 1.7e-05 :ffffffff811c1615: mov %rbx,(%r12)
> > 7 1.2e-04 :ffffffff811c1619: add $0x10,%rsp
> > 1211 0.0203 :ffffffff811c161d: pop %rbx
> > 5 8.4e-05 :ffffffff811c161e: pop %r12
> > 5 8.4e-05 :ffffffff811c1620: pop %r13
> > 1249 0.0210 :ffffffff811c1622: pop %r14
> > 7 1.2e-04 :ffffffff811c1624: pop %rbp
> > 5 8.4e-05 :ffffffff811c1625: retq
> > :ffffffff811c1626: nopw %cs:0x0(%rax,%rax,1)
> > 295 0.0050 :ffffffff811c1630: mov (%rdi),%rax
> >160703 2.6973 :ffffffff811c1633: mov %edx,%r13d
> >
> >#### MEL: I was surprised to see this atrocity. It's a PageSwapCache check
>
> Looks like sampling is off by instruction, because why would a reg->reg mov
> took so long. So it's probably a cache miss on struct page, pointer to which
> is in rdi. Which is weird, I would expect memcg to be called on struct pages
> that are already hot.

Yeah, anonymous faults do __SetPageUptodate() right before passing the
page into mem_cgroup_try_charge(). page->flags should be hot.

> It would also mean that if you don't fetch the struct
> page from the memcg code, then the following code in the caller will most
> likely work on the struct page and get the cache miss anyway?

Which is why the runtime reduction doesn't match the profile
reduction. The cost seems to get shifted somewhere else.

2015-05-19 17:10:12

by Ben Hutchings

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

On Tue, 2015-05-19 at 17:15 +0200, Michal Hocko wrote:
> [Let's CC Ben here - the email thread has started here:
> http://marc.info/?l=linux-mm&m=143203206402073&w=2 and it seems Debian
> is disabling memcg controller already so this might be of your interest]
>
> On Tue 19-05-15 15:43:45, Mel Gorman wrote:
> [...]
> > After I wrote the patch, I spotted that Debian apparently already
> > does something like this and by coincidence they matched the
> > parameter name and values. See the memory controller instructions on
> > https://wiki.debian.org/LXC#Prepare_the_host . So in this case at least
> > upstream would match something that at least one distro in the field
> > already uses.
>
> I've read through
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534964 and it seems
> that the primary motivation for the runtime disabling was the _memory_
> overhead of the struct page_cgroup
> (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=534964#152). This is
> no longer the case since 1306a85aed3e ("mm: embed the memcg pointer
> directly into struct page") merged in 3.19.
>
> I can see some point in disabling the memcg due to runtime overhead.

I was also concerned about runtime overhead.

> There will always be some, albeit hard to notice. If an user really need
> this to happen there is a command line option for that. The question is
> who would do CONFIG_MEMCG && !MEMCG_DEFAULT_ENABLED. Do you expect any
> distributions go that way?
> Ben, would you welcome such a change upstream or is there a reason to
> change the Debian kernel runtime default now that the memory overhead is
> mostly gone (for 3.19+ kernels of course)?

I have been meaning to reevaluate this as I know the overhead has been
reduced. Given Mel's benchmark results, I favour keeping it disabled by
default in Debian. So I would welcome this change.

Ben.

--
Ben Hutchings
I'm not a reverse psychological virus. Please don't copy me into your sig.


Attachments:
signature.asc (811.00 B)
This is a digitally signed message part

2015-05-19 19:32:43

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm, memcg: Optionally disable memcg by default using Kconfig

On Tue, May 19, 2015 at 05:04:04PM +0100, Mel Gorman wrote:
> On Tue, May 19, 2015 at 04:41:19PM +0100, Mel Gorman wrote:
> > On Tue, May 19, 2015 at 05:27:10PM +0200, Michal Hocko wrote:
> > > On Tue 19-05-15 16:13:02, Mel Gorman wrote:
> > > [...]
> > > > :ffffffff811c160f: je ffffffff811c1630 <mem_cgroup_try_charge+0x40>
> > > > :ffffffff811c1611: xor %eax,%eax
> > > > :ffffffff811c1613: xor %ebx,%ebx
> > > > 1 1.7e-05 :ffffffff811c1615: mov %rbx,(%r12)
> > > > 7 1.2e-04 :ffffffff811c1619: add $0x10,%rsp
> > > > 1211 0.0203 :ffffffff811c161d: pop %rbx
> > > > 5 8.4e-05 :ffffffff811c161e: pop %r12
> > > > 5 8.4e-05 :ffffffff811c1620: pop %r13
> > > > 1249 0.0210 :ffffffff811c1622: pop %r14
> > > > 7 1.2e-04 :ffffffff811c1624: pop %rbp
> > > > 5 8.4e-05 :ffffffff811c1625: retq
> > > > :ffffffff811c1626: nopw %cs:0x0(%rax,%rax,1)
> > > > 295 0.0050 :ffffffff811c1630: mov (%rdi),%rax
> > > > 160703 2.6973 :ffffffff811c1633: mov %edx,%r13d
> > >
> > > Huh, what? Even if this was off by one and the preceding instruction has
> > > consumed the time. This would be reading from page->flags but the page
> > > should be hot by the time we got here, no?
> > >
> >
> > I would have expected so but it's not the first time I've seen cases where
> > examining the flags was a costly instruction. I suspect it's due to an
> > ordering issue or more likely, a frequent branch mispredict that is being
> > accounted for against this instruction.
> >
>
> Which is plausible as forward branches are statically predicted false but
> in this particular load that could be a close to a 100% mispredict.
>

Plausible but wrong. The responsible instruction was too far away so it
looks more like an ordering issue where the PageSwapCache check must be
ordered against the setting of page up to date. __SetPageUptodate is a
barrier that is necessary before the PTE is established and visible but it
does not have to be ordered against the memcg charging. In fact it makes
sense to do it afterwards in case the charge fails and the page is never
visible. Just adjusting that reduces the cost to

/usr/src/linux-4.0-chargefirst-v1r1/mm/memcontrol.c 3.8547 228233
__mem_cgroup_count_vm_event 1.172% 69393
mem_cgroup_page_lruvec 0.464% 27456
mem_cgroup_commit_charge 0.390% 23072
uncharge_list 0.327% 19370
mem_cgroup_update_lru_size 0.284% 16831
get_mem_cgroup_from_mm 0.262% 15523
mem_cgroup_try_charge 0.256% 15147
memcg_check_events 0.222% 13120
mem_cgroup_charge_statistics.isra.22 0.194% 11470
commit_charge 0.145% 8615
try_charge 0.139% 8236

Big sinner there is updating per-cpu stats -- root cgroup stats I assume? To
refresh, a complete disable looks like

/usr/src/linux-4.0-nomemcg-v1r1/mm/memcontrol.c 0.4834 27511
mem_cgroup_page_lruvec 0.161% 9172
mem_cgroup_update_lru_size 0.154% 8794
mem_cgroup_try_charge 0.126% 7194
mem_cgroup_commit_charge 0.041% 2351

Still, 6.64% down to 3.85% is better than a kick in the head. Unprofiled
performance looks like

pft faults
4.0.0 4.0.0 4.0.0
vanilla nomemcg-v1 chargefirst-v1
Hmean faults/cpu-1 1443258.1051 ( 0.00%) 1530574.6033 ( 6.05%) 1487623.0037 ( 3.07%)
Hmean faults/cpu-3 1340385.9270 ( 0.00%) 1375156.5834 ( 2.59%) 1351401.2578 ( 0.82%)
Hmean faults/cpu-5 875599.0222 ( 0.00%) 876217.9211 ( 0.07%) 876122.6489 ( 0.06%)
Hmean faults/cpu-7 601146.6726 ( 0.00%) 599068.4360 ( -0.35%) 600944.9229 ( -0.03%)
Hmean faults/cpu-8 510728.2754 ( 0.00%) 509887.9960 ( -0.16%) 510906.3818 ( 0.03%)
Hmean faults/sec-1 1432084.7845 ( 0.00%) 1518566.3541 ( 6.04%) 1475994.2194 ( 3.07%)
Hmean faults/sec-3 3943818.1437 ( 0.00%) 4036918.0217 ( 2.36%) 3973070.2159 ( 0.74%)
Hmean faults/sec-5 3877573.5867 ( 0.00%) 3922745.9207 ( 1.16%) 3891705.1749 ( 0.36%)
Hmean faults/sec-7 3991832.0418 ( 0.00%) 3990670.8481 ( -0.03%) 3989110.4674 ( -0.07%)
Hmean faults/sec-8 3987189.8167 ( 0.00%) 3978842.8107 ( -0.21%) 3981011.2936 ( -0.15%)

Very minor boost. The same reordering looks like it would also suit
do_wp_page. I'll do that, retest, put some lipstick on the patches and
post them tomorrow the day after. The reordering one probably makes sense
anyway, the default disabling of memcg still has merit but maybe if that
charging of the root group can be eliminated then it'd be pointless.

--
Mel Gorman
SUSE Labs