On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin wrote:
> Memory cgroups are using large chunks of percpu memory to store vmstat
> data. Yet this memory is not accounted at all, so in the case when there
> are many (dying) cgroups, it's not exactly clear where all the memory is.
>
> Because the size of memory cgroup internal structures can dramatically
> exceed the size of object or page which is pinning it in the memory, it's
> not a good idea to simple ignore it. It actually breaks the isolation
> between cgroups.
>
> Let's account the consumed percpu memory to the parent cgroup.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> Acked-by: Dennis Zhou <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
This makes sense, and the accounting is in line with how we track and
distribute child creation quotas (cgroup.max.descendants and
cgroup.max.depth) up the cgroup tree.
I have one minor comment that isn't a dealbreaker for me:
> @@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> if (!pn)
> return 1;
>
> - pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> + pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
> + GFP_KERNEL_ACCOUNT);
> if (!pn->lruvec_stat_local) {
> kfree(pn);
> return 1;
> }
>
> - pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> + pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
> + GFP_KERNEL_ACCOUNT);
> if (!pn->lruvec_stat_cpu) {
> free_percpu(pn->lruvec_stat_local);
> kfree(pn);
> @@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> goto fail;
> }
>
> - memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> + memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> + GFP_KERNEL_ACCOUNT);
> if (!memcg->vmstats_local)
> goto fail;
>
> - memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> + memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> + GFP_KERNEL_ACCOUNT);
> if (!memcg->vmstats_percpu)
> goto fail;
>
> @@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> struct mem_cgroup *memcg;
> long error = -ENOMEM;
>
> + memalloc_use_memcg(parent);
> memcg = mem_cgroup_alloc();
> + memalloc_unuse_memcg();
The disconnect between 1) requesting accounting and 2) which cgroup to
charge is making me uneasy. It makes mem_cgroup_alloc() a bit of a
handgrenade, because accounting to the current task is almost
guaranteed to be wrong if the use_memcg() annotation were to get lost
in a refactor or not make it to a new caller of the function.
The saving grace is that mem_cgroup_alloc() is pretty unlikely to be
used elsewhere. And pretending it's an independent interface would be
overengineering. But how about the following in mem_cgroup_alloc() and
alloc_mem_cgroup_per_node_info() to document that caller relationship:
/* We charge the parent cgroup, never the current task */
WARN_ON_ONCE(!current->active_memcg);
On Tue, Aug 11, 2020 at 11:27:37AM -0400, Johannes Weiner wrote:
> On Tue, Jun 23, 2020 at 11:45:14AM -0700, Roman Gushchin wrote:
> > Memory cgroups are using large chunks of percpu memory to store vmstat
> > data. Yet this memory is not accounted at all, so in the case when there
> > are many (dying) cgroups, it's not exactly clear where all the memory is.
> >
> > Because the size of memory cgroup internal structures can dramatically
> > exceed the size of object or page which is pinning it in the memory, it's
> > not a good idea to simple ignore it. It actually breaks the isolation
> > between cgroups.
> >
> > Let's account the consumed percpu memory to the parent cgroup.
> >
> > Signed-off-by: Roman Gushchin <[email protected]>
> > Acked-by: Dennis Zhou <[email protected]>
>
> Acked-by: Johannes Weiner <[email protected]>
Thank you!
>
> This makes sense, and the accounting is in line with how we track and
> distribute child creation quotas (cgroup.max.descendants and
> cgroup.max.depth) up the cgroup tree.
>
> I have one minor comment that isn't a dealbreaker for me:
>
> > @@ -5069,13 +5069,15 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> > if (!pn)
> > return 1;
> >
> > - pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat);
> > + pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
> > + GFP_KERNEL_ACCOUNT);
> > if (!pn->lruvec_stat_local) {
> > kfree(pn);
> > return 1;
> > }
> >
> > - pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat);
> > + pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat,
> > + GFP_KERNEL_ACCOUNT);
> > if (!pn->lruvec_stat_cpu) {
> > free_percpu(pn->lruvec_stat_local);
> > kfree(pn);
> > @@ -5149,11 +5151,13 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> > goto fail;
> > }
> >
> > - memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu);
> > + memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> > + GFP_KERNEL_ACCOUNT);
> > if (!memcg->vmstats_local)
> > goto fail;
> >
> > - memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu);
> > + memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
> > + GFP_KERNEL_ACCOUNT);
> > if (!memcg->vmstats_percpu)
> > goto fail;
> >
> > @@ -5202,7 +5206,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> > struct mem_cgroup *memcg;
> > long error = -ENOMEM;
> >
> > + memalloc_use_memcg(parent);
> > memcg = mem_cgroup_alloc();
> > + memalloc_unuse_memcg();
>
> The disconnect between 1) requesting accounting and 2) which cgroup to
> charge is making me uneasy. It makes mem_cgroup_alloc() a bit of a
> handgrenade, because accounting to the current task is almost
> guaranteed to be wrong if the use_memcg() annotation were to get lost
> in a refactor or not make it to a new caller of the function.
>
> The saving grace is that mem_cgroup_alloc() is pretty unlikely to be
> used elsewhere. And pretending it's an independent interface would be
> overengineering. But how about the following in mem_cgroup_alloc() and
> alloc_mem_cgroup_per_node_info() to document that caller relationship:
>
> /* We charge the parent cgroup, never the current task */
> WARN_ON_ONCE(!current->active_memcg);
I have nothing against.
Andrew, can you, please, squash the following diff into the patch?
Thanks!
--
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 130093bdf74b..e25f2db7e61c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5137,6 +5137,9 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return 1;
+ /* We charge the parent cgroup, never the current task */
+ WARN_ON_ONCE(!current->active_memcg);
+
pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
GFP_KERNEL_ACCOUNT);
if (!pn->lruvec_stat_local) {
@@ -5219,6 +5222,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
goto fail;
}
+ /* We charge the parent cgroup, never the current task */
+ WARN_ON_ONCE(!current->active_memcg);
+
memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
GFP_KERNEL_ACCOUNT);
if (!memcg->vmstats_local)
The kernel warnings were noticed on linux next 20200813 while booting
on arm64, arm, x86_64 and i386.
metadata:
git branch: master
git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
git commit: e6d113aca646fb6a92b237340109237fd7a9c770
git describe: next-20200813
make_kernelversion: 5.8.0
kernel-config:
https://builds.tuxbuild.com/YQHc_PpEV-DF8rU7N9tlIQ/kernel.config
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 130093bdf74b..e25f2db7e61c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5137,6 +5137,9 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
> if (!pn)
> return 1;
>
> + /* We charge the parent cgroup, never the current task */
> + WARN_ON_ONCE(!current->active_memcg);
> +
> pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
> GFP_KERNEL_ACCOUNT);
> if (!pn->lruvec_stat_local) {
> @@ -5219,6 +5222,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> goto fail;
> }
>
> + /* We charge the parent cgroup, never the current task */
> + WARN_ON_ONCE(!current->active_memcg);
[ 0.217404] ------------[ cut here ]------------
[ 0.218038] WARNING: CPU: 0 PID: 0 at mm/memcontrol.c:5226
mem_cgroup_css_alloc+0x680/0x740
[ 0.219188] Modules linked in:
[ 0.219597] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.8.0-next-20200813 #1
[ 0.220187] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-1 04/01/2014
[ 0.221190] EIP: mem_cgroup_css_alloc+0x680/0x740
[ 0.222190] Code: d6 17 5d ff 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d 74
26 00 b8 58 39 6a d1 e8 fe 94 55 ff 8d 65 f4 89 d8 5b 5e 5f 5d c3 8d
74 26 00 <0f> 0b e9 01 fa ff ff 8d b4 26 00 00 00 00 66 90 bb f4 ff ff
ff ba
[ 0.223188] EAX: 00000000 EBX: d13666c0 ECX: 00000cc0 EDX: 0000ffff
[ 0.224187] ESI: 00000000 EDI: f4c11000 EBP: d1361f50 ESP: d1361f40
[ 0.225188] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[ 0.226190] CR0: 80050033 CR2: ffd19000 CR3: 115f8000 CR4: 00040690
[ 0.227195] Call Trace:
[ 0.227882] ? _cond_resched+0x17/0x30
[ 0.228195] cgroup_init_subsys+0x66/0x12a
[ 0.229193] cgroup_init+0x118/0x323
[ 0.230194] start_kernel+0x43c/0x47d
[ 0.231193] i386_start_kernel+0x48/0x4a
[ 0.232194] startup_32_smp+0x164/0x168
[ 0.233195] ---[ end trace dfcf9be7b40caf05 ]---
[ 0.2342#
08] ------------[ cut here ]------------
[ 0.235192] WARNING: CPU: 0 PID: 0 at mm/memcontrol.c:5141
mem_cgroup_css_alloc+0x718/0x740
[ 0.236187] Modules linked in:
[ 0.236590] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W
5.8.0-next-20200813 #1
[ 0.237190] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.12.0-1 04/01/2014
[ 0.238194] EIP: mem_cgroup_css_alloc+0x718/0x740
[ 0.239191] Code: 48 ff e9 7c fd ff ff 8d 76 00 a1 b0 14 40 d1 e9
53 fc ff ff 8d b6 00 00 00 00 0f 0b 8d b6 00 00 00 00 0f 0b 8d b6 00
00 00 00 <0f> 0b e9 df f9 ff ff 90 89 f8 e8 29 0c 5c ff 89 f2 b8 10 f4
40 d1
[ 0.240190] EAX: 00000000 EBX: f4c0c800 ECX: 00000000 EDX: d0eab660
[ 0.241189] ESI: 00000000 EDI: f4c11000 EBP: d1361f50 ESP: d1361f40
[ 0.242189] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
[ 0.243190] CR0: 80050033 CR2: ffd19000 CR3: 115f8000 CR4: 00040690
[ 0.244188] Call Trace:
[ 0.245191] ? _cond_resched+0x17/0x30
[ 0.245686] cgroup_init_subsys+0x66/0x12a
[ 0.246189] cgroup_init+0x118/0x323
[ 0.246654] start_kernel+0x43c/0x47d
[ 0.247189] i386_start_kernel+0x48/0x4a
[ 0.247697] startup_32_smp+0x164/0x168
[ 0.248188] ---[ end trace dfcf9be7b40caf06 ]---
[ 0.248990] Last level iTLB entries: 4KB 512, 2MB 255, 4MB 127
[ 0.249187] Last level dTLB entries: 4KB 512, 2MB 255, 4MB 127, 1GB 0
Full test log,
https://qa-reports.linaro.org/lkft/linux-next-oe/build/next-20200813/testrun/3061112/suite/linux-log-parser/test/check-kernel-warning-1665815/log
--
Linaro LKFT
https://lkft.linaro.org
Hi Naresh,
On Thu, 13 Aug 2020 14:46:51 +0530 Naresh Kamboju <[email protected]> wrote:
>
> The kernel warnings were noticed on linux next 20200813 while booting
> on arm64, arm, x86_64 and i386.
>
> metadata:
> git branch: master
> git repo: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
> git commit: e6d113aca646fb6a92b237340109237fd7a9c770
> git describe: next-20200813
> make_kernelversion: 5.8.0
> kernel-config:
> https://builds.tuxbuild.com/YQHc_PpEV-DF8rU7N9tlIQ/kernel.config
Actually in Linus' tree.
It has been fixed today. Thanks for reporting.
--
Cheers,
Stephen Rothwell