LinuxLists.cc - [RFC PATCH] mm: memcg/slab: Stop reparented obj

2020-10-14 23:10:04

Subject: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

SLAB objects which outlive their memcg are moved to their parent
memcg where they may be uncharged. However if they are moved to the
root memcg, uncharging will result in negative page counter values as
root has no page counters.

To prevent this, we check whether we are about to uncharge the root
memcg and skip it if we are. Possibly instead; the obj_cgroups should
be removed from their slabs and any per cpu stocks instead of
reparenting them to root?

The warning can be, unreliably, reproduced with the LTP test
madvise06 if the entire patch series
https://lore.kernel.org/linux-mm/[email protected]/
is present. Although the listed commit in 'fixes' appears to introduce
the bug, I can not reproduce it with just that commit and bisecting
runs into other bugs.

[ 12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 12.029539] Modules linked in:
[ 12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
[ 12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[ 12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
[ 12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
[ 12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
[ 12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
[ 12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
[ 12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
[ 12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
[ 12.030895] FS: 0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
[ 12.031017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
[ 12.031209] Call Trace:
[ 12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[ 12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[ 12.031594] refill_obj_stock (mm/memcontrol.c:3166)
[ 12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
[ 12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
[ 12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
[ 12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
[ 12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
[ 12.032664] __do_softirq (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/irq.h:142 kernel/softirq.c:299)
[ 12.032766] run_ksoftirqd (./arch/x86/include/asm/irqflags.h:54 ./arch/x86/include/asm/irqflags.h:94 kernel/softirq.c:653 kernel/softirq.c:644)
[ 12.032852] smpboot_thread_fn (kernel/smpboot.c:165)
[ 12.032940] ? smpboot_register_percpu_thread (kernel/smpboot.c:108)
[ 12.033059] kthread (kernel/kthread.c:292)
[ 12.033148] ? __kthread_bind_mask (kernel/kthread.c:245)
[ 12.033269] ret_from_fork (arch/x86/entry/entry_64.S:300)
[ 12.033357] ---[ end trace 961dbfc01c109d1f ]---

[ 9.841552] ------------[ cut here ]------------
[ 9.841788] WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.841982] Modules linked in:
[ 9.842072] CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
[ 9.842266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
[ 9.842571] Workqueue: events drain_local_stock
[ 9.842750] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
[ 9.842894] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 4b f9 88 2a 48 8b 17 48 39 d6 72 41 41 54 49 89
[ 9.843438] RSP: 0018:ffffb1c18006be28 EFLAGS: 00010086
[ 9.843585] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: ffff94803bc2cae0
[ 9.843806] RDX: 0000000000000001 RSI: ffffffffffffffff RDI: ffff948007d2b248
[ 9.844026] RBP: ffff948007d2b248 R08: ffff948007c58eb0 R09: ffff948007da05ac
[ 9.844248] R10: 0000000000000018 R11: 0000000000000018 R12: 0000000000000001
[ 9.844477] R13: ffffffffffffffff R14: 0000000000000000 R15: ffff94803bc2cac0
[ 9.844696] FS: 0000000000000000(0000) GS:ffff94803bc00000(0000) knlGS:0000000000000000
[ 9.844915] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.845096] CR2: 00007f0579ee0384 CR3: 000000002cc0a000 CR4: 00000000000006f0
[ 9.845319] Call Trace:
[ 9.845429] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
[ 9.845582] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
[ 9.845684] drain_local_stock (mm/memcontrol.c:2255)
[ 9.845789] process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
[ 9.845898] worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
[ 9.846034] ? process_one_work (kernel/workqueue.c:2358)
[ 9.846162] kthread (kernel/kthread.c:292)
[ 9.846271] ? __kthread_bind_mask (kernel/kthread.c:245)
[ 9.846420] ret_from_fork (arch/x86/entry/entry_64.S:300)
[ 9.846531] ---[ end trace 8b5647c1eba9d18a ]---

Reported-By: [email protected]
Signed-off-by: Richard Palethorpe <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: [email protected]
Cc: [email protected]
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
---
mm/memcontrol.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6877c765b8d0..214e1fe4e9a2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -291,7 +291,7 @@ static void obj_cgroup_release(struct percpu_ref *ref)

spin_lock_irqsave(&css_set_lock, flags);
memcg = obj_cgroup_memcg(objcg);
- if (nr_pages)
+ if (nr_pages && !mem_cgroup_is_root(memcg))
__memcg_kmem_uncharge(memcg, nr_pages);
list_del(&objcg->list);
mem_cgroup_put(memcg);
@@ -3100,6 +3100,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes)
static void drain_obj_stock(struct memcg_stock_pcp *stock)
{
struct obj_cgroup *old = stock->cached_objcg;
+ struct mem_cgroup *memcg;

if (!old)
return;
@@ -3110,7 +3111,9 @@ static void drain_obj_stock(struct memcg_stock_pcp *stock)

if (nr_pages) {
rcu_read_lock();
- __memcg_kmem_uncharge(obj_cgroup_memcg(old), nr_pages);
+ memcg = obj_cgroup_memcg(old);
+ if (!mem_cgroup_is_root(memcg))
+ __memcg_kmem_uncharge(memcg, nr_pages);
rcu_read_unlock();
}

--
2.28.0

2020-10-15 02:06:09

by Roman Gushchin

[permalink] [raw]

Subject: Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

Hi Richard!

> SLAB objects which outlive their memcg are moved to their parent
> memcg where they may be uncharged. However if they are moved to the
> root memcg, uncharging will result in negative page counter values as
> root has no page counters.
>
> To prevent this, we check whether we are about to uncharge the root
> memcg and skip it if we are. Possibly instead; the obj_cgroups should
> be removed from their slabs and any per cpu stocks instead of
> reparenting them to root?

It would be really complex. I think your fix is totally fine.
We have similar checks in cancel_charge(), uncharge_batch(),
mem_cgroup_swapout(), mem_cgroup_uncharge_swap() etc.

>
> The warning can be, unreliably, reproduced with the LTP test
> madvise06 if the entire patch series
> https://lore.kernel.org/linux-mm/[email protected]/
> is present. Although the listed commit in 'fixes' appears to introduce
> the bug, I can not reproduce it with just that commit and bisecting
> runs into other bugs.
>
> [ 12.029417] WARNING: CPU: 2 PID: 21 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 12.029539] Modules linked in:
> [ 12.029611] CPU: 2 PID: 21 Comm: ksoftirqd/2 Not tainted 5.9.0-rc7-22-default #76
> [ 12.029729] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
> [ 12.029908] RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
> [ 12.029991] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 e8 db 47 36 27 48 8b 17 48 39 d6 72 41 41 54 49 89
> [ 12.030258] RSP: 0018:ffffa5d8000efd08 EFLAGS: 00010086
> [ 12.030344] RAX: ffffffffffffffff RBX: ffffffffffffffff RCX: 0000000000000009
> [ 12.030455] RDX: 000000000000000b RSI: ffffffffffffffff RDI: ffff8ef8c7d2b248
> [ 12.030561] RBP: ffff8ef8c7d2b248 R08: ffff8ef8c78b19c8 R09: 0000000000000001
> [ 12.030672] R10: 0000000000000000 R11: ffff8ef8c780e0d0 R12: 0000000000000001
> [ 12.030784] R13: ffffffffffffffff R14: ffff8ef9478b19c8 R15: 0000000000000000
> [ 12.030895] FS: 0000000000000000(0000) GS:ffff8ef8fbc80000(0000) knlGS:0000000000000000
> [ 12.031017] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 12.031104] CR2: 00007f72c0af93ec CR3: 000000005c40a000 CR4: 00000000000006e0
> [ 12.031209] Call Trace:
> [ 12.031267] __memcg_kmem_uncharge (mm/memcontrol.c:3022)
> [ 12.031470] drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
> [ 12.031594] refill_obj_stock (mm/memcontrol.c:3166)
> [ 12.031733] ? rcu_do_batch (kernel/rcu/tree.c:2438)
> [ 12.032075] memcg_slab_free_hook (./include/linux/mm.h:1294 ./include/linux/mm.h:1441 mm/slab.h:368 mm/slab.h:348)
> [ 12.032339] kmem_cache_free (mm/slub.c:3107 mm/slub.c:3143 mm/slub.c:3158)
> [ 12.032464] rcu_do_batch (kernel/rcu/tree.c:2438)
> [ 12.032567] rcu_core (kernel/rcu/tree_plugin.h:2122 kernel/rcu/tree_plugin.h:2157 kernel/rcu/tree.c:2661)
> ...
> Reported-By: [email protected]
> Signed-off-by: Richard Palethorpe <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")

Acked-by: Roman Gushchin <[email protected]>

Thanks!

2020-10-16 05:48:58

by Richard Palethorpe

[permalink] [raw]

Subject: Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

Hello Roman,

Roman Gushchin <[email protected]> writes:

> Hi Richard!
>
>> SLAB objects which outlive their memcg are moved to their parent
>> memcg where they may be uncharged. However if they are moved to the
>> root memcg, uncharging will result in negative page counter values as
>> root has no page counters.
>>
>> To prevent this, we check whether we are about to uncharge the root
>> memcg and skip it if we are. Possibly instead; the obj_cgroups should
>> be removed from their slabs and any per cpu stocks instead of
>> reparenting them to root?
>
> It would be really complex. I think your fix is totally fine.
> We have similar checks in cancel_charge(), uncharge_batch(),
> mem_cgroup_swapout(), mem_cgroup_uncharge_swap() etc.
>
>
> Acked-by: Roman Gushchin <[email protected]>
>
> Thanks!

Great I will respin.

--
Thank you,
Richard.

2020-10-16 13:04:24

by Richard Palethorpe

[permalink] [raw]

Subject: Re: [RFC PATCH] mm: memcg/slab: Stop reparented obj_cgroups from charging root

Hello Michal,

Michal Koutný <[email protected]> writes:

> Hello.
>
> On Wed, Oct 14, 2020 at 08:07:49PM +0100, Richard Palethorpe <[email protected]> wrote:
>> SLAB objects which outlive their memcg are moved to their parent
>> memcg where they may be uncharged. However if they are moved to the
>> root memcg, uncharging will result in negative page counter values as
>> root has no page counters.
> Why do you think those are reparented objects? If those are originally
> charged in a non-root cgroup, then the charge value should be propagated up the
> hierarchy, including root memcg, so if they're later uncharged in root
> after reparenting, it should still break even. (Or did I miss some stock
> imbalance?)

I traced it and can see they are reparented objects and that the root
groups counters are zero (or negative if I run madvise06 multiple times)
before a drain takes place. I'm guessing this is because the root group
has 'use_hierachy' set to false so that the childs page_counter parents
are set to NULL. However I will check, because I'm not sure about
either.

>
> (But the patch seems justifiable to me as objects (not)charged directly to
> root memcg may be incorrectly uncharged.)
>
> Thanks,
> Michal

--
Thank you,
Richard.

2020-10-16 14:13:34

On Fri, Oct 16, 2020 at 10:53:08AM -0400, Johannes Weiner <[email protected]> wrote:
> The central try_charge() function charges recursively all the way up
> to and including the root.
Except for use_hiearchy=0 (which is the case here as Richard
wrote). The reparenting is hence somewhat incompatible with
new_parent.use_hiearchy=0 :-/

> We should clean this up one way or another: either charge the root or
> don't, but do it consistently.
I agree this'd be good to unify. One upside of excluding root memcg from
charging is that users are spared from the charging overhead when memcg
tree is not created. (Actually, I thought that was the reason for this
exception.)

Michal

Attachments:

(No filename) (695.00 B)
signature.asc (849.00 B)
Digital signature Download all attachments

2020-10-16 19:59:46

Hi.

On Tue, Nov 10, 2020 at 07:11:28AM -0800, Shakeel Butt <[email protected]> wrote:
> > The problem is that cgroup_subsys_on_dfl(memory_cgrp_subsys)'s return value
> > can change at any particular moment.
The switch can happen only when singular (i.e. root-only) hierarchy
exists. (Or it could if rebind_subsystems() waited until all memcgs are
completely free'd.)

> Since the commit 0158115f702b0 ("memcg, kmem: deprecate
> kmem.limit_in_bytes"), we are in the process of deprecating the limit
> on kmem. If we decide that now is the time to deprecate it, we can
> convert the kmem page counter to a memcg stat, update it for both v1
> and v2 and serve v1's kmem.usage_in_bytes from that memcg stat.
So with the single memcg, it may be possible to reconstruct the
necessary counters in both directions using the statistics (or some
complementarity, without fine grained counters removal).

I didn't check all the charging/uncharging places, these are just my 2
cents to the issue.

Michal

Attachments:

(No filename) (0.99 kB)
signature.asc (849.00 B)
Digital signature Download all attachments