LinuxLists.cc - [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Wed, Nov 06, 2019 at 05:25:26PM -0800, Shakeel Butt wrote:
> On Wed, Nov 6, 2019 at 4:22 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin wrote:
> > > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > >
> > > rcu: INFO: rcu_sched self-detected stall on CPU
> > > rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > > (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > > <...>
> > > RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > > <...>
> > > __memcg_kmem_charge+0x55/0x140
> > > __alloc_pages_nodemask+0x267/0x320
> > > pipe_write+0x1ad/0x400
> > > new_sync_write+0x127/0x1c0
> > > __kernel_write+0x4f/0xf0
> > > dump_emit+0x91/0xc0
> > > writenote+0xa0/0xc0
> > > elf_core_dump+0x11af/0x1430
> > > do_coredump+0xc65/0xee0
> > > ? unix_stream_sendmsg+0x37d/0x3b0
> > > get_signal+0x132/0x7c0
> > > do_signal+0x36/0x640
> > > ? recalc_sigpending+0x17/0x50
> > > exit_to_usermode_loop+0x61/0xd0
> > > do_syscall_64+0xd4/0x100
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >
> > > The problem is caused by an exiting task which is associated with
> > > an offline memcg. We're iterating over and over in the
> > > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > > become online and the exiting task won't be migrated to a live memcg.
> > >
> > > Let's fix it by switching from css_tryget_online() to css_tryget().
> > >
> > > As css_tryget_online() cannot guarantee that the memcg won't go
> > > offline, the check is usually useless, except some rare cases
> > > when for example it determines if something should be presented
> > > to a user.
> > >
> > > A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> > > css_tryget() instead of css_tryget_online() in task_get_css()").
> > >
> > > Signed-off-by: Roman Gushchin <[email protected]>
> > > Cc: [email protected]
> > > Cc: Tejun Heo <[email protected]>
> >
> > Acked-by: Johannes Weiner <[email protected]>
> >
> > The bug aside, it doesn't matter whether the cgroup is online for the
> > callers. It used to matter when offlining needed to evacuate all
> > charges from the memcg, and so needed to prevent new ones from showing
> > up, but we don't care now.
>
> Should get_mem_cgroup_from_current() and get_mem_cgroup_from_page() be
> switched to css_tryget() as well then?

In those case it can't cause a rcu stall, so it's not a so urgent.
But you are right, we should probably do the same here. I'll look
at all remaining callers and prepare the patchset.

I'll also probably rename it to css_tryget_if_online() to make obvious
that it doesn't hold the cgroup from being onlined.

Thanks!

2019-11-07 02:24:36

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Wed, Nov 6, 2019 at 5:43 PM Roman Gushchin <[email protected]> wrote:
>
> On Wed, Nov 06, 2019 at 05:25:26PM -0800, Shakeel Butt wrote:
> > On Wed, Nov 6, 2019 at 4:22 PM Johannes Weiner <[email protected]> wrote:
> > >
> > > On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin wrote:
> > > > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > > >
> > > > rcu: INFO: rcu_sched self-detected stall on CPU
> > > > rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > > > (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > > > <...>
> > > > RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > > > <...>
> > > > __memcg_kmem_charge+0x55/0x140
> > > > __alloc_pages_nodemask+0x267/0x320
> > > > pipe_write+0x1ad/0x400
> > > > new_sync_write+0x127/0x1c0
> > > > __kernel_write+0x4f/0xf0
> > > > dump_emit+0x91/0xc0
> > > > writenote+0xa0/0xc0
> > > > elf_core_dump+0x11af/0x1430
> > > > do_coredump+0xc65/0xee0
> > > > ? unix_stream_sendmsg+0x37d/0x3b0
> > > > get_signal+0x132/0x7c0
> > > > do_signal+0x36/0x640
> > > > ? recalc_sigpending+0x17/0x50
> > > > exit_to_usermode_loop+0x61/0xd0
> > > > do_syscall_64+0xd4/0x100
> > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >
> > > > The problem is caused by an exiting task which is associated with
> > > > an offline memcg. We're iterating over and over in the
> > > > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > > > become online and the exiting task won't be migrated to a live memcg.
> > > >
> > > > Let's fix it by switching from css_tryget_online() to css_tryget().
> > > >
> > > > As css_tryget_online() cannot guarantee that the memcg won't go
> > > > offline, the check is usually useless, except some rare cases
> > > > when for example it determines if something should be presented
> > > > to a user.
> > > >
> > > > A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> > > > css_tryget() instead of css_tryget_online() in task_get_css()").
> > > >
> > > > Signed-off-by: Roman Gushchin <[email protected]>
> > > > Cc: [email protected]
> > > > Cc: Tejun Heo <[email protected]>
> > >
> > > Acked-by: Johannes Weiner <[email protected]>
> > >
> > > The bug aside, it doesn't matter whether the cgroup is online for the
> > > callers. It used to matter when offlining needed to evacuate all
> > > charges from the memcg, and so needed to prevent new ones from showing
> > > up, but we don't care now.
> >
> > Should get_mem_cgroup_from_current() and get_mem_cgroup_from_page() be
> > switched to css_tryget() as well then?
>
> In those case it can't cause a rcu stall, so it's not a so urgent.
> But you are right, we should probably do the same here. I'll look
> at all remaining callers and prepare the patchset.
>
> I'll also probably rename it to css_tryget_if_online() to make obvious
> that it doesn't hold the cgroup from being onlined.
>

SGTM

2019-11-07 02:30:47

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Wed, Nov 6, 2019 at 6:21 PM Shakeel Butt <[email protected]> wrote:
>
> On Wed, Nov 6, 2019 at 5:43 PM Roman Gushchin <[email protected]> wrote:
> >
> > On Wed, Nov 06, 2019 at 05:25:26PM -0800, Shakeel Butt wrote:
> > > On Wed, Nov 6, 2019 at 4:22 PM Johannes Weiner <[email protected]> wrote:
> > > >
> > > > On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin wrote:
> > > > > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > > > >
> > > > > rcu: INFO: rcu_sched self-detected stall on CPU
> > > > > rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > > > > (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > > > > <...>
> > > > > RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > > > > <...>
> > > > > __memcg_kmem_charge+0x55/0x140
> > > > > __alloc_pages_nodemask+0x267/0x320
> > > > > pipe_write+0x1ad/0x400
> > > > > new_sync_write+0x127/0x1c0
> > > > > __kernel_write+0x4f/0xf0
> > > > > dump_emit+0x91/0xc0
> > > > > writenote+0xa0/0xc0
> > > > > elf_core_dump+0x11af/0x1430
> > > > > do_coredump+0xc65/0xee0
> > > > > ? unix_stream_sendmsg+0x37d/0x3b0
> > > > > get_signal+0x132/0x7c0
> > > > > do_signal+0x36/0x640
> > > > > ? recalc_sigpending+0x17/0x50
> > > > > exit_to_usermode_loop+0x61/0xd0
> > > > > do_syscall_64+0xd4/0x100
> > > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > >
> > > > > The problem is caused by an exiting task which is associated with
> > > > > an offline memcg. We're iterating over and over in the
> > > > > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > > > > become online and the exiting task won't be migrated to a live memcg.
> > > > >
> > > > > Let's fix it by switching from css_tryget_online() to css_tryget().
> > > > >
> > > > > As css_tryget_online() cannot guarantee that the memcg won't go
> > > > > offline, the check is usually useless, except some rare cases
> > > > > when for example it determines if something should be presented
> > > > > to a user.
> > > > >
> > > > > A similar problem is described by commit 18fa84a2db0e ("cgroup: Use
> > > > > css_tryget() instead of css_tryget_online() in task_get_css()").
> > > > >
> > > > > Signed-off-by: Roman Gushchin <[email protected]>

Forgot to add:

Reviewed-by: Shakeel Butt <[email protected]>

> > > > > Cc: [email protected]
> > > > > Cc: Tejun Heo <[email protected]>
> > > >
> > > > Acked-by: Johannes Weiner <[email protected]>
> > > >
> > > > The bug aside, it doesn't matter whether the cgroup is online for the
> > > > callers. It used to matter when offlining needed to evacuate all
> > > > charges from the memcg, and so needed to prevent new ones from showing
> > > > up, but we don't care now.
> > >
> > > Should get_mem_cgroup_from_current() and get_mem_cgroup_from_page() be
> > > switched to css_tryget() as well then?
> >
> > In those case it can't cause a rcu stall, so it's not a so urgent.
> > But you are right, we should probably do the same here. I'll look
> > at all remaining callers and prepare the patchset.
> >
> > I'll also probably rename it to css_tryget_if_online() to make obvious
> > that it doesn't hold the cgroup from being onlined.
> >
>
> SGTM

2019-11-07 12:25:33

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Wed 06-11-19 14:51:30, Roman Gushchin wrote:
> We've encountered a rcu stall in get_mem_cgroup_from_mm():
>
> rcu: INFO: rcu_sched self-detected stall on CPU
> rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> <...>
> RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> <...>
> __memcg_kmem_charge+0x55/0x140
> __alloc_pages_nodemask+0x267/0x320
> pipe_write+0x1ad/0x400
> new_sync_write+0x127/0x1c0
> __kernel_write+0x4f/0xf0
> dump_emit+0x91/0xc0
> writenote+0xa0/0xc0
> elf_core_dump+0x11af/0x1430
> do_coredump+0xc65/0xee0
> ? unix_stream_sendmsg+0x37d/0x3b0
> get_signal+0x132/0x7c0
> do_signal+0x36/0x640
> ? recalc_sigpending+0x17/0x50
> exit_to_usermode_loop+0x61/0xd0
> do_syscall_64+0xd4/0x100
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> The problem is caused by an exiting task which is associated with
> an offline memcg.

Hmm, how can we have a task in an offline memcg? I thought that any
existing task will prevent cgroup removal from proceeding. Is this some
sort of race where the task managed to disassociate from the cgroup
while there is still a binding to a memcg existing? What am I missing?

--
Michal Hocko
SUSE Labs

2019-11-07 15:46:01

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

2019-11-07 15:47:18

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

Hello,

On Thu, Nov 07, 2019 at 01:21:25PM +0100, Michal Hocko wrote:
> Hmm, how can we have a task in an offline memcg? I thought that any
> existing task will prevent cgroup removal from proceeding. Is this some
> sort of race where the task managed to disassociate from the cgroup
> while there is still a binding to a memcg existing? What am I missing?

The previous cgroup one was from bsd process accounting which happens
after the process is marked head. This one IIUC is from core dumping
path.

Thanks.

--
tejun

2019-11-07 16:44:09

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Thu, Nov 07, 2019 at 01:21:25PM +0100, Michal Hocko wrote:
> On Wed 06-11-19 14:51:30, Roman Gushchin wrote:
> > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> >
> > rcu: INFO: rcu_sched self-detected stall on CPU
> > rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > <...>
> > RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > <...>
> > __memcg_kmem_charge+0x55/0x140
> > __alloc_pages_nodemask+0x267/0x320
> > pipe_write+0x1ad/0x400
> > new_sync_write+0x127/0x1c0
> > __kernel_write+0x4f/0xf0
> > dump_emit+0x91/0xc0
> > writenote+0xa0/0xc0
> > elf_core_dump+0x11af/0x1430
> > do_coredump+0xc65/0xee0
> > ? unix_stream_sendmsg+0x37d/0x3b0
> > get_signal+0x132/0x7c0
> > do_signal+0x36/0x640
> > ? recalc_sigpending+0x17/0x50
> > exit_to_usermode_loop+0x61/0xd0
> > do_syscall_64+0xd4/0x100
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > The problem is caused by an exiting task which is associated with
> > an offline memcg.
>
> Hmm, how can we have a task in an offline memcg? I thought that any
> existing task will prevent cgroup removal from proceeding. Is this some
> sort of race where the task managed to disassociate from the cgroup
> while there is still a binding to a memcg existing? What am I missing?

It's an exiting task with the PF_EXITING flag set and it's in their late stages
of life.

2019-11-07 17:05:13

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Thu 07-11-19 16:42:41, Roman Gushchin wrote:
> On Thu, Nov 07, 2019 at 01:21:25PM +0100, Michal Hocko wrote:
> > On Wed 06-11-19 14:51:30, Roman Gushchin wrote:
> > > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > >
> > > rcu: INFO: rcu_sched self-detected stall on CPU
> > > rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > > (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > > <...>
> > > RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > > <...>
> > > __memcg_kmem_charge+0x55/0x140
> > > __alloc_pages_nodemask+0x267/0x320
> > > pipe_write+0x1ad/0x400
> > > new_sync_write+0x127/0x1c0
> > > __kernel_write+0x4f/0xf0
> > > dump_emit+0x91/0xc0
> > > writenote+0xa0/0xc0
> > > elf_core_dump+0x11af/0x1430
> > > do_coredump+0xc65/0xee0
> > > ? unix_stream_sendmsg+0x37d/0x3b0
> > > get_signal+0x132/0x7c0
> > > do_signal+0x36/0x640
> > > ? recalc_sigpending+0x17/0x50
> > > exit_to_usermode_loop+0x61/0xd0
> > > do_syscall_64+0xd4/0x100
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >
> > > The problem is caused by an exiting task which is associated with
> > > an offline memcg.
> >
> > Hmm, how can we have a task in an offline memcg? I thought that any
> > existing task will prevent cgroup removal from proceeding. Is this some
> > sort of race where the task managed to disassociate from the cgroup
> > while there is still a binding to a memcg existing? What am I missing?
>
> It's an exiting task with the PF_EXITING flag set and it's in their late stages
> of life.

This is a signal delivery path AFAIU (get_signal) and the coredumping
happens before do_exit. My understanding is that that unlinking
happens from cgroup_exit. So either I am misreading the backtrace or
there is some other way to leave cgroups or there is something more
going on.

JFTR I am not really disputing the patch but I simply do not understand
how the problem really happened.

--
Michal Hocko
SUSE Labs

2019-11-07 22:42:48

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Thu, Nov 07, 2019 at 06:02:00PM +0100, Michal Hocko wrote:
> On Thu 07-11-19 16:42:41, Roman Gushchin wrote:
> > On Thu, Nov 07, 2019 at 01:21:25PM +0100, Michal Hocko wrote:
> > > On Wed 06-11-19 14:51:30, Roman Gushchin wrote:
> > > > We've encountered a rcu stall in get_mem_cgroup_from_mm():
> > > >
> > > > rcu: INFO: rcu_sched self-detected stall on CPU
> > > > rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
> > > > (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
> > > > <...>
> > > > RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
> > > > <...>
> > > > __memcg_kmem_charge+0x55/0x140
> > > > __alloc_pages_nodemask+0x267/0x320
> > > > pipe_write+0x1ad/0x400
> > > > new_sync_write+0x127/0x1c0
> > > > __kernel_write+0x4f/0xf0
> > > > dump_emit+0x91/0xc0
> > > > writenote+0xa0/0xc0
> > > > elf_core_dump+0x11af/0x1430
> > > > do_coredump+0xc65/0xee0
> > > > ? unix_stream_sendmsg+0x37d/0x3b0
> > > > get_signal+0x132/0x7c0
> > > > do_signal+0x36/0x640
> > > > ? recalc_sigpending+0x17/0x50
> > > > exit_to_usermode_loop+0x61/0xd0
> > > > do_syscall_64+0xd4/0x100
> > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >
> > > > The problem is caused by an exiting task which is associated with
> > > > an offline memcg.
> > >
> > > Hmm, how can we have a task in an offline memcg? I thought that any
> > > existing task will prevent cgroup removal from proceeding. Is this some
> > > sort of race where the task managed to disassociate from the cgroup
> > > while there is still a binding to a memcg existing? What am I missing?
> >
> > It's an exiting task with the PF_EXITING flag set and it's in their late stages
> > of life.
>
> This is a signal delivery path AFAIU (get_signal) and the coredumping
> happens before do_exit. My understanding is that that unlinking
> happens from cgroup_exit. So either I am misreading the backtrace or
> there is some other way to leave cgroups or there is something more
> going on.

Yeah, you're right. I have no better explanation for this and the similar,
mentioned in the commit bsd accounting issue, than some very rare race condition
that allows cgroups to be offlined with a task inside.

I'll think more about it.

Thanks, it's a really good question.

>
> JFTR I am not really disputing the patch but I simply do not understand
> how the problem really happened.
>
> --
> Michal Hocko
> SUSE Labs

2019-11-08 08:57:10

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Thu 07-11-19 22:41:13, Roman Gushchin wrote:
> On Thu, Nov 07, 2019 at 06:02:00PM +0100, Michal Hocko wrote:
> > On Thu 07-11-19 16:42:41, Roman Gushchin wrote:
[...]
> > > It's an exiting task with the PF_EXITING flag set and it's in their late stages
> > > of life.
> >
> > This is a signal delivery path AFAIU (get_signal) and the coredumping
> > happens before do_exit. My understanding is that that unlinking
> > happens from cgroup_exit. So either I am misreading the backtrace or
> > there is some other way to leave cgroups or there is something more
> > going on.
>
> Yeah, you're right. I have no better explanation for this and the similar,
> mentioned in the commit bsd accounting issue,

Tejun mentioned bsd accounting issue as well, but I do not see any
explicit reference to it in neither of the two patches.

> than some very rare race condition
> that allows cgroups to be offlined with a task inside.
>
> I'll think more about it.

Thanks a lot. As I've said, I am not opposing this change once we have a
proper changelog but I find the explanation really weak. If there is a
race then it should be fixed as well.

Thanks!
--
Michal Hocko
SUSE Labs

2019-11-13 17:14:49

by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

Hi.

On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin <[email protected]> wrote:
> Let's fix it by switching from css_tryget_online() to css_tryget().
Is this a safe thing to do? The stack captures a kmem charge path, with
css_tryget() it may happen it gets an offlined memcg and carry out
charge into it. What happens when e.g. memcg_deactivate_kmem_caches is
skipped as a consequence?

> The problem is caused by an exiting task which is associated with
> an offline memcg. We're iterating over and over in the
> do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> become online and the exiting task won't be migrated to a live memcg.
As discussed in other replies, the task is not yet exiting. However, the
access to memcg isn't through `current` but `mm->owner`, i.e. another
task of a threadgroup may have got stuck in an offlined memcg (I don't
have a good explanation for that though).

HTH,
Michal

Attachments:

(No filename) (949.00 B)
signature.asc (849.00 B)
Digital signature Download all attachments

2019-11-13 17:23:22

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Wed, Nov 13, 2019 at 05:29:34PM +0100, Michal Koutn? wrote:
> Hi.
>
> On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin <[email protected]> wrote:
> > Let's fix it by switching from css_tryget_online() to css_tryget().
> Is this a safe thing to do? The stack captures a kmem charge path, with
> css_tryget() it may happen it gets an offlined memcg and carry out
> charge into it. What happens when e.g. memcg_deactivate_kmem_caches is
> skipped as a consequence?

The thing here is that css_tryget_online() cannot pin the online state,
so even if returned true, the cgroup can be offline at the return from
the function. So if we rely somewhere on it, it's already broken.

Generally speaking, it's better to reduce it's usage to the bare minimum.

>
> > The problem is caused by an exiting task which is associated with
> > an offline memcg. We're iterating over and over in the
> > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > become online and the exiting task won't be migrated to a live memcg.
> As discussed in other replies, the task is not yet exiting. However, the
> access to memcg isn't through `current` but `mm->owner`, i.e. another
> task of a threadgroup may have got stuck in an offlined memcg (I don't
> have a good explanation for that though).

Yes, it's true, and I've no idea how the memcg can be offline in this case too.
We've seen it only several times in fb production, so it seems to be a really
rare case. Could be anything from a tiny race somewhere to cpu bugs.

Thanks!

2019-11-14 19:18:10

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Wed 13-11-19 17:08:29, Roman Gushchin wrote:
> On Wed, Nov 13, 2019 at 05:29:34PM +0100, Michal Koutn? wrote:
> > Hi.
> >
> > On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin <[email protected]> wrote:
> > > Let's fix it by switching from css_tryget_online() to css_tryget().
> > Is this a safe thing to do? The stack captures a kmem charge path, with
> > css_tryget() it may happen it gets an offlined memcg and carry out
> > charge into it. What happens when e.g. memcg_deactivate_kmem_caches is
> > skipped as a consequence?
>
> The thing here is that css_tryget_online() cannot pin the online state,
> so even if returned true, the cgroup can be offline at the return from
> the function. So if we rely somewhere on it, it's already broken.

Then what is the point of this function and what about all other users?

> Generally speaking, it's better to reduce it's usage to the bare minimum.

If it doesn't have any sensible semantic then I would argue it should go
altogether otherwise we are going to chase new users again and aagain?

> > > The problem is caused by an exiting task which is associated with
> > > an offline memcg. We're iterating over and over in the
> > > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > > become online and the exiting task won't be migrated to a live memcg.
> > As discussed in other replies, the task is not yet exiting. However, the
> > access to memcg isn't through `current` but `mm->owner`, i.e. another
> > task of a threadgroup may have got stuck in an offlined memcg (I don't
> > have a good explanation for that though).

The trace however points to current->mm or current->active_memcg. Is it
possible that we have a stale active_memcg?
--
Michal Hocko
SUSE Labs

2019-11-14 19:21:25

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

Hello,

On Thu, Nov 14, 2019 at 08:16:57PM +0100, Michal Hocko wrote:
> Then what is the point of this function and what about all other users?

It is useful for controlling admissions of new userspace visible uses
- e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
which has already been deleted. We're just using it too liberally.

Thanks.

--
tejun

2019-11-14 19:38:46

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Thu 14-11-19 11:20:18, Tejun Heo wrote:
> Hello,
>
> On Thu, Nov 14, 2019 at 08:16:57PM +0100, Michal Hocko wrote:
> > Then what is the point of this function and what about all other users?
>
> It is useful for controlling admissions of new userspace visible uses
> - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> which has already been deleted.

I am not sure I understand. Roman says that the cgroup can get offline
right after the function returns. How is "already deleted" different
from "just deleted"? I thought that the state is preserved at least
while the rcu lock is held but my memory is dim here.

> We're just using it too liberally.

Can we get a doc update to be explicit about sensible usecases so that
others can be dropped accordingly?
--
Michal Hocko
SUSE Labs

2019-11-14 19:41:07

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

Hello,

On Thu, Nov 14, 2019 at 08:33:40PM +0100, Michal Hocko wrote:
> > It is useful for controlling admissions of new userspace visible uses
> > - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> > which has already been deleted.
>
> I am not sure I understand. Roman says that the cgroup can get offline
> right after the function returns. How is "already deleted" different
> from "just deleted"? I thought that the state is preserved at least
> while the rcu lock is held but my memory is dim here.

It's the same difference as between "opening a file and deleting it"
and "deleting a file and opening it". We shoud allow the former while
not allowing the latter.

> > We're just using it too liberally.
>
> Can we get a doc update to be explicit about sensible usecases so that
> others can be dropped accordingly?

Yeah, we should audit and convert the uses and update the doc.

Thanks.

--
tejun

2019-11-15 17:44:37

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Thu 14-11-19 11:37:36, Tejun Heo wrote:
> Hello,
>
> On Thu, Nov 14, 2019 at 08:33:40PM +0100, Michal Hocko wrote:
> > > It is useful for controlling admissions of new userspace visible uses
> > > - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> > > which has already been deleted.
> >
> > I am not sure I understand. Roman says that the cgroup can get offline
> > right after the function returns. How is "already deleted" different
> > from "just deleted"? I thought that the state is preserved at least
> > while the rcu lock is held but my memory is dim here.
>
> It's the same difference as between "opening a file and deleting it"
> and "deleting a file and opening it".

I am sorry but I do not follow. How can css_tryget_online provide the
same semantic when the css can go offline right after the tryget call
returns so it is effectivelly undistinguishable from the case when the
css was already online before the call was made. Or is my understanding
of what Roman's said earlier in the thread?

--
Michal Hocko
SUSE Labs

2019-11-15 17:48:51

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Fri, Nov 15, 2019 at 06:40:31PM +0100, Michal Hocko wrote:
> I am sorry but I do not follow. How can css_tryget_online provide the
> same semantic when the css can go offline right after the tryget call
> returns so it is effectivelly undistinguishable from the case when the
> css was already online before the call was made. Or is my understanding
> of what Roman's said earlier in the thread?

It's *exactly* the same semantics as opening a file and deleting it
and keeping using it vs. deleting a file and then trying to open it.
You can't give out a new reference when the entity's visibility is
expected to be gone already and that's the exactly the condition that
tryget_online avoids. I don't know how to better explain it if the
file analogy doesn't work.

Thanks.

--
tejun

2019-11-15 17:49:21

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Fri 15-11-19 18:40:31, Michal Hocko wrote:
> On Thu 14-11-19 11:37:36, Tejun Heo wrote:
> > Hello,
> >
> > On Thu, Nov 14, 2019 at 08:33:40PM +0100, Michal Hocko wrote:
> > > > It is useful for controlling admissions of new userspace visible uses
> > > > - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> > > > which has already been deleted.
> > >
> > > I am not sure I understand. Roman says that the cgroup can get offline
> > > right after the function returns. How is "already deleted" different
> > > from "just deleted"? I thought that the state is preserved at least
> > > while the rcu lock is held but my memory is dim here.
> >
> > It's the same difference as between "opening a file and deleting it"
> > and "deleting a file and opening it".
>
> I am sorry but I do not follow. How can css_tryget_online provide the
> same semantic when the css can go offline right after the tryget call
> returns so it is effectivelly undistinguishable from the case when the
> css was already online before the call was made.

s@online@offline@

And reading after myself it turned out to sound differently than I
meant. What I wanted to say really is, what is the difference that
css_tryget_online really guarantee when the css might go offline right
after the call suceeds so more specifically what is the difference
between
if (css_tryget()) {
if (online)
DO_SOMETHING
}
and
if (css_tryget_online()) {
DO_SOMETHING
}

both of them are racy and do not provide any guarantee wrt. online
state.
--
Michal Hocko
SUSE Labs

2019-11-15 17:50:34

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Fri, Nov 15, 2019 at 06:47:21PM +0100, Michal Hocko wrote:
> s@online@offline@
>
> And reading after myself it turned out to sound differently than I
> meant. What I wanted to say really is, what is the difference that
> css_tryget_online really guarantee when the css might go offline right
> after the call suceeds so more specifically what is the difference
> between
> if (css_tryget()) {
> if (online)
> DO_SOMETHING
> }
> and
> if (css_tryget_online()) {
> DO_SOMETHING
> }
>
> both of them are racy and do not provide any guarantee wrt. online
> state.

It's about not giving new reference when the object is known to be
delted to the user. Can you please think more about how file
deletions work?

Thanks.

--
tejun

2019-11-15 18:04:40

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Fri 15-11-19 09:48:44, Tejun Heo wrote:
> On Fri, Nov 15, 2019 at 06:47:21PM +0100, Michal Hocko wrote:
> > s@online@offline@
> >
> > And reading after myself it turned out to sound differently than I
> > meant. What I wanted to say really is, what is the difference that
> > css_tryget_online really guarantee when the css might go offline right
> > after the call suceeds so more specifically what is the difference
> > between
> > if (css_tryget()) {
> > if (online)
> > DO_SOMETHING
> > }
> > and
> > if (css_tryget_online()) {
> > DO_SOMETHING
> > }
> >
> > both of them are racy and do not provide any guarantee wrt. online
> > state.
>
> It's about not giving new reference when the object is known to be
> delted to the user.

This part is clear to me. The failure just says it is already too late
to do anything. I just still struggle why the success is telling me much
more when the state might change before I can do anything on the object.
I could see a usefulness if I've had a guarantee that the object stays
online while I hold a $FOO lock but if there is nothing like that then
we are just having already too late or potentially too late.

Anyway it's been a hard week and the brain is just going for the weekend
so I just might be dense now.

> Can you please think more about how file deletions work?

I will try that with a fresh brain next week.
--
Michal Hocko
SUSE Labs

2019-11-15 18:12:54

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Fri, Nov 15, 2019 at 06:47:21PM +0100, Michal Hocko wrote:
> On Fri 15-11-19 18:40:31, Michal Hocko wrote:
> > On Thu 14-11-19 11:37:36, Tejun Heo wrote:
> > > Hello,
> > >
> > > On Thu, Nov 14, 2019 at 08:33:40PM +0100, Michal Hocko wrote:
> > > > > It is useful for controlling admissions of new userspace visible uses
> > > > > - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> > > > > which has already been deleted.
> > > >
> > > > I am not sure I understand. Roman says that the cgroup can get offline
> > > > right after the function returns. How is "already deleted" different
> > > > from "just deleted"? I thought that the state is preserved at least
> > > > while the rcu lock is held but my memory is dim here.
> > >
> > > It's the same difference as between "opening a file and deleting it"
> > > and "deleting a file and opening it".
> >
> > I am sorry but I do not follow. How can css_tryget_online provide the
> > same semantic when the css can go offline right after the tryget call
> > returns so it is effectivelly undistinguishable from the case when the
> > css was already online before the call was made.
>
> s@online@offline@
>
> And reading after myself it turned out to sound differently than I
> meant. What I wanted to say really is, what is the difference that
> css_tryget_online really guarantee when the css might go offline right
> after the call suceeds so more specifically what is the difference
> between
> if (css_tryget()) {
> if (online)
> DO_SOMETHING
> }
> and
> if (css_tryget_online()) {
> DO_SOMETHING
> }
>
> both of them are racy and do not provide any guarantee wrt. online
> state.

Let me step back a little bit.

I think, we all agree that css_tryget_online() has a weird semantics,
in most cases is used only due to historical reasons and clearly asks
for a cleanup. So I suggest to stop arguing about it and wait for the
cleanup patchset. Then we can discuss each remaining use case in details,
if there will be any.

Thanks!

2019-11-15 18:16:37

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Thu, Nov 14, 2019 at 08:16:57PM +0100, Michal Hocko wrote:
> On Wed 13-11-19 17:08:29, Roman Gushchin wrote:
> > On Wed, Nov 13, 2019 at 05:29:34PM +0100, Michal Koutn? wrote:
> > > Hi.
> > >
> > > On Wed, Nov 06, 2019 at 02:51:30PM -0800, Roman Gushchin <[email protected]> wrote:
> > > > Let's fix it by switching from css_tryget_online() to css_tryget().
> > > Is this a safe thing to do? The stack captures a kmem charge path, with
> > > css_tryget() it may happen it gets an offlined memcg and carry out
> > > charge into it. What happens when e.g. memcg_deactivate_kmem_caches is
> > > skipped as a consequence?
> >
> > The thing here is that css_tryget_online() cannot pin the online state,
> > so even if returned true, the cgroup can be offline at the return from
> > the function. So if we rely somewhere on it, it's already broken.
>
> Then what is the point of this function and what about all other users?
>
> > Generally speaking, it's better to reduce it's usage to the bare minimum.
>
> If it doesn't have any sensible semantic then I would argue it should go
> altogether otherwise we are going to chase new users again and aagain?

That's the plan: to audit all use cases and get rid of it where it's possible.

>
> > > > The problem is caused by an exiting task which is associated with
> > > > an offline memcg. We're iterating over and over in the
> > > > do {} while (!css_tryget_online()) loop, but obviously the memcg won't
> > > > become online and the exiting task won't be migrated to a live memcg.
> > > As discussed in other replies, the task is not yet exiting. However, the
> > > access to memcg isn't through `current` but `mm->owner`, i.e. another
> > > task of a threadgroup may have got stuck in an offlined memcg (I don't
> > > have a good explanation for that though).
>
> The trace however points to current->mm or current->active_memcg. Is it
> possible that we have a stale active_memcg?

It wouldn't cause a rcu stall.

Thanks!

2019-11-18 09:45:00

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Fri 15-11-19 18:07:34, Roman Gushchin wrote:
> On Fri, Nov 15, 2019 at 06:47:21PM +0100, Michal Hocko wrote:
> > On Fri 15-11-19 18:40:31, Michal Hocko wrote:
> > > On Thu 14-11-19 11:37:36, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > On Thu, Nov 14, 2019 at 08:33:40PM +0100, Michal Hocko wrote:
> > > > > > It is useful for controlling admissions of new userspace visible uses
> > > > > > - e.g. a tracepoint shouldn't be allowed to be attached to a cgroup
> > > > > > which has already been deleted.
> > > > >
> > > > > I am not sure I understand. Roman says that the cgroup can get offline
> > > > > right after the function returns. How is "already deleted" different
> > > > > from "just deleted"? I thought that the state is preserved at least
> > > > > while the rcu lock is held but my memory is dim here.
> > > >
> > > > It's the same difference as between "opening a file and deleting it"
> > > > and "deleting a file and opening it".
> > >
> > > I am sorry but I do not follow. How can css_tryget_online provide the
> > > same semantic when the css can go offline right after the tryget call
> > > returns so it is effectivelly undistinguishable from the case when the
> > > css was already online before the call was made.
> >
> > s@online@offline@
> >
> > And reading after myself it turned out to sound differently than I
> > meant. What I wanted to say really is, what is the difference that
> > css_tryget_online really guarantee when the css might go offline right
> > after the call suceeds so more specifically what is the difference
> > between
> > if (css_tryget()) {
> > if (online)
> > DO_SOMETHING
> > }
> > and
> > if (css_tryget_online()) {
> > DO_SOMETHING
> > }
> >
> > both of them are racy and do not provide any guarantee wrt. online
> > state.
>
> Let me step back a little bit.
>
> I think, we all agree that css_tryget_online() has a weird semantics,
> in most cases is used only due to historical reasons and clearly asks
> for a cleanup. So I suggest to stop arguing about it and wait for the
> cleanup patchset. Then we can discuss each remaining use case in details,
> if there will be any.

Yes I am all in favor of the clean up patches as well as getting down
to the bottom of the underlying issue (race). Andrew has already sent
these two patches to Linus, unfortunatelly, even though the changelog
is slightly misleading (btw 18fa84a2db0e has the similar incorrect
reasoning).
--
Michal Hocko
SUSE Labs

2019-11-21 15:30:17

by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm()

On Wed, Nov 13, 2019 at 05:08:29PM +0000, Roman Gushchin <[email protected]> wrote:
> The thing here is that css_tryget_online() cannot pin the online state,
> so even if returned true, the cgroup can be offline at the return from
> the function.
Adding this for the record.

The actual offlining happens only from css_killed_ref_fn, which is
percpu_ref_kill_and_confirm confirmation callback. That is only called
after RCU grace period. So, css_tryget_online will pin the online state
_inside_ rcu_read_{,un}lock section.

The fact is that many call sites pass the css reference over
rcu_read_unlock boundary allowing potential offlining.

> So if we rely somewhere on it, it's already broken.
Charges into offlined memcg should fine (wrt the original patch). I
didn't check other callers though...

> Generally speaking, it's better to reduce it's usage to the bare minimum.
...I agree as it's confusing and potentially redundant.

Michal

Attachments:

(No filename) (957.00 B)
signature.asc (849.00 B)
Digital signature Download all attachments

2019-11-22 08:22:29