On Tue 18-09-12 18:04:09, Glauber Costa wrote:
> A lot of the initialization we do in mem_cgroup_create() is done with softirqs
> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and
> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the
> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This
> means that the freeing of memcg structure must happen in a compatible context,
> otherwise we'll get a deadlock.
Maybe I am missing something obvious but why cannot we simply disble
(soft)irqs in mem_cgroup_create rather than make the free path much more
complicated. It really feels strange to defer everything (e.g. soft
reclaim tree cleanup which should be a no-op at the time because there
shouldn't be any user pages in the group).
> The reference counting mechanism we use allows the memcg structure to be freed
> later and outlive the actual memcg destruction from the filesystem. However, we
> have little, if any, means to guarantee in which context the last memcg_put
> will happen. The best we can do is test it and try to make sure no invalid
> context releases are happening. But as we add more code to memcg, the possible
> interactions grow in number and expose more ways to get context conflicts.
>
> We already moved a part of the freeing to a worker thread to be context-safe
> for the static branches disabling. I see no reason not to do it for the whole
> freeing action. I consider this to be the safe choice.
>
> Signed-off-by: Glauber Costa <[email protected]>
> Tested-by: Greg Thelen <[email protected]>
> CC: KAMEZAWA Hiroyuki <[email protected]>
> CC: Michal Hocko <[email protected]>
> CC: Johannes Weiner <[email protected]>
> ---
> mm/memcontrol.c | 66 +++++++++++++++++++++++++++++----------------------------
> 1 file changed, 34 insertions(+), 32 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b05ecac..74654f0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5082,16 +5082,29 @@ out_free:
> }
>
> /*
> - * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
> - * but in process context. The work_freeing structure is overlaid
> - * on the rcu_freeing structure, which itself is overlaid on memsw.
> + * At destroying mem_cgroup, references from swap_cgroup can remain.
> + * (scanning all at force_empty is too costly...)
> + *
> + * Instead of clearing all references at force_empty, we remember
> + * the number of reference from swap_cgroup and free mem_cgroup when
> + * it goes down to 0.
> + *
> + * Removal of cgroup itself succeeds regardless of refs from swap.
> */
> -static void free_work(struct work_struct *work)
> +
> +static void __mem_cgroup_free(struct mem_cgroup *memcg)
> {
> - struct mem_cgroup *memcg;
> + int node;
> int size = sizeof(struct mem_cgroup);
>
> - memcg = container_of(work, struct mem_cgroup, work_freeing);
> + mem_cgroup_remove_from_trees(memcg);
> + free_css_id(&mem_cgroup_subsys, &memcg->css);
> +
> + for_each_node(node)
> + free_mem_cgroup_per_zone_info(memcg, node);
> +
> + free_percpu(memcg->stat);
> +
> /*
> * We need to make sure that (at least for now), the jump label
> * destruction code runs outside of the cgroup lock. This is because
> @@ -5110,38 +5123,27 @@ static void free_work(struct work_struct *work)
> vfree(memcg);
> }
>
> -static void free_rcu(struct rcu_head *rcu_head)
> -{
> - struct mem_cgroup *memcg;
> -
> - memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
> - INIT_WORK(&memcg->work_freeing, free_work);
> - schedule_work(&memcg->work_freeing);
> -}
>
> /*
> - * At destroying mem_cgroup, references from swap_cgroup can remain.
> - * (scanning all at force_empty is too costly...)
> - *
> - * Instead of clearing all references at force_empty, we remember
> - * the number of reference from swap_cgroup and free mem_cgroup when
> - * it goes down to 0.
> - *
> - * Removal of cgroup itself succeeds regardless of refs from swap.
> + * Helpers for freeing a kmalloc()ed/vzalloc()ed mem_cgroup by RCU,
> + * but in process context. The work_freeing structure is overlaid
> + * on the rcu_freeing structure, which itself is overlaid on memsw.
> */
> -
> -static void __mem_cgroup_free(struct mem_cgroup *memcg)
> +static void free_work(struct work_struct *work)
> {
> - int node;
> + struct mem_cgroup *memcg;
>
> - mem_cgroup_remove_from_trees(memcg);
> - free_css_id(&mem_cgroup_subsys, &memcg->css);
> + memcg = container_of(work, struct mem_cgroup, work_freeing);
> + __mem_cgroup_free(memcg);
> +}
>
> - for_each_node(node)
> - free_mem_cgroup_per_zone_info(memcg, node);
> +static void free_rcu(struct rcu_head *rcu_head)
> +{
> + struct mem_cgroup *memcg;
>
> - free_percpu(memcg->stat);
> - call_rcu(&memcg->rcu_freeing, free_rcu);
> + memcg = container_of(rcu_head, struct mem_cgroup, rcu_freeing);
> + INIT_WORK(&memcg->work_freeing, free_work);
> + schedule_work(&memcg->work_freeing);
> }
>
> static void mem_cgroup_get(struct mem_cgroup *memcg)
> @@ -5153,7 +5155,7 @@ static void __mem_cgroup_put(struct mem_cgroup *memcg, int count)
> {
> if (atomic_sub_and_test(count, &memcg->refcnt)) {
> struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> - __mem_cgroup_free(memcg);
> + call_rcu(&memcg->rcu_freeing, free_rcu);
> if (parent)
> mem_cgroup_put(parent);
> }
> --
> 1.7.11.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
On 10/01/2012 05:27 PM, Michal Hocko wrote:
> On Tue 18-09-12 18:04:09, Glauber Costa wrote:
>> A lot of the initialization we do in mem_cgroup_create() is done with softirqs
>> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and
>> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the
>> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This
>> means that the freeing of memcg structure must happen in a compatible context,
>> otherwise we'll get a deadlock.
>
> Maybe I am missing something obvious but why cannot we simply disble
> (soft)irqs in mem_cgroup_create rather than make the free path much more
> complicated. It really feels strange to defer everything (e.g. soft
> reclaim tree cleanup which should be a no-op at the time because there
> shouldn't be any user pages in the group).
>
Ok.
I was just able to come back to this today - I was mostly working on the
slab feedback over the past few days. I will answer yours and Tejun's
concerns at once:
Here is the situation: the backtrace I get is this one:
[ 124.956725] =================================
[ 124.957217] [ INFO: inconsistent lock state ]
[ 124.957217] 3.5.0+ #99 Not tainted
[ 124.957217] ---------------------------------
[ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes:
[ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at:
[<ffffffff810aa7b2>] spin_lock+0x9/0xb
[ 124.957217] {SOFTIRQ-ON-W} state was registered at:
[ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68
[ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c
[ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f
[ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb
[ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3
[ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60
[ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100
[ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee
[ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5
[ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b
So what we learn from it, is: we are acquiring a specific lock (the css
id one) from softirq context. It was previously taken in a
softirq-enabled context, that seems to be coming directly from
get_new_cssid.
Tejun correctly pointed out that we should never acquire that lock from
a softirq context, in which he is right.
But the situation changes slightly with kmem. Now, the following excerpt
of a backtrace is possible:
[ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
[ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c
[ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
[ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
[ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
[ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
[ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
[ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277
So as you can see, free_accounted_pages (that will trigger a memcg_put()
-> mem_cgroup_free()) can now be called from softirq context, which is,
an rcu callback (and I just realized I wrote the exact opposite in the
subj line: man, I really suck at that!!)
As a matter of fact, we could not move to our rcu callback as well:
we need to move it to a worker thread with the rest.
We already have a worker thread: he reason we have it is not
static_branches: The reason is vfree(), that will BUG_ON(in_interrupt())
and could not be called from rcu callback as well. We moved static
branches in there as well for a similar problem, but haven't introduced it.
Could we move just part of it to the worker thread? Absolutely yes.
Moving just free_css_id() is enough to make it work. But since it is not
the first context related problem we had, I thought: "to hell with that,
let's move everything and be safe".
I am fine moving free_css_id() only if you would prefer.
Can we disable softirqs when we initialize css_id? Maybe. My machine
seems to boot fine and survive the simple workload that would trigger
that bug if I use irqsave spinlocks instead of normal spinlocks. But
this has to be done from cgroup core: We have no control over css
creation in memcg.
How would you guys like me to handle this ?
On 10/04/2012 02:53 PM, Glauber Costa wrote:
> On 10/01/2012 05:27 PM, Michal Hocko wrote:
>> On Tue 18-09-12 18:04:09, Glauber Costa wrote:
>>> A lot of the initialization we do in mem_cgroup_create() is done with softirqs
>>> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and
>>> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the
>>> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This
>>> means that the freeing of memcg structure must happen in a compatible context,
>>> otherwise we'll get a deadlock.
>>
>> Maybe I am missing something obvious but why cannot we simply disble
>> (soft)irqs in mem_cgroup_create rather than make the free path much more
>> complicated. It really feels strange to defer everything (e.g. soft
>> reclaim tree cleanup which should be a no-op at the time because there
>> shouldn't be any user pages in the group).
>>
>
> Ok.
>
> I was just able to come back to this today - I was mostly working on the
> slab feedback over the past few days. I will answer yours and Tejun's
> concerns at once:
>
> Here is the situation: the backtrace I get is this one:
>
> [ 124.956725] =================================
> [ 124.957217] [ INFO: inconsistent lock state ]
> [ 124.957217] 3.5.0+ #99 Not tainted
> [ 124.957217] ---------------------------------
> [ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> [ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes:
> [ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at:
> [<ffffffff810aa7b2>] spin_lock+0x9/0xb
> [ 124.957217] {SOFTIRQ-ON-W} state was registered at:
> [ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68
> [ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c
> [ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f
> [ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb
> [ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3
> [ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60
> [ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100
> [ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee
> [ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5
> [ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b
>
>
> So what we learn from it, is: we are acquiring a specific lock (the css
> id one) from softirq context. It was previously taken in a
> softirq-enabled context, that seems to be coming directly from
> get_new_cssid.
>
> Tejun correctly pointed out that we should never acquire that lock from
> a softirq context, in which he is right.
>
> But the situation changes slightly with kmem. Now, the following excerpt
> of a backtrace is possible:
>
> [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
> [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c
> [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
> [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
> [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
> [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
> [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
> [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277
>
> So as you can see, free_accounted_pages (that will trigger a memcg_put()
> -> mem_cgroup_free()) can now be called from softirq context, which is,
> an rcu callback (and I just realized I wrote the exact opposite in the
> subj line: man, I really suck at that!!)
> As a matter of fact, we could not move to our rcu callback as well:
>
> we need to move it to a worker thread with the rest.
>
> We already have a worker thread: he reason we have it is not
> static_branches: The reason is vfree(), that will BUG_ON(in_interrupt())
> and could not be called from rcu callback as well. We moved static
> branches in there as well for a similar problem, but haven't introduced it.
>
> Could we move just part of it to the worker thread? Absolutely yes.
> Moving just free_css_id() is enough to make it work. But since it is not
> the first context related problem we had, I thought: "to hell with that,
> let's move everything and be safe".
>
> I am fine moving free_css_id() only if you would prefer.
>
> Can we disable softirqs when we initialize css_id? Maybe. My machine
> seems to boot fine and survive the simple workload that would trigger
> that bug if I use irqsave spinlocks instead of normal spinlocks. But
> this has to be done from cgroup core: We have no control over css
> creation in memcg.
>
> How would you guys like me to handle this ?
One more thing: As I mentioned in the Changelog,
mem_cgroup_remove_exceeded(), called from mem_cgroup_remove_from_trees()
will lead to the same usage pattern.
On Thu, Oct 04, 2012 at 02:53:13PM +0400, Glauber Costa wrote:
> On 10/01/2012 05:27 PM, Michal Hocko wrote:
> > On Tue 18-09-12 18:04:09, Glauber Costa wrote:
> >> A lot of the initialization we do in mem_cgroup_create() is done with softirqs
> >> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and
> >> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the
> >> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This
> >> means that the freeing of memcg structure must happen in a compatible context,
> >> otherwise we'll get a deadlock.
> >
> > Maybe I am missing something obvious but why cannot we simply disble
> > (soft)irqs in mem_cgroup_create rather than make the free path much more
> > complicated. It really feels strange to defer everything (e.g. soft
> > reclaim tree cleanup which should be a no-op at the time because there
> > shouldn't be any user pages in the group).
> >
>
> Ok.
>
> I was just able to come back to this today - I was mostly working on the
> slab feedback over the past few days. I will answer yours and Tejun's
> concerns at once:
>
> Here is the situation: the backtrace I get is this one:
>
> [ 124.956725] =================================
> [ 124.957217] [ INFO: inconsistent lock state ]
> [ 124.957217] 3.5.0+ #99 Not tainted
> [ 124.957217] ---------------------------------
> [ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
> [ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes:
> [ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at:
> [<ffffffff810aa7b2>] spin_lock+0x9/0xb
> [ 124.957217] {SOFTIRQ-ON-W} state was registered at:
> [ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68
> [ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c
> [ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f
> [ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb
> [ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3
> [ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60
> [ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100
> [ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee
> [ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5
> [ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b
>
>
> So what we learn from it, is: we are acquiring a specific lock (the css
> id one) from softirq context. It was previously taken in a
> softirq-enabled context, that seems to be coming directly from
> get_new_cssid.
>
> Tejun correctly pointed out that we should never acquire that lock from
> a softirq context, in which he is right.
>
> But the situation changes slightly with kmem. Now, the following excerpt
> of a backtrace is possible:
>
> [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
> [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c
> [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
> [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
> [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
> [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
> [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
> [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277
>
> So as you can see, free_accounted_pages (that will trigger a memcg_put()
> -> mem_cgroup_free()) can now be called from softirq context, which is,
> an rcu callback (and I just realized I wrote the exact opposite in the
> subj line: man, I really suck at that!!)
> As a matter of fact, we could not move to our rcu callback as well:
>
> we need to move it to a worker thread with the rest.
>
> We already have a worker thread: he reason we have it is not
> static_branches: The reason is vfree(), that will BUG_ON(in_interrupt())
> and could not be called from rcu callback as well. We moved static
> branches in there as well for a similar problem, but haven't introduced it.
>
> Could we move just part of it to the worker thread? Absolutely yes.
> Moving just free_css_id() is enough to make it work. But since it is not
> the first context related problem we had, I thought: "to hell with that,
> let's move everything and be safe".
>
> I am fine moving free_css_id() only if you would prefer.
>
> Can we disable softirqs when we initialize css_id? Maybe. My machine
> seems to boot fine and survive the simple workload that would trigger
> that bug if I use irqsave spinlocks instead of normal spinlocks. But
> this has to be done from cgroup core: We have no control over css
> creation in memcg.
>
> How would you guys like me to handle this ?
Without the vfree callback, I would have preferred just making the
id_lock softirq safe. But since we have to defer (parts of) freeing
anyway, I like your approach of just deferring the rest as well
better.
But please add comments why the stuff in there is actually deferred.
Just simple notes like:
"this can be called from atomic contexts, <examples>",
"vfree must run from process context" and "css_id locking is not soft
irq safe",
"to hell with that, let's just do everything from the workqueue and be
safe and simple".
(And this may be personal preference, but why have free_work call
__mem_cgroup_free()? Does anyone else need to call that code? There
are too many layers already, why not just keep it all in free_work()
and have one less stack frame on your mind? :))
As for the changelog, here is my attempt:
---
mm: memcg: defer whole memcg tear-down to workqueue
The final memcg put can already happen in atomic context and so the
freeing is deferred to a workqueue because it needs to use vfree().
Kmem tracking will add freeing from softirq context, but the id_lock
acquired when destroying the cgroup object is not softirq safe, e.g.:
> [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
> [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c
> [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
> [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
> [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
> [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
> [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
> [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277
To avoid making tear-down too complicated - making locks soft irq
safe, having half the cleanup in one function on the other half
somewhere else - just defer everything to the workqueue.
On 10/05/2012 07:31 PM, Johannes Weiner wrote:
> On Thu, Oct 04, 2012 at 02:53:13PM +0400, Glauber Costa wrote:
>> On 10/01/2012 05:27 PM, Michal Hocko wrote:
>>> On Tue 18-09-12 18:04:09, Glauber Costa wrote:
>>>> A lot of the initialization we do in mem_cgroup_create() is done with softirqs
>>>> enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and
>>>> the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the
>>>> lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This
>>>> means that the freeing of memcg structure must happen in a compatible context,
>>>> otherwise we'll get a deadlock.
>>>
>>> Maybe I am missing something obvious but why cannot we simply disble
>>> (soft)irqs in mem_cgroup_create rather than make the free path much more
>>> complicated. It really feels strange to defer everything (e.g. soft
>>> reclaim tree cleanup which should be a no-op at the time because there
>>> shouldn't be any user pages in the group).
>>>
>>
>> Ok.
>>
>> I was just able to come back to this today - I was mostly working on the
>> slab feedback over the past few days. I will answer yours and Tejun's
>> concerns at once:
>>
>> Here is the situation: the backtrace I get is this one:
>>
>> [ 124.956725] =================================
>> [ 124.957217] [ INFO: inconsistent lock state ]
>> [ 124.957217] 3.5.0+ #99 Not tainted
>> [ 124.957217] ---------------------------------
>> [ 124.957217] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
>> [ 124.957217] ksoftirqd/0/3 [HC0[0]:SC1[1]:HE1:SE0] takes:
>> [ 124.957217] (&(&ss->id_lock)->rlock){+.?...}, at:
>> [<ffffffff810aa7b2>] spin_lock+0x9/0xb
>> [ 124.957217] {SOFTIRQ-ON-W} state was registered at:
>> [ 124.957217] [<ffffffff810996ed>] __lock_acquire+0x31f/0xd68
>> [ 124.957217] [<ffffffff8109a660>] lock_acquire+0x108/0x15c
>> [ 124.957217] [<ffffffff81534ec4>] _raw_spin_lock+0x40/0x4f
>> [ 124.957217] [<ffffffff810aa7b2>] spin_lock+0x9/0xb
>> [ 124.957217] [<ffffffff810ad00e>] get_new_cssid+0x69/0xf3
>> [ 124.957217] [<ffffffff810ad0da>] cgroup_init_idr+0x42/0x60
>> [ 124.957217] [<ffffffff81b20e04>] cgroup_init+0x50/0x100
>> [ 124.957217] [<ffffffff81b05b9b>] start_kernel+0x3b9/0x3ee
>> [ 124.957217] [<ffffffff81b052d6>] x86_64_start_reservations+0xb1/0xb5
>> [ 124.957217] [<ffffffff81b053d8>] x86_64_start_kernel+0xfe/0x10b
>>
>>
>> So what we learn from it, is: we are acquiring a specific lock (the css
>> id one) from softirq context. It was previously taken in a
>> softirq-enabled context, that seems to be coming directly from
>> get_new_cssid.
>>
>> Tejun correctly pointed out that we should never acquire that lock from
>> a softirq context, in which he is right.
>>
>> But the situation changes slightly with kmem. Now, the following excerpt
>> of a backtrace is possible:
>>
>> [ 48.602775] [<ffffffff81103095>] free_accounted_pages+0x47/0x4c
>> [ 48.602775] [<ffffffff81047f90>] free_task+0x31/0x5c
>> [ 48.602775] [<ffffffff8104807d>] __put_task_struct+0xc2/0xdb
>> [ 48.602775] [<ffffffff8104dfc7>] put_task_struct+0x1e/0x22
>> [ 48.602775] [<ffffffff8104e144>] delayed_put_task_struct+0x7a/0x98
>> [ 48.602775] [<ffffffff810cf0e5>] __rcu_process_callbacks+0x269/0x3df
>> [ 48.602775] [<ffffffff810cf28c>] rcu_process_callbacks+0x31/0x5b
>> [ 48.602775] [<ffffffff8105266d>] __do_softirq+0x122/0x277
>>
>> So as you can see, free_accounted_pages (that will trigger a memcg_put()
>> -> mem_cgroup_free()) can now be called from softirq context, which is,
>> an rcu callback (and I just realized I wrote the exact opposite in the
>> subj line: man, I really suck at that!!)
>> As a matter of fact, we could not move to our rcu callback as well:
>>
>> we need to move it to a worker thread with the rest.
>>
>> We already have a worker thread: he reason we have it is not
>> static_branches: The reason is vfree(), that will BUG_ON(in_interrupt())
>> and could not be called from rcu callback as well. We moved static
>> branches in there as well for a similar problem, but haven't introduced it.
>>
>> Could we move just part of it to the worker thread? Absolutely yes.
>> Moving just free_css_id() is enough to make it work. But since it is not
>> the first context related problem we had, I thought: "to hell with that,
>> let's move everything and be safe".
>>
>> I am fine moving free_css_id() only if you would prefer.
>>
>> Can we disable softirqs when we initialize css_id? Maybe. My machine
>> seems to boot fine and survive the simple workload that would trigger
>> that bug if I use irqsave spinlocks instead of normal spinlocks. But
>> this has to be done from cgroup core: We have no control over css
>> creation in memcg.
>>
>> How would you guys like me to handle this ?
>
> Without the vfree callback, I would have preferred just making the
> id_lock softirq safe. But since we have to defer (parts of) freeing
> anyway, I like your approach of just deferring the rest as well
> better.
>
> But please add comments why the stuff in there is actually deferred.
> Just simple notes like:
>
> "this can be called from atomic contexts, <examples>",
>
> "vfree must run from process context" and "css_id locking is not soft
> irq safe",
>
> "to hell with that, let's just do everything from the workqueue and be
> safe and simple".
>
> (And this may be personal preference, but why have free_work call
> __mem_cgroup_free()? Does anyone else need to call that code? There
> are too many layers already, why not just keep it all in free_work()
> and have one less stack frame on your mind? :))
>
It is used when create fails.