2018-04-26 11:02:06

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

This function searches for a new mm owner in children and siblings,
and then iterates over all processes in the system in unlikely case.
Despite the case is unlikely, its probability growths with the number
of processes in the system. The time, spent on iterations, also growths.
I regulary observe mm_update_next_owner() in crash dumps (not related
to this function) of the nodes with many processes (20K+), so it looks
like it's not so unlikely case.

The patchset reworks the function locking and makes it to use
rcu_read_lock() for iterations over all tasks in the system. This is
possible, because of task_struct::mm may be inherited by children
processes and threads only (except kernel threads), which are added
to end of tasks list or threads list on fork(). So, the patchset uses
RCU and memory barriers to make race-free traverse over the lists [4/4].
Patches [1-3/4] are preparations for that.

Kirill
---

Kirill Tkhai (4):
exit: Move read_unlock() up in mm_update_next_owner()
exit: Use rcu instead of get_task_struct() in mm_update_next_owner()
exit: Rename assign_new_owner label in mm_update_next_owner()
exit: Lockless iteration over task list in mm_update_next_owner()


kernel/exit.c | 58 ++++++++++++++++++++++++++++++++++++++++++---------------
kernel/fork.c | 1 +
kernel/pid.c | 5 ++++-
3 files changed, 48 insertions(+), 16 deletions(-)

--
Signed-off-by: Kirill Tkhai <[email protected]>


2018-04-26 11:02:47

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 2/4] exit: Use rcu instead of get_task_struct() in mm_update_next_owner()

Since release_task() puts final task_struct::usage counter
at least one rcu grace period after removing from task list:

__exit_signal()
__unhash_process()
call_rcu(&p->rcu, delayed_put_task_struct)

rcu_read_lock() guarantees nobody release task_struct memory.
So, it's possible to use this primitive instead of get_task_struct().

Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/exit.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 9fb7b699bdeb..5c42a9b9c1d7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -464,7 +464,7 @@ void mm_update_next_owner(struct mm_struct *mm)
return;

assign_new_owner:
- get_task_struct(c);
+ rcu_read_lock();
read_unlock(&tasklist_lock);
BUG_ON(c == p);

@@ -475,12 +475,12 @@ void mm_update_next_owner(struct mm_struct *mm)
task_lock(c);
if (c->mm != mm) {
task_unlock(c);
- put_task_struct(c);
+ rcu_read_unlock();
goto retry;
}
mm->owner = c;
task_unlock(c);
- put_task_struct(c);
+ rcu_read_unlock();
}
#endif /* CONFIG_MEMCG */



2018-04-26 11:03:26

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 3/4] exit: Rename assign_new_owner label in mm_update_next_owner()

Next patch introduces nolock variant of this label,
so we will have new_owner and new_owner_nolock.
assign_new_owner_nolock would be too long there.

Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/exit.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 5c42a9b9c1d7..40f734ed1193 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -430,7 +430,7 @@ void mm_update_next_owner(struct mm_struct *mm)
*/
list_for_each_entry(c, &p->children, sibling) {
if (c->mm == mm)
- goto assign_new_owner;
+ goto new_owner;
}

/*
@@ -438,7 +438,7 @@ void mm_update_next_owner(struct mm_struct *mm)
*/
list_for_each_entry(c, &p->real_parent->children, sibling) {
if (c->mm == mm)
- goto assign_new_owner;
+ goto new_owner;
}

/*
@@ -449,7 +449,7 @@ void mm_update_next_owner(struct mm_struct *mm)
continue;
for_each_thread(g, c) {
if (c->mm == mm)
- goto assign_new_owner;
+ goto new_owner;
if (c->mm)
break;
}
@@ -463,7 +463,7 @@ void mm_update_next_owner(struct mm_struct *mm)
mm->owner = NULL;
return;

-assign_new_owner:
+new_owner:
rcu_read_lock();
read_unlock(&tasklist_lock);
BUG_ON(c == p);


2018-04-26 11:03:35

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

The patch finalizes the series and makes mm_update_next_owner()
to iterate over task list using RCU instead of tasklist_lock.
This is possible because of rules of inheritance of mm: it may be
propagated to a child only, while only kernel thread can obtain
someone else's mm via use_mm().

Also, all new tasks are added to tail of tasks list or threads list.
The only exception is transfer_pid() in de_thread(), when group
leader is replaced by another thread. But transfer_pid() is called
in case of successful exec only, where new mm is allocated, so it
can't be interesting for mm_update_next_owner().

This patch uses alloc_pid() as a memory barrier, and it's possible
since it contains two or more spin_lock()/spin_unlock() pairs.
Single pair does not imply a barrier, while two pairs do imply that.

There are three barriers:

1)for_each_process(g) copy_process()
p->mm = mm
smp_rmb(); smp_wmb() implied by alloc_pid()
if (g->flags & PF_KTHREAD) list_add_tail_rcu(&p->tasks, &init_task.tasks)

2)for_each_thread(g, c) copy_process()
p->mm = mm
smp_rmb(); smp_wmb() implied by alloc_pid()
tmp = READ_ONCE(c->mm) list_add_tail_rcu(&p->thread_node, ...)

3)for_each_thread(g, c) copy_process()
list_add_tail_rcu(&p->thread_node, ...)
p->mm != NULL check do_exit()
smp_rmb() smp_mb();
get next thread in loop p->mm = NULL


This patch may be useful for machines with many processes executing.
I regulary observe mm_update_next_owner() executing on one of the cpus
in crash dumps (not related to this function) on big machines. Even
if iteration over task list looks as unlikely situation, it regularity
grows with the growth of containers/processes numbers.

Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/exit.c | 39 +++++++++++++++++++++++++++++++++++----
kernel/fork.c | 1 +
kernel/pid.c | 5 ++++-
3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 40f734ed1193..7ce4cdf96a64 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -406,6 +406,8 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
void mm_update_next_owner(struct mm_struct *mm)
{
struct task_struct *c, *g, *p = current;
+ struct mm_struct *tmp;
+ struct list_head *n;

retry:
/*
@@ -440,21 +442,49 @@ void mm_update_next_owner(struct mm_struct *mm)
if (c->mm == mm)
goto new_owner;
}
+ read_unlock(&tasklist_lock);

/*
* Search through everything else, we should not get here often.
*/
+ rcu_read_lock();
for_each_process(g) {
+ /*
+ * g->signal, g->mm and g->flags initialization of a just
+ * created task must not reorder with linking the task to
+ * tasks list. Pairs with smp_mb() implied by alloc_pid().
+ */
+ smp_rmb();
if (g->flags & PF_KTHREAD)
continue;
for_each_thread(g, c) {
- if (c->mm == mm)
- goto new_owner;
- if (c->mm)
+ /*
+ * Make visible mm of iterated thread.
+ * Pairs with smp_mb() implied by alloc_pid().
+ */
+ if (c != g)
+ smp_rmb();
+ tmp = READ_ONCE(c->mm);
+ if (tmp == mm)
+ goto new_owner_nolock;
+ if (likely(tmp))
break;
+ n = READ_ONCE(c->thread_node.next);
+ /*
+ * All mm are NULL, so iterated threads already exited.
+ * Make sure we see their children.
+ * Pairs with smp_mb() in do_exit().
+ */
+ if (n == &g->signal->thread_head)
+ smp_rmb();
}
+ /*
+ * Children of exited thread group are visible due to the above
+ * smp_rmb(). Threads with mm != NULL can't create a child with
+ * the mm we're looking for. So, no additional smp_rmb() needed.
+ */
}
- read_unlock(&tasklist_lock);
+ rcu_read_unlock();
/*
* We found no owner yet mm_users > 1: this implies that we are
* most likely racing with swapoff (try_to_unuse()) or /proc or
@@ -466,6 +496,7 @@ void mm_update_next_owner(struct mm_struct *mm)
new_owner:
rcu_read_lock();
read_unlock(&tasklist_lock);
+new_owner_nolock:
BUG_ON(c == p);

/*
diff --git a/kernel/fork.c b/kernel/fork.c
index a5d21c42acfc..2032d4657546 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1805,6 +1805,7 @@ static __latent_entropy struct task_struct *copy_process(
goto bad_fork_cleanup_io;

if (pid != &init_struct_pid) {
+ /* Successfuly returned, this function imply smp_mb() */
pid = alloc_pid(p->nsproxy->pid_ns_for_children);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
diff --git a/kernel/pid.c b/kernel/pid.c
index 157fe4b19971..cb96473aa058 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -155,7 +155,10 @@ void free_pid(struct pid *pid)

call_rcu(&pid->rcu, delayed_put_pid);
}
-
+/*
+ * This function contains at least two sequential spin_lock()/spin_unlock(),
+ * and together they imply full memory barrier.
+ */
struct pid *alloc_pid(struct pid_namespace *ns)
{
struct pid *pid;


2018-04-26 11:04:15

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 1/4] exit: Move read_unlock() up in mm_update_next_owner()

All the places, we update task's mm, are made this
under task_lock(). These are exec_mmap(), exit_mm()
and kernel thread's use_mm() and unuse_mm().
The only exception is copy_mm(), which initializes
newborn task's mm, but we can't race with it, as
mm_update_next_owner() iterates tasks already linked
to task list.

get_task_struct() guarantees that task_struct can't free.
exec_mmap() and exit_mm() both call mm_update_next_owner():

task_lock(current);
current->mm = new_mm;
task_unlock(current);
mm_update_next_owner(mm);

and since mm_update_next_owner() checks and assigns mm under
task_lock() too, they can't miss they became a new mm owner.
So we can relax the lock and release tasklist_lock earlier.

Signed-off-by: Kirill Tkhai <[email protected]>
---
kernel/exit.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index c3c7ac560114..9fb7b699bdeb 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -464,18 +464,15 @@ void mm_update_next_owner(struct mm_struct *mm)
return;

assign_new_owner:
- BUG_ON(c == p);
get_task_struct(c);
+ read_unlock(&tasklist_lock);
+ BUG_ON(c == p);
+
/*
* The task_lock protects c->mm from changing.
* We always want mm->owner->mm == mm
*/
task_lock(c);
- /*
- * Delay read_unlock() till we have the task_lock()
- * to ensure that c does not slip away underneath us
- */
- read_unlock(&tasklist_lock);
if (c->mm != mm) {
task_unlock(c);
put_task_struct(c);


2018-04-26 12:37:35

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

Hi Kirill,

On Thu, Apr 26, 2018 at 02:01:07PM +0300, Kirill Tkhai wrote:
> The patch finalizes the series and makes mm_update_next_owner()
> to iterate over task list using RCU instead of tasklist_lock.
> This is possible because of rules of inheritance of mm: it may be
> propagated to a child only, while only kernel thread can obtain
> someone else's mm via use_mm().
>
> Also, all new tasks are added to tail of tasks list or threads list.
> The only exception is transfer_pid() in de_thread(), when group
> leader is replaced by another thread. But transfer_pid() is called
> in case of successful exec only, where new mm is allocated, so it
> can't be interesting for mm_update_next_owner().
>
> This patch uses alloc_pid() as a memory barrier, and it's possible
> since it contains two or more spin_lock()/spin_unlock() pairs.
> Single pair does not imply a barrier, while two pairs do imply that.
>
> There are three barriers:
>
> 1)for_each_process(g) copy_process()
> p->mm = mm
> smp_rmb(); smp_wmb() implied by alloc_pid()
> if (g->flags & PF_KTHREAD) list_add_tail_rcu(&p->tasks, &init_task.tasks)
>
> 2)for_each_thread(g, c) copy_process()
> p->mm = mm
> smp_rmb(); smp_wmb() implied by alloc_pid()
> tmp = READ_ONCE(c->mm) list_add_tail_rcu(&p->thread_node, ...)
>
> 3)for_each_thread(g, c) copy_process()
> list_add_tail_rcu(&p->thread_node, ...)
> p->mm != NULL check do_exit()
> smp_rmb() smp_mb();
> get next thread in loop p->mm = NULL
>
>
> This patch may be useful for machines with many processes executing.
> I regulary observe mm_update_next_owner() executing on one of the cpus
> in crash dumps (not related to this function) on big machines. Even
> if iteration over task list looks as unlikely situation, it regularity
> grows with the growth of containers/processes numbers.
>
> Signed-off-by: Kirill Tkhai <[email protected]>
> ---
> kernel/exit.c | 39 +++++++++++++++++++++++++++++++++++----
> kernel/fork.c | 1 +
> kernel/pid.c | 5 ++++-
> 3 files changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 40f734ed1193..7ce4cdf96a64 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -406,6 +406,8 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
> void mm_update_next_owner(struct mm_struct *mm)
> {
> struct task_struct *c, *g, *p = current;
> + struct mm_struct *tmp;
> + struct list_head *n;
>
> retry:
> /*
> @@ -440,21 +442,49 @@ void mm_update_next_owner(struct mm_struct *mm)
> if (c->mm == mm)
> goto new_owner;
> }
> + read_unlock(&tasklist_lock);
>
> /*
> * Search through everything else, we should not get here often.
> */
> + rcu_read_lock();
> for_each_process(g) {
> + /*
> + * g->signal, g->mm and g->flags initialization of a just
> + * created task must not reorder with linking the task to
> + * tasks list. Pairs with smp_mb() implied by alloc_pid().
> + */
> + smp_rmb();
> if (g->flags & PF_KTHREAD)
> continue;
> for_each_thread(g, c) {
> - if (c->mm == mm)
> - goto new_owner;
> - if (c->mm)
> + /*
> + * Make visible mm of iterated thread.
> + * Pairs with smp_mb() implied by alloc_pid().
> + */
> + if (c != g)
> + smp_rmb();
> + tmp = READ_ONCE(c->mm);
> + if (tmp == mm)
> + goto new_owner_nolock;
> + if (likely(tmp))
> break;
> + n = READ_ONCE(c->thread_node.next);
> + /*
> + * All mm are NULL, so iterated threads already exited.
> + * Make sure we see their children.
> + * Pairs with smp_mb() in do_exit().
> + */
> + if (n == &g->signal->thread_head)
> + smp_rmb();
> }
> + /*
> + * Children of exited thread group are visible due to the above
> + * smp_rmb(). Threads with mm != NULL can't create a child with
> + * the mm we're looking for. So, no additional smp_rmb() needed.
> + */
> }
> - read_unlock(&tasklist_lock);
> + rcu_read_unlock();
> /*
> * We found no owner yet mm_users > 1: this implies that we are
> * most likely racing with swapoff (try_to_unuse()) or /proc or
> @@ -466,6 +496,7 @@ void mm_update_next_owner(struct mm_struct *mm)
> new_owner:
> rcu_read_lock();
> read_unlock(&tasklist_lock);
> +new_owner_nolock:
> BUG_ON(c == p);
>
> /*
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a5d21c42acfc..2032d4657546 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1805,6 +1805,7 @@ static __latent_entropy struct task_struct *copy_process(
> goto bad_fork_cleanup_io;
>
> if (pid != &init_struct_pid) {
> + /* Successfuly returned, this function imply smp_mb() */
> pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> if (IS_ERR(pid)) {
> retval = PTR_ERR(pid);
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 157fe4b19971..cb96473aa058 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -155,7 +155,10 @@ void free_pid(struct pid *pid)
>
> call_rcu(&pid->rcu, delayed_put_pid);
> }
> -
> +/*
> + * This function contains at least two sequential spin_lock()/spin_unlock(),
> + * and together they imply full memory barrier.

Mmh, it's possible that I am misunderstanding this statement but it does
not seem quite correct to me; a counter-example would be provided by the
test at "tools/memory-model/litmus-tests/SB+mbonceonces.litmus" (replace
either of the smp_mb() with the sequence:

spin_lock(s); spin_unlock(s); spin_lock(s); spin_unlock(s); ).

BTW, your commit message suggests that your case would work with "imply
an smp_wmb()". This implication should hold "w.r.t. current implementa-
tions". We (LKMM people) discussed changes to the LKMM to make it hold
in LKMM but such changes are still in our TODO list as of today...

Andrea


> + */
> struct pid *alloc_pid(struct pid_namespace *ns)
> {
> struct pid *pid;
>

2018-04-26 13:08:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On Thu 26-04-18 14:00:19, Kirill Tkhai wrote:
> This function searches for a new mm owner in children and siblings,
> and then iterates over all processes in the system in unlikely case.
> Despite the case is unlikely, its probability growths with the number
> of processes in the system. The time, spent on iterations, also growths.
> I regulary observe mm_update_next_owner() in crash dumps (not related
> to this function) of the nodes with many processes (20K+), so it looks
> like it's not so unlikely case.

Did you manage to find the pattern that forces mm_update_next_owner to
slow paths? This really shouldn't trigger very often. If we can fallback
easily then I suspect that we should be better off reconsidering
mm->owner and try to come up with something more clever. I've had a
patch to remove owner few years back. It needed some work to finish but
maybe that would be a better than try to make non-scalable thing suck
less.

--
Michal Hocko
SUSE Labs

2018-04-26 13:54:32

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

On 26.04.2018 15:35, Andrea Parri wrote:
> Hi Kirill,
>
> On Thu, Apr 26, 2018 at 02:01:07PM +0300, Kirill Tkhai wrote:
>> The patch finalizes the series and makes mm_update_next_owner()
>> to iterate over task list using RCU instead of tasklist_lock.
>> This is possible because of rules of inheritance of mm: it may be
>> propagated to a child only, while only kernel thread can obtain
>> someone else's mm via use_mm().
>>
>> Also, all new tasks are added to tail of tasks list or threads list.
>> The only exception is transfer_pid() in de_thread(), when group
>> leader is replaced by another thread. But transfer_pid() is called
>> in case of successful exec only, where new mm is allocated, so it
>> can't be interesting for mm_update_next_owner().
>>
>> This patch uses alloc_pid() as a memory barrier, and it's possible
>> since it contains two or more spin_lock()/spin_unlock() pairs.
>> Single pair does not imply a barrier, while two pairs do imply that.
>>
>> There are three barriers:
>>
>> 1)for_each_process(g) copy_process()
>> p->mm = mm
>> smp_rmb(); smp_wmb() implied by alloc_pid()
>> if (g->flags & PF_KTHREAD) list_add_tail_rcu(&p->tasks, &init_task.tasks)
>>
>> 2)for_each_thread(g, c) copy_process()
>> p->mm = mm
>> smp_rmb(); smp_wmb() implied by alloc_pid()
>> tmp = READ_ONCE(c->mm) list_add_tail_rcu(&p->thread_node, ...)
>>
>> 3)for_each_thread(g, c) copy_process()
>> list_add_tail_rcu(&p->thread_node, ...)
>> p->mm != NULL check do_exit()
>> smp_rmb() smp_mb();
>> get next thread in loop p->mm = NULL
>>
>>
>> This patch may be useful for machines with many processes executing.
>> I regulary observe mm_update_next_owner() executing on one of the cpus
>> in crash dumps (not related to this function) on big machines. Even
>> if iteration over task list looks as unlikely situation, it regularity
>> grows with the growth of containers/processes numbers.
>>
>> Signed-off-by: Kirill Tkhai <[email protected]>
>> ---
>> kernel/exit.c | 39 +++++++++++++++++++++++++++++++++++----
>> kernel/fork.c | 1 +
>> kernel/pid.c | 5 ++++-
>> 3 files changed, 40 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 40f734ed1193..7ce4cdf96a64 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -406,6 +406,8 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
>> void mm_update_next_owner(struct mm_struct *mm)
>> {
>> struct task_struct *c, *g, *p = current;
>> + struct mm_struct *tmp;
>> + struct list_head *n;
>>
>> retry:
>> /*
>> @@ -440,21 +442,49 @@ void mm_update_next_owner(struct mm_struct *mm)
>> if (c->mm == mm)
>> goto new_owner;
>> }
>> + read_unlock(&tasklist_lock);
>>
>> /*
>> * Search through everything else, we should not get here often.
>> */
>> + rcu_read_lock();
>> for_each_process(g) {
>> + /*
>> + * g->signal, g->mm and g->flags initialization of a just
>> + * created task must not reorder with linking the task to
>> + * tasks list. Pairs with smp_mb() implied by alloc_pid().
>> + */
>> + smp_rmb();
>> if (g->flags & PF_KTHREAD)
>> continue;
>> for_each_thread(g, c) {
>> - if (c->mm == mm)
>> - goto new_owner;
>> - if (c->mm)
>> + /*
>> + * Make visible mm of iterated thread.
>> + * Pairs with smp_mb() implied by alloc_pid().
>> + */
>> + if (c != g)
>> + smp_rmb();
>> + tmp = READ_ONCE(c->mm);
>> + if (tmp == mm)
>> + goto new_owner_nolock;
>> + if (likely(tmp))
>> break;
>> + n = READ_ONCE(c->thread_node.next);
>> + /*
>> + * All mm are NULL, so iterated threads already exited.
>> + * Make sure we see their children.
>> + * Pairs with smp_mb() in do_exit().
>> + */
>> + if (n == &g->signal->thread_head)
>> + smp_rmb();
>> }
>> + /*
>> + * Children of exited thread group are visible due to the above
>> + * smp_rmb(). Threads with mm != NULL can't create a child with
>> + * the mm we're looking for. So, no additional smp_rmb() needed.
>> + */
>> }
>> - read_unlock(&tasklist_lock);
>> + rcu_read_unlock();
>> /*
>> * We found no owner yet mm_users > 1: this implies that we are
>> * most likely racing with swapoff (try_to_unuse()) or /proc or
>> @@ -466,6 +496,7 @@ void mm_update_next_owner(struct mm_struct *mm)
>> new_owner:
>> rcu_read_lock();
>> read_unlock(&tasklist_lock);
>> +new_owner_nolock:
>> BUG_ON(c == p);
>>
>> /*
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index a5d21c42acfc..2032d4657546 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1805,6 +1805,7 @@ static __latent_entropy struct task_struct *copy_process(
>> goto bad_fork_cleanup_io;
>>
>> if (pid != &init_struct_pid) {
>> + /* Successfuly returned, this function imply smp_mb() */
>> pid = alloc_pid(p->nsproxy->pid_ns_for_children);
>> if (IS_ERR(pid)) {
>> retval = PTR_ERR(pid);
>> diff --git a/kernel/pid.c b/kernel/pid.c
>> index 157fe4b19971..cb96473aa058 100644
>> --- a/kernel/pid.c
>> +++ b/kernel/pid.c
>> @@ -155,7 +155,10 @@ void free_pid(struct pid *pid)
>>
>> call_rcu(&pid->rcu, delayed_put_pid);
>> }
>> -
>> +/*
>> + * This function contains at least two sequential spin_lock()/spin_unlock(),
>> + * and together they imply full memory barrier.
>
> Mmh, it's possible that I am misunderstanding this statement but it does
> not seem quite correct to me; a counter-example would be provided by the
> test at "tools/memory-model/litmus-tests/SB+mbonceonces.litmus" (replace
> either of the smp_mb() with the sequence:
>
> spin_lock(s); spin_unlock(s); spin_lock(s); spin_unlock(s); ).
>
> BTW, your commit message suggests that your case would work with "imply
> an smp_wmb()". This implication should hold "w.r.t. current implementa-
> tions". We (LKMM people) discussed changes to the LKMM to make it hold
> in LKMM but such changes are still in our TODO list as of today...

I'm not close to LKMM, so the test you referenced is not clear for me.
Does LKMM show the real hardware behavior? Or there are added the most
cases, and work is still in progress?

In the patch I used the logic, that the below code:

x = A;
spin_lock();
spin_unlock();
spin_lock();
spin_unlock();
y = B;

cannot reorder much than:

spin_lock();
x = A; <- this can't become visible later, that spin_unlock()
spin_unlock();
spin_lock();
y = B; <- this can't become visible earlier, than spin_lock()
spin_unlock();

Is there a problem?

Kirill

2018-04-26 13:54:37

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On 04/26, Michal Hocko wrote:
>
> then I suspect that we should be better off reconsidering
> mm->owner and try to come up with something more clever. I've had a
> patch to remove owner few years back.

Yeees, I do remember we discussed this but I forgot everything.

I agree, it would nice to turn mm->owner into mm->mem_cgroup and just kill
the ugly mm_update_next_owner/etc.

Oleg.


2018-04-26 14:10:16

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On 26.04.2018 16:07, Michal Hocko wrote:
> On Thu 26-04-18 14:00:19, Kirill Tkhai wrote:
>> This function searches for a new mm owner in children and siblings,
>> and then iterates over all processes in the system in unlikely case.
>> Despite the case is unlikely, its probability growths with the number
>> of processes in the system. The time, spent on iterations, also growths.
>> I regulary observe mm_update_next_owner() in crash dumps (not related
>> to this function) of the nodes with many processes (20K+), so it looks
>> like it's not so unlikely case.
>
> Did you manage to find the pattern that forces mm_update_next_owner to
> slow paths? This really shouldn't trigger very often. If we can fallback
> easily then I suspect that we should be better off reconsidering
> mm->owner and try to come up with something more clever. I've had a
> patch to remove owner few years back. It needed some work to finish but
> maybe that would be a better than try to make non-scalable thing suck
> less.

It's not easy to find a pattern on such the big number of processes, especially
because of only the final result is visible in crash. I assume, this may
be connected with some unexpected signals received by task set in topology,
but I'm not sure. Though, even it becomes the most unlikely situation with
some small not zero probability, it still has to be optimized to minimize
unexpected situations.

We can rework this simply by adding a list of tasks to mm. Also, reuse
task_lock() of the mm owner for that. Something like:

assign_task_mm(struct mm_struct *mm, struct task_struct *task)
{
int again;
again:
again = 0;
rcu_read_lock();
task = mm->owner;
if (!task)
goto unlock;
task_lock(task);
if (mm->owner = task)
llist_add(&task->mm_list, &mm->task_list);
else
again = 1;
task_unlock(task);
unlock:
rcu_read_unlock();
if (again)
goto again;
}

Kirill

2018-04-26 15:02:43

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/4] exit: Move read_unlock() up in mm_update_next_owner()

On 04/26, Kirill Tkhai wrote:
>
> @@ -464,18 +464,15 @@ void mm_update_next_owner(struct mm_struct *mm)
> return;
>
> assign_new_owner:
> - BUG_ON(c == p);
> get_task_struct(c);
> + read_unlock(&tasklist_lock);
> + BUG_ON(c == p);
> +
> /*
> * The task_lock protects c->mm from changing.
> * We always want mm->owner->mm == mm
> */
> task_lock(c);
> - /*
> - * Delay read_unlock() till we have the task_lock()
> - * to ensure that c does not slip away underneath us
> - */
> - read_unlock(&tasklist_lock);

I think this is correct, but...

Firstly, I agree with Michal, it would be nice to kill mm_update_next_owner()
altogether.

If this is not possible I agree, it needs cleanups and we can change it to
avoid tasklist (although your 4/4 looks overcomplicated to me at first glance).

But in this case I think that whatever we do we should start with something like
the patch below. I wrote it 3 years ago but it still applies.

Oleg.

Subject: [PATCH 1/3] memcg: introduce assign_new_owner()

The code under "assign_new_owner" looks very ugly and suboptimal.

We do not really need get_task_struct/put_task_struct(), we can
simply recheck/change mm->owner under tasklist_lock. And we do not
want to restart from the very beginning if ->mm was changed by the
time we take task_lock(), we can simply continue (if we do not drop
tasklist_lock).

Just move this code into the new simple helper, assign_new_owner().

Signed-off-by: Oleg Nesterov <[email protected]>
---
kernel/exit.c | 56 ++++++++++++++++++++++++++------------------------------
1 files changed, 26 insertions(+), 30 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 22fcc05..4d446ab 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -293,6 +293,23 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
}

#ifdef CONFIG_MEMCG
+static bool assign_new_owner(struct mm_struct *mm, struct task_struct *c)
+{
+ bool ret = false;
+
+ if (c->mm != mm)
+ return ret;
+
+ task_lock(c); /* protects c->mm from changing */
+ if (c->mm == mm) {
+ mm->owner = c;
+ ret = true;
+ }
+ task_unlock(c);
+
+ return ret;
+}
+
/*
* A task is exiting. If it owned this mm, find a new owner for the mm.
*/
@@ -300,7 +317,6 @@ void mm_update_next_owner(struct mm_struct *mm)
{
struct task_struct *c, *g, *p = current;

-retry:
/*
* If the exiting or execing task is not the owner, it's
* someone else's problem.
@@ -322,16 +338,16 @@ retry:
* Search in the children
*/
list_for_each_entry(c, &p->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
+ if (assign_new_owner(mm, c))
+ goto done;
}

/*
* Search in the siblings
*/
list_for_each_entry(c, &p->real_parent->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
+ if (assign_new_owner(mm, c))
+ goto done;
}

/*
@@ -341,42 +357,22 @@ retry:
if (g->flags & PF_KTHREAD)
continue;
for_each_thread(g, c) {
- if (c->mm == mm)
- goto assign_new_owner;
+ if (assign_new_owner(mm, c))
+ goto done;
if (c->mm)
break;
}
}
- read_unlock(&tasklist_lock);
+
/*
* We found no owner yet mm_users > 1: this implies that we are
* most likely racing with swapoff (try_to_unuse()) or /proc or
* ptrace or page migration (get_task_mm()). Mark owner as NULL.
*/
mm->owner = NULL;
- return;
-
-assign_new_owner:
- BUG_ON(c == p);
- get_task_struct(c);
- /*
- * The task_lock protects c->mm from changing.
- * We always want mm->owner->mm == mm
- */
- task_lock(c);
- /*
- * Delay read_unlock() till we have the task_lock()
- * to ensure that c does not slip away underneath us
- */
+done:
read_unlock(&tasklist_lock);
- if (c->mm != mm) {
- task_unlock(c);
- put_task_struct(c);
- goto retry;
- }
- mm->owner = c;
- task_unlock(c);
- put_task_struct(c);
+ return;
}
#endif /* CONFIG_MEMCG */

--
1.5.5.1




2018-04-26 15:11:55

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On 04/26, Kirill Tkhai wrote:
>
> We can rework this simply by adding a list of tasks to mm.

Perhaps, but then I think this list should not depend on mm->owner.

I mean, mm->list_of_group_leaders_which_use_this_mm can be used by coredump
and oom-killer at least. But this is not that simple...

Oleg.


2018-04-26 15:23:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

On Thu, Apr 26, 2018 at 04:52:39PM +0300, Kirill Tkhai wrote:
> In the patch I used the logic, that the below code:
>
> x = A;
> spin_lock();
> spin_unlock();
> spin_lock();
> spin_unlock();
> y = B;
>
> cannot reorder much than:
>
> spin_lock();
> x = A; <- this can't become visible later, that spin_unlock()
> spin_unlock();
> spin_lock();
> y = B; <- this can't become visible earlier, than spin_lock()
> spin_unlock();
>
> Is there a problem?

The two stores will be ordered, but only at the strength of an
smp_wmb(). The above construct does not imply smp_mb(). The difference
is observable on real hardware (Power).

2018-04-26 15:23:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

On Thu, Apr 26, 2018 at 04:52:39PM +0300, Kirill Tkhai wrote:
> >>
> >> 1)for_each_process(g) copy_process()
> >> p->mm = mm
> >> smp_rmb(); smp_wmb() implied by alloc_pid()
> >> if (g->flags & PF_KTHREAD) list_add_tail_rcu(&p->tasks, &init_task.tasks)
> >>
> >> 2)for_each_thread(g, c) copy_process()
> >> p->mm = mm
> >> smp_rmb(); smp_wmb() implied by alloc_pid()
> >> tmp = READ_ONCE(c->mm) list_add_tail_rcu(&p->thread_node, ...)

For these two; what's the purpose of the smp_rmb()? which loads are
ordered?

2018-04-26 15:31:38

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

On Thu, Apr 26, 2018 at 04:52:39PM +0300, Kirill Tkhai wrote:
> On 26.04.2018 15:35, Andrea Parri wrote:

[...]

> >
> > Mmh, it's possible that I am misunderstanding this statement but it does
> > not seem quite correct to me; a counter-example would be provided by the
> > test at "tools/memory-model/litmus-tests/SB+mbonceonces.litmus" (replace
> > either of the smp_mb() with the sequence:
> >
> > spin_lock(s); spin_unlock(s); spin_lock(s); spin_unlock(s); ).
> >
> > BTW, your commit message suggests that your case would work with "imply
> > an smp_wmb()". This implication should hold "w.r.t. current implementa-
> > tions". We (LKMM people) discussed changes to the LKMM to make it hold
> > in LKMM but such changes are still in our TODO list as of today...
>
> I'm not close to LKMM, so the test you referenced is not clear for me.

The test could be concisely described by:

{initially: x=y=0; }

Thread0 Thread1

x = 1; y = 1;
MB MB
r0 = y; r1 = x;

Can r0,r1 be both 0 after joining?

The answer to the question is -No-; however, if you replaced any of the
MB with the locking sequence described above, then the answer is -Yes-:
full fences on both sides are required to forbid that state and this is
something that the locking sequences won't be able to provide (think at
the implementation of these primitives for powerpc, for example).


> Does LKMM show the real hardware behavior? Or there are added the most
> cases, and work is still in progress?

Very roughly speaking, LKMM is an "envelope" of the underlying hardware
memory models/architectures supported by the Linux kernel which in turn
may not coincide with the observable behavior on a given implementation
/processor of that architecture. Also, LKMM doesn't aim to be a "tight"
envelope. I'd refer to the documentation within "tools/memory-model/";
please let me know if I can provide further info.


>
> In the patch I used the logic, that the below code:
>
> x = A;
> spin_lock();
> spin_unlock();
> spin_lock();
> spin_unlock();
> y = B;
>
> cannot reorder much than:
>
> spin_lock();
> x = A; <- this can't become visible later, that spin_unlock()
> spin_unlock();
> spin_lock();
> y = B; <- this can't become visible earlier, than spin_lock()
> spin_unlock();
>
> Is there a problem?

As mentioned in the previous email, if smp_wmb() is what you're looking
for then this should be fine (considering current implementations; LKMM
will likely be there soon...).

BTW, the behavior in question has been recently discussed on the list;
c.f., for example, the test "unlock-lock-write-ordering" described in:

http://lkml.kernel.org/r/[email protected]

as well as

0123f4d76ca63b7b895f40089be0ce4809e392d8
("riscv/spinlock: Strengthen implementations with fences")

Andrea


>
> Kirill

2018-04-26 15:58:32

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

On 26.04.2018 18:20, Peter Zijlstra wrote:
> On Thu, Apr 26, 2018 at 04:52:39PM +0300, Kirill Tkhai wrote:
>> In the patch I used the logic, that the below code:
>>
>> x = A;
>> spin_lock();
>> spin_unlock();
>> spin_lock();
>> spin_unlock();
>> y = B;
>>
>> cannot reorder much than:
>>
>> spin_lock();
>> x = A; <- this can't become visible later, that spin_unlock()
>> spin_unlock();
>> spin_lock();
>> y = B; <- this can't become visible earlier, than spin_lock()
>> spin_unlock();
>>
>> Is there a problem?
>
> The two stores will be ordered, but only at the strength of an
> smp_wmb(). The above construct does not imply smp_mb(). The difference
> is observable on real hardware (Power).

Ah, thanks.

But hopefully, smp_rmb() should be enough here.

2018-04-26 16:05:52

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

On 26.04.2018 18:20, Peter Zijlstra wrote:
> On Thu, Apr 26, 2018 at 04:52:39PM +0300, Kirill Tkhai wrote:
>>>>
>>>> 1)for_each_process(g) copy_process()
>>>> p->mm = mm
>>>> smp_rmb(); smp_wmb() implied by alloc_pid()
>>>> if (g->flags & PF_KTHREAD) list_add_tail_rcu(&p->tasks, &init_task.tasks)
>>>>
>>>> 2)for_each_thread(g, c) copy_process()
>>>> p->mm = mm
>>>> smp_rmb(); smp_wmb() implied by alloc_pid()
>>>> tmp = READ_ONCE(c->mm) list_add_tail_rcu(&p->thread_node, ...)
>
> For these two; what's the purpose of the smp_rmb()? which loads are
> ordered?

(1) and (2) make visible g->flags and c->mm in case we see process is linked
to the task list.

It seems in mm_update_next_owner() we may see result of list_add_tail_rcu(&p->thread_node, ...),
but p->mm = mm will not be visible.

Hm, also I've dived into list_add_tail_rcu() and it contains smp_wmb(). So, it seems
we do not need the barrier implied by alloc_pid().

Kirill

2018-04-26 16:13:10

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()

On 26.04.2018 18:29, Andrea Parri wrote:
> On Thu, Apr 26, 2018 at 04:52:39PM +0300, Kirill Tkhai wrote:
>> On 26.04.2018 15:35, Andrea Parri wrote:
>
> [...]
>
>>>
>>> Mmh, it's possible that I am misunderstanding this statement but it does
>>> not seem quite correct to me; a counter-example would be provided by the
>>> test at "tools/memory-model/litmus-tests/SB+mbonceonces.litmus" (replace
>>> either of the smp_mb() with the sequence:
>>>
>>> spin_lock(s); spin_unlock(s); spin_lock(s); spin_unlock(s); ).
>>>
>>> BTW, your commit message suggests that your case would work with "imply
>>> an smp_wmb()". This implication should hold "w.r.t. current implementa-
>>> tions". We (LKMM people) discussed changes to the LKMM to make it hold
>>> in LKMM but such changes are still in our TODO list as of today...
>>
>> I'm not close to LKMM, so the test you referenced is not clear for me.
>
> The test could be concisely described by:
>
> {initially: x=y=0; }
>
> Thread0 Thread1
>
> x = 1; y = 1;
> MB MB
> r0 = y; r1 = x;
>
> Can r0,r1 be both 0 after joining?
>
> The answer to the question is -No-; however, if you replaced any of the
> MB with the locking sequence described above, then the answer is -Yes-:
> full fences on both sides are required to forbid that state and this is
> something that the locking sequences won't be able to provide (think at
> the implementation of these primitives for powerpc, for example).

Ah, I see, thanks for clarifying this.

>> Does LKMM show the real hardware behavior? Or there are added the most
>> cases, and work is still in progress?
>
> Very roughly speaking, LKMM is an "envelope" of the underlying hardware
> memory models/architectures supported by the Linux kernel which in turn
> may not coincide with the observable behavior on a given implementation
> /processor of that architecture. Also, LKMM doesn't aim to be a "tight"
> envelope. I'd refer to the documentation within "tools/memory-model/";
> please let me know if I can provide further info.
>
>
>>
>> In the patch I used the logic, that the below code:
>>
>> x = A;
>> spin_lock();
>> spin_unlock();
>> spin_lock();
>> spin_unlock();
>> y = B;
>>
>> cannot reorder much than:
>>
>> spin_lock();
>> x = A; <- this can't become visible later, that spin_unlock()
>> spin_unlock();
>> spin_lock();
>> y = B; <- this can't become visible earlier, than spin_lock()
>> spin_unlock();
>>
>> Is there a problem?
>
> As mentioned in the previous email, if smp_wmb() is what you're looking
> for then this should be fine (considering current implementations; LKMM
> will likely be there soon...).
>
> BTW, the behavior in question has been recently discussed on the list;
> c.f., for example, the test "unlock-lock-write-ordering" described in:
>
> http://lkml.kernel.org/r/[email protected]
>
> as well as
>
> 0123f4d76ca63b7b895f40089be0ce4809e392d8
> ("riscv/spinlock: Strengthen implementations with fences")
>
> Andrea

Yes, I'm looking for smp_wmb(). Read barrier is not required there.

Thanks for referring this.

Kirill

2018-04-26 16:24:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

Michal Hocko <[email protected]> writes:

> I've had a patch to remove owner few years back. It needed some work
> to finish but maybe that would be a better than try to make
> non-scalable thing suck less.

I have a question. Would it be reasonable to just have a mm->memcg?
That would appear to be the simplest solution to the problem.

That would require failing migration between memory cgroups if you are
not moving all of processes/threads that have a given mm_struct. That
should not be a huge restriction as typically it is only threads that
share a mm. Further the check should be straigh forward: counting the
number of threads and verifying the count matches the count on the
mm_struct.

Eric







2018-04-26 19:30:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On Thu 26-04-18 11:19:33, Eric W. Biederman wrote:
> Michal Hocko <[email protected]> writes:
>
> > I've had a patch to remove owner few years back. It needed some work
> > to finish but maybe that would be a better than try to make
> > non-scalable thing suck less.
>
> I have a question. Would it be reasonable to just have a mm->memcg?
> That would appear to be the simplest solution to the problem.

I do not remember details. Have to re-read the whole thing again. Hope
to get to this soon but with the current jet lag and backlog from the
LSFMM I rather not promis anything. Going with mm->memcg would be the
most simple of course but I have a very vague recollection that it was
not possible. Maybe I misremember...

--
Michal Hocko
SUSE Labs

2018-04-27 07:10:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On Thu 26-04-18 21:28:18, Michal Hocko wrote:
> On Thu 26-04-18 11:19:33, Eric W. Biederman wrote:
> > Michal Hocko <[email protected]> writes:
> >
> > > I've had a patch to remove owner few years back. It needed some work
> > > to finish but maybe that would be a better than try to make
> > > non-scalable thing suck less.
> >
> > I have a question. Would it be reasonable to just have a mm->memcg?
> > That would appear to be the simplest solution to the problem.
>
> I do not remember details. Have to re-read the whole thing again. Hope
> to get to this soon but with the current jet lag and backlog from the
> LSFMM I rather not promis anything. Going with mm->memcg would be the
> most simple of course but I have a very vague recollection that it was
> not possible. Maybe I misremember...

Just for the record, the last version where I've tried to remove owner
was posted here: http://lkml.kernel.org/r/[email protected]

I didn't get to remember details yet, but the primary problem was the
task migration between cgroups and the nasty case when different thread
grounds share the mm. At some point I just suggested to not care
about semantic of these weird threads all that much. We can either
migrate all tasks sharing the mm struct or just keep the inconsistency.

Anyway, removing this ugliness would be so cool!
--
Michal Hocko
SUSE Labs

2018-04-27 18:07:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

Michal Hocko <[email protected]> writes:

> On Thu 26-04-18 21:28:18, Michal Hocko wrote:
>> On Thu 26-04-18 11:19:33, Eric W. Biederman wrote:
>> > Michal Hocko <[email protected]> writes:
>> >
>> > > I've had a patch to remove owner few years back. It needed some work
>> > > to finish but maybe that would be a better than try to make
>> > > non-scalable thing suck less.
>> >
>> > I have a question. Would it be reasonable to just have a mm->memcg?
>> > That would appear to be the simplest solution to the problem.
>>
>> I do not remember details. Have to re-read the whole thing again. Hope
>> to get to this soon but with the current jet lag and backlog from the
>> LSFMM I rather not promis anything. Going with mm->memcg would be the
>> most simple of course but I have a very vague recollection that it was
>> not possible. Maybe I misremember...
>
> Just for the record, the last version where I've tried to remove owner
> was posted here: http://lkml.kernel.org/r/[email protected]
>
> I didn't get to remember details yet, but the primary problem was the
> task migration between cgroups and the nasty case when different thread
> grounds share the mm. At some point I just suggested to not care
> about semantic of these weird threads all that much. We can either
> migrate all tasks sharing the mm struct or just keep the inconsistency.
>
> Anyway, removing this ugliness would be so cool!

I suspect the only common user of CLONE_VM today is vfork. And I do
think it is crazy to migrate a process that has called vfork before
calling exec. Other useses of CLONE_VM seem even crazier.

I think the easiest change to make in mem_cgroup_can_attach would
be just to change the test for when charges are migrated. AKA

from:

if (mm->owner == p) {
....
}

to
if (mem_cgroup_from_task(p) == mm->memcg) {
...
}

That allows using mm->memcg with no new limitations on when migration
can be called. In crazy cases that has the potential to change which
memcgroup the charges are accounted to, but the choice is already
somewhat arbitrary so I don't think that will be a problem. Especially
given that mm_update_next_owner does not migrate charges if the next
owner is in a different memory cgroup. A mm with tasks using it in
two different cgroups is already questionable if not outright
problematic.


Kirill Tkhai do you think you would be able adapt Michal Hoko's old
patch at https://marc.info/?l=linux-kernel&m=143635857131756&w=2
that replaces mm->owner with mm->memcg?

We probably want to outlaw migrating an mm where we are not migrating
all of the mm->users eventually. Just because that case is crazy.
But it doesn't look like we need to do that to fix the memory control
group data structures.

Eric

2018-05-01 17:24:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

[email protected] (Eric W. Biederman) writes:

> Kirill Tkhai do you think you would be able adapt Michal Hoko's old
> patch at https://marc.info/?l=linux-kernel&m=143635857131756&w=2
> that replaces mm->owner with mm->memcg?

Ouch. At least compared to the current kernel your patch is wide
of the mark of what is needed to use mm->memcg correctly.

I have however written my own and if someone can review it I would
really appreciate it. It looks like we can change mm->owner to
mm->memcg without too much trouble and reduce the amount of code
in the kernel at the same time.

Eric

2018-05-01 17:36:10

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH] memcg: Replace mm->owner with mm->memcg


Recently it was reported that mm_update_next_owner could get into
cases where it was executing it's fallback for_each_process part of
the loop and thus taking up a lot of time.

To deal with this replace mm->owner with mm->memcg. This just reduces
the complexity of everything. As much as possible I have maintained
the current semantics. There are two siginificant exceptions. During
fork the memcg of the process calling fork is charged rather than
init_css_set. During memory cgroup migration the charges are migrated
not if the process is the owner of the mm, but if the process being
migrated has the same memory cgroup as the mm.

I believe it was a bug if init_css_set is charged for memory activity
during fork, and the old behavior was simply a consequence of the new
task not having tsk->cgroup not initialized to it's proper cgroup.

Durhing cgroup migration only thread group leaders are allowed to
migrate. Which means in practice there should only be one. Linux
tasks created with CLONE_VM are the only exception, but the common
cases are already ruled out. Processes created with vfork have a
suspended parent and can do nothing but call exec so they should never
show up. Threads of the same cgroup are not the thread group leader
so also should not show up. That leaves the old LinuxThreads library
which is probably out of use by now, and someone doing something very
creative with cgroups, and rolling their own threads with CLONE_VM.
So in practice I don't think the difference charge migration will
affect anyone.

To ensure that mm->memcg is updated appropriately I have implemented
cgroup "attach" and "fork" methods. This ensures that at those
points the mm pointed to the task has the appropriate memory cgroup.

For simplicity instead of introducing a new mm lock I simply use
exchange on the pointer where the mm->memcg is updated to get
atomic updates.

Looking at the history effectively this change is a revert. The
reason given for adding mm->owner is so that multiple cgroups can be
attached to the same mm. In the last 8 years a second user of
mm->owner has not appeared. A feature that has never used, makes the
code more complicated and has horrible worst case performance should
go.

Fixes: cf475ad28ac3 ("cgroups: add an owner to the mm_struct")
Reported-by: Kirill Tkhai <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 1 -
include/linux/memcontrol.h | 11 ++++--
include/linux/mm_types.h | 12 +------
include/linux/sched/mm.h | 8 -----
kernel/exit.c | 89 ----------------------------------------------
kernel/fork.c | 17 +++++++--
mm/memcontrol.c | 86 ++++++++++++++++++++++++++++++++++----------
7 files changed, 90 insertions(+), 134 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 183059c427b9..a8be9318d1a8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1040,7 +1040,6 @@ static int exec_mmap(struct mm_struct *mm)
up_read(&old_mm->mmap_sem);
BUG_ON(active_mm != old_mm);
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
- mm_update_next_owner(old_mm);
mmput(old_mm);
return 0;
}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d99b71bc2c66..147e04bfcaee 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -341,7 +341,6 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);

bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
@@ -402,6 +401,8 @@ static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
}

+void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new);
+
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
@@ -409,7 +410,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
bool match = false;

rcu_read_lock();
- task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ task_memcg = rcu_dereference(mm->memcg);
if (task_memcg)
match = mem_cgroup_is_descendant(task_memcg, memcg);
rcu_read_unlock();
@@ -693,7 +694,7 @@ static inline void count_memcg_event_mm(struct mm_struct *mm,
return;

rcu_read_lock();
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ memcg = rcu_dereference(mm->memcg);
if (likely(memcg)) {
count_memcg_events(memcg, idx, 1);
if (idx == OOM_KILL)
@@ -781,6 +782,10 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &pgdat->lruvec;
}

+static inline void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
+{
+}
+
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 21612347d311..ea5efd40a5d1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -443,17 +443,7 @@ struct mm_struct {
struct kioctx_table __rcu *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
- /*
- * "owner" points to a task that is regarded as the canonical
- * user/owner of this mm. All of the following must be true in
- * order for it to be changed:
- *
- * current == mm->owner
- * current->mm != mm
- * new_owner->mm == mm
- * new_owner->alloc_lock is held
- */
- struct task_struct __rcu *owner;
+ struct mem_cgroup __rcu *memcg;
#endif
struct user_namespace *user_ns;

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2c570cd934af..cc8e68d36fc2 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -95,14 +95,6 @@ extern struct mm_struct *mm_access(struct task_struct *task, unsigned int mode);
/* Remove the current tasks stale references to the old mm_struct */
extern void mm_release(struct task_struct *, struct mm_struct *);

-#ifdef CONFIG_MEMCG
-extern void mm_update_next_owner(struct mm_struct *mm);
-#else
-static inline void mm_update_next_owner(struct mm_struct *mm)
-{
-}
-#endif /* CONFIG_MEMCG */
-
#ifdef CONFIG_MMU
extern void arch_pick_mmap_layout(struct mm_struct *mm,
struct rlimit *rlim_stack);
diff --git a/kernel/exit.c b/kernel/exit.c
index c3c7ac560114..be967d2da0ce 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -399,94 +399,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
}
}

-#ifdef CONFIG_MEMCG
-/*
- * A task is exiting. If it owned this mm, find a new owner for the mm.
- */
-void mm_update_next_owner(struct mm_struct *mm)
-{
- struct task_struct *c, *g, *p = current;
-
-retry:
- /*
- * If the exiting or execing task is not the owner, it's
- * someone else's problem.
- */
- if (mm->owner != p)
- return;
- /*
- * The current owner is exiting/execing and there are no other
- * candidates. Do not leave the mm pointing to a possibly
- * freed task structure.
- */
- if (atomic_read(&mm->mm_users) <= 1) {
- mm->owner = NULL;
- return;
- }
-
- read_lock(&tasklist_lock);
- /*
- * Search in the children
- */
- list_for_each_entry(c, &p->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
- }
-
- /*
- * Search in the siblings
- */
- list_for_each_entry(c, &p->real_parent->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
- }
-
- /*
- * Search through everything else, we should not get here often.
- */
- for_each_process(g) {
- if (g->flags & PF_KTHREAD)
- continue;
- for_each_thread(g, c) {
- if (c->mm == mm)
- goto assign_new_owner;
- if (c->mm)
- break;
- }
- }
- read_unlock(&tasklist_lock);
- /*
- * We found no owner yet mm_users > 1: this implies that we are
- * most likely racing with swapoff (try_to_unuse()) or /proc or
- * ptrace or page migration (get_task_mm()). Mark owner as NULL.
- */
- mm->owner = NULL;
- return;
-
-assign_new_owner:
- BUG_ON(c == p);
- get_task_struct(c);
- /*
- * The task_lock protects c->mm from changing.
- * We always want mm->owner->mm == mm
- */
- task_lock(c);
- /*
- * Delay read_unlock() till we have the task_lock()
- * to ensure that c does not slip away underneath us
- */
- read_unlock(&tasklist_lock);
- if (c->mm != mm) {
- task_unlock(c);
- put_task_struct(c);
- goto retry;
- }
- mm->owner = c;
- task_unlock(c);
- put_task_struct(c);
-}
-#endif /* CONFIG_MEMCG */
-
/*
* Turn us into a lazy TLB process if we
* aren't already..
@@ -540,7 +452,6 @@ static void exit_mm(void)
up_read(&mm->mmap_sem);
enter_lazy_tlb(mm, current);
task_unlock(current);
- mm_update_next_owner(mm);
mmput(mm);
if (test_thread_flag(TIF_MEMDIE))
exit_oom_victim();
diff --git a/kernel/fork.c b/kernel/fork.c
index a5d21c42acfc..f284acf22aad 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -868,10 +868,19 @@ static void mm_init_aio(struct mm_struct *mm)
#endif
}

-static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
+static void mm_init_memcg(struct mm_struct *mm)
{
#ifdef CONFIG_MEMCG
- mm->owner = p;
+ struct cgroup_subsys_state *css;
+
+ /* Ensure mm->memcg is initialized */
+ mm->memcg = NULL;
+
+ rcu_read_lock();
+ css = task_css(current, memory_cgrp_id);
+ if (css && css_tryget(css))
+ mm_update_memcg(mm, mem_cgroup_from_css(css));
+ rcu_read_unlock();
#endif
}

@@ -901,7 +910,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
spin_lock_init(&mm->page_table_lock);
mm_init_cpumask(mm);
mm_init_aio(mm);
- mm_init_owner(mm, p);
+ mm_init_memcg(mm);
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_mm_init(mm);
hmm_mm_init(mm);
@@ -931,6 +940,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
fail_nocontext:
mm_free_pgd(mm);
fail_nopgd:
+ mm_update_memcg(mm, NULL);
free_mm(mm);
return NULL;
}
@@ -968,6 +978,7 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+ mm_update_memcg(mm, NULL);
mmdrop(mm);
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..5dce8a7fa65b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -664,20 +664,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
}
}

-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
-{
- /*
- * mm_update_next_owner() may clear mm->owner to NULL
- * if it races with swapoff, page migration, etc.
- * So this can be called with p == NULL.
- */
- if (unlikely(!p))
- return NULL;
-
- return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
-}
-EXPORT_SYMBOL(mem_cgroup_from_task);
-
static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
{
struct mem_cgroup *memcg = NULL;
@@ -692,7 +678,7 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
if (unlikely(!mm))
memcg = root_mem_cgroup;
else {
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ memcg = rcu_dereference(mm->memcg);
if (unlikely(!memcg))
memcg = root_mem_cgroup;
}
@@ -1011,7 +997,7 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
* killed to prevent needlessly killing additional tasks.
*/
rcu_read_lock();
- task_memcg = mem_cgroup_from_task(task);
+ task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
css_get(&task_memcg->css);
rcu_read_unlock();
}
@@ -4827,15 +4813,16 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
if (!move_flags)
return 0;

- from = mem_cgroup_from_task(p);
+ from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));

VM_BUG_ON(from == memcg);

mm = get_task_mm(p);
if (!mm)
return 0;
+
/* We move charges only when we move a owner of the mm */
- if (mm->owner == p) {
+ if (mm->memcg == from) {
VM_BUG_ON(mc.from);
VM_BUG_ON(mc.to);
VM_BUG_ON(mc.precharge);
@@ -4859,6 +4846,59 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
return ret;
}

+/**
+ * mm_update_memcg - Update the memory cgroup of a mm_struct
+ * @mm: mm struct
+ * @new: new memory cgroup value
+ *
+ * Called whenever mm->memcg needs to change. Consumes a reference
+ * to new (unless new is NULL). The reference to the old memory
+ * cgroup is decreased.
+ */
+void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
+{
+ /* This is the only place where mm->memcg is changed */
+ struct mem_cgroup *old;
+
+ old = xchg(&mm->memcg, new);
+ if (old)
+ css_put(&old->css);
+}
+
+static void task_update_memcg(struct task_struct *tsk, struct mem_cgroup *new)
+{
+ struct mm_struct *mm;
+ task_lock(tsk);
+ mm = tsk->mm;
+ if (mm && !(tsk->flags & PF_KTHREAD))
+ mm_update_memcg(mm, new);
+ task_unlock(tsk);
+}
+
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *tsk;
+
+ cgroup_taskset_for_each(tsk, css, tset) {
+ struct mem_cgroup *new = mem_cgroup_from_css(css);
+ css_get(css);
+ task_update_memcg(tsk, new);
+ }
+}
+
+static void mem_cgroup_fork(struct task_struct *tsk)
+{
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+ css = task_css(tsk, memory_cgrp_id);
+ if (css && css_tryget(css))
+ task_update_memcg(tsk, mem_cgroup_from_css(css));
+ rcu_read_unlock();
+}
+
+
static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
{
if (mc.to)
@@ -5027,6 +5067,12 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
return 0;
}
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+static void mem_cgroup_fork(struct task_struct *task)
+{
+}
static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
{
}
@@ -5335,8 +5381,10 @@ struct cgroup_subsys memory_cgrp_subsys = {
.css_free = mem_cgroup_css_free,
.css_reset = mem_cgroup_css_reset,
.can_attach = mem_cgroup_can_attach,
+ .attach = mem_cgroup_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.post_attach = mem_cgroup_move_task,
+ .fork = mem_cgroup_fork,
.bind = mem_cgroup_bind,
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
@@ -5769,7 +5817,7 @@ void mem_cgroup_sk_alloc(struct sock *sk)
}

rcu_read_lock();
- memcg = mem_cgroup_from_task(current);
+ memcg = mem_cgroup_from_css(task_css(current, memory_cgrp_id));
if (memcg == root_mem_cgroup)
goto out;
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg->tcpmem_active)
--
2.14.1


2018-05-02 13:20:19

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: Replace mm->owner with mm->memcg

Hi Eric,

On Wed, May 02, 2018 at 10:47:08AM +0200, Michal Hocko wrote:
> [CC johannes and Tejun as well. I am sorry but my backlog is so huge I
> will not get to this week.]
>
> On Tue 01-05-18 12:35:16, Eric W. Biederman wrote:
> > Recently it was reported that mm_update_next_owner could get into
> > cases where it was executing it's fallback for_each_process part of
> > the loop and thus taking up a lot of time.
> >
> > To deal with this replace mm->owner with mm->memcg. This just reduces
> > the complexity of everything. As much as possible I have maintained
> > the current semantics. There are two siginificant exceptions. During
> > fork the memcg of the process calling fork is charged rather than
> > init_css_set. During memory cgroup migration the charges are migrated
> > not if the process is the owner of the mm, but if the process being
> > migrated has the same memory cgroup as the mm.
> >
> > I believe it was a bug if init_css_set is charged for memory activity
> > during fork, and the old behavior was simply a consequence of the new
> > task not having tsk->cgroup not initialized to it's proper cgroup.
> >
> > Durhing cgroup migration only thread group leaders are allowed to
> > migrate. Which means in practice there should only be one. Linux
> > tasks created with CLONE_VM are the only exception, but the common
> > cases are already ruled out. Processes created with vfork have a
> > suspended parent and can do nothing but call exec so they should never
> > show up. Threads of the same cgroup are not the thread group leader
> > so also should not show up. That leaves the old LinuxThreads library
> > which is probably out of use by now, and someone doing something very
> > creative with cgroups, and rolling their own threads with CLONE_VM.
> > So in practice I don't think the difference charge migration will
> > affect anyone.
> >
> > To ensure that mm->memcg is updated appropriately I have implemented
> > cgroup "attach" and "fork" methods. This ensures that at those
> > points the mm pointed to the task has the appropriate memory cgroup.
> >
> > For simplicity instead of introducing a new mm lock I simply use
> > exchange on the pointer where the mm->memcg is updated to get
> > atomic updates.
> >
> > Looking at the history effectively this change is a revert. The
> > reason given for adding mm->owner is so that multiple cgroups can be
> > attached to the same mm. In the last 8 years a second user of
> > mm->owner has not appeared. A feature that has never used, makes the
> > code more complicated and has horrible worst case performance should
> > go.
> >
> > Fixes: cf475ad28ac3 ("cgroups: add an owner to the mm_struct")
> > Reported-by: Kirill Tkhai <[email protected]>
> > Signed-off-by: "Eric W. Biederman" <[email protected]>

I really like this. The multiple css per mm future that was referenced
in cf475ad28ac3's changelog never materialized, so there is no need to
always go through the task just to have the full css_set.

> > @@ -4827,15 +4813,16 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> > if (!move_flags)
> > return 0;
> >
> > - from = mem_cgroup_from_task(p);
> > + from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));
> >
> > VM_BUG_ON(from == memcg);
> >
> > mm = get_task_mm(p);
> > if (!mm)
> > return 0;
> > +
> > /* We move charges only when we move a owner of the mm */
> > - if (mm->owner == p) {
> > + if (mm->memcg == from) {
> > VM_BUG_ON(mc.from);
> > VM_BUG_ON(mc.to);
> > VM_BUG_ON(mc.precharge);

mm->memcg is updated on every single ->attach(), so this can only
happen to a task when a CLONE_VM sibling is moved out of its
group. Essentially, in the new scheme the "owner" is whichever task
with that mm migrated most recently.

I agree that it's hard to conjure up a practical usecase that would
straddle mms like this over multiple cgroups - especially given how
the memory charges themselves can only belong to one cgroup, too. So
in practice task->mm->memcg will always be task->css_set[memory].

But could you please update the comment to outline the cornercase?
"owner" isn't really a thing anymore after this patch.

Oh, and mm/debug.c::dump_mm() still refers to mm->owner, that needs to
be switched to mm->memcg as well.

Aside from that, this looks great to me. For the fixed version:

Acked-by: Johannes Weiner <[email protected]>

2018-05-02 14:17:07

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: Replace mm->owner with mm->memcg

Johannes Weiner <[email protected]> writes:

> Hi Eric,
>
> On Wed, May 02, 2018 at 10:47:08AM +0200, Michal Hocko wrote:
>> [CC johannes and Tejun as well. I am sorry but my backlog is so huge I
>> will not get to this week.]
>>
>> On Tue 01-05-18 12:35:16, Eric W. Biederman wrote:
>> > Recently it was reported that mm_update_next_owner could get into
>> > cases where it was executing it's fallback for_each_process part of
>> > the loop and thus taking up a lot of time.
>> >
>> > To deal with this replace mm->owner with mm->memcg. This just reduces
>> > the complexity of everything. As much as possible I have maintained
>> > the current semantics. There are two siginificant exceptions. During
>> > fork the memcg of the process calling fork is charged rather than
>> > init_css_set. During memory cgroup migration the charges are migrated
>> > not if the process is the owner of the mm, but if the process being
>> > migrated has the same memory cgroup as the mm.
>> >
>> > I believe it was a bug if init_css_set is charged for memory activity
>> > during fork, and the old behavior was simply a consequence of the new
>> > task not having tsk->cgroup not initialized to it's proper cgroup.
>> >
>> > Durhing cgroup migration only thread group leaders are allowed to
>> > migrate. Which means in practice there should only be one. Linux
>> > tasks created with CLONE_VM are the only exception, but the common
>> > cases are already ruled out. Processes created with vfork have a
>> > suspended parent and can do nothing but call exec so they should never
>> > show up. Threads of the same cgroup are not the thread group leader
>> > so also should not show up. That leaves the old LinuxThreads library
>> > which is probably out of use by now, and someone doing something very
>> > creative with cgroups, and rolling their own threads with CLONE_VM.
>> > So in practice I don't think the difference charge migration will
>> > affect anyone.
>> >
>> > To ensure that mm->memcg is updated appropriately I have implemented
>> > cgroup "attach" and "fork" methods. This ensures that at those
>> > points the mm pointed to the task has the appropriate memory cgroup.
>> >
>> > For simplicity instead of introducing a new mm lock I simply use
>> > exchange on the pointer where the mm->memcg is updated to get
>> > atomic updates.
>> >
>> > Looking at the history effectively this change is a revert. The
>> > reason given for adding mm->owner is so that multiple cgroups can be
>> > attached to the same mm. In the last 8 years a second user of
>> > mm->owner has not appeared. A feature that has never used, makes the
>> > code more complicated and has horrible worst case performance should
>> > go.
>> >
>> > Fixes: cf475ad28ac3 ("cgroups: add an owner to the mm_struct")
>> > Reported-by: Kirill Tkhai <[email protected]>
>> > Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> I really like this. The multiple css per mm future that was referenced
> in cf475ad28ac3's changelog never materialized, so there is no need to
> always go through the task just to have the full css_set.
>
>> > @@ -4827,15 +4813,16 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
>> > if (!move_flags)
>> > return 0;
>> >
>> > - from = mem_cgroup_from_task(p);
>> > + from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));
>> >
>> > VM_BUG_ON(from == memcg);
>> >
>> > mm = get_task_mm(p);
>> > if (!mm)
>> > return 0;
>> > +
>> > /* We move charges only when we move a owner of the mm */
>> > - if (mm->owner == p) {
>> > + if (mm->memcg == from) {
>> > VM_BUG_ON(mc.from);
>> > VM_BUG_ON(mc.to);
>> > VM_BUG_ON(mc.precharge);
>
> mm->memcg is updated on every single ->attach(), so this can only
> happen to a task when a CLONE_VM sibling is moved out of its
> group. Essentially, in the new scheme the "owner" is whichever task
> with that mm migrated most recently.

Yes. The charges will only fail to be migrated in some CLONE_VM
situations.

> I agree that it's hard to conjure up a practical usecase that would
> straddle mms like this over multiple cgroups - especially given how
> the memory charges themselves can only belong to one cgroup, too. So
> in practice task->mm->memcg will always be task->css_set[memory].
>
> But could you please update the comment to outline the cornercase?
> "owner" isn't really a thing anymore after this patch.

I can. How about:
/* We move charges except for creative uses of CLONE_VM */

> Oh, and mm/debug.c::dump_mm() still refers to mm->owner, that needs to
> be switched to mm->memcg as well.

The kbuild test robot pointed that out as well.

> Aside from that, this looks great to me. For the fixed version:
>
> Acked-by: Johannes Weiner <[email protected]>

Will do and will repost in a bit.

Eric


2018-05-02 19:30:56

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] memcg: Replace mm->owner with mm->memcg


Recently it was reported that mm_update_next_owner could get into
cases where it was executing it's fallback for_each_process part of
the loop and thus taking up a lot of time.

To deal with this replace mm->owner with mm->memcg. This just reduces
the complexity of everything. As much as possible I have maintained
the current semantics. There are two siginificant exceptions. During
fork the memcg of the process calling fork is charged rather than
init_css_set. During memory cgroup migration the charges are migrated
not if the process is the owner of the mm, but if the process being
migrated has the same memory cgroup as the mm.

I believe it was a bug if init_css_set is charged for memory activity
during fork, and the old behavior was simply a consequence of the new
task not having tsk->cgroup not initialized to it's proper cgroup.

Durhing cgroup migration only thread group leaders are allowed to
migrate. Which means in practice there should only be one. Linux
tasks created with CLONE_VM are the only exception, but the common
cases are already ruled out. Processes created with vfork have a
suspended parent and can do nothing but call exec so they should never
show up. Threads of the same cgroup are not the thread group leader
so also should not show up. That leaves the old LinuxThreads library
which is probably out of use by now, and someone doing something very
creative with cgroups, and rolling their own threads with CLONE_VM.
So in practice I don't think the difference charge migration will
affect anyone.

To ensure that mm->memcg is updated appropriately I have implemented
cgroup "attach" and "fork" methods. This ensures that at those
points the mm pointed to the task has the appropriate memory cgroup.

For simplicity instead of introducing a new mm lock I simply use
exchange on the pointer where the mm->memcg is updated to get
atomic updates.

Looking at the history effectively this change is a revert. The
reason given for adding mm->owner is so that multiple cgroups can be
attached to the same mm. In the last 8 years a second user of
mm->owner has not appeared. A feature that has never used, makes the
code more complicated and has horrible worst case performance should
go.

Fixes: cf475ad28ac3 ("cgroups: add an owner to the mm_struct")
Reported-by: Kirill Tkhai <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---

Fingers crossed that this is the final version of this patch.

fs/exec.c | 1 -
include/linux/memcontrol.h | 11 ++++--
include/linux/mm_types.h | 12 +------
include/linux/sched/mm.h | 8 -----
kernel/exit.c | 89 ----------------------------------------------
kernel/fork.c | 17 +++++++--
mm/debug.c | 4 +--
mm/memcontrol.c | 88 ++++++++++++++++++++++++++++++++++-----------
8 files changed, 93 insertions(+), 137 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 183059c427b9..a8be9318d1a8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1040,7 +1040,6 @@ static int exec_mmap(struct mm_struct *mm)
up_read(&old_mm->mmap_sem);
BUG_ON(active_mm != old_mm);
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
- mm_update_next_owner(old_mm);
mmput(old_mm);
return 0;
}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d99b71bc2c66..147e04bfcaee 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -341,7 +341,6 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);

bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
@@ -402,6 +401,8 @@ static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
}

+void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new);
+
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
@@ -409,7 +410,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
bool match = false;

rcu_read_lock();
- task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ task_memcg = rcu_dereference(mm->memcg);
if (task_memcg)
match = mem_cgroup_is_descendant(task_memcg, memcg);
rcu_read_unlock();
@@ -693,7 +694,7 @@ static inline void count_memcg_event_mm(struct mm_struct *mm,
return;

rcu_read_lock();
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ memcg = rcu_dereference(mm->memcg);
if (likely(memcg)) {
count_memcg_events(memcg, idx, 1);
if (idx == OOM_KILL)
@@ -781,6 +782,10 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &pgdat->lruvec;
}

+static inline void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
+{
+}
+
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 21612347d311..ea5efd40a5d1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -443,17 +443,7 @@ struct mm_struct {
struct kioctx_table __rcu *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
- /*
- * "owner" points to a task that is regarded as the canonical
- * user/owner of this mm. All of the following must be true in
- * order for it to be changed:
- *
- * current == mm->owner
- * current->mm != mm
- * new_owner->mm == mm
- * new_owner->alloc_lock is held
- */
- struct task_struct __rcu *owner;
+ struct mem_cgroup __rcu *memcg;
#endif
struct user_namespace *user_ns;

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2c570cd934af..cc8e68d36fc2 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -95,14 +95,6 @@ extern struct mm_struct *mm_access(struct task_struct *task, unsigned int mode);
/* Remove the current tasks stale references to the old mm_struct */
extern void mm_release(struct task_struct *, struct mm_struct *);

-#ifdef CONFIG_MEMCG
-extern void mm_update_next_owner(struct mm_struct *mm);
-#else
-static inline void mm_update_next_owner(struct mm_struct *mm)
-{
-}
-#endif /* CONFIG_MEMCG */
-
#ifdef CONFIG_MMU
extern void arch_pick_mmap_layout(struct mm_struct *mm,
struct rlimit *rlim_stack);
diff --git a/kernel/exit.c b/kernel/exit.c
index c3c7ac560114..be967d2da0ce 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -399,94 +399,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
}
}

-#ifdef CONFIG_MEMCG
-/*
- * A task is exiting. If it owned this mm, find a new owner for the mm.
- */
-void mm_update_next_owner(struct mm_struct *mm)
-{
- struct task_struct *c, *g, *p = current;
-
-retry:
- /*
- * If the exiting or execing task is not the owner, it's
- * someone else's problem.
- */
- if (mm->owner != p)
- return;
- /*
- * The current owner is exiting/execing and there are no other
- * candidates. Do not leave the mm pointing to a possibly
- * freed task structure.
- */
- if (atomic_read(&mm->mm_users) <= 1) {
- mm->owner = NULL;
- return;
- }
-
- read_lock(&tasklist_lock);
- /*
- * Search in the children
- */
- list_for_each_entry(c, &p->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
- }
-
- /*
- * Search in the siblings
- */
- list_for_each_entry(c, &p->real_parent->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
- }
-
- /*
- * Search through everything else, we should not get here often.
- */
- for_each_process(g) {
- if (g->flags & PF_KTHREAD)
- continue;
- for_each_thread(g, c) {
- if (c->mm == mm)
- goto assign_new_owner;
- if (c->mm)
- break;
- }
- }
- read_unlock(&tasklist_lock);
- /*
- * We found no owner yet mm_users > 1: this implies that we are
- * most likely racing with swapoff (try_to_unuse()) or /proc or
- * ptrace or page migration (get_task_mm()). Mark owner as NULL.
- */
- mm->owner = NULL;
- return;
-
-assign_new_owner:
- BUG_ON(c == p);
- get_task_struct(c);
- /*
- * The task_lock protects c->mm from changing.
- * We always want mm->owner->mm == mm
- */
- task_lock(c);
- /*
- * Delay read_unlock() till we have the task_lock()
- * to ensure that c does not slip away underneath us
- */
- read_unlock(&tasklist_lock);
- if (c->mm != mm) {
- task_unlock(c);
- put_task_struct(c);
- goto retry;
- }
- mm->owner = c;
- task_unlock(c);
- put_task_struct(c);
-}
-#endif /* CONFIG_MEMCG */
-
/*
* Turn us into a lazy TLB process if we
* aren't already..
@@ -540,7 +452,6 @@ static void exit_mm(void)
up_read(&mm->mmap_sem);
enter_lazy_tlb(mm, current);
task_unlock(current);
- mm_update_next_owner(mm);
mmput(mm);
if (test_thread_flag(TIF_MEMDIE))
exit_oom_victim();
diff --git a/kernel/fork.c b/kernel/fork.c
index a5d21c42acfc..f284acf22aad 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -868,10 +868,19 @@ static void mm_init_aio(struct mm_struct *mm)
#endif
}

-static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
+static void mm_init_memcg(struct mm_struct *mm)
{
#ifdef CONFIG_MEMCG
- mm->owner = p;
+ struct cgroup_subsys_state *css;
+
+ /* Ensure mm->memcg is initialized */
+ mm->memcg = NULL;
+
+ rcu_read_lock();
+ css = task_css(current, memory_cgrp_id);
+ if (css && css_tryget(css))
+ mm_update_memcg(mm, mem_cgroup_from_css(css));
+ rcu_read_unlock();
#endif
}

@@ -901,7 +910,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
spin_lock_init(&mm->page_table_lock);
mm_init_cpumask(mm);
mm_init_aio(mm);
- mm_init_owner(mm, p);
+ mm_init_memcg(mm);
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_mm_init(mm);
hmm_mm_init(mm);
@@ -931,6 +940,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
fail_nocontext:
mm_free_pgd(mm);
fail_nopgd:
+ mm_update_memcg(mm, NULL);
free_mm(mm);
return NULL;
}
@@ -968,6 +978,7 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+ mm_update_memcg(mm, NULL);
mmdrop(mm);
}

diff --git a/mm/debug.c b/mm/debug.c
index 56e2d9125ea5..3aed6102b8d0 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -116,7 +116,7 @@ void dump_mm(const struct mm_struct *mm)
"ioctx_table %px\n"
#endif
#ifdef CONFIG_MEMCG
- "owner %px "
+ "memcg %px "
#endif
"exe_file %px\n"
#ifdef CONFIG_MMU_NOTIFIER
@@ -147,7 +147,7 @@ void dump_mm(const struct mm_struct *mm)
mm->ioctx_table,
#endif
#ifdef CONFIG_MEMCG
- mm->owner,
+ mm->memcg,
#endif
mm->exe_file,
#ifdef CONFIG_MMU_NOTIFIER
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..a3745ccbd8c0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -664,20 +664,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
}
}

-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
-{
- /*
- * mm_update_next_owner() may clear mm->owner to NULL
- * if it races with swapoff, page migration, etc.
- * So this can be called with p == NULL.
- */
- if (unlikely(!p))
- return NULL;
-
- return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
-}
-EXPORT_SYMBOL(mem_cgroup_from_task);
-
static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
{
struct mem_cgroup *memcg = NULL;
@@ -692,7 +678,7 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
if (unlikely(!mm))
memcg = root_mem_cgroup;
else {
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ memcg = rcu_dereference(mm->memcg);
if (unlikely(!memcg))
memcg = root_mem_cgroup;
}
@@ -1011,7 +997,7 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
* killed to prevent needlessly killing additional tasks.
*/
rcu_read_lock();
- task_memcg = mem_cgroup_from_task(task);
+ task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
css_get(&task_memcg->css);
rcu_read_unlock();
}
@@ -4827,15 +4813,16 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
if (!move_flags)
return 0;

- from = mem_cgroup_from_task(p);
+ from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));

VM_BUG_ON(from == memcg);

mm = get_task_mm(p);
if (!mm)
return 0;
- /* We move charges only when we move a owner of the mm */
- if (mm->owner == p) {
+
+ /* We move charges except for creative uses of CLONE_VM */
+ if (mm->memcg == from) {
VM_BUG_ON(mc.from);
VM_BUG_ON(mc.to);
VM_BUG_ON(mc.precharge);
@@ -4859,6 +4846,59 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
return ret;
}

+/**
+ * mm_update_memcg - Update the memory cgroup of a mm_struct
+ * @mm: mm struct
+ * @new: new memory cgroup value
+ *
+ * Called whenever mm->memcg needs to change. Consumes a reference
+ * to new (unless new is NULL). The reference to the old memory
+ * cgroup is decreased.
+ */
+void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
+{
+ /* This is the only place where mm->memcg is changed */
+ struct mem_cgroup *old;
+
+ old = xchg(&mm->memcg, new);
+ if (old)
+ css_put(&old->css);
+}
+
+static void task_update_memcg(struct task_struct *tsk, struct mem_cgroup *new)
+{
+ struct mm_struct *mm;
+ task_lock(tsk);
+ mm = tsk->mm;
+ if (mm && !(tsk->flags & PF_KTHREAD))
+ mm_update_memcg(mm, new);
+ task_unlock(tsk);
+}
+
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *tsk;
+
+ cgroup_taskset_for_each(tsk, css, tset) {
+ struct mem_cgroup *new = mem_cgroup_from_css(css);
+ css_get(css);
+ task_update_memcg(tsk, new);
+ }
+}
+
+static void mem_cgroup_fork(struct task_struct *tsk)
+{
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+ css = task_css(tsk, memory_cgrp_id);
+ if (css && css_tryget(css))
+ task_update_memcg(tsk, mem_cgroup_from_css(css));
+ rcu_read_unlock();
+}
+
+
static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
{
if (mc.to)
@@ -5027,6 +5067,12 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
return 0;
}
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+}
+static void mem_cgroup_fork(struct task_struct *task)
+{
+}
static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
{
}
@@ -5335,8 +5381,10 @@ struct cgroup_subsys memory_cgrp_subsys = {
.css_free = mem_cgroup_css_free,
.css_reset = mem_cgroup_css_reset,
.can_attach = mem_cgroup_can_attach,
+ .attach = mem_cgroup_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.post_attach = mem_cgroup_move_task,
+ .fork = mem_cgroup_fork,
.bind = mem_cgroup_bind,
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
@@ -5769,7 +5817,7 @@ void mem_cgroup_sk_alloc(struct sock *sk)
}

rcu_read_lock();
- memcg = mem_cgroup_from_task(current);
+ memcg = mem_cgroup_from_css(task_css(current, memory_cgrp_id));
if (memcg == root_mem_cgroup)
goto out;
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg->tcpmem_active)
--
2.14.1


2018-05-02 21:05:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

On Wed, 02 May 2018 14:21:35 -0500 [email protected] (Eric W. Biederman) wrote:

> Recently it was reported that mm_update_next_owner could get into
> cases where it was executing it's fallback for_each_process part of
> the loop and thus taking up a lot of time.
>
> To deal with this replace mm->owner with mm->memcg. This just reduces
> the complexity of everything. As much as possible I have maintained
> the current semantics. There are two siginificant exceptions. During
> fork the memcg of the process calling fork is charged rather than
> init_css_set. During memory cgroup migration the charges are migrated
> not if the process is the owner of the mm, but if the process being
> migrated has the same memory cgroup as the mm.
>
> I believe it was a bug if init_css_set is charged for memory activity
> during fork, and the old behavior was simply a consequence of the new
> task not having tsk->cgroup not initialized to it's proper cgroup.
>
> Durhing cgroup migration only thread group leaders are allowed to
> migrate. Which means in practice there should only be one. Linux
> tasks created with CLONE_VM are the only exception, but the common
> cases are already ruled out. Processes created with vfork have a
> suspended parent and can do nothing but call exec so they should never
> show up. Threads of the same cgroup are not the thread group leader
> so also should not show up. That leaves the old LinuxThreads library
> which is probably out of use by now, and someone doing something very
> creative with cgroups, and rolling their own threads with CLONE_VM.
> So in practice I don't think the difference charge migration will
> affect anyone.
>
> To ensure that mm->memcg is updated appropriately I have implemented
> cgroup "attach" and "fork" methods. This ensures that at those
> points the mm pointed to the task has the appropriate memory cgroup.
>
> For simplicity instead of introducing a new mm lock I simply use
> exchange on the pointer where the mm->memcg is updated to get
> atomic updates.
>
> Looking at the history effectively this change is a revert. The
> reason given for adding mm->owner is so that multiple cgroups can be
> attached to the same mm. In the last 8 years a second user of
> mm->owner has not appeared. A feature that has never used, makes the
> code more complicated and has horrible worst case performance should
> go.

Cleanliness nit: I'm not sure that the removal and open-coding of
mem_cgroup_from_task() actually improved things. Should we restore it?


--- a/mm/memcontrol.c~memcg-replace-mm-owner-with-mm-memcg-fix
+++ a/mm/memcontrol.c
@@ -664,6 +664,11 @@ static void memcg_check_events(struct me
}
}

+static inline struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+{
+ return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
+}
+
struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
{
struct mem_cgroup *memcg = NULL;
@@ -1011,7 +1016,7 @@ bool task_in_mem_cgroup(struct task_stru
* killed to prevent needlessly killing additional tasks.
*/
rcu_read_lock();
- task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
+ task_memcg = mem_cgroup_from_task(task);
css_get(&task_memcg->css);
rcu_read_unlock();
}
@@ -4829,7 +4834,7 @@ static int mem_cgroup_can_attach(struct
if (!move_flags)
return 0;

- from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));
+ from = mem_cgroup_from_task(p);

VM_BUG_ON(from == memcg);

@@ -5887,7 +5892,7 @@ void mem_cgroup_sk_alloc(struct sock *sk
}

rcu_read_lock();
- memcg = mem_cgroup_from_css(task_css(current, memory_cgrp_id));
+ memcg = mem_cgroup_from_task(current);
if (memcg == root_mem_cgroup)
goto out;
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg->tcpmem_active)
_


2018-05-02 21:43:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

Andrew Morton <[email protected]> writes:

> On Wed, 02 May 2018 14:21:35 -0500 [email protected] (Eric W. Biederman) wrote:
>
>> Recently it was reported that mm_update_next_owner could get into
>> cases where it was executing it's fallback for_each_process part of
>> the loop and thus taking up a lot of time.
>>
>> To deal with this replace mm->owner with mm->memcg. This just reduces
>> the complexity of everything. As much as possible I have maintained
>> the current semantics. There are two siginificant exceptions. During
>> fork the memcg of the process calling fork is charged rather than
>> init_css_set. During memory cgroup migration the charges are migrated
>> not if the process is the owner of the mm, but if the process being
>> migrated has the same memory cgroup as the mm.
>>
>> I believe it was a bug if init_css_set is charged for memory activity
>> during fork, and the old behavior was simply a consequence of the new
>> task not having tsk->cgroup not initialized to it's proper cgroup.
>>
>> Durhing cgroup migration only thread group leaders are allowed to
>> migrate. Which means in practice there should only be one. Linux
>> tasks created with CLONE_VM are the only exception, but the common
>> cases are already ruled out. Processes created with vfork have a
>> suspended parent and can do nothing but call exec so they should never
>> show up. Threads of the same cgroup are not the thread group leader
>> so also should not show up. That leaves the old LinuxThreads library
>> which is probably out of use by now, and someone doing something very
>> creative with cgroups, and rolling their own threads with CLONE_VM.
>> So in practice I don't think the difference charge migration will
>> affect anyone.
>>
>> To ensure that mm->memcg is updated appropriately I have implemented
>> cgroup "attach" and "fork" methods. This ensures that at those
>> points the mm pointed to the task has the appropriate memory cgroup.
>>
>> For simplicity instead of introducing a new mm lock I simply use
>> exchange on the pointer where the mm->memcg is updated to get
>> atomic updates.
>>
>> Looking at the history effectively this change is a revert. The
>> reason given for adding mm->owner is so that multiple cgroups can be
>> attached to the same mm. In the last 8 years a second user of
>> mm->owner has not appeared. A feature that has never used, makes the
>> code more complicated and has horrible worst case performance should
>> go.
>
> Cleanliness nit: I'm not sure that the removal and open-coding of
> mem_cgroup_from_task() actually improved things. Should we restore
> it?

While writing the patch itself removing mem_cgroup_from_task forced
thinking about which places should use mm->memcg and which places
should use an alternative.

If we want to add it back afterwards with a second patch I don't mind.

I just don't want to have that in the same patch as opportunities get
lost to look at how the memory cgroup should be derived.

Eric

> --- a/mm/memcontrol.c~memcg-replace-mm-owner-with-mm-memcg-fix
> +++ a/mm/memcontrol.c
> @@ -664,6 +664,11 @@ static void memcg_check_events(struct me
> }
> }
>
> +static inline struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> +{
> + return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
> +}
> +
> struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
> {
> struct mem_cgroup *memcg = NULL;
> @@ -1011,7 +1016,7 @@ bool task_in_mem_cgroup(struct task_stru
> * killed to prevent needlessly killing additional tasks.
> */
> rcu_read_lock();
> - task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
> + task_memcg = mem_cgroup_from_task(task);
> css_get(&task_memcg->css);
> rcu_read_unlock();
> }
> @@ -4829,7 +4834,7 @@ static int mem_cgroup_can_attach(struct
> if (!move_flags)
> return 0;
>
> - from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));
> + from = mem_cgroup_from_task(p);
>
> VM_BUG_ON(from == memcg);
>
> @@ -5887,7 +5892,7 @@ void mem_cgroup_sk_alloc(struct sock *sk
> }
>
> rcu_read_lock();
> - memcg = mem_cgroup_from_css(task_css(current, memory_cgrp_id));
> + memcg = mem_cgroup_from_task(current);
> if (memcg == root_mem_cgroup)
> goto out;
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg->tcpmem_active)
> _

2018-05-03 00:01:49

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: Replace mm->owner with mm->memcg

On Tue, 01 May 2018 12:35:16 -0500
[email protected] (Eric W. Biederman) wrote:

> Recently it was reported that mm_update_next_owner could get into
> cases where it was executing it's fallback for_each_process part of
> the loop and thus taking up a lot of time.
>
> To deal with this replace mm->owner with mm->memcg. This just reduces
> the complexity of everything. As much as possible I have maintained
> the current semantics. There are two siginificant exceptions. During
> fork the memcg of the process calling fork is charged rather than
> init_css_set. During memory cgroup migration the charges are migrated
> not if the process is the owner of the mm, but if the process being
> migrated has the same memory cgroup as the mm.
>
> I believe it was a bug if init_css_set is charged for memory activity
> during fork, and the old behavior was simply a consequence of the new
> task not having tsk->cgroup not initialized to it's proper cgroup.

That does sound like a bug, I guess we've not seen it because we did
not track any slab allocations initially.

>
> Durhing cgroup migration only thread group leaders are allowed to
> migrate. Which means in practice there should only be one. Linux
> tasks created with CLONE_VM are the only exception, but the common
> cases are already ruled out. Processes created with vfork have a
> suspended parent and can do nothing but call exec so they should never
> show up. Threads of the same cgroup are not the thread group leader
> so also should not show up. That leaves the old LinuxThreads library
> which is probably out of use by now, and someone doing something very
> creative with cgroups, and rolling their own threads with CLONE_VM.
> So in practice I don't think the difference charge migration will
> affect anyone.
>
> To ensure that mm->memcg is updated appropriately I have implemented
> cgroup "attach" and "fork" methods. This ensures that at those
> points the mm pointed to the task has the appropriate memory cgroup.
>
> For simplicity instead of introducing a new mm lock I simply use
> exchange on the pointer where the mm->memcg is updated to get
> atomic updates.
>
> Looking at the history effectively this change is a revert. The
> reason given for adding mm->owner is so that multiple cgroups can be
> attached to the same mm. In the last 8 years a second user of
> mm->owner has not appeared. A feature that has never used, makes the
> code more complicated and has horrible worst case performance should
> go.

The idea was to track the mm to the right cgroup, we did find that
the mm could be confused as belonging to two cgroups. tsk->cgroup is
not sufficient and if when the tgid left, we needed an owner to track
where the current allocations were. But this is from 8 year old history,
I don't have my notes anymore :)

>
> Fixes: cf475ad28ac3 ("cgroups: add an owner to the mm_struct")
> Reported-by: Kirill Tkhai <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> fs/exec.c | 1 -
> include/linux/memcontrol.h | 11 ++++--
> include/linux/mm_types.h | 12 +------
> include/linux/sched/mm.h | 8 -----
> kernel/exit.c | 89 ----------------------------------------------
> kernel/fork.c | 17 +++++++--
> mm/memcontrol.c | 86 ++++++++++++++++++++++++++++++++++----------
> 7 files changed, 90 insertions(+), 134 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 183059c427b9..a8be9318d1a8 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1040,7 +1040,6 @@ static int exec_mmap(struct mm_struct *mm)
> up_read(&old_mm->mmap_sem);
> BUG_ON(active_mm != old_mm);
> setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
> - mm_update_next_owner(old_mm);
> mmput(old_mm);
> return 0;
> }
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d99b71bc2c66..147e04bfcaee 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -341,7 +341,6 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
> struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>
> bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
> -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>
> static inline
> struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
> @@ -402,6 +401,8 @@ static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
> return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
> }
>
> +void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new);
> +
> static inline bool mm_match_cgroup(struct mm_struct *mm,
> struct mem_cgroup *memcg)
> {
> @@ -409,7 +410,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
> bool match = false;
>
> rcu_read_lock();
> - task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + task_memcg = rcu_dereference(mm->memcg);
> if (task_memcg)
> match = mem_cgroup_is_descendant(task_memcg, memcg);
> rcu_read_unlock();
> @@ -693,7 +694,7 @@ static inline void count_memcg_event_mm(struct mm_struct *mm,
> return;
>
> rcu_read_lock();
> - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + memcg = rcu_dereference(mm->memcg);
> if (likely(memcg)) {
> count_memcg_events(memcg, idx, 1);
> if (idx == OOM_KILL)
> @@ -781,6 +782,10 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
> return &pgdat->lruvec;
> }
>
> +static inline void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
> +{
> +}
> +
> static inline bool mm_match_cgroup(struct mm_struct *mm,
> struct mem_cgroup *memcg)
> {
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 21612347d311..ea5efd40a5d1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -443,17 +443,7 @@ struct mm_struct {
> struct kioctx_table __rcu *ioctx_table;
> #endif
> #ifdef CONFIG_MEMCG
> - /*
> - * "owner" points to a task that is regarded as the canonical
> - * user/owner of this mm. All of the following must be true in
> - * order for it to be changed:
> - *
> - * current == mm->owner
> - * current->mm != mm
> - * new_owner->mm == mm
> - * new_owner->alloc_lock is held
> - */
> - struct task_struct __rcu *owner;
> + struct mem_cgroup __rcu *memcg;
> #endif
> struct user_namespace *user_ns;
>
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 2c570cd934af..cc8e68d36fc2 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -95,14 +95,6 @@ extern struct mm_struct *mm_access(struct task_struct *task, unsigned int mode);
> /* Remove the current tasks stale references to the old mm_struct */
> extern void mm_release(struct task_struct *, struct mm_struct *);
>
> -#ifdef CONFIG_MEMCG
> -extern void mm_update_next_owner(struct mm_struct *mm);
> -#else
> -static inline void mm_update_next_owner(struct mm_struct *mm)
> -{
> -}
> -#endif /* CONFIG_MEMCG */
> -
> #ifdef CONFIG_MMU
> extern void arch_pick_mmap_layout(struct mm_struct *mm,
> struct rlimit *rlim_stack);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index c3c7ac560114..be967d2da0ce 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -399,94 +399,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
> }
> }
>
> -#ifdef CONFIG_MEMCG
> -/*
> - * A task is exiting. If it owned this mm, find a new owner for the mm.
> - */
> -void mm_update_next_owner(struct mm_struct *mm)
> -{
> - struct task_struct *c, *g, *p = current;
> -
> -retry:
> - /*
> - * If the exiting or execing task is not the owner, it's
> - * someone else's problem.
> - */
> - if (mm->owner != p)
> - return;
> - /*
> - * The current owner is exiting/execing and there are no other
> - * candidates. Do not leave the mm pointing to a possibly
> - * freed task structure.
> - */
> - if (atomic_read(&mm->mm_users) <= 1) {
> - mm->owner = NULL;
> - return;
> - }
> -
> - read_lock(&tasklist_lock);
> - /*
> - * Search in the children
> - */
> - list_for_each_entry(c, &p->children, sibling) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - }
> -
> - /*
> - * Search in the siblings
> - */
> - list_for_each_entry(c, &p->real_parent->children, sibling) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - }
> -
> - /*
> - * Search through everything else, we should not get here often.
> - */
> - for_each_process(g) {
> - if (g->flags & PF_KTHREAD)
> - continue;
> - for_each_thread(g, c) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - if (c->mm)
> - break;
> - }
> - }
> - read_unlock(&tasklist_lock);
> - /*
> - * We found no owner yet mm_users > 1: this implies that we are
> - * most likely racing with swapoff (try_to_unuse()) or /proc or
> - * ptrace or page migration (get_task_mm()). Mark owner as NULL.
> - */
> - mm->owner = NULL;
> - return;
> -
> -assign_new_owner:
> - BUG_ON(c == p);
> - get_task_struct(c);
> - /*
> - * The task_lock protects c->mm from changing.
> - * We always want mm->owner->mm == mm
> - */
> - task_lock(c);
> - /*
> - * Delay read_unlock() till we have the task_lock()
> - * to ensure that c does not slip away underneath us
> - */
> - read_unlock(&tasklist_lock);
> - if (c->mm != mm) {
> - task_unlock(c);
> - put_task_struct(c);
> - goto retry;
> - }
> - mm->owner = c;
> - task_unlock(c);
> - put_task_struct(c);
> -}
> -#endif /* CONFIG_MEMCG */
> -
> /*
> * Turn us into a lazy TLB process if we
> * aren't already..
> @@ -540,7 +452,6 @@ static void exit_mm(void)
> up_read(&mm->mmap_sem);
> enter_lazy_tlb(mm, current);
> task_unlock(current);
> - mm_update_next_owner(mm);
> mmput(mm);
> if (test_thread_flag(TIF_MEMDIE))
> exit_oom_victim();
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a5d21c42acfc..f284acf22aad 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -868,10 +868,19 @@ static void mm_init_aio(struct mm_struct *mm)
> #endif
> }
>
> -static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
> +static void mm_init_memcg(struct mm_struct *mm)
> {
> #ifdef CONFIG_MEMCG
> - mm->owner = p;
> + struct cgroup_subsys_state *css;
> +
> + /* Ensure mm->memcg is initialized */
> + mm->memcg = NULL;
> +
> + rcu_read_lock();
> + css = task_css(current, memory_cgrp_id);
> + if (css && css_tryget(css))
> + mm_update_memcg(mm, mem_cgroup_from_css(css));
> + rcu_read_unlock();
> #endif
> }
>
> @@ -901,7 +910,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> spin_lock_init(&mm->page_table_lock);
> mm_init_cpumask(mm);
> mm_init_aio(mm);
> - mm_init_owner(mm, p);
> + mm_init_memcg(mm);
> RCU_INIT_POINTER(mm->exe_file, NULL);
> mmu_notifier_mm_init(mm);
> hmm_mm_init(mm);
> @@ -931,6 +940,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> fail_nocontext:
> mm_free_pgd(mm);
> fail_nopgd:
> + mm_update_memcg(mm, NULL);
> free_mm(mm);
> return NULL;
> }
> @@ -968,6 +978,7 @@ static inline void __mmput(struct mm_struct *mm)
> }
> if (mm->binfmt)
> module_put(mm->binfmt->module);
> + mm_update_memcg(mm, NULL);
> mmdrop(mm);
> }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2bd3df3d101a..5dce8a7fa65b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -664,20 +664,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
> }
> }
>
> -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> -{
> - /*
> - * mm_update_next_owner() may clear mm->owner to NULL
> - * if it races with swapoff, page migration, etc.
> - * So this can be called with p == NULL.
> - */
> - if (unlikely(!p))
> - return NULL;
> -
> - return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
> -}
> -EXPORT_SYMBOL(mem_cgroup_from_task);
> -
> static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
> {
> struct mem_cgroup *memcg = NULL;
> @@ -692,7 +678,7 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
> if (unlikely(!mm))
> memcg = root_mem_cgroup;
> else {
> - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + memcg = rcu_dereference(mm->memcg);
> if (unlikely(!memcg))
> memcg = root_mem_cgroup;
> }
> @@ -1011,7 +997,7 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
> * killed to prevent needlessly killing additional tasks.
> */
> rcu_read_lock();
> - task_memcg = mem_cgroup_from_task(task);
> + task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
> css_get(&task_memcg->css);
> rcu_read_unlock();
> }
> @@ -4827,15 +4813,16 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> if (!move_flags)
> return 0;
>
> - from = mem_cgroup_from_task(p);
> + from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));
>
> VM_BUG_ON(from == memcg);
>
> mm = get_task_mm(p);
> if (!mm)
> return 0;
> +
> /* We move charges only when we move a owner of the mm */
> - if (mm->owner == p) {
> + if (mm->memcg == from) {
> VM_BUG_ON(mc.from);
> VM_BUG_ON(mc.to);
> VM_BUG_ON(mc.precharge);
> @@ -4859,6 +4846,59 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> return ret;
> }
>
> +/**
> + * mm_update_memcg - Update the memory cgroup of a mm_struct
> + * @mm: mm struct
> + * @new: new memory cgroup value
> + *
> + * Called whenever mm->memcg needs to change. Consumes a reference
> + * to new (unless new is NULL). The reference to the old memory
> + * cgroup is decreased.
> + */
> +void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
> +{
> + /* This is the only place where mm->memcg is changed */
> + struct mem_cgroup *old;
> +
> + old = xchg(&mm->memcg, new);
> + if (old)
> + css_put(&old->css);
> +}
> +
> +static void task_update_memcg(struct task_struct *tsk, struct mem_cgroup *new)
> +{
> + struct mm_struct *mm;
> + task_lock(tsk);
> + mm = tsk->mm;
> + if (mm && !(tsk->flags & PF_KTHREAD))
> + mm_update_memcg(mm, new);
> + task_unlock(tsk);
> +}
> +
> +static void mem_cgroup_attach(struct cgroup_taskset *tset)
> +{
> + struct cgroup_subsys_state *css;
> + struct task_struct *tsk;
> +
> + cgroup_taskset_for_each(tsk, css, tset) {
> + struct mem_cgroup *new = mem_cgroup_from_css(css);
> + css_get(css);
> + task_update_memcg(tsk, new);

I'd have to go back and check and I think your comment refers to this,
but we don't expect non tgid tasks to show up here? My concern is I can't
find the guaratee that task_update_memcg(tsk, new) is not

1. Duplicated for each thread in the process or attached to the mm
2. Do not update mm->memcg to point to different places, so the one
that sticks is the one that updated things last.


Balbir Singh

2018-05-03 10:54:45

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On 27.04.2018 21:05, Eric W. Biederman wrote:
> Michal Hocko <[email protected]> writes:
>
>> On Thu 26-04-18 21:28:18, Michal Hocko wrote:
>>> On Thu 26-04-18 11:19:33, Eric W. Biederman wrote:
>>>> Michal Hocko <[email protected]> writes:
>>>>
>>>>> I've had a patch to remove owner few years back. It needed some work
>>>>> to finish but maybe that would be a better than try to make
>>>>> non-scalable thing suck less.
>>>>
>>>> I have a question. Would it be reasonable to just have a mm->memcg?
>>>> That would appear to be the simplest solution to the problem.
>>>
>>> I do not remember details. Have to re-read the whole thing again. Hope
>>> to get to this soon but with the current jet lag and backlog from the
>>> LSFMM I rather not promis anything. Going with mm->memcg would be the
>>> most simple of course but I have a very vague recollection that it was
>>> not possible. Maybe I misremember...
>>
>> Just for the record, the last version where I've tried to remove owner
>> was posted here: http://lkml.kernel.org/r/[email protected]
>>
>> I didn't get to remember details yet, but the primary problem was the
>> task migration between cgroups and the nasty case when different thread
>> grounds share the mm. At some point I just suggested to not care
>> about semantic of these weird threads all that much. We can either
>> migrate all tasks sharing the mm struct or just keep the inconsistency.
>>
>> Anyway, removing this ugliness would be so cool!
>
> I suspect the only common user of CLONE_VM today is vfork. And I do
> think it is crazy to migrate a process that has called vfork before
> calling exec. Other useses of CLONE_VM seem even crazier.
>
> I think the easiest change to make in mem_cgroup_can_attach would
> be just to change the test for when charges are migrated. AKA
>
> from:
>
> if (mm->owner == p) {
> ....
> }
>
> to
> if (mem_cgroup_from_task(p) == mm->memcg) {
> ...
> }
>
> That allows using mm->memcg with no new limitations on when migration
> can be called. In crazy cases that has the potential to change which
> memcgroup the charges are accounted to, but the choice is already
> somewhat arbitrary so I don't think that will be a problem. Especially
> given that mm_update_next_owner does not migrate charges if the next
> owner is in a different memory cgroup. A mm with tasks using it in
> two different cgroups is already questionable if not outright
> problematic.
>
>
> Kirill Tkhai do you think you would be able adapt Michal Hoko's old
> patch at https://marc.info/?l=linux-kernel&m=143635857131756&w=2
> that replaces mm->owner with mm->memcg?

I was at vacation. Sorry for the late reply.

> We probably want to outlaw migrating an mm where we are not migrating
> all of the mm->users eventually. Just because that case is crazy.
> But it doesn't look like we need to do that to fix the memory control
> group data structures.

Kirill

2018-05-03 13:34:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

On 05/02, Eric W. Biederman wrote:
>
> +static void mem_cgroup_fork(struct task_struct *tsk)
> +{
> + struct cgroup_subsys_state *css;
> +
> + rcu_read_lock();
> + css = task_css(tsk, memory_cgrp_id);
> + if (css && css_tryget(css))
> + task_update_memcg(tsk, mem_cgroup_from_css(css));
> + rcu_read_unlock();
> +}

Why do we need it?

The child's mm->memcg was already initialized by mm_init_memcg() and we can't
race with migrate until cgroup_threadgroup_change_end() ?

Oleg.


2018-05-03 14:41:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

Oleg Nesterov <[email protected]> writes:

> On 05/02, Eric W. Biederman wrote:
>>
>> +static void mem_cgroup_fork(struct task_struct *tsk)
>> +{
>> + struct cgroup_subsys_state *css;
>> +
>> + rcu_read_lock();
>> + css = task_css(tsk, memory_cgrp_id);
>> + if (css && css_tryget(css))
>> + task_update_memcg(tsk, mem_cgroup_from_css(css));
>> + rcu_read_unlock();
>> +}
>
> Why do we need it?
>
> The child's mm->memcg was already initialized by mm_init_memcg() and we can't
> race with migrate until cgroup_threadgroup_change_end() ?

I admit I missed the cgroup_threadgroup_change_begin
cgroup_threadgroup_change_end pair in fs fork. In this case it doesn't
matter because mm_init_memcg is called from:

copy_mm
dup_mm
mm_init

And copy_mm is called before we call cgroup_threadgroup_change_begin.
So the race remains.

We could move move cgroup_threadgroup_change_begin earlier, to remove
the need for mem_cgroup_fork. But I have not analyzed that.

Eric

2018-05-03 15:12:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: Replace mm->owner with mm->memcg

Balbir Singh <[email protected]> writes:

> On Tue, 01 May 2018 12:35:16 -0500
> [email protected] (Eric W. Biederman) wrote:
>
>> Recently it was reported that mm_update_next_owner could get into
>> cases where it was executing it's fallback for_each_process part of
>> the loop and thus taking up a lot of time.
>>
>> To deal with this replace mm->owner with mm->memcg. This just reduces
>> the complexity of everything. As much as possible I have maintained
>> the current semantics. There are two siginificant exceptions. During
>> fork the memcg of the process calling fork is charged rather than
>> init_css_set. During memory cgroup migration the charges are migrated
>> not if the process is the owner of the mm, but if the process being
>> migrated has the same memory cgroup as the mm.
>>
>> I believe it was a bug if init_css_set is charged for memory activity
>> during fork, and the old behavior was simply a consequence of the new
>> task not having tsk->cgroup not initialized to it's proper cgroup.
>
> That does sound like a bug, I guess we've not seen it because we did
> not track any slab allocations initially.


>> Durhing cgroup migration only thread group leaders are allowed to
>> migrate. Which means in practice there should only be one. Linux
>> tasks created with CLONE_VM are the only exception, but the common
>> cases are already ruled out. Processes created with vfork have a
>> suspended parent and can do nothing but call exec so they should never
>> show up. Threads of the same cgroup are not the thread group leader
>> so also should not show up. That leaves the old LinuxThreads library
>> which is probably out of use by now, and someone doing something very
>> creative with cgroups, and rolling their own threads with CLONE_VM.
>> So in practice I don't think the difference charge migration will
>> affect anyone.
>>
>> To ensure that mm->memcg is updated appropriately I have implemented
>> cgroup "attach" and "fork" methods. This ensures that at those
>> points the mm pointed to the task has the appropriate memory cgroup.
>>
>> For simplicity instead of introducing a new mm lock I simply use
>> exchange on the pointer where the mm->memcg is updated to get
>> atomic updates.
>>
>> Looking at the history effectively this change is a revert. The
>> reason given for adding mm->owner is so that multiple cgroups can be
>> attached to the same mm. In the last 8 years a second user of
>> mm->owner has not appeared. A feature that has never used, makes the
>> code more complicated and has horrible worst case performance should
>> go.
>
> The idea was to track the mm to the right cgroup, we did find that
> the mm could be confused as belonging to two cgroups. tsk->cgroup is
> not sufficient and if when the tgid left, we needed an owner to track
> where the current allocations were. But this is from 8 year old history,
> I don't have my notes anymore :)

I was referring to the change 8ish years ago where mm->memcg was
replaced with mm->owner. Semantically the two concepts should both
be perfectly capable of resolving which cgroup the mm belongs to.

>> +static void mem_cgroup_attach(struct cgroup_taskset *tset)
>> +{
>> + struct cgroup_subsys_state *css;
>> + struct task_struct *tsk;
>> +
>> + cgroup_taskset_for_each(tsk, css, tset) {
>> + struct mem_cgroup *new = mem_cgroup_from_css(css);
>> + css_get(css);
>> + task_update_memcg(tsk, new);
>
> I'd have to go back and check and I think your comment refers to this,
> but we don't expect non tgid tasks to show up here? My concern is I can't
> find the guaratee that task_update_memcg(tsk, new) is not
>
> 1. Duplicated for each thread in the process or attached to the mm
> 2. Do not update mm->memcg to point to different places, so the one
> that sticks is the one that updated things last.

For cgroupv2 which only operates on processes we have such a guarantee.

There is no such guarantee for cgroupv1. But it would take someone
being crazy to try this.

We can add a guarantee to can_attach that we move all of the threads in
a process, and we probably should. However having mm->memcg is more
important structurally than what crazy we let in. So let's make this
change first as safely as we can, and then we don't loose important
data structure simplications if it turns out we have to revert a change
to make the memcgroup per process in cgroupv1.


There are some serious issues with the intereactions between the memory
control group and the concept of thread group leader, that show up when
you consider a zombie thread group leader that has called cgroup_exit.
So I am not anxious to stir in concepts like thread_group_leader into
new code unless there is a very good reason.

We don't exepect crazy but the code allows it, and I have not changed
that.

Eric


2018-05-04 05:01:22

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: Replace mm->owner with mm->memcg

On Fri, May 4, 2018 at 1:11 AM, Eric W. Biederman <[email protected]> wrote:
> Balbir Singh <[email protected]> writes:
>
>> On Tue, 01 May 2018 12:35:16 -0500
>> [email protected] (Eric W. Biederman) wrote:
>>
>>> Recently it was reported that mm_update_next_owner could get into
>>> cases where it was executing it's fallback for_each_process part of
>>> the loop and thus taking up a lot of time.
>>>
>>> To deal with this replace mm->owner with mm->memcg. This just reduces
>>> the complexity of everything. As much as possible I have maintained
>>> the current semantics. There are two siginificant exceptions. During
>>> fork the memcg of the process calling fork is charged rather than
>>> init_css_set. During memory cgroup migration the charges are migrated
>>> not if the process is the owner of the mm, but if the process being
>>> migrated has the same memory cgroup as the mm.
>>>
>>> I believe it was a bug if init_css_set is charged for memory activity
>>> during fork, and the old behavior was simply a consequence of the new
>>> task not having tsk->cgroup not initialized to it's proper cgroup.
>>
>> That does sound like a bug, I guess we've not seen it because we did
>> not track any slab allocations initially.
>
>
>>> Durhing cgroup migration only thread group leaders are allowed to
>>> migrate. Which means in practice there should only be one. Linux
>>> tasks created with CLONE_VM are the only exception, but the common
>>> cases are already ruled out. Processes created with vfork have a
>>> suspended parent and can do nothing but call exec so they should never
>>> show up. Threads of the same cgroup are not the thread group leader
>>> so also should not show up. That leaves the old LinuxThreads library
>>> which is probably out of use by now, and someone doing something very
>>> creative with cgroups, and rolling their own threads with CLONE_VM.
>>> So in practice I don't think the difference charge migration will
>>> affect anyone.
>>>
>>> To ensure that mm->memcg is updated appropriately I have implemented
>>> cgroup "attach" and "fork" methods. This ensures that at those
>>> points the mm pointed to the task has the appropriate memory cgroup.
>>>
>>> For simplicity instead of introducing a new mm lock I simply use
>>> exchange on the pointer where the mm->memcg is updated to get
>>> atomic updates.
>>>
>>> Looking at the history effectively this change is a revert. The
>>> reason given for adding mm->owner is so that multiple cgroups can be
>>> attached to the same mm. In the last 8 years a second user of
>>> mm->owner has not appeared. A feature that has never used, makes the
>>> code more complicated and has horrible worst case performance should
>>> go.
>>
>> The idea was to track the mm to the right cgroup, we did find that
>> the mm could be confused as belonging to two cgroups. tsk->cgroup is
>> not sufficient and if when the tgid left, we needed an owner to track
>> where the current allocations were. But this is from 8 year old history,
>> I don't have my notes anymore :)
>
> I was referring to the change 8ish years ago where mm->memcg was
> replaced with mm->owner. Semantically the two concepts should both
> be perfectly capable of resolving which cgroup the mm belongs to.
>

Yep, agreed.

>>> +static void mem_cgroup_attach(struct cgroup_taskset *tset)
>>> +{
>>> + struct cgroup_subsys_state *css;
>>> + struct task_struct *tsk;
>>> +
>>> + cgroup_taskset_for_each(tsk, css, tset) {
>>> + struct mem_cgroup *new = mem_cgroup_from_css(css);
>>> + css_get(css);
>>> + task_update_memcg(tsk, new);
>>
>> I'd have to go back and check and I think your comment refers to this,
>> but we don't expect non tgid tasks to show up here? My concern is I can't
>> find the guaratee that task_update_memcg(tsk, new) is not
>>
>> 1. Duplicated for each thread in the process or attached to the mm
>> 2. Do not update mm->memcg to point to different places, so the one
>> that sticks is the one that updated things last.
>
> For cgroupv2 which only operates on processes we have such a guarantee.
>
> There is no such guarantee for cgroupv1. But it would take someone
> being crazy to try this.
>
> We can add a guarantee to can_attach that we move all of the threads in
> a process, and we probably should. However having mm->memcg is more
> important structurally than what crazy we let in. So let's make this
> change first as safely as we can, and then we don't loose important
> data structure simplications if it turns out we have to revert a change
> to make the memcgroup per process in cgroupv1.
>
>
> There are some serious issues with the intereactions between the memory
> control group and the concept of thread group leader, that show up when
> you consider a zombie thread group leader that has called cgroup_exit.
> So I am not anxious to stir in concepts like thread_group_leader into
> new code unless there is a very good reason.
>
> We don't exepect crazy but the code allows it, and I have not changed
> that.
>

OK, lets stick with this

Acked-by: Balbir Singh <[email protected]>

Balbir Singh

2018-05-04 11:08:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

I am getting the following for nommu with MEMCG=y randconfig:
kernel/fork.o: In function `mm_init':
fork.c:(.text+0x948): undefined reference to `mm_update_memcg'
kernel/fork.o: In function `mmput':
fork.c:(.text+0x119c): undefined reference to `mm_update_memcg'
make: *** [vmlinux] Error 1

I am sorry but I didn't get to look at it much more but I suspect you
just have to move mm_update_memcg up before mem_cgroup_do_precharge.
--
Michal Hocko
SUSE Labs

2018-05-04 14:21:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

On 05/03, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > On 05/02, Eric W. Biederman wrote:
> >>
> >> +static void mem_cgroup_fork(struct task_struct *tsk)
> >> +{
> >> + struct cgroup_subsys_state *css;
> >> +
> >> + rcu_read_lock();
> >> + css = task_css(tsk, memory_cgrp_id);
> >> + if (css && css_tryget(css))
> >> + task_update_memcg(tsk, mem_cgroup_from_css(css));
> >> + rcu_read_unlock();
> >> +}
> >
> > Why do we need it?
> >
> > The child's mm->memcg was already initialized by mm_init_memcg() and we can't
> > race with migrate until cgroup_threadgroup_change_end() ?
>
> I admit I missed the cgroup_threadgroup_change_begin
> cgroup_threadgroup_change_end pair in fs fork. In this case it doesn't
> matter because mm_init_memcg is called from:
>
> copy_mm
> dup_mm
> mm_init
>
> And copy_mm is called before we call cgroup_threadgroup_change_begin.
> So the race remains.

Ah yes, you are right.

> We could move move cgroup_threadgroup_change_begin earlier, to remove
> the need for mem_cgroup_fork. But I have not analyzed that.

No, cgroup_threadgroup_change_begin() was called early and this was wrong, see
568ac888215c7fb2fabe8ea739b00ec3c1f5d440. Actually there were more problems, say
copy_net() could deadlock because cleanup_net() does do_wait() with net_mutex held.


OK, what about exec() ? mm_init_memcg() initializes bprm->mm->memcg early in
bprm_mm_init(). What if the execing task migrates before exec_mmap() ?

Oleg.


2018-05-04 14:37:26

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

Oleg Nesterov <[email protected]> writes:

> On 05/03, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <[email protected]> writes:
>>
>> > On 05/02, Eric W. Biederman wrote:
>> >>
>> >> +static void mem_cgroup_fork(struct task_struct *tsk)
>> >> +{
>> >> + struct cgroup_subsys_state *css;
>> >> +
>> >> + rcu_read_lock();
>> >> + css = task_css(tsk, memory_cgrp_id);
>> >> + if (css && css_tryget(css))
>> >> + task_update_memcg(tsk, mem_cgroup_from_css(css));
>> >> + rcu_read_unlock();
>> >> +}
>> >
>> > Why do we need it?
>> >
>> > The child's mm->memcg was already initialized by mm_init_memcg() and we can't
>> > race with migrate until cgroup_threadgroup_change_end() ?
>>
>> I admit I missed the cgroup_threadgroup_change_begin
>> cgroup_threadgroup_change_end pair in fs fork. In this case it doesn't
>> matter because mm_init_memcg is called from:
>>
>> copy_mm
>> dup_mm
>> mm_init
>>
>> And copy_mm is called before we call cgroup_threadgroup_change_begin.
>> So the race remains.
>
> Ah yes, you are right.
>
>> We could move move cgroup_threadgroup_change_begin earlier, to remove
>> the need for mem_cgroup_fork. But I have not analyzed that.
>
> No, cgroup_threadgroup_change_begin() was called early and this was wrong, see
> 568ac888215c7fb2fabe8ea739b00ec3c1f5d440. Actually there were more problems, say
> copy_net() could deadlock because cleanup_net() does do_wait() with net_mutex held.
>
>
> OK, what about exec() ? mm_init_memcg() initializes bprm->mm->memcg early in
> bprm_mm_init(). What if the execing task migrates before exec_mmap() ?

We need the the cgroup when the mm is initialized. That way we have the
cgroup information when initializing the mm.

I don't know if a lock preventing changing the cgroup in exec or just
a little bit of code in exec_mmap to ensure mm->memcg is properly set
is the better approach. I have not analyzed that code path.

This does look like a very good place for an incremental patch to close
that race.

Eric

2018-05-04 14:56:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

On 05/04, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > OK, what about exec() ? mm_init_memcg() initializes bprm->mm->memcg early in
> > bprm_mm_init(). What if the execing task migrates before exec_mmap() ?
>
> We need the the cgroup when the mm is initialized. That way we have the
> cgroup information when initializing the mm.

Yes, we need to initialize new_mm->memcg but iiuc mostly for the error path,

> I don't know if a lock preventing changing the cgroup in exec or just
> a little bit of code in exec_mmap to ensure mm->memcg is properly set
> is the better approach.

I'd vote for the change in exec_mmap(). This way mm_init_memcg() can just
nullify mm->memcg.

> This does look like a very good place for an incremental patch to close
> that race.

Hmm. I think v2 makes more sense, but I won't argue.

Oleg.


2018-05-04 15:50:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

Oleg Nesterov <[email protected]> writes:

> On 05/04, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <[email protected]> writes:
>>
>> > OK, what about exec() ? mm_init_memcg() initializes bprm->mm->memcg early in
>> > bprm_mm_init(). What if the execing task migrates before exec_mmap() ?
>>
>> We need the the cgroup when the mm is initialized. That way we have the
>> cgroup information when initializing the mm.
>
> Yes, we need to initialize new_mm->memcg but iiuc mostly for the error path,
>
>> I don't know if a lock preventing changing the cgroup in exec or just
>> a little bit of code in exec_mmap to ensure mm->memcg is properly set
>> is the better approach.
>
> I'd vote for the change in exec_mmap(). This way mm_init_memcg() can just
> nullify mm->memcg.

There is at least one common path where we need the memory control group
properly initialized so memory allocations don't escape the memory
control group.

do_execveat_common
copy_strings
get_arg_page
get_user_pages_remote
__get_user_pages_locked
__get_user_pages
faultin_page
handle_mm_fault
count_memcg_event_mm
__handle_mm_fault
handle_pte_fault
do_anonymous_page
mem_cgroup_try_charge

I am surprised I can't easily find more. Apparently in load_elf_binary
we call elf_mmap after set_new_exec and install_exec_creds, making
a gracefull recovery from elf_mmap failures impossible.

In any case we most definitely need the memory control group properly
setup before exec_mmap.

Eric

2018-05-04 16:23:00

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

On 05/04, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > I'd vote for the change in exec_mmap(). This way mm_init_memcg() can just
> > nullify mm->memcg.
>
> There is at least one common path where we need the memory control group
> properly initialized so memory allocations don't escape the memory
> control group.
>
> do_execveat_common
> copy_strings
> get_arg_page
> get_user_pages_remote
> __get_user_pages_locked
> __get_user_pages
> faultin_page
> handle_mm_fault
> count_memcg_event_mm
> __handle_mm_fault
> handle_pte_fault
> do_anonymous_page
> mem_cgroup_try_charge
>
> I am surprised I can't easily find more. Apparently in load_elf_binary
> we call elf_mmap after set_new_exec and install_exec_creds, making
> a gracefull recovery from elf_mmap failures impossible.
>
> In any case we most definitely need the memory control group properly
> setup before exec_mmap.

Confused ...

new_mm->memcg has no effect until exec_mmap(), why it can't be NULL ?

and why do you think mem_cgroup_try_charge() can use the wrong memcg
in this case?

Oleg.


2018-05-04 16:41:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

Oleg Nesterov <[email protected]> writes:

> On 05/04, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <[email protected]> writes:
>>
>> > I'd vote for the change in exec_mmap(). This way mm_init_memcg() can just
>> > nullify mm->memcg.
>>
>> There is at least one common path where we need the memory control group
>> properly initialized so memory allocations don't escape the memory
>> control group.
>>
>> do_execveat_common
>> copy_strings
>> get_arg_page
>> get_user_pages_remote
>> __get_user_pages_locked
>> __get_user_pages
>> faultin_page
>> handle_mm_fault
>> count_memcg_event_mm
>> __handle_mm_fault
>> handle_pte_fault
>> do_anonymous_page
>> mem_cgroup_try_charge
>>
>> I am surprised I can't easily find more. Apparently in load_elf_binary
>> we call elf_mmap after set_new_exec and install_exec_creds, making
>> a gracefull recovery from elf_mmap failures impossible.
>>
>> In any case we most definitely need the memory control group properly
>> setup before exec_mmap.
>
> Confused ...
>
> new_mm->memcg has no effect until exec_mmap(), why it can't be NULL ?

new_mm->memcg does have effect before exec_mmap.

> and why do you think mem_cgroup_try_charge() can use the wrong memcg
> in this case?

It would only use the wrong memcg if you had bprm->mm->memcg == NULL.

mm_init_memcg is at the same point as mm_init_owner. So my change did
not introduce any logic changes on when the memory control group became
valid.

get_user_pages_remote is passed bprm->mm. So all of the above happens
on bprm->mm. Then bprm->mm->memcg is charged and fails if the memory
control group is full.


It would actually be pointless to allocate and initialized bprm->mm as
early as we do if we were not using it before exec_mmap. Using bprm->mm
implies there will be charges to the bprm->mm->memcg.

If bprm->mm was not used until exec_mmap we could just delay the
allocation until there and avoid these kind of challenges.

Eric

2018-05-04 17:27:54

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 1/2] memcg: Update the mm->memcg maintenance to work when !CONFIG_MMU


Michal Hocko <[email protected]> reported:
> I am getting the following for nommu with MEMCG=y randconfig:
> kernel/fork.o: In function `mm_init':
> fork.c:(.text+0x948): undefined reference to `mm_update_memcg'
> kernel/fork.o: In function `mmput':
> fork.c:(.text+0x119c): undefined reference to `mm_update_memcg'
> make: *** [vmlinux] Error 1

A memory control group that only accounts mm activity on page fault
seems pointless when !MMU. Still I don't intend to break it so
move the mm->memcg changes out of #ifndef MMU.

Reported-by: Michal Hocko <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
mm/memcontrol.c | 109 ++++++++++++++++++++++++++------------------------------
1 file changed, 51 insertions(+), 58 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3745ccbd8c0..d74aeba7dfed 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4846,59 +4846,6 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
return ret;
}

-/**
- * mm_update_memcg - Update the memory cgroup of a mm_struct
- * @mm: mm struct
- * @new: new memory cgroup value
- *
- * Called whenever mm->memcg needs to change. Consumes a reference
- * to new (unless new is NULL). The reference to the old memory
- * cgroup is decreased.
- */
-void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
-{
- /* This is the only place where mm->memcg is changed */
- struct mem_cgroup *old;
-
- old = xchg(&mm->memcg, new);
- if (old)
- css_put(&old->css);
-}
-
-static void task_update_memcg(struct task_struct *tsk, struct mem_cgroup *new)
-{
- struct mm_struct *mm;
- task_lock(tsk);
- mm = tsk->mm;
- if (mm && !(tsk->flags & PF_KTHREAD))
- mm_update_memcg(mm, new);
- task_unlock(tsk);
-}
-
-static void mem_cgroup_attach(struct cgroup_taskset *tset)
-{
- struct cgroup_subsys_state *css;
- struct task_struct *tsk;
-
- cgroup_taskset_for_each(tsk, css, tset) {
- struct mem_cgroup *new = mem_cgroup_from_css(css);
- css_get(css);
- task_update_memcg(tsk, new);
- }
-}
-
-static void mem_cgroup_fork(struct task_struct *tsk)
-{
- struct cgroup_subsys_state *css;
-
- rcu_read_lock();
- css = task_css(tsk, memory_cgrp_id);
- if (css && css_tryget(css))
- task_update_memcg(tsk, mem_cgroup_from_css(css));
- rcu_read_unlock();
-}
-
-
static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
{
if (mc.to)
@@ -5067,19 +5014,65 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
return 0;
}
-static void mem_cgroup_attach(struct cgroup_taskset *tset)
+static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
{
}
-static void mem_cgroup_fork(struct task_struct *task)
+static void mem_cgroup_move_task(void)
{
}
-static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
+#endif
+
+/**
+ * mm_update_memcg - Update the memory cgroup of a mm_struct
+ * @mm: mm struct
+ * @new: new memory cgroup value
+ *
+ * Called whenever mm->memcg needs to change. Consumes a reference
+ * to new (unless new is NULL). The reference to the old memory
+ * cgroup is decreased.
+ */
+void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
{
+ /* This is the only place where mm->memcg is changed */
+ struct mem_cgroup *old;
+
+ old = xchg(&mm->memcg, new);
+ if (old)
+ css_put(&old->css);
}
-static void mem_cgroup_move_task(void)
+
+static void task_update_memcg(struct task_struct *tsk, struct mem_cgroup *new)
+{
+ struct mm_struct *mm;
+ task_lock(tsk);
+ mm = tsk->mm;
+ if (mm && !(tsk->flags & PF_KTHREAD))
+ mm_update_memcg(mm, new);
+ task_unlock(tsk);
+}
+
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
{
+ struct cgroup_subsys_state *css;
+ struct task_struct *tsk;
+
+ cgroup_taskset_for_each(tsk, css, tset) {
+ struct mem_cgroup *new = mem_cgroup_from_css(css);
+ css_get(css);
+ task_update_memcg(tsk, new);
+ }
+}
+
+static void mem_cgroup_fork(struct task_struct *tsk)
+{
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+ css = task_css(tsk, memory_cgrp_id);
+ if (css && css_tryget(css))
+ task_update_memcg(tsk, mem_cgroup_from_css(css));
+ rcu_read_unlock();
}
-#endif

/*
* Cgroup retains root cgroups across [un]mount cycles making it necessary
--
2.14.1


2018-05-04 17:28:23

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 0/2] mm->owner to mm->memcg fixes


Andrew can you pick up these two fixes as well.

These address the issues Michal Hocko and Oleg Nesterov noticed.

Eric W. Biederman (2):
memcg: Update the mm->memcg maintenance to work when !CONFIG_MMU
memcg: Close the race between migration and installing bprm->mm as mm

fs/exec.c | 2 +
include/linux/memcontrol.h | 5 ++
mm/memcontrol.c | 111 +++++++++++++++++++++------------------------
3 files changed, 59 insertions(+), 59 deletions(-)




2018-05-04 17:28:50

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 2/2] memcg: Close the race between migration and installing bprm->mm as mm


Oleg pointed out that there is a race at exec time between when
bprm->mm is initialized and the exec'ing task being migrated to a
different memory control group.

Ractor the code in memcontrol so exec_mmap can use the same code as as
fork to ensure that task->memcg == task->mm->memcg.

Reported-by: Oleg Nesterov <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 2 ++
include/linux/memcontrol.h | 5 +++++
mm/memcontrol.c | 4 ++--
3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a8be9318d1a8..32461a1543fc 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1044,6 +1044,8 @@ static int exec_mmap(struct mm_struct *mm)
return 0;
}
mmdrop(active_mm);
+ /* The tsk may have migrated before the new mm was attached */
+ mm_sync_memcg_from_task(tsk);
return 0;
}

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 147e04bfcaee..9b68d9f2740e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -402,6 +402,7 @@ static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
}

void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new);
+void mm_sync_memcg_from_task(struct task_struct *tsk);

static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
@@ -786,6 +787,10 @@ static inline void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
{
}

+static inline void mm_sync_memcg_from_task(struct task_struct *tsk)
+{
+}
+
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d74aeba7dfed..552657613c0b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5063,7 +5063,7 @@ static void mem_cgroup_attach(struct cgroup_taskset *tset)
}
}

-static void mem_cgroup_fork(struct task_struct *tsk)
+void mm_sync_memcg_from_task(struct task_struct *tsk)
{
struct cgroup_subsys_state *css;

@@ -5377,7 +5377,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
.attach = mem_cgroup_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.post_attach = mem_cgroup_move_task,
- .fork = mem_cgroup_fork,
+ .fork = mm_sync_memcg_from_task,
.bind = mem_cgroup_bind,
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
--
2.14.1


2018-05-05 16:56:53

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

Hi Eric,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.17-rc3]
[cannot apply to next-20180504]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Eric-W-Biederman/memcg-Replace-mm-owner-with-mm-memcg/20180503-120054
config: xtensa-nommu_kc705_defconfig (attached as .config)
compiler: xtensa-de212-elf-gcc (crosstool-NG crosstool-ng-1.23.0-307-g452ee331) 7.3.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa

All errors (new ones prefixed by >>):

kernel/fork.o: In function `sighand_ctor':
>> fork.c:(.text+0x3b0): undefined reference to `mm_update_memcg'
kernel/fork.o: In function `mmdrop_async':
fork.c:(.text+0x43a): undefined reference to `mm_update_memcg'
kernel/fork.o: In function `set_task_stack_end_magic':
fork.c:(.text+0x752): undefined reference to `mm_update_memcg'

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (1.32 kB)
.config.gz (10.06 kB)
Download all attachments

2018-05-07 14:44:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

On 05/04, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > On 05/04, Eric W. Biederman wrote:
> >>
> >> Oleg Nesterov <[email protected]> writes:
> >>
> >> > I'd vote for the change in exec_mmap(). This way mm_init_memcg() can just
> >> > nullify mm->memcg.
> >>
> >> There is at least one common path where we need the memory control group
> >> properly initialized so memory allocations don't escape the memory
> >> control group.
> >>
> >> do_execveat_common
> >> copy_strings
> >> get_arg_page
> >> get_user_pages_remote
> >> __get_user_pages_locked
> >> __get_user_pages
> >> faultin_page
> >> handle_mm_fault
> >> count_memcg_event_mm
> >> __handle_mm_fault
> >> handle_pte_fault
> >> do_anonymous_page
> >> mem_cgroup_try_charge

Ah yes, indeed.

> mm_init_memcg is at the same point as mm_init_owner. So my change did
> not introduce any logic changes on when the memory control group became
> valid.

Not sure, but perhaps I am all confused....

before your patch get_mem_cgroup_from_mm() looks at mm->owner == current
(in this case) and mem_cgroup_from_task() should return the correct memcg
even if execing task migrates after bprm_mm_init(). At least in the common
case when the old mm is not shared.

After your patch the memory allocations in copy_strings() won't be accounted
correctly, bprm->mm->memcg is wrong if this task migrates. And iiuc your recent
"[PATCH 2/2] memcg: Close the race between migration and installing bprm->mm as mm"
doesn't fix the problem.

No?

Perhaps we can change get_mem_cgroup_from_mm() to use
mem_cgroup_from_css(current, memory_cgrp_id) if mm->memcg == NULL?

Oleg.


2018-05-07 23:18:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

On Sun, 6 May 2018 00:54:48 +0800 kbuild test robot <[email protected]> wrote:

> Hi Eric,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.17-rc3]
> [cannot apply to next-20180504]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Eric-W-Biederman/memcg-Replace-mm-owner-with-mm-memcg/20180503-120054
> config: xtensa-nommu_kc705_defconfig (attached as .config)
> compiler: xtensa-de212-elf-gcc (crosstool-NG crosstool-ng-1.23.0-307-g452ee331) 7.3.0
> reproduce:
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=xtensa
>
> All errors (new ones prefixed by >>):
>
> kernel/fork.o: In function `sighand_ctor':
> >> fork.c:(.text+0x3b0): undefined reference to `mm_update_memcg'
> kernel/fork.o: In function `mmdrop_async':
> fork.c:(.text+0x43a): undefined reference to `mm_update_memcg'
> kernel/fork.o: In function `set_task_stack_end_magic':
> fork.c:(.text+0x752): undefined reference to `mm_update_memcg'

Due to the combination of CONFIG_MMU=n and CONFIG_MEMCG=y.
mm/memcontrol.c's mm_update_memcg() depends upon CONFIG_MMU=y.

Is this a valid conbination? Do we actually support (and test!) memcg
on nommu systems?


2018-05-08 02:18:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

Andrew Morton <[email protected]> writes:

> On Sun, 6 May 2018 00:54:48 +0800 kbuild test robot <[email protected]> wrote:
>
>> Hi Eric,
>>
>> Thank you for the patch! Yet something to improve:
>>
>> [auto build test ERROR on linus/master]
>> [also build test ERROR on v4.17-rc3]
>> [cannot apply to next-20180504]
>> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>>
>> url: https://github.com/0day-ci/linux/commits/Eric-W-Biederman/memcg-Replace-mm-owner-with-mm-memcg/20180503-120054
>> config: xtensa-nommu_kc705_defconfig (attached as .config)
>> compiler: xtensa-de212-elf-gcc (crosstool-NG crosstool-ng-1.23.0-307-g452ee331) 7.3.0
>> reproduce:
>> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>> chmod +x ~/bin/make.cross
>> # save the attached .config to linux build tree
>> make.cross ARCH=xtensa
>>
>> All errors (new ones prefixed by >>):
>>
>> kernel/fork.o: In function `sighand_ctor':
>> >> fork.c:(.text+0x3b0): undefined reference to `mm_update_memcg'
>> kernel/fork.o: In function `mmdrop_async':
>> fork.c:(.text+0x43a): undefined reference to `mm_update_memcg'
>> kernel/fork.o: In function `set_task_stack_end_magic':
>> fork.c:(.text+0x752): undefined reference to `mm_update_memcg'
>
> Due to the combination of CONFIG_MMU=n and CONFIG_MEMCG=y.
> mm/memcontrol.c's mm_update_memcg() depends upon CONFIG_MMU=y.
>
> Is this a valid conbination? Do we actually support (and test!) memcg
> on nommu systems?

So this was spotted by a human before any automated system so I presume
that the combination is actually supported and at least build tested.

In particular the incremental patch series that fixes that issue has
the subjects lines below. Andrew if you could apply them that would be great.

[PATCH 0/2] mm->owner to mm->memcg fixes
[PATCH 1/2] memcg: Update the mm->memcg maintenance to work when !CONFIG_MMU
[PATCH 2/2] memcg: Close the race between migration and installing bprm->mm as mm


Andrew since this is essentially a mm patch I presume this work should
go through your tree.

Eric


2018-05-08 03:17:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

Oleg Nesterov <[email protected]> writes:

> On 05/04, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <[email protected]> writes:
>>
>> > On 05/04, Eric W. Biederman wrote:
>> >>
>> >> Oleg Nesterov <[email protected]> writes:
>> >>
>> >> > I'd vote for the change in exec_mmap(). This way mm_init_memcg() can just
>> >> > nullify mm->memcg.
>> >>
>> >> There is at least one common path where we need the memory control group
>> >> properly initialized so memory allocations don't escape the memory
>> >> control group.
>> >>
>> >> do_execveat_common
>> >> copy_strings
>> >> get_arg_page
>> >> get_user_pages_remote
>> >> __get_user_pages_locked
>> >> __get_user_pages
>> >> faultin_page
>> >> handle_mm_fault
>> >> count_memcg_event_mm
>> >> __handle_mm_fault
>> >> handle_pte_fault
>> >> do_anonymous_page
>> >> mem_cgroup_try_charge
>
> Ah yes, indeed.
>
>> mm_init_memcg is at the same point as mm_init_owner. So my change did
>> not introduce any logic changes on when the memory control group became
>> valid.
>
> Not sure, but perhaps I am all confused....

So. In exec actions that are initiated while a process is calling exec
can either logcially happen either before or after the exec. That is
the way these races are always resolved. I don't see any reason
for the memory control group to be any different.

Which means it is perfectly valid for a migration that technically
happens in the middle of copy_strings to not show up until some time
later in exec.

This works because there are no user visible effects until exec
completes. If there are user visible effects they are bugs.

That is all it takes to understand why my patch fixes things.

It might make more sense to simply fail migration if task->in_execve.
While perhaps the best solution that might introduce a user visible
effect that we don't care about right now.

> before your patch get_mem_cgroup_from_mm() looks at mm->owner == current
> (in this case) and mem_cgroup_from_task() should return the correct memcg
> even if execing task migrates after bprm_mm_init(). At least in the common
> case when the old mm is not shared.
>
> After your patch the memory allocations in copy_strings() won't be accounted
> correctly, bprm->mm->memcg is wrong if this task migrates. And iiuc your recent
> "[PATCH 2/2] memcg: Close the race between migration and installing bprm->mm as mm"
> doesn't fix the problem.
>
> No?

The patch does solve the issue. There should be nothing a userspace
process can observe that should tell it where in the middle of exec
such a migration happend so placing the migration at what from the
kernel's perspective might be technically later should not be a problem.

If it is a problem the issue is that there is a way to observe the
difference.

> Perhaps we can change get_mem_cgroup_from_mm() to use
> mem_cgroup_from_css(current, memory_cgrp_id) if mm->memcg == NULL?

Please God no. Having any unnecessary special case is just going to
confuse people and cause bugs.

Plus as I have pointed out above that is not an issue.

Eric

2018-05-09 14:42:18

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

On 05/07, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> > before your patch get_mem_cgroup_from_mm() looks at mm->owner == current
> > (in this case) and mem_cgroup_from_task() should return the correct memcg
> > even if execing task migrates after bprm_mm_init(). At least in the common
> > case when the old mm is not shared.
> >
> > After your patch the memory allocations in copy_strings() won't be accounted
> > correctly, bprm->mm->memcg is wrong if this task migrates. And iiuc your recent
> > "[PATCH 2/2] memcg: Close the race between migration and installing bprm->mm as mm"
> > doesn't fix the problem.
> >
> > No?
>
> The patch does solve the issue. There should be nothing a userspace
> process can observe that should tell it where in the middle of exec
> such a migration happend so placing the migration at what from the
> kernel's perspective might be technically later should not be a problem.
>
> If it is a problem the issue is that there is a way to observe the
> difference.

So. The task migrates from some MEMCG right after bprm_mm_init().

copy_strings() triggers OOM in MEMCG. This is quite possible, it can use a lot
of memory and that is why we have acct_arg_size() to make these allocations
visible to oom killer.

task_in_mem_cgroup(MEMCG) returns false and oom killer has to kill another
innocent process in MEMCG.

Does this look like a way to observe the difference?

> > Perhaps we can change get_mem_cgroup_from_mm() to use
> > mem_cgroup_from_css(current, memory_cgrp_id) if mm->memcg == NULL?
>
> Please God no. Having any unnecessary special case is just going to
> confuse people and cause bugs.

To me the unnecessary special case is the new_mm->memcg which is used for
accounting but doesn't follow migration till exec_mmap(). But I won't argue.

Oleg.


2018-05-09 14:52:58

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: Close the race between migration and installing bprm->mm as mm

On 05/04, Eric W. Biederman wrote:
>
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1044,6 +1044,8 @@ static int exec_mmap(struct mm_struct *mm)
> return 0;
> }
> mmdrop(active_mm);
> + /* The tsk may have migrated before the new mm was attached */
> + mm_sync_memcg_from_task(tsk);
> return 0;
> }

call_usermodehelper(). perhaps it makes sense to change flush_old_exec()
to clear PF_KTHREAD/etc before exec_mmap().

Oleg.


2018-05-09 21:00:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

On Mon 07-05-18 16:18:01, Andrew Morton wrote:
> On Sun, 6 May 2018 00:54:48 +0800 kbuild test robot <[email protected]> wrote:
>
> > Hi Eric,
> >
> > Thank you for the patch! Yet something to improve:
> >
> > [auto build test ERROR on linus/master]
> > [also build test ERROR on v4.17-rc3]
> > [cannot apply to next-20180504]
> > [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> >
> > url: https://github.com/0day-ci/linux/commits/Eric-W-Biederman/memcg-Replace-mm-owner-with-mm-memcg/20180503-120054
> > config: xtensa-nommu_kc705_defconfig (attached as .config)
> > compiler: xtensa-de212-elf-gcc (crosstool-NG crosstool-ng-1.23.0-307-g452ee331) 7.3.0
> > reproduce:
> > wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> > chmod +x ~/bin/make.cross
> > # save the attached .config to linux build tree
> > make.cross ARCH=xtensa
> >
> > All errors (new ones prefixed by >>):
> >
> > kernel/fork.o: In function `sighand_ctor':
> > >> fork.c:(.text+0x3b0): undefined reference to `mm_update_memcg'
> > kernel/fork.o: In function `mmdrop_async':
> > fork.c:(.text+0x43a): undefined reference to `mm_update_memcg'
> > kernel/fork.o: In function `set_task_stack_end_magic':
> > fork.c:(.text+0x752): undefined reference to `mm_update_memcg'
>
> Due to the combination of CONFIG_MMU=n and CONFIG_MEMCG=y.
> mm/memcontrol.c's mm_update_memcg() depends upon CONFIG_MMU=y.
>
> Is this a valid conbination? Do we actually support (and test!) memcg
> on nommu systems?

Well, I was arguing we should mark it BROKEN quite some time ago but
Johannes and you IIRC objected that we shouldn't do that as long as
nobody complains it doesn't work. To be completely honest I am skeptical
anybody uses memcg on nommu, but I might be easily wrong here. I didn't
care enough to pursue back then...

--
Michal Hocko
SUSE Labs

2018-05-10 03:00:37

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 2/2] memcg: Close the race between migration and installing bprm->mm as mm

Oleg Nesterov <[email protected]> writes:

> On 05/04, Eric W. Biederman wrote:
>>
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1044,6 +1044,8 @@ static int exec_mmap(struct mm_struct *mm)
>> return 0;
>> }
>> mmdrop(active_mm);
>> + /* The tsk may have migrated before the new mm was attached */
>> + mm_sync_memcg_from_task(tsk);
>> return 0;
>> }
>
> call_usermodehelper(). perhaps it makes sense to change flush_old_exec()
> to clear PF_KTHREAD/etc before exec_mmap().

Yes. That does look like something to be fixed. In practice while it
is rare for anything to be migrated during exec I expect it to be even
rarer for kernel threads. But that is a legitimate issue.

I expect all of these lines after exec_mmap actually belong
in exec_mmap. Which would then make rearraning the lines easy to ensure
mm_sync_memcg_from_task sees PF_KTHREAD cleared.
/*
* After clearing bprm->mm (to mark that current is using the
* prepared mm now), we have nothing left of the original
* process. If anything from here on returns an error, the check
* in search_binary_handler() will SEGV current.
*/
bprm->mm = NULL;

set_fs(USER_DS);
current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
PF_NOFREEZE | PF_NO_SETAFFINITY);
flush_thread();
current->personality &= ~bprm->per_clear;

Eric


2018-05-10 03:10:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

Oleg Nesterov <[email protected]> writes:

> On 05/07, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <[email protected]> writes:
>>
>> > before your patch get_mem_cgroup_from_mm() looks at mm->owner == current
>> > (in this case) and mem_cgroup_from_task() should return the correct memcg
>> > even if execing task migrates after bprm_mm_init(). At least in the common
>> > case when the old mm is not shared.
>> >
>> > After your patch the memory allocations in copy_strings() won't be accounted
>> > correctly, bprm->mm->memcg is wrong if this task migrates. And iiuc your recent
>> > "[PATCH 2/2] memcg: Close the race between migration and installing bprm->mm as mm"
>> > doesn't fix the problem.
>> >
>> > No?
>>
>> The patch does solve the issue. There should be nothing a userspace
>> process can observe that should tell it where in the middle of exec
>> such a migration happend so placing the migration at what from the
>> kernel's perspective might be technically later should not be a problem.
>>
>> If it is a problem the issue is that there is a way to observe the
>> difference.
>
> So. The task migrates from some MEMCG right after bprm_mm_init().
>
> copy_strings() triggers OOM in MEMCG. This is quite possible, it can use a lot
> of memory and that is why we have acct_arg_size() to make these allocations
> visible to oom killer.
>
> task_in_mem_cgroup(MEMCG) returns false and oom killer has to kill another
> innocent process in MEMCG.
>
> Does this look like a way to observe the difference?

Sort of.

I don't know how the memcg gets away without migrating charges
when it migrates a process. With charges not being migrated
I don't think this is observable.

That does look like a real issue however.

>> > Perhaps we can change get_mem_cgroup_from_mm() to use
>> > mem_cgroup_from_css(current, memory_cgrp_id) if mm->memcg == NULL?
>>
>> Please God no. Having any unnecessary special case is just going to
>> confuse people and cause bugs.
>
> To me the unnecessary special case is the new_mm->memcg which is used for
> accounting but doesn't follow migration till exec_mmap(). But I won't
> argue.

Eric

2018-05-10 04:04:57

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH] cgroup: Don't mess with tasks in exec


Semantically exec is supposed to be atomic with no user space visible
intermediate points. Migrating tasks during exec may change that and
lead to all manner of difficult to analyze and maintin corner cases.

So avoid the problems by simply blocking cgroup migration over the
entirety of exec.

Reported-by: Oleg Nesterov <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---

Unless this leads to some kind of deadlock
fs/exec.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 32461a1543fc..54bb01cfc635 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1101,7 +1099,6 @@ static int de_thread(struct task_struct *tsk)
struct task_struct *leader = tsk->group_leader;

for (;;) {
- cgroup_threadgroup_change_begin(tsk);
write_lock_irq(&tasklist_lock);
/*
* Do this under tasklist_lock to ensure that
@@ -1112,7 +1109,6 @@ static int de_thread(struct task_struct *tsk)
break;
__set_current_state(TASK_KILLABLE);
write_unlock_irq(&tasklist_lock);
- cgroup_threadgroup_change_end(tsk);
schedule();
if (unlikely(__fatal_signal_pending(tsk)))
goto killed;
@@ -1750,6 +1746,7 @@ static int do_execveat_common(int fd, struct filename *filename,
if (retval)
goto out_free;

+ cgroup_threadgroup_change_begin(current);
check_unsafe_exec(bprm);
current->in_execve = 1;

@@ -1822,6 +1819,7 @@ static int do_execveat_common(int fd, struct filename *filename,
/* execve succeeded */
current->fs->in_exec = 0;
current->in_execve = 0;
+ cgroup_threadgroup_change_end(current);
membarrier_execve(current);
acct_update_integrals(current);
task_numa_free(current);
@@ -1841,6 +1839,7 @@ static int do_execveat_common(int fd, struct filename *filename,
out_unmark:
current->fs->in_exec = 0;
current->in_execve = 0;
+ cgroup_threadgroup_change_end(current);

out_free:
free_bprm(bprm);
--
2.14.1


2018-05-10 12:14:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

On Fri 04-05-18 12:26:03, Eric W. Biederman wrote:
>
> Andrew can you pick up these two fixes as well.
>
> These address the issues Michal Hocko and Oleg Nesterov noticed.

I completely got lost in this thread. There are way too many things
discussed at once. Could you repost the full series for a proper review
please?
--
Michal Hocko
SUSE Labs

2018-05-10 12:17:15

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC][PATCH] cgroup: Don't mess with tasks in exec

On 05/09, Eric W. Biederman wrote:
>
> Semantically exec is supposed to be atomic with no user space visible
> intermediate points. Migrating tasks during exec may change that and
> lead to all manner of difficult to analyze and maintin corner cases.

Apart from race with copy_strings() we discuss in another thread?

> So avoid the problems by simply blocking cgroup migration over the
> entirety of exec.

This patch, even if it was correct, will bring much more problems.

If nothing else exec() is very slow. If it races with migration which needs
this sem for writing the new readers will be blocked. This means that clone(),
exit(), or another exec() will block too.

Now. if some IO path does kthread_stop() we have a deadlock.

Or request_module() in search_binary_handler(). Deadlock.

Plus this adds the nice security problem, a PTRACE_O_TRACEEXEC'ed task will
sleep in TASK_TRACED with cgroup_threadgroup_rwsem.

Oleg.

> Reported-by: Oleg Nesterov <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
>
> Unless this leads to some kind of deadlock
> fs/exec.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 32461a1543fc..54bb01cfc635 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1101,7 +1099,6 @@ static int de_thread(struct task_struct *tsk)
> struct task_struct *leader = tsk->group_leader;
>
> for (;;) {
> - cgroup_threadgroup_change_begin(tsk);
> write_lock_irq(&tasklist_lock);
> /*
> * Do this under tasklist_lock to ensure that
> @@ -1112,7 +1109,6 @@ static int de_thread(struct task_struct *tsk)
> break;
> __set_current_state(TASK_KILLABLE);
> write_unlock_irq(&tasklist_lock);
> - cgroup_threadgroup_change_end(tsk);
> schedule();
> if (unlikely(__fatal_signal_pending(tsk)))
> goto killed;
> @@ -1750,6 +1746,7 @@ static int do_execveat_common(int fd, struct filename *filename,
> if (retval)
> goto out_free;
>
> + cgroup_threadgroup_change_begin(current);
> check_unsafe_exec(bprm);
> current->in_execve = 1;
>
> @@ -1822,6 +1819,7 @@ static int do_execveat_common(int fd, struct filename *filename,
> /* execve succeeded */
> current->fs->in_exec = 0;
> current->in_execve = 0;
> + cgroup_threadgroup_change_end(current);
> membarrier_execve(current);
> acct_update_integrals(current);
> task_numa_free(current);
> @@ -1841,6 +1839,7 @@ static int do_execveat_common(int fd, struct filename *filename,
> out_unmark:
> current->fs->in_exec = 0;
> current->in_execve = 0;
> + cgroup_threadgroup_change_end(current);
>
> out_free:
> free_bprm(bprm);
> --
> 2.14.1
>


2018-05-10 12:19:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

On Thu 10-05-18 14:14:18, Michal Hocko wrote:
> On Fri 04-05-18 12:26:03, Eric W. Biederman wrote:
> >
> > Andrew can you pick up these two fixes as well.
> >
> > These address the issues Michal Hocko and Oleg Nesterov noticed.
>
> I completely got lost in this thread. There are way too many things
> discussed at once. Could you repost the full series for a proper review
> please?

Btw. it would be great if you could have a look at http://lkml.kernel.org/r/[email protected]
where I found some issues with mm->memcg approach and proposed some ways
around it.
--
Michal Hocko
SUSE Labs

2018-05-10 12:37:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH] cgroup: Don't mess with tasks in exec

Hello,

On Thu, May 10, 2018 at 02:15:32PM +0200, Oleg Nesterov wrote:
> > So avoid the problems by simply blocking cgroup migration over the
> > entirety of exec.
>
> This patch, even if it was correct, will bring much more problems.
>
> If nothing else exec() is very slow. If it races with migration which needs
> this sem for writing the new readers will be blocked. This means that clone(),
> exit(), or another exec() will block too.

Agreed. I don't think it's a good idea to expand threadgroup coverage
larger than it already is. I'd much prefer either working around the
copy_string case specifically or just ignoring it.

Thanks.

--
tejun

2018-05-10 12:39:13

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] memcg: Replace mm->owner with mm->memcg

On 05/09, Eric W. Biederman wrote:
>
> Oleg Nesterov <[email protected]> writes:
>
> >> The patch does solve the issue. There should be nothing a userspace
> >> process can observe that should tell it where in the middle of exec
> >> such a migration happend so placing the migration at what from the
> >> kernel's perspective might be technically later should not be a problem.
> >>
> >> If it is a problem the issue is that there is a way to observe the
> >> difference.
> >
> > So. The task migrates from some MEMCG right after bprm_mm_init().
> >
> > copy_strings() triggers OOM in MEMCG. This is quite possible, it can use a lot
> > of memory and that is why we have acct_arg_size() to make these allocations
> > visible to oom killer.
> >
> > task_in_mem_cgroup(MEMCG) returns false and oom killer has to kill another
> > innocent process in MEMCG.
> >
> > Does this look like a way to observe the difference?
>
> Sort of.
>
> I don't know how the memcg gets away without migrating charges
> when it migrates a process. With charges not being migrated
> I don't think this is observable.

Not sure I understand how this connects to accounting...

But yes sure, with or without your change, mem_cgroup_move_task() obviously
can't see the the nascent bprm->mm. I have no idea if this is important or
not, and iiuc cgroup v2 doesn't even support ->move_charge_at_immigrate.

As for accounting, I still think that it would be better to nullify ->memcg in
mm_init_memcg(), simply because we can not initialize it properly, we can race
with migration until exec_mmap/cgroup_post_fork which need to update ->memcg
anyway.

Yes, this means a special case in get_mem_cgroup_from_mm().

Oleg.


> That does look like a real issue however.
>
> >> > Perhaps we can change get_mem_cgroup_from_mm() to use
> >> > mem_cgroup_from_css(current, memory_cgrp_id) if mm->memcg == NULL?
> >>
> >> Please God no. Having any unnecessary special case is just going to
> >> confuse people and cause bugs.
> >
> > To me the unnecessary special case is the new_mm->memcg which is used for
> > accounting but doesn't follow migration till exec_mmap(). But I won't
> > argue.
>
> Eric


2018-05-22 12:59:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

On Thu 10-05-18 14:14:18, Michal Hocko wrote:
> On Fri 04-05-18 12:26:03, Eric W. Biederman wrote:
> >
> > Andrew can you pick up these two fixes as well.
> >
> > These address the issues Michal Hocko and Oleg Nesterov noticed.
>
> I completely got lost in this thread. There are way too many things
> discussed at once. Could you repost the full series for a proper review
> please?

ping
--
Michal Hocko
SUSE Labs

2018-05-23 19:47:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

Michal Hocko <[email protected]> writes:

> On Thu 10-05-18 14:14:18, Michal Hocko wrote:
>> On Fri 04-05-18 12:26:03, Eric W. Biederman wrote:
>> >
>> > Andrew can you pick up these two fixes as well.
>> >
>> > These address the issues Michal Hocko and Oleg Nesterov noticed.
>>
>> I completely got lost in this thread. There are way too many things
>> discussed at once. Could you repost the full series for a proper review
>> please?
>
> ping

Quick summary of where things stand. (Just getting back from vacation
and catching up with things).

Andrew has picked up the best version of these patches and you can look
at his tree to find the current patches.

Looking at my tree the issues that have been looked at above
the basic patch are:
!CONFIG_MMU support (code motion)
Races during exec. (Roughly solved but I think we can do better by
expanding the scope of
cgroup_threadgroup_change_begin/end in exec and
just making exec atomic wrt to cgroup changes)




While I was on vacation you posted an old concern about a condtion
where get_mem_cgroup_from_mm was not guaranteed to complete, and how
that interacts with charge migration.


Looking at your description the concern is that cgroup_rmdir can succeed
when a cgroup has just an mm in it (and no tasks). The result is
that mem_cgroup_try_charge can loop indefinitely in that case.

That is possible with two processes sharing the same mm, but living in
different memory cgroups. That is a silly useless configuration but
something to support at least to the level of making certain kernel's
don't wedge when it happens by accident or maliciously.

The suggestion of having cgroup_is_populated take this into account
seems like a good general solution but I don't think that is strictly
necessary.

In the spirit of let's make this work. How about:

From: "Eric W. Biederman" <[email protected]>
Date: Wed, 23 May 2018 13:48:31 -0500
Subject: [PATCH] memcontrol: Guarantee that get_mem_cgroup_from_mm does not
loop forever

The root memory cgroup should always be online, so when returning it
unconditionally bump it's refcount.

The only way for mm->memcg to point to an offline memory cgroup is if
CLONE_VM was used to create two processes that share a mm and those
processes were put in different memory cgroups. If one of those
processes exited and then cgroup_rmdir was called on it's memory
cgroup.

As two processes sharing an mm is useless and highly unlikely there is
no need to handle this case well, it just needs to be handled well
enough to prevent an indefinite loop. So when css_tryget_online fails
just treat the mm as belong to the root memory cgroup.

The cgroup migration and the cgroup remove actions both happen
with the the cgroup_mutex held. So there is no need to worry about
a mm that is migrating cgroups not having a live cgroup.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
mm/memcontrol.c | 34 +++++++++++++++++++---------------
1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d74aeba7dfed..dbb197bfc517 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -666,25 +666,29 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)

static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
{
- struct mem_cgroup *memcg = NULL;
+ struct mem_cgroup *memcg;

rcu_read_lock();
- do {
- /*
- * Page cache insertions can happen withou an
- * actual mm context, e.g. during disk probing
- * on boot, loopback IO, acct() writes etc.
- */
- if (unlikely(!mm))
- memcg = root_mem_cgroup;
- else {
- memcg = rcu_dereference(mm->memcg);
- if (unlikely(!memcg))
- memcg = root_mem_cgroup;
- }
- } while (!css_tryget_online(&memcg->css));
+ /*
+ * Page cache insertions can happen withou an
+ * actual mm context, e.g. during disk probing
+ * on boot, loopback IO, acct() writes etc.
+ */
+ if (unlikely(!mm))
+ goto root_memcg;
+
+ memcg = rcu_dereference(mm->memcg);
+ if (unlikely(!memcg))
+ goto root_memcg;
+ if (!css_tryget_online(&memcg->css))
+ goto root_memcg;
rcu_read_unlock();
return memcg;
+
+root_memcg:
+ css_get(&root_mem_cgroup->css);
+ rcu_read_unlock();
+ return root_mem_cgroup;
}

/**
--
2.14.1



2018-05-24 11:11:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

On Wed 23-05-18 14:46:43, Eric W. Biederman wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Thu 10-05-18 14:14:18, Michal Hocko wrote:
> >> On Fri 04-05-18 12:26:03, Eric W. Biederman wrote:
> >> >
> >> > Andrew can you pick up these two fixes as well.
> >> >
> >> > These address the issues Michal Hocko and Oleg Nesterov noticed.
> >>
> >> I completely got lost in this thread. There are way too many things
> >> discussed at once. Could you repost the full series for a proper review
> >> please?
> >
> > ping
>
> Quick summary of where things stand. (Just getting back from vacation
> and catching up with things).
>
> Andrew has picked up the best version of these patches and you can look
> at his tree to find the current patches.

I would really prefer and appreciate a repost with all the fixes folded
in. Wrapping my head around -fix+ is not something I have time for.

> Looking at my tree the issues that have been looked at above
> the basic patch are:
> !CONFIG_MMU support (code motion)
> Races during exec. (Roughly solved but I think we can do better by
> expanding the scope of
> cgroup_threadgroup_change_begin/end in exec and
> just making exec atomic wrt to cgroup changes)
>
> While I was on vacation you posted an old concern about a condtion
> where get_mem_cgroup_from_mm was not guaranteed to complete, and how
> that interacts with charge migration.
>
>
> Looking at your description the concern is that cgroup_rmdir can succeed
> when a cgroup has just an mm in it (and no tasks). The result is
> that mem_cgroup_try_charge can loop indefinitely in that case.

right

> That is possible with two processes sharing the same mm, but living in
> different memory cgroups.

right

> That is a silly useless configuration but
> something to support at least to the level of making certain kernel's
> don't wedge when it happens by accident or maliciously.

absolutely. Processes sharing the mm without being in the same thread
group is something we should have never allowed IMHO. It just generates
a lot of corner cases. I guess this is a relict from old threading
models so here we are.

> The suggestion of having cgroup_is_populated take this into account
> seems like a good general solution but I don't think that is strictly
> necessary.
>
> In the spirit of let's make this work. How about:

Please, repost with the whole series. I really want to see the full
picture.

Thanks!
--
Michal Hocko
SUSE Labs

2018-05-25 02:43:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

On Thu, 24 May 2018 13:10:02 +0200 Michal Hocko <[email protected]> wrote:

> I would really prefer and appreciate a repost with all the fixes folded
> in.

[1/2]

From: "Eric W. Biederman" <[email protected]>
Subject: memcg: replace mm->owner with mm->memcg

Recently it was reported that mm_update_next_owner could get into cases
where it was executing its fallback for_each_process part of the loop and
thus taking up a lot of time.

To deal with this replace mm->owner with mm->memcg. This just reduces the
complexity of everything. As much as possible I have maintained the
current semantics. There are two siginificant exceptions. During fork
the memcg of the process calling fork is charged rather than init_css_set.
During memory cgroup migration the charges are migrated not if the
process is the owner of the mm, but if the process being migrated has the
same memory cgroup as the mm.

I believe it was a bug if init_css_set is charged for memory activity
during fork, and the old behavior was simply a consequence of the new task
not having tsk->cgroup not initialized to it's proper cgroup.

During cgroup migration only thread group leaders are allowed to migrate.
Which means in practice there should only be one. Linux tasks created
with CLONE_VM are the only exception, but the common cases are already
ruled out. Processes created with vfork have a suspended parent and can
do nothing but call exec so they should never show up. Threads of the
same cgroup are not the thread group leader so also should not show up.
That leaves the old LinuxThreads library which is probably out of use by
now, and someone doing something very creative with cgroups, and rolling
their own threads with CLONE_VM. So in practice I don't think the
difference charge migration will affect anyone.

To ensure that mm->memcg is updated appropriately I have implemented
cgroup "attach" and "fork" methods. This ensures that at those points the
mm pointed to the task has the appropriate memory cgroup.

For simplicity instead of introducing a new mm lock I simply use exchange
on the pointer where the mm->memcg is updated to get atomic updates.

Looking at the history effectively this change is a revert. The reason
given for adding mm->owner is so that multiple cgroups can be attached to
the same mm. In the last 8 years a second user of mm->owner has not
appeared. A feature that has never used, makes the code more complicated
and has horrible worst case performance should go.

[[email protected]: update to work when !CONFIG_MMU]
Link: http://lkml.kernel.org/r/[email protected]
[[email protected]: close race between migration and installing bprm->mm as mm]
Link: http://lkml.kernel.org/r/[email protected]
Link: http://lkml.kernel.org/r/[email protected]
Fixes: cf475ad28ac3 ("cgroups: add an owner to the mm_struct")
Signed-off-by: "Eric W. Biederman" <[email protected]>
Reported-by: Kirill Tkhai <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

fs/exec.c | 3 -
include/linux/memcontrol.h | 16 +++++-
include/linux/mm_types.h | 12 ----
include/linux/sched/mm.h | 8 ---
kernel/exit.c | 89 -----------------------------------
kernel/fork.c | 17 +++++-
mm/debug.c | 4 -
mm/memcontrol.c | 81 +++++++++++++++++++++++--------
8 files changed, 93 insertions(+), 137 deletions(-)

diff -puN fs/exec.c~memcg-replace-mm-owner-with-mm-memcg fs/exec.c
--- a/fs/exec.c~memcg-replace-mm-owner-with-mm-memcg
+++ a/fs/exec.c
@@ -1040,11 +1040,12 @@ static int exec_mmap(struct mm_struct *m
up_read(&old_mm->mmap_sem);
BUG_ON(active_mm != old_mm);
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
- mm_update_next_owner(old_mm);
mmput(old_mm);
return 0;
}
mmdrop(active_mm);
+ /* The tsk may have migrated before the new mm was attached */
+ mm_sync_memcg_from_task(tsk);
return 0;
}

diff -puN include/linux/memcontrol.h~memcg-replace-mm-owner-with-mm-memcg include/linux/memcontrol.h
--- a/include/linux/memcontrol.h~memcg-replace-mm-owner-with-mm-memcg
+++ a/include/linux/memcontrol.h
@@ -345,7 +345,6 @@ out:
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);

bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);

@@ -408,6 +407,9 @@ static inline bool mem_cgroup_is_descend
return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
}

+void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new);
+void mm_sync_memcg_from_task(struct task_struct *tsk);
+
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
@@ -415,7 +417,7 @@ static inline bool mm_match_cgroup(struc
bool match = false;

rcu_read_lock();
- task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ task_memcg = rcu_dereference(mm->memcg);
if (task_memcg)
match = mem_cgroup_is_descendant(task_memcg, memcg);
rcu_read_unlock();
@@ -699,7 +701,7 @@ static inline void count_memcg_event_mm(
return;

rcu_read_lock();
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ memcg = rcu_dereference(mm->memcg);
if (likely(memcg)) {
count_memcg_events(memcg, idx, 1);
if (idx == OOM_KILL)
@@ -787,6 +789,14 @@ static inline struct lruvec *mem_cgroup_
return &pgdat->lruvec;
}

+static inline void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
+{
+}
+
+static inline void mm_sync_memcg_from_task(struct task_struct *tsk)
+{
+}
+
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff -puN include/linux/mm_types.h~memcg-replace-mm-owner-with-mm-memcg include/linux/mm_types.h
--- a/include/linux/mm_types.h~memcg-replace-mm-owner-with-mm-memcg
+++ a/include/linux/mm_types.h
@@ -445,17 +445,7 @@ struct mm_struct {
struct kioctx_table __rcu *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
- /*
- * "owner" points to a task that is regarded as the canonical
- * user/owner of this mm. All of the following must be true in
- * order for it to be changed:
- *
- * current == mm->owner
- * current->mm != mm
- * new_owner->mm == mm
- * new_owner->alloc_lock is held
- */
- struct task_struct __rcu *owner;
+ struct mem_cgroup __rcu *memcg;
#endif
struct user_namespace *user_ns;

diff -puN include/linux/sched/mm.h~memcg-replace-mm-owner-with-mm-memcg include/linux/sched/mm.h
--- a/include/linux/sched/mm.h~memcg-replace-mm-owner-with-mm-memcg
+++ a/include/linux/sched/mm.h
@@ -95,14 +95,6 @@ extern struct mm_struct *mm_access(struc
/* Remove the current tasks stale references to the old mm_struct */
extern void mm_release(struct task_struct *, struct mm_struct *);

-#ifdef CONFIG_MEMCG
-extern void mm_update_next_owner(struct mm_struct *mm);
-#else
-static inline void mm_update_next_owner(struct mm_struct *mm)
-{
-}
-#endif /* CONFIG_MEMCG */
-
#ifdef CONFIG_MMU
extern void arch_pick_mmap_layout(struct mm_struct *mm,
struct rlimit *rlim_stack);
diff -puN kernel/exit.c~memcg-replace-mm-owner-with-mm-memcg kernel/exit.c
--- a/kernel/exit.c~memcg-replace-mm-owner-with-mm-memcg
+++ a/kernel/exit.c
@@ -399,94 +399,6 @@ kill_orphaned_pgrp(struct task_struct *t
}
}

-#ifdef CONFIG_MEMCG
-/*
- * A task is exiting. If it owned this mm, find a new owner for the mm.
- */
-void mm_update_next_owner(struct mm_struct *mm)
-{
- struct task_struct *c, *g, *p = current;
-
-retry:
- /*
- * If the exiting or execing task is not the owner, it's
- * someone else's problem.
- */
- if (mm->owner != p)
- return;
- /*
- * The current owner is exiting/execing and there are no other
- * candidates. Do not leave the mm pointing to a possibly
- * freed task structure.
- */
- if (atomic_read(&mm->mm_users) <= 1) {
- mm->owner = NULL;
- return;
- }
-
- read_lock(&tasklist_lock);
- /*
- * Search in the children
- */
- list_for_each_entry(c, &p->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
- }
-
- /*
- * Search in the siblings
- */
- list_for_each_entry(c, &p->real_parent->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
- }
-
- /*
- * Search through everything else, we should not get here often.
- */
- for_each_process(g) {
- if (g->flags & PF_KTHREAD)
- continue;
- for_each_thread(g, c) {
- if (c->mm == mm)
- goto assign_new_owner;
- if (c->mm)
- break;
- }
- }
- read_unlock(&tasklist_lock);
- /*
- * We found no owner yet mm_users > 1: this implies that we are
- * most likely racing with swapoff (try_to_unuse()) or /proc or
- * ptrace or page migration (get_task_mm()). Mark owner as NULL.
- */
- mm->owner = NULL;
- return;
-
-assign_new_owner:
- BUG_ON(c == p);
- get_task_struct(c);
- /*
- * The task_lock protects c->mm from changing.
- * We always want mm->owner->mm == mm
- */
- task_lock(c);
- /*
- * Delay read_unlock() till we have the task_lock()
- * to ensure that c does not slip away underneath us
- */
- read_unlock(&tasklist_lock);
- if (c->mm != mm) {
- task_unlock(c);
- put_task_struct(c);
- goto retry;
- }
- mm->owner = c;
- task_unlock(c);
- put_task_struct(c);
-}
-#endif /* CONFIG_MEMCG */
-
/*
* Turn us into a lazy TLB process if we
* aren't already..
@@ -540,7 +452,6 @@ static void exit_mm(void)
up_read(&mm->mmap_sem);
enter_lazy_tlb(mm, current);
task_unlock(current);
- mm_update_next_owner(mm);
mmput(mm);
if (test_thread_flag(TIF_MEMDIE))
exit_oom_victim();
diff -puN kernel/fork.c~memcg-replace-mm-owner-with-mm-memcg kernel/fork.c
--- a/kernel/fork.c~memcg-replace-mm-owner-with-mm-memcg
+++ a/kernel/fork.c
@@ -878,10 +878,19 @@ static void mm_init_aio(struct mm_struct
#endif
}

-static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
+static void mm_init_memcg(struct mm_struct *mm)
{
#ifdef CONFIG_MEMCG
- mm->owner = p;
+ struct cgroup_subsys_state *css;
+
+ /* Ensure mm->memcg is initialized */
+ mm->memcg = NULL;
+
+ rcu_read_lock();
+ css = task_css(current, memory_cgrp_id);
+ if (css && css_tryget(css))
+ mm_update_memcg(mm, mem_cgroup_from_css(css));
+ rcu_read_unlock();
#endif
}

@@ -912,7 +921,7 @@ static struct mm_struct *mm_init(struct
spin_lock_init(&mm->arg_lock);
mm_init_cpumask(mm);
mm_init_aio(mm);
- mm_init_owner(mm, p);
+ mm_init_memcg(mm);
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_mm_init(mm);
hmm_mm_init(mm);
@@ -942,6 +951,7 @@ static struct mm_struct *mm_init(struct
fail_nocontext:
mm_free_pgd(mm);
fail_nopgd:
+ mm_update_memcg(mm, NULL);
free_mm(mm);
return NULL;
}
@@ -979,6 +989,7 @@ static inline void __mmput(struct mm_str
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+ mm_update_memcg(mm, NULL);
mmdrop(mm);
}

diff -puN mm/debug.c~memcg-replace-mm-owner-with-mm-memcg mm/debug.c
--- a/mm/debug.c~memcg-replace-mm-owner-with-mm-memcg
+++ a/mm/debug.c
@@ -116,7 +116,7 @@ void dump_mm(const struct mm_struct *mm)
"ioctx_table %px\n"
#endif
#ifdef CONFIG_MEMCG
- "owner %px "
+ "memcg %px "
#endif
"exe_file %px\n"
#ifdef CONFIG_MMU_NOTIFIER
@@ -147,7 +147,7 @@ void dump_mm(const struct mm_struct *mm)
mm->ioctx_table,
#endif
#ifdef CONFIG_MEMCG
- mm->owner,
+ mm->memcg,
#endif
mm->exe_file,
#ifdef CONFIG_MMU_NOTIFIER
diff -puN mm/memcontrol.c~memcg-replace-mm-owner-with-mm-memcg mm/memcontrol.c
--- a/mm/memcontrol.c~memcg-replace-mm-owner-with-mm-memcg
+++ a/mm/memcontrol.c
@@ -664,20 +664,6 @@ static void memcg_check_events(struct me
}
}

-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
-{
- /*
- * mm_update_next_owner() may clear mm->owner to NULL
- * if it races with swapoff, page migration, etc.
- * So this can be called with p == NULL.
- */
- if (unlikely(!p))
- return NULL;
-
- return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
-}
-EXPORT_SYMBOL(mem_cgroup_from_task);
-
struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
{
struct mem_cgroup *memcg = NULL;
@@ -692,7 +678,7 @@ struct mem_cgroup *get_mem_cgroup_from_m
if (unlikely(!mm))
memcg = root_mem_cgroup;
else {
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ memcg = rcu_dereference(mm->memcg);
if (unlikely(!memcg))
memcg = root_mem_cgroup;
}
@@ -1025,7 +1011,7 @@ bool task_in_mem_cgroup(struct task_stru
* killed to prevent needlessly killing additional tasks.
*/
rcu_read_lock();
- task_memcg = mem_cgroup_from_task(task);
+ task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
css_get(&task_memcg->css);
rcu_read_unlock();
}
@@ -4850,15 +4836,16 @@ static int mem_cgroup_can_attach(struct
if (!move_flags)
return 0;

- from = mem_cgroup_from_task(p);
+ from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));

VM_BUG_ON(from == memcg);

mm = get_task_mm(p);
if (!mm)
return 0;
- /* We move charges only when we move a owner of the mm */
- if (mm->owner == p) {
+
+ /* We move charges except for creative uses of CLONE_VM */
+ if (mm->memcg == from) {
VM_BUG_ON(mc.from);
VM_BUG_ON(mc.to);
VM_BUG_ON(mc.precharge);
@@ -5058,6 +5045,58 @@ static void mem_cgroup_move_task(void)
}
#endif

+/**
+ * mm_update_memcg - Update the memory cgroup of a mm_struct
+ * @mm: mm struct
+ * @new: new memory cgroup value
+ *
+ * Called whenever mm->memcg needs to change. Consumes a reference
+ * to new (unless new is NULL). The reference to the old memory
+ * cgroup is decreased.
+ */
+void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
+{
+ /* This is the only place where mm->memcg is changed */
+ struct mem_cgroup *old;
+
+ old = xchg(&mm->memcg, new);
+ if (old)
+ css_put(&old->css);
+}
+
+static void task_update_memcg(struct task_struct *tsk, struct mem_cgroup *new)
+{
+ struct mm_struct *mm;
+ task_lock(tsk);
+ mm = tsk->mm;
+ if (mm && !(tsk->flags & PF_KTHREAD))
+ mm_update_memcg(mm, new);
+ task_unlock(tsk);
+}
+
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *tsk;
+
+ cgroup_taskset_for_each(tsk, css, tset) {
+ struct mem_cgroup *new = mem_cgroup_from_css(css);
+ css_get(css);
+ task_update_memcg(tsk, new);
+ }
+}
+
+void mm_sync_memcg_from_task(struct task_struct *tsk)
+{
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+ css = task_css(tsk, memory_cgrp_id);
+ if (css && css_tryget(css))
+ task_update_memcg(tsk, mem_cgroup_from_css(css));
+ rcu_read_unlock();
+}
+
/*
* Cgroup retains root cgroups across [un]mount cycles making it necessary
* to verify whether we're attached to the default hierarchy on each mount
@@ -5358,8 +5397,10 @@ struct cgroup_subsys memory_cgrp_subsys
.css_free = mem_cgroup_css_free,
.css_reset = mem_cgroup_css_reset,
.can_attach = mem_cgroup_can_attach,
+ .attach = mem_cgroup_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.post_attach = mem_cgroup_move_task,
+ .fork = mm_sync_memcg_from_task,
.bind = mem_cgroup_bind,
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
@@ -5846,7 +5887,7 @@ void mem_cgroup_sk_alloc(struct sock *sk
}

rcu_read_lock();
- memcg = mem_cgroup_from_task(current);
+ memcg = mem_cgroup_from_css(task_css(current, memory_cgrp_id));
if (memcg == root_mem_cgroup)
goto out;
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg->tcpmem_active)
_


2018-05-25 02:43:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

On Thu, 24 May 2018 13:10:02 +0200 Michal Hocko <[email protected]> wrote:

> I would really prefer and appreciate a repost with all the fixes folded
> in.

[2/2]

From: Andrew Morton <[email protected]>
Subject: mm/memcontrol.c: add mem_cgroup_from_task() as a local helper

Factor out some commonly-occurring code.

Cc: "Eric W. Biederman" <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Kirill Tkhai <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/memcontrol.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff -puN mm/memcontrol.c~mm-memcontrolc-add-mem_cgroup_from_task-as-a-local-helper mm/memcontrol.c
--- a/mm/memcontrol.c~mm-memcontrolc-add-mem_cgroup_from_task-as-a-local-helper
+++ a/mm/memcontrol.c
@@ -664,6 +664,11 @@ static void memcg_check_events(struct me
}
}

+static inline struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+{
+ return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
+}
+
struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
{
struct mem_cgroup *memcg = NULL;
@@ -1011,7 +1016,7 @@ bool task_in_mem_cgroup(struct task_stru
* killed to prevent needlessly killing additional tasks.
*/
rcu_read_lock();
- task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
+ task_memcg = mem_cgroup_from_task(task);
css_get(&task_memcg->css);
rcu_read_unlock();
}
@@ -4836,7 +4841,7 @@ static int mem_cgroup_can_attach(struct
if (!move_flags)
return 0;

- from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));
+ from = mem_cgroup_from_task(p);

VM_BUG_ON(from == memcg);

@@ -5887,7 +5892,7 @@ void mem_cgroup_sk_alloc(struct sock *sk
}

rcu_read_lock();
- memcg = mem_cgroup_from_css(task_css(current, memory_cgrp_id));
+ memcg = mem_cgroup_from_task(current);
if (memcg == root_mem_cgroup)
goto out;
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg->tcpmem_active)
_


2018-05-25 02:47:19

by Andrea Parri

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

On Thu, May 24, 2018 at 02:16:35PM -0700, Andrew Morton wrote:
> On Thu, 24 May 2018 13:10:02 +0200 Michal Hocko <[email protected]> wrote:
>
> > I would really prefer and appreciate a repost with all the fixes folded
> > in.
>
> [1/2]
>
> From: "Eric W. Biederman" <[email protected]>
> Subject: memcg: replace mm->owner with mm->memcg
>
> Recently it was reported that mm_update_next_owner could get into cases
> where it was executing its fallback for_each_process part of the loop and
> thus taking up a lot of time.

Reference?


>
> To deal with this replace mm->owner with mm->memcg. This just reduces the
> complexity of everything.

"the complexity of everything"?


> As much as possible I have maintained the
> current semantics.

"As much as possible"?


> There are two siginificant exceptions.

s/siginificant/significant


> During fork
> the memcg of the process calling fork is charged rather than init_css_set.
> During memory cgroup migration the charges are migrated not if the
> process is the owner of the mm, but if the process being migrated has the
> same memory cgroup as the mm.
>
> I believe it was a bug

It was a bug or not??


> if init_css_set is charged for memory activity
> during fork, and the old behavior was simply a consequence of the new task
> not having tsk->cgroup not initialized to it's proper cgroup.
>
> During cgroup migration only thread group leaders are allowed to migrate.
> Which means in practice there should only be one.

"in practice there should"??


> Linux tasks created
> with CLONE_VM are the only exception, but the common cases are already
> ruled out. Processes created with vfork have a suspended parent and can
> do nothing but call exec so they should never show up. Threads of the
> same cgroup are not the thread group leader so also should not show up.
> That leaves the old LinuxThreads library which is probably out of use by

"probably"???


> now, and someone doing something very creative with cgroups,

"very creative"?


> and rolling
> their own threads with CLONE_VM. So in practice I don't think

"in practice I don't think"??

Andrea


> the
> difference charge migration will affect anyone.
>
> To ensure that mm->memcg is updated appropriately I have implemented
> cgroup "attach" and "fork" methods. This ensures that at those points the
> mm pointed to the task has the appropriate memory cgroup.
>
> For simplicity instead of introducing a new mm lock I simply use exchange
> on the pointer where the mm->memcg is updated to get atomic updates.
>
> Looking at the history effectively this change is a revert. The reason
> given for adding mm->owner is so that multiple cgroups can be attached to
> the same mm. In the last 8 years a second user of mm->owner has not
> appeared. A feature that has never used, makes the code more complicated
> and has horrible worst case performance should go.
>
> [[email protected]: update to work when !CONFIG_MMU]
> Link: http://lkml.kernel.org/r/[email protected]
> [[email protected]: close race between migration and installing bprm->mm as mm]
> Link: http://lkml.kernel.org/r/[email protected]
> Link: http://lkml.kernel.org/r/[email protected]
> Fixes: cf475ad28ac3 ("cgroups: add an owner to the mm_struct")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> Reported-by: Kirill Tkhai <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> fs/exec.c | 3 -
> include/linux/memcontrol.h | 16 +++++-
> include/linux/mm_types.h | 12 ----
> include/linux/sched/mm.h | 8 ---
> kernel/exit.c | 89 -----------------------------------
> kernel/fork.c | 17 +++++-
> mm/debug.c | 4 -
> mm/memcontrol.c | 81 +++++++++++++++++++++++--------
> 8 files changed, 93 insertions(+), 137 deletions(-)
>
> diff -puN fs/exec.c~memcg-replace-mm-owner-with-mm-memcg fs/exec.c
> --- a/fs/exec.c~memcg-replace-mm-owner-with-mm-memcg
> +++ a/fs/exec.c
> @@ -1040,11 +1040,12 @@ static int exec_mmap(struct mm_struct *m
> up_read(&old_mm->mmap_sem);
> BUG_ON(active_mm != old_mm);
> setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
> - mm_update_next_owner(old_mm);
> mmput(old_mm);
> return 0;
> }
> mmdrop(active_mm);
> + /* The tsk may have migrated before the new mm was attached */
> + mm_sync_memcg_from_task(tsk);
> return 0;
> }
>
> diff -puN include/linux/memcontrol.h~memcg-replace-mm-owner-with-mm-memcg include/linux/memcontrol.h
> --- a/include/linux/memcontrol.h~memcg-replace-mm-owner-with-mm-memcg
> +++ a/include/linux/memcontrol.h
> @@ -345,7 +345,6 @@ out:
> struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);
>
> bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
> -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>
> struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
>
> @@ -408,6 +407,9 @@ static inline bool mem_cgroup_is_descend
> return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
> }
>
> +void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new);
> +void mm_sync_memcg_from_task(struct task_struct *tsk);
> +
> static inline bool mm_match_cgroup(struct mm_struct *mm,
> struct mem_cgroup *memcg)
> {
> @@ -415,7 +417,7 @@ static inline bool mm_match_cgroup(struc
> bool match = false;
>
> rcu_read_lock();
> - task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + task_memcg = rcu_dereference(mm->memcg);
> if (task_memcg)
> match = mem_cgroup_is_descendant(task_memcg, memcg);
> rcu_read_unlock();
> @@ -699,7 +701,7 @@ static inline void count_memcg_event_mm(
> return;
>
> rcu_read_lock();
> - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + memcg = rcu_dereference(mm->memcg);
> if (likely(memcg)) {
> count_memcg_events(memcg, idx, 1);
> if (idx == OOM_KILL)
> @@ -787,6 +789,14 @@ static inline struct lruvec *mem_cgroup_
> return &pgdat->lruvec;
> }
>
> +static inline void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
> +{
> +}
> +
> +static inline void mm_sync_memcg_from_task(struct task_struct *tsk)
> +{
> +}
> +
> static inline bool mm_match_cgroup(struct mm_struct *mm,
> struct mem_cgroup *memcg)
> {
> diff -puN include/linux/mm_types.h~memcg-replace-mm-owner-with-mm-memcg include/linux/mm_types.h
> --- a/include/linux/mm_types.h~memcg-replace-mm-owner-with-mm-memcg
> +++ a/include/linux/mm_types.h
> @@ -445,17 +445,7 @@ struct mm_struct {
> struct kioctx_table __rcu *ioctx_table;
> #endif
> #ifdef CONFIG_MEMCG
> - /*
> - * "owner" points to a task that is regarded as the canonical
> - * user/owner of this mm. All of the following must be true in
> - * order for it to be changed:
> - *
> - * current == mm->owner
> - * current->mm != mm
> - * new_owner->mm == mm
> - * new_owner->alloc_lock is held
> - */
> - struct task_struct __rcu *owner;
> + struct mem_cgroup __rcu *memcg;
> #endif
> struct user_namespace *user_ns;
>
> diff -puN include/linux/sched/mm.h~memcg-replace-mm-owner-with-mm-memcg include/linux/sched/mm.h
> --- a/include/linux/sched/mm.h~memcg-replace-mm-owner-with-mm-memcg
> +++ a/include/linux/sched/mm.h
> @@ -95,14 +95,6 @@ extern struct mm_struct *mm_access(struc
> /* Remove the current tasks stale references to the old mm_struct */
> extern void mm_release(struct task_struct *, struct mm_struct *);
>
> -#ifdef CONFIG_MEMCG
> -extern void mm_update_next_owner(struct mm_struct *mm);
> -#else
> -static inline void mm_update_next_owner(struct mm_struct *mm)
> -{
> -}
> -#endif /* CONFIG_MEMCG */
> -
> #ifdef CONFIG_MMU
> extern void arch_pick_mmap_layout(struct mm_struct *mm,
> struct rlimit *rlim_stack);
> diff -puN kernel/exit.c~memcg-replace-mm-owner-with-mm-memcg kernel/exit.c
> --- a/kernel/exit.c~memcg-replace-mm-owner-with-mm-memcg
> +++ a/kernel/exit.c
> @@ -399,94 +399,6 @@ kill_orphaned_pgrp(struct task_struct *t
> }
> }
>
> -#ifdef CONFIG_MEMCG
> -/*
> - * A task is exiting. If it owned this mm, find a new owner for the mm.
> - */
> -void mm_update_next_owner(struct mm_struct *mm)
> -{
> - struct task_struct *c, *g, *p = current;
> -
> -retry:
> - /*
> - * If the exiting or execing task is not the owner, it's
> - * someone else's problem.
> - */
> - if (mm->owner != p)
> - return;
> - /*
> - * The current owner is exiting/execing and there are no other
> - * candidates. Do not leave the mm pointing to a possibly
> - * freed task structure.
> - */
> - if (atomic_read(&mm->mm_users) <= 1) {
> - mm->owner = NULL;
> - return;
> - }
> -
> - read_lock(&tasklist_lock);
> - /*
> - * Search in the children
> - */
> - list_for_each_entry(c, &p->children, sibling) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - }
> -
> - /*
> - * Search in the siblings
> - */
> - list_for_each_entry(c, &p->real_parent->children, sibling) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - }
> -
> - /*
> - * Search through everything else, we should not get here often.
> - */
> - for_each_process(g) {
> - if (g->flags & PF_KTHREAD)
> - continue;
> - for_each_thread(g, c) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - if (c->mm)
> - break;
> - }
> - }
> - read_unlock(&tasklist_lock);
> - /*
> - * We found no owner yet mm_users > 1: this implies that we are
> - * most likely racing with swapoff (try_to_unuse()) or /proc or
> - * ptrace or page migration (get_task_mm()). Mark owner as NULL.
> - */
> - mm->owner = NULL;
> - return;
> -
> -assign_new_owner:
> - BUG_ON(c == p);
> - get_task_struct(c);
> - /*
> - * The task_lock protects c->mm from changing.
> - * We always want mm->owner->mm == mm
> - */
> - task_lock(c);
> - /*
> - * Delay read_unlock() till we have the task_lock()
> - * to ensure that c does not slip away underneath us
> - */
> - read_unlock(&tasklist_lock);
> - if (c->mm != mm) {
> - task_unlock(c);
> - put_task_struct(c);
> - goto retry;
> - }
> - mm->owner = c;
> - task_unlock(c);
> - put_task_struct(c);
> -}
> -#endif /* CONFIG_MEMCG */
> -
> /*
> * Turn us into a lazy TLB process if we
> * aren't already..
> @@ -540,7 +452,6 @@ static void exit_mm(void)
> up_read(&mm->mmap_sem);
> enter_lazy_tlb(mm, current);
> task_unlock(current);
> - mm_update_next_owner(mm);
> mmput(mm);
> if (test_thread_flag(TIF_MEMDIE))
> exit_oom_victim();
> diff -puN kernel/fork.c~memcg-replace-mm-owner-with-mm-memcg kernel/fork.c
> --- a/kernel/fork.c~memcg-replace-mm-owner-with-mm-memcg
> +++ a/kernel/fork.c
> @@ -878,10 +878,19 @@ static void mm_init_aio(struct mm_struct
> #endif
> }
>
> -static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
> +static void mm_init_memcg(struct mm_struct *mm)
> {
> #ifdef CONFIG_MEMCG
> - mm->owner = p;
> + struct cgroup_subsys_state *css;
> +
> + /* Ensure mm->memcg is initialized */
> + mm->memcg = NULL;
> +
> + rcu_read_lock();
> + css = task_css(current, memory_cgrp_id);
> + if (css && css_tryget(css))
> + mm_update_memcg(mm, mem_cgroup_from_css(css));
> + rcu_read_unlock();
> #endif
> }
>
> @@ -912,7 +921,7 @@ static struct mm_struct *mm_init(struct
> spin_lock_init(&mm->arg_lock);
> mm_init_cpumask(mm);
> mm_init_aio(mm);
> - mm_init_owner(mm, p);
> + mm_init_memcg(mm);
> RCU_INIT_POINTER(mm->exe_file, NULL);
> mmu_notifier_mm_init(mm);
> hmm_mm_init(mm);
> @@ -942,6 +951,7 @@ static struct mm_struct *mm_init(struct
> fail_nocontext:
> mm_free_pgd(mm);
> fail_nopgd:
> + mm_update_memcg(mm, NULL);
> free_mm(mm);
> return NULL;
> }
> @@ -979,6 +989,7 @@ static inline void __mmput(struct mm_str
> }
> if (mm->binfmt)
> module_put(mm->binfmt->module);
> + mm_update_memcg(mm, NULL);
> mmdrop(mm);
> }
>
> diff -puN mm/debug.c~memcg-replace-mm-owner-with-mm-memcg mm/debug.c
> --- a/mm/debug.c~memcg-replace-mm-owner-with-mm-memcg
> +++ a/mm/debug.c
> @@ -116,7 +116,7 @@ void dump_mm(const struct mm_struct *mm)
> "ioctx_table %px\n"
> #endif
> #ifdef CONFIG_MEMCG
> - "owner %px "
> + "memcg %px "
> #endif
> "exe_file %px\n"
> #ifdef CONFIG_MMU_NOTIFIER
> @@ -147,7 +147,7 @@ void dump_mm(const struct mm_struct *mm)
> mm->ioctx_table,
> #endif
> #ifdef CONFIG_MEMCG
> - mm->owner,
> + mm->memcg,
> #endif
> mm->exe_file,
> #ifdef CONFIG_MMU_NOTIFIER
> diff -puN mm/memcontrol.c~memcg-replace-mm-owner-with-mm-memcg mm/memcontrol.c
> --- a/mm/memcontrol.c~memcg-replace-mm-owner-with-mm-memcg
> +++ a/mm/memcontrol.c
> @@ -664,20 +664,6 @@ static void memcg_check_events(struct me
> }
> }
>
> -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> -{
> - /*
> - * mm_update_next_owner() may clear mm->owner to NULL
> - * if it races with swapoff, page migration, etc.
> - * So this can be called with p == NULL.
> - */
> - if (unlikely(!p))
> - return NULL;
> -
> - return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
> -}
> -EXPORT_SYMBOL(mem_cgroup_from_task);
> -
> struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
> {
> struct mem_cgroup *memcg = NULL;
> @@ -692,7 +678,7 @@ struct mem_cgroup *get_mem_cgroup_from_m
> if (unlikely(!mm))
> memcg = root_mem_cgroup;
> else {
> - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + memcg = rcu_dereference(mm->memcg);
> if (unlikely(!memcg))
> memcg = root_mem_cgroup;
> }
> @@ -1025,7 +1011,7 @@ bool task_in_mem_cgroup(struct task_stru
> * killed to prevent needlessly killing additional tasks.
> */
> rcu_read_lock();
> - task_memcg = mem_cgroup_from_task(task);
> + task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
> css_get(&task_memcg->css);
> rcu_read_unlock();
> }
> @@ -4850,15 +4836,16 @@ static int mem_cgroup_can_attach(struct
> if (!move_flags)
> return 0;
>
> - from = mem_cgroup_from_task(p);
> + from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));
>
> VM_BUG_ON(from == memcg);
>
> mm = get_task_mm(p);
> if (!mm)
> return 0;
> - /* We move charges only when we move a owner of the mm */
> - if (mm->owner == p) {
> +
> + /* We move charges except for creative uses of CLONE_VM */
> + if (mm->memcg == from) {
> VM_BUG_ON(mc.from);
> VM_BUG_ON(mc.to);
> VM_BUG_ON(mc.precharge);
> @@ -5058,6 +5045,58 @@ static void mem_cgroup_move_task(void)
> }
> #endif
>
> +/**
> + * mm_update_memcg - Update the memory cgroup of a mm_struct
> + * @mm: mm struct
> + * @new: new memory cgroup value
> + *
> + * Called whenever mm->memcg needs to change. Consumes a reference
> + * to new (unless new is NULL). The reference to the old memory
> + * cgroup is decreased.
> + */
> +void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
> +{
> + /* This is the only place where mm->memcg is changed */
> + struct mem_cgroup *old;
> +
> + old = xchg(&mm->memcg, new);
> + if (old)
> + css_put(&old->css);
> +}
> +
> +static void task_update_memcg(struct task_struct *tsk, struct mem_cgroup *new)
> +{
> + struct mm_struct *mm;
> + task_lock(tsk);
> + mm = tsk->mm;
> + if (mm && !(tsk->flags & PF_KTHREAD))
> + mm_update_memcg(mm, new);
> + task_unlock(tsk);
> +}
> +
> +static void mem_cgroup_attach(struct cgroup_taskset *tset)
> +{
> + struct cgroup_subsys_state *css;
> + struct task_struct *tsk;
> +
> + cgroup_taskset_for_each(tsk, css, tset) {
> + struct mem_cgroup *new = mem_cgroup_from_css(css);
> + css_get(css);
> + task_update_memcg(tsk, new);
> + }
> +}
> +
> +void mm_sync_memcg_from_task(struct task_struct *tsk)
> +{
> + struct cgroup_subsys_state *css;
> +
> + rcu_read_lock();
> + css = task_css(tsk, memory_cgrp_id);
> + if (css && css_tryget(css))
> + task_update_memcg(tsk, mem_cgroup_from_css(css));
> + rcu_read_unlock();
> +}
> +
> /*
> * Cgroup retains root cgroups across [un]mount cycles making it necessary
> * to verify whether we're attached to the default hierarchy on each mount
> @@ -5358,8 +5397,10 @@ struct cgroup_subsys memory_cgrp_subsys
> .css_free = mem_cgroup_css_free,
> .css_reset = mem_cgroup_css_reset,
> .can_attach = mem_cgroup_can_attach,
> + .attach = mem_cgroup_attach,
> .cancel_attach = mem_cgroup_cancel_attach,
> .post_attach = mem_cgroup_move_task,
> + .fork = mm_sync_memcg_from_task,
> .bind = mem_cgroup_bind,
> .dfl_cftypes = memory_files,
> .legacy_cftypes = mem_cgroup_legacy_files,
> @@ -5846,7 +5887,7 @@ void mem_cgroup_sk_alloc(struct sock *sk
> }
>
> rcu_read_lock();
> - memcg = mem_cgroup_from_task(current);
> + memcg = mem_cgroup_from_css(task_css(current, memory_cgrp_id));
> if (memcg == root_mem_cgroup)
> goto out;
> if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg->tcpmem_active)
> _
>

2018-05-30 11:54:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

On Wed 23-05-18 14:46:43, Eric W. Biederman wrote:
[...]
> As two processes sharing an mm is useless and highly unlikely there is
> no need to handle this case well, it just needs to be handled well
> enough to prevent an indefinite loop. So when css_tryget_online fails
> just treat the mm as belong to the root memory cgroup.

Does that mean that a malicious user can construct such a task and
runaway from its limits?

--
Michal Hocko
SUSE Labs

2018-05-30 12:18:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

On Thu 24-05-18 14:16:35, Andrew Morton wrote:
> On Thu, 24 May 2018 13:10:02 +0200 Michal Hocko <[email protected]> wrote:
>
> > I would really prefer and appreciate a repost with all the fixes folded
> > in.
>
> [1/2]

Thanks Andrew and sorry it took so long! This seems to be missing the
fix for the issue I've mentioned in http://lkml.kernel.org/r/[email protected]
and Eric wanted to handle by http://lkml.kernel.org/r/[email protected].
I do not think that this fix is really correct one though. I will
comment in a reply to that email.

In any case I really think this should better be reposted in one patch
series and restart the discussion. I strongly suggest revisiting my
previous attempt http://lkml.kernel.org/r/[email protected]
including the follow up discussion regarding the unfortunate CLONE_VM
outside of thread group case and potential solution for that.

> -static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
> +static void mm_init_memcg(struct mm_struct *mm)
> {
> #ifdef CONFIG_MEMCG
> - mm->owner = p;
> + struct cgroup_subsys_state *css;
> +
> + /* Ensure mm->memcg is initialized */
> + mm->memcg = NULL;
> +
> + rcu_read_lock();
> + css = task_css(current, memory_cgrp_id);
> + if (css && css_tryget(css))
> + mm_update_memcg(mm, mem_cgroup_from_css(css));
> + rcu_read_unlock();

Is it possible that css_tryget ever fails here? I remember I used to do
plain css_get in my patch.

> @@ -4850,15 +4836,16 @@ static int mem_cgroup_can_attach(struct
> if (!move_flags)
> return 0;
>
> - from = mem_cgroup_from_task(p);
> + from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));
>
> VM_BUG_ON(from == memcg);
>
> mm = get_task_mm(p);
> if (!mm)
> return 0;
> - /* We move charges only when we move a owner of the mm */
> - if (mm->owner == p) {
> +
> + /* We move charges except for creative uses of CLONE_VM */
> + if (mm->memcg == from) {

I must be missing something here but how do you prevent those cases?
AFAICS processes sharing the mm will simply allow to migrate.

> VM_BUG_ON(mc.from);
> VM_BUG_ON(mc.to);
> VM_BUG_ON(mc.precharge);

Other than that the patch makes sense to me.
--
Michal Hocko
SUSE Labs

2018-05-31 17:45:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

Michal Hocko <[email protected]> writes:

> On Wed 23-05-18 14:46:43, Eric W. Biederman wrote:
> [...]
>> As two processes sharing an mm is useless and highly unlikely there is
>> no need to handle this case well, it just needs to be handled well
>> enough to prevent an indefinite loop. So when css_tryget_online fails
>> just treat the mm as belong to the root memory cgroup.
>
> Does that mean that a malicious user can construct such a task and
> runaway from its limits?

Unfortunately if the memory cgroup is delegated than yes that can
happen. So removing the loop in get_mem_cgroup_from_mm won't work.

Eric


2018-05-31 18:45:07

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

Michal Hocko <[email protected]> writes:

> On Thu 24-05-18 14:16:35, Andrew Morton wrote:
>> On Thu, 24 May 2018 13:10:02 +0200 Michal Hocko <[email protected]> wrote:
>>
>> > I would really prefer and appreciate a repost with all the fixes folded
>> > in.
>>
>> [1/2]
>
> Thanks Andrew and sorry it took so long! This seems to be missing the
> fix for the issue I've mentioned in http://lkml.kernel.org/r/[email protected]
> and Eric wanted to handle by http://lkml.kernel.org/r/[email protected].
> I do not think that this fix is really correct one though. I will
> comment in a reply to that email.

Agreed. That was not a correct fix to the possible infinite loop in
get_mem_cgroup_from_mm. Which in net leaves all of this broken and
not ready to be merged.

> In any case I really think this should better be reposted in one patch
> series and restart the discussion. I strongly suggest revisiting my
> previous attempt http://lkml.kernel.org/r/[email protected]
> including the follow up discussion regarding the unfortunate CLONE_VM
> outside of thread group case and potential solution for that.

With my fix for get_mem_cgroup_from_mm falling flat that limits our
possible future directions:
- Make memory cgroups honest and require all tasks of an mm to be in the
same memory cgroup. [BEST]
- Don't allow memory cgroups to be removed even if they are only populated by
an mm.
- If tryget_online fails in get_mem_cgroup_from_mm find a parent of the
memcg where tryget_online succeeds and use that one.

That should not be possible to abose as even if a cgroup subtree is
delegated it should not be possible for the processes to escape the
subtree. So in the worst case the charge would walk back up to the
delegation point.

>> -static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>> +static void mm_init_memcg(struct mm_struct *mm)
>> {
>> #ifdef CONFIG_MEMCG
>> - mm->owner = p;
>> + struct cgroup_subsys_state *css;
>> +
>> + /* Ensure mm->memcg is initialized */
>> + mm->memcg = NULL;
>> +
>> + rcu_read_lock();
>> + css = task_css(current, memory_cgrp_id);
>> + if (css && css_tryget(css))
>> + mm_update_memcg(mm, mem_cgroup_from_css(css));
>> + rcu_read_unlock();
>
> Is it possible that css_tryget ever fails here? I remember I used to do
> plain css_get in my patch.

Short of races with task migration that can't fail. As the current
task will keep the memory memory cgroup alive.

I will recheck when I refresh my patch.

>> @@ -4850,15 +4836,16 @@ static int mem_cgroup_can_attach(struct
>> if (!move_flags)
>> return 0;
>>
>> - from = mem_cgroup_from_task(p);
>> + from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));
>>
>> VM_BUG_ON(from == memcg);
>>
>> mm = get_task_mm(p);
>> if (!mm)
>> return 0;
>> - /* We move charges only when we move a owner of the mm */
>> - if (mm->owner == p) {
c>> +
>> + /* We move charges except for creative uses of CLONE_VM */
>> + if (mm->memcg == from) {
>
> I must be missing something here but how do you prevent those cases?
> AFAICS processes sharing the mm will simply allow to migrate.

The mm will only migrate once per set of tasks. A task that does not
migrate with the mm will not have mm->memcg == from. Which is all I was
referring to. Perhaps I did not say that well.

>> VM_BUG_ON(mc.from);
>> VM_BUG_ON(mc.to);
>> VM_BUG_ON(mc.precharge);
>
> Other than that the patch makes sense to me.

I will take a bit and respin this. Because of algorithmic bug-fix
nature of where this started I want to avoid depending on fixing the
semantics. With my third option for fixing get_mem_cgroup_from_mm I see
how to do that now. Then I will include a separate patch to fix the
semantics.

Eric

2018-06-01 01:09:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

Michal Hocko <[email protected]> writes:

> On Thu 26-04-18 14:00:19, Kirill Tkhai wrote:
>> This function searches for a new mm owner in children and siblings,
>> and then iterates over all processes in the system in unlikely case.
>> Despite the case is unlikely, its probability growths with the number
>> of processes in the system. The time, spent on iterations, also growths.
>> I regulary observe mm_update_next_owner() in crash dumps (not related
>> to this function) of the nodes with many processes (20K+), so it looks
>> like it's not so unlikely case.
>
> Did you manage to find the pattern that forces mm_update_next_owner to
> slow paths? This really shouldn't trigger very often. If we can fallback
> easily then I suspect that we should be better off reconsidering
> mm->owner and try to come up with something more clever. I've had a
> patch to remove owner few years back. It needed some work to finish but
> maybe that would be a better than try to make non-scalable thing suck
> less.

Reading through the code I just found a trivial pattern that triggers
this. Create a multi-threaded process. Have the thread group leader
(the first thread) exit.

This has the potential to be a significant DOS attack if anyone cares.

Eric


2018-06-01 01:59:17

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] memcg: Replace mm->owner with mm->memcg


Recently Kiril Tkhai reported that mm_update_next_ownerwas executing
it's expensive for_each_process fallback. Reading the code reveals
that all that is needed to trigger this is a multi-threaded process
where the thread group leader exits. AKA something anyone can easily
trigger an any time.

Walking all of the processes in the system is quite expensive so
the current mm_update_next_owner needs to be killed.

To deal with this replace mm->owner with mm->memcg. This removes
an unnecessary hop when dealing with mm and the memory control group.
As much as possible I have maintained the current semantics. There
are two siginificant exceptions. During fork the memcg of
the process calling fork is charged rather than init_css_set. During
memory cgroup migration the charges are migrated not if the process
is the owner of the mm, but of the process being migrated has
the same memory cgroup as the mm.

I believe it was a bug if init_css_set is charged for memory activity
during fork, and the old behavior was simply a consequence of the new
task not having tsk->cgroup not initialized to it's proper cgroup.

On a previous reading of mem_cgroup_can_attach I thought it limited
task migration between memory cgroups, but the only limit it applies
is are if the target memory cgroup can not be precharged with the
charges of the mm being moved. Which means that today the memory
cgroup code is supporting processes that share an mm with CLONE_VM
being in different memory cgroups as well as threads of a
multi-threaded process being in different memory cgroups.

There is a minor difference with charge migration. Instead of moving
the charges exactly when the mm->owner is moved, charges are now moved
when any thread group leader that uses the mm and shares the mm memcg
is moved. It requires playing games with vfork or CLONE_VM to setup a
situation with multiple thread group leaders sharing the same mm. As
vfork only lasts a moment and the old LinuxThreads library appears out
of use, I do not expect there to be many occasions for there even to
be a chance for the difference in charge migration to be visible.

To ensure that the mm->memcg is updated appropriate I have implemented
the cgroup "attach" and "fork" methods and a special
mm_sync_memcg_from_task function. These methods and the new function
ensure that at task migration, after fork, and after exec the task
has the appropriate memory cgroup.

The need to call something in exec is unfortunate and likely indicates
a defect in the way control groups and exec interact. As migration
during exec is expected to be rare this interaction can wait to be
improved.

For simplicity instead of introducing a new mm lock I simply use
exchange on the pointer where the mm->memcg is updated to get atomic
updates.

The way the memory cgroup of an mm is reassigned has changed
significantly. Instead of mm_update_next_owner assiging a new owner
and potential changing the memcg of the mm when a task exits, the
memcg of an mm remains the same unless a task using the memcg migrates
or the memcg is brought offline. If the memcg is brought offline (AKA
rmdir on the memory cgroups directory in cgroupfs) then the next call
of mm_cgroup_from_mm will reassign mm->memcg to the first ancestor
memory cgroup that is still online, and then return that memory cgroup.

By walking returning the first ancestor memory cgroup that is online
this ensures a process that has been delegated a memory cgroup subtree
can not escape from the delegation point. As every living thread of a
process will remain in some memory cgroup at or below the delegation
point, it is guaranteed at least the delegation point will remain
online if the mm is in use. Thus this does not provide a memory
cgroup escape.

Looking at the history effectively this change is a revert. The
reason given for adding mm->owner is so that multiple cgroups can be
attached to the same mm. In the last 8 years a second user of
mm->owner has not appeared. A feature that has never used, makes the
code more complicated and has horrible worst case performance should
go.

Fixes: cf475ad28ac3 ("cgroups: add an owner to the mm_struct")
Reported-by: Kirill Tkhai <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---

I believe this is a correct patch but folks please look it over and see
if I am missing something.

fs/exec.c | 3 +-
include/linux/memcontrol.h | 16 ++++--
include/linux/mm_types.h | 12 +----
include/linux/sched/mm.h | 8 ---
kernel/exit.c | 89 -------------------------------
kernel/fork.c | 14 +++--
mm/debug.c | 4 +-
mm/memcontrol.c | 130 +++++++++++++++++++++++++++++++++------------
8 files changed, 126 insertions(+), 150 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 183059c427b9..32461a1543fc 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1040,11 +1040,12 @@ static int exec_mmap(struct mm_struct *mm)
up_read(&old_mm->mmap_sem);
BUG_ON(active_mm != old_mm);
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
- mm_update_next_owner(old_mm);
mmput(old_mm);
return 0;
}
mmdrop(active_mm);
+ /* The tsk may have migrated before the new mm was attached */
+ mm_sync_memcg_from_task(tsk);
return 0;
}

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d99b71bc2c66..9b68d9f2740e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -341,7 +341,6 @@ static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct pglist_data *);

bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
@@ -402,6 +401,9 @@ static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
}

+void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new);
+void mm_sync_memcg_from_task(struct task_struct *tsk);
+
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
@@ -409,7 +411,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
bool match = false;

rcu_read_lock();
- task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ task_memcg = rcu_dereference(mm->memcg);
if (task_memcg)
match = mem_cgroup_is_descendant(task_memcg, memcg);
rcu_read_unlock();
@@ -693,7 +695,7 @@ static inline void count_memcg_event_mm(struct mm_struct *mm,
return;

rcu_read_lock();
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ memcg = rcu_dereference(mm->memcg);
if (likely(memcg)) {
count_memcg_events(memcg, idx, 1);
if (idx == OOM_KILL)
@@ -781,6 +783,14 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &pgdat->lruvec;
}

+static inline void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
+{
+}
+
+static inline void mm_sync_memcg_from_task(struct task_struct *tsk)
+{
+}
+
static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
{
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 21612347d311..ea5efd40a5d1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -443,17 +443,7 @@ struct mm_struct {
struct kioctx_table __rcu *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
- /*
- * "owner" points to a task that is regarded as the canonical
- * user/owner of this mm. All of the following must be true in
- * order for it to be changed:
- *
- * current == mm->owner
- * current->mm != mm
- * new_owner->mm == mm
- * new_owner->alloc_lock is held
- */
- struct task_struct __rcu *owner;
+ struct mem_cgroup __rcu *memcg;
#endif
struct user_namespace *user_ns;

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2c570cd934af..cc8e68d36fc2 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -95,14 +95,6 @@ extern struct mm_struct *mm_access(struct task_struct *task, unsigned int mode);
/* Remove the current tasks stale references to the old mm_struct */
extern void mm_release(struct task_struct *, struct mm_struct *);

-#ifdef CONFIG_MEMCG
-extern void mm_update_next_owner(struct mm_struct *mm);
-#else
-static inline void mm_update_next_owner(struct mm_struct *mm)
-{
-}
-#endif /* CONFIG_MEMCG */
-
#ifdef CONFIG_MMU
extern void arch_pick_mmap_layout(struct mm_struct *mm,
struct rlimit *rlim_stack);
diff --git a/kernel/exit.c b/kernel/exit.c
index c3c7ac560114..be967d2da0ce 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -399,94 +399,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
}
}

-#ifdef CONFIG_MEMCG
-/*
- * A task is exiting. If it owned this mm, find a new owner for the mm.
- */
-void mm_update_next_owner(struct mm_struct *mm)
-{
- struct task_struct *c, *g, *p = current;
-
-retry:
- /*
- * If the exiting or execing task is not the owner, it's
- * someone else's problem.
- */
- if (mm->owner != p)
- return;
- /*
- * The current owner is exiting/execing and there are no other
- * candidates. Do not leave the mm pointing to a possibly
- * freed task structure.
- */
- if (atomic_read(&mm->mm_users) <= 1) {
- mm->owner = NULL;
- return;
- }
-
- read_lock(&tasklist_lock);
- /*
- * Search in the children
- */
- list_for_each_entry(c, &p->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
- }
-
- /*
- * Search in the siblings
- */
- list_for_each_entry(c, &p->real_parent->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
- }
-
- /*
- * Search through everything else, we should not get here often.
- */
- for_each_process(g) {
- if (g->flags & PF_KTHREAD)
- continue;
- for_each_thread(g, c) {
- if (c->mm == mm)
- goto assign_new_owner;
- if (c->mm)
- break;
- }
- }
- read_unlock(&tasklist_lock);
- /*
- * We found no owner yet mm_users > 1: this implies that we are
- * most likely racing with swapoff (try_to_unuse()) or /proc or
- * ptrace or page migration (get_task_mm()). Mark owner as NULL.
- */
- mm->owner = NULL;
- return;
-
-assign_new_owner:
- BUG_ON(c == p);
- get_task_struct(c);
- /*
- * The task_lock protects c->mm from changing.
- * We always want mm->owner->mm == mm
- */
- task_lock(c);
- /*
- * Delay read_unlock() till we have the task_lock()
- * to ensure that c does not slip away underneath us
- */
- read_unlock(&tasklist_lock);
- if (c->mm != mm) {
- task_unlock(c);
- put_task_struct(c);
- goto retry;
- }
- mm->owner = c;
- task_unlock(c);
- put_task_struct(c);
-}
-#endif /* CONFIG_MEMCG */
-
/*
* Turn us into a lazy TLB process if we
* aren't already..
@@ -540,7 +452,6 @@ static void exit_mm(void)
up_read(&mm->mmap_sem);
enter_lazy_tlb(mm, current);
task_unlock(current);
- mm_update_next_owner(mm);
mmput(mm);
if (test_thread_flag(TIF_MEMDIE))
exit_oom_victim();
diff --git a/kernel/fork.c b/kernel/fork.c
index a5d21c42acfc..8d558678e038 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -868,10 +868,16 @@ static void mm_init_aio(struct mm_struct *mm)
#endif
}

-static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
+static void mm_init_memcg(struct mm_struct *mm)
{
#ifdef CONFIG_MEMCG
- mm->owner = p;
+ struct cgroup_subsys_state *css;
+
+ /* Ensure mm->memcg is initialized */
+ mm->memcg = NULL;
+
+ css = task_get_css(current, memory_cgrp_id);
+ mm_update_memcg(mm, mem_cgroup_from_css(css));
#endif
}

@@ -901,7 +907,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
spin_lock_init(&mm->page_table_lock);
mm_init_cpumask(mm);
mm_init_aio(mm);
- mm_init_owner(mm, p);
+ mm_init_memcg(mm);
RCU_INIT_POINTER(mm->exe_file, NULL);
mmu_notifier_mm_init(mm);
hmm_mm_init(mm);
@@ -931,6 +937,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
fail_nocontext:
mm_free_pgd(mm);
fail_nopgd:
+ mm_update_memcg(mm, NULL);
free_mm(mm);
return NULL;
}
@@ -968,6 +975,7 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+ mm_update_memcg(mm, NULL);
mmdrop(mm);
}

diff --git a/mm/debug.c b/mm/debug.c
index 56e2d9125ea5..3aed6102b8d0 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -116,7 +116,7 @@ void dump_mm(const struct mm_struct *mm)
"ioctx_table %px\n"
#endif
#ifdef CONFIG_MEMCG
- "owner %px "
+ "memcg %px "
#endif
"exe_file %px\n"
#ifdef CONFIG_MMU_NOTIFIER
@@ -147,7 +147,7 @@ void dump_mm(const struct mm_struct *mm)
mm->ioctx_table,
#endif
#ifdef CONFIG_MEMCG
- mm->owner,
+ mm->memcg,
#endif
mm->exe_file,
#ifdef CONFIG_MMU_NOTIFIER
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2bd3df3d101a..21c13f4768eb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -664,41 +664,50 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
}
}

-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
{
+ struct mem_cgroup *memcg, *omemcg;
+
+ rcu_read_lock();
/*
- * mm_update_next_owner() may clear mm->owner to NULL
- * if it races with swapoff, page migration, etc.
- * So this can be called with p == NULL.
+ * Page cache insertions can happen without an
+ * actual mm context, e.g. during disk probing
+ * on boot, loopback IO, acct() writes etc.
*/
- if (unlikely(!p))
- return NULL;
-
- return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
-}
-EXPORT_SYMBOL(mem_cgroup_from_task);
+ if (unlikely(!mm))
+ goto root_memcg;
+
+ /* Loop trying to get a reference while mm->memcg is changing */
+ for (omemcg = NULL;; omemcg = memcg) {
+ memcg = rcu_dereference(mm->memcg);
+ if (unlikely(!memcg))
+ goto root_memcg;
+ if (likely(css_tryget_online(&memcg->css)))
+ goto found;
+ if ((memcg == omemcg) && css_tryget(&omemcg->css))
+ break;
+ }

-static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
-{
- struct mem_cgroup *memcg = NULL;
+ /* Walk up and find a live memory cgroup */
+ while (memcg != root_mem_cgroup) {
+ memcg = mem_cgroup_from_css(memcg->css.parent);
+ if (css_tryget_online(&memcg->css))
+ goto found_parent;
+ }
+ css_put(&omemcg->css);
+root_memcg:
+ css_get(&root_mem_cgroup->css);
+ rcu_read_unlock();
+ return root_mem_cgroup;

- rcu_read_lock();
- do {
- /*
- * Page cache insertions can happen withou an
- * actual mm context, e.g. during disk probing
- * on boot, loopback IO, acct() writes etc.
- */
- if (unlikely(!mm))
- memcg = root_mem_cgroup;
- else {
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
- if (unlikely(!memcg))
- memcg = root_mem_cgroup;
- }
- } while (!css_tryget_online(&memcg->css));
+found_parent:
+ css_get(&memcg->css);
+ mm_update_memcg(mm, memcg);
+ css_put(&omemcg->css);
+found:
rcu_read_unlock();
return memcg;
+
}

/**
@@ -1011,7 +1020,7 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
* killed to prevent needlessly killing additional tasks.
*/
rcu_read_lock();
- task_memcg = mem_cgroup_from_task(task);
+ task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
css_get(&task_memcg->css);
rcu_read_unlock();
}
@@ -4827,15 +4836,19 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
if (!move_flags)
return 0;

- from = mem_cgroup_from_task(p);
+ from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));

VM_BUG_ON(from == memcg);

mm = get_task_mm(p);
if (!mm)
return 0;
- /* We move charges only when we move a owner of the mm */
- if (mm->owner == p) {
+
+ /*
+ * Charges are moved only when moving a leader in the same
+ * memcg as the mm.
+ */
+ if (mm->memcg == from) {
VM_BUG_ON(mc.from);
VM_BUG_ON(mc.to);
VM_BUG_ON(mc.precharge);
@@ -5035,6 +5048,55 @@ static void mem_cgroup_move_task(void)
}
#endif

+/**
+ * mm_update_memcg - Update the memory cgroup of a mm_struct
+ * @mm: mm struct
+ * @new: new memory cgroup value
+ *
+ * Called whenever mm->memcg needs to change. Consumes a reference
+ * to new (unless new is NULL). The reference to the old memory
+ * cgroup is decreased.
+ */
+void mm_update_memcg(struct mm_struct *mm, struct mem_cgroup *new)
+{
+ /* This is the only place where mm->memcg is changed */
+ struct mem_cgroup *old;
+
+ old = xchg(&mm->memcg, new);
+ if (old)
+ css_put(&old->css);
+}
+
+static void task_update_memcg(struct task_struct *tsk, struct mem_cgroup *new)
+{
+ struct mm_struct *mm;
+ task_lock(tsk);
+ mm = tsk->mm;
+ if (mm && !(tsk->flags & PF_KTHREAD))
+ mm_update_memcg(mm, new);
+ task_unlock(tsk);
+}
+
+static void mem_cgroup_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css;
+ struct task_struct *tsk;
+
+ cgroup_taskset_for_each(tsk, css, tset) {
+ struct mem_cgroup *new = mem_cgroup_from_css(css);
+ css_get(css);
+ task_update_memcg(tsk, new);
+ }
+}
+
+void mm_sync_memcg_from_task(struct task_struct *tsk)
+{
+ struct cgroup_subsys_state *css;
+
+ css = task_get_css(tsk, memory_cgrp_id);
+ task_update_memcg(tsk, mem_cgroup_from_css(css));
+}
+
/*
* Cgroup retains root cgroups across [un]mount cycles making it necessary
* to verify whether we're attached to the default hierarchy on each mount
@@ -5335,8 +5397,10 @@ struct cgroup_subsys memory_cgrp_subsys = {
.css_free = mem_cgroup_css_free,
.css_reset = mem_cgroup_css_reset,
.can_attach = mem_cgroup_can_attach,
+ .attach = mem_cgroup_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.post_attach = mem_cgroup_move_task,
+ .fork = mm_sync_memcg_from_task,
.bind = mem_cgroup_bind,
.dfl_cftypes = memory_files,
.legacy_cftypes = mem_cgroup_legacy_files,
@@ -5769,7 +5833,7 @@ void mem_cgroup_sk_alloc(struct sock *sk)
}

rcu_read_lock();
- memcg = mem_cgroup_from_task(current);
+ memcg = mem_cgroup_from_css(task_css(current, memory_cgrp_id));
if (memcg == root_mem_cgroup)
goto out;
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && !memcg->tcpmem_active)
--
2.14.1


2018-06-01 13:58:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On Thu 31-05-18 20:07:28, Eric W. Biederman wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Thu 26-04-18 14:00:19, Kirill Tkhai wrote:
> >> This function searches for a new mm owner in children and siblings,
> >> and then iterates over all processes in the system in unlikely case.
> >> Despite the case is unlikely, its probability growths with the number
> >> of processes in the system. The time, spent on iterations, also growths.
> >> I regulary observe mm_update_next_owner() in crash dumps (not related
> >> to this function) of the nodes with many processes (20K+), so it looks
> >> like it's not so unlikely case.
> >
> > Did you manage to find the pattern that forces mm_update_next_owner to
> > slow paths? This really shouldn't trigger very often. If we can fallback
> > easily then I suspect that we should be better off reconsidering
> > mm->owner and try to come up with something more clever. I've had a
> > patch to remove owner few years back. It needed some work to finish but
> > maybe that would be a better than try to make non-scalable thing suck
> > less.
>
> Reading through the code I just found a trivial pattern that triggers
> this. Create a multi-threaded process. Have the thread group leader
> (the first thread) exit.

Hmm, I thought that we try to iterate over threads in the same thread
group first. But we are not doing that. Anyway just CLONE_VM without
CLONE_THREAD would achieve the same pathological path but that should be
rare. Group leader exiting early without tearing down the whole thread
group should be quite rare as well. No question that somebody might do
that on purpose though...

--
Michal Hocko
SUSE Labs

2018-06-01 14:07:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

On Thu 31-05-18 13:41:38, Eric W. Biederman wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Thu 24-05-18 14:16:35, Andrew Morton wrote:
> >> On Thu, 24 May 2018 13:10:02 +0200 Michal Hocko <[email protected]> wrote:
> >>
> >> > I would really prefer and appreciate a repost with all the fixes folded
> >> > in.
> >>
> >> [1/2]
> >
> > Thanks Andrew and sorry it took so long! This seems to be missing the
> > fix for the issue I've mentioned in http://lkml.kernel.org/r/[email protected]
> > and Eric wanted to handle by http://lkml.kernel.org/r/[email protected].
> > I do not think that this fix is really correct one though. I will
> > comment in a reply to that email.
>
> Agreed. That was not a correct fix to the possible infinite loop in
> get_mem_cgroup_from_mm. Which in net leaves all of this broken and
> not ready to be merged.
>
> > In any case I really think this should better be reposted in one patch
> > series and restart the discussion. I strongly suggest revisiting my
> > previous attempt http://lkml.kernel.org/r/[email protected]
> > including the follow up discussion regarding the unfortunate CLONE_VM
> > outside of thread group case and potential solution for that.
>
> With my fix for get_mem_cgroup_from_mm falling flat that limits our
> possible future directions:
> - Make memory cgroups honest and require all tasks of an mm to be in the
> same memory cgroup. [BEST]

Agreed. This should be possible with the multi-pid migration support.
Just do not allow to migrate if the set of pids doesn't contain all
processes sharing the same mm. We can add a special MMF_ flag for mm
that is generated by CLONE_VM without CLONE_THREAD to make the normal
path fast.

Or we can simply fail the migration for this odd cases and implement it
only if somebody actually complains. The flag would be useful for other
cases (e.g. in the oom path where we have to handle these tasks as well).

[...]
> >> - /* We move charges only when we move a owner of the mm */
> >> - if (mm->owner == p) {
> c>> +
> >> + /* We move charges except for creative uses of CLONE_VM */
> >> + if (mm->memcg == from) {
> >
> > I must be missing something here but how do you prevent those cases?
> > AFAICS processes sharing the mm will simply allow to migrate.
>
> The mm will only migrate once per set of tasks. A task that does not
> migrate with the mm will not have mm->memcg == from. Which is all I was
> referring to. Perhaps I did not say that well.

OK, I see now (I guess). I would probably just drop the comment. The
code is much more clear when dealing with the memcg than a task...

> >> VM_BUG_ON(mc.from);
> >> VM_BUG_ON(mc.to);
> >> VM_BUG_ON(mc.precharge);
> >
> > Other than that the patch makes sense to me.
>
> I will take a bit and respin this. Because of algorithmic bug-fix
> nature of where this started I want to avoid depending on fixing the
> semantics. With my third option for fixing get_mem_cgroup_from_mm I see
> how to do that now. Then I will include a separate patch to fix the
> semantics.

Thanks!
--
Michal Hocko
SUSE Labs

2018-06-01 14:35:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

Michal Hocko <[email protected]> writes:

> On Thu 31-05-18 20:07:28, Eric W. Biederman wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Thu 26-04-18 14:00:19, Kirill Tkhai wrote:
>> >> This function searches for a new mm owner in children and siblings,
>> >> and then iterates over all processes in the system in unlikely case.
>> >> Despite the case is unlikely, its probability growths with the number
>> >> of processes in the system. The time, spent on iterations, also growths.
>> >> I regulary observe mm_update_next_owner() in crash dumps (not related
>> >> to this function) of the nodes with many processes (20K+), so it looks
>> >> like it's not so unlikely case.
>> >
>> > Did you manage to find the pattern that forces mm_update_next_owner to
>> > slow paths? This really shouldn't trigger very often. If we can fallback
>> > easily then I suspect that we should be better off reconsidering
>> > mm->owner and try to come up with something more clever. I've had a
>> > patch to remove owner few years back. It needed some work to finish but
>> > maybe that would be a better than try to make non-scalable thing suck
>> > less.
>>
>> Reading through the code I just found a trivial pattern that triggers
>> this. Create a multi-threaded process. Have the thread group leader
>> (the first thread) exit.
>
> Hmm, I thought that we try to iterate over threads in the same thread
> group first. But we are not doing that. Anyway just CLONE_VM without
> CLONE_THREAD would achieve the same pathological path but that should be
> rare.

Yes, if the child exited. The code searches the children and siblings
but the parents of the process that exited.

> Group leader exiting early without tearing down the whole thread
> group should be quite rare as well. No question that somebody might do
> that on purpose though...

The group leader exiting early is a completely legitimate and reasonable
thing to do, even if it is rare.

I think all it would take is one program like that in a work-load for
the performance to descend into something unpleasant.

Eric

2018-06-01 14:52:54

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 0/2] memcg: Require every task that uses an mm to migrate together


This is a follow up to my last patch of switching from mm->owner to
mm->memgcg and as it might be significant enough to break userspace. I
am carefully not depending on this change.

Eric W. Biederman (2):
memcg: Ensure every task that uses an mm is in the same memory cgroup
memcgl: Remove dead code now that all tasks of an mm share a memcg

mm/memcontrol.c | 94 ++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 59 insertions(+), 35 deletions(-)



2018-06-01 14:53:59

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup


From a userspace perspective the cgroup of a mm is determined by which
the cgroup the task belongs too. As practically an mm can only belong to
a single memory cgroup having multiple tasks with the same mm in different
memory cgroups is not well defined.

Avoid the difficulties of dealing with crazy semantics and restrict all
tasks that share a single mm to the same memory cgroup.

This is accomplished by adding a new function mem_cgroup_mm_can_attach
that checks this condition with a straight forward algorithm. In the worst
case it is O(N^2). In the common case it should be O(N) in the number of
tasks being migrated. As typically only a single process and thus a single
process is being migrated and it is optimized for that case.

There are users of mmget such as proc that can create references to an
mm this function can not find. This results in an unnecessary
migration failure. It does not break the invariant that every task of
an mm stays in the same memory cgroup. So this condition is annoying
but harmelss.

This requires multi-threaded mm's to be migrated using the procs file.

This effectively forbids process with mm's shared processes being migrated.
Although enabling the control file might work.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 21c13f4768eb..078ef562bb90 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4798,6 +4798,50 @@ static void mem_cgroup_clear_mc(void)
mmput(mm);
}

+static int mem_cgroup_mm_can_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *css, *tcss;
+ struct task_struct *p, *t;
+ struct mm_struct *mm = NULL;
+ int ret = -EINVAL;
+
+ /*
+ * Ensure all references for every mm can be accounted for by
+ * tasks that are being migrated.
+ */
+ rcu_read_lock();
+ cgroup_taskset_for_each(p, css, tset) {
+ int mm_users, mm_count;
+
+ /* Does this task have an mm that has not been visited? */
+ if (!p->mm || (p->flags & PF_KTHREAD) || (p->mm == mm))
+ continue;
+
+ mm = p->mm;
+ mm_users = atomic_read(&mm->mm_users);
+ if (mm_users == 1)
+ continue;
+
+ mm_count = 0;
+ cgroup_taskset_for_each(t, tcss, tset) {
+ if (t->mm != mm)
+ continue;
+ mm_count++;
+ }
+ /*
+ * If there are no non-task references mm_users will
+ * be stable as holding cgroup_thread_rwsem for write
+ * blocks fork and exit.
+ */
+ if (mm_count != mm_users)
+ goto out;
+ }
+ ret = 0;
+out:
+ rcu_read_unlock();
+ return ret;
+}
+
static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
struct cgroup_subsys_state *css;
@@ -4806,7 +4850,12 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
struct task_struct *leader, *p;
struct mm_struct *mm;
unsigned long move_flags;
- int ret = 0;
+ int ret;
+
+ /* Is every task of every mm in tset being moved? */
+ ret = mem_cgroup_mm_can_attach(tset);
+ if (ret)
+ return ret;

/* charge immigration isn't supported on the default hierarchy */
if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
--
2.14.1


2018-06-01 14:55:09

by Eric W. Biederman

[permalink] [raw]
Subject: [RFC][PATCH 2/2] memcgl: Remove dead code now that all tasks of an mm share a memcg


get_mem_cgroup_from_mm no longer needs to deal with offline memory cgroups.

mem_cgroup_can_attach no longer needs to worry about cgroup leaders
as all tasks of an mm will migrate together.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
mm/memcontrol.c | 43 +++++++++----------------------------------
1 file changed, 9 insertions(+), 34 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 078ef562bb90..a05c1253ad87 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -666,7 +666,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)

static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
{
- struct mem_cgroup *memcg, *omemcg;
+ struct mem_cgroup *memcg;

rcu_read_lock();
/*
@@ -677,37 +677,18 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
if (unlikely(!mm))
goto root_memcg;

- /* Loop trying to get a reference while mm->memcg is changing */
- for (omemcg = NULL;; omemcg = memcg) {
+ do {
memcg = rcu_dereference(mm->memcg);
if (unlikely(!memcg))
goto root_memcg;
- if (likely(css_tryget_online(&memcg->css)))
- goto found;
- if ((memcg == omemcg) && css_tryget(&omemcg->css))
- break;
- }
+ } while (!css_tryget_online(&memcg->css));
+ rcu_read_unlock();
+ return memcg;

- /* Walk up and find a live memory cgroup */
- while (memcg != root_mem_cgroup) {
- memcg = mem_cgroup_from_css(memcg->css.parent);
- if (css_tryget_online(&memcg->css))
- goto found_parent;
- }
- css_put(&omemcg->css);
root_memcg:
css_get(&root_mem_cgroup->css);
rcu_read_unlock();
return root_mem_cgroup;
-
-found_parent:
- css_get(&memcg->css);
- mm_update_memcg(mm, memcg);
- css_put(&omemcg->css);
-found:
- rcu_read_unlock();
- return memcg;
-
}

/**
@@ -4847,7 +4828,7 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
struct cgroup_subsys_state *css;
struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
struct mem_cgroup *from;
- struct task_struct *leader, *p;
+ struct task_struct *p;
struct mm_struct *mm;
unsigned long move_flags;
int ret;
@@ -4863,18 +4844,12 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)

/*
* Multi-process migrations only happen on the default hierarchy
- * where charge immigration is not used. Perform charge
- * immigration if @tset contains a leader and whine if there are
- * multiple.
+ * where charge immigration is not used.
*/
- p = NULL;
- cgroup_taskset_for_each_leader(leader, css, tset) {
- WARN_ON_ONCE(p);
- p = leader;
- memcg = mem_cgroup_from_css(css);
- }
+ p = cgroup_taskset_first(tset, &css);
if (!p)
return 0;
+ memcg = mem_cgroup_from_css(css);

/*
* We are now commited to this value whatever it is. Changes in this
--
2.14.1


2018-06-01 15:04:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On Fri 01-06-18 09:32:42, Eric W. Biederman wrote:
> Michal Hocko <[email protected]> writes:
[...]
> > Group leader exiting early without tearing down the whole thread
> > group should be quite rare as well. No question that somebody might do
> > that on purpose though...
>
> The group leader exiting early is a completely legitimate and reasonable
> thing to do, even if it is rare.

I am not saying it isn't legitimate. But the most common case is the
main thread waiting for its threads or calling exit which would tear the
whole group down. Is there any easy way to achieve this other than tkill
to group leader? Calling exit(3) from the leader performs group exit
IIRC.

I am not arguing this is non-issue. And it certainly is a problem once
somebody wants to be nasty... I was more interested how often this
really happens for sane workloads.
--
Michal Hocko
SUSE Labs

2018-06-01 15:26:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

Michal Hocko <[email protected]> writes:

> On Fri 01-06-18 09:32:42, Eric W. Biederman wrote:
>> Michal Hocko <[email protected]> writes:
> [...]
>> > Group leader exiting early without tearing down the whole thread
>> > group should be quite rare as well. No question that somebody might do
>> > that on purpose though...
>>
>> The group leader exiting early is a completely legitimate and reasonable
>> thing to do, even if it is rare.
>
> I am not saying it isn't legitimate. But the most common case is the
> main thread waiting for its threads or calling exit which would tear the
> whole group down. Is there any easy way to achieve this other than tkill
> to group leader? Calling exit(3) from the leader performs group exit
> IIRC.

pthread_exit from the group leader.

> I am not arguing this is non-issue. And it certainly is a problem once
> somebody wants to be nasty... I was more interested how often this
> really happens for sane workloads.

That is a fair question. All I know for certain is that whatever Kirill
Tkhai's workload was it was triggering this the slow path.

Eric


2018-06-01 16:51:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

Hello, Eric.

On Fri, Jun 01, 2018 at 09:53:09AM -0500, Eric W. Biederman wrote:
>
> From a userspace perspective the cgroup of a mm is determined by which
> the cgroup the task belongs too. As practically an mm can only belong to
> a single memory cgroup having multiple tasks with the same mm in different
> memory cgroups is not well defined.
>
> Avoid the difficulties of dealing with crazy semantics and restrict all
> tasks that share a single mm to the same memory cgroup.
>
> This is accomplished by adding a new function mem_cgroup_mm_can_attach
> that checks this condition with a straight forward algorithm. In the worst
> case it is O(N^2). In the common case it should be O(N) in the number of
> tasks being migrated. As typically only a single process and thus a single
> process is being migrated and it is optimized for that case.
>
> There are users of mmget such as proc that can create references to an
> mm this function can not find. This results in an unnecessary
> migration failure. It does not break the invariant that every task of
> an mm stays in the same memory cgroup. So this condition is annoying
> but harmelss.
>
> This requires multi-threaded mm's to be migrated using the procs file.
>
> This effectively forbids process with mm's shared processes being migrated.
> Although enabling the control file might work.

So, I don't think we need to support putting tasks which share a mm in
different cgroups. That said, if we're gonna put in that restriction,
I think it should be in cgroup core rather than memcg can_attach. The
only thing we'd need to do is widening what cgroup migration considers
to be a process.

Thanks.

--
tejun

2018-06-01 18:13:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

Tejun Heo <[email protected]> writes:

>> This effectively forbids process with mm's shared processes being migrated.
>> Although enabling the control file might work.
>
> So, I don't think we need to support putting tasks which share a mm in
> different cgroups. That said, if we're gonna put in that restriction,
> I think it should be in cgroup core rather than memcg can_attach. The
> only thing we'd need to do is widening what cgroup migration considers
> to be a process.

Widening the definition of a process sounds good. The memory control
group code would still need a way to forbid these in cgroup v1 mode,
when someone uses the task file.

Using widening instead of denying should reduce the risk of introducing
a regression.

The only reason I support the crazy case in my earlier patch is so that
we can have this discussion and in case we do cause a regression with
this change the previous algorithmic cleanup won't have to be reverted
as well.

Eric

2018-06-01 19:18:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

Hello,

On Fri, Jun 01, 2018 at 01:11:59PM -0500, Eric W. Biederman wrote:
> Widening the definition of a process sounds good. The memory control
> group code would still need a way to forbid these in cgroup v1 mode,
> when someone uses the task file.

Yeap, you're right. We'll need memcg's can_attach rejecting for v1.

> Using widening instead of denying should reduce the risk of introducing
> a regression.
>
> The only reason I support the crazy case in my earlier patch is so that
> we can have this discussion and in case we do cause a regression with
> this change the previous algorithmic cleanup won't have to be reverted
> as well.

Yeah, sure thing.

Thanks a lot.

--
tejun

2018-06-04 06:56:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On Fri 01-06-18 10:25:59, Eric W. Biederman wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Fri 01-06-18 09:32:42, Eric W. Biederman wrote:
> >> Michal Hocko <[email protected]> writes:
> > [...]
> >> > Group leader exiting early without tearing down the whole thread
> >> > group should be quite rare as well. No question that somebody might do
> >> > that on purpose though...
> >>
> >> The group leader exiting early is a completely legitimate and reasonable
> >> thing to do, even if it is rare.
> >
> > I am not saying it isn't legitimate. But the most common case is the
> > main thread waiting for its threads or calling exit which would tear the
> > whole group down. Is there any easy way to achieve this other than tkill
> > to group leader? Calling exit(3) from the leader performs group exit
> > IIRC.
>
> pthread_exit from the group leader.

Right, forgot to mention this one but this would be quite exotic, right?


> > I am not arguing this is non-issue. And it certainly is a problem once
> > somebody wants to be nasty... I was more interested how often this
> > really happens for sane workloads.
>
> That is a fair question. All I know for certain is that whatever Kirill
> Tkhai's workload was it was triggering this the slow path.

Yeah, that was exactly why I've asked that originally. It must be
something pretty special ;)
--
Michal Hocko
SUSE Labs

2018-06-04 13:02:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

[dropping Kirill Tkhai from the CC because I get rejection from the mail
server]

On Fri 01-06-18 12:16:52, Tejun Heo wrote:
> Hello,
>
> On Fri, Jun 01, 2018 at 01:11:59PM -0500, Eric W. Biederman wrote:
> > Widening the definition of a process sounds good. The memory control
> > group code would still need a way to forbid these in cgroup v1 mode,
> > when someone uses the task file.
>
> Yeap, you're right. We'll need memcg's can_attach rejecting for v1.

Do we really need? I mean, do we know about any existing usecase that
would need this weird threading concept and depend on memory migration
which doesn't really work?
--
Michal Hocko
SUSE Labs

2018-06-04 14:33:42

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

Michal Hocko <[email protected]> writes:

> On Fri 01-06-18 10:25:59, Eric W. Biederman wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Fri 01-06-18 09:32:42, Eric W. Biederman wrote:
>> >> Michal Hocko <[email protected]> writes:
>> > [...]
>> >> > Group leader exiting early without tearing down the whole thread
>> >> > group should be quite rare as well. No question that somebody might do
>> >> > that on purpose though...
>> >>
>> >> The group leader exiting early is a completely legitimate and reasonable
>> >> thing to do, even if it is rare.
>> >
>> > I am not saying it isn't legitimate. But the most common case is the
>> > main thread waiting for its threads or calling exit which would tear the
>> > whole group down. Is there any easy way to achieve this other than tkill
>> > to group leader? Calling exit(3) from the leader performs group exit
>> > IIRC.
>>
>> pthread_exit from the group leader.
>
> Right, forgot to mention this one but this would be quite exotic,
> right?

Not exotic. It is easy to do and well defined.

It would be easy to do if you are running a thread pool and closing
unnecessary threads. Or frankly anything where the application is not
assigning a special role to the initial thread.

It does seem rare enough that no one has noticed how attrocious
mm_update_next_owner is until now.

My key point is that it is easy to trigger which makes the current
mm_update_next_owner a fundamentally flawed design, and something that
needs to be fixed.

Eric

2018-06-04 18:49:23

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

Hello, Michal.

On Mon, Jun 04, 2018 at 03:01:19PM +0200, Michal Hocko wrote:
> > On Fri, Jun 01, 2018 at 01:11:59PM -0500, Eric W. Biederman wrote:
> > > Widening the definition of a process sounds good. The memory control
> > > group code would still need a way to forbid these in cgroup v1 mode,
> > > when someone uses the task file.
> >
> > Yeap, you're right. We'll need memcg's can_attach rejecting for v1.
>
> Do we really need? I mean, do we know about any existing usecase that
> would need this weird threading concept and depend on memory migration
> which doesn't really work?

I thought the requirement is from the ->owner change so that the
association doesn't become 1:N, right?

Thanks.

--
tejun

2018-06-04 19:12:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

Tejun Heo <[email protected]> writes:

> Hello, Michal.
>
> On Mon, Jun 04, 2018 at 03:01:19PM +0200, Michal Hocko wrote:
>> > On Fri, Jun 01, 2018 at 01:11:59PM -0500, Eric W. Biederman wrote:
>> > > Widening the definition of a process sounds good. The memory control
>> > > group code would still need a way to forbid these in cgroup v1 mode,
>> > > when someone uses the task file.
>> >
>> > Yeap, you're right. We'll need memcg's can_attach rejecting for v1.
>>
>> Do we really need? I mean, do we know about any existing usecase that
>> would need this weird threading concept and depend on memory migration
>> which doesn't really work?
>
> I thought the requirement is from the ->owner change so that the
> association doesn't become 1:N, right?

Yes. We need not the existing can_attach, but my new
mem_cgroup_mm_can_attach.

Even if the cgroup notion of a process is extended to be any set of
tasks that shares an mm. We still need to fail cgroup migration through
the tasks file for processes that are not single threaded for cgroup v1.

Eric


2018-06-05 08:16:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On Mon 04-06-18 09:31:46, Eric W. Biederman wrote:
[...]
> My key point is that it is easy to trigger which makes the current
> mm_update_next_owner a fundamentally flawed design, and something that
> needs to be fixed.

Ohh, absolutely agreed! I was not trying to argue that part of course.

--
Michal Hocko
SUSE Labs

2018-06-05 08:49:45

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

On 01.06.2018 18:25, Eric W. Biederman wrote:
> Michal Hocko <[email protected]> writes:
>
>> On Fri 01-06-18 09:32:42, Eric W. Biederman wrote:
>>> Michal Hocko <[email protected]> writes:
>> [...]
>>>> Group leader exiting early without tearing down the whole thread
>>>> group should be quite rare as well. No question that somebody might do
>>>> that on purpose though...
>>>
>>> The group leader exiting early is a completely legitimate and reasonable
>>> thing to do, even if it is rare.
>>
>> I am not saying it isn't legitimate. But the most common case is the
>> main thread waiting for its threads or calling exit which would tear the
>> whole group down. Is there any easy way to achieve this other than tkill
>> to group leader? Calling exit(3) from the leader performs group exit
>> IIRC.
>
> pthread_exit from the group leader.
>
>> I am not arguing this is non-issue. And it certainly is a problem once
>> somebody wants to be nasty... I was more interested how often this
>> really happens for sane workloads.
>
> That is a fair question. All I know for certain is that whatever Kirill
> Tkhai's workload was it was triggering this the slow path.

It was triggered on a server, where many VPS of many people are hosted.
Sorry, I have no an idea what they did.

Kirill

2018-06-05 15:38:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable

Kirill Tkhai <[email protected]> writes:

> On 01.06.2018 18:25, Eric W. Biederman wrote:
>> Michal Hocko <[email protected]> writes:
>>
>>> On Fri 01-06-18 09:32:42, Eric W. Biederman wrote:
>>>> Michal Hocko <[email protected]> writes:
>>> [...]
>>>>> Group leader exiting early without tearing down the whole thread
>>>>> group should be quite rare as well. No question that somebody might do
>>>>> that on purpose though...
>>>>
>>>> The group leader exiting early is a completely legitimate and reasonable
>>>> thing to do, even if it is rare.
>>>
>>> I am not saying it isn't legitimate. But the most common case is the
>>> main thread waiting for its threads or calling exit which would tear the
>>> whole group down. Is there any easy way to achieve this other than tkill
>>> to group leader? Calling exit(3) from the leader performs group exit
>>> IIRC.
>>
>> pthread_exit from the group leader.
>>
>>> I am not arguing this is non-issue. And it certainly is a problem once
>>> somebody wants to be nasty... I was more interested how often this
>>> really happens for sane workloads.
>>
>> That is a fair question. All I know for certain is that whatever Kirill
>> Tkhai's workload was it was triggering this the slow path.
>
> It was triggered on a server, where many VPS of many people are hosted.
> Sorry, I have no an idea what they did.

That at least tells us it was naturally occurring. Which makes this a
real problem in the real world.

Eric

2018-06-06 11:14:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

On Fri 01-06-18 09:53:09, Eric W. Biederman wrote:
[...]
> +static int mem_cgroup_mm_can_attach(struct cgroup_taskset *tset)
> +{
> + struct cgroup_subsys_state *css, *tcss;
> + struct task_struct *p, *t;
> + struct mm_struct *mm = NULL;
> + int ret = -EINVAL;
> +
> + /*
> + * Ensure all references for every mm can be accounted for by
> + * tasks that are being migrated.
> + */
> + rcu_read_lock();
> + cgroup_taskset_for_each(p, css, tset) {
> + int mm_users, mm_count;
> +
> + /* Does this task have an mm that has not been visited? */
> + if (!p->mm || (p->flags & PF_KTHREAD) || (p->mm == mm))
> + continue;
> +
> + mm = p->mm;
> + mm_users = atomic_read(&mm->mm_users);
> + if (mm_users == 1)
> + continue;
> +
> + mm_count = 0;
> + cgroup_taskset_for_each(t, tcss, tset) {
> + if (t->mm != mm)
> + continue;
> + mm_count++;
> + }
> + /*
> + * If there are no non-task references mm_users will
> + * be stable as holding cgroup_thread_rwsem for write
> + * blocks fork and exit.
> + */
> + if (mm_count != mm_users)
> + goto out;

Wouldn't it be much more simple to do something like this instead? Sure
it will break migration of non-thread tasks sharing mms but I would like
to see that this actually matters to anybody rather than make the code
more complicated and maintain it forever without a good reason. So yes
this is a bit harsh but considering that the migration suceeded silently
and that was simply broken as well, I am not sure this is any worse.

Btw. MMF_ALIEN_MM could be used in the OOM proper as well.

diff --git a/kernel/fork.c b/kernel/fork.c
index dfe8e14c0fba..285cd0397307 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1303,6 +1303,8 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
if (clone_flags & CLONE_VM) {
mmget(oldmm);
mm = oldmm;
+ if (unlikely(!(clone_flags & CLONE_THREAD)))
+ set_bit(MMF_ALIEN_MM, &mm->flags);
goto good_mm;
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2f5dac193431..fa0248fcdb36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5103,6 +5103,18 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
if (!mm)
return 0;

+ /*
+ * Migrating a task which shares mm with other thread group
+ * has never been really well defined. Shout to the log when
+ * this happens and see whether anybody really complains.
+ * If so make sure to support migration if all processes sharing
+ * this mm are migrating together.
+ */
+ if (WARN_ON_ONCE(test_bit(MMF_ALIEN_MM, &mm->flags))) {
+ mmput(mm);
+ return -EBUSY;
+ }
+
/* We move charges except for creative uses of CLONE_VM */
if (mm->memcg == from) {
VM_BUG_ON(mc.from);

--
Michal Hocko
SUSE Labs

2018-06-07 12:00:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

Michal Hocko <[email protected]> writes:

> On Fri 01-06-18 09:53:09, Eric W. Biederman wrote:
> [...]
>> +static int mem_cgroup_mm_can_attach(struct cgroup_taskset *tset)
>> +{
>> + struct cgroup_subsys_state *css, *tcss;
>> + struct task_struct *p, *t;
>> + struct mm_struct *mm = NULL;
>> + int ret = -EINVAL;
>> +
>> + /*
>> + * Ensure all references for every mm can be accounted for by
>> + * tasks that are being migrated.
>> + */
>> + rcu_read_lock();
>> + cgroup_taskset_for_each(p, css, tset) {
>> + int mm_users, mm_count;
>> +
>> + /* Does this task have an mm that has not been visited? */
>> + if (!p->mm || (p->flags & PF_KTHREAD) || (p->mm == mm))
>> + continue;
>> +
>> + mm = p->mm;
>> + mm_users = atomic_read(&mm->mm_users);
>> + if (mm_users == 1)
>> + continue;
>> +
>> + mm_count = 0;
>> + cgroup_taskset_for_each(t, tcss, tset) {
>> + if (t->mm != mm)
>> + continue;
>> + mm_count++;
>> + }
>> + /*
>> + * If there are no non-task references mm_users will
>> + * be stable as holding cgroup_thread_rwsem for write
>> + * blocks fork and exit.
>> + */
>> + if (mm_count != mm_users)
>> + goto out;
>
> Wouldn't it be much more simple to do something like this instead? Sure
> it will break migration of non-thread tasks sharing mms but I would like
> to see that this actually matters to anybody rather than make the code
> more complicated and maintain it forever without a good reason. So yes
> this is a bit harsh but considering that the migration suceeded silently
> and that was simply broken as well, I am not sure this is any worse.

I definitely agree that there are some other possibilities worth
exploring.

> Btw. MMF_ALIEN_MM could be used in the OOM proper as well.

There are two big issues I see with your suggested alternative.
1) cgroupv1 the task interface. We still need to deny migrating
part of a thread group.

2) vfork. That uses CLONE_MM and it gets used. At a quick look
I am seeing gcc, g++, cpp, emacs24, strace, calendar, nm, telnet, gdb,
and several other programs I don't recognize.

I believe your proposal will prevent onlining the memcgroup if there
is a compile running, because of failure to support vfork.

Further I expect we would need a count rather than a bit that gets set
and never gets cleared. Or else even when the mm is no longer shared by
vfork we will still think it is.

Michal do you have an opinion on my previous patch?

I just want to make certain that this fun work does not get all of the
attention, and the bug fix actually gets reviewed.

Eric

> diff --git a/kernel/fork.c b/kernel/fork.c
> index dfe8e14c0fba..285cd0397307 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1303,6 +1303,8 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
> if (clone_flags & CLONE_VM) {
> mmget(oldmm);
> mm = oldmm;
> + if (unlikely(!(clone_flags & CLONE_THREAD)))
> + set_bit(MMF_ALIEN_MM, &mm->flags);
> goto good_mm;
> }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2f5dac193431..fa0248fcdb36 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5103,6 +5103,18 @@ static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> if (!mm)
> return 0;
>
> + /*
> + * Migrating a task which shares mm with other thread group
> + * has never been really well defined. Shout to the log when
> + * this happens and see whether anybody really complains.
> + * If so make sure to support migration if all processes sharing
> + * this mm are migrating together.
> + */
> + if (WARN_ON_ONCE(test_bit(MMF_ALIEN_MM, &mm->flags))) {
> + mmput(mm);
> + return -EBUSY;
> + }
> +
> /* We move charges except for creative uses of CLONE_VM */
> if (mm->memcg == from) {
> VM_BUG_ON(mc.from);

2018-06-07 13:12:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup

On Thu 07-06-18 06:42:49, Eric W. Biederman wrote:
> Michal Hocko <[email protected]> writes:
[...]
> > Btw. MMF_ALIEN_MM could be used in the OOM proper as well.
>
> There are two big issues I see with your suggested alternative.
> 1) cgroupv1 the task interface. We still need to deny migrating
> part of a thread group.

Hmm. Can we simply check the migrating tsk to be the thread group
leader in can_attach and keep the silent migration skip?

> 2) vfork. That uses CLONE_MM and it gets used.

Right. MMF_ALIEN_MM would need a better check. I've forgot about this
case.
+ if (unlikely(!(clone_flags & (CLONE_THREAD|CLONE_VFORK)))
+ set_bit(MMF_ALIEN_MM, &mm->flags);

this would still allow to migrate mm via vforked task which is wrong
so can_attach would have to special case vforked task as well. Which is,
ehm, fuggly but that is something we have to handle regardless of
MMF_ALIEN_MM... I will have to check your other patch but I suspect you
haven't done that. Btw. neither did I when trying to work on this
previously.

> At a quick look
> I am seeing gcc, g++, cpp, emacs24, strace, calendar, nm, telnet, gdb,
> and several other programs I don't recognize.
>
> I believe your proposal will prevent onlining the memcgroup if there

did you mean s@onlining@offlining@ ?

> is a compile running, because of failure to support vfork.
>
> Further I expect we would need a count rather than a bit that gets set
> and never gets cleared. Or else even when the mm is no longer shared by
> vfork we will still think it is.

Yes, but does that actually matter all that much to warrant an
additional complexity? If somebody uses the funny threading model and
want to migrate then the failure will likely get noticed regardless
whether there is one or more processes sharing the mm. One failing while
other succeeding sounds confusing.

> Michal do you have an opinion on my previous patch?

I guess you mean http://lkml.kernel.org/r/[email protected]
Yes, I definitely plan to review this. I just need to find some time to
focus on this without being interrupted by many other small things.

> I just want to make certain that this fun work does not get all of the
> attention, and the bug fix actually gets reviewed.

Sure thing!
--
Michal Hocko
SUSE Labs