2019-07-27 17:21:50

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

It was found that a dying mm_struct where the owning task has exited
can stay on as active_mm of kernel threads as long as no other user
tasks run on those CPUs that use it as active_mm. This prolongs the
life time of dying mm holding up memory and other resources like swap
space that cannot be freed.

Fix that by forcing the kernel threads to use init_mm as the active_mm
if the previous active_mm is dying.

The determination of a dying mm is based on the absence of an owning
task. The selection of the owning task only happens with the CONFIG_MEMCG
option. Without that, there is no simple way to determine the life span
of a given mm. So it falls back to the old behavior.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/mm_types.h | 15 +++++++++++++++
kernel/sched/core.c | 13 +++++++++++--
mm/init-mm.c | 4 ++++
3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3a37a89eb7a7..32712e78763c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
return atomic_read(&mm->tlb_flush_pending) > 1;
}

+#ifdef CONFIG_MEMCG
+/*
+ * A mm is considered dying if there is no owning task.
+ */
+static inline bool mm_dying(struct mm_struct *mm)
+{
+ return !mm->owner;
+}
+#else
+static inline bool mm_dying(struct mm_struct *mm)
+{
+ return false;
+}
+#endif
+
struct vm_fault;

/**
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..923a63262dfd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
* Both of these contain the full memory barrier required by
* membarrier after storing to rq->curr, before returning to
* user-space.
+ *
+ * If mm is NULL and oldmm is dying (!owner), we switch to
+ * init_mm instead to make sure that oldmm can be freed ASAP.
*/
- if (!mm) {
+ if (!mm && !mm_dying(oldmm)) {
next->active_mm = oldmm;
mmgrab(oldmm);
enter_lazy_tlb(oldmm, next);
- } else
+ } else {
+ if (!mm) {
+ mm = &init_mm;
+ next->active_mm = mm;
+ mmgrab(mm);
+ }
switch_mm_irqs_off(oldmm, mm, next);
+ }

if (!prev->mm) {
prev->active_mm = NULL;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index a787a319211e..69090a11249c 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -5,6 +5,7 @@
#include <linux/spinlock.h>
#include <linux/list.h>
#include <linux/cpumask.h>
+#include <linux/sched/task.h>

#include <linux/atomic.h>
#include <linux/user_namespace.h>
@@ -36,5 +37,8 @@ struct mm_struct init_mm = {
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
.user_ns = &init_user_ns,
.cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
+#ifdef CONFIG_MEMCG
+ .owner = &init_task,
+#endif
INIT_MM_CONTEXT(init_mm)
};
--
2.18.1



2019-07-29 08:24:21

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On 07/27/19 13:10, Waiman Long wrote:
> It was found that a dying mm_struct where the owning task has exited
> can stay on as active_mm of kernel threads as long as no other user
> tasks run on those CPUs that use it as active_mm. This prolongs the
> life time of dying mm holding up memory and other resources like swap
> space that cannot be freed.
>
> Fix that by forcing the kernel threads to use init_mm as the active_mm
> if the previous active_mm is dying.
>
> The determination of a dying mm is based on the absence of an owning
> task. The selection of the owning task only happens with the CONFIG_MEMCG
> option. Without that, there is no simple way to determine the life span
> of a given mm. So it falls back to the old behavior.

I don't really know a lot about this code, but does the owner field has to
depend on CONFIG_MEMCG? ie: can't the owner be always set?

Cheers

--
Qais Yousef

2019-07-29 09:06:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
> It was found that a dying mm_struct where the owning task has exited
> can stay on as active_mm of kernel threads as long as no other user
> tasks run on those CPUs that use it as active_mm. This prolongs the
> life time of dying mm holding up memory and other resources like swap
> space that cannot be freed.

Sure, but this has been so 'forever', why is it a problem now?

> Fix that by forcing the kernel threads to use init_mm as the active_mm
> if the previous active_mm is dying.
>
> The determination of a dying mm is based on the absence of an owning
> task. The selection of the owning task only happens with the CONFIG_MEMCG
> option. Without that, there is no simple way to determine the life span
> of a given mm. So it falls back to the old behavior.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> include/linux/mm_types.h | 15 +++++++++++++++
> kernel/sched/core.c | 13 +++++++++++--
> mm/init-mm.c | 4 ++++
> 3 files changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3a37a89eb7a7..32712e78763c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> return atomic_read(&mm->tlb_flush_pending) > 1;
> }
>
> +#ifdef CONFIG_MEMCG
> +/*
> + * A mm is considered dying if there is no owning task.
> + */
> +static inline bool mm_dying(struct mm_struct *mm)
> +{
> + return !mm->owner;
> +}
> +#else
> +static inline bool mm_dying(struct mm_struct *mm)
> +{
> + return false;
> +}
> +#endif
> +
> struct vm_fault;

Yuck. So people without memcg will still suffer the terrible 'whatever
it is this patch fixes'.

> /**
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2b037f195473..923a63262dfd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
> * Both of these contain the full memory barrier required by
> * membarrier after storing to rq->curr, before returning to
> * user-space.
> + *
> + * If mm is NULL and oldmm is dying (!owner), we switch to
> + * init_mm instead to make sure that oldmm can be freed ASAP.
> */
> - if (!mm) {
> + if (!mm && !mm_dying(oldmm)) {
> next->active_mm = oldmm;
> mmgrab(oldmm);
> enter_lazy_tlb(oldmm, next);
> - } else
> + } else {
> + if (!mm) {
> + mm = &init_mm;
> + next->active_mm = mm;
> + mmgrab(mm);
> + }
> switch_mm_irqs_off(oldmm, mm, next);
> + }
>
> if (!prev->mm) {
> prev->active_mm = NULL;

Bah, I see we _still_ haven't 'fixed' that code. And you're making an
even bigger mess of it.

Let me go find where that cleanup went.

2019-07-29 09:20:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On Sat 27-07-19 13:10:47, Waiman Long wrote:
> It was found that a dying mm_struct where the owning task has exited
> can stay on as active_mm of kernel threads as long as no other user
> tasks run on those CPUs that use it as active_mm. This prolongs the
> life time of dying mm holding up memory and other resources like swap
> space that cannot be freed.

IIRC use_mm doesn't pin the address space. It only pins the mm_struct
itself. So what exactly is the problem here?

>
> Fix that by forcing the kernel threads to use init_mm as the active_mm
> if the previous active_mm is dying.
>
> The determination of a dying mm is based on the absence of an owning
> task. The selection of the owning task only happens with the CONFIG_MEMCG
> option. Without that, there is no simple way to determine the life span
> of a given mm. So it falls back to the old behavior.

Please don't. We really wont to remove mm->owner long term.
--
Michal Hocko
SUSE Labs

2019-07-29 15:06:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote:
> On 7/29/19 4:52 AM, Peter Zijlstra wrote:
> > On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
> >> It was found that a dying mm_struct where the owning task has exited
> >> can stay on as active_mm of kernel threads as long as no other user
> >> tasks run on those CPUs that use it as active_mm. This prolongs the
> >> life time of dying mm holding up memory and other resources like swap
> >> space that cannot be freed.
> > Sure, but this has been so 'forever', why is it a problem now?
>
> I ran into this probem when running a test program that keeps on
> allocating and touch memory and it eventually fails as the swap space is
> full. After the failure, I could not rerun the test program again
> because the swap space remained full. I finally track it down to the
> fact that the mm stayed on as active_mm of kernel threads. I have to
> make sure that all the idle cpus get a user task to run to bump the
> dying mm off the active_mm of those cpus, but this is just a workaround,
> not a solution to this problem.

The 'sad' part is that x86 already switches to init_mm on idle and we
only keep the active_mm around for 'stupid'.

Rik and Andy were working on getting that 'fixed' a while ago, not sure
where that went.

2019-07-29 15:29:37

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote:

> The 'sad' part is that x86 already switches to init_mm on idle and we
> only keep the active_mm around for 'stupid'.

Wait, where do we do that?

> Rik and Andy were working on getting that 'fixed' a while ago, not
> sure
> where that went.

My lazy TLB stuff got merged last year.

Did we miss a spot somewhere, where the code still
quietly switches to init_mm for no good reason?

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2019-07-29 16:35:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On Mon, Jul 29, 2019 at 11:28:04AM -0400, Rik van Riel wrote:
> On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote:
>
> > The 'sad' part is that x86 already switches to init_mm on idle and we
> > only keep the active_mm around for 'stupid'.
>
> Wait, where do we do that?

drivers/idle/intel_idle.c: leave_mm(cpu);
drivers/acpi/processor_idle.c: acpi_unlazy_tlb(smp_processor_id());

> > Rik and Andy were working on getting that 'fixed' a while ago, not
> > sure
> > where that went.
>
> My lazy TLB stuff got merged last year.

Yes, but we never got around to getting rid of active_mm for x86, right?

2019-07-29 16:40:54

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On Mon, 2019-07-29 at 11:37 -0400, Waiman Long wrote:
> On 7/29/19 11:03 AM, Peter Zijlstra wrote:
> > On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote:
> > > On 7/29/19 4:52 AM, Peter Zijlstra wrote:
> > > > On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
> > > > > It was found that a dying mm_struct where the owning task has
> > > > > exited
> > > > > can stay on as active_mm of kernel threads as long as no
> > > > > other user
> > > > > tasks run on those CPUs that use it as active_mm. This
> > > > > prolongs the
> > > > > life time of dying mm holding up memory and other resources
> > > > > like swap
> > > > > space that cannot be freed.
> > > > Sure, but this has been so 'forever', why is it a problem now?
> > > I ran into this probem when running a test program that keeps on
> > > allocating and touch memory and it eventually fails as the swap
> > > space is
> > > full. After the failure, I could not rerun the test program again
> > > because the swap space remained full. I finally track it down to
> > > the
> > > fact that the mm stayed on as active_mm of kernel threads. I have
> > > to
> > > make sure that all the idle cpus get a user task to run to bump
> > > the
> > > dying mm off the active_mm of those cpus, but this is just a
> > > workaround,
> > > not a solution to this problem.
> > The 'sad' part is that x86 already switches to init_mm on idle and
> > we
> > only keep the active_mm around for 'stupid'.
> >
> > Rik and Andy were working on getting that 'fixed' a while ago, not
> > sure
> > where that went.
>
> Good, perhaps the right thing to do is for the idle->kernel case to
> keep
> init_mm as the active_mm instead of reuse whatever left behind the
> last
> time around.

Absolutely not. That creates heavy cache line
contention on the mm_cpumask as we switch the
mm out and back in after an idle period.

The cache line contention on the mm_cpumask
alone can take up as much as a percent of
CPU time on a very busy system with a large
multi-threaded application, multiple sockets,
and lots of context switches.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2019-07-29 16:47:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On Mon, Jul 29, 2019 at 12:10:12PM -0400, Rik van Riel wrote:
> On Mon, 2019-07-29 at 17:44 +0200, Peter Zijlstra wrote:
> > On Mon, Jul 29, 2019 at 11:28:04AM -0400, Rik van Riel wrote:
> > > On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote:
> > >
> > > > The 'sad' part is that x86 already switches to init_mm on idle
> > > > and we
> > > > only keep the active_mm around for 'stupid'.
> > >
> > > Wait, where do we do that?
> >
> > drivers/idle/intel_idle.c: leave_mm(cpu);
> > drivers/acpi/processor_idle.c: acpi_unlazy_tlb(smp_processor_id());
>
> This is only done for deeper c-states, isn't it?

Not C1 but I forever forget where it starts doing that. IIRC it isn't
too hard to hit it often, and I'm fairly sure we always do it when we
hit NOHZ.

> > > > Rik and Andy were working on getting that 'fixed' a while ago,
> > > > not
> > > > sure
> > > > where that went.
> > >
> > > My lazy TLB stuff got merged last year.
> >
> > Yes, but we never got around to getting rid of active_mm for x86,
> > right?
>
> True, we still use active_mm. Getting rid of the
> active_mm refcounting alltogether did not look
> entirely worthwhile the hassle.

OK, clearly I forgot some of the details ;-)

2019-07-29 17:45:00

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] sched: Clean up active_mm reference counting

On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote:
> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:

> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 2b037f195473..923a63262dfd 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
> > * Both of these contain the full memory barrier required by
> > * membarrier after storing to rq->curr, before returning to
> > * user-space.
> > + *
> > + * If mm is NULL and oldmm is dying (!owner), we switch to
> > + * init_mm instead to make sure that oldmm can be freed ASAP.
> > */
> > - if (!mm) {
> > + if (!mm && !mm_dying(oldmm)) {
> > next->active_mm = oldmm;
> > mmgrab(oldmm);
> > enter_lazy_tlb(oldmm, next);
> > - } else
> > + } else {
> > + if (!mm) {
> > + mm = &init_mm;
> > + next->active_mm = mm;
> > + mmgrab(mm);
> > + }
> > switch_mm_irqs_off(oldmm, mm, next);
> > + }
> >
> > if (!prev->mm) {
> > prev->active_mm = NULL;
>
> Bah, I see we _still_ haven't 'fixed' that code. And you're making an
> even bigger mess of it.
>
> Let me go find where that cleanup went.

---
Subject: sched: Clean up active_mm reference counting
From: Peter Zijlstra <[email protected]>
Date: Mon Jul 29 16:05:15 CEST 2019

The current active_mm reference counting is confusing and sub-optimal.

Rewrite the code to explicitly consider the 4 separate cases:

user -> user

When switching between two user tasks, all we need to consider
is switch_mm().

user -> kernel

When switching from a user task to a kernel task (which
doesn't have an associated mm) we retain the last mm in our
active_mm. Increment a reference count on active_mm.

kernel -> kernel

When switching between kernel threads, all we need to do is
pass along the active_mm reference.

kernel -> user

When switching between a kernel and user task, we must switch
from the last active_mm to the next mm, hoping of course that
these are the same. Decrement a reference on the active_mm.

The code keeps a different order, because as you'll note, both 'to
user' cases require switch_mm().

And where the old code would increment/decrement for the 'kernel ->
kernel' case, the new code observes this is a neutral operation and
avoids touching the reference count.

Cc: [email protected]
Cc: [email protected]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/core.c | 49 ++++++++++++++++++++++++++++++-------------------
1 file changed, 30 insertions(+), 19 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3214,12 +3214,8 @@ static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf)
{
- struct mm_struct *mm, *oldmm;
-
prepare_task_switch(rq, prev, next);

- mm = next->mm;
- oldmm = prev->active_mm;
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
@@ -3228,22 +3224,37 @@ context_switch(struct rq *rq, struct tas
arch_start_context_switch(prev);

/*
- * If mm is non-NULL, we pass through switch_mm(). If mm is
- * NULL, we will pass through mmdrop() in finish_task_switch().
- * Both of these contain the full memory barrier required by
- * membarrier after storing to rq->curr, before returning to
- * user-space.
+ * kernel -> kernel lazy + transfer active
+ * user -> kernel lazy + mmgrab() active
+ *
+ * kernel -> user switch + mmdrop() active
+ * user -> user switch
*/
- if (!mm) {
- next->active_mm = oldmm;
- mmgrab(oldmm);
- enter_lazy_tlb(oldmm, next);
- } else
- switch_mm_irqs_off(oldmm, mm, next);
-
- if (!prev->mm) {
- prev->active_mm = NULL;
- rq->prev_mm = oldmm;
+ if (!next->mm) { // to kernel
+ enter_lazy_tlb(prev->active_mm, next);
+
+ next->active_mm = prev->active_mm;
+ if (prev->mm) // from user
+ mmgrab(prev->active_mm);
+ else
+ prev->active_mm = NULL;
+ } else { // to user
+ /*
+ * sys_membarrier() requires an smp_mb() between setting
+ * rq->curr and returning to userspace.
+ *
+ * The below provides this either through switch_mm(), or in
+ * case 'prev->active_mm == next->mm' through
+ * finish_task_switch()'s mmdrop().
+ */
+
+ switch_mm_irqs_off(prev->active_mm, next->mm, next);
+
+ if (!prev->mm) { // from kernel
+ /* will mmdrop() in finish_task_switch(). */
+ rq->prev_mm = prev->active_mm;
+ prev->active_mm = NULL;
+ }
}

rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

2019-07-29 17:46:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote:
> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
> > It was found that a dying mm_struct where the owning task has exited
> > can stay on as active_mm of kernel threads as long as no other user
> > tasks run on those CPUs that use it as active_mm. This prolongs the
> > life time of dying mm holding up memory and other resources like swap
> > space that cannot be freed.
>
> Sure, but this has been so 'forever', why is it a problem now?
>
> > Fix that by forcing the kernel threads to use init_mm as the active_mm
> > if the previous active_mm is dying.
> >
> > The determination of a dying mm is based on the absence of an owning
> > task. The selection of the owning task only happens with the CONFIG_MEMCG
> > option. Without that, there is no simple way to determine the life span
> > of a given mm. So it falls back to the old behavior.
> >
> > Signed-off-by: Waiman Long <[email protected]>
> > ---
> > include/linux/mm_types.h | 15 +++++++++++++++
> > kernel/sched/core.c | 13 +++++++++++--
> > mm/init-mm.c | 4 ++++
> > 3 files changed, 30 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 3a37a89eb7a7..32712e78763c 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
> > return atomic_read(&mm->tlb_flush_pending) > 1;
> > }
> >
> > +#ifdef CONFIG_MEMCG
> > +/*
> > + * A mm is considered dying if there is no owning task.
> > + */
> > +static inline bool mm_dying(struct mm_struct *mm)
> > +{
> > + return !mm->owner;
> > +}
> > +#else
> > +static inline bool mm_dying(struct mm_struct *mm)
> > +{
> > + return false;
> > +}
> > +#endif
> > +
> > struct vm_fault;
>
> Yuck. So people without memcg will still suffer the terrible 'whatever
> it is this patch fixes'.

Also; why then not key off that owner tracking to free the resources
(and leave the struct mm around) and avoid touching this scheduling
hot-path ?

2019-07-29 18:10:23

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On 7/29/19 4:52 AM, Peter Zijlstra wrote:
> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
>> It was found that a dying mm_struct where the owning task has exited
>> can stay on as active_mm of kernel threads as long as no other user
>> tasks run on those CPUs that use it as active_mm. This prolongs the
>> life time of dying mm holding up memory and other resources like swap
>> space that cannot be freed.
> Sure, but this has been so 'forever', why is it a problem now?

I ran into this probem when running a test program that keeps on
allocating and touch memory and it eventually fails as the swap space is
full. After the failure, I could not rerun the test program again
because the swap space remained full. I finally track it down to the
fact that the mm stayed on as active_mm of kernel threads. I have to
make sure that all the idle cpus get a user task to run to bump the
dying mm off the active_mm of those cpus, but this is just a workaround,
not a solution to this problem.

>
>> Fix that by forcing the kernel threads to use init_mm as the active_mm
>> if the previous active_mm is dying.
>>
>> The determination of a dying mm is based on the absence of an owning
>> task. The selection of the owning task only happens with the CONFIG_MEMCG
>> option. Without that, there is no simple way to determine the life span
>> of a given mm. So it falls back to the old behavior.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> include/linux/mm_types.h | 15 +++++++++++++++
>> kernel/sched/core.c | 13 +++++++++++--
>> mm/init-mm.c | 4 ++++
>> 3 files changed, 30 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 3a37a89eb7a7..32712e78763c 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
>> return atomic_read(&mm->tlb_flush_pending) > 1;
>> }
>>
>> +#ifdef CONFIG_MEMCG
>> +/*
>> + * A mm is considered dying if there is no owning task.
>> + */
>> +static inline bool mm_dying(struct mm_struct *mm)
>> +{
>> + return !mm->owner;
>> +}
>> +#else
>> +static inline bool mm_dying(struct mm_struct *mm)
>> +{
>> + return false;
>> +}
>> +#endif
>> +
>> struct vm_fault;
> Yuck. So people without memcg will still suffer the terrible 'whatever
> it is this patch fixes'.
>
That is true.
>> /**
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 2b037f195473..923a63262dfd 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -3233,13 +3233,22 @@ context_switch(struct rq *rq, struct task_struct *prev,
>> * Both of these contain the full memory barrier required by
>> * membarrier after storing to rq->curr, before returning to
>> * user-space.
>> + *
>> + * If mm is NULL and oldmm is dying (!owner), we switch to
>> + * init_mm instead to make sure that oldmm can be freed ASAP.
>> */
>> - if (!mm) {
>> + if (!mm && !mm_dying(oldmm)) {
>> next->active_mm = oldmm;
>> mmgrab(oldmm);
>> enter_lazy_tlb(oldmm, next);
>> - } else
>> + } else {
>> + if (!mm) {
>> + mm = &init_mm;
>> + next->active_mm = mm;
>> + mmgrab(mm);
>> + }
>> switch_mm_irqs_off(oldmm, mm, next);
>> + }
>>
>> if (!prev->mm) {
>> prev->active_mm = NULL;
> Bah, I see we _still_ haven't 'fixed' that code. And you're making an
> even bigger mess of it.
>
> Let me go find where that cleanup went.

It would be nice if there is a better solution.

Cheers,
Longman

2019-07-29 18:32:33

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On 7/29/19 5:12 AM, Michal Hocko wrote:
> On Sat 27-07-19 13:10:47, Waiman Long wrote:
>> It was found that a dying mm_struct where the owning task has exited
>> can stay on as active_mm of kernel threads as long as no other user
>> tasks run on those CPUs that use it as active_mm. This prolongs the
>> life time of dying mm holding up memory and other resources like swap
>> space that cannot be freed.
> IIRC use_mm doesn't pin the address space. It only pins the mm_struct
> itself. So what exactly is the problem here?

As explained in my response to Peter, I found that resource like swap
space were depleted even after the exit of the offending program in a
mostly idle system. This patch is to make sure that those resources get
freed after program exit ASAP.

>> Fix that by forcing the kernel threads to use init_mm as the active_mm
>> if the previous active_mm is dying.
>>
>> The determination of a dying mm is based on the absence of an owning
>> task. The selection of the owning task only happens with the CONFIG_MEMCG
>> option. Without that, there is no simple way to determine the life span
>> of a given mm. So it falls back to the old behavior.
> Please don't. We really wont to remove mm->owner long term.

OK, if that is the case, I will need to find an alternative way to
determine if an mm is to be freed soon and perhaps set a flag to
indicate that.

Thanks,
Longman

2019-07-29 18:32:33

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On 7/29/19 10:27 AM, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 10:52:35AM +0200, Peter Zijlstra wrote:
>> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
>>> It was found that a dying mm_struct where the owning task has exited
>>> can stay on as active_mm of kernel threads as long as no other user
>>> tasks run on those CPUs that use it as active_mm. This prolongs the
>>> life time of dying mm holding up memory and other resources like swap
>>> space that cannot be freed.
>> Sure, but this has been so 'forever', why is it a problem now?
>>
>>> Fix that by forcing the kernel threads to use init_mm as the active_mm
>>> if the previous active_mm is dying.
>>>
>>> The determination of a dying mm is based on the absence of an owning
>>> task. The selection of the owning task only happens with the CONFIG_MEMCG
>>> option. Without that, there is no simple way to determine the life span
>>> of a given mm. So it falls back to the old behavior.
>>>
>>> Signed-off-by: Waiman Long <[email protected]>
>>> ---
>>> include/linux/mm_types.h | 15 +++++++++++++++
>>> kernel/sched/core.c | 13 +++++++++++--
>>> mm/init-mm.c | 4 ++++
>>> 3 files changed, 30 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 3a37a89eb7a7..32712e78763c 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -623,6 +623,21 @@ static inline bool mm_tlb_flush_nested(struct mm_struct *mm)
>>> return atomic_read(&mm->tlb_flush_pending) > 1;
>>> }
>>>
>>> +#ifdef CONFIG_MEMCG
>>> +/*
>>> + * A mm is considered dying if there is no owning task.
>>> + */
>>> +static inline bool mm_dying(struct mm_struct *mm)
>>> +{
>>> + return !mm->owner;
>>> +}
>>> +#else
>>> +static inline bool mm_dying(struct mm_struct *mm)
>>> +{
>>> + return false;
>>> +}
>>> +#endif
>>> +
>>> struct vm_fault;
>> Yuck. So people without memcg will still suffer the terrible 'whatever
>> it is this patch fixes'.
> Also; why then not key off that owner tracking to free the resources
> (and leave the struct mm around) and avoid touching this scheduling
> hot-path ?

The resources are pinned by the reference count. Making a special case
will certainly mess up the existing code.

It is actually a problem for systems that are mostly idle. Only the
kernel->kernel case needs to be updated. If the CPUs isn't busy running
user tasks, a little bit more overhead shouldn't really hurt IMHO.

Cheers,
Longman

2019-07-29 18:37:01

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On 7/29/19 11:03 AM, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote:
>> On 7/29/19 4:52 AM, Peter Zijlstra wrote:
>>> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
>>>> It was found that a dying mm_struct where the owning task has exited
>>>> can stay on as active_mm of kernel threads as long as no other user
>>>> tasks run on those CPUs that use it as active_mm. This prolongs the
>>>> life time of dying mm holding up memory and other resources like swap
>>>> space that cannot be freed.
>>> Sure, but this has been so 'forever', why is it a problem now?
>> I ran into this probem when running a test program that keeps on
>> allocating and touch memory and it eventually fails as the swap space is
>> full. After the failure, I could not rerun the test program again
>> because the swap space remained full. I finally track it down to the
>> fact that the mm stayed on as active_mm of kernel threads. I have to
>> make sure that all the idle cpus get a user task to run to bump the
>> dying mm off the active_mm of those cpus, but this is just a workaround,
>> not a solution to this problem.
> The 'sad' part is that x86 already switches to init_mm on idle and we
> only keep the active_mm around for 'stupid'.
>
> Rik and Andy were working on getting that 'fixed' a while ago, not sure
> where that went.

Good, perhaps the right thing to do is for the idle->kernel case to keep
init_mm as the active_mm instead of reuse whatever left behind the last
time around.

Cheers,
Longman

2019-07-29 18:54:35

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On Mon, 2019-07-29 at 17:44 +0200, Peter Zijlstra wrote:
> On Mon, Jul 29, 2019 at 11:28:04AM -0400, Rik van Riel wrote:
> > On Mon, 2019-07-29 at 17:03 +0200, Peter Zijlstra wrote:
> >
> > > The 'sad' part is that x86 already switches to init_mm on idle
> > > and we
> > > only keep the active_mm around for 'stupid'.
> >
> > Wait, where do we do that?
>
> drivers/idle/intel_idle.c: leave_mm(cpu);
> drivers/acpi/processor_idle.c: acpi_unlazy_tlb(smp_processor_id());

This is only done for deeper c-states, isn't it?

> > > Rik and Andy were working on getting that 'fixed' a while ago,
> > > not
> > > sure
> > > where that went.
> >
> > My lazy TLB stuff got merged last year.
>
> Yes, but we never got around to getting rid of active_mm for x86,
> right?

True, we still use active_mm. Getting rid of the
active_mm refcounting alltogether did not look
entirely worthwhile the hassle.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2019-07-29 19:24:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On Mon, Jul 29, 2019 at 11:22:16AM -0400, Waiman Long wrote:
> On 7/29/19 10:27 AM, Peter Zijlstra wrote:

> > Also; why then not key off that owner tracking to free the resources
> > (and leave the struct mm around) and avoid touching this scheduling
> > hot-path ?
>
> The resources are pinned by the reference count. Making a special case
> will certainly mess up the existing code.
>
> It is actually a problem for systems that are mostly idle. Only the
> kernel->kernel case needs to be updated. If the CPUs isn't busy running
> user tasks, a little bit more overhead shouldn't really hurt IMHO.

But when you cannot find a new owner; you can start to strip mm_struct.
That is, what's stopping you from freeing swap reservations when that
happens?

That is; I think the moment mm_users drops to 0, you can destroy the
actual addres space. But you have to keep mm_struct around until
mm_count goes to 0.

This is going on the comments with mmget() and mmgrab(); they forever
confuse me.

2019-07-29 19:25:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads



> On Jul 29, 2019, at 8:03 AM, Peter Zijlstra <[email protected]> wrote:
>
>> On Mon, Jul 29, 2019 at 10:51:51AM -0400, Waiman Long wrote:
>>> On 7/29/19 4:52 AM, Peter Zijlstra wrote:
>>>> On Sat, Jul 27, 2019 at 01:10:47PM -0400, Waiman Long wrote:
>>>> It was found that a dying mm_struct where the owning task has exited
>>>> can stay on as active_mm of kernel threads as long as no other user
>>>> tasks run on those CPUs that use it as active_mm. This prolongs the
>>>> life time of dying mm holding up memory and other resources like swap
>>>> space that cannot be freed.
>>> Sure, but this has been so 'forever', why is it a problem now?
>>
>> I ran into this probem when running a test program that keeps on
>> allocating and touch memory and it eventually fails as the swap space is
>> full. After the failure, I could not rerun the test program again
>> because the swap space remained full. I finally track it down to the
>> fact that the mm stayed on as active_mm of kernel threads. I have to
>> make sure that all the idle cpus get a user task to run to bump the
>> dying mm off the active_mm of those cpus, but this is just a workaround,
>> not a solution to this problem.
>
> The 'sad' part is that x86 already switches to init_mm on idle and we
> only keep the active_mm around for 'stupid'.
>
> Rik and Andy were working on getting that 'fixed' a while ago, not sure
> where that went.

I thought the current status was that we don’t always switch to init_mm on idle and instead we use a fancier and actually correct flushing routine that only flushed idle CPUs when pagetables are freed. I still think we should be able to kill active_mm in favor of explicit refcounting in the arch code.

2019-07-29 19:31:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On Mon 29-07-19 11:27:35, Waiman Long wrote:
> On 7/29/19 5:12 AM, Michal Hocko wrote:
> > On Sat 27-07-19 13:10:47, Waiman Long wrote:
> >> It was found that a dying mm_struct where the owning task has exited
> >> can stay on as active_mm of kernel threads as long as no other user
> >> tasks run on those CPUs that use it as active_mm. This prolongs the
> >> life time of dying mm holding up memory and other resources like swap
> >> space that cannot be freed.
> > IIRC use_mm doesn't pin the address space. It only pins the mm_struct
> > itself. So what exactly is the problem here?
>
> As explained in my response to Peter, I found that resource like swap
> space were depleted even after the exit of the offending program in a
> mostly idle system. This patch is to make sure that those resources get
> freed after program exit ASAP.

Could you elaborate more? How can a mm counter (do not confuse with
mm_users) prevent address space to be torn down on exit?


--
Michal Hocko
SUSE Labs

2019-07-29 21:09:03

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On 7/29/19 4:18 AM, Qais Yousef wrote:
> On 07/27/19 13:10, Waiman Long wrote:
>> It was found that a dying mm_struct where the owning task has exited
>> can stay on as active_mm of kernel threads as long as no other user
>> tasks run on those CPUs that use it as active_mm. This prolongs the
>> life time of dying mm holding up memory and other resources like swap
>> space that cannot be freed.
>>
>> Fix that by forcing the kernel threads to use init_mm as the active_mm
>> if the previous active_mm is dying.
>>
>> The determination of a dying mm is based on the absence of an owning
>> task. The selection of the owning task only happens with the CONFIG_MEMCG
>> option. Without that, there is no simple way to determine the life span
>> of a given mm. So it falls back to the old behavior.
> I don't really know a lot about this code, but does the owner field has to
> depend on CONFIG_MEMCG? ie: can't the owner be always set?
>
Yes, the owner field is only used and defined when CONFIG_MEMCG is on.

Cheers,
Longman

2019-07-29 21:36:16

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On 07/29/19 17:06, Waiman Long wrote:
> On 7/29/19 4:18 AM, Qais Yousef wrote:
> > On 07/27/19 13:10, Waiman Long wrote:
> >> It was found that a dying mm_struct where the owning task has exited
> >> can stay on as active_mm of kernel threads as long as no other user
> >> tasks run on those CPUs that use it as active_mm. This prolongs the
> >> life time of dying mm holding up memory and other resources like swap
> >> space that cannot be freed.
> >>
> >> Fix that by forcing the kernel threads to use init_mm as the active_mm
> >> if the previous active_mm is dying.
> >>
> >> The determination of a dying mm is based on the absence of an owning
> >> task. The selection of the owning task only happens with the CONFIG_MEMCG
> >> option. Without that, there is no simple way to determine the life span
> >> of a given mm. So it falls back to the old behavior.
> > I don't really know a lot about this code, but does the owner field has to
> > depend on CONFIG_MEMCG? ie: can't the owner be always set?
> >
> Yes, the owner field is only used and defined when CONFIG_MEMCG is on.

I guess this is the simpler answer of it's too much work to take it out of
CONFIG_MEMCG.

Anyway, given the direction of the thread this is moot now :-)

Thanks!

--
Qais Yousef

2019-07-30 07:16:39

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2] sched/core: Don't use dying mm as active_mm of kthreads

On 7/29/19 2:58 PM, Michal Hocko wrote:
> On Mon 29-07-19 11:27:35, Waiman Long wrote:
>> On 7/29/19 5:12 AM, Michal Hocko wrote:
>>> On Sat 27-07-19 13:10:47, Waiman Long wrote:
>>>> It was found that a dying mm_struct where the owning task has exited
>>>> can stay on as active_mm of kernel threads as long as no other user
>>>> tasks run on those CPUs that use it as active_mm. This prolongs the
>>>> life time of dying mm holding up memory and other resources like swap
>>>> space that cannot be freed.
>>> IIRC use_mm doesn't pin the address space. It only pins the mm_struct
>>> itself. So what exactly is the problem here?
>> As explained in my response to Peter, I found that resource like swap
>> space were depleted even after the exit of the offending program in a
>> mostly idle system. This patch is to make sure that those resources get
>> freed after program exit ASAP.
> Could you elaborate more? How can a mm counter (do not confuse with
> mm_users) prevent address space to be torn down on exit?

Many of the resources tied to mm_struct are indeed freed when mm_users
becomes 0 including swap space reservation, I think. I was testing a mm
patch and it did have a missing mmput bug that cause mm_users not going
to 0. I fixed the bug, and with sched patch to speed up the release the
mm_struct, every was fine. I didn't realize that fixing the mm bug is
enough to free the swap space.

Still there are some resources not being free when the mm_count is
non-zero. It is certainly less serious than what I have thought. Sorry
for the confusion.

Cheers,
Longman