Hi,
there is an mm_struct memory leak when using cpu hotplug. Appearently
start_secondary in smp.c initializes active_mm of the cpu's idle task
and increases init_mm's mm_count. But on cpu_die the idle task's
active_mm doesn't get dropped and therefore on the next cpu_up event
(->start_secondary) it gets overwritten and the result is a forgotten
reference count to whatever mm_struct was active when the cpu
was taken down previously.
The patch below should fix this for s390 (at least it works fine for
me), but I'm not sure if it's ok to call mmdrop from __cpu_die.
Also this very same leak exists for ppc64 as well.
Any opinions?
Thanks,
Heiko
diff -urN linux-2.6.10/arch/s390/kernel/smp.c linux-2.6.10-patched/arch/s390/kernel/smp.c
--- linux-2.6.10/arch/s390/kernel/smp.c 2004-12-24 22:35:50.000000000 +0100
+++ linux-2.6.10-patched/arch/s390/kernel/smp.c 2005-01-04 13:42:14.000000000 +0100
@@ -728,9 +728,14 @@
void
__cpu_die(unsigned int cpu)
{
+ struct task_struct *p;
+
/* Wait until target cpu is down */
while (!cpu_stopped(cpu));
printk("Processor %d spun down\n", cpu);
+ p = current_set[cpu];
+ mmdrop(p->active_mm);
+ p->active_mm = NULL;
}
void
On Tue, 2005-01-04 at 14:11 +0100, Heiko Carstens wrote:
> Hi,
>
> there is an mm_struct memory leak when using cpu hotplug. Appearently
> start_secondary in smp.c initializes active_mm of the cpu's idle task
> and increases init_mm's mm_count. But on cpu_die the idle task's
> active_mm doesn't get dropped and therefore on the next cpu_up event
> (->start_secondary) it gets overwritten and the result is a forgotten
> reference count to whatever mm_struct was active when the cpu
> was taken down previously.
>
> The patch below should fix this for s390 (at least it works fine for
> me), but I'm not sure if it's ok to call mmdrop from __cpu_die.
>
> Also this very same leak exists for ppc64 as well.
>
> Any opinions?
Wouldn't it be better to fix this in generic code instead of duplicating
it in each architecture? It looks like the same thing would occur on
ia64 also.
What about something like this? Tested on ppc64.
Index: 2.6.10/kernel/sched.c
===================================================================
--- 2.6.10.orig/kernel/sched.c 2004-12-24 21:35:24.000000000 +0000
+++ 2.6.10/kernel/sched.c 2005-01-05 01:48:47.520250232 +0000
@@ -4088,6 +4088,9 @@
migrate_nr_uninterruptible(rq);
BUG_ON(rq->nr_running != 0);
+ /* Must manually drop reference to avoid leaking mm_structs. */
+ mmdrop(rq->idle->active_mm);
+
/* No need to migrate the tasks: it was best-effort if
* they didn't do lock_cpu_hotplug(). Just wake up
* the requestors. */
* Nathan Lynch <[email protected]> wrote:
> What about something like this? Tested on ppc64.
> migrate_nr_uninterruptible(rq);
> BUG_ON(rq->nr_running != 0);
>
> + /* Must manually drop reference to avoid leaking mm_structs. */
> + mmdrop(rq->idle->active_mm);
> +
> /* No need to migrate the tasks: it was best-effort if
> * they didn't do lock_cpu_hotplug(). Just wake up
> * the requestors. */
this doesnt look correct to me, because we might end up pulling the rug
(the pagetables) from under the idle task on that CPU. This can happen
in two ways: 1) there's no direct synchronization between a dead CPU
having called into cpu_die() and the downing CPU doing the mmdrop(), so
we might end up dropping it before the idle has entered the final loop
and is still executing kernel code, 2) even when the dead idle task is
already in its final loop there's no generic guarantee that an mmdrop()
can be done - e.g. on x86 the kernel pagetables are mixed up with the
user pagetables and an mmdrop() in case of lazy-TLB might end up zapping
the idle task's pagetables which might break in subtle ways.
the correct solution i think would be to call back into the scheduler
from cpu_die():
void cpu_die(void)
{
if (ppc_md.cpu_die)
ppc_md.cpu_die();
+ idle_task_exit();
local_irq_disable();
for (;;);
}
and then in idle_task_exit(), do something like:
void idle_task_exit(void)
{
struct mm_struct *mm = current->active_mm;
if (mm != &init_mm)
switch_mm(mm, &init_mm, current);
mmdrop(mm);
}
(completely untested.) This makes sure that the idle task uses the
init_mm (which always has valid pagetables), and also ensures correct
reference-counting. Hm?
Ingo
> the correct solution i think would be to call back into the scheduler
> from cpu_die():
>
> void cpu_die(void)
> {
> if (ppc_md.cpu_die)
> ppc_md.cpu_die();
> + idle_task_exit();
> local_irq_disable();
> for (;;);
> }
>
> and then in idle_task_exit(), do something like:
>
> void idle_task_exit(void)
> {
> struct mm_struct *mm = current->active_mm;
>
> if (mm != &init_mm)
> switch_mm(mm, &init_mm, current);
> mmdrop(mm);
> }
>
> (completely untested.) This makes sure that the idle task uses the
> init_mm (which always has valid pagetables), and also ensures correct
> reference-counting. Hm?
Looks good and works for me.
Thanks,
Heiko
On Wed, 2005-01-05 at 12:08 +0100, Ingo Molnar wrote:
> the correct solution i think would be to call back into the scheduler
> from cpu_die():
>
> void cpu_die(void)
> {
> if (ppc_md.cpu_die)
> ppc_md.cpu_die();
> + idle_task_exit();
> local_irq_disable();
> for (;;);
> }
>
> and then in idle_task_exit(), do something like:
>
> void idle_task_exit(void)
> {
> struct mm_struct *mm = current->active_mm;
>
> if (mm != &init_mm)
> switch_mm(mm, &init_mm, current);
> mmdrop(mm);
> }
>
> (completely untested.) This makes sure that the idle task uses the
> init_mm (which always has valid pagetables), and also ensures correct
> reference-counting. Hm?
OK, how's this? I'll submit the sched and ppc64 bits separately if
there are no objections. I assume Heiko can take care of s390.
Note that in the ppc64 cpu_die function we must call idle_task_exit
before calling ppc_md.cpu_die, because the latter does not return.
Nathan
Index: 2.6.10/include/linux/sched.h
===================================================================
--- 2.6.10.orig/include/linux/sched.h 2004-12-24 21:33:59.000000000 +0000
+++ 2.6.10/include/linux/sched.h 2005-01-05 14:30:39.000000000 +0000
@@ -735,6 +735,7 @@
#endif
extern void sched_idle_next(void);
+extern void idle_task_exit(void);
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(const task_t *p);
extern int task_nice(const task_t *p);
Index: 2.6.10/kernel/sched.c
===================================================================
--- 2.6.10.orig/kernel/sched.c 2004-12-24 21:35:24.000000000 +0000
+++ 2.6.10/kernel/sched.c 2005-01-05 14:36:40.000000000 +0000
@@ -3995,6 +3995,20 @@
spin_unlock_irqrestore(&rq->lock, flags);
}
+/* Ensures that the idle task is using init_mm right before its cpu goes
+ * offline.
+ */
+void idle_task_exit(void)
+{
+ struct mm_struct *mm = current->active_mm;
+
+ BUG_ON(cpu_online(smp_processor_id()));
+
+ if (mm != &init_mm)
+ switch_mm(mm, &init_mm, current);
+ mmdrop(mm);
+}
+
static void migrate_dead(unsigned int dead_cpu, task_t *tsk)
{
struct runqueue *rq = cpu_rq(dead_cpu);
Index: 2.6.10/arch/ppc64/kernel/setup.c
===================================================================
--- 2.6.10.orig/arch/ppc64/kernel/setup.c 2004-12-24 21:34:44.000000000 +0000
+++ 2.6.10/arch/ppc64/kernel/setup.c 2005-01-05 14:39:43.000000000 +0000
@@ -1337,6 +1337,7 @@
void cpu_die(void)
{
+ idle_task_exit();
if (ppc_md.cpu_die)
ppc_md.cpu_die();
local_irq_disable();
* Nathan Lynch <[email protected]> wrote:
> OK, how's this? I'll submit the sched and ppc64 bits separately if
> there are no objections. I assume Heiko can take care of s390.
>
> Note that in the ppc64 cpu_die function we must call idle_task_exit
> before calling ppc_md.cpu_die, because the latter does not return.
Signed-off-by: Ingo Molnar <[email protected]>
Ingo
Call idle_task_exit from cpu_die to avoid mm_struct leak.
Signed-off-by: Nathan Lynch <[email protected]>
---
diff -puN arch/ppc64/kernel/setup.c~ppc64-call-idle_task_exit-from-cpu_die arch/ppc64/kernel/setup.c
--- linux-2.6.10-bk10/arch/ppc64/kernel/setup.c~ppc64-call-idle_task_exit-from-cpu_die 2005-01-07 15:33:02.000000000 -0600
+++ linux-2.6.10-bk10-nathanl/arch/ppc64/kernel/setup.c 2005-01-07 15:33:23.000000000 -0600
@@ -1345,6 +1345,7 @@ early_param("xmon", early_xmon);
void cpu_die(void)
{
+ idle_task_exit();
if (ppc_md.cpu_die)
ppc_md.cpu_die();
local_irq_disable();
_
On Fri, 2005-01-07 at 05:43, Ingo Molnar wrote:
> * Nathan Lynch <[email protected]> wrote:
>
> > OK, how's this? I'll submit the sched and ppc64 bits separately if
> > there are no objections. I assume Heiko can take care of s390.
> >
> > Note that in the ppc64 cpu_die function we must call idle_task_exit
> > before calling ppc_md.cpu_die, because the latter does not return.
>
> Signed-off-by: Ingo Molnar <[email protected]>
Heiko Carstens figured out that offlining a cpu can leak mm_structs
because the dying cpu's idle task fails to switch to init_mm and
mmdrop its active_mm before the cpu is down. This patch introduces
idle_task_exit, which allows the idle task to do this as Ingo
suggested.
I will follow this up with a patch for ppc64 which calls
idle_task_exit from cpu_die.
Signed-off-by: Nathan Lynch <[email protected]>
---
diff -puN kernel/sched.c~sched-idle_task_exit kernel/sched.c
--- linux-2.6.10-bk10/kernel/sched.c~sched-idle_task_exit 2005-01-07 15:23:35.000000000 -0600
+++ linux-2.6.10-bk10-nathanl/kernel/sched.c 2005-01-07 15:23:35.000000000 -0600
@@ -3996,6 +3996,20 @@ void sched_idle_next(void)
spin_unlock_irqrestore(&rq->lock, flags);
}
+/* Ensures that the idle task is using init_mm right before its cpu goes
+ * offline.
+ */
+void idle_task_exit(void)
+{
+ struct mm_struct *mm = current->active_mm;
+
+ BUG_ON(cpu_online(smp_processor_id()));
+
+ if (mm != &init_mm)
+ switch_mm(mm, &init_mm, current);
+ mmdrop(mm);
+}
+
static void migrate_dead(unsigned int dead_cpu, task_t *tsk)
{
struct runqueue *rq = cpu_rq(dead_cpu);
diff -puN include/linux/sched.h~sched-idle_task_exit include/linux/sched.h
--- linux-2.6.10-bk10/include/linux/sched.h~sched-idle_task_exit 2005-01-07 15:23:35.000000000 -0600
+++ linux-2.6.10-bk10-nathanl/include/linux/sched.h 2005-01-07 15:23:35.000000000 -0600
@@ -757,6 +757,7 @@ extern void sched_exec(void);
#endif
extern void sched_idle_next(void);
+extern void idle_task_exit(void);
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(const task_t *p);
extern int task_nice(const task_t *p);
_