2007-05-11 09:07:07

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH 0/7] Freezer bugfixes

Hi,

The following series of patches is a replacement of the freezer patches
recently dropped by Linus.

Each of the first six patches fixes specific issue related to the freezer, as
explained in the changelogs and in the comments within the patches. The
last patch is a cleanup.

Please consider them for merging.

Greetings,
Rafael


2007-05-11 09:07:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

From: Rafael J. Wysocki <[email protected]>

The reading of PF_BORROWED_MM in is_user_space() without task_lock() is racy.
Fix it.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Pavel Machek <[email protected]>
---
kernel/power/process.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c 2007-05-10 21:44:23.000000000 +0200
+++ linux-2.6/kernel/power/process.c 2007-05-10 21:44:28.000000000 +0200
@@ -8,6 +8,7 @@

#undef DEBUG

+#include <linux/sched.h>
#include <linux/interrupt.h>
#include <linux/suspend.h>
#include <linux/module.h>
@@ -88,7 +89,12 @@ static void cancel_freezing(struct task_

static inline int is_user_space(struct task_struct *p)
{
- return p->mm && !(p->flags & PF_BORROWED_MM);
+ int ret;
+
+ task_lock(p);
+ ret = p->mm && !(p->flags & PF_BORROWED_MM);
+ task_unlock(p);
+ return ret;
}

static unsigned int try_to_freeze_tasks(int freeze_user_space)

2007-05-11 09:07:47

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH 2/7] Freezer: Close potential race between refrigerator and thaw_tasks

From: Rafael J. Wysocki <[email protected]>

If the freezing of tasks fails and a task is preempted in refrigerator()
before calling frozen_process(), then thaw_tasks() may run before this task is
frozen. In that case the task will freeze and no one will thaw it.

To fix this race we can call freezing(current) in refrigerator() along with
frozen_process(current) under the task_lock() which also should be taken in
the error path of try_to_freeze_tasks() as well as in thaw_process().
Moreover, if thaw_process() additionally clears TIF_FREEZE for tasks that are
not frozen, we can be sure that all tasks are thawed and there are no pending
"freeze" requests after thaw_tasks() has run.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Pavel Machek <[email protected]>
---
include/linux/freezer.h | 10 ++++++++++
kernel/power/process.c | 12 +++++++++++-
2 files changed, 21 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/freezer.h
===================================================================
--- linux-2.6.orig/include/linux/freezer.h 2007-05-10 21:44:23.000000000 +0200
+++ linux-2.6/include/linux/freezer.h 2007-05-10 21:44:31.000000000 +0200
@@ -37,14 +37,24 @@ static inline void do_not_freeze(struct

/*
* Wake up a frozen process
+ *
+ * task_lock() is taken to prevent the race with refrigerator() which may
+ * occur if the freezing of tasks fails. Namely, without the lock, if the
+ * freezing of tasks failed, thaw_tasks() might have run before a task in
+ * refrigerator() could call frozen_process(), in which case the task would be
+ * frozen and no one would thaw it.
*/
static inline int thaw_process(struct task_struct *p)
{
+ task_lock(p);
if (frozen(p)) {
p->flags &= ~PF_FROZEN;
+ task_unlock(p);
wake_up_process(p);
return 1;
}
+ clear_tsk_thread_flag(p, TIF_FREEZE);
+ task_unlock(p);
return 0;
}

Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c 2007-05-10 21:44:28.000000000 +0200
+++ linux-2.6/kernel/power/process.c 2007-05-10 21:44:31.000000000 +0200
@@ -38,10 +38,18 @@ void refrigerator(void)
/* Hmm, should we be allowed to suspend when there are realtime
processes around? */
long save;
+
+ task_lock(current);
+ if (freezing(current)) {
+ frozen_process(current);
+ task_unlock(current);
+ } else {
+ task_unlock(current);
+ return;
+ }
save = current->state;
pr_debug("%s entered refrigerator\n", current->comm);

- frozen_process(current);
spin_lock_irq(&current->sighand->siglock);
recalc_sigpending(); /* We sent fake signal, clean it up */
spin_unlock_irq(&current->sighand->siglock);
@@ -158,10 +166,12 @@ static unsigned int try_to_freeze_tasks(
if (is_user_space(p) == !freeze_user_space)
continue;

+ task_lock(p);
if (freezeable(p) && !frozen(p))
printk(KERN_ERR " %s\n", p->comm);

cancel_freezing(p);
+ task_unlock(p);
} while_each_thread(g, p);
read_unlock(&tasklist_lock);
}

2007-05-11 09:08:01

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH 5/7] Freezer: Fix kthread_create vs freezer theoretical race

From: Oleg Nesterov <[email protected]>

kthread() sleeps in TASK_INTERRUPTIBLE state waiting for the first wakeup. In
theory, this wakeup may come from freeze_process()->signal_wake_up(), so the
task can disappear even before kthread_create() sets its ->comm.

Change kthread() to use TASK_UNINTERRUPTIBLE.

[[email protected]: s/BUG_ON/WARN_ON+recover]

Signed-off-by: Oleg Nesterov <[email protected]>
Acked-by: "Eric W. Biederman" <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
kernel/kthread.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/kthread.c
===================================================================
--- linux-2.6.orig/kernel/kthread.c 2007-05-10 21:44:23.000000000 +0200
+++ linux-2.6/kernel/kthread.c 2007-05-10 21:44:43.000000000 +0200
@@ -70,7 +70,7 @@ static int kthread(void *_create)
data = create->data;

/* OK, tell user we're spawned, wait for stop or wakeup */
- __set_current_state(TASK_INTERRUPTIBLE);
+ __set_current_state(TASK_UNINTERRUPTIBLE);
complete(&create->started);
schedule();

@@ -162,7 +162,10 @@ EXPORT_SYMBOL(kthread_create);
*/
void kthread_bind(struct task_struct *k, unsigned int cpu)
{
- BUG_ON(k->state != TASK_INTERRUPTIBLE);
+ if (k->state != TASK_UNINTERRUPTIBLE) {
+ WARN_ON(1);
+ return;
+ }
/* Must have done schedule() in kthread() before we set_task_cpu */
wait_task_inactive(k);
set_task_cpu(k, cpu);

2007-05-11 09:08:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH 6/7] Freezer: Fix PF_NOFREEZE vs freezeable race

From: Gautham R Shenoy <[email protected]>

This patch fixes the race pointed out by Oleg Nesterov.

* Freezer marks a thread as freezeable.
* The thread now marks itself PF_NOFREEZE, but it will be frozen on
? on calling try_to_freeze(). Thus the task is frozen, even though it doesn't
? want to.
* Subsequent thaw_processes() will also fail to thaw the task since it is
? marked PF_NOFREEZE.

Avoid this problem by checking the task's PF_NOFREEZE status in
frozen_processes() before marking the task as frozen.

Signed-off-by: Gautham R Shenoy <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
include/linux/freezer.h | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/freezer.h
===================================================================
--- linux-2.6.orig/include/linux/freezer.h 2007-05-10 21:44:36.000000000 +0200
+++ linux-2.6/include/linux/freezer.h 2007-05-10 21:44:47.000000000 +0200
@@ -63,8 +63,10 @@ static inline int thaw_process(struct ta
*/
static inline void frozen_process(struct task_struct *p)
{
- p->flags |= PF_FROZEN;
- wmb();
+ if (!unlikely(p->flags & PF_NOFREEZE)) {
+ p->flags |= PF_FROZEN;
+ wmb();
+ }
clear_tsk_thread_flag(p, TIF_FREEZE);
}


2007-05-11 09:08:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH 7/7] Freezer: Move frozen_process to kernel/power/process.c

From: Gautham R Shenoy <[email protected]>

Other than refrigerator, no one else calls frozen_process(). So move it
from include/linux/freezer.h to kernel/power/process.c.

Also, since a task can be marked as frozen by itself, we don't need to pass
the (struct task_struct *p) parameter to frozen_process().

Signed-off-by: Gautham R Shenoy <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
include/linux/freezer.h | 13 -------------
kernel/power/process.c | 14 +++++++++++++-
2 files changed, 13 insertions(+), 14 deletions(-)

Index: linux-2.6/include/linux/freezer.h
===================================================================
--- linux-2.6.orig/include/linux/freezer.h 2007-05-10 21:44:47.000000000 +0200
+++ linux-2.6/include/linux/freezer.h 2007-05-10 21:44:50.000000000 +0200
@@ -58,18 +58,6 @@ static inline int thaw_process(struct ta
return 0;
}

-/*
- * freezing is complete, mark process as frozen
- */
-static inline void frozen_process(struct task_struct *p)
-{
- if (!unlikely(p->flags & PF_NOFREEZE)) {
- p->flags |= PF_FROZEN;
- wmb();
- }
- clear_tsk_thread_flag(p, TIF_FREEZE);
-}
-
extern void refrigerator(void);
extern int freeze_processes(void);
extern void thaw_processes(void);
@@ -132,7 +120,6 @@ static inline int frozen(struct task_str
static inline int freezing(struct task_struct *p) { return 0; }
static inline void freeze(struct task_struct *p) { BUG(); }
static inline int thaw_process(struct task_struct *p) { return 1; }
-static inline void frozen_process(struct task_struct *p) { BUG(); }

static inline void refrigerator(void) {}
static inline int freeze_processes(void) { BUG(); return 0; }
Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c 2007-05-10 21:44:40.000000000 +0200
+++ linux-2.6/kernel/power/process.c 2007-05-10 21:44:50.000000000 +0200
@@ -32,6 +32,18 @@ static inline int freezeable(struct task
return 1;
}

+/*
+ * freezing is complete, mark current process as frozen
+ */
+static inline void frozen_process(void)
+{
+ if (!unlikely(current->flags & PF_NOFREEZE)) {
+ current->flags |= PF_FROZEN;
+ wmb();
+ }
+ clear_tsk_thread_flag(current, TIF_FREEZE);
+}
+
/* Refrigerator is place where frozen processes are stored :-). */
void refrigerator(void)
{
@@ -41,7 +53,7 @@ void refrigerator(void)

task_lock(current);
if (freezing(current)) {
- frozen_process(current);
+ frozen_process();
task_unlock(current);
} else {
task_unlock(current);

2007-05-11 09:09:01

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH 4/7] Freezer: Take kernel_execve into consideration

From: Rafael J. Wysocki <[email protected]>

Kernel threads can become userland processes by calling kernel_execve(). In
particular, this may happen right after
try_to_freeze_tasks(FREEZER_USER_SPACE) has returned, so try_to_freeze_tasks()
needs to take userspace processes into consideration even if it is called with
FREEZER_KERNEL_THREADS.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Pavel Machek <[email protected]>
---
kernel/power/process.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c 2007-05-10 21:44:36.000000000 +0200
+++ linux-2.6/kernel/power/process.c 2007-05-10 21:44:40.000000000 +0200
@@ -126,7 +126,7 @@ static unsigned int try_to_freeze_tasks(
cancel_freezing(p);
continue;
}
- if (is_user_space(p) == !freeze_user_space)
+ if (freeze_user_space && !is_user_space(p))
continue;

freeze_process(p);
@@ -153,7 +153,7 @@ static unsigned int try_to_freeze_tasks(
TIMEOUT / HZ, todo);
read_lock(&tasklist_lock);
do_each_thread(g, p) {
- if (is_user_space(p) == !freeze_user_space)
+ if (freeze_user_space && !is_user_space(p))
continue;

task_lock(p);

2007-05-11 09:09:24

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH 3/7] Freezer: Fix vfork problem

From: Rafael J. Wysocki <[email protected]>

Currently try_to_freeze_tasks() has to wait until all of the vforked processes
exit and for this reason every user can make it fail. To fix this problem we
can introduce the additional process flag PF_FREEZER_SKIP to be used by tasks
that do not want to be counted as freezable by the freezer and want to have
TIF_FREEZE set nevertheless. Then, this flag can be set by tasks using
sys_vfork() before they call wait_for_completion(&vfork) and cleared after
they have woken up. After clearing it, the tasks should call try_to_freeze()
as soon as possible.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---
include/linux/freezer.h | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
include/linux/sched.h | 1 +
kernel/fork.c | 3 +++
kernel/power/process.c | 27 ++++++++-------------------
4 files changed, 58 insertions(+), 21 deletions(-)

Index: linux-2.6/include/linux/freezer.h
===================================================================
--- linux-2.6.orig/include/linux/freezer.h 2007-05-10 21:44:31.000000000 +0200
+++ linux-2.6/include/linux/freezer.h 2007-05-10 21:44:36.000000000 +0200
@@ -81,7 +81,49 @@ static inline int try_to_freeze(void)
return 0;
}

-extern void thaw_some_processes(int all);
+/*
+ * The PF_FREEZER_SKIP flag should be set by a vfork parent right before it
+ * calls wait_for_completion(&vfork) and reset right after it returns from this
+ * function. Next, the parent should call try_to_freeze() to freeze itself
+ * appropriately in case the child has exited before the freezing of tasks is
+ * complete. However, we don't want kernel threads to be frozen in unexpected
+ * places, so we allow them to block freeze_processes() instead or to set
+ * PF_NOFREEZE if needed and PF_FREEZER_SKIP is only set for userland vfork
+ * parents. Fortunately, in the ____call_usermodehelper() case the parent won't
+ * really block freeze_processes(), since ____call_usermodehelper() (the child)
+ * does a little before exec/exit and it can't be frozen before waking up the
+ * parent.
+ */
+
+/*
+ * If the current task is a user space one, tell the freezer not to count it as
+ * freezable.
+ */
+static inline void freezer_do_not_count(void)
+{
+ if (current->mm)
+ current->flags |= PF_FREEZER_SKIP;
+}
+
+/*
+ * If the current task is a user space one, tell the freezer to count it as
+ * freezable again and try to freeze it.
+ */
+static inline void freezer_count(void)
+{
+ if (current->mm) {
+ current->flags &= ~PF_FREEZER_SKIP;
+ try_to_freeze();
+ }
+}
+
+/*
+ * Check if the task should be counted as freezeable by the freezer
+ */
+static inline int freezer_should_skip(struct task_struct *p)
+{
+ return !!(p->flags & PF_FREEZER_SKIP);
+}

#else
static inline int frozen(struct task_struct *p) { return 0; }
@@ -96,5 +138,7 @@ static inline void thaw_processes(void)

static inline int try_to_freeze(void) { return 0; }

-
+static inline void freezer_do_not_count(void) {}
+static inline void freezer_count(void) {}
+static inline int freezer_should_skip(struct task_struct *p) { return 0; }
#endif
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h 2007-05-10 21:44:23.000000000 +0200
+++ linux-2.6/include/linux/sched.h 2007-05-10 21:44:36.000000000 +0200
@@ -1176,6 +1176,7 @@ static inline void put_task_struct(struc
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
+#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */

/*
* Only the _current_ task can read/write to tsk->flags, but other
Index: linux-2.6/kernel/fork.c
===================================================================
--- linux-2.6.orig/kernel/fork.c 2007-05-10 21:44:23.000000000 +0200
+++ linux-2.6/kernel/fork.c 2007-05-10 21:44:36.000000000 +0200
@@ -45,6 +45,7 @@
#include <linux/acct.h>
#include <linux/tsacct_kern.h>
#include <linux/cn_proc.h>
+#include <linux/freezer.h>
#include <linux/delayacct.h>
#include <linux/taskstats_kern.h>
#include <linux/random.h>
@@ -1403,7 +1404,9 @@ long do_fork(unsigned long clone_flags,
}

if (clone_flags & CLONE_VFORK) {
+ freezer_do_not_count();
wait_for_completion(&vfork);
+ freezer_count();
if (unlikely (current->ptrace & PT_TRACE_VFORK_DONE)) {
current->ptrace_message = nr;
ptrace_notify ((PTRACE_EVENT_VFORK_DONE << 8) | SIGTRAP);
Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c 2007-05-10 21:44:31.000000000 +0200
+++ linux-2.6/kernel/power/process.c 2007-05-10 21:44:36.000000000 +0200
@@ -126,22 +126,12 @@ static unsigned int try_to_freeze_tasks(
cancel_freezing(p);
continue;
}
- if (is_user_space(p)) {
- if (!freeze_user_space)
- continue;
-
- /* Freeze the task unless there is a vfork
- * completion pending
- */
- if (!p->vfork_done)
- freeze_process(p);
- } else {
- if (freeze_user_space)
- continue;
+ if (is_user_space(p) == !freeze_user_space)
+ continue;

- freeze_process(p);
- }
- todo++;
+ freeze_process(p);
+ if (!freezer_should_skip(p))
+ todo++;
} while_each_thread(g, p);
read_unlock(&tasklist_lock);
yield(); /* Yield is okay here */
@@ -167,7 +157,8 @@ static unsigned int try_to_freeze_tasks(
continue;

task_lock(p);
- if (freezeable(p) && !frozen(p))
+ if (freezeable(p) && !frozen(p) &&
+ !freezer_should_skip(p))
printk(KERN_ERR " %s\n", p->comm);

cancel_freezing(p);
@@ -216,9 +207,7 @@ static void thaw_tasks(int thaw_user_spa
if (is_user_space(p) == !thaw_user_space)
continue;

- if (!thaw_process(p))
- printk(KERN_WARNING " Strange, %s not stopped\n",
- p->comm );
+ thaw_process(p);
} while_each_thread(g, p);
read_unlock(&tasklist_lock);
}

2007-05-11 19:40:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Fri, 11 May 2007 00:36:25 +0200
"Rafael J. Wysocki" <[email protected]> wrote:

> The reading of PF_BORROWED_MM in is_user_space() without task_lock() is racy.
> Fix it.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> Acked-by: Pavel Machek <[email protected]>
> ---
> kernel/power/process.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/kernel/power/process.c
> ===================================================================
> --- linux-2.6.orig/kernel/power/process.c 2007-05-10 21:44:23.000000000 +0200
> +++ linux-2.6/kernel/power/process.c 2007-05-10 21:44:28.000000000 +0200
> @@ -8,6 +8,7 @@
>
> #undef DEBUG
>
> +#include <linux/sched.h>
> #include <linux/interrupt.h>
> #include <linux/suspend.h>
> #include <linux/module.h>
> @@ -88,7 +89,12 @@ static void cancel_freezing(struct task_
>
> static inline int is_user_space(struct task_struct *p)
> {
> - return p->mm && !(p->flags & PF_BORROWED_MM);
> + int ret;
> +
> + task_lock(p);
> + ret = p->mm && !(p->flags & PF_BORROWED_MM);
> + task_unlock(p);
> + return ret;
> }

The whole function is racy, isn't it? I mean, the condition which it is
testing can go from true->false or false->true at any instant after this
function returns its now-wrong value.

iow, callers of this function need to to something to prevent the expression
`p->mm && !(p->flags & PF_BORROWED_MM);' from changing value _anyway_. In
which case the new locking is not needed?

2007-05-11 20:21:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On 05/11, Andrew Morton wrote:
>
> On Fri, 11 May 2007 00:36:25 +0200
> "Rafael J. Wysocki" <[email protected]> wrote:
>
> > static inline int is_user_space(struct task_struct *p)
> > {
> > - return p->mm && !(p->flags & PF_BORROWED_MM);
> > + int ret;
> > +
> > + task_lock(p);
> > + ret = p->mm && !(p->flags & PF_BORROWED_MM);
> > + task_unlock(p);
> > + return ret;
> > }
>
> The whole function is racy, isn't it? I mean, the condition which it is
> testing can go from true->false or false->true at any instant after this
> function returns its now-wrong value.
>
> iow, callers of this function need to to something to prevent the expression
> `p->mm && !(p->flags & PF_BORROWED_MM);' from changing value _anyway_. In
> which case the new locking is not needed?

freeze_processes() first freezes user-space tasks only, then kernel threads.

Without task_lock() we can miss PF_BORROWED_MM and count the kernel thread
(which is doing use_mm()) as a user space process. This means it will be
frozen prematurely.

true->false means daemonize() or do_exit(), seems harmless.

false->true means exec from kernel space. That is why FREEZER_KERNEL_THREADS
in fact means all tasks, not only kernel threads.

Oleg.

2007-05-11 20:36:35

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Friday, 11 May 2007 21:39, Andrew Morton wrote:
> On Fri, 11 May 2007 00:36:25 +0200
> "Rafael J. Wysocki" <[email protected]> wrote:
>
> > The reading of PF_BORROWED_MM in is_user_space() without task_lock() is racy.
> > Fix it.
> >
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > Acked-by: Pavel Machek <[email protected]>
> > ---
> > kernel/power/process.c | 8 +++++++-
> > 1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > Index: linux-2.6/kernel/power/process.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/power/process.c 2007-05-10 21:44:23.000000000 +0200
> > +++ linux-2.6/kernel/power/process.c 2007-05-10 21:44:28.000000000 +0200
> > @@ -8,6 +8,7 @@
> >
> > #undef DEBUG
> >
> > +#include <linux/sched.h>
> > #include <linux/interrupt.h>
> > #include <linux/suspend.h>
> > #include <linux/module.h>
> > @@ -88,7 +89,12 @@ static void cancel_freezing(struct task_
> >
> > static inline int is_user_space(struct task_struct *p)
> > {
> > - return p->mm && !(p->flags & PF_BORROWED_MM);
> > + int ret;
> > +
> > + task_lock(p);
> > + ret = p->mm && !(p->flags & PF_BORROWED_MM);
> > + task_unlock(p);
> > + return ret;
> > }
>
> The whole function is racy, isn't it? I mean, the condition which it is
> testing can go from true->false or false->true at any instant after this
> function returns its now-wrong value.
>
> iow, callers of this function need to to something to prevent the expression
> `p->mm && !(p->flags & PF_BORROWED_MM);' from changing value _anyway_. In
> which case the new locking is not needed?

For user space processes this condition is always true.

For kernel threads:
(1) the change of tsk->mm from NULL to a nonzero value is only made in
fs/aio.c:use_mm() along with the setting of PF_BORROWED_MM under
the task_lock(),
(2) the change of tsk->mm from a nonzero value to NULL is only made in
fs/aio.c:unuse_mm() along with the resetting of PF_BORROWED_MM
under the task_lock().
Therefore, by taking the task_lock() here we make sure that the condition
is alyways false when we check it for kernel threads.

2007-05-11 22:57:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way



On Fri, 11 May 2007, Rafael J. Wysocki wrote:
>
> For user space processes this condition is always true.
>
> For kernel threads:
> (1) the change of tsk->mm from NULL to a nonzero value is only made in
> fs/aio.c:use_mm() along with the setting of PF_BORROWED_MM under
> the task_lock(),
> (2) the change of tsk->mm from a nonzero value to NULL is only made in
> fs/aio.c:unuse_mm() along with the resetting of PF_BORROWED_MM
> under the task_lock().
> Therefore, by taking the task_lock() here we make sure that the condition
> is alyways false when we check it for kernel threads.

Why *test* it then and return anything?

Why not just doa "task_lock(p); task_unlock(p);" with no return value?

As it is, it sounds like either the code is buggy, or it's pointless.

Linus

2007-05-11 23:17:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Saturday, 12 May 2007 00:56, Linus Torvalds wrote:
>
> On Fri, 11 May 2007, Rafael J. Wysocki wrote:
> >
> > For user space processes this condition is always true.
> >
> > For kernel threads:
> > (1) the change of tsk->mm from NULL to a nonzero value is only made in
> > fs/aio.c:use_mm() along with the setting of PF_BORROWED_MM under
> > the task_lock(),
> > (2) the change of tsk->mm from a nonzero value to NULL is only made in
> > fs/aio.c:unuse_mm() along with the resetting of PF_BORROWED_MM
> > under the task_lock().
> > Therefore, by taking the task_lock() here we make sure that the condition
> > is alyways false when we check it for kernel threads.
>
> Why *test* it then and return anything?
>
> Why not just doa "task_lock(p); task_unlock(p);" with no return value?
>
> As it is, it sounds like either the code is buggy, or it's pointless.

I'm not sure what you mean.

We use this function (ie. kernel/power/process.c:is_user_space()) to
distinguish kernel threads from user space processes. Therefore we make it
always return true for user space processes and always return false for kernel
threads. In the latter case we need to use the task_lock() to ensure that the
result is as desired (ie. false), because otherwise it might be racing with
either fs/aio.c:use_mm() or fs/aio.c:unuse_mm().

Rafael

2007-05-11 23:20:32

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On 05/11, Linus Torvalds wrote:
>
> On Fri, 11 May 2007, Rafael J. Wysocki wrote:
> >
> > Therefore, by taking the task_lock() here we make sure that the condition
> > is alyways false when we check it for kernel threads.
>
> Why *test* it then and return anything?
>
> Why not just doa "task_lock(p); task_unlock(p);" with no return value?

because we should not freeze a kernel thread at FREEZER_USER_SPACE stage?

without task_lock() we can see "p->mm != NULL" but not PF_BORROWED_MM.

Oleg.

2007-05-11 23:26:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Sat, 12 May 2007 01:22:06 +0200
"Rafael J. Wysocki" <[email protected]> wrote:

> On Saturday, 12 May 2007 00:56, Linus Torvalds wrote:
> >
> > On Fri, 11 May 2007, Rafael J. Wysocki wrote:
> > >
> > > For user space processes this condition is always true.
> > >
> > > For kernel threads:
> > > (1) the change of tsk->mm from NULL to a nonzero value is only made in
> > > fs/aio.c:use_mm() along with the setting of PF_BORROWED_MM under
> > > the task_lock(),
> > > (2) the change of tsk->mm from a nonzero value to NULL is only made in
> > > fs/aio.c:unuse_mm() along with the resetting of PF_BORROWED_MM
> > > under the task_lock().
> > > Therefore, by taking the task_lock() here we make sure that the condition
> > > is alyways false when we check it for kernel threads.
> >
> > Why *test* it then and return anything?
> >
> > Why not just doa "task_lock(p); task_unlock(p);" with no return value?
> >
> > As it is, it sounds like either the code is buggy, or it's pointless.
>
> I'm not sure what you mean.
>
> We use this function (ie. kernel/power/process.c:is_user_space()) to
> distinguish kernel threads from user space processes. Therefore we make it
> always return true for user space processes and always return false for kernel
> threads. In the latter case we need to use the task_lock() to ensure that the
> result is as desired (ie. false), because otherwise it might be racing with
> either fs/aio.c:use_mm() or fs/aio.c:unuse_mm().
>

ah, OK.

static void use_mm(struct mm_struct *mm)
{
struct mm_struct *active_mm;
struct task_struct *tsk = current;

task_lock(tsk);
tsk->flags |= PF_BORROWED_MM;
active_mm = tsk->active_mm;
atomic_inc(&mm->mm_count);
tsk->mm = mm;
tsk->active_mm = mm;
/*
* Note that on UML this *requires* PF_BORROWED_MM to be set, otherwise
* it won't work. Update it accordingly if you change it here
*/
switch_mm(active_mm, mm, tsk);
task_unlock(tsk);

So is_user_space() requires that the state of p->mm and p->flags be
consistent: it doesn't want to be looking at those two things in that
three-statement window above.

Good changelogging and commenting save quite a bit of time and email.

2007-05-11 23:30:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way



On Sat, 12 May 2007, Rafael J. Wysocki wrote:
>
> We use this function (ie. kernel/power/process.c:is_user_space()) to
> distinguish kernel threads from user space processes. Therefore we make it
> always return true for user space processes and always return false for kernel
> threads. In the latter case we need to use the task_lock() to ensure that the
> result is as desired (ie. false), because otherwise it might be racing with
> either fs/aio.c:use_mm() or fs/aio.c:unuse_mm().

But there is no race protection in the *caller*, so if it can ever return
one or the other, what protects it from changing once the caller returns?

And if the value can change (because some thread uses "use_mm()"), then
the caller cannot rely on the value that got returned.

So you migt as well not return any value at all, since the returned value
is apparently meaningless once the lock has been released.

In other words: "The lock, it does nothing".

Linus

2007-05-11 23:33:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way



On Sat, 12 May 2007, Oleg Nesterov wrote:
>
> without task_lock() we can see "p->mm != NULL" but not PF_BORROWED_MM.

Let me explain it one more time:
- shouldn't the *caller* protect this?

Afaik, there's two situations:
- either things don't change (in which case you don't need locking at
all, since things are statically one way or the other)
- or things change (in which case the caller can't rely on the return
value anyway, since they might change *after* you release the lock)
ie what's up? Is there a third case?

Linus

2007-05-11 23:48:38

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On 05/11, Linus Torvalds wrote:
>
> On Sat, 12 May 2007, Oleg Nesterov wrote:
> >
> > without task_lock() we can see "p->mm != NULL" but not PF_BORROWED_MM.
>
> Let me explain it one more time:
> - shouldn't the *caller* protect this?
>
> Afaik, there's two situations:
> - either things don't change (in which case you don't need locking at
> all, since things are statically one way or the other)
> - or things change (in which case the caller can't rely on the return
> value anyway, since they might change *after* you release the lock)

things change, ->mm is not stable if the kernel thread does use_mm/unuse_mm.

However, the return value == 0 does not change in that particular case,
exactly because is_user_space() takes task_lock().

Oleg.

2007-05-11 23:57:09

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Saturday, 12 May 2007 01:29, Linus Torvalds wrote:
>
> On Sat, 12 May 2007, Rafael J. Wysocki wrote:
> >
> > We use this function (ie. kernel/power/process.c:is_user_space()) to
> > distinguish kernel threads from user space processes. Therefore we make it
> > always return true for user space processes and always return false for kernel
> > threads. In the latter case we need to use the task_lock() to ensure that the
> > result is as desired (ie. false), because otherwise it might be racing with
> > either fs/aio.c:use_mm() or fs/aio.c:unuse_mm().
>
> But there is no race protection in the *caller*, so if it can ever return
> one or the other, what protects it from changing once the caller returns?
>
> And if the value can change (because some thread uses "use_mm()"), then
> the caller cannot rely on the value that got returned.

The value cannot change because of that. There only is a small window inside
unuse_mm() (or use_mm()) in which the value may be wrong. Namely:

static void unuse_mm(struct mm_struct *mm)
{
struct task_struct *tsk = current;

task_lock(tsk);
tsk->flags &= ~PF_BORROWED_MM;
---
--- If is_user_space() without the task_lock() is called right here, it will
--- return 'true', although it should return 'false'.
---
tsk->mm = NULL;
/* active_mm is still 'mm' */
enter_lazy_tlb(mm, tsk);
task_unlock(tsk);
}

IOW, quoting Andrew, "is_user_space() requires that the state of p->mm and
p->flags be consistent".

> So you migt as well not return any value at all, since the returned value
> is apparently meaningless once the lock has been released.

No, it is not meaningless.

Rafael

2007-05-12 00:05:25

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

I hope Rafael will correct me if I am wrong,

On 05/12, Oleg Nesterov wrote:
>
> On 05/11, Linus Torvalds wrote:
> >
> > On Sat, 12 May 2007, Oleg Nesterov wrote:
> > >
> > > without task_lock() we can see "p->mm != NULL" but not PF_BORROWED_MM.
> >
> > Let me explain it one more time:
> > - shouldn't the *caller* protect this?
> >
> > Afaik, there's two situations:
> > - either things don't change (in which case you don't need locking at
> > all, since things are statically one way or the other)
> > - or things change (in which case the caller can't rely on the return
> > value anyway, since they might change *after* you release the lock)
>
> things change, ->mm is not stable if the kernel thread does use_mm/unuse_mm.
>
> However, the return value == 0 does not change in that particular case,
> exactly because is_user_space() takes task_lock().

Probably there is some misunderstanding. This patch doesn't claim it solves
all problems. Before this patch we have

static inline int is_user_space(struct task_struct *p)
{
return p->mm && !(p->flags & PF_BORROWED_MM);
}

and this is clearly racy wrt to use_mm() which sets this PF_BORROWED_MM bit.
So this is just a little improvement, nothing more.

Oleg.

2007-05-12 00:09:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way



On Sat, 12 May 2007, Oleg Nesterov wrote:
>
> things change, ->mm is not stable if the kernel thread does use_mm/unuse_mm.

->mm is not stable *regardless*!

Trivial examples:
- kernel thread does execve()
- user thread does exit().

The use "use_mm()" and "unuse_mm()" things are total red herrings.

If the freezer depends on the difference between user and kernel threads,
then THAT PATCH IS BUGGY. It's that simple. It tests something that simply
isn't stable outside the lock, and then returns that value after having
unlocked it.

It might as well return a random number.

> However, the return value == 0 does not change in that particular case,
> exactly because is_user_space() takes task_lock().

As does exit_mm() etc.

That's NOT THE POINT. You cannot use the end result after releasing the
task lock, because the moment you release the task lock, it becomes
totally irrelevant, and may not be true any more.

Example (a):
- you ask "is_user_space(p)", it returns 1.
- before you actually have time to do anything about it, the task exists,
and (since you don't hold the lock any more) will now have a NULL
tsk->mm again (and would now return 0 if you called it again).

Example (b):
- you ask "is_user_space(p)" and it returns 0, because it's a kernel
thread
- before you actually do anything about it (but after you released the
task lock), the kernel thread does an "execve(/sbin/hotplug)" and is no
longer a kernel thread.

In both cases will the caller have a return value THAT IS NO LONGER TRUE.

See? The locking was pointless. Exactly because you release the lock
before the user can actually do anything about the return value!

The fact that the locking protects against the very specific case of AIO
where the threads _stay_ user tasks and don't really change is pretty much
irrelevant, as far as I can see.

Anyway, I think the whole freezer thing is broken. There's no reason to
freeze kernel threads.

If you want to freeze user processes, go ahead. But then you need a lock
to make sure that new processes don't *become* user processes (ie you need
to disable hotplug).

And if you want to protect against cpufreq, do so. But don't try to say
that you want to freeze all kernel threads. Just protect against cpufreq
threads. Don't make all the other threads that have *zero* interest in
freezing have to worry about it.

Linus

2007-05-12 00:12:01

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Saturday, 12 May 2007 01:25, Andrew Morton wrote:
> On Sat, 12 May 2007 01:22:06 +0200
> "Rafael J. Wysocki" <[email protected]> wrote:
>
> > On Saturday, 12 May 2007 00:56, Linus Torvalds wrote:
> > >
> > > On Fri, 11 May 2007, Rafael J. Wysocki wrote:
> > > >
> > > > For user space processes this condition is always true.
> > > >
> > > > For kernel threads:
> > > > (1) the change of tsk->mm from NULL to a nonzero value is only made in
> > > > fs/aio.c:use_mm() along with the setting of PF_BORROWED_MM under
> > > > the task_lock(),
> > > > (2) the change of tsk->mm from a nonzero value to NULL is only made in
> > > > fs/aio.c:unuse_mm() along with the resetting of PF_BORROWED_MM
> > > > under the task_lock().
> > > > Therefore, by taking the task_lock() here we make sure that the condition
> > > > is alyways false when we check it for kernel threads.
> > >
> > > Why *test* it then and return anything?
> > >
> > > Why not just doa "task_lock(p); task_unlock(p);" with no return value?
> > >
> > > As it is, it sounds like either the code is buggy, or it's pointless.
> >
> > I'm not sure what you mean.
> >
> > We use this function (ie. kernel/power/process.c:is_user_space()) to
> > distinguish kernel threads from user space processes. Therefore we make it
> > always return true for user space processes and always return false for kernel
> > threads. In the latter case we need to use the task_lock() to ensure that the
> > result is as desired (ie. false), because otherwise it might be racing with
> > either fs/aio.c:use_mm() or fs/aio.c:unuse_mm().
> >
>
> ah, OK.
>
> static void use_mm(struct mm_struct *mm)
> {
> struct mm_struct *active_mm;
> struct task_struct *tsk = current;
>
> task_lock(tsk);
> tsk->flags |= PF_BORROWED_MM;
> active_mm = tsk->active_mm;
> atomic_inc(&mm->mm_count);
> tsk->mm = mm;
> tsk->active_mm = mm;
> /*
> * Note that on UML this *requires* PF_BORROWED_MM to be set, otherwise
> * it won't work. Update it accordingly if you change it here
> */
> switch_mm(active_mm, mm, tsk);
> task_unlock(tsk);
>
> So is_user_space() requires that the state of p->mm and p->flags be
> consistent: it doesn't want to be looking at those two things in that
> three-statement window above.
>
> Good changelogging and commenting save quite a bit of time and email.

Very true.

I have added a comment to the patch, so that we remeber why the task_lock()
is there. Please replace the original patch with this one (unless you think it's
worse ;-)).

---
From: Rafael J. Wysocki <[email protected]>

The reading of PF_BORROWED_MM in is_user_space() without task_lock() is racy.
Fix it.

Signed-off-by: Rafael J. Wysocki <[email protected]>
Acked-by: Pavel Machek <[email protected]>
---
kernel/power/process.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/power/process.c
===================================================================
--- linux-2.6.orig/kernel/power/process.c
+++ linux-2.6/kernel/power/process.c
@@ -8,6 +8,7 @@

#undef DEBUG

+#include <linux/sched.h>
#include <linux/interrupt.h>
#include <linux/suspend.h>
#include <linux/module.h>
@@ -88,7 +89,18 @@ static void cancel_freezing(struct task_

static inline int is_user_space(struct task_struct *p)
{
- return p->mm && !(p->flags & PF_BORROWED_MM);
+ int ret;
+
+ /*
+ * task_lock() is acquired to avoid evaluating the condition while the
+ * state of p->mm and p->flags is not consistent, which may happen,
+ * for example, if this function is executed in parallel with
+ * fs/aio.c:unuse_mm()
+ */
+ task_lock(p);
+ ret = p->mm && !(p->flags & PF_BORROWED_MM);
+ task_unlock(p);
+ return ret;
}

static unsigned int try_to_freeze_tasks(int freeze_user_space)

2007-05-12 00:41:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On 05/11, Linus Torvalds wrote:
>
> On Sat, 12 May 2007, Oleg Nesterov wrote:
> >
> > things change, ->mm is not stable if the kernel thread does use_mm/unuse_mm.
>
> ->mm is not stable *regardless*!
>
> Trivial examples:
> - kernel thread does execve()
> - user thread does exit().

Yes sure. Quoting myself,
>
> true->false means daemonize() or do_exit(), seems harmless.
>
> false->true means exec from kernel space. That is why FREEZER_KERNEL_THREADS
> in fact means all tasks, not only kernel threads.
>

> The use "use_mm()" and "unuse_mm()" things are total red herrings.
>
> If the freezer depends on the difference between user and kernel threads,
> then THAT PATCH IS BUGGY. It's that simple.

This is another story, I can't comment because I am not educated enough.

However, in my opininon THAT PATCH has nothing to do with this problem.
It just improves the code that we already have.

> > However, the return value == 0 does not change in that particular case,
> > exactly because is_user_space() takes task_lock().
>
> As does exit_mm() etc.

Note the "in that particular case".

> See? The locking was pointless. Exactly because you release the lock
> before the user can actually do anything about the return value!

Yes. See the "Quoting myself" above.

> Anyway, I think the whole freezer thing is broken. There's no reason to
> freeze kernel threads.

It is not perfect. Rafael tries to improve it.

Do we need freezer? Should we freeze kernel threads? I can't judge. I tried
to read a long thread about suspend, and failed to understand it.

I personally think we can simplify things if CPU-hotplug use freezer, at least.

Oleg.

2007-05-12 01:01:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On 05/12, Oleg Nesterov wrote:
>
> Do we need freezer? Should we freeze kernel threads? I can't judge. I tried
> to read a long thread about suspend, and failed to understand it.
>
> I personally think we can simplify things if CPU-hotplug use freezer, at least.

Just a small example,

debug_smp_processor_id:

/*
* Kernel threads bound to a single CPU can safely use
* smp_processor_id():
*/
this_mask = cpumask_of_cpu(this_cpu);

if (cpus_equal(current->cpus_allowed, this_mask))
goto out;

This is not true with CONFIG_HOTPLUG_CPU. This becomes true if we freeze
the kernel threads from CPU_DOWN_PREPARE to CPU_DEAD.

Oleg.

2007-05-12 01:07:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Saturday, 12 May 2007 02:08, Linus Torvalds wrote:
>
> On Sat, 12 May 2007, Oleg Nesterov wrote:
> >
> > things change, ->mm is not stable if the kernel thread does use_mm/unuse_mm.
>
> ->mm is not stable *regardless*!
>
> Trivial examples:
> - kernel thread does execve()
> - user thread does exit().
>
> The use "use_mm()" and "unuse_mm()" things are total red herrings.
>
> If the freezer depends on the difference between user and kernel threads,
> then THAT PATCH IS BUGGY. It's that simple. It tests something that simply
> isn't stable outside the lock, and then returns that value after having
> unlocked it.
>
> It might as well return a random number.
>
> > However, the return value == 0 does not change in that particular case,
> > exactly because is_user_space() takes task_lock().
>
> As does exit_mm() etc.
>
> That's NOT THE POINT. You cannot use the end result after releasing the
> task lock, because the moment you release the task lock, it becomes
> totally irrelevant, and may not be true any more.
>
> Example (a):
> - you ask "is_user_space(p)", it returns 1.
> - before you actually have time to do anything about it, the task exists,
> and (since you don't hold the lock any more) will now have a NULL
> tsk->mm again (and would now return 0 if you called it again).

In which case we won't be freezing this task at all.

> Example (b):
> - you ask "is_user_space(p)" and it returns 0, because it's a kernel
> thread
> - before you actually do anything about it (but after you released the
> task lock), the kernel thread does an "execve(/sbin/hotplug)" and is no
> longer a kernel thread.

This is a special case that needs special handling.

> In both cases will the caller have a return value THAT IS NO LONGER TRUE.
>
> See? The locking was pointless. Exactly because you release the lock
> before the user can actually do anything about the return value!
>
> The fact that the locking protects against the very specific case of AIO
> where the threads _stay_ user tasks and don't really change is pretty much
> irrelevant, as far as I can see.

Well, I disagree. We need the locking *exactly* to avoid situations in which
the threads don't really change, but we might think that they *have changed*.
More precisely, it's needed, because without it kernel threads which execute
use_mm()/unuse_mm() might be identified as user space processes, and that
would be wrong. The other cases are beyond the scope of this patch.

Rafael

2007-05-12 01:25:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way



On Sat, 12 May 2007, Oleg Nesterov wrote:
>
> However, in my opininon THAT PATCH has nothing to do with this problem.
> It just improves the code that we already have.

Sure.

However, I think it does it THE WRONG WAY, and doesn't actually fix the
much deeper problems with the freezer, as shown by the fact that the lock
is *still* broken for other cases.

So, here's a summary:

- we should not take the lock inside the function, because taking it
there is fundamentally wrong, and leaves all the *other* races in
place.

- if you actually want to solve the other races, the lock needs to be
taken by the caller, in which case taking it in the callee is obviously
(again) wrong.

- or then, we accept that the race wasn't fixed AT ALL, and you add other
code to _other_ places to handle the case where you froze the wrong
thread (or didn't freeze the right one).

And I'm not making that up. Look at most of the other patches in that
series: they are _exactly_ about the scenario I'm outlining.

- the whole "kernel thread vs user thread" thing is the wrong thing to
check in the first place, since we just should never touch kernel
threads in the first place, and anything that wants to freeze user
space should have disabled exec_usermodehelper() at a higher level

That's why I'm so unhappy. The "fix" is going in the wrong direction. Each
fix on their own may be an "improvement", but the end result of many of
the fixes is a total mess!

We can continue to add bandaids to something broken, until it "works". But
the end result, while "working", is not actually any better. Quite the
reverse - the end result of something like that is that you add all these
magic rules and special cases.

So in the end one ugly design decision leads to broken locking, which in
turn leads to other cases where you add more broken code, which just leads
to a situation where nobody actually understands what the *design* is,
because there simply *isn't* any design - it's just a hodge-podge of "but
this fixes a bug" ad-hoc "fixes".

Linus

Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Sat, May 12, 2007 at 02:01:41AM +0200, Rafael J. Wysocki wrote:
>
> > So you migt as well not return any value at all, since the returned value
> > is apparently meaningless once the lock has been released.
>
> No, it is not meaningless.

Right.

Agreed that the returned value might not necessarily reflect the
current state of thread in question. Can it pose any problems?

Case : is_user_space() returns 0:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We think the thread is a kernel thread.
So we don't count this thread when the case is FREEZER_USER_SPACE.

Now this thread can perform an execve() and enter the
userspace. It might not freeze.
So we would be declaring that all the userspace threads have been frozen,
when actually we might have this one li'l unfrozen devil from the
userspace.

Do we care?
For cpu hotplug, I don't think so.

Because, like Oleg pointed out, we know we'll anyway freeze all the
leftover threads while handling the case FREEZER_KERNEL_THREADS.

But I am not sure if this is the case with suspend/hibernate, since we
need to do a sys_sync() between try_freeze_tasks(FREEZE_USER_SPACE) and
try_to_freeze_tasks(FREEZE_KERNEL_THREADS).


Case:is_user_space() returns 1:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We think the thread is from userspace.
We count this thread when the case is FREEZER_USER_SPACE. Now the
thread enters the kernel space by either doing a daemonize() or exit().

The freezer code loops *atleast* twice before declaring the system
frozen. So if this metamorphosis occurs between iteration i and
iteration i+1, in i+1, we note that this thread is now a kernel thread,
and don't count it anymore.

However, between the 'i'th and 'i+1'th interation, if this thread calls
try_to_freeze(), it'll enter the refrigerator. We now have a cold
kernel thread when we were trying to freeze only userspace.

So, the question should be, are we clearing the TIF_FREEZING flag in places
where the thread can become a kernel thread and eventually call try_to_freeze.
I won't expect an exiting thread to call try_to_freeze, but a
daemonize()ed thread can.

So should we perform that check in reparent_to_kthreadd() ?
We are protected by the tasklist_lock there, no?

>
> Rafael

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-05-12 08:57:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Saturday, 12 May 2007 03:24, Linus Torvalds wrote:
>
> On Sat, 12 May 2007, Oleg Nesterov wrote:
> >
> > However, in my opininon THAT PATCH has nothing to do with this problem.
> > It just improves the code that we already have.
>
> Sure.
>
> However, I think it does it THE WRONG WAY, and doesn't actually fix the
> much deeper problems with the freezer, as shown by the fact that the lock
> is *still* broken for other cases.

The other cases don't lead to the specific issue this patch is meant to
prevent. Namely, that if a kernel thread is identified as a user space task by
the freezer it may be frozen prematurely.

And yes, this only is a problem because we freeze kernel threads, which may be
avoidable, but we've been doing that for more than two years and it's a big
change to stop doing so. We can't just say overnight that we won't be freezing
kernel threads from now on without al least checking if that doesn't lead to
user-visible problems.

Thus IMO it's reasonable to fix the potential issue with the current code and
think about changing the design *later*. Still, I'm not so attached to this
patch and if you think that it should be dropped (which IMO is wrong), then so
be it.

That said:

> So, here's a summary:
>
> - we should not take the lock inside the function, because taking it
> there is fundamentally wrong, and leaves all the *other* races in
> place.

The other races don't lead to the same (wrong) result.

> - if you actually want to solve the other races, the lock needs to be
> taken by the caller, in which case taking it in the callee is obviously
> (again) wrong.
>
> - or then, we accept that the race wasn't fixed AT ALL, and you add other
> code to _other_ places to handle the case where you froze the wrong
> thread (or didn't freeze the right one).

There are no other cases, AFAICS, in which we can freeze a wrong thead.
That's, in fact, the point here. The not freezing of the right one (when a
kernel thread does execve() and becomes a user land process) is a totally
separate issue unrelated to this patch.

> And I'm not making that up. Look at most of the other patches in that
> series: they are _exactly_ about the scenario I'm outlining.
>
> - the whole "kernel thread vs user thread" thing is the wrong thing to
> check in the first place, since we just should never touch kernel
> threads in the first place, and anything that wants to freeze user
> space should have disabled exec_usermodehelper() at a higher level

Very well, but then, how are we supposed to know which is a kernel thread so
that we won't touch it?

> That's why I'm so unhappy. The "fix" is going in the wrong direction. Each
> fix on their own may be an "improvement", but the end result of many of
> the fixes is a total mess!
>
> We can continue to add bandaids to something broken, until it "works". But
> the end result, while "working", is not actually any better. Quite the
> reverse - the end result of something like that is that you add all these
> magic rules and special cases.
>
> So in the end one ugly design decision leads to broken locking, which in
> turn leads to other cases where you add more broken code, which just leads
> to a situation where nobody actually understands what the *design* is,
> because there simply *isn't* any design - it's just a hodge-podge of "but
> this fixes a bug" ad-hoc "fixes".

You have already said for several times that in your opinion kernel threads
should not be frozen (for suspend, but I'm not sure about the CPU hotplug).
As far as I'm concerned, that's sufficient. However, the reality is that we do
freeze kernel threads and we can't just stop doing that overnight. At least,
we need to check what problems that will lead to before we decide to do this in
a stable kernel.

Greetings,
Rafael

2007-05-12 09:23:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Saturday, 12 May 2007 10:16, Gautham R Shenoy wrote:
> On Sat, May 12, 2007 at 02:01:41AM +0200, Rafael J. Wysocki wrote:
> >
> > > So you migt as well not return any value at all, since the returned value
> > > is apparently meaningless once the lock has been released.
> >
> > No, it is not meaningless.
>
> Right.
>
> Agreed that the returned value might not necessarily reflect the
> current state of thread in question. Can it pose any problems?
>
> Case : is_user_space() returns 0:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> We think the thread is a kernel thread.
> So we don't count this thread when the case is FREEZER_USER_SPACE.
>
> Now this thread can perform an execve() and enter the
> userspace. It might not freeze.
> So we would be declaring that all the userspace threads have been frozen,
> when actually we might have this one li'l unfrozen devil from the
> userspace.
>
> Do we care?
> For cpu hotplug, I don't think so.
>
> Because, like Oleg pointed out, we know we'll anyway freeze all the
> leftover threads while handling the case FREEZER_KERNEL_THREADS.
>
> But I am not sure if this is the case with suspend/hibernate, since we
> need to do a sys_sync() between try_freeze_tasks(FREEZE_USER_SPACE) and
> try_to_freeze_tasks(FREEZE_KERNEL_THREADS).

>From the point of view of syncing it's only necessary to make sure that we
won't freeze a kernel thread that's needed for the syncing. We can have an
additional user space task running at this point.

> Case:is_user_space() returns 1:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> We think the thread is from userspace.
> We count this thread when the case is FREEZER_USER_SPACE. Now the
> thread enters the kernel space by either doing a daemonize() or exit().
>
> The freezer code loops *atleast* twice before declaring the system
> frozen. So if this metamorphosis occurs between iteration i and
> iteration i+1, in i+1, we note that this thread is now a kernel thread,
> and don't count it anymore.
>
> However, between the 'i'th and 'i+1'th interation, if this thread calls
> try_to_freeze(), it'll enter the refrigerator. We now have a cold
> kernel thread when we were trying to freeze only userspace.

Hmm, I'm not sure if the task can call try_to_freeze() after doing exit().
May it happen?

> So, the question should be, are we clearing the TIF_FREEZING flag in places
> where the thread can become a kernel thread and eventually call try_to_freeze.
> I won't expect an exiting thread to call try_to_freeze, but a
> daemonize()ed thread can.
>
> So should we perform that check in reparent_to_kthreadd() ?
> We are protected by the tasklist_lock there, no?

Yes. Still, I think the daemonize()ed threads should clear their TIF_FREEZE
flag unconditionally right after they have called exit_mm(). So that would be
in daemonize().

Or, perhaps, it's better to clear TIF_FREEZE (unconditionally) in exit_mm(),
after we've done tsk->mm = NULL? Oleg, what do you think?

Greetings,
Rafael

Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Sat, May 12, 2007 at 11:27:36AM +0200, Rafael J. Wysocki wrote:
> On Saturday, 12 May 2007 10:16, Gautham R Shenoy wrote:
> >
> > But I am not sure if this is the case with suspend/hibernate, since we
> > need to do a sys_sync() between try_freeze_tasks(FREEZE_USER_SPACE) and
> > try_to_freeze_tasks(FREEZE_KERNEL_THREADS).
>
> From the point of view of syncing it's only necessary to make sure that we
> won't freeze a kernel thread that's needed for the syncing. We can have an
> additional user space task running at this point.

Ok. Say we're might have an additional user space task which is not
frozen (say A).

> >
> > So should we perform that check in reparent_to_kthreadd() ?
> > We are protected by the tasklist_lock there, no?
>
> Yes. Still, I think the daemonize()ed threads should clear their TIF_FREEZE
> flag unconditionally right after they have called exit_mm(). So that would be
> in daemonize().
>
> Or, perhaps, it's better to clear TIF_FREEZE (unconditionally) in exit_mm(),
> after we've done tsk->mm = NULL? Oleg, what do you think?
>
Is the following scenario possible?

FREEZE_KERNEL_THREADS:
1) Mark all leftover threads as freezeable. That would include 'A'.
2) 'A' is now daemonised and we clear TIF_FREEZE in exit_mm().
3) 'A' calls try_to_freeze() but doesn't enter the refrigerator.

Hmm, on second thought, this shouldn't matter .
The subsequent iteration will set A's TIF_FREEZE flag anyway, right?
So I think it should be ok to unconditionally clear the TIF_FREEZE flag
in exit_mm() after tsk->mm = NULL.


> Greetings,
> Rafael

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-05-12 10:37:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Saturday, 12 May 2007 12:13, Gautham R Shenoy wrote:
> On Sat, May 12, 2007 at 11:27:36AM +0200, Rafael J. Wysocki wrote:
> > On Saturday, 12 May 2007 10:16, Gautham R Shenoy wrote:
> > >
> > > But I am not sure if this is the case with suspend/hibernate, since we
> > > need to do a sys_sync() between try_freeze_tasks(FREEZE_USER_SPACE) and
> > > try_to_freeze_tasks(FREEZE_KERNEL_THREADS).
> >
> > From the point of view of syncing it's only necessary to make sure that we
> > won't freeze a kernel thread that's needed for the syncing. We can have an
> > additional user space task running at this point.
>
> Ok. Say we're might have an additional user space task which is not
> frozen (say A).
>
> > >
> > > So should we perform that check in reparent_to_kthreadd() ?
> > > We are protected by the tasklist_lock there, no?
> >
> > Yes. Still, I think the daemonize()ed threads should clear their TIF_FREEZE
> > flag unconditionally right after they have called exit_mm(). So that would be
> > in daemonize().
> >
> > Or, perhaps, it's better to clear TIF_FREEZE (unconditionally) in exit_mm(),
> > after we've done tsk->mm = NULL? Oleg, what do you think?
> >
> Is the following scenario possible?
>
> FREEZE_KERNEL_THREADS:
> 1) Mark all leftover threads as freezeable. That would include 'A'.
> 2) 'A' is now daemonised and we clear TIF_FREEZE in exit_mm().
> 3) 'A' calls try_to_freeze() but doesn't enter the refrigerator.
>
> Hmm, on second thought, this shouldn't matter .
> The subsequent iteration will set A's TIF_FREEZE flag anyway, right?
> So I think it should be ok to unconditionally clear the TIF_FREEZE flag
> in exit_mm() after tsk->mm = NULL.

Yes.

Still, the following scenario is possible while we're freezing users space
tasks:

(1) user space task calls daemonize()
(2) freezer checks if this is a user space task and the test returns 'true'
(3) task calls exit_mm() and clears its TIF_FREEZE
(4) freezer sets TIF_FREEZE for the task
(5) task calls try_to_freeze() and freezes itself (bad!)

To prevent this from happening, I think, we should acquire task_lock() around
the entire block in which the test is made and TIF_FREEZE is set for the task,
so something more sophisticated than
freezer-read-pf_borrowed_mm-in-a-nonracy-way.patch is needed.

Well, I think we should ask Andrew to drop this patch and try to address the
issue in the next series of patches.

Greetings,
Rafael

2007-05-12 10:40:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Saturday, 12 May 2007 11:01, Rafael J. Wysocki wrote:
> On Saturday, 12 May 2007 03:24, Linus Torvalds wrote:
> >
> > On Sat, 12 May 2007, Oleg Nesterov wrote:
> > >
> > > However, in my opininon THAT PATCH has nothing to do with this problem.
> > > It just improves the code that we already have.
> >
> > Sure.
> >
> > However, I think it does it THE WRONG WAY, and doesn't actually fix the
> > much deeper problems with the freezer, as shown by the fact that the lock
> > is *still* broken for other cases.
>
> The other cases don't lead to the specific issue this patch is meant to
> prevent. Namely, that if a kernel thread is identified as a user space task by
> the freezer it may be frozen prematurely.
>
> And yes, this only is a problem because we freeze kernel threads, which may be
> avoidable, but we've been doing that for more than two years and it's a big
> change to stop doing so. We can't just say overnight that we won't be freezing
> kernel threads from now on without al least checking if that doesn't lead to
> user-visible problems.
>
> Thus IMO it's reasonable to fix the potential issue with the current code and
> think about changing the design *later*. Still, I'm not so attached to this
> patch and if you think that it should be dropped (which IMO is wrong), then so
> be it.

Sorry, I was wrong, because ...

> That said:
>
> > So, here's a summary:
> >
> > - we should not take the lock inside the function, because taking it
> > there is fundamentally wrong, and leaves all the *other* races in
> > place.
>
> The other races don't lead to the same (wrong) result.
>
> > - if you actually want to solve the other races, the lock needs to be
> > taken by the caller, in which case taking it in the callee is obviously
> > (again) wrong.
> >
> > - or then, we accept that the race wasn't fixed AT ALL, and you add other
> > code to _other_ places to handle the case where you froze the wrong
> > thread (or didn't freeze the right one).
>
> There are no other cases, AFAICS, in which we can freeze a wrong thead.

... user space tasks that call deamonize() can also be frozen prematurely.
We didn't take this possibility into consideration before, which was obviously
wrong.

Rafael

Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Sat, May 12, 2007 at 12:41:54PM +0200, Rafael J. Wysocki wrote:
>
> Still, the following scenario is possible while we're freezing users space
> tasks:
>
> (1) user space task calls daemonize()
> (2) freezer checks if this is a user space task and the test returns 'true'
> (3) task calls exit_mm() and clears its TIF_FREEZE
> (4) freezer sets TIF_FREEZE for the task
> (5) task calls try_to_freeze() and freezes itself (bad!)
>
> To prevent this from happening, I think, we should acquire task_lock() around
> the entire block in which the test is made and TIF_FREEZE is set for the task,
> so something more sophisticated than
> freezer-read-pf_borrowed_mm-in-a-nonracy-way.patch is needed.
>

Hmmm, turns out Linus was right, after all! The caller needs to acquire
the task_lock().

> Well, I think we should ask Andrew to drop this patch and try to address the
> issue in the next series of patches.

I think it's a good idea.

I would want to review the patches again. The more I look at them,
the better I seem to understand the subtleties in the freezer code.

>
> Greetings,
> Rafael

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-05-12 11:29:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Saturday, 12 May 2007 12:52, Gautham R Shenoy wrote:
> On Sat, May 12, 2007 at 12:41:54PM +0200, Rafael J. Wysocki wrote:
> >
> > Still, the following scenario is possible while we're freezing users space
> > tasks:
> >
> > (1) user space task calls daemonize()
> > (2) freezer checks if this is a user space task and the test returns 'true'
> > (3) task calls exit_mm() and clears its TIF_FREEZE
> > (4) freezer sets TIF_FREEZE for the task
> > (5) task calls try_to_freeze() and freezes itself (bad!)
> >
> > To prevent this from happening, I think, we should acquire task_lock() around
> > the entire block in which the test is made and TIF_FREEZE is set for the task,
> > so something more sophisticated than
> > freezer-read-pf_borrowed_mm-in-a-nonracy-way.patch is needed.
> >
>
> Hmmm, turns out Linus was right, after all! The caller needs to acquire
> the task_lock().
>
> > Well, I think we should ask Andrew to drop this patch and try to address the
> > issue in the next series of patches.
>
> I think it's a good idea.
>
> I would want to review the patches again. The more I look at them,
> the better I seem to understand the subtleties in the freezer code.

Okay, I'll put the entire series on the web later today and I'll let you know
when it's ready.

Andrew, could you please drop
freezer-read-pf_borrowed_mm-in-a-nonracy-way.patch?

I believe that the other six freezer patches currently in -mm are correct.

Greetings,
Rafael

2007-05-12 14:18:26

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On 05/12, Rafael J. Wysocki wrote:
>
> ... user space tasks that call deamonize() can also be frozen prematurely.
> We didn't take this possibility into consideration before, which was obviously
> wrong.

No, no, sorry for the confusion. User space tasks never call deamonize().

Kernel threads call daemonize, because when we are doing kernel_thread()
on behalf of user-space task, the new kernel thread (child) shares its
->mm with the caller (parent). So it is considered as "is_user_space()"
until it does daemonize().

Definitely, is_user_space() should have another name.

When a user space task exits, it does exit_mm() and becomes "a kernel thread"
from the freezer POV. In its current from, freezer can do nothing with this.
The exiting task won't call try_to_freeze() after that, so try_to_freeze_tasks()
will wait until it dissapears (actually, until it calls exit_notify(), note
the ->exit_state check in freezeable()).

I do not think we can improve things if exit_mm() clears TIF_FREEZING.
We should clear TIF_FREEZING when we set PF_NOFREEZE, I think. This was
discussed before iirc, but I forgot the result.

Oleg.

2007-05-12 14:26:12

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On 05/12, Rafael J. Wysocki wrote:
>
> Hmm, I'm not sure if the task can call try_to_freeze() after doing exit().
> May it happen?

It can, note the do_exit()->ptrace_notify(). But this doesn't matter, I think.
>From the freezer POV an exiting task has 2 "interesting" events

exit_mm() ->mm becomes NULL

exit_notify() ->exit_state != 0, and perhaps the task disappears
from global process list

> Or, perhaps, it's better to clear TIF_FREEZE (unconditionally) in exit_mm(),
> after we've done tsk->mm = NULL? Oleg, what do you think?

Please see another message.

Oleg.

2007-05-12 14:59:37

by Oleg Nesterov

[permalink] [raw]
Subject: migrate_dead_tasks() vs sleep-after-exit_notify() problems?

On 05/12, Oleg Nesterov wrote:
>
> exit_notify() ->exit_state != 0, and perhaps the task disappears
> from global process list

This, btw, means that do_exit()->__free_pipe_info() is not cpu-hotplug friendly.

The task may sleep on mutex_lock(->i_mutex), and dequeued from rq->arrays[].

The parent reaps this task, or it was TASK_DEAD and reaped itself.

CPU_DEAD comes, migration_call(CPU_DEAD) can do nothing with this task:

- we already did release_task()->__unhash_process(), so
migrate_live_tasks() can't see it

- migrate_dead_tasks() can't see it because it was deactivated.

No?

probably __might_sleep() should also check __exit_state.

Oleg.

2007-05-12 16:30:42

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Saturday, 12 May 2007 16:18, Oleg Nesterov wrote:
> On 05/12, Rafael J. Wysocki wrote:
> >
> > ... user space tasks that call deamonize() can also be frozen prematurely.
> > We didn't take this possibility into consideration before, which was obviously
> > wrong.
>
> No, no, sorry for the confusion. User space tasks never call deamonize().
>
> Kernel threads call daemonize, because when we are doing kernel_thread()
> on behalf of user-space task, the new kernel thread (child) shares its
> ->mm with the caller (parent). So it is considered as "is_user_space()"
> until it does daemonize().

Ah, I see. We spawn a kernel thread from a code path that belongs to a
user space task and we need to call deamonize() to make it become a
'real' kernel thread.

Still, this means that is_user_space() may return 'true' for this thread
before it calls daemonize() and then the scenario described by me in the
previous message may occur. It seems.

> Definitely, is_user_space() should have another name.

Well, I have no (better) idea ...

> When a user space task exits, it does exit_mm() and becomes "a kernel thread"
> from the freezer POV. In its current from, freezer can do nothing with this.
> The exiting task won't call try_to_freeze() after that, so try_to_freeze_tasks()
> will wait until it dissapears (actually, until it calls exit_notify(), note
> the ->exit_state check in freezeable()).
>
> I do not think we can improve things if exit_mm() clears TIF_FREEZING.

I think we can avoid the above scenario (in which a kernel thread that has
called daemonize() "too late" ends up with TIF_FREEZE set and goes to
refrigerator() while we're freezing the user space).

> We should clear TIF_FREEZING when we set PF_NOFREEZE, I think. This was
> discussed before iirc, but I forgot the result.

It's in freezer-fix-pf_nofreeze-vs-freezeable-race.patch (appended for
convenience, white space may be broken).

Greetings,
Rafael

---
--- a/include/linux/freezer.h~freezer-fix-pf_nofreeze-vs-freezeable-race
+++ a/include/linux/freezer.h
@@ -63,8 +63,10 @@ static inline int thaw_process(struct ta
? */
?static inline void frozen_process(struct task_struct *p)
?{
-???????p->flags |= PF_FROZEN;
-???????wmb();
+???????if (!unlikely(p->flags & PF_NOFREEZE)) {
+???????????????p->flags |= PF_FROZEN;
+???????????????wmb();
+???????}
????????clear_tsk_thread_flag(p, TIF_FREEZE);
?}
?

2007-05-12 16:59:30

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On 05/12, Rafael J. Wysocki wrote:
>
> Ah, I see. We spawn a kernel thread from a code path that belongs to a
> user space task and we need to call deamonize() to make it become a
> 'real' kernel thread.
>
> Still, this means that is_user_space() may return 'true' for this thread
> before it calls daemonize() and then the scenario described by me in the
> previous message may occur. It seems.

Yes sure. Probably not so bad in practice. Most likely this fresh thread
is not "important" and could be freezed, I dunno.

> It's in freezer-fix-pf_nofreeze-vs-freezeable-race.patch (appended for
> convenience, white space may be broken).
>
> ---
> --- a/include/linux/freezer.h~freezer-fix-pf_nofreeze-vs-freezeable-race
> +++ a/include/linux/freezer.h
> @@ -63,8 +63,10 @@ static inline int thaw_process(struct ta
> ? */
> ?static inline void frozen_process(struct task_struct *p)
> ?{
> -???????p->flags |= PF_FROZEN;
> -???????wmb();
> +???????if (!unlikely(p->flags & PF_NOFREEZE)) {
> +???????????????p->flags |= PF_FROZEN;
> +???????????????wmb();
> +???????}
> ????????clear_tsk_thread_flag(p, TIF_FREEZE);
> ?}

This is OK if a kernel thread does try_to_freeze() eventually.

But what if it does not, because it marks itself as PF_NOFREEZE?
This means it may run with signal_pending() forever. That is why
I think we should clear TIF_FREEZE when we set PF_NOFREEZE.

Oleg.

2007-05-12 17:12:03

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

On Saturday, 12 May 2007 18:58, Oleg Nesterov wrote:
> On 05/12, Rafael J. Wysocki wrote:
> >
> > Ah, I see. We spawn a kernel thread from a code path that belongs to a
> > user space task and we need to call deamonize() to make it become a
> > 'real' kernel thread.
> >
> > Still, this means that is_user_space() may return 'true' for this thread
> > before it calls daemonize() and then the scenario described by me in the
> > previous message may occur. It seems.
>
> Yes sure. Probably not so bad in practice. Most likely this fresh thread
> is not "important" and could be freezed, I dunno.

I don't know too, and that's why I'd like to prevent this from happening.

> > It's in freezer-fix-pf_nofreeze-vs-freezeable-race.patch (appended for
> > convenience, white space may be broken).
> >
> > ---
> > --- a/include/linux/freezer.h~freezer-fix-pf_nofreeze-vs-freezeable-race
> > +++ a/include/linux/freezer.h
> > @@ -63,8 +63,10 @@ static inline int thaw_process(struct ta
> > ? */
> > ?static inline void frozen_process(struct task_struct *p)
> > ?{
> > -???????p->flags |= PF_FROZEN;
> > -???????wmb();
> > +???????if (!unlikely(p->flags & PF_NOFREEZE)) {
> > +???????????????p->flags |= PF_FROZEN;
> > +???????????????wmb();
> > +???????}
> > ????????clear_tsk_thread_flag(p, TIF_FREEZE);
> > ?}
>
> This is OK if a kernel thread does try_to_freeze() eventually.
>
> But what if it does not, because it marks itself as PF_NOFREEZE?
> This means it may run with signal_pending() forever. That is why
> I think we should clear TIF_FREEZE when we set PF_NOFREEZE.

Yes, we should.

Greetings,
Rafael

2007-05-12 17:44:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/7] Freezer: Read PF_BORROWED_MM in a nonracy way

I guess I should just shut up and listen to experts... ok, just my
0.02 roubles.

Again, I can't comment things like "we just should never touch kernel",
because I don't even remotely undestand the things which actually use
freezer. I can only talk about the current implementation of freezer.

On 05/11, Linus Torvalds wrote:
>
> So, here's a summary:
>
> - we should not take the lock inside the function, because taking it
> there is fundamentally wrong, and leaves all the *other* races in
> place.
>
> - if you actually want to solve the other races, the lock needs to be
> taken by the caller, in which case taking it in the callee is obviously
> (again) wrong.
>
> - or then, we accept that the race wasn't fixed AT ALL, and you add other
> code to _other_ places to handle the case where you froze the wrong
> thread (or didn't freeze the right one).
>
> And I'm not making that up. Look at most of the other patches in that
> series: they are _exactly_ about the scenario I'm outlining.
>
> - the whole "kernel thread vs user thread" thing is the wrong thing to
> check in the first place, since we just should never touch kernel
> threads in the first place, and anything that wants to freeze user
> space should have disabled exec_usermodehelper() at a higher level
>
> That's why I'm so unhappy. The "fix" is going in the wrong direction. Each
> fix on their own may be an "improvement", but the end result of many of
> the fixes is a total mess!
>
> We can continue to add bandaids to something broken, until it "works". But
> the end result, while "working", is not actually any better. Quite the
> reverse - the end result of something like that is that you add all these
> magic rules and special cases.
>
> So in the end one ugly design decision leads to broken locking, which in
> turn leads to other cases where you add more broken code, which just leads
> to a situation where nobody actually understands what the *design* is,
> because there simply *isn't* any design - it's just a hodge-podge of "but
> this fixes a bug" ad-hoc "fixes".

Yes, the current design has many limitations. The major problem is that it
can't solve the dependencies: say T1 is frozen, and T2 can't be freezed
because it waits for T1. And I don't see how we can solve this problem
in general. And this is imho the source of most uglifications.

Let's look at the code. try_to_freeze_tasks() which can choose between
user/kernel threads is static. The exported freeze_processes() freeze
all of them. So why do we need is_user_space() at all? I am not sure
I know the right answer, but I think this was done to prevent the case
when user-space needs some kthread to make a progress and notice the
signal (TIF_FREEZE).

So yes, this is ugly, and is_user_space() is not reliable, and most
fixes are "let's fix this particular case", and each fix doesn't make
the code prettier.

But, on the other hand, the freezer is very simple and not intrusive.
When I first looked at the code 2 months ago, I thought we should do
something better. Since then, I have no ideas how to do this.

Oleg.