commit e3c8ca8336 (sched: do not count frozen tasks toward load) broke
the nr_uninterruptible accounting on freeze/thaw. On freeze the task
is excluded from accounting with a check for (task->flags &
PF_FROZEN), but that flag is cleared before the task is thawed. So
while we prevent that the freezing task with state
TASK_UNINTERRUPTIBLE is accounted to nr_uninterruptible we decrement
nr_uninterruptible on thaw.
Use a separate flag which is handled by the freezing task itself. Set
it before calling the scheduler with TASK_UNINTERRUPTIBLE state and
clear it after we return from frozen state.
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Nathan Lynch <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Nigel Cunningham <[email protected]>
Cc: <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Matt Helsley <[email protected]>
---
include/linux/sched.h | 3 ++-
kernel/freezer.c | 7 +++++++
2 files changed, 9 insertions(+), 1 deletion(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -209,7 +209,7 @@ extern unsigned long long time_sync_thre
((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)
#define task_contributes_to_load(task) \
((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
- (task->flags & PF_FROZEN) == 0)
+ (task->flags & PF_FREEZING) == 0)
#define __set_task_state(tsk, state_value) \
do { (tsk)->state = (state_value); } while (0)
@@ -1680,6 +1680,7 @@ extern cputime_t task_gtime(struct task_
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_FLUSHER 0x00001000 /* responsible for disk writeback */
#define PF_USED_MATH 0x00002000 /* if unset the fpu must be initialized before use */
+#define PF_FREEZING 0x00004000 /* freeze in progress. do not account to load */
#define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */
#define PF_FROZEN 0x00010000 /* frozen for system suspend */
#define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
Index: linux-2.6/kernel/freezer.c
===================================================================
--- linux-2.6.orig/kernel/freezer.c
+++ linux-2.6/kernel/freezer.c
@@ -44,12 +44,19 @@ void refrigerator(void)
recalc_sigpending(); /* We sent fake signal, clean it up */
spin_unlock_irq(¤t->sighand->siglock);
+ /* prevent accounting of that task to load */
+ current->flags |= PF_FREEZING;
+
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!frozen(current))
break;
schedule();
}
+
+ /* Remove the accounting blocker */
+ current->flags &= ~PF_FREEZING;
+
pr_debug("%s left refrigerator\n", current->comm);
__set_current_state(save);
}
On Fri, 2009-07-17 at 12:25 +0000, Thomas Gleixner wrote:
> plain text document attachment (freezer-fix-accounting-for-real.patch)
> commit e3c8ca8336 (sched: do not count frozen tasks toward load) broke
> the nr_uninterruptible accounting on freeze/thaw. On freeze the task
> is excluded from accounting with a check for (task->flags &
> PF_FROZEN), but that flag is cleared before the task is thawed. So
> while we prevent that the freezing task with state
> TASK_UNINTERRUPTIBLE is accounted to nr_uninterruptible we decrement
> nr_uninterruptible on thaw.
>
> Use a separate flag which is handled by the freezing task itself. Set
> it before calling the scheduler with TASK_UNINTERRUPTIBLE state and
> clear it after we return from frozen state.
Right, so I'm wondering why we don't fully revert e3c8ca8336 to begin
with.
The changelog reads:
---
commit e3c8ca8336707062f3f7cb1cd7e6b3c753baccdd
Author: Nathan Lynch <[email protected]>
Date: Wed Apr 8 19:45:12 2009 -0500
sched: do not count frozen tasks toward load
Freezing tasks via the cgroup freezer causes the load average to climb
because the freezer's current implementation puts frozen tasks in
uninterruptible sleep (D state).
Some applications which perform job-scheduling functions consult the
load average when making decisions. If a cgroup is frozen, the load
average does not provide a useful measure of the system's utilization
to such applications. This is especially inconvenient if the job
scheduler employs the cgroup freezer as a mechanism for preempting low
priority jobs. Contrast this with using SIGSTOP for the same purpose:
the stopped tasks do not count toward system load.
Change task_contributes_to_load() to return false if the task is
frozen. This results in /proc/loadavg behavior that better meets
users' expectations.
---
It appears to me that a frozen cgroup is a transient state. Either you
would typically do something like:
freeze -> {snapshot, migrate} -> {thaw, destroy}
Therefore a short increase in load doesn't seem like too big a problem,
its going to be gone soon anyway.
Hmm?
On Fri, 17 Jul 2009, Peter Zijlstra wrote:
> On Fri, 2009-07-17 at 12:25 +0000, Thomas Gleixner wrote:
> > plain text document attachment (freezer-fix-accounting-for-real.patch)
> > commit e3c8ca8336 (sched: do not count frozen tasks toward load) broke
> > the nr_uninterruptible accounting on freeze/thaw. On freeze the task
> > is excluded from accounting with a check for (task->flags &
> > PF_FROZEN), but that flag is cleared before the task is thawed. So
> > while we prevent that the freezing task with state
> > TASK_UNINTERRUPTIBLE is accounted to nr_uninterruptible we decrement
> > nr_uninterruptible on thaw.
> >
> > Use a separate flag which is handled by the freezing task itself. Set
> > it before calling the scheduler with TASK_UNINTERRUPTIBLE state and
> > clear it after we return from frozen state.
>
> Right, so I'm wondering why we don't fully revert e3c8ca8336 to begin
> with.
Fine with me, but it seems that the cgroup folks have some luser space
stuff looking at proc/loadavg which goes berserk when loadavg
increases rapidly due to freezing. OTOH that stuff seems to be
oblivious to the fact that the commit in question brings loadavg
irreversibly to 0 when you do enough freeze/thaw cycles.
So either we revert or apply the fix, which keeps the accounting
straight and solves the freezer loadavg thing. No strong opinion on
that.
Thanks,
tglx
On Fri, Jul 17, 2009 at 12:25:01PM -0000, Thomas Gleixner wrote:
> commit e3c8ca8336 (sched: do not count frozen tasks toward load) broke
> the nr_uninterruptible accounting on freeze/thaw. On freeze the task
> is excluded from accounting with a check for (task->flags &
> PF_FROZEN), but that flag is cleared before the task is thawed. So
> while we prevent that the freezing task with state
> TASK_UNINTERRUPTIBLE is accounted to nr_uninterruptible we decrement
> nr_uninterruptible on thaw.
>
> Use a separate flag which is handled by the freezing task itself. Set
> it before calling the scheduler with TASK_UNINTERRUPTIBLE state and
> clear it after we return from frozen state.
I'm sorry, it's not clear to me based on the description what problem is
being fixed. As far as I can see PF_FROZEN is almost exactly the same as
PF_FREEZING. When a task is being frozen TIF_FREEZE is set. Then the task
enters the refrigerator, sets PF_FROZEN, and schedule()s until PF_FROZEN
is no longer set. The original code with extra comments:
static inline void frozen_process(void)
{
if (!unlikely(current->flags & PF_NOFREEZE)) {
current->flags |= PF_FROZEN;
wmb();
}
clear_freeze_flag(current);
}
/* Refrigerator is place where frozen processes are stored :-). */
void refrigerator(void)
{
/* Hmm, should we be allowed to suspend when there are realtime
processes around? */
long save;
task_lock(current);
if (freezing(current)) {
/* prevent accounting of that task to load */
frozen_process(); /* <-- sets PF_FROZEN */
task_unlock(current);
} else {
task_unlock(current);
return;
}
save = current->state;
pr_debug("%s entered refrigerator\n", current->comm);
spin_lock_irq(¤t->sighand->siglock);
recalc_sigpending(); /* We sent fake signal, clean it up */
spin_unlock_irq(¤t->sighand->siglock);
/* you set PF_FREEZING here */
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
if (!frozen(current)) /* <-- checks PF_FROZEN */
break;
schedule();
}
/* you clear PF_FREEZING here */
pr_debug("%s left refrigerator\n", current->comm);
__set_current_state(save);
}
>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Cc: Nathan Lynch <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Nigel Cunningham <[email protected]>
> Cc: <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Matt Helsley <[email protected]>
>
> ---
> include/linux/sched.h | 3 ++-
> kernel/freezer.c | 7 +++++++
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -209,7 +209,7 @@ extern unsigned long long time_sync_thre
> ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)
> #define task_contributes_to_load(task) \
> ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
> - (task->flags & PF_FROZEN) == 0)
> + (task->flags & PF_FREEZING) == 0)
>
> #define __set_task_state(tsk, state_value) \
> do { (tsk)->state = (state_value); } while (0)
> @@ -1680,6 +1680,7 @@ extern cputime_t task_gtime(struct task_
> #define PF_MEMALLOC 0x00000800 /* Allocating memory */
> #define PF_FLUSHER 0x00001000 /* responsible for disk writeback */
> #define PF_USED_MATH 0x00002000 /* if unset the fpu must be initialized before use */
> +#define PF_FREEZING 0x00004000 /* freeze in progress. do not account to load */
> #define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */
> #define PF_FROZEN 0x00010000 /* frozen for system suspend */
> #define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
> Index: linux-2.6/kernel/freezer.c
> ===================================================================
> --- linux-2.6.orig/kernel/freezer.c
> +++ linux-2.6/kernel/freezer.c
> @@ -44,12 +44,19 @@ void refrigerator(void)
> recalc_sigpending(); /* We sent fake signal, clean it up */
> spin_unlock_irq(¤t->sighand->siglock);
>
> + /* prevent accounting of that task to load */
> + current->flags |= PF_FREEZING;
> +
> for (;;) {
> set_current_state(TASK_UNINTERRUPTIBLE);
> if (!frozen(current))
> break;
> schedule();
> }
> +
> + /* Remove the accounting blocker */
> + current->flags &= ~PF_FREEZING;
> +
Hence PF_FREEZING covers slightly less time than PF_FROZEN but
otherwise does not change the way nr_uninterruptible is incremented or
decremented (in (de)activate_task()).
So it's not clear to me how adding PF_FREEZING fixes anything. Am I
missing something?
Cheers,
-Matt Helsley
On Fri, 2009-07-17 at 08:51 -0700, Matt Helsley wrote:
> So it's not clear to me how adding PF_FREEZING fixes anything. Am I
> missing something?
Yep, PF_FROZEN is only observed going into schedule(), on the way back
out thaw_process() will already have cleared the flag.
So we don't increment in deactivate_task(), but we do decrement in
activate_task(), yielding for some very unbalanced accounting.
On Fri, Jul 17, 2009 at 02:31:50PM +0200, Peter Zijlstra wrote:
> On Fri, 2009-07-17 at 12:25 +0000, Thomas Gleixner wrote:
> > plain text document attachment (freezer-fix-accounting-for-real.patch)
> > commit e3c8ca8336 (sched: do not count frozen tasks toward load) broke
> > the nr_uninterruptible accounting on freeze/thaw. On freeze the task
> > is excluded from accounting with a check for (task->flags &
> > PF_FROZEN), but that flag is cleared before the task is thawed. So
> > while we prevent that the freezing task with state
> > TASK_UNINTERRUPTIBLE is accounted to nr_uninterruptible we decrement
> > nr_uninterruptible on thaw.
> >
> > Use a separate flag which is handled by the freezing task itself. Set
> > it before calling the scheduler with TASK_UNINTERRUPTIBLE state and
> > clear it after we return from frozen state.
>
> Right, so I'm wondering why we don't fully revert e3c8ca8336 to begin
> with.
>
> The changelog reads:
>
> ---
> commit e3c8ca8336707062f3f7cb1cd7e6b3c753baccdd
> Author: Nathan Lynch <[email protected]>
> Date: Wed Apr 8 19:45:12 2009 -0500
>
> sched: do not count frozen tasks toward load
>
> Freezing tasks via the cgroup freezer causes the load average to climb
> because the freezer's current implementation puts frozen tasks in
> uninterruptible sleep (D state).
>
> Some applications which perform job-scheduling functions consult the
> load average when making decisions. If a cgroup is frozen, the load
> average does not provide a useful measure of the system's utilization
> to such applications. This is especially inconvenient if the job
> scheduler employs the cgroup freezer as a mechanism for preempting low
> priority jobs. Contrast this with using SIGSTOP for the same purpose:
> the stopped tasks do not count toward system load.
>
> Change task_contributes_to_load() to return false if the task is
> frozen. This results in /proc/loadavg behavior that better meets
> users' expectations.
> ---
>
> It appears to me that a frozen cgroup is a transient state. Either you
> would typically do something like:
>
> freeze -> {snapshot, migrate} -> {thaw, destroy}
>
> Therefore a short increase in load doesn't seem like too big a problem,
> its going to be gone soon anyway.
>
> Hmm?
The job scheduler in question does not use FROZEN as a transient state and
does not use checkpoint/restart at all since c/r is still a work in progress.
Even when used for power management it seems wrong to count frozen tasks
towards the loadavg since they aren't using CPU time or waiting for IO.
Cheers,
-Matt Helsley
On Fri, Jul 17, 2009 at 05:55:27PM +0200, Peter Zijlstra wrote:
> On Fri, 2009-07-17 at 08:51 -0700, Matt Helsley wrote:
>
> > So it's not clear to me how adding PF_FREEZING fixes anything. Am I
> > missing something?
>
> Yep, PF_FROZEN is only observed going into schedule(), on the way back
> out thaw_process() will already have cleared the flag.
>
> So we don't increment in deactivate_task(), but we do decrement in
> activate_task(), yielding for some very unbalanced accounting.
Ahhh, right!
OK, so the fix looks good then.
Cheers,
-Matt Helsley
On Fri, 2009-07-17 at 08:22 -0700, Matt Helsley wrote:
> The job scheduler in question does not use FROZEN as a transient state and
> does not use checkpoint/restart at all since c/r is still a work in progress.
> Even when used for power management it seems wrong to count frozen tasks
> towards the loadavg since they aren't using CPU time or waiting for IO.
You're abusing it for _WHAT_?
On Fri, 2009-07-17 at 12:25 +0000, Thomas Gleixner wrote:
> commit e3c8ca8336 (sched: do not count frozen tasks toward load) broke
> the nr_uninterruptible accounting on freeze/thaw. On freeze the task
> is excluded from accounting with a check for (task->flags &
> PF_FROZEN), but that flag is cleared before the task is thawed. So
> while we prevent that the freezing task with state
> TASK_UNINTERRUPTIBLE is accounted to nr_uninterruptible we decrement
> nr_uninterruptible on thaw.
>
> Use a separate flag which is handled by the freezing task itself. Set
> it before calling the scheduler with TASK_UNINTERRUPTIBLE state and
> clear it after we return from frozen state.
Thanks for the fix, and sorry for the regression. I'll test this early
next week but I don't anticipate any problems for the use case the
original patch was addressing.
On Fri, 2009-07-17 at 18:47 +0200, Peter Zijlstra wrote:
> On Fri, 2009-07-17 at 08:22 -0700, Matt Helsley wrote:
>
> > The job scheduler in question does not use FROZEN as a transient state and
> > does not use checkpoint/restart at all since c/r is still a work in progress.
Right, the job scheduler uses the cgroup freezer as a mechanism to
preempt a low priority job for a higher priority job. (It had used
SIGSTOP in the past.) So in this scenario a frozen cgroup may remain in
that state for a while. Load average is consulted as a measure of
system utilization.
> > Even when used for power management it seems wrong to count frozen tasks
> > towards the loadavg since they aren't using CPU time or waiting for IO.
>
> You're abusing it for _WHAT_?
I think Matt was referring to system-wide suspend/resume/hibernate, not
a behavior of the job scheduler, if that's your concern.
Hi.
Thomas Gleixner wrote:
> On Fri, 17 Jul 2009, Peter Zijlstra wrote:
>
>> On Fri, 2009-07-17 at 12:25 +0000, Thomas Gleixner wrote:
>>> plain text document attachment (freezer-fix-accounting-for-real.patch)
>>> commit e3c8ca8336 (sched: do not count frozen tasks toward load) broke
>>> the nr_uninterruptible accounting on freeze/thaw. On freeze the task
>>> is excluded from accounting with a check for (task->flags &
>>> PF_FROZEN), but that flag is cleared before the task is thawed. So
>>> while we prevent that the freezing task with state
>>> TASK_UNINTERRUPTIBLE is accounted to nr_uninterruptible we decrement
>>> nr_uninterruptible on thaw.
>>>
>>> Use a separate flag which is handled by the freezing task itself. Set
>>> it before calling the scheduler with TASK_UNINTERRUPTIBLE state and
>>> clear it after we return from frozen state.
>> Right, so I'm wondering why we don't fully revert e3c8ca8336 to begin
>> with.
>
> Fine with me, but it seems that the cgroup folks have some luser space
> stuff looking at proc/loadavg which goes berserk when loadavg
> increases rapidly due to freezing. OTOH that stuff seems to be
> oblivious to the fact that the commit in question brings loadavg
> irreversibly to 0 when you do enough freeze/thaw cycles.
That's right. I don't remember the exact details (it's been ages since
this was first reported to me), but sendmail (IIRC) was looking at the
load average and stopping delivery until it dropped sufficiently.
Historically, my solution has been to save and restore the load average
values (on the theory that hibernation should be transparent to
userspace), but I dropped that in favour of this approach.
Regards,
Nigel
On Fri, 2009-07-17 at 15:55 -0500, Nathan Lynch wrote:
> On Fri, 2009-07-17 at 18:47 +0200, Peter Zijlstra wrote:
> > On Fri, 2009-07-17 at 08:22 -0700, Matt Helsley wrote:
> >
> > > The job scheduler in question does not use FROZEN as a transient state and
> > > does not use checkpoint/restart at all since c/r is still a work in progress.
>
> Right, the job scheduler uses the cgroup freezer as a mechanism to
> preempt a low priority job for a higher priority job. (It had used
> SIGSTOP in the past.) So in this scenario a frozen cgroup may remain in
> that state for a while. Load average is consulted as a measure of
> system utilization.
I think that this is an utterly broken use for it, if you want something
like that make a signal cgroup or something and deliver SIGSTOP to all
of them.
In other words, why is the freezer any better than the SIGSTOP approach?
> > > Even when used for power management it seems wrong to count frozen tasks
> > > towards the loadavg since they aren't using CPU time or waiting for IO.
> >
> > You're abusing it for _WHAT_?
>
> I think Matt was referring to system-wide suspend/resume/hibernate, not
> a behavior of the job scheduler, if that's your concern.
I understood he referred to the crazy use-case you mentioned above, IMHO
frozen should be a temporary state used for things like
snapshot/migrate.
I'm still very tempted to plain simply revert that original patch.
On Sat, 2009-07-18 at 14:56 +0200, Peter Zijlstra wrote:
> On Fri, 2009-07-17 at 15:55 -0500, Nathan Lynch wrote:
> > On Fri, 2009-07-17 at 18:47 +0200, Peter Zijlstra wrote:
> > > On Fri, 2009-07-17 at 08:22 -0700, Matt Helsley wrote:
> > >
> > > > The job scheduler in question does not use FROZEN as a transient state and
> > > > does not use checkpoint/restart at all since c/r is still a work in progress.
> >
> > Right, the job scheduler uses the cgroup freezer as a mechanism to
> > preempt a low priority job for a higher priority job. (It had used
> > SIGSTOP in the past.) So in this scenario a frozen cgroup may remain in
> > that state for a while. Load average is consulted as a measure of
> > system utilization.
>
> I think that this is an utterly broken use for it, if you want something
> like that make a signal cgroup or something and deliver SIGSTOP to all
> of them.
>
> In other words, why is the freezer any better than the SIGSTOP approach?
Documentation/cgroups/freezer-subsystem.txt happens to document this use
case and the disadvantages of SIGSTOP/SIGCONT. Does that change your
opinion at all?
> > > > Even when used for power management it seems wrong to count frozen tasks
> > > > towards the loadavg since they aren't using CPU time or waiting for IO.
> > >
> > > You're abusing it for _WHAT_?
> >
> > I think Matt was referring to system-wide suspend/resume/hibernate, not
> > a behavior of the job scheduler, if that's your concern.
>
> I understood he referred to the crazy use-case you mentioned above, IMHO
> frozen should be a temporary state used for things like
> snapshot/migrate.
But snapshot (or checkpoint) and migration aren't possible with mainline
at this time. As far as I know, the use case to which you object is the
primary use of the cgroup freezer on production systems.