2008-08-11 23:54:16

by Matt Helsley

[permalink] [raw]
Subject: [PATCH 0/5] Container Freezer v6: Reuse Suspend Freezer

This patch series introduces a cgroup subsystem that utilizes the swsusp
freezer to freeze a group of tasks. It's immediately useful for batch job
management scripts. It should also be useful in the future for implementing
container checkpoint/restart.

The freezer subsystem in the container filesystem defines a cgroup file named
freezer.state. Reading freezer.state will return the current state of the
cgroup. Writing "FROZEN" to the state file will freeze all tasks in the
cgroup. Subsequently writing "RUNNING" will unfreeze the tasks in the cgroup.

* Examples of usage :

# mkdir /containers/freezer
# mount -t cgroup -ofreezer freezer /containers
# mkdir /containers/0
# echo $some_pid > /containers/0/tasks

to get status of the freezer subsystem :

# cat /containers/0/freezer.state
RUNNING

to freeze all tasks in the container :

# echo FROZEN > /containers/0/freezer.state
# cat /containers/0/freezer.state
FREEZING
# cat /containers/0/freezer.state
FROZEN

to unfreeze all tasks in the container :

# echo RUNNING > /containers/0/freezer.state
# cat /containers/0/freezer.state
RUNNING

Andrew, please consider these patches for -mm.

Cheers,
-Matt Helsley

Changes since v5:
v6:
Merged the patch using the cgroups write_string() method since Linus
has merged the patch supporting it.
Moved header file modifications to in patch 1 to
arch/$ARCH/include/asm/thread_info.h where appropriate.
Moved cgroup_freezer.h contents into cgroups_freezer.c and freezer.h
Added CONFIG_FREEZER to help conditionally build the freezer code.
This required some simplifications of the second patch.
Fix a lock ordering problem with the freezer->lock reacquire code
Required order: freezer->lock, css_set_lock, task->alloc_lock
Reacquiring: css_set_lock, task->alloc_lock, freezer->lock
Solution: change freezer_fork() to not require any ordering
between task->alloc_lock and freezer->lock
v5:
Split out write_string as a separate patch for easier merging
with trees lacking certain cgroup patches at the time.
Checked use of task alloc lock for races with swsusp freeze/thaw --
looks safe because there are explicit barriers to handle
freeze/thaw races for individual tasks, we explicitly
handle partial group freezing, and partial group thawing
should be resolved without changing swsusp's loop.
Updated the patches to Linus' git tree as of approximately
7/31/2008.
Added Pavel and Serge's Acked-by lines to Acked patches

v4 (Almost all of these changes are confined to patch 3):
Reworked the series to use task_lock() instead of RCU.
Reworked the series to use write_string() and read_seq_string()
cgroup methods.
Fixed the race Paul Menage identified.
Fixed up check_if_frozen() to do more than just test the FROZEN
flag. In some cases tasks could be stopped (T) and marked
FREEZING. When that happens we can safely assume that it
will be frozen immediately upon waking up in the kernel.
Waiting for it to get marked with PF_FROZEN in order to
transition to the FROZEN state would block unnecessarily.
Removed freezer_ prefix from static functions in cgroup_freezer.c.
Simplified STATE_ switch.
Updated the locking comments.

v3:
Ported to 2.6.26-rc5-mm2 with Rafael's freezer patches
Tested on 24 combinations of 3 architectures (x86, x86_64, ppc64)
with 8 different kernel configs varying power management
and cgroup config variables. Each patch builds and boots
in these 24 combinations.
Passes functional testing.

v2 (roughly patches 3 and 5):
Moved the "kill" file into a separate cgroup subsystem (signal) and
it's own patch.
Changed the name of the file from freezer.freeze to freezer.state.
Switched from taking 1 and 0 as input to the strings "FROZEN" and
"RUNNING", respectively. This helps keep the interface
human-usable if/when we need to more states.
Checked that stopped or interrupted is "frozen enough"
Since try_to_freeze() is called upon wakeup of these tasks
this should be fine. This idea comes from recent changes to
the freezer.
Checked that if (task == current) whilst freezing cgroup we're ok
Fixed bug where -EBUSY would always be returned when freezing
Added code to handle userspace retries for any remaining -EBUSY

--


2008-08-12 22:45:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/5] Container Freezer v6: Reuse Suspend Freezer

On Mon, 11 Aug 2008 16:53:23 -0700
Matt Helsley <[email protected]> wrote:

> This patch series introduces a cgroup subsystem that utilizes the swsusp
> freezer to freeze a group of tasks. It's immediately useful for batch job
> management scripts. It should also be useful in the future for implementing
> container checkpoint/restart.

I don't think that this provides anything like sufficient detail to justify
merging a whole bunch of stuff into Linux.

What does "It's immediately useful for batch job management scripts."
mean? How is it useful? Examples? Why would an operator want this
feature, and how would it be used? _much_ more information is needed!

Once we've actually found out what this work is useful for, we can move
onto identification of and discussion of alternatives. One would be "why not
use plain old SIGSTOP?" Another alternative is, of course "that's not useful
enough to justify merging the code". But we don't know yet, coz you didn't
tell us.

2008-08-13 03:47:27

by Vivek Kashyap

[permalink] [raw]
Subject: Re: [linux-pm] [PATCH 0/5] Container Freezer v6: Reuse Suspend Freezer

On Tue, 12 Aug 2008, Andrew Morton wrote:

> On Mon, 11 Aug 2008 16:53:23 -0700
> Matt Helsley <[email protected]> wrote:
>
>> This patch series introduces a cgroup subsystem that utilizes the swsusp
>> freezer to freeze a group of tasks. It's immediately useful for batch job
>> management scripts. It should also be useful in the future for implementing
>> container checkpoint/restart.
>
> I don't think that this provides anything like sufficient detail to justify
> merging a whole bunch of stuff into Linux.
>
> What does "It's immediately useful for batch job management scripts."
> mean? How is it useful? Examples? Why would an operator want this
> feature, and how would it be used? _much_ more information is needed!

A batch-manager/job scheduler (such as loadleveler) must at times stop all
tasks associated with a workload being run in a container. The workload may
constitute of multiple tasks - some of which are in different sessions.
A signal (STOP/CONT) to the Containers 'init' wont be transmitted to all
the tasks in the Container. The 'freezer' mechanism allows this control
to be implemented in a clean way.

Vivek
>
> Once we've actually found out what this work is useful for, we can move
> onto identification of and discussion of alternatives. One would be "why not
> use plain old SIGSTOP?" Another alternative is, of course "that's not useful
> enough to justify merging the code". But we don't know yet, coz you didn't
> tell us.
> _______________________________________________
> linux-pm mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/linux-pm
>

__

Vivek Kashyap
Linux Technology Center, IBM

2008-08-13 04:09:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [linux-pm] [PATCH 0/5] Container Freezer v6: Reuse Suspend Freezer

On Tue, 12 Aug 2008 20:47:10 -0700 (Pacific Daylight Time) Vivek Kashyap <[email protected]> wrote:

> On Tue, 12 Aug 2008, Andrew Morton wrote:
>
> > On Mon, 11 Aug 2008 16:53:23 -0700
> > Matt Helsley <[email protected]> wrote:
> >
> >> This patch series introduces a cgroup subsystem that utilizes the swsusp
> >> freezer to freeze a group of tasks. It's immediately useful for batch job
> >> management scripts. It should also be useful in the future for implementing
> >> container checkpoint/restart.
> >
> > I don't think that this provides anything like sufficient detail to justify
> > merging a whole bunch of stuff into Linux.
> >
> > What does "It's immediately useful for batch job management scripts."
> > mean? How is it useful? Examples? Why would an operator want this
> > feature, and how would it be used? _much_ more information is needed!
>
> A batch-manager/job scheduler (such as loadleveler)

what's that?

> must at times stop all
> tasks associated with a workload being run in a container.

why?

I'm being deliberately obtuse here, but I'm afraid you guys haven't
come anywhere into the vague nearby neighbourhood of adequately describing
this feature.

Please provide proper and full reasons for merging this code into
Linux. If they exist. This shouldn't be too hard.

Please put yourself in my position:

me: [patch] <this stuff>
Linus: why are you sending me this?
me: I have not the faintest idea

trust me - many others will be in my position too.

> The workload may
> constitute of multiple tasks - some of which are in different sessions.
> A signal (STOP/CONT) to the Containers 'init' wont be transmitted to all
> the tasks in the Container. The 'freezer' mechanism allows this control
> to be implemented in a clean way.

So why not implement a send-signal-to-all-tasks-in-a-container
controller?

2008-08-15 21:54:56

by Matt Helsley

[permalink] [raw]
Subject: Re: [linux-pm] [PATCH 0/5] Container Freezer v6: Reuse Suspend Freezer


On Tue, 2008-08-12 at 21:08 -0700, Andrew Morton wrote:
> On Tue, 12 Aug 2008 20:47:10 -0700 (Pacific Daylight Time) Vivek Kashyap <[email protected]> wrote:
>
> > On Tue, 12 Aug 2008, Andrew Morton wrote:
> >
> > > On Mon, 11 Aug 2008 16:53:23 -0700
> > > Matt Helsley <[email protected]> wrote:
> > >
> > >> This patch series introduces a cgroup subsystem that utilizes the swsusp
> > >> freezer to freeze a group of tasks. It's immediately useful for batch job
> > >> management scripts. It should also be useful in the future for implementing
> > >> container checkpoint/restart.
> > >
> > > I don't think that this provides anything like sufficient detail to justify
> > > merging a whole bunch of stuff into Linux.
> > >
> > > What does "It's immediately useful for batch job management scripts."
> > > mean? How is it useful? Examples? Why would an operator want this
> > > feature, and how would it be used? _much_ more information is needed!
> >
> > A batch-manager/job scheduler (such as loadleveler)
>
> what's that?
>
> > must at times stop all
> > tasks associated with a workload being run in a container.
>
> why?
>
> I'm being deliberately obtuse here, but I'm afraid you guys haven't
> come anywhere into the vague nearby neighbourhood of adequately describing
> this feature.
>
> Please provide proper and full reasons for merging this code into
> Linux. If they exist. This shouldn't be too hard.
>
> Please put yourself in my position:
>
> me: [patch] <this stuff>
> Linus: why are you sending me this?
> me: I have not the faintest idea
>
> trust me - many others will be in my position too.

Hi Andrew,

Sorry for being so quiet. I've been carefully considering your email
and composing what I hope is a much better description of why the code
should eventually be merged:

The cgroup freezer is useful to batch job management system which start
and stop sets of tasks in order to schedule the resources of a machine
according to the desires of a system administrator. This sort of program
is often used on HPC clusters to schedule access to the cluster as a
whole. The cgroup freezer uses cgroups to describe the set of tasks to
be started/stopped by the batch job management system. It also provides
a means to start and stop the tasks composing the job.

The cgroup freezer will also be useful for checkpointing running groups
of tasks. The freezer allows the checkpoint code to obtain a consistent
image of the tasks by attempting to force the tasks in a cgroup into a
quiescent state. Once the tasks are quiescent another task can
walk /proc or invoke a kernel interface to gather information about the
quiesced tasks. Checkpointed tasks can be restarted later should a
recoverable error occur. This also allows the checkpointed tasks to be
migrated between nodes in a cluster by copying the gathered information
to another node and restarting the tasks there.

Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping
and resuming tasks in userspace. Both of these signals are observable
from within the tasks we wish to freeze. While SIGSTOP cannot be caught,
blocked, or ignored it can be seen by waiting or ptracing parent tasks.
SIGCONT is especially unsuitable since it can be caught by the task. Any
programs designed to watch for SIGSTOP and SIGCONT could be broken by
attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can
demonstrate this problem using nested bash shells:

$ echo $$
16644
$ bash
$ echo $$
16690

From a second, unrelated bash shell:
$ kill -SIGSTOP 16690
$ kill -SIGCONT 16990

<at this point 16990 exits and causes 16644 to exit too>

This happens because bash can observe both signals and choose how it
responds to them.

Another example of a program which catches and responds to these
signals is gdb. In fact any program designed to use ptrace is likely to
have a problem with this method of stopping and resuming tasks.

In contrast, the cgroup freezer uses the kernel freezer code to
prevent the freeze/unfreeze cycle from becoming visible to the tasks
being frozen. This allows the bash example above and gdb to run as
expected.

> > The workload may
> > constitute of multiple tasks - some of which are in different sessions.
> > A signal (STOP/CONT) to the Containers 'init' wont be transmitted to all
> > the tasks in the Container. The 'freezer' mechanism allows this control
> > to be implemented in a clean way.
>
> So why not implement a send-signal-to-all-tasks-in-a-container
> controller?

I have posted such a controller to the containers list in the past. For
the reasons cited above I don't think its suitable as a replacement for
the freezer controller.

Please let me know if the reasons for merging this code remain unclear.

Cheers,
-Matt Helsley